Modeling Airline Delay¶

Overview¶

Statistics on whether a flight was delayed and for how long are available from government databases for all the major carriers. It would be useful to be able to predict before scheduling a flight whether or not it was likely to be delayed. In this example, DataRobot will try to model whether a flight will be delayed, based on information such as the scheduled departure time and whether rained the day of the flight.

Prerequisites¶

In order to run this notebook yourself, you will need the following:

This notebook. If you are viewing this in the HTML documentation bundle, you can download all of the example notebooks and supporting materials from Downloads.
The datasets required for this notebook. These are in the same directory as this notebook.
A DataRobot API token. You can find your API token by logging into the DataRobot Web User Interface and looking in your Profile.

Set Up¶

This example assumes that the DataRobot Python client package has been installed and configured with the credentials of a DataRobot user with API access permissions.

Data Sources¶

Information on flights and flight delays is made available by the Bureau of Transportation Statistics at https://www.transtats.bts.gov/ONTIME/Departures.aspx. To narrow down the amount of data involved, the datasets assembled for this use case are limited to US Airways flights out of Boston Logan in 2013 and 2014, although the script for interacting with DataRobot is sufficiently general that any dataset with the correct format could be used. A flight was declared to be delayed if it ultimately took off at least fifteen minutes after its scheduled departure time.

In additional to flight information, each record in the prepared dataset notes the amount of rain and whether it rained on the day of the flight. This information came from the National Oceanic and Atmospheric Administration’s Quality Controlled Local Climatological Data, available at http://www.ncdc.noaa.gov/qclcd/QCLCD. By looking at the recorded daily summaries of the water equivalent precipitation at the Boston Logan station, the daily rainfall for each day in 2013 and 2014 was measured. For some days, the QCLCD reports trace amounts of rainfall, which was recorded as 0 inches of rain.

Dataset Structure¶

Each row in the assembled dataset contains the following columns

was_delayed
- boolean
- whether the flight was delayed
daily_rainfall
- float
- the amount of rain, in inches, on the day of the flight
did_rain
- bool
- whether it rained on the day of the flight
Carrier Code
- str
- the carrier code of the airline - US for all entries in assembled dataset
Date
- str (MM/DD/YYYY format)
- the date of the flight
Flight Number
- str
- the flight number for the flight
Tail Number
- str
- the tail number of the aircraft
Destination Airport
- str
- the three-letter airport code of the destination airport
Scheduled Departure Time
- str
- the 24-hour scheduled departure time of the flight, in the origin airport’s timezone

[1]:

import pandas as pd
import datarobot as dr

[2]:

data_path = "logan-US-2013.csv"
logan_2013 = pd.read_csv(data_path)
logan_2013.head()

[2]:

	was_delayed	did_rain	Carrier Code	Date (MM/DD/YYYY)	Flight Number	Tail Number	Destination Airport	Scheduled Departure Time
0	False	False	US	02/01/2013	225	N662AW	PHX	16:20
1	False	False	US	02/01/2013	280	N822AW	PHX	06:00
2	False	False	US	02/01/2013	303	N653AW	CLT	09:35
3	True	False	US	02/01/2013	604	N640AW	PHX	09:55
4	False	False	US	02/01/2013	722	N715UW	PHL	18:30

We want to be able to make predictions for future data, so the “date” column should be transformed in a way that avoids values that won’t be populated for future data:

[3]:

def prepare_modeling_dataset(df):
    date_column_name = 'Date (MM/DD/YYYY)'
    date = pd.to_datetime(df[date_column_name])
    modeling_df = df.drop(date_column_name, axis=1)
    days = {0: 'Mon', 1: 'Tues', 2: 'Weds', 3: 'Thurs', 4: 'Fri', 5: 'Sat',
            6: 'Sun'}
    modeling_df['day_of_week'] = date.apply(lambda x: days[x.dayofweek])
    modeling_df['month'] = date.dt.month
    return modeling_df

[4]:

logan_2013_modeling = prepare_modeling_dataset(logan_2013)
logan_2013_modeling.head()

[4]:

	was_delayed	did_rain	Carrier Code	Flight Number	Tail Number	Destination Airport	Scheduled Departure Time	day_of_week	month
0	False	False	US	225	N662AW	PHX	16:20	Fri	2
1	False	False	US	280	N822AW	PHX	06:00	Fri	2
2	False	False	US	303	N653AW	CLT	09:35	Fri	2
3	True	False	US	604	N640AW	PHX	09:55	Fri	2
4	False	False	US	722	N715UW	PHL	18:30	Fri	2

DataRobot Modeling¶

As part of this use case, in model_flight_ontime.py, a DataRobot project will be created and used to run a variety of models against the assembled datasets. By default, DataRobot will run autopilot on the automatically generated Informative Features list, which excludes certain pathological features (like Carrier Code in this example, which is always the same value), and we will also create a custom feature list excluding the amount of rainfall on the day of the flight.

This notebook shows how to use the Python API client to create a project, create feature lists, train models with different sample percents and feature lists, and view the models that have been run. It will:

create a project
create a new feature list (no foreknowledge) excluding the rainfall features
set the target to was_delayed, and run DataRobot autopilot on the Informative Features list
rerun autopilot on a new feature list
make predictions on a new data set

Configure the Python Client¶

Configuring the client requires the following two things:

A DataRobot endpoint - where the API server can be found
A DataRobot API token - a token the server uses to identify and validate the user making API requests

The endpoint is usually the URL you would use to log into the DataRobot Web User Interface (e.g., https://app.datarobot.com) with “/api/v2/” appended, e.g., (https://app.datarobot.com/api/v2/).

You can find your API token by logging into the DataRobot Web User Interface and looking in your Profile.

The Python client can be configured in several ways. The example we’ll use in this notebook is to point to a yaml file that has the information. This is a text file containing two lines like this:

endpoint: https://app.datarobot.com/api/v2/
token: not-my-real-token

If you want to run this notebook without changes, please save your configuration in a file located under your home directory called ~/.config/datarobot/drconfig.yaml.

[5]:

# Initialization with arguments
# dr.Client(token='<API TOKEN>', endpoint='https://<YOUR ENDPOINT>/api/v2/')

# Initialization with a config file in the same directory as this notebook
# dr.Client(config_path='drconfig.yaml')

# Initialization with config file located at ~/.config/datarobot/dr.config.yaml
dr.Client()

[5]:

<datarobot.rest.RESTClientObject at 0x114014510>

Starting a Project¶

[6]:

project = dr.Project.start(logan_2013_modeling,
                           project_name='Airline Delays - was_delayed',
                           target="was_delayed")
print('Project ID: {}'.format(project.id))

Project ID: 5c0012ca6523cd0200c4a017

Jobs and the Project Queue¶

You can view the project in your browser:

[7]:

#  If running notebook remotely
project.open_leaderboard_browser()

[7]:

True

[8]:

#  Set worker count higher.
#  Passing -1 sets it to the maximum available to your account.
project.set_worker_count(-1)

[8]:

Project(Airline Delays - was_delayed)

[9]:

project.pause_autopilot()

[9]:

True

[10]:

#  More jobs will go in the queue in each stage of autopilot.
#  This gets the currently inprogress and queued jobs
project.get_model_jobs()

[10]:

[ModelJob(Logistic Regression, status=inprogress),
 ModelJob(Regularized Logistic Regression (L2), status=inprogress),
 ModelJob(Auto-tuned K-Nearest Neighbors Classifier (Euclidean Distance), status=inprogress),
 ModelJob(Majority Class Classifier, status=inprogress),
 ModelJob(Gradient Boosted Trees Classifier, status=inprogress),
 ModelJob(Breiman and Cutler Random Forest Classifier, status=inprogress),
 ModelJob(RuleFit Classifier, status=inprogress),
 ModelJob(Regularized Logistic Regression (L2), status=inprogress),
 ModelJob(TensorFlow Neural Network Classifier, status=inprogress),
 ModelJob(eXtreme Gradient Boosted Trees Classifier with Early Stopping, status=inprogress),
 ModelJob(Elastic-Net Classifier (L2 / Binomial Deviance), status=inprogress),
 ModelJob(Nystroem Kernel SVM Classifier, status=inprogress),
 ModelJob(RandomForest Classifier (Gini), status=inprogress),
 ModelJob(Vowpal Wabbit Classifier, status=inprogress),
 ModelJob(Generalized Additive2 Model, status=inprogress),
 ModelJob(Light Gradient Boosted Trees Classifier with Early Stopping, status=queue),
 ModelJob(Light Gradient Boosting on ElasticNet Predictions , status=queue),
 ModelJob(Regularized Logistic Regression (L2), status=queue),
 ModelJob(Elastic-Net Classifier (L2 / Binomial Deviance) with Binned numeric features, status=queue),
 ModelJob(Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance), status=queue),
 ModelJob(RandomForest Classifier (Entropy), status=queue),
 ModelJob(ExtraTrees Classifier (Gini), status=queue),
 ModelJob(Gradient Boosted Trees Classifier with Early Stopping, status=queue),
 ModelJob(Gradient Boosted Greedy Trees Classifier with Early Stopping, status=queue),
 ModelJob(eXtreme Gradient Boosted Trees Classifier with Early Stopping, status=queue),
 ModelJob(eXtreme Gradient Boosted Trees Classifier with Early Stopping and Unsupervised Learning Features, status=queue),
 ModelJob(Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance) with Unsupervised Learning Features, status=queue),
 ModelJob(Auto-tuned K-Nearest Neighbors Classifier (Euclidean Distance), status=queue),
 ModelJob(Eureqa Generalized Additive Model Classifier (3645 Generations), status=inprogress),
 ModelJob(Naive Bayes combiner classifier, status=inprogress),
 ModelJob(RandomForest Classifier (Gini), status=inprogress),
 ModelJob(Gradient Boosted Trees Classifier, status=inprogress),
 ModelJob(Decision Tree Classifier (Gini), status=inprogress)]

[11]:

project.unpause_autopilot()

[11]:

True

Features¶

[12]:

features = project.get_features()
features

[12]:

[Feature(did_rain),
 Feature(Destination Airport),
 Feature(Carrier Code),
 Feature(Flight Number),
 Feature(Tail Number),
 Feature(day_of_week),
 Feature(month),
 Feature(Scheduled Departure Time),
 Feature(daily_rainfall),
 Feature(was_delayed),
 Feature(Scheduled Departure Time (Hour of Day))]

[13]:

pd.DataFrame([f.__dict__ for f in features])

[13]:

	date_format	feature_type	id	importance	low_information	max	mean	median	min	name	project_id	std_dev	target_leakage	time_series_eligibility_reason	time_series_eligible	time_step	time_unit	unique_count
0	None	Boolean	2	0.029045	False	1	0.36	0	0	did_rain	5c0012ca6523cd0200c4a017	0.48	FALSE	notADate	False	None	None	2
1	None	Categorical	6	0.003714	True	None	None	None	None	Destination Airport	5c0012ca6523cd0200c4a017	None	FALSE	notADate	False	None	None	5
2	None	Categorical	3	NaN	True	None	None	None	None	Carrier Code	5c0012ca6523cd0200c4a017	None	SKIPPED_DETECTION	notADate	False	None	None	1
3	None	Numeric	4	0.005900	False	2165	1705.63	2021	67	Flight Number	5c0012ca6523cd0200c4a017	566.67	FALSE	notADate	False	None	None	329
4	None	Categorical	5	-0.004512	True	None	None	None	None	Tail Number	5c0012ca6523cd0200c4a017	None	FALSE	notADate	False	None	None	296
5	None	Categorical	8	0.003452	True	None	None	None	None	day_of_week	5c0012ca6523cd0200c4a017	None	FALSE	notADate	False	None	None	7
6	None	Numeric	9	0.003043	True	12	6.47	6	1	month	5c0012ca6523cd0200c4a017	3.38	FALSE	notADate	False	None	None	12
7	%H:%M	Time	7	0.058245	False	21:30	12:26	12:00	05:00	Scheduled Departure Time	5c0012ca6523cd0200c4a017	0.19 days	FALSE	notADate	False	None	None	77
8	None	Numeric	1	0.034295	False	3.07	0.12	0	0	daily_rainfall	5c0012ca6523cd0200c4a017	0.33	FALSE	notADate	False	None	None	58
9	None	Boolean	0	1.000000	False	1	0.098	0	0	was_delayed	5c0012ca6523cd0200c4a017	0.3	SKIPPED_DETECTION	notADate	False	None	None	2
10	None	Categorical	10	0.053047	False	None	None	None	None	Scheduled Departure Time (Hour of Day)	5c0012ca6523cd0200c4a017	None	FALSE	notADate	False	None	None	17

Three feature lists are automatically created:

Raw Features: one for all features
Informative Features: one based on features with any information (columns with no variation are excluded)
Univariate Importance: one based on univariate importance (this is only created after the target is set)

Informative Features is the one used by default in autopilot.

[14]:

feature_lists = project.get_featurelists()
feature_lists

[14]:

[Featurelist(Raw Features),
 Featurelist(Informative Features),
 Featurelist(Univariate Selections)]

[15]:

# create a featurelist without the rain features
# (since they leak future information)
informative_feats = [lst for lst in feature_lists if
                     lst.name == 'Informative Features'][0]
no_foreknowledge_features = list(
    set(informative_feats.features) - {'daily_rainfall', 'did_rain'})

[16]:

no_foreknowledge = project.create_featurelist('no foreknowledge',
                                              no_foreknowledge_features)
no_foreknowledge

[16]:

Featurelist(no foreknowledge)

[17]:

project.get_status()

[17]:

{u'autopilot_done': False,
 u'stage': u'modeling',
 u'stage_description': u'Ready for modeling'}

[18]:

# This waits until autopilot is complete:
project.wait_for_autopilot(check_interval=90)

In progress: 20, queued: 13 (waited: 0s)
In progress: 20, queued: 13 (waited: 1s)
In progress: 19, queued: 13 (waited: 1s)
In progress: 20, queued: 12 (waited: 2s)
In progress: 20, queued: 12 (waited: 4s)
In progress: 20, queued: 12 (waited: 6s)
In progress: 20, queued: 12 (waited: 10s)
In progress: 19, queued: 2 (waited: 17s)
In progress: 10, queued: 0 (waited: 30s)
In progress: 2, queued: 0 (waited: 56s)
In progress: 4, queued: 0 (waited: 108s)
In progress: 1, queued: 0 (waited: 198s)
In progress: 13, queued: 0 (waited: 289s)
In progress: 0, queued: 0 (waited: 379s)
In progress: 5, queued: 0 (waited: 470s)
In progress: 4, queued: 0 (waited: 560s)
In progress: 0, queued: 0 (waited: 651s)

[19]:

project.start_autopilot(no_foreknowledge.id)

[20]:

project.wait_for_autopilot(check_interval=90)

In progress: 0, queued: 0 (waited: 0s)
In progress: 0, queued: 0 (waited: 0s)
In progress: 0, queued: 0 (waited: 1s)
In progress: 0, queued: 0 (waited: 1s)
In progress: 0, queued: 0 (waited: 3s)
In progress: 0, queued: 0 (waited: 4s)
In progress: 0, queued: 1 (waited: 8s)
In progress: 20, queued: 13 (waited: 15s)
In progress: 20, queued: 1 (waited: 28s)
In progress: 3, queued: 0 (waited: 54s)
In progress: 16, queued: 0 (waited: 106s)
In progress: 20, queued: 12 (waited: 196s)
In progress: 0, queued: 0 (waited: 287s)

Models¶

[21]:

models = project.get_models()
example_model = models[0]
example_model

[21]:

Model(u'eXtreme Gradient Boosted Trees Classifier with Early Stopping and Unsupervised Learning Features')

Models represent fitted models and have various data about the model, including metrics:

[22]:

example_model.metrics

[22]:

{u'AUC': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.755494,
  u'holdout': 0.76509,
  u'validation': 0.75702},
 u'FVE Binomial': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.14855,
  u'holdout': 0.14992,
  u'validation': 0.15364},
 u'Gini Norm': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.510988,
  u'holdout': 0.53018,
  u'validation': 0.51404},
 u'Kolmogorov-Smirnov': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.398738,
  u'holdout': 0.42279,
  u'validation': 0.40472},
 u'LogLoss': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.272296,
  u'holdout': 0.27178,
  u'validation': 0.27079},
 u'RMSE': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.27529400000000004,
  u'holdout': 0.27627,
  u'validation': 0.27448},
 u'Rate@Top10%': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.379522,
  u'holdout': 0.35792,
  u'validation': 0.38908},
 u'Rate@Top5%': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.489794,
  u'holdout': 0.45902,
  u'validation': 0.5034},
 u'Rate@TopTenth%': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.8000019999999999,
  u'holdout': 0.75,
  u'validation': 0.66667}}

[23]:

def sorted_by_log_loss(models, test_set):
    models_with_score = [model for model in models if
                         model.metrics['LogLoss'][test_set] is not None]
    return sorted(models_with_score,
                  key=lambda model: model.metrics['LogLoss'][test_set])

Let’s choose the models (from each feature set, to compare the scores) with the best LogLoss score from those with the rain and those without:

[24]:

models = project.get_models()
fair_models = [mod for mod in models if
               mod.featurelist_id == no_foreknowledge.id]
rain_cheat_models = [mod for mod in models if
                     mod.featurelist_id == informative_feats.id]

[25]:

models[0].metrics['LogLoss']

[25]:

{u'backtesting': None,
 u'backtestingScores': None,
 u'crossValidation': 0.272296,
 u'holdout': 0.27178,
 u'validation': 0.27079}

[26]:

best_fair_model = sorted_by_log_loss(fair_models, 'crossValidation')[0]
best_cheat_model = sorted_by_log_loss(rain_cheat_models, 'crossValidation')[0]
best_fair_model.metrics, best_cheat_model.metrics

[26]:

({u'AUC': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.7132720000000001,
   u'holdout': None,
   u'validation': 0.71811},
  u'FVE Binomial': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.089814,
   u'holdout': None,
   u'validation': 0.09341},
  u'Gini Norm': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.426544,
   u'holdout': None,
   u'validation': 0.43622},
  u'Kolmogorov-Smirnov': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.322424,
   u'holdout': None,
   u'validation': 0.31053},
  u'LogLoss': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.291076,
   u'holdout': None,
   u'validation': 0.29006},
  u'RMSE': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.285848,
   u'holdout': None,
   u'validation': 0.28579},
  u'Rate@Top10%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.294882,
   u'holdout': None,
   u'validation': 0.29352},
  u'Rate@Top5%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.36734799999999995,
   u'holdout': None,
   u'validation': 0.39456},
  u'Rate@TopTenth%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.600002,
   u'holdout': None,
   u'validation': 0.66667}},
 {u'AUC': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.7604420000000001,
   u'holdout': None,
   u'validation': 0.75549},
  u'FVE Binomial': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.15306999999999998,
   u'holdout': None,
   u'validation': 0.15124},
  u'Gini Norm': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.520884,
   u'holdout': None,
   u'validation': 0.51098},
  u'Kolmogorov-Smirnov': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.406068,
   u'holdout': None,
   u'validation': 0.39472},
  u'LogLoss': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.270848,
   u'holdout': None,
   u'validation': 0.27156},
  u'RMSE': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.274772,
   u'holdout': None,
   u'validation': 0.27497},
  u'Rate@Top10%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.38498399999999994,
   u'holdout': None,
   u'validation': 0.38908},
  u'Rate@Top5%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.504762,
   u'holdout': None,
   u'validation': 0.5034},
  u'Rate@TopTenth%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.933334,
   u'holdout': None,
   u'validation': 1.0}})

Visualizing Models¶

This is a good time to use Feature Fit and Feature Effects (not yet available via the API) to visualize the models:

[27]:

best_fair_model.open_model_browser()

[27]:

True

[28]:

best_cheat_model.open_model_browser()

[28]:

True

Unlocking the Holdout¶

To maintain holdout scores as a valid estimate of out-of-sample error, we recommend not looking at them until late in the project. For this reason, holdout scores are locked until you unlock them.

[29]:

project.unlock_holdout()

[29]:

Project(Airline Delays - was_delayed)

[30]:

best_fair_model = dr.Model.get(project.id, best_fair_model.id)
best_cheat_model = dr.Model.get(project.id, best_cheat_model.id)

[31]:

best_fair_model.metrics['LogLoss'], best_cheat_model.metrics['LogLoss']

[31]:

({u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.291076,
  u'holdout': 0.29408,
  u'validation': 0.29006},
 {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.270848,
  u'holdout': 0.27193,
  u'validation': 0.27156})

Retrain on 100%¶

When ready to use the final model, you will probably get the best performance by retraining on 100% of the data.

[32]:

model_job_fair_100pct_id = best_fair_model.train(sample_pct=100)
model_job_fair_100pct_id

[32]:

'211'

Wait for the model to complete:

[33]:

model_fair_100pct = dr.models.modeljob.wait_for_async_model_creation(
    project.id, model_job_fair_100pct_id)
model_fair_100pct.id

[33]:

u'5c0016b76523cd026cc49f99'

Predictions¶

Let’s make predictions for some new data. This new data will need to have the same transformations applied as we applied to the training data.

[34]:

logan_2014 = pd.read_csv("logan-US-2014.csv")
logan_2014_modeling = prepare_modeling_dataset(logan_2014)
logan_2014_modeling.head()

[34]:

	was_delayed	did_rain	Carrier Code	Flight Number	Tail Number	Destination Airport	Scheduled Departure Time	day_of_week	month
0	False	False	US	450	N809AW	PHX	10:00	Sat	2
1	False	False	US	553	N814AW	PHL	07:00	Sat	2
2	False	False	US	582	N820AW	PHX	06:10	Sat	2
3	False	False	US	601	N678AW	PHX	16:20	Sat	2
4	False	False	US	657	N662AW	CLT	09:45	Sat	2

[35]:

prediction_dataset = project.upload_dataset(logan_2014_modeling)
predict_job = model_fair_100pct.request_predictions(prediction_dataset.id)
prediction_dataset.id

[35]:

u'5c0016cf6523cd0018c4a0d3'

[36]:

predictions = predict_job.get_result_when_complete()

[37]:

pd.concat([logan_2014_modeling, predictions], axis=1).head()

[37]:

	was_delayed	did_rain	Carrier Code	Flight Number	Tail Number	Destination Airport	Scheduled Departure Time	day_of_week	month	positive_probability	prediction_threshold	row_id	class_0.0	class_1.0
0	False	False	US	450	N809AW	PHX	10:00	Sat	2	0.055054	0.5	0	0.944946	0.055054
1	False	False	US	553	N814AW	PHL	07:00	Sat	2	0.045004	0.5	1	0.954996	0.045004
2	False	False	US	582	N820AW	PHX	06:10	Sat	2	0.030196	0.5	2	0.969804	0.030196
3	False	False	US	601	N678AW	PHX	16:20	Sat	2	0.201461	0.5	3	0.798539	0.201461
4	False	False	US	657	N662AW	CLT	09:45	Sat	2	0.072447	0.5	4	0.927553	0.072447

Let’s have a look at our results. Since this is a binary classification problem, as the positive_probability approaches zero this row is a stronger candidate for the negative class (the flight will leave on-time), while as it approaches one, the outcome is more likely to be of the positive class (the flight will be delayed). From the KDE (Kernel Density Estimate) plot below, we can see that this sample of the data is weighted stronger to the negative class.

[38]:

%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib

[39]:

matplotlib.rcParams['figure.figsize'] = (15, 10)  # make charts bigger

[40]:

sns.set(color_codes=True)
sns.kdeplot(predictions.positive_probability, shade=True, cut=0,
            label='Positive Probability')
plt.xlim((0, 1))
plt.ylim((0, None))
plt.xlabel('Probability of Event')
plt.ylabel('Probability Density')
plt.title('Prediction Distribution')

[40]:

Text(0.5,1,'Prediction Distribution')

../../_images/examples_airline_ontime_example_Modeling_Airline_Delay_57_1.png

Exploring Prediction Explanations¶

Computing prediction explanations is a resource-intensive task, but you can help reduce the runtime for computing them by setting prediction value thresholds. You can learn more about prediction explanations by searching the online documentation available in the DataRobot web interface.

When are they useful?¶

A common question when evaluating data is “why is a certain data-point considered high-risk (or low-risk) for a certain event”?

A sample case for prediction explanations:

Clark is a business analyst at a large manufacturing firm. She does not have a lot of data science expertise, but has been using DataRobot with great success to predict likely product failures at her manufacturing plant. Her manager is now asking for recommendations for reducing the defect rate, based on these predictions. Clark would like DataRobot to produce prediction explanations for the expected product failures so that she can identify the key drivers of product failures based on a higher-level aggregation of explanations. Her business team can then use this report to address the causes of failure.

Other common use cases and possible explanations include:

What are indicators that a transaction could be at high risk for fraud? Possible explanations include transactions out of a cardholder’s home area, transactions out of their “normal usage” time range, and transactions that are too large or small.
What are some explanations for setting a higher auto insurance price? The applicant is single, male, age under 30 years, and has received traffic tickets. A married homeowner may receive a lower rate.

Preparation¶

We are almost ready to compute prediction explanations. Prediction explanations require two prerequisites to be performed first; however, these commands only need to be run once per model.

Feature Impact¶

A prerequisite to computing prediction explanations is that you need to compute the feature impact for your model (this only needs to be done once per model):

[41]:

%%time
feature_impacts = model_fair_100pct.get_or_request_feature_impact()

CPU times: user 25.4 ms, sys: 5.09 ms, total: 30.5 ms
Wall time: 11.3 s

Prediction Explanations Initialization¶

After Feature Impact has been computed, you also must create a Prediction Explanations Initialization for your model:

[42]:

%%time
try:
    # Test to see if they are already computed
    dr.PredictionExplanationsInitialization.get(project.id,
                                                model_fair_100pct.id)
except dr.errors.ClientError as e:
    assert e.status_code == 404  # haven't been computed
    init_job = dr.PredictionExplanationsInitialization.create(
        project.id,
        model_fair_100pct.id
    )
    init_job.wait_for_completion()

CPU times: user 24.9 ms, sys: 5.16 ms, total: 30 ms
Wall time: 11 s

Computing the prediction explanations¶

Now that we have computed the feature impact and initialized the prediction explanations, and also uploaded a dataset and computed predictions on it, we are ready to make a request to compute the prediction explanations for every row of the dataset. Computing prediction explanations supports a couple of parameters:

max_explanations are the maximum number of prediction explanations to compute for each row.
threshold_low and threshold_high are thresholds for the value of the prediction of the row. Prediction explanations will be computed for a row if the row’s prediction value is higher than threshold_high or lower than threshold_low. If no thresholds are specified, prediction explanations will be computed for all rows.

Note: for binary classification projects (like this one), the thresholds correspond to the positive_probability prediction value whereas for regression problems, it corresponds to the actual predicted value.

Since we’ve already examined our prediction distribution from above, let’s use that to influence what we set for our thresholds. It looks like most flights depart on-time so let’s just examine the explanations for flights that have an above normal probability for being delayed. We will use a threshold_high of 0.456 which means for all rows where the predicted positive_probability is at least 0.456 we will compute the prediction explanations for that row. For the simplicity of this tutorial, we will also limit DataRobot to only compute 5 explanations for us.

[43]:

%%time
number_of_explanations = 5
pe_job = dr.PredictionExplanations.create(
    project.id,
    model_fair_100pct.id,
    prediction_dataset.id,
    max_explanations=number_of_explanations,
    threshold_low=None,
    threshold_high=0.456
)
pe = pe_job.get_result_when_complete()
all_rows = pe.get_all_as_dataframe()

CPU times: user 4.1 s, sys: 131 ms, total: 4.23 s
Wall time: 22.4 s

Let’s cleanup the DataFrame we got back by trimming it down to just the interesting columns. Also, since most rows will have prediction values outside our thresholds, let’s drop all the uninteresting rows (i.e. ones with null values).

[44]:

import pandas as pd
pd.options.display.max_rows = 10  # default display is too verbose

# These rows are all redundant or of little value for this example
redundant_cols = ['row_id', 'class_0_label', 'class_1_probability',
                  'class_1_label']
explanations = all_rows.drop(redundant_cols, axis=1)
explanations.drop(['explanation_{}_label'.format(i)
                   for i in range(number_of_explanations)],
                  axis=1, inplace=True)

# These are rows that didn't meet our thresholds
explanations.dropna(inplace=True)

# Rename columns to be more consistent with the terms we have been using
explanations.rename(index=str,
                    columns={'class_0_probability': 'positive_probability'},
                    inplace=True)
explanations

[44]:

	prediction	positive_probability	explanation_0_feature	explanation_0_feature_value	explanation_0_qualitative_strength	explanation_0_strength	explanation_1_feature	explanation_1_feature_value	explanation_1_qualitative_strength	explanation_1_strength	...	explanation_2_qualitative_strength	explanation_2_strength	explanation_3_feature	explanation_3_feature_value	explanation_3_qualitative_strength	explanation_3_strength	explanation_4_feature	explanation_4_feature_value	explanation_4_qualitative_strength	explanation_4_strength
39	0.0	0.471055	Scheduled Departure Time	-2.208920e+09	+++	1.072288	day_of_week	Sun	++	0.455652	...	++	0.362867	Destination Airport	CLT	++	0.345914	Tail Number	N537UW	++	0.242375
392	0.0	0.478501	Scheduled Departure Time	-2.208920e+09	+++	1.072288	day_of_week	Sun	++	0.455652	...	++	0.362867	Destination Airport	CLT	++	0.345914	Tail Number	N536UW	++	0.272234
13043	0.0	0.465055	Scheduled Departure Time	-2.208920e+09	+++	1.202299	Tail Number	N194UW	++	0.416944	...	++	0.391831	day_of_week	Sun	++	0.286239	month	12	++	0.273073
13259	0.0	0.463182	Scheduled Departure Time	-2.208920e+09	+++	1.141272	Destination Airport	CLT	++	0.391831	...	++	0.373726	Tail Number	N563UW	++	0.321922	month	12	++	0.256552
13843	0.0	0.498733	Scheduled Departure Time	-2.208920e+09	+++	1.270218	Flight Number	586	++	0.440506	...	++	0.355779	Tail Number	N647AW	++	0.241246	day_of_week	Thurs	++	0.224909
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
18015	0.0	0.497778	Scheduled Departure Time	-2.208920e+09	+++	1.565999	month	7	++	0.809545	...	++	0.347827	Tail Number	N534UW	++	0.247029	day_of_week	Thurs	+	0.224909
18165	0.0	0.466710	Scheduled Departure Time	-2.208920e+09	+++	1.368628	month	7	++	0.368182	...	++	0.347827	Tail Number	N173US	++	0.314294	Flight Number	800	+	0.093169
18382	0.0	0.481914	Scheduled Departure Time	-2.208920e+09	+++	1.281047	Flight Number	586	++	0.440506	...	++	0.396207	day_of_week	Thurs	++	0.224909	Tail Number	N660AW	+	0.164530
18392	1.0	0.506051	Scheduled Departure Time	-2.208920e+09	+++	1.334738	month	7	++	0.424888	...	++	0.347827	Tail Number	N170US	++	0.280126	day_of_week	Thurs	++	0.224909
18406	1.0	0.511845	Scheduled Departure Time	-2.208927e+09	+++	1.357411	month	7	++	0.855629	...	++	0.676216	Scheduled Departure Time (Hour of Day)	17	++	0.455910	Destination Airport	CLT	++	0.344885

24 rows × 22 columns

Explore Prediction Explanations results¶

Now let’s see how often various features are showing up as the top explanation for impacting the probability of a flight being delayed.

[45]:

from functools import reduce

# Create a combined histogram of all our explanations
explanations_hist = reduce(
    lambda x, y: x.add(y, fill_value=0),
    (explanations['explanation_{}_feature'.format(i)].value_counts()
     for i in range(number_of_explanations)))

[46]:

explanations_hist.plot.bar()
plt.xticks(rotation=45, ha='right')

[46]:

(array([0, 1, 2, 3, 4, 5, 6]), <a list of 7 Text xticklabel objects>)

../../_images/examples_airline_ontime_example_Modeling_Airline_Delay_70_1.png

Knowing the feature impact for this model from the Diving Deeper notebook, the high occurrence of the daily_rainfall and Scheduled Departure Time as prediction explanations is not entirely surprising because these were some of the top ranked features in the impact chart. Therefore, let’s take a small detour investigating some of the rows that had less expected explanations.

Below is some helper code. It can largely be ignored as it is mostly relevant for this exercise and not needed for a general understanding of the DataRobot APIs

[47]:

from operator import or_
from functools import reduce
from itertools import chain


def find_rows_with_explanation(df, feature_name, nexpls):
    """
    Given a prediction explanations DataFrame, return a slice
    of that data where the top N explanations match the given feature
    """
    all_expl_columns = (df['explanation_{}_feature'.format(i)] == feature_name
                        for i in range(nexpls))
    df_filter = reduce(or_, all_expl_columns)
    return favorite_expl_columns(df[df_filter], nexpls)


def favorite_expl_columns(df, nexpls):
    """
    Only display the most useful rows of a prediction explanations DataFrame.
    """
    # Use chain to flatten our list of tuples
    columns = list(chain.from_iterable((
        'explanation_{}_feature'.format(i),
        'explanation_{}_feature_value'.format(i),
        'explanation_{}_strength'.format(i))
        for i in range(nexpls)))
    return df[columns]


def find_feature_in_row(feature, row, nexpls):
    """
    Return the value of a given feature
    """
    for i in range(nexpls):
        if row['explanation_{}_feature'.format(i)] == feature:
            return row['explanation_{}_feature_value'.format(i)]


def collect_feature_values(df, feature, nexpls):
    """
    Return a list of all values of a given prediction explanation
    from a DataFrame
    """
    return [find_feature_in_row(feature, row, nexpls)
            for index, row in df.iterrows()]

Investigation: Destination Airport¶

It looks like there was a small number of rows where the Destination Airport was one of the top N explanations for a given prediction

[48]:

feature_name = 'Destination Airport'
flight_nums = find_rows_with_explanation(explanations,
                                         feature_name,
                                         number_of_explanations)
flight_nums

[48]:

	explanation_0_feature	explanation_0_feature_value	explanation_0_strength	explanation_1_feature	explanation_1_feature_value	explanation_1_strength	explanation_2_feature	explanation_2_feature_value	explanation_2_strength	explanation_3_feature	explanation_3_feature_value	explanation_3_strength	explanation_4_feature	explanation_4_feature_value	explanation_4_strength
39	Scheduled Departure Time	-2.208920e+09	1.072288	day_of_week	Sun	0.455652	month	2	0.362867	Destination Airport	CLT	0.345914	Tail Number	N537UW	0.242375
392	Scheduled Departure Time	-2.208920e+09	1.072288	day_of_week	Sun	0.455652	month	2	0.362867	Destination Airport	CLT	0.345914	Tail Number	N536UW	0.272234
13043	Scheduled Departure Time	-2.208920e+09	1.202299	Tail Number	N194UW	0.416944	Destination Airport	CLT	0.391831	day_of_week	Sun	0.286239	month	12	0.273073
13259	Scheduled Departure Time	-2.208920e+09	1.141272	Destination Airport	CLT	0.391831	day_of_week	Thurs	0.373726	Tail Number	N563UW	0.321922	month	12	0.256552
14226	Scheduled Departure Time	-2.208920e+09	1.339540	month	6	0.401657	Destination Airport	CLT	0.347827	day_of_week	Thurs	0.224909	Tail Number	N190UW	0.147016
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
17638	Scheduled Departure Time	-2.208920e+09	1.340564	month	7	0.411066	Destination Airport	CLT	0.347827	day_of_week	Thurs	0.224909	Flight Number	800	0.120877
18015	Scheduled Departure Time	-2.208920e+09	1.565999	month	7	0.809545	Destination Airport	CLT	0.347827	Tail Number	N534UW	0.247029	day_of_week	Thurs	0.224909
18165	Scheduled Departure Time	-2.208920e+09	1.368628	month	7	0.368182	Destination Airport	CLT	0.347827	Tail Number	N173US	0.314294	Flight Number	800	0.093169
18392	Scheduled Departure Time	-2.208920e+09	1.334738	month	7	0.424888	Destination Airport	CLT	0.347827	Tail Number	N170US	0.280126	day_of_week	Thurs	0.224909
18406	Scheduled Departure Time	-2.208927e+09	1.357411	month	7	0.855629	Tail Number	N818AW	0.676216	Scheduled Departure Time (Hour of Day)	17	0.455910	Destination Airport	CLT	0.344885

14 rows × 15 columns

[49]:

all_flights = collect_feature_values(flight_nums,
                                     feature_name,
                                     number_of_explanations)
pd.DataFrame(all_flights)[0].value_counts().plot.bar()
plt.xticks(rotation=0)

[49]:

(array([0]), <a list of 1 Text xticklabel objects>)

../../_images/examples_airline_ontime_example_Modeling_Airline_Delay_76_1.png

Many a frequent flier will tell you horror stories about flying in and out of certain airports. While any given prediction explanation can have a positive or a negative impact to a prediction (this is indicated by both the strength and qualitative_strength columns), due to the thresholds we configured earlier for this tutorial it is likely that the above airports are causing flight delays.

Investigation: Scheduled Departure Time¶

DataRobot correctly identified the Scheduled Departure Time input as a timestamp but in the prediction explanation output, we are seeing the internal representation of the time value as a Unix epoch value so let’s put it back into a format that humans can understand better:

[50]:

# For simplicity, let's just look at rows where `Scheduled Departure Time`
# was the first/top explanation.
feature_name = 'Scheduled Departure Time'
bad_times = explanations[explanations.explanation_0_feature == feature_name]

# Now let's convert the epoch to a datetime
pd.to_datetime(bad_times.explanation_0_feature_value, unit='s')

[50]:

39      1900-01-01 19:15:00
392     1900-01-01 19:15:00
13043   1900-01-01 19:10:00
13259   1900-01-01 19:10:00
13843   1900-01-01 19:15:00
                ...
18015   1900-01-01 19:10:00
18165   1900-01-01 19:10:00
18382   1900-01-01 19:15:00
18392   1900-01-01 19:10:00
18406   1900-01-01 17:05:00
Name: explanation_0_feature_value, Length: 24, dtype: datetime64[ns]

We can see that it appears as though all departures occurred on Jan. 1st, 1900. This is because the original value was simply a timestamp so only the time portion of the epoch is meaningful. We will clean this up in our graph below:

[51]:

from matplotlib.ticker import FuncFormatter
from time import gmtime, strftime

scale_factor = 9  # make the difference in strengths more visible

depart = explanations[explanations.explanation_0_feature == feature_name]
true_only = depart[depart.prediction == 1]
false_only = depart[depart.prediction == 0]
plt.scatter(x=true_only.explanation_0_feature_value,
            y=true_only.positive_probability,
            c='green',
            s=true_only.explanation_0_strength ** scale_factor,
            label='Will be delayed')
plt.scatter(x=false_only.explanation_0_feature_value,
            y=false_only.positive_probability,
            c='purple',
            s=false_only.explanation_0_strength ** scale_factor,
            label='Will not')

# Convert the Epoch values into human time stamps
formatter = FuncFormatter(lambda x, pos: strftime('%H:%M', gmtime(x)))
plt.gca().xaxis.set_major_formatter(formatter)

plt.xlabel('Scheduled Departure Time')
plt.ylabel('Positive Probability')
plt.legend(markerscale=.5, frameon=True, facecolor="white")
plt.title("Relationship of Depart Time and being delayed")

[51]:

Text(0.5,1,'Relationship of Depart Time and being delayed')

../../_images/examples_airline_ontime_example_Modeling_Airline_Delay_81_1.png

The above plot shows each prediction where the top influencer of the prediction was the Scheduled Departure Time. It’s plotted against the positive_probability on the Y-axis and the size of the plots represent the strength that departure time had on the prediction (relative to the other features of that given data point). Finally to aid visually, the positive vs. negative outcomes are colored.

As we can see by the time-scale on the X-axis, it doesn’t represent the full 24 hours; this is telling. Since we filtered our data earlier to only show predictions that were leaning towards being delayed, and this chart is leaning towards times in the afternoon and evening there may be a correlation between later scheduled departure time and a higher probability of being delayed. With a little bit of domain knowledge, one will understand that an airplane and its crew make many flights in a day (typically hopping between cities) so small delays in the morning compound into the evening hours.