Modeling Airline Delay

Overview

Statistics on whether a flight was delayed and for how long are available from government databases for all the major carriers. It would be useful to be able to predict before scheduling a flight whether or not it was likely to be delayed. In this example, DataRobot will try to model whether a flight will be delayed, based on information such as the scheduled departure time and whether rained the day of the flight.

Prerequisites

In order to run this notebook yourself, you will need the following:

  • This notebook. If you are viewing this in the HTML documentation bundle, you can download all of the example notebooks and supporting materials from Downloads.
  • The datasets required for this notebook. These are in the same directory as this notebook.
  • A DataRobot API token. You can find your API token by logging into the DataRobot Web User Interface and looking in your Profile.

Set Up

This example assumes that the DataRobot Python client package has been installed and configured with the credentials of a DataRobot user with API access permissions.

Data Sources

Information on flights and flight delays is made available by the Bureau of Transportation Statistics at https://www.transtats.bts.gov/ONTIME/Departures.aspx. To narrow down the amount of data involved, the datasets assembled for this use case are limited to US Airways flights out of Boston Logan in 2013 and 2014, although the script for interacting with DataRobot is sufficiently general that any dataset with the correct format could be used. A flight was declared to be delayed if it ultimately took off at least fifteen minutes after its scheduled departure time.

In additional to flight information, each record in the prepared dataset notes the amount of rain and whether it rained on the day of the flight. This information came from the National Oceanic and Atmospheric Administration’s Quality Controlled Local Climatological Data, available at http://www.ncdc.noaa.gov/qclcd/QCLCD. By looking at the recorded daily summaries of the water equivalent precipitation at the Boston Logan station, the daily rainfall for each day in 2013 and 2014 was measured. For some days, the QCLCD reports trace amounts of rainfall, which was recorded as 0 inches of rain.

Dataset Structure

Each row in the assembled dataset contains the following columns

  • was_delayed
    • boolean
    • whether the flight was delayed
  • daily_rainfall
    • float
    • the amount of rain, in inches, on the day of the flight
  • did_rain
    • bool
    • whether it rained on the day of the flight
  • Carrier Code
    • str
    • the carrier code of the airline - US for all entries in assembled dataset
  • Date
    • str (MM/DD/YYYY format)
    • the date of the flight
  • Flight Number
    • str
    • the flight number for the flight
  • Tail Number
    • str
    • the tail number of the aircraft
  • Destination Airport
    • str
    • the three-letter airport code of the destination airport
  • Scheduled Departure Time
    • str
    • the 24-hour scheduled departure time of the flight, in the origin airport’s timezone
[1]:
import pandas as pd
import datarobot as dr
[2]:
data_path = "logan-US-2013.csv"
logan_2013 = pd.read_csv(data_path)
logan_2013.head()
[2]:
was_delayed daily_rainfall did_rain Carrier Code Date (MM/DD/YYYY) Flight Number Tail Number Destination Airport Scheduled Departure Time
0 False 0.0 False US 02/01/2013 225 N662AW PHX 16:20
1 False 0.0 False US 02/01/2013 280 N822AW PHX 06:00
2 False 0.0 False US 02/01/2013 303 N653AW CLT 09:35
3 True 0.0 False US 02/01/2013 604 N640AW PHX 09:55
4 False 0.0 False US 02/01/2013 722 N715UW PHL 18:30

We want to be able to make predictions for future data, so the “date” column should be transformed in a way that avoids values that won’t be populated for future data:

[3]:
def prepare_modeling_dataset(df):
    date_column_name = 'Date (MM/DD/YYYY)'
    date = pd.to_datetime(df[date_column_name])
    modeling_df = df.drop(date_column_name, axis=1)
    days = {0: 'Mon', 1: 'Tues', 2: 'Weds', 3: 'Thurs', 4: 'Fri', 5: 'Sat',
            6: 'Sun'}
    modeling_df['day_of_week'] = date.apply(lambda x: days[x.dayofweek])
    modeling_df['month'] = date.dt.month
    return modeling_df
[4]:
logan_2013_modeling = prepare_modeling_dataset(logan_2013)
logan_2013_modeling.head()
[4]:
was_delayed daily_rainfall did_rain Carrier Code Flight Number Tail Number Destination Airport Scheduled Departure Time day_of_week month
0 False 0.0 False US 225 N662AW PHX 16:20 Fri 2
1 False 0.0 False US 280 N822AW PHX 06:00 Fri 2
2 False 0.0 False US 303 N653AW CLT 09:35 Fri 2
3 True 0.0 False US 604 N640AW PHX 09:55 Fri 2
4 False 0.0 False US 722 N715UW PHL 18:30 Fri 2

DataRobot Modeling

As part of this use case, in model_flight_ontime.py, a DataRobot project will be created and used to run a variety of models against the assembled datasets. By default, DataRobot will run autopilot on the automatically generated Informative Features list, which excludes certain pathological features (like Carrier Code in this example, which is always the same value), and we will also create a custom feature list excluding the amount of rainfall on the day of the flight.

This notebook shows how to use the Python API client to create a project, create feature lists, train models with different sample percents and feature lists, and view the models that have been run. It will:

  • create a project
  • create a new feature list (no foreknowledge) excluding the rainfall features
  • set the target to was_delayed, and run DataRobot autopilot on the Informative Features list
  • rerun autopilot on a new feature list
  • make predictions on a new data set

Configure the Python Client

Configuring the client requires the following two things:

  • A DataRobot endpoint - where the API server can be found
  • A DataRobot API token - a token the server uses to identify and validate the user making API requests

The endpoint is usually the URL you would use to log into the DataRobot Web User Interface (e.g., https://app.datarobot.com) with “/api/v2/” appended, e.g., (https://app.datarobot.com/api/v2/).

You can find your API token by logging into the DataRobot Web User Interface and looking in your Profile.

The Python client can be configured in several ways. The example we’ll use in this notebook is to point to a yaml file that has the information. This is a text file containing two lines like this:

endpoint: https://app.datarobot.com/api/v2/
token: not-my-real-token

If you want to run this notebook without changes, please save your configuration in a file located under your home directory called ~/.config/datarobot/drconfig.yaml.

[5]:
# Initialization with arguments
# dr.Client(token='<API TOKEN>', endpoint='https://<YOUR ENDPOINT>/api/v2/')

# Initialization with a config file in the same directory as this notebook
# dr.Client(config_path='drconfig.yaml')

# Initialization with config file located at ~/.config/datarobot/dr.config.yaml
dr.Client()
[5]:
<datarobot.rest.RESTClientObject at 0x114014510>

Starting a Project

[6]:
project = dr.Project.start(logan_2013_modeling,
                           project_name='Airline Delays - was_delayed',
                           target="was_delayed")
print('Project ID: {}'.format(project.id))
Project ID: 5c0012ca6523cd0200c4a017

Jobs and the Project Queue

You can view the project in your browser:

[7]:
#  If running notebook remotely
project.open_leaderboard_browser()
[7]:
True
[8]:
#  Set worker count higher.
#  Passing -1 sets it to the maximum available to your account.
project.set_worker_count(-1)
[8]:
Project(Airline Delays - was_delayed)
[9]:
project.pause_autopilot()
[9]:
True
[10]:
#  More jobs will go in the queue in each stage of autopilot.
#  This gets the currently inprogress and queued jobs
project.get_model_jobs()
[10]:
[ModelJob(Logistic Regression, status=inprogress),
 ModelJob(Regularized Logistic Regression (L2), status=inprogress),
 ModelJob(Auto-tuned K-Nearest Neighbors Classifier (Euclidean Distance), status=inprogress),
 ModelJob(Majority Class Classifier, status=inprogress),
 ModelJob(Gradient Boosted Trees Classifier, status=inprogress),
 ModelJob(Breiman and Cutler Random Forest Classifier, status=inprogress),
 ModelJob(RuleFit Classifier, status=inprogress),
 ModelJob(Regularized Logistic Regression (L2), status=inprogress),
 ModelJob(TensorFlow Neural Network Classifier, status=inprogress),
 ModelJob(eXtreme Gradient Boosted Trees Classifier with Early Stopping, status=inprogress),
 ModelJob(Elastic-Net Classifier (L2 / Binomial Deviance), status=inprogress),
 ModelJob(Nystroem Kernel SVM Classifier, status=inprogress),
 ModelJob(RandomForest Classifier (Gini), status=inprogress),
 ModelJob(Vowpal Wabbit Classifier, status=inprogress),
 ModelJob(Generalized Additive2 Model, status=inprogress),
 ModelJob(Light Gradient Boosted Trees Classifier with Early Stopping, status=queue),
 ModelJob(Light Gradient Boosting on ElasticNet Predictions , status=queue),
 ModelJob(Regularized Logistic Regression (L2), status=queue),
 ModelJob(Elastic-Net Classifier (L2 / Binomial Deviance) with Binned numeric features, status=queue),
 ModelJob(Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance), status=queue),
 ModelJob(RandomForest Classifier (Entropy), status=queue),
 ModelJob(ExtraTrees Classifier (Gini), status=queue),
 ModelJob(Gradient Boosted Trees Classifier with Early Stopping, status=queue),
 ModelJob(Gradient Boosted Greedy Trees Classifier with Early Stopping, status=queue),
 ModelJob(eXtreme Gradient Boosted Trees Classifier with Early Stopping, status=queue),
 ModelJob(eXtreme Gradient Boosted Trees Classifier with Early Stopping and Unsupervised Learning Features, status=queue),
 ModelJob(Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance) with Unsupervised Learning Features, status=queue),
 ModelJob(Auto-tuned K-Nearest Neighbors Classifier (Euclidean Distance), status=queue),
 ModelJob(Eureqa Generalized Additive Model Classifier (3645 Generations), status=inprogress),
 ModelJob(Naive Bayes combiner classifier, status=inprogress),
 ModelJob(RandomForest Classifier (Gini), status=inprogress),
 ModelJob(Gradient Boosted Trees Classifier, status=inprogress),
 ModelJob(Decision Tree Classifier (Gini), status=inprogress)]
[11]:
project.unpause_autopilot()
[11]:
True

Features

[12]:
features = project.get_features()
features
[12]:
[Feature(did_rain),
 Feature(Destination Airport),
 Feature(Carrier Code),
 Feature(Flight Number),
 Feature(Tail Number),
 Feature(day_of_week),
 Feature(month),
 Feature(Scheduled Departure Time),
 Feature(daily_rainfall),
 Feature(was_delayed),
 Feature(Scheduled Departure Time (Hour of Day))]
[13]:
pd.DataFrame([f.__dict__ for f in features])
[13]:
date_format feature_type id importance low_information max mean median min na_count name project_id std_dev target_leakage time_series_eligibility_reason time_series_eligible time_step time_unit unique_count
0 None Boolean 2 0.029045 False 1 0.36 0 0 0 did_rain 5c0012ca6523cd0200c4a017 0.48 FALSE notADate False None None 2
1 None Categorical 6 0.003714 True None None None None 0 Destination Airport 5c0012ca6523cd0200c4a017 None FALSE notADate False None None 5
2 None Categorical 3 NaN True None None None None 0 Carrier Code 5c0012ca6523cd0200c4a017 None SKIPPED_DETECTION notADate False None None 1
3 None Numeric 4 0.005900 False 2165 1705.63 2021 67 0 Flight Number 5c0012ca6523cd0200c4a017 566.67 FALSE notADate False None None 329
4 None Categorical 5 -0.004512 True None None None None 0 Tail Number 5c0012ca6523cd0200c4a017 None FALSE notADate False None None 296
5 None Categorical 8 0.003452 True None None None None 0 day_of_week 5c0012ca6523cd0200c4a017 None FALSE notADate False None None 7
6 None Numeric 9 0.003043 True 12 6.47 6 1 0 month 5c0012ca6523cd0200c4a017 3.38 FALSE notADate False None None 12
7 %H:%M Time 7 0.058245 False 21:30 12:26 12:00 05:00 0 Scheduled Departure Time 5c0012ca6523cd0200c4a017 0.19 days FALSE notADate False None None 77
8 None Numeric 1 0.034295 False 3.07 0.12 0 0 0 daily_rainfall 5c0012ca6523cd0200c4a017 0.33 FALSE notADate False None None 58
9 None Boolean 0 1.000000 False 1 0.098 0 0 0 was_delayed 5c0012ca6523cd0200c4a017 0.3 SKIPPED_DETECTION notADate False None None 2
10 None Categorical 10 0.053047 False None None None None 0 Scheduled Departure Time (Hour of Day) 5c0012ca6523cd0200c4a017 None FALSE notADate False None None 17

Three feature lists are automatically created:

  • Raw Features: one for all features
  • Informative Features: one based on features with any information (columns with no variation are excluded)
  • Univariate Importance: one based on univariate importance (this is only created after the target is set)

Informative Features is the one used by default in autopilot.

[14]:
feature_lists = project.get_featurelists()
feature_lists
[14]:
[Featurelist(Raw Features),
 Featurelist(Informative Features),
 Featurelist(Univariate Selections)]
[15]:
# create a featurelist without the rain features
# (since they leak future information)
informative_feats = [lst for lst in feature_lists if
                     lst.name == 'Informative Features'][0]
no_foreknowledge_features = list(
    set(informative_feats.features) - {'daily_rainfall', 'did_rain'})
[16]:
no_foreknowledge = project.create_featurelist('no foreknowledge',
                                              no_foreknowledge_features)
no_foreknowledge
[16]:
Featurelist(no foreknowledge)
[17]:
project.get_status()
[17]:
{u'autopilot_done': False,
 u'stage': u'modeling',
 u'stage_description': u'Ready for modeling'}
[18]:
# This waits until autopilot is complete:
project.wait_for_autopilot(check_interval=90)
In progress: 20, queued: 13 (waited: 0s)
In progress: 20, queued: 13 (waited: 1s)
In progress: 19, queued: 13 (waited: 1s)
In progress: 20, queued: 12 (waited: 2s)
In progress: 20, queued: 12 (waited: 4s)
In progress: 20, queued: 12 (waited: 6s)
In progress: 20, queued: 12 (waited: 10s)
In progress: 19, queued: 2 (waited: 17s)
In progress: 10, queued: 0 (waited: 30s)
In progress: 2, queued: 0 (waited: 56s)
In progress: 4, queued: 0 (waited: 108s)
In progress: 1, queued: 0 (waited: 198s)
In progress: 13, queued: 0 (waited: 289s)
In progress: 0, queued: 0 (waited: 379s)
In progress: 5, queued: 0 (waited: 470s)
In progress: 4, queued: 0 (waited: 560s)
In progress: 0, queued: 0 (waited: 651s)
[19]:
project.start_autopilot(no_foreknowledge.id)
[20]:
project.wait_for_autopilot(check_interval=90)
In progress: 0, queued: 0 (waited: 0s)
In progress: 0, queued: 0 (waited: 0s)
In progress: 0, queued: 0 (waited: 1s)
In progress: 0, queued: 0 (waited: 1s)
In progress: 0, queued: 0 (waited: 3s)
In progress: 0, queued: 0 (waited: 4s)
In progress: 0, queued: 1 (waited: 8s)
In progress: 20, queued: 13 (waited: 15s)
In progress: 20, queued: 1 (waited: 28s)
In progress: 3, queued: 0 (waited: 54s)
In progress: 16, queued: 0 (waited: 106s)
In progress: 20, queued: 12 (waited: 196s)
In progress: 0, queued: 0 (waited: 287s)

Models

[21]:
models = project.get_models()
example_model = models[0]
example_model
[21]:
Model(u'eXtreme Gradient Boosted Trees Classifier with Early Stopping and Unsupervised Learning Features')

Models represent fitted models and have various data about the model, including metrics:

[22]:
example_model.metrics
[22]:
{u'AUC': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.755494,
  u'holdout': 0.76509,
  u'validation': 0.75702},
 u'FVE Binomial': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.14855,
  u'holdout': 0.14992,
  u'validation': 0.15364},
 u'Gini Norm': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.510988,
  u'holdout': 0.53018,
  u'validation': 0.51404},
 u'Kolmogorov-Smirnov': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.398738,
  u'holdout': 0.42279,
  u'validation': 0.40472},
 u'LogLoss': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.272296,
  u'holdout': 0.27178,
  u'validation': 0.27079},
 u'RMSE': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.27529400000000004,
  u'holdout': 0.27627,
  u'validation': 0.27448},
 u'Rate@Top10%': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.379522,
  u'holdout': 0.35792,
  u'validation': 0.38908},
 u'Rate@Top5%': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.489794,
  u'holdout': 0.45902,
  u'validation': 0.5034},
 u'Rate@TopTenth%': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.8000019999999999,
  u'holdout': 0.75,
  u'validation': 0.66667}}
[23]:
def sorted_by_log_loss(models, test_set):
    models_with_score = [model for model in models if
                         model.metrics['LogLoss'][test_set] is not None]
    return sorted(models_with_score,
                  key=lambda model: model.metrics['LogLoss'][test_set])

Let’s choose the models (from each feature set, to compare the scores) with the best LogLoss score from those with the rain and those without:

[24]:
models = project.get_models()
fair_models = [mod for mod in models if
               mod.featurelist_id == no_foreknowledge.id]
rain_cheat_models = [mod for mod in models if
                     mod.featurelist_id == informative_feats.id]
[25]:
models[0].metrics['LogLoss']

[25]:
{u'backtesting': None,
 u'backtestingScores': None,
 u'crossValidation': 0.272296,
 u'holdout': 0.27178,
 u'validation': 0.27079}
[26]:
best_fair_model = sorted_by_log_loss(fair_models, 'crossValidation')[0]
best_cheat_model = sorted_by_log_loss(rain_cheat_models, 'crossValidation')[0]
best_fair_model.metrics, best_cheat_model.metrics
[26]:
({u'AUC': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.7132720000000001,
   u'holdout': None,
   u'validation': 0.71811},
  u'FVE Binomial': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.089814,
   u'holdout': None,
   u'validation': 0.09341},
  u'Gini Norm': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.426544,
   u'holdout': None,
   u'validation': 0.43622},
  u'Kolmogorov-Smirnov': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.322424,
   u'holdout': None,
   u'validation': 0.31053},
  u'LogLoss': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.291076,
   u'holdout': None,
   u'validation': 0.29006},
  u'RMSE': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.285848,
   u'holdout': None,
   u'validation': 0.28579},
  u'Rate@Top10%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.294882,
   u'holdout': None,
   u'validation': 0.29352},
  u'Rate@Top5%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.36734799999999995,
   u'holdout': None,
   u'validation': 0.39456},
  u'Rate@TopTenth%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.600002,
   u'holdout': None,
   u'validation': 0.66667}},
 {u'AUC': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.7604420000000001,
   u'holdout': None,
   u'validation': 0.75549},
  u'FVE Binomial': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.15306999999999998,
   u'holdout': None,
   u'validation': 0.15124},
  u'Gini Norm': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.520884,
   u'holdout': None,
   u'validation': 0.51098},
  u'Kolmogorov-Smirnov': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.406068,
   u'holdout': None,
   u'validation': 0.39472},
  u'LogLoss': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.270848,
   u'holdout': None,
   u'validation': 0.27156},
  u'RMSE': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.274772,
   u'holdout': None,
   u'validation': 0.27497},
  u'Rate@Top10%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.38498399999999994,
   u'holdout': None,
   u'validation': 0.38908},
  u'Rate@Top5%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.504762,
   u'holdout': None,
   u'validation': 0.5034},
  u'Rate@TopTenth%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.933334,
   u'holdout': None,
   u'validation': 1.0}})

Visualizing Models

This is a good time to use Feature Fit and Feature Effects (not yet available via the API) to visualize the models:

[27]:
best_fair_model.open_model_browser()
[27]:
True
[28]:
best_cheat_model.open_model_browser()
[28]:
True

Unlocking the Holdout

To maintain holdout scores as a valid estimate of out-of-sample error, we recommend not looking at them until late in the project. For this reason, holdout scores are locked until you unlock them.

[29]:
project.unlock_holdout()
[29]:
Project(Airline Delays - was_delayed)
[30]:
best_fair_model = dr.Model.get(project.id, best_fair_model.id)
best_cheat_model = dr.Model.get(project.id, best_cheat_model.id)
[31]:
best_fair_model.metrics['LogLoss'], best_cheat_model.metrics['LogLoss']
[31]:
({u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.291076,
  u'holdout': 0.29408,
  u'validation': 0.29006},
 {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.270848,
  u'holdout': 0.27193,
  u'validation': 0.27156})

Retrain on 100%

When ready to use the final model, you will probably get the best performance by retraining on 100% of the data.

[32]:
model_job_fair_100pct_id = best_fair_model.train(sample_pct=100)
model_job_fair_100pct_id
[32]:
'211'

Wait for the model to complete:

[33]:
model_fair_100pct = dr.models.modeljob.wait_for_async_model_creation(
    project.id, model_job_fair_100pct_id)
model_fair_100pct.id
[33]:
u'5c0016b76523cd026cc49f99'

Predictions

Let’s make predictions for some new data. This new data will need to have the same transformations applied as we applied to the training data.

[34]:
logan_2014 = pd.read_csv("logan-US-2014.csv")
logan_2014_modeling = prepare_modeling_dataset(logan_2014)
logan_2014_modeling.head()
[34]:
was_delayed daily_rainfall did_rain Carrier Code Flight Number Tail Number Destination Airport Scheduled Departure Time day_of_week month
0 False 0.0 False US 450 N809AW PHX 10:00 Sat 2
1 False 0.0 False US 553 N814AW PHL 07:00 Sat 2
2 False 0.0 False US 582 N820AW PHX 06:10 Sat 2
3 False 0.0 False US 601 N678AW PHX 16:20 Sat 2
4 False 0.0 False US 657 N662AW CLT 09:45 Sat 2
[35]:
prediction_dataset = project.upload_dataset(logan_2014_modeling)
predict_job = model_fair_100pct.request_predictions(prediction_dataset.id)
prediction_dataset.id
[35]:
u'5c0016cf6523cd0018c4a0d3'
[36]:
predictions = predict_job.get_result_when_complete()
[37]:
pd.concat([logan_2014_modeling, predictions], axis=1).head()
[37]:
was_delayed daily_rainfall did_rain Carrier Code Flight Number Tail Number Destination Airport Scheduled Departure Time day_of_week month positive_probability prediction prediction_threshold row_id class_0.0 class_1.0
0 False 0.0 False US 450 N809AW PHX 10:00 Sat 2 0.055054 0.0 0.5 0 0.944946 0.055054
1 False 0.0 False US 553 N814AW PHL 07:00 Sat 2 0.045004 0.0 0.5 1 0.954996 0.045004
2 False 0.0 False US 582 N820AW PHX 06:10 Sat 2 0.030196 0.0 0.5 2 0.969804 0.030196
3 False 0.0 False US 601 N678AW PHX 16:20 Sat 2 0.201461 0.0 0.5 3 0.798539 0.201461
4 False 0.0 False US 657 N662AW CLT 09:45 Sat 2 0.072447 0.0 0.5 4 0.927553 0.072447

Let’s have a look at our results. Since this is a binary classification problem, as the positive_probability approaches zero this row is a stronger candidate for the negative class (the flight will leave on-time), while as it approaches one, the outcome is more likely to be of the positive class (the flight will be delayed). From the KDE (Kernel Density Estimate) plot below, we can see that this sample of the data is weighted stronger to the negative class.

[38]:
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
[39]:
matplotlib.rcParams['figure.figsize'] = (15, 10)  # make charts bigger
[40]:
sns.set(color_codes=True)
sns.kdeplot(predictions.positive_probability, shade=True, cut=0,
            label='Positive Probability')
plt.xlim((0, 1))
plt.ylim((0, None))
plt.xlabel('Probability of Event')
plt.ylabel('Probability Density')
plt.title('Prediction Distribution')
[40]:
Text(0.5,1,'Prediction Distribution')
../../_images/examples_airline_ontime_example_Modeling_Airline_Delay_57_1.png

Exploring Prediction Explanations

Computing prediction explanations is a resource-intensive task, but you can help reduce the runtime for computing them by setting prediction value thresholds. You can learn more about prediction explanations by searching the online documentation available in the DataRobot web interface.

When are they useful?

A common question when evaluating data is “why is a certain data-point considered high-risk (or low-risk) for a certain event”?

A sample case for prediction explanations:

Clark is a business analyst at a large manufacturing firm. She does not have a lot of data science expertise, but has been using DataRobot with great success to predict likely product failures at her manufacturing plant. Her manager is now asking for recommendations for reducing the defect rate, based on these predictions. Clark would like DataRobot to produce prediction explanations for the expected product failures so that she can identify the key drivers of product failures based on a higher-level aggregation of explanations. Her business team can then use this report to address the causes of failure.

Other common use cases and possible explanations include:

  • What are indicators that a transaction could be at high risk for fraud? Possible explanations include transactions out of a cardholder’s home area, transactions out of their “normal usage” time range, and transactions that are too large or small.
  • What are some explanations for setting a higher auto insurance price? The applicant is single, male, age under 30 years, and has received traffic tickets. A married homeowner may receive a lower rate.

Preparation

We are almost ready to compute prediction explanations. Prediction explanations require two prerequisites to be performed first; however, these commands only need to be run once per model.

Feature Impact

A prerequisite to computing prediction explanations is that you need to compute the feature impact for your model (this only needs to be done once per model):

[41]:
%%time
feature_impacts = model_fair_100pct.get_or_request_feature_impact()
CPU times: user 25.4 ms, sys: 5.09 ms, total: 30.5 ms
Wall time: 11.3 s
Prediction Explanations Initialization

After Feature Impact has been computed, you also must create a Prediction Explanations Initialization for your model:

[42]:
%%time
try:
    # Test to see if they are already computed
    dr.PredictionExplanationsInitialization.get(project.id,
                                                model_fair_100pct.id)
except dr.errors.ClientError as e:
    assert e.status_code == 404  # haven't been computed
    init_job = dr.PredictionExplanationsInitialization.create(
        project.id,
        model_fair_100pct.id
    )
    init_job.wait_for_completion()
CPU times: user 24.9 ms, sys: 5.16 ms, total: 30 ms
Wall time: 11 s

Computing the prediction explanations

Now that we have computed the feature impact and initialized the prediction explanations, and also uploaded a dataset and computed predictions on it, we are ready to make a request to compute the prediction explanations for every row of the dataset. Computing prediction explanations supports a couple of parameters:

  • max_explanations are the maximum number of prediction explanations to compute for each row.
  • threshold_low and threshold_high are thresholds for the value of the prediction of the row. Prediction explanations will be computed for a row if the row’s prediction value is higher than threshold_high or lower than threshold_low. If no thresholds are specified, prediction explanations will be computed for all rows.

Note: for binary classification projects (like this one), the thresholds correspond to the positive_probability prediction value whereas for regression problems, it corresponds to the actual predicted value.

Since we’ve already examined our prediction distribution from above, let’s use that to influence what we set for our thresholds. It looks like most flights depart on-time so let’s just examine the explanations for flights that have an above normal probability for being delayed. We will use a threshold_high of 0.456 which means for all rows where the predicted positive_probability is at least 0.456 we will compute the prediction explanations for that row. For the simplicity of this tutorial, we will also limit DataRobot to only compute 5 explanations for us.

[43]:
%%time
number_of_explanations = 5
pe_job = dr.PredictionExplanations.create(
    project.id,
    model_fair_100pct.id,
    prediction_dataset.id,
    max_explanations=number_of_explanations,
    threshold_low=None,
    threshold_high=0.456
)
pe = pe_job.get_result_when_complete()
all_rows = pe.get_all_as_dataframe()
CPU times: user 4.1 s, sys: 131 ms, total: 4.23 s
Wall time: 22.4 s

Let’s cleanup the DataFrame we got back by trimming it down to just the interesting columns. Also, since most rows will have prediction values outside our thresholds, let’s drop all the uninteresting rows (i.e. ones with null values).

[44]:
import pandas as pd
pd.options.display.max_rows = 10  # default display is too verbose

# These rows are all redundant or of little value for this example
redundant_cols = ['row_id', 'class_0_label', 'class_1_probability',
                  'class_1_label']
explanations = all_rows.drop(redundant_cols, axis=1)
explanations.drop(['explanation_{}_label'.format(i)
                   for i in range(number_of_explanations)],
                  axis=1, inplace=True)

# These are rows that didn't meet our thresholds
explanations.dropna(inplace=True)

# Rename columns to be more consistent with the terms we have been using
explanations.rename(index=str,
                    columns={'class_0_probability': 'positive_probability'},
                    inplace=True)
explanations
[44]:
prediction positive_probability explanation_0_feature explanation_0_feature_value explanation_0_qualitative_strength explanation_0_strength explanation_1_feature explanation_1_feature_value explanation_1_qualitative_strength explanation_1_strength ... explanation_2_qualitative_strength explanation_2_strength explanation_3_feature explanation_3_feature_value explanation_3_qualitative_strength explanation_3_strength explanation_4_feature explanation_4_feature_value explanation_4_qualitative_strength explanation_4_strength
39 0.0 0.471055 Scheduled Departure Time -2.208920e+09 +++ 1.072288 day_of_week Sun ++ 0.455652 ... ++ 0.362867 Destination Airport CLT ++ 0.345914 Tail Number N537UW ++ 0.242375
392 0.0 0.478501 Scheduled Departure Time -2.208920e+09 +++ 1.072288 day_of_week Sun ++ 0.455652 ... ++ 0.362867 Destination Airport CLT ++ 0.345914 Tail Number N536UW ++ 0.272234
13043 0.0 0.465055 Scheduled Departure Time -2.208920e+09 +++ 1.202299 Tail Number N194UW ++ 0.416944 ... ++ 0.391831 day_of_week Sun ++ 0.286239 month 12 ++ 0.273073
13259 0.0 0.463182 Scheduled Departure Time -2.208920e+09 +++ 1.141272 Destination Airport CLT ++ 0.391831 ... ++ 0.373726 Tail Number N563UW ++ 0.321922 month 12 ++ 0.256552
13843 0.0 0.498733 Scheduled Departure Time -2.208920e+09 +++ 1.270218 Flight Number 586 ++ 0.440506 ... ++ 0.355779 Tail Number N647AW ++ 0.241246 day_of_week Thurs ++ 0.224909
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
18015 0.0 0.497778 Scheduled Departure Time -2.208920e+09 +++ 1.565999 month 7 ++ 0.809545 ... ++ 0.347827 Tail Number N534UW ++ 0.247029 day_of_week Thurs + 0.224909
18165 0.0 0.466710 Scheduled Departure Time -2.208920e+09 +++ 1.368628 month 7 ++ 0.368182 ... ++ 0.347827 Tail Number N173US ++ 0.314294 Flight Number 800 + 0.093169
18382 0.0 0.481914 Scheduled Departure Time -2.208920e+09 +++ 1.281047 Flight Number 586 ++ 0.440506 ... ++ 0.396207 day_of_week Thurs ++ 0.224909 Tail Number N660AW + 0.164530
18392 1.0 0.506051 Scheduled Departure Time -2.208920e+09 +++ 1.334738 month 7 ++ 0.424888 ... ++ 0.347827 Tail Number N170US ++ 0.280126 day_of_week Thurs ++ 0.224909
18406 1.0 0.511845 Scheduled Departure Time -2.208927e+09 +++ 1.357411 month 7 ++ 0.855629 ... ++ 0.676216 Scheduled Departure Time (Hour of Day) 17 ++ 0.455910 Destination Airport CLT ++ 0.344885

24 rows × 22 columns

Explore Prediction Explanations results

Now let’s see how often various features are showing up as the top explanation for impacting the probability of a flight being delayed.

[45]:
from functools import reduce

# Create a combined histogram of all our explanations
explanations_hist = reduce(
    lambda x, y: x.add(y, fill_value=0),
    (explanations['explanation_{}_feature'.format(i)].value_counts()
     for i in range(number_of_explanations)))
[46]:
explanations_hist.plot.bar()
plt.xticks(rotation=45, ha='right')
[46]:
(array([0, 1, 2, 3, 4, 5, 6]), <a list of 7 Text xticklabel objects>)
../../_images/examples_airline_ontime_example_Modeling_Airline_Delay_70_1.png

Knowing the feature impact for this model from the Diving Deeper notebook, the high occurrence of the daily_rainfall and Scheduled Departure Time as prediction explanations is not entirely surprising because these were some of the top ranked features in the impact chart. Therefore, let’s take a small detour investigating some of the rows that had less expected explanations.


Below is some helper code. It can largely be ignored as it is mostly relevant for this exercise and not needed for a general understanding of the DataRobot APIs

[47]:
from operator import or_
from functools import reduce
from itertools import chain


def find_rows_with_explanation(df, feature_name, nexpls):
    """
    Given a prediction explanations DataFrame, return a slice
    of that data where the top N explanations match the given feature
    """
    all_expl_columns = (df['explanation_{}_feature'.format(i)] == feature_name
                        for i in range(nexpls))
    df_filter = reduce(or_, all_expl_columns)
    return favorite_expl_columns(df[df_filter], nexpls)


def favorite_expl_columns(df, nexpls):
    """
    Only display the most useful rows of a prediction explanations DataFrame.
    """
    # Use chain to flatten our list of tuples
    columns = list(chain.from_iterable((
        'explanation_{}_feature'.format(i),
        'explanation_{}_feature_value'.format(i),
        'explanation_{}_strength'.format(i))
        for i in range(nexpls)))
    return df[columns]


def find_feature_in_row(feature, row, nexpls):
    """
    Return the value of a given feature
    """
    for i in range(nexpls):
        if row['explanation_{}_feature'.format(i)] == feature:
            return row['explanation_{}_feature_value'.format(i)]


def collect_feature_values(df, feature, nexpls):
    """
    Return a list of all values of a given prediction explanation
    from a DataFrame
    """
    return [find_feature_in_row(feature, row, nexpls)
            for index, row in df.iterrows()]

Investigation: Destination Airport

It looks like there was a small number of rows where the Destination Airport was one of the top N explanations for a given prediction

[48]:
feature_name = 'Destination Airport'
flight_nums = find_rows_with_explanation(explanations,
                                         feature_name,
                                         number_of_explanations)
flight_nums
[48]:
explanation_0_feature explanation_0_feature_value explanation_0_strength explanation_1_feature explanation_1_feature_value explanation_1_strength explanation_2_feature explanation_2_feature_value explanation_2_strength explanation_3_feature explanation_3_feature_value explanation_3_strength explanation_4_feature explanation_4_feature_value explanation_4_strength
39 Scheduled Departure Time -2.208920e+09 1.072288 day_of_week Sun 0.455652 month 2 0.362867 Destination Airport CLT 0.345914 Tail Number N537UW 0.242375
392 Scheduled Departure Time -2.208920e+09 1.072288 day_of_week Sun 0.455652 month 2 0.362867 Destination Airport CLT 0.345914 Tail Number N536UW 0.272234
13043 Scheduled Departure Time -2.208920e+09 1.202299 Tail Number N194UW 0.416944 Destination Airport CLT 0.391831 day_of_week Sun 0.286239 month 12 0.273073
13259 Scheduled Departure Time -2.208920e+09 1.141272 Destination Airport CLT 0.391831 day_of_week Thurs 0.373726 Tail Number N563UW 0.321922 month 12 0.256552
14226 Scheduled Departure Time -2.208920e+09 1.339540 month 6 0.401657 Destination Airport CLT 0.347827 day_of_week Thurs 0.224909 Tail Number N190UW 0.147016
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
17638 Scheduled Departure Time -2.208920e+09 1.340564 month 7 0.411066 Destination Airport CLT 0.347827 day_of_week Thurs 0.224909 Flight Number 800 0.120877
18015 Scheduled Departure Time -2.208920e+09 1.565999 month 7 0.809545 Destination Airport CLT 0.347827 Tail Number N534UW 0.247029 day_of_week Thurs 0.224909
18165 Scheduled Departure Time -2.208920e+09 1.368628 month 7 0.368182 Destination Airport CLT 0.347827 Tail Number N173US 0.314294 Flight Number 800 0.093169
18392 Scheduled Departure Time -2.208920e+09 1.334738 month 7 0.424888 Destination Airport CLT 0.347827 Tail Number N170US 0.280126 day_of_week Thurs 0.224909
18406 Scheduled Departure Time -2.208927e+09 1.357411 month 7 0.855629 Tail Number N818AW 0.676216 Scheduled Departure Time (Hour of Day) 17 0.455910 Destination Airport CLT 0.344885

14 rows × 15 columns

[49]:
all_flights = collect_feature_values(flight_nums,
                                     feature_name,
                                     number_of_explanations)
pd.DataFrame(all_flights)[0].value_counts().plot.bar()
plt.xticks(rotation=0)
[49]:
(array([0]), <a list of 1 Text xticklabel objects>)
../../_images/examples_airline_ontime_example_Modeling_Airline_Delay_76_1.png

Many a frequent flier will tell you horror stories about flying in and out of certain airports. While any given prediction explanation can have a positive or a negative impact to a prediction (this is indicated by both the strength and qualitative_strength columns), due to the thresholds we configured earlier for this tutorial it is likely that the above airports are causing flight delays.


Investigation: Scheduled Departure Time

DataRobot correctly identified the Scheduled Departure Time input as a timestamp but in the prediction explanation output, we are seeing the internal representation of the time value as a Unix epoch value so let’s put it back into a format that humans can understand better:

[50]:
# For simplicity, let's just look at rows where `Scheduled Departure Time`
# was the first/top explanation.
feature_name = 'Scheduled Departure Time'
bad_times = explanations[explanations.explanation_0_feature == feature_name]

# Now let's convert the epoch to a datetime
pd.to_datetime(bad_times.explanation_0_feature_value, unit='s')
[50]:
39      1900-01-01 19:15:00
392     1900-01-01 19:15:00
13043   1900-01-01 19:10:00
13259   1900-01-01 19:10:00
13843   1900-01-01 19:15:00
                ...
18015   1900-01-01 19:10:00
18165   1900-01-01 19:10:00
18382   1900-01-01 19:15:00
18392   1900-01-01 19:10:00
18406   1900-01-01 17:05:00
Name: explanation_0_feature_value, Length: 24, dtype: datetime64[ns]

We can see that it appears as though all departures occurred on Jan. 1st, 1900. This is because the original value was simply a timestamp so only the time portion of the epoch is meaningful. We will clean this up in our graph below:

[51]:
from matplotlib.ticker import FuncFormatter
from time import gmtime, strftime

scale_factor = 9  # make the difference in strengths more visible

depart = explanations[explanations.explanation_0_feature == feature_name]
true_only = depart[depart.prediction == 1]
false_only = depart[depart.prediction == 0]
plt.scatter(x=true_only.explanation_0_feature_value,
            y=true_only.positive_probability,
            c='green',
            s=true_only.explanation_0_strength ** scale_factor,
            label='Will be delayed')
plt.scatter(x=false_only.explanation_0_feature_value,
            y=false_only.positive_probability,
            c='purple',
            s=false_only.explanation_0_strength ** scale_factor,
            label='Will not')

# Convert the Epoch values into human time stamps
formatter = FuncFormatter(lambda x, pos: strftime('%H:%M', gmtime(x)))
plt.gca().xaxis.set_major_formatter(formatter)

plt.xlabel('Scheduled Departure Time')
plt.ylabel('Positive Probability')
plt.legend(markerscale=.5, frameon=True, facecolor="white")
plt.title("Relationship of Depart Time and being delayed")
[51]:
Text(0.5,1,'Relationship of Depart Time and being delayed')
../../_images/examples_airline_ontime_example_Modeling_Airline_Delay_81_1.png

The above plot shows each prediction where the top influencer of the prediction was the Scheduled Departure Time. It’s plotted against the positive_probability on the Y-axis and the size of the plots represent the strength that departure time had on the prediction (relative to the other features of that given data point). Finally to aid visually, the positive vs. negative outcomes are colored.

As we can see by the time-scale on the X-axis, it doesn’t represent the full 24 hours; this is telling. Since we filtered our data earlier to only show predictions that were leaning towards being delayed, and this chart is leaning towards times in the afternoon and evening there may be a correlation between later scheduled departure time and a higher probability of being delayed. With a little bit of domain knowledge, one will understand that an airplane and its crew make many flights in a day (typically hopping between cities) so small delays in the morning compound into the evening hours.