Modeling Airline Delay

Overview

Statistics on whether a flight was delayed and for how long are available from government databases for all the major carriers. It would be useful to be able to predict before scheduling a flight whether or not it was likely to be delayed. In this example, DataRobot will try to model whether a flight will be delayed, based on information such as the scheduled departure time and whether rained the day of the flight.

Set Up

This example assumes that the DataRobot Python client package has been installed and configured with the credentials of a DataRobot user with API access permissions.

Data Sources

Information on flights and flight delays is made available by the Bureau of Transportation Statistics at https://www.transtats.bts.gov/ONTIME/Departures.aspx. To narrow down the amount of data involved, the datasets assembled for this use case are limited to US Airways flights out of Boston Logan in 2013 and 2014, although the script for interacting with DataRobot is sufficiently general that any dataset with the correct format could be used. A flight was declared to be delayed if it ultimately took off at least fifteen minutes after its scheduled departure time.

In additional to flight information, each record in the prepared dataset notes the amount of rain and whether it rained on the day of the flight. This information came from the National Oceanic and Atmospheric Administration’s Quality Controlled Local Climatological Data, available at http://www.ncdc.noaa.gov/qclcd/QCLCD. By looking at the recorded daily summaries of the water equivalent precipitation at the Boston Logan station, the daily rainfall for each day in 2013 and 2014 was measured. For some days, the QCLCD reports trace amounts of rainfall, which was recorded as 0 inches of rain.

Dataset Structure

Each row in the assembled dataset contains the following columns

  • was_delayed
    • boolean
    • whether the flight was delayed
  • daily_rainfall
    • float
    • the amount of rain, in inches, on the day of the flight
  • did_rain
    • bool
    • whether it rained on the day of the flight
  • Carrier Code
    • str
    • the carrier code of the airline - US for all entries in assembled dataset
  • Date
    • str (MM/DD/YYYY format)
    • the date of the flight
  • Flight Number
    • str
    • the flight number for the flight
  • Tail Number
    • str
    • the tail number of the aircraft
  • Destination Airport
    • str
    • the three-letter airport code of the destination airport
  • Scheduled Departure Time
    • str
    • the 24-hour scheduled departure time of the flight, in the origin airport’s timezone
In [1]:
import pandas as pd
import datarobot as dr
In [2]:
data_path = "logan-US-2013.csv"
logan_2013 = pd.read_csv(data_path)
logan_2013.head()
Out[2]:
was_delayed daily_rainfall did_rain Carrier Code Date (MM/DD/YYYY) Flight Number Tail Number Destination Airport Scheduled Departure Time
0 False 0.0 False US 02/01/2013 225 N662AW PHX 16:20
1 False 0.0 False US 02/01/2013 280 N822AW PHX 06:00
2 False 0.0 False US 02/01/2013 303 N653AW CLT 09:35
3 True 0.0 False US 02/01/2013 604 N640AW PHX 09:55
4 False 0.0 False US 02/01/2013 722 N715UW PHL 18:30

We want to be able to make predictions for future data, so the “date” column should be transformed in a way that avoids values that won’t be populated for future data:

In [3]:
def prepare_modeling_dataset(df):
    date_column_name = 'Date (MM/DD/YYYY)'
    date = pd.to_datetime(df[date_column_name])
    modeling_df = df.drop(date_column_name, axis=1)
    days = {0: 'Mon', 1: 'Tues', 2: 'Weds', 3: 'Thurs', 4: 'Fri', 5: 'Sat',
            6: 'Sun'}
    modeling_df['day_of_week'] = date.apply(lambda x: days[x.dayofweek])
    modeling_df['month'] = date.dt.month
    return modeling_df
In [4]:
logan_2013_modeling = prepare_modeling_dataset(logan_2013)
logan_2013_modeling.head()
Out[4]:
was_delayed daily_rainfall did_rain Carrier Code Flight Number Tail Number Destination Airport Scheduled Departure Time day_of_week month
0 False 0.0 False US 225 N662AW PHX 16:20 Fri 2
1 False 0.0 False US 280 N822AW PHX 06:00 Fri 2
2 False 0.0 False US 303 N653AW CLT 09:35 Fri 2
3 True 0.0 False US 604 N640AW PHX 09:55 Fri 2
4 False 0.0 False US 722 N715UW PHL 18:30 Fri 2

DataRobot Modeling

As part of this use case, in model_flight_ontime.py, a DataRobot project will be created and used to run a variety of models against the assembled datasets. By default, DataRobot will run autopilot on the automatically generated Informative Features list, which excludes certain pathological features (like Carrier Code in this example, which is always the same value), and we will also create a custom feature list excluding the amount of rainfall on the day of the flight.

This notebook shows how to use the Python API client to create a project, create feature lists, train models with different sample percents and feature lists, and view the models that have been run. It will:

  • create a project
  • create a new feature list (no foreknowledge) excluding the rainfall features
  • set the target to was_delayed, and run DataRobot autopilot on the Informative Features list
  • rerun autopilot on a new feature list
  • make predictions on a new data set

Starting a Project

In [5]:
project = dr.Project.start(logan_2013_modeling,
                           project_name='Airline Delays - was_delayed',
                           target="was_delayed")
project.id
Out[5]:
u'5b2b0fbd96a0250378fc36a5'

Jobs and the Project Queue

You can view the project in your browser:

In [6]:
#  If running notebook remotely
project.open_leaderboard_browser()
Out[6]:
True
In [7]:
#  Set worker count higher. This will fail if you don't have 10 workers.
project.set_worker_count(10)
Out[7]:
Project(Airline Delays - was_delayed)
In [8]:
project.pause_autopilot()
Out[8]:
True
In [9]:
#  More jobs will go in the queue in each stage of autopilot.
#  This gets the currently inprogress and queued jobs
project.get_model_jobs()
Out[9]:
[ModelJob(Logistic Regression, status=inprogress),
 ModelJob(Regularized Logistic Regression (L2), status=inprogress),
 ModelJob(Auto-tuned K-Nearest Neighbors Classifier (Minkowski Distance), status=inprogress),
 ModelJob(Majority Class Classifier, status=inprogress),
 ModelJob(Gradient Boosted Trees Classifier, status=inprogress),
 ModelJob(Breiman and Cutler Random Forest Classifier, status=queue),
 ModelJob(RuleFit Classifier, status=queue),
 ModelJob(Regularized Logistic Regression (L2), status=queue),
 ModelJob(TensorFlow Neural Network Classifier, status=queue),
 ModelJob(eXtreme Gradient Boosted Trees Classifier with Early Stopping, status=queue),
 ModelJob(Elastic-Net Classifier (L2 / Binomial Deviance), status=queue),
 ModelJob(Nystroem Kernel SVM Classifier, status=queue),
 ModelJob(RandomForest Classifier (Gini), status=queue),
 ModelJob(Vowpal Wabbit Classifier, status=queue),
 ModelJob(Generalized Additive2 Model, status=queue),
 ModelJob(Light Gradient Boosted Trees Classifier with Early Stopping, status=queue),
 ModelJob(Light Gradient Boosting on ElasticNet Predictions , status=queue),
 ModelJob(Regularized Logistic Regression (L2), status=queue),
 ModelJob(Elastic-Net Classifier (L2 / Binomial Deviance) with Binned numeric features, status=queue),
 ModelJob(Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance), status=queue),
 ModelJob(RandomForest Classifier (Entropy), status=queue),
 ModelJob(ExtraTrees Classifier (Gini), status=queue),
 ModelJob(Gradient Boosted Trees Classifier with Early Stopping, status=queue),
 ModelJob(Gradient Boosted Greedy Trees Classifier with Early Stopping, status=queue),
 ModelJob(eXtreme Gradient Boosted Trees Classifier with Early Stopping, status=queue),
 ModelJob(eXtreme Gradient Boosted Trees Classifier with Early Stopping and Unsupervised Learning Features, status=queue),
 ModelJob(Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance) with Unsupervised Learning Features, status=queue),
 ModelJob(Auto-tuned K-Nearest Neighbors Classifier (Minkowski Distance), status=queue),
 ModelJob(Isolation Forest Anomaly Detection, status=queue),
 ModelJob(Eureqa Generalized Additive Model Classifier (3645 Generations), status=inprogress),
 ModelJob(Naive Bayes combiner classifier, status=inprogress),
 ModelJob(RandomForest Classifier (Gini), status=inprogress),
 ModelJob(Gradient Boosted Trees Classifier, status=inprogress),
 ModelJob(Decision Tree Classifier (Gini), status=inprogress)]
In [10]:
project.unpause_autopilot()
Out[10]:
True

Features

In [11]:
features = project.get_features()
features
Out[11]:
[Feature(did_rain),
 Feature(Destination Airport),
 Feature(Carrier Code),
 Feature(Flight Number),
 Feature(Tail Number),
 Feature(day_of_week),
 Feature(month),
 Feature(Scheduled Departure Time),
 Feature(daily_rainfall),
 Feature(was_delayed),
 Feature(Scheduled Departure Time (Hour of Day))]
In [12]:
pd.DataFrame([f.__dict__ for f in features])
Out[12]:
date_format feature_type id importance low_information max mean median min na_count name project_id std_dev target_leakage time_series_eligibility_reason time_series_eligible time_step time_unit unique_count
0 None Boolean 2 0.029045 False 1 0.36 0 0 0 did_rain 5b2b0fbd96a0250378fc36a5 0.48 FALSE notADate False None None 2
1 None Categorical 6 0.003714 True None None None None 0 Destination Airport 5b2b0fbd96a0250378fc36a5 None FALSE notADate False None None 5
2 None Categorical 3 NaN True None None None None 0 Carrier Code 5b2b0fbd96a0250378fc36a5 None SKIPPED_DETECTION notADate False None None 1
3 None Numeric 4 0.005900 False 2165 1705.63 2021 67 0 Flight Number 5b2b0fbd96a0250378fc36a5 566.67 FALSE notADate False None None 329
4 None Categorical 5 -0.004512 True None None None None 0 Tail Number 5b2b0fbd96a0250378fc36a5 None FALSE notADate False None None 296
5 None Categorical 8 0.003452 True None None None None 0 day_of_week 5b2b0fbd96a0250378fc36a5 None FALSE notADate False None None 7
6 None Numeric 9 0.003043 True 12 6.47 6 1 0 month 5b2b0fbd96a0250378fc36a5 3.38 FALSE notADate False None None 12
7 %H:%M Time 7 0.058245 False 21:30 12:26 12:00 05:00 0 Scheduled Departure Time 5b2b0fbd96a0250378fc36a5 0.19 days FALSE notADate False None None 77
8 None Numeric 1 0.034295 False 3.07 0.12 0 0 0 daily_rainfall 5b2b0fbd96a0250378fc36a5 0.33 FALSE notADate False None None 58
9 None Boolean 0 1.000000 False 1 0.098 0 0 0 was_delayed 5b2b0fbd96a0250378fc36a5 0.3 SKIPPED_DETECTION notADate False None None 2
10 None Categorical 10 0.053047 False None None None None 0 Scheduled Departure Time (Hour of Day) 5b2b0fbd96a0250378fc36a5 None FALSE notADate False None None 17

Three feature lists are automatically created:

  • Raw Features: one for all features
  • Informative Features: one based on features with any information (columns with no variation are excluded)
  • Univariate Importance: one based on univariate importance (this is only created after the target is set)

Informative Features is the one used by default in autopilot.

In [13]:
feature_lists = project.get_featurelists()
feature_lists
Out[13]:
[Featurelist(Raw Features),
 Featurelist(Informative Features),
 Featurelist(Univariate Selections)]
In [14]:
# create a featurelist without the rain features
# (since they leak future information)
informative_feats = [lst for lst in feature_lists if
                     lst.name == 'Informative Features'][0]
no_foreknowledge_features = list(
    set(informative_feats.features) - {'daily_rainfall', 'did_rain'})
In [15]:
no_foreknowledge = project.create_featurelist('no foreknowledge',
                                              no_foreknowledge_features)
no_foreknowledge
Out[15]:
Featurelist(no foreknowledge)
In [16]:
project.get_status()
Out[16]:
{u'autopilot_done': False,
 u'stage': u'modeling',
 u'stage_description': u'Ready for modeling'}
In [17]:
# This waits until autopilot is complete:
project.wait_for_autopilot(check_interval=90)
In progress: 10, queued: 24 (waited: 0s)
In progress: 10, queued: 24 (waited: 1s)
In progress: 10, queued: 24 (waited: 2s)
In progress: 10, queued: 24 (waited: 3s)
In progress: 10, queued: 24 (waited: 5s)
In progress: 10, queued: 24 (waited: 7s)
In progress: 10, queued: 24 (waited: 11s)
In progress: 10, queued: 24 (waited: 18s)
In progress: 10, queued: 23 (waited: 32s)
In progress: 10, queued: 15 (waited: 58s)
In progress: 10, queued: 6 (waited: 110s)
In progress: 10, queued: 6 (waited: 201s)
In progress: 1, queued: 0 (waited: 292s)
In progress: 1, queued: 0 (waited: 382s)
In progress: 9, queued: 7 (waited: 472s)
In progress: 0, queued: 0 (waited: 563s)
In progress: 1, queued: 0 (waited: 654s)
In [18]:
project.start_autopilot(no_foreknowledge.id)
In [19]:
project.wait_for_autopilot(check_interval=90)
In progress: 1, queued: 0 (waited: 0s)
In progress: 1, queued: 0 (waited: 0s)
In progress: 1, queued: 0 (waited: 1s)
In progress: 1, queued: 0 (waited: 2s)
In progress: 1, queued: 0 (waited: 3s)
In progress: 1, queued: 0 (waited: 4s)
In progress: 1, queued: 1 (waited: 8s)
In progress: 9, queued: 25 (waited: 15s)
In progress: 10, queued: 24 (waited: 28s)
In progress: 10, queued: 22 (waited: 54s)
In progress: 10, queued: 13 (waited: 106s)
In progress: 8, queued: 0 (waited: 197s)
In progress: 10, queued: 6 (waited: 288s)
In progress: 8, queued: 0 (waited: 378s)
In progress: 10, queued: 22 (waited: 469s)
In progress: 5, queued: 0 (waited: 559s)
In progress: 4, queued: 0 (waited: 650s)
In progress: 0, queued: 0 (waited: 740s)

Models

In [20]:
models = project.get_models()
example_model = models[0]
example_model
Out[20]:
Model(u'Light Gradient Boosted Trees Classifier with Early Stopping')

Models represent fitted models and have various data about the model, including metrics:

In [21]:
example_model.metrics
Out[21]:
{u'AUC': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.759622,
  u'holdout': None,
  u'validation': 0.75541},
 u'FVE Binomial': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.14890399999999998,
  u'holdout': None,
  u'validation': 0.14899},
 u'Gini Norm': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.519244,
  u'holdout': None,
  u'validation': 0.51082},
 u'Kolmogorov-Smirnov': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.40499799999999997,
  u'holdout': None,
  u'validation': 0.40223},
 u'LogLoss': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.272182,
  u'holdout': None,
  u'validation': 0.27228},
 u'RMSE': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.275546,
  u'holdout': None,
  u'validation': 0.27554},
 u'Rate@Top10%': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.368602,
  u'holdout': None,
  u'validation': 0.38567},
 u'Rate@Top5%': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.492518,
  u'holdout': None,
  u'validation': 0.47619},
 u'Rate@TopTenth%': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.933334,
  u'holdout': None,
  u'validation': 1.0}}
In [22]:
def sorted_by_log_loss(models, test_set):
    models_with_score = [model for model in models if
                         model.metrics['LogLoss'][test_set] is not None]
    return sorted(models_with_score,
                  key=lambda model: model.metrics['LogLoss'][test_set])

Let’s choose the models (from each feature set, to compare the scores) with the best LogLoss score from those with the rain and those without:

In [23]:
models = project.get_models()
fair_models = [mod for mod in models if
               mod.featurelist_id == no_foreknowledge.id]
rain_cheat_models = [mod for mod in models if
                     mod.featurelist_id == informative_feats.id]
In [24]:
models[0].metrics['LogLoss']

Out[24]:
{u'backtesting': None,
 u'backtestingScores': None,
 u'crossValidation': 0.272182,
 u'holdout': None,
 u'validation': 0.27228}
In [25]:
best_fair_model = sorted_by_log_loss(fair_models, 'crossValidation')[0]
best_cheat_model = sorted_by_log_loss(rain_cheat_models, 'crossValidation')[0]
best_fair_model.metrics, best_cheat_model.metrics
Out[25]:
({u'AUC': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.7147939999999999,
   u'holdout': None,
   u'validation': 0.71893},
  u'FVE Binomial': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.09153199999999999,
   u'holdout': None,
   u'validation': 0.09336},
  u'Gini Norm': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.429588,
   u'holdout': None,
   u'validation': 0.43786},
  u'Kolmogorov-Smirnov': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.32391,
   u'holdout': None,
   u'validation': 0.31819},
  u'LogLoss': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.29053000000000007,
   u'holdout': None,
   u'validation': 0.29008},
  u'RMSE': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.285702,
   u'holdout': None,
   u'validation': 0.28599},
  u'Rate@Top10%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.295562,
   u'holdout': None,
   u'validation': 0.2901},
  u'Rate@Top5%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.374152,
   u'holdout': None,
   u'validation': 0.40136},
  u'Rate@TopTenth%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.46666799999999997,
   u'holdout': None,
   u'validation': 0.0}},
 {u'AUC': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.759622,
   u'holdout': None,
   u'validation': 0.75541},
  u'FVE Binomial': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.14890399999999998,
   u'holdout': None,
   u'validation': 0.14899},
  u'Gini Norm': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.519244,
   u'holdout': None,
   u'validation': 0.51082},
  u'Kolmogorov-Smirnov': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.40499799999999997,
   u'holdout': None,
   u'validation': 0.40223},
  u'LogLoss': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.272182,
   u'holdout': None,
   u'validation': 0.27228},
  u'RMSE': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.275546,
   u'holdout': None,
   u'validation': 0.27554},
  u'Rate@Top10%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.368602,
   u'holdout': None,
   u'validation': 0.38567},
  u'Rate@Top5%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.492518,
   u'holdout': None,
   u'validation': 0.47619},
  u'Rate@TopTenth%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.933334,
   u'holdout': None,
   u'validation': 1.0}})

Visualizing Models

This is a good time to use Feature Fit and Feature Effects (not yet available via the API) to visualize the models:

In [26]:
best_fair_model.open_model_browser()
Out[26]:
True
In [27]:
best_cheat_model.open_model_browser()
Out[27]:
True

Unlocking the Holdout

To maintain holdout scores as a valid estimate of out-of-sample error, we recommend not looking at them until late in the project. For this reason, holdout scores are locked until you unlock them.

In [28]:
project.unlock_holdout()
Out[28]:
Project(Airline Delays - was_delayed)
In [29]:
best_fair_model = dr.Model.get(project.id, best_fair_model.id)
best_cheat_model = dr.Model.get(project.id, best_cheat_model.id)
In [30]:
best_fair_model.metrics['LogLoss'], best_cheat_model.metrics['LogLoss']
Out[30]:
({u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.29053000000000007,
  u'holdout': 0.29269,
  u'validation': 0.29008},
 {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.272182,
  u'holdout': 0.27481,
  u'validation': 0.27228})

Retrain on 100%

When ready to use the final model, you will probably get the best performance by retraining on 100% of the data.

In [31]:
model_job_fair_100pct_id = best_fair_model.train(sample_pct=100)
model_job_fair_100pct_id
Out[31]:
'211'

Wait for the model to complete:

In [32]:
model_fair_100pct = dr.models.modeljob.wait_for_async_model_creation(
    project.id, model_job_fair_100pct_id)
model_fair_100pct.id
Out[32]:
u'5b2b157c96a0250015bc5c81'

Predictions

Let’s make predictions for some new data. This new data will need to have the same transformations applied as we applied to the training data.

In [33]:
logan_2014 = pd.read_csv("logan-US-2014.csv")
logan_2014_modeling = prepare_modeling_dataset(logan_2014)
logan_2014_modeling.head()
Out[33]:
was_delayed daily_rainfall did_rain Carrier Code Flight Number Tail Number Destination Airport Scheduled Departure Time day_of_week month
0 False 0.0 False US 450 N809AW PHX 10:00 Sat 2
1 False 0.0 False US 553 N814AW PHL 07:00 Sat 2
2 False 0.0 False US 582 N820AW PHX 06:10 Sat 2
3 False 0.0 False US 601 N678AW PHX 16:20 Sat 2
4 False 0.0 False US 657 N662AW CLT 09:45 Sat 2
In [34]:
prediction_dataset = project.upload_dataset(logan_2014_modeling)
predict_job = model_fair_100pct.request_predictions(prediction_dataset.id)
prediction_dataset.id
Out[34]:
u'5b2b15b196a02500148bae57'
In [35]:
predictions = predict_job.get_result_when_complete()
In [36]:
pd.concat([logan_2014_modeling, predictions], axis=1).head()
Out[36]:
was_delayed daily_rainfall did_rain Carrier Code Flight Number Tail Number Destination Airport Scheduled Departure Time day_of_week month positive_probability prediction row_id class_0.0 class_1.0
0 False 0.0 False US 450 N809AW PHX 10:00 Sat 2 0.043375 0.0 0 0.956625 0.043375
1 False 0.0 False US 553 N814AW PHL 07:00 Sat 2 0.035594 0.0 1 0.964406 0.035594
2 False 0.0 False US 582 N820AW PHX 06:10 Sat 2 0.023461 0.0 2 0.976539 0.023461
3 False 0.0 False US 601 N678AW PHX 16:20 Sat 2 0.128815 0.0 3 0.871185 0.128815
4 False 0.0 False US 657 N662AW CLT 09:45 Sat 2 0.050310 0.0 4 0.949690 0.050310

Let’s have a look at our results. Since this is a binary classification problem, as the positive_probability approaches zero this row is a stronger candidate for the negative class (the flight will leave on-time), while as it approaches one, the outcome is more likely to be of the positive class (the flight will be delayed). From the KDE (Kernel Density Estimate) plot below, we can see that this sample of the data is weighted stronger to the negative class.

In [37]:
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
In [38]:
matplotlib.rcParams['figure.figsize'] = (15, 10)  # make charts bigger
In [39]:
sns.set(color_codes=True)
sns.kdeplot(predictions.positive_probability, shade=True, cut=0,
            label='Positive Probability')
plt.xlim((0, 1))
plt.ylim((0, None))
plt.xlabel('Probability of Event')
plt.ylabel('Probability Density')
plt.title('Prediction Distribution')
Out[39]:
<matplotlib.text.Text at 0x7f4f05227990>
../../_images/examples_airline_ontime_example_Modeling_Airline_Delay_55_1.png

Exploring Prediction Explanations

Computing prediction explanations is a resource-intensive task, but you can help reduce the runtime for computing them by setting prediction value thresholds. You can learn more about prediction explanations by searching the online documentation available in the DataRobot web interface.

When are they useful?

A common question when evaluating data is “why is a certain data-point considered high-risk (or low-risk) for a certain event”?

A sample case for prediction explanations:

Clark is a business analyst at a large manufacturing firm. She does not have a lot of data science expertise, but has been using DataRobot with great success to predict likely product failures at her manufacturing plant. Her manager is now asking for recommendations for reducing the defect rate, based on these predictions. Clark would like DataRobot to produce prediction explanations for the expected product failures so that she can identify the key drivers of product failures based on a higher-level aggregation of explanations. Her business team can then use this report to address the causes of failure.

Other common use cases and possible explanations include:

  • What are indicators that a transaction could be at high risk for fraud? Possible explanations include transactions out of a cardholder’s home area, transactions out of their “normal usage” time range, and transactions that are too large or small.
  • What are some explanations for setting a higher auto insurance price? The applicant is single, male, age under 30 years, and has received traffic tickets. A married homeowner may receive a lower rate.

Preparation

We are almost ready to compute prediction explanations. Prediction explanations require two prerequisites to be performed first; however, these commands only need to be run once per model.

Feature Impact

A prerequisite to computing prediction explanations is that you need to compute the feature impact for your model (this only needs to be done once per model):

In [40]:
%%time
try:
    impact_job = model_fair_100pct.request_feature_impact()
    impact_job.wait_for_completion(4 * 60)
except dr.errors.JobAlreadyRequested:
    pass  # already computed
CPU times: user 27.3 ms, sys: 3.94 ms, total: 31.2 ms
Wall time: 12.7 s
Prediction Explanations Initialization

After Feature Impact has been computed, you also must create a Prediction Explanations Initialization for your model:

In [41]:
%%time
try:
    # Test to see if they are already computed
    dr.PredictionExplanationsInitialization.get(project.id,
                                                model_fair_100pct.id)
except dr.errors.ClientError as e:
    assert e.status_code == 404  # haven't been computed
    init_job = dr.PredictionExplanationsInitialization.create(
        project.id,
        model_fair_100pct.id
    )
    init_job.wait_for_completion()
CPU times: user 46.5 ms, sys: 5.06 ms, total: 51.6 ms
Wall time: 12.1 s

Computing the prediction explanations

Now that we have computed the feature impact and initialized the prediction explanations, and also uploaded a dataset and computed predictions on it, we are ready to make a request to compute the prediction explanations for every row of the dataset. Computing prediction explanations supports a couple of parameters:

  • max_explanations are the maximum number of prediction explanations to compute for each row.
  • threshold_low and threshold_high are thresholds for the value of the prediction of the row. Prediction explanations will be computed for a row if the row’s prediction value is higher than threshold_high or lower than threshold_low. If no thresholds are specified, prediction explanations will be computed for all rows.

Note: for binary classification projects (like this one), the thresholds correspond to the positive_probability prediction value whereas for regression problems, it corresponds to the actual predicted value.

Since we’ve already examined our prediction distribution from above, let’s use that to influence what we set for our thresholds. It looks like most flights depart on-time so let’s just examine the explanations for flights that have an above normal probability for being delayed. We will use a threshold_high of 0.456 which means for all rows where the predicted positive_probability is at least 0.456 we will compute the prediction explanations for that row. For the simplicity of this tutorial, we will also limit DataRobot to only compute 5 explanations for us.

In [42]:
%%time
number_of_explanations = 5
pe_job = dr.PredictionExplanations.create(
    project.id,
    model_fair_100pct.id,
    prediction_dataset.id,
    max_explanations=number_of_explanations,
    threshold_low=None,
    threshold_high=0.456
)
pe = pe_job.get_result_when_complete()
all_rows = pe.get_all_as_dataframe()
CPU times: user 3.22 s, sys: 137 ms, total: 3.36 s
Wall time: 23.4 s

Let’s cleanup the DataFrame we got back by trimming it down to just the interesting columns. Also, since most rows will have prediction values outside our thresholds, let’s drop all the uninteresting rows (i.e. ones with null values).

In [43]:
import pandas as pd
pd.options.display.max_rows = 10  # default display is too verbose

# These rows are all redundant or of little value for this example
redundant_cols = ['row_id', 'class_0_label', 'class_1_probability',
                  'class_1_label']
explanations = all_rows.drop(redundant_cols, axis=1)
explanations.drop(['explanation_{}_label'.format(i)
                   for i in range(number_of_explanations)],
                  axis=1, inplace=True)

# These are rows that didn't meet our thresholds
explanations.dropna(inplace=True)

# Rename columns to be more consistent with the terms we have been using
explanations.rename(index=str,
                    columns={'class_0_probability': 'positive_probability'},
                    inplace=True)
explanations
Out[43]:
prediction positive_probability explanation_0_feature explanation_0_feature_value explanation_0_qualitative_strength explanation_0_strength explanation_1_feature explanation_1_feature_value explanation_1_qualitative_strength explanation_1_strength ... explanation_2_qualitative_strength explanation_2_strength explanation_3_feature explanation_3_feature_value explanation_3_qualitative_strength explanation_3_strength explanation_4_feature explanation_4_feature_value explanation_4_qualitative_strength explanation_4_strength
9498 1.0 0.521672 Scheduled Departure Time -2.208920e+09 +++ 1.411063 Tail Number N170US ++ 0.522242 ... ++ 0.355082 Flight Number 800 ++ 0.247061 day_of_week Thurs ++ 0.240676
12373 1.0 0.505737 Scheduled Departure Time -2.208920e+09 ++ 0.858645 Flight Number 897 ++ 0.848086 ... ++ 0.522828 month 12 ++ 0.312428 day_of_week Mon ++ 0.276766
13254 0.0 0.466474 Scheduled Departure Time -2.208920e+09 +++ 0.937670 Flight Number 897 +++ 0.850898 ... ++ 0.335550 day_of_week Thurs ++ 0.308574 Destination Airport PHX - -0.145671
13351 0.0 0.484007 Scheduled Departure Time -2.208920e+09 +++ 1.124481 Flight Number 897 ++ 0.863671 ... ++ 0.371650 day_of_week Sun ++ 0.343775 month 12 ++ 0.341253
13536 1.0 0.512797 Scheduled Departure Time -2.208920e+09 +++ 1.229332 Flight Number 897 ++ 0.861893 ... ++ 0.486845 day_of_week Sun ++ 0.343775 month 12 ++ 0.319640
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
18015 0.0 0.494928 Scheduled Departure Time -2.208920e+09 +++ 1.902487 month 7 ++ 0.743062 ... + 0.230507 day_of_week Thurs + 0.220769 Flight Number 800 + 0.216649
18165 0.0 0.492765 Scheduled Departure Time -2.208920e+09 +++ 1.447815 month 7 ++ 0.517627 ... ++ 0.349980 Flight Number 800 ++ 0.290879 Destination Airport CLT + 0.183091
18392 1.0 0.584637 Scheduled Departure Time -2.208920e+09 +++ 1.494192 month 7 ++ 0.610177 ... ++ 0.522249 Flight Number 800 ++ 0.281044 day_of_week Thurs ++ 0.243063
18396 0.0 0.456992 Scheduled Departure Time -2.208919e+09 +++ 1.442328 month 7 ++ 0.600896 ... ++ 0.338863 day_of_week Thurs ++ 0.217166 Scheduled Departure Time (Hour of Day) 19 + 0.065168
18406 1.0 0.515136 Scheduled Departure Time -2.208927e+09 +++ 1.766550 month 7 ++ 0.832155 ... ++ 0.815748 Scheduled Departure Time (Hour of Day) 17 ++ 0.323491 Destination Airport CLT + 0.272840

27 rows × 22 columns

Explore Prediction Explanations results

Now let’s see how often various features are showing up as the top explanation for impacting the probability of a flight being delayed.

In [44]:
from functools import reduce

# Create a combined histogram of all our explanations
explanations_hist = reduce(
    lambda x, y: x.add(y, fill_value=0),
    (explanations['explanation_{}_feature'.format(i)].value_counts()
     for i in range(number_of_explanations)))
In [45]:
explanations_hist.plot.bar()
plt.xticks(rotation=45, ha='right')
Out[45]:
(array([0, 1, 2, 3, 4, 5, 6]), <a list of 7 Text xticklabel objects>)
../../_images/examples_airline_ontime_example_Modeling_Airline_Delay_68_1.png

Knowing the feature impact for this model from the Diving Deeper notebook, the high occurrence of the daily_rainfall and Scheduled Departure Time as prediction explanations is not entirely surprising because these were some of the top ranked features in the impact chart. Therefore, let’s take a small detour investigating some of the rows that had less expected explanations.


Below is some helper code. It can largely be ignored as it is mostly relevant for this exercise and not needed for a general understanding of the DataRobot APIs

In [46]:
from operator import or_
from functools import reduce
from itertools import chain


def find_rows_with_explanation(df, feature_name, nexpls):
    """
    Given a prediction explanations DataFrame, return a slice
    of that data where the top N explanations match the given feature
    """
    all_expl_columns = (df['explanation_{}_feature'.format(i)] == feature_name
                        for i in range(nexpls))
    df_filter = reduce(or_, all_expl_columns)
    return favorite_expl_columns(df[df_filter], nexpls)


def favorite_expl_columns(df, nexpls):
    """
    Only display the most useful rows of a prediction explanations DataFrame.
    """
    # Use chain to flatten our list of tuples
    columns = list(chain.from_iterable((
        'explanation_{}_feature'.format(i),
        'explanation_{}_feature_value'.format(i),
        'explanation_{}_strength'.format(i))
        for i in range(nexpls)))
    return df[columns]


def find_feature_in_row(feature, row, nexpls):
    """
    Return the value of a given feature
    """
    for i in range(nexpls):
        if row['explanation_{}_feature'.format(i)] == feature:
            return row['explanation_{}_feature_value'.format(i)]


def collect_feature_values(df, feature, nexpls):
    """
    Return a list of all values of a given prediction explanation
    from a DataFrame
    """
    return [find_feature_in_row(feature, row, nexpls)
            for index, row in df.iterrows()]

Investigation: Destination Airport

It looks like there was a small number of rows where the Destination Airport was one of the top N explanations for a given prediction

In [47]:
feature_name = 'Destination Airport'
flight_nums = find_rows_with_explanation(explanations,
                                         feature_name,
                                         number_of_explanations)
flight_nums
Out[47]:
explanation_0_feature explanation_0_feature_value explanation_0_strength explanation_1_feature explanation_1_feature_value explanation_1_strength explanation_2_feature explanation_2_feature_value explanation_2_strength explanation_3_feature explanation_3_feature_value explanation_3_strength explanation_4_feature explanation_4_feature_value explanation_4_strength
13254 Scheduled Departure Time -2.208920e+09 0.937670 Flight Number 897 0.850898 Tail Number N657AW 0.335550 day_of_week Thurs 0.308574 Destination Airport PHX -0.145671
14226 Scheduled Departure Time -2.208920e+09 1.435292 month 6 0.459697 Flight Number 800 0.280207 day_of_week Thurs 0.251885 Destination Airport CLT 0.201186
14601 Scheduled Departure Time -2.208920e+09 1.422922 month 6 0.381899 Flight Number 800 0.278981 day_of_week Thurs 0.248532 Destination Airport CLT 0.201186
14855 Scheduled Departure Time -2.208920e+09 1.376668 month 6 0.455120 Tail Number N163US 0.345858 Flight Number 800 0.308118 Destination Airport CLT 0.186002
14978 Scheduled Departure Time -2.208920e+09 1.435292 month 6 0.459697 Flight Number 800 0.280207 day_of_week Thurs 0.251885 Destination Airport CLT 0.201186
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
17576 Scheduled Departure Time -2.208920e+09 1.512200 month 7 0.623913 Tail Number N170US 0.544830 Flight Number 800 0.305650 Destination Airport CLT 0.186923
17638 Scheduled Departure Time -2.208920e+09 1.523302 month 7 0.461251 Flight Number 800 0.282270 day_of_week Thurs 0.246416 Destination Airport CLT 0.202109
18015 Scheduled Departure Time -2.208920e+09 1.902487 month 7 0.743062 Destination Airport CLT 0.230507 day_of_week Thurs 0.220769 Flight Number 800 0.216649
18165 Scheduled Departure Time -2.208920e+09 1.447815 month 7 0.517627 Tail Number N173US 0.349980 Flight Number 800 0.290879 Destination Airport CLT 0.183091
18406 Scheduled Departure Time -2.208927e+09 1.766550 month 7 0.832155 Tail Number N818AW 0.815748 Scheduled Departure Time (Hour of Day) 17 0.323491 Destination Airport CLT 0.272840

12 rows × 15 columns

In [48]:
all_flights = collect_feature_values(flight_nums,
                                     feature_name,
                                     number_of_explanations)
pd.DataFrame(all_flights)[0].value_counts().plot.bar()
plt.xticks(rotation=0)
Out[48]:
(array([0, 1]), <a list of 2 Text xticklabel objects>)
../../_images/examples_airline_ontime_example_Modeling_Airline_Delay_74_1.png

Many a frequent flier will tell you horror stories about flying in and out of certain airports. While any given prediction explanation can have a positive or a negative impact to a prediction (this is indicated by both the strength and qualitative_strength columns), due to the thresholds we configured earlier for this tutorial it is likely that the above airports are causing flight delays.


Investigation: Scheduled Departure Time

DataRobot correctly identified the Scheduled Departure Time input as a timestamp but in the prediction explanation output, we are seeing the internal representation of the time value as a Unix epoch value so let’s put it back into a format that humans can understand better:

In [49]:
# For simplicity, let's just look at rows where `Scheduled Departure Time`
# was the first/top explanation.
feature_name = 'Scheduled Departure Time'
bad_times = explanations[explanations.explanation_0_feature == feature_name]

# Now let's convert the epoch to a datetime
pd.to_datetime(bad_times.explanation_0_feature_value, unit='s')
Out[49]:
9498    1900-01-01 19:10:00
12373   1900-01-01 19:00:00
13254   1900-01-01 19:00:00
13351   1900-01-01 19:00:00
13536   1900-01-01 19:00:00
                ...
18015   1900-01-01 19:10:00
18165   1900-01-01 19:10:00
18392   1900-01-01 19:10:00
18396   1900-01-01 19:30:00
18406   1900-01-01 17:05:00
Name: explanation_0_feature_value, Length: 27, dtype: datetime64[ns]

We can see that it appears as though all departures occurred on Jan. 1st, 1900. This is because the original value was simply a timestamp so only the time portion of the epoch is meaningful. We will clean this up in our graph below:

In [50]:
from matplotlib.ticker import FuncFormatter
from time import gmtime, strftime

scale_factor = 9  # make the difference in strengths more visible

depart = explanations[explanations.explanation_0_feature == feature_name]
true_only = depart[depart.prediction == 1]
false_only = depart[depart.prediction == 0]
plt.scatter(x=true_only.explanation_0_feature_value,
            y=true_only.positive_probability,
            c='green',
            s=true_only.explanation_0_strength ** scale_factor,
            label='Will be delayed')
plt.scatter(x=false_only.explanation_0_feature_value,
            y=false_only.positive_probability,
            c='purple',
            s=false_only.explanation_0_strength ** scale_factor,
            label='Will not')

# Convert the Epoch values into human time stamps
formatter = FuncFormatter(lambda x, pos: strftime('%H:%M', gmtime(x)))
plt.gca().xaxis.set_major_formatter(formatter)

plt.xlabel('Scheduled Departure Time')
plt.ylabel('Positive Probability')
plt.legend(markerscale=.5, frameon=True, facecolor="white")
plt.title("Relationship of Depart Time and being delayed")
Out[50]:
<matplotlib.text.Text at 0x7f4effb4f290>
../../_images/examples_airline_ontime_example_Modeling_Airline_Delay_79_1.png

The above plot shows each prediction where the top influencer of the prediction was the Scheduled Departure Time. It’s plotted against the positive_probability on the Y-axis and the size of the plots represent the strength that departure time had on the prediction (relative to the other features of that given data point). Finally to aid visually, the positive vs. negative outcomes are colored.

As we can see by the time-scale on the X-axis, it doesn’t represent the full 24 hours; this is telling. Since we filtered our data earlier to only show predictions that were leaning towards being delayed, and this chart is leaning towards times in the afternoon and evening there may be a correlation between later scheduled departure time and a higher probability of being delayed. With a little bit of domain knowledge, one will understand that an airplane and its crew make many flights in a day (typically hopping between cities) so small delays in the morning compound into the evening hours.