Modeling Airline Delay

Overview

Statistics on whether a flight was delayed and for how long are available from government databases for all the major carriers. It would be useful to be able to predict before scheduling a flight whether or not it was likely to be delayed. In this example, DataRobot will try to model whether a flight will be delayed, based on information such as the scheduled departure time and whether rained the day of the flight.

Set Up

This example assumes that the DataRobot Python client package has been installed and configured with the credentials of a DataRobot user with API access permissions.

Data Sources

Information on flights and flight delays is made available by the Bureau of Transportation Statistics at https://www.transtats.bts.gov/ONTIME/Departures.aspx. To narrow down the amount of data involved, the datasets assembled for this use case are limited to US Airways flights out of Boston Logan in 2013 and 2014, although the script for interacting with DataRobot is sufficiently general that any dataset with the correct format could be used. A flight was declared to be delayed if it ultimately took off at least fifteen minutes after its scheduled departure time.

In additional to flight information, each record in the prepared dataset notes the amount of rain and whether it rained on the day of the flight. This information came from the National Oceanic and Atmospheric Administration’s Quality Controlled Local Climatological Data, available at http://www.ncdc.noaa.gov/qclcd/QCLCD. By looking at the recorded daily summaries of the water equivalent precipitation at the Bostan Logan station, the daily rainfall for each day in 2013 and 2014 was measured. For some days, the QCLCD reports trace amounts of rainfall, which was recorded as 0 inches of rain.

Dataset Structure

Each row in the assembled dataset contains the following columns

  • was_delayed
    • boolean
    • whether the flight was delayed
  • daily_rainfall
    • float
    • the amount of rain, in inches, on the day of the flight
  • did_rain
    • bool
    • whether it rained on the day of the flight
  • Carrier Code
    • str
    • the carrier code of the airline - US for all entries in assembled dataset
  • Date
    • str (MM/DD/YYYY format)
    • the date of the flight
  • Flight Number
    • str
    • the flight number for the flight
  • Tail Number
    • str
    • the tail number of the aircraft
  • Destination Airport
    • str
    • the three-letter airport code of the destination airport
  • Scheduled Deparature Time
    • str
    • the 24-hour scheduled departure time of the flight, in the origin airport’s timezone
In [1]:
import pandas as pd
import datarobot as dr
In [2]:
data_path = "logan-US-2013.csv"
logan_2013 = pd.read_csv(data_path)
logan_2013.head()
Out[2]:
was_delayed daily_rainfall did_rain Carrier Code Date (MM/DD/YYYY) Flight Number Tail Number Destination Airport Scheduled Departure Time
0 False 0.0 False US 02/01/2013 225 N662AW PHX 16:20
1 False 0.0 False US 02/01/2013 280 N822AW PHX 06:00
2 False 0.0 False US 02/01/2013 303 N653AW CLT 09:35
3 True 0.0 False US 02/01/2013 604 N640AW PHX 09:55
4 False 0.0 False US 02/01/2013 722 N715UW PHL 18:30

We want to be able to make predictions for future data, so the “date” column should be transformed in a way that avoids values that won’t be populated for future data:

In [3]:
def prepare_modeling_dataset(df):
    date_column_name = 'Date (MM/DD/YYYY)'
    date = pd.to_datetime(df[date_column_name])
    modeling_df = df.drop(date_column_name, axis=1)
    days = {0: 'Mon', 1: 'Tues', 2: 'Weds', 3: 'Thurs', 4: 'Fri', 5: 'Sat',
            6: 'Sun'}
    modeling_df['day_of_week'] = date.apply(lambda x: days[x.dayofweek])
    modeling_df['month'] = date.dt.month
    return modeling_df
In [4]:
logan_2013_modeling = prepare_modeling_dataset(logan_2013)
logan_2013_modeling.head()
Out[4]:
was_delayed daily_rainfall did_rain Carrier Code Flight Number Tail Number Destination Airport Scheduled Departure Time day_of_week month
0 False 0.0 False US 225 N662AW PHX 16:20 Fri 2
1 False 0.0 False US 280 N822AW PHX 06:00 Fri 2
2 False 0.0 False US 303 N653AW CLT 09:35 Fri 2
3 True 0.0 False US 604 N640AW PHX 09:55 Fri 2
4 False 0.0 False US 722 N715UW PHL 18:30 Fri 2

DataRobot Modeling

As part of this use case, in model_flight_ontime.py, a DataRobot project will be created and used to run a variety of models against the assembled datasets. By default, DataRobot will run autopilot on the automatically generated Informative Features list, which excludes certain pathological features (like Carrier Code in this example, which is always the same value), and we will also create a custom feature list excluding the amount of rainfall on the day of the flight.

This notebook shows how to use the Python API client to create a project, create feature lists, train models with different sample percents and feature lists, and view the models that have been run. It will:

  • create a project
  • create a new feature list (no foreknowledge) excluding the rainfall features
  • set the target to was_delayed, and run DataRobot autopilot on the Informative Features list
  • rerun autopilot on a new feature list
  • make predictions on a new data set

Starting a Project

In [5]:
project = dr.Project.start(logan_2013_modeling,
                           project_name='Airline Delays - was_delayed',
                           target="was_delayed")
project.id
Out[5]:
u'5963ddefc8089169ef1637c2'

Jobs and the Project Queue

You can view the project in your browser:

In [ ]:
#  If running notebook remotely
project.open_leaderboard_browser()
In [ ]:
#  Set worker count higher. This will fail if you don't have 10 workers.
project.set_worker_count(10)
In [6]:
project.pause_autopilot()
Out[6]:
True
In [7]:
#  More jobs will go in the queue in each stage of autopilot.
#  This gets the currently inprogress and queued jobs
project.get_model_jobs()
Out[7]:
[ModelJob(Gradient Boosted Trees Classifier, status=inprogress),
 ModelJob(Breiman and Cutler Random Forest Classifier, status=inprogress),
 ModelJob(RuleFit Classifier, status=queue),
 ModelJob(Regularized Logistic Regression (L2), status=queue),
 ModelJob(Elastic-Net Classifier (L2 / Binomial Deviance), status=queue),
 ModelJob(RandomForest Classifier (Gini), status=queue),
 ModelJob(eXtreme Gradient Boosted Trees Classifier with Early Stopping, status=queue),
 ModelJob(Nystroem Kernel SVM Classifier, status=queue),
 ModelJob(Regularized Logistic Regression (L2), status=queue),
 ModelJob(Elastic-Net Classifier (L2 / Binomial Deviance) with Binned numeric features, status=queue),
 ModelJob(Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance), status=queue),
 ModelJob(RandomForest Classifier (Entropy), status=queue),
 ModelJob(ExtraTrees Classifier (Gini), status=queue),
 ModelJob(Gradient Boosted Trees Classifier with Early Stopping, status=queue),
 ModelJob(Gradient Boosted Greedy Trees Classifier with Early Stopping, status=queue),
 ModelJob(eXtreme Gradient Boosted Trees Classifier with Early Stopping, status=queue),
 ModelJob(eXtreme Gradient Boosted Trees Classifier with Early Stopping and Unsupervised Learning Features, status=queue),
 ModelJob(Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance) with Unsupervised Learning Features, status=queue),
 ModelJob(Auto-tuned K-Nearest Neighbors Classifier (Minkowski Distance), status=queue),
 ModelJob(Vowpal Wabbit Classifier, status=queue)]
In [8]:
project.unpause_autopilot()
Out[8]:
True

Features

In [9]:
features = project.get_features()
features
Out[9]:
[Feature(did_rain),
 Feature(Destination Airport),
 Feature(Carrier Code),
 Feature(Flight Number),
 Feature(Tail Number),
 Feature(day_of_week),
 Feature(month),
 Feature(Scheduled Departure Time),
 Feature(daily_rainfall),
 Feature(was_delayed)]
In [10]:
pd.DataFrame([f.__dict__ for f in features])
Out[10]:
date_format feature_type id importance low_information na_count name project_id unique_count
0 None Boolean 2 0.029045 False 0 did_rain 5963ddefc8089169ef1637c2 2
1 None Categorical 6 0.003714 True 0 Destination Airport 5963ddefc8089169ef1637c2 5
2 None Categorical 3 NaN True 0 Carrier Code 5963ddefc8089169ef1637c2 1
3 None Numeric 4 0.005900 False 0 Flight Number 5963ddefc8089169ef1637c2 329
4 None Categorical 5 -0.004512 True 0 Tail Number 5963ddefc8089169ef1637c2 296
5 None Categorical 8 0.003452 True 0 day_of_week 5963ddefc8089169ef1637c2 7
6 None Numeric 9 0.003043 True 0 month 5963ddefc8089169ef1637c2 12
7 %H:%M Time 7 0.058245 False 0 Scheduled Departure Time 5963ddefc8089169ef1637c2 77
8 None Numeric 1 0.034295 False 0 daily_rainfall 5963ddefc8089169ef1637c2 58
9 None Boolean 0 1.000000 False 0 was_delayed 5963ddefc8089169ef1637c2 2

Three feature lists are automatically created:

  • Raw Features: one for all features
  • Informative Features: one based on features with any information (columns with no variation are excluded)
  • Univariate Importance: one based on univariate importance (this is only created after the target is set)

Informative Features is the one used by default in autopilot.

In [11]:
feature_lists = project.get_featurelists()
feature_lists
Out[11]:
[Featurelist(Informative Features),
 Featurelist(Raw Features),
 Featurelist(Univariate Selections)]
In [12]:
# create a featurelist without the rain features
# (since they leak future information)
informative_feats = [lst for lst in feature_lists if
                     lst.name == 'Informative Features'][0]
no_foreknowledge_features = list(
    set(informative_feats.features) - {'daily_rainfall', 'did_rain'})
In [13]:
no_foreknowledge = project.create_featurelist('no foreknowledge',
                                              no_foreknowledge_features)
no_foreknowledge
Out[13]:
Featurelist(no foreknowledge)
In [14]:
project.get_status()
Out[14]:
{u'autopilot_done': False,
 u'stage': u'modeling',
 u'stage_description': u'Ready for modeling'}
In [16]:
# This waits until autopilot is complete:
project.wait_for_autopilot(check_interval=90)
In progress: 2, queued: 2 (waited: 0s)
In progress: 2, queued: 2 (waited: 0s)
In progress: 2, queued: 2 (waited: 1s)
In progress: 2, queued: 2 (waited: 2s)
In progress: 2, queued: 2 (waited: 3s)
In progress: 2, queued: 2 (waited: 4s)
In progress: 2, queued: 2 (waited: 8s)
In progress: 2, queued: 2 (waited: 14s)
In progress: 2, queued: 2 (waited: 27s)
In progress: 2, queued: 0 (waited: 53s)
In progress: 2, queued: 0 (waited: 105s)
In progress: 0, queued: 0 (waited: 195s)
In progress: 0, queued: 0 (waited: 286s)
In [17]:
project.start_autopilot(no_foreknowledge.id)
In [24]:
project.wait_for_autopilot(check_interval=90)
In progress: 2, queued: 26 (waited: 0s)
In progress: 2, queued: 26 (waited: 0s)
In progress: 2, queued: 26 (waited: 1s)
In progress: 2, queued: 26 (waited: 2s)
In progress: 2, queued: 26 (waited: 3s)
In progress: 2, queued: 26 (waited: 5s)
In progress: 1, queued: 26 (waited: 8s)
In progress: 4, queued: 23 (waited: 15s)
In progress: 6, queued: 17 (waited: 28s)
In progress: 7, queued: 6 (waited: 54s)
In progress: 5, queued: 9 (waited: 105s)
In progress: 7, queued: 1 (waited: 196s)
In progress: 7, queued: 20 (waited: 287s)
In progress: 7, queued: 3 (waited: 378s)
In progress: 4, queued: 0 (waited: 469s)
In progress: 3, queued: 0 (waited: 559s)
In progress: 0, queued: 0 (waited: 650s)

Models

In [25]:
models = project.get_models()
example_model = models[0]
example_model
Out[25]:
Model(u'Gradient Boosted Trees Classifier with Early Stopping')

Models represent fitted models and have various data about the model, including metrics:

In [26]:
example_model.metrics
Out[26]:
{u'AUC': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.751662,
  u'holdout': None,
  u'validation': 0.74957},
 u'FVE Binomial': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.139262,
  u'holdout': None,
  u'validation': 0.14529},
 u'Gini Norm': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.503324,
  u'holdout': None,
  u'validation': 0.49914},
 u'LogLoss': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.275264,
  u'holdout': None,
  u'validation': 0.27347},
 u'RMSE': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.27734,
  u'holdout': None,
  u'validation': 0.27582},
 u'Rate@Top10%': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.362458,
  u'holdout': None,
  u'validation': 0.37884},
 u'Rate@Top5%': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.47347,
  u'holdout': None,
  u'validation': 0.4898},
 u'Rate@TopTenth%': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.866668,
  u'holdout': None,
  u'validation': 1.0}}
In [27]:
def sorted_by_log_loss(models, test_set):
    models_with_score = [model for model in models if
                         model.metrics['LogLoss'][test_set] is not None]
    return sorted(models_with_score,
                  key=lambda model: model.metrics['LogLoss'][test_set])

Let’s choose the models (from each feature set, to compare the scores) with the best LogLoss score from those with the rain and those without:

In [28]:
models = project.get_models()
fair_models = [mod for mod in models if
               mod.featurelist_id == no_foreknowledge.id]
rain_cheat_models = [mod for mod in models if
                     mod.featurelist_id == informative_feats.id]
In [29]:
models[0].metrics['LogLoss']

Out[29]:
{u'backtesting': None,
 u'backtestingScores': None,
 u'crossValidation': 0.275264,
 u'holdout': None,
 u'validation': 0.27347}
In [30]:
best_fair_model = sorted_by_log_loss(fair_models, 'crossValidation')[0]
best_cheat_model = sorted_by_log_loss(rain_cheat_models, 'crossValidation')[0]
best_fair_model.metrics, best_cheat_model.metrics
Out[30]:
({u'AUC': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.71437,
   u'holdout': None,
   u'validation': 0.7187},
  u'FVE Binomial': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.089798,
   u'holdout': None,
   u'validation': 0.09167},
  u'Gini Norm': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.42874,
   u'holdout': None,
   u'validation': 0.4374},
  u'LogLoss': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.29108199999999995,
   u'holdout': None,
   u'validation': 0.29062},
  u'RMSE': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.28612,
   u'holdout': None,
   u'validation': 0.28617},
  u'Rate@Top10%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.288738,
   u'holdout': None,
   u'validation': 0.28669},
  u'Rate@Top5%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.37415,
   u'holdout': None,
   u'validation': 0.39456},
  u'Rate@TopTenth%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.633334,
   u'holdout': None,
   u'validation': 1.0}},
 {u'AUC': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.758114,
   u'holdout': None,
   u'validation': 0.75345},
  u'FVE Binomial': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.14579400000000003,
   u'holdout': None,
   u'validation': 0.14438},
  u'Gini Norm': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.516228,
   u'holdout': None,
   u'validation': 0.5069},
  u'LogLoss': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.273176,
   u'holdout': None,
   u'validation': 0.27376},
  u'RMSE': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.27671,
   u'holdout': None,
   u'validation': 0.27686},
  u'Rate@Top10%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.370648,
   u'holdout': None,
   u'validation': 0.38225},
  u'Rate@Top5%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.48163600000000006,
   u'holdout': None,
   u'validation': 0.4898},
  u'Rate@TopTenth%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.933334,
   u'holdout': None,
   u'validation': 1.0}})

Visualizing Models

This is a good time to use Model XRay (not yet available via the API) to visualize the models:

In [ ]:
best_fair_model.open_model_browser()
In [ ]:
best_cheat_model.open_model_browser()

Unlocking the Holdout

To maintain holdout scores as a valid estimate of out-of-sample error, we recommend not looking at them until late in the project. For this reason, holdout scores are locked until you unlock them.

In [31]:
project.unlock_holdout()
Out[31]:
Project(Airline Delays - was_delayed)
In [32]:
best_fair_model = dr.Model.get(project.id, best_fair_model.id)
best_cheat_model = dr.Model.get(project.id, best_cheat_model.id)
In [33]:
best_fair_model.metrics['LogLoss'], best_cheat_model.metrics['LogLoss']
Out[33]:
({u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.29108199999999995,
  u'holdout': 0.29344,
  u'validation': 0.29062},
 {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.273176,
  u'holdout': 0.27542,
  u'validation': 0.27376})

Retrain on 100%

When ready to use the final model, you will probably get the best performance by retraining on 100% of the data.

In [34]:
model_job_fair_100pct_id = best_fair_model.train(sample_pct=100)
model_job_fair_100pct_id
Out[34]:
'188'

Wait for the model to complete:

In [35]:
model_fair_100pct = dr.models.modeljob.wait_for_async_model_creation(
    project.id, model_job_fair_100pct_id)

Predictions

Let’s make predictions for some new data. This new data will need to have the same transformations applied as we applied to the training data.

In [36]:
logan_2014 = pd.read_csv("logan-US-2014.csv")
logan_2014_modeling = prepare_modeling_dataset(logan_2014)
logan_2014_modeling.head()
Out[36]:
was_delayed daily_rainfall did_rain Carrier Code Flight Number Tail Number Destination Airport Scheduled Departure Time day_of_week month
0 False 0.0 False US 450 N809AW PHX 10:00 Sat 2
1 False 0.0 False US 553 N814AW PHL 07:00 Sat 2
2 False 0.0 False US 582 N820AW PHX 06:10 Sat 2
3 False 0.0 False US 601 N678AW PHX 16:20 Sat 2
4 False 0.0 False US 657 N662AW CLT 09:45 Sat 2
In [37]:
prediction_dataset = project.upload_dataset(logan_2014_modeling)
predict_job = model_fair_100pct.request_predictions(prediction_dataset.id)
In [38]:
predictions = predict_job.get_result_when_complete()
In [39]:
pd.concat([logan_2014_modeling, predictions], axis=1).head()
Out[39]:
was_delayed daily_rainfall did_rain Carrier Code Flight Number Tail Number Destination Airport Scheduled Departure Time day_of_week month positive_probability prediction row_id
0 False 0.0 False US 450 N809AW PHX 10:00 Sat 2 0.050824 0.0 0
1 False 0.0 False US 553 N814AW PHL 07:00 Sat 2 0.040017 0.0 1
2 False 0.0 False US 582 N820AW PHX 06:10 Sat 2 0.032445 0.0 2
3 False 0.0 False US 601 N678AW PHX 16:20 Sat 2 0.122692 0.0 3
4 False 0.0 False US 657 N662AW CLT 09:45 Sat 2 0.054400 0.0 4