Time Series Modeling

Overview

This example provides an introduction to a few of DataRobot’s time series modeling capabilities with a sales dataset. Here is a list of things we will touch on during this notebook:

  • Installing the datarobot package
  • Configuring the client
  • Creating a project
  • Denoting known-in-advance features
  • Specifying a partitioning scheme
  • Running the automated modeling process
  • Generating predictions

Prerequisites

In order to run this notebook yourself, you will need the following:

  • This notebook. If you are viewing this in the HTML documentation bundle, you can download all of the example notebooks and supporting materials from Downloads.
  • The required datasets, which is included in the same directory as this notebook.
  • A DataRobot API token. You can find your API token by logging into the DataRobot Web User Interface and looking in your Profile.
  • The xlrd Python package is needed for the pandas read_excel function. You can install this with pip install xlrd.

Installing the datarobot package

The datarobot package is hosted on PyPI. You can install it via:

pip install datarobot

from the command line. Its main dependencies are numpy and pandas, which could take some time to install on a new system. We highly recommend use of virtualenvs to avoid conflicts with other dependencies in your system-wide python installation.

Getting Started

This line imports the datarobot package. By convention, we always import it with the alias dr.

[1]:
import datarobot as dr

Other Important Imports

We’ll use these in this notebook as well. If the previous cell and the following cell both run without issue, you’re in good shape.

[2]:
import datetime
import pandas as pd

Configure the Python Client

Configuring the client requires the following two things:

  • A DataRobot endpoint - where the API server can be found
  • A DataRobot API token - a token the server uses to identify and validate the user making API requests

The endpoint is usually the URL you would use to log into the DataRobot Web User Interface (e.g., https://app.datarobot.com) with “/api/v2/” appended, e.g., (https://app.datarobot.com/api/v2/).

You can find your API token by logging into the DataRobot Web User Interface and looking in your Profile.

The Python client can be configured in several ways. The example we’ll use in this notebook is to point to a yaml file that has the information. This is a text file containing two lines like this:

endpoint: https://app.datarobot.com/api/v2/
token: not-my-real-token

If you want to run this notebook without changes, please save your configuration in a file located under your home directory called ~/.config/datarobot/drconfig.yaml.

[3]:
# Initialization with arguments
# dr.Client(token='<API TOKEN>', endpoint='https://<YOUR ENDPOINT>/api/v2/')

# Initialization with a config file in the same directory as this notebook
# dr.Client(config_path='drconfig.yaml')

# Initialization with a config file located at
# ~/.config/datarobot/dr.config.yaml
dr.Client()
[3]:
<datarobot.rest.RESTClientObject at 0x115b3f850>

Create the Project

Here, we use the datarobot package to upload a new file and create a project. The name of the project is optional, but can be helpful when trying to sort among many projects on DataRobot.

[4]:
filename = 'DR_Demo_Sales_Multiseries_training.xlsx'
now = datetime.datetime.now().strftime('%Y-%m-%dT%H:%M')
project_name = 'DR_Demo_Sales_Multiseries_{}'.format(now)
proj = dr.Project.create(sourcedata=filename,
                         project_name=project_name,
                         max_wait=3600)
print('Project ID: {}'.format(proj.id))
Project ID: 5c0086ba784cc602226a9e3f

Identify Known-In-Advance Features

This dataset has five columns that will always be known-in-advance and available for prediction.

[5]:
known_in_advance = ['Marketing', 'Near_Xmas', 'Near_BlackFriday',
                    'Holiday', 'DestinationEvent']
feature_settings = [dr.FeatureSettings(feat_name,
                                       known_in_advance=True)
                    for feat_name in known_in_advance]

Create a Partition Specification

This problem has a time component to it, and it would be bad practice to train on data from the present and predict on the past. We could manually add a column to the dataset to indicate which rows should be used for training, test, and validation, but it is straightforward to allow DataRobot to do it automatically. This dataset contains sales data from multiple individual stores so we use multiseries_id_columns to tell DataRobot there are actually multiple time series in this file and to indicate the column that identifies the series each row belongs to.

[6]:
time_partition = dr.DatetimePartitioningSpecification(
    datetime_partition_column='Date',
    multiseries_id_columns=['Store'],
    use_time_series=True,
    feature_settings=feature_settings,
)

Run the Automated Modeling Process

Now we can start the modeling process. The target for this problem is called Sales and we let DataRobot automatically select the metric for scoring and comparing models.

The partitioning_method is used to specify that we would like DataRobot to use the partitioning schema we specified previously

Finally, the worker_count parameter specifies how many workers should be used for this project. Passing a value of -1 tells DataRobot to set the worker count to the maximum available to you. You can also specify the exact number of workers to use, but this command will fail if you request more workers than your account allows. If you need more resources than what has been allocated to you, you should think about upgrading your license.

The second command provides a URL that can be used to see the project execute on the DataRobot UI.

The last command in this cell is just a blocking loop that periodically checks on the project to see if it is done, printing out the number of jobs in progress and in the queue along the way so you can see progress. The automated model exploration process will occasionally add more jobs to the queue, so don’t be alarmed if the number of jobs does not strictly decrease over time.

[7]:
proj.set_target(
    target='Sales',
    partitioning_method=time_partition,
    max_wait=3600,
    worker_count=-1
)

print(proj.get_leaderboard_ui_permalink())

proj.wait_for_autopilot()
https://staging.datarobot.com/projects/5c0086ba784cc602226a9e3f/models
In progress: 20, queued: 1 (waited: 0s)
In progress: 20, queued: 1 (waited: 1s)
In progress: 20, queued: 1 (waited: 2s)
In progress: 20, queued: 1 (waited: 3s)
In progress: 20, queued: 1 (waited: 4s)
In progress: 20, queued: 1 (waited: 7s)
In progress: 20, queued: 1 (waited: 11s)
In progress: 20, queued: 1 (waited: 18s)
In progress: 19, queued: 0 (waited: 31s)
In progress: 19, queued: 0 (waited: 52s)
In progress: 17, queued: 0 (waited: 72s)
In progress: 16, queued: 0 (waited: 93s)
In progress: 15, queued: 0 (waited: 114s)
In progress: 13, queued: 0 (waited: 134s)
In progress: 12, queued: 0 (waited: 155s)
In progress: 12, queued: 0 (waited: 175s)
In progress: 10, queued: 0 (waited: 196s)
In progress: 9, queued: 0 (waited: 217s)
In progress: 7, queued: 0 (waited: 238s)
In progress: 6, queued: 0 (waited: 258s)
In progress: 6, queued: 0 (waited: 278s)
In progress: 2, queued: 0 (waited: 299s)
In progress: 1, queued: 0 (waited: 320s)
In progress: 8, queued: 0 (waited: 340s)
In progress: 8, queued: 0 (waited: 360s)
In progress: 8, queued: 0 (waited: 381s)
In progress: 6, queued: 0 (waited: 402s)
In progress: 5, queued: 0 (waited: 422s)
In progress: 5, queued: 0 (waited: 442s)
In progress: 3, queued: 0 (waited: 463s)
In progress: 3, queued: 0 (waited: 483s)
In progress: 3, queued: 0 (waited: 504s)
In progress: 1, queued: 0 (waited: 524s)
In progress: 0, queued: 0 (waited: 545s)
In progress: 1, queued: 0 (waited: 565s)
In progress: 1, queued: 0 (waited: 586s)
In progress: 1, queued: 0 (waited: 606s)
In progress: 1, queued: 0 (waited: 626s)
In progress: 1, queued: 0 (waited: 647s)
In progress: 1, queued: 0 (waited: 667s)
In progress: 0, queued: 0 (waited: 688s)
In progress: 1, queued: 0 (waited: 708s)
In progress: 1, queued: 0 (waited: 728s)
In progress: 1, queued: 0 (waited: 749s)
In progress: 1, queued: 0 (waited: 769s)
In progress: 1, queued: 0 (waited: 790s)
In progress: 1, queued: 0 (waited: 810s)
In progress: 1, queued: 0 (waited: 830s)
In progress: 1, queued: 0 (waited: 851s)
In progress: 1, queued: 0 (waited: 871s)
In progress: 1, queued: 0 (waited: 892s)
In progress: 1, queued: 0 (waited: 912s)
In progress: 0, queued: 0 (waited: 932s)

Choose the Best Model

First, we take a look at the top of the leaderboard. In this example, we choose the model that has the lowest backtesting error.

[8]:
proj.get_models()[:10]
[8]:
[Model(u'AVG Blender'),
 Model(u'eXtreme Gradient Boosted Trees Regressor with Early Stopping'),
 Model(u'eXtreme Gradient Boosted Trees Regressor with Early Stopping'),
 Model(u'eXtreme Gradient Boosted Trees Regressor with Early Stopping'),
 Model(u'eXtreme Gradient Boosted Trees Regressor with Early Stopping'),
 Model(u'Light Gradient Boosting on ElasticNet Predictions '),
 Model(u'eXtreme Gradient Boosting on ElasticNet Predictions'),
 Model(u'Light Gradient Boosting on ElasticNet Predictions '),
 Model(u'Ridge Regressor with Forecast Distance Modeling'),
 Model(u'eXtreme Gradient Boosting on ElasticNet Predictions')]
[9]:
lb = proj.get_models()
valid_models = [m for m in lb if
                m.metrics[proj.metric]['crossValidation']]
best_model = min(valid_models,
                 key=lambda m: m.metrics[proj.metric]['crossValidation'])

print(best_model.model_type)
print(best_model.get_leaderboard_ui_permalink())
AVG Blender
https://staging.datarobot.com/projects/5c0086ba784cc602226a9e3f/models/5c008a2ce23dec598947eb1d

Generate Predictions

This example notebook uses the modeling API to make predictions, which uses modeling servers to score the predictions. If you have dedicated prediction servers, you should use that API for faster performance.

Finish training

First, we unlock the holdout data to fully train the best model. The last command in the next cell prints the URL to examine the fully-trained model in the DataRobot UI.

[10]:
proj.unlock_holdout()
job = best_model.request_frozen_datetime_model()
retrained_model = job.get_result_when_complete()

print(retrained_model.get_leaderboard_ui_permalink())
https://staging.datarobot.com/projects/5c0086ba784cc602226a9e3f/models/5c008b29784cc6020c6a9e8c

Execute a prediction job

First, we find the latest date in the training data. Then, we upload a dataset to predict from, setting the starting forecast_point to be the end of the training data. Finally, we execute the prediction request.

[11]:
d = pd.read_excel('DR_Demo_Sales_Multiseries_training.xlsx')
last_train_date = pd.to_datetime(d['Date']).max()

dataset = proj.upload_dataset(
    'DR_Demo_Sales_Multiseries_prediction.xlsx',
    forecast_point=last_train_date
)

pred_job = retrained_model.request_predictions(dataset_id=dataset.id)
preds = pred_job.get_result_when_complete()

Each row of the resulting predictions has a prediction of sales at a timestamp for a particular series_id and can be matched to the the uploaded prediction data set through the row_id field. The forecast_distance is the number of time units after the forecast point for a given row.

[12]:
preds.head()

# we could also write predictions out to a file for subsequent analysis
# preds.to_csv('DR_Demo_Sales_Multiseries_prediction_output.csv', index=False)
[12]:
forecast_distance forecast_point prediction row_id series_id timestamp
0 1 2014-06-14T00:00:00.000000Z 148181.314360 714 Louisville 2014-06-15T00:00:00.000000Z
1 2 2014-06-14T00:00:00.000000Z 139278.257114 715 Louisville 2014-06-16T00:00:00.000000Z
2 3 2014-06-14T00:00:00.000000Z 139419.155936 716 Louisville 2014-06-17T00:00:00.000000Z
3 4 2014-06-14T00:00:00.000000Z 135730.704195 717 Louisville 2014-06-18T00:00:00.000000Z
4 5 2014-06-14T00:00:00.000000Z 140947.763900 718 Louisville 2014-06-19T00:00:00.000000Z