DataRobot Python Package

Getting Started

Installation

You will need the following

  • Python 2.7 or 3.4+
  • DataRobot account
  • pip

Installing for Cloud DataRobot

If you are using the cloud version of DataRobot, the easiest way to get the latest version of the package is:

pip install datarobot

Note

If you are not running in a Python virtualenv, you probably want to use pip install --user datarobot.

Installing for an On-Site Deploy

If you are using an on-site deploy of DataRobot, the latest version of the package is not the most appropriate for you. Contact your CFDS for guidance on the appropriate version range.

pip install "datarobot>=$(MIN_VERSION),<$(EXCLUDE_VERSION)"

For some particular installation of DataRobot, the correct value of $(MIN_VERSION) could be 2.0 with an $(EXCLUDE_VERSION) of 2.3. This ensures that all the features the client expects to be present on the backend will always be correct.

Note

If you are not running in a Python virtualenv, you probably want to use pip install --user "datarobot>=$(MIN_VERSION),<$(MAX_VERSION).

Configuration

Each authentication method will specify credentials for DataRobot, as well as the location of the DataRobot deployment. We currently support configuration using a configuration file, by setting environment variables, or within the code itself.

Credentials

You will have to specify an API token and an endpoint in order to use the client. You can manage your API tokens in the DataRobot webapp, in your profile. This section describes how to use these options. Their order of precedence is as follows, noting that the first available option will be used:

  1. Setting endpoint and token in code using datarobot.Client
  2. Configuring from a config file as specified directly using datarobot.Client
  3. Configuring from a config file as specified by the environment variable DATAROBOT_CONFIG_FILE
  4. Configuring from the environment variables DATAROBOT_ENDPOINT and DATAROBOT_API_TOKEN
  5. Searching for a config file in the home directory of the current user, at ~/.config/datarobot/drconfig.yaml

Note

If you access the DataRobot webapp at https://app.datarobot.com, then the correct endpoint to specify would be https://app.datarobot.com/api/v2. If you have a local installation, update the endpoint accordingly to point at the installation of DataRobot available on your local network.

Set Credentials Explicitly in Code

Explicitly set credentials in code:

import datarobot as dr
dr.Client(token='your_token', endpoint='https://app.datarobot.com/api/v2')

You can also point to a YAML config file to use:

import datarobot as dr
dr.Client(config_path='/home/user/my_datarobot_config.yaml')

Use a Configuration File

You can use a configuration file to specify the client setup.

The following is an example configuration file that should be saved as ~/.config/datarobot/drconfig.yaml:

token: yourtoken
endpoint: https://app.datarobot.com/api/v2

You can specify a different location for the DataRobot configuration file by setting the DATAROBOT_CONFIG_FILE environment variable. Note that if you specify a filepath, you should use an absolute path so that the API client will work when run from any location.

Set Credentials Using Environment Variables

Set up an endpoint by setting environment variables in the UNIX shell:

export DATAROBOT_ENDPOINT='https://app.datarobot.com/api/v2'
export DATAROBOT_API_TOKEN=your_token

Common Issues

This section has examples of cases that can cause issues with using the DataRobot client, as well as known fixes.

InsecurePlatformWarning

On versions of Python earlier than 2.7.9 you might have InsecurePlatformWarning in your output. To prevent this without updating your Python version you should install pyOpenSSL package:

pip install pyopenssl ndg-httpsclient pyasn1

AttributeError: ‘EntryPoint’ object has no attribute ‘resolve’

Some earlier versions of setuptools will cause an error on importing DataRobot. The recommended fix is upgrading setuptools. If you are unable to upgrade setuptools, pinning trafaret to version <=7.4 will correct this issue.

>>> import datarobot as dr
...
File "/home/clark/.local/lib/python2.7/site-packages/trafaret/__init__.py", line 1550, in load_contrib
  trafaret_class = entrypoint.resolve()
AttributeError: 'EntryPoint' object has no attribute 'resolve'

To prevent this upgrade your setuptools:

pip install --upgrade setuptools

ConnectTimeout

If you have a slow connection to your DataRobot installation, you may see a traceback like

ConnectTimeout: HTTPSConnectionPool(host='my-datarobot.com', port=443): Max
retries exceeded with url: /api/v2/projects/
(Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f130fc76150>,
'Connection to my-datarobot.com timed out. (connect timeout=6.05)'))

You can configure a larger connect timeout (the amount of time to wait on each request attempting to connect to the DataRobot server before giving up) using a connect_timeout value in either a configuration file or via a direct call to datarobot.Client.

project.open_leaderboard_browser

Calling the project.open_leaderboard_browser may block if ran with a text-mode browser or running on a server that doesn’t have an ability to open a browser.

Configuration

This section describes all of the settings that can be configured in the DataRobot configuration file. This file is by default looked for inside the user’s home directory at ~/.config/datarobot/drconfig.yaml, but the default location can be overridden by specifying an environment variable DATAROBOT_CONFIG_FILE, or within the code by setting the global client with dr.Client(config_path='/path/to/config.yaml').

Configurable Variables

These are the variables available for configuration for the DataRobot client:

endpoint
This parameter is required. It is the URL of the DataRobot endpoint. For example, the default endpoint on the cloud installation of DataRobot is https://app.datarobot.com/api/v2
token
This parameter is required. It is the token of your DataRobot account. This can be found in the user settings page of DataRobot
connect_timeout
This parameter is optional. It specifies the number of seconds that the client should be willing to wait to establish a connection to the remote server. Users with poor connections may need to increase this value. By default DataRobot uses the value 6.05.
ssl_verify
This parameter is optional. It controls the SSL certificate verification of the DataRobot client. DataRobot is built with the python requests library, and this variable is used as the verify parameter in that library. More information can be found in their documentation. The default value is true, which means that requests will use your computer’s set of trusted certificate chains by default.
max_retries

This parameter is optional. It controls the number of retries to attempt for each connection. More information can be found in the requests documentation. By default, errors implying the request never made it to the server are retried always and read timeouts (where the request began running and did not finish) are not retried. More granular control by be acquired by passing a Retry object from urllib3 into a direct instantiation of dr.Client.

import datarobot as dr
dr.Client(endpoint='https://app.datarobot.com/api/v2', token='this-is-a-fake-token',
          max_retries=Retry(connect=3, read=3))

Proxy support

DataRobot API can work behind a non-transparent HTTP proxy server. Please set environment variable HTTP_PROXY containing proxy URL to route all the DataRobot traffic through that proxy server, e.g. HTTP_PROXY="http://my-proxy.local:3128" python my_datarobot_script.py.

QuickStart

Note

You must set up credentials in order to access the DataRobot API. For more information, see Credentials

All of the modeling within DataRobot happens within a project. Each project has one dataset that is used as the source from which to train models.

There are three steps required to begin modeling:

  1. Create an empty project.
  2. Upload a data file to model.
  3. Select parameters and start training models with the autopilot.

The following command includes these three steps. It is equivalent to choosing all of the default settings recommended by DataRobot.

import datarobot as dr
project = dr.Project.start(project_name='My new project',
                        sourcedata='/home/user/data/last_week_data.csv',
                        target='ItemsPurchased')

Where:

  • name is the name of the new DataRobot project.
  • sourcedata is the path to the dataset.
  • target is the name of the target feature column in the dataset.

Projects

All of the modeling within DataRobot happens within a project. Each project has one dataset that is used as the source from which to train models.

Create a Project

You can use the following command to create a new project. You must specify a path to data file, file object, raw file contents, or a pandas.DataFrame object when creating a new project. Path to file can be either a path to a local file or a publicly accessible URL.

import datarobot as dr
project = dr.Project.create('/home/user/data/last_week_data.csv',
                            project_name='New Project')

You can use the following commands to view the project ID and name:

project.id
>>> u'5506fcd38bd88f5953219da0'
project.project_name
>>> u'New Project'

Select Modeling Parameters

The final information needed to begin modeling includes the target feature, the queue mode, the metric for comparing models, and the optional parameters such as weights, offset, exposure and downsampling.

Target

The target must be the name of one of the columns of data uploaded to the project.

Metric

The optimization metric used to compare models is an important factor in building accurate models. If a metric is not specified, the default metric recommended by DataRobot will be used. You can use the following code to view a list of valid metrics for a specified target:

target_name = 'ItemsPurchased'
project.get_metrics(target_name)
>>> {'available_metrics': [
         'Gini Norm',
         'Weighted Gini Norm',
         'Weighted R Squared',
         'Weighted RMSLE',
         'Weighted MAPE',
         'Weighted Gamma Deviance',
         'Gamma Deviance',
         'RMSE',
         'Weighted MAD',
         'Tweedie Deviance',
         'MAD',
         'RMSLE',
         'Weighted Tweedie Deviance',
         'Weighted RMSE',
         'MAPE',
         'Weighted Poisson Deviance',
         'R Squared',
         'Poisson Deviance'],
     'feature_name': 'SalePrice'}

Partitioning Method

DataRobot projects always have a holdout set used for final model validation. We use two different approaches for testing prior to the holdout set:

  • split the remaining data into training and validation sets
  • cross-validation, in which the remaining data is split into a number of folds; each fold serves as a validation set, with models trained on the other folds and evaluated on that fold.

There are several other options you can control. To specify a partition method, create an instance of one of the Partition Classes, and pass it as the partitioning_method argument in your call to project.set_target or project.start. See here for more information on using datetime partitioning.

Several partitioning methods include parameters for validation_pct and holdout_pct, specifying desired percentages for the validation and holdout sets. Note that there may be constraints that prevent the actual percentages used from exactly (or some cases, even closely) matching the requested percentages.

Queue Mode

You can use the API to set the DataRobot modeling process to run in either automatic or manual mode.

Autopilot mode means that the modeling process will proceed completely automatically, including running recommended models, running at different sample sizes, and blending.

Manual mode means that DataRobot will populate a list of recommended models, but will not insert any of them into the queue. Manual mode lets you select which models to execute before starting the modeling process.

Quick mode means that a smaller set of Blueprints is used, so autopilot finishes faster.

Weights

DataRobot also supports using a weight parameter. A full discussion of the use of weights in data science is not within the scope of this document, but weights are often used to help compensate for rare events in data. You can specify a column name in the project dataset to be used as a weight column.

Offsets

Starting with version v2.6 DataRobot also supports using an offset parameter. Offsets are commonly used in insurance modeling to include effects that are outside of the training data due to regulatory compliance or constraints. You can specify the names of several columns in the project dataset to be used as the offset columns.

Exposure

Starting with version v2.6 DataRobot also supports using an exposure parameter. Exposure is often used to model insurance premiums where strict proportionality of premiums to duration is required. You can specify the name of the column in the project dataset to be used as an exposure column.

Start Modeling

Once you have selected modeling parameters, you can use the following code structure to specify parameters and start the modeling process.

import datarobot as dr
project.set_target(target='ItemsPurchased',
                   metric='Tweedie Deviance',
                   mode=dr.AUTOPILOT_MODE.FULL_AUTO)

You can also pass additional optional parameters to project.set_target to change parameters of the modeling process. Currently supported parameters are:

  • worker_count – int, sets number of workers used for modeling.
  • partitioning_methodPartitioningMethod object.
  • positive_class – str, float, or int; Specifies a level of the target column that should treated as the positive class for binary classification. May only be specified for binary classification targets.
  • advanced_optionsAdvancedOptions object, used to set advanced options of modeling process.
  • target_type – str, override the automaticially selected target_type. An example usage would be setting the target_type=TARGET_TYPE.MULTICLASS when you want to perform a multiclass classification task on a numeric column that has a low cardinality.

You can run with different autopilot modes by changing the parameter to mode. AUTOPILOT_MODE.FULL_AUTO is the default. Other accepted modes include AUTOPILOT_MODE.MANUAL for manual mode (choose your own models to run rather than use the DataRobot autopilot) and AUTOPILOT_MODE.QUICK for quickrun (run on a more limited set of models to get insights more quickly).

Quickly Start a Project

Project creation, file upload and target selection are all combined in Project.start method:

import datarobot as dr
project = dr.Project.start('/home/user/data/last_week_data.csv',
                        target='ItemsPurchased',
                        project_name='New Project')

You can also pass additional optional parameters to Project.start:

  • worker_count – int, sets number of workers used for modeling.
  • metric - str, name of metric to use.
  • autopilot_on - boolean, defaults to True; set whether or not to begin modeling automatically.
  • blueprint_threshold – int, number of hours the model is permitted to run. Minimum 1.
  • response_cap – float, Quantile of the response distribution to use for response capping. Must be in range 0.5..1.0
  • partitioning_methodPartitioningMethod object.
  • positive_class – str, float, or int; Specifies a level of the target column that should treated as the positive class for binary classification. May only be specified for binary classification targets.
  • target_type – str, override the automaticially selected target_type. An example usage would be setting the target_type=TARGET_TYPE.MULTICLASS when you want to perform a multiclass classification task on a numeric column that has a low cardinality.

Interact with a Project

The following commands can be used to manage DataRobot projects.

List Projects

Returns a list of projects associated with current API user.

import datarobot as dr
dr.Project.list()
>>> [Project(Project One), Project(Two)]

dr.Project.list(search_params={'project_name': 'One'})
>>> [Project(One)]

You can pass following parameters to change result:

  • search_params – dict, used to filter returned projects. Currently you can query projects only by project_name

Get an existing project

Rather than querying the full list of projects every time you need to interact with a project, you can retrieve its id value and use that to reference the project.

import datarobot as dr
project = dr.Project.get(project_id='5506fcd38bd88f5953219da0')
project.id
>>> '5506fcd38bd88f5953219da0'
project.project_name
>>> 'Churn Projection'

Update a project

You can update various attributes of a project.

To update the name of the project:

project.rename(new_name)

To update the number of workers used by your project (this will fail if you request more workers than you have available):

project.set_worker_count(num_workers)

To unlock the holdout set, allowing holdout scores to be shown and models to be trained on more data:

project.unlock_holdout()

Delete a project

Use the following command to delete a project:

project.delete()

Wait for Autopilot to Finish

Once the modeling autopilot is started, in some cases you will want to wait for autopilot to finish:

project.wait_for_autopilot()

Play/Pause the autopilot

If your project is running in autopilot mode, it will continually use available workers, subject to the number of workers allocated to the project and the total number of simultaneous workers allowed according to the user permissions.

To pause a project running in autopilot mode:

project.pause_autopilot()

To resume running a paused project:

project.unpause_autopilot()

Start autopilot on another Featurelist

You can start autopilot on an existing featurelist.

import datarobot as dr

featurelist = project.create_featurelist('test', ['feature 1', 'feature 2'])
project.start_autopilot(featurelist.id, mode=dr.AUTOPILOT_MODE.FULL_AUTO)
>>> True

# Starting autopilot that is already running on the provided featurelist
project.start_autopilot(featurelist.id, mode=dr.AUTOPILOT_MODE.FULL_AUTO)
>>> dr.errors.AppPlatformError

Note

This method should be used on a project where the target has already been set. An error will be raised if autopilot is currently running on or has already finished running on the provided featurelist.

Further reading

The Blueprints and Models sections of this document will describe how to create new models based on the Blueprints recommended by DataRobot.

Datetime Partitioned Projects

If your dataset is modeling events taking place over time, datetime partitioning may be appropriate. Datetime partitioning ensures that when partitioning the dataset for training and validation, rows are ordered according to the value of the date partition feature.

Setting Up a Datetime Partitioned Project

After creating a project and before setting the target, create a DatetimePartitioningSpecification to define how the project should be partitioned. By passing the specification into DatetimePartitioning.generate, the full partitioning can be previewed before finalizing the partitioning. After verifying that the partitioning is correct for the project dataset, pass the specification into Project.set_target via the partitioning_method argument. Once modeling begins, the project can be used as normal.

The following code block shows the basic workflow for creating datetime partitioned projects.

import datarobot as dr

project = dr.Project.create('some_data.csv')
spec = dr.DatetimePartitioningSpecification('my_date_column')
# can customize the spec as needed

partitioning_preview = dr.DatetimePartitioning.generate(project.id, spec)
# the preview generated is based on the project's data

print partitioning_preview.to_dataframe()
# hmm ... I want more backtests
spec.number_of_backtests = 5
partitioning_preview = dr.DatetimePartitioning.generate(project.id, spec)
print partitioning_preview.to_dataframe()
# looks good

project.set_target('target_column', partitioning_method=spec)

Modeling with a Datetime Partitioned Project

While Model objects can still be used to interact with the project, DatetimeModel objects, which are only retrievable from datetime partitioned projects, provide more information including which date ranges and how many rows are used in training and scoring the model as well as scores and statuses for individual backtests.

The autopilot workflow is the same as for other projects, but to manually train a model, Project.train_datetime and Model.train_datetime should be used in the place of Project.train and Model.train. To create frozen models, Model.request_frozen_datetime_model should be used in place of DatetimeModel.request_frozen_datetime_model. Unlike other projects, to trigger computation of scores for all backtests use DatetimeModel.score_backtests instead of using the scoring_type argument in the train methods.

Dates, Datetimes, and Durations

When specifying a date or datetime for datetime partitioning, the client expects to receive and will return a datetime. Timezones may be specified, and will be assumed to be UTC if left unspecified. All dates returned from DataRobot are in UTC with a timezone specified.

Datetimes may include a time, or specify only a date; however, they may have a non-zero time component only if the partition column included a time component in its date format. If the partition column included only dates like “24/03/2015”, then the time component of any datetimes, if present, must be zero.

When date ranges are specified with a start and an end date, the end date is exclusive, so only dates earlier than the end date are included, but the start date is inclusive, so dates equal to or later than the start date are included. If the start and end date are the same, then no dates are included in the range.

Durations are specified using a subset of ISO8601. Durations will be of the form PnYnMnDTnHnMnS where each “n” may be replaced with an integer value. Within the duration string,

  • nY represents the number of years
  • the nM following the “P” represents the number of months
  • nD represents the number of days
  • nH represents the number of hours
  • the nM following the “T” represents the number of minutes
  • nS represents the number of seconds

and “P” is used to indicate that the string represents a period and “T” indicates the beginning of the time component of the string. Any section with a value of 0 may be excluded. As with datetimes, if the partition column did not include a time component in its date format, the time component of any duration must be either unspecified or consist only of zeros.

Example Durations:

  • “P3Y6M” (three years, six months)
  • “P1Y0M0DT0H0M0S” (one year)
  • “P1Y5DT10H” (one year, 5 days, 10 hours)

datarobot.helpers.partitioning_methods.construct_duration_string is a helper method that can be used to construct appropriate duration strings.

Time Series Projects

Time series projects, like OTV projects, use datetime partitioning, and all the workflow changes that apply to other datetime partitioned projects also apply to them. Unlike other projects, time series projects produce different types of models which forecast multiple future predictions instead of an individual prediction for each row.

DataRobot uses a general time series framework to configure how time series features are created and what future values the models will output. This framework consists of a Forecast Point (defining a time a prediction is being made), a Feature Derivation Window (a rolling window used to create features), and a Forecast Window (a rolling window of future values to predict). These components are described in more detail below.

Time series projects will automatically transform the dataset provided in order to apply this framework. During the transformation, DataRobot uses the Feature Derivation Window to derive time series features (such as lags and rolling statistics), and uses the Forecast Window to provide examples of forecasting different distances in the future (such as time shifts). After project creation, a new dataset and a new feature list are generated and used to train the models. This process is reapplied automatically at prediction time as well in order to generate future predictions based on the original data features.

The time_unit and time_step used to define the Feature Derivation and Forecast Windows are taken from the datetime partition column, and can be retrieved for a given column in the input data by looking at the corresponding attributes on the datarobot.Feature object.

Setting Up A Time Series Project

To set up a time series project, follow the standard datetime partitioning workflow and use the six new time series specific parameters on the datarobot.DatetimePartitioningSpecification object:

use_time_series
bool, set this to True to enable time series for the project.
default_to_a_priori
bool, set this to True to default to treating all features as a priori features. Otherwise they will not be handled as a priori features. See the prediction documentation for more information.
feature_derivation_window_start
int, the offset into the past to the start of the feature derivation window.
feature_derivation_window_end
int, the offset into the past to the end of the feature derivation window.
forecast_window_start
int, the offset into the future to the start of the forecast window.
forecast_window_end
int, the offset into the future to the end of the forecast window.
feature_settings
list of FeatureSettings specifying per feature settings, can be left unspecified

Feature Derivation Window

The Feature Derivation window represents the rolling window that is used to derive time series features and lags, relative to the Forecast Point. It is defined in terms of feature_derivation_window_start and feature_derivation_window_end which are integer values representing datetime offsets in terms of the time_unit (e.g. hours or days).

The Feature Derivation Window start and end must be less than or equal to zero, indicating they are positioned before the forecast point. Additionally, the window must be specified as an integer multiple of the time_step which defines the expected difference in time units between rows in the data.

The window is closed, meaning the edges are considered to be inside the window.

Forecast Window

The Forecast Window represents the rolling window of future values to predict, relative to the Forecast Point. It is defined in terms of the forecast_window_start and forecast_window_end, which are positive integer values indicating datetime offsets in terms of the time_unit (e.g. hours or days).

The Forecast Window start and end must be positive integers, indicating they are positioned after the forecast point. Additionally, the window must be specified as an integer multiple of the time_step which defines the expected difference in time units between rows in the data.

The window is closed, meaning the edges are considered to be inside the window.

Multiseries Projects

Certain time series problems represent multiple separate series of data, e.g. “I have five different stores that all have different customer bases. I want to predict how many units of a particular item will sell, and account for the different behavior of each store”. When setting up the project, a column specifying series ids must be identified, so that each row from the same series has the same value in the multiseries id column.

Using a multiseries id column changes which partition columns are eligible for time series, as each series is required to be unique and regular, instead of the entire partition column being required to have those properties. In order to use a multiseries id column for partitioning, a detection job must first be run to analyze the relationship between the partition and multiseries id columns. If needed, it will be automatically triggered by calling datarobot.Feature.get_multiseries_properties() on the desired partition column. The previously computed multiseries properties for a particular partition column can then be accessed via that method. The computation will also be automatically triggered when calling datarobot.DatetimePartitioning.generate() or datarobot.Project.set_target() with a multiseries id column specified.

Note that currently only one multiseries id column is supported, but all interfaces accept lists of id columns to ensure multiple id columns will be able to be supported in the future.

In order to create a multiseries project:

  1. Set up a datetime partitioning specification with the desired partition column and multiseries id columns.
  2. (Optionally) Use datarobot.Feature.get_multiseries_properties() to confirm the inferred time step and time unit of the partition column when used with the specified multiseries id column.
  3. (Optionally) Specify the multiseries id column in order to preview the full datetime partitioning settings using datarobot.DatetimePartitioning.generate().
  4. Specify the multiseries id column when sending the target and partitioning settings via datarobot.Project.set_target().
project = dr.Project.create('path/to/multiseries.csv', project_name='my multiseries project')
partitioning_spec = dr.DatetimePartitioningSpecification(
    'timestamp', use_time_series=True, multiseries_id_columns=['multiseries_id']
)

# manually confirm time step and time unit are as expected
datetime_feature = dr.Feature.get(project.id, 'timestamp')
multiseries_props = datetime_feature.get_multiseries_properties(['multiseries_id'])
print(multiseries_props)

# manually check out the partitioning settings like feature derivation window and backtests
# to make sure they make sense before moving on
full_part = dr.DatetimePartitioning.generate(project.id, partitioning_spec)
print(full_part.feature_derivation_window_start, full_part.feature_derivation_window_end)
print(full_part.to_dataframe())

# finalize the project and start the autopilot
project.set_target('target', partitioning_method=partitioning_spec)

Feature Settings

datarobot.FeatureSettings constructor receives feature_name and settings. For now only a_priori setting supported.

# I have 10 features, 8 of them are a priori and two are not
not_a_priori_features = ['previous_day_sales', 'amount_in_stock']
feature_settings = [FeatureSettings(feat_name, a_priori=False) for feat_name in not_a_priori_features]
spec = DatetimePartitioningSpecification(
    # ...
    default_to_a_priori=True,
    feature_settings=feature_settings
)

Modeling Data and Time Series Features

In time series projects, a new set of modeling features is created after setting the partitioning options. If a featurelist is specified with the partitioning options, it will be used to select which features should be used to derived modeling features; if a featurelist is not specified, the default featurelist will be used.

These features are automatically derived from those in the project’s dataset and are the features used for modeling - note that the Project methods get_featurelists and get_modeling_featurelists will return different data in time series projects. Modeling featurelists are the ones that can be used for modeling and will be accepted by the backend, while regular featurelists will continue to exist but cannot be used. Modeling features are only accessible once the target and partitioning options have been set. In projects that don’t use time series modeling, once the target has been set, modeling and regular features and featurelists will behave the same.

Making Predictions

Prediction datasets are uploaded as normal. However, when uploading a prediction dataset, a new parameter forecast_point can be specified. The forecast point of a prediction dataset identifies the point in time relative which predictions should be generated, and if one is not specified when uploading a dataset, the server will choose the most recent possible forecast point. The forecast window specified when setting the partitioning options for the project determines how far into the future from the forecast point predictions should be calculated.

When setting up a time series project, input features could be identified as a priori features. These features are not used to generate lags, and are expected to be known for the rows in the forecast window at predict time (e.g. “how much money will have been spent on marketing”, “is this a holiday”).

When uploading datasets to a time series project, the dataset might look something like the following, if “Time” is the datetime partition column, “Target” is the target column, and “Temp.” is an input feature. If the dataset was uploaded with a forecast point of “2017-01-08” and during partitioning the feature derivation window start and end were set to -5 and -3 and the forecast window start and end were set to 1 and 3, then rows 1 through 3 are historical data, row 6 is the forecast point, and rows 7 though 9 are forecast rows that will have predictions when predictions are computed.

Row, Time, Target, Temp.
1, 2017-01-03, 16443, 72
2, 2017-01-04, 3013, 72
3, 2017-01-05, 1643, 68
4, 2017-01-06, ,
5, 2017-01-07, ,
6, 2017-01-08, ,
7, 2017-01-09, ,
8, 2017-01-10, ,
9, 2017-01-11, ,

On the other hand, if the project instead used “Holiday” as an a priori input feature, the uploaded dataset might look like the following.

Row, Time, Target, Holiday
1, 2017-01-03, 16443, TRUE
2, 2017-01-04, 3013, FALSE
3, 2017-01-05, 1643, FALSE
4, 2017-01-06, , FALSE
5, 2017-01-07, , FALSE
6, 2017-01-08, , FALSE
7, 2017-01-09, , TRUE
8, 2017-01-10, , FALSE
9, 2017-01-11, , FALSE

Blueprints

The set of computation paths that a dataset passes through before producing predictions from data is called a blueprint. A blueprint can be trained on a dataset to generate a model.

Quick Reference

The following code block summarizes the interactions available for blueprints.

# Get the set of blueprints recommended by datarobot
import datarobot as dr
my_projects = dr.Project.list()
project = my_projects[0]
menu = project.get_blueprints()

first_blueprint = menu[0]
project.train(first_blueprint)

List Blueprints

When a file is uploaded to a project and the target is set, DataRobot recommends a set of blueprints that are appropriate for the task at hand. You can use the get_blueprints method to get the list of blueprints recommended for a project:

project = dr.Project.get('5506fcd38bd88f5953219da0')
menu = project.get_blueprints()
blueprint = menu[0]

Get a blueprint

If you already have a blueprint_id from a model you can retrieve the blueprint directly.

project_id = '5506fcd38bd88f5953219da0'
project = dr.Project.get(project_id)
models = project.get_models()
model = models[0]
blueprint = Blueprint.get(project_id, model.blueprint_id)

Get a blueprint chart

For all blueprints - either from blueprint menu or already used in model - you can retrieve its chart. You can also get its representation in graphviz DOT format to render it into format you need.

project_id = '5506fcd38bd88f5953219da0'
blueprint_id = '4321fcd38bd88f595321554223'
bp_chart = BlueprintChart.get(project_id, blueprint_id)
print(bp_chart.to_graphviz())

Get a blueprint documentation

You can retrieve documentation on tasks used in blueprint. It will contain information about task, its parameters and (when available) links and references to additional sources. All documents are instances of BlueprintTaskDocument class.

project_id = '5506fcd38bd88f5953219da0'
blueprint_id = '4321fcd38bd88f595321554223'
bp = Blueprint.get(project_id, blueprint_id)
docs = bp.get_documents()
print(docs[0].task)
>>> Average Blend
print(docs[0].links[0]['url'])
>>> https://en.wikipedia.org/wiki/Ensemble_learning

Blueprint Attributes

The Blueprint class holds the data required to use the blueprint for modeling. This includes the blueprint_id and project_id. There are also two attributes that help distinguish blueprints: model_type and processes.

print(blueprint.id)
>>> u'8956e1aeecffa0fa6db2b84640fb3848'
print(blueprint.project_id)
>>> u5506fcd38bd88f5953219da0'
print(blueprint.model_type)
>>> Logistic Regression
print(blueprint.processes)
>>> [u'One-Hot Encoding',
     u'Missing Values Imputed',
     u'Standardize',
     u'Logistic Regression']

Create a Model from a Blueprint

You can use a blueprint instance to train a model. The default dataset for the project is used.

model_job_id = project.train(blueprint, sample_pct=25)

This method will put a new modeling job into the queue and returns id of created ModelJob. You can pass ModelJob id to wait_for_async_model_creation function, that polls async model creation status and returns newly created model when it’s finished.

Models

When a blueprint has been trained on a specific dataset at a specified sample size, the result is a model. Models can be inspected to analyze their accuracy.

Quick Reference

# Get all models of an existing project

import datarobot as dr
my_projects = dr.Project.list()
project = my_projects[0]
models = project.get_models()

List Finished Models

You can use the get_models method to return a list of the project models that have finished training:

import datarobot as dr
project = dr.Project.get('5506fcd38bd88f5953219da0')
models = project.get_models()
print(models[:5])
>>> [Model(Decision Tree Classifier (Gini)),
     Model(Auto-tuned K-Nearest Neighbors Classifier (Minkowski Distance)),
     Model(Gradient Boosted Trees Classifier (R)),
     Model(Gradient Boosted Trees Classifier),
     Model(Logistic Regression)]
model = models[0]

project.id
>>> u'5506fcd38bd88f5953219da0'
model.id
>>> u'5506fcd98bd88f1641a720a3'

You can pass following parameters to change result:

  • search_params – dict, used to filter returned projects. Currently you can query models by

    • name
    • sample_pct
  • order_by – str or list, if passed returned models are ordered by this attribute or attributes.

  • with_metric – str, If not None, the returned models will only have scores for this metric. Otherwise all the metrics are returned.

List Models Example:

Project('pid').get_models(order_by=['-created_time', 'sample_pct', 'metric'])

# Getting models that contain "Ridge" in name
# and with sample_pct more than 64
Project('pid').get_models(
    search_params={
        'sample_pct__gt': 64,
        'name': "Ridge"
    })

Retrieve a Known Model

If you know the model_id and project_id values of a model, you can retrieve it directly:

import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
                     model_id=model_id)

You can also use an instance of Project as the parameter for get

model = dr.Model.get(project=project,
                     model_id=model_id)

Train a Model on a Different Sample Size

One of the key insights into a model and the data behind it is how its performance varies with more training data. In Autopilot mode, DataRobot will run at several sample sizes by default, but you can also create a job that will run at a specific sample size. You can also specify featurelist that should be used for training of new model and scoring type. train method of Model instance will put a new modeling job into the queue and return id of created ModelJob. You can pass ModelJob id to wait_for_async_model_creation function, that polls async model creation status and returns newly created model when it’s finished.

model_job_id = model.train(sample_pct=33)

# retraining model on custom featurelist using cross validation
import datarobot as dr
model_job_id = model.train(
    sample_pct=55,
    featurelist_id=custom_featurelist.id,
    scoring_type=dr.SCORING_TYPE.cross_validation,
)

Find the Features Used

Because each project can have many associated featurelists, it is important to know which features a model requires in order to run. This helps ensure that the the necessary features are provided when generating predictions.

feature_names = model.get_features_used()
print(feature_names)
>>> ['MonthlyIncome',
     'VisitsLast8Weeks',
     'Age']

Feature Impact

Feature Impact measures how much worse a model’s error score would be if DataRobot made predictions after randomly shuffling a particular column (a technique sometimes called Permutation Importance).

The following example code snippet shows how a featurelist with just the features with the highest feature impact could be created.

import datarobot as dr

max_num_features = 10
time_to_wait_for_impact = 4 * 60  # seconds

try:
    feature_impacts = model.get_feature_impact()  # if they've already been computed
except dr.errors.ClientError as e:
    assert e.status_code == 404  # the feature impact score haven't been computed yet
    impact_job = model.request_feature_impact()
    feature_impacts = impact_job.get_result_when_complete(time_to_wait_for_impact)

feature_impacts.sort(key=lambda x: x['impactNormalized'], reverse=True)
final_names = [f['featureName'] for f in feature_impacts[:max_num_features]]

project.create_featurelist('highest_impact', final_names)

Predict new data

After creating models you can use them to generate predictions on new data. See PredictJob for further information on how to request predictions from a model.

Model IDs Vs. Blueprint IDs

Each model has both an model_id and a blueprint_id. What is the difference between these two IDs?

A model is the result of training a blueprint on a dataset at a specified sample percentage. The blueprint_id is used to keep track of which blueprint was used to train the model, while the model_id is used to locate the trained model in the system.

Model parameters

Some models can have parameters that provide data needed to reproduce its predictions.

For additional usage information see DataRobot documentation, section “Coefficients tab and pre-processing details”

import datarobot as dr

model = dr.Model.get(project=project, model_id=model_id)
mp = model.get_parameters()
print mp.derived_features
>>> [{
         'coefficient': -0.015,
         'originalFeature': u'A1Cresult',
         'derivedFeature': u'A1Cresult->7',
         'type': u'CAT',
         'transformations': [{'name': u'One-hot', 'value': u"'>7'"}]
    }]

Create a Blender

You can blend multiple models; in many cases, the resulting blender model is more accurate than the parent models. To do so you need to select parent models and a blender method from datarobot.enums.BLENDER_METHOD.

Be aware that the tradeoff for better prediction accuracy is bigger resource consumption and slower predictions.

import datarobot as dr

pr = dr.Project.get(pid)
models = pr.get_models()
parent_models = [model.id for model in models[:2]]
pr.blend(parent_models, dr.enums.BLENDER_METHOD.AVERAGE)

Lift chart retrieval

You can use Model methods get_lift_chart and get_all_lift_charts to retrieve lift chart data. First will get it from specific source (validation data, cross validation or holdout, if holdout unlocked) and second will list all available data. Please refer to Advanced model information notebook for additional information about lift charts and how they can be visualised.

ROC curve retrieval

Same as with the lift chart you can use Model methods get_roc_curve and get_all_roc_curves to retrieve ROC curve data. Please refer to Advanced model information notebook for additional information about ROC curves and how they can be visualised. More information about working with ROC curves can be found in DataRobot web application documentation section “ROC Curve tab details”.

Word Cloud

If your dataset contains text columns, DataRobot can create text processing models that will contain word cloud insight data. An example of such model is any “Auto-Tuned Word N-Gram Text Modeler” model. You can use Model.get_word_cloud method to retrieve those insights - it will provide up to 200 most important ngrams in the model and data about their influence. The Advanced model information notebook contains examples of how you can use that data and build a visualization in a way similar to how the DataRobot webapp does.

Scoring Code

Subset of models in DataRobot supports code generation. For each of those models you can download a JAR file with scoring code to make predictions locally using method Model.download_scoring_code. For details on how to do that see “Code Generation” section in DataRobot web application documentation. Optionally you can download source code in Java to see what calculations those models do internally.

Be aware that source code JAR isn’t compiled so it cannot be used for making predictions.

Get a model blueprint chart

For all models you can retrieve its blueprint chart. You can also get its representation in graphviz DOT format to render it into format you need.

import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
                     model_id=model_id)
bp_chart = model.get_model_blueprint_chart()
print(bp_chart.to_graphviz())

Get a blueprint documentation

You can retrieve documentation on tasks used to build a model. It will contain information about task, its parameters and (when available) links and references to additional sources. All documents are instances of BlueprintTaskDocument class.

import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
                     model_id=model_id)
docs = model.get_model_blueprint_documents()
print(docs[0].task)
>>> Average Blend
print(docs[0].links[0]['url'])
>>> https://en.wikipedia.org/wiki/Ensemble_learning

Request training predictions

You can request a model’s predictions for a particular subset of its training data. See datarobot.models.Model.request_training_predictions() reference for all the valid subsets.

See training predictions reference for more details.

import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
                     model_id=model_id)
training_predictions_job = model.request_training_predictions(dr.enums.DATA_SUBSET.HOLDOUT)
training_predictions = training_predictions_job.get_result_when_complete()
for row in training_predictions.iterate_rows():
    print(row.row_id, row.prediction)

Jobs

The Job (API reference) class is a generic representation of jobs running through a project’s queue. Many tasks involved in modeling, such as creating a new model or computing feature impact for a model, will use a job to track the worker usage and progress of the associated task.

Checking the Contents of the Queue

To see what jobs running or waiting in the queue for a project, use the Project.get_all_jobs method.

from datarobot.enums import QUEUE_STATUS

jobs_list = project.get_all_jobs()  # gives all jobs queued or inprogress
jobs_by_type = {}
for job in jobs_list:
    if job.job_type not in jobs_by_type:
        jobs_by_type[job.job_type] = [0, 0]
    if job.status == QUEUE_STATUS.QUEUE:
        jobs_by_type[job.job_type][0] += 1
    else:
        jobs_by_type[job.job_type][1] += 1
for type in jobs_by_type:
    (num_queued, num_inprogress) = jobs_by_type[type]
    print('{} jobs: {} queued, {} inprogress'.format(type, num_queued, num_inprogress)')

Cancelling a Job

If a job is taking too long to run or no longer necessary, it can be cancelled easily from the Job object.

from datarobot.enums import QUEUE_STATUS

project.pause_autopilot()
bad_jobs = project.get_all_jobs(status=QUEUE_STATUS.QUEUE)
for job in bad_jobs:
    job.cancel()
project.unpause_autopilot()

Retrieving Results From a Job

Once you’ve found a particular job of interest, you can retrieve the results once it is complete. Note that the type of the returned object will vary depending on the job_type. All return types are documented in Job.get_result.

from datarobot.enums import JOB_TYPE

time_to_wait = 60 * 60  # how long to wait for the job to finish (in seconds) - i.e. an hour
assert my_job.job_type == JOB_TYPE.MODEL
my_model = my_job.get_result_when_complete(max_wait=time_to_wait)

ModelJobs

Model creation is asynchronous process. This means than when explicitly invoking new model creation (with project.train or model.train for example) all you get is id of process, responsible for model creation. With this id you can get info about model that is being created or the model itself, when creation process is finished. For this you should use ModelJob (API reference) class.

Get an existing ModelJob

To retrieve existing ModelJob use ModelJob.get method. For this you need id of Project that is used for model creation and id of ModelJob. Having ModelJob might be useful if you want to know parameters of model creation, automatically chosen by API backend, before actual model was created.

If model is already created, ModelJob.get will raise PendingJobFinished exception

import time

import datarobot as dr

blueprint_id = '5506fcd38bd88f5953219da0'
model_job_id = project.train(blueprint_id)
model_job = dr.ModelJob.get(project=project.id,
                            model_job_id=model_job_id)
model_job.sample_pct
>>> 64.0

# wait for model to be created (in a very inefficient way)
time.sleep(10 * 60)
model_job = dr.ModelJob.get(project=project.id,
                            model_job_id=model_job_id)
>>> datarobot.errors.PendingJobFinished

Get created model

After model is created, you can use ModelJob.get_model to get newly created model.

import datarobot as dr

model = dr.ModelJob.get_model(project=project.id,
                              model_job_id=model_job_id)

wait_for_async_model_creation function

If you just want to get created model after getting ModelJob id, you can use wait_for_async_model_creation function. It will poll for status of model creation process until it’s finished, and then will return newly created model.

from datarobot.models.modeljob import wait_for_async_model_creation

# used during training based on blueprint
model_job_id = project.train(blueprint, sample_pct=33)
new_model = wait_for_async_model_creation(
    project_id=project.id,
    model_job_id=model_job_id,
)

# used during training based on existing model
model_job_id = existing_model.train(sample_pct=33)
new_model = wait_for_async_model_creation(
    project_id=existing_model.project_id,
    model_job_id=model_job_id,
)

Predictions

Predictions generation is an asynchronous process. This means that when starting predictions with Model.request_predictions you will receive back a PredictJob for tracking the process responsible for fulfilling your request.

With this object you can get info about the predictions generation process before it has finished and be rerouted to the predictions themselves when the process is finished. For this you should use the PredictJob (API reference) class.

Starting predictions generation

Before actually requesting predictions, you should upload the dataset you wish to predict via Project.upload_dataset. Previously uploaded datasets can be seen under Project.get_datasets. When uploading the dataset you can provide the path to a local file, a file object, raw file content, a pandas.DataFrame object, or the url to a publicly available dataset.

To start predicting on new data using a finished model use Model.request_predictions. It will create a new predictions generation process and return a PredictJob object tracking this process. With it, you can monitor an existing PredictJob and retrieve generated predictions when the corresponding PredictJob is finished.

import datarobot as dr

project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
project = dr.Project.get(project_id)
model = dr.Model.get(project=project_id,
                     model_id=model_id)

# Using path to local file to generate predictions
dataset_from_path = project.upload_dataset('./data_to_predict.csv')

# Using file object to generate predictions
with open('./data_to_predict.csv') as data_to_predict:
    dataset_from_file = project.upload_dataset(data_to_predict)

predict_job_1 = model.request_predictions(dataset_from_path.id)
predict_job_2 = model.request_predictions(dataset_from_file.id)

Get an existing PredictJob

To retrieve an existing PredictJob use the PredictJob.get method. This will give you a PredictJob matching the latest status of the job if it has not completed.

If predictions have finished building, PredictJob.get will raise a PendingJobFinished exception.

import time

import datarobot as dr

predict_job = dr.PredictJob.get(project_id=project_id,
                                predict_job_id=predict_job_id)
predict_job.status
>>> 'queue'

# wait for generation of predictions (in a very inefficient way)
time.sleep(10 * 60)
predict_job = dr.PredictJob.get(project_id=project_id,
                                predict_job_id=predict_job_id)
>>> dr.errors.PendingJobFinished

# now the predictions are finished
predictions = dr.PredictJob.get_predictions(project_id=project.id,
                                            predict_job_id=predict_job_id)

Get generated predictions

After predictions are generated, you can use PredictJob.get_predictions to get newly generated predictions.

If predictions have not yet been finished, it will raise a JobNotFinished exception.

import datarobot as dr

predictions = dr.PredictJob.get_predictions(project_id=project.id,
                                            predict_job_id=predict_job_id)

Wait for and Retrieve results

If you just want to get generated predictions from a PredictJob, you can use the PredictJob.get_result_when_complete function. It will poll the status of predictions generation process until it has finished, and then will return predictions.

dataset = project.get_datasets()[0]
predict_job = model.request_predictions(dataset.id)
predictions = predict_job.get_result_when_complete()

DataRobot Prime

DataRobot Prime is a premium feature intended to allow downloading executable code approximating models. If the feature is unavailable to you, please contact your Account Representative. For more information about this feature, see the documentation within the DataRobot webapp.

Approximate a Model

Given a Model you wish to approximate, Model.request_approximation will start a job creating several Ruleset objects approximating the parent model. Each of those rulesets will identify how many rules were used to approximate the model, as well as the validation score the approximation achieved.

rulesets_job = model.request_approximation()
rulesets = rulesets_job.get_result_when_complete()
for ruleset in rulesets:
    info = (ruleset.id, ruleset.rule_count, ruleset.score)
    print('id: {}, rule_count: {}, score: {}'.format(*info))

Prime Models vs. Models

Given a ruleset, you can create a model based on that ruleset. We consider such models to be Prime models. The PrimeModel class inherits from the Model class, so anything a Model can do, as PrimeModel can do as well.

The PrimeModel objects available within a Project can be listed by project.get_prime_models, or a particular one can be retrieve via PrimeModel.get. If a ruleset has not yet had a model built for it, ruleset.request_model can be used to start a job to make a PrimeModel using a particular ruleset.

rulesets = parent_model.get_rulesets()
selected_ruleset = sorted(rulesets, key=lambda x: x.score)[-1]
if selected_ruleset.model_id:
    prime_model = PrimeModel.get(selected_ruleset.project_id, selected_ruleset.model_id)
else:
    prime_job = selected_ruleset.request_model()
    prime_model = prime_job.get_result_when_complete()

The PrimeModel class has two additional attributes and one additional method. The attributes are ruleset, which is the Ruleset used in the PrimeModel, and parent_model_id which is the id of the model it approximates.

Finally, the new method defined is request_download_validation which is used to prepare code download for the model and is discussed later on in this document.

Retrieving Code from a PrimeModel

Given a PrimeModel, you can download the code used to approximate the parent model, and view and execute it locally.

The first step is to validate the PrimeModel, which runs some basic validation of the generated code, as well as preparing it for download. We use the PrimeFile object to represent code that is ready to download. PrimeFiles can be prepared by the request_download_validation method on PrimeModel objects, and listed from a project with the get_prime_files method.

Once you have a PrimeFile you can check the is_valid attribute to verify the code passed basic validation, and then download it to a local file with download.

validation_job = prime_model.request_download_validation(enums.PRIME_LANGUAGE.PYTHON)
prime_file = validation_job.get_result_when_complete()
if not prime_file.is_valid:
    raise ValueError('File was not valid')
prime_file.download('/home/myuser/drCode/primeModelCode.py')

Reason Codes

To compute reason codes you need to have feature impact computed for a model, and predictions for an uploaded dataset computed with a selected model.

Computing reason codes is a resource-intensive task, but you can configure it with maximum codes, and prediction value thresholds to speed up the process.

Quick Reference

import datarobot as dr
# Get project
my_projects = dr.Project.list()
project = my_projects[0]
# Get model
models = project.get_models()
model = models[0]
# Compute feature impact
impact_job = model.request_feature_impact()
impact_job.wait_for_completion()
# Upload dataset
dataset = project.upload_dataset('./data_to_predict.csv')
# Compute predictions
predict_job = model.request_predictions(dataset.id)
predict_job.wait_for_completion()
# Initialize reason codes
rci_job = dr.ReasonCodesInitialization.create(project.id, model.id)
rci_job.wait_for_completion()
# Compute reason codes with default parameters
rc_job = dr.ReasonCodes.create(project.id, model.id, dataset.id)
rc = rc_job.get_result_when_complete()
# Iterate through predictions with reason codes
for row in rc.get_rows():
    print row.prediction
    print row.reason_codes
# download to a CSV file
rc.download_to_csv('reason_codes.csv')

List Reason Codes

You can use the ReasonCodes.list() method to return a list of reason codes computed for a project’s models:

import datarobot as dr
reason_codes = dr.ReasonCodes.list('58591727100d2b57196701b3')
print(reason_codes)
>>> [ReasonCodes(id=585967e7100d2b6afc93b13b,
                 project_id=58591727100d2b57196701b3,
                 model_id=585932c5100d2b7c298b8acf),
     ReasonCodes(id=58596bc2100d2b639329eae4,
                 project_id=58591727100d2b57196701b3,
                 model_id=585932c5100d2b7c298b8ac5),
     ReasonCodes(id=58763db4100d2b66759cc187,
                 project_id=58591727100d2b57196701b3,
                 model_id=585932c5100d2b7c298b8ac5),
     ...]
rc = reason_codes[0]

rc.project_id
>>> u'58591727100d2b57196701b3'
rc.model_id
>>> u'585932c5100d2b7c298b8acf'

You can pass following parameters to filter the result:

  • model_id – str, used to filter returned reason codes by model_id.
  • limit – int, limit for number of items returned, default: no limit.
  • offset – int, number of items to skip, default: 0.

List Reason Codes Example:

dr.ReasonCodes.list('pid', model_id='model_id', limit=20, offset=100)

Initialize Reason Codes

In order to compute reason codes you have to initialize it for a particular model.

dr.ReasonCodesInitialization.create(project_id, model_id)

Compute Reason Codes

If all prerequisites are in place, you can compute reason codes in the following way:

import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
dataset_id = '5506fcd98bd88a8142b725c8'
rc_job = dr.ReasonCodes.create(project_id, model_id, dataset_id,
                               max_codes=2, threshold_low=0.2, threshold_high=0.8)
rc = rc_job.get_result_when_complete()

Where:

  • max_codes are the maximum number of reason codes to compute for each row.
  • threshold_low and threshold_high are thresholds for the value of the prediction of the row. Reason codes will be computed for a row if the row’s prediction value is higher than threshold_high or lower than threshold_low. If no thresholds are specified, reason codes will be computed for all rows.

Retrieving Reason Codes

You have three options for retrieving reason codes.

Note

ReasonCodes.get_all_as_dataframe() and ReasonCodes.download_to_csv() reformat reason codes to match the schema of CSV file downloaded from UI (RowId, Prediction, Reason 1 Strength, Reason 1 Feature, Reason 1 Value, ..., Reason N Strength, Reason N Feature, Reason N Value)

Get reason codes rows one by one as dr.models.reason_codes.ReasonCodesRow objects:

import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
reason_codes_id = '5506fcd98bd88f1641a720a3'
rc = dr.ReasonCodes.get(project_id, reason_codes_id)
for row in rc.get_rows():
    print row.reason_codes

Get all rows as pandas.DataFrame:

import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
reason_codes_id = '5506fcd98bd88f1641a720a3'
rc = dr.ReasonCodes.get(project_id, reason_codes_id)
reason_codes_df = rc.get_all_as_dataframe()

Download all rows to a file as CSV document:

import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
reason_codes_id = '5506fcd98bd88f1641a720a3'
rc = dr.ReasonCodes.get(project_id, reason_codes_id)
rc.download_to_csv('reason_codes.csv')

Adjusted Predictions In Reason Codes

In some projects such as insurance projects, the prediction adjusted by exposure is more useful compared with raw prediction. For example, the raw prediction (e.g. claim counts) is divided by exposure (e.g. time) in the project with exposure column. The adjusted prediction provides insights with regard to the predicted claim counts per unit of time. To include that information, set exclude_adjusted_predictions to False in correspondent method calls.

import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
reason_codes_id = '5506fcd98bd88f1641a720a3'
rc = dr.ReasonCodes.get(project_id, reason_codes_id)
rc.download_to_csv('reason_codes.csv', exclude_adjusted_predictions=False)
reason_codes_df = rc.get_all_as_dataframe(exclude_adjusted_predictions=False)

Rating Table

A rating table is an exportable csv representation of a Generalized Additive Model. They contain information about the features and coefficients used to make predictions. Users can influence predictions by downloading and editing values in a rating table, then reuploading the table and using it to create a new model.

See the page about interpreting Generalized Additive Models’ output in the Datarobot user guide for more details on how to interpret and edit rating tables.

Download A Rating Table

You can retrieve a rating table from the list of rating tables in a project:

import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
project = dr.Project.get(project_id)
rating_tables = project.get_rating_tables()
rating_table = rating_tables[0]

Or you can retrieve a rating table from a specific model. The model must already exist:

import datarobot as dr
from datarobot.models import RatingTableModel, RatingTable
project_id = '5506fcd38bd88f5953219da0'
project = dr.Project.get(project_id)

# Get model from list of models with a rating table
rating_table_models = project.get_rating_table_models()
rating_table_model = rating_table_models[0]

# Or retrieve model by id. The model must have a rating table.
model_id = '5506fcd98bd88f1641a720a3'
rating_table_model = dr.RatingTableModel.get(project=project_id, model_id=model_id)

# Then retrieve the rating table from the model
rating_table_id = rating_table_model.rating_table_id
rating_table = dr.RatingTable.get(projcet_id, rating_table_id)

Then you can download the contents of the rating table:

rating_table.download('./my_rating_table.csv')

Uploading A Rating Table

After you’ve retrieved the rating table CSV and made the necessary edits, you can re-upload the CSV so you can create a new model from it:

job = dr.RatingTable.create(project_id, model_id, './my_rating_table.csv')
new_rating_table = job.get_result_when_complete()
job = new_rating_table.create_model()
model = job.get_result_when_complete()

Training Predictions

The training predictions interface allows computing and retrieving out-of-sample predictions for a model using the original project dataset. The predictions can be computed for all the rows, or restricted to validation or holdout data. As the predictions generated will be out-of-sample, they can be expected to have different results than if the project dataset were reuploaded as a prediction dataset.

Quick reference

Training predictions generation is an asynchronous process. This means that when starting predictions with datarobot.models.Model.request_training_predictions() you will receive back a datarobot.models.TrainingPredictionsJob for tracking the process responsible for fulfilling your request. Actual predictions may be obtained with the help of a datarobot.models.training_predictions.TrainingPredictions object returned as the result of the training predictions job. There are three ways to retrieve them:

  1. Iterate prediction rows one by one as named tuples:
import datarobot as dr

# Calculate new training predictions on all dataset
training_predictions_job = model.request_training_predictions(dr.enums.DATA_SUBSET.ALL)
training_predictions = training_predictions_job.get_result_when_complete()

# Fetch rows from API and print them
for prediction in training_predictions.iterate_rows(batch_size=250):
    print(prediction.row_id, prediction.prediction)
  1. Get all prediction rows as a pandas.DataFrame object:
import datarobot from dr

# Calculate new training predictions on holdout partition of dataset
training_predictions_job = model.request_training_predictions(dr.enums.DATA_SUBSET.HOLDOUT)
training_predictions = training_predictions_job.get_result_when_complete()

# Fetch training predictions as data frame
dataframe = training_predictions.get_all_as_dataframe()
  1. Download all prediction rows to a file as a CSV document:
import datarobot from dr

# Calculate new training predictions on all dataset
training_predictions_job = model.request_training_predictions(dr.enums.DATA_SUBSET.ALL)
training_predictions = training_predictions_job.get_result_when_complete()

# Fetch training predictions and save them to file
training_predictions.download_to_csv('my-training-predictions.csv')

Model Deployment

Model deployments are records that we create when user deploys a model to dedicated prediction cluster.

Warning

This interface is now deprecated and will be removed in the v2.13 release of the DataRobot client.

Warning

Model Deployments feature is in beta state and requires additional configuration for proper usage. Please contact Support/CFDS for help with setup and usage of model deployment functionality.

Warning

Users can still predict using models which have NOT been deployed. Deployment, in the current state of the system, only means making database records which we then associate monitoring data with. In other words, users can’t access monitoring info for predictions using models without an associated model deployment record.

Creating Model Deployment

To create new ModelDeployment we need to have a project_id and model_id we want to deploy. If we are going to create ModelDeployment of Model that is deployed to instance we need instance_id of this instance. For creation of new ModelDeployment we will use ModelDeployment.create. For new ModelDeployment we will need to set some readable label. It can also have custom description and status.

import datarobot as dr

project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
instance_id = '5a8d4bf9962d7415f7cce05a'
label = 'New Model Deployment'

model_deployment = dr.ModelDeployment.create(label=label, model_id=model_id,
                                             project_id=project_id, instance_id=instance_id)

print(model_deployment.id)
>>> '5a8eabe8962d743607c5009'

Get list of Model Deployments

To retrieve list of all ModelDeployment items we use ModelDeployment.list. List could be queried using query parameter, ordered by order_by and filtered by status parameters. Also we can slice results using limit and offset parameters.

import datarobot as dr

model_deployments = dr.ModelDeployment.list()
print(model_deployments)
>>> [<datarobot.models.model_deployment.ModelDeployment object at 0x7efebf513c10>,
<datarobot.models.model_deployment.ModelDeployment object at 0x7efebf513a50>,
<datarobot.models.model_deployment.ModelDeployment object at 0x7efebf513ad0>]

Get single ModelDeployment

To get ModelDeployment instance we use ModelDeployment.get with model_deployment_id as an argument.

import datarobot as dr

model_deployment_id = '5a8eabe8962d743607c5009'

model_deployment = dr.ModelDeployment.get(model_deployment_id)

print(model_deployment.service_health_messages)
>>> [{'message': 'No successful predictions in 24 hours', 'msg_id': 'NO_GOOD_REQUESTS', :level': 'passing'}]

When we have an instance of ModelDeployment we can update its label, description or status. You can chose status value from datarobot.enums.MODEL_DEPLOYMENT_STATUS

from datarobot.enums import MODEL_DEPLOYMENT_STATUS

model_deployment.update(label='Old deployment', description='Deactivated model deployment',
                        status=MODEL_DEPLOYMENT_STATUS.ARCHIVED)

We can also get service health of ModelDeployment instance using get_service_statistics method. It accepts start_data and end_date as optional parameters for setting period of statistics

model_deployment.get_service_statistics(start_date='2017-01-01')
>>> {'consumers': 0,
     'load': {'median': 0.0, 'peak': 0.0},
     'period': {'end': datetime.datetime(2018, 2, 22, 12, 5, 40, 764294, tzinfo=tzutc()),
     'start': datetime.datetime(2017, 1, 1, 0, 0, tzinfo=tzutc())},
     'server_error_rate': {'current': 0.0, 'previous': 0.0},
     'total_requests': 0,
     'user_error_rate': {'current': 0.0, 'previous': 0.0}}

History of ModelDeployment instance is available via action_log method

model_deployment.action_log()
>>> [{'action': 'created',
      'performed_at': datetime.datetime(2018, 2, 21, 12, 4, 5, 804305),
      'performed_by': {'id': '5a86c0e0e7c354c960cd0540',
       'username': 'user@datarobot.com'}},
     {'action': 'deployed',
      'performed_at': datetime.datetime(2018, 2, 22, 11, 39, 20, 34000),
      'performed_by': {'id': '5a86c0e0e7c354c960cd0540',
       'username': 'user@datarobot.com'}}]

Monotonic Constraints

Training with monotonic constraints allows users to force models to learn monotonic relationships with respect to some features and the target. This helps users create accurate models that comply with regulations (e.g. insurance, banking). Currently, only certain blueprints (e.g. xgboost) support this feature, and it is only supported for regression and binary classification projects. Typically working with monotonic constraints follows the following two workflows:

Workflow one - Running a project with default monotonic constraints

  • set the target and specify default constraint lists for the project
  • when running autopilot or manually training models without overriding constraint settings, all blueprints that support monotonic constraints will use the specified default constraint featurelists

Workflow two - Running a model with specific monotonic constraints

  • create featurelists for monotonic constraints
  • train a blueprint that supports monotonic constraints while specifying monotonic constraint featurelists
  • the specified constraints will be used, regardless of the defaults on the blueprint

Creating featurelists

When specifying monotonic constraints, users must pass a reference to a featurelist containing only the features to be constrained, one for features that should monotonically increase with the target and another for those that should monotonically decrease with the target.

import datarobot as dr
project = dr.Project.get(project_id)
features_mono_up = ['feature_0', 'feature_1']  # features that have monotonically increasing relationship with target
features_mono_down = ['feature_2', 'feature_3']  # features that have monotonically decreasing relationship with target
flist_mono_up = project.create_featurelist(name='mono_up',
                                           features=features_mono_up)
flist_mono_down = project.create_featurelist(name='mono_down',
                                             features=features_mono_down)

Specify default monotonic constraints for a project

When setting the target, the user can specify default monotonic constraints for the project, to ensure that autopilot models use the desired settings, and optionally to ensure that only blueprints supporting monotonic constraints appear in the project. Regardless of the defaults specified during target selection, the user can override them when manually training a particular model.

import datarobot as dr
from datarobot.enums import AUTOPILOT_MODE
advanced_options = dr.AdvancedOptions(
    monotonic_increasing_featurelist_id=flist_mono_up.id,
    monotonic_decreasing_featurelist_id=flist_mono_down.id,
    only_include_monotonic_blueprints=True)
project = dr.Project.get(project_id)
project.set_target(target='target', mode=AUTOPILOT_MODE.FULL_AUTO, advanced_options=advanced_options)

Retrieve models and blueprints using monotonic constraints

When retrieving models, users can inspect to see which supports monotonic constraints, and which actually enforces them. Some models will not support monotonic constraints at all, and some may support constraints but not have any constrained features specified.

import datarobot as dr
project = dr.Project.get(project_id)
models = project.get_models()
# retrieve models that support monotonic constraints
models_support_mono = [model for model in models if model.supports_monotonic_constraints]
# retrieve models that support and enforce monotonic constraints
models_enforce_mono = [model for model in models
                       if (model.monotonic_increasing_featurelist_id or
                           model.monotonic_decreasing_featurelist_id)]

When retrieving blueprints, users can check if they support monotonic constraints and see what default contraint lists are associated with them. The monotonic featurelist ids associated with a blueprint will be used everytime it is trained, unless the user specifically overrides them at model submission time.

import datarobot as dr
project = dr.Project.get(project_id)
blueprints = project.get_blueprints()
# retrieve blueprints that support monotonic constraints
blueprints_support_mono = [blueprint for blueprint in blueprints if blueprint.supports_monotonic_constraints]
# retrieve blueprints that support and enforce monotonic constraints
blueprints_enforce_mono = [blueprint for blueprint in blueprints
                           if (blueprint.monotonic_increasing_featurelist_id or
                               blueprint.monotonic_decreasing_featurelist_id)]

Train a model with specific monotonic constraints

Even after specifiying default settings for the project, users can override them to train a new model with different constraints, if desired.

import datarobot as dr
features_mono_up = ['feature_2', 'feature_3']  # features that have monotonically increasing relationship with target
features_mono_down = ['feature_0', 'feature_1']  # features that have monotonically decreasing relationship with target
project = dr.Project.get(project_id)
flist_mono_up = project.create_featurelist(name='mono_up',
                                           features=features_mono_up)
flist_mono_down = project.create_featurelist(name='mono_down',
                                             features=features_mono_down)
model_job_id = project.train(
    blueprint,
    sample_pct=55,
    featurelist_id=featurelist.id,
    monotonic_increasing_featurelist_id=flist_mono_up.id,
    monotonic_decreasing_featurelist_id=flist_mono_down.id
)

Database Connectivity

Databases are a widely used tool for carrying valuable business data. To enable integration with a variety of enterprise databases, DataRobot provides a “self-service” JDBC platform for database connectivity setup. Once configured, you can read data from production databases for model building and predictions. This allows you to quickly train and retrain models on that data, and avoids the unnecessary step of exporting data from your enterprise database to a CSV for ingest to DataRobot. It allows access to more diverse data, which results in more accurate models.

The steps describing how to set up your database connections use the following terminology:

  • DataStore: A configured connection to a database&mdash; it has a name, a specified driver, and a JDBC URL. You can register data stores with DataRobot for ease of re-use. A data store has one connector but can have many data sources.
  • DataSource: A configured connection to the backing data store (the location of data within a given endpoint). A data source specifies, via SQL query or selected table and schema data, which data to extract from the data store to use for modeling or predictions. A data source has one data store and one connector but can have many datasets.
  • DataDriver: The software that allows the DataRobot application to interact with a database; each data store is associated with one driver (created the admin). The driver configuration saves the storage location in DataRobot of the JAR file and any additional dependency files associated with the driver.
  • Dataset: Data, a file or the content of a data source, at a particular point in time. A data source can produce multiple datasets; a dataset has exactly one data source.

The expected workflow when setting up projects or prediction datasets is:

  1. The administrator sets up a datarobot.DataDriver for accessing a particular database. For any particular driver, this setup is done once for the entire system and then the resulting driver is used by all users.
  2. Users create a datarobot.DataStore which represents an interface to a particular database, using that driver.
  3. Users create a datarobot.DataSource representing a particular set of data to be extracted from the DataStore.
  4. Users create projects and prediction datasets from a DataSource.

Besides the described workflow for creating projects and prediction datasets, users can manage their DataStores and DataSources and admins can manage Drivers by listing, retrieving, updating and deleting existing instances.

Cloud users: This feature is turned off by default. To enable the feature, contact your CFDS or DataRobot Support.

Creating Drivers

The admin should specify class_name, the name of the Java class in the Java archive which implements the java.sql.Driver interface; canonical_name, a user-friendly name for resulting driver to display in the API and the GUI; and files, a list of local files which contain the driver.

>>> import datarobot as dr
>>> driver = dr.DataDriver.create(
...     class_name='org.postgresql.Driver',
...     canonical_name='PostgreSQL',
...     files=['/tmp/postgresql-42.2.2.jar']
... )
>>> driver
DataDriver('PostgreSQL')

Creating DataStores

After the admin has created drivers, any user can use them for DataStore creation. A DataStore represents a JDBC database. When creating them, users should specify type, which currently must be jdbc; canonical_name, a user-friendly name to display in the API and GUI for the DataStore; driver_id, the id of the driver to use to connect to the database; and jdbc_url, the full URL specifying the database connection settings like database type, server address, port, and database name.

>>> import datarobot as dr
>>> data_store = dr.DataStore.create(
...     data_store_type='jdbc',
...     canonical_name='Demo DB',
...     driver_id='5a6af02eb15372000117c040',
...     jdbc_url='jdbc:postgresql://my.db.address.org:5432/perftest'
... )
>>> data_store
DataStore('Demo DB')
>>> data_store.test(username='username', password='password')
{'message': 'Connection successful'}

Creating DataSources

Once users have a DataStore, they can can query datasets via the DataSource entity, which represents a query. When creating a DataSource, users first create a datarobot.DataSourceParameter object from a DataStore’s id and a query, and then create the DataSource with a type, currently always jdbc; a canonical_name, the user-friendly name to display in the API and GUI, and params, the DataSourceParameter object.

>>> import datarobot as dr
>>> params = dr.DataSourceParameters(
...     data_store_id='5a8ac90b07a57a0001be501e',
...     query='SELECT * FROM airlines10mb WHERE "Year" >= 1995;'
... )
>>> data_source = dr.DataSource.create(
...     data_source_type='jdbc',
...     canonical_name='airlines stats after 1995',
...     params=params
... )
>>> data_source
DataSource('airlines stats after 1995')

Creating Projects

Given a DataSource, users can create new projects from it.

>>> import datarobot as dr
>>> project = dr.Project.create_from_data_source(
...     data_source_id='5ae6eee9962d740dd7b86886',
...     username='username',
...     password='password'
... )

Creating Predictions

Given a DataSource, new prediction datasets can be created for any project.

>>> import datarobot as dr
>>> project = dr.Project.get('5ae6f296962d740dd7b86887')
>>> prediction_dataset = project.upload_dataset_from_data_source(
...     data_source_id='5ae6eee9962d740dd7b86886',
...     username='username',
...     password='password'
... )

API Reference

Project API

class datarobot.models.Project(id=None, project_name=None, mode=None, target=None, target_type=None, holdout_unlocked=None, metric=None, stage=None, partition=None, positive_class=None, created=None, advanced_options=None, recommender=None, max_train_pct=None, max_train_rows=None, scaleout_max_train_pct=None, scaleout_max_train_rows=None, file_name=None)

A project built from a particular training dataset

Attributes

id (str) the id of the project
project_name (str) the name of the project
mode (int) the autopilot mode currently selected for the project - 0 for full autopilot, 1 for semi-automatic, and 2 for manual
target (str) the name of the selected target features
target_type (str) Indicating what kind of modeling is being done in this project Options are: ‘Regression’, ‘Binary’ (Binary classification), ‘Multiclass’ (Multiclass classification)
holdout_unlocked (bool) whether the holdout has been unlocked
metric (str) the selected project metric (e.g. LogLoss)
stage (str) the stage the project has reached - one of datarobot.enums.PROJECT_STAGE
partition (dict) information about the selected partitioning options
positive_class (str) for binary classification projects, the selected positive class; otherwise, None
created (datetime) the time the project was created
advanced_options (dict) information on the advanced options that were selected for the project settings, e.g. a weights column or a cap of the runtime of models that can advance autopilot stages
recommender (dict) information on the recommender settings of the project (i.e. whether it is a recommender project, or the id columns)
max_train_pct (float) the maximum percentage of the project dataset that can be used without going into the validation data or being too large to submit any blueprint for training
max_train_rows (int) the maximum number of rows that can be trained on without going into the validation data or being too large to submit any blueprint for training
scaleout_max_train_pct (float) the maximum percentage of the project dataset that can be used to successfully train a scaleout model without going into the validation data. May exceed max_train_pct, in which case only scaleout models can be trained up to this point.
scaleout_max_train_rows (int) the maximum number of rows that can be used to successfully train a scaleout model without going into the validation data. May exceed max_train_rows, in which case only scaleout models can be trained up to this point.
file_name (str) the name of the file uploaded for the project dataset
classmethod get(project_id)

Gets information about a project.

Parameters:

project_id : str

The identifier of the project you want to load.

Returns:

project : Project

The queried project

Examples

import datarobot as dr
p = dr.Project.get(project_id='54e639a18bd88f08078ca831')
p.id
>>>'54e639a18bd88f08078ca831'
p.project_name
>>>'Some project name'
classmethod create(sourcedata, project_name='Untitled Project', max_wait=600, read_timeout=600)

Creates a project with provided data.

Project creation is asynchronous process, which means that after initial request we will keep polling status of async process that is responsible for project creation until it’s finished. For SDK users this only means that this method might raise exceptions related to it’s async nature.

Parameters:

sourcedata : basestring, file or pandas.DataFrame

Data to be used for predictions. If string can be either a path to a local file, url to publicly available file or raw file content. If using a file, the filename must consist of ASCII characters only.

project_name : str, unicode, optional

The name to assign to the empty project.

max_wait : int, optional

Time in seconds after which project creation is considered unsuccessful

read_timeout: int

The maximum number of seconds to wait for the server to respond indicating that the initial upload is complete

Returns:

project : Project

Instance with initialized data.

Raises:

InputNotUnderstoodError

Raised if sourcedata isn’t one of supported types.

AsyncFailureError

Polling for status of async process resulted in response with unsupported status code. Beginning in version 2.1, this will be ProjectAsyncFailureError, a subclass of AsyncFailureError

AsyncProcessUnsuccessfulError

Raised if project creation was unsuccessful

AsyncTimeoutError

Raised if project creation took more time, than specified by max_wait parameter

Examples

p = Project.create('/home/datasets/somedataset.csv',
                   project_name="New API project")
p.id
>>> '5921731dkqshda8yd28h'
p.project_name
>>> 'New API project'
classmethod encrypted_string(plaintext)

Sends a string to DataRobot to be encrypted

This is used for passwords that DataRobot uses to access external data sources

Parameters:

plaintext : str

The string to encrypt

Returns:

ciphertext : str

The encrypted string

classmethod create_from_mysql(*args, **kwargs)

Note

Deprecated in v2.11 in favor of datarobot.models.Project.create_from_data_source().

Create a project from a MySQL table

Parameters:

server : str

The address of the MySQL server

database : str

The name of the database to use

table : str

The name of the table to fetch

user : str

The username to use to access the database

port : int, optional

The port to reach the MySQL server. If not specified, will use the default specified by DataRobot (3306).

prefetch : int, optional

If specified, specifies the number of rows to stream at a time from the database. If not specified, fetches all results at once. This is an optimization for reading from the database

project_name : str, optional

A name to give to the project

password : str, optional

The plaintext password for this user. Will be first encrypted with DataRobot. Only use this _or_ encrypted_password, not both.

encrypted_password : str, optional

The encrypted password for this user. Will be sent directly to DataRobot. Only use this _or_ password, not both.

max_wait : int

The maximum number of seconds to wait before giving up.

Returns:

Project

Raises:

ValueError

If both password and encrypted_password were used.

classmethod create_from_oracle(*args, **kwargs)

Note

Deprecated in v2.11 in favor of datarobot.models.Project.create_from_data_source().

Create a project from an Oracle table

Parameters:

dbq : str

tnsnames.ora entry in host:port/sid format

table : str

The name of the table to fetch

username : str

The username to use to access the database

fetch_buffer_size : int, optional

If specified, specifies the size of buffer that will be used to stream data from the database. Otherwise will use DataRobot default value.

project_name : str, optional

A name to give to the project

password : str, optional

The plaintext password for this user. Will be first encrypted with DataRobot. Only use this _or_ encrypted_password, not both.

encrypted_password : str, optional

The encrypted password for this user. Will be sent directly to DataRobot. Only use this _or_ password, not both.

max_wait : int

The maximum number of seconds to wait before giving up.

Returns:

Project

Raises:

ValueError

If both password and encrypted_password were used.

classmethod create_from_postgresql(*args, **kwargs)

Note

Deprecated in v2.11 in favor of datarobot.models.Project.create_from_data_source().

Create a project from a PostgreSQL table

Parameters:

server : str

The address of the PostgreSQL server

database : str

The name of the database to use

table : str

The name of the table to fetch

username : str

The username to use to access the database

port : int, optional

The port to reach the PostgreSQL server. If not specified, will use the default specified by DataRobot (5432).

driver : str, optional

Specify ODBC driver to use. If not specified - use DataRobot default. See the values within datarobot.enums.POSTGRESQL_DRIVER

fetch : int, optional

If specified, specifies the number of rows to stream at a time from the database. If not specified, use default value in DataRobot.

use_declare_fetch : bool, optional

On True, server will fetch result as available using DB cursor. On False it will try to retrieve entire result set - not recommended for big tables. If not specified - use the default specified by DataRobot.

project_name : str, optional

A name to give to the project

password : str, optional

The plaintext password for this user. Will be first encrypted with DataRobot. Only use this _or_ encrypted_password, not both.

encrypted_password : str, optional

The encrypted password for this user. Will be sent directly to DataRobot. Only use this _or_ password, not both.

max_wait : int

The maximum number of seconds to wait before giving up.

Returns:

Project

Raises:

ValueError

If both password and encrypted_password were used.

classmethod create_from_hdfs(url, port=None, project_name=None, max_wait=600)

Create a project from a datasource on a WebHDFS server.

Parameters:

url : str

The location of the WebHDFS file, both server and full path. Per the DataRobot specification, must begin with hdfs://, e.g. hdfs:///tmp/10kDiabetes.csv

port : int, optional

The port to use. If not specified, will default to the server default (50070)

project_name : str, optional

A name to give to the project

max_wait : int

The maximum number of seconds to wait before giving up.

Returns:

Project

Examples

p = Project.create_from_hdfs('hdfs:///tmp/somedataset.csv',
                             project_name="New API project")
p.id
>>> '5921731dkqshda8yd28h'
p.project_name
>>> 'New API project'
classmethod create_from_data_source(data_source_id, username, password, project_name=None, max_wait=600)

Create a project from a data source. Either data_source or data_source_id should be specified.

Parameters:

data_source_id : str

the identifier of the data source.

username : str

the username for database authentication.

password : str

the password for database authentication. The password is encrypted at server side and never saved / stored.

project_name : str, optional

optional, a name to give to the project.

max_wait : int

optional, the maximum number of seconds to wait before giving up.

Returns:

Project

classmethod from_async(async_location, max_wait=600)

Given a temporary async status location poll for no more than max_wait seconds until the async process (project creation or setting the target, for example) finishes successfully, then return the ready project

Parameters:

async_location : str

The URL for the temporary async status resource. This is returned as a header in the response to a request that initiates an async process

max_wait : int

The maximum number of seconds to wait before giving up.

Returns:

project : Project

The project, now ready

Raises:

ProjectAsyncFailureError

If the server returned an unexpected response while polling for the asynchronous operation to resolve

AsyncProcessUnsuccessfulError

If the final result of the asynchronous operation was a failure

AsyncTimeoutError

If the asynchronous operation did not resolve within the time specified

classmethod start(sourcedata, target, project_name='Untitled Project', worker_count=None, metric=None, autopilot_on=True, blueprint_threshold=None, response_cap=None, partitioning_method=None, positive_class=None, target_type=None)

Chain together project creation, file upload, and target selection.

Parameters:

sourcedata : str or pandas.DataFrame

The path to the file to upload. Can be either a path to a local file or a publicly accessible URL. If the source is a DataFrame, it will be serialized to a temporary buffer. If using a file, the filename must consist of ASCII characters only.

target : str

The name of the target column in the uploaded file.

project_name : str

The project name.

Returns:

project : Project

The newly created and initialized project.

Other Parameters:
 

worker_count : int, optional

The number of workers that you want to allocate to this project.

metric : str, optional

The name of metric to use.

autopilot_on : boolean, default True

Whether or not to begin modeling automatically.

blueprint_threshold : int, optional

Number of hours the model is permitted to run. Minimum 1

response_cap : float, optional

Quantile of the response distribution to use for response capping Must be in range 0.5 .. 1.0

partitioning_method : PartitioningMethod object, optional

It should be one of PartitioningMethod object.

positive_class : str, float, or int; optional

Specifies a level of the target column that should treated as the positive class for binary classification. May only be specified for binary classification targets.

target_type : str, optional

Override the automaticially selected target_type. An example usage would be setting the target_type=’Mutliclass’ when you want to preform a multiclass classification task on a numeric column that has a low cardinality. You can use TARGET_TYPE enum.

Raises:

AsyncFailureError

Polling for status of async process resulted in response with unsupported status code

AsyncProcessUnsuccessfulError

Raised if project creation or target setting was unsuccessful

AsyncTimeoutError

Raised if project creation or target setting timed out

Examples

Project.start("./tests/fixtures/file.csv",
              "a_target",
              project_name="test_name",
              worker_count=4,
              metric="a_metric")
classmethod list(search_params=None)

Returns the projects associated with this account.

Parameters:

search_params : dict, optional.

If not None, the returned projects are filtered by lookup. Currently you can query projects by:

  • project_name
Returns:

projects : list of Project instances

Contains a list of projects associated with this user account.

Raises:

TypeError

Raised if search_params parameter is provided, but is not of supported type.

Examples

List all projects .. code-block:: python

p_list = Project.list() p_list >>> [Project(‘Project One’), Project(‘Two’)]

Search for projects by name .. code-block:: python

Project.list(search_params={‘project_name’: ‘red’}) >>> [Project(‘Predtime’), Project(‘Fred Project’)]
refresh()

Fetches the latest state of the project, and updates this object with that information. This is an inplace update, not a new object.

Returns:

self : Project

the now-updated project

delete()

Removes this project from your account.

set_target(target, mode='auto', metric=None, quickrun=None, worker_count=None, positive_class=None, partitioning_method=None, featurelist_id=None, advanced_options=None, max_wait=600, target_type=None)

Set target variable of an existing project that has a file uploaded to it.

Target setting is asynchronous process, which means that after initial request we will keep polling status of async process that is responsible for target setting until it’s finished. For SDK users this only means that this method might raise exceptions related to it’s async nature.

Parameters:

target : str

Name of variable.

mode : str, optional

You can use AUTOPILOT_MODE enum to choose between

  • AUTOPILOT_MODE.FULL_AUTO
  • AUTOPILOT_MODE.MANUAL
  • AUTOPILOT_MODE.QUICK

If unspecified, FULL_AUTO is used

metric : str, optional

Name of the metric to use for evaluating models. You can query the metrics available for the target by way of Project.get_metrics. If none is specified, then the default recommended by DataRobot is used.

quickrun : bool, optional

Deprecated - pass AUTOPILOT_MODE.QUICK as mode instead. Sets whether project should be run in quick run mode. This setting causes DataRobot to recommend a more limited set of models in order to get a base set of models and insights more quickly.

worker_count : int, optional

The number of concurrent workers to request for this project. If None, then the default is used

partitioning_method : PartitioningMethod object, optional

It should be one of PartitioningMethod object.

positive_class : str, float, or int; optional

Specifies a level of the target column that should treated as the positive class for binary classification. May only be specified for binary classification targets.

featurelist_id : str, optional

Specifies which feature list to use.

advanced_options : AdvancedOptions, optional

Used to set advanced options of project creation.

max_wait : int, optional

Time in seconds after which target setting is considered unsuccessful.

target_type : str, optional

Override the automaticially selected target_type. An example usage would be setting the target_type=’Mutliclass’ when you want to preform a multiclass classification task on a numeric column that has a low cardinality. You can use TARGET_TYPE enum.

Returns:

project : Project

The instance with updated attributes.

Raises:

AsyncFailureError

Polling for status of async process resulted in response with unsupported status code

AsyncProcessUnsuccessfulError

Raised if target setting was unsuccessful

AsyncTimeoutError

Raised if target setting took more time, than specified by max_wait parameter

TypeError

Raised if advanced_options, partitioning_method or target_type is provided, but is not of supported type

See also

Project.start
combines project creation, file upload, and target selection
get_models(order_by=None, search_params=None, with_metric=None)

List all completed, successful models in the leaderboard for the given project.

Parameters:

order_by : str or list of strings, optional

If not None, the returned models are ordered by this attribute. If None, the default return is the order of default project metric.

Allowed attributes to sort by are:

  • metric
  • sample_pct

If the sort attribute is preceded by a hyphen, models will be sorted in descending order, otherwise in ascending order.

Multiple sort attributes can be included as a comma-delimited string or in a list e.g. order_by=`sample_pct,-metric` or order_by=[sample_pct, -metric]

Using metric to sort by will result in models being sorted according to their validation score by how well they did according to the project metric.

search_params : dict, optional.

If not None, the returned models are filtered by lookup. Currently you can query models by:

  • name
  • sample_pct

with_metric : str, optional.

If not None, the returned models will only have scores for this metric. Otherwise all the metrics are returned.

Returns:

models : a list of Model instances.

All of the models that have been trained in this project.

Raises:

TypeError

Raised if order_by or search_params parameter is provided, but is not of supported type.

Examples

Project.get('pid').get_models(order_by=['-sample_pct',
                              'metric'])

# Getting models that contain "Ridge" in name
# and with sample_pct more than 64
Project.get('pid').get_models(
    search_params={
        'sample_pct__gt': 64,
        'name': "Ridge"
    })
get_datetime_models()

List all models in the project as DatetimeModels

Requires the project to be datetime partitioned. If it is not, a ClientError will occur.

Returns:

models : list of DatetimeModel

the datetime models

get_prime_models()

List all DataRobot Prime models for the project Prime models were created to approximate a parent model, and have downloadable code.

Returns:models : list of PrimeModel
get_prime_files(parent_model_id=None, model_id=None)

List all downloadable code files from DataRobot Prime for the project

Parameters:

parent_model_id : str, optional

Filter for only those prime files approximating this parent model

model_id : str, optional

Filter for only those prime files with code for this prime model

Returns:

files: list of PrimeFile

get_datasets()

List all the datasets that have been uploaded for predictions

Returns:datasets : list of PredictionDataset instances
upload_dataset(sourcedata, max_wait=600, read_timeout=600, forecast_point=None, predictions_start_date=None, predictions_end_date=None)

Upload a new dataset to make predictions against

Parameters:

sourcedata : str, file or pandas.DataFrame

Data to be used for predictions. If string can be either a path to a local file, url to publicly available file or raw file content. If using a file on disk, the filename must consist of ASCII characters only.

max_wait : int, optional

The maximum number of seconds to wait for the uploaded dataset to be processed before raising an error

read_timeout : int, optional

The maximum number of seconds to wait for the server to respond indicating that the initial upload is complete

forecast_point : datetime.datetime or None, optional

(New in version v2.8) May only be specified for time series projects, otherwise the upload will be rejected. The time in the dataset relative to which predictions should be generated in a time series project. See the time series documentation for more information.

predictions_start_date : datetime.datetime or None, optional

(New in version v2.11) May only be specified for time series projects. The start date for bulk predictions. This parameter should be provided in conjunction with predictions_end_date. Can’t be provided with forecastPoint parameter.

predictions_end_date : datetime.datetime or None, optional

(New in version v2.11) May only be specified for time series projects. The end date for bulk predictions. This parameter should be provided in conjunction with predictions_start_date. Can’t be provided with forecastPoint parameter.

Returns:

dataset : PredictionDataset

the newly uploaded dataset

Raises:

InputNotUnderstoodError

Raised if sourcedata isn’t one of supported types.

AsyncFailureError

Polling for status of async process resulted in response with unsupported status code

AsyncProcessUnsuccessfulError

Raised if project creation was unsuccessful (i.e. the server reported an error in uploading the dataset)

AsyncTimeoutError

Raised if processing the uploaded dataset took more time than specified by max_wait parameter

ValueError

Raised if forecast_point is provided, but is not of supported type

upload_dataset_from_data_source(data_source_id, username, password, max_wait=600, forecast_point=None)

Upload a new dataset from a data source to make predictions against

Parameters:

data_source_id : str

the identifier of the data source.

username : str

the username for database authentication.

password : str

the password for database authentication. The password is encrypted at server side and never saved / stored.

max_wait : int

optional, the maximum number of seconds to wait before giving up.

forecast_point : datetime.datetime or None, optional

(New in version v2.8) May only be specified for time series projects, otherwise the upload will be rejected. The time in the dataset relative to which predictions should be generated in a time series project. See the time series documentation for more information.

Returns:

dataset : PredictionDataset

the newly uploaded dataset

get_blueprints()

List all blueprints recommended for a project.

Returns:

menu : list of Blueprint instances

All the blueprints recommended by DataRobot for a project

get_features()

List all features for this project

Returns:

list of Feature

all features for this project

get_modeling_features(batch_size=None)

List all modeling features for this project

Only available once the target and partitioning settings have been set. For more information on the distinction between input and modeling features, see the time series documentation<input_vs_modeling>.

Parameters:

batch_size : int, optional

The number of features to retrieve in a single API call. If specified, the client may make multiple calls to retrieve the full list of features. If not specified, an appropriate default will be chosen by the server.

Returns:

list of ModelingFeature

All modeling features in this project

get_featurelists()

List all featurelists created for this project

Returns:

list of Featurelist

all featurelists created for this project

get_modeling_featurelists(batch_size=None)

List all modeling featurelists created for this project

Modeling featurelists can only be created after the target and partitioning options have been set for a project. In time series projects, these are the featurelists that can be used for modeling; in other projects, they behave the same as regular featurelists.

See the time series documentation for more information.

Parameters:

batch_size : int, optional

The number of featurelists to retrieve in a single API call. If specified, the client may make multiple calls to retrieve the full list of features. If not specified, an appropriate default will be chosen by the server.

Returns:

list of ModelingFeaturelist

all modeling featurelists in this project

create_type_transform_feature(name, parent_name, variable_type, replacement=None, date_extraction=None, max_wait=600)

Create a new feature by transforming the type of an existing feature in the project

Note that only the following transformations are supported:

  1. Text to categorical or numeric
  2. Categorical to text or numeric
  3. Numeric to categorical
  4. Date to categorical or numeric

Note

Special considerations when casting numeric to categorical

There are two parameters which can be used for variableType to convert numeric data to categorical levels. These differ in the assumptions they make about the input data, and are very important when considering the data that will be used to make predictions. The assumptions that each makes are:

  • categorical : The data in the column is all integral, and there are no missing values. If either of these conditions do not hold in the training set, the transformation will be rejected. During predictions, if any of the values in the parent column are missing, the predictions will error
  • categoricalInt : New in v2.6 All of the data in the column should be considered categorical in its string form when cast to an int by truncation. For example the value 3 will be cast as the string 3 and the value 3.14 will also be cast as the string 3. Further, the value -3.6 will become the string -3. Missing values will still be recognized as missing.

For convenience these are represented in the enum VARIABLE_TYPE_TRANSFORM with the names CATEGORICAL and CATEGORICAL_INT

Parameters:

name : str

The name to give to the new feature

parent_name : str

The name of the feature to transform

variable_type : str

The type the new column should have. See the values within datarobot.enums.VARIABLE_TYPE_TRANSFORM

replacement : str or float, optional

The value that missing or unconverable data should have

date_extraction : str, optional

Must be specified when parent_name is a date column (and left None otherwise). Specifies which value from a date should be extracted. See the list of values in datarobot.enums.DATE_EXTRACTION

max_wait : int, optional

The maximum amount of time to wait for DataRobot to finish processing the new column. This process can take more time with more data to process. If this operation times out, an AsyncTimeoutError will occur. DataRobot continues the processing and the new column may successfully be constucted.

Returns:

Feature

The data of the new Feature

Raises:

AsyncFailureError

If any of the responses from the server are unexpected

AsyncProcessUnsuccessfulError

If the job being waited for has failed or has been cancelled

AsyncTimeoutError

If the resource did not resolve in time

create_featurelist(name, features)

Creates a new featurelist

Parameters:

name : str

The name to give to this new featurelist. Names must be unique, so an error will be returned from the server if this name has already been used in this project.

features : list of str

The names of the features. Each feature must exist in the project already.

Returns:

Featurelist

newly created featurelist

Raises:

DuplicateFeaturesError

Raised if features variable contains duplicate features

Examples

project = Project.get('5223deadbeefdeadbeef0101')
flists = project.get_featurelists()

# Create a new featurelist using a subset of features from an
# existing featurelist
flist = flists[0]
features = flist.features[::2]  # Half of the features

new_flist = project.create_featurelist(name='Feature Subset',
                                       features=features)
create_modeling_featurelist(name, features)

Create a new modeling featurelist

Modeling featurelists can only be created after the target and partitioning options have been set for a project. In time series projects, these are the featurelists that can be used for modeling; in other projects, they behave the same as regular featurelists.

See the time series documentation for more information.

Parameters:

name : str

the name of the modeling featurelist to create. Names must be unique within the project, or the server will return an error.

features : list of str

the names of the features to include in the modeling featurelist. Each feature must be a modeling feature.

Returns:

featurelist : ModelingFeaturelist

the newly created featurelist

Examples

project = Project.get('1234deadbeeffeeddead4321')
modeling_features = project.get_modeling_features()
selected_features = [feat.name for feat in modeling_features][:5]  # select first five
new_flist = project.create_modeling_featurelist('Model This', selected_features)
get_metrics(feature_name)

Get the metrics recommended for modeling on the given feature.

Parameters:

feature_name : str

The name of the feature to query regarding which metrics are recommended for modeling.

Returns:

names : list of str

The names of the recommended metrics.

get_status()

Query the server for project status.

Returns:

status : dict

Contains:

  • autopilot_done : a boolean.
  • stage : a short string indicating which stage the project is in.
  • stage_description : a description of what stage means.

Examples

{"autopilot_done": False,
 "stage": "modeling",
 "stage_description": "Ready for modeling"}
pause_autopilot()

Pause autopilot, which stops processing the next jobs in the queue.

Returns:

paused : boolean

Whether the command was acknowledged

unpause_autopilot()

Unpause autopilot, which restarts processing the next jobs in the queue.

Returns:

unpaused : boolean

Whether the command was acknowledged.

start_autopilot(featurelist_id)

Starts autopilot on provided featurelist.

Only one autopilot can be running at the time. That’s why any ongoing autopilot on a different featurelist will be halted - modeling jobs in queue would not be affected but new jobs would not be added to queue by the halted autopilot.

Parameters:

featurelist_id : str

Identifier of featurelist that should be used for autopilot

Raises:

AppPlatformError

Raised if autopilot is currently running on or has already finished running on the provided featurelist. Also raised if project’s target was not selected.

train(trainable, sample_pct=None, featurelist_id=None, source_project_id=None, scoring_type=None, training_row_count=None, monotonic_increasing_featurelist_id=<object object>, monotonic_decreasing_featurelist_id=<object object>)

Submit a job to the queue to train a model.

Either sample_pct or training_row_count can be used to specify the amount of data to use, but not both. If neither are specified, a default of the maximum amount of data that can safely be used to train any blueprint without going into the validation data will be selected.

In smart-sampled projects, sample_pct and training_row_count are assumed to be in terms of rows of the minority class.

Note

If the project uses datetime partitioning, use train_datetime instead

Parameters:

trainable : str or Blueprint

For str, this is assumed to be a blueprint_id. If no source_project_id is provided, the project_id will be assumed to be the project that this instance represents.

Otherwise, for a Blueprint, it contains the blueprint_id and source_project_id that we want to use. featurelist_id will assume the default for this project if not provided, and sample_pct will default to using the maximum training value allowed for this project’s partition setup. source_project_id will be ignored if a Blueprint instance is used for this parameter

sample_pct : float, optional

The amount of data to use for training, as a percentage of the project dataset from 0 to 100.

featurelist_id : str, optional

The identifier of the featurelist to use. If not defined, the default for this project is used.

source_project_id : str, optional

Which project created this blueprint_id. If None, it defaults to looking in this project. Note that you must have read permissions in this project.

scoring_type : str, optional

Either SCORING_TYPE.validation or SCORING_TYPE.cross_validation. SCORING_TYPE.validation is available for every partitioning type, and indicates that the default model validation should be used for the project. If the project uses a form of cross-validation partitioning, SCORING_TYPE.cross_validation can also be used to indicate that all of the available training/validation combinations should be used to evaluate the model.

training_row_count : int, optional

The number of rows to use to train the requested model.

monotonic_increasing_featurelist_id : str, optional

(new in version 2.11) the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. Passing None disables increasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT) is the one specified by the blueprint.

monotonic_decreasing_featurelist_id : str, optional

(new in version 2.11) the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. Passing None disables decreasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT) is the one specified by the blueprint.

Returns:

model_job_id : str

id of created job, can be used as parameter to ModelJob.get method or wait_for_async_model_creation function

Examples

Use a Blueprint instance:

blueprint = project.get_blueprints()[0]
model_job_id = project.train(blueprint, training_row_count=project.max_train_rows)

Use a blueprint_id, which is a string. In the first case, it is assumed that the blueprint was created by this project. If you are using a blueprint used by another project, you will need to pass the id of that other project as well.

blueprint_id = 'e1c7fc29ba2e612a72272324b8a842af'
project.train(blueprint, training_row_count=project.max_train_rows)

another_project.train(blueprint, source_project_id=project.id)

You can also easily use this interface to train a new model using the data from an existing model:

model = project.get_models()[0]
model_job_id = project.train(model.blueprint.id,
                             sample_pct=100)
train_datetime(blueprint_id, featurelist_id=None, training_row_count=None, training_duration=None, source_project_id=None)

Create a new model in a datetime partitioned project

If the project is not datetime partitioned, an error will occur.

Parameters:

blueprint_id : str

the blueprint to use to train the model

featurelist_id : str, optional

the featurelist to use to train the model. If not specified, the project default will be used.

training_row_count : int, optional

the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.

training_duration : str, optional

a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.

source_project_id : str, optional

the id of the project this blueprint comes from, if not this project. If left unspecified, the blueprint must belong to this project.

Returns:

job : ModelJob

the created job to build the model

blend(model_ids, blender_method)

Submit a job for creating blender model. Upon success, the new job will be added to the end of the queue.

Parameters:

model_ids : list of str

List of model ids that will be used to create blender. These models should have completed validation stage without errors, and can’t be blenders, DataRobot Prime or scaleout models.

blender_method : str

Chosen blend method, one from datarobot.enums.BLENDER_METHOD

Returns:

model_job : ModelJob

New ModelJob instance for the blender creation job in queue.

get_all_jobs(status=None)

Get a list of jobs

This will give Jobs representing any type of job, including modeling or predict jobs.

Parameters:

status : QUEUE_STATUS enum, optional

If called with QUEUE_STATUS.INPROGRESS, will return the jobs that are currently running.

If called with QUEUE_STATUS.QUEUE, will return the jobs that are waiting to be run.

If called with QUEUE_STATUS.ERROR, will return the jobs that have errored.

If no value is provided, will return all jobs currently running or waiting to be run.

Returns:

jobs : list

Each is an instance of Job

get_blenders()

Get a list of blender models.

Returns:

list of BlenderModel

list of all blender models in project.

get_frozen_models()

Get a list of frozen models

Returns:

list of FrozenModel

list of all frozen models in project.

get_model_jobs(status=None)

Get a list of modeling jobs

Parameters:

status : QUEUE_STATUS enum, optional

If called with QUEUE_STATUS.INPROGRESS, will return the modeling jobs that are currently running.

If called with QUEUE_STATUS.QUEUE, will return the modeling jobs that are waiting to be run.

If called with QUEUE_STATUS.ERROR, will return the modeling jobs that have errored.

If no value is provided, will return all modeling jobs currently running or waiting to be run.

Returns:

jobs : list

Each is an instance of ModelJob

get_predict_jobs(status=None)

Get a list of prediction jobs

Parameters:

status : QUEUE_STATUS enum, optional

If called with QUEUE_STATUS.INPROGRESS, will return the prediction jobs that are currently running.

If called with QUEUE_STATUS.QUEUE, will return the prediction jobs that are waiting to be run.

If called with QUEUE_STATUS.ERROR, will return the prediction jobs that have errored.

If called without a status, will return all prediction jobs currently running or waiting to be run.

Returns:

jobs : list

Each is an instance of PredictJob

wait_for_autopilot(check_interval=20.0, timeout=86400, verbosity=1)

Blocks until autopilot is finished. This will raise an exception if the autopilot mode is changed from AUTOPILOT_MODE.FULL_AUTO.

It makes API calls to sync the project state with the server and to look at which jobs are enqueued.

Parameters:

check_interval : float or int

The maximum time (in seconds) to wait between checks for whether autopilot is finished

timeout : float or int or None

After this long (in seconds), we give up. If None, never timeout.

verbosity:

This should be VERBOSITY_LEVEL.SILENT or VERBOSITY_LEVEL.VERBOSE. For VERBOSITY_LEVEL.SILENT, nothing will be displayed about progress. For VERBOSITY_LEVEL.VERBOSE, the number of jobs in progress or queued is shown. Note that new jobs are added to the queue along the way.

Raises:

AsyncTimeoutError

If autopilot does not finished in the amount of time specified

RuntimeError

If a condition is detected that indicates that autopilot will not complete on its own

rename(project_name)

Update the name of the project.

Parameters:

project_name : str

The new name

unlock_holdout()

Unlock the holdout for this project.

This will cause subsequent queries of the models of this project to contain the metric values for the holdout set, if it exists.

Take care, as this cannot be undone. Remember that best practice is to select a model before analyzing the model performance on the holdout set

set_worker_count(worker_count)

Sets the number of workers allocated to this project.

Note that this value is limited to the number allowed by your account. Lowering the number will not stop currently running jobs, but will cause the queue to wait for the appropriate number of jobs to finish before attempting to run more jobs.

Parameters:

worker_count : int

The number of concurrent workers to request from the pool of workers

Returns:

url : str

Permanent static hyperlink to a project leaderboard.

open_leaderboard_browser()

Opens project leaderboard in web browser.

Note: If text-mode browsers are used, the calling process will block until the user exits the browser.

get_rating_table_models()

Get a list of models with a rating table

Returns:

list of RatingTableModel

list of all models with a rating table in project.

get_rating_tables()

Get a list of rating tables

Returns:

list of RatingTable

list of rating tables in project.

Partitioning API

class datarobot.RandomCV(holdout_pct, reps, seed=0)

A partition in which observations are randomly assigned to cross-validation groups and the holdout set.

Parameters:

holdout_pct : int

the desired percentage of dataset to assign to holdout set

reps : int

number of cross validation folds to use

seed : int

a seed to use for randomization

class datarobot.StratifiedCV(holdout_pct, reps, seed=0)

A partition in which observations are randomly assigned to cross-validation groups and the holdout set, preserving in each group the same ratio of positive to negative cases as in the original data.

Parameters:

holdout_pct : int

the desired percentage of dataset to assign to holdout set

reps : int

number of cross validation folds to use

seed : int

a seed to use for randomization

class datarobot.GroupCV(holdout_pct, reps, partition_key_cols, seed=0)

A partition in which one column is specified, and rows sharing a common value for that column are guaranteed to stay together in the partitioning into cross-validation groups and the holdout set.

Parameters:

holdout_pct : int

the desired percentage of dataset to assign to holdout set

reps : int

number of cross validation folds to use

partition_key_cols : list

a list containing a single string, where the string is the name of the column whose values should remain together in partitioning

seed : int

a seed to use for randomization

class datarobot.UserCV(user_partition_col, cv_holdout_level, seed=0)

A partition where the cross-validation folds and the holdout set are specified by the user.

Parameters:

user_partition_col : string

the name of the column containing the partition assignments

cv_holdout_level

the value of the partition column indicating a row is part of the holdout set

seed : int

a seed to use for randomization

class datarobot.RandomTVH(holdout_pct, validation_pct, seed=0)

Specifies a partitioning method in which rows are randomly assigned to training, validation, and holdout.

Parameters:

holdout_pct : int

the desired percentage of dataset to assign to holdout set

validation_pct : int

the desired percentage of dataset to assign to validation set

seed : int

a seed to use for randomization

class datarobot.UserTVH(user_partition_col, training_level, validation_level, holdout_level, seed=0)

Specifies a partitioning method in which rows are assigned by the user to training, validation, and holdout sets.

Parameters:

user_partition_col : string

the name of the column containing the partition assignments

training_level

the value of the partition column indicating a row is part of the training set

validation_level

the value of the partition column indicating a row is part of the validation set

holdout_level

the value of the partition column indicating a row is part of the holdout set (use None if you want no holdout set)

seed : int

a seed to use for randomization

class datarobot.StratifiedTVH(holdout_pct, validation_pct, seed=0)

A partition in which observations are randomly assigned to train, validation, and holdout sets, preserving in each group the same ratio of positive to negative cases as in the original data.

Parameters:

holdout_pct : int

the desired percentage of dataset to assign to holdout set

validation_pct : int

the desired percentage of dataset to assign to validation set

seed : int

a seed to use for randomization

class datarobot.GroupTVH(holdout_pct, validation_pct, partition_key_cols, seed=0)

A partition in which one column is specified, and rows sharing a common value for that column are guaranteed to stay together in the partitioning into the training, validation, and holdout sets.

Parameters:

holdout_pct : int

the desired percentage of dataset to assign to holdout set

validation_pct : int

the desired percentage of dataset to assign to validation set

partition_key_cols : list

a list containing a single string, where the string is the name of the column whose values should remain together in partitioning

seed : int

a seed to use for randomization

class datarobot.DatetimePartitioningSpecification(datetime_partition_column, autopilot_data_selection_method=None, validation_duration=None, holdout_start_date=None, holdout_duration=None, disable_holdout=None, gap_duration=None, number_of_backtests=None, backtests=None, use_time_series=False, default_to_a_priori=False, default_to_known_in_advance=False, feature_derivation_window_start=None, feature_derivation_window_end=None, feature_settings=None, forecast_window_start=None, forecast_window_end=None, treat_as_exponential=None, differencing_method=None, periodicities=None, multiseries_id_columns=None)

Uniquely defines a DatetimePartitioning for some project

Includes only the attributes of DatetimePartitioning that are directly controllable by users, not those determined by the DataRobot application based on the project dataset and the user-controlled settings.

This is the specification that should be passed to Project.set_target via the partitioning_method parameter. To see the full partitioning based on the project dataset, use DatetimePartitioning.generate.

All durations should be specified with a duration string such as those returned by the partitioning_methods.construct_duration_string helper method.

Attributes

datetime_partition_column (str) the name of the column whose values as dates are used to assign a row to a particular partition
autopilot_data_selection_method (str) one of datarobot.enums.DATETIME_AUTOPILOT_DATA_SELECTION_METHOD. Whether models created by the autopilot should use “rowCount” or “duration” as their data_selection_method.
validation_duration (str or None) the default validation_duration for the backtests
holdout_start_date (datetime.datetime or None) The start date of holdout scoring data. If holdout_start_date is specified, holdout_duration must also be specified. If disable_holdout is set to True, neither holdout_duration nor holdout_start_date must be specified.
holdout_duration (str or None) The duration of the holdout scoring data. If holdout_duration is specified, holdout_start_date must also be specified. If disable_holdout is set to True, neither holdout_duration nor holdout_start_date must be specified.
disable_holdout (bool or None) (New in version v2.8) Whether to suppress allocating a holdout fold. If set to True, holdout_start_date and holdout_duration must not be specified.
gap_duration (str or None) The duration of the gap between training and holdout scoring data
number_of_backtests (int or None) the number of backtests to use
backtests (list of BacktestSpecification) the exact specification of backtests to use. The indexes of the specified backtests should range from 0 to number_of_backtests - 1. If any backtest is left unspecified, a default configuration will be chosen.
use_time_series (bool) (New in version v2.8) Whether to create a time series project (if True) or an OTV project which uses datetime partitioning (if False). The default behaviour is to create an OTV project.
default_to_a_priori (bool) (Deprecated in version v2.11) Optional, renamed to default_to_known_in_advance, see below for more detail.
default_to_known_in_advance (bool) (New in version v2.11) Optional, only used for time series projects. Whether to default to treating features as known in advance. If not specified, defaults to False. Known in advance features are expected to be known for dates in the future when making predictions, e.g. “is this a holiday”.
feature_derivation_window_start (int or None) (New in version v2.8) Only used for time series projects. Offset into the past to define how far back relative to the forecast point the feature derivation window should start. Expressed in terms of the time_unit of the datetime_partition_column and should be negative or zero.
feature_derivation_window_end (int or None) (New in version v2.8) Only used for time series projects. Offset into the past to define how far back relative to the forecast point the feature derivation window should end. Expressed in terms of the time_unit of the datetime_partition_column, and should be a positive value.
feature_settings (list of FeatureSettings objects) (New in version v2.9) Optional, a list specifying per feature settings, can be left unspecified.
forecast_window_start (int or None) (New in version v2.8) Only used for time series projects. Offset into the future to define how far forward relative to the forecast point the forecast window should start. Expressed in terms of the time_unit of the datetime_partition_column.
forecast_window_end (int or None) (New in version v2.8) Only used for time series projects. Offset into the future to define how far forward relative to the forecast point the forecast window should end. Expressed in terms of the time_unit of the datetime_partition_column.
treat_as_exponential (string, optional) (New in version v2.9) defaults to “auto”. Used to specify whether to treat data as exponential trend and apply transformations like log-transform. Use values from datarobot.enums.TREAT_AS_EXPONENTIAL enum.
differencing_method (string, optional) (New in version v2.9) defaults to “auto”. Used to specify which differencing method to apply of case if data is stationary. Use values from datarobot.enums.DIFFERENCING_METHOD enum.
periodicities (list of Periodicity, optional) (New in version v2.9) a list of datarobot.Periodicity
multiseries_id_columns (list of str or null) (New in version v2.11) a list of the names of multiseries id columns to define series within the training data. Currently only one multiseries id column is supported.
class datarobot.BacktestSpecification(index, gap_duration, validation_start_date, validation_duration)

Uniquely defines a Backtest used in a DatetimePartitioning

Includes only the attributes of a backtest directly controllable by users. The other attributes are assigned by the DataRobot application based on the project dataset and the user-controlled settings.

All durations should be specified with a duration string such as those returned by the partitioning_methods.construct_duration_string helper method.

Attributes

index (int) the index of the backtest to update
gap_duration (str) the desired duration of the gap between training and validation scoring data for the backtest
validation_start_date (datetime.datetime) the desired start date of the validation scoring data for this backtest
validation_duration (datetime.datetime) the desired duration of the validation scoring data for this backtest
class datarobot.FeatureSettings(feature_name, known_in_advance=False, a_priori=None)

Per feature settings

Attributes

feature_name (string) name of the feature
a_priori (bool) (Deprecated in v2.11) Optional, renamed to known_in_advance, see below for more detail.
known_in_advance (bool) (New in version v2.11) Optional, whether the feature is known in advance, i.e. expected to be known for dates in the future at prediction time. Features that don’t have a feature setting specifying whether they are known in advance use the value from the default_to_known_in_advance flag.
class datarobot.Periodicity(time_steps, time_unit)

Periodicity configuration

Parameters:

time_steps : int

Time step value

time_unit : string

Time step unit, valid options are values from datarobot.enums.PERIODICITY_TIME_UNITS

Examples

from datarobot as dr
periodicities = [
    dr.Periodicity(time_steps=10, time_unit=dr.enums.PERIODICITY_TIME_UNITS.HOUR),
    dr.Periodicity(time_steps=600, time_unit=dr.enums.PERIODICITY_TIME_UNITS.MINUTE)]
spec = dr.DatetimePartitioningSpecification(
    # ...
    periodicities=periodicities
)
class datarobot.DatetimePartitioning(project_id=None, datetime_partition_column=None, date_format=None, autopilot_data_selection_method=None, validation_duration=None, available_training_start_date=None, available_training_duration=None, available_training_row_count=None, available_training_end_date=None, primary_training_start_date=None, primary_training_duration=None, primary_training_row_count=None, primary_training_end_date=None, gap_start_date=None, gap_duration=None, gap_row_count=None, gap_end_date=None, holdout_start_date=None, holdout_duration=None, holdout_row_count=None, holdout_end_date=None, number_of_backtests=None, backtests=None, total_row_count=None, use_time_series=False, default_to_a_priori=False, default_to_known_in_advance=False, feature_derivation_window_start=None, feature_derivation_window_end=None, feature_settings=None, forecast_window_start=None, forecast_window_end=None, treat_as_exponential=None, differencing_method=None, periodicities=None, multiseries_id_columns=None)

Full partitioning of a project for datetime partitioning

Includes both the attributes specified by the user, as well as those determined by the DataRobot application based on the project dataset. In order to use a partitioning to set the target, call to_specification and pass the resulting DatetimePartitioningSpecification to Project.set_target.

The available training data corresponds to all the data available for training, while the primary training data corresponds to the data that can be used to train while ensuring that all backtests are available. If a model is trained with more data than is available in the primary training data, then all backtests may not have scores available.

Attributes

project_id (str) the id of the project this partitioning applies to
datetime_partition_column (str) the name of the column whose values as dates are used to assign a row to a particular partition
date_format (str) the format (e.g. “%Y-%m-%d %H:%M:%S”) by which the partition column was interpreted (compatible with strftime [https://docs.python.org/2/library/time.html#time.strftime] )
autopilot_data_selection_method (str) one of datarobot.enums.DATETIME_AUTOPILOT_DATA_SELECTION_METHOD. Whether models created by the autopilot use “rowCount” or “duration” as their data_selection_method.
validation_duration (str) the validation duration specified when initializing the partitioning - not directly significant if the backtests have been modified, but used as the default validation_duration for the backtests
available_training_start_date (datetime.datetime) The start date of the available training data for scoring the holdout
available_training_duration (str) The duration of the available training data for scoring the holdout
available_training_row_count (int or None) The number of rows in the available training data for scoring the holdout. Only available when retrieving the partitioning after setting the target.
available_training_end_date (datetime.datetime) The end date of the available training data for scoring the holdout
primary_training_start_date (datetime.datetime or None) The start date of primary training data for scoring the holdout. Unavailable when the holdout fold is disabled.
primary_training_duration (str) The duration of the primary training data for scoring the holdout
primary_training_row_count (int or None) The number of rows in the primary training data for scoring the holdout. Only available when retrieving the partitioning after setting the target.
primary_training_end_date (datetime.datetime or None) The end date of the primary training data for scoring the holdout. Unavailable when the holdout fold is disabled.
gap_start_date (datetime.datetime or None) The start date of the gap between training and holdout scoring data. Unavailable when the holdout fold is disabled.
gap_duration (str) The duration of the gap between training and holdout scoring data
gap_row_count (int or None) The number of rows in the gap between training and holdout scoring data. Only available when retrieving the partitioning after setting the target.
gap_end_date (datetime.datetime or None) The end date of the gap between training and holdout scoring data. Unavailable when the holdout fold is disabled.
holdout_start_date (datetime.datetime or None) The start date of holdout scoring data. Unavailable when the holdout fold is disabled.
holdout_duration (str) The duration of the holdout scoring data
holdout_row_count (int or None) The number of rows in the holdout scoring data. Only available when retrieving the partitioning after setting the target.
holdout_end_date (datetime.datetime or None) The end date of the holdout scoring data. Unavailable when the holdout fold is disabled.
number_of_backtests (int) the number of backtests used
backtests (list of partitioning_methods.Backtest) the configured Backtests
total_row_count (int) the number of rows in the project dataset. Only available when retrieving the partitioning after setting the target.
use_time_series (bool) (New in version v2.8) Whether to create a time series project (if True) or an OTV project which uses datetime partitioning (if False). The default behaviour is to create an OTV project.
default_to_a_priori (bool) (Deprecated in version v2.11) Optional, renamed to defaultToKnownInAdvance, see below for more detail.
default_to_a_priori (bool) (Deprecated in version v2.11) Optional, renamed to default_to_known_in_advance, see below for more detail.
default_to_known_in_advance (bool) (New in version v2.11) Optional, only used for time series projects. Whether to default to treating features as known in advance. If not specified, defaults to False. Known in advance features are expected to be known for dates in the future when making predictions, e.g. “is this a holiday”.
feature_derivation_window_start (int or None) (New in version v2.8) Only used for time series projects. Offset into the past to define how far back relative to the forecast point the feature derivation window should start. Expressed in terms of the time_unit of the datetime_partition_column.
feature_derivation_window_end (int or None) (New in version v2.8) Only used for time series projects. Offset into the past to define how far back relative to the forecast point the feature derivation window should end. Expressed in terms of the time_unit of the datetime_partition_column.
feature_settings (list of FeatureSettings) (New in version v2.9) Optional, a list specifying per feature settings, can be left unspecified.
forecast_window_start (int or None) (New in version v2.8) Only used for time series projects. Offset into the future to define how far forward relative to the forecast point the forecast window should start. Expressed in terms of the time_unit of the datetime_partition_column.
forecast_window_end (int or None) (New in version v2.8) Only used for time series projects. Offset into the future to define how far forward relative to the forecast point the forecast window should end. Expressed in terms of the time_unit of the datetime_partition_column.
treat_as_exponential (string, optional) (New in version v2.9) defaults to “auto”. Used to specify whether to treat data as exponential trend and apply transformations like log-transform. Use values from datarobot.enums.TREAT_AS_EXPONENTIAL enum.
differencing_method (string, optional) (New in version v2.9) defaults to “auto”. Used to specify which differencing method to apply of case if data is stationary. Use values from datarobot.enums.DIFFERENCING_METHOD enum.
periodicities (list of Periodicity, optional) (New in version v2.9) a list of datarobot.Periodicity
multiseries_id_columns (list of str or null) (New in version v2.11) a list of the names of multiseries id columns to define series within the training data. Currently only one multiseries id column is supported.
classmethod generate(project_id, spec, max_wait=600)

Preview the full partitioning determined by a DatetimePartitioningSpecification

Based on the project dataset and the partitioning specification, inspect the full partitioning that would be used if the same specification were passed into Project.set_target.

Parameters:

project_id : str

the id of the project

spec : DatetimePartitioningSpec

the desired partitioning

max_wait : int, optional

For some settings (e.g. generating a partitioning preview for a multiseries project for the first time), an asynchronous task must be run to analyze the dataset. max_wait governs the maximum time (in seconds) to wait before giving up. In all non-multiseries projects, this is unused.

Returns:

DatetimePartitioning :

the full generated partitioning

classmethod get(project_id)

Retrieve the DatetimePartitioning from a project

Only available if the project has already set the target as a datetime project.

Parameters:

project_id : str

the id of the project to retrieve partitioning for

Returns:

DatetimePartitioning : the full partitioning for the project

classmethod feature_log_list(project_id, offset=None, limit=None)

Retrieve the feature derivation log content and log length for a time series project.

The Time Series Feature Log provides details about the feature generation process for a time series project. It includes information about which features are generated and their priority, as well as the detected properties of the time series data such as whether the series is stationary, and periodicities detected.

This route is only supported for time series projects that have finished partitioning.

The feature derivation log will include information about:

  • Detected stationarity of the series:
    e.g. ‘Series detected as non-stationary’
  • Detected presence of multiplicative trend in the series:
    e.g. ‘Multiplicative trend detected’
  • Detected presence of multiplicative trend in the series:
    e.g. ‘Detected periodicities: 7 day’
  • Maximum number of feature to be generated:
    e.g. ‘Maximum number of feature to be generated is 1440’
  • Window sizes used in rolling statistics / lag extractors
    e.g. ‘The window sizes chosen to be: 2 months
    (because the time step is 1 month and Feature Derivation Window is 2 months)’
  • Features that are specified as known-in-advance
    e.g. ‘Variables treated as apriori: holiday’
  • Details about why certain variables are transformed in the input data
    e.g. ‘Generating variable “y (log)” from “y” because multiplicative trend
    is detected’
  • Details about features generated as timeseries features, and their priority
    e.g. ‘Generating feature “date (actual)” from “date” (priority: 1)’
Parameters:

project_id : str

project id to retrieve a feature derivation log for.

offset : int

optional, defaults is 0, this many results will be skipped.

limit : int

optional, defaults to 100, at most this many results are returned. To specify

no limit, use 0. The default may change without notice.

classmethod feature_log_retrieve(project_id)

Retrieve the feature derivation log content and log length for a time series project.

The Time Series Feature Log provides details about the feature generation process for a time series project. It includes information about which features are generated and their priority, as well as the detected properties of the time series data such as whether the series is stationary, and periodicities detected.

This route is only supported for time series projects that have finished partitioning.

The feature derivation log will include information about:

  • Detected stationarity of the series:
    e.g. ‘Series detected as non-stationary’
  • Detected presence of multiplicative trend in the series:
    e.g. ‘Multiplicative trend detected’
  • Detected presence of multiplicative trend in the series:
    e.g. ‘Detected periodicities: 7 day’
  • Maximum number of feature to be generated:
    e.g. ‘Maximum number of feature to be generated is 1440’
  • Window sizes used in rolling statistics / lag extractors
    e.g. ‘The window sizes chosen to be: 2 months
    (because the time step is 1 month and Feature Derivation Window is 2 months)’
  • Features that are specified as known-in-advance
    e.g. ‘Variables treated as apriori: holiday’
  • Details about why certain variables are transformed in the input data
    e.g. ‘Generating variable “y (log)” from “y” because multiplicative trend
    is detected’
  • Details about features generated as timeseries features, and their priority
    e.g. ‘Generating feature “date (actual)” from “date” (priority: 1)’
Parameters:

project_id : str

project id to retrieve a feature derivation log for.

to_specification()

Render the DatetimePartitioning as a DatetimePartitioningSpecification

The resulting specification can be used when setting the target, and contains only the attributes directly controllable by users.

Returns:

DatetimePartitioningSpecification:

the specification for this partitioning

to_dataframe()

Render the partitioning settings as a dataframe for convenience of display

Excludes project_id, datetime_partition_column, date_format, autopilot_data_selection_method, validation_duration, and number_of_backtests, as well as the row count information, if present.

Also excludes the time series specific parameters for use_time_series, default_to_known_in_advance, and defining the feature derivation and forecast windows.

class datarobot.helpers.partitioning_methods.Backtest(index=None, available_training_start_date=None, available_training_duration=None, available_training_row_count=None, available_training_end_date=None, primary_training_start_date=None, primary_training_duration=None, primary_training_row_count=None, primary_training_end_date=None, gap_start_date=None, gap_duration=None, gap_row_count=None, gap_end_date=None, validation_start_date=None, validation_duration=None, validation_row_count=None, validation_end_date=None, total_row_count=None)

A backtest used to evaluate models trained in a datetime partitioned project

When setting up a datetime partitioning project, backtests are specified by a BacktestSpecification.

The available training data corresponds to all the data available for training, while the primary training data corresponds to the data that can be used to train while ensuring that all backtests are available. If a model is trained with more data than is available in the primary training data, then all backtests may not have scores available.

Attributes

index (int) the index of the backtest
available_training_start_date (datetime.datetime) the start date of the available training data for this backtest
available_training_duration (str) the duration of available training data for this backtest
available_training_row_count (int or None) the number of rows of available training data for this backtest. Only available when retrieving from a project where the target is set.
available_training_end_date (datetime.datetime) the end date of the available training data for this backtest
primary_training_start_date (datetime.datetime) the start date of the primary training data for this backtest
primary_training_duration (str) the duration of the primary training data for this backtest
primary_training_row_count (int or None) the number of rows of primary training data for this backtest. Only available when retrieving from a project where the target is set.
primary_training_end_date (datetime.datetime) the end date of the primary training data for this backtest
gap_start_date (datetime.datetime) the start date of the gap between training and validation scoring data for this backtest
gap_duration (str) the duration of the gap between training and validation scoring data for this backtest
gap_row_count (int or None) the number of rows in the gap between training and validation scoring data for this backtest. Only available when retrieving from a project where the target is set.
gap_end_date (datetime.datetime) the end date of the gap between training and validation scoring data for this backtest
validation_start_date (datetime.datetime) the start date of the validation scoring data for this backtest
validation_duration (str) the duration of the validation scoring data for this backtest
validation_row_count (int or None) the number of rows of validation scoring data for this backtest. Only available when retrieving from a project where the target is set.
validation_end_date (datetime.datetime) the end date of the validation scoring data for this backtest
total_row_count (int or None) the number of rows in this backtest. Only available when retrieving from a project where the target is set.
to_specification()

Render this backtest as a BacktestSpecification

A BacktestSpecification includes only the attributes users can directly control, not those indirectly determined by the project dataset.

Returns:

BacktestSpecification

the specification for this backtest

to_dataframe()

Render this backtest as a dataframe for convenience of display

Returns:

backtest_partitioning : pandas.Dataframe

the backtest attributes, formatted into a dataframe

datarobot.helpers.partitioning_methods.construct_duration_string(years=0, months=0, days=0, hours=0, minutes=0, seconds=0)

Construct a valid string representing a duration in accordance with ISO8601

A duration of six months, 3 days, and 12 hours could be represented as P6M3DT12H.

Parameters:

years : int

the number of years in the duration

months : int

the number of months in the duration

days : int

the number of days in the duration

hours : int

the number of hours in the duration

minutes : int

the number of minutes in the duration

seconds : int

the number of seconds in the duration

Returns:

duration_string: str

The duration string, specified compatibly with ISO8601

Blueprint API

class datarobot.models.Blueprint(id=None, processes=None, model_type=None, project_id=None, blueprint_category=None, monotonic_increasing_featurelist_id=None, monotonic_decreasing_featurelist_id=None, supports_monotonic_constraints=None)

A Blueprint which can be used to fit models

Attributes

id (str) the id of the blueprint
processes (list of str) the processes used by the blueprint
model_type (str) the model produced by the blueprint
project_id (str) the project the blueprint belongs to
blueprint_category (str) (New in version v2.6) Describes the category of the blueprint and the kind of model it produces.
classmethod get(project_id, blueprint_id)

Retrieve a blueprint.

Parameters:

project_id : str

The project’s id.

blueprint_id : str

Id of blueprint to retrieve.

Returns:

blueprint : Blueprint

The queried blueprint.

get_chart()

Retrieve a chart.

Returns:

BlueprintChart

The current blueprint chart.

get_documents()

Get documentation for tasks used in the blueprint.

Returns:

list of BlueprintTaskDocument

All documents available for blueprint.

class datarobot.models.BlueprintTaskDocument(title=None, task=None, description=None, parameters=None, links=None, references=None)

Document describing a task from a blueprint.

Attributes

title (str) Title of document.
task (str) Name of the task described in document.
description (str) Task description.
parameters (list of dict(name, type, description)) Parameters that task can receive in human-readable format.
links (list of dict(name, url)) External links used in document
references (list of dict(name, url)) References used in document. When no link available url equals None.
class datarobot.models.BlueprintChart(nodes, edges)

A Blueprint chart that can be used to understand data flow in blueprint.

Attributes

nodes (list of dict (id, label)) Chart nodes, id unique in chart.
edges (list of tuple (id1, id2)) Directions of data flow between blueprint chart nodes.
classmethod get(project_id, blueprint_id)

Retrieve a blueprint chart.

Parameters:

project_id : str

The project’s id.

blueprint_id : str

Id of blueprint to retrieve chart.

Returns:

BlueprintChart

The queried blueprint chart.

to_graphviz()

Get blueprint chart in graphviz DOT format.

Returns:

unicode

String representation of chart in graphviz DOT language.

class datarobot.models.ModelBlueprintChart(nodes, edges)

A Blueprint chart that can be used to understand data flow in model. Model blueprint chart represents reduced repository blueprint chart with only elements that used to build this particular model.

Attributes

nodes (list of dict (id, label)) Chart nodes, id unique in chart.
edges (list of tuple (id1, id2)) Directions of data flow between blueprint chart nodes.
classmethod get(project_id, model_id)

Retrieve a model blueprint chart.

Parameters:

project_id : str

The project’s id.

model_id : str

Id of model to retrieve model blueprint chart.

Returns:

ModelBlueprintChart

The queried model blueprint chart.

to_graphviz()

Get blueprint chart in graphviz DOT format.

Returns:

unicode

String representation of chart in graphviz DOT language.

Model API

class datarobot.models.Model(id=None, processes=None, featurelist_name=None, featurelist_id=None, project_id=None, sample_pct=None, training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, model_type=None, model_category=None, is_frozen=None, blueprint_id=None, metrics=None, project=None, data=None, monotonic_increasing_featurelist_id=None, monotonic_decreasing_featurelist_id=None, supports_monotonic_constraints=None)

A model trained on a project’s dataset capable of making predictions

Attributes

id (str) the id of the model
project_id (str) the id of the project the model belongs to
processes (list of str) the processes used by the model
featurelist_name (str) the name of the featurelist used by the model
featurelist_id (str) the id of the featurelist used by the model
sample_pct (float or None) the percentage of the project dataset used in training the model. If the project uses datetime partitioning, the sample_pct will be None. See training_row_count, training_duration, and training_start_date and training_end_date instead.
training_row_count (int or None) the number of rows of the project dataset used in training the model. In a datetime partitioned project, if specified, defines the number of rows used to train the model and evaluate backtest scores; if unspecified, either training_duration or training_start_date and training_end_date was used to determine that instead.
training_duration (str or None) only present for models in datetime partitioned projects. If specified, a duration string specifying the duration spanned by the data used to train the model and evaluate backtest scores.
training_start_date (datetime or None) only present for frozen models in datetime partitioned projects. If specified, the start date of the data used to train the model.
training_end_date (datetime or None) only present for frozen models in datetime partitioned projects. If specified, the end date of the data used to train the model.
model_type (str) what model this is, e.g. ‘Nystroem Kernel SVM Regressor’
model_category (str) what kind of model this is - ‘prime’ for DataRobot Prime models, ‘blend’ for blender models, and ‘model’ for other models
is_frozen (bool) whether this model is a frozen model
blueprint_id (str) the id of the blueprint used in this model
metrics (dict) a mapping from each metric to the model’s scores for that metric
monotonic_increasing_featurelist_id (str) optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. If None, no such constraints are enforced.
monotonic_decreasing_featurelist_id (str) optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. If None, no such constraints are enforced.
supports_monotonic_constraints (bool) optinonal, whether this model supports enforcing montonic constraints
classmethod get(project, model_id)

Retrieve a specific model.

Parameters:

project : str

The project’s id.

model_id : str

The model_id of the leaderboard item to retrieve.

Returns:

model : Model

The queried instance.

Raises:

ValueError

passed project parameter value is of not supported type

classmethod fetch_resource_data(*args, **kwargs)

(Deprecated.) Used to acquire model data directly from its url.

Consider using get instead, as this is a convenience function used for development of datarobot

Parameters:

url : string

The resource we are acquiring

join_endpoint : boolean, optional

Whether the client’s endpoint should be joined to the URL before sending the request. Location headers are returned as absolute locations, so will _not_ need the endpoint

Returns:

model_data : dict

The queried model’s data

get_features_used()

Query the server to determine which features were used.

Note that the data returned by this method is possibly different than the names of the features in the featurelist used by this model. This method will return the raw features that must be supplied in order for predictions to be generated on a new set of data. The featurelist, in contrast, would also include the names of derived features.

Returns:

features : list of str

The names of the features used in the model.

delete()

Delete a model from the project’s leaderboard.

Returns:

url : str

Permanent static hyperlink to this model at leaderboard.

open_model_browser()

Opens model at project leaderboard in web browser.

Note: If text-mode browsers are used, the calling process will block until the user exits the browser.

train(sample_pct=None, featurelist_id=None, scoring_type=None, training_row_count=None, monotonic_increasing_featurelist_id=<object object>, monotonic_decreasing_featurelist_id=<object object>)

Train the blueprint used in model on a particular featurelist or amount of data.

This method creates a new training job for worker and appends it to the end of the queue for this project. After the job has finished you can get the newly trained model by retrieving it from the project leaderboard, or by retrieving the result of the job.

Either sample_pct or training_row_count can be used to specify the amount of data to use, but not both. If neither are specified, a default of the maximum amount of data that can safely be used to train any blueprint without going into the validation data will be selected.

In smart-sampled projects, sample_pct and training_row_count are assumed to be in terms of rows of the minority class.

Note

For datetime partitioned projects, use train_datetime instead.

Parameters:

sample_pct : float, optional

The amount of data to use for training, as a percentage of the project dataset from 0 to 100.

featurelist_id : str, optional

The identifier of the featurelist to use. If not defined, the featurelist of this model is used.

scoring_type : str, optional

Either SCORING_TYPE.validation or SCORING_TYPE.cross_validation. SCORING_TYPE.validation is available for every partitioning type, and indicates that the default model validation should be used for the project. If the project uses a form of cross-validation partitioning, SCORING_TYPE.cross_validation can also be used to indicate that all of the available training/validation combinations should be used to evaluate the model.

training_row_count : int, optional

The number of rows to use to train the requested model.

monotonic_increasing_featurelist_id : str

(new in version 2.11) optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. Passing None disables increasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT) is the one specified by the blueprint.

monotonic_decreasing_featurelist_id : str

(new in version 2.11) optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. Passing None disables decreasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT) is the one specified by the blueprint.

Returns:

model_job_id : str

id of created job, can be used as parameter to ModelJob.get method or wait_for_async_model_creation function

Examples

project = Project.get('p-id')
model = Model.get('p-id', 'l-id')
model_job_id = model.train(training_row_count=project.max_train_rows)
train_datetime(featurelist_id=None, training_row_count=None, training_duration=None, time_window_sample_pct=None)

Train this model on a different featurelist or amount of data

Requires that this model is part of a datetime partitioned project; otherwise, an error will occur.

Parameters:

featurelist_id : str, optional

the featurelist to use to train the model. If not specified, the featurelist of this model is used.

training_row_count : int, optional

the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.

training_duration : str, optional

a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.

time_window_sample_pct : int, optional

may only be specified when the requested model is a time window (e.g. duration or start and end dates). An integer between 1 and 99 indicating the percentage to sample by within the window. The points kept are determined by a random uniform sample.

Returns:

job : ModelJob

the created job to build the model

request_predictions(dataset_id)

Request predictions against a previously uploaded dataset

Parameters:

dataset_id : string

The dataset to make predictions against (as uploaded from Project.upload_dataset)

Returns:

job : PredictJob

The job computing the predictions

get_feature_impact()

Retrieve the computed Feature Impact results, a measure of the relevance of each feature in the model.

Elsewhere this technique is sometimes called ‘Permutation Importance’.

Requires that Feature Impact has already been computed with request_feature_impact.

Returns:

feature_impacts : list[dict]

The feature impact data. Each item is a dict with the keys ‘featureName’, ‘impactNormalized’, and ‘impactUnnormalized’. See the help for Model.request_feature_impact for more details.

Raises:

ClientError (404)

If the feature impacts have not been computed.

request_feature_impact()

Request feature impacts to be computed for the model.

Feature Impact is computed for each column by creating new data with that column randomly permuted (but the others left unchanged), and seeing how the error metric score for the predictions is affected. The ‘impactUnnormalized’ is how much worse the error metric score is when making predictions on this modified data. The ‘impactNormalized’ is normalized so that the largest value is 1. In both cases, larger values indicate more important features.

Returns:

job : Job

A Job representing the feature impact computation. To get the completed feature impact data, use job.get_result or job.get_result_when_complete.

Raises:

JobAlreadyRequested (422)

If the feature impacts have already been requested.

get_prime_eligibility()

Check if this model can be approximated with DataRobot Prime

Returns:

prime_eligibility : dict

a dict indicating whether a model can be approximated with DataRobot Prime (key can_make_prime) and why it may be ineligible (key message)

request_approximation()

Request an approximation of this model using DataRobot Prime

This will create several rulesets that could be used to approximate this model. After comparing their scores and rule counts, the code used in the approximation can be downloaded and run locally.

Returns:

job : Job

the job generating the rulesets

get_rulesets()

List the rulesets approximating this model generated by DataRobot Prime

If this model hasn’t been approximated yet, will return an empty list. Note that these are rulesets approximating this model, not rulesets used to construct this model.

Returns:rulesets : list of Ruleset
download_export(filepath)

Download an exportable model file for use in an on-premise DataRobot standalone prediction environment.

This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.

Parameters:

filepath : str

The path at which to save the exported model file.

request_transferable_export()

Request generation of an exportable model file for use in an on-premise DataRobot standalone prediction environment.

This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.

This function does not download the exported file. Use download_export for that.

Examples

model = datarobot.Model.get('p-id', 'l-id')
job = model.request_transferable_export()
job.wait_for_completion()
model.download_export('my_exported_model.drmodel')

# Client must be configured to use standalone prediction server for import:
datarobot.Client(token='my-token-at-standalone-server',
                 endpoint='standalone-server-url/api/v2')

imported_model = datarobot.ImportedModel.create('my_exported_model.drmodel')
request_frozen_model(sample_pct=None, training_row_count=None)

Train a new frozen model with parameters from this model

Note

This method only works if project the model belongs to is not datetime partitioned. If it is, use request_frozen_datetime_model instead.

Frozen models use the same tuning parameters as their parent model instead of independently optimizing them to allow efficiently retraining models on larger amounts of the training data.

Parameters:

sample_pct : float

optional, the percentage of the dataset to use with the model. If not provided, will use the value from this model.

training_row_count : int

(New in version v2.9) optional, the integer number of rows of the dataset to use with the model. Only one of sample_pct and training_row_count should be specified.

Returns:

model_job : ModelJob

the modeling job training a frozen model

request_frozen_datetime_model(training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, time_window_sample_pct=None)

Train a new frozen model with parameters from this model

Requires that this model belongs to a datetime partitioned project. If it does not, an error will occur when submitting the job.

Frozen models use the same tuning parameters as their parent model instead of independently optimizing them to allow efficiently retraining models on larger amounts of the training data.

In addition of training_row_count and training_duration, frozen datetime models may be trained on an exact date range. Only one of training_row_count, training_duration, or training_start_date and training_end_date should be specified.

Models specified using training_start_date and training_end_date are the only ones that can be trained into the holdout data (once the holdout is unlocked).

Parameters:

training_row_count : int, optional

the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.

training_duration : str, optional

a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.

training_start_date : datetime.datetime, optional

the start date of the data to train to model on. Only rows occurring at or after this datetime will be used. If training_start_date is specified, training_end_date must also be specified.

training_end_date : datetime.datetime, optional

the end date of the data to train the model on. Only rows occurring strictly before this datetime will be used. If training_end_date is specified, training_start_date must also be specified.

time_window_sample_pct : int, optional

may only be specified when the requested model is a time window (e.g. duration or start and end dates). An integer between 1 and 99 indicating the percentage to sample by within the window. The points kept are determined by a random uniform sample.

Returns:

model_job : ModelJob

the modeling job training a frozen model

get_parameters()

Retrieve model parameters.

Returns:

ModelParameters

Model parameters for this model.

get_lift_chart(source)

Retrieve model lift chart for the specified source.

Parameters:

source : str

Lift chart data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.

Returns:

LiftChart

Model lift chart data

get_all_lift_charts()

Retrieve a list of all lift charts available for the model.

Returns:

list of LiftChart

Data for all available model lift charts.

get_confusion_chart(source)

Retrieve model’s confusion chart for the specified source.

Parameters:

source : str

Confusion chart source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.

Returns:

ConfusionChart

Model ConfusionChart data

get_all_confusion_charts()

Retrieve a list of all confusion charts available for the model.

Returns:

list of ConfusionChart

Data for all available confusion charts for model.

get_roc_curve(source)

Retrieve model ROC curve for the specified source.

Parameters:

source : str

ROC curve data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.

Returns:

RocCurve

Model ROC curve data

get_all_roc_curves()

Retrieve a list of all ROC curves available for the model.

Returns:

list of RocCurve

Data for all available model ROC curves.

get_word_cloud(exclude_stop_words=False)

Retrieve a word cloud data for the model.

Parameters:

exclude_stop_words : bool, optional

Set to True if you want stopwords filtered out of response.

Returns:

WordCloud

Word cloud data for the model.

download_scoring_code(file_name, source_code=False)

Download scoring code JAR.

Parameters:

file_name : str

File path where scoring code will be saved.

source_code : bool, optional

Set to True to download source code archive. It will not be executable.

get_model_blueprint_documents()

Get documentation for tasks used in this model.

Returns:

list of BlueprintTaskDocument

All documents available for the model.

get_model_blueprint_chart()

Retrieve a model blueprint chart that can be used to understand data flow in blueprint.

Returns:

ModelBlueprintChart

The queried model blueprint chart.

request_training_predictions(data_subset)

Start a job to build training predictions

Parameters:

data_subset : str

data set definition to build predictions on. Choices are:

  • dr.enums.DATA_SUBSET.ALL for all data available
  • dr.enums.DATA_SUBSET.VALIDATION_AND_HOLDOUT for all data except training set
  • dr.enums.DATA_SUBSET.HOLDOUT for holdout data set only
Returns:

Job

an instance of created async job

cross_validate()

Run Cross Validation on this model.

Note

To perform Cross Validation on a new model with new parameters, use train instead.

Returns:

ModelJob

The created job to build the model

PrimeModel API

class datarobot.models.PrimeModel(id=None, processes=None, featurelist_name=None, featurelist_id=None, project_id=None, sample_pct=None, training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, model_type=None, model_category=None, is_frozen=None, blueprint_id=None, metrics=None, parent_model_id=None, ruleset_id=None, rule_count=None, score=None, monotonic_increasing_featurelist_id=None, monotonic_decreasing_featurelist_id=None, supports_monotonic_constraints=None)

A DataRobot Prime model approximating a parent model with downloadable code

Attributes

id (str) the id of the model
project_id (str) the id of the project the model belongs to
processes (list of str) the processes used by the model
featurelist_name (str) the name of the featurelist used by the model
featurelist_id (str) the id of the featurelist used by the model
sample_pct (float) the percentage of the project dataset used in training the model
training_row_count (int or None) the number of rows of the project dataset used in training the model. In a datetime partitioned project, if specified, defines the number of rows used to train the model and evaluate backtest scores; if unspecified, either training_duration or training_start_date and training_end_date was used to determine that instead.
training_duration (str or None) only present for models in datetime partitioned projects. If specified, a duration string specifying the duration spanned by the data used to train the model and evaluate backtest scores.
training_start_date (datetime or None) only present for frozen models in datetime partitioned projects. If specified, the start date of the data used to train the model.
training_end_date (datetime or None) only present for frozen models in datetime partitioned projects. If specified, the end date of the data used to train the model.
model_type (str) what model this is, e.g. ‘DataRobot Prime’
model_category (str) what kind of model this is - always ‘prime’ for DataRobot Prime models
is_frozen (bool) whether this model is a frozen model
blueprint_id (str) the id of the blueprint used in this model
metrics (dict) a mapping from each metric to the model’s scores for that metric
ruleset (Ruleset) the ruleset used in the Prime model
parent_model_id (str) the id of the model that this Prime model approximates
monotonic_increasing_featurelist_id (str) optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. If None, no such constraints are enforced.
monotonic_decreasing_featurelist_id (str) optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. If None, no such constraints are enforced.
supports_monotonic_constraints (bool) optinonal, whether this model supports enforcing montonic constraints
classmethod get(project_id, model_id)

Retrieve a specific prime model.

Parameters:

project_id : str

The id of the project the prime model belongs to

model_id : str

The model_id of the prime model to retrieve.

Returns:

model : PrimeModel

The queried instance.

request_download_validation(language)

Prep and validate the downloadable code for the ruleset associated with this model

Parameters:

language : str

the language the code should be downloaded in - see datarobot.enums.PRIME_LANGUAGE for available languages

Returns:

job : Job

A job tracking the code preparation and validation

cross_validate()

Run Cross Validation on this model.

Note

To perform Cross Validation on a new model with new parameters, use train instead.

Returns:

ModelJob

The created job to build the model

delete()

Delete a model from the project’s leaderboard.

download_export(filepath)

Download an exportable model file for use in an on-premise DataRobot standalone prediction environment.

This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.

Parameters:

filepath : str

The path at which to save the exported model file.

download_scoring_code(file_name, source_code=False)

Download scoring code JAR.

Parameters:

file_name : str

File path where scoring code will be saved.

source_code : bool, optional

Set to True to download source code archive. It will not be executable.

fetch_resource_data(*args, **kwargs)

(Deprecated.) Used to acquire model data directly from its url.

Consider using get instead, as this is a convenience function used for development of datarobot

Parameters:

url : string

The resource we are acquiring

join_endpoint : boolean, optional

Whether the client’s endpoint should be joined to the URL before sending the request. Location headers are returned as absolute locations, so will _not_ need the endpoint

Returns:

model_data : dict

The queried model’s data

get_all_confusion_charts()

Retrieve a list of all confusion charts available for the model.

Returns:

list of ConfusionChart

Data for all available confusion charts for model.

get_all_lift_charts()

Retrieve a list of all lift charts available for the model.

Returns:

list of LiftChart

Data for all available model lift charts.

get_all_roc_curves()

Retrieve a list of all ROC curves available for the model.

Returns:

list of RocCurve

Data for all available model ROC curves.

get_confusion_chart(source)

Retrieve model’s confusion chart for the specified source.

Parameters:

source : str

Confusion chart source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.

Returns:

ConfusionChart

Model ConfusionChart data

get_feature_impact()

Retrieve the computed Feature Impact results, a measure of the relevance of each feature in the model.

Elsewhere this technique is sometimes called ‘Permutation Importance’.

Requires that Feature Impact has already been computed with request_feature_impact.

Returns:

feature_impacts : list[dict]

The feature impact data. Each item is a dict with the keys ‘featureName’, ‘impactNormalized’, and ‘impactUnnormalized’. See the help for Model.request_feature_impact for more details.

Raises:

ClientError (404)

If the feature impacts have not been computed.

get_features_used()

Query the server to determine which features were used.

Note that the data returned by this method is possibly different than the names of the features in the featurelist used by this model. This method will return the raw features that must be supplied in order for predictions to be generated on a new set of data. The featurelist, in contrast, would also include the names of derived features.

Returns:

features : list of str

The names of the features used in the model.

Returns:

url : str

Permanent static hyperlink to this model at leaderboard.

get_lift_chart(source)

Retrieve model lift chart for the specified source.

Parameters:

source : str

Lift chart data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.

Returns:

LiftChart

Model lift chart data

get_model_blueprint_chart()

Retrieve a model blueprint chart that can be used to understand data flow in blueprint.

Returns:

ModelBlueprintChart

The queried model blueprint chart.

get_model_blueprint_documents()

Get documentation for tasks used in this model.

Returns:

list of BlueprintTaskDocument

All documents available for the model.

get_parameters()

Retrieve model parameters.

Returns:

ModelParameters

Model parameters for this model.

get_prime_eligibility()

Check if this model can be approximated with DataRobot Prime

Returns:

prime_eligibility : dict

a dict indicating whether a model can be approximated with DataRobot Prime (key can_make_prime) and why it may be ineligible (key message)

get_roc_curve(source)

Retrieve model ROC curve for the specified source.

Parameters:

source : str

ROC curve data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.

Returns:

RocCurve

Model ROC curve data

get_rulesets()

List the rulesets approximating this model generated by DataRobot Prime

If this model hasn’t been approximated yet, will return an empty list. Note that these are rulesets approximating this model, not rulesets used to construct this model.

Returns:rulesets : list of Ruleset
get_word_cloud(exclude_stop_words=False)

Retrieve a word cloud data for the model.

Parameters:

exclude_stop_words : bool, optional

Set to True if you want stopwords filtered out of response.

Returns:

WordCloud

Word cloud data for the model.

open_model_browser()

Opens model at project leaderboard in web browser.

Note: If text-mode browsers are used, the calling process will block until the user exits the browser.

request_feature_impact()

Request feature impacts to be computed for the model.

Feature Impact is computed for each column by creating new data with that column randomly permuted (but the others left unchanged), and seeing how the error metric score for the predictions is affected. The ‘impactUnnormalized’ is how much worse the error metric score is when making predictions on this modified data. The ‘impactNormalized’ is normalized so that the largest value is 1. In both cases, larger values indicate more important features.

Returns:

job : Job

A Job representing the feature impact computation. To get the completed feature impact data, use job.get_result or job.get_result_when_complete.

Raises:

JobAlreadyRequested (422)

If the feature impacts have already been requested.

request_predictions(dataset_id)

Request predictions against a previously uploaded dataset

Parameters:

dataset_id : string

The dataset to make predictions against (as uploaded from Project.upload_dataset)

Returns:

job : PredictJob

The job computing the predictions

request_training_predictions(data_subset)

Start a job to build training predictions

Parameters:

data_subset : str

data set definition to build predictions on. Choices are:

  • dr.enums.DATA_SUBSET.ALL for all data available
  • dr.enums.DATA_SUBSET.VALIDATION_AND_HOLDOUT for all data except training set
  • dr.enums.DATA_SUBSET.HOLDOUT for holdout data set only
Returns:

Job

an instance of created async job

request_transferable_export()

Request generation of an exportable model file for use in an on-premise DataRobot standalone prediction environment.

This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.

This function does not download the exported file. Use download_export for that.

Examples

model = datarobot.Model.get('p-id', 'l-id')
job = model.request_transferable_export()
job.wait_for_completion()
model.download_export('my_exported_model.drmodel')

# Client must be configured to use standalone prediction server for import:
datarobot.Client(token='my-token-at-standalone-server',
                 endpoint='standalone-server-url/api/v2')

imported_model = datarobot.ImportedModel.create('my_exported_model.drmodel')

BlenderModel API

class datarobot.models.BlenderModel(id=None, processes=None, featurelist_name=None, featurelist_id=None, project_id=None, sample_pct=None, training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, model_type=None, model_category=None, is_frozen=None, blueprint_id=None, metrics=None, model_ids=None, blender_method=None, monotonic_increasing_featurelist_id=None, monotonic_decreasing_featurelist_id=None, supports_monotonic_constraints=None)

Blender model that combines prediction results from other models.

Attributes

id (str) the id of the model
project_id (str) the id of the project the model belongs to
processes (list of str) the processes used by the model
featurelist_name (str) the name of the featurelist used by the model
featurelist_id (str) the id of the featurelist used by the model
sample_pct (float) the percentage of the project dataset used in training the model
training_row_count (int or None) the number of rows of the project dataset used in training the model. In a datetime partitioned project, if specified, defines the number of rows used to train the model and evaluate backtest scores; if unspecified, either training_duration or training_start_date and training_end_date was used to determine that instead.
training_duration (str or None) only present for models in datetime partitioned projects. If specified, a duration string specifying the duration spanned by the data used to train the model and evaluate backtest scores.
training_start_date (datetime or None) only present for frozen models in datetime partitioned projects. If specified, the start date of the data used to train the model.
training_end_date (datetime or None) only present for frozen models in datetime partitioned projects. If specified, the end date of the data used to train the model.
model_type (str) what model this is, e.g. ‘DataRobot Prime’
model_category (str) what kind of model this is - always ‘prime’ for DataRobot Prime models
is_frozen (bool) whether this model is a frozen model
blueprint_id (str) the id of the blueprint used in this model
metrics (dict) a mapping from each metric to the model’s scores for that metric
model_ids (list of str) List of model ids used in blender
blender_method (str) Method used to blend results from underlying models
monotonic_increasing_featurelist_id (str) optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. If None, no such constraints are enforced.
monotonic_decreasing_featurelist_id (str) optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. If None, no such constraints are enforced.
supports_monotonic_constraints (bool) optinonal, whether this model supports enforcing montonic constraints
classmethod get(project_id, model_id)

Retrieve a specific blender.

Parameters:

project_id : str

The project’s id.

model_id : str

The model_id of the leaderboard item to retrieve.

Returns:

model : BlenderModel

The queried instance.

cross_validate()

Run Cross Validation on this model.

Note

To perform Cross Validation on a new model with new parameters, use train instead.

Returns:

ModelJob

The created job to build the model

delete()

Delete a model from the project’s leaderboard.

download_export(filepath)

Download an exportable model file for use in an on-premise DataRobot standalone prediction environment.

This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.

Parameters:

filepath : str

The path at which to save the exported model file.

download_scoring_code(file_name, source_code=False)

Download scoring code JAR.

Parameters:

file_name : str

File path where scoring code will be saved.

source_code : bool, optional

Set to True to download source code archive. It will not be executable.

fetch_resource_data(*args, **kwargs)

(Deprecated.) Used to acquire model data directly from its url.

Consider using get instead, as this is a convenience function used for development of datarobot

Parameters:

url : string

The resource we are acquiring

join_endpoint : boolean, optional

Whether the client’s endpoint should be joined to the URL before sending the request. Location headers are returned as absolute locations, so will _not_ need the endpoint

Returns:

model_data : dict

The queried model’s data

get_all_confusion_charts()

Retrieve a list of all confusion charts available for the model.

Returns:

list of ConfusionChart

Data for all available confusion charts for model.

get_all_lift_charts()

Retrieve a list of all lift charts available for the model.

Returns:

list of LiftChart

Data for all available model lift charts.

get_all_roc_curves()

Retrieve a list of all ROC curves available for the model.

Returns:

list of RocCurve

Data for all available model ROC curves.

get_confusion_chart(source)

Retrieve model’s confusion chart for the specified source.

Parameters:

source : str

Confusion chart source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.

Returns:

ConfusionChart

Model ConfusionChart data

get_feature_impact()

Retrieve the computed Feature Impact results, a measure of the relevance of each feature in the model.

Elsewhere this technique is sometimes called ‘Permutation Importance’.

Requires that Feature Impact has already been computed with request_feature_impact.

Returns:

feature_impacts : list[dict]

The feature impact data. Each item is a dict with the keys ‘featureName’, ‘impactNormalized’, and ‘impactUnnormalized’. See the help for Model.request_feature_impact for more details.

Raises:

ClientError (404)

If the feature impacts have not been computed.

get_features_used()

Query the server to determine which features were used.

Note that the data returned by this method is possibly different than the names of the features in the featurelist used by this model. This method will return the raw features that must be supplied in order for predictions to be generated on a new set of data. The featurelist, in contrast, would also include the names of derived features.

Returns:

features : list of str

The names of the features used in the model.

Returns:

url : str

Permanent static hyperlink to this model at leaderboard.

get_lift_chart(source)

Retrieve model lift chart for the specified source.

Parameters:

source : str

Lift chart data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.

Returns:

LiftChart

Model lift chart data

get_model_blueprint_chart()

Retrieve a model blueprint chart that can be used to understand data flow in blueprint.

Returns:

ModelBlueprintChart

The queried model blueprint chart.

get_model_blueprint_documents()

Get documentation for tasks used in this model.

Returns:

list of BlueprintTaskDocument

All documents available for the model.

get_parameters()

Retrieve model parameters.

Returns:

ModelParameters

Model parameters for this model.

get_prime_eligibility()

Check if this model can be approximated with DataRobot Prime

Returns:

prime_eligibility : dict

a dict indicating whether a model can be approximated with DataRobot Prime (key can_make_prime) and why it may be ineligible (key message)

get_roc_curve(source)

Retrieve model ROC curve for the specified source.

Parameters:

source : str

ROC curve data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.

Returns:

RocCurve

Model ROC curve data

get_rulesets()

List the rulesets approximating this model generated by DataRobot Prime

If this model hasn’t been approximated yet, will return an empty list. Note that these are rulesets approximating this model, not rulesets used to construct this model.

Returns:rulesets : list of Ruleset
get_word_cloud(exclude_stop_words=False)

Retrieve a word cloud data for the model.

Parameters:

exclude_stop_words : bool, optional

Set to True if you want stopwords filtered out of response.

Returns:

WordCloud

Word cloud data for the model.

open_model_browser()

Opens model at project leaderboard in web browser.

Note: If text-mode browsers are used, the calling process will block until the user exits the browser.

request_approximation()

Request an approximation of this model using DataRobot Prime

This will create several rulesets that could be used to approximate this model. After comparing their scores and rule counts, the code used in the approximation can be downloaded and run locally.

Returns:

job : Job

the job generating the rulesets

request_feature_impact()

Request feature impacts to be computed for the model.

Feature Impact is computed for each column by creating new data with that column randomly permuted (but the others left unchanged), and seeing how the error metric score for the predictions is affected. The ‘impactUnnormalized’ is how much worse the error metric score is when making predictions on this modified data. The ‘impactNormalized’ is normalized so that the largest value is 1. In both cases, larger values indicate more important features.

Returns:

job : Job

A Job representing the feature impact computation. To get the completed feature impact data, use job.get_result or job.get_result_when_complete.

Raises:

JobAlreadyRequested (422)

If the feature impacts have already been requested.

request_frozen_datetime_model(training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, time_window_sample_pct=None)

Train a new frozen model with parameters from this model

Requires that this model belongs to a datetime partitioned project. If it does not, an error will occur when submitting the job.

Frozen models use the same tuning parameters as their parent model instead of independently optimizing them to allow efficiently retraining models on larger amounts of the training data.

In addition of training_row_count and training_duration, frozen datetime models may be trained on an exact date range. Only one of training_row_count, training_duration, or training_start_date and training_end_date should be specified.

Models specified using training_start_date and training_end_date are the only ones that can be trained into the holdout data (once the holdout is unlocked).

Parameters:

training_row_count : int, optional

the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.

training_duration : str, optional

a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.

training_start_date : datetime.datetime, optional

the start date of the data to train to model on. Only rows occurring at or after this datetime will be used. If training_start_date is specified, training_end_date must also be specified.

training_end_date : datetime.datetime, optional

the end date of the data to train the model on. Only rows occurring strictly before this datetime will be used. If training_end_date is specified, training_start_date must also be specified.

time_window_sample_pct : int, optional

may only be specified when the requested model is a time window (e.g. duration or start and end dates). An integer between 1 and 99 indicating the percentage to sample by within the window. The points kept are determined by a random uniform sample.

Returns:

model_job : ModelJob

the modeling job training a frozen model

request_frozen_model(sample_pct=None, training_row_count=None)

Train a new frozen model with parameters from this model

Note

This method only works if project the model belongs to is not datetime partitioned. If it is, use request_frozen_datetime_model instead.

Frozen models use the same tuning parameters as their parent model instead of independently optimizing them to allow efficiently retraining models on larger amounts of the training data.

Parameters:

sample_pct : float

optional, the percentage of the dataset to use with the model. If not provided, will use the value from this model.

training_row_count : int

(New in version v2.9) optional, the integer number of rows of the dataset to use with the model. Only one of sample_pct and training_row_count should be specified.

Returns:

model_job : ModelJob

the modeling job training a frozen model

request_predictions(dataset_id)

Request predictions against a previously uploaded dataset

Parameters:

dataset_id : string

The dataset to make predictions against (as uploaded from Project.upload_dataset)

Returns:

job : PredictJob

The job computing the predictions

request_training_predictions(data_subset)

Start a job to build training predictions

Parameters:

data_subset : str

data set definition to build predictions on. Choices are:

  • dr.enums.DATA_SUBSET.ALL for all data available
  • dr.enums.DATA_SUBSET.VALIDATION_AND_HOLDOUT for all data except training set
  • dr.enums.DATA_SUBSET.HOLDOUT for holdout data set only
Returns:

Job

an instance of created async job

request_transferable_export()

Request generation of an exportable model file for use in an on-premise DataRobot standalone prediction environment.

This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.

This function does not download the exported file. Use download_export for that.

Examples

model = datarobot.Model.get('p-id', 'l-id')
job = model.request_transferable_export()
job.wait_for_completion()
model.download_export('my_exported_model.drmodel')

# Client must be configured to use standalone prediction server for import:
datarobot.Client(token='my-token-at-standalone-server',
                 endpoint='standalone-server-url/api/v2')

imported_model = datarobot.ImportedModel.create('my_exported_model.drmodel')
train(sample_pct=None, featurelist_id=None, scoring_type=None, training_row_count=None, monotonic_increasing_featurelist_id=<object object>, monotonic_decreasing_featurelist_id=<object object>)

Train the blueprint used in model on a particular featurelist or amount of data.

This method creates a new training job for worker and appends it to the end of the queue for this project. After the job has finished you can get the newly trained model by retrieving it from the project leaderboard, or by retrieving the result of the job.

Either sample_pct or training_row_count can be used to specify the amount of data to use, but not both. If neither are specified, a default of the maximum amount of data that can safely be used to train any blueprint without going into the validation data will be selected.

In smart-sampled projects, sample_pct and training_row_count are assumed to be in terms of rows of the minority class.

Note

For datetime partitioned projects, use train_datetime instead.

Parameters:

sample_pct : float, optional

The amount of data to use for training, as a percentage of the project dataset from 0 to 100.

featurelist_id : str, optional

The identifier of the featurelist to use. If not defined, the featurelist of this model is used.

scoring_type : str, optional

Either SCORING_TYPE.validation or SCORING_TYPE.cross_validation. SCORING_TYPE.validation is available for every partitioning type, and indicates that the default model validation should be used for the project. If the project uses a form of cross-validation partitioning, SCORING_TYPE.cross_validation can also be used to indicate that all of the available training/validation combinations should be used to evaluate the model.

training_row_count : int, optional

The number of rows to use to train the requested model.

monotonic_increasing_featurelist_id : str

(new in version 2.11) optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. Passing None disables increasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT) is the one specified by the blueprint.

monotonic_decreasing_featurelist_id : str

(new in version 2.11) optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. Passing None disables decreasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT) is the one specified by the blueprint.

Returns:

model_job_id : str

id of created job, can be used as parameter to ModelJob.get method or wait_for_async_model_creation function

Examples

project = Project.get('p-id')
model = Model.get('p-id', 'l-id')
model_job_id = model.train(training_row_count=project.max_train_rows)
train_datetime(featurelist_id=None, training_row_count=None, training_duration=None, time_window_sample_pct=None)

Train this model on a different featurelist or amount of data

Requires that this model is part of a datetime partitioned project; otherwise, an error will occur.

Parameters:

featurelist_id : str, optional

the featurelist to use to train the model. If not specified, the featurelist of this model is used.

training_row_count : int, optional

the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.

training_duration : str, optional

a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.

time_window_sample_pct : int, optional

may only be specified when the requested model is a time window (e.g. duration or start and end dates). An integer between 1 and 99 indicating the percentage to sample by within the window. The points kept are determined by a random uniform sample.

Returns:

job : ModelJob

the created job to build the model

DatetimeModel API

class datarobot.models.DatetimeModel(id=None, processes=None, featurelist_name=None, featurelist_id=None, project_id=None, sample_pct=None, training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, time_window_sample_pct=None, model_type=None, model_category=None, is_frozen=None, blueprint_id=None, metrics=None, training_info=None, holdout_score=None, holdout_status=None, data_selection_method=None, backtests=None, monotonic_increasing_featurelist_id=None, monotonic_decreasing_featurelist_id=None, supports_monotonic_constraints=None)

A model from a datetime partitioned project

Only one of training_row_count, training_duration, and training_start_date and training_end_date will be specified, depending on the data_selection_method of the model. Whichever method was selected determines the amount of data used to train on when making predictions and scoring the backtests and the holdout.

Attributes

id (str) the id of the model
project_id (str) the id of the project the model belongs to
processes (list of str) the processes used by the model
featurelist_name (str) the name of the featurelist used by the model
featurelist_id (str) the id of the featurelist used by the model
sample_pct (float) the percentage of the project dataset used in training the model
training_row_count (int or None) If specified, an int specifying the number of rows used to train the model and evaluate backtest scores.
training_duration (str or None) If specified, a duration string specifying the duration spanned by the data used to train the model and evaluate backtest scores.
training_start_date (datetime or None) only present for frozen models in datetime partitioned projects. If specified, the start date of the data used to train the model.
training_end_date (datetime or None) only present for frozen models in datetime partitioned projects. If specified, the end date of the data used to train the model.
time_window_sample_pct (int or None) An integer between 1 and 99 indicating the percentage of sampling within the training window. The points kept are determined by a random uniform sample. If not specified, no sampling was done.
model_type (str) what model this is, e.g. ‘Nystroem Kernel SVM Regressor’
model_category (str) what kind of model this is - ‘prime’ for DataRobot Prime models, ‘blend’ for blender models, and ‘model’ for other models
is_frozen (bool) whether this model is a frozen model
blueprint_id (str) the id of the blueprint used in this model
metrics (dict) a mapping from each metric to the model’s scores for that metric. The keys in metrics are the different metrics used to evaluate the model, and the values are the results. The dictionaries inside of metrics will be as described here: ‘validation’, the score for a single backtest; ‘crossValidation’, always None; ‘backtesting’, the average score for all backtests if all are available and computed, or None otherwise; ‘backtestingScores’, a list of scores for all backtests where the score is None if that backtest does not have a score available; and ‘holdout’, the score for the holdout or None if the holdout is locked or the score is unavailable.
backtests (list of dict) describes what data was used to fit each backtest, the score for the project metric, and why the backtest score is unavailable if it is not provided.
data_selection_method (str) which of training_row_count, training_duration, or training_start_data and training_end_date were used to determine the data used to fit the model. One of ‘rowCount’, ‘duration’, or ‘selectedDateRange’.
training_info (dict) describes which data was used to train on when scoring the holdout and making predictions. training_info` will have the following keys: holdout_training_start_date, holdout_training_duration, holdout_training_row_count, holdout_training_end_date, prediction_training_start_date, prediction_training_duration, prediction_training_row_count, prediction_training_end_date. Start and end dates will be datetimes, durations will be duration strings, and rows will be integers.
holdout_score (float or None) the score against the holdout, if available and the holdout is unlocked, according to the project metric.
holdout_status (string or None) the status of the holdout score, e.g. “COMPLETED”, “HOLDOUT_BOUNDARIES_EXCEEDED”. Unavailable if the holdout fold was disabled in the partitioning configuration.
monotonic_increasing_featurelist_id (str) optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. If None, no such constraints are enforced.
monotonic_decreasing_featurelist_id (str) optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. If None, no such constraints are enforced.
supports_monotonic_constraints (bool) optinonal, whether this model supports enforcing montonic constraints
classmethod get(project, model_id)

Retrieve a specific datetime model

If the project does not use datetime partitioning, a ClientError will occur.

Parameters:

project : str

the id of the project the model belongs to

model_id : str

the id of the model to retrieve

Returns:

model : DatetimeModel

the model

score_backtests()

Compute the scores for all available backtests

Some backtests may be unavailable if the model is trained into their validation data.

Returns:

job : Job

a job tracking the backtest computation. When it is complete, all available backtests will have scores computed.

cross_validate()

Inherited from Model - DatetimeModels cannot request Cross Validation,

Use score_backtests instead.

delete()

Delete a model from the project’s leaderboard.

download_export(filepath)

Download an exportable model file for use in an on-premise DataRobot standalone prediction environment.

This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.

Parameters:

filepath : str

The path at which to save the exported model file.

download_scoring_code(file_name, source_code=False)

Download scoring code JAR.

Parameters:

file_name : str

File path where scoring code will be saved.

source_code : bool, optional

Set to True to download source code archive. It will not be executable.

fetch_resource_data(*args, **kwargs)

(Deprecated.) Used to acquire model data directly from its url.

Consider using get instead, as this is a convenience function used for development of datarobot

Parameters:

url : string

The resource we are acquiring

join_endpoint : boolean, optional

Whether the client’s endpoint should be joined to the URL before sending the request. Location headers are returned as absolute locations, so will _not_ need the endpoint

Returns:

model_data : dict

The queried model’s data

get_all_confusion_charts()

Retrieve a list of all confusion charts available for the model.

Returns:

list of ConfusionChart

Data for all available confusion charts for model.

get_all_lift_charts()

Retrieve a list of all lift charts available for the model.

Returns:

list of LiftChart

Data for all available model lift charts.

get_all_roc_curves()

Retrieve a list of all ROC curves available for the model.

Returns:

list of RocCurve

Data for all available model ROC curves.

get_confusion_chart(source)

Retrieve model’s confusion chart for the specified source.

Parameters:

source : str

Confusion chart source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.

Returns:

ConfusionChart

Model ConfusionChart data

get_feature_impact()

Retrieve the computed Feature Impact results, a measure of the relevance of each feature in the model.

Elsewhere this technique is sometimes called ‘Permutation Importance’.

Requires that Feature Impact has already been computed with request_feature_impact.

Returns:

feature_impacts : list[dict]

The feature impact data. Each item is a dict with the keys ‘featureName’, ‘impactNormalized’, and ‘impactUnnormalized’. See the help for Model.request_feature_impact for more details.

Raises:

ClientError (404)

If the feature impacts have not been computed.

get_features_used()

Query the server to determine which features were used.

Note that the data returned by this method is possibly different than the names of the features in the featurelist used by this model. This method will return the raw features that must be supplied in order for predictions to be generated on a new set of data. The featurelist, in contrast, would also include the names of derived features.

Returns:

features : list of str

The names of the features used in the model.

Returns:

url : str

Permanent static hyperlink to this model at leaderboard.

get_lift_chart(source)

Retrieve model lift chart for the specified source.

Parameters:

source : str

Lift chart data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.

Returns:

LiftChart

Model lift chart data

get_model_blueprint_chart()

Retrieve a model blueprint chart that can be used to understand data flow in blueprint.

Returns:

ModelBlueprintChart

The queried model blueprint chart.

get_model_blueprint_documents()

Get documentation for tasks used in this model.

Returns:

list of BlueprintTaskDocument

All documents available for the model.

get_parameters()

Retrieve model parameters.

Returns:

ModelParameters

Model parameters for this model.

get_prime_eligibility()

Check if this model can be approximated with DataRobot Prime

Returns:

prime_eligibility : dict

a dict indicating whether a model can be approximated with DataRobot Prime (key can_make_prime) and why it may be ineligible (key message)

get_roc_curve(source)

Retrieve model ROC curve for the specified source.

Parameters:

source : str

ROC curve data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.

Returns:

RocCurve

Model ROC curve data

get_rulesets()

List the rulesets approximating this model generated by DataRobot Prime

If this model hasn’t been approximated yet, will return an empty list. Note that these are rulesets approximating this model, not rulesets used to construct this model.

Returns:rulesets : list of Ruleset
get_word_cloud(exclude_stop_words=False)

Retrieve a word cloud data for the model.

Parameters:

exclude_stop_words : bool, optional

Set to True if you want stopwords filtered out of response.

Returns:

WordCloud

Word cloud data for the model.

open_model_browser()

Opens model at project leaderboard in web browser.

Note: If text-mode browsers are used, the calling process will block until the user exits the browser.

request_approximation()

Request an approximation of this model using DataRobot Prime

This will create several rulesets that could be used to approximate this model. After comparing their scores and rule counts, the code used in the approximation can be downloaded and run locally.

Returns:

job : Job

the job generating the rulesets

request_feature_impact()

Request feature impacts to be computed for the model.

Feature Impact is computed for each column by creating new data with that column randomly permuted (but the others left unchanged), and seeing how the error metric score for the predictions is affected. The ‘impactUnnormalized’ is how much worse the error metric score is when making predictions on this modified data. The ‘impactNormalized’ is normalized so that the largest value is 1. In both cases, larger values indicate more important features.

Returns:

job : Job

A Job representing the feature impact computation. To get the completed feature impact data, use job.get_result or job.get_result_when_complete.

Raises:

JobAlreadyRequested (422)

If the feature impacts have already been requested.

request_frozen_datetime_model(training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, time_window_sample_pct=None)

Train a new frozen model with parameters from this model

Requires that this model belongs to a datetime partitioned project. If it does not, an error will occur when submitting the job.

Frozen models use the same tuning parameters as their parent model instead of independently optimizing them to allow efficiently retraining models on larger amounts of the training data.

In addition of training_row_count and training_duration, frozen datetime models may be trained on an exact date range. Only one of training_row_count, training_duration, or training_start_date and training_end_date should be specified.

Models specified using training_start_date and training_end_date are the only ones that can be trained into the holdout data (once the holdout is unlocked).

Parameters:

training_row_count : int, optional

the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.

training_duration : str, optional

a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.

training_start_date : datetime.datetime, optional

the start date of the data to train to model on. Only rows occurring at or after this datetime will be used. If training_start_date is specified, training_end_date must also be specified.

training_end_date : datetime.datetime, optional

the end date of the data to train the model on. Only rows occurring strictly before this datetime will be used. If training_end_date is specified, training_start_date must also be specified.

time_window_sample_pct : int, optional

may only be specified when the requested model is a time window (e.g. duration or start and end dates). An integer between 1 and 99 indicating the percentage to sample by within the window. The points kept are determined by a random uniform sample.

Returns:

model_job : ModelJob

the modeling job training a frozen model

request_predictions(dataset_id)

Request predictions against a previously uploaded dataset

Parameters:

dataset_id : string

The dataset to make predictions against (as uploaded from Project.upload_dataset)

Returns:

job : PredictJob

The job computing the predictions

request_training_predictions(data_subset)

Start a job to build training predictions

Parameters:

data_subset : str

data set definition to build predictions on. Choices are:

  • dr.enums.DATA_SUBSET.ALL for all data available
  • dr.enums.DATA_SUBSET.VALIDATION_AND_HOLDOUT for all data except training set
  • dr.enums.DATA_SUBSET.HOLDOUT for holdout data set only
Returns:

Job

an instance of created async job

request_transferable_export()

Request generation of an exportable model file for use in an on-premise DataRobot standalone prediction environment.

This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.

This function does not download the exported file. Use download_export for that.

Examples

model = datarobot.Model.get('p-id', 'l-id')
job = model.request_transferable_export()
job.wait_for_completion()
model.download_export('my_exported_model.drmodel')

# Client must be configured to use standalone prediction server for import:
datarobot.Client(token='my-token-at-standalone-server',
                 endpoint='standalone-server-url/api/v2')

imported_model = datarobot.ImportedModel.create('my_exported_model.drmodel')
train_datetime(featurelist_id=None, training_row_count=None, training_duration=None, time_window_sample_pct=None)

Train this model on a different featurelist or amount of data

Requires that this model is part of a datetime partitioned project; otherwise, an error will occur.

Parameters:

featurelist_id : str, optional

the featurelist to use to train the model. If not specified, the featurelist of this model is used.

training_row_count : int, optional

the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.

training_duration : str, optional

a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.

time_window_sample_pct : int, optional

may only be specified when the requested model is a time window (e.g. duration or start and end dates). An integer between 1 and 99 indicating the percentage to sample by within the window. The points kept are determined by a random uniform sample.

Returns:

job : ModelJob

the created job to build the model

RatingTableModel API

class datarobot.models.RatingTableModel(id=None, processes=None, featurelist_name=None, featurelist_id=None, project_id=None, sample_pct=None, training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, model_type=None, model_category=None, is_frozen=None, blueprint_id=None, metrics=None, rating_table_id=None, monotonic_increasing_featurelist_id=None, monotonic_decreasing_featurelist_id=None, supports_monotonic_constraints=None)

A model that has a rating table.

Attributes

id (str) the id of the model
project_id (str) the id of the project the model belongs to
processes (list of str) the processes used by the model
featurelist_name (str) the name of the featurelist used by the model
featurelist_id (str) the id of the featurelist used by the model
sample_pct (float or None) the percentage of the project dataset used in training the model. If the project uses datetime partitioning, the sample_pct will be None. See training_row_count, training_duration, and training_start_date and training_end_date instead.
training_row_count (int or None) the number of rows of the project dataset used in training the model. In a datetime partitioned project, if specified, defines the number of rows used to train the model and evaluate backtest scores; if unspecified, either training_duration or training_start_date and training_end_date was used to determine that instead.
training_duration (str or None) only present for models in datetime partitioned projects. If specified, a duration string specifying the duration spanned by the data used to train the model and evaluate backtest scores.
training_start_date (datetime or None) only present for frozen models in datetime partitioned projects. If specified, the start date of the data used to train the model.
training_end_date (datetime or None) only present for frozen models in datetime partitioned projects. If specified, the end date of the data used to train the model.
model_type (str) what model this is, e.g. ‘Nystroem Kernel SVM Regressor’
model_category (str) what kind of model this is - ‘prime’ for DataRobot Prime models, ‘blend’ for blender models, and ‘model’ for other models
is_frozen (bool) whether this model is a frozen model
blueprint_id (str) the id of the blueprint used in this model
metrics (dict) a mapping from each metric to the model’s scores for that metric
rating_table_id (str) the id of the rating table that belongs to this model
monotonic_increasing_featurelist_id (str) optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. If None, no such constraints are enforced.
monotonic_decreasing_featurelist_id (str) optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. If None, no such constraints are enforced.
supports_monotonic_constraints (bool) optinonal, whether this model supports enforcing montonic constraints
classmethod get(project_id, model_id)

Retrieve a specific rating table model

If the project does not have a rating table, a ClientError will occur.

Parameters:

project_id : str

the id of the project the model belongs to

model_id : str

the id of the model to retrieve

Returns:

model : RatingTableModel

the model

classmethod create_from_rating_table(project_id, rating_table_id)

Creates a new model from a validated rating table record. The RatingTable must not be associated with an existing model.

Parameters:

project_id : str

the id of the project the rating table belongs to

rating_table_id : str

the id of the rating table to create this model from

Returns:

job: Job

an instance of created async job

Raises:

ClientError (422)

Raised if creating model from a RatingTable that failed validation

JobAlreadyRequested

Raised if creating model from a RatingTable that is already associated with a RatingTableModel

cross_validate()

Run Cross Validation on this model.

Note

To perform Cross Validation on a new model with new parameters, use train instead.

Returns:

ModelJob

The created job to build the model

delete()

Delete a model from the project’s leaderboard.

download_export(filepath)

Download an exportable model file for use in an on-premise DataRobot standalone prediction environment.

This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.

Parameters:

filepath : str

The path at which to save the exported model file.

download_scoring_code(file_name, source_code=False)

Download scoring code JAR.

Parameters:

file_name : str

File path where scoring code will be saved.

source_code : bool, optional

Set to True to download source code archive. It will not be executable.

fetch_resource_data(*args, **kwargs)

(Deprecated.) Used to acquire model data directly from its url.

Consider using get instead, as this is a convenience function used for development of datarobot

Parameters:

url : string

The resource we are acquiring

join_endpoint : boolean, optional

Whether the client’s endpoint should be joined to the URL before sending the request. Location headers are returned as absolute locations, so will _not_ need the endpoint

Returns:

model_data : dict

The queried model’s data

get_all_confusion_charts()

Retrieve a list of all confusion charts available for the model.

Returns:

list of ConfusionChart

Data for all available confusion charts for model.

get_all_lift_charts()

Retrieve a list of all lift charts available for the model.

Returns:

list of LiftChart

Data for all available model lift charts.

get_all_roc_curves()

Retrieve a list of all ROC curves available for the model.

Returns:

list of RocCurve

Data for all available model ROC curves.

get_confusion_chart(source)

Retrieve model’s confusion chart for the specified source.

Parameters:

source : str

Confusion chart source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.

Returns:

ConfusionChart

Model ConfusionChart data

get_feature_impact()

Retrieve the computed Feature Impact results, a measure of the relevance of each feature in the model.

Elsewhere this technique is sometimes called ‘Permutation Importance’.

Requires that Feature Impact has already been computed with request_feature_impact.

Returns:

feature_impacts : list[dict]

The feature impact data. Each item is a dict with the keys ‘featureName’, ‘impactNormalized’, and ‘impactUnnormalized’. See the help for Model.request_feature_impact for more details.

Raises:

ClientError (404)

If the feature impacts have not been computed.

get_features_used()

Query the server to determine which features were used.

Note that the data returned by this method is possibly different than the names of the features in the featurelist used by this model. This method will return the raw features that must be supplied in order for predictions to be generated on a new set of data. The featurelist, in contrast, would also include the names of derived features.

Returns:

features : list of str

The names of the features used in the model.

Returns:

url : str

Permanent static hyperlink to this model at leaderboard.

get_lift_chart(source)

Retrieve model lift chart for the specified source.

Parameters:

source : str

Lift chart data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.

Returns:

LiftChart

Model lift chart data

get_model_blueprint_chart()

Retrieve a model blueprint chart that can be used to understand data flow in blueprint.

Returns:

ModelBlueprintChart

The queried model blueprint chart.

get_model_blueprint_documents()

Get documentation for tasks used in this model.

Returns:

list of BlueprintTaskDocument

All documents available for the model.

get_parameters()

Retrieve model parameters.

Returns:

ModelParameters

Model parameters for this model.

get_prime_eligibility()

Check if this model can be approximated with DataRobot Prime

Returns:

prime_eligibility : dict

a dict indicating whether a model can be approximated with DataRobot Prime (key can_make_prime) and why it may be ineligible (key message)

get_roc_curve(source)

Retrieve model ROC curve for the specified source.

Parameters:

source : str

ROC curve data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.

Returns:

RocCurve

Model ROC curve data

get_rulesets()

List the rulesets approximating this model generated by DataRobot Prime

If this model hasn’t been approximated yet, will return an empty list. Note that these are rulesets approximating this model, not rulesets used to construct this model.

Returns:rulesets : list of Ruleset
get_word_cloud(exclude_stop_words=False)

Retrieve a word cloud data for the model.

Parameters:

exclude_stop_words : bool, optional

Set to True if you want stopwords filtered out of response.

Returns:

WordCloud

Word cloud data for the model.

open_model_browser()

Opens model at project leaderboard in web browser.

Note: If text-mode browsers are used, the calling process will block until the user exits the browser.

request_approximation()

Request an approximation of this model using DataRobot Prime

This will create several rulesets that could be used to approximate this model. After comparing their scores and rule counts, the code used in the approximation can be downloaded and run locally.

Returns:

job : Job

the job generating the rulesets

request_feature_impact()

Request feature impacts to be computed for the model.

Feature Impact is computed for each column by creating new data with that column randomly permuted (but the others left unchanged), and seeing how the error metric score for the predictions is affected. The ‘impactUnnormalized’ is how much worse the error metric score is when making predictions on this modified data. The ‘impactNormalized’ is normalized so that the largest value is 1. In both cases, larger values indicate more important features.

Returns:

job : Job

A Job representing the feature impact computation. To get the completed feature impact data, use job.get_result or job.get_result_when_complete.

Raises:

JobAlreadyRequested (422)

If the feature impacts have already been requested.

request_frozen_datetime_model(training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, time_window_sample_pct=None)

Train a new frozen model with parameters from this model

Requires that this model belongs to a datetime partitioned project. If it does not, an error will occur when submitting the job.

Frozen models use the same tuning parameters as their parent model instead of independently optimizing them to allow efficiently retraining models on larger amounts of the training data.

In addition of training_row_count and training_duration, frozen datetime models may be trained on an exact date range. Only one of training_row_count, training_duration, or training_start_date and training_end_date should be specified.

Models specified using training_start_date and training_end_date are the only ones that can be trained into the holdout data (once the holdout is unlocked).

Parameters:

training_row_count : int, optional

the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.

training_duration : str, optional

a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.

training_start_date : datetime.datetime, optional

the start date of the data to train to model on. Only rows occurring at or after this datetime will be used. If training_start_date is specified, training_end_date must also be specified.

training_end_date : datetime.datetime, optional

the end date of the data to train the model on. Only rows occurring strictly before this datetime will be used. If training_end_date is specified, training_start_date must also be specified.

time_window_sample_pct : int, optional

may only be specified when the requested model is a time window (e.g. duration or start and end dates). An integer between 1 and 99 indicating the percentage to sample by within the window. The points kept are determined by a random uniform sample.

Returns:

model_job : ModelJob

the modeling job training a frozen model

request_frozen_model(sample_pct=None, training_row_count=None)

Train a new frozen model with parameters from this model

Note

This method only works if project the model belongs to is not datetime partitioned. If it is, use request_frozen_datetime_model instead.

Frozen models use the same tuning parameters as their parent model instead of independently optimizing them to allow efficiently retraining models on larger amounts of the training data.

Parameters:

sample_pct : float

optional, the percentage of the dataset to use with the model. If not provided, will use the value from this model.

training_row_count : int

(New in version v2.9) optional, the integer number of rows of the dataset to use with the model. Only one of sample_pct and training_row_count should be specified.

Returns:

model_job : ModelJob

the modeling job training a frozen model

request_predictions(dataset_id)

Request predictions against a previously uploaded dataset

Parameters:

dataset_id : string

The dataset to make predictions against (as uploaded from Project.upload_dataset)

Returns:

job : PredictJob

The job computing the predictions

request_training_predictions(data_subset)

Start a job to build training predictions

Parameters:

data_subset : str

data set definition to build predictions on. Choices are:

  • dr.enums.DATA_SUBSET.ALL for all data available
  • dr.enums.DATA_SUBSET.VALIDATION_AND_HOLDOUT for all data except training set
  • dr.enums.DATA_SUBSET.HOLDOUT for holdout data set only
Returns:

Job

an instance of created async job

request_transferable_export()

Request generation of an exportable model file for use in an on-premise DataRobot standalone prediction environment.

This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.

This function does not download the exported file. Use download_export for that.

Examples

model = datarobot.Model.get('p-id', 'l-id')
job = model.request_transferable_export()
job.wait_for_completion()
model.download_export('my_exported_model.drmodel')

# Client must be configured to use standalone prediction server for import:
datarobot.Client(token='my-token-at-standalone-server',
                 endpoint='standalone-server-url/api/v2')

imported_model = datarobot.ImportedModel.create('my_exported_model.drmodel')
train(sample_pct=None, featurelist_id=None, scoring_type=None, training_row_count=None, monotonic_increasing_featurelist_id=<object object>, monotonic_decreasing_featurelist_id=<object object>)

Train the blueprint used in model on a particular featurelist or amount of data.

This method creates a new training job for worker and appends it to the end of the queue for this project. After the job has finished you can get the newly trained model by retrieving it from the project leaderboard, or by retrieving the result of the job.

Either sample_pct or training_row_count can be used to specify the amount of data to use, but not both. If neither are specified, a default of the maximum amount of data that can safely be used to train any blueprint without going into the validation data will be selected.

In smart-sampled projects, sample_pct and training_row_count are assumed to be in terms of rows of the minority class.

Note

For datetime partitioned projects, use train_datetime instead.

Parameters:

sample_pct : float, optional

The amount of data to use for training, as a percentage of the project dataset from 0 to 100.

featurelist_id : str, optional

The identifier of the featurelist to use. If not defined, the featurelist of this model is used.

scoring_type : str, optional

Either SCORING_TYPE.validation or SCORING_TYPE.cross_validation. SCORING_TYPE.validation is available for every partitioning type, and indicates that the default model validation should be used for the project. If the project uses a form of cross-validation partitioning, SCORING_TYPE.cross_validation can also be used to indicate that all of the available training/validation combinations should be used to evaluate the model.

training_row_count : int, optional

The number of rows to use to train the requested model.

monotonic_increasing_featurelist_id : str

(new in version 2.11) optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. Passing None disables increasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT) is the one specified by the blueprint.

monotonic_decreasing_featurelist_id : str

(new in version 2.11) optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. Passing None disables decreasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT) is the one specified by the blueprint.

Returns:

model_job_id : str

id of created job, can be used as parameter to ModelJob.get method or wait_for_async_model_creation function

Examples

project = Project.get('p-id')
model = Model.get('p-id', 'l-id')
model_job_id = model.train(training_row_count=project.max_train_rows)
train_datetime(featurelist_id=None, training_row_count=None, training_duration=None, time_window_sample_pct=None)

Train this model on a different featurelist or amount of data

Requires that this model is part of a datetime partitioned project; otherwise, an error will occur.

Parameters:

featurelist_id : str, optional

the featurelist to use to train the model. If not specified, the featurelist of this model is used.

training_row_count : int, optional

the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.

training_duration : str, optional

a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.

time_window_sample_pct : int, optional

may only be specified when the requested model is a time window (e.g. duration or start and end dates). An integer between 1 and 99 indicating the percentage to sample by within the window. The points kept are determined by a random uniform sample.

Returns:

job : ModelJob

the created job to build the model

Job API

class datarobot.models.Job(data, completed_resource_url=None)

Tracks asynchronous work being done within a project

Attributes

id (int) the id of the job
project_id (str) the id of the project the job belongs to
status (str) the status of the job - will be one of datarobot.enums.QUEUE_STATUS
job_type (str) what kind of work the job is doing - will be one of datarobot.enums.JOB_TYPE
classmethod get(project_id, job_id)

Fetches one job.

Parameters:

project_id : str

The identifier of the project in which the job resides

job_id : str

The job id

Returns:

job : Job

The job

Raises:

AsyncFailureError

Querying this resource gave a status code other than 200 or 303

cancel()

Cancel this job. If this job has not finished running, it will be removed and canceled.

get_result()
Returns:

result : object

Return type depends on the job type:
  • for model jobs, a Model is returned
  • for predict jobs, a pandas.DataFrame (with predictions) is returned
  • for featureImpact jobs, a list of dicts (whose keys are featureName and impact)
  • for primeRulesets jobs, a list of Rulesets
  • for primeModel jobs, a PrimeModel
  • for primeDownloadValidation jobs, a PrimeFile
  • for reasonCodesInitialization jobs, a ReasonCodesInitialization
  • for reasonCodes jobs, a ReasonCodes
Raises:

JobNotFinished

If the job is not finished, the result is not available.

AsyncProcessUnsuccessfulError

If the job errored or was aborted

get_result_when_complete(max_wait=600)
Parameters:

max_wait : int, optional

How long to wait for the job to finish.

Returns:

result: object

Return type is the same as would be returned by Job.get_result.

Raises:

AsyncTimeoutError

If the job does not finish in time

AsyncProcessUnsuccessfulError

If the job errored or was aborted

refresh()

Update this object with the latest job data from the server.

wait_for_completion(max_wait=600)

Waits for job to complete.

Parameters:

max_wait : int, optional

How long to wait for the job to finish.

class datarobot.models.TrainingPredictionsJob(data, model_id, data_subset, **kwargs)
classmethod get(project_id, job_id, model_id=None, data_subset=None)

Fetches one job. The resulting datarobot.models.TrainingPredictions object will be annotated with model_id and data_subset.

Parameters:

project_id : str

The identifier of the project in which the job resides

job_id : str

The job id

model_id : str

The identifier of the model used for computing training predictions

data_subset : dr.enums.DATA_SUBSET, optional

Data subset used for computing training predictions

Returns:

job : TrainingPredictionsJob

The job

Raises:

AsyncFailureError

Querying this resource gave a status code other than 200 or 303

refresh()

Update this object with the latest job data from the server.

cancel()

Cancel this job. If this job has not finished running, it will be removed and canceled.

get_result()
Returns:

result : object

Return type depends on the job type:
  • for model jobs, a Model is returned
  • for predict jobs, a pandas.DataFrame (with predictions) is returned
  • for featureImpact jobs, a list of dicts (whose keys are featureName and impact)
  • for primeRulesets jobs, a list of Rulesets
  • for primeModel jobs, a PrimeModel
  • for primeDownloadValidation jobs, a PrimeFile
  • for reasonCodesInitialization jobs, a ReasonCodesInitialization
  • for reasonCodes jobs, a ReasonCodes
Raises:

JobNotFinished

If the job is not finished, the result is not available.

AsyncProcessUnsuccessfulError

If the job errored or was aborted

get_result_when_complete(max_wait=600)
Parameters:

max_wait : int, optional

How long to wait for the job to finish.

Returns:

result: object

Return type is the same as would be returned by Job.get_result.

Raises:

AsyncTimeoutError

If the job does not finish in time

AsyncProcessUnsuccessfulError

If the job errored or was aborted

wait_for_completion(max_wait=600)

Waits for job to complete.

Parameters:

max_wait : int, optional

How long to wait for the job to finish.

ModelJob API

datarobot.models.modeljob.wait_for_async_model_creation(project_id, model_job_id, max_wait=600)

Given a Project id and ModelJob id poll for status of process responsible for model creation until model is created.

Parameters:

project_id : str

The identifier of the project

model_job_id : str

The identifier of the ModelJob

max_wait : int, optional

Time in seconds after which model creation is considered unsuccessful

Returns:

model : Model

Newly created model

Raises:

AsyncModelCreationError

Raised if status of fetched ModelJob object is error

AsyncTimeoutError

Model wasn’t created in time, specified by max_wait parameter

class datarobot.models.ModelJob(data, completed_resource_url=None)

Tracks asynchronous work being done within a project

Attributes

id (int) the id of the job
project_id (str) the id of the project the job belongs to
status (str) the status of the job - will be one of datarobot.enums.QUEUE_STATUS
job_type (str) what kind of work the job is doing - will be ‘model’ for modeling jobs
sample_pct (float) the percentage of the project’s dataset used in this modeling job
model_type (str) the model this job builds (e.g. ‘Nystroem Kernel SVM Regressor’)
processes (list of str) the processes used by the model
featurelist_id (str) the id of the featurelist used in this modeling job
blueprint (Blueprint) the blueprint used in this modeling job
classmethod from_job(job)

Transforms a generic Job into a ModelJob

Parameters:

job: Job

A generic job representing a ModelJob

Returns:

model_job: ModelJob

A fully populated ModelJob with all the details of the job

Raises:

ValueError:

If the generic Job was not a model job, e.g. job_type != JOB_TYPE.MODEL

classmethod get(project_id, model_job_id)

Fetches one ModelJob. If the job finished, raises PendingJobFinished exception.

Parameters:

project_id : str

The identifier of the project the model belongs to

model_job_id : str

The identifier of the model_job

Returns:

model_job : ModelJob

The pending ModelJob

Raises:

PendingJobFinished

If the job being queried already finished, and the server is re-routing to the finished model.

AsyncFailureError

Querying this resource gave a status code other than 200 or 303

classmethod get_model(project_id, model_job_id)

Fetches a finished model from the job used to create it.

Parameters:

project_id : str

The identifier of the project the model belongs to

model_job_id : str

The identifier of the model_job

Returns:

model : Model

The finished model

Raises:

JobNotFinished

If the job has not finished yet

AsyncFailureError

Querying the model_job in question gave a status code other than 200 or 303

cancel()

Cancel this job. If this job has not finished running, it will be removed and canceled.

get_result()
Returns:

result : object

Return type depends on the job type:
  • for model jobs, a Model is returned
  • for predict jobs, a pandas.DataFrame (with predictions) is returned
  • for featureImpact jobs, a list of dicts (whose keys are featureName and impact)
  • for primeRulesets jobs, a list of Rulesets
  • for primeModel jobs, a PrimeModel
  • for primeDownloadValidation jobs, a PrimeFile
  • for reasonCodesInitialization jobs, a ReasonCodesInitialization
  • for reasonCodes jobs, a ReasonCodes
Raises:

JobNotFinished

If the job is not finished, the result is not available.

AsyncProcessUnsuccessfulError

If the job errored or was aborted

get_result_when_complete(max_wait=600)
Parameters:

max_wait : int, optional

How long to wait for the job to finish.

Returns:

result: object

Return type is the same as would be returned by Job.get_result.

Raises:

AsyncTimeoutError

If the job does not finish in time

AsyncProcessUnsuccessfulError

If the job errored or was aborted

refresh()

Update this object with the latest job data from the server.

wait_for_completion(max_wait=600)

Waits for job to complete.

Parameters:

max_wait : int, optional

How long to wait for the job to finish.

Prediction Dataset API

class datarobot.models.PredictionDataset(project_id, id, name, created, num_rows, num_columns, forecast_point=None, predictions_start_date=None, predictions_end_date=None)

A dataset uploaded to make predictions

Typically created via project.upload_dataset

Attributes

id (str) the id of the dataset
project_id (str) the id of the project the dataset belongs to
created (str) the time the dataset was created
name (str) the name of the dataset
num_rows (int) the number of rows in the dataset
num_columns (int) the number of columns in the dataset
forecast_point (datetime.datetime or None) Only specified in time series projects. The point relative to which predictions will be generated, based on the forecast window of the project. See the time series documentation for more information.
predictions_start_date (datetime.datetime or None, optional) Only specified in time series projects. The start date for bulk predictions. This parameter should be provided in conjunction with predictions_end_date. Can’t be provided with forecastPoint parameter.
predictions_end_date (datetime.datetime or None, optional) Only specified in time series projects. The end date for bulk predictions. This parameter should be provided in conjunction with predictions_start_date. Can’t be provided with forecastPoint parameter.
classmethod get(project_id, dataset_id)

Retrieve information about a dataset uploaded for predictions

Parameters:

project_id:

the id of the project to query

dataset_id:

the id of the dataset to retrieve

Returns:

dataset: PredictionDataset

A dataset uploaded to make predictions

delete()

Delete a dataset uploaded for predictions

Will also delete predictions made using this dataset and cancel any predict jobs using this dataset.

PredictJob API

datarobot.models.predict_job.wait_for_async_predictions(project_id, predict_job_id, max_wait=600)

Given a Project id and PredictJob id poll for status of process responsible for predictions generation until it’s finished

Parameters:

project_id : str

The identifier of the project

predict_job_id : str

The identifier of the PredictJob

max_wait : int, optional

Time in seconds after which predictions creation is considered unsuccessful

Returns:

predictions : pandas.DataFrame

Generated predictions.

Raises:

AsyncPredictionsGenerationError

Raised if status of fetched PredictJob object is error

AsyncTimeoutError

Predictions weren’t generated in time, specified by max_wait parameter

class datarobot.models.PredictJob(data, completed_resource_url=None)

Tracks asynchronous work being done within a project

Attributes

id (int) the id of the job
project_id (str) the id of the project the job belongs to
status (str) the status of the job - will be one of datarobot.enums.QUEUE_STATUS
job_type (str) what kind of work the job is doing - will be ‘predict’ for predict jobs
message (str) a message about the state of the job, typically explaining why an error occurred
classmethod from_job(job)

Transforms a generic Job into a PredictJob

Parameters:

job: Job

A generic job representing a PredictJob

Returns:

predict_job: PredictJob

A fully populated PredictJob with all the details of the job

Raises:

ValueError:

If the generic Job was not a predict job, e.g. job_type != JOB_TYPE.PREDICT

classmethod create(*args, **kwargs)

Note

Deprecated in v2.3 in favor of Project.upload_dataset and Model.request_predictions. That workflow allows you to reuse the same dataset for predictions from multiple models within one project.

Starts predictions generation for provided data using previously created model.

Parameters:

model : Model

Model to use for predictions generation

sourcedata : str, file or pandas.DataFrame

Data to be used for predictions. If this parameter is a str, it can be either a path to a local file or raw file content. If using a file on disk, the filename must consist of ASCII characters only. The file must be a CSV, and cannot be compressed

Returns:

predict_job_id : str

id of created job, can be used as parameter to PredictJob.get or PredictJob.get_predictions methods or wait_for_async_predictions function

Raises:

InputNotUnderstoodError

If the parameter for sourcedata didn’t resolve into known data types

Examples

model = Model.get('p-id', 'l-id')
predict_job = PredictJob.create(model, './data_to_predict.csv')
classmethod get(project_id, predict_job_id)

Fetches one PredictJob. If the job finished, raises PendingJobFinished exception.

Parameters:

project_id : str

The identifier of the project the model on which prediction was started belongs to

predict_job_id : str

The identifier of the predict_job

Returns:

predict_job : PredictJob

The pending PredictJob

Raises:

PendingJobFinished

If the job being queried already finished, and the server is re-routing to the finished predictions.

AsyncFailureError

Querying this resource gave a status code other than 200 or 303

classmethod get_predictions(project_id, predict_job_id, class_prefix='class_')

Fetches finished predictions from the job used to generate them.

Note

The prediction API for classifications now returns an additional prediction_values dictionary that is converted into a series of class_prefixed columns in the final dataframe. For example, <label> = 1.0 is converted to ‘class_1.0’. If you are on an older version of the client (prior to v2.8), you must update to v2.8 to correctly pivot this data.

Parameters:

project_id : str

The identifier of the project to which belongs the model used for predictions generation

predict_job_id : str

The identifier of the predict_job

class_prefix : str

The prefix to append to labels in the final dataframe (e.g., apple -> class_apple)

Returns:

predictions : pandas.DataFrame

Generated predictions

Raises:

JobNotFinished

If the job has not finished yet

AsyncFailureError

Querying the predict_job in question gave a status code other than 200 or 303

cancel()

Cancel this job. If this job has not finished running, it will be removed and canceled.

get_result()
Returns:

result : object

Return type depends on the job type:
  • for model jobs, a Model is returned
  • for predict jobs, a pandas.DataFrame (with predictions) is returned
  • for featureImpact jobs, a list of dicts (whose keys are featureName and impact)
  • for primeRulesets jobs, a list of Rulesets
  • for primeModel jobs, a PrimeModel
  • for primeDownloadValidation jobs, a PrimeFile
  • for reasonCodesInitialization jobs, a ReasonCodesInitialization
  • for reasonCodes jobs, a ReasonCodes
Raises:

JobNotFinished

If the job is not finished, the result is not available.

AsyncProcessUnsuccessfulError

If the job errored or was aborted

get_result_when_complete(max_wait=600)
Parameters:

max_wait : int, optional

How long to wait for the job to finish.

Returns:

result: object

Return type is the same as would be returned by Job.get_result.

Raises:

AsyncTimeoutError

If the job does not finish in time

AsyncProcessUnsuccessfulError

If the job errored or was aborted

refresh()

Update this object with the latest job data from the server.

wait_for_completion(max_wait=600)

Waits for job to complete.

Parameters:

max_wait : int, optional

How long to wait for the job to finish.

Feature List API

class datarobot.models.Featurelist(id=None, name=None, features=None, project_id=None)

A set of features used in modeling

Attributes

id (str) the id of the featurelist
name (str) the name of the featurelist
features (list of str) the names of all the Features in the Featurelist
project_id (str) the project the Featurelist belongs to
classmethod get(project_id, featurelist_id)

Retrieve a known feature list

Parameters:

project_id : str

The id of the project the featurelist is associated with

featurelist_id : str

The ID of the featurelist to retrieve

Returns:

featurelist : Featurelist

The queried instance

class datarobot.models.ModelingFeaturelist(id=None, name=None, features=None, project_id=None)

A set of features that can be used to build a model

In time series projects, a new set of modeling features is created after setting the partitioning options. These features are automatically derived from those in the project’s dataset and are the features used for modeling. Modeling features are only accessible once the target and partitioning options have been set. In projects that don’t use time series modeling, once the target has been set, ModelingFeaturelists and Featurelists will behave the same.

For more information about input and modeling features, see the time series documentation.

Attributes

id (str) the id of the modeling featurelist
project_id (str) the id of the project the modeling featurelist belongs to
name (str) the name of the modeling featurelist
features (list of str) a list of the names of features included in this modeling featurelist
classmethod get(project_id, featurelist_id)

Retrieve a modeling featurelist

Modeling featurelists can only be retrieved once the target and partitioning options have been set.

Parameters:

project_id : str

the id of the project the modeling featurelist belongs to

featurelist_id : str

the id of the modeling featurelist to retrieve

Returns:

featurelist : ModelingFeaturelist

the specified featurelist

Feature API

class datarobot.models.Feature(id, project_id=None, name=None, feature_type=None, importance=None, low_information=None, unique_count=None, na_count=None, date_format=None, min=None, max=None, mean=None, median=None, std_dev=None, time_series_eligible=None, time_series_eligibility_reason=None, time_step=None, time_unit=None, target_leakage=None)

A feature from a project’s dataset

These are features either included in the originally uploaded dataset or added to it via feature transformations. In time series projects, these will be distinct from the ModelingFeature s created during partitioning; otherwise, they will correspond to the same features. For more information about input and modeling features, see the time series documentation.

The min, max, mean, median, and std_dev attributes provide information about the distribution of the feature in the EDA sample data. For non-numeric features or features created prior to these summary statistics becoming available, they will be None. For features where the summary statistics are available, they will be in a format compatible with the data type, i.e. date type features will have their summary statistics expressed as ISO-8601 formatted date strings.

Attributes

id (int) the id for the feature - note that name is used to reference the feature instead of id
project_id (str) the id of the project the feature belongs to
name (str) the name of the feature
feature_type (str) the type of the feature, e.g. ‘Categorical’, ‘Text’
importance (float or None) numeric measure of the strength of relationship between the feature and target (independent of any model or other features); may be None for non-modeling features such as partition columns
low_information (bool) whether a feature is considered too uninformative for modeling (e.g. because it has too few values)
unique_count (int) number of unique values
na_count (int or None) number of missing values
date_format (str or None) For Date features, the date format string for how this feature was interpreted, compatible with https://docs.python.org/2/library/time.html#time.strftime . For other feature types, None.
min (str, int, float, or None) The minimum value of the source data in the EDA sample
max (str, int, float, or None) The maximum value of the source data in the EDA sample
mean (str, int, or, float) The arithmetic mean of the source data in the EDA sample
median (str, int, float, or None) The median of the source data in the EDA sample
std_dev (str, int, float, or None) The standard deviation of the source data in the EDA sample
time_series_eligible (bool) Whether this feature can be used as the datetime partition column in a time series project.
time_series_eligibility_reason (str) Why the feature is ineligible for the datetime partition column in a time series project, or ‘suitable’ when it is eligible.
time_step (int or None) For time series eligible features, a positive integer determining the interval at which windows can be specified. If used as the datetime partition column on a time series project, the feature derivation and forecast windows must start and end at an integer multiple of this value. None for features that are not time series eligible.
time_unit (str or None) For time series eligible features, the time unit covered by a single time step, e.g. ‘HOUR’, or None for features that are not time series eligible.
target_leakage (str) Whether a feature is considered to have target leakage or not. A value of ‘SKIPPED_DETECTION’ indicates that target leakage detection was not run on the feature. ‘FALSE’ indicates no leakage, ‘MODERATE’ indicates a moderate risk of target leakage, and ‘HIGH_RISK’ indicates a high risk of target leakage
classmethod get(project_id, feature_name)

Retrieve a single feature

Parameters:

project_id : str

The ID of the project the feature is associated with.

feature_name : str

The name of the feature to retrieve

Returns:

feature : Feature

The queried instance

get_multiseries_properties(multiseries_id_columns, max_wait=600)

Retrieve time series properties for a potential multiseries datetime partition column

Multiseries time series projects use multiseries id columns to model multiple distinct series within a single project. This function returns the time series properties (time step and time unit) of this column if it were used as a datetime partition column with the specified multiseries id columns, running multiseries detection automatically if it had not previously been successfully ran.

Parameters:

multiseries_id_columns : list of str

the name(s) of the multiseries id columns to use with this datetime partition column. Currently only one multiseries id column is supported.

max_wait : int, optional

if a multiseries detection task is run, the maximum amount of time to wait for it to complete before giving up

Returns:

properties : dict

A dict with three keys:

  • time_series_eligible : bool, whether the column can be used as a partition column
  • time_unit : str or null, the inferred time unit if used as a partition column
  • time_step : int or null, the inferred time step if used as a partition column
class datarobot.models.ModelingFeature(project_id=None, name=None, feature_type=None, importance=None, low_information=None, unique_count=None, na_count=None, date_format=None, min=None, max=None, mean=None, median=None, std_dev=None, parent_feature_names=None)

A feature used for modeling

In time series projects, a new set of modeling features is created after setting the partitioning options. These features are automatically derived from those in the project’s dataset and are the features used for modeling. Modeling features are only accessible once the target and partitioning options have been set. In projects that don’t use time series modeling, once the target has been set, ModelingFeatures and Features will behave the same.

For more information about input and modeling features, see the time series documentation.

As with the dr.models.feature.Feature object, the min, max, `mean, median, and std_dev attributes provide information about the distribution of the feature in the EDA sample data. For non-numeric features, they will be None. For features where the summary statistics are available, they will be in a format compatible with the data type, i.e. date type features will have their summary statistics expressed as ISO-8601 formatted date strings.

Attributes

project_id (str) the id of the project the feature belongs to
name (str) the name of the feature
feature_type (str) the type of the feature, e.g. ‘Categorical’, ‘Text’
importance (float or None) numeric measure of the strength of relationship between the feature and target (independent of any model or other features); may be None for non-modeling features such as partition columns
low_information (bool) whether a feature is considered too uninformative for modeling (e.g. because it has too few values)
unique_count (int) number of unique values
na_count (int or None) number of missing values
date_format (str or None) For Date features, the date format string for how this feature was interpreted, compatible with https://docs.python.org/2/library/time.html#time.strftime . For other feature types, None.
min (str, int, float, or None) The minimum value of the source data in the EDA sample
max (str, int, float, or None) The maximum value of the source data in the EDA sample
mean (str, int, or, float) The arithmetic mean of the source data in the EDA sample
median (str, int, float, or None) The median of the source data in the EDA sample
std_dev (str, int, float, or None) The standard deviation of the source data in the EDA sample
parent_feature_names (list of str) A list of the names of input features used to derive this modeling feature. In cases where the input features and modeling features are the same, this will simply contain the feature’s name. Note that if a derived feature was used to create this modeling feature, the values here will not necessarily correspond to the features that must be supplied at prediction time.
classmethod get(project_id, feature_name)

Retrieve a single modeling feature

Parameters:

project_id : str

The ID of the project the feature is associated with.

feature_name : str

The name of the feature to retrieve

Returns:

feature : ModelingFeature

The requested feature

Ruleset API

class datarobot.models.Ruleset(project_id=None, parent_model_id=None, model_id=None, ruleset_id=None, rule_count=None, score=None)

Represents an approximation of a model with DataRobot Prime

Attributes

id (str) the id of the ruleset
rule_count (int) the number of rules used to approximate the model
score (float) the validation score of the approximation
project_id (str) the project the approximation belongs to
parent_model_id (str) the model being approximated
model_id (str or None) the model using this ruleset (if it exists). Will be None if no such model has been trained.
request_model()

Request training for a model using this ruleset

Training a model using a ruleset is a necessary prerequisite for being able to download the code for a ruleset.

Returns:

job: Job

the job fitting the new Prime model

PrimeFile API

class datarobot.models.PrimeFile(id=None, project_id=None, parent_model_id=None, model_id=None, ruleset_id=None, language=None, is_valid=None)

Represents an executable file available for download of the code for a DataRobot Prime model

Attributes

id (str) the id of the PrimeFile
project_id (str) the id of the project this PrimeFile belongs to
parent_model_id (str) the model being approximated by this PrimeFile
model_id (str) the prime model this file represents
ruleset_id (int) the ruleset being used in this PrimeFile
language (str) the language of the code in this file - see enums.LANGUAGE for possibilities
is_valid (bool) whether the code passed basic validation
download(filepath)

Download the code and save it to a file

Parameters:

filepath: string

the location to save the file to

Frozen Model API

class datarobot.models.FrozenModel(id=None, processes=None, featurelist_name=None, featurelist_id=None, project_id=None, sample_pct=None, training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, model_type=None, model_category=None, is_frozen=None, blueprint_id=None, metrics=None, parent_model_id=None, monotonic_increasing_featurelist_id=None, monotonic_decreasing_featurelist_id=None, supports_monotonic_constraints=None)

A model tuned with parameters which are derived from another model

Attributes

id (str) the id of the model
project_id (str) the id of the project the model belongs to
processes (list of str) the processes used by the model
featurelist_name (str) the name of the featurelist used by the model
featurelist_id (str) the id of the featurelist used by the model
sample_pct (float) the percentage of the project dataset used in training the model
training_row_count (int or None) the number of rows of the project dataset used in training the model. In a datetime partitioned project, if specified, defines the number of rows used to train the model and evaluate backtest scores; if unspecified, either training_duration or training_start_date and training_end_date was used to determine that instead.
training_duration (str or None) only present for models in datetime partitioned projects. If specified, a duration string specifying the duration spanned by the data used to train the model and evaluate backtest scores.
training_start_date (datetime or None) only present for frozen models in datetime partitioned projects. If specified, the start date of the data used to train the model.
training_end_date (datetime or None) only present for frozen models in datetime partitioned projects. If specified, the end date of the data used to train the model.
model_type (str) what model this is, e.g. ‘Nystroem Kernel SVM Regressor’
model_category (str) what kind of model this is - ‘prime’ for DataRobot Prime models, ‘blend’ for blender models, and ‘model’ for other models
is_frozen (bool) whether this model is a frozen model
parent_model_id (str) the id of the model that tuning parameters are derived from
blueprint_id (str) the id of the blueprint used in this model
metrics (dict) a mapping from each metric to the model’s scores for that metric
monotonic_increasing_featurelist_id (str) optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. If None, no such constraints are enforced.
monotonic_decreasing_featurelist_id (str) optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. If None, no such constraints are enforced.
supports_monotonic_constraints (bool) optinonal, whether this model supports enforcing montonic constraints
classmethod get(project_id, model_id)

Retrieve a specific frozen model.

Parameters:

project_id : str

The project’s id.

model_id : str

The model_id of the leaderboard item to retrieve.

Returns:

model : FrozenModel

The queried instance.

Advanced Options API

class datarobot.helpers.AdvancedOptions(weights=None, response_cap=None, blueprint_threshold=None, seed=None, smart_downsampled=False, majority_downsampling_rate=None, offset=None, exposure=None, accuracy_optimized_mb=None, scaleout_modeling_mode=None, events_count=None, monotonic_increasing_featurelist_id=None, monotonic_decreasing_featurelist_id=None, only_include_monotonic_blueprints=None)

Used when setting the target of a project to set advanced options of modeling process.

Parameters:

weights : string, optional

The name of a column indicating the weight of each row

response_cap : float in [0.5, 1), optional

Quantile of the response distribution to use for response capping.

blueprint_threshold : int, optional

Number of hours models are permitted to run before being excluded from later autopilot stages Minimum 1

seed : int

a seed to use for randomization

smart_downsampled : bool

whether to use smart downsampling to throw away excess rows of the majority class. Only applicable to classification and zero-boosted regression projects.

majority_downsampling_rate : float

the percentage between 0 and 100 of the majority rows that should be kept. Specify only if using smart downsampling. May not cause the majority class to become smaller than the minority class.

offset : list of str, optional

(New in version v2.6) the list of the names of the columns containing the offset of each row

exposure : string, optional

(New in version v2.6) the name of a column containing the exposure of each row

accuracy_optimized_mb : bool, optional

(New in version v2.6) Include additional, longer-running models that will be run by the autopilot and available to run manually.

scaleout_modeling_mode : string, optional

(New in version v2.8) Specifies the behavior of Scaleout models for the project. This is one of datarobot.enums.SCALEOUT_MODELING_MODE. If datarobot.enums.SCALEOUT_MODELING_MODE.DISABLED, no models will run during autopilot or show in the list of available blueprints. Scaleout models must be disabled for some partitioning settings including projects using datetime partitioning or projects using offset or exposure columns. If datarobot.enums.SCALEOUT_MODELING_MODE.REPOSITORY_ONLY, scaleout models will be in the list of available blueprints but not run during autopilot. If datarobot.enums.SCALEOUT_MODELING_MODE.AUTOPILOT, scaleout models will run during autopilot and be in the list of available blueprints. Scaleout models are only supported in the Hadoop enviroment with the corresponding user permission set.

events_count : string, optional

(New in version v2.8) the name of a column specifying events count.

monotonic_increasing_featurelist_id : string, optional

(new in version 2.11) the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. If None, no such constraints are enforced. When specified, this will set a default for the project that can be overriden at model submission time if desired.

monotonic_decreasing_featurelist_id : string, optional

(new in version 2.11) the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. If None, no such constraints are enforced. When specified, this will set a default for the project that can be overriden at model submission time if desired.

only_include_monotonic_blueprints : bool, optional

(new in version 2.11) when true, only blueprints that support enforcing monotonic constraints will be available in the project or selected for the autopilot.

Examples

import datarobot as dr
advanced_options = dr.AdvancedOptions(
    weights='weights_column',
    offset=['offset_column'],
    exposure='exposure_column',
    response_cap=0.7,
    blueprint_threshold=2,
    smart_downsampled=True, majority_downsampling_rate=75.0)

Imported Model API

Note

Imported Models are used in Stand Alone Scoring Engines. If you are not an administrator of such an engine, they are not relevant to you.

class datarobot.models.ImportedModel(id, imported_at=None, model_id=None, target=None, featurelist_name=None, dataset_name=None, model_name=None, project_id=None, version=None, note=None, origin_url=None, imported_by_username=None, project_name=None, created_by_username=None, created_by_id=None, imported_by_id=None, display_name=None)

Represents an imported model available for making predictions. These are only relevant for administrators of on-premise Stand Alone Scoring Engines.

ImportedModels are trained in one DataRobot application, exported as a .drmodel file, and then imported for use in a Stand Alone Scoring Engine.

Attributes

id (str) id of the import
model_name (str) model type describing the model generated by DataRobot
display_name (str) manually specified human-readable name of the imported model
note (str) manually added node about this imported model
imported_at (datetime) the time the model was imported
imported_by_username (str) username of the user who imported the model
imported_by_id (str) id of the user who imported the model
origin_url (str) URL of the application the model was exported from
model_id (str) original id of the model prior to export
featurelist_name (str) name of the featurelist used to train the model
project_id (str) id of the project the model belonged to prior to export
project_name (str) name of the project the model belonged to prior to export
target (str) the target of the project the model belonged to prior to export
version (float) project version of the project the model belonged to
dataset_name (str) filename of the dataset used to create the project the model belonged to
created_by_username (str) username of the user who created the model prior to export
created_by_id (str) id of the user who created the model prior to export
classmethod create(path)

Import a previously exported model for predictions.

Parameters:

path : str

The path to the exported model file

classmethod get(import_id)

Retrieve imported model info

Parameters:

import_id : str

The ID of the imported model.

Returns:

imported_model : ImportedModel

The ImportedModel instance

classmethod list(limit=None, offset=None)

List the imported models.

Parameters:

limit : int

The number of records to return. The server will use a (possibly finite) default if not specified.

offset : int

The number of records to skip.

Returns:

imported_models : list[ImportedModel]

update(display_name=None, note=None)

Update the display name or note for an imported model. The ImportedModel object is updated in place.

Parameters:

display_name : str

The new display name.

note : str

The new note.

delete()

Delete this imported model.

Reason Codes API

class datarobot.ReasonCodesInitialization(project_id, model_id, reason_codes_sample=None)

Represents a reason codes initialization of a model.

Attributes

project_id (str) id of the project the model belongs to
model_id (str) id of the model reason codes initialization is for
reason_codes_sample (list of dict) a small sample of reason codes that could be generated for the model
classmethod get(project_id, model_id)

Retrieve the reason codes initialization for a model.

Reason codes initializations are a prerequisite for computing reason codes, and include a sample what the computed reason codes for a prediction dataset would look like.

Parameters:

project_id : str

id of the project the model belongs to

model_id : str

id of the model reason codes initialization is for

Returns:

reason_codes_initialization : ReasonCodesInitialization

The queried instance.

Raises:

ClientError (404)

If the project or model does not exist or the initialization has not been computed.

classmethod create(project_id, model_id)

Create a reason codes initialization for the specified model.

Parameters:

project_id : str

id of the project the model belongs to

model_id : str

id of the model for which initialization is requested

Returns:

job : Job

an instance of created async job

delete()

Delete this reason codes initialization.

class datarobot.ReasonCodes(id, project_id, model_id, dataset_id, max_codes, num_columns, finish_time, reason_codes_location, threshold_low=None, threshold_high=None)

Represents reason codes metadata and provides access to computation results.

Examples

reason_codes = dr.ReasonCodes.get(project_id, reason_codes_id)
for row in reason_codes.get_rows():
    print(row)  # row is an instance of ReasonCodesRow

Attributes

id (str) id of the record and reason codes computation result
project_id (str) id of the project the model belongs to
model_id (str) id of the model reason codes initialization is for
dataset_id (str) id of the prediction dataset reason codes were computed for
max_codes (int) maximum number of reason codes to supply per row of the dataset
threshold_low (float) the lower threshold, below which a prediction must score in order for reason codes to be computed for a row in the dataset
threshold_high (float) the high threshold, above which a prediction must score in order for reason codes to be computed for a row in the dataset
num_columns (int) the number of columns reason codes were computed for
finish_time (float) timestamp referencing when computation for these reason codes finished
reason_codes_location (str) where to retrieve the reason codes
classmethod get(project_id, reason_codes_id)

Retrieve a specific reason codes.

Parameters:

project_id : str

id of the project the model belongs to

reason_codes_id : str

id of the reason codes

Returns:

reason_codes : ReasonCodes

The queried instance.

classmethod create(project_id, model_id, dataset_id, max_codes=None, threshold_low=None, threshold_high=None)

Create a reason codes for the specified dataset.

In order to create ReasonCodesPage for a particular model and dataset, you must first:

  • Compute feature impact for the model via datarobot.Model.get_feature_impact()
  • Compute a ReasonCodesInitialization for the model via datarobot.ReasonCodesInitialization.create(project_id, model_id)
  • Compute predictions for the model and dataset via datarobot.Model.request_predictions(dataset_id)

threshold_high and threshold_low are optional filters applied to speed up computation. When at least one is specified, only the selected outlier rows will have reason codes computed. Rows are considered to be outliers if their predicted value (in case of regression projects) or probability of being the positive class (in case of classification projects) is less than threshold_low or greater than thresholdHigh. If neither is specified, reason codes will be computed for all rows.

Parameters:

project_id : str

id of the project the model belongs to

model_id : str

id of the model for which reason codes are requested

dataset_id : str

id of the prediction dataset for which reason codes are requested

threshold_low : float, optional

the lower threshold, below which a prediction must score in order for reason codes to be computed for a row in the dataset. If neither threshold_high nor threshold_low is specified, reason codes will be computed for all rows.

threshold_high : float, optional

the high threshold, above which a prediction must score in order for reason codes to be computed. If neither threshold_high nor threshold_low is specified, reason codes will be computed for all rows.

max_codes : int, optional

the maximum number of reason codes to supply per row of the dataset, default: 3.

Returns:

job: Job

an instance of created async job

classmethod list(project_id, model_id=None, limit=None, offset=None)

List of reason codes for a specified project.

Parameters:

project_id : str

id of the project to list reason codes for

model_id : str, optional

if specified, only reason codes computed for this model will be returned

limit : int or None

at most this many results are returned, default: no limit

offset : int or None

this many results will be skipped, default: 0

Returns:

reason_codes : list[ReasonCodes]

get_rows(batch_size=None, exclude_adjusted_predictions=True)

Retrieve reason codes rows.

Parameters:

batch_size : int

maximum number of reason codes rows to retrieve per request

exclude_adjusted_predictions : bool

Optional, defaults to True. Set to False to include adjusted predictions, which will differ from the predictions on some projects, e.g. those with an exposure column specified.

Yields:

reason_codes_row : ReasonCodesRow

Represents reason codes computed for a prediction row.

get_all_as_dataframe(exclude_adjusted_predictions=True)

Retrieve all reason codes rows and return them as a pandas.DataFrame.

Returned dataframe has the following structure:

  • row_id : row id from prediction dataset
  • prediction : the output of the model for this row
  • adjusted_prediction : adjusted prediction values (only appears for projects that utilize prediction adjustments, e.g. projects with an exposure column)
  • class_0_label : a class level from the target (only appears for classification projects)
  • class_0_probability : the probability that the target is this class (only appears for classification projects)
  • class_1_label : a class level from the target (only appears for classification projects)
  • class_1_probability : the probability that the target is this class (only appears for classification projects)
  • reason_0_feature : the name of the feature contributing to the prediction for this reason
  • reason_0_feature_value : the value the feature took on
  • reason_0_label : the output being driven by this reason. For regression projects, this is the name of the target feature. For classification projects, this is the class label whose probability increasing would correspond to a positive strength.
  • reason_0_qualitative_strength : a human-readable description of how strongly the feature affected the prediction (e.g. ‘+++’, ‘–’, ‘+’) for this reason
  • reason_0_strength : the amount this feature’s value affected the prediction
  • ...
  • reason_N_feature : the name of the feature contributing to the prediction for this reason
  • reason_N_feature_value : the value the feature took on
  • reason_N_label : the output being driven by this reason. For regression projects, this is the name of the target feature. For classification projects, this is the class label whose probability increasing would correspond to a positive strength.
  • reason_N_qualitative_strength : a human-readable description of how strongly the feature affected the prediction (e.g. ‘+++’, ‘–’, ‘+’) for this reason
  • reason_N_strength : the amount this feature’s value affected the prediction
Parameters:

exclude_adjusted_predictions : bool

Optional, defaults to True. Set this to False to include adjusted prediction values in the returned dataframe.

Returns:

dataframe: pandas.DataFrame

download_to_csv(filename, encoding='utf-8', exclude_adjusted_predictions=True)

Save reason codes rows into CSV file.

Parameters:

filename : str or file object

path or file object to save reason codes rows

encoding : string, optional

A string representing the encoding to use in the output file, defaults to ‘utf-8’

exclude_adjusted_predictions : bool

Optional, defaults to True. Set to False to include adjusted predictions, which will differ from the predictions on some projects, e.g. those with an exposure column specified.

get_reason_codes_page(limit=None, offset=None, exclude_adjusted_predictions=True)

Get reason codes.

If you don’t want use a generator interface, you can access paginated reason codes directly.

Parameters:

limit : int or None

the number of records to return, the server will use a (possibly finite) default if not specified

offset : int or None

the number of records to skip, default 0

exclude_adjusted_predictions : bool

Optional, defaults to True. Set to False to include adjusted predictions, which will differ from the predictions on some projects, e.g. those with an exposure column specified.

Returns:

reason_codes : ReasonCodesPage

delete()

Delete this reason codes.

class datarobot.models.reason_codes.ReasonCodesRow(row_id, prediction, prediction_values, reason_codes=None, adjusted_prediction=None, adjusted_prediction_values=None)

Represents reason codes computed for a prediction row.

Notes

PredictionValue contains:

  • label : describes what this model output corresponds to. For regression projects, it is the name of the target feature. For classification projects, it is a level from the target feature.
  • value : the output of the prediction. For regression projects, it is the predicted value of the target. For classification projects, it is the predicted probability the row belongs to the class identified by the label.

ReasonCode contains:

  • label : described what output was driven by this reason code. For regression projects, it is the name of the target feature. For classification projects, it is the class whose probability increasing would correspond to a positive strength of this reason code.
  • feature : the name of the feature contributing to the prediction
  • feature_value : the value the feature took on for this row
  • strength : the amount this feature’s value affected the prediction
  • qualitativate_strength : a human-readable description of how strongly the feature affected the prediction (e.g. ‘+++’, ‘–’, ‘+’)

Attributes

row_id (int) which row this ReasonCodeRow describes
prediction (float) the output of the model for this row
adjusted_prediction (float or None) adjusted prediction value for projects that provide this information, None otherwise
prediction_values (list) an array of dictionaries with a schema described as PredictionValue
adjusted_prediction_values (list) same as prediction_values but for adjusted predictions
reason_codes (list) an array of dictionaries with a schema described as ReasonCode
class datarobot.models.reason_codes.ReasonCodesPage(id, count=None, previous=None, next=None, data=None, reason_codes_record_location=None, adjustment_method=None)

Represents batch of reason codes received by one request.

Attributes

id (str) id of the reason codes computation result
data (list[dict]) list of raw reason codes, each row corresponds to a row of the prediction dataset
count (int) total number of rows computed
previous_page (str) where to retrieve previous page of reason codes, None if current page is the first
next_page (str) where to retrieve next page of reason codes, None if current page is the last
reason_codes_record_location (str) where to retrieve the reason codes metadata
adjustment_method (str) Adjustment method that was applied to predictions, or ‘N/A’ if no adjustments were done.
classmethod get(project_id, reason_codes_id, limit=None, offset=0, exclude_adjusted_predictions=True)

Retrieve reason codes.

Parameters:

project_id : str

id of the project the model belongs to

reason_codes_id : str

id of the reason codes

limit : int or None

the number of records to return, the server will use a (possibly finite) default if not specified

offset : int or None

the number of records to skip, default 0

exclude_adjusted_predictions : bool

Optional, defaults to True. Set to False to include adjusted predictions, which will differ from the predictions on some projects, e.g. those with an exposure column specified.

Returns:

reason_codes : ReasonCodesPage

The queried instance.

Lift Chart API

class datarobot.models.lift_chart.LiftChart(source, bins)

Lift chart data for model.

Attributes

source (str) Lift chart data source. Can be ‘validation’, ‘crossValidation’ or ‘holdout’.
bins (list of dict) List of lift chart bins information. Dictionary keys: actual : float Sum of actual target values in bin predicted : float Sum of predicted target values in bin bin_weight : float The weight of the bin. For weighted projects, it is the sum of the weights of the rows in the bin. For unweighted projects, it is the number of rows in the bin.

ROC Curve API

class datarobot.models.roc_curve.RocCurve(source, roc_points, negative_class_predictions, positive_class_predictions)

ROC curve data for model.

Attributes

source (str) ROC curve data source. Can be ‘validation’, ‘crossValidation’ or ‘holdout’.
roc_points (list of dict) List of precalculated metrics associated with thresholds for ROC curve.
negative_class_predictions (list of float) List of predictions from example for negative class
positive_class_predictions (list of float) List of predictions from example for positive class
estimate_threshold(threshold)

Return metrics estimation for given threshold.

Parameters:

threshold : float from [0, 1] interval

Threshold we want estimation for

Returns:

dict

Dictionary of estimated metrics in form of {metric_name: metric_value}. Metrics are ‘accuracy’, ‘f1_score’, ‘false_negative_score’, ‘true_negative_score’, ‘true_negative_rate’, ‘matthews_correlation_coefficient’, ‘true_positive_score’, ‘positive_predictive_value’, ‘false_positive_score’, ‘false_positive_rate’, ‘negative_predictive_value’, ‘true_positive_rate’.

Raises:

ValueError

Given threshold isn’t from [0, 1] interval

get_best_f1_threshold()

Return value of threshold that corresponds to max F1 score. This is threshold that will be preselected in DataRobot when you open “ROC curve” tab.

Returns:

float

Threhold with best F1 score.

Word Cloud API

class datarobot.models.word_cloud.WordCloud(ngrams)

Word cloud data for the model.

Attributes

ngrams: list of dict List of the wordcloud ngrams and corresponding data. Dictionary has following keys: ngram: str Word or ngram value. coefficient: float Value from [-1.0, 1.0] range, describes effect of this ngram on the target. Large negative value means strong effect toward negative class in classification and smaller target value in regression models. Large positive - toward positive class and bigger value respectively. count: int Number of rows in the training sample where this ngram appears. frequency: float Value from (0.0, 1.0] range, relative frequency of given ngram to most frequent ngram. is_stopword: bool True for ngrams that DataRobot evaluates as stopwords.
most_frequent(top_n=5)

Return most frequent ngrams in the word cloud.

Parameters:

top_n : int

Number of ngrams to return

Returns:

list of dict

Up to top_n top most frequent ngrams in the word cloud. If top_n bigger then total number of ngrams in word cloud - return all sorted by frequency in descending order.

most_important(top_n=5)

Return most important ngrams in the word cloud.

Parameters:

top_n : int

Number of ngrams to return

Returns:

list of dict

Up to top_n top most important ngrams in the word cloud. If top_n bigger then total number of ngrams in word cloud - return all sorted by absolute coefficient value in descending order.

Rating Table API

class datarobot.models.RatingTable(id, rating_table_name, original_filename, project_id, parent_model_id, model_id=None, model_job_id=None, validation_job_id=None, validation_error=None)

Interface to modify and download rating tables.

Attributes

id (str) The id of the rating table.
project_id (str) The id of the project this rating table belongs to.
rating_table_name (str) The name of the rating table.
original_filename (str) The name of the file used to create the rating table.
parent_model_id (str) The model id of the model the rating table was validated against.
model_id (str) The model id of the model that was created from the rating table. Can be None if a model has not been created from the rating table.
model_job_id (str) The id of the job to create a model from this rating table. Can be None if a model has not been created from the rating table.
validation_job_id (str) The id of the created job to validate the rating table. Can be None if the rating table has not been validated.
validation_error (str) Contains a description of any errors caused during validation.
classmethod get(project_id, rating_table_id)

Retrieve a single rating table

Parameters:

project_id : str

The ID of the project the rating table is associated with.

rating_table_id : str

The ID of the rating table

Returns:

rating_table : RatingTable

The queried instance

classmethod create(project_id, parent_model_id, filename, rating_table_name='Uploaded Rating Table')

Uploads and validates a new rating table CSV

Parameters:

project_id : str

id of the project the rating table belongs to

parent_model_id : str

id of the model for which this rating table should be validated against

filename : str

The path of the CSV file containing the modified rating table.

rating_table_name : str, optional

A human friendly name for the new rating table. The string may be truncated and a suffix may be added to maintain unique names of all rating tables.

Returns:

job: Job

an instance of created async job

Raises:

InputNotUnderstoodError

Raised if filename isn’t one of supported types.

ClientError (400)

Raised if parent_model_id is invalid.

download(filepath)

Download a csv file containing the contents of this rating table

Parameters:

filepath : str

The path at which to save the rating table file.

rename(rating_table_name)

Renames a rating table to a different name.

Parameters:

rating_table_name : str

The new name to rename the rating table to.

create_model()

Creates a new model from this rating table record. This rating table must not already be associated with a model and must be valid.

Returns:

job: Job

an instance of created async job

Raises:

ClientError (422)

Raised if creating model from a RatingTable that failed validation

JobAlreadyRequested

Raised if creating model from a RatingTable that is already associated with a RatingTableModel

Confusion Chart API

class datarobot.models.confusion_chart.ConfusionChart(source, data)

Confusion Chart data for model.

Attributes

source (str) Confusion Chart data source. Can be ‘validation’, ‘crossValidation’ or ‘holdout’.
raw_data (dict) All of the raw data for the Confusion Chart
confusion_matrix (list of list) The NxN confusion matrix
classes (list) The names of each of the classes
class_metrics (list of dict) Containing all of the metrics for each of the classes. Dictionary keys: className : string name of the class actualCount : int number of times this class is seen in the validation data predictedCount : int number of times this class has been predicted for the validation data f1 : float F1 score recall : float recall score precision : float precision score wasActualPercentages : list of dict one vs all actual percentages in a format specified below, Dictionary keys: otherClassName : string the name of the other class percentage : float the percentage of the times this class was predicted when is was actually class (from 0 to 1) wasPredictedPercentages : list of dict one vs all predicted percentages in a format specified below, Dictionary keys: otherClassName : string the name of the other class percentage : float the percentage of the times this class was actual predicted (from 0 to 1) confusionMatrixOneVsAll : list of list 2d list representing 2x2 one vs all matrix. This represents the True/False Negative/Positive rates as integer for each class. The data structure looks like: [ [ True Negative, False Positive ], [ False Negative, True Positive ] ]

Training Predictions API

class datarobot.models.training_predictions.TrainingPredictionsIterator(client, path, limit=None)

Lazily fetches training predictions from DataRobot API in chunks of specified size and then iterates rows from responses as named tuples. Each row represents a training prediction computed for a dataset’s row. Each named tuple has the following structure:

Notes

Each PredictionValue dict contains these keys:

label
describes what this model output corresponds to. For regression projects, it is the name of the target feature. For classification and multiclass projects, it is a label from the target feature.
value
the output of the prediction. For regression projects, it is the predicted value of the target. For classification and multiclass projects, it is the predicted probability that the row belongs to the class identified by the label.

Examples

import datarobot as dr

# Fetch existing training predictions by their id
training_predictions = dr.TrainingPredictions.get(project_id, prediction_id)

# Iterate over predictions
for row in training_predictions.iterate_rows()
    print(row.row_id, row.prediction)

Attributes

row_id (int) id of the record in original dataset for which training prediction is calculated
partition_id (str or float) id of the data partition that the row belongs to
prediction (float) the model’s prediction for this data row
prediction_values (list of dictionaries) an array of dictionaries with a schema described as PredictionValue
timestamp (str or None) (New in version v2.11) an ISO string representing the time of the prediction in time series project; may be None for non-time series projects
forecast_point (str or None) (New in version v2.11) an ISO string representing the point in time used as a basis to generate the predictions in time series project; may be None for non-time series projects
forecast_distance (str or None) (New in version v2.11) how many time steps are between the forecast point and the timestamp in time series project; None for non-time series projects
series_id (str or None) (New in version v2.11) the id of the series in a multiseries project; may be NaN for single series projects; None for non-time series projects
class datarobot.models.training_predictions.TrainingPredictions(project_id, prediction_id, model_id=None, data_subset=None)

Represents training predictions metadata and provides access to prediction results.

Examples

Compute training predictions for a model on the whole dataset

import datarobot as dr

# Request calculation of training predictions
training_predictions_job = model.request_training_predictions(dr.enums.DATA_SUBSET.ALL)
training_predictions = training_predictions_job.get_result_when_complete()
print('Training predictions {} are ready'.format(training_predictions.prediction_id))

# Iterate over actual predictions
for row in training_predictions.iterate_rows():
    print(row.row_id, row.partition_id, row.prediction)

List all training predictions for a project

import datarobot as dr

# Fetch all training predictions for a project
all_training_predictions = dr.TrainingPredictions.list(project_id)

# Inspect all calculated training predictions
for training_predictions in all_training_predictions:
    print(
        'Prediction {} is made for data subset "{}"'.format(
            training_predictions.prediction_id,
            training_predictions.data_subset,
        )
    )

Retrieve training predictions by id

import datarobot as dr

# Getting training predictions by id
training_predictions = dr.TrainingPredictions.get(project_id, prediction_id)

# Iterate over actual predictions
for row in training_predictions.iterate_rows():
    print(row.row_id, row.partition_id, row.prediction)

Attributes

project_id (str) id of the project the model belongs to
model_id (str) id of the model
prediction_id (str) id of generated predictions
classmethod list(project_id)

Fetch all the computed training predictions for a project.

Parameters:

project_id : str

id of the project

Returns:

A list of TrainingPredictions objects

classmethod get(project_id, prediction_id)

Retrieve training predictions on a specified data set.

Parameters:

project_id : str

id of the project the model belongs to

prediction_id : str

id of the prediction set

Returns:

TrainingPredictions object which is ready to operate with specified predictions

iterate_rows(batch_size=None)

Retrieve training prediction rows as an iterator.

Parameters:

batch_size : int, optional

maximum number of training prediction rows to fetch per request

Returns:

iterator : TrainingPredictionsIterator

an iterator which yields named tuples representing training prediction rows

get_all_as_dataframe(class_prefix='class_')

Retrieve all training prediction rows and return them as a pandas.DataFrame.

Returned dataframe has the following structure:
  • row_id : row id from the original dataset
  • prediction : the model’s prediction for this row
  • class_<label> : the probability that the target is this class (only appears for classification and multiclass projects)
  • timestamp : the time of the prediction (only appears for time series projects)
  • forecast_point : the point in time used as a basis to generate the predictions (only appears for time series projects)
  • forecast_distance : how many time steps are between timestamp and forecast_point (only appears for time series projects)
  • series_id : he id of the series in a multiseries project or None for a single series project (only appears for time series projects)
Parameters:

class_prefix : str, optional

The prefix to append to labels in the final dataframe. Default is class_ (e.g., apple -> class_apple)

Returns:

dataframe: pandas.DataFrame

download_to_csv(filename, encoding='utf-8')

Save training prediction rows into CSV file.

Parameters:

filename : str or file object

path or file object to save training prediction rows

encoding : string, optional

A string representing the encoding to use in the output file, defaults to ‘utf-8’

ModelDeployment API

Warning

This interface is now deprecated and will be removed in the v2.13 release of the DataRobot client.

class datarobot.models.ModelDeployment(id, model=None, project=None, type=None, status=None, user=None, organization_id=None, instance=None, label=None, description=None, prediction_endpoint=None, deployed=None, created_at=None, updated_at=None, service_health=None, service_health_messages=None, recent_request_count=None, prev_request_count=None, relative_requests_trend=None, trend_time_window=None, request_rates=None)

ModelDeployments provide an interface for tracking the health and activity of predictions made against a deployment model. The get_service_statistics method can be used to see current and historical trends in requests made and in user and server error rates

Notes

HealthMessage dict contains:

  • level : error level, one of [passing, warning, failing]
  • msg_id : identifier for message, like USER_ERRORS, SERVER_ERRORS, NO_GOOD_REQUESTS
  • message : human-readable message

Instance dict contains:

  • id : id of the dedicated prediction instance the model is deployed to
  • host_name : host name of the dedicated prediction instance
  • private_ip : IP address of the dedicated prediction instance
  • orm_version : On-demand Resource Manager version of the dedicated prediction instance

Model dict contains:

  • id : id of the deployed model
  • model_type : identifies the model, e.g. Nystroem Kernel SVM Regressor
  • uid : id of the user who created this model

User dict contains:

  • username : the user’s username
  • first_name : the user’s first name
  • last_name : the user’s last name

Attributes

id (str) id of the model deployment
model (dict) model associated with the model deployment
project (dict) project associated with the model deployment
type (str) type of the model deployment. Can be one of [sse, dedicated, legacy_dedicated]
status (str) status of the model deployment. Can be one of [active, inactive, archived]
user (dict) user who created the model deployment
organization_id (str) id of the organization associated with the model deployment
instance (dict) instance associated with the model deployment
label (str) label of the model deployment
description (str) description of the model deployment
prediction_endpoint (str) URL where the model is deployed and available for serving predictions
deployed (bool) has the model deployment deployed process finished or not.
created_at (datetime) timestamp when the model deployment was created
updated_at (datetime) timestamp when the model deployment was updated
service_health (str) display model health status. Can be one of [passing, warning or failing]
service_health_messages (list) list of HealthMessage objects for service health state
recent_request_count (int) the number of recent requests, within the recent time window specified in trend_time_window
prev_request_count (int) the number of requests, within the previous time window specified in trend_time_window
relative_requests_trend (float) relative difference (as a percentage) between the number of prediction requests performed within the current time window and one time window before that. The size of the time window is specified by trend_time_window
trend_time_window (str) time window (in full days from “now”) trend is calculated for
request_rates (list) history of request rates per day sorted in chronological order (last entry being the most recent, i.e. today).
classmethod create(project_id, model_id, label, instance_id=None, description=None, status=None)

Create model deployment.

Parameters:

project_id : str

id of the project the model belongs to

model_id : str

id of the model for deployment

label : str

human-readable name for the model deployment

instance_id : str, optional

id of the instance in DataRobot cloud being deployed to

description : str, optional

description for the model deployment

status : str, optional

status for the model deployment. Can be [active, inactive, archived].

Returns:

job : Job

an instance of created async job

classmethod list(limit=None, offset=None, query=None, order_by=None, status=None)

List of model_deployments

Parameters:

limit : int or None

at most this many results are returned, default: no limit

offset : int or None

this many results will be skipped, default: 0

query : str, optional

Filter the model deployments by matching labels and descriptions with the specified string. Partial matches are included, too. Matches are case insensitive

order_by : str, optional

the model deployments. Supported attributes for ordering: label, exportTarget, status, type. Prefix attribute name with dash to sort in descending order, e.g. orderBy=-label. Only one field can be selected

status : str, optional

Filter the list of deployments by status. Must be one of: [active, inactive, archived]

Returns:

model_deployments : list[ModelDeployment]

classmethod get(model_deployment_id)

Retrieve sa single model_deployment

Parameters:

model_deployment_id:

the id of the model_deployment to query

Returns:

model_deployment : ModelDeployment

The queried instance

update(label=None, description=None, status=None)

Update model_deployment object

Parameters:

label : str, optional

The new value for label to be set

description : str, optional

The new value for description to be set

status : str, optional

The new value for status to be set, Can be one of [active, inactive, archived]

get_service_statistics(start_date=None, end_date=None)

Retrieve health overview of current model_deployment

Parameters:

start_date : str, optional

datetime string that filter statistic from this timestamp

end_date: str, optional

datetime string that filter statistic till this timestamp

Returns:

service_health : dict

dict that represent ServiceHealth object

Notes

ServiceHealth dict contains:

  • total_requests: total number of requests performed. 0, if there were no requests
  • consumers : total number of unique users performing requests. 0, if there were no requests
  • period : dict with two fields - start and end, that denote the boundaries of the time period the stats are reported for. Note, that a half-open time interval is used: [start: end)
  • user_error_rate : dict with two fields - current and previous, that denote the ratio of user errors to the total number of requests performed for the given period and one time period before that. 0.0, if there were no errors (or requests)
  • server_error_rate : dict with two fields - current and previous, that denote the ratio of server errors to the total number of requests performed for the given period and one time period before that. 0.0, if there were no errors (or requests)
  • load : dict with two fields - peak and median, that denote the max and the median of the request rate (in requests per minute) across all requests for the duration of the given time period. Both will be equal to 0.0, if there were no requests.
  • median_execution_time : the median of the execution time across all performed requests (in seconds). null, if there were no requests
action_log(limit=None, offset=None)

List of actions taken affecting this deployment

Allows insight into when the ModelDeployment was created or deployed.

Parameters:

limit : int or None

at most this many results are returned, default: no limit

offset : int or None

this many results will be skipped, default: 0

Returns:

action_log : list of dict [ActionLog]

Notes

ActionLog dict contains:

  • action : identifies the action. Can be one of [deployed, created]
  • performed_by : dict with id, username, first_name and last_name of the user who performed the action.
  • performed_at : date/time when the action was performed in ISO-8601 format.

Database Connectivity API

class datarobot.DataDriver(id=None, creator=None, base_names=None, class_name=None, canonical_name=None)

A data driver

Attributes

id (str) the id of the driver.
class_name (str) the Java class name for the driver.
canonical_name (str) the user-friendly name of the driver.
creator (str) the id of the user who created the driver.
base_names (list of str) a list of the file name(s) of the jar files.
classmethod list()

Returns list of available drivers.

Returns:

drivers : list of DataDriver instances

contains a list of available drivers.

Examples

>>> import datarobot as dr
>>> drivers = dr.DataDriver.list()
>>> drivers
[DataDriver('mysql'), DataDriver('RedShift'), DataDriver('PostgreSQL')]
classmethod get(driver_id)

Gets the driver.

Parameters:

driver_id : str

the identifier of the driver.

Returns:

driver : DataDriver

the required driver.

Examples

>>> import datarobot as dr
>>> driver = dr.DataDriver.get('5ad08a1889453d0001ea7c5c')
>>> driver
DataDriver('PostgreSQL')
classmethod create(class_name, canonical_name, files)

Creates the driver. Only available to admin users.

Parameters:

class_name : str

the Java class name for the driver.

canonical_name : str

the user-friendly name of the driver.

files : list of str

a list of the file paths on file system file_path(s) for the driver.

Returns:

driver : DataDriver

the created driver.

Raises:

ClientError

raised if user is not granted for Can manage JDBC database drivers feature

Examples

>>> import datarobot as dr
>>> driver = dr.DataDriver.create(
...     class_name='org.postgresql.Driver',
...     canonical_name='PostgreSQL',
...     files=['/tmp/postgresql-42.2.2.jar']
... )
>>> driver
DataDriver('PostgreSQL')
update(class_name=None, canonical_name=None)

Updates the driver. Only available to admin users.

Parameters:

class_name : str

the Java class name for the driver.

canonical_name : str

the user-friendly name of the driver.

Raises:

ClientError

raised if user is not granted for Can manage JDBC database drivers feature

Examples

>>> import datarobot as dr
>>> driver = dr.DataDriver.get('5ad08a1889453d0001ea7c5c')
>>> driver.canonical_name
'PostgreSQL'
>>> driver.update(canonical_name='postgres')
>>> driver.canonical_name
'postgres'
delete()

Removes the driver. Only available to admin users.

Raises:

ClientError

raised if user is not granted for Can manage JDBC database drivers feature

class datarobot.DataStore(data_store_id=None, data_store_type=None, canonical_name=None, creator=None, updated=None, params=None)

A data store. Represents database

Attributes

id (str) the id of the data store.
data_store_type (str) the type of data store.
canonical_name (str) the user-friendly name of the data store.
creator (str) the id of the user who created the data store.
updated (datetime.datetime) the time of the last update
params (DataStoreParameters) a list specifying data store parameters.
classmethod list()

Returns list of available data stores.

Returns:

data_stores : list of DataStore instances

contains a list of available data stores.

Examples

>>> import datarobot as dr
>>> data_stores = dr.DataStore.list()
>>> data_stores
[DataStore('Demo'), DataStore('Airlines')]
classmethod get(data_store_id)

Gets the data store.

Parameters:

data_store_id : str

the identifier of the data store.

Returns:

data_store : DataStore

the required data store.

Examples

>>> import datarobot as dr
>>> data_store = dr.DataStore.get('5a8ac90b07a57a0001be501e')
>>> data_store
DataStore('Demo')
classmethod create(data_store_type, canonical_name, driver_id, jdbc_url)

Creates the data store.

Parameters:

data_store_type : str

the type of data store.

canonical_name : str

the user-friendly name of the data store.

driver_id : str

the identifier of the DataDriver.

jdbc_url : str

the full JDBC url, for example jdbc:postgresql://my.dbaddress.org:5432/my_db.

Returns:

data_store : DataStore

the created data store.

Examples

>>> import datarobot as dr
>>> data_store = dr.DataStore.create(
...     data_store_type='jdbc',
...     canonical_name='Demo DB',
...     driver_id='5a6af02eb15372000117c040',
...     jdbc_url='jdbc:postgresql://my.db.address.org:5432/perftest'
... )
>>> data_store
DataStore('Demo DB')
update(canonical_name=None, driver_id=None, jdbc_url=None)

Updates the data store.

Parameters:

canonical_name : str

optional, the user-friendly name of the data store.

driver_id : str

optional, the identifier of the DataDriver.

jdbc_url : str

optional, the full JDBC url, for example jdbc:postgresql://my.dbaddress.org:5432/my_db.

Examples

>>> import datarobot as dr
>>> data_store = dr.DataStore.get('5ad5d2afef5cd700014d3cae')
>>> data_store
DataStore('Demo DB')
>>> data_store.update(canonical_name='Demo DB updated')
>>> data_store
DataStore('Demo DB updated')
delete()

Removes the DataStore

test(username, password)

Tests database connection.

Parameters:

username : str

the username for database authentication.

password : str

the password for database authentication. The password is encrypted at server side and never saved / stored

Returns:

message : dict

message with status.

Examples

>>> import datarobot as dr
>>> data_store = dr.DataStore.get('5ad5d2afef5cd700014d3cae')
>>> data_store.test(username='db_username', password='db_password')
{'message': 'Connection successful'}
schemas(username, password)

Returns list of available schemas.

Parameters:

username : str

the username for database authentication.

password : str

the password for database authentication. The password is encrypted at server side and never saved / stored

Returns:

response : dict

dict with database name and list of str - available schemas

Examples

>>> import datarobot as dr
>>> data_store = dr.DataStore.get('5ad5d2afef5cd700014d3cae')
>>> data_store.schemas(username='db_username', password='db_password')
{'catalog': 'perftest', 'schemas': ['demo', 'information_schema', 'public']}
tables(username, password, schema=None)

Returns list of available tables in schema.

Parameters:

username : str

optional, the username for database authentication.

password : str

optional, the password for database authentication. The password is encrypted at server side and never saved / stored

schema : str

optional, the schema name.

Returns:

response : dict

dict with catalog name and tables info

Examples

>>> import datarobot as dr
>>> data_store = dr.DataStore.get('5ad5d2afef5cd700014d3cae')
>>> data_store.tables(username='db_username', password='db_password', schema='demo')
{'tables': [{'type': 'TABLE', 'name': 'diagnosis', 'schema': 'demo'}, {'type': 'TABLE',
'name': 'kickcars', 'schema': 'demo'}, {'type': 'TABLE', 'name': 'patient',
'schema': 'demo'}, {'type': 'TABLE', 'name': 'transcript', 'schema': 'demo'}],
'catalog': 'perftest'}
class datarobot.DataSource(data_source_id=None, data_source_type=None, canonical_name=None, creator=None, updated=None, params=None)

A data source. Represents data request

Attributes

data_source_id (str) the id of the data source.
data_source_type (str) the type of data source.
canonical_name (str) the user-friendly name of the data source.
creator (str) the id of the user who created the data source.
updated (datetime.datetime) the time of the last update.
params (DataSourceParameters) a list specifying data source parameters.
classmethod list()

Returns list of available data sources.

Returns:

data_sources : list of DataSource instances

contains a list of available data sources.

Examples

>>> import datarobot as dr
>>> data_sources = dr.DataSource.list()
>>> data_sources
[DataSource('Diagnostics'), DataSource('Airlines 100mb'), DataSource('Airlines 10mb')]
classmethod get(data_source_id)

Gets the data source.

Parameters:

data_source_id : str

the identifier of the data source.

Returns:

data_source : DataSource

the requested data source.

Examples

>>> import datarobot as dr
>>> data_source = dr.DataSource.get('5a8ac9ab07a57a0001be501f')
>>> data_source
DataSource('Diagnostics')
classmethod create(data_source_type, canonical_name, params)

Creates the data source.

Parameters:

data_source_type : str

the type of data source.

canonical_name : str

the user-friendly name of the data source.

params : DataSourceParameters

a list specifying data source parameters.

Returns:

data_source : DataSource

the created data source.

Examples

>>> import datarobot as dr
>>> params = dr.DataSourceParameters(
...     data_store_id='5a8ac90b07a57a0001be501e',
...     query='SELECT * FROM airlines10mb WHERE "Year" >= 1995;'
... )
>>> data_source = dr.DataSource.create(
...     data_source_type='jdbc',
...     canonical_name='airlines stats after 1995',
...     params=params
... )
>>> data_source
DataSource('airlines stats after 1995')
update(canonical_name=None, params=None)

Creates the data source.

Parameters:

canonical_name : str

optional, the user-friendly name of the data source.

params : DataSourceParameters

optional, the identifier of the DataDriver.

Examples

>>> import datarobot as dr
>>> data_source = dr.DataSource.get('5ad840cc613b480001570953')
>>> data_source
DataSource('airlines stats after 1995')
>>> params = dr.DataSourceParameters(
...     query='SELECT * FROM airlines10mb WHERE "Year" >= 1990;'
... )
>>> data_source.update(
...     canonical_name='airlines stats after 1990',
...     params=params
... )
>>> data_source
DataSource('airlines stats after 1990')
delete()

Removes the DataSource

Examples

Note

You are able to install all of the requirements needed to run the example notebooks with: pip install datarobot[examples].

Modeling Airline Delay

Overview

Statistics on whether a flight was delayed and for how long are available from government databases for all the major carriers. It would be useful to be able to predict before scheduling a flight whether or not it was likely to be delayed. In this example, DataRobot will try to model whether a flight will be delayed, based on information such as the scheduled departure time and whether rained the day of the flight.

Set Up

This example assumes that the DataRobot Python client package has been installed and configured with the credentials of a DataRobot user with API access permissions.

Data Sources

Information on flights and flight delays is made available by the Bureau of Transportation Statistics at https://www.transtats.bts.gov/ONTIME/Departures.aspx. To narrow down the amount of data involved, the datasets assembled for this use case are limited to US Airways flights out of Boston Logan in 2013 and 2014, although the script for interacting with DataRobot is sufficiently general that any dataset with the correct format could be used. A flight was declared to be delayed if it ultimately took off at least fifteen minutes after its scheduled departure time.

In additional to flight information, each record in the prepared dataset notes the amount of rain and whether it rained on the day of the flight. This information came from the National Oceanic and Atmospheric Administration’s Quality Controlled Local Climatological Data, available at http://www.ncdc.noaa.gov/qclcd/QCLCD. By looking at the recorded daily summaries of the water equivalent precipitation at the Boston Logan station, the daily rainfall for each day in 2013 and 2014 was measured. For some days, the QCLCD reports trace amounts of rainfall, which was recorded as 0 inches of rain.

Dataset Structure

Each row in the assembled dataset contains the following columns

  • was_delayed
    • boolean
    • whether the flight was delayed
  • daily_rainfall
    • float
    • the amount of rain, in inches, on the day of the flight
  • did_rain
    • bool
    • whether it rained on the day of the flight
  • Carrier Code
    • str
    • the carrier code of the airline - US for all entries in assembled dataset
  • Date
    • str (MM/DD/YYYY format)
    • the date of the flight
  • Flight Number
    • str
    • the flight number for the flight
  • Tail Number
    • str
    • the tail number of the aircraft
  • Destination Airport
    • str
    • the three-letter airport code of the destination airport
  • Scheduled Departure Time
    • str
    • the 24-hour scheduled departure time of the flight, in the origin airport’s timezone
In [1]:
import pandas as pd
import datarobot as dr
In [2]:
data_path = "logan-US-2013.csv"
logan_2013 = pd.read_csv(data_path)
logan_2013.head()
Out[2]:
was_delayed daily_rainfall did_rain Carrier Code Date (MM/DD/YYYY) Flight Number Tail Number Destination Airport Scheduled Departure Time
0 False 0.0 False US 02/01/2013 225 N662AW PHX 16:20
1 False 0.0 False US 02/01/2013 280 N822AW PHX 06:00
2 False 0.0 False US 02/01/2013 303 N653AW CLT 09:35
3 True 0.0 False US 02/01/2013 604 N640AW PHX 09:55
4 False 0.0 False US 02/01/2013 722 N715UW PHL 18:30

We want to be able to make predictions for future data, so the “date” column should be transformed in a way that avoids values that won’t be populated for future data:

In [3]:
def prepare_modeling_dataset(df):
    date_column_name = 'Date (MM/DD/YYYY)'
    date = pd.to_datetime(df[date_column_name])
    modeling_df = df.drop(date_column_name, axis=1)
    days = {0: 'Mon', 1: 'Tues', 2: 'Weds', 3: 'Thurs', 4: 'Fri', 5: 'Sat',
            6: 'Sun'}
    modeling_df['day_of_week'] = date.apply(lambda x: days[x.dayofweek])
    modeling_df['month'] = date.dt.month
    return modeling_df
In [4]:
logan_2013_modeling = prepare_modeling_dataset(logan_2013)
logan_2013_modeling.head()
Out[4]:
was_delayed daily_rainfall did_rain Carrier Code Flight Number Tail Number Destination Airport Scheduled Departure Time day_of_week month
0 False 0.0 False US 225 N662AW PHX 16:20 Fri 2
1 False 0.0 False US 280 N822AW PHX 06:00 Fri 2
2 False 0.0 False US 303 N653AW CLT 09:35 Fri 2
3 True 0.0 False US 604 N640AW PHX 09:55 Fri 2
4 False 0.0 False US 722 N715UW PHL 18:30 Fri 2

DataRobot Modeling

As part of this use case, in model_flight_ontime.py, a DataRobot project will be created and used to run a variety of models against the assembled datasets. By default, DataRobot will run autopilot on the automatically generated Informative Features list, which excludes certain pathological features (like Carrier Code in this example, which is always the same value), and we will also create a custom feature list excluding the amount of rainfall on the day of the flight.

This notebook shows how to use the Python API client to create a project, create feature lists, train models with different sample percents and feature lists, and view the models that have been run. It will:

  • create a project
  • create a new feature list (no foreknowledge) excluding the rainfall features
  • set the target to was_delayed, and run DataRobot autopilot on the Informative Features list
  • rerun autopilot on a new feature list
  • make predictions on a new data set

Starting a Project

In [5]:
project = dr.Project.start(logan_2013_modeling,
                           project_name='Airline Delays - was_delayed',
                           target="was_delayed")
project.id
Out[5]:
u'5963ddefc8089169ef1637c2'

Jobs and the Project Queue

You can view the project in your browser:

In [ ]:
#  If running notebook remotely
project.open_leaderboard_browser()
In [ ]:
#  Set worker count higher. This will fail if you don't have 10 workers.
project.set_worker_count(10)
In [6]:
project.pause_autopilot()
Out[6]:
True
In [7]:
#  More jobs will go in the queue in each stage of autopilot.
#  This gets the currently inprogress and queued jobs
project.get_model_jobs()
Out[7]:
[ModelJob(Gradient Boosted Trees Classifier, status=inprogress),
 ModelJob(Breiman and Cutler Random Forest Classifier, status=inprogress),
 ModelJob(RuleFit Classifier, status=queue),
 ModelJob(Regularized Logistic Regression (L2), status=queue),
 ModelJob(Elastic-Net Classifier (L2 / Binomial Deviance), status=queue),
 ModelJob(RandomForest Classifier (Gini), status=queue),
 ModelJob(eXtreme Gradient Boosted Trees Classifier with Early Stopping, status=queue),
 ModelJob(Nystroem Kernel SVM Classifier, status=queue),
 ModelJob(Regularized Logistic Regression (L2), status=queue),
 ModelJob(Elastic-Net Classifier (L2 / Binomial Deviance) with Binned numeric features, status=queue),
 ModelJob(Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance), status=queue),
 ModelJob(RandomForest Classifier (Entropy), status=queue),
 ModelJob(ExtraTrees Classifier (Gini), status=queue),
 ModelJob(Gradient Boosted Trees Classifier with Early Stopping, status=queue),
 ModelJob(Gradient Boosted Greedy Trees Classifier with Early Stopping, status=queue),
 ModelJob(eXtreme Gradient Boosted Trees Classifier with Early Stopping, status=queue),
 ModelJob(eXtreme Gradient Boosted Trees Classifier with Early Stopping and Unsupervised Learning Features, status=queue),
 ModelJob(Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance) with Unsupervised Learning Features, status=queue),
 ModelJob(Auto-tuned K-Nearest Neighbors Classifier (Minkowski Distance), status=queue),
 ModelJob(Vowpal Wabbit Classifier, status=queue)]
In [8]:
project.unpause_autopilot()
Out[8]:
True

Features

In [9]:
features = project.get_features()
features
Out[9]:
[Feature(did_rain),
 Feature(Destination Airport),
 Feature(Carrier Code),
 Feature(Flight Number),
 Feature(Tail Number),
 Feature(day_of_week),
 Feature(month),
 Feature(Scheduled Departure Time),
 Feature(daily_rainfall),
 Feature(was_delayed)]
In [10]:
pd.DataFrame([f.__dict__ for f in features])
Out[10]:
date_format feature_type id importance low_information na_count name project_id unique_count
0 None Boolean 2 0.029045 False 0 did_rain 5963ddefc8089169ef1637c2 2
1 None Categorical 6 0.003714 True 0 Destination Airport 5963ddefc8089169ef1637c2 5
2 None Categorical 3 NaN True 0 Carrier Code 5963ddefc8089169ef1637c2 1
3 None Numeric 4 0.005900 False 0 Flight Number 5963ddefc8089169ef1637c2 329
4 None Categorical 5 -0.004512 True 0 Tail Number 5963ddefc8089169ef1637c2 296
5 None Categorical 8 0.003452 True 0 day_of_week 5963ddefc8089169ef1637c2 7
6 None Numeric 9 0.003043 True 0 month 5963ddefc8089169ef1637c2 12
7 %H:%M Time 7 0.058245 False 0 Scheduled Departure Time 5963ddefc8089169ef1637c2 77
8 None Numeric 1 0.034295 False 0 daily_rainfall 5963ddefc8089169ef1637c2 58
9 None Boolean 0 1.000000 False 0 was_delayed 5963ddefc8089169ef1637c2 2

Three feature lists are automatically created:

  • Raw Features: one for all features
  • Informative Features: one based on features with any information (columns with no variation are excluded)
  • Univariate Importance: one based on univariate importance (this is only created after the target is set)

Informative Features is the one used by default in autopilot.

In [11]:
feature_lists = project.get_featurelists()
feature_lists
Out[11]:
[Featurelist(Informative Features),
 Featurelist(Raw Features),
 Featurelist(Univariate Selections)]
In [12]:
# create a featurelist without the rain features
# (since they leak future information)
informative_feats = [lst for lst in feature_lists if
                     lst.name == 'Informative Features'][0]
no_foreknowledge_features = list(
    set(informative_feats.features) - {'daily_rainfall', 'did_rain'})
In [13]:
no_foreknowledge = project.create_featurelist('no foreknowledge',
                                              no_foreknowledge_features)
no_foreknowledge
Out[13]:
Featurelist(no foreknowledge)
In [14]:
project.get_status()
Out[14]:
{u'autopilot_done': False,
 u'stage': u'modeling',
 u'stage_description': u'Ready for modeling'}
In [15]:
# This waits until autopilot is complete:
project.wait_for_autopilot(check_interval=90)
In progress: 2, queued: 2 (waited: 0s)
In progress: 2, queued: 2 (waited: 0s)
In progress: 2, queued: 2 (waited: 1s)
In progress: 2, queued: 2 (waited: 2s)
In progress: 2, queued: 2 (waited: 3s)
In progress: 2, queued: 2 (waited: 4s)
In progress: 2, queued: 2 (waited: 8s)
In progress: 2, queued: 2 (waited: 14s)
In progress: 2, queued: 2 (waited: 27s)
In progress: 2, queued: 0 (waited: 53s)
In progress: 2, queued: 0 (waited: 105s)
In progress: 0, queued: 0 (waited: 195s)
In progress: 0, queued: 0 (waited: 286s)
In [16]:
project.start_autopilot(no_foreknowledge.id)
In [17]:
project.wait_for_autopilot(check_interval=90)
In progress: 2, queued: 26 (waited: 0s)
In progress: 2, queued: 26 (waited: 0s)
In progress: 2, queued: 26 (waited: 1s)
In progress: 2, queued: 26 (waited: 2s)
In progress: 2, queued: 26 (waited: 3s)
In progress: 2, queued: 26 (waited: 5s)
In progress: 1, queued: 26 (waited: 8s)
In progress: 4, queued: 23 (waited: 15s)
In progress: 6, queued: 17 (waited: 28s)
In progress: 7, queued: 6 (waited: 54s)
In progress: 5, queued: 9 (waited: 105s)
In progress: 7, queued: 1 (waited: 196s)
In progress: 7, queued: 20 (waited: 287s)
In progress: 7, queued: 3 (waited: 378s)
In progress: 4, queued: 0 (waited: 469s)
In progress: 3, queued: 0 (waited: 559s)
In progress: 0, queued: 0 (waited: 650s)

Models

In [18]:
models = project.get_models()
example_model = models[0]
example_model
Out[18]:
Model(u'Gradient Boosted Trees Classifier with Early Stopping')

Models represent fitted models and have various data about the model, including metrics:

In [19]:
example_model.metrics
Out[19]:
{u'AUC': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.751662,
  u'holdout': None,
  u'validation': 0.74957},
 u'FVE Binomial': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.139262,
  u'holdout': None,
  u'validation': 0.14529},
 u'Gini Norm': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.503324,
  u'holdout': None,
  u'validation': 0.49914},
 u'LogLoss': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.275264,
  u'holdout': None,
  u'validation': 0.27347},
 u'RMSE': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.27734,
  u'holdout': None,
  u'validation': 0.27582},
 u'Rate@Top10%': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.362458,
  u'holdout': None,
  u'validation': 0.37884},
 u'Rate@Top5%': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.47347,
  u'holdout': None,
  u'validation': 0.4898},
 u'Rate@TopTenth%': {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.866668,
  u'holdout': None,
  u'validation': 1.0}}
In [20]:
def sorted_by_log_loss(models, test_set):
    models_with_score = [model for model in models if
                         model.metrics['LogLoss'][test_set] is not None]
    return sorted(models_with_score,
                  key=lambda model: model.metrics['LogLoss'][test_set])

Let’s choose the models (from each feature set, to compare the scores) with the best LogLoss score from those with the rain and those without:

In [21]:
models = project.get_models()
fair_models = [mod for mod in models if
               mod.featurelist_id == no_foreknowledge.id]
rain_cheat_models = [mod for mod in models if
                     mod.featurelist_id == informative_feats.id]
In [22]:
models[0].metrics['LogLoss']

Out[22]:
{u'backtesting': None,
 u'backtestingScores': None,
 u'crossValidation': 0.275264,
 u'holdout': None,
 u'validation': 0.27347}
In [23]:
best_fair_model = sorted_by_log_loss(fair_models, 'crossValidation')[0]
best_cheat_model = sorted_by_log_loss(rain_cheat_models, 'crossValidation')[0]
best_fair_model.metrics, best_cheat_model.metrics
Out[23]:
({u'AUC': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.71437,
   u'holdout': None,
   u'validation': 0.7187},
  u'FVE Binomial': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.089798,
   u'holdout': None,
   u'validation': 0.09167},
  u'Gini Norm': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.42874,
   u'holdout': None,
   u'validation': 0.4374},
  u'LogLoss': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.29108199999999995,
   u'holdout': None,
   u'validation': 0.29062},
  u'RMSE': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.28612,
   u'holdout': None,
   u'validation': 0.28617},
  u'Rate@Top10%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.288738,
   u'holdout': None,
   u'validation': 0.28669},
  u'Rate@Top5%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.37415,
   u'holdout': None,
   u'validation': 0.39456},
  u'Rate@TopTenth%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.633334,
   u'holdout': None,
   u'validation': 1.0}},
 {u'AUC': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.758114,
   u'holdout': None,
   u'validation': 0.75345},
  u'FVE Binomial': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.14579400000000003,
   u'holdout': None,
   u'validation': 0.14438},
  u'Gini Norm': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.516228,
   u'holdout': None,
   u'validation': 0.5069},
  u'LogLoss': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.273176,
   u'holdout': None,
   u'validation': 0.27376},
  u'RMSE': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.27671,
   u'holdout': None,
   u'validation': 0.27686},
  u'Rate@Top10%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.370648,
   u'holdout': None,
   u'validation': 0.38225},
  u'Rate@Top5%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.48163600000000006,
   u'holdout': None,
   u'validation': 0.4898},
  u'Rate@TopTenth%': {u'backtesting': None,
   u'backtestingScores': None,
   u'crossValidation': 0.933334,
   u'holdout': None,
   u'validation': 1.0}})

Visualizing Models

This is a good time to use Model XRay (not yet available via the API) to visualize the models:

In [ ]:
best_fair_model.open_model_browser()
In [ ]:
best_cheat_model.open_model_browser()

Unlocking the Holdout

To maintain holdout scores as a valid estimate of out-of-sample error, we recommend not looking at them until late in the project. For this reason, holdout scores are locked until you unlock them.

In [24]:
project.unlock_holdout()
Out[24]:
Project(Airline Delays - was_delayed)
In [25]:
best_fair_model = dr.Model.get(project.id, best_fair_model.id)
best_cheat_model = dr.Model.get(project.id, best_cheat_model.id)
In [26]:
best_fair_model.metrics['LogLoss'], best_cheat_model.metrics['LogLoss']
Out[26]:
({u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.29108199999999995,
  u'holdout': 0.29344,
  u'validation': 0.29062},
 {u'backtesting': None,
  u'backtestingScores': None,
  u'crossValidation': 0.273176,
  u'holdout': 0.27542,
  u'validation': 0.27376})

Retrain on 100%

When ready to use the final model, you will probably get the best performance by retraining on 100% of the data.

In [27]:
model_job_fair_100pct_id = best_fair_model.train(sample_pct=100)
model_job_fair_100pct_id
Out[27]:
'188'

Wait for the model to complete:

In [28]:
model_fair_100pct = dr.models.modeljob.wait_for_async_model_creation(
    project.id, model_job_fair_100pct_id)
model_fair_100pct.id
Out[28]:
u'5aa015f8fe075913b47c67ff'

Predictions

Let’s make predictions for some new data. This new data will need to have the same transformations applied as we applied to the training data.

In [29]:
logan_2014 = pd.read_csv("logan-US-2014.csv")
logan_2014_modeling = prepare_modeling_dataset(logan_2014)
logan_2014_modeling.head()
Out[29]:
was_delayed daily_rainfall did_rain Carrier Code Flight Number Tail Number Destination Airport Scheduled Departure Time day_of_week month
0 False 0.0 False US 450 N809AW PHX 10:00 Sat 2
1 False 0.0 False US 553 N814AW PHL 07:00 Sat 2
2 False 0.0 False US 582 N820AW PHX 06:10 Sat 2
3 False 0.0 False US 601 N678AW PHX 16:20 Sat 2
4 False 0.0 False US 657 N662AW CLT 09:45 Sat 2
In [30]:
prediction_dataset = project.upload_dataset(logan_2014_modeling)
predict_job = model_fair_100pct.request_predictions(prediction_dataset.id)
prediction_dataset.id
Out[30]:
u'5aa01634fe0759146b80ab2c'
In [31]:
predictions = predict_job.get_result_when_complete()
In [32]:
pd.concat([logan_2014_modeling, predictions], axis=1).head()
Out[32]:
was_delayed daily_rainfall did_rain Carrier Code Flight Number Tail Number Destination Airport Scheduled Departure Time day_of_week month positive_probability prediction row_id
0 False 0.0 False US 450 N809AW PHX 10:00 Sat 2 0.050824 0.0 0
1 False 0.0 False US 553 N814AW PHL 07:00 Sat 2 0.040017 0.0 1
2 False 0.0 False US 582 N820AW PHX 06:10 Sat 2 0.032445 0.0 2
3 False 0.0 False US 601 N678AW PHX 16:20 Sat 2 0.122692 0.0 3
4 False 0.0 False US 657 N662AW CLT 09:45 Sat 2 0.054400 0.0 4

Let’s have a look at our results. Since this is a binary classification problem, as the positive_probability approaches zero this row is a stronger candidate for the negative class (the flight will leave on-time), while as it approaches one, the outcome is more likely to be of the positive class (the flight will be delayed). From the KDE (Kernel Density Estimate) plot below, we can see that this sample of the data is weighted stronger to the negative class.

In [33]:
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
In [34]:
matplotlib.rcParams['figure.figsize'] = (15, 10)  # make charts bigger
In [35]:
sns.set(color_codes=True)
sns.kdeplot(predictions.positive_probability, shade=True, cut=0,
            label='Positive Probability')
plt.xlim((0, 1))
plt.ylim((0, None))
plt.xlabel('Probability of Event')
plt.ylabel('Probability Density')
plt.title('Prediction Distribution')
Out[35]:
Text(0.5,1,u'Prediction Distribution')
_images/examples_airline_ontime_example_Modeling_Airline_Delay_55_1.png

Exploring Reason Codes

Computing reason codes is a resource-intensive task, but you can help reduce the runtime for computing them by setting prediction value thresholds. You can learn more about reason codes by searching the online documentation available in the DataRobot web interface (it may be referred to as Prediction Explanations).

When are they useful?

A common question when evaluating data is “why is a certain data-point considered high-risk (or low-risk) for a certain event”?

A sample case for reason codes:

Clark is a business analyst at a large manufacturing firm. She does not have a lot of data science expertise, but has been using DataRobot with great success to predict likely product failures at her manufacturing plant. Her manager is now asking for recommendations for reducing the defect rate, based on these predictions. Clark would like DataRobot to produce reason codes for the expected product failures so that she can identify the key drivers of product failures based on a higher-level aggregation of reasons. Her business team can then use this report to address the causes of failure.

Other common use cases and possible reasons include:

  • What are indicators that a transaction could be at high risk for fraud? Possible reasons include transactions out of a cardholder’s home area, transactions out of their “normal usage” time range, and transactions that are too large or small.
  • What are some reasons for setting a higher auto insurance price? The applicant is single, male, age under 30 years, and has received traffic tickets. A married homeowner may receive a lower rate.
Preparation

We are almost ready to compute reason codes. Reason codes require two prerequisites to be performed first; however, these commands only need to be run once per model.

Feature Impact

A prerequisite to computing reason codes is that you need to compute the feature impact for your model (this only needs to be done once per model):

In [36]:
%%time
try:
    impact_job = model_fair_100pct.request_feature_impact()
    impact_job.wait_for_completion(4 * 60)
except dr.errors.JobAlreadyRequested:
    pass  # already computed
CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 10.6 s
Reason Codes Initialization

After Feature Impact has been computed, you also must create a Reason Codes Initialization for your model:

In [37]:
%%time
try:
    # Test to see if they are already computed
    dr.ReasonCodesInitialization.get(project.id, model_fair_100pct.id)
except dr.errors.ClientError as e:
    assert e.status_code == 404  # haven't been computed
    init_job = dr.ReasonCodesInitialization.create(project.id,
                                                   model_fair_100pct.id)
    init_job.wait_for_completion()
CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 96.9 ms
Computing the reason codes

Now that we have computed the feature impact and initialized the reason codes, and also uploaded a dataset and computed predictions on it, we are ready to make a request to compute the reason codes for every row of the dataset. Computing reason codes supports a couple of parameters:

  • max_codes are the maximum number of reason codes to compute for each row.
  • threshold_low and threshold_high are thresholds for the value of the prediction of the row. Reason codes will be computed for a row if the row’s prediction value is higher than threshold_high or lower than threshold_low. If no thresholds are specified, reason codes will be computed for all rows.

Note: for binary classification projects (like this one), the thresholds correspond to the positive_probability prediction value whereas for regression problems, it corresponds to the actual predicted value.

Since we’ve already examined our prediction distribution from above let’s use that to influence what we set for our thresholds. It looks like most flights depart on-time so let’s just examine the reasons for flights that have an above normal probability for being delayed. We will use a threshold_high of 0.456 which means for all rows where the predicted positive_probability is at least 0.456 we will compute the reason codes for that row. For the simplicity of this tutorial, we will also limit DataRobot to only compute 5 codes for us.

In [38]:
%%time
number_of_reasons = 5
rc_job = dr.ReasonCodes.create(project.id,
                               model_fair_100pct.id,
                               prediction_dataset.id,
                               max_codes=number_of_reasons,
                               threshold_low=None,
                               threshold_high=0.456)
rc = rc_job.get_result_when_complete()
all_rows = rc.get_all_as_dataframe()
CPU times: user 4.3 s, sys: 108 ms, total: 4.41 s
Wall time: 29.3 s

Let’s cleanup the DataFrame we got back by trimming it down to just the interesting columns. Also, since most rows will have prediction values outside our thresholds, let’s drop all the uninteresting rows (i.e. ones with null values).

In [39]:
import pandas as pd
pd.options.display.max_rows = 10  # default display is too verbose

# These rows are all redundant or of little value for this example
redundant_cols = ['row_id', 'class_0_label', 'class_1_probability',
                  'class_1_label']
reasons = all_rows.drop(redundant_cols, axis=1)
reasons.drop(['reason_{}_label'.format(i) for i in range(number_of_reasons)],
             axis=1, inplace=True)

# These are rows that didn't meet our thresholds
reasons.dropna(inplace=True)

# Rename columns to be more consistent with the terms we have been using
reasons.rename(index=str,
               columns={'class_0_probability': 'positive_probability'},
               inplace=True)
reasons
Out[39]:
prediction positive_probability reason_0_feature reason_0_feature_value reason_0_qualitative_strength reason_0_strength reason_1_feature reason_1_feature_value reason_1_qualitative_strength reason_1_strength ... reason_2_qualitative_strength reason_2_strength reason_3_feature reason_3_feature_value reason_3_qualitative_strength reason_3_strength reason_4_feature reason_4_feature_value reason_4_qualitative_strength reason_4_strength
9498 1.0 0.521672 Scheduled Departure Time -2.208920e+09 +++ 1.411063 Tail Number N170US ++ 0.522242 ... ++ 0.355082 Flight Number 800 ++ 0.247061 day_of_week Thurs ++ 0.240676
12373 1.0 0.505737 Scheduled Departure Time -2.208920e+09 ++ 0.858645 Flight Number 897 ++ 0.848086 ... ++ 0.522828 month 12 ++ 0.312428 day_of_week Mon ++ 0.276766
13254 0.0 0.466474 Scheduled Departure Time -2.208920e+09 +++ 0.937670 Flight Number 897 +++ 0.850898 ... ++ 0.335550 day_of_week Thurs ++ 0.308574 Destination Airport PHX - -0.145671
13351 0.0 0.484007 Scheduled Departure Time -2.208920e+09 +++ 1.124481 Flight Number 897 ++ 0.863671 ... ++ 0.371650 day_of_week Sun ++ 0.343775 month 12 ++ 0.341253
13536 1.0 0.512797 Scheduled Departure Time -2.208920e+09 +++ 1.229332 Flight Number 897 ++ 0.861893 ... ++ 0.486845 day_of_week Sun ++ 0.343775 month 12 ++ 0.319640
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
18015 0.0 0.494928 Scheduled Departure Time -2.208920e+09 +++ 1.902487 month 7 ++ 0.743062 ... + 0.230507 day_of_week Thurs + 0.220769 Flight Number 800 + 0.216649
18165 0.0 0.492765 Scheduled Departure Time -2.208920e+09 +++ 1.447815 month 7 ++ 0.517627 ... ++ 0.349980 Flight Number 800 ++ 0.290879 Destination Airport CLT + 0.183091
18392 1.0 0.584637 Scheduled Departure Time -2.208920e+09 +++ 1.494192 month 7 ++ 0.610177 ... ++ 0.522249 Flight Number 800 ++ 0.281044 day_of_week Thurs ++ 0.243063
18396 0.0 0.456992 Scheduled Departure Time -2.208919e+09 +++ 1.442328 month 7 ++ 0.600896 ... ++ 0.338863 day_of_week Thurs ++ 0.217166 Scheduled Departure Time (Hour of Day) 19 + 0.065168
18406 1.0 0.515136 Scheduled Departure Time -2.208927e+09 +++ 1.766550 month 7 ++ 0.832155 ... ++ 0.815748 Scheduled Departure Time (Hour of Day) 17 ++ 0.323491 Destination Airport CLT + 0.272840

27 rows × 22 columns

Explore Reason Code results

Now let’s see how often various features are showing up as the top reason for impacting the probability of a flight being delayed.

In [40]:
from functools import reduce

# Create a combined histogram of all our reasons
reasons_hist = reduce(lambda x, y: x.add(y, fill_value=0),
                      (reasons['reason_{}_feature'.format(i)].value_counts()
                       for i in range(number_of_reasons)))
In [41]:
reasons_hist.plot.bar()
plt.xticks(rotation=45, ha='right')
Out[41]:
(array([0, 1, 2, 3, 4, 5, 6]), <a list of 7 Text xticklabel objects>)
_images/examples_airline_ontime_example_Modeling_Airline_Delay_68_1.png

Knowing the feature impact for this model from the Diving Deeper notebook, the high occurrence of the daily_rainfall and Scheduled Departure Time as reason codes is not entirely surprising because these were some of the top ranked features in the impact chart. Therefore, let’s take a small detour investigating some of the rows that had less expected reasons.


Below is some helper code. It can largely be ignored as it is mostly relevant for this exercise and not needed for a general understanding of the DataRobot APIs

In [42]:
from operator import or_
from functools import reduce
from itertools import chain


def find_rows_with_reason(df, feature_name, nreasons):
    """
    Given a reason codes DataFrame, return a slice of that data where the
    top N reasons match the given feature
    """
    all_reason_columns = (df['reason_{}_feature'.format(i)] == feature_name
                          for i in range(nreasons))
    df_filter = reduce(or_, all_reason_columns)
    return favorite_reason_columns(df[df_filter], nreasons)


def favorite_reason_columns(df, nreasons):
    """
    Only display the most useful rows of a reason codes DataFrame.
    """
    # Use chain to flatten our list of tuples
    columns = list(chain.from_iterable(('reason_{}_feature'.format(i),
                                        'reason_{}_feature_value'.format(i),
                                        'reason_{}_strength'.format(i))
                                       for i in range(nreasons)))
    return df[columns]


def find_feature_in_row(feature, row, nreasons):
    """
    Return the value of a given feature
    """
    for i in range(nreasons):
        if row['reason_{}_feature'.format(i)] == feature:
            return row['reason_{}_feature_value'.format(i)]


def collect_feature_values(df, feature, nreasons):
    """
    Return a list of all values of a given reason code from a DataFrame
    """
    return [find_feature_in_row(feature, row, nreasons)
            for index, row in df.iterrows()]

Investigation: Destination Airport

It looks like there was a small number of rows where the Destination Airport was one of the top N reasons for a given prediction

In [43]:
feature_name = 'Destination Airport'
flight_nums = find_rows_with_reason(reasons, feature_name, number_of_reasons)
flight_nums
Out[43]:
reason_0_feature reason_0_feature_value reason_0_strength reason_1_feature reason_1_feature_value reason_1_strength reason_2_feature reason_2_feature_value reason_2_strength reason_3_feature reason_3_feature_value reason_3_strength reason_4_feature reason_4_feature_value reason_4_strength
13254 Scheduled Departure Time -2.208920e+09 0.937670 Flight Number 897 0.850898 Tail Number N657AW 0.335550 day_of_week Thurs 0.308574 Destination Airport PHX -0.145671
14226 Scheduled Departure Time -2.208920e+09 1.435292 month 6 0.459697 Flight Number 800 0.280207 day_of_week Thurs 0.251885 Destination Airport CLT 0.201186
14601 Scheduled Departure Time -2.208920e+09 1.422922 month 6 0.381899 Flight Number 800 0.278981 day_of_week Thurs 0.248532 Destination Airport CLT 0.201186
14855 Scheduled Departure Time -2.208920e+09 1.376668 month 6 0.455120 Tail Number N163US 0.345858 Flight Number 800 0.308118 Destination Airport CLT 0.186002
14978 Scheduled Departure Time -2.208920e+09 1.435292 month 6 0.459697 Flight Number 800 0.280207 day_of_week Thurs 0.251885 Destination Airport CLT 0.201186
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
17576 Scheduled Departure Time -2.208920e+09 1.512200 month 7 0.623913 Tail Number N170US 0.544830 Flight Number 800 0.305650 Destination Airport CLT 0.186923
17638 Scheduled Departure Time -2.208920e+09 1.523302 month 7 0.461251 Flight Number 800 0.282270 day_of_week Thurs 0.246416 Destination Airport CLT 0.202109
18015 Scheduled Departure Time -2.208920e+09 1.902487 month 7 0.743062 Destination Airport CLT 0.230507 day_of_week Thurs 0.220769 Flight Number 800 0.216649
18165 Scheduled Departure Time -2.208920e+09 1.447815 month 7 0.517627 Tail Number N173US 0.349980 Flight Number 800 0.290879 Destination Airport CLT 0.183091
18406 Scheduled Departure Time -2.208927e+09 1.766550 month 7 0.832155 Tail Number N818AW 0.815748 Scheduled Departure Time (Hour of Day) 17 0.323491 Destination Airport CLT 0.272840

12 rows × 15 columns

In [44]:
all_flights = collect_feature_values(flight_nums,
                                     feature_name,
                                     number_of_reasons)
pd.DataFrame(all_flights)[0].value_counts().plot.bar()
plt.xticks(rotation=0)
Out[44]:
(array([0, 1]), <a list of 2 Text xticklabel objects>)
_images/examples_airline_ontime_example_Modeling_Airline_Delay_74_1.png

Many a frequent flier will tell you horror stories about flying in and out of certain airports. While any given reason code can have a positive or a negative impact to a prediction (this is indicated by both the strength and qualitative_strength columns), due to the thresholds we configured earlier for this tutorial it is likely that the above airports are causing flight delays.


Investigation: Scheduled Departure Time

DataRobot correctly identified the Scheduled Departure Time input as a timestamp but in the reason code output, we are seeing the internal representation of the time value as a Unix epoch value so let’s put it back into a format that humans can understand better:

In [45]:
# For simplicity, let's just look at rows where `Scheduled Departure Time`
# was the first/top reason.
bad_times = reasons[reasons.reason_0_feature == 'Scheduled Departure Time']

# Now let's convert the epoch to a datetime
pd.to_datetime(bad_times.reason_0_feature_value, unit='s')
Out[45]:
9498    1900-01-01 19:10:00
12373   1900-01-01 19:00:00
13254   1900-01-01 19:00:00
13351   1900-01-01 19:00:00
13536   1900-01-01 19:00:00
                ...
18015   1900-01-01 19:10:00
18165   1900-01-01 19:10:00
18392   1900-01-01 19:10:00
18396   1900-01-01 19:30:00
18406   1900-01-01 17:05:00
Name: reason_0_feature_value, Length: 27, dtype: datetime64[ns]

We can see that it appears as though all departures occurred on Jan. 1st, 1900. This is because the original value was simply a timestamp so only the time portion of the epoch is meaningful. We will clean this up in our graph below:

In [46]:
from matplotlib.ticker import FuncFormatter
from time import gmtime, strftime

scale_factor = 9  # make the difference in strengths more visible

depart = reasons[reasons.reason_0_feature == 'Scheduled Departure Time']
true_only = depart[depart.prediction == 1]
false_only = depart[depart.prediction == 0]
plt.scatter(x=true_only.reason_0_feature_value,
            y=true_only.positive_probability,
            c='green',
            s=true_only.reason_0_strength ** scale_factor,
            label='Will be delayed')
plt.scatter(x=false_only.reason_0_feature_value,
            y=false_only.positive_probability,
            c='purple',
            s=false_only.reason_0_strength ** scale_factor,
            label='Will not')

# Convert the Epoch values into human time stamps
formatter = FuncFormatter(lambda x, pos: strftime('%H:%M', gmtime(x)))
plt.gca().xaxis.set_major_formatter(formatter)

plt.xlabel('Scheduled Departure Time')
plt.ylabel('Positive Probability')
plt.legend(markerscale=.5, frameon=True, facecolor="white")
plt.title("Relationship of Depart Time and being delayed")
Out[46]:
Text(0.5,1,u'Relationship of Depart Time and being delayed')
_images/examples_airline_ontime_example_Modeling_Airline_Delay_79_1.png

The above plot shows each prediction where the top influencer of the prediction was the Scheduled Departure Time. It’s plotted against the positive_probability on the Y-axis and the size of the plots represent the strength that departure time had on the prediction (relative to the other features of that given data point). Finally to aid visually, the positive vs. negative outcomes are colored.

As we can see by the time-scale on the X-axis, it doesn’t represent the full 24 hours; this is telling. Since we filtered our data earlier to only show predictions that were leaning towards being delayed, and this chart is leaning towards times in the afternoon and evening there may be a correlation between later scheduled departure time and a higher probability of being delayed. With a little bit of domain knowledge, one will understand that an airplane and its crew make many flights in a day (typically hopping between cities) so small delays in the morning compound into the evening hours.

Financial Data

This example retrieves financial data from the Federal Reserve Bank of St. Louis and builds models in DataRobot to predict recession.

Creating the Dataset

This notebook shows some of the steps required in creating a dataset from a third party’s data. It has very little to do with DataRobot, and if you’re mostly interested in learning about how to use the DataRobot Python Client, then you could skip reading this section and miss out on very little. In order to have the data necessary for the other notebook, you will need to make sure that this notebook runs.

What do I need to do?
Get an API Key

The data we will be using is owned by the Federal Reserve Bank of St. Louis. They have an API for which you will need a key. The key is free, don’t worry. Grab one at https://research.stlouisfed.org/docs/api/fred/

To run this notebook without any changes, you will need to save your API key in a file in the same directory from which you run this notebook, and call the name of the file api_key.

Install the fredapi package

You will also need this python client package, which makes accessing the data incredibly easy.

pip install fredapi

What will we do with this data?

We’re going to predict the future and get rich.

More concretely, we’re going to use historical economic data to build a forecasting model for whether or not the US Economy will be in recession in 13 weeks from now.

The FRED Economic Data

The Federal Reserve Bank of St. Louis provides a rich set of historical financial data, plus a REST API to access this data.

We have also written some utilities in order to make it easy to combine data series with different date frequencies in a technique known as Last Observation Carried Forward.

In [1]:
import warnings

import datetime
import fredapi
import pandas as pd
import timetools
fred = fredapi.Fred(api_key_file='api_key')
Get the data

There is a lot of data accessible through the FRED API. More than a quarter million data series, actually. That’s probably too much to all be useful.

We selected this set of series by starting with a subset of data specifically related to the US Economy, and started filtering out forecast data, data that was a pseudo-indicator date (which is a big data leak for this problem), eventually ending up with the collection of series you see in the cell below. It wasn’t really a scientific process, there are certainly more robust ways to go about it.

You can go learn about any of these on the FRED website, like this: https://research.stlouisfed.org/fred2/series/A007RO1Q156NBEA That is the webpage for the first data series in the cell below. You can also get much of that data through the API, using the get_series_info method like we do in the following cell.

In [2]:
good_columns = [
    u'A007RO1Q156NBEA', u'A011RE1Q156NBEA', u'A011RJ2Q224SBEA',
    u'A021RO1Q156NBEA', u'A021RY2Q224SBEA', u'A191RV1Q225SBEA',
    u'A765RL1Q225SBEA', u'A798RS2Q224SBEA', u'B808RA3Q086SBEA',
    u'CLSACBQ158SBOG', u'CORESTICKM158SFRBATL', u'DLTRUCKSSAAR',
    u'DNDGRY2Q224SBEA', u'DONGRS2Q224SBEA', u'DPCERV1Q225SBEA',
    u'DTRSRZ2Q224SBEA', u'LNS14024886', u'LNU02300000', u'LNU04000003',
    u'M1V', u'M2MOWN', u'M2V', u'MVAAUTOASS',
    u'NECDFNA066MNFRBPHI', u'NOCDSA156MSFRBPHI', u'PERMIT',
    u'PERMITMWNSA', u'PRS84006173', u'RCPHBS',
    u'STICKCPIXSHLTRM158SFRBATL', u'W004RZ2Q224SBEA', u'W087RA3Q086SBEA',
    u'W111RA3Q086SBEA', u'W117RL1Q225SBEA', u'W130RA3Q086SBEA',
    u'W368RG3Q066SBEA', u'WAAA', u'WGS10YR', u'WTB3MS',
    u'Y020RY2Q224SBEA', u'Y033RV1Q225SBEA', u'Y033RZ2Q224SBEA',
    u'Y034RA3Q086SBEA', u'Y034RY2Q224SBEA', u'Y052RL1Q225SBEA',
    u'Y054RG3Q086SBEA', u'Y060RZ2Q224SBEA', u'Y694RY2Q224SBEA']
Get the metadata

We’ll be needing to know the frequency of the observations in order to merge the data correctly. That information is available from the API. Each call to get_series_info involves an API call, so this may take some time.

In [3]:
metadata = {}
for series_id in good_columns:
    try:
        metadata[series_id] = fred.get_series_info(series_id)
    except ValueError:
        # Series sometimes get retired from FRED
        warnings.warn('Series {} not found on FRED API'.format(series_id))
Get the data

This is where we actually acquire the data. This next step may take a while.

In [4]:
def get_series_data(series_id):
    series_data = fred.get_series_first_release(series_id)

    series_index = [ix.strftime('%Y-%m-%d') for ix in series_data.index]
    series_data.index = series_index
    return series_data

obs = {}
for series_id in metadata.keys():
    series_data = get_series_data(series_id)
    obs[series_id] = series_data
Organize by data frequency

Here we make a few groups of the series we just acquired. The ones that have the same update frequency can be put into one dataframe very easily.

In [5]:
weekly = [series_id for series_id, meta
          in metadata.iteritems()
          if meta['frequency'] == 'Weekly, Ending Friday']
quarterly = [series_id for series_id, meta
             in metadata.iteritems()
             if meta['frequency'] == 'Quarterly']
monthly = [series_id for series_id, meta
           in metadata.iteritems()
           if meta['frequency'] == 'Monthly']
In [6]:
all_weekly = pd.DataFrame({metadata[series_id]['title']: obs[series_id]
                           for series_id in weekly})

all_monthly = pd.DataFrame({metadata[series_id]['title']: obs[series_id]
                            for series_id in monthly})

all_quarterly = pd.DataFrame({metadata[series_id]['title']: obs[series_id]
                              for series_id in quarterly})
Combine the data of different frequencies

We wrote a little helper to take care of merging dataframes that have differing date indexes. It comes in handy right here.

We also drop some rows that extend into the future - some of the series from FRED come back like that, and it’s not good for modeling.

In [7]:
fin_data = timetools.expand_frame_merge(all_weekly, all_monthly)
fin_data = timetools.expand_frame_merge(fin_data, all_quarterly)

fin_data = fin_data[fin_data.index <
                    datetime.datetime.today().strftime('%Y-%m-%d')]
Create the target

The whole point of all this is to see if we can predict if there will be a recession in the future, so we’ll need to get historical data on the state of the US economy.

Of course, predicting if we are in a recession on any given day is kind of a no-brainer. So we’ll slide the series in such a way that for any given date, we’re looking at whether there is a recession 13 weeks from that day.

In [8]:
usrec = fred.get_series_first_release('USREC')
usrec.index = [ix.isoformat().split('T')[0] for ix in usrec.index]
bool_match = usrec.index > '1918-01-01'
target_series = usrec[bool_match]


target_name = 'US Recession in 13 Weeks'
timetools.slide(target_series, 7 * 13)
target_frame = pd.DataFrame({target_name: target_series})

modeling_frame = timetools.expand_frame_merge(fin_data, target_frame)
Trim some (mostly useless) data

Some of these series only started gathering data in the late 1940’s. So we’ll just drop rows from before then, since there isn’t much information in those weeks. While this step isn’t necessary, it does mean we’ll be modeling on some more informative data.

In [9]:
na_counts = modeling_frame.isnull().sum(axis=1)
earliest_useful_day = na_counts[na_counts < 20].index[0]
earliest_useful_day

modeling_frame = modeling_frame[modeling_frame.index >= earliest_useful_day]
Create the partition column

We’ll be training on data before 1980, validating on data from 1980 to 1995, and withholding the data for 1995 onward. This is mostly arbitrary, but does ensure that each time interval has more than one recession. If we create a column with these labels, DataRobot will let us use that column to partition the data into training, validation, and holdout.

In [10]:
n_rows = len(modeling_frame)

validation_first_day = modeling_frame[modeling_frame.index >=
                                      '1980-01-01'].index[0]
validation_point = modeling_frame.index.get_loc(validation_first_day)
holdout_first_day = modeling_frame[modeling_frame.index >=
                                   '1995-01-01'].index[0]
holdout_point = modeling_frame.index.get_loc(holdout_first_day)

tvh = pd.Series(['T'] * n_rows)
tvh.loc[validation_point:holdout_point] = 'V'
tvh.loc[holdout_point:] = 'H'
tvh.index = modeling_frame.index

modeling_frame['TVH'] = tvh

Write the dataset to disk

In [12]:
fname = 'financials-{}.csv'.format(datetime.datetime.today().
                                   strftime('%Y-%m-%d'))
modeling_frame.to_csv(fname, index=True, index_label='Date', encoding='utf-8')

Predicting Recessions with DataRobot

In this use case, we’ll try to predict whether or not the US economy is heading into a recession within the next three months. While hopefully it’s not necessary to say so, let’s just get this out of the way up front: don’t actually invest your real money according to the results of this notebook. The real value comes from learning about how to use the Python client of the DataRobot API.

Topics Covered in this Notebook

Here is a list of things we’ll touch on during this notebook:

  • Installing the datarobot package
  • Configuring the client
  • Creating a project
  • Using a column from the dataset for a custom partitioning scheme
  • Omitting one of the source columns from the modeling process
  • Run the automated modeling process
  • Generating predictions from a finished model

The dataset required for this notebook can be produced by running the notebook Generating a Dataset from FRBSL, located in this same directory.

Prerequisites

In order to run this notebook yourself, you will need the following:

  • A DataRobot API token
  • matplotlib for the visualizations at the end
Installing the datarobot package

The datarobot package is hosted on PyPI. You can install it via:

pip install datarobot

from the command line. Its main dependencies are numpy and pandas, which could take some time to install on a new system. We highly recommend use of virtualenvs to avoid conflicts with other dependencies in your system-wide python installation.

Getting started

This line imports the datarobot package. By convention, we always import it with the alias dr.

In [1]:
import datarobot as dr
Other important imports

We’ll use these in this notebook as well. If the previous cell and the following cell both run without issue, you’re in good shape.

In [2]:
import re
import os
import datetime
import matplotlib.pyplot as plt
import pandas as pd
%pylab inline
Populating the interactive namespace from numpy and matplotlib
Configure the Python Client

Configuring the client requires the following two things:

  • A DataRobot endpoint - where the API server can be found
  • A DataRobot API token

The client can be configured in several ways. The example we’ll use in this notebook is to point to a yaml file that has the information. This is the structure of that file:

endpoint: https://app.datarobot.com/api/v2/
token: not-my-real-token

If you want to run this notebook without changes, please save your configuration in the same directory from which you run this notebook, and call it drconfig.yaml.

In [3]:
dr.Client(config_path='drconfig.yaml')
Out[3]:
<datarobot.rest.RESTClientObject at 0x2d958d0>
Find the data in your filesystem

If you have run the other notebook, it will have written a file to disk. In the next cell, we’ll try to find it in this directory. If it’s not here, you can help the notebook continue by defining the variable filename to point to that file.

In [4]:
usecase_name_regex = re.compile('financials-.*\.csv')

files = [fname for fname in os.listdir('.')
         if usecase_name_regex.match(fname)]
filename = files[0]
print('Using {}'.format(filename))
Using financials-2017-07-10.csv
Create the Project

Here, we use the datarobot package to upload a new file and create a project. The name of the project is optional, but can be helpful when trying to sort among many projects on DataRobot.

In [5]:
now = datetime.datetime.now().strftime('%Y-%m-%dT%H:%M')
project_name = 'FRB{}'.format(now)
proj = dr.Project.create(sourcedata=filename,
                         project_name=project_name)
Create a custom partition scheme

This problem has a time component to it, so it wouldn’t do us very much good to train on data from the present and predict on the past. In creating the dataset, the column TVH was used to indicate which partition each row should belong to. The training (T) data all precedes the validation (V) data in time, which in turn precedes the holdout (H) data. By using a UserTVH column we can specify this partition should be used by DataRobot. Absent this information, DataRobot defaults to randomly separating rows into training, validation, and holdout.

In [6]:
proj_partition = dr.UserTVH(user_partition_col='TVH',
                            training_level='T',
                            validation_level='V',
                            holdout_level='H')
Omit a column from modeling

The Date column is a data leak, so we don’t want it to be included in the modeling process. We can accomplish this by creating a featurelist that does not include it, and using that featurelist during modeling.

In [7]:
features = proj.get_features()
names_without_date = [feature.name for feature in features
                      if feature.name != 'Date']
flist = proj.create_featurelist('Without Date', names_without_date)
Run the automated modeling process

Now we can start the modeling process. The target for this problem is called US Recession in 13 Weeks - a binary variable indicating whether or not the US economy was in recession 13 weeks after the week that a row represents.

We specify that the metric that should be used is AUC. Without a specification DataRobot would use the metric it recommends (in this case, it would have been LogLoss).

The partitioning_method is used to specify that we would like DataRobot to use the partitioning schema we specified previously.

The featurelist_id parameter tells DataRobot to model on that specific featurelist, rather than the default Informative Features.

Finally, the worker_count parameter specifies how many workers should be used for this project. Keep in mind, you might not have access to 10 workers. If you need more resources than what has been allocated to you, you should think about upgrading your license.

The last command in this cell is just a blocking loop that periodically checks on the project to see if it is done, printing out the number of jobs in progress and in the queue along the way so you can see progress. The automated model exploration process will occasionally add more jobs to the queue, so don’t be alarmed if the number of jobs does not strictly decrease over time.

In [8]:
target_name = 'US Recession in 13 Weeks'
proj.set_target(target_name,
                metric='AUC',
                partitioning_method=proj_partition,
                featurelist_id=flist.id,
                worker_count=8)

proj.wait_for_autopilot()
In progress: 7, queued: 23 (waited: 0s)
In progress: 7, queued: 23 (waited: 0s)
In progress: 7, queued: 23 (waited: 1s)
In progress: 7, queued: 23 (waited: 2s)
In progress: 7, queued: 23 (waited: 3s)
In progress: 7, queued: 23 (waited: 4s)
In progress: 6, queued: 21 (waited: 8s)
In progress: 7, queued: 17 (waited: 15s)
In progress: 7, queued: 14 (waited: 28s)
In progress: 7, queued: 6 (waited: 48s)
In progress: 7, queued: 2 (waited: 68s)
In progress: 5, queued: 0 (waited: 89s)
In progress: 4, queued: 0 (waited: 109s)
In progress: 2, queued: 0 (waited: 129s)
In progress: 0, queued: 0 (waited: 149s)
In progress: 6, queued: 0 (waited: 170s)
In progress: 4, queued: 0 (waited: 190s)
In progress: 2, queued: 0 (waited: 210s)
In progress: 1, queued: 0 (waited: 230s)
In progress: 4, queued: 0 (waited: 251s)
In progress: 0, queued: 0 (waited: 271s)
In progress: 0, queued: 0 (waited: 291s)
What just happened?

We can see how many models DataRobot built for this project by querying. Each of them has been tuned individually. Models that appear to have the same name differ either in the amount of data used in training or in the preprocessing steps used (or both).

In [9]:
models = proj.get_models()
for idx, model in enumerate(models):
    print('[{}]: {} - {}'.
          format(idx, model.metrics['AUC']['validation'], model.model_type))
[0]: 0.96738 - ExtraTrees Classifier (Gini)
[1]: 0.96279 - ExtraTrees Classifier (Gini)
[2]: 0.94981 - Vowpal Wabbit Classifier
[3]: 0.94803 - eXtreme Gradient Boosted Trees Classifier
[4]: 0.94741 - AVG Blender
[5]: 0.94396 - eXtreme Gradient Boosted Trees Classifier with Unsupervised Learning Features
[6]: 0.9437 - eXtreme Gradient Boosted Trees Classifier
[7]: 0.9437 - ENET Blender
[8]: 0.94274 - ENET Blender
[9]: 0.94215 - Elastic-Net Classifier (L2 / Binomial Deviance)
[10]: 0.9401 - Regularized Logistic Regression (L2)
[11]: 0.93376 - Advanced AVG Blender
[12]: 0.93321 - Regularized Logistic Regression (L2)
[13]: 0.93229 - Support Vector Classifier (Radial Kernel)
[14]: 0.92888 - Regularized Logistic Regression (L2)
[15]: 0.92879 - Support Vector Classifier (Radial Kernel)
[16]: 0.9245 - Regularized Logistic Regression (L2)
[17]: 0.91793 - eXtreme Gradient Boosted Trees Classifier with Unsupervised Learning Features
[18]: 0.91719 - eXtreme Gradient Boosted Trees Classifier
[19]: 0.90894 - RandomForest Classifier (Entropy)
[20]: 0.90451 - Gradient Boosted Greedy Trees Classifier
[21]: 0.90188 - Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance) with Unsupervised Learning Features
[22]: 0.8934 - Elastic-Net Classifier (L2 / Binomial Deviance) with Binned numeric features
[23]: 0.89184 - Breiman and Cutler Random Forest Classifier
[24]: 0.89151 - Gradient Boosted Trees Classifier
[25]: 0.89137 - Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance)
[26]: 0.88992 - Gradient Boosted Trees Classifier
[27]: 0.88978 - RandomForest Classifier (Gini)
[28]: 0.8574 - RuleFit Classifier
[29]: 0.85148 - Auto-tuned K-Nearest Neighbors Classifier (Minkowski Distance)
[30]: 0.84975 - Vowpal Wabbit Classifier
[31]: 0.8471 - RandomForest Classifier (Gini)
[32]: 0.83946 - Logistic Regression
[33]: 0.81802 - Gradient Boosted Trees Classifier
[34]: 0.80683 - TensorFlow Neural Network Classifier
[35]: 0.7483 - Elastic-Net Classifier (L2 / Binomial Deviance)
[36]: 0.7375 - Decision Tree Classifier (Gini)
[37]: 0.70172 - Nystroem Kernel SVM Classifier
[38]: 0.61144 - Auto-tuned K-Nearest Neighbors Classifier (Minkowski Distance)
[39]: 0.57843 - Regularized Logistic Regression (L2)
[40]: 0.55107 - Naive Bayes combiner classifier
[41]: 0.5 - Majority Class Classifier
Generating predictions from a finished model

So, what do these models think about the likelihood of a recession in the next 3 months? We can make predictions on the latest data to see what they see.

These may not be the predictions you are looking for...

There are two ways to generate predictions in DataRobot, one using modeling workers and one using dedicated prediction workers. In this notebook we will use the former, which is slower, occupies one of your modeling worker slots, and has no real guarantees about latency because the jobs go through the project queue.

Why do we even have this slow prediction mechanism? Because of its limitations it is much easier for anticipate the load that it adds to the system, so we can provide it to everyone in a shared environment.

For the faster, low latency, dedicated prediction solution, we would encourage you to look into an upgraded license of DataRobot, specifically one with dedicated prediction workers.

Three step process

As just mentioned, these predictions go through the modeling queue, so there is a three-step process. The first step is to upload your dataset; the second is to generate the prediction jobs themselves. Finally, you need to retreive your predictions when the job is done.

In this case, we are generating predictions from the top 10 models in the project.

In [10]:
dataset = proj.upload_dataset(filename)

pred_jobs = [models[i].request_predictions(dataset.id) for i in range(10)]
all_preds = [pred_job.get_result_when_complete() for pred_job in pred_jobs]
Bonus Section: Predicting the future

That concludes the “how-to” portion of the notebook. But we won’t just leave you hanging... we’ve gone through all this trouble to try to predict the future. We might as well tell you what we saw.

Get Ready to plot

It will be easier to plot the data if it all shares the same time-based index. Here in this cell we read the modeling data and use its index, then we attach the predictions from each of the models to that dataframe

In [11]:
plot_data = pd.read_csv(filename, index_col=0)

for idx, pred in enumerate(all_preds):
    plot_data['pred_{}'.format(idx)] = pred['positive_probability'].tolist()
Plots!

We start by defining a helper function to plot the predictions together on the same plot.

Here we plot the predictions for every week in the dataset after the year 2000 (the holdout was all the data after the start of 1995).

In [20]:
def plot_date_data(dataframe, column_names):
    x_axis = [datetime.datetime.strptime(x, '%Y-%m-%d')
              for x in dataframe.index]
    import matplotlib.dates as mdates
    years = mdates.YearLocator()
    months = mdates.MonthLocator()
    years_fmt = mdates.DateFormatter('%Y')
    fig, ax = plt.subplots()

    for column_name in column_names:
        data = dataframe[column_name]
        ax.plot(x_axis, data)
    ax.xaxis.set_major_locator(years)
    ax.xaxis.set_major_formatter(years_fmt)
    ax.xaxis.set_minor_locator(months)
    ax.format_xdata = mdates.DateFormatter('%Y-%m-%d')
    ax.grid(True)
    fig.autofmt_xdate()


plot_date_data(plot_data[plot_data.index > '2000-01-01'],
               ['pred_{}'.format(i) for i in range(10)])
_images/examples_financial_data_predict_your_fortunes_24_0.png

The two spikes correspond to the dotcom bubble bursting in early 2001 and the Great Recession.

But... were the models predictive or postdictive?

A closer look at the Great Recession.

Let’s zoom in on 2007 and 2008, when things really went sideways.

In [21]:
plot_date_data(plot_data[(plot_data.index > '2007-01-01') &
                         (plot_data.index < '2009-01-01')],
               ['pred_{}'.format(i) for i in range(10)])
_images/examples_financial_data_predict_your_fortunes_26_0.png

Some of these models were picking up on some signal in the early months of 2008, shortly before stocks went for a dive. But then again, they flatlined before the real tumult happened, so take it with a grain of salt.

But what about now? Are we headed for a recession?
In [22]:
plot_date_data(plot_data[plot_data.index > '2011-01-01'],
               ['pred_{}'.format(i) for i in range(10)])
_images/examples_financial_data_predict_your_fortunes_28_0.png

Nope. (As of 7/1/2017)

What can we say about these models?

It would seem like we used a lot of information in building and evaluation these models. It does include more than 3000 weeks of data. But how much information is in this data?

For this specific problem, we know that the state of the economy does not jump around with great velocity. So we don’t really have 3000 independent observations, because the observations in one week convey a lot of information about the values of the nearby weeks. So what information do we actually have?

In this case, while we had many weeks in which there were observed recessions in the economy, we are actually only looking at the event of entering (or exiting) a recession, which is limited by the total number of recessions. In this case that number was only 11; 6 were used in training, 3 in validation, and 2 in the holdout. That’s not a lot of information to train on.

In [23]:
plot_date_data(plot_data, ['pred_{}'.format(i) for i in range(10)])
_images/examples_financial_data_predict_your_fortunes_31_0.png

Advanced Model Insights

Preparation

This notebook explores additional options for model insights added in the v2.7 release of the DataRobot API.

Let’s start with importing some packages that will help us with presentation (if you don’t have them installed already, you’ll have to install them to run this notebook).

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import pandas as pd
import datarobot as dr
import numpy as np
from datarobot.enums import AUTOPILOT_MODE
from datarobot.errors import ClientError

Set Up

Now configure your DataRobot client (unless you’re using a configuration file)...

In [2]:
dr.Client(token='<API TOKEN>', endpoint='https://<YOUR ENDPOINT>/api/v2/')
Out[2]:
<datarobot.rest.RESTClientObject at 0x10bc01e50>

Create Project with features

Create a new project using the 10K_diabetes dataset. This dataset contains a binary classification on the target readmitted. This project is an excellent example of the advanced model insights available from DataRobot models.

In [3]:
url = 'https://s3.amazonaws.com/datarobot_public_datasets/10k_diabetes.xlsx'
project = dr.Project.create(url, project_name='10K Advanced Modeling')
print('Project ID: {}'.format(project.id))
Project ID: 598dec4bc8089177139da4ad
In [4]:
# Increase the worker count to make the project go faster.
project.set_worker_count(8)
Out[4]:
Project(10K Advanced Modeling)
In [5]:
project.set_target('readmitted', mode=AUTOPILOT_MODE.QUICK)
Out[5]:
Project(10K Advanced Modeling)
In [6]:
project.wait_for_autopilot()
In progress: 2, queued: 0 (waited: 0s)
In progress: 2, queued: 0 (waited: 1s)
In progress: 2, queued: 0 (waited: 2s)
In progress: 2, queued: 0 (waited: 2s)
In progress: 2, queued: 0 (waited: 4s)
In progress: 2, queued: 0 (waited: 6s)
In progress: 2, queued: 0 (waited: 9s)
In progress: 2, queued: 0 (waited: 16s)
In progress: 2, queued: 0 (waited: 29s)
In progress: 1, queued: 0 (waited: 50s)
In progress: 1, queued: 0 (waited: 71s)
In progress: 1, queued: 0 (waited: 91s)
In progress: 1, queued: 0 (waited: 111s)
In progress: 1, queued: 0 (waited: 132s)
In progress: 1, queued: 0 (waited: 152s)
In progress: 1, queued: 0 (waited: 172s)
In progress: 1, queued: 0 (waited: 193s)
In progress: 1, queued: 0 (waited: 213s)
In progress: 1, queued: 0 (waited: 233s)
In progress: 1, queued: 0 (waited: 254s)
In progress: 1, queued: 0 (waited: 274s)
In progress: 0, queued: 1 (waited: 295s)
In progress: 1, queued: 0 (waited: 315s)
In progress: 0, queued: 0 (waited: 335s)
In progress: 0, queued: 0 (waited: 356s)
In [7]:
models = project.get_models()
model = models[0]
model
Out[7]:
Model(u'AVG Blender')

Let’s set some color constants to replicate visual style of DataRobot lift chart.

In [8]:
dr_dark_blue = '#08233F'
dr_blue = '#1F77B4'
dr_orange = '#FF7F0E'
dr_red = '#BE3C28'

Feature Impact

Feature Impact is available for all model types and works by altering input data and observing the effect on a model’s score. It is an on-demand feature, meaning that you must initiate a calculation to see the results. Once you have had DataRobot compute the feature impact for a model, that information is saved with the project.

Feature Impact measures how important a feature is in the context of a model. That is, it measures how much the accuracy of a model would decrease if that feature were removed.

In [9]:
try:
    # Check first if they've already been computed
    feature_impacts = model.get_feature_impact()
except dr.errors.ClientError as e:
    # Status code of 404 means the feature impact hasn't been computed yet
    assert e.status_code == 404
    impact_job = model.request_feature_impact()
    # We must wait for the async job to finish; 4 minutes should be plenty
    feature_impacts = impact_job.get_result_when_complete(4 * 60)
In [10]:
# Formats the ticks from a float into a percent
percent_tick_fmt = mtick.PercentFormatter(xmax=1.0)

impact_df = pd.DataFrame(feature_impacts)
impact_df.sort_values(by='impactNormalized', ascending=True, inplace=True)

# Positive values are blue, negative are red
bar_colors = impact_df.impactNormalized.apply(lambda x: dr_red if x < 0
                                              else dr_blue)

ax = impact_df.plot.barh(x='featureName', y='impactNormalized',
                         legend=False,
                         color=bar_colors,
                         figsize=(10, 14))
ax.xaxis.set_major_formatter(percent_tick_fmt)
ax.xaxis.set_tick_params(labeltop=True)
ax.xaxis.grid(True, alpha=0.2)
ax.set_facecolor(dr_dark_blue)

plt.ylabel('')
plt.xlabel('Effect')
plt.xlim((None, 1))  # Allow for negative impact
plt.title('Feature Impact', y=1.04)
Out[10]:
Text(0.5,1.04,u'Feature Impact')
_images/examples_advanced_model_insights_Advanced_Model_Insights_14_1.png

Lift Chart

A lift chart will show you how close in general model predictions are to the actual target values in the training data.

The lift chart data we retrieve from the server includes the average model prediction and the average actual target values, sorted by the prediction values in ascending order and split into up to 60 bins.

bin_weight parameter shows how much weight is in each bin (number of rows for unweighted projects).

In [11]:
lc = model.get_lift_chart('validation')
lc
Out[11]:
LiftChart(validation)
In [12]:
bins_df = pd.DataFrame(lc.bins)
bins_df.head()
Out[12]:
actual bin_weight predicted
0 0.037037 27.0 0.088575
1 0.111111 27.0 0.131661
2 0.192308 26.0 0.153389
3 0.222222 27.0 0.167035
4 0.111111 27.0 0.179245

Let’s define our rebinning and plotting functions.

In [13]:
def rebin_df(raw_df, number_of_bins):
    cols = ['bin', 'actual_mean', 'predicted_mean', 'bin_weight']
    new_df = pd.DataFrame(columns=cols)
    current_prediction_total = 0
    current_actual_total = 0
    current_row_total = 0
    x_index = 1
    bin_size = 60 / number_of_bins
    for rowId, data in raw_df.iterrows():
        current_prediction_total += data['predicted'] * data['bin_weight']
        current_actual_total += data['actual'] * data['bin_weight']
        current_row_total += data['bin_weight']

        if ((rowId + 1) % bin_size == 0):
            x_index += 1
            bin_properties = {
                'bin': ((round(rowId + 1) / 60) * number_of_bins),
                'actual_mean': current_actual_total / current_row_total,
                'predicted_mean': current_prediction_total / current_row_total,
                'bin_weight': current_row_total
            }

            new_df = new_df.append(bin_properties, ignore_index=True)
            current_prediction_total = 0
            current_actual_total = 0
            current_row_total = 0
    return new_df


def matplotlib_lift(bins_df, bin_count, ax):
    grouped = rebin_df(bins_df, bin_count)
    ax.plot(range(1, len(grouped) + 1), grouped['predicted_mean'],
            marker='+', lw=1, color=dr_blue)
    ax.plot(range(1, len(grouped) + 1), grouped['actual_mean'],
            marker='*', lw=1, color=dr_orange)
    ax.set_xlim([0, len(grouped) + 1])
    ax.set_facecolor(dr_dark_blue)
    ax.legend(loc='best')
    ax.set_title('Lift chart {} bins'.format(bin_count))
    ax.set_xlabel('Sorted Prediction')
    ax.set_ylabel('Value')
    return grouped

Now we can show all lift charts we propose in DataRobot web application.

Note 1 : While this method will work for any bin count less then 60 - the most reliable result will be achieved when the number of bins is a divisor of 60.

Note 2 : This visualization method will NOT work for bin count > 60 because DataRobot does not provide enough information for a larger resolution.

In [14]:
bin_counts = [10, 12, 15, 20, 30, 60]
f, axarr = plt.subplots(len(bin_counts))
f.set_size_inches((8, 4 * len(bin_counts)))

rebinned_dfs = []
for i in range(len(bin_counts)):
    rebinned_dfs.append(matplotlib_lift(bins_df, bin_counts[i], axarr[i]))
plt.tight_layout()
_images/examples_advanced_model_insights_Advanced_Model_Insights_21_0.png

Rebinned Data

You may want to interact with the raw re-binned data for use in third party tools, or for additional evaluation.

In [15]:
for rebinned in rebinned_dfs:
    print('Number of bins: {}'.format(len(rebinned.index)))
    print(rebinned)
Number of bins: 10
    bin  actual_mean  predicted_mean  bin_weight
0   1.0      0.13125        0.151517       160.0
1   2.0      0.20000        0.225520       160.0
2   3.0      0.23125        0.272101       160.0
3   4.0      0.31250        0.310227       160.0
4   5.0      0.40000        0.350982       160.0
5   6.0      0.40000        0.395550       160.0
6   7.0      0.43750        0.441662       160.0
7   8.0      0.55625        0.494121       160.0
8   9.0      0.60625        0.561798       160.0
9  10.0      0.69375        0.710759       160.0
Number of bins: 12
     bin  actual_mean  predicted_mean  bin_weight
0    1.0     0.134328        0.143911       134.0
1    2.0     0.180451        0.211710       133.0
2    3.0     0.225564        0.253760       133.0
3    4.0     0.276119        0.289034       134.0
4    5.0     0.308271        0.320351       133.0
5    6.0     0.406015        0.354336       133.0
6    7.0     0.406015        0.391651       133.0
7    8.0     0.395522        0.430018       134.0
8    9.0     0.518797        0.470626       133.0
9   10.0     0.639098        0.519144       133.0
10  11.0     0.586466        0.583965       133.0
11  12.0     0.686567        0.728384       134.0
Number of bins: 15
     bin  actual_mean  predicted_mean  bin_weight
0    1.0     0.140187        0.134995       107.0
1    2.0     0.149533        0.195819       107.0
2    3.0     0.207547        0.235178       106.0
3    4.0     0.242991        0.264718       107.0
4    5.0     0.280374        0.292256       107.0
5    6.0     0.292453        0.316757       106.0
6    7.0     0.373832        0.344156       107.0
7    8.0     0.452830        0.372372       106.0
8    9.0     0.373832        0.403261       107.0
9   10.0     0.401869        0.433869       107.0
10  11.0     0.528302        0.465610       106.0
11  12.0     0.560748        0.504174       107.0
12  13.0     0.603774        0.547079       106.0
13  14.0     0.635514        0.612989       107.0
14  15.0     0.710280        0.747934       107.0
Number of bins: 20
     bin  actual_mean  predicted_mean  bin_weight
0    1.0       0.1125        0.124181        80.0
1    2.0       0.1500        0.178852        80.0
2    3.0       0.1875        0.211547        80.0
3    4.0       0.2125        0.239493        80.0
4    5.0       0.2375        0.260820        80.0
5    6.0       0.2250        0.283381        80.0
6    7.0       0.3375        0.300590        80.0
7    8.0       0.2875        0.319864        80.0
8    9.0       0.3750        0.340949        80.0
9   10.0       0.4250        0.361015        80.0
10  11.0       0.4000        0.383998        80.0
11  12.0       0.4000        0.407102        80.0
12  13.0       0.4125        0.429924        80.0
13  14.0       0.4625        0.453401        80.0
14  15.0       0.5250        0.479391        80.0
15  16.0       0.5875        0.508850        80.0
16  17.0       0.6125        0.541193        80.0
17  18.0       0.6000        0.582403        80.0
18  19.0       0.6750        0.649406        80.0
19  20.0       0.7125        0.772112        80.0
Number of bins: 30
     bin  actual_mean  predicted_mean  bin_weight
0    1.0     0.074074        0.110118        54.0
1    2.0     0.207547        0.160341        53.0
2    3.0     0.113208        0.184872        53.0
3    4.0     0.185185        0.206563        54.0
4    5.0     0.207547        0.227254        53.0
5    6.0     0.207547        0.243102        53.0
6    7.0     0.240741        0.257413        54.0
7    8.0     0.245283        0.272161        53.0
8    9.0     0.207547        0.287006        53.0
9   10.0     0.351852        0.297408        54.0
10  11.0     0.301887        0.310547        53.0
11  12.0     0.283019        0.322968        53.0
12  13.0     0.396226        0.337750        53.0
13  14.0     0.351852        0.350444        54.0
14  15.0     0.452830        0.364761        53.0
15  16.0     0.452830        0.379984        53.0
16  17.0     0.351852        0.395395        54.0
17  18.0     0.396226        0.411274        53.0
18  19.0     0.358491        0.425801        53.0
19  20.0     0.444444        0.441788        54.0
20  21.0     0.509434        0.457396        53.0
21  22.0     0.547170        0.473825        53.0
22  23.0     0.490566        0.494573        53.0
23  24.0     0.629630        0.513596        54.0
24  25.0     0.716981        0.534683        53.0
25  26.0     0.490566        0.559476        53.0
26  27.0     0.611111        0.590690        54.0
27  28.0     0.660377        0.635708        53.0
28  29.0     0.622642        0.695099        53.0
29  30.0     0.796296        0.799789        54.0
Number of bins: 60
     bin  actual_mean  predicted_mean  bin_weight
0    1.0     0.037037        0.088575        27.0
1    2.0     0.111111        0.131661        27.0
2    3.0     0.192308        0.153389        26.0
3    4.0     0.222222        0.167035        27.0
4    5.0     0.111111        0.179245        27.0
5    6.0     0.115385        0.190716        26.0
6    7.0     0.185185        0.201566        27.0
7    8.0     0.185185        0.211559        27.0
8    9.0     0.192308        0.221900        26.0
9   10.0     0.222222        0.232409        27.0
10  11.0     0.074074        0.239081        27.0
11  12.0     0.346154        0.247278        26.0
12  13.0     0.222222        0.253636        27.0
13  14.0     0.259259        0.261190        27.0
14  15.0     0.230769        0.267897        26.0
15  16.0     0.259259        0.276266        27.0
16  17.0     0.185185        0.283961        27.0
17  18.0     0.230769        0.290167        26.0
18  19.0     0.296296        0.294495        27.0
19  20.0     0.407407        0.300322        27.0
20  21.0     0.307692        0.307198        26.0
21  22.0     0.296296        0.313772        27.0
22  23.0     0.269231        0.319444        26.0
23  24.0     0.296296        0.326361        27.0
24  25.0     0.370370        0.334460        27.0
25  26.0     0.423077        0.341167        26.0
26  27.0     0.333333        0.347227        27.0
27  28.0     0.370370        0.353661        27.0
28  29.0     0.423077        0.361275        26.0
29  30.0     0.481481        0.368118        27.0
30  31.0     0.481481        0.376098        27.0
31  32.0     0.423077        0.384019        26.0
32  33.0     0.296296        0.391877        27.0
33  34.0     0.407407        0.398914        27.0
34  35.0     0.423077        0.407656        26.0
35  36.0     0.370370        0.414758        27.0
36  37.0     0.259259        0.421825        27.0
37  38.0     0.461538        0.429930        26.0
38  39.0     0.518519        0.438017        27.0
39  40.0     0.370370        0.445558        27.0
40  41.0     0.423077        0.453398        26.0
41  42.0     0.592593        0.461246        27.0
42  43.0     0.500000        0.468806        26.0
43  44.0     0.592593        0.478657        27.0
44  45.0     0.481481        0.490318        27.0
45  46.0     0.500000        0.498991        26.0
46  47.0     0.592593        0.507938        27.0
47  48.0     0.666667        0.519255        27.0
48  49.0     0.692308        0.528170        26.0
49  50.0     0.740741        0.540955        27.0
50  51.0     0.407407        0.553971        27.0
51  52.0     0.576923        0.565192        26.0
52  53.0     0.666667        0.582203        27.0
53  54.0     0.555556        0.599178        27.0
54  55.0     0.730769        0.619919        26.0
55  56.0     0.592593        0.650911        27.0
56  57.0     0.703704        0.676295        27.0
57  58.0     0.538462        0.714627        26.0
58  59.0     0.814815        0.763131        27.0
59  60.0     0.777778        0.836447        27.0

ROC curve

The receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.

To retrieve ROC curve information use the Model.get_roc_curve method.

In [16]:
roc = model.get_roc_curve('validation')
roc
Out[16]:
RocCurve(validation)
In [17]:
df = pd.DataFrame(roc.roc_points)
df.head()
Out[17]:
accuracy f1_score false_negative_score false_positive_rate false_positive_score matthews_correlation_coefficient negative_predictive_value positive_predictive_value threshold true_negative_rate true_negative_score true_positive_rate true_positive_score
0 0.603125 0.000000 635 0.000000 0 0.000000 0.603125 0.000000 1.000000 1.000000 965 0.000000 0
1 0.605000 0.009404 632 0.000000 0 0.053430 0.604258 1.000000 0.925734 1.000000 965 0.004724 3
2 0.605625 0.012520 631 0.000000 0 0.061715 0.604637 1.000000 0.897726 1.000000 965 0.006299 4
3 0.609375 0.031008 625 0.000000 0 0.097764 0.606918 1.000000 0.843124 1.000000 965 0.015748 10
4 0.610000 0.037037 623 0.001036 1 0.097343 0.607435 0.923077 0.812854 0.998964 964 0.018898 12
Threshold operations

You can get the recommended threshold value with maximal F1 score using RocCurve.get_best_f1_threshold method. That is the same threshold that is preselected in DataRobot when you open “ROC curve” tab.

In [18]:
threshold = roc.get_best_f1_threshold()
threshold
Out[18]:
0.3359943414397026

To estimate metrics for different threshold values just pass it to the RocCurve.estimate_threshold method. This will produce the same results as updating threshold on the DataRobot “ROC curve” tab.

In [19]:
metrics = roc.estimate_threshold(threshold)
metrics
Out[19]:
{'accuracy': 0.626875,
 'f1_score': 0.6219126029132362,
 'false_negative_score': 144,
 'false_positive_rate': 0.4694300518134715,
 'false_positive_score': 453,
 'matthews_correlation_coefficient': 0.30220241744619025,
 'negative_predictive_value': 0.7804878048780488,
 'positive_predictive_value': 0.5201271186440678,
 'threshold': 0.3359943414397026,
 'true_negative_rate': 0.5305699481865285,
 'true_negative_score': 512,
 'true_positive_rate': 0.7732283464566929,
 'true_positive_score': 491}
Confusion matrix

Using a few keys from the retrieved metrics we now can build a confusion matrix for the selected threshold.

In [20]:
roc_df = pd.DataFrame({
    'Predicted Negative': [metrics['true_negative_score'],
                           metrics['false_negative_score'],
                           metrics['true_negative_score'] + metrics[
                               'false_negative_score']],
    'Predicted Positive': [metrics['false_positive_score'],
                           metrics['true_positive_score'],
                           metrics['true_positive_score'] + metrics[
                               'false_positive_score']],
    'Total': [len(roc.negative_class_predictions),
              len(roc.positive_class_predictions),
              len(roc.negative_class_predictions) + len(
                  roc.positive_class_predictions)]})
roc_df.index = pd.MultiIndex.from_tuples([
    ('Actual', '-'), ('Actual', '+'), ('Total', '')])
roc_df.columns = pd.MultiIndex.from_tuples([
    ('Predicted', '-'), ('Predicted', '+'), ('Total', '')])
roc_df.style.set_properties(**{'text-align': 'right'})
roc_df
Out[20]:
Predicted Total
- +
Actual - 512 453 962
+ 144 491 638
Total 656 944 1600
ROC curve plot
In [21]:
dr_roc_green = '#03c75f'
white = '#ffffff'
dr_purple = '#65147D'
dr_dense_green = '#018f4f'

fig = plt.figure(figsize=(8, 8))
axes = fig.add_subplot(1, 1, 1, facecolor=dr_dark_blue)

plt.scatter(df.false_positive_rate, df.true_positive_rate, color=dr_roc_green)
plt.plot(df.false_positive_rate, df.true_positive_rate, color=dr_roc_green)
plt.plot([0, 1], [0, 1], color=white, alpha=0.25)
plt.title('ROC curve')
plt.xlabel('False Positive Rate (Fallout)')
plt.xlim([0, 1])
plt.ylabel('True Positive Rate (Sensitivity)')
plt.ylim([0, 1])
Out[21]:
(0, 1)
_images/examples_advanced_model_insights_Advanced_Model_Insights_34_1.png
Prediction distribution plot

There are a few different methods for visualizing it, which one to use depends on what packages you have installed. Below you will find 3 different examples.

Using seaborn

In [22]:
import seaborn as sns
sns.set_style("whitegrid", {'axes.grid': False})

fig = plt.figure(figsize=(8, 8))
axes = fig.add_subplot(1, 1, 1, facecolor=dr_dark_blue)

shared_params = {'shade': True, 'clip': (0, 1), 'bw': 0.2}
sns.kdeplot(np.array(roc.negative_class_predictions),
            color=dr_purple, **shared_params)
sns.kdeplot(np.array(roc.positive_class_predictions),
            color=dr_dense_green, **shared_params)

plt.title('Prediction Distribution')
plt.xlabel('Probability of Event')
plt.xlim([0, 1])
plt.ylabel('Probability Density')
Out[22]:
Text(0,0.5,'Probability Density')
_images/examples_advanced_model_insights_Advanced_Model_Insights_36_1.png

Using SciPy

In [23]:
from scipy.stats import gaussian_kde

fig = plt.figure(figsize=(8, 8))
axes = fig.add_subplot(1, 1, 1, facecolor=dr_dark_blue)
xs = np.linspace(0, 1, 100)

density_neg = gaussian_kde(roc.negative_class_predictions, bw_method=0.2)
plt.plot(xs, density_neg(xs), color=dr_purple)
plt.fill_between(xs, 0, density_neg(xs), color=dr_purple, alpha=0.3)

density_pos = gaussian_kde(roc.positive_class_predictions, bw_method=0.2)
plt.plot(xs, density_pos(xs), color=dr_dense_green)
plt.fill_between(xs, 0, density_pos(xs), color=dr_dense_green, alpha=0.3)

plt.title('Prediction Distribution')
plt.xlabel('Probability of Event')
plt.xlim([0, 1])
plt.ylabel('Probability Density')
Out[23]:
Text(0,0.5,'Probability Density')
_images/examples_advanced_model_insights_Advanced_Model_Insights_38_1.png

Using scikit-learn

This way will be most consistent with how we display this plot in DataRobot, because scikit-learn supports additional kernel options, and we can configure the same kernel as we use in web application (epanichkov kernel with size 0.05).

Other examples above use a gaussian kernel, so they may slightly differ from the plot in the DataRobot interface.

In [24]:
from sklearn.neighbors import KernelDensity

fig = plt.figure(figsize=(8, 8))
axes = fig.add_subplot(1, 1, 1, facecolor=dr_dark_blue)
xs = np.linspace(0, 1, 100)

X_neg = np.asarray(roc.negative_class_predictions)[:, np.newaxis]
density_neg = KernelDensity(bandwidth=0.05, kernel='epanechnikov').fit(X_neg)
plt.plot(xs, np.exp(density_neg.score_samples(xs[:, np.newaxis])),
         color=dr_purple)
plt.fill_between(xs, 0, np.exp(density_neg.score_samples(xs[:, np.newaxis])),
                 color=dr_purple, alpha=0.3)

X_pos = np.asarray(roc.positive_class_predictions)[:, np.newaxis]
density_pos = KernelDensity(bandwidth=0.05, kernel='epanechnikov').fit(X_pos)
plt.plot(xs, np.exp(density_pos.score_samples(xs[:, np.newaxis])),
         color=dr_dense_green)
plt.fill_between(xs, 0, np.exp(density_pos.score_samples(xs[:, np.newaxis])),
                 color=dr_dense_green, alpha=0.3)

plt.title('Prediction Distribution')
plt.xlabel('Probability of Event')
plt.xlim([0, 1])
plt.ylabel('Probability Density')
Out[24]:
Text(0,0.5,'Probability Density')
_images/examples_advanced_model_insights_Advanced_Model_Insights_40_1.png

Word Cloud

Word cloud is a type of insight available for some text-processing models for datasets containing text columns. You can get information about how the appearance of each ngram (word or sequence of words) in the text field affects the predicted target value.

This example will show you how to obtain word cloud data and visualize it in similar to DataRobot web application way.

The visualization example here uses colour and wordcloud packages, so if you don’t have them, you will need to install them.

First, we will create a color palette similar to what we use in DataRobot.

In [25]:
from colour import Color
import wordcloud
In [26]:
colors = [Color('#2458EB')]
colors.extend(list(Color('#2458EB').range_to(Color('#31E7FE'), 81))[1:])
colors.extend(list(Color('#31E7FE').range_to(Color('#8da0a2'), 21))[1:])
colors.extend(list(Color('#a18f8c').range_to(Color('#ffad9e'), 21))[1:])
colors.extend(list(Color('#ffad9e').range_to(Color('#d80909'), 81))[1:])
webcolors = [c.get_web() for c in colors]

Variable webcolors now contains 201 ([-1, 1] interval with step 0.01) colors that will be used in the word cloud. Let’s look at our palette.

In [27]:
from matplotlib.colors import LinearSegmentedColormap
dr_cmap = LinearSegmentedColormap.from_list('DataRobot',
                                            webcolors,
                                            N=len(colors))
x = np.arange(-1, 1.01, 0.01)
y = np.arange(0, 40, 1)
X = np.meshgrid(x, y)[0]
plt.xticks([0, 20, 40, 60, 80, 100, 120, 140, 160, 180, 200],
           ['-1', '-0.8', '-0.6', '-0.4', '-0.2', '0',
            '0.2', '0.4', '0.6', '0.8', '1'])
plt.yticks([], [])
im = plt.imshow(X, interpolation='nearest', origin='lower', cmap=dr_cmap)
_images/examples_advanced_model_insights_Advanced_Model_Insights_45_0.png

Now we will pick some model that provides a word cloud in the DataRobot. Any “Auto-Tuned Word N-Gram Text Modeler” should work.

In [28]:
models = project.get_models()
In [29]:
model_with_word_cloud = None
for model in models:
    try:
        model.get_word_cloud()
        model_with_word_cloud = model
        break
    except ClientError as e:
        if 'No word cloud data found for model' in e:
            pass

model_with_word_cloud
Out[29]:
Model(u'Auto-Tuned Word N-Gram Text Modeler using token occurrences - diag_1_desc')
In [30]:
wc = model_with_word_cloud.get_word_cloud(exclude_stop_words=True)
In [31]:
def word_cloud_plot(wc, font_path=None):
    # Stopwords usually dominate any word cloud, so we will filter them out
    dict_freq = {wc_word['ngram']: wc_word['frequency']
                 for wc_word in wc.ngrams
                 if not wc_word['is_stopword']}
    dict_coef = {wc_word['ngram']: wc_word['coefficient']
                 for wc_word in wc.ngrams}

    def color_func(*args, **kwargs):
        word = args[0]
        palette_index = int(round(dict_coef[word] * 100)) + 100
        r, g, b = colors[palette_index].get_rgb()
        return 'rgb({:.0f}, {:.0f}, {:.0f})'.format(int(r * 255),
                                                    int(g * 255),
                                                    int(b * 255))

    wc_image = wordcloud.WordCloud(stopwords=set(),
                                   width=1024, height=1024,
                                   relative_scaling=0.5,
                                   prefer_horizontal=1,
                                   color_func=color_func,
                                   background_color=(0, 10, 29),
                                   font_path=font_path).fit_words(dict_freq)
    plt.imshow(wc_image, interpolation='bilinear')
    plt.axis('off')
In [32]:
word_cloud_plot(wc)
_images/examples_advanced_model_insights_Advanced_Model_Insights_51_0.png

You can use the word cloud to get information about most frequent and most important (highest absolute coefficient value) ngrams in your text.

In [33]:
wc.most_frequent(5)
Out[33]:
[{'coefficient': 0.622977418480506,
  'count': 534,
  'frequency': 0.21876280213027446,
  'is_stopword': False,
  'ngram': u'failure'},
 {'coefficient': 0.5680375262833832,
  'count': 524,
  'frequency': 0.21466612044244163,
  'is_stopword': False,
  'ngram': u'atherosclerosis'},
 {'coefficient': 0.5163937133054939,
  'count': 520,
  'frequency': 0.21302744776730848,
  'is_stopword': False,
  'ngram': u'atherosclerosis of'},
 {'coefficient': 0.3793240551174481,
  'count': 505,
  'frequency': 0.2068824252355592,
  'is_stopword': False,
  'ngram': u'infarction'},
 {'coefficient': 0.46897343056956153,
  'count': 453,
  'frequency': 0.18557968045882836,
  'is_stopword': False,
  'ngram': u'heart'}]
In [34]:
wc.most_important(5)
Out[34]:
[{'coefficient': -0.8759179138969192,
  'count': 38,
  'frequency': 0.015567390413764851,
  'is_stopword': False,
  'ngram': u'obesity unspecified'},
 {'coefficient': -0.8655105382141891,
  'count': 38,
  'frequency': 0.015567390413764851,
  'is_stopword': False,
  'ngram': u'obesity'},
 {'coefficient': 0.8329465952065772,
  'count': 9,
  'frequency': 0.0036870135190495697,
  'is_stopword': False,
  'ngram': u'nephroptosis'},
 {'coefficient': -0.8198621557218905,
  'count': 45,
  'frequency': 0.01843506759524785,
  'is_stopword': False,
  'ngram': u'of kidney'},
 {'coefficient': 0.7444542252245915,
  'count': 452,
  'frequency': 0.18517001229004507,
  'is_stopword': False,
  'ngram': u'heart failure'}]

Non-ASCII Texts

Word cloud has full Unicode support but if you want to visualize it using the recipe from this notebook - you should use the font_path parameter that leads to font supporting symbols used in your text. For example for Japanese text in the model below you should use one of the CJK fonts.

In [35]:
jp_project = dr.Project.create('jp_10k.csv', project_name='Japanese 10K')

print('Project ID: {}'.format(project.id))
Project ID: 598dec4bc8089177139da4ad
In [36]:
jp_project.set_target('readmitted_再入院', mode=AUTOPILOT_MODE.QUICK)
jp_project.wait_for_autopilot()
In progress: 10, queued: 3 (waited: 0s)
In progress: 10, queued: 3 (waited: 0s)
In progress: 10, queued: 3 (waited: 1s)
In progress: 10, queued: 3 (waited: 2s)
In progress: 10, queued: 3 (waited: 3s)
In progress: 10, queued: 3 (waited: 5s)
In progress: 10, queued: 3 (waited: 8s)
In progress: 10, queued: 1 (waited: 15s)
In progress: 6, queued: 0 (waited: 28s)
In progress: 1, queued: 0 (waited: 49s)
In progress: 0, queued: 0 (waited: 69s)
In progress: 8, queued: 0 (waited: 90s)
In progress: 5, queued: 0 (waited: 110s)
In progress: 1, queued: 0 (waited: 130s)
In progress: 0, queued: 14 (waited: 151s)
In progress: 10, queued: 6 (waited: 171s)
In progress: 10, queued: 2 (waited: 191s)
In progress: 8, queued: 0 (waited: 212s)
In progress: 2, queued: 0 (waited: 232s)
In progress: 2, queued: 0 (waited: 253s)
In progress: 1, queued: 0 (waited: 273s)
In progress: 1, queued: 0 (waited: 293s)
In progress: 0, queued: 0 (waited: 314s)
In [37]:
jp_models = jp_project.get_models()
jp_model_with_word_cloud = None

for model in jp_models:
    try:
        model.get_word_cloud()
        jp_model_with_word_cloud = model
        break
    except ClientError as e:
        if 'No word cloud data found for model' in e:
            pass

jp_model_with_word_cloud
Out[37]:
Model(u'Auto-Tuned Word N-Gram Text Modeler using token occurrences and tfidf - diag_1_desc_\u8a3a\u65ad1\u8aac\u660e')
In [38]:
jp_wc = jp_model_with_word_cloud.get_word_cloud(exclude_stop_words=True)
In [39]:
word_cloud_plot(jp_wc, font_path='CJK.ttf')
_images/examples_advanced_model_insights_Advanced_Model_Insights_60_0.png

Changelog

2.11.2

Enhancements

  • Python 3.7 is now suported

Bugfixes

Deprecation Summary

  • The Model Deployment interface has been deprecated and will be removed in 2.13, in order to allow the interface to mature. The raw API will continue to be available as a “beta” API without full backwards compatibility support.

2.11.1

New Features

  • Time series projects now support multiseries as well as single series data. See the multiseries section in the Time Series Projects documentation for more detail.

2.11.0

New Features

  • The new ModelRecommendation class can be used to retrieve the recommended models for a project.
  • A new helper method cross_validate was added to class Model. This method can be used to request Model’s Cross Validation score.
  • Training a model with monotonic constraints is now supported. Training with monotonic constraints allows users to force models to learn monotonic relationships with respect to some features and the target. This helps users create accurate models that comply with regulations (e.g. insurance, banking). Currently, only certain blueprints (e.g. xgboost) support this feature, and it is only supported for regression and binary classification projects.
  • DataRobot now supports “Database Connectivity”, allowing databases to be used as the source of data for projects and prediction datasets. The feature works on top of the JDBC standard, so a variety of databases conforming to that standard are available; a list of databases with tested support for DataRobot is available in the user guide in the web application. See Database Connectivity for details.
  • Added a new feature to retrieve feature logs for time series projects. Check datarobot.DatetimePartitioning.feature_log_list() and datarobot.DatetimePartitioning.feature_log_retrieve() for details.

API Changes

Configuration Changes

  • Retry settings compatible with those offered by urllib3’s Retry interface can now be configured. By default, we will now retry connection errors that prevented requests from arriving at the server.

Documentation Changes

  • “Advanced Model Insights” example has been updated to properly handle bin weights when rebinning.

2.9.0

New Features

  • New ModelDeployment class can be used to track status and health of models deployed for predictions.

Enhancements

  • DataRobot API now supports creating 3 new blender types - Random Forest, TensorFlow, LightGBM.
  • Multiclass projects now support blenders creation for 3 new blender types as well as Average and ENET blenders.
  • Models can be trained by requesting a particular row count using the new training_row_count argument with Project.train, Model.train and Model.request_frozen_model in non-datetime partitioned projects, as an alternative to the previous option of specifying a desired percentage of the project dataset. Specifying model size by row count is recommended when the float precision of sample_pct could be problematic, e.g. when training on a small percentage of the dataset or when training up to partition boundaries.
  • New attributes max_train_rows, scaleout_max_train_pct, and scaleout_max_train_rows have been added to Project. max_train_rows specified the equivalent value to the existing max_train_pct as a row count. The scaleout fields can be used to see how far scaleout models can be trained on projects, which for projects taking advantage of scalable ingest may exceed the limits on the data available to non-scaleout blueprints.
  • Individual features can now be marked as a priori or not a priori using the new feature_settings attribute when setting the target or specifying datetime partitioning settings on time series projects. Any features not specified in the feature_settings parameter will be assigned according to the default_to_a_priori value.
  • Three new options have been made available in the datarobot.DatetimePartitioningSpecification class to fine-tune how time-series projects derive modeling features. treat_as_exponential can control whether data is analyzed as an exponential trend and transformations like log-transform are applied. differencing_method can control which differencing method to use for stationary data. periodicities can be used to specify periodicities occuring within the data. All are optional and defaults will be chosen automatically if they are unspecified.

API Changes

  • Now training_row_count is available on non-datetime models as well as “rowCount” based datetime models. It reports the number of rows used to train the model (equivalent to sample_pct).
  • Features retrieved from Feature.get now include target_leakage.

2.8.1

Bugfixes

  • The documented default connect_timeout will now be correctly set for all configuration mechanisms, so that requests that fail to reach the DataRobot server in a reasonable amount of time will now error instead of hanging indefinitely. If you observe that you have started seeing ConnectTimeout errors, please configure your connect_timeout to a larger value.
  • Version of trafaret library this package depends on is now pinned to trafaret>=0.7,<1.1 since versions outside that range are known to be incompatible.

2.8.0

New Features

  • The DataRobot API supports the creation, training, and predicting of multiclass classification projects. DataRobot, by default, handles a dataset with a numeric target column as regression. If your data has a numeric cardinality of fewer than 11 classes, you can override this behavior to instead create a multiclass classification project from the data. To do so, use the set_target function, setting target_type=’Multiclass’. If DataRobot recognizes your data as categorical, and it has fewer than 11 classes, using multiclass will create a project that classifies which label the data belongs to.
  • The DataRobot API now includes Rating Tables. A rating table is an exportable csv representation of a model. Users can influence predictions by modifying them and creating a new model with the modified table. See the documentation for more information on how to use rating tables.
  • scaleout_modeling_mode has been added to the AdvancedOptions class used when setting a project target. It can be used to control whether scaleout models appear in the autopilot and/or available blueprints. Scaleout models are only supported in the Hadoop enviroment with the corresponding user permission set.
  • A new premium add-on product, Time Series, is now available. New projects can be created as time series projects which automatically derive features from past data and forecast the future. See the time series documentation for more information.
  • The Feature object now returns the EDA summary statistics (i.e., mean, median, minum, maximum, and standard deviation) for features where this is available (e.g., numeric, date, time, currency, and length features). These summary statistics will be formatted in the same format as the data it summarizes.
  • The DataRobot API now supports Training Predictions workflow. Training predictions are made by a model for a subset of data from original dataset. User can start a job which will make those predictions and retrieve them. See the documentation for more information on how to use training predictions.
  • DataRobot now supports retrieving a model blueprint chart and a model blueprint docs.
  • With the introduction of Multiclass Classification projects, DataRobot needed a better way to explain the performance of a multiclass model so we created a new Confusion Chart. The API now supports retrieving and interacting with confusion charts.

Enhancements

  • DatetimePartitioningSpecification now includes the optional disable_holdout flag that can be used to disable the holdout fold when creating a project with datetime partitioning.
  • When retrieving reason codes on a project using an exposure column, predictions that are adjusted for exposure can be retrieved.
  • File URIs can now be used as sourcedata when creating a project or uploading a prediction dataset. The file URI must refer to an allowed location on the server, which is configured as described in the user guide documentation.
  • The advanced options available when setting the target have been extended to include the new parameter ‘events_count’ as a part of the AdvancedOptions object to allow specifying the events count column. See the user guide documentation in the webapp for more information on events count.
  • PredictJob.get_predictions now returns predicted probability for each class in the dataframe.
  • PredictJob.get_predictions now accepts prefix parameter to prefix the classes name returned in the predictions dataframe.

API Changes

  • Add target_type parameter to set_target() and start(), used to override the project default.

2.7.2

Documentation Changes

  • Updated link to the publicly hosted documentation.

2.7.1

Documentation Changes

  • Online documentation hosting has migrated from PythonHosted to Read The Docs. Minor code changes have been made to support this.

2.7.0

New Features

  • Lift chart data for models can be retrieved using the Model.get_lift_chart and Model.get_all_lift_charts methods.
  • ROC curve data for models in classification projects can be retrieved using the Model.get_roc_curve and Model.get_all_roc_curves methods.
  • Semi-automatic autopilot mode is removed.
  • Word cloud data for text processing models can be retrieved using Model.get_word_cloud method.
  • Scoring code JAR file can be downloaded for models supporting code generation.

Enhancements

  • A __repr__ method has been added to the PredictionDataset class to improve readability when using the client interactively.
  • Model.get_parameters now includes an additional key in the derived features it includes, showing the coefficients for individual stages of multistage models (e.g. Frequency-Severity models).
  • When training a DatetimeModel on a window of data, a time_window_sample_pct can be specified to take a uniform random sample of the training data instead of using all data within the window.
  • Installing of DataRobot package now has an “Extra Requirements” section that will install all of the dependencies needed to run the example notebooks.

Documentation Changes

  • A new example notebook describing how to visualize some of the newly available model insights including lift charts, ROC curves, and word clouds has been added to the examples section.
  • A new section for Common Issues has been added to Getting Started to help debug issues related to client installation and usage.

2.6.1

Bugfixes

  • Fixed a bug with Model.get_parameters raising an exception on some valid parameter values.

Documentation Changes

  • Fixed sorting order in Feature Impact example code snippet.

2.6.0

New Features

  • A new partitioning method (datetime partitioning) has been added. The recommended workflow is to preview the partitioning by creating a DatetimePartitioningSpecification and passing it into DatetimePartitioning.generate, inspect the results and adjust as needed for the specific project dataset by adjusting the DatetimePartitioningSpecification and re-generating, and then set the target by passing the final DatetimePartitioningSpecification object to the partitioning_method parameter of Project.set_target.
  • When interacting with datetime partitioned projects, DatetimeModel can be used to access more information specific to models in datetime partitioned projects. See the documentation for more information on differences in the modeling workflow for datetime partitioned projects.
  • The advanced options available when setting the target have been extended to include the new parameters ‘offset’ and ‘exposure’ (part of the AdvancedOptions object) to allow specifying offset and exposure columns to apply to predictions generated by models within the project. See the user guide documentation in the webapp for more information on offset and exposure columns.
  • Blueprints can now be retrieved directly by project_id and blueprint_id via Blueprint.get.
  • Blueprint charts can now be retrieved directly by project_id and blueprint_id via BlueprintChart.get. If you already have an instance of Blueprint you can retrieve its chart using Blueprint.get_chart.
  • Model parameters can now be retrieved using ModelParameters.get. If you already have an instance of Model you can retrieve its parameters using Model.get_parameters.
  • Blueprint documentation can now be retrieved using Blueprint.get_documents. It will contain information about the task, its parameters and (when available) links and references to additional sources.
  • The DataRobot API now includes Reason Codes. You can now compute reason codes for prediction datasets. You are able to specify thresholds on which rows to compute reason codes for to speed up computation by skipping rows based on the predictions they generate. See the reason codes documentation for more information.

Enhancements

  • A new parameter has been added to the AdvancedOptions used with Project.set_target. By specifying accuracyOptimizedMb=True when creating AdvancedOptions, longer-running models that may have a high accuracy will be included in the autopilot and made available to run manually.
  • A new option for Project.create_type_transform_feature has been added which explicitly truncates data when casting numerical data as categorical data.
  • Added 2 new blenders for projects that use MAD or Weighted MAD as a metric. The MAE blender uses BFGS optimization to find linear weights for the blender that minimize mean absolute error (compared to the GLM blender, which finds linear weights that minimize RMSE), and the MAEL1 blender uses BFGS optimization to find linear weights that minimize MAE + a L1 penalty on the coefficients (compared to the ENET blender, which minimizes RMSE + a combination of the L1 and L2 penalty on the coefficients).

Bugfixes

  • Fixed a bug (affecting Python 2 only) with printing any model (including frozen and prime models) whose model_type is not ascii.
  • FrozenModels were unable to correctly use methods inherited from Model. This has been fixed.
  • When calling get_result for a Job, ModelJob, or PredictJob that has errored, AsyncProcessUnsuccessfulError will now be raised instead of JobNotFinished, consistently with the behaviour of get_result_when_complete.

Deprecation Summary

  • Support for the experimental Recommender Problems projects has been removed. Any code relying on RecommenderSettings or the recommender_settings argument of Project.set_target and Project.start will error.
  • Project.update, deprecated in v2.2.32, has been removed in favor of specific updates: rename, unlock_holdout, set_worker_count.

Documentation Changes

  • The link to Configuration from the Quickstart page has been fixed.

2.5.1

Bugfixes

  • Fixed a bug (affecting Python 2 only) with printing blueprints whose names are not ascii.
  • Fixed an issue where the weights column (for weighted projects) did not appear in the advanced_options of a Project.

2.5.0

New Features

  • Methods to work with blender models have been added. Use Project.blend method to create new blenders, Project.get_blenders to get the list of existing blenders and BlenderModel.get to retrieve a model with blender-specific information.
  • Projects created via the API can now use smart downsampling when setting the target by passing smart_downsampled and majority_downsampling_rate into the AdvancedOptions object used with Project.set_target. The smart sampling options used with an existing project will be available as part of Project.advanced_options.
  • Support for frozen models, which use tuning parameters from a parent model for more efficient training, has been added. Use Model.request_frozen_model to create a new frozen model, Project.get_frozen_models to get the list of existing frozen models and FrozenModel.get to retrieve a particular frozen model.

Enhancements

  • The inferred date format (e.g. “%Y-%m-%d %H:%M:%S”) is now included in the Feature object. For non-date features, it will be None.
  • When specifying the API endpoint in the configuration, the client will now behave correctly for endpoints with and without trailing slashes.

2.4.0

New Features

  • The premium add-on product DataRobot Prime has been added. You can now approximate a model on the leaderboard and download executable code for it. See documentation for further details, or talk to your account representative if the feature is not available on your account.
  • (Only relevant for on-premise users with a Standalone Scoring cluster.) Methods (request_transferable_export and download_export) have been added to the Model class for exporting models (which will only work if model export is turned on). There is a new class ImportedModel for managing imported models on a Standalone Scoring cluster.
  • It is now possible to create projects from a WebHDFS, PostgreSQL, Oracle or MySQL data source. For more information see the documentation for the relevant Project classmethods: create_from_hdfs, create_from_postgresql, create_from_oracle and create_from_mysql.
  • Job.wait_for_completion, which waits for a job to complete without returning anything, has been added.

Enhancements

  • The client will now check the API version offered by the server specified in configuration, and give a warning if the client version is newer than the server version. The DataRobot server is always backwards compatible with old clients, but new clients may have functionality that is not implemented on older server versions. This issue mainly affects users with on-premise deployments of DataRobot.

Bugfixes

  • Fixed an issue where Model.request_predictions might raise an error when predictions finished very quickly instead of returning the job.

API Changes

  • To set the target with quickrun autopilot, call Project.set_target with mode=AUTOPILOT_MODE.QUICK instead of specifying quickrun=True.

Deprecation Summary

  • Semi-automatic mode for autopilot has been deprecated and will be removed in 3.0. Use manual or fully automatic instead.
  • Use of the quickrun argument in Project.set_target has been deprecated and will be removed in 3.0. Use mode=AUTOPILOT_MODE.QUICK instead.

Configuration Changes

  • It is now possible to control the SSL certificate verification by setting the parameter ssl_verify in the config file.

Documentation Changes

  • The “Modeling Airline Delay” example notebook has been updated to work with the new 2.3 enhancements.
  • Documentation for the generic Job class has been added.
  • Class attributes are now documented in the API Reference section of the documentation.
  • The changelog now appears in the documentation.
  • There is a new section dedicated to configuration, which lists all of the configuration options and their meanings.

2.3.0

New Features

  • The DataRobot API now includes Feature Impact, an approach to measuring the relevance of each feature that can be applied to any model. The Model class now includes methods request_feature_impact (which creates and returns a feature impact job) and get_feature_impact (which can retrieve completed feature impact results).
  • A new improved workflow for predictions now supports first uploading a dataset via Project.upload_dataset, then requesting predictions via Model.request_predictions. This allows us to better support predictions on larger datasets and non-ascii files.
  • Datasets previously uploaded for predictions (represented by the PredictionDataset class) can be listed from Project.get_datasets and retrieve and deleted via PredictionDataset.get and PredictionDataset.delete.
  • You can now create a new feature by re-interpreting the type of an existing feature in a project by using the Project.create_type_transform_feature method.
  • The Job class now includes a get method for retrieving a job and a cancel method for canceling a job.
  • All of the jobs classes (Job, ModelJob, PredictJob) now include the following new methods: refresh (for refreshing the data in the job object), get_result (for getting the completed resource resulting from the job), and get_result_when_complete (which waits until the job is complete and returns the results, or times out).
  • A new method Project.refresh can be used to update Project objects with the latest state from the server.
  • A new function datarobot.async.wait_for_async_resolution can be used to poll for the resolution of any generic asynchronous operation on the server.

Enhancements

  • The JOB_TYPE enum now includes FEATURE_IMPACT.
  • The QUEUE_STATUS enum now includes ABORTED and COMPLETED.
  • The Project.create method now has a read_timeout parameter which can be used to keep open the connection to DataRobot while an uploaded file is being processed. For very large files this time can be substantial. Appropriately raising this value can help avoid timeouts when uploading large files.
  • The method Project.wait_for_autopilot has been enhanced to error if the project enters a state where autopilot may not finish. This avoids a situation that existed previously where users could wait indefinitely on their project that was not going to finish. However, users are still responsible to make sure a project has more than zero workers, and that the queue is not paused.
  • Feature.get now supports retrieving features by feature name. (For backwards compatibility, feature IDs are still supported until 3.0.)
  • File paths that have unicode directory names can now be used for creating projects and PredictJobs. The filename itself must still be ascii, but containing directory names can have other encodings.
  • Now raises more specific JobAlreadyRequested exception when we refuse a model fitting request as a duplicate. Users can explicitly catch this exception if they want it to be ignored.
  • A file_name attribute has been added to the Project class, identifying the file name associated with the original project dataset. Note that if the project was created from a data frame, the file name may not be helpful.
  • The connect timeout for establishing a connection to the server can now be set directly. This can be done in the yaml configuration of the client, or directly in the code. The default timeout has been lowered from 60 seconds to 6 seconds, which will make detecting a bad connection happen much quicker.

Bugfixes

  • Fixed a bug (affecting Python 2 only) with printing features and featurelists whose names are not ascii.

API Changes

  • Job class hierarchy is rearranged to better express the relationship between these objects. See documentation for datarobot.models.job for details.
  • Featurelist objects now have a project_id attribute to indicate which project they belong to. Directly accessing the project attribute of a Featurelist object is now deprecated
  • Support INI-style configuration, which was deprecated in v2.1, has been removed. yaml is the only supported configuration format.
  • The method Project.get_jobs method, which was deprecated in v2.1, has been removed. Users should use the Project.get_model_jobs method instead to get the list of model jobs.

Deprecation Summary

  • PredictJob.create has been deprecated in favor of the alternate workflow using Model.request_predictions.
  • Feature.converter (used internally for object construction) has been made private.
  • Model.fetch_resource_data has been deprecated and will be removed in 3.0. To fetch a model from
    its ID, use Model.get.
  • The ability to use Feature.get with feature IDs (rather than names) is deprecated and will be removed in 3.0.
  • Instantiating a Project, Model, Blueprint, Featurelist, or Feature instance from a dict of data is now deprecated. Please use the from_data classmethod of these classes instead. Additionally, instantiating a Model from a tuple or by using the keyword argument data is also deprecated.
  • Use of the attribute Featurelist.project is now deprecated. You can use the project_id attribute of a Featurelist to instantiate a Project instance using Project.get.
  • Use of the attributes Model.project, Model.blueprint, and Model.featurelist are all deprecated now to avoid use of partially instantiated objects. Please use the ids of these objects instead.
  • Using a Project instance as an argument in Featurelist.get is now deprecated. Please use a project_id instead. Similarly, using a Project instance in Model.get is also deprecated, and a project_id should be used in its place.

Configuration Changes

  • Previously it was possible (though unintended) that the client configuration could be mixed through environment variables, configuration files, and arguments to datarobot.Client. This logic is now simpler - please see the Getting Started section of the documentation for more information.

2.2.33

Bugfixes

  • Fixed a bug with non-ascii project names using the package with Python 2.
  • Fixed an error that occurred when printing projects that had been constructed from an ID only or printing printing models that had been constructed from a tuple (which impacted printing PredictJobs).
  • Fixed a bug with project creation from non-ascii file names. Project creation from non-ascii file names is not supported, so this now raises a more informative exception. The project name is no longer used as the file name in cases where we do not have a file name, which prevents non-ascii project names from causing problems in those circumstances.
  • Fixed a bug (affecting Python 2 only) with printing projects, features, and featurelists whose names are not ascii.

2.2.32

New Features

  • Project.get_features and Feature.get methods have been added for feature retrieval.
  • A generic Job entity has been added for use in retrieving the entire queue at once. Calling Project.get_all_jobs will retrieve all (appropriately filtered) jobs from the queue. Those can be cancelled directly as generic jobs, or transformed into instances of the specific job class using ModelJob.from_job and PredictJob.from_job, which allow all functionality previously available via the ModelJob and PredictJob interfaces.
  • Model.train now supports featurelist_id and scoring_type parameters, similar to Project.train.

Enhancements

  • Deprecation warning filters have been updated. By default, a filter will be added ensuring that usage of deprecated features will display a warning once per new usage location. In order to hide deprecation warnings, a filter like warnings.filterwarnings(‘ignore’, category=DataRobotDeprecationWarning) can be added to a script so no such warnings are shown. Watching for deprecation warnings to avoid reliance on deprecated features is recommended.
  • If your client is misconfigured and does not specify an endpoint, the cloud production server is no longer used as the default as in many cases this is not the correct default.
  • This changelog is now included in the distributable of the client.

Bugfixes

  • Fixed an issue where updating the global client would not affect existing objects with cached clients. Now the global client is used for every API call.
  • An issue where mistyping a filepath for use in a file upload has been resolved. Now an error will be raised if it looks like the raw string content for modeling or predictions is just one single line.

API Changes

  • Use of username and password to authenticate is no longer supported - use an API token instead.
  • Usage of start_time and finish_time parameters in Project.get_models is not supported both in filtering and ordering of models
  • Default value of sample_pct parameter of Model.train method is now None instead of 100. If the default value is used, models will be trained with all of the available training data based on project configuration, rather than with entire dataset including holdout for the previous default value of 100.
  • order_by parameter of Project.list which was deprecated in v2.0 has been removed.
  • recommendation_settings parameter of Project.start which was deprecated in v0.2 has been removed.
  • Project.status method which was deprecated in v0.2 has been removed.
  • Project.wait_for_aim_stage method which was deprecated in v0.2 has been removed.
  • Delay, ConstantDelay, NoDelay, ExponentialBackoffDelay, RetryManager classes from retry module which were deprecated in v2.1 were removed.
  • Package renamed to datarobot.

Deprecation Summary

  • Project.update deprecated in favor of specific updates: rename, unlock_holdout, set_worker_count.

Documentation Changes

  • A new use case involving financial data has been added to the examples directory.
  • Added documentation for the partition methods.

2.1.31

Bugfixes

  • In Python 2, using a unicode token to instantiate the client will now work correctly.

2.1.30

Bugfixes

  • The minimum required version of trafaret has been upgraded to 0.7.1 to get around an incompatibility between it and setuptools.

2.1.29

Enhancements

  • Minimal used version of requests_toolbelt package changed from 0.4 to 0.6

2.1.28

New Features

  • Default to reading YAML config file from ~/.config/datarobot/drconfig.yaml
  • Allow config_path argument to client
  • wait_for_autopilot method added to Project. This method can be used to block execution until autopilot has finished running on the project.
  • Support for specifying which featurelist to use with initial autopilot in Project.set_target
  • Project.get_predict_jobs method has been added, which looks up all prediction jobs for a project
  • Project.start_autopilot method has been added, which starts autopilot on specified featurelist
  • The schema for PredictJob in DataRobot API v2.1 now includes a message. This attribute has been added to the PredictJob class.
  • PredictJob.cancel now exists to cancel prediction jobs, mirroring ModelJob.cancel
  • Project.from_async is a new classmethod that can be used to wait for an async resolution in project creation. Most users will not need to know about it as it is used behind the scenes in Project.create and Project.set_target, but power users who may run into periodic connection errors will be able to catch the new ProjectAsyncFailureError and decide if they would like to resume waiting for async process to resolve

Enhancements

  • AUTOPILOT_MODE enum now uses string names for autopilot modes instead of numbers

Deprecation Summary

  • ConstantDelay, NoDelay, ExponentialBackoffDelay, and RetryManager utils are now deprecated
  • INI-style config files are now deprecated (in favor of YAML config files)
  • Several functions in the utils submodule are now deprecated (they are being moved elsewhere and are not considered part of the public interface)
  • Project.get_jobs has been renamed Project.get_model_jobs for clarity and deprecated
  • Support for the experimental date partitioning has been removed in DataRobot API, so it is being removed from the client immediately.

API Changes

  • In several places where AppPlatformError was being raised, now TypeError, ValueError or InputNotUnderstoodError are now used. With this change, one can now safely assume that when catching an AppPlatformError it is because of an unexpected response from the server.
  • AppPlatformError has gained a two new attributes, status_code which is the HTTP status code of the unexpected response from the server, and error_code which is a DataRobot-defined error code. error_code is not used by any routes in DataRobot API 2.1, but will be in the future. In cases where it is not provided, the instance of AppPlatformError will have the attribute error_code set to None.
  • Two new subclasses of AppPlatformError have been introduced, ClientError (for 400-level response status codes) and ServerError (for 500-level response status codes). These will make it easier to build automated tooling that can recover from periodic connection issues while polling.
  • If a ClientError or ServerError occurs during a call to Project.from_async, then a ProjectAsyncFailureError (a subclass of AsyncFailureError) will be raised. That exception will have the status_code of the unexpected response from the server, and the location that was being polled to wait for the asynchronous process to resolve.

2.0.27

New Features

  • PredictJob class was added to work with prediction jobs
  • wait_for_async_predictions function added to predict_job module

Deprecation Summary

  • The order_by parameter of the Project.list is now deprecated.

0.2.26

Enhancements

  • Projet.set_target will re-fetch the project data after it succeeds, keeping the client side in sync with the state of the project on the server
  • Project.create_featurelist now throws DuplicateFeaturesError exception if passed list of features contains duplicates
  • Project.get_models now supports snake_case arguments to its order_by keyword

Deprecation Summary

  • Project.wait_for_aim_stage is now deprecated, as the REST Async flow is a more reliable method of determining that project creation has completed successfully
  • Project.status is deprecated in favor of Project.get_status
  • recommendation_settings parameter of Project.start is deprecated in favor of recommender_settings

Bugfixes

  • Project.wait_for_aim_stage changed to support Python 3
  • Fixed incorrect value of SCORING_TYPE.cross_validation
  • Models returned by Project.get_models will now be correctly ordered when the order_by keyword is used

0.2.25

  • Pinned versions of required libraries

0.2.24

Official release of v0.2

0.1.24

  • Updated documentation
  • Renamed parameter name of Project.create and Project.start to project_name
  • Removed Model.predict method
  • wait_for_async_model_creation function added to modeljob module
  • wait_for_async_status_service of Project class renamed to _wait_for_async_status_service
  • Can now use auth_token in config file to configure SDK

0.1.23

  • Fixes a method that pointed to a removed route

0.1.22

  • Added featurelist_id attribute to ModelJob class

0.1.21

  • Removes model attribute from ModelJob class

0.1.20

  • Project creation raises AsyncProjectCreationError if it was unsuccessful
  • Removed Model.list_prime_rulesets and Model.get_prime_ruleset methods
  • Removed Model.predict_batch method
  • Removed Project.create_prime_model method
  • Removed PrimeRuleSet model
  • Adds backwards compatibility bridge for ModelJob async
  • Adds ModelJob.get and ModelJob.get_model

0.1.19

  • Minor bugfixes in wait_for_async_status_service

0.1.18

  • Removes submit_model from Project until serverside implementation is improved
  • Switches training URLs for new resource-based route at /projects/<project_id>/models/
  • Job renamed to ModelJob, and using modelJobs route
  • Fixes an inconsistency in argument order for train methods

0.1.17

  • wait_for_async_status_service timeout increased from 60s to 600s

0.1.16

  • Project.create will now handle both async/sync project creation

0.1.15

  • All routes pluralized to sync with changes in API
  • Project.get_jobs will request all jobs when no param specified
  • dataframes from predict method will have pythonic names
  • Project.get_status created, Project.status now deprecated
  • Project.unlock_holdout created.
  • Added quickrun parameter to Project.set_target
  • Added modelCategory to Model schema
  • Add permalinks featrue to Project and Model objects.
  • Project.create_prime_model created

0.1.14

  • Project.set_worker_count fix for compatibility with API change in project update.

0.1.13

  • Add positive class to set_target.
  • Change attributes names of Project, Model, Job and Blueprint
    • features in Model, Job and Blueprint are now processes
    • dataset_id and dataset_name migrated to featurelist_id and featurelist_name.
    • samplepct -> sample_pct
  • Model has now blueprint, project, and featurlist attributes.
  • Minor bugfixes.

0.1.12

  • Minor fixes regarding rename Job attributes. features attributes now named processes, samplepct now is sample_pct.

0.1.11

(May 27, 2015)

  • Minor fixes regarding migrating API from under_score names to camelCase.

0.1.10

(May 20, 2015)

  • Remove Project.upload_file, Project.upload_file_from_url and Project.attach_file methods. Moved all logic that uploading file to Project.create method.

0.1.9

(May 15, 2015)

  • Fix uploading file causing a lot of memory usage. Minor bugfixes.

Indices and tables