DataRobot Python Package¶
Getting Started¶
Installation¶
You will need the following
- Python 2.7 or 3.4+
- DataRobot account
- pip
Installing for Cloud DataRobot¶
If you are using the cloud version of DataRobot, the easiest way to get the latest version of the package is:
pip install datarobot
Note
If you are not running in a Python virtualenv, you probably want to use pip install --user datarobot
.
Installing for an On-Site Deploy¶
If you are using an on-site deploy of DataRobot, the latest version of the package is not the most appropriate for you. Contact your CFDS for guidance on the appropriate version range.
pip install "datarobot>=$(MIN_VERSION),<$(EXCLUDE_VERSION)"
For some particular installation of DataRobot, the correct value of $(MIN_VERSION) could be 2.0 with an $(EXCLUDE_VERSION) of 2.3. This ensures that all the features the client expects to be present on the backend will always be correct.
Note
If you are not running in a Python virtualenv, you probably want to use pip install --user "datarobot>=$(MIN_VERSION),<$(MAX_VERSION)
.
Configuration¶
Each authentication method will specify credentials for DataRobot, as well as the location of the DataRobot deployment. We currently support configuration using a configuration file, by setting environment variables, or within the code itself.
Credentials¶
You will have to specify an API token and an endpoint in order to use the client. You can manage your API tokens in the DataRobot webapp, in your profile. This section describes how to use these options. Their order of precedence is as follows, noting that the first available option will be used:
- Setting endpoint and token in code using datarobot.Client
- Configuring from a config file as specified directly using datarobot.Client
- Configuring from a config file as specified by the environment variable DATAROBOT_CONFIG_FILE
- Configuring from the environment variables DATAROBOT_ENDPOINT and DATAROBOT_API_TOKEN
- Searching for a config file in the home directory of the current user, at ~/.config/datarobot/drconfig.yaml
Note
If you access the DataRobot webapp at https://app.datarobot.com, then the correct endpoint to specify would be https://app.datarobot.com/api/v2. If you have a local installation, update the endpoint accordingly to point at the installation of DataRobot available on your local network.
Set Credentials Explicitly in Code¶
Explicitly set credentials in code:
import datarobot as dr
dr.Client(token='your_token', endpoint='https://app.datarobot.com/api/v2')
You can also point to a YAML config file to use:
import datarobot as dr
dr.Client(config_path='/home/user/my_datarobot_config.yaml')
Use a Configuration File¶
You can use a configuration file to specify the client setup.
The following is an example configuration file that should be saved as ~/.config/datarobot/drconfig.yaml
:
token: yourtoken
endpoint: https://app.datarobot.com/api/v2
You can specify a different location for the DataRobot configuration file by setting
the DATAROBOT_CONFIG_FILE
environment variable. Note that if you specify a filepath, you should
use an absolute path so that the API client will work when run from any location.
Set Credentials Using Environment Variables¶
Set up an endpoint by setting environment variables in the UNIX shell:
export DATAROBOT_ENDPOINT='https://app.datarobot.com/api/v2'
export DATAROBOT_API_TOKEN=your_token
Common Issues¶
This section has examples of cases that can cause issues with using the DataRobot client, as well as known fixes.
InsecurePlatformWarning¶
On versions of Python earlier than 2.7.9 you might have InsecurePlatformWarning in your output. To prevent this without updating your Python version you should install pyOpenSSL package:
pip install pyopenssl ndg-httpsclient pyasn1
AttributeError: ‘EntryPoint’ object has no attribute ‘resolve’¶
Some earlier versions of setuptools will cause an error on importing DataRobot. The recommended fix is upgrading setuptools. If you are unable to upgrade setuptools, pinning trafaret to version <=7.4 will correct this issue.
>>> import datarobot as dr
...
File "/home/clark/.local/lib/python2.7/site-packages/trafaret/__init__.py", line 1550, in load_contrib
trafaret_class = entrypoint.resolve()
AttributeError: 'EntryPoint' object has no attribute 'resolve'
To prevent this upgrade your setuptools:
pip install --upgrade setuptools
ConnectTimeout¶
If you have a slow connection to your DataRobot installation, you may see a traceback like
ConnectTimeout: HTTPSConnectionPool(host='my-datarobot.com', port=443): Max
retries exceeded with url: /api/v2/projects/
(Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f130fc76150>,
'Connection to my-datarobot.com timed out. (connect timeout=6.05)'))
You can configure a larger connect timeout (the amount of time to wait on each request attempting
to connect to the DataRobot server before giving up) using a connect_timeout value in either
a configuration file or via a direct call to datarobot.Client
.
project.open_leaderboard_browser¶
Calling the project.open_leaderboard_browser
may block if ran with a text-mode browser or
running on a server that doesn’t have an ability to open a browser.
Configuration¶
This section describes all of the settings that can be configured in the DataRobot
configuration file. This file is by default looked for inside the user’s home
directory at ~/.config/datarobot/drconfig.yaml
, but the default location can be
overridden by specifying an environment variable DATAROBOT_CONFIG_FILE
, or within
the code by setting the global client with dr.Client(config_path='/path/to/config.yaml')
.
Configurable Variables¶
These are the variables available for configuration for the DataRobot client:
- endpoint
- This parameter is required. It is the URL of the DataRobot endpoint. For example,
the default endpoint on the
cloud installation of DataRobot is
https://app.datarobot.com/api/v2
- token
- This parameter is required. It is the token of your DataRobot account. This can be found in the user settings page of DataRobot
- connect_timeout
- This parameter is optional. It specifies the number of seconds that the
client should be willing to wait to establish a connection to the remote server.
Users with poor connections may need to increase this value. By default DataRobot
uses the value
6.05
. - ssl_verify
- This parameter is optional. It controls the SSL certificate verification of the
DataRobot client. DataRobot is built with the
python
requests
library, and this variable is used as theverify
parameter in that library. More information can be found in their documentation. The default value istrue
, which means thatrequests
will use your computer’s set of trusted certificate chains by default. - max_retries
This parameter is optional. It controls the number of retries to attempt for each connection. More information can be found in the requests documentation. By default, errors implying the request never made it to the server are retried always and read timeouts (where the request began running and did not finish) are not retried. More granular control by be acquired by passing a Retry object from urllib3 into a direct instantiation of
dr.Client
.import datarobot as dr dr.Client(endpoint='https://app.datarobot.com/api/v2', token='this-is-a-fake-token', max_retries=Retry(connect=3, read=3))
Proxy support¶
DataRobot API can work behind a non-transparent HTTP proxy server. Please set environment
variable HTTP_PROXY
containing proxy URL to route all the DataRobot traffic through that
proxy server, e.g. HTTP_PROXY="http://my-proxy.local:3128" python my_datarobot_script.py
.
QuickStart¶
Note
You must set up credentials in order to access the DataRobot API. For more information, see Credentials
All of the modeling within DataRobot happens within a project. Each project has one dataset that is used as the source from which to train models.
There are three steps required to begin modeling:
- Create an empty project.
- Upload a data file to model.
- Select parameters and start training models with the autopilot.
The following command includes these three steps. It is equivalent to choosing all of the default settings recommended by DataRobot.
import datarobot as dr
project = dr.Project.start(project_name='My new project',
sourcedata='/home/user/data/last_week_data.csv',
target='ItemsPurchased')
Where:
name
is the name of the new DataRobot project.sourcedata
is the path to the dataset.target
is the name of the target feature column in the dataset.
Projects¶
All of the modeling within DataRobot happens within a project. Each project has one dataset that is used as the source from which to train models.
Create a Project¶
You can use the following command to create a new project. You must specify a path to data file, file object, raw file contents,
or a pandas.DataFrame
object when creating a new project. Path to file can be either a path to a local file or a publicly accessible URL.
import datarobot as dr
project = dr.Project.create('/home/user/data/last_week_data.csv',
project_name='New Project')
You can use the following commands to view the project ID and name:
project.id
>>> u'5506fcd38bd88f5953219da0'
project.project_name
>>> u'New Project'
Select Modeling Parameters¶
The final information needed to begin modeling includes the target feature, the queue mode, the metric for comparing models, and the optional parameters such as weights, offset, exposure and downsampling.
Target¶
The target must be the name of one of the columns of data uploaded to the project.
Metric¶
The optimization metric used to compare models is an important factor in building accurate models. If a metric is not specified, the default metric recommended by DataRobot will be used. You can use the following code to view a list of valid metrics for a specified target:
target_name = 'ItemsPurchased'
project.get_metrics(target_name)
>>> {'available_metrics': [
'Gini Norm',
'Weighted Gini Norm',
'Weighted R Squared',
'Weighted RMSLE',
'Weighted MAPE',
'Weighted Gamma Deviance',
'Gamma Deviance',
'RMSE',
'Weighted MAD',
'Tweedie Deviance',
'MAD',
'RMSLE',
'Weighted Tweedie Deviance',
'Weighted RMSE',
'MAPE',
'Weighted Poisson Deviance',
'R Squared',
'Poisson Deviance'],
'feature_name': 'SalePrice'}
Partitioning Method¶
DataRobot projects always have a holdout set used for final model validation. We use two different approaches for testing prior to the holdout set:
- split the remaining data into training and validation sets
- cross-validation, in which the remaining data is split into a number of folds; each fold serves as a validation set, with models trained on the other folds and evaluated on that fold.
There are several other options you can control. To specify a partition method, create an instance of one of the Partition Classes, and pass it as the partitioning_method
argument in your call to project.set_target
or project.start
. See here for more information on using datetime partitioning.
Several partitioning methods include parameters for validation_pct
and holdout_pct
, specifying desired percentages for the validation and holdout sets. Note that there may be constraints that prevent the actual percentages used from exactly (or some cases, even closely) matching the requested percentages.
Queue Mode¶
You can use the API to set the DataRobot modeling process to run in either automatic or manual mode.
Autopilot mode means that the modeling process will proceed completely automatically, including running recommended models, running at different sample sizes, and blending.
Manual mode means that DataRobot will populate a list of recommended models, but will not insert any of them into the queue. Manual mode lets you select which models to execute before starting the modeling process.
Quick mode means that a smaller set of Blueprints is used, so autopilot finishes faster.
Weights¶
DataRobot also supports using a weight parameter. A full discussion of the use of weights in data science is not within the scope of this document, but weights are often used to help compensate for rare events in data. You can specify a column name in the project dataset to be used as a weight column.
Offsets¶
Starting with version v2.6 DataRobot also supports using an offset parameter. Offsets are commonly used in insurance modeling to include effects that are outside of the training data due to regulatory compliance or constraints. You can specify the names of several columns in the project dataset to be used as the offset columns.
Exposure¶
Starting with version v2.6 DataRobot also supports using an exposure parameter. Exposure is often used to model insurance premiums where strict proportionality of premiums to duration is required. You can specify the name of the column in the project dataset to be used as an exposure column.
Start Modeling¶
Once you have selected modeling parameters, you can use the following code structure to specify parameters and start the modeling process.
import datarobot as dr
project.set_target(target='ItemsPurchased',
metric='Tweedie Deviance',
mode=dr.AUTOPILOT_MODE.FULL_AUTO)
You can also pass additional optional parameters to project.set_target
to change parameters of the modeling process. Currently supported parameters are:
worker_count
– int, sets number of workers used for modeling.partitioning_method
–PartitioningMethod
object.positive_class
– str, float, or int; Specifies a level of the target column that should treated as the positive class for binary classification. May only be specified for binary classification targets.advanced_options
– AdvancedOptions object, used to set advanced options of modeling process.target_type
– str, override the automaticially selected target_type. An example usage would be setting the target_type=TARGET_TYPE.MULTICLASS when you want to perform a multiclass classification task on a numeric column that has a low cardinality.
You can run with different autopilot modes by changing the parameter to mode. AUTOPILOT_MODE.FULL_AUTO is the default. Other accepted modes include AUTOPILOT_MODE.MANUAL for manual mode (choose your own models to run rather than use the DataRobot autopilot) and AUTOPILOT_MODE.QUICK for quickrun (run on a more limited set of models to get insights more quickly).
Quickly Start a Project¶
Project creation, file upload and target selection are all combined in Project.start
method:
import datarobot as dr
project = dr.Project.start('/home/user/data/last_week_data.csv',
target='ItemsPurchased',
project_name='New Project')
You can also pass additional optional parameters to Project.start
:
worker_count
– int, sets number of workers used for modeling.metric
- str, name of metric to use.autopilot_on
- boolean, defaults toTrue
; set whether or not to begin modeling automatically.blueprint_threshold
– int, number of hours the model is permitted to run. Minimum 1.response_cap
– float, Quantile of the response distribution to use for response capping. Must be in range 0.5..1.0partitioning_method
–PartitioningMethod
object.positive_class
– str, float, or int; Specifies a level of the target column that should treated as the positive class for binary classification. May only be specified for binary classification targets.target_type
– str, override the automaticially selected target_type. An example usage would be setting the target_type=TARGET_TYPE.MULTICLASS when you want to perform a multiclass classification task on a numeric column that has a low cardinality.
Interact with a Project¶
The following commands can be used to manage DataRobot projects.
List Projects¶
Returns a list of projects associated with current API user.
import datarobot as dr
dr.Project.list()
>>> [Project(Project One), Project(Two)]
dr.Project.list(search_params={'project_name': 'One'})
>>> [Project(One)]
You can pass following parameters to change result:
search_params
– dict, used to filter returned projects. Currently you can query projects only byproject_name
Get an existing project¶
Rather than querying the full list of projects every time you need
to interact with a project, you can retrieve its id
value and use that to reference the project.
import datarobot as dr
project = dr.Project.get(project_id='5506fcd38bd88f5953219da0')
project.id
>>> '5506fcd38bd88f5953219da0'
project.project_name
>>> 'Churn Projection'
Update a project¶
You can update various attributes of a project.
To update the name of the project:
project.rename(new_name)
To update the number of workers used by your project (this will fail if you request more workers than you have available):
project.set_worker_count(num_workers)
To unlock the holdout set, allowing holdout scores to be shown and models to be trained on more data:
project.unlock_holdout()
Wait for Autopilot to Finish¶
Once the modeling autopilot is started, in some cases you will want to wait for autopilot to finish:
project.wait_for_autopilot()
Play/Pause the autopilot¶
If your project is running in autopilot mode, it will continually use available workers, subject to the number of workers allocated to the project and the total number of simultaneous workers allowed according to the user permissions.
To pause a project running in autopilot mode:
project.pause_autopilot()
To resume running a paused project:
project.unpause_autopilot()
Start autopilot on another Featurelist¶
You can start autopilot on an existing featurelist.
import datarobot as dr
featurelist = project.create_featurelist('test', ['feature 1', 'feature 2'])
project.start_autopilot(featurelist.id, mode=dr.AUTOPILOT_MODE.FULL_AUTO)
>>> True
# Starting autopilot that is already running on the provided featurelist
project.start_autopilot(featurelist.id, mode=dr.AUTOPILOT_MODE.FULL_AUTO)
>>> dr.errors.AppPlatformError
Note
This method should be used on a project where the target has already been set. An error will be raised if autopilot is currently running on or has already finished running on the provided featurelist.
Further reading¶
The Blueprints and Models sections of this document will describe how to create new models based on the Blueprints recommended by DataRobot.
Datetime Partitioned Projects¶
If your dataset is modeling events taking place over time, datetime partitioning may be appropriate. Datetime partitioning ensures that when partitioning the dataset for training and validation, rows are ordered according to the value of the date partition feature.
Setting Up a Datetime Partitioned Project¶
After creating a project and before setting the target, create a
DatetimePartitioningSpecification to define how the project should
be partitioned. By passing the specification into DatetimePartitioning.generate
, the full
partitioning can be previewed before finalizing the partitioning. After verifying that the
partitioning is correct for the project dataset, pass the specification into Project.set_target
via the partitioning_method
argument. Once modeling begins, the project can be used as normal.
The following code block shows the basic workflow for creating datetime partitioned projects.
import datarobot as dr
project = dr.Project.create('some_data.csv')
spec = dr.DatetimePartitioningSpecification('my_date_column')
# can customize the spec as needed
partitioning_preview = dr.DatetimePartitioning.generate(project.id, spec)
# the preview generated is based on the project's data
print partitioning_preview.to_dataframe()
# hmm ... I want more backtests
spec.number_of_backtests = 5
partitioning_preview = dr.DatetimePartitioning.generate(project.id, spec)
print partitioning_preview.to_dataframe()
# looks good
project.set_target('target_column', partitioning_method=spec)
Modeling with a Datetime Partitioned Project¶
While Model
objects can still be used to interact with the project,
DatetimeModel objects, which are only retrievable from datetime partitioned
projects, provide more information including which date ranges and how many rows are used in
training and scoring the model as well as scores and statuses for individual backtests.
The autopilot workflow is the same as for other projects, but to manually train a model,
Project.train_datetime
and Model.train_datetime
should be used in the place of
Project.train
and Model.train
. To create frozen models,
Model.request_frozen_datetime_model
should be used in place of
DatetimeModel.request_frozen_datetime_model
. Unlike other projects, to trigger computation of
scores for all backtests use DatetimeModel.score_backtests
instead of using the scoring_type
argument in the train
methods.
Dates, Datetimes, and Durations¶
When specifying a date or datetime for datetime partitioning, the client expects to receive and
will return a datetime
. Timezones may be specified, and will be assumed to be UTC if left
unspecified. All dates returned from DataRobot are in UTC with a timezone specified.
Datetimes may include a time, or specify only a date; however, they may have a non-zero time component only if the partition column included a time component in its date format. If the partition column included only dates like “24/03/2015”, then the time component of any datetimes, if present, must be zero.
When date ranges are specified with a start and an end date, the end date is exclusive, so only dates earlier than the end date are included, but the start date is inclusive, so dates equal to or later than the start date are included. If the start and end date are the same, then no dates are included in the range.
Durations are specified using a subset of ISO8601. Durations will be of the form PnYnMnDTnHnMnS where each “n” may be replaced with an integer value. Within the duration string,
- nY represents the number of years
- the nM following the “P” represents the number of months
- nD represents the number of days
- nH represents the number of hours
- the nM following the “T” represents the number of minutes
- nS represents the number of seconds
and “P” is used to indicate that the string represents a period and “T” indicates the beginning of the time component of the string. Any section with a value of 0 may be excluded. As with datetimes, if the partition column did not include a time component in its date format, the time component of any duration must be either unspecified or consist only of zeros.
Example Durations:
- “P3Y6M” (three years, six months)
- “P1Y0M0DT0H0M0S” (one year)
- “P1Y5DT10H” (one year, 5 days, 10 hours)
datarobot.helpers.partitioning_methods.construct_duration_string is a helper method that can be used to construct appropriate duration strings.
Time Series Projects¶
Time series projects, like OTV projects, use datetime partitioning, and all the workflow changes that apply to other datetime partitioned projects also apply to them. Unlike other projects, time series projects produce different types of models which forecast multiple future predictions instead of an individual prediction for each row.
DataRobot uses a general time series framework to configure how time series features are created and what future values the models will output. This framework consists of a Forecast Point (defining a time a prediction is being made), a Feature Derivation Window (a rolling window used to create features), and a Forecast Window (a rolling window of future values to predict). These components are described in more detail below.
Time series projects will automatically transform the dataset provided in order to apply this framework. During the transformation, DataRobot uses the Feature Derivation Window to derive time series features (such as lags and rolling statistics), and uses the Forecast Window to provide examples of forecasting different distances in the future (such as time shifts). After project creation, a new dataset and a new feature list are generated and used to train the models. This process is reapplied automatically at prediction time as well in order to generate future predictions based on the original data features.
The time_unit
and time_step
used to define the Feature Derivation and Forecast Windows are
taken from the datetime partition column, and can be retrieved for a given column in the input data
by looking at the corresponding attributes on the datarobot.Feature
object.
Setting Up A Time Series Project¶
To set up a time series project, follow the standard datetime partitioning
workflow and use the six new time series specific parameters on the
datarobot.DatetimePartitioningSpecification
object:
- use_time_series
- bool, set this to True to enable time series for the project.
- default_to_a_priori
- bool, set this to True to default to treating all features as a priori features. Otherwise they will not be handled as a priori features. See the prediction documentation for more information.
- feature_derivation_window_start
- int, the offset into the past to the start of the feature derivation window.
- feature_derivation_window_end
- int, the offset into the past to the end of the feature derivation window.
- forecast_window_start
- int, the offset into the future to the start of the forecast window.
- forecast_window_end
- int, the offset into the future to the end of the forecast window.
- feature_settings
- list of FeatureSettings specifying per feature settings, can be left unspecified
Feature Derivation Window¶
The Feature Derivation window represents the rolling window that is used to derive
time series features and lags, relative to the Forecast Point. It is defined in terms of
feature_derivation_window_start
and feature_derivation_window_end
which are integer values
representing datetime offsets in terms of the time_unit
(e.g. hours or days).
The Feature Derivation Window start and end must be less than or equal to zero, indicating they are
positioned before the forecast point. Additionally, the window must be specified as an integer
multiple of the time_step
which defines the expected difference in time units between rows in
the data.
The window is closed, meaning the edges are considered to be inside the window.
Forecast Window¶
The Forecast Window represents the rolling window of future values to predict, relative to the
Forecast Point. It is defined in terms of the forecast_window_start
and forecast_window_end
,
which are positive integer values indicating datetime offsets in terms of the time_unit
(e.g.
hours or days).
The Forecast Window start and end must be positive integers, indicating they are
positioned after the forecast point. Additionally, the window must be specified as an integer
multiple of the time_step
which defines the expected difference in time units between rows in
the data.
The window is closed, meaning the edges are considered to be inside the window.
Multiseries Projects¶
Certain time series problems represent multiple separate series of data, e.g. “I have five different stores that all have different customer bases. I want to predict how many units of a particular item will sell, and account for the different behavior of each store”. When setting up the project, a column specifying series ids must be identified, so that each row from the same series has the same value in the multiseries id column.
Using a multiseries id column changes which partition columns are eligible for time series, as
each series is required to be unique and regular, instead of the entire partition column being
required to have those properties. In order to use a multiseries id column for partitioning,
a detection job must first be run to analyze the relationship between the partition and multiseries
id columns. If needed, it will be automatically triggered by calling
datarobot.Feature.get_multiseries_properties()
on the desired partition column. The
previously computed multiseries properties for a particular partition column can then be accessed
via that method. The computation will also be automatically triggered when calling
datarobot.DatetimePartitioning.generate()
or datarobot.Project.set_target()
with
a multiseries id column specified.
Note that currently only one multiseries id column is supported, but all interfaces accept lists of id columns to ensure multiple id columns will be able to be supported in the future.
In order to create a multiseries project:
- Set up a datetime partitioning specification with the desired partition column and multiseries id columns.
- (Optionally) Use
datarobot.Feature.get_multiseries_properties()
to confirm the inferred time step and time unit of the partition column when used with the specified multiseries id column.- (Optionally) Specify the multiseries id column in order to preview the full datetime partitioning settings using
datarobot.DatetimePartitioning.generate()
.- Specify the multiseries id column when sending the target and partitioning settings via
datarobot.Project.set_target()
.
project = dr.Project.create('path/to/multiseries.csv', project_name='my multiseries project')
partitioning_spec = dr.DatetimePartitioningSpecification(
'timestamp', use_time_series=True, multiseries_id_columns=['multiseries_id']
)
# manually confirm time step and time unit are as expected
datetime_feature = dr.Feature.get(project.id, 'timestamp')
multiseries_props = datetime_feature.get_multiseries_properties(['multiseries_id'])
print(multiseries_props)
# manually check out the partitioning settings like feature derivation window and backtests
# to make sure they make sense before moving on
full_part = dr.DatetimePartitioning.generate(project.id, partitioning_spec)
print(full_part.feature_derivation_window_start, full_part.feature_derivation_window_end)
print(full_part.to_dataframe())
# finalize the project and start the autopilot
project.set_target('target', partitioning_method=partitioning_spec)
Feature Settings¶
datarobot.FeatureSettings
constructor receives feature_name and settings. For now
only a_priori setting supported.
# I have 10 features, 8 of them are a priori and two are not
not_a_priori_features = ['previous_day_sales', 'amount_in_stock']
feature_settings = [FeatureSettings(feat_name, a_priori=False) for feat_name in not_a_priori_features]
spec = DatetimePartitioningSpecification(
# ...
default_to_a_priori=True,
feature_settings=feature_settings
)
Modeling Data and Time Series Features¶
In time series projects, a new set of modeling features is created after setting the partitioning options. If a featurelist is specified with the partitioning options, it will be used to select which features should be used to derived modeling features; if a featurelist is not specified, the default featurelist will be used.
These features are automatically derived from those in the project’s
dataset and are the features used for modeling - note that the Project methods
get_featurelists
and get_modeling_featurelists
will return different data in time series
projects. Modeling featurelists are the ones that can be used for modeling and will be accepted by
the backend, while regular featurelists will continue to exist but cannot be used. Modeling
features are only accessible once the target and partitioning options have been
set. In projects that don’t use time series modeling, once the target has been set,
modeling and regular features and featurelists will behave the same.
Making Predictions¶
Prediction datasets are uploaded as normal. However, when uploading a
prediction dataset, a new parameter forecast_point
can be specified. The forecast point of a
prediction dataset identifies the point in time relative which predictions should be generated, and
if one is not specified when uploading a dataset, the server will choose the most recent possible
forecast point. The forecast window specified when setting the partitioning options for the project
determines how far into the future from the forecast point predictions should be calculated.
When setting up a time series project, input features could be identified as a priori features. These features are not used to generate lags, and are expected to be known for the rows in the forecast window at predict time (e.g. “how much money will have been spent on marketing”, “is this a holiday”).
When uploading datasets to a time series project, the dataset might look something like the following: if “Time” is the datetime partition column, “Target” is the target column, and “Temp.” is an input feature. If the dataset was uploaded with a forecast point of “2017-01-08” and during partitioning the feature derivation window start and end were set to -5 and -3 and the forecast window start and end were set to 1 and 3, then rows 1 through 3 are historical data, row 6 is the forecast point, and rows 7 though 9 are forecast rows that will have predictions when predictions are computed.
Row, Time, Target, Temp.
1, 2017-01-03, 16443, 72
2, 2017-01-04, 3013, 72
3, 2017-01-05, 1643, 68
4, 2017-01-06, ,
5, 2017-01-07, ,
6, 2017-01-08, ,
7, 2017-01-09, ,
8, 2017-01-10, ,
9, 2017-01-11, ,
On the other hand, if the project instead used “Holiday” as an a priori input feature, the uploaded dataset might look like the following:
Row, Time, Target, Holiday
1, 2017-01-03, 16443, TRUE
2, 2017-01-04, 3013, FALSE
3, 2017-01-05, 1643, FALSE
4, 2017-01-06, , FALSE
5, 2017-01-07, , FALSE
6, 2017-01-08, , FALSE
7, 2017-01-09, , TRUE
8, 2017-01-10, , FALSE
9, 2017-01-11, , FALSE
Blueprints¶
The set of computation paths that a dataset passes through before producing predictions from data is called a blueprint. A blueprint can be trained on a dataset to generate a model.
Quick Reference¶
The following code block summarizes the interactions available for blueprints.
# Get the set of blueprints recommended by datarobot
import datarobot as dr
my_projects = dr.Project.list()
project = my_projects[0]
menu = project.get_blueprints()
first_blueprint = menu[0]
project.train(first_blueprint)
List Blueprints¶
When a file is uploaded to a project and the target is set, DataRobot
recommends a set of blueprints that are appropriate for the task at hand.
You can use the get_blueprints
method to get the list of blueprints recommended for a project:
project = dr.Project.get('5506fcd38bd88f5953219da0')
menu = project.get_blueprints()
blueprint = menu[0]
Get a blueprint¶
If you already have a blueprint_id
from a model you can retrieve the blueprint directly.
project_id = '5506fcd38bd88f5953219da0'
project = dr.Project.get(project_id)
models = project.get_models()
model = models[0]
blueprint = Blueprint.get(project_id, model.blueprint_id)
Get a blueprint chart¶
For all blueprints - either from blueprint menu or already used in model - you can retrieve its chart. You can also get its representation in graphviz DOT format to render it into format you need.
project_id = '5506fcd38bd88f5953219da0'
blueprint_id = '4321fcd38bd88f595321554223'
bp_chart = BlueprintChart.get(project_id, blueprint_id)
print(bp_chart.to_graphviz())
Get a blueprint documentation¶
You can retrieve documentation on tasks used in blueprint. It will contain information about
task, its parameters and (when available) links and references to additional sources.
All documents are instances of BlueprintTaskDocument
class.
project_id = '5506fcd38bd88f5953219da0'
blueprint_id = '4321fcd38bd88f595321554223'
bp = Blueprint.get(project_id, blueprint_id)
docs = bp.get_documents()
print(docs[0].task)
>>> Average Blend
print(docs[0].links[0]['url'])
>>> https://en.wikipedia.org/wiki/Ensemble_learning
Blueprint Attributes¶
The Blueprint
class holds the data required to use the blueprint
for modeling. This includes the blueprint_id
and project_id
.
There are also two attributes that help distinguish blueprints: model_type
and processes
.
print(blueprint.id)
>>> u'8956e1aeecffa0fa6db2b84640fb3848'
print(blueprint.project_id)
>>> u5506fcd38bd88f5953219da0'
print(blueprint.model_type)
>>> Logistic Regression
print(blueprint.processes)
>>> [u'One-Hot Encoding',
u'Missing Values Imputed',
u'Standardize',
u'Logistic Regression']
Create a Model from a Blueprint¶
You can use a blueprint instance to train a model. The default dataset for the project is used.
model_job_id = project.train(blueprint, sample_pct=25)
This method will put a new modeling job into the queue and returns id of created ModelJob. You can pass ModelJob id to wait_for_async_model_creation function, that polls async model creation status and returns newly created model when it’s finished.
Models¶
When a blueprint has been trained on a specific dataset at a specified sample size, the result is a model. Models can be inspected to analyze their accuracy.
Quick Reference¶
# Get all models of an existing project
import datarobot as dr
my_projects = dr.Project.list()
project = my_projects[0]
models = project.get_models()
List Finished Models¶
You can use the get_models
method to return a list of the project models
that have finished training:
import datarobot as dr
project = dr.Project.get('5506fcd38bd88f5953219da0')
models = project.get_models()
print(models[:5])
>>> [Model(Decision Tree Classifier (Gini)),
Model(Auto-tuned K-Nearest Neighbors Classifier (Minkowski Distance)),
Model(Gradient Boosted Trees Classifier (R)),
Model(Gradient Boosted Trees Classifier),
Model(Logistic Regression)]
model = models[0]
project.id
>>> u'5506fcd38bd88f5953219da0'
model.id
>>> u'5506fcd98bd88f1641a720a3'
You can pass following parameters to change result:
search_params
– dict, used to filter returned projects. Currently you can query models byname
sample_pct
order_by
– str or list, if passed returned models are ordered by this attribute or attributes.with_metric
– str, If not None, the returned models will only have scores for this metric. Otherwise all the metrics are returned.
List Models Example:
Project('pid').get_models(order_by=['-created_time', 'sample_pct', 'metric'])
# Getting models that contain "Ridge" in name
# and with sample_pct more than 64
Project('pid').get_models(
search_params={
'sample_pct__gt': 64,
'name': "Ridge"
})
Retrieve a Known Model¶
If you know the model_id
and project_id
values of a model, you can
retrieve it directly:
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
model_id=model_id)
You can also use an instance of Project
as the parameter for get
model = dr.Model.get(project=project,
model_id=model_id)
Train a Model on a Different Sample Size¶
One of the key insights into a model and the data behind it is how its
performance varies with more training data.
In Autopilot mode, DataRobot will run at several sample sizes by default,
but you can also create a job that will run at a specific sample size.
You can also specify featurelist that should be used for training of new model
and scoring type.
train
method of Model
instance will put a new modeling job into the queue and return id of created
ModelJob.
You can pass ModelJob id to wait_for_async_model_creation function,
that polls async model creation status and returns newly created model when it’s finished.
model_job_id = model.train(sample_pct=33)
# retraining model on custom featurelist using cross validation
import datarobot as dr
model_job_id = model.train(
sample_pct=55,
featurelist_id=custom_featurelist.id,
scoring_type=dr.SCORING_TYPE.cross_validation,
)
Find the Features Used¶
Because each project can have many associated featurelists, it is important to know which features a model requires in order to run. This helps ensure that the the necessary features are provided when generating predictions.
feature_names = model.get_features_used()
print(feature_names)
>>> ['MonthlyIncome',
'VisitsLast8Weeks',
'Age']
Feature Impact¶
Feature Impact measures how much worse a model’s error score would be if DataRobot made predictions after randomly shuffling a particular column (a technique sometimes called Permutation Importance).
The following example code snippet shows how a featurelist with just the features with the highest feature impact could be created.
import datarobot as dr
max_num_features = 10
time_to_wait_for_impact = 4 * 60 # seconds
try:
feature_impacts = model.get_feature_impact() # if they've already been computed
except dr.errors.ClientError as e:
assert e.status_code == 404 # the feature impact score haven't been computed yet
impact_job = model.request_feature_impact()
feature_impacts = impact_job.get_result_when_complete(time_to_wait_for_impact)
feature_impacts.sort(key=lambda x: x['impactNormalized'], reverse=True)
final_names = [f['featureName'] for f in feature_impacts[:max_num_features]]
project.create_featurelist('highest_impact', final_names)
Predict new data¶
After creating models you can use them to generate predictions on new data. See PredictJob for further information on how to request predictions from a model.
Model IDs Vs. Blueprint IDs¶
Each model has both an model_id
and a blueprint_id
. What is the difference between these two IDs?
A model is the result of training a blueprint on a dataset at a specified
sample percentage. The blueprint_id
is used to keep track of which
blueprint was used to train the model, while the model_id
is used to
locate the trained model in the system.
Model parameters¶
Some models can have parameters that provide data needed to reproduce its predictions.
For additional usage information see DataRobot documentation, section “Coefficients tab and pre-processing details”
import datarobot as dr
model = dr.Model.get(project=project, model_id=model_id)
mp = model.get_parameters()
print mp.derived_features
>>> [{
'coefficient': -0.015,
'originalFeature': u'A1Cresult',
'derivedFeature': u'A1Cresult->7',
'type': u'CAT',
'transformations': [{'name': u'One-hot', 'value': u"'>7'"}]
}]
Create a Blender¶
You can blend multiple models; in many cases, the resulting blender model is more accurate
than the parent models. To do so you need to select parent models and a blender method from
datarobot.enums.BLENDER_METHOD
.
Be aware that the tradeoff for better prediction accuracy is bigger resource consumption and slower predictions.
import datarobot as dr
pr = dr.Project.get(pid)
models = pr.get_models()
parent_models = [model.id for model in models[:2]]
pr.blend(parent_models, dr.enums.BLENDER_METHOD.AVERAGE)
Lift chart retrieval¶
You can use Model
methods get_lift_chart
and get_all_lift_charts
to retrieve
lift chart data. First will get it from specific source (validation data, cross validation or
holdout, if holdout unlocked) and second will list all available data. Please refer to
Advanced model information notebook for additional
information about lift charts and how they can be visualised.
ROC curve retrieval¶
Same as with the lift chart you can use Model
methods get_roc_curve
and
get_all_roc_curves
to retrieve ROC curve data. Please refer to
Advanced model information notebook for additional
information about ROC curves and how they can be visualised. More information about working with ROC
curves can be found in DataRobot web application documentation section “ROC Curve tab details”.
Word Cloud¶
If your dataset contains text columns, DataRobot can create text processing models that will
contain word cloud insight data. An example of such model is any “Auto-Tuned Word N-Gram Text
Modeler” model. You can use Model.get_word_cloud
method to retrieve those insights - it will
provide up to 200 most important ngrams in the model and data about their influence.
The Advanced model information notebook contains
examples of how you can use that data and build a visualization in a way similar to how the
DataRobot webapp does.
Scoring Code¶
Subset of models in DataRobot supports code generation. For each of those models you can download
a JAR file with scoring code to make predictions locally using method
Model.download_scoring_code
. For details on how to do that see “Code Generation” section in
DataRobot web application documentation. Optionally you can download source code in Java to see
what calculations those models do internally.
Be aware that source code JAR isn’t compiled so it cannot be used for making predictions.
Get a model blueprint chart¶
For all models you can retrieve its blueprint chart. You can also get its representation in graphviz DOT format to render it into format you need.
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
model_id=model_id)
bp_chart = model.get_model_blueprint_chart()
print(bp_chart.to_graphviz())
Get a model’s Missing Values report¶
For many models, you can retrieve their Missing Values reports describing how missing values were handled for each numeric and categorical feature. A Missing Values report is collected for those features which are considered eligible for a given blueprint task. For instance, a categorical feature with a lot of unique values may not be considered eligible by the One-Hot encoding task.
Models need to have at least one numeric imputation or categorical converter task like One-Hot Encoding in order to have a Missing Values report, and some models like blenders and scaleout models don’t support Missing Value reports. Only models built after the feature was introduced will have Missing Value reports. Missing Values reports are only available to users with access to uncensored blueprints.
Please refer to the Missing report attributes description for help on the exact details of the report interface.
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id, model_id=model_id)
missing_reports_per_feature = model.get_missing_report_info()
for report_per_feature in missing_reports_per_feature:
print(report_per_feature)
Consider the following example. Given a Decision Tree Classifier (Gini) blueprint chart representation
print(blueprint_chart.to_graphviz())
>>> digraph "Blueprint Chart" {
graph [rankdir=LR]
0 [label="Data"]
-2 [label="Numeric Variables"]
2 [label="Missing Values Imputed"]
3 [label="Decision Tree Classifier (Gini)"]
4 [label="Prediction"]
-1 [label="Categorical Variables"]
1 [label="Ordinal encoding of categorical variables"]
0 -> -2
-2 -> 2
2 -> 3
3 -> 4
0 -> -1
-1 -> 1
1 -> 3
}
and a missing report
>>> for feature_report in model.get_missing_report_info():
... print(feature_report)
... print(feature_report.missing_count, feature_report.missing_percentage)
... for task_report in feature_report.tasks:
... print(task_report.id, task_report.name)
... print(task_report.descriptions)
MissingReportPerFeature(feature=VehYear, type=Numeric)
(150, 50.0)
(u'2', u'Missing Values Imputed')
[u'Inputed value: 1']
MissingReportPerFeature(feature=Model, type=Categorical)
(100, 33.33)
(u'1', u'Ordinal Encoding of categorical variables')
[u'Imputed value: -2']
The results can be interpreted in the following way:
The numeric feature “Veh Year” is missing in 150 rows, or 50% of training data. It was transformed by the “Missing Values Imputed” task with an imputed value of 2006. The task has an id of “2”, which the BlueprintChart shows goes into the “Decision Tree Classifier (Gini)” task.
The Categorical feature “Model” was transformed by the “Ordinal encoding of categorical variables” task with a imputed value of -2, which the BlueprintChart shows also goes into the “Decision Tree Classifier (Gini)” task.
Get a blueprint documentation¶
You can retrieve documentation on tasks used to build a model. It will contain information about task, its parameters and (when available) links and references to additional sources.
All documents are instances of BlueprintTaskDocument
class.
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
model_id=model_id)
docs = model.get_model_blueprint_documents()
print(docs[0].task)
>>> Average Blend
print(docs[0].links[0]['url'])
>>> https://en.wikipedia.org/wiki/Ensemble_learning
Request training predictions¶
You can request a model’s predictions for a particular subset of its training data.
See datarobot.models.Model.request_training_predictions()
reference for all the valid subsets.
See training predictions reference for more details.
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
model_id=model_id)
training_predictions_job = model.request_training_predictions(dr.enums.DATA_SUBSET.HOLDOUT)
training_predictions = training_predictions_job.get_result_when_complete()
for row in training_predictions.iterate_rows():
print(row.row_id, row.prediction)
Jobs¶
The Job (API reference) class is a generic representation of jobs running through a project’s queue. Many tasks involved in modeling, such as creating a new model or computing feature impact for a model, will use a job to track the worker usage and progress of the associated task.
Checking the Contents of the Queue¶
To see what jobs running or waiting in the queue for a project, use the Project.get_all_jobs
method.
from datarobot.enums import QUEUE_STATUS
jobs_list = project.get_all_jobs() # gives all jobs queued or inprogress
jobs_by_type = {}
for job in jobs_list:
if job.job_type not in jobs_by_type:
jobs_by_type[job.job_type] = [0, 0]
if job.status == QUEUE_STATUS.QUEUE:
jobs_by_type[job.job_type][0] += 1
else:
jobs_by_type[job.job_type][1] += 1
for type in jobs_by_type:
(num_queued, num_inprogress) = jobs_by_type[type]
print('{} jobs: {} queued, {} inprogress'.format(type, num_queued, num_inprogress)')
Cancelling a Job¶
If a job is taking too long to run or no longer necessary, it can be cancelled easily from the
Job
object.
from datarobot.enums import QUEUE_STATUS
project.pause_autopilot()
bad_jobs = project.get_all_jobs(status=QUEUE_STATUS.QUEUE)
for job in bad_jobs:
job.cancel()
project.unpause_autopilot()
Retrieving Results From a Job¶
Once you’ve found a particular job of interest, you can retrieve the results once it is complete.
Note that the type of the returned object will vary depending on the job_type
. All return types
are documented in Job.get_result
.
from datarobot.enums import JOB_TYPE
time_to_wait = 60 * 60 # how long to wait for the job to finish (in seconds) - i.e. an hour
assert my_job.job_type == JOB_TYPE.MODEL
my_model = my_job.get_result_when_complete(max_wait=time_to_wait)
ModelJobs¶
Model creation is asynchronous process. This means than when explicitly invoking
new model creation (with project.train
or model.train
for example) all you get
is id of process, responsible for model creation. With this id you can
get info about model that is being created or the model itself, when
creation process is finished. For this you should use ModelJob
(API reference) class.
Get an existing ModelJob¶
To retrieve existing ModelJob use ModelJob.get
method.
For this you need id of Project that is used for model
creation and id of ModelJob. Having ModelJob might be useful if you want to
know parameters of model creation, automatically chosen by API backend,
before actual model was created.
If model is already created, ModelJob.get
will raise PendingJobFinished
exception
import time
import datarobot as dr
blueprint_id = '5506fcd38bd88f5953219da0'
model_job_id = project.train(blueprint_id)
model_job = dr.ModelJob.get(project=project.id,
model_job_id=model_job_id)
model_job.sample_pct
>>> 64.0
# wait for model to be created (in a very inefficient way)
time.sleep(10 * 60)
model_job = dr.ModelJob.get(project=project.id,
model_job_id=model_job_id)
>>> datarobot.errors.PendingJobFinished
Get created model¶
After model is created, you can use ModelJob.get_model to get newly created model.
import datarobot as dr
model = dr.ModelJob.get_model(project=project.id,
model_job_id=model_job_id)
wait_for_async_model_creation function¶
If you just want to get created model after getting ModelJob id, you can use wait_for_async_model_creation function. It will poll for status of model creation process until it’s finished, and then will return newly created model.
from datarobot.models.modeljob import wait_for_async_model_creation
# used during training based on blueprint
model_job_id = project.train(blueprint, sample_pct=33)
new_model = wait_for_async_model_creation(
project_id=project.id,
model_job_id=model_job_id,
)
# used during training based on existing model
model_job_id = existing_model.train(sample_pct=33)
new_model = wait_for_async_model_creation(
project_id=existing_model.project_id,
model_job_id=model_job_id,
)
Predictions¶
Predictions generation is an asynchronous process. This means that when starting
predictions with Model.request_predictions
you will receive back a PredictJob for tracking
the process responsible for fulfilling your request.
With this object you can get info about the predictions generation process before it has finished and be rerouted to the predictions themselves when the process is finished. For this you should use the PredictJob (API reference) class.
Starting predictions generation¶
Before actually requesting predictions, you should upload the dataset you wish to predict via
Project.upload_dataset
. Previously uploaded datasets can be seen under Project.get_datasets
.
When uploading the dataset you can provide the path to a local file, a file object, raw file content,
a pandas.DataFrame
object, or the url to a publicly available dataset.
To start predicting on new data using a finished model use Model.request_predictions
.
It will create a new predictions generation process and return a PredictJob object tracking this process.
With it, you can monitor an existing PredictJob and retrieve generated predictions when the corresponding
PredictJob is finished.
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
project = dr.Project.get(project_id)
model = dr.Model.get(project=project_id,
model_id=model_id)
# Using path to local file to generate predictions
dataset_from_path = project.upload_dataset('./data_to_predict.csv')
# Using file object to generate predictions
with open('./data_to_predict.csv') as data_to_predict:
dataset_from_file = project.upload_dataset(data_to_predict)
predict_job_1 = model.request_predictions(dataset_from_path.id)
predict_job_2 = model.request_predictions(dataset_from_file.id)
Get an existing PredictJob¶
To retrieve an existing PredictJob use the PredictJob.get
method. This will give you
a PredictJob matching the latest status of the job if it has not completed.
If predictions have finished building, PredictJob.get
will raise a PendingJobFinished
exception.
import time
import datarobot as dr
predict_job = dr.PredictJob.get(project_id=project_id,
predict_job_id=predict_job_id)
predict_job.status
>>> 'queue'
# wait for generation of predictions (in a very inefficient way)
time.sleep(10 * 60)
predict_job = dr.PredictJob.get(project_id=project_id,
predict_job_id=predict_job_id)
>>> dr.errors.PendingJobFinished
# now the predictions are finished
predictions = dr.PredictJob.get_predictions(project_id=project.id,
predict_job_id=predict_job_id)
Get generated predictions¶
After predictions are generated, you can use PredictJob.get_predictions
to get newly generated predictions.
If predictions have not yet been finished, it will raise a JobNotFinished
exception.
import datarobot as dr
predictions = dr.PredictJob.get_predictions(project_id=project.id,
predict_job_id=predict_job_id)
Wait for and Retrieve results¶
If you just want to get generated predictions from a PredictJob, you
can use the PredictJob.get_result_when_complete
function.
It will poll the status of predictions generation process until it has finished, and
then will return predictions.
dataset = project.get_datasets()[0]
predict_job = model.request_predictions(dataset.id)
predictions = predict_job.get_result_when_complete()
DataRobot Prime¶
DataRobot Prime is a premium feature intended to allow downloading executable code approximating models. If the feature is unavailable to you, please contact your Account Representative. For more information about this feature, see the documentation within the DataRobot webapp.
Approximate a Model¶
Given a Model you wish to approximate, Model.request_approximation
will start a job creating
several Ruleset
objects approximating the parent model. Each of those rulesets will identify
how many rules were used to approximate the model, as well as the validation score
the approximation achieved.
rulesets_job = model.request_approximation()
rulesets = rulesets_job.get_result_when_complete()
for ruleset in rulesets:
info = (ruleset.id, ruleset.rule_count, ruleset.score)
print('id: {}, rule_count: {}, score: {}'.format(*info))
Prime Models vs. Models¶
Given a ruleset, you can create a model based on that ruleset. We consider such models to be Prime
models. The PrimeModel
class inherits from the Model
class, so anything a Model can do,
as PrimeModel can do as well.
The PrimeModel
objects available within a Project
can be listed by
project.get_prime_models
, or a particular one can be retrieve via PrimeModel.get
. If a
ruleset has not yet had a model built for it, ruleset.request_model
can be used to start
a job to make a PrimeModel using a particular ruleset.
rulesets = parent_model.get_rulesets()
selected_ruleset = sorted(rulesets, key=lambda x: x.score)[-1]
if selected_ruleset.model_id:
prime_model = PrimeModel.get(selected_ruleset.project_id, selected_ruleset.model_id)
else:
prime_job = selected_ruleset.request_model()
prime_model = prime_job.get_result_when_complete()
The PrimeModel
class has two additional attributes and one additional method. The attributes
are ruleset
, which is the Ruleset used in the PrimeModel, and parent_model_id
which is
the id of the model it approximates.
Finally, the new method defined is request_download_validation
which is used to prepare code
download for the model and is discussed later on in this document.
Retrieving Code from a PrimeModel¶
Given a PrimeModel, you can download the code used to approximate the parent model, and view and execute it locally.
The first step is to validate the PrimeModel, which runs some basic validation of the generated
code, as well as preparing it for download. We use the PrimeFile
object to represent code
that is ready to download. PrimeFiles
can be prepared by the request_download_validation
method on PrimeModel
objects, and listed from a project with the get_prime_files
method.
Once you have a PrimeFile
you can check the is_valid
attribute to verify the code passed
basic validation, and then download it to a local file with download
.
validation_job = prime_model.request_download_validation(enums.PRIME_LANGUAGE.PYTHON)
prime_file = validation_job.get_result_when_complete()
if not prime_file.is_valid:
raise ValueError('File was not valid')
prime_file.download('/home/myuser/drCode/primeModelCode.py')
Reason Codes¶
To compute reason codes you need to have feature impact computed for a model, and predictions for an uploaded dataset computed with a selected model.
Computing reason codes is a resource-intensive task, but you can configure it with maximum codes, and prediction value thresholds to speed up the process.
Quick Reference¶
import datarobot as dr
# Get project
my_projects = dr.Project.list()
project = my_projects[0]
# Get model
models = project.get_models()
model = models[0]
# Compute feature impact
impact_job = model.request_feature_impact()
impact_job.wait_for_completion()
# Upload dataset
dataset = project.upload_dataset('./data_to_predict.csv')
# Compute predictions
predict_job = model.request_predictions(dataset.id)
predict_job.wait_for_completion()
# Initialize reason codes
rci_job = dr.ReasonCodesInitialization.create(project.id, model.id)
rci_job.wait_for_completion()
# Compute reason codes with default parameters
rc_job = dr.ReasonCodes.create(project.id, model.id, dataset.id)
rc = rc_job.get_result_when_complete()
# Iterate through predictions with reason codes
for row in rc.get_rows():
print row.prediction
print row.reason_codes
# download to a CSV file
rc.download_to_csv('reason_codes.csv')
List Reason Codes¶
You can use the ReasonCodes.list()
method to return a list of reason codes computed for
a project’s models:
import datarobot as dr
reason_codes = dr.ReasonCodes.list('58591727100d2b57196701b3')
print(reason_codes)
>>> [ReasonCodes(id=585967e7100d2b6afc93b13b,
project_id=58591727100d2b57196701b3,
model_id=585932c5100d2b7c298b8acf),
ReasonCodes(id=58596bc2100d2b639329eae4,
project_id=58591727100d2b57196701b3,
model_id=585932c5100d2b7c298b8ac5),
ReasonCodes(id=58763db4100d2b66759cc187,
project_id=58591727100d2b57196701b3,
model_id=585932c5100d2b7c298b8ac5),
...]
rc = reason_codes[0]
rc.project_id
>>> u'58591727100d2b57196701b3'
rc.model_id
>>> u'585932c5100d2b7c298b8acf'
You can pass following parameters to filter the result:
model_id
– str, used to filter returned reason codes by model_id.limit
– int, limit for number of items returned, default: no limit.offset
– int, number of items to skip, default: 0.
List Reason Codes Example:
dr.ReasonCodes.list('pid', model_id='model_id', limit=20, offset=100)
Initialize Reason Codes¶
In order to compute reason codes you have to initialize it for a particular model.
dr.ReasonCodesInitialization.create(project_id, model_id)
Compute Reason Codes¶
If all prerequisites are in place, you can compute reason codes in the following way:
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
dataset_id = '5506fcd98bd88a8142b725c8'
rc_job = dr.ReasonCodes.create(project_id, model_id, dataset_id,
max_codes=2, threshold_low=0.2, threshold_high=0.8)
rc = rc_job.get_result_when_complete()
Where:
max_codes
are the maximum number of reason codes to compute for each row.threshold_low
andthreshold_high
are thresholds for the value of the prediction of the row. Reason codes will be computed for a row if the row’s prediction value is higher thanthreshold_high
or lower thanthreshold_low
. If no thresholds are specified, reason codes will be computed for all rows.
Retrieving Reason Codes¶
You have three options for retrieving reason codes.
Note
ReasonCodes.get_all_as_dataframe()
and ReasonCodes.download_to_csv()
reformat
reason codes to match the schema of CSV file downloaded from UI (RowId, Prediction,
Reason 1 Strength, Reason 1 Feature, Reason 1 Value, ..., Reason N Strength,
Reason N Feature, Reason N Value)
Get reason codes rows one by one as dr.models.reason_codes.ReasonCodesRow
objects:
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
reason_codes_id = '5506fcd98bd88f1641a720a3'
rc = dr.ReasonCodes.get(project_id, reason_codes_id)
for row in rc.get_rows():
print row.reason_codes
Get all rows as pandas.DataFrame
:
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
reason_codes_id = '5506fcd98bd88f1641a720a3'
rc = dr.ReasonCodes.get(project_id, reason_codes_id)
reason_codes_df = rc.get_all_as_dataframe()
Download all rows to a file as CSV document:
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
reason_codes_id = '5506fcd98bd88f1641a720a3'
rc = dr.ReasonCodes.get(project_id, reason_codes_id)
rc.download_to_csv('reason_codes.csv')
Adjusted Predictions In Reason Codes¶
In some projects such as insurance projects, the prediction adjusted by exposure is more useful compared with raw prediction. For example, the raw prediction (e.g. claim counts) is divided by exposure (e.g. time) in the project with exposure column. The adjusted prediction provides insights with regard to the predicted claim counts per unit of time. To include that information, set exclude_adjusted_predictions to False in correspondent method calls.
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
reason_codes_id = '5506fcd98bd88f1641a720a3'
rc = dr.ReasonCodes.get(project_id, reason_codes_id)
rc.download_to_csv('reason_codes.csv', exclude_adjusted_predictions=False)
reason_codes_df = rc.get_all_as_dataframe(exclude_adjusted_predictions=False)
Rating Table¶
A rating table is an exportable csv representation of a Generalized Additive Model. They contain information about the features and coefficients used to make predictions. Users can influence predictions by downloading and editing values in a rating table, then reuploading the table and using it to create a new model.
See the page about interpreting Generalized Additive Models’ output in the Datarobot user guide for more details on how to interpret and edit rating tables.
Download A Rating Table¶
You can retrieve a rating table from the list of rating tables in a project:
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
project = dr.Project.get(project_id)
rating_tables = project.get_rating_tables()
rating_table = rating_tables[0]
Or you can retrieve a rating table from a specific model. The model must already exist:
import datarobot as dr
from datarobot.models import RatingTableModel, RatingTable
project_id = '5506fcd38bd88f5953219da0'
project = dr.Project.get(project_id)
# Get model from list of models with a rating table
rating_table_models = project.get_rating_table_models()
rating_table_model = rating_table_models[0]
# Or retrieve model by id. The model must have a rating table.
model_id = '5506fcd98bd88f1641a720a3'
rating_table_model = dr.RatingTableModel.get(project=project_id, model_id=model_id)
# Then retrieve the rating table from the model
rating_table_id = rating_table_model.rating_table_id
rating_table = dr.RatingTable.get(projcet_id, rating_table_id)
Then you can download the contents of the rating table:
rating_table.download('./my_rating_table.csv')
Uploading A Rating Table¶
After you’ve retrieved the rating table CSV and made the necessary edits, you can re-upload the CSV so you can create a new model from it:
job = dr.RatingTable.create(project_id, model_id, './my_rating_table.csv')
new_rating_table = job.get_result_when_complete()
job = new_rating_table.create_model()
model = job.get_result_when_complete()
Training Predictions¶
The training predictions interface allows computing and retrieving out-of-sample predictions for a model using the original project dataset. The predictions can be computed for all the rows, or restricted to validation or holdout data. As the predictions generated will be out-of-sample, they can be expected to have different results than if the project dataset were reuploaded as a prediction dataset.
Quick reference¶
Training predictions generation is an asynchronous process. This means that when starting
predictions with datarobot.models.Model.request_training_predictions()
you will receive back a
datarobot.models.TrainingPredictionsJob
for tracking the process responsible for fulfilling your request.
Actual predictions may be obtained with the help of a
datarobot.models.training_predictions.TrainingPredictions
object returned as the result of
the training predictions job.
There are three ways to retrieve them:
- Iterate prediction rows one by one as named tuples:
import datarobot as dr
# Calculate new training predictions on all dataset
training_predictions_job = model.request_training_predictions(dr.enums.DATA_SUBSET.ALL)
training_predictions = training_predictions_job.get_result_when_complete()
# Fetch rows from API and print them
for prediction in training_predictions.iterate_rows(batch_size=250):
print(prediction.row_id, prediction.prediction)
- Get all prediction rows as a
pandas.DataFrame
object:
import datarobot from dr
# Calculate new training predictions on holdout partition of dataset
training_predictions_job = model.request_training_predictions(dr.enums.DATA_SUBSET.HOLDOUT)
training_predictions = training_predictions_job.get_result_when_complete()
# Fetch training predictions as data frame
dataframe = training_predictions.get_all_as_dataframe()
- Download all prediction rows to a file as a CSV document:
import datarobot from dr
# Calculate new training predictions on all dataset
training_predictions_job = model.request_training_predictions(dr.enums.DATA_SUBSET.ALL)
training_predictions = training_predictions_job.get_result_when_complete()
# Fetch training predictions and save them to file
training_predictions.download_to_csv('my-training-predictions.csv')
Model Deployment¶
Model deployments are records that we create when user deploys a model to dedicated prediction cluster.
Warning
This interface is now deprecated and will be removed in the v2.13 release of the DataRobot client.
Warning
Model Deployments feature is in beta state and requires additional configuration for proper usage. Please contact Support/CFDS for help with setup and usage of model deployment functionality.
Warning
Users can still predict using models which have NOT been deployed. Deployment, in the current state of the system, only means making database records which we then associate monitoring data with. In other words, users can’t access monitoring info for predictions using models without an associated model deployment record.
Creating Model Deployment¶
To create new ModelDeployment
we need to have a project_id
and model_id
we want to deploy.
If we are going to create ModelDeployment
of Model
that is deployed to instance we need
instance_id
of this instance.
For creation of new ModelDeployment
we will use ModelDeployment.create
. For new ModelDeployment
we will need to set some readable label
. It can also have custom description
and status
.
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
instance_id = '5a8d4bf9962d7415f7cce05a'
label = 'New Model Deployment'
model_deployment = dr.ModelDeployment.create(label=label, model_id=model_id,
project_id=project_id, instance_id=instance_id)
print(model_deployment.id)
>>> '5a8eabe8962d743607c5009'
Get list of Model Deployments¶
To retrieve list of all ModelDeployment
items we use ModelDeployment.list
.
List could be queried using query
parameter, ordered by order_by
and filtered by status
parameters.
Also we can slice results using limit
and offset
parameters.
import datarobot as dr
model_deployments = dr.ModelDeployment.list()
print(model_deployments)
>>> [<datarobot.models.model_deployment.ModelDeployment object at 0x7efebf513c10>,
<datarobot.models.model_deployment.ModelDeployment object at 0x7efebf513a50>,
<datarobot.models.model_deployment.ModelDeployment object at 0x7efebf513ad0>]
Get single ModelDeployment¶
To get ModelDeployment
instance we use ModelDeployment.get
with model_deployment_id
as an argument.
import datarobot as dr
model_deployment_id = '5a8eabe8962d743607c5009'
model_deployment = dr.ModelDeployment.get(model_deployment_id)
print(model_deployment.service_health_messages)
>>> [{'message': 'No successful predictions in 24 hours', 'msg_id': 'NO_GOOD_REQUESTS', :level': 'passing'}]
When we have an instance of ModelDeployment
we can update its label
, description
or status
.
You can chose status
value from datarobot.enums.MODEL_DEPLOYMENT_STATUS
from datarobot.enums import MODEL_DEPLOYMENT_STATUS
model_deployment.update(label='Old deployment', description='Deactivated model deployment',
status=MODEL_DEPLOYMENT_STATUS.ARCHIVED)
We can also get service health of ModelDeployment
instance using get_service_statistics
method.
It accepts start_data
and end_date
as optional parameters for setting period of statistics
model_deployment.get_service_statistics(start_date='2017-01-01')
>>> {'consumers': 0,
'load': {'median': 0.0, 'peak': 0.0},
'period': {'end': datetime.datetime(2018, 2, 22, 12, 5, 40, 764294, tzinfo=tzutc()),
'start': datetime.datetime(2017, 1, 1, 0, 0, tzinfo=tzutc())},
'server_error_rate': {'current': 0.0, 'previous': 0.0},
'total_requests': 0,
'user_error_rate': {'current': 0.0, 'previous': 0.0}}
History of ModelDeployment
instance is available via action_log
method
model_deployment.action_log()
>>> [{'action': 'created',
'performed_at': datetime.datetime(2018, 2, 21, 12, 4, 5, 804305),
'performed_by': {'id': '5a86c0e0e7c354c960cd0540',
'username': 'user@datarobot.com'}},
{'action': 'deployed',
'performed_at': datetime.datetime(2018, 2, 22, 11, 39, 20, 34000),
'performed_by': {'id': '5a86c0e0e7c354c960cd0540',
'username': 'user@datarobot.com'}}]
Monotonic Constraints¶
Training with monotonic constraints allows users to force models to learn monotonic relationships with respect to some features and the target. This helps users create accurate models that comply with regulations (e.g. insurance, banking). Currently, only certain blueprints (e.g. xgboost) support this feature, and it is only supported for regression and binary classification projects. Typically working with monotonic constraints follows the following two workflows:
Workflow one - Running a project with default monotonic constraints
- set the target and specify default constraint lists for the project
- when running autopilot or manually training models without overriding constraint settings, all blueprints that support monotonic constraints will use the specified default constraint featurelists
Workflow two - Running a model with specific monotonic constraints
- create featurelists for monotonic constraints
- train a blueprint that supports monotonic constraints while specifying monotonic constraint featurelists
- the specified constraints will be used, regardless of the defaults on the blueprint
Creating featurelists¶
When specifying monotonic constraints, users must pass a reference to a featurelist containing only the features to be constrained, one for features that should monotonically increase with the target and another for those that should monotonically decrease with the target.
import datarobot as dr
project = dr.Project.get(project_id)
features_mono_up = ['feature_0', 'feature_1'] # features that have monotonically increasing relationship with target
features_mono_down = ['feature_2', 'feature_3'] # features that have monotonically decreasing relationship with target
flist_mono_up = project.create_featurelist(name='mono_up',
features=features_mono_up)
flist_mono_down = project.create_featurelist(name='mono_down',
features=features_mono_down)
Specify default monotonic constraints for a project¶
When setting the target, the user can specify default monotonic constraints for the project, to ensure that autopilot models use the desired settings, and optionally to ensure that only blueprints supporting monotonic constraints appear in the project. Regardless of the defaults specified during target selection, the user can override them when manually training a particular model.
import datarobot as dr
from datarobot.enums import AUTOPILOT_MODE
advanced_options = dr.AdvancedOptions(
monotonic_increasing_featurelist_id=flist_mono_up.id,
monotonic_decreasing_featurelist_id=flist_mono_down.id,
only_include_monotonic_blueprints=True)
project = dr.Project.get(project_id)
project.set_target(target='target', mode=AUTOPILOT_MODE.FULL_AUTO, advanced_options=advanced_options)
Retrieve models and blueprints using monotonic constraints¶
When retrieving models, users can inspect to see which supports monotonic constraints, and which actually enforces them. Some models will not support monotonic constraints at all, and some may support constraints but not have any constrained features specified.
import datarobot as dr
project = dr.Project.get(project_id)
models = project.get_models()
# retrieve models that support monotonic constraints
models_support_mono = [model for model in models if model.supports_monotonic_constraints]
# retrieve models that support and enforce monotonic constraints
models_enforce_mono = [model for model in models
if (model.monotonic_increasing_featurelist_id or
model.monotonic_decreasing_featurelist_id)]
When retrieving blueprints, users can check if they support monotonic constraints and see what default contraint lists are associated with them. The monotonic featurelist ids associated with a blueprint will be used everytime it is trained, unless the user specifically overrides them at model submission time.
import datarobot as dr
project = dr.Project.get(project_id)
blueprints = project.get_blueprints()
# retrieve blueprints that support monotonic constraints
blueprints_support_mono = [blueprint for blueprint in blueprints if blueprint.supports_monotonic_constraints]
# retrieve blueprints that support and enforce monotonic constraints
blueprints_enforce_mono = [blueprint for blueprint in blueprints
if (blueprint.monotonic_increasing_featurelist_id or
blueprint.monotonic_decreasing_featurelist_id)]
Train a model with specific monotonic constraints¶
Even after specifiying default settings for the project, users can override them to train a new model with different constraints, if desired.
import datarobot as dr
features_mono_up = ['feature_2', 'feature_3'] # features that have monotonically increasing relationship with target
features_mono_down = ['feature_0', 'feature_1'] # features that have monotonically decreasing relationship with target
project = dr.Project.get(project_id)
flist_mono_up = project.create_featurelist(name='mono_up',
features=features_mono_up)
flist_mono_down = project.create_featurelist(name='mono_down',
features=features_mono_down)
model_job_id = project.train(
blueprint,
sample_pct=55,
featurelist_id=featurelist.id,
monotonic_increasing_featurelist_id=flist_mono_up.id,
monotonic_decreasing_featurelist_id=flist_mono_down.id
)
Database Connectivity¶
Databases are a widely used tool for carrying valuable business data. To enable integration with a variety of enterprise databases, DataRobot provides a “self-service” JDBC platform for database connectivity setup. Once configured, you can read data from production databases for model building and predictions. This allows you to quickly train and retrain models on that data, and avoids the unnecessary step of exporting data from your enterprise database to a CSV for ingest to DataRobot. It allows access to more diverse data, which results in more accurate models.
The steps describing how to set up your database connections use the following terminology:
DataStore
: A configured connection to a database— it has a name, a specified driver, and a JDBC URL. You can register data stores with DataRobot for ease of re-use. A data store has one connector but can have many data sources.DataSource
: A configured connection to the backing data store (the location of data within a given endpoint). A data source specifies, via SQL query or selected table and schema data, which data to extract from the data store to use for modeling or predictions. A data source has one data store and one connector but can have many datasets.DataDriver
: The software that allows the DataRobot application to interact with a database; each data store is associated with one driver (created the admin). The driver configuration saves the storage location in DataRobot of the JAR file and any additional dependency files associated with the driver.Dataset
: Data, a file or the content of a data source, at a particular point in time. A data source can produce multiple datasets; a dataset has exactly one data source.
The expected workflow when setting up projects or prediction datasets is:
- The administrator sets up a
datarobot.DataDriver
for accessing a particular database. For any particular driver, this setup is done once for the entire system and then the resulting driver is used by all users. - Users create a
datarobot.DataStore
which represents an interface to a particular database, using that driver. - Users create a
datarobot.DataSource
representing a particular set of data to be extracted from the DataStore. - Users create projects and prediction datasets from a DataSource.
Besides the described workflow for creating projects and prediction datasets, users can manage their DataStores and DataSources and admins can manage Drivers by listing, retrieving, updating and deleting existing instances.
Cloud users: This feature is turned off by default. To enable the feature, contact your CFDS or DataRobot Support.
Creating Drivers¶
The admin should specify class_name
, the name of the Java class in the Java archive
which implements the java.sql.Driver
interface; canonical_name
, a user-friendly name
for resulting driver to display in the API and the GUI; and files
, a list of local files which
contain the driver.
>>> import datarobot as dr
>>> driver = dr.DataDriver.create(
... class_name='org.postgresql.Driver',
... canonical_name='PostgreSQL',
... files=['/tmp/postgresql-42.2.2.jar']
... )
>>> driver
DataDriver('PostgreSQL')
Creating DataStores¶
After the admin has created drivers, any user can use them for DataStore
creation.
A DataStore represents a JDBC database. When creating them, users should specify type
,
which currently must be jdbc
; canonical_name
, a user-friendly name to display
in the API and GUI for the DataStore; driver_id
, the id of the driver to use to connect
to the database; and jdbc_url
, the full URL specifying the database connection settings
like database type, server address, port, and database name.
>>> import datarobot as dr
>>> data_store = dr.DataStore.create(
... data_store_type='jdbc',
... canonical_name='Demo DB',
... driver_id='5a6af02eb15372000117c040',
... jdbc_url='jdbc:postgresql://my.db.address.org:5432/perftest'
... )
>>> data_store
DataStore('Demo DB')
>>> data_store.test(username='username', password='password')
{'message': 'Connection successful'}
Creating DataSources¶
Once users have a DataStore, they can can query datasets via the DataSource entity,
which represents a query. When creating a DataSource, users first create a
datarobot.DataSourceParameter
object from a DataStore’s id and a query,
and then create the DataSource with a type
, currently always jdbc
; a canonical_name
,
the user-friendly name to display in the API and GUI, and params
, the DataSourceParameter
object.
>>> import datarobot as dr
>>> params = dr.DataSourceParameters(
... data_store_id='5a8ac90b07a57a0001be501e',
... query='SELECT * FROM airlines10mb WHERE "Year" >= 1995;'
... )
>>> data_source = dr.DataSource.create(
... data_source_type='jdbc',
... canonical_name='airlines stats after 1995',
... params=params
... )
>>> data_source
DataSource('airlines stats after 1995')
Creating Projects¶
Given a DataSource, users can create new projects from it.
>>> import datarobot as dr
>>> project = dr.Project.create_from_data_source(
... data_source_id='5ae6eee9962d740dd7b86886',
... username='username',
... password='password'
... )
Creating Predictions¶
Given a DataSource, new prediction datasets can be created for any project.
>>> import datarobot as dr
>>> project = dr.Project.get('5ae6f296962d740dd7b86887')
>>> prediction_dataset = project.upload_dataset_from_data_source(
... data_source_id='5ae6eee9962d740dd7b86886',
... username='username',
... password='password'
... )
API Reference¶
Project API¶
-
class
datarobot.models.
Project
(id=None, project_name=None, mode=None, target=None, target_type=None, holdout_unlocked=None, metric=None, stage=None, partition=None, positive_class=None, created=None, advanced_options=None, recommender=None, max_train_pct=None, max_train_rows=None, scaleout_max_train_pct=None, scaleout_max_train_rows=None, file_name=None)¶ A project built from a particular training dataset
Attributes
id (str) the id of the project project_name (str) the name of the project mode (int) the autopilot mode currently selected for the project - 0 for full autopilot, 1 for semi-automatic, and 2 for manual target (str) the name of the selected target features target_type (str) Indicating what kind of modeling is being done in this project Options are: ‘Regression’, ‘Binary’ (Binary classification), ‘Multiclass’ (Multiclass classification) holdout_unlocked (bool) whether the holdout has been unlocked metric (str) the selected project metric (e.g. LogLoss) stage (str) the stage the project has reached - one of datarobot.enums.PROJECT_STAGE
partition (dict) information about the selected partitioning options positive_class (str) for binary classification projects, the selected positive class; otherwise, None created (datetime) the time the project was created advanced_options (dict) information on the advanced options that were selected for the project settings, e.g. a weights column or a cap of the runtime of models that can advance autopilot stages recommender (dict) information on the recommender settings of the project (i.e. whether it is a recommender project, or the id columns) max_train_pct (float) the maximum percentage of the project dataset that can be used without going into the validation data or being too large to submit any blueprint for training max_train_rows (int) the maximum number of rows that can be trained on without going into the validation data or being too large to submit any blueprint for training scaleout_max_train_pct (float) the maximum percentage of the project dataset that can be used to successfully train a scaleout model without going into the validation data. May exceed max_train_pct, in which case only scaleout models can be trained up to this point. scaleout_max_train_rows (int) the maximum number of rows that can be used to successfully train a scaleout model without going into the validation data. May exceed max_train_rows, in which case only scaleout models can be trained up to this point. file_name (str) the name of the file uploaded for the project dataset -
classmethod
get
(project_id)¶ Gets information about a project.
Parameters: project_id : str
The identifier of the project you want to load.
Returns: project : Project
The queried project
Examples
import datarobot as dr p = dr.Project.get(project_id='54e639a18bd88f08078ca831') p.id >>>'54e639a18bd88f08078ca831' p.project_name >>>'Some project name'
-
classmethod
create
(sourcedata, project_name='Untitled Project', max_wait=600, read_timeout=600)¶ Creates a project with provided data.
Project creation is asynchronous process, which means that after initial request we will keep polling status of async process that is responsible for project creation until it’s finished. For SDK users this only means that this method might raise exceptions related to it’s async nature.
Parameters: sourcedata : basestring, file or pandas.DataFrame
Data to be used for predictions. If string can be either a path to a local file, url to publicly available file or raw file content. If using a file, the filename must consist of ASCII characters only.
project_name : str, unicode, optional
The name to assign to the empty project.
max_wait : int, optional
Time in seconds after which project creation is considered unsuccessful
read_timeout: int
The maximum number of seconds to wait for the server to respond indicating that the initial upload is complete
Returns: project : Project
Instance with initialized data.
Raises: InputNotUnderstoodError
Raised if sourcedata isn’t one of supported types.
AsyncFailureError
Polling for status of async process resulted in response with unsupported status code. Beginning in version 2.1, this will be ProjectAsyncFailureError, a subclass of AsyncFailureError
AsyncProcessUnsuccessfulError
Raised if project creation was unsuccessful
AsyncTimeoutError
Raised if project creation took more time, than specified by
max_wait
parameterExamples
p = Project.create('/home/datasets/somedataset.csv', project_name="New API project") p.id >>> '5921731dkqshda8yd28h' p.project_name >>> 'New API project'
-
classmethod
encrypted_string
(plaintext)¶ Sends a string to DataRobot to be encrypted
This is used for passwords that DataRobot uses to access external data sources
Parameters: plaintext : str
The string to encrypt
Returns: ciphertext : str
The encrypted string
-
classmethod
create_from_mysql
(*args, **kwargs)¶ Note
Deprecated in v2.11 in favor of
datarobot.models.Project.create_from_data_source()
.Create a project from a MySQL table
Parameters: server : str
The address of the MySQL server
database : str
The name of the database to use
table : str
The name of the table to fetch
user : str
The username to use to access the database
port : int, optional
The port to reach the MySQL server. If not specified, will use the default specified by DataRobot (3306).
prefetch : int, optional
If specified, specifies the number of rows to stream at a time from the database. If not specified, fetches all results at once. This is an optimization for reading from the database
project_name : str, optional
A name to give to the project
password : str, optional
The plaintext password for this user. Will be first encrypted with DataRobot. Only use this _or_ encrypted_password, not both.
encrypted_password : str, optional
The encrypted password for this user. Will be sent directly to DataRobot. Only use this _or_ password, not both.
max_wait : int
The maximum number of seconds to wait before giving up.
Returns: Project
Raises: ValueError
If both password and encrypted_password were used.
-
classmethod
create_from_oracle
(*args, **kwargs)¶ Note
Deprecated in v2.11 in favor of
datarobot.models.Project.create_from_data_source()
.Create a project from an Oracle table
Parameters: dbq : str
tnsnames.ora entry in host:port/sid format
table : str
The name of the table to fetch
username : str
The username to use to access the database
fetch_buffer_size : int, optional
If specified, specifies the size of buffer that will be used to stream data from the database. Otherwise will use DataRobot default value.
project_name : str, optional
A name to give to the project
password : str, optional
The plaintext password for this user. Will be first encrypted with DataRobot. Only use this _or_ encrypted_password, not both.
encrypted_password : str, optional
The encrypted password for this user. Will be sent directly to DataRobot. Only use this _or_ password, not both.
max_wait : int
The maximum number of seconds to wait before giving up.
Returns: Project
Raises: ValueError
If both password and encrypted_password were used.
-
classmethod
create_from_postgresql
(*args, **kwargs)¶ Note
Deprecated in v2.11 in favor of
datarobot.models.Project.create_from_data_source()
.Create a project from a PostgreSQL table
Parameters: server : str
The address of the PostgreSQL server
database : str
The name of the database to use
table : str
The name of the table to fetch
username : str
The username to use to access the database
port : int, optional
The port to reach the PostgreSQL server. If not specified, will use the default specified by DataRobot (5432).
driver : str, optional
Specify ODBC driver to use. If not specified - use DataRobot default. See the values within
datarobot.enums.POSTGRESQL_DRIVER
fetch : int, optional
If specified, specifies the number of rows to stream at a time from the database. If not specified, use default value in DataRobot.
use_declare_fetch : bool, optional
On True, server will fetch result as available using DB cursor. On False it will try to retrieve entire result set - not recommended for big tables. If not specified - use the default specified by DataRobot.
project_name : str, optional
A name to give to the project
password : str, optional
The plaintext password for this user. Will be first encrypted with DataRobot. Only use this _or_ encrypted_password, not both.
encrypted_password : str, optional
The encrypted password for this user. Will be sent directly to DataRobot. Only use this _or_ password, not both.
max_wait : int
The maximum number of seconds to wait before giving up.
Returns: Project
Raises: ValueError
If both password and encrypted_password were used.
-
classmethod
create_from_hdfs
(url, port=None, project_name=None, max_wait=600)¶ Create a project from a datasource on a WebHDFS server.
Parameters: url : str
The location of the WebHDFS file, both server and full path. Per the DataRobot specification, must begin with hdfs://, e.g. hdfs:///tmp/10kDiabetes.csv
port : int, optional
The port to use. If not specified, will default to the server default (50070)
project_name : str, optional
A name to give to the project
max_wait : int
The maximum number of seconds to wait before giving up.
Returns: Project
Examples
p = Project.create_from_hdfs('hdfs:///tmp/somedataset.csv', project_name="New API project") p.id >>> '5921731dkqshda8yd28h' p.project_name >>> 'New API project'
-
classmethod
create_from_data_source
(data_source_id, username, password, project_name=None, max_wait=600)¶ Create a project from a data source. Either data_source or data_source_id should be specified.
Parameters: data_source_id : str
the identifier of the data source.
username : str
the username for database authentication.
password : str
the password for database authentication. The password is encrypted at server side and never saved / stored.
project_name : str, optional
optional, a name to give to the project.
max_wait : int
optional, the maximum number of seconds to wait before giving up.
Returns: Project
-
classmethod
from_async
(async_location, max_wait=600)¶ Given a temporary async status location poll for no more than max_wait seconds until the async process (project creation or setting the target, for example) finishes successfully, then return the ready project
Parameters: async_location : str
The URL for the temporary async status resource. This is returned as a header in the response to a request that initiates an async process
max_wait : int
The maximum number of seconds to wait before giving up.
Returns: project : Project
The project, now ready
Raises: ProjectAsyncFailureError
If the server returned an unexpected response while polling for the asynchronous operation to resolve
AsyncProcessUnsuccessfulError
If the final result of the asynchronous operation was a failure
AsyncTimeoutError
If the asynchronous operation did not resolve within the time specified
-
classmethod
start
(sourcedata, target, project_name='Untitled Project', worker_count=None, metric=None, autopilot_on=True, blueprint_threshold=None, response_cap=None, partitioning_method=None, positive_class=None, target_type=None)¶ Chain together project creation, file upload, and target selection.
Parameters: sourcedata : str or pandas.DataFrame
The path to the file to upload. Can be either a path to a local file or a publicly accessible URL. If the source is a DataFrame, it will be serialized to a temporary buffer. If using a file, the filename must consist of ASCII characters only.
target : str
The name of the target column in the uploaded file.
project_name : str
The project name.
Returns: project : Project
The newly created and initialized project.
Other Parameters: worker_count : int, optional
The number of workers that you want to allocate to this project.
metric : str, optional
The name of metric to use.
autopilot_on : boolean, default
True
Whether or not to begin modeling automatically.
blueprint_threshold : int, optional
Number of hours the model is permitted to run. Minimum 1
response_cap : float, optional
Quantile of the response distribution to use for response capping Must be in range 0.5 .. 1.0
partitioning_method : PartitioningMethod object, optional
It should be one of PartitioningMethod object.
positive_class : str, float, or int; optional
Specifies a level of the target column that should treated as the positive class for binary classification. May only be specified for binary classification targets.
target_type : str, optional
Override the automaticially selected target_type. An example usage would be setting the target_type=’Mutliclass’ when you want to preform a multiclass classification task on a numeric column that has a low cardinality. You can use
TARGET_TYPE
enum.Raises: AsyncFailureError
Polling for status of async process resulted in response with unsupported status code
AsyncProcessUnsuccessfulError
Raised if project creation or target setting was unsuccessful
AsyncTimeoutError
Raised if project creation or target setting timed out
Examples
Project.start("./tests/fixtures/file.csv", "a_target", project_name="test_name", worker_count=4, metric="a_metric")
-
classmethod
list
(search_params=None)¶ Returns the projects associated with this account.
Parameters: search_params : dict, optional.
If not None, the returned projects are filtered by lookup. Currently you can query projects by:
project_name
Returns: projects : list of Project instances
Contains a list of projects associated with this user account.
Raises: TypeError
Raised if
search_params
parameter is provided, but is not of supported type.Examples
List all projects .. code-block:: python
p_list = Project.list() p_list >>> [Project(‘Project One’), Project(‘Two’)]Search for projects by name .. code-block:: python
Project.list(search_params={‘project_name’: ‘red’}) >>> [Project(‘Predtime’), Project(‘Fred Project’)]
-
refresh
()¶ Fetches the latest state of the project, and updates this object with that information. This is an inplace update, not a new object.
Returns: self : Project
the now-updated project
-
delete
()¶ Removes this project from your account.
-
set_target
(target, mode='auto', metric=None, quickrun=None, worker_count=None, positive_class=None, partitioning_method=None, featurelist_id=None, advanced_options=None, max_wait=600, target_type=None)¶ Set target variable of an existing project that has a file uploaded to it.
Target setting is asynchronous process, which means that after initial request we will keep polling status of async process that is responsible for target setting until it’s finished. For SDK users this only means that this method might raise exceptions related to it’s async nature.
Parameters: target : str
Name of variable.
mode : str, optional
You can use
AUTOPILOT_MODE
enum to choose betweenAUTOPILOT_MODE.FULL_AUTO
AUTOPILOT_MODE.MANUAL
AUTOPILOT_MODE.QUICK
If unspecified,
FULL_AUTO
is usedmetric : str, optional
Name of the metric to use for evaluating models. You can query the metrics available for the target by way of
Project.get_metrics
. If none is specified, then the default recommended by DataRobot is used.quickrun : bool, optional
Deprecated - pass
AUTOPILOT_MODE.QUICK
as mode instead. Sets whether project should be run inquick run
mode. This setting causes DataRobot to recommend a more limited set of models in order to get a base set of models and insights more quickly.worker_count : int, optional
The number of concurrent workers to request for this project. If None, then the default is used
partitioning_method : PartitioningMethod object, optional
It should be one of PartitioningMethod object.
positive_class : str, float, or int; optional
Specifies a level of the target column that should treated as the positive class for binary classification. May only be specified for binary classification targets.
featurelist_id : str, optional
Specifies which feature list to use.
advanced_options : AdvancedOptions, optional
Used to set advanced options of project creation.
max_wait : int, optional
Time in seconds after which target setting is considered unsuccessful.
target_type : str, optional
Override the automaticially selected target_type. An example usage would be setting the target_type=’Mutliclass’ when you want to preform a multiclass classification task on a numeric column that has a low cardinality. You can use
TARGET_TYPE
enum.Returns: project : Project
The instance with updated attributes.
Raises: AsyncFailureError
Polling for status of async process resulted in response with unsupported status code
AsyncProcessUnsuccessfulError
Raised if target setting was unsuccessful
AsyncTimeoutError
Raised if target setting took more time, than specified by
max_wait
parameterTypeError
Raised if
advanced_options
,partitioning_method
ortarget_type
is provided, but is not of supported typeSee also
Project.start
- combines project creation, file upload, and target selection
-
get_models
(order_by=None, search_params=None, with_metric=None)¶ List all completed, successful models in the leaderboard for the given project.
Parameters: order_by : str or list of strings, optional
If not None, the returned models are ordered by this attribute. If None, the default return is the order of default project metric.
Allowed attributes to sort by are:
metric
sample_pct
If the sort attribute is preceded by a hyphen, models will be sorted in descending order, otherwise in ascending order.
Multiple sort attributes can be included as a comma-delimited string or in a list e.g. order_by=`sample_pct,-metric` or order_by=[sample_pct, -metric]
Using metric to sort by will result in models being sorted according to their validation score by how well they did according to the project metric.
search_params : dict, optional.
If not None, the returned models are filtered by lookup. Currently you can query models by:
name
sample_pct
with_metric : str, optional.
If not None, the returned models will only have scores for this metric. Otherwise all the metrics are returned.
Returns: models : a list of Model instances.
All of the models that have been trained in this project.
Raises: TypeError
Raised if
order_by
orsearch_params
parameter is provided, but is not of supported type.Examples
Project.get('pid').get_models(order_by=['-sample_pct', 'metric']) # Getting models that contain "Ridge" in name # and with sample_pct more than 64 Project.get('pid').get_models( search_params={ 'sample_pct__gt': 64, 'name': "Ridge" })
-
get_datetime_models
()¶ List all models in the project as DatetimeModels
Requires the project to be datetime partitioned. If it is not, a ClientError will occur.
Returns: models : list of DatetimeModel
the datetime models
-
get_prime_models
()¶ List all DataRobot Prime models for the project Prime models were created to approximate a parent model, and have downloadable code.
Returns: models : list of PrimeModel
-
get_prime_files
(parent_model_id=None, model_id=None)¶ List all downloadable code files from DataRobot Prime for the project
Parameters: parent_model_id : str, optional
Filter for only those prime files approximating this parent model
model_id : str, optional
Filter for only those prime files with code for this prime model
Returns: files: list of PrimeFile
-
get_datasets
()¶ List all the datasets that have been uploaded for predictions
Returns: datasets : list of PredictionDataset instances
-
upload_dataset
(sourcedata, max_wait=600, read_timeout=600, forecast_point=None, predictions_start_date=None, predictions_end_date=None)¶ Upload a new dataset to make predictions against
Parameters: sourcedata : str, file or pandas.DataFrame
Data to be used for predictions. If string can be either a path to a local file, url to publicly available file or raw file content. If using a file on disk, the filename must consist of ASCII characters only.
max_wait : int, optional
The maximum number of seconds to wait for the uploaded dataset to be processed before raising an error
read_timeout : int, optional
The maximum number of seconds to wait for the server to respond indicating that the initial upload is complete
forecast_point : datetime.datetime or None, optional
(New in version v2.8) May only be specified for time series projects, otherwise the upload will be rejected. The time in the dataset relative to which predictions should be generated in a time series project. See the time series documentation for more information.
predictions_start_date : datetime.datetime or None, optional
(New in version v2.11) May only be specified for time series projects. The start date for bulk predictions. This parameter should be provided in conjunction with predictions_end_date. Can’t be provided with forecastPoint parameter.
predictions_end_date : datetime.datetime or None, optional
(New in version v2.11) May only be specified for time series projects. The end date for bulk predictions. This parameter should be provided in conjunction with predictions_start_date. Can’t be provided with forecastPoint parameter.
Returns: dataset : PredictionDataset
the newly uploaded dataset
Raises: InputNotUnderstoodError
Raised if
sourcedata
isn’t one of supported types.AsyncFailureError
Polling for status of async process resulted in response with unsupported status code
AsyncProcessUnsuccessfulError
Raised if project creation was unsuccessful (i.e. the server reported an error in uploading the dataset)
AsyncTimeoutError
Raised if processing the uploaded dataset took more time than specified by
max_wait
parameterValueError
Raised if
forecast_point
is provided, but is not of supported type
-
upload_dataset_from_data_source
(data_source_id, username, password, max_wait=600, forecast_point=None)¶ Upload a new dataset from a data source to make predictions against
Parameters: data_source_id : str
the identifier of the data source.
username : str
the username for database authentication.
password : str
the password for database authentication. The password is encrypted at server side and never saved / stored.
max_wait : int
optional, the maximum number of seconds to wait before giving up.
forecast_point : datetime.datetime or None, optional
(New in version v2.8) May only be specified for time series projects, otherwise the upload will be rejected. The time in the dataset relative to which predictions should be generated in a time series project. See the time series documentation for more information.
Returns: dataset : PredictionDataset
the newly uploaded dataset
-
get_blueprints
()¶ List all blueprints recommended for a project.
Returns: menu : list of Blueprint instances
All the blueprints recommended by DataRobot for a project
-
get_features
()¶ List all features for this project
Returns: list of Feature
all features for this project
-
get_modeling_features
(batch_size=None)¶ List all modeling features for this project
Only available once the target and partitioning settings have been set. For more information on the distinction between input and modeling features, see the time series documentation<input_vs_modeling>.
Parameters: batch_size : int, optional
The number of features to retrieve in a single API call. If specified, the client may make multiple calls to retrieve the full list of features. If not specified, an appropriate default will be chosen by the server.
Returns: list of ModelingFeature
All modeling features in this project
-
get_featurelists
()¶ List all featurelists created for this project
Returns: list of Featurelist
all featurelists created for this project
-
get_modeling_featurelists
(batch_size=None)¶ List all modeling featurelists created for this project
Modeling featurelists can only be created after the target and partitioning options have been set for a project. In time series projects, these are the featurelists that can be used for modeling; in other projects, they behave the same as regular featurelists.
See the time series documentation for more information.
Parameters: batch_size : int, optional
The number of featurelists to retrieve in a single API call. If specified, the client may make multiple calls to retrieve the full list of features. If not specified, an appropriate default will be chosen by the server.
Returns: list of ModelingFeaturelist
all modeling featurelists in this project
-
create_type_transform_feature
(name, parent_name, variable_type, replacement=None, date_extraction=None, max_wait=600)¶ Create a new feature by transforming the type of an existing feature in the project
Note that only the following transformations are supported:
- Text to categorical or numeric
- Categorical to text or numeric
- Numeric to categorical
- Date to categorical or numeric
Note
Special considerations when casting numeric to categorical
There are two parameters which can be used for
variableType
to convert numeric data to categorical levels. These differ in the assumptions they make about the input data, and are very important when considering the data that will be used to make predictions. The assumptions that each makes are:categorical
: The data in the column is all integral, and there are no missing values. If either of these conditions do not hold in the training set, the transformation will be rejected. During predictions, if any of the values in the parent column are missing, the predictions will errorcategoricalInt
: New in v2.6 All of the data in the column should be considered categorical in its string form when cast to an int by truncation. For example the value3
will be cast as the string3
and the value3.14
will also be cast as the string3
. Further, the value-3.6
will become the string-3
. Missing values will still be recognized as missing.
For convenience these are represented in the enum
VARIABLE_TYPE_TRANSFORM
with the namesCATEGORICAL
andCATEGORICAL_INT
Parameters: name : str
The name to give to the new feature
parent_name : str
The name of the feature to transform
variable_type : str
The type the new column should have. See the values within
datarobot.enums.VARIABLE_TYPE_TRANSFORM
replacement : str or float, optional
The value that missing or unconverable data should have
date_extraction : str, optional
Must be specified when parent_name is a date column (and left None otherwise). Specifies which value from a date should be extracted. See the list of values in
datarobot.enums.DATE_EXTRACTION
max_wait : int, optional
The maximum amount of time to wait for DataRobot to finish processing the new column. This process can take more time with more data to process. If this operation times out, an AsyncTimeoutError will occur. DataRobot continues the processing and the new column may successfully be constucted.
Returns: Feature
The data of the new Feature
Raises: AsyncFailureError
If any of the responses from the server are unexpected
AsyncProcessUnsuccessfulError
If the job being waited for has failed or has been cancelled
AsyncTimeoutError
If the resource did not resolve in time
-
create_featurelist
(name, features)¶ Creates a new featurelist
Parameters: name : str
The name to give to this new featurelist. Names must be unique, so an error will be returned from the server if this name has already been used in this project.
features : list of str
The names of the features. Each feature must exist in the project already.
Returns: Featurelist
newly created featurelist
Raises: DuplicateFeaturesError
Raised if features variable contains duplicate features
Examples
project = Project.get('5223deadbeefdeadbeef0101') flists = project.get_featurelists() # Create a new featurelist using a subset of features from an # existing featurelist flist = flists[0] features = flist.features[::2] # Half of the features new_flist = project.create_featurelist(name='Feature Subset', features=features)
-
create_modeling_featurelist
(name, features)¶ Create a new modeling featurelist
Modeling featurelists can only be created after the target and partitioning options have been set for a project. In time series projects, these are the featurelists that can be used for modeling; in other projects, they behave the same as regular featurelists.
See the time series documentation for more information.
Parameters: name : str
the name of the modeling featurelist to create. Names must be unique within the project, or the server will return an error.
features : list of str
the names of the features to include in the modeling featurelist. Each feature must be a modeling feature.
Returns: featurelist : ModelingFeaturelist
the newly created featurelist
Examples
project = Project.get('1234deadbeeffeeddead4321') modeling_features = project.get_modeling_features() selected_features = [feat.name for feat in modeling_features][:5] # select first five new_flist = project.create_modeling_featurelist('Model This', selected_features)
-
get_metrics
(feature_name)¶ Get the metrics recommended for modeling on the given feature.
Parameters: feature_name : str
The name of the feature to query regarding which metrics are recommended for modeling.
Returns: names : list of str
The names of the recommended metrics.
-
get_status
()¶ Query the server for project status.
Returns: status : dict
Contains:
autopilot_done
: a boolean.stage
: a short string indicating which stage the project is in.stage_description
: a description of whatstage
means.
Examples
{"autopilot_done": False, "stage": "modeling", "stage_description": "Ready for modeling"}
-
pause_autopilot
()¶ Pause autopilot, which stops processing the next jobs in the queue.
Returns: paused : boolean
Whether the command was acknowledged
-
unpause_autopilot
()¶ Unpause autopilot, which restarts processing the next jobs in the queue.
Returns: unpaused : boolean
Whether the command was acknowledged.
-
start_autopilot
(featurelist_id)¶ Starts autopilot on provided featurelist.
Only one autopilot can be running at the time. That’s why any ongoing autopilot on a different featurelist will be halted - modeling jobs in queue would not be affected but new jobs would not be added to queue by the halted autopilot.
Parameters: featurelist_id : str
Identifier of featurelist that should be used for autopilot
Raises: AppPlatformError
Raised if autopilot is currently running on or has already finished running on the provided featurelist. Also raised if project’s target was not selected.
-
train
(trainable, sample_pct=None, featurelist_id=None, source_project_id=None, scoring_type=None, training_row_count=None, monotonic_increasing_featurelist_id=<object object>, monotonic_decreasing_featurelist_id=<object object>)¶ Submit a job to the queue to train a model.
Either sample_pct or training_row_count can be used to specify the amount of data to use, but not both. If neither are specified, a default of the maximum amount of data that can safely be used to train any blueprint without going into the validation data will be selected.
In smart-sampled projects, sample_pct and training_row_count are assumed to be in terms of rows of the minority class.
Note
If the project uses datetime partitioning, use
train_datetime
insteadParameters: trainable : str or Blueprint
For
str
, this is assumed to be a blueprint_id. If nosource_project_id
is provided, theproject_id
will be assumed to be the project that this instance represents.Otherwise, for a
Blueprint
, it contains the blueprint_id and source_project_id that we want to use.featurelist_id
will assume the default for this project if not provided, andsample_pct
will default to using the maximum training value allowed for this project’s partition setup.source_project_id
will be ignored if aBlueprint
instance is used for this parametersample_pct : float, optional
The amount of data to use for training, as a percentage of the project dataset from 0 to 100.
featurelist_id : str, optional
The identifier of the featurelist to use. If not defined, the default for this project is used.
source_project_id : str, optional
Which project created this blueprint_id. If
None
, it defaults to looking in this project. Note that you must have read permissions in this project.scoring_type : str, optional
Either
SCORING_TYPE.validation
orSCORING_TYPE.cross_validation
.SCORING_TYPE.validation
is available for every partitioning type, and indicates that the default model validation should be used for the project. If the project uses a form of cross-validation partitioning,SCORING_TYPE.cross_validation
can also be used to indicate that all of the available training/validation combinations should be used to evaluate the model.training_row_count : int, optional
The number of rows to use to train the requested model.
monotonic_increasing_featurelist_id : str, optional
(new in version 2.11) the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. Passing
None
disables increasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT
) is the one specified by the blueprint.monotonic_decreasing_featurelist_id : str, optional
(new in version 2.11) the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. Passing
None
disables decreasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT
) is the one specified by the blueprint.Returns: model_job_id : str
id of created job, can be used as parameter to
ModelJob.get
method orwait_for_async_model_creation
functionExamples
Use a
Blueprint
instance:blueprint = project.get_blueprints()[0] model_job_id = project.train(blueprint, training_row_count=project.max_train_rows)
Use a
blueprint_id
, which is a string. In the first case, it is assumed that the blueprint was created by this project. If you are using a blueprint used by another project, you will need to pass the id of that other project as well.blueprint_id = 'e1c7fc29ba2e612a72272324b8a842af' project.train(blueprint, training_row_count=project.max_train_rows) another_project.train(blueprint, source_project_id=project.id)
You can also easily use this interface to train a new model using the data from an existing model:
model = project.get_models()[0] model_job_id = project.train(model.blueprint.id, sample_pct=100)
-
train_datetime
(blueprint_id, featurelist_id=None, training_row_count=None, training_duration=None, source_project_id=None)¶ Create a new model in a datetime partitioned project
If the project is not datetime partitioned, an error will occur.
Parameters: blueprint_id : str
the blueprint to use to train the model
featurelist_id : str, optional
the featurelist to use to train the model. If not specified, the project default will be used.
training_row_count : int, optional
the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.
training_duration : str, optional
a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.
source_project_id : str, optional
the id of the project this blueprint comes from, if not this project. If left unspecified, the blueprint must belong to this project.
Returns: job : ModelJob
the created job to build the model
-
blend
(model_ids, blender_method)¶ Submit a job for creating blender model. Upon success, the new job will be added to the end of the queue.
Parameters: model_ids : list of str
List of model ids that will be used to create blender. These models should have completed validation stage without errors, and can’t be blenders, DataRobot Prime or scaleout models.
blender_method : str
Chosen blend method, one from
datarobot.enums.BLENDER_METHOD
Returns: model_job : ModelJob
New
ModelJob
instance for the blender creation job in queue.
-
get_all_jobs
(status=None)¶ Get a list of jobs
This will give Jobs representing any type of job, including modeling or predict jobs.
Parameters: status : QUEUE_STATUS enum, optional
If called with QUEUE_STATUS.INPROGRESS, will return the jobs that are currently running.
If called with QUEUE_STATUS.QUEUE, will return the jobs that are waiting to be run.
If called with QUEUE_STATUS.ERROR, will return the jobs that have errored.
If no value is provided, will return all jobs currently running or waiting to be run.
Returns: jobs : list
Each is an instance of Job
-
get_blenders
()¶ Get a list of blender models.
Returns: list of BlenderModel
list of all blender models in project.
-
get_frozen_models
()¶ Get a list of frozen models
Returns: list of FrozenModel
list of all frozen models in project.
-
get_model_jobs
(status=None)¶ Get a list of modeling jobs
Parameters: status : QUEUE_STATUS enum, optional
If called with QUEUE_STATUS.INPROGRESS, will return the modeling jobs that are currently running.
If called with QUEUE_STATUS.QUEUE, will return the modeling jobs that are waiting to be run.
If called with QUEUE_STATUS.ERROR, will return the modeling jobs that have errored.
If no value is provided, will return all modeling jobs currently running or waiting to be run.
Returns: jobs : list
Each is an instance of ModelJob
-
get_predict_jobs
(status=None)¶ Get a list of prediction jobs
Parameters: status : QUEUE_STATUS enum, optional
If called with QUEUE_STATUS.INPROGRESS, will return the prediction jobs that are currently running.
If called with QUEUE_STATUS.QUEUE, will return the prediction jobs that are waiting to be run.
If called with QUEUE_STATUS.ERROR, will return the prediction jobs that have errored.
If called without a status, will return all prediction jobs currently running or waiting to be run.
Returns: jobs : list
Each is an instance of PredictJob
-
wait_for_autopilot
(check_interval=20.0, timeout=86400, verbosity=1)¶ Blocks until autopilot is finished. This will raise an exception if the autopilot mode is changed from AUTOPILOT_MODE.FULL_AUTO.
It makes API calls to sync the project state with the server and to look at which jobs are enqueued.
Parameters: check_interval : float or int
The maximum time (in seconds) to wait between checks for whether autopilot is finished
timeout : float or int or None
After this long (in seconds), we give up. If None, never timeout.
verbosity:
This should be VERBOSITY_LEVEL.SILENT or VERBOSITY_LEVEL.VERBOSE. For VERBOSITY_LEVEL.SILENT, nothing will be displayed about progress. For VERBOSITY_LEVEL.VERBOSE, the number of jobs in progress or queued is shown. Note that new jobs are added to the queue along the way.
Raises: AsyncTimeoutError
If autopilot does not finished in the amount of time specified
RuntimeError
If a condition is detected that indicates that autopilot will not complete on its own
-
rename
(project_name)¶ Update the name of the project.
Parameters: project_name : str
The new name
-
unlock_holdout
()¶ Unlock the holdout for this project.
This will cause subsequent queries of the models of this project to contain the metric values for the holdout set, if it exists.
Take care, as this cannot be undone. Remember that best practice is to select a model before analyzing the model performance on the holdout set
-
set_worker_count
(worker_count)¶ Sets the number of workers allocated to this project.
Note that this value is limited to the number allowed by your account. Lowering the number will not stop currently running jobs, but will cause the queue to wait for the appropriate number of jobs to finish before attempting to run more jobs.
Parameters: worker_count : int
The number of concurrent workers to request from the pool of workers
-
get_leaderboard_ui_permalink
()¶ Returns: url : str
Permanent static hyperlink to a project leaderboard.
-
open_leaderboard_browser
()¶ Opens project leaderboard in web browser.
Note: If text-mode browsers are used, the calling process will block until the user exits the browser.
-
get_rating_table_models
()¶ Get a list of models with a rating table
Returns: list of RatingTableModel
list of all models with a rating table in project.
-
get_rating_tables
()¶ Get a list of rating tables
Returns: list of RatingTable
list of rating tables in project.
-
classmethod
Partitioning API¶
-
class
datarobot.
RandomCV
(holdout_pct, reps, seed=0)¶ A partition in which observations are randomly assigned to cross-validation groups and the holdout set.
Parameters: holdout_pct : int
the desired percentage of dataset to assign to holdout set
reps : int
number of cross validation folds to use
seed : int
a seed to use for randomization
-
class
datarobot.
StratifiedCV
(holdout_pct, reps, seed=0)¶ A partition in which observations are randomly assigned to cross-validation groups and the holdout set, preserving in each group the same ratio of positive to negative cases as in the original data.
Parameters: holdout_pct : int
the desired percentage of dataset to assign to holdout set
reps : int
number of cross validation folds to use
seed : int
a seed to use for randomization
-
class
datarobot.
GroupCV
(holdout_pct, reps, partition_key_cols, seed=0)¶ A partition in which one column is specified, and rows sharing a common value for that column are guaranteed to stay together in the partitioning into cross-validation groups and the holdout set.
Parameters: holdout_pct : int
the desired percentage of dataset to assign to holdout set
reps : int
number of cross validation folds to use
partition_key_cols : list
a list containing a single string, where the string is the name of the column whose values should remain together in partitioning
seed : int
a seed to use for randomization
-
class
datarobot.
UserCV
(user_partition_col, cv_holdout_level, seed=0)¶ A partition where the cross-validation folds and the holdout set are specified by the user.
Parameters: user_partition_col : string
the name of the column containing the partition assignments
cv_holdout_level
the value of the partition column indicating a row is part of the holdout set
seed : int
a seed to use for randomization
-
class
datarobot.
RandomTVH
(holdout_pct, validation_pct, seed=0)¶ Specifies a partitioning method in which rows are randomly assigned to training, validation, and holdout.
Parameters: holdout_pct : int
the desired percentage of dataset to assign to holdout set
validation_pct : int
the desired percentage of dataset to assign to validation set
seed : int
a seed to use for randomization
-
class
datarobot.
UserTVH
(user_partition_col, training_level, validation_level, holdout_level, seed=0)¶ Specifies a partitioning method in which rows are assigned by the user to training, validation, and holdout sets.
Parameters: user_partition_col : string
the name of the column containing the partition assignments
training_level
the value of the partition column indicating a row is part of the training set
validation_level
the value of the partition column indicating a row is part of the validation set
holdout_level
the value of the partition column indicating a row is part of the holdout set (use None if you want no holdout set)
seed : int
a seed to use for randomization
-
class
datarobot.
StratifiedTVH
(holdout_pct, validation_pct, seed=0)¶ A partition in which observations are randomly assigned to train, validation, and holdout sets, preserving in each group the same ratio of positive to negative cases as in the original data.
Parameters: holdout_pct : int
the desired percentage of dataset to assign to holdout set
validation_pct : int
the desired percentage of dataset to assign to validation set
seed : int
a seed to use for randomization
-
class
datarobot.
GroupTVH
(holdout_pct, validation_pct, partition_key_cols, seed=0)¶ A partition in which one column is specified, and rows sharing a common value for that column are guaranteed to stay together in the partitioning into the training, validation, and holdout sets.
Parameters: holdout_pct : int
the desired percentage of dataset to assign to holdout set
validation_pct : int
the desired percentage of dataset to assign to validation set
partition_key_cols : list
a list containing a single string, where the string is the name of the column whose values should remain together in partitioning
seed : int
a seed to use for randomization
-
class
datarobot.
DatetimePartitioningSpecification
(datetime_partition_column, autopilot_data_selection_method=None, validation_duration=None, holdout_start_date=None, holdout_duration=None, disable_holdout=None, gap_duration=None, number_of_backtests=None, backtests=None, use_time_series=False, default_to_a_priori=False, default_to_known_in_advance=False, feature_derivation_window_start=None, feature_derivation_window_end=None, feature_settings=None, forecast_window_start=None, forecast_window_end=None, treat_as_exponential=None, differencing_method=None, periodicities=None, multiseries_id_columns=None)¶ Uniquely defines a DatetimePartitioning for some project
Includes only the attributes of DatetimePartitioning that are directly controllable by users, not those determined by the DataRobot application based on the project dataset and the user-controlled settings.
This is the specification that should be passed to Project.set_target via the partitioning_method parameter. To see the full partitioning based on the project dataset, use DatetimePartitioning.generate.
All durations should be specified with a duration string such as those returned by the partitioning_methods.construct_duration_string helper method.
Attributes
datetime_partition_column (str) the name of the column whose values as dates are used to assign a row to a particular partition autopilot_data_selection_method (str) one of datarobot.enums.DATETIME_AUTOPILOT_DATA_SELECTION_METHOD
. Whether models created by the autopilot should use “rowCount” or “duration” as their data_selection_method.validation_duration (str or None) the default validation_duration for the backtests holdout_start_date (datetime.datetime or None) The start date of holdout scoring data. If holdout_start_date is specified, holdout_duration must also be specified. If disable_holdout is set to True, neither holdout_duration nor holdout_start_date must be specified. holdout_duration (str or None) The duration of the holdout scoring data. If holdout_duration is specified, holdout_start_date must also be specified. If disable_holdout is set to True, neither holdout_duration nor holdout_start_date must be specified. disable_holdout (bool or None) (New in version v2.8) Whether to suppress allocating a holdout fold. If set to True, holdout_start_date and holdout_duration must not be specified. gap_duration (str or None) The duration of the gap between training and holdout scoring data number_of_backtests (int or None) the number of backtests to use backtests (list of BacktestSpecification) the exact specification of backtests to use. The indexes of the specified backtests should range from 0 to number_of_backtests - 1. If any backtest is left unspecified, a default configuration will be chosen. use_time_series (bool) (New in version v2.8) Whether to create a time series project (if True) or an OTV project which uses datetime partitioning (if False). The default behaviour is to create an OTV project. default_to_a_priori (bool) (Deprecated in version v2.11) Optional, renamed to default_to_known_in_advance, see below for more detail. default_to_known_in_advance (bool) (New in version v2.11) Optional, only used for time series projects. Whether to default to treating features as known in advance. If not specified, defaults to False. Known in advance features are expected to be known for dates in the future when making predictions, e.g. “is this a holiday”. feature_derivation_window_start (int or None) (New in version v2.8) Only used for time series projects. Offset into the past to define how far back relative to the forecast point the feature derivation window should start. Expressed in terms of the time_unit of the datetime_partition_column and should be negative or zero. feature_derivation_window_end (int or None) (New in version v2.8) Only used for time series projects. Offset into the past to define how far back relative to the forecast point the feature derivation window should end. Expressed in terms of the time_unit of the datetime_partition_column, and should be a positive value. feature_settings (list of FeatureSettings
objects) (New in version v2.9) Optional, a list specifying per feature settings, can be left unspecified.forecast_window_start (int or None) (New in version v2.8) Only used for time series projects. Offset into the future to define how far forward relative to the forecast point the forecast window should start. Expressed in terms of the time_unit of the datetime_partition_column. forecast_window_end (int or None) (New in version v2.8) Only used for time series projects. Offset into the future to define how far forward relative to the forecast point the forecast window should end. Expressed in terms of the time_unit of the datetime_partition_column. treat_as_exponential (string, optional) (New in version v2.9) defaults to “auto”. Used to specify whether to treat data as exponential trend and apply transformations like log-transform. Use values from datarobot.enums.TREAT_AS_EXPONENTIAL
enum.differencing_method (string, optional) (New in version v2.9) defaults to “auto”. Used to specify which differencing method to apply of case if data is stationary. Use values from datarobot.enums.DIFFERENCING_METHOD
enum.periodicities (list of Periodicity, optional) (New in version v2.9) a list of datarobot.Periodicity
multiseries_id_columns (list of str or null) (New in version v2.11) a list of the names of multiseries id columns to define series within the training data. Currently only one multiseries id column is supported.
-
class
datarobot.
BacktestSpecification
(index, gap_duration, validation_start_date, validation_duration)¶ Uniquely defines a Backtest used in a DatetimePartitioning
Includes only the attributes of a backtest directly controllable by users. The other attributes are assigned by the DataRobot application based on the project dataset and the user-controlled settings.
All durations should be specified with a duration string such as those returned by the partitioning_methods.construct_duration_string helper method.
Attributes
index (int) the index of the backtest to update gap_duration (str) the desired duration of the gap between training and validation scoring data for the backtest validation_start_date (datetime.datetime) the desired start date of the validation scoring data for this backtest validation_duration (datetime.datetime) the desired duration of the validation scoring data for this backtest
-
class
datarobot.
FeatureSettings
(feature_name, known_in_advance=False, a_priori=None)¶ Per feature settings
Attributes
feature_name (string) name of the feature a_priori (bool) (Deprecated in v2.11) Optional, renamed to known_in_advance, see below for more detail. known_in_advance (bool) (New in version v2.11) Optional, whether the feature is known in advance, i.e. expected to be known for dates in the future at prediction time. Features that don’t have a feature setting specifying whether they are known in advance use the value from the default_to_known_in_advance flag.
-
class
datarobot.
Periodicity
(time_steps, time_unit)¶ Periodicity configuration
Parameters: time_steps : int
Time step value
time_unit : string
Time step unit, valid options are values from datarobot.enums.PERIODICITY_TIME_UNITS
Examples
from datarobot as dr periodicities = [ dr.Periodicity(time_steps=10, time_unit=dr.enums.PERIODICITY_TIME_UNITS.HOUR), dr.Periodicity(time_steps=600, time_unit=dr.enums.PERIODICITY_TIME_UNITS.MINUTE)] spec = dr.DatetimePartitioningSpecification( # ... periodicities=periodicities )
-
class
datarobot.
DatetimePartitioning
(project_id=None, datetime_partition_column=None, date_format=None, autopilot_data_selection_method=None, validation_duration=None, available_training_start_date=None, available_training_duration=None, available_training_row_count=None, available_training_end_date=None, primary_training_start_date=None, primary_training_duration=None, primary_training_row_count=None, primary_training_end_date=None, gap_start_date=None, gap_duration=None, gap_row_count=None, gap_end_date=None, holdout_start_date=None, holdout_duration=None, holdout_row_count=None, holdout_end_date=None, number_of_backtests=None, backtests=None, total_row_count=None, use_time_series=False, default_to_a_priori=False, default_to_known_in_advance=False, feature_derivation_window_start=None, feature_derivation_window_end=None, feature_settings=None, forecast_window_start=None, forecast_window_end=None, treat_as_exponential=None, differencing_method=None, periodicities=None, multiseries_id_columns=None)¶ Full partitioning of a project for datetime partitioning
Includes both the attributes specified by the user, as well as those determined by the DataRobot application based on the project dataset. In order to use a partitioning to set the target, call to_specification and pass the resulting DatetimePartitioningSpecification to Project.set_target.
The available training data corresponds to all the data available for training, while the primary training data corresponds to the data that can be used to train while ensuring that all backtests are available. If a model is trained with more data than is available in the primary training data, then all backtests may not have scores available.
Attributes
project_id (str) the id of the project this partitioning applies to datetime_partition_column (str) the name of the column whose values as dates are used to assign a row to a particular partition date_format (str) the format (e.g. “%Y-%m-%d %H:%M:%S”) by which the partition column was interpreted (compatible with strftime [https://docs.python.org/2/library/time.html#time.strftime] ) autopilot_data_selection_method (str) one of datarobot.enums.DATETIME_AUTOPILOT_DATA_SELECTION_METHOD
. Whether models created by the autopilot use “rowCount” or “duration” as their data_selection_method.validation_duration (str) the validation duration specified when initializing the partitioning - not directly significant if the backtests have been modified, but used as the default validation_duration for the backtests available_training_start_date (datetime.datetime) The start date of the available training data for scoring the holdout available_training_duration (str) The duration of the available training data for scoring the holdout available_training_row_count (int or None) The number of rows in the available training data for scoring the holdout. Only available when retrieving the partitioning after setting the target. available_training_end_date (datetime.datetime) The end date of the available training data for scoring the holdout primary_training_start_date (datetime.datetime or None) The start date of primary training data for scoring the holdout. Unavailable when the holdout fold is disabled. primary_training_duration (str) The duration of the primary training data for scoring the holdout primary_training_row_count (int or None) The number of rows in the primary training data for scoring the holdout. Only available when retrieving the partitioning after setting the target. primary_training_end_date (datetime.datetime or None) The end date of the primary training data for scoring the holdout. Unavailable when the holdout fold is disabled. gap_start_date (datetime.datetime or None) The start date of the gap between training and holdout scoring data. Unavailable when the holdout fold is disabled. gap_duration (str) The duration of the gap between training and holdout scoring data gap_row_count (int or None) The number of rows in the gap between training and holdout scoring data. Only available when retrieving the partitioning after setting the target. gap_end_date (datetime.datetime or None) The end date of the gap between training and holdout scoring data. Unavailable when the holdout fold is disabled. holdout_start_date (datetime.datetime or None) The start date of holdout scoring data. Unavailable when the holdout fold is disabled. holdout_duration (str) The duration of the holdout scoring data holdout_row_count (int or None) The number of rows in the holdout scoring data. Only available when retrieving the partitioning after setting the target. holdout_end_date (datetime.datetime or None) The end date of the holdout scoring data. Unavailable when the holdout fold is disabled. number_of_backtests (int) the number of backtests used backtests (list of partitioning_methods.Backtest) the configured Backtests total_row_count (int) the number of rows in the project dataset. Only available when retrieving the partitioning after setting the target. use_time_series (bool) (New in version v2.8) Whether to create a time series project (if True) or an OTV project which uses datetime partitioning (if False). The default behaviour is to create an OTV project. default_to_a_priori (bool) (Deprecated in version v2.11) Optional, renamed to defaultToKnownInAdvance, see below for more detail. default_to_a_priori (bool) (Deprecated in version v2.11) Optional, renamed to default_to_known_in_advance, see below for more detail. default_to_known_in_advance (bool) (New in version v2.11) Optional, only used for time series projects. Whether to default to treating features as known in advance. If not specified, defaults to False. Known in advance features are expected to be known for dates in the future when making predictions, e.g. “is this a holiday”. feature_derivation_window_start (int or None) (New in version v2.8) Only used for time series projects. Offset into the past to define how far back relative to the forecast point the feature derivation window should start. Expressed in terms of the time_unit of the datetime_partition_column. feature_derivation_window_end (int or None) (New in version v2.8) Only used for time series projects. Offset into the past to define how far back relative to the forecast point the feature derivation window should end. Expressed in terms of the time_unit of the datetime_partition_column. feature_settings (list of FeatureSettings) (New in version v2.9) Optional, a list specifying per feature settings, can be left unspecified. forecast_window_start (int or None) (New in version v2.8) Only used for time series projects. Offset into the future to define how far forward relative to the forecast point the forecast window should start. Expressed in terms of the time_unit of the datetime_partition_column. forecast_window_end (int or None) (New in version v2.8) Only used for time series projects. Offset into the future to define how far forward relative to the forecast point the forecast window should end. Expressed in terms of the time_unit of the datetime_partition_column. treat_as_exponential (string, optional) (New in version v2.9) defaults to “auto”. Used to specify whether to treat data as exponential trend and apply transformations like log-transform. Use values from datarobot.enums.TREAT_AS_EXPONENTIAL
enum.differencing_method (string, optional) (New in version v2.9) defaults to “auto”. Used to specify which differencing method to apply of case if data is stationary. Use values from datarobot.enums.DIFFERENCING_METHOD
enum.periodicities (list of Periodicity, optional) (New in version v2.9) a list of datarobot.Periodicity
multiseries_id_columns (list of str or null) (New in version v2.11) a list of the names of multiseries id columns to define series within the training data. Currently only one multiseries id column is supported. -
classmethod
generate
(project_id, spec, max_wait=600)¶ Preview the full partitioning determined by a DatetimePartitioningSpecification
Based on the project dataset and the partitioning specification, inspect the full partitioning that would be used if the same specification were passed into Project.set_target.
Parameters: project_id : str
the id of the project
spec : DatetimePartitioningSpec
the desired partitioning
max_wait : int, optional
For some settings (e.g. generating a partitioning preview for a multiseries project for the first time), an asynchronous task must be run to analyze the dataset. max_wait governs the maximum time (in seconds) to wait before giving up. In all non-multiseries projects, this is unused.
Returns: DatetimePartitioning :
the full generated partitioning
-
classmethod
get
(project_id)¶ Retrieve the DatetimePartitioning from a project
Only available if the project has already set the target as a datetime project.
Parameters: project_id : str
the id of the project to retrieve partitioning for
Returns: DatetimePartitioning : the full partitioning for the project
-
classmethod
feature_log_list
(project_id, offset=None, limit=None)¶ Retrieve the feature derivation log content and log length for a time series project.
The Time Series Feature Log provides details about the feature generation process for a time series project. It includes information about which features are generated and their priority, as well as the detected properties of the time series data such as whether the series is stationary, and periodicities detected.
This route is only supported for time series projects that have finished partitioning.
The feature derivation log will include information about:
- Detected stationarity of the series:e.g. ‘Series detected as non-stationary’
- Detected presence of multiplicative trend in the series:e.g. ‘Multiplicative trend detected’
- Detected presence of multiplicative trend in the series:e.g. ‘Detected periodicities: 7 day’
- Maximum number of feature to be generated:e.g. ‘Maximum number of feature to be generated is 1440’
- Window sizes used in rolling statistics / lag extractorse.g. ‘The window sizes chosen to be: 2 months(because the time step is 1 month and Feature Derivation Window is 2 months)’
- Features that are specified as known-in-advancee.g. ‘Variables treated as apriori: holiday’
- Details about why certain variables are transformed in the input datae.g. ‘Generating variable “y (log)” from “y” because multiplicative trendis detected’
- Details about features generated as timeseries features, and their prioritye.g. ‘Generating feature “date (actual)” from “date” (priority: 1)’
Parameters: project_id : str
project id to retrieve a feature derivation log for.
offset : int
optional, defaults is 0, this many results will be skipped.
limit : int
optional, defaults to 100, at most this many results are returned. To specify
no limit, use 0. The default may change without notice.
-
classmethod
feature_log_retrieve
(project_id)¶ Retrieve the feature derivation log content and log length for a time series project.
The Time Series Feature Log provides details about the feature generation process for a time series project. It includes information about which features are generated and their priority, as well as the detected properties of the time series data such as whether the series is stationary, and periodicities detected.
This route is only supported for time series projects that have finished partitioning.
The feature derivation log will include information about:
- Detected stationarity of the series:e.g. ‘Series detected as non-stationary’
- Detected presence of multiplicative trend in the series:e.g. ‘Multiplicative trend detected’
- Detected presence of multiplicative trend in the series:e.g. ‘Detected periodicities: 7 day’
- Maximum number of feature to be generated:e.g. ‘Maximum number of feature to be generated is 1440’
- Window sizes used in rolling statistics / lag extractorse.g. ‘The window sizes chosen to be: 2 months(because the time step is 1 month and Feature Derivation Window is 2 months)’
- Features that are specified as known-in-advancee.g. ‘Variables treated as apriori: holiday’
- Details about why certain variables are transformed in the input datae.g. ‘Generating variable “y (log)” from “y” because multiplicative trendis detected’
- Details about features generated as timeseries features, and their prioritye.g. ‘Generating feature “date (actual)” from “date” (priority: 1)’
Parameters: project_id : str
project id to retrieve a feature derivation log for.
-
to_specification
()¶ Render the DatetimePartitioning as a DatetimePartitioningSpecification
The resulting specification can be used when setting the target, and contains only the attributes directly controllable by users.
Returns: DatetimePartitioningSpecification:
the specification for this partitioning
-
to_dataframe
()¶ Render the partitioning settings as a dataframe for convenience of display
Excludes project_id, datetime_partition_column, date_format, autopilot_data_selection_method, validation_duration, and number_of_backtests, as well as the row count information, if present.
Also excludes the time series specific parameters for use_time_series, default_to_known_in_advance, and defining the feature derivation and forecast windows.
-
classmethod
-
class
datarobot.helpers.partitioning_methods.
Backtest
(index=None, available_training_start_date=None, available_training_duration=None, available_training_row_count=None, available_training_end_date=None, primary_training_start_date=None, primary_training_duration=None, primary_training_row_count=None, primary_training_end_date=None, gap_start_date=None, gap_duration=None, gap_row_count=None, gap_end_date=None, validation_start_date=None, validation_duration=None, validation_row_count=None, validation_end_date=None, total_row_count=None)¶ A backtest used to evaluate models trained in a datetime partitioned project
When setting up a datetime partitioning project, backtests are specified by a BacktestSpecification.
The available training data corresponds to all the data available for training, while the primary training data corresponds to the data that can be used to train while ensuring that all backtests are available. If a model is trained with more data than is available in the primary training data, then all backtests may not have scores available.
Attributes
index (int) the index of the backtest available_training_start_date (datetime.datetime) the start date of the available training data for this backtest available_training_duration (str) the duration of available training data for this backtest available_training_row_count (int or None) the number of rows of available training data for this backtest. Only available when retrieving from a project where the target is set. available_training_end_date (datetime.datetime) the end date of the available training data for this backtest primary_training_start_date (datetime.datetime) the start date of the primary training data for this backtest primary_training_duration (str) the duration of the primary training data for this backtest primary_training_row_count (int or None) the number of rows of primary training data for this backtest. Only available when retrieving from a project where the target is set. primary_training_end_date (datetime.datetime) the end date of the primary training data for this backtest gap_start_date (datetime.datetime) the start date of the gap between training and validation scoring data for this backtest gap_duration (str) the duration of the gap between training and validation scoring data for this backtest gap_row_count (int or None) the number of rows in the gap between training and validation scoring data for this backtest. Only available when retrieving from a project where the target is set. gap_end_date (datetime.datetime) the end date of the gap between training and validation scoring data for this backtest validation_start_date (datetime.datetime) the start date of the validation scoring data for this backtest validation_duration (str) the duration of the validation scoring data for this backtest validation_row_count (int or None) the number of rows of validation scoring data for this backtest. Only available when retrieving from a project where the target is set. validation_end_date (datetime.datetime) the end date of the validation scoring data for this backtest total_row_count (int or None) the number of rows in this backtest. Only available when retrieving from a project where the target is set. -
to_specification
()¶ Render this backtest as a BacktestSpecification
A BacktestSpecification includes only the attributes users can directly control, not those indirectly determined by the project dataset.
Returns: BacktestSpecification
the specification for this backtest
-
to_dataframe
()¶ Render this backtest as a dataframe for convenience of display
Returns: backtest_partitioning : pandas.Dataframe
the backtest attributes, formatted into a dataframe
-
-
datarobot.helpers.partitioning_methods.
construct_duration_string
(years=0, months=0, days=0, hours=0, minutes=0, seconds=0)¶ Construct a valid string representing a duration in accordance with ISO8601
A duration of six months, 3 days, and 12 hours could be represented as P6M3DT12H.
Parameters: years : int
the number of years in the duration
months : int
the number of months in the duration
days : int
the number of days in the duration
hours : int
the number of hours in the duration
minutes : int
the number of minutes in the duration
seconds : int
the number of seconds in the duration
Returns: duration_string: str
The duration string, specified compatibly with ISO8601
Blueprint API¶
-
class
datarobot.models.
Blueprint
(id=None, processes=None, model_type=None, project_id=None, blueprint_category=None, monotonic_increasing_featurelist_id=None, monotonic_decreasing_featurelist_id=None, supports_monotonic_constraints=None)¶ A Blueprint which can be used to fit models
Attributes
id (str) the id of the blueprint processes (list of str) the processes used by the blueprint model_type (str) the model produced by the blueprint project_id (str) the project the blueprint belongs to blueprint_category (str) (New in version v2.6) Describes the category of the blueprint and the kind of model it produces. -
classmethod
get
(project_id, blueprint_id)¶ Retrieve a blueprint.
Parameters: project_id : str
The project’s id.
blueprint_id : str
Id of blueprint to retrieve.
Returns: blueprint : Blueprint
The queried blueprint.
-
get_chart
()¶ Retrieve a chart.
Returns: BlueprintChart
The current blueprint chart.
-
get_documents
()¶ Get documentation for tasks used in the blueprint.
Returns: list of BlueprintTaskDocument
All documents available for blueprint.
-
classmethod
-
class
datarobot.models.
BlueprintTaskDocument
(title=None, task=None, description=None, parameters=None, links=None, references=None)¶ Document describing a task from a blueprint.
Attributes
title (str) Title of document. task (str) Name of the task described in document. description (str) Task description. parameters (list of dict(name, type, description)) Parameters that task can receive in human-readable format. links (list of dict(name, url)) External links used in document references (list of dict(name, url)) References used in document. When no link available url equals None.
-
class
datarobot.models.
BlueprintChart
(nodes, edges)¶ A Blueprint chart that can be used to understand data flow in blueprint.
Attributes
nodes (list of dict (id, label)) Chart nodes, id unique in chart. edges (list of tuple (id1, id2)) Directions of data flow between blueprint chart nodes. -
classmethod
get
(project_id, blueprint_id)¶ Retrieve a blueprint chart.
Parameters: project_id : str
The project’s id.
blueprint_id : str
Id of blueprint to retrieve chart.
Returns: BlueprintChart
The queried blueprint chart.
-
to_graphviz
()¶ Get blueprint chart in graphviz DOT format.
Returns: unicode
String representation of chart in graphviz DOT language.
-
classmethod
-
class
datarobot.models.
ModelBlueprintChart
(nodes, edges)¶ A Blueprint chart that can be used to understand data flow in model. Model blueprint chart represents reduced repository blueprint chart with only elements that used to build this particular model.
Attributes
nodes (list of dict (id, label)) Chart nodes, id unique in chart. edges (list of tuple (id1, id2)) Directions of data flow between blueprint chart nodes. -
classmethod
get
(project_id, model_id)¶ Retrieve a model blueprint chart.
Parameters: project_id : str
The project’s id.
model_id : str
Id of model to retrieve model blueprint chart.
Returns: ModelBlueprintChart
The queried model blueprint chart.
-
to_graphviz
()¶ Get blueprint chart in graphviz DOT format.
Returns: unicode
String representation of chart in graphviz DOT language.
-
classmethod
Model API¶
-
class
datarobot.models.
Model
(id=None, processes=None, featurelist_name=None, featurelist_id=None, project_id=None, sample_pct=None, training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, model_type=None, model_category=None, is_frozen=None, blueprint_id=None, metrics=None, project=None, data=None, monotonic_increasing_featurelist_id=None, monotonic_decreasing_featurelist_id=None, supports_monotonic_constraints=None)¶ A model trained on a project’s dataset capable of making predictions
Attributes
id (str) the id of the model project_id (str) the id of the project the model belongs to processes (list of str) the processes used by the model featurelist_name (str) the name of the featurelist used by the model featurelist_id (str) the id of the featurelist used by the model sample_pct (float or None) the percentage of the project dataset used in training the model. If the project uses datetime partitioning, the sample_pct will be None. See training_row_count, training_duration, and training_start_date and training_end_date instead. training_row_count (int or None) the number of rows of the project dataset used in training the model. In a datetime partitioned project, if specified, defines the number of rows used to train the model and evaluate backtest scores; if unspecified, either training_duration or training_start_date and training_end_date was used to determine that instead. training_duration (str or None) only present for models in datetime partitioned projects. If specified, a duration string specifying the duration spanned by the data used to train the model and evaluate backtest scores. training_start_date (datetime or None) only present for frozen models in datetime partitioned projects. If specified, the start date of the data used to train the model. training_end_date (datetime or None) only present for frozen models in datetime partitioned projects. If specified, the end date of the data used to train the model. model_type (str) what model this is, e.g. ‘Nystroem Kernel SVM Regressor’ model_category (str) what kind of model this is - ‘prime’ for DataRobot Prime models, ‘blend’ for blender models, and ‘model’ for other models is_frozen (bool) whether this model is a frozen model blueprint_id (str) the id of the blueprint used in this model metrics (dict) a mapping from each metric to the model’s scores for that metric monotonic_increasing_featurelist_id (str) optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. If None, no such constraints are enforced. monotonic_decreasing_featurelist_id (str) optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. If None, no such constraints are enforced. supports_monotonic_constraints (bool) optinonal, whether this model supports enforcing montonic constraints -
classmethod
get
(project, model_id)¶ Retrieve a specific model.
Parameters: project : str
The project’s id.
model_id : str
The
model_id
of the leaderboard item to retrieve.Returns: model : Model
The queried instance.
Raises: ValueError
passed
project
parameter value is of not supported type
-
classmethod
fetch_resource_data
(*args, **kwargs)¶ (Deprecated.) Used to acquire model data directly from its url.
Consider using get instead, as this is a convenience function used for development of datarobot
Parameters: url : string
The resource we are acquiring
join_endpoint : boolean, optional
Whether the client’s endpoint should be joined to the URL before sending the request. Location headers are returned as absolute locations, so will _not_ need the endpoint
Returns: model_data : dict
The queried model’s data
-
get_features_used
()¶ Query the server to determine which features were used.
Note that the data returned by this method is possibly different than the names of the features in the featurelist used by this model. This method will return the raw features that must be supplied in order for predictions to be generated on a new set of data. The featurelist, in contrast, would also include the names of derived features.
Returns: features : list of str
The names of the features used in the model.
-
delete
()¶ Delete a model from the project’s leaderboard.
-
get_leaderboard_ui_permalink
()¶ Returns: url : str
Permanent static hyperlink to this model at leaderboard.
-
open_model_browser
()¶ Opens model at project leaderboard in web browser.
Note: If text-mode browsers are used, the calling process will block until the user exits the browser.
-
train
(sample_pct=None, featurelist_id=None, scoring_type=None, training_row_count=None, monotonic_increasing_featurelist_id=<object object>, monotonic_decreasing_featurelist_id=<object object>)¶ Train the blueprint used in model on a particular featurelist or amount of data.
This method creates a new training job for worker and appends it to the end of the queue for this project. After the job has finished you can get the newly trained model by retrieving it from the project leaderboard, or by retrieving the result of the job.
Either sample_pct or training_row_count can be used to specify the amount of data to use, but not both. If neither are specified, a default of the maximum amount of data that can safely be used to train any blueprint without going into the validation data will be selected.
In smart-sampled projects, sample_pct and training_row_count are assumed to be in terms of rows of the minority class.
Note
For datetime partitioned projects, use
train_datetime
instead.Parameters: sample_pct : float, optional
The amount of data to use for training, as a percentage of the project dataset from 0 to 100.
featurelist_id : str, optional
The identifier of the featurelist to use. If not defined, the featurelist of this model is used.
scoring_type : str, optional
Either
SCORING_TYPE.validation
orSCORING_TYPE.cross_validation
.SCORING_TYPE.validation
is available for every partitioning type, and indicates that the default model validation should be used for the project. If the project uses a form of cross-validation partitioning,SCORING_TYPE.cross_validation
can also be used to indicate that all of the available training/validation combinations should be used to evaluate the model.training_row_count : int, optional
The number of rows to use to train the requested model.
monotonic_increasing_featurelist_id : str
(new in version 2.11) optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. Passing
None
disables increasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT
) is the one specified by the blueprint.monotonic_decreasing_featurelist_id : str
(new in version 2.11) optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. Passing
None
disables decreasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT
) is the one specified by the blueprint.Returns: model_job_id : str
id of created job, can be used as parameter to
ModelJob.get
method orwait_for_async_model_creation
functionExamples
project = Project.get('p-id') model = Model.get('p-id', 'l-id') model_job_id = model.train(training_row_count=project.max_train_rows)
-
train_datetime
(featurelist_id=None, training_row_count=None, training_duration=None, time_window_sample_pct=None)¶ Train this model on a different featurelist or amount of data
Requires that this model is part of a datetime partitioned project; otherwise, an error will occur.
Parameters: featurelist_id : str, optional
the featurelist to use to train the model. If not specified, the featurelist of this model is used.
training_row_count : int, optional
the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.
training_duration : str, optional
a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.
time_window_sample_pct : int, optional
may only be specified when the requested model is a time window (e.g. duration or start and end dates). An integer between 1 and 99 indicating the percentage to sample by within the window. The points kept are determined by a random uniform sample.
Returns: job : ModelJob
the created job to build the model
-
request_predictions
(dataset_id)¶ Request predictions against a previously uploaded dataset
Parameters: dataset_id : string
The dataset to make predictions against (as uploaded from Project.upload_dataset)
Returns: job : PredictJob
The job computing the predictions
-
get_feature_impact
()¶ Retrieve the computed Feature Impact results, a measure of the relevance of each feature in the model.
Elsewhere this technique is sometimes called ‘Permutation Importance’.
Requires that Feature Impact has already been computed with request_feature_impact.
Returns: feature_impacts : list[dict]
The feature impact data. Each item is a dict with the keys ‘featureName’, ‘impactNormalized’, and ‘impactUnnormalized’. See the help for Model.request_feature_impact for more details.
Raises: ClientError (404)
If the feature impacts have not been computed.
-
request_feature_impact
()¶ Request feature impacts to be computed for the model.
Feature Impact is computed for each column by creating new data with that column randomly permuted (but the others left unchanged), and seeing how the error metric score for the predictions is affected. The ‘impactUnnormalized’ is how much worse the error metric score is when making predictions on this modified data. The ‘impactNormalized’ is normalized so that the largest value is 1. In both cases, larger values indicate more important features.
Returns: job : Job
A Job representing the feature impact computation. To get the completed feature impact data, use job.get_result or job.get_result_when_complete.
Raises: JobAlreadyRequested (422)
If the feature impacts have already been requested.
-
get_prime_eligibility
()¶ Check if this model can be approximated with DataRobot Prime
Returns: prime_eligibility : dict
a dict indicating whether a model can be approximated with DataRobot Prime (key can_make_prime) and why it may be ineligible (key message)
-
request_approximation
()¶ Request an approximation of this model using DataRobot Prime
This will create several rulesets that could be used to approximate this model. After comparing their scores and rule counts, the code used in the approximation can be downloaded and run locally.
Returns: job : Job
the job generating the rulesets
-
get_rulesets
()¶ List the rulesets approximating this model generated by DataRobot Prime
If this model hasn’t been approximated yet, will return an empty list. Note that these are rulesets approximating this model, not rulesets used to construct this model.
Returns: rulesets : list of Ruleset
-
download_export
(filepath)¶ Download an exportable model file for use in an on-premise DataRobot standalone prediction environment.
This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.
Parameters: filepath : str
The path at which to save the exported model file.
-
request_transferable_export
()¶ Request generation of an exportable model file for use in an on-premise DataRobot standalone prediction environment.
This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.
This function does not download the exported file. Use download_export for that.
Examples
model = datarobot.Model.get('p-id', 'l-id') job = model.request_transferable_export() job.wait_for_completion() model.download_export('my_exported_model.drmodel') # Client must be configured to use standalone prediction server for import: datarobot.Client(token='my-token-at-standalone-server', endpoint='standalone-server-url/api/v2') imported_model = datarobot.ImportedModel.create('my_exported_model.drmodel')
-
request_frozen_model
(sample_pct=None, training_row_count=None)¶ Train a new frozen model with parameters from this model
Note
This method only works if project the model belongs to is not datetime partitioned. If it is, use
request_frozen_datetime_model
instead.Frozen models use the same tuning parameters as their parent model instead of independently optimizing them to allow efficiently retraining models on larger amounts of the training data.
Parameters: sample_pct : float
optional, the percentage of the dataset to use with the model. If not provided, will use the value from this model.
training_row_count : int
(New in version v2.9) optional, the integer number of rows of the dataset to use with the model. Only one of sample_pct and training_row_count should be specified.
Returns: model_job : ModelJob
the modeling job training a frozen model
-
request_frozen_datetime_model
(training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, time_window_sample_pct=None)¶ Train a new frozen model with parameters from this model
Requires that this model belongs to a datetime partitioned project. If it does not, an error will occur when submitting the job.
Frozen models use the same tuning parameters as their parent model instead of independently optimizing them to allow efficiently retraining models on larger amounts of the training data.
In addition of training_row_count and training_duration, frozen datetime models may be trained on an exact date range. Only one of training_row_count, training_duration, or training_start_date and training_end_date should be specified.
Models specified using training_start_date and training_end_date are the only ones that can be trained into the holdout data (once the holdout is unlocked).
Parameters: training_row_count : int, optional
the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.
training_duration : str, optional
a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.
training_start_date : datetime.datetime, optional
the start date of the data to train to model on. Only rows occurring at or after this datetime will be used. If training_start_date is specified, training_end_date must also be specified.
training_end_date : datetime.datetime, optional
the end date of the data to train the model on. Only rows occurring strictly before this datetime will be used. If training_end_date is specified, training_start_date must also be specified.
time_window_sample_pct : int, optional
may only be specified when the requested model is a time window (e.g. duration or start and end dates). An integer between 1 and 99 indicating the percentage to sample by within the window. The points kept are determined by a random uniform sample.
Returns: model_job : ModelJob
the modeling job training a frozen model
-
get_parameters
()¶ Retrieve model parameters.
Returns: ModelParameters
Model parameters for this model.
-
get_lift_chart
(source)¶ Retrieve model lift chart for the specified source.
Parameters: source : str
Lift chart data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
Returns: LiftChart
Model lift chart data
-
get_all_lift_charts
()¶ Retrieve a list of all lift charts available for the model.
Returns: list of LiftChart
Data for all available model lift charts.
-
get_confusion_chart
(source)¶ Retrieve model’s confusion chart for the specified source.
Parameters: source : str
Confusion chart source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
Returns: ConfusionChart
Model ConfusionChart data
-
get_all_confusion_charts
()¶ Retrieve a list of all confusion charts available for the model.
Returns: list of ConfusionChart
Data for all available confusion charts for model.
-
get_roc_curve
(source)¶ Retrieve model ROC curve for the specified source.
Parameters: source : str
ROC curve data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
Returns: RocCurve
Model ROC curve data
-
get_all_roc_curves
()¶ Retrieve a list of all ROC curves available for the model.
Returns: list of RocCurve
Data for all available model ROC curves.
-
get_word_cloud
(exclude_stop_words=False)¶ Retrieve a word cloud data for the model.
Parameters: exclude_stop_words : bool, optional
Set to True if you want stopwords filtered out of response.
Returns: WordCloud
Word cloud data for the model.
-
download_scoring_code
(file_name, source_code=False)¶ Download scoring code JAR.
Parameters: file_name : str
File path where scoring code will be saved.
source_code : bool, optional
Set to True to download source code archive. It will not be executable.
-
get_model_blueprint_documents
()¶ Get documentation for tasks used in this model.
Returns: list of BlueprintTaskDocument
All documents available for the model.
-
get_model_blueprint_chart
()¶ Retrieve a model blueprint chart that can be used to understand data flow in blueprint.
Returns: ModelBlueprintChart
The queried model blueprint chart.
-
get_missing_report_info
()¶ Retrieve a model’s Missing Values report
The report explains for numeric and categorical features how many times they were missing in the training data and how various tasks in the model handled the missing values.
Returns: MissingValuesReport
A Missing Values report is an iterable containing several
datarobot.models.missing_report.MissingReportPerFeature
-
request_training_predictions
(data_subset)¶ Start a job to build training predictions
Parameters: data_subset : str
data set definition to build predictions on. Choices are:
- dr.enums.DATA_SUBSET.ALL for all data available
- dr.enums.DATA_SUBSET.VALIDATION_AND_HOLDOUT for all data except training set
- dr.enums.DATA_SUBSET.HOLDOUT for holdout data set only
Returns: Job
an instance of created async job
-
cross_validate
()¶ Run Cross Validation on this model.
Note
To perform Cross Validation on a new model with new parameters, use
train
instead.Returns: ModelJob
The created job to build the model
-
classmethod
PrimeModel API¶
-
class
datarobot.models.
PrimeModel
(id=None, processes=None, featurelist_name=None, featurelist_id=None, project_id=None, sample_pct=None, training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, model_type=None, model_category=None, is_frozen=None, blueprint_id=None, metrics=None, parent_model_id=None, ruleset_id=None, rule_count=None, score=None, monotonic_increasing_featurelist_id=None, monotonic_decreasing_featurelist_id=None, supports_monotonic_constraints=None)¶ A DataRobot Prime model approximating a parent model with downloadable code
Attributes
id (str) the id of the model project_id (str) the id of the project the model belongs to processes (list of str) the processes used by the model featurelist_name (str) the name of the featurelist used by the model featurelist_id (str) the id of the featurelist used by the model sample_pct (float) the percentage of the project dataset used in training the model training_row_count (int or None) the number of rows of the project dataset used in training the model. In a datetime partitioned project, if specified, defines the number of rows used to train the model and evaluate backtest scores; if unspecified, either training_duration or training_start_date and training_end_date was used to determine that instead. training_duration (str or None) only present for models in datetime partitioned projects. If specified, a duration string specifying the duration spanned by the data used to train the model and evaluate backtest scores. training_start_date (datetime or None) only present for frozen models in datetime partitioned projects. If specified, the start date of the data used to train the model. training_end_date (datetime or None) only present for frozen models in datetime partitioned projects. If specified, the end date of the data used to train the model. model_type (str) what model this is, e.g. ‘DataRobot Prime’ model_category (str) what kind of model this is - always ‘prime’ for DataRobot Prime models is_frozen (bool) whether this model is a frozen model blueprint_id (str) the id of the blueprint used in this model metrics (dict) a mapping from each metric to the model’s scores for that metric ruleset (Ruleset) the ruleset used in the Prime model parent_model_id (str) the id of the model that this Prime model approximates monotonic_increasing_featurelist_id (str) optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. If None, no such constraints are enforced. monotonic_decreasing_featurelist_id (str) optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. If None, no such constraints are enforced. supports_monotonic_constraints (bool) optinonal, whether this model supports enforcing montonic constraints -
classmethod
get
(project_id, model_id)¶ Retrieve a specific prime model.
Parameters: project_id : str
The id of the project the prime model belongs to
model_id : str
The
model_id
of the prime model to retrieve.Returns: model : PrimeModel
The queried instance.
-
request_download_validation
(language)¶ Prep and validate the downloadable code for the ruleset associated with this model
Parameters: language : str
the language the code should be downloaded in - see
datarobot.enums.PRIME_LANGUAGE
for available languagesReturns: job : Job
A job tracking the code preparation and validation
-
cross_validate
()¶ Run Cross Validation on this model.
Note
To perform Cross Validation on a new model with new parameters, use
train
instead.Returns: ModelJob
The created job to build the model
-
delete
()¶ Delete a model from the project’s leaderboard.
-
download_export
(filepath)¶ Download an exportable model file for use in an on-premise DataRobot standalone prediction environment.
This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.
Parameters: filepath : str
The path at which to save the exported model file.
-
download_scoring_code
(file_name, source_code=False)¶ Download scoring code JAR.
Parameters: file_name : str
File path where scoring code will be saved.
source_code : bool, optional
Set to True to download source code archive. It will not be executable.
-
fetch_resource_data
(*args, **kwargs)¶ (Deprecated.) Used to acquire model data directly from its url.
Consider using get instead, as this is a convenience function used for development of datarobot
Parameters: url : string
The resource we are acquiring
join_endpoint : boolean, optional
Whether the client’s endpoint should be joined to the URL before sending the request. Location headers are returned as absolute locations, so will _not_ need the endpoint
Returns: model_data : dict
The queried model’s data
-
get_all_confusion_charts
()¶ Retrieve a list of all confusion charts available for the model.
Returns: list of ConfusionChart
Data for all available confusion charts for model.
-
get_all_lift_charts
()¶ Retrieve a list of all lift charts available for the model.
Returns: list of LiftChart
Data for all available model lift charts.
-
get_all_roc_curves
()¶ Retrieve a list of all ROC curves available for the model.
Returns: list of RocCurve
Data for all available model ROC curves.
-
get_confusion_chart
(source)¶ Retrieve model’s confusion chart for the specified source.
Parameters: source : str
Confusion chart source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
Returns: ConfusionChart
Model ConfusionChart data
-
get_feature_impact
()¶ Retrieve the computed Feature Impact results, a measure of the relevance of each feature in the model.
Elsewhere this technique is sometimes called ‘Permutation Importance’.
Requires that Feature Impact has already been computed with request_feature_impact.
Returns: feature_impacts : list[dict]
The feature impact data. Each item is a dict with the keys ‘featureName’, ‘impactNormalized’, and ‘impactUnnormalized’. See the help for Model.request_feature_impact for more details.
Raises: ClientError (404)
If the feature impacts have not been computed.
-
get_features_used
()¶ Query the server to determine which features were used.
Note that the data returned by this method is possibly different than the names of the features in the featurelist used by this model. This method will return the raw features that must be supplied in order for predictions to be generated on a new set of data. The featurelist, in contrast, would also include the names of derived features.
Returns: features : list of str
The names of the features used in the model.
-
get_leaderboard_ui_permalink
()¶ Returns: url : str
Permanent static hyperlink to this model at leaderboard.
-
get_lift_chart
(source)¶ Retrieve model lift chart for the specified source.
Parameters: source : str
Lift chart data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
Returns: LiftChart
Model lift chart data
-
get_missing_report_info
()¶ Retrieve a model’s Missing Values report
The report explains for numeric and categorical features how many times they were missing in the training data and how various tasks in the model handled the missing values.
Returns: MissingValuesReport
A Missing Values report is an iterable containing several
datarobot.models.missing_report.MissingReportPerFeature
-
get_model_blueprint_chart
()¶ Retrieve a model blueprint chart that can be used to understand data flow in blueprint.
Returns: ModelBlueprintChart
The queried model blueprint chart.
-
get_model_blueprint_documents
()¶ Get documentation for tasks used in this model.
Returns: list of BlueprintTaskDocument
All documents available for the model.
-
get_parameters
()¶ Retrieve model parameters.
Returns: ModelParameters
Model parameters for this model.
-
get_prime_eligibility
()¶ Check if this model can be approximated with DataRobot Prime
Returns: prime_eligibility : dict
a dict indicating whether a model can be approximated with DataRobot Prime (key can_make_prime) and why it may be ineligible (key message)
-
get_roc_curve
(source)¶ Retrieve model ROC curve for the specified source.
Parameters: source : str
ROC curve data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
Returns: RocCurve
Model ROC curve data
-
get_rulesets
()¶ List the rulesets approximating this model generated by DataRobot Prime
If this model hasn’t been approximated yet, will return an empty list. Note that these are rulesets approximating this model, not rulesets used to construct this model.
Returns: rulesets : list of Ruleset
-
get_word_cloud
(exclude_stop_words=False)¶ Retrieve a word cloud data for the model.
Parameters: exclude_stop_words : bool, optional
Set to True if you want stopwords filtered out of response.
Returns: WordCloud
Word cloud data for the model.
-
open_model_browser
()¶ Opens model at project leaderboard in web browser.
Note: If text-mode browsers are used, the calling process will block until the user exits the browser.
-
request_feature_impact
()¶ Request feature impacts to be computed for the model.
Feature Impact is computed for each column by creating new data with that column randomly permuted (but the others left unchanged), and seeing how the error metric score for the predictions is affected. The ‘impactUnnormalized’ is how much worse the error metric score is when making predictions on this modified data. The ‘impactNormalized’ is normalized so that the largest value is 1. In both cases, larger values indicate more important features.
Returns: job : Job
A Job representing the feature impact computation. To get the completed feature impact data, use job.get_result or job.get_result_when_complete.
Raises: JobAlreadyRequested (422)
If the feature impacts have already been requested.
-
request_predictions
(dataset_id)¶ Request predictions against a previously uploaded dataset
Parameters: dataset_id : string
The dataset to make predictions against (as uploaded from Project.upload_dataset)
Returns: job : PredictJob
The job computing the predictions
-
request_training_predictions
(data_subset)¶ Start a job to build training predictions
Parameters: data_subset : str
data set definition to build predictions on. Choices are:
- dr.enums.DATA_SUBSET.ALL for all data available
- dr.enums.DATA_SUBSET.VALIDATION_AND_HOLDOUT for all data except training set
- dr.enums.DATA_SUBSET.HOLDOUT for holdout data set only
Returns: Job
an instance of created async job
-
request_transferable_export
()¶ Request generation of an exportable model file for use in an on-premise DataRobot standalone prediction environment.
This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.
This function does not download the exported file. Use download_export for that.
Examples
model = datarobot.Model.get('p-id', 'l-id') job = model.request_transferable_export() job.wait_for_completion() model.download_export('my_exported_model.drmodel') # Client must be configured to use standalone prediction server for import: datarobot.Client(token='my-token-at-standalone-server', endpoint='standalone-server-url/api/v2') imported_model = datarobot.ImportedModel.create('my_exported_model.drmodel')
-
classmethod
BlenderModel API¶
-
class
datarobot.models.
BlenderModel
(id=None, processes=None, featurelist_name=None, featurelist_id=None, project_id=None, sample_pct=None, training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, model_type=None, model_category=None, is_frozen=None, blueprint_id=None, metrics=None, model_ids=None, blender_method=None, monotonic_increasing_featurelist_id=None, monotonic_decreasing_featurelist_id=None, supports_monotonic_constraints=None)¶ Blender model that combines prediction results from other models.
Attributes
id (str) the id of the model project_id (str) the id of the project the model belongs to processes (list of str) the processes used by the model featurelist_name (str) the name of the featurelist used by the model featurelist_id (str) the id of the featurelist used by the model sample_pct (float) the percentage of the project dataset used in training the model training_row_count (int or None) the number of rows of the project dataset used in training the model. In a datetime partitioned project, if specified, defines the number of rows used to train the model and evaluate backtest scores; if unspecified, either training_duration or training_start_date and training_end_date was used to determine that instead. training_duration (str or None) only present for models in datetime partitioned projects. If specified, a duration string specifying the duration spanned by the data used to train the model and evaluate backtest scores. training_start_date (datetime or None) only present for frozen models in datetime partitioned projects. If specified, the start date of the data used to train the model. training_end_date (datetime or None) only present for frozen models in datetime partitioned projects. If specified, the end date of the data used to train the model. model_type (str) what model this is, e.g. ‘DataRobot Prime’ model_category (str) what kind of model this is - always ‘prime’ for DataRobot Prime models is_frozen (bool) whether this model is a frozen model blueprint_id (str) the id of the blueprint used in this model metrics (dict) a mapping from each metric to the model’s scores for that metric model_ids (list of str) List of model ids used in blender blender_method (str) Method used to blend results from underlying models monotonic_increasing_featurelist_id (str) optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. If None, no such constraints are enforced. monotonic_decreasing_featurelist_id (str) optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. If None, no such constraints are enforced. supports_monotonic_constraints (bool) optinonal, whether this model supports enforcing montonic constraints -
classmethod
get
(project_id, model_id)¶ Retrieve a specific blender.
Parameters: project_id : str
The project’s id.
model_id : str
The
model_id
of the leaderboard item to retrieve.Returns: model : BlenderModel
The queried instance.
-
cross_validate
()¶ Run Cross Validation on this model.
Note
To perform Cross Validation on a new model with new parameters, use
train
instead.Returns: ModelJob
The created job to build the model
-
delete
()¶ Delete a model from the project’s leaderboard.
-
download_export
(filepath)¶ Download an exportable model file for use in an on-premise DataRobot standalone prediction environment.
This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.
Parameters: filepath : str
The path at which to save the exported model file.
-
download_scoring_code
(file_name, source_code=False)¶ Download scoring code JAR.
Parameters: file_name : str
File path where scoring code will be saved.
source_code : bool, optional
Set to True to download source code archive. It will not be executable.
-
fetch_resource_data
(*args, **kwargs)¶ (Deprecated.) Used to acquire model data directly from its url.
Consider using get instead, as this is a convenience function used for development of datarobot
Parameters: url : string
The resource we are acquiring
join_endpoint : boolean, optional
Whether the client’s endpoint should be joined to the URL before sending the request. Location headers are returned as absolute locations, so will _not_ need the endpoint
Returns: model_data : dict
The queried model’s data
-
get_all_confusion_charts
()¶ Retrieve a list of all confusion charts available for the model.
Returns: list of ConfusionChart
Data for all available confusion charts for model.
-
get_all_lift_charts
()¶ Retrieve a list of all lift charts available for the model.
Returns: list of LiftChart
Data for all available model lift charts.
-
get_all_roc_curves
()¶ Retrieve a list of all ROC curves available for the model.
Returns: list of RocCurve
Data for all available model ROC curves.
-
get_confusion_chart
(source)¶ Retrieve model’s confusion chart for the specified source.
Parameters: source : str
Confusion chart source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
Returns: ConfusionChart
Model ConfusionChart data
-
get_feature_impact
()¶ Retrieve the computed Feature Impact results, a measure of the relevance of each feature in the model.
Elsewhere this technique is sometimes called ‘Permutation Importance’.
Requires that Feature Impact has already been computed with request_feature_impact.
Returns: feature_impacts : list[dict]
The feature impact data. Each item is a dict with the keys ‘featureName’, ‘impactNormalized’, and ‘impactUnnormalized’. See the help for Model.request_feature_impact for more details.
Raises: ClientError (404)
If the feature impacts have not been computed.
-
get_features_used
()¶ Query the server to determine which features were used.
Note that the data returned by this method is possibly different than the names of the features in the featurelist used by this model. This method will return the raw features that must be supplied in order for predictions to be generated on a new set of data. The featurelist, in contrast, would also include the names of derived features.
Returns: features : list of str
The names of the features used in the model.
-
get_leaderboard_ui_permalink
()¶ Returns: url : str
Permanent static hyperlink to this model at leaderboard.
-
get_lift_chart
(source)¶ Retrieve model lift chart for the specified source.
Parameters: source : str
Lift chart data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
Returns: LiftChart
Model lift chart data
-
get_missing_report_info
()¶ Retrieve a model’s Missing Values report
The report explains for numeric and categorical features how many times they were missing in the training data and how various tasks in the model handled the missing values.
Returns: MissingValuesReport
A Missing Values report is an iterable containing several
datarobot.models.missing_report.MissingReportPerFeature
-
get_model_blueprint_chart
()¶ Retrieve a model blueprint chart that can be used to understand data flow in blueprint.
Returns: ModelBlueprintChart
The queried model blueprint chart.
-
get_model_blueprint_documents
()¶ Get documentation for tasks used in this model.
Returns: list of BlueprintTaskDocument
All documents available for the model.
-
get_parameters
()¶ Retrieve model parameters.
Returns: ModelParameters
Model parameters for this model.
-
get_prime_eligibility
()¶ Check if this model can be approximated with DataRobot Prime
Returns: prime_eligibility : dict
a dict indicating whether a model can be approximated with DataRobot Prime (key can_make_prime) and why it may be ineligible (key message)
-
get_roc_curve
(source)¶ Retrieve model ROC curve for the specified source.
Parameters: source : str
ROC curve data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
Returns: RocCurve
Model ROC curve data
-
get_rulesets
()¶ List the rulesets approximating this model generated by DataRobot Prime
If this model hasn’t been approximated yet, will return an empty list. Note that these are rulesets approximating this model, not rulesets used to construct this model.
Returns: rulesets : list of Ruleset
-
get_word_cloud
(exclude_stop_words=False)¶ Retrieve a word cloud data for the model.
Parameters: exclude_stop_words : bool, optional
Set to True if you want stopwords filtered out of response.
Returns: WordCloud
Word cloud data for the model.
-
open_model_browser
()¶ Opens model at project leaderboard in web browser.
Note: If text-mode browsers are used, the calling process will block until the user exits the browser.
-
request_approximation
()¶ Request an approximation of this model using DataRobot Prime
This will create several rulesets that could be used to approximate this model. After comparing their scores and rule counts, the code used in the approximation can be downloaded and run locally.
Returns: job : Job
the job generating the rulesets
-
request_feature_impact
()¶ Request feature impacts to be computed for the model.
Feature Impact is computed for each column by creating new data with that column randomly permuted (but the others left unchanged), and seeing how the error metric score for the predictions is affected. The ‘impactUnnormalized’ is how much worse the error metric score is when making predictions on this modified data. The ‘impactNormalized’ is normalized so that the largest value is 1. In both cases, larger values indicate more important features.
Returns: job : Job
A Job representing the feature impact computation. To get the completed feature impact data, use job.get_result or job.get_result_when_complete.
Raises: JobAlreadyRequested (422)
If the feature impacts have already been requested.
-
request_frozen_datetime_model
(training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, time_window_sample_pct=None)¶ Train a new frozen model with parameters from this model
Requires that this model belongs to a datetime partitioned project. If it does not, an error will occur when submitting the job.
Frozen models use the same tuning parameters as their parent model instead of independently optimizing them to allow efficiently retraining models on larger amounts of the training data.
In addition of training_row_count and training_duration, frozen datetime models may be trained on an exact date range. Only one of training_row_count, training_duration, or training_start_date and training_end_date should be specified.
Models specified using training_start_date and training_end_date are the only ones that can be trained into the holdout data (once the holdout is unlocked).
Parameters: training_row_count : int, optional
the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.
training_duration : str, optional
a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.
training_start_date : datetime.datetime, optional
the start date of the data to train to model on. Only rows occurring at or after this datetime will be used. If training_start_date is specified, training_end_date must also be specified.
training_end_date : datetime.datetime, optional
the end date of the data to train the model on. Only rows occurring strictly before this datetime will be used. If training_end_date is specified, training_start_date must also be specified.
time_window_sample_pct : int, optional
may only be specified when the requested model is a time window (e.g. duration or start and end dates). An integer between 1 and 99 indicating the percentage to sample by within the window. The points kept are determined by a random uniform sample.
Returns: model_job : ModelJob
the modeling job training a frozen model
-
request_frozen_model
(sample_pct=None, training_row_count=None)¶ Train a new frozen model with parameters from this model
Note
This method only works if project the model belongs to is not datetime partitioned. If it is, use
request_frozen_datetime_model
instead.Frozen models use the same tuning parameters as their parent model instead of independently optimizing them to allow efficiently retraining models on larger amounts of the training data.
Parameters: sample_pct : float
optional, the percentage of the dataset to use with the model. If not provided, will use the value from this model.
training_row_count : int
(New in version v2.9) optional, the integer number of rows of the dataset to use with the model. Only one of sample_pct and training_row_count should be specified.
Returns: model_job : ModelJob
the modeling job training a frozen model
-
request_predictions
(dataset_id)¶ Request predictions against a previously uploaded dataset
Parameters: dataset_id : string
The dataset to make predictions against (as uploaded from Project.upload_dataset)
Returns: job : PredictJob
The job computing the predictions
-
request_training_predictions
(data_subset)¶ Start a job to build training predictions
Parameters: data_subset : str
data set definition to build predictions on. Choices are:
- dr.enums.DATA_SUBSET.ALL for all data available
- dr.enums.DATA_SUBSET.VALIDATION_AND_HOLDOUT for all data except training set
- dr.enums.DATA_SUBSET.HOLDOUT for holdout data set only
Returns: Job
an instance of created async job
-
request_transferable_export
()¶ Request generation of an exportable model file for use in an on-premise DataRobot standalone prediction environment.
This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.
This function does not download the exported file. Use download_export for that.
Examples
model = datarobot.Model.get('p-id', 'l-id') job = model.request_transferable_export() job.wait_for_completion() model.download_export('my_exported_model.drmodel') # Client must be configured to use standalone prediction server for import: datarobot.Client(token='my-token-at-standalone-server', endpoint='standalone-server-url/api/v2') imported_model = datarobot.ImportedModel.create('my_exported_model.drmodel')
-
train
(sample_pct=None, featurelist_id=None, scoring_type=None, training_row_count=None, monotonic_increasing_featurelist_id=<object object>, monotonic_decreasing_featurelist_id=<object object>)¶ Train the blueprint used in model on a particular featurelist or amount of data.
This method creates a new training job for worker and appends it to the end of the queue for this project. After the job has finished you can get the newly trained model by retrieving it from the project leaderboard, or by retrieving the result of the job.
Either sample_pct or training_row_count can be used to specify the amount of data to use, but not both. If neither are specified, a default of the maximum amount of data that can safely be used to train any blueprint without going into the validation data will be selected.
In smart-sampled projects, sample_pct and training_row_count are assumed to be in terms of rows of the minority class.
Note
For datetime partitioned projects, use
train_datetime
instead.Parameters: sample_pct : float, optional
The amount of data to use for training, as a percentage of the project dataset from 0 to 100.
featurelist_id : str, optional
The identifier of the featurelist to use. If not defined, the featurelist of this model is used.
scoring_type : str, optional
Either
SCORING_TYPE.validation
orSCORING_TYPE.cross_validation
.SCORING_TYPE.validation
is available for every partitioning type, and indicates that the default model validation should be used for the project. If the project uses a form of cross-validation partitioning,SCORING_TYPE.cross_validation
can also be used to indicate that all of the available training/validation combinations should be used to evaluate the model.training_row_count : int, optional
The number of rows to use to train the requested model.
monotonic_increasing_featurelist_id : str
(new in version 2.11) optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. Passing
None
disables increasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT
) is the one specified by the blueprint.monotonic_decreasing_featurelist_id : str
(new in version 2.11) optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. Passing
None
disables decreasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT
) is the one specified by the blueprint.Returns: model_job_id : str
id of created job, can be used as parameter to
ModelJob.get
method orwait_for_async_model_creation
functionExamples
project = Project.get('p-id') model = Model.get('p-id', 'l-id') model_job_id = model.train(training_row_count=project.max_train_rows)
-
train_datetime
(featurelist_id=None, training_row_count=None, training_duration=None, time_window_sample_pct=None)¶ Train this model on a different featurelist or amount of data
Requires that this model is part of a datetime partitioned project; otherwise, an error will occur.
Parameters: featurelist_id : str, optional
the featurelist to use to train the model. If not specified, the featurelist of this model is used.
training_row_count : int, optional
the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.
training_duration : str, optional
a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.
time_window_sample_pct : int, optional
may only be specified when the requested model is a time window (e.g. duration or start and end dates). An integer between 1 and 99 indicating the percentage to sample by within the window. The points kept are determined by a random uniform sample.
Returns: job : ModelJob
the created job to build the model
-
classmethod
DatetimeModel API¶
-
class
datarobot.models.
DatetimeModel
(id=None, processes=None, featurelist_name=None, featurelist_id=None, project_id=None, sample_pct=None, training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, time_window_sample_pct=None, model_type=None, model_category=None, is_frozen=None, blueprint_id=None, metrics=None, training_info=None, holdout_score=None, holdout_status=None, data_selection_method=None, backtests=None, monotonic_increasing_featurelist_id=None, monotonic_decreasing_featurelist_id=None, supports_monotonic_constraints=None)¶ A model from a datetime partitioned project
Only one of training_row_count, training_duration, and training_start_date and training_end_date will be specified, depending on the data_selection_method of the model. Whichever method was selected determines the amount of data used to train on when making predictions and scoring the backtests and the holdout.
Attributes
id (str) the id of the model project_id (str) the id of the project the model belongs to processes (list of str) the processes used by the model featurelist_name (str) the name of the featurelist used by the model featurelist_id (str) the id of the featurelist used by the model sample_pct (float) the percentage of the project dataset used in training the model training_row_count (int or None) If specified, an int specifying the number of rows used to train the model and evaluate backtest scores. training_duration (str or None) If specified, a duration string specifying the duration spanned by the data used to train the model and evaluate backtest scores. training_start_date (datetime or None) only present for frozen models in datetime partitioned projects. If specified, the start date of the data used to train the model. training_end_date (datetime or None) only present for frozen models in datetime partitioned projects. If specified, the end date of the data used to train the model. time_window_sample_pct (int or None) An integer between 1 and 99 indicating the percentage of sampling within the training window. The points kept are determined by a random uniform sample. If not specified, no sampling was done. model_type (str) what model this is, e.g. ‘Nystroem Kernel SVM Regressor’ model_category (str) what kind of model this is - ‘prime’ for DataRobot Prime models, ‘blend’ for blender models, and ‘model’ for other models is_frozen (bool) whether this model is a frozen model blueprint_id (str) the id of the blueprint used in this model metrics (dict) a mapping from each metric to the model’s scores for that metric. The keys in metrics are the different metrics used to evaluate the model, and the values are the results. The dictionaries inside of metrics will be as described here: ‘validation’, the score for a single backtest; ‘crossValidation’, always None; ‘backtesting’, the average score for all backtests if all are available and computed, or None otherwise; ‘backtestingScores’, a list of scores for all backtests where the score is None if that backtest does not have a score available; and ‘holdout’, the score for the holdout or None if the holdout is locked or the score is unavailable. backtests (list of dict) describes what data was used to fit each backtest, the score for the project metric, and why the backtest score is unavailable if it is not provided. data_selection_method (str) which of training_row_count, training_duration, or training_start_data and training_end_date were used to determine the data used to fit the model. One of ‘rowCount’, ‘duration’, or ‘selectedDateRange’. training_info (dict) describes which data was used to train on when scoring the holdout and making predictions. training_info` will have the following keys: holdout_training_start_date, holdout_training_duration, holdout_training_row_count, holdout_training_end_date, prediction_training_start_date, prediction_training_duration, prediction_training_row_count, prediction_training_end_date. Start and end dates will be datetimes, durations will be duration strings, and rows will be integers. holdout_score (float or None) the score against the holdout, if available and the holdout is unlocked, according to the project metric. holdout_status (string or None) the status of the holdout score, e.g. “COMPLETED”, “HOLDOUT_BOUNDARIES_EXCEEDED”. Unavailable if the holdout fold was disabled in the partitioning configuration. monotonic_increasing_featurelist_id (str) optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. If None, no such constraints are enforced. monotonic_decreasing_featurelist_id (str) optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. If None, no such constraints are enforced. supports_monotonic_constraints (bool) optinonal, whether this model supports enforcing montonic constraints -
classmethod
get
(project, model_id)¶ Retrieve a specific datetime model
If the project does not use datetime partitioning, a ClientError will occur.
Parameters: project : str
the id of the project the model belongs to
model_id : str
the id of the model to retrieve
Returns: model : DatetimeModel
the model
-
score_backtests
()¶ Compute the scores for all available backtests
Some backtests may be unavailable if the model is trained into their validation data.
Returns: job : Job
a job tracking the backtest computation. When it is complete, all available backtests will have scores computed.
-
cross_validate
()¶ Inherited from Model - DatetimeModels cannot request Cross Validation,
Use score_backtests instead.
-
delete
()¶ Delete a model from the project’s leaderboard.
-
download_export
(filepath)¶ Download an exportable model file for use in an on-premise DataRobot standalone prediction environment.
This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.
Parameters: filepath : str
The path at which to save the exported model file.
-
download_scoring_code
(file_name, source_code=False)¶ Download scoring code JAR.
Parameters: file_name : str
File path where scoring code will be saved.
source_code : bool, optional
Set to True to download source code archive. It will not be executable.
-
fetch_resource_data
(*args, **kwargs)¶ (Deprecated.) Used to acquire model data directly from its url.
Consider using get instead, as this is a convenience function used for development of datarobot
Parameters: url : string
The resource we are acquiring
join_endpoint : boolean, optional
Whether the client’s endpoint should be joined to the URL before sending the request. Location headers are returned as absolute locations, so will _not_ need the endpoint
Returns: model_data : dict
The queried model’s data
-
get_all_confusion_charts
()¶ Retrieve a list of all confusion charts available for the model.
Returns: list of ConfusionChart
Data for all available confusion charts for model.
-
get_all_lift_charts
()¶ Retrieve a list of all lift charts available for the model.
Returns: list of LiftChart
Data for all available model lift charts.
-
get_all_roc_curves
()¶ Retrieve a list of all ROC curves available for the model.
Returns: list of RocCurve
Data for all available model ROC curves.
-
get_confusion_chart
(source)¶ Retrieve model’s confusion chart for the specified source.
Parameters: source : str
Confusion chart source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
Returns: ConfusionChart
Model ConfusionChart data
-
get_feature_impact
()¶ Retrieve the computed Feature Impact results, a measure of the relevance of each feature in the model.
Elsewhere this technique is sometimes called ‘Permutation Importance’.
Requires that Feature Impact has already been computed with request_feature_impact.
Returns: feature_impacts : list[dict]
The feature impact data. Each item is a dict with the keys ‘featureName’, ‘impactNormalized’, and ‘impactUnnormalized’. See the help for Model.request_feature_impact for more details.
Raises: ClientError (404)
If the feature impacts have not been computed.
-
get_features_used
()¶ Query the server to determine which features were used.
Note that the data returned by this method is possibly different than the names of the features in the featurelist used by this model. This method will return the raw features that must be supplied in order for predictions to be generated on a new set of data. The featurelist, in contrast, would also include the names of derived features.
Returns: features : list of str
The names of the features used in the model.
-
get_leaderboard_ui_permalink
()¶ Returns: url : str
Permanent static hyperlink to this model at leaderboard.
-
get_lift_chart
(source)¶ Retrieve model lift chart for the specified source.
Parameters: source : str
Lift chart data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
Returns: LiftChart
Model lift chart data
-
get_missing_report_info
()¶ Retrieve a model’s Missing Values report
The report explains for numeric and categorical features how many times they were missing in the training data and how various tasks in the model handled the missing values.
Returns: MissingValuesReport
A Missing Values report is an iterable containing several
datarobot.models.missing_report.MissingReportPerFeature
-
get_model_blueprint_chart
()¶ Retrieve a model blueprint chart that can be used to understand data flow in blueprint.
Returns: ModelBlueprintChart
The queried model blueprint chart.
-
get_model_blueprint_documents
()¶ Get documentation for tasks used in this model.
Returns: list of BlueprintTaskDocument
All documents available for the model.
-
get_parameters
()¶ Retrieve model parameters.
Returns: ModelParameters
Model parameters for this model.
-
get_prime_eligibility
()¶ Check if this model can be approximated with DataRobot Prime
Returns: prime_eligibility : dict
a dict indicating whether a model can be approximated with DataRobot Prime (key can_make_prime) and why it may be ineligible (key message)
-
get_roc_curve
(source)¶ Retrieve model ROC curve for the specified source.
Parameters: source : str
ROC curve data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
Returns: RocCurve
Model ROC curve data
-
get_rulesets
()¶ List the rulesets approximating this model generated by DataRobot Prime
If this model hasn’t been approximated yet, will return an empty list. Note that these are rulesets approximating this model, not rulesets used to construct this model.
Returns: rulesets : list of Ruleset
-
get_word_cloud
(exclude_stop_words=False)¶ Retrieve a word cloud data for the model.
Parameters: exclude_stop_words : bool, optional
Set to True if you want stopwords filtered out of response.
Returns: WordCloud
Word cloud data for the model.
-
open_model_browser
()¶ Opens model at project leaderboard in web browser.
Note: If text-mode browsers are used, the calling process will block until the user exits the browser.
-
request_approximation
()¶ Request an approximation of this model using DataRobot Prime
This will create several rulesets that could be used to approximate this model. After comparing their scores and rule counts, the code used in the approximation can be downloaded and run locally.
Returns: job : Job
the job generating the rulesets
-
request_feature_impact
()¶ Request feature impacts to be computed for the model.
Feature Impact is computed for each column by creating new data with that column randomly permuted (but the others left unchanged), and seeing how the error metric score for the predictions is affected. The ‘impactUnnormalized’ is how much worse the error metric score is when making predictions on this modified data. The ‘impactNormalized’ is normalized so that the largest value is 1. In both cases, larger values indicate more important features.
Returns: job : Job
A Job representing the feature impact computation. To get the completed feature impact data, use job.get_result or job.get_result_when_complete.
Raises: JobAlreadyRequested (422)
If the feature impacts have already been requested.
-
request_frozen_datetime_model
(training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, time_window_sample_pct=None)¶ Train a new frozen model with parameters from this model
Requires that this model belongs to a datetime partitioned project. If it does not, an error will occur when submitting the job.
Frozen models use the same tuning parameters as their parent model instead of independently optimizing them to allow efficiently retraining models on larger amounts of the training data.
In addition of training_row_count and training_duration, frozen datetime models may be trained on an exact date range. Only one of training_row_count, training_duration, or training_start_date and training_end_date should be specified.
Models specified using training_start_date and training_end_date are the only ones that can be trained into the holdout data (once the holdout is unlocked).
Parameters: training_row_count : int, optional
the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.
training_duration : str, optional
a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.
training_start_date : datetime.datetime, optional
the start date of the data to train to model on. Only rows occurring at or after this datetime will be used. If training_start_date is specified, training_end_date must also be specified.
training_end_date : datetime.datetime, optional
the end date of the data to train the model on. Only rows occurring strictly before this datetime will be used. If training_end_date is specified, training_start_date must also be specified.
time_window_sample_pct : int, optional
may only be specified when the requested model is a time window (e.g. duration or start and end dates). An integer between 1 and 99 indicating the percentage to sample by within the window. The points kept are determined by a random uniform sample.
Returns: model_job : ModelJob
the modeling job training a frozen model
-
request_predictions
(dataset_id)¶ Request predictions against a previously uploaded dataset
Parameters: dataset_id : string
The dataset to make predictions against (as uploaded from Project.upload_dataset)
Returns: job : PredictJob
The job computing the predictions
-
request_training_predictions
(data_subset)¶ Start a job to build training predictions
Parameters: data_subset : str
data set definition to build predictions on. Choices are:
- dr.enums.DATA_SUBSET.ALL for all data available
- dr.enums.DATA_SUBSET.VALIDATION_AND_HOLDOUT for all data except training set
- dr.enums.DATA_SUBSET.HOLDOUT for holdout data set only
Returns: Job
an instance of created async job
-
request_transferable_export
()¶ Request generation of an exportable model file for use in an on-premise DataRobot standalone prediction environment.
This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.
This function does not download the exported file. Use download_export for that.
Examples
model = datarobot.Model.get('p-id', 'l-id') job = model.request_transferable_export() job.wait_for_completion() model.download_export('my_exported_model.drmodel') # Client must be configured to use standalone prediction server for import: datarobot.Client(token='my-token-at-standalone-server', endpoint='standalone-server-url/api/v2') imported_model = datarobot.ImportedModel.create('my_exported_model.drmodel')
-
train_datetime
(featurelist_id=None, training_row_count=None, training_duration=None, time_window_sample_pct=None)¶ Train this model on a different featurelist or amount of data
Requires that this model is part of a datetime partitioned project; otherwise, an error will occur.
Parameters: featurelist_id : str, optional
the featurelist to use to train the model. If not specified, the featurelist of this model is used.
training_row_count : int, optional
the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.
training_duration : str, optional
a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.
time_window_sample_pct : int, optional
may only be specified when the requested model is a time window (e.g. duration or start and end dates). An integer between 1 and 99 indicating the percentage to sample by within the window. The points kept are determined by a random uniform sample.
Returns: job : ModelJob
the created job to build the model
-
classmethod
RatingTableModel API¶
-
class
datarobot.models.
RatingTableModel
(id=None, processes=None, featurelist_name=None, featurelist_id=None, project_id=None, sample_pct=None, training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, model_type=None, model_category=None, is_frozen=None, blueprint_id=None, metrics=None, rating_table_id=None, monotonic_increasing_featurelist_id=None, monotonic_decreasing_featurelist_id=None, supports_monotonic_constraints=None)¶ A model that has a rating table.
Attributes
id (str) the id of the model project_id (str) the id of the project the model belongs to processes (list of str) the processes used by the model featurelist_name (str) the name of the featurelist used by the model featurelist_id (str) the id of the featurelist used by the model sample_pct (float or None) the percentage of the project dataset used in training the model. If the project uses datetime partitioning, the sample_pct will be None. See training_row_count, training_duration, and training_start_date and training_end_date instead. training_row_count (int or None) the number of rows of the project dataset used in training the model. In a datetime partitioned project, if specified, defines the number of rows used to train the model and evaluate backtest scores; if unspecified, either training_duration or training_start_date and training_end_date was used to determine that instead. training_duration (str or None) only present for models in datetime partitioned projects. If specified, a duration string specifying the duration spanned by the data used to train the model and evaluate backtest scores. training_start_date (datetime or None) only present for frozen models in datetime partitioned projects. If specified, the start date of the data used to train the model. training_end_date (datetime or None) only present for frozen models in datetime partitioned projects. If specified, the end date of the data used to train the model. model_type (str) what model this is, e.g. ‘Nystroem Kernel SVM Regressor’ model_category (str) what kind of model this is - ‘prime’ for DataRobot Prime models, ‘blend’ for blender models, and ‘model’ for other models is_frozen (bool) whether this model is a frozen model blueprint_id (str) the id of the blueprint used in this model metrics (dict) a mapping from each metric to the model’s scores for that metric rating_table_id (str) the id of the rating table that belongs to this model monotonic_increasing_featurelist_id (str) optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. If None, no such constraints are enforced. monotonic_decreasing_featurelist_id (str) optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. If None, no such constraints are enforced. supports_monotonic_constraints (bool) optinonal, whether this model supports enforcing montonic constraints -
classmethod
get
(project_id, model_id)¶ Retrieve a specific rating table model
If the project does not have a rating table, a ClientError will occur.
Parameters: project_id : str
the id of the project the model belongs to
model_id : str
the id of the model to retrieve
Returns: model : RatingTableModel
the model
-
classmethod
create_from_rating_table
(project_id, rating_table_id)¶ Creates a new model from a validated rating table record. The RatingTable must not be associated with an existing model.
Parameters: project_id : str
the id of the project the rating table belongs to
rating_table_id : str
the id of the rating table to create this model from
Returns: job: Job
an instance of created async job
Raises: ClientError (422)
Raised if creating model from a RatingTable that failed validation
JobAlreadyRequested
Raised if creating model from a RatingTable that is already associated with a RatingTableModel
-
cross_validate
()¶ Run Cross Validation on this model.
Note
To perform Cross Validation on a new model with new parameters, use
train
instead.Returns: ModelJob
The created job to build the model
-
delete
()¶ Delete a model from the project’s leaderboard.
-
download_export
(filepath)¶ Download an exportable model file for use in an on-premise DataRobot standalone prediction environment.
This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.
Parameters: filepath : str
The path at which to save the exported model file.
-
download_scoring_code
(file_name, source_code=False)¶ Download scoring code JAR.
Parameters: file_name : str
File path where scoring code will be saved.
source_code : bool, optional
Set to True to download source code archive. It will not be executable.
-
fetch_resource_data
(*args, **kwargs)¶ (Deprecated.) Used to acquire model data directly from its url.
Consider using get instead, as this is a convenience function used for development of datarobot
Parameters: url : string
The resource we are acquiring
join_endpoint : boolean, optional
Whether the client’s endpoint should be joined to the URL before sending the request. Location headers are returned as absolute locations, so will _not_ need the endpoint
Returns: model_data : dict
The queried model’s data
-
get_all_confusion_charts
()¶ Retrieve a list of all confusion charts available for the model.
Returns: list of ConfusionChart
Data for all available confusion charts for model.
-
get_all_lift_charts
()¶ Retrieve a list of all lift charts available for the model.
Returns: list of LiftChart
Data for all available model lift charts.
-
get_all_roc_curves
()¶ Retrieve a list of all ROC curves available for the model.
Returns: list of RocCurve
Data for all available model ROC curves.
-
get_confusion_chart
(source)¶ Retrieve model’s confusion chart for the specified source.
Parameters: source : str
Confusion chart source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
Returns: ConfusionChart
Model ConfusionChart data
-
get_feature_impact
()¶ Retrieve the computed Feature Impact results, a measure of the relevance of each feature in the model.
Elsewhere this technique is sometimes called ‘Permutation Importance’.
Requires that Feature Impact has already been computed with request_feature_impact.
Returns: feature_impacts : list[dict]
The feature impact data. Each item is a dict with the keys ‘featureName’, ‘impactNormalized’, and ‘impactUnnormalized’. See the help for Model.request_feature_impact for more details.
Raises: ClientError (404)
If the feature impacts have not been computed.
-
get_features_used
()¶ Query the server to determine which features were used.
Note that the data returned by this method is possibly different than the names of the features in the featurelist used by this model. This method will return the raw features that must be supplied in order for predictions to be generated on a new set of data. The featurelist, in contrast, would also include the names of derived features.
Returns: features : list of str
The names of the features used in the model.
-
get_leaderboard_ui_permalink
()¶ Returns: url : str
Permanent static hyperlink to this model at leaderboard.
-
get_lift_chart
(source)¶ Retrieve model lift chart for the specified source.
Parameters: source : str
Lift chart data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
Returns: LiftChart
Model lift chart data
-
get_missing_report_info
()¶ Retrieve a model’s Missing Values report
The report explains for numeric and categorical features how many times they were missing in the training data and how various tasks in the model handled the missing values.
Returns: MissingValuesReport
A Missing Values report is an iterable containing several
datarobot.models.missing_report.MissingReportPerFeature
-
get_model_blueprint_chart
()¶ Retrieve a model blueprint chart that can be used to understand data flow in blueprint.
Returns: ModelBlueprintChart
The queried model blueprint chart.
-
get_model_blueprint_documents
()¶ Get documentation for tasks used in this model.
Returns: list of BlueprintTaskDocument
All documents available for the model.
-
get_parameters
()¶ Retrieve model parameters.
Returns: ModelParameters
Model parameters for this model.
-
get_prime_eligibility
()¶ Check if this model can be approximated with DataRobot Prime
Returns: prime_eligibility : dict
a dict indicating whether a model can be approximated with DataRobot Prime (key can_make_prime) and why it may be ineligible (key message)
-
get_roc_curve
(source)¶ Retrieve model ROC curve for the specified source.
Parameters: source : str
ROC curve data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
Returns: RocCurve
Model ROC curve data
-
get_rulesets
()¶ List the rulesets approximating this model generated by DataRobot Prime
If this model hasn’t been approximated yet, will return an empty list. Note that these are rulesets approximating this model, not rulesets used to construct this model.
Returns: rulesets : list of Ruleset
-
get_word_cloud
(exclude_stop_words=False)¶ Retrieve a word cloud data for the model.
Parameters: exclude_stop_words : bool, optional
Set to True if you want stopwords filtered out of response.
Returns: WordCloud
Word cloud data for the model.
-
open_model_browser
()¶ Opens model at project leaderboard in web browser.
Note: If text-mode browsers are used, the calling process will block until the user exits the browser.
-
request_approximation
()¶ Request an approximation of this model using DataRobot Prime
This will create several rulesets that could be used to approximate this model. After comparing their scores and rule counts, the code used in the approximation can be downloaded and run locally.
Returns: job : Job
the job generating the rulesets
-
request_feature_impact
()¶ Request feature impacts to be computed for the model.
Feature Impact is computed for each column by creating new data with that column randomly permuted (but the others left unchanged), and seeing how the error metric score for the predictions is affected. The ‘impactUnnormalized’ is how much worse the error metric score is when making predictions on this modified data. The ‘impactNormalized’ is normalized so that the largest value is 1. In both cases, larger values indicate more important features.
Returns: job : Job
A Job representing the feature impact computation. To get the completed feature impact data, use job.get_result or job.get_result_when_complete.
Raises: JobAlreadyRequested (422)
If the feature impacts have already been requested.
-
request_frozen_datetime_model
(training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, time_window_sample_pct=None)¶ Train a new frozen model with parameters from this model
Requires that this model belongs to a datetime partitioned project. If it does not, an error will occur when submitting the job.
Frozen models use the same tuning parameters as their parent model instead of independently optimizing them to allow efficiently retraining models on larger amounts of the training data.
In addition of training_row_count and training_duration, frozen datetime models may be trained on an exact date range. Only one of training_row_count, training_duration, or training_start_date and training_end_date should be specified.
Models specified using training_start_date and training_end_date are the only ones that can be trained into the holdout data (once the holdout is unlocked).
Parameters: training_row_count : int, optional
the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.
training_duration : str, optional
a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.
training_start_date : datetime.datetime, optional
the start date of the data to train to model on. Only rows occurring at or after this datetime will be used. If training_start_date is specified, training_end_date must also be specified.
training_end_date : datetime.datetime, optional
the end date of the data to train the model on. Only rows occurring strictly before this datetime will be used. If training_end_date is specified, training_start_date must also be specified.
time_window_sample_pct : int, optional
may only be specified when the requested model is a time window (e.g. duration or start and end dates). An integer between 1 and 99 indicating the percentage to sample by within the window. The points kept are determined by a random uniform sample.
Returns: model_job : ModelJob
the modeling job training a frozen model
-
request_frozen_model
(sample_pct=None, training_row_count=None)¶ Train a new frozen model with parameters from this model
Note
This method only works if project the model belongs to is not datetime partitioned. If it is, use
request_frozen_datetime_model
instead.Frozen models use the same tuning parameters as their parent model instead of independently optimizing them to allow efficiently retraining models on larger amounts of the training data.
Parameters: sample_pct : float
optional, the percentage of the dataset to use with the model. If not provided, will use the value from this model.
training_row_count : int
(New in version v2.9) optional, the integer number of rows of the dataset to use with the model. Only one of sample_pct and training_row_count should be specified.
Returns: model_job : ModelJob
the modeling job training a frozen model
-
request_predictions
(dataset_id)¶ Request predictions against a previously uploaded dataset
Parameters: dataset_id : string
The dataset to make predictions against (as uploaded from Project.upload_dataset)
Returns: job : PredictJob
The job computing the predictions
-
request_training_predictions
(data_subset)¶ Start a job to build training predictions
Parameters: data_subset : str
data set definition to build predictions on. Choices are:
- dr.enums.DATA_SUBSET.ALL for all data available
- dr.enums.DATA_SUBSET.VALIDATION_AND_HOLDOUT for all data except training set
- dr.enums.DATA_SUBSET.HOLDOUT for holdout data set only
Returns: Job
an instance of created async job
-
request_transferable_export
()¶ Request generation of an exportable model file for use in an on-premise DataRobot standalone prediction environment.
This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.
This function does not download the exported file. Use download_export for that.
Examples
model = datarobot.Model.get('p-id', 'l-id') job = model.request_transferable_export() job.wait_for_completion() model.download_export('my_exported_model.drmodel') # Client must be configured to use standalone prediction server for import: datarobot.Client(token='my-token-at-standalone-server', endpoint='standalone-server-url/api/v2') imported_model = datarobot.ImportedModel.create('my_exported_model.drmodel')
-
train
(sample_pct=None, featurelist_id=None, scoring_type=None, training_row_count=None, monotonic_increasing_featurelist_id=<object object>, monotonic_decreasing_featurelist_id=<object object>)¶ Train the blueprint used in model on a particular featurelist or amount of data.
This method creates a new training job for worker and appends it to the end of the queue for this project. After the job has finished you can get the newly trained model by retrieving it from the project leaderboard, or by retrieving the result of the job.
Either sample_pct or training_row_count can be used to specify the amount of data to use, but not both. If neither are specified, a default of the maximum amount of data that can safely be used to train any blueprint without going into the validation data will be selected.
In smart-sampled projects, sample_pct and training_row_count are assumed to be in terms of rows of the minority class.
Note
For datetime partitioned projects, use
train_datetime
instead.Parameters: sample_pct : float, optional
The amount of data to use for training, as a percentage of the project dataset from 0 to 100.
featurelist_id : str, optional
The identifier of the featurelist to use. If not defined, the featurelist of this model is used.
scoring_type : str, optional
Either
SCORING_TYPE.validation
orSCORING_TYPE.cross_validation
.SCORING_TYPE.validation
is available for every partitioning type, and indicates that the default model validation should be used for the project. If the project uses a form of cross-validation partitioning,SCORING_TYPE.cross_validation
can also be used to indicate that all of the available training/validation combinations should be used to evaluate the model.training_row_count : int, optional
The number of rows to use to train the requested model.
monotonic_increasing_featurelist_id : str
(new in version 2.11) optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. Passing
None
disables increasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT
) is the one specified by the blueprint.monotonic_decreasing_featurelist_id : str
(new in version 2.11) optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. Passing
None
disables decreasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT
) is the one specified by the blueprint.Returns: model_job_id : str
id of created job, can be used as parameter to
ModelJob.get
method orwait_for_async_model_creation
functionExamples
project = Project.get('p-id') model = Model.get('p-id', 'l-id') model_job_id = model.train(training_row_count=project.max_train_rows)
-
train_datetime
(featurelist_id=None, training_row_count=None, training_duration=None, time_window_sample_pct=None)¶ Train this model on a different featurelist or amount of data
Requires that this model is part of a datetime partitioned project; otherwise, an error will occur.
Parameters: featurelist_id : str, optional
the featurelist to use to train the model. If not specified, the featurelist of this model is used.
training_row_count : int, optional
the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.
training_duration : str, optional
a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.
time_window_sample_pct : int, optional
may only be specified when the requested model is a time window (e.g. duration or start and end dates). An integer between 1 and 99 indicating the percentage to sample by within the window. The points kept are determined by a random uniform sample.
Returns: job : ModelJob
the created job to build the model
-
classmethod
Job API¶
-
class
datarobot.models.
Job
(data, completed_resource_url=None)¶ Tracks asynchronous work being done within a project
Attributes
id (int) the id of the job project_id (str) the id of the project the job belongs to status (str) the status of the job - will be one of datarobot.enums.QUEUE_STATUS
job_type (str) what kind of work the job is doing - will be one of datarobot.enums.JOB_TYPE
-
classmethod
get
(project_id, job_id)¶ Fetches one job.
Parameters: project_id : str
The identifier of the project in which the job resides
job_id : str
The job id
Returns: job : Job
The job
Raises: AsyncFailureError
Querying this resource gave a status code other than 200 or 303
-
cancel
()¶ Cancel this job. If this job has not finished running, it will be removed and canceled.
-
get_result
()¶ Returns: result : object
- Return type depends on the job type:
- for model jobs, a Model is returned
- for predict jobs, a pandas.DataFrame (with predictions) is returned
- for featureImpact jobs, a list of dicts (whose keys are featureName and impact)
- for primeRulesets jobs, a list of Rulesets
- for primeModel jobs, a PrimeModel
- for primeDownloadValidation jobs, a PrimeFile
- for reasonCodesInitialization jobs, a ReasonCodesInitialization
- for reasonCodes jobs, a ReasonCodes
Raises: JobNotFinished
If the job is not finished, the result is not available.
AsyncProcessUnsuccessfulError
If the job errored or was aborted
-
get_result_when_complete
(max_wait=600)¶ Parameters: max_wait : int, optional
How long to wait for the job to finish.
Returns: result: object
Return type is the same as would be returned by Job.get_result.
Raises: AsyncTimeoutError
If the job does not finish in time
AsyncProcessUnsuccessfulError
If the job errored or was aborted
-
refresh
()¶ Update this object with the latest job data from the server.
-
wait_for_completion
(max_wait=600)¶ Waits for job to complete.
Parameters: max_wait : int, optional
How long to wait for the job to finish.
-
classmethod
-
class
datarobot.models.
TrainingPredictionsJob
(data, model_id, data_subset, **kwargs)¶ -
classmethod
get
(project_id, job_id, model_id=None, data_subset=None)¶ Fetches one job. The resulting
datarobot.models.TrainingPredictions
object will be annotated with model_id and data_subset.Parameters: project_id : str
The identifier of the project in which the job resides
job_id : str
The job id
model_id : str
The identifier of the model used for computing training predictions
data_subset : dr.enums.DATA_SUBSET, optional
Data subset used for computing training predictions
Returns: job : TrainingPredictionsJob
The job
Raises: AsyncFailureError
Querying this resource gave a status code other than 200 or 303
-
refresh
()¶ Update this object with the latest job data from the server.
-
cancel
()¶ Cancel this job. If this job has not finished running, it will be removed and canceled.
-
get_result
()¶ Returns: result : object
- Return type depends on the job type:
- for model jobs, a Model is returned
- for predict jobs, a pandas.DataFrame (with predictions) is returned
- for featureImpact jobs, a list of dicts (whose keys are featureName and impact)
- for primeRulesets jobs, a list of Rulesets
- for primeModel jobs, a PrimeModel
- for primeDownloadValidation jobs, a PrimeFile
- for reasonCodesInitialization jobs, a ReasonCodesInitialization
- for reasonCodes jobs, a ReasonCodes
Raises: JobNotFinished
If the job is not finished, the result is not available.
AsyncProcessUnsuccessfulError
If the job errored or was aborted
-
get_result_when_complete
(max_wait=600)¶ Parameters: max_wait : int, optional
How long to wait for the job to finish.
Returns: result: object
Return type is the same as would be returned by Job.get_result.
Raises: AsyncTimeoutError
If the job does not finish in time
AsyncProcessUnsuccessfulError
If the job errored or was aborted
-
wait_for_completion
(max_wait=600)¶ Waits for job to complete.
Parameters: max_wait : int, optional
How long to wait for the job to finish.
-
classmethod
ModelJob API¶
-
datarobot.models.modeljob.
wait_for_async_model_creation
(project_id, model_job_id, max_wait=600)¶ Given a Project id and ModelJob id poll for status of process responsible for model creation until model is created.
Parameters: project_id : str
The identifier of the project
model_job_id : str
The identifier of the ModelJob
max_wait : int, optional
Time in seconds after which model creation is considered unsuccessful
Returns: model : Model
Newly created model
Raises: AsyncModelCreationError
Raised if status of fetched ModelJob object is
error
AsyncTimeoutError
Model wasn’t created in time, specified by
max_wait
parameter
-
class
datarobot.models.
ModelJob
(data, completed_resource_url=None)¶ Tracks asynchronous work being done within a project
Attributes
id (int) the id of the job project_id (str) the id of the project the job belongs to status (str) the status of the job - will be one of datarobot.enums.QUEUE_STATUS
job_type (str) what kind of work the job is doing - will be ‘model’ for modeling jobs sample_pct (float) the percentage of the project’s dataset used in this modeling job model_type (str) the model this job builds (e.g. ‘Nystroem Kernel SVM Regressor’) processes (list of str) the processes used by the model featurelist_id (str) the id of the featurelist used in this modeling job blueprint (Blueprint) the blueprint used in this modeling job -
classmethod
from_job
(job)¶ Transforms a generic Job into a ModelJob
Parameters: job: Job
A generic job representing a ModelJob
Returns: model_job: ModelJob
A fully populated ModelJob with all the details of the job
Raises: ValueError:
If the generic Job was not a model job, e.g. job_type != JOB_TYPE.MODEL
-
classmethod
get
(project_id, model_job_id)¶ Fetches one ModelJob. If the job finished, raises PendingJobFinished exception.
Parameters: project_id : str
The identifier of the project the model belongs to
model_job_id : str
The identifier of the model_job
Returns: model_job : ModelJob
The pending ModelJob
Raises: PendingJobFinished
If the job being queried already finished, and the server is re-routing to the finished model.
AsyncFailureError
Querying this resource gave a status code other than 200 or 303
-
classmethod
get_model
(project_id, model_job_id)¶ Fetches a finished model from the job used to create it.
Parameters: project_id : str
The identifier of the project the model belongs to
model_job_id : str
The identifier of the model_job
Returns: model : Model
The finished model
Raises: JobNotFinished
If the job has not finished yet
AsyncFailureError
Querying the model_job in question gave a status code other than 200 or 303
-
cancel
()¶ Cancel this job. If this job has not finished running, it will be removed and canceled.
-
get_result
()¶ Returns: result : object
- Return type depends on the job type:
- for model jobs, a Model is returned
- for predict jobs, a pandas.DataFrame (with predictions) is returned
- for featureImpact jobs, a list of dicts (whose keys are featureName and impact)
- for primeRulesets jobs, a list of Rulesets
- for primeModel jobs, a PrimeModel
- for primeDownloadValidation jobs, a PrimeFile
- for reasonCodesInitialization jobs, a ReasonCodesInitialization
- for reasonCodes jobs, a ReasonCodes
Raises: JobNotFinished
If the job is not finished, the result is not available.
AsyncProcessUnsuccessfulError
If the job errored or was aborted
-
get_result_when_complete
(max_wait=600)¶ Parameters: max_wait : int, optional
How long to wait for the job to finish.
Returns: result: object
Return type is the same as would be returned by Job.get_result.
Raises: AsyncTimeoutError
If the job does not finish in time
AsyncProcessUnsuccessfulError
If the job errored or was aborted
-
refresh
()¶ Update this object with the latest job data from the server.
-
wait_for_completion
(max_wait=600)¶ Waits for job to complete.
Parameters: max_wait : int, optional
How long to wait for the job to finish.
-
classmethod
Prediction Dataset API¶
-
class
datarobot.models.
PredictionDataset
(project_id, id, name, created, num_rows, num_columns, forecast_point=None, predictions_start_date=None, predictions_end_date=None)¶ A dataset uploaded to make predictions
Typically created via project.upload_dataset
Attributes
id (str) the id of the dataset project_id (str) the id of the project the dataset belongs to created (str) the time the dataset was created name (str) the name of the dataset num_rows (int) the number of rows in the dataset num_columns (int) the number of columns in the dataset forecast_point (datetime.datetime or None) Only specified in time series projects. The point relative to which predictions will be generated, based on the forecast window of the project. See the time series documentation for more information. predictions_start_date (datetime.datetime or None, optional) Only specified in time series projects. The start date for bulk predictions. This parameter should be provided in conjunction with predictions_end_date. Can’t be provided with forecastPoint parameter. predictions_end_date (datetime.datetime or None, optional) Only specified in time series projects. The end date for bulk predictions. This parameter should be provided in conjunction with predictions_start_date. Can’t be provided with forecastPoint parameter. -
classmethod
get
(project_id, dataset_id)¶ Retrieve information about a dataset uploaded for predictions
Parameters: project_id:
the id of the project to query
dataset_id:
the id of the dataset to retrieve
Returns: dataset: PredictionDataset
A dataset uploaded to make predictions
-
delete
()¶ Delete a dataset uploaded for predictions
Will also delete predictions made using this dataset and cancel any predict jobs using this dataset.
-
classmethod
PredictJob API¶
-
datarobot.models.predict_job.
wait_for_async_predictions
(project_id, predict_job_id, max_wait=600)¶ Given a Project id and PredictJob id poll for status of process responsible for predictions generation until it’s finished
Parameters: project_id : str
The identifier of the project
predict_job_id : str
The identifier of the PredictJob
max_wait : int, optional
Time in seconds after which predictions creation is considered unsuccessful
Returns: predictions : pandas.DataFrame
Generated predictions.
Raises: AsyncPredictionsGenerationError
Raised if status of fetched PredictJob object is
error
AsyncTimeoutError
Predictions weren’t generated in time, specified by
max_wait
parameter
-
class
datarobot.models.
PredictJob
(data, completed_resource_url=None)¶ Tracks asynchronous work being done within a project
Attributes
id (int) the id of the job project_id (str) the id of the project the job belongs to status (str) the status of the job - will be one of datarobot.enums.QUEUE_STATUS
job_type (str) what kind of work the job is doing - will be ‘predict’ for predict jobs message (str) a message about the state of the job, typically explaining why an error occurred -
classmethod
from_job
(job)¶ Transforms a generic Job into a PredictJob
Parameters: job: Job
A generic job representing a PredictJob
Returns: predict_job: PredictJob
A fully populated PredictJob with all the details of the job
Raises: ValueError:
If the generic Job was not a predict job, e.g. job_type != JOB_TYPE.PREDICT
-
classmethod
create
(*args, **kwargs)¶ Note
Deprecated in v2.3 in favor of
Project.upload_dataset
andModel.request_predictions
. That workflow allows you to reuse the same dataset for predictions from multiple models within one project.Starts predictions generation for provided data using previously created model.
Parameters: model : Model
Model to use for predictions generation
sourcedata : str, file or pandas.DataFrame
Data to be used for predictions. If this parameter is a str, it can be either a path to a local file or raw file content. If using a file on disk, the filename must consist of ASCII characters only. The file must be a CSV, and cannot be compressed
Returns: predict_job_id : str
id of created job, can be used as parameter to
PredictJob.get
orPredictJob.get_predictions
methods orwait_for_async_predictions
functionRaises: InputNotUnderstoodError
If the parameter for sourcedata didn’t resolve into known data types
Examples
model = Model.get('p-id', 'l-id') predict_job = PredictJob.create(model, './data_to_predict.csv')
-
classmethod
get
(project_id, predict_job_id)¶ Fetches one PredictJob. If the job finished, raises PendingJobFinished exception.
Parameters: project_id : str
The identifier of the project the model on which prediction was started belongs to
predict_job_id : str
The identifier of the predict_job
Returns: predict_job : PredictJob
The pending PredictJob
Raises: PendingJobFinished
If the job being queried already finished, and the server is re-routing to the finished predictions.
AsyncFailureError
Querying this resource gave a status code other than 200 or 303
-
classmethod
get_predictions
(project_id, predict_job_id, class_prefix='class_')¶ Fetches finished predictions from the job used to generate them.
Note
The prediction API for classifications now returns an additional prediction_values dictionary that is converted into a series of class_prefixed columns in the final dataframe. For example, <label> = 1.0 is converted to ‘class_1.0’. If you are on an older version of the client (prior to v2.8), you must update to v2.8 to correctly pivot this data.
Parameters: project_id : str
The identifier of the project to which belongs the model used for predictions generation
predict_job_id : str
The identifier of the predict_job
class_prefix : str
The prefix to append to labels in the final dataframe (e.g., apple -> class_apple)
Returns: predictions : pandas.DataFrame
Generated predictions
Raises: JobNotFinished
If the job has not finished yet
AsyncFailureError
Querying the predict_job in question gave a status code other than 200 or 303
-
cancel
()¶ Cancel this job. If this job has not finished running, it will be removed and canceled.
-
get_result
()¶ Returns: result : object
- Return type depends on the job type:
- for model jobs, a Model is returned
- for predict jobs, a pandas.DataFrame (with predictions) is returned
- for featureImpact jobs, a list of dicts (whose keys are featureName and impact)
- for primeRulesets jobs, a list of Rulesets
- for primeModel jobs, a PrimeModel
- for primeDownloadValidation jobs, a PrimeFile
- for reasonCodesInitialization jobs, a ReasonCodesInitialization
- for reasonCodes jobs, a ReasonCodes
Raises: JobNotFinished
If the job is not finished, the result is not available.
AsyncProcessUnsuccessfulError
If the job errored or was aborted
-
get_result_when_complete
(max_wait=600)¶ Parameters: max_wait : int, optional
How long to wait for the job to finish.
Returns: result: object
Return type is the same as would be returned by Job.get_result.
Raises: AsyncTimeoutError
If the job does not finish in time
AsyncProcessUnsuccessfulError
If the job errored or was aborted
-
refresh
()¶ Update this object with the latest job data from the server.
-
wait_for_completion
(max_wait=600)¶ Waits for job to complete.
Parameters: max_wait : int, optional
How long to wait for the job to finish.
-
classmethod
Feature List API¶
-
class
datarobot.models.
Featurelist
(id=None, name=None, features=None, project_id=None)¶ A set of features used in modeling
Attributes
id (str) the id of the featurelist name (str) the name of the featurelist features (list of str) the names of all the Features in the Featurelist project_id (str) the project the Featurelist belongs to -
classmethod
get
(project_id, featurelist_id)¶ Retrieve a known feature list
Parameters: project_id : str
The id of the project the featurelist is associated with
featurelist_id : str
The ID of the featurelist to retrieve
Returns: featurelist : Featurelist
The queried instance
-
classmethod
-
class
datarobot.models.
ModelingFeaturelist
(id=None, name=None, features=None, project_id=None)¶ A set of features that can be used to build a model
In time series projects, a new set of modeling features is created after setting the partitioning options. These features are automatically derived from those in the project’s dataset and are the features used for modeling. Modeling features are only accessible once the target and partitioning options have been set. In projects that don’t use time series modeling, once the target has been set, ModelingFeaturelists and Featurelists will behave the same.
For more information about input and modeling features, see the time series documentation.
Attributes
id (str) the id of the modeling featurelist project_id (str) the id of the project the modeling featurelist belongs to name (str) the name of the modeling featurelist features (list of str) a list of the names of features included in this modeling featurelist -
classmethod
get
(project_id, featurelist_id)¶ Retrieve a modeling featurelist
Modeling featurelists can only be retrieved once the target and partitioning options have been set.
Parameters: project_id : str
the id of the project the modeling featurelist belongs to
featurelist_id : str
the id of the modeling featurelist to retrieve
Returns: featurelist : ModelingFeaturelist
the specified featurelist
-
classmethod
Feature API¶
-
class
datarobot.models.
Feature
(id, project_id=None, name=None, feature_type=None, importance=None, low_information=None, unique_count=None, na_count=None, date_format=None, min=None, max=None, mean=None, median=None, std_dev=None, time_series_eligible=None, time_series_eligibility_reason=None, time_step=None, time_unit=None, target_leakage=None)¶ A feature from a project’s dataset
These are features either included in the originally uploaded dataset or added to it via feature transformations. In time series projects, these will be distinct from the
ModelingFeature
s created during partitioning; otherwise, they will correspond to the same features. For more information about input and modeling features, see the time series documentation.The
min
,max
,mean
,median
, andstd_dev
attributes provide information about the distribution of the feature in the EDA sample data. For non-numeric features or features created prior to these summary statistics becoming available, they will be None. For features where the summary statistics are available, they will be in a format compatible with the data type, i.e. date type features will have their summary statistics expressed as ISO-8601 formatted date strings.Attributes
id (int) the id for the feature - note that name is used to reference the feature instead of id project_id (str) the id of the project the feature belongs to name (str) the name of the feature feature_type (str) the type of the feature, e.g. ‘Categorical’, ‘Text’ importance (float or None) numeric measure of the strength of relationship between the feature and target (independent of any model or other features); may be None for non-modeling features such as partition columns low_information (bool) whether a feature is considered too uninformative for modeling (e.g. because it has too few values) unique_count (int) number of unique values na_count (int or None) number of missing values date_format (str or None) For Date features, the date format string for how this feature was interpreted, compatible with https://docs.python.org/2/library/time.html#time.strftime . For other feature types, None. min (str, int, float, or None) The minimum value of the source data in the EDA sample max (str, int, float, or None) The maximum value of the source data in the EDA sample mean (str, int, or, float) The arithmetic mean of the source data in the EDA sample median (str, int, float, or None) The median of the source data in the EDA sample std_dev (str, int, float, or None) The standard deviation of the source data in the EDA sample time_series_eligible (bool) Whether this feature can be used as the datetime partition column in a time series project. time_series_eligibility_reason (str) Why the feature is ineligible for the datetime partition column in a time series project, or ‘suitable’ when it is eligible. time_step (int or None) For time series eligible features, a positive integer determining the interval at which windows can be specified. If used as the datetime partition column on a time series project, the feature derivation and forecast windows must start and end at an integer multiple of this value. None for features that are not time series eligible. time_unit (str or None) For time series eligible features, the time unit covered by a single time step, e.g. ‘HOUR’, or None for features that are not time series eligible. target_leakage (str) Whether a feature is considered to have target leakage or not. A value of ‘SKIPPED_DETECTION’ indicates that target leakage detection was not run on the feature. ‘FALSE’ indicates no leakage, ‘MODERATE’ indicates a moderate risk of target leakage, and ‘HIGH_RISK’ indicates a high risk of target leakage -
classmethod
get
(project_id, feature_name)¶ Retrieve a single feature
Parameters: project_id : str
The ID of the project the feature is associated with.
feature_name : str
The name of the feature to retrieve
Returns: feature : Feature
The queried instance
-
get_multiseries_properties
(multiseries_id_columns, max_wait=600)¶ Retrieve time series properties for a potential multiseries datetime partition column
Multiseries time series projects use multiseries id columns to model multiple distinct series within a single project. This function returns the time series properties (time step and time unit) of this column if it were used as a datetime partition column with the specified multiseries id columns, running multiseries detection automatically if it had not previously been successfully ran.
Parameters: multiseries_id_columns : list of str
the name(s) of the multiseries id columns to use with this datetime partition column. Currently only one multiseries id column is supported.
max_wait : int, optional
if a multiseries detection task is run, the maximum amount of time to wait for it to complete before giving up
Returns: properties : dict
A dict with three keys:
- time_series_eligible : bool, whether the column can be used as a partition column
- time_unit : str or null, the inferred time unit if used as a partition column
- time_step : int or null, the inferred time step if used as a partition column
-
classmethod
-
class
datarobot.models.
ModelingFeature
(project_id=None, name=None, feature_type=None, importance=None, low_information=None, unique_count=None, na_count=None, date_format=None, min=None, max=None, mean=None, median=None, std_dev=None, parent_feature_names=None)¶ A feature used for modeling
In time series projects, a new set of modeling features is created after setting the partitioning options. These features are automatically derived from those in the project’s dataset and are the features used for modeling. Modeling features are only accessible once the target and partitioning options have been set. In projects that don’t use time series modeling, once the target has been set, ModelingFeatures and Features will behave the same.
For more information about input and modeling features, see the time series documentation.
As with the
dr.models.feature.Feature
object, the min, max, `mean, median, and std_dev attributes provide information about the distribution of the feature in the EDA sample data. For non-numeric features, they will be None. For features where the summary statistics are available, they will be in a format compatible with the data type, i.e. date type features will have their summary statistics expressed as ISO-8601 formatted date strings.Attributes
project_id (str) the id of the project the feature belongs to name (str) the name of the feature feature_type (str) the type of the feature, e.g. ‘Categorical’, ‘Text’ importance (float or None) numeric measure of the strength of relationship between the feature and target (independent of any model or other features); may be None for non-modeling features such as partition columns low_information (bool) whether a feature is considered too uninformative for modeling (e.g. because it has too few values) unique_count (int) number of unique values na_count (int or None) number of missing values date_format (str or None) For Date features, the date format string for how this feature was interpreted, compatible with https://docs.python.org/2/library/time.html#time.strftime . For other feature types, None. min (str, int, float, or None) The minimum value of the source data in the EDA sample max (str, int, float, or None) The maximum value of the source data in the EDA sample mean (str, int, or, float) The arithmetic mean of the source data in the EDA sample median (str, int, float, or None) The median of the source data in the EDA sample std_dev (str, int, float, or None) The standard deviation of the source data in the EDA sample parent_feature_names (list of str) A list of the names of input features used to derive this modeling feature. In cases where the input features and modeling features are the same, this will simply contain the feature’s name. Note that if a derived feature was used to create this modeling feature, the values here will not necessarily correspond to the features that must be supplied at prediction time. -
classmethod
get
(project_id, feature_name)¶ Retrieve a single modeling feature
Parameters: project_id : str
The ID of the project the feature is associated with.
feature_name : str
The name of the feature to retrieve
Returns: feature : ModelingFeature
The requested feature
-
classmethod
Ruleset API¶
-
class
datarobot.models.
Ruleset
(project_id=None, parent_model_id=None, model_id=None, ruleset_id=None, rule_count=None, score=None)¶ Represents an approximation of a model with DataRobot Prime
Attributes
id (str) the id of the ruleset rule_count (int) the number of rules used to approximate the model score (float) the validation score of the approximation project_id (str) the project the approximation belongs to parent_model_id (str) the model being approximated model_id (str or None) the model using this ruleset (if it exists). Will be None if no such model has been trained. -
request_model
()¶ Request training for a model using this ruleset
Training a model using a ruleset is a necessary prerequisite for being able to download the code for a ruleset.
Returns: job: Job
the job fitting the new Prime model
-
PrimeFile API¶
-
class
datarobot.models.
PrimeFile
(id=None, project_id=None, parent_model_id=None, model_id=None, ruleset_id=None, language=None, is_valid=None)¶ Represents an executable file available for download of the code for a DataRobot Prime model
Attributes
id (str) the id of the PrimeFile project_id (str) the id of the project this PrimeFile belongs to parent_model_id (str) the model being approximated by this PrimeFile model_id (str) the prime model this file represents ruleset_id (int) the ruleset being used in this PrimeFile language (str) the language of the code in this file - see enums.LANGUAGE for possibilities is_valid (bool) whether the code passed basic validation -
download
(filepath)¶ Download the code and save it to a file
Parameters: filepath: string
the location to save the file to
-
Frozen Model API¶
-
class
datarobot.models.
FrozenModel
(id=None, processes=None, featurelist_name=None, featurelist_id=None, project_id=None, sample_pct=None, training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, model_type=None, model_category=None, is_frozen=None, blueprint_id=None, metrics=None, parent_model_id=None, monotonic_increasing_featurelist_id=None, monotonic_decreasing_featurelist_id=None, supports_monotonic_constraints=None)¶ A model tuned with parameters which are derived from another model
Attributes
id (str) the id of the model project_id (str) the id of the project the model belongs to processes (list of str) the processes used by the model featurelist_name (str) the name of the featurelist used by the model featurelist_id (str) the id of the featurelist used by the model sample_pct (float) the percentage of the project dataset used in training the model training_row_count (int or None) the number of rows of the project dataset used in training the model. In a datetime partitioned project, if specified, defines the number of rows used to train the model and evaluate backtest scores; if unspecified, either training_duration or training_start_date and training_end_date was used to determine that instead. training_duration (str or None) only present for models in datetime partitioned projects. If specified, a duration string specifying the duration spanned by the data used to train the model and evaluate backtest scores. training_start_date (datetime or None) only present for frozen models in datetime partitioned projects. If specified, the start date of the data used to train the model. training_end_date (datetime or None) only present for frozen models in datetime partitioned projects. If specified, the end date of the data used to train the model. model_type (str) what model this is, e.g. ‘Nystroem Kernel SVM Regressor’ model_category (str) what kind of model this is - ‘prime’ for DataRobot Prime models, ‘blend’ for blender models, and ‘model’ for other models is_frozen (bool) whether this model is a frozen model parent_model_id (str) the id of the model that tuning parameters are derived from blueprint_id (str) the id of the blueprint used in this model metrics (dict) a mapping from each metric to the model’s scores for that metric monotonic_increasing_featurelist_id (str) optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. If None, no such constraints are enforced. monotonic_decreasing_featurelist_id (str) optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. If None, no such constraints are enforced. supports_monotonic_constraints (bool) optinonal, whether this model supports enforcing montonic constraints -
classmethod
get
(project_id, model_id)¶ Retrieve a specific frozen model.
Parameters: project_id : str
The project’s id.
model_id : str
The
model_id
of the leaderboard item to retrieve.Returns: model : FrozenModel
The queried instance.
-
classmethod
Advanced Options API¶
-
class
datarobot.helpers.
AdvancedOptions
(weights=None, response_cap=None, blueprint_threshold=None, seed=None, smart_downsampled=False, majority_downsampling_rate=None, offset=None, exposure=None, accuracy_optimized_mb=None, scaleout_modeling_mode=None, events_count=None, monotonic_increasing_featurelist_id=None, monotonic_decreasing_featurelist_id=None, only_include_monotonic_blueprints=None)¶ Used when setting the target of a project to set advanced options of modeling process.
Parameters: weights : string, optional
The name of a column indicating the weight of each row
response_cap : float in [0.5, 1), optional
Quantile of the response distribution to use for response capping.
blueprint_threshold : int, optional
Number of hours models are permitted to run before being excluded from later autopilot stages Minimum 1
seed : int
a seed to use for randomization
smart_downsampled : bool
whether to use smart downsampling to throw away excess rows of the majority class. Only applicable to classification and zero-boosted regression projects.
majority_downsampling_rate : float
the percentage between 0 and 100 of the majority rows that should be kept. Specify only if using smart downsampling. May not cause the majority class to become smaller than the minority class.
offset : list of str, optional
(New in version v2.6) the list of the names of the columns containing the offset of each row
exposure : string, optional
(New in version v2.6) the name of a column containing the exposure of each row
accuracy_optimized_mb : bool, optional
(New in version v2.6) Include additional, longer-running models that will be run by the autopilot and available to run manually.
scaleout_modeling_mode : string, optional
(New in version v2.8) Specifies the behavior of Scaleout models for the project. This is one of
datarobot.enums.SCALEOUT_MODELING_MODE
. Ifdatarobot.enums.SCALEOUT_MODELING_MODE.DISABLED
, no models will run during autopilot or show in the list of available blueprints. Scaleout models must be disabled for some partitioning settings including projects using datetime partitioning or projects using offset or exposure columns. Ifdatarobot.enums.SCALEOUT_MODELING_MODE.REPOSITORY_ONLY
, scaleout models will be in the list of available blueprints but not run during autopilot. Ifdatarobot.enums.SCALEOUT_MODELING_MODE.AUTOPILOT
, scaleout models will run during autopilot and be in the list of available blueprints. Scaleout models are only supported in the Hadoop enviroment with the corresponding user permission set.events_count : string, optional
(New in version v2.8) the name of a column specifying events count.
monotonic_increasing_featurelist_id : string, optional
(new in version 2.11) the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. If None, no such constraints are enforced. When specified, this will set a default for the project that can be overriden at model submission time if desired.
monotonic_decreasing_featurelist_id : string, optional
(new in version 2.11) the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. If None, no such constraints are enforced. When specified, this will set a default for the project that can be overriden at model submission time if desired.
only_include_monotonic_blueprints : bool, optional
(new in version 2.11) when true, only blueprints that support enforcing monotonic constraints will be available in the project or selected for the autopilot.
Examples
import datarobot as dr advanced_options = dr.AdvancedOptions( weights='weights_column', offset=['offset_column'], exposure='exposure_column', response_cap=0.7, blueprint_threshold=2, smart_downsampled=True, majority_downsampling_rate=75.0)
Imported Model API¶
Note
Imported Models are used in Stand Alone Scoring Engines. If you are not an administrator of such an engine, they are not relevant to you.
-
class
datarobot.models.
ImportedModel
(id, imported_at=None, model_id=None, target=None, featurelist_name=None, dataset_name=None, model_name=None, project_id=None, version=None, note=None, origin_url=None, imported_by_username=None, project_name=None, created_by_username=None, created_by_id=None, imported_by_id=None, display_name=None)¶ Represents an imported model available for making predictions. These are only relevant for administrators of on-premise Stand Alone Scoring Engines.
ImportedModels are trained in one DataRobot application, exported as a .drmodel file, and then imported for use in a Stand Alone Scoring Engine.
Attributes
id (str) id of the import model_name (str) model type describing the model generated by DataRobot display_name (str) manually specified human-readable name of the imported model note (str) manually added node about this imported model imported_at (datetime) the time the model was imported imported_by_username (str) username of the user who imported the model imported_by_id (str) id of the user who imported the model origin_url (str) URL of the application the model was exported from model_id (str) original id of the model prior to export featurelist_name (str) name of the featurelist used to train the model project_id (str) id of the project the model belonged to prior to export project_name (str) name of the project the model belonged to prior to export target (str) the target of the project the model belonged to prior to export version (float) project version of the project the model belonged to dataset_name (str) filename of the dataset used to create the project the model belonged to created_by_username (str) username of the user who created the model prior to export created_by_id (str) id of the user who created the model prior to export -
classmethod
create
(path)¶ Import a previously exported model for predictions.
Parameters: path : str
The path to the exported model file
-
classmethod
get
(import_id)¶ Retrieve imported model info
Parameters: import_id : str
The ID of the imported model.
Returns: imported_model : ImportedModel
The ImportedModel instance
-
classmethod
list
(limit=None, offset=None)¶ List the imported models.
Parameters: limit : int
The number of records to return. The server will use a (possibly finite) default if not specified.
offset : int
The number of records to skip.
Returns: imported_models : list[ImportedModel]
-
update
(display_name=None, note=None)¶ Update the display name or note for an imported model. The ImportedModel object is updated in place.
Parameters: display_name : str
The new display name.
note : str
The new note.
-
delete
()¶ Delete this imported model.
-
classmethod
Reason Codes API¶
-
class
datarobot.
ReasonCodesInitialization
(project_id, model_id, reason_codes_sample=None)¶ Represents a reason codes initialization of a model.
Attributes
project_id (str) id of the project the model belongs to model_id (str) id of the model reason codes initialization is for reason_codes_sample (list of dict) a small sample of reason codes that could be generated for the model -
classmethod
get
(project_id, model_id)¶ Retrieve the reason codes initialization for a model.
Reason codes initializations are a prerequisite for computing reason codes, and include a sample what the computed reason codes for a prediction dataset would look like.
Parameters: project_id : str
id of the project the model belongs to
model_id : str
id of the model reason codes initialization is for
Returns: reason_codes_initialization : ReasonCodesInitialization
The queried instance.
Raises: ClientError (404)
If the project or model does not exist or the initialization has not been computed.
-
classmethod
create
(project_id, model_id)¶ Create a reason codes initialization for the specified model.
Parameters: project_id : str
id of the project the model belongs to
model_id : str
id of the model for which initialization is requested
Returns: job : Job
an instance of created async job
-
delete
()¶ Delete this reason codes initialization.
-
classmethod
-
class
datarobot.
ReasonCodes
(id, project_id, model_id, dataset_id, max_codes, num_columns, finish_time, reason_codes_location, threshold_low=None, threshold_high=None)¶ Represents reason codes metadata and provides access to computation results.
Examples
reason_codes = dr.ReasonCodes.get(project_id, reason_codes_id) for row in reason_codes.get_rows(): print(row) # row is an instance of ReasonCodesRow
Attributes
id (str) id of the record and reason codes computation result project_id (str) id of the project the model belongs to model_id (str) id of the model reason codes initialization is for dataset_id (str) id of the prediction dataset reason codes were computed for max_codes (int) maximum number of reason codes to supply per row of the dataset threshold_low (float) the lower threshold, below which a prediction must score in order for reason codes to be computed for a row in the dataset threshold_high (float) the high threshold, above which a prediction must score in order for reason codes to be computed for a row in the dataset num_columns (int) the number of columns reason codes were computed for finish_time (float) timestamp referencing when computation for these reason codes finished reason_codes_location (str) where to retrieve the reason codes -
classmethod
get
(project_id, reason_codes_id)¶ Retrieve a specific reason codes.
Parameters: project_id : str
id of the project the model belongs to
reason_codes_id : str
id of the reason codes
Returns: reason_codes : ReasonCodes
The queried instance.
-
classmethod
create
(project_id, model_id, dataset_id, max_codes=None, threshold_low=None, threshold_high=None)¶ Create a reason codes for the specified dataset.
In order to create ReasonCodesPage for a particular model and dataset, you must first:
- Compute feature impact for the model via
datarobot.Model.get_feature_impact()
- Compute a ReasonCodesInitialization for the model via
datarobot.ReasonCodesInitialization.create(project_id, model_id)
- Compute predictions for the model and dataset via
datarobot.Model.request_predictions(dataset_id)
threshold_high
andthreshold_low
are optional filters applied to speed up computation. When at least one is specified, only the selected outlier rows will have reason codes computed. Rows are considered to be outliers if their predicted value (in case of regression projects) or probability of being the positive class (in case of classification projects) is less thanthreshold_low
or greater thanthresholdHigh
. If neither is specified, reason codes will be computed for all rows.Parameters: project_id : str
id of the project the model belongs to
model_id : str
id of the model for which reason codes are requested
dataset_id : str
id of the prediction dataset for which reason codes are requested
threshold_low : float, optional
the lower threshold, below which a prediction must score in order for reason codes to be computed for a row in the dataset. If neither
threshold_high
northreshold_low
is specified, reason codes will be computed for all rows.threshold_high : float, optional
the high threshold, above which a prediction must score in order for reason codes to be computed. If neither
threshold_high
northreshold_low
is specified, reason codes will be computed for all rows.max_codes : int, optional
the maximum number of reason codes to supply per row of the dataset, default: 3.
Returns: job: Job
an instance of created async job
- Compute feature impact for the model via
-
classmethod
list
(project_id, model_id=None, limit=None, offset=None)¶ List of reason codes for a specified project.
Parameters: project_id : str
id of the project to list reason codes for
model_id : str, optional
if specified, only reason codes computed for this model will be returned
limit : int or None
at most this many results are returned, default: no limit
offset : int or None
this many results will be skipped, default: 0
Returns: reason_codes : list[ReasonCodes]
-
get_rows
(batch_size=None, exclude_adjusted_predictions=True)¶ Retrieve reason codes rows.
Parameters: batch_size : int
maximum number of reason codes rows to retrieve per request
exclude_adjusted_predictions : bool
Optional, defaults to True. Set to False to include adjusted predictions, which will differ from the predictions on some projects, e.g. those with an exposure column specified.
Yields: reason_codes_row : ReasonCodesRow
Represents reason codes computed for a prediction row.
-
get_all_as_dataframe
(exclude_adjusted_predictions=True)¶ Retrieve all reason codes rows and return them as a pandas.DataFrame.
Returned dataframe has the following structure:
- row_id : row id from prediction dataset
- prediction : the output of the model for this row
- adjusted_prediction : adjusted prediction values (only appears for projects that utilize prediction adjustments, e.g. projects with an exposure column)
- class_0_label : a class level from the target (only appears for classification projects)
- class_0_probability : the probability that the target is this class (only appears for classification projects)
- class_1_label : a class level from the target (only appears for classification projects)
- class_1_probability : the probability that the target is this class (only appears for classification projects)
- reason_0_feature : the name of the feature contributing to the prediction for this reason
- reason_0_feature_value : the value the feature took on
- reason_0_label : the output being driven by this reason. For regression projects, this is the name of the target feature. For classification projects, this is the class label whose probability increasing would correspond to a positive strength.
- reason_0_qualitative_strength : a human-readable description of how strongly the feature affected the prediction (e.g. ‘+++’, ‘–’, ‘+’) for this reason
- reason_0_strength : the amount this feature’s value affected the prediction
- ...
- reason_N_feature : the name of the feature contributing to the prediction for this reason
- reason_N_feature_value : the value the feature took on
- reason_N_label : the output being driven by this reason. For regression projects, this is the name of the target feature. For classification projects, this is the class label whose probability increasing would correspond to a positive strength.
- reason_N_qualitative_strength : a human-readable description of how strongly the feature affected the prediction (e.g. ‘+++’, ‘–’, ‘+’) for this reason
- reason_N_strength : the amount this feature’s value affected the prediction
Parameters: exclude_adjusted_predictions : bool
Optional, defaults to True. Set this to False to include adjusted prediction values in the returned dataframe.
Returns: dataframe: pandas.DataFrame
-
download_to_csv
(filename, encoding='utf-8', exclude_adjusted_predictions=True)¶ Save reason codes rows into CSV file.
Parameters: filename : str or file object
path or file object to save reason codes rows
encoding : string, optional
A string representing the encoding to use in the output file, defaults to ‘utf-8’
exclude_adjusted_predictions : bool
Optional, defaults to True. Set to False to include adjusted predictions, which will differ from the predictions on some projects, e.g. those with an exposure column specified.
-
get_reason_codes_page
(limit=None, offset=None, exclude_adjusted_predictions=True)¶ Get reason codes.
If you don’t want use a generator interface, you can access paginated reason codes directly.
Parameters: limit : int or None
the number of records to return, the server will use a (possibly finite) default if not specified
offset : int or None
the number of records to skip, default 0
exclude_adjusted_predictions : bool
Optional, defaults to True. Set to False to include adjusted predictions, which will differ from the predictions on some projects, e.g. those with an exposure column specified.
Returns: reason_codes : ReasonCodesPage
-
delete
()¶ Delete this reason codes.
-
classmethod
-
class
datarobot.models.reason_codes.
ReasonCodesRow
(row_id, prediction, prediction_values, reason_codes=None, adjusted_prediction=None, adjusted_prediction_values=None)¶ Represents reason codes computed for a prediction row.
Notes
PredictionValue
contains:label
: describes what this model output corresponds to. For regression projects, it is the name of the target feature. For classification projects, it is a level from the target feature.value
: the output of the prediction. For regression projects, it is the predicted value of the target. For classification projects, it is the predicted probability the row belongs to the class identified by the label.
ReasonCode
contains:label
: described what output was driven by this reason code. For regression projects, it is the name of the target feature. For classification projects, it is the class whose probability increasing would correspond to a positive strength of this reason code.feature
: the name of the feature contributing to the predictionfeature_value
: the value the feature took on for this rowstrength
: the amount this feature’s value affected the predictionqualitativate_strength
: a human-readable description of how strongly the feature affected the prediction (e.g. ‘+++’, ‘–’, ‘+’)
Attributes
row_id (int) which row this ReasonCodeRow
describesprediction (float) the output of the model for this row adjusted_prediction (float or None) adjusted prediction value for projects that provide this information, None otherwise prediction_values (list) an array of dictionaries with a schema described as PredictionValue
adjusted_prediction_values (list) same as prediction_values but for adjusted predictions reason_codes (list) an array of dictionaries with a schema described as ReasonCode
-
class
datarobot.models.reason_codes.
ReasonCodesPage
(id, count=None, previous=None, next=None, data=None, reason_codes_record_location=None, adjustment_method=None)¶ Represents batch of reason codes received by one request.
Attributes
id (str) id of the reason codes computation result data (list[dict]) list of raw reason codes, each row corresponds to a row of the prediction dataset count (int) total number of rows computed previous_page (str) where to retrieve previous page of reason codes, None if current page is the first next_page (str) where to retrieve next page of reason codes, None if current page is the last reason_codes_record_location (str) where to retrieve the reason codes metadata adjustment_method (str) Adjustment method that was applied to predictions, or ‘N/A’ if no adjustments were done. -
classmethod
get
(project_id, reason_codes_id, limit=None, offset=0, exclude_adjusted_predictions=True)¶ Retrieve reason codes.
Parameters: project_id : str
id of the project the model belongs to
reason_codes_id : str
id of the reason codes
limit : int or None
the number of records to return, the server will use a (possibly finite) default if not specified
offset : int or None
the number of records to skip, default 0
exclude_adjusted_predictions : bool
Optional, defaults to True. Set to False to include adjusted predictions, which will differ from the predictions on some projects, e.g. those with an exposure column specified.
Returns: reason_codes : ReasonCodesPage
The queried instance.
-
classmethod
Lift Chart API¶
-
class
datarobot.models.lift_chart.
LiftChart
(source, bins)¶ Lift chart data for model.
Attributes
source (str) Lift chart data source. Can be ‘validation’, ‘crossValidation’ or ‘holdout’. bins (list of dict) List of lift chart bins information. Dictionary keys: actual : float Sum of actual target values in bin predicted : float Sum of predicted target values in bin bin_weight : float The weight of the bin. For weighted projects, it is the sum of the weights of the rows in the bin. For unweighted projects, it is the number of rows in the bin.
Missing Values Report API¶
-
class
datarobot.models.missing_report.
MissingValuesReport
(missing_values_report)¶ A Missing Values report for a particular model
The report breaks down for each relevant feature how many missing values it had and how each task treated missing values.
The report is an iterable containing
datarobot.models.missing_report.MissingReportPerFeature
-
classmethod
get
(project_id, model_id)¶ Retrieve a missing report.
Parameters: project_id : str
The project’s id.
model_id : str
The model’s id.
Returns: MissingValuesReport
The queried missing report.
-
classmethod
-
class
datarobot.models.missing_report.
MissingReportPerFeature
(report_per_feature_dict)¶ Represents how missing values were handled for a particular feature
Attributes
feature (basestring) the name of the feature type (str) the type of the feature, e.g. ‘Categorical’ or ‘Numeric’ missing_count (int) the number of rows from the model’s training data where the feature was missing missing_percentage (float) the percentage of the model’s training data where the feature was missing, as a float between 0.0 and 100.0 tasks (list of MissingReportPerTask) information about how tasks within the model handled missing values for this feature
-
class
datarobot.models.missing_report.
MissingReportPerTask
(task_id, info)¶ Represents how a particular task handled missing values
Attributes
id (basestring) the id of the task, corresponding to the same ids used by datarobot.models.blueprint.BlueprintChart
name (basestring) the name of the task, e.g. “One-Hot Encoding”. These are values that appear in the datarobot.models.Model
‘s processes attribute.descriptions (list of basestring) human readable aggregated information about how the task handles missing values. The following descriptions may be present: what value is imputed for missing values, whether the feature being missing is itself treated as a feature by the task, whether missing values are treated as infrequent values, whether infrequent values are treated as missing values, and whether missing values are ignored.
ROC Curve API¶
-
class
datarobot.models.roc_curve.
RocCurve
(source, roc_points, negative_class_predictions, positive_class_predictions)¶ ROC curve data for model.
Attributes
source (str) ROC curve data source. Can be ‘validation’, ‘crossValidation’ or ‘holdout’. roc_points (list of dict) List of precalculated metrics associated with thresholds for ROC curve. negative_class_predictions (list of float) List of predictions from example for negative class positive_class_predictions (list of float) List of predictions from example for positive class -
estimate_threshold
(threshold)¶ Return metrics estimation for given threshold.
Parameters: threshold : float from [0, 1] interval
Threshold we want estimation for
Returns: dict
Dictionary of estimated metrics in form of {metric_name: metric_value}. Metrics are ‘accuracy’, ‘f1_score’, ‘false_negative_score’, ‘true_negative_score’, ‘true_negative_rate’, ‘matthews_correlation_coefficient’, ‘true_positive_score’, ‘positive_predictive_value’, ‘false_positive_score’, ‘false_positive_rate’, ‘negative_predictive_value’, ‘true_positive_rate’.
Raises: ValueError
Given threshold isn’t from [0, 1] interval
-
get_best_f1_threshold
()¶ Return value of threshold that corresponds to max F1 score. This is threshold that will be preselected in DataRobot when you open “ROC curve” tab.
Returns: float
Threhold with best F1 score.
-
Word Cloud API¶
-
class
datarobot.models.word_cloud.
WordCloud
(ngrams)¶ Word cloud data for the model.
Attributes
ngrams: list of dict List of the wordcloud ngrams and corresponding data. Dictionary has following keys: ngram: str Word or ngram value. coefficient: float Value from [-1.0, 1.0] range, describes effect of this ngram on the target. Large negative value means strong effect toward negative class in classification and smaller target value in regression models. Large positive - toward positive class and bigger value respectively. count: int Number of rows in the training sample where this ngram appears. frequency: float Value from (0.0, 1.0] range, relative frequency of given ngram to most frequent ngram. is_stopword: bool True for ngrams that DataRobot evaluates as stopwords. -
most_frequent
(top_n=5)¶ Return most frequent ngrams in the word cloud.
Parameters: top_n : int
Number of ngrams to return
Returns: list of dict
Up to top_n top most frequent ngrams in the word cloud. If top_n bigger then total number of ngrams in word cloud - return all sorted by frequency in descending order.
-
most_important
(top_n=5)¶ Return most important ngrams in the word cloud.
Parameters: top_n : int
Number of ngrams to return
Returns: list of dict
Up to top_n top most important ngrams in the word cloud. If top_n bigger then total number of ngrams in word cloud - return all sorted by absolute coefficient value in descending order.
-
Rating Table API¶
-
class
datarobot.models.
RatingTable
(id, rating_table_name, original_filename, project_id, parent_model_id, model_id=None, model_job_id=None, validation_job_id=None, validation_error=None)¶ Interface to modify and download rating tables.
Attributes
id (str) The id of the rating table. project_id (str) The id of the project this rating table belongs to. rating_table_name (str) The name of the rating table. original_filename (str) The name of the file used to create the rating table. parent_model_id (str) The model id of the model the rating table was validated against. model_id (str) The model id of the model that was created from the rating table. Can be None if a model has not been created from the rating table. model_job_id (str) The id of the job to create a model from this rating table. Can be None if a model has not been created from the rating table. validation_job_id (str) The id of the created job to validate the rating table. Can be None if the rating table has not been validated. validation_error (str) Contains a description of any errors caused during validation. -
classmethod
get
(project_id, rating_table_id)¶ Retrieve a single rating table
Parameters: project_id : str
The ID of the project the rating table is associated with.
rating_table_id : str
The ID of the rating table
Returns: rating_table : RatingTable
The queried instance
-
classmethod
create
(project_id, parent_model_id, filename, rating_table_name='Uploaded Rating Table')¶ Uploads and validates a new rating table CSV
Parameters: project_id : str
id of the project the rating table belongs to
parent_model_id : str
id of the model for which this rating table should be validated against
filename : str
The path of the CSV file containing the modified rating table.
rating_table_name : str, optional
A human friendly name for the new rating table. The string may be truncated and a suffix may be added to maintain unique names of all rating tables.
Returns: job: Job
an instance of created async job
Raises: InputNotUnderstoodError
Raised if filename isn’t one of supported types.
ClientError (400)
Raised if parent_model_id is invalid.
-
download
(filepath)¶ Download a csv file containing the contents of this rating table
Parameters: filepath : str
The path at which to save the rating table file.
-
rename
(rating_table_name)¶ Renames a rating table to a different name.
Parameters: rating_table_name : str
The new name to rename the rating table to.
-
create_model
()¶ Creates a new model from this rating table record. This rating table must not already be associated with a model and must be valid.
Returns: job: Job
an instance of created async job
Raises: ClientError (422)
Raised if creating model from a RatingTable that failed validation
JobAlreadyRequested
Raised if creating model from a RatingTable that is already associated with a RatingTableModel
-
classmethod
Confusion Chart API¶
-
class
datarobot.models.confusion_chart.
ConfusionChart
(source, data)¶ Confusion Chart data for model.
Attributes
source (str) Confusion Chart data source. Can be ‘validation’, ‘crossValidation’ or ‘holdout’. raw_data (dict) All of the raw data for the Confusion Chart confusion_matrix (list of list) The NxN confusion matrix classes (list) The names of each of the classes class_metrics (list of dict) Containing all of the metrics for each of the classes. Dictionary keys: className : string name of the class actualCount : int number of times this class is seen in the validation data predictedCount : int number of times this class has been predicted for the validation data f1 : float F1 score recall : float recall score precision : float precision score wasActualPercentages : list of dict one vs all actual percentages in a format specified below, Dictionary keys: otherClassName : string the name of the other class percentage : float the percentage of the times this class was predicted when is was actually class (from 0 to 1) wasPredictedPercentages : list of dict one vs all predicted percentages in a format specified below, Dictionary keys: otherClassName : string the name of the other class percentage : float the percentage of the times this class was actual predicted (from 0 to 1) confusionMatrixOneVsAll : list of list 2d list representing 2x2 one vs all matrix. This represents the True/False Negative/Positive rates as integer for each class. The data structure looks like: [ [ True Negative, False Positive ], [ False Negative, True Positive ] ]
Training Predictions API¶
-
class
datarobot.models.training_predictions.
TrainingPredictionsIterator
(client, path, limit=None)¶ Lazily fetches training predictions from DataRobot API in chunks of specified size and then iterates rows from responses as named tuples. Each row represents a training prediction computed for a dataset’s row. Each named tuple has the following structure:
Notes
Each
PredictionValue
dict contains these keys:- label
- describes what this model output corresponds to. For regression projects, it is the name of the target feature. For classification and multiclass projects, it is a label from the target feature.
- value
- the output of the prediction. For regression projects, it is the predicted value of the target. For classification and multiclass projects, it is the predicted probability that the row belongs to the class identified by the label.
Examples
import datarobot as dr # Fetch existing training predictions by their id training_predictions = dr.TrainingPredictions.get(project_id, prediction_id) # Iterate over predictions for row in training_predictions.iterate_rows() print(row.row_id, row.prediction)
Attributes
row_id (int) id of the record in original dataset for which training prediction is calculated partition_id (str or float) id of the data partition that the row belongs to prediction (float) the model’s prediction for this data row prediction_values (list of dictionaries) an array of dictionaries with a schema described as PredictionValue
timestamp (str or None) (New in version v2.11) an ISO string representing the time of the prediction in time series project; may be None for non-time series projects forecast_point (str or None) (New in version v2.11) an ISO string representing the point in time used as a basis to generate the predictions in time series project; may be None for non-time series projects forecast_distance (str or None) (New in version v2.11) how many time steps are between the forecast point and the timestamp in time series project; None for non-time series projects series_id (str or None) (New in version v2.11) the id of the series in a multiseries project; may be NaN for single series projects; None for non-time series projects
-
class
datarobot.models.training_predictions.
TrainingPredictions
(project_id, prediction_id, model_id=None, data_subset=None)¶ Represents training predictions metadata and provides access to prediction results.
Examples
Compute training predictions for a model on the whole dataset
import datarobot as dr # Request calculation of training predictions training_predictions_job = model.request_training_predictions(dr.enums.DATA_SUBSET.ALL) training_predictions = training_predictions_job.get_result_when_complete() print('Training predictions {} are ready'.format(training_predictions.prediction_id)) # Iterate over actual predictions for row in training_predictions.iterate_rows(): print(row.row_id, row.partition_id, row.prediction)
List all training predictions for a project
import datarobot as dr # Fetch all training predictions for a project all_training_predictions = dr.TrainingPredictions.list(project_id) # Inspect all calculated training predictions for training_predictions in all_training_predictions: print( 'Prediction {} is made for data subset "{}"'.format( training_predictions.prediction_id, training_predictions.data_subset, ) )
Retrieve training predictions by id
import datarobot as dr # Getting training predictions by id training_predictions = dr.TrainingPredictions.get(project_id, prediction_id) # Iterate over actual predictions for row in training_predictions.iterate_rows(): print(row.row_id, row.partition_id, row.prediction)
Attributes
project_id (str) id of the project the model belongs to model_id (str) id of the model prediction_id (str) id of generated predictions -
classmethod
list
(project_id)¶ Fetch all the computed training predictions for a project.
Parameters: project_id : str
id of the project
Returns: A list of
TrainingPredictions
objects
-
classmethod
get
(project_id, prediction_id)¶ Retrieve training predictions on a specified data set.
Parameters: project_id : str
id of the project the model belongs to
prediction_id : str
id of the prediction set
Returns: TrainingPredictions
object which is ready to operate with specified predictions
-
iterate_rows
(batch_size=None)¶ Retrieve training prediction rows as an iterator.
Parameters: batch_size : int, optional
maximum number of training prediction rows to fetch per request
Returns: iterator :
TrainingPredictionsIterator
an iterator which yields named tuples representing training prediction rows
-
get_all_as_dataframe
(class_prefix='class_')¶ Retrieve all training prediction rows and return them as a pandas.DataFrame.
- Returned dataframe has the following structure:
- row_id : row id from the original dataset
- prediction : the model’s prediction for this row
- class_<label> : the probability that the target is this class (only appears for classification and multiclass projects)
- timestamp : the time of the prediction (only appears for time series projects)
- forecast_point : the point in time used as a basis to generate the predictions (only appears for time series projects)
- forecast_distance : how many time steps are between timestamp and forecast_point (only appears for time series projects)
- series_id : he id of the series in a multiseries project or None for a single series project (only appears for time series projects)
Parameters: class_prefix : str, optional
The prefix to append to labels in the final dataframe. Default is
class_
(e.g., apple -> class_apple)Returns: dataframe: pandas.DataFrame
-
download_to_csv
(filename, encoding='utf-8')¶ Save training prediction rows into CSV file.
Parameters: filename : str or file object
path or file object to save training prediction rows
encoding : string, optional
A string representing the encoding to use in the output file, defaults to ‘utf-8’
-
classmethod
ModelDeployment API¶
Warning
This interface is now deprecated and will be removed in the v2.13 release of the DataRobot client.
-
class
datarobot.models.
ModelDeployment
(id, model=None, project=None, type=None, status=None, user=None, organization_id=None, instance=None, label=None, description=None, prediction_endpoint=None, deployed=None, created_at=None, updated_at=None, service_health=None, service_health_messages=None, recent_request_count=None, prev_request_count=None, relative_requests_trend=None, trend_time_window=None, request_rates=None)¶ ModelDeployments provide an interface for tracking the health and activity of predictions made against a deployment model. The get_service_statistics method can be used to see current and historical trends in requests made and in user and server error rates
Notes
HealthMessage
dict contains:- level : error level, one of [passing, warning, failing]
- msg_id : identifier for message, like USER_ERRORS, SERVER_ERRORS, NO_GOOD_REQUESTS
- message : human-readable message
Instance
dict contains:- id : id of the dedicated prediction instance the model is deployed to
- host_name : host name of the dedicated prediction instance
- private_ip : IP address of the dedicated prediction instance
- orm_version : On-demand Resource Manager version of the dedicated prediction instance
Model
dict contains:- id : id of the deployed model
- model_type : identifies the model, e.g. Nystroem Kernel SVM Regressor
- uid : id of the user who created this model
User
dict contains:- username : the user’s username
- first_name : the user’s first name
- last_name : the user’s last name
Attributes
id (str) id of the model deployment model (dict) model associated with the model deployment project (dict) project associated with the model deployment type (str) type of the model deployment. Can be one of [sse, dedicated, legacy_dedicated] status (str) status of the model deployment. Can be one of [active, inactive, archived] user (dict) user who created the model deployment organization_id (str) id of the organization associated with the model deployment instance (dict) instance associated with the model deployment label (str) label of the model deployment description (str) description of the model deployment prediction_endpoint (str) URL where the model is deployed and available for serving predictions deployed (bool) has the model deployment deployed process finished or not. created_at (datetime) timestamp when the model deployment was created updated_at (datetime) timestamp when the model deployment was updated service_health (str) display model health status. Can be one of [passing, warning or failing] service_health_messages (list) list of HealthMessage objects for service health state recent_request_count (int) the number of recent requests, within the recent time window specified in trend_time_window prev_request_count (int) the number of requests, within the previous time window specified in trend_time_window relative_requests_trend (float) relative difference (as a percentage) between the number of prediction requests performed within the current time window and one time window before that. The size of the time window is specified by trend_time_window trend_time_window (str) time window (in full days from “now”) trend is calculated for request_rates (list) history of request rates per day sorted in chronological order (last entry being the most recent, i.e. today). -
classmethod
create
(project_id, model_id, label, instance_id=None, description=None, status=None)¶ Create model deployment.
Parameters: project_id : str
id of the project the model belongs to
model_id : str
id of the model for deployment
label : str
human-readable name for the model deployment
instance_id : str, optional
id of the instance in DataRobot cloud being deployed to
description : str, optional
description for the model deployment
status : str, optional
status for the model deployment. Can be [active, inactive, archived].
Returns: job : Job
an instance of created async job
-
classmethod
list
(limit=None, offset=None, query=None, order_by=None, status=None)¶ List of model_deployments
Parameters: limit : int or None
at most this many results are returned, default: no limit
offset : int or None
this many results will be skipped, default: 0
query : str, optional
Filter the model deployments by matching labels and descriptions with the specified string. Partial matches are included, too. Matches are case insensitive
order_by : str, optional
the model deployments. Supported attributes for ordering: label, exportTarget, status, type. Prefix attribute name with dash to sort in descending order, e.g. orderBy=-label. Only one field can be selected
status : str, optional
Filter the list of deployments by status. Must be one of: [active, inactive, archived]
Returns: model_deployments : list[ModelDeployment]
-
classmethod
get
(model_deployment_id)¶ Retrieve sa single model_deployment
Parameters: model_deployment_id:
the id of the model_deployment to query
Returns: model_deployment : ModelDeployment
The queried instance
-
update
(label=None, description=None, status=None)¶ Update model_deployment object
Parameters: label : str, optional
The new value for label to be set
description : str, optional
The new value for description to be set
status : str, optional
The new value for status to be set, Can be one of [active, inactive, archived]
-
get_service_statistics
(start_date=None, end_date=None)¶ Retrieve health overview of current model_deployment
Parameters: start_date : str, optional
datetime string that filter statistic from this timestamp
end_date: str, optional
datetime string that filter statistic till this timestamp
Returns: service_health : dict
dict that represent ServiceHealth object
Notes
ServiceHealth dict contains:
- total_requests: total number of requests performed. 0, if there were no requests
- consumers : total number of unique users performing requests. 0, if there were no requests
- period : dict with two fields - start and end, that denote the boundaries of the time period the stats are reported for. Note, that a half-open time interval is used: [start: end)
- user_error_rate : dict with two fields - current and previous, that denote the ratio of user errors to the total number of requests performed for the given period and one time period before that. 0.0, if there were no errors (or requests)
- server_error_rate : dict with two fields - current and previous, that denote the ratio of server errors to the total number of requests performed for the given period and one time period before that. 0.0, if there were no errors (or requests)
- load : dict with two fields - peak and median, that denote the max and the median of the request rate (in requests per minute) across all requests for the duration of the given time period. Both will be equal to 0.0, if there were no requests.
- median_execution_time : the median of the execution time across all performed requests (in seconds). null, if there were no requests
-
action_log
(limit=None, offset=None)¶ List of actions taken affecting this deployment
Allows insight into when the ModelDeployment was created or deployed.
Parameters: limit : int or None
at most this many results are returned, default: no limit
offset : int or None
this many results will be skipped, default: 0
Returns: action_log : list of dict [ActionLog]
Notes
ActionLog
dict contains:- action : identifies the action. Can be one of [deployed, created]
- performed_by : dict with id, username, first_name and last_name of the user who performed the action.
- performed_at : date/time when the action was performed in ISO-8601 format.
Recommended Model API¶
-
class
datarobot.models.
ModelRecommendation
(project_id, model_id, recommendation_type)¶ A collection of information about a recommended model for a project.
Attributes
project_id (str) the id of the project the model belongs to model_id (str) the id of the recommended model recommendation_type (str) the type of model recommendation -
classmethod
get
(project_id)¶ Retrieves the default recommended model available.
Parameters: project_id : str
The project’s id.
Returns: recommended_model : ModelRecommendation
-
classmethod
get_all
(project_id)¶ Retrieves all of the current recommended models for the project.
Parameters: project_id : str
The project’s id.
Returns: recommended_models : list of ModelRecommendation
-
classmethod
get_recommendation
(recommended_models, recommendation_type)¶ Returns the model in the given list with the requested type.
Parameters: recommended_models : list of ModelRecommendation
recommendation_type : enums.RECOMMENDED_MODEL_TYPE
the type of model to extract from the recommended_models list
Returns: recommended_model : ModelRecommendation or None if no model with the requested type exists
-
get_model
()¶ Returns the Model associated with this ModelRecommendation.
Returns: recommended_model : Model
-
classmethod
Database Connectivity API¶
-
class
datarobot.
DataDriver
(id=None, creator=None, base_names=None, class_name=None, canonical_name=None)¶ A data driver
Attributes
id (str) the id of the driver. class_name (str) the Java class name for the driver. canonical_name (str) the user-friendly name of the driver. creator (str) the id of the user who created the driver. base_names (list of str) a list of the file name(s) of the jar files. -
classmethod
list
()¶ Returns list of available drivers.
Returns: drivers : list of DataDriver instances
contains a list of available drivers.
Examples
>>> import datarobot as dr >>> drivers = dr.DataDriver.list() >>> drivers [DataDriver('mysql'), DataDriver('RedShift'), DataDriver('PostgreSQL')]
-
classmethod
get
(driver_id)¶ Gets the driver.
Parameters: driver_id : str
the identifier of the driver.
Returns: driver : DataDriver
the required driver.
Examples
>>> import datarobot as dr >>> driver = dr.DataDriver.get('5ad08a1889453d0001ea7c5c') >>> driver DataDriver('PostgreSQL')
-
classmethod
create
(class_name, canonical_name, files)¶ Creates the driver. Only available to admin users.
Parameters: class_name : str
the Java class name for the driver.
canonical_name : str
the user-friendly name of the driver.
files : list of str
a list of the file paths on file system file_path(s) for the driver.
Returns: driver : DataDriver
the created driver.
Raises: ClientError
raised if user is not granted for Can manage JDBC database drivers feature
Examples
>>> import datarobot as dr >>> driver = dr.DataDriver.create( ... class_name='org.postgresql.Driver', ... canonical_name='PostgreSQL', ... files=['/tmp/postgresql-42.2.2.jar'] ... ) >>> driver DataDriver('PostgreSQL')
-
update
(class_name=None, canonical_name=None)¶ Updates the driver. Only available to admin users.
Parameters: class_name : str
the Java class name for the driver.
canonical_name : str
the user-friendly name of the driver.
Raises: ClientError
raised if user is not granted for Can manage JDBC database drivers feature
Examples
>>> import datarobot as dr >>> driver = dr.DataDriver.get('5ad08a1889453d0001ea7c5c') >>> driver.canonical_name 'PostgreSQL' >>> driver.update(canonical_name='postgres') >>> driver.canonical_name 'postgres'
-
delete
()¶ Removes the driver. Only available to admin users.
Raises: ClientError
raised if user is not granted for Can manage JDBC database drivers feature
-
classmethod
-
class
datarobot.
DataStore
(data_store_id=None, data_store_type=None, canonical_name=None, creator=None, updated=None, params=None)¶ A data store. Represents database
Attributes
id (str) the id of the data store. data_store_type (str) the type of data store. canonical_name (str) the user-friendly name of the data store. creator (str) the id of the user who created the data store. updated (datetime.datetime) the time of the last update params (DataStoreParameters) a list specifying data store parameters. -
classmethod
list
()¶ Returns list of available data stores.
Returns: data_stores : list of DataStore instances
contains a list of available data stores.
Examples
>>> import datarobot as dr >>> data_stores = dr.DataStore.list() >>> data_stores [DataStore('Demo'), DataStore('Airlines')]
-
classmethod
get
(data_store_id)¶ Gets the data store.
Parameters: data_store_id : str
the identifier of the data store.
Returns: data_store : DataStore
the required data store.
Examples
>>> import datarobot as dr >>> data_store = dr.DataStore.get('5a8ac90b07a57a0001be501e') >>> data_store DataStore('Demo')
-
classmethod
create
(data_store_type, canonical_name, driver_id, jdbc_url)¶ Creates the data store.
Parameters: data_store_type : str
the type of data store.
canonical_name : str
the user-friendly name of the data store.
driver_id : str
the identifier of the DataDriver.
jdbc_url : str
the full JDBC url, for example jdbc:postgresql://my.dbaddress.org:5432/my_db.
Returns: data_store : DataStore
the created data store.
Examples
>>> import datarobot as dr >>> data_store = dr.DataStore.create( ... data_store_type='jdbc', ... canonical_name='Demo DB', ... driver_id='5a6af02eb15372000117c040', ... jdbc_url='jdbc:postgresql://my.db.address.org:5432/perftest' ... ) >>> data_store DataStore('Demo DB')
-
update
(canonical_name=None, driver_id=None, jdbc_url=None)¶ Updates the data store.
Parameters: canonical_name : str
optional, the user-friendly name of the data store.
driver_id : str
optional, the identifier of the DataDriver.
jdbc_url : str
optional, the full JDBC url, for example jdbc:postgresql://my.dbaddress.org:5432/my_db.
Examples
>>> import datarobot as dr >>> data_store = dr.DataStore.get('5ad5d2afef5cd700014d3cae') >>> data_store DataStore('Demo DB') >>> data_store.update(canonical_name='Demo DB updated') >>> data_store DataStore('Demo DB updated')
-
delete
()¶ Removes the DataStore
-
test
(username, password)¶ Tests database connection.
Parameters: username : str
the username for database authentication.
password : str
the password for database authentication. The password is encrypted at server side and never saved / stored
Returns: message : dict
message with status.
Examples
>>> import datarobot as dr >>> data_store = dr.DataStore.get('5ad5d2afef5cd700014d3cae') >>> data_store.test(username='db_username', password='db_password') {'message': 'Connection successful'}
-
schemas
(username, password)¶ Returns list of available schemas.
Parameters: username : str
the username for database authentication.
password : str
the password for database authentication. The password is encrypted at server side and never saved / stored
Returns: response : dict
dict with database name and list of str - available schemas
Examples
>>> import datarobot as dr >>> data_store = dr.DataStore.get('5ad5d2afef5cd700014d3cae') >>> data_store.schemas(username='db_username', password='db_password') {'catalog': 'perftest', 'schemas': ['demo', 'information_schema', 'public']}
-
tables
(username, password, schema=None)¶ Returns list of available tables in schema.
Parameters: username : str
optional, the username for database authentication.
password : str
optional, the password for database authentication. The password is encrypted at server side and never saved / stored
schema : str
optional, the schema name.
Returns: response : dict
dict with catalog name and tables info
Examples
>>> import datarobot as dr >>> data_store = dr.DataStore.get('5ad5d2afef5cd700014d3cae') >>> data_store.tables(username='db_username', password='db_password', schema='demo') {'tables': [{'type': 'TABLE', 'name': 'diagnosis', 'schema': 'demo'}, {'type': 'TABLE', 'name': 'kickcars', 'schema': 'demo'}, {'type': 'TABLE', 'name': 'patient', 'schema': 'demo'}, {'type': 'TABLE', 'name': 'transcript', 'schema': 'demo'}], 'catalog': 'perftest'}
-
classmethod
-
class
datarobot.
DataSource
(data_source_id=None, data_source_type=None, canonical_name=None, creator=None, updated=None, params=None)¶ A data source. Represents data request
Attributes
data_source_id (str) the id of the data source. data_source_type (str) the type of data source. canonical_name (str) the user-friendly name of the data source. creator (str) the id of the user who created the data source. updated (datetime.datetime) the time of the last update. params (DataSourceParameters) a list specifying data source parameters. -
classmethod
list
()¶ Returns list of available data sources.
Returns: data_sources : list of DataSource instances
contains a list of available data sources.
Examples
>>> import datarobot as dr >>> data_sources = dr.DataSource.list() >>> data_sources [DataSource('Diagnostics'), DataSource('Airlines 100mb'), DataSource('Airlines 10mb')]
-
classmethod
get
(data_source_id)¶ Gets the data source.
Parameters: data_source_id : str
the identifier of the data source.
Returns: data_source : DataSource
the requested data source.
Examples
>>> import datarobot as dr >>> data_source = dr.DataSource.get('5a8ac9ab07a57a0001be501f') >>> data_source DataSource('Diagnostics')
-
classmethod
create
(data_source_type, canonical_name, params)¶ Creates the data source.
Parameters: data_source_type : str
the type of data source.
canonical_name : str
the user-friendly name of the data source.
params : DataSourceParameters
a list specifying data source parameters.
Returns: data_source : DataSource
the created data source.
Examples
>>> import datarobot as dr >>> params = dr.DataSourceParameters( ... data_store_id='5a8ac90b07a57a0001be501e', ... query='SELECT * FROM airlines10mb WHERE "Year" >= 1995;' ... ) >>> data_source = dr.DataSource.create( ... data_source_type='jdbc', ... canonical_name='airlines stats after 1995', ... params=params ... ) >>> data_source DataSource('airlines stats after 1995')
-
update
(canonical_name=None, params=None)¶ Creates the data source.
Parameters: canonical_name : str
optional, the user-friendly name of the data source.
params : DataSourceParameters
optional, the identifier of the DataDriver.
Examples
>>> import datarobot as dr >>> data_source = dr.DataSource.get('5ad840cc613b480001570953') >>> data_source DataSource('airlines stats after 1995') >>> params = dr.DataSourceParameters( ... query='SELECT * FROM airlines10mb WHERE "Year" >= 1990;' ... ) >>> data_source.update( ... canonical_name='airlines stats after 1990', ... params=params ... ) >>> data_source DataSource('airlines stats after 1990')
-
delete
()¶ Removes the DataSource
-
classmethod
Examples¶
Note
You are able to install all of the requirements needed to run the example notebooks with: pip install datarobot[examples].
Modeling Airline Delay¶
Overview¶
Statistics on whether a flight was delayed and for how long are available from government databases for all the major carriers. It would be useful to be able to predict before scheduling a flight whether or not it was likely to be delayed. In this example, DataRobot will try to model whether a flight will be delayed, based on information such as the scheduled departure time and whether rained the day of the flight.
Set Up¶
This example assumes that the DataRobot Python client package has been installed and configured with the credentials of a DataRobot user with API access permissions.
Data Sources¶
Information on flights and flight delays is made available by the Bureau of Transportation Statistics at https://www.transtats.bts.gov/ONTIME/Departures.aspx. To narrow down the amount of data involved, the datasets assembled for this use case are limited to US Airways flights out of Boston Logan in 2013 and 2014, although the script for interacting with DataRobot is sufficiently general that any dataset with the correct format could be used. A flight was declared to be delayed if it ultimately took off at least fifteen minutes after its scheduled departure time.
In additional to flight information, each record in the prepared dataset notes the amount of rain and whether it rained on the day of the flight. This information came from the National Oceanic and Atmospheric Administration’s Quality Controlled Local Climatological Data, available at http://www.ncdc.noaa.gov/qclcd/QCLCD. By looking at the recorded daily summaries of the water equivalent precipitation at the Boston Logan station, the daily rainfall for each day in 2013 and 2014 was measured. For some days, the QCLCD reports trace amounts of rainfall, which was recorded as 0 inches of rain.
Dataset Structure¶
Each row in the assembled dataset contains the following columns
- was_delayed
- boolean
- whether the flight was delayed
- daily_rainfall
- float
- the amount of rain, in inches, on the day of the flight
- did_rain
- bool
- whether it rained on the day of the flight
- Carrier Code
- str
- the carrier code of the airline - US for all entries in assembled dataset
- Date
- str (MM/DD/YYYY format)
- the date of the flight
- Flight Number
- str
- the flight number for the flight
- Tail Number
- str
- the tail number of the aircraft
- Destination Airport
- str
- the three-letter airport code of the destination airport
- Scheduled Departure Time
- str
- the 24-hour scheduled departure time of the flight, in the origin airport’s timezone
In [1]:
import pandas as pd
import datarobot as dr
In [2]:
data_path = "logan-US-2013.csv"
logan_2013 = pd.read_csv(data_path)
logan_2013.head()
Out[2]:
was_delayed | daily_rainfall | did_rain | Carrier Code | Date (MM/DD/YYYY) | Flight Number | Tail Number | Destination Airport | Scheduled Departure Time | |
---|---|---|---|---|---|---|---|---|---|
0 | False | 0.0 | False | US | 02/01/2013 | 225 | N662AW | PHX | 16:20 |
1 | False | 0.0 | False | US | 02/01/2013 | 280 | N822AW | PHX | 06:00 |
2 | False | 0.0 | False | US | 02/01/2013 | 303 | N653AW | CLT | 09:35 |
3 | True | 0.0 | False | US | 02/01/2013 | 604 | N640AW | PHX | 09:55 |
4 | False | 0.0 | False | US | 02/01/2013 | 722 | N715UW | PHL | 18:30 |
We want to be able to make predictions for future data, so the “date” column should be transformed in a way that avoids values that won’t be populated for future data:
In [3]:
def prepare_modeling_dataset(df):
date_column_name = 'Date (MM/DD/YYYY)'
date = pd.to_datetime(df[date_column_name])
modeling_df = df.drop(date_column_name, axis=1)
days = {0: 'Mon', 1: 'Tues', 2: 'Weds', 3: 'Thurs', 4: 'Fri', 5: 'Sat',
6: 'Sun'}
modeling_df['day_of_week'] = date.apply(lambda x: days[x.dayofweek])
modeling_df['month'] = date.dt.month
return modeling_df
In [4]:
logan_2013_modeling = prepare_modeling_dataset(logan_2013)
logan_2013_modeling.head()
Out[4]:
was_delayed | daily_rainfall | did_rain | Carrier Code | Flight Number | Tail Number | Destination Airport | Scheduled Departure Time | day_of_week | month | |
---|---|---|---|---|---|---|---|---|---|---|
0 | False | 0.0 | False | US | 225 | N662AW | PHX | 16:20 | Fri | 2 |
1 | False | 0.0 | False | US | 280 | N822AW | PHX | 06:00 | Fri | 2 |
2 | False | 0.0 | False | US | 303 | N653AW | CLT | 09:35 | Fri | 2 |
3 | True | 0.0 | False | US | 604 | N640AW | PHX | 09:55 | Fri | 2 |
4 | False | 0.0 | False | US | 722 | N715UW | PHL | 18:30 | Fri | 2 |
DataRobot Modeling¶
As part of this use case, in model_flight_ontime.py
, a DataRobot
project will be created and used to run a variety of models against the
assembled datasets. By default, DataRobot will run autopilot on the
automatically generated Informative Features list, which excludes
certain pathological features (like Carrier Code in this example, which
is always the same value), and we will also create a custom feature list
excluding the amount of rainfall on the day of the flight.
This notebook shows how to use the Python API client to create a project, create feature lists, train models with different sample percents and feature lists, and view the models that have been run. It will:
- create a project
- create a new feature list (no foreknowledge) excluding the rainfall features
- set the target to
was_delayed
, and run DataRobot autopilot on the Informative Features list - rerun autopilot on a new feature list
- make predictions on a new data set
Starting a Project¶
In [5]:
project = dr.Project.start(logan_2013_modeling,
project_name='Airline Delays - was_delayed',
target="was_delayed")
project.id
Out[5]:
u'5963ddefc8089169ef1637c2'
Jobs and the Project Queue¶
You can view the project in your browser:
In [ ]:
# If running notebook remotely
project.open_leaderboard_browser()
In [ ]:
# Set worker count higher. This will fail if you don't have 10 workers.
project.set_worker_count(10)
In [6]:
project.pause_autopilot()
Out[6]:
True
In [7]:
# More jobs will go in the queue in each stage of autopilot.
# This gets the currently inprogress and queued jobs
project.get_model_jobs()
Out[7]:
[ModelJob(Gradient Boosted Trees Classifier, status=inprogress),
ModelJob(Breiman and Cutler Random Forest Classifier, status=inprogress),
ModelJob(RuleFit Classifier, status=queue),
ModelJob(Regularized Logistic Regression (L2), status=queue),
ModelJob(Elastic-Net Classifier (L2 / Binomial Deviance), status=queue),
ModelJob(RandomForest Classifier (Gini), status=queue),
ModelJob(eXtreme Gradient Boosted Trees Classifier with Early Stopping, status=queue),
ModelJob(Nystroem Kernel SVM Classifier, status=queue),
ModelJob(Regularized Logistic Regression (L2), status=queue),
ModelJob(Elastic-Net Classifier (L2 / Binomial Deviance) with Binned numeric features, status=queue),
ModelJob(Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance), status=queue),
ModelJob(RandomForest Classifier (Entropy), status=queue),
ModelJob(ExtraTrees Classifier (Gini), status=queue),
ModelJob(Gradient Boosted Trees Classifier with Early Stopping, status=queue),
ModelJob(Gradient Boosted Greedy Trees Classifier with Early Stopping, status=queue),
ModelJob(eXtreme Gradient Boosted Trees Classifier with Early Stopping, status=queue),
ModelJob(eXtreme Gradient Boosted Trees Classifier with Early Stopping and Unsupervised Learning Features, status=queue),
ModelJob(Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance) with Unsupervised Learning Features, status=queue),
ModelJob(Auto-tuned K-Nearest Neighbors Classifier (Minkowski Distance), status=queue),
ModelJob(Vowpal Wabbit Classifier, status=queue)]
In [8]:
project.unpause_autopilot()
Out[8]:
True
Features¶
In [9]:
features = project.get_features()
features
Out[9]:
[Feature(did_rain),
Feature(Destination Airport),
Feature(Carrier Code),
Feature(Flight Number),
Feature(Tail Number),
Feature(day_of_week),
Feature(month),
Feature(Scheduled Departure Time),
Feature(daily_rainfall),
Feature(was_delayed)]
In [10]:
pd.DataFrame([f.__dict__ for f in features])
Out[10]:
date_format | feature_type | id | importance | low_information | na_count | name | project_id | unique_count | |
---|---|---|---|---|---|---|---|---|---|
0 | None | Boolean | 2 | 0.029045 | False | 0 | did_rain | 5963ddefc8089169ef1637c2 | 2 |
1 | None | Categorical | 6 | 0.003714 | True | 0 | Destination Airport | 5963ddefc8089169ef1637c2 | 5 |
2 | None | Categorical | 3 | NaN | True | 0 | Carrier Code | 5963ddefc8089169ef1637c2 | 1 |
3 | None | Numeric | 4 | 0.005900 | False | 0 | Flight Number | 5963ddefc8089169ef1637c2 | 329 |
4 | None | Categorical | 5 | -0.004512 | True | 0 | Tail Number | 5963ddefc8089169ef1637c2 | 296 |
5 | None | Categorical | 8 | 0.003452 | True | 0 | day_of_week | 5963ddefc8089169ef1637c2 | 7 |
6 | None | Numeric | 9 | 0.003043 | True | 0 | month | 5963ddefc8089169ef1637c2 | 12 |
7 | %H:%M | Time | 7 | 0.058245 | False | 0 | Scheduled Departure Time | 5963ddefc8089169ef1637c2 | 77 |
8 | None | Numeric | 1 | 0.034295 | False | 0 | daily_rainfall | 5963ddefc8089169ef1637c2 | 58 |
9 | None | Boolean | 0 | 1.000000 | False | 0 | was_delayed | 5963ddefc8089169ef1637c2 | 2 |
Three feature lists are automatically created:
- Raw Features: one for all features
- Informative Features: one based on features with any information (columns with no variation are excluded)
- Univariate Importance: one based on univariate importance (this is only created after the target is set)
Informative Features is the one used by default in autopilot.
In [11]:
feature_lists = project.get_featurelists()
feature_lists
Out[11]:
[Featurelist(Informative Features),
Featurelist(Raw Features),
Featurelist(Univariate Selections)]
In [12]:
# create a featurelist without the rain features
# (since they leak future information)
informative_feats = [lst for lst in feature_lists if
lst.name == 'Informative Features'][0]
no_foreknowledge_features = list(
set(informative_feats.features) - {'daily_rainfall', 'did_rain'})
In [13]:
no_foreknowledge = project.create_featurelist('no foreknowledge',
no_foreknowledge_features)
no_foreknowledge
Out[13]:
Featurelist(no foreknowledge)
In [14]:
project.get_status()
Out[14]:
{u'autopilot_done': False,
u'stage': u'modeling',
u'stage_description': u'Ready for modeling'}
In [15]:
# This waits until autopilot is complete:
project.wait_for_autopilot(check_interval=90)
In progress: 2, queued: 2 (waited: 0s)
In progress: 2, queued: 2 (waited: 0s)
In progress: 2, queued: 2 (waited: 1s)
In progress: 2, queued: 2 (waited: 2s)
In progress: 2, queued: 2 (waited: 3s)
In progress: 2, queued: 2 (waited: 4s)
In progress: 2, queued: 2 (waited: 8s)
In progress: 2, queued: 2 (waited: 14s)
In progress: 2, queued: 2 (waited: 27s)
In progress: 2, queued: 0 (waited: 53s)
In progress: 2, queued: 0 (waited: 105s)
In progress: 0, queued: 0 (waited: 195s)
In progress: 0, queued: 0 (waited: 286s)
In [16]:
project.start_autopilot(no_foreknowledge.id)
In [17]:
project.wait_for_autopilot(check_interval=90)
In progress: 2, queued: 26 (waited: 0s)
In progress: 2, queued: 26 (waited: 0s)
In progress: 2, queued: 26 (waited: 1s)
In progress: 2, queued: 26 (waited: 2s)
In progress: 2, queued: 26 (waited: 3s)
In progress: 2, queued: 26 (waited: 5s)
In progress: 1, queued: 26 (waited: 8s)
In progress: 4, queued: 23 (waited: 15s)
In progress: 6, queued: 17 (waited: 28s)
In progress: 7, queued: 6 (waited: 54s)
In progress: 5, queued: 9 (waited: 105s)
In progress: 7, queued: 1 (waited: 196s)
In progress: 7, queued: 20 (waited: 287s)
In progress: 7, queued: 3 (waited: 378s)
In progress: 4, queued: 0 (waited: 469s)
In progress: 3, queued: 0 (waited: 559s)
In progress: 0, queued: 0 (waited: 650s)
Models¶
In [18]:
models = project.get_models()
example_model = models[0]
example_model
Out[18]:
Model(u'Gradient Boosted Trees Classifier with Early Stopping')
Models represent fitted models and have various data about the model, including metrics:
In [19]:
example_model.metrics
Out[19]:
{u'AUC': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.751662,
u'holdout': None,
u'validation': 0.74957},
u'FVE Binomial': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.139262,
u'holdout': None,
u'validation': 0.14529},
u'Gini Norm': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.503324,
u'holdout': None,
u'validation': 0.49914},
u'LogLoss': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.275264,
u'holdout': None,
u'validation': 0.27347},
u'RMSE': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.27734,
u'holdout': None,
u'validation': 0.27582},
u'Rate@Top10%': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.362458,
u'holdout': None,
u'validation': 0.37884},
u'Rate@Top5%': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.47347,
u'holdout': None,
u'validation': 0.4898},
u'Rate@TopTenth%': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.866668,
u'holdout': None,
u'validation': 1.0}}
In [20]:
def sorted_by_log_loss(models, test_set):
models_with_score = [model for model in models if
model.metrics['LogLoss'][test_set] is not None]
return sorted(models_with_score,
key=lambda model: model.metrics['LogLoss'][test_set])
Let’s choose the models (from each feature set, to compare the scores) with the best LogLoss score from those with the rain and those without:
In [21]:
models = project.get_models()
fair_models = [mod for mod in models if
mod.featurelist_id == no_foreknowledge.id]
rain_cheat_models = [mod for mod in models if
mod.featurelist_id == informative_feats.id]
In [22]:
models[0].metrics['LogLoss']
Out[22]:
{u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.275264,
u'holdout': None,
u'validation': 0.27347}
In [23]:
best_fair_model = sorted_by_log_loss(fair_models, 'crossValidation')[0]
best_cheat_model = sorted_by_log_loss(rain_cheat_models, 'crossValidation')[0]
best_fair_model.metrics, best_cheat_model.metrics
Out[23]:
({u'AUC': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.71437,
u'holdout': None,
u'validation': 0.7187},
u'FVE Binomial': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.089798,
u'holdout': None,
u'validation': 0.09167},
u'Gini Norm': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.42874,
u'holdout': None,
u'validation': 0.4374},
u'LogLoss': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.29108199999999995,
u'holdout': None,
u'validation': 0.29062},
u'RMSE': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.28612,
u'holdout': None,
u'validation': 0.28617},
u'Rate@Top10%': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.288738,
u'holdout': None,
u'validation': 0.28669},
u'Rate@Top5%': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.37415,
u'holdout': None,
u'validation': 0.39456},
u'Rate@TopTenth%': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.633334,
u'holdout': None,
u'validation': 1.0}},
{u'AUC': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.758114,
u'holdout': None,
u'validation': 0.75345},
u'FVE Binomial': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.14579400000000003,
u'holdout': None,
u'validation': 0.14438},
u'Gini Norm': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.516228,
u'holdout': None,
u'validation': 0.5069},
u'LogLoss': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.273176,
u'holdout': None,
u'validation': 0.27376},
u'RMSE': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.27671,
u'holdout': None,
u'validation': 0.27686},
u'Rate@Top10%': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.370648,
u'holdout': None,
u'validation': 0.38225},
u'Rate@Top5%': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.48163600000000006,
u'holdout': None,
u'validation': 0.4898},
u'Rate@TopTenth%': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.933334,
u'holdout': None,
u'validation': 1.0}})
Visualizing Models¶
This is a good time to use Model XRay (not yet available via the API) to visualize the models:
In [ ]:
best_fair_model.open_model_browser()
In [ ]:
best_cheat_model.open_model_browser()
Unlocking the Holdout¶
To maintain holdout scores as a valid estimate of out-of-sample error, we recommend not looking at them until late in the project. For this reason, holdout scores are locked until you unlock them.
In [24]:
project.unlock_holdout()
Out[24]:
Project(Airline Delays - was_delayed)
In [25]:
best_fair_model = dr.Model.get(project.id, best_fair_model.id)
best_cheat_model = dr.Model.get(project.id, best_cheat_model.id)
In [26]:
best_fair_model.metrics['LogLoss'], best_cheat_model.metrics['LogLoss']
Out[26]:
({u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.29108199999999995,
u'holdout': 0.29344,
u'validation': 0.29062},
{u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.273176,
u'holdout': 0.27542,
u'validation': 0.27376})
Retrain on 100%¶
When ready to use the final model, you will probably get the best performance by retraining on 100% of the data.
In [27]:
model_job_fair_100pct_id = best_fair_model.train(sample_pct=100)
model_job_fair_100pct_id
Out[27]:
'188'
Wait for the model to complete:
In [28]:
model_fair_100pct = dr.models.modeljob.wait_for_async_model_creation(
project.id, model_job_fair_100pct_id)
model_fair_100pct.id
Out[28]:
u'5aa015f8fe075913b47c67ff'
Predictions¶
Let’s make predictions for some new data. This new data will need to have the same transformations applied as we applied to the training data.
In [29]:
logan_2014 = pd.read_csv("logan-US-2014.csv")
logan_2014_modeling = prepare_modeling_dataset(logan_2014)
logan_2014_modeling.head()
Out[29]:
was_delayed | daily_rainfall | did_rain | Carrier Code | Flight Number | Tail Number | Destination Airport | Scheduled Departure Time | day_of_week | month | |
---|---|---|---|---|---|---|---|---|---|---|
0 | False | 0.0 | False | US | 450 | N809AW | PHX | 10:00 | Sat | 2 |
1 | False | 0.0 | False | US | 553 | N814AW | PHL | 07:00 | Sat | 2 |
2 | False | 0.0 | False | US | 582 | N820AW | PHX | 06:10 | Sat | 2 |
3 | False | 0.0 | False | US | 601 | N678AW | PHX | 16:20 | Sat | 2 |
4 | False | 0.0 | False | US | 657 | N662AW | CLT | 09:45 | Sat | 2 |
In [30]:
prediction_dataset = project.upload_dataset(logan_2014_modeling)
predict_job = model_fair_100pct.request_predictions(prediction_dataset.id)
prediction_dataset.id
Out[30]:
u'5aa01634fe0759146b80ab2c'
In [31]:
predictions = predict_job.get_result_when_complete()
In [32]:
pd.concat([logan_2014_modeling, predictions], axis=1).head()
Out[32]:
was_delayed | daily_rainfall | did_rain | Carrier Code | Flight Number | Tail Number | Destination Airport | Scheduled Departure Time | day_of_week | month | positive_probability | prediction | row_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | 0.0 | False | US | 450 | N809AW | PHX | 10:00 | Sat | 2 | 0.050824 | 0.0 | 0 |
1 | False | 0.0 | False | US | 553 | N814AW | PHL | 07:00 | Sat | 2 | 0.040017 | 0.0 | 1 |
2 | False | 0.0 | False | US | 582 | N820AW | PHX | 06:10 | Sat | 2 | 0.032445 | 0.0 | 2 |
3 | False | 0.0 | False | US | 601 | N678AW | PHX | 16:20 | Sat | 2 | 0.122692 | 0.0 | 3 |
4 | False | 0.0 | False | US | 657 | N662AW | CLT | 09:45 | Sat | 2 | 0.054400 | 0.0 | 4 |
Let’s have a look at our results. Since this is a binary classification
problem, as the positive_probability
approaches zero this row is a
stronger candidate for the negative class (the flight will leave
on-time), while as it approaches one, the outcome is more likely to be
of the positive class (the flight will be delayed). From the KDE
(Kernel Density Estimate) plot below, we can see that this sample of the
data is weighted stronger to the negative class.
In [33]:
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
In [34]:
matplotlib.rcParams['figure.figsize'] = (15, 10) # make charts bigger
In [35]:
sns.set(color_codes=True)
sns.kdeplot(predictions.positive_probability, shade=True, cut=0,
label='Positive Probability')
plt.xlim((0, 1))
plt.ylim((0, None))
plt.xlabel('Probability of Event')
plt.ylabel('Probability Density')
plt.title('Prediction Distribution')
Out[35]:
Text(0.5,1,u'Prediction Distribution')

Exploring Reason Codes¶
Computing reason codes is a resource-intensive task, but you can help reduce the runtime for computing them by setting prediction value thresholds. You can learn more about reason codes by searching the online documentation available in the DataRobot web interface (it may be referred to as Prediction Explanations).
When are they useful?¶
A common question when evaluating data is “why is a certain data-point considered high-risk (or low-risk) for a certain event”?
A sample case for reason codes:
Clark is a business analyst at a large manufacturing firm. She does not have a lot of data science expertise, but has been using DataRobot with great success to predict likely product failures at her manufacturing plant. Her manager is now asking for recommendations for reducing the defect rate, based on these predictions. Clark would like DataRobot to produce reason codes for the expected product failures so that she can identify the key drivers of product failures based on a higher-level aggregation of reasons. Her business team can then use this report to address the causes of failure.
Other common use cases and possible reasons include:
- What are indicators that a transaction could be at high risk for fraud? Possible reasons include transactions out of a cardholder’s home area, transactions out of their “normal usage” time range, and transactions that are too large or small.
- What are some reasons for setting a higher auto insurance price? The applicant is single, male, age under 30 years, and has received traffic tickets. A married homeowner may receive a lower rate.
Preparation¶
We are almost ready to compute reason codes. Reason codes require two prerequisites to be performed first; however, these commands only need to be run once per model.
A prerequisite to computing reason codes is that you need to compute the feature impact for your model (this only needs to be done once per model):
In [36]:
%%time
try:
impact_job = model_fair_100pct.request_feature_impact()
impact_job.wait_for_completion(4 * 60)
except dr.errors.JobAlreadyRequested:
pass # already computed
CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 10.6 s
After Feature Impact has been computed, you also must create a Reason Codes Initialization for your model:
In [37]:
%%time
try:
# Test to see if they are already computed
dr.ReasonCodesInitialization.get(project.id, model_fair_100pct.id)
except dr.errors.ClientError as e:
assert e.status_code == 404 # haven't been computed
init_job = dr.ReasonCodesInitialization.create(project.id,
model_fair_100pct.id)
init_job.wait_for_completion()
CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 96.9 ms
Computing the reason codes¶
Now that we have computed the feature impact and initialized the reason codes, and also uploaded a dataset and computed predictions on it, we are ready to make a request to compute the reason codes for every row of the dataset. Computing reason codes supports a couple of parameters:
max_codes
are the maximum number of reason codes to compute for each row.threshold_low
andthreshold_high
are thresholds for the value of the prediction of the row. Reason codes will be computed for a row if the row’s prediction value is higher than threshold_high or lower than threshold_low. If no thresholds are specified, reason codes will be computed for all rows.
Note: for binary classification projects (like this one), the
thresholds correspond to the positive_probability
prediction value
whereas for regression problems, it corresponds to the actual predicted
value.
Since we’ve already examined our prediction distribution from above
let’s use that to influence what we set for our thresholds. It looks
like most flights depart on-time so let’s just examine the reasons for
flights that have an above normal probability for being delayed. We
will use a threshold_high
of 0.456
which means for all rows
where the predicted positive_probability
is at least 0.456
we will compute the reason codes for that row. For the simplicity of
this tutorial, we will also limit DataRobot to only compute 5
codes
for us.
In [38]:
%%time
number_of_reasons = 5
rc_job = dr.ReasonCodes.create(project.id,
model_fair_100pct.id,
prediction_dataset.id,
max_codes=number_of_reasons,
threshold_low=None,
threshold_high=0.456)
rc = rc_job.get_result_when_complete()
all_rows = rc.get_all_as_dataframe()
CPU times: user 4.3 s, sys: 108 ms, total: 4.41 s
Wall time: 29.3 s
Let’s cleanup the DataFrame we got back by trimming it down to just the
interesting columns. Also, since most rows will have prediction values
outside our thresholds, let’s drop all the uninteresting rows (i.e. ones
with null
values).
In [39]:
import pandas as pd
pd.options.display.max_rows = 10 # default display is too verbose
# These rows are all redundant or of little value for this example
redundant_cols = ['row_id', 'class_0_label', 'class_1_probability',
'class_1_label']
reasons = all_rows.drop(redundant_cols, axis=1)
reasons.drop(['reason_{}_label'.format(i) for i in range(number_of_reasons)],
axis=1, inplace=True)
# These are rows that didn't meet our thresholds
reasons.dropna(inplace=True)
# Rename columns to be more consistent with the terms we have been using
reasons.rename(index=str,
columns={'class_0_probability': 'positive_probability'},
inplace=True)
reasons
Out[39]:
prediction | positive_probability | reason_0_feature | reason_0_feature_value | reason_0_qualitative_strength | reason_0_strength | reason_1_feature | reason_1_feature_value | reason_1_qualitative_strength | reason_1_strength | ... | reason_2_qualitative_strength | reason_2_strength | reason_3_feature | reason_3_feature_value | reason_3_qualitative_strength | reason_3_strength | reason_4_feature | reason_4_feature_value | reason_4_qualitative_strength | reason_4_strength | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9498 | 1.0 | 0.521672 | Scheduled Departure Time | -2.208920e+09 | +++ | 1.411063 | Tail Number | N170US | ++ | 0.522242 | ... | ++ | 0.355082 | Flight Number | 800 | ++ | 0.247061 | day_of_week | Thurs | ++ | 0.240676 |
12373 | 1.0 | 0.505737 | Scheduled Departure Time | -2.208920e+09 | ++ | 0.858645 | Flight Number | 897 | ++ | 0.848086 | ... | ++ | 0.522828 | month | 12 | ++ | 0.312428 | day_of_week | Mon | ++ | 0.276766 |
13254 | 0.0 | 0.466474 | Scheduled Departure Time | -2.208920e+09 | +++ | 0.937670 | Flight Number | 897 | +++ | 0.850898 | ... | ++ | 0.335550 | day_of_week | Thurs | ++ | 0.308574 | Destination Airport | PHX | - | -0.145671 |
13351 | 0.0 | 0.484007 | Scheduled Departure Time | -2.208920e+09 | +++ | 1.124481 | Flight Number | 897 | ++ | 0.863671 | ... | ++ | 0.371650 | day_of_week | Sun | ++ | 0.343775 | month | 12 | ++ | 0.341253 |
13536 | 1.0 | 0.512797 | Scheduled Departure Time | -2.208920e+09 | +++ | 1.229332 | Flight Number | 897 | ++ | 0.861893 | ... | ++ | 0.486845 | day_of_week | Sun | ++ | 0.343775 | month | 12 | ++ | 0.319640 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
18015 | 0.0 | 0.494928 | Scheduled Departure Time | -2.208920e+09 | +++ | 1.902487 | month | 7 | ++ | 0.743062 | ... | + | 0.230507 | day_of_week | Thurs | + | 0.220769 | Flight Number | 800 | + | 0.216649 |
18165 | 0.0 | 0.492765 | Scheduled Departure Time | -2.208920e+09 | +++ | 1.447815 | month | 7 | ++ | 0.517627 | ... | ++ | 0.349980 | Flight Number | 800 | ++ | 0.290879 | Destination Airport | CLT | + | 0.183091 |
18392 | 1.0 | 0.584637 | Scheduled Departure Time | -2.208920e+09 | +++ | 1.494192 | month | 7 | ++ | 0.610177 | ... | ++ | 0.522249 | Flight Number | 800 | ++ | 0.281044 | day_of_week | Thurs | ++ | 0.243063 |
18396 | 0.0 | 0.456992 | Scheduled Departure Time | -2.208919e+09 | +++ | 1.442328 | month | 7 | ++ | 0.600896 | ... | ++ | 0.338863 | day_of_week | Thurs | ++ | 0.217166 | Scheduled Departure Time (Hour of Day) | 19 | + | 0.065168 |
18406 | 1.0 | 0.515136 | Scheduled Departure Time | -2.208927e+09 | +++ | 1.766550 | month | 7 | ++ | 0.832155 | ... | ++ | 0.815748 | Scheduled Departure Time (Hour of Day) | 17 | ++ | 0.323491 | Destination Airport | CLT | + | 0.272840 |
27 rows × 22 columns
Explore Reason Code results¶
Now let’s see how often various features are showing up as the top reason for impacting the probability of a flight being delayed.
In [40]:
from functools import reduce
# Create a combined histogram of all our reasons
reasons_hist = reduce(lambda x, y: x.add(y, fill_value=0),
(reasons['reason_{}_feature'.format(i)].value_counts()
for i in range(number_of_reasons)))
In [41]:
reasons_hist.plot.bar()
plt.xticks(rotation=45, ha='right')
Out[41]:
(array([0, 1, 2, 3, 4, 5, 6]), <a list of 7 Text xticklabel objects>)

Knowing the feature impact for this model from the Diving Deeper
notebook, the high occurrence of the daily_rainfall
and
Scheduled Departure Time
as reason codes is not entirely surprising
because these were some of the top ranked features in the impact chart.
Therefore, let’s take a small detour investigating some of the rows that
had less expected reasons.
Below is some helper code. It can largely be ignored as it is mostly relevant for this exercise and not needed for a general understanding of the DataRobot APIs
In [42]:
from operator import or_
from functools import reduce
from itertools import chain
def find_rows_with_reason(df, feature_name, nreasons):
"""
Given a reason codes DataFrame, return a slice of that data where the
top N reasons match the given feature
"""
all_reason_columns = (df['reason_{}_feature'.format(i)] == feature_name
for i in range(nreasons))
df_filter = reduce(or_, all_reason_columns)
return favorite_reason_columns(df[df_filter], nreasons)
def favorite_reason_columns(df, nreasons):
"""
Only display the most useful rows of a reason codes DataFrame.
"""
# Use chain to flatten our list of tuples
columns = list(chain.from_iterable(('reason_{}_feature'.format(i),
'reason_{}_feature_value'.format(i),
'reason_{}_strength'.format(i))
for i in range(nreasons)))
return df[columns]
def find_feature_in_row(feature, row, nreasons):
"""
Return the value of a given feature
"""
for i in range(nreasons):
if row['reason_{}_feature'.format(i)] == feature:
return row['reason_{}_feature_value'.format(i)]
def collect_feature_values(df, feature, nreasons):
"""
Return a list of all values of a given reason code from a DataFrame
"""
return [find_feature_in_row(feature, row, nreasons)
for index, row in df.iterrows()]
It looks like there was a small number of rows where the
Destination Airport
was one of the top N reasons for a given
prediction
In [43]:
feature_name = 'Destination Airport'
flight_nums = find_rows_with_reason(reasons, feature_name, number_of_reasons)
flight_nums
Out[43]:
reason_0_feature | reason_0_feature_value | reason_0_strength | reason_1_feature | reason_1_feature_value | reason_1_strength | reason_2_feature | reason_2_feature_value | reason_2_strength | reason_3_feature | reason_3_feature_value | reason_3_strength | reason_4_feature | reason_4_feature_value | reason_4_strength | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
13254 | Scheduled Departure Time | -2.208920e+09 | 0.937670 | Flight Number | 897 | 0.850898 | Tail Number | N657AW | 0.335550 | day_of_week | Thurs | 0.308574 | Destination Airport | PHX | -0.145671 |
14226 | Scheduled Departure Time | -2.208920e+09 | 1.435292 | month | 6 | 0.459697 | Flight Number | 800 | 0.280207 | day_of_week | Thurs | 0.251885 | Destination Airport | CLT | 0.201186 |
14601 | Scheduled Departure Time | -2.208920e+09 | 1.422922 | month | 6 | 0.381899 | Flight Number | 800 | 0.278981 | day_of_week | Thurs | 0.248532 | Destination Airport | CLT | 0.201186 |
14855 | Scheduled Departure Time | -2.208920e+09 | 1.376668 | month | 6 | 0.455120 | Tail Number | N163US | 0.345858 | Flight Number | 800 | 0.308118 | Destination Airport | CLT | 0.186002 |
14978 | Scheduled Departure Time | -2.208920e+09 | 1.435292 | month | 6 | 0.459697 | Flight Number | 800 | 0.280207 | day_of_week | Thurs | 0.251885 | Destination Airport | CLT | 0.201186 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
17576 | Scheduled Departure Time | -2.208920e+09 | 1.512200 | month | 7 | 0.623913 | Tail Number | N170US | 0.544830 | Flight Number | 800 | 0.305650 | Destination Airport | CLT | 0.186923 |
17638 | Scheduled Departure Time | -2.208920e+09 | 1.523302 | month | 7 | 0.461251 | Flight Number | 800 | 0.282270 | day_of_week | Thurs | 0.246416 | Destination Airport | CLT | 0.202109 |
18015 | Scheduled Departure Time | -2.208920e+09 | 1.902487 | month | 7 | 0.743062 | Destination Airport | CLT | 0.230507 | day_of_week | Thurs | 0.220769 | Flight Number | 800 | 0.216649 |
18165 | Scheduled Departure Time | -2.208920e+09 | 1.447815 | month | 7 | 0.517627 | Tail Number | N173US | 0.349980 | Flight Number | 800 | 0.290879 | Destination Airport | CLT | 0.183091 |
18406 | Scheduled Departure Time | -2.208927e+09 | 1.766550 | month | 7 | 0.832155 | Tail Number | N818AW | 0.815748 | Scheduled Departure Time (Hour of Day) | 17 | 0.323491 | Destination Airport | CLT | 0.272840 |
12 rows × 15 columns
In [44]:
all_flights = collect_feature_values(flight_nums,
feature_name,
number_of_reasons)
pd.DataFrame(all_flights)[0].value_counts().plot.bar()
plt.xticks(rotation=0)
Out[44]:
(array([0, 1]), <a list of 2 Text xticklabel objects>)

Many a frequent flier will tell you horror stories about flying in and out of certain airports. While any given reason code can have a positive or a negative impact to a prediction (this is indicated by both the strength and qualitative_strength columns), due to the thresholds we configured earlier for this tutorial it is likely that the above airports are causing flight delays.
DataRobot correctly identified the Scheduled Departure Time
input as
a timestamp but in the reason code output, we are seeing the internal
representation of the time value as a Unix epoch value so let’s put it
back into a format that humans can understand better:
In [45]:
# For simplicity, let's just look at rows where `Scheduled Departure Time`
# was the first/top reason.
bad_times = reasons[reasons.reason_0_feature == 'Scheduled Departure Time']
# Now let's convert the epoch to a datetime
pd.to_datetime(bad_times.reason_0_feature_value, unit='s')
Out[45]:
9498 1900-01-01 19:10:00
12373 1900-01-01 19:00:00
13254 1900-01-01 19:00:00
13351 1900-01-01 19:00:00
13536 1900-01-01 19:00:00
...
18015 1900-01-01 19:10:00
18165 1900-01-01 19:10:00
18392 1900-01-01 19:10:00
18396 1900-01-01 19:30:00
18406 1900-01-01 17:05:00
Name: reason_0_feature_value, Length: 27, dtype: datetime64[ns]
We can see that it appears as though all departures occurred on Jan. 1st, 1900. This is because the original value was simply a timestamp so only the time portion of the epoch is meaningful. We will clean this up in our graph below:
In [46]:
from matplotlib.ticker import FuncFormatter
from time import gmtime, strftime
scale_factor = 9 # make the difference in strengths more visible
depart = reasons[reasons.reason_0_feature == 'Scheduled Departure Time']
true_only = depart[depart.prediction == 1]
false_only = depart[depart.prediction == 0]
plt.scatter(x=true_only.reason_0_feature_value,
y=true_only.positive_probability,
c='green',
s=true_only.reason_0_strength ** scale_factor,
label='Will be delayed')
plt.scatter(x=false_only.reason_0_feature_value,
y=false_only.positive_probability,
c='purple',
s=false_only.reason_0_strength ** scale_factor,
label='Will not')
# Convert the Epoch values into human time stamps
formatter = FuncFormatter(lambda x, pos: strftime('%H:%M', gmtime(x)))
plt.gca().xaxis.set_major_formatter(formatter)
plt.xlabel('Scheduled Departure Time')
plt.ylabel('Positive Probability')
plt.legend(markerscale=.5, frameon=True, facecolor="white")
plt.title("Relationship of Depart Time and being delayed")
Out[46]:
Text(0.5,1,u'Relationship of Depart Time and being delayed')

The above plot shows each prediction where the top influencer of the
prediction was the Scheduled Departure Time
. It’s plotted against
the positive_probability
on the Y-axis and the size of the plots
represent the strength that departure time had on the prediction
(relative to the other features of that given data point). Finally to
aid visually, the positive vs. negative outcomes are colored.
As we can see by the time-scale on the X-axis, it doesn’t represent the full 24 hours; this is telling. Since we filtered our data earlier to only show predictions that were leaning towards being delayed, and this chart is leaning towards times in the afternoon and evening there may be a correlation between later scheduled departure time and a higher probability of being delayed. With a little bit of domain knowledge, one will understand that an airplane and its crew make many flights in a day (typically hopping between cities) so small delays in the morning compound into the evening hours.
Financial Data¶
This example retrieves financial data from the Federal Reserve Bank of St. Louis and builds models in DataRobot to predict recession.
Creating the Dataset¶
This notebook shows some of the steps required in creating a dataset from a third party’s data. It has very little to do with DataRobot, and if you’re mostly interested in learning about how to use the DataRobot Python Client, then you could skip reading this section and miss out on very little. In order to have the data necessary for the other notebook, you will need to make sure that this notebook runs.
What do I need to do?¶
Get an API Key¶
The data we will be using is owned by the Federal Reserve Bank of St. Louis. They have an API for which you will need a key. The key is free, don’t worry. Grab one at https://research.stlouisfed.org/docs/api/fred/
To run this notebook without any changes, you will need to save your API
key in a file in the same directory from which you run this notebook,
and call the name of the file api_key
.
Install the fredapi package¶
You will also need this python client package, which makes accessing the data incredibly easy.
pip install fredapi
What will we do with this data?¶
We’re going to predict the future and get rich.
More concretely, we’re going to use historical economic data to build a forecasting model for whether or not the US Economy will be in recession in 13 weeks from now.
The FRED Economic Data¶
The Federal Reserve Bank of St. Louis provides a rich set of historical financial data, plus a REST API to access this data.
We have also written some utilities in order to make it easy to combine data series with different date frequencies in a technique known as Last Observation Carried Forward.
In [1]:
import warnings
import datetime
import fredapi
import pandas as pd
import timetools
fred = fredapi.Fred(api_key_file='api_key')
Get the data¶
There is a lot of data accessible through the FRED API. More than a quarter million data series, actually. That’s probably too much to all be useful.
We selected this set of series by starting with a subset of data specifically related to the US Economy, and started filtering out forecast data, data that was a pseudo-indicator date (which is a big data leak for this problem), eventually ending up with the collection of series you see in the cell below. It wasn’t really a scientific process, there are certainly more robust ways to go about it.
You can go learn about any of these on the FRED website, like this:
https://research.stlouisfed.org/fred2/series/A007RO1Q156NBEA That is the
webpage for the first data series in the cell below. You can also get
much of that data through the API, using the get_series_info
method
like we do in the following cell.
In [2]:
good_columns = [
u'A007RO1Q156NBEA', u'A011RE1Q156NBEA', u'A011RJ2Q224SBEA',
u'A021RO1Q156NBEA', u'A021RY2Q224SBEA', u'A191RV1Q225SBEA',
u'A765RL1Q225SBEA', u'A798RS2Q224SBEA', u'B808RA3Q086SBEA',
u'CLSACBQ158SBOG', u'CORESTICKM158SFRBATL', u'DLTRUCKSSAAR',
u'DNDGRY2Q224SBEA', u'DONGRS2Q224SBEA', u'DPCERV1Q225SBEA',
u'DTRSRZ2Q224SBEA', u'LNS14024886', u'LNU02300000', u'LNU04000003',
u'M1V', u'M2MOWN', u'M2V', u'MVAAUTOASS',
u'NECDFNA066MNFRBPHI', u'NOCDSA156MSFRBPHI', u'PERMIT',
u'PERMITMWNSA', u'PRS84006173', u'RCPHBS',
u'STICKCPIXSHLTRM158SFRBATL', u'W004RZ2Q224SBEA', u'W087RA3Q086SBEA',
u'W111RA3Q086SBEA', u'W117RL1Q225SBEA', u'W130RA3Q086SBEA',
u'W368RG3Q066SBEA', u'WAAA', u'WGS10YR', u'WTB3MS',
u'Y020RY2Q224SBEA', u'Y033RV1Q225SBEA', u'Y033RZ2Q224SBEA',
u'Y034RA3Q086SBEA', u'Y034RY2Q224SBEA', u'Y052RL1Q225SBEA',
u'Y054RG3Q086SBEA', u'Y060RZ2Q224SBEA', u'Y694RY2Q224SBEA']
Get the metadata¶
We’ll be needing to know the frequency of the observations in order to
merge the data correctly. That information is available from the API.
Each call to get_series_info
involves an API call, so this may take
some time.
In [3]:
metadata = {}
for series_id in good_columns:
try:
metadata[series_id] = fred.get_series_info(series_id)
except ValueError:
# Series sometimes get retired from FRED
warnings.warn('Series {} not found on FRED API'.format(series_id))
Get the data¶
This is where we actually acquire the data. This next step may take a while.
In [4]:
def get_series_data(series_id):
series_data = fred.get_series_first_release(series_id)
series_index = [ix.strftime('%Y-%m-%d') for ix in series_data.index]
series_data.index = series_index
return series_data
obs = {}
for series_id in metadata.keys():
series_data = get_series_data(series_id)
obs[series_id] = series_data
Organize by data frequency¶
Here we make a few groups of the series we just acquired. The ones that have the same update frequency can be put into one dataframe very easily.
In [5]:
weekly = [series_id for series_id, meta
in metadata.iteritems()
if meta['frequency'] == 'Weekly, Ending Friday']
quarterly = [series_id for series_id, meta
in metadata.iteritems()
if meta['frequency'] == 'Quarterly']
monthly = [series_id for series_id, meta
in metadata.iteritems()
if meta['frequency'] == 'Monthly']
In [6]:
all_weekly = pd.DataFrame({metadata[series_id]['title']: obs[series_id]
for series_id in weekly})
all_monthly = pd.DataFrame({metadata[series_id]['title']: obs[series_id]
for series_id in monthly})
all_quarterly = pd.DataFrame({metadata[series_id]['title']: obs[series_id]
for series_id in quarterly})
Combine the data of different frequencies¶
We wrote a little helper to take care of merging dataframes that have differing date indexes. It comes in handy right here.
We also drop some rows that extend into the future - some of the series from FRED come back like that, and it’s not good for modeling.
In [7]:
fin_data = timetools.expand_frame_merge(all_weekly, all_monthly)
fin_data = timetools.expand_frame_merge(fin_data, all_quarterly)
fin_data = fin_data[fin_data.index <
datetime.datetime.today().strftime('%Y-%m-%d')]
Create the target¶
The whole point of all this is to see if we can predict if there will be a recession in the future, so we’ll need to get historical data on the state of the US economy.
Of course, predicting if we are in a recession on any given day is kind of a no-brainer. So we’ll slide the series in such a way that for any given date, we’re looking at whether there is a recession 13 weeks from that day.
In [8]:
usrec = fred.get_series_first_release('USREC')
usrec.index = [ix.isoformat().split('T')[0] for ix in usrec.index]
bool_match = usrec.index > '1918-01-01'
target_series = usrec[bool_match]
target_name = 'US Recession in 13 Weeks'
timetools.slide(target_series, 7 * 13)
target_frame = pd.DataFrame({target_name: target_series})
modeling_frame = timetools.expand_frame_merge(fin_data, target_frame)
Trim some (mostly useless) data¶
Some of these series only started gathering data in the late 1940’s. So we’ll just drop rows from before then, since there isn’t much information in those weeks. While this step isn’t necessary, it does mean we’ll be modeling on some more informative data.
In [9]:
na_counts = modeling_frame.isnull().sum(axis=1)
earliest_useful_day = na_counts[na_counts < 20].index[0]
earliest_useful_day
modeling_frame = modeling_frame[modeling_frame.index >= earliest_useful_day]
Create the partition column¶
We’ll be training on data before 1980, validating on data from 1980 to 1995, and withholding the data for 1995 onward. This is mostly arbitrary, but does ensure that each time interval has more than one recession. If we create a column with these labels, DataRobot will let us use that column to partition the data into training, validation, and holdout.
In [10]:
n_rows = len(modeling_frame)
validation_first_day = modeling_frame[modeling_frame.index >=
'1980-01-01'].index[0]
validation_point = modeling_frame.index.get_loc(validation_first_day)
holdout_first_day = modeling_frame[modeling_frame.index >=
'1995-01-01'].index[0]
holdout_point = modeling_frame.index.get_loc(holdout_first_day)
tvh = pd.Series(['T'] * n_rows)
tvh.loc[validation_point:holdout_point] = 'V'
tvh.loc[holdout_point:] = 'H'
tvh.index = modeling_frame.index
modeling_frame['TVH'] = tvh
Write the dataset to disk
In [12]:
fname = 'financials-{}.csv'.format(datetime.datetime.today().
strftime('%Y-%m-%d'))
modeling_frame.to_csv(fname, index=True, index_label='Date', encoding='utf-8')
Predicting Recessions with DataRobot¶
In this use case, we’ll try to predict whether or not the US economy is heading into a recession within the next three months. While hopefully it’s not necessary to say so, let’s just get this out of the way up front: don’t actually invest your real money according to the results of this notebook. The real value comes from learning about how to use the Python client of the DataRobot API.
Topics Covered in this Notebook¶
Here is a list of things we’ll touch on during this notebook:
- Installing the
datarobot
package - Configuring the client
- Creating a project
- Using a column from the dataset for a custom partitioning scheme
- Omitting one of the source columns from the modeling process
- Run the automated modeling process
- Generating predictions from a finished model
The dataset required for this notebook can be produced by running the
notebook Generating a Dataset from FRBSL
, located in this same
directory.
Prerequisites¶
In order to run this notebook yourself, you will need the following:
- A DataRobot API token
- matplotlib for the visualizations at the end
Installing the datarobot
package¶
The datarobot
package is hosted on PyPI. You can install it via:
pip install datarobot
from the command line. Its main dependencies are numpy
and
pandas
, which could take some time to install on a new system. We
highly recommend use of virtualenvs to avoid conflicts with other
dependencies in your system-wide python installation.
Getting started¶
This line imports the datarobot
package. By convention, we always
import it with the alias dr
.
In [1]:
import datarobot as dr
Other important imports¶
We’ll use these in this notebook as well. If the previous cell and the following cell both run without issue, you’re in good shape.
In [2]:
import re
import os
import datetime
import matplotlib.pyplot as plt
import pandas as pd
%pylab inline
Populating the interactive namespace from numpy and matplotlib
Configure the Python Client¶
Configuring the client requires the following two things:
- A DataRobot endpoint - where the API server can be found
- A DataRobot API token
The client can be configured in several ways. The example we’ll use in
this notebook is to point to a yaml
file that has the information.
This is the structure of that file:
endpoint: https://app.datarobot.com/api/v2/
token: not-my-real-token
If you want to run this notebook without changes, please save your
configuration in the same directory from which you run this notebook,
and call it drconfig.yaml
.
In [3]:
dr.Client(config_path='drconfig.yaml')
Out[3]:
<datarobot.rest.RESTClientObject at 0x2d958d0>
Find the data in your filesystem¶
If you have run the other notebook, it will have written a file to disk.
In the next cell, we’ll try to find it in this directory. If it’s not
here, you can help the notebook continue by defining the variable
filename
to point to that file.
In [4]:
usecase_name_regex = re.compile('financials-.*\.csv')
files = [fname for fname in os.listdir('.')
if usecase_name_regex.match(fname)]
filename = files[0]
print('Using {}'.format(filename))
Using financials-2017-07-10.csv
Create the Project¶
Here, we use the datarobot
package to upload a new file and create a
project. The name of the project is optional, but can be helpful when
trying to sort among many projects on DataRobot.
In [5]:
now = datetime.datetime.now().strftime('%Y-%m-%dT%H:%M')
project_name = 'FRB{}'.format(now)
proj = dr.Project.create(sourcedata=filename,
project_name=project_name)
Create a custom partition scheme¶
This problem has a time component to it, so it wouldn’t do us very much
good to train on data from the present and predict on the past. In
creating the dataset, the column TVH
was used to indicate which
partition each row should belong to. The training (T
) data all
precedes the validation (V
) data in time, which in turn precedes the
holdout (H
) data. By using a UserTVH
column we can specify this
partition should be used by DataRobot. Absent this information,
DataRobot defaults to randomly separating rows into training,
validation, and holdout.
In [6]:
proj_partition = dr.UserTVH(user_partition_col='TVH',
training_level='T',
validation_level='V',
holdout_level='H')
Omit a column from modeling¶
The Date
column is a data leak, so we don’t want it to be included
in the modeling process. We can accomplish this by creating a
featurelist that does not include it, and using that featurelist during
modeling.
In [7]:
features = proj.get_features()
names_without_date = [feature.name for feature in features
if feature.name != 'Date']
flist = proj.create_featurelist('Without Date', names_without_date)
Run the automated modeling process¶
Now we can start the modeling process. The target for this problem is
called US Recession in 13 Weeks
- a binary variable indicating
whether or not the US economy was in recession 13 weeks after the week
that a row represents.
We specify that the metric that should be used is AUC
. Without a
specification DataRobot would use the metric it recommends (in this
case, it would have been LogLoss
).
The partitioning_method
is used to specify that we would like
DataRobot to use the partitioning schema we specified previously.
The featurelist_id
parameter tells DataRobot to model on that
specific featurelist, rather than the default Informative Features
.
Finally, the worker_count
parameter specifies how many workers
should be used for this project. Keep in mind, you might not have access
to 10 workers. If you need more resources than what has been allocated
to you, you should think about upgrading your license.
The last command in this cell is just a blocking loop that periodically checks on the project to see if it is done, printing out the number of jobs in progress and in the queue along the way so you can see progress. The automated model exploration process will occasionally add more jobs to the queue, so don’t be alarmed if the number of jobs does not strictly decrease over time.
In [8]:
target_name = 'US Recession in 13 Weeks'
proj.set_target(target_name,
metric='AUC',
partitioning_method=proj_partition,
featurelist_id=flist.id,
worker_count=8)
proj.wait_for_autopilot()
In progress: 7, queued: 23 (waited: 0s)
In progress: 7, queued: 23 (waited: 0s)
In progress: 7, queued: 23 (waited: 1s)
In progress: 7, queued: 23 (waited: 2s)
In progress: 7, queued: 23 (waited: 3s)
In progress: 7, queued: 23 (waited: 4s)
In progress: 6, queued: 21 (waited: 8s)
In progress: 7, queued: 17 (waited: 15s)
In progress: 7, queued: 14 (waited: 28s)
In progress: 7, queued: 6 (waited: 48s)
In progress: 7, queued: 2 (waited: 68s)
In progress: 5, queued: 0 (waited: 89s)
In progress: 4, queued: 0 (waited: 109s)
In progress: 2, queued: 0 (waited: 129s)
In progress: 0, queued: 0 (waited: 149s)
In progress: 6, queued: 0 (waited: 170s)
In progress: 4, queued: 0 (waited: 190s)
In progress: 2, queued: 0 (waited: 210s)
In progress: 1, queued: 0 (waited: 230s)
In progress: 4, queued: 0 (waited: 251s)
In progress: 0, queued: 0 (waited: 271s)
In progress: 0, queued: 0 (waited: 291s)
What just happened?¶
We can see how many models DataRobot built for this project by querying. Each of them has been tuned individually. Models that appear to have the same name differ either in the amount of data used in training or in the preprocessing steps used (or both).
In [9]:
models = proj.get_models()
for idx, model in enumerate(models):
print('[{}]: {} - {}'.
format(idx, model.metrics['AUC']['validation'], model.model_type))
[0]: 0.96738 - ExtraTrees Classifier (Gini)
[1]: 0.96279 - ExtraTrees Classifier (Gini)
[2]: 0.94981 - Vowpal Wabbit Classifier
[3]: 0.94803 - eXtreme Gradient Boosted Trees Classifier
[4]: 0.94741 - AVG Blender
[5]: 0.94396 - eXtreme Gradient Boosted Trees Classifier with Unsupervised Learning Features
[6]: 0.9437 - eXtreme Gradient Boosted Trees Classifier
[7]: 0.9437 - ENET Blender
[8]: 0.94274 - ENET Blender
[9]: 0.94215 - Elastic-Net Classifier (L2 / Binomial Deviance)
[10]: 0.9401 - Regularized Logistic Regression (L2)
[11]: 0.93376 - Advanced AVG Blender
[12]: 0.93321 - Regularized Logistic Regression (L2)
[13]: 0.93229 - Support Vector Classifier (Radial Kernel)
[14]: 0.92888 - Regularized Logistic Regression (L2)
[15]: 0.92879 - Support Vector Classifier (Radial Kernel)
[16]: 0.9245 - Regularized Logistic Regression (L2)
[17]: 0.91793 - eXtreme Gradient Boosted Trees Classifier with Unsupervised Learning Features
[18]: 0.91719 - eXtreme Gradient Boosted Trees Classifier
[19]: 0.90894 - RandomForest Classifier (Entropy)
[20]: 0.90451 - Gradient Boosted Greedy Trees Classifier
[21]: 0.90188 - Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance) with Unsupervised Learning Features
[22]: 0.8934 - Elastic-Net Classifier (L2 / Binomial Deviance) with Binned numeric features
[23]: 0.89184 - Breiman and Cutler Random Forest Classifier
[24]: 0.89151 - Gradient Boosted Trees Classifier
[25]: 0.89137 - Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance)
[26]: 0.88992 - Gradient Boosted Trees Classifier
[27]: 0.88978 - RandomForest Classifier (Gini)
[28]: 0.8574 - RuleFit Classifier
[29]: 0.85148 - Auto-tuned K-Nearest Neighbors Classifier (Minkowski Distance)
[30]: 0.84975 - Vowpal Wabbit Classifier
[31]: 0.8471 - RandomForest Classifier (Gini)
[32]: 0.83946 - Logistic Regression
[33]: 0.81802 - Gradient Boosted Trees Classifier
[34]: 0.80683 - TensorFlow Neural Network Classifier
[35]: 0.7483 - Elastic-Net Classifier (L2 / Binomial Deviance)
[36]: 0.7375 - Decision Tree Classifier (Gini)
[37]: 0.70172 - Nystroem Kernel SVM Classifier
[38]: 0.61144 - Auto-tuned K-Nearest Neighbors Classifier (Minkowski Distance)
[39]: 0.57843 - Regularized Logistic Regression (L2)
[40]: 0.55107 - Naive Bayes combiner classifier
[41]: 0.5 - Majority Class Classifier
Generating predictions from a finished model¶
So, what do these models think about the likelihood of a recession in the next 3 months? We can make predictions on the latest data to see what they see.
These may not be the predictions you are looking for...¶
There are two ways to generate predictions in DataRobot, one using modeling workers and one using dedicated prediction workers. In this notebook we will use the former, which is slower, occupies one of your modeling worker slots, and has no real guarantees about latency because the jobs go through the project queue.
Why do we even have this slow prediction mechanism? Because of its limitations it is much easier for anticipate the load that it adds to the system, so we can provide it to everyone in a shared environment.
For the faster, low latency, dedicated prediction solution, we would encourage you to look into an upgraded license of DataRobot, specifically one with dedicated prediction workers.
Three step process¶
As just mentioned, these predictions go through the modeling queue, so there is a three-step process. The first step is to upload your dataset; the second is to generate the prediction jobs themselves. Finally, you need to retreive your predictions when the job is done.
In this case, we are generating predictions from the top 10 models in the project.
In [10]:
dataset = proj.upload_dataset(filename)
pred_jobs = [models[i].request_predictions(dataset.id) for i in range(10)]
all_preds = [pred_job.get_result_when_complete() for pred_job in pred_jobs]
Bonus Section: Predicting the future¶
That concludes the “how-to” portion of the notebook. But we won’t just leave you hanging... we’ve gone through all this trouble to try to predict the future. We might as well tell you what we saw.
Get Ready to plot¶
It will be easier to plot the data if it all shares the same time-based index. Here in this cell we read the modeling data and use its index, then we attach the predictions from each of the models to that dataframe
In [11]:
plot_data = pd.read_csv(filename, index_col=0)
for idx, pred in enumerate(all_preds):
plot_data['pred_{}'.format(idx)] = pred['positive_probability'].tolist()
Plots!¶
We start by defining a helper function to plot the predictions together on the same plot.
Here we plot the predictions for every week in the dataset after the year 2000 (the holdout was all the data after the start of 1995).
In [20]:
def plot_date_data(dataframe, column_names):
x_axis = [datetime.datetime.strptime(x, '%Y-%m-%d')
for x in dataframe.index]
import matplotlib.dates as mdates
years = mdates.YearLocator()
months = mdates.MonthLocator()
years_fmt = mdates.DateFormatter('%Y')
fig, ax = plt.subplots()
for column_name in column_names:
data = dataframe[column_name]
ax.plot(x_axis, data)
ax.xaxis.set_major_locator(years)
ax.xaxis.set_major_formatter(years_fmt)
ax.xaxis.set_minor_locator(months)
ax.format_xdata = mdates.DateFormatter('%Y-%m-%d')
ax.grid(True)
fig.autofmt_xdate()
plot_date_data(plot_data[plot_data.index > '2000-01-01'],
['pred_{}'.format(i) for i in range(10)])

The two spikes correspond to the dotcom bubble bursting in early 2001 and the Great Recession.
But... were the models predictive or postdictive?
A closer look at the Great Recession.¶
Let’s zoom in on 2007 and 2008, when things really went sideways.
In [21]:
plot_date_data(plot_data[(plot_data.index > '2007-01-01') &
(plot_data.index < '2009-01-01')],
['pred_{}'.format(i) for i in range(10)])

Some of these models were picking up on some signal in the early months of 2008, shortly before stocks went for a dive. But then again, they flatlined before the real tumult happened, so take it with a grain of salt.
But what about now? Are we headed for a recession?¶
In [22]:
plot_date_data(plot_data[plot_data.index > '2011-01-01'],
['pred_{}'.format(i) for i in range(10)])

Nope. (As of 7/1/2017)
What can we say about these models?¶
It would seem like we used a lot of information in building and evaluation these models. It does include more than 3000 weeks of data. But how much information is in this data?
For this specific problem, we know that the state of the economy does not jump around with great velocity. So we don’t really have 3000 independent observations, because the observations in one week convey a lot of information about the values of the nearby weeks. So what information do we actually have?
In this case, while we had many weeks in which there were observed recessions in the economy, we are actually only looking at the event of entering (or exiting) a recession, which is limited by the total number of recessions. In this case that number was only 11; 6 were used in training, 3 in validation, and 2 in the holdout. That’s not a lot of information to train on.
In [23]:
plot_date_data(plot_data, ['pred_{}'.format(i) for i in range(10)])

Advanced Model Insights¶
Preparation¶
This notebook explores additional options for model insights added in the v2.7 release of the DataRobot API.
Let’s start with importing some packages that will help us with presentation (if you don’t have them installed already, you’ll have to install them to run this notebook).
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import pandas as pd
import datarobot as dr
import numpy as np
from datarobot.enums import AUTOPILOT_MODE
from datarobot.errors import ClientError
Set Up¶
Now configure your DataRobot client (unless you’re using a configuration file)...
In [2]:
dr.Client(token='<API TOKEN>', endpoint='https://<YOUR ENDPOINT>/api/v2/')
Out[2]:
<datarobot.rest.RESTClientObject at 0x10bc01e50>
Create Project with features¶
Create a new project using the 10K_diabetes dataset. This dataset
contains a binary classification on the target readmitted
. This
project is an excellent example of the advanced model insights available
from DataRobot models.
In [3]:
url = 'https://s3.amazonaws.com/datarobot_public_datasets/10k_diabetes.xlsx'
project = dr.Project.create(url, project_name='10K Advanced Modeling')
print('Project ID: {}'.format(project.id))
Project ID: 598dec4bc8089177139da4ad
In [4]:
# Increase the worker count to make the project go faster.
project.set_worker_count(8)
Out[4]:
Project(10K Advanced Modeling)
In [5]:
project.set_target('readmitted', mode=AUTOPILOT_MODE.QUICK)
Out[5]:
Project(10K Advanced Modeling)
In [6]:
project.wait_for_autopilot()
In progress: 2, queued: 0 (waited: 0s)
In progress: 2, queued: 0 (waited: 1s)
In progress: 2, queued: 0 (waited: 2s)
In progress: 2, queued: 0 (waited: 2s)
In progress: 2, queued: 0 (waited: 4s)
In progress: 2, queued: 0 (waited: 6s)
In progress: 2, queued: 0 (waited: 9s)
In progress: 2, queued: 0 (waited: 16s)
In progress: 2, queued: 0 (waited: 29s)
In progress: 1, queued: 0 (waited: 50s)
In progress: 1, queued: 0 (waited: 71s)
In progress: 1, queued: 0 (waited: 91s)
In progress: 1, queued: 0 (waited: 111s)
In progress: 1, queued: 0 (waited: 132s)
In progress: 1, queued: 0 (waited: 152s)
In progress: 1, queued: 0 (waited: 172s)
In progress: 1, queued: 0 (waited: 193s)
In progress: 1, queued: 0 (waited: 213s)
In progress: 1, queued: 0 (waited: 233s)
In progress: 1, queued: 0 (waited: 254s)
In progress: 1, queued: 0 (waited: 274s)
In progress: 0, queued: 1 (waited: 295s)
In progress: 1, queued: 0 (waited: 315s)
In progress: 0, queued: 0 (waited: 335s)
In progress: 0, queued: 0 (waited: 356s)
In [7]:
models = project.get_models()
model = models[0]
model
Out[7]:
Model(u'AVG Blender')
Let’s set some color constants to replicate visual style of DataRobot lift chart.
In [8]:
dr_dark_blue = '#08233F'
dr_blue = '#1F77B4'
dr_orange = '#FF7F0E'
dr_red = '#BE3C28'
Feature Impact¶
Feature Impact is available for all model types and works by altering input data and observing the effect on a model’s score. It is an on-demand feature, meaning that you must initiate a calculation to see the results. Once you have had DataRobot compute the feature impact for a model, that information is saved with the project.
Feature Impact measures how important a feature is in the context of a model. That is, it measures how much the accuracy of a model would decrease if that feature were removed.
In [9]:
try:
# Check first if they've already been computed
feature_impacts = model.get_feature_impact()
except dr.errors.ClientError as e:
# Status code of 404 means the feature impact hasn't been computed yet
assert e.status_code == 404
impact_job = model.request_feature_impact()
# We must wait for the async job to finish; 4 minutes should be plenty
feature_impacts = impact_job.get_result_when_complete(4 * 60)
In [10]:
# Formats the ticks from a float into a percent
percent_tick_fmt = mtick.PercentFormatter(xmax=1.0)
impact_df = pd.DataFrame(feature_impacts)
impact_df.sort_values(by='impactNormalized', ascending=True, inplace=True)
# Positive values are blue, negative are red
bar_colors = impact_df.impactNormalized.apply(lambda x: dr_red if x < 0
else dr_blue)
ax = impact_df.plot.barh(x='featureName', y='impactNormalized',
legend=False,
color=bar_colors,
figsize=(10, 14))
ax.xaxis.set_major_formatter(percent_tick_fmt)
ax.xaxis.set_tick_params(labeltop=True)
ax.xaxis.grid(True, alpha=0.2)
ax.set_facecolor(dr_dark_blue)
plt.ylabel('')
plt.xlabel('Effect')
plt.xlim((None, 1)) # Allow for negative impact
plt.title('Feature Impact', y=1.04)
Out[10]:
Text(0.5,1.04,u'Feature Impact')

Lift Chart¶
A lift chart will show you how close in general model predictions are to the actual target values in the training data.
The lift chart data we retrieve from the server includes the average model prediction and the average actual target values, sorted by the prediction values in ascending order and split into up to 60 bins.
bin_weight
parameter shows how much weight is in each bin (number of
rows for unweighted projects).
In [11]:
lc = model.get_lift_chart('validation')
lc
Out[11]:
LiftChart(validation)
In [12]:
bins_df = pd.DataFrame(lc.bins)
bins_df.head()
Out[12]:
actual | bin_weight | predicted | |
---|---|---|---|
0 | 0.037037 | 27.0 | 0.088575 |
1 | 0.111111 | 27.0 | 0.131661 |
2 | 0.192308 | 26.0 | 0.153389 |
3 | 0.222222 | 27.0 | 0.167035 |
4 | 0.111111 | 27.0 | 0.179245 |
Let’s define our rebinning and plotting functions.
In [13]:
def rebin_df(raw_df, number_of_bins):
cols = ['bin', 'actual_mean', 'predicted_mean', 'bin_weight']
new_df = pd.DataFrame(columns=cols)
current_prediction_total = 0
current_actual_total = 0
current_row_total = 0
x_index = 1
bin_size = 60 / number_of_bins
for rowId, data in raw_df.iterrows():
current_prediction_total += data['predicted'] * data['bin_weight']
current_actual_total += data['actual'] * data['bin_weight']
current_row_total += data['bin_weight']
if ((rowId + 1) % bin_size == 0):
x_index += 1
bin_properties = {
'bin': ((round(rowId + 1) / 60) * number_of_bins),
'actual_mean': current_actual_total / current_row_total,
'predicted_mean': current_prediction_total / current_row_total,
'bin_weight': current_row_total
}
new_df = new_df.append(bin_properties, ignore_index=True)
current_prediction_total = 0
current_actual_total = 0
current_row_total = 0
return new_df
def matplotlib_lift(bins_df, bin_count, ax):
grouped = rebin_df(bins_df, bin_count)
ax.plot(range(1, len(grouped) + 1), grouped['predicted_mean'],
marker='+', lw=1, color=dr_blue)
ax.plot(range(1, len(grouped) + 1), grouped['actual_mean'],
marker='*', lw=1, color=dr_orange)
ax.set_xlim([0, len(grouped) + 1])
ax.set_facecolor(dr_dark_blue)
ax.legend(loc='best')
ax.set_title('Lift chart {} bins'.format(bin_count))
ax.set_xlabel('Sorted Prediction')
ax.set_ylabel('Value')
return grouped
Now we can show all lift charts we propose in DataRobot web application.
Note 1 : While this method will work for any bin count less then 60 - the most reliable result will be achieved when the number of bins is a divisor of 60.
Note 2 : This visualization method will NOT work for bin count > 60 because DataRobot does not provide enough information for a larger resolution.
In [14]:
bin_counts = [10, 12, 15, 20, 30, 60]
f, axarr = plt.subplots(len(bin_counts))
f.set_size_inches((8, 4 * len(bin_counts)))
rebinned_dfs = []
for i in range(len(bin_counts)):
rebinned_dfs.append(matplotlib_lift(bins_df, bin_counts[i], axarr[i]))
plt.tight_layout()

Rebinned Data¶
You may want to interact with the raw re-binned data for use in third party tools, or for additional evaluation.
In [15]:
for rebinned in rebinned_dfs:
print('Number of bins: {}'.format(len(rebinned.index)))
print(rebinned)
Number of bins: 10
bin actual_mean predicted_mean bin_weight
0 1.0 0.13125 0.151517 160.0
1 2.0 0.20000 0.225520 160.0
2 3.0 0.23125 0.272101 160.0
3 4.0 0.31250 0.310227 160.0
4 5.0 0.40000 0.350982 160.0
5 6.0 0.40000 0.395550 160.0
6 7.0 0.43750 0.441662 160.0
7 8.0 0.55625 0.494121 160.0
8 9.0 0.60625 0.561798 160.0
9 10.0 0.69375 0.710759 160.0
Number of bins: 12
bin actual_mean predicted_mean bin_weight
0 1.0 0.134328 0.143911 134.0
1 2.0 0.180451 0.211710 133.0
2 3.0 0.225564 0.253760 133.0
3 4.0 0.276119 0.289034 134.0
4 5.0 0.308271 0.320351 133.0
5 6.0 0.406015 0.354336 133.0
6 7.0 0.406015 0.391651 133.0
7 8.0 0.395522 0.430018 134.0
8 9.0 0.518797 0.470626 133.0
9 10.0 0.639098 0.519144 133.0
10 11.0 0.586466 0.583965 133.0
11 12.0 0.686567 0.728384 134.0
Number of bins: 15
bin actual_mean predicted_mean bin_weight
0 1.0 0.140187 0.134995 107.0
1 2.0 0.149533 0.195819 107.0
2 3.0 0.207547 0.235178 106.0
3 4.0 0.242991 0.264718 107.0
4 5.0 0.280374 0.292256 107.0
5 6.0 0.292453 0.316757 106.0
6 7.0 0.373832 0.344156 107.0
7 8.0 0.452830 0.372372 106.0
8 9.0 0.373832 0.403261 107.0
9 10.0 0.401869 0.433869 107.0
10 11.0 0.528302 0.465610 106.0
11 12.0 0.560748 0.504174 107.0
12 13.0 0.603774 0.547079 106.0
13 14.0 0.635514 0.612989 107.0
14 15.0 0.710280 0.747934 107.0
Number of bins: 20
bin actual_mean predicted_mean bin_weight
0 1.0 0.1125 0.124181 80.0
1 2.0 0.1500 0.178852 80.0
2 3.0 0.1875 0.211547 80.0
3 4.0 0.2125 0.239493 80.0
4 5.0 0.2375 0.260820 80.0
5 6.0 0.2250 0.283381 80.0
6 7.0 0.3375 0.300590 80.0
7 8.0 0.2875 0.319864 80.0
8 9.0 0.3750 0.340949 80.0
9 10.0 0.4250 0.361015 80.0
10 11.0 0.4000 0.383998 80.0
11 12.0 0.4000 0.407102 80.0
12 13.0 0.4125 0.429924 80.0
13 14.0 0.4625 0.453401 80.0
14 15.0 0.5250 0.479391 80.0
15 16.0 0.5875 0.508850 80.0
16 17.0 0.6125 0.541193 80.0
17 18.0 0.6000 0.582403 80.0
18 19.0 0.6750 0.649406 80.0
19 20.0 0.7125 0.772112 80.0
Number of bins: 30
bin actual_mean predicted_mean bin_weight
0 1.0 0.074074 0.110118 54.0
1 2.0 0.207547 0.160341 53.0
2 3.0 0.113208 0.184872 53.0
3 4.0 0.185185 0.206563 54.0
4 5.0 0.207547 0.227254 53.0
5 6.0 0.207547 0.243102 53.0
6 7.0 0.240741 0.257413 54.0
7 8.0 0.245283 0.272161 53.0
8 9.0 0.207547 0.287006 53.0
9 10.0 0.351852 0.297408 54.0
10 11.0 0.301887 0.310547 53.0
11 12.0 0.283019 0.322968 53.0
12 13.0 0.396226 0.337750 53.0
13 14.0 0.351852 0.350444 54.0
14 15.0 0.452830 0.364761 53.0
15 16.0 0.452830 0.379984 53.0
16 17.0 0.351852 0.395395 54.0
17 18.0 0.396226 0.411274 53.0
18 19.0 0.358491 0.425801 53.0
19 20.0 0.444444 0.441788 54.0
20 21.0 0.509434 0.457396 53.0
21 22.0 0.547170 0.473825 53.0
22 23.0 0.490566 0.494573 53.0
23 24.0 0.629630 0.513596 54.0
24 25.0 0.716981 0.534683 53.0
25 26.0 0.490566 0.559476 53.0
26 27.0 0.611111 0.590690 54.0
27 28.0 0.660377 0.635708 53.0
28 29.0 0.622642 0.695099 53.0
29 30.0 0.796296 0.799789 54.0
Number of bins: 60
bin actual_mean predicted_mean bin_weight
0 1.0 0.037037 0.088575 27.0
1 2.0 0.111111 0.131661 27.0
2 3.0 0.192308 0.153389 26.0
3 4.0 0.222222 0.167035 27.0
4 5.0 0.111111 0.179245 27.0
5 6.0 0.115385 0.190716 26.0
6 7.0 0.185185 0.201566 27.0
7 8.0 0.185185 0.211559 27.0
8 9.0 0.192308 0.221900 26.0
9 10.0 0.222222 0.232409 27.0
10 11.0 0.074074 0.239081 27.0
11 12.0 0.346154 0.247278 26.0
12 13.0 0.222222 0.253636 27.0
13 14.0 0.259259 0.261190 27.0
14 15.0 0.230769 0.267897 26.0
15 16.0 0.259259 0.276266 27.0
16 17.0 0.185185 0.283961 27.0
17 18.0 0.230769 0.290167 26.0
18 19.0 0.296296 0.294495 27.0
19 20.0 0.407407 0.300322 27.0
20 21.0 0.307692 0.307198 26.0
21 22.0 0.296296 0.313772 27.0
22 23.0 0.269231 0.319444 26.0
23 24.0 0.296296 0.326361 27.0
24 25.0 0.370370 0.334460 27.0
25 26.0 0.423077 0.341167 26.0
26 27.0 0.333333 0.347227 27.0
27 28.0 0.370370 0.353661 27.0
28 29.0 0.423077 0.361275 26.0
29 30.0 0.481481 0.368118 27.0
30 31.0 0.481481 0.376098 27.0
31 32.0 0.423077 0.384019 26.0
32 33.0 0.296296 0.391877 27.0
33 34.0 0.407407 0.398914 27.0
34 35.0 0.423077 0.407656 26.0
35 36.0 0.370370 0.414758 27.0
36 37.0 0.259259 0.421825 27.0
37 38.0 0.461538 0.429930 26.0
38 39.0 0.518519 0.438017 27.0
39 40.0 0.370370 0.445558 27.0
40 41.0 0.423077 0.453398 26.0
41 42.0 0.592593 0.461246 27.0
42 43.0 0.500000 0.468806 26.0
43 44.0 0.592593 0.478657 27.0
44 45.0 0.481481 0.490318 27.0
45 46.0 0.500000 0.498991 26.0
46 47.0 0.592593 0.507938 27.0
47 48.0 0.666667 0.519255 27.0
48 49.0 0.692308 0.528170 26.0
49 50.0 0.740741 0.540955 27.0
50 51.0 0.407407 0.553971 27.0
51 52.0 0.576923 0.565192 26.0
52 53.0 0.666667 0.582203 27.0
53 54.0 0.555556 0.599178 27.0
54 55.0 0.730769 0.619919 26.0
55 56.0 0.592593 0.650911 27.0
56 57.0 0.703704 0.676295 27.0
57 58.0 0.538462 0.714627 26.0
58 59.0 0.814815 0.763131 27.0
59 60.0 0.777778 0.836447 27.0
ROC curve¶
The receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
To retrieve ROC curve information use the Model.get_roc_curve
method.
In [16]:
roc = model.get_roc_curve('validation')
roc
Out[16]:
RocCurve(validation)
In [17]:
df = pd.DataFrame(roc.roc_points)
df.head()
Out[17]:
accuracy | f1_score | false_negative_score | false_positive_rate | false_positive_score | matthews_correlation_coefficient | negative_predictive_value | positive_predictive_value | threshold | true_negative_rate | true_negative_score | true_positive_rate | true_positive_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.603125 | 0.000000 | 635 | 0.000000 | 0 | 0.000000 | 0.603125 | 0.000000 | 1.000000 | 1.000000 | 965 | 0.000000 | 0 |
1 | 0.605000 | 0.009404 | 632 | 0.000000 | 0 | 0.053430 | 0.604258 | 1.000000 | 0.925734 | 1.000000 | 965 | 0.004724 | 3 |
2 | 0.605625 | 0.012520 | 631 | 0.000000 | 0 | 0.061715 | 0.604637 | 1.000000 | 0.897726 | 1.000000 | 965 | 0.006299 | 4 |
3 | 0.609375 | 0.031008 | 625 | 0.000000 | 0 | 0.097764 | 0.606918 | 1.000000 | 0.843124 | 1.000000 | 965 | 0.015748 | 10 |
4 | 0.610000 | 0.037037 | 623 | 0.001036 | 1 | 0.097343 | 0.607435 | 0.923077 | 0.812854 | 0.998964 | 964 | 0.018898 | 12 |
Threshold operations¶
You can get the recommended threshold value with maximal F1 score using
RocCurve.get_best_f1_threshold
method. That is the same threshold
that is preselected in DataRobot when you open “ROC curve” tab.
In [18]:
threshold = roc.get_best_f1_threshold()
threshold
Out[18]:
0.3359943414397026
To estimate metrics for different threshold values just pass it to the
RocCurve.estimate_threshold
method. This will produce the same
results as updating threshold on the DataRobot “ROC curve” tab.
In [19]:
metrics = roc.estimate_threshold(threshold)
metrics
Out[19]:
{'accuracy': 0.626875,
'f1_score': 0.6219126029132362,
'false_negative_score': 144,
'false_positive_rate': 0.4694300518134715,
'false_positive_score': 453,
'matthews_correlation_coefficient': 0.30220241744619025,
'negative_predictive_value': 0.7804878048780488,
'positive_predictive_value': 0.5201271186440678,
'threshold': 0.3359943414397026,
'true_negative_rate': 0.5305699481865285,
'true_negative_score': 512,
'true_positive_rate': 0.7732283464566929,
'true_positive_score': 491}
Confusion matrix¶
Using a few keys from the retrieved metrics we now can build a confusion matrix for the selected threshold.
In [20]:
roc_df = pd.DataFrame({
'Predicted Negative': [metrics['true_negative_score'],
metrics['false_negative_score'],
metrics['true_negative_score'] + metrics[
'false_negative_score']],
'Predicted Positive': [metrics['false_positive_score'],
metrics['true_positive_score'],
metrics['true_positive_score'] + metrics[
'false_positive_score']],
'Total': [len(roc.negative_class_predictions),
len(roc.positive_class_predictions),
len(roc.negative_class_predictions) + len(
roc.positive_class_predictions)]})
roc_df.index = pd.MultiIndex.from_tuples([
('Actual', '-'), ('Actual', '+'), ('Total', '')])
roc_df.columns = pd.MultiIndex.from_tuples([
('Predicted', '-'), ('Predicted', '+'), ('Total', '')])
roc_df.style.set_properties(**{'text-align': 'right'})
roc_df
Out[20]:
Predicted | Total | |||
---|---|---|---|---|
- | + | |||
Actual | - | 512 | 453 | 962 |
+ | 144 | 491 | 638 | |
Total | 656 | 944 | 1600 |
ROC curve plot¶
In [21]:
dr_roc_green = '#03c75f'
white = '#ffffff'
dr_purple = '#65147D'
dr_dense_green = '#018f4f'
fig = plt.figure(figsize=(8, 8))
axes = fig.add_subplot(1, 1, 1, facecolor=dr_dark_blue)
plt.scatter(df.false_positive_rate, df.true_positive_rate, color=dr_roc_green)
plt.plot(df.false_positive_rate, df.true_positive_rate, color=dr_roc_green)
plt.plot([0, 1], [0, 1], color=white, alpha=0.25)
plt.title('ROC curve')
plt.xlabel('False Positive Rate (Fallout)')
plt.xlim([0, 1])
plt.ylabel('True Positive Rate (Sensitivity)')
plt.ylim([0, 1])
Out[21]:
(0, 1)

Prediction distribution plot¶
There are a few different methods for visualizing it, which one to use depends on what packages you have installed. Below you will find 3 different examples.
Using seaborn
In [22]:
import seaborn as sns
sns.set_style("whitegrid", {'axes.grid': False})
fig = plt.figure(figsize=(8, 8))
axes = fig.add_subplot(1, 1, 1, facecolor=dr_dark_blue)
shared_params = {'shade': True, 'clip': (0, 1), 'bw': 0.2}
sns.kdeplot(np.array(roc.negative_class_predictions),
color=dr_purple, **shared_params)
sns.kdeplot(np.array(roc.positive_class_predictions),
color=dr_dense_green, **shared_params)
plt.title('Prediction Distribution')
plt.xlabel('Probability of Event')
plt.xlim([0, 1])
plt.ylabel('Probability Density')
Out[22]:
Text(0,0.5,'Probability Density')

Using SciPy
In [23]:
from scipy.stats import gaussian_kde
fig = plt.figure(figsize=(8, 8))
axes = fig.add_subplot(1, 1, 1, facecolor=dr_dark_blue)
xs = np.linspace(0, 1, 100)
density_neg = gaussian_kde(roc.negative_class_predictions, bw_method=0.2)
plt.plot(xs, density_neg(xs), color=dr_purple)
plt.fill_between(xs, 0, density_neg(xs), color=dr_purple, alpha=0.3)
density_pos = gaussian_kde(roc.positive_class_predictions, bw_method=0.2)
plt.plot(xs, density_pos(xs), color=dr_dense_green)
plt.fill_between(xs, 0, density_pos(xs), color=dr_dense_green, alpha=0.3)
plt.title('Prediction Distribution')
plt.xlabel('Probability of Event')
plt.xlim([0, 1])
plt.ylabel('Probability Density')
Out[23]:
Text(0,0.5,'Probability Density')

Using scikit-learn
This way will be most consistent with how we display this plot in DataRobot, because scikit-learn supports additional kernel options, and we can configure the same kernel as we use in web application (epanichkov kernel with size 0.05).
Other examples above use a gaussian kernel, so they may slightly differ from the plot in the DataRobot interface.
In [24]:
from sklearn.neighbors import KernelDensity
fig = plt.figure(figsize=(8, 8))
axes = fig.add_subplot(1, 1, 1, facecolor=dr_dark_blue)
xs = np.linspace(0, 1, 100)
X_neg = np.asarray(roc.negative_class_predictions)[:, np.newaxis]
density_neg = KernelDensity(bandwidth=0.05, kernel='epanechnikov').fit(X_neg)
plt.plot(xs, np.exp(density_neg.score_samples(xs[:, np.newaxis])),
color=dr_purple)
plt.fill_between(xs, 0, np.exp(density_neg.score_samples(xs[:, np.newaxis])),
color=dr_purple, alpha=0.3)
X_pos = np.asarray(roc.positive_class_predictions)[:, np.newaxis]
density_pos = KernelDensity(bandwidth=0.05, kernel='epanechnikov').fit(X_pos)
plt.plot(xs, np.exp(density_pos.score_samples(xs[:, np.newaxis])),
color=dr_dense_green)
plt.fill_between(xs, 0, np.exp(density_pos.score_samples(xs[:, np.newaxis])),
color=dr_dense_green, alpha=0.3)
plt.title('Prediction Distribution')
plt.xlabel('Probability of Event')
plt.xlim([0, 1])
plt.ylabel('Probability Density')
Out[24]:
Text(0,0.5,'Probability Density')

Word Cloud¶
Word cloud is a type of insight available for some text-processing models for datasets containing text columns. You can get information about how the appearance of each ngram (word or sequence of words) in the text field affects the predicted target value.
This example will show you how to obtain word cloud data and visualize it in similar to DataRobot web application way.
The visualization example here uses colour
and wordcloud
packages, so if you don’t have them, you will need to install them.
First, we will create a color palette similar to what we use in DataRobot.
In [25]:
from colour import Color
import wordcloud
In [26]:
colors = [Color('#2458EB')]
colors.extend(list(Color('#2458EB').range_to(Color('#31E7FE'), 81))[1:])
colors.extend(list(Color('#31E7FE').range_to(Color('#8da0a2'), 21))[1:])
colors.extend(list(Color('#a18f8c').range_to(Color('#ffad9e'), 21))[1:])
colors.extend(list(Color('#ffad9e').range_to(Color('#d80909'), 81))[1:])
webcolors = [c.get_web() for c in colors]
Variable webcolors
now contains 201 ([-1, 1] interval with step
0.01) colors that will be used in the word cloud. Let’s look at our
palette.
In [27]:
from matplotlib.colors import LinearSegmentedColormap
dr_cmap = LinearSegmentedColormap.from_list('DataRobot',
webcolors,
N=len(colors))
x = np.arange(-1, 1.01, 0.01)
y = np.arange(0, 40, 1)
X = np.meshgrid(x, y)[0]
plt.xticks([0, 20, 40, 60, 80, 100, 120, 140, 160, 180, 200],
['-1', '-0.8', '-0.6', '-0.4', '-0.2', '0',
'0.2', '0.4', '0.6', '0.8', '1'])
plt.yticks([], [])
im = plt.imshow(X, interpolation='nearest', origin='lower', cmap=dr_cmap)

Now we will pick some model that provides a word cloud in the DataRobot. Any “Auto-Tuned Word N-Gram Text Modeler” should work.
In [28]:
models = project.get_models()
In [29]:
model_with_word_cloud = None
for model in models:
try:
model.get_word_cloud()
model_with_word_cloud = model
break
except ClientError as e:
if 'No word cloud data found for model' in e:
pass
model_with_word_cloud
Out[29]:
Model(u'Auto-Tuned Word N-Gram Text Modeler using token occurrences - diag_1_desc')
In [30]:
wc = model_with_word_cloud.get_word_cloud(exclude_stop_words=True)
In [31]:
def word_cloud_plot(wc, font_path=None):
# Stopwords usually dominate any word cloud, so we will filter them out
dict_freq = {wc_word['ngram']: wc_word['frequency']
for wc_word in wc.ngrams
if not wc_word['is_stopword']}
dict_coef = {wc_word['ngram']: wc_word['coefficient']
for wc_word in wc.ngrams}
def color_func(*args, **kwargs):
word = args[0]
palette_index = int(round(dict_coef[word] * 100)) + 100
r, g, b = colors[palette_index].get_rgb()
return 'rgb({:.0f}, {:.0f}, {:.0f})'.format(int(r * 255),
int(g * 255),
int(b * 255))
wc_image = wordcloud.WordCloud(stopwords=set(),
width=1024, height=1024,
relative_scaling=0.5,
prefer_horizontal=1,
color_func=color_func,
background_color=(0, 10, 29),
font_path=font_path).fit_words(dict_freq)
plt.imshow(wc_image, interpolation='bilinear')
plt.axis('off')
In [32]:
word_cloud_plot(wc)

You can use the word cloud to get information about most frequent and most important (highest absolute coefficient value) ngrams in your text.
In [33]:
wc.most_frequent(5)
Out[33]:
[{'coefficient': 0.622977418480506,
'count': 534,
'frequency': 0.21876280213027446,
'is_stopword': False,
'ngram': u'failure'},
{'coefficient': 0.5680375262833832,
'count': 524,
'frequency': 0.21466612044244163,
'is_stopword': False,
'ngram': u'atherosclerosis'},
{'coefficient': 0.5163937133054939,
'count': 520,
'frequency': 0.21302744776730848,
'is_stopword': False,
'ngram': u'atherosclerosis of'},
{'coefficient': 0.3793240551174481,
'count': 505,
'frequency': 0.2068824252355592,
'is_stopword': False,
'ngram': u'infarction'},
{'coefficient': 0.46897343056956153,
'count': 453,
'frequency': 0.18557968045882836,
'is_stopword': False,
'ngram': u'heart'}]
In [34]:
wc.most_important(5)
Out[34]:
[{'coefficient': -0.8759179138969192,
'count': 38,
'frequency': 0.015567390413764851,
'is_stopword': False,
'ngram': u'obesity unspecified'},
{'coefficient': -0.8655105382141891,
'count': 38,
'frequency': 0.015567390413764851,
'is_stopword': False,
'ngram': u'obesity'},
{'coefficient': 0.8329465952065772,
'count': 9,
'frequency': 0.0036870135190495697,
'is_stopword': False,
'ngram': u'nephroptosis'},
{'coefficient': -0.8198621557218905,
'count': 45,
'frequency': 0.01843506759524785,
'is_stopword': False,
'ngram': u'of kidney'},
{'coefficient': 0.7444542252245915,
'count': 452,
'frequency': 0.18517001229004507,
'is_stopword': False,
'ngram': u'heart failure'}]
Non-ASCII Texts
Word cloud has full Unicode support but if you want to visualize it using the recipe from this notebook - you should use the font_path parameter that leads to font supporting symbols used in your text. For example for Japanese text in the model below you should use one of the CJK fonts.
In [35]:
jp_project = dr.Project.create('jp_10k.csv', project_name='Japanese 10K')
print('Project ID: {}'.format(project.id))
Project ID: 598dec4bc8089177139da4ad
In [36]:
jp_project.set_target('readmitted_再入院', mode=AUTOPILOT_MODE.QUICK)
jp_project.wait_for_autopilot()
In progress: 10, queued: 3 (waited: 0s)
In progress: 10, queued: 3 (waited: 0s)
In progress: 10, queued: 3 (waited: 1s)
In progress: 10, queued: 3 (waited: 2s)
In progress: 10, queued: 3 (waited: 3s)
In progress: 10, queued: 3 (waited: 5s)
In progress: 10, queued: 3 (waited: 8s)
In progress: 10, queued: 1 (waited: 15s)
In progress: 6, queued: 0 (waited: 28s)
In progress: 1, queued: 0 (waited: 49s)
In progress: 0, queued: 0 (waited: 69s)
In progress: 8, queued: 0 (waited: 90s)
In progress: 5, queued: 0 (waited: 110s)
In progress: 1, queued: 0 (waited: 130s)
In progress: 0, queued: 14 (waited: 151s)
In progress: 10, queued: 6 (waited: 171s)
In progress: 10, queued: 2 (waited: 191s)
In progress: 8, queued: 0 (waited: 212s)
In progress: 2, queued: 0 (waited: 232s)
In progress: 2, queued: 0 (waited: 253s)
In progress: 1, queued: 0 (waited: 273s)
In progress: 1, queued: 0 (waited: 293s)
In progress: 0, queued: 0 (waited: 314s)
In [37]:
jp_models = jp_project.get_models()
jp_model_with_word_cloud = None
for model in jp_models:
try:
model.get_word_cloud()
jp_model_with_word_cloud = model
break
except ClientError as e:
if 'No word cloud data found for model' in e:
pass
jp_model_with_word_cloud
Out[37]:
Model(u'Auto-Tuned Word N-Gram Text Modeler using token occurrences and tfidf - diag_1_desc_\u8a3a\u65ad1\u8aac\u660e')
In [38]:
jp_wc = jp_model_with_word_cloud.get_word_cloud(exclude_stop_words=True)
In [39]:
word_cloud_plot(jp_wc, font_path='CJK.ttf')

Changelog¶
2.12.2¶
Bugfixes¶
- The Database Connectivity interface now works when used with more recent versions of DataRobot, e.g., the cloud environment.
Deprecation Summary¶
- The Model Deployment interface has been deprecated and will be removed in 2.13, in order to allow the interface to mature. The raw API will continue to be available as a “beta” API without full backwards compatibility support.
2.12.0¶
New Features¶
- Some models now have Missing Value reports allowing users with access to uncensored blueprints to retrieve a detailed breakdown of how numeric imputation and categorical converter tasks handled missing values. See the documentation for more information on the report.
- Time series projects now support multiseries as well as single series data. See the multiseries section in the Time Series Projects documentation for more detail.
2.11.0¶
New Features¶
- The new
ModelRecommendation
class can be used to retrieve the recommended models for a project. - A new helper method cross_validate was added to class Model. This method can be used to request Model’s Cross Validation score.
- Training a model with monotonic constraints is now supported. Training with monotonic constraints allows users to force models to learn monotonic relationships with respect to some features and the target. This helps users create accurate models that comply with regulations (e.g. insurance, banking). Currently, only certain blueprints (e.g. xgboost) support this feature, and it is only supported for regression and binary classification projects.
- DataRobot now supports “Database Connectivity”, allowing databases to be used as the source of data for projects and prediction datasets. The feature works on top of the JDBC standard, so a variety of databases conforming to that standard are available; a list of databases with tested support for DataRobot is available in the user guide in the web application. See Database Connectivity for details.
API Changes¶
- New attributes supporting monotonic constraints have been added to the
AdvancedOptions
,Project
,Model
, andBlueprint
classes. See monotonic constraints for more information on how to configure monotonic constraints. - New parameters predictions_start_date and predictions_end_date added to
datarobot.models.Project.upload_dataset()
to support bulk predictions upload for time series projects.
Deprecation Summary¶
datarobot.models.Project.create_from_mysql()
,datarobot.models.Project.create_from_oracle()
,datarobot.models.Project.create_from_postgresql()
methods have been deprecated and will be removed in 2.14. Usedatarobot.models.Project.create_from_data_source()
instead.datarobot.FeatureSettings.a_priori
attribute has been deprecated and will be removed in 2.14. Usedatarobot.FeatureSettings.known_in_advance
instead.datarobot.DatetimePartitioning.default_to_a_priori
attribute has been deprecated and will be removed in 2.14. Usedatarobot.DatetimePartitioning.known_in_advance
instead.datarobot.DatetimePartitioningSpecification.default_to_a_priori
attribute has been deprecated and will be removed in 2.14. Usedatarobot.DatetimePartitioningSpecification.known_in_advance
instead.
Configuration Changes¶
- Retry settings compatible with those offered by urllib3’s Retry interface can now be configured. By default, we will now retry connection errors that prevented requests from arriving at the server.
Documentation Changes¶
- “Advanced Model Insights” example has been updated to properly handle bin weights when rebinning.
2.9.0¶
New Features¶
- New
ModelDeployment
class can be used to track status and health of models deployed for predictions.
Enhancements¶
- DataRobot API now supports creating 3 new blender types - Random Forest, TensorFlow, LightGBM.
- Multiclass projects now support blenders creation for 3 new blender types as well as Average and ENET blenders.
- Models can be trained by requesting a particular row count using the new
training_row_count
argument with Project.train, Model.train and Model.request_frozen_model in non-datetime partitioned projects, as an alternative to the previous option of specifying a desired percentage of the project dataset. Specifying model size by row count is recommended when the float precision ofsample_pct
could be problematic, e.g. when training on a small percentage of the dataset or when training up to partition boundaries. - New attributes
max_train_rows
,scaleout_max_train_pct
, andscaleout_max_train_rows
have been added todatarobot.Project
.max_train_rows
specified the equivalent value to the existingmax_train_pct
as a row count. The scaleout fields can be used to see how far scaleout models can be trained on projects, which for projects taking advantage of scalable ingest may exceed the limits on the data available to non-scaleout blueprints. - Individual features can now be marked as a priori or not a priori using the new feature_settings attribute when setting the target or specifying datetime partitioning settings on time series projects. Any features not specified in the feature_settings parameter will be assigned according to the default_to_a_priori value.
- Three new options have been made availabe in the
datarobotDatetimePartitioningSpecification
class to fine-tune how time-series projects derive modeling features. treat_as_exponential can control whether data is analyzed as an exponential trend and transformations like log-transform are applied. differencing_method can control which differencing method to use for stationary data. periodicities can be used to specify periodicities occuring within the data. All are optional and defaults will be chosen automatically if they are unspecified.
API Changes¶
- Now
training_row_count
is available on non-datetime models as well as “rowCount” based datetime models. It reports the number of rows used to train the model (equivalent tosample_pct
). - Features retrieved from
Feature.get
now includetarget_leakage
.
2.8.1¶
Bugfixes¶
- The documented default connect_timeout will now be correctly set for all configuration mechanisms,
so that requests that fail to reach the DataRobot server in a reasonable amount of time will now
error instead of hanging indefinitely. If you observe that you have started seeing
ConnectTimeout
errors, please configure your connect_timeout to a larger value. - Version of
trafaret
library this package depends on is now pinned totrafaret>=0.7,<1.1
since versions outside that range are known to be incompatible.
2.8.0¶
New Features¶
- The DataRobot API supports the creation, training, and predicting of multiclass classification projects. DataRobot, by default, handles a dataset with a numeric target column as regression. If your data has a numeric cardinality of fewer than 11 classes, you can override this behavior to instead create a multiclass classification project from the data. To do so, use the set_target function, setting target_type=’Multiclass’. If DataRobot recognizes your data as categorical, and it has fewer than 11 classes, using multiclass will create a project that classifies which label the data belongs to.
- The DataRobot API now includes Rating Tables. A rating table is an exportable csv representation of a model. Users can influence predictions by modifying them and creating a new model with the modified table. See the documentation for more information on how to use rating tables.
- scaleout_modeling_mode has been added to the AdvancedOptions class used when setting a project target. It can be used to control whether scaleout models appear in the autopilot and/or available blueprints. Scaleout models are only supported in the Hadoop enviroment with the corresponding user permission set.
- A new premium add-on product, Time Series, is now available. New projects can be created as time series projects which automatically derive features from past data and forecast the future. See the time series documentation for more information.
- The Feature object now returns the EDA summary statistics (i.e., mean, median, minum, maximum, and standard deviation) for features where this is available (e.g., numeric, date, time, currency, and length features). These summary statistics will be formatted in the same format as the data it summarizes.
- The DataRobot API now supports Training Predictions workflow. Training predictions are made by a model for a subset of data from original dataset. User can start a job which will make those predictions and retrieve them. See the documentation for more information on how to use training predictions.
- DataRobot now supports retrieving a model blueprint chart and a model blueprint docs.
- With the introduction of Multiclass Classification projects, DataRobot needed a better way to explain the performance of a multiclass model so we created a new Confusion Chart. The API now supports retrieving and interacting with confusion charts.
Enhancements¶
- DatetimePartitioningSpecification now includes the optional disable_holdout flag that can be used to disable the holdout fold when creating a project with datetime partitioning.
- When retrieving reason codes on a project using an exposure column, predictions that are adjusted for exposure can be retrieved.
- File URIs can now be used as sourcedata when creating a project or uploading a prediction dataset. The file URI must refer to an allowed location on the server, which is configured as described in the user guide documentation.
- The advanced options available when setting the target have been extended to include the new parameter ‘events_count’ as a part of the AdvancedOptions object to allow specifying the events count column. See the user guide documentation in the webapp for more information on events count.
- PredictJob.get_predictions now returns predicted probability for each class in the dataframe.
- PredictJob.get_predictions now accepts prefix parameter to prefix the classes name returned in the predictions dataframe.
API Changes¶
- Add target_type parameter to set_target() and start(), used to override the project default.
2.7.1¶
Documentation Changes¶
- Online documentation hosting has migrated from PythonHosted to Read The Docs. Minor code changes have been made to support this.
2.7.0¶
New Features¶
- Lift chart data for models can be retrieved using the Model.get_lift_chart and Model.get_all_lift_charts methods.
- ROC curve data for models in classification projects can be retrieved using the Model.get_roc_curve and Model.get_all_roc_curves methods.
- Semi-automatic autopilot mode is removed.
- Word cloud data for text processing models can be retrieved using Model.get_word_cloud method.
- Scoring code JAR file can be downloaded for models supporting code generation.
Enhancements¶
- A __repr__ method has been added to the PredictionDataset class to improve readability when using the client interactively.
- Model.get_parameters now includes an additional key in the derived features it includes, showing the coefficients for individual stages of multistage models (e.g. Frequency-Severity models).
- When training a DatetimeModel on a window of data, a time_window_sample_pct can be specified to take a uniform random sample of the training data instead of using all data within the window.
- Installing of DataRobot package now has an “Extra Requirements” section that will install all of the dependencies needed to run the example notebooks.
Documentation Changes¶
- A new example notebook describing how to visualize some of the newly available model insights including lift charts, ROC curves, and word clouds has been added to the examples section.
- A new section for Common Issues has been added to Getting Started to help debug issues related to client installation and usage.
2.6.1¶
Bugfixes¶
- Fixed a bug with Model.get_parameters raising an exception on some valid parameter values.
Documentation Changes¶
- Fixed sorting order in Feature Impact example code snippet.
2.6.0¶
New Features¶
- A new partitioning method (datetime partitioning) has been added. The recommended workflow is to preview the partitioning by creating a DatetimePartitioningSpecification and passing it into DatetimePartitioning.generate, inspect the results and adjust as needed for the specific project dataset by adjusting the DatetimePartitioningSpecification and re-generating, and then set the target by passing the final DatetimePartitioningSpecification object to the partitioning_method parameter of Project.set_target.
- When interacting with datetime partitioned projects, DatetimeModel can be used to access more information specific to models in datetime partitioned projects. See the documentation for more information on differences in the modeling workflow for datetime partitioned projects.
- The advanced options available when setting the target have been extended to include the new parameters ‘offset’ and ‘exposure’ (part of the AdvancedOptions object) to allow specifying offset and exposure columns to apply to predictions generated by models within the project. See the user guide documentation in the webapp for more information on offset and exposure columns.
- Blueprints can now be retrieved directly by project_id and blueprint_id via Blueprint.get.
- Blueprint charts can now be retrieved directly by project_id and blueprint_id via BlueprintChart.get. If you already have an instance of Blueprint you can retrieve its chart using Blueprint.get_chart.
- Model parameters can now be retrieved using ModelParameters.get. If you already have an instance of Model you can retrieve its parameters using Model.get_parameters.
- Blueprint documentation can now be retrieved using Blueprint.get_documents. It will contain information about the task, its parameters and (when available) links and references to additional sources.
- The DataRobot API now includes Reason Codes. You can now compute reason codes for prediction datasets. You are able to specify thresholds on which rows to compute reason codes for to speed up computation by skipping rows based on the predictions they generate. See the reason codes documentation for more information.
Enhancements¶
- A new parameter has been added to the AdvancedOptions used with Project.set_target. By specifying accuracyOptimizedMb=True when creating AdvancedOptions, longer-running models that may have a high accuracy will be included in the autopilot and made available to run manually.
- A new option for Project.create_type_transform_feature has been added which explicitly truncates data when casting numerical data as categorical data.
- Added 2 new blenders for projects that use MAD or Weighted MAD as a metric. The MAE blender uses BFGS optimization to find linear weights for the blender that minimize mean absolute error (compared to the GLM blender, which finds linear weights that minimize RMSE), and the MAEL1 blender uses BFGS optimization to find linear weights that minimize MAE + a L1 penalty on the coefficients (compared to the ENET blender, which minimizes RMSE + a combination of the L1 and L2 penalty on the coefficients).
Bugfixes¶
- Fixed a bug (affecting Python 2 only) with printing any model (including frozen and prime models) whose model_type is not ascii.
- FrozenModels were unable to correctly use methods inherited from Model. This has been fixed.
- When calling get_result for a Job, ModelJob, or PredictJob that has errored, AsyncProcessUnsuccessfulError will now be raised instead of JobNotFinished, consistently with the behaviour of get_result_when_complete.
Deprecation Summary¶
- Support for the experimental Recommender Problems projects has been removed. Any code relying on RecommenderSettings or the recommender_settings argument of Project.set_target and Project.start will error.
Project.update
, deprecated in v2.2.32, has been removed in favor of specific updates:rename
,unlock_holdout
,set_worker_count
.
Documentation Changes¶
- The link to Configuration from the Quickstart page has been fixed.
2.5.1¶
Bugfixes¶
- Fixed a bug (affecting Python 2 only) with printing blueprints whose names are not ascii.
- Fixed an issue where the weights column (for weighted projects) did not appear in the advanced_options of a Project.
2.5.0¶
New Features¶
- Methods to work with blender models have been added. Use Project.blend method to create new blenders, Project.get_blenders to get the list of existing blenders and BlenderModel.get to retrieve a model with blender-specific information.
- Projects created via the API can now use smart downsampling when setting the target by passing smart_downsampled and majority_downsampling_rate into the AdvancedOptions object used with Project.set_target. The smart sampling options used with an existing project will be available as part of Project.advanced_options.
- Support for frozen models, which use tuning parameters from a parent model for more efficient training, has been added. Use Model.request_frozen_model to create a new frozen model, Project.get_frozen_models to get the list of existing frozen models and FrozenModel.get to retrieve a particular frozen model.
Enhancements¶
- The inferred date format (e.g. “%Y-%m-%d %H:%M:%S”) is now included in the Feature object. For non-date features, it will be None.
- When specifying the API endpoint in the configuration, the client will now behave correctly for endpoints with and without trailing slashes.
2.4.0¶
New Features¶
- The premium add-on product DataRobot Prime has been added. You can now approximate a model on the leaderboard and download executable code for it. See documentation for further details, or talk to your account representative if the feature is not available on your account.
- (Only relevant for on-premise users with a Standalone Scoring cluster.) Methods (request_transferable_export and download_export) have been added to the Model class for exporting models (which will only work if model export is turned on). There is a new class ImportedModel for managing imported models on a Standalone Scoring cluster.
- It is now possible to create projects from a WebHDFS, PostgreSQL, Oracle or MySQL data source. For more information see the documentation for the relevant Project classmethods: create_from_hdfs, create_from_postgresql, create_from_oracle and create_from_mysql.
- Job.wait_for_completion, which waits for a job to complete without returning anything, has been added.
Enhancements¶
- The client will now check the API version offered by the server specified in configuration, and give a warning if the client version is newer than the server version. The DataRobot server is always backwards compatible with old clients, but new clients may have functionality that is not implemented on older server versions. This issue mainly affects users with on-premise deployments of DataRobot.
Bugfixes¶
- Fixed an issue where Model.request_predictions might raise an error when predictions finished very quickly instead of returning the job.
API Changes¶
- To set the target with quickrun autopilot, call Project.set_target with mode=AUTOPILOT_MODE.QUICK instead of specifying quickrun=True.
Deprecation Summary¶
- Semi-automatic mode for autopilot has been deprecated and will be removed in 3.0. Use manual or fully automatic instead.
- Use of the quickrun argument in Project.set_target has been deprecated and will be removed in 3.0. Use mode=AUTOPILOT_MODE.QUICK instead.
Configuration Changes¶
- It is now possible to control the SSL certificate verification by setting the parameter ssl_verify in the config file.
Documentation Changes¶
- The “Modeling Airline Delay” example notebook has been updated to work with the new 2.3 enhancements.
- Documentation for the generic Job class has been added.
- Class attributes are now documented in the API Reference section of the documentation.
- The changelog now appears in the documentation.
- There is a new section dedicated to configuration, which lists all of the configuration options and their meanings.
2.3.0¶
New Features¶
- The DataRobot API now includes Feature Impact, an approach to measuring the relevance of each feature that can be applied to any model. The Model class now includes methods request_feature_impact (which creates and returns a feature impact job) and get_feature_impact (which can retrieve completed feature impact results).
- A new improved workflow for predictions now supports first uploading a dataset via Project.upload_dataset, then requesting predictions via Model.request_predictions. This allows us to better support predictions on larger datasets and non-ascii files.
- Datasets previously uploaded for predictions (represented by the PredictionDataset class) can be listed from Project.get_datasets and retrieve and deleted via PredictionDataset.get and PredictionDataset.delete.
- You can now create a new feature by re-interpreting the type of an existing feature in a project by using the Project.create_type_transform_feature method.
- The Job class now includes a get method for retrieving a job and a cancel method for canceling a job.
- All of the jobs classes (Job, ModelJob, PredictJob) now include the following new methods: refresh (for refreshing the data in the job object), get_result (for getting the completed resource resulting from the job), and get_result_when_complete (which waits until the job is complete and returns the results, or times out).
- A new method Project.refresh can be used to update Project objects with the latest state from the server.
- A new function datarobot.async.wait_for_async_resolution can be used to poll for the resolution of any generic asynchronous operation on the server.
Enhancements¶
- The JOB_TYPE enum now includes FEATURE_IMPACT.
- The QUEUE_STATUS enum now includes ABORTED and COMPLETED.
- The Project.create method now has a read_timeout parameter which can be used to keep open the connection to DataRobot while an uploaded file is being processed. For very large files this time can be substantial. Appropriately raising this value can help avoid timeouts when uploading large files.
- The method Project.wait_for_autopilot has been enhanced to error if the project enters a state where autopilot may not finish. This avoids a situation that existed previously where users could wait indefinitely on their project that was not going to finish. However, users are still responsible to make sure a project has more than zero workers, and that the queue is not paused.
- Feature.get now supports retrieving features by feature name. (For backwards compatibility, feature IDs are still supported until 3.0.)
- File paths that have unicode directory names can now be used for creating projects and PredictJobs. The filename itself must still be ascii, but containing directory names can have other encodings.
- Now raises more specific JobAlreadyRequested exception when we refuse a model fitting request as a duplicate. Users can explicitly catch this exception if they want it to be ignored.
- A file_name attribute has been added to the Project class, identifying the file name associated with the original project dataset. Note that if the project was created from a data frame, the file name may not be helpful.
- The connect timeout for establishing a connection to the server can now be set directly. This can be done in the yaml configuration of the client, or directly in the code. The default timeout has been lowered from 60 seconds to 6 seconds, which will make detecting a bad connection happen much quicker.
Bugfixes¶
- Fixed a bug (affecting Python 2 only) with printing features and featurelists whose names are not ascii.
API Changes¶
- Job class hierarchy is rearranged to better express the relationship between these objects. See documentation for datarobot.models.job for details.
- Featurelist objects now have a project_id attribute to indicate which project they belong to. Directly accessing the project attribute of a Featurelist object is now deprecated
- Support INI-style configuration, which was deprecated in v2.1, has been removed. yaml is the only supported configuration format.
- The method Project.get_jobs method, which was deprecated in v2.1, has been removed. Users should use the Project.get_model_jobs method instead to get the list of model jobs.
Deprecation Summary¶
- PredictJob.create has been deprecated in favor of the alternate workflow using Model.request_predictions.
- Feature.converter (used internally for object construction) has been made private.
- Model.fetch_resource_data has been deprecated and will be removed in 3.0. To fetch a model from
- its ID, use Model.get.
- The ability to use Feature.get with feature IDs (rather than names) is deprecated and will be removed in 3.0.
- Instantiating a Project, Model, Blueprint, Featurelist, or Feature instance from a dict of data is now deprecated. Please use the from_data classmethod of these classes instead. Additionally, instantiating a Model from a tuple or by using the keyword argument data is also deprecated.
- Use of the attribute Featurelist.project is now deprecated. You can use the project_id attribute of a Featurelist to instantiate a Project instance using Project.get.
- Use of the attributes Model.project, Model.blueprint, and Model.featurelist are all deprecated now to avoid use of partially instantiated objects. Please use the ids of these objects instead.
- Using a Project instance as an argument in Featurelist.get is now deprecated. Please use a project_id instead. Similarly, using a Project instance in Model.get is also deprecated, and a project_id should be used in its place.
Configuration Changes¶
- Previously it was possible (though unintended) that the client configuration could be mixed through environment variables, configuration files, and arguments to datarobot.Client. This logic is now simpler - please see the Getting Started section of the documentation for more information.
2.2.33¶
Bugfixes¶
- Fixed a bug with non-ascii project names using the package with Python 2.
- Fixed an error that occurred when printing projects that had been constructed from an ID only or printing printing models that had been constructed from a tuple (which impacted printing PredictJobs).
- Fixed a bug with project creation from non-ascii file names. Project creation from non-ascii file names is not supported, so this now raises a more informative exception. The project name is no longer used as the file name in cases where we do not have a file name, which prevents non-ascii project names from causing problems in those circumstances.
- Fixed a bug (affecting Python 2 only) with printing projects, features, and featurelists whose names are not ascii.
2.2.32¶
New Features¶
Project.get_features
andFeature.get
methods have been added for feature retrieval.- A generic
Job
entity has been added for use in retrieving the entire queue at once. CallingProject.get_all_jobs
will retrieve all (appropriately filtered) jobs from the queue. Those can be cancelled directly as generic jobs, or transformed into instances of the specific job class usingModelJob.from_job
andPredictJob.from_job
, which allow all functionality previously available via the ModelJob and PredictJob interfaces. Model.train
now supportsfeaturelist_id
andscoring_type
parameters, similar toProject.train
.
Enhancements¶
- Deprecation warning filters have been updated. By default, a filter will be added ensuring that usage of deprecated features will display a warning once per new usage location. In order to hide deprecation warnings, a filter like warnings.filterwarnings(‘ignore’, category=DataRobotDeprecationWarning) can be added to a script so no such warnings are shown. Watching for deprecation warnings to avoid reliance on deprecated features is recommended.
- If your client is misconfigured and does not specify an endpoint, the cloud production server is no longer used as the default as in many cases this is not the correct default.
- This changelog is now included in the distributable of the client.
Bugfixes¶
- Fixed an issue where updating the global client would not affect existing objects with cached clients. Now the global client is used for every API call.
- An issue where mistyping a filepath for use in a file upload has been resolved. Now an error will be raised if it looks like the raw string content for modeling or predictions is just one single line.
API Changes¶
- Use of username and password to authenticate is no longer supported - use an API token instead.
- Usage of
start_time
andfinish_time
parameters inProject.get_models
is not supported both in filtering and ordering of models - Default value of
sample_pct
parameter ofModel.train
method is nowNone
instead of100
. If the default value is used, models will be trained with all of the available training data based on project configuration, rather than with entire dataset including holdout for the previous default value of100
. order_by
parameter ofProject.list
which was deprecated in v2.0 has been removed.recommendation_settings
parameter ofProject.start
which was deprecated in v0.2 has been removed.Project.status
method which was deprecated in v0.2 has been removed.Project.wait_for_aim_stage
method which was deprecated in v0.2 has been removed.Delay
,ConstantDelay
,NoDelay
,ExponentialBackoffDelay
,RetryManager
classes fromretry
module which were deprecated in v2.1 were removed.- Package renamed to
datarobot
.
Deprecation Summary¶
Project.update
deprecated in favor of specific updates:rename
,unlock_holdout
,set_worker_count
.
Documentation Changes¶
- A new use case involving financial data has been added to the
examples
directory. - Added documentation for the partition methods.
2.1.31¶
Bugfixes¶
- In Python 2, using a unicode token to instantiate the client will now work correctly.
2.1.30¶
Bugfixes¶
- The minimum required version of
trafaret
has been upgraded to 0.7.1 to get around an incompatibility between it andsetuptools
.
2.1.28¶
New Features¶
- Default to reading YAML config file from ~/.config/datarobot/drconfig.yaml
- Allow config_path argument to client
wait_for_autopilot
method added to Project. This method can be used to block execution until autopilot has finished running on the project.- Support for specifying which featurelist to use with initial autopilot in
Project.set_target
Project.get_predict_jobs
method has been added, which looks up all prediction jobs for a projectProject.start_autopilot
method has been added, which starts autopilot on specified featurelist- The schema for
PredictJob
in DataRobot API v2.1 now includes amessage
. This attribute has been added to the PredictJob class. PredictJob.cancel
now exists to cancel prediction jobs, mirroringModelJob.cancel
Project.from_async
is a new classmethod that can be used to wait for an async resolution in project creation. Most users will not need to know about it as it is used behind the scenes inProject.create
andProject.set_target
, but power users who may run into periodic connection errors will be able to catch the new ProjectAsyncFailureError and decide if they would like to resume waiting for async process to resolve
Enhancements¶
AUTOPILOT_MODE
enum now uses string names for autopilot modes instead of numbers
Deprecation Summary¶
ConstantDelay
,NoDelay
,ExponentialBackoffDelay
, andRetryManager
utils are now deprecated- INI-style config files are now deprecated (in favor of YAML config files)
- Several functions in the utils submodule are now deprecated (they are being moved elsewhere and are not considered part of the public interface)
Project.get_jobs
has been renamedProject.get_model_jobs
for clarity and deprecated- Support for the experimental date partitioning has been removed in DataRobot API, so it is being removed from the client immediately.
API Changes¶
- In several places where
AppPlatformError
was being raised, nowTypeError
,ValueError
orInputNotUnderstoodError
are now used. With this change, one can now safely assume that when catching anAppPlatformError
it is because of an unexpected response from the server. AppPlatformError
has gained a two new attributes,status_code
which is the HTTP status code of the unexpected response from the server, anderror_code
which is a DataRobot-defined error code.error_code
is not used by any routes in DataRobot API 2.1, but will be in the future. In cases where it is not provided, the instance ofAppPlatformError
will have the attributeerror_code
set toNone
.- Two new subclasses of
AppPlatformError
have been introduced,ClientError
(for 400-level response status codes) andServerError
(for 500-level response status codes). These will make it easier to build automated tooling that can recover from periodic connection issues while polling. - If a
ClientError
orServerError
occurs during a call toProject.from_async
, then aProjectAsyncFailureError
(a subclass of AsyncFailureError) will be raised. That exception will have the status_code of the unexpected response from the server, and the location that was being polled to wait for the asynchronous process to resolve.
2.0.27¶
New Features¶
PredictJob
class was added to work with prediction jobswait_for_async_predictions
function added to predict_job module
Deprecation Summary¶
- The order_by parameter of the
Project.list
is now deprecated.
0.2.26¶
Enhancements¶
Projet.set_target
will re-fetch the project data after it succeeds, keeping the client side in sync with the state of the project on the serverProject.create_featurelist
now throwsDuplicateFeaturesError
exception if passed list of features contains duplicatesProject.get_models
now supports snake_case arguments to its order_by keyword
Deprecation Summary¶
Project.wait_for_aim_stage
is now deprecated, as the REST Async flow is a more reliable method of determining that project creation has completed successfullyProject.status
is deprecated in favor ofProject.get_status
recommendation_settings
parameter ofProject.start
is deprecated in favor ofrecommender_settings
Bugfixes¶
Project.wait_for_aim_stage
changed to support Python 3- Fixed incorrect value of
SCORING_TYPE.cross_validation
- Models returned by
Project.get_models
will now be correctly ordered when the order_by keyword is used
0.2.25¶
- Pinned versions of required libraries
0.2.24¶
Official release of v0.2
0.1.24¶
- Updated documentation
- Renamed parameter name of Project.create and Project.start to project_name
- Removed Model.predict method
- wait_for_async_model_creation function added to modeljob module
- wait_for_async_status_service of Project class renamed to _wait_for_async_status_service
- Can now use auth_token in config file to configure SDK
0.1.23¶
- Fixes a method that pointed to a removed route
0.1.22¶
- Added featurelist_id attribute to ModelJob class
0.1.21¶
- Removes model attribute from ModelJob class
0.1.20¶
- Project creation raises AsyncProjectCreationError if it was unsuccessful
- Removed Model.list_prime_rulesets and Model.get_prime_ruleset methods
- Removed Model.predict_batch method
- Removed Project.create_prime_model method
- Removed PrimeRuleSet model
- Adds backwards compatibility bridge for ModelJob async
- Adds ModelJob.get and ModelJob.get_model
0.1.19¶
- Minor bugfixes in wait_for_async_status_service
0.1.18¶
- Removes submit_model from Project until serverside implementation is improved
- Switches training URLs for new resource-based route at /projects/<project_id>/models/
- Job renamed to ModelJob, and using modelJobs route
- Fixes an inconsistency in argument order for train methods
0.1.17¶
- wait_for_async_status_service timeout increased from 60s to 600s
0.1.16¶
- Project.create will now handle both async/sync project creation
0.1.15¶
- All routes pluralized to sync with changes in API
- Project.get_jobs will request all jobs when no param specified
- dataframes from predict method will have pythonic names
- Project.get_status created, Project.status now deprecated
- Project.unlock_holdout created.
- Added quickrun parameter to Project.set_target
- Added modelCategory to Model schema
- Add permalinks featrue to Project and Model objects.
- Project.create_prime_model created
0.1.14¶
- Project.set_worker_count fix for compatibility with API change in project update.
0.1.13¶
- Add positive class to set_target.
- Change attributes names of Project, Model, Job and Blueprint
- features in Model, Job and Blueprint are now processes
- dataset_id and dataset_name migrated to featurelist_id and featurelist_name.
- samplepct -> sample_pct
- Model has now blueprint, project, and featurlist attributes.
- Minor bugfixes.
0.1.12¶
- Minor fixes regarding rename Job attributes. features attributes now named processes, samplepct now is sample_pct.
0.1.10¶
(May 20, 2015)
- Remove Project.upload_file, Project.upload_file_from_url and Project.attach_file methods. Moved all logic that uploading file to Project.create method.