DataRobot Python Package¶
Getting Started¶
Installation¶
You will need the following
- Python 2.7 or 3.4+
- DataRobot account
- pip
Installing for Cloud DataRobot¶
If you are using the cloud version of DataRobot, the easiest way to get the latest version of the package is:
pip install datarobot
Note
If you are not running in a Python virtualenv, you probably want to use pip install --user datarobot
.
Installing for an On-Site Deploy¶
If you are using an on-site deploy of DataRobot, the latest version of the package is not the most appropriate for you. Contact your CFDS for guidance on the appropriate version range.
pip install "datarobot>=$(MIN_VERSION),<$(EXCLUDE_VERSION)"
For some particular installation of DataRobot, the correct value of $(MIN_VERSION) could be 2.0 with an $(EXCLUDE_VERSION) of 2.3. This ensures that all the features the client expects to be present on the backend will always be correct.
Note
If you are not running in a Python virtualenv, you probably want to use pip install --user "datarobot>=$(MIN_VERSION),<$(MAX_VERSION)
.
Configuration¶
Each authentication method will specify credentials for DataRobot, as well as the location of the DataRobot deployment. We currently support configuration using a configuration file, by setting environment variables, or within the code itself.
Credentials¶
You will have to specify an API token and an endpoint in order to use the client. You can manage your API tokens in the DataRobot webapp, in your profile. This section describes how to use these options. Their order of precedence is as follows, noting that the first available option will be used:
- Setting endpoint and token in code using datarobot.Client
- Configuring from a config file as specified directly using datarobot.Client
- Configuring from a config file as specified by the environment variable DATAROBOT_CONFIG_FILE
- Configuring from the environment variables DATAROBOT_ENDPOINT and DATAROBOT_API_TOKEN
- Searching for a config file in the home directory of the current user, at ~/.config/datarobot/drconfig.yaml
Note
If you access the DataRobot webapp at https://app.datarobot.com, then the correct endpoint to specify would be https://app.datarobot.com/api/v2. If you have a local installation, update the endpoint accordingly to point at the installation of DataRobot available on your local network.
Set Credentials Explicitly in Code¶
Explicitly set credentials in code:
import datarobot as dr
dr.Client(token='your_token', endpoint='https://app.datarobot.com/api/v2')
You can also point to a YAML config file to use:
import datarobot as dr
dr.Client(config_path='/home/user/my_datarobot_config.yaml')
Use a Configuration File¶
You can use a configuration file to specify the client setup.
The following is an example configuration file that should be saved as ~/.config/datarobot/drconfig.yaml
:
token: yourtoken
endpoint: https://app.datarobot.com/api/v2
You can specify a different location for the DataRobot configuration file by setting
the DATAROBOT_CONFIG_FILE
environment variable. Note that if you specify a filepath, you should
use an absolute path so that the API client will work when run from any location.
Set Credentials Using Environment Variables¶
Set up an endpoint by setting environment variables in the UNIX shell:
export DATAROBOT_ENDPOINT='https://app.datarobot.com/api/v2'
export DATAROBOT_API_TOKEN=your_token
Common Issues¶
This section has examples of cases that can cause issues with using the DataRobot client, as well as known fixes.
InsecurePlatformWarning¶
On versions of Python earlier than 2.7.9 you might have InsecurePlatformWarning in your output. To prevent this without updating your Python version you should install pyOpenSSL package:
pip install pyopenssl ndg-httpsclient pyasn1
AttributeError: ‘EntryPoint’ object has no attribute ‘resolve’¶
Some earlier versions of setuptools will cause an error on importing DataRobot. The recommended fix is upgrading setuptools. If you are unable to upgrade setuptools, pinning trafaret to version <=7.4 will correct this issue.
>>> import datarobot as dr
...
File "/home/clark/.local/lib/python2.7/site-packages/trafaret/__init__.py", line 1550, in load_contrib
trafaret_class = entrypoint.resolve()
AttributeError: 'EntryPoint' object has no attribute 'resolve'
To prevent this upgrade your setuptools:
pip install --upgrade setuptools
Connection Errors¶
<configuration.rst> describes how to configure the DataRobot client with the max_retries parameter to fine tune behaviors like the number of times it attempts to retry failed connections.
ConnectTimeout¶
If you have a slow connection to your DataRobot installation, you may see a traceback like
ConnectTimeout: HTTPSConnectionPool(host='my-datarobot.com', port=443): Max
retries exceeded with url: /api/v2/projects/
(Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f130fc76150>,
'Connection to my-datarobot.com timed out. (connect timeout=6.05)'))
You can configure a larger connect timeout (the amount of time to wait on each request attempting
to connect to the DataRobot server before giving up) using a connect_timeout value in either
a configuration file or via a direct call to datarobot.Client
.
project.open_leaderboard_browser¶
Calling the project.open_leaderboard_browser
may block if ran with a text-mode browser or
running on a server that doesn’t have an ability to open a browser.
Configuration¶
This section describes all of the settings that can be configured in the DataRobot
configuration file. This file is by default looked for inside the user’s home
directory at ~/.config/datarobot/drconfig.yaml
, but the default location can be
overridden by specifying an environment variable DATAROBOT_CONFIG_FILE
, or within
the code by setting the global client with dr.Client(config_path='/path/to/config.yaml')
.
Configurable Variables¶
These are the variables available for configuration for the DataRobot client:
- endpoint
- This parameter is required. It is the URL of the DataRobot endpoint. For example,
the default endpoint on the
cloud installation of DataRobot is
https://app.datarobot.com/api/v2
- token
- This parameter is required. It is the token of your DataRobot account. This can be found in the user settings page of DataRobot
- connect_timeout
- This parameter is optional. It specifies the number of seconds that the
client should be willing to wait to establish a connection to the remote server.
Users with poor connections may need to increase this value. By default DataRobot
uses the value
6.05
. - ssl_verify
- This parameter is optional. It controls the SSL certificate verification of the
DataRobot client. DataRobot is built with the
python
requests
library, and this variable is used as theverify
parameter in that library. More information can be found in their documentation. The default value istrue
, which means thatrequests
will use your computer’s set of trusted certificate chains by default. - max_retries
This parameter is optional. It controls the number of retries to attempt for each connection. More information can be found in the requests documentation. By default, the client will attempt 10 retries (the default provided by Retry) with an exponential backoff between attempts. It will retry after connection errors, read errors, and 413, 429, and 503 HTTP responses, and will respect the Retry-After header, as in:
Retry(backoff_factor=0.1, respect_retry_after_header=True)
More granular control by be acquired by passing a Retry object from urllib3 into a direct instantiation ofdr.Client
.import datarobot as dr dr.Client(endpoint='https://app.datarobot.com/api/v2', token='this-is-a-fake-token', max_retries=Retry(connect=3, read=3))
Proxy support¶
DataRobot API can work behind a non-transparent HTTP proxy server. Please set environment
variable HTTP_PROXY
containing proxy URL to route all the DataRobot traffic through that
proxy server, e.g. HTTP_PROXY="http://my-proxy.local:3128" python my_datarobot_script.py
.
QuickStart¶
Note
You must set up credentials in order to access the DataRobot API. For more information, see Credentials
All of the modeling within DataRobot happens within a project. Each project has one dataset that is used as the source from which to train models.
There are three steps required to begin modeling:
- Create an empty project.
- Upload a data file to model.
- Select parameters and start training models with the autopilot.
The following command includes these three steps. It is equivalent to choosing all of the default settings recommended by DataRobot.
import datarobot as dr
project = dr.Project.start(project_name='My new project',
sourcedata='/home/user/data/last_week_data.csv',
target='ItemsPurchased')
Where:
name
is the name of the new DataRobot project.sourcedata
is the path to the dataset.target
is the name of the target feature column in the dataset.
Projects¶
All of the modeling within DataRobot happens within a project. Each project has one dataset that is used as the source from which to train models.
Create a Project¶
You can use the following command to create a new project. You must specify a path to data file, file object, raw file contents,
or a pandas.DataFrame
object when creating a new project. Path to file can be either a path to a local file or a publicly accessible URL.
import datarobot as dr
project = dr.Project.create('/home/user/data/last_week_data.csv',
project_name='New Project')
You can use the following commands to view the project ID and name:
project.id
>>> u'5506fcd38bd88f5953219da0'
project.project_name
>>> u'New Project'
Select Modeling Parameters¶
The final information needed to begin modeling includes the target feature, the queue mode, the metric for comparing models, and the optional parameters such as weights, offset, exposure and downsampling.
Target¶
The target must be the name of one of the columns of data uploaded to the project.
Metric¶
The optimization metric used to compare models is an important factor in building accurate models. If a metric is not specified, the default metric recommended by DataRobot will be used. You can use the following code to view a list of valid metrics for a specified target:
target_name = 'ItemsPurchased'
project.get_metrics(target_name)
>>> {'available_metrics': [
'Gini Norm',
'Weighted Gini Norm',
'Weighted R Squared',
'Weighted RMSLE',
'Weighted MAPE',
'Weighted Gamma Deviance',
'Gamma Deviance',
'RMSE',
'Weighted MAD',
'Tweedie Deviance',
'MAD',
'RMSLE',
'Weighted Tweedie Deviance',
'Weighted RMSE',
'MAPE',
'Weighted Poisson Deviance',
'R Squared',
'Poisson Deviance'],
'feature_name': 'SalePrice'}
Partitioning Method¶
DataRobot projects always have a holdout set used for final model validation. We use two different approaches for testing prior to the holdout set:
- split the remaining data into training and validation sets
- cross-validation, in which the remaining data is split into a number of folds; each fold serves as a validation set, with models trained on the other folds and evaluated on that fold.
There are several other options you can control. To specify a partition method, create an instance of one of the Partition Classes, and pass it as the partitioning_method
argument in your call to project.set_target
or project.start
. See here for more information on using datetime partitioning.
Several partitioning methods include parameters for validation_pct
and holdout_pct
, specifying desired percentages for the validation and holdout sets. Note that there may be constraints that prevent the actual percentages used from exactly (or some cases, even closely) matching the requested percentages.
Queue Mode¶
You can use the API to set the DataRobot modeling process to run in either automatic or manual mode.
Autopilot mode means that the modeling process will proceed completely automatically, including running recommended models, running at different sample sizes, and blending.
Manual mode means that DataRobot will populate a list of recommended models, but will not insert any of them into the queue. Manual mode lets you select which models to execute before starting the modeling process.
Quick mode means that a smaller set of Blueprints is used, so autopilot finishes faster.
Weights¶
DataRobot also supports using a weight parameter. A full discussion of the use of weights in data science is not within the scope of this document, but weights are often used to help compensate for rare events in data. You can specify a column name in the project dataset to be used as a weight column.
Offsets¶
Starting with version v2.6 DataRobot also supports using an offset parameter. Offsets are commonly used in insurance modeling to include effects that are outside of the training data due to regulatory compliance or constraints. You can specify the names of several columns in the project dataset to be used as the offset columns.
Exposure¶
Starting with version v2.6 DataRobot also supports using an exposure parameter. Exposure is often used to model insurance premiums where strict proportionality of premiums to duration is required. You can specify the name of the column in the project dataset to be used as an exposure column.
Start Modeling¶
Once you have selected modeling parameters, you can use the following code structure to specify parameters and start the modeling process.
import datarobot as dr
project.set_target(target='ItemsPurchased',
metric='Tweedie Deviance',
mode=dr.AUTOPILOT_MODE.FULL_AUTO)
You can also pass additional optional parameters to project.set_target
to change parameters of the modeling process. Currently supported parameters are:
worker_count
– int, sets number of workers used for modeling.partitioning_method
–PartitioningMethod
object.positive_class
– str, float, or int; Specifies a level of the target column that should treated as the positive class for binary classification. May only be specified for binary classification targets.advanced_options
– AdvancedOptions object, used to set advanced options of modeling process.target_type
– str, override the automaticially selected target_type. An example usage would be setting the target_type=TARGET_TYPE.MULTICLASS when you want to perform a multiclass classification task on a numeric column that has a low cardinality.
You can run with different autopilot modes by changing the parameter to mode. AUTOPILOT_MODE.FULL_AUTO is the default. Other accepted modes include AUTOPILOT_MODE.MANUAL for manual mode (choose your own models to run rather than use the DataRobot autopilot) and AUTOPILOT_MODE.QUICK for quickrun (run on a more limited set of models to get insights more quickly).
Quickly Start a Project¶
Project creation, file upload and target selection are all combined in Project.start
method:
import datarobot as dr
project = dr.Project.start('/home/user/data/last_week_data.csv',
target='ItemsPurchased',
project_name='New Project')
You can also pass additional optional parameters to Project.start
:
worker_count
– int, sets number of workers used for modeling.metric
- str, name of metric to use.autopilot_on
- boolean, defaults toTrue
; set whether or not to begin modeling automatically.blueprint_threshold
– int, number of hours the model is permitted to run. Minimum 1.response_cap
– float, Quantile of the response distribution to use for response capping. Must be in range 0.5..1.0partitioning_method
–PartitioningMethod
object.positive_class
– str, float, or int; Specifies a level of the target column that should treated as the positive class for binary classification. May only be specified for binary classification targets.target_type
– str, override the automaticially selected target_type. An example usage would be setting the target_type=TARGET_TYPE.MULTICLASS when you want to perform a multiclass classification task on a numeric column that has a low cardinality.
Interact with a Project¶
The following commands can be used to manage DataRobot projects.
List Projects¶
Returns a list of projects associated with current API user.
import datarobot as dr
dr.Project.list()
>>> [Project(Project One), Project(Two)]
dr.Project.list(search_params={'project_name': 'One'})
>>> [Project(One)]
You can pass following parameters to change result:
search_params
– dict, used to filter returned projects. Currently you can query projects only byproject_name
Get an existing project¶
Rather than querying the full list of projects every time you need
to interact with a project, you can retrieve its id
value and use that to reference the project.
import datarobot as dr
project = dr.Project.get(project_id='5506fcd38bd88f5953219da0')
project.id
>>> '5506fcd38bd88f5953219da0'
project.project_name
>>> 'Churn Projection'
Update a project¶
You can update various attributes of a project.
To update the name of the project:
project.rename(new_name)
To update the number of workers used by your project (this will fail if you request more workers than you have available; the special value -1 will request your maximum number):
project.set_worker_count(num_workers)
To unlock the holdout set, allowing holdout scores to be shown and models to be trained on more data:
project.unlock_holdout()
Wait for Autopilot to Finish¶
Once the modeling autopilot is started, in some cases you will want to wait for autopilot to finish:
project.wait_for_autopilot()
Play/Pause the autopilot¶
If your project is running in autopilot mode, it will continually use available workers, subject to the number of workers allocated to the project and the total number of simultaneous workers allowed according to the user permissions.
To pause a project running in autopilot mode:
project.pause_autopilot()
To resume running a paused project:
project.unpause_autopilot()
Start autopilot on another Featurelist¶
You can start autopilot on an existing featurelist.
import datarobot as dr
featurelist = project.create_featurelist('test', ['feature 1', 'feature 2'])
project.start_autopilot(featurelist.id)
>>> True
# Starting autopilot that is already running on the provided featurelist
project.start_autopilot(featurelist.id)
>>> dr.errors.AppPlatformError
Note
This method should be used on a project where the target has already been set. An error will be raised if autopilot is currently running on or has already finished running on the provided featurelist.
Further reading¶
The Blueprints and Models sections of this document will describe how to create new models based on the Blueprints recommended by DataRobot.
Datetime Partitioned Projects¶
If your dataset is modeling events taking place over time, datetime partitioning may be appropriate. Datetime partitioning ensures that when partitioning the dataset for training and validation, rows are ordered according to the value of the date partition feature.
Setting Up a Datetime Partitioned Project¶
After creating a project and before setting the target, create a
DatetimePartitioningSpecification to define how the project should
be partitioned. By passing the specification into DatetimePartitioning.generate
, the full
partitioning can be previewed before finalizing the partitioning. After verifying that the
partitioning is correct for the project dataset, pass the specification into Project.set_target
via the partitioning_method
argument. Once modeling begins, the project can be used as normal.
The following code block shows the basic workflow for creating datetime partitioned projects.
import datarobot as dr
project = dr.Project.create('some_data.csv')
spec = dr.DatetimePartitioningSpecification('my_date_column')
# can customize the spec as needed
partitioning_preview = dr.DatetimePartitioning.generate(project.id, spec)
# the preview generated is based on the project's data
print(partitioning_preview.to_dataframe())
# hmm ... I want more backtests
spec.number_of_backtests = 5
partitioning_preview = dr.DatetimePartitioning.generate(project.id, spec)
print(partitioning_preview.to_dataframe())
# looks good
project.set_target('target_column', partitioning_method=spec)
Modeling with a Datetime Partitioned Project¶
While Model
objects can still be used to interact with the project,
DatetimeModel objects, which are only retrievable from datetime partitioned
projects, provide more information including which date ranges and how many rows are used in
training and scoring the model as well as scores and statuses for individual backtests.
The autopilot workflow is the same as for other projects, but to manually train a model,
Project.train_datetime
and Model.train_datetime
should be used in the place of
Project.train
and Model.train
. To create frozen models,
Model.request_frozen_datetime_model
should be used in place of
DatetimeModel.request_frozen_datetime_model
. Unlike other projects, to trigger computation of
scores for all backtests use DatetimeModel.score_backtests
instead of using the scoring_type
argument in the train
methods.
Dates, Datetimes, and Durations¶
When specifying a date or datetime for datetime partitioning, the client expects to receive and
will return a datetime
. Timezones may be specified, and will be assumed to be UTC if left
unspecified. All dates returned from DataRobot are in UTC with a timezone specified.
Datetimes may include a time, or specify only a date; however, they may have a non-zero time component only if the partition column included a time component in its date format. If the partition column included only dates like “24/03/2015”, then the time component of any datetimes, if present, must be zero.
When date ranges are specified with a start and an end date, the end date is exclusive, so only dates earlier than the end date are included, but the start date is inclusive, so dates equal to or later than the start date are included. If the start and end date are the same, then no dates are included in the range.
Durations are specified using a subset of ISO8601. Durations will be of the form PnYnMnDTnHnMnS where each “n” may be replaced with an integer value. Within the duration string,
- nY represents the number of years
- the nM following the “P” represents the number of months
- nD represents the number of days
- nH represents the number of hours
- the nM following the “T” represents the number of minutes
- nS represents the number of seconds
and “P” is used to indicate that the string represents a period and “T” indicates the beginning of the time component of the string. Any section with a value of 0 may be excluded. As with datetimes, if the partition column did not include a time component in its date format, the time component of any duration must be either unspecified or consist only of zeros.
Example Durations:
- “P3Y6M” (three years, six months)
- “P1Y0M0DT0H0M0S” (one year)
- “P1Y5DT10H” (one year, 5 days, 10 hours)
datarobot.helpers.partitioning_methods.construct_duration_string is a helper method that can be used to construct appropriate duration strings.
Time Series Projects¶
Time series projects, like OTV projects, use datetime partitioning, and all the workflow changes that apply to other datetime partitioned projects also apply to them. Unlike other projects, time series projects produce different types of models which forecast multiple future predictions instead of an individual prediction for each row.
DataRobot uses a general time series framework to configure how time series features are created and what future values the models will output. This framework consists of a Forecast Point (defining a time a prediction is being made), a Feature Derivation Window (a rolling window used to create features), and a Forecast Window (a rolling window of future values to predict). These components are described in more detail below.
Time series projects will automatically transform the dataset provided in order to apply this framework. During the transformation, DataRobot uses the Feature Derivation Window to derive time series features (such as lags and rolling statistics), and uses the Forecast Window to provide examples of forecasting different distances in the future (such as time shifts). After project creation, a new dataset and a new feature list are generated and used to train the models. This process is reapplied automatically at prediction time as well in order to generate future predictions based on the original data features.
The time_unit
and time_step
used to define the Feature Derivation and Forecast Windows are
taken from the datetime partition column, and can be retrieved for a given column in the input data
by looking at the corresponding attributes on the datarobot.models.Feature
object.
If windows_basis_unit
is set to ROW
, then Feature Derivation and Forecast Windows will be
defined using number of the rows.
Setting Up A Time Series Project¶
To set up a time series project, follow the standard datetime partitioning
workflow and use the five new time series specific parameters on the
datarobot.DatetimePartitioningSpecification
object:
- use_time_series
- bool, set this to True to enable time series for the project.
- default_to_known_in_advance
- bool, set this to True to default to treating all features as known in advance, or a priori, features. Otherwise, they will not be handled as known in advance features. Individual features can be set to a value different than the default by using the featureSettings parameter. See the prediction documentation for more information.
- feature_derivation_window_start
- int, the offset into the past to the start of the feature derivation window.
- feature_derivation_window_end
- int, the offset into the past to the end of the feature derivation window.
- forecast_window_start
- int, the offset into the future to the start of the forecast window.
- forecast_window_end
- int, the offset into the future to the end of the forecast window.
- windows_basis_unit
- string, set this to
ROW
to define feature derivation and forecast windows in terms of the rows, rather than time units. If omitted, value is detected time unit, one of thedatarobot.enums.TIME_UNITS
. - feature_settings
- list of FeatureSettings specifying per feature settings, can be left unspecified
Feature Derivation Window¶
The Feature Derivation window represents the rolling window that is used to derive
time series features and lags, relative to the Forecast Point. It is defined in terms of
feature_derivation_window_start
and feature_derivation_window_end
which are integer values
representing datetime offsets in terms of the time_unit
(e.g. hours or days).
The Feature Derivation Window start and end must be less than or equal to zero, indicating they are
positioned before the forecast point. Additionally, the window must be specified as an integer
multiple of the time_step
which defines the expected difference in time units between rows in
the data.
The window is closed, meaning the edges are considered to be inside the window.
Forecast Window¶
The Forecast Window represents the rolling window of future values to predict, relative to the
Forecast Point. It is defined in terms of the forecast_window_start
and forecast_window_end
,
which are positive integer values indicating datetime offsets in terms of the time_unit
(e.g.
hours or days).
The Forecast Window start and end must be positive integers, indicating they are
positioned after the forecast point. Additionally, the window must be specified as an integer
multiple of the time_step
which defines the expected difference in time units between rows in
the data.
The window is closed, meaning the edges are considered to be inside the window.
Multiseries Projects¶
Certain time series problems represent multiple separate series of data, e.g. “I have five different stores that all have different customer bases. I want to predict how many units of a particular item will sell, and account for the different behavior of each store”. When setting up the project, a column specifying series ids must be identified, so that each row from the same series has the same value in the multiseries id column.
Using a multiseries id column changes which partition columns are eligible for time series, as
each series is required to be unique and regular, instead of the entire partition column being
required to have those properties. In order to use a multiseries id column for partitioning,
a detection job must first be run to analyze the relationship between the partition and multiseries
id columns. If needed, it will be automatically triggered by calling
datarobot.models.Feature.get_multiseries_properties()
on the desired partition column. The
previously computed multiseries properties for a particular partition column can then be accessed
via that method. The computation will also be automatically triggered when calling
datarobot.DatetimePartitioning.generate()
or datarobot.models.Project.set_target()
with a multiseries id column specified.
Note that currently only one multiseries id column is supported, but all interfaces accept lists of id columns to ensure multiple id columns will be able to be supported in the future.
In order to create a multiseries project:
- Set up a datetime partitioning specification with the desired partition column and multiseries id columns.
- (Optionally) Use
datarobot.models.Feature.get_multiseries_properties()
to confirm the inferred time step and time unit of the partition column when used with the specified multiseries id column.- (Optionally) Specify the multiseries id column in order to preview the full datetime partitioning settings using
datarobot.DatetimePartitioning.generate()
.- Specify the multiseries id column when sending the target and partitioning settings via
datarobot.models.Project.set_target()
.
project = dr.Project.create('path/to/multiseries.csv', project_name='my multiseries project')
partitioning_spec = dr.DatetimePartitioningSpecification(
'timestamp', use_time_series=True, multiseries_id_columns=['multiseries_id']
)
# manually confirm time step and time unit are as expected
datetime_feature = dr.Feature.get(project.id, 'timestamp')
multiseries_props = datetime_feature.get_multiseries_properties(['multiseries_id'])
print(multiseries_props)
# manually check out the partitioning settings like feature derivation window and backtests
# to make sure they make sense before moving on
full_part = dr.DatetimePartitioning.generate(project.id, partitioning_spec)
print(full_part.feature_derivation_window_start, full_part.feature_derivation_window_end)
print(full_part.to_dataframe())
# finalize the project and start the autopilot
project.set_target('target', partitioning_method=partitioning_spec)
Feature Settings¶
datarobot.FeatureSettings
constructor receives feature_name and settings. For now only
known_in_advance settings are supported.
# I have 10 features, 8 of them are known in advance and two are not
not_known_in_advance_features = ['previous_day_sales', 'amount_in_stock']
feature_settings = [dr.FeatureSettings(feat_name, known_in_advance=False) for feat_name in not_known_in_advance_features]
spec = dr.DatetimePartitioningSpecification(
# ...
default_to_known_in_advance=True,
feature_settings=feature_settings
)
Modeling Data and Time Series Features¶
In time series projects, a new set of modeling features is created after setting the partitioning options. If a featurelist is specified with the partitioning options, it will be used to select which features should be used to derived modeling features; if a featurelist is not specified, the default featurelist will be used.
These features are automatically derived from those in the project’s
dataset and are the features used for modeling - note that the Project methods
get_featurelists
and get_modeling_featurelists
will return different data in time series
projects. Modeling featurelists are the ones that can be used for modeling and will be accepted by
the backend, while regular featurelists will continue to exist but cannot be used. Modeling
features are only accessible once the target and partitioning options have been
set. In projects that don’t use time series modeling, once the target has been set,
modeling and regular features and featurelists will behave the same.
Making Predictions¶
Prediction datasets are uploaded as normal. However, when uploading a
prediction dataset, a new parameter forecast_point
can be specified. The forecast point of a
prediction dataset identifies the point in time relative which predictions should be generated, and
if one is not specified when uploading a dataset, the server will choose the most recent possible
forecast point. The forecast window specified when setting the partitioning options for the project
determines how far into the future from the forecast point predictions should be calculated.
When setting up a time series project, input features could be identified as known-in-advance features. These features are not used to generate lags, and are expected to be known for the rows in the forecast window at predict time (e.g. “how much money will have been spent on marketing”, “is this a holiday”).
Enough rows of historical data must be provided to cover the span of the effective Feature
Derivation Window (which may be longer than the project’s Feature Derivation Window depending
on the differencing settings chosen). The effective Feature Derivation Window of any model
can be checked via the effective_feature_derivation_window_start
and
effective_feature_derivation_window_end
attributes of a
DatetimeModel
.
When uploading datasets to a time series project, the dataset might look something like the following, where “Time” is the datetime partition column, “Target” is the target column, and “Temp.” is an input feature. If the dataset was uploaded with a forecast point of “2017-01-08” and the effective feature derivation window start and end for the model are -5 and -3 and the forecast window start and end were set to 1 and 3, then rows 1 through 3 are historical data, row 6 is the forecast point, and rows 7 though 9 are forecast rows that will have predictions when predictions are computed.
Row, Time, Target, Temp.
1, 2017-01-03, 16443, 72
2, 2017-01-04, 3013, 72
3, 2017-01-05, 1643, 68
4, 2017-01-06, ,
5, 2017-01-07, ,
6, 2017-01-08, ,
7, 2017-01-09, ,
8, 2017-01-10, ,
9, 2017-01-11, ,
On the other hand, if the project instead used “Holiday” as an a priori input feature, the uploaded dataset might look like the following:
Row, Time, Target, Holiday
1, 2017-01-03, 16443, TRUE
2, 2017-01-04, 3013, FALSE
3, 2017-01-05, 1643, FALSE
4, 2017-01-06, , FALSE
5, 2017-01-07, , FALSE
6, 2017-01-08, , FALSE
7, 2017-01-09, , TRUE
8, 2017-01-10, , FALSE
9, 2017-01-11, , FALSE
Calendars¶
You can upload a calendar file
containing a list of events relevant to your dataset. When provided, DataRobot automatically derives time series features based on the calendar events. DataRobot creates special features from the calendar events (e.g., time until the next event, labeling the most recent event).
The calendar file:
- Should span the entire training data date range, as well as all future dates in which model will be forecasting.
- Must be in CSV format.
- Must have at least one date column. You can optionally include a second column, Label, that provides the event name or type.
- Must be in date-only format YYYY-MM-DD (i.e., no hour, month, or second) with no duplicates.
- Cannot be updated in an active project. You must specify all future calendar events at project start or if you did not, train a new project.
An example of a valid calendar file:
Date, Name
2019-01-01, New Year's Day
2019-02-14, Valentine's Day
2019-04-01, April Fools
2019-05-05, Cinco de Mayo
2019-07-04, July 4th
Prediction Intervals¶
For each model, prediction intervals estimate the range of values DataRobot expects actual values of the target to fall within. They are similar to a confidence interval of a prediction, but are based on the residual errors measured during the backtesting for the selected model.
Note that because calculation depends on the backtesting values, prediction intervals are not available for predictions on models that have not had all backtests completed. Additionally, prediction intervals are not available when the number of points per forecast distance is less than 10, due to insufficient data.
In a prediction request, users can specify a prediction intervals size, which specifies the desired probability of actual values falling within the interval range. Larger values are less precise, but more conservative. For example, specifying a size of 80 will result in a lower bound of 10% and an upper bound of 90%. More generally, for a specific prediction_intervals_size, the upper and lower bounds will be calculated as follows:
- prediction_interval_upper_bound = 50% + (prediction_intervals_size / 2)
- prediction_interval_lower_bound = 50% - (prediction_intervals_size / 2)
To view prediction intervals data for a prediction, the prediction needs to have been created using the
datarobot.models.DatetimeModel.request_predictions()
method and specifying include_prediction_intervals = True
.
The size for the prediction interval can be specified with the prediction_intervals_size
parameter for the same function,
and will default to 80 if left unspecified. Specifying either of these fields will result in prediction interval bounds being included in
the retrieved prediction data for that request (see the Predictions
class for retrieval methods).
Blueprints¶
The set of computation paths that a dataset passes through before producing predictions from data is called a blueprint. A blueprint can be trained on a dataset to generate a model.
Quick Reference¶
The following code block summarizes the interactions available for blueprints.
# Get the set of blueprints recommended by datarobot
import datarobot as dr
my_projects = dr.Project.list()
project = my_projects[0]
menu = project.get_blueprints()
first_blueprint = menu[0]
project.train(first_blueprint)
List Blueprints¶
When a file is uploaded to a project and the target is set, DataRobot
recommends a set of blueprints that are appropriate for the task at hand.
You can use the get_blueprints
method to get the list of blueprints recommended for a project:
project = dr.Project.get('5506fcd38bd88f5953219da0')
menu = project.get_blueprints()
blueprint = menu[0]
Get a blueprint¶
If you already have a blueprint_id
from a model you can retrieve the blueprint directly.
project_id = '5506fcd38bd88f5953219da0'
project = dr.Project.get(project_id)
models = project.get_models()
model = models[0]
blueprint = Blueprint.get(project_id, model.blueprint_id)
Get a blueprint chart¶
For all blueprints - either from blueprint menu or already used in model - you can retrieve its chart. You can also get its representation in graphviz DOT format to render it into format you need.
project_id = '5506fcd38bd88f5953219da0'
blueprint_id = '4321fcd38bd88f595321554223'
bp_chart = BlueprintChart.get(project_id, blueprint_id)
print(bp_chart.to_graphviz())
Get a blueprint documentation¶
You can retrieve documentation on tasks used in blueprint. It will contain information about
task, its parameters and (when available) links and references to additional sources.
All documents are instances of BlueprintTaskDocument
class.
project_id = '5506fcd38bd88f5953219da0'
blueprint_id = '4321fcd38bd88f595321554223'
bp = Blueprint.get(project_id, blueprint_id)
docs = bp.get_documents()
print(docs[0].task)
>>> Average Blend
print(docs[0].links[0]['url'])
>>> https://en.wikipedia.org/wiki/Ensemble_learning
Blueprint Attributes¶
The Blueprint
class holds the data required to use the blueprint
for modeling. This includes the blueprint_id
and project_id
.
There are also two attributes that help distinguish blueprints: model_type
and processes
.
print(blueprint.id)
>>> u'8956e1aeecffa0fa6db2b84640fb3848'
print(blueprint.project_id)
>>> u5506fcd38bd88f5953219da0'
print(blueprint.model_type)
>>> Logistic Regression
print(blueprint.processes)
>>> [u'One-Hot Encoding',
u'Missing Values Imputed',
u'Standardize',
u'Logistic Regression']
Create a Model from a Blueprint¶
You can use a blueprint instance to train a model. The default dataset for the project is used.
model_job_id = project.train(blueprint, sample_pct=25)
This method will put a new modeling job into the queue and returns id of created ModelJob. You can pass ModelJob id to wait_for_async_model_creation function, that polls async model creation status and returns newly created model when it’s finished.
Models¶
When a blueprint has been trained on a specific dataset at a specified sample size, the result is a model. Models can be inspected to analyze their accuracy.
Quick Reference¶
# Get all models of an existing project
import datarobot as dr
my_projects = dr.Project.list()
project = my_projects[0]
models = project.get_models()
List Finished Models¶
You can use the get_models
method to return a list of the project models
that have finished training:
import datarobot as dr
project = dr.Project.get('5506fcd38bd88f5953219da0')
models = project.get_models()
print(models[:5])
>>> [Model(Decision Tree Classifier (Gini)),
Model(Auto-tuned K-Nearest Neighbors Classifier (Minkowski Distance)),
Model(Gradient Boosted Trees Classifier (R)),
Model(Gradient Boosted Trees Classifier),
Model(Logistic Regression)]
model = models[0]
project.id
>>> u'5506fcd38bd88f5953219da0'
model.id
>>> u'5506fcd98bd88f1641a720a3'
You can pass following parameters to change result:
search_params
– dict, used to filter returned projects. Currently you can query models byname
sample_pct
is_starred
order_by
– str or list, if passed returned models are ordered by this attribute or attributes.with_metric
– str, If not None, the returned models will only have scores for this metric. Otherwise all the metrics are returned.
List Models Example:
Project('pid').get_models(order_by=['-created_time', 'sample_pct', 'metric'])
# Getting models that contain "Ridge" in name
# and with sample_pct more than 64
Project('pid').get_models(
search_params={
'sample_pct__gt': 64,
'name': "Ridge"
})
# Getting models marked as starred
Project('pid').get_models(
search_params={
'is_starred': True
})
Retrieve a Known Model¶
If you know the model_id
and project_id
values of a model, you can
retrieve it directly:
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
model_id=model_id)
You can also use an instance of Project
as the parameter for get
model = dr.Model.get(project=project,
model_id=model_id)
Train a Model on a Different Sample Size¶
One of the key insights into a model and the data behind it is how its
performance varies with more training data.
In Autopilot mode, DataRobot will run at several sample sizes by default,
but you can also create a job that will run at a specific sample size.
You can also specify featurelist that should be used for training of new model
and scoring type.
train
method of Model
instance will put a new modeling job into the queue and return id of created
ModelJob.
You can pass ModelJob id to wait_for_async_model_creation function,
that polls async model creation status and returns newly created model when it’s finished.
model_job_id = model.train(sample_pct=33)
# retraining model on custom featurelist using cross validation
import datarobot as dr
model_job_id = model.train(
sample_pct=55,
featurelist_id=custom_featurelist.id,
scoring_type=dr.SCORING_TYPE.cross_validation,
)
Find the Features Used¶
Because each project can have many associated featurelists, it is important to know which features a model requires in order to run. This helps ensure that the the necessary features are provided when generating predictions.
feature_names = model.get_features_used()
print(feature_names)
>>> ['MonthlyIncome',
'VisitsLast8Weeks',
'Age']
Feature Impact¶
Feature Impact measures how much worse a model’s error score would be if DataRobot made predictions after randomly shuffling a particular column (a technique sometimes called Permutation Importance).
The following example code snippet shows how a featurelist with just the features with the highest feature impact could be created.
import datarobot as dr
max_num_features = 10
time_to_wait_for_impact = 4 * 60 # seconds
feature_impacts = model.get_or_request_feature_impact(time_to_wait_for_impact)
feature_impacts.sort(key=lambda x: x['impactNormalized'], reverse=True)
final_names = [f['featureName'] for f in feature_impacts[:max_num_features]]
project.create_featurelist('highest_impact', final_names)
Predict new data¶
After creating models you can use them to generate predictions on new data. See PredictJob for further information on how to request predictions from a model.
Model IDs Vs. Blueprint IDs¶
Each model has both an model_id
and a blueprint_id
. What is the difference between these two IDs?
A model is the result of training a blueprint on a dataset at a specified
sample percentage. The blueprint_id
is used to keep track of which
blueprint was used to train the model, while the model_id
is used to
locate the trained model in the system.
Model parameters¶
Some models can have parameters that provide data needed to reproduce its predictions.
For additional usage information see DataRobot documentation, section “Coefficients tab and pre-processing details”
import datarobot as dr
model = dr.Model.get(project=project, model_id=model_id)
mp = model.get_parameters()
print(mp.derived_features)
>>> [{
'coefficient': -0.015,
'originalFeature': u'A1Cresult',
'derivedFeature': u'A1Cresult->7',
'type': u'CAT',
'transformations': [{'name': u'One-hot', 'value': u"'>7'"}]
}]
Create a Blender¶
You can blend multiple models; in many cases, the resulting blender model is more accurate
than the parent models. To do so you need to select parent models and a blender method from
datarobot.enums.BLENDER_METHOD
.
Be aware that the tradeoff for better prediction accuracy is bigger resource consumption and slower predictions.
import datarobot as dr
pr = dr.Project.get(pid)
models = pr.get_models()
parent_models = [model.id for model in models[:2]]
pr.blend(parent_models, dr.enums.BLENDER_METHOD.AVERAGE)
Lift chart retrieval¶
You can use Model
methods get_lift_chart
and get_all_lift_charts
to retrieve
lift chart data. First will get it from specific source (validation data, cross validation or
holdout, if holdout unlocked) and second will list all available data. Please refer to
Advanced model information notebook for additional
information about lift charts and how they can be visualised.
ROC curve retrieval¶
Same as with the lift chart you can use Model
methods get_roc_curve
and
get_all_roc_curves
to retrieve ROC curve data. Please refer to
Advanced model information notebook for additional
information about ROC curves and how they can be visualised. More information about working with ROC
curves can be found in DataRobot web application documentation section “ROC Curve tab details”.
Word Cloud¶
If your dataset contains text columns, DataRobot can create text processing models that will
contain word cloud insight data. An example of such model is any “Auto-Tuned Word N-Gram Text
Modeler” model. You can use Model.get_word_cloud
method to retrieve those insights - it will
provide up to 200 most important ngrams in the model and data about their influence.
The Advanced model information notebook contains
examples of how you can use that data and build a visualization in a way similar to how the
DataRobot webapp does.
Scoring Code¶
Subset of models in DataRobot supports code generation. For each of those models you can download
a JAR file with scoring code to make predictions locally using method
Model.download_scoring_code
. For details on how to do that see “Code Generation” section in
DataRobot web application documentation. Optionally you can download source code in Java to see
what calculations those models do internally.
Be aware that source code JAR isn’t compiled so it cannot be used for making predictions.
Get a model blueprint chart¶
For all models you can retrieve its blueprint chart. You can also get its representation in graphviz DOT format to render it into format you need.
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
model_id=model_id)
bp_chart = model.get_model_blueprint_chart()
print(bp_chart.to_graphviz())
Get a model missing values report¶
For the majority of models you can retrieve their missing values reports on training data per each numeric and categorical feature. Model needs to have at least one of the supported tasks in the blueprint in order to have a missing values report (blenders are not supported). Report is gathered for Numerical Imputation tasks and Categorical converters like Ordinal Encoding, One-Hot Encoding etc. Missing values report is available to users with access to full blueprint docs.
Report is collected for those features which are considered eligible by given blueprint task. For instance, categorical feature with a lot of unique values may not be considered as eligible in the One-Hot encoding task.
Please refer to Missing report attributes description for report interpretation.
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id, model_id=model_id)
missing_reports_per_feature = model.get_missing_report_info()
for report_per_feature in missing_reports_per_feature:
print(report_per_feature)
Consider following example. Given Decision Tree Classifier (Gini) blueprint chart representation:
print(blueprint_chart.to_graphviz())
>>> digraph "Blueprint Chart" {
graph [rankdir=LR]
0 [label="Data"]
-2 [label="Numeric Variables"]
2 [label="Missing Values Imputed"]
3 [label="Decision Tree Classifier (Gini)"]
4 [label="Prediction"]
-1 [label="Categorical Variables"]
1 [label="Ordinal encoding of categorical variables"]
0 -> -2
-2 -> 2
2 -> 3
3 -> 4
0 -> -1
-1 -> 1
1 -> 3
}
and missing report:
print(report_per_feature1)
>>> {'feature': 'Veh Year',
'type': 'Numeric',
'missing_count': 150,
'missing_percentage': 50.00,
'tasks': [
{'id': u'2',
'name': u'Missing Values Imputed',
'descriptions': [u'Imputed value: 2006']
}
]
}
print(report_per_feature2)
>>> {'feature': 'Model',
'type': 'Categorical',
'missing_count': 100,
'missing_percentage': 33.33,
'tasks': [
{'id': u'1',
'name': u'Ordinal encoding of categorical variables',
'descriptions': [u'Imputed value: -2']
}
]
}
results can be interpreted in the following way:
Numeric feature “Veh Year” has 150 missing values and respectively 50% in training data. It was transformed by “Missing Values Imputed” task with imputed value 2006. Task has id 2, and its output goes into Decision Tree Classifier (Gini) - it can be inferred from the chart.
Categorical feature “Model” was transformed by “Ordinal encoding of categorical variables” task with imputed value -2.
Get a blueprint documentation¶
You can retrieve documentation on tasks used to build a model. It will contain information about task, its parameters and (when available) links and references to additional sources.
All documents are instances of BlueprintTaskDocument
class.
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
model_id=model_id)
docs = model.get_model_blueprint_documents()
print(docs[0].task)
>>> Average Blend
print(docs[0].links[0]['url'])
>>> https://en.wikipedia.org/wiki/Ensemble_learning
Request training predictions¶
You can request a model’s predictions for a particular subset of its training data.
See datarobot.models.Model.request_training_predictions()
reference for all the valid subsets.
See training predictions reference for more details.
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
model_id=model_id)
training_predictions_job = model.request_training_predictions(dr.enums.DATA_SUBSET.HOLDOUT)
training_predictions = training_predictions_job.get_result_when_complete()
for row in training_predictions.iterate_rows():
print(row.row_id, row.prediction)
Advanced Tuning¶
You can perform advanced tuning on a model – generate a new model by taking an existing model and rerunning it with modified tuning parameters.
The AdvancedTuningSession class exists to track the creation of an Advanced Tuning model on the client. It enables browsing and setting advanced-tuning parameters one at a time, and using human-readable parameter names rather than requiring opaque parameter IDs in all cases. No information is sent to the server until the run() method is called on the AdvancedTuningSession.
See datarobot.models.Model.get_advanced_tuning_parameters()
reference for a description
of the types of parameters that can be passed in.
As of v2.15, all models support Advanced Tuning other than Blenders, OSS, and user-created.
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
model_id=model_id)
tune = model.start_advanced_tuning_session()
# Get available task names,
# and available parameter names for a task name that exists on this model
tune.get_task_names()
tune.get_parameter_names('Eureqa Generalized Additive Model Classifier (3000 Generations)')
tune.set_parameter(
task_name='Eureqa Generalized Additive Model Classifier (3000 Generations)',
parameter_name='EUREQA_building_block__sine',
value=1)
job = tune.run()
Jobs¶
The Job class is a generic representation of jobs running through a project’s queue. Many tasks involved in modeling, such as creating a new model or computing feature impact for a model, will use a job to track the worker usage and progress of the associated task.
Checking the Contents of the Queue¶
To see what jobs running or waiting in the queue for a project, use the Project.get_all_jobs
method.
from datarobot.enums import QUEUE_STATUS
jobs_list = project.get_all_jobs() # gives all jobs queued or inprogress
jobs_by_type = {}
for job in jobs_list:
if job.job_type not in jobs_by_type:
jobs_by_type[job.job_type] = [0, 0]
if job.status == QUEUE_STATUS.QUEUE:
jobs_by_type[job.job_type][0] += 1
else:
jobs_by_type[job.job_type][1] += 1
for type in jobs_by_type:
(num_queued, num_inprogress) = jobs_by_type[type]
print('{} jobs: {} queued, {} inprogress'.format(type, num_queued, num_inprogress)')
Cancelling a Job¶
If a job is taking too long to run or no longer necessary, it can be cancelled easily from the
Job
object.
from datarobot.enums import QUEUE_STATUS
project.pause_autopilot()
bad_jobs = project.get_all_jobs(status=QUEUE_STATUS.QUEUE)
for job in bad_jobs:
job.cancel()
project.unpause_autopilot()
Retrieving Results From a Job¶
Once you’ve found a particular job of interest, you can retrieve the results once it is complete.
Note that the type of the returned object will vary depending on the job_type
. All return types
are documented in Job.get_result
.
from datarobot.enums import JOB_TYPE
time_to_wait = 60 * 60 # how long to wait for the job to finish (in seconds) - i.e. an hour
assert my_job.job_type == JOB_TYPE.MODEL
my_model = my_job.get_result_when_complete(max_wait=time_to_wait)
ModelJobs¶
Model creation is asynchronous process. This means than when explicitly invoking
new model creation (with project.train
or model.train
for example) all you get
is id of process, responsible for model creation. With this id you can
get info about model that is being created or the model itself, when
creation process is finished. For this you should use
the ModelJob
class.
Get an existing ModelJob¶
To retrieve existing ModelJob use ModelJob.get
method.
For this you need id of Project that is used for model
creation and id of ModelJob. Having ModelJob might be useful if you want to
know parameters of model creation, automatically chosen by API backend,
before actual model was created.
If model is already created, ModelJob.get
will raise PendingJobFinished
exception
import time
import datarobot as dr
blueprint_id = '5506fcd38bd88f5953219da0'
model_job_id = project.train(blueprint_id)
model_job = dr.ModelJob.get(project=project.id,
model_job_id=model_job_id)
model_job.sample_pct
>>> 64.0
# wait for model to be created (in a very inefficient way)
time.sleep(10 * 60)
model_job = dr.ModelJob.get(project=project.id,
model_job_id=model_job_id)
>>> datarobot.errors.PendingJobFinished
Get created model¶
After model is created, you can use ModelJob.get_model to get newly created model.
import datarobot as dr
model = dr.ModelJob.get_model(project=project.id,
model_job_id=model_job_id)
wait_for_async_model_creation function¶
If you just want to get created model after getting ModelJob id, you can use wait_for_async_model_creation function. It will poll for status of model creation process until it’s finished, and then will return newly created model.
from datarobot.models.modeljob import wait_for_async_model_creation
# used during training based on blueprint
model_job_id = project.train(blueprint, sample_pct=33)
new_model = wait_for_async_model_creation(
project_id=project.id,
model_job_id=model_job_id,
)
# used during training based on existing model
model_job_id = existing_model.train(sample_pct=33)
new_model = wait_for_async_model_creation(
project_id=existing_model.project_id,
model_job_id=model_job_id,
)
Predictions¶
Predictions generation is an asynchronous process. This means that when starting
predictions with Model.request_predictions
you will receive back a PredictJob for tracking
the process responsible for fulfilling your request.
With this object you can get info about the predictions generation process before it has finished and be rerouted to the predictions themselves when the process is finished. For this you should use the PredictJob class.
Starting predictions generation¶
Before actually requesting predictions, you should upload the dataset you wish to predict via
Project.upload_dataset
. Previously uploaded datasets can be seen under Project.get_datasets
.
When uploading the dataset you can provide the path to a local file, a file object, raw file content,
a pandas.DataFrame
object, or the url to a publicly available dataset.
To start predicting on new data using a finished model use Model.request_predictions
.
It will create a new predictions generation process and return a PredictJob object tracking this process.
With it, you can monitor an existing PredictJob and retrieve generated predictions when the corresponding
PredictJob is finished.
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
project = dr.Project.get(project_id)
model = dr.Model.get(project=project_id,
model_id=model_id)
# Using path to local file to generate predictions
dataset_from_path = project.upload_dataset('./data_to_predict.csv')
# Using file object to generate predictions
with open('./data_to_predict.csv') as data_to_predict:
dataset_from_file = project.upload_dataset(data_to_predict)
predict_job_1 = model.request_predictions(dataset_from_path.id)
predict_job_2 = model.request_predictions(dataset_from_file.id)
Listing Predictions¶
You can use the Predictions.list()
method to return a list of predictions generated on a project.
import datarobot as dr
predictions = dr.Predictions.list('58591727100d2b57196701b3')
print(predictions)
>>>[Predictions(prediction_id='5b6b163eca36c0108fc5d411',
project_id='5b61bd68ca36c04aed8aab7f',
model_id='5b61bd7aca36c05744846630',
dataset_id='5b6b1632ca36c03b5875e6a0'),
Predictions(prediction_id='5b6b2315ca36c0108fc5d41b',
project_id='5b61bd68ca36c04aed8aab7f',
model_id='5b61bd7aca36c0574484662e',
dataset_id='5b6b1632ca36c03b5875e6a0'),
Predictions(prediction_id='5b6b23b7ca36c0108fc5d422',
project_id='5b61bd68ca36c04aed8aab7f',
model_id='5b61bd7aca36c0574484662e',
dataset_id='55b6b1632ca36c03b5875e6a0')
]
You can pass following parameters to filter the result:
model_id
– str, used to filter returned predictions by model_id.dataset_id
– str, used to filter returned predictions by dataset_id.
Get an existing PredictJob¶
To retrieve an existing PredictJob use the PredictJob.get
method. This will give you
a PredictJob matching the latest status of the job if it has not completed.
If predictions have finished building, PredictJob.get
will raise a PendingJobFinished
exception.
import time
import datarobot as dr
predict_job = dr.PredictJob.get(project_id=project_id,
predict_job_id=predict_job_id)
predict_job.status
>>> 'queue'
# wait for generation of predictions (in a very inefficient way)
time.sleep(10 * 60)
predict_job = dr.PredictJob.get(project_id=project_id,
predict_job_id=predict_job_id)
>>> dr.errors.PendingJobFinished
# now the predictions are finished
predictions = dr.PredictJob.get_predictions(project_id=project.id,
predict_job_id=predict_job_id)
Get generated predictions¶
After predictions are generated, you can use PredictJob.get_predictions
to get newly generated predictions.
If predictions have not yet been finished, it will raise a JobNotFinished
exception.
import datarobot as dr
predictions = dr.PredictJob.get_predictions(project_id=project.id,
predict_job_id=predict_job_id)
Wait for and Retrieve results¶
If you just want to get generated predictions from a PredictJob, you
can use the PredictJob.get_result_when_complete
function.
It will poll the status of predictions generation process until it has finished, and
then will return predictions.
dataset = project.get_datasets()[0]
predict_job = model.request_predictions(dataset.id)
predictions = predict_job.get_result_when_complete()
Get previously generated predictions¶
If you don’t have a Model.predict_job
on hand, there are two more ways to retrieve predictions from the
Predictions
interface:
- Get all prediction rows as a
pandas.DataFrame
object:
import datarobot as dr
preds = dr.Predictions.get("5b61bd68ca36c04aed8aab7f", prediction_id="5b6b163eca36c0108fc5d411")
df = preds.get_all_as_dataframe()
- Download all prediction rows to a file as a CSV document:
import datarobot as dr
preds = dr.Predictions.get("5b61bd68ca36c04aed8aab7f", prediction_id="5b6b163eca36c0108fc5d411")
preds.download_to_csv('predictions.csv')
Prediction Explanations¶
To compute prediction explanations you need to have feature impact computed for a model, and predictions for an uploaded dataset computed with a selected model.
Computing prediction explanations is a resource-intensive task, but you can configure it with maximum explanations per row and prediction value thresholds to speed up the process.
Quick Reference¶
import datarobot as dr
# Get project
my_projects = dr.Project.list()
project = my_projects[0]
# Get model
models = project.get_models()
model = models[0]
# Compute feature impact
feature_impacts = model.get_or_request_feature_impact()
# Upload dataset
dataset = project.upload_dataset('./data_to_predict.csv')
# Compute predictions
predict_job = model.request_predictions(dataset.id)
predict_job.wait_for_completion()
# Initialize prediction explanations
pei_job = dr.PredictionExplanationsInitialization.create(project.id, model.id)
pei_job.wait_for_completion()
# Compute prediction explanations with default parameters
pe_job = dr.PredictionExplanations.create(project.id, model.id, dataset.id)
pe = pe_job.get_result_when_complete()
# Iterate through predictions with prediction explanations
for row in pe.get_rows():
print(row.prediction)
print(row.prediction_explanations)
# download to a CSV file
pe.download_to_csv('prediction_explanations.csv')
List Prediction Explanations¶
You can use the PredictionExplanations.list()
method to return a list of prediction
explanations computed for a project’s models:
import datarobot as dr
prediction_explanations = dr.PredictionExplanations.list('58591727100d2b57196701b3')
print(prediction_explanations)
>>> [PredictionExplanations(id=585967e7100d2b6afc93b13b,
project_id=58591727100d2b57196701b3,
model_id=585932c5100d2b7c298b8acf),
PredictionExplanations(id=58596bc2100d2b639329eae4,
project_id=58591727100d2b57196701b3,
model_id=585932c5100d2b7c298b8ac5),
PredictionExplanations(id=58763db4100d2b66759cc187,
project_id=58591727100d2b57196701b3,
model_id=585932c5100d2b7c298b8ac5),
...]
pe = prediction_explanations[0]
pe.project_id
>>> u'58591727100d2b57196701b3'
pe.model_id
>>> u'585932c5100d2b7c298b8acf'
You can pass following parameters to filter the result:
model_id
– str, used to filter returned prediction explanations by model_id.limit
– int, limit for number of items returned, default: no limit.offset
– int, number of items to skip, default: 0.
List Prediction Explanations Example:
project_id = '58591727100d2b57196701b3'
model_id = '585932c5100d2b7c298b8acf'
dr.PredictionExplanations.list(project_id, model_id=model_id, limit=20, offset=100)
Initialize Prediction Explanations¶
In order to compute prediction explanations you have to initialize it for a particular model.
dr.PredictionExplanationsInitialization.create(project_id, model_id)
Compute Prediction Explanations¶
If all prerequisites are in place, you can compute prediction explanations in the following way:
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
dataset_id = '5506fcd98bd88a8142b725c8'
pe_job = dr.PredictionExplanations.create(project_id, model_id, dataset_id,
max_explanations=2, threshold_low=0.2, threshold_high=0.8)
pe = pe_job.get_result_when_complete()
Where:
max_explanations
are the maximum number of prediction explanations to compute for each row.threshold_low
andthreshold_high
are thresholds for the value of the prediction of the row. Prediction explanations will be computed for a row if the row’s prediction value is higher thanthreshold_high
or lower thanthreshold_low
. If no thresholds are specified, prediction explanations will be computed for all rows.
Retrieving Prediction Explanations¶
You have three options for retrieving prediction explanations.
Note
PredictionExplanations.get_all_as_dataframe()
and
PredictionExplanations.download_to_csv()
reformat
prediction explanations to match the schema of CSV file downloaded from UI (RowId,
Prediction, Explanation 1 Strength, Explanation 1 Feature, Explanation 1 Value, …,
Explanation N Strength, Explanation N Feature, Explanation N Value)
Get prediction explanations rows one by one as
PredictionExplanationsRow
objects:
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
prediction_explanations_id = '5506fcd98bd88f1641a720a3'
pe = dr.PredictionExplanations.get(project_id, prediction_explanations_id)
for row in pe.get_rows():
print(row.prediction_explanations)
Get all rows as pandas.DataFrame
:
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
prediction_explanations_id = '5506fcd98bd88f1641a720a3'
pe = dr.PredictionExplanations.get(project_id, prediction_explanations_id)
prediction_explanations_df = pe.get_all_as_dataframe()
Download all rows to a file as CSV document:
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
prediction_explanations_id = '5506fcd98bd88f1641a720a3'
pe = dr.PredictionExplanations.get(project_id, prediction_explanations_id)
pe.download_to_csv('prediction_explanations.csv')
Adjusted Predictions In Prediction Explanations¶
In some projects such as insurance projects, the prediction adjusted by exposure is more useful compared with raw prediction. For example, the raw prediction (e.g. claim counts) is divided by exposure (e.g. time) in the project with exposure column. The adjusted prediction provides insights with regard to the predicted claim counts per unit of time. To include that information, set exclude_adjusted_predictions to False in correspondent method calls.
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
prediction_explanations_id = '5506fcd98bd88f1641a720a3'
pe = dr.PredictionExplanations.get(project_id, prediction_explanations_id)
pe.download_to_csv('prediction_explanations.csv', exclude_adjusted_predictions=False)
prediction_explanations_df = pe.get_all_as_dataframe(exclude_adjusted_predictions=False)
Deprecated Reason Codes Interface¶
This feature was previously referred to using the Reason Codes API. This interface is now deprecated and should be replaced with the Prediction Explanations interface.
DataRobot Prime¶
DataRobot Prime is a premium add-on product intended to allow downloading executable code approximating models. If the feature is unavailable to you, please contact your Account Representative. For more information about this feature, see the documentation within the DataRobot webapp.
Approximate a Model¶
Given a Model you wish to approximate, Model.request_approximation
will start a job creating
several Ruleset
objects approximating the parent model. Each of those rulesets will identify
how many rules were used to approximate the model, as well as the validation score
the approximation achieved.
rulesets_job = model.request_approximation()
rulesets = rulesets_job.get_result_when_complete()
for ruleset in rulesets:
info = (ruleset.id, ruleset.rule_count, ruleset.score)
print('id: {}, rule_count: {}, score: {}'.format(*info))
Prime Models vs. Models¶
Given a ruleset, you can create a model based on that ruleset. We consider such models to be Prime
models. The PrimeModel
class inherits from the Model
class, so anything a Model can do,
as PrimeModel can do as well.
The PrimeModel
objects available within a Project
can be listed by
project.get_prime_models
, or a particular one can be retrieve via PrimeModel.get
. If a
ruleset has not yet had a model built for it, ruleset.request_model
can be used to start
a job to make a PrimeModel using a particular ruleset.
rulesets = parent_model.get_rulesets()
selected_ruleset = sorted(rulesets, key=lambda x: x.score)[-1]
if selected_ruleset.model_id:
prime_model = PrimeModel.get(selected_ruleset.project_id, selected_ruleset.model_id)
else:
prime_job = selected_ruleset.request_model()
prime_model = prime_job.get_result_when_complete()
The PrimeModel
class has two additional attributes and one additional method. The attributes
are ruleset
, which is the Ruleset used in the PrimeModel, and parent_model_id
which is
the id of the model it approximates.
Finally, the new method defined is request_download_validation
which is used to prepare code
download for the model and is discussed later on in this document.
Retrieving Code from a PrimeModel¶
Given a PrimeModel, you can download the code used to approximate the parent model, and view and execute it locally.
The first step is to validate the PrimeModel, which runs some basic validation of the generated
code, as well as preparing it for download. We use the PrimeFile
object to represent code
that is ready to download. PrimeFiles
can be prepared by the request_download_validation
method on PrimeModel
objects, and listed from a project with the get_prime_files
method.
Once you have a PrimeFile
you can check the is_valid
attribute to verify the code passed
basic validation, and then download it to a local file with download
.
validation_job = prime_model.request_download_validation(enums.PRIME_LANGUAGE.PYTHON)
prime_file = validation_job.get_result_when_complete()
if not prime_file.is_valid:
raise ValueError('File was not valid')
prime_file.download('/home/myuser/drCode/primeModelCode.py')
Rating Table¶
A rating table is an exportable csv representation of a Generalized Additive Model. They contain information about the features and coefficients used to make predictions. Users can influence predictions by downloading and editing values in a rating table, then reuploading the table and using it to create a new model.
See the page about interpreting Generalized Additive Models’ output in the Datarobot user guide for more details on how to interpret and edit rating tables.
Download A Rating Table¶
You can retrieve a rating table from the list of rating tables in a project:
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
project = dr.Project.get(project_id)
rating_tables = project.get_rating_tables()
rating_table = rating_tables[0]
Or you can retrieve a rating table from a specific model. The model must already exist:
import datarobot as dr
from datarobot.models import RatingTableModel, RatingTable
project_id = '5506fcd38bd88f5953219da0'
project = dr.Project.get(project_id)
# Get model from list of models with a rating table
rating_table_models = project.get_rating_table_models()
rating_table_model = rating_table_models[0]
# Or retrieve model by id. The model must have a rating table.
model_id = '5506fcd98bd88f1641a720a3'
rating_table_model = dr.RatingTableModel.get(project=project_id, model_id=model_id)
# Then retrieve the rating table from the model
rating_table_id = rating_table_model.rating_table_id
rating_table = dr.RatingTable.get(projcet_id, rating_table_id)
Then you can download the contents of the rating table:
rating_table.download('./my_rating_table.csv')
Uploading A Rating Table¶
After you’ve retrieved the rating table CSV and made the necessary edits, you can re-upload the CSV so you can create a new model from it:
job = dr.RatingTable.create(project_id, model_id, './my_rating_table.csv')
new_rating_table = job.get_result_when_complete()
job = new_rating_table.create_model()
model = job.get_result_when_complete()
Training Predictions¶
The training predictions interface allows computing and retrieving out-of-sample predictions for a model using the original project dataset. The predictions can be computed for all the rows, or restricted to validation or holdout data. As the predictions generated will be out-of-sample, they can be expected to have different results than if the project dataset were reuploaded as a prediction dataset.
Quick reference¶
Training predictions generation is an asynchronous process. This means that when starting
predictions with datarobot.models.Model.request_training_predictions()
you will receive back a
datarobot.models.TrainingPredictionsJob
for tracking the process responsible for fulfilling your request.
Actual predictions may be obtained with the help of a
datarobot.models.training_predictions.TrainingPredictions
object returned as the result of
the training predictions job.
There are three ways to retrieve them:
- Iterate prediction rows one by one as named tuples:
import datarobot as dr
# Calculate new training predictions on all dataset
training_predictions_job = model.request_training_predictions(dr.enums.DATA_SUBSET.ALL)
training_predictions = training_predictions_job.get_result_when_complete()
# Fetch rows from API and print them
for prediction in training_predictions.iterate_rows(batch_size=250):
print(prediction.row_id, prediction.prediction)
- Get all prediction rows as a
pandas.DataFrame
object:
import datarobot from dr
# Calculate new training predictions on holdout partition of dataset
training_predictions_job = model.request_training_predictions(dr.enums.DATA_SUBSET.HOLDOUT)
training_predictions = training_predictions_job.get_result_when_complete()
# Fetch training predictions as data frame
dataframe = training_predictions.get_all_as_dataframe()
- Download all prediction rows to a file as a CSV document:
import datarobot from dr
# Calculate new training predictions on all dataset
training_predictions_job = model.request_training_predictions(dr.enums.DATA_SUBSET.ALL)
training_predictions = training_predictions_job.get_result_when_complete()
# Fetch training predictions and save them to file
training_predictions.download_to_csv('my-training-predictions.csv')
Monotonic Constraints¶
Training with monotonic constraints allows users to force models to learn monotonic relationships with respect to some features and the target. This helps users create accurate models that comply with regulations (e.g. insurance, banking). Currently, only certain blueprints (e.g. xgboost) support this feature, and it is only supported for regression and binary classification projects. Typically working with monotonic constraints follows the following two workflows:
Workflow one - Running a project with default monotonic constraints
- set the target and specify default constraint lists for the project
- when running autopilot or manually training models without overriding constraint settings, all blueprints that support monotonic constraints will use the specified default constraint featurelists
Workflow two - Running a model with specific monotonic constraints
- create featurelists for monotonic constraints
- train a blueprint that supports monotonic constraints while specifying monotonic constraint featurelists
- the specified constraints will be used, regardless of the defaults on the blueprint
Creating featurelists¶
When specifying monotonic constraints, users must pass a reference to a featurelist containing only the features to be constrained, one for features that should monotonically increase with the target and another for those that should monotonically decrease with the target.
import datarobot as dr
project = dr.Project.get(project_id)
features_mono_up = ['feature_0', 'feature_1'] # features that have monotonically increasing relationship with target
features_mono_down = ['feature_2', 'feature_3'] # features that have monotonically decreasing relationship with target
flist_mono_up = project.create_featurelist(name='mono_up',
features=features_mono_up)
flist_mono_down = project.create_featurelist(name='mono_down',
features=features_mono_down)
Specify default monotonic constraints for a project¶
When setting the target, the user can specify default monotonic constraints for the project, to ensure that autopilot models use the desired settings, and optionally to ensure that only blueprints supporting monotonic constraints appear in the project. Regardless of the defaults specified during target selection, the user can override them when manually training a particular model.
import datarobot as dr
from datarobot.enums import AUTOPILOT_MODE
advanced_options = dr.AdvancedOptions(
monotonic_increasing_featurelist_id=flist_mono_up.id,
monotonic_decreasing_featurelist_id=flist_mono_down.id,
only_include_monotonic_blueprints=True)
project = dr.Project.get(project_id)
project.set_target(target='target', mode=AUTOPILOT_MODE.FULL_AUTO, advanced_options=advanced_options)
Retrieve models and blueprints using monotonic constraints¶
When retrieving models, users can inspect to see which supports monotonic constraints, and which actually enforces them. Some models will not support monotonic constraints at all, and some may support constraints but not have any constrained features specified.
import datarobot as dr
project = dr.Project.get(project_id)
models = project.get_models()
# retrieve models that support monotonic constraints
models_support_mono = [model for model in models if model.supports_monotonic_constraints]
# retrieve models that support and enforce monotonic constraints
models_enforce_mono = [model for model in models
if (model.monotonic_increasing_featurelist_id or
model.monotonic_decreasing_featurelist_id)]
When retrieving blueprints, users can check if they support monotonic constraints and see what default contraint lists are associated with them. The monotonic featurelist ids associated with a blueprint will be used everytime it is trained, unless the user specifically overrides them at model submission time.
import datarobot as dr
project = dr.Project.get(project_id)
blueprints = project.get_blueprints()
# retrieve blueprints that support monotonic constraints
blueprints_support_mono = [blueprint for blueprint in blueprints if blueprint.supports_monotonic_constraints]
# retrieve blueprints that support and enforce monotonic constraints
blueprints_enforce_mono = [blueprint for blueprint in blueprints
if (blueprint.monotonic_increasing_featurelist_id or
blueprint.monotonic_decreasing_featurelist_id)]
Train a model with specific monotonic constraints¶
Even after specifiying default settings for the project, users can override them to train a new model with different constraints, if desired.
import datarobot as dr
features_mono_up = ['feature_2', 'feature_3'] # features that have monotonically increasing relationship with target
features_mono_down = ['feature_0', 'feature_1'] # features that have monotonically decreasing relationship with target
project = dr.Project.get(project_id)
flist_mono_up = project.create_featurelist(name='mono_up',
features=features_mono_up)
flist_mono_down = project.create_featurelist(name='mono_down',
features=features_mono_down)
model_job_id = project.train(
blueprint,
sample_pct=55,
featurelist_id=featurelist.id,
monotonic_increasing_featurelist_id=flist_mono_up.id,
monotonic_decreasing_featurelist_id=flist_mono_down.id
)
Database Connectivity¶
Databases are a widely used tool for carrying valuable business data. To enable integration with a variety of enterprise databases, DataRobot provides a “self-service” JDBC platform for database connectivity setup. Once configured, you can read data from production databases for model building and predictions. This allows you to quickly train and retrain models on that data, and avoids the unnecessary step of exporting data from your enterprise database to a CSV for ingest to DataRobot. It allows access to more diverse data, which results in more accurate models.
The steps describing how to set up your database connections use the following terminology:
DataStore
: A configured connection to a database— it has a name, a specified driver, and a JDBC URL. You can register data stores with DataRobot for ease of re-use. A data store has one connector but can have many data sources.DataSource
: A configured connection to the backing data store (the location of data within a given endpoint). A data source specifies, via SQL query or selected table and schema data, which data to extract from the data store to use for modeling or predictions. A data source has one data store and one connector but can have many datasets.DataDriver
: The software that allows the DataRobot application to interact with a database; each data store is associated with one driver (created the admin). The driver configuration saves the storage location in DataRobot of the JAR file and any additional dependency files associated with the driver.Dataset
: Data, a file or the content of a data source, at a particular point in time. A data source can produce multiple datasets; a dataset has exactly one data source.
The expected workflow when setting up projects or prediction datasets is:
- The administrator sets up a
datarobot.DataDriver
for accessing a particular database. For any particular driver, this setup is done once for the entire system and then the resulting driver is used by all users. - Users create a
datarobot.DataStore
which represents an interface to a particular database, using that driver. - Users create a
datarobot.DataSource
representing a particular set of data to be extracted from the DataStore. - Users create projects and prediction datasets from a DataSource.
Besides the described workflow for creating projects and prediction datasets, users can manage their DataStores and DataSources and admins can manage Drivers by listing, retrieving, updating and deleting existing instances.
Cloud users: This feature is turned off by default. To enable the feature, contact your CFDS or DataRobot Support.
Creating Drivers¶
The admin should specify class_name
, the name of the Java class in the Java archive
which implements the java.sql.Driver
interface; canonical_name
, a user-friendly name
for resulting driver to display in the API and the GUI; and files
, a list of local files which
contain the driver.
>>> import datarobot as dr
>>> driver = dr.DataDriver.create(
... class_name='org.postgresql.Driver',
... canonical_name='PostgreSQL',
... files=['/tmp/postgresql-42.2.2.jar']
... )
>>> driver
DataDriver('PostgreSQL')
Creating DataStores¶
After the admin has created drivers, any user can use them for DataStore
creation.
A DataStore represents a JDBC database. When creating them, users should specify type
,
which currently must be jdbc
; canonical_name
, a user-friendly name to display
in the API and GUI for the DataStore; driver_id
, the id of the driver to use to connect
to the database; and jdbc_url
, the full URL specifying the database connection settings
like database type, server address, port, and database name.
>>> import datarobot as dr
>>> data_store = dr.DataStore.create(
... data_store_type='jdbc',
... canonical_name='Demo DB',
... driver_id='5a6af02eb15372000117c040',
... jdbc_url='jdbc:postgresql://my.db.address.org:5432/perftest'
... )
>>> data_store
DataStore('Demo DB')
>>> data_store.test(username='username', password='password')
{'message': 'Connection successful'}
Creating DataSources¶
Once users have a DataStore, they can can query datasets via the DataSource entity,
which represents a query. When creating a DataSource, users first create a
datarobot.DataSourceParameters
object from a DataStore’s id and a query,
and then create the DataSource with a type
, currently always jdbc
; a canonical_name
,
the user-friendly name to display in the API and GUI, and params
, the DataSourceParameters
object.
>>> import datarobot as dr
>>> params = dr.DataSourceParameters(
... data_store_id='5a8ac90b07a57a0001be501e',
... query='SELECT * FROM airlines10mb WHERE "Year" >= 1995;'
... )
>>> data_source = dr.DataSource.create(
... data_source_type='jdbc',
... canonical_name='airlines stats after 1995',
... params=params
... )
>>> data_source
DataSource('airlines stats after 1995')
Creating Projects¶
Given a DataSource, users can create new projects from it.
>>> import datarobot as dr
>>> project = dr.Project.create_from_data_source(
... data_source_id='5ae6eee9962d740dd7b86886',
... username='username',
... password='password'
... )
Creating Predictions¶
Given a DataSource, new prediction datasets can be created for any project.
>>> import datarobot as dr
>>> project = dr.Project.get('5ae6f296962d740dd7b86887')
>>> prediction_dataset = project.upload_dataset_from_data_source(
... data_source_id='5ae6eee9962d740dd7b86886',
... username='username',
... password='password'
... )
Model Recommendation¶
During the Autopilot modeling process, DataRobot will recommend up to three well-performing models.
Warning
Model recommendations are only generated when you run full Autopilot.
One of them (the most accurate individual, non-blender model) will be prepared for deployment. In the preparation process, DataRobot:
- Calculates feature impact for the selected model and uses it to generate a reduced feature list.
- Retrains the selected model on the reduced feature list. If the new model performs better than the original model, DataRobot uses the new model for the next stage. Otherwise, the original model is used.
- Retrains the selected model on a higher sample size. If the new model performs better than the original model, DataRobot selects it as Recommended for Deployment. Otherwise, the original model is selected.
Note
The higher sample size DataRobot uses in Step 3 is either:
- Up to holdout if the training sample size does not exceed the maximum Autopilot size threshold: sample size is the training set plus the validation set (for TVH) or 5-folds (for CV). In this case, DataRobot compares retrained and original models on the holdout score.
- Up to validation if the training sample size does exceed the maximum Autopilot size threshold: sample size is the training set (for TVH) or 4-folds (for CV). In this case, DataRobot compares retrained and original models on the validation score.
The three types of recommendations are the following:
- Recommended for Deployment. This is the most accurate individual, non-blender model on the Leaderboard. This model is ready for deployment.
- Most Accurate. Based on the validation or cross-validation results, this model is the most accurate model overall on the Leaderboard (in most cases, a blender).
- Fast & Accurate. This is the most accurate individual model on the Leaderboard that passes a set prediction speed guidelines. If no models meet the guideline, the badge is not applied.
Retrieve all recommendations¶
The following code will return all models recommended for the project.
import datarobot as dr
recommendations = dr.ModelRecommendation.get_all(project_id)
Retrieve a default recommendation¶
If you are unsure about the tradeoffs between the various types of recommendations, DataRobot can make this choice for you. The following route will return the Recommended for Deployment model to use for predictions for the project.
import datarobot as dr
recommendation = dr.ModelRecommendation.get(project_id)
Retrieve a specific recommendation¶
If you know which recommendation you want to use, you can select a specific recommendation using the following code.
import datarobot as dr
recommendation_type = dr.enums.RECOMMENDED_MODEL_TYPE.FAST_ACCURATE
recommendations = dr.ModelRecommendation.get(project_id, recommendation_type)
Get recommended model¶
You can use method get_model() of a recommendation object to retrieve a recommended model.
import datarobot as dr
recommendation = dr.ModelRecommendation.get(project_id)
recommended_model = recommendation.get_model()
Sharing¶
Once you have created data stores or data sources, you may want to share them with collaborators. DataRobot provides an API for sharing the following entities:
- Data Sources and Data Stores ( see Database Connectivity for more info on connecting to JDBC databases)
- Projects
- Calendar Files
- Model Deployments (Only in the REST API, not yet in this Python client)
Access Levels¶
Entities can be shared at varying access levels. For example, you can allow someone to create projects from a data source you have built without letting them delete it.
Each entity type uses slightly different permission names intended to convey more specifically what kind of actions are available, and these roles fall into three categories. These generic role names can be used in the sharing API for any entity.
For the complete set of actions granted by each role on a given entity, please see the user documentation in the web application.
- OWNER
- used for all entities
- allows any action including deletion
- READ_WRITE
- known as as EDITOR on data sources and data stores
- allows modifications to the state, e.g. renaming and creating data sources from a data store, but not deleting the entity
- READ_ONLY
- known as CONSUMER on data sources and data stores
- for data sources, enables creating projects and predictions; for data stores, allows viewing them only.
Finally, when a user’s new role is specified as None
, their access will be revoked.
In addition to the role, some entities (currently only data sources and data stores) allow separate control over whether a new user should be able to share that entity further. When granting access to a user, the can_share parameter determines whether that user can, in turn, share this entity with another user. When this parameter is specified as false, the user in question will have all the access to the entity granted by their role and be able to remove themselves if desired, but be unable to change the role of any other user.
Examples¶
Transfer access to the data source from old_user@datarobot.com to new_user@datarobot.com
import datarobot as dr
new_access = dr.SharingAccess(new_user@datarobot.com,
dr.enums.SHARING_ROLE.OWNER, can_share=True)
access_list = [dr.SharingAccess(old_user@datarobot.com, None), new_access]
dr.DataSource.get('my-data-source-id').share(access_list)
API Reference¶
Advanced Options¶
-
class
datarobot.helpers.
AdvancedOptions
(weights=None, response_cap=None, blueprint_threshold=None, seed=None, smart_downsampled=False, majority_downsampling_rate=None, offset=None, exposure=None, accuracy_optimized_mb=None, scaleout_modeling_mode=None, events_count=None, monotonic_increasing_featurelist_id=None, monotonic_decreasing_featurelist_id=None, only_include_monotonic_blueprints=None)¶ Used when setting the target of a project to set advanced options of modeling process.
Parameters: - weights : string, optional
The name of a column indicating the weight of each row
- response_cap : float in [0.5, 1), optional
Quantile of the response distribution to use for response capping.
- blueprint_threshold : int, optional
Number of hours models are permitted to run before being excluded from later autopilot stages Minimum 1
- seed : int
a seed to use for randomization
- smart_downsampled : bool
whether to use smart downsampling to throw away excess rows of the majority class. Only applicable to classification and zero-boosted regression projects.
- majority_downsampling_rate : float
the percentage between 0 and 100 of the majority rows that should be kept. Specify only if using smart downsampling. May not cause the majority class to become smaller than the minority class.
- offset : list of str, optional
(New in version v2.6) the list of the names of the columns containing the offset of each row
- exposure : string, optional
(New in version v2.6) the name of a column containing the exposure of each row
- accuracy_optimized_mb : bool, optional
(New in version v2.6) Include additional, longer-running models that will be run by the autopilot and available to run manually.
- scaleout_modeling_mode : string, optional
(New in version v2.8) Specifies the behavior of Scaleout models for the project. This is one of
datarobot.enums.SCALEOUT_MODELING_MODE
. Ifdatarobot.enums.SCALEOUT_MODELING_MODE.DISABLED
, no models will run during autopilot or show in the list of available blueprints. Scaleout models must be disabled for some partitioning settings including projects using datetime partitioning or projects using offset or exposure columns. Ifdatarobot.enums.SCALEOUT_MODELING_MODE.REPOSITORY_ONLY
, scaleout models will be in the list of available blueprints but not run during autopilot. Ifdatarobot.enums.SCALEOUT_MODELING_MODE.AUTOPILOT
, scaleout models will run during autopilot and be in the list of available blueprints. Scaleout models are only supported in the Hadoop enviroment with the corresponding user permission set.- events_count : string, optional
(New in version v2.8) the name of a column specifying events count.
- monotonic_increasing_featurelist_id : string, optional
(new in version 2.11) the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. If None, no such constraints are enforced. When specified, this will set a default for the project that can be overriden at model submission time if desired.
- monotonic_decreasing_featurelist_id : string, optional
(new in version 2.11) the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. If None, no such constraints are enforced. When specified, this will set a default for the project that can be overriden at model submission time if desired.
- only_include_monotonic_blueprints : bool, optional
(new in version 2.11) when true, only blueprints that support enforcing monotonic constraints will be available in the project or selected for the autopilot.
Examples
import datarobot as dr advanced_options = dr.AdvancedOptions( weights='weights_column', offset=['offset_column'], exposure='exposure_column', response_cap=0.7, blueprint_threshold=2, smart_downsampled=True, majority_downsampling_rate=75.0)
Blueprint¶
-
class
datarobot.models.
Blueprint
(id=None, processes=None, model_type=None, project_id=None, blueprint_category=None, monotonic_increasing_featurelist_id=None, monotonic_decreasing_featurelist_id=None, supports_monotonic_constraints=None)¶ A Blueprint which can be used to fit models
Attributes: - id : str
the id of the blueprint
- processes : list of str
the processes used by the blueprint
- model_type : str
the model produced by the blueprint
- project_id : str
the project the blueprint belongs to
- blueprint_category : str
(New in version v2.6) Describes the category of the blueprint and the kind of model it produces.
-
classmethod
get
(project_id, blueprint_id)¶ Retrieve a blueprint.
Parameters: - project_id : str
The project’s id.
- blueprint_id : str
Id of blueprint to retrieve.
Returns: - blueprint : Blueprint
The queried blueprint.
-
get_chart
()¶ Retrieve a chart.
Returns: - BlueprintChart
The current blueprint chart.
-
get_documents
()¶ Get documentation for tasks used in the blueprint.
Returns: - list of BlueprintTaskDocument
All documents available for blueprint.
-
class
datarobot.models.
BlueprintTaskDocument
(title=None, task=None, description=None, parameters=None, links=None, references=None)¶ Document describing a task from a blueprint.
Attributes: - title : str
Title of document.
- task : str
Name of the task described in document.
- description : str
Task description.
- parameters : list of dict(name, type, description)
Parameters that task can receive in human-readable format.
- links : list of dict(name, url)
External links used in document
- references : list of dict(name, url)
References used in document. When no link available url equals None.
-
class
datarobot.models.
BlueprintChart
(nodes, edges)¶ A Blueprint chart that can be used to understand data flow in blueprint.
Attributes: - nodes : list of dict (id, label)
Chart nodes, id unique in chart.
- edges : list of tuple (id1, id2)
Directions of data flow between blueprint chart nodes.
-
classmethod
get
(project_id, blueprint_id)¶ Retrieve a blueprint chart.
Parameters: - project_id : str
The project’s id.
- blueprint_id : str
Id of blueprint to retrieve chart.
Returns: - BlueprintChart
The queried blueprint chart.
-
to_graphviz
()¶ Get blueprint chart in graphviz DOT format.
Returns: - unicode
String representation of chart in graphviz DOT language.
-
class
datarobot.models.
ModelBlueprintChart
(nodes, edges)¶ A Blueprint chart that can be used to understand data flow in model. Model blueprint chart represents reduced repository blueprint chart with only elements that used to build this particular model.
Attributes: - nodes : list of dict (id, label)
Chart nodes, id unique in chart.
- edges : list of tuple (id1, id2)
Directions of data flow between blueprint chart nodes.
-
classmethod
get
(project_id, model_id)¶ Retrieve a model blueprint chart.
Parameters: - project_id : str
The project’s id.
- model_id : str
Id of model to retrieve model blueprint chart.
Returns: - ModelBlueprintChart
The queried model blueprint chart.
-
to_graphviz
()¶ Get blueprint chart in graphviz DOT format.
Returns: - unicode
String representation of chart in graphviz DOT language.
Calendar File¶
-
class
datarobot.
CalendarFile
(calendar_end_date=None, calendar_start_date=None, created=None, id=None, name=None, num_event_types=None, num_events=None, project_ids=None, role=None)¶ Represents the data for a calendar file
Attributes: - id : str
The id of the calendar file.
- calendar_start_date : str
The earliest date in the calendar.
- calendar_end_date : str
The last date in the calendar.
- created : str
The date this calendar was created, i.e. uploaded to DR.
- name : str
The name of the calendar.
- num_event_types : int
The number of different event types.
- num_events : int
The number of events this calendar has.
- project_ids : list of strings
A list containing the projectIds of the projects using this calendar.
- role : str
The access role the user has for this calendar.
-
classmethod
create
(file_path, calendar_name=None)¶ Creates a calendar using the given file. The provided file must be a CSV in the format:
Date, Event <date>, <event_type>, <date>, <event_type>,
A header row is required.
Parameters: - file_path : string
A string representing a path to a local csv file.
- calendar_name : string, optional
A name to assign to the calendar. Defaults to the name of the file if not provided.
Returns: - calendar_file : CalendarFile
Instance with initialized data.
Raises: - AsyncProcessUnsuccessfulError
Raised if there was an error processing the provided calendar file.
Examples
# Creating a calendar with a specified name cal = dr.CalendarFile.create('/home/calendars/somecalendar.csv', calendar_name='Some Calendar Name') cal.id >>> 5c1d4904211c0a061bc93013 cal.name >>> Some Calendar Name # Creating a calendar without specifying a name cal = dr.CalendarFile.create('/home/calendars/somecalendar.csv') cal.id >>> 5c1d4904211c0a061bc93012 cal.name >>> somecalendar.csv
-
classmethod
get
(calendar_id)¶ Gets the details of a calendar, given the id.
Parameters: - calendar_id : str
The identifier of the calendar.
Returns: - calendar_file : CalendarFile
The requested calendar.
Raises: - DataError
Raised if the calendar_id is invalid, i.e. the specified CalendarFile does not exist.
Examples
cal = dr.CalendarFile.get(some_calendar_id) cal.id >>> some_calendar_id
-
classmethod
list
(project_id=None, batch_size=None)¶ Gets the details of all calendars this user has view access for.
Parameters: - project_id : str, optional
If provided, will filter for calendars associated only with the specified project.
- batch_size : int, optional
The number of calendars to retrieve in a single API call. If specified, the client may make multiple calls to retrieve the full list of calendars. If not specified, an appropriate default will be chosen by the server.
Returns: - calendar_list : list of
CalendarFile
A list of CalendarFile objects.
Examples
calendars = dr.CalendarFile.list() len(calendars) >>> 10
-
classmethod
delete
(calendar_id)¶ Deletes the calendar specified by calendar_id.
Parameters: - calendar_id : str
The id of the calendar to delete. The requester must have OWNER access for this calendar.
Raises: - ClientError
Raised if an invalid calendar_id is provided.
Examples
# Deleting with a valid calendar_id status_code = dr.CalendarFile.delete(some_calendar_id) status_code >>> 204 dr.CalendarFile.get(some_calendar_id) >>> ClientError: Item not found
-
classmethod
update_name
(calendar_id, new_calendar_name)¶ Changes the name of the specified calendar to the specified name. The requester must have at least READ_WRITE permissions on the calendar.
Parameters: - calendar_id : str
The id of the calendar to update.
- new_calendar_name : str
The new name to set for the specified calendar.
Returns: - status_code : int
200 for success
Raises: - ClientError
Raised if an invalid calendar_id is provided.
Examples
response = dr.CalendarFile.update_name(some_calendar_id, some_new_name) response >>> 200 cal = dr.CalendarFile.get(some_calendar_id) cal.name >>> some_new_name
Shares the calendar with the specified users, assigning the specified roles.
Parameters: - calendar_id : str
The id of the calendar to update
- access_list:
A list of dr.SharingAccess objects. Specify None for the role to delete a user’s access from the specified CalendarFile. For more information on specific access levels, see the sharing documentation.
Returns: - status_code : int
200 for success
Raises: - ClientError
Raised if unable to update permissions for a user.
- AssertionError
Raised if access_list is invalid.
Examples
# assuming some_user is a valid user, share this calendar with some_user sharing_list = [dr.SharingAccess(some_user_username, dr.enums.SHARING_ROLE.READ_WRITE)] response = dr.CalendarFile.share(some_calendar_id, sharing_list) response.status_code >>> 200 # delete some_user from this calendar, assuming they have access of some kind already delete_sharing_list = [dr.SharingAccess(some_user_username, None)] response = dr.CalendarFile.share(some_calendar_id, delete_sharing_list) response.status_code >>> 200 # Attempt to add an invalid user to a calendar invalid_sharing_list = [dr.SharingAccess(invalid_username, dr.enums.SHARING_ROLE.READ_WRITE)] dr.CalendarFile.share(some_calendar_id, invalid_sharing_list) >>> ClientError: Unable to update access for this calendar
-
classmethod
get_access_list
(calendar_id, batch_size=None)¶ Retrieve a list of users that have access to this calendar.
Parameters: - calendar_id : str
The id of the calendar to retrieve the access list for.
- batch_size : int, optional
The number of access records to retrieve in a single API call. If specified, the client may make multiple calls to retrieve the full list of calendars. If not specified, an appropriate default will be chosen by the server.
Returns: - access_control_list : list of
SharingAccess
A list of
SharingAccess
objects.
Raises: - ClientError
Raised if user does not have access to calendar or calendar does not exist.
Compliance Documentation Templates¶
-
class
datarobot.models.compliance_doc_template.
ComplianceDocTemplate
(id, creator_id, creator_username, name, org_id=None, sections=None)¶ A compliance documentation template. Templates are used to customize contents of
ComplianceDocumentation
.New in version v2.14.
Notes
Each
section
dictionary has the following schema:title
: title of the sectiontype
: type of section. Must be one of “datarobot”, “user” or “table_of_contents”.
Each type of section has a different set of attributes described bellow.
Section of type
"datarobot"
represent a section owned by DataRobot. DataRobot sections have the following additional attributes:content_id
: The identifier of the content in this section. You can get the default template withget_default
for a complete list of possible DataRobot section content ids.sections
: list of sub-section dicts nested under the parent section.
Section of type
"user"
represent a section with user-defined content. Those sections may contain text generated by user and have the following additional fields:regularText
: regular text of the section, optionally separated by\n
to split paragraphs.highlightedText
: highlighted text of the section, optionally separated by\n
to split paragraphs.sections
: list of sub-section dicts nested under the parent section.
Section of type
"table_of_contents"
represent a table of contents and has no additional attributes.Attributes: - id : str
the id of the template
- name : str
the name of the template.
- creator_id : str
the id of the user who created the template
- creator_username : str
username of the user who created the template
- org_id : str
the id of the organization the template belongs to
- sections : list of dicts
the sections of the template describing the structure of the document. Section schema is described in Notes section above.
-
classmethod
get_default
(template_type=None)¶ Get a default DataRobot template. This template is used for generating compliance documentation when no template is specified.
Parameters: - template_type : str or None
Type of the template. Currently supported values are “normal” and “time_series”
Returns: - template : ComplianceDocTemplate
the default template object with
sections
attribute populated with default sections.
-
classmethod
create_from_json_file
(name, path)¶ Create a template with the specified name and sections in a JSON file.
This is useful when working with sections in a JSON file. Example:
default_template = ComplianceDocTemplate.get_default() default_template.sections_to_json_file('path/to/example.json') # ... edit example.json in your editor my_template = ComplianceDocTemplate.create_from_json_file( name='my template', path='path/to/example.json' )
Parameters: - name : str
the name of the template. Must be unique for your user.
- path : str
the path to find the JSON file at
Returns: - template : ComplianceDocTemplate
the created template
-
classmethod
create
(name, sections)¶ Create a template with the specified name and sections.
Parameters: - name : str
the name of the template. Must be unique for your user.
- sections : list
list of section objects
Returns: - template : ComplianceDocTemplate
the created template
-
classmethod
get
(template_id)¶ Retrieve a specific template.
Parameters: - template_id : str
the id of the template to retrieve
Returns: - template : ComplianceDocTemplate
the retrieved template
-
classmethod
list
(name_part=None, limit=None, offset=None)¶ Get a paginated list of compliance documentation template objects.
Parameters: - name_part : str or None
Return only the templates with names matching specified string. The matching is case-insensitive.
- limit : int
The number of records to return. The server will use a (possibly finite) default if not specified.
- offset : int
The number of records to skip.
Returns: - templates : list of ComplianceDocTemplate
the list of template objects
-
sections_to_json_file
(path, indent=2)¶ Save sections of the template to a json file at the specified path
Parameters: - path : str
the path to save the file to
- indent : int
indentation to use in the json file.
-
update
(name=None, sections=None)¶ Update the name or sections of an existing doc template.
Note that default or non-existent templates can not be updated.
Parameters: - name : str, optional
the new name for the template
- sections : list of dicts
list of sections
-
delete
()¶ Delete the compliance documentation template.
Compliance Documentation¶
-
class
datarobot.models.compliance_documentation.
ComplianceDocumentation
(project_id, model_id, template_id=None)¶ A compliance documentation object.
New in version v2.14.
Examples
doc = ComplianceDocumentation('project-id', 'model-id') job = doc.generate() job.wait_for_completion() doc.download('example.docx')
Attributes: - project_id : str
the id of the project
- model_id : str
the id of the model
- template_id : str or None
optional id of the template for the generated doc. See documentation for
ComplianceDocTemplate
for more info.
-
generate
()¶ Start a job generating model compliance documentation.
Returns: - Job
an instance of an async job
-
download
(filepath)¶ Download the generated compliance documentation file and save it to the specified path. The generated file has a DOCX format.
Parameters: - filepath : str
A file path, e.g. “/path/to/save/compliance_documentation.docx”
Confusion Chart¶
-
class
datarobot.models.confusion_chart.
ConfusionChart
(source, data, source_model_id)¶ Confusion Chart data for model.
Notes
ClassMetrics
is a dict containing the following:class_name
(string) name of the classactual_count
(int) number of times this class is seen in the validation datapredicted_count
(int) number of times this class has been predicted for the validation dataf1
(float) F1 scorerecall
(float) recall scoreprecision
(float) precision scorewas_actual_percentages
(list of dict) one vs all actual percentages in format specified below.other_class_name
(string) the name of the other classpercentage
(float) the percentage of the times this class was predicted when is was actually class (from 0 to 1)
was_predicted_percentages
(list of dict) one vs all predicted percentages in format specified below.other_class_name
(string) the name of the other classpercentage
(float) the percentage of the times this class was actual predicted (from 0 to 1)
confusion_matrix_one_vs_all
(list of list) 2d list representing 2x2 one vs all matrix.- This represents the True/False Negative/Positive rates as integer for each class. The data structure looks like:
[ [ True Negative, False Positive ], [ False Negative, True Positive ] ]
Attributes: - source : str
Confusion Chart data source. Can be ‘validation’, ‘crossValidation’ or ‘holdout’.
- raw_data : dict
All of the raw data for the Confusion Chart
- confusion_matrix : list of list
The NxN confusion matrix
- classes : list
The names of each of the classes
- class_metrics : list of dicts
List of dicts with schema described as
ClassMetrics
above.- source_model_id : str
ID of the model this Confusion chart represents; in some cases, insights from the parent of a frozen model may be used
Database Connectivity¶
-
class
datarobot.
DataDriver
(id=None, creator=None, base_names=None, class_name=None, canonical_name=None)¶ A data driver
Attributes: - id : str
the id of the driver.
- class_name : str
the Java class name for the driver.
- canonical_name : str
the user-friendly name of the driver.
- creator : str
the id of the user who created the driver.
- base_names : list of str
a list of the file name(s) of the jar files.
-
classmethod
list
()¶ Returns list of available drivers.
Returns: - drivers : list of DataDriver instances
contains a list of available drivers.
Examples
>>> import datarobot as dr >>> drivers = dr.DataDriver.list() >>> drivers [DataDriver('mysql'), DataDriver('RedShift'), DataDriver('PostgreSQL')]
-
classmethod
get
(driver_id)¶ Gets the driver.
Parameters: - driver_id : str
the identifier of the driver.
Returns: - driver : DataDriver
the required driver.
Examples
>>> import datarobot as dr >>> driver = dr.DataDriver.get('5ad08a1889453d0001ea7c5c') >>> driver DataDriver('PostgreSQL')
-
classmethod
create
(class_name, canonical_name, files)¶ Creates the driver. Only available to admin users.
Parameters: - class_name : str
the Java class name for the driver.
- canonical_name : str
the user-friendly name of the driver.
- files : list of str
a list of the file paths on file system file_path(s) for the driver.
Returns: - driver : DataDriver
the created driver.
Raises: - ClientError
raised if user is not granted for Can manage JDBC database drivers feature
Examples
>>> import datarobot as dr >>> driver = dr.DataDriver.create( ... class_name='org.postgresql.Driver', ... canonical_name='PostgreSQL', ... files=['/tmp/postgresql-42.2.2.jar'] ... ) >>> driver DataDriver('PostgreSQL')
-
update
(class_name=None, canonical_name=None)¶ Updates the driver. Only available to admin users.
Parameters: - class_name : str
the Java class name for the driver.
- canonical_name : str
the user-friendly name of the driver.
Raises: - ClientError
raised if user is not granted for Can manage JDBC database drivers feature
Examples
>>> import datarobot as dr >>> driver = dr.DataDriver.get('5ad08a1889453d0001ea7c5c') >>> driver.canonical_name 'PostgreSQL' >>> driver.update(canonical_name='postgres') >>> driver.canonical_name 'postgres'
-
delete
()¶ Removes the driver. Only available to admin users.
Raises: - ClientError
raised if user is not granted for Can manage JDBC database drivers feature
-
class
datarobot.
DataStore
(data_store_id=None, data_store_type=None, canonical_name=None, creator=None, updated=None, params=None, role=None)¶ A data store. Represents database
Attributes: - id : str
the id of the data store.
- data_store_type : str
the type of data store.
- canonical_name : str
the user-friendly name of the data store.
- creator : str
the id of the user who created the data store.
- updated : datetime.datetime
the time of the last update
- params : DataStoreParameters
a list specifying data store parameters.
-
classmethod
list
()¶ Returns list of available data stores.
Returns: - data_stores : list of DataStore instances
contains a list of available data stores.
Examples
>>> import datarobot as dr >>> data_stores = dr.DataStore.list() >>> data_stores [DataStore('Demo'), DataStore('Airlines')]
-
classmethod
get
(data_store_id)¶ Gets the data store.
Parameters: - data_store_id : str
the identifier of the data store.
Returns: - data_store : DataStore
the required data store.
Examples
>>> import datarobot as dr >>> data_store = dr.DataStore.get('5a8ac90b07a57a0001be501e') >>> data_store DataStore('Demo')
-
classmethod
create
(data_store_type, canonical_name, driver_id, jdbc_url)¶ Creates the data store.
Parameters: - data_store_type : str
the type of data store.
- canonical_name : str
the user-friendly name of the data store.
- driver_id : str
the identifier of the DataDriver.
- jdbc_url : str
the full JDBC url, for example jdbc:postgresql://my.dbaddress.org:5432/my_db.
Returns: - data_store : DataStore
the created data store.
Examples
>>> import datarobot as dr >>> data_store = dr.DataStore.create( ... data_store_type='jdbc', ... canonical_name='Demo DB', ... driver_id='5a6af02eb15372000117c040', ... jdbc_url='jdbc:postgresql://my.db.address.org:5432/perftest' ... ) >>> data_store DataStore('Demo DB')
-
update
(canonical_name=None, driver_id=None, jdbc_url=None)¶ Updates the data store.
Parameters: - canonical_name : str
optional, the user-friendly name of the data store.
- driver_id : str
optional, the identifier of the DataDriver.
- jdbc_url : str
optional, the full JDBC url, for example jdbc:postgresql://my.dbaddress.org:5432/my_db.
Examples
>>> import datarobot as dr >>> data_store = dr.DataStore.get('5ad5d2afef5cd700014d3cae') >>> data_store DataStore('Demo DB') >>> data_store.update(canonical_name='Demo DB updated') >>> data_store DataStore('Demo DB updated')
-
delete
()¶ Removes the DataStore
-
test
(username, password)¶ Tests database connection.
Parameters: - username : str
the username for database authentication.
- password : str
the password for database authentication. The password is encrypted at server side and never saved / stored
Returns: - message : dict
message with status.
Examples
>>> import datarobot as dr >>> data_store = dr.DataStore.get('5ad5d2afef5cd700014d3cae') >>> data_store.test(username='db_username', password='db_password') {'message': 'Connection successful'}
-
schemas
(username, password)¶ Returns list of available schemas.
Parameters: - username : str
the username for database authentication.
- password : str
the password for database authentication. The password is encrypted at server side and never saved / stored
Returns: - response : dict
dict with database name and list of str - available schemas
Examples
>>> import datarobot as dr >>> data_store = dr.DataStore.get('5ad5d2afef5cd700014d3cae') >>> data_store.schemas(username='db_username', password='db_password') {'catalog': 'perftest', 'schemas': ['demo', 'information_schema', 'public']}
-
tables
(username, password, schema=None)¶ Returns list of available tables in schema.
Parameters: - username : str
optional, the username for database authentication.
- password : str
optional, the password for database authentication. The password is encrypted at server side and never saved / stored
- schema : str
optional, the schema name.
Returns: - response : dict
dict with catalog name and tables info
Examples
>>> import datarobot as dr >>> data_store = dr.DataStore.get('5ad5d2afef5cd700014d3cae') >>> data_store.tables(username='db_username', password='db_password', schema='demo') {'tables': [{'type': 'TABLE', 'name': 'diagnosis', 'schema': 'demo'}, {'type': 'TABLE', 'name': 'kickcars', 'schema': 'demo'}, {'type': 'TABLE', 'name': 'patient', 'schema': 'demo'}, {'type': 'TABLE', 'name': 'transcript', 'schema': 'demo'}], 'catalog': 'perftest'}
-
classmethod
from_server_data
(data, keep_attrs=None)¶ Instantiate an object of this class using the data directly from the server, meaning that the keys may have the wrong camel casing
Parameters: - data : dict
The directly translated dict of JSON from the server. No casing fixes have taken place
- keep_attrs : list
List of the dotted namespace notations for attributes to keep within the object structure even if their values are None
-
get_access_list
()¶ Retrieve what users have access to this data store
New in version v2.14.
Returns: - list of :class:`SharingAccess <datarobot.SharingAccess>`
Modify the ability of users to access this data store
New in version v2.14.
Parameters: - access_list : list of
SharingAccess
the modifications to make.
Raises: - datarobot.ClientError :
if you do not have permission to share this data store, if the user you’re sharing with doesn’t exist, if the same user appears multiple times in the access_list, or if these changes would leave the data store without an owner.
Examples
Transfer access to the data store from old_user@datarobot.com to new_user@datarobot.com
import datarobot as dr new_access = dr.SharingAccess(new_user@datarobot.com, dr.enums.SHARING_ROLE.OWNER, can_share=True) access_list = [dr.SharingAccess(old_user@datarobot.com, None), new_access] dr.DataStore.get('my-data-store-id').share(access_list)
- access_list : list of
-
class
datarobot.
DataSource
(data_source_id=None, data_source_type=None, canonical_name=None, creator=None, updated=None, params=None, role=None)¶ A data source. Represents data request
Attributes: - data_source_id : str
the id of the data source.
- data_source_type : str
the type of data source.
- canonical_name : str
the user-friendly name of the data source.
- creator : str
the id of the user who created the data source.
- updated : datetime.datetime
the time of the last update.
- params : DataSourceParameters
a list specifying data source parameters.
-
classmethod
list
()¶ Returns list of available data sources.
Returns: - data_sources : list of DataSource instances
contains a list of available data sources.
Examples
>>> import datarobot as dr >>> data_sources = dr.DataSource.list() >>> data_sources [DataSource('Diagnostics'), DataSource('Airlines 100mb'), DataSource('Airlines 10mb')]
-
classmethod
get
(data_source_id)¶ Gets the data source.
Parameters: - data_source_id : str
the identifier of the data source.
Returns: - data_source : DataSource
the requested data source.
Examples
>>> import datarobot as dr >>> data_source = dr.DataSource.get('5a8ac9ab07a57a0001be501f') >>> data_source DataSource('Diagnostics')
-
classmethod
create
(data_source_type, canonical_name, params)¶ Creates the data source.
Parameters: - data_source_type : str
the type of data source.
- canonical_name : str
the user-friendly name of the data source.
- params : DataSourceParameters
a list specifying data source parameters.
Returns: - data_source : DataSource
the created data source.
Examples
>>> import datarobot as dr >>> params = dr.DataSourceParameters( ... data_store_id='5a8ac90b07a57a0001be501e', ... query='SELECT * FROM airlines10mb WHERE "Year" >= 1995;' ... ) >>> data_source = dr.DataSource.create( ... data_source_type='jdbc', ... canonical_name='airlines stats after 1995', ... params=params ... ) >>> data_source DataSource('airlines stats after 1995')
-
update
(canonical_name=None, params=None)¶ Creates the data source.
Parameters: - canonical_name : str
optional, the user-friendly name of the data source.
- params : DataSourceParameters
optional, the identifier of the DataDriver.
Examples
>>> import datarobot as dr >>> data_source = dr.DataSource.get('5ad840cc613b480001570953') >>> data_source DataSource('airlines stats after 1995') >>> params = dr.DataSourceParameters( ... query='SELECT * FROM airlines10mb WHERE "Year" >= 1990;' ... ) >>> data_source.update( ... canonical_name='airlines stats after 1990', ... params=params ... ) >>> data_source DataSource('airlines stats after 1990')
-
delete
()¶ Removes the DataSource
-
classmethod
from_server_data
(data, keep_attrs=None)¶ Instantiate an object of this class using the data directly from the server, meaning that the keys may have the wrong camel casing
Parameters: - data : dict
The directly translated dict of JSON from the server. No casing fixes have taken place
- keep_attrs : list
List of the dotted namespace notations for attributes to keep within the object structure even if their values are None
-
get_access_list
()¶ Retrieve what users have access to this data source
New in version v2.14.
Returns: - list of :class:`SharingAccess <datarobot.SharingAccess>`
Modify the ability of users to access this data source
New in version v2.14.
Parameters: - access_list : list of
SharingAccess
the modifications to make.
Raises: - datarobot.ClientError :
if you do not have permission to share this data source, if the user you’re sharing with doesn’t exist, if the same user appears multiple times in the access_list, or if these changes would leave the data source without an owner
Examples
Transfer access to the data source from old_user@datarobot.com to new_user@datarobot.com
import datarobot as dr new_access = dr.SharingAccess(new_user@datarobot.com, dr.enums.SHARING_ROLE.OWNER, can_share=True) access_list = [dr.SharingAccess(old_user@datarobot.com, None), new_access] dr.DataSource.get('my-data-source-id').share(access_list)
- access_list : list of
-
class
datarobot.
DataSourceParameters
(data_store_id=None, table=None, schema=None, partition_column=None, query=None, fetch_size=None)¶ Data request configuration
Attributes: - data_store_id : str
the id of the DataStore.
- table : str
optional, the name of specified database table.
- schema : str
optional, the name of the schema associated with the table.
- partition_column : str
optional, the name of the partition column.
- query : str
optional, the user specified SQL query.
- fetch_size : int
optional, a user specified fetch size in the range [1, 20000]. By default a fetchSize will be assigned to balance throughput and memory usage
Feature¶
-
class
datarobot.models.
Feature
(id, project_id=None, name=None, feature_type=None, importance=None, low_information=None, unique_count=None, na_count=None, date_format=None, min=None, max=None, mean=None, median=None, std_dev=None, time_series_eligible=None, time_series_eligibility_reason=None, time_step=None, time_unit=None, target_leakage=None)¶ A feature from a project’s dataset
These are features either included in the originally uploaded dataset or added to it via feature transformations. In time series projects, these will be distinct from the
ModelingFeature
s created during partitioning; otherwise, they will correspond to the same features. For more information about input and modeling features, see the time series documentation.The
min
,max
,mean
,median
, andstd_dev
attributes provide information about the distribution of the feature in the EDA sample data. For non-numeric features or features created prior to these summary statistics becoming available, they will be None. For features where the summary statistics are available, they will be in a format compatible with the data type, i.e. date type features will have their summary statistics expressed as ISO-8601 formatted date strings.Attributes: - id : int
the id for the feature - note that name is used to reference the feature instead of id
- project_id : str
the id of the project the feature belongs to
- name : str
the name of the feature
- feature_type : str
the type of the feature, e.g. ‘Categorical’, ‘Text’
- importance : float or None
numeric measure of the strength of relationship between the feature and target (independent of any model or other features); may be None for non-modeling features such as partition columns
- low_information : bool
whether a feature is considered too uninformative for modeling (e.g. because it has too few values)
- unique_count : int
number of unique values
- na_count : int or None
number of missing values
- date_format : str or None
For Date features, the date format string for how this feature was interpreted, compatible with https://docs.python.org/2/library/time.html#time.strftime . For other feature types, None.
- min : str, int, float, or None
The minimum value of the source data in the EDA sample
- max : str, int, float, or None
The maximum value of the source data in the EDA sample
- mean : str, int, or, float
The arithmetic mean of the source data in the EDA sample
- median : str, int, float, or None
The median of the source data in the EDA sample
- std_dev : str, int, float, or None
The standard deviation of the source data in the EDA sample
- time_series_eligible : bool
Whether this feature can be used as the datetime partition column in a time series project.
- time_series_eligibility_reason : str
Why the feature is ineligible for the datetime partition column in a time series project, or ‘suitable’ when it is eligible.
- time_step : int or None
For time series eligible features, a positive integer determining the interval at which windows can be specified. If used as the datetime partition column on a time series project, the feature derivation and forecast windows must start and end at an integer multiple of this value. None for features that are not time series eligible.
- time_unit : str or None
For time series eligible features, the time unit covered by a single time step, e.g. ‘HOUR’, or None for features that are not time series eligible.
- target_leakage : str
Whether a feature is considered to have target leakage or not. A value of ‘SKIPPED_DETECTION’ indicates that target leakage detection was not run on the feature. ‘FALSE’ indicates no leakage, ‘MODERATE’ indicates a moderate risk of target leakage, and ‘HIGH_RISK’ indicates a high risk of target leakage
-
classmethod
get
(project_id, feature_name)¶ Retrieve a single feature
Parameters: - project_id : str
The ID of the project the feature is associated with.
- feature_name : str
The name of the feature to retrieve
Returns: - feature : Feature
The queried instance
-
get_multiseries_properties
(multiseries_id_columns, max_wait=600)¶ Retrieve time series properties for a potential multiseries datetime partition column
Multiseries time series projects use multiseries id columns to model multiple distinct series within a single project. This function returns the time series properties (time step and time unit) of this column if it were used as a datetime partition column with the specified multiseries id columns, running multiseries detection automatically if it had not previously been successfully ran.
Parameters: - multiseries_id_columns : list of str
the name(s) of the multiseries id columns to use with this datetime partition column. Currently only one multiseries id column is supported.
- max_wait : int, optional
if a multiseries detection task is run, the maximum amount of time to wait for it to complete before giving up
Returns: - properties : dict
A dict with three keys:
- time_series_eligible : bool, whether the column can be used as a partition column
- time_unit : str or null, the inferred time unit if used as a partition column
- time_step : int or null, the inferred time step if used as a partition column
-
get_cross_series_properties
(datetime_partition_column, cross_series_group_by_columns, max_wait=600)¶ Retrieve cross-series properties for multiseries ID column.
This function returns the cross-series properties (eligibility as group-by column) of this column if it were used with specified datetime partition column and with current multiseries id column, running cross-series group-by validation automatically if it had not previously been successfully ran.
Parameters: - datetime_partition_column : datetime partition column
- cross_series_group_by_columns : list of str
the name(s) of the columns to use with this multiseries ID column. Currently only one cross-series group-by column is supported.
- max_wait : int, optional
if a multiseries detection task is run, the maximum amount of time to wait for it to complete before giving up
Returns: - properties : dict
A dict with three keys:
- name : str, column name
- eligibility : str, reason for column eligibility
- isEligible : bool, is column eligible as cross-series group-by
-
class
datarobot.models.
ModelingFeature
(project_id=None, name=None, feature_type=None, importance=None, low_information=None, unique_count=None, na_count=None, date_format=None, min=None, max=None, mean=None, median=None, std_dev=None, parent_feature_names=None)¶ A feature used for modeling
In time series projects, a new set of modeling features is created after setting the partitioning options. These features are automatically derived from those in the project’s dataset and are the features used for modeling. Modeling features are only accessible once the target and partitioning options have been set. In projects that don’t use time series modeling, once the target has been set, ModelingFeatures and Features will behave the same.
For more information about input and modeling features, see the time series documentation.
As with the
Feature
object, the min, max, `mean, median, and std_dev attributes provide information about the distribution of the feature in the EDA sample data. For non-numeric features, they will be None. For features where the summary statistics are available, they will be in a format compatible with the data type, i.e. date type features will have their summary statistics expressed as ISO-8601 formatted date strings.Attributes: - project_id : str
the id of the project the feature belongs to
- name : str
the name of the feature
- feature_type : str
the type of the feature, e.g. ‘Categorical’, ‘Text’
- importance : float or None
numeric measure of the strength of relationship between the feature and target (independent of any model or other features); may be None for non-modeling features such as partition columns
- low_information : bool
whether a feature is considered too uninformative for modeling (e.g. because it has too few values)
- unique_count : int
number of unique values
- na_count : int or None
number of missing values
- date_format : str or None
For Date features, the date format string for how this feature was interpreted, compatible with https://docs.python.org/2/library/time.html#time.strftime . For other feature types, None.
- min : str, int, float, or None
The minimum value of the source data in the EDA sample
- max : str, int, float, or None
The maximum value of the source data in the EDA sample
- mean : str, int, or, float
The arithmetic mean of the source data in the EDA sample
- median : str, int, float, or None
The median of the source data in the EDA sample
- std_dev : str, int, float, or None
The standard deviation of the source data in the EDA sample
- parent_feature_names : list of str
A list of the names of input features used to derive this modeling feature. In cases where the input features and modeling features are the same, this will simply contain the feature’s name. Note that if a derived feature was used to create this modeling feature, the values here will not necessarily correspond to the features that must be supplied at prediction time.
-
classmethod
get
(project_id, feature_name)¶ Retrieve a single modeling feature
Parameters: - project_id : str
The ID of the project the feature is associated with.
- feature_name : str
The name of the feature to retrieve
Returns: - feature : ModelingFeature
The requested feature
-
class
datarobot.models.
FeatureHistogram
(plot)¶ A histogram plot data for a specific feature
New in version v2.14.
Histogram is a popular way of visual representation of feature values distribution. Here histogram is represented as an ordered collection of bins. For categorical features every bin represents exactly one of feature values and the count in that bin is the number of occurrences of that value. For numeric features every bin represents a range of values (low end inclusive, high end exclusive) and the count in the bin is the total number of occurrences of all values in this range. In addition, each bin may contain a target feature average for values in that bin (see
target
description below).Notes
HistogramBin
contains:label
: (str) for categorical features: the value of the feature, for numeric: the low end of bin range, so that the difference between two consecutive bin labels is the length of the bincount
: (int or float) number of values in this bin’s range If project uses weights, the value is equal to the sum of weights of all feature values in bin’s rangetarget
: (float or None) Average of the target feature values for the bin. Present only for informative features if project target has already been selected and AIM processing has finished. For multiclass projects the value is always null.
Attributes: - plot : list
a list of dictionaries with a schema described as
HistogramBin
-
classmethod
get
(project_id, feature_name, bin_limit=None)¶ Retrieve a single feature histogram
Parameters: - project_id : str
The ID of the project the feature is associated with.
- feature_name : str
The name of the feature to retrieve
- bin_limit : int or None
Desired max number of histogram bins. If omitted, by default endpoint will use 60.
Returns: - featureHistogram : FeatureHistogram
The queried instance with plot attribute in it.
Feature List¶
-
class
datarobot.models.
Featurelist
(id=None, name=None, features=None, project_id=None, created=None, is_user_created=None, num_models=None, description=None)¶ A set of features used in modeling
Attributes: - id : str
the id of the featurelist
- name : str
the name of the featurelist
- features : list of str
the names of all the Features in the featurelist
- project_id : str
the project the featurelist belongs to
- created : datetime.datetime
(New in version v2.13) when the featurelist was created
- is_user_created : bool
(New in version v2.13) whether the featurelist was created by a user or by DataRobot automation
- num_models : int
(New in version v2.13) the number of models currently using this featurelist. A model is considered to use a featurelist if it is used to train the model or as a monotonic constraint featurelist, or if the model is a blender with at least one component model using the featurelist.
- description : basestring
(New in version v2.13) the description of the featurelist. Can be updated by the user and may be supplied by default for DataRobot-created featurelists.
-
classmethod
get
(project_id, featurelist_id)¶ Retrieve a known feature list
Parameters: - project_id : str
The id of the project the featurelist is associated with
- featurelist_id : str
The ID of the featurelist to retrieve
Returns: - featurelist : Featurelist
The queried instance
-
delete
(dry_run=False, delete_dependencies=False)¶ Delete a featurelist, and any models and jobs using it
All models using a featurelist, whether as the training featurelist or as a monotonic constraint featurelist, will also be deleted when the deletion is executed and any queued or running jobs using it will be cancelled. Similarly, predictions made on these models will also be deleted. All the entities that are to be deleted with a featurelist are described as “dependencies” of it. To preview the results of deleting a featurelist, call delete with dry_run=True
When deleting a featurelist with dependencies, users must specify delete_dependencies=True to confirm they want to delete the featurelist and all its dependencies. Without that option, only featurelists with no dependencies may be successfully deleted and others will error.
Featurelists configured into the project as a default featurelist or as a default monotonic constraint featurelist cannot be deleted.
Featurelists used in a model deployment cannot be deleted until the model deployment is deleted.
Parameters: - dry_run : bool, optional
specify True to preview the result of deleting the featurelist, instead of actually deleting it.
- delete_dependencies : bool, optional
specify True to successfully delete featurelists with dependencies; if left False by default, featurelists without dependencies can be successfully deleted and those with dependencies will error upon attempting to delete them.
Returns: - result : dict
- A dictionary describing the result of deleting the featurelist, with the following keys
- dry_run : bool, whether the deletion was a dry run or an actual deletion
- can_delete : bool, whether the featurelist can actually be deleted
- deletion_blocked_reason : str, why the featurelist can’t be deleted (if it can’t)
- num_affected_models : int, the number of models using this featurelist
- num_affected_jobs : int, the number of jobs using this featurelist
-
update
(name=None, description=None)¶ Update the name or description of an existing featurelist
Note that only user-created featurelists can be renamed, and that names must not conflict with names used by other featurelists.
Parameters: - name : str, optional
the new name for the featurelist
- description : str, optional
the new description for the featurelist
-
class
datarobot.models.
ModelingFeaturelist
(id=None, name=None, features=None, project_id=None, created=None, is_user_created=None, num_models=None, description=None)¶ A set of features that can be used to build a model
In time series projects, a new set of modeling features is created after setting the partitioning options. These features are automatically derived from those in the project’s dataset and are the features used for modeling. Modeling features are only accessible once the target and partitioning options have been set. In projects that don’t use time series modeling, once the target has been set, ModelingFeaturelists and Featurelists will behave the same.
For more information about input and modeling features, see the time series documentation.
Attributes: - id : str
the id of the modeling featurelist
- project_id : str
the id of the project the modeling featurelist belongs to
- name : str
the name of the modeling featurelist
- features : list of str
a list of the names of features included in this modeling featurelist
- created : datetime.datetime
(New in version v2.13) when the featurelist was created
- is_user_created : bool
(New in version v2.13) whether the featurelist was created by a user or by DataRobot automation
- num_models : int
(New in version v2.13) the number of models currently using this featurelist. A model is considered to use a featurelist if it is used to train the model or as a monotonic constraint featurelist, or if the model is a blender with at least one component model using the featurelist.
- description : basestring
(New in version v2.13) the description of the featurelist. Can be updated by the user and may be supplied by default for DataRobot-created featurelists.
-
classmethod
get
(project_id, featurelist_id)¶ Retrieve a modeling featurelist
Modeling featurelists can only be retrieved once the target and partitioning options have been set.
Parameters: - project_id : str
the id of the project the modeling featurelist belongs to
- featurelist_id : str
the id of the modeling featurelist to retrieve
Returns: - featurelist : ModelingFeaturelist
the specified featurelist
-
delete
(dry_run=False, delete_dependencies=False)¶ Delete a featurelist, and any models and jobs using it
All models using a featurelist, whether as the training featurelist or as a monotonic constraint featurelist, will also be deleted when the deletion is executed and any queued or running jobs using it will be cancelled. Similarly, predictions made on these models will also be deleted. All the entities that are to be deleted with a featurelist are described as “dependencies” of it. To preview the results of deleting a featurelist, call delete with dry_run=True
When deleting a featurelist with dependencies, users must specify delete_dependencies=True to confirm they want to delete the featurelist and all its dependencies. Without that option, only featurelists with no dependencies may be successfully deleted and others will error.
Featurelists configured into the project as a default featurelist or as a default monotonic constraint featurelist cannot be deleted.
Featurelists used in a model deployment cannot be deleted until the model deployment is deleted.
Parameters: - dry_run : bool, optional
specify True to preview the result of deleting the featurelist, instead of actually deleting it.
- delete_dependencies : bool, optional
specify True to successfully delete featurelists with dependencies; if left False by default, featurelists without dependencies can be successfully deleted and those with dependencies will error upon attempting to delete them.
Returns: - result : dict
- A dictionary describing the result of deleting the featurelist, with the following keys
- dry_run : bool, whether the deletion was a dry run or an actual deletion
- can_delete : bool, whether the featurelist can actually be deleted
- deletion_blocked_reason : str, why the featurelist can’t be deleted (if it can’t)
- num_affected_models : int, the number of models using this featurelist
- num_affected_jobs : int, the number of jobs using this featurelist
-
update
(name=None, description=None)¶ Update the name or description of an existing featurelist
Note that only user-created featurelists can be renamed, and that names must not conflict with names used by other featurelists.
Parameters: - name : str, optional
the new name for the featurelist
- description : str, optional
the new description for the featurelist
Job¶
-
class
datarobot.models.
Job
(data, completed_resource_url=None)¶ Tracks asynchronous work being done within a project
Attributes: - id : int
the id of the job
- project_id : str
the id of the project the job belongs to
- status : str
the status of the job - will be one of
datarobot.enums.QUEUE_STATUS
- job_type : str
what kind of work the job is doing - will be one of
datarobot.enums.JOB_TYPE
- is_blocked : bool
if true, the job is blocked (cannot be executed) until its dependencies are resolved
-
classmethod
get
(project_id, job_id)¶ Fetches one job.
Parameters: - project_id : str
The identifier of the project in which the job resides
- job_id : str
The job id
Returns: - job : Job
The job
Raises: - AsyncFailureError
Querying this resource gave a status code other than 200 or 303
-
cancel
()¶ Cancel this job. If this job has not finished running, it will be removed and canceled.
-
get_result
()¶ Returns: - result : object
- Return type depends on the job type:
- for model jobs, a Model is returned
- for predict jobs, a pandas.DataFrame (with predictions) is returned
- for featureImpact jobs, a list of dicts (see
Model.get_feature_impact
for more detail) - for primeRulesets jobs, a list of Rulesets
- for primeModel jobs, a PrimeModel
- for primeDownloadValidation jobs, a PrimeFile
- for reasonCodesInitialization jobs, a ReasonCodesInitialization
- for reasonCodes jobs, a ReasonCodes
- for predictionExplanationInitialization jobs, a PredictionExplanationsInitialization
- for predictionExplanations jobs, a PredictionExplanations
Raises: - JobNotFinished
If the job is not finished, the result is not available.
- AsyncProcessUnsuccessfulError
If the job errored or was aborted
-
get_result_when_complete
(max_wait=600)¶ Parameters: - max_wait : int, optional
How long to wait for the job to finish.
Returns: - result: object
Return type is the same as would be returned by Job.get_result.
Raises: - AsyncTimeoutError
If the job does not finish in time
- AsyncProcessUnsuccessfulError
If the job errored or was aborted
-
refresh
()¶ Update this object with the latest job data from the server.
-
wait_for_completion
(max_wait=600)¶ Waits for job to complete.
Parameters: - max_wait : int, optional
How long to wait for the job to finish.
-
class
datarobot.models.
TrainingPredictionsJob
(data, model_id, data_subset, **kwargs)¶ -
classmethod
get
(project_id, job_id, model_id=None, data_subset=None)¶ Fetches one training predictions job.
The resulting
TrainingPredictions
object will be annotated with model_id and data_subset.Parameters: - project_id : str
The identifier of the project in which the job resides
- job_id : str
The job id
- model_id : str
The identifier of the model used for computing training predictions
- data_subset : dr.enums.DATA_SUBSET, optional
Data subset used for computing training predictions
Returns: - job : TrainingPredictionsJob
The job
-
refresh
()¶ Update this object with the latest job data from the server.
-
cancel
()¶ Cancel this job. If this job has not finished running, it will be removed and canceled.
-
get_result
()¶ Returns: - result : object
- Return type depends on the job type:
- for model jobs, a Model is returned
- for predict jobs, a pandas.DataFrame (with predictions) is returned
- for featureImpact jobs, a list of dicts (see
Model.get_feature_impact
for more detail) - for primeRulesets jobs, a list of Rulesets
- for primeModel jobs, a PrimeModel
- for primeDownloadValidation jobs, a PrimeFile
- for reasonCodesInitialization jobs, a ReasonCodesInitialization
- for reasonCodes jobs, a ReasonCodes
- for predictionExplanationInitialization jobs, a PredictionExplanationsInitialization
- for predictionExplanations jobs, a PredictionExplanations
Raises: - JobNotFinished
If the job is not finished, the result is not available.
- AsyncProcessUnsuccessfulError
If the job errored or was aborted
-
get_result_when_complete
(max_wait=600)¶ Parameters: - max_wait : int, optional
How long to wait for the job to finish.
Returns: - result: object
Return type is the same as would be returned by Job.get_result.
Raises: - AsyncTimeoutError
If the job does not finish in time
- AsyncProcessUnsuccessfulError
If the job errored or was aborted
-
wait_for_completion
(max_wait=600)¶ Waits for job to complete.
Parameters: - max_wait : int, optional
How long to wait for the job to finish.
-
classmethod
Lift Chart¶
-
class
datarobot.models.lift_chart.
LiftChart
(source, bins, source_model_id)¶ Lift chart data for model.
Notes
LiftChartBin
is a dict containing the following:actual
(float) Sum of actual target values in binpredicted
(float) Sum of predicted target values in binbin_weight
(float) The weight of the bin. For weighted projects, it is the sum of the weights of the rows in the bin. For unweighted projects, it is the number of rows in the bin.
Attributes: - source : str
Lift chart data source. Can be ‘validation’, ‘crossValidation’ or ‘holdout’.
- bins : list of dict
List of dicts with schema described as
LiftChartBin
above.- source_model_id : str
ID of the model this lift chart represents; in some cases, insights from the parent of a frozen model may be used
Missing Values Report¶
-
class
datarobot.models.missing_report.
MissingValuesReport
(missing_values_report)¶ Missing values report for model, contains list of reports per feature sorted by missing count in descending order.
Notes
Report per feature
contains:feature
: feature name.type
: feature type – ‘Numeric’ or ‘Categorical’.missing_count
: missing values count in training data.missing_percentage
: missing values percentage in training data.tasks
: list of information per each task, which was applied to feature.
task information
contains:id
: a number of task in the blueprint diagram.name
: task name.descriptions
: human readable aggregated information about how the task handles missing values. The following descriptions may be present: what value is imputed for missing values, whether the feature being missing is treated as a feature by the task, whether missing values are treated as infrequent values, whether infrequent values are treated as missing values, and whether missing values are ignored.
-
classmethod
get
(project_id, model_id)¶ Retrieve a missing report.
Parameters: - project_id : str
The project’s id.
- model_id : str
The model’s id.
Returns: - MissingValuesReport
The queried missing report.
Models¶
Model¶
-
class
datarobot.models.
Model
(id=None, processes=None, featurelist_name=None, featurelist_id=None, project_id=None, sample_pct=None, training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, model_type=None, model_category=None, is_frozen=None, blueprint_id=None, metrics=None, project=None, data=None, monotonic_increasing_featurelist_id=None, monotonic_decreasing_featurelist_id=None, supports_monotonic_constraints=None, is_starred=None, prediction_threshold=None, prediction_threshold_read_only=None)¶ A model trained on a project’s dataset capable of making predictions
Attributes: - id : str
the id of the model
- project_id : str
the id of the project the model belongs to
- processes : list of str
the processes used by the model
- featurelist_name : str
the name of the featurelist used by the model
- featurelist_id : str
the id of the featurelist used by the model
- sample_pct : float or None
the percentage of the project dataset used in training the model. If the project uses datetime partitioning, the sample_pct will be None. See training_row_count, training_duration, and training_start_date and training_end_date instead.
- training_row_count : int or None
the number of rows of the project dataset used in training the model. In a datetime partitioned project, if specified, defines the number of rows used to train the model and evaluate backtest scores; if unspecified, either training_duration or training_start_date and training_end_date was used to determine that instead.
- training_duration : str or None
only present for models in datetime partitioned projects. If specified, a duration string specifying the duration spanned by the data used to train the model and evaluate backtest scores.
- training_start_date : datetime or None
only present for frozen models in datetime partitioned projects. If specified, the start date of the data used to train the model.
- training_end_date : datetime or None
only present for frozen models in datetime partitioned projects. If specified, the end date of the data used to train the model.
- model_type : str
what model this is, e.g. ‘Nystroem Kernel SVM Regressor’
- model_category : str
what kind of model this is - ‘prime’ for DataRobot Prime models, ‘blend’ for blender models, and ‘model’ for other models
- is_frozen : bool
whether this model is a frozen model
- blueprint_id : str
the id of the blueprint used in this model
- metrics : dict
a mapping from each metric to the model’s scores for that metric
- monotonic_increasing_featurelist_id : str
optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. If None, no such constraints are enforced.
- monotonic_decreasing_featurelist_id : str
optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. If None, no such constraints are enforced.
- supports_monotonic_constraints : bool
optinonal, whether this model supports enforcing monotonic constraints
- is_starred : bool
whether this model marked as starred
- prediction_threshold : float
for binary classification projects, the threshold used for predictions
- prediction_threshold_read_only : bool
indicated whether modification of the prediction threshold is forbidden. Threshold modification is forbidden once a model has had a deployment created or predictions made via the dedicated prediction API.
-
classmethod
get
(project, model_id)¶ Retrieve a specific model.
Parameters: - project : str
The project’s id.
- model_id : str
The
model_id
of the leaderboard item to retrieve.
Returns: - model : Model
The queried instance.
Raises: - ValueError
passed
project
parameter value is of not supported type
-
classmethod
fetch_resource_data
(url, join_endpoint=True)¶ (Deprecated.) Used to acquire model data directly from its url.
Consider using get instead, as this is a convenience function used for development of datarobot
Parameters: - url : string
The resource we are acquiring
- join_endpoint : boolean, optional
Whether the client’s endpoint should be joined to the URL before sending the request. Location headers are returned as absolute locations, so will _not_ need the endpoint
Returns: - model_data : dict
The queried model’s data
-
get_features_used
()¶ Query the server to determine which features were used.
Note that the data returned by this method is possibly different than the names of the features in the featurelist used by this model. This method will return the raw features that must be supplied in order for predictions to be generated on a new set of data. The featurelist, in contrast, would also include the names of derived features.
Returns: - features : list of str
The names of the features used in the model.
-
get_supported_capabilities
()¶ Retrieves a summary of the capabilities supported by a model.
New in version v2.14.
Returns: - supportsBlending: bool
whether the model supports blending
- supportsMonotonicConstraints: bool
whether the model supports monotonic constraints
- hasWordCloud: bool
whether the model has word cloud data available
- eligibleForPrime: bool
whether the model is eligible for Prime
- hasParameters: bool
whether the model has parameters that can be retrieved
-
delete
()¶ Delete a model from the project’s leaderboard.
-
get_leaderboard_ui_permalink
()¶ Returns: - url : str
Permanent static hyperlink to this model at leaderboard.
-
open_model_browser
()¶ Opens model at project leaderboard in web browser.
Note: If text-mode browsers are used, the calling process will block until the user exits the browser.
-
train
(sample_pct=None, featurelist_id=None, scoring_type=None, training_row_count=None, monotonic_increasing_featurelist_id=<object object>, monotonic_decreasing_featurelist_id=<object object>)¶ Train the blueprint used in model on a particular featurelist or amount of data.
This method creates a new training job for worker and appends it to the end of the queue for this project. After the job has finished you can get the newly trained model by retrieving it from the project leaderboard, or by retrieving the result of the job.
Either sample_pct or training_row_count can be used to specify the amount of data to use, but not both. If neither are specified, a default of the maximum amount of data that can safely be used to train any blueprint without going into the validation data will be selected.
In smart-sampled projects, sample_pct and training_row_count are assumed to be in terms of rows of the minority class.
Note
For datetime partitioned projects, use
train_datetime
instead.Parameters: - sample_pct : float, optional
The amount of data to use for training, as a percentage of the project dataset from 0 to 100.
- featurelist_id : str, optional
The identifier of the featurelist to use. If not defined, the featurelist of this model is used.
- scoring_type : str, optional
Either
SCORING_TYPE.validation
orSCORING_TYPE.cross_validation
.SCORING_TYPE.validation
is available for every partitioning type, and indicates that the default model validation should be used for the project. If the project uses a form of cross-validation partitioning,SCORING_TYPE.cross_validation
can also be used to indicate that all of the available training/validation combinations should be used to evaluate the model.- training_row_count : int, optional
The number of rows to use to train the requested model.
- monotonic_increasing_featurelist_id : str
(new in version 2.11) optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. Passing
None
disables increasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT
) is the one specified by the blueprint.- monotonic_decreasing_featurelist_id : str
(new in version 2.11) optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. Passing
None
disables decreasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT
) is the one specified by the blueprint.
Returns: - model_job_id : str
id of created job, can be used as parameter to
ModelJob.get
method orwait_for_async_model_creation
function
Examples
project = Project.get('p-id') model = Model.get('p-id', 'l-id') model_job_id = model.train(training_row_count=project.max_train_rows)
-
train_datetime
(featurelist_id=None, training_row_count=None, training_duration=None, time_window_sample_pct=None)¶ Train this model on a different featurelist or amount of data
Requires that this model is part of a datetime partitioned project; otherwise, an error will occur.
Parameters: - featurelist_id : str, optional
the featurelist to use to train the model. If not specified, the featurelist of this model is used.
- training_row_count : int, optional
the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.
- training_duration : str, optional
a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.
- time_window_sample_pct : int, optional
may only be specified when the requested model is a time window (e.g. duration or start and end dates). An integer between 1 and 99 indicating the percentage to sample by within the window. The points kept are determined by a random uniform sample.
Returns: - job : ModelJob
the created job to build the model
-
request_predictions
(dataset_id, include_prediction_intervals=None, prediction_intervals_size=None)¶ Request predictions against a previously uploaded dataset
Parameters: - dataset_id : string
The dataset to make predictions against (as uploaded from Project.upload_dataset)
- include_prediction_intervals : bool, optional
(New in v2.16) For time series projects only. Specifies whether prediction intervals should be calculated for this request. Defaults to True if prediction_intervals_size is specified, otherwise defaults to False.
- prediction_intervals_size : int, optional
(New in v2.16) For time series projects only. Represents the percentile to use for the size of the prediction intervals. Defaults to 80 if include_prediction_intervals is True.
Returns: - job : PredictJob
The job computing the predictions
-
get_feature_impact
()¶ Retrieve the computed Feature Impact results, a measure of the relevance of each feature in the model.
Feature Impact is computed for each column by creating new data with that column randomly permuted (but the others left unchanged), and seeing how the error metric score for the predictions is affected. The ‘impactUnnormalized’ is how much worse the error metric score is when making predictions on this modified data. The ‘impactNormalized’ is normalized so that the largest value is 1. In both cases, larger values indicate more important features.
If a feature is a redundant feature, i.e. once other features are considered it doesn’t contribute much in additional, the ‘redundantWith’ value is the name of feature that has the highest correlation with this feature. Note that redundancy detection is only available for jobs run after the addition of this feature. When retrieving data that predates this functionality, a NoRedundancyImpactAvailable warning will be used.
Elsewhere this technique is sometimes called ‘Permutation Importance’.
Requires that Feature Impact has already been computed with
request_feature_impact
.Returns: - feature_impacts : list of dict
The feature impact data. Each item is a dict with the keys ‘featureName’, ‘impactNormalized’, and ‘impactUnnormalized’, and ‘redundantWith’.
Raises: - ClientError (404)
If the feature impacts have not been computed.
-
request_feature_impact
()¶ Request feature impacts to be computed for the model.
See
get_feature_impact
for more information on the result of the job.Returns: - job : Job
A Job representing the feature impact computation. To get the completed feature impact data, use job.get_result or job.get_result_when_complete.
Raises: - JobAlreadyRequested (422)
If the feature impacts have already been requested.
-
get_or_request_feature_impact
(max_wait=600)¶ Retrieve feature impact for the model, requesting a job if it hasn’t been run previously
Parameters: - max_wait : int, optional
The maximum time to wait for a requested feature impact job to complete before erroring
Returns: - feature_impacts : list of dict
The feature impact data. See
get_feature_impact
for the exact schema.
-
get_prime_eligibility
()¶ Check if this model can be approximated with DataRobot Prime
Returns: - prime_eligibility : dict
a dict indicating whether a model can be approximated with DataRobot Prime (key can_make_prime) and why it may be ineligible (key message)
-
request_approximation
()¶ Request an approximation of this model using DataRobot Prime
This will create several rulesets that could be used to approximate this model. After comparing their scores and rule counts, the code used in the approximation can be downloaded and run locally.
Returns: - job : Job
the job generating the rulesets
-
get_rulesets
()¶ List the rulesets approximating this model generated by DataRobot Prime
If this model hasn’t been approximated yet, will return an empty list. Note that these are rulesets approximating this model, not rulesets used to construct this model.
Returns: - rulesets : list of Ruleset
-
download_export
(filepath)¶ Download an exportable model file for use in an on-premise DataRobot standalone prediction environment.
This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.
Parameters: - filepath : str
The path at which to save the exported model file.
-
request_transferable_export
()¶ Request generation of an exportable model file for use in an on-premise DataRobot standalone prediction environment.
This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.
This function does not download the exported file. Use download_export for that.
Examples
model = datarobot.Model.get('p-id', 'l-id') job = model.request_transferable_export() job.wait_for_completion() model.download_export('my_exported_model.drmodel') # Client must be configured to use standalone prediction server for import: datarobot.Client(token='my-token-at-standalone-server', endpoint='standalone-server-url/api/v2') imported_model = datarobot.ImportedModel.create('my_exported_model.drmodel')
-
request_frozen_model
(sample_pct=None, training_row_count=None)¶ Train a new frozen model with parameters from this model
Note
This method only works if project the model belongs to is not datetime partitioned. If it is, use
request_frozen_datetime_model
instead.Frozen models use the same tuning parameters as their parent model instead of independently optimizing them to allow efficiently retraining models on larger amounts of the training data.
Parameters: - sample_pct : float
optional, the percentage of the dataset to use with the model. If not provided, will use the value from this model.
- training_row_count : int
(New in version v2.9) optional, the integer number of rows of the dataset to use with the model. Only one of sample_pct and training_row_count should be specified.
Returns: - model_job : ModelJob
the modeling job training a frozen model
-
request_frozen_datetime_model
(training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, time_window_sample_pct=None)¶ Train a new frozen model with parameters from this model
Requires that this model belongs to a datetime partitioned project. If it does not, an error will occur when submitting the job.
Frozen models use the same tuning parameters as their parent model instead of independently optimizing them to allow efficiently retraining models on larger amounts of the training data.
In addition of training_row_count and training_duration, frozen datetime models may be trained on an exact date range. Only one of training_row_count, training_duration, or training_start_date and training_end_date should be specified.
Models specified using training_start_date and training_end_date are the only ones that can be trained into the holdout data (once the holdout is unlocked).
Parameters: - training_row_count : int, optional
the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.
- training_duration : str, optional
a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.
- training_start_date : datetime.datetime, optional
the start date of the data to train to model on. Only rows occurring at or after this datetime will be used. If training_start_date is specified, training_end_date must also be specified.
- training_end_date : datetime.datetime, optional
the end date of the data to train the model on. Only rows occurring strictly before this datetime will be used. If training_end_date is specified, training_start_date must also be specified.
- time_window_sample_pct : int, optional
may only be specified when the requested model is a time window (e.g. duration or start and end dates). An integer between 1 and 99 indicating the percentage to sample by within the window. The points kept are determined by a random uniform sample.
Returns: - model_job : ModelJob
the modeling job training a frozen model
-
get_parameters
()¶ Retrieve model parameters.
Returns: - ModelParameters
Model parameters for this model.
-
get_lift_chart
(source, fallback_to_parent_insights=False)¶ Retrieve model lift chart for the specified source.
Parameters: - source : str
Lift chart data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
- fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return lift chart data for this model’s parent if the lift chart is not available for this model and the model has a defined parent model. If omitted or False, or there is no parent model, will not attempt to return insight data from this model’s parent.
Returns: - LiftChart
Model lift chart data
Raises: - ClientError
If the insight is not available for this model
-
get_all_lift_charts
(fallback_to_parent_insights=False)¶ Retrieve a list of all lift charts available for the model.
Parameters: - fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return lift chart data for this model’s parent for any source that is not available for this model and if this model has a defined parent model. If omitted or False, or this model has no parent, this will not attempt to retrieve any data from this model’s parent.
Returns: - list of LiftChart
Data for all available model lift charts.
-
get_pareto_front
()¶ Retrieve the Pareto Front for a Eureqa model.
This method is only supported for Eureqa models.
Returns: - ParetoFront
Model ParetoFront data
-
get_confusion_chart
(source, fallback_to_parent_insights=False)¶ Retrieve model’s confusion chart for the specified source.
Parameters: - source : str
Confusion chart source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
- fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return confusion chart data for this model’s parent if the confusion chart is not available for this model and the defined parent model. If omitted or False, or there is no parent model, will not attempt to return insight data from this model’s parent.
Returns: - ConfusionChart
Model ConfusionChart data
Raises: - ClientError
If the insight is not available for this model
-
get_all_confusion_charts
(fallback_to_parent_insights=False)¶ Retrieve a list of all confusion charts available for the model.
Parameters: - fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return confusion chart data for this model’s parent for any source that is not available for this model and if this has a defined parent model. If omitted or False, or this model has no parent, this will not attempt to retrieve any data from this model’s parent.
Returns: - list of ConfusionChart
Data for all available confusion charts for model.
-
get_roc_curve
(source, fallback_to_parent_insights=False)¶ Retrieve model ROC curve for the specified source.
Parameters: - source : str
ROC curve data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
- fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return ROC curve data for this model’s parent if the ROC curve is not available for this model and the model has a defined parent model. If omitted or False, or there is no parent model, will not attempt to return data from this model’s parent.
Returns: - RocCurve
Model ROC curve data
Raises: - ClientError
If the insight is not available for this model
-
get_all_roc_curves
(fallback_to_parent_insights=False)¶ Retrieve a list of all ROC curves available for the model.
Parameters: - fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return ROC curve data for this model’s parent for any source that is not available for this model and if this model has a defined parent model. If omitted or False, or this model has no parent, this will not attempt to retrieve any data from this model’s parent.
Returns: - list of RocCurve
Data for all available model ROC curves.
-
get_word_cloud
(exclude_stop_words=False)¶ Retrieve a word cloud data for the model.
Parameters: - exclude_stop_words : bool, optional
Set to True if you want stopwords filtered out of response.
Returns: - WordCloud
Word cloud data for the model.
-
download_scoring_code
(file_name, source_code=False)¶ Download scoring code JAR.
Parameters: - file_name : str
File path where scoring code will be saved.
- source_code : bool, optional
Set to True to download source code archive. It will not be executable.
-
get_model_blueprint_documents
()¶ Get documentation for tasks used in this model.
Returns: - list of BlueprintTaskDocument
All documents available for the model.
-
get_model_blueprint_chart
()¶ Retrieve a model blueprint chart that can be used to understand data flow in blueprint.
Returns: - ModelBlueprintChart
The queried model blueprint chart.
-
get_missing_report_info
()¶ Retrieve a model missing data report on training data that can be used to understand missing values treatment in a model. Report consists of missing values reports for features which took part in modelling and are numeric or categorical.
Returns: - An iterable of MissingReportPerFeature
The queried model missing report, sorted by missing count (DESCENDING order).
-
request_training_predictions
(data_subset)¶ Start a job to build training predictions
Parameters: - data_subset : str
data set definition to build predictions on. Choices are:
- dr.enums.DATA_SUBSET.ALL for all data available. Not valid for models
- in datetime partitioned projects
- dr.enums.DATA_SUBSET.VALIDATION_AND_HOLDOUT for all data except training set.
- Not valid for models in datetime partitioned projects
- dr.enums.DATA_SUBSET.HOLDOUT for holdout data set only
- dr.enums.DATA_SUBSET.ALL_BACKTESTS for downloading the predictions for all
- backtest validation folds. Requires the model to have successfully scored all backtests. Datetime partitioned projects only.
Returns: - Job
an instance of created async job
-
cross_validate
()¶ Run Cross Validation on this model.
Note
To perform Cross Validation on a new model with new parameters, use
train
instead.Returns: - ModelJob
The created job to build the model
-
get_cross_validation_scores
(partition=None, metric=None)¶ Returns a dictionary keyed by metric showing cross validation scores per partition.
Cross Validation should already have been performed using
cross_validate
ortrain
.Note
Models that computed cross validation before this feature was added will need to be deleted and retrained before this method can be used.
Parameters: - partition : float
optional, the id of the partition (1,2,3.0,4.0,etc…) to filter results by can be a whole number positive integer or float value.
- metric: unicode
optional name of the metric to filter to resulting cross validation scores by
Returns: - cross_validation_scores: dict
A dictionary keyed by metric showing cross validation scores per partition.
-
advanced_tune
(params, description=None)¶ Generate a new model with the specified advanced-tuning parameters
As of v2.15, all models support Advanced Tuning other than Blenders, OSS, and user-created.
Parameters: - params : dict
Mapping of parameter ID to parameter value. The list of valid parameter IDs for a model can be found by calling get_advanced_tuning_parameters(). This endpoint does not need to include values for all parameters. If a parameter is omitted, its current_value will be used.
- description : unicode
Human-readable string describing the newly advanced-tuned model
Returns: - ModelJob
The created job to build the model
-
get_advanced_tuning_parameters
()¶ Get the advanced-tuning parameters available for this model.
As of v2.15, all models support Advanced Tuning other than Blenders, OSS, and user-created.
Returns: - dict
A dictionary describing the advanced-tuning parameters for the current model. There are two top-level keys, tuningDescription and tuningParameters.
tuningDescription an optional value. If not None, then it indicates the user-specified description of this set of tuning parameter.
tuningParameters is a list of a dicts, each has the following keys
- parameterName : (unicode) name of the parameter (unique per task, see below)
- parameterId : (unicode) opaque ID string uniquely identifying parameter
- defaultValue : (*) default value of the parameter for the blueprint
- currentValue : (*) value of the parameter that was used for this model
- taskName : (unicode) name of the task that this parameter belongs to
- constraints: (dict) see the notes below
Notes
The type of defaultValue and currentValue is defined by the constraints structure. It will be a string or numeric Python type.
constraints is a dict with at least one, possibly more, of the following keys. The presence of a key indicates that the parameter may take on the specified type. (If a key is absent, this means that the parameter may not take on the specified type.) If a key on constraints is present, its value will be a dict containing all of the fields described below for that key.
"constraints": { "select": { "values": [<list(basestring or number) : possible values>] }, "ascii": {}, "unicode": {}, "int": { "min": <int : minimum valid value>, "max": <int : maximum valid value>, "supports_grid_search": <bool : True if Grid Search may be requested for this param> }, "float": { "min": <float : minimum valid value>, "max": <float : maximum valid value>, "supports_grid_search": <bool : True if Grid Search may be requested for this param> }, "intList": { "length": { "min_length": <int : minimum valid length>, "max_length": <int : maximum valid length> "min_val": <int : minimum valid value>, "max_val": <int : maximum valid value> "supports_grid_search": <bool : True if Grid Search may be requested for this param> }, "floatList": { "min_length": <int : minimum valid length>, "max_length": <int : maximum valid length> "min_val": <float : minimum valid value>, "max_val": <float : maximum valid value> "supports_grid_search": <bool : True if Grid Search may be requested for this param> } }
The keys have meaning as follows:
- select: Rather than specifying a specific data type, if present, it indicates that the parameter is permitted to take on any of the specified values. Listed values may be of any string or real (non-complex) numeric type.
- ascii: The parameter may be a unicode object that encodes simple ASCII characters. (A-Z, a-z, 0-9, whitespace, and certain common symbols.) In addition to listed constraints, ASCII keys currently may not contain either newlines or semicolons.
- unicode: The parameter may be any Python unicode object.
- int: The value may be an object of type int within the specified range (inclusive). Please note that the value will be passed around using the JSON format, and some JSON parsers have undefined behavior with integers outside of the range [-(2**53)+1, (2**53)-1].
- float: The value may be an object of type float within the specified range (inclusive).
- intList, floatList: The value may be a list of int or float objects, respectively, following constraints as specified respectively by the int and float types (above).
Many parameters only specify one key under constraints. If a parameter specifies multiple keys, the parameter may take on any value permitted by any key.
-
start_advanced_tuning_session
()¶ Start an Advanced Tuning session. Returns an object that helps set up arguments for an Advanced Tuning model execution.
As of v2.15, all models support Advanced Tuning other than Blenders, OSS, and user-created.
Returns: - AdvancedTuningSession
Session for setting up and running Advanced Tuning on a model
-
star_model
()¶ Mark the model as starred
Model stars propagate to the web application and the API, and can be used to filter when listing models.
-
unstar_model
()¶ Unmark the model as starred
Model stars propagate to the web application and the API, and can be used to filter when listing models.
-
set_prediction_threshold
(threshold)¶ Set a custom prediction threshold for the model
May not be used once
prediction_threshold_read_only
is True for this model.Parameters: - threshold : float
only used for binary classification projects. The threshold to when deciding between the positive and negative classes when making predictions. Should be between 0.0 and 1.0 (inclusive).
PrimeModel¶
-
class
datarobot.models.
PrimeModel
(id=None, processes=None, featurelist_name=None, featurelist_id=None, project_id=None, sample_pct=None, training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, model_type=None, model_category=None, is_frozen=None, blueprint_id=None, metrics=None, parent_model_id=None, ruleset_id=None, rule_count=None, score=None, monotonic_increasing_featurelist_id=None, monotonic_decreasing_featurelist_id=None, supports_monotonic_constraints=None, is_starred=None, prediction_threshold=None, prediction_threshold_read_only=None)¶ A DataRobot Prime model approximating a parent model with downloadable code
Attributes: - id : str
the id of the model
- project_id : str
the id of the project the model belongs to
- processes : list of str
the processes used by the model
- featurelist_name : str
the name of the featurelist used by the model
- featurelist_id : str
the id of the featurelist used by the model
- sample_pct : float
the percentage of the project dataset used in training the model
- training_row_count : int or None
the number of rows of the project dataset used in training the model. In a datetime partitioned project, if specified, defines the number of rows used to train the model and evaluate backtest scores; if unspecified, either training_duration or training_start_date and training_end_date was used to determine that instead.
- training_duration : str or None
only present for models in datetime partitioned projects. If specified, a duration string specifying the duration spanned by the data used to train the model and evaluate backtest scores.
- training_start_date : datetime or None
only present for frozen models in datetime partitioned projects. If specified, the start date of the data used to train the model.
- training_end_date : datetime or None
only present for frozen models in datetime partitioned projects. If specified, the end date of the data used to train the model.
- model_type : str
what model this is, e.g. ‘DataRobot Prime’
- model_category : str
what kind of model this is - always ‘prime’ for DataRobot Prime models
- is_frozen : bool
whether this model is a frozen model
- blueprint_id : str
the id of the blueprint used in this model
- metrics : dict
a mapping from each metric to the model’s scores for that metric
- ruleset : Ruleset
the ruleset used in the Prime model
- parent_model_id : str
the id of the model that this Prime model approximates
- monotonic_increasing_featurelist_id : str
optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. If None, no such constraints are enforced.
- monotonic_decreasing_featurelist_id : str
optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. If None, no such constraints are enforced.
- supports_monotonic_constraints : bool
optional, whether this model supports enforcing monotonic constraints
- is_starred : bool
whether this model is marked as starred
- prediction_threshold : float
for binary classification projects, the threshold used for predictions
- prediction_threshold_read_only : bool
indicated whether modification of the prediction threshold is forbidden. Threshold modification is forbidden once a model has had a deployment created or predictions made via the dedicated prediction API.
-
classmethod
get
(project_id, model_id)¶ Retrieve a specific prime model.
Parameters: - project_id : str
The id of the project the prime model belongs to
- model_id : str
The
model_id
of the prime model to retrieve.
Returns: - model : PrimeModel
The queried instance.
-
request_download_validation
(language)¶ Prep and validate the downloadable code for the ruleset associated with this model
Parameters: - language : str
the language the code should be downloaded in - see
datarobot.enums.PRIME_LANGUAGE
for available languages
Returns: - job : Job
A job tracking the code preparation and validation
-
advanced_tune
(params, description=None)¶ Generate a new model with the specified advanced-tuning parameters
As of v2.15, all models support Advanced Tuning other than Blenders, OSS, and user-created.
Parameters: - params : dict
Mapping of parameter ID to parameter value. The list of valid parameter IDs for a model can be found by calling get_advanced_tuning_parameters(). This endpoint does not need to include values for all parameters. If a parameter is omitted, its current_value will be used.
- description : unicode
Human-readable string describing the newly advanced-tuned model
Returns: - ModelJob
The created job to build the model
-
cross_validate
()¶ Run Cross Validation on this model.
Note
To perform Cross Validation on a new model with new parameters, use
train
instead.Returns: - ModelJob
The created job to build the model
-
delete
()¶ Delete a model from the project’s leaderboard.
-
download_export
(filepath)¶ Download an exportable model file for use in an on-premise DataRobot standalone prediction environment.
This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.
Parameters: - filepath : str
The path at which to save the exported model file.
-
download_scoring_code
(file_name, source_code=False)¶ Download scoring code JAR.
Parameters: - file_name : str
File path where scoring code will be saved.
- source_code : bool, optional
Set to True to download source code archive. It will not be executable.
-
classmethod
fetch_resource_data
(url, join_endpoint=True)¶ (Deprecated.) Used to acquire model data directly from its url.
Consider using get instead, as this is a convenience function used for development of datarobot
Parameters: - url : string
The resource we are acquiring
- join_endpoint : boolean, optional
Whether the client’s endpoint should be joined to the URL before sending the request. Location headers are returned as absolute locations, so will _not_ need the endpoint
Returns: - model_data : dict
The queried model’s data
-
get_advanced_tuning_parameters
()¶ Get the advanced-tuning parameters available for this model.
As of v2.15, all models support Advanced Tuning other than Blenders, OSS, and user-created.
Returns: - dict
A dictionary describing the advanced-tuning parameters for the current model. There are two top-level keys, tuningDescription and tuningParameters.
tuningDescription an optional value. If not None, then it indicates the user-specified description of this set of tuning parameter.
tuningParameters is a list of a dicts, each has the following keys
- parameterName : (unicode) name of the parameter (unique per task, see below)
- parameterId : (unicode) opaque ID string uniquely identifying parameter
- defaultValue : (*) default value of the parameter for the blueprint
- currentValue : (*) value of the parameter that was used for this model
- taskName : (unicode) name of the task that this parameter belongs to
- constraints: (dict) see the notes below
Notes
The type of defaultValue and currentValue is defined by the constraints structure. It will be a string or numeric Python type.
constraints is a dict with at least one, possibly more, of the following keys. The presence of a key indicates that the parameter may take on the specified type. (If a key is absent, this means that the parameter may not take on the specified type.) If a key on constraints is present, its value will be a dict containing all of the fields described below for that key.
"constraints": { "select": { "values": [<list(basestring or number) : possible values>] }, "ascii": {}, "unicode": {}, "int": { "min": <int : minimum valid value>, "max": <int : maximum valid value>, "supports_grid_search": <bool : True if Grid Search may be requested for this param> }, "float": { "min": <float : minimum valid value>, "max": <float : maximum valid value>, "supports_grid_search": <bool : True if Grid Search may be requested for this param> }, "intList": { "length": { "min_length": <int : minimum valid length>, "max_length": <int : maximum valid length> "min_val": <int : minimum valid value>, "max_val": <int : maximum valid value> "supports_grid_search": <bool : True if Grid Search may be requested for this param> }, "floatList": { "min_length": <int : minimum valid length>, "max_length": <int : maximum valid length> "min_val": <float : minimum valid value>, "max_val": <float : maximum valid value> "supports_grid_search": <bool : True if Grid Search may be requested for this param> } }
The keys have meaning as follows:
- select: Rather than specifying a specific data type, if present, it indicates that the parameter is permitted to take on any of the specified values. Listed values may be of any string or real (non-complex) numeric type.
- ascii: The parameter may be a unicode object that encodes simple ASCII characters. (A-Z, a-z, 0-9, whitespace, and certain common symbols.) In addition to listed constraints, ASCII keys currently may not contain either newlines or semicolons.
- unicode: The parameter may be any Python unicode object.
- int: The value may be an object of type int within the specified range (inclusive). Please note that the value will be passed around using the JSON format, and some JSON parsers have undefined behavior with integers outside of the range [-(2**53)+1, (2**53)-1].
- float: The value may be an object of type float within the specified range (inclusive).
- intList, floatList: The value may be a list of int or float objects, respectively, following constraints as specified respectively by the int and float types (above).
Many parameters only specify one key under constraints. If a parameter specifies multiple keys, the parameter may take on any value permitted by any key.
-
get_all_confusion_charts
(fallback_to_parent_insights=False)¶ Retrieve a list of all confusion charts available for the model.
Parameters: - fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return confusion chart data for this model’s parent for any source that is not available for this model and if this has a defined parent model. If omitted or False, or this model has no parent, this will not attempt to retrieve any data from this model’s parent.
Returns: - list of ConfusionChart
Data for all available confusion charts for model.
-
get_all_lift_charts
(fallback_to_parent_insights=False)¶ Retrieve a list of all lift charts available for the model.
Parameters: - fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return lift chart data for this model’s parent for any source that is not available for this model and if this model has a defined parent model. If omitted or False, or this model has no parent, this will not attempt to retrieve any data from this model’s parent.
Returns: - list of LiftChart
Data for all available model lift charts.
-
get_all_roc_curves
(fallback_to_parent_insights=False)¶ Retrieve a list of all ROC curves available for the model.
Parameters: - fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return ROC curve data for this model’s parent for any source that is not available for this model and if this model has a defined parent model. If omitted or False, or this model has no parent, this will not attempt to retrieve any data from this model’s parent.
Returns: - list of RocCurve
Data for all available model ROC curves.
-
get_confusion_chart
(source, fallback_to_parent_insights=False)¶ Retrieve model’s confusion chart for the specified source.
Parameters: - source : str
Confusion chart source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
- fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return confusion chart data for this model’s parent if the confusion chart is not available for this model and the defined parent model. If omitted or False, or there is no parent model, will not attempt to return insight data from this model’s parent.
Returns: - ConfusionChart
Model ConfusionChart data
Raises: - ClientError
If the insight is not available for this model
-
get_cross_validation_scores
(partition=None, metric=None)¶ Returns a dictionary keyed by metric showing cross validation scores per partition.
Cross Validation should already have been performed using
cross_validate
ortrain
.Note
Models that computed cross validation before this feature was added will need to be deleted and retrained before this method can be used.
Parameters: - partition : float
optional, the id of the partition (1,2,3.0,4.0,etc…) to filter results by can be a whole number positive integer or float value.
- metric: unicode
optional name of the metric to filter to resulting cross validation scores by
Returns: - cross_validation_scores: dict
A dictionary keyed by metric showing cross validation scores per partition.
-
get_feature_impact
()¶ Retrieve the computed Feature Impact results, a measure of the relevance of each feature in the model.
Feature Impact is computed for each column by creating new data with that column randomly permuted (but the others left unchanged), and seeing how the error metric score for the predictions is affected. The ‘impactUnnormalized’ is how much worse the error metric score is when making predictions on this modified data. The ‘impactNormalized’ is normalized so that the largest value is 1. In both cases, larger values indicate more important features.
If a feature is a redundant feature, i.e. once other features are considered it doesn’t contribute much in additional, the ‘redundantWith’ value is the name of feature that has the highest correlation with this feature. Note that redundancy detection is only available for jobs run after the addition of this feature. When retrieving data that predates this functionality, a NoRedundancyImpactAvailable warning will be used.
Elsewhere this technique is sometimes called ‘Permutation Importance’.
Requires that Feature Impact has already been computed with
request_feature_impact
.Returns: - feature_impacts : list of dict
The feature impact data. Each item is a dict with the keys ‘featureName’, ‘impactNormalized’, and ‘impactUnnormalized’, and ‘redundantWith’.
Raises: - ClientError (404)
If the feature impacts have not been computed.
-
get_features_used
()¶ Query the server to determine which features were used.
Note that the data returned by this method is possibly different than the names of the features in the featurelist used by this model. This method will return the raw features that must be supplied in order for predictions to be generated on a new set of data. The featurelist, in contrast, would also include the names of derived features.
Returns: - features : list of str
The names of the features used in the model.
-
get_leaderboard_ui_permalink
()¶ Returns: - url : str
Permanent static hyperlink to this model at leaderboard.
-
get_lift_chart
(source, fallback_to_parent_insights=False)¶ Retrieve model lift chart for the specified source.
Parameters: - source : str
Lift chart data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
- fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return lift chart data for this model’s parent if the lift chart is not available for this model and the model has a defined parent model. If omitted or False, or there is no parent model, will not attempt to return insight data from this model’s parent.
Returns: - LiftChart
Model lift chart data
Raises: - ClientError
If the insight is not available for this model
-
get_missing_report_info
()¶ Retrieve a model missing data report on training data that can be used to understand missing values treatment in a model. Report consists of missing values reports for features which took part in modelling and are numeric or categorical.
Returns: - An iterable of MissingReportPerFeature
The queried model missing report, sorted by missing count (DESCENDING order).
-
get_model_blueprint_chart
()¶ Retrieve a model blueprint chart that can be used to understand data flow in blueprint.
Returns: - ModelBlueprintChart
The queried model blueprint chart.
-
get_model_blueprint_documents
()¶ Get documentation for tasks used in this model.
Returns: - list of BlueprintTaskDocument
All documents available for the model.
-
get_or_request_feature_impact
(max_wait=600)¶ Retrieve feature impact for the model, requesting a job if it hasn’t been run previously
Parameters: - max_wait : int, optional
The maximum time to wait for a requested feature impact job to complete before erroring
Returns: - feature_impacts : list of dict
The feature impact data. See
get_feature_impact
for the exact schema.
-
get_parameters
()¶ Retrieve model parameters.
Returns: - ModelParameters
Model parameters for this model.
-
get_pareto_front
()¶ Retrieve the Pareto Front for a Eureqa model.
This method is only supported for Eureqa models.
Returns: - ParetoFront
Model ParetoFront data
-
get_prime_eligibility
()¶ Check if this model can be approximated with DataRobot Prime
Returns: - prime_eligibility : dict
a dict indicating whether a model can be approximated with DataRobot Prime (key can_make_prime) and why it may be ineligible (key message)
-
get_roc_curve
(source, fallback_to_parent_insights=False)¶ Retrieve model ROC curve for the specified source.
Parameters: - source : str
ROC curve data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
- fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return ROC curve data for this model’s parent if the ROC curve is not available for this model and the model has a defined parent model. If omitted or False, or there is no parent model, will not attempt to return data from this model’s parent.
Returns: - RocCurve
Model ROC curve data
Raises: - ClientError
If the insight is not available for this model
-
get_rulesets
()¶ List the rulesets approximating this model generated by DataRobot Prime
If this model hasn’t been approximated yet, will return an empty list. Note that these are rulesets approximating this model, not rulesets used to construct this model.
Returns: - rulesets : list of Ruleset
-
get_supported_capabilities
()¶ Retrieves a summary of the capabilities supported by a model.
New in version v2.14.
Returns: - supportsBlending: bool
whether the model supports blending
- supportsMonotonicConstraints: bool
whether the model supports monotonic constraints
- hasWordCloud: bool
whether the model has word cloud data available
- eligibleForPrime: bool
whether the model is eligible for Prime
- hasParameters: bool
whether the model has parameters that can be retrieved
-
get_word_cloud
(exclude_stop_words=False)¶ Retrieve a word cloud data for the model.
Parameters: - exclude_stop_words : bool, optional
Set to True if you want stopwords filtered out of response.
Returns: - WordCloud
Word cloud data for the model.
-
open_model_browser
()¶ Opens model at project leaderboard in web browser.
Note: If text-mode browsers are used, the calling process will block until the user exits the browser.
-
request_feature_impact
()¶ Request feature impacts to be computed for the model.
See
get_feature_impact
for more information on the result of the job.Returns: - job : Job
A Job representing the feature impact computation. To get the completed feature impact data, use job.get_result or job.get_result_when_complete.
Raises: - JobAlreadyRequested (422)
If the feature impacts have already been requested.
-
request_predictions
(dataset_id, include_prediction_intervals=None, prediction_intervals_size=None)¶ Request predictions against a previously uploaded dataset
Parameters: - dataset_id : string
The dataset to make predictions against (as uploaded from Project.upload_dataset)
- include_prediction_intervals : bool, optional
(New in v2.16) For time series projects only. Specifies whether prediction intervals should be calculated for this request. Defaults to True if prediction_intervals_size is specified, otherwise defaults to False.
- prediction_intervals_size : int, optional
(New in v2.16) For time series projects only. Represents the percentile to use for the size of the prediction intervals. Defaults to 80 if include_prediction_intervals is True.
Returns: - job : PredictJob
The job computing the predictions
-
request_training_predictions
(data_subset)¶ Start a job to build training predictions
Parameters: - data_subset : str
data set definition to build predictions on. Choices are:
- dr.enums.DATA_SUBSET.ALL for all data available. Not valid for models
- in datetime partitioned projects
- dr.enums.DATA_SUBSET.VALIDATION_AND_HOLDOUT for all data except training set.
- Not valid for models in datetime partitioned projects
- dr.enums.DATA_SUBSET.HOLDOUT for holdout data set only
- dr.enums.DATA_SUBSET.ALL_BACKTESTS for downloading the predictions for all
- backtest validation folds. Requires the model to have successfully scored all backtests. Datetime partitioned projects only.
Returns: - Job
an instance of created async job
-
request_transferable_export
()¶ Request generation of an exportable model file for use in an on-premise DataRobot standalone prediction environment.
This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.
This function does not download the exported file. Use download_export for that.
Examples
model = datarobot.Model.get('p-id', 'l-id') job = model.request_transferable_export() job.wait_for_completion() model.download_export('my_exported_model.drmodel') # Client must be configured to use standalone prediction server for import: datarobot.Client(token='my-token-at-standalone-server', endpoint='standalone-server-url/api/v2') imported_model = datarobot.ImportedModel.create('my_exported_model.drmodel')
-
set_prediction_threshold
(threshold)¶ Set a custom prediction threshold for the model
May not be used once
prediction_threshold_read_only
is True for this model.Parameters: - threshold : float
only used for binary classification projects. The threshold to when deciding between the positive and negative classes when making predictions. Should be between 0.0 and 1.0 (inclusive).
-
star_model
()¶ Mark the model as starred
Model stars propagate to the web application and the API, and can be used to filter when listing models.
-
start_advanced_tuning_session
()¶ Start an Advanced Tuning session. Returns an object that helps set up arguments for an Advanced Tuning model execution.
As of v2.15, all models support Advanced Tuning other than Blenders, OSS, and user-created.
Returns: - AdvancedTuningSession
Session for setting up and running Advanced Tuning on a model
-
unstar_model
()¶ Unmark the model as starred
Model stars propagate to the web application and the API, and can be used to filter when listing models.
BlenderModel¶
-
class
datarobot.models.
BlenderModel
(id=None, processes=None, featurelist_name=None, featurelist_id=None, project_id=None, sample_pct=None, training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, model_type=None, model_category=None, is_frozen=None, blueprint_id=None, metrics=None, model_ids=None, blender_method=None, monotonic_increasing_featurelist_id=None, monotonic_decreasing_featurelist_id=None, supports_monotonic_constraints=None, is_starred=None, prediction_threshold=None, prediction_threshold_read_only=None)¶ Blender model that combines prediction results from other models.
Attributes: - id : str
the id of the model
- project_id : str
the id of the project the model belongs to
- processes : list of str
the processes used by the model
- featurelist_name : str
the name of the featurelist used by the model
- featurelist_id : str
the id of the featurelist used by the model
- sample_pct : float
the percentage of the project dataset used in training the model
- training_row_count : int or None
the number of rows of the project dataset used in training the model. In a datetime partitioned project, if specified, defines the number of rows used to train the model and evaluate backtest scores; if unspecified, either training_duration or training_start_date and training_end_date was used to determine that instead.
- training_duration : str or None
only present for models in datetime partitioned projects. If specified, a duration string specifying the duration spanned by the data used to train the model and evaluate backtest scores.
- training_start_date : datetime or None
only present for frozen models in datetime partitioned projects. If specified, the start date of the data used to train the model.
- training_end_date : datetime or None
only present for frozen models in datetime partitioned projects. If specified, the end date of the data used to train the model.
- model_type : str
what model this is, e.g. ‘DataRobot Prime’
- model_category : str
what kind of model this is - always ‘prime’ for DataRobot Prime models
- is_frozen : bool
whether this model is a frozen model
- blueprint_id : str
the id of the blueprint used in this model
- metrics : dict
a mapping from each metric to the model’s scores for that metric
- model_ids : list of str
List of model ids used in blender
- blender_method : str
Method used to blend results from underlying models
- monotonic_increasing_featurelist_id : str
optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. If None, no such constraints are enforced.
- monotonic_decreasing_featurelist_id : str
optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. If None, no such constraints are enforced.
- supports_monotonic_constraints : bool
optional, whether this model supports enforcing monotonic constraints
- is_starred : bool
whether this model marked as starred
- prediction_threshold : float
for binary classification projects, the threshold used for predictions
- prediction_threshold_read_only : bool
indicated whether modification of the prediction threshold is forbidden. Threshold modification is forbidden once a model has had a deployment created or predictions made via the dedicated prediction API.
-
classmethod
get
(project_id, model_id)¶ Retrieve a specific blender.
Parameters: - project_id : str
The project’s id.
- model_id : str
The
model_id
of the leaderboard item to retrieve.
Returns: - model : BlenderModel
The queried instance.
-
advanced_tune
(params, description=None)¶ Generate a new model with the specified advanced-tuning parameters
As of v2.15, all models support Advanced Tuning other than Blenders, OSS, and user-created.
Parameters: - params : dict
Mapping of parameter ID to parameter value. The list of valid parameter IDs for a model can be found by calling get_advanced_tuning_parameters(). This endpoint does not need to include values for all parameters. If a parameter is omitted, its current_value will be used.
- description : unicode
Human-readable string describing the newly advanced-tuned model
Returns: - ModelJob
The created job to build the model
-
cross_validate
()¶ Run Cross Validation on this model.
Note
To perform Cross Validation on a new model with new parameters, use
train
instead.Returns: - ModelJob
The created job to build the model
-
delete
()¶ Delete a model from the project’s leaderboard.
-
download_export
(filepath)¶ Download an exportable model file for use in an on-premise DataRobot standalone prediction environment.
This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.
Parameters: - filepath : str
The path at which to save the exported model file.
-
download_scoring_code
(file_name, source_code=False)¶ Download scoring code JAR.
Parameters: - file_name : str
File path where scoring code will be saved.
- source_code : bool, optional
Set to True to download source code archive. It will not be executable.
-
classmethod
fetch_resource_data
(url, join_endpoint=True)¶ (Deprecated.) Used to acquire model data directly from its url.
Consider using get instead, as this is a convenience function used for development of datarobot
Parameters: - url : string
The resource we are acquiring
- join_endpoint : boolean, optional
Whether the client’s endpoint should be joined to the URL before sending the request. Location headers are returned as absolute locations, so will _not_ need the endpoint
Returns: - model_data : dict
The queried model’s data
-
get_advanced_tuning_parameters
()¶ Get the advanced-tuning parameters available for this model.
As of v2.15, all models support Advanced Tuning other than Blenders, OSS, and user-created.
Returns: - dict
A dictionary describing the advanced-tuning parameters for the current model. There are two top-level keys, tuningDescription and tuningParameters.
tuningDescription an optional value. If not None, then it indicates the user-specified description of this set of tuning parameter.
tuningParameters is a list of a dicts, each has the following keys
- parameterName : (unicode) name of the parameter (unique per task, see below)
- parameterId : (unicode) opaque ID string uniquely identifying parameter
- defaultValue : (*) default value of the parameter for the blueprint
- currentValue : (*) value of the parameter that was used for this model
- taskName : (unicode) name of the task that this parameter belongs to
- constraints: (dict) see the notes below
Notes
The type of defaultValue and currentValue is defined by the constraints structure. It will be a string or numeric Python type.
constraints is a dict with at least one, possibly more, of the following keys. The presence of a key indicates that the parameter may take on the specified type. (If a key is absent, this means that the parameter may not take on the specified type.) If a key on constraints is present, its value will be a dict containing all of the fields described below for that key.
"constraints": { "select": { "values": [<list(basestring or number) : possible values>] }, "ascii": {}, "unicode": {}, "int": { "min": <int : minimum valid value>, "max": <int : maximum valid value>, "supports_grid_search": <bool : True if Grid Search may be requested for this param> }, "float": { "min": <float : minimum valid value>, "max": <float : maximum valid value>, "supports_grid_search": <bool : True if Grid Search may be requested for this param> }, "intList": { "length": { "min_length": <int : minimum valid length>, "max_length": <int : maximum valid length> "min_val": <int : minimum valid value>, "max_val": <int : maximum valid value> "supports_grid_search": <bool : True if Grid Search may be requested for this param> }, "floatList": { "min_length": <int : minimum valid length>, "max_length": <int : maximum valid length> "min_val": <float : minimum valid value>, "max_val": <float : maximum valid value> "supports_grid_search": <bool : True if Grid Search may be requested for this param> } }
The keys have meaning as follows:
- select: Rather than specifying a specific data type, if present, it indicates that the parameter is permitted to take on any of the specified values. Listed values may be of any string or real (non-complex) numeric type.
- ascii: The parameter may be a unicode object that encodes simple ASCII characters. (A-Z, a-z, 0-9, whitespace, and certain common symbols.) In addition to listed constraints, ASCII keys currently may not contain either newlines or semicolons.
- unicode: The parameter may be any Python unicode object.
- int: The value may be an object of type int within the specified range (inclusive). Please note that the value will be passed around using the JSON format, and some JSON parsers have undefined behavior with integers outside of the range [-(2**53)+1, (2**53)-1].
- float: The value may be an object of type float within the specified range (inclusive).
- intList, floatList: The value may be a list of int or float objects, respectively, following constraints as specified respectively by the int and float types (above).
Many parameters only specify one key under constraints. If a parameter specifies multiple keys, the parameter may take on any value permitted by any key.
-
get_all_confusion_charts
(fallback_to_parent_insights=False)¶ Retrieve a list of all confusion charts available for the model.
Parameters: - fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return confusion chart data for this model’s parent for any source that is not available for this model and if this has a defined parent model. If omitted or False, or this model has no parent, this will not attempt to retrieve any data from this model’s parent.
Returns: - list of ConfusionChart
Data for all available confusion charts for model.
-
get_all_lift_charts
(fallback_to_parent_insights=False)¶ Retrieve a list of all lift charts available for the model.
Parameters: - fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return lift chart data for this model’s parent for any source that is not available for this model and if this model has a defined parent model. If omitted or False, or this model has no parent, this will not attempt to retrieve any data from this model’s parent.
Returns: - list of LiftChart
Data for all available model lift charts.
-
get_all_roc_curves
(fallback_to_parent_insights=False)¶ Retrieve a list of all ROC curves available for the model.
Parameters: - fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return ROC curve data for this model’s parent for any source that is not available for this model and if this model has a defined parent model. If omitted or False, or this model has no parent, this will not attempt to retrieve any data from this model’s parent.
Returns: - list of RocCurve
Data for all available model ROC curves.
-
get_confusion_chart
(source, fallback_to_parent_insights=False)¶ Retrieve model’s confusion chart for the specified source.
Parameters: - source : str
Confusion chart source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
- fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return confusion chart data for this model’s parent if the confusion chart is not available for this model and the defined parent model. If omitted or False, or there is no parent model, will not attempt to return insight data from this model’s parent.
Returns: - ConfusionChart
Model ConfusionChart data
Raises: - ClientError
If the insight is not available for this model
-
get_cross_validation_scores
(partition=None, metric=None)¶ Returns a dictionary keyed by metric showing cross validation scores per partition.
Cross Validation should already have been performed using
cross_validate
ortrain
.Note
Models that computed cross validation before this feature was added will need to be deleted and retrained before this method can be used.
Parameters: - partition : float
optional, the id of the partition (1,2,3.0,4.0,etc…) to filter results by can be a whole number positive integer or float value.
- metric: unicode
optional name of the metric to filter to resulting cross validation scores by
Returns: - cross_validation_scores: dict
A dictionary keyed by metric showing cross validation scores per partition.
-
get_feature_impact
()¶ Retrieve the computed Feature Impact results, a measure of the relevance of each feature in the model.
Feature Impact is computed for each column by creating new data with that column randomly permuted (but the others left unchanged), and seeing how the error metric score for the predictions is affected. The ‘impactUnnormalized’ is how much worse the error metric score is when making predictions on this modified data. The ‘impactNormalized’ is normalized so that the largest value is 1. In both cases, larger values indicate more important features.
If a feature is a redundant feature, i.e. once other features are considered it doesn’t contribute much in additional, the ‘redundantWith’ value is the name of feature that has the highest correlation with this feature. Note that redundancy detection is only available for jobs run after the addition of this feature. When retrieving data that predates this functionality, a NoRedundancyImpactAvailable warning will be used.
Elsewhere this technique is sometimes called ‘Permutation Importance’.
Requires that Feature Impact has already been computed with
request_feature_impact
.Returns: - feature_impacts : list of dict
The feature impact data. Each item is a dict with the keys ‘featureName’, ‘impactNormalized’, and ‘impactUnnormalized’, and ‘redundantWith’.
Raises: - ClientError (404)
If the feature impacts have not been computed.
-
get_features_used
()¶ Query the server to determine which features were used.
Note that the data returned by this method is possibly different than the names of the features in the featurelist used by this model. This method will return the raw features that must be supplied in order for predictions to be generated on a new set of data. The featurelist, in contrast, would also include the names of derived features.
Returns: - features : list of str
The names of the features used in the model.
-
get_leaderboard_ui_permalink
()¶ Returns: - url : str
Permanent static hyperlink to this model at leaderboard.
-
get_lift_chart
(source, fallback_to_parent_insights=False)¶ Retrieve model lift chart for the specified source.
Parameters: - source : str
Lift chart data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
- fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return lift chart data for this model’s parent if the lift chart is not available for this model and the model has a defined parent model. If omitted or False, or there is no parent model, will not attempt to return insight data from this model’s parent.
Returns: - LiftChart
Model lift chart data
Raises: - ClientError
If the insight is not available for this model
-
get_missing_report_info
()¶ Retrieve a model missing data report on training data that can be used to understand missing values treatment in a model. Report consists of missing values reports for features which took part in modelling and are numeric or categorical.
Returns: - An iterable of MissingReportPerFeature
The queried model missing report, sorted by missing count (DESCENDING order).
-
get_model_blueprint_chart
()¶ Retrieve a model blueprint chart that can be used to understand data flow in blueprint.
Returns: - ModelBlueprintChart
The queried model blueprint chart.
-
get_model_blueprint_documents
()¶ Get documentation for tasks used in this model.
Returns: - list of BlueprintTaskDocument
All documents available for the model.
-
get_or_request_feature_impact
(max_wait=600)¶ Retrieve feature impact for the model, requesting a job if it hasn’t been run previously
Parameters: - max_wait : int, optional
The maximum time to wait for a requested feature impact job to complete before erroring
Returns: - feature_impacts : list of dict
The feature impact data. See
get_feature_impact
for the exact schema.
-
get_parameters
()¶ Retrieve model parameters.
Returns: - ModelParameters
Model parameters for this model.
-
get_pareto_front
()¶ Retrieve the Pareto Front for a Eureqa model.
This method is only supported for Eureqa models.
Returns: - ParetoFront
Model ParetoFront data
-
get_prime_eligibility
()¶ Check if this model can be approximated with DataRobot Prime
Returns: - prime_eligibility : dict
a dict indicating whether a model can be approximated with DataRobot Prime (key can_make_prime) and why it may be ineligible (key message)
-
get_roc_curve
(source, fallback_to_parent_insights=False)¶ Retrieve model ROC curve for the specified source.
Parameters: - source : str
ROC curve data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
- fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return ROC curve data for this model’s parent if the ROC curve is not available for this model and the model has a defined parent model. If omitted or False, or there is no parent model, will not attempt to return data from this model’s parent.
Returns: - RocCurve
Model ROC curve data
Raises: - ClientError
If the insight is not available for this model
-
get_rulesets
()¶ List the rulesets approximating this model generated by DataRobot Prime
If this model hasn’t been approximated yet, will return an empty list. Note that these are rulesets approximating this model, not rulesets used to construct this model.
Returns: - rulesets : list of Ruleset
-
get_supported_capabilities
()¶ Retrieves a summary of the capabilities supported by a model.
New in version v2.14.
Returns: - supportsBlending: bool
whether the model supports blending
- supportsMonotonicConstraints: bool
whether the model supports monotonic constraints
- hasWordCloud: bool
whether the model has word cloud data available
- eligibleForPrime: bool
whether the model is eligible for Prime
- hasParameters: bool
whether the model has parameters that can be retrieved
-
get_word_cloud
(exclude_stop_words=False)¶ Retrieve a word cloud data for the model.
Parameters: - exclude_stop_words : bool, optional
Set to True if you want stopwords filtered out of response.
Returns: - WordCloud
Word cloud data for the model.
-
open_model_browser
()¶ Opens model at project leaderboard in web browser.
Note: If text-mode browsers are used, the calling process will block until the user exits the browser.
-
request_approximation
()¶ Request an approximation of this model using DataRobot Prime
This will create several rulesets that could be used to approximate this model. After comparing their scores and rule counts, the code used in the approximation can be downloaded and run locally.
Returns: - job : Job
the job generating the rulesets
-
request_feature_impact
()¶ Request feature impacts to be computed for the model.
See
get_feature_impact
for more information on the result of the job.Returns: - job : Job
A Job representing the feature impact computation. To get the completed feature impact data, use job.get_result or job.get_result_when_complete.
Raises: - JobAlreadyRequested (422)
If the feature impacts have already been requested.
-
request_frozen_datetime_model
(training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, time_window_sample_pct=None)¶ Train a new frozen model with parameters from this model
Requires that this model belongs to a datetime partitioned project. If it does not, an error will occur when submitting the job.
Frozen models use the same tuning parameters as their parent model instead of independently optimizing them to allow efficiently retraining models on larger amounts of the training data.
In addition of training_row_count and training_duration, frozen datetime models may be trained on an exact date range. Only one of training_row_count, training_duration, or training_start_date and training_end_date should be specified.
Models specified using training_start_date and training_end_date are the only ones that can be trained into the holdout data (once the holdout is unlocked).
Parameters: - training_row_count : int, optional
the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.
- training_duration : str, optional
a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.
- training_start_date : datetime.datetime, optional
the start date of the data to train to model on. Only rows occurring at or after this datetime will be used. If training_start_date is specified, training_end_date must also be specified.
- training_end_date : datetime.datetime, optional
the end date of the data to train the model on. Only rows occurring strictly before this datetime will be used. If training_end_date is specified, training_start_date must also be specified.
- time_window_sample_pct : int, optional
may only be specified when the requested model is a time window (e.g. duration or start and end dates). An integer between 1 and 99 indicating the percentage to sample by within the window. The points kept are determined by a random uniform sample.
Returns: - model_job : ModelJob
the modeling job training a frozen model
-
request_frozen_model
(sample_pct=None, training_row_count=None)¶ Train a new frozen model with parameters from this model
Note
This method only works if project the model belongs to is not datetime partitioned. If it is, use
request_frozen_datetime_model
instead.Frozen models use the same tuning parameters as their parent model instead of independently optimizing them to allow efficiently retraining models on larger amounts of the training data.
Parameters: - sample_pct : float
optional, the percentage of the dataset to use with the model. If not provided, will use the value from this model.
- training_row_count : int
(New in version v2.9) optional, the integer number of rows of the dataset to use with the model. Only one of sample_pct and training_row_count should be specified.
Returns: - model_job : ModelJob
the modeling job training a frozen model
-
request_predictions
(dataset_id, include_prediction_intervals=None, prediction_intervals_size=None)¶ Request predictions against a previously uploaded dataset
Parameters: - dataset_id : string
The dataset to make predictions against (as uploaded from Project.upload_dataset)
- include_prediction_intervals : bool, optional
(New in v2.16) For time series projects only. Specifies whether prediction intervals should be calculated for this request. Defaults to True if prediction_intervals_size is specified, otherwise defaults to False.
- prediction_intervals_size : int, optional
(New in v2.16) For time series projects only. Represents the percentile to use for the size of the prediction intervals. Defaults to 80 if include_prediction_intervals is True.
Returns: - job : PredictJob
The job computing the predictions
-
request_training_predictions
(data_subset)¶ Start a job to build training predictions
Parameters: - data_subset : str
data set definition to build predictions on. Choices are:
- dr.enums.DATA_SUBSET.ALL for all data available. Not valid for models
- in datetime partitioned projects
- dr.enums.DATA_SUBSET.VALIDATION_AND_HOLDOUT for all data except training set.
- Not valid for models in datetime partitioned projects
- dr.enums.DATA_SUBSET.HOLDOUT for holdout data set only
- dr.enums.DATA_SUBSET.ALL_BACKTESTS for downloading the predictions for all
- backtest validation folds. Requires the model to have successfully scored all backtests. Datetime partitioned projects only.
Returns: - Job
an instance of created async job
-
request_transferable_export
()¶ Request generation of an exportable model file for use in an on-premise DataRobot standalone prediction environment.
This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.
This function does not download the exported file. Use download_export for that.
Examples
model = datarobot.Model.get('p-id', 'l-id') job = model.request_transferable_export() job.wait_for_completion() model.download_export('my_exported_model.drmodel') # Client must be configured to use standalone prediction server for import: datarobot.Client(token='my-token-at-standalone-server', endpoint='standalone-server-url/api/v2') imported_model = datarobot.ImportedModel.create('my_exported_model.drmodel')
-
set_prediction_threshold
(threshold)¶ Set a custom prediction threshold for the model
May not be used once
prediction_threshold_read_only
is True for this model.Parameters: - threshold : float
only used for binary classification projects. The threshold to when deciding between the positive and negative classes when making predictions. Should be between 0.0 and 1.0 (inclusive).
-
star_model
()¶ Mark the model as starred
Model stars propagate to the web application and the API, and can be used to filter when listing models.
-
start_advanced_tuning_session
()¶ Start an Advanced Tuning session. Returns an object that helps set up arguments for an Advanced Tuning model execution.
As of v2.15, all models support Advanced Tuning other than Blenders, OSS, and user-created.
Returns: - AdvancedTuningSession
Session for setting up and running Advanced Tuning on a model
-
train
(sample_pct=None, featurelist_id=None, scoring_type=None, training_row_count=None, monotonic_increasing_featurelist_id=<object object>, monotonic_decreasing_featurelist_id=<object object>)¶ Train the blueprint used in model on a particular featurelist or amount of data.
This method creates a new training job for worker and appends it to the end of the queue for this project. After the job has finished you can get the newly trained model by retrieving it from the project leaderboard, or by retrieving the result of the job.
Either sample_pct or training_row_count can be used to specify the amount of data to use, but not both. If neither are specified, a default of the maximum amount of data that can safely be used to train any blueprint without going into the validation data will be selected.
In smart-sampled projects, sample_pct and training_row_count are assumed to be in terms of rows of the minority class.
Note
For datetime partitioned projects, use
train_datetime
instead.Parameters: - sample_pct : float, optional
The amount of data to use for training, as a percentage of the project dataset from 0 to 100.
- featurelist_id : str, optional
The identifier of the featurelist to use. If not defined, the featurelist of this model is used.
- scoring_type : str, optional
Either
SCORING_TYPE.validation
orSCORING_TYPE.cross_validation
.SCORING_TYPE.validation
is available for every partitioning type, and indicates that the default model validation should be used for the project. If the project uses a form of cross-validation partitioning,SCORING_TYPE.cross_validation
can also be used to indicate that all of the available training/validation combinations should be used to evaluate the model.- training_row_count : int, optional
The number of rows to use to train the requested model.
- monotonic_increasing_featurelist_id : str
(new in version 2.11) optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. Passing
None
disables increasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT
) is the one specified by the blueprint.- monotonic_decreasing_featurelist_id : str
(new in version 2.11) optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. Passing
None
disables decreasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT
) is the one specified by the blueprint.
Returns: - model_job_id : str
id of created job, can be used as parameter to
ModelJob.get
method orwait_for_async_model_creation
function
Examples
project = Project.get('p-id') model = Model.get('p-id', 'l-id') model_job_id = model.train(training_row_count=project.max_train_rows)
-
train_datetime
(featurelist_id=None, training_row_count=None, training_duration=None, time_window_sample_pct=None)¶ Train this model on a different featurelist or amount of data
Requires that this model is part of a datetime partitioned project; otherwise, an error will occur.
Parameters: - featurelist_id : str, optional
the featurelist to use to train the model. If not specified, the featurelist of this model is used.
- training_row_count : int, optional
the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.
- training_duration : str, optional
a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.
- time_window_sample_pct : int, optional
may only be specified when the requested model is a time window (e.g. duration or start and end dates). An integer between 1 and 99 indicating the percentage to sample by within the window. The points kept are determined by a random uniform sample.
Returns: - job : ModelJob
the created job to build the model
-
unstar_model
()¶ Unmark the model as starred
Model stars propagate to the web application and the API, and can be used to filter when listing models.
DatetimeModel¶
-
class
datarobot.models.
DatetimeModel
(id=None, processes=None, featurelist_name=None, featurelist_id=None, project_id=None, sample_pct=None, training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, time_window_sample_pct=None, model_type=None, model_category=None, is_frozen=None, blueprint_id=None, metrics=None, training_info=None, holdout_score=None, holdout_status=None, data_selection_method=None, backtests=None, monotonic_increasing_featurelist_id=None, monotonic_decreasing_featurelist_id=None, supports_monotonic_constraints=None, is_starred=None, prediction_threshold=None, prediction_threshold_read_only=None, effective_feature_derivation_window_start=None, effective_feature_derivation_window_end=None, forecast_window_start=None, forecast_window_end=None, windows_basis_unit=None)¶ A model from a datetime partitioned project
Only one of training_row_count, training_duration, and training_start_date and training_end_date will be specified, depending on the data_selection_method of the model. Whichever method was selected determines the amount of data used to train on when making predictions and scoring the backtests and the holdout.
Attributes: - id : str
the id of the model
- project_id : str
the id of the project the model belongs to
- processes : list of str
the processes used by the model
- featurelist_name : str
the name of the featurelist used by the model
- featurelist_id : str
the id of the featurelist used by the model
- sample_pct : float
the percentage of the project dataset used in training the model
- training_row_count : int or None
If specified, an int specifying the number of rows used to train the model and evaluate backtest scores.
- training_duration : str or None
If specified, a duration string specifying the duration spanned by the data used to train the model and evaluate backtest scores.
- training_start_date : datetime or None
only present for frozen models in datetime partitioned projects. If specified, the start date of the data used to train the model.
- training_end_date : datetime or None
only present for frozen models in datetime partitioned projects. If specified, the end date of the data used to train the model.
- time_window_sample_pct : int or None
An integer between 1 and 99 indicating the percentage of sampling within the training window. The points kept are determined by a random uniform sample. If not specified, no sampling was done.
- model_type : str
what model this is, e.g. ‘Nystroem Kernel SVM Regressor’
- model_category : str
what kind of model this is - ‘prime’ for DataRobot Prime models, ‘blend’ for blender models, and ‘model’ for other models
- is_frozen : bool
whether this model is a frozen model
- blueprint_id : str
the id of the blueprint used in this model
- metrics : dict
a mapping from each metric to the model’s scores for that metric. The keys in metrics are the different metrics used to evaluate the model, and the values are the results. The dictionaries inside of metrics will be as described here: ‘validation’, the score for a single backtest; ‘crossValidation’, always None; ‘backtesting’, the average score for all backtests if all are available and computed, or None otherwise; ‘backtestingScores’, a list of scores for all backtests where the score is None if that backtest does not have a score available; and ‘holdout’, the score for the holdout or None if the holdout is locked or the score is unavailable.
- backtests : list of dict
describes what data was used to fit each backtest, the score for the project metric, and why the backtest score is unavailable if it is not provided.
- data_selection_method : str
which of training_row_count, training_duration, or training_start_data and training_end_date were used to determine the data used to fit the model. One of ‘rowCount’, ‘duration’, or ‘selectedDateRange’.
- training_info : dict
describes which data was used to train on when scoring the holdout and making predictions. training_info` will have the following keys: holdout_training_start_date, holdout_training_duration, holdout_training_row_count, holdout_training_end_date, prediction_training_start_date, prediction_training_duration, prediction_training_row_count, prediction_training_end_date. Start and end dates will be datetimes, durations will be duration strings, and rows will be integers.
- holdout_score : float or None
the score against the holdout, if available and the holdout is unlocked, according to the project metric.
- holdout_status : string or None
the status of the holdout score, e.g. “COMPLETED”, “HOLDOUT_BOUNDARIES_EXCEEDED”. Unavailable if the holdout fold was disabled in the partitioning configuration.
- monotonic_increasing_featurelist_id : str
optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. If None, no such constraints are enforced.
- monotonic_decreasing_featurelist_id : str
optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. If None, no such constraints are enforced.
- supports_monotonic_constraints : bool
optional, whether this model supports enforcing monotonic constraints
- is_starred : bool
whether this model marked as starred
- prediction_threshold : float
for binary classification projects, the threshold used for predictions
- prediction_threshold_read_only : bool
indicated whether modification of the prediction threshold is forbidden. Threshold modification is forbidden once a model has had a deployment created or predictions made via the dedicated prediction API.
- effective_feature_derivation_window_start : int or None
(New in v2.16) For time series projects only. How many timeUnits into the past relative to the forecast point the user needs to provide history for at prediction time. This can differ from the feature_derivation_window_start set on the project due to the differencing method and period selected, or if the model is a time series native model such as ARIMA. Will be a negative integer in time series projects and None otherwise.
- effective_feature_derivation_window_end : int or None
(New in v2.16) For time series projects only. How many timeUnits into the past relative to the forecast point the feature derivation window should end. Will be a non-positive integer in time series projects and None otherwise.
- forecast_window_start : int or None
(New in v2.16) For time series projects only. How many timeUnits into the future relative to the forecast point the forecast window should start. Note that this field will be the same as what is shown in the project settings. Will be a non-negative integer in time series projects and None otherwise.
- forecast_window_end : int or None
(New in v2.16) For time series projects only. How many timeUnits into the future relative to the forecast point the forecast window should end. Note that this field will be the same as what is shown in the project settings. Will be a non-negative integer in time series projects and None otherwise.
- windows_basis_unit : str or None
(New in v2.16) For time series projects only. Indicates which unit is the basis for the feature derivation window and the forecast window. Note that this field will be the same as what is shown in the project settings. In time series projects, will be either the detected time unit or “ROW”, and None otherwise.
-
classmethod
get
(project, model_id)¶ Retrieve a specific datetime model
If the project does not use datetime partitioning, a ClientError will occur.
Parameters: - project : str
the id of the project the model belongs to
- model_id : str
the id of the model to retrieve
Returns: - model : DatetimeModel
the model
-
score_backtests
()¶ Compute the scores for all available backtests
Some backtests may be unavailable if the model is trained into their validation data.
Returns: - job : Job
a job tracking the backtest computation. When it is complete, all available backtests will have scores computed.
-
cross_validate
()¶ Inherited from Model - DatetimeModels cannot request Cross Validation,
Use score_backtests instead.
-
get_cross_validation_scores
(partition=None, metric=None)¶ Inherited from Model - DatetimeModels cannot request Cross Validation scores,
Use
backtests
instead.
-
request_training_predictions
(data_subset)¶ Start a job to build training predictions
Parameters: - data_subset : str
data set definition to build predictions on. Choices are:
- dr.enums.DATA_SUBSET.HOLDOUT for holdout data set only
- dr.enums.DATA_SUBSET.ALL_BACKTESTS for downloading the predictions for all
- backtest validation folds. Requires the model to have successfully scored all backtests.
- Returns
- ——-
- Job
an instance of created async job
-
get_series_accuracy_as_dataframe
()¶ Retrieve the Series Accuracy for the specified model as a pandas.DataFrame.
New in version v2.16.
Returns: - data
A pandas.DataFrame with the Series Accuracy for the specified model.
-
download_series_accuracy_as_csv
(filename, encoding='utf-8')¶ Save the Series Accuracy for the specified model into a csv file.
New in version v2.16.
Parameters: - filename : str or file object
The path or file object to save the data to.
- encoding : str, optional
A string representing the encoding to use in the output csv file. Defaults to ‘utf-8’.
-
compute_series_accuracy
()¶ Compute the Series Accuracy for this model
New in version v2.16.
Returns: - Job
an instance of the created async job
-
advanced_tune
(params, description=None)¶ Generate a new model with the specified advanced-tuning parameters
As of v2.15, all models support Advanced Tuning other than Blenders, OSS, and user-created.
Parameters: - params : dict
Mapping of parameter ID to parameter value. The list of valid parameter IDs for a model can be found by calling get_advanced_tuning_parameters(). This endpoint does not need to include values for all parameters. If a parameter is omitted, its current_value will be used.
- description : unicode
Human-readable string describing the newly advanced-tuned model
Returns: - ModelJob
The created job to build the model
-
delete
()¶ Delete a model from the project’s leaderboard.
-
download_export
(filepath)¶ Download an exportable model file for use in an on-premise DataRobot standalone prediction environment.
This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.
Parameters: - filepath : str
The path at which to save the exported model file.
-
download_scoring_code
(file_name, source_code=False)¶ Download scoring code JAR.
Parameters: - file_name : str
File path where scoring code will be saved.
- source_code : bool, optional
Set to True to download source code archive. It will not be executable.
-
classmethod
fetch_resource_data
(url, join_endpoint=True)¶ (Deprecated.) Used to acquire model data directly from its url.
Consider using get instead, as this is a convenience function used for development of datarobot
Parameters: - url : string
The resource we are acquiring
- join_endpoint : boolean, optional
Whether the client’s endpoint should be joined to the URL before sending the request. Location headers are returned as absolute locations, so will _not_ need the endpoint
Returns: - model_data : dict
The queried model’s data
-
get_advanced_tuning_parameters
()¶ Get the advanced-tuning parameters available for this model.
As of v2.15, all models support Advanced Tuning other than Blenders, OSS, and user-created.
Returns: - dict
A dictionary describing the advanced-tuning parameters for the current model. There are two top-level keys, tuningDescription and tuningParameters.
tuningDescription an optional value. If not None, then it indicates the user-specified description of this set of tuning parameter.
tuningParameters is a list of a dicts, each has the following keys
- parameterName : (unicode) name of the parameter (unique per task, see below)
- parameterId : (unicode) opaque ID string uniquely identifying parameter
- defaultValue : (*) default value of the parameter for the blueprint
- currentValue : (*) value of the parameter that was used for this model
- taskName : (unicode) name of the task that this parameter belongs to
- constraints: (dict) see the notes below
Notes
The type of defaultValue and currentValue is defined by the constraints structure. It will be a string or numeric Python type.
constraints is a dict with at least one, possibly more, of the following keys. The presence of a key indicates that the parameter may take on the specified type. (If a key is absent, this means that the parameter may not take on the specified type.) If a key on constraints is present, its value will be a dict containing all of the fields described below for that key.
"constraints": { "select": { "values": [<list(basestring or number) : possible values>] }, "ascii": {}, "unicode": {}, "int": { "min": <int : minimum valid value>, "max": <int : maximum valid value>, "supports_grid_search": <bool : True if Grid Search may be requested for this param> }, "float": { "min": <float : minimum valid value>, "max": <float : maximum valid value>, "supports_grid_search": <bool : True if Grid Search may be requested for this param> }, "intList": { "length": { "min_length": <int : minimum valid length>, "max_length": <int : maximum valid length> "min_val": <int : minimum valid value>, "max_val": <int : maximum valid value> "supports_grid_search": <bool : True if Grid Search may be requested for this param> }, "floatList": { "min_length": <int : minimum valid length>, "max_length": <int : maximum valid length> "min_val": <float : minimum valid value>, "max_val": <float : maximum valid value> "supports_grid_search": <bool : True if Grid Search may be requested for this param> } }
The keys have meaning as follows:
- select: Rather than specifying a specific data type, if present, it indicates that the parameter is permitted to take on any of the specified values. Listed values may be of any string or real (non-complex) numeric type.
- ascii: The parameter may be a unicode object that encodes simple ASCII characters. (A-Z, a-z, 0-9, whitespace, and certain common symbols.) In addition to listed constraints, ASCII keys currently may not contain either newlines or semicolons.
- unicode: The parameter may be any Python unicode object.
- int: The value may be an object of type int within the specified range (inclusive). Please note that the value will be passed around using the JSON format, and some JSON parsers have undefined behavior with integers outside of the range [-(2**53)+1, (2**53)-1].
- float: The value may be an object of type float within the specified range (inclusive).
- intList, floatList: The value may be a list of int or float objects, respectively, following constraints as specified respectively by the int and float types (above).
Many parameters only specify one key under constraints. If a parameter specifies multiple keys, the parameter may take on any value permitted by any key.
-
get_all_confusion_charts
(fallback_to_parent_insights=False)¶ Retrieve a list of all confusion charts available for the model.
Parameters: - fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return confusion chart data for this model’s parent for any source that is not available for this model and if this has a defined parent model. If omitted or False, or this model has no parent, this will not attempt to retrieve any data from this model’s parent.
Returns: - list of ConfusionChart
Data for all available confusion charts for model.
-
get_all_lift_charts
(fallback_to_parent_insights=False)¶ Retrieve a list of all lift charts available for the model.
Parameters: - fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return lift chart data for this model’s parent for any source that is not available for this model and if this model has a defined parent model. If omitted or False, or this model has no parent, this will not attempt to retrieve any data from this model’s parent.
Returns: - list of LiftChart
Data for all available model lift charts.
-
get_all_roc_curves
(fallback_to_parent_insights=False)¶ Retrieve a list of all ROC curves available for the model.
Parameters: - fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return ROC curve data for this model’s parent for any source that is not available for this model and if this model has a defined parent model. If omitted or False, or this model has no parent, this will not attempt to retrieve any data from this model’s parent.
Returns: - list of RocCurve
Data for all available model ROC curves.
-
get_confusion_chart
(source, fallback_to_parent_insights=False)¶ Retrieve model’s confusion chart for the specified source.
Parameters: - source : str
Confusion chart source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
- fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return confusion chart data for this model’s parent if the confusion chart is not available for this model and the defined parent model. If omitted or False, or there is no parent model, will not attempt to return insight data from this model’s parent.
Returns: - ConfusionChart
Model ConfusionChart data
Raises: - ClientError
If the insight is not available for this model
-
get_feature_impact
()¶ Retrieve the computed Feature Impact results, a measure of the relevance of each feature in the model.
Feature Impact is computed for each column by creating new data with that column randomly permuted (but the others left unchanged), and seeing how the error metric score for the predictions is affected. The ‘impactUnnormalized’ is how much worse the error metric score is when making predictions on this modified data. The ‘impactNormalized’ is normalized so that the largest value is 1. In both cases, larger values indicate more important features.
If a feature is a redundant feature, i.e. once other features are considered it doesn’t contribute much in additional, the ‘redundantWith’ value is the name of feature that has the highest correlation with this feature. Note that redundancy detection is only available for jobs run after the addition of this feature. When retrieving data that predates this functionality, a NoRedundancyImpactAvailable warning will be used.
Elsewhere this technique is sometimes called ‘Permutation Importance’.
Requires that Feature Impact has already been computed with
request_feature_impact
.Returns: - feature_impacts : list of dict
The feature impact data. Each item is a dict with the keys ‘featureName’, ‘impactNormalized’, and ‘impactUnnormalized’, and ‘redundantWith’.
Raises: - ClientError (404)
If the feature impacts have not been computed.
-
get_features_used
()¶ Query the server to determine which features were used.
Note that the data returned by this method is possibly different than the names of the features in the featurelist used by this model. This method will return the raw features that must be supplied in order for predictions to be generated on a new set of data. The featurelist, in contrast, would also include the names of derived features.
Returns: - features : list of str
The names of the features used in the model.
-
get_leaderboard_ui_permalink
()¶ Returns: - url : str
Permanent static hyperlink to this model at leaderboard.
-
get_lift_chart
(source, fallback_to_parent_insights=False)¶ Retrieve model lift chart for the specified source.
Parameters: - source : str
Lift chart data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
- fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return lift chart data for this model’s parent if the lift chart is not available for this model and the model has a defined parent model. If omitted or False, or there is no parent model, will not attempt to return insight data from this model’s parent.
Returns: - LiftChart
Model lift chart data
Raises: - ClientError
If the insight is not available for this model
-
get_missing_report_info
()¶ Retrieve a model missing data report on training data that can be used to understand missing values treatment in a model. Report consists of missing values reports for features which took part in modelling and are numeric or categorical.
Returns: - An iterable of MissingReportPerFeature
The queried model missing report, sorted by missing count (DESCENDING order).
-
get_model_blueprint_chart
()¶ Retrieve a model blueprint chart that can be used to understand data flow in blueprint.
Returns: - ModelBlueprintChart
The queried model blueprint chart.
-
get_model_blueprint_documents
()¶ Get documentation for tasks used in this model.
Returns: - list of BlueprintTaskDocument
All documents available for the model.
-
get_or_request_feature_impact
(max_wait=600)¶ Retrieve feature impact for the model, requesting a job if it hasn’t been run previously
Parameters: - max_wait : int, optional
The maximum time to wait for a requested feature impact job to complete before erroring
Returns: - feature_impacts : list of dict
The feature impact data. See
get_feature_impact
for the exact schema.
-
get_parameters
()¶ Retrieve model parameters.
Returns: - ModelParameters
Model parameters for this model.
-
get_pareto_front
()¶ Retrieve the Pareto Front for a Eureqa model.
This method is only supported for Eureqa models.
Returns: - ParetoFront
Model ParetoFront data
-
get_prime_eligibility
()¶ Check if this model can be approximated with DataRobot Prime
Returns: - prime_eligibility : dict
a dict indicating whether a model can be approximated with DataRobot Prime (key can_make_prime) and why it may be ineligible (key message)
-
get_roc_curve
(source, fallback_to_parent_insights=False)¶ Retrieve model ROC curve for the specified source.
Parameters: - source : str
ROC curve data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
- fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return ROC curve data for this model’s parent if the ROC curve is not available for this model and the model has a defined parent model. If omitted or False, or there is no parent model, will not attempt to return data from this model’s parent.
Returns: - RocCurve
Model ROC curve data
Raises: - ClientError
If the insight is not available for this model
-
get_rulesets
()¶ List the rulesets approximating this model generated by DataRobot Prime
If this model hasn’t been approximated yet, will return an empty list. Note that these are rulesets approximating this model, not rulesets used to construct this model.
Returns: - rulesets : list of Ruleset
-
get_supported_capabilities
()¶ Retrieves a summary of the capabilities supported by a model.
New in version v2.14.
Returns: - supportsBlending: bool
whether the model supports blending
- supportsMonotonicConstraints: bool
whether the model supports monotonic constraints
- hasWordCloud: bool
whether the model has word cloud data available
- eligibleForPrime: bool
whether the model is eligible for Prime
- hasParameters: bool
whether the model has parameters that can be retrieved
-
get_word_cloud
(exclude_stop_words=False)¶ Retrieve a word cloud data for the model.
Parameters: - exclude_stop_words : bool, optional
Set to True if you want stopwords filtered out of response.
Returns: - WordCloud
Word cloud data for the model.
-
open_model_browser
()¶ Opens model at project leaderboard in web browser.
Note: If text-mode browsers are used, the calling process will block until the user exits the browser.
-
request_approximation
()¶ Request an approximation of this model using DataRobot Prime
This will create several rulesets that could be used to approximate this model. After comparing their scores and rule counts, the code used in the approximation can be downloaded and run locally.
Returns: - job : Job
the job generating the rulesets
-
request_feature_impact
()¶ Request feature impacts to be computed for the model.
See
get_feature_impact
for more information on the result of the job.Returns: - job : Job
A Job representing the feature impact computation. To get the completed feature impact data, use job.get_result or job.get_result_when_complete.
Raises: - JobAlreadyRequested (422)
If the feature impacts have already been requested.
-
request_frozen_datetime_model
(training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, time_window_sample_pct=None)¶ Train a new frozen model with parameters from this model
Requires that this model belongs to a datetime partitioned project. If it does not, an error will occur when submitting the job.
Frozen models use the same tuning parameters as their parent model instead of independently optimizing them to allow efficiently retraining models on larger amounts of the training data.
In addition of training_row_count and training_duration, frozen datetime models may be trained on an exact date range. Only one of training_row_count, training_duration, or training_start_date and training_end_date should be specified.
Models specified using training_start_date and training_end_date are the only ones that can be trained into the holdout data (once the holdout is unlocked).
Parameters: - training_row_count : int, optional
the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.
- training_duration : str, optional
a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.
- training_start_date : datetime.datetime, optional
the start date of the data to train to model on. Only rows occurring at or after this datetime will be used. If training_start_date is specified, training_end_date must also be specified.
- training_end_date : datetime.datetime, optional
the end date of the data to train the model on. Only rows occurring strictly before this datetime will be used. If training_end_date is specified, training_start_date must also be specified.
- time_window_sample_pct : int, optional
may only be specified when the requested model is a time window (e.g. duration or start and end dates). An integer between 1 and 99 indicating the percentage to sample by within the window. The points kept are determined by a random uniform sample.
Returns: - model_job : ModelJob
the modeling job training a frozen model
-
request_predictions
(dataset_id, include_prediction_intervals=None, prediction_intervals_size=None)¶ Request predictions against a previously uploaded dataset
Parameters: - dataset_id : string
The dataset to make predictions against (as uploaded from Project.upload_dataset)
- include_prediction_intervals : bool, optional
(New in v2.16) For time series projects only. Specifies whether prediction intervals should be calculated for this request. Defaults to True if prediction_intervals_size is specified, otherwise defaults to False.
- prediction_intervals_size : int, optional
(New in v2.16) For time series projects only. Represents the percentile to use for the size of the prediction intervals. Defaults to 80 if include_prediction_intervals is True.
Returns: - job : PredictJob
The job computing the predictions
-
request_transferable_export
()¶ Request generation of an exportable model file for use in an on-premise DataRobot standalone prediction environment.
This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.
This function does not download the exported file. Use download_export for that.
Examples
model = datarobot.Model.get('p-id', 'l-id') job = model.request_transferable_export() job.wait_for_completion() model.download_export('my_exported_model.drmodel') # Client must be configured to use standalone prediction server for import: datarobot.Client(token='my-token-at-standalone-server', endpoint='standalone-server-url/api/v2') imported_model = datarobot.ImportedModel.create('my_exported_model.drmodel')
-
set_prediction_threshold
(threshold)¶ Set a custom prediction threshold for the model
May not be used once
prediction_threshold_read_only
is True for this model.Parameters: - threshold : float
only used for binary classification projects. The threshold to when deciding between the positive and negative classes when making predictions. Should be between 0.0 and 1.0 (inclusive).
-
star_model
()¶ Mark the model as starred
Model stars propagate to the web application and the API, and can be used to filter when listing models.
-
start_advanced_tuning_session
()¶ Start an Advanced Tuning session. Returns an object that helps set up arguments for an Advanced Tuning model execution.
As of v2.15, all models support Advanced Tuning other than Blenders, OSS, and user-created.
Returns: - AdvancedTuningSession
Session for setting up and running Advanced Tuning on a model
-
train_datetime
(featurelist_id=None, training_row_count=None, training_duration=None, time_window_sample_pct=None)¶ Train this model on a different featurelist or amount of data
Requires that this model is part of a datetime partitioned project; otherwise, an error will occur.
Parameters: - featurelist_id : str, optional
the featurelist to use to train the model. If not specified, the featurelist of this model is used.
- training_row_count : int, optional
the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.
- training_duration : str, optional
a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.
- time_window_sample_pct : int, optional
may only be specified when the requested model is a time window (e.g. duration or start and end dates). An integer between 1 and 99 indicating the percentage to sample by within the window. The points kept are determined by a random uniform sample.
Returns: - job : ModelJob
the created job to build the model
-
unstar_model
()¶ Unmark the model as starred
Model stars propagate to the web application and the API, and can be used to filter when listing models.
Frozen Model¶
-
class
datarobot.models.
FrozenModel
(id=None, processes=None, featurelist_name=None, featurelist_id=None, project_id=None, sample_pct=None, training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, model_type=None, model_category=None, is_frozen=None, blueprint_id=None, metrics=None, parent_model_id=None, monotonic_increasing_featurelist_id=None, monotonic_decreasing_featurelist_id=None, supports_monotonic_constraints=None, is_starred=None, prediction_threshold=None, prediction_threshold_read_only=None)¶ A model tuned with parameters which are derived from another model
Attributes: - id : str
the id of the model
- project_id : str
the id of the project the model belongs to
- processes : list of str
the processes used by the model
- featurelist_name : str
the name of the featurelist used by the model
- featurelist_id : str
the id of the featurelist used by the model
- sample_pct : float
the percentage of the project dataset used in training the model
- training_row_count : int or None
the number of rows of the project dataset used in training the model. In a datetime partitioned project, if specified, defines the number of rows used to train the model and evaluate backtest scores; if unspecified, either training_duration or training_start_date and training_end_date was used to determine that instead.
- training_duration : str or None
only present for models in datetime partitioned projects. If specified, a duration string specifying the duration spanned by the data used to train the model and evaluate backtest scores.
- training_start_date : datetime or None
only present for frozen models in datetime partitioned projects. If specified, the start date of the data used to train the model.
- training_end_date : datetime or None
only present for frozen models in datetime partitioned projects. If specified, the end date of the data used to train the model.
- model_type : str
what model this is, e.g. ‘Nystroem Kernel SVM Regressor’
- model_category : str
what kind of model this is - ‘prime’ for DataRobot Prime models, ‘blend’ for blender models, and ‘model’ for other models
- is_frozen : bool
whether this model is a frozen model
- parent_model_id : str
the id of the model that tuning parameters are derived from
- blueprint_id : str
the id of the blueprint used in this model
- metrics : dict
a mapping from each metric to the model’s scores for that metric
- monotonic_increasing_featurelist_id : str
optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. If None, no such constraints are enforced.
- monotonic_decreasing_featurelist_id : str
optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. If None, no such constraints are enforced.
- supports_monotonic_constraints : bool
optional, whether this model supports enforcing monotonic constraints
- is_starred : bool
whether this model marked as starred
- prediction_threshold : float
for binary classification projects, the threshold used for predictions
- prediction_threshold_read_only : bool
indicated whether modification of the prediction threshold is forbidden. Threshold modification is forbidden once a model has had a deployment created or predictions made via the dedicated prediction API.
-
classmethod
get
(project_id, model_id)¶ Retrieve a specific frozen model.
Parameters: - project_id : str
The project’s id.
- model_id : str
The
model_id
of the leaderboard item to retrieve.
Returns: - model : FrozenModel
The queried instance.
Imported Model¶
Note
Imported Models are used in Stand Alone Scoring Engines. If you are not an administrator of such an engine, they are not relevant to you.
-
class
datarobot.models.
ImportedModel
(id, imported_at=None, model_id=None, target=None, featurelist_name=None, dataset_name=None, model_name=None, project_id=None, version=None, note=None, origin_url=None, imported_by_username=None, project_name=None, created_by_username=None, created_by_id=None, imported_by_id=None, display_name=None)¶ Represents an imported model available for making predictions. These are only relevant for administrators of on-premise Stand Alone Scoring Engines.
ImportedModels are trained in one DataRobot application, exported as a .drmodel file, and then imported for use in a Stand Alone Scoring Engine.
Attributes: - id : str
id of the import
- model_name : str
model type describing the model generated by DataRobot
- display_name : str
manually specified human-readable name of the imported model
- note : str
manually added node about this imported model
- imported_at : datetime
the time the model was imported
- imported_by_username : str
username of the user who imported the model
- imported_by_id : str
id of the user who imported the model
- origin_url : str
URL of the application the model was exported from
- model_id : str
original id of the model prior to export
- featurelist_name : str
name of the featurelist used to train the model
- project_id : str
id of the project the model belonged to prior to export
- project_name : str
name of the project the model belonged to prior to export
- target : str
the target of the project the model belonged to prior to export
- version : float
project version of the project the model belonged to
- dataset_name : str
filename of the dataset used to create the project the model belonged to
- created_by_username : str
username of the user who created the model prior to export
- created_by_id : str
id of the user who created the model prior to export
-
classmethod
create
(path)¶ Import a previously exported model for predictions.
Parameters: - path : str
The path to the exported model file
-
classmethod
get
(import_id)¶ Retrieve imported model info
Parameters: - import_id : str
The ID of the imported model.
Returns: - imported_model : ImportedModel
The ImportedModel instance
-
classmethod
list
(limit=None, offset=None)¶ List the imported models.
Parameters: - limit : int
The number of records to return. The server will use a (possibly finite) default if not specified.
- offset : int
The number of records to skip.
Returns: - imported_models : list[ImportedModel]
-
update
(display_name=None, note=None)¶ Update the display name or note for an imported model. The ImportedModel object is updated in place.
Parameters: - display_name : str
The new display name.
- note : str
The new note.
-
delete
()¶ Delete this imported model.
RatingTableModel¶
-
class
datarobot.models.
RatingTableModel
(id=None, processes=None, featurelist_name=None, featurelist_id=None, project_id=None, sample_pct=None, training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, model_type=None, model_category=None, is_frozen=None, blueprint_id=None, metrics=None, rating_table_id=None, monotonic_increasing_featurelist_id=None, monotonic_decreasing_featurelist_id=None, supports_monotonic_constraints=None, is_starred=None, prediction_threshold=None, prediction_threshold_read_only=None)¶ A model that has a rating table.
Attributes: - id : str
the id of the model
- project_id : str
the id of the project the model belongs to
- processes : list of str
the processes used by the model
- featurelist_name : str
the name of the featurelist used by the model
- featurelist_id : str
the id of the featurelist used by the model
- sample_pct : float or None
the percentage of the project dataset used in training the model. If the project uses datetime partitioning, the sample_pct will be None. See training_row_count, training_duration, and training_start_date and training_end_date instead.
- training_row_count : int or None
the number of rows of the project dataset used in training the model. In a datetime partitioned project, if specified, defines the number of rows used to train the model and evaluate backtest scores; if unspecified, either training_duration or training_start_date and training_end_date was used to determine that instead.
- training_duration : str or None
only present for models in datetime partitioned projects. If specified, a duration string specifying the duration spanned by the data used to train the model and evaluate backtest scores.
- training_start_date : datetime or None
only present for frozen models in datetime partitioned projects. If specified, the start date of the data used to train the model.
- training_end_date : datetime or None
only present for frozen models in datetime partitioned projects. If specified, the end date of the data used to train the model.
- model_type : str
what model this is, e.g. ‘Nystroem Kernel SVM Regressor’
- model_category : str
what kind of model this is - ‘prime’ for DataRobot Prime models, ‘blend’ for blender models, and ‘model’ for other models
- is_frozen : bool
whether this model is a frozen model
- blueprint_id : str
the id of the blueprint used in this model
- metrics : dict
a mapping from each metric to the model’s scores for that metric
- rating_table_id : str
the id of the rating table that belongs to this model
- monotonic_increasing_featurelist_id : str
optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. If None, no such constraints are enforced.
- monotonic_decreasing_featurelist_id : str
optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. If None, no such constraints are enforced.
- supports_monotonic_constraints : bool
optional, whether this model supports enforcing monotonic constraints
- is_starred : bool
whether this model marked as starred
- prediction_threshold : float
for binary classification projects, the threshold used for predictions
- prediction_threshold_read_only : bool
indicated whether modification of the prediction threshold is forbidden. Threshold modification is forbidden once a model has had a deployment created or predictions made via the dedicated prediction API.
-
classmethod
get
(project_id, model_id)¶ Retrieve a specific rating table model
If the project does not have a rating table, a ClientError will occur.
Parameters: - project_id : str
the id of the project the model belongs to
- model_id : str
the id of the model to retrieve
Returns: - model : RatingTableModel
the model
-
classmethod
create_from_rating_table
(project_id, rating_table_id)¶ Creates a new model from a validated rating table record. The RatingTable must not be associated with an existing model.
Parameters: - project_id : str
the id of the project the rating table belongs to
- rating_table_id : str
the id of the rating table to create this model from
Returns: - job: Job
an instance of created async job
Raises: - ClientError (422)
Raised if creating model from a RatingTable that failed validation
- JobAlreadyRequested
Raised if creating model from a RatingTable that is already associated with a RatingTableModel
-
advanced_tune
(params, description=None)¶ Generate a new model with the specified advanced-tuning parameters
As of v2.15, all models support Advanced Tuning other than Blenders, OSS, and user-created.
Parameters: - params : dict
Mapping of parameter ID to parameter value. The list of valid parameter IDs for a model can be found by calling get_advanced_tuning_parameters(). This endpoint does not need to include values for all parameters. If a parameter is omitted, its current_value will be used.
- description : unicode
Human-readable string describing the newly advanced-tuned model
Returns: - ModelJob
The created job to build the model
-
cross_validate
()¶ Run Cross Validation on this model.
Note
To perform Cross Validation on a new model with new parameters, use
train
instead.Returns: - ModelJob
The created job to build the model
-
delete
()¶ Delete a model from the project’s leaderboard.
-
download_export
(filepath)¶ Download an exportable model file for use in an on-premise DataRobot standalone prediction environment.
This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.
Parameters: - filepath : str
The path at which to save the exported model file.
-
download_scoring_code
(file_name, source_code=False)¶ Download scoring code JAR.
Parameters: - file_name : str
File path where scoring code will be saved.
- source_code : bool, optional
Set to True to download source code archive. It will not be executable.
-
classmethod
fetch_resource_data
(url, join_endpoint=True)¶ (Deprecated.) Used to acquire model data directly from its url.
Consider using get instead, as this is a convenience function used for development of datarobot
Parameters: - url : string
The resource we are acquiring
- join_endpoint : boolean, optional
Whether the client’s endpoint should be joined to the URL before sending the request. Location headers are returned as absolute locations, so will _not_ need the endpoint
Returns: - model_data : dict
The queried model’s data
-
get_advanced_tuning_parameters
()¶ Get the advanced-tuning parameters available for this model.
As of v2.15, all models support Advanced Tuning other than Blenders, OSS, and user-created.
Returns: - dict
A dictionary describing the advanced-tuning parameters for the current model. There are two top-level keys, tuningDescription and tuningParameters.
tuningDescription an optional value. If not None, then it indicates the user-specified description of this set of tuning parameter.
tuningParameters is a list of a dicts, each has the following keys
- parameterName : (unicode) name of the parameter (unique per task, see below)
- parameterId : (unicode) opaque ID string uniquely identifying parameter
- defaultValue : (*) default value of the parameter for the blueprint
- currentValue : (*) value of the parameter that was used for this model
- taskName : (unicode) name of the task that this parameter belongs to
- constraints: (dict) see the notes below
Notes
The type of defaultValue and currentValue is defined by the constraints structure. It will be a string or numeric Python type.
constraints is a dict with at least one, possibly more, of the following keys. The presence of a key indicates that the parameter may take on the specified type. (If a key is absent, this means that the parameter may not take on the specified type.) If a key on constraints is present, its value will be a dict containing all of the fields described below for that key.
"constraints": { "select": { "values": [<list(basestring or number) : possible values>] }, "ascii": {}, "unicode": {}, "int": { "min": <int : minimum valid value>, "max": <int : maximum valid value>, "supports_grid_search": <bool : True if Grid Search may be requested for this param> }, "float": { "min": <float : minimum valid value>, "max": <float : maximum valid value>, "supports_grid_search": <bool : True if Grid Search may be requested for this param> }, "intList": { "length": { "min_length": <int : minimum valid length>, "max_length": <int : maximum valid length> "min_val": <int : minimum valid value>, "max_val": <int : maximum valid value> "supports_grid_search": <bool : True if Grid Search may be requested for this param> }, "floatList": { "min_length": <int : minimum valid length>, "max_length": <int : maximum valid length> "min_val": <float : minimum valid value>, "max_val": <float : maximum valid value> "supports_grid_search": <bool : True if Grid Search may be requested for this param> } }
The keys have meaning as follows:
- select: Rather than specifying a specific data type, if present, it indicates that the parameter is permitted to take on any of the specified values. Listed values may be of any string or real (non-complex) numeric type.
- ascii: The parameter may be a unicode object that encodes simple ASCII characters. (A-Z, a-z, 0-9, whitespace, and certain common symbols.) In addition to listed constraints, ASCII keys currently may not contain either newlines or semicolons.
- unicode: The parameter may be any Python unicode object.
- int: The value may be an object of type int within the specified range (inclusive). Please note that the value will be passed around using the JSON format, and some JSON parsers have undefined behavior with integers outside of the range [-(2**53)+1, (2**53)-1].
- float: The value may be an object of type float within the specified range (inclusive).
- intList, floatList: The value may be a list of int or float objects, respectively, following constraints as specified respectively by the int and float types (above).
Many parameters only specify one key under constraints. If a parameter specifies multiple keys, the parameter may take on any value permitted by any key.
-
get_all_confusion_charts
(fallback_to_parent_insights=False)¶ Retrieve a list of all confusion charts available for the model.
Parameters: - fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return confusion chart data for this model’s parent for any source that is not available for this model and if this has a defined parent model. If omitted or False, or this model has no parent, this will not attempt to retrieve any data from this model’s parent.
Returns: - list of ConfusionChart
Data for all available confusion charts for model.
-
get_all_lift_charts
(fallback_to_parent_insights=False)¶ Retrieve a list of all lift charts available for the model.
Parameters: - fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return lift chart data for this model’s parent for any source that is not available for this model and if this model has a defined parent model. If omitted or False, or this model has no parent, this will not attempt to retrieve any data from this model’s parent.
Returns: - list of LiftChart
Data for all available model lift charts.
-
get_all_roc_curves
(fallback_to_parent_insights=False)¶ Retrieve a list of all ROC curves available for the model.
Parameters: - fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return ROC curve data for this model’s parent for any source that is not available for this model and if this model has a defined parent model. If omitted or False, or this model has no parent, this will not attempt to retrieve any data from this model’s parent.
Returns: - list of RocCurve
Data for all available model ROC curves.
-
get_confusion_chart
(source, fallback_to_parent_insights=False)¶ Retrieve model’s confusion chart for the specified source.
Parameters: - source : str
Confusion chart source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
- fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return confusion chart data for this model’s parent if the confusion chart is not available for this model and the defined parent model. If omitted or False, or there is no parent model, will not attempt to return insight data from this model’s parent.
Returns: - ConfusionChart
Model ConfusionChart data
Raises: - ClientError
If the insight is not available for this model
-
get_cross_validation_scores
(partition=None, metric=None)¶ Returns a dictionary keyed by metric showing cross validation scores per partition.
Cross Validation should already have been performed using
cross_validate
ortrain
.Note
Models that computed cross validation before this feature was added will need to be deleted and retrained before this method can be used.
Parameters: - partition : float
optional, the id of the partition (1,2,3.0,4.0,etc…) to filter results by can be a whole number positive integer or float value.
- metric: unicode
optional name of the metric to filter to resulting cross validation scores by
Returns: - cross_validation_scores: dict
A dictionary keyed by metric showing cross validation scores per partition.
-
get_feature_impact
()¶ Retrieve the computed Feature Impact results, a measure of the relevance of each feature in the model.
Feature Impact is computed for each column by creating new data with that column randomly permuted (but the others left unchanged), and seeing how the error metric score for the predictions is affected. The ‘impactUnnormalized’ is how much worse the error metric score is when making predictions on this modified data. The ‘impactNormalized’ is normalized so that the largest value is 1. In both cases, larger values indicate more important features.
If a feature is a redundant feature, i.e. once other features are considered it doesn’t contribute much in additional, the ‘redundantWith’ value is the name of feature that has the highest correlation with this feature. Note that redundancy detection is only available for jobs run after the addition of this feature. When retrieving data that predates this functionality, a NoRedundancyImpactAvailable warning will be used.
Elsewhere this technique is sometimes called ‘Permutation Importance’.
Requires that Feature Impact has already been computed with
request_feature_impact
.Returns: - feature_impacts : list of dict
The feature impact data. Each item is a dict with the keys ‘featureName’, ‘impactNormalized’, and ‘impactUnnormalized’, and ‘redundantWith’.
Raises: - ClientError (404)
If the feature impacts have not been computed.
-
get_features_used
()¶ Query the server to determine which features were used.
Note that the data returned by this method is possibly different than the names of the features in the featurelist used by this model. This method will return the raw features that must be supplied in order for predictions to be generated on a new set of data. The featurelist, in contrast, would also include the names of derived features.
Returns: - features : list of str
The names of the features used in the model.
-
get_leaderboard_ui_permalink
()¶ Returns: - url : str
Permanent static hyperlink to this model at leaderboard.
-
get_lift_chart
(source, fallback_to_parent_insights=False)¶ Retrieve model lift chart for the specified source.
Parameters: - source : str
Lift chart data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
- fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return lift chart data for this model’s parent if the lift chart is not available for this model and the model has a defined parent model. If omitted or False, or there is no parent model, will not attempt to return insight data from this model’s parent.
Returns: - LiftChart
Model lift chart data
Raises: - ClientError
If the insight is not available for this model
-
get_missing_report_info
()¶ Retrieve a model missing data report on training data that can be used to understand missing values treatment in a model. Report consists of missing values reports for features which took part in modelling and are numeric or categorical.
Returns: - An iterable of MissingReportPerFeature
The queried model missing report, sorted by missing count (DESCENDING order).
-
get_model_blueprint_chart
()¶ Retrieve a model blueprint chart that can be used to understand data flow in blueprint.
Returns: - ModelBlueprintChart
The queried model blueprint chart.
-
get_model_blueprint_documents
()¶ Get documentation for tasks used in this model.
Returns: - list of BlueprintTaskDocument
All documents available for the model.
-
get_or_request_feature_impact
(max_wait=600)¶ Retrieve feature impact for the model, requesting a job if it hasn’t been run previously
Parameters: - max_wait : int, optional
The maximum time to wait for a requested feature impact job to complete before erroring
Returns: - feature_impacts : list of dict
The feature impact data. See
get_feature_impact
for the exact schema.
-
get_parameters
()¶ Retrieve model parameters.
Returns: - ModelParameters
Model parameters for this model.
-
get_pareto_front
()¶ Retrieve the Pareto Front for a Eureqa model.
This method is only supported for Eureqa models.
Returns: - ParetoFront
Model ParetoFront data
-
get_prime_eligibility
()¶ Check if this model can be approximated with DataRobot Prime
Returns: - prime_eligibility : dict
a dict indicating whether a model can be approximated with DataRobot Prime (key can_make_prime) and why it may be ineligible (key message)
-
get_roc_curve
(source, fallback_to_parent_insights=False)¶ Retrieve model ROC curve for the specified source.
Parameters: - source : str
ROC curve data source. Check datarobot.enums.CHART_DATA_SOURCE for possible values.
- fallback_to_parent_insights : bool
(New in version v2.14) Optional, if True, this will return ROC curve data for this model’s parent if the ROC curve is not available for this model and the model has a defined parent model. If omitted or False, or there is no parent model, will not attempt to return data from this model’s parent.
Returns: - RocCurve
Model ROC curve data
Raises: - ClientError
If the insight is not available for this model
-
get_rulesets
()¶ List the rulesets approximating this model generated by DataRobot Prime
If this model hasn’t been approximated yet, will return an empty list. Note that these are rulesets approximating this model, not rulesets used to construct this model.
Returns: - rulesets : list of Ruleset
-
get_supported_capabilities
()¶ Retrieves a summary of the capabilities supported by a model.
New in version v2.14.
Returns: - supportsBlending: bool
whether the model supports blending
- supportsMonotonicConstraints: bool
whether the model supports monotonic constraints
- hasWordCloud: bool
whether the model has word cloud data available
- eligibleForPrime: bool
whether the model is eligible for Prime
- hasParameters: bool
whether the model has parameters that can be retrieved
-
get_word_cloud
(exclude_stop_words=False)¶ Retrieve a word cloud data for the model.
Parameters: - exclude_stop_words : bool, optional
Set to True if you want stopwords filtered out of response.
Returns: - WordCloud
Word cloud data for the model.
-
open_model_browser
()¶ Opens model at project leaderboard in web browser.
Note: If text-mode browsers are used, the calling process will block until the user exits the browser.
-
request_approximation
()¶ Request an approximation of this model using DataRobot Prime
This will create several rulesets that could be used to approximate this model. After comparing their scores and rule counts, the code used in the approximation can be downloaded and run locally.
Returns: - job : Job
the job generating the rulesets
-
request_feature_impact
()¶ Request feature impacts to be computed for the model.
See
get_feature_impact
for more information on the result of the job.Returns: - job : Job
A Job representing the feature impact computation. To get the completed feature impact data, use job.get_result or job.get_result_when_complete.
Raises: - JobAlreadyRequested (422)
If the feature impacts have already been requested.
-
request_frozen_datetime_model
(training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, time_window_sample_pct=None)¶ Train a new frozen model with parameters from this model
Requires that this model belongs to a datetime partitioned project. If it does not, an error will occur when submitting the job.
Frozen models use the same tuning parameters as their parent model instead of independently optimizing them to allow efficiently retraining models on larger amounts of the training data.
In addition of training_row_count and training_duration, frozen datetime models may be trained on an exact date range. Only one of training_row_count, training_duration, or training_start_date and training_end_date should be specified.
Models specified using training_start_date and training_end_date are the only ones that can be trained into the holdout data (once the holdout is unlocked).
Parameters: - training_row_count : int, optional
the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.
- training_duration : str, optional
a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.
- training_start_date : datetime.datetime, optional
the start date of the data to train to model on. Only rows occurring at or after this datetime will be used. If training_start_date is specified, training_end_date must also be specified.
- training_end_date : datetime.datetime, optional
the end date of the data to train the model on. Only rows occurring strictly before this datetime will be used. If training_end_date is specified, training_start_date must also be specified.
- time_window_sample_pct : int, optional
may only be specified when the requested model is a time window (e.g. duration or start and end dates). An integer between 1 and 99 indicating the percentage to sample by within the window. The points kept are determined by a random uniform sample.
Returns: - model_job : ModelJob
the modeling job training a frozen model
-
request_frozen_model
(sample_pct=None, training_row_count=None)¶ Train a new frozen model with parameters from this model
Note
This method only works if project the model belongs to is not datetime partitioned. If it is, use
request_frozen_datetime_model
instead.Frozen models use the same tuning parameters as their parent model instead of independently optimizing them to allow efficiently retraining models on larger amounts of the training data.
Parameters: - sample_pct : float
optional, the percentage of the dataset to use with the model. If not provided, will use the value from this model.
- training_row_count : int
(New in version v2.9) optional, the integer number of rows of the dataset to use with the model. Only one of sample_pct and training_row_count should be specified.
Returns: - model_job : ModelJob
the modeling job training a frozen model
-
request_predictions
(dataset_id, include_prediction_intervals=None, prediction_intervals_size=None)¶ Request predictions against a previously uploaded dataset
Parameters: - dataset_id : string
The dataset to make predictions against (as uploaded from Project.upload_dataset)
- include_prediction_intervals : bool, optional
(New in v2.16) For time series projects only. Specifies whether prediction intervals should be calculated for this request. Defaults to True if prediction_intervals_size is specified, otherwise defaults to False.
- prediction_intervals_size : int, optional
(New in v2.16) For time series projects only. Represents the percentile to use for the size of the prediction intervals. Defaults to 80 if include_prediction_intervals is True.
Returns: - job : PredictJob
The job computing the predictions
-
request_training_predictions
(data_subset)¶ Start a job to build training predictions
Parameters: - data_subset : str
data set definition to build predictions on. Choices are:
- dr.enums.DATA_SUBSET.ALL for all data available. Not valid for models
- in datetime partitioned projects
- dr.enums.DATA_SUBSET.VALIDATION_AND_HOLDOUT for all data except training set.
- Not valid for models in datetime partitioned projects
- dr.enums.DATA_SUBSET.HOLDOUT for holdout data set only
- dr.enums.DATA_SUBSET.ALL_BACKTESTS for downloading the predictions for all
- backtest validation folds. Requires the model to have successfully scored all backtests. Datetime partitioned projects only.
Returns: - Job
an instance of created async job
-
request_transferable_export
()¶ Request generation of an exportable model file for use in an on-premise DataRobot standalone prediction environment.
This function can only be used if model export is enabled, and will only be useful if you have an on-premise environment in which to import it.
This function does not download the exported file. Use download_export for that.
Examples
model = datarobot.Model.get('p-id', 'l-id') job = model.request_transferable_export() job.wait_for_completion() model.download_export('my_exported_model.drmodel') # Client must be configured to use standalone prediction server for import: datarobot.Client(token='my-token-at-standalone-server', endpoint='standalone-server-url/api/v2') imported_model = datarobot.ImportedModel.create('my_exported_model.drmodel')
-
set_prediction_threshold
(threshold)¶ Set a custom prediction threshold for the model
May not be used once
prediction_threshold_read_only
is True for this model.Parameters: - threshold : float
only used for binary classification projects. The threshold to when deciding between the positive and negative classes when making predictions. Should be between 0.0 and 1.0 (inclusive).
-
star_model
()¶ Mark the model as starred
Model stars propagate to the web application and the API, and can be used to filter when listing models.
-
start_advanced_tuning_session
()¶ Start an Advanced Tuning session. Returns an object that helps set up arguments for an Advanced Tuning model execution.
As of v2.15, all models support Advanced Tuning other than Blenders, OSS, and user-created.
Returns: - AdvancedTuningSession
Session for setting up and running Advanced Tuning on a model
-
train
(sample_pct=None, featurelist_id=None, scoring_type=None, training_row_count=None, monotonic_increasing_featurelist_id=<object object>, monotonic_decreasing_featurelist_id=<object object>)¶ Train the blueprint used in model on a particular featurelist or amount of data.
This method creates a new training job for worker and appends it to the end of the queue for this project. After the job has finished you can get the newly trained model by retrieving it from the project leaderboard, or by retrieving the result of the job.
Either sample_pct or training_row_count can be used to specify the amount of data to use, but not both. If neither are specified, a default of the maximum amount of data that can safely be used to train any blueprint without going into the validation data will be selected.
In smart-sampled projects, sample_pct and training_row_count are assumed to be in terms of rows of the minority class.
Note
For datetime partitioned projects, use
train_datetime
instead.Parameters: - sample_pct : float, optional
The amount of data to use for training, as a percentage of the project dataset from 0 to 100.
- featurelist_id : str, optional
The identifier of the featurelist to use. If not defined, the featurelist of this model is used.
- scoring_type : str, optional
Either
SCORING_TYPE.validation
orSCORING_TYPE.cross_validation
.SCORING_TYPE.validation
is available for every partitioning type, and indicates that the default model validation should be used for the project. If the project uses a form of cross-validation partitioning,SCORING_TYPE.cross_validation
can also be used to indicate that all of the available training/validation combinations should be used to evaluate the model.- training_row_count : int, optional
The number of rows to use to train the requested model.
- monotonic_increasing_featurelist_id : str
(new in version 2.11) optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. Passing
None
disables increasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT
) is the one specified by the blueprint.- monotonic_decreasing_featurelist_id : str
(new in version 2.11) optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. Passing
None
disables decreasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT
) is the one specified by the blueprint.
Returns: - model_job_id : str
id of created job, can be used as parameter to
ModelJob.get
method orwait_for_async_model_creation
function
Examples
project = Project.get('p-id') model = Model.get('p-id', 'l-id') model_job_id = model.train(training_row_count=project.max_train_rows)
-
train_datetime
(featurelist_id=None, training_row_count=None, training_duration=None, time_window_sample_pct=None)¶ Train this model on a different featurelist or amount of data
Requires that this model is part of a datetime partitioned project; otherwise, an error will occur.
Parameters: - featurelist_id : str, optional
the featurelist to use to train the model. If not specified, the featurelist of this model is used.
- training_row_count : int, optional
the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.
- training_duration : str, optional
a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.
- time_window_sample_pct : int, optional
may only be specified when the requested model is a time window (e.g. duration or start and end dates). An integer between 1 and 99 indicating the percentage to sample by within the window. The points kept are determined by a random uniform sample.
Returns: - job : ModelJob
the created job to build the model
-
unstar_model
()¶ Unmark the model as starred
Model stars propagate to the web application and the API, and can be used to filter when listing models.
Advanced Tuning¶
-
class
datarobot.models.advanced_tuning.
AdvancedTuningSession
(model)¶ A session enabling users to configure and run advanced tuning for a model.
Every model contains a set of one or more tasks. Every task contains a set of zero or more parameters. This class allows tuning the values of each parameter on each task of a model, before running that model.
This session is client-side only and is not persistent. Only the final model, constructed when run is called, is persisted on the DataRobot server.
Attributes: - description : basestring
Description for the new advance-tuned model. Defaults to the same description as the base model.
-
get_task_names
()¶ Get the list of task names that are available for this model
Returns: - list(basestring)
List of task names
-
get_parameter_names
(task_name)¶ Get the list of parameter names available for a specific task
Returns: - list(basestring)
List of parameter names
-
set_parameter
(value, task_name=None, parameter_name=None, parameter_id=None)¶ Set the value of a parameter to be used
The caller must supply enough of the optional arguments to this function to uniquely identify the parameter that is being set. For example, a less-common parameter name such as ‘building_block__complementary_error_function’ might only be used once (if at all) by a single task in a model. In which case it may be sufficient to simply specify ‘parameter_name’. But a more-common name such as ‘random_seed’ might be used by several of the model’s tasks, and it may be necessary to also specify ‘task_name’ to clarify which task’s random seed is to be set. This function only affects client-side state. It will not check that the new parameter value(s) are valid.
Parameters: - task_name : basestring
Name of the task whose parameter needs to be set
- parameter_name : basestring
Name of the parameter to set
- parameter_id : basestring
ID of the parameter to set
- value : int, float, list, or basestring
New value for the parameter, with legal values determined by the parameter being set
Raises: - NoParametersFoundException
if no matching parameters are found.
- NonUniqueParametersException
if multiple parameters matched the specified filtering criteria
-
get_parameters
()¶ Returns the set of parameters available to this model
The returned parameters have one additional key, “value”, reflecting any new values that have been set in this AdvancedTuningSession. When the session is run, “value” will be used, or if it is unset, “current_value”.
Returns: - parameters : dict
“Parameters” dictionary, same as specified on Model.get_advanced_tuning_params.
- An additional field is added per parameter to the ‘tuningParameters’ list in the dictionary:
- value : int, float, list, or basestring
The current value of the parameter. None if none has been specified.
-
run
()¶ Submit this model for Advanced Tuning.
Returns: - datarobot.models.modeljob.ModelJob
The created job to build the model
ModelJob¶
-
datarobot.models.modeljob.
wait_for_async_model_creation
(project_id, model_job_id, max_wait=600)¶ Given a Project id and ModelJob id poll for status of process responsible for model creation until model is created.
Parameters: - project_id : str
The identifier of the project
- model_job_id : str
The identifier of the ModelJob
- max_wait : int, optional
Time in seconds after which model creation is considered unsuccessful
Returns: - model : Model
Newly created model
Raises: - AsyncModelCreationError
Raised if status of fetched ModelJob object is
error
- AsyncTimeoutError
Model wasn’t created in time, specified by
max_wait
parameter
-
class
datarobot.models.
ModelJob
(data, completed_resource_url=None)¶ Tracks asynchronous work being done within a project
Attributes: - id : int
the id of the job
- project_id : str
the id of the project the job belongs to
- status : str
the status of the job - will be one of
datarobot.enums.QUEUE_STATUS
- job_type : str
what kind of work the job is doing - will be ‘model’ for modeling jobs
- is_blocked : bool
if true, the job is blocked (cannot be executed) until its dependencies are resolved
- sample_pct : float
the percentage of the project’s dataset used in this modeling job
- model_type : str
the model this job builds (e.g. ‘Nystroem Kernel SVM Regressor’)
- processes : list of str
the processes used by the model
- featurelist_id : str
the id of the featurelist used in this modeling job
- blueprint : Blueprint
the blueprint used in this modeling job
-
classmethod
from_job
(job)¶ Transforms a generic Job into a ModelJob
Parameters: - job: Job
A generic job representing a ModelJob
Returns: - model_job: ModelJob
A fully populated ModelJob with all the details of the job
Raises: - ValueError:
If the generic Job was not a model job, e.g. job_type != JOB_TYPE.MODEL
-
classmethod
get
(project_id, model_job_id)¶ Fetches one ModelJob. If the job finished, raises PendingJobFinished exception.
Parameters: - project_id : str
The identifier of the project the model belongs to
- model_job_id : str
The identifier of the model_job
Returns: - model_job : ModelJob
The pending ModelJob
Raises: - PendingJobFinished
If the job being queried already finished, and the server is re-routing to the finished model.
- AsyncFailureError
Querying this resource gave a status code other than 200 or 303
-
classmethod
get_model
(project_id, model_job_id)¶ Fetches a finished model from the job used to create it.
Parameters: - project_id : str
The identifier of the project the model belongs to
- model_job_id : str
The identifier of the model_job
Returns: - model : Model
The finished model
Raises: - JobNotFinished
If the job has not finished yet
- AsyncFailureError
Querying the model_job in question gave a status code other than 200 or 303
-
cancel
()¶ Cancel this job. If this job has not finished running, it will be removed and canceled.
-
get_result
()¶ Returns: - result : object
- Return type depends on the job type:
- for model jobs, a Model is returned
- for predict jobs, a pandas.DataFrame (with predictions) is returned
- for featureImpact jobs, a list of dicts (see
Model.get_feature_impact
for more detail) - for primeRulesets jobs, a list of Rulesets
- for primeModel jobs, a PrimeModel
- for primeDownloadValidation jobs, a PrimeFile
- for reasonCodesInitialization jobs, a ReasonCodesInitialization
- for reasonCodes jobs, a ReasonCodes
- for predictionExplanationInitialization jobs, a PredictionExplanationsInitialization
- for predictionExplanations jobs, a PredictionExplanations
Raises: - JobNotFinished
If the job is not finished, the result is not available.
- AsyncProcessUnsuccessfulError
If the job errored or was aborted
-
get_result_when_complete
(max_wait=600)¶ Parameters: - max_wait : int, optional
How long to wait for the job to finish.
Returns: - result: object
Return type is the same as would be returned by Job.get_result.
Raises: - AsyncTimeoutError
If the job does not finish in time
- AsyncProcessUnsuccessfulError
If the job errored or was aborted
-
refresh
()¶ Update this object with the latest job data from the server.
-
wait_for_completion
(max_wait=600)¶ Waits for job to complete.
Parameters: - max_wait : int, optional
How long to wait for the job to finish.
Pareto Front¶
-
class
datarobot.models.pareto_front.
ParetoFront
(project_id, error_metric, hyperparameters, target_type, solutions)¶ Pareto front data for a Eureqa model.
The pareto front reflects the tradeoffs between error and complexity for particular model. The solutions reflect possible Eureqa models that are different levels of complexity. By default, only one solution will have a corresponding model, but models can be created for each solution.
Attributes: - project_id : str
the ID of the project the model belongs to
- error_metric : str
Eureqa error-metric identifier used to compute error metrics for this search. Note that Eureqa error metrics do NOT correspond 1:1 with DataRobot error metrics – the available metrics are not the same, and are computed from a subset of the training data rather than from the validation data.
- hyperparameters : dict
Hyperparameters used by this run of the Eureqa blueprint
- target_type : str
Indicating what kind of modeling is being done in this project, either ‘Regression’, ‘Binary’ (Binary classification), or ‘Multiclass’ (Multiclass classification).
- solutions : list(Solution)
Solutions that Eureqa has found to model this data. Some solutions will have greater accuracy. Others will have slightly less accuracy but will use simpler expressions.
-
class
datarobot.models.pareto_front.
Solution
(eureqa_solution_id, complexity, error, expression, expression_annotated, best_model, project_id)¶ Eureqa Solution.
A solution represents a possible Eureqa model; however not all solutions have models associated with them. It must have a model created before it can be used to make predictions, etc.
Attributes: - eureqa_solution_id: str
ID of this Solution
- complexity: int
Complexity score for this solution. Complexity score is a function of the mathematical operators used in the current solution. The Complexity calculation can be tuned via model hyperparameters.
- error: float
Error for the current solution, as computed by Eureqa using the ‘error_metric’ error metric.
- expression: str
Eureqa model equation string.
- expression_annotated: str
Eureqa model equation string with variable names tagged for easy identification.
- best_model: bool
True, if the model is determined to be the best
-
create_model
()¶ Add this solution to the leaderboard, if it is not already present.
Partitioning¶
-
class
datarobot.
RandomCV
(holdout_pct, reps, seed=0)¶ A partition in which observations are randomly assigned to cross-validation groups and the holdout set.
Parameters: - holdout_pct : int
the desired percentage of dataset to assign to holdout set
- reps : int
number of cross validation folds to use
- seed : int
a seed to use for randomization
-
class
datarobot.
StratifiedCV
(holdout_pct, reps, seed=0)¶ A partition in which observations are randomly assigned to cross-validation groups and the holdout set, preserving in each group the same ratio of positive to negative cases as in the original data.
Parameters: - holdout_pct : int
the desired percentage of dataset to assign to holdout set
- reps : int
number of cross validation folds to use
- seed : int
a seed to use for randomization
-
class
datarobot.
GroupCV
(holdout_pct, reps, partition_key_cols, seed=0)¶ A partition in which one column is specified, and rows sharing a common value for that column are guaranteed to stay together in the partitioning into cross-validation groups and the holdout set.
Parameters: - holdout_pct : int
the desired percentage of dataset to assign to holdout set
- reps : int
number of cross validation folds to use
- partition_key_cols : list
a list containing a single string, where the string is the name of the column whose values should remain together in partitioning
- seed : int
a seed to use for randomization
-
class
datarobot.
UserCV
(user_partition_col, cv_holdout_level, seed=0)¶ A partition where the cross-validation folds and the holdout set are specified by the user.
Parameters: - user_partition_col : string
the name of the column containing the partition assignments
- cv_holdout_level
the value of the partition column indicating a row is part of the holdout set
- seed : int
a seed to use for randomization
-
class
datarobot.
RandomTVH
(holdout_pct, validation_pct, seed=0)¶ Specifies a partitioning method in which rows are randomly assigned to training, validation, and holdout.
Parameters: - holdout_pct : int
the desired percentage of dataset to assign to holdout set
- validation_pct : int
the desired percentage of dataset to assign to validation set
- seed : int
a seed to use for randomization
-
class
datarobot.
UserTVH
(user_partition_col, training_level, validation_level, holdout_level, seed=0)¶ Specifies a partitioning method in which rows are assigned by the user to training, validation, and holdout sets.
Parameters: - user_partition_col : string
the name of the column containing the partition assignments
- training_level
the value of the partition column indicating a row is part of the training set
- validation_level
the value of the partition column indicating a row is part of the validation set
- holdout_level
the value of the partition column indicating a row is part of the holdout set (use None if you want no holdout set)
- seed : int
a seed to use for randomization
-
class
datarobot.
StratifiedTVH
(holdout_pct, validation_pct, seed=0)¶ A partition in which observations are randomly assigned to train, validation, and holdout sets, preserving in each group the same ratio of positive to negative cases as in the original data.
Parameters: - holdout_pct : int
the desired percentage of dataset to assign to holdout set
- validation_pct : int
the desired percentage of dataset to assign to validation set
- seed : int
a seed to use for randomization
-
class
datarobot.
GroupTVH
(holdout_pct, validation_pct, partition_key_cols, seed=0)¶ A partition in which one column is specified, and rows sharing a common value for that column are guaranteed to stay together in the partitioning into the training, validation, and holdout sets.
Parameters: - holdout_pct : int
the desired percentage of dataset to assign to holdout set
- validation_pct : int
the desired percentage of dataset to assign to validation set
- partition_key_cols : list
a list containing a single string, where the string is the name of the column whose values should remain together in partitioning
- seed : int
a seed to use for randomization
-
class
datarobot.
DatetimePartitioningSpecification
(datetime_partition_column, autopilot_data_selection_method=None, validation_duration=None, holdout_start_date=None, holdout_duration=None, disable_holdout=None, gap_duration=None, number_of_backtests=None, backtests=None, use_time_series=False, default_to_known_in_advance=False, default_to_do_not_derive=False, feature_derivation_window_start=None, feature_derivation_window_end=None, feature_settings=None, forecast_window_start=None, forecast_window_end=None, windows_basis_unit=None, treat_as_exponential=None, differencing_method=None, periodicities=None, multiseries_id_columns=None, target=None, use_cross_series_features=None, aggregation_type=None, cross_series_group_by_columns=None, calendar_id=None)¶ Uniquely defines a DatetimePartitioning for some project
Includes only the attributes of DatetimePartitioning that are directly controllable by users, not those determined by the DataRobot application based on the project dataset and the user-controlled settings.
This is the specification that should be passed to Project.set_target via the partitioning_method parameter. To see the full partitioning based on the project dataset, use DatetimePartitioning.generate.
All durations should be specified with a duration string such as those returned by the partitioning_methods.construct_duration_string helper method.
Attributes: - datetime_partition_column : str
the name of the column whose values as dates are used to assign a row to a particular partition
- autopilot_data_selection_method : str
one of
datarobot.enums.DATETIME_AUTOPILOT_DATA_SELECTION_METHOD
. Whether models created by the autopilot should use “rowCount” or “duration” as their data_selection_method.- validation_duration : str or None
the default validation_duration for the backtests
- holdout_start_date : datetime.datetime or None
The start date of holdout scoring data. If holdout_start_date is specified, holdout_duration must also be specified. If disable_holdout is set to True, neither holdout_duration nor holdout_start_date must be specified.
- holdout_duration : str or None
The duration of the holdout scoring data. If holdout_duration is specified, holdout_start_date must also be specified. If disable_holdout is set to True, neither holdout_duration nor holdout_start_date must be specified.
- disable_holdout : bool or None
(New in version v2.8) Whether to suppress allocating a holdout fold. If set to True, holdout_start_date and holdout_duration must not be specified.
- gap_duration : str or None
The duration of the gap between training and holdout scoring data
- number_of_backtests : int or None
the number of backtests to use
- backtests : list of BacktestSpecification
the exact specification of backtests to use. The indexes of the specified backtests should range from 0 to number_of_backtests - 1. If any backtest is left unspecified, a default configuration will be chosen.
- use_time_series : bool
(New in version v2.8) Whether to create a time series project (if True) or an OTV project which uses datetime partitioning (if False). The default behaviour is to create an OTV project.
- default_to_known_in_advance : bool
(New in version v2.11) Optional, for time series projects only. Sets whether all features default to being treated as known in advance. Known in advance features are expected to be known for dates in the future when making predictions, e.g., “is this a holiday?”. The default is false, all features are not known in advance. Individual features can be set to a value different than the default using the featureSettings parameter.
- feature_derivation_window_start : int or None
(New in version v2.8) Only used for time series projects. Offset into the past to define how far back relative to the forecast point the feature derivation window should start. Expressed in terms of the time_unit of the datetime_partition_column and should be negative or zero.
- feature_derivation_window_end : int or None
(New in version v2.8) Only used for time series projects. Offset into the past to define how far back relative to the forecast point the feature derivation window should end. Expressed in terms of the time_unit of the datetime_partition_column, and should be a positive value.
- feature_settings : list of
FeatureSettings
objects (New in version v2.9) Optional, a list specifying per feature settings, can be left unspecified.
- forecast_window_start : int or None
(New in version v2.8) Only used for time series projects. Offset into the future to define how far forward relative to the forecast point the forecast window should start. Expressed in terms of the time_unit of the datetime_partition_column.
- forecast_window_end : int or None
(New in version v2.8) Only used for time series projects. Offset into the future to define how far forward relative to the forecast point the forecast window should end. Expressed in terms of the time_unit of the datetime_partition_column.
- windows_basis_unit : string, optional
(New in version v2.14) Only used for time series projects. Indicates which unit is a basis for feature derivation window and forecast window. Valid options are detected time unit (one of the
datarobot.enums.TIME_UNITS
) or “ROW”. If omitted, the default value is detected time unit.- treat_as_exponential : string, optional
(New in version v2.9) defaults to “auto”. Used to specify whether to treat data as exponential trend and apply transformations like log-transform. Use values from the
datarobot.enums.TREAT_AS_EXPONENTIAL
enum.- differencing_method : string, optional
(New in version v2.9) defaults to “auto”. Used to specify which differencing method to apply of case if data is stationary. Use values from
datarobot.enums.DIFFERENCING_METHOD
enum.- periodicities : list of Periodicity, optional
(New in version v2.9) a list of
datarobot.Periodicity
. Periodicities units should be ‘ROW’, if windows_basis_unit is ‘ROW’- multiseries_id_columns : list of str or null
(New in version v2.11) a list of the names of multiseries id columns to define series within the training data. Currently only one multiseries id column is supported.
- use_cross_series_features : bool
(New in version v2.14) Whether to use cross series features.
- aggregation_type : str, optional
(New in version v2.14) The aggregation type to apply when creating cross series features. Optional, must be one of “total” or “average”.
- cross_series_group_by_columns : list of str, optional
(New in version v2.15) List of columns (currently of length 1). Optional setting that indicates how to further split series into related groups. For example, if every series is sales of an individual product, the series group-by could be the product category with values like “men’s clothing”, “sports equipment”, etc.. Must be used with multiseries and useCrossSeriesFeatures enabled.
- calendar_id : str, optional
(New in version v2.15) The id of the
CalendarFile
to use with this project.
-
collect_payload
()¶ Set up the dict that should be sent to the server when setting the target Returns ——- partitioning_spec : dict
-
prep_payload
(project_id, max_wait=600)¶ Run any necessary validation and prep of the payload, including async operations
Mainly used for the datetime partitioning spec but implemented in general for consistency
-
class
datarobot.
BacktestSpecification
(index, gap_duration, validation_start_date, validation_duration)¶ Uniquely defines a Backtest used in a DatetimePartitioning
Includes only the attributes of a backtest directly controllable by users. The other attributes are assigned by the DataRobot application based on the project dataset and the user-controlled settings.
All durations should be specified with a duration string such as those returned by the partitioning_methods.construct_duration_string helper method.
Attributes: - index : int
the index of the backtest to update
- gap_duration : str
the desired duration of the gap between training and validation scoring data for the backtest
- validation_start_date : datetime.datetime
the desired start date of the validation scoring data for this backtest
- validation_duration : datetime.datetime
the desired duration of the validation scoring data for this backtest
-
class
datarobot.
FeatureSettings
(feature_name, known_in_advance=None, do_not_derive=None)¶ Per feature settings
Attributes: - feature_name : string
name of the feature
- known_in_advance : bool
(New in version v2.11) Optional, for time series projects only. Sets whether the feature is known in advance, i.e., values for future dates are known at prediction time. If not specified, the feature uses the value from the default_to_known_in_advance flag.
-
class
datarobot.
Periodicity
(time_steps, time_unit)¶ Periodicity configuration
Parameters: - time_steps : int
Time step value
- time_unit : string
Time step unit, valid options are values from datarobot.enums.TIME_UNITS
Examples
from datarobot as dr periodicities = [ dr.Periodicity(time_steps=10, time_unit=dr.enums.TIME_UNITS.HOUR), dr.Periodicity(time_steps=600, time_unit=dr.enums.TIME_UNITS.MINUTE)] spec = dr.DatetimePartitioningSpecification( # ... periodicities=periodicities )
-
class
datarobot.
DatetimePartitioning
(project_id=None, datetime_partition_column=None, date_format=None, autopilot_data_selection_method=None, validation_duration=None, available_training_start_date=None, available_training_duration=None, available_training_row_count=None, available_training_end_date=None, primary_training_start_date=None, primary_training_duration=None, primary_training_row_count=None, primary_training_end_date=None, gap_start_date=None, gap_duration=None, gap_row_count=None, gap_end_date=None, holdout_start_date=None, holdout_duration=None, holdout_row_count=None, holdout_end_date=None, number_of_backtests=None, backtests=None, total_row_count=None, use_time_series=False, default_to_known_in_advance=False, default_to_do_not_derive=False, feature_derivation_window_start=None, feature_derivation_window_end=None, feature_settings=None, forecast_window_start=None, forecast_window_end=None, windows_basis_unit=None, treat_as_exponential=None, differencing_method=None, periodicities=None, multiseries_id_columns=None, number_of_known_in_advance_features=0, number_of_do_not_derive_features=0, use_cross_series_features=None, aggregation_type=None, cross_series_group_by_columns=None, calendar_id=None)¶ Full partitioning of a project for datetime partitioning
Includes both the attributes specified by the user, as well as those determined by the DataRobot application based on the project dataset. In order to use a partitioning to set the target, call to_specification and pass the resulting DatetimePartitioningSpecification to Project.set_target.
The available training data corresponds to all the data available for training, while the primary training data corresponds to the data that can be used to train while ensuring that all backtests are available. If a model is trained with more data than is available in the primary training data, then all backtests may not have scores available.
Attributes: - project_id : str
the id of the project this partitioning applies to
- datetime_partition_column : str
the name of the column whose values as dates are used to assign a row to a particular partition
- date_format : str
the format (e.g. “%Y-%m-%d %H:%M:%S”) by which the partition column was interpreted (compatible with strftime [https://docs.python.org/2/library/time.html#time.strftime] )
- autopilot_data_selection_method : str
one of
datarobot.enums.DATETIME_AUTOPILOT_DATA_SELECTION_METHOD
. Whether models created by the autopilot use “rowCount” or “duration” as their data_selection_method.- validation_duration : str
the validation duration specified when initializing the partitioning - not directly significant if the backtests have been modified, but used as the default validation_duration for the backtests
- available_training_start_date : datetime.datetime
The start date of the available training data for scoring the holdout
- available_training_duration : str
The duration of the available training data for scoring the holdout
- available_training_row_count : int or None
The number of rows in the available training data for scoring the holdout. Only available when retrieving the partitioning after setting the target.
- available_training_end_date : datetime.datetime
The end date of the available training data for scoring the holdout
- primary_training_start_date : datetime.datetime or None
The start date of primary training data for scoring the holdout. Unavailable when the holdout fold is disabled.
- primary_training_duration : str
The duration of the primary training data for scoring the holdout
- primary_training_row_count : int or None
The number of rows in the primary training data for scoring the holdout. Only available when retrieving the partitioning after setting the target.
- primary_training_end_date : datetime.datetime or None
The end date of the primary training data for scoring the holdout. Unavailable when the holdout fold is disabled.
- gap_start_date : datetime.datetime or None
The start date of the gap between training and holdout scoring data. Unavailable when the holdout fold is disabled.
- gap_duration : str
The duration of the gap between training and holdout scoring data
- gap_row_count : int or None
The number of rows in the gap between training and holdout scoring data. Only available when retrieving the partitioning after setting the target.
- gap_end_date : datetime.datetime or None
The end date of the gap between training and holdout scoring data. Unavailable when the holdout fold is disabled.
- holdout_start_date : datetime.datetime or None
The start date of holdout scoring data. Unavailable when the holdout fold is disabled.
- holdout_duration : str
The duration of the holdout scoring data
- holdout_row_count : int or None
The number of rows in the holdout scoring data. Only available when retrieving the partitioning after setting the target.
- holdout_end_date : datetime.datetime or None
The end date of the holdout scoring data. Unavailable when the holdout fold is disabled.
- number_of_backtests : int
the number of backtests used
- backtests : list of partitioning_methods.Backtest
the configured Backtests
- total_row_count : int
the number of rows in the project dataset. Only available when retrieving the partitioning after setting the target.
- use_time_series : bool
(New in version v2.8) Whether to create a time series project (if True) or an OTV project which uses datetime partitioning (if False). The default behaviour is to create an OTV project.
- default_to_known_in_advance : bool
(New in version v2.11) Optional, for time series projects only. Sets whether all features default to being treated as known in advance. Known in advance features are expected to be known for dates in the future when making predictions, e.g., “is this a holiday?”. The default is false, all features are not known in advance. Individual features can be set to a value different than the default using the featureSettings parameter.
- feature_derivation_window_start : int or None
(New in version v2.8) Only used for time series projects. Offset into the past to define how far back relative to the forecast point the feature derivation window should start. Expressed in terms of the time_unit of the datetime_partition_column.
- feature_derivation_window_end : int or None
(New in version v2.8) Only used for time series projects. Offset into the past to define how far back relative to the forecast point the feature derivation window should end. Expressed in terms of the time_unit of the datetime_partition_column.
- feature_settings : list of FeatureSettings
(New in version v2.9) Optional, a list specifying per feature settings, can be left unspecified.
- forecast_window_start : int or None
(New in version v2.8) Only used for time series projects. Offset into the future to define how far forward relative to the forecast point the forecast window should start. Expressed in terms of the time_unit of the datetime_partition_column.
- forecast_window_end : int or None
(New in version v2.8) Only used for time series projects. Offset into the future to define how far forward relative to the forecast point the forecast window should end. Expressed in terms of the time_unit of the datetime_partition_column.
- windows_basis_unit : string, optional
(New in version v2.14) Only used for time series projects. Indicates which unit is a basis for feature derivation window and forecast window. Valid options are detected time unit (one of the
datarobot.enums.TIME_UNITS
) or “ROW”. If omitted, the default value is detected time unit.- treat_as_exponential : string, optional
(New in version v2.9) defaults to “auto”. Used to specify whether to treat data as exponential trend and apply transformations like log-transform. Use values from
datarobot.enums.TREAT_AS_EXPONENTIAL
enum.- differencing_method : string, optional
(New in version v2.9) defaults to “auto”. Used to specify which differencing method to apply of case if data is stationary. Use values from
datarobot.enums.DIFFERENCING_METHOD
enum.- periodicities : list of Periodicity, optional
(New in version v2.9) a list of
datarobot.Periodicity
Periodicities units should be ‘ROW’, if windows_basis_unit is ‘ROW’- multiseries_id_columns : list of str or null
(New in version v2.11) a list of the names of multiseries id columns to define series within the training data. Currently only one multiseries id column is supported.
- number_of_known_in_advance_features : int
(New in version v2.14) Number of features that are marked as known in advance.
- use_cross_series_features : bool
(New in version v2.14) Whether to use cross series features.
- aggregation_type : str, optional
(New in version v2.14) The aggregation type to apply when creating cross series features. Optional, must be one of “total” or “average”.
- cross_series_group_by_columns : list of str, optional
(New in version v2.15) List of columns (currently of length 1). Optional setting that indicates how to further split series into related groups. For example, if every series is sales of an individual product, the series group-by could be the product category with values like “men’s clothing”, “sports equipment”, etc.. Must be used with multiseries and useCrossSeriesFeatures enabled.
- calendar_id : str, optional
(New in version v2.15) The id of the
CalendarFile
to use with this project.
-
classmethod
generate
(project_id, spec, max_wait=600)¶ Preview the full partitioning determined by a DatetimePartitioningSpecification
Based on the project dataset and the partitioning specification, inspect the full partitioning that would be used if the same specification were passed into Project.set_target.
Parameters: - project_id : str
the id of the project
- spec : DatetimePartitioningSpec
the desired partitioning
- max_wait : int, optional
For some settings (e.g. generating a partitioning preview for a multiseries project for the first time), an asynchronous task must be run to analyze the dataset. max_wait governs the maximum time (in seconds) to wait before giving up. In all non-multiseries projects, this is unused.
Returns: - DatetimePartitioning :
the full generated partitioning
-
classmethod
get
(project_id)¶ Retrieve the DatetimePartitioning from a project
Only available if the project has already set the target as a datetime project.
Parameters: - project_id : str
the id of the project to retrieve partitioning for
Returns: - DatetimePartitioning : the full partitioning for the project
-
classmethod
feature_log_list
(project_id, offset=None, limit=None)¶ Retrieve the feature derivation log content and log length for a time series project.
The Time Series Feature Log provides details about the feature generation process for a time series project. It includes information about which features are generated and their priority, as well as the detected properties of the time series data such as whether the series is stationary, and periodicities detected.
This route is only supported for time series projects that have finished partitioning.
The feature derivation log will include information about:
- Detected stationarity of the series:e.g. ‘Series detected as non-stationary’
- Detected presence of multiplicative trend in the series:e.g. ‘Multiplicative trend detected’
- Detected presence of multiplicative trend in the series:e.g. ‘Detected periodicities: 7 day’
- Maximum number of feature to be generated:e.g. ‘Maximum number of feature to be generated is 1440’
- Window sizes used in rolling statistics / lag extractorse.g. ‘The window sizes chosen to be: 2 months(because the time step is 1 month and Feature Derivation Window is 2 months)’
- Features that are specified as known-in-advancee.g. ‘Variables treated as apriori: holiday’
- Details about why certain variables are transformed in the input datae.g. ‘Generating variable “y (log)” from “y” because multiplicative trendis detected’
- Details about features generated as timeseries features, and their prioritye.g. ‘Generating feature “date (actual)” from “date” (priority: 1)’
Parameters: - project_id : str
project id to retrieve a feature derivation log for.
- offset : int
optional, defaults is 0, this many results will be skipped.
- limit : int
optional, defaults to 100, at most this many results are returned. To specify
- no limit, use 0. The default may change without notice.
-
classmethod
feature_log_retrieve
(project_id)¶ Retrieve the feature derivation log content and log length for a time series project.
The Time Series Feature Log provides details about the feature generation process for a time series project. It includes information about which features are generated and their priority, as well as the detected properties of the time series data such as whether the series is stationary, and periodicities detected.
This route is only supported for time series projects that have finished partitioning.
The feature derivation log will include information about:
- Detected stationarity of the series:e.g. ‘Series detected as non-stationary’
- Detected presence of multiplicative trend in the series:e.g. ‘Multiplicative trend detected’
- Detected presence of multiplicative trend in the series:e.g. ‘Detected periodicities: 7 day’
- Maximum number of feature to be generated:e.g. ‘Maximum number of feature to be generated is 1440’
- Window sizes used in rolling statistics / lag extractorse.g. ‘The window sizes chosen to be: 2 months(because the time step is 1 month and Feature Derivation Window is 2 months)’
- Features that are specified as known-in-advancee.g. ‘Variables treated as apriori: holiday’
- Details about why certain variables are transformed in the input datae.g. ‘Generating variable “y (log)” from “y” because multiplicative trendis detected’
- Details about features generated as timeseries features, and their prioritye.g. ‘Generating feature “date (actual)” from “date” (priority: 1)’
Parameters: - project_id : str
project id to retrieve a feature derivation log for.
-
to_specification
()¶ Render the DatetimePartitioning as a DatetimePartitioningSpecification
The resulting specification can be used when setting the target, and contains only the attributes directly controllable by users.
Returns: - DatetimePartitioningSpecification:
the specification for this partitioning
-
to_dataframe
()¶ Render the partitioning settings as a dataframe for convenience of display
Excludes project_id, datetime_partition_column, date_format, autopilot_data_selection_method, validation_duration, and number_of_backtests, as well as the row count information, if present.
Also excludes the time series specific parameters for use_time_series, default_to_known_in_advance and defining the feature derivation and forecast windows.
-
class
datarobot.helpers.partitioning_methods.
Backtest
(index=None, available_training_start_date=None, available_training_duration=None, available_training_row_count=None, available_training_end_date=None, primary_training_start_date=None, primary_training_duration=None, primary_training_row_count=None, primary_training_end_date=None, gap_start_date=None, gap_duration=None, gap_row_count=None, gap_end_date=None, validation_start_date=None, validation_duration=None, validation_row_count=None, validation_end_date=None, total_row_count=None)¶ A backtest used to evaluate models trained in a datetime partitioned project
When setting up a datetime partitioning project, backtests are specified by a BacktestSpecification.
The available training data corresponds to all the data available for training, while the primary training data corresponds to the data that can be used to train while ensuring that all backtests are available. If a model is trained with more data than is available in the primary training data, then all backtests may not have scores available.
Attributes: - index : int
the index of the backtest
- available_training_start_date : datetime.datetime
the start date of the available training data for this backtest
- available_training_duration : str
the duration of available training data for this backtest
- available_training_row_count : int or None
the number of rows of available training data for this backtest. Only available when retrieving from a project where the target is set.
- available_training_end_date : datetime.datetime
the end date of the available training data for this backtest
- primary_training_start_date : datetime.datetime
the start date of the primary training data for this backtest
- primary_training_duration : str
the duration of the primary training data for this backtest
- primary_training_row_count : int or None
the number of rows of primary training data for this backtest. Only available when retrieving from a project where the target is set.
- primary_training_end_date : datetime.datetime
the end date of the primary training data for this backtest
- gap_start_date : datetime.datetime
the start date of the gap between training and validation scoring data for this backtest
- gap_duration : str
the duration of the gap between training and validation scoring data for this backtest
- gap_row_count : int or None
the number of rows in the gap between training and validation scoring data for this backtest. Only available when retrieving from a project where the target is set.
- gap_end_date : datetime.datetime
the end date of the gap between training and validation scoring data for this backtest
- validation_start_date : datetime.datetime
the start date of the validation scoring data for this backtest
- validation_duration : str
the duration of the validation scoring data for this backtest
- validation_row_count : int or None
the number of rows of validation scoring data for this backtest. Only available when retrieving from a project where the target is set.
- validation_end_date : datetime.datetime
the end date of the validation scoring data for this backtest
- total_row_count : int or None
the number of rows in this backtest. Only available when retrieving from a project where the target is set.
-
to_specification
()¶ Render this backtest as a BacktestSpecification
A BacktestSpecification includes only the attributes users can directly control, not those indirectly determined by the project dataset.
Returns: - BacktestSpecification
the specification for this backtest
-
to_dataframe
()¶ Render this backtest as a dataframe for convenience of display
Returns: - backtest_partitioning : pandas.Dataframe
the backtest attributes, formatted into a dataframe
-
datarobot.helpers.partitioning_methods.
construct_duration_string
(years=0, months=0, days=0, hours=0, minutes=0, seconds=0)¶ Construct a valid string representing a duration in accordance with ISO8601
A duration of six months, 3 days, and 12 hours could be represented as P6M3DT12H.
Parameters: - years : int
the number of years in the duration
- months : int
the number of months in the duration
- days : int
the number of days in the duration
- hours : int
the number of hours in the duration
- minutes : int
the number of minutes in the duration
- seconds : int
the number of seconds in the duration
Returns: - duration_string: str
The duration string, specified compatibly with ISO8601
PredictJob¶
-
datarobot.models.predict_job.
wait_for_async_predictions
(project_id, predict_job_id, max_wait=600)¶ Given a Project id and PredictJob id poll for status of process responsible for predictions generation until it’s finished
Parameters: - project_id : str
The identifier of the project
- predict_job_id : str
The identifier of the PredictJob
- max_wait : int, optional
Time in seconds after which predictions creation is considered unsuccessful
Returns: - predictions : pandas.DataFrame
Generated predictions.
Raises: - AsyncPredictionsGenerationError
Raised if status of fetched PredictJob object is
error
- AsyncTimeoutError
Predictions weren’t generated in time, specified by
max_wait
parameter
-
class
datarobot.models.
PredictJob
(data, completed_resource_url=None)¶ Tracks asynchronous work being done within a project
Attributes: - id : int
the id of the job
- project_id : str
the id of the project the job belongs to
- status : str
the status of the job - will be one of
datarobot.enums.QUEUE_STATUS
- job_type : str
what kind of work the job is doing - will be ‘predict’ for predict jobs
- is_blocked : bool
if true, the job is blocked (cannot be executed) until its dependencies are resolved
- message : str
a message about the state of the job, typically explaining why an error occurred
-
classmethod
from_job
(job)¶ Transforms a generic Job into a PredictJob
Parameters: - job: Job
A generic job representing a PredictJob
Returns: - predict_job: PredictJob
A fully populated PredictJob with all the details of the job
Raises: - ValueError:
If the generic Job was not a predict job, e.g. job_type != JOB_TYPE.PREDICT
-
classmethod
create
(model, sourcedata)¶ Note
Deprecated in v2.3 in favor of
Project.upload_dataset
andModel.request_predictions
. That workflow allows you to reuse the same dataset for predictions from multiple models within one project.Starts predictions generation for provided data using previously created model.
Parameters: - model : Model
Model to use for predictions generation
- sourcedata : str, file or pandas.DataFrame
Data to be used for predictions. If this parameter is a str, it can be either a path to a local file or raw file content. If using a file on disk, the filename must consist of ASCII characters only. The file must be a CSV, and cannot be compressed
Returns: - predict_job_id : str
id of created job, can be used as parameter to
PredictJob.get
orPredictJob.get_predictions
methods orwait_for_async_predictions
function
Raises: - InputNotUnderstoodError
If the parameter for sourcedata didn’t resolve into known data types
Examples
model = Model.get('p-id', 'l-id') predict_job = PredictJob.create(model, './data_to_predict.csv')
-
classmethod
get
(project_id, predict_job_id)¶ Fetches one PredictJob. If the job finished, raises PendingJobFinished exception.
Parameters: - project_id : str
The identifier of the project the model on which prediction was started belongs to
- predict_job_id : str
The identifier of the predict_job
Returns: - predict_job : PredictJob
The pending PredictJob
Raises: - PendingJobFinished
If the job being queried already finished, and the server is re-routing to the finished predictions.
- AsyncFailureError
Querying this resource gave a status code other than 200 or 303
-
classmethod
get_predictions
(project_id, predict_job_id, class_prefix='class_')¶ Fetches finished predictions from the job used to generate them.
Note
The prediction API for classifications now returns an additional prediction_values dictionary that is converted into a series of class_prefixed columns in the final dataframe. For example, <label> = 1.0 is converted to ‘class_1.0’. If you are on an older version of the client (prior to v2.8), you must update to v2.8 to correctly pivot this data.
Parameters: - project_id : str
The identifier of the project to which belongs the model used for predictions generation
- predict_job_id : str
The identifier of the predict_job
- class_prefix : str
The prefix to append to labels in the final dataframe (e.g., apple -> class_apple)
Returns: - predictions : pandas.DataFrame
Generated predictions
Raises: - JobNotFinished
If the job has not finished yet
- AsyncFailureError
Querying the predict_job in question gave a status code other than 200 or 303
-
cancel
()¶ Cancel this job. If this job has not finished running, it will be removed and canceled.
-
get_result
()¶ Returns: - result : object
- Return type depends on the job type:
- for model jobs, a Model is returned
- for predict jobs, a pandas.DataFrame (with predictions) is returned
- for featureImpact jobs, a list of dicts (see
Model.get_feature_impact
for more detail) - for primeRulesets jobs, a list of Rulesets
- for primeModel jobs, a PrimeModel
- for primeDownloadValidation jobs, a PrimeFile
- for reasonCodesInitialization jobs, a ReasonCodesInitialization
- for reasonCodes jobs, a ReasonCodes
- for predictionExplanationInitialization jobs, a PredictionExplanationsInitialization
- for predictionExplanations jobs, a PredictionExplanations
Raises: - JobNotFinished
If the job is not finished, the result is not available.
- AsyncProcessUnsuccessfulError
If the job errored or was aborted
-
get_result_when_complete
(max_wait=600)¶ Parameters: - max_wait : int, optional
How long to wait for the job to finish.
Returns: - result: object
Return type is the same as would be returned by Job.get_result.
Raises: - AsyncTimeoutError
If the job does not finish in time
- AsyncProcessUnsuccessfulError
If the job errored or was aborted
-
refresh
()¶ Update this object with the latest job data from the server.
-
wait_for_completion
(max_wait=600)¶ Waits for job to complete.
Parameters: - max_wait : int, optional
How long to wait for the job to finish.
Prediction Dataset¶
-
class
datarobot.models.
PredictionDataset
(project_id, id, name, created, num_rows, num_columns, forecast_point=None, predictions_start_date=None, predictions_end_date=None, relax_known_in_advance_features_check=None, data_quality_warnings=None)¶ A dataset uploaded to make predictions
Typically created via project.upload_dataset
Attributes: - id : str
the id of the dataset
- project_id : str
the id of the project the dataset belongs to
- created : str
the time the dataset was created
- name : str
the name of the dataset
- num_rows : int
the number of rows in the dataset
- num_columns : int
the number of columns in the dataset
- forecast_point : datetime.datetime or None
Only specified in time series projects. The point relative to which predictions will be generated, based on the forecast window of the project. See the time series documentation for more information.
- predictions_start_date : datetime.datetime or None, optional
Only specified in time series projects. The start date for bulk predictions. This parameter should be provided in conjunction with predictions_end_date. Can’t be provided with forecastPoint parameter.
- predictions_end_date : datetime.datetime or None, optional
Only specified in time series projects. The end date for bulk predictions. This parameter should be provided in conjunction with predictions_start_date. Can’t be provided with forecastPoint parameter.
- relax_known_in_advance_features_check : bool, optional
(New in version v2.15) For Time Series projects only. If True, missing values in the known in advance features are allowed in the forecast window at the prediction time. If omitted or False, missing values are not allowed.
- data_quality_warnings : dict, optional
(New in version 2.15) A dictionary that contains available warnings about potential problems in this prediction dataset. Empty if no warnings.
-
classmethod
get
(project_id, dataset_id)¶ Retrieve information about a dataset uploaded for predictions
Parameters: - project_id:
the id of the project to query
- dataset_id:
the id of the dataset to retrieve
Returns: - dataset: PredictionDataset
A dataset uploaded to make predictions
-
delete
()¶ Delete a dataset uploaded for predictions
Will also delete predictions made using this dataset and cancel any predict jobs using this dataset.
Prediction Explanations¶
-
class
datarobot.
PredictionExplanationsInitialization
(project_id, model_id, prediction_explanations_sample=None)¶ Represents a prediction explanations initialization of a model.
Attributes: - project_id : str
id of the project the model belongs to
- model_id : str
id of the model the prediction explanations initialization is for
- prediction_explanations_sample : list of dict
a small sample of prediction explanations that could be generated for the model
-
classmethod
get
(project_id, model_id)¶ Retrieve the prediction explanations initialization for a model.
Prediction explanations initializations are a prerequisite for computing prediction explanations, and include a sample what the computed prediction explanations for a prediction dataset would look like.
Parameters: - project_id : str
id of the project the model belongs to
- model_id : str
id of the model the prediction explanations initialization is for
Returns: - prediction_explanations_initialization : PredictionExplanationsInitialization
The queried instance.
Raises: - ClientError (404)
If the project or model does not exist or the initialization has not been computed.
-
classmethod
create
(project_id, model_id)¶ Create a prediction explanations initialization for the specified model.
Parameters: - project_id : str
id of the project the model belongs to
- model_id : str
id of the model for which initialization is requested
Returns: - job : Job
an instance of created async job
-
delete
()¶ Delete this prediction explanations initialization.
-
class
datarobot.
PredictionExplanations
(id, project_id, model_id, dataset_id, max_explanations, num_columns, finish_time, prediction_explanations_location, threshold_low=None, threshold_high=None)¶ Represents prediction explanations metadata and provides access to computation results.
Examples
prediction_explanations = dr.PredictionExplanations.get(project_id, explanations_id) for row in prediction_explanations.get_rows(): print(row) # row is an instance of PredictionExplanationsRow
Attributes: - id : str
id of the record and prediction explanations computation result
- project_id : str
id of the project the model belongs to
- model_id : str
id of the model the prediction explanations are for
- dataset_id : str
id of the prediction dataset prediction explanations were computed for
- max_explanations : int
maximum number of prediction explanations to supply per row of the dataset
- threshold_low : float
the lower threshold, below which a prediction must score in order for prediction explanations to be computed for a row in the dataset
- threshold_high : float
the high threshold, above which a prediction must score in order for prediction explanations to be computed for a row in the dataset
- num_columns : int
the number of columns prediction explanations were computed for
- finish_time : float
timestamp referencing when computation for these prediction explanations finished
- prediction_explanations_location : str
where to retrieve the prediction explanations
-
classmethod
get
(project_id, prediction_explanations_id)¶ Retrieve a specific prediction explanations.
Parameters: - project_id : str
id of the project the explanations belong to
- prediction_explanations_id : str
id of the prediction explanations
Returns: - prediction_explanations : PredictionExplanations
The queried instance.
-
classmethod
create
(project_id, model_id, dataset_id, max_explanations=None, threshold_low=None, threshold_high=None)¶ Create prediction explanations for the specified dataset.
In order to create PredictionExplanations for a particular model and dataset, you must first:
- Compute feature impact for the model via
datarobot.Model.get_feature_impact()
- Compute a PredictionExplanationsInitialization for the model via
datarobot.PredictionExplanationsInitialization.create(project_id, model_id)
- Compute predictions for the model and dataset via
datarobot.Model.request_predictions(dataset_id)
threshold_high
andthreshold_low
are optional filters applied to speed up computation. When at least one is specified, only the selected outlier rows will have prediction explanations computed. Rows are considered to be outliers if their predicted value (in case of regression projects) or probability of being the positive class (in case of classification projects) is less thanthreshold_low
or greater thanthresholdHigh
. If neither is specified, prediction explanations will be computed for all rows.Parameters: - project_id : str
id of the project the model belongs to
- model_id : str
id of the model for which prediction explanations are requested
- dataset_id : str
id of the prediction dataset for which prediction explanations are requested
- threshold_low : float, optional
the lower threshold, below which a prediction must score in order for prediction explanations to be computed for a row in the dataset. If neither
threshold_high
northreshold_low
is specified, prediction explanations will be computed for all rows.- threshold_high : float, optional
the high threshold, above which a prediction must score in order for prediction explanations to be computed. If neither
threshold_high
northreshold_low
is specified, prediction explanations will be computed for all rows.- max_explanations : int, optional
the maximum number of prediction explanations to supply per row of the dataset, default: 3.
Returns: - job: Job
an instance of created async job
- Compute feature impact for the model via
-
classmethod
list
(project_id, model_id=None, limit=None, offset=None)¶ List of prediction explanations for a specified project.
Parameters: - project_id : str
id of the project to list prediction explanations for
- model_id : str, optional
if specified, only prediction explanations computed for this model will be returned
- limit : int or None
at most this many results are returned, default: no limit
- offset : int or None
this many results will be skipped, default: 0
Returns: - prediction_explanations : list[PredictionExplanations]
-
get_rows
(batch_size=None, exclude_adjusted_predictions=True)¶ Retrieve prediction explanations rows.
Parameters: - batch_size : int or None, optional
maximum number of prediction explanations rows to retrieve per request
- exclude_adjusted_predictions : bool
Optional, defaults to True. Set to False to include adjusted predictions, which will differ from the predictions on some projects, e.g. those with an exposure column specified.
Yields: - prediction_explanations_row : PredictionExplanationsRow
Represents prediction explanations computed for a prediction row.
-
get_all_as_dataframe
(exclude_adjusted_predictions=True)¶ Retrieve all prediction explanations rows and return them as a pandas.DataFrame.
Returned dataframe has the following structure:
- row_id : row id from prediction dataset
- prediction : the output of the model for this row
- adjusted_prediction : adjusted prediction values (only appears for projects that utilize prediction adjustments, e.g. projects with an exposure column)
- class_0_label : a class level from the target (only appears for classification projects)
- class_0_probability : the probability that the target is this class (only appears for classification projects)
- class_1_label : a class level from the target (only appears for classification projects)
- class_1_probability : the probability that the target is this class (only appears for classification projects)
- explanation_0_feature : the name of the feature contributing to the prediction for this explanation
- explanation_0_feature_value : the value the feature took on
- explanation_0_label : the output being driven by this explanation. For regression projects, this is the name of the target feature. For classification projects, this is the class label whose probability increasing would correspond to a positive strength.
- explanation_0_qualitative_strength : a human-readable description of how strongly the feature affected the prediction (e.g. ‘+++’, ‘–’, ‘+’) for this explanation
- explanation_0_strength : the amount this feature’s value affected the prediction
- …
- explanation_N_feature : the name of the feature contributing to the prediction for this explanation
- explanation_N_feature_value : the value the feature took on
- explanation_N_label : the output being driven by this explanation. For regression projects, this is the name of the target feature. For classification projects, this is the class label whose probability increasing would correspond to a positive strength.
- explanation_N_qualitative_strength : a human-readable description of how strongly the feature affected the prediction (e.g. ‘+++’, ‘–’, ‘+’) for this explanation
- explanation_N_strength : the amount this feature’s value affected the prediction
For classification projects, the server does not guarantee any ordering on the prediction values, however within this function we sort the values so that class_X corresponds to the same class from row to row.
Parameters: - exclude_adjusted_predictions : bool
Optional, defaults to True. Set this to False to include adjusted prediction values in the returned dataframe.
Returns: - dataframe: pandas.DataFrame
-
download_to_csv
(filename, encoding='utf-8', exclude_adjusted_predictions=True)¶ Save prediction explanations rows into CSV file.
Parameters: - filename : str or file object
path or file object to save prediction explanations rows
- encoding : string, optional
A string representing the encoding to use in the output file, defaults to ‘utf-8’
- exclude_adjusted_predictions : bool
Optional, defaults to True. Set to False to include adjusted predictions, which will differ from the predictions on some projects, e.g. those with an exposure column specified.
-
get_prediction_explanations_page
(limit=None, offset=None, exclude_adjusted_predictions=True)¶ Get prediction explanations.
If you don’t want use a generator interface, you can access paginated prediction explanations directly.
Parameters: - limit : int or None
the number of records to return, the server will use a (possibly finite) default if not specified
- offset : int or None
the number of records to skip, default 0
- exclude_adjusted_predictions : bool
Optional, defaults to True. Set to False to include adjusted predictions, which will differ from the predictions on some projects, e.g. those with an exposure column specified.
Returns: - prediction_explanations : PredictionExplanationsPage
-
delete
()¶ Delete these prediction explanations.
-
class
datarobot.models.prediction_explanations.
PredictionExplanationsRow
(row_id, prediction, prediction_values, prediction_explanations=None, adjusted_prediction=None, adjusted_prediction_values=None)¶ Represents prediction explanations computed for a prediction row.
Notes
PredictionValue
contains:label
: describes what this model output corresponds to. For regression projects, it is the name of the target feature. For classification projects, it is a level from the target feature.value
: the output of the prediction. For regression projects, it is the predicted value of the target. For classification projects, it is the predicted probability the row belongs to the class identified by the label.
PredictionExplanation
contains:label
: described what output was driven by this explanation. For regression projects, it is the name of the target feature. For classification projects, it is the class whose probability increasing would correspond to a positive strength of this prediction explanation.feature
: the name of the feature contributing to the predictionfeature_value
: the value the feature took on for this rowstrength
: the amount this feature’s value affected the predictionqualitative_strength
: a human-readable description of how strongly the feature affected the prediction (e.g. ‘+++’, ‘–’, ‘+’)
Attributes: - row_id : int
which row this
PredictionExplanationsRow
describes- prediction : float
the output of the model for this row
- adjusted_prediction : float or None
adjusted prediction value for projects that provide this information, None otherwise
- prediction_values : list
an array of dictionaries with a schema described as
PredictionValue
- adjusted_prediction_values : list
same as prediction_values but for adjusted predictions
- prediction_explanations : list
an array of dictionaries with a schema described as
PredictionExplanation
-
class
datarobot.models.prediction_explanations.
PredictionExplanationsPage
(id, count=None, previous=None, next=None, data=None, prediction_explanations_record_location=None, adjustment_method=None)¶ Represents a batch of prediction explanations received by one request.
Attributes: - id : str
id of the prediction explanations computation result
- data : list[dict]
list of raw prediction explanations; each row corresponds to a row of the prediction dataset
- count : int
total number of rows computed
- previous_page : str
where to retrieve previous page of prediction explanations, None if current page is the first
- next_page : str
where to retrieve next page of prediction explanations, None if current page is the last
- prediction_explanations_record_location : str
where to retrieve the prediction explanations metadata
- adjustment_method : str
Adjustment method that was applied to predictions, or ‘N/A’ if no adjustments were done.
-
classmethod
get
(project_id, prediction_explanations_id, limit=None, offset=0, exclude_adjusted_predictions=True)¶ Retrieve prediction explanations.
Parameters: - project_id : str
id of the project the model belongs to
- prediction_explanations_id : str
id of the prediction explanations
- limit : int or None
the number of records to return; the server will use a (possibly finite) default if not specified
- offset : int or None
the number of records to skip, default 0
- exclude_adjusted_predictions : bool
Optional, defaults to True. Set to False to include adjusted predictions, which will differ from the predictions on some projects, e.g. those with an exposure column specified.
Returns: - prediction_explanations : PredictionExplanationsPage
The queried instance.
Predictions¶
-
class
datarobot.models.
Predictions
(project_id, prediction_id, model_id=None, dataset_id=None, includes_prediction_intervals=None, prediction_intervals_size=None)¶ Represents predictions metadata and provides access to prediction results.
Examples
List all predictions for a project
import datarobot as dr # Fetch all predictions for a project all_predictions = dr.Predictions.list(project_id) # Inspect all calculated predictions for predictions in all_predictions: print(predictions) # repr includes project_id, model_id, and dataset_id
Retrieve predictions by id
import datarobot as dr # Getting predictions by id predictions = dr.Predictions.get(project_id, prediction_id) # Dump actual predictions df = predictions.get_all_as_dataframe() print(df)
Attributes: - project_id : str
id of the project the model belongs to
- model_id : str
id of the model
- prediction_id : str
id of generated predictions
- includes_prediction_intervals : bool, optional
(New in v2.16) For time series projects only. Indicates if prediction intervals will be part of the response. Defaults to False.
- prediction_intervals_size : int, optional
(New in v2.16) For time series projects only. Indicates the percentile used for prediction intervals calculation. Will be present only if includes_prediction_intervals is True.
-
classmethod
list
(project_id, model_id=None, dataset_id=None)¶ Fetch all the computed predictions metadata for a project.
Parameters: - project_id : str
id of the project
- model_id : str, optional
if specified, only predictions metadata for this model will be retrieved
- dataset_id : str, optional
if specified, only predictions metadata for this dataset will be retrieved
Returns: - A list of :py:class:`Predictions <datarobot.models.Predictions>` objects
-
classmethod
get
(project_id, prediction_id)¶ Retrieve the specific predictions metadata
Parameters: - project_id : str
id of the project the model belongs to
- prediction_id : str
id of the prediction set
Returns: - :py:class:`Predictions <datarobot.models.Predictions>` object representing specified
- predictions
-
get_all_as_dataframe
(class_prefix='class_')¶ Retrieve all prediction rows and return them as a pandas.DataFrame.
Parameters: - class_prefix : str, optional
The prefix to append to labels in the final dataframe. Default is
class_
(e.g., apple -> class_apple)
Returns: - dataframe: pandas.DataFrame
-
download_to_csv
(filename, encoding='utf-8')¶ Save prediction rows into CSV file.
Parameters: - filename : str or file object
path or file object to save prediction rows
- encoding : string, optional
A string representing the encoding to use in the output file, defaults to ‘utf-8’
Ruleset¶
-
class
datarobot.models.
Ruleset
(project_id=None, parent_model_id=None, model_id=None, ruleset_id=None, rule_count=None, score=None)¶ Represents an approximation of a model with DataRobot Prime
Attributes: - id : str
the id of the ruleset
- rule_count : int
the number of rules used to approximate the model
- score : float
the validation score of the approximation
- project_id : str
the project the approximation belongs to
- parent_model_id : str
the model being approximated
- model_id : str or None
the model using this ruleset (if it exists). Will be None if no such model has been trained.
-
request_model
()¶ Request training for a model using this ruleset
Training a model using a ruleset is a necessary prerequisite for being able to download the code for a ruleset.
Returns: - job: Job
the job fitting the new Prime model
PrimeFile¶
-
class
datarobot.models.
PrimeFile
(id=None, project_id=None, parent_model_id=None, model_id=None, ruleset_id=None, language=None, is_valid=None)¶ Represents an executable file available for download of the code for a DataRobot Prime model
Attributes: - id : str
the id of the PrimeFile
- project_id : str
the id of the project this PrimeFile belongs to
- parent_model_id : str
the model being approximated by this PrimeFile
- model_id : str
the prime model this file represents
- ruleset_id : int
the ruleset being used in this PrimeFile
- language : str
the language of the code in this file - see enums.LANGUAGE for possibilities
- is_valid : bool
whether the code passed basic validation
-
download
(filepath)¶ Download the code and save it to a file
Parameters: - filepath: string
the location to save the file to
Project¶
-
class
datarobot.models.
Project
(id=None, project_name=None, mode=None, target=None, target_type=None, holdout_unlocked=None, metric=None, stage=None, partition=None, positive_class=None, created=None, advanced_options=None, recommender=None, max_train_pct=None, max_train_rows=None, scaleout_max_train_pct=None, scaleout_max_train_rows=None, file_name=None)¶ A project built from a particular training dataset
Attributes: - id : str
the id of the project
- project_name : str
the name of the project
- mode : int
the autopilot mode currently selected for the project - 0 for full autopilot, 1 for semi-automatic, and 2 for manual
- target : str
the name of the selected target features
- target_type : str
Indicating what kind of modeling is being done in this project Options are: ‘Regression’, ‘Binary’ (Binary classification), ‘Multiclass’ (Multiclass classification)
- holdout_unlocked : bool
whether the holdout has been unlocked
- metric : str
the selected project metric (e.g. LogLoss)
- stage : str
the stage the project has reached - one of
datarobot.enums.PROJECT_STAGE
- partition : dict
information about the selected partitioning options
- positive_class : str
for binary classification projects, the selected positive class; otherwise, None
- created : datetime
the time the project was created
- advanced_options : dict
information on the advanced options that were selected for the project settings, e.g. a weights column or a cap of the runtime of models that can advance autopilot stages
- recommender : dict
information on the recommender settings of the project (i.e. whether it is a recommender project, or the id columns)
- max_train_pct : float
the maximum percentage of the project dataset that can be used without going into the validation data or being too large to submit any blueprint for training
- max_train_rows : int
the maximum number of rows that can be trained on without going into the validation data or being too large to submit any blueprint for training
- scaleout_max_train_pct : float
the maximum percentage of the project dataset that can be used to successfully train a scaleout model without going into the validation data. May exceed max_train_pct, in which case only scaleout models can be trained up to this point.
- scaleout_max_train_rows : int
the maximum number of rows that can be used to successfully train a scaleout model without going into the validation data. May exceed max_train_rows, in which case only scaleout models can be trained up to this point.
- file_name : str
the name of the file uploaded for the project dataset
-
classmethod
get
(project_id)¶ Gets information about a project.
Parameters: - project_id : str
The identifier of the project you want to load.
Returns: - project : Project
The queried project
Examples
import datarobot as dr p = dr.Project.get(project_id='54e639a18bd88f08078ca831') p.id >>>'54e639a18bd88f08078ca831' p.project_name >>>'Some project name'
-
classmethod
create
(sourcedata, project_name='Untitled Project', max_wait=600, read_timeout=600, dataset_filename=None)¶ Creates a project with provided data.
Project creation is asynchronous process, which means that after initial request we will keep polling status of async process that is responsible for project creation until it’s finished. For SDK users this only means that this method might raise exceptions related to it’s async nature.
Parameters: - sourcedata : basestring, file or pandas.DataFrame
Dataset to use for the project. If string can be either a path to a local file, url to publicly available file or raw file content. If using a file, the filename must consist of ASCII characters only.
- project_name : str, unicode, optional
The name to assign to the empty project.
- max_wait : int, optional
Time in seconds after which project creation is considered unsuccessful
- read_timeout: int
The maximum number of seconds to wait for the server to respond indicating that the initial upload is complete
- dataset_filename : string or None, optional
(New in version v2.14) File name to use for dataset. Ignored for url and file path sources.
Returns: - project : Project
Instance with initialized data.
Raises: - InputNotUnderstoodError
Raised if sourcedata isn’t one of supported types.
- AsyncFailureError
Polling for status of async process resulted in response with unsupported status code. Beginning in version 2.1, this will be ProjectAsyncFailureError, a subclass of AsyncFailureError
- AsyncProcessUnsuccessfulError
Raised if project creation was unsuccessful
- AsyncTimeoutError
Raised if project creation took more time, than specified by
max_wait
parameter
Examples
p = Project.create('/home/datasets/somedataset.csv', project_name="New API project") p.id >>> '5921731dkqshda8yd28h' p.project_name >>> 'New API project'
-
classmethod
encrypted_string
(plaintext)¶ Sends a string to DataRobot to be encrypted
This is used for passwords that DataRobot uses to access external data sources
Parameters: - plaintext : str
The string to encrypt
Returns: - ciphertext : str
The encrypted string
-
classmethod
create_from_hdfs
(url, port=None, project_name=None, max_wait=600)¶ Create a project from a datasource on a WebHDFS server.
Parameters: - url : str
The location of the WebHDFS file, both server and full path. Per the DataRobot specification, must begin with hdfs://, e.g. hdfs:///tmp/10kDiabetes.csv
- port : int, optional
The port to use. If not specified, will default to the server default (50070)
- project_name : str, optional
A name to give to the project
- max_wait : int
The maximum number of seconds to wait before giving up.
Returns: - Project
Examples
p = Project.create_from_hdfs('hdfs:///tmp/somedataset.csv', project_name="New API project") p.id >>> '5921731dkqshda8yd28h' p.project_name >>> 'New API project'
-
classmethod
create_from_data_source
(data_source_id, username, password, project_name=None, max_wait=600)¶ Create a project from a data source. Either data_source or data_source_id should be specified.
Parameters: - data_source_id : str
the identifier of the data source.
- username : str
the username for database authentication.
- password : str
the password for database authentication. The password is encrypted at server side and never saved / stored.
- project_name : str, optional
optional, a name to give to the project.
- max_wait : int
optional, the maximum number of seconds to wait before giving up.
Returns: - Project
-
classmethod
from_async
(async_location, max_wait=600)¶ Given a temporary async status location poll for no more than max_wait seconds until the async process (project creation or setting the target, for example) finishes successfully, then return the ready project
Parameters: - async_location : str
The URL for the temporary async status resource. This is returned as a header in the response to a request that initiates an async process
- max_wait : int
The maximum number of seconds to wait before giving up.
Returns: - project : Project
The project, now ready
Raises: - ProjectAsyncFailureError
If the server returned an unexpected response while polling for the asynchronous operation to resolve
- AsyncProcessUnsuccessfulError
If the final result of the asynchronous operation was a failure
- AsyncTimeoutError
If the asynchronous operation did not resolve within the time specified
-
classmethod
start
(sourcedata, target, project_name='Untitled Project', worker_count=None, metric=None, autopilot_on=True, blueprint_threshold=None, response_cap=None, partitioning_method=None, positive_class=None, target_type=None)¶ Chain together project creation, file upload, and target selection.
Parameters: - sourcedata : str or pandas.DataFrame
The path to the file to upload. Can be either a path to a local file or a publicly accessible URL. If the source is a DataFrame, it will be serialized to a temporary buffer. If using a file, the filename must consist of ASCII characters only.
- target : str
The name of the target column in the uploaded file.
- project_name : str
The project name.
Returns: - project : Project
The newly created and initialized project.
Other Parameters: - worker_count : int, optional
The number of workers that you want to allocate to this project.
- metric : str, optional
The name of metric to use.
- autopilot_on : boolean, default
True
Whether or not to begin modeling automatically.
- blueprint_threshold : int, optional
Number of hours the model is permitted to run. Minimum 1
- response_cap : float, optional
Quantile of the response distribution to use for response capping Must be in range 0.5 .. 1.0
- partitioning_method : PartitioningMethod object, optional
It should be one of PartitioningMethod object.
- positive_class : str, float, or int; optional
Specifies a level of the target column that should treated as the positive class for binary classification. May only be specified for binary classification targets.
- target_type : str, optional
Override the automaticially selected target_type. An example usage would be setting the target_type=’Mutliclass’ when you want to preform a multiclass classification task on a numeric column that has a low cardinality. You can use
TARGET_TYPE
enum.
Raises: - AsyncFailureError
Polling for status of async process resulted in response with unsupported status code
- AsyncProcessUnsuccessfulError
Raised if project creation or target setting was unsuccessful
- AsyncTimeoutError
Raised if project creation or target setting timed out
Examples
Project.start("./tests/fixtures/file.csv", "a_target", project_name="test_name", worker_count=4, metric="a_metric")
-
classmethod
list
(search_params=None)¶ Returns the projects associated with this account.
Parameters: - search_params : dict, optional.
If not None, the returned projects are filtered by lookup. Currently you can query projects by:
project_name
Returns: - projects : list of Project instances
Contains a list of projects associated with this user account.
Raises: - TypeError
Raised if
search_params
parameter is provided, but is not of supported type.
Examples
List all projects .. code-block:: python
p_list = Project.list() p_list >>> [Project(‘Project One’), Project(‘Two’)]Search for projects by name .. code-block:: python
Project.list(search_params={‘project_name’: ‘red’}) >>> [Project(‘Predtime’), Project(‘Fred Project’)]
-
refresh
()¶ Fetches the latest state of the project, and updates this object with that information. This is an inplace update, not a new object.
Returns: - self : Project
the now-updated project
-
delete
()¶ Removes this project from your account.
-
set_target
(target, mode='auto', metric=None, quickrun=None, worker_count=None, positive_class=None, partitioning_method=None, featurelist_id=None, advanced_options=None, max_wait=600, target_type=None)¶ Set target variable of an existing project that has a file uploaded to it.
Target setting is asynchronous process, which means that after initial request we will keep polling status of async process that is responsible for target setting until it’s finished. For SDK users this only means that this method might raise exceptions related to it’s async nature.
Parameters: - target : str
Name of variable.
- mode : str, optional
You can use
AUTOPILOT_MODE
enum to choose betweenAUTOPILOT_MODE.FULL_AUTO
AUTOPILOT_MODE.MANUAL
AUTOPILOT_MODE.QUICK
If unspecified,
FULL_AUTO
is used- metric : str, optional
Name of the metric to use for evaluating models. You can query the metrics available for the target by way of
Project.get_metrics
. If none is specified, then the default recommended by DataRobot is used.- quickrun : bool, optional
Deprecated - pass
AUTOPILOT_MODE.QUICK
as mode instead. Sets whether project should be run inquick run
mode. This setting causes DataRobot to recommend a more limited set of models in order to get a base set of models and insights more quickly.- worker_count : int, optional
The number of concurrent workers to request for this project. If None, then the default is used. (New in version v2.14) Setting this to -1 will request the maximum number available to your account.
- partitioning_method : PartitioningMethod object, optional
It should be one of PartitioningMethod object.
- positive_class : str, float, or int; optional
Specifies a level of the target column that should treated as the positive class for binary classification. May only be specified for binary classification targets.
- featurelist_id : str, optional
Specifies which feature list to use.
- advanced_options : AdvancedOptions, optional
Used to set advanced options of project creation.
- max_wait : int, optional
Time in seconds after which target setting is considered unsuccessful.
- target_type : str, optional
Override the automatically selected target_type. An example usage would be setting the target_type=’Mutliclass’ when you want to preform a multiclass classification task on a numeric column that has a low cardinality. You can use
TARGET_TYPE
enum.
Returns: - project : Project
The instance with updated attributes.
Raises: - AsyncFailureError
Polling for status of async process resulted in response with unsupported status code
- AsyncProcessUnsuccessfulError
Raised if target setting was unsuccessful
- AsyncTimeoutError
Raised if target setting took more time, than specified by
max_wait
parameter- TypeError
Raised if
advanced_options
,partitioning_method
ortarget_type
is provided, but is not of supported type
See also
datarobot.models.Project.start
- combines project creation, file upload, and target selection
-
get_models
(order_by=None, search_params=None, with_metric=None)¶ List all completed, successful models in the leaderboard for the given project.
Parameters: - order_by : str or list of strings, optional
If not None, the returned models are ordered by this attribute. If None, the default return is the order of default project metric.
Allowed attributes to sort by are:
metric
sample_pct
If the sort attribute is preceded by a hyphen, models will be sorted in descending order, otherwise in ascending order.
Multiple sort attributes can be included as a comma-delimited string or in a list e.g. order_by=`sample_pct,-metric` or order_by=[sample_pct, -metric]
Using metric to sort by will result in models being sorted according to their validation score by how well they did according to the project metric.
- search_params : dict, optional.
If not None, the returned models are filtered by lookup. Currently you can query models by:
name
sample_pct
is_starred
- with_metric : str, optional.
If not None, the returned models will only have scores for this metric. Otherwise all the metrics are returned.
Returns: - models : a list of Model instances.
All of the models that have been trained in this project.
Raises: - TypeError
Raised if
order_by
orsearch_params
parameter is provided, but is not of supported type.
Examples
Project.get('pid').get_models(order_by=['-sample_pct', 'metric']) # Getting models that contain "Ridge" in name # and with sample_pct more than 64 Project.get('pid').get_models( search_params={ 'sample_pct__gt': 64, 'name': "Ridge" }) # Filtering models based on 'starred' flag: Project.get('pid').get_models(search_params={'is_starred': True})
-
get_datetime_models
()¶ List all models in the project as DatetimeModels
Requires the project to be datetime partitioned. If it is not, a ClientError will occur.
Returns: - models : list of DatetimeModel
the datetime models
-
get_prime_models
()¶ List all DataRobot Prime models for the project Prime models were created to approximate a parent model, and have downloadable code.
Returns: - models : list of PrimeModel
-
get_prime_files
(parent_model_id=None, model_id=None)¶ List all downloadable code files from DataRobot Prime for the project
Parameters: - parent_model_id : str, optional
Filter for only those prime files approximating this parent model
- model_id : str, optional
Filter for only those prime files with code for this prime model
Returns: - files: list of PrimeFile
-
get_datasets
()¶ List all the datasets that have been uploaded for predictions
Returns: - datasets : list of PredictionDataset instances
-
upload_dataset
(sourcedata, max_wait=600, read_timeout=600, forecast_point=None, predictions_start_date=None, predictions_end_date=None, dataset_filename=None, relax_known_in_advance_features_check=None)¶ Upload a new dataset to make predictions against
Parameters: - sourcedata : str, file or pandas.DataFrame
Data to be used for predictions. If string, can be either a path to a local file, url to publicly available file, or raw file content. If using a file on disk, the filename must consist of ASCII characters only.
- max_wait : int, optional
The maximum number of seconds to wait for the uploaded dataset to be processed before raising an error.
- read_timeout : int, optional
The maximum number of seconds to wait for the server to respond indicating that the initial upload is complete
- forecast_point : datetime.datetime or None, optional
(New in version v2.8) May only be specified for Time Series projects, otherwise the upload will be rejected. The time in the dataset relative to which predictions should be generated in a Time Series project. See the Time Series documentation for more information. If not provided, will default to using the latest forecast point in the dataset.
- predictions_start_date : datetime.datetime or None, optional
(New in version v2.11) May only be specified for time series projects. The start date for bulk predictions. This parameter should be provided in conjunction with
predictions_end_date
. Cannot be provided with theforecast_point
parameter.- predictions_end_date : datetime.datetime or None, optional
(New in version v2.11) May only be specified for time series projects. The end date for bulk predictions. This parameter should be provided in conjunction with
predictions_start_date
. Cannot be provided with theforecast_point
parameter.- dataset_filename : string or None, optional
(New in version v2.14) File name to use for the dataset. Ignored for url and file path sources.
- relax_known_in_advance_features_check : bool, optional
(New in version v2.15) For Time Series projects only. If True, missing values in the known in advance features are allowed in the forecast window at the prediction time. If omitted or False, missing values are not allowed.
Returns: - dataset : PredictionDataset
The newly uploaded dataset.
Raises: - InputNotUnderstoodError
Raised if
sourcedata
isn’t one of supported types.- AsyncFailureError
Raised if polling for the status of an async process resulted in a response with an unsupported status code.
- AsyncProcessUnsuccessfulError
Raised if project creation was unsuccessful (i.e. the server reported an error in uploading the dataset).
- AsyncTimeoutError
Raised if processing the uploaded dataset took more time than specified by the
max_wait
parameter.- ValueError
Raised if
forecast_point
orpredictions_start_date
andpredictions_end_date
are provided, but are not of the supported type.
-
upload_dataset_from_data_source
(data_source_id, username, password, max_wait=600, forecast_point=None, relax_known_in_advance_features_check=None)¶ Upload a new dataset from a data source to make predictions against
Parameters: - data_source_id : str
The identifier of the data source.
- username : str
The username for database authentication.
- password : str
The password for database authentication. The password is encrypted at server side and never saved / stored.
- max_wait : int, optional
Optional, the maximum number of seconds to wait before giving up.
- forecast_point : datetime.datetime or None, optional
(New in version v2.8) May only be specified for Time Series projects, otherwise the upload will be rejected. The time in the dataset relative to which predictions should be generated in a Time Series project. See the Time Series documentation for more information. If not provided, will default to using the latest forecast point in the dataset.
- relax_known_in_advance_features_check : bool, optional
(New in version v2.15) For Time Series projects only. If True, missing values in the known in advance features are allowed in the forecast window at the prediction time. If omitted or False, missing values are not allowed.
Returns: - dataset : PredictionDataset
the newly uploaded dataset
-
get_blueprints
()¶ List all blueprints recommended for a project.
Returns: - menu : list of Blueprint instances
All the blueprints recommended by DataRobot for a project
-
get_features
()¶ List all features for this project
Returns: - list of Feature
all features for this project
-
get_modeling_features
(batch_size=None)¶ List all modeling features for this project
Only available once the target and partitioning settings have been set. For more information on the distinction between input and modeling features, see the time series documentation<input_vs_modeling>.
Parameters: - batch_size : int, optional
The number of features to retrieve in a single API call. If specified, the client may make multiple calls to retrieve the full list of features. If not specified, an appropriate default will be chosen by the server.
Returns: - list of ModelingFeature
All modeling features in this project
-
get_featurelists
()¶ List all featurelists created for this project
Returns: - list of Featurelist
all featurelists created for this project
-
get_modeling_featurelists
(batch_size=None)¶ List all modeling featurelists created for this project
Modeling featurelists can only be created after the target and partitioning options have been set for a project. In time series projects, these are the featurelists that can be used for modeling; in other projects, they behave the same as regular featurelists.
See the time series documentation for more information.
Parameters: - batch_size : int, optional
The number of featurelists to retrieve in a single API call. If specified, the client may make multiple calls to retrieve the full list of features. If not specified, an appropriate default will be chosen by the server.
Returns: - list of ModelingFeaturelist
all modeling featurelists in this project
-
create_type_transform_feature
(name, parent_name, variable_type, replacement=None, date_extraction=None, max_wait=600)¶ Create a new feature by transforming the type of an existing feature in the project
Note that only the following transformations are supported:
- Text to categorical or numeric
- Categorical to text or numeric
- Numeric to categorical
- Date to categorical or numeric
Note
Special considerations when casting numeric to categorical
There are two parameters which can be used for
variableType
to convert numeric data to categorical levels. These differ in the assumptions they make about the input data, and are very important when considering the data that will be used to make predictions. The assumptions that each makes are:categorical
: The data in the column is all integral, and there are no missing values. If either of these conditions do not hold in the training set, the transformation will be rejected. During predictions, if any of the values in the parent column are missing, the predictions will errorcategoricalInt
: New in v2.6 All of the data in the column should be considered categorical in its string form when cast to an int by truncation. For example the value3
will be cast as the string3
and the value3.14
will also be cast as the string3
. Further, the value-3.6
will become the string-3
. Missing values will still be recognized as missing.
For convenience these are represented in the enum
VARIABLE_TYPE_TRANSFORM
with the namesCATEGORICAL
andCATEGORICAL_INT
Parameters: - name : str
The name to give to the new feature
- parent_name : str
The name of the feature to transform
- variable_type : str
The type the new column should have. See the values within
datarobot.enums.VARIABLE_TYPE_TRANSFORM
- replacement : str or float, optional
The value that missing or unconverable data should have
- date_extraction : str, optional
Must be specified when parent_name is a date column (and left None otherwise). Specifies which value from a date should be extracted. See the list of values in
datarobot.enums.DATE_EXTRACTION
- max_wait : int, optional
The maximum amount of time to wait for DataRobot to finish processing the new column. This process can take more time with more data to process. If this operation times out, an AsyncTimeoutError will occur. DataRobot continues the processing and the new column may successfully be constructed.
Returns: - Feature
The data of the new Feature
Raises: - AsyncFailureError
If any of the responses from the server are unexpected
- AsyncProcessUnsuccessfulError
If the job being waited for has failed or has been cancelled
- AsyncTimeoutError
If the resource did not resolve in time
-
create_featurelist
(name, features)¶ Creates a new featurelist
Parameters: - name : str
The name to give to this new featurelist. Names must be unique, so an error will be returned from the server if this name has already been used in this project.
- features : list of str
The names of the features. Each feature must exist in the project already.
Returns: - Featurelist
newly created featurelist
Raises: - DuplicateFeaturesError
Raised if features variable contains duplicate features
Examples
project = Project.get('5223deadbeefdeadbeef0101') flists = project.get_featurelists() # Create a new featurelist using a subset of features from an # existing featurelist flist = flists[0] features = flist.features[::2] # Half of the features new_flist = project.create_featurelist(name='Feature Subset', features=features)
-
create_modeling_featurelist
(name, features)¶ Create a new modeling featurelist
Modeling featurelists can only be created after the target and partitioning options have been set for a project. In time series projects, these are the featurelists that can be used for modeling; in other projects, they behave the same as regular featurelists.
See the time series documentation for more information.
Parameters: - name : str
the name of the modeling featurelist to create. Names must be unique within the project, or the server will return an error.
- features : list of str
the names of the features to include in the modeling featurelist. Each feature must be a modeling feature.
Returns: - featurelist : ModelingFeaturelist
the newly created featurelist
Examples
project = Project.get('1234deadbeeffeeddead4321') modeling_features = project.get_modeling_features() selected_features = [feat.name for feat in modeling_features][:5] # select first five new_flist = project.create_modeling_featurelist('Model This', selected_features)
-
get_metrics
(feature_name)¶ Get the metrics recommended for modeling on the given feature.
Parameters: - feature_name : str
The name of the feature to query regarding which metrics are recommended for modeling.
Returns: - names : list of str
The names of the recommended metrics.
-
get_status
()¶ Query the server for project status.
Returns: - status : dict
Contains:
autopilot_done
: a boolean.stage
: a short string indicating which stage the project is in.stage_description
: a description of whatstage
means.
Examples
{"autopilot_done": False, "stage": "modeling", "stage_description": "Ready for modeling"}
-
pause_autopilot
()¶ Pause autopilot, which stops processing the next jobs in the queue.
Returns: - paused : boolean
Whether the command was acknowledged
-
unpause_autopilot
()¶ Unpause autopilot, which restarts processing the next jobs in the queue.
Returns: - unpaused : boolean
Whether the command was acknowledged.
-
start_autopilot
(featurelist_id)¶ Starts autopilot on provided featurelist.
Only one autopilot can be running at the time. That’s why any ongoing autopilot on a different featurelist will be halted - modeling jobs in queue would not be affected but new jobs would not be added to queue by the halted autopilot.
Parameters: - featurelist_id : str
Identifier of featurelist that should be used for autopilot
Raises: - AppPlatformError
Raised if autopilot is currently running on or has already finished running on the provided featurelist. Also raised if project’s target was not selected.
-
train
(trainable, sample_pct=None, featurelist_id=None, source_project_id=None, scoring_type=None, training_row_count=None, monotonic_increasing_featurelist_id=<object object>, monotonic_decreasing_featurelist_id=<object object>)¶ Submit a job to the queue to train a model.
Either sample_pct or training_row_count can be used to specify the amount of data to use, but not both. If neither are specified, a default of the maximum amount of data that can safely be used to train any blueprint without going into the validation data will be selected.
In smart-sampled projects, sample_pct and training_row_count are assumed to be in terms of rows of the minority class.
Note
If the project uses datetime partitioning, use
train_datetime
insteadParameters: - trainable : str or Blueprint
For
str
, this is assumed to be a blueprint_id. If nosource_project_id
is provided, theproject_id
will be assumed to be the project that this instance represents.Otherwise, for a
Blueprint
, it contains the blueprint_id and source_project_id that we want to use.featurelist_id
will assume the default for this project if not provided, andsample_pct
will default to using the maximum training value allowed for this project’s partition setup.source_project_id
will be ignored if aBlueprint
instance is used for this parameter- sample_pct : float, optional
The amount of data to use for training, as a percentage of the project dataset from 0 to 100.
- featurelist_id : str, optional
The identifier of the featurelist to use. If not defined, the default for this project is used.
- source_project_id : str, optional
Which project created this blueprint_id. If
None
, it defaults to looking in this project. Note that you must have read permissions in this project.- scoring_type : str, optional
Either
SCORING_TYPE.validation
orSCORING_TYPE.cross_validation
.SCORING_TYPE.validation
is available for every partitioning type, and indicates that the default model validation should be used for the project. If the project uses a form of cross-validation partitioning,SCORING_TYPE.cross_validation
can also be used to indicate that all of the available training/validation combinations should be used to evaluate the model.- training_row_count : int, optional
The number of rows to use to train the requested model.
- monotonic_increasing_featurelist_id : str, optional
(new in version 2.11) the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. Passing
None
disables increasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT
) is the one specified by the blueprint.- monotonic_decreasing_featurelist_id : str, optional
(new in version 2.11) the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. Passing
None
disables decreasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT
) is the one specified by the blueprint.
Returns: - model_job_id : str
id of created job, can be used as parameter to
ModelJob.get
method orwait_for_async_model_creation
function
Examples
Use a
Blueprint
instance:blueprint = project.get_blueprints()[0] model_job_id = project.train(blueprint, training_row_count=project.max_train_rows)
Use a
blueprint_id
, which is a string. In the first case, it is assumed that the blueprint was created by this project. If you are using a blueprint used by another project, you will need to pass the id of that other project as well.blueprint_id = 'e1c7fc29ba2e612a72272324b8a842af' project.train(blueprint, training_row_count=project.max_train_rows) another_project.train(blueprint, source_project_id=project.id)
You can also easily use this interface to train a new model using the data from an existing model:
model = project.get_models()[0] model_job_id = project.train(model.blueprint.id, sample_pct=100)
-
train_datetime
(blueprint_id, featurelist_id=None, training_row_count=None, training_duration=None, source_project_id=None)¶ Create a new model in a datetime partitioned project
If the project is not datetime partitioned, an error will occur.
Parameters: - blueprint_id : str
the blueprint to use to train the model
- featurelist_id : str, optional
the featurelist to use to train the model. If not specified, the project default will be used.
- training_row_count : int, optional
the number of rows of data that should be used to train the model. If specified, training_duration may not be specified.
- training_duration : str, optional
a duration string specifying what time range the data used to train the model should span. If specified, training_row_count may not be specified.
- source_project_id : str, optional
the id of the project this blueprint comes from, if not this project. If left unspecified, the blueprint must belong to this project.
Returns: - job : ModelJob
the created job to build the model
-
blend
(model_ids, blender_method)¶ Submit a job for creating blender model. Upon success, the new job will be added to the end of the queue.
Parameters: - model_ids : list of str
List of model ids that will be used to create blender. These models should have completed validation stage without errors, and can’t be blenders, DataRobot Prime or scaleout models.
- blender_method : str
Chosen blend method, one from
datarobot.enums.BLENDER_METHOD
Returns: - model_job : ModelJob
New
ModelJob
instance for the blender creation job in queue.
See also
datarobot.models.Project.check_blendable
- to confirm if models can be blended
-
check_blendable
(model_ids, blender_method)¶ Check if the specified models can be successfully blended
Parameters: - model_ids : list of str
List of model ids that will be used to create blender. These models should have completed validation stage without errors, and can’t be blenders, DataRobot Prime or scaleout models.
- blender_method : str
Chosen blend method, one from
datarobot.enums.BLENDER_METHOD
Returns: - :class:`EligibilityResult <datarobot.helpers.eligibility_result.EligibilityResult>`
-
get_all_jobs
(status=None)¶ Get a list of jobs
This will give Jobs representing any type of job, including modeling or predict jobs.
Parameters: - status : QUEUE_STATUS enum, optional
If called with QUEUE_STATUS.INPROGRESS, will return the jobs that are currently running.
If called with QUEUE_STATUS.QUEUE, will return the jobs that are waiting to be run.
If called with QUEUE_STATUS.ERROR, will return the jobs that have errored.
If no value is provided, will return all jobs currently running or waiting to be run.
Returns: - jobs : list
Each is an instance of Job
-
get_blenders
()¶ Get a list of blender models.
Returns: - list of BlenderModel
list of all blender models in project.
-
get_frozen_models
()¶ Get a list of frozen models
Returns: - list of FrozenModel
list of all frozen models in project.
-
get_model_jobs
(status=None)¶ Get a list of modeling jobs
Parameters: - status : QUEUE_STATUS enum, optional
If called with QUEUE_STATUS.INPROGRESS, will return the modeling jobs that are currently running.
If called with QUEUE_STATUS.QUEUE, will return the modeling jobs that are waiting to be run.
If called with QUEUE_STATUS.ERROR, will return the modeling jobs that have errored.
If no value is provided, will return all modeling jobs currently running or waiting to be run.
Returns: - jobs : list
Each is an instance of ModelJob
-
get_predict_jobs
(status=None)¶ Get a list of prediction jobs
Parameters: - status : QUEUE_STATUS enum, optional
If called with QUEUE_STATUS.INPROGRESS, will return the prediction jobs that are currently running.
If called with QUEUE_STATUS.QUEUE, will return the prediction jobs that are waiting to be run.
If called with QUEUE_STATUS.ERROR, will return the prediction jobs that have errored.
If called without a status, will return all prediction jobs currently running or waiting to be run.
Returns: - jobs : list
Each is an instance of PredictJob
-
wait_for_autopilot
(check_interval=20.0, timeout=86400, verbosity=1)¶ Blocks until autopilot is finished. This will raise an exception if the autopilot mode is changed from AUTOPILOT_MODE.FULL_AUTO.
It makes API calls to sync the project state with the server and to look at which jobs are enqueued.
Parameters: - check_interval : float or int
The maximum time (in seconds) to wait between checks for whether autopilot is finished
- timeout : float or int or None
After this long (in seconds), we give up. If None, never timeout.
- verbosity:
This should be VERBOSITY_LEVEL.SILENT or VERBOSITY_LEVEL.VERBOSE. For VERBOSITY_LEVEL.SILENT, nothing will be displayed about progress. For VERBOSITY_LEVEL.VERBOSE, the number of jobs in progress or queued is shown. Note that new jobs are added to the queue along the way.
Raises: - AsyncTimeoutError
If autopilot does not finished in the amount of time specified
- RuntimeError
If a condition is detected that indicates that autopilot will not complete on its own
-
rename
(project_name)¶ Update the name of the project.
Parameters: - project_name : str
The new name
-
unlock_holdout
()¶ Unlock the holdout for this project.
This will cause subsequent queries of the models of this project to contain the metric values for the holdout set, if it exists.
Take care, as this cannot be undone. Remember that best practice is to select a model before analyzing the model performance on the holdout set
-
set_worker_count
(worker_count)¶ Sets the number of workers allocated to this project.
Note that this value is limited to the number allowed by your account. Lowering the number will not stop currently running jobs, but will cause the queue to wait for the appropriate number of jobs to finish before attempting to run more jobs.
Parameters: - worker_count : int
The number of concurrent workers to request from the pool of workers. (New in version v2.14) Setting this to -1 will update the number of workers to the maximum available to your account.
-
get_leaderboard_ui_permalink
()¶ Returns: - url : str
Permanent static hyperlink to a project leaderboard.
-
open_leaderboard_browser
()¶ Opens project leaderboard in web browser.
Note: If text-mode browsers are used, the calling process will block until the user exits the browser.
-
get_rating_table_models
()¶ Get a list of models with a rating table
Returns: - list of RatingTableModel
list of all models with a rating table in project.
-
get_rating_tables
()¶ Get a list of rating tables
Returns: - list of RatingTable
list of rating tables in project.
-
get_access_list
()¶ Retrieve users who have access to this project and their access levels
New in version v2.15.
Returns: - list of :class:`SharingAccess <datarobot.SharingAccess>`
Modify the ability of users to access this project
New in version v2.15.
Parameters: - access_list : list of
SharingAccess
the modifications to make.
Raises: - datarobot.ClientError :
if you do not have permission to share this project, if the user you’re sharing with doesn’t exist, if the same user appears multiple times in the access_list, or if these changes would leave the project without an owner
Examples
Transfer access to the project from old_user@datarobot.com to new_user@datarobot.com
import datarobot as dr new_access = dr.SharingAccess(new_user@datarobot.com, dr.enums.SHARING_ROLE.OWNER, can_share=True) access_list = [dr.SharingAccess(old_user@datarobot.com, None), new_access] dr.Project.get('my-project-id').share(access_list)
- access_list : list of
-
class
datarobot.helpers.eligibility_result.
EligibilityResult
(supported, reason='', context='')¶ Represents whether a particular operation is supported
For instance, a function to check whether a set of models can be blended can return an EligibilityResult specifying whether or not blending is supported and why it may not be supported.
Attributes: - supported : bool
whether the operation this result represents is supported
- reason : str
why the operation is or is not supported
- context : str
what operation isn’t supported
Rating Table¶
-
class
datarobot.models.
RatingTable
(id, rating_table_name, original_filename, project_id, parent_model_id, model_id=None, model_job_id=None, validation_job_id=None, validation_error=None)¶ Interface to modify and download rating tables.
Attributes: - id : str
The id of the rating table.
- project_id : str
The id of the project this rating table belongs to.
- rating_table_name : str
The name of the rating table.
- original_filename : str
The name of the file used to create the rating table.
- parent_model_id : str
The model id of the model the rating table was validated against.
- model_id : str
The model id of the model that was created from the rating table. Can be None if a model has not been created from the rating table.
- model_job_id : str
The id of the job to create a model from this rating table. Can be None if a model has not been created from the rating table.
- validation_job_id : str
The id of the created job to validate the rating table. Can be None if the rating table has not been validated.
- validation_error : str
Contains a description of any errors caused during validation.
-
classmethod
get
(project_id, rating_table_id)¶ Retrieve a single rating table
Parameters: - project_id : str
The ID of the project the rating table is associated with.
- rating_table_id : str
The ID of the rating table
Returns: - rating_table : RatingTable
The queried instance
-
classmethod
create
(project_id, parent_model_id, filename, rating_table_name='Uploaded Rating Table')¶ Uploads and validates a new rating table CSV
Parameters: - project_id : str
id of the project the rating table belongs to
- parent_model_id : str
id of the model for which this rating table should be validated against
- filename : str
The path of the CSV file containing the modified rating table.
- rating_table_name : str, optional
A human friendly name for the new rating table. The string may be truncated and a suffix may be added to maintain unique names of all rating tables.
Returns: - job: Job
an instance of created async job
Raises: - InputNotUnderstoodError
Raised if filename isn’t one of supported types.
- ClientError (400)
Raised if parent_model_id is invalid.
-
download
(filepath)¶ Download a csv file containing the contents of this rating table
Parameters: - filepath : str
The path at which to save the rating table file.
-
rename
(rating_table_name)¶ Renames a rating table to a different name.
Parameters: - rating_table_name : str
The new name to rename the rating table to.
-
create_model
()¶ Creates a new model from this rating table record. This rating table must not already be associated with a model and must be valid.
Returns: - job: Job
an instance of created async job
Raises: - ClientError (422)
Raised if creating model from a RatingTable that failed validation
- JobAlreadyRequested
Raised if creating model from a RatingTable that is already associated with a RatingTableModel
Reason Codes (Deprecated)¶
This interface is considered deprecated. Please use PredictionExplanations instead.
-
class
datarobot.
ReasonCodesInitialization
(project_id, model_id, reason_codes_sample=None)¶ Represents a reason codes initialization of a model.
Attributes: - project_id : str
id of the project the model belongs to
- model_id : str
id of the model reason codes initialization is for
- reason_codes_sample : list of dict
a small sample of reason codes that could be generated for the model
-
classmethod
get
(project_id, model_id)¶ Retrieve the reason codes initialization for a model.
Reason codes initializations are a prerequisite for computing reason codes, and include a sample what the computed reason codes for a prediction dataset would look like.
Parameters: - project_id : str
id of the project the model belongs to
- model_id : str
id of the model reason codes initialization is for
Returns: - reason_codes_initialization : ReasonCodesInitialization
The queried instance.
Raises: - ClientError (404)
If the project or model does not exist or the initialization has not been computed.
-
classmethod
create
(project_id, model_id)¶ Create a reason codes initialization for the specified model.
Parameters: - project_id : str
id of the project the model belongs to
- model_id : str
id of the model for which initialization is requested
Returns: - job : Job
an instance of created async job
-
delete
()¶ Delete this reason codes initialization.
-
class
datarobot.
ReasonCodes
(id, project_id, model_id, dataset_id, max_codes, num_columns, finish_time, reason_codes_location, threshold_low=None, threshold_high=None)¶ Represents reason codes metadata and provides access to computation results.
Examples
reason_codes = dr.ReasonCodes.get(project_id, reason_codes_id) for row in reason_codes.get_rows(): print(row) # row is an instance of ReasonCodesRow
Attributes: - id : str
id of the record and reason codes computation result
- project_id : str
id of the project the model belongs to
- model_id : str
id of the model reason codes initialization is for
- dataset_id : str
id of the prediction dataset reason codes were computed for
- max_codes : int
maximum number of reason codes to supply per row of the dataset
- threshold_low : float
the lower threshold, below which a prediction must score in order for reason codes to be computed for a row in the dataset
- threshold_high : float
the high threshold, above which a prediction must score in order for reason codes to be computed for a row in the dataset
- num_columns : int
the number of columns reason codes were computed for
- finish_time : float
timestamp referencing when computation for these reason codes finished
- reason_codes_location : str
where to retrieve the reason codes
-
classmethod
get
(project_id, reason_codes_id)¶ Retrieve a specific reason codes.
Parameters: - project_id : str
id of the project the model belongs to
- reason_codes_id : str
id of the reason codes
Returns: - reason_codes : ReasonCodes
The queried instance.
-
classmethod
create
(project_id, model_id, dataset_id, max_codes=None, threshold_low=None, threshold_high=None)¶ Create a reason codes for the specified dataset.
In order to create ReasonCodesPage for a particular model and dataset, you must first:
- Compute feature impact for the model via
datarobot.Model.get_feature_impact()
- Compute a ReasonCodesInitialization for the model via
datarobot.ReasonCodesInitialization.create(project_id, model_id)
- Compute predictions for the model and dataset via
datarobot.Model.request_predictions(dataset_id)
threshold_high
andthreshold_low
are optional filters applied to speed up computation. When at least one is specified, only the selected outlier rows will have reason codes computed. Rows are considered to be outliers if their predicted value (in case of regression projects) or probability of being the positive class (in case of classification projects) is less thanthreshold_low
or greater thanthresholdHigh
. If neither is specified, reason codes will be computed for all rows.Parameters: - project_id : str
id of the project the model belongs to
- model_id : str
id of the model for which reason codes are requested
- dataset_id : str
id of the prediction dataset for which reason codes are requested
- threshold_low : float, optional
the lower threshold, below which a prediction must score in order for reason codes to be computed for a row in the dataset. If neither
threshold_high
northreshold_low
is specified, reason codes will be computed for all rows.- threshold_high : float, optional
the high threshold, above which a prediction must score in order for reason codes to be computed. If neither
threshold_high
northreshold_low
is specified, reason codes will be computed for all rows.- max_codes : int, optional
the maximum number of reason codes to supply per row of the dataset, default: 3.
Returns: - job: Job
an instance of created async job
- Compute feature impact for the model via
-
classmethod
list
(project_id, model_id=None, limit=None, offset=None)¶ List of reason codes for a specified project.
Parameters: - project_id : str
id of the project to list reason codes for
- model_id : str, optional
if specified, only reason codes computed for this model will be returned
- limit : int or None
at most this many results are returned, default: no limit
- offset : int or None
this many results will be skipped, default: 0
Returns: - reason_codes : list[ReasonCodes]
-
get_rows
(batch_size=None, exclude_adjusted_predictions=True)¶ Retrieve reason codes rows.
Parameters: - batch_size : int
maximum number of reason codes rows to retrieve per request
- exclude_adjusted_predictions : bool
Optional, defaults to True. Set to False to include adjusted predictions, which will differ from the predictions on some projects, e.g. those with an exposure column specified.
Yields: - reason_codes_row : ReasonCodesRow
Represents reason codes computed for a prediction row.
-
get_all_as_dataframe
(exclude_adjusted_predictions=True)¶ Retrieve all reason codes rows and return them as a pandas.DataFrame.
Returned dataframe has the following structure:
- row_id : row id from prediction dataset
- prediction : the output of the model for this row
- adjusted_prediction : adjusted prediction values (only appears for projects that utilize prediction adjustments, e.g. projects with an exposure column)
- class_0_label : a class level from the target (only appears for classification projects)
- class_0_probability : the probability that the target is this class (only appears for classification projects)
- class_1_label : a class level from the target (only appears for classification projects)
- class_1_probability : the probability that the target is this class (only appears for classification projects)
- reason_0_feature : the name of the feature contributing to the prediction for this reason
- reason_0_feature_value : the value the feature took on
- reason_0_label : the output being driven by this reason. For regression projects, this is the name of the target feature. For classification projects, this is the class label whose probability increasing would correspond to a positive strength.
- reason_0_qualitative_strength : a human-readable description of how strongly the feature affected the prediction (e.g. ‘+++’, ‘–’, ‘+’) for this reason
- reason_0_strength : the amount this feature’s value affected the prediction
- …
- reason_N_feature : the name of the feature contributing to the prediction for this reason
- reason_N_feature_value : the value the feature took on
- reason_N_label : the output being driven by this reason. For regression projects, this is the name of the target feature. For classification projects, this is the class label whose probability increasing would correspond to a positive strength.
- reason_N_qualitative_strength : a human-readable description of how strongly the feature affected the prediction (e.g. ‘+++’, ‘–’, ‘+’) for this reason
- reason_N_strength : the amount this feature’s value affected the prediction
Parameters: - exclude_adjusted_predictions : bool
Optional, defaults to True. Set this to False to include adjusted prediction values in the returned dataframe.
Returns: - dataframe: pandas.DataFrame
-
download_to_csv
(filename, encoding='utf-8', exclude_adjusted_predictions=True)¶ Save reason codes rows into CSV file.
Parameters: - filename : str or file object
path or file object to save reason codes rows
- encoding : string, optional
A string representing the encoding to use in the output file, defaults to ‘utf-8’
- exclude_adjusted_predictions : bool
Optional, defaults to True. Set to False to include adjusted predictions, which will differ from the predictions on some projects, e.g. those with an exposure column specified.
-
get_reason_codes_page
(limit=None, offset=None, exclude_adjusted_predictions=True)¶ Get reason codes.
If you don’t want use a generator interface, you can access paginated reason codes directly.
Parameters: - limit : int or None
the number of records to return, the server will use a (possibly finite) default if not specified
- offset : int or None
the number of records to skip, default 0
- exclude_adjusted_predictions : bool
Optional, defaults to True. Set to False to include adjusted predictions, which will differ from the predictions on some projects, e.g. those with an exposure column specified.
Returns: - reason_codes : ReasonCodesPage
-
delete
()¶ Delete this reason codes.
-
class
datarobot.models.reason_codes.
ReasonCodesRow
(row_id, prediction, prediction_values, reason_codes=None, adjusted_prediction=None, adjusted_prediction_values=None)¶ Represents reason codes computed for a prediction row.
Notes
PredictionValue
contains:label
: describes what this model output corresponds to. For regression projects, it is the name of the target feature. For classification projects, it is a level from the target feature.value
: the output of the prediction. For regression projects, it is the predicted value of the target. For classification projects, it is the predicted probability the row belongs to the class identified by the label.
ReasonCode
contains:label
: described what output was driven by this reason code. For regression projects, it is the name of the target feature. For classification projects, it is the class whose probability increasing would correspond to a positive strength of this reason code.feature
: the name of the feature contributing to the predictionfeature_value
: the value the feature took on for this rowstrength
: the amount this feature’s value affected the predictionqualitativate_strength
: a human-readable description of how strongly the feature affected the prediction (e.g. ‘+++’, ‘–’, ‘+’)
Attributes: - row_id : int
which row this
ReasonCodeRow
describes- prediction : float
the output of the model for this row
- adjusted_prediction : float or None
adjusted prediction value for projects that provide this information, None otherwise
- prediction_values : list
an array of dictionaries with a schema described as
PredictionValue
- adjusted_prediction_values : list
same as prediction_values but for adjusted predictions
- reason_codes : list
an array of dictionaries with a schema described as
ReasonCode
-
class
datarobot.models.reason_codes.
ReasonCodesPage
(id, count=None, previous=None, next=None, data=None, reason_codes_record_location=None, adjustment_method=None)¶ Represents batch of reason codes received by one request.
Attributes: - id : str
id of the reason codes computation result
- data : list[dict]
list of raw reason codes, each row corresponds to a row of the prediction dataset
- count : int
total number of rows computed
- previous_page : str
where to retrieve previous page of reason codes, None if current page is the first
- next_page : str
where to retrieve next page of reason codes, None if current page is the last
- reason_codes_record_location : str
where to retrieve the reason codes metadata
- adjustment_method : str
Adjustment method that was applied to predictions, or ‘N/A’ if no adjustments were done.
-
classmethod
get
(project_id, reason_codes_id, limit=None, offset=0, exclude_adjusted_predictions=True)¶ Retrieve reason codes.
Parameters: - project_id : str
id of the project the model belongs to
- reason_codes_id : str
id of the reason codes
- limit : int or None
the number of records to return, the server will use a (possibly finite) default if not specified
- offset : int or None
the number of records to skip, default 0
- exclude_adjusted_predictions : bool
Optional, defaults to True. Set to False to include adjusted predictions, which will differ from the predictions on some projects, e.g. those with an exposure column specified.
Returns: - reason_codes : ReasonCodesPage
The queried instance.
Recommended Models¶
-
class
datarobot.models.
ModelRecommendation
(project_id, model_id, recommendation_type)¶ A collection of information about a recommended model for a project.
Attributes: - project_id : str
the id of the project the model belongs to
- model_id : str
the id of the recommended model
- recommendation_type : str
the type of model recommendation
-
classmethod
get
(project_id, recommendation_type=None)¶ Retrieves the default or specified by recommendation_type recommendation.
Parameters: - project_id : str
The project’s id.
- recommendation_type : enums.RECOMMENDED_MODEL_TYPE
The type of recommendation to get. If None, returns the default recommendation.
Returns: - recommended_model : ModelRecommendation
-
classmethod
get_all
(project_id)¶ Retrieves all of the current recommended models for the project.
Parameters: - project_id : str
The project’s id.
Returns: - recommended_models : list of ModelRecommendation
-
classmethod
get_recommendation
(recommended_models, recommendation_type)¶ Returns the model in the given list with the requested type.
Parameters: - recommended_models : list of ModelRecommendation
- recommendation_type : enums.RECOMMENDED_MODEL_TYPE
the type of model to extract from the recommended_models list
Returns: - recommended_model : ModelRecommendation or None if no model with the requested type exists
-
get_model
()¶ Returns the Model associated with this ModelRecommendation.
Returns: - recommended_model : Model
ROC Curve¶
-
class
datarobot.models.roc_curve.
RocCurve
(source, roc_points, negative_class_predictions, positive_class_predictions, source_model_id)¶ ROC curve data for model.
Attributes: - source : str
ROC curve data source. Can be ‘validation’, ‘crossValidation’ or ‘holdout’.
- roc_points : list of dict
List of precalculated metrics associated with thresholds for ROC curve.
- negative_class_predictions : list of float
List of predictions from example for negative class
- positive_class_predictions : list of float
List of predictions from example for positive class
- source_model_id : str
ID of the model this ROC curve represents; in some cases, insights from the parent of a frozen model may be used
-
estimate_threshold
(threshold)¶ Return metrics estimation for given threshold.
Parameters: - threshold : float from [0, 1] interval
Threshold we want estimation for
Returns: - dict
Dictionary of estimated metrics in form of {metric_name: metric_value}. Metrics are ‘accuracy’, ‘f1_score’, ‘false_negative_score’, ‘true_negative_score’, ‘true_negative_rate’, ‘matthews_correlation_coefficient’, ‘true_positive_score’, ‘positive_predictive_value’, ‘false_positive_score’, ‘false_positive_rate’, ‘negative_predictive_value’, ‘true_positive_rate’.
Raises: - ValueError
Given threshold isn’t from [0, 1] interval
-
get_best_f1_threshold
()¶ Return value of threshold that corresponds to max F1 score. This is threshold that will be preselected in DataRobot when you open “ROC curve” tab.
Returns: - float
Threhold with best F1 score.
SharingAccess¶
-
class
datarobot.
SharingAccess
(username, role, can_share=None, user_id=None)¶ Represents metadata about whom a entity (e.g. a data store) has been shared with
New in version v2.14.
Currently
DataStores
,DataSources
,Projects
(new in version v2.15) andCalendarFiles
(new in version 2.15) can be shared.This class can represent either access that has already been granted, or be used to grant access to additional users.
Attributes: - username : str
a particular user
- role : str or None
if a string, represents a particular level of access and should be one of
datarobot.enums.SHARING_ROLE
. For more information on the specific access levels, see the sharing documentation. If None, can be passed to a share function to revoke access for a specific user.- can_share : bool or None
if a bool, indicates whether this user is permitted to further share. When False, the user has access to the entity, but can only revoke their own access but not modify any user’s access role. When True, the user can share with any other user at a access role up to their own. May be None if the SharingAccess was not retrieved from the DataRobot server but intended to be passed into a share function; this will be equivalent to passing True.
- user_id : str
the id of the user
Training Predictions¶
-
class
datarobot.models.training_predictions.
TrainingPredictionsIterator
(client, path, limit=None)¶ Lazily fetches training predictions from DataRobot API in chunks of specified size and then iterates rows from responses as named tuples. Each row represents a training prediction computed for a dataset’s row. Each named tuple has the following structure:
Notes
Each
PredictionValue
dict contains these keys:- label
- describes what this model output corresponds to. For regression projects, it is the name of the target feature. For classification and multiclass projects, it is a label from the target feature.
- value
- the output of the prediction. For regression projects, it is the predicted value of the target. For classification and multiclass projects, it is the predicted probability that the row belongs to the class identified by the label.
Examples
import datarobot as dr # Fetch existing training predictions by their id training_predictions = dr.TrainingPredictions.get(project_id, prediction_id) # Iterate over predictions for row in training_predictions.iterate_rows() print(row.row_id, row.prediction)
Attributes: - row_id : int
id of the record in original dataset for which training prediction is calculated
- partition_id : str or float
id of the data partition that the row belongs to
- prediction : float
the model’s prediction for this data row
- prediction_values : list of dictionaries
an array of dictionaries with a schema described as
PredictionValue
- timestamp : str or None
(New in version v2.11) an ISO string representing the time of the prediction in time series project; may be None for non-time series projects
- forecast_point : str or None
(New in version v2.11) an ISO string representing the point in time used as a basis to generate the predictions in time series project; may be None for non-time series projects
- forecast_distance : str or None
(New in version v2.11) how many time steps are between the forecast point and the timestamp in time series project; None for non-time series projects
- series_id : str or None
(New in version v2.11) the id of the series in a multiseries project; may be NaN for single series projects; None for non-time series projects
-
class
datarobot.models.training_predictions.
TrainingPredictions
(project_id, prediction_id, model_id=None, data_subset=None)¶ Represents training predictions metadata and provides access to prediction results.
Examples
Compute training predictions for a model on the whole dataset
import datarobot as dr # Request calculation of training predictions training_predictions_job = model.request_training_predictions(dr.enums.DATA_SUBSET.ALL) training_predictions = training_predictions_job.get_result_when_complete() print('Training predictions {} are ready'.format(training_predictions.prediction_id)) # Iterate over actual predictions for row in training_predictions.iterate_rows(): print(row.row_id, row.partition_id, row.prediction)
List all training predictions for a project
import datarobot as dr # Fetch all training predictions for a project all_training_predictions = dr.TrainingPredictions.list(project_id) # Inspect all calculated training predictions for training_predictions in all_training_predictions: print( 'Prediction {} is made for data subset "{}"'.format( training_predictions.prediction_id, training_predictions.data_subset, ) )
Retrieve training predictions by id
import datarobot as dr # Getting training predictions by id training_predictions = dr.TrainingPredictions.get(project_id, prediction_id) # Iterate over actual predictions for row in training_predictions.iterate_rows(): print(row.row_id, row.partition_id, row.prediction)
Attributes: - project_id : str
id of the project the model belongs to
- model_id : str
id of the model
- prediction_id : str
id of generated predictions
-
classmethod
list
(project_id)¶ Fetch all the computed training predictions for a project.
Parameters: - project_id : str
id of the project
Returns: - A list of :py:class:`TrainingPredictions` objects
-
classmethod
get
(project_id, prediction_id)¶ Retrieve training predictions on a specified data set.
Parameters: - project_id : str
id of the project the model belongs to
- prediction_id : str
id of the prediction set
Returns: - :py:class:`TrainingPredictions` object which is ready to operate with specified predictions
-
iterate_rows
(batch_size=None)¶ Retrieve training prediction rows as an iterator.
Parameters: - batch_size : int, optional
maximum number of training prediction rows to fetch per request
Returns: - iterator :
TrainingPredictionsIterator
an iterator which yields named tuples representing training prediction rows
-
get_all_as_dataframe
(class_prefix='class_')¶ Retrieve all training prediction rows and return them as a pandas.DataFrame.
- Returned dataframe has the following structure:
- row_id : row id from the original dataset
- prediction : the model’s prediction for this row
- class_<label> : the probability that the target is this class (only appears for classification and multiclass projects)
- timestamp : the time of the prediction (only appears for out of time validation or time series projects)
- forecast_point : the point in time used as a basis to generate the predictions (only appears for time series projects)
- forecast_distance : how many time steps are between timestamp and forecast_point (only appears for time series projects)
- series_id : he id of the series in a multiseries project or None for a single series project (only appears for time series projects)
Parameters: - class_prefix : str, optional
The prefix to append to labels in the final dataframe. Default is
class_
(e.g., apple -> class_apple)
Returns: - dataframe: pandas.DataFrame
-
download_to_csv
(filename, encoding='utf-8')¶ Save training prediction rows into CSV file.
Parameters: - filename : str or file object
path or file object to save training prediction rows
- encoding : string, optional
A string representing the encoding to use in the output file, defaults to ‘utf-8’
Word Cloud¶
-
class
datarobot.models.word_cloud.
WordCloud
(ngrams)¶ Word cloud data for the model.
Notes
WordCloudNgram
is a dict containing the following:ngram
(str) Word or ngram value.coefficient
(float) Value from [-1.0, 1.0] range, describes effect of this ngram on the target. Large negative value means strong effect toward negative class in classification and smaller target value in regression models. Large positive - toward positive class and bigger value respectively.count
(int) Number of rows in the training sample where this ngram appears.frequency
(float) Value from (0.0, 1.0] range, relative frequency of given ngram to most frequent ngram.is_stopword
(bool) True for ngrams that DataRobot evaluates as stopwords.
Attributes: - ngrams : list of dicts
List of dicts with schema described as
WordCloudNgram
above.
-
most_frequent
(top_n=5)¶ Return most frequent ngrams in the word cloud.
Parameters: - top_n : int
Number of ngrams to return
Returns: - list of dict
Up to top_n top most frequent ngrams in the word cloud. If top_n bigger then total number of ngrams in word cloud - return all sorted by frequency in descending order.
-
most_important
(top_n=5)¶ Return most important ngrams in the word cloud.
Parameters: - top_n : int
Number of ngrams to return
Returns: - list of dict
Up to top_n top most important ngrams in the word cloud. If top_n bigger then total number of ngrams in word cloud - return all sorted by absolute coefficient value in descending order.
Examples¶
Note
You can install all of the Python library requirements needed to run the example notebooks with: pip install datarobot[examples].
Downloads¶
Download all the notebooks and the supporting scripts and data files
Download an open source font that supports the Japanese text example (only required in the Advanced Model Insights notebook).
Example Jupyter Notebooks¶
Predicting Bad Loans¶
Overview¶
In this example we will build a binary classification model using the Lending Club dataset. Here is a list of things we will touch on during this notebook:
- Installing the
datarobot
package - Configuring the client
- Creating a project
- Changing the datatype of some of the source columns
- Selecting the source columns used in the modeling process
- Running the automated modeling process
- Generating predictions
Prerequisites¶
In order to run this notebook yourself, you will need the following:
- This notebook. If you are viewing this in the HTML documentation bundle, you can download all of the example notebooks and supporting materials from Downloads.
- The required dataset, which is included in the same directory as this notebook.
- A DataRobot API token. You can find your API token by logging into the DataRobot Web User Interface and looking in your
Profile
.
Installing the datarobot
package¶
The datarobot
package is hosted on PyPI. You can install it via:
pip install datarobot
from the command line. Its main dependencies are numpy
and pandas
, which could take some time to install on a new system. We highly recommend use of virtualenvs to avoid conflicts with other dependencies in your system-wide python installation.
Getting Started¶
This line imports the datarobot
package. By convention, we always import it with the alias dr
.
[1]:
import datarobot as dr
Other Important Imports¶
We’ll use these in this notebook as well. If the previous cell and the following cell both run without issue, you’re in good shape.
[2]:
import datetime
import pandas as pd
Configure the Python Client¶
Configuring the client requires the following two things:
- A DataRobot endpoint - where the API server can be found
- A DataRobot API token - a token the server uses to identify and validate the user making API requests
The endpoint is usually the URL you would use to log into the DataRobot Web User Interface (e.g., https://app.datarobot.com) with “/api/v2/” appended, e.g., (https://app.datarobot.com/api/v2/).
You can find your API token by logging into the DataRobot Web User Interface and looking in your Profile.
The Python client can be configured in several ways. The example we’ll use in this notebook is to point to a yaml
file that has the information. This is a text file containing two lines like this:
endpoint: https://app.datarobot.com/api/v2/
token: not-my-real-token
If you want to run this notebook without changes, please save your configuration in a file located under your home directory called ~/.config/datarobot/drconfig.yaml
.
[3]:
# Initialization with arguments
# dr.Client(token='<API TOKEN>', endpoint='https://<YOUR ENDPOINT>/api/v2/')
# Initialization with a config file in the same directory as this notebook
# dr.Client(config_path='drconfig.yaml')
# Initialization with a config file located at
# ~/.config/datarobot/dr.config.yaml
dr.Client()
[3]:
<datarobot.rest.RESTClientObject at 0x11043b210>
Create the Project¶
Here, we use the datarobot
package to upload a new file and create a project. The name of the project is optional, but can be helpful when trying to sort among many projects on DataRobot.
[4]:
filename = '10K_Lending_Club_Loans.csv'
now = datetime.datetime.now().strftime('%Y-%m-%dT%H:%M')
project_name = '10K_Lending_Club_Loans_{}'.format(now)
proj = dr.Project.create(sourcedata=filename,
project_name=project_name)
print('Project ID: {}'.format(proj.id))
Project ID: 5c007ffa784cc602016a9f06
Select Features for Modeling¶
First, retrieve the raw feature list. This corresponds to the columns in the input spreadsheet.
[5]:
raw = [feat_list for feat_list in proj.get_featurelists()
if feat_list.name == 'Raw Features'][0]
raw_features = [
{
"name": feat,
"type": dr.Feature.get(proj.id, feat).feature_type
}
for feat in raw.features
]
pd.DataFrame.from_dict(raw_features)
[5]:
name | type | |
---|---|---|
0 | loan_amnt | Numeric |
1 | funded_amnt | Numeric |
2 | term | Categorical |
3 | int_rate | Percentage |
4 | installment | Numeric |
5 | grade | Categorical |
6 | sub_grade | Categorical |
7 | emp_title | Text |
8 | emp_length | Categorical |
9 | home_ownership | Categorical |
10 | annual_inc | Numeric |
11 | verification_status | Categorical |
12 | pymnt_plan | Categorical |
13 | url | Text |
14 | desc | Text |
15 | purpose | Categorical |
16 | title | Text |
17 | zip_code | Categorical |
18 | addr_state | Categorical |
19 | dti | Numeric |
20 | delinq_2yrs | Numeric |
21 | earliest_cr_line | Date |
22 | inq_last_6mths | Numeric |
23 | mths_since_last_delinq | Numeric |
24 | mths_since_last_record | Numeric |
25 | open_acc | Numeric |
26 | pub_rec | Numeric |
27 | revol_bal | Numeric |
28 | revol_util | Numeric |
29 | total_acc | Numeric |
30 | initial_list_status | Categorical |
31 | mths_since_last_major_derog | None |
32 | policy_code | Categorical |
33 | is_bad | Numeric |
Modify Feature Types¶
We can tweak features to improve the modeling. For example, we might change delinq_2yrs
from an integer into a categorical.
[6]:
proj.create_type_transform_feature(
"delinq_2yrs(Cat)", # new feature name
"delinq_2yrs", # parent name
dr.enums.VARIABLE_TYPE_TRANSFORM.CATEGORICAL_INT
)
[6]:
Feature(delinq_2yrs(Cat))
Then, we can change type of addr_state
from categorical into text.
[7]:
proj.create_type_transform_feature(
"addr_state(Text)", # new feature name
"addr_state", # parent name
dr.enums.VARIABLE_TYPE_TRANSFORM.TEXT
)
[7]:
Feature(addr_state(Text))
Select Features for Modeling¶
Next, we create a new feature list where we remove the features delinq_2yrs
and addr_state
and add the modified features we just created.
[8]:
feature_list_name = "new_feature_list"
new_feature_list = proj.create_featurelist(
feature_list_name,
list((set(raw.features) - {"addr_state", "delinq_2yrs"}) |
{"addr_state(Text)", "delinq_2yrs(Cat)"})
)
Run the Automated Modeling Process¶
Now we can start the modeling process. The target for this problem is called is_bad
- a binary variable indicating whether or not the customer defaults on a particular loan.
We specify that the metric that should be used is LogLoss
. Without a specification DataRobot would automatically select an appropriate default metric.
The featurelist_id
parameter tells DataRobot to model on that specific featurelist, rather than the default Informative Features
.
Finally, the worker_count
parameter specifies how many workers should be used for this project. Passing a value of -1
tells DataRobot to set the worker count to the maximum available to you. You can also specify the exact number of workers to use, but this command will fail if you request more workers than your account allows. If you need more resources than what has been allocated to you, you should think about upgrading your license.
The last command in this cell is just a blocking loop that periodically checks on the project to see if it is done, printing out the number of jobs in progress and in the queue along the way so you can see progress. The automated model exploration process will occasionally add more jobs to the queue, so don’t be alarmed if the number of jobs does not strictly decrease over time.
[9]:
proj.set_target(
"is_bad",
mode=dr.enums.AUTOPILOT_MODE.FULL_AUTO,
metric="LogLoss",
featurelist_id=new_feature_list.id,
worker_count=-1
)
proj.wait_for_autopilot()
In progress: 17, queued: 21 (waited: 0s)
In progress: 20, queued: 18 (waited: 1s)
In progress: 20, queued: 18 (waited: 2s)
In progress: 20, queued: 18 (waited: 3s)
In progress: 19, queued: 18 (waited: 5s)
In progress: 20, queued: 17 (waited: 7s)
In progress: 20, queued: 16 (waited: 12s)
In progress: 20, queued: 12 (waited: 19s)
In progress: 19, queued: 8 (waited: 32s)
In progress: 20, queued: 2 (waited: 53s)
In progress: 16, queued: 0 (waited: 74s)
In progress: 16, queued: 0 (waited: 95s)
In progress: 16, queued: 0 (waited: 115s)
In progress: 16, queued: 0 (waited: 136s)
In progress: 15, queued: 0 (waited: 156s)
In progress: 13, queued: 0 (waited: 177s)
In progress: 8, queued: 0 (waited: 198s)
In progress: 1, queued: 0 (waited: 218s)
In progress: 19, queued: 0 (waited: 238s)
In progress: 13, queued: 0 (waited: 259s)
In progress: 6, queued: 0 (waited: 280s)
In progress: 2, queued: 0 (waited: 300s)
In progress: 13, queued: 0 (waited: 321s)
In progress: 9, queued: 0 (waited: 341s)
In progress: 6, queued: 0 (waited: 362s)
In progress: 2, queued: 0 (waited: 382s)
In progress: 2, queued: 0 (waited: 403s)
In progress: 1, queued: 0 (waited: 423s)
In progress: 1, queued: 0 (waited: 444s)
In progress: 1, queued: 0 (waited: 464s)
In progress: 20, queued: 12 (waited: 485s)
In progress: 20, queued: 12 (waited: 505s)
In progress: 20, queued: 6 (waited: 526s)
In progress: 19, queued: 3 (waited: 547s)
In progress: 19, queued: 0 (waited: 567s)
In progress: 18, queued: 0 (waited: 588s)
In progress: 16, queued: 0 (waited: 609s)
In progress: 13, queued: 0 (waited: 629s)
In progress: 11, queued: 0 (waited: 650s)
In progress: 7, queued: 0 (waited: 670s)
In progress: 3, queued: 0 (waited: 691s)
In progress: 3, queued: 0 (waited: 711s)
In progress: 3, queued: 0 (waited: 732s)
In progress: 1, queued: 0 (waited: 752s)
In progress: 0, queued: 0 (waited: 773s)
In progress: 1, queued: 0 (waited: 793s)
In progress: 0, queued: 0 (waited: 814s)
In progress: 4, queued: 0 (waited: 834s)
In progress: 2, queued: 0 (waited: 855s)
In progress: 4, queued: 0 (waited: 875s)
In progress: 4, queued: 0 (waited: 895s)
In progress: 2, queued: 0 (waited: 916s)
In progress: 2, queued: 0 (waited: 936s)
In progress: 0, queued: 0 (waited: 957s)
In progress: 0, queued: 0 (waited: 977s)
Exploring Trained Models¶
We can see how many models DataRobot built for this project by querying. Each of them has been tuned individually. Models that appear to have the same name differ either in the amount of data used in training or in the preprocessing steps used (or both).
[10]:
models = proj.get_models()
for idx, model in enumerate(models):
print('[{}]: {} - {}'.
format(idx, model.metrics['LogLoss']['validation'],
model.model_type))
[0]: 0.36614 - ENET Blender
[1]: 0.36661 - Advanced AVG Blender
[2]: 0.36684 - ENET Blender
[3]: 0.36686 - AVG Blender
[4]: 0.36712 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[5]: 0.36787 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[6]: 0.36791 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[7]: 0.36839 - Light Gradient Boosted Trees Classifier with Early Stopping
[8]: 0.3684 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[9]: 0.36872 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[10]: 0.36873 - Generalized Additive2 Model
[11]: 0.36938 - Generalized Additive2 Model
[12]: 0.36952 - RandomForest Classifier (Gini)
[13]: 0.36971 - Light Gradient Boosted Trees Classifier with Early Stopping
[14]: 0.36978 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[15]: 0.37004 - RandomForest Classifier (Entropy)
[16]: 0.37073 - RandomForest Classifier (Gini)
[17]: 0.37121 - Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance)
[18]: 0.37235 - Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance)
[19]: 0.37274 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[20]: 0.37275 - Vowpal Wabbit Classifier
[21]: 0.37283 - RandomForest Classifier (Entropy)
[22]: 0.37302 - ExtraTrees Classifier (Gini)
[23]: 0.37335 - Vowpal Wabbit Classifier
[24]: 0.37345 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[25]: 0.37357 - Nystroem Kernel SVM Classifier
[26]: 0.37362 - Nystroem Kernel SVM Classifier
[27]: 0.37368 - ExtraTrees Classifier (Gini)
[28]: 0.37417 - Gradient Boosted Trees Classifier with Early Stopping
[29]: 0.37495 - Gradient Boosted Trees Classifier with Early Stopping
[30]: 0.37548 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[31]: 0.37574 - Regularized Logistic Regression (L2)
[32]: 0.37607 - RandomForest Classifier (Gini)
[33]: 0.37631 - Vowpal Wabbit Classifier
[34]: 0.37667 - Light Gradient Boosted Trees Classifier with Early Stopping
[35]: 0.37767 - Generalized Additive2 Model
[36]: 0.37773 - Regularized Logistic Regression (L2)
[37]: 0.37814 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[38]: 0.37816 - Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance)
[39]: 0.37862 - RandomForest Classifier (Entropy)
[40]: 0.37921 - Elastic-Net Classifier (L2 / Binomial Deviance) with Binned numeric features
[41]: 0.37929 - Regularized Logistic Regression (L2)
[42]: 0.37953 - Auto-tuned K-Nearest Neighbors Classifier (Euclidean Distance)
[43]: 0.38011 - Regularized Logistic Regression (L2)
[44]: 0.38013 - Elastic-Net Classifier (L2 / Binomial Deviance)
[45]: 0.38024 - Eureqa Generalized Additive Model Classifier (3000 Generations)
[46]: 0.38026 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - desc
[47]: 0.38037 - Gradient Boosted Trees Classifier
[48]: 0.38127 - Gradient Boosted Trees Classifier
[49]: 0.3813 - Light Gradient Boosting on ElasticNet Predictions
[50]: 0.38136 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - desc
[51]: 0.38176 - Elastic-Net Classifier (L2 / Binomial Deviance) with Binned numeric features
[52]: 0.38236 - eXtreme Gradient Boosted Trees Classifier with Early Stopping and Unsupervised Learning Features
[53]: 0.38237 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[54]: 0.3833 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - title
[55]: 0.38354 - Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance) with Unsupervised Learning Features
[56]: 0.38373 - Elastic-Net Classifier (L2 / Binomial Deviance)
[57]: 0.38387 - Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance)
[58]: 0.38401 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - emp_title
[59]: 0.38428 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - emp_title
[60]: 0.38435 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - emp_title
[61]: 0.38481 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - title
[62]: 0.38497 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - title
[63]: 0.38505 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - addr_state(Text)
[64]: 0.38524 - RandomForest Classifier (Gini)
[65]: 0.38532 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - title
[66]: 0.38572 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - addr_state(Text)
[67]: 0.38606 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - desc
[68]: 0.38639 - Majority Class Classifier
[69]: 0.38642 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - url
[70]: 0.38662 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - url
[71]: 0.387 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - addr_state(Text)
[72]: 0.38711 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - desc
[73]: 0.38726 - Regularized Logistic Regression (L2)
[74]: 0.38738 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - url
[75]: 0.38802 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - addr_state(Text)
[76]: 0.39071 - Gradient Boosted Greedy Trees Classifier with Early Stopping
[77]: 0.40035 - Auto-tuned K-Nearest Neighbors Classifier (Euclidean Distance)
[78]: 0.40057 - Breiman and Cutler Random Forest Classifier
[79]: 0.41186 - RuleFit Classifier
[80]: 0.43793 - Naive Bayes combiner classifier
[81]: 0.44045 - Auto-tuned K-Nearest Neighbors Classifier (Euclidean Distance)
[82]: 0.44713 - Logistic Regression
[83]: 0.48423 - Decision Tree Classifier (Gini)
[84]: 0.60431 - TensorFlow Neural Network Classifier
Generating Predictions¶
Predictions: modeling workers vs. dedicated servers¶
There are two ways to generate predictions in DataRobot: using modeling workers and dedicated prediction servers. In this notebook we will use the former, which is slower, occupies one of your modeling worker slots, and has no strong latency guarantees because the jobs go through the project queue. This method can be useful for developing and evaluating models. However, in a production environment, a faster, dedicated prediction server configuration may be more appropriate.
Three step process¶
As just mentioned, these predictions go through the modeling queue, so there is a three-step process. The first step is to upload your dataset; the second is to generate prediction jobs. Finally, you need to retreive your predictions when the job is done.
To simplify this example we will make predictions for the same data used to train the models. We could use any of the models DataRobot generated, but will select the model that DataRobot recommends for deployment. DataRobot weighs both model accuracy and runtime to develop this recommendation.
[11]:
dataset = proj.upload_dataset(filename)
model = dr.ModelRecommendation.get(
proj.id,
dr.enums.RECOMMENDED_MODEL_TYPE.RECOMMENDED_FOR_DEPLOYMENT
).get_model()
pred_job = model.request_predictions(dataset.id)
preds = pred_job.get_result_when_complete()
Results¶
This example is a binary, or two-class classification problem, so DataRobot estimates the probability of each row is in the positive class (a bad loan) and negative class (not a bad loan). positive_probability
and class_1.0
represent the former, and class_0.0
the latter. Given a configurable prediction_threshold
, DataRobot creates a prediction
whose value is the predicted class for each row. The predictions can be matched to the the uploaded prediction data set through the
row_id
predictions field.
[12]:
preds.head()
[12]:
positive_probability | prediction | prediction_threshold | row_id | class_0.0 | class_1.0 | |
---|---|---|---|---|---|---|
0 | 0.092677 | 0.0 | 0.5 | 0 | 0.907323 | 0.092677 |
1 | 0.261903 | 0.0 | 0.5 | 1 | 0.738097 | 0.261903 |
2 | 0.095587 | 0.0 | 0.5 | 2 | 0.904413 | 0.095587 |
3 | 0.121502 | 0.0 | 0.5 | 3 | 0.878498 | 0.121502 |
4 | 0.065982 | 0.0 | 0.5 | 4 | 0.934018 | 0.065982 |
Modeling Airline Delay¶
Overview¶
Statistics on whether a flight was delayed and for how long are available from government databases for all the major carriers. It would be useful to be able to predict before scheduling a flight whether or not it was likely to be delayed. In this example, DataRobot will try to model whether a flight will be delayed, based on information such as the scheduled departure time and whether rained the day of the flight.
Prerequisites¶
In order to run this notebook yourself, you will need the following:
- This notebook. If you are viewing this in the HTML documentation bundle, you can download all of the example notebooks and supporting materials from Downloads.
- The datasets required for this notebook. These are in the same directory as this notebook.
- A DataRobot API token. You can find your API token by logging into the DataRobot Web User Interface and looking in your
Profile
.
Set Up¶
This example assumes that the DataRobot Python client package has been installed and configured with the credentials of a DataRobot user with API access permissions.
Data Sources¶
Information on flights and flight delays is made available by the Bureau of Transportation Statistics at https://www.transtats.bts.gov/ONTIME/Departures.aspx. To narrow down the amount of data involved, the datasets assembled for this use case are limited to US Airways flights out of Boston Logan in 2013 and 2014, although the script for interacting with DataRobot is sufficiently general that any dataset with the correct format could be used. A flight was declared to be delayed if it ultimately took off at least fifteen minutes after its scheduled departure time.
In additional to flight information, each record in the prepared dataset notes the amount of rain and whether it rained on the day of the flight. This information came from the National Oceanic and Atmospheric Administration’s Quality Controlled Local Climatological Data, available at http://www.ncdc.noaa.gov/qclcd/QCLCD. By looking at the recorded daily summaries of the water equivalent precipitation at the Boston Logan station, the daily rainfall for each day in 2013 and 2014 was measured. For some days, the QCLCD reports trace amounts of rainfall, which was recorded as 0 inches of rain.
Dataset Structure¶
Each row in the assembled dataset contains the following columns
- was_delayed
- boolean
- whether the flight was delayed
- daily_rainfall
- float
- the amount of rain, in inches, on the day of the flight
- did_rain
- bool
- whether it rained on the day of the flight
- Carrier Code
- str
- the carrier code of the airline - US for all entries in assembled dataset
- Date
- str (MM/DD/YYYY format)
- the date of the flight
- Flight Number
- str
- the flight number for the flight
- Tail Number
- str
- the tail number of the aircraft
- Destination Airport
- str
- the three-letter airport code of the destination airport
- Scheduled Departure Time
- str
- the 24-hour scheduled departure time of the flight, in the origin airport’s timezone
[1]:
import pandas as pd
import datarobot as dr
[2]:
data_path = "logan-US-2013.csv"
logan_2013 = pd.read_csv(data_path)
logan_2013.head()
[2]:
was_delayed | daily_rainfall | did_rain | Carrier Code | Date (MM/DD/YYYY) | Flight Number | Tail Number | Destination Airport | Scheduled Departure Time | |
---|---|---|---|---|---|---|---|---|---|
0 | False | 0.0 | False | US | 02/01/2013 | 225 | N662AW | PHX | 16:20 |
1 | False | 0.0 | False | US | 02/01/2013 | 280 | N822AW | PHX | 06:00 |
2 | False | 0.0 | False | US | 02/01/2013 | 303 | N653AW | CLT | 09:35 |
3 | True | 0.0 | False | US | 02/01/2013 | 604 | N640AW | PHX | 09:55 |
4 | False | 0.0 | False | US | 02/01/2013 | 722 | N715UW | PHL | 18:30 |
We want to be able to make predictions for future data, so the “date” column should be transformed in a way that avoids values that won’t be populated for future data:
[3]:
def prepare_modeling_dataset(df):
date_column_name = 'Date (MM/DD/YYYY)'
date = pd.to_datetime(df[date_column_name])
modeling_df = df.drop(date_column_name, axis=1)
days = {0: 'Mon', 1: 'Tues', 2: 'Weds', 3: 'Thurs', 4: 'Fri', 5: 'Sat',
6: 'Sun'}
modeling_df['day_of_week'] = date.apply(lambda x: days[x.dayofweek])
modeling_df['month'] = date.dt.month
return modeling_df
[4]:
logan_2013_modeling = prepare_modeling_dataset(logan_2013)
logan_2013_modeling.head()
[4]:
was_delayed | daily_rainfall | did_rain | Carrier Code | Flight Number | Tail Number | Destination Airport | Scheduled Departure Time | day_of_week | month | |
---|---|---|---|---|---|---|---|---|---|---|
0 | False | 0.0 | False | US | 225 | N662AW | PHX | 16:20 | Fri | 2 |
1 | False | 0.0 | False | US | 280 | N822AW | PHX | 06:00 | Fri | 2 |
2 | False | 0.0 | False | US | 303 | N653AW | CLT | 09:35 | Fri | 2 |
3 | True | 0.0 | False | US | 604 | N640AW | PHX | 09:55 | Fri | 2 |
4 | False | 0.0 | False | US | 722 | N715UW | PHL | 18:30 | Fri | 2 |
DataRobot Modeling¶
As part of this use case, in model_flight_ontime.py
, a DataRobot project will be created and used to run a variety of models against the assembled datasets. By default, DataRobot will run autopilot on the automatically generated Informative Features list, which excludes certain pathological features (like Carrier Code in this example, which is always the same value), and we will also create a custom feature list excluding the amount of rainfall on the day of the flight.
This notebook shows how to use the Python API client to create a project, create feature lists, train models with different sample percents and feature lists, and view the models that have been run. It will:
- create a project
- create a new feature list (no foreknowledge) excluding the rainfall features
- set the target to
was_delayed
, and run DataRobot autopilot on the Informative Features list - rerun autopilot on a new feature list
- make predictions on a new data set
Configure the Python Client¶
Configuring the client requires the following two things:
- A DataRobot endpoint - where the API server can be found
- A DataRobot API token - a token the server uses to identify and validate the user making API requests
The endpoint is usually the URL you would use to log into the DataRobot Web User Interface (e.g., https://app.datarobot.com) with “/api/v2/” appended, e.g., (https://app.datarobot.com/api/v2/).
You can find your API token by logging into the DataRobot Web User Interface and looking in your Profile.
The Python client can be configured in several ways. The example we’ll use in this notebook is to point to a yaml
file that has the information. This is a text file containing two lines like this:
endpoint: https://app.datarobot.com/api/v2/
token: not-my-real-token
If you want to run this notebook without changes, please save your configuration in a file located under your home directory called ~/.config/datarobot/drconfig.yaml
.
[5]:
# Initialization with arguments
# dr.Client(token='<API TOKEN>', endpoint='https://<YOUR ENDPOINT>/api/v2/')
# Initialization with a config file in the same directory as this notebook
# dr.Client(config_path='drconfig.yaml')
# Initialization with config file located at ~/.config/datarobot/dr.config.yaml
dr.Client()
[5]:
<datarobot.rest.RESTClientObject at 0x114014510>
Starting a Project¶
[6]:
project = dr.Project.start(logan_2013_modeling,
project_name='Airline Delays - was_delayed',
target="was_delayed")
print('Project ID: {}'.format(project.id))
Project ID: 5c0012ca6523cd0200c4a017
Jobs and the Project Queue¶
You can view the project in your browser:
[7]:
# If running notebook remotely
project.open_leaderboard_browser()
[7]:
True
[8]:
# Set worker count higher.
# Passing -1 sets it to the maximum available to your account.
project.set_worker_count(-1)
[8]:
Project(Airline Delays - was_delayed)
[9]:
project.pause_autopilot()
[9]:
True
[10]:
# More jobs will go in the queue in each stage of autopilot.
# This gets the currently inprogress and queued jobs
project.get_model_jobs()
[10]:
[ModelJob(Logistic Regression, status=inprogress),
ModelJob(Regularized Logistic Regression (L2), status=inprogress),
ModelJob(Auto-tuned K-Nearest Neighbors Classifier (Euclidean Distance), status=inprogress),
ModelJob(Majority Class Classifier, status=inprogress),
ModelJob(Gradient Boosted Trees Classifier, status=inprogress),
ModelJob(Breiman and Cutler Random Forest Classifier, status=inprogress),
ModelJob(RuleFit Classifier, status=inprogress),
ModelJob(Regularized Logistic Regression (L2), status=inprogress),
ModelJob(TensorFlow Neural Network Classifier, status=inprogress),
ModelJob(eXtreme Gradient Boosted Trees Classifier with Early Stopping, status=inprogress),
ModelJob(Elastic-Net Classifier (L2 / Binomial Deviance), status=inprogress),
ModelJob(Nystroem Kernel SVM Classifier, status=inprogress),
ModelJob(RandomForest Classifier (Gini), status=inprogress),
ModelJob(Vowpal Wabbit Classifier, status=inprogress),
ModelJob(Generalized Additive2 Model, status=inprogress),
ModelJob(Light Gradient Boosted Trees Classifier with Early Stopping, status=queue),
ModelJob(Light Gradient Boosting on ElasticNet Predictions , status=queue),
ModelJob(Regularized Logistic Regression (L2), status=queue),
ModelJob(Elastic-Net Classifier (L2 / Binomial Deviance) with Binned numeric features, status=queue),
ModelJob(Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance), status=queue),
ModelJob(RandomForest Classifier (Entropy), status=queue),
ModelJob(ExtraTrees Classifier (Gini), status=queue),
ModelJob(Gradient Boosted Trees Classifier with Early Stopping, status=queue),
ModelJob(Gradient Boosted Greedy Trees Classifier with Early Stopping, status=queue),
ModelJob(eXtreme Gradient Boosted Trees Classifier with Early Stopping, status=queue),
ModelJob(eXtreme Gradient Boosted Trees Classifier with Early Stopping and Unsupervised Learning Features, status=queue),
ModelJob(Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance) with Unsupervised Learning Features, status=queue),
ModelJob(Auto-tuned K-Nearest Neighbors Classifier (Euclidean Distance), status=queue),
ModelJob(Eureqa Generalized Additive Model Classifier (3645 Generations), status=inprogress),
ModelJob(Naive Bayes combiner classifier, status=inprogress),
ModelJob(RandomForest Classifier (Gini), status=inprogress),
ModelJob(Gradient Boosted Trees Classifier, status=inprogress),
ModelJob(Decision Tree Classifier (Gini), status=inprogress)]
[11]:
project.unpause_autopilot()
[11]:
True
Features¶
[12]:
features = project.get_features()
features
[12]:
[Feature(did_rain),
Feature(Destination Airport),
Feature(Carrier Code),
Feature(Flight Number),
Feature(Tail Number),
Feature(day_of_week),
Feature(month),
Feature(Scheduled Departure Time),
Feature(daily_rainfall),
Feature(was_delayed),
Feature(Scheduled Departure Time (Hour of Day))]
[13]:
pd.DataFrame([f.__dict__ for f in features])
[13]:
date_format | feature_type | id | importance | low_information | max | mean | median | min | na_count | name | project_id | std_dev | target_leakage | time_series_eligibility_reason | time_series_eligible | time_step | time_unit | unique_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | None | Boolean | 2 | 0.029045 | False | 1 | 0.36 | 0 | 0 | 0 | did_rain | 5c0012ca6523cd0200c4a017 | 0.48 | FALSE | notADate | False | None | None | 2 |
1 | None | Categorical | 6 | 0.003714 | True | None | None | None | None | 0 | Destination Airport | 5c0012ca6523cd0200c4a017 | None | FALSE | notADate | False | None | None | 5 |
2 | None | Categorical | 3 | NaN | True | None | None | None | None | 0 | Carrier Code | 5c0012ca6523cd0200c4a017 | None | SKIPPED_DETECTION | notADate | False | None | None | 1 |
3 | None | Numeric | 4 | 0.005900 | False | 2165 | 1705.63 | 2021 | 67 | 0 | Flight Number | 5c0012ca6523cd0200c4a017 | 566.67 | FALSE | notADate | False | None | None | 329 |
4 | None | Categorical | 5 | -0.004512 | True | None | None | None | None | 0 | Tail Number | 5c0012ca6523cd0200c4a017 | None | FALSE | notADate | False | None | None | 296 |
5 | None | Categorical | 8 | 0.003452 | True | None | None | None | None | 0 | day_of_week | 5c0012ca6523cd0200c4a017 | None | FALSE | notADate | False | None | None | 7 |
6 | None | Numeric | 9 | 0.003043 | True | 12 | 6.47 | 6 | 1 | 0 | month | 5c0012ca6523cd0200c4a017 | 3.38 | FALSE | notADate | False | None | None | 12 |
7 | %H:%M | Time | 7 | 0.058245 | False | 21:30 | 12:26 | 12:00 | 05:00 | 0 | Scheduled Departure Time | 5c0012ca6523cd0200c4a017 | 0.19 days | FALSE | notADate | False | None | None | 77 |
8 | None | Numeric | 1 | 0.034295 | False | 3.07 | 0.12 | 0 | 0 | 0 | daily_rainfall | 5c0012ca6523cd0200c4a017 | 0.33 | FALSE | notADate | False | None | None | 58 |
9 | None | Boolean | 0 | 1.000000 | False | 1 | 0.098 | 0 | 0 | 0 | was_delayed | 5c0012ca6523cd0200c4a017 | 0.3 | SKIPPED_DETECTION | notADate | False | None | None | 2 |
10 | None | Categorical | 10 | 0.053047 | False | None | None | None | None | 0 | Scheduled Departure Time (Hour of Day) | 5c0012ca6523cd0200c4a017 | None | FALSE | notADate | False | None | None | 17 |
Three feature lists are automatically created:
- Raw Features: one for all features
- Informative Features: one based on features with any information (columns with no variation are excluded)
- Univariate Importance: one based on univariate importance (this is only created after the target is set)
Informative Features is the one used by default in autopilot.
[14]:
feature_lists = project.get_featurelists()
feature_lists
[14]:
[Featurelist(Raw Features),
Featurelist(Informative Features),
Featurelist(Univariate Selections)]
[15]:
# create a featurelist without the rain features
# (since they leak future information)
informative_feats = [lst for lst in feature_lists if
lst.name == 'Informative Features'][0]
no_foreknowledge_features = list(
set(informative_feats.features) - {'daily_rainfall', 'did_rain'})
[16]:
no_foreknowledge = project.create_featurelist('no foreknowledge',
no_foreknowledge_features)
no_foreknowledge
[16]:
Featurelist(no foreknowledge)
[17]:
project.get_status()
[17]:
{u'autopilot_done': False,
u'stage': u'modeling',
u'stage_description': u'Ready for modeling'}
[18]:
# This waits until autopilot is complete:
project.wait_for_autopilot(check_interval=90)
In progress: 20, queued: 13 (waited: 0s)
In progress: 20, queued: 13 (waited: 1s)
In progress: 19, queued: 13 (waited: 1s)
In progress: 20, queued: 12 (waited: 2s)
In progress: 20, queued: 12 (waited: 4s)
In progress: 20, queued: 12 (waited: 6s)
In progress: 20, queued: 12 (waited: 10s)
In progress: 19, queued: 2 (waited: 17s)
In progress: 10, queued: 0 (waited: 30s)
In progress: 2, queued: 0 (waited: 56s)
In progress: 4, queued: 0 (waited: 108s)
In progress: 1, queued: 0 (waited: 198s)
In progress: 13, queued: 0 (waited: 289s)
In progress: 0, queued: 0 (waited: 379s)
In progress: 5, queued: 0 (waited: 470s)
In progress: 4, queued: 0 (waited: 560s)
In progress: 0, queued: 0 (waited: 651s)
[19]:
project.start_autopilot(no_foreknowledge.id)
[20]:
project.wait_for_autopilot(check_interval=90)
In progress: 0, queued: 0 (waited: 0s)
In progress: 0, queued: 0 (waited: 0s)
In progress: 0, queued: 0 (waited: 1s)
In progress: 0, queued: 0 (waited: 1s)
In progress: 0, queued: 0 (waited: 3s)
In progress: 0, queued: 0 (waited: 4s)
In progress: 0, queued: 1 (waited: 8s)
In progress: 20, queued: 13 (waited: 15s)
In progress: 20, queued: 1 (waited: 28s)
In progress: 3, queued: 0 (waited: 54s)
In progress: 16, queued: 0 (waited: 106s)
In progress: 20, queued: 12 (waited: 196s)
In progress: 0, queued: 0 (waited: 287s)
Models¶
[21]:
models = project.get_models()
example_model = models[0]
example_model
[21]:
Model(u'eXtreme Gradient Boosted Trees Classifier with Early Stopping and Unsupervised Learning Features')
Models represent fitted models and have various data about the model, including metrics:
[22]:
example_model.metrics
[22]:
{u'AUC': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.755494,
u'holdout': 0.76509,
u'validation': 0.75702},
u'FVE Binomial': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.14855,
u'holdout': 0.14992,
u'validation': 0.15364},
u'Gini Norm': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.510988,
u'holdout': 0.53018,
u'validation': 0.51404},
u'Kolmogorov-Smirnov': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.398738,
u'holdout': 0.42279,
u'validation': 0.40472},
u'LogLoss': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.272296,
u'holdout': 0.27178,
u'validation': 0.27079},
u'RMSE': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.27529400000000004,
u'holdout': 0.27627,
u'validation': 0.27448},
u'Rate@Top10%': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.379522,
u'holdout': 0.35792,
u'validation': 0.38908},
u'Rate@Top5%': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.489794,
u'holdout': 0.45902,
u'validation': 0.5034},
u'Rate@TopTenth%': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.8000019999999999,
u'holdout': 0.75,
u'validation': 0.66667}}
[23]:
def sorted_by_log_loss(models, test_set):
models_with_score = [model for model in models if
model.metrics['LogLoss'][test_set] is not None]
return sorted(models_with_score,
key=lambda model: model.metrics['LogLoss'][test_set])
Let’s choose the models (from each feature set, to compare the scores) with the best LogLoss score from those with the rain and those without:
[24]:
models = project.get_models()
fair_models = [mod for mod in models if
mod.featurelist_id == no_foreknowledge.id]
rain_cheat_models = [mod for mod in models if
mod.featurelist_id == informative_feats.id]
[25]:
models[0].metrics['LogLoss']
[25]:
{u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.272296,
u'holdout': 0.27178,
u'validation': 0.27079}
[26]:
best_fair_model = sorted_by_log_loss(fair_models, 'crossValidation')[0]
best_cheat_model = sorted_by_log_loss(rain_cheat_models, 'crossValidation')[0]
best_fair_model.metrics, best_cheat_model.metrics
[26]:
({u'AUC': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.7132720000000001,
u'holdout': None,
u'validation': 0.71811},
u'FVE Binomial': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.089814,
u'holdout': None,
u'validation': 0.09341},
u'Gini Norm': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.426544,
u'holdout': None,
u'validation': 0.43622},
u'Kolmogorov-Smirnov': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.322424,
u'holdout': None,
u'validation': 0.31053},
u'LogLoss': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.291076,
u'holdout': None,
u'validation': 0.29006},
u'RMSE': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.285848,
u'holdout': None,
u'validation': 0.28579},
u'Rate@Top10%': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.294882,
u'holdout': None,
u'validation': 0.29352},
u'Rate@Top5%': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.36734799999999995,
u'holdout': None,
u'validation': 0.39456},
u'Rate@TopTenth%': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.600002,
u'holdout': None,
u'validation': 0.66667}},
{u'AUC': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.7604420000000001,
u'holdout': None,
u'validation': 0.75549},
u'FVE Binomial': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.15306999999999998,
u'holdout': None,
u'validation': 0.15124},
u'Gini Norm': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.520884,
u'holdout': None,
u'validation': 0.51098},
u'Kolmogorov-Smirnov': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.406068,
u'holdout': None,
u'validation': 0.39472},
u'LogLoss': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.270848,
u'holdout': None,
u'validation': 0.27156},
u'RMSE': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.274772,
u'holdout': None,
u'validation': 0.27497},
u'Rate@Top10%': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.38498399999999994,
u'holdout': None,
u'validation': 0.38908},
u'Rate@Top5%': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.504762,
u'holdout': None,
u'validation': 0.5034},
u'Rate@TopTenth%': {u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.933334,
u'holdout': None,
u'validation': 1.0}})
Visualizing Models¶
This is a good time to use Feature Fit and Feature Effects (not yet available via the API) to visualize the models:
[27]:
best_fair_model.open_model_browser()
[27]:
True
[28]:
best_cheat_model.open_model_browser()
[28]:
True
Unlocking the Holdout¶
To maintain holdout scores as a valid estimate of out-of-sample error, we recommend not looking at them until late in the project. For this reason, holdout scores are locked until you unlock them.
[29]:
project.unlock_holdout()
[29]:
Project(Airline Delays - was_delayed)
[30]:
best_fair_model = dr.Model.get(project.id, best_fair_model.id)
best_cheat_model = dr.Model.get(project.id, best_cheat_model.id)
[31]:
best_fair_model.metrics['LogLoss'], best_cheat_model.metrics['LogLoss']
[31]:
({u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.291076,
u'holdout': 0.29408,
u'validation': 0.29006},
{u'backtesting': None,
u'backtestingScores': None,
u'crossValidation': 0.270848,
u'holdout': 0.27193,
u'validation': 0.27156})
Retrain on 100%¶
When ready to use the final model, you will probably get the best performance by retraining on 100% of the data.
[32]:
model_job_fair_100pct_id = best_fair_model.train(sample_pct=100)
model_job_fair_100pct_id
[32]:
'211'
Wait for the model to complete:
[33]:
model_fair_100pct = dr.models.modeljob.wait_for_async_model_creation(
project.id, model_job_fair_100pct_id)
model_fair_100pct.id
[33]:
u'5c0016b76523cd026cc49f99'
Predictions¶
Let’s make predictions for some new data. This new data will need to have the same transformations applied as we applied to the training data.
[34]:
logan_2014 = pd.read_csv("logan-US-2014.csv")
logan_2014_modeling = prepare_modeling_dataset(logan_2014)
logan_2014_modeling.head()
[34]:
was_delayed | daily_rainfall | did_rain | Carrier Code | Flight Number | Tail Number | Destination Airport | Scheduled Departure Time | day_of_week | month | |
---|---|---|---|---|---|---|---|---|---|---|
0 | False | 0.0 | False | US | 450 | N809AW | PHX | 10:00 | Sat | 2 |
1 | False | 0.0 | False | US | 553 | N814AW | PHL | 07:00 | Sat | 2 |
2 | False | 0.0 | False | US | 582 | N820AW | PHX | 06:10 | Sat | 2 |
3 | False | 0.0 | False | US | 601 | N678AW | PHX | 16:20 | Sat | 2 |
4 | False | 0.0 | False | US | 657 | N662AW | CLT | 09:45 | Sat | 2 |
[35]:
prediction_dataset = project.upload_dataset(logan_2014_modeling)
predict_job = model_fair_100pct.request_predictions(prediction_dataset.id)
prediction_dataset.id
[35]:
u'5c0016cf6523cd0018c4a0d3'
[36]:
predictions = predict_job.get_result_when_complete()
[37]:
pd.concat([logan_2014_modeling, predictions], axis=1).head()
[37]:
was_delayed | daily_rainfall | did_rain | Carrier Code | Flight Number | Tail Number | Destination Airport | Scheduled Departure Time | day_of_week | month | positive_probability | prediction | prediction_threshold | row_id | class_0.0 | class_1.0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | 0.0 | False | US | 450 | N809AW | PHX | 10:00 | Sat | 2 | 0.055054 | 0.0 | 0.5 | 0 | 0.944946 | 0.055054 |
1 | False | 0.0 | False | US | 553 | N814AW | PHL | 07:00 | Sat | 2 | 0.045004 | 0.0 | 0.5 | 1 | 0.954996 | 0.045004 |
2 | False | 0.0 | False | US | 582 | N820AW | PHX | 06:10 | Sat | 2 | 0.030196 | 0.0 | 0.5 | 2 | 0.969804 | 0.030196 |
3 | False | 0.0 | False | US | 601 | N678AW | PHX | 16:20 | Sat | 2 | 0.201461 | 0.0 | 0.5 | 3 | 0.798539 | 0.201461 |
4 | False | 0.0 | False | US | 657 | N662AW | CLT | 09:45 | Sat | 2 | 0.072447 | 0.0 | 0.5 | 4 | 0.927553 | 0.072447 |
Let’s have a look at our results. Since this is a binary classification problem, as the positive_probability
approaches zero this row is a stronger candidate for the negative class (the flight will leave on-time), while as it approaches one, the outcome is more likely to be of the positive class (the flight will be delayed). From the KDE (Kernel Density Estimate) plot below, we can see that this sample of the data is weighted stronger to the negative class.
[38]:
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
[39]:
matplotlib.rcParams['figure.figsize'] = (15, 10) # make charts bigger
[40]:
sns.set(color_codes=True)
sns.kdeplot(predictions.positive_probability, shade=True, cut=0,
label='Positive Probability')
plt.xlim((0, 1))
plt.ylim((0, None))
plt.xlabel('Probability of Event')
plt.ylabel('Probability Density')
plt.title('Prediction Distribution')
[40]:
Text(0.5,1,'Prediction Distribution')

Exploring Prediction Explanations¶
Computing prediction explanations is a resource-intensive task, but you can help reduce the runtime for computing them by setting prediction value thresholds. You can learn more about prediction explanations by searching the online documentation available in the DataRobot web interface.
A common question when evaluating data is “why is a certain data-point considered high-risk (or low-risk) for a certain event”?
A sample case for prediction explanations:
Clark is a business analyst at a large manufacturing firm. She does not have a lot of data science expertise, but has been using DataRobot with great success to predict likely product failures at her manufacturing plant. Her manager is now asking for recommendations for reducing the defect rate, based on these predictions. Clark would like DataRobot to produce prediction explanations for the expected product failures so that she can identify the key drivers of product failures based on a higher-level aggregation of explanations. Her business team can then use this report to address the causes of failure.
Other common use cases and possible explanations include:
- What are indicators that a transaction could be at high risk for fraud? Possible explanations include transactions out of a cardholder’s home area, transactions out of their “normal usage” time range, and transactions that are too large or small.
- What are some explanations for setting a higher auto insurance price? The applicant is single, male, age under 30 years, and has received traffic tickets. A married homeowner may receive a lower rate.
We are almost ready to compute prediction explanations. Prediction explanations require two prerequisites to be performed first; however, these commands only need to be run once per model.
A prerequisite to computing prediction explanations is that you need to compute the feature impact for your model (this only needs to be done once per model):
[41]:
%%time
feature_impacts = model_fair_100pct.get_or_request_feature_impact()
CPU times: user 25.4 ms, sys: 5.09 ms, total: 30.5 ms
Wall time: 11.3 s
After Feature Impact has been computed, you also must create a Prediction Explanations Initialization for your model:
[42]:
%%time
try:
# Test to see if they are already computed
dr.PredictionExplanationsInitialization.get(project.id,
model_fair_100pct.id)
except dr.errors.ClientError as e:
assert e.status_code == 404 # haven't been computed
init_job = dr.PredictionExplanationsInitialization.create(
project.id,
model_fair_100pct.id
)
init_job.wait_for_completion()
CPU times: user 24.9 ms, sys: 5.16 ms, total: 30 ms
Wall time: 11 s
Now that we have computed the feature impact and initialized the prediction explanations, and also uploaded a dataset and computed predictions on it, we are ready to make a request to compute the prediction explanations for every row of the dataset. Computing prediction explanations supports a couple of parameters:
max_explanations
are the maximum number of prediction explanations to compute for each row.threshold_low
andthreshold_high
are thresholds for the value of the prediction of the row. Prediction explanations will be computed for a row if the row’s prediction value is higher than threshold_high or lower than threshold_low. If no thresholds are specified, prediction explanations will be computed for all rows.
Note: for binary classification projects (like this one), the thresholds correspond to the positive_probability
prediction value whereas for regression problems, it corresponds to the actual predicted value.
Since we’ve already examined our prediction distribution from above, let’s use that to influence what we set for our thresholds. It looks like most flights depart on-time so let’s just examine the explanations for flights that have an above normal probability for being delayed. We will use a threshold_high
of 0.456
which means for all rows where the predicted positive_probability
is at least 0.456
we will compute the prediction explanations for that row. For the simplicity
of this tutorial, we will also limit DataRobot to only compute 5
explanations for us.
[43]:
%%time
number_of_explanations = 5
pe_job = dr.PredictionExplanations.create(
project.id,
model_fair_100pct.id,
prediction_dataset.id,
max_explanations=number_of_explanations,
threshold_low=None,
threshold_high=0.456
)
pe = pe_job.get_result_when_complete()
all_rows = pe.get_all_as_dataframe()
CPU times: user 4.1 s, sys: 131 ms, total: 4.23 s
Wall time: 22.4 s
Let’s cleanup the DataFrame we got back by trimming it down to just the interesting columns. Also, since most rows will have prediction values outside our thresholds, let’s drop all the uninteresting rows (i.e. ones with null
values).
[44]:
import pandas as pd
pd.options.display.max_rows = 10 # default display is too verbose
# These rows are all redundant or of little value for this example
redundant_cols = ['row_id', 'class_0_label', 'class_1_probability',
'class_1_label']
explanations = all_rows.drop(redundant_cols, axis=1)
explanations.drop(['explanation_{}_label'.format(i)
for i in range(number_of_explanations)],
axis=1, inplace=True)
# These are rows that didn't meet our thresholds
explanations.dropna(inplace=True)
# Rename columns to be more consistent with the terms we have been using
explanations.rename(index=str,
columns={'class_0_probability': 'positive_probability'},
inplace=True)
explanations
[44]:
prediction | positive_probability | explanation_0_feature | explanation_0_feature_value | explanation_0_qualitative_strength | explanation_0_strength | explanation_1_feature | explanation_1_feature_value | explanation_1_qualitative_strength | explanation_1_strength | ... | explanation_2_qualitative_strength | explanation_2_strength | explanation_3_feature | explanation_3_feature_value | explanation_3_qualitative_strength | explanation_3_strength | explanation_4_feature | explanation_4_feature_value | explanation_4_qualitative_strength | explanation_4_strength | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
39 | 0.0 | 0.471055 | Scheduled Departure Time | -2.208920e+09 | +++ | 1.072288 | day_of_week | Sun | ++ | 0.455652 | ... | ++ | 0.362867 | Destination Airport | CLT | ++ | 0.345914 | Tail Number | N537UW | ++ | 0.242375 |
392 | 0.0 | 0.478501 | Scheduled Departure Time | -2.208920e+09 | +++ | 1.072288 | day_of_week | Sun | ++ | 0.455652 | ... | ++ | 0.362867 | Destination Airport | CLT | ++ | 0.345914 | Tail Number | N536UW | ++ | 0.272234 |
13043 | 0.0 | 0.465055 | Scheduled Departure Time | -2.208920e+09 | +++ | 1.202299 | Tail Number | N194UW | ++ | 0.416944 | ... | ++ | 0.391831 | day_of_week | Sun | ++ | 0.286239 | month | 12 | ++ | 0.273073 |
13259 | 0.0 | 0.463182 | Scheduled Departure Time | -2.208920e+09 | +++ | 1.141272 | Destination Airport | CLT | ++ | 0.391831 | ... | ++ | 0.373726 | Tail Number | N563UW | ++ | 0.321922 | month | 12 | ++ | 0.256552 |
13843 | 0.0 | 0.498733 | Scheduled Departure Time | -2.208920e+09 | +++ | 1.270218 | Flight Number | 586 | ++ | 0.440506 | ... | ++ | 0.355779 | Tail Number | N647AW | ++ | 0.241246 | day_of_week | Thurs | ++ | 0.224909 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
18015 | 0.0 | 0.497778 | Scheduled Departure Time | -2.208920e+09 | +++ | 1.565999 | month | 7 | ++ | 0.809545 | ... | ++ | 0.347827 | Tail Number | N534UW | ++ | 0.247029 | day_of_week | Thurs | + | 0.224909 |
18165 | 0.0 | 0.466710 | Scheduled Departure Time | -2.208920e+09 | +++ | 1.368628 | month | 7 | ++ | 0.368182 | ... | ++ | 0.347827 | Tail Number | N173US | ++ | 0.314294 | Flight Number | 800 | + | 0.093169 |
18382 | 0.0 | 0.481914 | Scheduled Departure Time | -2.208920e+09 | +++ | 1.281047 | Flight Number | 586 | ++ | 0.440506 | ... | ++ | 0.396207 | day_of_week | Thurs | ++ | 0.224909 | Tail Number | N660AW | + | 0.164530 |
18392 | 1.0 | 0.506051 | Scheduled Departure Time | -2.208920e+09 | +++ | 1.334738 | month | 7 | ++ | 0.424888 | ... | ++ | 0.347827 | Tail Number | N170US | ++ | 0.280126 | day_of_week | Thurs | ++ | 0.224909 |
18406 | 1.0 | 0.511845 | Scheduled Departure Time | -2.208927e+09 | +++ | 1.357411 | month | 7 | ++ | 0.855629 | ... | ++ | 0.676216 | Scheduled Departure Time (Hour of Day) | 17 | ++ | 0.455910 | Destination Airport | CLT | ++ | 0.344885 |
24 rows × 22 columns
Now let’s see how often various features are showing up as the top explanation for impacting the probability of a flight being delayed.
[45]:
from functools import reduce
# Create a combined histogram of all our explanations
explanations_hist = reduce(
lambda x, y: x.add(y, fill_value=0),
(explanations['explanation_{}_feature'.format(i)].value_counts()
for i in range(number_of_explanations)))
[46]:
explanations_hist.plot.bar()
plt.xticks(rotation=45, ha='right')
[46]:
(array([0, 1, 2, 3, 4, 5, 6]), <a list of 7 Text xticklabel objects>)

Knowing the feature impact for this model from the Diving Deeper notebook, the high occurrence of the daily_rainfall
and Scheduled Departure Time
as prediction explanations is not entirely surprising because these were some of the top ranked features in the impact chart. Therefore, let’s take a small detour investigating some of the rows that had less expected explanations.
Below is some helper code. It can largely be ignored as it is mostly relevant for this exercise and not needed for a general understanding of the DataRobot APIs
[47]:
from operator import or_
from functools import reduce
from itertools import chain
def find_rows_with_explanation(df, feature_name, nexpls):
"""
Given a prediction explanations DataFrame, return a slice
of that data where the top N explanations match the given feature
"""
all_expl_columns = (df['explanation_{}_feature'.format(i)] == feature_name
for i in range(nexpls))
df_filter = reduce(or_, all_expl_columns)
return favorite_expl_columns(df[df_filter], nexpls)
def favorite_expl_columns(df, nexpls):
"""
Only display the most useful rows of a prediction explanations DataFrame.
"""
# Use chain to flatten our list of tuples
columns = list(chain.from_iterable((
'explanation_{}_feature'.format(i),
'explanation_{}_feature_value'.format(i),
'explanation_{}_strength'.format(i))
for i in range(nexpls)))
return df[columns]
def find_feature_in_row(feature, row, nexpls):
"""
Return the value of a given feature
"""
for i in range(nexpls):
if row['explanation_{}_feature'.format(i)] == feature:
return row['explanation_{}_feature_value'.format(i)]
def collect_feature_values(df, feature, nexpls):
"""
Return a list of all values of a given prediction explanation
from a DataFrame
"""
return [find_feature_in_row(feature, row, nexpls)
for index, row in df.iterrows()]
It looks like there was a small number of rows where the Destination Airport
was one of the top N explanations for a given prediction
[48]:
feature_name = 'Destination Airport'
flight_nums = find_rows_with_explanation(explanations,
feature_name,
number_of_explanations)
flight_nums
[48]:
explanation_0_feature | explanation_0_feature_value | explanation_0_strength | explanation_1_feature | explanation_1_feature_value | explanation_1_strength | explanation_2_feature | explanation_2_feature_value | explanation_2_strength | explanation_3_feature | explanation_3_feature_value | explanation_3_strength | explanation_4_feature | explanation_4_feature_value | explanation_4_strength | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
39 | Scheduled Departure Time | -2.208920e+09 | 1.072288 | day_of_week | Sun | 0.455652 | month | 2 | 0.362867 | Destination Airport | CLT | 0.345914 | Tail Number | N537UW | 0.242375 |
392 | Scheduled Departure Time | -2.208920e+09 | 1.072288 | day_of_week | Sun | 0.455652 | month | 2 | 0.362867 | Destination Airport | CLT | 0.345914 | Tail Number | N536UW | 0.272234 |
13043 | Scheduled Departure Time | -2.208920e+09 | 1.202299 | Tail Number | N194UW | 0.416944 | Destination Airport | CLT | 0.391831 | day_of_week | Sun | 0.286239 | month | 12 | 0.273073 |
13259 | Scheduled Departure Time | -2.208920e+09 | 1.141272 | Destination Airport | CLT | 0.391831 | day_of_week | Thurs | 0.373726 | Tail Number | N563UW | 0.321922 | month | 12 | 0.256552 |
14226 | Scheduled Departure Time | -2.208920e+09 | 1.339540 | month | 6 | 0.401657 | Destination Airport | CLT | 0.347827 | day_of_week | Thurs | 0.224909 | Tail Number | N190UW | 0.147016 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
17638 | Scheduled Departure Time | -2.208920e+09 | 1.340564 | month | 7 | 0.411066 | Destination Airport | CLT | 0.347827 | day_of_week | Thurs | 0.224909 | Flight Number | 800 | 0.120877 |
18015 | Scheduled Departure Time | -2.208920e+09 | 1.565999 | month | 7 | 0.809545 | Destination Airport | CLT | 0.347827 | Tail Number | N534UW | 0.247029 | day_of_week | Thurs | 0.224909 |
18165 | Scheduled Departure Time | -2.208920e+09 | 1.368628 | month | 7 | 0.368182 | Destination Airport | CLT | 0.347827 | Tail Number | N173US | 0.314294 | Flight Number | 800 | 0.093169 |
18392 | Scheduled Departure Time | -2.208920e+09 | 1.334738 | month | 7 | 0.424888 | Destination Airport | CLT | 0.347827 | Tail Number | N170US | 0.280126 | day_of_week | Thurs | 0.224909 |
18406 | Scheduled Departure Time | -2.208927e+09 | 1.357411 | month | 7 | 0.855629 | Tail Number | N818AW | 0.676216 | Scheduled Departure Time (Hour of Day) | 17 | 0.455910 | Destination Airport | CLT | 0.344885 |
14 rows × 15 columns
[49]:
all_flights = collect_feature_values(flight_nums,
feature_name,
number_of_explanations)
pd.DataFrame(all_flights)[0].value_counts().plot.bar()
plt.xticks(rotation=0)
[49]:
(array([0]), <a list of 1 Text xticklabel objects>)

Many a frequent flier will tell you horror stories about flying in and out of certain airports. While any given prediction explanation can have a positive or a negative impact to a prediction (this is indicated by both the strength and qualitative_strength columns), due to the thresholds we configured earlier for this tutorial it is likely that the above airports are causing flight delays.
DataRobot correctly identified the Scheduled Departure Time
input as a timestamp but in the prediction explanation output, we are seeing the internal representation of the time value as a Unix epoch value so let’s put it back into a format that humans can understand better:
[50]:
# For simplicity, let's just look at rows where `Scheduled Departure Time`
# was the first/top explanation.
feature_name = 'Scheduled Departure Time'
bad_times = explanations[explanations.explanation_0_feature == feature_name]
# Now let's convert the epoch to a datetime
pd.to_datetime(bad_times.explanation_0_feature_value, unit='s')
[50]:
39 1900-01-01 19:15:00
392 1900-01-01 19:15:00
13043 1900-01-01 19:10:00
13259 1900-01-01 19:10:00
13843 1900-01-01 19:15:00
...
18015 1900-01-01 19:10:00
18165 1900-01-01 19:10:00
18382 1900-01-01 19:15:00
18392 1900-01-01 19:10:00
18406 1900-01-01 17:05:00
Name: explanation_0_feature_value, Length: 24, dtype: datetime64[ns]
We can see that it appears as though all departures occurred on Jan. 1st, 1900. This is because the original value was simply a timestamp so only the time portion of the epoch is meaningful. We will clean this up in our graph below:
[51]:
from matplotlib.ticker import FuncFormatter
from time import gmtime, strftime
scale_factor = 9 # make the difference in strengths more visible
depart = explanations[explanations.explanation_0_feature == feature_name]
true_only = depart[depart.prediction == 1]
false_only = depart[depart.prediction == 0]
plt.scatter(x=true_only.explanation_0_feature_value,
y=true_only.positive_probability,
c='green',
s=true_only.explanation_0_strength ** scale_factor,
label='Will be delayed')
plt.scatter(x=false_only.explanation_0_feature_value,
y=false_only.positive_probability,
c='purple',
s=false_only.explanation_0_strength ** scale_factor,
label='Will not')
# Convert the Epoch values into human time stamps
formatter = FuncFormatter(lambda x, pos: strftime('%H:%M', gmtime(x)))
plt.gca().xaxis.set_major_formatter(formatter)
plt.xlabel('Scheduled Departure Time')
plt.ylabel('Positive Probability')
plt.legend(markerscale=.5, frameon=True, facecolor="white")
plt.title("Relationship of Depart Time and being delayed")
[51]:
Text(0.5,1,'Relationship of Depart Time and being delayed')

The above plot shows each prediction where the top influencer of the prediction was the Scheduled Departure Time
. It’s plotted against the positive_probability
on the Y-axis and the size of the plots represent the strength that departure time had on the prediction (relative to the other features of that given data point). Finally to aid visually, the positive vs. negative outcomes are colored.
As we can see by the time-scale on the X-axis, it doesn’t represent the full 24 hours; this is telling. Since we filtered our data earlier to only show predictions that were leaning towards being delayed, and this chart is leaning towards times in the afternoon and evening there may be a correlation between later scheduled departure time and a higher probability of being delayed. With a little bit of domain knowledge, one will understand that an airplane and its crew make many flights in a day (typically hopping between cities) so small delays in the morning compound into the evening hours.
Advanced Model Insights¶
This notebook explores additional options for model insights added in the v2.7 release of the DataRobot API.
Prerequisites¶
In order to run this notebook yourself, you will need the following:
- This notebook. If you are viewing this in the HTML documentation bundle, you can download all of the example notebooks and supporting materials from Downloads.
- The dataset required for this notebook. This is in the same directory as this notebook.
- A DataRobot API token. You can find your API token by logging into the DataRobot Web User Interface and looking in your
Profile
.
Preparation¶
This notebook explores additional options for model insights added in the v2.7 release of the DataRobot API.
Let’s start with importing some packages that will help us with presentation (if you don’t have them installed already, you’ll have to install them to run this notebook).
[1]:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import pandas as pd
import datarobot as dr
import numpy as np
from datarobot.enums import AUTOPILOT_MODE
from datarobot.errors import ClientError
Configure the Python Client¶
Configuring the client requires the following two things:
- A DataRobot endpoint - where the API server can be found
- A DataRobot API token - a token the server uses to identify and validate the user making API requests
The endpoint is usually the URL you would use to log into the DataRobot Web User Interface (e.g., https://app.datarobot.com) with “/api/v2/” appended, e.g., (https://app.datarobot.com/api/v2/).
You can find your API token by logging into the DataRobot Web User Interface and looking in your Profile.
The Python client can be configured in several ways. The example we’ll use in this notebook is to point to a yaml
file that has the information. This is a text file containing two lines like this:
endpoint: https://app.datarobot.com/api/v2/
token: not-my-real-token
If you want to run this notebook without changes, please save your configuration in a file located under your home directory called ~/.config/datarobot/drconfig.yaml
.
[2]:
# Initialization with arguments
# dr.Client(token='<API TOKEN>', endpoint='https://<YOUR ENDPOINT>/api/v2/')
# Initialization with a config file in the same directory as this notebook
# dr.Client(config_path='drconfig.yaml')
# Initialization with a config file located at
# ~/.config/datarobot/dr.config.yaml
dr.Client()
[2]:
<datarobot.rest.RESTClientObject at 0x1119c0d90>
Create Project with features¶
Create a new project using the 10K_diabetes dataset. This dataset contains a binary classification on the target readmitted
. This project is an excellent example of the advanced model insights available from DataRobot models.
[3]:
url = 'https://s3.amazonaws.com/datarobot_public_datasets/10k_diabetes.xlsx'
project = dr.Project.create(url, project_name='10K Advanced Modeling')
print('Project ID: {}'.format(project.id))
Project ID: 5c0008e06523cd0233c49fe4
[4]:
# Increase the worker count to your maximum available the project runs faster.
project.set_worker_count(-1)
[4]:
Project(10K Advanced Modeling)
[5]:
target_feature_name = 'readmitted'
project.set_target(target_feature_name, mode=AUTOPILOT_MODE.QUICK)
[5]:
Project(10K Advanced Modeling)
[6]:
project.wait_for_autopilot()
In progress: 14, queued: 0 (waited: 0s)
In progress: 14, queued: 0 (waited: 1s)
In progress: 14, queued: 0 (waited: 1s)
In progress: 14, queued: 0 (waited: 2s)
In progress: 14, queued: 0 (waited: 3s)
In progress: 14, queued: 0 (waited: 5s)
In progress: 11, queued: 0 (waited: 9s)
In progress: 10, queued: 0 (waited: 16s)
In progress: 6, queued: 0 (waited: 29s)
In progress: 1, queued: 0 (waited: 49s)
In progress: 7, queued: 0 (waited: 70s)
In progress: 1, queued: 0 (waited: 90s)
In progress: 16, queued: 0 (waited: 111s)
In progress: 10, queued: 0 (waited: 131s)
In progress: 6, queued: 0 (waited: 151s)
In progress: 2, queued: 0 (waited: 172s)
In progress: 0, queued: 0 (waited: 192s)
In progress: 5, queued: 0 (waited: 213s)
In progress: 1, queued: 0 (waited: 233s)
In progress: 4, queued: 0 (waited: 253s)
In progress: 1, queued: 0 (waited: 274s)
In progress: 1, queued: 0 (waited: 294s)
In progress: 0, queued: 0 (waited: 315s)
In progress: 0, queued: 0 (waited: 335s)
[7]:
models = project.get_models()
model = models[0]
model
[7]:
Model(u'AVG Blender')
Let’s set some color constants to replicate visual style of DataRobot lift chart.
[8]:
dr_dark_blue = '#08233F'
dr_blue = '#1F77B4'
dr_orange = '#FF7F0E'
dr_red = '#BE3C28'
Feature Impact¶
Feature Impact is available for all model types and works by altering input data and observing the effect on a model’s score. It is an on-demand feature, meaning that you must initiate a calculation to see the results. Once you have had DataRobot compute the feature impact for a model, that information is saved with the project.
Feature Impact measures how important a feature is in the context of a model. That is, it measures how much the accuracy of a model would decrease if that feature were removed.
[9]:
feature_impacts = model.get_or_request_feature_impact()
[10]:
# Formats the ticks from a float into a percent
percent_tick_fmt = mtick.PercentFormatter(xmax=1.0)
impact_df = pd.DataFrame(feature_impacts)
impact_df.sort_values(by='impactNormalized', ascending=True, inplace=True)
# Positive values are blue, negative are red
bar_colors = impact_df.impactNormalized.apply(lambda x: dr_red if x < 0
else dr_blue)
ax = impact_df.plot.barh(x='featureName', y='impactNormalized',
legend=False,
color=bar_colors,
figsize=(10, 14))
ax.xaxis.set_major_formatter(percent_tick_fmt)
ax.xaxis.set_tick_params(labeltop=True)
ax.xaxis.grid(True, alpha=0.2)
ax.set_facecolor(dr_dark_blue)
plt.ylabel('')
plt.xlabel('Effect')
plt.xlim((None, 1)) # Allow for negative impact
plt.title('Feature Impact', y=1.04)
[10]:
Text(0.5,1.04,'Feature Impact')

Feature Histogram¶
Feature histogram is a popular EDA tool for visualizing features. Using DataRobot feature histogram API it is easy to draw them.
For starters, let us set up two convenient functions.
First helper function below - matplotlib_pair_histogram
- will be used to draw histograms paired with project target feature. We also attach an orange mark to every histogram bin with average target feature value for rows in that bin.
[11]:
def matplotlib_pair_histogram(labels, counts, target_avgs,
bin_count, ax1, feature):
# Rotate categorical labels
if feature.feature_type in ['Categorical', 'Text']:
ax1.tick_params(axis='x', rotation=45)
ax1.set_ylabel(feature.name, color=dr_blue)
ax1.bar(labels, counts, color=dr_blue)
# Instantiate a second axes that shares the same x-axis
ax2 = ax1.twinx()
ax2.set_ylabel(target_feature_name, color=dr_orange)
ax2.plot(labels, target_avgs, marker='o', lw=1, color=dr_orange)
ax1.set_facecolor(dr_dark_blue)
title = 'Histogram for {} ({} bins)'.format(feature.name, bin_count)
ax1.set_title(title)
Let us also create high level function draw_feature_histogram
, which will get histogram data and draw histogram using helper function we have just created. But first let try to retrieve downsampled histogram data and have a look at it:
[12]:
feature = dr.Feature.get(project.id, 'num_lab_procedures')
feature.get_histogram(bin_limit=6).plot
[12]:
[{'count': 755, 'label': u'1.0', 'target': 0.36026490066225164},
{'count': 895, 'label': u'14.5', 'target': 0.3240223463687151},
{'count': 1875, 'label': u'28.0', 'target': 0.3744},
{'count': 2159, 'label': u'41.5', 'target': 0.38490041685965726},
{'count': 1603, 'label': u'55.0', 'target': 0.45414847161572053},
{'count': 557, 'label': u'68.5', 'target': 0.5080789946140036}]
For best accuracy it is recommended to use divisors of 60 for bin_limit
, but actully any values <= 60 can be used as well.
target
values are basically project target input average values for that bins. Please refer to FeatureHistogram
for documentation details.
So, our high level function draw_feature_histogram
will be like:
[14]:
def draw_feature_histogram(feature_name, bin_count):
feature = dr.Feature.get(project.id, feature_name)
# Retrieve downsampled histogram data from server
# based on desired bin count
data = feature.get_histogram(bin_count).plot
labels = [row['label'] for row in data]
counts = [row['count'] for row in data]
target_averages = [row['target'] for row in data]
f, axarr = plt.subplots()
f.set_size_inches((10, 4))
matplotlib_pair_histogram(labels, counts, target_averages,
bin_count, axarr, feature)
Done! Now we can just specify feature name and desired bin count to get feature histograms. Example for numerical feature:
[15]:
draw_feature_histogram('num_lab_procedures', 12)

Categorical and other feature types are supported as well:
[16]:
draw_feature_histogram('medical_specialty', 10)

Lift Chart¶
A lift chart will show you how close in general model predictions are to the actual target values in the training data.
The lift chart data we retrieve from the server includes the average model prediction and the average actual target values, sorted by the prediction values in ascending order and split into up to 60 bins.
bin_weight
parameter shows how much weight is in each bin (number of rows for unweighted projects).
[17]:
lc = model.get_lift_chart('validation')
lc
[17]:
LiftChart(validation)
[18]:
bins_df = pd.DataFrame(lc.bins)
bins_df.head()
[18]:
actual | bin_weight | predicted | |
---|---|---|---|
0 | 0.037037 | 27.0 | 0.097886 |
1 | 0.037037 | 27.0 | 0.137739 |
2 | 0.076923 | 26.0 | 0.162243 |
3 | 0.185185 | 27.0 | 0.173459 |
4 | 0.333333 | 27.0 | 0.188488 |
Let’s define our rebinning and plotting functions.
[19]:
def rebin_df(raw_df, number_of_bins):
cols = ['bin', 'actual_mean', 'predicted_mean', 'bin_weight']
new_df = pd.DataFrame(columns=cols)
current_prediction_total = 0
current_actual_total = 0
current_row_total = 0
x_index = 1
bin_size = 60 / number_of_bins
for rowId, data in raw_df.iterrows():
current_prediction_total += data['predicted'] * data['bin_weight']
current_actual_total += data['actual'] * data['bin_weight']
current_row_total += data['bin_weight']
if ((rowId + 1) % bin_size == 0):
x_index += 1
bin_properties = {
'bin': ((round(rowId + 1) / 60) * number_of_bins),
'actual_mean': current_actual_total / current_row_total,
'predicted_mean': current_prediction_total / current_row_total,
'bin_weight': current_row_total
}
new_df = new_df.append(bin_properties, ignore_index=True)
current_prediction_total = 0
current_actual_total = 0
current_row_total = 0
return new_df
def matplotlib_lift(bins_df, bin_count, ax):
grouped = rebin_df(bins_df, bin_count)
ax.plot(range(1, len(grouped) + 1), grouped['predicted_mean'],
marker='+', lw=1, color=dr_blue)
ax.plot(range(1, len(grouped) + 1), grouped['actual_mean'],
marker='*', lw=1, color=dr_orange)
ax.set_xlim([0, len(grouped) + 1])
ax.set_facecolor(dr_dark_blue)
ax.legend(loc='best')
ax.set_title('Lift chart {} bins'.format(bin_count))
ax.set_xlabel('Sorted Prediction')
ax.set_ylabel('Value')
return grouped
Now we can show all lift charts we propose in DataRobot web application.
Note 1 : While this method will work for any bin count less then 60 - the most reliable result will be achieved when the number of bins is a divisor of 60.
Note 2 : This visualization method will NOT work for bin count > 60 because DataRobot does not provide enough information for a larger resolution.
[20]:
bin_counts = [10, 12, 15, 20, 30, 60]
f, axarr = plt.subplots(len(bin_counts))
f.set_size_inches((8, 4 * len(bin_counts)))
rebinned_dfs = []
for i in range(len(bin_counts)):
rebinned_dfs.append(matplotlib_lift(bins_df, bin_counts[i], axarr[i]))
plt.tight_layout()

Rebinned Data¶
You may want to interact with the raw re-binned data for use in third party tools, or for additional evaluation.
[21]:
for rebinned in rebinned_dfs:
print('Number of bins: {}'.format(len(rebinned.index)))
print(rebinned)
Number of bins: 10
bin actual_mean predicted_mean bin_weight
0 1.0 0.13750 0.159916 160.0
1 2.0 0.17500 0.233332 160.0
2 3.0 0.27500 0.276564 160.0
3 4.0 0.28750 0.317841 160.0
4 5.0 0.41250 0.355449 160.0
5 6.0 0.33750 0.394435 160.0
6 7.0 0.49375 0.436481 160.0
7 8.0 0.54375 0.490176 160.0
8 9.0 0.62500 0.559797 160.0
9 10.0 0.68125 0.697142 160.0
Number of bins: 12
bin actual_mean predicted_mean bin_weight
0 1.0 0.134328 0.151886 134.0
1 2.0 0.180451 0.220872 133.0
2 3.0 0.210526 0.259316 133.0
3 4.0 0.313433 0.294237 134.0
4 5.0 0.293233 0.327699 133.0
5 6.0 0.413534 0.358398 133.0
6 7.0 0.353383 0.390993 133.0
7 8.0 0.440299 0.425269 134.0
8 9.0 0.556391 0.465567 133.0
9 10.0 0.556391 0.515761 133.0
10 11.0 0.609023 0.583067 133.0
11 12.0 0.701493 0.712181 134.0
Number of bins: 15
bin actual_mean predicted_mean bin_weight
0 1.0 0.084112 0.142650 107.0
1 2.0 0.177570 0.206029 107.0
2 3.0 0.207547 0.241613 106.0
3 4.0 0.271028 0.269917 107.0
4 5.0 0.308411 0.297614 107.0
5 6.0 0.264151 0.324330 106.0
6 7.0 0.420561 0.349149 107.0
7 8.0 0.367925 0.374717 106.0
8 9.0 0.336449 0.400959 107.0
9 10.0 0.485981 0.428771 107.0
10 11.0 0.518868 0.460771 106.0
11 12.0 0.551402 0.500419 107.0
12 13.0 0.603774 0.543591 106.0
13 14.0 0.635514 0.610431 107.0
14 15.0 0.719626 0.730594 107.0
Number of bins: 20
bin actual_mean predicted_mean bin_weight
0 1.0 0.0500 0.132253 80.0
1 2.0 0.2250 0.187579 80.0
2 3.0 0.1750 0.221244 80.0
3 4.0 0.1750 0.245419 80.0
4 5.0 0.2500 0.266226 80.0
5 6.0 0.3000 0.286902 80.0
6 7.0 0.3375 0.308215 80.0
7 8.0 0.2375 0.327466 80.0
8 9.0 0.4250 0.346325 80.0
9 10.0 0.4000 0.364573 80.0
10 11.0 0.3625 0.384512 80.0
11 12.0 0.3125 0.404358 80.0
12 13.0 0.4875 0.425218 80.0
13 14.0 0.5000 0.447743 80.0
14 15.0 0.5875 0.474525 80.0
15 16.0 0.5000 0.505826 80.0
16 17.0 0.6250 0.536862 80.0
17 18.0 0.6250 0.582731 80.0
18 19.0 0.6250 0.640753 80.0
19 20.0 0.7375 0.753532 80.0
Number of bins: 30
bin actual_mean predicted_mean bin_weight
0 1.0 0.037037 0.117812 54.0
1 2.0 0.132075 0.167957 53.0
2 3.0 0.245283 0.194772 53.0
3 4.0 0.111111 0.217077 54.0
4 5.0 0.264151 0.234340 53.0
5 6.0 0.150943 0.248885 53.0
6 7.0 0.259259 0.262677 54.0
7 8.0 0.283019 0.277293 53.0
8 9.0 0.283019 0.289984 53.0
9 10.0 0.333333 0.305103 54.0
10 11.0 0.226415 0.317688 53.0
11 12.0 0.301887 0.330972 53.0
12 13.0 0.415094 0.343545 53.0
13 14.0 0.425926 0.354649 54.0
14 15.0 0.396226 0.368169 53.0
15 16.0 0.339623 0.381265 53.0
16 17.0 0.314815 0.394318 54.0
17 18.0 0.358491 0.407725 53.0
18 19.0 0.452830 0.422268 53.0
19 20.0 0.518519 0.435153 54.0
20 21.0 0.509434 0.452046 53.0
21 22.0 0.528302 0.469495 53.0
22 23.0 0.641509 0.489711 53.0
23 24.0 0.462963 0.510929 54.0
24 25.0 0.641509 0.530756 53.0
25 26.0 0.566038 0.556426 53.0
26 27.0 0.666667 0.591609 54.0
27 28.0 0.603774 0.629608 53.0
28 29.0 0.698113 0.676879 53.0
29 30.0 0.740741 0.783314 54.0
Number of bins: 60
bin actual_mean predicted_mean bin_weight
0 1.0 0.037037 0.097886 27.0
1 2.0 0.037037 0.137739 27.0
2 3.0 0.076923 0.162243 26.0
3 4.0 0.185185 0.173459 27.0
4 5.0 0.333333 0.188488 27.0
5 6.0 0.153846 0.201298 26.0
6 7.0 0.148148 0.213213 27.0
7 8.0 0.074074 0.220940 27.0
8 9.0 0.307692 0.229899 26.0
9 10.0 0.222222 0.238617 27.0
10 11.0 0.111111 0.245402 27.0
11 12.0 0.192308 0.252501 26.0
12 13.0 0.259259 0.258865 27.0
13 14.0 0.259259 0.266489 27.0
14 15.0 0.230769 0.273597 26.0
15 16.0 0.333333 0.280852 27.0
16 17.0 0.333333 0.286678 27.0
17 18.0 0.230769 0.293418 26.0
18 19.0 0.259259 0.301547 27.0
19 20.0 0.407407 0.308660 27.0
20 21.0 0.346154 0.314679 26.0
21 22.0 0.111111 0.320585 27.0
22 23.0 0.307692 0.327277 26.0
23 24.0 0.296296 0.334530 27.0
24 25.0 0.407407 0.340926 27.0
25 26.0 0.423077 0.346264 26.0
26 27.0 0.444444 0.351782 27.0
27 28.0 0.407407 0.357515 27.0
28 29.0 0.461538 0.364479 26.0
29 30.0 0.333333 0.371723 27.0
30 31.0 0.407407 0.378530 27.0
31 32.0 0.269231 0.384105 26.0
32 33.0 0.407407 0.390886 27.0
33 34.0 0.222222 0.397751 27.0
34 35.0 0.461538 0.403918 26.0
35 36.0 0.259259 0.411391 27.0
36 37.0 0.481481 0.419135 27.0
37 38.0 0.423077 0.425521 26.0
38 39.0 0.555556 0.431010 27.0
39 40.0 0.481481 0.439296 27.0
40 41.0 0.538462 0.448068 26.0
41 42.0 0.481481 0.455876 27.0
42 43.0 0.576923 0.464854 26.0
43 44.0 0.481481 0.473965 27.0
44 45.0 0.703704 0.484397 27.0
45 46.0 0.576923 0.495230 26.0
46 47.0 0.444444 0.505163 27.0
47 48.0 0.481481 0.516694 27.0
48 49.0 0.615385 0.526190 26.0
49 50.0 0.666667 0.535152 27.0
50 51.0 0.592593 0.548849 27.0
51 52.0 0.538462 0.564293 26.0
52 53.0 0.555556 0.581138 27.0
53 54.0 0.777778 0.602079 27.0
54 55.0 0.576923 0.619633 26.0
55 56.0 0.629630 0.639213 27.0
56 57.0 0.666667 0.662629 27.0
57 58.0 0.730769 0.691678 26.0
58 59.0 0.666667 0.740971 27.0
59 60.0 0.814815 0.825658 27.0
ROC curve¶
The receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
To retrieve ROC curve information use the Model.get_roc_curve
method.
[22]:
roc = model.get_roc_curve('validation')
roc
[22]:
RocCurve(validation)
[23]:
df = pd.DataFrame(roc.roc_points)
df.head()
[23]:
accuracy | f1_score | false_negative_score | false_positive_rate | false_positive_score | matthews_correlation_coefficient | negative_predictive_value | positive_predictive_value | threshold | true_negative_rate | true_negative_score | true_positive_rate | true_positive_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.603125 | 0.000000 | 635 | 0.000000 | 0 | 0.000000 | 0.603125 | 0.0000 | 1.000000 | 1.000000 | 965 | 0.000000 | 0 |
1 | 0.604375 | 0.006279 | 633 | 0.000000 | 0 | 0.043612 | 0.603880 | 1.0000 | 0.919849 | 1.000000 | 965 | 0.003150 | 2 |
2 | 0.606875 | 0.018721 | 629 | 0.000000 | 0 | 0.075632 | 0.605395 | 1.0000 | 0.881041 | 1.000000 | 965 | 0.009449 | 6 |
3 | 0.609375 | 0.031008 | 625 | 0.000000 | 0 | 0.097764 | 0.606918 | 1.0000 | 0.839455 | 1.000000 | 965 | 0.015748 | 10 |
4 | 0.611875 | 0.046083 | 620 | 0.001036 | 1 | 0.111058 | 0.608586 | 0.9375 | 0.798130 | 0.998964 | 964 | 0.023622 | 15 |
Threshold operations¶
You can get the recommended threshold value with maximal F1 score using RocCurve.get_best_f1_threshold
method. That is the same threshold that is preselected in DataRobot when you open “ROC curve” tab.
[24]:
threshold = roc.get_best_f1_threshold()
threshold
[24]:
0.3410205659739286
To estimate metrics for different threshold values just pass it to the RocCurve.estimate_threshold
method. This will produce the same results as updating threshold on the DataRobot “ROC curve” tab.
[25]:
metrics = roc.estimate_threshold(threshold)
metrics
[25]:
{'accuracy': 0.62625,
'f1_score': 0.6215189873417721,
'false_negative_score': 144,
'false_positive_rate': 0.47046632124352333,
'false_positive_score': 454,
'matthews_correlation_coefficient': 0.30124189206636187,
'negative_predictive_value': 0.7801526717557252,
'positive_predictive_value': 0.5195767195767196,
'threshold': 0.3410205659739286,
'true_negative_rate': 0.5295336787564767,
'true_negative_score': 511,
'true_positive_rate': 0.7732283464566929,
'true_positive_score': 491}
Confusion matrix¶
Using a few keys from the retrieved metrics we now can build a confusion matrix for the selected threshold.
[26]:
roc_df = pd.DataFrame({
'Predicted Negative': [metrics['true_negative_score'],
metrics['false_negative_score'],
metrics['true_negative_score'] + metrics[
'false_negative_score']],
'Predicted Positive': [metrics['false_positive_score'],
metrics['true_positive_score'],
metrics['true_positive_score'] + metrics[
'false_positive_score']],
'Total': [len(roc.negative_class_predictions),
len(roc.positive_class_predictions),
len(roc.negative_class_predictions) + len(
roc.positive_class_predictions)]})
roc_df.index = pd.MultiIndex.from_tuples([
('Actual', '-'), ('Actual', '+'), ('Total', '')])
roc_df.columns = pd.MultiIndex.from_tuples([
('Predicted', '-'), ('Predicted', '+'), ('Total', '')])
roc_df.style.set_properties(**{'text-align': 'right'})
roc_df
[26]:
Predicted | Total | |||
---|---|---|---|---|
- | + | |||
Actual | - | 511 | 454 | 962 |
+ | 144 | 491 | 638 | |
Total | 655 | 945 | 1600 |
ROC curve plot¶
[27]:
dr_roc_green = '#03c75f'
white = '#ffffff'
dr_purple = '#65147D'
dr_dense_green = '#018f4f'
fig = plt.figure(figsize=(8, 8))
axes = fig.add_subplot(1, 1, 1, facecolor=dr_dark_blue)
plt.scatter(df.false_positive_rate, df.true_positive_rate, color=dr_roc_green)
plt.plot(df.false_positive_rate, df.true_positive_rate, color=dr_roc_green)
plt.plot([0, 1], [0, 1], color=white, alpha=0.25)
plt.title('ROC curve')
plt.xlabel('False Positive Rate (Fallout)')
plt.xlim([0, 1])
plt.ylabel('True Positive Rate (Sensitivity)')
plt.ylim([0, 1])
[27]:
(0, 1)

Prediction distribution plot¶
There are a few different methods for visualizing it, which one to use depends on what packages you have installed. Below you will find 3 different examples.
Using seaborn
[28]:
import seaborn as sns
sns.set_style("whitegrid", {'axes.grid': False})
fig = plt.figure(figsize=(8, 8))
axes = fig.add_subplot(1, 1, 1, facecolor=dr_dark_blue)
shared_params = {'shade': True, 'clip': (0, 1), 'bw': 0.2}
sns.kdeplot(np.array(roc.negative_class_predictions),
color=dr_purple, **shared_params)
sns.kdeplot(np.array(roc.positive_class_predictions),
color=dr_dense_green, **shared_params)
plt.title('Prediction Distribution')
plt.xlabel('Probability of Event')
plt.xlim([0, 1])
plt.ylabel('Probability Density')
[28]:
Text(0,0.5,'Probability Density')

Using SciPy
[29]:
from scipy.stats import gaussian_kde
fig = plt.figure(figsize=(8, 8))
axes = fig.add_subplot(1, 1, 1, facecolor=dr_dark_blue)
xs = np.linspace(0, 1, 100)
density_neg = gaussian_kde(roc.negative_class_predictions, bw_method=0.2)
plt.plot(xs, density_neg(xs), color=dr_purple)
plt.fill_between(xs, 0, density_neg(xs), color=dr_purple, alpha=0.3)
density_pos = gaussian_kde(roc.positive_class_predictions, bw_method=0.2)
plt.plot(xs, density_pos(xs), color=dr_dense_green)
plt.fill_between(xs, 0, density_pos(xs), color=dr_dense_green, alpha=0.3)
plt.title('Prediction Distribution')
plt.xlabel('Probability of Event')
plt.xlim([0, 1])
plt.ylabel('Probability Density')
[29]:
Text(0,0.5,'Probability Density')

Using scikit-learn
This way will be most consistent with how we display this plot in DataRobot, because scikit-learn supports additional kernel options, and we can configure the same kernel as we use in web application (epanichkov kernel with size 0.05).
Other examples above use a gaussian kernel, so they may slightly differ from the plot in the DataRobot interface.
[30]:
from sklearn.neighbors import KernelDensity
fig = plt.figure(figsize=(8, 8))
axes = fig.add_subplot(1, 1, 1, facecolor=dr_dark_blue)
xs = np.linspace(0, 1, 100)
X_neg = np.asarray(roc.negative_class_predictions)[:, np.newaxis]
density_neg = KernelDensity(bandwidth=0.05, kernel='epanechnikov').fit(X_neg)
plt.plot(xs, np.exp(density_neg.score_samples(xs[:, np.newaxis])),
color=dr_purple)
plt.fill_between(xs, 0, np.exp(density_neg.score_samples(xs[:, np.newaxis])),
color=dr_purple, alpha=0.3)
X_pos = np.asarray(roc.positive_class_predictions)[:, np.newaxis]
density_pos = KernelDensity(bandwidth=0.05, kernel='epanechnikov').fit(X_pos)
plt.plot(xs, np.exp(density_pos.score_samples(xs[:, np.newaxis])),
color=dr_dense_green)
plt.fill_between(xs, 0, np.exp(density_pos.score_samples(xs[:, np.newaxis])),
color=dr_dense_green, alpha=0.3)
plt.title('Prediction Distribution')
plt.xlabel('Probability of Event')
plt.xlim([0, 1])
plt.ylabel('Probability Density')
[30]:
Text(0,0.5,'Probability Density')

Word Cloud¶
Word cloud is a type of insight available for some text-processing models for datasets containing text columns. You can get information about how the appearance of each ngram (word or sequence of words) in the text field affects the predicted target value.
This example will show you how to obtain word cloud data and visualize it in similar to DataRobot web application way.
The visualization example here uses colour
and wordcloud
packages, so if you don’t have them, you will need to install them.
First, we will create a color palette similar to what we use in DataRobot.
[31]:
from colour import Color
import wordcloud
[32]:
colors = [Color('#2458EB')]
colors.extend(list(Color('#2458EB').range_to(Color('#31E7FE'), 81))[1:])
colors.extend(list(Color('#31E7FE').range_to(Color('#8da0a2'), 21))[1:])
colors.extend(list(Color('#a18f8c').range_to(Color('#ffad9e'), 21))[1:])
colors.extend(list(Color('#ffad9e').range_to(Color('#d80909'), 81))[1:])
webcolors = [c.get_web() for c in colors]
Variable webcolors
now contains 201 ([-1, 1] interval with step 0.01) colors that will be used in the word cloud. Let’s look at our palette.
[33]:
from matplotlib.colors import LinearSegmentedColormap
dr_cmap = LinearSegmentedColormap.from_list('DataRobot',
webcolors,
N=len(colors))
x = np.arange(-1, 1.01, 0.01)
y = np.arange(0, 40, 1)
X = np.meshgrid(x, y)[0]
plt.xticks([0, 20, 40, 60, 80, 100, 120, 140, 160, 180, 200],
['-1', '-0.8', '-0.6', '-0.4', '-0.2', '0',
'0.2', '0.4', '0.6', '0.8', '1'])
plt.yticks([], [])
im = plt.imshow(X, interpolation='nearest', origin='lower', cmap=dr_cmap)

Now we will pick some model that provides a word cloud in the DataRobot. Any “Auto-Tuned Word N-Gram Text Modeler” should work.
[34]:
models = project.get_models()
[35]:
model_with_word_cloud = None
for model in models:
try:
model.get_word_cloud()
model_with_word_cloud = model
break
except ClientError as e:
if e.json['message'] and 'No word cloud data' in e.json['message']:
pass
else:
raise
model_with_word_cloud
[35]:
Model(u'Auto-Tuned Word N-Gram Text Modeler using token occurrences - diag_1_desc')
[36]:
wc = model_with_word_cloud.get_word_cloud(exclude_stop_words=True)
[37]:
def word_cloud_plot(wc, font_path=None):
# Stopwords usually dominate any word cloud, so we will filter them out
dict_freq = {wc_word['ngram']: wc_word['frequency']
for wc_word in wc.ngrams
if not wc_word['is_stopword']}
dict_coef = {wc_word['ngram']: wc_word['coefficient']
for wc_word in wc.ngrams}
def color_func(*args, **kwargs):
word = args[0]
palette_index = int(round(dict_coef[word] * 100)) + 100
r, g, b = colors[palette_index].get_rgb()
return 'rgb({:.0f}, {:.0f}, {:.0f})'.format(int(r * 255),
int(g * 255),
int(b * 255))
wc_image = wordcloud.WordCloud(stopwords=set(),
width=1024, height=1024,
relative_scaling=0.5,
prefer_horizontal=1,
color_func=color_func,
background_color=(0, 10, 29),
font_path=font_path).fit_words(dict_freq)
plt.imshow(wc_image, interpolation='bilinear')
plt.axis('off')
[38]:
word_cloud_plot(wc)

You can use the word cloud to get information about most frequent and most important (highest absolute coefficient value) ngrams in your text.
[39]:
wc.most_frequent(5)
[39]:
[{'coefficient': 0.6229774184805059,
'count': 534,
'frequency': 0.21876280213027446,
'is_stopword': False,
'ngram': u'failure'},
{'coefficient': 0.5680375262833832,
'count': 524,
'frequency': 0.21466612044244163,
'is_stopword': False,
'ngram': u'atherosclerosis'},
{'coefficient': 0.37932405511744804,
'count': 505,
'frequency': 0.2068824252355592,
'is_stopword': False,
'ngram': u'infarction'},
{'coefficient': 0.4689734305695615,
'count': 453,
'frequency': 0.18557968045882836,
'is_stopword': False,
'ngram': u'heart'},
{'coefficient': 0.7444542252245913,
'count': 452,
'frequency': 0.18517001229004507,
'is_stopword': False,
'ngram': u'heart failure'}]
[40]:
wc.most_important(5)
[40]:
[{'coefficient': -0.875917913896919,
'count': 38,
'frequency': 0.015567390413764851,
'is_stopword': False,
'ngram': u'obesity unspecified'},
{'coefficient': -0.8655105382141891,
'count': 38,
'frequency': 0.015567390413764851,
'is_stopword': False,
'ngram': u'obesity'},
{'coefficient': 0.8329465952065771,
'count': 9,
'frequency': 0.0036870135190495697,
'is_stopword': False,
'ngram': u'nephroptosis'},
{'coefficient': 0.7444542252245913,
'count': 452,
'frequency': 0.18517001229004507,
'is_stopword': False,
'ngram': u'heart failure'},
{'coefficient': 0.7029270716899754,
'count': 76,
'frequency': 0.031134780827529702,
'is_stopword': False,
'ngram': u'disorders'}]
Non-ASCII Texts
Word cloud has full Unicode support but if you want to visualize it using the recipe from this notebook - you should use the font_path parameter that leads to font supporting symbols used in your text. For example for Japanese text in the model below you should use one of the CJK fonts. If you do not have a compatible font, you can download an open-source font like this one from Google’s Noto project.
[41]:
jp_project = dr.Project.create('jp_10k.csv', project_name='Japanese 10K')
print('Project ID: {}'.format(project.id))
Project ID: 5c0008e06523cd0233c49fe4
[42]:
jp_project.set_target('readmitted_再入院', mode=AUTOPILOT_MODE.QUICK)
jp_project.wait_for_autopilot()
In progress: 2, queued: 12 (waited: 0s)
In progress: 2, queued: 12 (waited: 1s)
In progress: 2, queued: 12 (waited: 1s)
In progress: 2, queued: 12 (waited: 2s)
In progress: 2, queued: 12 (waited: 4s)
In progress: 2, queued: 12 (waited: 6s)
In progress: 2, queued: 11 (waited: 9s)
In progress: 1, queued: 11 (waited: 16s)
In progress: 2, queued: 9 (waited: 30s)
In progress: 2, queued: 7 (waited: 50s)
In progress: 2, queued: 5 (waited: 70s)
In progress: 2, queued: 3 (waited: 91s)
In progress: 2, queued: 1 (waited: 111s)
In progress: 1, queued: 0 (waited: 132s)
In progress: 2, queued: 5 (waited: 152s)
In progress: 2, queued: 3 (waited: 172s)
In progress: 2, queued: 2 (waited: 193s)
In progress: 2, queued: 1 (waited: 213s)
In progress: 1, queued: 0 (waited: 234s)
In progress: 2, queued: 14 (waited: 254s)
In progress: 2, queued: 14 (waited: 274s)
In progress: 2, queued: 12 (waited: 295s)
In progress: 1, queued: 12 (waited: 316s)
In progress: 2, queued: 10 (waited: 336s)
In progress: 2, queued: 9 (waited: 356s)
In progress: 2, queued: 7 (waited: 377s)
In progress: 2, queued: 6 (waited: 397s)
In progress: 2, queued: 4 (waited: 418s)
In progress: 2, queued: 3 (waited: 438s)
In progress: 2, queued: 1 (waited: 459s)
In progress: 1, queued: 0 (waited: 479s)
In progress: 1, queued: 0 (waited: 499s)
In progress: 0, queued: 0 (waited: 520s)
In progress: 2, queued: 3 (waited: 540s)
In progress: 2, queued: 1 (waited: 560s)
In progress: 1, queued: 0 (waited: 581s)
In progress: 1, queued: 0 (waited: 601s)
In progress: 2, queued: 2 (waited: 621s)
In progress: 2, queued: 0 (waited: 642s)
In progress: 0, queued: 0 (waited: 662s)
In progress: 1, queued: 0 (waited: 682s)
In progress: 0, queued: 0 (waited: 703s)
In progress: 0, queued: 0 (waited: 723s)
[43]:
jp_models = jp_project.get_models()
jp_model_with_word_cloud = None
for model in jp_models:
try:
model.get_word_cloud()
jp_model_with_word_cloud = model
break
except ClientError as e:
if e.json['message'] and 'No word cloud data' in e.json['message']:
pass
else:
raise
jp_model_with_word_cloud
[43]:
Model(u'Auto-Tuned Word N-Gram Text Modeler using token occurrences and tfidf - diag_1_desc_\u8a3a\u65ad1\u8aac\u660e')
[44]:
jp_wc = jp_model_with_word_cloud.get_word_cloud(exclude_stop_words=True)
[45]:
word_cloud_plot(jp_wc, font_path='NotoSansCJKjp-Regular.otf')

Cumulative gains and lift
ROC curve data also now contains information necessary for creating the cumulative gains and lift charts. Just use new fields fraction_predicted_as_positive
and fraction_predicted_as_negative
to get X axis and
- For cumulative gains use
true_positive_rate
/true_negative_rate
as Y axis - For lift use new fields
lift_positive
/lift_negative
as Y axis.
You can check code for visualization below, along with baseline/random model (in gray) and ideal (in orange)
[46]:
fig, ((ax_gains_pos, ax_gains_neg), (ax_lift_pos, ax_lift_neg)) = plt.subplots(
nrows=2, ncols=2, figsize=(8, 8))
total_rows = (df.true_positive_score[0] +
df.false_negative_score[0] +
df.true_negative_score[0] +
df.false_positive_score[0])
fraction_of_positives = float(df.true_positive_score[0] +
df.false_negative_score[0]) / total_rows
fraction_of_negatives = 1 - fraction_of_positives
# Cumulative gains (positive class)
ax_gains_pos.set_facecolor(dr_dark_blue)
ax_gains_pos.scatter(df.fraction_predicted_as_positive, df.true_positive_rate,
color=dr_roc_green)
ax_gains_pos.plot(df.fraction_predicted_as_positive, df.true_positive_rate,
color=dr_roc_green)
ax_gains_pos.plot([0, 1], [0, 1], color=white, alpha=0.25)
ax_gains_pos.plot([0, fraction_of_positives, 1], [0, 1, 1], color=dr_orange)
ax_gains_pos.set_title('Cumulative gains (positive class)')
ax_gains_pos.set_xlabel('Fraction predicted as positive')
ax_gains_pos.set_xlim([0, 1])
ax_gains_pos.set_ylabel('True Positive Rate (Sensitivity)')
# Cumulative gains (negative class)
ax_gains_neg.set_facecolor(dr_dark_blue)
ax_gains_neg.scatter(df.fraction_predicted_as_negative, df.true_negative_rate,
color=dr_roc_green)
ax_gains_neg.plot(df.fraction_predicted_as_negative, df.true_negative_rate,
color=dr_roc_green)
ax_gains_neg.plot([0, 1], [0, 1], color=white, alpha=0.25)
ax_gains_neg.plot([0, fraction_of_negatives, 1], [0, 1, 1], color=dr_orange)
ax_gains_neg.set_title('Cumulative gains (negative class)')
ax_gains_neg.set_xlabel('Fraction predicted as negative')
ax_gains_neg.set_xlim([0, 1])
ax_gains_neg.set_ylabel('True Negative Rate (Specificity)')
# Lift (positive class)
ax_lift_pos.set_facecolor(dr_dark_blue)
ax_lift_pos.scatter(df.fraction_predicted_as_positive, df.lift_positive,
color=dr_roc_green)
ax_lift_pos.plot(df.fraction_predicted_as_positive, df.lift_positive,
color=dr_roc_green)
ax_lift_pos.plot([0, 1], [1, 1], color=white, alpha=0.25)
ax_lift_pos.set_title('Lift (positive class)')
ax_lift_pos.set_xlabel('Fraction predicted as positive')
ax_lift_pos.set_xlim([0, 1])
ax_lift_pos.set_ylabel('Lift')
ideal_lift_pos_x = np.arange(0.01, 1.01, 0.01)
ideal_lift_pos_y = np.minimum(1 / fraction_of_positives, 1 / ideal_lift_pos_x)
ax_lift_pos.plot(ideal_lift_pos_x, ideal_lift_pos_y, color=dr_orange)
# Lift (negative class)
ax_lift_neg.set_facecolor(dr_dark_blue)
ax_lift_neg.scatter(df.fraction_predicted_as_negative, df.lift_negative,
color=dr_roc_green)
ax_lift_neg.plot(df.fraction_predicted_as_negative, df.lift_negative,
color=dr_roc_green)
ax_lift_neg.plot([0, 1], [1, 1], color=white, alpha=0.25)
# ax_lift_neg.plot([0, fraction_of_positives, 1], [0, 1, 1], color=dr_orange)
ax_lift_neg.set_title('Lift (negative class)')
ax_lift_neg.set_xlabel('Fraction predicted as negative')
ax_lift_neg.set_xlim([0, 1])
ax_lift_neg.set_ylabel('Lift')
ideal_lift_neg_x = np.arange(0.01, 1.01, 0.01)
ideal_lift_neg_y = np.minimum(1 / fraction_of_negatives, 1 / ideal_lift_neg_x)
ax_lift_neg.plot(ideal_lift_neg_x, ideal_lift_neg_y, color=dr_orange)
# Adjust spacing for notebook
plt.tight_layout()

[ ]:
Advanced Model Tuning¶
This notebook explores additional capabilities for tuning models added in the 2.15 release of the DataRobot API (Eureqa models only in 2.13 release).
Prerequisites¶
In order to run this notebook yourself, you will need the following:
- This notebook. If you are viewing this in the HTML documentation bundle, you can download all of the example notebooks and supporting materials from Downloads.
- A DataRobot API token. You can find your API token by logging into the DataRobot Web User Interface and looking in your
Profile
.
Preparation¶
Let’s start by importing the DataRobot API. (If you don’t have it installed already, you will need to install it in order to run this notebook.)
[1]:
import datarobot as dr
from datarobot.enums import AUTOPILOT_MODE
Configure the Python Client¶
Configuring the client requires the following two things:
- A DataRobot endpoint - where the API server can be found
- A DataRobot API token - a token the server uses to identify and validate the user making API requests
The endpoint is usually the URL you would use to log into the DataRobot Web User Interface (e.g., https://app.datarobot.com) with “/api/v2/” appended, e.g., (https://app.datarobot.com/api/v2/).
You can find your API token by logging into the DataRobot Web User Interface and looking in your Profile.
The Python client can be configured in several ways. The example we’ll use in this notebook is to point to a yaml
file that has the information. This is a text file containing two lines like this:
endpoint: https://app.datarobot.com/api/v2/
token: not-my-real-token
If you want to run this notebook without changes, please save your configuration in a file located under your home directory called ~/.config/datarobot/drconfig.yaml
.
[2]:
# Initialization with arguments
# dr.Client(token='<API TOKEN>', endpoint='https://<YOUR ENDPOINT>/api/v2/')
# Initialization with a config file in the same directory as this notebook
# dr.Client(config_path='drconfig.yaml')
# Initialization with config file located at ~/.config/datarobot/dr.config.yaml
dr.Client()
[2]:
<datarobot.rest.RESTClientObject at 0x103acb610>
Create Project with features¶
Create a new project using the 10K_diabetes dataset. This dataset contains a binary classification on the target readmitted
.
[3]:
url = 'https://s3.amazonaws.com/datarobot_public_datasets/10k_diabetes.xlsx'
project = dr.Project.create(url, project_name='10K Advanced Modeling')
print('Project ID: {}'.format(project.id))
Project ID: 5c001c2c6523cd0200c4a035
Now, let’s set up the project and run Autopilot to get some models.
[4]:
# Increase the worker count to make the project go faster.
project.set_worker_count(-1)
[4]:
Project(10K Advanced Modeling)
[5]:
project.set_target('readmitted', mode=AUTOPILOT_MODE.FULL_AUTO)
[5]:
Project(10K Advanced Modeling)
[6]:
project.wait_for_autopilot()
In progress: 20, queued: 20 (waited: 0s)
In progress: 20, queued: 20 (waited: 1s)
In progress: 20, queued: 20 (waited: 2s)
In progress: 20, queued: 20 (waited: 3s)
In progress: 20, queued: 20 (waited: 4s)
In progress: 18, queued: 20 (waited: 6s)
In progress: 19, queued: 16 (waited: 10s)
In progress: 20, queued: 13 (waited: 17s)
In progress: 20, queued: 13 (waited: 31s)
In progress: 20, queued: 13 (waited: 51s)
In progress: 20, queued: 13 (waited: 72s)
In progress: 20, queued: 11 (waited: 92s)
In progress: 20, queued: 3 (waited: 113s)
In progress: 18, queued: 0 (waited: 134s)
In progress: 10, queued: 0 (waited: 154s)
In progress: 6, queued: 0 (waited: 175s)
In progress: 1, queued: 0 (waited: 195s)
In progress: 19, queued: 0 (waited: 215s)
In progress: 12, queued: 0 (waited: 236s)
In progress: 3, queued: 0 (waited: 256s)
In progress: 2, queued: 0 (waited: 277s)
In progress: 1, queued: 0 (waited: 297s)
In progress: 0, queued: 0 (waited: 317s)
In progress: 10, queued: 0 (waited: 337s)
In progress: 3, queued: 0 (waited: 358s)
In progress: 1, queued: 0 (waited: 378s)
In progress: 1, queued: 0 (waited: 398s)
In progress: 20, queued: 12 (waited: 419s)
In progress: 20, queued: 11 (waited: 439s)
In progress: 20, queued: 7 (waited: 460s)
In progress: 20, queued: 1 (waited: 480s)
In progress: 15, queued: 0 (waited: 501s)
In progress: 9, queued: 0 (waited: 521s)
In progress: 5, queued: 0 (waited: 542s)
In progress: 3, queued: 0 (waited: 562s)
In progress: 1, queued: 0 (waited: 582s)
In progress: 0, queued: 0 (waited: 603s)
In progress: 1, queued: 0 (waited: 623s)
In progress: 0, queued: 0 (waited: 643s)
In progress: 3, queued: 0 (waited: 664s)
In progress: 3, queued: 1 (waited: 684s)
In progress: 4, queued: 0 (waited: 704s)
In progress: 2, queued: 0 (waited: 725s)
In progress: 1, queued: 0 (waited: 745s)
In progress: 0, queued: 0 (waited: 765s)
In progress: 0, queued: 0 (waited: 786s)
For the purposes of this example, let’s look at a Eureqa model.
[7]:
models = project.get_models()
model = [
m for m in models
if m.model_type.startswith('Eureqa Generalized Additive Model')
][0]
model
[7]:
Model(u'Eureqa Generalized Additive Model Classifier (3000 Generations)')
Now that we have a model, we can start an advanced-tuning session based on that model.
[8]:
tune = model.start_advanced_tuning_session()
Each model’s blueprint consists of a series of tasks. Each task contains some number of tunable parameters. Let’s take a look at the available (tunable) tasks.
[9]:
tune.get_task_names()
[9]:
[u'Eureqa Generalized Additive Model Classifier (3000 Generations)']
Let’s drill down into the main Eureqa task, to see what parameters it has available.
[10]:
task_name = 'Eureqa Generalized Additive Model Classifier (3000 Generations)'
tune.get_parameter_names(task_name)
[10]:
[u'EUREQA_building_block__absolute_value',
u'EUREQA_building_block__addition',
u'EUREQA_building_block__arccosine',
u'EUREQA_building_block__arcsine',
u'EUREQA_building_block__arctangent',
u'EUREQA_building_block__ceiling',
u'EUREQA_building_block__complementary_error_function',
u'EUREQA_building_block__constant',
u'EUREQA_building_block__cosine',
u'EUREQA_building_block__division',
u'EUREQA_building_block__equal-to',
u'EUREQA_building_block__error_function',
u'EUREQA_building_block__exponential',
u'EUREQA_building_block__factorial',
u'EUREQA_building_block__floor',
u'EUREQA_building_block__gaussian_function',
u'EUREQA_building_block__greater-than',
u'EUREQA_building_block__greater-than-or-equal',
u'EUREQA_building_block__hyperbolic_cosine',
u'EUREQA_building_block__hyperbolic_sine',
u'EUREQA_building_block__hyperbolic_tangent',
u'EUREQA_building_block__if-then-else',
u'EUREQA_building_block__input_variable',
u'EUREQA_building_block__integer_constant',
u'EUREQA_building_block__inverse_hyperbolic_cosine',
u'EUREQA_building_block__inverse_hyperbolic_sine',
u'EUREQA_building_block__inverse_hyperbolic_tangent',
u'EUREQA_building_block__less-than',
u'EUREQA_building_block__less-than-or-equal',
u'EUREQA_building_block__logical_and',
u'EUREQA_building_block__logical_not',
u'EUREQA_building_block__logical_or',
u'EUREQA_building_block__logical_xor',
u'EUREQA_building_block__logistic_function',
u'EUREQA_building_block__maximum',
u'EUREQA_building_block__minimum',
u'EUREQA_building_block__modulo',
u'EUREQA_building_block__multiplication',
u'EUREQA_building_block__natural_logarithm',
u'EUREQA_building_block__negation',
u'EUREQA_building_block__power',
u'EUREQA_building_block__round',
u'EUREQA_building_block__sign_function',
u'EUREQA_building_block__sine',
u'EUREQA_building_block__square_root',
u'EUREQA_building_block__step_function',
u'EUREQA_building_block__subtraction',
u'EUREQA_building_block__tangent',
u'EUREQA_building_block__two-argument_arctangent',
u'EUREQA_experimental__max_expression_ops',
u'EUREQA_max_generations',
u'EUREQA_num_threads',
u'EUREQA_prior_solutions',
u'EUREQA_random_seed',
u'EUREQA_split_mode',
u'EUREQA_sync_migrations',
u'EUREQA_target_expression_format',
u'EUREQA_target_expression_string',
u'EUREQA_training_fraction',
u'EUREQA_training_split_expr',
u'EUREQA_validation_fraction',
u'EUREQA_validation_split_expr',
u'EUREQA_weight_expr',
u'XGB_base_margin_initialize',
u'XGB_colsample_bylevel',
u'XGB_colsample_bytree',
u'XGB_interval',
u'XGB_learning_rate',
u'XGB_max_bin',
u'XGB_max_delta_step',
u'XGB_max_depth',
u'XGB_min_child_weight',
u'XGB_min_split_loss',
u'XGB_missing_value',
u'XGB_n_estimators',
u'XGB_num_parallel_tree',
u'XGB_random_state',
u'XGB_reg_alpha',
u'XGB_reg_lambda',
u'XGB_scale_pos_weight',
u'XGB_smooth_interval',
u'XGB_subsample',
u'XGB_tree_method',
u'feature_interaction_max_features',
u'feature_interaction_sampling',
u'feature_interaction_threshold',
u'feature_selection_max_features',
u'feature_selection_method',
u'feature_selection_min_features',
u'feature_selection_threshold',
u'highdim_modeling',
u'subsample']
Eureqa does not search for periodic relationships in the data by default. Doing so would take time away from other types of modeling, so could reduce model quality if no periodic relationships are present. But let’s say we want to check whether Eureqa can find any strong periodic relationships in the data, by allowing it to consider models that use the mathematical sine() function.
[11]:
tune.set_parameter(
task_name=task_name,
parameter_name='EUREQA_building_block__sine',
value=1)
More values could be set if desired, using the same approach.
Now that some parameters have been set, the tuned model can be run:
[12]:
job = tune.run()
new_model = job.get_result_when_complete()
new_model
[12]:
Model(u'Eureqa Generalized Additive Model Classifier (3000 Generations)')
You now have a new model that was run using your specified Advanced Tuning parameters.
Time Series Modeling¶
Overview¶
This example provides an introduction to a few of DataRobot’s time series modeling capabilities with a sales dataset. Here is a list of things we will touch on during this notebook:
- Installing the
datarobot
package - Configuring the client
- Creating a project
- Denoting known-in-advance features
- Specifying a partitioning scheme
- Running the automated modeling process
- Generating predictions
Prerequisites¶
In order to run this notebook yourself, you will need the following:
- This notebook. If you are viewing this in the HTML documentation bundle, you can download all of the example notebooks and supporting materials from Downloads.
- The required datasets, which is included in the same directory as this notebook.
- A DataRobot API token. You can find your API token by logging into the DataRobot Web User Interface and looking in your
Profile
. - The
xlrd
Python package is needed for the pandasread_excel
function. You can install this withpip install xlrd
.
Installing the datarobot
package¶
The datarobot
package is hosted on PyPI. You can install it via:
pip install datarobot
from the command line. Its main dependencies are numpy
and pandas
, which could take some time to install on a new system. We highly recommend use of virtualenvs to avoid conflicts with other dependencies in your system-wide python installation.
Getting Started¶
This line imports the datarobot
package. By convention, we always import it with the alias dr
.
[1]:
import datarobot as dr
Other Important Imports¶
We’ll use these in this notebook as well. If the previous cell and the following cell both run without issue, you’re in good shape.
[2]:
import datetime
import pandas as pd
Configure the Python Client¶
Configuring the client requires the following two things:
- A DataRobot endpoint - where the API server can be found
- A DataRobot API token - a token the server uses to identify and validate the user making API requests
The endpoint is usually the URL you would use to log into the DataRobot Web User Interface (e.g., https://app.datarobot.com) with “/api/v2/” appended, e.g., (https://app.datarobot.com/api/v2/).
You can find your API token by logging into the DataRobot Web User Interface and looking in your Profile.
The Python client can be configured in several ways. The example we’ll use in this notebook is to point to a yaml
file that has the information. This is a text file containing two lines like this:
endpoint: https://app.datarobot.com/api/v2/
token: not-my-real-token
If you want to run this notebook without changes, please save your configuration in a file located under your home directory called ~/.config/datarobot/drconfig.yaml
.
[3]:
# Initialization with arguments
# dr.Client(token='<API TOKEN>', endpoint='https://<YOUR ENDPOINT>/api/v2/')
# Initialization with a config file in the same directory as this notebook
# dr.Client(config_path='drconfig.yaml')
# Initialization with a config file located at
# ~/.config/datarobot/dr.config.yaml
dr.Client()
[3]:
<datarobot.rest.RESTClientObject at 0x115b3f850>
Create the Project¶
Here, we use the datarobot
package to upload a new file and create a project. The name of the project is optional, but can be helpful when trying to sort among many projects on DataRobot.
[4]:
filename = 'DR_Demo_Sales_Multiseries_training.xlsx'
now = datetime.datetime.now().strftime('%Y-%m-%dT%H:%M')
project_name = 'DR_Demo_Sales_Multiseries_{}'.format(now)
proj = dr.Project.create(sourcedata=filename,
project_name=project_name,
max_wait=3600)
print('Project ID: {}'.format(proj.id))
Project ID: 5c0086ba784cc602226a9e3f
Identify Known-In-Advance Features¶
This dataset has five columns that will always be known-in-advance and available for prediction.
[5]:
known_in_advance = ['Marketing', 'Near_Xmas', 'Near_BlackFriday',
'Holiday', 'DestinationEvent']
feature_settings = [dr.FeatureSettings(feat_name,
known_in_advance=True)
for feat_name in known_in_advance]
Create a Partition Specification¶
This problem has a time component to it, and it would be bad practice to train on data from the present and predict on the past. We could manually add a column to the dataset to indicate which rows should be used for training, test, and validation, but it is straightforward to allow DataRobot to do it automatically. This dataset contains sales data from multiple individual stores so we use multiseries_id_columns
to tell DataRobot there are actually multiple time series in this file and to
indicate the column that identifies the series each row belongs to.
[6]:
time_partition = dr.DatetimePartitioningSpecification(
datetime_partition_column='Date',
multiseries_id_columns=['Store'],
use_time_series=True,
feature_settings=feature_settings,
)
Run the Automated Modeling Process¶
Now we can start the modeling process. The target for this problem is called Sales
and we let DataRobot automatically select the metric for scoring and comparing models.
The partitioning_method
is used to specify that we would like DataRobot to use the partitioning schema we specified previously
Finally, the worker_count
parameter specifies how many workers should be used for this project. Passing a value of -1
tells DataRobot to set the worker count to the maximum available to you. You can also specify the exact number of workers to use, but this command will fail if you request more workers than your account allows. If you need more resources than what has been allocated to you, you should think about upgrading your license.
The second command provides a URL that can be used to see the project execute on the DataRobot UI.
The last command in this cell is just a blocking loop that periodically checks on the project to see if it is done, printing out the number of jobs in progress and in the queue along the way so you can see progress. The automated model exploration process will occasionally add more jobs to the queue, so don’t be alarmed if the number of jobs does not strictly decrease over time.
[7]:
proj.set_target(
target='Sales',
partitioning_method=time_partition,
max_wait=3600,
worker_count=-1
)
print(proj.get_leaderboard_ui_permalink())
proj.wait_for_autopilot()
https://staging.datarobot.com/projects/5c0086ba784cc602226a9e3f/models
In progress: 20, queued: 1 (waited: 0s)
In progress: 20, queued: 1 (waited: 1s)
In progress: 20, queued: 1 (waited: 2s)
In progress: 20, queued: 1 (waited: 3s)
In progress: 20, queued: 1 (waited: 4s)
In progress: 20, queued: 1 (waited: 7s)
In progress: 20, queued: 1 (waited: 11s)
In progress: 20, queued: 1 (waited: 18s)
In progress: 19, queued: 0 (waited: 31s)
In progress: 19, queued: 0 (waited: 52s)
In progress: 17, queued: 0 (waited: 72s)
In progress: 16, queued: 0 (waited: 93s)
In progress: 15, queued: 0 (waited: 114s)
In progress: 13, queued: 0 (waited: 134s)
In progress: 12, queued: 0 (waited: 155s)
In progress: 12, queued: 0 (waited: 175s)
In progress: 10, queued: 0 (waited: 196s)
In progress: 9, queued: 0 (waited: 217s)
In progress: 7, queued: 0 (waited: 238s)
In progress: 6, queued: 0 (waited: 258s)
In progress: 6, queued: 0 (waited: 278s)
In progress: 2, queued: 0 (waited: 299s)
In progress: 1, queued: 0 (waited: 320s)
In progress: 8, queued: 0 (waited: 340s)
In progress: 8, queued: 0 (waited: 360s)
In progress: 8, queued: 0 (waited: 381s)
In progress: 6, queued: 0 (waited: 402s)
In progress: 5, queued: 0 (waited: 422s)
In progress: 5, queued: 0 (waited: 442s)
In progress: 3, queued: 0 (waited: 463s)
In progress: 3, queued: 0 (waited: 483s)
In progress: 3, queued: 0 (waited: 504s)
In progress: 1, queued: 0 (waited: 524s)
In progress: 0, queued: 0 (waited: 545s)
In progress: 1, queued: 0 (waited: 565s)
In progress: 1, queued: 0 (waited: 586s)
In progress: 1, queued: 0 (waited: 606s)
In progress: 1, queued: 0 (waited: 626s)
In progress: 1, queued: 0 (waited: 647s)
In progress: 1, queued: 0 (waited: 667s)
In progress: 0, queued: 0 (waited: 688s)
In progress: 1, queued: 0 (waited: 708s)
In progress: 1, queued: 0 (waited: 728s)
In progress: 1, queued: 0 (waited: 749s)
In progress: 1, queued: 0 (waited: 769s)
In progress: 1, queued: 0 (waited: 790s)
In progress: 1, queued: 0 (waited: 810s)
In progress: 1, queued: 0 (waited: 830s)
In progress: 1, queued: 0 (waited: 851s)
In progress: 1, queued: 0 (waited: 871s)
In progress: 1, queued: 0 (waited: 892s)
In progress: 1, queued: 0 (waited: 912s)
In progress: 0, queued: 0 (waited: 932s)
Choose the Best Model¶
First, we take a look at the top of the leaderboard. In this example, we choose the model that has the lowest backtesting error.
[8]:
proj.get_models()[:10]
[8]:
[Model(u'AVG Blender'),
Model(u'eXtreme Gradient Boosted Trees Regressor with Early Stopping'),
Model(u'eXtreme Gradient Boosted Trees Regressor with Early Stopping'),
Model(u'eXtreme Gradient Boosted Trees Regressor with Early Stopping'),
Model(u'eXtreme Gradient Boosted Trees Regressor with Early Stopping'),
Model(u'Light Gradient Boosting on ElasticNet Predictions '),
Model(u'eXtreme Gradient Boosting on ElasticNet Predictions'),
Model(u'Light Gradient Boosting on ElasticNet Predictions '),
Model(u'Ridge Regressor with Forecast Distance Modeling'),
Model(u'eXtreme Gradient Boosting on ElasticNet Predictions')]
[9]:
lb = proj.get_models()
valid_models = [m for m in lb if
m.metrics[proj.metric]['crossValidation']]
best_model = min(valid_models,
key=lambda m: m.metrics[proj.metric]['crossValidation'])
print(best_model.model_type)
print(best_model.get_leaderboard_ui_permalink())
AVG Blender
https://staging.datarobot.com/projects/5c0086ba784cc602226a9e3f/models/5c008a2ce23dec598947eb1d
Generate Predictions¶
This example notebook uses the modeling API to make predictions, which uses modeling servers to score the predictions. If you have dedicated prediction servers, you should use that API for faster performance.
Finish training¶
First, we unlock the holdout data to fully train the best model. The last command in the next cell prints the URL to examine the fully-trained model in the DataRobot UI.
[10]:
proj.unlock_holdout()
job = best_model.request_frozen_datetime_model()
retrained_model = job.get_result_when_complete()
print(retrained_model.get_leaderboard_ui_permalink())
https://staging.datarobot.com/projects/5c0086ba784cc602226a9e3f/models/5c008b29784cc6020c6a9e8c
Execute a prediction job¶
First, we find the latest date in the training data. Then, we upload a dataset to predict from, setting the starting forecast_point
to be the end of the training data. Finally, we execute the prediction request.
[11]:
d = pd.read_excel('DR_Demo_Sales_Multiseries_training.xlsx')
last_train_date = pd.to_datetime(d['Date']).max()
dataset = proj.upload_dataset(
'DR_Demo_Sales_Multiseries_prediction.xlsx',
forecast_point=last_train_date
)
pred_job = retrained_model.request_predictions(dataset_id=dataset.id)
preds = pred_job.get_result_when_complete()
Each row of the resulting predictions has a prediction
of sales at a timestamp
for a particular series_id
and can be matched to the the uploaded prediction data set through the row_id
field. The forecast_distance
is the number of time units after the forecast point for a given row.
[12]:
preds.head()
# we could also write predictions out to a file for subsequent analysis
# preds.to_csv('DR_Demo_Sales_Multiseries_prediction_output.csv', index=False)
[12]:
forecast_distance | forecast_point | prediction | row_id | series_id | timestamp | |
---|---|---|---|---|---|---|
0 | 1 | 2014-06-14T00:00:00.000000Z | 148181.314360 | 714 | Louisville | 2014-06-15T00:00:00.000000Z |
1 | 2 | 2014-06-14T00:00:00.000000Z | 139278.257114 | 715 | Louisville | 2014-06-16T00:00:00.000000Z |
2 | 3 | 2014-06-14T00:00:00.000000Z | 139419.155936 | 716 | Louisville | 2014-06-17T00:00:00.000000Z |
3 | 4 | 2014-06-14T00:00:00.000000Z | 135730.704195 | 717 | Louisville | 2014-06-18T00:00:00.000000Z |
4 | 5 | 2014-06-14T00:00:00.000000Z | 140947.763900 | 718 | Louisville | 2014-06-19T00:00:00.000000Z |
Changelog¶
2.16.0¶
New Features¶
Three new methods for Series Accuracy have been added to the
DatetimeModel
class.- Start a request to calculate Series Accuracy with
DatetimeModel.compute_series_accuracy
- Once computed, Series Accuracy can be retrieved as a pandas.DataFrame using
DatetimeModel.get_series_accuracy_as_dataframe
- Or saved as a CSV using
DatetimeModel.download_series_accuracy_as_csv
- Start a request to calculate Series Accuracy with
Users can now access prediction intervals data for each prediction with a
DatetimeModel
. For each model, prediction intervals estimate the range of values DataRobot expects actual values of the target to fall within. They are similar to a confidence interval of a prediction, but are based on the residual errors measured during the backtesting for the selected model.
Enhancements¶
Information on the effective feature derivation window is now available for Time Series projects to specify the full span of historical data required at prediction time. It may be longer than the feature derivation window of the project depending on the differencing settings used.
Additionally, more of the project partitioning settings are also available on the
DatetimeModel
class. The new attributes are:effective_feature_derivation_window_start
effective_feature_derivation_window_end
forecast_window_start
forecast_window_end
windows_basis_unit
Prediction metadata is now included in the return of
Predictions.get
2.15.1¶
Enhancements¶
CalendarFile.get_access_list
has been added to theCalendarFile
class to return a list of users with access to a calendar file.- A
role
attribute has been added to theCalendarFile
class to indicate the access level a current user has to a calendar file. For more information on the specific access levels, see the sharing documentation.
Bugfixes¶
- Previously, attempting to retrieve the
calendar_id
of a project without a set target would result in an error. This has been fixed to returnNone
instead.
2.15.0¶
New Features¶
- Previously available for only Eureqa models, Advanced Tuning methods and objects, including
Model.start_advanced_tuning_session
,Model.get_advanced_tuning_parameters
,Model.advanced_tune
, andAdvancedTuningSession
, now support all models other than blender, OSS, and user-created. - Calendar Files for Time Series projects can now be created and managed through the
CalendarFile
class.
Enhancements¶
- The dataframe returned from
datarobot.PredictionExplanations.get_all_as_dataframe()
will now have each class label class_X be the same from row to row. - The client is now more robust to networking issues by default. It will retry on more errors and respects Retry-After headers in HTTP 413, 429, and 503 responses.
- Added Forecast Distance blender for Time-Series projects configured with more than one Forecast Distance. It blends the selected models creating separate linear models for each Forecast Distance.
Project
can now be shared with other users.Project.upload_dataset
andProject.upload_dataset_from_data_source
will return aPredictionDataset
withdata_quality_warnings
if potential problems exist around the uploaded dataset.relax_known_in_advance_features_check
has been added toProject.upload_dataset
andProject.upload_dataset_from_data_source
to allow missing values from the known in advance features in the forecast window at prediction time.cross_series_group_by_columns
has been added todatarobot.DatetimePartitioning
to allow users the ability to indicate how to further split series into related groups.- Information retrieval for
ROC Curve
has been extended to includefraction_predicted_as_positive
,fraction_predicted_as_negative
,lift_positive
andlift_negative
Bugfixes¶
- Fixes an issue where the client would not be usable if it could not be sure it was compatible with the configured server
API Changes¶
- Methods for creating
datarobot.models.Project
: create_from_mysql, create_from_oracle, and create_from_postgresql, deprecated in 2.11, have now been removed. Usedatarobot.models.Project.create_from_data_source()
instead. datarobot.FeatureSettings
attribute apriori, deprecated in 2.11, has been removed. Usedatarobot.FeatureSettings.known_in_advance
instead.datarobot.DatetimePartitioning
attribute default_to_a_priori, deprecated in 2.11, has been removed. Usedatarobot.DatetimePartitioning.known_in_advance
instead.datarobot.DatetimePartitioningSpecification
attribute default_to_a_priori, deprecated in 2.11, has been removed. Usedatarobot.DatetimePartitioningSpecification.known_in_advance
instead.
Deprecation Summary¶
Configuration Changes¶
Documentation Changes¶
- Advanced model insights notebook extended to contain information on visualisation of cumulative gains and lift charts.
2.14.2¶
Bugfixes¶
- Fixed an issue where searches of the HTML documentation would sometimes hang indefinitely
Documentation Changes¶
- Python3 is now the primary interpreter used to build the docs (this does not affect the ability to use the package with Python2)
2.14.1¶
Documentation Changes¶
- Documentation for the Model Deployment interface has been removed after the corresponding interface was removed in 2.13.0.
2.14.0¶
New Features¶
- The new method
Model.get_supported_capabilities
retrieves a summary of the capabilities supported by a particular model, such as whether it is eligible for Prime and whether it has word cloud data available. - New class for working with model compliance documentation feature of DataRobot:
ComplianceDocumentation
- New class for working with compliance documentation templates:
ComplianceDocTemplate
- New class
FeatureHistogram
has been added to retrieve feature histograms for a requested maximum bin count - Time series projects now support binary classification targets.
- Cross series features can now be created within time series multiseries projects using the
use_cross_series_features
andaggregation_type
attributes of thedatarobot.DatetimePartitioningSpecification
. See the Time Series documentation for more info.
Enhancements¶
- Client instantiation now checks the endpoint configuration and provides more informative error messages. It also automatically corrects HTTP to HTTPS if the server responds with a redirect to HTTPS.
Project.upload_dataset
andProject.create
now accept an optional parameter ofdataset_filename
to specify a file name for the dataset. This is ignored for url and file path sources.- New optional parameter fallback_to_parent_insights has been added to
Model.get_lift_chart
,Model.get_all_lift_charts
,Model.get_confusion_chart
,Model.get_all_confusion_charts
,Model.get_roc_curve
, andModel.get_all_roc_curves
. When True, a frozen model with missing insights will attempt to retrieve the missing insight data from its parent model. - New
number_of_known_in_advance_features
attribute has been added to thedatarobot.DatetimePartitioning
class. The attribute specifies number of features that are marked as known in advance. Project.set_worker_count
can now update the worker count on a project to the maximum number available to the user.- Recommended Models API can now be used to retrieve model recommendations for datetime partitioned projects
- Timeseries projects can now accept feature derivation and forecast windows intervals in terms of
number of the rows rather than a fixed time unit.
DatetimePartitioningSpecification
andProject.set_target
support new optional parameter windowsBasisUnit, either ‘ROW’ or detected time unit. - Timeseries projects can now accept feature derivation intervals, forecast windows, forecast points and prediction start/end dates in milliseconds.
DataSources
andDataStores
can now be shared with other users.- Training predictions for datetime partitioned projects now support the new data subset dr.enums.DATA_SUBSET.ALL_BACKTESTS for requesting the predictions for all backtest validation folds.
API Changes¶
- The model recommendation type “Recommended” (deprecated in version 2.13.0) has been removed.
Documentation Changes¶
- Example notebooks have been updated:
- Notebooks now work in Python 2 and Python 3
- A notebook illustrating time series capability has been added
- The financial data example has been replaced with an updated introductory example.
- To supplement the embedded Python notebooks in both the PDF and HTML docs bundles, the notebook files and supporting data can now be downloaded from the HTML docs bundle.
- Fixed a minor typo in the code sample for
get_or_request_feature_impact
2.13.0¶
New Features¶
- The new method
Model.get_or_request_feature_impact
functionality will attempt to request feature impact and return the newly created feature impact object or the existing object so two calls are no longer required. - New methods and objects, including
Model.start_advanced_tuning_session
,Model.get_advanced_tuning_parameters
,Model.advanced_tune
, andAdvancedTuningSession
, were added to support the setting of Advanced Tuning parameters. This is currently supported for Eureqa models only. - New
is_starred
attribute has been added to theModel
class. The attribute specifies whether a model has been marked as starred by user or not. - Model can be marked as starred or being unstarred with
Model.star_model
andModel.unstar_model
. - When listing models with
Project.get_models
, the model list can now be filtered by theis_starred
value. - A custom prediction threshold may now be configured for each model via
Model.set_prediction_threshold
. When making predictions in binary classification projects, this value will be used when deciding between the positive and negative classes. Project.check_blendable
can be used to confirm if a particular group of models are eligible for blending as some are not, e.g. scaleout models and datetime models with different training lengths.- Individual cross validation scores can be retrieved for new models using
Model.get_cross_validation_scores
.
Enhancements¶
- Python 3.7 is now supported.
- Feature impact now returns not only the impact score for the features but also whether they were detected to be redundant with other high-impact features.
- A new
is_blocked
attribute has been added to theJob
class, specifying whether a job is blocked from execution because one or more dependencies are not yet met. - The
Featurelist
object now has new attributes reporting its creation time, whether it was created by a user or by DataRobot, and the number of models using the featurelist, as well as a new description field. - Featurelists can now be renamed and have their descriptions updated with
Featurelist.update
andModelingFeaturelist.update
. - Featurelists can now be deleted with
Featurelist.delete
andModelingFeaturelist.delete
. ModelRecommendation.get
now accepts an optional parameter of typedatarobot.enums.RECOMMENDED_MODEL_TYPE
which can be used to get a specific kind of recommendation.- Previously computed predictions can now be listed and retrieved with the
Predictions
class, without requiring a reference to the originalPredictJob
.
Bugfixes¶
- The Model Deployment interface which was previously visible in the client has been removed to allow the interface to mature, although the raw API is available as a “beta” API without full backwards compatibility support.
API Changes¶
- Added support for retrieving the Pareto Front of a Eureqa model. See
ParetoFront
. - A new recommendation type “Recommended for Deployment” has been added to
ModelRecommendation
which is now returns as the default recommended model when available. See Model Recommendation.
Deprecation Summary¶
- The feature previously referred to as “Reason Codes” has been renamed to “Prediction
Explanations”, to provide increased clarity and accessibility. The old
ReasonCodes
interface has been deprecated and replaced withPredictionExplanations
. - The recommendation type “Recommended” is deprecated and will no longer be returned in v2.14 of the API.
Documentation Changes¶
- Added a new documentation section Model Recommendation.
- Time series projects support multiseries as well as single series data. They are now documented in the Time Series Projects documentation.
2.12.0¶
New Features¶
- Some models now have Missing Value reports allowing users with access to uncensored blueprints to retrieve a detailed breakdown of how numeric imputation and categorical converter tasks handled missing values. See the documentation for more information on the report.
2.11.0¶
New Features¶
- The new
ModelRecommendation
class can be used to retrieve the recommended models for a project. - A new helper method cross_validate was added to class Model. This method can be used to request Model’s Cross Validation score.
- Training a model with monotonic constraints is now supported. Training with monotonic constraints allows users to force models to learn monotonic relationships with respect to some features and the target. This helps users create accurate models that comply with regulations (e.g. insurance, banking). Currently, only certain blueprints (e.g. xgboost) support this feature, and it is only supported for regression and binary classification projects.
- DataRobot now supports “Database Connectivity”, allowing databases to be used as the source of data for projects and prediction datasets. The feature works on top of the JDBC standard, so a variety of databases conforming to that standard are available; a list of databases with tested support for DataRobot is available in the user guide in the web application. See Database Connectivity for details.
- Added a new feature to retrieve feature logs for time series projects. Check
datarobot.DatetimePartitioning.feature_log_list()
anddatarobot.DatetimePartitioning.feature_log_retrieve()
for details.
API Changes¶
- New attributes supporting monotonic constraints have been added to the
AdvancedOptions
,Project
,Model
, andBlueprint
classes. See monotonic constraints for more information on how to configure monotonic constraints. - New parameters predictions_start_date and predictions_end_date added to
Project.upload_dataset
to support bulk predictions upload for time series projects.
Deprecation Summary¶
- Methods for creating
datarobot.models.Project
: create_from_mysql, create_from_oracle, and create_from_postgresql, have been deprecated and will be removed in 2.14. Usedatarobot.models.Project.create_from_data_source()
instead. datarobot.FeatureSettings
attribute apriori, has been deprecated and will be removed in 2.14. Usedatarobot.FeatureSettings.known_in_advance
instead.datarobot.DatetimePartitioning
attribute default_to_a_priori, has been deprecated and will be removed in 2.14.datarobot.DatetimePartitioning.known_in_advance
instead.datarobot.DatetimePartitioningSpecification
attribute default_to_a_priori, has been deprecated and will be removed in 2.14. Usedatarobot.DatetimePartitioningSpecification.known_in_advance
instead.
Configuration Changes¶
- Retry settings compatible with those offered by urllib3’s Retry interface can now be configured. By default, we will now retry connection errors that prevented requests from arriving at the server.
Documentation Changes¶
- “Advanced Model Insights” example has been updated to properly handle bin weights when rebinning.
2.9.0¶
New Features¶
- New
ModelDeployment
class can be used to track status and health of models deployed for predictions.
Enhancements¶
- DataRobot API now supports creating 3 new blender types - Random Forest, TensorFlow, LightGBM.
- Multiclass projects now support blenders creation for 3 new blender types as well as Average and ENET blenders.
- Models can be trained by requesting a particular row count using the new
training_row_count
argument with Project.train, Model.train and Model.request_frozen_model in non-datetime partitioned projects, as an alternative to the previous option of specifying a desired percentage of the project dataset. Specifying model size by row count is recommended when the float precision ofsample_pct
could be problematic, e.g. when training on a small percentage of the dataset or when training up to partition boundaries. - New attributes
max_train_rows
,scaleout_max_train_pct
, andscaleout_max_train_rows
have been added toProject
.max_train_rows
specified the equivalent value to the existingmax_train_pct
as a row count. The scaleout fields can be used to see how far scaleout models can be trained on projects, which for projects taking advantage of scalable ingest may exceed the limits on the data available to non-scaleout blueprints. - Individual features can now be marked as a priori or not a priori using the new feature_settings attribute when setting the target or specifying datetime partitioning settings on time series projects. Any features not specified in the feature_settings parameter will be assigned according to the default_to_a_priori value.
- Three new options have been made available in the
datarobot.DatetimePartitioningSpecification
class to fine-tune how time-series projects derive modeling features. treat_as_exponential can control whether data is analyzed as an exponential trend and transformations like log-transform are applied. differencing_method can control which differencing method to use for stationary data. periodicities can be used to specify periodicities occuring within the data. All are optional and defaults will be chosen automatically if they are unspecified.
API Changes¶
- Now
training_row_count
is available on non-datetime models as well as “rowCount” based datetime models. It reports the number of rows used to train the model (equivalent tosample_pct
). - Features retrieved from
Feature.get
now includetarget_leakage
.
2.8.1¶
Bugfixes¶
- The documented default connect_timeout will now be correctly set for all configuration mechanisms,
so that requests that fail to reach the DataRobot server in a reasonable amount of time will now
error instead of hanging indefinitely. If you observe that you have started seeing
ConnectTimeout
errors, please configure your connect_timeout to a larger value. - Version of
trafaret
library this package depends on is now pinned totrafaret>=0.7,<1.1
since versions outside that range are known to be incompatible.
2.8.0¶
New Features¶
- The DataRobot API supports the creation, training, and predicting of multiclass classification projects. DataRobot, by default, handles a dataset with a numeric target column as regression. If your data has a numeric cardinality of fewer than 11 classes, you can override this behavior to instead create a multiclass classification project from the data. To do so, use the set_target function, setting target_type=’Multiclass’. If DataRobot recognizes your data as categorical, and it has fewer than 11 classes, using multiclass will create a project that classifies which label the data belongs to.
- The DataRobot API now includes Rating Tables. A rating table is an exportable csv representation of a model. Users can influence predictions by modifying them and creating a new model with the modified table. See the documentation for more information on how to use rating tables.
- scaleout_modeling_mode has been added to the AdvancedOptions class used when setting a project target. It can be used to control whether scaleout models appear in the autopilot and/or available blueprints. Scaleout models are only supported in the Hadoop enviroment with the corresponding user permission set.
- A new premium add-on product, Time Series, is now available. New projects can be created as time series projects which automatically derive features from past data and forecast the future. See the time series documentation for more information.
- The Feature object now returns the EDA summary statistics (i.e., mean, median, minum, maximum, and standard deviation) for features where this is available (e.g., numeric, date, time, currency, and length features). These summary statistics will be formatted in the same format as the data it summarizes.
- The DataRobot API now supports Training Predictions workflow. Training predictions are made by a model for a subset of data from original dataset. User can start a job which will make those predictions and retrieve them. See the documentation for more information on how to use training predictions.
- DataRobot now supports retrieving a model blueprint chart and a model blueprint docs.
- With the introduction of Multiclass Classification projects, DataRobot needed a better way to explain the performance of a multiclass model so we created a new Confusion Chart. The API now supports retrieving and interacting with confusion charts.
Enhancements¶
- DatetimePartitioningSpecification now includes the optional disable_holdout flag that can be used to disable the holdout fold when creating a project with datetime partitioning.
- When retrieving reason codes on a project using an exposure column, predictions that are adjusted for exposure can be retrieved.
- File URIs can now be used as sourcedata when creating a project or uploading a prediction dataset. The file URI must refer to an allowed location on the server, which is configured as described in the user guide documentation.
- The advanced options available when setting the target have been extended to include the new parameter ‘events_count’ as a part of the AdvancedOptions object to allow specifying the events count column. See the user guide documentation in the webapp for more information on events count.
- PredictJob.get_predictions now returns predicted probability for each class in the dataframe.
- PredictJob.get_predictions now accepts prefix parameter to prefix the classes name returned in the predictions dataframe.
API Changes¶
- Add target_type parameter to set_target() and start(), used to override the project default.
2.7.1¶
Documentation Changes¶
- Online documentation hosting has migrated from PythonHosted to Read The Docs. Minor code changes have been made to support this.
2.7.0¶
New Features¶
- Lift chart data for models can be retrieved using the Model.get_lift_chart and Model.get_all_lift_charts methods.
- ROC curve data for models in classification projects can be retrieved using the Model.get_roc_curve and Model.get_all_roc_curves methods.
- Semi-automatic autopilot mode is removed.
- Word cloud data for text processing models can be retrieved using Model.get_word_cloud method.
- Scoring code JAR file can be downloaded for models supporting code generation.
Enhancements¶
- A __repr__ method has been added to the PredictionDataset class to improve readability when using the client interactively.
- Model.get_parameters now includes an additional key in the derived features it includes, showing the coefficients for individual stages of multistage models (e.g. Frequency-Severity models).
- When training a DatetimeModel on a window of data, a time_window_sample_pct can be specified to take a uniform random sample of the training data instead of using all data within the window.
- Installing of DataRobot package now has an “Extra Requirements” section that will install all of the dependencies needed to run the example notebooks.
Documentation Changes¶
- A new example notebook describing how to visualize some of the newly available model insights including lift charts, ROC curves, and word clouds has been added to the examples section.
- A new section for Common Issues has been added to Getting Started to help debug issues related to client installation and usage.
2.6.1¶
Bugfixes¶
- Fixed a bug with Model.get_parameters raising an exception on some valid parameter values.
Documentation Changes¶
- Fixed sorting order in Feature Impact example code snippet.
2.6.0¶
New Features¶
- A new partitioning method (datetime partitioning) has been added. The recommended workflow is to preview the partitioning by creating a DatetimePartitioningSpecification and passing it into DatetimePartitioning.generate, inspect the results and adjust as needed for the specific project dataset by adjusting the DatetimePartitioningSpecification and re-generating, and then set the target by passing the final DatetimePartitioningSpecification object to the partitioning_method parameter of Project.set_target.
- When interacting with datetime partitioned projects, DatetimeModel can be used to access more information specific to models in datetime partitioned projects. See the documentation for more information on differences in the modeling workflow for datetime partitioned projects.
- The advanced options available when setting the target have been extended to include the new parameters ‘offset’ and ‘exposure’ (part of the AdvancedOptions object) to allow specifying offset and exposure columns to apply to predictions generated by models within the project. See the user guide documentation in the webapp for more information on offset and exposure columns.
- Blueprints can now be retrieved directly by project_id and blueprint_id via Blueprint.get.
- Blueprint charts can now be retrieved directly by project_id and blueprint_id via BlueprintChart.get. If you already have an instance of Blueprint you can retrieve its chart using Blueprint.get_chart.
- Model parameters can now be retrieved using ModelParameters.get. If you already have an instance of Model you can retrieve its parameters using Model.get_parameters.
- Blueprint documentation can now be retrieved using Blueprint.get_documents. It will contain information about the task, its parameters and (when available) links and references to additional sources.
- The DataRobot API now includes Reason Codes. You can now compute reason codes for prediction datasets. You are able to specify thresholds on which rows to compute reason codes for to speed up computation by skipping rows based on the predictions they generate. See the reason codes documentation for more information.
Enhancements¶
- A new parameter has been added to the AdvancedOptions used with Project.set_target. By specifying accuracyOptimizedMb=True when creating AdvancedOptions, longer-running models that may have a high accuracy will be included in the autopilot and made available to run manually.
- A new option for Project.create_type_transform_feature has been added which explicitly truncates data when casting numerical data as categorical data.
- Added 2 new blenders for projects that use MAD or Weighted MAD as a metric. The MAE blender uses BFGS optimization to find linear weights for the blender that minimize mean absolute error (compared to the GLM blender, which finds linear weights that minimize RMSE), and the MAEL1 blender uses BFGS optimization to find linear weights that minimize MAE + a L1 penalty on the coefficients (compared to the ENET blender, which minimizes RMSE + a combination of the L1 and L2 penalty on the coefficients).
Bugfixes¶
- Fixed a bug (affecting Python 2 only) with printing any model (including frozen and prime models) whose model_type is not ascii.
- FrozenModels were unable to correctly use methods inherited from Model. This has been fixed.
- When calling get_result for a Job, ModelJob, or PredictJob that has errored, AsyncProcessUnsuccessfulError will now be raised instead of JobNotFinished, consistently with the behaviour of get_result_when_complete.
Deprecation Summary¶
- Support for the experimental Recommender Problems projects has been removed. Any code relying on RecommenderSettings or the recommender_settings argument of Project.set_target and Project.start will error.
Project.update
, deprecated in v2.2.32, has been removed in favor of specific updates:rename
,unlock_holdout
,set_worker_count
.
Documentation Changes¶
- The link to Configuration from the Quickstart page has been fixed.
2.5.1¶
Bugfixes¶
- Fixed a bug (affecting Python 2 only) with printing blueprints whose names are not ascii.
- Fixed an issue where the weights column (for weighted projects) did not appear in the advanced_options of a Project.
2.5.0¶
New Features¶
- Methods to work with blender models have been added. Use Project.blend method to create new blenders, Project.get_blenders to get the list of existing blenders and BlenderModel.get to retrieve a model with blender-specific information.
- Projects created via the API can now use smart downsampling when setting the target by passing smart_downsampled and majority_downsampling_rate into the AdvancedOptions object used with Project.set_target. The smart sampling options used with an existing project will be available as part of Project.advanced_options.
- Support for frozen models, which use tuning parameters from a parent model for more efficient training, has been added. Use Model.request_frozen_model to create a new frozen model, Project.get_frozen_models to get the list of existing frozen models and FrozenModel.get to retrieve a particular frozen model.
Enhancements¶
- The inferred date format (e.g. “%Y-%m-%d %H:%M:%S”) is now included in the Feature object. For non-date features, it will be None.
- When specifying the API endpoint in the configuration, the client will now behave correctly for endpoints with and without trailing slashes.
2.4.0¶
New Features¶
- The premium add-on product DataRobot Prime has been added. You can now approximate a model on the leaderboard and download executable code for it. See documentation for further details, or talk to your account representative if the feature is not available on your account.
- (Only relevant for on-premise users with a Standalone Scoring cluster.) Methods (request_transferable_export and download_export) have been added to the Model class for exporting models (which will only work if model export is turned on). There is a new class ImportedModel for managing imported models on a Standalone Scoring cluster.
- It is now possible to create projects from a WebHDFS, PostgreSQL, Oracle or MySQL data source. For more information see the documentation for the relevant Project classmethods: create_from_hdfs, create_from_postgresql, create_from_oracle and create_from_mysql.
- Job.wait_for_completion, which waits for a job to complete without returning anything, has been added.
Enhancements¶
- The client will now check the API version offered by the server specified in configuration, and give a warning if the client version is newer than the server version. The DataRobot server is always backwards compatible with old clients, but new clients may have functionality that is not implemented on older server versions. This issue mainly affects users with on-premise deployments of DataRobot.
Bugfixes¶
- Fixed an issue where Model.request_predictions might raise an error when predictions finished very quickly instead of returning the job.
API Changes¶
- To set the target with quickrun autopilot, call Project.set_target with mode=AUTOPILOT_MODE.QUICK instead of specifying quickrun=True.
Deprecation Summary¶
- Semi-automatic mode for autopilot has been deprecated and will be removed in 3.0. Use manual or fully automatic instead.
- Use of the quickrun argument in Project.set_target has been deprecated and will be removed in 3.0. Use mode=AUTOPILOT_MODE.QUICK instead.
Configuration Changes¶
- It is now possible to control the SSL certificate verification by setting the parameter ssl_verify in the config file.
Documentation Changes¶
- The “Modeling Airline Delay” example notebook has been updated to work with the new 2.3 enhancements.
- Documentation for the generic Job class has been added.
- Class attributes are now documented in the API Reference section of the documentation.
- The changelog now appears in the documentation.
- There is a new section dedicated to configuration, which lists all of the configuration options and their meanings.
2.3.0¶
New Features¶
- The DataRobot API now includes Feature Impact, an approach to measuring the relevance of each feature that can be applied to any model. The Model class now includes methods request_feature_impact (which creates and returns a feature impact job) and get_feature_impact (which can retrieve completed feature impact results).
- A new improved workflow for predictions now supports first uploading a dataset via Project.upload_dataset, then requesting predictions via Model.request_predictions. This allows us to better support predictions on larger datasets and non-ascii files.
- Datasets previously uploaded for predictions (represented by the PredictionDataset class) can be listed from Project.get_datasets and retrieve and deleted via PredictionDataset.get and PredictionDataset.delete.
- You can now create a new feature by re-interpreting the type of an existing feature in a project by using the Project.create_type_transform_feature method.
- The Job class now includes a get method for retrieving a job and a cancel method for canceling a job.
- All of the jobs classes (Job, ModelJob, PredictJob) now include the following new methods: refresh (for refreshing the data in the job object), get_result (for getting the completed resource resulting from the job), and get_result_when_complete (which waits until the job is complete and returns the results, or times out).
- A new method Project.refresh can be used to update Project objects with the latest state from the server.
- A new function datarobot.async.wait_for_async_resolution can be used to poll for the resolution of any generic asynchronous operation on the server.
Enhancements¶
- The JOB_TYPE enum now includes FEATURE_IMPACT.
- The QUEUE_STATUS enum now includes ABORTED and COMPLETED.
- The Project.create method now has a read_timeout parameter which can be used to keep open the connection to DataRobot while an uploaded file is being processed. For very large files this time can be substantial. Appropriately raising this value can help avoid timeouts when uploading large files.
- The method Project.wait_for_autopilot has been enhanced to error if the project enters a state where autopilot may not finish. This avoids a situation that existed previously where users could wait indefinitely on their project that was not going to finish. However, users are still responsible to make sure a project has more than zero workers, and that the queue is not paused.
- Feature.get now supports retrieving features by feature name. (For backwards compatibility, feature IDs are still supported until 3.0.)
- File paths that have unicode directory names can now be used for creating projects and PredictJobs. The filename itself must still be ascii, but containing directory names can have other encodings.
- Now raises more specific JobAlreadyRequested exception when we refuse a model fitting request as a duplicate. Users can explicitly catch this exception if they want it to be ignored.
- A file_name attribute has been added to the Project class, identifying the file name associated with the original project dataset. Note that if the project was created from a data frame, the file name may not be helpful.
- The connect timeout for establishing a connection to the server can now be set directly. This can be done in the yaml configuration of the client, or directly in the code. The default timeout has been lowered from 60 seconds to 6 seconds, which will make detecting a bad connection happen much quicker.
Bugfixes¶
- Fixed a bug (affecting Python 2 only) with printing features and featurelists whose names are not ascii.
API Changes¶
- Job class hierarchy is rearranged to better express the relationship between these objects. See documentation for datarobot.models.job for details.
- Featurelist objects now have a project_id attribute to indicate which project they belong to. Directly accessing the project attribute of a Featurelist object is now deprecated
- Support INI-style configuration, which was deprecated in v2.1, has been removed. yaml is the only supported configuration format.
- The method Project.get_jobs method, which was deprecated in v2.1, has been removed. Users should use the Project.get_model_jobs method instead to get the list of model jobs.
Deprecation Summary¶
- PredictJob.create has been deprecated in favor of the alternate workflow using Model.request_predictions.
- Feature.converter (used internally for object construction) has been made private.
- Model.fetch_resource_data has been deprecated and will be removed in 3.0. To fetch a model from
- its ID, use Model.get.
- The ability to use Feature.get with feature IDs (rather than names) is deprecated and will be removed in 3.0.
- Instantiating a Project, Model, Blueprint, Featurelist, or Feature instance from a dict of data is now deprecated. Please use the from_data classmethod of these classes instead. Additionally, instantiating a Model from a tuple or by using the keyword argument data is also deprecated.
- Use of the attribute Featurelist.project is now deprecated. You can use the project_id attribute of a Featurelist to instantiate a Project instance using Project.get.
- Use of the attributes Model.project, Model.blueprint, and Model.featurelist are all deprecated now to avoid use of partially instantiated objects. Please use the ids of these objects instead.
- Using a Project instance as an argument in Featurelist.get is now deprecated. Please use a project_id instead. Similarly, using a Project instance in Model.get is also deprecated, and a project_id should be used in its place.
Configuration Changes¶
- Previously it was possible (though unintended) that the client configuration could be mixed through environment variables, configuration files, and arguments to datarobot.Client. This logic is now simpler - please see the Getting Started section of the documentation for more information.
2.2.33¶
Bugfixes¶
- Fixed a bug with non-ascii project names using the package with Python 2.
- Fixed an error that occurred when printing projects that had been constructed from an ID only or printing printing models that had been constructed from a tuple (which impacted printing PredictJobs).
- Fixed a bug with project creation from non-ascii file names. Project creation from non-ascii file names is not supported, so this now raises a more informative exception. The project name is no longer used as the file name in cases where we do not have a file name, which prevents non-ascii project names from causing problems in those circumstances.
- Fixed a bug (affecting Python 2 only) with printing projects, features, and featurelists whose names are not ascii.
2.2.32¶
New Features¶
Project.get_features
andFeature.get
methods have been added for feature retrieval.- A generic
Job
entity has been added for use in retrieving the entire queue at once. CallingProject.get_all_jobs
will retrieve all (appropriately filtered) jobs from the queue. Those can be cancelled directly as generic jobs, or transformed into instances of the specific job class usingModelJob.from_job
andPredictJob.from_job
, which allow all functionality previously available via the ModelJob and PredictJob interfaces. Model.train
now supportsfeaturelist_id
andscoring_type
parameters, similar toProject.train
.
Enhancements¶
- Deprecation warning filters have been updated. By default, a filter will be added ensuring that usage of deprecated features will display a warning once per new usage location. In order to hide deprecation warnings, a filter like warnings.filterwarnings(‘ignore’, category=DataRobotDeprecationWarning) can be added to a script so no such warnings are shown. Watching for deprecation warnings to avoid reliance on deprecated features is recommended.
- If your client is misconfigured and does not specify an endpoint, the cloud production server is no longer used as the default as in many cases this is not the correct default.
- This changelog is now included in the distributable of the client.
Bugfixes¶
- Fixed an issue where updating the global client would not affect existing objects with cached clients. Now the global client is used for every API call.
- An issue where mistyping a filepath for use in a file upload has been resolved. Now an error will be raised if it looks like the raw string content for modeling or predictions is just one single line.
API Changes¶
- Use of username and password to authenticate is no longer supported - use an API token instead.
- Usage of
start_time
andfinish_time
parameters inProject.get_models
is not supported both in filtering and ordering of models - Default value of
sample_pct
parameter ofModel.train
method is nowNone
instead of100
. If the default value is used, models will be trained with all of the available training data based on project configuration, rather than with entire dataset including holdout for the previous default value of100
. order_by
parameter ofProject.list
which was deprecated in v2.0 has been removed.recommendation_settings
parameter ofProject.start
which was deprecated in v0.2 has been removed.Project.status
method which was deprecated in v0.2 has been removed.Project.wait_for_aim_stage
method which was deprecated in v0.2 has been removed.Delay
,ConstantDelay
,NoDelay
,ExponentialBackoffDelay
,RetryManager
classes fromretry
module which were deprecated in v2.1 were removed.- Package renamed to
datarobot
.
Deprecation Summary¶
Project.update
deprecated in favor of specific updates:rename
,unlock_holdout
,set_worker_count
.
Documentation Changes¶
- A new use case involving financial data has been added to the
examples
directory. - Added documentation for the partition methods.
2.1.31¶
Bugfixes¶
- In Python 2, using a unicode token to instantiate the client will now work correctly.
2.1.30¶
Bugfixes¶
- The minimum required version of
trafaret
has been upgraded to 0.7.1 to get around an incompatibility between it andsetuptools
.
2.1.28¶
New Features¶
- Default to reading YAML config file from ~/.config/datarobot/drconfig.yaml
- Allow config_path argument to client
wait_for_autopilot
method added to Project. This method can be used to block execution until autopilot has finished running on the project.- Support for specifying which featurelist to use with initial autopilot in
Project.set_target
Project.get_predict_jobs
method has been added, which looks up all prediction jobs for a projectProject.start_autopilot
method has been added, which starts autopilot on specified featurelist- The schema for
PredictJob
in DataRobot API v2.1 now includes amessage
. This attribute has been added to the PredictJob class. PredictJob.cancel
now exists to cancel prediction jobs, mirroringModelJob.cancel
Project.from_async
is a new classmethod that can be used to wait for an async resolution in project creation. Most users will not need to know about it as it is used behind the scenes inProject.create
andProject.set_target
, but power users who may run into periodic connection errors will be able to catch the new ProjectAsyncFailureError and decide if they would like to resume waiting for async process to resolve
Enhancements¶
AUTOPILOT_MODE
enum now uses string names for autopilot modes instead of numbers
Deprecation Summary¶
ConstantDelay
,NoDelay
,ExponentialBackoffDelay
, andRetryManager
utils are now deprecated- INI-style config files are now deprecated (in favor of YAML config files)
- Several functions in the utils submodule are now deprecated (they are being moved elsewhere and are not considered part of the public interface)
Project.get_jobs
has been renamedProject.get_model_jobs
for clarity and deprecated- Support for the experimental date partitioning has been removed in DataRobot API, so it is being removed from the client immediately.
API Changes¶
- In several places where
AppPlatformError
was being raised, nowTypeError
,ValueError
orInputNotUnderstoodError
are now used. With this change, one can now safely assume that when catching anAppPlatformError
it is because of an unexpected response from the server. AppPlatformError
has gained a two new attributes,status_code
which is the HTTP status code of the unexpected response from the server, anderror_code
which is a DataRobot-defined error code.error_code
is not used by any routes in DataRobot API 2.1, but will be in the future. In cases where it is not provided, the instance ofAppPlatformError
will have the attributeerror_code
set toNone
.- Two new subclasses of
AppPlatformError
have been introduced,ClientError
(for 400-level response status codes) andServerError
(for 500-level response status codes). These will make it easier to build automated tooling that can recover from periodic connection issues while polling. - If a
ClientError
orServerError
occurs during a call toProject.from_async
, then aProjectAsyncFailureError
(a subclass of AsyncFailureError) will be raised. That exception will have the status_code of the unexpected response from the server, and the location that was being polled to wait for the asynchronous process to resolve.
2.0.27¶
New Features¶
PredictJob
class was added to work with prediction jobswait_for_async_predictions
function added to predict_job module
Deprecation Summary¶
- The order_by parameter of the
Project.list
is now deprecated.
0.2.26¶
Enhancements¶
Projet.set_target
will re-fetch the project data after it succeeds, keeping the client side in sync with the state of the project on the serverProject.create_featurelist
now throwsDuplicateFeaturesError
exception if passed list of features contains duplicatesProject.get_models
now supports snake_case arguments to its order_by keyword
Deprecation Summary¶
Project.wait_for_aim_stage
is now deprecated, as the REST Async flow is a more reliable method of determining that project creation has completed successfullyProject.status
is deprecated in favor ofProject.get_status
recommendation_settings
parameter ofProject.start
is deprecated in favor ofrecommender_settings
Bugfixes¶
Project.wait_for_aim_stage
changed to support Python 3- Fixed incorrect value of
SCORING_TYPE.cross_validation
- Models returned by
Project.get_models
will now be correctly ordered when the order_by keyword is used
0.2.25¶
- Pinned versions of required libraries
0.2.24¶
Official release of v0.2
0.1.24¶
- Updated documentation
- Renamed parameter name of Project.create and Project.start to project_name
- Removed Model.predict method
- wait_for_async_model_creation function added to modeljob module
- wait_for_async_status_service of Project class renamed to _wait_for_async_status_service
- Can now use auth_token in config file to configure SDK
0.1.23¶
- Fixes a method that pointed to a removed route
0.1.22¶
- Added featurelist_id attribute to ModelJob class
0.1.21¶
- Removes model attribute from ModelJob class
0.1.20¶
- Project creation raises AsyncProjectCreationError if it was unsuccessful
- Removed Model.list_prime_rulesets and Model.get_prime_ruleset methods
- Removed Model.predict_batch method
- Removed Project.create_prime_model method
- Removed PrimeRuleSet model
- Adds backwards compatibility bridge for ModelJob async
- Adds ModelJob.get and ModelJob.get_model
0.1.19¶
- Minor bugfixes in wait_for_async_status_service
0.1.18¶
- Removes submit_model from Project until serverside implementation is improved
- Switches training URLs for new resource-based route at /projects/<project_id>/models/
- Job renamed to ModelJob, and using modelJobs route
- Fixes an inconsistency in argument order for train methods
0.1.17¶
- wait_for_async_status_service timeout increased from 60s to 600s
0.1.16¶
- Project.create will now handle both async/sync project creation
0.1.15¶
- All routes pluralized to sync with changes in API
- Project.get_jobs will request all jobs when no param specified
- dataframes from predict method will have pythonic names
- Project.get_status created, Project.status now deprecated
- Project.unlock_holdout created.
- Added quickrun parameter to Project.set_target
- Added modelCategory to Model schema
- Add permalinks featrue to Project and Model objects.
- Project.create_prime_model created
0.1.14¶
- Project.set_worker_count fix for compatibility with API change in project update.
0.1.13¶
- Add positive class to set_target.
- Change attributes names of Project, Model, Job and Blueprint
- features in Model, Job and Blueprint are now processes
- dataset_id and dataset_name migrated to featurelist_id and featurelist_name.
- samplepct -> sample_pct
- Model has now blueprint, project, and featurlist attributes.
- Minor bugfixes.
0.1.12¶
- Minor fixes regarding rename Job attributes. features attributes now named processes, samplepct now is sample_pct.
0.1.10¶
(May 20, 2015)
- Remove Project.upload_file, Project.upload_file_from_url and Project.attach_file methods. Moved all logic that uploading file to Project.create method.