Projects

All of the modeling within DataRobot happens within a project. Each project has one dataset that is used as the source from which to train models.

Create a Project

You can use the following command to create a new project. You must specify a path to data file, file object, raw file contents, or a pandas.DataFrame object when creating a new project. Path to file can be either a path to a local file or a publicly accessible URL.

import datarobot as dr
project = dr.Project.create('/home/user/data/last_week_data.csv',
                            project_name='New Project')

You can use the following commands to view the project ID and name:

project.id
>>> u'5506fcd38bd88f5953219da0'
project.project_name
>>> u'New Project'

Select Modeling Parameters

The final information needed to begin modeling includes the target feature, the queue mode, the metric for comparing models, and the optional parameters such as weights, offset, exposure and downsampling.

Target

The target must be the name of one of the columns of data uploaded to the project.

Metric

The optimization metric used to compare models is an important factor in building accurate models. If a metric is not specified, the default metric recommended by DataRobot will be used. You can use the following code to view a list of valid metrics for a specified target:

target_name = 'ItemsPurchased'
project.get_metrics(target_name)
>>> {'available_metrics': [
         'Gini Norm',
         'Weighted Gini Norm',
         'Weighted R Squared',
         'Weighted RMSLE',
         'Weighted MAPE',
         'Weighted Gamma Deviance',
         'Gamma Deviance',
         'RMSE',
         'Weighted MAD',
         'Tweedie Deviance',
         'MAD',
         'RMSLE',
         'Weighted Tweedie Deviance',
         'Weighted RMSE',
         'MAPE',
         'Weighted Poisson Deviance',
         'R Squared',
         'Poisson Deviance'],
     'feature_name': 'SalePrice'}

Partitioning Method

DataRobot projects always have a holdout set used for final model validation. We use two different approaches for testing prior to the holdout set:

  • split the remaining data into training and validation sets
  • cross-validation, in which the remaining data is split into a number of folds; each fold serves as a validation set, with models trained on the other folds and evaluated on that fold.

There are several other options you can control. To specify a partition method, create an instance of one of the Partition Classes, and pass it as the partitioning_method argument in your call to project.set_target or project.start. See here for more information on using datetime partitioning.

Several partitioning methods include parameters for validation_pct and holdout_pct, specifying desired percentages for the validation and holdout sets. Note that there may be constraints that prevent the actual percentages used from exactly (or some cases, even closely) matching the requested percentages.

Queue Mode

You can use the API to set the DataRobot modeling process to run in either automatic or manual mode.

Autopilot mode means that the modeling process will proceed completely automatically, including running recommended models, running at different sample sizes, and blending.

Manual mode means that DataRobot will populate a list of recommended models, but will not insert any of them into the queue. Manual mode lets you select which models to execute before starting the modeling process.

Quick mode means that a smaller set of Blueprints is used, so autopilot finishes faster.

Weights

DataRobot also supports using a weight parameter. A full discussion of the use of weights in data science is not within the scope of this document, but weights are often used to help compensate for rare events in data. You can specify a column name in the project dataset to be used as a weight column.

Offsets

Starting with version v2.6 DataRobot also supports using an offset parameter. Offsets are commonly used in insurance modeling to include effects that are outside of the training data due to regulatory compliance or constraints. You can specify the names of several columns in the project dataset to be used as the offset columns.

Exposure

Starting with version v2.6 DataRobot also supports using an exposure parameter. Exposure is often used to model insurance premiums where strict proportionality of premiums to duration is required. You can specify the name of the column in the project dataset to be used as an exposure column.

Start Modeling

Once you have selected modeling parameters, you can use the following code structure to specify parameters and start the modeling process.

import datarobot as dr
project.set_target(target='ItemsPurchased',
                   metric='Tweedie Deviance',
                   mode=dr.AUTOPILOT_MODE.FULL_AUTO)

You can also pass additional optional parameters to project.set_target to change parameters of the modeling process. Currently supported parameters are:

  • worker_count – int, sets number of workers used for modeling.
  • partitioning_methodPartitioningMethod object.
  • positive_class – str, float, or int; Specifies a level of the target column that should treated as the positive class for binary classification. May only be specified for binary classification targets.
  • advanced_optionsAdvancedOptions object, used to set advanced options of modeling process.
  • target_type – str, override the automaticially selected target_type. An example usage would be setting the target_type=TARGET_TYPE.MULTICLASS when you want to perform a multiclass classification task on a numeric column that has a low cardinality.

You can run with different autopilot modes by changing the parameter to mode. AUTOPILOT_MODE.FULL_AUTO is the default. Other accepted modes include AUTOPILOT_MODE.MANUAL for manual mode (choose your own models to run rather than use the DataRobot autopilot) and AUTOPILOT_MODE.QUICK for quickrun (run on a more limited set of models to get insights more quickly).

Quickly Start a Project

Project creation, file upload and target selection are all combined in Project.start method:

import datarobot as dr
project = dr.Project.start('/home/user/data/last_week_data.csv',
                        target='ItemsPurchased',
                        project_name='New Project')

You can also pass additional optional parameters to Project.start:

  • worker_count – int, sets number of workers used for modeling.
  • metric - str, name of metric to use.
  • autopilot_on - boolean, defaults to True; set whether or not to begin modeling automatically.
  • blueprint_threshold – int, number of hours the model is permitted to run. Minimum 1.
  • response_cap – float, Quantile of the response distribution to use for response capping. Must be in range 0.5..1.0
  • partitioning_methodPartitioningMethod object.
  • positive_class – str, float, or int; Specifies a level of the target column that should treated as the positive class for binary classification. May only be specified for binary classification targets.
  • target_type – str, override the automaticially selected target_type. An example usage would be setting the target_type=TARGET_TYPE.MULTICLASS when you want to perform a multiclass classification task on a numeric column that has a low cardinality.

Interact with a Project

The following commands can be used to manage DataRobot projects.

List Projects

Returns a list of projects associated with current API user.

import datarobot as dr
dr.Project.list()
>>> [Project(Project One), Project(Two)]

dr.Project.list(search_params={'project_name': 'One'})
>>> [Project(One)]

You can pass following parameters to change result:

  • search_params – dict, used to filter returned projects. Currently you can query projects only by project_name

Get an existing project

Rather than querying the full list of projects every time you need to interact with a project, you can retrieve its id value and use that to reference the project.

import datarobot as dr
project = dr.Project.get(project_id='5506fcd38bd88f5953219da0')
project.id
>>> '5506fcd38bd88f5953219da0'
project.project_name
>>> 'Churn Projection'

Update a project

You can update various attributes of a project.

To update the name of the project:

project.rename(new_name)

To update the number of workers used by your project (this will fail if you request more workers than you have available):

project.set_worker_count(num_workers)

To unlock the holdout set, allowing holdout scores to be shown and models to be trained on more data:

project.unlock_holdout()

Delete a project

Use the following command to delete a project:

project.delete()

Wait for Autopilot to Finish

Once the modeling autopilot is started, in some cases you will want to wait for autopilot to finish:

project.wait_for_autopilot()

Play/Pause the autopilot

If your project is running in autopilot mode, it will continually use available workers, subject to the number of workers allocated to the project and the total number of simultaneous workers allowed according to the user permissions.

To pause a project running in autopilot mode:

project.pause_autopilot()

To resume running a paused project:

project.unpause_autopilot()

Start autopilot on another Featurelist

You can start autopilot on an existing featurelist.

import datarobot as dr

featurelist = project.create_featurelist('test', ['feature 1', 'feature 2'])
project.start_autopilot(featurelist.id)
>>> True

# Starting autopilot that is already running on the provided featurelist
project.start_autopilot(featurelist.id)
>>> dr.errors.AppPlatformError

Note

This method should be used on a project where the target has already been set. An error will be raised if autopilot is currently running on or has already finished running on the provided featurelist.

Further reading

The Blueprints and Models sections of this document will describe how to create new models based on the Blueprints recommended by DataRobot.