Projects

All of the modeling within DataRobot happens within a project. Each project has one dataset that is used as the source from which to train models.

Create a project

You can create a project from previously-created Datasets or directly from a data source.

import datarobot as dr
dataset = Dataset.create_from_file(file_path='/home/user/data/last_week_data.csv')
project = dr.Project.create_from_dataset(dataset.id, project_name='New Project')

The following command creates a new project directly from a data source. You must specify a path to data file, file object URL (starting with http://, https://, file://, or s3://), raw file contents, or a pandas.DataFrame object when creating a new project. Path to file can be either a path to a local file or a publicly accessible URL.

import datarobot as dr
project = dr.Project.create('/home/user/data/last_week_data.csv',
                            project_name='New Project')

You can use the following commands to view the project ID and name:

project.id
>>> u'5506fcd38bd88f5953219da0'
project.project_name
>>> u'New Project'

Select modeling parameters

The final information needed to begin modeling includes the target feature, queue mode, metric for comparing models, and optional parameters such as weights, offset, exposure, and downsampling.

Target

The target must be the name of one of the columns of data uploaded to the project.

Metric

The optimization metric used to compare models is an important factor in building accurate models. If a metric is not specified, the default metric recommended by DataRobot will be used. You can use the following code to view a list of valid metrics for a specified target:

target_name = 'ItemsPurchased'
project.get_metrics(target_name)
>>> {'available_metrics': [
         'Gini Norm',
         'Weighted Gini Norm',
         'Weighted R Squared',
         'Weighted RMSLE',
         'Weighted MAPE',
         'Weighted Gamma Deviance',
         'Gamma Deviance',
         'RMSE',
         'Weighted MAD',
         'Tweedie Deviance',
         'MAD',
         'RMSLE',
         'Weighted Tweedie Deviance',
         'Weighted RMSE',
         'MAPE',
         'Weighted Poisson Deviance',
         'R Squared',
         'Poisson Deviance'],
     'feature_name': 'SalePrice'}

Partitioning method

DataRobot projects always have a Holdout set used for final model validation. You can use two different approaches for testing prior to the Holdout set:

Split the remaining data into training and validation sets.
Cross-validation, in which the remaining data is split into a number of folds (partitions); each fold serves as a validation set, with models trained on the other folds and evaluated on that fold.

There are several other options you can control. To specify a partition method, create an instance of one of the Partition Classes, and pass it as the partitioning_method argument in your call to project.analyze_and_model or project.start. As of v3.0 of the Python client, you can alternately use project.set_partitioning_method. See here for more information on using datetime partitioning.

Several partitioning methods include parameters for validation_pct and holdout_pct, specifying desired percentages for the validation and holdout sets. Note that there may be constraints that prevent the actual percentages used from exactly (or some cases, even closely) matching the requested percentages.

Queue mode

You can use the API to set the DataRobot modeling process to run Autopilot in manual, quick, or comprehensive mode.

Autopilot mode means that the modeling process will proceed completely automatically, including running recommended models, running at different sample sizes, and blending.

Manual mode means that DataRobot will populate a list of recommended models, but will not insert any of them into the queue. This mode lets you specify which models to execute before starting the modeling process.

Quick mode means that a smaller set of blueprints is used, so Autopilot finishes faster.

Weights

DataRobot also supports using a weight parameter, which are often used to help compensate for rare events in data. You can specify a column name in the project dataset to be used as a weight column.

Offsets

Starting with Python client v2.6, DataRobot also supports using an offset parameter. Offsets are commonly used in insurance modeling to include effects that are outside of the training data due to regulatory compliance or constraints. You can specify the names of several columns in the project dataset to be used as the offset columns.

Exposure

Starting with version v2.6, DataRobot also supports using an exposure parameter. Exposure is often used to model insurance premiums where strict proportionality of premiums to duration is required. You can specify the name of the column in the project dataset to be used as an exposure column.

Start modeling

Once you have selected modeling parameters, you can use the following code structure to specify parameters and start the modeling process.

import datarobot as dr
project.analyze_and_model(target='ItemsPurchased',
                   metric='Tweedie Deviance',
                   mode=dr.AUTOPILOT_MODE.FULL_AUTO)

You can also pass additional parameters to project.analyze_and_model to change parts of the modeling process. Some of those parameters include:

worker_count - int, sets number of workers used for modeling.
partitioning_method - PartitioningMethod object.
positive_class - str, float, or int; Specifies a level of the target column that should be treated as the positive class for binary classification. May only be specified for binary classification targets.
advanced_options - AdvancedOptions object; Used to set advanced options of modeling process. Can alternatively call set_options on a project instance which will be used automatically if nothing is passed here.
target_type - str; Overrides the automatically selected target_type. An example usage would be setting the target_type=TARGET_TYPE.MULTICLASS when you want to perform a multiclass classification task on a numeric column that has a low cardinality.

You can run different Autopilot modes with the mode parameter. AUTOPILOT_MODE.FULL_AUTO is the default, which will trigger modeling with no further actions necessary. Other accepted modes include AUTOPILOT_MODE.MANUAL for manual mode (choose your own models to run rather than use the DataRobot autopilot), AUTOPILOT_MODE.QUICK (run on a more limited set of models to get insights more quickly), and AUTOPILOT_MODE.COMPREHENSIVE (used to invest more time to find the most accurate model to serve your use case).

For a full reference of available parameters, see Project.analyze_and_model.

Clone a project

Once a project has been successfully created, you may clone it using the following code structure:

new_project = project.clone_project(new_project_name='This is my new project')
new_project.project_name
>> 'This is my new project'
new_project.id != project.id
>> True

The new_project_name attribute is optional. If it is omitted, the default new project name will be ‘Copy of <project.name>’.

Interact with a project

The following commands can be used to manage DataRobot projects.

List projects

Returns a list of projects associated with current API user.

import datarobot as dr
dr.Project.list()
>>> [Project(Project One), Project(Two)]

dr.Project.list(search_params={'project_name': 'One'})
>>> [Project(One)]

You can pass following parameter to change the result:

search_params – dict; Used to filter returned projects. You can only query projects by project_name.

Get an existing project

Rather than querying the full list of projects every time you need to interact with a project, you can retrieve its ID value and use that to reference the project.

import datarobot as dr
project = dr.Project.get(project_id='5506fcd38bd88f5953219da0')
project.id
>>> '5506fcd38bd88f5953219da0'
project.project_name
>>> 'Churn Projection'

Get feature association statistics for an existing project

You can retrieve either feature association or correlation statistics and metadata on informative features for a given project.

import datarobot as dr
project = dr.Project.get(project_id='5506fcd38bd88f5953219da0')
association_data = project.get_associations(assoc_type='association', metric='mutualInfo')
association_data.keys()
>>> ['strengths', 'features']

Get whether your featurelists have association statistics

Get whether an association matrix job has been run on each of your feature lists.

import datarobot as dr
project = dr.Project.get(project_id='5506fcd38bd88f5953219da0')
featurelists = project.get_association_featurelists()
featurelists['featurelists'][0]
>>> {"featurelistId": "54e510ef8bd88f5aeb02a3ed", "hasFam": True, "title": "Informative Features"}

Create association statistics for a featurelist

Generate the feature association statistics for all features in a feature list.

import datarobot as dr
from datarobot.models.feature_association_matrix import FeatureAssociationMatrix
project = dr.Project.get(project_id='5506fcd38bd88f5953219da0')
featurelist = project.get_featurelist_by_name("Raw Features")
status = FeatureAssociationMatrix.create(project.id, featurelist.id)
# two ways to wait for completion
# option 1
status.wait_for_completion()
fam = FeatureAssociationMatrix.get(project_id=project.id, featurelist_id=featurelist.id)
# or option 2
# fam = status.get_result_when_complete()

Get a project’s feature list by name

Get a feature list by name.

import datarobot as dr
project = dr.Project.get(project_id='5506fcd38bd88f5953219da0')
featurelist = project.get_featurelist_by_name("Raw Features")
featurelist
>>> Featurelist(Raw Features)

# Trying to get feature list that does not exist
featurelist = project.get_featurelist_by_name("Flying Circus")
featurelist is None
>>> True

Create project feature lists

Using a project’s create_featurelist() method, you can create feature lists in multiple ways:

import datarobot as dr
project = dr.Project.get(project_id='5506fcd38bd88f5953219da0')

featurelist_one = project.create_featurelist(
    name="Testing featurelist creation",
    features=["age", "weight", "number_diagnoses"],
)
featurelist_one
>>> Featurelist(Testing featurelist creation)
featurelist_one.features
>>> ['age', 'weight', 'number_diagnoses']

# Create a feature list using another feature list as a starting point (`starting_featurelist`)
# To Note: this example passes the `featurelist` object but you can also pass the
# id (`starting_featurelist_id`) or the name (`starting_featurelist_name`)
featurelist_two = project.create_featurelist(
    starting_featurelist=featurelist_one,
    features_to_exclude=["number_diagnoses"],  # Please see docs for use of `features_to_include`
)
featurelist_two  # Note below we have an auto-generated name because we did not pass `name`
>>> Featurelist(Testing featurelist creation - 2022-07-12)
>>> # Note below we have a new feature list which has `"number_diagnoses"` excluded
featurelist_two.features
>>> ['age', 'weight']

Get values for a pair of features in an existing project

Get a sample of the exact values used in the feature association matrix plotting.

import datarobot as dr
project = dr.Project.get(project_id='5506fcd38bd88f5953219da0')
feature_values = project.get_association_matrix_details(feature1='foo', feature2='bar')
feature_values.keys()
>>> ['features', 'types', 'values']

Update a project

You can update various attributes of a project.

To update the name of the project:

project.rename(new_name)

To update the number of workers used by your project (this will fail if you request more workers than you have available; the special value -1 will request your maximum number):

project.set_worker_count(num_workers)

To unlock the Holdout set, allowing holdout scores to be shown and models to be trained on more data:

project.unlock_holdout()

To add or change the project description:

project.set_project_description(project_description)

To add or change the project’s advanced_options:

# Using kwargs
project.set_options(blend_best_models=False)

# Using an ``AdvancedOptions`` instance
project.set_options(AdvancedOptions(blend_best_models=False))

Delete a project

Use the following command to delete a project:

project.delete()

Wait for Autopilot to finish

Once the modeling Autopilot is started, in some cases you will want to wait for Autopilot to finish:

project.wait_for_autopilot()

Play/Pause Autopilot

If your project is running in Autopilot, it will continually use available workers, subject to the number of workers allocated to the project and the total number of simultaneous workers allowed according to the user permissions.

To pause a project running in Autopilot:

project.pause_autopilot()

To resume running a paused project:

project.unpause_autopilot()

Start Autopilot on another feature list

You can start Autopilot on an existing feature list.

import datarobot as dr

featurelist = project.create_featurelist('test', ['feature 1', 'feature 2'])
project.start_autopilot(featurelist.id)
>>> True

# Starting autopilot that is already running on the provided featurelist
project.start_autopilot(featurelist.id)
>>> dr.errors.AppPlatformError

Note

This method should be used on a project where the target has already been set. An error will be raised if autopilot is currently running on or has already finished running on the provided feature list.

Start preparing a specific model for deployment

You can start preparing a specific model for deployment. The model will then go through the various recommendation stages including retraining on a reduced feature list and retraining the model on a higher sample size (recent data for datetime partitioned).

# prepare a specific model for deployment and wait for the process to complete
project.start_prepare_model_for_deployment(model_id=model.id)
project.wait_for_autopilot(check_interval=5, timeout=600)
# get the prepared model
prepared_for_deployment_model = dr.models.ModelRecommendation.get(
    project.id, recommendation_type=RECOMMENDED_MODEL_TYPE.PREPARED_FOR_DEPLOYMENT
)
prepared_for_deployment_model_id = prepared_for_deployment_model.model_id

Note

This method should be used on a project where the target has already been set. An error will be raised if autopilot is currently running on the project or another model in the project is being prepared for deployment.

Using credential data

For methods that accept credential data instead of user/password or credential ID, please see Credential Data documentation.