Models

When a blueprint has been trained on a specific dataset at a specified sample size, the result is a model. Models can be inspected to analyze their accuracy.

Start training a model

To start training a model, use the Project.train method with a blueprint object:

import datarobot as dr
project = dr.Project.get('5506fcd38bd88f5953219da0')
blueprints = project.get_blueprints()
model_job_id = project.train(blueprints[0].id)

For a datetime partitioned project (see the specialized workflows section), use Project.train_datetime:

import datarobot as dr
project = dr.Project.get('5506fcd38bd88f5953219da0')
blueprints = project.get_blueprints()
model_job_id = project.train_datetime(blueprints[0].id)

List finished models

You can use the Project.get_models method to return a list of the project models that have finished training:

import datarobot as dr
project = dr.Project.get('5506fcd38bd88f5953219da0')
models = project.get_models()
print(models[:5])
>>> [Model(Decision Tree Classifier (Gini)),
     Model(Auto-tuned K-Nearest Neighbors Classifier (Minkowski Distance)),
     Model(Gradient Boosted Trees Classifier (R)),
     Model(Gradient Boosted Trees Classifier),
     Model(Logistic Regression)]
model = models[0]

project.id
>>> u'5506fcd38bd88f5953219da0'
model.id
>>> u'5506fcd98bd88f1641a720a3'

You can pass following parameters to change the result:

search_params - A dict. Used to filter returned projects. Currently, you can query models by name, sample_pct, and is_starred.
order_by — A str or list. If passed, returned models are ordered by this attribute(s). You can sort by the metric and sample_pct attributes.

If the sort attribute is preceded by a hyphen, models will be sorted in descending order, otherwise, in ascending order. Multiple sort attributes can be included as a comma-delimited string or in a list, e.g., order_by='sample_pct,-metric' or order_by=['sample_pct', '-metric']. Using metric to sort will result in models being sorted according to their validation score by how well they did according to the project metric.

with_metric – A str. If not set as None, the returned models will only have scores for this metric. Otherwise, all the metrics are returned.

Review an example of listing models below.

import datarobot as dr

dr.Project('5506fcd38bd88f5953219da0').get_models(order_by=['sample_pct', '-metric'])

# Getting models that contain "Ridge" in name
# and with sample_pct more than 64
dr.Project('5506fcd38bd88f5953219da0').get_models(
    search_params={
        'sample_pct__gt': 64,
        'name': "Ridge"
    })

# Getting models marked as starred
dr.Project('5506fcd38bd88f5953219da0').get_models(
    search_params={
        'is_starred': True
    })

Retrieve a known model

If you know the model_id and project_id values of a model, you can retrieve it directly:

import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
                     model_id=model_id)

You can also use an instance of Project as the parameter for Model.get.

model = dr.Model.get(project=project,
                     model_id=model_id)

Retrieve the highest scoring model for a given metric

You can retrieve the highest scoring model for a project based on a metric of your choice.

If you decide not to pass a metric to this method or if you pass the default project metric ( the value of the metric attribute of your project instance), the result of Project.recommended_model is returned.

import datarobot as dr
project = dr.Project.get('5506fcd38bd88f5953219da0')
top_model_r_squared = project.get_top_model(metric="R Squared")

Train a model on a different sample size

One of the key insights into a model and the data behind it is how its performance varies with more training data. In Autopilot, DataRobot runs at several sample sizes by default, but you can also create a job that will run at a specific sample size, or specify a feature list that should be used for training the new model. The Model.train method of a Model instance will put a new modeling job into the queue and return the ID of the created ModelJob. You can pass the model job ID to the wait_for_async_model_creation function, which polls the async model creation status and returns the newly-created model when it’s finished.

import datarobot as dr

model_job_id = model.train(sample_pct=33)

# Retrain a model on a custom featurelist using cross validation.
# Note that you can specify a custom value for `sample_pct`.
model_job_id = model.train(
    sample_pct=55,
    featurelist_id=custom_featurelist.id,
    scoring_type=dr.SCORING_TYPE.cross_validation,
)

Cross-validating a model

By default, models are evaluated on the first validation partition. To start cross-validation, use Model.cross_validate:

import datarobot as dr

model_job_id = model.cross_validate()

For a :doc:datetime partitioned project , backtesting is the only cross-validation method supported. To run backtesting for a datetime model, use the DatetimeModel.score_backtests method:

import datarobot as dr

# `model` here must be an instance of `dr.DatetimeModel`.
model_job_id = model.score_backtests()

Find the features used

Because each project can have many associated feature lists, it is important to know which features a model requires in order to run. This helps ensure that the necessary features are provided when generating predictions.

feature_names = model.get_features_used()
print(feature_names)
>>> ['MonthlyIncome',
     'VisitsLast8Weeks',
     'Age']

Feature Impact

Feature Impact measures how much worse a model’s error score would be if DataRobot made predictions after randomly shuffling a particular column (a technique sometimes called Permutation Importance).

The following example code snippet shows how a feature list with just the features with the highest feature impact could be created.

import datarobot as dr

max_num_features = 10
time_to_wait_for_impact = 4 * 60  # seconds

feature_impacts = model.get_or_request_feature_impact(time_to_wait_for_impact)

feature_impacts.sort(key=lambda x: x['impactNormalized'], reverse=True)
final_names = [f['featureName'] for f in feature_impacts[:max_num_features]]

project.create_featurelist('highest_impact', final_names)

For datetime-aware models, Feature Impact can be calculated for any backtest and holdout.

import datarobot as dr

datetime_model = dr.Model.get(project=project_id, model_id=model_id)
feature_impacts = datetime_model.get_or_request_feature_impact(backtest=1, with_metadata=True)

Feature Effects

Feature Effects helps to understand how changing a single feature affects the target while holding all other features constant. Feature Effects provides partial dependence plot and prediction vs accuracy plot data.

import datarobot as dr

feature_effects = model.get_or_request_feature_effect(source='validation')

For multiclass models use request_feature_effects_multiclass and get_feature_effects_multiclass or get_or_request_feature_effects_multiclass methods.

import datarobot as dr

feature_effects = model.get_feature_effect(source='validation')

Predict new data

After creating models, you can use them to generate predictions on new data. See the predictions documentation for further information on how to request predictions from a model.

Model IDs vs. blueprint IDs

Each model has both a model_id and a blueprint_id.

A model is the result of training a blueprint on a dataset at a specified sample percentage. The blueprint_id is used to keep track of which blueprint was used to train the model, while the model_id is used to locate the trained model in the system.

Model parameters

Some models can have parameters that provide data needed to reproduce their predictions.

For additional usage information see Coefficients.

import datarobot as dr

model = dr.Model.get(project=project, model_id=model_id)
mp = model.get_parameters()
print(mp.derived_features)
>>> [{
         'coefficient': -0.015,
         'originalFeature': u'A1Cresult',
         'derivedFeature': u'A1Cresult->7',
         'type': u'CAT',
         'transformations': [{'name': u'One-hot', 'value': u"'>7'"}]
    }]

Create a blender model

You can blend multiple models; in many cases, the resulting blender model is more accurate than the parent models. To do so, you need to select parent models and a blender method from datarobot.enums.BLENDER_METHOD. If this is a time series project, only methods in datarobot.enums.TS_BLENDER_METHOD are allowed.

Be aware that the tradeoff for better prediction accuracy is bigger resource consumption and slower predictions.

import datarobot as dr

pr = dr.Project.get(pid)
models = pr.get_models()
parent_models = [model.id for model in models[:2]]
pr.blend(parent_models, dr.enums.BLENDER_METHOD.AVERAGE)

Lift chart retrieval

You can use the Model methods get_lift_chart and get_all_lift_charts to retrieve lift chart data. The first will get it from specific source (validation data, cross validation, or unlocked Holdout) and the second will list all available data.

For multiclass models, you can get a list of per-class lift charts using the Model method get_multiclass_lift_chart.

ROC curve retrieval

Same as with the lift chart, you can use Model methods get_roc_curve and get_all_roc_curves to retrieve ROC curve data. More information about working with ROC curves can be found in ROC curve.

Residuals chart retrieval

Just as with the lift and ROC charts, you can use Model methods get_residuals_chart and get_all_residuals_charts to retrieve residuals chart data. The first will get it from a specific source (validation data, cross-validation data, or unlocked Holdout). The second retrieves all available data.

Word cloud

If your dataset contains text columns, DataRobot can create text processing models that will contain word cloud insight data. An example of such a model is any “Auto-Tuned Word N-Gram Text Modeler” model. You can use the {meth}`Model.get_word_cloud <datarobot.models.Model.get_word_cloud> method to retrieve those insights — it provides up to the 200 most important ngrams in the model and coefficients corresponding to their influence.

Scoring Code

A subset of models support code generation. For each of those models, you can download a JAR file with Scoring Code to make predictions locally using model.download_scoring_code. For details on how to do so, see Scoring Code. Optionally, you can download source code in Java to see what calculations those models do internally.

Be aware that the source code JAR isn’t compiled so it cannot be used for making predictions.

Get a model blueprint chart

For any model, you can retrieve its blueprint chart. You can also get its representation in graphviz DOT format to render it into the format you need.

import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
                     model_id=model_id)
bp_chart = model.get_model_blueprint_chart()
print(bp_chart.to_graphviz())

Get a model missing values report

For the majority of models, you can retrieve their missing values reports on training data per each numeric and categorical feature. Model needs to have at least one of the supported tasks in the blueprint in order to have a missing values report (blenders are not supported). Report is gathered for Numerical Imputation tasks and Categorical converters like Ordinal Encoding, One-Hot Encoding, etc. Missing values report is available to users with access to full blueprint docs.

A report is collected for those features which are considered eligible by a given blueprint task. For instance, a categorical feature with a lot of unique values may not be considered as eligible in the One-Hot encoding task.

Please refer to Missing report attributes description for report interpretation.

import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id, model_id=model_id)
missing_reports_per_feature = model.get_missing_report_info()
for report_per_feature in missing_reports_per_feature:
    print(report_per_feature)

Consider the following example of a Decision Tree Classifier (Gini) blueprint chart representation. A summary of the results is outlined below.

print(blueprint_chart.to_graphviz())
>>> digraph "Blueprint Chart" {
        graph [rankdir=LR]
        0 [label="Data"]
        -2 [label="Numeric Variables"]
        2 [label="Missing Values Imputed"]
        3 [label="Decision Tree Classifier (Gini)"]
        4 [label="Prediction"]
        -1 [label="Categorical Variables"]
        1 [label="Ordinal encoding of categorical variables"]
        0 -> -2
        -2 -> 2
        2 -> 3
        3 -> 4
        0 -> -1
        -1 -> 1
        1 -> 3
    }

And a missing report:

print(report_per_feature1)
>>> {'feature': 'Veh Year',
     'type': 'Numeric',
     'missing_count': 150,
     'missing_percentage': 50.00,
     'tasks': [
                {'id': u'2',
                'name': u'Missing Values Imputed',
                'descriptions': [u'Imputed value: 2006']
                }
        ]
      }
print(report_per_feature2)
>>> {'feature': 'Model',
     'type': 'Categorical',
     'missing_count': 100,
     'missing_percentage': 33.33,
     'tasks': [
                {'id': u'1',
                'name': u'Ordinal encoding of categorical variables',
                'descriptions': [u'Imputed value: -2']
                }
          ]
        }

The numeric feature “Veh Year” has 150 missing values and, respectively, 50% in training data. It was transformed by the “Missing Values Imputed” task with imputed value 2006. Task has ID 2, and its output goes into Decision Tree Classifier (Gini), which can be inferred from the chart.

The “Model” categorical feature was transformed by “Ordinal encoding of categorical variables” task with imputed value -2.

Get a blueprint’s documentation

You can retrieve documentation on tasks used to build a model. It will contain information about the task, its parameters and (when available) links and references to additional sources. All documents are instances of BlueprintTaskDocument class.

import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
                     model_id=model_id)
docs = model.get_model_blueprint_documents()
print(docs[0].task)
>>> Average Blend
print(docs[0].links[0]['url'])
>>> https://en.wikipedia.org/wiki/Ensemble_learning

Request training predictions

You can request a model’s predictions for a particular subset of its training data. See datarobot.models.Model.request_training_predictions() reference for all the valid subsets.

See training predictions reference for more details.

import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
                     model_id=model_id)
training_predictions_job = model.request_training_predictions(dr.enums.DATA_SUBSET.HOLDOUT)
training_predictions = training_predictions_job.get_result_when_complete()
for row in training_predictions.iterate_rows():
    print(row.row_id, row.prediction)

Advanced tuning

You can perform advanced tuning on a model — generate a new model by taking an existing model and rerunning it with modified tuning parameters.

The AdvancedTuningSession class exists to track the creation of an advanced tuning model on the client. It enables browsing and setting advanced tuning parameters one at a time, and using human-readable parameter names rather than requiring opaque parameter IDs in all cases. No information is sent to the server until the run() method is called on the AdvancedTuningSession.

See datarobot.models.Model.get_advanced_tuning_parameters() reference for a description of the types of parameters that can be passed in.

As of v2.17 of the Python client, all models other than blenders, open source, and user-created models support Advanced Tuning. The use of Advanced Tuning via the API for non-Eureqa models is in beta, but is enabled by default for all users.

import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
                     model_id=model_id)
tune = model.start_advanced_tuning_session()

# Get available task names,
# and available parameter names for a task name that exists on this model
tune.get_task_names()
tune.get_parameter_names('Eureqa Generalized Additive Model Classifier (3000 Generations)')

tune.set_parameter(
    task_name='Eureqa Generalized Additive Model Classifier (3000 Generations)',
    parameter_name='EUREQA_building_block__sine',
    value=1)

job = tune.run()

SHAP Feature Impact

SHAP Feature Impact is computed by calculating the SHAP values on a sample of training data and then taking the mean absolute value for each column. A larger value of impact indicates a more important feature.

See datarobot.models.ShapImpact.create() reference for a description of the types of parameters that can be passed in.

import datarobot as dr

project_id = '5ec3d6884cfad17cd8c0ed62'
model_id = '5ec3d6f44cfad17cd8c0ed78'
shap_impact_job = dr.ShapImpact.create(project_id=project_id, model_id=model_id)
shap_impact = shap_impact_job.get_result_when_complete()
print(shap_impact)
>>> [ShapImpact(count=36)]
print(shap_impact.shap_impacts[:1])
>>> [{'feature_name': 'number_inpatient', 'impact_normalized': 1.0, 'impact_unnormalized': 0.07670175497683789}]

shap_impact = dr.ShapImpact.get(project_id=project_id, model_id=model_id)
print(shap_impact.shap_impacts[:1])
>>> [{'feature_name': 'number_inpatient', 'impact_normalized': 1.0, 'impact_unnormalized': 0.07670175497683789}]

Number of iterations trained

Early-stopping models will train a subset of max estimators/iterations that are defined in advanced tuning. This method allows the user to retrieve the actual number of estimators that were trained by an early-stopping tree-based model (currently the only model type supported). The method returns the projectId, modelId, and a list of dictionaries containing the number of iterations trained for each model stage. In the case of single-stage models, this dictionary will contain only one entry.

import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
                     model_id=model_id)
num_iterations = model.get_num_iterations_trained()
print(num_iterations)
>>> {"projectId": "5506fcd38bd88f5953219da0", "modelId": "5506fcd98bd88f1641a720a3", "data" [{"stage": "FREQ", "numIterations":250}, {"stage":"SEV", "numIterations":50}]}