Models

When a blueprint has been trained on a specific dataset at a specified sample size, the result is a model. Models can be inspected to analyze their accuracy.

Quick Reference

# Get all models of an existing project

import datarobot as dr
my_projects = dr.Project.list()
project = my_projects[0]
models = project.get_models()

List Finished Models

You can use the get_models method to return a list of the project models that have finished training:

import datarobot as dr
project = dr.Project.get('5506fcd38bd88f5953219da0')
models = project.get_models()
print(models[:5])
>>> [Model(Decision Tree Classifier (Gini)),
     Model(Auto-tuned K-Nearest Neighbors Classifier (Minkowski Distance)),
     Model(Gradient Boosted Trees Classifier (R)),
     Model(Gradient Boosted Trees Classifier),
     Model(Logistic Regression)]
model = models[0]

project.id
>>> u'5506fcd38bd88f5953219da0'
model.id
>>> u'5506fcd98bd88f1641a720a3'

You can pass following parameters to change result:

  • search_params – dict, used to filter returned projects. Currently you can query models by

    • name
    • sample_pct
    • is_starred
  • order_by – str or list, if passed returned models are ordered by this attribute or attributes.

  • with_metric – str, If not None, the returned models will only have scores for this metric. Otherwise all the metrics are returned.

List Models Example:

Project('pid').get_models(order_by=['-created_time', 'sample_pct', 'metric'])

# Getting models that contain "Ridge" in name
# and with sample_pct more than 64
Project('pid').get_models(
    search_params={
        'sample_pct__gt': 64,
        'name': "Ridge"
    })

# Getting models marked as starred
Project('pid').get_models(
    search_params={
        'is_starred': True
    })

Retrieve a Known Model

If you know the model_id and project_id values of a model, you can retrieve it directly:

import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
                     model_id=model_id)

You can also use an instance of Project as the parameter for get

model = dr.Model.get(project=project,
                     model_id=model_id)

Train a Model on a Different Sample Size

One of the key insights into a model and the data behind it is how its performance varies with more training data. In Autopilot mode, DataRobot will run at several sample sizes by default, but you can also create a job that will run at a specific sample size. You can also specify featurelist that should be used for training of new model and scoring type. train method of Model instance will put a new modeling job into the queue and return id of created ModelJob. You can pass ModelJob id to wait_for_async_model_creation function, that polls async model creation status and returns newly created model when it’s finished.

model_job_id = model.train(sample_pct=33)

# retraining model on custom featurelist using cross validation
import datarobot as dr
model_job_id = model.train(
    sample_pct=55,
    featurelist_id=custom_featurelist.id,
    scoring_type=dr.SCORING_TYPE.cross_validation,
)

Find the Features Used

Because each project can have many associated featurelists, it is important to know which features a model requires in order to run. This helps ensure that the the necessary features are provided when generating predictions.

feature_names = model.get_features_used()
print(feature_names)
>>> ['MonthlyIncome',
     'VisitsLast8Weeks',
     'Age']

Feature Impact

Feature Impact measures how much worse a model’s error score would be if DataRobot made predictions after randomly shuffling a particular column (a technique sometimes called Permutation Importance).

The following example code snippet shows how a featurelist with just the features with the highest feature impact could be created.

import datarobot as dr

max_num_features = 10
time_to_wait_for_impact = 4 * 60  # seconds

feature_impacts = model.get_or_reqeust_feature_impact(time_to_wait_for_impact)

feature_impacts.sort(key=lambda x: x['impactNormalized'], reverse=True)
final_names = [f['featureName'] for f in feature_impacts[:max_num_features]]

project.create_featurelist('highest_impact', final_names)

Predict new data

After creating models you can use them to generate predictions on new data. See PredictJob for further information on how to request predictions from a model.

Model IDs Vs. Blueprint IDs

Each model has both an model_id and a blueprint_id. What is the difference between these two IDs?

A model is the result of training a blueprint on a dataset at a specified sample percentage. The blueprint_id is used to keep track of which blueprint was used to train the model, while the model_id is used to locate the trained model in the system.

Model parameters

Some models can have parameters that provide data needed to reproduce its predictions.

For additional usage information see DataRobot documentation, section “Coefficients tab and pre-processing details”

import datarobot as dr

model = dr.Model.get(project=project, model_id=model_id)
mp = model.get_parameters()
print mp.derived_features
>>> [{
         'coefficient': -0.015,
         'originalFeature': u'A1Cresult',
         'derivedFeature': u'A1Cresult->7',
         'type': u'CAT',
         'transformations': [{'name': u'One-hot', 'value': u"'>7'"}]
    }]

Create a Blender

You can blend multiple models; in many cases, the resulting blender model is more accurate than the parent models. To do so you need to select parent models and a blender method from datarobot.enums.BLENDER_METHOD.

Be aware that the tradeoff for better prediction accuracy is bigger resource consumption and slower predictions.

import datarobot as dr

pr = dr.Project.get(pid)
models = pr.get_models()
parent_models = [model.id for model in models[:2]]
pr.blend(parent_models, dr.enums.BLENDER_METHOD.AVERAGE)

Lift chart retrieval

You can use Model methods get_lift_chart and get_all_lift_charts to retrieve lift chart data. First will get it from specific source (validation data, cross validation or holdout, if holdout unlocked) and second will list all available data. Please refer to Advanced model information notebook for additional information about lift charts and how they can be visualised.

ROC curve retrieval

Same as with the lift chart you can use Model methods get_roc_curve and get_all_roc_curves to retrieve ROC curve data. Please refer to Advanced model information notebook for additional information about ROC curves and how they can be visualised. More information about working with ROC curves can be found in DataRobot web application documentation section “ROC Curve tab details”.

Word Cloud

If your dataset contains text columns, DataRobot can create text processing models that will contain word cloud insight data. An example of such model is any “Auto-Tuned Word N-Gram Text Modeler” model. You can use Model.get_word_cloud method to retrieve those insights - it will provide up to 200 most important ngrams in the model and data about their influence. The Advanced model information notebook contains examples of how you can use that data and build a visualization in a way similar to how the DataRobot webapp does.

Scoring Code

Subset of models in DataRobot supports code generation. For each of those models you can download a JAR file with scoring code to make predictions locally using method Model.download_scoring_code. For details on how to do that see “Code Generation” section in DataRobot web application documentation. Optionally you can download source code in Java to see what calculations those models do internally.

Be aware that source code JAR isn’t compiled so it cannot be used for making predictions.

Get a model blueprint chart

For all models you can retrieve its blueprint chart. You can also get its representation in graphviz DOT format to render it into format you need.

import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
                     model_id=model_id)
bp_chart = model.get_model_blueprint_chart()
print(bp_chart.to_graphviz())

Get a model missing values report

For the majority of models you can retrieve their missing values reports on training data per each numeric and categorical feature. Model needs to have at least one of the supported tasks in the blueprint in order to have a missing values report (blenders are not supported). Report is gathered for Numerical Imputation tasks and Categorical converters like Ordinal Encoding, One-Hot Encoding etc. Missing values report is available to users with access to full blueprint docs.

Report is collected for those features which are considered eligible by given blueprint task. For instance, categorical feature with a lot of unique values may not be considered as eligible in the One-Hot encoding task.

Please refer to Missing report attributes description for report interpretation.

import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id, model_id=model_id)
missing_reports_per_feature = model.get_missing_report_info()
for report_per_feature in missing_reports_per_feature:
    print(report_per_feature)

Consider following example. Given Decision Tree Classifier (Gini) blueprint chart representation:

print(blueprint_chart.to_graphviz())
>>> digraph "Blueprint Chart" {
        graph [rankdir=LR]
        0 [label="Data"]
        -2 [label="Numeric Variables"]
        2 [label="Missing Values Imputed"]
        3 [label="Decision Tree Classifier (Gini)"]
        4 [label="Prediction"]
        -1 [label="Categorical Variables"]
        1 [label="Ordinal encoding of categorical variables"]
        0 -> -2
        -2 -> 2
        2 -> 3
        3 -> 4
        0 -> -1
        -1 -> 1
        1 -> 3
    }

and missing report:

print(report_per_feature1)
>>> {'feature': 'Veh Year',
     'type': 'Numeric',
     'missing_count': 150,
     'missing_percentage': 50.00,
     'tasks': [
                {'id': u'2',
                'name': u'Missing Values Imputed',
                'descriptions': [u'Imputed value: 2006']
                }
        ]
      }
print(report_per_feature2)
>>> {'feature': 'Model',
     'type': 'Categorical',
     'missing_count': 100,
     'missing_percentage': 33.33,
     'tasks': [
                {'id': u'1',
                'name': u'Ordinal encoding of categorical variables',
                'descriptions': [u'Imputed value: -2']
                }
          ]
        }

results can be interpreted in the following way:

Numeric feature “Veh Year” has 150 missing values and respectively 50% in training data. It was transformed by “Missing Values Imputed” task with imputed value 2006. Task has id 2, and its output goes into Decision Tree Classifier (Gini) - it can be inferred from the chart.

Categorical feature “Model” was transformed by “Ordinal encoding of categorical variables” task with imputed value -2.

Get a blueprint documentation

You can retrieve documentation on tasks used to build a model. It will contain information about task, its parameters and (when available) links and references to additional sources. All documents are instances of BlueprintTaskDocument class.

import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
                     model_id=model_id)
docs = model.get_model_blueprint_documents()
print(docs[0].task)
>>> Average Blend
print(docs[0].links[0]['url'])
>>> https://en.wikipedia.org/wiki/Ensemble_learning

Request training predictions

You can request a model’s predictions for a particular subset of its training data. See datarobot.models.Model.request_training_predictions() reference for all the valid subsets.

See training predictions reference for more details.

import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
                     model_id=model_id)
training_predictions_job = model.request_training_predictions(dr.enums.DATA_SUBSET.HOLDOUT)
training_predictions = training_predictions_job.get_result_when_complete()
for row in training_predictions.iterate_rows():
    print(row.row_id, row.prediction)

Advanced Tuning

You can perform advanced tuning on a model – generate a new model by taking an existing model and rerunning it with modified tuning parameters.

The AdvancedTuningSession class exists to track the creation of an Advanced Tuning model on the client. It enables browsing and setting advanced-tuning parameters one at a time, and using human-readable parameter names rather than requiring opaque parameter IDs in all cases. No information is sent to the server until the run() method is called on the AdvancedTuningSession.

See datarobot.models.Model.get_advanced_tuning_parameters() reference for a description of the types of parameters that can be passed in.

As of v2.13, only Eureqa blueprints (blueprints whose title starts with ‘Eureqa’) support Advanced Tuning.

import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
                     model_id=model_id)
tune = model.start_advanced_tuning_session()

# Get available task names,
# and available parameter names for a task name that exists on this model
tune.get_task_names()
tune.get_parameter_names('Eureqa Generalized Additive Model Classifier (3000 Generations)')

tune.set_parameter(
    task_name='Eureqa Generalized Additive Model Classifier (3000 Generations)',
    parameter_name='EUREQA_building_block__sine',
    value=1)

job = tune.run()