Models¶
When a blueprint has been trained on a specific dataset at a specified sample size, the result is a model. Models can be inspected to analyze their accuracy.
Start Training a Model¶
To start training a model, use the Project.train
method with
a blueprint object:
import datarobot as dr
project = dr.Project.get('5506fcd38bd88f5953219da0')
blueprints = project.get_blueprints()
model_job_id = project.train(blueprints[0])
For a Datetime Partitioned Project, use
Project.train_datetime
:
import datarobot as dr
project = dr.Project.get('5506fcd38bd88f5953219da0')
blueprints = project.get_blueprints()
model_job_id = project.train_datetime(blueprints[0])
List Finished Models¶
You can use the Project.get_models
method to
return a list of the project models
that have finished training:
import datarobot as dr
project = dr.Project.get('5506fcd38bd88f5953219da0')
models = project.get_models()
print(models[:5])
>>> [Model(Decision Tree Classifier (Gini)),
Model(Auto-tuned K-Nearest Neighbors Classifier (Minkowski Distance)),
Model(Gradient Boosted Trees Classifier (R)),
Model(Gradient Boosted Trees Classifier),
Model(Logistic Regression)]
model = models[0]
project.id
>>> u'5506fcd38bd88f5953219da0'
model.id
>>> u'5506fcd98bd88f1641a720a3'
You can pass following parameters to change result:
search_params
– dict, used to filter returned projects. Currently you can query models byname
sample_pct
is_starred
order_by
– str or list, if passed returned models are ordered by this attribute(s). Allowed attributes to sort by are:metric
sample_pct
If the sort attribute is preceded by a hyphen, models will be sorted in descending order, otherwise in ascending order. Multiple sort attributes can be included as a comma-delimited string or in a list e.g.
order_by='sample_pct,-metric'
ororder_by=['sample_pct', '-metric']
. Using metric to sort by will result in models being sorted according to their validation score by how well they did according to the project metric.with_metric
– str, If not None, the returned models will only have scores for this metric. Otherwise all the metrics are returned.
List Models Example:
import datarobot as dr
dr.Project('5506fcd38bd88f5953219da0').get_models(order_by=['sample_pct', '-metric'])
# Getting models that contain "Ridge" in name
# and with sample_pct more than 64
dr.Project('5506fcd38bd88f5953219da0').get_models(
search_params={
'sample_pct__gt': 64,
'name': "Ridge"
})
# Getting models marked as starred
dr.Project('5506fcd38bd88f5953219da0').get_models(
search_params={
'is_starred': True
})
Retrieve a Known Model¶
If you know the model_id
and project_id
values of a model, you can
retrieve it directly:
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
model_id=model_id)
You can also use an instance of Project
as the parameter for
Model.get
model = dr.Model.get(project=project,
model_id=model_id)
Train a Model on a Different Sample Size¶
One of the key insights into a model and the data behind it is how its
performance varies with more training data.
In Autopilot mode, DataRobot will run at several sample sizes by default,
but you can also create a job that will run at a specific sample size.
You can also specify featurelist that should be used for training of new model
and scoring type.
Model.train
method of Model
instance will
put a new modeling job into the queue and return id of created
ModelJob.
You can pass ModelJob id to the wait_for_async_model_creation function,
which polls async model creation status and returns newly created model when it’s finished.
import datarobot as dr
model_job_id = model.train(sample_pct=33)
# Retrain a model on a custom featurelist using cross validation.
# Note that you can specify a custom value for `sample_pct`.
model_job_id = model.train(
sample_pct=55,
featurelist_id=custom_featurelist.id,
scoring_type=dr.SCORING_TYPE.cross_validation,
)
Cross-Validating a Model¶
By default, models are trained using only the first validation partition. To start
cross-validation, use the Model.cross_validate
method:
import datarobot as dr
model_job_id = model.cross_validate()
For a Datetime Partitioned Project, backtesting is
the only cross-validation method supported. To run backtesting for a datetime model, use the
DatetimeModel.score_backtests
method:
import datarobot as dr
# `model` here must be an instance of `dr.DatetimeModel`.
model_job_id = model.score_backtests()
Find the Features Used¶
Because each project can have many associated featurelists, it is important to know which features a model requires in order to run. This helps ensure that the the necessary features are provided when generating predictions.
feature_names = model.get_features_used()
print(feature_names)
>>> ['MonthlyIncome',
'VisitsLast8Weeks',
'Age']
Feature Impact¶
Feature Impact measures how much worse a model’s error score would be if DataRobot made predictions after randomly shuffling a particular column (a technique sometimes called Permutation Importance).
The following example code snippet shows how a featurelist with just the features with the highest feature impact could be created.
import datarobot as dr
max_num_features = 10
time_to_wait_for_impact = 4 * 60 # seconds
feature_impacts = model.get_or_request_feature_impact(time_to_wait_for_impact)
feature_impacts.sort(key=lambda x: x['impactNormalized'], reverse=True)
final_names = [f['featureName'] for f in feature_impacts[:max_num_features]]
project.create_featurelist('highest_impact', final_names)
Feature Effects¶
Feature Effects helps to understand how keeping all other feature data constant, how change in selected feature data impacts the target. Feature Effects provides partial dependence plot and prediction vs accuracy plot data.
import datarobot as dr
feature_effects = model.get_or_request_feature_effect(source='validation')
Predict new data¶
After creating models you can use them to generate predictions on new data. See PredictJob for further information on how to request predictions from a model.
Model IDs Vs. Blueprint IDs¶
Each model has both an model_id
and a blueprint_id
. What is the difference between these two IDs?
A model is the result of training a blueprint on a dataset at a specified
sample percentage. The blueprint_id
is used to keep track of which
blueprint was used to train the model, while the model_id
is used to
locate the trained model in the system.
Model parameters¶
Some models can have parameters that provide data needed to reproduce its predictions.
For additional usage information see DataRobot documentation, section “Coefficients tab and pre-processing details”
import datarobot as dr
model = dr.Model.get(project=project, model_id=model_id)
mp = model.get_parameters()
print(mp.derived_features)
>>> [{
'coefficient': -0.015,
'originalFeature': u'A1Cresult',
'derivedFeature': u'A1Cresult->7',
'type': u'CAT',
'transformations': [{'name': u'One-hot', 'value': u"'>7'"}]
}]
Create a Blender¶
You can blend multiple models; in many cases, the resulting blender model is more accurate
than the parent models. To do so you need to select parent models and a blender method from
datarobot.enums.BLENDER_METHOD
. If this is a time series project, only methods in
datarobot.enums.TS_BLENDER_METHOD
are allowed.
Be aware that the tradeoff for better prediction accuracy is bigger resource consumption and slower predictions.
import datarobot as dr
pr = dr.Project.get(pid)
models = pr.get_models()
parent_models = [model.id for model in models[:2]]
pr.blend(parent_models, dr.enums.BLENDER_METHOD.AVERAGE)
Lift chart retrieval¶
You can use Model
methods get_lift_chart
and get_all_lift_charts
to retrieve
lift chart data. First will get it from specific source (validation data, cross validation or
holdout, if holdout unlocked) and second will list all available data. Please refer to
Advanced model information notebook for additional
information about lift charts and how they can be visualised.
For multiclass models you can get list of per-class lift charts using Model
method get_multiclass_lift_chart
.
ROC curve retrieval¶
Same as with the lift chart you can use Model
methods get_roc_curve
and
get_all_roc_curves
to retrieve ROC curve data. Please refer to
Advanced model information notebook for additional
information about ROC curves and how they can be visualised. More information about working with ROC
curves can be found in DataRobot web application documentation section “ROC Curve tab details”.
Residuals chart retrieval¶
Just as with the lift and ROC charts, you can use Model
methods get_residuals_chart
and
get_all_residuals_charts
to retrieve residuals chart data. The first will get it from a
specific source (validation data, cross-validation data, or holdout, if unlocked). The second
will retrieve all available data. Please refer to the
Advanced model information
notebook for more information about residuals charts and how they can be visualised.
Word Cloud¶
If your dataset contains text columns, DataRobot can create text processing models that will
contain word cloud insight data. An example of such model is any “Auto-Tuned Word N-Gram Text
Modeler” model. You can use Model.get_word_cloud
method to retrieve those insights - it will
provide up to 200 most important ngrams in the model and data about their influence.
The Advanced model information notebook contains
examples of how you can use that data and build a visualization in a way similar to how the
DataRobot webapp does.
Scoring Code¶
Subset of models in DataRobot supports code generation. For each of those models you can download
a JAR file with scoring code to make predictions locally using method
Model.download_scoring_code
. For details on how to do that see “Code Generation” section in
DataRobot web application documentation. Optionally you can download source code in Java to see
what calculations those models do internally.
Be aware that source code JAR isn’t compiled so it cannot be used for making predictions.
Get a model blueprint chart¶
For all models you can retrieve its blueprint chart. You can also get its representation in graphviz DOT format to render it into format you need.
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
model_id=model_id)
bp_chart = model.get_model_blueprint_chart()
print(bp_chart.to_graphviz())
Get a model missing values report¶
For the majority of models you can retrieve their missing values reports on training data per each numeric and categorical feature. Model needs to have at least one of the supported tasks in the blueprint in order to have a missing values report (blenders are not supported). Report is gathered for Numerical Imputation tasks and Categorical converters like Ordinal Encoding, One-Hot Encoding etc. Missing values report is available to users with access to full blueprint docs.
Report is collected for those features which are considered eligible by given blueprint task. For instance, categorical feature with a lot of unique values may not be considered as eligible in the One-Hot encoding task.
Please refer to Missing report attributes description for report interpretation.
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id, model_id=model_id)
missing_reports_per_feature = model.get_missing_report_info()
for report_per_feature in missing_reports_per_feature:
print(report_per_feature)
Consider following example. Given Decision Tree Classifier (Gini) blueprint chart representation:
print(blueprint_chart.to_graphviz())
>>> digraph "Blueprint Chart" {
graph [rankdir=LR]
0 [label="Data"]
-2 [label="Numeric Variables"]
2 [label="Missing Values Imputed"]
3 [label="Decision Tree Classifier (Gini)"]
4 [label="Prediction"]
-1 [label="Categorical Variables"]
1 [label="Ordinal encoding of categorical variables"]
0 -> -2
-2 -> 2
2 -> 3
3 -> 4
0 -> -1
-1 -> 1
1 -> 3
}
and missing report:
print(report_per_feature1)
>>> {'feature': 'Veh Year',
'type': 'Numeric',
'missing_count': 150,
'missing_percentage': 50.00,
'tasks': [
{'id': u'2',
'name': u'Missing Values Imputed',
'descriptions': [u'Imputed value: 2006']
}
]
}
print(report_per_feature2)
>>> {'feature': 'Model',
'type': 'Categorical',
'missing_count': 100,
'missing_percentage': 33.33,
'tasks': [
{'id': u'1',
'name': u'Ordinal encoding of categorical variables',
'descriptions': [u'Imputed value: -2']
}
]
}
results can be interpreted in the following way:
Numeric feature “Veh Year” has 150 missing values and respectively 50% in training data. It was transformed by “Missing Values Imputed” task with imputed value 2006. Task has id 2, and its output goes into Decision Tree Classifier (Gini) - it can be inferred from the chart.
Categorical feature “Model” was transformed by “Ordinal encoding of categorical variables” task with imputed value -2.
Get a blueprint documentation¶
You can retrieve documentation on tasks used to build a model. It will contain information about task, its parameters and (when available) links and references to additional sources.
All documents are instances of BlueprintTaskDocument
class.
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
model_id=model_id)
docs = model.get_model_blueprint_documents()
print(docs[0].task)
>>> Average Blend
print(docs[0].links[0]['url'])
>>> https://en.wikipedia.org/wiki/Ensemble_learning
Request training predictions¶
You can request a model’s predictions for a particular subset of its training data.
See datarobot.models.Model.request_training_predictions()
reference for all the valid subsets.
See training predictions reference for more details.
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
model_id=model_id)
training_predictions_job = model.request_training_predictions(dr.enums.DATA_SUBSET.HOLDOUT)
training_predictions = training_predictions_job.get_result_when_complete()
for row in training_predictions.iterate_rows():
print(row.row_id, row.prediction)
Advanced Tuning¶
You can perform advanced tuning on a model – generate a new model by taking an existing model and rerunning it with modified tuning parameters.
The AdvancedTuningSession class exists to track the creation of an Advanced Tuning model on the client. It enables browsing and setting advanced-tuning parameters one at a time, and using human-readable parameter names rather than requiring opaque parameter IDs in all cases. No information is sent to the server until the run() method is called on the AdvancedTuningSession.
See datarobot.models.Model.get_advanced_tuning_parameters()
reference for a description
of the types of parameters that can be passed in.
As of v2.17, all models other than blenders, open source, and user-created models support Advanced Tuning. The use of Advanced Tuning via API for non-Eureqa models is in beta, but is enabled by default for all users.
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
model_id=model_id)
tune = model.start_advanced_tuning_session()
# Get available task names,
# and available parameter names for a task name that exists on this model
tune.get_task_names()
tune.get_parameter_names('Eureqa Generalized Additive Model Classifier (3000 Generations)')
tune.set_parameter(
task_name='Eureqa Generalized Additive Model Classifier (3000 Generations)',
parameter_name='EUREQA_building_block__sine',
value=1)
job = tune.run()
SHAP Impact¶
You can retrieve SHAP impact scores for features in a model. SHAP impact is computed by calculating the shap values on a sample of training data and then taking the mean absolute value for each column. The larger value of impact indicate more important feature.
See datarobot.models.ShapImpact.create()
reference for a description of the types of parameters
that can be passed in.
import datarobot as dr
project_id = '5ec3d6884cfad17cd8c0ed62'
model_id = '5ec3d6f44cfad17cd8c0ed78'
shap_impact_job = dr.ShapImpact.create(project_id=project_id, model_id=model_id)
shap_impact = shap_impact_job.get_result_when_complete()
print(shap_impact)
>>> [ShapImpact(count=36)]
print(shap_impact.shap_impacts[:1])
>>> [{'feature_name': 'number_inpatient', 'impact_normalized': 1.0, 'impact_unnormalized': 0.07670175497683789}]
shap_impact = dr.ShapImpact.get(project_id=project_id, model_id=model_id)
print(shap_impact.shap_impacts[:1])
>>> [{'feature_name': 'number_inpatient', 'impact_normalized': 1.0, 'impact_unnormalized': 0.07670175497683789}]
Number of Iterations Trained¶
Early-stopping models will train a subset of max estimators/iterations that are defined in advanced tuning. This method allows the user to retrieve the actual number of estimators that were trained by an early-stopping tree-based model (currently the only model type supported). The method returns the projectId, modelId, and a list of dictionaries containing the number of iterations trained for each model stage. In the case of single-stage models, this dictionary will contain only one entry.
import datarobot as dr
project_id = '5506fcd38bd88f5953219da0'
model_id = '5506fcd98bd88f1641a720a3'
model = dr.Model.get(project=project_id,
model_id=model_id)
num_iterations = model.get_num_iterations_trained()
print(num_iterations)
>>> {"projectId": "5506fcd38bd88f5953219da0", "modelId": "5506fcd98bd88f1641a720a3", "data" [{"stage": "FREQ", "numIterations":250}, {"stage":"SEV", "numIterations":50}]}