Project
- class datarobot.models.Project(id=None, project_name=None, mode=None, target=None, target_type=None, holdout_unlocked=None, metric=None, stage=None, partition=None, positive_class=None, created=None, advanced_options=None, max_train_pct=None, max_train_rows=None, file_name=None, credentials=None, feature_engineering_prediction_point=None, unsupervised_mode=None, use_feature_discovery=None, relationships_configuration_id=None, project_description=None, query_generator_id=None, segmentation=None, partitioning_method=None, catalog_id=None, catalog_version_id=None, use_gpu=None)
A project built from a particular training dataset
- Attributes:
- idstr
the id of the project
- project_namestr
the name of the project
- project_descriptionstr
an optional description for the project
- modeint
The current autopilot mode. 0: Full Autopilot. 2: Manual Mode. 4: Comprehensive Autopilot. null: Mode not set.
- targetstr
the name of the selected target features
- target_typestr
Indicating what kind of modeling is being done in this project Options are: ‘Regression’, ‘Binary’ (Binary classification), ‘Multiclass’ (Multiclass classification), ‘Multilabel’ (Multilabel classification)
- holdout_unlockedbool
whether the holdout has been unlocked
- metricstr
the selected project metric (e.g. LogLoss)
- stagestr
the stage the project has reached - one of
datarobot.enums.PROJECT_STAGE
- partitiondict
information about the selected partitioning options
- positive_classstr
for binary classification projects, the selected positive class; otherwise, None
- createddatetime
the time the project was created
- advanced_optionsAdvancedOptions
information on the advanced options that were selected for the project settings, e.g. a weights column or a cap of the runtime of models that can advance autopilot stages
- max_train_pctfloat
The maximum percentage of the project dataset that can be used without going into the validation data or being too large to submit any blueprint for training
- max_train_rowsint
the maximum number of rows that can be trained on without going into the validation data or being too large to submit any blueprint for training
- file_namestr
The name of the file uploaded for the project dataset
- credentialslist, optional
A list of credentials for the datasets used in relationship configuration (previously graphs). For Feature Discovery projects, the list must be formatted in dictionary record format. Provide the catalogVersionId and credentialId for each dataset that is to be used in the project that requires authentication.
- feature_engineering_prediction_pointstr, optional
For time-aware Feature Engineering, this parameter specifies the column from the primary dataset to use as the prediction point.
- unsupervised_modebool, optional
(New in version v2.20) defaults to False, indicates whether this is an unsupervised project.
- relationships_configuration_idstr, optional
(New in version v2.21) id of the relationships configuration to use
- query_generator_id: str, optional
(New in version v2.27) id of the query generator applied for time series data prep
- segmentationdict, optional
information on the segmentation options for segmented project
- partitioning_methodPartitioningMethod, optional
(New in version v3.0) The partitioning class for this project. This attribute should only be used with newly-created projects and before calling Project.analyze_and_model(). After the project has been aimed, see Project.partition for actual partitioning options.
- catalog_idstr
(New in version v3.0) ID of the dataset used during creation of the project.
- catalog_version_idstr
(New in version v3.0) The object ID of the
catalog_version
which the project’s dataset belongs to.- use_gpu: bool
(New in version v3.2) Whether project allows usage of GPUs
- set_options(options=None, **kwargs)
Update the advanced options of this project.
Either accepts an AdvancedOptions object or individual keyword arguments. This is an inplace update.
- Raises:
- ValueError
Raised if an object passed to the
options
parameter is not anAdvancedOptions
instance, a valid keyword argument from theAdvancedOptions
class, or a combination of anAdvancedOptions
instance AND keyword arguments.
- Return type:
None
- get_options()
Return the stored advanced options for this project.
- Returns:
- AdvancedOptions
- Return type:
- classmethod get(project_id)
Gets information about a project.
- Parameters:
- project_idstr
The identifier of the project you want to load.
- Returns:
- projectProject
The queried project
- Return type:
TypeVar
(TProject
, bound= Project)
Examples
import datarobot as dr p = dr.Project.get(project_id='54e639a18bd88f08078ca831') p.id >>>'54e639a18bd88f08078ca831' p.project_name >>>'Some project name'
- classmethod create(cls, sourcedata, project_name='Untitled Project', max_wait=600, read_timeout=600, dataset_filename=None, *, use_case=None)
Creates a project with provided data.
Project creation is asynchronous process, which means that after initial request we will keep polling status of async process that is responsible for project creation until it’s finished. For SDK users this only means that this method might raise exceptions related to it’s async nature.
- Parameters:
- sourcedatabasestring, file, pathlib.Path or pandas.DataFrame
Dataset to use for the project. If string can be either a path to a local file, url to publicly available file or raw file content. If using a file, the filename must consist of ASCII characters only.
- project_namestr, unicode, optional
The name to assign to the empty project.
- max_waitint, optional
Time in seconds after which project creation is considered unsuccessful
- read_timeout: int
The maximum number of seconds to wait for the server to respond indicating that the initial upload is complete
- dataset_filenamestring or None, optional
(New in version v2.14) File name to use for dataset. Ignored for url and file path sources.
- use_case: UseCase | string, optional
A single UseCase object or ID to add this new Project to. Must be a kwarg.
- Returns:
- projectProject
Instance with initialized data.
- Raises:
- InputNotUnderstoodError
Raised if sourcedata isn’t one of supported types.
- AsyncFailureError
Polling for status of async process resulted in response with unsupported status code. Beginning in version 2.1, this will be ProjectAsyncFailureError, a subclass of AsyncFailureError
- AsyncProcessUnsuccessfulError
Raised if project creation was unsuccessful
- AsyncTimeoutError
Raised if project creation took more time, than specified by
max_wait
parameter
- Return type:
TypeVar
(TProject
, bound= Project)
Examples
p = Project.create('/home/datasets/somedataset.csv', project_name="New API project") p.id >>> '5921731dkqshda8yd28h' p.project_name >>> 'New API project'
- classmethod encrypted_string(plaintext)
Sends a string to DataRobot to be encrypted
This is used for passwords that DataRobot uses to access external data sources
- Parameters:
- plaintextstr
The string to encrypt
- Returns:
- ciphertextstr
The encrypted string
- Return type:
str
- classmethod create_from_hdfs(cls, url, port=None, project_name=None, max_wait=600)
Create a project from a datasource on a WebHDFS server.
- Parameters:
- urlstr
The location of the WebHDFS file, both server and full path. Per the DataRobot specification, must begin with hdfs://, e.g. hdfs:///tmp/10kDiabetes.csv
- portint, optional
The port to use. If not specified, will default to the server default (50070)
- project_namestr, optional
A name to give to the project
- max_waitint
The maximum number of seconds to wait before giving up.
- Returns:
- Project
Examples
p = Project.create_from_hdfs('hdfs:///tmp/somedataset.csv', project_name="New API project") p.id >>> '5921731dkqshda8yd28h' p.project_name >>> 'New API project'
- classmethod create_from_data_source(cls, data_source_id, username=None, password=None, credential_id=None, use_kerberos=None, credential_data=None, project_name=None, max_wait=600, *, use_case=None)
Create a project from a data source. Either data_source or data_source_id should be specified.
- Parameters:
- data_source_idstr
the identifier of the data source.
- usernamestr, optional
The username for database authentication. If supplied
password
must also be supplied.- passwordstr, optional
The password for database authentication. The password is encrypted at server side and never saved / stored. If supplied
username
must also be supplied.- credential_id: str, optional
The ID of the set of credentials to use instead of user and password. Note that with this change, username and password will become optional.
- use_kerberos: bool, optional
Server default is False. If true, use kerberos authentication for database authentication.
- credential_data: dict, optional
The credentials to authenticate with the database, to use instead of user/password or credential ID.
- project_namestr, optional
optional, a name to give to the project.
- max_waitint
optional, the maximum number of seconds to wait before giving up.
- use_case: UseCase | string, optional
A single UseCase object or ID to add this new Project to. Must be a kwarg.
- Returns:
- Project
- Raises:
- InvalidUsageError
Raised if either
username
orpassword
is passed without the other.
- Return type:
TypeVar
(TProject
, bound= Project)
- classmethod create_from_dataset(cls, dataset_id, dataset_version_id=None, project_name=None, user=None, password=None, credential_id=None, use_kerberos=None, use_sample_from_dataset=None, credential_data=None, max_wait=600, *, use_case=None)
Create a Project from a
datarobot.models.Dataset
- Parameters:
- dataset_id: string
The ID of the dataset entry to user for the project’s Dataset
- dataset_version_id: string, optional
The ID of the dataset version to use for the project dataset. If not specified - uses latest version associated with dataset_id
- project_name: string, optional
The name of the project to be created. If not specified, will be “Untitled Project” for database connections, otherwise the project name will be based on the file used.
- user: string, optional
The username for database authentication.
- password: string, optional
The password (in cleartext) for database authentication. The password will be encrypted on the server side in scope of HTTP request and never saved or stored
- credential_id: string, optional
The ID of the set of credentials to use instead of user and password.
- use_kerberos: bool, optional
Server default is False. If true, use kerberos authentication for database authentication.
- use_sample_from_dataset: bool, optional
Server default is False If true, use the EDA sample for the project instead of the full data. It is optional for datasets between 500 MB and 10 GB. For datasets over 10 GB, this is always set to True on the server side.
- credential_data: dict, optional
The credentials to authenticate with the database, to use instead of user/password or credential ID.
- max_wait: int
optional, the maximum number of seconds to wait before giving up.
- use_case: UseCase | string, optional
A single UseCase object or ID to add this new Project to. Must be a kwarg.
- Returns:
- Project
- Return type:
TypeVar
(TProject
, bound= Project)
- classmethod create_segmented_project_from_clustering_model(cls, clustering_project_id, clustering_model_id, target, max_wait=600, *, use_case=None)
Create a new segmented project from a clustering model
- Parameters:
- clustering_project_idstr
The identifier of the clustering project you want to use as the base.
- clustering_model_idstr
The identifier of the clustering model you want to use as the segmentation method.
- targetstr
The name of the target column that will be used from the clustering project.
- max_wait: int
optional, the maximum number of seconds to wait before giving up.
- use_case: UseCase | string, optional
A single UseCase object or ID to add this new Project to. Must be a kwarg.
- Returns:
- projectProject
The created project
- Return type:
TypeVar
(TProject
, bound= Project)
- classmethod from_async(async_location, max_wait=600)
Given a temporary async status location poll for no more than max_wait seconds until the async process (project creation or setting the target, for example) finishes successfully, then return the ready project
- Parameters:
- async_locationstr
The URL for the temporary async status resource. This is returned as a header in the response to a request that initiates an async process
- max_waitint
The maximum number of seconds to wait before giving up.
- Returns:
- projectProject
The project, now ready
- Raises:
- ProjectAsyncFailureError
If the server returned an unexpected response while polling for the asynchronous operation to resolve
- AsyncProcessUnsuccessfulError
If the final result of the asynchronous operation was a failure
- AsyncTimeoutError
If the asynchronous operation did not resolve within the time specified
- Return type:
TypeVar
(TProject
, bound= Project)
- classmethod start(cls, sourcedata, target=None, project_name='Untitled Project', worker_count=None, metric=None, autopilot_on=True, blueprint_threshold=None, response_cap=None, partitioning_method=None, positive_class=None, target_type=None, unsupervised_mode=False, blend_best_models=None, prepare_model_for_deployment=None, consider_blenders_in_recommendation=None, scoring_code_only=None, min_secondary_validation_model_count=None, shap_only_mode=None, relationships_configuration_id=None, autopilot_with_feature_discovery=None, feature_discovery_supervised_feature_reduction=None, unsupervised_type=None, autopilot_cluster_list=None, bias_mitigation_feature_name=None, bias_mitigation_technique=None, include_bias_mitigation_feature_as_predictor_variable=None, incremental_learning_only_mode=None, incremental_learning_on_best_model=None, number_of_incremental_learning_iterations_before_best_model_selection=None, *, use_case=None)
Chain together project creation, file upload, and target selection. :rtype:
TypeVar
(TProject
, bound= Project)Note
While this function provides a simple means to get started, it does not expose all possible parameters. For advanced usage, using
create
,set_advanced_options
andanalyze_and_model
directly is recommended.- Parameters:
- sourcedatastr or pandas.DataFrame
The path to the file to upload. Can be either a path to a local file or a publicly accessible URL (starting with
http://
,https://
,file://
, ors3://
). If the source is a DataFrame, it will be serialized to a temporary buffer. If using a file, the filename must consist of ASCII characters only.- targetstr, optional
The name of the target column in the uploaded file. Should not be provided if
unsupervised_mode
isTrue
.- project_namestr
The project name.
- Returns:
- projectProject
The newly created and initialized project.
- Other Parameters:
- worker_countint, optional
The number of workers that you want to allocate to this project.
- metricstr, optional
The name of metric to use.
- autopilot_onboolean, default
True
Whether or not to begin modeling automatically.
- blueprint_thresholdint, optional
Number of hours the model is permitted to run. Minimum 1
- response_capfloat, optional
Quantile of the response distribution to use for response capping Must be in range 0.5 .. 1.0
- partitioning_methodPartitioningMethod object, optional
Instance of one of the Partition Classes defined in
datarobot.helpers.partitioning_methods
. As an alternative, useProject.set_partitioning_method
orProject.set_datetime_partitioning
to set the partitioning for the project.- positive_classstr, float, or int; optional
Specifies a level of the target column that should be treated as the positive class for binary classification. May only be specified for binary classification targets.
- target_typestr, optional
Override the automatically selected target_type. An example usage would be setting the target_type=’Multiclass’ when you want to preform a multiclass classification task on a numeric column that has a low cardinality. You can use
TARGET_TYPE
enum.- unsupervised_modeboolean, default
False
Specifies whether to create an unsupervised project.
- blend_best_models: bool, optional
blend best models during Autopilot run
- scoring_code_only: bool, optional
Keep only models that can be converted to scorable java code during Autopilot run.
- shap_only_mode: bool, optional
Keep only models that support SHAP values during Autopilot run. Use SHAP-based insights wherever possible. Defaults to False.
- prepare_model_for_deployment: bool, optional
Prepare model for deployment during Autopilot run. The preparation includes creating reduced feature list models, retraining best model on higher sample size, computing insights and assigning “RECOMMENDED FOR DEPLOYMENT” label.
- consider_blenders_in_recommendation: bool, optional
Include blenders when selecting a model to prepare for deployment in an Autopilot Run. Defaults to False.
- min_secondary_validation_model_count: int, optional
Compute “All backtest” scores (datetime models) or cross validation scores for the specified number of highest ranking models on the Leaderboard, if over the Autopilot default.
- relationships_configuration_idstr, optional
(New in version v2.23) id of the relationships configuration to use
- autopilot_with_feature_discovery: bool, optional.
(New in version v2.23) If true, autopilot will run on a feature list that includes features found via search for interactions.
- feature_discovery_supervised_feature_reduction: bool, optional
(New in version v2.23) Run supervised feature reduction for feature discovery projects.
- unsupervised_typeUnsupervisedTypeEnum, optional
(New in version v2.27) Specifies whether an unsupervised project is anomaly detection or clustering.
- autopilot_cluster_listlist(int), optional
(New in version v2.27) Specifies the list of clusters to build for each model during Autopilot. Specifying multiple values in a list will build models with each number of clusters for the Leaderboard.
- bias_mitigation_feature_namestr, optional
The feature from protected features that will be used in a bias mitigation task to mitigate bias
- bias_mitigation_techniquestr, optional
One of datarobot.enums.BiasMitigationTechnique Options: - ‘preprocessingReweighing’ - ‘postProcessingRejectionOptionBasedClassification’ The technique by which we’ll mitigate bias, which will inform which bias mitigation task we insert into blueprints
- include_bias_mitigation_feature_as_predictor_variablebool, optional
Whether we should also use the mitigation feature as in input to the modeler just like any other categorical used for training, i.e. do we want the model to “train on” this feature in addition to using it for bias mitigation
- use_case: UseCase | string, optional
A single UseCase object or ID to add this new Project to. Must be a kwarg.
- Raises:
- AsyncFailureError
Polling for status of async process resulted in response with unsupported status code
- AsyncProcessUnsuccessfulError
Raised if project creation or target setting was unsuccessful
- AsyncTimeoutError
Raised if project creation or target setting timed out
Examples
Project.start("./tests/fixtures/file.csv", "a_target", project_name="test_name", worker_count=4, metric="a_metric")
This is an example of using a URL to specify the datasource:
Project.start("https://example.com/data/file.csv", "a_target", project_name="test_name", worker_count=4, metric="a_metric")
- classmethod list(search_params=None, use_cases=None, offset=None, limit=None)
Returns the projects associated with this account.
- Parameters:
- search_paramsdict, optional.
If not None, the returned projects are filtered by lookup. Currently you can query projects by:
project_name
- use_casesUnion[UseCase, List[UseCase], str, List[str]], optional.
If not None, the returned projects are filtered to those associated with a specific Use Case or Use Cases. Accepts either the entity or the ID.
- offsetint, optional
If provided, specifies the number of results to skip.
- limitint, optional
If provided, specifies the maximum number of results to return. If not provided, returns a maximum of 1000 results.
- Returns:
- projectslist of Project instances
Contains a list of projects associated with this user account.
- Raises:
- TypeError
Raised if
search_params
parameter is provided, but is not of supported type.
- Return type:
List
[Project
]
Examples
List all projects .. code-block:: python
p_list = Project.list() p_list >>> [Project(‘Project One’), Project(‘Two’)]
Search for projects by name .. code-block:: python
Project.list(search_params={‘project_name’: ‘red’}) >>> [Project(‘Prediction Time’), Project(‘Fred Project’)]
List 2nd and 3rd projects .. code-block:: python
Project.list(offset=1, limit=2) >>> [Project(‘Project 2’), Project(‘Project 3’)]
- refresh()
Fetches the latest state of the project, and updates this object with that information. This is an in place update, not a new object.
- Return type:
None
- delete()
Removes this project from your account.
- Return type:
None
- analyze_and_model(target=None, mode='quick', metric=None, worker_count=None, positive_class=None, partitioning_method=None, featurelist_id=None, advanced_options=None, max_wait=600, target_type=None, credentials=None, feature_engineering_prediction_point=None, unsupervised_mode=False, relationships_configuration_id=None, class_mapping_aggregation_settings=None, segmentation_task_id=None, unsupervised_type=None, autopilot_cluster_list=None, use_gpu=None)
Set target variable of an existing project and begin the autopilot process or send data to DataRobot for feature analysis only if manual mode is specified.
Any options saved using
set_options
will be used if nothing is passed toadvanced_options
. However, saved options will be ignored ifadvanced_options
are passed.Target setting is an asynchronous process, which means that after initial request we will keep polling status of async process that is responsible for target setting until it’s finished. For SDK users this only means that this method might raise exceptions related to it’s async nature.
When execution returns to the caller, the autopilot process will already have commenced (again, unless manual mode is specified).
- Parameters:
- targetstr, optional
The name of the target column in the uploaded file. Should not be provided if
unsupervised_mode
isTrue
.- modestr, optional
You can use
AUTOPILOT_MODE
enum to choose betweenAUTOPILOT_MODE.FULL_AUTO
AUTOPILOT_MODE.MANUAL
AUTOPILOT_MODE.QUICK
AUTOPILOT_MODE.COMPREHENSIVE
: Runs all blueprints in the repository (warning: this may be extremely slow).
If unspecified,
QUICK
is used. If theMANUAL
value is used, the model creation process will need to be started by executing thestart_autopilot
function with the desired featurelist. It will start immediately otherwise.- metricstr, optional
Name of the metric to use for evaluating models. You can query the metrics available for the target by way of
Project.get_metrics
. If none is specified, then the default recommended by DataRobot is used.- worker_countint, optional
The number of concurrent workers to request for this project. If None, then the default is used. (New in version v2.14) Setting this to -1 will request the maximum number available to your account.
- partitioning_methodPartitioningMethod object, optional
Instance of one of the Partition Classes defined in
datarobot.helpers.partitioning_methods
. As an alternative, useProject.set_partitioning_method
orProject.set_datetime_partitioning
to set the partitioning for the project.- positive_classstr, float, or int; optional
Specifies a level of the target column that should be treated as the positive class for binary classification. May only be specified for binary classification targets.
- featurelist_idstr, optional
Specifies which feature list to use.
- advanced_optionsAdvancedOptions, optional
Used to set advanced options of project creation. Will override any options saved using
set_options
.- max_waitint, optional
Time in seconds after which target setting is considered unsuccessful.
- target_typestr, optional
Override the automatically selected target_type. An example usage would be setting the target_type=’Multiclass’ when you want to preform a multiclass classification task on a numeric column that has a low cardinality. You can use
TARGET_TYPE
enum.- credentials: list, optional,
a list of credentials for the datasets used in relationship configuration (previously graphs).
- feature_engineering_prediction_pointstr, optional
additional aim parameter.
- unsupervised_modeboolean, default
False
(New in version v2.20) Specifies whether to create an unsupervised project. If
True
,target
may not be provided.- relationships_configuration_idstr, optional
(New in version v2.21) ID of the relationships configuration to use.
- segmentation_task_idstr or SegmentationTask, optional
(New in version v2.28) The segmentation task that should be used to split the project for segmented modeling.
- unsupervised_typeUnsupervisedTypeEnum, optional
(New in version v2.27) Specifies whether an unsupervised project is anomaly detection or clustering.
- autopilot_cluster_listlist(int), optional
(New in version v2.27) Specifies the list of clusters to build for each model during Autopilot. Specifying multiple values in a list will build models with each number of clusters for the Leaderboard.
- use_gpubool, optional
(New in version v3.2) Specifies whether project should use GPUs
- Returns:
- projectProject
The instance with updated attributes.
- Raises:
- AsyncFailureError
Polling for status of async process resulted in response with unsupported status code
- AsyncProcessUnsuccessfulError
Raised if target setting was unsuccessful
- AsyncTimeoutError
Raised if target setting took more time, than specified by
max_wait
parameter- TypeError
Raised if
advanced_options
,partitioning_method
ortarget_type
is provided, but is not of supported type
See also
datarobot.models.Project.start
combines project creation, file upload, and target selection. Provides fewer options, but is useful for getting started quickly.
- set_target(target=None, mode='quick', metric=None, worker_count=None, positive_class=None, partitioning_method=None, featurelist_id=None, advanced_options=None, max_wait=600, target_type=None, credentials=None, feature_engineering_prediction_point=None, unsupervised_mode=False, relationships_configuration_id=None, class_mapping_aggregation_settings=None, segmentation_task_id=None, unsupervised_type=None, autopilot_cluster_list=None)
Set target variable of an existing project and begin the Autopilot process (unless manual mode is specified).
Target setting is an asynchronous process, which means that after initial request DataRobot keeps polling status of an async process that is responsible for target setting until it’s finished. For SDK users, this method might raise exceptions related to its async nature.
When execution returns to the caller, the Autopilot process will already have commenced (again, unless manual mode is specified).
- Parameters:
- targetstr, optional
The name of the target column in the uploaded file. Should not be provided if
unsupervised_mode
isTrue
.- modestr, optional
You can use
AUTOPILOT_MODE
enum to choose betweenAUTOPILOT_MODE.FULL_AUTO
AUTOPILOT_MODE.MANUAL
AUTOPILOT_MODE.QUICK
AUTOPILOT_MODE.COMPREHENSIVE
: Runs all blueprints in the repository (warning: this may be extremely slow).
If unspecified,
QUICK
mode is used. If theMANUAL
value is used, the model creation process needs to be started by executing thestart_autopilot
function with the desired feature list. It will start immediately otherwise.- metricstr, optional
Name of the metric to use for evaluating models. You can query the metrics available for the target by way of
Project.get_metrics
. If none is specified, then the default recommended by DataRobot is used.- worker_countint, optional
The number of concurrent workers to request for this project. If None, then the default is used. (New in version v2.14) Setting this to -1 will request the maximum number available to your account.
- positive_classstr, float, or int; optional
Specifies a level of the target column that should be treated as the positive class for binary classification. May only be specified for binary classification targets.
- partitioning_methodPartitioningMethod object, optional
Instance of one of the Partition Classes defined in
datarobot.helpers.partitioning_methods
. As an alternative, useProject.set_partitioning_method
orProject.set_datetime_partitioning
to set the partitioning for the project.- featurelist_idstr, optional
Specifies which feature list to use.
- advanced_optionsAdvancedOptions, optional
Used to set advanced options of project creation.
- max_waitint, optional
Time in seconds after which target setting is considered unsuccessful.
- target_typestr, optional
Override the automatically selected target_type. An example usage would be setting the target_type=Multiclass’ when you want to preform a multiclass classification task on a numeric column that has a low cardinality. You can use ``TARGET_TYPE` enum.
- credentials: list, optional,
A list of credentials for the datasets used in relationship configuration (previously graphs).
- feature_engineering_prediction_pointstr, optional
For time-aware Feature Engineering, this parameter specifies the column from the primary dataset to use as the prediction point.
- unsupervised_modeboolean, default
False
(New in version v2.20) Specifies whether to create an unsupervised project. If
True
,target
may not be provided.- relationships_configuration_idstr, optional
(New in version v2.21) ID of the relationships configuration to use.
- class_mapping_aggregation_settingsClassMappingAggregationSettings, optional
Instance of
datarobot.helpers.ClassMappingAggregationSettings
- segmentation_task_idstr or SegmentationTask, optional
(New in version v2.28) The segmentation task that should be used to split the project for segmented modeling.
- unsupervised_typeUnsupervisedTypeEnum, optional
(New in version v2.27) Specifies whether an unsupervised project is anomaly detection or clustering.
- autopilot_cluster_listlist(int), optional
(New in version v2.27) Specifies the list of clusters to build for each model during Autopilot. Specifying multiple values in a list will build models with each number of clusters for the Leaderboard.
- Returns:
- projectProject
The instance with updated attributes.
- Raises:
- AsyncFailureError
Polling for status of async process resulted in response with unsupported status code.
- AsyncProcessUnsuccessfulError
Raised if target setting was unsuccessful.
- AsyncTimeoutError
Raised if target setting took more time, than specified by
max_wait
parameter.- TypeError
Raised if
advanced_options
,partitioning_method
ortarget_type
is provided, but is not of supported type.
See also
datarobot.models.Project.start
Combines project creation, file upload, and target selection. Provides fewer options, but is useful for getting started quickly.
datarobot.models.Project.analyze_and_model
the method replacing
set_target
after it is removed.
- get_model_records(sort_by_partition='validation', sort_by_metric=None, with_metric=None, search_term=None, featurelists=None, families=None, blueprints=None, labels=None, characteristics=None, training_filters=None, number_of_clusters=None, limit=100, offset=0)
Retrieve paginated model records, sorted by scores, with optional filtering.
- Parameters:
- sort_by_partition: str, one of `validation`, `backtesting`, `crossValidation` or `holdout`
Set the partition to use for sorted (by score) list of models. validation is the default.
- sort_by_metric: str
- Set the project metric to use for model sorting. DataRobot-selected project optimization metric
is the default.
- with_metric: str
For a single-metric list of results, specify that project metric.
- search_term: str
If specified, only models containing the term in their name or processes are returned.
- featurelists: list of str
If specified, only models trained on selected featurelists are returned.
- families: list of str
If specified, only models belonging to selected families are returned.
- blueprints: list of str
If specified, only models trained on specified blueprint IDs are returned.
- labels: list of str, `starred` or `prepared for deployment`
If specified, only models tagged with all listed labels are returned.
- characteristics: list of str
If specified, only models matching all listed characteristics are returned. Possible values “frozen”,”trained on gpu”,”with exportable coefficients”,”with mono constraints”,”with rating table”, “with scoring code”,”new series optimized”
- training_filters: list of str
If specified, only models matching at least one of the listed training conditions are returned. The following formats are supported for autoML and datetime partitioned projects: - number of rows in training subset For datetime partitioned projects: - <training duration>, example P6Y0M0D - <training_duration>-<time_window_sample_percent>-<sampling_method> Example: P6Y0M0D-78-Random, (returns models trained on 6 years of data, sampling rate 78%, random sampling). - Start/end date - Project settings
- number_of_clusters: list of int
Filter models by number of clusters. Applicable only in unsupervised clustering projects.
- limit: int
- offset: int
- Returns:
- generic_models: list of GenericModel
- Return type:
List
[GenericModel
]
- get_models(order_by=None, search_params=None, with_metric=None, use_new_models_retrieval=False)
List all completed, successful models in the leaderboard for the given project.
- Parameters:
- order_bystr or list of strings, optional
If not None, the returned models are ordered by this attribute. If None, the default return is the order of default project metric.
Allowed attributes to sort by are:
metric
sample_pct
If the sort attribute is preceded by a hyphen, models will be sorted in descending order, otherwise in ascending order.
Multiple sort attributes can be included as a comma-delimited string or in a list e.g. order_by=`sample_pct,-metric` or order_by=[sample_pct, -metric]
Using metric to sort by will result in models being sorted according to their validation score by how well they did according to the project metric.
- search_paramsdict, optional.
If not None, the returned models are filtered by lookup. Currently you can query models by:
name
sample_pct
is_starred
- with_metricstr, optional.
If not None, the returned models will only have scores for this metric. Otherwise all the metrics are returned.
- use_new_models_retrieval: bool, False by default
If true, new retrieval route is used, which supports filtering and returns fewer attributes per individual model. Following attributes are absent and could be retrieved from the blueprint level: monotonic_increasing_featurelist_id, monotonic_decreasing_featurelist_id, supports_composable_ml and supports_monotonic_constraints. Following attributes are absent and could be retrieved from the individual model level: has_empty_clusters, is_n_clusters_dynamically_determined, prediction_threshold and prediction_threshold_read_only. Attribute n_clusters in Model is renamed to number_of_clusters in GenericModel and is returned for unsupervised clustering models.
- Returns:
- modelsa list of Model or a list of GenericModel if use_new_models_retrieval is True.
All models trained in the project.
- Raises:
- TypeError
Raised if
order_by
orsearch_params
parameter is provided, but is not of supported type.
- Return type:
Union
[List
[Model
],List
[GenericModel
]]
Examples
Project.get('pid').get_models(order_by=['-sample_pct', 'metric']) # Getting models that contain "Ridge" in name Project.get('pid').get_models( search_params={ 'name': "Ridge" }) # Filtering models based on 'starred' flag: Project.get('pid').get_models(search_params={'is_starred': True})
# retrieve additional attributes for the model model_records = project.get_models(use_new_models_retrieval=True) model_record = model_records[0] blueprint_id = model_record.blueprint_id blueprint = dr.Blueprint.get(project.id, blueprint_id) model_record.number_of_clusters blueprint.supports_composable_ml blueprint.supports_monotonic_constraints blueprint.monotonic_decreasing_featurelist_id blueprint.monotonic_increasing_featurelist_id model = dr.Model.get(project.id, model_record.id) model.prediction_threshold model.prediction_threshold_read_only model.has_empty_clusters model.is_n_clusters_dynamically_determined
- recommended_model()
Returns the default recommended model, or None if there is no default recommended model.
- Returns:
- recommended_modelModel or None
The default recommended model.
- Return type:
Optional
[Model
]
- get_top_model(metric=None)
Obtain the top ranked model for a given metric/ If no metric is passed in, it uses the project’s default metric. Models that display score of N/A in the UI are not included in the ranking (see https://docs.datarobot.com/en/docs/modeling/reference/model-detail/leaderboard-ref.html#na-scores).
- Parameters:
- metricstr, optional
Metric to sort models
- Returns:
- modelModel
The top model
- Raises:
- ValueError
Raised if the project is unsupervised. Raised if the project has no target set. Raised if no metric was passed or the project has no metric. Raised if the metric passed is not used by the models on the leaderboard.
- Return type:
Examples
from datarobot.models.project import Project project = Project.get("<MY_PROJECT_ID>") top_model = project.get_top_model()
- get_datetime_models()
List all models in the project as DatetimeModels
Requires the project to be datetime partitioned. If it is not, a ClientError will occur.
- Returns:
- modelslist of DatetimeModel
the datetime models
- Return type:
List
[DatetimeModel
]
- get_prime_models()
List all DataRobot Prime models for the project Prime models were created to approximate a parent model, and have downloadable code.
- Returns:
- modelslist of PrimeModel
- Return type:
List
[PrimeModel
]
- get_prime_files(parent_model_id=None, model_id=None)
List all downloadable code files from DataRobot Prime for the project
- Parameters:
- parent_model_idstr, optional
Filter for only those prime files approximating this parent model
- model_idstr, optional
Filter for only those prime files with code for this prime model
- Returns:
- files: list of PrimeFile
- get_dataset()
Retrieve the dataset used to create a project.
- Returns:
- Dataset
Dataset used for creation of project or None if no
catalog_id
present.
- Return type:
Optional
[Dataset
]
Examples
from datarobot.models.project import Project project = Project.get("<MY_PROJECT_ID>") dataset = project.get_dataset()
- get_datasets()
List all the datasets that have been uploaded for predictions
- Returns:
- datasetslist of PredictionDataset instances
- Return type:
List
[PredictionDataset
]
- upload_dataset(sourcedata, max_wait=600, read_timeout=600, forecast_point=None, predictions_start_date=None, predictions_end_date=None, dataset_filename=None, relax_known_in_advance_features_check=None, credentials=None, actual_value_column=None, secondary_datasets_config_id=None)
Upload a new dataset to make predictions against
- Parameters:
- sourcedatastr, file or pandas.DataFrame
Data to be used for predictions. If string, can be either a path to a local file, a publicly accessible URL (starting with
http://
,https://
,file://
), or raw file content. If using a file on disk, the filename must consist of ASCII characters only.- max_waitint, optional
The maximum number of seconds to wait for the uploaded dataset to be processed before raising an error.
- read_timeoutint, optional
The maximum number of seconds to wait for the server to respond indicating that the initial upload is complete
- forecast_pointdatetime.datetime or None, optional
(New in version v2.8) May only be specified for time series projects, otherwise the upload will be rejected. The time in the dataset relative to which predictions should be generated in a time series project. See the Time Series documentation for more information. If not provided, will default to using the latest forecast point in the dataset.
- predictions_start_datedatetime.datetime or None, optional
(New in version v2.11) May only be specified for time series projects. The start date for bulk predictions. Note that this parameter is for generating historical predictions using the training data. This parameter should be provided in conjunction with
predictions_end_date
. Cannot be provided with theforecast_point
parameter.- predictions_end_datedatetime.datetime or None, optional
(New in version v2.11) May only be specified for time series projects. The end date for bulk predictions, exclusive. Note that this parameter is for generating historical predictions using the training data. This parameter should be provided in conjunction with
predictions_start_date
. Cannot be provided with theforecast_point
parameter.- actual_value_columnstring, optional
(New in version v2.21) Actual value column name, valid for the prediction files if the project is unsupervised and the dataset is considered as bulk predictions dataset. Cannot be provided with the
forecast_point
parameter.- dataset_filenamestring or None, optional
(New in version v2.14) File name to use for the dataset. Ignored for url and file path sources.
- relax_known_in_advance_features_checkbool, optional
(New in version v2.15) For time series projects only. If True, missing values in the known in advance features are allowed in the forecast window at the prediction time. If omitted or False, missing values are not allowed.
- credentials: list, optional, a list of credentials for the datasets used
in Feature discovery project
- secondary_datasets_config_id: string or None, optional
(New in version v2.23) The Id of the alternative secondary dataset config to use during prediction for Feature discovery project.
- Returns
- ——-
- datasetPredictionDataset
The newly uploaded dataset.
- Raises:
- InputNotUnderstoodError
Raised if
sourcedata
isn’t one of supported types.- AsyncFailureError
Raised if polling for the status of an async process resulted in a response with an unsupported status code.
- AsyncProcessUnsuccessfulError
Raised if project creation was unsuccessful (i.e. the server reported an error in uploading the dataset).
- AsyncTimeoutError
Raised if processing the uploaded dataset took more time than specified by the
max_wait
parameter.- ValueError
Raised if
forecast_point
orpredictions_start_date
andpredictions_end_date
are provided, but are not of the supported type.
- Return type:
- upload_dataset_from_data_source(data_source_id, username, password, max_wait=600, forecast_point=None, relax_known_in_advance_features_check=None, credentials=None, predictions_start_date=None, predictions_end_date=None, actual_value_column=None, secondary_datasets_config_id=None)
Upload a new dataset from a data source to make predictions against
- Parameters:
- data_source_idstr
The identifier of the data source.
- usernamestr
The username for database authentication.
- passwordstr
The password for database authentication. The password is encrypted at server side and never saved / stored.
- max_waitint, optional
Optional, the maximum number of seconds to wait before giving up.
- forecast_pointdatetime.datetime or None, optional
(New in version v2.8) For time series projects only. This is the default point relative to which predictions will be generated, based on the forecast window of the project. See the time series prediction documentation for more information.
- relax_known_in_advance_features_checkbool, optional
(New in version v2.15) For time series projects only. If True, missing values in the known in advance features are allowed in the forecast window at the prediction time. If omitted or False, missing values are not allowed.
- credentials: list, optional, a list of credentials for the datasets used
in Feature discovery project
- predictions_start_datedatetime.datetime or None, optional
(New in version v2.20) For time series projects only. The start date for bulk predictions. Note that this parameter is for generating historical predictions using the training data. This parameter should be provided in conjunction with
predictions_end_date
. Can’t be provided with theforecast_point
parameter.- predictions_end_datedatetime.datetime or None, optional
(New in version v2.20) For time series projects only. The end date for bulk predictions, exclusive. Note that this parameter is for generating historical predictions using the training data. This parameter should be provided in conjunction with
predictions_start_date
. Can’t be provided with theforecast_point
parameter.- actual_value_columnstring, optional
(New in version v2.21) Actual value column name, valid for the prediction files if the project is unsupervised and the dataset is considered as bulk predictions dataset. Cannot be provided with the
forecast_point
parameter.- secondary_datasets_config_id: string or None, optional
(New in version v2.23) The Id of the alternative secondary dataset config to use during prediction for Feature discovery project.
- Returns
- ——-
- datasetPredictionDataset
the newly uploaded dataset
- Return type:
- upload_dataset_from_catalog(dataset_id, credential_id=None, credential_data=None, dataset_version_id=None, max_wait=600, forecast_point=None, relax_known_in_advance_features_check=None, credentials=None, predictions_start_date=None, predictions_end_date=None, actual_value_column=None, secondary_datasets_config_id=None)
Upload a new dataset from a catalog dataset to make predictions against
- Parameters:
- dataset_idstr
The identifier of the dataset.
- credential_idstr, optional
The credential ID of the AI Catalog dataset to upload.
- credential_dataBasicCredentialsDataDict | S3CredentialsDataDict | OAuthCredentialsDataDict, optional
Credential data of the catalog dataset to upload. credential_data can be in one of the following forms:
- Basic Credentials
- credentialTypestr
The credential type. For basic credentials, this value must be CredentialTypes.BASIC.
- userstr
The username for database authentication.
- passwordstr
The password for database authentication. The password is encrypted at rest and never saved or stored.
- S3 Credentials
- credentialTypestr
The credential type. For S3 credentials, this value must be CredentialTypes.S3.
- awsAccessKeyIdstr, optional
The S3 AWS access key ID.
- awsSecretAccessKeystr, optional
The S3 AWS secret access key.
- awsSessionTokenstr, optional
The S3 AWS session token.
- config_id: str, optional
The ID of the saved shared secure configuration. If specified, cannot include awsAccessKeyId, awsSecretAccessKey or awsSessionToken.
- OAuth Credentials
- credentialTypestr
The credential type. For OAuth credentials, this value must be CredentialTypes.OAUTH.
- oauthRefreshTokenstr
The oauth refresh token.
- oauthClientIdstr
The oauth client ID.
- oauthClientSecretstr
The oauth client secret.
- oauthAccessTokenstr
The oauth access token.
- Snowflake Key Pair Credentials
- credentialTypestr
The credential type. For Snowflake Key Pair, this value must be CredentialTypes.SNOWFLAKE_KEY_PAIR_AUTH.
- userstr, optional
The Snowflake login name.
- privateKeyStrstr, optional
The private key copied exactly from user private key file. Since it contains multiple lines, when assign to a variable, put the key string inside triple-quotes
- passphrasestr, optional
The string used to encrypt the private key.
- configIdstr, optional
The ID of the saved shared secure configuration. If specified, cannot include user, privateKeyStr or passphrase.
- Databricks Access Token Credentials
- credentialTypestr
The credential type. For a Databricks access token, this value must be CredentialTypes.DATABRICKS_ACCESS_TOKEN.
- databricksAccessTokenstr
The Databricks personal access token.
- Databricks Service Principal Credentials
- credentialTypestr
The credential type. For Databricks service principal, this value must be CredentialTypes.DATABRICKS_SERVICE_PRINCIPAL.
- clientIdstr, optional
The client ID for Databricks service principal.
- clientSecretstr, optional
The client secret for Databricks service principal.
- configIdstr, optional
The ID of the saved shared secure configuration. If specified, cannot include clientId and clientSecret.
- Azure Service Principal Credentials
- credentialTypestr
The credential type. For Azure service principal, this value must be CredentialTypes.AZURE_SERVICE_PRINCIPAL.
- clientIdstr, optional
The client ID for Azure service principal.
- clientSecretstr, optional
The client secret for Azure service principal.
- azureTenantIdstr, optional
The azure tenant ID for Azure service principal.
- configIdstr, optional
The ID of the saved shared secure configuration. If specified, cannot include clientId and clientSecret.
- dataset_version_idstr, optional
The version id of the dataset to use.
- max_waitint, optional
Optional, the maximum number of seconds to wait before giving up.
- forecast_pointdatetime.datetime or None, optional
For time series projects only. This is the default point relative to which predictions will be generated, based on the forecast window of the project. See the time series prediction documentation for more information.
- relax_known_in_advance_features_checkbool, optional
For time series projects only. If True, missing values in the known in advance features are allowed in the forecast window at the prediction time. If omitted or False, missing values are not allowed.
- credentials: list[BasicCredentialsDict | CredentialIdCredentialsDict], optional
A list of credentials for the datasets used in Feature discovery project.
Items in credentials can have the following forms:
- Basic Credentials
- userstr
The username for database authentication.
- passwordstr
The password (in cleartext) for database authentication. The password will be encrypted on the server side in scope of HTTP request and never saved or stored.
- Credential ID
- credentialIdstr
The ID of the set of credentials to use instead of user and password. Note that with this change, username and password will become optional.
- predictions_start_datedatetime.datetime or None, optional
For time series projects only. The start date for bulk predictions. Note that this parameter is for generating historical predictions using the training data. This parameter should be provided in conjunction with
predictions_end_date
. Can’t be provided with theforecast_point
parameter.- predictions_end_datedatetime.datetime or None, optional
For time series projects only. The end date for bulk predictions, exclusive. Note that this parameter is for generating historical predictions using the training data. This parameter should be provided in conjunction with
predictions_start_date
. Can’t be provided with theforecast_point
parameter.- actual_value_columnstring, optional
Actual value column name, valid for the prediction files if the project is unsupervised and the dataset is considered as bulk predictions dataset. Cannot be provided with the
forecast_point
parameter.- secondary_datasets_config_id: string or None, optional
The Id of the alternative secondary dataset config to use during prediction for Feature discovery project.
- Returns
- ——-
- datasetPredictionDataset
the newly uploaded dataset
- Return type:
- get_blueprints()
List all blueprints recommended for a project.
- Returns:
- menulist of Blueprint instances
All blueprints in a project’s repository.
- get_features()
List all features for this project
- Returns:
- list of Feature
all features for this project
- Return type:
List
[Feature
]
- get_modeling_features(batch_size=None)
List all modeling features for this project
Only available once the target and partitioning settings have been set. For more information on the distinction between input and modeling features, see the time series documentation.
- Parameters:
- batch_sizeint, optional
The number of features to retrieve in a single API call. If specified, the client may make multiple calls to retrieve the full list of features. If not specified, an appropriate default will be chosen by the server.
- Returns:
- list of ModelingFeature
All modeling features in this project
- Return type:
List
[ModelingFeature
]
- get_featurelists()
List all featurelists created for this project
- Returns:
- list of Featurelist
All featurelists created for this project
- Return type:
List
[Featurelist
]
- get_associations(assoc_type, metric, featurelist_id=None)
Get the association statistics and metadata for a project’s informative features
Added in version v2.17.
- Parameters:
- assoc_typestring or None
The type of association, must be either ‘association’ or ‘correlation’
- metricstring or None
The specified association metric, belongs under either association or correlation umbrella
- featurelist_idstring or None
The desired featurelist for which to get association statistics (New in version v2.19)
- Returns:
- association_datadict
Pairwise metric strength data, feature clustering data, and ordering data for Feature Association Matrix visualization
- get_association_featurelists()
List featurelists and get feature association status for each
Added in version v2.19.
- Returns:
- feature_listsdict
Dict with ‘featurelists’ as key, with list of featurelists as values
- get_association_matrix_details(feature1, feature2)
Get a sample of the actual values used to measure the association between a pair of features
Added in version v2.17.
- Parameters:
- feature1str
Feature name for the first feature of interest
- feature2str
Feature name for the second feature of interest
- Returns:
- dict
This data has 3 keys: chart_type, features, values, and types
- chart_typestr
Type of plotting the pair of features gets in the UI. e.g. ‘HORIZONTAL_BOX’, ‘VERTICAL_BOX’, ‘SCATTER’ or ‘CONTINGENCY’
- valueslist
A list of triplet lists e.g. {“values”: [[460.0, 428.5, 0.001], [1679.3, 259.0, 0.001], …] The first entry of each list is a value of feature1, the second entry of each list is a value of feature2, and the third is the relative frequency of the pair of datapoints in the sample.
- featureslist of str
A list of the passed features, [feature1, feature2]
- typeslist of str
A list of the passed features’ types inferred by DataRobot. e.g. [‘NUMERIC’, ‘CATEGORICAL’]
- get_modeling_featurelists(batch_size=None)
List all modeling featurelists created for this project
Modeling featurelists can only be created after the target and partitioning options have been set for a project. In time series projects, these are the featurelists that can be used for modeling; in other projects, they behave the same as regular featurelists.
See the time series documentation for more information.
- Parameters:
- batch_sizeint, optional
The number of featurelists to retrieve in a single API call. If specified, the client may make multiple calls to retrieve the full list of features. If not specified, an appropriate default will be chosen by the server.
- Returns:
- list of ModelingFeaturelist
all modeling featurelists in this project
- Return type:
List
[ModelingFeaturelist
]
- get_discarded_features()
Retrieve discarded during feature generation features. Applicable for time series projects. Can be called at the modeling stage.
- Returns:
- discarded_features_info: DiscardedFeaturesInfo
- Return type:
- restore_discarded_features(features, max_wait=600)
Restore discarded during feature generation features. Applicable for time series projects. Can be called at the modeling stage.
- Returns:
- status: FeatureRestorationStatus
information about features requested to be restored.
- Return type:
- create_type_transform_feature(name, parent_name, variable_type, replacement=None, date_extraction=None, max_wait=600)
Create a new feature by transforming the type of an existing feature in the project
Note that only the following transformations are supported: :rtype:
Feature
Text to categorical or numeric
Categorical to text or numeric
Numeric to categorical
Date to categorical or numeric
(type-transform-considerations)= .. note:: Special considerations when casting numeric to categorical
There are two parameters which can be used for
variableType
to convert numeric data to categorical levels. These differ in the assumptions they make about the input data, and are very important when considering the data that will be used to make predictions. The assumptions that each makes are:categorical
: The data in the column is all integral, and there are no missing values. If either of these conditions do not hold in the training set, the transformation will be rejected. During predictions, if any of the values in the parent column are missing, the predictions will error.categoricalInt
: New in v2.6 All of the data in the column should be considered categorical in its string form when cast to an int by truncation. For example the value3
will be cast as the string3
and the value3.14
will also be cast as the string3
. Further, the value-3.6
will become the string-3
. Missing values will still be recognized as missing.
For convenience these are represented in the enum
VARIABLE_TYPE_TRANSFORM
with the namesCATEGORICAL
andCATEGORICAL_INT
.- Parameters:
- namestr
The name to give to the new feature
- parent_namestr
The name of the feature to transform
- variable_typestr
The type the new column should have. See the values within
datarobot.enums.VARIABLE_TYPE_TRANSFORM
.- replacementstr or float, optional
The value that missing or unconvertable data should have
- date_extractionstr, optional
Must be specified when parent_name is a date column (and left None otherwise). Specifies which value from a date should be extracted. See the list of values in
datarobot.enums.DATE_EXTRACTION
- max_waitint, optional
The maximum amount of time to wait for DataRobot to finish processing the new column. This process can take more time with more data to process. If this operation times out, an AsyncTimeoutError will occur. DataRobot continues the processing and the new column may successfully be constructed.
- Returns:
- Feature
The data of the new Feature
- Raises:
- AsyncFailureError
If any of the responses from the server are unexpected
- AsyncProcessUnsuccessfulError
If the job being waited for has failed or has been cancelled
- AsyncTimeoutError
If the resource did not resolve in time
- get_featurelist_by_name(name)
Creates a new featurelist
- Parameters:
- namestr, optional
The name of the Project’s featurelist to get.
- Returns:
- Featurelist
featurelist found by name, optional
- Return type:
Optional
[Featurelist
]
Examples
project = Project.get('5223deadbeefdeadbeef0101') featurelist = project.get_featurelist_by_name("Raw Features")
- create_featurelist(name=None, features=None, starting_featurelist=None, starting_featurelist_id=None, starting_featurelist_name=None, features_to_include=None, features_to_exclude=None)
Creates a new featurelist
- Parameters:
- namestr, optional
The name to give to this new featurelist. Names must be unique, so an error will be returned from the server if this name has already been used in this project. We dynamically create a name if none is provided.
- featureslist of str, optional
The names of the features. Each feature must exist in the project already.
- starting_featurelistFeaturelist, optional
The featurelist to use as the basis when creating a new featurelist. starting_featurelist.features will be read to get the list of features that we will manipulate.
- starting_featurelist_idstr, optional
The featurelist ID used instead of passing an object instance.
- starting_featurelist_namestr, optional
The featurelist name like “Informative Features” to find a featurelist via the API, and use to fetch features.
- features_to_includelist of str, optional
The list of the feature names to include in new featurelist. Throws an error if an item in this list is not in the featurelist that was passed, or that was retrieved from the API. If nothing is passed, all features are included from the starting featurelist.
- features_to_excludelist of str, optional
The list of the feature names to exclude in the new featurelist. Throws an error if an item in this list is not in the featurelist that was passed, also throws an error if a feature is in this list as well as features_to_include. Method cannot use both at the same time.
- Returns:
- Featurelist
newly created featurelist
- Raises:
- DuplicateFeaturesError
Raised if features variable contains duplicate features
- InvalidUsageError
Raised method is called with incompatible arguments
- Return type:
Examples
project = Project.get('5223deadbeefdeadbeef0101') flists = project.get_featurelists() # Create a new featurelist using a subset of features from an # existing featurelist flist = flists[0] features = flist.features[::2] # Half of the features new_flist = project.create_featurelist( name='Feature Subset', features=features, )
project = Project.get('5223deadbeefdeadbeef0101') # Create a new featurelist using a subset of features from an # existing featurelist by using features_to_exclude param new_flist = project.create_featurelist( name='Feature Subset of Existing Featurelist', starting_featurelist_name="Informative Features", features_to_exclude=["metformin", "weight", "age"], )
- create_modeling_featurelist(name, features, skip_datetime_partition_column=False)
Create a new modeling featurelist
Modeling featurelists can only be created after the target and partitioning options have been set for a project. In time series projects, these are the featurelists that can be used for modeling; in other projects, they behave the same as regular featurelists.
See the time series documentation for more information.
- Parameters:
- namestr
the name of the modeling featurelist to create. Names must be unique within the project, or the server will return an error.
- featureslist of str
the names of the features to include in the modeling featurelist. Each feature must be a modeling feature.
- skip_datetime_partition_column: boolean, optional
False by default. If True, featurelist will not contain datetime partition column. Use to create monotonic feature lists in Time Series projects. Setting makes no difference for not Time Series projects. Monotonic featurelists can not be used for modeling.
- Returns:
- featurelistModelingFeaturelist
the newly created featurelist
- Return type:
Examples
project = Project.get('1234deadbeeffeeddead4321') modeling_features = project.get_modeling_features() selected_features = [feat.name for feat in modeling_features][:5] # select first five new_flist = project.create_modeling_featurelist('Model This', selected_features)
- get_metrics(feature_name)
Get the metrics recommended for modeling on the given feature.
- Parameters:
- feature_namestr
The name of the feature to query regarding which metrics are recommended for modeling.
- Returns:
- feature_name: str
The name of the feature that was looked up
- available_metrics: list of str
An array of strings representing the appropriate metrics. If the feature cannot be selected as the target, then this array will be empty.
- metric_details: list of dict
The list of metricDetails objects
- metric_name: str
Name of the metric
- supports_timeseries: boolean
This metric is valid for timeseries
- supports_multiclass: boolean
This metric is valid for multiclass classification
- supports_binary: boolean
This metric is valid for binary classification
- supports_regression: boolean
This metric is valid for regression
- ascending: boolean
Should the metric be sorted in ascending order
- get_status()
Query the server for project status.
- Returns:
- statusdict
Contains:
autopilot_done
: a boolean.stage
: a short string indicating which stage the project is in.stage_description
: a description of whatstage
means.
Examples
{"autopilot_done": False, "stage": "modeling", "stage_description": "Ready for modeling"}
- pause_autopilot()
Pause autopilot, which stops processing the next jobs in the queue.
- Returns:
- pausedboolean
Whether the command was acknowledged
- Return type:
bool
- unpause_autopilot()
Unpause autopilot, which restarts processing the next jobs in the queue.
- Returns:
- unpausedboolean
Whether the command was acknowledged.
- Return type:
bool
- start_autopilot(featurelist_id, mode='quick', blend_best_models=False, scoring_code_only=False, prepare_model_for_deployment=True, consider_blenders_in_recommendation=False, run_leakage_removed_feature_list=True, autopilot_cluster_list=None)
Start Autopilot on provided featurelist with the specified Autopilot settings, halting the current Autopilot run.
Only one autopilot can be running at the time. That’s why any ongoing autopilot on a different featurelist will be halted - modeling jobs in queue would not be affected but new jobs would not be added to queue by the halted autopilot.
- Parameters:
- featurelist_idstr
Identifier of featurelist that should be used for autopilot
- modestr, optional
The Autopilot mode to run. You can use
AUTOPILOT_MODE
enum to choose betweenAUTOPILOT_MODE.FULL_AUTO
AUTOPILOT_MODE.QUICK
AUTOPILOT_MODE.COMPREHENSIVE
If unspecified,
AUTOPILOT_MODE.QUICK
is used.- blend_best_modelsbool, optional
Blend best models during Autopilot run. This option is not supported in SHAP-only ‘ ‘mode.
- scoring_code_onlybool, optional
Keep only models that can be converted to scorable java code during Autopilot run.
- prepare_model_for_deploymentbool, optional
Prepare model for deployment during Autopilot run. The preparation includes creating reduced feature list models, retraining best model on higher sample size, computing insights and assigning “RECOMMENDED FOR DEPLOYMENT” label.
- consider_blenders_in_recommendationbool, optional
Include blenders when selecting a model to prepare for deployment in an Autopilot Run. This option is not supported in SHAP-only mode or for multilabel projects.
- run_leakage_removed_feature_listbool, optional
Run Autopilot on Leakage Removed feature list (if exists).
- autopilot_cluster_listlist of int, optional
(New in v2.27) A list of integers, where each value will be used as the number of clusters in Autopilot model(s) for unsupervised clustering projects. Cannot be specified unless project unsupervisedMode is true and unsupervisedType is set to ‘clustering’.
- Raises:
- AppPlatformError
Raised project’s target was not selected or the settings for Autopilot are invalid for the project project.
- Return type:
None
- train(trainable, sample_pct=None, featurelist_id=None, source_project_id=None, scoring_type=None, training_row_count=None, monotonic_increasing_featurelist_id=<object object>, monotonic_decreasing_featurelist_id=<object object>, n_clusters=None)
Submit a job to the queue to train a model.
Either sample_pct or training_row_count can be used to specify the amount of data to use, but not both. If neither are specified, a default of the maximum amount of data that can safely be used to train any blueprint without going into the validation data will be selected.
In smart-sampled projects, sample_pct and training_row_count are assumed to be in terms of rows of the minority class.
Note
If the project uses datetime partitioning, use
Project.train_datetime
instead.- Parameters:
- trainablestr or Blueprint
For
str
, this is assumed to be a blueprint_id. If nosource_project_id
is provided, theproject_id
will be assumed to be the project that this instance represents.Otherwise, for a
Blueprint
, it contains the blueprint_id and source_project_id that we want to use.featurelist_id
will assume the default for this project if not provided, andsample_pct
will default to using the maximum training value allowed for this project’s partition setup.source_project_id
will be ignored if aBlueprint
instance is used for this parameter- sample_pctfloat, optional
The amount of data to use for training, as a percentage of the project dataset from 0 to 100.
- featurelist_idstr, optional
The identifier of the featurelist to use. If not defined, the default for this project is used.
- source_project_idstr, optional
Which project created this blueprint_id. If
None
, it defaults to looking in this project. Note that you must have read permissions in this project.- scoring_typestr, optional
Either
validation
orcrossValidation
(alsodr.SCORING_TYPE.validation
ordr.SCORING_TYPE.cross_validation
).validation
is available for every partitioning type, and indicates that the default model validation should be used for the project. If the project uses a form of cross-validation partitioning,crossValidation
can also be used to indicate that all of the available training/validation combinations should be used to evaluate the model.- training_row_countint, optional
The number of rows to use to train the requested model.
- monotonic_increasing_featurelist_idstr, optional
(new in version 2.11) the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. Passing
None
disables increasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT
) is the one specified by the blueprint.- monotonic_decreasing_featurelist_idstr, optional
(new in version 2.11) the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. Passing
None
disables decreasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT
) is the one specified by the blueprint.- n_clusters: int, optional
(new in version 2.27) Number of clusters to use in an unsupervised clustering model. This parameter is used only for unsupervised clustering models that don’t automatically determine the number of clusters.
- Returns:
- model_job_idstr
id of created job, can be used as parameter to
ModelJob.get
method orwait_for_async_model_creation
function
Examples
Use a
Blueprint
instance:blueprint = project.get_blueprints()[0] model_job_id = project.train(blueprint, training_row_count=project.max_train_rows)
Use a
blueprint_id
, which is a string. In the first case, it is assumed that the blueprint was created by this project. If you are using a blueprint used by another project, you will need to pass the id of that other project as well.blueprint_id = 'e1c7fc29ba2e612a72272324b8a842af' project.train(blueprint, training_row_count=project.max_train_rows) another_project.train(blueprint, source_project_id=project.id)
You can also easily use this interface to train a new model using the data from an existing model:
model = project.get_models()[0] model_job_id = project.train(model.blueprint.id, sample_pct=100)
- train_datetime(blueprint_id, featurelist_id=None, training_row_count=None, training_duration=None, source_project_id=None, monotonic_increasing_featurelist_id=<object object>, monotonic_decreasing_featurelist_id=<object object>, use_project_settings=False, sampling_method=None, n_clusters=None)
Create a new model in a datetime partitioned project
If the project is not datetime partitioned, an error will occur.
All durations should be specified with a duration string such as those returned by the
partitioning_methods.construct_duration_string
helper method. Please see datetime partitioned project documentation for more information on duration strings.- Parameters:
- blueprint_idstr
the blueprint to use to train the model
- featurelist_idstr, optional
the featurelist to use to train the model. If not specified, the project default will be used.
- training_row_countint, optional
the number of rows of data that should be used to train the model. If specified, neither
training_duration
noruse_project_settings
may be specified.- training_durationstr, optional
a duration string specifying what time range the data used to train the model should span. If specified, neither
training_row_count
noruse_project_settings
may be specified.- sampling_methodstr, optional
(New in version v2.23) defines the way training data is selected. Can be either
random
orlatest
. In combination withtraining_row_count
defines how rows are selected from backtest (latest
by default). When training data is defined using time range (training_duration
oruse_project_settings
) this setting changes the waytime_window_sample_pct
is applied (random
by default). Applicable to OTV projects only.- use_project_settingsbool, optional
(New in version v2.20) defaults to
False
. IfTrue
, indicates that the custom backtest partitioning settings specified by the user will be used to train the model and evaluate backtest scores. If specified, neithertraining_row_count
nortraining_duration
may be specified.- source_project_idstr, optional
the id of the project this blueprint comes from, if not this project. If left unspecified, the blueprint must belong to this project.
- monotonic_increasing_featurelist_idstr, optional
(New in version v2.18) optional, the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. Passing
None
disables increasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT
) is the one specified by the blueprint.- monotonic_decreasing_featurelist_idstr, optional
(New in version v2.18) optional, the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. Passing
None
disables decreasing monotonicity constraint. Default (dr.enums.MONOTONICITY_FEATURELIST_DEFAULT
) is the one specified by the blueprint.- n_clustersint, optional
The number of clusters to use in the specified unsupervised clustering model. ONLY VALID IN UNSUPERVISED CLUSTERING PROJECTS
- Returns:
- jobModelJob
the created job to build the model
- blend(model_ids, blender_method)
Submit a job for creating blender model. Upon success, the new job will be added to the end of the queue.
- Parameters:
- model_idslist of str
List of model ids that will be used to create blender. These models should have completed validation stage without errors, and can’t be blenders or DataRobot Prime
- blender_methodstr
Chosen blend method, one from
datarobot.enums.BLENDER_METHOD
. If this is a time series project, only methods indatarobot.enums.TS_BLENDER_METHOD
are allowed.
- Returns:
- model_jobModelJob
New
ModelJob
instance for the blender creation job in queue.
- Return type:
See also
datarobot.models.Project.check_blendable
to confirm if models can be blended
- check_blendable(model_ids, blender_method)
Check if the specified models can be successfully blended
- Parameters:
- model_idslist of str
List of model ids that will be used to create blender. These models should have completed validation stage without errors, and can’t be blenders or DataRobot Prime
- blender_methodstr
Chosen blend method, one from
datarobot.enums.BLENDER_METHOD
. If this is a time series project, only methods indatarobot.enums.TS_BLENDER_METHOD
are allowed.
- Returns:
- Return type:
- start_prepare_model_for_deployment(model_id)
Prepare a specific model for deployment.
The requested model will be trained on the maximum autopilot size then go through the recommendation stages. For datetime partitioned projects, this includes the feature impact stage, retraining on a reduced feature list, and retraining the best of the reduced feature list model and the max autopilot original model on recent data. For non-datetime partitioned projects, this includes the feature impact stage, retraining on a reduced feature list, retraining the best of the reduced feature list model and the max autopilot original model up to the holdout size, then retraining the up-to-the holdout model on the full dataset.
- Parameters:
- model_idstr
The model to prepare for deployment.
- Return type:
None
- get_all_jobs(status=None)
Get a list of jobs
This will give Jobs representing any type of job, including modeling or predict jobs.
- Parameters:
- statusQUEUE_STATUS enum, optional
If called with QUEUE_STATUS.INPROGRESS, will return the jobs that are currently running.
If called with QUEUE_STATUS.QUEUE, will return the jobs that are waiting to be run.
If called with QUEUE_STATUS.ERROR, will return the jobs that have errored.
If no value is provided, will return all jobs currently running or waiting to be run.
- Returns:
- jobslist
Each is an instance of Job
- Return type:
List
[Job
]
- get_blenders()
Get a list of blender models.
- Returns:
- list of BlenderModel
list of all blender models in project.
- Return type:
List
[BlenderModel
]
- get_frozen_models()
Get a list of frozen models
- Returns:
- list of FrozenModel
list of all frozen models in project.
- Return type:
List
[FrozenModel
]
- get_combined_models()
Get a list of models in segmented project.
- Returns:
- list of CombinedModel
list of all combined models in segmented project.
- Return type:
List
[CombinedModel
]
- get_active_combined_model()
Retrieve currently active combined model in segmented project.
- Returns:
- CombinedModel
currently active combined model in segmented project.
- Return type:
- get_segments_models(combined_model_id=None)
Retrieve a list of all models belonging to the segments/child projects of the segmented project.
- Parameters:
- combined_model_idstr, optional
Id of the combined model to get segments for. If there is only a single combined model it can be retrieved automatically, but this must be specified when there are > 1 combined models.
- Returns:
- segments_modelslist(dict)
A list of dictionaries containing all of the segments/child projects, each with a list of their models ordered by metric from best to worst.
- Return type:
List
[Dict
[str
,Any
]]
- get_model_jobs(status=None)
Get a list of modeling jobs
- Parameters:
- statusQUEUE_STATUS enum, optional
If called with QUEUE_STATUS.INPROGRESS, will return the modeling jobs that are currently running.
If called with QUEUE_STATUS.QUEUE, will return the modeling jobs that are waiting to be run.
If called with QUEUE_STATUS.ERROR, will return the modeling jobs that have errored.
If no value is provided, will return all modeling jobs currently running or waiting to be run.
- Returns:
- jobslist
Each is an instance of ModelJob
- Return type:
List
[ModelJob
]
- get_predict_jobs(status=None)
Get a list of prediction jobs
- Parameters:
- statusQUEUE_STATUS enum, optional
If called with QUEUE_STATUS.INPROGRESS, will return the prediction jobs that are currently running.
If called with QUEUE_STATUS.QUEUE, will return the prediction jobs that are waiting to be run.
If called with QUEUE_STATUS.ERROR, will return the prediction jobs that have errored.
If called without a status, will return all prediction jobs currently running or waiting to be run.
- Returns:
- jobslist
Each is an instance of PredictJob
- Return type:
List
[PredictJob
]
- wait_for_autopilot(check_interval=20.0, timeout=86400, verbosity=1)
Blocks until autopilot is finished. This will raise an exception if the autopilot mode is changed from AUTOPILOT_MODE.FULL_AUTO.
It makes API calls to sync the project state with the server and to look at which jobs are enqueued.
- Parameters:
- check_intervalfloat or int
The maximum time (in seconds) to wait between checks for whether autopilot is finished
- timeoutfloat or int or None
After this long (in seconds), we give up. If None, never timeout.
- verbosity:
This should be VERBOSITY_LEVEL.SILENT or VERBOSITY_LEVEL.VERBOSE. For VERBOSITY_LEVEL.SILENT, nothing will be displayed about progress. For VERBOSITY_LEVEL.VERBOSE, the number of jobs in progress or queued is shown. Note that new jobs are added to the queue along the way.
- Raises:
- AsyncTimeoutError
If autopilot does not finished in the amount of time specified
- RuntimeError
If a condition is detected that indicates that autopilot will not complete on its own
- Return type:
None
- rename(project_name)
Update the name of the project.
- Parameters:
- project_namestr
The new name
- Return type:
None
- set_project_description(project_description)
Set or Update the project description.
- Parameters:
- project_descriptionstr
The new description for this project.
- Return type:
None
- unlock_holdout()
Unlock the holdout for this project.
This will cause subsequent queries of the models of this project to contain the metric values for the holdout set, if it exists.
Take care, as this cannot be undone. Remember that best practice is to select a model before analyzing the model performance on the holdout set
- Return type:
None
- set_worker_count(worker_count)
Sets the number of workers allocated to this project.
Note that this value is limited to the number allowed by your account. Lowering the number will not stop currently running jobs, but will cause the queue to wait for the appropriate number of jobs to finish before attempting to run more jobs.
- Parameters:
- worker_countint
The number of concurrent workers to request from the pool of workers. (New in version v2.14) Setting this to -1 will update the number of workers to the maximum available to your account.
- Return type:
None
- set_advanced_options(advanced_options=None, **kwargs)
Update the advanced options of this project. :rtype:
None
Note
project options will not be stored at the database level, so the options set via this method will only be attached to a project instance for the lifetime of a client session (if you quit your session and reopen a new one before running autopilot, the advanced options will be lost).
Either accepts an AdvancedOptions object to replace all advanced options or individual keyword arguments. This is an inplace update, not a new object. The options set will only remain for the life of this project instance within a given session.
- Parameters:
- advanced_optionsAdvancedOptions, optional
AdvancedOptions instance as an alternative to passing individual parameters.
- weightsstring, optional
The name of a column indicating the weight of each row
- response_capfloat in [0.5, 1), optional
Quantile of the response distribution to use for response capping.
- blueprint_thresholdint, optional
Number of hours models are permitted to run before being excluded from later autopilot stages Minimum 1
- seedint, optional
a seed to use for randomization
- smart_downsampledbool, optional
whether to use smart downsampling to throw away excess rows of the majority class. Only applicable to classification and zero-boosted regression projects.
- majority_downsampling_ratefloat, optional
The percentage between 0 and 100 of the majority rows that should be kept. Specify only if using smart downsampling. May not cause the majority class to become smaller than the minority class.
- offsetlist of str, optional
(New in version v2.6) the list of the names of the columns containing the offset of each row
- exposurestring, optional
(New in version v2.6) the name of a column containing the exposure of each row
- accuracy_optimized_mbbool, optional
(New in version v2.6) Include additional, longer-running models that will be run by the autopilot and available to run manually.
- events_countstring, optional
(New in version v2.8) the name of a column specifying events count.
- monotonic_increasing_featurelist_idstring, optional
(new in version 2.11) the id of the featurelist that defines the set of features with a monotonically increasing relationship to the target. If None, no such constraints are enforced. When specified, this will set a default for the project that can be overridden at model submission time if desired.
- monotonic_decreasing_featurelist_idstring, optional
(new in version 2.11) the id of the featurelist that defines the set of features with a monotonically decreasing relationship to the target. If None, no such constraints are enforced. When specified, this will set a default for the project that can be overridden at model submission time if desired.
- only_include_monotonic_blueprintsbool, optional
(new in version 2.11) when true, only blueprints that support enforcing monotonic constraints will be available in the project or selected for the autopilot.
- allowed_pairwise_interaction_groupslist of tuple, optional
(New in version v2.19) For GA2M models - specify groups of columns for which pairwise interactions will be allowed. E.g. if set to [(A, B, C), (C, D)] then GA2M models will allow interactions between columns A x B, B x C, A x C, C x D. All others (A x D, B x D) will not be considered.
- blend_best_models: bool, optional
(New in version v2.19) blend best models during Autopilot run
- scoring_code_only: bool, optional
(New in version v2.19) Keep only models that can be converted to scorable java code during Autopilot run
- shap_only_mode: bool, optional
(New in version v2.21) Keep only models that support SHAP values during Autopilot run. Use SHAP-based insights wherever possible. Defaults to False.
- prepare_model_for_deployment: bool, optional
(New in version v2.19) Prepare model for deployment during Autopilot run. The preparation includes creating reduced feature list models, retraining best model on higher sample size, computing insights and assigning “RECOMMENDED FOR DEPLOYMENT” label.
- consider_blenders_in_recommendation: bool, optional
(New in version 2.22.0) Include blenders when selecting a model to prepare for deployment in an Autopilot Run. Defaults to False.
- min_secondary_validation_model_count: int, optional
(New in version v2.19) Compute “All backtest” scores (datetime models) or cross validation scores for the specified number of highest ranking models on the Leaderboard, if over the Autopilot default.
- autopilot_data_sampling_method: str, optional
(New in version v2.23) one of
datarobot.enums.DATETIME_AUTOPILOT_DATA_SAMPLING_METHOD
. Applicable for OTV projects only, defines if autopilot uses “random” or “latest” sampling when iteratively building models on various training samples. Defaults to “random” for duration-based projects and to “latest” for row-based projects.- run_leakage_removed_feature_list: bool, optional
(New in version v2.23) Run Autopilot on Leakage Removed feature list (if exists).
- autopilot_with_feature_discovery: bool, optional.
(New in version v2.23) If true, autopilot will run on a feature list that includes features found via search for interactions.
- feature_discovery_supervised_feature_reduction: bool, optional
(New in version v2.23) Run supervised feature reduction for feature discovery projects.
- exponentially_weighted_moving_alpha: float, optional
(New in version v2.26) defaults to None, value between 0 and 1 (inclusive), indicates alpha parameter used in exponentially weighted moving average within feature derivation window.
- external_time_series_baseline_dataset_id: str, optional.
(New in version v2.26) If provided, will generate metrics scaled by external model predictions metric for time series projects. The external predictions catalog must be validated before autopilot starts, see
Project.validate_external_time_series_baseline
and external baseline predictions documentation for further explanation.- use_supervised_feature_reduction: bool, default ``True` optional
Time Series only. When true, during feature generation DataRobot runs a supervised algorithm to retain only qualifying features. Setting to false can severely impact autopilot duration, especially for datasets with many features.
- primary_location_column: str, optional.
The name of primary location column.
- protected_features: list of str, optional.
(New in version v2.24) A list of project features to mark as protected for Bias and Fairness testing calculations. Max number of protected features allowed is 10.
- preferable_target_value: str, optional.
(New in version v2.24) A target value that should be treated as a favorable outcome for the prediction. For example, if we want to check gender discrimination for giving a loan and our target is named
is_bad
, then the positive outcome for the prediction would beNo
, which means that the loan is good and that’s what we treat as a favorable result for the loaner.- fairness_metrics_set: str, optional.
(New in version v2.24) Metric to use for calculating fairness. Can be one of
proportionalParity
,equalParity
,predictionBalance
,trueFavorableAndUnfavorableRateParity
orfavorableAndUnfavorablePredictiveValueParity
. Used and required only if Bias & Fairness in AutoML feature is enabled.- fairness_threshold: str, optional.
(New in version v2.24) Threshold value for the fairness metric. Can be in a range of
[0.0, 1.0]
. If the relative (i.e. normalized) fairness score is below the threshold, then the user will see a visual indication on the- bias_mitigation_feature_namestr, optional
The feature from protected features that will be used in a bias mitigation task to mitigate bias
- bias_mitigation_techniquestr, optional
One of datarobot.enums.BiasMitigationTechnique Options: - ‘preprocessingReweighing’ - ‘postProcessingRejectionOptionBasedClassification’ The technique by which we’ll mitigate bias, which will inform which bias mitigation task we insert into blueprints
- include_bias_mitigation_feature_as_predictor_variablebool, optional
Whether we should also use the mitigation feature as in input to the modeler just like any other categorical used for training, i.e. do we want the model to “train on” this feature in addition to using it for bias mitigation
- series_idstring, optional
(New in version v3.6) The name of a column containing the series ID for each row.
- forecast_distancestring, optional
(New in version v3.6) The name of a column containing the forecast distance for each row.
- forecast_offsetslist of str, optional
(New in version v3.6) The list of the names of the columns containing the forecast offsets for each row.
- incremental_learning_only_modebool, optional
(New in version v3.4) Keep only models that support incremental learning during Autopilot run.
- incremental_learning_on_best_modelbool, optional
(New in version v3.4) Run incremental learning on the best model during Autopilot run.
- chunk_definition_idstring, optional
(New in version v3.4) Unique definition for chunks needed to run automated incremental learning.
- incremental_learning_early_stopping_rounds: int, optional
(New in version v3.4) Early stopping rounds used in the automated incremental learning service.
- number_of_incremental_learning_iterations_before_best_model_selection: Optional[int] = None
Number of iterations top 5 models complete prior to best model selection.
- list_advanced_options()
View the advanced options that have been set on a project instance. Includes those that haven’t been set (with value of None).
- Returns:
- dict of advanced options and their values
- Return type:
Dict
[str
,Any
]
- set_partitioning_method(cv_method=None, validation_type=None, seed=0, reps=None, user_partition_col=None, training_level=None, validation_level=None, holdout_level=None, cv_holdout_level=None, validation_pct=None, holdout_pct=None, partition_key_cols=None, partitioning_method=None)
Configures the partitioning method for this project.
If this project does not already have a partitioning method set, creates a new configuration based on provided args.
If the partitioning_method arg is set, that configuration will instead be used. :rtype:
Project
Note
This is an inplace update, not a new object. The options set will only remain for the life of this project instance within a given session. You must still call
set_target
to make this change permanent for the project. Callingrefresh
without first callingset_target
will invalidate this configuration. Similarly, callingget
to retrieve a second copy of the project will not include this configuration.Added in version v3.0.
- Parameters:
- cv_method: str
The partitioning method used. Supported values can be found in
datarobot.enums.CV_METHOD
.- validation_type: str
May be “CV” (K-fold cross-validation) or “TVH” (Training, validation, and holdout).
- seedint
A seed to use for randomization.
- repsint
Number of cross validation folds to use.
- user_partition_colstr
The name of the column containing the partition assignments.
- training_levelUnion[str,int]
The value of the partition column indicating a row is part of the training set.
- validation_levelUnion[str,int]
The value of the partition column indicating a row is part of the validation set.
- holdout_levelUnion[str,int]
The value of the partition column indicating a row is part of the holdout set (use
None
if you want no holdout set).- cv_holdout_level: Union[str,int]
The value of the partition column indicating a row is part of the holdout set.
- validation_pctint
The desired percentage of dataset to assign to validation set.
- holdout_pctint
The desired percentage of dataset to assign to holdout set.
- partition_key_colslist
A list containing a single string, where the string is the name of the column whose values should remain together in partitioning.
- partitioning_methodPartitioningMethod, optional
An instance of
datarobot.helpers.partitioning_methods.PartitioningMethod
that will be used instead of creating a new instance from the other args.
- Returns:
- projectProject
The instance with updated attributes.
- Raises:
- TypeError
If cv_method or validation_type are not set and partitioning_method is not set.
- InvalidUsageError
If invoked after project.set_target or project.start, or if invoked with the wrong combination of args for a given partitioning method.
- get_uri()
- Returns:
- urlstr
Permanent static hyperlink to a project leaderboard.
- Return type:
str
- get_rating_table_models()
Get a list of models with a rating table
- Returns:
- list of RatingTableModel
list of all models with a rating table in project.
- Return type:
List
[RatingTableModel
]
- get_rating_tables()
Get a list of rating tables
- Returns:
- list of RatingTable
list of rating tables in project.
- Return type:
List
[RatingTable
]
- get_access_list()
Retrieve users who have access to this project and their access levels :rtype:
List
[SharingAccess
]Added in version v2.15.
- Returns:
- list of
SharingAccess
- list of
Modify the ability of users to access this project :rtype:
None
Added in version v2.15.
- Parameters:
- access_listlist of
SharingAccess
the modifications to make.
- send_notificationboolean, default
None
(New in version v2.21) optional, whether or not an email notification should be sent, default to None
- include_feature_discovery_entitiesboolean, default
None
(New in version v2.21) optional (default: None), whether or not to share all the related entities i.e., datasets for a project with Feature Discovery enabled
- access_listlist of
- Raises:
- datarobot.ClientError
if you do not have permission to share this project, if the user you’re sharing with doesn’t exist, if the same user appears multiple times in the access_list, or if these changes would leave the project without an owner
Examples
Transfer access to the project from old_user@datarobot.com to new_user@datarobot.com
import datarobot as dr new_access = dr.SharingAccess(new_user@datarobot.com, dr.enums.SHARING_ROLE.OWNER, can_share=True) access_list = [dr.SharingAccess(old_user@datarobot.com, None), new_access] dr.Project.get('my-project-id').share(access_list)
- batch_features_type_transform(parent_names, variable_type, prefix=None, suffix=None, max_wait=600)
Create new features by transforming the type of existing ones. :rtype:
List
[Feature
]Added in version v2.17.
Note
The following transformations are only supported in batch mode:
Text to categorical or numeric
Categorical to text or numeric
Numeric to categorical
See {ref}`here <type-transform-considerations>` for special considerations when casting numeric to categorical. Date to categorical or numeric transformations are not currently supported for batch mode but can be performed individually using
create_type_transform_feature
.- Parameters:
- parent_nameslist[str]
The list of variable names to be transformed.
- variable_typestr
The type new columns should have. Can be one of ‘categorical’, ‘categoricalInt’, ‘numeric’, and ‘text’ - supported values can be found in
datarobot.enums.VARIABLE_TYPE_TRANSFORM
.- prefixstr, optional
Note
Either
prefix
,suffix
, or both must be provided.The string that will preface all feature names. At least one of
prefix
andsuffix
must be specified.- suffixstr, optional
Note
Either
prefix
,suffix
, or both must be provided.The string that will be appended at the end to all feature names. At least one of
prefix
andsuffix
must be specified.- max_waitint, optional
The maximum amount of time to wait for DataRobot to finish processing the new column. This process can take more time with more data to process. If this operation times out, an AsyncTimeoutError will occur. DataRobot continues the processing and the new column may successfully be constructed.
- Returns:
- list of Features
all features for this project after transformation.
- Raises:
- TypeError:
If parent_names is not a list.
- ValueError
If value of
variable_type
is not fromdatarobot.enums.VARIABLE_TYPE_TRANSFORM
.- AsyncFailureError`
If any of the responses from the server are unexpected.
- AsyncProcessUnsuccessfulError
If the job being waited for has failed or has been cancelled.
- AsyncTimeoutError
If the resource did not resolve in time.
- clone_project(new_project_name=None, max_wait=600)
Create a fresh (post-EDA1) copy of this project that is ready for setting targets and modeling options.
- Parameters:
- new_project_namestr, optional
The desired name of the new project. If omitted, the API will default to ‘Copy of <original project>’
- max_waitint, optional
Time in seconds after which project creation is considered unsuccessful
- Returns:
- datarobot.models.Project
- Return type:
- create_interaction_feature(name, features, separator, max_wait=600)
Create a new interaction feature by combining two categorical ones. :rtype:
InteractionFeature
Added in version v2.21.
- Parameters:
- namestr
The name of final Interaction Feature
- featureslist(str)
List of two categorical feature names
- separatorstr
The character used to join the two data values, one of these ` + - / | & . _ , `
- max_waitint, optional
Time in seconds after which project creation is considered unsuccessful.
- Returns:
- datarobot.models.InteractionFeature
The data of the new Interaction feature
- Raises:
- ClientError
If requested Interaction feature can not be created. Possible reasons for example are:
one of features either does not exist or is of unsupported type
feature with requested name already exists
invalid separator character submitted.
- AsyncFailureError
If any of the responses from the server are unexpected
- AsyncProcessUnsuccessfulError
If the job being waited for has failed or has been cancelled
- AsyncTimeoutError
If the resource did not resolve in time
- get_relationships_configuration()
Get the relationships configuration for a given project :rtype:
RelationshipsConfiguration
Added in version v2.21.
- Returns:
- relationships_configuration: RelationshipsConfiguration
relationships configuration applied to project
- download_feature_discovery_dataset(file_name, pred_dataset_id=None)
Download Feature discovery training or prediction dataset
- Parameters:
- file_namestr
File path where dataset will be saved.
- pred_dataset_idstr, optional
ID of the prediction dataset
- Return type:
None
- download_feature_discovery_recipe_sqls(file_name, model_id=None, max_wait=600)
Export and download Feature discovery recipe SQL statements .. versionadded:: v2.25
- Parameters:
- file_namestr
File path where dataset will be saved.
- model_idstr, optional
ID of the model to export SQL for. If specified, QL to generate only features used by the model will be exported. If not specified, SQL to generate all features will be exported.
- max_waitint, optional
Time in seconds after which export is considered unsuccessful.
- Raises:
- ClientError
If requested SQL cannot be exported. Possible reason is the feature is not available to user.
- AsyncFailureError
If any of the responses from the server are unexpected.
- AsyncProcessUnsuccessfulError
If the job being waited for has failed or has been cancelled.
- AsyncTimeoutError
If the resource did not resolve in time.
- Return type:
None
- validate_external_time_series_baseline(catalog_version_id, target, datetime_partitioning, max_wait=600)
Validate external baseline prediction catalog.
The forecast windows settings, validation and holdout duration specified in the datetime specification must be consistent with project settings as these parameters are used to check whether the specified catalog version id has been validated or not. See external baseline predictions documentation for example usage.
- Parameters:
- catalog_version_id: str
Id of the catalog version for validating external baseline predictions.
- target: str
The name of the target column.
- datetime_partitioning: DatetimePartitioning object
Instance of the DatetimePartitioning defined in
datarobot.helpers.partitioning_methods
.Attributes of the object used to check the validation are:
datetime_partition_column
forecast_window_start
forecast_window_end
holdout_start_date
holdout_end_date
backtests
multiseries_id_columns
If the above attributes are different from the project settings, the catalog version will not pass the validation check in the autopilot.
- max_wait: int, optional
The maximum number of seconds to wait for the catalog version to be validated before raising an error.
- Returns:
- external_baseline_validation_info: ExternalBaselineValidationInfo
Validation result of the specified catalog version.
- Raises:
- AsyncTimeoutError
Raised if the catalog version validation took more time than specified by the
max_wait
parameter.
- Return type:
- download_multicategorical_data_format_errors(file_name)
Download multicategorical data format errors to the CSV file. If any format errors where detected in potentially multicategorical features the resulting file will contain at max 10 entries. CSV file content contains feature name, dataset index in which the error was detected, row value and type of error detected. In case that there were no errors or none of the features where potentially multicategorical the CSV file will be empty containing only the header.
- Parameters:
- file_namestr
File path where CSV file will be saved.
- Return type:
None
- get_multiseries_names()
For a multiseries timeseries project it returns all distinct entries in the multiseries column. For a non timeseries project it will just return an empty list.
- Returns:
- multiseries_names: List[str]
List of all distinct entries in the multiseries column
- Return type:
List
[Optional
[str
]]
- restart_segment(segment)
Restart single segment in a segmented project.
Added in version v2.28.
Segment restart is allowed only for segments that haven’t reached modeling phase. Restart will permanently remove previous project and trigger set up of a new one for particular segment.
- Parameters:
- segmentstr
Segment to restart
- get_bias_mitigated_models(parent_model_id=None, offset=0, limit=100)
List the child models with bias mitigation applied :rtype:
List
[Dict
[str
,Any
]]Added in version v2.29.
- Parameters:
- parent_model_idstr, optional
Filter by parent models
- offsetint, optional
Number of items to skip.
- limitint, optional
Number of items to return.
- Returns:
- modelslist of dict
- apply_bias_mitigation(bias_mitigation_parent_leaderboard_id, bias_mitigation_feature_name, bias_mitigation_technique, include_bias_mitigation_feature_as_predictor_variable)
Apply bias mitigation to an existing model by training a version of that model but with bias mitigation applied. An error will be returned if the model does not support bias mitigation with the technique requested. :rtype:
ModelJob
Added in version v2.29.
- Parameters:
- bias_mitigation_parent_leaderboard_idstr
The leaderboard id of the model to apply bias mitigation to
- bias_mitigation_feature_namestr
The feature name of the protected features that will be used in a bias mitigation task to attempt to mitigate bias
- bias_mitigation_techniquestr, optional
One of datarobot.enums.BiasMitigationTechnique Options: - ‘preprocessingReweighing’ - ‘postProcessingRejectionOptionBasedClassification’ The technique by which we’ll mitigate bias, which will inform which bias mitigation task we insert into blueprints
- include_bias_mitigation_feature_as_predictor_variablebool
Whether we should also use the mitigation feature as in input to the modeler just like any other categorical used for training, i.e. do we want the model to “train on” this feature in addition to using it for bias mitigation
- Returns:
- ModelJob
the job of the model with bias mitigation applied that was just submitted for training
- request_bias_mitigation_feature_info(bias_mitigation_feature_name)
Request a compute job for bias mitigation feature info for a given feature, which will include - if there are any rare classes - if there are any combinations of the target values and the feature values that never occur in the same row - if the feature has a high number of missing values. Note that this feature check is dependent on the current target selected for the project. :rtype:
BiasMitigationFeatureInfo
Added in version v2.29.
- Parameters:
- bias_mitigation_feature_namestr
The feature name of the protected features that will be used in a bias mitigation task to attempt to mitigate bias
- Returns:
- BiasMitigationFeatureInfo
Bias mitigation feature info model for the requested feature
- get_bias_mitigation_feature_info(bias_mitigation_feature_name)
Get the computed bias mitigation feature info for a given feature, which will include - if there are any rare classes - if there are any combinations of the target values and the feature values that never occur in the same row - if the feature has a high number of missing values. Note that this feature check is dependent on the current target selected for the project. If this info has not already been computed, this will raise a 404 error. :rtype:
BiasMitigationFeatureInfo
Added in version v2.29.
- Parameters:
- bias_mitigation_feature_namestr
The feature name of the protected features that will be used in a bias mitigation task to attempt to mitigate bias
- Returns:
- BiasMitigationFeatureInfo
Bias mitigation feature info model for the requested feature
- classmethod from_data(data)
Instantiate an object of this class using a dict.
- Parameters:
- datadict
Correctly snake_cased keys and their values.
- Return type:
TypeVar
(T
, bound= APIObject)
- classmethod from_server_data(data, keep_attrs=None)
Instantiate an object of this class using the data directly from the server, meaning that the keys may have the wrong camel casing
- Parameters:
- datadict
The directly translated dict of JSON from the server. No casing fixes have taken place
- keep_attrsiterable
List, set or tuple of the dotted namespace notations for attributes to keep within the object structure even if their values are None
- Return type:
TypeVar
(T
, bound= APIObject)
- open_in_browser()
Opens class’ relevant web browser location. If default browser is not available the URL is logged.
Note: If text-mode browsers are used, the calling process will block until the user exits the browser.
- Return type:
None
- set_datetime_partitioning(datetime_partition_spec=None, **kwargs)
Set the datetime partitioning method for a time series project by either passing in a DatetimePartitioningSpecification instance or any individual attributes of that class. Updates
self.partitioning_method
if already set previously (does not replace it).This is an alternative to passing a specification to
Project.analyze_and_model
via thepartitioning_method
parameter. To see the full partitioning based on the project dataset, useDatetimePartitioning.generate
. :rtype:DatetimePartitioning
Added in version v3.0.
- Parameters:
- datetime_partition_spec
DatetimePartitioningSpecification
, optional The customizable aspects of datetime partitioning for a time series project. An alternative to passing individual settings (attributes of the DatetimePartitioningSpecification class).
- Returns:
- DatetimePartitioning
Full partitioning including user-specified attributes as well as those determined by DR based on the dataset.
- list_datetime_partition_spec()
List datetime partitioning settings.
This method makes an API call to retrieve settings from the DB if project is in the modeling stage, i.e. if analyze_and_model (autopilot) has already been called.
If analyze_and_model has not yet been called, this method will instead simply print settings from project.partitioning_method. :rtype:
Optional
[DatetimePartitioningSpecification
]Added in version v3.0.
- Returns:
- DatetimePartitioningSpecification or None
- class datarobot.helpers.eligibility_result.EligibilityResult(supported, reason='', context='')
Represents whether a particular operation is supported
For instance, a function to check whether a set of models can be blended can return an EligibilityResult specifying whether or not blending is supported and why it may not be supported.
- Attributes:
- supportedbool
whether the operation this result represents is supported
- reasonstr
why the operation is or is not supported
- contextstr
what operation isn’t supported