Batch predictions

class datarobot.models.BatchPredictionJob

A Batch Prediction Job is used to score large data sets on prediction servers using the Batch Prediction API.

Variables:: id (str) – the id of the job

classmethod score(deployment, intake_settings=None, output_settings=None, csv_settings=None, timeseries_settings=None, num_concurrent=None, chunk_size=None, passthrough_columns=None, passthrough_columns_set=None, max_explanations=None, max_ngram_explanations=None, explanation_algorithm=None, threshold_high=None, threshold_low=None, prediction_threshold=None, prediction_warning_enabled=None, include_prediction_status=False, skip_drift_tracking=False, prediction_instance=None, abort_on_error=True, column_names_remapping=None, include_probabilities=True, include_probabilities_classes=None, download_timeout=120, download_read_timeout=660, upload_read_timeout=600, explanations_mode=None)

Create new batch prediction job, upload the scoring dataset and return a batch prediction job.

The default intake and output options are both localFile which requires the caller to pass the file parameter and either download the results using the download() method afterwards or pass a path to a file where the scored data will be downloaded to afterwards.

Variables:

deployment (Deployment or string ID) – Deployment which will be used for scoring.
intake_settings (Optional[IntakeSettings]) –
A dict configuring how data is coming from. Supported options:
- type : str, either localFile, s3, azure, gcp, dataset, jdbc snowflake, synapse, bigquery, or datasphere
Note that to pass a dataset, you not only need to specify the type parameter as dataset, but you must also set the dataset parameter as a dr.Dataset object.

To score from a local file, add the this parameter to the settings:
- file : file-like object, string path to file or a pandas.DataFrame of scoring data
To score from S3, add the next parameters to the settings:
- url : str, the URL to score (e.g.: s3://bucket/key)
- credential_id : Optional[str]
- endpoint_url : Optional[str], any non-default endpoint URL for S3 access (omit to use the default)
To score from JDBC, add the next parameters to the settings:
- data_store_id : str, the ID of the external data store connected to the JDBC data source (see Database Connectivity).
- query : str (optional if table, schema and/or catalog is specified), a self-supplied SELECT statement of the data set you wish to predict.
- table : str (optional if query is specified), the name of specified database table.
- schema : str (optional if query is specified), the name of specified database schema.
- catalog : str (optional if query is specified), (new in v2.22) the name of specified database catalog.
- fetch_size : Optional[int], Changing the fetchSize can be used to balance throughput and memory usage.
- credential_id : Optional[str] the ID of the credentials holding information about a user with read-access to the JDBC data source (see Credentials).
To score from Datasphere, add the next parameters to the settings:
- data_store_id : str, the ID of the external data store connected to the Datasphere data source (see Database Connectivity).
- table : str, the name of specified database table.
- schema : str, the name of specified database schema.
- credential_id : str, the ID of the credentials holding information about a user with read-access to the Datasphere data source (see Credentials).
output_settings (Optional[OutputSettings]) –
A dict configuring how scored data is to be saved. Supported options:
- type : str, either localFile, s3, azure, gcp, jdbc, snowflake, synapse, bigquery, or datasphere
To save scored data to a local file, add this parameters to the settings:
- path : Optional[str], path to save the scored data as CSV. If a path is not specified, you must download the scored data yourself with job.download(). If a path is specified, the call will block until the job is done. if there are no other jobs currently processing for the targeted prediction instance, uploading, scoring, downloading will happen in parallel without waiting for a full job to complete. Otherwise, it will still block, but start downloading the scored data as soon as it starts generating data. This is the fastest method to get predictions.
To save scored data to S3, add the next parameters to the settings:
- url : str, the URL for storing the results (e.g.: s3://bucket/key)
- credential_id : Optional[str]
- endpoint_url : Optional[str], any non-default endpoint URL for S3 access (omit to use the default)
To save scored data to JDBC, add the next parameters to the settings:
- data_store_id : str, the ID of the external data store connected to the JDBC data source (see Database Connectivity).
- table : str, the name of specified database table.
- schema : Optional[str], the name of specified database schema.
- catalog : Optional[str], (new in v2.22) the name of specified database catalog.
- statement_type : str, the type of insertion statement to create, one of datarobot.enums.AVAILABLE_STATEMENT_TYPES.
- update_columns : list(string) (optional), a list of strings containing those column names to be updated in case statement_type is set to a value related to update or upsert.
- where_columns : list(string) (optional), a list of strings containing those column names to be selected in case statement_type is set to a value related to insert or update.
- credential_id : str, the ID of the credentials holding information about a user with write-access to the JDBC data source (see Credentials).
To save scored data to Datasphere, add the following parameters to the settings:
- data_store_id : str, the ID of the external data store connected to the Datasphere data source (see Database Connectivity).
- table : str, the name of specified database table.
- schema : str, the name of specified database schema.
- credential_id : str, the ID of the credentials holding information about a user with write-access to the Datasphere data source (see Credentials).
csv_settings (Optional[CsvSettings]) –
CSV intake and output settings. Supported options:
- delimiter : str (optional, default ,), fields are delimited by this character. Use the string tab to denote TSV (TAB separated values). Must be either a one-character string or the string tab.
- quotechar : str (optional, default “), fields containing the delimiter must be quoted using this character.
- encoding : str (optional, default utf-8), encoding for the CSV files. For example (but not limited to): shift_jis, latin_1 or mskanji.
timeseries_settings (Optional[TimeSeriesSettings]) –
Configuration for time-series scoring. Supported options:
- type : str, must be forecast or historical (default if not passed is forecast). forecast mode makes predictions using forecast_point or rows in the dataset without target. historical enables bulk prediction mode which calculates predictions for all possible forecast points and forecast distances in the dataset within predictions_start_date/predictions_end_date range.
- forecast_point : Optional[datetime.datetime], forecast point for the dataset, used for the forecast predictions, by default value will be inferred from the dataset. May be passed if timeseries_settings.type=forecast.
- predictions_start_date : Optional[datetime.datetime], used for historical predictions in order to override date from which predictions should be calculated. By default value will be inferred automatically from the dataset. May be passed if timeseries_settings.type=historical.
- predictions_end_date : Optional[datetime.datetime], used for historical predictions in order to override date from which predictions should be calculated. By default value will be inferred automatically from the dataset. May be passed if timeseries_settings.type=historical.
- relax_known_in_advance_features_check : bool, (default False). If True, missing values in the known in advance features are allowed in the forecast window at the prediction time. If omitted or False, missing values are not allowed.
num_concurrent (Optional[int]) – Number of concurrent chunks to score simultaneously. Defaults to the available number of cores of the deployment. Lower it to leave resources for real-time scoring.
chunk_size (str or Optional[int]) – Which strategy should be used to determine the chunk size. Can be either a named strategy or a fixed size in bytes. - auto: use fixed or dynamic based on flipper - fixed: use 1MB for explanations, 5MB for regular requests - dynamic: use dynamic chunk sizes - int: use this many bytes per chunk
passthrough_columns (list[string] (optional)) – Keep these columns from the scoring dataset in the scored dataset. This is useful for correlating predictions with source data.
passthrough_columns_set (Optional[str]) – To pass through every column from the scoring dataset, set this to all. Takes precedence over passthrough_columns if set.
max_explanations (Optional[int]) – Compute prediction explanations for this amount of features.
max_ngram_explanations (int or str (optional)) – Compute text explanations for this amount of ngrams. Set to all to return all ngram explanations, or set to a positive integer value to limit the amount of ngram explanations returned. By default no ngram explanations will be computed and returned.
threshold_high (Optional[float]) – Only compute prediction explanations for predictions above this threshold. Can be combined with threshold_low.
threshold_low (Optional[float]) – Only compute prediction explanations for predictions below this threshold. Can be combined with threshold_high.
explanations_mode (PredictionExplanationsMode, optional) – Mode of prediction explanations calculation for multiclass and clustering models, if not specified - server default is to explain only the predicted class, identical to passing TopPredictionsMode(1).
prediction_warning_enabled (Optional[bool]) – Add prediction warnings to the scored data. Currently only supported for regression models.
include_prediction_status (Optional[bool]) – Include the prediction_status column in the output, defaults to False.
skip_drift_tracking (Optional[bool]) – Skips drift tracking on any predictions made from this job. This is useful when running non-production workloads to not affect drift tracking and cause unnecessary alerts. Defaults to False.
prediction_instance (Optional[PredictionInstance]) –
Defaults to instance specified by deployment or system configuration. Supported options:
- hostName : str
- sslEnabled : boolean (optional, default true). Set to false to run prediction requests from the batch prediction job without SSL.
- datarobotKey : Optional[str], if running a job against a prediction instance in the Managed AI Cloud, you must provide the organization level DataRobot-Key
- apiKey : Optional[str], by default, prediction requests will use the API key of the user that created the job. This allows you to make requests on behalf of other users.
abort_on_error (Optional[bool]) – Default behavior is to abort the job if too many rows fail scoring. This will free up resources for other jobs that may score successfully. Set to false to unconditionally score every row no matter how many errors are encountered. Defaults to True.
column_names_remapping (Optional[Dict[str, str]]) – Mapping with column renaming for output table. Defaults to {}.
include_probabilities (Optional[bool]) – Flag that enables returning of all probability columns. Defaults to True.
include_probabilities_classes (list (optional)) – List the subset of classes if a user doesn’t want all the classes. Defaults to [].
download_timeout (Optional[int]) –

Added in version 2.22.

If using localFile output, wait this many seconds for the download to become available. See download().
download_read_timeout (Optional[int], default 660) –

Added in version 2.22.

If using localFile output, wait this many seconds for the server to respond between chunks.
upload_read_timeout (Optional[int], default 600) –

Added in version 2.28.

If using localFile intake, wait this many seconds for the server to respond after whole dataset upload.
prediction_threshold (Optional[float]) –

Added in version 3.4.0.

Threshold is the point that sets the class boundary for a predicted value. The model classifies an observation below the threshold as FALSE, and an observation above the threshold as TRUE. In other words, DataRobot automatically assigns the positive class label to any prediction exceeding the threshold. This value can be set between 0.0 and 1.0.

Returns:

Instance of BatchPredictionJob

Return type:

BatchPredictionJob

classmethod apply_time_series_data_prep_and_score(deployment, intake_settings, timeseries_settings, **kwargs)

Prepare the dataset with time series data prep, create new batch prediction job, upload the scoring dataset, and return a batch prediction job.

The supported intake_settings are of type localFile or dataset.

For timeseries_settings of type forecast the forecast_point must be specified.

Refer to the datarobot.models.BatchPredictionJob.score() method for details on the other kwargs parameters.

Added in version v3.1.

Variables:

deployment (Deployment) – Deployment which will be used for scoring.
intake_settings (dict) –
A dict configuring where data is coming from. Supported options:
- type : str, either localFile, dataset
Note that to pass a dataset, you not only need to specify the type parameter as dataset, but you must also set the dataset parameter as a Dataset object.

To score from a local file, add this parameter to the settings:
- file : file-like object, string path to file or a pandas.DataFrame of scoring data.
timeseries_settings (dict) –
Configuration for time-series scoring. Supported options:
- type : str, must be forecast or historical (default if not passed is forecast). forecast mode makes predictions using forecast_point. historical enables bulk prediction mode which calculates predictions for all possible forecast points and forecast distances in the dataset within predictions_start_date/predictions_end_date range.
- forecast_point : Optional[datetime.datetime], forecast point for the dataset, used for the forecast predictions. Must be passed if timeseries_settings.type=forecast.
- predictions_start_date : Optional[datetime.datetime], used for historical predictions in order to override date from which predictions should be calculated. By default value will be inferred automatically from the dataset. May be passed if timeseries_settings.type=historical.
- predictions_end_date : Optional[datetime.datetime], used for historical predictions in order to override date from which predictions should be calculated. By default value will be inferred automatically from the dataset. May be passed if timeseries_settings.type=historical.
- relax_known_in_advance_features_check : bool, (default False). If True, missing values in the known in advance features are allowed in the forecast window at the prediction time. If omitted or False, missing values are not allowed.

Returns:

Instance of BatchPredictionJob

Return type:

BatchPredictionJob

Raises:

InvalidUsageError – If the deployment does not support time series data prep. If the intake type is not supported for time series data prep.

classmethod score_to_file(deployment, intake_path, output_path, **kwargs)

Create new batch prediction job, upload the scoring dataset and download the scored CSV file concurrently.

Will block until the entire file is scored.

Refer to the datarobot.models.BatchPredictionJob.score() method for details on the other kwargs parameters.

Variables:

deployment (Deployment or string ID) – Deployment which will be used for scoring.
intake_path (file-like object/string path to file/pandas.DataFrame) – Scoring data
output_path (str) – Filename to save the result under

Returns:

Instance of BatchPredictionJob

Return type:

BatchPredictionJob

classmethod apply_time_series_data_prep_and_score_to_file(deployment, intake_path, output_path, timeseries_settings, **kwargs)

Prepare the input dataset with time series data prep. Then, create a new batch prediction job using the prepared AI catalog item as input and concurrently download the scored CSV file.

The function call will return when the entire file is scored.

For timeseries_settings of type forecast the forecast_point must be specified.

Refer to the datarobot.models.BatchPredictionJob.score() method for details on the other kwargs parameters.

Added in version v3.1.

Variables:

deployment (Deployment) – The deployment which will be used for scoring.
intake_path (file-like object/string path to file/pandas.DataFrame) – The scoring data.
output_path (str) – The filename under which you save the result.
timeseries_settings (dict) –
Configuration for time-series scoring. Supported options:
- type : str, must be forecast or historical (default if not passed is forecast). forecast mode makes predictions using forecast_point. historical enables bulk prediction mode which calculates predictions for all possible forecast points and forecast distances in the dataset within predictions_start_date/predictions_end_date range.
- forecast_point : Optional[datetime.datetime], forecast point for the dataset, used for the forecast predictions. Must be passed if timeseries_settings.type=forecast.
- predictions_start_date : Optional[datetime.datetime], used for historical predictions in order to override date from which predictions should be calculated. By default value will be inferred automatically from the dataset. May be passed if timeseries_settings.type=historical.
- predictions_end_date : Optional[datetime.datetime], used for historical predictions in order to override date from which predictions should be calculated. By default value will be inferred automatically from the dataset. May be passed if timeseries_settings.type=historical.
- relax_known_in_advance_features_check : bool, (default False). If True, missing values in the known in advance features are allowed in the forecast window at the prediction time. If omitted or False, missing values are not allowed.

Returns:

Instance of BatchPredictionJob.

Return type:

BatchPredictionJob

Raises:

InvalidUsageError – If the deployment does not support time series data prep.

classmethod score_s3(deployment, source_url, destination_url, credential=None, endpoint_url=None, **kwargs)

Create new batch prediction job, with a scoring dataset from S3 and writing the result back to S3.

This returns immediately after the job has been created. You must poll for job completion using get_status() or wait_for_completion() (see datarobot.models.Job)

Refer to the datarobot.models.BatchPredictionJob.score() method for details on the other kwargs parameters.

Variables:

deployment (Deployment or string ID) – Deployment which will be used for scoring.
source_url (str) – The URL for the prediction dataset (e.g.: s3://bucket/key)
destination_url (str) – The URL for the scored dataset (e.g.: s3://bucket/key)
credential (str or Credential (optional)) – The AWS Credential object or credential id
endpoint_url (Optional[str]) – Any non-default endpoint URL for S3 access (omit to use the default)

Returns:

Instance of BatchPredictionJob

Return type:

BatchPredictionJob

classmethod score_azure(deployment, source_url, destination_url, credential=None, **kwargs)

Create new batch prediction job, with a scoring dataset from Azure blob storage and writing the result back to Azure blob storage.

This returns immediately after the job has been created. You must poll for job completion using get_status() or wait_for_completion() (see datarobot.models.Job).

Refer to the datarobot.models.BatchPredictionJob.score() method for details on the other kwargs parameters.

Variables:

deployment (Deployment or string ID) – Deployment which will be used for scoring.
source_url (str) – The URL for the prediction dataset (e.g.: https://storage_account.blob.endpoint/container/blob_name)
destination_url (str) – The URL for the scored dataset (e.g.: https://storage_account.blob.endpoint/container/blob_name)
credential (str or Credential (optional)) – The Azure Credential object or credential id

Returns:

Instance of BatchPredictionJob

Return type:

BatchPredictionJob

classmethod score_gcp(deployment, source_url, destination_url, credential=None, **kwargs)

Create new batch prediction job, with a scoring dataset from Google Cloud Storage and writing the result back to one.

This returns immediately after the job has been created. You must poll for job completion using get_status() or wait_for_completion() (see datarobot.models.Job).

Refer to the datarobot.models.BatchPredictionJob.score() method for details on the other kwargs parameters.

Variables:

deployment (Deployment or string ID) – Deployment which will be used for scoring.
source_url (str) – The URL for the prediction dataset (e.g.: http(s)://storage.googleapis.com/[bucket]/[object])
destination_url (str) – The URL for the scored dataset (e.g.: http(s)://storage.googleapis.com/[bucket]/[object])
credential (str or Credential (optional)) – The GCP Credential object or credential id

Returns:

Instance of BatchPredictionJob

Return type:

BatchPredictionJob

classmethod score_from_existing(batch_prediction_job_id)

Create a new batch prediction job based on the settings from a previously created one

Variables:: batch_prediction_job_id (str) – ID of the previous batch prediction job
Returns:: Instance of BatchPredictionJob
Return type:: BatchPredictionJob

classmethod score_pandas(deployment, df, read_timeout=660, **kwargs)

Run a batch prediction job, with a scoring dataset from a pandas dataframe. The output from the prediction will be joined to the passed DataFrame and returned.

Use columnNamesRemapping to drop or rename columns in the output

This method blocks until the job has completed or raises an exception on errors.

Refer to the datarobot.models.BatchPredictionJob.score() method for details on the other kwargs parameters.

Variables:

deployment (Deployment or string ID) – Deployment which will be used for scoring.
df (pandas.DataFrame) – The dataframe to score

Return type:

Tuple[BatchPredictionJob, DataFrame]

Returns:

BatchPredictionJob – Instance of BatchPredictonJob
pandas.DataFrame – The original dataframe merged with the predictions

classmethod score_with_leaderboard_model(model, intake_settings=None, output_settings=None, csv_settings=None, timeseries_settings=None, passthrough_columns=None, passthrough_columns_set=None, max_explanations=None, max_ngram_explanations=None, explanation_algorithm=None, threshold_high=None, threshold_low=None, prediction_threshold=None, prediction_warning_enabled=None, include_prediction_status=False, abort_on_error=True, column_names_remapping=None, include_probabilities=True, include_probabilities_classes=None, download_timeout=120, download_read_timeout=660, upload_read_timeout=600, explanations_mode=None)

Creates a new batch prediction job for a Leaderboard model by uploading the scoring dataset. Returns a batch prediction job.

The default intake and output options are both localFile, which requires the caller to pass the file parameter and either download the results using the download() method afterwards or pass a path to a file where the scored data will be downloaded to.

Variables:

model (Model or DatetimeModel or string ID) – Model which will be used for scoring.
intake_settings (Optional[IntakeSettings]) –
A dict configuring how data is coming from. Supported options:
- type : str, either localFile, dataset, or dss.
Note that to pass a dataset, you not only need to specify the type parameter as dataset, but you must also set the dataset parameter as a dr.Dataset object.

To score from a local file, add the this parameter to the settings:
- file : file-like object, string path to file or a pandas.DataFrame of scoring data.
To score subset of training data, use dss intake type and specify following parameters:
- project_id : project to fetch training data from. Access to project is required.
- partition : subset of training data to score, one of datarobot.enums.TrainingDataSubsets.
output_settings (Optional[OutputSettings]) –
A dict configuring how scored data is to be saved. Supported options:
- type : str, localFile
To save scored data to a local file, add this parameters to the settings:
- path : Optional[str] The path to save the scored data as a CSV file. If a path is not specified, you must download the scored data yourself with job.download(). If a path is specified, the call is blocked until the job is done. If there are no other jobs currently processing for the targeted prediction instance, uploading, scoring, and downloading will happen in parallel without waiting for a full job to complete. Otherwise, it will still block, but start downloading the scored data as soon as it starts generating data. This is the fastest method to get predictions.
csv_settings (Optional[CsvSettings]) –
CSV intake and output settings. Supported options:
- delimiter : str (optional, default ,), fields are delimited by this character. Use the string tab to denote TSV (TAB separated values). Must be either a one-character string or the string tab.
- quotechar : str (optional, default “), fields containing the delimiter must be quoted using this character.
- encoding : str (optional, default utf-8), encoding for the CSV files. For example (but not limited to): shift_jis, latin_1 or mskanji.
timeseries_settings (Optional[TimeSeriesSettings]) –
Configuration for time-series scoring. Supported options:
- type : str, must be forecast, historical (default if not passed is forecast), or training. forecast mode makes predictions using forecast_point or rows in the dataset without target. historical enables bulk prediction mode which calculates predictions for all possible forecast points and forecast distances in the dataset within predictions_start_date/predictions_end_date range. training mode is a special case for predictions on subsets of training data. Note, that it must be used in conjunction with dss intake type only.
- forecast_point : Optional[datetime.datetime], forecast point for the dataset, used for the forecast predictions, by default value will be inferred from the dataset. May be passed if timeseries_settings.type=forecast.
- predictions_start_date : Optional[datetime.datetime], used for historical predictions in order to override date from which predictions should be calculated. By default value will be inferred automatically from the dataset. May be passed if timeseries_settings.type=historical.
- predictions_end_date : Optional[datetime.datetime], used for historical predictions in order to override date from which predictions should be calculated. By default value will be inferred automatically from the dataset. May be passed if timeseries_settings.type=historical.
- relax_known_in_advance_features_check : bool, (default False). If True, missing values in the known in advance features are allowed in the forecast window at the prediction time. If omitted or False, missing values are not allowed.
passthrough_columns (list[string] (optional)) – Keep these columns from the scoring dataset in the scored dataset. This is useful for correlating predictions with source data.
passthrough_columns_set (Optional[str]) – To pass through every column from the scoring dataset, set this to all. Takes precedence over passthrough_columns if set.
max_explanations (Optional[int]) – Compute prediction explanations for this amount of features.
max_ngram_explanations (int or str (optional)) – Compute text explanations for this amount of ngrams. Set to all to return all ngram explanations, or set to a positive integer value to limit the amount of ngram explanations returned. By default no ngram explanations will be computed and returned.
threshold_high (Optional[float]) – Only compute prediction explanations for predictions above this threshold. Can be combined with threshold_low.
threshold_low (Optional[float]) – Only compute prediction explanations for predictions below this threshold. Can be combined with threshold_high.
explanations_mode (PredictionExplanationsMode, optional) – Mode of prediction explanations calculation for multiclass and clustering models, if not specified - server default is to explain only the predicted class, identical to passing TopPredictionsMode(1).
prediction_warning_enabled (Optional[bool]) – Add prediction warnings to the scored data. Currently only supported for regression models.
include_prediction_status (Optional[bool]) – Include the prediction_status column in the output, defaults to False.
abort_on_error (Optional[bool]) – Default behavior is to abort the job if too many rows fail scoring. This will free up resources for other jobs that may score successfully. Set to false to unconditionally score every row no matter how many errors are encountered. Defaults to True.
column_names_remapping (Optional[Dict]) – Mapping with column renaming for output table. Defaults to {}.
include_probabilities (Optional[bool]) – Flag that enables returning of all probability columns. Defaults to True.
include_probabilities_classes (list (optional)) – List the subset of classes if you do not want all the classes. Defaults to [].
download_timeout (Optional[int]) –

Added in version 2.22.

If using localFile output, wait this many seconds for the download to become available. See download().
download_read_timeout (int (optional, default 660)) –

Added in version 2.22.

If using localFile output, wait this many seconds for the server to respond between chunks.
upload_read_timeout (int (optional, default 600)) –

Added in version 2.28.

If using localFile intake, wait this many seconds for the server to respond after whole dataset upload.
prediction_threshold (Optional[float]) –

Added in version 3.4.0.

Threshold is the point that sets the class boundary for a predicted value. The model classifies an observation below the threshold as FALSE, and an observation above the threshold as TRUE. In other words, DataRobot automatically assigns the positive class label to any prediction exceeding the threshold. This value can be set between 0.0 and 1.0.

Returns:

Instance of BatchPredictionJob

Return type:

BatchPredictionJob

classmethod get(batch_prediction_job_id)

Get batch prediction job

Variables:: batch_prediction_job_id (str) – ID of batch prediction job
Returns:: Instance of BatchPredictionJob
Return type:: BatchPredictionJob

download(fileobj, timeout=120, read_timeout=660)

Downloads the CSV result of a prediction job

Variables:

fileobj (A file-like object where the CSV prediction results will be) – written to. Examples include an in-memory buffer (e.g., io.BytesIO) or a file on disk (opened for binary writing).
timeout (int (optional, default 120)) –

Added in version 2.22.

Seconds to wait for the download to become available.

The download will not be available before the job has started processing. In case other jobs are occupying the queue, processing may not start immediately.

If the timeout is reached, the job will be aborted and RuntimeError is raised.

Set to -1 to wait infinitely.
read_timeout (int (optional, default 660)) –

Added in version 2.22.

Seconds to wait for the server to respond between chunks.

Return type:

None

delete(ignore_404_errors=False)

Cancel this job. If this job has not finished running, it will be removed and canceled.

Return type:: None

get_status()

Get status of batch prediction job

Returns:: Dict with job status
Return type:: BatchPredictionJob status data

classmethod list_by_status(statuses=None)

Get jobs collection for specific set of statuses

Variables:: statuses – List of statuses to filter jobs ([ABORTED|COMPLETED…]) if statuses is not provided, returns all jobs for user
Returns:: List of job statuses dicts with specific statuses
Return type:: BatchPredictionJob statuses

class datarobot.models.BatchPredictionJobDefinition

classmethod get(batch_prediction_job_definition_id)

Get batch prediction job definition

Variables:: batch_prediction_job_definition_id (str) – ID of batch prediction job definition
Returns:: Instance of BatchPredictionJobDefinition
Return type:: BatchPredictionJobDefinition

Examples

>>> import datarobot as dr
>>> definition = dr.BatchPredictionJobDefinition.get('5a8ac9ab07a57a0001be501f')
>>> definition
BatchPredictionJobDefinition(60912e09fd1f04e832a575c1)

classmethod list(search_name=None, deployment_id=None, limit=<datarobot.models.batch_prediction_job.MissingType object>, offset=0)

Get job all definitions

Parameters:

search_name (Optional[str]) – String for filtering job definitions Job definitions that contain the string in name will be returned. If not specified, all available job definitions will be returned.
deployment_id (str) – The ID of the deployment record belongs to.
limit (Optional[int]) – 0 by default. At most this many results are returned.
offset (Optional[int]) – This many results will be skipped.

Returns:

List of job definitions the user has access to see

Return type:

List[BatchPredictionJobDefinition]

Examples

>>> import datarobot as dr
>>> definition = dr.BatchPredictionJobDefinition.list()
>>> definition
[
    BatchPredictionJobDefinition(60912e09fd1f04e832a575c1),
    BatchPredictionJobDefinition(6086ba053f3ef731e81af3ca)
]

classmethod create(enabled, batch_prediction_job, name=None, schedule=None)

Creates a new batch prediction job definition to be run either at scheduled interval or as a manual run.

Variables:

enabled (bool (default False)) – Whether or not the definition should be active on a scheduled basis. If True, schedule is required.
batch_prediction_job (dict) – The job specifications for your batch prediction job. It requires the same job input parameters as used with score(), only it will not initialize a job scoring, only store it as a definition for later use.
name (Optional[str]) – The name you want your job to be identified with. Must be unique across the organization’s existing jobs. If you don’t supply a name, a random one will be generated for you.
schedule (Optional[Dict]) –
The schedule payload defines at what intervals the job should run, which can be combined in various ways to construct complex scheduling terms if needed. In all of the elements in the objects, you can supply either an asterisk ["*"] denoting “every” time denomination or an array of integers (e.g. [1, 2, 3]) to define a specific interval.

The schedule payload is split up in the following items:

Minute:

The minute(s) of the day that the job will run. Allowed values are either ["*"] meaning every minute of the day or [0 ... 59]

Hour: The hour(s) of the day that the job will run. Allowed values are either ["*"] meaning every hour of the day or [0 ... 23].

Day of Month: The date(s) of the month that the job will run. Allowed values are either [1 ... 31] or ["*"] for all days of the month. This field is additive with dayOfWeek, meaning the job will run both on the date(s) defined in this field and the day specified by dayOfWeek (for example, dates 1st, 2nd, 3rd, plus every Tuesday). If dayOfMonth is set to ["*"] and dayOfWeek is defined, the scheduler will trigger on every day of the month that matches dayOfWeek (for example, Tuesday the 2nd, 9th, 16th, 23rd, 30th). Invalid dates such as February 31st are ignored.

Month: The month(s) of the year that the job will run. Allowed values are either [1 ... 12] or ["*"] for all months of the year. Strings, either 3-letter abbreviations or the full name of the month, can be used interchangeably (e.g., “jan” or “october”). Months that are not compatible with dayOfMonth are ignored, for example {"dayOfMonth": [31], "month":["feb"]}

Day of Week: The day(s) of the week that the job will run. Allowed values are [0 .. 6], where (Sunday=0), or ["*"], for all days of the week. Strings, either 3-letter abbreviations or the full name of the day, can be used interchangeably (e.g., “sunday”, “Sunday”, “sun”, or “Sun”, all map to [0]. This field is additive with dayOfMonth, meaning the job will run both on the date specified by dayOfMonth and the day defined in this field.

Returns:

Instance of BatchPredictionJobDefinition

Return type:

BatchPredictionJobDefinition

Examples

>>> import datarobot as dr
>>> job_spec = {
...    "num_concurrent": 4,
...    "deployment_id": "foobar",
...    "intake_settings": {
...        "url": "s3://foobar/123",
...        "type": "s3",
...        "format": "csv"
...    },
...    "output_settings": {
...        "url": "s3://foobar/123",
...        "type": "s3",
...        "format": "csv"
...    },
...}
>>> schedule = {
...    "day_of_week": [
...        1
...    ],
...    "month": [
...        "*"
...    ],
...    "hour": [
...        16
...    ],
...    "minute": [
...        0
...    ],
...    "day_of_month": [
...        1
...    ]
...}
>>> definition = BatchPredictionJobDefinition.create(
...    enabled=False,
...    batch_prediction_job=job_spec,
...    name="some_definition_name",
...    schedule=schedule
... )
>>> definition
BatchPredictionJobDefinition(60912e09fd1f04e832a575c1)

update(enabled, batch_prediction_job=None, name=None, schedule=None)

Updates a job definition with the changed specs.

Takes the same input as create()

Variables:

enabled (bool (default False)) – Same as enabled in create().
batch_prediction_job (dict) – Same as batch_prediction_job in create().
name (Optional[str]) – Same as name in create().
schedule (dict) – Same as schedule in create().

Returns:

Instance of the updated BatchPredictionJobDefinition

Return type:

BatchPredictionJobDefinition

Examples

>>> import datarobot as dr
>>> job_spec = {
...    "num_concurrent": 5,
...    "deployment_id": "foobar_new",
...    "intake_settings": {
...        "url": "s3://foobar/123",
...        "type": "s3",
...        "format": "csv"
...    },
...    "output_settings": {
...        "url": "s3://foobar/123",
...        "type": "s3",
...        "format": "csv"
...    },
...}
>>> schedule = {
...    "day_of_week": [
...        1
...    ],
...    "month": [
...        "*"
...    ],
...    "hour": [
...        "*"
...    ],
...    "minute": [
...        30, 59
...    ],
...    "day_of_month": [
...        1, 2, 6
...    ]
...}
>>> definition = BatchPredictionJobDefinition.create(
...    enabled=False,
...    batch_prediction_job=job_spec,
...    name="updated_definition_name",
...    schedule=schedule
... )
>>> definition
BatchPredictionJobDefinition(60912e09fd1f04e832a575c1)

run_on_schedule(schedule)

Sets the run schedule of an already created job definition.

If the job was previously not enabled, this will also set the job to enabled.

Variables:: schedule (dict) – Same as schedule in create().
Returns:: Instance of the updated BatchPredictionJobDefinition with the new / updated schedule.
Return type:: BatchPredictionJobDefinition

Examples

>>> import datarobot as dr
>>> definition = dr.BatchPredictionJobDefinition.create('...')
>>> schedule = {
...    "day_of_week": [
...        1
...    ],
...    "month": [
...        "*"
...    ],
...    "hour": [
...        "*"
...    ],
...    "minute": [
...        30, 59
...    ],
...    "day_of_month": [
...        1, 2, 6
...    ]
...}
>>> definition.run_on_schedule(schedule)
BatchPredictionJobDefinition(60912e09fd1f04e832a575c1)

run_once()

Manually submits a batch prediction job to the queue, based off of an already created job definition.

Returns:: Instance of BatchPredictionJob
Return type:: BatchPredictionJob

Examples

>>> import datarobot as dr
>>> definition = dr.BatchPredictionJobDefinition.create('...')
>>> job = definition.run_once()
>>> job.wait_for_completion()

delete()

Deletes the job definition and disables any future schedules of this job if any. If a scheduled job is currently running, this will not be cancelled. :rtype: None

Examples

>>> import datarobot as dr
>>> definition = dr.BatchPredictionJobDefinition.get('5a8ac9ab07a57a0001be501f')
>>> definition.delete()

Batch job

class datarobot.models.batch_job.IntakeSettings: Intake settings typed dict

class datarobot.models.batch_job.OutputSettings: Output settings typed dict

Predict job

datarobot.models.predict_job.wait_for_async_predictions(project_id, predict_job_id, max_wait=600)

Given a Project id and PredictJob id poll for status of process responsible for predictions generation until it’s finished

Parameters:

project_id (str) – The identifier of the project
predict_job_id (str) – The identifier of the PredictJob
max_wait (Optional[int]) – Time in seconds after which predictions creation is considered unsuccessful

Returns:

predictions – Generated predictions.

Return type:

pandas.DataFrame

Raises:

AsyncPredictionsGenerationError – Raised if status of fetched PredictJob object is error
AsyncTimeoutError – Predictions weren’t generated in time, specified by max_wait parameter

class datarobot.models.PredictJob

Tracks asynchronous work being done within a project

Variables:

id (int) – the id of the job
project_id (str) – the id of the project the job belongs to
status (str) – the status of the job - will be one of datarobot.enums.QUEUE_STATUS
job_type (str) – what kind of work the job is doing - will be ‘predict’ for predict jobs
is_blocked (bool) – if true, the job is blocked (cannot be executed) until its dependencies are resolved
message (str) – a message about the state of the job, typically explaining why an error occurred

classmethod from_job(job)

Transforms a generic Job into a PredictJob

Parameters:: job (Job) – A generic job representing a PredictJob
Returns:: predict_job – A fully populated PredictJob with all the details of the job
Return type:: PredictJob
Raises:: ValueError: – If the generic Job was not a predict job, e.g. job_type != JOB_TYPE.PREDICT

classmethod get(project_id, predict_job_id)

Fetches one PredictJob. If the job finished, raises PendingJobFinished exception.

Parameters:

project_id (str) – The identifier of the project the model on which prediction was started belongs to
predict_job_id (str) – The identifier of the predict_job

Returns:

predict_job – The pending PredictJob

Return type:

PredictJob

Raises:

PendingJobFinished – If the job being queried already finished, and the server is re-routing to the finished predictions.
AsyncFailureError – Querying this resource gave a status code other than 200 or 303

classmethod get_predictions(project_id, predict_job_id, class_prefix='class_')

Fetches finished predictions from the job used to generate them.

Notes

The prediction API for classifications now returns an additional prediction_values dictionary that is converted into a series of class_prefixed columns in the final dataframe. For example, <label> = 1.0 is converted to ‘class_1.0’. If you are on an older version of the client (prior to v2.8), you must update to v2.8 to correctly pivot this data.

Parameters:

project_id (str) – The identifier of the project to which belongs the model used for predictions generation
predict_job_id (str) – The identifier of the predict_job
class_prefix (str) – The prefix to append to labels in the final dataframe (e.g., apple -> class_apple)

Returns:

predictions – Generated predictions

Return type:

pandas.DataFrame

Raises:

JobNotFinished – If the job has not finished yet
AsyncFailureError – Querying the predict_job in question gave a status code other than 200 or 303

cancel(): Cancel this job. If this job has not finished running, it will be removed and canceled.

get_result(params=None)

Parameters:: params (dict or None) – Query parameters to be added to request to get results.

Notes

For featureEffects, source param is required to define source, otherwise the default is training.

Returns:

result –

Return type depends on the job type

for model jobs, a Model is returned
for predict jobs, a pandas.DataFrame (with predictions) is returned
for featureImpact jobs, a list of dicts by default (see with_metadata parameter of the FeatureImpactJob class and its get() method).
for primeRulesets jobs, a list of Rulesets
for primeModel jobs, a PrimeModel
for primeDownloadValidation jobs, a PrimeFile
for predictionExplanationInitialization jobs, a PredictionExplanationsInitialization
for predictionExplanations jobs, a PredictionExplanations
for featureEffects, a FeatureEffects.

Return type:

object

Raises:

JobNotFinished – If the job is not finished, the result is not available.
AsyncProcessUnsuccessfulError – If the job errored or was aborted

get_result_when_complete(max_wait=600, params=None)

Parameters:

max_wait (Optional[int]) – How long to wait for the job to finish.
params (dict, optional) – Query parameters to be added to request.

Returns:

result – Return type is the same as would be returned by Job.get_result.

Return type:

object

Raises:

AsyncTimeoutError – If the job does not finish in time
AsyncProcessUnsuccessfulError – If the job errored or was aborted

refresh(): Update this object with the latest job data from the server.

wait_for_completion(max_wait=600)

Waits for job to complete.

Parameters:: max_wait (Optional[int]) – How long to wait for the job to finish.
Return type:: None

Prediction dataset

class datarobot.models.PredictionDataset

A dataset uploaded to make predictions

Typically created via project.upload_dataset

Variables:

id (str) – the id of the dataset
project_id (str) – the id of the project the dataset belongs to
created (str) – the time the dataset was created
name (str) – the name of the dataset
num_rows (int) – the number of rows in the dataset
num_columns (int) – the number of columns in the dataset
forecast_point (datetime.datetime or None) – For time series projects only. This is the default point relative to which predictions will be generated, based on the forecast window of the project. See the time series predictions documentation for more information.
predictions_start_date (datetime.datetime or None, optional) – For time series projects only. The start date for bulk predictions. Note that this parameter is for generating historical predictions using the training data. This parameter should be provided in conjunction with predictions_end_date. Can’t be provided with the forecast_point parameter.
predictions_end_date (datetime.datetime or None, optional) – For time series projects only. The end date for bulk predictions, exclusive. Note that this parameter is for generating historical predictions using the training data. This parameter should be provided in conjunction with predictions_start_date. Can’t be provided with the forecast_point parameter.
relax_known_in_advance_features_check (Optional[bool]) – (New in version v2.15) For time series projects only. If True, missing values in the known in advance features are allowed in the forecast window at the prediction time. If omitted or False, missing values are not allowed.
data_quality_warnings (dict, optional) –
(New in version v2.15) A dictionary that contains available warnings about potential problems in this prediction dataset. Available warnings include:
- has_kia_missing_values_in_forecast_window (bool)
  Applicable for time series projects. If True, known in advance features have missing values in forecast window which may decrease prediction accuracy.
- insufficient_rows_for_evaluating_models (bool)
  Applicable for datasets which are used as external test sets. If True, there is not enough rows in dataset to calculate insights.
- single_class_actual_value_column (bool)
  Applicable for datasets which are used as external test sets. If True, actual value column has only one class and such insights as ROC curve can not be calculated. Only applies for binary classification projects or unsupervised projects.
forecast_point_range (list[datetime.datetime] or None, optional) – (New in version v2.20) For time series projects only. Specifies the range of dates available for use as a forecast point.
data_start_date (datetime.datetime or None, optional) – (New in version v2.20) For time series projects only. The minimum primary date of this prediction dataset.
data_end_date (datetime.datetime or None, optional) – (New in version v2.20) For time series projects only. The maximum primary date of this prediction dataset.
max_forecast_date (datetime.datetime or None, optional) – (New in version v2.20) For time series projects only. The maximum forecast date of this prediction dataset.
actual_value_column (string, optional) – (New in version v2.21) Optional, only available for unsupervised projects, in case dataset was uploaded with actual value column specified. Name of the column which will be used to calculate the classification metrics and insights.
detected_actual_value_columns (list of dict, optional) – (New in version v2.21) For unsupervised projects only, list of detected actual value columns information containing missing count and name for each column.
contains_target_values (Optional[bool]) – (New in version v2.21) Only for supervised projects. If True, dataset contains target values and can be used to calculate the classification metrics and insights.
secondary_datasets_config_id (string or None, optional) – (New in version v2.23) The Id of the alternative secondary dataset config to use during prediction for Feature discovery project.

classmethod get(project_id, dataset_id)

Retrieve information about a dataset uploaded for predictions

Parameters:

project_id (str) – the id of the project to query
dataset_id (str) – the id of the dataset to retrieve

Returns:

dataset – A dataset uploaded to make predictions

Return type:

PredictionDataset

delete()

Delete a dataset uploaded for predictions

Will also delete predictions made using this dataset and cancel any predict jobs using this dataset.

Return type:: None