Batch predictions
- class datarobot.models.BatchPredictionJob
A Batch Prediction Job is used to score large data sets on prediction servers using the Batch Prediction API.
- Variables:
id (
str
) – the id of the job
- classmethod score(deployment, intake_settings=None, output_settings=None, csv_settings=None, timeseries_settings=None, num_concurrent=None, chunk_size=None, passthrough_columns=None, passthrough_columns_set=None, max_explanations=None, max_ngram_explanations=None, explanation_algorithm=None, threshold_high=None, threshold_low=None, prediction_threshold=None, prediction_warning_enabled=None, include_prediction_status=False, skip_drift_tracking=False, prediction_instance=None, abort_on_error=True, column_names_remapping=None, include_probabilities=True, include_probabilities_classes=None, download_timeout=120, download_read_timeout=660, upload_read_timeout=600, explanations_mode=None)
Create new batch prediction job, upload the scoring dataset and return a batch prediction job.
The default intake and output options are both localFile which requires the caller to pass the file parameter and either download the results using the download() method afterwards or pass a path to a file where the scored data will be downloaded to afterwards.
- Variables:
deployment (
Deployment
orstring ID
) – Deployment which will be used for scoring.intake_settings (
Optional[IntakeSettings]
) –A dict configuring how data is coming from. Supported options:
type : str, either localFile, s3, azure, gcp, dataset, jdbc snowflake, synapse, bigquery, or datasphere
Note that to pass a dataset, you not only need to specify the type parameter as dataset, but you must also set the dataset parameter as a dr.Dataset object.
To score from a local file, add the this parameter to the settings:
file : file-like object, string path to file or a pandas.DataFrame of scoring data
To score from S3, add the next parameters to the settings:
url : str, the URL to score (e.g.: s3://bucket/key)
credential_id : Optional[str]
endpoint_url : Optional[str], any non-default endpoint URL for S3 access (omit to use the default)
To score from JDBC, add the next parameters to the settings:
data_store_id : str, the ID of the external data store connected to the JDBC data source (see Database Connectivity).
query : str (optional if table, schema and/or catalog is specified), a self-supplied SELECT statement of the data set you wish to predict.
table : str (optional if query is specified), the name of specified database table.
schema : str (optional if query is specified), the name of specified database schema.
catalog : str (optional if query is specified), (new in v2.22) the name of specified database catalog.
fetch_size : Optional[int], Changing the fetchSize can be used to balance throughput and memory usage.
credential_id : Optional[str] the ID of the credentials holding information about a user with read-access to the JDBC data source (see Credentials).
To score from Datasphere, add the next parameters to the settings:
data_store_id : str, the ID of the external data store connected to the Datasphere data source (see Database Connectivity).
table : str, the name of specified database table.
schema : str, the name of specified database schema.
credential_id : str, the ID of the credentials holding information about a user with read-access to the Datasphere data source (see Credentials).
output_settings (
Optional[OutputSettings]
) –A dict configuring how scored data is to be saved. Supported options:
type : str, either localFile, s3, azure, gcp, jdbc, snowflake, synapse, bigquery, or datasphere
To save scored data to a local file, add this parameters to the settings:
path : Optional[str], path to save the scored data as CSV. If a path is not specified, you must download the scored data yourself with job.download(). If a path is specified, the call will block until the job is done. if there are no other jobs currently processing for the targeted prediction instance, uploading, scoring, downloading will happen in parallel without waiting for a full job to complete. Otherwise, it will still block, but start downloading the scored data as soon as it starts generating data. This is the fastest method to get predictions.
To save scored data to S3, add the next parameters to the settings:
url : str, the URL for storing the results (e.g.: s3://bucket/key)
credential_id : Optional[str]
endpoint_url : Optional[str], any non-default endpoint URL for S3 access (omit to use the default)
To save scored data to JDBC, add the next parameters to the settings:
data_store_id : str, the ID of the external data store connected to the JDBC data source (see Database Connectivity).
table : str, the name of specified database table.
schema : Optional[str], the name of specified database schema.
catalog : Optional[str], (new in v2.22) the name of specified database catalog.
statement_type : str, the type of insertion statement to create, one of
datarobot.enums.AVAILABLE_STATEMENT_TYPES
.update_columns : list(string) (optional), a list of strings containing those column names to be updated in case statement_type is set to a value related to update or upsert.
where_columns : list(string) (optional), a list of strings containing those column names to be selected in case statement_type is set to a value related to insert or update.
credential_id : str, the ID of the credentials holding information about a user with write-access to the JDBC data source (see Credentials).
To save scored data to Datasphere, add the following parameters to the settings:
data_store_id : str, the ID of the external data store connected to the Datasphere data source (see Database Connectivity).
table : str, the name of specified database table.
schema : str, the name of specified database schema.
credential_id : str, the ID of the credentials holding information about a user with write-access to the Datasphere data source (see Credentials).
csv_settings (
Optional[CsvSettings]
) –CSV intake and output settings. Supported options:
delimiter : str (optional, default ,), fields are delimited by this character. Use the string tab to denote TSV (TAB separated values). Must be either a one-character string or the string tab.
quotechar : str (optional, default “), fields containing the delimiter must be quoted using this character.
encoding : str (optional, default utf-8), encoding for the CSV files. For example (but not limited to): shift_jis, latin_1 or mskanji.
timeseries_settings (
Optional[TimeSeriesSettings]
) –Configuration for time-series scoring. Supported options:
type : str, must be forecast or historical (default if not passed is forecast). forecast mode makes predictions using forecast_point or rows in the dataset without target. historical enables bulk prediction mode which calculates predictions for all possible forecast points and forecast distances in the dataset within predictions_start_date/predictions_end_date range.
forecast_point : Optional[datetime.datetime], forecast point for the dataset, used for the forecast predictions, by default value will be inferred from the dataset. May be passed if
timeseries_settings.type=forecast
.predictions_start_date : Optional[datetime.datetime], used for historical predictions in order to override date from which predictions should be calculated. By default value will be inferred automatically from the dataset. May be passed if
timeseries_settings.type=historical
.predictions_end_date : Optional[datetime.datetime], used for historical predictions in order to override date from which predictions should be calculated. By default value will be inferred automatically from the dataset. May be passed if
timeseries_settings.type=historical
.relax_known_in_advance_features_check : bool, (default False). If True, missing values in the known in advance features are allowed in the forecast window at the prediction time. If omitted or False, missing values are not allowed.
num_concurrent (
Optional[int]
) – Number of concurrent chunks to score simultaneously. Defaults to the available number of cores of the deployment. Lower it to leave resources for real-time scoring.chunk_size (
str
orOptional[int]
) – Which strategy should be used to determine the chunk size. Can be either a named strategy or a fixed size in bytes. - auto: use fixed or dynamic based on flipper - fixed: use 1MB for explanations, 5MB for regular requests - dynamic: use dynamic chunk sizes - int: use this many bytes per chunkpassthrough_columns (
list[string] (optional)
) – Keep these columns from the scoring dataset in the scored dataset. This is useful for correlating predictions with source data.passthrough_columns_set (
Optional[str]
) – To pass through every column from the scoring dataset, set this to all. Takes precedence over passthrough_columns if set.max_explanations (
Optional[int]
) – Compute prediction explanations for this amount of features.max_ngram_explanations (
int
orstr (optional)
) – Compute text explanations for this amount of ngrams. Set to all to return all ngram explanations, or set to a positive integer value to limit the amount of ngram explanations returned. By default no ngram explanations will be computed and returned.threshold_high (
Optional[float]
) – Only compute prediction explanations for predictions above this threshold. Can be combined with threshold_low.threshold_low (
Optional[float]
) – Only compute prediction explanations for predictions below this threshold. Can be combined with threshold_high.explanations_mode (
PredictionExplanationsMode
, optional) – Mode of prediction explanations calculation for multiclass and clustering models, if not specified - server default is to explain only the predicted class, identical to passing TopPredictionsMode(1).prediction_warning_enabled (
Optional[bool]
) – Add prediction warnings to the scored data. Currently only supported for regression models.include_prediction_status (
Optional[bool]
) – Include the prediction_status column in the output, defaults to False.skip_drift_tracking (
Optional[bool]
) – Skips drift tracking on any predictions made from this job. This is useful when running non-production workloads to not affect drift tracking and cause unnecessary alerts. Defaults to False.prediction_instance (
Optional[PredictionInstance]
) –Defaults to instance specified by deployment or system configuration. Supported options:
hostName : str
sslEnabled : boolean (optional, default true). Set to false to run prediction requests from the batch prediction job without SSL.
datarobotKey : Optional[str], if running a job against a prediction instance in the Managed AI Cloud, you must provide the organization level DataRobot-Key
apiKey : Optional[str], by default, prediction requests will use the API key of the user that created the job. This allows you to make requests on behalf of other users.
abort_on_error (
Optional[bool]
) – Default behavior is to abort the job if too many rows fail scoring. This will free up resources for other jobs that may score successfully. Set to false to unconditionally score every row no matter how many errors are encountered. Defaults to True.column_names_remapping (
Optional[Dict[str
,str]]
) – Mapping with column renaming for output table. Defaults to {}.include_probabilities (
Optional[bool]
) – Flag that enables returning of all probability columns. Defaults to True.include_probabilities_classes (
list (optional)
) – List the subset of classes if a user doesn’t want all the classes. Defaults to [].download_timeout (
Optional[int]
) –Added in version 2.22.
If using localFile output, wait this many seconds for the download to become available. See download().
download_read_timeout (
Optional[int]
, default660
) –Added in version 2.22.
If using localFile output, wait this many seconds for the server to respond between chunks.
upload_read_timeout (
Optional[int]
, default600
) –Added in version 2.28.
If using localFile intake, wait this many seconds for the server to respond after whole dataset upload.
prediction_threshold (
Optional[float]
) –Added in version 3.4.0.
Threshold is the point that sets the class boundary for a predicted value. The model classifies an observation below the threshold as FALSE, and an observation above the threshold as TRUE. In other words, DataRobot automatically assigns the positive class label to any prediction exceeding the threshold. This value can be set between 0.0 and 1.0.
- Returns:
Instance of BatchPredictionJob
- Return type:
- classmethod apply_time_series_data_prep_and_score(deployment, intake_settings, timeseries_settings, **kwargs)
Prepare the dataset with time series data prep, create new batch prediction job, upload the scoring dataset, and return a batch prediction job.
The supported intake_settings are of type localFile or dataset.
For timeseries_settings of type forecast the forecast_point must be specified.
Refer to the
datarobot.models.BatchPredictionJob.score()
method for details on the other kwargs parameters.Added in version v3.1.
- Variables:
deployment (
Deployment
) – Deployment which will be used for scoring.intake_settings (
dict
) –A dict configuring where data is coming from. Supported options:
type : str, either localFile, dataset
Note that to pass a dataset, you not only need to specify the type parameter as dataset, but you must also set the dataset parameter as a
Dataset
object.To score from a local file, add this parameter to the settings:
file : file-like object, string path to file or a pandas.DataFrame of scoring data.
timeseries_settings (
dict
) –Configuration for time-series scoring. Supported options:
type : str, must be forecast or historical (default if not passed is forecast). forecast mode makes predictions using forecast_point. historical enables bulk prediction mode which calculates predictions for all possible forecast points and forecast distances in the dataset within predictions_start_date/predictions_end_date range.
forecast_point : Optional[datetime.datetime], forecast point for the dataset, used for the forecast predictions. Must be passed if
timeseries_settings.type=forecast
.predictions_start_date : Optional[datetime.datetime], used for historical predictions in order to override date from which predictions should be calculated. By default value will be inferred automatically from the dataset. May be passed if
timeseries_settings.type=historical
.predictions_end_date : Optional[datetime.datetime], used for historical predictions in order to override date from which predictions should be calculated. By default value will be inferred automatically from the dataset. May be passed if
timeseries_settings.type=historical
.relax_known_in_advance_features_check : bool, (default False). If True, missing values in the known in advance features are allowed in the forecast window at the prediction time. If omitted or False, missing values are not allowed.
- Returns:
Instance of BatchPredictionJob
- Return type:
- Raises:
InvalidUsageError – If the deployment does not support time series data prep. If the intake type is not supported for time series data prep.
- classmethod score_to_file(deployment, intake_path, output_path, **kwargs)
Create new batch prediction job, upload the scoring dataset and download the scored CSV file concurrently.
Will block until the entire file is scored.
Refer to the
datarobot.models.BatchPredictionJob.score()
method for details on the other kwargs parameters.- Variables:
deployment (
Deployment
orstring ID
) – Deployment which will be used for scoring.intake_path (
file-like object/string path
tofile/pandas.DataFrame
) – Scoring dataoutput_path (
str
) – Filename to save the result under
- Returns:
Instance of BatchPredictionJob
- Return type:
- classmethod apply_time_series_data_prep_and_score_to_file(deployment, intake_path, output_path, timeseries_settings, **kwargs)
Prepare the input dataset with time series data prep. Then, create a new batch prediction job using the prepared AI catalog item as input and concurrently download the scored CSV file.
The function call will return when the entire file is scored.
For timeseries_settings of type forecast the forecast_point must be specified.
Refer to the
datarobot.models.BatchPredictionJob.score()
method for details on the other kwargs parameters.Added in version v3.1.
- Variables:
deployment (
Deployment
) – The deployment which will be used for scoring.intake_path (
file-like object/string path
tofile/pandas.DataFrame
) – The scoring data.output_path (
str
) – The filename under which you save the result.timeseries_settings (
dict
) –Configuration for time-series scoring. Supported options:
type : str, must be forecast or historical (default if not passed is forecast). forecast mode makes predictions using forecast_point. historical enables bulk prediction mode which calculates predictions for all possible forecast points and forecast distances in the dataset within predictions_start_date/predictions_end_date range.
forecast_point : Optional[datetime.datetime], forecast point for the dataset, used for the forecast predictions. Must be passed if
timeseries_settings.type=forecast
.predictions_start_date : Optional[datetime.datetime], used for historical predictions in order to override date from which predictions should be calculated. By default value will be inferred automatically from the dataset. May be passed if
timeseries_settings.type=historical
.predictions_end_date : Optional[datetime.datetime], used for historical predictions in order to override date from which predictions should be calculated. By default value will be inferred automatically from the dataset. May be passed if
timeseries_settings.type=historical
.relax_known_in_advance_features_check : bool, (default False). If True, missing values in the known in advance features are allowed in the forecast window at the prediction time. If omitted or False, missing values are not allowed.
- Returns:
Instance of BatchPredictionJob.
- Return type:
- Raises:
InvalidUsageError – If the deployment does not support time series data prep.
- classmethod score_s3(deployment, source_url, destination_url, credential=None, endpoint_url=None, **kwargs)
Create new batch prediction job, with a scoring dataset from S3 and writing the result back to S3.
This returns immediately after the job has been created. You must poll for job completion using get_status() or wait_for_completion() (see datarobot.models.Job)
Refer to the
datarobot.models.BatchPredictionJob.score()
method for details on the other kwargs parameters.- Variables:
deployment (
Deployment
orstring ID
) – Deployment which will be used for scoring.source_url (
str
) – The URL for the prediction dataset (e.g.: s3://bucket/key)destination_url (
str
) – The URL for the scored dataset (e.g.: s3://bucket/key)credential (
str
orCredential (optional)
) – The AWS Credential object or credential idendpoint_url (
Optional[str]
) – Any non-default endpoint URL for S3 access (omit to use the default)
- Returns:
Instance of BatchPredictionJob
- Return type:
- classmethod score_azure(deployment, source_url, destination_url, credential=None, **kwargs)
Create new batch prediction job, with a scoring dataset from Azure blob storage and writing the result back to Azure blob storage.
This returns immediately after the job has been created. You must poll for job completion using get_status() or wait_for_completion() (see datarobot.models.Job).
Refer to the
datarobot.models.BatchPredictionJob.score()
method for details on the other kwargs parameters.- Variables:
deployment (
Deployment
orstring ID
) – Deployment which will be used for scoring.source_url (
str
) – The URL for the prediction dataset (e.g.: https://storage_account.blob.endpoint/container/blob_name)destination_url (
str
) – The URL for the scored dataset (e.g.: https://storage_account.blob.endpoint/container/blob_name)credential (
str
orCredential (optional)
) – The Azure Credential object or credential id
- Returns:
Instance of BatchPredictionJob
- Return type:
- classmethod score_gcp(deployment, source_url, destination_url, credential=None, **kwargs)
Create new batch prediction job, with a scoring dataset from Google Cloud Storage and writing the result back to one.
This returns immediately after the job has been created. You must poll for job completion using get_status() or wait_for_completion() (see datarobot.models.Job).
Refer to the
datarobot.models.BatchPredictionJob.score()
method for details on the other kwargs parameters.- Variables:
deployment (
Deployment
orstring ID
) – Deployment which will be used for scoring.source_url (
str
) – The URL for the prediction dataset (e.g.: http(s)://storage.googleapis.com/[bucket]/[object])destination_url (
str
) – The URL for the scored dataset (e.g.: http(s)://storage.googleapis.com/[bucket]/[object])credential (
str
orCredential (optional)
) – The GCP Credential object or credential id
- Returns:
Instance of BatchPredictionJob
- Return type:
- classmethod score_from_existing(batch_prediction_job_id)
Create a new batch prediction job based on the settings from a previously created one
- Variables:
batch_prediction_job_id (
str
) – ID of the previous batch prediction job- Returns:
Instance of BatchPredictionJob
- Return type:
- classmethod score_pandas(deployment, df, read_timeout=660, **kwargs)
Run a batch prediction job, with a scoring dataset from a pandas dataframe. The output from the prediction will be joined to the passed DataFrame and returned.
Use columnNamesRemapping to drop or rename columns in the output
This method blocks until the job has completed or raises an exception on errors.
Refer to the
datarobot.models.BatchPredictionJob.score()
method for details on the other kwargs parameters.- Variables:
deployment (
Deployment
orstring ID
) – Deployment which will be used for scoring.df (
pandas.DataFrame
) – The dataframe to score
- Return type:
Tuple
[BatchPredictionJob
,DataFrame
]- Returns:
BatchPredictionJob
– Instance of BatchPredictonJobpandas.DataFrame
– The original dataframe merged with the predictions
- classmethod score_with_leaderboard_model(model, intake_settings=None, output_settings=None, csv_settings=None, timeseries_settings=None, passthrough_columns=None, passthrough_columns_set=None, max_explanations=None, max_ngram_explanations=None, explanation_algorithm=None, threshold_high=None, threshold_low=None, prediction_threshold=None, prediction_warning_enabled=None, include_prediction_status=False, abort_on_error=True, column_names_remapping=None, include_probabilities=True, include_probabilities_classes=None, download_timeout=120, download_read_timeout=660, upload_read_timeout=600, explanations_mode=None)
Creates a new batch prediction job for a Leaderboard model by uploading the scoring dataset. Returns a batch prediction job.
The default intake and output options are both localFile, which requires the caller to pass the file parameter and either download the results using the download() method afterwards or pass a path to a file where the scored data will be downloaded to.
- Variables:
model (
Model
orDatetimeModel
orstring ID
) – Model which will be used for scoring.intake_settings (
Optional[IntakeSettings]
) –A dict configuring how data is coming from. Supported options:
type : str, either localFile, dataset, or dss.
Note that to pass a dataset, you not only need to specify the type parameter as dataset, but you must also set the dataset parameter as a dr.Dataset object.
To score from a local file, add the this parameter to the settings:
file : file-like object, string path to file or a pandas.DataFrame of scoring data.
To score subset of training data, use dss intake type and specify following parameters:
project_id : project to fetch training data from. Access to project is required.
partition : subset of training data to score, one of
datarobot.enums.TrainingDataSubsets
.
output_settings (
Optional[OutputSettings]
) –A dict configuring how scored data is to be saved. Supported options:
type : str, localFile
To save scored data to a local file, add this parameters to the settings:
path : Optional[str] The path to save the scored data as a CSV file. If a path is not specified, you must download the scored data yourself with job.download(). If a path is specified, the call is blocked until the job is done. If there are no other jobs currently processing for the targeted prediction instance, uploading, scoring, and downloading will happen in parallel without waiting for a full job to complete. Otherwise, it will still block, but start downloading the scored data as soon as it starts generating data. This is the fastest method to get predictions.
csv_settings (
Optional[CsvSettings]
) –CSV intake and output settings. Supported options:
delimiter : str (optional, default ,), fields are delimited by this character. Use the string tab to denote TSV (TAB separated values). Must be either a one-character string or the string tab.
quotechar : str (optional, default “), fields containing the delimiter must be quoted using this character.
encoding : str (optional, default utf-8), encoding for the CSV files. For example (but not limited to): shift_jis, latin_1 or mskanji.
timeseries_settings (
Optional[TimeSeriesSettings]
) –Configuration for time-series scoring. Supported options:
type : str, must be forecast, historical (default if not passed is forecast), or training. forecast mode makes predictions using forecast_point or rows in the dataset without target. historical enables bulk prediction mode which calculates predictions for all possible forecast points and forecast distances in the dataset within predictions_start_date/predictions_end_date range. training mode is a special case for predictions on subsets of training data. Note, that it must be used in conjunction with dss intake type only.
forecast_point : Optional[datetime.datetime], forecast point for the dataset, used for the forecast predictions, by default value will be inferred from the dataset. May be passed if
timeseries_settings.type=forecast
.predictions_start_date : Optional[datetime.datetime], used for historical predictions in order to override date from which predictions should be calculated. By default value will be inferred automatically from the dataset. May be passed if
timeseries_settings.type=historical
.predictions_end_date : Optional[datetime.datetime], used for historical predictions in order to override date from which predictions should be calculated. By default value will be inferred automatically from the dataset. May be passed if
timeseries_settings.type=historical
.relax_known_in_advance_features_check : bool, (default False). If True, missing values in the known in advance features are allowed in the forecast window at the prediction time. If omitted or False, missing values are not allowed.
passthrough_columns (
list[string] (optional)
) – Keep these columns from the scoring dataset in the scored dataset. This is useful for correlating predictions with source data.passthrough_columns_set (
Optional[str]
) – To pass through every column from the scoring dataset, set this to all. Takes precedence over passthrough_columns if set.max_explanations (
Optional[int]
) – Compute prediction explanations for this amount of features.max_ngram_explanations (
int
orstr (optional)
) – Compute text explanations for this amount of ngrams. Set to all to return all ngram explanations, or set to a positive integer value to limit the amount of ngram explanations returned. By default no ngram explanations will be computed and returned.threshold_high (
Optional[float]
) – Only compute prediction explanations for predictions above this threshold. Can be combined with threshold_low.threshold_low (
Optional[float]
) – Only compute prediction explanations for predictions below this threshold. Can be combined with threshold_high.explanations_mode (
PredictionExplanationsMode
, optional) – Mode of prediction explanations calculation for multiclass and clustering models, if not specified - server default is to explain only the predicted class, identical to passing TopPredictionsMode(1).prediction_warning_enabled (
Optional[bool]
) – Add prediction warnings to the scored data. Currently only supported for regression models.include_prediction_status (
Optional[bool]
) – Include the prediction_status column in the output, defaults to False.abort_on_error (
Optional[bool]
) – Default behavior is to abort the job if too many rows fail scoring. This will free up resources for other jobs that may score successfully. Set to false to unconditionally score every row no matter how many errors are encountered. Defaults to True.column_names_remapping (
Optional[Dict]
) – Mapping with column renaming for output table. Defaults to {}.include_probabilities (
Optional[bool]
) – Flag that enables returning of all probability columns. Defaults to True.include_probabilities_classes (
list (optional)
) – List the subset of classes if you do not want all the classes. Defaults to [].download_timeout (
Optional[int]
) –Added in version 2.22.
If using localFile output, wait this many seconds for the download to become available. See download().
download_read_timeout (
int (optional
, default660)
) –Added in version 2.22.
If using localFile output, wait this many seconds for the server to respond between chunks.
upload_read_timeout (
int (optional
, default600)
) –Added in version 2.28.
If using localFile intake, wait this many seconds for the server to respond after whole dataset upload.
prediction_threshold (
Optional[float]
) –Added in version 3.4.0.
Threshold is the point that sets the class boundary for a predicted value. The model classifies an observation below the threshold as FALSE, and an observation above the threshold as TRUE. In other words, DataRobot automatically assigns the positive class label to any prediction exceeding the threshold. This value can be set between 0.0 and 1.0.
- Returns:
Instance of BatchPredictionJob
- Return type:
- classmethod get(batch_prediction_job_id)
Get batch prediction job
- Variables:
batch_prediction_job_id (
str
) – ID of batch prediction job- Returns:
Instance of BatchPredictionJob
- Return type:
- download(fileobj, timeout=120, read_timeout=660)
Downloads the CSV result of a prediction job
- Variables:
fileobj (
A file-like object where the CSV prediction results will be
) – written to. Examples include an in-memory buffer (e.g., io.BytesIO) or a file on disk (opened for binary writing).timeout (
int (optional
, default120)
) –Added in version 2.22.
Seconds to wait for the download to become available.
The download will not be available before the job has started processing. In case other jobs are occupying the queue, processing may not start immediately.
If the timeout is reached, the job will be aborted and RuntimeError is raised.
Set to -1 to wait infinitely.
read_timeout (
int (optional
, default660)
) –Added in version 2.22.
Seconds to wait for the server to respond between chunks.
- Return type:
None
- delete(ignore_404_errors=False)
Cancel this job. If this job has not finished running, it will be removed and canceled.
- Return type:
None
- get_status()
Get status of batch prediction job
- Returns:
Dict with job status
- Return type:
BatchPredictionJob status data
- classmethod list_by_status(statuses=None)
Get jobs collection for specific set of statuses
- Variables:
statuses – List of statuses to filter jobs ([ABORTED|COMPLETED…]) if statuses is not provided, returns all jobs for user
- Returns:
List of job statuses dicts with specific statuses
- Return type:
BatchPredictionJob statuses
- class datarobot.models.BatchPredictionJobDefinition
- classmethod get(batch_prediction_job_definition_id)
Get batch prediction job definition
- Variables:
batch_prediction_job_definition_id (
str
) – ID of batch prediction job definition- Returns:
Instance of BatchPredictionJobDefinition
- Return type:
Examples
>>> import datarobot as dr >>> definition = dr.BatchPredictionJobDefinition.get('5a8ac9ab07a57a0001be501f') >>> definition BatchPredictionJobDefinition(60912e09fd1f04e832a575c1)
- classmethod list(search_name=None, deployment_id=None, limit=<datarobot.models.batch_prediction_job.MissingType object>, offset=0)
Get job all definitions
- Parameters:
search_name (
Optional[str]
) – String for filtering job definitions Job definitions that contain the string in name will be returned. If not specified, all available job definitions will be returned.deployment_id (
str
) – The ID of the deployment record belongs to.limit (
Optional[int]
) – 0 by default. At most this many results are returned.offset (
Optional[int]
) – This many results will be skipped.
- Returns:
List of job definitions the user has access to see
- Return type:
List[BatchPredictionJobDefinition]
Examples
>>> import datarobot as dr >>> definition = dr.BatchPredictionJobDefinition.list() >>> definition [ BatchPredictionJobDefinition(60912e09fd1f04e832a575c1), BatchPredictionJobDefinition(6086ba053f3ef731e81af3ca) ]
- classmethod create(enabled, batch_prediction_job, name=None, schedule=None)
Creates a new batch prediction job definition to be run either at scheduled interval or as a manual run.
- Variables:
enabled (
bool (default False)
) – Whether or not the definition should be active on a scheduled basis. If True, schedule is required.batch_prediction_job (
dict
) – The job specifications for your batch prediction job. It requires the same job input parameters as used withscore()
, only it will not initialize a job scoring, only store it as a definition for later use.name (
Optional[str]
) – The name you want your job to be identified with. Must be unique across the organization’s existing jobs. If you don’t supply a name, a random one will be generated for you.schedule (
Optional[Dict]
) –The
schedule
payload defines at what intervals the job should run, which can be combined in various ways to construct complex scheduling terms if needed. In all of the elements in the objects, you can supply either an asterisk["*"]
denoting “every” time denomination or an array of integers (e.g.[1, 2, 3]
) to define a specific interval.The
schedule
payload is split up in the following items:Minute:
The minute(s) of the day that the job will run. Allowed values are either
["*"]
meaning every minute of the day or[0 ... 59]
Hour: The hour(s) of the day that the job will run. Allowed values are either
["*"]
meaning every hour of the day or[0 ... 23]
.Day of Month: The date(s) of the month that the job will run. Allowed values are either
[1 ... 31]
or["*"]
for all days of the month. This field is additive withdayOfWeek
, meaning the job will run both on the date(s) defined in this field and the day specified bydayOfWeek
(for example, dates 1st, 2nd, 3rd, plus every Tuesday). IfdayOfMonth
is set to["*"]
anddayOfWeek
is defined, the scheduler will trigger on every day of the month that matchesdayOfWeek
(for example, Tuesday the 2nd, 9th, 16th, 23rd, 30th). Invalid dates such as February 31st are ignored.Month: The month(s) of the year that the job will run. Allowed values are either
[1 ... 12]
or["*"]
for all months of the year. Strings, either 3-letter abbreviations or the full name of the month, can be used interchangeably (e.g., “jan” or “october”). Months that are not compatible withdayOfMonth
are ignored, for example{"dayOfMonth": [31], "month":["feb"]}
Day of Week: The day(s) of the week that the job will run. Allowed values are
[0 .. 6]
, where (Sunday=0), or["*"]
, for all days of the week. Strings, either 3-letter abbreviations or the full name of the day, can be used interchangeably (e.g., “sunday”, “Sunday”, “sun”, or “Sun”, all map to[0]
. This field is additive withdayOfMonth
, meaning the job will run both on the date specified bydayOfMonth
and the day defined in this field.
- Returns:
Instance of BatchPredictionJobDefinition
- Return type:
Examples
>>> import datarobot as dr >>> job_spec = { ... "num_concurrent": 4, ... "deployment_id": "foobar", ... "intake_settings": { ... "url": "s3://foobar/123", ... "type": "s3", ... "format": "csv" ... }, ... "output_settings": { ... "url": "s3://foobar/123", ... "type": "s3", ... "format": "csv" ... }, ...} >>> schedule = { ... "day_of_week": [ ... 1 ... ], ... "month": [ ... "*" ... ], ... "hour": [ ... 16 ... ], ... "minute": [ ... 0 ... ], ... "day_of_month": [ ... 1 ... ] ...} >>> definition = BatchPredictionJobDefinition.create( ... enabled=False, ... batch_prediction_job=job_spec, ... name="some_definition_name", ... schedule=schedule ... ) >>> definition BatchPredictionJobDefinition(60912e09fd1f04e832a575c1)
- update(enabled, batch_prediction_job=None, name=None, schedule=None)
Updates a job definition with the changed specs.
Takes the same input as
create()
- Variables:
- Returns:
Instance of the updated BatchPredictionJobDefinition
- Return type:
Examples
>>> import datarobot as dr >>> job_spec = { ... "num_concurrent": 5, ... "deployment_id": "foobar_new", ... "intake_settings": { ... "url": "s3://foobar/123", ... "type": "s3", ... "format": "csv" ... }, ... "output_settings": { ... "url": "s3://foobar/123", ... "type": "s3", ... "format": "csv" ... }, ...} >>> schedule = { ... "day_of_week": [ ... 1 ... ], ... "month": [ ... "*" ... ], ... "hour": [ ... "*" ... ], ... "minute": [ ... 30, 59 ... ], ... "day_of_month": [ ... 1, 2, 6 ... ] ...} >>> definition = BatchPredictionJobDefinition.create( ... enabled=False, ... batch_prediction_job=job_spec, ... name="updated_definition_name", ... schedule=schedule ... ) >>> definition BatchPredictionJobDefinition(60912e09fd1f04e832a575c1)
- run_on_schedule(schedule)
Sets the run schedule of an already created job definition.
If the job was previously not enabled, this will also set the job to enabled.
- Variables:
schedule (
dict
) – Same asschedule
increate()
.- Returns:
Instance of the updated BatchPredictionJobDefinition with the new / updated schedule.
- Return type:
Examples
>>> import datarobot as dr >>> definition = dr.BatchPredictionJobDefinition.create('...') >>> schedule = { ... "day_of_week": [ ... 1 ... ], ... "month": [ ... "*" ... ], ... "hour": [ ... "*" ... ], ... "minute": [ ... 30, 59 ... ], ... "day_of_month": [ ... 1, 2, 6 ... ] ...} >>> definition.run_on_schedule(schedule) BatchPredictionJobDefinition(60912e09fd1f04e832a575c1)
- run_once()
Manually submits a batch prediction job to the queue, based off of an already created job definition.
- Returns:
Instance of BatchPredictionJob
- Return type:
Examples
>>> import datarobot as dr >>> definition = dr.BatchPredictionJobDefinition.create('...') >>> job = definition.run_once() >>> job.wait_for_completion()
- delete()
Deletes the job definition and disables any future schedules of this job if any. If a scheduled job is currently running, this will not be cancelled. :rtype:
None
Examples
>>> import datarobot as dr >>> definition = dr.BatchPredictionJobDefinition.get('5a8ac9ab07a57a0001be501f') >>> definition.delete()
Batch job
- class datarobot.models.batch_job.IntakeSettings
Intake settings typed dict
- class datarobot.models.batch_job.OutputSettings
Output settings typed dict
Predict job
- datarobot.models.predict_job.wait_for_async_predictions(project_id, predict_job_id, max_wait=600)
Given a Project id and PredictJob id poll for status of process responsible for predictions generation until it’s finished
- Parameters:
project_id (
str
) – The identifier of the projectpredict_job_id (
str
) – The identifier of the PredictJobmax_wait (
Optional[int]
) – Time in seconds after which predictions creation is considered unsuccessful
- Returns:
predictions – Generated predictions.
- Return type:
pandas.DataFrame
- Raises:
AsyncPredictionsGenerationError – Raised if status of fetched PredictJob object is
error
AsyncTimeoutError – Predictions weren’t generated in time, specified by
max_wait
parameter
- class datarobot.models.PredictJob
Tracks asynchronous work being done within a project
- Variables:
id (
int
) – the id of the jobproject_id (
str
) – the id of the project the job belongs tostatus (
str
) – the status of the job - will be one ofdatarobot.enums.QUEUE_STATUS
job_type (
str
) – what kind of work the job is doing - will be ‘predict’ for predict jobsis_blocked (
bool
) – if true, the job is blocked (cannot be executed) until its dependencies are resolvedmessage (
str
) – a message about the state of the job, typically explaining why an error occurred
- classmethod from_job(job)
Transforms a generic Job into a PredictJob
- Parameters:
job (
Job
) – A generic job representing a PredictJob- Returns:
predict_job – A fully populated PredictJob with all the details of the job
- Return type:
- Raises:
ValueError: – If the generic Job was not a predict job, e.g. job_type != JOB_TYPE.PREDICT
- classmethod get(project_id, predict_job_id)
Fetches one PredictJob. If the job finished, raises PendingJobFinished exception.
- Parameters:
project_id (
str
) – The identifier of the project the model on which prediction was started belongs topredict_job_id (
str
) – The identifier of the predict_job
- Returns:
predict_job – The pending PredictJob
- Return type:
- Raises:
PendingJobFinished – If the job being queried already finished, and the server is re-routing to the finished predictions.
AsyncFailureError – Querying this resource gave a status code other than 200 or 303
- classmethod get_predictions(project_id, predict_job_id, class_prefix='class_')
Fetches finished predictions from the job used to generate them.
Notes
The prediction API for classifications now returns an additional prediction_values dictionary that is converted into a series of class_prefixed columns in the final dataframe. For example, <label> = 1.0 is converted to ‘class_1.0’. If you are on an older version of the client (prior to v2.8), you must update to v2.8 to correctly pivot this data.
- Parameters:
project_id (
str
) – The identifier of the project to which belongs the model used for predictions generationpredict_job_id (
str
) – The identifier of the predict_jobclass_prefix (
str
) – The prefix to append to labels in the final dataframe (e.g., apple -> class_apple)
- Returns:
predictions – Generated predictions
- Return type:
pandas.DataFrame
- Raises:
JobNotFinished – If the job has not finished yet
AsyncFailureError – Querying the predict_job in question gave a status code other than 200 or 303
- cancel()
Cancel this job. If this job has not finished running, it will be removed and canceled.
- get_result(params=None)
- Parameters:
params (
dict
orNone
) – Query parameters to be added to request to get results.
Notes
For featureEffects, source param is required to define source, otherwise the default is training.
- Returns:
result –
- Return type depends on the job type
for model jobs, a Model is returned
for predict jobs, a pandas.DataFrame (with predictions) is returned
for featureImpact jobs, a list of dicts by default (see
with_metadata
parameter of theFeatureImpactJob
class and itsget()
method).for primeRulesets jobs, a list of Rulesets
for primeModel jobs, a PrimeModel
for primeDownloadValidation jobs, a PrimeFile
for predictionExplanationInitialization jobs, a PredictionExplanationsInitialization
for predictionExplanations jobs, a PredictionExplanations
for featureEffects, a FeatureEffects.
- Return type:
object
- Raises:
JobNotFinished – If the job is not finished, the result is not available.
AsyncProcessUnsuccessfulError – If the job errored or was aborted
- get_result_when_complete(max_wait=600, params=None)
- Parameters:
max_wait (
Optional[int]
) – How long to wait for the job to finish.params (
dict
, optional) – Query parameters to be added to request.
- Returns:
result – Return type is the same as would be returned by Job.get_result.
- Return type:
object
- Raises:
AsyncTimeoutError – If the job does not finish in time
AsyncProcessUnsuccessfulError – If the job errored or was aborted
- refresh()
Update this object with the latest job data from the server.
- wait_for_completion(max_wait=600)
Waits for job to complete.
- Parameters:
max_wait (
Optional[int]
) – How long to wait for the job to finish.- Return type:
None
Prediction dataset
- class datarobot.models.PredictionDataset
A dataset uploaded to make predictions
Typically created via project.upload_dataset
- Variables:
id (
str
) – the id of the datasetproject_id (
str
) – the id of the project the dataset belongs tocreated (
str
) – the time the dataset was createdname (
str
) – the name of the datasetnum_rows (
int
) – the number of rows in the datasetnum_columns (
int
) – the number of columns in the datasetforecast_point (
datetime.datetime
orNone
) – For time series projects only. This is the default point relative to which predictions will be generated, based on the forecast window of the project. See the time series predictions documentation for more information.predictions_start_date (
datetime.datetime
orNone
, optional) – For time series projects only. The start date for bulk predictions. Note that this parameter is for generating historical predictions using the training data. This parameter should be provided in conjunction withpredictions_end_date
. Can’t be provided with theforecast_point
parameter.predictions_end_date (
datetime.datetime
orNone
, optional) – For time series projects only. The end date for bulk predictions, exclusive. Note that this parameter is for generating historical predictions using the training data. This parameter should be provided in conjunction withpredictions_start_date
. Can’t be provided with theforecast_point
parameter.relax_known_in_advance_features_check (
Optional[bool]
) – (New in version v2.15) For time series projects only. If True, missing values in the known in advance features are allowed in the forecast window at the prediction time. If omitted or False, missing values are not allowed.data_quality_warnings (
dict
, optional) –(New in version v2.15) A dictionary that contains available warnings about potential problems in this prediction dataset. Available warnings include:
- has_kia_missing_values_in_forecast_window (bool)
Applicable for time series projects. If True, known in advance features have missing values in forecast window which may decrease prediction accuracy.
- insufficient_rows_for_evaluating_models (bool)
Applicable for datasets which are used as external test sets. If True, there is not enough rows in dataset to calculate insights.
- single_class_actual_value_column (bool)
Applicable for datasets which are used as external test sets. If True, actual value column has only one class and such insights as ROC curve can not be calculated. Only applies for binary classification projects or unsupervised projects.
forecast_point_range (
list[datetime.datetime]
orNone
, optional) – (New in version v2.20) For time series projects only. Specifies the range of dates available for use as a forecast point.data_start_date (
datetime.datetime
orNone
, optional) – (New in version v2.20) For time series projects only. The minimum primary date of this prediction dataset.data_end_date (
datetime.datetime
orNone
, optional) – (New in version v2.20) For time series projects only. The maximum primary date of this prediction dataset.max_forecast_date (
datetime.datetime
orNone
, optional) – (New in version v2.20) For time series projects only. The maximum forecast date of this prediction dataset.actual_value_column (
string
, optional) – (New in version v2.21) Optional, only available for unsupervised projects, in case dataset was uploaded with actual value column specified. Name of the column which will be used to calculate the classification metrics and insights.detected_actual_value_columns (
list
ofdict
, optional) – (New in version v2.21) For unsupervised projects only, list of detected actual value columns information containing missing count and name for each column.contains_target_values (
Optional[bool]
) – (New in version v2.21) Only for supervised projects. If True, dataset contains target values and can be used to calculate the classification metrics and insights.secondary_datasets_config_id (
string
orNone
, optional) – (New in version v2.23) The Id of the alternative secondary dataset config to use during prediction for Feature discovery project.
- classmethod get(project_id, dataset_id)
Retrieve information about a dataset uploaded for predictions
- Parameters:
project_id (
str
) – the id of the project to querydataset_id (
str
) – the id of the dataset to retrieve
- Returns:
dataset – A dataset uploaded to make predictions
- Return type:
- delete()
Delete a dataset uploaded for predictions
Will also delete predictions made using this dataset and cancel any predict jobs using this dataset.
- Return type:
None