Batch Predictions

The Batch Prediction API provides a way to score large datasets using flexible options for intake and output on the Prediction Servers you have already deployed.

The main features are:

  • Flexible options for intake and output.
  • Stream local files and start scoring while still uploading - while simultaneously downloading the results.
  • Score large datasets from and to S3.
  • Connect to your database using JDBC with bidirectional streaming of scoring data and results.
  • Intake and output options can be mixed and doesn’t need to match. So scoring from a JDBC source to an S3 target is also an option.
  • Protection against overloading your prediction servers with the option to control the concurrency level for scoring.
  • Prediction Explanations can be included (with option to add thresholds).
  • Passthrough Columns are supported to correlate scored data with source data.
  • Prediction Warnings can be included in the output.

To interact with Batch Predictions, you should use the BatchPredictionJob class.

Scoring local CSV files

We provide a small utility function for scoring from/to local CSV files: BatchPredictionJob.score_to_file. The first parameter can be either:

  • Path to a CSV dataset
  • File-like object
  • Pandas DataFrame

For larger datasets, you should avoid using a DataFrame, as that will load the entire dataset into memory. The other options don’t.

import datarobot as dr

deployment_id = '5dc5b1015e6e762a6241f9aa'

dr.BatchPredictionJob.score_to_file(
    deployment_id,
    './data_to_predict.csv',
    './predicted.csv',
)

The input file will be streamed to our API and scoring will start immediately. As soon as results start coming in, we will initiate the download concurrently. The entire call will block until the file has been scored.

Scoring from and to S3

We provide a small utility function for scoring from/to CSV files hosted on S3 BatchPredictionJob.score_s3. This requires that the intake and output buckets share the same credentials (see Credentials and Credential.create_s3) or that their access policy is set to public:

import datarobot as dr

deployment_id = '5dc5b1015e6e762a6241f9aa'

cred = dr.Credential.get('5a8ac9ab07a57a0001be501f')

job = dr.BatchPredictionJob.score_s3(
    deployment=deployment_id,
    source_url='s3://mybucket/data_to_predict.csv',
    destination_url='s3://mybucket/predicted.csv',
    credential=cred,
)

Note

The S3 output functionality has a limit of 100 GB.

Scoring from and to Azure Cloud Storage

As for S3, we provide the same support for Azure through the utility function BatchPredictionJob.score_azure. This required that an Azure connection string has been added to the DataRobot credentials store. (see Credentials and Credential.create_azure)

import datarobot as dr

deployment_id = '5dc5b1015e6e762a6241f9aa'

cred = dr.Credential.get('5a8ac9ab07a57a0001be501f')

job = dr.BatchPredictionJob.score_azure(
    deployment=deployment_id,
    source_url='https://mybucket.blob.core.windows.net/bucket/data_to_predict.csv',
    destination_url='https://mybucket.blob.core.windows.net/results/predicted.csv',
    credential=cred,
)

Scoring from and to Google Cloud Platform

As for Azure, we provide the same support for Azure through the utility function BatchPredictionJob.score_gcp. This required that an Azure connection string has been added to the DataRobot credentials store. (see Credentials and Credential.create_gcp)

import datarobot as dr

deployment_id = '5dc5b1015e6e762a6241f9aa'

cred = dr.Credential.get('5a8ac9ab07a57a0001be501f')

job = dr.BatchPredictionJob.score_gcp(
    deployment=deployment_id,
    source_url='gs:/bucket/data_to_predict.csv',
    destination_url='gs://results/predicted.csv',
    credential=cred,
)

Wiring a Batch Prediction Job manually

If you can’t use any of the utilities above, you are also free to configure your job manually. This requires configuring an intake and output option:

import datarobot as dr

deployment_id = '5dc5b1015e6e762a6241f9aa'

dr.BatchPredictionJob.score(
    deployment_id,
    intake_settings={
        'type': 's3',
        'url': 's3://public-bucket/data_to_predict.csv',
        'credential_id': '5a8ac9ab07a57a0001be501f',
    },
    output_settings={
        'type': 'localFile',
        'path': './predicted.csv',
    },
)

Credentials may be created with Credentials API.

Supported intake types

These are the supported intake types and descriptions of their configuration parameters:

Local file intake

This requires you to pass either a path to a CSV dataset, file-like object or a Pandas DataFrame as the file parameter:

intake_settings={
    'type': 'localFile',
    'file': './data_to_predict.csv',
}

S3 CSV intake

This requires you to pass an S3 URL to the CSV file your scoring in the url parameter:

intake_settings={
    'type': 's3',
    'url': 's3://public-bucket/data_to_predict.csv',
}

If the bucket is not publicly accessible, you can supply AWS credentials using the three parameters:

  • aws_access_key_id
  • aws_secret_access_key
  • aws_session_token

And save it to the Credential API. Here is an example:

import datarobot as dr

# get to make sure it exists
credential_id = '5a8ac9ab07a57a0001be501f'
cred = dr.Credential.get(credential_id)

intake_settings={
    'type': 's3',
    'url': 's3://private-bucket/data_to_predict.csv',
    'credential_id': cred.credential_id,
}

JDBC intake

This requires you to create a DataStore and Credential for your database:

# get to make sure it exists
datastore_id = '5a8ac9ab07a57a0001be5010'
data_store = dr.DataStore.get(datastore_id)

credential_id = '5a8ac9ab07a57a0001be501f'
cred = dr.Credential.get(credential_id)

intake_settings = {
    'type': 'jdbc',
    'table': 'table_name',
    'schema': 'public', # optional, if supported by database
    'catalog': 'master', # optional, if supported by database
    'data_store_id': data_store.id,
    'credential_id': cred.credential_id,
}

AI Catalog intake

This requires you to create a Dataset and identify the dataset_id of that to use as input.

# get to make sure it exists
dataset_id = '5a8ac9ab07a57a0001be501f'
dataset = dr.Dataset.get(dataset_id)

intake_settings={
    'type': 'dataset',
    'dataset': dataset
}

Or, in case you want another version_id than the latest, supply your own.

# get to make sure it exists
dataset_id = '5a8ac9ab07a57a0001be501f'
dataset = dr.Dataset.get(dataset_id)

intake_settings={
    'type': 'dataset',
    'dataset': dataset,
    'dataset_version_id': 'another_version_id'
}

Supported output types

These are the supported output types and descriptions of their configuration parameters:

Local file output

For local file output you have two options. You can either pass a path parameter and have the client block and download the scored data concurrently. This is the fastest way to get predictions as it will upload, score and download concurrently:

output_settings={
    'type': 'localFile',
    'path': './predicted.csv',
}

Another option is to leave out the parameter and subsequently call BatchPredictionJob.download at your own convenience. The BatchPredictionJob.score call will then return as soon as the upload is complete.

If the job is not finished scoring, the call to BatchPredictionJob.download will start streaming the data that has been scored so far and block until more data is available.

You can poll for job completion using BatchPredictionJob.get_status or use BatchPredictionJob.wait_for_completion to wait.

import datarobot as dr

deployment_id = '5dc5b1015e6e762a6241f9aa'

job = dr.BatchPredictionJob.score(
    deployment_id,
    intake_settings={
        'type': 'localFile',
        'file': './data_to_predict.csv',
    },
    output_settings={
        'type': 'localFile',
    },
)

job.wait_for_completion()

with open('./predicted.csv', 'wb') as f:
    job.download(f)

S3 CSV output

This requires you to pass an S3 URL to the CSV file where the scored data should be saved to in the url parameter:

output_settings={
    'type': 's3',
    'url': 's3://public-bucket/predicted.csv',
}

Most likely, the bucket is not publically accessible for writes, but you can supply AWS credentials using the three parameters:

  • aws_access_key_id
  • aws_secret_access_key
  • aws_session_token

And save it to the Credential API. Here is an example:

# get to make sure it exists
credential_id = '5a8ac9ab07a57a0001be501f'
cred = dr.Credential.get(credential_id)

output_settings={
    'type': 's3',
    'url': 's3://private-bucket/predicted.csv',
    'credential_id': cred.credential_id,
}

JDBC output

Same as for the input, this requires you to create a DataStore and Credential for your database, but for output_settings you also need to specify statementType, which should be one of datarobot.enums.AVAILABLE_STATEMENT_TYPES:

# get to make sure it exists
datastore_id = '5a8ac9ab07a57a0001be5010'
data_store = dr.DataStore.get(datastore_id)

credential_id = '5a8ac9ab07a57a0001be501f'
cred = dr.Credential.get(credential_id)

output_settings = {
    'type': 'jdbc',
    'table': 'table_name',
    'schema': 'public', # optional, if supported by database
    'catalog': 'master', # optional, if supported by database
    'statementType': 'insert',
    'data_store_id': data_store.id,
    'credential_id': cred.credential_id,
}

Copying a previously submitted job

We provide a small utility function for submitting a job using parameters from a job previously submitted: BatchPredictionJob.score_from_existing. The first parameter is the job id of another job.

import datarobot as dr

previously_submitted_job_id = '5dc5b1015e6e762a6241f9aa'

dr.BatchPredictionJob.score_from_existing(
    previously_submitted_job_id,
)

Scoring an in-memory Pandas DataFrame

When working with DataFrames, we provide a method for scoring the data without first writing it to a CSV file and subsequently reading the data back from a CSV file.

This will also take care of joining the computed predictions into the existing DataFrame.

Use the method BatchPredictionJob.score_pandas. The first parameter is the deployment ID and then the DataFrame to score.

import datarobot as dr
import pandas as pd

deployment_id = '5dc5b1015e6e762a6241f9aa'

df = pd.read_csv('testdata/titanic_predict.csv')

job, df = dr.BatchPredictionJob.score_pandas(deployment_id, df)

The method returns a copy of the job status and the updated DataFrame with the predictions added. So your DataFrame will now contain the following extra columns:

  • Survived_1_PREDICTION
  • Survived_0_PREDICTION
  • Survived_PREDICTION
  • THRESHOLD
  • POSITIVE_CLASS
  • prediction_status
print(df)
     PassengerId  Pclass                                          Name  ... Survived_PREDICTION  THRESHOLD  POSITIVE_CLASS
0            892       3                              Kelly, Mr. James  ...                   0        0.5               1
1            893       3              Wilkes, Mrs. James (Ellen Needs)  ...                   1        0.5               1
2            894       2                     Myles, Mr. Thomas Francis  ...                   0        0.5               1
3            895       3                              Wirz, Mr. Albert  ...                   0        0.5               1
4            896       3  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  ...                   1        0.5               1
..           ...     ...                                           ...  ...                 ...        ...             ...
413         1305       3                            Spector, Mr. Woolf  ...                   0        0.5               1
414         1306       1                  Oliva y Ocana, Dona. Fermina  ...                   0        0.5               1
415         1307       3                  Saether, Mr. Simon Sivertsen  ...                   0        0.5               1
416         1308       3                           Ware, Mr. Frederick  ...                   0        0.5               1
417         1309       3                      Peter, Master. Michael J  ...                   1        0.5               1

[418 rows x 16 columns]

If you don’t want all of them or if you’re not happy with the names of the added columns, they can be modified using column remapping:

import datarobot as dr
import pandas as pd

deployment_id = '5dc5b1015e6e762a6241f9aa'

df = pd.read_csv('testdata/titanic_predict.csv')

job, df = dr.BatchPredictionJob.score_pandas(
    deployment_id,
    df,
    column_names_remapping={
        'Survived_1_PREDICTION': None,       # discard column
        'Survived_0_PREDICTION': None,       # discard column
        'Survived_PREDICTION': 'predicted',  # rename column
        'THRESHOLD': None,                   # discard column
        'POSITIVE_CLASS': None,              # discard column
    },
)

Any column mapped to None will be discarded. Any column mapped to a string will be renamed. Any column not mentioned will be kept in the output untouched. So your DataFrame will now contain the following extra columns:

  • predicted
  • prediction_status

Refer to the documentation for BatchPredictionJob.score for the full range of available options.