Batch Predictions

The Batch Prediction API provides a way to score large datasets using flexible options for intake and output on the Prediction Servers you have already deployed.

The main features are:

  • Flexible options for intake and output.
  • Stream local files and start scoring while still uploading - while simultaneously downloading the results.
  • Score large datasets from and to S3.
  • Connect to your database using JDBC with bidirectional streaming of scoring data and results.
  • Intake and output options can be mixed and doesn’t need to match. So scoring from a JDBC source to an S3 target is also an option.
  • Protection against overloading your prediction servers with the option to control the concurrency level for scoring.
  • Prediction Explanations can be included (with option to add thresholds).
  • Passthrough Columns are supported to correlate scored data with source data.
  • Prediction Warnings can be included in the output.

To interact with Batch Predictions, you should use the BatchPredictionJob. class.

Scoring local CSV files

We provide a small utility function for scoring from/to local CSV files: BatchPredictionJob.score_to_file(). The first parameter can be either:

  • Path to a CSV dataset
  • File-like object
  • Pandas DataFrame

For larger datasets, you should avoid using a DataFrame, as that will load the entire dataset into memory. The other options don’t.

import datarobot as dr

deployment_id = '5dc5b1015e6e762a6241f9aa'

dr.BatchPredictionJob.score_to_file(
    deployment_id,
    './data_to_predict.csv',
    './predicted.csv',
)

The input file will be streamed to our API and scoring will start immediately. As soon as results start coming in, we will initiate the download concurrently. The entire call will block until the file has been scored.

Scoring from and to S3

We provide a small utility function for scoring from/to CSV files hosted on S3: BatchPredictionJob.score_s3(). This requires that the intake and output buckets share the same credentials (see Credentials) or are public:

 import datarobot as dr

 deployment_id = '5dc5b1015e6e762a6241f9aa'

 cred = dr.Credential.get('5a8ac9ab07a57a0001be501f')

 dr.BatchPredictionJob.score_s3(
     deployment_id=deployment_id,
     's3://mybucket/data_to_predict.csv',
     's3://mybucket/predicted.csv',
     credential=cred,
)

Note

The S3 output functionality has a limit of 100 GB.

Wiring a Batch Prediction Job manually

If you can’t use any of the utilities above, you are also free to configure your job manually. This requires configuring an intake and output option:

import datarobot as dr

deployment_id = '5dc5b1015e6e762a6241f9aa'

dr.BatchPredictionJob.score(
    deployment_id,
    intake_settings={
        'type': 's3',
        'url': 's3://public-bucket/data_to_predict.csv',
        'credential_id': '5a8ac9ab07a57a0001be501f',
    },
    output_settings={
        'type': 'localFile',
        'path': './predicted.csv',
    },
)

Credentials may be created with Credentials API.

Supported intake types

These are the supported intake types and descriptions of their configuration parameters:

Local file intake

This requires you to pass either a path to a CSV dataset, file-like object or a Pandas DataFrame as the file parameter:

intake_settings={
    'type': 'localFile',
    'file': './data_to_predict.csv',
}

S3 CSV intake

This requires you to pass an S3 URL to the CSV file your scoring in the url parameter:

intake_settings={
    'type': 's3',
    'url': 's3://public-bucket/data_to_predict.csv',
}

If the bucket is not publicly accessible, you can supply AWS credentials using the three parameters:

  • aws_access_key_id
  • aws_secret_access_key
  • aws_session_token

And save it to the Credential API. Here is an example:

import datarobot as dr

# get to make sure it's exists
cred = dr.Credential.get(credential_id)

intake_settings={
    'type': 's3',
    'url': 's3://private-bucket/data_to_predict.csv',
    'credential_id': cred.credential_id,
}

JDBC intake

This requires you to create a DataStore and Credential for your database:

# get to make sure it's exists
data_store = dr.DataStore.get(datastore_id)
cred = dr.Credential.get(credential_id)

intake_settings = {
    'type': 'jdbc',
    'table': 'table_name',
    'schema': 'public',
    'dataStoreId': data_store.id,
    'credentialId': cred.credential_id,
}

Supported output types

These are the supported output types and descriptions of their configuration parameters:

Local file output

For local file output you have two options. You can either pass a path parameter and have the client block and download the scored data concurrently. This is the fastest way to get predictions as it will upload, score and download concurrently:

output_settings={
    'type': 'localFile',
    'path': './predicted.csv',
}

Another option is to leave out the parameter and subsequently call BatchPredictionJob.download() at your own convenience. The score() call will then return as soon as the upload is complete.

If the job is not finished scoring, the call to BatchPredictionJob.download() will start streaming the data that has been scored so far and block until more data is available.

You can poll for job completion using BatchPredictionJob.get_status() or use BatchPredictionJob.wait_for_completion() to wait.

import datarobot as dr

deployment_id = '5dc5b1015e6e762a6241f9aa'

job = dr.BatchPredictionJob.score(
    deployment_id,
    intake_settings={
        'type': 'localFile',
        'file': './data_to_predict.csv',
    },
    output_settings={
        'type': 'localFile',
    },
)

job.wait_for_completion()

with open('./predicted.csv', 'wb') as f:
    job.download(f)

S3 CSV output

This requires you to pass an S3 URL to the CSV file where the scored data should be saved to in the url parameter:

output_settings={
    'type': 's3',
    'url': 's3://public-bucket/predicted.csv',
}

Most likely, the bucket is not publically accessible for writes, but you can supply AWS credentials using the three parameters:

  • aws_access_key_id
  • aws_secret_access_key
  • aws_session_token

And save it to the Credential API. Here is an example:

# get to make sure it's exists
cred = dr.Credential.get(credential_id)

output_settings={
    'type': 's3',
    'url': 's3://private-bucket/predicted.csv',
    'credential_id': cred.credential_id,
}

JDBC output

Same as for the input, this requires you to create a DataStore and Credential for your database, but for output_settings you also need to specify statementType, which should be one of datarobot.enums.AVAILABLE_STATEMENT_TYPES:

# get to make sure it's exists
data_store = dr.DataStore.get(datastore_id)
cred = dr.Credential.get(credential_id)

output_settings = {
    'type': 'jdbc',
    'table': 'table_name',
    'schema': 'public',
    'statementType': 'insert',
    'dataStoreId': data_store.id,
    'credentialId': cred.credential_id,
}

Copying a previously submitted job

We provide a small utility function for submitting a job using parameters from a job previously submitted: BatchPredictionJob.score_from_existing(). The first parameter is the job id of another job.

import datarobot as dr

previously_submitted_job_id = '5dc5b1015e6e762a6241f9aa'

dr.BatchPredictionJob.score_to_file(
    previously_submitted_job_id,
)