Batch Monitoring

class datarobot.models.BatchMonitoringJob(data, completed_resource_url=None)

A Batch Monitoring Job is used to monitor data sets outside DataRobot app.

Attributes:

idstr: the id of the job

classmethod get(project_id, job_id)

Get batch monitoring job

Returns:

BatchMonitoringJob: Instance of BatchMonitoringJob

Attributes:

job_id: str: ID of batch job

Return type:

BatchMonitoringJob

download(fileobj, timeout=120, read_timeout=660)

Downloads the results of a monitoring job as a CSV.

Attributes:

fileobj: A file-like object where the CSV monitoring results will be

written to. Examples include an in-memory buffer (e.g., io.BytesIO) or a file on disk (opened for binary writing).

timeoutint (optional, default 120)

Seconds to wait for the download to become available.

The download will not be available before the job has started processing. In case other jobs are occupying the queue, processing may not start immediately.

If the timeout is reached, the job will be aborted and RuntimeError is raised.

Set to -1 to wait infinitely.

read_timeoutint (optional, default 660)

Seconds to wait for the server to respond between chunks.

Return type:

None

classmethod run(deployment, intake_settings=None, output_settings=None, csv_settings=None, num_concurrent=None, chunk_size=None, abort_on_error=True, monitoring_aggregation=None, monitoring_columns=None, monitoring_output_settings=None, download_timeout=120, download_read_timeout=660, upload_read_timeout=600)

Create new batch monitoring job, upload the dataset, and return a batch monitoring job.

Returns:

BatchMonitoringJob: Instance of BatchMonitoringJob

Return type:

BatchMonitoringJob

Examples

>>> import datarobot as dr
>>> job_spec = {
...     "intake_settings": {
...         "type": "jdbc",
...         "data_store_id": "645043933d4fbc3215f17e34",
...         "catalog": "SANDBOX",
...         "table": "10kDiabetes_output_actuals",
...         "schema": "SCORING_CODE_UDF_SCHEMA",
...         "credential_id": "645043b61a158045f66fb329"
...     },
>>>     "monitoring_columns": {
...         "predictions_columns": [
...             {
...                 "class_name": "True",
...                 "column_name": "readmitted_True_PREDICTION"
...             },
...             {
...                 "class_name": "False",
...                 "column_name": "readmitted_False_PREDICTION"
...             }
...         ],
...         "association_id_column": "rowID",
...         "actuals_value_column": "ACTUALS"
...     }
... }
>>> deployment_id = "foobar"
>>> job = dr.BatchMonitoringJob.run(deployment_id, **job_spec)
>>> job.wait_for_completion()

Attributes:

deploymentDeployment or string ID

Deployment which will be used for monitoring.

intake_settingsdict

A dict configuring how data is coming from. Supported options:

type : string, either localFile, s3, azure, gcp, dataset, jdbc snowflake, synapse or bigquery

Note that to pass a dataset, you not only need to specify the type parameter as dataset, but you must also set the dataset parameter as a dr.Dataset object.

To monitor from a local file, add this parameter to the settings:

file : A file-like object, string path to a file or a pandas.DataFrame of scoring data.

To monitor from S3, add the next parameters to the settings:

url : string, the URL to score (e.g.: s3://bucket/key).

credential_id : string (optional).

endpoint_url : string (optional), any non-default endpoint URL for S3 access (omit to use the default).

(batch_monitoring_jdbc_creds_usage)=

To monitor from JDBC, add the next parameters to the settings:

data_store_id : string, the ID of the external data store connected to the JDBC data source (see Database Connectivity).

query : string (optional if table, schema and/or catalog is specified), a self-supplied SELECT statement of the data set you wish to predict.

table : string (optional if query is specified), the name of specified database table.

schema : string (optional if query is specified), the name of specified database schema.

catalog : string (optional if query is specified), (new in v2.22) the name of specified database catalog.

fetch_size : int (optional), Changing the fetchSize can be used to balance throughput and memory usage.

credential_id : string (optional) the ID of the credentials holding information about a user with read-access to the JDBC data source (see Credentials).

output_settingsdict (optional)

A dict configuring how monitored data is to be saved. Supported options:

type : string, either localFile, s3, azure, gcp, jdbc, snowflake, synapse or bigquery

To save monitored data to a local file, add parameters to the settings:

path : string (optional), path to save the scored data as CSV. If a path is not specified, you must download the scored data yourself with job.download(). If a path is specified, the call will block until the job is done. if there are no other jobs currently processing for the targeted prediction instance, uploading, scoring, downloading will happen in parallel without waiting for a full job to complete. Otherwise, it will still block, but start downloading the scored data as soon as it starts generating data. This is the fastest method to get predictions.

To save monitored data to S3, add the next parameters to the settings:

url : string, the URL for storing the results (e.g.: s3://bucket/key).

credential_id : string (optional).

endpoint_url : string (optional), any non-default endpoint URL for S3 access (omit to use the default).

To save monitored data to JDBC, add the next parameters to the settings:

data_store_id : string, the ID of the external data store connected to the JDBC data source (see Database Connectivity).

table : string, the name of specified database table.

schema : string (optional), the name of specified database schema.

catalog : string (optional), (new in v2.22) the name of specified database catalog.

statement_type : string, the type of insertion statement to create, one of datarobot.enums.AVAILABLE_STATEMENT_TYPES.

update_columns : list(string) (optional), a list of strings containing those column names to be updated in case statement_type is set to a value related to update or upsert.

where_columns : list(string) (optional), a list of strings containing those column names to be selected in case statement_type is set to a value related to insert or update.

credential_id : string, the ID of the credentials holding information about a user with write-access to the JDBC data source (see Credentials).

create_table_if_not_exists : bool (optional), If no existing table is detected, attempt to create it before writing data with the strategy defined in the statementType parameter.

csv_settingsdict (optional)

CSV intake and output settings. Supported options:

delimiter : string (optional, default ,), fields are delimited by this character. Use the string tab to denote TSV (TAB separated values). Must be either a one-character string or the string tab.
quotechar : string (optional, default “), fields containing the delimiter must be quoted using this character.
encoding : string (optional, default utf-8), encoding for the CSV files. For example (but not limited to): shift_jis, latin_1 or mskanji.

num_concurrentint (optional)

Number of concurrent chunks to score simultaneously. Defaults to the available number of cores of the deployment. Lower it to leave resources for real-time scoring.

chunk_sizestring or int (optional)

Which strategy should be used to determine the chunk size. Can be either a named strategy or a fixed size in bytes. - auto: use fixed or dynamic based on flipper. - fixed: use 1MB for explanations, 5MB for regular requests. - dynamic: use dynamic chunk sizes. - int: use this many bytes per chunk.

abort_on_errorboolean (optional)

Default behavior is to abort the job if too many rows fail scoring. This will free up resources for other jobs that may score successfully. Set to false to unconditionally score every row no matter how many errors are encountered. Defaults to True.

download_timeoutint (optional)

Added in version 2.22.

If using localFile output, wait this many seconds for the download to become available. See download().

download_read_timeoutint (optional, default 660)

Added in version 2.22.

If using localFile output, wait this many seconds for the server to respond between chunks.

upload_read_timeout: int (optional, default 600)

Added in version 2.28.

If using localFile intake, wait this many seconds for the server to respond after whole dataset upload.

cancel(ignore_404_errors=False)

Cancel this job. If this job has not finished running, it will be removed and canceled.

Return type:: None

get_status()

Get status of batch monitoring job

Returns:

BatchMonitoringJob status data: Dict with job status

Return type:

Any

class datarobot.models.BatchMonitoringJobDefinition(id=None, name=None, enabled=None, schedule=None, batch_monitoring_job=None, created=None, updated=None, created_by=None, updated_by=None, last_failed_run_time=None, last_successful_run_time=None, last_started_job_status=None, last_scheduled_run_time=None)

classmethod get(batch_monitoring_job_definition_id)

Get batch monitoring job definition

Returns:

BatchMonitoringJobDefinition: Instance of BatchMonitoringJobDefinition

Return type:

BatchMonitoringJobDefinition

Examples

>>> import datarobot as dr
>>> definition = dr.BatchMonitoringJobDefinition.get('5a8ac9ab07a57a0001be501f')
>>> definition
BatchMonitoringJobDefinition(60912e09fd1f04e832a575c1)

Attributes:

batch_monitoring_job_definition_id: str: ID of batch monitoring job definition

classmethod list()

Get job all monitoring job definitions

Returns:

List[BatchMonitoringJobDefinition]: List of job definitions the user has access to see

Return type:

List[BatchMonitoringJobDefinition]

Examples

>>> import datarobot as dr
>>> definition = dr.BatchMonitoringJobDefinition.list()
>>> definition
[
    BatchMonitoringJobDefinition(60912e09fd1f04e832a575c1),
    BatchMonitoringJobDefinition(6086ba053f3ef731e81af3ca)
]

classmethod create(enabled, batch_monitoring_job, name=None, schedule=None)

Creates a new batch monitoring job definition to be run either at scheduled interval or as a manual run.

Returns:

BatchMonitoringJobDefinition: Instance of BatchMonitoringJobDefinition

Return type:

BatchMonitoringJobDefinition

Examples

>>> import datarobot as dr
>>> job_spec = {
...    "num_concurrent": 4,
...    "deployment_id": "foobar",
...    "intake_settings": {
...        "url": "s3://foobar/123",
...        "type": "s3",
...        "format": "csv"
...    },
...    "output_settings": {
...        "url": "s3://foobar/123",
...        "type": "s3",
...        "format": "csv"
...    },
...}
>>> schedule = {
...    "day_of_week": [
...        1
...    ],
...    "month": [
...        "*"
...    ],
...    "hour": [
...        16
...    ],
...    "minute": [
...        0
...    ],
...    "day_of_month": [
...        1
...    ]
...}
>>> definition = BatchMonitoringJobDefinition.create(
...    enabled=False,
...    batch_monitoring_job=job_spec,
...    name="some_definition_name",
...    schedule=schedule
... )
>>> definition
BatchMonitoringJobDefinition(60912e09fd1f04e832a575c1)

Attributes:

enabledbool (default False)

Whether the definition should be active on a scheduled basis. If True, schedule is required.

batch_monitoring_job: dict

The job specifications for your batch monitoring job. It requires the same job input parameters as used with BatchMonitoringJob

namestring (optional)

The name you want your job to be identified with. Must be unique across the organization’s existing jobs. If you don’t supply a name, a random one will be generated for you.

scheduledict (optional)

The schedule payload defines at what intervals the job should run, which can be combined in various ways to construct complex scheduling terms if needed. In all the elements in the objects, you can supply either an asterisk ["*"] denoting “every” time denomination or an array of integers (e.g. [1, 2, 3]) to define a specific interval.

The schedule payload is split up in the following items:

Minute:

The minute(s) of the day that the job will run. Allowed values are either ["*"] meaning every minute of the day or [0 ... 59]

Hour: The hour(s) of the day that the job will run. Allowed values are either ["*"] meaning every hour of the day or [0 ... 23].

Day of Month: The date(s) of the month that the job will run. Allowed values are either [1 ... 31] or ["*"] for all days of the month. This field is additive with dayOfWeek, meaning the job will run both on the date(s) defined in this field and the day specified by dayOfWeek (for example, dates 1st, 2nd, 3rd, plus every Tuesday). If dayOfMonth is set to ["*"] and dayOfWeek is defined, the scheduler will trigger on every day of the month that matches dayOfWeek (for example, Tuesday the 2nd, 9th, 16th, 23rd, 30th). Invalid dates such as February 31st are ignored.

Month: The month(s) of the year that the job will run. Allowed values are either [1 ... 12] or ["*"] for all months of the year. Strings, either 3-letter abbreviations or the full name of the month, can be used interchangeably (e.g., “jan” or “october”). Months that are not compatible with dayOfMonth are ignored, for example {"dayOfMonth": [31], "month":["feb"]}

Day of Week: The day(s) of the week that the job will run. Allowed values are [0 .. 6], where (Sunday=0), or ["*"], for all days of the week. Strings, either 3-letter abbreviations or the full name of the day, can be used interchangeably (e.g., “sunday”, “Sunday”, “sun”, or “Sun”, all map to [0]. This field is additive with dayOfMonth, meaning the job will run both on the date specified by dayOfMonth and the day defined in this field.

update(enabled, batch_monitoring_job=None, name=None, schedule=None)

Updates a job definition with the changed specs.

Takes the same input as create()

Returns:

BatchMonitoringJobDefinition: Instance of the updated BatchMonitoringJobDefinition

Return type:

BatchMonitoringJobDefinition

Examples

>>> import datarobot as dr
>>> job_spec = {
...    "num_concurrent": 5,
...    "deployment_id": "foobar_new",
...    "intake_settings": {
...        "url": "s3://foobar/123",
...        "type": "s3",
...        "format": "csv"
...    },
...    "output_settings": {
...        "url": "s3://foobar/123",
...        "type": "s3",
...        "format": "csv"
...    },
...}
>>> schedule = {
...    "day_of_week": [
...        1
...    ],
...    "month": [
...        "*"
...    ],
...    "hour": [
...        "*"
...    ],
...    "minute": [
...        30, 59
...    ],
...    "day_of_month": [
...        1, 2, 6
...    ]
...}
>>> definition = BatchMonitoringJobDefinition.create(
...    enabled=False,
...    batch_monitoring_job=job_spec,
...    name="updated_definition_name",
...    schedule=schedule
... )
>>> definition
BatchMonitoringJobDefinition(60912e09fd1f04e832a575c1)

Attributes:

enabledbool (default False): Same as enabled in create().
batch_monitoring_job: dict: Same as batch_monitoring_job in create().
namestring (optional): Same as name in create().
scheduledict: Same as schedule in create().

run_on_schedule(schedule)

Sets the run schedule of an already created job definition.

If the job was previously not enabled, this will also set the job to enabled.

Returns:

BatchMonitoringJobDefinition: Instance of the updated BatchMonitoringJobDefinition with the new / updated schedule.

Return type:

BatchMonitoringJobDefinition

Examples

>>> import datarobot as dr
>>> definition = dr.BatchMonitoringJobDefinition.create('...')
>>> schedule = {
...    "day_of_week": [
...        1
...    ],
...    "month": [
...        "*"
...    ],
...    "hour": [
...        "*"
...    ],
...    "minute": [
...        30, 59
...    ],
...    "day_of_month": [
...        1, 2, 6
...    ]
...}
>>> definition.run_on_schedule(schedule)
BatchMonitoringJobDefinition(60912e09fd1f04e832a575c1)

Attributes:

scheduledict: Same as schedule in create().

run_once()

Manually submits a batch monitoring job to the queue, based off of an already created job definition.

Returns:

BatchMonitoringJob: Instance of BatchMonitoringJob

Return type:

BatchMonitoringJob

Examples

>>> import datarobot as dr
>>> definition = dr.BatchMonitoringJobDefinition.create('...')
>>> job = definition.run_once()
>>> job.wait_for_completion()

delete()

Deletes the job definition and disables any future schedules of this job if any. If a scheduled job is currently running, this will not be cancelled.

Return type:: None

Examples

>>> import datarobot as dr
>>> definition = dr.BatchMonitoringJobDefinition.get('5a8ac9ab07a57a0001be501f')
>>> definition.delete()