Data engine query generator

class datarobot.DataEngineQueryGenerator

DataEngineQueryGenerator is used to set up time series data prep.

Added in version v2.27.

Variables:

id (str) – id of the query generator
query (str) – text of the generated Spark SQL query
datasets (list(QueryGeneratorDataset)) – datasets associated with the query generator
generator_settings (QueryGeneratorSettings) – the settings used to define the query
generator_type (str) – “TimeSeries” is the only supported type

classmethod create(generator_type, datasets, generator_settings)

Creates a query generator entity.

Added in version v2.27.

Parameters:

generator_type (str) – Type of data engine query generator
datasets (List[QueryGeneratorDataset]) – Source datasets in the Data Engine workspace.
generator_settings (dict) – Data engine generator settings of the given generator_type.

Returns:

query_generator – The created generator

Return type:

DataEngineQueryGenerator

Examples

import datarobot as dr
from datarobot.models.data_engine_query_generator import (
   QueryGeneratorDataset,
   QueryGeneratorSettings,
)
dataset = QueryGeneratorDataset(
   alias='My_Awesome_Dataset_csv',
   dataset_id='61093144cabd630828bca321',
   dataset_version_id=1,
)
settings = QueryGeneratorSettings(
   datetime_partition_column='date',
   time_unit='DAY',
   time_step=1,
   default_numeric_aggregation_method='sum',
   default_categorical_aggregation_method='mostFrequent',
)
g = dr.DataEngineQueryGenerator.create(
   generator_type='TimeSeries',
   datasets=[dataset],
   generator_settings=settings,
)
g.id
>>>'54e639a18bd88f08078ca831'
g.generator_type
>>>'TimeSeries'

classmethod get(generator_id)

Gets information about a query generator.

Parameters:: generator_id (str) – The identifier of the query generator you want to load.
Returns:: query_generator – The queried generator
Return type:: DataEngineQueryGenerator

Examples

import datarobot as dr
g = dr.DataEngineQueryGenerator.get(generator_id='54e639a18bd88f08078ca831')
g.id
>>>'54e639a18bd88f08078ca831'
g.generator_type
>>>'TimeSeries'

create_dataset(dataset_id=None, dataset_version_id=None, max_wait=600)

A blocking call that creates a new Dataset from the query generator. Returns when the dataset has been successfully processed. If optional parameters are not specified the query is applied to the dataset_id and dataset_version_id stored in the query generator. If specified they will override the stored dataset_id/dataset_version_id, i.e. to prep a prediction dataset.

Parameters:

dataset_id (Optional[str]) – The id of the unprepped dataset to apply the query to
dataset_version_id (Optional[str]) – The version_id of the unprepped dataset to apply the query to

Returns:

response – The Dataset created from the query generator

Return type:

Dataset

prepare_prediction_dataset_from_catalog(project_id, dataset_id, dataset_version_id=None, max_wait=600, relax_known_in_advance_features_check=None)

Apply time series data prep to a catalog dataset and upload it to the project as a PredictionDataset.

Added in version v3.1.

Parameters:

project_id (str) – The id of the project to which you upload the prediction dataset.
dataset_id (str) – The identifier of the dataset.
dataset_version_id (Optional[str]) – The version id of the dataset to use.
max_wait (Optional[int]) – Optional, the maximum number of seconds to wait before giving up.
relax_known_in_advance_features_check (Optional[bool]) – For time series projects only. If True, missing values in the known in advance features are allowed in the forecast window at the prediction time. If omitted or False, missing values are not allowed.

Returns:

dataset – The newly uploaded dataset.

Return type:

PredictionDataset

prepare_prediction_dataset(sourcedata, project_id, max_wait=600, relax_known_in_advance_features_check=None)

Apply time series data prep and upload the PredictionDataset to the project.

Added in version v3.1.

Parameters:

sourcedata (str, file or pandas.DataFrame) – Data to be used for predictions. If it is a string, it can be either a path to a local file, or raw file content. If using a file on disk, the filename must consist of ASCII characters only.
project_id (str) – The id of the project to which you upload the prediction dataset.
max_wait (Optional[int]) – The maximum number of seconds to wait for the uploaded dataset to be processed before raising an error.
relax_known_in_advance_features_check (Optional[bool]) – For time series projects only. If True, missing values in the known in advance features are allowed in the forecast window at the prediction time. If omitted or False, missing values are not allowed.

Returns:

dataset – The newly uploaded dataset.

Return type:

PredictionDataset

Raises:

InputNotUnderstoodError – Raised if sourcedata isn’t one of supported types.
AsyncFailureError – Raised if polling for the status of an async process resulted in a response with an unsupported status code.
AsyncProcessUnsuccessfulError – Raised if project creation was unsuccessful (i.e. the server reported an error in uploading the dataset).
AsyncTimeoutError – Raised if processing the uploaded dataset took more time than specified by the max_wait parameter.