Experimental API

These features all require special permissions to be activated on your DataRobot account, and will not work otherwise. If you want to test a feature, please ask your DataRobot CFDS or account manager about enrolling in our preview program.

Classes in this list should be considered “experimental”, not fully released, and likely to change in future releases. Do not use them for production systems or other mission-critical uses.

class datarobot._experimental.models.model.Model
get_feature_effect(source)

Retrieve Feature Effects for the model.

Feature Effects provides partial dependence and predicted vs actual values for top-500 features ordered by feature impact score.

The partial dependence shows marginal effect of a feature on the target variable after accounting for the average effects of all other predictive features. It indicates how, holding all other variables except the feature of interest as they were, the value of this feature affects your prediction.

Requires that Feature Effects has already been computed with request_feature_effect.

See get_feature_effect_metadata for retrieving information the available sources.

Parameters:

source (str) – The source Feature Effects are retrieved for.

Returns:

feature_effects – The feature effects data.

Return type:

FeatureEffects

Raises:

ClientError – If the feature effects have not been computed or source is not valid value.

get_incremental_learning_metadata()

Retrieve incremental learning metadata for this model.

Added in version v3.4.0.

This functionality requires the INCREMENTAL_LEARNING feature flag to be enabled.

Returns:

metadata – a IncrementalLearningMetadata representing incremental learning metadata

Return type:

IncrementalLearningMetadata

start_incremental_learning(early_stopping_rounds=None)

Start incremental learning for this model.

Added in version v3.4.0.

This functionality requires the INCREMENTAL_LEARNING feature flag to be enabled.

Parameters:

early_stopping_rounds (Optional[int]) – The number of chunks in which no improvement is observed that triggers the early stopping mechanism.

Return type:

None

Raises:

ClientError – if the server responded with 4xx status

start_incremental_learning_from_sample(early_stopping_rounds=None, first_iteration_only=None)

Submit a job to the queue to perform the first incremental learning iteration training on an existing sample model. This functionality requires the SAMPLE_DATA_TO_START_PROJECT feature flag to be enabled.

Parameters:
  • early_stopping_rounds (Optional[int]) – The number of chunks in which no improvement is observed that triggers the early stopping mechanism.

  • first_iteration_only (Optional[bool]) – Specifies whether incremental learning training should be limited to the first iteration. If set to True, the training process will be performed only for the first iteration. If set to False, training will continue until early stopping conditions are met or the maximum number of iterations is reached. The default value is False.

Returns:

job – The created job that is retraining the model

Return type:

ModelJob

class datarobot._experimental.models.model.DatetimeModel
get_feature_effect(source, backtest_index)

Retrieve Feature Effects for the model.

Feature Effects provides partial dependence and predicted vs actual values for top-500 features ordered by feature impact score.

The partial dependence shows marginal effect of a feature on the target variable after accounting for the average effects of all other predictive features. It indicates how, holding all other variables except the feature of interest as they were, the value of this feature affects your prediction.

Requires that Feature Effects has already been computed with request_feature_effect.

See get_feature_effect_metadata for retrieving information of source, backtest_index.

Parameters:
  • source (string) – The source Feature Effects are retrieved for. One value of [FeatureEffectMetadataDatetime.sources]. To retrieve the available sources for feature effect.

  • backtest_index (string, FeatureEffectMetadataDatetime.backtest_index.) – The backtest index to retrieve Feature Effects for.

Returns:

feature_effects – The feature effects data.

Return type:

FeatureEffects

Raises:

ClientError – If the feature effects have not been computed or source is not valid value.

datarobot._experimental.models.data_store.get_spark_session(self, db_token)

Returns a Spark session

Parameters:

db_token (str) – A personal access token.

Returns:

A spark session initialized with connection parameters taken from DataStore and provided db_token.

Return type:

SparkSession

Examples

>>> from datarobot._experimental.models.data_store import DataStore
>>> data_stores = DataStore.list(typ=DataStoreListTypes.DR_DATABASE_V1)
>>> data_stores
[DataStore('my_databricks_store_1')]
>>> db_connection = data_stores[0].get_spark_session('<token>')
>>> db_connection
<pyspark.sql.connect.session.SparkSession at 0x7f386068fbb0>
>>> df = session.read.table("samples.nyctaxi.trips")
>>> df.show()
class datarobot._experimental.models.data_store.DataStore

A data store. Represents database

Variables:
  • id (str) – The ID of the data store.

  • data_store_type (str) – The type of data store.

  • canonical_name (str) – The user-friendly name of the data store.

  • creator (str) – The ID of the user who created the data store.

  • updated (datetime.datetime) – The time of the last update.

  • params (DataStoreParameters) – A list specifying data store parameters.

  • role (str) – Your access role for this data store.

  • driver_class_type (str) – Your access role for this data store.

class datarobot._experimental.models.retraining.RetrainingUseCase

Retraining use case.

Variables:
  • id (str) – The ID of the use case.

  • name (str) – The name of the use case.

class datarobot._experimental.models.retraining.RetrainingPolicy

Retraining Policy.

Variables:
  • policy_id (str) – ID of the retraining policy

  • name (str) – Name of the retraining policy

  • description (str) – Description of the retraining policy

  • use_case (Optional[dict]) – Use case the retraining policy is associated with

classmethod list(deployment_id)

Lists all retraining policies associated with a deployment

Parameters:

deployment_id (str) – Id of the deployment

Returns:

policies – List of retraining policies associated with a deployment

Return type:

list

Examples

from datarobot import Deployment
from datarobot._experimental.models.retraining import RetrainingPolicy
deployment = Deployment.get(deployment_id='620ed0e37b6ce03244f19631')
RetrainingPolicy.list(deployment.id)
>>> [RetrainingPolicy('620ed248bb0a1f5889eb6aa7'), RetrainingPolicy('624f68be8828ed81bf487d8d')]
classmethod get(deployment_id, retraining_policy_id)

Retrieves a retraining policy associated with a deployment

Parameters:
  • deployment_id (str) – Id of the deployment

  • retraining_policy_id (str) – Id of the policy

Returns:

retraining_policy – Retraining policy

Return type:

Retraining Policy

Examples

from datarobot._experimental.models.retraining import RetrainingPolicy
policy = RetrainingPolicy.get(
    deployment_id='620ed0e37b6ce03244f19631',
    retraining_policy_id='624f68be8828ed81bf487d8d'
)
policy.id
>>>'624f68be8828ed81bf487d8d'
policy.name
>>>'PolicyA'
classmethod create(deployment_id, name, description=None, use_case_id=None)

Create a retraining policy associated with a deployment

Parameters:
  • deployment_id (str) – The ID of the deployment.

  • name (str) – The retraining policy name.

  • description (str) – The retraining policy description.

  • use_case_id (Optional[str]) – The ID of the Use Case that the retraining policy is associated with.

Returns:

retraining_policy – Retraining policy

Return type:

Retraining Policy

Examples

from datarobot._experimental.models.retraining import RetrainingPolicy
policy = RetrainingPolicy.create(
    deployment_id='620ed0e37b6ce03244f19631',
    name='Retraining Policy A',
    use_case_id='678114c41e9114cabca27044',
)
policy.id
>>>'624f68be8828ed81bf487d8d'
classmethod delete(deployment_id, retraining_policy_id)

Deletes a retraining policy associated with a deployment

Parameters:
  • deployment_id (str) – Id of the deployment

  • retraining_policy_id (str) – Id of the policy

Return type:

None

Examples

from datarobot._experimental.models.retraining import RetrainingPolicy
RetrainingPolicy.delete(
    deployment_id='620ed0e37b6ce03244f19631',
    retraining_policy_id='624f68be8828ed81bf487d8d'
)
update_use_case(use_case_id)

Update the use case associated with this retraining policy

Parameters:

use_case_id (str) – Id of the use case the retraining policy is associated with

Return type:

RetrainingPolicy

Examples

from datarobot._experimental.models.retraining import RetrainingPolicy
policy = RetrainingPolicy.get(
    deployment_id='620ed0e37b6ce03244f19631',
    retraining_policy_id='624f68be8828ed81bf487d8d'
)
updated = RetrainingPolicy.update_use_case(use_case_id='620ed0e37b6ce03244f19633')
updated.use_case.id
>>>'620ed0e37b6ce03244f19633'
class datarobot._experimental.models.retraining.RetrainingPolicyRun

Retraining policy run.

Variables:
  • policy_run_id (str) – ID of the retraining policy run

  • status (str) – Status of the retraining policy run

  • challenger_id (str) – ID of the challenger model retrieved after running the policy

  • error_message (str) – The error message if an error occurs during the policy run

  • model_package_id (str) – ID of the model package (version) retrieved after the policy is run

  • project_id (str) – ID of the project the deployment is associated with

  • start_time (datetime.datetime) – Timestamp of when the policy run starts

  • finish_time (datetime.datetime) – Timestamp of when the policy run finishes

classmethod list(deployment_id, retraining_policy_id)

Lists all the retraining policy runs of a retraining policy that is associated with a deployment.

Parameters:
  • deployment_id (str) – ID of the deployment

  • retraining_policy_id (str) – ID of the policy

Returns:

policy runs – List of retraining policy runs

Return type:

list

Examples

from datarobot._experimental.models.retraining import RetrainingPolicyRun
RetrainingPolicyRun.list(
    deployment_id='620ed0e37b6ce03244f19631',
    retraining_policy_id='62f4448f0dfd5699feae3e6e'
)
>>> [RetrainingPolicyRun('620ed248bb0a1f5889eb6aa7'), RetrainingPolicyRun('624f68be8828ed81bf487d8d')]
class datarobot._experimental.models.data_matching.DataMatching

Retrieves the closest data points for the input data.

This functionality is more than the simple lookup. In order to retrieve the closest data points data matching functionality will leverage DataRobot preprocessing pipeline first and then search for the closest data points. The returned values will be the closest data points at the point of entry to the model.

There are three sets of methods supported:
  1. Methods to build the index (for project, model, featurelist). The index needs to be built first in order to search for the closest data points. Once the index is built it will be reused.

  2. Methods to search for the closest data points (for project, model, featurelist). These methods will initialize the query, await its completion and then save the result as csv file with in the specified location.

  3. Additional methods to manually list history of queries and retrieve results for them.

get_query_url(url, number_of_data=None)

Returns formatted data matching query url

Return type:

str

get_closest_data(query_file_path, number_of_data=None, max_wait=600, build_index_if_missing=True)

Retrieves closest data points to the data point in input file. If the index is missing by default the method will try to build it.

Parameters:
  • query_file_path (str) – Path to file with the data point to search closest data points

  • number_of_data (int or None) – Number of results to search for. If no value specified, the default is 10.

  • max_wait (int) – Number of seconds to wait for the result. Default is 600.

  • build_index_if_missing (Optional[bool]) – Should the index be created if it is missing. If False is specified and the index is missing, an exception is thrown. Default True.

Returns:

df – Dataframe with query result

Return type:

pd.DataFrame

get_closest_data_for_model(model_id, query_file_path, number_of_data=None, max_wait=600, build_index_if_missing=True)

Retrieves closest data points to the data point in input file. If the index is missing by default the method will try to build it.

Parameters:
  • model_id (str) – Id of the model to search for the closest data points

  • query_file_path (str) – Path to file with the data point to search closest data points

  • number_of_data (int or None) – Number of results to search for. If no value specified, the default is 10.

  • max_wait (int) – Number of seconds to wait for the result. Default is 600.

  • build_index_if_missing (Optional[bool]) – Should the index be created if it is missing. If False is specified and the index is missing, an exception is thrown. Default True.

Returns:

df – Dataframe with query result

Return type:

pd.DataFrame

get_closest_data_for_featurelist(featurelist_id, query_file_path, number_of_data=None, max_wait=600, build_index_if_missing=True)

Retrieves closest data points to the data point in input file. If the index is missing by default the method will try to build it.

Parameters:
  • featurelist_id (str) – Id of the featurelist to search for the closest data points

  • query_file_path (str) – Path to file with the data point to search closest data points

  • number_of_data (int or None) – Number of results to search for. If no value specified, the default is 10.

  • max_wait (int) – Number of seconds to wait for the result. Default is 600.

  • build_index_if_missing (bool) – Should the index be created if it is missing. If False is specified and the index is missing, the exception is thrown. Default True.

Returns:

df – Dataframe with query result

Return type:

pd.DataFrame

build_index(max_wait=600)

Builds data matching index and waits for its completion.

Parameters:

max_wait (int or None) – Seconds to wait for the completion of build index operation. Default is 600. When the 0 or None value is passed then the method will exit without awaiting for the build index operation to complete.

Return type:

None

build_index_for_featurelist(featurelist_id, max_wait=600)

Builds data matching index for featurelist and waits for its completion.

Parameters:
  • featurelist_id (str) – Id of the featurelist to build the index for

  • max_wait (int or None) – Seconds to wait for the completion of build index operation. Default is 600. When the 0 or None value is passed then the method will exit without awaiting for the build index operation to complete.

Return type:

None

build_index_for_model(model_id, max_wait=600)

Builds data matching index for feature list and waits for its completion.

Parameters:
  • model_id (str) – Id of the model to build index for

  • max_wait (int or None) – Seconds to wait for the completion of build index operation. Default is 600. When the 0 or None value is passed then the method will exit without awaiting for the build index operation to complete.

Return type:

None

list()

Lists all data matching queries for the project. Results are sorted in descending order starting from the latest to the oldest.

Return type:

List[DataMatchingQuery]

class datarobot._experimental.models.data_matching.DataMatchingQuery

Data Matching Query object.

Represents single query for the closest data points. Once related query job is completed, its result can be retrieved and saved as csv file in specified location.

classmethod list(project_id)

Retrieves the list of queries.

Parameters:

project_id (str) – Project ID to retrieve data matching queries for

Return type:

List[DataMatchingQuery]

save_result(file_path)

Downloads the query result and saves it in file_path location.

Parameters:

file_path (str) – Path location where to save the query result

Return type:

None

get_result()

Returns the query result as dataframe.

Parameters:

df (pd.DataFrame) – Dataframe with query result

Return type:

DataFrame

class datarobot._experimental.models.model_lineage.FeatureCountByType

Contains information about a feature type and how many features in the dataset are of this type.

Variables:
  • feature_type (str) – The feature type grouped in this count.

  • count (int) – The number of features of this type.

class datarobot._experimental.models.model_lineage.User

Contains information about a user.

Variables:
  • Id (str) – Id of the user.

  • full_name (Optional[str]) – Full name of the user.

  • email (Optional[str]) – Email address of the user.

  • user_hash (Optional[str]) – User’s gravatar hash.

  • user_name (Optional[str]) – Username of the user.

class datarobot._experimental.models.model_lineage.ReferencedInUseCase

Contains information about the reference of a dataset in an Use Case.

Variables:
  • added_to_use_case_by (User) – User who added the dataset to the Use Case.

  • added_to_use_case_at (datetime.datetime) – Time when the dataset was added to the Use Case.

class datarobot._experimental.models.model_lineage.DatasetInfo

Contains information about the dataset.

Variables:
  • dataset_name (str) – Dataset name.

  • dataset_version_id (str) – Dataset version Id.

  • dataset_id (str) – Dataset Id.

  • number_of_rows (int) – Number of rows in the dataset.

  • file_size (int) – Size of the dataset as a CSV file, in bytes.

  • number_of_features (int) – Number of features in the dataset.

  • number_of_feature_by_type (List[FeatureCountByType]) – Number of features in the dataset, grouped by feature type.

  • referenced_in_use_case (Optional[ReferencedInUseCase]) – Information about the reference of this dataset in the Use Case. This information will only be present if the use_case_id was passed to ModelLineage.get.

class datarobot._experimental.models.model_lineage.FeatureWithMissingValues

Contains information about the number of missing values for one feature.

Variables:
  • feature_name (str) – Name of the feature.

  • number_of_missing_values (int) – Number of missing values for this feature.

class datarobot._experimental.models.model_lineage.FeaturelistInfo

Contains information about the featurelist.

Variables:
  • featurelist_name (str) – Featurelist name.

  • featurelist_id (str) – Featurelist Id.

  • number_of_features (int) – Number of features in the featurelist.

  • number_of_feature_by_type (List[FeatureCountByType]) – Number of features in the featurelist, grouped by feature type.

  • number_of_features_with_missing_values (int) – Number of features in the featurelist with at least one missing value.

  • number_of_missing_values (int) – Number of missing values across all features of the featurelist.

  • features_with_most_missing_values (List[FeatureWithMissingValues]) – List of features with the most missing values.

  • description (str) – Description of the featurelist.

class datarobot._experimental.models.model_lineage.TargetInfo

Contains information about the target.

Variables:
  • name (str) – Name of the target feature.

  • target_type (str) – Project type resulting from selected target.

  • positive_class_label (Optional[Union[str, int, float]]) – Positive class label. For every project type except Binary Classification, this value will be null.

  • mean (Optional[float]) – Mean of the target. This field will only be available for Binary Classification, Regression, and Min Inflated projects.

class datarobot._experimental.models.model_lineage.PartitionInfo

Contains information about project partitioning.

Variables:
  • validation_type (str) – Either CV for cross-validation or TVH for train-validation-holdout split.

  • cv_method (str) – Partitioning method used.

  • holdout_pct (float) – Percentage of the dataset reserved for the holdout set.

  • datetime_col (Optional[str]) – If a date partition column was used, the name of the column. Note that datetime_col applies to an old partitioning method no longer supported for new projects, as of API version v2.0.

  • datetime_partition_column (Optional[str]) – If a datetime partition column was used, the name of the column.

  • validation_pct (Optional[float]) – If train-validation-holdout split was used, the percentage of the dataset used for the validation set.

  • reps (Optional[float]) – If cross validation was used, the number of folds to use.

  • cv_holdout_level (Optional[Union[str, float, int]]) – If a user partition column was used with cross validation, the value assigned to the holdout set.

  • holdout_level (Optional[Union[str, float, int]]) – If a user partition column was used with train-validation-holdout split, the value assigned to the holdout set.

  • user_partition_col (Optional[str]) – If a user partition column was used, the name of the column.

  • training_level (Optional[Union[str, float, int]]) – If a user partition column was used with train-validation-holdout split, the value assigned to the training set.

  • partition_key_cols (Optional[List[str]]) – A list containing a single string - the name of the group partition column.

  • validation_level (Optional[Union[str, float, int]]) – If a user partition column was used with train-validation-holdout split, the value assigned to the validation set.

  • use_time_series (Optional[bool]) – A boolean value indicating whether a time series project was created by using datetime partitioning. Otherwise, datetime partitioning created an OTV project.

class datarobot._experimental.models.model_lineage.ProjectInfo

Contains information about the project.

Variables:
  • project_name (str) – Name of the project.

  • project_id (str) – Project Id.

  • partition (PartitionInfo) – Partitioning settings of the project.

  • metric (str) – Project metric used to select the best-performing models.

  • created_by (User) – User who created the project.

  • created_at (Optional[datetime.datetime]) – Time when the project was created.

  • target (Optional[TargetInfo]) – Information about the target.

class datarobot._experimental.models.model_lineage.ModelInfo

Contains information about the model.

Variables:
  • blueprint_tasks (List[str]) – Tasks that make up the blueprint.

  • blueprint_id (str) – Blueprint Id.

  • model_type (str) – Model type.

  • sample_size (Optional[int]) – Number of rows this model was trained on.

  • sample_percentage (Optional[float]) – Percentage of the dataset the model was trained on.

  • milliseconds_to_predict_1000_rows (Optional[float]) – Estimate of how many millisecond it takes to predict 1000 rows. The estimate is based on the time it took to predict the holdout set.’

  • serialized_blueprint_file_size (Optional[int]) – Size of the serialized blueprint, in bytes.

class datarobot._experimental.models.model_lineage.ModelLineage

Contains information about the lineage of a model.

Variables:
  • dataset (DatasetInfo) – Information about the dataset this model was created with.

  • featurelist (FeaturelistInfo) – Information about the featurelist used to train this model.

  • project (ProjectInfo) – Information about the project this model was created in.

  • model (ModelInfo) – Information about the model itself.

classmethod get(model_id, use_case_id=None)

Retrieve lineage information about a trained model. If you pass the optional use_case_id parameter, this class will contain additional information.

Parameters:
  • model_id (str) – Model Id.

  • use_case_id (Optional[str]) – Use Case Id.

Return type:

ModelLineage

class datarobot._experimental.models.incremental_learning.IncrementalLearningItem
class datarobot._experimental.models.incremental_learning.IncrementalLearningMetadata

Incremental learning metadata for an incremental model.

Added in version v3.4.0.

Variables:
  • project_id (str) – The project ID.

  • model_id (str) – The model ID.

  • user_id (str) – The ID of the user who started incremental learning.

  • featurelist_id (str) – The ID of the featurelist the model is using.

  • status (str) – The status of incremental training. One of datarobot._experimental.models.enums.IncrementalLearningStatus.

  • items (List[IncrementalLearningItemDoc]) – An array of incremental learning items associated with the sequential order of chunks. See incremental item info in Notes for more details.

  • sample_pct (float) – The sample size in percents (1 to 100) to use in training.

  • training_row_count (int) – The number of rows used to train a model.

  • score (float) – The validation score of the model.

  • metric (str) – The name of the scoring metric.

  • early_stopping_rounds (int) – The number of chunks in which no improvement is observed that triggers the early stopping mechanism.

  • total_number_of_chunks (int) – The total number of chunks.

  • model_number (int) – The number of the model in the project.

Notes

Incremental item is a dict containing the following:

  • chunk_index: int

    The incremental learning order in which chunks are trained.

  • status: str

    The status of training current chunk. One of datarobot._experimental.models.enums.IncrementalLearningItemStatus

  • model_id: str

    The ID of the model associated with the current item (chunk).

  • parent_model_id: str

    The ID of the model based on which the current item (chunk) is trained.

  • data_stage_id: str

    The ID of the data stage.

  • sample_pct: float

    The cumulative percentage of the base dataset size used for training the model.

  • training_row_count: int

    The number of rows used to train a model.

  • score: float

    The validation score of the current model

class datarobot._experimental.models.chunking_service.ChunkStorage

The chunk storage location for the data chunks.

Variables:
  • storage_reference_id (str) – The ID of the storage entity.

  • chunk_storage_type (str) – The type of the chunk storage.

  • version_id (str) – The catalog version ID. This will only be used if the storage type is “AI Catalog”.

class datarobot._experimental.models.chunking_service.Chunk

Data chunk object that holds metadata about a chunk.

Variables:
  • id (str) – The ID of the chunk entity.

  • chunk_definition_id (str) – The ID of the dataset chunk definition the chunk belongs to.

  • limit (int) – The number of rows in the chunk.

  • offset (int) – The offset in the dataset to create the chunk.

  • chunk_index (str) – The index of the chunk if chunks are divided uniformly. Otherwise, it is None.

  • data_source_id (str) – The ID of the data request used to create the chunk.

  • chunk_storage (ChunkStorage) – A list of storage locations where the chunk is stored.

get_chunk_storage_id(storage_type)

Get storage location ID for the chunk.

Parameters:

storage_type (ChunkStorageType) – The storage type where the chunk is stored.

Returns:

storage_reference_id – An ID that references the storage location for the chunk.

Return type:

str

get_chunk_storage_version_id(storage_type)

Get storage version ID for the chunk.

Parameters:

storage_type (ChunkStorageType) – The storage type where the chunk is stored.

Returns:

storage_reference_id – A catalog version ID associated with the AI Catalog dataset ID.

Return type:

str

class datarobot._experimental.models.chunking_service.DatasourceDefinition

Data source definition that holds information of data source for API responses. Do not use this to ‘create’ DatasourceDefinition objects directly, use DatasourceAICatalogInfo and DatasourceDataWarehouseInfo.

Variables:
  • id (str) – The ID of the data source definition.

  • data_store_id (str) – The ID of the data store.

  • credentials_id (str) – The ID of the credentials.

  • table (str) – The data source table name.

  • schema (str) – The offset into the dataset to create the chunk.

  • catalog (str) – The database or catalog name.

  • storage_origin (str) – The origin data source or data warehouse (e.g., Snowflake, BigQuery).

  • data_source_id (str) – The ID of the data request used to generate sampling and metadata.

  • total_rows (str) – The total number of rows in the dataset.

  • source_size (str) – The size of the dataset.

  • estimated_size_per_row (str) – The estimated size per row.

  • columns (str) – The list of column names in the dataset.

  • order_by_columns (List[str]) – A list of columns used to sort the dataset.

  • is_descending_order (bool) – Orders the direction of the data. Defaults to False, ordering from smallest to largest.

  • select_columns (List[str]) – A list of columns to select from the dataset.

  • datetime_partition_column (str) – The datetime partition column name used in OTV projects.

  • validation_pct (float) – The percentage threshold between 0.1 and 1.0 for the first chunk validation.

  • validation_limit_pct (float) – The percentage threshold between 0.1 and 1.0 for the validation kept.

  • validation_start_date (datetime.datetime) – The start date for validation.

  • validation_end_date (datetime.datetime) – The end date for validation.

  • training_end_date (datetime.datetime) – The end date for training.

  • latest_timestamp (datetime.datetime) – The latest timestamp.

  • earliest_timestamp (datetime.datetime) – The earliest timestamp.

class datarobot._experimental.models.chunking_service.DatasourceDataWarehouseInfo

Data source information used at creation time with dataset chunk definition. Data warehouses supported: Snowflake, BigQuery, Databricks

Variables:
  • name (str) – The optional custom name of the data source.

  • table (str) – The data source table name or AI Catalog dataset name.

  • storage_origin (str) – The origin data source or data warehouse (e.g., Snowflake, BigQuery).

  • data_store_id (str) – The ID of the data store.

  • credentials_id (str) – The ID of the credentials.

  • schema (str) – The offset into the dataset to create the chunk.

  • catalog (str) – The database or catalog name.

  • data_source_id (str) – The ID of the data request used to generate sampling and metadata.

  • order_by_columns (List[str]) – A list of columns used to sort the dataset.

  • is_descending_order (bool) – Orders the direction of the data. Defaults to False, ordering from smallest to largest.

  • select_columns (List[str]) – A list of columns to select from the dataset.

  • datetime_partition_column (str) – The datetime partition column name used in OTV projects.

  • validation_pct (float) – The percentage threshold between 0.1 and 1.0 for the first chunk validation.

  • validation_limit_pct (float) – The percentage threshold between 0.1 and 1.0 for the validation kept.

  • validation_start_date (datetime.datetime) – The start date for validation.

  • validation_end_date (datetime.datetime) – The end date for validation.

  • training_end_date (datetime.datetime) – The end date for training.

  • latest_timestamp (datetime.datetime) – The latest timestamp.

  • earliest_timestamp (datetime.datetime) – The earliest timestamp.

class datarobot._experimental.models.chunking_service.DatasourceAICatalogInfo

AI Catalog data source information used at creation time with dataset chunk definition.

Variables:
  • name (str) – The optional custom name of the data source.

  • table (str) – The data source table name or AI Catalog dataset name.

  • storage_origin (str) – The origin data source, always AI Catalog type.

  • catalog_id (str) – The ID of the AI Catalog dataset.

  • catalog_version_id (str) – The ID of the AI Catalog dataset version.

  • order_by_columns (List[str]) – A list of columns used to sort the dataset.

  • is_descending_order (bool) – Orders the direction of the data. Defaults to False, ordering from smallest to largest.

  • select_columns (List[str]) – A list of columns to select from the dataset.

  • datetime_partition_column (str) – The datetime partition column name used in OTV projects.

  • validation_pct (float) – The percentage threshold between 0.1 and 1.0 for the first chunk validation.

  • validation_limit_pct (float) – The percentage threshold between 0.1 and 1.0 for the validation kept.

  • validation_start_date (datetime.datetime) – The start date for validation.

  • validation_end_date (datetime.datetime) – The end date for validation.

  • training_end_date (datetime.datetime) – The end date for training.

  • latest_timestamp (datetime.datetime) – The latest timestamp.

  • earliest_timestamp (datetime.datetime) – The earliest timestamp.

class datarobot._experimental.models.chunking_service.DatasetChunkDefinition

Dataset chunking definition that holds information about how to chunk the dataset.

Variables:
  • id (str) – The ID of the dataset chunk definition.

  • user_id (str) – The ID of the user who created the definition.

  • name (str) – The name of the dataset chunk definition.

  • project_starter_chunk_size (int) – The size, in bytes, of the project starter chunk.

  • user_chunk_size (int) – Chunk size in bytes.

  • datasource_definition_id (str) – The data source definition ID associated with the dataset chunk definition.

  • chunking_type (ChunkingType) –

    The type of chunk creation from the dataset. All possible chunking types can be found under ChunkingType enum, that can be imported from datarobot._experimental.models.enums Types include:

    • INCREMENTAL_LEARNING for non-time aware projects that use a chunk index to create chunks.

    • INCREMENTAL_LEARNING_OTV for OTV projects that use a chunk index to create chunks.

    • SLICED_OFFSET_LIMIT for any dataset in which user provides offset and limit to create chunks.

    SLICED_OFFSET_LIMIT has no indexed based chunks aka method create_by_index() not supported.

classmethod get(dataset_chunk_definition_id)

Retrieve a specific dataset chunk definition metadata.

Parameters:

dataset_chunk_definition_id (str) – The ID of the dataset chunk definition.

Returns:

dataset_chunk_definition – The queried instance.

Return type:

DatasetChunkDefinition

classmethod list(limit=50, offset=0)

Retrieves a list of dataset chunk definitions

Parameters:
  • limit (int) – The maximum number of objects to return. Default is 50.

  • offset (int) – The starting offset of the results. Default is 0.

Returns:

dataset_chunk_definitions – The list of dataset chunk definitions.

Return type:

List[DatasetChunkDefinition]

classmethod create(name, project_starter_chunk_size, user_chunk_size, datasource_info, chunking_type=ChunkingType.INCREMENTAL_LEARNING)

Create a dataset chunk definition. Required for both index-based and custom chunks.

In order to create a dataset chunk definition, you must first:

  • Create a data connection to the target data source via dr.DataStore.create()

  • Create credentials that must be attached to the data connection via dr.Credential.create()

If you have an existing data connections and credentials:

  • Retrieve the data store ID by the canonical name via:

    • [ds for ds in dr.DataStore.list() if ds.canonical_name == <name>][0].id

  • Retrieve the credential ID by the name via:

    • [cr for cr in dr.Credential.list() if ds.name == <name>][0].id

You must create the required ‘datasource_info’ object with the datasource information that corresponds to your use case:

  • DatasourceAICatalogInfo for AI catalog datasets.

  • DatasourceDataWarehouseInfo for Snowflake, BigQuery, or other data warehouse.

Parameters:
  • name (str) – The name of the dataset chunk definition.

  • project_starter_chunk_size (int) – The size, in bytes, of the first chunk. Used to start a DataRobot project.

  • user_chunk_size (int) – The size, in bytes, of the user-defined incremental chunk.

  • datasource_info (Union[DatasourceDataWarehouseInfo, DatasourceAICatalogInfo]) – The object that contains the information of the data source.

  • chunking_type (ChunkingType) –

    The type of chunk creation from the dataset. All possible chunking types can be found under ChunkingType enum, that can be imported from datarobot._experimental.models.enums Types include:

    • INCREMENTAL_LEARNING for non-time aware projects that use a chunk index to create chunks.

    • INCREMENTAL_LEARNING_OTV for OTV projects that use a chunk index to create chunks.

    • SLICED_OFFSET_LIMIT for any dataset in which user provides offset and limit to create chunks.

    SLICED_OFFSET_LIMIT has no indexed based chunks aka method create_by_index() not supported. The default type is ChunkingType.INCREMENTAL_LEARNING

Returns:

dataset_chunk_definition – An instance of a created dataset chunk definition.

Return type:

DatasetChunkDefinition

classmethod get_datasource_definition(dataset_chunk_definition_id)

Retrieves the data source definition associated with a dataset chunk definition.

Parameters:

dataset_chunk_definition_id (str) – id of the dataset chunk definition

Returns:

datasource_definition – an instance of created datasource definition

Return type:

DatasourceDefinition

classmethod get_chunk(dataset_chunk_definition_id, chunk_id)

Retrieves a specific data chunk associated with a dataset chunk definition

Parameters:
  • dataset_chunk_definition_id (str) – id of the dataset chunk definition

  • chunk_id (str) – id of the chunk

Returns:

chunk – an instance of created chunk

Return type:

Chunk

classmethod list_chunks(dataset_chunk_definition_id)

Retrieves all data chunks associated with a dataset chunk definition

Parameters:

dataset_chunk_definition_id (str) – id of the dataset chunk definition

Returns:

chunks – a list of chunks

Return type:

List[Chunk]

analyze_dataset(max_wait_time=600)

Analyzes the data source to retrieve and compute metadata about the dataset.

Depending on the size of the data set, adding order_by_columns to the dataset chunking definition will increase the execution time to create the data chunk. Set the max_wait_time for the appropriate wait time.

Parameters:

max_wait_time (int) – maximum time to wait for completion

Returns:

datasource_definition – an instance of created datasource definition

Return type:

DatasourceDefinition

create_chunk(limit, offset=0, storage_type=ChunkStorageType.DATASTAGE, max_wait_time=600)

Creates a data chunk using the limit and offset. By default, the data chunk is stored in data stages.

Depending on the size of the data set, adding order_by_columns to the dataset chunking definition will increase the execution time to retrieve or create the data chunk. Set the max_wait_time for the appropriate wait time.

Parameters:
  • limit (int) – The maximum number of rows.

  • offset (int) – The offset into the dataset (where reading begins).

  • storage_type (ChunkStorageType) – The storage location of the chunk.

  • max_wait_time (int) – maximum time to wait for completion

Returns:

chunk – An instance of a created or updated chunk.

Return type:

Chunk

create_chunk_by_index(index, storage_type=ChunkStorageType.DATASTAGE, max_wait_time=600)

Creates a data chunk using the limit and offset. By default, the data chunk is stored in data stages.

Depending on the size of the data set, adding order_by_columns to the dataset chunking definition will increase the execution time to retrieve or create the data chunk. Set the max_wait_time for the appropriate wait time.

Parameters:
  • index (int) – The index of the chunk.

  • storage_type (ChunkStorageType) – The storage location of the chunk.

  • max_wait_time (int) – maximum time to wait for completion

Returns:

chunk – An instance of a created or updated chunk.

Return type:

Chunk

classmethod patch_validation_dates(dataset_chunk_definition_id, validation_start_date, validation_end_date)

Updates the data source definition validation dates associated with a dataset chunk definition. In order to set the validation dates appropriately, both start and end dates should be specified. This method can only be used for INCREMENTAL_LEARNING_OTV dataset chunk definitions and its associated datasource definition.

Parameters:
  • dataset_chunk_definition_id (str) – The ID of the dataset chunk definition.

  • validation_start_date (datetime.datetime) – The start date of validation scoring data. Internally converted to format ‘%Y-%m-%d %H:%M:%S’, the timezone defaults to UTC.

  • validation_end_date (datetime.datetime) – The end date of validation scoring data. Internally converted to format ‘%Y-%m-%d %H:%M:%S’, the timezone defaults to UTC.

Returns:

datasource_definition – An instance of created datasource definition.

Return type:

DatasourceDefinition

class datarobot._experimental.models.chunking_service_v2.DatasetProps

The dataset props for a catalog dataset.

Variables:
  • dataset_id (str) – The ID of the AI Catalog dataset.

  • dataset_version_id (str) – The ID of the AI Catalog dataset version.

class datarobot._experimental.models.chunking_service_v2.DatasetInfo

The dataset information.

Variables:
  • total_rows (str) – The total number of rows in the dataset.

  • source_size (str) – The size of the dataset.

  • estimated_size_per_row (str) – The estimated size per row.

  • columns (str) – The list of column names in the dataset.

  • dialect (str) – The sql dialect associated with the dataset (e.g., Snowflake, BigQuery, Spark).

  • data_store_id (str) – The ID of the data store.

  • data_source_id (str) – The ID of the data request used to generate sampling and metadata.

class datarobot._experimental.models.chunking_service_v2.DynamicDatasetProps

The dataset props for a dynamic dataset.

Variables:

credentials_id (str) – The ID of the credentials.

class datarobot._experimental.models.chunking_service_v2.DatasetDefinition

Dataset definition that holds information of dataset for API responses.

Variables:
  • id (str) – The ID of the data source definition.

  • creator_user_id (str) – The ID of the user.

  • dataset_props (DatasetProps) – The properties of the dataset in catalog.

  • dynamic_dataset_props (DynamicDatasetProps) – The properties of the dynamic dataset.

  • dataset_info (DatasetInfo) – The information about the dataset.

  • name (str) – The optional custom name of the dataset definition.

classmethod from_data(data)

Properly convert composition classes.

Return type:

DatasetDefinition

classmethod create(dataset_id, dataset_version_id=None, name=None, credentials_id=None)

Create a dataset definition.

In order to create a dataset definition, you must first have an existing dataset in the Data Registry. A dataset can be uploaded using dr.Dataset.create_from_file if you have a file for example

If you have an existing dataset in the Data Registry:

  • Retrieve the dataset ID by the canonical name via:

    • [cr for cr in dr.Dataset.list() if cr.name == <name>][0].id

  • Retrieve the dataset version ID by the name via:

    • [cr for cr in dr.Dataset.list() if cr.name == <name>][0].version_id

Parameters:
  • dataset_id (str) – The ID of the AI Catalog dataset.

  • dataset_version_id (str) – The optional ID of the AI Catalog dataset version.

  • name (str) – The optional custom name of the dataset definition.

  • credentials_id (str) – The optional ID of the credentials to access the data store.

Returns:

dataset_definition – An instance of a created dataset definition.

Return type:

DatasetDefinition

classmethod get(dataset_definition_id)

Retrieve a specific dataset definition metadata.

Parameters:

dataset_definition_id (str) – The ID of the dataset definition.

Returns:

dataset_definition_id – The queried instance.

Return type:

DatasetDefinition

classmethod delete(dataset_definition_id)

Delete a specific dataset definition

Parameters:

dataset_definition_id (str) – The ID of the dataset definition.

Return type:

None

classmethod list()

List all dataset definitions

Return type:

A list of DatasetDefinition

classmethod analyze(dataset_definition_id, max_wait=600)

Analyze a specific dataset definition

Parameters:
  • dataset_definition_id (str) – The ID of the dataset definition.

  • max_wait (Optional[int]) – Time in seconds after which analyze is considered unsuccessful

Return type:

None

class datarobot._experimental.models.chunking_service_v2.RowsChunkDefinition

The rows chunk information.

Variables:
  • order_by_columns (List[str]) – List of the sorting column names.

  • is_descending_order (bool) – The sorting order. Defaults to False, ordering from smallest to largest.

  • target_column (str) – The target column.

  • target_class (str) – For binary target, one of the possible values. For zero inflated, will be ‘0’.

  • user_group_column (str) – The user group column.

  • datetime_partition_column (str) – The datetime partition column name used in OTV projects.

  • otv_validation_start_date (datetime.datetime) – The start date for the validation set.

  • otv_validation_end_date (datetime.datetime) – The end date for the validation set.

  • otv_training_end_date (datetime.datetime) – The end date for the training set.

  • otv_latest_timestamp (datetime.datetime) – The latest timestamp, this field is auto generated.

  • otv_earliest_timestamp (datetime.datetime) – The earliest timestamp, this field is auto generated.

  • otv_validation_downsampling_pct (float) – The percentage of the validation set to downsample, this field is auto generated.

class datarobot._experimental.models.chunking_service_v2.FeaturesChunkDefinition

The features chunk information.

class datarobot._experimental.models.chunking_service_v2.ChunkDefinitionStats

The chunk stats information.

Variables:
  • expected_chunk_size (int) – The expected chunk size, this field is auto generated.

  • number_of_rows_per_chunk (int) – The number of rows per chunk, this field is auto generated.

  • total_number_of_chunks (int) – The total number of chunks, this field is auto generated.

class datarobot._experimental.models.chunking_service_v2.ChunkDefinition

The chunk information.

Variables:
  • id (str) – The ID of the chunk entity.

  • dataset_definition_id (str) – The ID of the dataset definition.

  • name (str) – The name of the chunk entity.

  • is_readonly (bool) – The read only flag.

  • partition_method (str) – The partition method used to create chunks, either ‘random’, ‘stratified’, or ‘date’.

  • chunking_strategy_type (str) – The chunking strategy type, either ‘features’ or ‘rows’.

  • chunk_definition_stats (ChunkDefinitionStats) – The chunk stats information.

  • rows_chunk_definition (RowsChunkDefinition) – The rows chunk information.

  • features_chunk_definition (FeaturesChunkDefinition) – The features chunk information.

classmethod from_data(data)

Properly convert composition classes.

Return type:

ChunkDefinition

classmethod create(dataset_definition_id, name=None, partition_method=ChunkingPartitionMethod.RANDOM, chunking_strategy_type=ChunkingStrategy.ROWS, order_by_columns=None, is_descending_order=False, target_column=None, target_class=None, user_group_column=None, datetime_partition_column=None, otv_validation_start_date=None, otv_validation_end_date=None, otv_training_end_date=None)

Create a chunk definition.

Parameters:
  • dataset_definition_id (str) – The ID of the dataset definition.

  • name (str) – The optional custom name of the chunk definition.

  • partition_method (str) – The partition method used to create chunks, either ‘random’, ‘stratified’, or ‘date’.

  • chunking_strategy_type (str) – The chunking strategy type, either ‘features’ or ‘rows’.

  • order_by_columns (List[str]) – List of the sorting column names.

  • is_descending_order (bool) – The sorting order. Defaults to False, ordering from smallest to largest.

  • target_column (str) – The target column.

  • target_class (str) – For binary target, one of the possible values. For zero inflated, will be ‘0’.

  • user_group_column (str) – The user group column.

  • datetime_partition_column (str) – The datetime partition column name used in OTV projects.

  • otv_validation_start_date (datetime.datetime) – The start date for the validation set.

  • otv_validation_end_date (datetime.datetime) – The end date for the validation set.

  • otv_training_end_date (datetime.datetime) – The end date for the training set.

Returns:

chunk_definition – An instance of a created chunk definition.

Return type:

ChunkDefinition

classmethod get(dataset_definition_id, chunk_definition_id)

Retrieve a specific chunk definition metadata.

Parameters:
  • dataset_definition_id (str) – The ID of the dataset definition.

  • chunk_definition_id (str) – The ID of the chunk definition.

Returns:

chunk_definition – The queried instance.

Return type:

ChunkDefinition

classmethod delete(dataset_definition_id, chunk_definition_id)

Delete a specific chunk definition

Parameters:
  • dataset_definition_id (str) – The ID of the dataset definition.

  • chunk_definition_id (str) – The ID of the chunk definition.

Return type:

None

classmethod list(dataset_definition_id)

List all chunk definitions

Parameters:

dataset_definition_id (str) – The ID of the dataset definition.

Return type:

A list of ChunkDefinition

classmethod analyze(dataset_definition_id, chunk_definition_id, max_wait=600)

Analyze a specific chunk definition

Parameters:
  • dataset_definition_id (str) – The ID of the dataset definition.

  • chunk_definition_id (str) – The ID of the chunk definition

  • max_wait (Optional[int]) – Time in seconds after which analyze is considered unsuccessful

Return type:

None

classmethod update(chunk_definition_id, dataset_definition_id, name=None, order_by_columns=None, is_descending_order=None, target_column=None, target_class=None, user_group_column=None, datetime_partition_column=None, otv_validation_start_date=None, otv_validation_end_date=None, otv_training_end_date=None, force_update=False)

Update a chunk definition.

Parameters:
  • chunk_definition_id (str) – The ID of the chunk definition.

  • dataset_definition_id (str) – The ID of the dataset definition.

  • name (str) – The optional custom name of the chunk definition.

  • order_by_columns (List[str]) – List of the sorting column names.

  • is_descending_order (bool) – The sorting order. Defaults to False, ordering from smallest to largest.

  • target_column (str) – The target column.

  • target_class (str) – For binary target, one of the possible values. For zero inflated, will be ‘0’.

  • user_group_column (str) – The user group column.

  • datetime_partition_column (str) – The datetime partition column name used in OTV projects.

  • otv_validation_start_date (datetime.datetime) – The start date for the validation set.

  • otv_validation_end_date (datetime.datetime) – The end date for the validation set.

  • otv_training_end_date (datetime.datetime) – The end date for the training set.

  • force_update (bool) – If True, the update will be forced in some cases. For example, update after analysis is done.

Returns:

chunk_definition – An update instance of a created chunk definition.

Return type:

ChunkDefinition