Experimental APIs

These features all require special permissions to be activated on your DataRobot account, and will not work otherwise. If you want to test a feature, please ask your DataRobot CFDS or account manager about enrolling in our preview program.

Classes in this list should be considered “experimental”, not fully released, and likely to change in future releases. Do not use them for production systems or other mission-critical uses.

datarobot._experimental.models.data_store.get_spark_session(self, db_token)

Returns a Spark session

Parameters:

db_token (str) – A personal access token.

Returns:

A spark session initialized with connection parameters taken from DataStore and provided db_token.

Return type:

SparkSession

Examples

>>> from datarobot._experimental.models.data_store import DataStore
>>> data_stores = DataStore.list(typ=DataStoreListTypes.DR_DATABASE_V1)
>>> data_stores
[DataStore('my_databricks_store_1')]
>>> db_connection = data_stores[0].get_spark_session('<token>')
>>> db_connection
<pyspark.sql.connect.session.SparkSession at 0x7f386068fbb0>
>>> df = session.read.table("samples.nyctaxi.trips")
>>> df.show()
class datarobot._experimental.models.data_store.DataStore

A data store. Represents database

Variables:
  • id (str) – The ID of the data store.

  • data_store_type (str) – The type of data store.

  • canonical_name (str) – The user-friendly name of the data store.

  • creator (str) – The ID of the user who created the data store.

  • updated (datetime.datetime) – The time of the last update.

  • params (DataStoreParameters) – A list specifying data store parameters.

  • role (str) – Your access role for this data store.

  • driver_class_type (str) – Your access role for this data store.

class datarobot._experimental.models.retraining.RetrainingUseCase

Retraining use case.

Variables:
  • id (str) – The ID of the use case.

  • name (str) – The name of the use case.

class datarobot._experimental.models.retraining.RetrainingPolicy

Retraining Policy.

Variables:
  • policy_id (str) – ID of the retraining policy

  • name (str) – Name of the retraining policy

  • description (str) – Description of the retraining policy

  • use_case (Optional[dict]) – Use case the retraining policy is associated with

classmethod list(deployment_id)

Lists all retraining policies associated with a deployment

Parameters:

deployment_id (str) – Id of the deployment

Returns:

policies – List of retraining policies associated with a deployment

Return type:

list

Examples

from datarobot import Deployment
from datarobot._experimental.models.retraining import RetrainingPolicy
deployment = Deployment.get(deployment_id='620ed0e37b6ce03244f19631')
RetrainingPolicy.list(deployment.id)
>>> [RetrainingPolicy('620ed248bb0a1f5889eb6aa7'), RetrainingPolicy('624f68be8828ed81bf487d8d')]
classmethod get(deployment_id, retraining_policy_id)

Retrieves a retraining policy associated with a deployment

Parameters:
  • deployment_id (str) – Id of the deployment

  • retraining_policy_id (str) – Id of the policy

Returns:

retraining_policy – Retraining policy

Return type:

Retraining Policy

Examples

from datarobot._experimental.models.retraining import RetrainingPolicy
policy = RetrainingPolicy.get(
    deployment_id='620ed0e37b6ce03244f19631',
    retraining_policy_id='624f68be8828ed81bf487d8d'
)
policy.id
>>>'624f68be8828ed81bf487d8d'
policy.name
>>>'PolicyA'
classmethod create(deployment_id, name, description=None, use_case_id=None)

Create a retraining policy associated with a deployment

Parameters:
  • deployment_id (str) – The ID of the deployment.

  • name (str) – The retraining policy name.

  • description (str) – The retraining policy description.

  • use_case_id (Optional[str]) – The ID of the Use Case that the retraining policy is associated with.

Returns:

retraining_policy – Retraining policy

Return type:

Retraining Policy

Examples

from datarobot._experimental.models.retraining import RetrainingPolicy
policy = RetrainingPolicy.create(
    deployment_id='620ed0e37b6ce03244f19631',
    name='Retraining Policy A',
    use_case_id='678114c41e9114cabca27044',
)
policy.id
>>>'624f68be8828ed81bf487d8d'
classmethod delete(deployment_id, retraining_policy_id)

Deletes a retraining policy associated with a deployment

Parameters:
  • deployment_id (str) – Id of the deployment

  • retraining_policy_id (str) – Id of the policy

Return type:

None

Examples

from datarobot._experimental.models.retraining import RetrainingPolicy
RetrainingPolicy.delete(
    deployment_id='620ed0e37b6ce03244f19631',
    retraining_policy_id='624f68be8828ed81bf487d8d'
)
update_use_case(use_case_id)

Update the use case associated with this retraining policy

Parameters:

use_case_id (str) – Id of the use case the retraining policy is associated with

Return type:

RetrainingPolicy

Examples

from datarobot._experimental.models.retraining import RetrainingPolicy
policy = RetrainingPolicy.get(
    deployment_id='620ed0e37b6ce03244f19631',
    retraining_policy_id='624f68be8828ed81bf487d8d'
)
updated = RetrainingPolicy.update_use_case(use_case_id='620ed0e37b6ce03244f19633')
updated.use_case.id
>>>'620ed0e37b6ce03244f19633'
class datarobot._experimental.models.retraining.RetrainingPolicyRun

Retraining policy run.

Variables:
  • policy_run_id (str) – ID of the retraining policy run

  • status (str) – Status of the retraining policy run

  • challenger_id (str) – ID of the challenger model retrieved after running the policy

  • error_message (str) – The error message if an error occurs during the policy run

  • model_package_id (str) – ID of the model package (version) retrieved after the policy is run

  • project_id (str) – ID of the project the deployment is associated with

  • start_time (datetime.datetime) – Timestamp of when the policy run starts

  • finish_time (datetime.datetime) – Timestamp of when the policy run finishes

classmethod list(deployment_id, retraining_policy_id)

Lists all the retraining policy runs of a retraining policy that is associated with a deployment.

Parameters:
  • deployment_id (str) – ID of the deployment

  • retraining_policy_id (str) – ID of the policy

Returns:

policy runs – List of retraining policy runs

Return type:

list

Examples

from datarobot._experimental.models.retraining import RetrainingPolicyRun
RetrainingPolicyRun.list(
    deployment_id='620ed0e37b6ce03244f19631',
    retraining_policy_id='62f4448f0dfd5699feae3e6e'
)
>>> [RetrainingPolicyRun('620ed248bb0a1f5889eb6aa7'), RetrainingPolicyRun('624f68be8828ed81bf487d8d')]
class datarobot._experimental.models.data_matching.DataMatching

Retrieves the closest data points for the input data.

This functionality is more than the simple lookup. In order to retrieve the closest data points data matching functionality will leverage DataRobot preprocessing pipeline first and then search for the closest data points. The returned values will be the closest data points at the point of entry to the model.

There are three sets of methods supported:
  1. Methods to build the index (for project, model, featurelist). The index needs to be built first in order to search for the closest data points. Once the index is built it will be reused.

  2. Methods to search for the closest data points (for project, model, featurelist). These methods will initialize the query, await its completion and then save the result as csv file with in the specified location.

  3. Additional methods to manually list history of queries and retrieve results for them.

get_query_url(url, number_of_data=None)

Returns formatted data matching query url

Return type:

str

get_closest_data(query_file_path, number_of_data=None, max_wait=600, build_index_if_missing=True)

Retrieves closest data points to the data point in input file. If the index is missing by default the method will try to build it.

Parameters:
  • query_file_path (str) – Path to file with the data point to search closest data points

  • number_of_data (int or None) – Number of results to search for. If no value specified, the default is 10.

  • max_wait (int) – Number of seconds to wait for the result. Default is 600.

  • build_index_if_missing (Optional[bool]) – Should the index be created if it is missing. If False is specified and the index is missing, an exception is thrown. Default True.

Returns:

df – Dataframe with query result

Return type:

pd.DataFrame

get_closest_data_for_model(model_id, query_file_path, number_of_data=None, max_wait=600, build_index_if_missing=True)

Retrieves closest data points to the data point in input file. If the index is missing by default the method will try to build it.

Parameters:
  • model_id (str) – Id of the model to search for the closest data points

  • query_file_path (str) – Path to file with the data point to search closest data points

  • number_of_data (int or None) – Number of results to search for. If no value specified, the default is 10.

  • max_wait (int) – Number of seconds to wait for the result. Default is 600.

  • build_index_if_missing (Optional[bool]) – Should the index be created if it is missing. If False is specified and the index is missing, an exception is thrown. Default True.

Returns:

df – Dataframe with query result

Return type:

pd.DataFrame

get_closest_data_for_featurelist(featurelist_id, query_file_path, number_of_data=None, max_wait=600, build_index_if_missing=True)

Retrieves closest data points to the data point in input file. If the index is missing by default the method will try to build it.

Parameters:
  • featurelist_id (str) – Id of the featurelist to search for the closest data points

  • query_file_path (str) – Path to file with the data point to search closest data points

  • number_of_data (int or None) – Number of results to search for. If no value specified, the default is 10.

  • max_wait (int) – Number of seconds to wait for the result. Default is 600.

  • build_index_if_missing (bool) – Should the index be created if it is missing. If False is specified and the index is missing, the exception is thrown. Default True.

Returns:

df – Dataframe with query result

Return type:

pd.DataFrame

build_index(max_wait=600)

Builds data matching index and waits for its completion.

Parameters:

max_wait (int or None) – Seconds to wait for the completion of build index operation. Default is 600. When the 0 or None value is passed then the method will exit without awaiting for the build index operation to complete.

Return type:

None

build_index_for_featurelist(featurelist_id, max_wait=600)

Builds data matching index for featurelist and waits for its completion.

Parameters:
  • featurelist_id (str) – Id of the featurelist to build the index for

  • max_wait (int or None) – Seconds to wait for the completion of build index operation. Default is 600. When the 0 or None value is passed then the method will exit without awaiting for the build index operation to complete.

Return type:

None

build_index_for_model(model_id, max_wait=600)

Builds data matching index for feature list and waits for its completion.

Parameters:
  • model_id (str) – Id of the model to build index for

  • max_wait (int or None) – Seconds to wait for the completion of build index operation. Default is 600. When the 0 or None value is passed then the method will exit without awaiting for the build index operation to complete.

Return type:

None

list()

Lists all data matching queries for the project. Results are sorted in descending order starting from the latest to the oldest.

Return type:

List[DataMatchingQuery]

class datarobot._experimental.models.data_matching.DataMatchingQuery

Data Matching Query object.

Represents single query for the closest data points. Once related query job is completed, its result can be retrieved and saved as csv file in specified location.

classmethod list(project_id)

Retrieves the list of queries.

Parameters:

project_id (str) – Project ID to retrieve data matching queries for

Return type:

List[DataMatchingQuery]

save_result(file_path)

Downloads the query result and saves it in file_path location.

Parameters:

file_path (str) – Path location where to save the query result

Return type:

None

get_result()

Returns the query result as dataframe.

Parameters:

df (pd.DataFrame) – Dataframe with query result

Return type:

DataFrame

class datarobot._experimental.models.model_lineage.FeatureCountByType

Contains information about a feature type and how many features in the dataset are of this type.

Variables:
  • feature_type (str) – The feature type grouped in this count.

  • count (int) – The number of features of this type.

class datarobot._experimental.models.model_lineage.User

Contains information about a user.

Variables:
  • Id (str) – Id of the user.

  • full_name (Optional[str]) – Full name of the user.

  • email (Optional[str]) – Email address of the user.

  • user_hash (Optional[str]) – User’s gravatar hash.

  • user_name (Optional[str]) – Username of the user.

class datarobot._experimental.models.model_lineage.ReferencedInUseCase

Contains information about the reference of a dataset in an Use Case.

Variables:
  • added_to_use_case_by (User) – User who added the dataset to the Use Case.

  • added_to_use_case_at (datetime.datetime) – Time when the dataset was added to the Use Case.

class datarobot._experimental.models.model_lineage.DatasetInfo

Contains information about the dataset.

Variables:
  • dataset_name (str) – Dataset name.

  • dataset_version_id (str) – Dataset version Id.

  • dataset_id (str) – Dataset Id.

  • number_of_rows (int) – Number of rows in the dataset.

  • file_size (int) – Size of the dataset as a CSV file, in bytes.

  • number_of_features (int) – Number of features in the dataset.

  • number_of_feature_by_type (List[FeatureCountByType]) – Number of features in the dataset, grouped by feature type.

  • referenced_in_use_case (Optional[ReferencedInUseCase]) – Information about the reference of this dataset in the Use Case. This information will only be present if the use_case_id was passed to ModelLineage.get.

class datarobot._experimental.models.model_lineage.FeatureWithMissingValues

Contains information about the number of missing values for one feature.

Variables:
  • feature_name (str) – Name of the feature.

  • number_of_missing_values (int) – Number of missing values for this feature.

class datarobot._experimental.models.model_lineage.FeaturelistInfo

Contains information about the featurelist.

Variables:
  • featurelist_name (str) – Featurelist name.

  • featurelist_id (str) – Featurelist Id.

  • number_of_features (int) – Number of features in the featurelist.

  • number_of_feature_by_type (List[FeatureCountByType]) – Number of features in the featurelist, grouped by feature type.

  • number_of_features_with_missing_values (int) – Number of features in the featurelist with at least one missing value.

  • number_of_missing_values (int) – Number of missing values across all features of the featurelist.

  • features_with_most_missing_values (List[FeatureWithMissingValues]) – List of features with the most missing values.

  • description (str) – Description of the featurelist.

class datarobot._experimental.models.model_lineage.TargetInfo

Contains information about the target.

Variables:
  • name (str) – Name of the target feature.

  • target_type (str) – Project type resulting from selected target.

  • positive_class_label (Optional[Union[str, int, float]]) – Positive class label. For every project type except Binary Classification, this value will be null.

  • mean (Optional[float]) – Mean of the target. This field will only be available for Binary Classification, Regression, and Min Inflated projects.

class datarobot._experimental.models.model_lineage.PartitionInfo

Contains information about project partitioning.

Variables:
  • validation_type (str) – Either CV for cross-validation or TVH for train-validation-holdout split.

  • cv_method (str) – Partitioning method used.

  • holdout_pct (float) – Percentage of the dataset reserved for the holdout set.

  • datetime_col (Optional[str]) – If a date partition column was used, the name of the column. Note that datetime_col applies to an old partitioning method no longer supported for new projects, as of API version v2.0.

  • datetime_partition_column (Optional[str]) – If a datetime partition column was used, the name of the column.

  • validation_pct (Optional[float]) – If train-validation-holdout split was used, the percentage of the dataset used for the validation set.

  • reps (Optional[float]) – If cross validation was used, the number of folds to use.

  • cv_holdout_level (Optional[Union[str, float, int]]) – If a user partition column was used with cross validation, the value assigned to the holdout set.

  • holdout_level (Optional[Union[str, float, int]]) – If a user partition column was used with train-validation-holdout split, the value assigned to the holdout set.

  • user_partition_col (Optional[str]) – If a user partition column was used, the name of the column.

  • training_level (Optional[Union[str, float, int]]) – If a user partition column was used with train-validation-holdout split, the value assigned to the training set.

  • partition_key_cols (Optional[List[str]]) – A list containing a single string - the name of the group partition column.

  • validation_level (Optional[Union[str, float, int]]) – If a user partition column was used with train-validation-holdout split, the value assigned to the validation set.

  • use_time_series (Optional[bool]) – A boolean value indicating whether a time series project was created by using datetime partitioning. Otherwise, datetime partitioning created an OTV project.

class datarobot._experimental.models.model_lineage.ProjectInfo

Contains information about the project.

Variables:
  • project_name (str) – Name of the project.

  • project_id (str) – Project Id.

  • partition (PartitionInfo) – Partitioning settings of the project.

  • metric (str) – Project metric used to select the best-performing models.

  • created_by (User) – User who created the project.

  • created_at (Optional[datetime.datetime]) – Time when the project was created.

  • target (Optional[TargetInfo]) – Information about the target.

class datarobot._experimental.models.model_lineage.ModelInfo

Contains information about the model.

Variables:
  • blueprint_tasks (List[str]) – Tasks that make up the blueprint.

  • blueprint_id (str) – Blueprint Id.

  • model_type (str) – Model type.

  • sample_size (Optional[int]) – Number of rows this model was trained on.

  • sample_percentage (Optional[float]) – Percentage of the dataset the model was trained on.

  • milliseconds_to_predict_1000_rows (Optional[float]) – Estimate of how many millisecond it takes to predict 1000 rows. The estimate is based on the time it took to predict the holdout set.’

  • serialized_blueprint_file_size (Optional[int]) – Size of the serialized blueprint, in bytes.

class datarobot._experimental.models.model_lineage.ModelLineage

Contains information about the lineage of a model.

Variables:
  • dataset (DatasetInfo) – Information about the dataset this model was created with.

  • featurelist (FeaturelistInfo) – Information about the featurelist used to train this model.

  • project (ProjectInfo) – Information about the project this model was created in.

  • model (ModelInfo) – Information about the model itself.

classmethod get(model_id, use_case_id=None)

Retrieve lineage information about a trained model. If you pass the optional use_case_id parameter, this class will contain additional information.

Parameters:
  • model_id (str) – Model Id.

  • use_case_id (Optional[str]) – Use Case Id.

Return type:

ModelLineage

class datarobot._experimental.models.chunking_service.ChunkStorage

The chunk storage location for the data chunks.

Variables:
  • storage_reference_id (str) – The ID of the storage entity.

  • chunk_storage_type (str) – The type of the chunk storage.

  • version_id (str) – The catalog version ID. This will only be used if the storage type is “AI Catalog”.

class datarobot._experimental.models.chunking_service.Chunk

Data chunk object that holds metadata about a chunk.

Variables:
  • id (str) – The ID of the chunk entity.

  • chunk_definition_id (str) – The ID of the dataset chunk definition the chunk belongs to.

  • limit (int) – The number of rows in the chunk.

  • offset (int) – The offset in the dataset to create the chunk.

  • chunk_index (str) – The index of the chunk if chunks are divided uniformly. Otherwise, it is None.

  • data_source_id (str) – The ID of the data request used to create the chunk.

  • chunk_storage (ChunkStorage) – A list of storage locations where the chunk is stored.

get_chunk_storage_id(storage_type)

Get storage location ID for the chunk.

Parameters:

storage_type (ChunkStorageType) – The storage type where the chunk is stored.

Returns:

storage_reference_id – An ID that references the storage location for the chunk.

Return type:

str

get_chunk_storage_version_id(storage_type)

Get storage version ID for the chunk.

Parameters:

storage_type (ChunkStorageType) – The storage type where the chunk is stored.

Returns:

storage_reference_id – A catalog version ID associated with the AI Catalog dataset ID.

Return type:

str

class datarobot._experimental.models.chunking_service.DatasourceDefinition

Data source definition that holds information of data source for API responses. Do not use this to ‘create’ DatasourceDefinition objects directly, use DatasourceAICatalogInfo and DatasourceDataWarehouseInfo.

Variables:
  • id (str) – The ID of the data source definition.

  • data_store_id (str) – The ID of the data store.

  • credentials_id (str) – The ID of the credentials.

  • table (str) – The data source table name.

  • schema (str) – The offset into the dataset to create the chunk.

  • catalog (str) – The database or catalog name.

  • storage_origin (str) – The origin data source or data warehouse (e.g., Snowflake, BigQuery).

  • data_source_id (str) – The ID of the data request used to generate sampling and metadata.

  • total_rows (str) – The total number of rows in the dataset.

  • source_size (str) – The size of the dataset.

  • estimated_size_per_row (str) – The estimated size per row.

  • columns (str) – The list of column names in the dataset.

  • order_by_columns (List[str]) – A list of columns used to sort the dataset.

  • is_descending_order (bool) – Orders the direction of the data. Defaults to False, ordering from smallest to largest.

  • select_columns (List[str]) – A list of columns to select from the dataset.

  • datetime_partition_column (str) – The datetime partition column name used in OTV projects.

  • validation_pct (float) – The percentage threshold between 0.1 and 1.0 for the first chunk validation.

  • validation_limit_pct (float) – The percentage threshold between 0.1 and 1.0 for the validation kept.

  • validation_start_date (datetime.datetime) – The start date for validation.

  • validation_end_date (datetime.datetime) – The end date for validation.

  • training_end_date (datetime.datetime) – The end date for training.

  • latest_timestamp (datetime.datetime) – The latest timestamp.

  • earliest_timestamp (datetime.datetime) – The earliest timestamp.

class datarobot._experimental.models.chunking_service.DatasourceDataWarehouseInfo

Data source information used at creation time with dataset chunk definition. Data warehouses supported: Snowflake, BigQuery, Databricks

Variables:
  • name (str) – The optional custom name of the data source.

  • table (str) – The data source table name or AI Catalog dataset name.

  • storage_origin (str) – The origin data source or data warehouse (e.g., Snowflake, BigQuery).

  • data_store_id (str) – The ID of the data store.

  • credentials_id (str) – The ID of the credentials.

  • schema (str) – The offset into the dataset to create the chunk.

  • catalog (str) – The database or catalog name.

  • data_source_id (str) – The ID of the data request used to generate sampling and metadata.

  • order_by_columns (List[str]) – A list of columns used to sort the dataset.

  • is_descending_order (bool) – Orders the direction of the data. Defaults to False, ordering from smallest to largest.

  • select_columns (List[str]) – A list of columns to select from the dataset.

  • datetime_partition_column (str) – The datetime partition column name used in OTV projects.

  • validation_pct (float) – The percentage threshold between 0.1 and 1.0 for the first chunk validation.

  • validation_limit_pct (float) – The percentage threshold between 0.1 and 1.0 for the validation kept.

  • validation_start_date (datetime.datetime) – The start date for validation.

  • validation_end_date (datetime.datetime) – The end date for validation.

  • training_end_date (datetime.datetime) – The end date for training.

  • latest_timestamp (datetime.datetime) – The latest timestamp.

  • earliest_timestamp (datetime.datetime) – The earliest timestamp.

class datarobot._experimental.models.chunking_service.DatasourceAICatalogInfo

AI Catalog data source information used at creation time with dataset chunk definition.

Variables:
  • name (str) – The optional custom name of the data source.

  • table (str) – The data source table name or AI Catalog dataset name.

  • storage_origin (str) – The origin data source, always AI Catalog type.

  • catalog_id (str) – The ID of the AI Catalog dataset.

  • catalog_version_id (str) – The ID of the AI Catalog dataset version.

  • order_by_columns (List[str]) – A list of columns used to sort the dataset.

  • is_descending_order (bool) – Orders the direction of the data. Defaults to False, ordering from smallest to largest.

  • select_columns (List[str]) – A list of columns to select from the dataset.

  • datetime_partition_column (str) – The datetime partition column name used in OTV projects.

  • validation_pct (float) – The percentage threshold between 0.1 and 1.0 for the first chunk validation.

  • validation_limit_pct (float) – The percentage threshold between 0.1 and 1.0 for the validation kept.

  • validation_start_date (datetime.datetime) – The start date for validation.

  • validation_end_date (datetime.datetime) – The end date for validation.

  • training_end_date (datetime.datetime) – The end date for training.

  • latest_timestamp (datetime.datetime) – The latest timestamp.

  • earliest_timestamp (datetime.datetime) – The earliest timestamp.

class datarobot._experimental.models.chunking_service.DatasetChunkDefinition

Dataset chunking definition that holds information about how to chunk the dataset.

Variables:
  • id (str) – The ID of the dataset chunk definition.

  • user_id (str) – The ID of the user who created the definition.

  • name (str) – The name of the dataset chunk definition.

  • project_starter_chunk_size (int) – The size, in bytes, of the project starter chunk.

  • user_chunk_size (int) – Chunk size in bytes.

  • datasource_definition_id (str) – The data source definition ID associated with the dataset chunk definition.

  • chunking_type (ChunkingType) –

    The type of chunk creation from the dataset. All possible chunking types can be found under ChunkingType enum, that can be imported from datarobot._experimental.models.enums Types include:

    • INCREMENTAL_LEARNING for non-time aware projects that use a chunk index to create chunks.

    • INCREMENTAL_LEARNING_OTV for OTV projects that use a chunk index to create chunks.

    • SLICED_OFFSET_LIMIT for any dataset in which user provides offset and limit to create chunks.

    SLICED_OFFSET_LIMIT has no indexed based chunks aka method create_by_index() not supported.

classmethod get(dataset_chunk_definition_id)

Retrieve a specific dataset chunk definition metadata.

Parameters:

dataset_chunk_definition_id (str) – The ID of the dataset chunk definition.

Returns:

dataset_chunk_definition – The queried instance.

Return type:

DatasetChunkDefinition

classmethod list(limit=50, offset=0)

Retrieves a list of dataset chunk definitions

Parameters:
  • limit (int) – The maximum number of objects to return. Default is 50.

  • offset (int) – The starting offset of the results. Default is 0.

Returns:

dataset_chunk_definitions – The list of dataset chunk definitions.

Return type:

List[DatasetChunkDefinition]

classmethod create(name, project_starter_chunk_size, user_chunk_size, datasource_info, chunking_type=ChunkingType.INCREMENTAL_LEARNING)

Create a dataset chunk definition. Required for both index-based and custom chunks.

In order to create a dataset chunk definition, you must first:

  • Create a data connection to the target data source via dr.DataStore.create()

  • Create credentials that must be attached to the data connection via dr.Credential.create()

If you have an existing data connections and credentials:

  • Retrieve the data store ID by the canonical name via:

    • [ds for ds in dr.DataStore.list() if ds.canonical_name == <name>][0].id

  • Retrieve the credential ID by the name via:

    • [cr for cr in dr.Credential.list() if ds.name == <name>][0].id

You must create the required ‘datasource_info’ object with the datasource information that corresponds to your use case:

  • DatasourceAICatalogInfo for AI catalog datasets.

  • DatasourceDataWarehouseInfo for Snowflake, BigQuery, or other data warehouse.

Parameters:
  • name (str) – The name of the dataset chunk definition.

  • project_starter_chunk_size (int) – The size, in bytes, of the first chunk. Used to start a DataRobot project.

  • user_chunk_size (int) – The size, in bytes, of the user-defined incremental chunk.

  • datasource_info (Union[DatasourceDataWarehouseInfo, DatasourceAICatalogInfo]) – The object that contains the information of the data source.

  • chunking_type (ChunkingType) –

    The type of chunk creation from the dataset. All possible chunking types can be found under ChunkingType enum, that can be imported from datarobot._experimental.models.enums Types include:

    • INCREMENTAL_LEARNING for non-time aware projects that use a chunk index to create chunks.

    • INCREMENTAL_LEARNING_OTV for OTV projects that use a chunk index to create chunks.

    • SLICED_OFFSET_LIMIT for any dataset in which user provides offset and limit to create chunks.

    SLICED_OFFSET_LIMIT has no indexed based chunks aka method create_by_index() not supported. The default type is ChunkingType.INCREMENTAL_LEARNING

Returns:

dataset_chunk_definition – An instance of a created dataset chunk definition.

Return type:

DatasetChunkDefinition

classmethod get_datasource_definition(dataset_chunk_definition_id)

Retrieves the data source definition associated with a dataset chunk definition.

Parameters:

dataset_chunk_definition_id (str) – id of the dataset chunk definition

Returns:

datasource_definition – an instance of created datasource definition

Return type:

DatasourceDefinition

classmethod get_chunk(dataset_chunk_definition_id, chunk_id)

Retrieves a specific data chunk associated with a dataset chunk definition

Parameters:
  • dataset_chunk_definition_id (str) – id of the dataset chunk definition

  • chunk_id (str) – id of the chunk

Returns:

chunk – an instance of created chunk

Return type:

Chunk

classmethod list_chunks(dataset_chunk_definition_id)

Retrieves all data chunks associated with a dataset chunk definition

Parameters:

dataset_chunk_definition_id (str) – id of the dataset chunk definition

Returns:

chunks – a list of chunks

Return type:

List[Chunk]

analyze_dataset(max_wait_time=600)

Analyzes the data source to retrieve and compute metadata about the dataset.

Depending on the size of the data set, adding order_by_columns to the dataset chunking definition will increase the execution time to create the data chunk. Set the max_wait_time for the appropriate wait time.

Parameters:

max_wait_time (int) – maximum time to wait for completion

Returns:

datasource_definition – an instance of created datasource definition

Return type:

DatasourceDefinition

create_chunk(limit, offset=0, storage_type=ChunkStorageType.DATASTAGE, max_wait_time=600)

Creates a data chunk using the limit and offset. By default, the data chunk is stored in data stages.

Depending on the size of the data set, adding order_by_columns to the dataset chunking definition will increase the execution time to retrieve or create the data chunk. Set the max_wait_time for the appropriate wait time.

Parameters:
  • limit (int) – The maximum number of rows.

  • offset (int) – The offset into the dataset (where reading begins).

  • storage_type (ChunkStorageType) – The storage location of the chunk.

  • max_wait_time (int) – maximum time to wait for completion

Returns:

chunk – An instance of a created or updated chunk.

Return type:

Chunk

create_chunk_by_index(index, storage_type=ChunkStorageType.DATASTAGE, max_wait_time=600)

Creates a data chunk using the limit and offset. By default, the data chunk is stored in data stages.

Depending on the size of the data set, adding order_by_columns to the dataset chunking definition will increase the execution time to retrieve or create the data chunk. Set the max_wait_time for the appropriate wait time.

Parameters:
  • index (int) – The index of the chunk.

  • storage_type (ChunkStorageType) – The storage location of the chunk.

  • max_wait_time (int) – maximum time to wait for completion

Returns:

chunk – An instance of a created or updated chunk.

Return type:

Chunk

classmethod patch_validation_dates(dataset_chunk_definition_id, validation_start_date, validation_end_date)

Updates the data source definition validation dates associated with a dataset chunk definition. In order to set the validation dates appropriately, both start and end dates should be specified. This method can only be used for INCREMENTAL_LEARNING_OTV dataset chunk definitions and its associated datasource definition.

Parameters:
  • dataset_chunk_definition_id (str) – The ID of the dataset chunk definition.

  • validation_start_date (datetime.datetime) – The start date of validation scoring data. Internally converted to format ‘%Y-%m-%d %H:%M:%S’, the timezone defaults to UTC.

  • validation_end_date (datetime.datetime) – The end date of validation scoring data. Internally converted to format ‘%Y-%m-%d %H:%M:%S’, the timezone defaults to UTC.

Returns:

datasource_definition – An instance of created datasource definition.

Return type:

DatasourceDefinition