Experimental APIs
These features all require special permissions to be activated on your DataRobot account, and will not work otherwise. If you want to test a feature, please ask your DataRobot CFDS or account manager about enrolling in our preview program.
Classes in this list should be considered “experimental”, not fully released, and likely to change in future releases. Do not use them for production systems or other mission-critical uses.
- datarobot._experimental.models.data_store.get_spark_session(self, db_token)
Returns a Spark session
- Parameters:
db_token (
str
) – A personal access token.- Returns:
A spark session initialized with connection parameters taken from DataStore and provided db_token.
- Return type:
SparkSession
Examples
>>> from datarobot._experimental.models.data_store import DataStore >>> data_stores = DataStore.list(typ=DataStoreListTypes.DR_DATABASE_V1) >>> data_stores [DataStore('my_databricks_store_1')] >>> db_connection = data_stores[0].get_spark_session('<token>') >>> db_connection <pyspark.sql.connect.session.SparkSession at 0x7f386068fbb0> >>> df = session.read.table("samples.nyctaxi.trips") >>> df.show()
- class datarobot._experimental.models.data_store.DataStore
A data store. Represents database
- Variables:
id (
str
) – The ID of the data store.data_store_type (
str
) – The type of data store.canonical_name (
str
) – The user-friendly name of the data store.creator (
str
) – The ID of the user who created the data store.updated (
datetime.datetime
) – The time of the last update.params (
DataStoreParameters
) – A list specifying data store parameters.role (
str
) – Your access role for this data store.driver_class_type (
str
) – Your access role for this data store.
- class datarobot._experimental.models.retraining.RetrainingUseCase
Retraining use case.
- Variables:
id (
str
) – The ID of the use case.name (
str
) – The name of the use case.
- class datarobot._experimental.models.retraining.RetrainingPolicy
Retraining Policy.
- Variables:
policy_id (
str
) – ID of the retraining policyname (
str
) – Name of the retraining policydescription (
str
) – Description of the retraining policyuse_case (
Optional[dict]
) – Use case the retraining policy is associated with
- classmethod list(deployment_id)
Lists all retraining policies associated with a deployment
- Parameters:
deployment_id (
str
) – Id of the deployment- Returns:
policies – List of retraining policies associated with a deployment
- Return type:
Examples
from datarobot import Deployment from datarobot._experimental.models.retraining import RetrainingPolicy deployment = Deployment.get(deployment_id='620ed0e37b6ce03244f19631') RetrainingPolicy.list(deployment.id) >>> [RetrainingPolicy('620ed248bb0a1f5889eb6aa7'), RetrainingPolicy('624f68be8828ed81bf487d8d')]
- classmethod get(deployment_id, retraining_policy_id)
Retrieves a retraining policy associated with a deployment
- Parameters:
deployment_id (
str
) – Id of the deploymentretraining_policy_id (
str
) – Id of the policy
- Returns:
retraining_policy – Retraining policy
- Return type:
Retraining Policy
Examples
from datarobot._experimental.models.retraining import RetrainingPolicy policy = RetrainingPolicy.get( deployment_id='620ed0e37b6ce03244f19631', retraining_policy_id='624f68be8828ed81bf487d8d' ) policy.id >>>'624f68be8828ed81bf487d8d' policy.name >>>'PolicyA'
- classmethod create(deployment_id, name, description=None, use_case_id=None)
Create a retraining policy associated with a deployment
- Parameters:
deployment_id (
str
) – The ID of the deployment.name (
str
) – The retraining policy name.description (
str
) – The retraining policy description.use_case_id (
Optional[str]
) – The ID of the Use Case that the retraining policy is associated with.
- Returns:
retraining_policy – Retraining policy
- Return type:
Retraining Policy
Examples
from datarobot._experimental.models.retraining import RetrainingPolicy policy = RetrainingPolicy.create( deployment_id='620ed0e37b6ce03244f19631', name='Retraining Policy A', use_case_id='678114c41e9114cabca27044', ) policy.id >>>'624f68be8828ed81bf487d8d'
- classmethod delete(deployment_id, retraining_policy_id)
Deletes a retraining policy associated with a deployment
- Parameters:
deployment_id (
str
) – Id of the deploymentretraining_policy_id (
str
) – Id of the policy
- Return type:
None
Examples
from datarobot._experimental.models.retraining import RetrainingPolicy RetrainingPolicy.delete( deployment_id='620ed0e37b6ce03244f19631', retraining_policy_id='624f68be8828ed81bf487d8d' )
- update_use_case(use_case_id)
Update the use case associated with this retraining policy
- Parameters:
use_case_id (
str
) – Id of the use case the retraining policy is associated with- Return type:
Examples
from datarobot._experimental.models.retraining import RetrainingPolicy policy = RetrainingPolicy.get( deployment_id='620ed0e37b6ce03244f19631', retraining_policy_id='624f68be8828ed81bf487d8d' ) updated = RetrainingPolicy.update_use_case(use_case_id='620ed0e37b6ce03244f19633') updated.use_case.id >>>'620ed0e37b6ce03244f19633'
- class datarobot._experimental.models.retraining.RetrainingPolicyRun
Retraining policy run.
- Variables:
policy_run_id (
str
) – ID of the retraining policy runstatus (
str
) – Status of the retraining policy runchallenger_id (
str
) – ID of the challenger model retrieved after running the policyerror_message (
str
) – The error message if an error occurs during the policy runmodel_package_id (
str
) – ID of the model package (version) retrieved after the policy is runproject_id (
str
) – ID of the project the deployment is associated withstart_time (
datetime.datetime
) – Timestamp of when the policy run startsfinish_time (
datetime.datetime
) – Timestamp of when the policy run finishes
- classmethod list(deployment_id, retraining_policy_id)
Lists all the retraining policy runs of a retraining policy that is associated with a deployment.
- Parameters:
deployment_id (
str
) – ID of the deploymentretraining_policy_id (
str
) – ID of the policy
- Returns:
policy runs – List of retraining policy runs
- Return type:
Examples
from datarobot._experimental.models.retraining import RetrainingPolicyRun RetrainingPolicyRun.list( deployment_id='620ed0e37b6ce03244f19631', retraining_policy_id='62f4448f0dfd5699feae3e6e' ) >>> [RetrainingPolicyRun('620ed248bb0a1f5889eb6aa7'), RetrainingPolicyRun('624f68be8828ed81bf487d8d')]
- class datarobot._experimental.models.data_matching.DataMatching
Retrieves the closest data points for the input data.
This functionality is more than the simple lookup. In order to retrieve the closest data points data matching functionality will leverage DataRobot preprocessing pipeline first and then search for the closest data points. The returned values will be the closest data points at the point of entry to the model.
- There are three sets of methods supported:
Methods to build the index (for project, model, featurelist). The index needs to be built first in order to search for the closest data points. Once the index is built it will be reused.
Methods to search for the closest data points (for project, model, featurelist). These methods will initialize the query, await its completion and then save the result as csv file with in the specified location.
Additional methods to manually list history of queries and retrieve results for them.
- get_query_url(url, number_of_data=None)
Returns formatted data matching query url
- Return type:
str
- get_closest_data(query_file_path, number_of_data=None, max_wait=600, build_index_if_missing=True)
Retrieves closest data points to the data point in input file. If the index is missing by default the method will try to build it.
- Parameters:
query_file_path (
str
) – Path to file with the data point to search closest data pointsnumber_of_data (
int
orNone
) – Number of results to search for. If no value specified, the default is 10.max_wait (
int
) – Number of seconds to wait for the result. Default is 600.build_index_if_missing (
Optional[bool]
) – Should the index be created if it is missing. If False is specified and the index is missing, an exception is thrown. Default True.
- Returns:
df – Dataframe with query result
- Return type:
pd.DataFrame
- get_closest_data_for_model(model_id, query_file_path, number_of_data=None, max_wait=600, build_index_if_missing=True)
Retrieves closest data points to the data point in input file. If the index is missing by default the method will try to build it.
- Parameters:
model_id (
str
) – Id of the model to search for the closest data pointsquery_file_path (
str
) – Path to file with the data point to search closest data pointsnumber_of_data (
int
orNone
) – Number of results to search for. If no value specified, the default is 10.max_wait (
int
) – Number of seconds to wait for the result. Default is 600.build_index_if_missing (
Optional[bool]
) – Should the index be created if it is missing. If False is specified and the index is missing, an exception is thrown. Default True.
- Returns:
df – Dataframe with query result
- Return type:
pd.DataFrame
- get_closest_data_for_featurelist(featurelist_id, query_file_path, number_of_data=None, max_wait=600, build_index_if_missing=True)
Retrieves closest data points to the data point in input file. If the index is missing by default the method will try to build it.
- Parameters:
featurelist_id (
str
) – Id of the featurelist to search for the closest data pointsquery_file_path (
str
) – Path to file with the data point to search closest data pointsnumber_of_data (
int
orNone
) – Number of results to search for. If no value specified, the default is 10.max_wait (
int
) – Number of seconds to wait for the result. Default is 600.build_index_if_missing (
bool
) – Should the index be created if it is missing. If False is specified and the index is missing, the exception is thrown. Default True.
- Returns:
df – Dataframe with query result
- Return type:
pd.DataFrame
- build_index(max_wait=600)
Builds data matching index and waits for its completion.
- Parameters:
max_wait (
int
orNone
) – Seconds to wait for the completion of build index operation. Default is 600. When the 0 or None value is passed then the method will exit without awaiting for the build index operation to complete.- Return type:
None
- build_index_for_featurelist(featurelist_id, max_wait=600)
Builds data matching index for featurelist and waits for its completion.
- Parameters:
featurelist_id (
str
) – Id of the featurelist to build the index formax_wait (
int
orNone
) – Seconds to wait for the completion of build index operation. Default is 600. When the 0 or None value is passed then the method will exit without awaiting for the build index operation to complete.
- Return type:
None
- build_index_for_model(model_id, max_wait=600)
Builds data matching index for feature list and waits for its completion.
- Parameters:
model_id (
str
) – Id of the model to build index formax_wait (
int
orNone
) – Seconds to wait for the completion of build index operation. Default is 600. When the 0 or None value is passed then the method will exit without awaiting for the build index operation to complete.
- Return type:
None
- list()
Lists all data matching queries for the project. Results are sorted in descending order starting from the latest to the oldest.
- Return type:
List[DataMatchingQuery]
- class datarobot._experimental.models.data_matching.DataMatchingQuery
Data Matching Query object.
Represents single query for the closest data points. Once related query job is completed, its result can be retrieved and saved as csv file in specified location.
- classmethod list(project_id)
Retrieves the list of queries.
- Parameters:
project_id (
str
) – Project ID to retrieve data matching queries for- Return type:
List[DataMatchingQuery]
- save_result(file_path)
Downloads the query result and saves it in file_path location.
- Parameters:
file_path (
str
) – Path location where to save the query result- Return type:
None
- get_result()
Returns the query result as dataframe.
- Parameters:
df (
pd.DataFrame
) – Dataframe with query result- Return type:
DataFrame
- class datarobot._experimental.models.model_lineage.FeatureCountByType
Contains information about a feature type and how many features in the dataset are of this type.
- Variables:
feature_type (
str
) – The feature type grouped in this count.count (
int
) – The number of features of this type.
- class datarobot._experimental.models.model_lineage.User
Contains information about a user.
- Variables:
Id (
str
) – Id of the user.full_name (
Optional[str]
) – Full name of the user.email (
Optional[str]
) – Email address of the user.user_hash (
Optional[str]
) – User’s gravatar hash.user_name (
Optional[str]
) – Username of the user.
- class datarobot._experimental.models.model_lineage.ReferencedInUseCase
Contains information about the reference of a dataset in an Use Case.
- Variables:
added_to_use_case_by (
User
) – User who added the dataset to the Use Case.added_to_use_case_at (
datetime.datetime
) – Time when the dataset was added to the Use Case.
- class datarobot._experimental.models.model_lineage.DatasetInfo
Contains information about the dataset.
- Variables:
dataset_name (
str
) – Dataset name.dataset_version_id (
str
) – Dataset version Id.dataset_id (
str
) – Dataset Id.number_of_rows (
int
) – Number of rows in the dataset.file_size (
int
) – Size of the dataset as a CSV file, in bytes.number_of_features (
int
) – Number of features in the dataset.number_of_feature_by_type (
List[FeatureCountByType]
) – Number of features in the dataset, grouped by feature type.referenced_in_use_case (
Optional[ReferencedInUseCase]
) – Information about the reference of this dataset in the Use Case. This information will only be present if the use_case_id was passed toModelLineage.get
.
- class datarobot._experimental.models.model_lineage.FeatureWithMissingValues
Contains information about the number of missing values for one feature.
- Variables:
feature_name (
str
) – Name of the feature.number_of_missing_values (
int
) – Number of missing values for this feature.
- class datarobot._experimental.models.model_lineage.FeaturelistInfo
Contains information about the featurelist.
- Variables:
featurelist_name (
str
) – Featurelist name.featurelist_id (
str
) – Featurelist Id.number_of_features (
int
) – Number of features in the featurelist.number_of_feature_by_type (
List[FeatureCountByType]
) – Number of features in the featurelist, grouped by feature type.number_of_features_with_missing_values (
int
) – Number of features in the featurelist with at least one missing value.number_of_missing_values (
int
) – Number of missing values across all features of the featurelist.features_with_most_missing_values (
List[FeatureWithMissingValues]
) – List of features with the most missing values.description (
str
) – Description of the featurelist.
- class datarobot._experimental.models.model_lineage.TargetInfo
Contains information about the target.
- Variables:
name (
str
) – Name of the target feature.target_type (
str
) – Project type resulting from selected target.positive_class_label (
Optional[Union[str
,int
,float]]
) – Positive class label. For every project type except Binary Classification, this value will be null.mean (
Optional[float]
) – Mean of the target. This field will only be available for Binary Classification, Regression, and Min Inflated projects.
- class datarobot._experimental.models.model_lineage.PartitionInfo
Contains information about project partitioning.
- Variables:
validation_type (
str
) – Either CV for cross-validation or TVH for train-validation-holdout split.cv_method (
str
) – Partitioning method used.holdout_pct (
float
) – Percentage of the dataset reserved for the holdout set.datetime_col (
Optional[str]
) – If a date partition column was used, the name of the column. Note that datetime_col applies to an old partitioning method no longer supported for new projects, as of API version v2.0.datetime_partition_column (
Optional[str]
) – If a datetime partition column was used, the name of the column.validation_pct (
Optional[float]
) – If train-validation-holdout split was used, the percentage of the dataset used for the validation set.reps (
Optional[float]
) – If cross validation was used, the number of folds to use.cv_holdout_level (
Optional[Union[str
,float
,int]]
) – If a user partition column was used with cross validation, the value assigned to the holdout set.holdout_level (
Optional[Union[str
,float
,int]]
) – If a user partition column was used with train-validation-holdout split, the value assigned to the holdout set.user_partition_col (
Optional[str]
) – If a user partition column was used, the name of the column.training_level (
Optional[Union[str
,float
,int]]
) – If a user partition column was used with train-validation-holdout split, the value assigned to the training set.partition_key_cols (
Optional[List[str]]
) – A list containing a single string - the name of the group partition column.validation_level (
Optional[Union[str
,float
,int]]
) – If a user partition column was used with train-validation-holdout split, the value assigned to the validation set.use_time_series (
Optional[bool]
) – A boolean value indicating whether a time series project was created by using datetime partitioning. Otherwise, datetime partitioning created an OTV project.
- class datarobot._experimental.models.model_lineage.ProjectInfo
Contains information about the project.
- Variables:
project_name (
str
) – Name of the project.project_id (
str
) – Project Id.partition (
PartitionInfo
) – Partitioning settings of the project.metric (
str
) – Project metric used to select the best-performing models.created_by (
User
) – User who created the project.created_at (
Optional[datetime.datetime]
) – Time when the project was created.target (
Optional[TargetInfo]
) – Information about the target.
- class datarobot._experimental.models.model_lineage.ModelInfo
Contains information about the model.
- Variables:
blueprint_tasks (
List[str]
) – Tasks that make up the blueprint.blueprint_id (
str
) – Blueprint Id.model_type (
str
) – Model type.sample_size (
Optional[int]
) – Number of rows this model was trained on.sample_percentage (
Optional[float]
) – Percentage of the dataset the model was trained on.milliseconds_to_predict_1000_rows (
Optional[float]
) – Estimate of how many millisecond it takes to predict 1000 rows. The estimate is based on the time it took to predict the holdout set.’serialized_blueprint_file_size (
Optional[int]
) – Size of the serialized blueprint, in bytes.
- class datarobot._experimental.models.model_lineage.ModelLineage
Contains information about the lineage of a model.
- Variables:
dataset (
DatasetInfo
) – Information about the dataset this model was created with.featurelist (
FeaturelistInfo
) – Information about the featurelist used to train this model.project (
ProjectInfo
) – Information about the project this model was created in.model (
ModelInfo
) – Information about the model itself.
- classmethod get(model_id, use_case_id=None)
Retrieve lineage information about a trained model. If you pass the optional
use_case_id
parameter, this class will contain additional information.- Parameters:
model_id (
str
) – Model Id.use_case_id (
Optional[str]
) – Use Case Id.
- Return type:
- class datarobot._experimental.models.chunking_service.ChunkStorage
The chunk storage location for the data chunks.
- Variables:
storage_reference_id (
str
) – The ID of the storage entity.chunk_storage_type (
str
) – The type of the chunk storage.version_id (
str
) – The catalog version ID. This will only be used if the storage type is “AI Catalog”.
- class datarobot._experimental.models.chunking_service.Chunk
Data chunk object that holds metadata about a chunk.
- Variables:
id (
str
) – The ID of the chunk entity.chunk_definition_id (
str
) – The ID of the dataset chunk definition the chunk belongs to.limit (
int
) – The number of rows in the chunk.offset (
int
) – The offset in the dataset to create the chunk.chunk_index (
str
) – The index of the chunk if chunks are divided uniformly. Otherwise, it is None.data_source_id (
str
) – The ID of the data request used to create the chunk.chunk_storage (
ChunkStorage
) – A list of storage locations where the chunk is stored.
- get_chunk_storage_id(storage_type)
Get storage location ID for the chunk.
- Parameters:
storage_type (
ChunkStorageType
) – The storage type where the chunk is stored.- Returns:
storage_reference_id – An ID that references the storage location for the chunk.
- Return type:
str
- get_chunk_storage_version_id(storage_type)
Get storage version ID for the chunk.
- Parameters:
storage_type (
ChunkStorageType
) – The storage type where the chunk is stored.- Returns:
storage_reference_id – A catalog version ID associated with the AI Catalog dataset ID.
- Return type:
str
- class datarobot._experimental.models.chunking_service.DatasourceDefinition
Data source definition that holds information of data source for API responses. Do not use this to ‘create’ DatasourceDefinition objects directly, use DatasourceAICatalogInfo and DatasourceDataWarehouseInfo.
- Variables:
id (
str
) – The ID of the data source definition.data_store_id (
str
) – The ID of the data store.credentials_id (
str
) – The ID of the credentials.table (
str
) – The data source table name.schema (
str
) – The offset into the dataset to create the chunk.catalog (
str
) – The database or catalog name.storage_origin (
str
) – The origin data source or data warehouse (e.g., Snowflake, BigQuery).data_source_id (
str
) – The ID of the data request used to generate sampling and metadata.total_rows (
str
) – The total number of rows in the dataset.source_size (
str
) – The size of the dataset.estimated_size_per_row (
str
) – The estimated size per row.columns (
str
) – The list of column names in the dataset.order_by_columns (
List[str]
) – A list of columns used to sort the dataset.is_descending_order (
bool
) – Orders the direction of the data. Defaults to False, ordering from smallest to largest.select_columns (
List[str]
) – A list of columns to select from the dataset.datetime_partition_column (
str
) – The datetime partition column name used in OTV projects.validation_pct (
float
) – The percentage threshold between 0.1 and 1.0 for the first chunk validation.validation_limit_pct (
float
) – The percentage threshold between 0.1 and 1.0 for the validation kept.validation_start_date (
datetime.datetime
) – The start date for validation.validation_end_date (
datetime.datetime
) – The end date for validation.training_end_date (
datetime.datetime
) – The end date for training.latest_timestamp (
datetime.datetime
) – The latest timestamp.earliest_timestamp (
datetime.datetime
) – The earliest timestamp.
- class datarobot._experimental.models.chunking_service.DatasourceDataWarehouseInfo
Data source information used at creation time with dataset chunk definition. Data warehouses supported: Snowflake, BigQuery, Databricks
- Variables:
name (
str
) – The optional custom name of the data source.table (
str
) – The data source table name or AI Catalog dataset name.storage_origin (
str
) – The origin data source or data warehouse (e.g., Snowflake, BigQuery).data_store_id (
str
) – The ID of the data store.credentials_id (
str
) – The ID of the credentials.schema (
str
) – The offset into the dataset to create the chunk.catalog (
str
) – The database or catalog name.data_source_id (
str
) – The ID of the data request used to generate sampling and metadata.order_by_columns (
List[str]
) – A list of columns used to sort the dataset.is_descending_order (
bool
) – Orders the direction of the data. Defaults to False, ordering from smallest to largest.select_columns (
List[str]
) – A list of columns to select from the dataset.datetime_partition_column (
str
) – The datetime partition column name used in OTV projects.validation_pct (
float
) – The percentage threshold between 0.1 and 1.0 for the first chunk validation.validation_limit_pct (
float
) – The percentage threshold between 0.1 and 1.0 for the validation kept.validation_start_date (
datetime.datetime
) – The start date for validation.validation_end_date (
datetime.datetime
) – The end date for validation.training_end_date (
datetime.datetime
) – The end date for training.latest_timestamp (
datetime.datetime
) – The latest timestamp.earliest_timestamp (
datetime.datetime
) – The earliest timestamp.
- class datarobot._experimental.models.chunking_service.DatasourceAICatalogInfo
AI Catalog data source information used at creation time with dataset chunk definition.
- Variables:
name (
str
) – The optional custom name of the data source.table (
str
) – The data source table name or AI Catalog dataset name.storage_origin (
str
) – The origin data source, always AI Catalog type.catalog_id (
str
) – The ID of the AI Catalog dataset.catalog_version_id (
str
) – The ID of the AI Catalog dataset version.order_by_columns (
List[str]
) – A list of columns used to sort the dataset.is_descending_order (
bool
) – Orders the direction of the data. Defaults to False, ordering from smallest to largest.select_columns (
List[str]
) – A list of columns to select from the dataset.datetime_partition_column (
str
) – The datetime partition column name used in OTV projects.validation_pct (
float
) – The percentage threshold between 0.1 and 1.0 for the first chunk validation.validation_limit_pct (
float
) – The percentage threshold between 0.1 and 1.0 for the validation kept.validation_start_date (
datetime.datetime
) – The start date for validation.validation_end_date (
datetime.datetime
) – The end date for validation.training_end_date (
datetime.datetime
) – The end date for training.latest_timestamp (
datetime.datetime
) – The latest timestamp.earliest_timestamp (
datetime.datetime
) – The earliest timestamp.
- class datarobot._experimental.models.chunking_service.DatasetChunkDefinition
Dataset chunking definition that holds information about how to chunk the dataset.
- Variables:
id (
str
) – The ID of the dataset chunk definition.user_id (
str
) – The ID of the user who created the definition.name (
str
) – The name of the dataset chunk definition.project_starter_chunk_size (
int
) – The size, in bytes, of the project starter chunk.user_chunk_size (
int
) – Chunk size in bytes.datasource_definition_id (
str
) – The data source definition ID associated with the dataset chunk definition.chunking_type (
ChunkingType
) –The type of chunk creation from the dataset. All possible chunking types can be found under ChunkingType enum, that can be imported from datarobot._experimental.models.enums Types include:
INCREMENTAL_LEARNING for non-time aware projects that use a chunk index to create chunks.
INCREMENTAL_LEARNING_OTV for OTV projects that use a chunk index to create chunks.
SLICED_OFFSET_LIMIT for any dataset in which user provides offset and limit to create chunks.
SLICED_OFFSET_LIMIT has no indexed based chunks aka method create_by_index() not supported.
- classmethod get(dataset_chunk_definition_id)
Retrieve a specific dataset chunk definition metadata.
- Parameters:
dataset_chunk_definition_id (
str
) – The ID of the dataset chunk definition.- Returns:
dataset_chunk_definition – The queried instance.
- Return type:
- classmethod list(limit=50, offset=0)
Retrieves a list of dataset chunk definitions
- Parameters:
limit (
int
) – The maximum number of objects to return. Default is 50.offset (
int
) – The starting offset of the results. Default is 0.
- Returns:
dataset_chunk_definitions – The list of dataset chunk definitions.
- Return type:
List[DatasetChunkDefinition]
- classmethod create(name, project_starter_chunk_size, user_chunk_size, datasource_info, chunking_type=ChunkingType.INCREMENTAL_LEARNING)
Create a dataset chunk definition. Required for both index-based and custom chunks.
In order to create a dataset chunk definition, you must first:
Create a data connection to the target data source via
dr.DataStore.create()
Create credentials that must be attached to the data connection via
dr.Credential.create()
If you have an existing data connections and credentials:
Retrieve the data store ID by the canonical name via:
[ds for ds in dr.DataStore.list() if ds.canonical_name == <name>][0].id
Retrieve the credential ID by the name via:
[cr for cr in dr.Credential.list() if ds.name == <name>][0].id
You must create the required ‘datasource_info’ object with the datasource information that corresponds to your use case:
DatasourceAICatalogInfo for AI catalog datasets.
DatasourceDataWarehouseInfo for Snowflake, BigQuery, or other data warehouse.
- Parameters:
name (
str
) – The name of the dataset chunk definition.project_starter_chunk_size (
int
) – The size, in bytes, of the first chunk. Used to start a DataRobot project.user_chunk_size (
int
) – The size, in bytes, of the user-defined incremental chunk.datasource_info (
Union[DatasourceDataWarehouseInfo
,DatasourceAICatalogInfo]
) – The object that contains the information of the data source.chunking_type (
ChunkingType
) –The type of chunk creation from the dataset. All possible chunking types can be found under ChunkingType enum, that can be imported from datarobot._experimental.models.enums Types include:
INCREMENTAL_LEARNING for non-time aware projects that use a chunk index to create chunks.
INCREMENTAL_LEARNING_OTV for OTV projects that use a chunk index to create chunks.
SLICED_OFFSET_LIMIT for any dataset in which user provides offset and limit to create chunks.
SLICED_OFFSET_LIMIT has no indexed based chunks aka method create_by_index() not supported. The default type is ChunkingType.INCREMENTAL_LEARNING
- Returns:
dataset_chunk_definition – An instance of a created dataset chunk definition.
- Return type:
- classmethod get_datasource_definition(dataset_chunk_definition_id)
Retrieves the data source definition associated with a dataset chunk definition.
- Parameters:
dataset_chunk_definition_id (
str
) – id of the dataset chunk definition- Returns:
datasource_definition – an instance of created datasource definition
- Return type:
- classmethod get_chunk(dataset_chunk_definition_id, chunk_id)
Retrieves a specific data chunk associated with a dataset chunk definition
- Parameters:
dataset_chunk_definition_id (
str
) – id of the dataset chunk definitionchunk_id (
str
) – id of the chunk
- Returns:
chunk – an instance of created chunk
- Return type:
- classmethod list_chunks(dataset_chunk_definition_id)
Retrieves all data chunks associated with a dataset chunk definition
- Parameters:
dataset_chunk_definition_id (
str
) – id of the dataset chunk definition- Returns:
chunks – a list of chunks
- Return type:
List[Chunk]
- analyze_dataset(max_wait_time=600)
Analyzes the data source to retrieve and compute metadata about the dataset.
Depending on the size of the data set, adding
order_by_columns
to the dataset chunking definition will increase the execution time to create the data chunk. Set themax_wait_time
for the appropriate wait time.- Parameters:
max_wait_time (
int
) – maximum time to wait for completion- Returns:
datasource_definition – an instance of created datasource definition
- Return type:
- create_chunk(limit, offset=0, storage_type=ChunkStorageType.DATASTAGE, max_wait_time=600)
Creates a data chunk using the limit and offset. By default, the data chunk is stored in data stages.
Depending on the size of the data set, adding
order_by_columns
to the dataset chunking definition will increase the execution time to retrieve or create the data chunk. Set themax_wait_time
for the appropriate wait time.- Parameters:
limit (
int
) – The maximum number of rows.offset (
int
) – The offset into the dataset (where reading begins).storage_type (
ChunkStorageType
) – The storage location of the chunk.max_wait_time (
int
) – maximum time to wait for completion
- Returns:
chunk – An instance of a created or updated chunk.
- Return type:
- create_chunk_by_index(index, storage_type=ChunkStorageType.DATASTAGE, max_wait_time=600)
Creates a data chunk using the limit and offset. By default, the data chunk is stored in data stages.
Depending on the size of the data set, adding
order_by_columns
to the dataset chunking definition will increase the execution time to retrieve or create the data chunk. Set themax_wait_time
for the appropriate wait time.- Parameters:
index (
int
) – The index of the chunk.storage_type (
ChunkStorageType
) – The storage location of the chunk.max_wait_time (
int
) – maximum time to wait for completion
- Returns:
chunk – An instance of a created or updated chunk.
- Return type:
- classmethod patch_validation_dates(dataset_chunk_definition_id, validation_start_date, validation_end_date)
Updates the data source definition validation dates associated with a dataset chunk definition. In order to set the validation dates appropriately, both start and end dates should be specified. This method can only be used for INCREMENTAL_LEARNING_OTV dataset chunk definitions and its associated datasource definition.
- Parameters:
dataset_chunk_definition_id (
str
) – The ID of the dataset chunk definition.validation_start_date (
datetime.datetime
) – The start date of validation scoring data. Internally converted to format ‘%Y-%m-%d %H:%M:%S’, the timezone defaults to UTC.validation_end_date (
datetime.datetime
) – The end date of validation scoring data. Internally converted to format ‘%Y-%m-%d %H:%M:%S’, the timezone defaults to UTC.
- Returns:
datasource_definition – An instance of created datasource definition.
- Return type: