Experimental API
These features all require special permissions to be activated on your DataRobot account, and will not work otherwise. If you want to test a feature, please ask your DataRobot CFDS or account manager about enrolling in our preview program.
Classes in this list should be considered “experimental”, not fully released, and likely to change in future releases. Do not use them for production systems or other mission-critical uses.
- class datarobot._experimental.models.model.Model
- get_feature_effect(source)
Retrieve Feature Effects for the model.
Feature Effects provides partial dependence and predicted vs actual values for top-500 features ordered by feature impact score.
The partial dependence shows marginal effect of a feature on the target variable after accounting for the average effects of all other predictive features. It indicates how, holding all other variables except the feature of interest as they were, the value of this feature affects your prediction.
Requires that Feature Effects has already been computed with
request_feature_effect
.See
get_feature_effect_metadata
for retrieving information the available sources.- Parameters:
source (
str
) – The source Feature Effects are retrieved for.- Returns:
feature_effects – The feature effects data.
- Return type:
FeatureEffects
- Raises:
ClientError – If the feature effects have not been computed or source is not valid value.
- get_incremental_learning_metadata()
Retrieve incremental learning metadata for this model.
Added in version v3.4.0.
This functionality requires the INCREMENTAL_LEARNING feature flag to be enabled.
- Returns:
metadata – a
IncrementalLearningMetadata
representing incremental learning metadata- Return type:
IncrementalLearningMetadata
- start_incremental_learning(early_stopping_rounds=None)
Start incremental learning for this model.
Added in version v3.4.0.
This functionality requires the INCREMENTAL_LEARNING feature flag to be enabled.
- Parameters:
early_stopping_rounds (
Optional[int]
) – The number of chunks in which no improvement is observed that triggers the early stopping mechanism.- Return type:
None
- Raises:
ClientError – if the server responded with 4xx status
- start_incremental_learning_from_sample(early_stopping_rounds=None, first_iteration_only=None)
Submit a job to the queue to perform the first incremental learning iteration training on an existing sample model. This functionality requires the SAMPLE_DATA_TO_START_PROJECT feature flag to be enabled.
- Parameters:
early_stopping_rounds (
Optional[int]
) – The number of chunks in which no improvement is observed that triggers the early stopping mechanism.first_iteration_only (
Optional[bool]
) – Specifies whether incremental learning training should be limited to the first iteration. If set to True, the training process will be performed only for the first iteration. If set to False, training will continue until early stopping conditions are met or the maximum number of iterations is reached. The default value is False.
- Returns:
job – The created job that is retraining the model
- Return type:
ModelJob
- class datarobot._experimental.models.model.DatetimeModel
- get_feature_effect(source, backtest_index)
Retrieve Feature Effects for the model.
Feature Effects provides partial dependence and predicted vs actual values for top-500 features ordered by feature impact score.
The partial dependence shows marginal effect of a feature on the target variable after accounting for the average effects of all other predictive features. It indicates how, holding all other variables except the feature of interest as they were, the value of this feature affects your prediction.
Requires that Feature Effects has already been computed with
request_feature_effect
.See
get_feature_effect_metadata
for retrieving information of source, backtest_index.- Parameters:
source (
string
) – The source Feature Effects are retrieved for. One value of [FeatureEffectMetadataDatetime.sources]. To retrieve the available sources for feature effect.backtest_index (
string
,FeatureEffectMetadataDatetime.backtest_index.
) – The backtest index to retrieve Feature Effects for.
- Returns:
feature_effects – The feature effects data.
- Return type:
FeatureEffects
- Raises:
ClientError – If the feature effects have not been computed or source is not valid value.
- datarobot._experimental.models.data_store.get_spark_session(self, db_token)
Returns a Spark session
- Parameters:
db_token (
str
) – A personal access token.- Returns:
A spark session initialized with connection parameters taken from DataStore and provided db_token.
- Return type:
SparkSession
Examples
>>> from datarobot._experimental.models.data_store import DataStore >>> data_stores = DataStore.list(typ=DataStoreListTypes.DR_DATABASE_V1) >>> data_stores [DataStore('my_databricks_store_1')] >>> db_connection = data_stores[0].get_spark_session('<token>') >>> db_connection <pyspark.sql.connect.session.SparkSession at 0x7f386068fbb0> >>> df = session.read.table("samples.nyctaxi.trips") >>> df.show()
- class datarobot._experimental.models.data_store.DataStore
A data store. Represents database
- Variables:
id (
str
) – The ID of the data store.data_store_type (
str
) – The type of data store.canonical_name (
str
) – The user-friendly name of the data store.creator (
str
) – The ID of the user who created the data store.updated (
datetime.datetime
) – The time of the last update.params (
DataStoreParameters
) – A list specifying data store parameters.role (
str
) – Your access role for this data store.driver_class_type (
str
) – Your access role for this data store.
- class datarobot._experimental.models.retraining.RetrainingUseCase
Retraining use case.
- Variables:
id (
str
) – The ID of the use case.name (
str
) – The name of the use case.
- class datarobot._experimental.models.retraining.RetrainingPolicy
Retraining Policy.
- Variables:
policy_id (
str
) – ID of the retraining policyname (
str
) – Name of the retraining policydescription (
str
) – Description of the retraining policyuse_case (
Optional[dict]
) – Use case the retraining policy is associated with
- classmethod list(deployment_id)
Lists all retraining policies associated with a deployment
- Parameters:
deployment_id (
str
) – Id of the deployment- Returns:
policies – List of retraining policies associated with a deployment
- Return type:
Examples
from datarobot import Deployment from datarobot._experimental.models.retraining import RetrainingPolicy deployment = Deployment.get(deployment_id='620ed0e37b6ce03244f19631') RetrainingPolicy.list(deployment.id) >>> [RetrainingPolicy('620ed248bb0a1f5889eb6aa7'), RetrainingPolicy('624f68be8828ed81bf487d8d')]
- classmethod get(deployment_id, retraining_policy_id)
Retrieves a retraining policy associated with a deployment
- Parameters:
deployment_id (
str
) – Id of the deploymentretraining_policy_id (
str
) – Id of the policy
- Returns:
retraining_policy – Retraining policy
- Return type:
Retraining Policy
Examples
from datarobot._experimental.models.retraining import RetrainingPolicy policy = RetrainingPolicy.get( deployment_id='620ed0e37b6ce03244f19631', retraining_policy_id='624f68be8828ed81bf487d8d' ) policy.id >>>'624f68be8828ed81bf487d8d' policy.name >>>'PolicyA'
- classmethod create(deployment_id, name, description=None, use_case_id=None)
Create a retraining policy associated with a deployment
- Parameters:
deployment_id (
str
) – The ID of the deployment.name (
str
) – The retraining policy name.description (
str
) – The retraining policy description.use_case_id (
Optional[str]
) – The ID of the Use Case that the retraining policy is associated with.
- Returns:
retraining_policy – Retraining policy
- Return type:
Retraining Policy
Examples
from datarobot._experimental.models.retraining import RetrainingPolicy policy = RetrainingPolicy.create( deployment_id='620ed0e37b6ce03244f19631', name='Retraining Policy A', use_case_id='678114c41e9114cabca27044', ) policy.id >>>'624f68be8828ed81bf487d8d'
- classmethod delete(deployment_id, retraining_policy_id)
Deletes a retraining policy associated with a deployment
- Parameters:
deployment_id (
str
) – Id of the deploymentretraining_policy_id (
str
) – Id of the policy
- Return type:
None
Examples
from datarobot._experimental.models.retraining import RetrainingPolicy RetrainingPolicy.delete( deployment_id='620ed0e37b6ce03244f19631', retraining_policy_id='624f68be8828ed81bf487d8d' )
- update_use_case(use_case_id)
Update the use case associated with this retraining policy
- Parameters:
use_case_id (
str
) – Id of the use case the retraining policy is associated with- Return type:
Examples
from datarobot._experimental.models.retraining import RetrainingPolicy policy = RetrainingPolicy.get( deployment_id='620ed0e37b6ce03244f19631', retraining_policy_id='624f68be8828ed81bf487d8d' ) updated = RetrainingPolicy.update_use_case(use_case_id='620ed0e37b6ce03244f19633') updated.use_case.id >>>'620ed0e37b6ce03244f19633'
- class datarobot._experimental.models.retraining.RetrainingPolicyRun
Retraining policy run.
- Variables:
policy_run_id (
str
) – ID of the retraining policy runstatus (
str
) – Status of the retraining policy runchallenger_id (
str
) – ID of the challenger model retrieved after running the policyerror_message (
str
) – The error message if an error occurs during the policy runmodel_package_id (
str
) – ID of the model package (version) retrieved after the policy is runproject_id (
str
) – ID of the project the deployment is associated withstart_time (
datetime.datetime
) – Timestamp of when the policy run startsfinish_time (
datetime.datetime
) – Timestamp of when the policy run finishes
- classmethod list(deployment_id, retraining_policy_id)
Lists all the retraining policy runs of a retraining policy that is associated with a deployment.
- Parameters:
deployment_id (
str
) – ID of the deploymentretraining_policy_id (
str
) – ID of the policy
- Returns:
policy runs – List of retraining policy runs
- Return type:
Examples
from datarobot._experimental.models.retraining import RetrainingPolicyRun RetrainingPolicyRun.list( deployment_id='620ed0e37b6ce03244f19631', retraining_policy_id='62f4448f0dfd5699feae3e6e' ) >>> [RetrainingPolicyRun('620ed248bb0a1f5889eb6aa7'), RetrainingPolicyRun('624f68be8828ed81bf487d8d')]
- class datarobot._experimental.models.data_matching.DataMatching
Retrieves the closest data points for the input data.
This functionality is more than the simple lookup. In order to retrieve the closest data points data matching functionality will leverage DataRobot preprocessing pipeline first and then search for the closest data points. The returned values will be the closest data points at the point of entry to the model.
- There are three sets of methods supported:
Methods to build the index (for project, model, featurelist). The index needs to be built first in order to search for the closest data points. Once the index is built it will be reused.
Methods to search for the closest data points (for project, model, featurelist). These methods will initialize the query, await its completion and then save the result as csv file with in the specified location.
Additional methods to manually list history of queries and retrieve results for them.
- get_query_url(url, number_of_data=None)
Returns formatted data matching query url
- Return type:
str
- get_closest_data(query_file_path, number_of_data=None, max_wait=600, build_index_if_missing=True)
Retrieves closest data points to the data point in input file. If the index is missing by default the method will try to build it.
- Parameters:
query_file_path (
str
) – Path to file with the data point to search closest data pointsnumber_of_data (
int
orNone
) – Number of results to search for. If no value specified, the default is 10.max_wait (
int
) – Number of seconds to wait for the result. Default is 600.build_index_if_missing (
Optional[bool]
) – Should the index be created if it is missing. If False is specified and the index is missing, an exception is thrown. Default True.
- Returns:
df – Dataframe with query result
- Return type:
pd.DataFrame
- get_closest_data_for_model(model_id, query_file_path, number_of_data=None, max_wait=600, build_index_if_missing=True)
Retrieves closest data points to the data point in input file. If the index is missing by default the method will try to build it.
- Parameters:
model_id (
str
) – Id of the model to search for the closest data pointsquery_file_path (
str
) – Path to file with the data point to search closest data pointsnumber_of_data (
int
orNone
) – Number of results to search for. If no value specified, the default is 10.max_wait (
int
) – Number of seconds to wait for the result. Default is 600.build_index_if_missing (
Optional[bool]
) – Should the index be created if it is missing. If False is specified and the index is missing, an exception is thrown. Default True.
- Returns:
df – Dataframe with query result
- Return type:
pd.DataFrame
- get_closest_data_for_featurelist(featurelist_id, query_file_path, number_of_data=None, max_wait=600, build_index_if_missing=True)
Retrieves closest data points to the data point in input file. If the index is missing by default the method will try to build it.
- Parameters:
featurelist_id (
str
) – Id of the featurelist to search for the closest data pointsquery_file_path (
str
) – Path to file with the data point to search closest data pointsnumber_of_data (
int
orNone
) – Number of results to search for. If no value specified, the default is 10.max_wait (
int
) – Number of seconds to wait for the result. Default is 600.build_index_if_missing (
bool
) – Should the index be created if it is missing. If False is specified and the index is missing, the exception is thrown. Default True.
- Returns:
df – Dataframe with query result
- Return type:
pd.DataFrame
- build_index(max_wait=600)
Builds data matching index and waits for its completion.
- Parameters:
max_wait (
int
orNone
) – Seconds to wait for the completion of build index operation. Default is 600. When the 0 or None value is passed then the method will exit without awaiting for the build index operation to complete.- Return type:
None
- build_index_for_featurelist(featurelist_id, max_wait=600)
Builds data matching index for featurelist and waits for its completion.
- Parameters:
featurelist_id (
str
) – Id of the featurelist to build the index formax_wait (
int
orNone
) – Seconds to wait for the completion of build index operation. Default is 600. When the 0 or None value is passed then the method will exit without awaiting for the build index operation to complete.
- Return type:
None
- build_index_for_model(model_id, max_wait=600)
Builds data matching index for feature list and waits for its completion.
- Parameters:
model_id (
str
) – Id of the model to build index formax_wait (
int
orNone
) – Seconds to wait for the completion of build index operation. Default is 600. When the 0 or None value is passed then the method will exit without awaiting for the build index operation to complete.
- Return type:
None
- list()
Lists all data matching queries for the project. Results are sorted in descending order starting from the latest to the oldest.
- Return type:
List[DataMatchingQuery]
- class datarobot._experimental.models.data_matching.DataMatchingQuery
Data Matching Query object.
Represents single query for the closest data points. Once related query job is completed, its result can be retrieved and saved as csv file in specified location.
- classmethod list(project_id)
Retrieves the list of queries.
- Parameters:
project_id (
str
) – Project ID to retrieve data matching queries for- Return type:
List[DataMatchingQuery]
- save_result(file_path)
Downloads the query result and saves it in file_path location.
- Parameters:
file_path (
str
) – Path location where to save the query result- Return type:
None
- get_result()
Returns the query result as dataframe.
- Parameters:
df (
pd.DataFrame
) – Dataframe with query result- Return type:
DataFrame
- class datarobot._experimental.models.model_lineage.FeatureCountByType
Contains information about a feature type and how many features in the dataset are of this type.
- Variables:
feature_type (
str
) – The feature type grouped in this count.count (
int
) – The number of features of this type.
- class datarobot._experimental.models.model_lineage.User
Contains information about a user.
- Variables:
Id (
str
) – Id of the user.full_name (
Optional[str]
) – Full name of the user.email (
Optional[str]
) – Email address of the user.user_hash (
Optional[str]
) – User’s gravatar hash.user_name (
Optional[str]
) – Username of the user.
- class datarobot._experimental.models.model_lineage.ReferencedInUseCase
Contains information about the reference of a dataset in an Use Case.
- Variables:
added_to_use_case_by (
User
) – User who added the dataset to the Use Case.added_to_use_case_at (
datetime.datetime
) – Time when the dataset was added to the Use Case.
- class datarobot._experimental.models.model_lineage.DatasetInfo
Contains information about the dataset.
- Variables:
dataset_name (
str
) – Dataset name.dataset_version_id (
str
) – Dataset version Id.dataset_id (
str
) – Dataset Id.number_of_rows (
int
) – Number of rows in the dataset.file_size (
int
) – Size of the dataset as a CSV file, in bytes.number_of_features (
int
) – Number of features in the dataset.number_of_feature_by_type (
List[FeatureCountByType]
) – Number of features in the dataset, grouped by feature type.referenced_in_use_case (
Optional[ReferencedInUseCase]
) – Information about the reference of this dataset in the Use Case. This information will only be present if the use_case_id was passed toModelLineage.get
.
- class datarobot._experimental.models.model_lineage.FeatureWithMissingValues
Contains information about the number of missing values for one feature.
- Variables:
feature_name (
str
) – Name of the feature.number_of_missing_values (
int
) – Number of missing values for this feature.
- class datarobot._experimental.models.model_lineage.FeaturelistInfo
Contains information about the featurelist.
- Variables:
featurelist_name (
str
) – Featurelist name.featurelist_id (
str
) – Featurelist Id.number_of_features (
int
) – Number of features in the featurelist.number_of_feature_by_type (
List[FeatureCountByType]
) – Number of features in the featurelist, grouped by feature type.number_of_features_with_missing_values (
int
) – Number of features in the featurelist with at least one missing value.number_of_missing_values (
int
) – Number of missing values across all features of the featurelist.features_with_most_missing_values (
List[FeatureWithMissingValues]
) – List of features with the most missing values.description (
str
) – Description of the featurelist.
- class datarobot._experimental.models.model_lineage.TargetInfo
Contains information about the target.
- Variables:
name (
str
) – Name of the target feature.target_type (
str
) – Project type resulting from selected target.positive_class_label (
Optional[Union[str
,int
,float]]
) – Positive class label. For every project type except Binary Classification, this value will be null.mean (
Optional[float]
) – Mean of the target. This field will only be available for Binary Classification, Regression, and Min Inflated projects.
- class datarobot._experimental.models.model_lineage.PartitionInfo
Contains information about project partitioning.
- Variables:
validation_type (
str
) – Either CV for cross-validation or TVH for train-validation-holdout split.cv_method (
str
) – Partitioning method used.holdout_pct (
float
) – Percentage of the dataset reserved for the holdout set.datetime_col (
Optional[str]
) – If a date partition column was used, the name of the column. Note that datetime_col applies to an old partitioning method no longer supported for new projects, as of API version v2.0.datetime_partition_column (
Optional[str]
) – If a datetime partition column was used, the name of the column.validation_pct (
Optional[float]
) – If train-validation-holdout split was used, the percentage of the dataset used for the validation set.reps (
Optional[float]
) – If cross validation was used, the number of folds to use.cv_holdout_level (
Optional[Union[str
,float
,int]]
) – If a user partition column was used with cross validation, the value assigned to the holdout set.holdout_level (
Optional[Union[str
,float
,int]]
) – If a user partition column was used with train-validation-holdout split, the value assigned to the holdout set.user_partition_col (
Optional[str]
) – If a user partition column was used, the name of the column.training_level (
Optional[Union[str
,float
,int]]
) – If a user partition column was used with train-validation-holdout split, the value assigned to the training set.partition_key_cols (
Optional[List[str]]
) – A list containing a single string - the name of the group partition column.validation_level (
Optional[Union[str
,float
,int]]
) – If a user partition column was used with train-validation-holdout split, the value assigned to the validation set.use_time_series (
Optional[bool]
) – A boolean value indicating whether a time series project was created by using datetime partitioning. Otherwise, datetime partitioning created an OTV project.
- class datarobot._experimental.models.model_lineage.ProjectInfo
Contains information about the project.
- Variables:
project_name (
str
) – Name of the project.project_id (
str
) – Project Id.partition (
PartitionInfo
) – Partitioning settings of the project.metric (
str
) – Project metric used to select the best-performing models.created_by (
User
) – User who created the project.created_at (
Optional[datetime.datetime]
) – Time when the project was created.target (
Optional[TargetInfo]
) – Information about the target.
- class datarobot._experimental.models.model_lineage.ModelInfo
Contains information about the model.
- Variables:
blueprint_tasks (
List[str]
) – Tasks that make up the blueprint.blueprint_id (
str
) – Blueprint Id.model_type (
str
) – Model type.sample_size (
Optional[int]
) – Number of rows this model was trained on.sample_percentage (
Optional[float]
) – Percentage of the dataset the model was trained on.milliseconds_to_predict_1000_rows (
Optional[float]
) – Estimate of how many millisecond it takes to predict 1000 rows. The estimate is based on the time it took to predict the holdout set.’serialized_blueprint_file_size (
Optional[int]
) – Size of the serialized blueprint, in bytes.
- class datarobot._experimental.models.model_lineage.ModelLineage
Contains information about the lineage of a model.
- Variables:
dataset (
DatasetInfo
) – Information about the dataset this model was created with.featurelist (
FeaturelistInfo
) – Information about the featurelist used to train this model.project (
ProjectInfo
) – Information about the project this model was created in.model (
ModelInfo
) – Information about the model itself.
- classmethod get(model_id, use_case_id=None)
Retrieve lineage information about a trained model. If you pass the optional
use_case_id
parameter, this class will contain additional information.- Parameters:
model_id (
str
) – Model Id.use_case_id (
Optional[str]
) – Use Case Id.
- Return type:
- class datarobot._experimental.models.incremental_learning.IncrementalLearningItem
- class datarobot._experimental.models.incremental_learning.IncrementalLearningMetadata
Incremental learning metadata for an incremental model.
Added in version v3.4.0.
- Variables:
project_id (
str
) – The project ID.model_id (
str
) – The model ID.user_id (
str
) – The ID of the user who started incremental learning.featurelist_id (
str
) – The ID of the featurelist the model is using.status (
str
) – The status of incremental training. One ofdatarobot._experimental.models.enums.IncrementalLearningStatus
.items (
List[IncrementalLearningItemDoc]
) – An array of incremental learning items associated with the sequential order of chunks. See incremental item info in Notes for more details.sample_pct (
float
) – The sample size in percents (1 to 100) to use in training.training_row_count (
int
) – The number of rows used to train a model.score (
float
) – The validation score of the model.metric (
str
) – The name of the scoring metric.early_stopping_rounds (
int
) – The number of chunks in which no improvement is observed that triggers the early stopping mechanism.total_number_of_chunks (
int
) – The total number of chunks.model_number (
int
) – The number of the model in the project.
Notes
Incremental item is a dict containing the following:
- chunk_index: int
The incremental learning order in which chunks are trained.
- status: str
The status of training current chunk. One of
datarobot._experimental.models.enums.IncrementalLearningItemStatus
- model_id: str
The ID of the model associated with the current item (chunk).
- parent_model_id: str
The ID of the model based on which the current item (chunk) is trained.
- data_stage_id: str
The ID of the data stage.
- sample_pct: float
The cumulative percentage of the base dataset size used for training the model.
- training_row_count: int
The number of rows used to train a model.
- score: float
The validation score of the current model
- class datarobot._experimental.models.chunking_service.ChunkStorage
The chunk storage location for the data chunks.
- Variables:
storage_reference_id (
str
) – The ID of the storage entity.chunk_storage_type (
str
) – The type of the chunk storage.version_id (
str
) – The catalog version ID. This will only be used if the storage type is “AI Catalog”.
- class datarobot._experimental.models.chunking_service.Chunk
Data chunk object that holds metadata about a chunk.
- Variables:
id (
str
) – The ID of the chunk entity.chunk_definition_id (
str
) – The ID of the dataset chunk definition the chunk belongs to.limit (
int
) – The number of rows in the chunk.offset (
int
) – The offset in the dataset to create the chunk.chunk_index (
str
) – The index of the chunk if chunks are divided uniformly. Otherwise, it is None.data_source_id (
str
) – The ID of the data request used to create the chunk.chunk_storage (
ChunkStorage
) – A list of storage locations where the chunk is stored.
- get_chunk_storage_id(storage_type)
Get storage location ID for the chunk.
- Parameters:
storage_type (
ChunkStorageType
) – The storage type where the chunk is stored.- Returns:
storage_reference_id – An ID that references the storage location for the chunk.
- Return type:
str
- get_chunk_storage_version_id(storage_type)
Get storage version ID for the chunk.
- Parameters:
storage_type (
ChunkStorageType
) – The storage type where the chunk is stored.- Returns:
storage_reference_id – A catalog version ID associated with the AI Catalog dataset ID.
- Return type:
str
- class datarobot._experimental.models.chunking_service.DatasourceDefinition
Data source definition that holds information of data source for API responses. Do not use this to ‘create’ DatasourceDefinition objects directly, use DatasourceAICatalogInfo and DatasourceDataWarehouseInfo.
- Variables:
id (
str
) – The ID of the data source definition.data_store_id (
str
) – The ID of the data store.credentials_id (
str
) – The ID of the credentials.table (
str
) – The data source table name.schema (
str
) – The offset into the dataset to create the chunk.catalog (
str
) – The database or catalog name.storage_origin (
str
) – The origin data source or data warehouse (e.g., Snowflake, BigQuery).data_source_id (
str
) – The ID of the data request used to generate sampling and metadata.total_rows (
str
) – The total number of rows in the dataset.source_size (
str
) – The size of the dataset.estimated_size_per_row (
str
) – The estimated size per row.columns (
str
) – The list of column names in the dataset.order_by_columns (
List[str]
) – A list of columns used to sort the dataset.is_descending_order (
bool
) – Orders the direction of the data. Defaults to False, ordering from smallest to largest.select_columns (
List[str]
) – A list of columns to select from the dataset.datetime_partition_column (
str
) – The datetime partition column name used in OTV projects.validation_pct (
float
) – The percentage threshold between 0.1 and 1.0 for the first chunk validation.validation_limit_pct (
float
) – The percentage threshold between 0.1 and 1.0 for the validation kept.validation_start_date (
datetime.datetime
) – The start date for validation.validation_end_date (
datetime.datetime
) – The end date for validation.training_end_date (
datetime.datetime
) – The end date for training.latest_timestamp (
datetime.datetime
) – The latest timestamp.earliest_timestamp (
datetime.datetime
) – The earliest timestamp.
- class datarobot._experimental.models.chunking_service.DatasourceDataWarehouseInfo
Data source information used at creation time with dataset chunk definition. Data warehouses supported: Snowflake, BigQuery, Databricks
- Variables:
name (
str
) – The optional custom name of the data source.table (
str
) – The data source table name or AI Catalog dataset name.storage_origin (
str
) – The origin data source or data warehouse (e.g., Snowflake, BigQuery).data_store_id (
str
) – The ID of the data store.credentials_id (
str
) – The ID of the credentials.schema (
str
) – The offset into the dataset to create the chunk.catalog (
str
) – The database or catalog name.data_source_id (
str
) – The ID of the data request used to generate sampling and metadata.order_by_columns (
List[str]
) – A list of columns used to sort the dataset.is_descending_order (
bool
) – Orders the direction of the data. Defaults to False, ordering from smallest to largest.select_columns (
List[str]
) – A list of columns to select from the dataset.datetime_partition_column (
str
) – The datetime partition column name used in OTV projects.validation_pct (
float
) – The percentage threshold between 0.1 and 1.0 for the first chunk validation.validation_limit_pct (
float
) – The percentage threshold between 0.1 and 1.0 for the validation kept.validation_start_date (
datetime.datetime
) – The start date for validation.validation_end_date (
datetime.datetime
) – The end date for validation.training_end_date (
datetime.datetime
) – The end date for training.latest_timestamp (
datetime.datetime
) – The latest timestamp.earliest_timestamp (
datetime.datetime
) – The earliest timestamp.
- class datarobot._experimental.models.chunking_service.DatasourceAICatalogInfo
AI Catalog data source information used at creation time with dataset chunk definition.
- Variables:
name (
str
) – The optional custom name of the data source.table (
str
) – The data source table name or AI Catalog dataset name.storage_origin (
str
) – The origin data source, always AI Catalog type.catalog_id (
str
) – The ID of the AI Catalog dataset.catalog_version_id (
str
) – The ID of the AI Catalog dataset version.order_by_columns (
List[str]
) – A list of columns used to sort the dataset.is_descending_order (
bool
) – Orders the direction of the data. Defaults to False, ordering from smallest to largest.select_columns (
List[str]
) – A list of columns to select from the dataset.datetime_partition_column (
str
) – The datetime partition column name used in OTV projects.validation_pct (
float
) – The percentage threshold between 0.1 and 1.0 for the first chunk validation.validation_limit_pct (
float
) – The percentage threshold between 0.1 and 1.0 for the validation kept.validation_start_date (
datetime.datetime
) – The start date for validation.validation_end_date (
datetime.datetime
) – The end date for validation.training_end_date (
datetime.datetime
) – The end date for training.latest_timestamp (
datetime.datetime
) – The latest timestamp.earliest_timestamp (
datetime.datetime
) – The earliest timestamp.
- class datarobot._experimental.models.chunking_service.DatasetChunkDefinition
Dataset chunking definition that holds information about how to chunk the dataset.
- Variables:
id (
str
) – The ID of the dataset chunk definition.user_id (
str
) – The ID of the user who created the definition.name (
str
) – The name of the dataset chunk definition.project_starter_chunk_size (
int
) – The size, in bytes, of the project starter chunk.user_chunk_size (
int
) – Chunk size in bytes.datasource_definition_id (
str
) – The data source definition ID associated with the dataset chunk definition.chunking_type (
ChunkingType
) –The type of chunk creation from the dataset. All possible chunking types can be found under ChunkingType enum, that can be imported from datarobot._experimental.models.enums Types include:
INCREMENTAL_LEARNING for non-time aware projects that use a chunk index to create chunks.
INCREMENTAL_LEARNING_OTV for OTV projects that use a chunk index to create chunks.
SLICED_OFFSET_LIMIT for any dataset in which user provides offset and limit to create chunks.
SLICED_OFFSET_LIMIT has no indexed based chunks aka method create_by_index() not supported.
- classmethod get(dataset_chunk_definition_id)
Retrieve a specific dataset chunk definition metadata.
- Parameters:
dataset_chunk_definition_id (
str
) – The ID of the dataset chunk definition.- Returns:
dataset_chunk_definition – The queried instance.
- Return type:
- classmethod list(limit=50, offset=0)
Retrieves a list of dataset chunk definitions
- Parameters:
limit (
int
) – The maximum number of objects to return. Default is 50.offset (
int
) – The starting offset of the results. Default is 0.
- Returns:
dataset_chunk_definitions – The list of dataset chunk definitions.
- Return type:
List[DatasetChunkDefinition]
- classmethod create(name, project_starter_chunk_size, user_chunk_size, datasource_info, chunking_type=ChunkingType.INCREMENTAL_LEARNING)
Create a dataset chunk definition. Required for both index-based and custom chunks.
In order to create a dataset chunk definition, you must first:
Create a data connection to the target data source via
dr.DataStore.create()
Create credentials that must be attached to the data connection via
dr.Credential.create()
If you have an existing data connections and credentials:
Retrieve the data store ID by the canonical name via:
[ds for ds in dr.DataStore.list() if ds.canonical_name == <name>][0].id
Retrieve the credential ID by the name via:
[cr for cr in dr.Credential.list() if ds.name == <name>][0].id
You must create the required ‘datasource_info’ object with the datasource information that corresponds to your use case:
DatasourceAICatalogInfo for AI catalog datasets.
DatasourceDataWarehouseInfo for Snowflake, BigQuery, or other data warehouse.
- Parameters:
name (
str
) – The name of the dataset chunk definition.project_starter_chunk_size (
int
) – The size, in bytes, of the first chunk. Used to start a DataRobot project.user_chunk_size (
int
) – The size, in bytes, of the user-defined incremental chunk.datasource_info (
Union[DatasourceDataWarehouseInfo
,DatasourceAICatalogInfo]
) – The object that contains the information of the data source.chunking_type (
ChunkingType
) –The type of chunk creation from the dataset. All possible chunking types can be found under ChunkingType enum, that can be imported from datarobot._experimental.models.enums Types include:
INCREMENTAL_LEARNING for non-time aware projects that use a chunk index to create chunks.
INCREMENTAL_LEARNING_OTV for OTV projects that use a chunk index to create chunks.
SLICED_OFFSET_LIMIT for any dataset in which user provides offset and limit to create chunks.
SLICED_OFFSET_LIMIT has no indexed based chunks aka method create_by_index() not supported. The default type is ChunkingType.INCREMENTAL_LEARNING
- Returns:
dataset_chunk_definition – An instance of a created dataset chunk definition.
- Return type:
- classmethod get_datasource_definition(dataset_chunk_definition_id)
Retrieves the data source definition associated with a dataset chunk definition.
- Parameters:
dataset_chunk_definition_id (
str
) – id of the dataset chunk definition- Returns:
datasource_definition – an instance of created datasource definition
- Return type:
- classmethod get_chunk(dataset_chunk_definition_id, chunk_id)
Retrieves a specific data chunk associated with a dataset chunk definition
- Parameters:
dataset_chunk_definition_id (
str
) – id of the dataset chunk definitionchunk_id (
str
) – id of the chunk
- Returns:
chunk – an instance of created chunk
- Return type:
- classmethod list_chunks(dataset_chunk_definition_id)
Retrieves all data chunks associated with a dataset chunk definition
- Parameters:
dataset_chunk_definition_id (
str
) – id of the dataset chunk definition- Returns:
chunks – a list of chunks
- Return type:
List[Chunk]
- analyze_dataset(max_wait_time=600)
Analyzes the data source to retrieve and compute metadata about the dataset.
Depending on the size of the data set, adding
order_by_columns
to the dataset chunking definition will increase the execution time to create the data chunk. Set themax_wait_time
for the appropriate wait time.- Parameters:
max_wait_time (
int
) – maximum time to wait for completion- Returns:
datasource_definition – an instance of created datasource definition
- Return type:
- create_chunk(limit, offset=0, storage_type=ChunkStorageType.DATASTAGE, max_wait_time=600)
Creates a data chunk using the limit and offset. By default, the data chunk is stored in data stages.
Depending on the size of the data set, adding
order_by_columns
to the dataset chunking definition will increase the execution time to retrieve or create the data chunk. Set themax_wait_time
for the appropriate wait time.- Parameters:
limit (
int
) – The maximum number of rows.offset (
int
) – The offset into the dataset (where reading begins).storage_type (
ChunkStorageType
) – The storage location of the chunk.max_wait_time (
int
) – maximum time to wait for completion
- Returns:
chunk – An instance of a created or updated chunk.
- Return type:
- create_chunk_by_index(index, storage_type=ChunkStorageType.DATASTAGE, max_wait_time=600)
Creates a data chunk using the limit and offset. By default, the data chunk is stored in data stages.
Depending on the size of the data set, adding
order_by_columns
to the dataset chunking definition will increase the execution time to retrieve or create the data chunk. Set themax_wait_time
for the appropriate wait time.- Parameters:
index (
int
) – The index of the chunk.storage_type (
ChunkStorageType
) – The storage location of the chunk.max_wait_time (
int
) – maximum time to wait for completion
- Returns:
chunk – An instance of a created or updated chunk.
- Return type:
- classmethod patch_validation_dates(dataset_chunk_definition_id, validation_start_date, validation_end_date)
Updates the data source definition validation dates associated with a dataset chunk definition. In order to set the validation dates appropriately, both start and end dates should be specified. This method can only be used for INCREMENTAL_LEARNING_OTV dataset chunk definitions and its associated datasource definition.
- Parameters:
dataset_chunk_definition_id (
str
) – The ID of the dataset chunk definition.validation_start_date (
datetime.datetime
) – The start date of validation scoring data. Internally converted to format ‘%Y-%m-%d %H:%M:%S’, the timezone defaults to UTC.validation_end_date (
datetime.datetime
) – The end date of validation scoring data. Internally converted to format ‘%Y-%m-%d %H:%M:%S’, the timezone defaults to UTC.
- Returns:
datasource_definition – An instance of created datasource definition.
- Return type:
- class datarobot._experimental.models.chunking_service_v2.DatasetProps
The dataset props for a catalog dataset.
- Variables:
dataset_id (
str
) – The ID of the AI Catalog dataset.dataset_version_id (
str
) – The ID of the AI Catalog dataset version.
- class datarobot._experimental.models.chunking_service_v2.DatasetInfo
The dataset information.
- Variables:
total_rows (
str
) – The total number of rows in the dataset.source_size (
str
) – The size of the dataset.estimated_size_per_row (
str
) – The estimated size per row.columns (
str
) – The list of column names in the dataset.dialect (
str
) – The sql dialect associated with the dataset (e.g., Snowflake, BigQuery, Spark).data_store_id (
str
) – The ID of the data store.data_source_id (
str
) – The ID of the data request used to generate sampling and metadata.
- class datarobot._experimental.models.chunking_service_v2.DynamicDatasetProps
The dataset props for a dynamic dataset.
- Variables:
credentials_id (
str
) – The ID of the credentials.
- class datarobot._experimental.models.chunking_service_v2.DatasetDefinition
Dataset definition that holds information of dataset for API responses.
- Variables:
id (
str
) – The ID of the data source definition.creator_user_id (
str
) – The ID of the user.dataset_props (
DatasetProps
) – The properties of the dataset in catalog.dynamic_dataset_props (
DynamicDatasetProps
) – The properties of the dynamic dataset.dataset_info (
DatasetInfo
) – The information about the dataset.name (
str
) – The optional custom name of the dataset definition.
- classmethod from_data(data)
Properly convert composition classes.
- Return type:
- classmethod create(dataset_id, dataset_version_id=None, name=None, credentials_id=None)
Create a dataset definition.
In order to create a dataset definition, you must first have an existing dataset in the Data Registry. A dataset can be uploaded using
dr.Dataset.create_from_file
if you have a file for exampleIf you have an existing dataset in the Data Registry:
Retrieve the dataset ID by the canonical name via:
[cr for cr in dr.Dataset.list() if cr.name == <name>][0].id
Retrieve the dataset version ID by the name via:
[cr for cr in dr.Dataset.list() if cr.name == <name>][0].version_id
- Parameters:
dataset_id (
str
) – The ID of the AI Catalog dataset.dataset_version_id (
str
) – The optional ID of the AI Catalog dataset version.name (
str
) – The optional custom name of the dataset definition.credentials_id (
str
) – The optional ID of the credentials to access the data store.
- Returns:
dataset_definition – An instance of a created dataset definition.
- Return type:
- classmethod get(dataset_definition_id)
Retrieve a specific dataset definition metadata.
- Parameters:
dataset_definition_id (
str
) – The ID of the dataset definition.- Returns:
dataset_definition_id – The queried instance.
- Return type:
- classmethod delete(dataset_definition_id)
Delete a specific dataset definition
- Parameters:
dataset_definition_id (
str
) – The ID of the dataset definition.- Return type:
None
- classmethod list()
List all dataset definitions
- Return type:
A list
ofDatasetDefinition
- classmethod analyze(dataset_definition_id, max_wait=600)
Analyze a specific dataset definition
- Parameters:
dataset_definition_id (
str
) – The ID of the dataset definition.max_wait (
Optional[int]
) – Time in seconds after which analyze is considered unsuccessful
- Return type:
None
- class datarobot._experimental.models.chunking_service_v2.RowsChunkDefinition
The rows chunk information.
- Variables:
order_by_columns (
List[str]
) – List of the sorting column names.is_descending_order (
bool
) – The sorting order. Defaults to False, ordering from smallest to largest.target_column (
str
) – The target column.target_class (
str
) – For binary target, one of the possible values. For zero inflated, will be ‘0’.user_group_column (
str
) – The user group column.datetime_partition_column (
str
) – The datetime partition column name used in OTV projects.otv_validation_start_date (
datetime.datetime
) – The start date for the validation set.otv_validation_end_date (
datetime.datetime
) – The end date for the validation set.otv_training_end_date (
datetime.datetime
) – The end date for the training set.otv_latest_timestamp (
datetime.datetime
) – The latest timestamp, this field is auto generated.otv_earliest_timestamp (
datetime.datetime
) – The earliest timestamp, this field is auto generated.otv_validation_downsampling_pct (
float
) – The percentage of the validation set to downsample, this field is auto generated.
- class datarobot._experimental.models.chunking_service_v2.FeaturesChunkDefinition
The features chunk information.
- class datarobot._experimental.models.chunking_service_v2.ChunkDefinitionStats
The chunk stats information.
- Variables:
expected_chunk_size (
int
) – The expected chunk size, this field is auto generated.number_of_rows_per_chunk (
int
) – The number of rows per chunk, this field is auto generated.total_number_of_chunks (
int
) – The total number of chunks, this field is auto generated.
- class datarobot._experimental.models.chunking_service_v2.ChunkDefinition
The chunk information.
- Variables:
id (
str
) – The ID of the chunk entity.dataset_definition_id (
str
) – The ID of the dataset definition.name (
str
) – The name of the chunk entity.is_readonly (
bool
) – The read only flag.partition_method (
str
) – The partition method used to create chunks, either ‘random’, ‘stratified’, or ‘date’.chunking_strategy_type (
str
) – The chunking strategy type, either ‘features’ or ‘rows’.chunk_definition_stats (
ChunkDefinitionStats
) – The chunk stats information.rows_chunk_definition (
RowsChunkDefinition
) – The rows chunk information.features_chunk_definition (
FeaturesChunkDefinition
) – The features chunk information.
- classmethod from_data(data)
Properly convert composition classes.
- Return type:
- classmethod create(dataset_definition_id, name=None, partition_method=ChunkingPartitionMethod.RANDOM, chunking_strategy_type=ChunkingStrategy.ROWS, order_by_columns=None, is_descending_order=False, target_column=None, target_class=None, user_group_column=None, datetime_partition_column=None, otv_validation_start_date=None, otv_validation_end_date=None, otv_training_end_date=None)
Create a chunk definition.
- Parameters:
dataset_definition_id (
str
) – The ID of the dataset definition.name (
str
) – The optional custom name of the chunk definition.partition_method (
str
) – The partition method used to create chunks, either ‘random’, ‘stratified’, or ‘date’.chunking_strategy_type (
str
) – The chunking strategy type, either ‘features’ or ‘rows’.order_by_columns (
List[str]
) – List of the sorting column names.is_descending_order (
bool
) – The sorting order. Defaults to False, ordering from smallest to largest.target_column (
str
) – The target column.target_class (
str
) – For binary target, one of the possible values. For zero inflated, will be ‘0’.user_group_column (
str
) – The user group column.datetime_partition_column (
str
) – The datetime partition column name used in OTV projects.otv_validation_start_date (
datetime.datetime
) – The start date for the validation set.otv_validation_end_date (
datetime.datetime
) – The end date for the validation set.otv_training_end_date (
datetime.datetime
) – The end date for the training set.
- Returns:
chunk_definition – An instance of a created chunk definition.
- Return type:
- classmethod get(dataset_definition_id, chunk_definition_id)
Retrieve a specific chunk definition metadata.
- Parameters:
dataset_definition_id (
str
) – The ID of the dataset definition.chunk_definition_id (
str
) – The ID of the chunk definition.
- Returns:
chunk_definition – The queried instance.
- Return type:
- classmethod delete(dataset_definition_id, chunk_definition_id)
Delete a specific chunk definition
- Parameters:
dataset_definition_id (
str
) – The ID of the dataset definition.chunk_definition_id (
str
) – The ID of the chunk definition.
- Return type:
None
- classmethod list(dataset_definition_id)
List all chunk definitions
- Parameters:
dataset_definition_id (
str
) – The ID of the dataset definition.- Return type:
A list
ofChunkDefinition
- classmethod analyze(dataset_definition_id, chunk_definition_id, max_wait=600)
Analyze a specific chunk definition
- Parameters:
dataset_definition_id (
str
) – The ID of the dataset definition.chunk_definition_id (
str
) – The ID of the chunk definitionmax_wait (
Optional[int]
) – Time in seconds after which analyze is considered unsuccessful
- Return type:
None
- classmethod update(chunk_definition_id, dataset_definition_id, name=None, order_by_columns=None, is_descending_order=None, target_column=None, target_class=None, user_group_column=None, datetime_partition_column=None, otv_validation_start_date=None, otv_validation_end_date=None, otv_training_end_date=None, force_update=False)
Update a chunk definition.
- Parameters:
chunk_definition_id (
str
) – The ID of the chunk definition.dataset_definition_id (
str
) – The ID of the dataset definition.name (
str
) – The optional custom name of the chunk definition.order_by_columns (
List[str]
) – List of the sorting column names.is_descending_order (
bool
) – The sorting order. Defaults to False, ordering from smallest to largest.target_column (
str
) – The target column.target_class (
str
) – For binary target, one of the possible values. For zero inflated, will be ‘0’.user_group_column (
str
) – The user group column.datetime_partition_column (
str
) – The datetime partition column name used in OTV projects.otv_validation_start_date (
datetime.datetime
) – The start date for the validation set.otv_validation_end_date (
datetime.datetime
) – The end date for the validation set.otv_training_end_date (
datetime.datetime
) – The end date for the training set.force_update (
bool
) – If True, the update will be forced in some cases. For example, update after analysis is done.
- Returns:
chunk_definition – An update instance of a created chunk definition.
- Return type: