Features
- class datarobot.models.Feature
A feature from a project’s dataset
These are features either included in the originally uploaded dataset or added to it via feature transformations. In time series projects, these will be distinct from the
ModelingFeature
s created during partitioning; otherwise, they will correspond to the same features. For more information about input and modeling features, see the time series documentation.The
min
,max
,mean
,median
, andstd_dev
attributes provide information about the distribution of the feature in the EDA sample data. For non-numeric features or features created prior to these summary statistics becoming available, they will be None. For features where the summary statistics are available, they will be in a format compatible with the data type, i.e. date type features will have their summary statistics expressed as ISO-8601 formatted date strings.- Variables:
id (
int
) – the id for the feature - note that name is used to reference the feature instead of idproject_id (
str
) – the id of the project the feature belongs toname (
str
) – the name of the featurefeature_type (
str
) – the type of the feature, e.g. ‘Categorical’, ‘Text’importance (
float
orNone
) – numeric measure of the strength of relationship between the feature and target (independent of any model or other features); may be None for non-modeling features such as partition columnslow_information (
bool
) – whether a feature is considered too uninformative for modeling (e.g. because it has too few values)unique_count (
int
) – number of unique valuesna_count (
int
orNone
) – number of missing valuesdate_format (
str
orNone
) – For Date features, the date format string for how this feature was interpreted, compatible with https://docs.python.org/2/library/time.html#time.strftime . For other feature types, None.min (
str
,int
,float
, orNone
) – The minimum value of the source data in the EDA samplemax (
str
,int
,float
, orNone
) – The maximum value of the source data in the EDA samplemean (
str
,int
,or
,float
) – The arithmetic mean of the source data in the EDA samplemedian (
str
,int
,float
, orNone
) – The median of the source data in the EDA samplestd_dev (
str
,int
,float
, orNone
) – The standard deviation of the source data in the EDA sampletime_series_eligible (
bool
) – Whether this feature can be used as the datetime partition column in a time series project.time_series_eligibility_reason (
str
) – Why the feature is ineligible for the datetime partition column in a time series project, or ‘suitable’ when it is eligible.time_step (
int
orNone
) – For time series eligible features, a positive integer determining the interval at which windows can be specified. If used as the datetime partition column on a time series project, the feature derivation and forecast windows must start and end at an integer multiple of this value. None for features that are not time series eligible.time_unit (
str
orNone
) – For time series eligible features, the time unit covered by a single time step, e.g. ‘HOUR’, or None for features that are not time series eligible.target_leakage (
str
) – Whether a feature is considered to have target leakage or not. A value of ‘SKIPPED_DETECTION’ indicates that target leakage detection was not run on the feature. ‘FALSE’ indicates no leakage, ‘MODERATE’ indicates a moderate risk of target leakage, and ‘HIGH_RISK’ indicates a high risk of target leakagefeature_lineage_id (
str
) – id of a lineage for automatically discovered features or derived time series features.key_summary (
list
ofdict
) –Statistics for top 50 keys (truncated to 103 characters) of Summarized Categorical column example:
{‘key’:’DataRobot’, ‘summary’:{‘min’:0, ‘max’:29815.0, ‘stdDev’:6498.029, ‘mean’:1490.75, ‘median’:0.0, ‘pctRows’:5.0}}
- where,
- key: string or None
name of the key
- summary: dict
statistics of the key
max: maximum value of the key. min: minimum value of the key. mean: mean value of the key. median: median value of the key. stdDev: standard deviation of the key. pctRows: percentage occurrence of key in the EDA sample of the feature.
multilabel_insights_key (
str
orNone
) – For multicategorical columns this will contain a key for multilabel insights. The key is unique for a project, feature and EDA stage combination. This will be the key for the most recent, finished EDA stage.
- classmethod get(project_id, feature_name)
Retrieve a single feature
- Parameters:
project_id (
str
) – The ID of the project the feature is associated with.feature_name (
str
) – The name of the feature to retrieve
- Returns:
feature – The queried instance
- Return type:
- get_multiseries_properties(multiseries_id_columns, max_wait=600)
Retrieve time series properties for a potential multiseries datetime partition column
Multiseries time series projects use multiseries id columns to model multiple distinct series within a single project. This function returns the time series properties (time step and time unit) of this column if it were used as a datetime partition column with the specified multiseries id columns, running multiseries detection automatically if it had not previously been successfully ran.
- Parameters:
multiseries_id_columns (
List[str]
) – the name(s) of the multiseries id columns to use with this datetime partition column. Currently only one multiseries id column is supported.max_wait (
Optional[int]
) – if a multiseries detection task is run, the maximum amount of time to wait for it to complete before giving up
- Returns:
properties –
A dict with three keys:
time_series_eligible : bool, whether the column can be used as a partition column
time_unit : str or null, the inferred time unit if used as a partition column
time_step : int or null, the inferred time step if used as a partition column
- Return type:
dict
- get_cross_series_properties(datetime_partition_column, cross_series_group_by_columns, max_wait=600)
Retrieve cross-series properties for multiseries ID column.
This function returns the cross-series properties (eligibility as group-by column) of this column if it were used with specified datetime partition column and with current multiseries id column, running cross-series group-by validation automatically if it had not previously been successfully ran.
- Parameters:
datetime_partition_column (
datetime partition column
)cross_series_group_by_columns (
List[str]
) – the name(s) of the columns to use with this multiseries ID column. Currently only one cross-series group-by column is supported.max_wait (
Optional[int]
) – if a multiseries detection task is run, the maximum amount of time to wait for it to complete before giving up
- Returns:
properties –
A dict with three keys:
name : str, column name
eligibility : str, reason for column eligibility
isEligible : bool, is column eligible as cross-series group-by
- Return type:
dict
- get_multicategorical_histogram()
Retrieve multicategorical histogram for this feature
Added in version v2.24.
- Return type:
- Raises:
datarobot.errors.InvalidUsageError – if this method is called on a unsuited feature
ValueError – if no multilabel_insights_key is present for this feature
- get_pairwise_correlations()
Retrieve pairwise label correlation for multicategorical features
Added in version v2.24.
- Return type:
- Raises:
datarobot.errors.InvalidUsageError – if this method is called on a unsuited feature
ValueError – if no multilabel_insights_key is present for this feature
- get_pairwise_joint_probabilities()
Retrieve pairwise label joint probabilities for multicategorical features
Added in version v2.24.
- Return type:
- Raises:
datarobot.errors.InvalidUsageError – if this method is called on a unsuited feature
ValueError – if no multilabel_insights_key is present for this feature
- get_pairwise_conditional_probabilities()
Retrieve pairwise label conditional probabilities for multicategorical features
Added in version v2.24.
- Return type:
- Raises:
datarobot.errors.InvalidUsageError – if this method is called on a unsuited feature
ValueError – if no multilabel_insights_key is present for this feature
- classmethod from_data(data)
Instantiate an object of this class using a dict.
- Parameters:
data (
dict
) – Correctly snake_cased keys and their values.- Return type:
TypeVar
(T
, bound= APIObject)
- classmethod from_server_data(data, keep_attrs=None)
Instantiate an object of this class using the data directly from the server, meaning that the keys may have the wrong camel casing
- Parameters:
data (
dict
) – The directly translated dict of JSON from the server. No casing fixes have taken placekeep_attrs (
iterable
) – List, set or tuple of the dotted namespace notations for attributes to keep within the object structure even if their values are None
- Return type:
TypeVar
(T
, bound= APIObject)
- get_histogram(bin_limit=None)
Retrieve a feature histogram
- Parameters:
bin_limit (
int
orNone
) – Desired max number of histogram bins. If omitted, by default endpoint will use 60.- Returns:
featureHistogram – The requested histogram with desired number or bins
- Return type:
- class datarobot.models.ModelingFeature
A feature used for modeling
In time series projects, a new set of modeling features is created after setting the partitioning options. These features are automatically derived from those in the project’s dataset and are the features used for modeling. Modeling features are only accessible once the target and partitioning options have been set. In projects that don’t use time series modeling, once the target has been set, ModelingFeatures and Features will behave the same.
For more information about input and modeling features, see the time series documentation.
As with the
Feature
object, the min, max, `mean, median, and std_dev attributes provide information about the distribution of the feature in the EDA sample data. For non-numeric features, they will be None. For features where the summary statistics are available, they will be in a format compatible with the data type, i.e. date type features will have their summary statistics expressed as ISO-8601 formatted date strings.- Variables:
project_id (
str
) – the id of the project the feature belongs toname (
str
) – the name of the featurefeature_type (
str
) – the type of the feature, e.g. ‘Categorical’, ‘Text’importance (
float
orNone
) – numeric measure of the strength of relationship between the feature and target (independent of any model or other features); may be None for non-modeling features such as partition columnslow_information (
bool
) – whether a feature is considered too uninformative for modeling (e.g. because it has too few values)unique_count (
int
) – number of unique valuesna_count (
int
orNone
) – number of missing valuesdate_format (
str
orNone
) – For Date features, the date format string for how this feature was interpreted, compatible with https://docs.python.org/2/library/time.html#time.strftime . For other feature types, None.min (
str
,int
,float
, orNone
) – The minimum value of the source data in the EDA samplemax (
str
,int
,float
, orNone
) – The maximum value of the source data in the EDA samplemean (
str
,int
,or
,float
) – The arithmetic mean of the source data in the EDA samplemedian (
str
,int
,float
, orNone
) – The median of the source data in the EDA samplestd_dev (
str
,int
,float
, orNone
) – The standard deviation of the source data in the EDA sampleparent_feature_names (
List[str]
) – A list of the names of input features used to derive this modeling feature. In cases where the input features and modeling features are the same, this will simply contain the feature’s name. Note that if a derived feature was used to create this modeling feature, the values here will not necessarily correspond to the features that must be supplied at prediction time.key_summary (
list
ofdict
) –Statistics for top 50 keys (truncated to 103 characters) of Summarized Categorical column example:
{‘key’:’DataRobot’, ‘summary’:{‘min’:0, ‘max’:29815.0, ‘stdDev’:6498.029, ‘mean’:1490.75, ‘median’:0.0, ‘pctRows’:5.0}}
- where,
- key: string or None
name of the key
- summary: dict
statistics of the key
max: maximum value of the key. min: minimum value of the key. mean: mean value of the key. median: median value of the key. stdDev: standard deviation of the key. pctRows: percentage occurrence of key in the EDA sample of the feature.
- classmethod get(project_id, feature_name)
Retrieve a single modeling feature
- Parameters:
project_id (
str
) – The ID of the project the feature is associated with.feature_name (
str
) – The name of the feature to retrieve
- Returns:
feature – The requested feature
- Return type:
- class datarobot.models.DatasetFeature
A feature from a project’s dataset
These are features either included in the originally uploaded dataset or added to it via feature transformations.
The
min
,max
,mean
,median
, andstd_dev
attributes provide information about the distribution of the feature in the EDA sample data. For non-numeric features or features created prior to these summary statistics becoming available, they will be None. For features where the summary statistics are available, they will be in a format compatible with the data type, i.e. date type features will have their summary statistics expressed as ISO-8601 formatted date strings.- Variables:
id (
int
) – the id for the feature - note that name is used to reference the feature instead of iddataset_id (
str
) – the id of the dataset the feature belongs todataset_version_id (
str
) – the id of the dataset version the feature belongs toname (
str
) – the name of the featurefeature_type (
Optional[str]
) – the type of the feature, e.g. ‘Categorical’, ‘Text’low_information (
Optional[bool]
) – whether a feature is considered too uninformative for modeling (e.g. because it has too few values)unique_count (
Optional[int]
) – number of unique valuesna_count (
Optional[int]
) – number of missing valuesdate_format (
Optional[str]
) – For Date features, the date format string for how this feature was interpreted, compatible with https://docs.python.org/2/library/time.html#time.strftime . For other feature types, None.min (
str
,int
,Optional[float]
) – The minimum value of the source data in the EDA samplemax (
str
,int
,Optional[float]
) – The maximum value of the source data in the EDA samplemean (
str
,int
,Optional[float]
) – The arithmetic mean of the source data in the EDA samplemedian (
str
,int
,Optional[float]
) – The median of the source data in the EDA samplestd_dev (
str
,int
,Optional[float]
) – The standard deviation of the source data in the EDA sampletime_series_eligible (
Optional[bool]
) – Whether this feature can be used as the datetime partition column in a time series project.time_series_eligibility_reason (
Optional[str]
) – Why the feature is ineligible for the datetime partition column in a time series project, or ‘suitable’ when it is eligible.time_step (
Optional[int]
) – For time series eligible features, a positive integer determining the interval at which windows can be specified. If used as the datetime partition column on a time series project, the feature derivation and forecast windows must start and end at an integer multiple of this value. None for features that are not time series eligible.time_unit (
Optional[str]
) – For time series eligible features, the time unit covered by a single time step, e.g. ‘HOUR’, or None for features that are not time series eligible.target_leakage (
Optional[str]
) – Whether a feature is considered to have target leakage or not. A value of ‘SKIPPED_DETECTION’ indicates that target leakage detection was not run on the feature. ‘FALSE’ indicates no leakage, ‘MODERATE’ indicates a moderate risk of target leakage, and ‘HIGH_RISK’ indicates a high risk of target leakagetarget_leakage_reason (
string
, optional) – The descriptive text explaining the reason for target leakage, if any.
- get_histogram(bin_limit=None)
Retrieve a feature histogram
- Parameters:
bin_limit (
int
orNone
) – Desired max number of histogram bins. If omitted, by default endpoint will use 60.- Returns:
featureHistogram – The requested histogram with desired number or bins
- Return type:
- class datarobot.models.DatasetFeatureHistogram
- classmethod get(dataset_id, feature_name, bin_limit=None, key_name=None)
Retrieve a single feature histogram
- Parameters:
dataset_id (
str
) – The ID of the Dataset the feature is associated with.feature_name (
str
) – The name of the feature to retrievebin_limit (
int
orNone
) – Desired max number of histogram bins. If omitted, by default the endpoint will use 60.key_name (
string
orNone
) – (Only required for summarized categorical feature) Name of the top 50 keys for which plot to be retrieved
- Returns:
featureHistogram – The queried instance with plot attribute in it.
- Return type:
- class datarobot.models.FeatureHistogram
- classmethod get(project_id, feature_name, bin_limit=None, key_name=None)
Retrieve a single feature histogram
- Parameters:
project_id (
str
) – The ID of the project the feature is associated with.feature_name (
str
) – The name of the feature to retrievebin_limit (
int
orNone
) – Desired max number of histogram bins. If omitted, by default endpoint will use 60.key_name (
string
orNone
) – (Only required for summarized categorical feature) Name of the top 50 keys for which plot to be retrieved
- Returns:
featureHistogram – The queried instance with plot attribute in it.
- Return type:
- class datarobot.models.InteractionFeature
Interaction feature data
Added in version v2.21.
- Variables:
rows (
int
) – Total number of rowssource_columns (
list(str)
) – names of two categorical features which were combined into this onebars (
list(dict)
) – dictionaries representing frequencies of each independent value from the source columnsbubbles (
list(dict)
) – dictionaries representing frequencies of each combined value in the interaction feature.
- classmethod get(project_id, feature_name)
Retrieve a single Interaction feature
- Parameters:
project_id (
str
) – The id of the project the feature belongs tofeature_name (
str
) – The name of the Interaction feature to retrieve
- Returns:
feature – The queried instance
- Return type:
- class datarobot.models.MulticategoricalHistogram
Histogram for Multicategorical feature.
Added in version v2.24.
Notes
HistogramValues
contains:values.[].label
: string - Label namevalues.[].plot
: list - Histogram for labelvalues.[].plot.[].label_relevance
: int - Label relevance valuevalues.[].plot.[].row_count
: int - Row count where label has given relevancevalues.[].plot.[].row_pct
: float - Percentage of rows where label has given relevance
- Variables:
feature_name (
str
) – Name of the featurevalues (
list(dict)
) – List of Histogram values with a schema described asHistogramValues
- classmethod get(multilabel_insights_key)
Retrieves multicategorical histogram
You might find it more convenient to use
Feature.get_multicategorical_histogram
instead.- Parameters:
multilabel_insights_key (
string
) – Key for multilabel insights, unique for a project, feature and EDA stage combination. The multilabel_insights_key can be retrieved viaFeature.multilabel_insights_key
.- Returns:
The multicategorical histogram for multilabel_insights_key
- Return type:
- to_dataframe()
Convenience method to get all the information from this multicategorical_histogram instance in form of a
pandas.DataFrame
.- Returns:
Histogram information as a multicategorical_histogram. The dataframe will contain these columns: feature_name, label, label_relevance, row_count and row_pct
- Return type:
pandas.DataFrame
- class datarobot.models.PairwiseCorrelations
Correlation of label pairs for multicategorical feature.
Added in version v2.24.
Notes
CorrelationValues
contain:values.[].label_configuration
: list of length 2 - Configuration of the label pairvalues.[].label_configuration.[].label
: str – Label namevalues.[].statistic_value
: float – Statistic value
- Variables:
feature_name (
str
) – Name of the featurevalues (
list(dict)
) – List of correlation values with a schema described asCorrelationValues
statistic_dataframe (
pandas.DataFrame
) – Correlation values for all label pairs as a DataFrame
- classmethod get(multilabel_insights_key)
Retrieves pairwise correlations
You might find it more convenient to use
Feature.get_pairwise_correlations
instead.- Parameters:
multilabel_insights_key (
string
) – Key for multilabel insights, unique for a project, feature and EDA stage combination. The multilabel_insights_key can be retrieved viaFeature.multilabel_insights_key
.- Returns:
The pairwise label correlations
- Return type:
- as_dataframe()
The pairwise label correlations as a (num_labels x num_labels) DataFrame.
- Returns:
The pairwise label correlations. Index and column names allow the interpretation of the values.
- Return type:
pandas.DataFrame
- class datarobot.models.PairwiseJointProbabilities
Joint probabilities of label pairs for multicategorical feature.
Added in version v2.24.
Notes
ProbabilityValues
contain:values.[].label_configuration
: list of length 2 - Configuration of the label pairvalues.[].label_configuration.[].relevance
: int – 0 for absence of the labels, 1 for the presence of labelsvalues.[].label_configuration.[].label
: str – Label namevalues.[].statistic_value
: float – Statistic value
- Variables:
feature_name (
str
) – Name of the featurevalues (
list(dict)
) – List of joint probability values with a schema described asProbabilityValues
statistic_dataframes (
dict(pandas.DataFrame)
) –Joint Probability values as DataFrames for different relevance combinations.
E.g. The probability P(A=0,B=1) can be retrieved via:
pairwise_joint_probabilities.statistic_dataframes[(0,1)].loc['A', 'B']
- classmethod get(multilabel_insights_key)
Retrieves pairwise joint probabilities
You might find it more convenient to use
Feature.get_pairwise_joint_probabilities
instead.- Parameters:
multilabel_insights_key (
string
) – Key for multilabel insights, unique for a project, feature and EDA stage combination. The multilabel_insights_key can be retrieved viaFeature.multilabel_insights_key
.- Returns:
The pairwise joint probabilities
- Return type:
- as_dataframe(relevance_configuration)
Joint probabilities of label pairs as a (num_labels x num_labels) DataFrame.
- Parameters:
relevance_configuration (
tuple
oflength 2
) –Valid options are (0, 0), (0, 1), (1, 0) and (1, 1). Values of 0 indicate absence of labels and 1 indicates presence of labels. The first value describes the presence for the labels in axis=0 and the second value describes the presence for the labels in axis=1.
For example the matrix values for a relevance configuration of (0, 1) describe the probabilities of absent labels in the index axis and present labels in the column axis.
E.g. The probability P(A=0,B=1) can be retrieved via:
pairwise_joint_probabilities.as_dataframe((0,1)).loc['A', 'B']
- Returns:
The joint probabilities for the requested
relevance_configuration
. Index and column names allow the interpretation of the values.- Return type:
pandas.DataFrame
- class datarobot.models.PairwiseConditionalProbabilities
Conditional probabilities of label pairs for multicategorical feature.
Added in version v2.24.
Notes
ProbabilityValues
contain:values.[].label_configuration
: list of length 2 - Configuration of the label pairvalues.[].label_configuration.[].relevance
: int – 0 for absence of the labels, 1 for the presence of labelsvalues.[].label_configuration.[].label
: str – Label namevalues.[].statistic_value
: float – Statistic value
- Variables:
feature_name (
str
) – Name of the featurevalues (
list(dict)
) – List of conditional probability values with a schema described asProbabilityValues
statistic_dataframes (
dict(pandas.DataFrame)
) –Conditional Probability values as DataFrames for different relevance combinations. The label names in the columns are the events, on which we condition. The label names in the index are the events whose conditional probability given the indexes is in the dataframe.
E.g. The probability P(A=0|B=1) can be retrieved via:
pairwise_conditional_probabilities.statistic_dataframes[(0,1)].loc['A', 'B']
- classmethod get(multilabel_insights_key)
Retrieves pairwise conditional probabilities
You might find it more convenient to use
Feature.get_pairwise_conditional_probabilities
instead.- Parameters:
multilabel_insights_key (
string
) – Key for multilabel insights, unique for a project, feature and EDA stage combination. The multilabel_insights_key can be retrieved viaFeature.multilabel_insights_key
.- Returns:
The pairwise conditional probabilities
- Return type:
- as_dataframe(relevance_configuration)
Conditional probabilities of label pairs as a (num_labels x num_labels) DataFrame. The label names in the columns are the events, on which we condition. The label names in the index are the events whose conditional probability given the indexes is in the dataframe.
E.g. The probability P(A=0|B=1) can be retrieved via:
pairwise_conditional_probabilities.as_dataframe((0, 1)).loc['A', 'B']
- Parameters:
relevance_configuration (
tuple
oflength 2
) –Valid options are (0, 0), (0, 1), (1, 0) and (1, 1). Values of 0 indicate absence of labels and 1 indicates presence of labels. The first value describes the presence for the labels in axis=0 and the second value describes the presence for the labels in axis=1.
For example the matrix values for a relevance configuration of (0, 1) describe the probabilities of absent labels in the index axis given the presence of labels in the column axis.
- Returns:
The conditional probabilities for the requested
relevance_configuration
. Index and column names allow the interpretation of the values.- Return type:
pandas.DataFrame
Restoring Discarded Features
- class datarobot.models.restore_discarded_features.DiscardedFeaturesInfo
An object containing information about time series features which were reduced during time series feature generation process. These features can be restored back to the project. They will be included into All Time Series Features and can be used to create new feature lists.
Added in version v2.27.
- Variables:
total_restore_limit (
int
) – The total limit indicating how many features can be restored in this project.remaining_restore_limit (
int
) – The remaining available number of the features which can be restored in this project.features (
list
ofstrings
) – Discarded features which can be restored.count (
int
) – Discarded features count.
- classmethod restore(project_id, features_to_restore, max_wait=600)
Restore discarded during time series feature generation process features back to the project. After restoration features will be included into All Time Series Features.
Added in version v2.27.
- Parameters:
project_id (
string
)features_to_restore (
list
ofstrings
) – List of the feature names to restoremax_wait (
Optional[int]
) – max time to wait for features to be restored. Defaults to 10 min
- Returns:
status – information about features which were restored and which were not.
- Return type:
- classmethod retrieve(project_id)
Retrieve the discarded features information for a given project.
Added in version v2.27.
- Parameters:
project_id (
string
)- Returns:
info – information about features which were discarded during feature generation process and limits how many features can be restored.
- Return type:
- class datarobot.models.restore_discarded_features.FeatureRestorationStatus
Status of the feature restoration process.
Added in version v2.27.
- Variables:
warnings (
list
ofstrings
) – Warnings generated for those features which failed to restoreremaining_restore_limit (
int
) – The remaining available number of the features which can be restored in this project.restored_features (
list
ofstrings
) – Features which were restored
Feature lists
- class datarobot.DatasetFeaturelist
A set of features attached to a dataset in the AI Catalog
- Variables:
id (
str
) – the id of the dataset featurelistdataset_id (
str
) – the id of the dataset the featurelist belongs todataset_version_id (
Optional[str]
) – the version id of the dataset this featurelist belongs toname (
str
) – the name of the dataset featurelistfeatures (
List[str]
) – a list of the names of features included in this dataset featurelistcreation_date (
datetime.datetime
) – when the featurelist was createdcreated_by (
str
) – the user name of the user who created this featurelistuser_created (
bool
) – whether the featurelist was created by a user or by DataRobot automationdescription (
Optional[str]
) – the description of the featurelist. Only present on DataRobot-created featurelists.
- classmethod get(dataset_id, featurelist_id)
Retrieve a dataset featurelist
- Parameters:
dataset_id (
str
) – the id of the dataset the featurelist belongs tofeaturelist_id (
str
) – the id of the dataset featurelist to retrieve
- Returns:
featurelist – the specified featurelist
- Return type:
DatasetFeatureList
- delete()
Delete a dataset featurelist
Featurelists configured into the dataset as a default featurelist cannot be deleted.
- Return type:
None
- update(name=None)
Update the name of an existing featurelist
Note that only user-created featurelists can be renamed, and that names must not conflict with names used by other featurelists.
- Parameters:
name (
Optional[str]
) – the new name for the featurelist- Return type:
None
- class datarobot.models.Featurelist
A set of features used in modeling
- Variables:
id (
str
) – the id of the featurelistname (
str
) – the name of the featurelistfeatures (
List[str]
) – the names of all the Features in the featurelistproject_id (
str
) – the project the featurelist belongs tocreated (
datetime.datetime
) – (New in version v2.13) when the featurelist was createdis_user_created (
bool
) – (New in version v2.13) whether the featurelist was created by a user or by DataRobot automationnum_models (
int
) – (New in version v2.13) the number of models currently using this featurelist. A model is considered to use a featurelist if it is used to train the model or as a monotonic constraint featurelist, or if the model is a blender with at least one component model using the featurelist.description (
str
) – (New in version v2.13) the description of the featurelist. Can be updated by the user and may be supplied by default for DataRobot-created featurelists.
- classmethod from_data(data)
Overrides the parent method to ensure description is always populated
- Parameters:
data (
dict
) – the data from the server, having gone through processing- Return type:
TypeVar
(TFeaturelist
, bound= Featurelist)
- classmethod get(project_id, featurelist_id)
Retrieve a known feature list
- Parameters:
project_id (
str
) – The id of the project the featurelist is associated withfeaturelist_id (
str
) – The ID of the featurelist to retrieve
- Returns:
featurelist – The queried instance
- Return type:
- Raises:
ValueError – passed
project_id
parameter value is of not supported type
- delete(dry_run=False, delete_dependencies=False)
Delete a featurelist, and any models and jobs using it
All models using a featurelist, whether as the training featurelist or as a monotonic constraint featurelist, will also be deleted when the deletion is executed and any queued or running jobs using it will be cancelled. Similarly, predictions made on these models will also be deleted. All the entities that are to be deleted with a featurelist are described as “dependencies” of it. To preview the results of deleting a featurelist, call delete with dry_run=True
When deleting a featurelist with dependencies, users must specify delete_dependencies=True to confirm they want to delete the featurelist and all its dependencies. Without that option, only featurelists with no dependencies may be successfully deleted and others will error.
Featurelists configured into the project as a default featurelist or as a default monotonic constraint featurelist cannot be deleted.
Featurelists used in a model deployment cannot be deleted until the model deployment is deleted.
- Parameters:
dry_run (
Optional[bool]
) – specify True to preview the result of deleting the featurelist, instead of actually deleting it.delete_dependencies (
Optional[bool]
) – specify True to successfully delete featurelists with dependencies; if left False by default, featurelists without dependencies can be successfully deleted and those with dependencies will error upon attempting to delete them.
- Returns:
result –
- A dictionary describing the result of deleting the featurelist, with the following keys
dry_run : bool, whether the deletion was a dry run or an actual deletion
can_delete : bool, whether the featurelist can actually be deleted
deletion_blocked_reason : str, why the featurelist can’t be deleted (if it can’t)
num_affected_models : int, the number of models using this featurelist
num_affected_jobs : int, the number of jobs using this featurelist
- Return type:
dict
- classmethod from_server_data(data, keep_attrs=None)
Instantiate an object of this class using the data directly from the server, meaning that the keys may have the wrong camel casing
- Parameters:
data (
dict
) – The directly translated dict of JSON from the server. No casing fixes have taken placekeep_attrs (
iterable
) – List, set or tuple of the dotted namespace notations for attributes to keep within the object structure even if their values are None
- Return type:
TypeVar
(T
, bound= APIObject)
- update(name=None, description=None)
Update the name or description of an existing featurelist
Note that only user-created featurelists can be renamed, and that names must not conflict with names used by other featurelists.
- Parameters:
name (
Optional[str]
) – the new name for the featurelistdescription (
Optional[str]
) – the new description for the featurelist
- Return type:
None
- class datarobot.models.ModelingFeaturelist
A set of features that can be used to build a model
In time series projects, a new set of modeling features is created after setting the partitioning options. These features are automatically derived from those in the project’s dataset and are the features used for modeling. Modeling features are only accessible once the target and partitioning options have been set. In projects that don’t use time series modeling, once the target has been set, ModelingFeaturelists and Featurelists will behave the same.
For more information about input and modeling features, see the time series documentation.
- Variables:
id (
str
) – the id of the modeling featurelistproject_id (
str
) – the id of the project the modeling featurelist belongs toname (
str
) – the name of the modeling featurelistfeatures (
List[str]
) – a list of the names of features included in this modeling featurelistcreated (
datetime.datetime
) – (New in version v2.13) when the featurelist was createdis_user_created (
bool
) – (New in version v2.13) whether the featurelist was created by a user or by DataRobot automationnum_models (
int
) – (New in version v2.13) the number of models currently using this featurelist. A model is considered to use a featurelist if it is used to train the model or as a monotonic constraint featurelist, or if the model is a blender with at least one component model using the featurelist.description (
str
) – (New in version v2.13) the description of the featurelist. Can be updated by the user and may be supplied by default for DataRobot-created featurelists.
- classmethod get(project_id, featurelist_id)
Retrieve a modeling featurelist
Modeling featurelists can only be retrieved once the target and partitioning options have been set.
- Parameters:
project_id (
str
) – the id of the project the modeling featurelist belongs tofeaturelist_id (
str
) – the id of the modeling featurelist to retrieve
- Returns:
featurelist – the specified featurelist
- Return type:
- update(name=None, description=None)
Update the name or description of an existing featurelist
Note that only user-created featurelists can be renamed, and that names must not conflict with names used by other featurelists.
- Parameters:
name (
Optional[str]
) – the new name for the featurelistdescription (
Optional[str]
) – the new description for the featurelist
- Return type:
None
- delete(dry_run=False, delete_dependencies=False)
Delete a featurelist, and any models and jobs using it
All models using a featurelist, whether as the training featurelist or as a monotonic constraint featurelist, will also be deleted when the deletion is executed and any queued or running jobs using it will be cancelled. Similarly, predictions made on these models will also be deleted. All the entities that are to be deleted with a featurelist are described as “dependencies” of it. To preview the results of deleting a featurelist, call delete with dry_run=True
When deleting a featurelist with dependencies, users must specify delete_dependencies=True to confirm they want to delete the featurelist and all its dependencies. Without that option, only featurelists with no dependencies may be successfully deleted and others will error.
Featurelists configured into the project as a default featurelist or as a default monotonic constraint featurelist cannot be deleted.
Featurelists used in a model deployment cannot be deleted until the model deployment is deleted.
- Parameters:
dry_run (
Optional[bool]
) – specify True to preview the result of deleting the featurelist, instead of actually deleting it.delete_dependencies (
Optional[bool]
) – specify True to successfully delete featurelists with dependencies; if left False by default, featurelists without dependencies can be successfully deleted and those with dependencies will error upon attempting to delete them.
- Returns:
result –
- A dictionary describing the result of deleting the featurelist, with the following keys
dry_run : bool, whether the deletion was a dry run or an actual deletion
can_delete : bool, whether the featurelist can actually be deleted
deletion_blocked_reason : str, why the featurelist can’t be deleted (if it can’t)
num_affected_models : int, the number of models using this featurelist
num_affected_jobs : int, the number of jobs using this featurelist
- Return type:
dict
- class datarobot.models.featurelist.DeleteFeatureListResult
Dataset definition
- class datarobot.helpers.feature_discovery.DatasetDefinition
Dataset definition for the Feature Discovery
Added in version v2.25.
- Variables:
identifier (
str
) – Alias of the dataset (used directly as part of the generated feature names)catalog_id (
Optional[str]
) – Identifier of the catalog itemcatalog_version_id (
str
) – Identifier of the catalog item versionprimary_temporal_key (
Optional[str]
) – Name of the column indicating time of record creationfeature_list_id (
Optional[str]
) – Identifier of the feature list. This decides which columns in the dataset are used for feature generationsnapshot_policy (
Optional[str]
) – Policy to use when creating a project or making predictions. If omitted, by default endpoint will use ‘latest’. Must be one of the following values: ‘specified’: Use specific snapshot specified by catalogVersionId ‘latest’: Use latest snapshot from the same catalog item ‘dynamic’: Get data from the source (only applicable for JDBC datasets)
Examples
import datarobot as dr dataset_definition = dr.DatasetDefinition( identifier='profile', catalog_id='5ec4aec1f072bc028e3471ae', catalog_version_id='5ec4aec2f072bc028e3471b1', ) dataset_definition = dr.DatasetDefinition( identifier='transaction', catalog_id='5ec4aec1f072bc028e3471ae', catalog_version_id='5ec4aec2f072bc028e3471b1', primary_temporal_key='Date' )
Relationships
- class datarobot.helpers.feature_discovery.Relationship
Relationship between dataset defined in DatasetDefinition
Added in version v2.25.
- Variables:
dataset1_identifier (
Optional[str]
) – Identifier of the first dataset in this relationship. This is specified in the identifier field of dataset_definition structure. If None, then the relationship is with the primary dataset.dataset2_identifier (
str
) – Identifier of the second dataset in this relationship. This is specified in the identifier field of dataset_definition schema.dataset1_keys (
List[str]
) – (max length: 10 min length: 1) Column(s) from the first dataset which are used to join to the second datasetdataset2_keys (
List[str]
) – (max length: 10 min length: 1) Column(s) from the second dataset that are used to join to the first datasetfeature_derivation_window_start (
int
, orNone
) – How many time_units of each dataset’s primary temporal key into the past relative to the datetimePartitionColumn the feature derivation window should begin. Will be a negative integer, If present, the feature engineering Graph will perform time-aware joins.feature_derivation_window_end (
Optional[int]
) – How many timeUnits of each dataset’s record primary temporal key into the past relative to the datetimePartitionColumn the feature derivation window should end. Will be a non-positive integer, if present. If present, the feature engineering Graph will perform time-aware joins.feature_derivation_window_time_unit (
Optional[int]
) – Time unit of the feature derivation window. One ofdatarobot.enums.AllowedTimeUnitsSAFER
If present, time-aware joins will be used. Only applicable when dataset1_identifier is not provided.feature_derivation_windows (
list
ofdict
, orNone
) – List of feature derivation windows settings. If present, time-aware joins will be used. Only allowed when feature_derivation_window_start, feature_derivation_window_end and feature_derivation_window_time_unit are not provided.prediction_point_rounding (
Optional[int]
) – Closest value of prediction_point_rounding_time_unit to round the prediction point into the past when applying the feature derivation window. Will be a positive integer, if present.Only applicable when dataset1_identifier is not provided.prediction_point_rounding_time_unit (
Optional[str]
) – Time unit of the prediction point rounding. One ofdatarobot.enums.AllowedTimeUnitsSAFER
Only applicable when dataset1_identifier is not provided.schema (The feature_derivation_windows is a list of dictionary with) –
- start: int
How many time_units of each dataset’s primary temporal key into the past relative to the datetimePartitionColumn the feature derivation window should begin.
- end: int
How many timeUnits of each dataset’s record primary temporal key into the past relative to the datetimePartitionColumn the feature derivation window should end.
- unit: str
Time unit of the feature derivation window. One of
datarobot.enums.AllowedTimeUnitsSAFER
.
Examples
import datarobot as dr relationship = dr.Relationship( dataset1_identifier='profile', dataset2_identifier='transaction', dataset1_keys=['CustomerID'], dataset2_keys=['CustomerID'] ) relationship = dr.Relationship( dataset2_identifier='profile', dataset1_keys=['CustomerID'], dataset2_keys=['CustomerID'], feature_derivation_window_start=-14, feature_derivation_window_end=-1, feature_derivation_window_time_unit='DAY', prediction_point_rounding=1, prediction_point_rounding_time_unit='DAY' )
Relationships configuration
- class datarobot.models.RelationshipsConfiguration
A Relationships configuration specifies a set of secondary datasets as well as the relationships among them. It is used to configure Feature Discovery for a project to generate features automatically from these datasets.
- Variables:
id (
str
) – Id of the created relationships configurationdataset_definitions (
list
) – Each element is a dataset_definitions for a dataset.relationships (
list
) – Each element is a relationship between two datasetsfeature_discovery_mode (
str
) – Mode of feature discovery. Supported values are ‘default’ and ‘manual’feature_discovery_settings (
list
) – List of feature discovery settings used to customize the feature discovery processis (The feature_discovery_settings structure)
identifier (
str
) – Alias of the dataset (used directly as part of the generated feature names)catalog_id (
str
, orNone
) – Identifier of the catalog itemcatalog_version_id (
str
) – Identifier of the catalog item versionprimary_temporal_key (
Optional[str]
) – Name of the column indicating time of record creationfeature_list_id (
Optional[str]
) – Identifier of the feature list. This decides which columns in the dataset are used for feature generationsnapshot_policy (
str
) – Policy to use when creating a project or making predictions. Must be one of the following values: ‘specified’: Use specific snapshot specified by catalogVersionId ‘latest’: Use latest snapshot from the same catalog item ‘dynamic’: Get data from the source (only applicable for JDBC datasets)feature_lists (
list
) – List of feature list infodata_source (
dict
) – Data source info if the dataset is from data sourcedata_sources (
list
) – List of Data source details for a JDBC datasetsis_deleted (
Optional[bool]
) – Whether the dataset is deleted or notis
data_store_id (
str
) – Id of the data store.data_store_name (
str
) – User-friendly name of the data store.url (
str
) – Url used to connect to the data store.dbtable (
str
) – Name of table from the data store.schema (The feature_derivation_windows is a list of dictionary with) – Schema definition of the table from the data store
catalog (
str
) – Catalog name of the data source.is
id – Id of the featurelist
name (
str
) – Name of the featurelistfeatures (
List[str]
) – Names of all the Features in the featurelistdataset_id (
str
) – Project the featurelist belongs tocreation_date (
datetime.datetime
) – When the featurelist was createduser_created (
bool
) – Whether the featurelist was created by a user or by DataRobot automationcreated_by (
str
) – Name of user who created itdescription (
str
) – Description of the featurelist. Can be updated by the user and may be supplied by default for DataRobot-created featurelists.dataset_id – Dataset which is associated with the feature list
dataset_version_id (
str
orNone
) – Version of the dataset which is associated with feature list. Only relevant for Informative featuresis
dataset1_identifier (
str
orNone
) – Identifier of the first dataset in this relationship. This is specified in the identifier field of dataset_definition structure. If None, then the relationship is with the primary dataset.dataset2_identifier (
str
) – Identifier of the second dataset in this relationship. This is specified in the identifier field of dataset_definition schema.dataset1_keys (
List[str] (max length
:10 min length
:1)
) – Column(s) from the first dataset which are used to join to the second datasetdataset2_keys (
List[str]
) – (max length: 10 min length: 1) Column(s) from the second dataset that are used to join to the first datasettime_unit (
str
, orNone
) – Time unit of the feature derivation window. Supported values are MILLISECOND, SECOND, MINUTE, HOUR, DAY, WEEK, MONTH, QUARTER, YEAR. If present, the feature engineering Graph will perform time-aware joins.feature_derivation_window_start (
int
, orNone
) – How many time_units of each dataset’s primary temporal key into the past relative to the datetimePartitionColumn the feature derivation window should begin. Will be a negative integer, If present, the feature engineering Graph will perform time-aware joins.feature_derivation_window_end (
int
orNone
) – How many timeUnits of each dataset’s record primary temporal key into the past relative to the datetimePartitionColumn the feature derivation window should end. Will be a non-positive integer, if present. If present, the feature engineering Graph will perform time-aware joins.feature_derivation_window_time_unit (
int
orNone
) – Time unit of the feature derivation window. Supported values are MILLISECOND, SECOND, MINUTE, HOUR, DAY, WEEK, MONTH, QUARTER, YEAR If present, time-aware joins will be used. Only applicable when dataset1Identifier is not provided.feature_derivation_windows (
list
ofdict
, orNone
) – List of feature derivation windows settings. If present, time-aware joins will be used. Only allowed when feature_derivation_window_start, feature_derivation_window_end and feature_derivation_window_time_unit are not provided.prediction_point_rounding (
int
, orNone
) – Closest value of prediction_point_rounding_time_unit to round the prediction point into the past when applying the feature derivation window. Will be a positive integer, if present.Only applicable when dataset1_identifier is not provided.prediction_point_rounding_time_unit (
str
, orNone
) – time unit of the prediction point rounding. Supported values are MILLISECOND, SECOND, MINUTE, HOUR, DAY, WEEK, MONTH, QUARTER, YEAR Only applicable when dataset1_identifier is not provided.schema –
- start: int
How many time_units of each dataset’s primary temporal key into the past relative to the datetimePartitionColumn the feature derivation window should begin.
- end: int
How many timeUnits of each dataset’s record primary temporal key into the past relative to the datetimePartitionColumn the feature derivation window should end.
- unit: str
Time unit of the feature derivation window. One of
datarobot.enums.AllowedTimeUnitsSAFER
.
is
name – Name of the feature discovery setting
value (
bool
) – Value of the feature discovery settingspecifying (To see the list of possible settings, create a RelationshipConfiguration without)
possible (settings and check its feature_discovery_settings attribute, which is a list of)
values. (settings with their default)
- classmethod create(dataset_definitions, relationships, feature_discovery_settings=None)
Create a Relationships Configuration
- Parameters:
dataset_definitions (
list
ofDatasetDefinition
) – Each element is adatarobot.helpers.feature_discovery.DatasetDefinition
relationships (
list
ofRelationship
) – Each element is adatarobot.helpers.feature_discovery.Relationship
feature_discovery_settings (
Optional[List[FeatureDiscoverySetting]]
) – Each element is a dictionary or adatarobot.helpers.feature_discovery.FeatureDiscoverySetting
. If not provided, default settings will be used.
- Returns:
relationships_configuration – Created relationships configuration
- Return type:
Examples
dataset_definition = dr.DatasetDefinition( identifier='profile', catalog_id='5fd06b4af24c641b68e4d88f', catalog_version_id='5fd06b4af24c641b68e4d88f' ) relationship = dr.Relationship( dataset2_identifier='profile', dataset1_keys=['CustomerID'], dataset2_keys=['CustomerID'], feature_derivation_window_start=-14, feature_derivation_window_end=-1, feature_derivation_window_time_unit='DAY', prediction_point_rounding=1, prediction_point_rounding_time_unit='DAY' ) dataset_definitions = [dataset_definition] relationships = [relationship] relationship_config = dr.RelationshipsConfiguration.create( dataset_definitions=dataset_definitions, relationships=relationships, feature_discovery_settings = [ {'name': 'enable_categorical_statistics', 'value': True}, {'name': 'enable_numeric_skewness', 'value': True}, ] ) >>> relationship_config.id '5c88a37770fc42a2fcc62759'
- get()
Retrieve the Relationships configuration for a given id
- Returns:
relationships_configuration – The requested relationships configuration
- Return type:
- Raises:
ClientError – Raised if an invalid relationships config id is provided.
Examples
relationships_config = dr.RelationshipsConfiguration(valid_config_id) result = relationships_config.get() >>> result.id '5c88a37770fc42a2fcc62759'
- replace(dataset_definitions, relationships, feature_discovery_settings=None)
Update the Relationships Configuration which is not used in the feature discovery Project
- Parameters:
dataset_definitions (
List[DatasetDefinition]
) – Each element is adatarobot.helpers.feature_discovery.DatasetDefinition
relationships (
List[Relationship]
) – Each element is adatarobot.helpers.feature_discovery.Relationship
feature_discovery_settings (
Optional[List[FeatureDiscoverySetting]]
) – Each element is a dictionary or adatarobot.helpers.feature_discovery.FeatureDiscoverySetting
. If not provided, default settings will be used.
- Returns:
relationships_configuration – the updated relationships configuration
- Return type:
- delete()
Delete the Relationships configuration
- Raises:
ClientError – Raised if an invalid relationships config id is provided.
Examples
# Deleting with a valid id relationships_config = dr.RelationshipsConfiguration(valid_config_id) status_code = relationships_config.delete() status_code >>> 204 relationships_config.get() >>> ClientError: Relationships Configuration not found
Feature lineage
- class datarobot.models.FeatureLineage
Lineage of an automatically engineered feature.
- Variables:
steps (
list
) –list of steps which were applied to build the feature.
steps
structure is:- id - (int)
step id starting with 0.
- step_type: (str)
one of the data/action/json/generatedData.
- name: (str)
name of the step.
- description: (str)
description of the step.
- parents: (list[int])
references to other steps id.
- is_time_aware: (bool)
indicator of step being time aware. Mandatory only for action and join steps. action step provides additional information about feature derivation window in the timeInfo field.
- catalog_id: (str)
id of the catalog for a data step.
- catalog_version_id: (str)
id of the catalog version for a data step.
- group_by: (list[str])
list of columns which this action step aggregated by.
- columns: (list)
names of columns involved into the feature generation. Available only for data steps.
- time_info: (dict)
description of the feature derivation window which was applied to this action step.
- join_info: (list[dict])
join step details.
columns
structure is- data_type: (str)
the type of the feature, e.g. ‘Categorical’, ‘Text’
- is_input: (bool)
indicates features which provided data to transform in this lineage.
- name: (str)
feature name.
- is_cutoff: (bool)
indicates a cutoff column.
time_info
structure is:- latest: (dict)
end of the feature derivation window applied.
- duration: (dict)
size of the feature derivation window applied.
latest
and duration structure is:- time_unit: (str)
time unit name like ‘MINUTE’, ‘DAY’, ‘MONTH’ etc.
- duration: (int)
value/size of this duration object.
join_info
structure is:- join_type - (str)
kind of join, left/right.
- left_table - (dict)
information about a dataset which was considered as left.
- right_table - (str)
information about a dataset which was considered as right.
left_table
andright_table
structure is:- columns - (list[str])
list of columns which datasets were joined by.
- datasteps - (list[int])
list of data steps id which brought the columns into the current step dataset.
- classmethod get(project_id, id)
Retrieve a single FeatureLineage.
- Parameters:
project_id (
str
) – The id of the project the feature belongs toid (
str
) – id of a feature lineage to retrieve
- Returns:
lineage – The queried instance
- Return type:
OCR job resources
- class datarobot.models.ocr_job_resource.OCRJobResource
An OCR job resource container. It is used to: - Get an existing OCR job resource. - List available OCR job resources. - Start an OCR job. - Check the status of a started OCR job. - Download the error report of a started OCR job.
Added in version v3.6.0b0.
- Variables:
id (
str
) – The identifier of an OCR job resource.input_catalog_id (
str
) – The identifier of an AI catalog item used as the OCR job input.output_catalog_id (
str
) – The identifier of an AI catalog item used as the OCR job output.user_id (
str
) – The identifier of a user.job_started (
bool
) – Determines if a job associated with the OCRJobResource has started.language (
str
) – String representation of OCRJobDatasetLanguage.
- classmethod get(job_resource_id)
Get an OCR job resource.
- Parameters:
job_resource_id (
str
) – identifier of OCR job resource- Returns:
returned OCR job resource
- Return type:
- classmethod list(offset=0, limit=10)
Get a list of OCR job resources.
- Parameters:
offset (
int
) – The offset of the query.limit (
int
) – The limit of returned OCR job resources.
- Returns:
A list of OCR job resources.
- Return type:
List[OCRJobResource]
- classmethod create(input_catalog_id, language)
Create a new OCR job resource and return it.
- Parameters:
input_catalog_id (
str
) – The identifier of an AI catalog item used as the OCR job input.language (
OCRJobDatasetLanguage
) – The OCR job dataset language.
- Returns:
The created OCR job resource.
- Return type:
- start_job()
Start an OCR job with this OCR job resource.
- Returns:
The response of starting an OCR job.
- Return type:
- get_job_status()
Get status of the OCR job associated with this OCR job resource.
- Returns:
OCR job status enum
- Return type:
- download_error_report(download_file_path)
Download the error report of the OCR job associated with this OCR job resource.
- Parameters:
download_file_path (
Path
) – path to download error report- Return type:
None
- class datarobot.models.ocr_job_resource.OCRJobDatasetLanguage
Supported OCR language
- class datarobot.models.ocr_job_resource.OCRJobStatusEnum
OCR Job status enum
- class datarobot.models.ocr_job_resource.StartOCRJobResponse
Container of Start OCR Job API response
Document text extraction
- class datarobot.models.documentai.document.FeaturesWithSamples
FeaturesWithSamples(model_id, feature_name, document_task)
- document_task
Alias for field number 2
- feature_name
Alias for field number 1
- model_id
Alias for field number 0
- class datarobot.models.documentai.document.DocumentPageFile
Page of a document as an image file.
- Variables:
project_id (
str
) – The identifier of the project which the document page belongs to.document_page_id (
str
) – The unique identifier for the document page.height (
int
) – The height of the document thumbnail in pixels.width (
int
) – The width of the document thumbnail in pixels.thumbnail_bytes (
bytes
) – The number of bytes of the document thumbnail image. Accessing this may require a server request and an associated delay in fetching the resource.mime_type (
str
) – The mime image type of the document thumbnail. Example: ‘image/png’
- property thumbnail_bytes: bytes
Document thumbnail as bytes.
- Returns:
Document thumbnail.
- Return type:
bytes
- property mime_type: str
‘image/png’
- Returns:
Mime image type of the document thumbnail.
- Return type:
str
- Type:
Mime image type of the document thumbnail. Example
- class datarobot.models.documentai.document.DocumentThumbnail
Thumbnail of document from the project’s dataset.
If
Project.stage
isdatarobot.enums.PROJECT_STAGE.EDA2
and it is a supervised project then thetarget_*
attributes of this class will have values, otherwise the values will all be None.- Variables:
document (
Document
) – The document object.project_id (
str
) – The identifier of the project which the document thumbnail belongs to.target_value (
str
) – The target value used for filtering thumbnails.
- classmethod list(project_id, feature_name, target_value=None, offset=None, limit=None)
Get document thumbnails from a project.
- Parameters:
project_id (
str
) – The identifier of the project which the document thumbnail belongs to.feature_name (
str
) – The name of feature that specifies the document type.target_value (
Optional[str]
, defaultNone
) – The target value to filter thumbnails.offset (
Optional[int]
, defaultNone
) – The number of documents to be skipped.limit (
Optional[int]
, defaultNone
) – The number of document thumbnails to return.
- Returns:
documents – A list of
DocumentThumbnail
objects, each representing a single document.- Return type:
List[DocumentThumbnail]
Notes
Actual document thumbnails are not fetched from the server by this method. Instead the data gets loaded lazily when
DocumentPageFile
object attributes are accessed.Examples
Fetch document thumbnails for the given
project_id
andfeature_name
.from datarobot._experimental.models.documentai.document import DocumentThumbnail # Fetch five documents from the EDA SAMPLE for the specified project and specific feature document_thumbs = DocumentThumbnail.list(project_id, feature_name, limit=5) # Fetch five documents for the specified project with target value filtering # This option is only available after selecting the project target and starting modeling target1_thumbs = DocumentThumbnail.list(project_id, feature_name, target_value='target1', limit=5)
Preview the document thumbnail.
from datarobot._experimental.models.documentai.document import DocumentThumbnail from datarobot.helpers.image_utils import get_image_from_bytes # Fetch 3 documents document_thumbs = DocumentThumbnail.list(project_id, feature_name, limit=3) for doc_thumb in document_thumbs: thumbnail = get_image_from_bytes(doc_thumb.document.thumbnail_bytes) thumbnail.show()
- class datarobot.models.documentai.document.DocumentTextExtractionSample
Stateless class for computing and retrieving Document Text Extraction Samples.
Notes
Actual document text extraction samples are not fetched from the server in the moment of a function call. Detailed information on the documents, the pages and the rendered images of them are fetched when accessed on demand (lazy loading).
Examples
1) Compute text extraction samples for a specific model, and fetch all existing document text extraction samples for a specific project.
from datarobot._experimental.models.documentai.document import DocumentTextExtractionSample SPECIFIC_MODEL_ID1 = "model_id1" SPECIFIC_MODEL_ID2 = "model_id2" SPECIFIC_PROJECT_ID = "project_id" # Order computation of document text extraction sample for specific model. # By default `compute` method will await for computation to end before returning DocumentTextExtractionSample.compute(SPECIFIC_MODEL_ID1, await_completion=False) DocumentTextExtractionSample.compute(SPECIFIC_MODEL_ID2) samples = DocumentTextExtractionSample.list_features_with_samples(SPECIFIC_PROJECT_ID)
2) Fetch document text extraction samples for a specific model_id and feature_name, and display all document sample pages.
from datarobot._experimental.models.documentai.document import DocumentTextExtractionSample from datarobot.helpers.image_utils import get_image_from_bytes SPECIFIC_MODEL_ID = "model_id" SPECIFIC_FEATURE_NAME = "feature_name" samples = DocumentTextExtractionSample.list_pages( model_id=SPECIFIC_MODEL_ID, feature_name=SPECIFIC_FEATURE_NAME ) for sample in samples: thumbnail = sample.document_page.thumbnail image = get_image_from_bytes(thumbnail.thumbnail_bytes) image.show()
3) Fetch document text extraction samples for specific model_id and feature_name and display text extraction details for the first page. This example displays the image of the document with bounding boxes of detected text lines. It also returns a list of all text lines extracted from page along with their coordinates.
from datarobot._experimental.models.documentai.document import DocumentTextExtractionSample SPECIFIC_MODEL_ID = "model_id" SPECIFIC_FEATURE_NAME = "feature_name" samples = DocumentTextExtractionSample.list_pages(SPECIFIC_MODEL_ID, SPECIFIC_FEATURE_NAME) # Draw bounding boxes for first document page sample and display related text data. image = samples[0].get_document_page_with_text_locations() image.show() # For each text block represented as bounding box object drawn on original image # display its coordinates (top, left, bottom, right) and extracted text value for text_line in samples[0].text_lines: print(text_line)
- classmethod compute(model_id, await_completion=True, max_wait=600)
Starts computation of document text extraction samples for the model and, if successful, returns computed text samples for it. This method allows calculation to continue for a specified time and, if not complete, cancels the request.
- Parameters:
model_id (
str
) – The identifier of the project’s model that start the creation of the cluster insights.await_completion (
bool
) – Determines whether the method should wait for completion before exiting or not.max_wait (
int (default=600)
) – The maximum number of seconds to wait for the request to finish before raising an AsyncTimeoutError.
- Raises:
ClientError – Server rejected creation due to client error. Often, a bad model_id is causing these errors.
AsyncFailureError – Indicates whether any of the responses from the server are unexpected.
AsyncProcessUnsuccessfulError – Indicates whether the cluster insights computation failed or was cancelled.
AsyncTimeoutError – Indicates whether the cluster insights computation did not resolve within the specified time limit (max_wait).
- Return type:
None
- classmethod list_features_with_samples(project_id)
Returns a list of features, model_id pairs with computed document text extraction samples.
- Parameters:
project_id (
str
) – The project ID to retrieve the list of computed samples for.- Return type:
List[FeaturesWithSamples]
- classmethod list_pages(model_id, feature_name, document_index=None, document_task=None)
Returns a list of document text extraction sample pages.
- Parameters:
model_id (
str
) – The model identifier.feature_name (
str
) – The specific feature name to retrieve.document_index (
Optional[int]
) – The specific document index to retrieve. Defaults to None.document_task (
Optional[str]
) – The document blueprint task.
- Return type:
List[DocumentTextExtractionSamplePage]
- classmethod list_documents(model_id, feature_name)
Returns a list of documents used for text extraction.
- Parameters:
model_id (
str
) – The model identifier.feature_name (
str
) – The feature name.
- Return type:
List[DocumentTextExtractionSampleDocument]
- class datarobot.models.documentai.document.DocumentTextExtractionSampleDocument
Document text extraction source.
Holds data that contains feature and model prediction values, as well as the thumbnail of the document.
- Variables:
document_index (
int
) – The index of the document page sample.feature_name (
str
) – The name of the feature that the document text extraction sample is related to.thumbnail_id (
str
) – The document page ID.thumbnail_width (
int
) – The thumbnail image width.thumbnail_height (
int
) – The thumbnail image height.thumbnail_link (
str
) – The thumbnail image download link.document_task (
str
) – The document blueprint task that the document belongs to.actual_target_value (
Optional[Union[str
,int
,List[str]]]
) – The actual target value.prediction (
Optional[PredictionType]
) – Prediction values and labels.
- classmethod list(model_id, feature_name, document_task=None)
List available documents with document text extraction samples.
- Parameters:
model_id (
str
) – The identifier for the model.feature_name (
str
) – The name of the feature,document_task (
Optional[str]
) – The document blueprint task.
- Return type:
List[DocumentTextExtractionSampleDocument]
- class datarobot.models.documentai.document.DocumentTextExtractionSamplePage
Document text extraction sample covering one document page.
Holds data about the document page, the recognized text, and the location of the text in the document page.
- Variables:
page_index (
int
) – Index of the page inside the documentdocument_index (
int
) – Index of the document inside the datasetfeature_name (
str
) – The name of the feature that the document text extraction sample belongs to.document_page_id (
str
) – The document page ID.document_page_width (
int
) – Document page width.document_page_height (
int
) – Document page height.document_page_link (
str
) – Document page link to download the document page image.text_lines (
List[Dict[str
,Union[int
,str]]]
) – A list of text lines and their coordinates.document_task (
str
) – The document blueprint task that the page belongs to.actual_target_value (
Optional[Union[str
,int
,List[str]]
) – Actual target value.prediction (
Optional[PredictionType]
) – Prediction values and labels.
- classmethod list(model_id, feature_name, document_index=None, document_task=None)
Returns a list of document text extraction sample pages.
- Parameters:
model_id (
str
) – The model identifier, used to retrieve document text extraction page samples.feature_name (
str
) – The feature name, used to retrieve document text extraction page samples.document_index (
Optional[int]
) – The specific document index to retrieve. Defaults to None.document_task (
Optional[str]
) – Document blueprint task.
- Return type:
List[DocumentTextExtractionSamplePage]
- get_document_page_with_text_locations(line_color='blue', line_width=3, padding=3)
Returns the document page with bounding boxes drawn around the text lines as a PIL.Image.
- Parameters:
line_color (
str
) – The color used to draw a bounding box on the image page. Defaults to blue.line_width (
int
) – The line width of the bounding boxes that will be drawn. Defaults to 3.padding (
int
) – The additional space left between the text and the bounding box, measured in pixels. Defaults to 3.
- Returns:
Returns a PIL.Image with drawn text-bounding boxes.
- Return type:
Image