Feature
- class datarobot.models.Feature(id, project_id=None, name=None, feature_type=None, importance=None, low_information=None, unique_count=None, na_count=None, date_format=None, min=None, max=None, mean=None, median=None, std_dev=None, time_series_eligible=None, time_series_eligibility_reason=None, time_step=None, time_unit=None, target_leakage=None, feature_lineage_id=None, key_summary=None, multilabel_insights=None)
A feature from a project’s dataset
These are features either included in the originally uploaded dataset or added to it via feature transformations. In time series projects, these will be distinct from the
ModelingFeature
s created during partitioning; otherwise, they will correspond to the same features. For more information about input and modeling features, see the time series documentation.The
min
,max
,mean
,median
, andstd_dev
attributes provide information about the distribution of the feature in the EDA sample data. For non-numeric features or features created prior to these summary statistics becoming available, they will be None. For features where the summary statistics are available, they will be in a format compatible with the data type, i.e. date type features will have their summary statistics expressed as ISO-8601 formatted date strings.- Attributes:
- idint
the id for the feature - note that name is used to reference the feature instead of id
- project_idstr
the id of the project the feature belongs to
- namestr
the name of the feature
- feature_typestr
the type of the feature, e.g. ‘Categorical’, ‘Text’
- importancefloat or None
numeric measure of the strength of relationship between the feature and target (independent of any model or other features); may be None for non-modeling features such as partition columns
- low_informationbool
whether a feature is considered too uninformative for modeling (e.g. because it has too few values)
- unique_countint
number of unique values
- na_countint or None
number of missing values
- date_formatstr or None
For Date features, the date format string for how this feature was interpreted, compatible with https://docs.python.org/2/library/time.html#time.strftime . For other feature types, None.
- minstr, int, float, or None
The minimum value of the source data in the EDA sample
- maxstr, int, float, or None
The maximum value of the source data in the EDA sample
- meanstr, int, or, float
The arithmetic mean of the source data in the EDA sample
- medianstr, int, float, or None
The median of the source data in the EDA sample
- std_devstr, int, float, or None
The standard deviation of the source data in the EDA sample
- time_series_eligiblebool
Whether this feature can be used as the datetime partition column in a time series project.
- time_series_eligibility_reasonstr
Why the feature is ineligible for the datetime partition column in a time series project, or ‘suitable’ when it is eligible.
- time_stepint or None
For time series eligible features, a positive integer determining the interval at which windows can be specified. If used as the datetime partition column on a time series project, the feature derivation and forecast windows must start and end at an integer multiple of this value. None for features that are not time series eligible.
- time_unitstr or None
For time series eligible features, the time unit covered by a single time step, e.g. ‘HOUR’, or None for features that are not time series eligible.
- target_leakagestr
Whether a feature is considered to have target leakage or not. A value of ‘SKIPPED_DETECTION’ indicates that target leakage detection was not run on the feature. ‘FALSE’ indicates no leakage, ‘MODERATE’ indicates a moderate risk of target leakage, and ‘HIGH_RISK’ indicates a high risk of target leakage
- feature_lineage_idstr
id of a lineage for automatically discovered features or derived time series features.
- key_summary: list of dict
Statistics for top 50 keys (truncated to 103 characters) of Summarized Categorical column example:
{‘key’:’DataRobot’, ‘summary’:{‘min’:0, ‘max’:29815.0, ‘stdDev’:6498.029, ‘mean’:1490.75, ‘median’:0.0, ‘pctRows’:5.0}}
- where,
- key: string or None
name of the key
- summary: dict
statistics of the key
max: maximum value of the key. min: minimum value of the key. mean: mean value of the key. median: median value of the key. stdDev: standard deviation of the key. pctRows: percentage occurrence of key in the EDA sample of the feature.
- multilabel_insights_keystr or None
For multicategorical columns this will contain a key for multilabel insights. The key is unique for a project, feature and EDA stage combination. This will be the key for the most recent, finished EDA stage.
- classmethod get(project_id, feature_name)
Retrieve a single feature
- Parameters:
- project_idstr
The ID of the project the feature is associated with.
- feature_namestr
The name of the feature to retrieve
- Returns:
- featureFeature
The queried instance
- get_multiseries_properties(multiseries_id_columns, max_wait=600)
Retrieve time series properties for a potential multiseries datetime partition column
Multiseries time series projects use multiseries id columns to model multiple distinct series within a single project. This function returns the time series properties (time step and time unit) of this column if it were used as a datetime partition column with the specified multiseries id columns, running multiseries detection automatically if it had not previously been successfully ran.
- Parameters:
- multiseries_id_columnslist of str
the name(s) of the multiseries id columns to use with this datetime partition column. Currently only one multiseries id column is supported.
- max_waitint, optional
if a multiseries detection task is run, the maximum amount of time to wait for it to complete before giving up
- Returns:
- propertiesdict
A dict with three keys:
time_series_eligible : bool, whether the column can be used as a partition column
time_unit : str or null, the inferred time unit if used as a partition column
time_step : int or null, the inferred time step if used as a partition column
- get_cross_series_properties(datetime_partition_column, cross_series_group_by_columns, max_wait=600)
Retrieve cross-series properties for multiseries ID column.
This function returns the cross-series properties (eligibility as group-by column) of this column if it were used with specified datetime partition column and with current multiseries id column, running cross-series group-by validation automatically if it had not previously been successfully ran.
- Parameters:
- datetime_partition_columndatetime partition column
- cross_series_group_by_columnslist of str
the name(s) of the columns to use with this multiseries ID column. Currently only one cross-series group-by column is supported.
- max_waitint, optional
if a multiseries detection task is run, the maximum amount of time to wait for it to complete before giving up
- Returns:
- propertiesdict
A dict with three keys:
name : str, column name
eligibility : str, reason for column eligibility
isEligible : bool, is column eligible as cross-series group-by
- get_multicategorical_histogram()
Retrieve multicategorical histogram for this feature
Added in version v2.24.
- Returns:
- Raises:
- datarobot.errors.InvalidUsageError
if this method is called on a unsuited feature
- ValueError
if no multilabel_insights_key is present for this feature
- get_pairwise_correlations()
Retrieve pairwise label correlation for multicategorical features
Added in version v2.24.
- Returns:
- Raises:
- datarobot.errors.InvalidUsageError
if this method is called on a unsuited feature
- ValueError
if no multilabel_insights_key is present for this feature
- get_pairwise_joint_probabilities()
Retrieve pairwise label joint probabilities for multicategorical features
Added in version v2.24.
- Returns:
- Raises:
- datarobot.errors.InvalidUsageError
if this method is called on a unsuited feature
- ValueError
if no multilabel_insights_key is present for this feature
- get_pairwise_conditional_probabilities()
Retrieve pairwise label conditional probabilities for multicategorical features
Added in version v2.24.
- Returns:
- Raises:
- datarobot.errors.InvalidUsageError
if this method is called on a unsuited feature
- ValueError
if no multilabel_insights_key is present for this feature
- classmethod from_data(data)
Instantiate an object of this class using a dict.
- Parameters:
- datadict
Correctly snake_cased keys and their values.
- Return type:
TypeVar
(T
, bound= APIObject)
- classmethod from_server_data(data, keep_attrs=None)
Instantiate an object of this class using the data directly from the server, meaning that the keys may have the wrong camel casing
- Parameters:
- datadict
The directly translated dict of JSON from the server. No casing fixes have taken place
- keep_attrsiterable
List, set or tuple of the dotted namespace notations for attributes to keep within the object structure even if their values are None
- Return type:
TypeVar
(T
, bound= APIObject)
- get_histogram(bin_limit=None)
Retrieve a feature histogram
- Parameters:
- bin_limitint or None
Desired max number of histogram bins. If omitted, by default endpoint will use 60.
- Returns:
- featureHistogramFeatureHistogram
The requested histogram with desired number or bins
- class datarobot.models.ModelingFeature(project_id=None, name=None, feature_type=None, importance=None, low_information=None, unique_count=None, na_count=None, date_format=None, min=None, max=None, mean=None, median=None, std_dev=None, parent_feature_names=None, key_summary=None, is_restored_after_reduction=None)
A feature used for modeling
In time series projects, a new set of modeling features is created after setting the partitioning options. These features are automatically derived from those in the project’s dataset and are the features used for modeling. Modeling features are only accessible once the target and partitioning options have been set. In projects that don’t use time series modeling, once the target has been set, ModelingFeatures and Features will behave the same.
For more information about input and modeling features, see the time series documentation.
As with the
Feature
object, the min, max, `mean, median, and std_dev attributes provide information about the distribution of the feature in the EDA sample data. For non-numeric features, they will be None. For features where the summary statistics are available, they will be in a format compatible with the data type, i.e. date type features will have their summary statistics expressed as ISO-8601 formatted date strings.- Attributes:
- project_idstr
the id of the project the feature belongs to
- namestr
the name of the feature
- feature_typestr
the type of the feature, e.g. ‘Categorical’, ‘Text’
- importancefloat or None
numeric measure of the strength of relationship between the feature and target (independent of any model or other features); may be None for non-modeling features such as partition columns
- low_informationbool
whether a feature is considered too uninformative for modeling (e.g. because it has too few values)
- unique_countint
number of unique values
- na_countint or None
number of missing values
- date_formatstr or None
For Date features, the date format string for how this feature was interpreted, compatible with https://docs.python.org/2/library/time.html#time.strftime . For other feature types, None.
- minstr, int, float, or None
The minimum value of the source data in the EDA sample
- maxstr, int, float, or None
The maximum value of the source data in the EDA sample
- meanstr, int, or, float
The arithmetic mean of the source data in the EDA sample
- medianstr, int, float, or None
The median of the source data in the EDA sample
- std_devstr, int, float, or None
The standard deviation of the source data in the EDA sample
- parent_feature_nameslist of str
A list of the names of input features used to derive this modeling feature. In cases where the input features and modeling features are the same, this will simply contain the feature’s name. Note that if a derived feature was used to create this modeling feature, the values here will not necessarily correspond to the features that must be supplied at prediction time.
- key_summary: list of dict
Statistics for top 50 keys (truncated to 103 characters) of Summarized Categorical column example:
{‘key’:’DataRobot’, ‘summary’:{‘min’:0, ‘max’:29815.0, ‘stdDev’:6498.029, ‘mean’:1490.75, ‘median’:0.0, ‘pctRows’:5.0}}
- where,
- key: string or None
name of the key
- summary: dict
statistics of the key
max: maximum value of the key. min: minimum value of the key. mean: mean value of the key. median: median value of the key. stdDev: standard deviation of the key. pctRows: percentage occurrence of key in the EDA sample of the feature.
- classmethod get(project_id, feature_name)
Retrieve a single modeling feature
- Parameters:
- project_idstr
The ID of the project the feature is associated with.
- feature_namestr
The name of the feature to retrieve
- Returns:
- featureModelingFeature
The requested feature
- class datarobot.models.DatasetFeature(id_, dataset_id=None, dataset_version_id=None, name=None, feature_type=None, low_information=None, unique_count=None, na_count=None, date_format=None, min_=None, max_=None, mean=None, median=None, std_dev=None, time_series_eligible=None, time_series_eligibility_reason=None, time_step=None, time_unit=None, target_leakage=None, target_leakage_reason=None)
A feature from a project’s dataset
These are features either included in the originally uploaded dataset or added to it via feature transformations.
The
min
,max
,mean
,median
, andstd_dev
attributes provide information about the distribution of the feature in the EDA sample data. For non-numeric features or features created prior to these summary statistics becoming available, they will be None. For features where the summary statistics are available, they will be in a format compatible with the data type, i.e. date type features will have their summary statistics expressed as ISO-8601 formatted date strings.- Attributes:
- idint
the id for the feature - note that name is used to reference the feature instead of id
- dataset_idstr
the id of the dataset the feature belongs to
- dataset_version_idstr
the id of the dataset version the feature belongs to
- namestr
the name of the feature
- feature_typestr, optional
the type of the feature, e.g. ‘Categorical’, ‘Text’
- low_informationbool, optional
whether a feature is considered too uninformative for modeling (e.g. because it has too few values)
- unique_countint, optional
number of unique values
- na_countint, optional
number of missing values
- date_formatstr, optional
For Date features, the date format string for how this feature was interpreted, compatible with https://docs.python.org/2/library/time.html#time.strftime . For other feature types, None.
- minstr, int, float, optional
The minimum value of the source data in the EDA sample
- maxstr, int, float, optional
The maximum value of the source data in the EDA sample
- meanstr, int, float, optional
The arithmetic mean of the source data in the EDA sample
- medianstr, int, float, optional
The median of the source data in the EDA sample
- std_devstr, int, float, optional
The standard deviation of the source data in the EDA sample
- time_series_eligiblebool, optional
Whether this feature can be used as the datetime partition column in a time series project.
- time_series_eligibility_reasonstr, optional
Why the feature is ineligible for the datetime partition column in a time series project, or ‘suitable’ when it is eligible.
- time_stepint, optional
For time series eligible features, a positive integer determining the interval at which windows can be specified. If used as the datetime partition column on a time series project, the feature derivation and forecast windows must start and end at an integer multiple of this value. None for features that are not time series eligible.
- time_unitstr, optional
For time series eligible features, the time unit covered by a single time step, e.g. ‘HOUR’, or None for features that are not time series eligible.
- target_leakagestr, optional
Whether a feature is considered to have target leakage or not. A value of ‘SKIPPED_DETECTION’ indicates that target leakage detection was not run on the feature. ‘FALSE’ indicates no leakage, ‘MODERATE’ indicates a moderate risk of target leakage, and ‘HIGH_RISK’ indicates a high risk of target leakage
- target_leakage_reason: string, optional
The descriptive text explaining the reason for target leakage, if any.
- get_histogram(bin_limit=None)
Retrieve a feature histogram
- Parameters:
- bin_limitint or None
Desired max number of histogram bins. If omitted, by default endpoint will use 60.
- Returns:
- featureHistogramDatasetFeatureHistogram
The requested histogram with desired number or bins
- class datarobot.models.DatasetFeatureHistogram(plot)
- classmethod get(dataset_id, feature_name, bin_limit=None, key_name=None)
Retrieve a single feature histogram
- Parameters:
- dataset_idstr
The ID of the Dataset the feature is associated with.
- feature_namestr
The name of the feature to retrieve
- bin_limitint or None
Desired max number of histogram bins. If omitted, by default the endpoint will use 60.
- key_name: string or None
(Only required for summarized categorical feature) Name of the top 50 keys for which plot to be retrieved
- Returns:
- featureHistogramFeatureHistogram
The queried instance with plot attribute in it.
- class datarobot.models.FeatureHistogram(plot)
- classmethod get(project_id, feature_name, bin_limit=None, key_name=None)
Retrieve a single feature histogram
- Parameters:
- project_idstr
The ID of the project the feature is associated with.
- feature_namestr
The name of the feature to retrieve
- bin_limitint or None
Desired max number of histogram bins. If omitted, by default endpoint will use 60.
- key_name: string or None
(Only required for summarized categorical feature) Name of the top 50 keys for which plot to be retrieved
- Returns:
- featureHistogramFeatureHistogram
The queried instance with plot attribute in it.
- class datarobot.models.InteractionFeature(rows, source_columns, bars, bubbles)
Interaction feature data
Added in version v2.21.
- Attributes:
- rows: int
Total number of rows
- source_columns: list(str)
names of two categorical features which were combined into this one
- bars: list(dict)
dictionaries representing frequencies of each independent value from the source columns
- bubbles: list(dict)
dictionaries representing frequencies of each combined value in the interaction feature.
- classmethod get(project_id, feature_name)
Retrieve a single Interaction feature
- Parameters:
- project_idstr
The id of the project the feature belongs to
- feature_namestr
The name of the Interaction feature to retrieve
- Returns:
- featureInteractionFeature
The queried instance
- class datarobot.models.MulticategoricalHistogram(feature_name, histogram)
Histogram for Multicategorical feature.
Added in version v2.24.
Notes
HistogramValues
contains:values.[].label
: string - Label namevalues.[].plot
: list - Histogram for labelvalues.[].plot.[].label_relevance
: int - Label relevance valuevalues.[].plot.[].row_count
: int - Row count where label has given relevancevalues.[].plot.[].row_pct
: float - Percentage of rows where label has given relevance
- Attributes:
- feature_namestr
Name of the feature
- valueslist(dict)
List of Histogram values with a schema described as
HistogramValues
- classmethod get(multilabel_insights_key)
Retrieves multicategorical histogram
You might find it more convenient to use
Feature.get_multicategorical_histogram
instead.- Parameters:
- multilabel_insights_key: string
Key for multilabel insights, unique for a project, feature and EDA stage combination. The multilabel_insights_key can be retrieved via
Feature.multilabel_insights_key
.
- Returns:
- MulticategoricalHistogram
The multicategorical histogram for multilabel_insights_key
- to_dataframe()
Convenience method to get all the information from this multicategorical_histogram instance in form of a
pandas.DataFrame
.- Returns:
- pandas.DataFrame
Histogram information as a multicategorical_histogram. The dataframe will contain these columns: feature_name, label, label_relevance, row_count and row_pct
- class datarobot.models.PairwiseCorrelations(*args, **kwargs)
Correlation of label pairs for multicategorical feature.
Added in version v2.24.
Notes
CorrelationValues
contain:values.[].label_configuration
: list of length 2 - Configuration of the label pairvalues.[].label_configuration.[].label
: str – Label namevalues.[].statistic_value
: float – Statistic value
- Attributes:
- feature_namestr
Name of the feature
- valueslist(dict)
List of correlation values with a schema described as
CorrelationValues
- statistic_dataframepandas.DataFrame
Correlation values for all label pairs as a DataFrame
- classmethod get(multilabel_insights_key)
Retrieves pairwise correlations
You might find it more convenient to use
Feature.get_pairwise_correlations
instead.- Parameters:
- multilabel_insights_key: string
Key for multilabel insights, unique for a project, feature and EDA stage combination. The multilabel_insights_key can be retrieved via
Feature.multilabel_insights_key
.
- Returns:
- PairwiseCorrelations
The pairwise label correlations
- as_dataframe()
The pairwise label correlations as a (num_labels x num_labels) DataFrame.
- Returns:
- pandas.DataFrame
The pairwise label correlations. Index and column names allow the interpretation of the values.
- class datarobot.models.PairwiseJointProbabilities(*args, **kwargs)
Joint probabilities of label pairs for multicategorical feature.
Added in version v2.24.
Notes
ProbabilityValues
contain:values.[].label_configuration
: list of length 2 - Configuration of the label pairvalues.[].label_configuration.[].relevance
: int – 0 for absence of the labels, 1 for the presence of labelsvalues.[].label_configuration.[].label
: str – Label namevalues.[].statistic_value
: float – Statistic value
- Attributes:
- feature_namestr
Name of the feature
- valueslist(dict)
List of joint probability values with a schema described as
ProbabilityValues
- statistic_dataframesdict(pandas.DataFrame)
Joint Probability values as DataFrames for different relevance combinations.
E.g. The probability P(A=0,B=1) can be retrieved via:
pairwise_joint_probabilities.statistic_dataframes[(0,1)].loc['A', 'B']
- classmethod get(multilabel_insights_key)
Retrieves pairwise joint probabilities
You might find it more convenient to use
Feature.get_pairwise_joint_probabilities
instead.- Parameters:
- multilabel_insights_key: string
Key for multilabel insights, unique for a project, feature and EDA stage combination. The multilabel_insights_key can be retrieved via
Feature.multilabel_insights_key
.
- Returns:
- PairwiseJointProbabilities
The pairwise joint probabilities
- as_dataframe(relevance_configuration)
Joint probabilities of label pairs as a (num_labels x num_labels) DataFrame.
- Parameters:
- relevance_configuration: tuple of length 2
Valid options are (0, 0), (0, 1), (1, 0) and (1, 1). Values of 0 indicate absence of labels and 1 indicates presence of labels. The first value describes the presence for the labels in axis=0 and the second value describes the presence for the labels in axis=1.
For example the matrix values for a relevance configuration of (0, 1) describe the probabilities of absent labels in the index axis and present labels in the column axis.
E.g. The probability P(A=0,B=1) can be retrieved via:
pairwise_joint_probabilities.as_dataframe((0,1)).loc['A', 'B']
- Returns:
- pandas.DataFrame
The joint probabilities for the requested
relevance_configuration
. Index and column names allow the interpretation of the values.
- class datarobot.models.PairwiseConditionalProbabilities(*args, **kwargs)
Conditional probabilities of label pairs for multicategorical feature.
Added in version v2.24.
Notes
ProbabilityValues
contain:values.[].label_configuration
: list of length 2 - Configuration of the label pairvalues.[].label_configuration.[].relevance
: int – 0 for absence of the labels, 1 for the presence of labelsvalues.[].label_configuration.[].label
: str – Label namevalues.[].statistic_value
: float – Statistic value
- Attributes:
- feature_namestr
Name of the feature
- valueslist(dict)
List of conditional probability values with a schema described as
ProbabilityValues
- statistic_dataframesdict(pandas.DataFrame)
Conditional Probability values as DataFrames for different relevance combinations. The label names in the columns are the events, on which we condition. The label names in the index are the events whose conditional probability given the indexes is in the dataframe.
E.g. The probability P(A=0|B=1) can be retrieved via:
pairwise_conditional_probabilities.statistic_dataframes[(0,1)].loc['A', 'B']
- classmethod get(multilabel_insights_key)
Retrieves pairwise conditional probabilities
You might find it more convenient to use
Feature.get_pairwise_conditional_probabilities
instead.- Parameters:
- multilabel_insights_key: string
Key for multilabel insights, unique for a project, feature and EDA stage combination. The multilabel_insights_key can be retrieved via
Feature.multilabel_insights_key
.
- Returns:
- PairwiseConditionalProbabilities
The pairwise conditional probabilities
- as_dataframe(relevance_configuration)
Conditional probabilities of label pairs as a (num_labels x num_labels) DataFrame. The label names in the columns are the events, on which we condition. The label names in the index are the events whose conditional probability given the indexes is in the dataframe.
E.g. The probability P(A=0|B=1) can be retrieved via:
pairwise_conditional_probabilities.as_dataframe((0, 1)).loc['A', 'B']
- Parameters:
- relevance_configuration: tuple of length 2
Valid options are (0, 0), (0, 1), (1, 0) and (1, 1). Values of 0 indicate absence of labels and 1 indicates presence of labels. The first value describes the presence for the labels in axis=0 and the second value describes the presence for the labels in axis=1.
For example the matrix values for a relevance configuration of (0, 1) describe the probabilities of absent labels in the index axis given the presence of labels in the column axis.
- Returns:
- pandas.DataFrame
The conditional probabilities for the requested
relevance_configuration
. Index and column names allow the interpretation of the values.
Restoring Discarded Features
- class datarobot.models.restore_discarded_features.DiscardedFeaturesInfo(total_restore_limit, remaining_restore_limit, count, features)
An object containing information about time series features which were reduced during time series feature generation process. These features can be restored back to the project. They will be included into All Time Series Features and can be used to create new feature lists.
Added in version v2.27.
- Attributes:
- total_restore_limitint
The total limit indicating how many features can be restored in this project.
- remaining_restore_limitint
The remaining available number of the features which can be restored in this project.
- featureslist of strings
Discarded features which can be restored.
- countint
Discarded features count.
- classmethod restore(project_id, features_to_restore, max_wait=600)
Restore discarded during time series feature generation process features back to the project. After restoration features will be included into All Time Series Features. :rtype:
FeatureRestorationStatus
Added in version v2.27.
- Parameters:
- project_id: string
- features_to_restore: list of strings
List of the feature names to restore
- max_wait: int, optional
max time to wait for features to be restored. Defaults to 10 min
- Returns:
- status: FeatureRestorationStatus
information about features which were restored and which were not.
- classmethod retrieve(project_id)
Retrieve the discarded features information for a given project. :rtype:
DiscardedFeaturesInfo
Added in version v2.27.
- Parameters:
- project_id: string
- Returns:
- info: DiscardedFeaturesInfo
information about features which were discarded during feature generation process and limits how many features can be restored.
- class datarobot.models.restore_discarded_features.FeatureRestorationStatus(warnings, features_to_restore)
Status of the feature restoration process.
Added in version v2.27.
- Attributes:
- warningslist of strings
Warnings generated for those features which failed to restore
- remaining_restore_limitint
The remaining available number of the features which can be restored in this project.
- restored_featureslist of strings
Features which were restored