Partitioning
- class datarobot.RandomCV(holdout_pct, reps, seed=0)
A partition in which observations are randomly assigned to cross-validation groups and the holdout set.
- Parameters:
- holdout_pctint
the desired percentage of dataset to assign to holdout set
- repsint
number of cross validation folds to use
- seedint
a seed to use for randomization
- class datarobot.StratifiedCV(holdout_pct, reps, seed=0)
A partition in which observations are randomly assigned to cross-validation groups and the holdout set, preserving in each group the same ratio of positive to negative cases as in the original data.
- Parameters:
- holdout_pctint
the desired percentage of dataset to assign to holdout set
- repsint
number of cross validation folds to use
- seedint
a seed to use for randomization
- class datarobot.GroupCV(holdout_pct, reps, partition_key_cols, seed=0)
A partition in which one column is specified, and rows sharing a common value for that column are guaranteed to stay together in the partitioning into cross-validation groups and the holdout set.
- Parameters:
- holdout_pctint
the desired percentage of dataset to assign to holdout set
- repsint
number of cross validation folds to use
- partition_key_colslist
a list containing a single string, where the string is the name of the column whose values should remain together in partitioning
- seedint
a seed to use for randomization
- class datarobot.UserCV(user_partition_col, cv_holdout_level, seed=0)
A partition where the cross-validation folds and the holdout set are specified by the user.
- Parameters:
- user_partition_colstring
the name of the column containing the partition assignments
- cv_holdout_level
the value of the partition column indicating a row is part of the holdout set
- seedint
a seed to use for randomization
- class datarobot.RandomTVH(holdout_pct, validation_pct, seed=0)
Specifies a partitioning method in which rows are randomly assigned to training, validation, and holdout.
- Parameters:
- holdout_pctint
the desired percentage of dataset to assign to holdout set
- validation_pctint
the desired percentage of dataset to assign to validation set
- seedint
a seed to use for randomization
- class datarobot.UserTVH(user_partition_col, training_level, validation_level, holdout_level, seed=0)
Specifies a partitioning method in which rows are assigned by the user to training, validation, and holdout sets.
- Parameters:
- user_partition_colstring
the name of the column containing the partition assignments
- training_level
the value of the partition column indicating a row is part of the training set
- validation_level
the value of the partition column indicating a row is part of the validation set
- holdout_level
the value of the partition column indicating a row is part of the holdout set (use None if you want no holdout set)
- seedint
a seed to use for randomization
- class datarobot.StratifiedTVH(holdout_pct, validation_pct, seed=0)
A partition in which observations are randomly assigned to train, validation, and holdout sets, preserving in each group the same ratio of positive to negative cases as in the original data.
- Parameters:
- holdout_pctint
the desired percentage of dataset to assign to holdout set
- validation_pctint
the desired percentage of dataset to assign to validation set
- seedint
a seed to use for randomization
- class datarobot.GroupTVH(holdout_pct, validation_pct, partition_key_cols, seed=0)
A partition in which one column is specified, and rows sharing a common value for that column are guaranteed to stay together in the partitioning into the training, validation, and holdout sets.
- Parameters:
- holdout_pctint
the desired percentage of dataset to assign to holdout set
- validation_pctint
the desired percentage of dataset to assign to validation set
- partition_key_colslist
a list containing a single string, where the string is the name of the column whose values should remain together in partitioning
- seedint
a seed to use for randomization
- class datarobot.DatetimePartitioningSpecification(datetime_partition_column, autopilot_data_selection_method=None, validation_duration=None, holdout_start_date=None, holdout_duration=None, disable_holdout=None, gap_duration=None, number_of_backtests=None, backtests=None, use_time_series=False, default_to_known_in_advance=False, default_to_do_not_derive=False, feature_derivation_window_start=None, feature_derivation_window_end=None, feature_settings=None, forecast_window_start=None, forecast_window_end=None, windows_basis_unit=None, treat_as_exponential=None, differencing_method=None, periodicities=None, multiseries_id_columns=None, use_cross_series_features=None, aggregation_type=None, cross_series_group_by_columns=None, calendar_id=None, holdout_end_date=None, unsupervised_mode=False, model_splits=None, allow_partial_history_time_series_predictions=False, unsupervised_type=None)
Uniquely defines a DatetimePartitioning for some project
Includes only the attributes of DatetimePartitioning that are directly controllable by users, not those determined by the DataRobot application based on the project dataset and the user-controlled settings.
This is the specification that should be passed to
Project.analyze_and_model
via thepartitioning_method
parameter. To see the full partitioning based on the project dataset, useDatetimePartitioning.generate
.All durations should be specified with a duration string such as those returned by the
partitioning_methods.construct_duration_string
helper method. Please see datetime partitioned project documentation for more information on duration strings.Note that either (
holdout_start_date
,holdout_duration
) or (holdout_start_date
,holdout_end_date
) can be used to specify holdout partitioning settings.- Attributes:
- datetime_partition_columnstr
the name of the column whose values as dates are used to assign a row to a particular partition
- autopilot_data_selection_methodstr
one of
datarobot.enums.DATETIME_AUTOPILOT_DATA_SELECTION_METHOD
. Whether models created by the autopilot should use “rowCount” or “duration” as their data_selection_method.- validation_durationstr or None
the default validation_duration for the backtests
- holdout_start_datedatetime.datetime or None
The start date of holdout scoring data. If
holdout_start_date
is specified, eitherholdout_duration
orholdout_end_date
must also be specified. Ifdisable_holdout
is set toTrue
,holdout_start_date
,holdout_duration
, andholdout_end_date
may not be specified.- holdout_durationstr or None
The duration of the holdout scoring data. If
holdout_duration
is specified,holdout_start_date
must also be specified. Ifdisable_holdout
is set toTrue
,holdout_duration
,holdout_start_date
, andholdout_end_date
may not be specified.- holdout_end_datedatetime.datetime or None
The end date of holdout scoring data. If
holdout_end_date
is specified,holdout_start_date
must also be specified. Ifdisable_holdout
is set toTrue
,holdout_end_date
,holdout_start_date
, andholdout_duration
may not be specified.- disable_holdoutbool or None
(New in version v2.8) Whether to suppress allocating a holdout fold. If set to
True
,holdout_start_date
,holdout_duration
, andholdout_end_date
may not be specified.- gap_durationstr or None
The duration of the gap between training and holdout scoring data
- number_of_backtestsint or None
the number of backtests to use
- backtestslist of
BacktestSpecification
the exact specification of backtests to use. The indices of the specified backtests should range from 0 to number_of_backtests - 1. If any backtest is left unspecified, a default configuration will be chosen.
- use_time_seriesbool
(New in version v2.8) Whether to create a time series project (if
True
) or an OTV project which uses datetime partitioning (ifFalse
). The default behavior is to create an OTV project.- default_to_known_in_advancebool
(New in version v2.11) Optional, default
False
. Used for time series projects only. Sets whether all features default to being treated as known in advance. Known in advance features are expected to be known for dates in the future when making predictions, e.g., “is this a holiday?”. Individual features can be set to a value different than the default using thefeature_settings
parameter.- default_to_do_not_derivebool
(New in v2.17) Optional, default
False
. Used for time series projects only. Sets whether all features default to being treated as do-not-derive features, excluding them from feature derivation. Individual features can be set to a value different than the default by using thefeature_settings
parameter.- feature_derivation_window_startint or None
(New in version v2.8) Only used for time series projects. Offset into the past to define how far back relative to the forecast point the feature derivation window should start. Expressed in terms of the
windows_basis_unit
and should be negative value or zero.- feature_derivation_window_endint or None
(New in version v2.8) Only used for time series projects. Offset into the past to define how far back relative to the forecast point the feature derivation window should end. Expressed in terms of the
windows_basis_unit
and should be a negative value or zero.- feature_settingslist of
FeatureSettings
(New in version v2.9) Optional, a list specifying per feature settings, can be left unspecified.
- forecast_window_startint or None
(New in version v2.8) Only used for time series projects. Offset into the future to define how far forward relative to the forecast point the forecast window should start. Expressed in terms of the
windows_basis_unit
.- forecast_window_endint or None
(New in version v2.8) Only used for time series projects. Offset into the future to define how far forward relative to the forecast point the forecast window should end. Expressed in terms of the
windows_basis_unit
.- windows_basis_unitstring, optional
(New in version v2.14) Only used for time series projects. Indicates which unit is a basis for feature derivation window and forecast window. Valid options are detected time unit (one of the
datarobot.enums.TIME_UNITS
) or “ROW”. If omitted, the default value is the detected time unit.- treat_as_exponentialstring, optional
(New in version v2.9) defaults to “auto”. Used to specify whether to treat data as exponential trend and apply transformations like log-transform. Use values from the
datarobot.enums.TREAT_AS_EXPONENTIAL
enum.- differencing_methodstring, optional
(New in version v2.9) defaults to “auto”. Used to specify which differencing method to apply of case if data is stationary. Use values from
datarobot.enums.DIFFERENCING_METHOD
enum.- periodicitieslist of Periodicity, optional
(New in version v2.9) a list of
datarobot.Periodicity
. Periodicities units should be “ROW”, if thewindows_basis_unit
is “ROW”.- multiseries_id_columnslist of str or null
(New in version v2.11) a list of the names of multiseries id columns to define series within the training data. Currently only one multiseries id column is supported.
- use_cross_series_featuresbool
(New in version v2.14) Whether to use cross series features.
- aggregation_typestr, optional
(New in version v2.14) The aggregation type to apply when creating cross series features. Optional, must be one of “total” or “average”.
- cross_series_group_by_columnslist of str, optional
(New in version v2.15) List of columns (currently of length 1). Optional setting that indicates how to further split series into related groups. For example, if every series is sales of an individual product, the series group-by could be the product category with values like “men’s clothing”, “sports equipment”, etc.. Can only be used in a multiseries project with
use_cross_series_features
set toTrue
.- calendar_idstr, optional
(New in version v2.15) The id of the
CalendarFile
to use with this project.- unsupervised_mode: bool, optional
(New in version v2.20) defaults to False, indicates whether partitioning should be constructed for the unsupervised project.
- model_splits: int, optional
(New in version v2.21) Sets the cap on the number of jobs per model used when building models to control number of jobs in the queue. Higher number of model splits will allow for less downsampling leading to the use of more post-processed data.
- allow_partial_history_time_series_predictions: bool, optional
(New in version v2.24) Whether to allow time series models to make predictions using partial historical data.
- unsupervised_type: str, optional
(New in version v3.2) The unsupervised project type, only valid if
unsupervised_mode
is True. Use values fromdatarobot.enums.UnsupervisedTypeEnum
enum. If not specified then the project defaults to ‘anomaly’ whenunsupervised_mode
is True.
- collect_payload()
Set up the dict that should be sent to the server when setting the target Returns ——- partitioning_spec : dict
- Return type:
Dict
[str
,Any
]
- prep_payload(project_id, max_wait=600)
Run any necessary validation and prep of the payload, including async operations
Mainly used for the datetime partitioning spec but implemented in general for consistency
- Return type:
None
- update(**kwargs)
Update this instance, matching attributes to kwargs
Mainly used for the datetime partitioning spec but implemented in general for consistency
- Return type:
None
- class datarobot.BacktestSpecification(index, gap_duration=None, validation_start_date=None, validation_duration=None, validation_end_date=None, primary_training_start_date=None, primary_training_end_date=None)
Uniquely defines a Backtest used in a DatetimePartitioning
Includes only the attributes of a backtest directly controllable by users. The other attributes are assigned by the DataRobot application based on the project dataset and the user-controlled settings.
There are two ways to specify an individual backtest:
Option 1: Use
index
,gap_duration
,validation_start_date
, andvalidation_duration
. All durations should be specified with a duration string such as those returned by thepartitioning_methods.construct_duration_string
helper method.import datarobot as dr partitioning_spec = dr.DatetimePartitioningSpecification( backtests=[ # modify the first backtest using option 1 dr.BacktestSpecification( index=0, gap_duration=dr.partitioning_methods.construct_duration_string(), validation_start_date=datetime(year=2010, month=1, day=1), validation_duration=dr.partitioning_methods.construct_duration_string(years=1), ) ], # other partitioning settings... )
Option 2 (New in version v2.20): Use
index
,primary_training_start_date
,primary_training_end_date
,validation_start_date
, andvalidation_end_date
. In this case, note that settingprimary_training_end_date
andvalidation_start_date
to the same timestamp will result with no gap being created.import datarobot as dr partitioning_spec = dr.DatetimePartitioningSpecification( backtests=[ # modify the first backtest using option 2 dr.BacktestSpecification( index=0, primary_training_start_date=datetime(year=2005, month=1, day=1), primary_training_end_date=datetime(year=2010, month=1, day=1), validation_start_date=datetime(year=2010, month=1, day=1), validation_end_date=datetime(year=2011, month=1, day=1), ) ], # other partitioning settings... )
All durations should be specified with a duration string such as those returned by the
partitioning_methods.construct_duration_string
helper method. Please see datetime partitioned project documentation for more information on duration strings.- Attributes:
- indexint
the index of the backtest to update
- gap_durationstr
a duration string specifying the desired duration of the gap between training and validation scoring data for the backtest
- validation_start_datedatetime.datetime
the desired start date of the validation scoring data for this backtest
- validation_durationstr
a duration string specifying the desired duration of the validation scoring data for this backtest
- validation_end_datedatetime.datetime
the desired end date of the validation scoring data for this backtest
- primary_training_start_datedatetime.datetime
the desired start date of the training partition for this backtest
- primary_training_end_datedatetime.datetime
the desired end date of the training partition for this backtest
- class datarobot.FeatureSettings(feature_name, known_in_advance=None, do_not_derive=None)
Per feature settings
- Attributes:
- feature_namestring
name of the feature
- known_in_advancebool
(New in version v2.11) Optional, for time series projects only. Sets whether the feature is known in advance, i.e., values for future dates are known at prediction time. If not specified, the feature uses the value from the default_to_known_in_advance flag.
- do_not_derivebool
(New in v2.17) Optional, for time series projects only. Sets whether the feature is excluded from feature derivation. If not specified, the feature uses the value from the default_to_do_not_derive flag.
- collect_payload(use_a_priori=False)
- Parameters:
- use_a_prioriboolSwitch to using the older a_priori key name instead of known_in_advance. Default: False
- Returns:
- BacktestSpecification dictionary representation
- Return type:
- class datarobot.Periodicity(time_steps, time_unit)
Periodicity configuration
- Parameters:
- time_stepsint
Time step value
- time_unitstring
Time step unit, valid options are values from datarobot.enums.TIME_UNITS
Examples
from datarobot as dr periodicities = [ dr.Periodicity(time_steps=10, time_unit=dr.enums.TIME_UNITS.HOUR), dr.Periodicity(time_steps=600, time_unit=dr.enums.TIME_UNITS.MINUTE)] spec = dr.DatetimePartitioningSpecification( # ... periodicities=periodicities )
- class datarobot.DatetimePartitioning(project_id=None, datetime_partitioning_id=None, datetime_partition_column=None, date_format=None, autopilot_data_selection_method=None, validation_duration=None, available_training_start_date=None, available_training_duration=None, available_training_row_count=None, available_training_end_date=None, primary_training_start_date=None, primary_training_duration=None, primary_training_row_count=None, primary_training_end_date=None, gap_start_date=None, gap_duration=None, gap_row_count=None, gap_end_date=None, disable_holdout=None, holdout_start_date=None, holdout_duration=None, holdout_row_count=None, holdout_end_date=None, number_of_backtests=None, backtests=None, total_row_count=None, use_time_series=False, default_to_known_in_advance=False, default_to_do_not_derive=False, feature_derivation_window_start=None, feature_derivation_window_end=None, feature_settings=None, forecast_window_start=None, forecast_window_end=None, windows_basis_unit=None, treat_as_exponential=None, differencing_method=None, periodicities=None, multiseries_id_columns=None, number_of_known_in_advance_features=0, number_of_do_not_derive_features=0, use_cross_series_features=None, aggregation_type=None, cross_series_group_by_columns=None, calendar_id=None, calendar_name=None, model_splits=None, allow_partial_history_time_series_predictions=False, unsupervised_mode=False, unsupervised_type=None)
Full partitioning of a project for datetime partitioning.
To instantiate, use
DatetimePartitioning.get(project_id)
.Includes both the attributes specified by the user, as well as those determined by the DataRobot application based on the project dataset. In order to use a partitioning to set the target, call
to_specification
and pass the resultingDatetimePartitioningSpecification
toProject.analyze_and_model
via thepartitioning_method
parameter.The available training data corresponds to all the data available for training, while the primary training data corresponds to the data that can be used to train while ensuring that all backtests are available. If a model is trained with more data than is available in the primary training data, then all backtests may not have scores available.
All durations are specified with a duration string such as those returned by the
partitioning_methods.construct_duration_string
helper method. Please see datetime partitioned project documentation for more information on duration strings.- Attributes:
- project_idstr
the id of the project this partitioning applies to
- datetime_partitioning_idstr or None
the id of the datetime partitioning it is an optimized partitioning
- datetime_partition_columnstr
the name of the column whose values as dates are used to assign a row to a particular partition
- date_formatstr
the format (e.g. “%Y-%m-%d %H:%M:%S”) by which the partition column was interpreted (compatible with strftime)
- autopilot_data_selection_methodstr
one of
datarobot.enums.DATETIME_AUTOPILOT_DATA_SELECTION_METHOD
. Whether models created by the autopilot use “rowCount” or “duration” as their data_selection_method.- validation_durationstr or None
the validation duration specified when initializing the partitioning - not directly significant if the backtests have been modified, but used as the default validation_duration for the backtests. Can be absent if this is a time series project with an irregular primary date/time feature.
- available_training_start_datedatetime.datetime
The start date of the available training data for scoring the holdout
- available_training_durationstr
The duration of the available training data for scoring the holdout
- available_training_row_countint or None
The number of rows in the available training data for scoring the holdout. Only available when retrieving the partitioning after setting the target.
- available_training_end_datedatetime.datetime
The end date of the available training data for scoring the holdout
- primary_training_start_datedatetime.datetime or None
The start date of primary training data for scoring the holdout. Unavailable when the holdout fold is disabled.
- primary_training_durationstr
The duration of the primary training data for scoring the holdout
- primary_training_row_countint or None
The number of rows in the primary training data for scoring the holdout. Only available when retrieving the partitioning after setting the target.
- primary_training_end_datedatetime.datetime or None
The end date of the primary training data for scoring the holdout. Unavailable when the holdout fold is disabled.
- gap_start_datedatetime.datetime or None
The start date of the gap between training and holdout scoring data. Unavailable when the holdout fold is disabled.
- gap_durationstr
The duration of the gap between training and holdout scoring data
- gap_row_countint or None
The number of rows in the gap between training and holdout scoring data. Only available when retrieving the partitioning after setting the target.
- gap_end_datedatetime.datetime or None
The end date of the gap between training and holdout scoring data. Unavailable when the holdout fold is disabled.
- disable_holdoutbool or None
Whether to suppress allocating a holdout fold. If set to
True
,holdout_start_date
,holdout_duration
, andholdout_end_date
may not be specified.- holdout_start_datedatetime.datetime or None
The start date of holdout scoring data. Unavailable when the holdout fold is disabled.
- holdout_durationstr
The duration of the holdout scoring data
- holdout_row_countint or None
The number of rows in the holdout scoring data. Only available when retrieving the partitioning after setting the target.
- holdout_end_datedatetime.datetime or None
The end date of the holdout scoring data. Unavailable when the holdout fold is disabled.
- number_of_backtestsint
the number of backtests used.
- backtestslist of
Backtest
the configured backtests.
- total_row_countint
the number of rows in the project dataset. Only available when retrieving the partitioning after setting the target.
- use_time_seriesbool
(New in version v2.8) Whether to create a time series project (if
True
) or an OTV project which uses datetime partitioning (ifFalse
). The default behavior is to create an OTV project.- default_to_known_in_advancebool
(New in version v2.11) Optional, default
False
. Used for time series projects only. Sets whether all features default to being treated as known in advance. Known in advance features are expected to be known for dates in the future when making predictions, e.g., “is this a holiday?”. Individual features can be set to a value different from the default using thefeature_settings
parameter.- default_to_do_not_derivebool
(New in v2.17) Optional, default
False
. Used for time series projects only. Sets whether all features default to being treated as do-not-derive features, excluding them from feature derivation. Individual features can be set to a value different from the default by using thefeature_settings
parameter.- feature_derivation_window_startint or None
(New in version v2.8) Only used for time series projects. Offset into the past to define how far back relative to the forecast point the feature derivation window should start. Expressed in terms of the
windows_basis_unit
.- feature_derivation_window_endint or None
(New in version v2.8) Only used for time series projects. Offset into the past to define how far back relative to the forecast point the feature derivation window should end. Expressed in terms of the
windows_basis_unit
.- feature_settingslist of
FeatureSettings
(New in version v2.9) Optional, a list specifying per feature settings, can be left unspecified.
- forecast_window_startint or None
(New in version v2.8) Only used for time series projects. Offset into the future to define how far forward relative to the forecast point the forecast window should start. Expressed in terms of the
windows_basis_unit
.- forecast_window_endint or None
(New in version v2.8) Only used for time series projects. Offset into the future to define how far forward relative to the forecast point the forecast window should end. Expressed in terms of the
windows_basis_unit
.- windows_basis_unitstring, optional
(New in version v2.14) Only used for time series projects. Indicates which unit is a basis for feature derivation window and forecast window. Valid options are detected time unit (one of the
datarobot.enums.TIME_UNITS
) or “ROW”. If omitted, the default value is detected time unit.- treat_as_exponentialstring, optional
(New in version v2.9) defaults to “auto”. Used to specify whether to treat data as exponential trend and apply transformations like log-transform. Use values from the
datarobot.enums.TREAT_AS_EXPONENTIAL
enum.- differencing_methodstring, optional
(New in version v2.9) defaults to “auto”. Used to specify which differencing method to apply of case if data is stationary. Use values from the
datarobot.enums.DIFFERENCING_METHOD
enum.- periodicitieslist of Periodicity, optional
(New in version v2.9) a list of
datarobot.Periodicity
. Periodicities units should be “ROW”, if thewindows_basis_unit
is “ROW”.- multiseries_id_columnslist of str or null
(New in version v2.11) a list of the names of multiseries id columns to define series within the training data. Currently only one multiseries id column is supported.
- number_of_known_in_advance_featuresint
(New in version v2.14) Number of features that are marked as known in advance.
- number_of_do_not_derive_featuresint
(New in v2.17) Number of features that are excluded from derivation.
- use_cross_series_featuresbool
(New in version v2.14) Whether to use cross series features.
- aggregation_typestr, optional
(New in version v2.14) The aggregation type to apply when creating cross series features. Optional, must be one of “total” or “average”.
- cross_series_group_by_columnslist of str, optional
(New in version v2.15) List of columns (currently of length 1). Optional setting that indicates how to further split series into related groups. For example, if every series is sales of an individual product, the series group-by could be the product category with values like “men’s clothing”, “sports equipment”, etc.. Can only be used in a multiseries project with
use_cross_series_features
set toTrue
.- calendar_idstr, optional
(New in version v2.15) Only available for time series projects. The id of the
CalendarFile
to use with this project.- calendar_namestr, optional
(New in version v2.17) Only available for time series projects. The name of the
CalendarFile
used with this project.- model_splits: int, optional
(New in version v2.21) Sets the cap on the number of jobs per model used when building models to control number of jobs in the queue. Higher number of model splits will allow for less downsampling leading to the use of more post-processed data.
- allow_partial_history_time_series_predictions: bool, optional
(New in version v2.24) Whether to allow time series models to make predictions using partial historical data.
- unsupervised_mode: bool, optional
(New in version v3.1) Whether the date/time partitioning is for an unsupervised project
- unsupervised_type: str, optional
(New in version v3.2) The unsupervised project type, only valid if
unsupervised_mode
is True. Use values fromdatarobot.enums.UnsupervisedTypeEnum
enum. If not specified then the project defaults to ‘anomaly’ whenunsupervised_mode
is True.
- classmethod generate(project_id, spec, max_wait=600, target=None)
Preview the full partitioning determined by a DatetimePartitioningSpecification
Based on the project dataset and the partitioning specification, inspect the full partitioning that would be used if the same specification were passed into
Project.analyze_and_model
.- Parameters:
- project_idstr
the id of the project
- specDatetimePartitioningSpec
the desired partitioning
- max_waitint, optional
For some settings (e.g. generating a partitioning preview for a multiseries project for the first time), an asynchronous task must be run to analyze the dataset. max_wait governs the maximum time (in seconds) to wait before giving up. In all non-multiseries projects, this is unused.
- targetstr, optional
the name of the target column. For unsupervised projects target may be None. Providing a target will ensure that partitions are correctly optimized for your dataset.
- Returns:
- DatetimePartitioning
the full generated partitioning
- classmethod get(project_id)
Retrieve the DatetimePartitioning from a project
Only available if the project has already set the target as a datetime project.
- Parameters:
- project_idstr
the id of the project to retrieve partitioning for
- Returns:
- DatetimePartitioningthe full partitioning for the project
- Return type:
- classmethod generate_optimized(project_id, spec, target, max_wait=600)
Preview the full partitioning determined by a DatetimePartitioningSpecification
Based on the project dataset and the partitioning specification, inspect the full partitioning that would be used if the same specification were passed into Project.analyze_and_model.
- Parameters:
- project_idstr
the id of the project
- specDatetimePartitioningSpecification
the desired partitioning
- targetstr
the name of the target column. For unsupervised projects target may be None.
- max_waitint, optional
Governs the maximum time (in seconds) to wait before giving up.
- Returns:
- DatetimePartitioning
the full generated partitioning
- Return type:
- classmethod get_optimized(project_id, datetime_partitioning_id)
Retrieve an Optimized DatetimePartitioning from a project for the specified datetime_partitioning_id. A datetime_partitioning_id is created by using the
generate_optimized
function.- Parameters:
- project_idstr
the id of the project to retrieve partitioning for
- datetime_partitioning_idObjectId
the ObjectId associated with the project to retrieve from mongo
- Returns:
- DatetimePartitioningthe full partitioning for the project
- Return type:
- classmethod feature_log_list(project_id, offset=None, limit=None)
Retrieve the feature derivation log content and log length for a time series project.
The Time Series Feature Log provides details about the feature generation process for a time series project. It includes information about which features are generated and their priority, as well as the detected properties of the time series data such as whether the series is stationary, and periodicities detected.
This route is only supported for time series projects that have finished partitioning.
The feature derivation log will include information about:
- Detected stationarity of the series:e.g. ‘Series detected as non-stationary’
- Detected presence of multiplicative trend in the series:e.g. ‘Multiplicative trend detected’
- Detected presence of multiplicative trend in the series:e.g. ‘Detected periodicities: 7 day’
- Maximum number of feature to be generated:e.g. ‘Maximum number of feature to be generated is 1440’
- Window sizes used in rolling statistics / lag extractorse.g. ‘The window sizes chosen to be: 2 months(because the time step is 1 month and Feature Derivation Window is 2 months)’
- Features that are specified as known-in-advancee.g. ‘Variables treated as apriori: holiday’
- Details about why certain variables are transformed in the input datae.g. ‘Generating variable “y (log)” from “y” because multiplicative trendis detected’
- Details about features generated as timeseries features, and their prioritye.g. ‘Generating feature “date (actual)” from “date” (priority: 1)’
- Parameters:
- project_idstr
project id to retrieve a feature derivation log for.
- offsetint
optional, defaults is 0, this many results will be skipped.
- limitint
optional, defaults to 100, at most this many results are returned. To specify no limit, use 0. The default may change without notice.
- classmethod feature_log_retrieve(project_id)
Retrieve the feature derivation log content and log length for a time series project.
The Time Series Feature Log provides details about the feature generation process for a time series project. It includes information about which features are generated and their priority, as well as the detected properties of the time series data such as whether the series is stationary, and periodicities detected.
This route is only supported for time series projects that have finished partitioning.
The feature derivation log will include information about: :rtype:
str
- Detected stationarity of the series:e.g. ‘Series detected as non-stationary’
- Detected presence of multiplicative trend in the series:e.g. ‘Multiplicative trend detected’
- Detected presence of multiplicative trend in the series:e.g. ‘Detected periodicities: 7 day’
- Maximum number of feature to be generated:e.g. ‘Maximum number of feature to be generated is 1440’
- Window sizes used in rolling statistics / lag extractorse.g. ‘The window sizes chosen to be: 2 months(because the time step is 1 month and Feature Derivation Window is 2 months)’
- Features that are specified as known-in-advancee.g. ‘Variables treated as apriori: holiday’
- Details about why certain variables are transformed in the input datae.g. ‘Generating variable “y (log)” from “y” because multiplicative trendis detected’
- Details about features generated as timeseries features, and their prioritye.g. ‘Generating feature “date (actual)” from “date” (priority: 1)’
- Parameters:
- project_idstr
project id to retrieve a feature derivation log for.
- to_specification(use_holdout_start_end_format=False, use_backtest_start_end_format=False)
Render the DatetimePartitioning as a
DatetimePartitioningSpecification
The resulting specification can be used when setting the target, and contains only the attributes directly controllable by users.
- Parameters:
- use_holdout_start_end_formatbool, optional
Defaults to
False
. IfTrue
, will useholdout_end_date
when configuring the holdout partition. IfFalse
, will useholdout_duration
instead.- use_backtest_start_end_formatbool, optional
Defaults to
False
. IfFalse
, will use a duration-based approach for specifying backtests (gap_duration
,validation_start_date
, andvalidation_duration
). IfTrue
, will use a start/end date approach for specifying backtests (primary_training_start_date
,primary_training_end_date
,validation_start_date
,validation_end_date
). In contrast, projects created in the Web UI will use the start/end date approach for specifying backtests. Set this parameter toTrue
to mirror the behavior in the Web UI.
- Returns:
- DatetimePartitioningSpecification
the specification for this partitioning
- Return type:
- to_dataframe()
Render the partitioning settings as a dataframe for convenience of display
Excludes project_id, datetime_partition_column, date_format, autopilot_data_selection_method, validation_duration, and number_of_backtests, as well as the row count information, if present.
Also excludes the time series specific parameters for use_time_series, default_to_known_in_advance, default_to_do_not_derive, and defining the feature derivation and forecast windows.
- Return type:
DataFrame
- classmethod datetime_partitioning_log_retrieve(project_id, datetime_partitioning_id)
Retrieve the datetime partitioning log content for an optimized datetime partitioning.
The datetime partitioning log provides details about the partitioning process for an OTV or time series project.
- Parameters:
- project_idstr
The project ID of the project associated with the datetime partitioning.
- datetime_partitioning_idstr
id of the optimized datetime partitioning
- Return type:
Any
- classmethod datetime_partitioning_log_list(project_id, datetime_partitioning_id, offset=None, limit=None)
Retrieve the datetime partitioning log content and log length for an optimized datetime partitioning.
The Datetime Partitioning Log provides details about the partitioning process for an OTV or Time Series project.
- Parameters:
- project_idstr
project id of the project associated with the datetime partitioning.
- datetime_partitioning_idstr
id of the optimized datetime partitioning
- offsetint or None
optional, defaults is 0, this many results will be skipped.
- limitint or None
optional, defaults to 100, at most this many results are returned. To specify no limit, use 0. The default may change without notice.
- Return type:
Any
- classmethod get_input_data(project_id, datetime_partitioning_id)
Retrieve the input used to create an optimized DatetimePartitioning from a project for the specified datetime_partitioning_id. A datetime_partitioning_id is created by using the
generate_optimized
function.- Parameters:
- project_idstr
The ID of the project to retrieve partitioning for.
- datetime_partitioning_idObjectId
The ObjectId associated with the project to retrieve from Mongo.
- Returns:
- DatetimePartitioningInputThe input to optimized datetime partitioning.
- Return type:
- class datarobot.helpers.partitioning_methods.DatetimePartitioningId(datetime_partitioning_id, project_id)
Defines a DatetimePartitioningId used for datetime partitioning.
This class only includes the datetime_partitioning_id that identifies a previously optimized datetime partitioning and the project_id for the associated project.
This is the specification that should be passed to
Project.analyze_and_model
via thepartitioning_method
parameter. To see the full partitioning useDatetimePartitioning.get_optimized
.- Attributes:
- datetime_partitioning_idstr
The ID of the datetime partitioning to use.
- project_idstr
The ID of the project that the datetime partitioning is associated with.
- collect_payload()
Set up the dict that should be sent to the server when setting the target Returns ——- partitioning_spec : dict
- Return type:
Dict
[str
,Any
]
- prep_payload(project_id, max_wait=600)
Run any necessary validation and prep of the payload, including async operations
Mainly used for the datetime partitioning spec but implemented in general for consistency
- Return type:
None
- update(**kwargs)
Update this instance, matching attributes to kwargs
Mainly used for the datetime partitioning spec but implemented in general for consistency
- Return type:
NoReturn
- class datarobot.helpers.partitioning_methods.Backtest(index=None, available_training_start_date=None, available_training_duration=None, available_training_row_count=None, available_training_end_date=None, primary_training_start_date=None, primary_training_duration=None, primary_training_row_count=None, primary_training_end_date=None, gap_start_date=None, gap_duration=None, gap_row_count=None, gap_end_date=None, validation_start_date=None, validation_duration=None, validation_row_count=None, validation_end_date=None, total_row_count=None)
A backtest used to evaluate models trained in a datetime partitioned project
When setting up a datetime partitioning project, backtests are specified by a
BacktestSpecification
.The available training data corresponds to all the data available for training, while the primary training data corresponds to the data that can be used to train while ensuring that all backtests are available. If a model is trained with more data than is available in the primary training data, then all backtests may not have scores available.
All durations are specified with a duration string such as those returned by the
partitioning_methods.construct_duration_string
helper method. Please see datetime partitioned project documentation for more information on duration strings.- Attributes:
- indexint
the index of the backtest
- available_training_start_datedatetime.datetime
the start date of the available training data for this backtest
- available_training_durationstr
the duration of available training data for this backtest
- available_training_row_countint or None
the number of rows of available training data for this backtest. Only available when retrieving from a project where the target is set.
- available_training_end_datedatetime.datetime
the end date of the available training data for this backtest
- primary_training_start_datedatetime.datetime
the start date of the primary training data for this backtest
- primary_training_durationstr
the duration of the primary training data for this backtest
- primary_training_row_countint or None
the number of rows of primary training data for this backtest. Only available when retrieving from a project where the target is set.
- primary_training_end_datedatetime.datetime
the end date of the primary training data for this backtest
- gap_start_datedatetime.datetime
the start date of the gap between training and validation scoring data for this backtest
- gap_durationstr
the duration of the gap between training and validation scoring data for this backtest
- gap_row_countint or None
the number of rows in the gap between training and validation scoring data for this backtest. Only available when retrieving from a project where the target is set.
- gap_end_datedatetime.datetime
the end date of the gap between training and validation scoring data for this backtest
- validation_start_datedatetime.datetime
the start date of the validation scoring data for this backtest
- validation_durationstr
the duration of the validation scoring data for this backtest
- validation_row_countint or None
the number of rows of validation scoring data for this backtest. Only available when retrieving from a project where the target is set.
- validation_end_datedatetime.datetime
the end date of the validation scoring data for this backtest
- total_row_countint or None
the number of rows in this backtest. Only available when retrieving from a project where the target is set.
- to_specification(use_start_end_format=False)
Render this backtest as a
BacktestSpecification
.The resulting specification includes only the attributes users can directly control, not those indirectly determined by the project dataset.
- Parameters:
- use_start_end_formatbool
Default
False
. IfFalse
, will use a duration-based approach for specifying backtests (gap_duration
,validation_start_date
, andvalidation_duration
). IfTrue
, will use a start/end date approach for specifying backtests (primary_training_start_date
,primary_training_end_date
,validation_start_date
,validation_end_date
). In contrast, projects created in the Web UI will use the start/end date approach for specifying backtests. Set this parameter toTrue
to mirror the behavior in the Web UI.
- Returns:
- BacktestSpecification
the specification for this backtest
- Return type:
- to_dataframe()
Render this backtest as a dataframe for convenience of display
- Returns:
- backtest_partitioningpandas.Dataframe
the backtest attributes, formatted into a dataframe
- Return type:
DataFrame
- class datarobot.helpers.partitioning_methods.FeatureSettingsPayload(*args, **kwargs)
- datarobot.helpers.partitioning_methods.construct_duration_string(years=0, months=0, days=0, hours=0, minutes=0, seconds=0)
Construct a valid string representing a duration in accordance with ISO8601
A duration of six months, 3 days, and 12 hours could be represented as P6M3DT12H.
- Parameters:
- yearsint
the number of years in the duration
- monthsint
the number of months in the duration
- daysint
the number of days in the duration
- hoursint
the number of hours in the duration
- minutesint
the number of minutes in the duration
- secondsint
the number of seconds in the duration
- Returns:
- duration_string: str
The duration string, specified compatibly with ISO8601
- Return type:
str