Partitioning API

class datarobot.RandomCV(holdout_pct, reps, seed=0)

A partition in which observations are randomly assigned to cross-validation groups and the holdout set.

Parameters:

holdout_pct : int

the desired percentage of dataset to assign to holdout set

reps : int

number of cross validation folds to use

seed : int

a seed to use for randomization

class datarobot.StratifiedCV(holdout_pct, reps, seed=0)

A partition in which observations are randomly assigned to cross-validation groups and the holdout set, preserving in each group the same ratio of positive to negative cases as in the original data.

Parameters:

holdout_pct : int

the desired percentage of dataset to assign to holdout set

reps : int

number of cross validation folds to use

seed : int

a seed to use for randomization

class datarobot.GroupCV(holdout_pct, reps, partition_key_cols, seed=0)

A partition in which one column is specified, and rows sharing a common value for that column are guaranteed to stay together in the partitioning into cross-validation groups and the holdout set.

Parameters:

holdout_pct : int

the desired percentage of dataset to assign to holdout set

reps : int

number of cross validation folds to use

partition_key_cols : list

a list containing a single string, where the string is the name of the column whose values should remain together in partitioning

seed : int

a seed to use for randomization

class datarobot.UserCV(user_partition_col, cv_holdout_level, seed=0)

A partition where the cross-validation folds and the holdout set are specified by the user.

Parameters:

user_partition_col : string

the name of the column containing the partition assignments

cv_holdout_level

the value of the partition column indicating a row is part of the holdout set

seed : int

a seed to use for randomization

class datarobot.RandomTVH(holdout_pct, validation_pct, seed=0)

Specifies a partitioning method in which rows are randomly assigned to training, validation, and holdout.

Parameters:

holdout_pct : int

the desired percentage of dataset to assign to holdout set

validation_pct : int

the desired percentage of dataset to assign to validation set

seed : int

a seed to use for randomization

class datarobot.UserTVH(user_partition_col, training_level, validation_level, holdout_level, seed=0)

Specifies a partitioning method in which rows are assigned by the user to training, validation, and holdout sets.

Parameters:

user_partition_col : string

the name of the column containing the partition assignments

training_level

the value of the partition column indicating a row is part of the training set

validation_level

the value of the partition column indicating a row is part of the validation set

holdout_level

the value of the partition column indicating a row is part of the holdout set (use None if you want no holdout set)

seed : int

a seed to use for randomization

class datarobot.StratifiedTVH(holdout_pct, validation_pct, seed=0)

A partition in which observations are randomly assigned to train, validation, and holdout sets, preserving in each group the same ratio of positive to negative cases as in the original data.

Parameters:

holdout_pct : int

the desired percentage of dataset to assign to holdout set

validation_pct : int

the desired percentage of dataset to assign to validation set

seed : int

a seed to use for randomization

class datarobot.GroupTVH(holdout_pct, validation_pct, partition_key_cols, seed=0)

A partition in which one column is specified, and rows sharing a common value for that column are guaranteed to stay together in the partitioning into the training, validation, and holdout sets.

Parameters:

holdout_pct : int

the desired percentage of dataset to assign to holdout set

validation_pct : int

the desired percentage of dataset to assign to validation set

partition_key_cols : list

a list containing a single string, where the string is the name of the column whose values should remain together in partitioning

seed : int

a seed to use for randomization

class datarobot.DatetimePartitioningSpecification(datetime_partition_column, autopilot_data_selection_method=None, validation_duration=None, holdout_start_date=None, holdout_duration=None, disable_holdout=None, gap_duration=None, number_of_backtests=None, backtests=None, use_time_series=False, default_to_a_priori=False, default_to_known_in_advance=False, feature_derivation_window_start=None, feature_derivation_window_end=None, feature_settings=None, forecast_window_start=None, forecast_window_end=None, treat_as_exponential=None, differencing_method=None, periodicities=None, multiseries_id_columns=None)

Uniquely defines a DatetimePartitioning for some project

Includes only the attributes of DatetimePartitioning that are directly controllable by users, not those determined by the DataRobot application based on the project dataset and the user-controlled settings.

This is the specification that should be passed to Project.set_target via the partitioning_method parameter. To see the full partitioning based on the project dataset, use DatetimePartitioning.generate.

All durations should be specified with a duration string such as those returned by the partitioning_methods.construct_duration_string helper method.

Attributes

datetime_partition_column (str) the name of the column whose values as dates are used to assign a row to a particular partition
autopilot_data_selection_method (str) one of datarobot.enums.DATETIME_AUTOPILOT_DATA_SELECTION_METHOD. Whether models created by the autopilot should use “rowCount” or “duration” as their data_selection_method.
validation_duration (str or None) the default validation_duration for the backtests
holdout_start_date (datetime.datetime or None) The start date of holdout scoring data. If holdout_start_date is specified, holdout_duration must also be specified. If disable_holdout is set to True, neither holdout_duration nor holdout_start_date must be specified.
holdout_duration (str or None) The duration of the holdout scoring data. If holdout_duration is specified, holdout_start_date must also be specified. If disable_holdout is set to True, neither holdout_duration nor holdout_start_date must be specified.
disable_holdout (bool or None) (New in version v2.8) Whether to suppress allocating a holdout fold. If set to True, holdout_start_date and holdout_duration must not be specified.
gap_duration (str or None) The duration of the gap between training and holdout scoring data
number_of_backtests (int or None) the number of backtests to use
backtests (list of BacktestSpecification) the exact specification of backtests to use. The indexes of the specified backtests should range from 0 to number_of_backtests - 1. If any backtest is left unspecified, a default configuration will be chosen.
use_time_series (bool) (New in version v2.8) Whether to create a time series project (if True) or an OTV project which uses datetime partitioning (if False). The default behaviour is to create an OTV project.
default_to_a_priori (bool) (Deprecated in version v2.11) Optional, renamed to default_to_known_in_advance, see below for more detail.
default_to_known_in_advance (bool) (New in version v2.11) Optional, only used for time series projects. Whether to default to treating features as known in advance. If not specified, defaults to False. Known in advance features are expected to be known for dates in the future when making predictions, e.g. “is this a holiday”.
feature_derivation_window_start (int or None) (New in version v2.8) Only used for time series projects. Offset into the past to define how far back relative to the forecast point the feature derivation window should start. Expressed in terms of the time_unit of the datetime_partition_column and should be negative or zero.
feature_derivation_window_end (int or None) (New in version v2.8) Only used for time series projects. Offset into the past to define how far back relative to the forecast point the feature derivation window should end. Expressed in terms of the time_unit of the datetime_partition_column, and should be a positive value.
feature_settings (list of FeatureSettings objects) (New in version v2.9) Optional, a list specifying per feature settings, can be left unspecified.
forecast_window_start (int or None) (New in version v2.8) Only used for time series projects. Offset into the future to define how far forward relative to the forecast point the forecast window should start. Expressed in terms of the time_unit of the datetime_partition_column.
forecast_window_end (int or None) (New in version v2.8) Only used for time series projects. Offset into the future to define how far forward relative to the forecast point the forecast window should end. Expressed in terms of the time_unit of the datetime_partition_column.
treat_as_exponential (string, optional) (New in version v2.9) defaults to “auto”. Used to specify whether to treat data as exponential trend and apply transformations like log-transform. Use values from datarobot.enums.TREAT_AS_EXPONENTIAL enum.
differencing_method (string, optional) (New in version v2.9) defaults to “auto”. Used to specify which differencing method to apply of case if data is stationary. Use values from datarobot.enums.DIFFERENCING_METHOD enum.
periodicities (list of Periodicity, optional) (New in version v2.9) a list of datarobot.Periodicity
multiseries_id_columns (list of str or null) (New in version v2.11) a list of the names of multiseries id columns to define series within the training data. Currently only one multiseries id column is supported.
class datarobot.BacktestSpecification(index, gap_duration, validation_start_date, validation_duration)

Uniquely defines a Backtest used in a DatetimePartitioning

Includes only the attributes of a backtest directly controllable by users. The other attributes are assigned by the DataRobot application based on the project dataset and the user-controlled settings.

All durations should be specified with a duration string such as those returned by the partitioning_methods.construct_duration_string helper method.

Attributes

index (int) the index of the backtest to update
gap_duration (str) the desired duration of the gap between training and validation scoring data for the backtest
validation_start_date (datetime.datetime) the desired start date of the validation scoring data for this backtest
validation_duration (datetime.datetime) the desired duration of the validation scoring data for this backtest
class datarobot.FeatureSettings(feature_name, known_in_advance=False, a_priori=None)

Per feature settings

Attributes

feature_name (string) name of the feature
a_priori (bool) (Deprecated in v2.11) Optional, renamed to known_in_advance, see below for more detail.
known_in_advance (bool) (New in version v2.11) Optional, whether the feature is known in advance, i.e. expected to be known for dates in the future at prediction time. Features that don’t have a feature setting specifying whether they are known in advance use the value from the default_to_known_in_advance flag.
class datarobot.Periodicity(time_steps, time_unit)

Periodicity configuration

Parameters:

time_steps : int

Time step value

time_unit : string

Time step unit, valid options are values from datarobot.enums.PERIODICITY_TIME_UNITS

Examples

from datarobot as dr
periodicities = [
    dr.Periodicity(time_steps=10, time_unit=dr.enums.PERIODICITY_TIME_UNITS.HOUR),
    dr.Periodicity(time_steps=600, time_unit=dr.enums.PERIODICITY_TIME_UNITS.MINUTE)]
spec = dr.DatetimePartitioningSpecification(
    # ...
    periodicities=periodicities
)
class datarobot.DatetimePartitioning(project_id=None, datetime_partition_column=None, date_format=None, autopilot_data_selection_method=None, validation_duration=None, available_training_start_date=None, available_training_duration=None, available_training_row_count=None, available_training_end_date=None, primary_training_start_date=None, primary_training_duration=None, primary_training_row_count=None, primary_training_end_date=None, gap_start_date=None, gap_duration=None, gap_row_count=None, gap_end_date=None, holdout_start_date=None, holdout_duration=None, holdout_row_count=None, holdout_end_date=None, number_of_backtests=None, backtests=None, total_row_count=None, use_time_series=False, default_to_a_priori=False, default_to_known_in_advance=False, feature_derivation_window_start=None, feature_derivation_window_end=None, feature_settings=None, forecast_window_start=None, forecast_window_end=None, treat_as_exponential=None, differencing_method=None, periodicities=None, multiseries_id_columns=None)

Full partitioning of a project for datetime partitioning

Includes both the attributes specified by the user, as well as those determined by the DataRobot application based on the project dataset. In order to use a partitioning to set the target, call to_specification and pass the resulting DatetimePartitioningSpecification to Project.set_target.

The available training data corresponds to all the data available for training, while the primary training data corresponds to the data that can be used to train while ensuring that all backtests are available. If a model is trained with more data than is available in the primary training data, then all backtests may not have scores available.

Attributes

project_id (str) the id of the project this partitioning applies to
datetime_partition_column (str) the name of the column whose values as dates are used to assign a row to a particular partition
date_format (str) the format (e.g. “%Y-%m-%d %H:%M:%S”) by which the partition column was interpreted (compatible with strftime [https://docs.python.org/2/library/time.html#time.strftime] )
autopilot_data_selection_method (str) one of datarobot.enums.DATETIME_AUTOPILOT_DATA_SELECTION_METHOD. Whether models created by the autopilot use “rowCount” or “duration” as their data_selection_method.
validation_duration (str) the validation duration specified when initializing the partitioning - not directly significant if the backtests have been modified, but used as the default validation_duration for the backtests
available_training_start_date (datetime.datetime) The start date of the available training data for scoring the holdout
available_training_duration (str) The duration of the available training data for scoring the holdout
available_training_row_count (int or None) The number of rows in the available training data for scoring the holdout. Only available when retrieving the partitioning after setting the target.
available_training_end_date (datetime.datetime) The end date of the available training data for scoring the holdout
primary_training_start_date (datetime.datetime or None) The start date of primary training data for scoring the holdout. Unavailable when the holdout fold is disabled.
primary_training_duration (str) The duration of the primary training data for scoring the holdout
primary_training_row_count (int or None) The number of rows in the primary training data for scoring the holdout. Only available when retrieving the partitioning after setting the target.
primary_training_end_date (datetime.datetime or None) The end date of the primary training data for scoring the holdout. Unavailable when the holdout fold is disabled.
gap_start_date (datetime.datetime or None) The start date of the gap between training and holdout scoring data. Unavailable when the holdout fold is disabled.
gap_duration (str) The duration of the gap between training and holdout scoring data
gap_row_count (int or None) The number of rows in the gap between training and holdout scoring data. Only available when retrieving the partitioning after setting the target.
gap_end_date (datetime.datetime or None) The end date of the gap between training and holdout scoring data. Unavailable when the holdout fold is disabled.
holdout_start_date (datetime.datetime or None) The start date of holdout scoring data. Unavailable when the holdout fold is disabled.
holdout_duration (str) The duration of the holdout scoring data
holdout_row_count (int or None) The number of rows in the holdout scoring data. Only available when retrieving the partitioning after setting the target.
holdout_end_date (datetime.datetime or None) The end date of the holdout scoring data. Unavailable when the holdout fold is disabled.
number_of_backtests (int) the number of backtests used
backtests (list of partitioning_methods.Backtest) the configured Backtests
total_row_count (int) the number of rows in the project dataset. Only available when retrieving the partitioning after setting the target.
use_time_series (bool) (New in version v2.8) Whether to create a time series project (if True) or an OTV project which uses datetime partitioning (if False). The default behaviour is to create an OTV project.
default_to_a_priori (bool) (Deprecated in version v2.11) Optional, renamed to defaultToKnownInAdvance, see below for more detail.
default_to_a_priori (bool) (Deprecated in version v2.11) Optional, renamed to default_to_known_in_advance, see below for more detail.
default_to_known_in_advance (bool) (New in version v2.11) Optional, only used for time series projects. Whether to default to treating features as known in advance. If not specified, defaults to False. Known in advance features are expected to be known for dates in the future when making predictions, e.g. “is this a holiday”.
feature_derivation_window_start (int or None) (New in version v2.8) Only used for time series projects. Offset into the past to define how far back relative to the forecast point the feature derivation window should start. Expressed in terms of the time_unit of the datetime_partition_column.
feature_derivation_window_end (int or None) (New in version v2.8) Only used for time series projects. Offset into the past to define how far back relative to the forecast point the feature derivation window should end. Expressed in terms of the time_unit of the datetime_partition_column.
feature_settings (list of FeatureSettings) (New in version v2.9) Optional, a list specifying per feature settings, can be left unspecified.
forecast_window_start (int or None) (New in version v2.8) Only used for time series projects. Offset into the future to define how far forward relative to the forecast point the forecast window should start. Expressed in terms of the time_unit of the datetime_partition_column.
forecast_window_end (int or None) (New in version v2.8) Only used for time series projects. Offset into the future to define how far forward relative to the forecast point the forecast window should end. Expressed in terms of the time_unit of the datetime_partition_column.
treat_as_exponential (string, optional) (New in version v2.9) defaults to “auto”. Used to specify whether to treat data as exponential trend and apply transformations like log-transform. Use values from datarobot.enums.TREAT_AS_EXPONENTIAL enum.
differencing_method (string, optional) (New in version v2.9) defaults to “auto”. Used to specify which differencing method to apply of case if data is stationary. Use values from datarobot.enums.DIFFERENCING_METHOD enum.
periodicities (list of Periodicity, optional) (New in version v2.9) a list of datarobot.Periodicity
multiseries_id_columns (list of str or null) (New in version v2.11) a list of the names of multiseries id columns to define series within the training data. Currently only one multiseries id column is supported.
classmethod generate(project_id, spec, max_wait=600)

Preview the full partitioning determined by a DatetimePartitioningSpecification

Based on the project dataset and the partitioning specification, inspect the full partitioning that would be used if the same specification were passed into Project.set_target.

Parameters:

project_id : str

the id of the project

spec : DatetimePartitioningSpec

the desired partitioning

max_wait : int, optional

For some settings (e.g. generating a partitioning preview for a multiseries project for the first time), an asynchronous task must be run to analyze the dataset. max_wait governs the maximum time (in seconds) to wait before giving up. In all non-multiseries projects, this is unused.

Returns:

DatetimePartitioning :

the full generated partitioning

classmethod get(project_id)

Retrieve the DatetimePartitioning from a project

Only available if the project has already set the target as a datetime project.

Parameters:

project_id : str

the id of the project to retrieve partitioning for

Returns:

DatetimePartitioning : the full partitioning for the project

classmethod feature_log_list(project_id, offset=None, limit=None)

Retrieve the feature derivation log content and log length for a time series project.

The Time Series Feature Log provides details about the feature generation process for a time series project. It includes information about which features are generated and their priority, as well as the detected properties of the time series data such as whether the series is stationary, and periodicities detected.

This route is only supported for time series projects that have finished partitioning.

The feature derivation log will include information about:

  • Detected stationarity of the series:
    e.g. ‘Series detected as non-stationary’
  • Detected presence of multiplicative trend in the series:
    e.g. ‘Multiplicative trend detected’
  • Detected presence of multiplicative trend in the series:
    e.g. ‘Detected periodicities: 7 day’
  • Maximum number of feature to be generated:
    e.g. ‘Maximum number of feature to be generated is 1440’
  • Window sizes used in rolling statistics / lag extractors
    e.g. ‘The window sizes chosen to be: 2 months
    (because the time step is 1 month and Feature Derivation Window is 2 months)’
  • Features that are specified as known-in-advance
    e.g. ‘Variables treated as apriori: holiday’
  • Details about why certain variables are transformed in the input data
    e.g. ‘Generating variable “y (log)” from “y” because multiplicative trend
    is detected’
  • Details about features generated as timeseries features, and their priority
    e.g. ‘Generating feature “date (actual)” from “date” (priority: 1)’
Parameters:

project_id : str

project id to retrieve a feature derivation log for.

offset : int

optional, defaults is 0, this many results will be skipped.

limit : int

optional, defaults to 100, at most this many results are returned. To specify

no limit, use 0. The default may change without notice.

classmethod feature_log_retrieve(project_id)

Retrieve the feature derivation log content and log length for a time series project.

The Time Series Feature Log provides details about the feature generation process for a time series project. It includes information about which features are generated and their priority, as well as the detected properties of the time series data such as whether the series is stationary, and periodicities detected.

This route is only supported for time series projects that have finished partitioning.

The feature derivation log will include information about:

  • Detected stationarity of the series:
    e.g. ‘Series detected as non-stationary’
  • Detected presence of multiplicative trend in the series:
    e.g. ‘Multiplicative trend detected’
  • Detected presence of multiplicative trend in the series:
    e.g. ‘Detected periodicities: 7 day’
  • Maximum number of feature to be generated:
    e.g. ‘Maximum number of feature to be generated is 1440’
  • Window sizes used in rolling statistics / lag extractors
    e.g. ‘The window sizes chosen to be: 2 months
    (because the time step is 1 month and Feature Derivation Window is 2 months)’
  • Features that are specified as known-in-advance
    e.g. ‘Variables treated as apriori: holiday’
  • Details about why certain variables are transformed in the input data
    e.g. ‘Generating variable “y (log)” from “y” because multiplicative trend
    is detected’
  • Details about features generated as timeseries features, and their priority
    e.g. ‘Generating feature “date (actual)” from “date” (priority: 1)’
Parameters:

project_id : str

project id to retrieve a feature derivation log for.

to_specification()

Render the DatetimePartitioning as a DatetimePartitioningSpecification

The resulting specification can be used when setting the target, and contains only the attributes directly controllable by users.

Returns:

DatetimePartitioningSpecification:

the specification for this partitioning

to_dataframe()

Render the partitioning settings as a dataframe for convenience of display

Excludes project_id, datetime_partition_column, date_format, autopilot_data_selection_method, validation_duration, and number_of_backtests, as well as the row count information, if present.

Also excludes the time series specific parameters for use_time_series, default_to_known_in_advance, and defining the feature derivation and forecast windows.

class datarobot.helpers.partitioning_methods.Backtest(index=None, available_training_start_date=None, available_training_duration=None, available_training_row_count=None, available_training_end_date=None, primary_training_start_date=None, primary_training_duration=None, primary_training_row_count=None, primary_training_end_date=None, gap_start_date=None, gap_duration=None, gap_row_count=None, gap_end_date=None, validation_start_date=None, validation_duration=None, validation_row_count=None, validation_end_date=None, total_row_count=None)

A backtest used to evaluate models trained in a datetime partitioned project

When setting up a datetime partitioning project, backtests are specified by a BacktestSpecification.

The available training data corresponds to all the data available for training, while the primary training data corresponds to the data that can be used to train while ensuring that all backtests are available. If a model is trained with more data than is available in the primary training data, then all backtests may not have scores available.

Attributes

index (int) the index of the backtest
available_training_start_date (datetime.datetime) the start date of the available training data for this backtest
available_training_duration (str) the duration of available training data for this backtest
available_training_row_count (int or None) the number of rows of available training data for this backtest. Only available when retrieving from a project where the target is set.
available_training_end_date (datetime.datetime) the end date of the available training data for this backtest
primary_training_start_date (datetime.datetime) the start date of the primary training data for this backtest
primary_training_duration (str) the duration of the primary training data for this backtest
primary_training_row_count (int or None) the number of rows of primary training data for this backtest. Only available when retrieving from a project where the target is set.
primary_training_end_date (datetime.datetime) the end date of the primary training data for this backtest
gap_start_date (datetime.datetime) the start date of the gap between training and validation scoring data for this backtest
gap_duration (str) the duration of the gap between training and validation scoring data for this backtest
gap_row_count (int or None) the number of rows in the gap between training and validation scoring data for this backtest. Only available when retrieving from a project where the target is set.
gap_end_date (datetime.datetime) the end date of the gap between training and validation scoring data for this backtest
validation_start_date (datetime.datetime) the start date of the validation scoring data for this backtest
validation_duration (str) the duration of the validation scoring data for this backtest
validation_row_count (int or None) the number of rows of validation scoring data for this backtest. Only available when retrieving from a project where the target is set.
validation_end_date (datetime.datetime) the end date of the validation scoring data for this backtest
total_row_count (int or None) the number of rows in this backtest. Only available when retrieving from a project where the target is set.
to_specification()

Render this backtest as a BacktestSpecification

A BacktestSpecification includes only the attributes users can directly control, not those indirectly determined by the project dataset.

Returns:

BacktestSpecification

the specification for this backtest

to_dataframe()

Render this backtest as a dataframe for convenience of display

Returns:

backtest_partitioning : pandas.Dataframe

the backtest attributes, formatted into a dataframe

datarobot.helpers.partitioning_methods.construct_duration_string(years=0, months=0, days=0, hours=0, minutes=0, seconds=0)

Construct a valid string representing a duration in accordance with ISO8601

A duration of six months, 3 days, and 12 hours could be represented as P6M3DT12H.

Parameters:

years : int

the number of years in the duration

months : int

the number of months in the duration

days : int

the number of days in the duration

hours : int

the number of hours in the duration

minutes : int

the number of minutes in the duration

seconds : int

the number of seconds in the duration

Returns:

duration_string: str

The duration string, specified compatibly with ISO8601