Datasets

class datarobot.models.Dataset

Represents a Dataset returned from the api/v2/datasets/ endpoints.

Variables:

id (string) – The ID of this dataset
name (string) – The name of this dataset in the catalog
is_latest_version (bool) – Whether this dataset version is the latest version of this dataset
version_id (string) – The object ID of the catalog_version the dataset belongs to
categories (list(string)) – An array of strings describing the intended use of the dataset. The supported options are “TRAINING” and “PREDICTION”.
created_at (string) – The date when the dataset was created
created_by (string, optional) – Username of the user who created the dataset
is_snapshot (bool) – Whether the dataset version is an immutable snapshot of data which has previously been retrieved and saved to Data_robot
data_persisted (Optional[bool]) – If true, user is allowed to view extended data profile (which includes data statistics like min/max/median/mean, histogram, etc.) and download data. If false, download is not allowed and only the data schema (feature names and types) will be available.
is_data_engine_eligible (bool) – Whether this dataset can be a data source of a data engine query.
processing_state (string) – Current ingestion process state of the dataset
row_count (Optional[int]) – The number of rows in the dataset.
size (Optional[int]) – The size of the dataset as a CSV in bytes.
sample_size (dict, optional) – The size of data fetched during dataset registration. For example, to fetch the first 95 rows, the sample_size value is {‘type’: ‘rows’, ‘value’: 95}. Currently only ‘rows’ type is supported.

get_uri()

Returns:: url – Permanent static hyperlink to this dataset in AI Catalog.
Return type:: str

classmethod upload(source)

This method covers Dataset creation from local materials (file & DataFrame) and a URL.

Parameters:: source (str, pd.DataFrame or file object) – Pass a URL, filepath, file or DataFrame to create and return a Dataset.
Returns:: response – The Dataset created from the uploaded data source.
Return type:: Dataset
Raises:: InvalidUsageError – If the source parameter cannot be determined to be a URL, filepath, file or DataFrame.

Examples

# Upload a local file
dataset_one = Dataset.upload("./data/examples.csv")

# Create a dataset via URL
dataset_two = Dataset.upload(
    "https://raw.githubusercontent.com/curran/data/gh-pages/dbpedia/cities/data.csv"
)

# Create dataset with a pandas Dataframe
dataset_three = Dataset.upload(my_df)

# Create dataset using a local file
with open("./data/examples.csv", "rb") as file_pointer:
    dataset_four = Dataset.create_from_file(filelike=file_pointer)

classmethod create_from_file(cls, file_path=None, filelike=None, categories=None, read_timeout=600, max_wait=600, *, use_cases=None)

A blocking call that creates a new Dataset from a file. Returns when the dataset has been successfully uploaded and processed.

Warning: This function does not clean up it’s open files. If you pass a filelike, you are responsible for closing it. If you pass a file_path, this will create a file object from the file_path but will not close it.

Parameters:

file_path (string, optional) – The path to the file. This will create a file object pointing to that file but will not close it.
filelike (file, optional) – An open and readable file object.
categories (list[string], optional) – An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
read_timeout (Optional[int]) – The maximum number of seconds to wait for the server to respond indicating that the initial upload is complete
max_wait (Optional[int]) – Time in seconds after which dataset creation is considered unsuccessful
use_cases (list[UseCase] | UseCase | list[string] | string, optional) – A list of UseCase objects, UseCase object, list of Use Case ids or a single Use Case id to add this new Dataset to. Must be a kwarg.

Returns:

response – A fully armed and operational Dataset

Return type:

Dataset

classmethod create_from_in_memory_data(cls, data_frame=None, records=None, categories=None, read_timeout=600, max_wait=600, fname=None, *, use_cases=None)

A blocking call that creates a new Dataset from in-memory data. Returns when the dataset has been successfully uploaded and processed.

The data can be either a pandas DataFrame or a list of dictionaries with identical keys.

Parameters:

data_frame (DataFrame, optional) – The data frame to upload
records (list[dict], optional) – A list of dictionaries with identical keys to upload
categories (list[string], optional) – An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
read_timeout (Optional[int]) – The maximum number of seconds to wait for the server to respond indicating that the initial upload is complete
max_wait (Optional[int]) – Time in seconds after which dataset creation is considered unsuccessful
fname (string, optional) – The file name, “data.csv” by default
use_cases (list[UseCase] | UseCase | list[string] | string, optional) – A list of UseCase objects, UseCase object, list of Use Case IDs or a single Use Case ID to add this new dataset to. Must be a kwarg.

Returns:

response – The Dataset created from the uploaded data.

Return type:

Dataset

Raises:

InvalidUsageError – If neither a DataFrame or list of records is passed.

classmethod create_from_url(cls, url, do_snapshot=None, persist_data_after_ingestion=None, categories=None, sample_size=None, max_wait=600, *, use_cases=None)

A blocking call that creates a new Dataset from data stored at a url. Returns when the dataset has been successfully uploaded and processed.

Parameters:

url (string) – The URL to use as the source of data for the dataset being created.
do_snapshot (Optional[bool]) – If unset, uses the server default: True. If true, creates a snapshot dataset; if false, creates a remote dataset. Creating snapshots from non-file sources may be disabled by the permission, Disable AI Catalog Snapshots.
persist_data_after_ingestion (Optional[bool]) – If unset, uses the server default: True. If true, will enforce saving all data (for download and sampling) and will allow a user to view extended data profile (which includes data statistics like min/max/median/mean, histogram, etc.). If false, will not enforce saving data. The data schema (feature names and types) still will be available. Specifying this parameter to false and doSnapshot to true will result in an error.
categories (list[string], optional) – An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
sample_size (dict, optional) – The size of data fetched during dataset registration. For example, to fetch the first 95 rows, the sample_size value would be: {‘type’: ‘rows’, ‘value’: 95}. Currently only ‘rows’ type is supported.
max_wait (Optional[int]) – Time in seconds after which dataset creation is considered unsuccessful.
use_cases (list[UseCase] | UseCase | list[string] | string, optional) – A list of UseCase objects, UseCase object, list of Use Case IDs or a single Use Case ID to add this new dataset to. Must be a kwarg.

Returns:

response – The Dataset created from the uploaded data

Return type:

Dataset

classmethod create_from_project(cls, project_id, categories=None, max_wait=600, *, use_cases=None)

A blocking call that creates a new dataset from project data. Returns when the dataset has been successfully created.

Parameters:

project_id (string) – The project to create the dataset from.
categories (list[string], optional) – An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
max_wait (Optional[int]) – Time in seconds after which dataset creation is considered unsuccessful.
use_cases (list[UseCase] | UseCase | list[string] | string, optional) – A list of UseCase objects, UseCase object, list of Use Case IDs or a single Use Case ID to add this new dataset to. Must be a kwarg.

Returns:

response – The dataset created from the project dataset.

Return type:

Dataset

classmethod create_from_datastage(cls, datastage_id, categories=None, max_wait=600, *, use_cases=None)

A blocking call that creates a new Dataset from data stored as a DataStage. Returns when the dataset has been successfully uploaded and processed.

Parameters:

datastage_id (string) – The ID of the DataStage to use as the source of data for the dataset being created.
categories (list[string], optional) – An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
max_wait (Optional[int]) – Time in seconds after which dataset creation is considered unsuccessful.

Returns:

response – The Dataset created from the uploaded data

Return type:

Dataset

classmethod create_from_data_source(cls, data_source_id, username=None, password=None, do_snapshot=None, persist_data_after_ingestion=None, categories=None, credential_id=None, use_kerberos=None, credential_data=None, sample_size=None, max_wait=600, *, use_cases=None)

A blocking call that creates a new Dataset from data stored at a DataSource. Returns when the dataset has been successfully uploaded and processed.

Added in version v2.22.

Parameters:

data_source_id (string) – The ID of the DataSource to use as the source of data.
username (string, optional) – The username for database authentication.
password (string, optional) – The password (in cleartext) for database authentication. The password will be encrypted on the server side in scope of HTTP request and never saved or stored.
do_snapshot (Optional[bool]) – If unset, uses the server default: True. If true, creates a snapshot dataset; if false, creates a remote dataset. Creating snapshots from non-file sources requires may be disabled by the permission, Disable AI Catalog Snapshots.
persist_data_after_ingestion (Optional[bool]) – If unset, uses the server default: True. If true, will enforce saving all data (for download and sampling) and will allow a user to view extended data profile (which includes data statistics like min/max/median/mean, histogram, etc.). If false, will not enforce saving data. The data schema (feature names and types) still will be available. Specifying this parameter to false and doSnapshot to true will result in an error.
categories (list[string], optional) – An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
credential_id (string, optional) – The ID of the set of credentials to use instead of user and password. Note that with this change, username and password will become optional.
use_kerberos (Optional[bool]) – If unset, uses the server default: False. If true, use kerberos authentication for database authentication.
credential_data (dict, optional) – The credentials to authenticate with the database, to use instead of user/password or credential ID.
sample_size (dict, optional) – The size of data fetched during dataset registration. For example, to fetch the first 95 rows, the sample_size value would be: {‘type’: ‘rows’, ‘value’: 95}. Currently only ‘rows’ type is supported.
max_wait (Optional[int]) – Time in seconds after which project creation is considered unsuccessful.
use_cases (list[UseCase] | UseCase | list[string] | string, optional) – A list of UseCase objects, UseCase object, list of Use Case IDs or a single Use Case ID to add this new dataset to. Must be a kwarg.

Returns:

response – The Dataset created from the uploaded data

Return type:

Dataset

classmethod create_from_query_generator(cls, generator_id, dataset_id=None, dataset_version_id=None, max_wait=600, *, use_cases=None)

A blocking call that creates a new Dataset from the query generator. Returns when the dataset has been successfully processed. If optional parameters are not specified the query is applied to the dataset_id and dataset_version_id stored in the query generator. If specified they will override the stored dataset_id/dataset_version_id, e.g. to prep a prediction dataset.

Parameters:

generator_id (str) – The id of the query generator to use.
dataset_id (Optional[str]) – The id of the dataset to apply the query to.
dataset_version_id (Optional[str]) – The id of the dataset version to apply the query to. If not specified the latest version associated with dataset_id (if specified) is used.
max_wait (int) – optional, the maximum number of seconds to wait before giving up.
use_cases (list[UseCase] | UseCase | list[string] | string, optional) – A list of UseCase objects, UseCase object, list of Use Case IDs or a single Use Case ID to add this new dataset to. Must be a kwarg.

Returns:

response – The Dataset created from the query generator

Return type:

Dataset

classmethod create_from_recipe(cls, recipe, name=None, do_snapshot=None, persist_data_after_ingestion=None, categories=None, credential=None, credential_id=None, use_kerberos=None, materialization_destination=None, max_wait=600, *, use_cases=None)

A blocking call that creates a new Dataset from the recipe. Returns when the dataset has been successfully uploaded and processed.

Added in version 3.6.

Returns:: response – The Dataset created from the uploaded data
Return type:: Dataset

classmethod get(dataset_id)

Get information about a dataset.

Parameters:: dataset_id (string) – the id of the dataset
Returns:: dataset – the queried dataset
Return type:: Dataset

classmethod delete(dataset_id)

Soft deletes a dataset. You cannot get it or list it or do actions with it, except for un-deleting it.

Parameters:: dataset_id (string) – The id of the dataset to mark for deletion
Return type:: None

classmethod un_delete(dataset_id)

Un-deletes a previously deleted dataset. If the dataset was not deleted, nothing happens.

Parameters:: dataset_id (string) – The id of the dataset to un-delete
Return type:: None

classmethod list(category=None, filter_failed=None, order_by=None, use_cases=None)

List all datasets a user can view.

Parameters:

category (string, optional) – Optional. If specified, only dataset versions that have the specified category will be included in the results. Categories identify the intended use of the dataset; supported categories are “TRAINING” and “PREDICTION”.
filter_failed (Optional[bool]) – If unset, uses the server default: False. Whether datasets that failed during import should be excluded from the results. If True invalid datasets will be excluded.
order_by (string, optional) – If unset, uses the server default: “-created”. Sorting order which will be applied to catalog list, valid options are: - “created” – ascending order by creation datetime; - “-created” – descending order by creation datetime.
use_cases (Union[UseCase, List[UseCase], str, List[str]], optional) – Filter available datasets by a specific Use Case or Cases. Accepts either the entity or the ID. If set to [None], the method filters the project’s datasets by those not linked to a UseCase.

Returns:

a list of datasets the user can view

Return type:

list[Dataset]

classmethod iterate(offset=None, limit=None, category=None, order_by=None, filter_failed=None, use_cases=None)

Get an iterator for the requested datasets a user can view. This lazily retrieves results. It does not get the next page from the server until the current page is exhausted.

Parameters:

offset (Optional[int]) – If set, this many results will be skipped
limit (Optional[int]) – Specifies the size of each page retrieved from the server. If unset, uses the server default.
category (string, optional) – Optional. If specified, only dataset versions that have the specified category will be included in the results. Categories identify the intended use of the dataset; supported categories are “TRAINING” and “PREDICTION”.
filter_failed (Optional[bool]) – If unset, uses the server default: False. Whether datasets that failed during import should be excluded from the results. If True invalid datasets will be excluded.
order_by (string, optional) – If unset, uses the server default: “-created”. Sorting order which will be applied to catalog list, valid options are: - “created” – ascending order by creation datetime; - “-created” – descending order by creation datetime.
use_cases (Union[UseCase, List[UseCase], str, List[str]], optional) – Filter available datasets by a specific Use Case or Cases. Accepts either the entity or the ID. If set to [None], the method filters the project’s datasets by those not linked to a UseCase.

Yields:

Dataset – An iterator of the datasets the user can view.

Return type:

Generator[TypeVar(TDataset, bound= Dataset), None, None]

update()

Updates the Dataset attributes in place with the latest information from the server.

Return type:: None

modify(name=None, categories=None)

Modifies the Dataset name and/or categories. Updates the object in place.

Parameters:

name (string, optional) – The new name of the dataset
categories (list[string], optional) – A list of strings describing the intended use of the dataset. The supported options are “TRAINING” and “PREDICTION”. If any categories were previously specified for the dataset, they will be overwritten. If omitted or None, keep previous categories. To clear them specify []

Return type:

None

share(access_list, apply_grant_to_linked_objects=False)

Modify the ability of users to access this dataset

Parameters:

access_list (list of SharingAccess) – The modifications to make.
apply_grant_to_linked_objects (bool) – If true for any users being granted access to the dataset, grant the user read access to any linked objects such as DataSources and DataStores that may be used by this dataset. Ignored if no such objects are relevant for dataset, defaults to False.

Return type:

None

Raises:

datarobot.ClientError: – If you do not have permission to share this dataset, if the user you’re sharing with doesn’t exist, if the same user appears multiple times in the access_list, or if these changes would leave the dataset without an owner.

Examples

Transfer access to the dataset from old_user@datarobot.com to new_user@datarobot.com

from datarobot.enums import SHARING_ROLE
from datarobot.models.dataset import Dataset
from datarobot.models.sharing import SharingAccess

new_access = SharingAccess(
    "[email protected]",
    SHARING_ROLE.OWNER,
    can_share=True,
)
access_list = [
    SharingAccess(
        "[email protected]",
        SHARING_ROLE.OWNER,
        can_share=True,
        can_use_data=True,
    ),
    new_access,
]

Dataset.get('my-dataset-id').share(access_list)

get_details()

Gets the details for this Dataset

Return type:: DatasetDetails

get_all_features(order_by=None)

Get a list of all the features for this dataset.

Parameters:: order_by (string, optional) – If unset, uses the server default: ‘name’. How the features should be ordered. Can be ‘name’ or ‘featureType’.
Return type:: list[DatasetFeature]

iterate_all_features(offset=None, limit=None, order_by=None)

Get an iterator for the requested features of a dataset. This lazily retrieves results. It does not get the next page from the server until the current page is exhausted.

Parameters:

offset (Optional[int]) – If set, this many results will be skipped.
limit (Optional[int]) – Specifies the size of each page retrieved from the server. If unset, uses the server default.
order_by (string, optional) – If unset, uses the server default: ‘name’. How the features should be ordered. Can be ‘name’ or ‘featureType’.

Yields:

DatasetFeature

Return type:

Generator[DatasetFeature, None, None]

get_featurelists()

Get DatasetFeaturelists created on this Dataset

Returns:: feature_lists
Return type:: list[DatasetFeaturelist]

create_featurelist(name, features)

Create a new dataset featurelist

Parameters:

name (str) – the name of the modeling featurelist to create. Names must be unique within the dataset, or the server will return an error.
features (List[str]) – the names of the features to include in the dataset featurelist. Each feature must be a dataset feature.

Returns:

featurelist – the newly created featurelist

Return type:

DatasetFeaturelist

Examples

dataset = Dataset.get('1234deadbeeffeeddead4321')
dataset_features = dataset.get_all_features()
selected_features = [feat.name for feat in dataset_features][:5]  # select first five
new_flist = dataset.create_featurelist('Simple Features', selected_features)

get_file(file_path=None, filelike=None)

Retrieves all the originally uploaded data in CSV form. Writes it to either the file or a filelike object that can write bytes.

Only one of file_path or filelike can be provided and it must be provided as a keyword argument (i.e. file_path=’path-to-write-to’). If a file-like object is provided, the user is responsible for closing it when they are done.

The user must also have permission to download data.

Parameters:

file_path (string, optional) – The destination to write the file to.
filelike (file, optional) – A file-like object to write to. The object must be able to write bytes. The user is responsible for closing the object

Return type:

None

get_as_dataframe(low_memory=False)

Retrieves all the originally uploaded data in a pandas DataFrame.

Added in version v3.0.

Parameters:: low_memory (Optional[bool]) – If True, use local files to reduce memory usage which will be slower.
Return type:: pd.DataFrame

get_projects()

Retrieves the Dataset’s projects as ProjectLocation named tuples.

Returns:: locations
Return type:: list[ProjectLocation]

create_project(project_name=None, user=None, password=None, credential_id=None, use_kerberos=None, credential_data=None, *, use_cases=None)

Create a datarobot.models.Project from this dataset

Parameters:

project_name (string, optional) – The name of the project to be created. If not specified, will be “Untitled Project” for database connections, otherwise the project name will be based on the file used.
user (string, optional) – The username for database authentication.
password (string, optional) – The password (in cleartext) for database authentication. The password will be encrypted on the server side in scope of HTTP request and never saved or stored
credential_id (string, optional) – The ID of the set of credentials to use instead of user and password.
use_kerberos (Optional[bool]) – Server default is False. If true, use kerberos authentication for database authentication.
credential_data (dict, optional) – The credentials to authenticate with the database, to use instead of user/password or credential ID.
use_cases (list[UseCase] | UseCase | list[string] | string, optional) – A list of UseCase objects, UseCase object, list of Use Case ids or a single Use Case id to add this new Dataset to. Must be a kwarg.

Return type:

Project

classmethod create_version_from_file(dataset_id, file_path=None, filelike=None, categories=None, read_timeout=600, max_wait=600)

A blocking call that creates a new Dataset version from a file. Returns when the new dataset version has been successfully uploaded and processed.

Warning: This function does not clean up it’s open files. If you pass a filelike, you are responsible for closing it. If you pass a file_path, this will create a file object from the file_path but will not close it.

Added in version v2.23.

Parameters:

dataset_id (string) – The ID of the dataset for which new version to be created
file_path (string, optional) – The path to the file. This will create a file object pointing to that file but will not close it.
filelike (file, optional) – An open and readable file object.
categories (list[string], optional) – An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
read_timeout (Optional[int]) – The maximum number of seconds to wait for the server to respond indicating that the initial upload is complete
max_wait (Optional[int]) – Time in seconds after which project creation is considered unsuccessful

Returns:

response – A fully armed and operational Dataset version

Return type:

Dataset

classmethod create_version_from_in_memory_data(dataset_id, data_frame=None, records=None, categories=None, read_timeout=600, max_wait=600)

A blocking call that creates a new Dataset version for a dataset from in-memory data. Returns when the dataset has been successfully uploaded and processed.

The data can be either a pandas DataFrame or a list of dictionaries with identical keys.

Added in version v2.23.

Parameters:

dataset_id (string) – The ID of the dataset for which new version to be created
data_frame (DataFrame, optional) – The data frame to upload
records (list[dict], optional) – A list of dictionaries with identical keys to upload
categories (list[string], optional) – An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
read_timeout (Optional[int]) – The maximum number of seconds to wait for the server to respond indicating that the initial upload is complete
max_wait (Optional[int]) – Time in seconds after which project creation is considered unsuccessful

Returns:

response – The Dataset version created from the uploaded data

Return type:

Dataset

Raises:

InvalidUsageError – If neither a DataFrame or list of records is passed.

classmethod create_version_from_url(dataset_id, url, categories=None, max_wait=600)

A blocking call that creates a new Dataset from data stored at a url for a given dataset. Returns when the dataset has been successfully uploaded and processed.

Added in version v2.23.

Parameters:

dataset_id (string) – The ID of the dataset for which new version to be created
url (string) – The URL to use as the source of data for the dataset being created.
categories (list[string], optional) – An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
max_wait (Optional[int]) – Time in seconds after which project creation is considered unsuccessful

Returns:

response – The Dataset version created from the uploaded data

Return type:

Dataset

classmethod create_version_from_datastage(dataset_id, datastage_id, categories=None, max_wait=600)

A blocking call that creates a new Dataset from data stored as a DataStage for a given dataset. Returns when the dataset has been successfully uploaded and processed.

Parameters:

dataset_id (string) – The ID of the dataset for which new version to be created
datastage_id (string) – The ID of the DataStage to use as the source of data for the dataset being created.
categories (list[string], optional) – An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
max_wait (Optional[int]) – Time in seconds after which project creation is considered unsuccessful

Returns:

response – The Dataset version created from the uploaded data

Return type:

Dataset

classmethod create_version_from_data_source(dataset_id, data_source_id, username=None, password=None, categories=None, credential_id=None, use_kerberos=None, credential_data=None, max_wait=600)

A blocking call that creates a new Dataset from data stored at a DataSource. Returns when the dataset has been successfully uploaded and processed.

Added in version v2.23.

Parameters:

dataset_id (string) – The ID of the dataset for which new version to be created
data_source_id (string) – The ID of the DataSource to use as the source of data.
username (string, optional) – The username for database authentication.
password (string, optional) – The password (in cleartext) for database authentication. The password will be encrypted on the server side in scope of HTTP request and never saved or stored.
categories (list[string], optional) – An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
credential_id (string, optional) – The ID of the set of credentials to use instead of user and password. Note that with this change, username and password will become optional.
use_kerberos (Optional[bool]) – If unset, uses the server default: False. If true, use kerberos authentication for database authentication.
credential_data (dict, optional) – The credentials to authenticate with the database, to use instead of user/password or credential ID.
max_wait (Optional[int]) – Time in seconds after which project creation is considered unsuccessful

Returns:

response – The Dataset version created from the uploaded data

Return type:

Dataset

classmethod create_version_from_recipe(dataset_id, recipe, credential=None, credential_id=None, use_kerberos=None, max_wait=600)

A blocking call that creates a new Dataset version from Recipe. Returns when the dataset has been successfully uploaded and processed.

Added in version v3.8.

Parameters:

dataset_id (string) – The ID of the dataset for which a new version will be created.
recipe (Recipe) – The Recipe to use to create a new dataset version.
credential (Credential, optional) – The credentials to authenticate with the database.
credential_id (string, optional) – The ID of the set of credentials to use instead of Credential object.
use_kerberos (Optional[bool]) – If unset, uses the server default: False. If true, use kerberos authentication for database authentication.
max_wait (Optional[int]) – Time in seconds after which project creation is considered unsuccessful.

Returns:

response – The Dataset version created from the uploaded data

Return type:

Dataset

classmethod from_data(data)

Instantiate an object of this class using a dict.

Parameters:: data (dict) – Correctly snake_cased keys and their values.
Return type:: TypeVar(T, bound= APIObject)

classmethod from_server_data(data, keep_attrs=None)

Instantiate an object of this class using the data directly from the server, meaning that the keys may have the wrong camel casing

Parameters:

data (dict) – The directly translated dict of JSON from the server. No casing fixes have taken place
keep_attrs (iterable) – List, set or tuple of the dotted namespace notations for attributes to keep within the object structure even if their values are None

Return type:

TypeVar(T, bound= APIObject)

open_in_browser()

Opens class’ relevant web browser location. If default browser is not available the URL is logged.

Note: If text-mode browsers are used, the calling process will block until the user exits the browser.

Return type:: None

class datarobot.DatasetDetails

Represents a detailed view of a Dataset. The to_dataset method creates a Dataset from this details view.

Variables:

dataset_id (string) – The ID of this dataset
name (string) – The name of this dataset in the catalog
is_latest_version (bool) – Whether this dataset version is the latest version of this dataset
version_id (string) – The object ID of the catalog_version the dataset belongs to
categories (list(string)) – An array of strings describing the intended use of the dataset. The supported options are “TRAINING” and “PREDICTION”.
created_at (string) – The date when the dataset was created
created_by (string) – Username of the user who created the dataset
is_snapshot (bool) – Whether the dataset version is an immutable snapshot of data which has previously been retrieved and saved to Data_robot
data_persisted (Optional[bool]) – If true, user is allowed to view extended data profile (which includes data statistics like min/max/median/mean, histogram, etc.) and download data. If false, download is not allowed and only the data schema (feature names and types) will be available.
is_data_engine_eligible (bool) – Whether this dataset can be a data source of a data engine query.
processing_state (string) – Current ingestion process state of the dataset
row_count (Optional[int]) – The number of rows in the dataset.
size (Optional[int]) – The size of the dataset as a CSV in bytes.
data_engine_query_id (string, optional) – ID of the source data engine query
data_source_id (string, optional) – ID of the datasource used as the source of the dataset
data_source_type (string) – the type of the datasource that was used as the source of the dataset
description (string, optional) – the description of the dataset
eda1_modification_date (string, optional) – the ISO 8601 formatted date and time when the EDA1 for the dataset was updated
eda1_modifier_full_name (string, optional) – the user who was the last to update EDA1 for the dataset
error (string) – details of exception raised during ingestion process, if any
feature_count (Optional[int]) – total number of features in the dataset
feature_count_by_type (list[FeatureTypeCount]) – number of features in the dataset grouped by feature type
last_modification_date (string) – the ISO 8601 formatted date and time when the dataset was last modified
last_modifier_full_name (string) – full name of user who was the last to modify the dataset
tags (list[string]) – list of tags attached to the item
uri (string) – the uri to datasource like: - ‘file_name.csv’ - ‘jdbc:DATA_SOURCE_GIVEN_NAME/SCHEMA.TABLE_NAME’ - ‘jdbc:DATA_SOURCE_GIVEN_NAME/<query>’ - for query based datasources - ‘https://s3.amazonaws.com/my_data/my_dataset.csv’ - etc.
sample_size (dict, optional) – The size of data fetched during dataset registration. For example, to fetch the first 95 rows, the sample_size value is {‘type’: ‘rows’, ‘value’: 95}. Currently only ‘rows’ type is supported.

classmethod get(dataset_id)

Get details for a Dataset from the server

Parameters:: dataset_id (str) – The id for the Dataset from which to get details
Return type:: DatasetDetails

to_dataset()

Build a Dataset object from the information in this object

Return type:: Dataset

class datarobot.models.dataset.ProjectLocation

ProjectLocation(url, id)

id: Alias for field number 1

url: Alias for field number 0

Secondary datasets

class datarobot.helpers.feature_discovery.SecondaryDataset

A secondary dataset to be used for feature discovery

Added in version v2.25.

Variables:

identifier (str) – Alias of the dataset (used directly as part of the generated feature names)
catalog_id (str) – Identifier of the catalog item
catalog_version_id (str) – Identifier of the catalog item version
snapshot_policy (Optional[str]) – Policy to use while creating a project or making predictions. If omitted, by default endpoint will use ‘latest’. Must be one of the following values: ‘specified’: Use specific snapshot specified by catalogVersionId ‘latest’: Use latest snapshot from the same catalog item ‘dynamic’: Get data from the source (only applicable for JDBC datasets)

Examples

import datarobot as dr
dataset_definition = dr.SecondaryDataset(
    identifier='profile',
    catalog_id='5ec4aec1f072bc028e3471ae',
    catalog_version_id='5ec4aec2f072bc028e3471b1',
)

Secondary dataset configurations

class datarobot.models.SecondaryDatasetConfigurations

Create secondary dataset configurations for a given project

Added in version v2.20.

Variables:

id (str) – Id of this secondary dataset configuration
project_id (str) – Id of the associated project.
config (list of DatasetConfiguration (Deprecated in version v2.23)) – List of secondary dataset configurations
secondary_datasets (list of SecondaryDataset (new in v2.23)) – List of secondary datasets (secondaryDataset)
name (str) – Verbose name of the SecondaryDatasetConfig. null if it wasn’t specified.
created (datetime.datetime) – DR-formatted datetime. null for legacy (before DR 6.0) db records.
creator_user_id (str) – Id of the user created this config.
creator_full_name (str) – fullname or email of the user created this config.
featurelist_id (Optional[str]) – Id of the feature list. null if it wasn’t specified.
credential_ids (Optional[list of DatasetsCredentials]) – credentials used by the secondary datasets if the datasets used in the configuration are from datasource
is_default (Optional[bool]) – Boolean flag if default config created during feature discovery aim
project_version (Optional[str]) – Version of project when its created (Release version)

classmethod create(project_id, secondary_datasets, name, featurelist_id=None)

create secondary dataset configurations

Added in version v2.20.

Parameters:

project_id (str) – id of the associated project.
secondary_datasets (list of SecondaryDataset (New in version v2.23)) – list of secondary datasets used by the configuration each element is a datarobot.helpers.feature_discovery.SecondaryDataset
name (str (New in version v2.23)) – Name of the secondary datasets configuration
featurelist_id (str, or None (New in version v2.23)) – Id of the featurelist

Return type:

an instance of SecondaryDatasetConfigurations

Raises:

ClientError – raised if incorrect configuration parameters are provided

Examples

   profile_secondary_dataset = dr.SecondaryDataset(
       identifier='profile',
       catalog_id='5ec4aec1f072bc028e3471ae',
       catalog_version_id='5ec4aec2f072bc028e3471b1',
       snapshot_policy='latest'
   )

   transaction_secondary_dataset = dr.SecondaryDataset(
       identifier='transaction',
       catalog_id='5ec4aec268f0f30289a03901',
       catalog_version_id='5ec4aec268f0f30289a03900',
       snapshot_policy='latest'
   )

   secondary_datasets = [profile_secondary_dataset, transaction_secondary_dataset]
   new_secondary_dataset_config = dr.SecondaryDatasetConfigurations.create(
       project_id=project.id,
       name='My config',
       secondary_datasets=secondary_datasets
   )

>>> new_secondary_dataset_config.id
'5fd1e86c589238a4e635e93d'

delete()

Removes the Secondary datasets configuration :rtype: None

Added in version v2.21.

Raises:: ClientError – Raised if an invalid or already deleted secondary dataset config id is provided

Examples

# Deleting with a valid secondary_dataset_config id
status_code = dr.SecondaryDatasetConfigurations.delete(some_config_id)
status_code
>>> 204

get()

Retrieve a single secondary dataset configuration for a given id

Added in version v2.21.

Returns:: secondary_dataset_configurations – The requested secondary dataset configurations
Return type:: SecondaryDatasetConfigurations

Examples

config_id = '5fd1e86c589238a4e635e93d'
secondary_dataset_config = dr.SecondaryDatasetConfigurations(id=config_id).get()
>>> secondary_dataset_config
{
     'created': datetime.datetime(2020, 12, 9, 6, 16, 22, tzinfo=tzutc()),
     'creator_full_name': u'[email protected]',
     'creator_user_id': u'asdf4af1gf4bdsd2fba1de0a',
     'credential_ids': None,
     'featurelist_id': None,
     'id': u'5fd1e86c589238a4e635e93d',
     'is_default': True,
     'name': u'My config',
     'project_id': u'5fd06afce2456ec1e9d20457',
     'project_version': None,
     'secondary_datasets': [
            {
                'snapshot_policy': u'latest',
                'identifier': u'profile',
                'catalog_version_id': u'5fd06b4af24c641b68e4d88f',
                'catalog_id': u'5fd06b4af24c641b68e4d88e'
            },
            {
                'snapshot_policy': u'dynamic',
                'identifier': u'transaction',
                'catalog_version_id': u'5fd1e86c589238a4e635e98e',
                'catalog_id': u'5fd1e86c589238a4e635e98d'
            }
     ]
}

classmethod list(project_id, featurelist_id=None, limit=None, offset=None)

Returns list of secondary dataset configurations.

Added in version v2.23.

Parameters:

project_id (str) – The Id of project
featurelist_id (Optional[str]) – Id of the feature list to filter the secondary datasets configurations

Returns:

secondary_dataset_configurations – The requested list of secondary dataset configurations for a given project

Return type:

list of SecondaryDatasetConfigurations

Examples

pid = '5fd06afce2456ec1e9d20457'
secondary_dataset_configs = dr.SecondaryDatasetConfigurations.list(pid)
>>> secondary_dataset_configs[0]
    {
         'created': datetime.datetime(2020, 12, 9, 6, 16, 22, tzinfo=tzutc()),
         'creator_full_name': u'[email protected]',
         'creator_user_id': u'asdf4af1gf4bdsd2fba1de0a',
         'credential_ids': None,
         'featurelist_id': None,
         'id': u'5fd1e86c589238a4e635e93d',
         'is_default': True,
         'name': u'My config',
         'project_id': u'5fd06afce2456ec1e9d20457',
         'project_version': None,
         'secondary_datasets': [
                {
                    'snapshot_policy': u'latest',
                    'identifier': u'profile',
                    'catalog_version_id': u'5fd06b4af24c641b68e4d88f',
                    'catalog_id': u'5fd06b4af24c641b68e4d88e'
                },
                {
                    'snapshot_policy': u'dynamic',
                    'identifier': u'transaction',
                    'catalog_version_id': u'5fd1e86c589238a4e635e98e',
                    'catalog_id': u'5fd1e86c589238a4e635e98d'
                }
         ]
    }

Data engine query generator

class datarobot.DataEngineQueryGenerator

DataEngineQueryGenerator is used to set up time series data prep.

Added in version v2.27.

Variables:

id (str) – id of the query generator
query (str) – text of the generated Spark SQL query
datasets (list(QueryGeneratorDataset)) – datasets associated with the query generator
generator_settings (QueryGeneratorSettings) – the settings used to define the query
generator_type (str) – “TimeSeries” is the only supported type

classmethod create(generator_type, datasets, generator_settings)

Creates a query generator entity.

Added in version v2.27.

Parameters:

generator_type (str) – Type of data engine query generator
datasets (List[QueryGeneratorDataset]) – Source datasets in the Data Engine workspace.
generator_settings (dict) – Data engine generator settings of the given generator_type.

Returns:

query_generator – The created generator

Return type:

DataEngineQueryGenerator

Examples

import datarobot as dr
from datarobot.models.data_engine_query_generator import (
   QueryGeneratorDataset,
   QueryGeneratorSettings,
)
dataset = QueryGeneratorDataset(
   alias='My_Awesome_Dataset_csv',
   dataset_id='61093144cabd630828bca321',
   dataset_version_id=1,
)
settings = QueryGeneratorSettings(
   datetime_partition_column='date',
   time_unit='DAY',
   time_step=1,
   default_numeric_aggregation_method='sum',
   default_categorical_aggregation_method='mostFrequent',
)
g = dr.DataEngineQueryGenerator.create(
   generator_type='TimeSeries',
   datasets=[dataset],
   generator_settings=settings,
)
g.id
>>>'54e639a18bd88f08078ca831'
g.generator_type
>>>'TimeSeries'

classmethod get(generator_id)

Gets information about a query generator.

Parameters:: generator_id (str) – The identifier of the query generator you want to load.
Returns:: query_generator – The queried generator
Return type:: DataEngineQueryGenerator

Examples

import datarobot as dr
g = dr.DataEngineQueryGenerator.get(generator_id='54e639a18bd88f08078ca831')
g.id
>>>'54e639a18bd88f08078ca831'
g.generator_type
>>>'TimeSeries'

create_dataset(dataset_id=None, dataset_version_id=None, max_wait=600)

A blocking call that creates a new Dataset from the query generator. Returns when the dataset has been successfully processed. If optional parameters are not specified the query is applied to the dataset_id and dataset_version_id stored in the query generator. If specified they will override the stored dataset_id/dataset_version_id, i.e. to prep a prediction dataset.

Parameters:

dataset_id (Optional[str]) – The id of the unprepped dataset to apply the query to
dataset_version_id (Optional[str]) – The version_id of the unprepped dataset to apply the query to

Returns:

response – The Dataset created from the query generator

Return type:

Dataset

prepare_prediction_dataset_from_catalog(project_id, dataset_id, dataset_version_id=None, max_wait=600, relax_known_in_advance_features_check=None)

Apply time series data prep to a catalog dataset and upload it to the project as a PredictionDataset.

Added in version v3.1.

Parameters:

project_id (str) – The id of the project to which you upload the prediction dataset.
dataset_id (str) – The identifier of the dataset.
dataset_version_id (Optional[str]) – The version id of the dataset to use.
max_wait (Optional[int]) – Optional, the maximum number of seconds to wait before giving up.
relax_known_in_advance_features_check (Optional[bool]) – For time series projects only. If True, missing values in the known in advance features are allowed in the forecast window at the prediction time. If omitted or False, missing values are not allowed.

Returns:

dataset – The newly uploaded dataset.

Return type:

PredictionDataset

prepare_prediction_dataset(sourcedata, project_id, max_wait=600, relax_known_in_advance_features_check=None)

Apply time series data prep and upload the PredictionDataset to the project.

Added in version v3.1.

Parameters:

sourcedata (str, file or pandas.DataFrame) – Data to be used for predictions. If it is a string, it can be either a path to a local file, or raw file content. If using a file on disk, the filename must consist of ASCII characters only.
project_id (str) – The id of the project to which you upload the prediction dataset.
max_wait (Optional[int]) – The maximum number of seconds to wait for the uploaded dataset to be processed before raising an error.
relax_known_in_advance_features_check (Optional[bool]) – For time series projects only. If True, missing values in the known in advance features are allowed in the forecast window at the prediction time. If omitted or False, missing values are not allowed.

Returns:

dataset – The newly uploaded dataset.

Return type:

PredictionDataset

Raises:

InputNotUnderstoodError – Raised if sourcedata isn’t one of supported types.
AsyncFailureError – Raised if polling for the status of an async process resulted in a response with an unsupported status code.
AsyncProcessUnsuccessfulError – Raised if project creation was unsuccessful (i.e. the server reported an error in uploading the dataset).
AsyncTimeoutError – Raised if processing the uploaded dataset took more time than specified by the max_wait parameter.

Datasets

Secondary datasets

Secondary dataset configurations

Data engine query generator

Sharing access

Sharing role