Datasets

class datarobot.models.Dataset(dataset_id, version_id, name, categories, created_at, is_data_engine_eligible, is_latest_version, is_snapshot, processing_state, created_by=None, data_persisted=None, size=None, row_count=None, recipe_id=None, sample_size=None)

Represents a Dataset returned from the api/v2/datasets/ endpoints.

Attributes:

id: string: The ID of this dataset
name: string: The name of this dataset in the catalog
is_latest_version: bool: Whether this dataset version is the latest version of this dataset
version_id: string: The object ID of the catalog_version the dataset belongs to
categories: list(string): An array of strings describing the intended use of the dataset. The supported options are “TRAINING” and “PREDICTION”.
created_at: string: The date when the dataset was created
created_by: string, optional: Username of the user who created the dataset
is_snapshot: bool: Whether the dataset version is an immutable snapshot of data which has previously been retrieved and saved to Data_robot
data_persisted: bool, optional: If true, user is allowed to view extended data profile (which includes data statistics like min/max/median/mean, histogram, etc.) and download data. If false, download is not allowed and only the data schema (feature names and types) will be available.
is_data_engine_eligible: bool: Whether this dataset can be a data source of a data engine query.
processing_state: string: Current ingestion process state of the dataset
row_count: int, optional: The number of rows in the dataset.
size: int, optional: The size of the dataset as a CSV in bytes.
sample_size: dict, optional: The size of data fetched during dataset registration. For example, to fetch the first 95 rows, the sample_size value is {‘type’: ‘rows’, ‘value’: 95}. Currently only ‘rows’ type is supported.

get_uri()

Returns:

urlstr: Permanent static hyperlink to this dataset in AI Catalog.

Return type:

str

classmethod upload(source)

This method covers Dataset creation from local materials (file & DataFrame) and a URL.

Parameters:

source: str, pd.DataFrame or file object: Pass a URL, filepath, file or DataFrame to create and return a Dataset.

Returns:

response: Dataset: The Dataset created from the uploaded data source.

Raises:

InvalidUsageError: If the source parameter cannot be determined to be a URL, filepath, file or DataFrame.

Return type:

TypeVar(TDataset, bound= Dataset)

Examples

# Upload a local file
dataset_one = Dataset.upload("./data/examples.csv")

# Create a dataset via URL
dataset_two = Dataset.upload(
    "https://raw.githubusercontent.com/curran/data/gh-pages/dbpedia/cities/data.csv"
)

# Create dataset with a pandas Dataframe
dataset_three = Dataset.upload(my_df)

# Create dataset using a local file
with open("./data/examples.csv", "rb") as file_pointer:
    dataset_four = Dataset.create_from_file(filelike=file_pointer)

classmethod create_from_file(cls, file_path=None, filelike=None, categories=None, read_timeout=600, max_wait=600, *, use_cases=None)

A blocking call that creates a new Dataset from a file. Returns when the dataset has been successfully uploaded and processed.

Warning: This function does not clean up it’s open files. If you pass a filelike, you are responsible for closing it. If you pass a file_path, this will create a file object from the file_path but will not close it.

Parameters:

file_path: string, optional: The path to the file. This will create a file object pointing to that file but will not close it.
filelike: file, optional: An open and readable file object.
categories: list[string], optional: An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
read_timeout: int, optional: The maximum number of seconds to wait for the server to respond indicating that the initial upload is complete
max_wait: int, optional: Time in seconds after which dataset creation is considered unsuccessful
use_cases: list[UseCase] | UseCase | list[string] | string, optional: A list of UseCase objects, UseCase object, list of Use Case ids or a single Use Case id to add this new Dataset to. Must be a kwarg.

Returns:

response: Dataset: A fully armed and operational Dataset

Return type:

TypeVar(TDataset, bound= Dataset)

classmethod create_from_in_memory_data(cls, data_frame=None, records=None, categories=None, read_timeout=600, max_wait=600, fname=None, *, use_cases=None)

A blocking call that creates a new Dataset from in-memory data. Returns when the dataset has been successfully uploaded and processed.

The data can be either a pandas DataFrame or a list of dictionaries with identical keys.

Parameters:

data_frame: DataFrame, optional: The data frame to upload
records: list[dict], optional: A list of dictionaries with identical keys to upload
categories: list[string], optional: An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
read_timeout: int, optional: The maximum number of seconds to wait for the server to respond indicating that the initial upload is complete
max_wait: int, optional: Time in seconds after which dataset creation is considered unsuccessful
fname: string, optional: The file name, “data.csv” by default
use_cases: list[UseCase] | UseCase | list[string] | string, optional: A list of UseCase objects, UseCase object, list of Use Case IDs or a single Use Case ID to add this new dataset to. Must be a kwarg.

Returns:

response: Dataset: The Dataset created from the uploaded data.

Raises:

InvalidUsageError: If neither a DataFrame or list of records is passed.

Return type:

TypeVar(TDataset, bound= Dataset)

classmethod create_from_url(cls, url, do_snapshot=None, persist_data_after_ingestion=None, categories=None, sample_size=None, max_wait=600, *, use_cases=None)

A blocking call that creates a new Dataset from data stored at a url. Returns when the dataset has been successfully uploaded and processed.

Parameters:

url: string: The URL to use as the source of data for the dataset being created.
do_snapshot: bool, optional: If unset, uses the server default: True. If true, creates a snapshot dataset; if false, creates a remote dataset. Creating snapshots from non-file sources may be disabled by the permission, Disable AI Catalog Snapshots.
persist_data_after_ingestion: bool, optional: If unset, uses the server default: True. If true, will enforce saving all data (for download and sampling) and will allow a user to view extended data profile (which includes data statistics like min/max/median/mean, histogram, etc.). If false, will not enforce saving data. The data schema (feature names and types) still will be available. Specifying this parameter to false and doSnapshot to true will result in an error.
categories: list[string], optional: An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
sample_size: dict, optional: The size of data fetched during dataset registration. For example, to fetch the first 95 rows, the sample_size value would be: {‘type’: ‘rows’, ‘value’: 95}. Currently only ‘rows’ type is supported.
max_wait: int, optional: Time in seconds after which dataset creation is considered unsuccessful.
use_cases: list[UseCase] | UseCase | list[string] | string, optional: A list of UseCase objects, UseCase object, list of Use Case IDs or a single Use Case ID to add this new dataset to. Must be a kwarg.

Returns:

response: Dataset: The Dataset created from the uploaded data

Return type:

TypeVar(TDataset, bound= Dataset)

classmethod create_from_datastage(cls, datastage_id, categories=None, max_wait=600, *, use_cases=None)

A blocking call that creates a new Dataset from data stored as a DataStage. Returns when the dataset has been successfully uploaded and processed.

Parameters:

datastage_id: string: The ID of the DataStage to use as the source of data for the dataset being created.
categories: list[string], optional: An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
max_wait: int, optional: Time in seconds after which dataset creation is considered unsuccessful.

Returns:

response: Dataset: The Dataset created from the uploaded data

Return type:

TypeVar(TDataset, bound= Dataset)

classmethod create_from_data_source(cls, data_source_id, username=None, password=None, do_snapshot=None, persist_data_after_ingestion=None, categories=None, credential_id=None, use_kerberos=None, credential_data=None, sample_size=None, max_wait=600, *, use_cases=None)

A blocking call that creates a new Dataset from data stored at a DataSource. Returns when the dataset has been successfully uploaded and processed. :rtype: TypeVar(TDataset, bound= Dataset)

Added in version v2.22.

Parameters:

data_source_id: string: The ID of the DataSource to use as the source of data.
username: string, optional: The username for database authentication.
password: string, optional: The password (in cleartext) for database authentication. The password will be encrypted on the server side in scope of HTTP request and never saved or stored.
do_snapshot: bool, optional: If unset, uses the server default: True. If true, creates a snapshot dataset; if false, creates a remote dataset. Creating snapshots from non-file sources requires may be disabled by the permission, Disable AI Catalog Snapshots.
persist_data_after_ingestion: bool, optional: If unset, uses the server default: True. If true, will enforce saving all data (for download and sampling) and will allow a user to view extended data profile (which includes data statistics like min/max/median/mean, histogram, etc.). If false, will not enforce saving data. The data schema (feature names and types) still will be available. Specifying this parameter to false and doSnapshot to true will result in an error.
categories: list[string], optional: An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
credential_id: string, optional: The ID of the set of credentials to use instead of user and password. Note that with this change, username and password will become optional.
use_kerberos: bool, optional: If unset, uses the server default: False. If true, use kerberos authentication for database authentication.
credential_data: dict, optional: The credentials to authenticate with the database, to use instead of user/password or credential ID.
sample_size: dict, optional: The size of data fetched during dataset registration. For example, to fetch the first 95 rows, the sample_size value would be: {‘type’: ‘rows’, ‘value’: 95}. Currently only ‘rows’ type is supported.
max_wait: int, optional: Time in seconds after which project creation is considered unsuccessful.
use_cases: list[UseCase] | UseCase | list[string] | string, optional: A list of UseCase objects, UseCase object, list of Use Case IDs or a single Use Case ID to add this new dataset to. Must be a kwarg.

Returns:

response: Dataset: The Dataset created from the uploaded data

classmethod create_from_query_generator(cls, generator_id, dataset_id=None, dataset_version_id=None, max_wait=600, *, use_cases=None)

A blocking call that creates a new Dataset from the query generator. Returns when the dataset has been successfully processed. If optional parameters are not specified the query is applied to the dataset_id and dataset_version_id stored in the query generator. If specified they will override the stored dataset_id/dataset_version_id, e.g. to prep a prediction dataset.

Parameters:

generator_id: str: The id of the query generator to use.
dataset_id: str, optional: The id of the dataset to apply the query to.
dataset_version_id: str, optional: The id of the dataset version to apply the query to. If not specified the latest version associated with dataset_id (if specified) is used.
max_waitint: optional, the maximum number of seconds to wait before giving up.
use_cases: list[UseCase] | UseCase | list[string] | string, optional: A list of UseCase objects, UseCase object, list of Use Case IDs or a single Use Case ID to add this new dataset to. Must be a kwarg.

Returns:

response: Dataset: The Dataset created from the query generator

Return type:

TypeVar(TDataset, bound= Dataset)

classmethod create_from_recipe(cls, recipe, name=None, do_snapshot=None, persist_data_after_ingestion=None, categories=None, credential=None, use_kerberos=None, materialization_destination=None, max_wait=600, *, use_cases=None)

A blocking call that creates a new Dataset from the recipe. Returns when the dataset has been successfully uploaded and processed. :rtype: TypeVar(TDataset, bound= Dataset)

Added in version 3.6.

Returns:

response: Dataset: The Dataset created from the uploaded data

classmethod get(dataset_id)

Get information about a dataset.

Parameters:

dataset_idstring: the id of the dataset

Returns:

datasetDataset: the queried dataset

Return type:

TypeVar(TDataset, bound= Dataset)

classmethod delete(dataset_id)

Soft deletes a dataset. You cannot get it or list it or do actions with it, except for un-deleting it.

Parameters:

dataset_id: string: The id of the dataset to mark for deletion

Returns:

None

Return type:

None

classmethod un_delete(dataset_id)

Un-deletes a previously deleted dataset. If the dataset was not deleted, nothing happens.

Parameters:

dataset_id: string: The id of the dataset to un-delete

Returns:

None

Return type:

None

classmethod list(category=None, filter_failed=None, order_by=None, use_cases=None)

List all datasets a user can view.

Parameters:

category: string, optional: Optional. If specified, only dataset versions that have the specified category will be included in the results. Categories identify the intended use of the dataset; supported categories are “TRAINING” and “PREDICTION”.
filter_failed: bool, optional: If unset, uses the server default: False. Whether datasets that failed during import should be excluded from the results. If True invalid datasets will be excluded.
order_by: string, optional: If unset, uses the server default: “-created”. Sorting order which will be applied to catalog list, valid options are: - “created” – ascending order by creation datetime; - “-created” – descending order by creation datetime.
use_cases: Union[UseCase, List[UseCase], str, List[str]], optional: Filter available datasets by a specific Use Case or Cases. Accepts either the entity or the ID. If set to [None], the method filters the project’s datasets by those not linked to a UseCase.

Returns:

list[Dataset]: a list of datasets the user can view

Return type:

List[TypeVar(TDataset, bound= Dataset)]

classmethod iterate(offset=None, limit=None, category=None, order_by=None, filter_failed=None, use_cases=None)

Get an iterator for the requested datasets a user can view. This lazily retrieves results. It does not get the next page from the server until the current page is exhausted.

Parameters:

offset: int, optional: If set, this many results will be skipped
limit: int, optional: Specifies the size of each page retrieved from the server. If unset, uses the server default.
category: string, optional: Optional. If specified, only dataset versions that have the specified category will be included in the results. Categories identify the intended use of the dataset; supported categories are “TRAINING” and “PREDICTION”.
filter_failed: bool, optional: If unset, uses the server default: False. Whether datasets that failed during import should be excluded from the results. If True invalid datasets will be excluded.
order_by: string, optional: If unset, uses the server default: “-created”. Sorting order which will be applied to catalog list, valid options are: - “created” – ascending order by creation datetime; - “-created” – descending order by creation datetime.
use_cases: Union[UseCase, List[UseCase], str, List[str]], optional: Filter available datasets by a specific Use Case or Cases. Accepts either the entity or the ID. If set to [None], the method filters the project’s datasets by those not linked to a UseCase.

Yields:

Dataset: An iterator of the datasets the user can view.

Return type:

Generator[TypeVar(TDataset, bound= Dataset), None, None]

update()

Updates the Dataset attributes in place with the latest information from the server.

Returns:

None

Return type:

None

modify(name=None, categories=None)

Modifies the Dataset name and/or categories. Updates the object in place.

Parameters:

name: string, optional: The new name of the dataset
categories: list[string], optional: A list of strings describing the intended use of the dataset. The supported options are “TRAINING” and “PREDICTION”. If any categories were previously specified for the dataset, they will be overwritten. If omitted or None, keep previous categories. To clear them specify []

Returns:

None

Return type:

None

share(access_list, apply_grant_to_linked_objects=False)

Modify the ability of users to access this dataset

Parameters:

access_list: list of :class:`SharingAccess <datarobot.SharingAccess>`: The modifications to make.
apply_grant_to_linked_objects: bool: If true for any users being granted access to the dataset, grant the user read access to any linked objects such as DataSources and DataStores that may be used by this dataset. Ignored if no such objects are relevant for dataset, defaults to False.

Raises:

datarobot.ClientError:: If you do not have permission to share this dataset, if the user you’re sharing with doesn’t exist, if the same user appears multiple times in the access_list, or if these changes would leave the dataset without an owner.

Return type:

None

Examples

Transfer access to the dataset from old_user@datarobot.com to new_user@datarobot.com

from datarobot.enums import SHARING_ROLE
from datarobot.models.dataset import Dataset
from datarobot.models.sharing import SharingAccess

new_access = SharingAccess(
    "[email protected]",
    SHARING_ROLE.OWNER,
    can_share=True,
)
access_list = [
    SharingAccess(
        "[email protected]",
        SHARING_ROLE.OWNER,
        can_share=True,
        can_use_data=True,
    ),
    new_access,
]

Dataset.get('my-dataset-id').share(access_list)

get_details()

Gets the details for this Dataset

Returns:

DatasetDetails

Return type:

DatasetDetails

get_all_features(order_by=None)

Get a list of all the features for this dataset.

Parameters:

order_by: string, optional: If unset, uses the server default: ‘name’. How the features should be ordered. Can be ‘name’ or ‘featureType’.

Returns:

list[DatasetFeature]

Return type:

List[DatasetFeature]

iterate_all_features(offset=None, limit=None, order_by=None)

Get an iterator for the requested features of a dataset. This lazily retrieves results. It does not get the next page from the server until the current page is exhausted.

Parameters:

offset: int, optional: If set, this many results will be skipped.
limit: int, optional: Specifies the size of each page retrieved from the server. If unset, uses the server default.
order_by: string, optional: If unset, uses the server default: ‘name’. How the features should be ordered. Can be ‘name’ or ‘featureType’.

Yields:

DatasetFeature

Return type:

Generator[DatasetFeature, None, None]

get_featurelists()

Get DatasetFeaturelists created on this Dataset

Returns:

feature_lists: list[DatasetFeaturelist]

Return type:

List[DatasetFeaturelist]

create_featurelist(name, features)

Create a new dataset featurelist

Parameters:

namestr: the name of the modeling featurelist to create. Names must be unique within the dataset, or the server will return an error.
featureslist of str: the names of the features to include in the dataset featurelist. Each feature must be a dataset feature.

Returns:

featurelistDatasetFeaturelist: the newly created featurelist

Return type:

DatasetFeaturelist

Examples

dataset = Dataset.get('1234deadbeeffeeddead4321')
dataset_features = dataset.get_all_features()
selected_features = [feat.name for feat in dataset_features][:5]  # select first five
new_flist = dataset.create_featurelist('Simple Features', selected_features)

get_file(file_path=None, filelike=None)

Retrieves all the originally uploaded data in CSV form. Writes it to either the file or a filelike object that can write bytes.

Only one of file_path or filelike can be provided and it must be provided as a keyword argument (i.e. file_path=’path-to-write-to’). If a file-like object is provided, the user is responsible for closing it when they are done.

The user must also have permission to download data.

Parameters:

file_path: string, optional: The destination to write the file to.
filelike: file, optional: A file-like object to write to. The object must be able to write bytes. The user is responsible for closing the object

Returns:

None

Return type:

None

get_as_dataframe(low_memory=False)

Retrieves all the originally uploaded data in a pandas DataFrame. :rtype: DataFrame

Added in version v3.0.

Parameters:

low_memory: bool, optional: If True, use local files to reduce memory usage which will be slower.

Returns:

pd.DataFrame

get_projects()

Retrieves the Dataset’s projects as ProjectLocation named tuples.

Returns:

locations: list[ProjectLocation]

Return type:

List[ProjectLocation]

create_project(project_name=None, user=None, password=None, credential_id=None, use_kerberos=None, credential_data=None, *, use_cases=None)

Create a datarobot.models.Project from this dataset

Parameters:

project_name: string, optional: The name of the project to be created. If not specified, will be “Untitled Project” for database connections, otherwise the project name will be based on the file used.
user: string, optional: The username for database authentication.
password: string, optional: The password (in cleartext) for database authentication. The password will be encrypted on the server side in scope of HTTP request and never saved or stored
credential_id: string, optional: The ID of the set of credentials to use instead of user and password.
use_kerberos: bool, optional: Server default is False. If true, use kerberos authentication for database authentication.
credential_data: dict, optional: The credentials to authenticate with the database, to use instead of user/password or credential ID.
use_cases: list[UseCase] | UseCase | list[string] | string, optional: A list of UseCase objects, UseCase object, list of Use Case ids or a single Use Case id to add this new Dataset to. Must be a kwarg.

Returns:

Project

Return type:

Project

classmethod create_version_from_file(dataset_id, file_path=None, filelike=None, categories=None, read_timeout=600, max_wait=600)

A blocking call that creates a new Dataset version from a file. Returns when the new dataset version has been successfully uploaded and processed.

Warning: This function does not clean up it’s open files. If you pass a filelike, you are responsible for closing it. If you pass a file_path, this will create a file object from the file_path but will not close it. :rtype: TypeVar(TDataset, bound= Dataset)

Added in version v2.23.

Parameters:

dataset_id: string: The ID of the dataset for which new version to be created
file_path: string, optional: The path to the file. This will create a file object pointing to that file but will not close it.
filelike: file, optional: An open and readable file object.
categories: list[string], optional: An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
read_timeout: int, optional: The maximum number of seconds to wait for the server to respond indicating that the initial upload is complete
max_wait: int, optional: Time in seconds after which project creation is considered unsuccessful

Returns:

response: Dataset: A fully armed and operational Dataset version

classmethod create_version_from_in_memory_data(dataset_id, data_frame=None, records=None, categories=None, read_timeout=600, max_wait=600)

A blocking call that creates a new Dataset version for a dataset from in-memory data. Returns when the dataset has been successfully uploaded and processed.

The data can be either a pandas DataFrame or a list of dictionaries with identical keys. :rtype: TypeVar(TDataset, bound= Dataset)

Added in version v2.23.

Parameters:

dataset_id: string: The ID of the dataset for which new version to be created
data_frame: DataFrame, optional: The data frame to upload
records: list[dict], optional: A list of dictionaries with identical keys to upload
categories: list[string], optional: An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
read_timeout: int, optional: The maximum number of seconds to wait for the server to respond indicating that the initial upload is complete
max_wait: int, optional: Time in seconds after which project creation is considered unsuccessful

Returns:

response: Dataset: The Dataset version created from the uploaded data

Raises:

InvalidUsageError: If neither a DataFrame or list of records is passed.

classmethod create_version_from_url(dataset_id, url, categories=None, max_wait=600)

A blocking call that creates a new Dataset from data stored at a url for a given dataset. Returns when the dataset has been successfully uploaded and processed. :rtype: TypeVar(TDataset, bound= Dataset)

Added in version v2.23.

Parameters:

dataset_id: string: The ID of the dataset for which new version to be created
url: string: The URL to use as the source of data for the dataset being created.
categories: list[string], optional: An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
max_wait: int, optional: Time in seconds after which project creation is considered unsuccessful

Returns:

response: Dataset: The Dataset version created from the uploaded data

classmethod create_version_from_datastage(dataset_id, datastage_id, categories=None, max_wait=600)

A blocking call that creates a new Dataset from data stored as a DataStage for a given dataset. Returns when the dataset has been successfully uploaded and processed.

Parameters:

dataset_id: string: The ID of the dataset for which new version to be created
datastage_id: string: The ID of the DataStage to use as the source of data for the dataset being created.
categories: list[string], optional: An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
max_wait: int, optional: Time in seconds after which project creation is considered unsuccessful

Returns:

response: Dataset: The Dataset version created from the uploaded data

Return type:

TypeVar(TDataset, bound= Dataset)

classmethod create_version_from_data_source(dataset_id, data_source_id, username=None, password=None, categories=None, credential_id=None, use_kerberos=None, credential_data=None, max_wait=600)

A blocking call that creates a new Dataset from data stored at a DataSource. Returns when the dataset has been successfully uploaded and processed. :rtype: TypeVar(TDataset, bound= Dataset)

Added in version v2.23.

Parameters:

dataset_id: string: The ID of the dataset for which new version to be created
data_source_id: string: The ID of the DataSource to use as the source of data.
username: string, optional: The username for database authentication.
password: string, optional: The password (in cleartext) for database authentication. The password will be encrypted on the server side in scope of HTTP request and never saved or stored.
categories: list[string], optional: An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
credential_id: string, optional: The ID of the set of credentials to use instead of user and password. Note that with this change, username and password will become optional.
use_kerberos: bool, optional: If unset, uses the server default: False. If true, use kerberos authentication for database authentication.
credential_data: dict, optional: The credentials to authenticate with the database, to use instead of user/password or credential ID.
max_wait: int, optional: Time in seconds after which project creation is considered unsuccessful

Returns:

response: Dataset: The Dataset version created from the uploaded data

classmethod from_data(data)

Instantiate an object of this class using a dict.

Parameters:

datadict: Correctly snake_cased keys and their values.

Return type:

TypeVar(T, bound= APIObject)

classmethod from_server_data(data, keep_attrs=None)

Instantiate an object of this class using the data directly from the server, meaning that the keys may have the wrong camel casing

Parameters:

datadict: The directly translated dict of JSON from the server. No casing fixes have taken place
keep_attrsiterable: List, set or tuple of the dotted namespace notations for attributes to keep within the object structure even if their values are None

Return type:

TypeVar(T, bound= APIObject)

open_in_browser()

Opens class’ relevant web browser location. If default browser is not available the URL is logged.

Note: If text-mode browsers are used, the calling process will block until the user exits the browser.

Return type:: None

class datarobot.DatasetDetails(dataset_id, version_id, categories, created_by, created_at, data_source_type, error, is_latest_version, is_snapshot, is_data_engine_eligible, last_modification_date, last_modifier_full_name, name, uri, processing_state, data_persisted=None, data_engine_query_id=None, data_source_id=None, description=None, eda1_modification_date=None, eda1_modifier_full_name=None, feature_count=None, feature_count_by_type=None, row_count=None, size=None, tags=None, recipe_id=None, is_wrangling_eligible=None, sample_size=None)

Represents a detailed view of a Dataset. The to_dataset method creates a Dataset from this details view.

Attributes:

dataset_id: string: The ID of this dataset
name: string: The name of this dataset in the catalog
is_latest_version: bool: Whether this dataset version is the latest version of this dataset
version_id: string: The object ID of the catalog_version the dataset belongs to
categories: list(string): An array of strings describing the intended use of the dataset. The supported options are “TRAINING” and “PREDICTION”.
created_at: string: The date when the dataset was created
created_by: string: Username of the user who created the dataset
is_snapshot: bool: Whether the dataset version is an immutable snapshot of data which has previously been retrieved and saved to Data_robot
data_persisted: bool, optional: If true, user is allowed to view extended data profile (which includes data statistics like min/max/median/mean, histogram, etc.) and download data. If false, download is not allowed and only the data schema (feature names and types) will be available.
is_data_engine_eligible: bool: Whether this dataset can be a data source of a data engine query.
processing_state: string: Current ingestion process state of the dataset
row_count: int, optional: The number of rows in the dataset.
size: int, optional: The size of the dataset as a CSV in bytes.
data_engine_query_id: string, optional: ID of the source data engine query
data_source_id: string, optional: ID of the datasource used as the source of the dataset
data_source_type: string: the type of the datasource that was used as the source of the dataset
description: string, optional: the description of the dataset
eda1_modification_date: string, optional: the ISO 8601 formatted date and time when the EDA1 for the dataset was updated
eda1_modifier_full_name: string, optional: the user who was the last to update EDA1 for the dataset
error: string: details of exception raised during ingestion process, if any
feature_count: int, optional: total number of features in the dataset
feature_count_by_type: list[FeatureTypeCount]: number of features in the dataset grouped by feature type
last_modification_date: string: the ISO 8601 formatted date and time when the dataset was last modified
last_modifier_full_name: string: full name of user who was the last to modify the dataset
tags: list[string]: list of tags attached to the item
uri: string: the uri to datasource like: - ‘file_name.csv’ - ‘jdbc:DATA_SOURCE_GIVEN_NAME/SCHEMA.TABLE_NAME’ - ‘jdbc:DATA_SOURCE_GIVEN_NAME/<query>’ - for query based datasources - ‘https://s3.amazonaws.com/my_data/my_dataset.csv’ - etc.
sample_size: dict, optional: The size of data fetched during dataset registration. For example, to fetch the first 95 rows, the sample_size value is {‘type’: ‘rows’, ‘value’: 95}. Currently only ‘rows’ type is supported.

classmethod get(dataset_id)

Get details for a Dataset from the server

Parameters:

dataset_id: str: The id for the Dataset from which to get details

Returns:

DatasetDetails

Return type:

TypeVar(TDatasetDetails, bound= DatasetDetails)

to_dataset()

Build a Dataset object from the information in this object

Returns:

Dataset

Return type:

Dataset

class datarobot.models.dataset.ProjectLocation(url, id)

id: Alias for field number 1

url: Alias for field number 0