Datasets

class datarobot.models.Dataset(dataset_id, version_id, name, categories, created_at, is_data_engine_eligible, is_latest_version, is_snapshot, processing_state, created_by=None, data_persisted=None, size=None, row_count=None, recipe_id=None, sample_size=None)

Represents a Dataset returned from the api/v2/datasets/ endpoints.

Attributes:
id: string

The ID of this dataset

name: string

The name of this dataset in the catalog

is_latest_version: bool

Whether this dataset version is the latest version of this dataset

version_id: string

The object ID of the catalog_version the dataset belongs to

categories: list(string)

An array of strings describing the intended use of the dataset. The supported options are “TRAINING” and “PREDICTION”.

created_at: string

The date when the dataset was created

created_by: string, optional

Username of the user who created the dataset

is_snapshot: bool

Whether the dataset version is an immutable snapshot of data which has previously been retrieved and saved to Data_robot

data_persisted: bool, optional

If true, user is allowed to view extended data profile (which includes data statistics like min/max/median/mean, histogram, etc.) and download data. If false, download is not allowed and only the data schema (feature names and types) will be available.

is_data_engine_eligible: bool

Whether this dataset can be a data source of a data engine query.

processing_state: string

Current ingestion process state of the dataset

row_count: int, optional

The number of rows in the dataset.

size: int, optional

The size of the dataset as a CSV in bytes.

sample_size: dict, optional

The size of data fetched during dataset registration. For example, to fetch the first 95 rows, the sample_size value is {‘type’: ‘rows’, ‘value’: 95}. Currently only ‘rows’ type is supported.

get_uri()
Returns:
urlstr

Permanent static hyperlink to this dataset in AI Catalog.

Return type:

str

classmethod upload(source)

This method covers Dataset creation from local materials (file & DataFrame) and a URL.

Parameters:
source: str, pd.DataFrame or file object

Pass a URL, filepath, file or DataFrame to create and return a Dataset.

Returns:
response: Dataset

The Dataset created from the uploaded data source.

Raises:
InvalidUsageError

If the source parameter cannot be determined to be a URL, filepath, file or DataFrame.

Return type:

TypeVar(TDataset, bound= Dataset)

Examples

# Upload a local file
dataset_one = Dataset.upload("./data/examples.csv")

# Create a dataset via URL
dataset_two = Dataset.upload(
    "https://raw.githubusercontent.com/curran/data/gh-pages/dbpedia/cities/data.csv"
)

# Create dataset with a pandas Dataframe
dataset_three = Dataset.upload(my_df)

# Create dataset using a local file
with open("./data/examples.csv", "rb") as file_pointer:
    dataset_four = Dataset.create_from_file(filelike=file_pointer)
classmethod create_from_file(cls, file_path=None, filelike=None, categories=None, read_timeout=600, max_wait=600, *, use_cases=None)

A blocking call that creates a new Dataset from a file. Returns when the dataset has been successfully uploaded and processed.

Warning: This function does not clean up it’s open files. If you pass a filelike, you are responsible for closing it. If you pass a file_path, this will create a file object from the file_path but will not close it.

Parameters:
file_path: string, optional

The path to the file. This will create a file object pointing to that file but will not close it.

filelike: file, optional

An open and readable file object.

categories: list[string], optional

An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.

read_timeout: int, optional

The maximum number of seconds to wait for the server to respond indicating that the initial upload is complete

max_wait: int, optional

Time in seconds after which dataset creation is considered unsuccessful

use_cases: list[UseCase] | UseCase | list[string] | string, optional

A list of UseCase objects, UseCase object, list of Use Case ids or a single Use Case id to add this new Dataset to. Must be a kwarg.

Returns:
response: Dataset

A fully armed and operational Dataset

Return type:

TypeVar(TDataset, bound= Dataset)

classmethod create_from_in_memory_data(cls, data_frame=None, records=None, categories=None, read_timeout=600, max_wait=600, fname=None, *, use_cases=None)

A blocking call that creates a new Dataset from in-memory data. Returns when the dataset has been successfully uploaded and processed.

The data can be either a pandas DataFrame or a list of dictionaries with identical keys.

Parameters:
data_frame: DataFrame, optional

The data frame to upload

records: list[dict], optional

A list of dictionaries with identical keys to upload

categories: list[string], optional

An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.

read_timeout: int, optional

The maximum number of seconds to wait for the server to respond indicating that the initial upload is complete

max_wait: int, optional

Time in seconds after which dataset creation is considered unsuccessful

fname: string, optional

The file name, “data.csv” by default

use_cases: list[UseCase] | UseCase | list[string] | string, optional

A list of UseCase objects, UseCase object, list of Use Case IDs or a single Use Case ID to add this new dataset to. Must be a kwarg.

Returns:
response: Dataset

The Dataset created from the uploaded data.

Raises:
InvalidUsageError

If neither a DataFrame or list of records is passed.

Return type:

TypeVar(TDataset, bound= Dataset)

classmethod create_from_url(cls, url, do_snapshot=None, persist_data_after_ingestion=None, categories=None, sample_size=None, max_wait=600, *, use_cases=None)

A blocking call that creates a new Dataset from data stored at a url. Returns when the dataset has been successfully uploaded and processed.

Parameters:
url: string

The URL to use as the source of data for the dataset being created.

do_snapshot: bool, optional

If unset, uses the server default: True. If true, creates a snapshot dataset; if false, creates a remote dataset. Creating snapshots from non-file sources may be disabled by the permission, Disable AI Catalog Snapshots.

persist_data_after_ingestion: bool, optional

If unset, uses the server default: True. If true, will enforce saving all data (for download and sampling) and will allow a user to view extended data profile (which includes data statistics like min/max/median/mean, histogram, etc.). If false, will not enforce saving data. The data schema (feature names and types) still will be available. Specifying this parameter to false and doSnapshot to true will result in an error.

categories: list[string], optional

An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.

sample_size: dict, optional

The size of data fetched during dataset registration. For example, to fetch the first 95 rows, the sample_size value would be: {‘type’: ‘rows’, ‘value’: 95}. Currently only ‘rows’ type is supported.

max_wait: int, optional

Time in seconds after which dataset creation is considered unsuccessful.

use_cases: list[UseCase] | UseCase | list[string] | string, optional

A list of UseCase objects, UseCase object, list of Use Case IDs or a single Use Case ID to add this new dataset to. Must be a kwarg.

Returns:
response: Dataset

The Dataset created from the uploaded data

Return type:

TypeVar(TDataset, bound= Dataset)

classmethod create_from_datastage(cls, datastage_id, categories=None, max_wait=600, *, use_cases=None)

A blocking call that creates a new Dataset from data stored as a DataStage. Returns when the dataset has been successfully uploaded and processed.

Parameters:
datastage_id: string

The ID of the DataStage to use as the source of data for the dataset being created.

categories: list[string], optional

An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.

max_wait: int, optional

Time in seconds after which dataset creation is considered unsuccessful.

Returns:
response: Dataset

The Dataset created from the uploaded data

Return type:

TypeVar(TDataset, bound= Dataset)

classmethod create_from_data_source(cls, data_source_id, username=None, password=None, do_snapshot=None, persist_data_after_ingestion=None, categories=None, credential_id=None, use_kerberos=None, credential_data=None, sample_size=None, max_wait=600, *, use_cases=None)

A blocking call that creates a new Dataset from data stored at a DataSource. Returns when the dataset has been successfully uploaded and processed. :rtype: TypeVar(TDataset, bound= Dataset)

Added in version v2.22.

Parameters:
data_source_id: string

The ID of the DataSource to use as the source of data.

username: string, optional

The username for database authentication.

password: string, optional

The password (in cleartext) for database authentication. The password will be encrypted on the server side in scope of HTTP request and never saved or stored.

do_snapshot: bool, optional

If unset, uses the server default: True. If true, creates a snapshot dataset; if false, creates a remote dataset. Creating snapshots from non-file sources requires may be disabled by the permission, Disable AI Catalog Snapshots.

persist_data_after_ingestion: bool, optional

If unset, uses the server default: True. If true, will enforce saving all data (for download and sampling) and will allow a user to view extended data profile (which includes data statistics like min/max/median/mean, histogram, etc.). If false, will not enforce saving data. The data schema (feature names and types) still will be available. Specifying this parameter to false and doSnapshot to true will result in an error.

categories: list[string], optional

An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.

credential_id: string, optional

The ID of the set of credentials to use instead of user and password. Note that with this change, username and password will become optional.

use_kerberos: bool, optional

If unset, uses the server default: False. If true, use kerberos authentication for database authentication.

credential_data: dict, optional

The credentials to authenticate with the database, to use instead of user/password or credential ID.

sample_size: dict, optional

The size of data fetched during dataset registration. For example, to fetch the first 95 rows, the sample_size value would be: {‘type’: ‘rows’, ‘value’: 95}. Currently only ‘rows’ type is supported.

max_wait: int, optional

Time in seconds after which project creation is considered unsuccessful.

use_cases: list[UseCase] | UseCase | list[string] | string, optional

A list of UseCase objects, UseCase object, list of Use Case IDs or a single Use Case ID to add this new dataset to. Must be a kwarg.

Returns:
response: Dataset

The Dataset created from the uploaded data

classmethod create_from_query_generator(cls, generator_id, dataset_id=None, dataset_version_id=None, max_wait=600, *, use_cases=None)

A blocking call that creates a new Dataset from the query generator. Returns when the dataset has been successfully processed. If optional parameters are not specified the query is applied to the dataset_id and dataset_version_id stored in the query generator. If specified they will override the stored dataset_id/dataset_version_id, e.g. to prep a prediction dataset.

Parameters:
generator_id: str

The id of the query generator to use.

dataset_id: str, optional

The id of the dataset to apply the query to.

dataset_version_id: str, optional

The id of the dataset version to apply the query to. If not specified the latest version associated with dataset_id (if specified) is used.

max_waitint

optional, the maximum number of seconds to wait before giving up.

use_cases: list[UseCase] | UseCase | list[string] | string, optional

A list of UseCase objects, UseCase object, list of Use Case IDs or a single Use Case ID to add this new dataset to. Must be a kwarg.

Returns:
response: Dataset

The Dataset created from the query generator

Return type:

TypeVar(TDataset, bound= Dataset)

classmethod create_from_recipe(cls, recipe, name=None, do_snapshot=None, persist_data_after_ingestion=None, categories=None, credential=None, use_kerberos=None, materialization_destination=None, max_wait=600, *, use_cases=None)

A blocking call that creates a new Dataset from the recipe. Returns when the dataset has been successfully uploaded and processed. :rtype: TypeVar(TDataset, bound= Dataset)

Added in version 3.6.

Returns:
response: Dataset

The Dataset created from the uploaded data

classmethod get(dataset_id)

Get information about a dataset.

Parameters:
dataset_idstring

the id of the dataset

Returns:
datasetDataset

the queried dataset

Return type:

TypeVar(TDataset, bound= Dataset)

classmethod delete(dataset_id)

Soft deletes a dataset. You cannot get it or list it or do actions with it, except for un-deleting it.

Parameters:
dataset_id: string

The id of the dataset to mark for deletion

Returns:
None
Return type:

None

classmethod un_delete(dataset_id)

Un-deletes a previously deleted dataset. If the dataset was not deleted, nothing happens.

Parameters:
dataset_id: string

The id of the dataset to un-delete

Returns:
None
Return type:

None

classmethod list(category=None, filter_failed=None, order_by=None, use_cases=None)

List all datasets a user can view.

Parameters:
category: string, optional

Optional. If specified, only dataset versions that have the specified category will be included in the results. Categories identify the intended use of the dataset; supported categories are “TRAINING” and “PREDICTION”.

filter_failed: bool, optional

If unset, uses the server default: False. Whether datasets that failed during import should be excluded from the results. If True invalid datasets will be excluded.

order_by: string, optional

If unset, uses the server default: “-created”. Sorting order which will be applied to catalog list, valid options are: - “created” – ascending order by creation datetime; - “-created” – descending order by creation datetime.

use_cases: Union[UseCase, List[UseCase], str, List[str]], optional

Filter available datasets by a specific Use Case or Cases. Accepts either the entity or the ID. If set to [None], the method filters the project’s datasets by those not linked to a UseCase.

Returns:
list[Dataset]

a list of datasets the user can view

Return type:

List[TypeVar(TDataset, bound= Dataset)]

classmethod iterate(offset=None, limit=None, category=None, order_by=None, filter_failed=None, use_cases=None)

Get an iterator for the requested datasets a user can view. This lazily retrieves results. It does not get the next page from the server until the current page is exhausted.

Parameters:
offset: int, optional

If set, this many results will be skipped

limit: int, optional

Specifies the size of each page retrieved from the server. If unset, uses the server default.

category: string, optional

Optional. If specified, only dataset versions that have the specified category will be included in the results. Categories identify the intended use of the dataset; supported categories are “TRAINING” and “PREDICTION”.

filter_failed: bool, optional

If unset, uses the server default: False. Whether datasets that failed during import should be excluded from the results. If True invalid datasets will be excluded.

order_by: string, optional

If unset, uses the server default: “-created”. Sorting order which will be applied to catalog list, valid options are: - “created” – ascending order by creation datetime; - “-created” – descending order by creation datetime.

use_cases: Union[UseCase, List[UseCase], str, List[str]], optional

Filter available datasets by a specific Use Case or Cases. Accepts either the entity or the ID. If set to [None], the method filters the project’s datasets by those not linked to a UseCase.

Yields:
Dataset

An iterator of the datasets the user can view.

Return type:

Generator[TypeVar(TDataset, bound= Dataset), None, None]

update()

Updates the Dataset attributes in place with the latest information from the server.

Returns:
None
Return type:

None

modify(name=None, categories=None)

Modifies the Dataset name and/or categories. Updates the object in place.

Parameters:
name: string, optional

The new name of the dataset

categories: list[string], optional

A list of strings describing the intended use of the dataset. The supported options are “TRAINING” and “PREDICTION”. If any categories were previously specified for the dataset, they will be overwritten. If omitted or None, keep previous categories. To clear them specify []

Returns:
None
Return type:

None

share(access_list, apply_grant_to_linked_objects=False)

Modify the ability of users to access this dataset

Parameters:
access_list: list of :class:`SharingAccess <datarobot.SharingAccess>`

The modifications to make.

apply_grant_to_linked_objects: bool

If true for any users being granted access to the dataset, grant the user read access to any linked objects such as DataSources and DataStores that may be used by this dataset. Ignored if no such objects are relevant for dataset, defaults to False.

Raises:
datarobot.ClientError:

If you do not have permission to share this dataset, if the user you’re sharing with doesn’t exist, if the same user appears multiple times in the access_list, or if these changes would leave the dataset without an owner.

Return type:

None

Examples

Transfer access to the dataset from old_user@datarobot.com to new_user@datarobot.com

from datarobot.enums import SHARING_ROLE
from datarobot.models.dataset import Dataset
from datarobot.models.sharing import SharingAccess

new_access = SharingAccess(
    "[email protected]",
    SHARING_ROLE.OWNER,
    can_share=True,
)
access_list = [
    SharingAccess(
        "[email protected]",
        SHARING_ROLE.OWNER,
        can_share=True,
        can_use_data=True,
    ),
    new_access,
]

Dataset.get('my-dataset-id').share(access_list)
get_details()

Gets the details for this Dataset

Returns:
DatasetDetails
Return type:

DatasetDetails

get_all_features(order_by=None)

Get a list of all the features for this dataset.

Parameters:
order_by: string, optional

If unset, uses the server default: ‘name’. How the features should be ordered. Can be ‘name’ or ‘featureType’.

Returns:
list[DatasetFeature]
Return type:

List[DatasetFeature]

iterate_all_features(offset=None, limit=None, order_by=None)

Get an iterator for the requested features of a dataset. This lazily retrieves results. It does not get the next page from the server until the current page is exhausted.

Parameters:
offset: int, optional

If set, this many results will be skipped.

limit: int, optional

Specifies the size of each page retrieved from the server. If unset, uses the server default.

order_by: string, optional

If unset, uses the server default: ‘name’. How the features should be ordered. Can be ‘name’ or ‘featureType’.

Yields:
DatasetFeature
Return type:

Generator[DatasetFeature, None, None]

get_featurelists()

Get DatasetFeaturelists created on this Dataset

Returns:
feature_lists: list[DatasetFeaturelist]
Return type:

List[DatasetFeaturelist]

create_featurelist(name, features)

Create a new dataset featurelist

Parameters:
namestr

the name of the modeling featurelist to create. Names must be unique within the dataset, or the server will return an error.

featureslist of str

the names of the features to include in the dataset featurelist. Each feature must be a dataset feature.

Returns:
featurelistDatasetFeaturelist

the newly created featurelist

Return type:

DatasetFeaturelist

Examples

dataset = Dataset.get('1234deadbeeffeeddead4321')
dataset_features = dataset.get_all_features()
selected_features = [feat.name for feat in dataset_features][:5]  # select first five
new_flist = dataset.create_featurelist('Simple Features', selected_features)
get_file(file_path=None, filelike=None)

Retrieves all the originally uploaded data in CSV form. Writes it to either the file or a filelike object that can write bytes.

Only one of file_path or filelike can be provided and it must be provided as a keyword argument (i.e. file_path=’path-to-write-to’). If a file-like object is provided, the user is responsible for closing it when they are done.

The user must also have permission to download data.

Parameters:
file_path: string, optional

The destination to write the file to.

filelike: file, optional

A file-like object to write to. The object must be able to write bytes. The user is responsible for closing the object

Returns:
None
Return type:

None

get_as_dataframe(low_memory=False)

Retrieves all the originally uploaded data in a pandas DataFrame. :rtype: DataFrame

Added in version v3.0.

Parameters:
low_memory: bool, optional

If True, use local files to reduce memory usage which will be slower.

Returns:
pd.DataFrame
get_projects()

Retrieves the Dataset’s projects as ProjectLocation named tuples.

Returns:
locations: list[ProjectLocation]
Return type:

List[ProjectLocation]

create_project(project_name=None, user=None, password=None, credential_id=None, use_kerberos=None, credential_data=None, *, use_cases=None)

Create a datarobot.models.Project from this dataset

Parameters:
project_name: string, optional

The name of the project to be created. If not specified, will be “Untitled Project” for database connections, otherwise the project name will be based on the file used.

user: string, optional

The username for database authentication.

password: string, optional

The password (in cleartext) for database authentication. The password will be encrypted on the server side in scope of HTTP request and never saved or stored

credential_id: string, optional

The ID of the set of credentials to use instead of user and password.

use_kerberos: bool, optional

Server default is False. If true, use kerberos authentication for database authentication.

credential_data: dict, optional

The credentials to authenticate with the database, to use instead of user/password or credential ID.

use_cases: list[UseCase] | UseCase | list[string] | string, optional

A list of UseCase objects, UseCase object, list of Use Case ids or a single Use Case id to add this new Dataset to. Must be a kwarg.

Returns:
Project
Return type:

Project

classmethod create_version_from_file(dataset_id, file_path=None, filelike=None, categories=None, read_timeout=600, max_wait=600)

A blocking call that creates a new Dataset version from a file. Returns when the new dataset version has been successfully uploaded and processed.

Warning: This function does not clean up it’s open files. If you pass a filelike, you are responsible for closing it. If you pass a file_path, this will create a file object from the file_path but will not close it. :rtype: TypeVar(TDataset, bound= Dataset)

Added in version v2.23.

Parameters:
dataset_id: string

The ID of the dataset for which new version to be created

file_path: string, optional

The path to the file. This will create a file object pointing to that file but will not close it.

filelike: file, optional

An open and readable file object.

categories: list[string], optional

An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.

read_timeout: int, optional

The maximum number of seconds to wait for the server to respond indicating that the initial upload is complete

max_wait: int, optional

Time in seconds after which project creation is considered unsuccessful

Returns:
response: Dataset

A fully armed and operational Dataset version

classmethod create_version_from_in_memory_data(dataset_id, data_frame=None, records=None, categories=None, read_timeout=600, max_wait=600)

A blocking call that creates a new Dataset version for a dataset from in-memory data. Returns when the dataset has been successfully uploaded and processed.

The data can be either a pandas DataFrame or a list of dictionaries with identical keys. :rtype: TypeVar(TDataset, bound= Dataset)

Added in version v2.23.

Parameters:
dataset_id: string

The ID of the dataset for which new version to be created

data_frame: DataFrame, optional

The data frame to upload

records: list[dict], optional

A list of dictionaries with identical keys to upload

categories: list[string], optional

An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.

read_timeout: int, optional

The maximum number of seconds to wait for the server to respond indicating that the initial upload is complete

max_wait: int, optional

Time in seconds after which project creation is considered unsuccessful

Returns:
response: Dataset

The Dataset version created from the uploaded data

Raises:
InvalidUsageError

If neither a DataFrame or list of records is passed.

classmethod create_version_from_url(dataset_id, url, categories=None, max_wait=600)

A blocking call that creates a new Dataset from data stored at a url for a given dataset. Returns when the dataset has been successfully uploaded and processed. :rtype: TypeVar(TDataset, bound= Dataset)

Added in version v2.23.

Parameters:
dataset_id: string

The ID of the dataset for which new version to be created

url: string

The URL to use as the source of data for the dataset being created.

categories: list[string], optional

An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.

max_wait: int, optional

Time in seconds after which project creation is considered unsuccessful

Returns:
response: Dataset

The Dataset version created from the uploaded data

classmethod create_version_from_datastage(dataset_id, datastage_id, categories=None, max_wait=600)

A blocking call that creates a new Dataset from data stored as a DataStage for a given dataset. Returns when the dataset has been successfully uploaded and processed.

Parameters:
dataset_id: string

The ID of the dataset for which new version to be created

datastage_id: string

The ID of the DataStage to use as the source of data for the dataset being created.

categories: list[string], optional

An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.

max_wait: int, optional

Time in seconds after which project creation is considered unsuccessful

Returns:
response: Dataset

The Dataset version created from the uploaded data

Return type:

TypeVar(TDataset, bound= Dataset)

classmethod create_version_from_data_source(dataset_id, data_source_id, username=None, password=None, categories=None, credential_id=None, use_kerberos=None, credential_data=None, max_wait=600)

A blocking call that creates a new Dataset from data stored at a DataSource. Returns when the dataset has been successfully uploaded and processed. :rtype: TypeVar(TDataset, bound= Dataset)

Added in version v2.23.

Parameters:
dataset_id: string

The ID of the dataset for which new version to be created

data_source_id: string

The ID of the DataSource to use as the source of data.

username: string, optional

The username for database authentication.

password: string, optional

The password (in cleartext) for database authentication. The password will be encrypted on the server side in scope of HTTP request and never saved or stored.

categories: list[string], optional

An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.

credential_id: string, optional

The ID of the set of credentials to use instead of user and password. Note that with this change, username and password will become optional.

use_kerberos: bool, optional

If unset, uses the server default: False. If true, use kerberos authentication for database authentication.

credential_data: dict, optional

The credentials to authenticate with the database, to use instead of user/password or credential ID.

max_wait: int, optional

Time in seconds after which project creation is considered unsuccessful

Returns:
response: Dataset

The Dataset version created from the uploaded data

classmethod from_data(data)

Instantiate an object of this class using a dict.

Parameters:
datadict

Correctly snake_cased keys and their values.

Return type:

TypeVar(T, bound= APIObject)

classmethod from_server_data(data, keep_attrs=None)

Instantiate an object of this class using the data directly from the server, meaning that the keys may have the wrong camel casing

Parameters:
datadict

The directly translated dict of JSON from the server. No casing fixes have taken place

keep_attrsiterable

List, set or tuple of the dotted namespace notations for attributes to keep within the object structure even if their values are None

Return type:

TypeVar(T, bound= APIObject)

open_in_browser()

Opens class’ relevant web browser location. If default browser is not available the URL is logged.

Note: If text-mode browsers are used, the calling process will block until the user exits the browser.

Return type:

None

class datarobot.DatasetDetails(dataset_id, version_id, categories, created_by, created_at, data_source_type, error, is_latest_version, is_snapshot, is_data_engine_eligible, last_modification_date, last_modifier_full_name, name, uri, processing_state, data_persisted=None, data_engine_query_id=None, data_source_id=None, description=None, eda1_modification_date=None, eda1_modifier_full_name=None, feature_count=None, feature_count_by_type=None, row_count=None, size=None, tags=None, recipe_id=None, is_wrangling_eligible=None, sample_size=None)

Represents a detailed view of a Dataset. The to_dataset method creates a Dataset from this details view.

Attributes:
dataset_id: string

The ID of this dataset

name: string

The name of this dataset in the catalog

is_latest_version: bool

Whether this dataset version is the latest version of this dataset

version_id: string

The object ID of the catalog_version the dataset belongs to

categories: list(string)

An array of strings describing the intended use of the dataset. The supported options are “TRAINING” and “PREDICTION”.

created_at: string

The date when the dataset was created

created_by: string

Username of the user who created the dataset

is_snapshot: bool

Whether the dataset version is an immutable snapshot of data which has previously been retrieved and saved to Data_robot

data_persisted: bool, optional

If true, user is allowed to view extended data profile (which includes data statistics like min/max/median/mean, histogram, etc.) and download data. If false, download is not allowed and only the data schema (feature names and types) will be available.

is_data_engine_eligible: bool

Whether this dataset can be a data source of a data engine query.

processing_state: string

Current ingestion process state of the dataset

row_count: int, optional

The number of rows in the dataset.

size: int, optional

The size of the dataset as a CSV in bytes.

data_engine_query_id: string, optional

ID of the source data engine query

data_source_id: string, optional

ID of the datasource used as the source of the dataset

data_source_type: string

the type of the datasource that was used as the source of the dataset

description: string, optional

the description of the dataset

eda1_modification_date: string, optional

the ISO 8601 formatted date and time when the EDA1 for the dataset was updated

eda1_modifier_full_name: string, optional

the user who was the last to update EDA1 for the dataset

error: string

details of exception raised during ingestion process, if any

feature_count: int, optional

total number of features in the dataset

feature_count_by_type: list[FeatureTypeCount]

number of features in the dataset grouped by feature type

last_modification_date: string

the ISO 8601 formatted date and time when the dataset was last modified

last_modifier_full_name: string

full name of user who was the last to modify the dataset

tags: list[string]

list of tags attached to the item

uri: string

the uri to datasource like: - ‘file_name.csv’ - ‘jdbc:DATA_SOURCE_GIVEN_NAME/SCHEMA.TABLE_NAME’ - ‘jdbc:DATA_SOURCE_GIVEN_NAME/<query>’ - for query based datasources - ‘https://s3.amazonaws.com/my_data/my_dataset.csv’ - etc.

sample_size: dict, optional

The size of data fetched during dataset registration. For example, to fetch the first 95 rows, the sample_size value is {‘type’: ‘rows’, ‘value’: 95}. Currently only ‘rows’ type is supported.

classmethod get(dataset_id)

Get details for a Dataset from the server

Parameters:
dataset_id: str

The id for the Dataset from which to get details

Returns:
DatasetDetails
Return type:

TypeVar(TDatasetDetails, bound= DatasetDetails)

to_dataset()

Build a Dataset object from the information in this object

Returns:
Dataset
Return type:

Dataset

class datarobot.models.dataset.ProjectLocation(url, id)
id

Alias for field number 1

url

Alias for field number 0