Datasets
- class datarobot.models.Dataset(dataset_id, version_id, name, categories, created_at, is_data_engine_eligible, is_latest_version, is_snapshot, processing_state, created_by=None, data_persisted=None, size=None, row_count=None, recipe_id=None, sample_size=None)
Represents a Dataset returned from the api/v2/datasets/ endpoints.
- Attributes:
- id: string
The ID of this dataset
- name: string
The name of this dataset in the catalog
- is_latest_version: bool
Whether this dataset version is the latest version of this dataset
- version_id: string
The object ID of the catalog_version the dataset belongs to
- categories: list(string)
An array of strings describing the intended use of the dataset. The supported options are “TRAINING” and “PREDICTION”.
- created_at: string
The date when the dataset was created
- created_by: string, optional
Username of the user who created the dataset
- is_snapshot: bool
Whether the dataset version is an immutable snapshot of data which has previously been retrieved and saved to Data_robot
- data_persisted: bool, optional
If true, user is allowed to view extended data profile (which includes data statistics like min/max/median/mean, histogram, etc.) and download data. If false, download is not allowed and only the data schema (feature names and types) will be available.
- is_data_engine_eligible: bool
Whether this dataset can be a data source of a data engine query.
- processing_state: string
Current ingestion process state of the dataset
- row_count: int, optional
The number of rows in the dataset.
- size: int, optional
The size of the dataset as a CSV in bytes.
- sample_size: dict, optional
The size of data fetched during dataset registration. For example, to fetch the first 95 rows, the sample_size value is {‘type’: ‘rows’, ‘value’: 95}. Currently only ‘rows’ type is supported.
- get_uri()
- Returns:
- urlstr
Permanent static hyperlink to this dataset in AI Catalog.
- Return type:
str
- classmethod upload(source)
This method covers Dataset creation from local materials (file & DataFrame) and a URL.
- Parameters:
- source: str, pd.DataFrame or file object
Pass a URL, filepath, file or DataFrame to create and return a Dataset.
- Returns:
- response: Dataset
The Dataset created from the uploaded data source.
- Raises:
- InvalidUsageError
If the source parameter cannot be determined to be a URL, filepath, file or DataFrame.
- Return type:
TypeVar
(TDataset
, bound= Dataset)
Examples
# Upload a local file dataset_one = Dataset.upload("./data/examples.csv") # Create a dataset via URL dataset_two = Dataset.upload( "https://raw.githubusercontent.com/curran/data/gh-pages/dbpedia/cities/data.csv" ) # Create dataset with a pandas Dataframe dataset_three = Dataset.upload(my_df) # Create dataset using a local file with open("./data/examples.csv", "rb") as file_pointer: dataset_four = Dataset.create_from_file(filelike=file_pointer)
- classmethod create_from_file(cls, file_path=None, filelike=None, categories=None, read_timeout=600, max_wait=600, *, use_cases=None)
A blocking call that creates a new Dataset from a file. Returns when the dataset has been successfully uploaded and processed.
Warning: This function does not clean up it’s open files. If you pass a filelike, you are responsible for closing it. If you pass a file_path, this will create a file object from the file_path but will not close it.
- Parameters:
- file_path: string, optional
The path to the file. This will create a file object pointing to that file but will not close it.
- filelike: file, optional
An open and readable file object.
- categories: list[string], optional
An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
- read_timeout: int, optional
The maximum number of seconds to wait for the server to respond indicating that the initial upload is complete
- max_wait: int, optional
Time in seconds after which dataset creation is considered unsuccessful
- use_cases: list[UseCase] | UseCase | list[string] | string, optional
A list of UseCase objects, UseCase object, list of Use Case ids or a single Use Case id to add this new Dataset to. Must be a kwarg.
- Returns:
- response: Dataset
A fully armed and operational Dataset
- Return type:
TypeVar
(TDataset
, bound= Dataset)
- classmethod create_from_in_memory_data(cls, data_frame=None, records=None, categories=None, read_timeout=600, max_wait=600, fname=None, *, use_cases=None)
A blocking call that creates a new Dataset from in-memory data. Returns when the dataset has been successfully uploaded and processed.
The data can be either a pandas DataFrame or a list of dictionaries with identical keys.
- Parameters:
- data_frame: DataFrame, optional
The data frame to upload
- records: list[dict], optional
A list of dictionaries with identical keys to upload
- categories: list[string], optional
An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
- read_timeout: int, optional
The maximum number of seconds to wait for the server to respond indicating that the initial upload is complete
- max_wait: int, optional
Time in seconds after which dataset creation is considered unsuccessful
- fname: string, optional
The file name, “data.csv” by default
- use_cases: list[UseCase] | UseCase | list[string] | string, optional
A list of UseCase objects, UseCase object, list of Use Case IDs or a single Use Case ID to add this new dataset to. Must be a kwarg.
- Returns:
- response: Dataset
The Dataset created from the uploaded data.
- Raises:
- InvalidUsageError
If neither a DataFrame or list of records is passed.
- Return type:
TypeVar
(TDataset
, bound= Dataset)
- classmethod create_from_url(cls, url, do_snapshot=None, persist_data_after_ingestion=None, categories=None, sample_size=None, max_wait=600, *, use_cases=None)
A blocking call that creates a new Dataset from data stored at a url. Returns when the dataset has been successfully uploaded and processed.
- Parameters:
- url: string
The URL to use as the source of data for the dataset being created.
- do_snapshot: bool, optional
If unset, uses the server default: True. If true, creates a snapshot dataset; if false, creates a remote dataset. Creating snapshots from non-file sources may be disabled by the permission, Disable AI Catalog Snapshots.
- persist_data_after_ingestion: bool, optional
If unset, uses the server default: True. If true, will enforce saving all data (for download and sampling) and will allow a user to view extended data profile (which includes data statistics like min/max/median/mean, histogram, etc.). If false, will not enforce saving data. The data schema (feature names and types) still will be available. Specifying this parameter to false and doSnapshot to true will result in an error.
- categories: list[string], optional
An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
- sample_size: dict, optional
The size of data fetched during dataset registration. For example, to fetch the first 95 rows, the sample_size value would be: {‘type’: ‘rows’, ‘value’: 95}. Currently only ‘rows’ type is supported.
- max_wait: int, optional
Time in seconds after which dataset creation is considered unsuccessful.
- use_cases: list[UseCase] | UseCase | list[string] | string, optional
A list of UseCase objects, UseCase object, list of Use Case IDs or a single Use Case ID to add this new dataset to. Must be a kwarg.
- Returns:
- response: Dataset
The Dataset created from the uploaded data
- Return type:
TypeVar
(TDataset
, bound= Dataset)
- classmethod create_from_datastage(cls, datastage_id, categories=None, max_wait=600, *, use_cases=None)
A blocking call that creates a new Dataset from data stored as a DataStage. Returns when the dataset has been successfully uploaded and processed.
- Parameters:
- datastage_id: string
The ID of the DataStage to use as the source of data for the dataset being created.
- categories: list[string], optional
An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
- max_wait: int, optional
Time in seconds after which dataset creation is considered unsuccessful.
- Returns:
- response: Dataset
The Dataset created from the uploaded data
- Return type:
TypeVar
(TDataset
, bound= Dataset)
- classmethod create_from_data_source(cls, data_source_id, username=None, password=None, do_snapshot=None, persist_data_after_ingestion=None, categories=None, credential_id=None, use_kerberos=None, credential_data=None, sample_size=None, max_wait=600, *, use_cases=None)
A blocking call that creates a new Dataset from data stored at a DataSource. Returns when the dataset has been successfully uploaded and processed. :rtype:
TypeVar
(TDataset
, bound= Dataset)Added in version v2.22.
- Parameters:
- data_source_id: string
The ID of the DataSource to use as the source of data.
- username: string, optional
The username for database authentication.
- password: string, optional
The password (in cleartext) for database authentication. The password will be encrypted on the server side in scope of HTTP request and never saved or stored.
- do_snapshot: bool, optional
If unset, uses the server default: True. If true, creates a snapshot dataset; if false, creates a remote dataset. Creating snapshots from non-file sources requires may be disabled by the permission, Disable AI Catalog Snapshots.
- persist_data_after_ingestion: bool, optional
If unset, uses the server default: True. If true, will enforce saving all data (for download and sampling) and will allow a user to view extended data profile (which includes data statistics like min/max/median/mean, histogram, etc.). If false, will not enforce saving data. The data schema (feature names and types) still will be available. Specifying this parameter to false and doSnapshot to true will result in an error.
- categories: list[string], optional
An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
- credential_id: string, optional
The ID of the set of credentials to use instead of user and password. Note that with this change, username and password will become optional.
- use_kerberos: bool, optional
If unset, uses the server default: False. If true, use kerberos authentication for database authentication.
- credential_data: dict, optional
The credentials to authenticate with the database, to use instead of user/password or credential ID.
- sample_size: dict, optional
The size of data fetched during dataset registration. For example, to fetch the first 95 rows, the sample_size value would be: {‘type’: ‘rows’, ‘value’: 95}. Currently only ‘rows’ type is supported.
- max_wait: int, optional
Time in seconds after which project creation is considered unsuccessful.
- use_cases: list[UseCase] | UseCase | list[string] | string, optional
A list of UseCase objects, UseCase object, list of Use Case IDs or a single Use Case ID to add this new dataset to. Must be a kwarg.
- Returns:
- response: Dataset
The Dataset created from the uploaded data
- classmethod create_from_query_generator(cls, generator_id, dataset_id=None, dataset_version_id=None, max_wait=600, *, use_cases=None)
A blocking call that creates a new Dataset from the query generator. Returns when the dataset has been successfully processed. If optional parameters are not specified the query is applied to the dataset_id and dataset_version_id stored in the query generator. If specified they will override the stored dataset_id/dataset_version_id, e.g. to prep a prediction dataset.
- Parameters:
- generator_id: str
The id of the query generator to use.
- dataset_id: str, optional
The id of the dataset to apply the query to.
- dataset_version_id: str, optional
The id of the dataset version to apply the query to. If not specified the latest version associated with dataset_id (if specified) is used.
- max_waitint
optional, the maximum number of seconds to wait before giving up.
- use_cases: list[UseCase] | UseCase | list[string] | string, optional
A list of UseCase objects, UseCase object, list of Use Case IDs or a single Use Case ID to add this new dataset to. Must be a kwarg.
- Returns:
- response: Dataset
The Dataset created from the query generator
- Return type:
TypeVar
(TDataset
, bound= Dataset)
- classmethod create_from_recipe(cls, recipe, name=None, do_snapshot=None, persist_data_after_ingestion=None, categories=None, credential=None, use_kerberos=None, materialization_destination=None, max_wait=600, *, use_cases=None)
A blocking call that creates a new Dataset from the recipe. Returns when the dataset has been successfully uploaded and processed. :rtype:
TypeVar
(TDataset
, bound= Dataset)Added in version 3.6.
- Returns:
- response: Dataset
The Dataset created from the uploaded data
- classmethod get(dataset_id)
Get information about a dataset.
- Parameters:
- dataset_idstring
the id of the dataset
- Returns:
- datasetDataset
the queried dataset
- Return type:
TypeVar
(TDataset
, bound= Dataset)
- classmethod delete(dataset_id)
Soft deletes a dataset. You cannot get it or list it or do actions with it, except for un-deleting it.
- Parameters:
- dataset_id: string
The id of the dataset to mark for deletion
- Returns:
- None
- Return type:
None
- classmethod un_delete(dataset_id)
Un-deletes a previously deleted dataset. If the dataset was not deleted, nothing happens.
- Parameters:
- dataset_id: string
The id of the dataset to un-delete
- Returns:
- None
- Return type:
None
- classmethod list(category=None, filter_failed=None, order_by=None, use_cases=None)
List all datasets a user can view.
- Parameters:
- category: string, optional
Optional. If specified, only dataset versions that have the specified category will be included in the results. Categories identify the intended use of the dataset; supported categories are “TRAINING” and “PREDICTION”.
- filter_failed: bool, optional
If unset, uses the server default: False. Whether datasets that failed during import should be excluded from the results. If True invalid datasets will be excluded.
- order_by: string, optional
If unset, uses the server default: “-created”. Sorting order which will be applied to catalog list, valid options are: - “created” – ascending order by creation datetime; - “-created” – descending order by creation datetime.
- use_cases: Union[UseCase, List[UseCase], str, List[str]], optional
Filter available datasets by a specific Use Case or Cases. Accepts either the entity or the ID. If set to [None], the method filters the project’s datasets by those not linked to a UseCase.
- Returns:
- list[Dataset]
a list of datasets the user can view
- Return type:
List
[TypeVar
(TDataset
, bound= Dataset)]
- classmethod iterate(offset=None, limit=None, category=None, order_by=None, filter_failed=None, use_cases=None)
Get an iterator for the requested datasets a user can view. This lazily retrieves results. It does not get the next page from the server until the current page is exhausted.
- Parameters:
- offset: int, optional
If set, this many results will be skipped
- limit: int, optional
Specifies the size of each page retrieved from the server. If unset, uses the server default.
- category: string, optional
Optional. If specified, only dataset versions that have the specified category will be included in the results. Categories identify the intended use of the dataset; supported categories are “TRAINING” and “PREDICTION”.
- filter_failed: bool, optional
If unset, uses the server default: False. Whether datasets that failed during import should be excluded from the results. If True invalid datasets will be excluded.
- order_by: string, optional
If unset, uses the server default: “-created”. Sorting order which will be applied to catalog list, valid options are: - “created” – ascending order by creation datetime; - “-created” – descending order by creation datetime.
- use_cases: Union[UseCase, List[UseCase], str, List[str]], optional
Filter available datasets by a specific Use Case or Cases. Accepts either the entity or the ID. If set to [None], the method filters the project’s datasets by those not linked to a UseCase.
- Yields:
- Dataset
An iterator of the datasets the user can view.
- Return type:
Generator
[TypeVar
(TDataset
, bound= Dataset),None
,None
]
- update()
Updates the Dataset attributes in place with the latest information from the server.
- Returns:
- None
- Return type:
None
- modify(name=None, categories=None)
Modifies the Dataset name and/or categories. Updates the object in place.
- Parameters:
- name: string, optional
The new name of the dataset
- categories: list[string], optional
A list of strings describing the intended use of the dataset. The supported options are “TRAINING” and “PREDICTION”. If any categories were previously specified for the dataset, they will be overwritten. If omitted or None, keep previous categories. To clear them specify []
- Returns:
- None
- Return type:
None
Modify the ability of users to access this dataset
- Parameters:
- access_list: list of :class:`SharingAccess <datarobot.SharingAccess>`
The modifications to make.
- apply_grant_to_linked_objects: bool
If true for any users being granted access to the dataset, grant the user read access to any linked objects such as DataSources and DataStores that may be used by this dataset. Ignored if no such objects are relevant for dataset, defaults to False.
- Raises:
- datarobot.ClientError:
If you do not have permission to share this dataset, if the user you’re sharing with doesn’t exist, if the same user appears multiple times in the access_list, or if these changes would leave the dataset without an owner.
- Return type:
None
Examples
Transfer access to the dataset from old_user@datarobot.com to new_user@datarobot.com
from datarobot.enums import SHARING_ROLE from datarobot.models.dataset import Dataset from datarobot.models.sharing import SharingAccess new_access = SharingAccess( "[email protected]", SHARING_ROLE.OWNER, can_share=True, ) access_list = [ SharingAccess( "[email protected]", SHARING_ROLE.OWNER, can_share=True, can_use_data=True, ), new_access, ] Dataset.get('my-dataset-id').share(access_list)
- get_details()
Gets the details for this Dataset
- Returns:
- DatasetDetails
- Return type:
- get_all_features(order_by=None)
Get a list of all the features for this dataset.
- Parameters:
- order_by: string, optional
If unset, uses the server default: ‘name’. How the features should be ordered. Can be ‘name’ or ‘featureType’.
- Returns:
- list[DatasetFeature]
- Return type:
List
[DatasetFeature
]
- iterate_all_features(offset=None, limit=None, order_by=None)
Get an iterator for the requested features of a dataset. This lazily retrieves results. It does not get the next page from the server until the current page is exhausted.
- Parameters:
- offset: int, optional
If set, this many results will be skipped.
- limit: int, optional
Specifies the size of each page retrieved from the server. If unset, uses the server default.
- order_by: string, optional
If unset, uses the server default: ‘name’. How the features should be ordered. Can be ‘name’ or ‘featureType’.
- Yields:
- DatasetFeature
- Return type:
Generator
[DatasetFeature
,None
,None
]
- get_featurelists()
Get DatasetFeaturelists created on this Dataset
- Returns:
- feature_lists: list[DatasetFeaturelist]
- Return type:
List
[DatasetFeaturelist
]
- create_featurelist(name, features)
Create a new dataset featurelist
- Parameters:
- namestr
the name of the modeling featurelist to create. Names must be unique within the dataset, or the server will return an error.
- featureslist of str
the names of the features to include in the dataset featurelist. Each feature must be a dataset feature.
- Returns:
- featurelistDatasetFeaturelist
the newly created featurelist
- Return type:
Examples
dataset = Dataset.get('1234deadbeeffeeddead4321') dataset_features = dataset.get_all_features() selected_features = [feat.name for feat in dataset_features][:5] # select first five new_flist = dataset.create_featurelist('Simple Features', selected_features)
- get_file(file_path=None, filelike=None)
Retrieves all the originally uploaded data in CSV form. Writes it to either the file or a filelike object that can write bytes.
Only one of file_path or filelike can be provided and it must be provided as a keyword argument (i.e. file_path=’path-to-write-to’). If a file-like object is provided, the user is responsible for closing it when they are done.
The user must also have permission to download data.
- Parameters:
- file_path: string, optional
The destination to write the file to.
- filelike: file, optional
A file-like object to write to. The object must be able to write bytes. The user is responsible for closing the object
- Returns:
- None
- Return type:
None
- get_as_dataframe(low_memory=False)
Retrieves all the originally uploaded data in a pandas DataFrame. :rtype:
DataFrame
Added in version v3.0.
- Parameters:
- low_memory: bool, optional
If True, use local files to reduce memory usage which will be slower.
- Returns:
- pd.DataFrame
- get_projects()
Retrieves the Dataset’s projects as ProjectLocation named tuples.
- Returns:
- locations: list[ProjectLocation]
- Return type:
List
[ProjectLocation
]
- create_project(project_name=None, user=None, password=None, credential_id=None, use_kerberos=None, credential_data=None, *, use_cases=None)
Create a
datarobot.models.Project
from this dataset- Parameters:
- project_name: string, optional
The name of the project to be created. If not specified, will be “Untitled Project” for database connections, otherwise the project name will be based on the file used.
- user: string, optional
The username for database authentication.
- password: string, optional
The password (in cleartext) for database authentication. The password will be encrypted on the server side in scope of HTTP request and never saved or stored
- credential_id: string, optional
The ID of the set of credentials to use instead of user and password.
- use_kerberos: bool, optional
Server default is False. If true, use kerberos authentication for database authentication.
- credential_data: dict, optional
The credentials to authenticate with the database, to use instead of user/password or credential ID.
- use_cases: list[UseCase] | UseCase | list[string] | string, optional
A list of UseCase objects, UseCase object, list of Use Case ids or a single Use Case id to add this new Dataset to. Must be a kwarg.
- Returns:
- Project
- Return type:
- classmethod create_version_from_file(dataset_id, file_path=None, filelike=None, categories=None, read_timeout=600, max_wait=600)
A blocking call that creates a new Dataset version from a file. Returns when the new dataset version has been successfully uploaded and processed.
Warning: This function does not clean up it’s open files. If you pass a filelike, you are responsible for closing it. If you pass a file_path, this will create a file object from the file_path but will not close it. :rtype:
TypeVar
(TDataset
, bound= Dataset)Added in version v2.23.
- Parameters:
- dataset_id: string
The ID of the dataset for which new version to be created
- file_path: string, optional
The path to the file. This will create a file object pointing to that file but will not close it.
- filelike: file, optional
An open and readable file object.
- categories: list[string], optional
An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
- read_timeout: int, optional
The maximum number of seconds to wait for the server to respond indicating that the initial upload is complete
- max_wait: int, optional
Time in seconds after which project creation is considered unsuccessful
- Returns:
- response: Dataset
A fully armed and operational Dataset version
- classmethod create_version_from_in_memory_data(dataset_id, data_frame=None, records=None, categories=None, read_timeout=600, max_wait=600)
A blocking call that creates a new Dataset version for a dataset from in-memory data. Returns when the dataset has been successfully uploaded and processed.
The data can be either a pandas DataFrame or a list of dictionaries with identical keys. :rtype:
TypeVar
(TDataset
, bound= Dataset)Added in version v2.23.
- Parameters:
- dataset_id: string
The ID of the dataset for which new version to be created
- data_frame: DataFrame, optional
The data frame to upload
- records: list[dict], optional
A list of dictionaries with identical keys to upload
- categories: list[string], optional
An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
- read_timeout: int, optional
The maximum number of seconds to wait for the server to respond indicating that the initial upload is complete
- max_wait: int, optional
Time in seconds after which project creation is considered unsuccessful
- Returns:
- response: Dataset
The Dataset version created from the uploaded data
- Raises:
- InvalidUsageError
If neither a DataFrame or list of records is passed.
- classmethod create_version_from_url(dataset_id, url, categories=None, max_wait=600)
A blocking call that creates a new Dataset from data stored at a url for a given dataset. Returns when the dataset has been successfully uploaded and processed. :rtype:
TypeVar
(TDataset
, bound= Dataset)Added in version v2.23.
- Parameters:
- dataset_id: string
The ID of the dataset for which new version to be created
- url: string
The URL to use as the source of data for the dataset being created.
- categories: list[string], optional
An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
- max_wait: int, optional
Time in seconds after which project creation is considered unsuccessful
- Returns:
- response: Dataset
The Dataset version created from the uploaded data
- classmethod create_version_from_datastage(dataset_id, datastage_id, categories=None, max_wait=600)
A blocking call that creates a new Dataset from data stored as a DataStage for a given dataset. Returns when the dataset has been successfully uploaded and processed.
- Parameters:
- dataset_id: string
The ID of the dataset for which new version to be created
- datastage_id: string
The ID of the DataStage to use as the source of data for the dataset being created.
- categories: list[string], optional
An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
- max_wait: int, optional
Time in seconds after which project creation is considered unsuccessful
- Returns:
- response: Dataset
The Dataset version created from the uploaded data
- Return type:
TypeVar
(TDataset
, bound= Dataset)
- classmethod create_version_from_data_source(dataset_id, data_source_id, username=None, password=None, categories=None, credential_id=None, use_kerberos=None, credential_data=None, max_wait=600)
A blocking call that creates a new Dataset from data stored at a DataSource. Returns when the dataset has been successfully uploaded and processed. :rtype:
TypeVar
(TDataset
, bound= Dataset)Added in version v2.23.
- Parameters:
- dataset_id: string
The ID of the dataset for which new version to be created
- data_source_id: string
The ID of the DataSource to use as the source of data.
- username: string, optional
The username for database authentication.
- password: string, optional
The password (in cleartext) for database authentication. The password will be encrypted on the server side in scope of HTTP request and never saved or stored.
- categories: list[string], optional
An array of strings describing the intended use of the dataset. The current supported options are “TRAINING” and “PREDICTION”.
- credential_id: string, optional
The ID of the set of credentials to use instead of user and password. Note that with this change, username and password will become optional.
- use_kerberos: bool, optional
If unset, uses the server default: False. If true, use kerberos authentication for database authentication.
- credential_data: dict, optional
The credentials to authenticate with the database, to use instead of user/password or credential ID.
- max_wait: int, optional
Time in seconds after which project creation is considered unsuccessful
- Returns:
- response: Dataset
The Dataset version created from the uploaded data
- classmethod from_data(data)
Instantiate an object of this class using a dict.
- Parameters:
- datadict
Correctly snake_cased keys and their values.
- Return type:
TypeVar
(T
, bound= APIObject)
- classmethod from_server_data(data, keep_attrs=None)
Instantiate an object of this class using the data directly from the server, meaning that the keys may have the wrong camel casing
- Parameters:
- datadict
The directly translated dict of JSON from the server. No casing fixes have taken place
- keep_attrsiterable
List, set or tuple of the dotted namespace notations for attributes to keep within the object structure even if their values are None
- Return type:
TypeVar
(T
, bound= APIObject)
- open_in_browser()
Opens class’ relevant web browser location. If default browser is not available the URL is logged.
Note: If text-mode browsers are used, the calling process will block until the user exits the browser.
- Return type:
None
- class datarobot.DatasetDetails(dataset_id, version_id, categories, created_by, created_at, data_source_type, error, is_latest_version, is_snapshot, is_data_engine_eligible, last_modification_date, last_modifier_full_name, name, uri, processing_state, data_persisted=None, data_engine_query_id=None, data_source_id=None, description=None, eda1_modification_date=None, eda1_modifier_full_name=None, feature_count=None, feature_count_by_type=None, row_count=None, size=None, tags=None, recipe_id=None, is_wrangling_eligible=None, sample_size=None)
Represents a detailed view of a Dataset. The to_dataset method creates a Dataset from this details view.
- Attributes:
- dataset_id: string
The ID of this dataset
- name: string
The name of this dataset in the catalog
- is_latest_version: bool
Whether this dataset version is the latest version of this dataset
- version_id: string
The object ID of the catalog_version the dataset belongs to
- categories: list(string)
An array of strings describing the intended use of the dataset. The supported options are “TRAINING” and “PREDICTION”.
- created_at: string
The date when the dataset was created
- created_by: string
Username of the user who created the dataset
- is_snapshot: bool
Whether the dataset version is an immutable snapshot of data which has previously been retrieved and saved to Data_robot
- data_persisted: bool, optional
If true, user is allowed to view extended data profile (which includes data statistics like min/max/median/mean, histogram, etc.) and download data. If false, download is not allowed and only the data schema (feature names and types) will be available.
- is_data_engine_eligible: bool
Whether this dataset can be a data source of a data engine query.
- processing_state: string
Current ingestion process state of the dataset
- row_count: int, optional
The number of rows in the dataset.
- size: int, optional
The size of the dataset as a CSV in bytes.
- data_engine_query_id: string, optional
ID of the source data engine query
- data_source_id: string, optional
ID of the datasource used as the source of the dataset
- data_source_type: string
the type of the datasource that was used as the source of the dataset
- description: string, optional
the description of the dataset
- eda1_modification_date: string, optional
the ISO 8601 formatted date and time when the EDA1 for the dataset was updated
- eda1_modifier_full_name: string, optional
the user who was the last to update EDA1 for the dataset
- error: string
details of exception raised during ingestion process, if any
- feature_count: int, optional
total number of features in the dataset
- feature_count_by_type: list[FeatureTypeCount]
number of features in the dataset grouped by feature type
- last_modification_date: string
the ISO 8601 formatted date and time when the dataset was last modified
- last_modifier_full_name: string
full name of user who was the last to modify the dataset
- tags: list[string]
list of tags attached to the item
- uri: string
the uri to datasource like: - ‘file_name.csv’ - ‘jdbc:DATA_SOURCE_GIVEN_NAME/SCHEMA.TABLE_NAME’ - ‘jdbc:DATA_SOURCE_GIVEN_NAME/<query>’ - for query based datasources - ‘https://s3.amazonaws.com/my_data/my_dataset.csv’ - etc.
- sample_size: dict, optional
The size of data fetched during dataset registration. For example, to fetch the first 95 rows, the sample_size value is {‘type’: ‘rows’, ‘value’: 95}. Currently only ‘rows’ type is supported.
- classmethod get(dataset_id)
Get details for a Dataset from the server
- Parameters:
- dataset_id: str
The id for the Dataset from which to get details
- Returns:
- DatasetDetails
- Return type:
TypeVar
(TDatasetDetails
, bound= DatasetDetails)