Database connectivity

Databases are a widely used tool for carrying valuable business data. To enable integration with a variety of enterprise databases, DataRobot provides a “self-service” JDBC product for database connectivity setup. Once configured, you can read data from production databases for model building and predictions. This allows you to quickly train and retrain models on that data, and avoids the unnecessary step of exporting data from your enterprise database to a CSV for ingest to DataRobot. With access to more diverse data, you can build more accurate models.

Database connection configuration uses the following terminology:

Data store: A configured connection to a database. It has a name, a specified driver, and a JDBC URL. You can register data stores with DataRobot for ease of re-use. A data store has one connector but can have many data sources.
Data source: A configured connection to the backing data store (the location of data within a given endpoint). A data source specifies, via a SQL query or a selected table and schema data, which data to extract from the data store to use for modeling or predictions. A data source has one data store and one connector but can have many datasets.
Data driver: The software that allows the application to interact with a database; each data store is associated with either a driver or a connector (created by the administrator). The driver configuration saves the storage location in the application of the JAR file and any additional dependency files associated with the driver.
Connector: Similarly to data drivers, a connector allows the application to interact with a database; each data store is associated with either a driver or a connector (created by the administrator). The connector configuration saves the storage location in the application of the JAR file and any additional dependency files associated with the connector.
Dataset: Data, a file or the content of a data source, at a particular point in time. A data source can produce multiple datasets; a dataset has exactly one data source.

Review the workflow to set up projects or prediction datasets below.

The administrator sets up a datarobot.DataDriver for accessing a particular database. For any particular driver, this setup is performed once for the entire system and the resulting driver is used by all users.
Users create a datarobot.DataStore which represents an interface to a particular database, using that driver.
Users create a datarobot.DataSource representing a particular set of data to be extracted from the data store.
Users create projects and prediction datasets from a data source.

Users can also manage their data stores and data sources and administrators can manage drivers by listing, retrieving, updating, and deleting existing instances.

Create drivers

Administrators should specify the class_name, the name of the Java class in the Java archive which implements the java.sql.Driver interface; canonical_name, a user-friendly name for resulting driver to display in the API and the GUI; and files, a list of local files which contain the driver.

>>> import datarobot as dr
>>> driver = dr.DataDriver.create(
...     class_name='org.postgresql.Driver',
...     canonical_name='PostgreSQL',
...     files=['/tmp/postgresql-42.2.2.jar']
... )
>>> driver
DataDriver('PostgreSQL')

To retrieve information about existing drivers, such as the driver ID for data store creation, you can use dr.DataDriver.list().

Create data stores

After an administrator has created drivers, any user can use them for DataStore creation. A data store represents a JDBC database. When creating them, you should specify the type, which must be jdbc; canonical_name, a user-friendly name to display in the API and GUI for the data store; driver_id, the ID of the driver to use to connect to the database; and jdbc_url, the full URL specifying the database connection settings such as the database type, server address, port, and database name.

Note that you can only create data stores with drivers when using the Python client. Drivers and connectors are not interchangeable for this method. To create a data store with a connector, instead use the REST API.

>>> import datarobot as dr
>>> data_store = dr.DataStore.create(
...     data_store_type='jdbc',
...     canonical_name='Demo DB',
...     driver_id='5a6af02eb15372000117c040',
...     jdbc_url='jdbc:postgresql://my.db.address.org:5432/perftest'
... )
>>> data_store
DataStore('Demo DB')
>>> data_store.test(username='username', password='password')
{'message': 'Connection successful'}

Create data sources

Once you have a data store, you can can query datasets via the data source. When creating a data source, first create a datarobot.DataSourceParameters object from a data store’s ID and a query, and then create the data source with a type, currently always jdbc; a canonical_name, the user-friendly name to display in the API and GUI, and params, the DataSourceParameters object.

>>> import datarobot as dr
>>> params = dr.DataSourceParameters(
...     data_store_id='5a8ac90b07a57a0001be501e',
...     query='SELECT * FROM airlines10mb WHERE "Year" >= 1995;'
... )
>>> data_source = dr.DataSource.create(
...     data_source_type='jdbc',
...     canonical_name='airlines stats after 1995',
...     params=params
... )
>>> data_source
DataSource('airlines stats after 1995')

Create projects

You can create new projects from a data source, demonstrated below.

>>> import datarobot as dr
>>> project = dr.Project.create_from_data_source(
...     data_source_id='5ae6eee9962d740dd7b86886',
...     username='username',
...     password='password'
... )

As of v3.0 of the Python API client, you can alternatively pass in the credential_id of an existing Dataset.Credential object.

>>> import datarobot as dr
>>> project = dr.Project.create_from_data_source(
...     data_source_id='5ae6eee9962d740dd7b86886',
...     credential_id='9963d544d5ce3se783r12190'
... )

Or, pass in credential_data which conforms to CredentialDataSchema.

>>> import datarobot as dr
>>> s3_credential_data = {"credentialType": "s3", "awsAccessKeyId": "key123", "awsSecretAccessKey": "secret123"}
>>> project = dr.Project.create_from_data_source(
...     data_source_id='5ae6eee9962d740dd7b86886',
...     credential_data=s3_credential_data
... )

Create prediction datasets

Given a data source, new prediction datasets can be created for any project.

>>> import datarobot as dr
>>> project = dr.Project.get('5ae6f296962d740dd7b86887')
>>> prediction_dataset = project.upload_dataset_from_data_source(
...     data_source_id='5ae6eee9962d740dd7b86886',
...     username='username',
...     password='password'
... )