Predicting Bad Loans¶

Overview¶

In this example we will build a binary classification model using the Lending Club dataset. Here is a list of things we will touch on during this notebook:

Installing the datarobot package
Configuring the client
Creating a project
Changing the datatype of some of the source columns
Selecting the source columns used in the modeling process
Running the automated modeling process
Generating predictions

Prerequisites¶

In order to run this notebook yourself, you will need the following:

This notebook. If you are viewing this in the HTML documentation bundle, you can download all of the example notebooks and supporting materials from Downloads.
The required dataset, which is included in the same directory as this notebook.
A DataRobot API token. You can find your API token by logging into the DataRobot Web User Interface and looking in your Profile.

Installing the `datarobot` package¶

The datarobot package is hosted on PyPI. You can install it via:

pip install datarobot

from the command line. Its main dependencies are numpy and pandas, which could take some time to install on a new system. We highly recommend use of virtualenvs to avoid conflicts with other dependencies in your system-wide python installation.

Getting Started¶

This line imports the datarobot package. By convention, we always import it with the alias dr.

[1]:

import datarobot as dr

Other Important Imports¶

We’ll use these in this notebook as well. If the previous cell and the following cell both run without issue, you’re in good shape.

[2]:

import datetime
import pandas as pd

Configure the Python Client¶

DataRobot recommends providing a configuration file containing your credentials (endpoint and API Key) to connect to DataRobot. For more information about authentication, reference the API quickstart guide.

Save your configuration in a file located under your home directory called ~/.config/datarobot/drconfig.yaml.

[3]:

# Initialization with arguments
# dr.Client(token='<API TOKEN>', endpoint='https://<YOUR ENDPOINT>/api/v2/')

# Initialization with a config file in the same directory as this notebook
# dr.Client(config_path='drconfig.yaml')

# Initialization with a config file located at
# ~/.config/datarobot/drconfig.yaml
dr.Client()

[3]:

<datarobot.rest.RESTClientObject at 0x11043b210>

Create the Project¶

Here, we use the datarobot package to upload a new file and create a project. The name of the project is optional, but can be helpful when trying to sort among many projects on DataRobot.

[4]:

filename = '10K_Lending_Club_Loans.csv'
now = datetime.datetime.now().strftime('%Y-%m-%dT%H:%M')
project_name = '10K_Lending_Club_Loans_{}'.format(now)
proj = dr.Project.create(sourcedata=filename,
                         project_name=project_name)
print('Project ID: {}'.format(proj.id))

Project ID: 5c007ffa784cc602016a9f06

Select Features for Modeling¶

First, retrieve the raw feature list. This corresponds to the columns in the input spreadsheet.

[5]:

raw = [feat_list for feat_list in proj.get_featurelists()
       if feat_list.name == 'Raw Features'][0]
raw_features = [
    {
        "name": feat,
        "type": dr.Feature.get(proj.id, feat).feature_type
    }
    for feat in raw.features
]
pd.DataFrame.from_dict(raw_features)

[5]:

	name	type
0	loan_amnt	Numeric
1	funded_amnt	Numeric
2	term	Categorical
3	int_rate	Percentage
4	installment	Numeric
5	grade	Categorical
6	sub_grade	Categorical
7	emp_title	Text
8	emp_length	Categorical
9	home_ownership	Categorical
10	annual_inc	Numeric
11	verification_status	Categorical
12	pymnt_plan	Categorical
13	url	Text
14	desc	Text
15	purpose	Categorical
16	title	Text
17	zip_code	Categorical
18	addr_state	Categorical
19	dti	Numeric
20	delinq_2yrs	Numeric
21	earliest_cr_line	Date
22	inq_last_6mths	Numeric
23	mths_since_last_delinq	Numeric
24	mths_since_last_record	Numeric
25	open_acc	Numeric
26	pub_rec	Numeric
27	revol_bal	Numeric
28	revol_util	Numeric
29	total_acc	Numeric
30	initial_list_status	Categorical
31	mths_since_last_major_derog	None
32	policy_code	Categorical
33	is_bad	Numeric

Modify Feature Types¶

We can tweak features to improve the modeling. For example, we might change delinq_2yrs from an integer into a categorical.

[6]:

proj.create_type_transform_feature(
    "delinq_2yrs(Cat)",  # new feature name
    "delinq_2yrs",       # parent name
    dr.enums.VARIABLE_TYPE_TRANSFORM.CATEGORICAL_INT
)

[6]:

Feature(delinq_2yrs(Cat))

Then, we can change type of addr_state from categorical into text.

[7]:

proj.create_type_transform_feature(
    "addr_state(Text)",  # new feature name
    "addr_state",        # parent name
    dr.enums.VARIABLE_TYPE_TRANSFORM.TEXT
)

[7]:

Feature(addr_state(Text))

Select Features for Modeling¶

Next, we create a new feature list where we remove the features delinq_2yrs and addr_state and add the modified features we just created.

[8]:

feature_list_name = "new_feature_list"

new_feature_list = proj.create_featurelist(
    feature_list_name,
    list((set(raw.features) - {"addr_state", "delinq_2yrs"}) |
         {"addr_state(Text)", "delinq_2yrs(Cat)"})
)

Run the Automated Modeling Process¶

Now we can start the modeling process. The target for this problem is called is_bad - a binary variable indicating whether or not the customer defaults on a particular loan.

We specify that the metric that should be used is LogLoss. Without a specification DataRobot would automatically select an appropriate default metric.

The featurelist_id parameter tells DataRobot to model on that specific featurelist, rather than the default Informative Features.

Finally, the worker_count parameter specifies how many workers should be used for this project. Passing a value of -1 tells DataRobot to set the worker count to the maximum available to you. You can also specify the exact number of workers to use, but this command will fail if you request more workers than your account allows. If you need more resources than what has been allocated to you, you should think about upgrading your license.

The last command in this cell is just a blocking loop that periodically checks on the project to see if it is done, printing out the number of jobs in progress and in the queue along the way so you can see progress. The automated model exploration process will occasionally add more jobs to the queue, so don’t be alarmed if the number of jobs does not strictly decrease over time.

[9]:

proj.set_target(
    "is_bad",
    mode=dr.enums.AUTOPILOT_MODE.FULL_AUTO,
    metric="LogLoss",
    featurelist_id=new_feature_list.id,
    worker_count=-1
)

proj.wait_for_autopilot()

In progress: 17, queued: 21 (waited: 0s)
In progress: 20, queued: 18 (waited: 1s)
In progress: 20, queued: 18 (waited: 2s)
In progress: 20, queued: 18 (waited: 3s)
In progress: 19, queued: 18 (waited: 5s)
In progress: 20, queued: 17 (waited: 7s)
In progress: 20, queued: 16 (waited: 12s)
In progress: 20, queued: 12 (waited: 19s)
In progress: 19, queued: 8 (waited: 32s)
In progress: 20, queued: 2 (waited: 53s)
In progress: 16, queued: 0 (waited: 74s)
In progress: 16, queued: 0 (waited: 95s)
In progress: 16, queued: 0 (waited: 115s)
In progress: 16, queued: 0 (waited: 136s)
In progress: 15, queued: 0 (waited: 156s)
In progress: 13, queued: 0 (waited: 177s)
In progress: 8, queued: 0 (waited: 198s)
In progress: 1, queued: 0 (waited: 218s)
In progress: 19, queued: 0 (waited: 238s)
In progress: 13, queued: 0 (waited: 259s)
In progress: 6, queued: 0 (waited: 280s)
In progress: 2, queued: 0 (waited: 300s)
In progress: 13, queued: 0 (waited: 321s)
In progress: 9, queued: 0 (waited: 341s)
In progress: 6, queued: 0 (waited: 362s)
In progress: 2, queued: 0 (waited: 382s)
In progress: 2, queued: 0 (waited: 403s)
In progress: 1, queued: 0 (waited: 423s)
In progress: 1, queued: 0 (waited: 444s)
In progress: 1, queued: 0 (waited: 464s)
In progress: 20, queued: 12 (waited: 485s)
In progress: 20, queued: 12 (waited: 505s)
In progress: 20, queued: 6 (waited: 526s)
In progress: 19, queued: 3 (waited: 547s)
In progress: 19, queued: 0 (waited: 567s)
In progress: 18, queued: 0 (waited: 588s)
In progress: 16, queued: 0 (waited: 609s)
In progress: 13, queued: 0 (waited: 629s)
In progress: 11, queued: 0 (waited: 650s)
In progress: 7, queued: 0 (waited: 670s)
In progress: 3, queued: 0 (waited: 691s)
In progress: 3, queued: 0 (waited: 711s)
In progress: 3, queued: 0 (waited: 732s)
In progress: 1, queued: 0 (waited: 752s)
In progress: 0, queued: 0 (waited: 773s)
In progress: 1, queued: 0 (waited: 793s)
In progress: 0, queued: 0 (waited: 814s)
In progress: 4, queued: 0 (waited: 834s)
In progress: 2, queued: 0 (waited: 855s)
In progress: 4, queued: 0 (waited: 875s)
In progress: 4, queued: 0 (waited: 895s)
In progress: 2, queued: 0 (waited: 916s)
In progress: 2, queued: 0 (waited: 936s)
In progress: 0, queued: 0 (waited: 957s)
In progress: 0, queued: 0 (waited: 977s)

Exploring Trained Models¶

We can see how many models DataRobot built for this project by querying. Each of them has been tuned individually. Models that appear to have the same name differ either in the amount of data used in training or in the preprocessing steps used (or both).

[10]:

models = proj.get_models()
for idx, model in enumerate(models):
    print('[{}]: {} - {}'.
          format(idx, model.metrics['LogLoss']['validation'],
                 model.model_type))

[0]: 0.36614 - ENET Blender
[1]: 0.36661 - Advanced AVG Blender
[2]: 0.36684 - ENET Blender
[3]: 0.36686 - AVG Blender
[4]: 0.36712 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[5]: 0.36787 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[6]: 0.36791 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[7]: 0.36839 - Light Gradient Boosted Trees Classifier with Early Stopping
[8]: 0.3684 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[9]: 0.36872 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[10]: 0.36873 - Generalized Additive2 Model
[11]: 0.36938 - Generalized Additive2 Model
[12]: 0.36952 - RandomForest Classifier (Gini)
[13]: 0.36971 - Light Gradient Boosted Trees Classifier with Early Stopping
[14]: 0.36978 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[15]: 0.37004 - RandomForest Classifier (Entropy)
[16]: 0.37073 - RandomForest Classifier (Gini)
[17]: 0.37121 - Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance)
[18]: 0.37235 - Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance)
[19]: 0.37274 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[20]: 0.37275 - Vowpal Wabbit Classifier
[21]: 0.37283 - RandomForest Classifier (Entropy)
[22]: 0.37302 - ExtraTrees Classifier (Gini)
[23]: 0.37335 - Vowpal Wabbit Classifier
[24]: 0.37345 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[25]: 0.37357 - Nystroem Kernel SVM Classifier
[26]: 0.37362 - Nystroem Kernel SVM Classifier
[27]: 0.37368 - ExtraTrees Classifier (Gini)
[28]: 0.37417 - Gradient Boosted Trees Classifier with Early Stopping
[29]: 0.37495 - Gradient Boosted Trees Classifier with Early Stopping
[30]: 0.37548 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[31]: 0.37574 - Regularized Logistic Regression (L2)
[32]: 0.37607 - RandomForest Classifier (Gini)
[33]: 0.37631 - Vowpal Wabbit Classifier
[34]: 0.37667 - Light Gradient Boosted Trees Classifier with Early Stopping
[35]: 0.37767 - Generalized Additive2 Model
[36]: 0.37773 - Regularized Logistic Regression (L2)
[37]: 0.37814 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[38]: 0.37816 - Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance)
[39]: 0.37862 - RandomForest Classifier (Entropy)
[40]: 0.37921 - Elastic-Net Classifier (L2 / Binomial Deviance) with Binned numeric features
[41]: 0.37929 - Regularized Logistic Regression (L2)
[42]: 0.37953 - Auto-tuned K-Nearest Neighbors Classifier (Euclidean Distance)
[43]: 0.38011 - Regularized Logistic Regression (L2)
[44]: 0.38013 - Elastic-Net Classifier (L2 / Binomial Deviance)
[45]: 0.38024 - Eureqa Generalized Additive Model Classifier (3000 Generations)
[46]: 0.38026 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - desc
[47]: 0.38037 - Gradient Boosted Trees Classifier
[48]: 0.38127 - Gradient Boosted Trees Classifier
[49]: 0.3813 - Light Gradient Boosting on ElasticNet Predictions
[50]: 0.38136 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - desc
[51]: 0.38176 - Elastic-Net Classifier (L2 / Binomial Deviance) with Binned numeric features
[52]: 0.38236 - eXtreme Gradient Boosted Trees Classifier with Early Stopping and Unsupervised Learning Features
[53]: 0.38237 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[54]: 0.3833 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - title
[55]: 0.38354 - Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance) with Unsupervised Learning Features
[56]: 0.38373 - Elastic-Net Classifier (L2 / Binomial Deviance)
[57]: 0.38387 - Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance)
[58]: 0.38401 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - emp_title
[59]: 0.38428 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - emp_title
[60]: 0.38435 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - emp_title
[61]: 0.38481 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - title
[62]: 0.38497 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - title
[63]: 0.38505 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - addr_state(Text)
[64]: 0.38524 - RandomForest Classifier (Gini)
[65]: 0.38532 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - title
[66]: 0.38572 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - addr_state(Text)
[67]: 0.38606 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - desc
[68]: 0.38639 - Majority Class Classifier
[69]: 0.38642 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - url
[70]: 0.38662 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - url
[71]: 0.387 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - addr_state(Text)
[72]: 0.38711 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - desc
[73]: 0.38726 - Regularized Logistic Regression (L2)
[74]: 0.38738 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - url
[75]: 0.38802 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - addr_state(Text)
[76]: 0.39071 - Gradient Boosted Greedy Trees Classifier with Early Stopping
[77]: 0.40035 - Auto-tuned K-Nearest Neighbors Classifier (Euclidean Distance)
[78]: 0.40057 - Breiman and Cutler Random Forest Classifier
[79]: 0.41186 - RuleFit Classifier
[80]: 0.43793 - Naive Bayes combiner classifier
[81]: 0.44045 - Auto-tuned K-Nearest Neighbors Classifier (Euclidean Distance)
[82]: 0.44713 - Logistic Regression
[83]: 0.48423 - Decision Tree Classifier (Gini)
[84]: 0.60431 - TensorFlow Neural Network Classifier

Generating Predictions¶

Predictions: modeling workers vs. dedicated servers¶

There are two ways to generate predictions in DataRobot: using modeling workers and dedicated prediction servers. In this notebook we will use the former, which is slower, occupies one of your modeling worker slots, and has no strong latency guarantees because the jobs go through the project queue. This method can be useful for developing and evaluating models. However, in a production environment, a faster, dedicated prediction server configuration may be more appropriate.

Three step process¶

As just mentioned, these predictions go through the modeling queue, so there is a three-step process. The first step is to upload your dataset; the second is to generate prediction jobs. Finally, you need to retreive your predictions when the job is done.

To simplify this example we will make predictions for the same data used to train the models. We could use any of the models DataRobot generated, but will select the model that DataRobot recommends for deployment. DataRobot weighs both model accuracy and runtime to develop this recommendation.

[11]:

dataset = proj.upload_dataset(filename)

model = dr.ModelRecommendation.get(
    proj.id,
    dr.enums.RECOMMENDED_MODEL_TYPE.RECOMMENDED_FOR_DEPLOYMENT
).get_model()

pred_job = model.request_predictions(dataset.id)
preds = pred_job.get_result_when_complete()

Results¶

This example is a binary, or two-class classification problem, so DataRobot estimates the probability of each row is in the positive class (a bad loan) and negative class (not a bad loan). positive_probability and class_1.0 represent the former, and class_0.0 the latter. Given a configurable prediction_threshold, DataRobot creates a prediction whose value is the predicted class for each row. The predictions can be matched to the the uploaded prediction data set through the row_id predictions field.

[12]:

preds.head()

[12]:

	positive_probability	prediction_threshold	row_id	class_0.0	class_1.0
0	0.092677	0.5	0	0.907323	0.092677
1	0.261903	0.5	1	0.738097	0.261903
2	0.095587	0.5	2	0.904413	0.095587
3	0.121502	0.5	3	0.878498	0.121502
4	0.065982	0.5	4	0.934018	0.065982

Predicting Bad Loans¶

Overview¶

Prerequisites¶

Installing the datarobot package¶

Getting Started¶

Other Important Imports¶

Configure the Python Client¶

Create the Project¶

Select Features for Modeling¶

Modify Feature Types¶

Select Features for Modeling¶

Run the Automated Modeling Process¶

Exploring Trained Models¶

Generating Predictions¶

Predictions: modeling workers vs. dedicated servers¶

Three step process¶

Results¶

Installing the `datarobot` package¶