Predicting Bad Loans

Overview

In this example we will build a binary classification model using the Lending Club dataset. Here is a list of things we will touch on during this notebook:

  • Installing the datarobot package
  • Configuring the client
  • Creating a project
  • Changing the datatype of some of the source columns
  • Selecting the source columns used in the modeling process
  • Running the automated modeling process
  • Generating predictions

Prerequisites

In order to run this notebook yourself, you will need the following:

  • This notebook. If you are viewing this in the HTML documentation bundle, you can download all of the example notebooks and supporting materials from Downloads.
  • The required dataset, which is included in the same directory as this notebook.
  • A DataRobot API token. You can find your API token by logging into the DataRobot Web User Interface and looking in your Profile.

Installing the datarobot package

The datarobot package is hosted on PyPI. You can install it via:

pip install datarobot

from the command line. Its main dependencies are numpy and pandas, which could take some time to install on a new system. We highly recommend use of virtualenvs to avoid conflicts with other dependencies in your system-wide python installation.

Getting Started

This line imports the datarobot package. By convention, we always import it with the alias dr.

[1]:
import datarobot as dr

Other Important Imports

We’ll use these in this notebook as well. If the previous cell and the following cell both run without issue, you’re in good shape.

[2]:
import datetime
import pandas as pd

Configure the Python Client

DataRobot recommends providing a configuration file containing your credentials (endpoint and API Key) to connect to DataRobot. For more information about authentication, reference the API quickstart guide.

Save your configuration in a file located under your home directory called ~/.config/datarobot/drconfig.yaml.

[3]:
# Initialization with arguments
# dr.Client(token='<API TOKEN>', endpoint='https://<YOUR ENDPOINT>/api/v2/')

# Initialization with a config file in the same directory as this notebook
# dr.Client(config_path='drconfig.yaml')

# Initialization with a config file located at
# ~/.config/datarobot/drconfig.yaml
dr.Client()
[3]:
<datarobot.rest.RESTClientObject at 0x11043b210>

Create the Project

Here, we use the datarobot package to upload a new file and create a project. The name of the project is optional, but can be helpful when trying to sort among many projects on DataRobot.

[4]:
filename = '10K_Lending_Club_Loans.csv'
now = datetime.datetime.now().strftime('%Y-%m-%dT%H:%M')
project_name = '10K_Lending_Club_Loans_{}'.format(now)
proj = dr.Project.create(sourcedata=filename,
                         project_name=project_name)
print('Project ID: {}'.format(proj.id))
Project ID: 5c007ffa784cc602016a9f06

Select Features for Modeling

First, retrieve the raw feature list. This corresponds to the columns in the input spreadsheet.

[5]:
raw = [feat_list for feat_list in proj.get_featurelists()
       if feat_list.name == 'Raw Features'][0]
raw_features = [
    {
        "name": feat,
        "type": dr.Feature.get(proj.id, feat).feature_type
    }
    for feat in raw.features
]
pd.DataFrame.from_dict(raw_features)
[5]:
name type
0 loan_amnt Numeric
1 funded_amnt Numeric
2 term Categorical
3 int_rate Percentage
4 installment Numeric
5 grade Categorical
6 sub_grade Categorical
7 emp_title Text
8 emp_length Categorical
9 home_ownership Categorical
10 annual_inc Numeric
11 verification_status Categorical
12 pymnt_plan Categorical
13 url Text
14 desc Text
15 purpose Categorical
16 title Text
17 zip_code Categorical
18 addr_state Categorical
19 dti Numeric
20 delinq_2yrs Numeric
21 earliest_cr_line Date
22 inq_last_6mths Numeric
23 mths_since_last_delinq Numeric
24 mths_since_last_record Numeric
25 open_acc Numeric
26 pub_rec Numeric
27 revol_bal Numeric
28 revol_util Numeric
29 total_acc Numeric
30 initial_list_status Categorical
31 mths_since_last_major_derog None
32 policy_code Categorical
33 is_bad Numeric

Modify Feature Types

We can tweak features to improve the modeling. For example, we might change delinq_2yrs from an integer into a categorical.

[6]:
proj.create_type_transform_feature(
    "delinq_2yrs(Cat)",  # new feature name
    "delinq_2yrs",       # parent name
    dr.enums.VARIABLE_TYPE_TRANSFORM.CATEGORICAL_INT
)
[6]:
Feature(delinq_2yrs(Cat))

Then, we can change type of addr_state from categorical into text.

[7]:
proj.create_type_transform_feature(
    "addr_state(Text)",  # new feature name
    "addr_state",        # parent name
    dr.enums.VARIABLE_TYPE_TRANSFORM.TEXT
)
[7]:
Feature(addr_state(Text))

Select Features for Modeling

Next, we create a new feature list where we remove the features delinq_2yrs and addr_state and add the modified features we just created.

[8]:
feature_list_name = "new_feature_list"

new_feature_list = proj.create_featurelist(
    feature_list_name,
    list((set(raw.features) - {"addr_state", "delinq_2yrs"}) |
         {"addr_state(Text)", "delinq_2yrs(Cat)"})
)

Run the Automated Modeling Process

Now we can start the modeling process. The target for this problem is called is_bad - a binary variable indicating whether or not the customer defaults on a particular loan.

We specify that the metric that should be used is LogLoss. Without a specification DataRobot would automatically select an appropriate default metric.

The featurelist_id parameter tells DataRobot to model on that specific featurelist, rather than the default Informative Features.

Finally, the worker_count parameter specifies how many workers should be used for this project. Passing a value of -1 tells DataRobot to set the worker count to the maximum available to you. You can also specify the exact number of workers to use, but this command will fail if you request more workers than your account allows. If you need more resources than what has been allocated to you, you should think about upgrading your license.

The last command in this cell is just a blocking loop that periodically checks on the project to see if it is done, printing out the number of jobs in progress and in the queue along the way so you can see progress. The automated model exploration process will occasionally add more jobs to the queue, so don’t be alarmed if the number of jobs does not strictly decrease over time.

[9]:
proj.set_target(
    "is_bad",
    mode=dr.enums.AUTOPILOT_MODE.FULL_AUTO,
    metric="LogLoss",
    featurelist_id=new_feature_list.id,
    worker_count=-1
)

proj.wait_for_autopilot()
In progress: 17, queued: 21 (waited: 0s)
In progress: 20, queued: 18 (waited: 1s)
In progress: 20, queued: 18 (waited: 2s)
In progress: 20, queued: 18 (waited: 3s)
In progress: 19, queued: 18 (waited: 5s)
In progress: 20, queued: 17 (waited: 7s)
In progress: 20, queued: 16 (waited: 12s)
In progress: 20, queued: 12 (waited: 19s)
In progress: 19, queued: 8 (waited: 32s)
In progress: 20, queued: 2 (waited: 53s)
In progress: 16, queued: 0 (waited: 74s)
In progress: 16, queued: 0 (waited: 95s)
In progress: 16, queued: 0 (waited: 115s)
In progress: 16, queued: 0 (waited: 136s)
In progress: 15, queued: 0 (waited: 156s)
In progress: 13, queued: 0 (waited: 177s)
In progress: 8, queued: 0 (waited: 198s)
In progress: 1, queued: 0 (waited: 218s)
In progress: 19, queued: 0 (waited: 238s)
In progress: 13, queued: 0 (waited: 259s)
In progress: 6, queued: 0 (waited: 280s)
In progress: 2, queued: 0 (waited: 300s)
In progress: 13, queued: 0 (waited: 321s)
In progress: 9, queued: 0 (waited: 341s)
In progress: 6, queued: 0 (waited: 362s)
In progress: 2, queued: 0 (waited: 382s)
In progress: 2, queued: 0 (waited: 403s)
In progress: 1, queued: 0 (waited: 423s)
In progress: 1, queued: 0 (waited: 444s)
In progress: 1, queued: 0 (waited: 464s)
In progress: 20, queued: 12 (waited: 485s)
In progress: 20, queued: 12 (waited: 505s)
In progress: 20, queued: 6 (waited: 526s)
In progress: 19, queued: 3 (waited: 547s)
In progress: 19, queued: 0 (waited: 567s)
In progress: 18, queued: 0 (waited: 588s)
In progress: 16, queued: 0 (waited: 609s)
In progress: 13, queued: 0 (waited: 629s)
In progress: 11, queued: 0 (waited: 650s)
In progress: 7, queued: 0 (waited: 670s)
In progress: 3, queued: 0 (waited: 691s)
In progress: 3, queued: 0 (waited: 711s)
In progress: 3, queued: 0 (waited: 732s)
In progress: 1, queued: 0 (waited: 752s)
In progress: 0, queued: 0 (waited: 773s)
In progress: 1, queued: 0 (waited: 793s)
In progress: 0, queued: 0 (waited: 814s)
In progress: 4, queued: 0 (waited: 834s)
In progress: 2, queued: 0 (waited: 855s)
In progress: 4, queued: 0 (waited: 875s)
In progress: 4, queued: 0 (waited: 895s)
In progress: 2, queued: 0 (waited: 916s)
In progress: 2, queued: 0 (waited: 936s)
In progress: 0, queued: 0 (waited: 957s)
In progress: 0, queued: 0 (waited: 977s)

Exploring Trained Models

We can see how many models DataRobot built for this project by querying. Each of them has been tuned individually. Models that appear to have the same name differ either in the amount of data used in training or in the preprocessing steps used (or both).

[10]:
models = proj.get_models()
for idx, model in enumerate(models):
    print('[{}]: {} - {}'.
          format(idx, model.metrics['LogLoss']['validation'],
                 model.model_type))

[0]: 0.36614 - ENET Blender
[1]: 0.36661 - Advanced AVG Blender
[2]: 0.36684 - ENET Blender
[3]: 0.36686 - AVG Blender
[4]: 0.36712 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[5]: 0.36787 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[6]: 0.36791 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[7]: 0.36839 - Light Gradient Boosted Trees Classifier with Early Stopping
[8]: 0.3684 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[9]: 0.36872 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[10]: 0.36873 - Generalized Additive2 Model
[11]: 0.36938 - Generalized Additive2 Model
[12]: 0.36952 - RandomForest Classifier (Gini)
[13]: 0.36971 - Light Gradient Boosted Trees Classifier with Early Stopping
[14]: 0.36978 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[15]: 0.37004 - RandomForest Classifier (Entropy)
[16]: 0.37073 - RandomForest Classifier (Gini)
[17]: 0.37121 - Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance)
[18]: 0.37235 - Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance)
[19]: 0.37274 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[20]: 0.37275 - Vowpal Wabbit Classifier
[21]: 0.37283 - RandomForest Classifier (Entropy)
[22]: 0.37302 - ExtraTrees Classifier (Gini)
[23]: 0.37335 - Vowpal Wabbit Classifier
[24]: 0.37345 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[25]: 0.37357 - Nystroem Kernel SVM Classifier
[26]: 0.37362 - Nystroem Kernel SVM Classifier
[27]: 0.37368 - ExtraTrees Classifier (Gini)
[28]: 0.37417 - Gradient Boosted Trees Classifier with Early Stopping
[29]: 0.37495 - Gradient Boosted Trees Classifier with Early Stopping
[30]: 0.37548 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[31]: 0.37574 - Regularized Logistic Regression (L2)
[32]: 0.37607 - RandomForest Classifier (Gini)
[33]: 0.37631 - Vowpal Wabbit Classifier
[34]: 0.37667 - Light Gradient Boosted Trees Classifier with Early Stopping
[35]: 0.37767 - Generalized Additive2 Model
[36]: 0.37773 - Regularized Logistic Regression (L2)
[37]: 0.37814 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[38]: 0.37816 - Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance)
[39]: 0.37862 - RandomForest Classifier (Entropy)
[40]: 0.37921 - Elastic-Net Classifier (L2 / Binomial Deviance) with Binned numeric features
[41]: 0.37929 - Regularized Logistic Regression (L2)
[42]: 0.37953 - Auto-tuned K-Nearest Neighbors Classifier (Euclidean Distance)
[43]: 0.38011 - Regularized Logistic Regression (L2)
[44]: 0.38013 - Elastic-Net Classifier (L2 / Binomial Deviance)
[45]: 0.38024 - Eureqa Generalized Additive Model Classifier (3000 Generations)
[46]: 0.38026 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - desc
[47]: 0.38037 - Gradient Boosted Trees Classifier
[48]: 0.38127 - Gradient Boosted Trees Classifier
[49]: 0.3813 - Light Gradient Boosting on ElasticNet Predictions
[50]: 0.38136 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - desc
[51]: 0.38176 - Elastic-Net Classifier (L2 / Binomial Deviance) with Binned numeric features
[52]: 0.38236 - eXtreme Gradient Boosted Trees Classifier with Early Stopping and Unsupervised Learning Features
[53]: 0.38237 - eXtreme Gradient Boosted Trees Classifier with Early Stopping
[54]: 0.3833 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - title
[55]: 0.38354 - Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance) with Unsupervised Learning Features
[56]: 0.38373 - Elastic-Net Classifier (L2 / Binomial Deviance)
[57]: 0.38387 - Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance)
[58]: 0.38401 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - emp_title
[59]: 0.38428 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - emp_title
[60]: 0.38435 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - emp_title
[61]: 0.38481 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - title
[62]: 0.38497 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - title
[63]: 0.38505 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - addr_state(Text)
[64]: 0.38524 - RandomForest Classifier (Gini)
[65]: 0.38532 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - title
[66]: 0.38572 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - addr_state(Text)
[67]: 0.38606 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - desc
[68]: 0.38639 - Majority Class Classifier
[69]: 0.38642 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - url
[70]: 0.38662 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - url
[71]: 0.387 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - addr_state(Text)
[72]: 0.38711 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - desc
[73]: 0.38726 - Regularized Logistic Regression (L2)
[74]: 0.38738 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - url
[75]: 0.38802 - Auto-Tuned Word N-Gram Text Modeler using token occurrences - addr_state(Text)
[76]: 0.39071 - Gradient Boosted Greedy Trees Classifier with Early Stopping
[77]: 0.40035 - Auto-tuned K-Nearest Neighbors Classifier (Euclidean Distance)
[78]: 0.40057 - Breiman and Cutler Random Forest Classifier
[79]: 0.41186 - RuleFit Classifier
[80]: 0.43793 - Naive Bayes combiner classifier
[81]: 0.44045 - Auto-tuned K-Nearest Neighbors Classifier (Euclidean Distance)
[82]: 0.44713 - Logistic Regression
[83]: 0.48423 - Decision Tree Classifier (Gini)
[84]: 0.60431 - TensorFlow Neural Network Classifier

Generating Predictions

Predictions: modeling workers vs. dedicated servers

There are two ways to generate predictions in DataRobot: using modeling workers and dedicated prediction servers. In this notebook we will use the former, which is slower, occupies one of your modeling worker slots, and has no strong latency guarantees because the jobs go through the project queue. This method can be useful for developing and evaluating models. However, in a production environment, a faster, dedicated prediction server configuration may be more appropriate.

Three step process

As just mentioned, these predictions go through the modeling queue, so there is a three-step process. The first step is to upload your dataset; the second is to generate prediction jobs. Finally, you need to retreive your predictions when the job is done.

To simplify this example we will make predictions for the same data used to train the models. We could use any of the models DataRobot generated, but will select the model that DataRobot recommends for deployment. DataRobot weighs both model accuracy and runtime to develop this recommendation.

[11]:
dataset = proj.upload_dataset(filename)

model = dr.ModelRecommendation.get(
    proj.id,
    dr.enums.RECOMMENDED_MODEL_TYPE.RECOMMENDED_FOR_DEPLOYMENT
).get_model()

pred_job = model.request_predictions(dataset.id)
preds = pred_job.get_result_when_complete()

Results

This example is a binary, or two-class classification problem, so DataRobot estimates the probability of each row is in the positive class (a bad loan) and negative class (not a bad loan). positive_probability and class_1.0 represent the former, and class_0.0 the latter. Given a configurable prediction_threshold, DataRobot creates a prediction whose value is the predicted class for each row. The predictions can be matched to the the uploaded prediction data set through the row_id predictions field.

[12]:
preds.head()
[12]:
positive_probability prediction prediction_threshold row_id class_0.0 class_1.0
0 0.092677 0.0 0.5 0 0.907323 0.092677
1 0.261903 0.0 0.5 1 0.738097 0.261903
2 0.095587 0.0 0.5 2 0.904413 0.095587
3 0.121502 0.0 0.5 3 0.878498 0.121502
4 0.065982 0.0 0.5 4 0.934018 0.065982