Advanced Model Insights

Preparation

This notebook explores additional options for model insights added in the v2.7 release of the DataRobot API.

Let’s start with importing some packages that will help us with presentation (if you don’t have them installed already, you’ll have to install them to run this notebook).

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import pandas as pd
import datarobot as dr
import numpy as np
from datarobot.enums import AUTOPILOT_MODE
from datarobot.errors import ClientError

Set Up

Now configure your DataRobot client (unless you’re using a configuration file)...

In [2]:
dr.Client(token='<API TOKEN>', endpoint='https://<YOUR ENDPOINT>/api/v2/')
Out[2]:
<datarobot.rest.RESTClientObject at 0x10bc01e50>

Create Project with features

Create a new project using the 10K_diabetes dataset. This dataset contains a binary classification on the target readmitted. This project is an excellent example of the advanced model insights available from DataRobot models.

In [3]:
url = 'https://s3.amazonaws.com/datarobot_public_datasets/10k_diabetes.xlsx'
project = dr.Project.create(url, project_name='10K Advanced Modeling')
print('Project ID: {}'.format(project.id))
Project ID: 598dec4bc8089177139da4ad
In [4]:
# Increase the worker count to make the project go faster.
project.set_worker_count(8)
Out[4]:
Project(10K Advanced Modeling)
In [5]:
project.set_target('readmitted', mode=AUTOPILOT_MODE.QUICK)
Out[5]:
Project(10K Advanced Modeling)
In [6]:
project.wait_for_autopilot()
In progress: 2, queued: 0 (waited: 0s)
In progress: 2, queued: 0 (waited: 1s)
In progress: 2, queued: 0 (waited: 2s)
In progress: 2, queued: 0 (waited: 2s)
In progress: 2, queued: 0 (waited: 4s)
In progress: 2, queued: 0 (waited: 6s)
In progress: 2, queued: 0 (waited: 9s)
In progress: 2, queued: 0 (waited: 16s)
In progress: 2, queued: 0 (waited: 29s)
In progress: 1, queued: 0 (waited: 50s)
In progress: 1, queued: 0 (waited: 71s)
In progress: 1, queued: 0 (waited: 91s)
In progress: 1, queued: 0 (waited: 111s)
In progress: 1, queued: 0 (waited: 132s)
In progress: 1, queued: 0 (waited: 152s)
In progress: 1, queued: 0 (waited: 172s)
In progress: 1, queued: 0 (waited: 193s)
In progress: 1, queued: 0 (waited: 213s)
In progress: 1, queued: 0 (waited: 233s)
In progress: 1, queued: 0 (waited: 254s)
In progress: 1, queued: 0 (waited: 274s)
In progress: 0, queued: 1 (waited: 295s)
In progress: 1, queued: 0 (waited: 315s)
In progress: 0, queued: 0 (waited: 335s)
In progress: 0, queued: 0 (waited: 356s)
In [7]:
models = project.get_models()
model = models[0]
model
Out[7]:
Model(u'AVG Blender')

Let’s set some color constants to replicate visual style of DataRobot lift chart.

In [8]:
dr_dark_blue = '#08233F'
dr_blue = '#1F77B4'
dr_orange = '#FF7F0E'
dr_red = '#BE3C28'

Feature Impact

Feature Impact is available for all model types and works by altering input data and observing the effect on a model’s score. It is an on-demand feature, meaning that you must initiate a calculation to see the results. Once you have had DataRobot compute the feature impact for a model, that information is saved with the project.

Feature Impact measures how important a feature is in the context of a model. That is, it measures how much the accuracy of a model would decrease if that feature were removed.

In [9]:
try:
    # Check first if they've already been computed
    feature_impacts = model.get_feature_impact()
except dr.errors.ClientError as e:
    # Status code of 404 means the feature impact hasn't been computed yet
    assert e.status_code == 404
    impact_job = model.request_feature_impact()
    # We must wait for the async job to finish; 4 minutes should be plenty
    feature_impacts = impact_job.get_result_when_complete(4 * 60)
In [10]:
# Formats the ticks from a float into a percent
percent_tick_fmt = mtick.PercentFormatter(xmax=1.0)

impact_df = pd.DataFrame(feature_impacts)
impact_df.sort_values(by='impactNormalized', ascending=True, inplace=True)

# Positive values are blue, negative are red
bar_colors = impact_df.impactNormalized.apply(lambda x: dr_red if x < 0
                                              else dr_blue)

ax = impact_df.plot.barh(x='featureName', y='impactNormalized',
                         legend=False,
                         color=bar_colors,
                         figsize=(10, 14))
ax.xaxis.set_major_formatter(percent_tick_fmt)
ax.xaxis.set_tick_params(labeltop=True)
ax.xaxis.grid(True, alpha=0.2)
ax.set_facecolor(dr_dark_blue)

plt.ylabel('')
plt.xlabel('Effect')
plt.xlim((None, 1))  # Allow for negative impact
plt.title('Feature Impact', y=1.04)
Out[10]:
Text(0.5,1.04,u'Feature Impact')
../../_images/examples_advanced_model_insights_Advanced_Model_Insights_14_1.png

Lift Chart

A lift chart will show you how close in general model predictions are to the actual target values in the training data.

The lift chart data we retrieve from the server includes the average model prediction and the average actual target values, sorted by the prediction values in ascending order and split into up to 60 bins.

bin_weight parameter shows how much weight is in each bin (number of rows for unweighted projects).

In [11]:
lc = model.get_lift_chart('validation')
lc
Out[11]:
LiftChart(validation)
In [12]:
bins_df = pd.DataFrame(lc.bins)
bins_df.head()
Out[12]:
actual bin_weight predicted
0 0.037037 27.0 0.088575
1 0.111111 27.0 0.131661
2 0.192308 26.0 0.153389
3 0.222222 27.0 0.167035
4 0.111111 27.0 0.179245

Let’s define our rebinning and plotting functions.

In [13]:
def rebin_df(raw_df, number_of_bins):
    cols = ['bin', 'actual_mean', 'predicted_mean', 'bin_weight']
    new_df = pd.DataFrame(columns=cols)
    current_prediction_total = 0
    current_actual_total = 0
    current_row_total = 0
    x_index = 1
    bin_size = 60 / number_of_bins
    for rowId, data in raw_df.iterrows():
        current_prediction_total += data['predicted'] * data['bin_weight']
        current_actual_total += data['actual'] * data['bin_weight']
        current_row_total += data['bin_weight']

        if ((rowId + 1) % bin_size == 0):
            x_index += 1
            bin_properties = {
                'bin': ((round(rowId + 1) / 60) * number_of_bins),
                'actual_mean': current_actual_total / current_row_total,
                'predicted_mean': current_prediction_total / current_row_total,
                'bin_weight': current_row_total
            }

            new_df = new_df.append(bin_properties, ignore_index=True)
            current_prediction_total = 0
            current_actual_total = 0
            current_row_total = 0
    return new_df


def matplotlib_lift(bins_df, bin_count, ax):
    grouped = rebin_df(bins_df, bin_count)
    ax.plot(range(1, len(grouped) + 1), grouped['predicted_mean'],
            marker='+', lw=1, color=dr_blue)
    ax.plot(range(1, len(grouped) + 1), grouped['actual_mean'],
            marker='*', lw=1, color=dr_orange)
    ax.set_xlim([0, len(grouped) + 1])
    ax.set_facecolor(dr_dark_blue)
    ax.legend(loc='best')
    ax.set_title('Lift chart {} bins'.format(bin_count))
    ax.set_xlabel('Sorted Prediction')
    ax.set_ylabel('Value')
    return grouped

Now we can show all lift charts we propose in DataRobot web application.

Note 1 : While this method will work for any bin count less then 60 - the most reliable result will be achieved when the number of bins is a divisor of 60.

Note 2 : This visualization method will NOT work for bin count > 60 because DataRobot does not provide enough information for a larger resolution.

In [14]:
bin_counts = [10, 12, 15, 20, 30, 60]
f, axarr = plt.subplots(len(bin_counts))
f.set_size_inches((8, 4 * len(bin_counts)))

rebinned_dfs = []
for i in range(len(bin_counts)):
    rebinned_dfs.append(matplotlib_lift(bins_df, bin_counts[i], axarr[i]))
plt.tight_layout()
../../_images/examples_advanced_model_insights_Advanced_Model_Insights_21_0.png

Rebinned Data

You may want to interact with the raw re-binned data for use in third party tools, or for additional evaluation.

In [15]:
for rebinned in rebinned_dfs:
    print('Number of bins: {}'.format(len(rebinned.index)))
    print(rebinned)
Number of bins: 10
    bin  actual_mean  predicted_mean  bin_weight
0   1.0      0.13125        0.151517       160.0
1   2.0      0.20000        0.225520       160.0
2   3.0      0.23125        0.272101       160.0
3   4.0      0.31250        0.310227       160.0
4   5.0      0.40000        0.350982       160.0
5   6.0      0.40000        0.395550       160.0
6   7.0      0.43750        0.441662       160.0
7   8.0      0.55625        0.494121       160.0
8   9.0      0.60625        0.561798       160.0
9  10.0      0.69375        0.710759       160.0
Number of bins: 12
     bin  actual_mean  predicted_mean  bin_weight
0    1.0     0.134328        0.143911       134.0
1    2.0     0.180451        0.211710       133.0
2    3.0     0.225564        0.253760       133.0
3    4.0     0.276119        0.289034       134.0
4    5.0     0.308271        0.320351       133.0
5    6.0     0.406015        0.354336       133.0
6    7.0     0.406015        0.391651       133.0
7    8.0     0.395522        0.430018       134.0
8    9.0     0.518797        0.470626       133.0
9   10.0     0.639098        0.519144       133.0
10  11.0     0.586466        0.583965       133.0
11  12.0     0.686567        0.728384       134.0
Number of bins: 15
     bin  actual_mean  predicted_mean  bin_weight
0    1.0     0.140187        0.134995       107.0
1    2.0     0.149533        0.195819       107.0
2    3.0     0.207547        0.235178       106.0
3    4.0     0.242991        0.264718       107.0
4    5.0     0.280374        0.292256       107.0
5    6.0     0.292453        0.316757       106.0
6    7.0     0.373832        0.344156       107.0
7    8.0     0.452830        0.372372       106.0
8    9.0     0.373832        0.403261       107.0
9   10.0     0.401869        0.433869       107.0
10  11.0     0.528302        0.465610       106.0
11  12.0     0.560748        0.504174       107.0
12  13.0     0.603774        0.547079       106.0
13  14.0     0.635514        0.612989       107.0
14  15.0     0.710280        0.747934       107.0
Number of bins: 20
     bin  actual_mean  predicted_mean  bin_weight
0    1.0       0.1125        0.124181        80.0
1    2.0       0.1500        0.178852        80.0
2    3.0       0.1875        0.211547        80.0
3    4.0       0.2125        0.239493        80.0
4    5.0       0.2375        0.260820        80.0
5    6.0       0.2250        0.283381        80.0
6    7.0       0.3375        0.300590        80.0
7    8.0       0.2875        0.319864        80.0
8    9.0       0.3750        0.340949        80.0
9   10.0       0.4250        0.361015        80.0
10  11.0       0.4000        0.383998        80.0
11  12.0       0.4000        0.407102        80.0
12  13.0       0.4125        0.429924        80.0
13  14.0       0.4625        0.453401        80.0
14  15.0       0.5250        0.479391        80.0
15  16.0       0.5875        0.508850        80.0
16  17.0       0.6125        0.541193        80.0
17  18.0       0.6000        0.582403        80.0
18  19.0       0.6750        0.649406        80.0
19  20.0       0.7125        0.772112        80.0
Number of bins: 30
     bin  actual_mean  predicted_mean  bin_weight
0    1.0     0.074074        0.110118        54.0
1    2.0     0.207547        0.160341        53.0
2    3.0     0.113208        0.184872        53.0
3    4.0     0.185185        0.206563        54.0
4    5.0     0.207547        0.227254        53.0
5    6.0     0.207547        0.243102        53.0
6    7.0     0.240741        0.257413        54.0
7    8.0     0.245283        0.272161        53.0
8    9.0     0.207547        0.287006        53.0
9   10.0     0.351852        0.297408        54.0
10  11.0     0.301887        0.310547        53.0
11  12.0     0.283019        0.322968        53.0
12  13.0     0.396226        0.337750        53.0
13  14.0     0.351852        0.350444        54.0
14  15.0     0.452830        0.364761        53.0
15  16.0     0.452830        0.379984        53.0
16  17.0     0.351852        0.395395        54.0
17  18.0     0.396226        0.411274        53.0
18  19.0     0.358491        0.425801        53.0
19  20.0     0.444444        0.441788        54.0
20  21.0     0.509434        0.457396        53.0
21  22.0     0.547170        0.473825        53.0
22  23.0     0.490566        0.494573        53.0
23  24.0     0.629630        0.513596        54.0
24  25.0     0.716981        0.534683        53.0
25  26.0     0.490566        0.559476        53.0
26  27.0     0.611111        0.590690        54.0
27  28.0     0.660377        0.635708        53.0
28  29.0     0.622642        0.695099        53.0
29  30.0     0.796296        0.799789        54.0
Number of bins: 60
     bin  actual_mean  predicted_mean  bin_weight
0    1.0     0.037037        0.088575        27.0
1    2.0     0.111111        0.131661        27.0
2    3.0     0.192308        0.153389        26.0
3    4.0     0.222222        0.167035        27.0
4    5.0     0.111111        0.179245        27.0
5    6.0     0.115385        0.190716        26.0
6    7.0     0.185185        0.201566        27.0
7    8.0     0.185185        0.211559        27.0
8    9.0     0.192308        0.221900        26.0
9   10.0     0.222222        0.232409        27.0
10  11.0     0.074074        0.239081        27.0
11  12.0     0.346154        0.247278        26.0
12  13.0     0.222222        0.253636        27.0
13  14.0     0.259259        0.261190        27.0
14  15.0     0.230769        0.267897        26.0
15  16.0     0.259259        0.276266        27.0
16  17.0     0.185185        0.283961        27.0
17  18.0     0.230769        0.290167        26.0
18  19.0     0.296296        0.294495        27.0
19  20.0     0.407407        0.300322        27.0
20  21.0     0.307692        0.307198        26.0
21  22.0     0.296296        0.313772        27.0
22  23.0     0.269231        0.319444        26.0
23  24.0     0.296296        0.326361        27.0
24  25.0     0.370370        0.334460        27.0
25  26.0     0.423077        0.341167        26.0
26  27.0     0.333333        0.347227        27.0
27  28.0     0.370370        0.353661        27.0
28  29.0     0.423077        0.361275        26.0
29  30.0     0.481481        0.368118        27.0
30  31.0     0.481481        0.376098        27.0
31  32.0     0.423077        0.384019        26.0
32  33.0     0.296296        0.391877        27.0
33  34.0     0.407407        0.398914        27.0
34  35.0     0.423077        0.407656        26.0
35  36.0     0.370370        0.414758        27.0
36  37.0     0.259259        0.421825        27.0
37  38.0     0.461538        0.429930        26.0
38  39.0     0.518519        0.438017        27.0
39  40.0     0.370370        0.445558        27.0
40  41.0     0.423077        0.453398        26.0
41  42.0     0.592593        0.461246        27.0
42  43.0     0.500000        0.468806        26.0
43  44.0     0.592593        0.478657        27.0
44  45.0     0.481481        0.490318        27.0
45  46.0     0.500000        0.498991        26.0
46  47.0     0.592593        0.507938        27.0
47  48.0     0.666667        0.519255        27.0
48  49.0     0.692308        0.528170        26.0
49  50.0     0.740741        0.540955        27.0
50  51.0     0.407407        0.553971        27.0
51  52.0     0.576923        0.565192        26.0
52  53.0     0.666667        0.582203        27.0
53  54.0     0.555556        0.599178        27.0
54  55.0     0.730769        0.619919        26.0
55  56.0     0.592593        0.650911        27.0
56  57.0     0.703704        0.676295        27.0
57  58.0     0.538462        0.714627        26.0
58  59.0     0.814815        0.763131        27.0
59  60.0     0.777778        0.836447        27.0

ROC curve

The receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.

To retrieve ROC curve information use the Model.get_roc_curve method.

In [16]:
roc = model.get_roc_curve('validation')
roc
Out[16]:
RocCurve(validation)
In [17]:
df = pd.DataFrame(roc.roc_points)
df.head()
Out[17]:
accuracy f1_score false_negative_score false_positive_rate false_positive_score matthews_correlation_coefficient negative_predictive_value positive_predictive_value threshold true_negative_rate true_negative_score true_positive_rate true_positive_score
0 0.603125 0.000000 635 0.000000 0 0.000000 0.603125 0.000000 1.000000 1.000000 965 0.000000 0
1 0.605000 0.009404 632 0.000000 0 0.053430 0.604258 1.000000 0.925734 1.000000 965 0.004724 3
2 0.605625 0.012520 631 0.000000 0 0.061715 0.604637 1.000000 0.897726 1.000000 965 0.006299 4
3 0.609375 0.031008 625 0.000000 0 0.097764 0.606918 1.000000 0.843124 1.000000 965 0.015748 10
4 0.610000 0.037037 623 0.001036 1 0.097343 0.607435 0.923077 0.812854 0.998964 964 0.018898 12

Threshold operations

You can get the recommended threshold value with maximal F1 score using RocCurve.get_best_f1_threshold method. That is the same threshold that is preselected in DataRobot when you open “ROC curve” tab.

In [18]:
threshold = roc.get_best_f1_threshold()
threshold
Out[18]:
0.3359943414397026

To estimate metrics for different threshold values just pass it to the RocCurve.estimate_threshold method. This will produce the same results as updating threshold on the DataRobot “ROC curve” tab.

In [19]:
metrics = roc.estimate_threshold(threshold)
metrics
Out[19]:
{'accuracy': 0.626875,
 'f1_score': 0.6219126029132362,
 'false_negative_score': 144,
 'false_positive_rate': 0.4694300518134715,
 'false_positive_score': 453,
 'matthews_correlation_coefficient': 0.30220241744619025,
 'negative_predictive_value': 0.7804878048780488,
 'positive_predictive_value': 0.5201271186440678,
 'threshold': 0.3359943414397026,
 'true_negative_rate': 0.5305699481865285,
 'true_negative_score': 512,
 'true_positive_rate': 0.7732283464566929,
 'true_positive_score': 491}

Confusion matrix

Using a few keys from the retrieved metrics we now can build a confusion matrix for the selected threshold.

In [20]:
roc_df = pd.DataFrame({
    'Predicted Negative': [metrics['true_negative_score'],
                           metrics['false_negative_score'],
                           metrics['true_negative_score'] + metrics[
                               'false_negative_score']],
    'Predicted Positive': [metrics['false_positive_score'],
                           metrics['true_positive_score'],
                           metrics['true_positive_score'] + metrics[
                               'false_positive_score']],
    'Total': [len(roc.negative_class_predictions),
              len(roc.positive_class_predictions),
              len(roc.negative_class_predictions) + len(
                  roc.positive_class_predictions)]})
roc_df.index = pd.MultiIndex.from_tuples([
    ('Actual', '-'), ('Actual', '+'), ('Total', '')])
roc_df.columns = pd.MultiIndex.from_tuples([
    ('Predicted', '-'), ('Predicted', '+'), ('Total', '')])
roc_df.style.set_properties(**{'text-align': 'right'})
roc_df
Out[20]:
Predicted Total
- +
Actual - 512 453 962
+ 144 491 638
Total 656 944 1600

ROC curve plot

In [21]:
dr_roc_green = '#03c75f'
white = '#ffffff'
dr_purple = '#65147D'
dr_dense_green = '#018f4f'

fig = plt.figure(figsize=(8, 8))
axes = fig.add_subplot(1, 1, 1, facecolor=dr_dark_blue)

plt.scatter(df.false_positive_rate, df.true_positive_rate, color=dr_roc_green)
plt.plot(df.false_positive_rate, df.true_positive_rate, color=dr_roc_green)
plt.plot([0, 1], [0, 1], color=white, alpha=0.25)
plt.title('ROC curve')
plt.xlabel('False Positive Rate (Fallout)')
plt.xlim([0, 1])
plt.ylabel('True Positive Rate (Sensitivity)')
plt.ylim([0, 1])
Out[21]:
(0, 1)
../../_images/examples_advanced_model_insights_Advanced_Model_Insights_34_1.png

Prediction distribution plot

There are a few different methods for visualizing it, which one to use depends on what packages you have installed. Below you will find 3 different examples.

Using seaborn

In [22]:
import seaborn as sns
sns.set_style("whitegrid", {'axes.grid': False})

fig = plt.figure(figsize=(8, 8))
axes = fig.add_subplot(1, 1, 1, facecolor=dr_dark_blue)

shared_params = {'shade': True, 'clip': (0, 1), 'bw': 0.2}
sns.kdeplot(np.array(roc.negative_class_predictions),
            color=dr_purple, **shared_params)
sns.kdeplot(np.array(roc.positive_class_predictions),
            color=dr_dense_green, **shared_params)

plt.title('Prediction Distribution')
plt.xlabel('Probability of Event')
plt.xlim([0, 1])
plt.ylabel('Probability Density')
Out[22]:
Text(0,0.5,'Probability Density')
../../_images/examples_advanced_model_insights_Advanced_Model_Insights_36_1.png

Using SciPy

In [23]:
from scipy.stats import gaussian_kde

fig = plt.figure(figsize=(8, 8))
axes = fig.add_subplot(1, 1, 1, facecolor=dr_dark_blue)
xs = np.linspace(0, 1, 100)

density_neg = gaussian_kde(roc.negative_class_predictions, bw_method=0.2)
plt.plot(xs, density_neg(xs), color=dr_purple)
plt.fill_between(xs, 0, density_neg(xs), color=dr_purple, alpha=0.3)

density_pos = gaussian_kde(roc.positive_class_predictions, bw_method=0.2)
plt.plot(xs, density_pos(xs), color=dr_dense_green)
plt.fill_between(xs, 0, density_pos(xs), color=dr_dense_green, alpha=0.3)

plt.title('Prediction Distribution')
plt.xlabel('Probability of Event')
plt.xlim([0, 1])
plt.ylabel('Probability Density')
Out[23]:
Text(0,0.5,'Probability Density')
../../_images/examples_advanced_model_insights_Advanced_Model_Insights_38_1.png

Using scikit-learn

This way will be most consistent with how we display this plot in DataRobot, because scikit-learn supports additional kernel options, and we can configure the same kernel as we use in web application (epanichkov kernel with size 0.05).

Other examples above use a gaussian kernel, so they may slightly differ from the plot in the DataRobot interface.

In [24]:
from sklearn.neighbors import KernelDensity

fig = plt.figure(figsize=(8, 8))
axes = fig.add_subplot(1, 1, 1, facecolor=dr_dark_blue)
xs = np.linspace(0, 1, 100)

X_neg = np.asarray(roc.negative_class_predictions)[:, np.newaxis]
density_neg = KernelDensity(bandwidth=0.05, kernel='epanechnikov').fit(X_neg)
plt.plot(xs, np.exp(density_neg.score_samples(xs[:, np.newaxis])),
         color=dr_purple)
plt.fill_between(xs, 0, np.exp(density_neg.score_samples(xs[:, np.newaxis])),
                 color=dr_purple, alpha=0.3)

X_pos = np.asarray(roc.positive_class_predictions)[:, np.newaxis]
density_pos = KernelDensity(bandwidth=0.05, kernel='epanechnikov').fit(X_pos)
plt.plot(xs, np.exp(density_pos.score_samples(xs[:, np.newaxis])),
         color=dr_dense_green)
plt.fill_between(xs, 0, np.exp(density_pos.score_samples(xs[:, np.newaxis])),
                 color=dr_dense_green, alpha=0.3)

plt.title('Prediction Distribution')
plt.xlabel('Probability of Event')
plt.xlim([0, 1])
plt.ylabel('Probability Density')
Out[24]:
Text(0,0.5,'Probability Density')
../../_images/examples_advanced_model_insights_Advanced_Model_Insights_40_1.png

Word Cloud

Word cloud is a type of insight available for some text-processing models for datasets containing text columns. You can get information about how the appearance of each ngram (word or sequence of words) in the text field affects the predicted target value.

This example will show you how to obtain word cloud data and visualize it in similar to DataRobot web application way.

The visualization example here uses colour and wordcloud packages, so if you don’t have them, you will need to install them.

First, we will create a color palette similar to what we use in DataRobot.

In [25]:
from colour import Color
import wordcloud
In [26]:
colors = [Color('#2458EB')]
colors.extend(list(Color('#2458EB').range_to(Color('#31E7FE'), 81))[1:])
colors.extend(list(Color('#31E7FE').range_to(Color('#8da0a2'), 21))[1:])
colors.extend(list(Color('#a18f8c').range_to(Color('#ffad9e'), 21))[1:])
colors.extend(list(Color('#ffad9e').range_to(Color('#d80909'), 81))[1:])
webcolors = [c.get_web() for c in colors]

Variable webcolors now contains 201 ([-1, 1] interval with step 0.01) colors that will be used in the word cloud. Let’s look at our palette.

In [27]:
from matplotlib.colors import LinearSegmentedColormap
dr_cmap = LinearSegmentedColormap.from_list('DataRobot',
                                            webcolors,
                                            N=len(colors))
x = np.arange(-1, 1.01, 0.01)
y = np.arange(0, 40, 1)
X = np.meshgrid(x, y)[0]
plt.xticks([0, 20, 40, 60, 80, 100, 120, 140, 160, 180, 200],
           ['-1', '-0.8', '-0.6', '-0.4', '-0.2', '0',
            '0.2', '0.4', '0.6', '0.8', '1'])
plt.yticks([], [])
im = plt.imshow(X, interpolation='nearest', origin='lower', cmap=dr_cmap)
../../_images/examples_advanced_model_insights_Advanced_Model_Insights_45_0.png

Now we will pick some model that provides a word cloud in the DataRobot. Any “Auto-Tuned Word N-Gram Text Modeler” should work.

In [28]:
models = project.get_models()
In [29]:
model_with_word_cloud = None
for model in models:
    try:
        model.get_word_cloud()
        model_with_word_cloud = model
        break
    except ClientError as e:
        if 'No word cloud data found for model' in e:
            pass

model_with_word_cloud
Out[29]:
Model(u'Auto-Tuned Word N-Gram Text Modeler using token occurrences - diag_1_desc')
In [30]:
wc = model_with_word_cloud.get_word_cloud(exclude_stop_words=True)
In [31]:
def word_cloud_plot(wc, font_path=None):
    # Stopwords usually dominate any word cloud, so we will filter them out
    dict_freq = {wc_word['ngram']: wc_word['frequency']
                 for wc_word in wc.ngrams
                 if not wc_word['is_stopword']}
    dict_coef = {wc_word['ngram']: wc_word['coefficient']
                 for wc_word in wc.ngrams}

    def color_func(*args, **kwargs):
        word = args[0]
        palette_index = int(round(dict_coef[word] * 100)) + 100
        r, g, b = colors[palette_index].get_rgb()
        return 'rgb({:.0f}, {:.0f}, {:.0f})'.format(int(r * 255),
                                                    int(g * 255),
                                                    int(b * 255))

    wc_image = wordcloud.WordCloud(stopwords=set(),
                                   width=1024, height=1024,
                                   relative_scaling=0.5,
                                   prefer_horizontal=1,
                                   color_func=color_func,
                                   background_color=(0, 10, 29),
                                   font_path=font_path).fit_words(dict_freq)
    plt.imshow(wc_image, interpolation='bilinear')
    plt.axis('off')
In [32]:
word_cloud_plot(wc)
../../_images/examples_advanced_model_insights_Advanced_Model_Insights_51_0.png

You can use the word cloud to get information about most frequent and most important (highest absolute coefficient value) ngrams in your text.

In [33]:
wc.most_frequent(5)
Out[33]:
[{'coefficient': 0.622977418480506,
  'count': 534,
  'frequency': 0.21876280213027446,
  'is_stopword': False,
  'ngram': u'failure'},
 {'coefficient': 0.5680375262833832,
  'count': 524,
  'frequency': 0.21466612044244163,
  'is_stopword': False,
  'ngram': u'atherosclerosis'},
 {'coefficient': 0.5163937133054939,
  'count': 520,
  'frequency': 0.21302744776730848,
  'is_stopword': False,
  'ngram': u'atherosclerosis of'},
 {'coefficient': 0.3793240551174481,
  'count': 505,
  'frequency': 0.2068824252355592,
  'is_stopword': False,
  'ngram': u'infarction'},
 {'coefficient': 0.46897343056956153,
  'count': 453,
  'frequency': 0.18557968045882836,
  'is_stopword': False,
  'ngram': u'heart'}]
In [34]:
wc.most_important(5)
Out[34]:
[{'coefficient': -0.8759179138969192,
  'count': 38,
  'frequency': 0.015567390413764851,
  'is_stopword': False,
  'ngram': u'obesity unspecified'},
 {'coefficient': -0.8655105382141891,
  'count': 38,
  'frequency': 0.015567390413764851,
  'is_stopword': False,
  'ngram': u'obesity'},
 {'coefficient': 0.8329465952065772,
  'count': 9,
  'frequency': 0.0036870135190495697,
  'is_stopword': False,
  'ngram': u'nephroptosis'},
 {'coefficient': -0.8198621557218905,
  'count': 45,
  'frequency': 0.01843506759524785,
  'is_stopword': False,
  'ngram': u'of kidney'},
 {'coefficient': 0.7444542252245915,
  'count': 452,
  'frequency': 0.18517001229004507,
  'is_stopword': False,
  'ngram': u'heart failure'}]

Non-ASCII Texts

Word cloud has full Unicode support but if you want to visualize it using the recipe from this notebook - you should use the font_path parameter that leads to font supporting symbols used in your text. For example for Japanese text in the model below you should use one of the CJK fonts.

In [35]:
jp_project = dr.Project.create('jp_10k.csv', project_name='Japanese 10K')

print('Project ID: {}'.format(project.id))
Project ID: 598dec4bc8089177139da4ad
In [36]:
jp_project.set_target('readmitted_再入院', mode=AUTOPILOT_MODE.QUICK)
jp_project.wait_for_autopilot()
In progress: 10, queued: 3 (waited: 0s)
In progress: 10, queued: 3 (waited: 0s)
In progress: 10, queued: 3 (waited: 1s)
In progress: 10, queued: 3 (waited: 2s)
In progress: 10, queued: 3 (waited: 3s)
In progress: 10, queued: 3 (waited: 5s)
In progress: 10, queued: 3 (waited: 8s)
In progress: 10, queued: 1 (waited: 15s)
In progress: 6, queued: 0 (waited: 28s)
In progress: 1, queued: 0 (waited: 49s)
In progress: 0, queued: 0 (waited: 69s)
In progress: 8, queued: 0 (waited: 90s)
In progress: 5, queued: 0 (waited: 110s)
In progress: 1, queued: 0 (waited: 130s)
In progress: 0, queued: 14 (waited: 151s)
In progress: 10, queued: 6 (waited: 171s)
In progress: 10, queued: 2 (waited: 191s)
In progress: 8, queued: 0 (waited: 212s)
In progress: 2, queued: 0 (waited: 232s)
In progress: 2, queued: 0 (waited: 253s)
In progress: 1, queued: 0 (waited: 273s)
In progress: 1, queued: 0 (waited: 293s)
In progress: 0, queued: 0 (waited: 314s)
In [37]:
jp_models = jp_project.get_models()
jp_model_with_word_cloud = None

for model in jp_models:
    try:
        model.get_word_cloud()
        jp_model_with_word_cloud = model
        break
    except ClientError as e:
        if 'No word cloud data found for model' in e:
            pass

jp_model_with_word_cloud
Out[37]:
Model(u'Auto-Tuned Word N-Gram Text Modeler using token occurrences and tfidf - diag_1_desc_\u8a3a\u65ad1\u8aac\u660e')
In [38]:
jp_wc = jp_model_with_word_cloud.get_word_cloud(exclude_stop_words=True)
In [39]:
word_cloud_plot(jp_wc, font_path='CJK.ttf')
../../_images/examples_advanced_model_insights_Advanced_Model_Insights_60_0.png