Advanced Model Insights

Preparation

This notebook explores additional options for model insights added in the v2.7 release of the DataRobot API.

Let’s start with importing some packages that will help us with presentation (if you don’t have them installed already, you’ll have to install them to run this notebook).

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import datarobot as dr
import numpy as np
from datarobot.enums import AUTOPILOT_MODE
from datarobot.errors import ClientError

Set Up

Now configure your DataRobot client (unless you’re using a configuration file)...

In [3]:
dr.Client(token='<API TOKEN>', endpoint='https://<YOUR ENDPOINT>/api/v2/')
Out[3]:
<datarobot.rest.RESTClientObject at 0x10bc01e50>

Create Project with features

Create a new project using the 10K_diabetes dataset. This dataset contains a binary classification on the target readmitted. This project is an excellent example of the advanced model insights available from DataRobot models.

In [4]:
url = 'https://s3.amazonaws.com/datarobot_public_datasets/10k_diabetes.xlsx'
project = dr.Project.create(url, project_name='10K Advanced Modeling')
print('Project ID: {}'.format(project.id))
Project ID: 598dec4bc8089177139da4ad
In [25]:
# Increase the worker count to make the project go faster.
project.set_worker_count(8)
Out[25]:
Project(10K Advanced Modeling)
In [5]:
project.set_target('readmitted', mode=AUTOPILOT_MODE.QUICK)
Out[5]:
Project(10K Advanced Modeling)
In [6]:
project.wait_for_autopilot()
In progress: 2, queued: 0 (waited: 0s)
In progress: 2, queued: 0 (waited: 1s)
In progress: 2, queued: 0 (waited: 2s)
In progress: 2, queued: 0 (waited: 2s)
In progress: 2, queued: 0 (waited: 4s)
In progress: 2, queued: 0 (waited: 6s)
In progress: 2, queued: 0 (waited: 9s)
In progress: 2, queued: 0 (waited: 16s)
In progress: 2, queued: 0 (waited: 29s)
In progress: 1, queued: 0 (waited: 50s)
In progress: 1, queued: 0 (waited: 71s)
In progress: 1, queued: 0 (waited: 91s)
In progress: 1, queued: 0 (waited: 111s)
In progress: 1, queued: 0 (waited: 132s)
In progress: 1, queued: 0 (waited: 152s)
In progress: 1, queued: 0 (waited: 172s)
In progress: 1, queued: 0 (waited: 193s)
In progress: 1, queued: 0 (waited: 213s)
In progress: 1, queued: 0 (waited: 233s)
In progress: 1, queued: 0 (waited: 254s)
In progress: 1, queued: 0 (waited: 274s)
In progress: 0, queued: 1 (waited: 295s)
In progress: 1, queued: 0 (waited: 315s)
In progress: 0, queued: 0 (waited: 335s)
In progress: 0, queued: 0 (waited: 356s)
In [7]:
models = project.get_models()
model = models[0]
model
Out[7]:
Model(u'AVG Blender')

Lift Chart

A lift chart will show you how close in general model predictions are to the actual target values in the training data.

The lift chart data we retrieve from the server includes the average model prediction and the average actual target values, sorted by the prediction values in ascending order and split into up to 60 bins.

bin_weight parameter shows how much weight is in each bin (number of rows for unweighted projects).

In [10]:
lc = model.get_lift_chart('validation')
lc
Out[10]:
LiftChart(validation)
In [11]:
bins_df = pd.DataFrame(lc.bins)
bins_df.head()
Out[11]:
actual bin_weight predicted
0 0.037037 27 0.101428
1 0.074074 27 0.138344
2 0.192308 26 0.159232
3 0.185185 27 0.174883
4 0.185185 27 0.188870

Let’s set some color constants to replicate visual style of DataRobot lift chart.

In [12]:
dr_dark_blue = '#08233F'
dr_blue = '#1F77B4'
dr_orange = '#FF7F0E'

Let’s define our rebinning and plotting functions.

In [13]:
def rebin_df(raw_df, number_of_bins):
    cols = ['bin', 'actual_mean', 'predicted_mean', 'bin_weight']
    new_df = pd.DataFrame(columns=cols)
    current_prediction_total = 0
    current_actual_total = 0
    current_row_total = 0
    x_index = 1
    bin_size = 60 / number_of_bins
    for rowId, data in raw_df.iterrows():
        current_prediction_total += data['predicted']
        current_actual_total += data['actual']
        current_row_total += data['bin_weight']

        if ((rowId + 1) % bin_size == 0):
            x_index += 1
            bin_properties = {
                'bin': ((round(rowId + 1) / 60) * number_of_bins),
                'actual_mean': current_actual_total / current_row_total,
                'predicted_mean': current_prediction_total / current_row_total,
                'bin_weight': current_row_total
            }

            new_df = new_df.append(bin_properties, ignore_index=True)
            current_prediction_total = 0
            current_actual_total = 0
            current_row_total = 0
    return new_df


def matplotlib_lift(bins_df, bin_count, ax):
    grouped = rebin_df(bins_df, bin_count)
    ax.plot(range(1, len(grouped) + 1), grouped['predicted_mean'],
            marker='+', lw=1, color=dr_blue)
    ax.plot(range(1, len(grouped) + 1), grouped['actual_mean'],
            marker='*', lw=1, color=dr_orange)
    ax.set_xlim([0, len(grouped) + 1])
    ax.set_facecolor(dr_dark_blue)
    ax.legend(loc='best')
    ax.set_title('Lift chart {} bins'.format(bin_count))
    ax.set_xlabel('Sorted Prediction')
    ax.set_ylabel('Value')
    return grouped

Now we can show all lift charts we propose in DataRobot web application.

Note 1 : While this method will work for any bin count less then 60 - the most reliable result will be achieved when the number of bins is divisible by 60.

Note 2 : This visualization method will NOT work for bin count > 60 because DataRobot does not provide enough information for a larger resolution.

In [14]:
bin_counts = [10, 12, 15, 20, 30, 60]
f, axarr = plt.subplots(len(bin_counts))
f.set_size_inches((8, 4 * len(bin_counts)))

rebinned_dfs = []
for i in range(len(bin_counts)):
    rebinned_dfs.append(matplotlib_lift(bins_df, bin_counts[i], axarr[i]))
plt.tight_layout()
../../_images/examples_advanced_model_insights_Advanced_Model_Insights_18_0.png

Rebinned Data

You may want to interact with the raw re-binned data for use in third party tools, or for additional evaluation.

In [15]:
for rebinned in rebinned_dfs:
    print('Number of bins: {}'.format(len(rebinned.index)))
    print(rebinned)
Number of bins: 10
   bin  actual_mean  predicted_mean  bin_weight
0    1     0.005173        0.006018         160
1    2     0.006330        0.008653         160
2    3     0.011031        0.010219         160
3    4     0.010808        0.011721         160
4    5     0.013177        0.013245         160
5    6     0.014281        0.014824         160
6    7     0.019436        0.016481         160
7    8     0.021145        0.018407         160
8    9     0.020575        0.021133         160
9   10     0.026950        0.026278         160
Number of bins: 12
    bin  actual_mean  predicted_mean  bin_weight
0     1     0.005028        0.005692         134
1     2     0.006212        0.008198         133
2     3     0.008226        0.009638         133
3     4     0.011600        0.010858         133
4     5     0.011428        0.012048         134
5     6     0.013345        0.013410         133
6     7     0.013838        0.014727         133
7     8     0.018348        0.015950         134
8     9     0.018444        0.017625         133
9    10     0.023167        0.019384         133
10   11     0.022310        0.022115         133
11   12     0.026725        0.026719         134
Number of bins: 15
    bin  actual_mean  predicted_mean  bin_weight
0     1     0.004566        0.005363         107
1     2     0.005938        0.007619         107
2     3     0.006760        0.009039         106
3     4     0.011196        0.009940         107
4     5     0.010361        0.011019         106
5     6     0.011196        0.011951         107
6     7     0.011929        0.012943         107
7     8     0.016341        0.014100         106
8     9     0.012940        0.015061         107
9    10     0.018825        0.016115         107
10   11     0.018949        0.017540         106
11   12     0.023085        0.018679         107
12   13     0.018828        0.020709         106
13   14     0.026213        0.022978         107
14   15     0.026200        0.027401         107
Number of bins: 20
    bin  actual_mean  predicted_mean  bin_weight
0     1     0.003793        0.004988          80
1     2     0.006553        0.007048          80
2     3     0.005627        0.008176          80
3     4     0.007033        0.009131          80
4     5     0.009420        0.009844          80
5     6     0.012642        0.010593          80
6     7     0.010345        0.011322          80
7     8     0.011271        0.012120          80
8     9     0.013177        0.012850          80
9    10     0.013177        0.013641          80
10   11     0.015420        0.014415          80
11   12     0.013141        0.015233          80
12   13     0.019160        0.016009          80
13   14     0.019712        0.016954          80
14   15     0.016969        0.017891          80
15   16     0.025321        0.018923          80
16   17     0.020139        0.020248          80
17   18     0.021011        0.022019          80
18   19     0.026264        0.024319          80
19   20     0.027635        0.028236          80
Number of bins: 30
    bin  actual_mean  predicted_mean  bin_weight
0     1     0.002058        0.004440          54
1     2     0.007123        0.006304          53
2     3     0.006397        0.007339          53
3     4     0.005487        0.007895          54
4     5     0.007096        0.008754          53
5     6     0.006424        0.009325          53
6     7     0.007545        0.009598          54
7     8     0.014917        0.010290          53
8     9     0.010697        0.010780          53
9    10     0.010025        0.011257          53
10   11     0.009602        0.011583          54
11   12     0.012821        0.012326          53
12   13     0.014998        0.012809          53
13   14     0.008916        0.013075          54
14   15     0.015696        0.013856          53
15   16     0.016987        0.014345          53
16   17     0.010288        0.014668          54
17   18     0.015643        0.015462          53
18   19     0.017040        0.015956          53
19   20     0.020576        0.016270          54
20   21     0.020669        0.017222          53
21   22     0.017228        0.017858          53
22   23     0.021448        0.018475          53
23   24     0.024691        0.018880          54
24   25     0.019916        0.020080          53
25   26     0.017739        0.021339          53
26   27     0.024005        0.021966          54
27   28     0.028463        0.024009          53
28   29     0.024243        0.026005          53
29   30     0.028121        0.028772          54
Number of bins: 60
    bin  actual_mean  predicted_mean  bin_weight
0     1     0.001372        0.003757          27
1     2     0.002743        0.005124          27
2     3     0.007396        0.006124          26
3     4     0.006859        0.006477          27
4     5     0.006859        0.006995          27
5     6     0.005917        0.007696          26
6     7     0.006859        0.007732          27
7     8     0.004115        0.008057          27
8     9     0.005917        0.008759          26
9    10     0.008230        0.008748          27
10   11     0.005487        0.009041          27
11   12     0.007396        0.009621          26
12   13     0.005487        0.009473          27
13   14     0.009602        0.009723          27
14   15     0.013314        0.010357          26
15   16     0.016461        0.010225          27
16   17     0.009602        0.010477          27
17   18     0.011834        0.011095          26
18   19     0.006859        0.010917          27
19   20     0.013314        0.011610          26
20   21     0.010974        0.011450          27
21   22     0.008230        0.011716          27
22   23     0.013314        0.012471          26
23   24     0.012346        0.012187          27
24   25     0.012346        0.012433          27
25   26     0.017751        0.013199          26
26   27     0.009602        0.012931          27
27   28     0.008230        0.013219          27
28   29     0.017751        0.013977          26
29   30     0.013717        0.013738          27
30   31     0.021948        0.013934          27
31   32     0.011834        0.014773          26
32   33     0.012346        0.014552          27
33   34     0.008230        0.014783          27
34   35     0.014793        0.015630          26
35   36     0.016461        0.015299          27
36   37     0.019204        0.015522          27
37   38     0.014793        0.016407          26
38   39     0.023320        0.016112          27
39   40     0.017833        0.016427          27
40   41     0.022189        0.017373          26
41   42     0.019204        0.017076          27
42   43     0.025148        0.018082          26
43   44     0.009602        0.017641          27
44   45     0.016461        0.017958          27
45   46     0.026627        0.019012          26
46   47     0.024691        0.018699          27
47   48     0.024691        0.019061          27
48   49     0.019231        0.020192          26
49   50     0.020576        0.019971          27
50   51     0.020576        0.020577          27
51   52     0.014793        0.022130          26
52   53     0.026063        0.021734          27
53   54     0.021948        0.022198          27
54   55     0.028107        0.024006          26
55   56     0.028807        0.024012          27
56   57     0.021948        0.024929          27
57   58     0.026627        0.027123          26
58   59     0.026063        0.027594          27
59   60     0.030178        0.029950          27

ROC curve

The receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.

To retrieve ROC curve information use the Model.get_roc_curve method.

In [16]:
roc = model.get_roc_curve('validation')
roc
Out[16]:
RocCurve(validation)
In [17]:
df = pd.DataFrame(roc.roc_points)
df.head()
Out[17]:
accuracy f1_score false_negative_score false_positive_rate false_positive_score matthews_correlation_coefficient negative_predictive_value positive_predictive_value threshold true_negative_rate true_negative_score true_positive_rate true_positive_score
0 0.603125 0.000000 635 0.000000 0 0.000000 0.603125 0.000000 1.000000 1.000000 965 0.000000 0
1 0.604375 0.006279 633 0.000000 0 0.043612 0.603880 1.000000 0.879926 1.000000 965 0.003150 2
2 0.608125 0.024883 627 0.000000 0 0.087388 0.606156 1.000000 0.852429 1.000000 965 0.012598 8
3 0.610000 0.037037 623 0.001036 1 0.097343 0.607435 0.923077 0.799478 0.998964 964 0.018898 12
4 0.610625 0.043011 621 0.002073 2 0.098219 0.607955 0.875000 0.782085 0.997927 963 0.022047 14

Threshold operations

You can get the recommended threshold value with maximal F1 score using RocCurve.get_best_f1_threshold method. That is the same threshold that is preselected in DataRobot when you open “ROC curve” tab.

In [18]:
threshold = roc.get_best_f1_threshold()
threshold
Out[18]:
0.3187469581268751

To estimate metrics for different threshold values just pass it to the RocCurve.estimate_threshold method. This will produce the same results as updating threshold on the DataRobot “ROC curve” tab.

In [19]:
metrics = roc.estimate_threshold(threshold)
metrics
Out[19]:
{'accuracy': 0.608125,
 'f1_score': 0.6206896551724138,
 'false_negative_score': 122,
 'false_positive_rate': 0.5233160621761658,
 'false_positive_score': 505,
 'matthews_correlation_coefficient': 0.28939156398547705,
 'negative_predictive_value': 0.7903780068728522,
 'positive_predictive_value': 0.5039292730844793,
 'threshold': 0.3187469581268751,
 'true_negative_rate': 0.47668393782383417,
 'true_negative_score': 460,
 'true_positive_rate': 0.8078740157480315,
 'true_positive_score': 513}

Confusion matrix

Using a few keys from the retrieved metrics we now can build a confusion matrix for the selected threshold.

In [20]:
roc_df = pd.DataFrame({
    'Predicted Negative': [metrics['true_negative_score'],
                           metrics['false_negative_score'],
                           metrics['true_negative_score'] + metrics[
                               'false_negative_score']],
    'Predicted Positive': [metrics['false_positive_score'],
                           metrics['true_positive_score'],
                           metrics['true_positive_score'] + metrics[
                               'false_positive_score']],
    'Total': [len(roc.negative_class_predictions),
              len(roc.positive_class_predictions),
              len(roc.negative_class_predictions) + len(
                  roc.positive_class_predictions)]})
roc_df.index = pd.MultiIndex.from_tuples([
    ('Actual', '-'), ('Actual', '+'), ('Total', '')])
roc_df.columns = pd.MultiIndex.from_tuples([
    ('Predicted', '-'), ('Predicted', '+'), ('Total', '')])
roc_df.style.set_properties(**{'text-align': 'right'})
roc_df
Out[20]:
Predicted Total
- +
Actual - 460 505 962
+ 122 513 638
Total 582 1018 1600

ROC curve plot

In [21]:
dr_roc_green = '#03c75f'
white = '#ffffff'
dr_purple = '#65147D'
dr_dense_green = '#018f4f'

fig = plt.figure(figsize=(8, 8))
axes = fig.add_subplot(1, 1, 1, facecolor=dr_dark_blue)

plt.scatter(df.false_positive_rate, df.true_positive_rate, color=dr_roc_green)
plt.plot(df.false_positive_rate, df.true_positive_rate, color=dr_roc_green)
plt.plot([0, 1], [0, 1], color=white, alpha=0.25)
plt.title('ROC curve')
plt.xlabel('False Positive Rate (Fallout)')
plt.xlim([0, 1])
plt.ylabel('True Positive Rate (Sensitivity)')
plt.ylim([0, 1])
Out[21]:
(0, 1)
../../_images/examples_advanced_model_insights_Advanced_Model_Insights_31_1.png

Prediction distribution plot

There are a few different methods for visualizing it, which one to use depends on what packages you have installed. Below you will find 3 different examples.

Using seaborn

In [22]:
import seaborn as sns
sns.set_style("whitegrid", {'axes.grid': False})

fig = plt.figure(figsize=(8, 8))
axes = fig.add_subplot(1, 1, 1, facecolor=dr_dark_blue)

shared_params = {'shade': True, 'clip': (0, 1), 'bw': 0.2}
sns.kdeplot(np.array(roc.negative_class_predictions),
            color=dr_purple, **shared_params)
sns.kdeplot(np.array(roc.positive_class_predictions),
            color=dr_dense_green, **shared_params)

plt.title('Prediction Distribution')
plt.xlabel('Probability of Event')
plt.xlim([0, 1])
plt.ylabel('Probability Density')
Out[22]:
<matplotlib.text.Text at 0x10f44b0d0>
../../_images/examples_advanced_model_insights_Advanced_Model_Insights_33_1.png

Using SciPy

In [23]:
from scipy.stats import gaussian_kde

fig = plt.figure(figsize=(8, 8))
axes = fig.add_subplot(1, 1, 1, facecolor=dr_dark_blue)
xs = np.linspace(0, 1, 100)

density_neg = gaussian_kde(roc.negative_class_predictions, bw_method=0.2)
plt.plot(xs, density_neg(xs), color=dr_purple)
plt.fill_between(xs, 0, density_neg(xs), color=dr_purple, alpha=0.3)

density_pos = gaussian_kde(roc.positive_class_predictions, bw_method=0.2)
plt.plot(xs, density_pos(xs), color=dr_dense_green)
plt.fill_between(xs, 0, density_pos(xs), color=dr_dense_green, alpha=0.3)

plt.title('Prediction Distribution')
plt.xlabel('Probability of Event')
plt.xlim([0, 1])
plt.ylabel('Probability Density')
Out[23]:
<matplotlib.text.Text at 0x10f50fad0>
../../_images/examples_advanced_model_insights_Advanced_Model_Insights_35_1.png

Using scikit-learn

This way will be most consistent with how we display this plot in DataRobot, because scikit-learn supports additional kernel options, and we can configure the same kernel as we use in web application (epanichkov kernel with size 0.05).

Other examples above use a gaussian kernel, so they may slightly differ from the plot in the DataRobot interface.

In [24]:
from sklearn.neighbors import KernelDensity

fig = plt.figure(figsize=(8, 8))
axes = fig.add_subplot(1, 1, 1, facecolor=dr_dark_blue)
xs = np.linspace(0, 1, 100)

X_neg = np.asarray(roc.negative_class_predictions)[:, np.newaxis]
density_neg = KernelDensity(bandwidth=0.05, kernel='epanechnikov').fit(X_neg)
plt.plot(xs, np.exp(density_neg.score_samples(xs[:, np.newaxis])),
         color=dr_purple)
plt.fill_between(xs, 0, np.exp(density_neg.score_samples(xs[:, np.newaxis])),
                 color=dr_purple, alpha=0.3)

X_pos = np.asarray(roc.positive_class_predictions)[:, np.newaxis]
density_pos = KernelDensity(bandwidth=0.05, kernel='epanechnikov').fit(X_pos)
plt.plot(xs, np.exp(density_pos.score_samples(xs[:, np.newaxis])),
         color=dr_dense_green)
plt.fill_between(xs, 0, np.exp(density_pos.score_samples(xs[:, np.newaxis])),
                 color=dr_dense_green, alpha=0.3)

plt.title('Prediction Distribution')
plt.xlabel('Probability of Event')
plt.xlim([0, 1])
plt.ylabel('Probability Density')
Out[24]:
<matplotlib.text.Text at 0x10fb9f490>
../../_images/examples_advanced_model_insights_Advanced_Model_Insights_37_1.png

Word Cloud

Word cloud is a type of insight available for some text-processing models for datasets containing text columns. You can get information about how the appearance of each ngram (word or sequence of words) in the text field affects the predicted target value.

This example will show you how to obtain word cloud data and visualize it in similar to DataRobot web application way.

The visualization example here uses colour and wordcloud packages, so if you don’t have them, you will need to install them.

First, we will create a color palette similar to what we use in DataRobot.

In [25]:
from colour import Color
import wordcloud
In [26]:
colors = [Color('#2458EB')]
colors.extend(list(Color('#2458EB').range_to(Color('#31E7FE'), 81))[1:])
colors.extend(list(Color('#31E7FE').range_to(Color('#8da0a2'), 21))[1:])
colors.extend(list(Color('#a18f8c').range_to(Color('#ffad9e'), 21))[1:])
colors.extend(list(Color('#ffad9e').range_to(Color('#d80909'), 81))[1:])
webcolors = [c.get_web() for c in colors]

Variable webcolors now contains 201 ([-1, 1] interval with step 0.01) colors that will be used in the word cloud. Let’s look at our palette.

In [27]:
from matplotlib.colors import LinearSegmentedColormap
dr_cmap = LinearSegmentedColormap.from_list('DataRobot',
                                            webcolors,
                                            N=len(colors))
x = np.arange(-1, 1.01, 0.01)
y = np.arange(0, 40, 1)
X = np.meshgrid(x, y)[0]
plt.xticks([0, 20, 40, 60, 80, 100, 120, 140, 160, 180, 200],
           ['-1', '-0.8', '-0.6', '-0.4', '-0.2', '0',
            '0.2', '0.4', '0.6', '0.8', '1'])
plt.yticks([], [])
im = plt.imshow(X, interpolation='nearest', origin='lower', cmap=dr_cmap)
../../_images/examples_advanced_model_insights_Advanced_Model_Insights_42_0.png

Now we will pick some model that provides a word cloud in the DataRobot. Any “Auto-Tuned Word N-Gram Text Modeler” should work.

In [28]:
models = project.get_models()
In [29]:
model_with_word_cloud = None
for model in models:
    try:
        model.get_word_cloud()
        model_with_word_cloud = model
        break
    except ClientError as e:
        if 'No word cloud data found for model' in e:
            pass

model_with_word_cloud
Out[29]:
Model(u'Auto-Tuned Word N-Gram Text Modeler using token occurrences - diag_1_desc')
In [30]:
wc = model_with_word_cloud.get_word_cloud(exclude_stop_words=True)
In [31]:
def word_cloud_plot(wc, font_path=None):
    # Stopwords usually dominate any word cloud, so we will filter them out
    dict_freq = {wc_word['ngram']: wc_word['frequency']
                 for wc_word in wc.ngrams
                 if not wc_word['is_stopword']}
    dict_coef = {wc_word['ngram']: wc_word['coefficient']
                 for wc_word in wc.ngrams}

    def color_func(*args, **kwargs):
        word = args[0]
        palette_index = int(round(dict_coef[word] * 100)) + 100
        r, g, b = colors[palette_index].get_rgb()
        return 'rgb({:.0f}, {:.0f}, {:.0f})'.format(int(r * 255),
                                                    int(g * 255),
                                                    int(b * 255))

    wc_image = wordcloud.WordCloud(stopwords=set(),
                                   width=1024, height=1024,
                                   relative_scaling=0.5,
                                   prefer_horizontal=1,
                                   color_func=color_func,
                                   background_color=(0, 10, 29),
                                   font_path=font_path).fit_words(dict_freq)
    plt.imshow(wc_image, interpolation='bilinear')
    plt.axis('off')
In [32]:
word_cloud_plot(wc)
../../_images/examples_advanced_model_insights_Advanced_Model_Insights_48_0.png

You can use the word cloud to get information about most frequent and most important (highest absolute coefficient value) ngrams in your text.

In [33]:
wc.most_frequent(5)
Out[33]:
[{'coefficient': 0.622977418480506,
  'count': 534,
  'frequency': 0.21876280213027446,
  'is_stopword': False,
  'ngram': u'failure'},
 {'coefficient': 0.5680375262833832,
  'count': 524,
  'frequency': 0.21466612044244163,
  'is_stopword': False,
  'ngram': u'atherosclerosis'},
 {'coefficient': 0.5163937133054939,
  'count': 520,
  'frequency': 0.21302744776730848,
  'is_stopword': False,
  'ngram': u'atherosclerosis of'},
 {'coefficient': 0.3793240551174481,
  'count': 505,
  'frequency': 0.2068824252355592,
  'is_stopword': False,
  'ngram': u'infarction'},
 {'coefficient': 0.46897343056956153,
  'count': 453,
  'frequency': 0.18557968045882836,
  'is_stopword': False,
  'ngram': u'heart'}]
In [34]:
wc.most_important(5)
Out[34]:
[{'coefficient': -0.8759179138969192,
  'count': 38,
  'frequency': 0.015567390413764851,
  'is_stopword': False,
  'ngram': u'obesity unspecified'},
 {'coefficient': -0.8655105382141891,
  'count': 38,
  'frequency': 0.015567390413764851,
  'is_stopword': False,
  'ngram': u'obesity'},
 {'coefficient': 0.8329465952065772,
  'count': 9,
  'frequency': 0.0036870135190495697,
  'is_stopword': False,
  'ngram': u'nephroptosis'},
 {'coefficient': -0.8198621557218905,
  'count': 45,
  'frequency': 0.01843506759524785,
  'is_stopword': False,
  'ngram': u'of kidney'},
 {'coefficient': 0.7444542252245915,
  'count': 452,
  'frequency': 0.18517001229004507,
  'is_stopword': False,
  'ngram': u'heart failure'}]

Non-ASCII Texts

Word cloud has full Unicode support but if you want to visualize it using the recipe from this notebook - you should use the font_path parameter that leads to font supporting symbols used in your text. For example for Japanese text in the model below you should use one of the CJK fonts.

In [35]:
jp_project = dr.Project.create('jp_10k.csv', project_name='Japanese 10K')

print('Project ID: {}'.format(project.id))
Project ID: 598dec4bc8089177139da4ad
In [36]:
jp_project.set_target('readmitted_再入院', mode=AUTOPILOT_MODE.QUICK)
jp_project.wait_for_autopilot()
In progress: 10, queued: 3 (waited: 0s)
In progress: 10, queued: 3 (waited: 0s)
In progress: 10, queued: 3 (waited: 1s)
In progress: 10, queued: 3 (waited: 2s)
In progress: 10, queued: 3 (waited: 3s)
In progress: 10, queued: 3 (waited: 5s)
In progress: 10, queued: 3 (waited: 8s)
In progress: 10, queued: 1 (waited: 15s)
In progress: 6, queued: 0 (waited: 28s)
In progress: 1, queued: 0 (waited: 49s)
In progress: 0, queued: 0 (waited: 69s)
In progress: 8, queued: 0 (waited: 90s)
In progress: 5, queued: 0 (waited: 110s)
In progress: 1, queued: 0 (waited: 130s)
In progress: 0, queued: 14 (waited: 151s)
In progress: 10, queued: 6 (waited: 171s)
In progress: 10, queued: 2 (waited: 191s)
In progress: 8, queued: 0 (waited: 212s)
In progress: 2, queued: 0 (waited: 232s)
In progress: 2, queued: 0 (waited: 253s)
In progress: 1, queued: 0 (waited: 273s)
In progress: 1, queued: 0 (waited: 293s)
In progress: 0, queued: 0 (waited: 314s)
In [37]:
jp_models = jp_project.get_models()
jp_model_with_word_cloud = None

for model in jp_models:
    try:
        model.get_word_cloud()
        jp_model_with_word_cloud = model
        break
    except ClientError as e:
        if 'No word cloud data found for model' in e:
            pass

jp_model_with_word_cloud
Out[37]:
Model(u'Auto-Tuned Word N-Gram Text Modeler using token occurrences and tfidf - diag_1_desc_\u8a3a\u65ad1\u8aac\u660e')
In [38]:
jp_wc = jp_model_with_word_cloud.get_word_cloud(exclude_stop_words=True)
In [39]:
word_cloud_plot(jp_wc, font_path='CJK.ttf')
../../_images/examples_advanced_model_insights_Advanced_Model_Insights_57_0.png