Helping detect a rare disease

Below I construct a decision-tree model and a random-forest model to help detect a condition with an incidence of less than 0.173% among patients (173 in 100,000). The models are built using the anonymized information of more than 280,000 patients, consisting of 30 anonymized variables per patient. Each patient is classified as either having the condition 'Class'=1, or not having it 'Class'=0.

Author: Leonardo Espin

Date: August 8, 2019

  • First I will load all the libraries I will use for the exploratory data analysis
In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

pd.set_option('display.max_columns', 35)
#the lines below get rid of the plotting library warnings
import warnings
warnings.filterwarnings("ignore")
  • Next I will read the data into a dataframe, print the size of the dataset, the first few lines of data and the types of variables
In [2]:
raw=pd.read_csv('data_challenge_dataset.csv')
print(raw.shape)
print(raw.dtypes)
raw.head()
(284807, 30)
V1       float64
V2       float64
V3       float64
V4       float64
V5       float64
V6       float64
V7       float64
V8       float64
V9       float64
V10      float64
V11      float64
V12      float64
V13      float64
V14      float64
V15      float64
V16      float64
V17      float64
V18      float64
V19      float64
V20      float64
V21      float64
V22      float64
V23      float64
V24      float64
V25      float64
V26      float64
V27      float64
V28      float64
V29      float64
Class      int64
dtype: object
Out[2]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 Class
0 NaN -0.072781 2.536347 NaN -0.338321 0.462388 NaN 0.098698 0.363787 0.090794 -0.551600 -0.617801 -0.991390 -0.311169 1.468177 -0.470401 NaN 0.025791 0.403993 0.251412 -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 NaN -0.021053 149.62 0
1 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 NaN -0.166974 NaN 1.065235 NaN -0.143772 0.635558 0.463917 -0.114805 -0.183361 -0.145783 NaN -0.225775 -0.638672 NaN -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0
2 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 0.207643 0.624501 0.066084 0.717293 -0.165946 2.345865 -2.890083 1.109969 -0.121359 -2.261857 0.524980 0.247998 0.771679 0.909412 NaN -0.327642 -0.139097 -0.055353 -0.059752 378.66 0
3 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 -0.054952 -0.226487 0.178228 0.507757 -0.287924 -0.631418 -1.059647 -0.684093 1.965775 -1.232622 -0.208038 -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 123.50 0
4 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 NaN -0.270533 0.817739 0.753074 -0.822843 0.538196 1.345852 -1.119670 NaN -0.451449 NaN -0.038195 0.803487 0.408542 -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 69.99 0

Exploratory data analysis (EDA)

  • First I print summary statistics of the dataset
In [3]:
raw.describe()
Out[3]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 Class
count 256327.000000 256327.000000 256327.000000 256327.000000 256327.000000 256327.000000 256327.000000 256327.000000 256327.000000 256327.000000 256327.000000 256327.000000 256327.000000 256327.000000 256327.000000 256327.000000 256327.000000 256327.000000 256327.000000 256327.000000 256327.000000 256327.000000 256327.000000 256327.000000 256327.000000 256327.000000 256327.000000 256327.000000 256327.000000 284807.000000
mean -0.001855 0.001016 0.000476 0.000029 0.000192 0.000221 -0.000832 0.001599 -0.001070 -0.001081 0.000246 -0.000274 0.000800 0.000019 -0.000632 -0.000546 -0.000080 0.000083 -0.000345 -0.000258 0.000373 -0.000521 0.000435 0.000242 -0.000116 -0.000088 0.000086 0.000153 88.463396 0.001727
std 1.957926 1.651584 1.519391 1.417511 1.378669 1.332159 1.241105 1.187431 1.099997 1.085192 1.019877 1.000322 0.995882 0.955731 0.915411 0.876783 0.850534 0.838474 0.814060 0.772639 0.733762 0.725003 0.618081 0.605657 0.521329 0.482607 0.397504 0.330702 248.040629 0.041527
min -56.407510 -72.715728 -48.325589 -5.683171 -113.743307 -21.929312 -43.557242 -73.216718 -13.434066 -24.588262 -4.797473 -18.683715 -5.791881 -18.493773 -4.498945 -14.129855 -25.162799 -9.498746 -7.213527 -54.497720 -34.830382 -10.933144 -44.807735 -2.836627 -10.295397 -2.604551 -22.565679 -15.430084 0.000000 0.000000
25% -0.920793 -0.598073 -0.888388 -0.848423 -0.692812 -0.768247 -0.553502 -0.208516 -0.644112 -0.535903 -0.761755 -0.406160 -0.647872 -0.426090 -0.583406 -0.469167 -0.483949 -0.498586 -0.456986 -0.211756 -0.228512 -0.542946 -0.161762 -0.354556 -0.317167 -0.327224 -0.070878 -0.052945 5.640000 0.000000
50% 0.015389 0.065034 0.179871 -0.018836 -0.055993 -0.273935 0.039835 0.022608 -0.052607 -0.092313 -0.031499 0.139847 -0.013511 0.050509 0.047068 0.066032 -0.065844 -0.003976 0.003502 -0.062298 -0.029669 0.006540 -0.011011 0.041439 0.015954 -0.052294 0.001296 0.011181 22.000000 0.000000
75% 1.315404 0.804324 1.028192 0.742473 0.611817 0.399117 0.569714 0.327949 0.595965 0.453653 0.739153 0.618166 0.663910 0.492511 0.648903 0.523268 0.399840 0.500827 0.458520 0.132837 0.186117 0.527516 0.147746 0.439527 0.350374 0.240773 0.090973 0.078202 77.000000 0.000000
max 2.454930 22.057729 9.382558 16.875344 34.801666 73.301626 120.589494 20.007208 15.594995 23.745136 12.018913 7.848392 7.126883 10.526766 8.877742 17.315112 9.253526 5.041069 5.591971 39.420904 27.202839 8.361985 22.528412 4.584549 7.519589 3.517346 12.152401 33.847808 19656.530000 1.000000
  • The data seems to have been centered, but it has not been scaled to unit variance
  • Below I check for missing data, first along features, then across columns
In [4]:
print(raw.isna().sum())
print('Percentage of missing data in each column: {:.1f}%'.format(100*raw.V1.isna().sum()/len(raw)))
V1       28480
V2       28480
V3       28480
V4       28480
V5       28480
V6       28480
V7       28480
V8       28480
V9       28480
V10      28480
V11      28480
V12      28480
V13      28480
V14      28480
V15      28480
V16      28480
V17      28480
V18      28480
V19      28480
V20      28480
V21      28480
V22      28480
V23      28480
V24      28480
V25      28480
V26      28480
V27      28480
V28      28480
V29      28480
Class        0
dtype: int64
Percentage of missing data in each column: 10.0%

Therefore, all the columns have exactly the same amount of missing data. Then, if we look across columns:

In [5]:
naInRows=raw.isna().sum(axis=1)
print('Percentage of rows with missing data: {:.1f}%'.format(100*len(raw[raw.isna().sum(axis=1)>0])/len(raw)))
plt.figure(figsize=(16,4))
naInRows.plot()
plt.title('Missing data for each row', fontsize=14)
plt.xlabel('Row number', fontsize=14)
plt.ylabel('Number of missing', fontsize=14);
Percentage of rows with missing data: 95.3%
  • All the columns have the same number of missing values (10% of total)
  • Most rows have missing data, and the plot above shows that the missing data is more or less uniformly distributed along rows
  • Importantly, there are not missing values in the target column
  • I will use simple inputation before modeling the data, which I will explain in more detail in the modeling portion of the document
  • Below I inspect the distribution of the target variable:
In [6]:
raw.Class.value_counts()
Out[6]:
0    284315
1       492
Name: Class, dtype: int64
  • So, this is a heavily imbalanced dataset
  • Now, I will inspect the distribution of individual variables, but before doing that I will partition the dataset into train/test portions and will keep the test portion untouched until I use it for testing the model I will create.
  • Notice that I use stratified partitioning, in order to maintain the proportions of the classification variable in the test and train portions of the dataset
In [7]:
#Here I'm creating a list of features, and I separate the target variable
features=list(raw.columns)
target=features.pop()
In [8]:
from sklearn.model_selection import train_test_split
seed=491
train,test = train_test_split(raw,test_size=0.3,random_state=seed,stratify=raw[target])
In [9]:
#checking that the proportions have been maintained
print(100*train[target].value_counts()/len(train))
print(100*test[target].value_counts()/len(test))
0    99.827451
1     0.172549
Name: Class, dtype: float64
0    99.826785
1     0.173215
Name: Class, dtype: float64
  • Below I plot the histograms of all the individual features to have a sense of the distribution of each variable
  • Note that from this point on, I will only use the train portion of the dataset for analysis and modeling. The test portion will only be used for testing a model
In [10]:
plt.figure(figsize=(16,25))
for i in range(len(features)):
    plt.subplot(8, 4, i+1)
    ax=sns.distplot(train[features[i]].dropna(),kde=False,rug=False)#NaN causes problems
    ax.set_yscale('log')
  • All the variables have outliers (a few single values in the right or left tail of the count plots). Unfortunately, since the features are anonymous, it is difficult to guess the meaning of outlier values for each particular variable
  • Below I plot the distribution of features, segregated by the value of the target variable:
In [11]:
def segregatedPlot(df,col,grouping='Class'):
    '''conveniently making segregated plots'''
    #plt.figure(figsize=(9,6))
    sns.distplot(df.groupby(grouping)[col].get_group(0).dropna(),kde=False,rug=False, label=grouping.capitalize()+'=0');
    ax=sns.distplot(df.groupby(grouping)[col].get_group(1).dropna(),kde=False,rug=False, label=grouping.capitalize()+'=1');
    ax.set_yscale('log')
    plt.legend();
In [12]:
plt.figure(figsize=(16,30))
for i in range(len(features)):
    plt.subplot(10, 3, i+1)
    segregatedPlot(train,features[i])
  • The plots above are very illuminating with respect to what variables have might have more predictive power. Some of the variables show different distributions depending on the value of the class that a given row belongs to. I can see six different patterns in the plots above:

    1. For some variables the target classification makes little difference in the distribution of the variable
    2. The variable distribution shows a lower peak count for class 1 (V3)
    3. The variable distribution shows a larger peak count for class 1 (V4)
    4. Lower peak count and lower tail values (V9, V10, V12, V14, V16, V17)
    5. Larger peak count and larger tail values (V11)
    6. Lower tail values (V18)
  • Because of the apparent importance of tail values on the class distribution of the target variable, I am not inclined to use classification methods that are sensitive to feature scaling (like logistic regression). So I will first try using decision trees and random forest, and see how well they perform on the the test data.

  • Before start constructing a model, I inspect variable correlations below:
In [13]:
corr = train[features].corr()
plt.figure(figsize=(10,10))
#diverging colormap
cmap = sns.diverging_palette(220,10,as_cmap=True)

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
#triu_indices_from() returns the indices for the upper-triangle of arr
mask[np.triu_indices_from(mask)] = True 

sns.heatmap(corr, mask=mask, cmap=cmap, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5});
  • There is little correlation between the features, with the exception of V29 (which from the plots above, seems to have very little predictive power). This must be a consequence of the variables being centered, as pointed out earlier in the document
  • Note that V29 is the only variable that is in a scale that is significantly different than the other variables (only positive values, on the order of a thousand, it could be length of stay at a hospital, for example).

Modeling the data

Given the seemingly uniform distribution of the missing data, and the variable distribution plots shown above, I use simple imputation to fill the missing data

Baseline decision-tree model

In [14]:
from sklearn import tree
bench = tree.DecisionTreeClassifier(random_state=seed)
#simple imputation:
train_f=train.fillna(train.mean())
bench.fit(train_f[features],train_f[target])
Out[14]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=491,
            splitter='best')
In [15]:
print('Maximum depth = {}'.format(bench.tree_.max_depth))
print('Default score = {:.3f}'.format(bench.score(train_f[features],train_f[target])))
print('Default feature importances (sorted by importance):')
tups=[(features[_],bench.feature_importances_[_]) for _ in range(len(features))]
tups.sort(key=lambda x:x[1], reverse=True)
for tup in tups:
    print('{}, importance = {:.4f}'.format(tup[0],tup[1]))
Maximum depth = 30
Default score = 1.000
Default feature importances (sorted by importance):
V17, importance = 0.4973
V14, importance = 0.1187
V10, importance = 0.0461
V27, importance = 0.0322
V9, importance = 0.0282
V26, importance = 0.0229
V1, importance = 0.0218
V16, importance = 0.0181
V28, importance = 0.0180
V25, importance = 0.0180
V12, importance = 0.0165
V20, importance = 0.0156
V4, importance = 0.0152
V6, importance = 0.0148
V21, importance = 0.0147
V29, importance = 0.0130
V19, importance = 0.0117
V15, importance = 0.0115
V13, importance = 0.0097
V11, importance = 0.0097
V7, importance = 0.0088
V23, importance = 0.0082
V8, importance = 0.0061
V24, importance = 0.0057
V22, importance = 0.0056
V18, importance = 0.0047
V5, importance = 0.0044
V3, importance = 0.0029
V2, importance = 0.0000
  • Unsurprisingly the benchmark decision tree overfits the data. An interesting observation is that the top three features for the tree are in my list of variables which I expected to have more predictive power.
  • Below I use cross validation with stratified partitioning of the train dataset to find an optimal depth for a tree
In [16]:
from sklearn.model_selection import StratifiedKFold, GridSearchCV
benchCV = StratifiedKFold(n_splits=6,random_state=seed)
param_test = {
    'max_depth':[2,5,20,30]}#[_ for _ in range(2,30)] produces 5
dptSearch = GridSearchCV(
    estimator=tree.DecisionTreeClassifier(random_state=seed),                     
    param_grid = param_test, scoring='roc_auc',n_jobs=4,
    iid=False, cv=benchCV,return_train_score=True)

dptSearch.fit(train_f[features],train_f[target])
dptSearch.best_params_, dptSearch.best_score_
Out[16]:
({'max_depth': 5}, 0.8805669521429355)
  • The best cross-validation AUC score obtained with this tree is 88%, which is good. Below I check the ROC-AUC score on the test set, after fitting a decision tree with a maximum depth of 5 nodes:
In [17]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
bench = tree.DecisionTreeClassifier(random_state=seed,max_depth=5)
bench.fit(train_f[features],train_f[target])
#simple imputation of test dataset:
test_f=test.fillna(test.mean())
print('Decision-tree ROC-AUC score = {:.3f}'.format(roc_auc_score(test_f[target],bench.predict_proba(test_f[features])[:,1])))
print('Feature importances (sorted by importance):')
tups=[(features[_],bench.feature_importances_[_]) for _ in range(len(features))]
tups.sort(key=lambda x:x[1], reverse=True)
for tup in tups:
    print('{}, importance = {:.4f}'.format(tup[0],tup[1]))
Decision-tree ROC-AUC score = 0.841
Feature importances (sorted by importance):
V17, importance = 0.6438
V14, importance = 0.1367
V10, importance = 0.0514
V27, importance = 0.0390
V9, importance = 0.0326
V26, importance = 0.0237
V16, importance = 0.0212
V20, importance = 0.0130
V1, importance = 0.0123
V7, importance = 0.0110
V11, importance = 0.0065
V12, importance = 0.0052
V22, importance = 0.0036
V2, importance = 0.0000
V3, importance = 0.0000
V4, importance = 0.0000
V5, importance = 0.0000
V6, importance = 0.0000
V8, importance = 0.0000
V13, importance = 0.0000
V15, importance = 0.0000
V18, importance = 0.0000
V19, importance = 0.0000
V21, importance = 0.0000
V23, importance = 0.0000
V24, importance = 0.0000
V25, importance = 0.0000
V28, importance = 0.0000
V29, importance = 0.0000
  • How it should be expected, the ROC-AUC score on the test data is lower than on cross-validation, but not by much (84% vs. 88%)
  • Note that the feature importances agree pretty well, for the most part, with the variables I highlighted in the EDA portion of this analysis
  • Below I calculate the confusion matrix for the predictions of the decision-tree model
In [18]:
#the confusion matrix for the prediction is below
print('\tTn\tFp\n\tFn\tTp')
confusion_matrix(test_f[target],bench.predict(test_f[features]))
	Tn	Fp
	Fn	Tp
Out[18]:
array([[85275,    20],
       [   40,   108]])
  • The confusion matrix above shows reasonable results as well (sensitivity: 73%, specificity: 99.9%)

Constructing a random-forest model

  • Firs I use SelectFromModel, below, to select features for the model based on importance weights. The threshold for selection calculated automatically by SelectFromModel is 0.034483, and the features selected automatically are V10, V11, V12, V14, V16 and V17.
  • Note that I listed all these variables as likely to have more predictive power in the EDA portion of the analysis.
  • Since there are a few variables that I listed that were not selected automatically (e.g. V3, V4, V9 and V18), I performed a brief sensitivity analysis and slightly reduced the automatic threshold to see if some of these variables would be picked up:
In [19]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel #for selecting features for the model
sel = SelectFromModel(RandomForestClassifier(n_estimators = 100,random_state=seed,n_jobs=4),
                     threshold=0.03)
sel.fit(train_f[features], train_f[target])
Out[19]:
SelectFromModel(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=4,
            oob_score=False, random_state=491, verbose=0, warm_start=False),
        norm_order=1, prefit=False, threshold=0.03)
In [20]:
selected=[features[i] for i in range(len(features)) if sel.get_support()[i]]
print('The selected variables for my random forest model are: ')
for _ in selected:
    print(_)
The selected variables for my random forest model are: 
V9
V10
V11
V12
V14
V16
V17
V18
  • As noted above, I listed all these variables as likely to have more predictive power in the EDA portion of the analysis
  • Below I use cross validation to find the optimal number of trees for the random forest model.
In [21]:
param_test2 = {
    'n_estimators':[2,50,76,100]}#[_ for _ in range(2,100)] selects 76
forestSearch = GridSearchCV(
    estimator=RandomForestClassifier(random_state=seed),                     
    param_grid = param_test2, scoring='roc_auc',n_jobs=4,
    iid=False, cv=benchCV,return_train_score=True)

forestSearch.fit(train_f[selected], train_f[target])
forestSearch.best_params_, forestSearch.best_score_
Out[21]:
({'n_estimators': 76}, 0.9287616474384496)
  • The best cross-validation AUC score obtained with this random-forest model is 92.9%.
  • Below I calculate the confusion matrix for the predictions of the random-forest model, on the test set:
In [22]:
#confusion matrix for the best forest model
print('\tTn\tFp\n\tFn\tTp')
confusion_matrix(test_f[target],forestSearch.predict(test_f[selected]))
	Tn	Fp
	Fn	Tp
Out[22]:
array([[85282,    13],
       [   30,   118]])
  • The random-forest model sensitivity and specificity are 79.7% and 99.9%, respectively

Comparing the models by constructing ROC curves

In [23]:
from sklearn.metrics import auc #will use below, when plotting the curves
from roc_curve import get_roc_curve #I put this code in a separate file for clarity
#making an array of thresholds
thresholds=np.linspace(0,1,15)
  • Below I construct and plot an ROC curve for my decision-tree model
In [24]:
plt.figure(figsize=(10,7))
tprate=[]
fprate=[]
aucs=[]
counter=0
for tr_fold, test_fold in benchCV.split(train_f[selected],train_f[target]):
    bench.fit(train_f[selected].iloc[tr_fold],train_f[target].iloc[tr_fold])
    probs=bench.predict_proba(train_f[selected].iloc[test_fold])[:,1]#prob of class 1
    tpr,fpr=get_roc_curve(train_f[target].iloc[test_fold].values,probs,thresholds)
    tprate.append(tpr)
    fprate.append(fpr)
    auc_fold=auc(fpr, tpr)
    aucs.append(auc_fold)
    plt.plot(fpr,tpr,lw=1,alpha=0.3,
             label='Fold {}, AUC={:.3f}'.format(counter,auc_fold))
    #plt.plot(fpr[index],tpr[index],'r+');
    counter+=1

#chance curve
plt.plot([0, 1], [0, 1],linestyle='--',lw=1, alpha=.8,
         color='r',label='Chance')

mean_tpr = np.mean(tprate, axis=0)
mean_fpr = np.mean(fprate, axis=0)
mean_auc = np.mean(aucs)
std_auc = np.std(aucs)
plt.plot(mean_fpr, mean_tpr,lw=2, color='b',
         label='Mean ROC (AUC={:.3f} $\pm$ {:.3f})'.format(mean_auc, std_auc))

#the cross shows the value corresponding to an standard threshold of 0.5
index=np.where(thresholds ==0.5)[0][0]#assuming that 0.5 is present in thresholds
plt.plot(mean_fpr[index],mean_tpr[np.where(thresholds ==0.5)[0][0]],'r+')

plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('ROC decision-tree model', fontsize=18)
plt.legend(loc="lower right", prop={'size': 12});
  • The red cross in this plot shows the true and false positive rates for the standard classification threshold of 0.5. Similar crosses are also shown in the plots below
In [25]:
forest = RandomForestClassifier(random_state=seed,n_estimators=76)
In [26]:
plt.figure(figsize=(10,7))
tpr_forest=[]
fpr_forest=[]
aucs_forest=[]
counter=0
for tr_fold, test_fold in benchCV.split(train_f[selected],train_f[target]):
    forest.fit(train_f[selected].iloc[tr_fold],train_f[target].iloc[tr_fold])
    probs=forest.predict_proba(train_f[selected].iloc[test_fold])[:,1]#prob of class 1
    tpr,fpr=get_roc_curve(train_f[target].iloc[test_fold].values,probs,thresholds)
    tpr_forest.append(tpr)
    fpr_forest.append(fpr)
    auc_fold=auc(fpr, tpr)
    aucs_forest.append(auc_fold)
    plt.plot(fpr,tpr,lw=1,alpha=0.3,
             label='Fold {}, AUC={:.3f}'.format(counter,auc_fold))
    #plt.plot(fpr[index],tpr[index],'r+');
    counter+=1

#chance curve
plt.plot([0, 1], [0, 1],linestyle='--',lw=1, alpha=.8,
         color='r',label='Chance')

mean_tpr_f = np.mean(tpr_forest, axis=0)
mean_fpr_f = np.mean(fpr_forest, axis=0)
mean_auc_f = np.mean(aucs_forest)
std_auc_f = np.std(aucs_forest)
plt.plot(mean_fpr_f, mean_tpr_f,lw=2, color='b',
         label='Mean ROC (AUC={:.3f} $\pm$ {:.3f})'.format(mean_auc_f, std_auc_f))

#the cross shows the value corresponding to an standard threshold of 0.5
index=np.where(thresholds ==0.5)[0][0]#assuming that 0.5 is present in thresholds
plt.plot(mean_fpr_f[index],mean_tpr_f[index],'r+')

plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('ROC random-forest model', fontsize=18)
plt.legend(loc="lower right", prop={'size': 12});
  • Below I compare the mean ROC curves for both models:
In [27]:
#comparing the two models
plt.figure(figsize=(10,7))
#decision tree ROC
plt.plot(mean_fpr, mean_tpr,lw=2,
         label='Dtree Mean ROC (AUC={:.2f} $\pm$ {:.2f})'.format(mean_auc, std_auc))

plt.plot(mean_fpr_f, mean_tpr_f,lw=2,
         label='Rforest Mean ROC (AUC={:.2f} $\pm$ {:.2f})'.format(mean_auc_f, std_auc_f))
#chance curve
plt.plot([0, 1], [0, 1],linestyle='--',lw=1, alpha=.8,
         color='r',label='Chance')

plt.plot(mean_fpr[index],mean_tpr[index],'r+')
plt.plot(mean_fpr_f[index],mean_tpr_f[index],'g+')

plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('ROC for the two models', fontsize=18)
plt.legend(loc="lower right", prop={'size': 12});
  • Since the crosses show the standard classification threshold of 0.5, it is clear that we could reduce this threshold in order to improve the true positive rate, without significantly reducing the false positive rate.
  • To investigate this in more detail, I show a table with the mean true and false positive rates for both the decision tree and random forest models, together with the corresponding thresholds below:
In [28]:
rates=pd.concat([pd.Series(thresholds,name='Thresholds'),
             pd.Series(mean_tpr,name='tree_tpr'),
             pd.Series(mean_fpr,name='tree_fpr'),
             pd.Series(mean_tpr_f,name='forest_tpr'),
             pd.Series(mean_fpr_f,name='forest_fpr')
              ],axis=1)
rates
Out[28]:
Thresholds tree_tpr tree_fpr forest_tpr forest_fpr
0 0.000000 1.000000 1.000000 1.000000 1.000000
1 0.071429 0.746824 0.000367 0.831317 0.000839
2 0.142857 0.743900 0.000276 0.813874 0.000432
3 0.214286 0.740976 0.000266 0.808076 0.000301
4 0.285714 0.712039 0.000246 0.796431 0.000251
5 0.357143 0.697570 0.000231 0.784735 0.000196
6 0.428571 0.697570 0.000231 0.776013 0.000171
7 0.500000 0.691722 0.000201 0.764317 0.000136
8 0.571429 0.648165 0.000141 0.738153 0.000111
9 0.642857 0.645241 0.000121 0.700393 0.000085
10 0.714286 0.630672 0.000116 0.653962 0.000075
11 0.785714 0.618976 0.000111 0.604557 0.000070
12 0.857143 0.613128 0.000106 0.479835 0.000045
13 0.928571 0.560496 0.000070 0.291036 0.000025
14 1.000000 0.244051 0.000045 0.069873 0.000005
  • As can be seen in the table above, we can indeed reduce the classification threshold without affecting the false positive rate considerably. This would be equivalent to increasing the number of true positives or reducing the false negatives.
  • Since the number of actual positives is small, we would be increasing the number of false positives as well, which might not be desirable.

Conclusions

  1. The exploratory data analysis reveals several interesting trends that can be observed in the data. In particular, the distributions corresponding to each class of the target variable (0 or 1) are different for a subset of features (variables V3, V4, V9 to V12, V14, V16, V17 and V18).
  2. For a few varibles, outliers in the distribution seem to correlate with belonging to class 1
  3. Missing data seems to be distributed randomly and uniformly across the data, totaling around 10% for each column.
  4. Feature importances for my decision tree model, and feature selection with a random forest model correspond well with the variables highlighted in points 1 and 2
  5. The performance of a random forest model with 76 trees improves performance with respect to my decision-tree model, decreasing both the number of false positives and false negatives (sensitivity: 79.7%, specificity: 99.9%)
  6. As shown in the ROC-curve analysis section, the threshold for classification can be changed to optimize the number of false positives or false negatives. Selecting this threshold would probably require external input