Helping detect a rare disease

Below I construct a decision-tree model and a random-forest model to help detect a condition with an incidence of less than 0.173% among patients (173 in 100,000). The models are built using the anonymized information of more than 280,000 patients, consisting of 30 anonymized variables per patient. Each patient is classified as either having the condition 'Class'=1, or not having it 'Class'=0.

Author: Leonardo Espin

Date: August 8, 2019

First I will load all the libraries I will use for the exploratory data analysis

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

pd.set_option('display.max_columns', 35)
#the lines below get rid of the plotting library warnings
import warnings
warnings.filterwarnings("ignore")

Next I will read the data into a dataframe, print the size of the dataset, the first few lines of data and the types of variables

raw=pd.read_csv('data_challenge_dataset.csv')
print(raw.shape)
print(raw.dtypes)
raw.head()

(284807, 30)
V1       float64
V2       float64
V3       float64
V4       float64
V5       float64
V6       float64
V7       float64
V8       float64
V9       float64
V10      float64
V11      float64
V12      float64
V13      float64
V14      float64
V15      float64
V16      float64
V17      float64
V18      float64
V19      float64
V20      float64
V21      float64
V22      float64
V23      float64
V24      float64
V25      float64
V26      float64
V27      float64
V28      float64
V29      float64
Class      int64
dtype: object

Exploratory data analysis (EDA)¶

First I print summary statistics of the dataset

raw.describe()

The data seems to have been centered, but it has not been scaled to unit variance
Below I check for missing data, first along features, then across columns

print(raw.isna().sum())
print('Percentage of missing data in each column: {:.1f}%'.format(100*raw.V1.isna().sum()/len(raw)))

V1       28480
V2       28480
V3       28480
V4       28480
V5       28480
V6       28480
V7       28480
V8       28480
V9       28480
V10      28480
V11      28480
V12      28480
V13      28480
V14      28480
V15      28480
V16      28480
V17      28480
V18      28480
V19      28480
V20      28480
V21      28480
V22      28480
V23      28480
V24      28480
V25      28480
V26      28480
V27      28480
V28      28480
V29      28480
Class        0
dtype: int64
Percentage of missing data in each column: 10.0%

Therefore, all the columns have exactly the same amount of missing data. Then, if we look across columns:

naInRows=raw.isna().sum(axis=1)
print('Percentage of rows with missing data: {:.1f}%'.format(100*len(raw[raw.isna().sum(axis=1)>0])/len(raw)))
plt.figure(figsize=(16,4))
naInRows.plot()
plt.title('Missing data for each row', fontsize=14)
plt.xlabel('Row number', fontsize=14)
plt.ylabel('Number of missing', fontsize=14);

Percentage of rows with missing data: 95.3%

All the columns have the same number of missing values (10% of total)
Most rows have missing data, and the plot above shows that the missing data is more or less uniformly distributed along rows
Importantly, there are not missing values in the target column
I will use simple inputation before modeling the data, which I will explain in more detail in the modeling portion of the document
Below I inspect the distribution of the target variable:

raw.Class.value_counts()

0    284315
1       492
Name: Class, dtype: int64

So, this is a heavily imbalanced dataset
Now, I will inspect the distribution of individual variables, but before doing that I will partition the dataset into train/test portions and will keep the test portion untouched until I use it for testing the model I will create.
Notice that I use stratified partitioning, in order to maintain the proportions of the classification variable in the test and train portions of the dataset

#Here I'm creating a list of features, and I separate the target variable
features=list(raw.columns)
target=features.pop()

from sklearn.model_selection import train_test_split
seed=491
train,test = train_test_split(raw,test_size=0.3,random_state=seed,stratify=raw[target])

#checking that the proportions have been maintained
print(100*train[target].value_counts()/len(train))
print(100*test[target].value_counts()/len(test))

0    99.827451
1     0.172549
Name: Class, dtype: float64
0    99.826785
1     0.173215
Name: Class, dtype: float64

Below I plot the histograms of all the individual features to have a sense of the distribution of each variable
Note that from this point on, I will only use the train portion of the dataset for analysis and modeling. The test portion will only be used for testing a model

plt.figure(figsize=(16,25))
for i in range(len(features)):
    plt.subplot(8, 4, i+1)
    ax=sns.distplot(train[features[i]].dropna(),kde=False,rug=False)#NaN causes problems
    ax.set_yscale('log')

All the variables have outliers (a few single values in the right or left tail of the count plots). Unfortunately, since the features are anonymous, it is difficult to guess the meaning of outlier values for each particular variable
Below I plot the distribution of features, segregated by the value of the target variable:

def segregatedPlot(df,col,grouping='Class'):
    '''conveniently making segregated plots'''
    #plt.figure(figsize=(9,6))
    sns.distplot(df.groupby(grouping)[col].get_group(0).dropna(),kde=False,rug=False, label=grouping.capitalize()+'=0');
    ax=sns.distplot(df.groupby(grouping)[col].get_group(1).dropna(),kde=False,rug=False, label=grouping.capitalize()+'=1');
    ax.set_yscale('log')
    plt.legend();

plt.figure(figsize=(16,30))
for i in range(len(features)):
    plt.subplot(10, 3, i+1)
    segregatedPlot(train,features[i])

The plots above are very illuminating with respect to what variables have might have more predictive power. Some of the variables show different distributions depending on the value of the class that a given row belongs to. I can see six different patterns in the plots above:
1. For some variables the target classification makes little difference in the distribution of the variable
2. The variable distribution shows a lower peak count for class 1 (V3)
3. The variable distribution shows a larger peak count for class 1 (V4)
4. Lower peak count and lower tail values (V9, V10, V12, V14, V16, V17)
5. Larger peak count and larger tail values (V11)
6. Lower tail values (V18)
Because of the apparent importance of tail values on the class distribution of the target variable, I am not inclined to use classification methods that are sensitive to feature scaling (like logistic regression). So I will first try using decision trees and random forest, and see how well they perform on the the test data.
Before start constructing a model, I inspect variable correlations below:

corr = train[features].corr()
plt.figure(figsize=(10,10))
#diverging colormap
cmap = sns.diverging_palette(220,10,as_cmap=True)

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
#triu_indices_from() returns the indices for the upper-triangle of arr
mask[np.triu_indices_from(mask)] = True 

sns.heatmap(corr, mask=mask, cmap=cmap, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5});

There is little correlation between the features, with the exception of V29 (which from the plots above, seems to have very little predictive power). This must be a consequence of the variables being centered, as pointed out earlier in the document
Note that V29 is the only variable that is in a scale that is significantly different than the other variables (only positive values, on the order of a thousand, it could be length of stay at a hospital, for example).

Modeling the data¶

Given the seemingly uniform distribution of the missing data, and the variable distribution plots shown above, I use simple imputation to fill the missing data

Baseline decision-tree model¶

from sklearn import tree
bench = tree.DecisionTreeClassifier(random_state=seed)
#simple imputation:
train_f=train.fillna(train.mean())
bench.fit(train_f[features],train_f[target])

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=491,
            splitter='best')

print('Maximum depth = {}'.format(bench.tree_.max_depth))
print('Default score = {:.3f}'.format(bench.score(train_f[features],train_f[target])))
print('Default feature importances (sorted by importance):')
tups=[(features[_],bench.feature_importances_[_]) for _ in range(len(features))]
tups.sort(key=lambda x:x[1], reverse=True)
for tup in tups:
    print('{}, importance = {:.4f}'.format(tup[0],tup[1]))

Maximum depth = 30
Default score = 1.000
Default feature importances (sorted by importance):
V17, importance = 0.4973
V14, importance = 0.1187
V10, importance = 0.0461
V27, importance = 0.0322
V9, importance = 0.0282
V26, importance = 0.0229
V1, importance = 0.0218
V16, importance = 0.0181
V28, importance = 0.0180
V25, importance = 0.0180
V12, importance = 0.0165
V20, importance = 0.0156
V4, importance = 0.0152
V6, importance = 0.0148
V21, importance = 0.0147
V29, importance = 0.0130
V19, importance = 0.0117
V15, importance = 0.0115
V13, importance = 0.0097
V11, importance = 0.0097
V7, importance = 0.0088
V23, importance = 0.0082
V8, importance = 0.0061
V24, importance = 0.0057
V22, importance = 0.0056
V18, importance = 0.0047
V5, importance = 0.0044
V3, importance = 0.0029
V2, importance = 0.0000

Unsurprisingly the benchmark decision tree overfits the data. An interesting observation is that the top three features for the tree are in my list of variables which I expected to have more predictive power.
Below I use cross validation with stratified partitioning of the train dataset to find an optimal depth for a tree

from sklearn.model_selection import StratifiedKFold, GridSearchCV
benchCV = StratifiedKFold(n_splits=6,random_state=seed)
param_test = {
    'max_depth':[2,5,20,30]}#[_ for _ in range(2,30)] produces 5
dptSearch = GridSearchCV(
    estimator=tree.DecisionTreeClassifier(random_state=seed),                     
    param_grid = param_test, scoring='roc_auc',n_jobs=4,
    iid=False, cv=benchCV,return_train_score=True)

dptSearch.fit(train_f[features],train_f[target])
dptSearch.best_params_, dptSearch.best_score_

({'max_depth': 5}, 0.8805669521429355)

The best cross-validation AUC score obtained with this tree is 88%, which is good. Below I check the ROC-AUC score on the test set, after fitting a decision tree with a maximum depth of 5 nodes:

from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
bench = tree.DecisionTreeClassifier(random_state=seed,max_depth=5)
bench.fit(train_f[features],train_f[target])
#simple imputation of test dataset:
test_f=test.fillna(test.mean())
print('Decision-tree ROC-AUC score = {:.3f}'.format(roc_auc_score(test_f[target],bench.predict_proba(test_f[features])[:,1])))
print('Feature importances (sorted by importance):')
tups=[(features[_],bench.feature_importances_[_]) for _ in range(len(features))]
tups.sort(key=lambda x:x[1], reverse=True)
for tup in tups:
    print('{}, importance = {:.4f}'.format(tup[0],tup[1]))

Decision-tree ROC-AUC score = 0.841
Feature importances (sorted by importance):
V17, importance = 0.6438
V14, importance = 0.1367
V10, importance = 0.0514
V27, importance = 0.0390
V9, importance = 0.0326
V26, importance = 0.0237
V16, importance = 0.0212
V20, importance = 0.0130
V1, importance = 0.0123
V7, importance = 0.0110
V11, importance = 0.0065
V12, importance = 0.0052
V22, importance = 0.0036
V2, importance = 0.0000
V3, importance = 0.0000
V4, importance = 0.0000
V5, importance = 0.0000
V6, importance = 0.0000
V8, importance = 0.0000
V13, importance = 0.0000
V15, importance = 0.0000
V18, importance = 0.0000
V19, importance = 0.0000
V21, importance = 0.0000
V23, importance = 0.0000
V24, importance = 0.0000
V25, importance = 0.0000
V28, importance = 0.0000
V29, importance = 0.0000

How it should be expected, the ROC-AUC score on the test data is lower than on cross-validation, but not by much (84% vs. 88%)
Note that the feature importances agree pretty well, for the most part, with the variables I highlighted in the EDA portion of this analysis
Below I calculate the confusion matrix for the predictions of the decision-tree model

#the confusion matrix for the prediction is below
print('\tTn\tFp\n\tFn\tTp')
confusion_matrix(test_f[target],bench.predict(test_f[features]))

	Tn	Fp
	Fn	Tp

array([[85275,    20],
       [   40,   108]])

The confusion matrix above shows reasonable results as well (sensitivity: 73%, specificity: 99.9%)

Constructing a random-forest model¶

Firs I use SelectFromModel, below, to select features for the model based on importance weights. The threshold for selection calculated automatically by SelectFromModel is 0.034483, and the features selected automatically are V10, V11, V12, V14, V16 and V17.
Note that I listed all these variables as likely to have more predictive power in the EDA portion of the analysis.
Since there are a few variables that I listed that were not selected automatically (e.g. V3, V4, V9 and V18), I performed a brief sensitivity analysis and slightly reduced the automatic threshold to see if some of these variables would be picked up:

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel #for selecting features for the model
sel = SelectFromModel(RandomForestClassifier(n_estimators = 100,random_state=seed,n_jobs=4),
                     threshold=0.03)
sel.fit(train_f[features], train_f[target])

SelectFromModel(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=4,
            oob_score=False, random_state=491, verbose=0, warm_start=False),
        norm_order=1, prefit=False, threshold=0.03)

selected=[features[i] for i in range(len(features)) if sel.get_support()[i]]
print('The selected variables for my random forest model are: ')
for _ in selected:
    print(_)

The selected variables for my random forest model are: 
V9
V10
V11
V12
V14
V16
V17
V18

As noted above, I listed all these variables as likely to have more predictive power in the EDA portion of the analysis
Below I use cross validation to find the optimal number of trees for the random forest model.

param_test2 = {
    'n_estimators':[2,50,76,100]}#[_ for _ in range(2,100)] selects 76
forestSearch = GridSearchCV(
    estimator=RandomForestClassifier(random_state=seed),                     
    param_grid = param_test2, scoring='roc_auc',n_jobs=4,
    iid=False, cv=benchCV,return_train_score=True)

forestSearch.fit(train_f[selected], train_f[target])
forestSearch.best_params_, forestSearch.best_score_

({'n_estimators': 76}, 0.9287616474384496)

The best cross-validation AUC score obtained with this random-forest model is 92.9%.
Below I calculate the confusion matrix for the predictions of the random-forest model, on the test set:

#confusion matrix for the best forest model
print('\tTn\tFp\n\tFn\tTp')
confusion_matrix(test_f[target],forestSearch.predict(test_f[selected]))

	Tn	Fp
	Fn	Tp

array([[85282,    13],
       [   30,   118]])

The random-forest model sensitivity and specificity are 79.7% and 99.9%, respectively

Comparing the models by constructing ROC curves¶

from sklearn.metrics import auc #will use below, when plotting the curves
from roc_curve import get_roc_curve #I put this code in a separate file for clarity
#making an array of thresholds
thresholds=np.linspace(0,1,15)

Below I construct and plot an ROC curve for my decision-tree model

plt.figure(figsize=(10,7))
tprate=[]
fprate=[]
aucs=[]
counter=0
for tr_fold, test_fold in benchCV.split(train_f[selected],train_f[target]):
    bench.fit(train_f[selected].iloc[tr_fold],train_f[target].iloc[tr_fold])
    probs=bench.predict_proba(train_f[selected].iloc[test_fold])[:,1]#prob of class 1
    tpr,fpr=get_roc_curve(train_f[target].iloc[test_fold].values,probs,thresholds)
    tprate.append(tpr)
    fprate.append(fpr)
    auc_fold=auc(fpr, tpr)
    aucs.append(auc_fold)
    plt.plot(fpr,tpr,lw=1,alpha=0.3,
             label='Fold {}, AUC={:.3f}'.format(counter,auc_fold))
    #plt.plot(fpr[index],tpr[index],'r+');
    counter+=1

#chance curve
plt.plot([0, 1], [0, 1],linestyle='--',lw=1, alpha=.8,
         color='r',label='Chance')

mean_tpr = np.mean(tprate, axis=0)
mean_fpr = np.mean(fprate, axis=0)
mean_auc = np.mean(aucs)
std_auc = np.std(aucs)
plt.plot(mean_fpr, mean_tpr,lw=2, color='b',
         label='Mean ROC (AUC={:.3f} $\pm$ {:.3f})'.format(mean_auc, std_auc))

#the cross shows the value corresponding to an standard threshold of 0.5
index=np.where(thresholds ==0.5)[0][0]#assuming that 0.5 is present in thresholds
plt.plot(mean_fpr[index],mean_tpr[np.where(thresholds ==0.5)[0][0]],'r+')

plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('ROC decision-tree model', fontsize=18)
plt.legend(loc="lower right", prop={'size': 12});

The red cross in this plot shows the true and false positive rates for the standard classification threshold of 0.5. Similar crosses are also shown in the plots below

forest = RandomForestClassifier(random_state=seed,n_estimators=76)

plt.figure(figsize=(10,7))
tpr_forest=[]
fpr_forest=[]
aucs_forest=[]
counter=0
for tr_fold, test_fold in benchCV.split(train_f[selected],train_f[target]):
    forest.fit(train_f[selected].iloc[tr_fold],train_f[target].iloc[tr_fold])
    probs=forest.predict_proba(train_f[selected].iloc[test_fold])[:,1]#prob of class 1
    tpr,fpr=get_roc_curve(train_f[target].iloc[test_fold].values,probs,thresholds)
    tpr_forest.append(tpr)
    fpr_forest.append(fpr)
    auc_fold=auc(fpr, tpr)
    aucs_forest.append(auc_fold)
    plt.plot(fpr,tpr,lw=1,alpha=0.3,
             label='Fold {}, AUC={:.3f}'.format(counter,auc_fold))
    #plt.plot(fpr[index],tpr[index],'r+');
    counter+=1

#chance curve
plt.plot([0, 1], [0, 1],linestyle='--',lw=1, alpha=.8,
         color='r',label='Chance')

mean_tpr_f = np.mean(tpr_forest, axis=0)
mean_fpr_f = np.mean(fpr_forest, axis=0)
mean_auc_f = np.mean(aucs_forest)
std_auc_f = np.std(aucs_forest)
plt.plot(mean_fpr_f, mean_tpr_f,lw=2, color='b',
         label='Mean ROC (AUC={:.3f} $\pm$ {:.3f})'.format(mean_auc_f, std_auc_f))

#the cross shows the value corresponding to an standard threshold of 0.5
index=np.where(thresholds ==0.5)[0][0]#assuming that 0.5 is present in thresholds
plt.plot(mean_fpr_f[index],mean_tpr_f[index],'r+')

plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('ROC random-forest model', fontsize=18)
plt.legend(loc="lower right", prop={'size': 12});

Below I compare the mean ROC curves for both models:

#comparing the two models
plt.figure(figsize=(10,7))
#decision tree ROC
plt.plot(mean_fpr, mean_tpr,lw=2,
         label='Dtree Mean ROC (AUC={:.2f} $\pm$ {:.2f})'.format(mean_auc, std_auc))

plt.plot(mean_fpr_f, mean_tpr_f,lw=2,
         label='Rforest Mean ROC (AUC={:.2f} $\pm$ {:.2f})'.format(mean_auc_f, std_auc_f))
#chance curve
plt.plot([0, 1], [0, 1],linestyle='--',lw=1, alpha=.8,
         color='r',label='Chance')

plt.plot(mean_fpr[index],mean_tpr[index],'r+')
plt.plot(mean_fpr_f[index],mean_tpr_f[index],'g+')

plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('ROC for the two models', fontsize=18)
plt.legend(loc="lower right", prop={'size': 12});

Since the crosses show the standard classification threshold of 0.5, it is clear that we could reduce this threshold in order to improve the true positive rate, without significantly reducing the false positive rate.
To investigate this in more detail, I show a table with the mean true and false positive rates for both the decision tree and random forest models, together with the corresponding thresholds below:

rates=pd.concat([pd.Series(thresholds,name='Thresholds'),
             pd.Series(mean_tpr,name='tree_tpr'),
             pd.Series(mean_fpr,name='tree_fpr'),
             pd.Series(mean_tpr_f,name='forest_tpr'),
             pd.Series(mean_fpr_f,name='forest_fpr')
              ],axis=1)
rates

As can be seen in the table above, we can indeed reduce the classification threshold without affecting the false positive rate considerably. This would be equivalent to increasing the number of true positives or reducing the false negatives.
Since the number of actual positives is small, we would be increasing the number of false positives as well, which might not be desirable.

Conclusions¶

The exploratory data analysis reveals several interesting trends that can be observed in the data. In particular, the distributions corresponding to each class of the target variable (0 or 1) are different for a subset of features (variables V3, V4, V9 to V12, V14, V16, V17 and V18).
For a few varibles, outliers in the distribution seem to correlate with belonging to class 1
Missing data seems to be distributed randomly and uniformly across the data, totaling around 10% for each column.
Feature importances for my decision tree model, and feature selection with a random forest model correspond well with the variables highlighted in points 1 and 2
The performance of a random forest model with 76 trees improves performance with respect to my decision-tree model, decreasing both the number of false positives and false negatives (sensitivity: 79.7%, specificity: 99.9%)
As shown in the ROC-curve analysis section, the threshold for classification can be changed to optimize the number of false positives or false negatives. Selecting this threshold would probably require external input

	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	V15	V16	V17	V18	V19	V20	V21	V22	V23	V24	V25	V26	V27	V28	V29
0	NaN	-0.072781	2.536347	NaN	-0.338321	0.462388	NaN	0.098698	0.363787	0.090794	-0.551600	-0.617801	-0.991390	-0.311169	1.468177	-0.470401	NaN	0.025791	0.403993	0.251412	-0.018307	0.277838	-0.110474	0.066928	0.128539	-0.189115	NaN	-0.021053	149.62
1	1.191857	0.266151	0.166480	0.448154	0.060018	-0.082361	-0.078803	0.085102	NaN	-0.166974	NaN	1.065235	NaN	-0.143772	0.635558	0.463917	-0.114805	-0.183361	-0.145783	NaN	-0.225775	-0.638672	NaN	-0.339846	0.167170	0.125895	-0.008983	0.014724	2.69
2	-1.358354	-1.340163	1.773209	0.379780	-0.503198	1.800499	0.791461	0.247676	-1.514654	0.207643	0.624501	0.066084	0.717293	-0.165946	2.345865	-2.890083	1.109969	-0.121359	-2.261857	0.524980	0.247998	0.771679	0.909412	NaN	-0.327642	-0.139097	-0.055353	-0.059752	378.66
3	-0.966272	-0.185226	1.792993	-0.863291	-0.010309	1.247203	0.237609	0.377436	-1.387024	-0.054952	-0.226487	0.178228	0.507757	-0.287924	-0.631418	-1.059647	-0.684093	1.965775	-1.232622	-0.208038	-0.108300	0.005274	-0.190321	-1.175575	0.647376	-0.221929	0.062723	0.061458	123.50
4	-1.158233	0.877737	1.548718	0.403034	-0.407193	0.095921	NaN	-0.270533	0.817739	0.753074	-0.822843	0.538196	1.345852	-1.119670	NaN	-0.451449	NaN	-0.038195	0.803487	0.408542	-0.009431	0.798278	-0.137458	0.141267	-0.206010	0.502292	0.219422	0.215153	69.99

	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	V15	V16	V17	V18	V19	V20	V21	V22	V23	V24	V25	V26	V27	V28	V29	Class
count	256327.000000	256327.000000	256327.000000	256327.000000	256327.000000	256327.000000	256327.000000	256327.000000	256327.000000	256327.000000	256327.000000	256327.000000	256327.000000	256327.000000	256327.000000	256327.000000	256327.000000	256327.000000	256327.000000	256327.000000	256327.000000	256327.000000	256327.000000	256327.000000	256327.000000	256327.000000	256327.000000	256327.000000	256327.000000	284807.000000
mean	-0.001855	0.001016	0.000476	0.000029	0.000192	0.000221	-0.000832	0.001599	-0.001070	-0.001081	0.000246	-0.000274	0.000800	0.000019	-0.000632	-0.000546	-0.000080	0.000083	-0.000345	-0.000258	0.000373	-0.000521	0.000435	0.000242	-0.000116	-0.000088	0.000086	0.000153	88.463396	0.001727
std	1.957926	1.651584	1.519391	1.417511	1.378669	1.332159	1.241105	1.187431	1.099997	1.085192	1.019877	1.000322	0.995882	0.955731	0.915411	0.876783	0.850534	0.838474	0.814060	0.772639	0.733762	0.725003	0.618081	0.605657	0.521329	0.482607	0.397504	0.330702	248.040629	0.041527
min	-56.407510	-72.715728	-48.325589	-5.683171	-113.743307	-21.929312	-43.557242	-73.216718	-13.434066	-24.588262	-4.797473	-18.683715	-5.791881	-18.493773	-4.498945	-14.129855	-25.162799	-9.498746	-7.213527	-54.497720	-34.830382	-10.933144	-44.807735	-2.836627	-10.295397	-2.604551	-22.565679	-15.430084	0.000000	0.000000
25%	-0.920793	-0.598073	-0.888388	-0.848423	-0.692812	-0.768247	-0.553502	-0.208516	-0.644112	-0.535903	-0.761755	-0.406160	-0.647872	-0.426090	-0.583406	-0.469167	-0.483949	-0.498586	-0.456986	-0.211756	-0.228512	-0.542946	-0.161762	-0.354556	-0.317167	-0.327224	-0.070878	-0.052945	5.640000	0.000000
50%	0.015389	0.065034	0.179871	-0.018836	-0.055993	-0.273935	0.039835	0.022608	-0.052607	-0.092313	-0.031499	0.139847	-0.013511	0.050509	0.047068	0.066032	-0.065844	-0.003976	0.003502	-0.062298	-0.029669	0.006540	-0.011011	0.041439	0.015954	-0.052294	0.001296	0.011181	22.000000	0.000000
75%	1.315404	0.804324	1.028192	0.742473	0.611817	0.399117	0.569714	0.327949	0.595965	0.453653	0.739153	0.618166	0.663910	0.492511	0.648903	0.523268	0.399840	0.500827	0.458520	0.132837	0.186117	0.527516	0.147746	0.439527	0.350374	0.240773	0.090973	0.078202	77.000000	0.000000
max	2.454930	22.057729	9.382558	16.875344	34.801666	73.301626	120.589494	20.007208	15.594995	23.745136	12.018913	7.848392	7.126883	10.526766	8.877742	17.315112	9.253526	5.041069	5.591971	39.420904	27.202839	8.361985	22.528412	4.584549	7.519589	3.517346	12.152401	33.847808	19656.530000	1.000000

	Thresholds	tree_tpr	tree_fpr	forest_tpr	forest_fpr
0	0.000000	1.000000	1.000000	1.000000	1.000000
1	0.071429	0.746824	0.000367	0.831317	0.000839
2	0.142857	0.743900	0.000276	0.813874	0.000432
3	0.214286	0.740976	0.000266	0.808076	0.000301
4	0.285714	0.712039	0.000246	0.796431	0.000251
5	0.357143	0.697570	0.000231	0.784735	0.000196
6	0.428571	0.697570	0.000231	0.776013	0.000171
7	0.500000	0.691722	0.000201	0.764317	0.000136
8	0.571429	0.648165	0.000141	0.738153	0.000111
9	0.642857	0.645241	0.000121	0.700393	0.000085
10	0.714286	0.630672	0.000116	0.653962	0.000075
11	0.785714	0.618976	0.000111	0.604557	0.000070
12	0.857143	0.613128	0.000106	0.479835	0.000045
13	0.928571	0.560496	0.000070	0.291036	0.000025
14	1.000000	0.244051	0.000045	0.069873	0.000005