Concrete slump measurements are used to determine concrete properties before it sets. In this analysis, I use measurements of properties of a concrete mixture, such as water and cement content, to predict the concrete's compressive strength. I test two types of models for making predictions, a decision-tree regression model and a random-forest model.
There are 7 input variables, and 3 output variables in the data set:
Input variables (kg per M^3) | Output variables |
---|---|
Cement | SLUMP (cm) |
Slag | FLOW (cm) |
Fly ash | 28-day Compressive Strength (Mpa) |
Water | |
SP | |
Coarse Aggr. | |
Fine Aggr. |
Author: Leonardo Espin
Date: July 3, 2019
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option('display.max_columns', 20)
#the lines below get rid of the plotting library warnings
import warnings
warnings.filterwarnings("ignore")
#get data from online repository
url ="https://archive.ics.uci.edu/ml/machine-learning-databases/concrete/slump/slump_test.data"
raw = pd.read_csv(url)
print(raw.shape)
raw.head(10)
#checking the data types of the variables
raw.dtypes
#checking missing data for each column
raw.isnull().sum()
From the data description, the first 7 variables are input and the last 3 output. So, the variable 'No' is just an index, and because of that I will drop it. Afterwards I print the first rows of the data to confirm the deletion
#checking that this is indeed the case before dropping
print('if this value: {} equals 103, the variable \'No\' will be dropped'
.format(sum(raw
.No
.apply(lambda x:x-1)==raw.No.index)))
raw.drop(columns='No',inplace=True)
raw.head()
#descriptive statistics for the data
raw.describe()
from the basic statistics computed above, it does not seem that there are salient problems with data. Importantly all the measured input and output quantities seem reasonable:
plt.figure(figsize=(16,20))
nbins=10 #play with this to improve visualizations
for i in range(len(raw.columns)):
plt.subplot(4, 3, i+1)
sns.distplot(raw[raw.columns[i]],bins=nbins,kde=True,rug=True)
#since data is small, I plot a density estimate and data ticks above
A few observations:
Output variables:
Because of the observations above, below I inspect more closely the zero values for the'fly ash' and 'slag' variables
#check how many 'fly ash' and 'slag' values are zero:
print('Fly ash = 0: {}'.format(sum(raw['Fly ash']==0)))
print('Slag = 0: {}'.format(sum(raw['Slag']==0)))
It almost seems that these 0 'fly ash' and 'slag' measurements are missing values that were filled with zeroes. It could also be that these measurements were below the sensitivity threshold for the measurement method used. Nevertheless, This is something to keep in mind.
x_bins
functionality of the seaborn package#creating lists with input and output variables
variables=list(raw.columns)
inputV=variables[0:7]
outputV=variables[7:]
plt.figure(figsize=(16,30))
nbins=5 #play with this to improve visualizations
total=len(inputV)
j=0
for i in range(total):
tmp_df=raw[[inputV[i],outputV[-1]]]
plt.subplot(total, 2, (2*i)+1)
sns.regplot(x=tmp_df[inputV[i]],y=tmp_df[outputV[-1]],data=tmp_df)
if j==0:
plt.title('Scatter plots', fontsize=18);
plt.subplot(total, 2, 2*(i+1))
sns.regplot(x=tmp_df[outputV[-1]],y=tmp_df[inputV[i]],data=tmp_df, x_bins=nbins)
if j==0:
plt.title('Inverted scatter plots, with bin stratification', fontsize=18);
j+=1
This set of plots is very interesting, because it allows to understand how the input variables affect the measured compressive strength.
Take for instance the water content (4th variable). It shows an inverse effect between water content in concrete and its compressive strength, meaning that the more water content, the less compressive strength in the concrete, which seems reasonable (note that cement content has the opposite effect: the more cement, the more compressive strength in the concrete).
Although none of the input variables show a large correlation with CS, from the scatter-plot visualizations it seems that the most predictive variables for CS are cement and water content (1st and 4th input variables), given that they show more or less clear trends.
To find the input variables which more strongly correlate with CS, I calculate the $R^2$ coefficient for the regressions in all the scatter plots above:
from sklearn.linear_model import LinearRegression
linear= LinearRegression()
for i in range(total):
x=raw[inputV[i]].values.reshape(len(raw),1)#linear regression model expects a 2D matrix
y=raw[outputV[-1]].values.reshape(len(raw),1)#linear regression model expects a 2D matrix
linear.fit(x,y,1)
print('Variable \'' +inputV[i]+ '\' R^2={:.2f}'.format(linear.score(x, y)))
As I noted above, when I plotted the histograms for the variables, the values of zero for 'fly ash' and 'slag' seem suspicious. The scatter plots for those two variables show that although there is a group of zero measurements for 'slag', there is no jump or gap between the zeroes and larger values. This is not the case for 'fly ash', where there is a clear gap, with no measurements between 0 and 100kg, which is abnormal.
Below I repeat the calculation of the $R^2$ coefficient for 'fly ash' and 'slag', when I ignore values of zero.
noZeroes=raw.loc[raw[inputV[2]]>0,[inputV[2],outputV[-1]]]
x=noZeroes[inputV[2]].values.reshape(len(noZeroes),1)
y=noZeroes[outputV[-1]].values.reshape(len(noZeroes),1)
linear.fit(x,y,1)
print('Variable \'' +inputV[2]+ '\' R^2={:.2f}'.format(linear.score(x, y)))
noZeroes=raw.loc[raw[inputV[1]]>0,[inputV[1],outputV[-1]]]
x=noZeroes[inputV[1]].values.reshape(len(noZeroes),1)
y=noZeroes[outputV[-1]].values.reshape(len(noZeroes),1)
linear.fit(x,y,1)
print('Variable \'' +inputV[1]+ '\' R^2={:.2f}'.format(linear.score(x, y)))
Therefore there is no observable trend between 'fly ash' and CS, when we ignore the zero values.
We conclude that the most predictive variables for CS are cement, water content, and slag (this last one because of the uncertainty with respect of the 0 measurements)
Next I construct a decision tree model for the variable CS, using cement and water content as predictors:
from sklearn.model_selection import train_test_split
seed=491
X=raw[[inputV[0],inputV[3]]]
Y=raw[outputV[-1]]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=seed)
#checking the distribution of the CS variable in train and test sets:
plt.figure(figsize=(14,5))
sns.distplot(y_train,bins=nbins,kde=True,rug=True,label='Train')
sns.distplot(y_test,bins=nbins,kde=True,rug=True,label='Test')
plt.title('Distribution of target variable in test and train datasets', fontsize=18);
plt.legend();
from sklearn import tree
CStree = tree.DecisionTreeRegressor(random_state=seed)
CStree.fit(X_train,y_train)
print('Maximum depth = {}'.format(CStree.tree_.max_depth))
print('Default R^2 score = {:.3f}'.format(CStree.score(X_train,y_train)))
from sklearn import model_selection #for KFold, GridSearchCV
myCV = model_selection.KFold(n_splits=3,random_state=seed)
param_test = {
'max_depth':[_ for _ in range(2,13)]}
dptSearch = model_selection.GridSearchCV(
estimator=tree.DecisionTreeRegressor(random_state=seed),
param_grid = param_test, scoring='neg_mean_squared_error',n_jobs=4,
iid=False, cv=myCV,return_train_score=True)
dptSearch.fit(X_train,y_train)
dptSearch.best_params_, dptSearch.best_score_
from sklearn.metrics import mean_squared_error
#the decision tree with optimal depth:
optTree = tree.DecisionTreeRegressor(max_depth=2,random_state=seed)
optTree.fit(X_train,y_train)
y_pred = optTree.predict(X_test)
print('RMSE for the test dataset is {:.2f}'.format(np.sqrt(mean_squared_error(y_test, y_pred))))
import graphviz
dot_data =tree.export_graphviz(optTree, out_file=None)
graph = graphviz.Source(dot_data)
graph.format='png'
graph.render('output');
The tree visualization above shows the decisions that the model makes according to the values of the input variables. For example, if we have measured the two variables that are the input in this model x[0],x[1]
(the cement and water content), and we pass these values to the model, at the top of the tree if the variable x[0]
is less or equal than 165.65kg, the tree takes one path, but if it is larger than 165.5kg, it takes the other path and so on, according to the criteria shown in the boxes of the visualization.
When the model reaches the bottom by using this decision process, we say that we have reached a 'leaf', and the model returns a value based on the values of the training variables that were classified in that leaf (this could be the average or by fitting a regression, depending on the implementation of the software package used). This is how the model predicts an output value for CS
maximum_depth
parameter, and then I chose the one that gave the best score. #repeat the train/test split while using all the seven variables this time
X=raw[inputV]
Y=raw[outputV[-1]]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=seed)
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel #for selecting features for the model
sel = SelectFromModel(RandomForestRegressor(n_estimators = 100))
sel.fit(X_train, y_train)
selected=[inputV[i] for i in range(7) if sel.get_support()[i]]
print('The selected variables for my random forest model are: ')
for _ in selected:
print(_)
#repeat the train/test split while using the selected variables
X=raw[[inputV[0],inputV[2],inputV[3]]]
Y=raw[outputV[-1]]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=seed)
param_test2 = {
'n_estimators':[_ for _ in range(2,30)]}
forestSearch = model_selection.GridSearchCV(
estimator=RandomForestRegressor(random_state=seed),
param_grid = param_test2, scoring='neg_mean_squared_error',n_jobs=4,
iid=False, cv=myCV,return_train_score=True)
forestSearch.fit(X_train,y_train)
forestSearch.best_params_, forestSearch.best_score_
#forest with the optimal number of trees
forest_model=RandomForestRegressor(n_estimators=23,random_state=seed)
forest_model.fit(X_train,y_train)
y_pred = forest_model.predict(X_test)
print('RMSE for the test dataset is {:.2f}'.format(np.sqrt(mean_squared_error(y_test, y_pred))))
feature_selection
module were the same ones I highlighted during the EDA portion of this document