Regression Analysis (Part Two)

In the last post it explained how to prep clenase and build a multiple linear regression model that can predict the price of a diamond. In this post it will expand on this to explore more advanced regression such as ridge regression and lasso regression. Finally, it will make the case to perhaps reformulate the orgional question to give a more accurate and helpful result.

Importing relevant libraries

In [14]:
# Ignore warnings :
import warnings
warnings.filterwarnings('ignore')


# Handle table-like data and matrices :
import numpy as np
import pandas as pd
import math 

# reshape 1D array
from numpy import array
from numpy import reshape




# Modelling Algorithms :

# Classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier , GradientBoostingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis , QuadraticDiscriminantAnalysis

# Regression
from sklearn.linear_model import LinearRegression,Ridge,Lasso,LassoCV, RidgeCV, ElasticNet
from sklearn.ensemble import RandomForestRegressor,BaggingRegressor,GradientBoostingRegressor
,AdaBoostRegressor 
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor




# Modelling Helpers :
from sklearn.preprocessing import Imputer , Normalizer , scale
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV , KFold , cross_val_score



#preprocessing :
from sklearn.preprocessing import MinMaxScaler , StandardScaler, Imputer, LabelEncoder



#evaluation metrics :

# Regression
from sklearn.metrics import mean_squared_log_error,mean_squared_error, r2_score,mean_absolute_error 

# Classification
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score  



# Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
import missingno as msno

from yellowbrick.model_selection import LearningCurve
from yellowbrick.regressor.alphas import AlphaSelection

# Configure visualisations
%matplotlib inline
mpl.style.use( 'ggplot' )
plt.style.use('fivethirtyeight')
sns.set(context="notebook", palette="dark", style = 'whitegrid' , color_codes=True)
params = { 
    'axes.labelsize': "large",
    'xtick.labelsize': 'x-large',
    'legend.fontsize': 20,
    'figure.dpi': 150,
    'figure.figsize': [10, 7]
}
plt.rcParams.update(params)
In [3]:
# Center all plots
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""");

Upload the cleansed dataset from part one

In [4]:
#load the dataset
#https://www.kaggle.com/shivam2503/diamonds/data
df = pd.read_csv("df.csv")
df.head()
Out[4]:
carat cut color clarity depth table price volume
0 0.23 5 6 2 61.5 55.0 326 38.202030
1 0.21 4 6 3 59.8 61.0 326 34.505856
2 0.23 2 6 5 56.9 65.0 327 38.076885
3 0.29 4 2 4 62.4 58.0 334 46.724580
4 0.31 2 1 2 63.3 58.0 335 51.917250

Split the data into training and test sets

In [5]:
#Split the data into training and testing using the same seed as in part one for consistency
X = df.drop(['price'], axis=1)
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=66)
In [6]:
# Applying Feature Scaling ( StandardScaler )
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Ridge Regression

An extension of the linear model is ridge regression. Ridge regression is still a linear model however, the estimated coefficients of the predictors are shrunken towards zero relative to the least squares estimates. In the least squares estimation the procedure involves reducing the RSS (root squared error). The ridge regression coefficients are the min- imisation of the following: $$\sum_{i=1}^{n} \Bigg( y_i - \beta_0 - \sum_{j=1}^{p}\beta_jx_{ij} \Bigg )^2 + \alpha \sum_{j=1}^{p}\beta_j^2= \textrm{RSS} + \alpha \sum_{j=1}^{p}\beta_j^2, \label{Ridge}$$ where α ≥ 0 is a tuning parameter, that needs to be carefully chosen (James et al.,2013).It is a penalty term which puts a constraint on the coefficients. Subsequently, if the coefficients take large values then the optimisation function will be penalised and result in a less complicated model.Ridge regression is particularly useful to resolve the issue of collinearity (Gruber, 2017). The reason why ridge regression often performs better than traditional least squares is due to the fact that as α increases, then the models flexibility decreases, leading to a decrease in variance but an increase in bias. In other words it prevents ’overfitting’.

Learning Curve

The difficulty with a test data is that it is important not to use this data set until the final model decided upon. To use the test data to build a model runs the risk of data 'leakage' and then it is difficult to rely upon the test scores as they are no longer truly unseen data. Cross validation attempts to recreate this unseen data without having to resort to looking at the data set. Subsequently, it is important to explore if the data is large enough to provide a suitable validation score, or where more data needs to be added. For this a validation curve is particularly useful.

In [10]:
 #Learning Curve (does more values help)
# Instantiate the regression model and visualizer
# Create a list of alphas to cross-validate against
alphas = np.logspace(-10, 1, 400)
#alphas=alphas,
model = RidgeCV(alphas=alphas,store_cv_values=True)
visualizer = LearningCurve(model, scoring='neg_mean_squared_error')
# Fit the data to the visualizer
visualizer.fit(X_train, y_train)  
# Finalize and render the figure
visualizer.show() 
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff92d08a1d0>

In the above learning curve is that when the number of instances reaches a critical point (approximately at 22,000) the shaded area is now coming from the training score suggesting over-fitting is occurring.

Finding the best alpha value

In [15]:
#Using the yellowbrick to find best alpha value
# Create a list of alphas to cross-validate against
alphas = np.logspace(-10, 1, 400)
# Instantiate the linear model and visualizer
model = RidgeCV(alphas=alphas,store_cv_values=True) 
visualizer = AlphaSelection(model) 
visualizer.fit(X_train, y_train)
visualizer.show()
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff9055fff50>

The output above illustrates that in fact the ridge regression model is not particularly effective. The lowest MSE score is in fact when alpha = 0 however, this means there is no penalty applied and is in fact the ordinary least squares model.

Implementing Ridge Regression

In [18]:
 #Use the Ridge method
clf_rr = Ridge(alpha=1, normalize=True,random_state=5) #change the value of alpha manually
clf_rr.fit(X_train , y_train)
#accuracies = cross_val_score(estimator = clf_la, X = X_train, y = y_train, cv = 5,verbose = 1)
y_pred = clf_rr.predict(X_train)
print('')
print('###### Ridge Regression ######')
#print('Score : %.4f' % clf_rr.score(X_train, y_train)) #print(accuracies)
mse = mean_squared_error(y_train, y_pred)
mae = mean_absolute_error(y_train, y_pred)
rmse = mean_squared_error(y_train, y_pred)**0.5 
r2 = r2_score(y_train, y_pred)

print('')
print('MSE: %0.2f ' % mse)
print('MAE: %0.2f ' % mae)
print('RMSE: %0.2f ' % rmse)
print('R2: %0.5f ' % r2)
###### Ridge Regression ######

MSE: 3885008.69 
MAE: 1323.17 
RMSE: 1971.04 
R2: 0.75477 

The ouputs above provides the metrics for the ridge regression model. Notably, it provides considerably worse values than the multiple linear regression model in part one. To improve upon the reuslts above and fully utilise the ridge regression then a grid search mechanism can be used. The GridSearchCV function in the Python Scikit learn uses this very technique and implements a fit and score mech- anism whereby each value is implemented in turn and scored (Pedregosa et al., 2011). The best such score can then be used as the final value to be used in the model going forward. It is important to note that this method can very quickly become computationally expensive so it is still important to utilise the visualisation methods in combination with the grid search method to find the optimal hyper-parameters.

In [19]:
 #Use the GridSearchCV method
# prepare a range of alpha values to test
alphas = np.logspace(-10, 1, 400)
# create and fit a Ridge regression model, testing each alpha
model = Ridge(random_state=5)
grid = GridSearchCV(estimator=model,scoring='neg_mean_squared_error',param_grid=dict(alpha=alphas),cv=5)
grid.fit(X_train, y_train)
print(grid)
# summarize the results of the grid search 
print(grid.best_score_) 
print(grid.best_estimator_.alpha)
GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                             max_iter=None, normalize=False, random_state=5,
                             solver='auto', tol=0.001),
             iid='warn', n_jobs=None,
             param_grid={'alpha': array([1.00000000e-10, 1.06553795e-10, 1.13537112e-10, 1.20978102e-10,
       1.28906759e-10, 1.37355044e-10, 1.46357012e-10, 1.55948950e-10,
       1.6...
       2.99357729e+00, 3.18977022e+00, 3.39882122e+00, 3.62157300e+00,
       3.85892347e+00, 4.11182940e+00, 4.38131027e+00, 4.66845237e+00,
       4.97441317e+00, 5.30042602e+00, 5.64780507e+00, 6.01795064e+00,
       6.41235480e+00, 6.83260739e+00, 7.28040247e+00, 7.75754513e+00,
       8.26595874e+00, 8.80769273e+00, 9.38493086e+00, 1.00000000e+01])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='neg_mean_squared_error', verbose=0)
-1539187.200939145
1.0655379505623064e-10
In [20]:
 #Output the vlaues from the GridSearch Method
y_pred = grid.predict(X_train)
mse = mean_squared_error(y_train, y_pred)
mae = mean_absolute_error(y_train, y_pred)
rmse = mean_squared_error(y_train, y_pred)**0.5 
r2 = r2_score(y_train, y_pred)
print('')
print('MSE: %0.2f ' % mse)
print('MAE: %0.2f ' % mae)
print('RMSE: %0.2f ' % rmse)
print('R2: %0.5f ' % r2)
MSE: 1518664.84 
MAE: 852.05 
RMSE: 1232.34 
R2: 0.90414 

Residual Analysis

A key assumption within the linear regression model is that the residuals (error terms) are evenly scattered above and below the regression line. A residual plot should illustrate no pattern or hetroscedasticity, and should be evenly scattered above and below zero. A non-constant variance undermines a key assumption of the statistical approach in the regression model.

In [21]:
 #Creating a residual plot of the residual error agaisnt the predicted value
from yellowbrick.datasets import load_concrete 
from yellowbrick.regressor import ResidualsPlot 
visualizer = ResidualsPlot(clf_rr)
visualizer.fit(X_train, y_train) # Fit the training data to the visualizer 
#visualizer.score(X_test, y_test) # Evaluate the model on the test data 
visualizer.show()
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff9074e2a50>

is a residual plot for the ridge regression model with α = 1 while the outliers in this plot result in the overall scale decreasing. There is also an element of hetroscedas- ticity in this plot further indicating it is not a good fit. While the residuals are still not evenly distributed above and below zero (there appears to be a greater number below zero) the funnel shape has improved over the multiple linear regression model suggesting that the variance is now constant.

Lasso Regression

In ridge regression the penalty $$\alpha\sum\beta_j^2$$ will shrink all of the coefficients towards zero but not set any precisely to zero. In cases where the number of variables is particularly large this can pose a problem and make interpretation of the model a particularly difficult process. The lasso method preserves only the most important variables whilst the rest are shrunk down to zero. The lasso model is the minimisation of the quantity: $$\sum_{i=1}^{n} \Bigg( y_i - \beta_0 - \sum_{j=1}^{p}\beta_jx_{ij} \Bigg )^2 + \alpha \sum_{j=1}^{p}|\beta_j|= \textrm{RSS} + \alpha \sum_{j=1}^{p}|\beta_j|,$$ where $\alpha$ is the tuning parameter. The equation for ridge regression earlier and the equation for lasso regression above are very similar with only the penalty term different. In ridge regression it is $$\beta_y^2$$ whilst in the lasso regression it is $$|\beta_j|$$. Consequently, by taking the magnitude here rather than the square of coefficients it results in zero coefficients.

The advantage of the lasso method is that it can act as a feature selection process and assist in choosing which predictors are to be included and which can be essentially ignored. This model has great computational advantages and allows for a simplification however, it does run the risk of oversimplifying and losing some key information that can be gleaned from some variables, especially those that work in tandem with each other (Tibshirani, 2011).

Compute the learning curve using cross validation

In [22]:
#Learning Curve (does more values help)
# Instantiate the regression model and visualizer
model = LassoCV(random_state=5)
visualizer = LearningCurve(model, scoring='neg_mean_squared_error')
visualizer.fit(X_train, y_train) # Fit the data to the visualizer 
visualizer.show() # Finalize and render the figure
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff90648e910>

The output above shows that for the lasso method the error margins for both the training score and the cross validation gradually decreases as the number of training instances increases. Furthermore, as the training and the validation scores converge at close to 10,000 then the number of training instances with the training set for the full diamond data is sufficient to rely upon the inferences of the cross validation.

Finding the best penalty (alpha) value for the lasso method

In [23]:
#Using the yellowbrick to find best alpha value
# Create a list of alphas to cross-validate against alphas = np.logspace(-10, 1, 400)
# Instantiate the linear model and visualizer
model = LassoCV(alphas=alphas,random_state=5,cv=5) 
visualizer = AlphaSelection(model) 
visualizer.fit(X_train, y_train)
visualizer.show()
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff905c40290>

The output above shows how initially as alpha increases it leads to a sharp rise in the MSE and the error in the model in increasing. Then as alpha increases further the error gradually decreases. At some point (around α = 7) the penalty is too large and the error starts to increase again. Overall the output illustrates that an optimal value for alpha lies between 3 and 7.

Implementing Lasso Regression

In [27]:
#Use the Lasso method
clf_la = Lasso(alpha=10,random_state=5) #change the value of alpha manually 
clf_la.fit(X_train , y_train)
accuracies = cross_val_score(estimator = clf_la, X = X_train, y = y_train, cv =5,verbose = 1)
y_pred = clf_la.predict(X_train)
print('')
print('###### Lasso Regression ######')
#print('Score : %.4f' % clf_la.score(X_train, y_train)) 
#print(accuracies)

mse = mean_squared_error(y_train, y_pred)
mae = mean_absolute_error(y_train, y_pred)
rmse = mean_squared_error(y_train, y_pred)**0.5 
r2 = r2_score(y_train, y_pred)
print('')
print('MSE : %0.2f ' % mse)
print('MAE: %0.2f ' % mae)
print('RMSE: %0.2f ' % rmse)
print('R2: %0.7f ' % r2)
###### Lasso Regression ######

MSE : 1519509.25 
MAE: 848.20 
RMSE: 1232.68 
R2: 0.9040845 
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.2s finished

Apply Grid Search to Optimise the results

In [24]:
#Use the GridSearchCV method
# prepare a range of alpha values to test
alphas = np.logspace(-10, 1, 400)
# create and fit a Lasso regression model, testing each alpha
model = Lasso(random_state=5)
grid = GridSearchCV(estimator=model,scoring='neg_mean_squared_error',param_grid=dict(alpha=alphas),cv=5) 
grid.fit(X_train, y_train)
print(grid)
# summarize the results of the grid search 
print(grid.best_score_)
print(grid.best_estimator_.alpha)
GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True,
                             max_iter=1000, normalize=False, positive=False,
                             precompute=False, random_state=5,
                             selection='cyclic', tol=0.0001, warm_start=False),
             iid='warn', n_jobs=None,
             param_grid={'alpha': array([1.00000000e-10, 1.06553795e-10, 1.13537112e-10, 1.20978102e-10,
       1.289067...
       2.99357729e+00, 3.18977022e+00, 3.39882122e+00, 3.62157300e+00,
       3.85892347e+00, 4.11182940e+00, 4.38131027e+00, 4.66845237e+00,
       4.97441317e+00, 5.30042602e+00, 5.64780507e+00, 6.01795064e+00,
       6.41235480e+00, 6.83260739e+00, 7.28040247e+00, 7.75754513e+00,
       8.26595874e+00, 8.80769273e+00, 9.38493086e+00, 1.00000000e+01])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='neg_mean_squared_error', verbose=0)
-1537748.75410423
3.858923467029899
In [25]:
#Obtain the scores from the GridSearchCV method
y_pred = grid.predict(X_train)
mse = mean_squared_error(y_train, y_pred)
mae = mean_absolute_error(y_train, y_pred)
rmse = mean_squared_error(y_train, y_pred)**0.5 
r2 = r2_score(y_train, y_pred)
print('')
print('MSE: %0.2f ' % mse)
print('MAE: %0.2f ' % mae)
print('RMSE: %0.2f ' % rmse)
print('R2: %0.5f ' % r2)
MSE: 1518791.01 
MAE: 850.51 
RMSE: 1232.39 
R2: 0.90413 

Conducting a grid search of 400 values between 0 and 10 and performing a fit and score process, yields an MSE of 1518791.01 with α = 3.62. The scoring here was the mean square error, if the R2 was used then it makes it difficult to detect over-fitting with only the R2.

Residual Analysis

In [28]:
#Creating a residual plot of the residual error against the predicted value 
#From the clf_la which is the lasso version not the LassoCV or
from yellowbrick.datasets import load_concrete
from yellowbrick.regressor import ResidualsPlot
Final_model = Lasso(alpha=6,random_state=5)
visualizer = ResidualsPlot(Final_model,train_alpha=0.2, test_alpha=1)
visualizer.fit(X_train, y_train) # Fit the training data to the visualizer 
#visualizer.score(X_test, y_test) # Evaluate the model on the test data 
visualizer.show()
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff905aa9b50>

A point to note with regards to the lasso model in the diamond data set, is that the residual plot for the lasso method in above is similar both in shape, scale and spread to the residual plot for the least square model in the multiple linear regression in part one, it illustrates how the model is again only evenly scattered above and below zero axis for lower priced diamonds. This means that the accuracy of higher priced diamonds can be called into question. There is some slight reduction in the hetroscedasticity in the figure above illustrating this small improvement with the lasso model. This shift is represented in the respective improvements in the metrics for the lasso model.

Applying the test data

In [29]:
#Use the test data
y_pred = clf_la.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred)**0.5 
r2 = r2_score(y_test, y_pred)
print('')
print('MSE: %0.2f ' % mse)
print('MAE: %0.2f ' % mae)
print('RMSE: %0.2f ' % rmse)
print('R2: %0.5f ' % r2)
MSE: 1526554.04 
MAE: 852.46 
RMSE: 1235.54 
R2: 0.90530 

Notably, this model performs considerably better than the ridge regression model. The difference between the test and train scores are also relatively small indicating that this is indeed a good fit for the model. An α value of zero would mean that there is no penalty and is thus the same as the traditional linear regression model of least squares. However, here unlike the ridge regression the grid-search yields a non zero penalty term which allows the model to improve upon the traditional linear regression.

Conclusion

This project aimed to explore prediction, the goal was to establish a mechanism by which a learning algorithm would succeed in achieving a prediction. This task can essentially, be considered as a predictive modelling task and the ability to achieve any predictive capability greater than mere ’chance’ would be enough to satisfy the very basic require- ments of the task. With prediction being a difficult and different process for each new problem then it further complicates the ability to truly effectively evaluate how thoroughly the goal has been achieved.

The task set initially, was a regression one. The projects aims to be able to provide a prediction where the response variable is a numerical value. Whilst even in the field of regression there are multiple different types of problems this project attempted to provide a general approach to tackle the most common regression tasks. It was for this reason that the example of predicting the price of a diamond was not introduced until after the background research and discussion of the different methods had been explained. This avoids the discussion and explanation from developing too specifically to the subtle and specific requirements of predicting the price of a diamond.

The example was to predict the price of a diamond, with a focus on the prediction of the price rather than understanding the factors as to which contribute towards the price. While now the model has been built and its best parameters decided upon it does allow for some retrospective understanding as to what the main contributing feature are. The largest coefficient by a long way was carat (weight) with a coefficient of 776. The volume of a diamond has a coefficient of 660, since these two factors can be expected to have some co-linearity between the two it would therefore make sense for them both to have similar coefficients. The clarity and colour have coefficients of 275 and 189 respectively illustrating that they have less influence on determining the overall price.

Since a lasso model was shown to produce better predictions than the traditional multiple linear regression, then the coefficients of each of the features are not the quantity by which one unit change will result in the increase in price. The penalty α reduces the impact of those larger coefficients such as carat and volume. Subsequently, the lasso model proved to provide a more balanced and delicate prediction. Overall, an R2 value of 0.9 means that approximately 90% of the price is attributed to the model with and error rate of around 10%, it is important to note that no model would ever be able to provide 100% accuracy due to the natural variations in price fluctuation.