Regression Analysis (Part One)

Predicting the price of a diamond

This project deals with solving a quantitive problem. It ilustrates how regression can be utilised to predict a numerical output. In this case, the price of a diamond. We begin by importing the relevant libraries that will be used in this project (note: many of them will only be needed for part two, advanced regression).

Importing relavent libraries

In [11]:
# Ignore warnings :
import warnings
warnings.filterwarnings('ignore')


# Handle table-like data and matrices :
import numpy as np
import pandas as pd
import math 

# reshape 1D array
from numpy import array
from numpy import reshape




# Modelling Algorithms :

# Classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier , GradientBoostingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis , QuadraticDiscriminantAnalysis

# Regression
from sklearn.linear_model import LinearRegression,Ridge,Lasso,LassoCV, RidgeCV, ElasticNet
from sklearn.ensemble import RandomForestRegressor,BaggingRegressor,GradientBoostingRegressor
,AdaBoostRegressor 
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor




# Modelling Helpers :
from sklearn.preprocessing import Imputer , Normalizer , scale
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV , KFold , cross_val_score



#preprocessing :
from sklearn.preprocessing import MinMaxScaler , StandardScaler, Imputer, LabelEncoder



#evaluation metrics :

# Regression
from sklearn.metrics import mean_squared_log_error,mean_squared_error, r2_score,mean_absolute_error 

# Classification
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score  



# Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
import missingno as msno



# Configure visualisations
%matplotlib inline
mpl.style.use( 'ggplot' )
plt.style.use('fivethirtyeight')
sns.set(context="notebook", palette="dark", style = 'whitegrid' , color_codes=True)
params = { 
    'axes.labelsize': "large",
    'xtick.labelsize': 'x-large',
    'legend.fontsize': 20,
    'figure.dpi': 150,
    'figure.figsize': [10, 7]
}
plt.rcParams.update(params)
In [12]:
# Center all plots
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""");

Upload DataSet

In [17]:
#load the dataset
#https://www.kaggle.com/shivam2503/diamonds/data
df = pd.read_csv("diamonds.csv")
df.head()
Out[17]:
Unnamed: 0 carat cut color clarity depth table price x y z
0 1 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 2 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 3 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 4 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 5 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75

Data Cleansing

The first thing that must be done is to cleanse and prepare the data to ensure that it is suitable for the regression analysis that we wish to perform.

In [18]:
#Drop the unamed column
df.drop(['Unnamed: 0'] , axis=1 , inplace=True)
df.head()
Out[18]:
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
In [23]:
#What is contained within the dataset
print(df.keys())
Index(['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y',
       'z'],
      dtype='object')

The different features in the dataset are as follows.

Price price in US dollars (\326--\18,823) (This is the output variable we will be predicting.)

carat weight of the diamond (0.2--5.01)

cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)

color diamond colour, from J (worst) to D (best)

clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

x length in mm (0--10.74)

y width in mm (0--58.9)

z depth in mm (0--31.8)

depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)

table width of top of diamond relative to widest point (43--95)

In [19]:
#look for null values
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
carat      53940 non-null float64
cut        53940 non-null object
color      53940 non-null object
clarity    53940 non-null object
depth      53940 non-null float64
table      53940 non-null float64
price      53940 non-null int64
x          53940 non-null float64
y          53940 non-null float64
z          53940 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB
In [20]:
#Confirm no null values
df.isnull().sum()
Out[20]:
carat      0
cut        0
color      0
clarity    0
depth      0
table      0
price      0
x          0
y          0
z          0
dtype: int64

Although there are no null values we must still be wary of zeros in some columns for example any of the dimension columns (x,y or z), since it is impossible to have a zero dimension in a diamond.

In [21]:
#Count the number of rows with zero in x,y or z
len(df[(df['x']==0) | (df['y']==0) | (df['z']==0)])
Out[21]:
20
In [25]:
#Since there are only 20 (out of 50,000) then we can drop without any significant impact on the dataset
df = df[(df[['x','y','z']] != 0).all(axis=1)]

Feature Encoding

Feature encoding is where we must turn the categorical vairables into a numerical value so that regression analysis can be carried out. There are a number of different methods to feature encoding but since all the categories in this dataset are on a scale from worst to best we can replace each of these categories with a number ranking (from worst to best).

In [26]:
#Turn the Cut column from a word ranking to a number ranking 
df['cut'] = df['cut'].replace('Fair',1)
df['cut'] = df['cut'].replace('Good',2)
df['cut'] = df['cut'].replace('Very Good',3)
df['cut'] = df['cut'].replace('Premium',4)
df['cut'] = df['cut'].replace('Ideal',5)
In [27]:
#Convert color column into number ranking 
df['color'] = df['color'].replace('J',1)
df['color'] = df['color'].replace('I',2)
df['color'] = df['color'].replace('H',3)
df['color'] = df['color'].replace('G',4)
df['color'] = df['color'].replace('F',5)
df['color'] = df['color'].replace('E',6)
df['color'] = df['color'].replace('D',7)
In [28]:
#Convert the clarity into a number ranking
df['clarity'] = df['clarity'].replace('I1',1)
df['clarity'] = df['clarity'].replace('SI2',2)
df['clarity'] = df['clarity'].replace('SI1',3)
df['clarity'] = df['clarity'].replace('VS2',4)
df['clarity'] = df['clarity'].replace('VS1',5)
df['clarity'] = df['clarity'].replace('VVS2',6)
df['clarity'] = df['clarity'].replace('VVS1',7)
df['clarity'] = df['clarity'].replace('IF',8)

Data Scaling

Many datasets contain features that highly vary in magnitudes, units, and range. In this case to ensure that all of the features are normalised within a range we apply feature scaling.

In [29]:
sns.factorplot(data=df , kind='box' , size=7, aspect=2.5)
Out[29]:
<seaborn.axisgrid.FacetGrid at 0x7faf3057a350>

Exploratory Data Analysis

With every project it is often very helpful to visualise the dataset and key features within it prior to carrying out the full analysis.

In [30]:
#Plot the distrubution of the output varaible price
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.distplot(df['price'], bins=80)
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faf31b6ad50>

In the distribution above we see that many diamonds are priced at the lower range (below $5,000). The plot shows there is a long tail and perhaps it maybe useful to limit the analysis to a certain price range to ensure better reults.

In [31]:
#correlation matrix
correlation_matrix = df.corr().round(2)
#annot = True to print the values inside the square
sns.heatmap(data=correlation_matrix, annot=True)
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faf309185d0>

Here in this correlation matrix it allows us to view the features that are highly correlated with each other. The dimensions of the diamond (x,y & z) are all highly correlated with the price. Conversly, the colour appears to rank lower and has a smaller impact on the price.

Feature Engineering

In order to reduce the number of variables we create a volume feature and can then remove the width, depth and length. This reduced the number of different variables contained within the datset making predictions quicker.

In [32]:
df['volume'] = df['x']*df['y']*df['z']
df.head()
Out[32]:
carat cut color clarity depth table price x y z volume
0 0.23 5 6 2 61.5 55.0 326 3.95 3.98 2.43 38.202030
1 0.21 4 6 3 59.8 61.0 326 3.89 3.84 2.31 34.505856
2 0.23 2 6 5 56.9 65.0 327 4.05 4.07 2.31 38.076885
3 0.29 4 2 4 62.4 58.0 334 4.20 4.23 2.63 46.724580
4 0.31 2 1 2 63.3 58.0 335 4.34 4.35 2.75 51.917250
In [33]:
df.drop(['x','y','z'], axis=1, inplace= True)

Multiple Linear Regression

Now that the data has been pre-processed and prepared we can begin to carry out multiple linear regression to predict the price of the diamond. The first stage is to split the dataset into training and testing sets. We build the algorithm on the training set and then validate the accuracy of the model on the unseen test set.

In [34]:
#Split the data into training and testing
X = df.drop(['price'], axis=1)
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=66)
In [35]:
# Applying Feature Scaling ( StandardScaler )
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Perform linear regression and output key accuracy metrics

In [36]:
clf_lr = LinearRegression()
clf_lr.fit(X_train , y_train)
accuracies = cross_val_score(estimator = clf_lr, X = X_train, y = y_train, cv = 5,verbose = 1)
y_pred = clf_lr.predict(X_train)
print('')
print('####### Linear Regression #######')
print('Score : %.4f' % clf_lr.score(X_train, y_train))
print(accuracies) #Gives the R2 value for each fold in Cross Validation.

mse = mean_squared_error(y_train, y_pred)
mae = mean_absolute_error(y_train, y_pred)
rmse = mean_squared_error(y_train, y_pred)**0.5
r2 = r2_score(y_train, y_pred)

print('')
print('MSE    : %0.2f ' % mse)
print('MAE    : %0.2f ' % mae)
print('RMSE   : %0.2f ' % rmse)
print('R2     : %0.5f ' % r2)
####### Linear Regression #######
Score : 0.9041
[0.89570448 0.90544528 0.90272054 0.90162343 0.90874481]

MSE    : 1518664.84 
MAE    : 852.05 
RMSE   : 1232.34 
R2     : 0.90414 
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.1s finished

While the R2 and other statistics for this linear regression model are relatively high, the residual plot (below) illustrates that the model is a good fit perhaps only within a certain category of diamonds (i.e. lower prices diamonds). For lower price diamonds there appears to be an even distribution above and below zero, however this residual plot shows that for diamonds above $2000 then the model does not perform too well. An R2 score of 0.904 would appear as a particularly desirable score however, the residual plot uncovers that there is some elements of hetroscedasticity within the model and it does not perform too well on higher priced diamonds.

In [37]:
#Creating a residual plot of the residual error agaisnt the predicted value
from yellowbrick.datasets import load_concrete
from yellowbrick.regressor import ResidualsPlot
visualizer = ResidualsPlot(clf_lr)

visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
#visualizer.score(X_test, y_test)  # Evaluate the model on the test data
visualizer.show()  
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faf31393590>
In [38]:
#Coefficients
print(clf_lr.intercept_)
print(clf_lr.coef_)
3938.709546550446
[4089.44648967  138.25610023  540.24497119  860.90347832  -58.3206942
  -56.07983346   73.04341101]

These are the coeffcients of the linear regression model. Each of these values are multiplied by the respective feauture then added together to give a final output for price.

Testing the model

In [39]:
#Using the Test data
y_pred = clf_lr.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred)**0.5
r2 = r2_score(y_test, y_pred)

print('')
print('MSE    : %0.2f ' % mse)
print('MAE    : %0.2f ' % mae)
print('RMSE   : %0.2f ' % rmse)
print('R2     : %0.5f ' % r2)
MSE    : 1525343.06 
MAE    : 856.34 
RMSE   : 1235.05 
R2     : 0.90538 

The test scores illustrate that for an unseen set of data with an R2 score or 0.9 then the model can be relied upon to provide a 90% accuracy when predicting the price of a diamond.