This project deals with solving a quantitive problem. It ilustrates how regression can be utilised to predict a numerical output. In this case, the price of a diamond. We begin by importing the relevant libraries that will be used in this project (note: many of them will only be needed for part two, advanced regression).
# Ignore warnings :
import warnings
warnings.filterwarnings('ignore')
# Handle table-like data and matrices :
import numpy as np
import pandas as pd
import math
# reshape 1D array
from numpy import array
from numpy import reshape
# Modelling Algorithms :
# Classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier , GradientBoostingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis , QuadraticDiscriminantAnalysis
# Regression
from sklearn.linear_model import LinearRegression,Ridge,Lasso,LassoCV, RidgeCV, ElasticNet
from sklearn.ensemble import RandomForestRegressor,BaggingRegressor,GradientBoostingRegressor
,AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
# Modelling Helpers :
from sklearn.preprocessing import Imputer , Normalizer , scale
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV , KFold , cross_val_score
#preprocessing :
from sklearn.preprocessing import MinMaxScaler , StandardScaler, Imputer, LabelEncoder
#evaluation metrics :
# Regression
from sklearn.metrics import mean_squared_log_error,mean_squared_error, r2_score,mean_absolute_error
# Classification
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
# Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
import missingno as msno
# Configure visualisations
%matplotlib inline
mpl.style.use( 'ggplot' )
plt.style.use('fivethirtyeight')
sns.set(context="notebook", palette="dark", style = 'whitegrid' , color_codes=True)
params = {
'axes.labelsize': "large",
'xtick.labelsize': 'x-large',
'legend.fontsize': 20,
'figure.dpi': 150,
'figure.figsize': [10, 7]
}
plt.rcParams.update(params)
# Center all plots
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
display: table-cell;
text-align: center;
vertical-align: middle;
}
</style>
""");
#load the dataset
#https://www.kaggle.com/shivam2503/diamonds/data
df = pd.read_csv("diamonds.csv")
df.head()
The first thing that must be done is to cleanse and prepare the data to ensure that it is suitable for the regression analysis that we wish to perform.
#Drop the unamed column
df.drop(['Unnamed: 0'] , axis=1 , inplace=True)
df.head()
#What is contained within the dataset
print(df.keys())
The different features in the dataset are as follows.
Price price in US dollars (\326--\18,823) (This is the output variable we will be predicting.)
carat weight of the diamond (0.2--5.01)
cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color diamond colour, from J (worst) to D (best)
clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x length in mm (0--10.74)
y width in mm (0--58.9)
z depth in mm (0--31.8)
depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
table width of top of diamond relative to widest point (43--95)
#look for null values
df.info()
#Confirm no null values
df.isnull().sum()
Although there are no null values we must still be wary of zeros in some columns for example any of the dimension columns (x,y or z), since it is impossible to have a zero dimension in a diamond.
#Count the number of rows with zero in x,y or z
len(df[(df['x']==0) | (df['y']==0) | (df['z']==0)])
#Since there are only 20 (out of 50,000) then we can drop without any significant impact on the dataset
df = df[(df[['x','y','z']] != 0).all(axis=1)]
Feature encoding is where we must turn the categorical vairables into a numerical value so that regression analysis can be carried out. There are a number of different methods to feature encoding but since all the categories in this dataset are on a scale from worst to best we can replace each of these categories with a number ranking (from worst to best).
#Turn the Cut column from a word ranking to a number ranking
df['cut'] = df['cut'].replace('Fair',1)
df['cut'] = df['cut'].replace('Good',2)
df['cut'] = df['cut'].replace('Very Good',3)
df['cut'] = df['cut'].replace('Premium',4)
df['cut'] = df['cut'].replace('Ideal',5)
#Convert color column into number ranking
df['color'] = df['color'].replace('J',1)
df['color'] = df['color'].replace('I',2)
df['color'] = df['color'].replace('H',3)
df['color'] = df['color'].replace('G',4)
df['color'] = df['color'].replace('F',5)
df['color'] = df['color'].replace('E',6)
df['color'] = df['color'].replace('D',7)
#Convert the clarity into a number ranking
df['clarity'] = df['clarity'].replace('I1',1)
df['clarity'] = df['clarity'].replace('SI2',2)
df['clarity'] = df['clarity'].replace('SI1',3)
df['clarity'] = df['clarity'].replace('VS2',4)
df['clarity'] = df['clarity'].replace('VS1',5)
df['clarity'] = df['clarity'].replace('VVS2',6)
df['clarity'] = df['clarity'].replace('VVS1',7)
df['clarity'] = df['clarity'].replace('IF',8)
Many datasets contain features that highly vary in magnitudes, units, and range. In this case to ensure that all of the features are normalised within a range we apply feature scaling.
sns.factorplot(data=df , kind='box' , size=7, aspect=2.5)
With every project it is often very helpful to visualise the dataset and key features within it prior to carrying out the full analysis.
#Plot the distrubution of the output varaible price
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.distplot(df['price'], bins=80)
In the distribution above we see that many diamonds are priced at the lower range (below $5,000). The plot shows there is a long tail and perhaps it maybe useful to limit the analysis to a certain price range to ensure better reults.
#correlation matrix
correlation_matrix = df.corr().round(2)
#annot = True to print the values inside the square
sns.heatmap(data=correlation_matrix, annot=True)
Here in this correlation matrix it allows us to view the features that are highly correlated with each other. The dimensions of the diamond (x,y & z) are all highly correlated with the price. Conversly, the colour appears to rank lower and has a smaller impact on the price.
In order to reduce the number of variables we create a volume feature and can then remove the width, depth and length. This reduced the number of different variables contained within the datset making predictions quicker.
df['volume'] = df['x']*df['y']*df['z']
df.head()
df.drop(['x','y','z'], axis=1, inplace= True)
Now that the data has been pre-processed and prepared we can begin to carry out multiple linear regression to predict the price of the diamond. The first stage is to split the dataset into training and testing sets. We build the algorithm on the training set and then validate the accuracy of the model on the unseen test set.
#Split the data into training and testing
X = df.drop(['price'], axis=1)
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=66)
# Applying Feature Scaling ( StandardScaler )
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
clf_lr = LinearRegression()
clf_lr.fit(X_train , y_train)
accuracies = cross_val_score(estimator = clf_lr, X = X_train, y = y_train, cv = 5,verbose = 1)
y_pred = clf_lr.predict(X_train)
print('')
print('####### Linear Regression #######')
print('Score : %.4f' % clf_lr.score(X_train, y_train))
print(accuracies) #Gives the R2 value for each fold in Cross Validation.
mse = mean_squared_error(y_train, y_pred)
mae = mean_absolute_error(y_train, y_pred)
rmse = mean_squared_error(y_train, y_pred)**0.5
r2 = r2_score(y_train, y_pred)
print('')
print('MSE : %0.2f ' % mse)
print('MAE : %0.2f ' % mae)
print('RMSE : %0.2f ' % rmse)
print('R2 : %0.5f ' % r2)
While the R2 and other statistics for this linear regression model are relatively high, the residual plot (below) illustrates that the model is a good fit perhaps only within a certain category of diamonds (i.e. lower prices diamonds). For lower price diamonds there appears to be an even distribution above and below zero, however this residual plot shows that for diamonds above $2000 then the model does not perform too well. An R2 score of 0.904 would appear as a particularly desirable score however, the residual plot uncovers that there is some elements of hetroscedasticity within the model and it does not perform too well on higher priced diamonds.
#Creating a residual plot of the residual error agaisnt the predicted value
from yellowbrick.datasets import load_concrete
from yellowbrick.regressor import ResidualsPlot
visualizer = ResidualsPlot(clf_lr)
visualizer.fit(X_train, y_train) # Fit the training data to the visualizer
#visualizer.score(X_test, y_test) # Evaluate the model on the test data
visualizer.show()
#Coefficients
print(clf_lr.intercept_)
print(clf_lr.coef_)
These are the coeffcients of the linear regression model. Each of these values are multiplied by the respective feauture then added together to give a final output for price.
#Using the Test data
y_pred = clf_lr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred)**0.5
r2 = r2_score(y_test, y_pred)
print('')
print('MSE : %0.2f ' % mse)
print('MAE : %0.2f ' % mae)
print('RMSE : %0.2f ' % rmse)
print('R2 : %0.5f ' % r2)
The test scores illustrate that for an unseen set of data with an R2 score or 0.9 then the model can be relied upon to provide a 90% accuracy when predicting the price of a diamond.