Blog

Understanding Linear Regression Model

Linear regression is a linear model. Generally defined, it is a model that assumes a linear relationship between the input variables (x) and the single output variable (y). Alternatively, y (always numerical) can be calculated from a linear combination of the input variables (x). Hence, it can be easily said that linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables).

The case of one explanatory variable is called a simple linear regression. When there is a single input variable (x), the method is referred to as simple linear regression. When there are multiple input variables, the method is referred to as multiple linear regression.

In general, a model fits the data well if the differences between the observed values and the model’s predicted values are small and unbiased.

I am supremely intrigued by the power of a prediction model designed from a statistical approach combined with the power of python. I am just sharing a simple activity I conducted to understand the Linear Regression model based on the r2squared metric.

R-squared; r2score

It is a statistical measure of how close the data are to the fitted regression line. It is the percentage of the response variable variation that is explained by a linear model.

R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%:

  • 0% indicates that the model explains none of the variability of the response data around its mean.
  • 100% indicates that the model explains all the variability of the response data around its mean.

In general, the higher the R-squared, the better the model fits your data.

#LINEAR REGRESSION

Problem Statement: You work in XYZ Corporation as a Machine Learning Engineer. The corporation wants you to build a system that can predict the salary of an employee based on the experience of the employee in number of years. You will be training the model and using the testing data/ random data (2D array) to check the efficacy of the prediction model.

Tasks to be performed have been mentioned below.

In [5]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

Task 1: Load the dataset using Python Pandas.

Solution 1

In [3]:

data = pd.read_csv(r"/Users/harmeetkaur/Desktop/Aug2/data (1).csv")
df = pd.DataFrame(data)
df

Out[3]:

This image has an empty alt attribute; its file name is 1*FIxSuJnT00h_Pv_2VqoMuA.png
Historical Employee data

Task 2: Separate Dataset into training and testing sets.

Solution 2

YearsExperience is an independent variable and Salary is the dependent variable in the dataset. In the prediction model, x (Year of experience) is the independent variable that helps predict the dependent variable y (salary).

Linear Regression requires that the target variable should be always “numeric”

In [7]:

x = df.iloc[:,:-1].values         # :(all) [row,column] :, :-1
x

Out[7]:

array([[ 1.1],
[ 1.3],
[ 1.5],
[ 2. ],
[ 2.2],
[ 2.9],
[ 3. ],
[ 3.2],
[ 3.2],
[ 3.7],
[ 3.9],
[ 4. ],
[ 4. ],
[ 4.1],
[ 4.5],
[ 4.9],
[ 5.1],
[ 5.3],
[ 5.9],
[ 6. ],
[ 6.8],
[ 7.1],
[ 7.9],
[ 8.2],
[ 8.7],
[ 9. ],
[ 9.5],
[ 9.6],
[10.3],
[10.5]])

In [9]:

y = df.iloc[:,-1].values  #:(all) [row,column] .
y

Out[9]:

array([ 39343.,  46205.,  37731.,  43525.,  39891.,  56642.,  60150.,
54445., 64445., 57189., 63218., 55794., 56957., 57081.,
61111., 67938., 66029., 83088., 81363., 93940., 91738.,
98273., 101302., 113812., 109431., 105582., 116969., 112635.,
122391., 121872.])

In [32]:

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=0)

In [33]:

x_train   #80% of years of experience

Out[33]:

array([[ 9.6],
[ 4. ],
[ 5.3],
[ 7.9],
[ 2.9],
[ 5.1],
[ 3.2],
[ 4.5],
[ 8.2],
[ 6.8],
[ 1.3],
[10.5],
[ 3. ],
[ 2.2],
[ 5.9],
[ 6. ],
[ 3.7],
[ 3.2],
[ 9. ],
[ 2. ],
[ 1.1],
[ 7.1],
[ 4.9],
[ 4. ]])

In [34]:

x_test    #20% of years of experience

Out[34]:

array([[ 1.5],
[10.3],
[ 4.1],
[ 3.9],
[ 9.5],
[ 8.7]])

In [35]:

y_train   #80% of salary

Out[35]:

array([112635.,  55794.,  83088., 101302.,  56642.,  66029.,  64445.,
61111., 113812., 91738., 46205., 121872., 60150., 39891.,
81363., 93940., 57189., 54445., 105582., 43525., 39343.,
98273., 67938., 56957.])

In [36]:

y_test    #20% of years of experience

Out[36]:

array([ 37731., 122391.,  57081.,  63218., 116969., 109431.])

Task 3: Train a model to make predictions based on the number of years as experience.

Solution 3

In [19]:

from sklearn.linear_model import LinearRegression

In [20]:

#to create object, give the name of the object(obj) and constructor (Classname, empty())obj = LinearRegression()

fit() is a function available in the LinearrRegression class fit helps build a linear regression model; we pass only the train dataset to fit () method

In [21]:

obj.fit(x_train,y_train)

Out[21]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

predict() is the function which is present in the LinearRegression class, we are using the object named to call predict()

In [24]:

y_pred = obj.predict(x_test)  #y_test is not passed, it is hidden; shows salary for YearOfExp

In [25]:

y_pred

Out[25]:

array([ 56414.13629032,  63030.78467742, 116909.20725806, 115963.97177419,
68702.19758065, 108402.08790323])

In [41]:

y_pred = obj.predict([[5]])
y_pred# we can pass our own 2D array to see how correct it is

Out[41]:

array([73428.375])

Task 4: Plot and visualize the training data, testing data, and the regression line.

Solution 4

In [28]:

#training data : scatterplot
plt.scatter(x_train, y_train, color="blue")
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.title('Years of Experience vs Salary')
plt.plot(x_train, obj.predict(x_train),color="red")

Out[28]:

[<matplotlib.lines.Line2D at 0x7ffb575a7b90>]
This image has an empty alt attribute; its file name is 1*_zYtnRjHb155jb6wjw9L9g.png
training data: scatterplot

In [30]:

#testing data : scatterplot
plt.scatter(x_test, y_test, color="blue")
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.title('Years of Experience vs Salary')
plt.plot(x_test, obj.predict(x_test),color="green")

Out[30]:

[<matplotlib.lines.Line2D at 0x7ffb57792c10>]
This image has an empty alt attribute; its file name is 1*5qYqiaN0eeyXsKP2G1IMIA.png
testing data: scatterplot

Task 5: Check the model accuracy using the R2 score of the model.

Solution 5

In [31]:

r2_score(y_test, y_pred)

Out[31]:

0.9811828892966187

It is near to 100%; which depicts that the model is good. It is predicting nearly the correct salary for the year of experience mentioned. The r2score given by the model is 98%

Before You Go

Thanks for reading the article! You can reach out to me through my LinkedIn Profile. You can also view the code and data I have used here in my Github

Harmeet Kaur

Senior Computer Science Teacher at Heritage Xperiential Learning School, Gurugram. AI book author, GAFE and Common Sense education certified, Teacher Mentor