Linear regression is a linear model. Generally defined, it is a model that assumes a linear relationship between the input variables (x) and the single output variable (y). Alternatively, y (always numerical) can be calculated from a linear combination of the input variables (x). Hence, it can be easily said that linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables).
The case of one explanatory variable is called a simple linear regression. When there is a single input variable (x), the method is referred to as simple linear regression. When there are multiple input variables, the method is referred to as multiple linear regression.
In general, a model fits the data well if the differences between the observed values and the model’s predicted values are small and unbiased.
I am supremely intrigued by the power of a prediction model designed from a statistical approach combined with the power of python. I am just sharing a simple activity I conducted to understand the Linear Regression model based on the r2squared metric.
R-squared; r2score
It is a statistical measure of how close the data are to the fitted regression line. It is the percentage of the response variable variation that is explained by a linear model.
R-squared = Explained variation / Total variation
R-squared is always between 0 and 100%:
- 0% indicates that the model explains none of the variability of the response data around its mean.
- 100% indicates that the model explains all the variability of the response data around its mean.
In general, the higher the R-squared, the better the model fits your data.
#LINEAR REGRESSION
Problem Statement: You work in XYZ Corporation as a Machine Learning Engineer. The corporation wants you to build a system that can predict the salary of an employee based on the experience of the employee in number of years. You will be training the model and using the testing data/ random data (2D array) to check the efficacy of the prediction model.
Tasks to be performed have been mentioned below.
In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
Task 1: Load the dataset using Python Pandas.
Solution 1
In [3]:
data = pd.read_csv(r"/Users/harmeetkaur/Desktop/Aug2/data (1).csv")
df = pd.DataFrame(data)
df
Out[3]:
Task 2: Separate Dataset into training and testing sets.
Solution 2
YearsExperience is an independent variable and Salary is the dependent variable in the dataset. In the prediction model, x (Year of experience) is the independent variable that helps predict the dependent variable y (salary).
Linear Regression requires that the target variable should be always “numeric”
In [7]:
x = df.iloc[:,:-1].values # :(all) [row,column] :, :-1
x
Out[7]:
array([[ 1.1],
[ 1.3],
[ 1.5],
[ 2. ],
[ 2.2],
[ 2.9],
[ 3. ],
[ 3.2],
[ 3.2],
[ 3.7],
[ 3.9],
[ 4. ],
[ 4. ],
[ 4.1],
[ 4.5],
[ 4.9],
[ 5.1],
[ 5.3],
[ 5.9],
[ 6. ],
[ 6.8],
[ 7.1],
[ 7.9],
[ 8.2],
[ 8.7],
[ 9. ],
[ 9.5],
[ 9.6],
[10.3],
[10.5]])
In [9]:
y = df.iloc[:,-1].values #:(all) [row,column] .
y
Out[9]:
array([ 39343., 46205., 37731., 43525., 39891., 56642., 60150.,
54445., 64445., 57189., 63218., 55794., 56957., 57081.,
61111., 67938., 66029., 83088., 81363., 93940., 91738.,
98273., 101302., 113812., 109431., 105582., 116969., 112635.,
122391., 121872.])
In [32]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=0)
In [33]:
x_train #80% of years of experience
Out[33]:
array([[ 9.6],
[ 4. ],
[ 5.3],
[ 7.9],
[ 2.9],
[ 5.1],
[ 3.2],
[ 4.5],
[ 8.2],
[ 6.8],
[ 1.3],
[10.5],
[ 3. ],
[ 2.2],
[ 5.9],
[ 6. ],
[ 3.7],
[ 3.2],
[ 9. ],
[ 2. ],
[ 1.1],
[ 7.1],
[ 4.9],
[ 4. ]])
In [34]:
x_test #20% of years of experience
Out[34]:
array([[ 1.5],
[10.3],
[ 4.1],
[ 3.9],
[ 9.5],
[ 8.7]])
In [35]:
y_train #80% of salary
Out[35]:
array([112635., 55794., 83088., 101302., 56642., 66029., 64445.,
61111., 113812., 91738., 46205., 121872., 60150., 39891.,
81363., 93940., 57189., 54445., 105582., 43525., 39343.,
98273., 67938., 56957.])
In [36]:
y_test #20% of years of experience
Out[36]:
array([ 37731., 122391., 57081., 63218., 116969., 109431.])
Task 3: Train a model to make predictions based on the number of years as experience.
Solution 3
In [19]:
from sklearn.linear_model import LinearRegression
In [20]:
#to create object, give the name of the object(obj) and constructor (Classname, empty())obj = LinearRegression()
fit() is a function available in the LinearrRegression class fit helps build a linear regression model; we pass only the train dataset to fit () method
In [21]:
obj.fit(x_train,y_train)
Out[21]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
predict() is the function which is present in the LinearRegression class, we are using the object named to call predict()
In [24]:
y_pred = obj.predict(x_test) #y_test is not passed, it is hidden; shows salary for YearOfExp
In [25]:
y_pred
Out[25]:
array([ 56414.13629032, 63030.78467742, 116909.20725806, 115963.97177419,
68702.19758065, 108402.08790323])
In [41]:
y_pred = obj.predict([[5]])
y_pred# we can pass our own 2D array to see how correct it is
Out[41]:
array([73428.375])
Task 4: Plot and visualize the training data, testing data, and the regression line.
Solution 4
In [28]:
#training data : scatterplot
plt.scatter(x_train, y_train, color="blue")
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.title('Years of Experience vs Salary')
plt.plot(x_train, obj.predict(x_train),color="red")
Out[28]:
[<matplotlib.lines.Line2D at 0x7ffb575a7b90>]
In [30]:
#testing data : scatterplot
plt.scatter(x_test, y_test, color="blue")
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.title('Years of Experience vs Salary')
plt.plot(x_test, obj.predict(x_test),color="green")
Out[30]:
[<matplotlib.lines.Line2D at 0x7ffb57792c10>]
Task 5: Check the model accuracy using the R2 score of the model.
Solution 5
In [31]:
r2_score(y_test, y_pred)
Out[31]:
0.9811828892966187
It is near to 100%; which depicts that the model is good. It is predicting nearly the correct salary for the year of experience mentioned. The r2score given by the model is 98%
Before You Go
Thanks for reading the article! You can reach out to me through my LinkedIn Profile. You can also view the code and data I have used here in my Github
Harmeet Kaur
Senior Computer Science Teacher at Heritage Xperiential Learning School, Gurugram. AI book author, GAFE and Common Sense education certified, Teacher Mentor