Linear regression is a **linear model. **Generally defined, it is a model that assumes a linear relationship between the input variables (x) and the single output variable (y). Alternatively, y (always numerical) can be calculated from a linear combination of the input variables (x). Hence, it can be easily said that **linear regression** is a **linear** approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables).

The case of one explanatory variable is called a simple **linear regression**. When there is a single input variable (x), the method is referred to as **simple linear regression**. When there are **multiple input variables**, the method is referred to as **multiple linear regression.**

In general, a model fits the data well if the differences between the observed values and the model’s predicted values are small and unbiased.

I am supremely intrigued by the power of a prediction model designed from a statistical approach combined with the power of python. I am just sharing a simple activity I conducted to understand the Linear Regression model based on the r2squared metric.

*R-squared; r2score*

It is a statistical measure of how close the data are to the fitted regression line. It is the percentage of the response variable variation that is explained by a linear model.

*R-squared = Explained variation / Total variation*

R-squared is always between 0 and 100%:

- 0% indicates that the model explains none of the variability of the response data around its mean.
- 100% indicates that the model explains all the variability of the response data around its mean.

In general, the higher the R-squared, the better the model fits your data.

# #LINEAR REGRESSION

**Problem Statement:** You work in XYZ Corporation as a Machine Learning Engineer. The corporation wants you to build a system that can predict the salary of an employee based on the experience of the employee in number of years. You will be training the model and using the testing data/ random data (2D array) to check the efficacy of the prediction model.

Tasks to be performed have been mentioned below.

In [5]:

importnumpyasnpimportpandasaspdimportmatplotlib.pyplotaspltfromsklearn.model_selectionimporttrain_test_splitfromsklearn.metricsimportr2_score

**Task 1: Load the dataset using Python Pandas.**

**Solution 1**

In [3]:

data = pd.read_csv(r"/Users/harmeetkaur/Desktop/Aug2/data (1).csv")

df = pd.DataFrame(data)

df

Out[3]:

**Task 2: Separate Dataset into training and testing sets.**

**Solution 2**

**YearsExperience** is an independent variable and **Salary** is the dependent variable in the dataset. In the prediction model, x (Year of experience) is the independent variable that helps predict the dependent variable y (salary).

Linear Regression requires that the target variable should be always “numeric”

In [7]:

x = df.iloc[:,:-1].values# :(all) [row,column] :, :-1

x

Out[7]:

array([[ 1.1],

[ 1.3],

[ 1.5],

[ 2. ],

[ 2.2],

[ 2.9],

[ 3. ],

[ 3.2],

[ 3.2],

[ 3.7],

[ 3.9],

[ 4. ],

[ 4. ],

[ 4.1],

[ 4.5],

[ 4.9],

[ 5.1],

[ 5.3],

[ 5.9],

[ 6. ],

[ 6.8],

[ 7.1],

[ 7.9],

[ 8.2],

[ 8.7],

[ 9. ],

[ 9.5],

[ 9.6],

[10.3],

[10.5]])

In [9]:

y = df.iloc[:,-1].values#:(all) [row,column] .

y

Out[9]:

array([ 39343., 46205., 37731., 43525., 39891., 56642., 60150.,

54445., 64445., 57189., 63218., 55794., 56957., 57081.,

61111., 67938., 66029., 83088., 81363., 93940., 91738.,

98273., 101302., 113812., 109431., 105582., 116969., 112635.,

122391., 121872.])

In [32]:

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=0)

In [33]:

x_train#80% of years of experience

Out[33]:

array([[ 9.6],

[ 4. ],

[ 5.3],

[ 7.9],

[ 2.9],

[ 5.1],

[ 3.2],

[ 4.5],

[ 8.2],

[ 6.8],

[ 1.3],

[10.5],

[ 3. ],

[ 2.2],

[ 5.9],

[ 6. ],

[ 3.7],

[ 3.2],

[ 9. ],

[ 2. ],

[ 1.1],

[ 7.1],

[ 4.9],

[ 4. ]])

In [34]:

x_test#20% of years of experience

Out[34]:

array([[ 1.5],

[10.3],

[ 4.1],

[ 3.9],

[ 9.5],

[ 8.7]])

In [35]:

y_train#80% of salary

Out[35]:

array([112635., 55794., 83088., 101302., 56642., 66029., 64445.,

61111., 113812., 91738., 46205., 121872., 60150., 39891.,

81363., 93940., 57189., 54445., 105582., 43525., 39343.,

98273., 67938., 56957.])

In [36]:

y_test#20% of years of experience

Out[36]:

array([ 37731., 122391., 57081., 63218., 116969., 109431.])

**Task 3: Train a model to make predictions based on the number of years as experience.**

**Solution 3**

In [19]:

fromsklearn.linear_modelimportLinearRegression

In [20]:

#to create object, give the name of the object(obj) and constructor (Classname, empty())obj = LinearRegression()

fit() is a function available in the LinearrRegression class fit helps build a linear regression model; we pass only the train dataset to fit () method

In [21]:

obj.fit(x_train,y_train)

Out[21]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

predict() is the function which is present in the LinearRegression class, we are using the object named to call predict()

In [24]:

y_pred = obj.predict(x_test)#y_test is not passed, it is hidden; shows salary for YearOfExp

In [25]:

y_pred

Out[25]:

array([ 56414.13629032, 63030.78467742, 116909.20725806, 115963.97177419,

68702.19758065, 108402.08790323])

In [41]:

y_pred = obj.predict([[5]])

y_pred# we can pass our own 2D array to see how correct it is

Out[41]:

array([73428.375])

**Task 4: Plot and visualize the training data, testing data, and the regression line.**

**Solution 4**

In [28]:

#training data : scatterplot

plt.scatter(x_train, y_train, color="blue")

plt.grid(axis='y', alpha=0.75)

plt.xlabel('Years of Experience')

plt.ylabel('Salary')

plt.title('Years of Experience vs Salary')

plt.plot(x_train, obj.predict(x_train),color="red")

Out[28]:

[<matplotlib.lines.Line2D at 0x7ffb575a7b90>]

In [30]:

#testing data : scatterplot

plt.scatter(x_test, y_test, color="blue")

plt.grid(axis='y', alpha=0.75)

plt.xlabel('Years of Experience')

plt.ylabel('Salary')

plt.title('Years of Experience vs Salary')

plt.plot(x_test, obj.predict(x_test),color="green")

Out[30]:

[<matplotlib.lines.Line2D at 0x7ffb57792c10>]

**Task 5: Check the model accuracy using the R2 score of the model.**

**Solution 5**

In [31]:

r2_score(y_test, y_pred)

Out[31]:

0.9811828892966187

**It is near to 100%; which depicts that the model is good. It is predicting nearly the correct salary for the year of experience mentioned. The r2score given by the model is 98%**

# Before You Go

*Thanks for reading the article! You can reach out to me through my **LinkedIn Profile**. You can also view the code and data I have used here in my* *Github*

**Harmeet Kaur**

Senior Computer Science Teacher at Heritage Xperiential Learning School, Gurugram. AI book author, GAFE and Common Sense education certified, Teacher Mentor