Welcome to this article on simple linear regression. Today we will look at how to build a simple linear regression model given a dataset. You can go through our article detailing the concept of simple linear regression prior to the coding example in this article.

## 6 Steps to build a Linear Regression model

Step 1: Importing the dataset

Step 2: Data pre-processing

Step 3: Splitting the test and train sets

Step 4: Fitting the linear regression model to the training set

Step 5: Predicting test results

Step 6: Visualizing the test results

Now that we have seen the steps, let us begin with coding the same

## Implementing a Linear Regression Model in Python

In this article, we will be using salary dataset. Our dataset will have 2 columns namely – Years of Experience and Salary.

The link to the dataset is – https://github.com/content-anu/dataset-simple-linear

### 1. Importing the dataset

We will begin with importing the dataset using pandas and also import other libraries such as numpy and matplotlib.

import numpy as np import pandas as pd import matplotlib.pyplot as plt dataset = pd.read_csv('Salary_Data.csv') dataset.head()

The `dataset.head()`

shows the first few columns of our dataset. The output of the above snippet is as follows:

### 2. Data Preprocessing

Now that we have imported the dataset, we will perform data preprocessing.

X = dataset.iloc[:,:-1].values #independent variable array y = dataset.iloc[:,1].values #dependent variable vector

The `X`

is independent variable array and `y`

is the dependent variable vector. Note the difference between the array and vector. The dependent variable must be in vector and independent variable must be an array itself.

### 3. Splitting the dataset

We need to split our dataset into the test and train set. Generally, we follow the 20-80 policy or the 30-70 policy respectively.

**Why is it necessary to perform splitting? **This is because we wish to train our model according to the years and salary. We then test our model on the test set.

We check whether the predictions made by the model on the test set data matches what was given in the dataset.

If it matches, it implies that our model is accurate and is making the right predictions.

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=1/3,random_state=0)

We don’t need to apply feature scaling for linear regression as libraries take care of it.

### 4. Fitting linear regression model into the training set

From sklearn’s linear model library, import linear regression class. Create an object for a linear regression class called regressor.

To fit the regressor into the training set, we will call the fit method – function to fit the regressor into the training set.

We need to fit X_train (training data of matrix of features) into the target values y_train. Thus the model learns the correlation and learns how to predict the dependent variables based on the independent variable.

from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train,y_train) #actually produces the linear eqn for the data

### 5. Predicting the test set results

We create a vector containing all the predictions of the test set salaries. The predicted salaries are then put into the vector called `y_pred`

.(contains prediction for all observations in the test set)

`predict`

method makes the predictions for the test set. Hence, the input is the test set. The parameter for predict must be an array or sparse matrix, hence input is X_test.

y_pred = regressor.predict(X_test) y_pred

y_test

`y_test`

is the real salary of the test set.`y_pred`

are the predicted salaries.

### Visualizing the results

Let’s see what the results of our code will look like when we visualize it.

#### 1. Plotting the points (observations)

To visualize the data, we plot graphs using matplotlib. To plot real observation points ie plotting the real given values.

The X-axis will have years of experience and the Y-axis will have the predicted salaries.

`plt.scatter`

plots a scatter plot of the data. Parameters include :

- X – coordinate (X_train: number of years)
- Y – coordinate (y_train: real salaries of the employees)
- Color ( Regression line in red and observation line in blue)

#### 2. Plotting the regression line

plt.plot have the following parameters :

- X coordinates (X_train) – number of years
- Y coordinates (predict on X_train) – prediction of X-train (based on a number of years).

Note : The y-coordinate is not y_pred because y_pred is predicted salaries of the test set observations.

#plot for the TRAIN plt.scatter(X_train, y_train, color='red') # plotting the observation line plt.plot(X_train, regressor.predict(X_train), color='blue') # plotting the regression line plt.title("Salary vs Experience (Training set)") # stating the title of the graph plt.xlabel("Years of experience") # adding the name of x-axis plt.ylabel("Salaries") # adding the name of y-axis plt.show() # specifies end of graph

**The above code generates a plot for the train set shown below:**

#plot for the TEST plt.scatter(X_test, y_test, color='red') plt.plot(X_train, regressor.predict(X_train), color='blue') # plotting the regression line plt.title("Salary vs Experience (Testing set)") plt.xlabel("Years of experience") plt.ylabel("Salaries") plt.show()

**The above code snippet generates a plot as shown below:**

## Complete Python Code for Implementing Linear Regression

# importing the dataset import numpy as np import pandas as pd import matplotlib.pyplot as plt dataset = pd.read_csv('Salary_Data.csv') dataset.head() # data preprocessing X = dataset.iloc[:, :-1].values #independent variable array y = dataset.iloc[:,1].values #dependent variable vector # splitting the dataset from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=1/3,random_state=0) # fitting the regression model from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train,y_train) #actually produces the linear eqn for the data # predicting the test set results y_pred = regressor.predict(X_test) y_pred y_test # visualizing the results #plot for the TRAIN plt.scatter(X_train, y_train, color='red') # plotting the observation line plt.plot(X_train, regressor.predict(X_train), color='blue') # plotting the regression line plt.title("Salary vs Experience (Training set)") # stating the title of the graph plt.xlabel("Years of experience") # adding the name of x-axis plt.ylabel("Salaries") # adding the name of y-axis plt.show() # specifies end of graph #plot for the TEST plt.scatter(X_test, y_test, color='red') plt.plot(X_train, regressor.predict(X_train), color='blue') # plotting the regression line plt.title("Salary vs Experience (Testing set)") plt.xlabel("Years of experience") plt.ylabel("Salaries") plt.show()

The output of the above code snippet is as shown below:

## Conclusion

We have come to the end of this article on Simple Linear Regression. Hope you liked our example and have tried coding the model as well. Do let us know your feedback in the comment section below.

If you’re interested in more regression models, do read through multiple linear regression model.