Regression Error Metrics - A Simple Guide

Machine Learning is one of the fastest-growing technologies in this tech-driven area, with a load of real-world applications. Treated as a sub-branch of Artificial intelligence (AI), Machine Learning is used to mimic human thinking in machinery. How can machines think like humans? Well, more or less, we humans are responsible for making machines think like us by developing various algorithms and models.

Machine Learning enthusiasts and students strive to have the skills to become good machine learning engineers or scientists. Luckily, the Internet has many resources and free courses to help such individuals master machine learning.

With that being said, do visit our take on Machine Learning for Beginners here

The way a machine learns is divided into two techniques: Supervised learning, where we do help the model learn by providing some basic information like labels and their descriptions; and Unsupervised learning, where the model learns by itself with the help of unlabeled data.

The algorithm by which the machine learns is also classified into two categories: Regression and Classification.

Regression in Machine Learning is supervised and used to work with continuous data. Contrary to that, Classification is used to work with categorical data.

If you are even a tad bit accustomed to working on Machine Learning models, you understand how important their performance metrics or error metrics are. The performance metrics (error metrics in the case of regression) are used to help us understand how well the model is doing (both learning and performing on test data).

We can also understand if the model is overfitting or underfitting with the help of these error metrics. And just like that, Regression and Classification also have their own set of performance or error metrics.

Also Read: Regression vs. Classification

In this tutorial, our goal is to understand the error metrics of a simple regression model.

What Is Regression?

Regression in machine learning is used to best describe the relation between a single dependent variable and one or many independent variables. The idea is that the target variable(dependent) varies or changes with the other labels.

You moved to a new city recently and want to rent a studio room. The rents of the studios on your list depend on various factors like the area, number of bathrooms, if it is a 3 BHK or a 2 BHK, and if the parking slot is allocated to every apartment separately. If all these parameters are considered, the rent is obviously high. Can you relate this example to the definition of regression?

Based on all the parameters discussed, the price of the apartment varies. Here, all the labels, like area, number of bathrooms, and number of rooms, are independent variables, and the price of the apartment is the target or dependent label.

There are various types of regression models used.

Now, let us take a look at the error metrics of the regression model.

Error Metrics of Regression

As discussed above, the error metrics have been used to check how well our model is doing. It is important to note that the error or performance metrics for classification and regression are entirely different and we CANNOT use the classification metrics for regression and vice versa.

As their names suggest, these metrics are based on something called the error. When training the model, it gives a certain output (prediction), let us call it y_pred. But actually, the result might be something else(y_actual).

Precisely, the error is the difference between the predicted output and the actual output.

Error = y_actual – y_pred

The four main error metrics of Regression are:

Mean Squared Error(MSE)
Mean Absolute Error(MAE)
Root Mean Squared Error(RMSE)
Mean Absolute Percentage Error(MAPE)

Let us understand each of these metrics one by one.

Mean Squared Error(MSE)

The mean squared error(MSE) calculates the average squared d difference between the actual and predicted values. If there are n data points or records, the MSE calculates the error and divides the result by n.

Mean Absolute Error(MAE)

The mean absolute error is used to calculate the absolute difference between the predicted and actual values, thereby dividing the error by the number of data points present.

Root Mean Squared Error(RMSE)

The RMSE can be considered an extension of the MSE. It is just the square root of the result of MSE.

Mean Absolute Percentage Error(MAPE)

Similar to MAE, we additionally divide the error by the actual output value and calculate the percentage.

Regression Error Metrics: An Example

Let us now take an example model to calculate and visualize the error metrics. We are going to use the California Housing dataset available in the Scikit Learn library.

First, we import the necessary libraries and the dataset.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np
import matplotlib.pyplot as plt

The Scikit-Learn library is used to import the housing data, the data splitting module, the model we are going to use (linear regression), and the error metrics.

The numpy library is used to perform calculations and the well-known matplotlib is for visualization.

The imported dataset has to be loaded and the target and independent variables have to be specified.

data = fetch_california_housing()
X = data.data
y = data.target

The variable X stores the independent variable and the y has the target variable. We are splitting the entire dataset into training and testing parts.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

The dataset is split in the ratio of 70:30 for training and testing, respectively. In the second line, the instance called model is used to initialize the Linear Regression model, which is then fitted according to the dataset. The model is then used to predict the records in the test part.

The next step is to calculate and print the error metrics.

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100
print(f"MAE: {mae:.2f}")
print(f"MSE: {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"MAPE:{mape:.2f}")

We can directly use the man absolute error and mean squared error of the sklearn library, the other metrics can be derived from these two metrics as per the formulas discussed above.

All the metrics are printed in the last four lines.

We can also visualize all the metrics in a bar plot together with the help of Matplotlib.

metrics = ['MAE', 'MSE', 'RMSE', 'MAPE']
values = [mae, mse, rmse, mape]
plt.bar(metrics, values, color=['blue', 'green', 'orange', 'red'])
plt.xlabel('Error Metrics')
plt.ylabel('Values')
plt.title('Regression Error Metrics')
plt.show()

We have used the bar function of the library to display the values of the error metrics calculated above. The plt.show method is used to display the graph on the screen.

What do you observe from the visualization? We can derive the following insights:

The model is pretty good in terms of MAE, MSE, and RMSE(low error values)
The Mean Absolute Percentage Error is slightly high which means our model’s percentage error is a little high

Conclusion

To conclude, we have seen the definition of Regression and understood it with the help of an example and also observed the different types of regression models used.

Next, we have discussed the different error metrics followed for regression models and understood their formulae. We used these formulae to build a regression model, calculate the metrics, and visualize the metric values.

References

Find more about regression metrics here