Hey, readers! In this article, we will be focusing on the emergence and working of the Correlation Matrix in Python in detail. So, let us get started now!
What is the Correlation Regression Analysis?
In the domain of
Data Science and Machine Learning, we often come across situations wherein it is necessary for us to analyze the variables and perform feature selection as well. This is when Correlation Regression Analysis comes into the picture.
Correlation Regression Analysis enables the programmers to analyze the relationship between the continuous independent variables and the continuous dependent variable.
That is, the regression analysis evaluates the likeliness and relationship between the independent variables of the data set as well as the independent and the response (dependent) variables.
Correlation Regression Analysis makes use of the Correlation matrix to represent the relationship between the variables of the data set.
The correlation matrix is a matrix structure that helps the programmer analyze the relationship between the data variables. It represents the correlation value between a range of 0 and 1.
The positive value represents good correlation and a negative value represents low correlation and value equivalent to zero(0) represents no dependency between the particular set of variables.
One can drive out the following observations from the Regression Analysis and Correlation Matrix:
- Understand the dependence between the independent variables of the data set.
- Helps choose important and non-redundant variables of the data set.
- Applicable only to numeric/continuous variables.
Let us now focus on the implementation of a Correlation Matrix in Python.
Creating a Correlation Matrix in Python
Let us first begin by exploring the data set being used in this example. As seen below, the data set contains 4 independent continuous variables:
Here, cnt is the response variable.
Now, we have created a correlation matrix for the numeric columns using
corr() function as shown below:
import os import pandas as pd import numpy as np import seaborn as sn # Loading the dataset BIKE = pd.read_csv("day.csv") # Numeric columns of the dataset numeric_col = ['temp','atemp','hum','windspeed'] # Correlation Matrix formation corr_matrix = BIKE.loc[:,numeric_col].corr() print(corr_matrix) #Using heatmap to visualize the correlation matrix sn.heatmap(corr_matrix, annot=True)
Further, we have used Seaborn Heatmaps to visualize the matrix.
So, from the above matrix, the following observations can b drawn–
- The variables ‘temp’ and ‘atemp’ are highly correlated with a correlation value of 0.99.
- Thus, we can drop any one of the two data variables .
By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question.
Till then, Happy Learning!!