Hello, readers! In this article, we will be focusing on Chi-square Test in Python. So, let us get started!!
Understanding Statistical Tests for Data Science and Machine Learning
Statistical tests play an important role in the domain of Data Science and Machine Learning. With the statistical tests, one can presume a certain level of understanding about the data in terms of statistical distribution.
Various statistics exist based on the type of variables i.e. continuous or categorical. For continuous data values, the following are the most used tests:
On the other hand, for categorical data variables, below are the popular statistical tests:
- ANOVA test
- Chi-square test
Today, let us have a look at Chi-square test in Python.
What is a Chi-square Test?
The Chi-square test is a non-parametric statistical test that enables us to understand the relationship between the categorical variables of the dataset. That is, it defines the correlation amongst the grouping categorical data.
Using the Chi-square test, we can estimate the level of correlation i.e. association between the categorical variables of the dataset. This helps us analyze the dependence of one category of the variable on the other independent category of the variable.
Let us now understand Chi-square test in terms of Hypothesis.
Hypothesis setup for Chi-square test
- The null hypothesis can be framed in the below manner: The grouping variables have no association or correlation amongst them.
- The alternate Hypothesis goes as framed below: The variables are associated with each other and happen to have a correlation between the variables.
Using scipy.stats library to implement Chi-square test
In this example, we have created a table as shown below — ‘info’. Further, we have made use of
scipy.stats library which provides us with
chi2_contingency() function to implement Chi-square test.
from scipy.stats import chi2_contingency info = [[100, 200, 300], [50, 60, 70]] print(info) stat, p, dof= chi2_contingency(info) print(dof) significance_level = 0.05 print("p value: " + str(p)) if p <= significance_level: print('Reject NULL HYPOTHESIS') else: print('ACCEPT NULL HYPOTHESIS')
As an output, we get three values from the test: statistic value (which can be used to decide upon hypothesis when compared to the critical values), p-value and degree of freedom (number of variables that are free to vary)
We make use of p-value to interpret the Chi-square test.
[[100, 200, 300], [50, 60, 70]] 2 p value: 0.001937714203415323 Reject NULL HYPOTHESIS
If the p-value is less than the assumed significance value (0.05), then we fail to accept that there is no association between the variables. That is, we reject the NULL hypothesis and accept the alternate hypothesis claim.
Thus, in this case, we reject the Null hypothesis and assume a relationship between the passed data.
Using Chi-square test on a dataset
In this example, we will be making use of Bike rental count dataset. You can find the dataset here!
Now, we would be implementing Chi-square test to analyze the relationship between the independent categorical variables.
Initially, we load the dataset into the environment and then print the names of the categorical data variables as shown:
import os import pandas #Changing the current working directory os.chdir("D:/Ediwsor_Project - Bike_Rental_Count") BIKE = pandas.read_csv("day.csv") categorical_col = ['season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit'] print(categorical_col)
['season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit']
Further, we use the crosstab() function to create a contingency table of the two selected variables to work on ‘holiday’ and ‘weathersit’.
chisqt = pandas.crosstab(BIKE.holiday, BIKE.weathersit, margins=True) print(chisqt)
weathersit 1 2 3 All holiday 0 438 238 20 696 1 15 6 0 21 All 453 244 20 717
At last, we apply the chi2_contingency() function on the table and get the statistics, p-value and degree of freedom values.
from scipy.stats import chi2_contingency import numpy as np chisqt = pandas.crosstab(BIKE.holiday, BIKE.weathersit, margins=True) value = np.array([chisqt.iloc[0:5].values, chisqt.iloc[0:5].values]) print(chi2_contingency(value)[0:3])
(1.0258904805937215, 0.794987564022437, 3)
From above, 0.79 is the p-value, 1.02 is the statistical value and 3 is the degree of freedom. As the p-value is greater than 0.05, we accept the NULL hypothesis and assume that the variables ‘holiday’ and ‘weathersit’ are independent of each other.
By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question.
Till then, Happy Analyzing!! 🙂