Chi-square test in Python - All you need to know!!

Hello, readers! In this article, we will be focusing on Chi-square Test in Python. So, let us get started!!

Understanding Statistical Tests for Data Science and Machine Learning

Statistical tests play an important role in the domain of Data Science and Machine Learning. With the statistical tests, one can presume a certain level of understanding about the data in terms of statistical distribution.

Various statistics exist based on the type of variables i.e. continuous or categorical. For continuous data values, the following are the most used tests:

T-test
Correlation regression test

On the other hand, for categorical data variables, below are the popular statistical tests:

ANOVA test
Chi-square test

Today, let us have a look at Chi-square test in Python.

What is a Chi-square Test?

The Chi-square test is a non-parametric statistical test that enables us to understand the relationship between the categorical variables of the dataset. That is, it defines the correlation amongst the grouping categorical data.

Using the Chi-square test, we can estimate the level of correlation i.e. association between the categorical variables of the dataset. This helps us analyze the dependence of one category of the variable on the other independent category of the variable.

Let us now understand Chi-square test in terms of Hypothesis.

Hypothesis setup for Chi-square test

The null hypothesis can be framed in the below manner: The grouping variables have no association or correlation amongst them.
The alternate Hypothesis goes as framed below: The variables are associated with each other and happen to have a correlation between the variables.

Using scipy.stats library to implement Chi-square test

In this example, we have created a table as shown below — ‘info’. Further, we have made use of scipy.stats library which provides us with chi2_contingency() function to implement Chi-square test.

Example:

from scipy.stats import chi2_contingency 

info = [[100, 200, 300], [50, 60, 70]] 
print(info)
stat, p, dof= chi2_contingency(info) 

print(dof)

significance_level = 0.05
print("p value: " + str(p)) 
if p <= significance_level: 
	print('Reject NULL HYPOTHESIS') 
else: 
	print('ACCEPT NULL HYPOTHESIS')

As an output, we get three values from the test: statistic value (which can be used to decide upon hypothesis when compared to the critical values), p-value and degree of freedom (number of variables that are free to vary)

We make use of p-value to interpret the Chi-square test.

Output:

[[100, 200, 300], [50, 60, 70]]
2
p value: 0.001937714203415323
Reject NULL HYPOTHESIS

If the p-value is less than the assumed significance value (0.05), then we fail to accept that there is no association between the variables. That is, we reject the NULL hypothesis and accept the alternate hypothesis claim.

Thus, in this case, we reject the Null hypothesis and assume a relationship between the passed data.

Using Chi-square test on a dataset

In this example, we will be making use of Bike rental count dataset. You can find the dataset here!

Now, we would be implementing Chi-square test to analyze the relationship between the independent categorical variables.

Initially, we load the dataset into the environment and then print the names of the categorical data variables as shown:

import os
import pandas
#Changing the current working directory
os.chdir("D:/Ediwsor_Project - Bike_Rental_Count")
BIKE = pandas.read_csv("day.csv")
categorical_col = ['season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday',
       'weathersit']
print(categorical_col)

['season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit']

Further, we use the crosstab() function to create a contingency table of the two selected variables to work on ‘holiday’ and ‘weathersit’.

chisqt = pandas.crosstab(BIKE.holiday, BIKE.weathersit, margins=True)
print(chisqt)

weathersit    1    2   3  All
holiday                      
0           438  238  20  696
1            15    6   0   21
All         453  244  20  717

At last, we apply the chi2_contingency() function on the table and get the statistics, p-value and degree of freedom values.

from scipy.stats import chi2_contingency 
import numpy as np
chisqt = pandas.crosstab(BIKE.holiday, BIKE.weathersit, margins=True)
value = np.array([chisqt.iloc[0][0:5].values,
                  chisqt.iloc[1][0:5].values])
print(chi2_contingency(value)[0:3])

Output:

(1.0258904805937215, 0.794987564022437, 3)

From above, 0.79 is the p-value, 1.02 is the statistical value and 3 is the degree of freedom. As the p-value is greater than 0.05, we accept the NULL hypothesis and assume that the variables ‘holiday’ and ‘weathersit’ are independent of each other.