ANOVA test in Python

Hello readers! Today we will be focusing on an important statistical test in Data science — ANOVA test in Python programming, in detail.

So, let us get started!!

Emergence of ANOVA test

In the domain of data science and machine learning, the data needs to be understood and processed prior to modelling. That is, we need to analyze every variable of the dataset and its credibility in terms of its contribution to the target value.

Usually there are two kinds of variables–

Continuous variables
Categorical variables

Below are the mostly used statistical tests to analyze the numeric variables:

T-test
Correlation regression analysis, etc.

ANOVA test is a categorical statistical tests i.e. it works on the categorical variables to analyze them.

What is ANOVA test all about?

ANOVA test is a statistical test to analyze and work with the understanding of the categorical data variables. It estimates the extent to which a dependent variable is affected by one or more independent categorical data elements.

With ANOVA test, we estimate and analyze the difference in the statistical mean of every group of the independent categorical variable.

Hypothesis for ANOVA testing

As well all know, the Hypothesis claims are represented using two categories: Null Hypothesis and Alternate Hypothesis, respectively.

In the case of the ANOVA test, our Null hypothesis would claim the following: “The statistical mean of all the groups/categories of the variables is the same.”
On the other hand, the Alternate Hypothesis would claim as follows: “The statistical mean of all the groups/categories of the variables is not the same.”

Having said this, let us now focus on the Assumptions or considerations for ANOVA testing.

Assumptions of ANOVA testing

The data elements of the columns follow a normal distribution.
The variables share a common variance.

ANOVA test in Python – Simple Practical Approach!

In this example, we will be making use of the Bike Rental Count Prediction dataset wherein we are required to predict the number of customers who would opt for a rented bike based on different conditions provided.

You can find the dataset here!

So, initially, we load the dataset into the Python environment using read_csv() function. Further, we change the data type of the variables upon (EDA) to a defined data type. We also use the os module and the Pandas library to work with system variables and parse CSV data respectively

import os
import pandas 
#Changing the current working directory
os.chdir("D:/Ediwsor_Project - Bike_Rental_Count")
BIKE = pandas.read_csv("day.csv")
BIKE['holiday']=BIKE['holiday'].astype(str)
BIKE['weekday']=BIKE['weekday'].astype(str)
BIKE['workingday']=BIKE['workingday'].astype(str)
BIKE['weathersit']=BIKE['weathersit'].astype(str)
BIKE['dteday']=pandas.to_datetime(BIKE['dteday'])
BIKE['season']=BIKE['season'].astype(str)
BIKE['yr']=BIKE['yr'].astype(str)
BIKE['mnth']=BIKE['mnth'].astype(str)
print(BIKE.dtypes)

Output:

instant                int64
dteday        datetime64[ns]
season                object
yr                    object
mnth                  object
holiday               object
weekday               object
workingday            object
weathersit            object
temp                 float64
atemp                float64
hum                  float64
windspeed            float64
casual                 int64
registered             int64
cnt                    int64
dtype: object

Now, is the time to apply ANOVA test. Python provides us with anova_lm() function from the statsmodels library to implement the same.

Initially, we perform Ordinary Least Square test on the data, further to which the ANOVA test is applied on the above resultant.

import statsmodels.api as sm
from statsmodels.formula.api import ols

for x in categorical_col:
    model = ols('cnt' + '~' + x, data = BIKE).fit() #Oridnary least square method
    result_anova = sm.stats.anova_lm(model) # ANOVA Test
    print(result_anova)

Output:

             df        sum_sq       mean_sq           F        PR(>F)
season      3.0  9.218466e+08  3.072822e+08  124.840203  5.433284e-65
Residual  713.0  1.754981e+09  2.461404e+06         NaN           NaN
             df        sum_sq       mean_sq           F        PR(>F)
yr          1.0  8.813271e+08  8.813271e+08  350.959951  5.148657e-64
Residual  715.0  1.795501e+09  2.511190e+06         NaN           NaN
             df        sum_sq       mean_sq          F        PR(>F)
mnth       11.0  1.042307e+09  9.475520e+07  40.869727  2.557743e-68
Residual  705.0  1.634521e+09  2.318469e+06        NaN           NaN
             df        sum_sq       mean_sq        F    PR(>F)
holiday     1.0  1.377098e+07  1.377098e+07  3.69735  0.054896
Residual  715.0  2.663057e+09  3.724555e+06      NaN       NaN
             df        sum_sq       mean_sq         F    PR(>F)
weekday     6.0  1.757122e+07  2.928537e+06  0.781896  0.584261
Residual  710.0  2.659257e+09  3.745432e+06       NaN       NaN
               df        sum_sq       mean_sq         F    PR(>F)
workingday    1.0  8.494340e+06  8.494340e+06  2.276122  0.131822
Residual    715.0  2.668333e+09  3.731935e+06       NaN       NaN
               df        sum_sq       mean_sq          F        PR(>F)
weathersit    2.0  2.679982e+08  1.339991e+08  39.718604  4.408358e-17
Residual    714.0  2.408830e+09  3.373711e+06        NaN           NaN

Considering significance value as 0.05. we say that if the p value is less than 0.05, we assume and claim that there is considerable differences in the mean of the groups formed by each level of the categorical data. That is, we reject the NULL hypothesis.