Hello readers! Today we will be focusing on an important statistical test in Data science — ANOVA test in Python programming, in detail.
So, let us get started!!
Emergence of ANOVA test
In the domain of data science and machine learning, the data needs to be understood and processed prior to modelling. That is, we need to analyze every variable of the dataset and its credibility in terms of its contribution to the target value.
Usually there are two kinds of variables–
- Continuous variables
- Categorical variables
Below are the mostly used statistical tests to analyze the numeric variables:
- T-test
- Correlation regression analysis, etc.
ANOVA test is a categorical statistical tests i.e. it works on the categorical variables to analyze them.
What is ANOVA test all about?
ANOVA test is a statistical test to analyze and work with the understanding of the categorical data variables. It estimates the extent to which a dependent variable is affected by one or more independent categorical data elements.
With ANOVA test, we estimate and analyze the difference in the statistical mean of every group of the independent categorical variable.
Hypothesis for ANOVA testing
As well all know, the Hypothesis claims are represented using two categories: Null Hypothesis and Alternate Hypothesis, respectively.
- In the case of the ANOVA test, our Null hypothesis would claim the following: “The statistical mean of all the groups/categories of the variables is the same.”
- On the other hand, the Alternate Hypothesis would claim as follows: “The statistical mean of all the groups/categories of the variables is not the same.”
Having said this, let us now focus on the Assumptions or considerations for ANOVA testing.
Assumptions of ANOVA testing
- The data elements of the columns follow a normal distribution.
- The variables share a common variance.
ANOVA test in Python – Simple Practical Approach!
In this example, we will be making use of the Bike Rental Count Prediction dataset wherein we are required to predict the number of customers who would opt for a rented bike based on different conditions provided.
You can find the dataset here!
So, initially, we load the dataset into the Python environment using read_csv()
function. Further, we change the data type of the variables upon (EDA) to a defined data type. We also use the os module and the Pandas library to work with system variables and parse CSV data respectively
import os
import pandas
#Changing the current working directory
os.chdir("D:/Ediwsor_Project - Bike_Rental_Count")
BIKE = pandas.read_csv("day.csv")
BIKE['holiday']=BIKE['holiday'].astype(str)
BIKE['weekday']=BIKE['weekday'].astype(str)
BIKE['workingday']=BIKE['workingday'].astype(str)
BIKE['weathersit']=BIKE['weathersit'].astype(str)
BIKE['dteday']=pandas.to_datetime(BIKE['dteday'])
BIKE['season']=BIKE['season'].astype(str)
BIKE['yr']=BIKE['yr'].astype(str)
BIKE['mnth']=BIKE['mnth'].astype(str)
print(BIKE.dtypes)
Output:
instant int64
dteday datetime64[ns]
season object
yr object
mnth object
holiday object
weekday object
workingday object
weathersit object
temp float64
atemp float64
hum float64
windspeed float64
casual int64
registered int64
cnt int64
dtype: object
Now, is the time to apply ANOVA test. Python provides us with anova_lm()
function from the statsmodels
library to implement the same.
Initially, we perform Ordinary Least Square test on the data, further to which the ANOVA test is applied on the above resultant.
import statsmodels.api as sm
from statsmodels.formula.api import ols
for x in categorical_col:
model = ols('cnt' + '~' + x, data = BIKE).fit() #Oridnary least square method
result_anova = sm.stats.anova_lm(model) # ANOVA Test
print(result_anova)
Output:
df sum_sq mean_sq F PR(>F)
season 3.0 9.218466e+08 3.072822e+08 124.840203 5.433284e-65
Residual 713.0 1.754981e+09 2.461404e+06 NaN NaN
df sum_sq mean_sq F PR(>F)
yr 1.0 8.813271e+08 8.813271e+08 350.959951 5.148657e-64
Residual 715.0 1.795501e+09 2.511190e+06 NaN NaN
df sum_sq mean_sq F PR(>F)
mnth 11.0 1.042307e+09 9.475520e+07 40.869727 2.557743e-68
Residual 705.0 1.634521e+09 2.318469e+06 NaN NaN
df sum_sq mean_sq F PR(>F)
holiday 1.0 1.377098e+07 1.377098e+07 3.69735 0.054896
Residual 715.0 2.663057e+09 3.724555e+06 NaN NaN
df sum_sq mean_sq F PR(>F)
weekday 6.0 1.757122e+07 2.928537e+06 0.781896 0.584261
Residual 710.0 2.659257e+09 3.745432e+06 NaN NaN
df sum_sq mean_sq F PR(>F)
workingday 1.0 8.494340e+06 8.494340e+06 2.276122 0.131822
Residual 715.0 2.668333e+09 3.731935e+06 NaN NaN
df sum_sq mean_sq F PR(>F)
weathersit 2.0 2.679982e+08 1.339991e+08 39.718604 4.408358e-17
Residual 714.0 2.408830e+09 3.373711e+06 NaN NaN
Considering significance value as 0.05. we say that if the p value is less than 0.05, we assume and claim that there is considerable differences in the mean of the groups formed by each level of the categorical data. That is, we reject the NULL hypothesis.
Conclusion
By this, we have reached the end of this topic. Feel free to comment below, in case you come across any question.
Recommended read: Chi-square test in Python
Happy Analyzing!! 🙂