Hello readers! Today we will be focusing on an important statistical test in Data science — ANOVA test in Python programming, in detail.
So, let us get started!!
Emergence of ANOVA test
In the domain of data science and machine learning, the data needs to be understood and processed prior to modelling. That is, we need to analyze every variable of the dataset and its credibility in terms of its contribution to the target value.
Usually there are two kinds of variables–
- Continuous variables
- Categorical variables
Below are the mostly used statistical tests to analyze the numeric variables:
- Correlation regression analysis, etc.
ANOVA test is a categorical statistical tests i.e. it works on the categorical variables to analyze them.
What is ANOVA test all about?
ANOVA test is a statistical test to analyze and work with the understanding of the categorical data variables. It estimates the extent to which a dependent variable is affected by one or more independent categorical data elements.
With ANOVA test, we estimate and analyze the difference in the statistical mean of every group of the independent categorical variable.
Hypothesis for ANOVA testing
As well all know, the Hypothesis claims are represented using two categories: Null Hypothesis and Alternate Hypothesis, respectively.
- In the case of the ANOVA test, our Null hypothesis would claim the following: “The statistical mean of all the groups/categories of the variables is the same.”
- On the other hand, the Alternate Hypothesis would claim as follows: “The statistical mean of all the groups/categories of the variables is not the same.”
Having said this, let us now focus on the Assumptions or considerations for ANOVA testing.
Assumptions of ANOVA testing
- The data elements of the columns follow a normal distribution.
- The variables share a common variance.
ANOVA test in Python – Simple Practical Approach!
In this example, we will be making use of the Bike Rental Count Prediction dataset wherein we are required to predict the number of customers who would opt for a rented bike based on different conditions provided.
You can find the dataset here!
So, initially, we load the dataset into the Python environment using
read_csv() function. Further, we change the data type of the variables upon (EDA) to a defined data type. We also use the os module and the Pandas library to work with system variables and parse CSV data respectively
import os import pandas #Changing the current working directory os.chdir("D:/Ediwsor_Project - Bike_Rental_Count") BIKE = pandas.read_csv("day.csv") BIKE['holiday']=BIKE['holiday'].astype(str) BIKE['weekday']=BIKE['weekday'].astype(str) BIKE['workingday']=BIKE['workingday'].astype(str) BIKE['weathersit']=BIKE['weathersit'].astype(str) BIKE['dteday']=pandas.to_datetime(BIKE['dteday']) BIKE['season']=BIKE['season'].astype(str) BIKE['yr']=BIKE['yr'].astype(str) BIKE['mnth']=BIKE['mnth'].astype(str) print(BIKE.dtypes)
instant int64 dteday datetime64[ns] season object yr object mnth object holiday object weekday object workingday object weathersit object temp float64 atemp float64 hum float64 windspeed float64 casual int64 registered int64 cnt int64 dtype: object
Now, is the time to apply ANOVA test. Python provides us with
anova_lm() function from the
statsmodels library to implement the same.
Initially, we perform Ordinary Least Square test on the data, further to which the ANOVA test is applied on the above resultant.
import statsmodels.api as sm from statsmodels.formula.api import ols for x in categorical_col: model = ols('cnt' + '~' + x, data = BIKE).fit() #Oridnary least square method result_anova = sm.stats.anova_lm(model) # ANOVA Test print(result_anova)
df sum_sq mean_sq F PR(>F) season 3.0 9.218466e+08 3.072822e+08 124.840203 5.433284e-65 Residual 713.0 1.754981e+09 2.461404e+06 NaN NaN df sum_sq mean_sq F PR(>F) yr 1.0 8.813271e+08 8.813271e+08 350.959951 5.148657e-64 Residual 715.0 1.795501e+09 2.511190e+06 NaN NaN df sum_sq mean_sq F PR(>F) mnth 11.0 1.042307e+09 9.475520e+07 40.869727 2.557743e-68 Residual 705.0 1.634521e+09 2.318469e+06 NaN NaN df sum_sq mean_sq F PR(>F) holiday 1.0 1.377098e+07 1.377098e+07 3.69735 0.054896 Residual 715.0 2.663057e+09 3.724555e+06 NaN NaN df sum_sq mean_sq F PR(>F) weekday 6.0 1.757122e+07 2.928537e+06 0.781896 0.584261 Residual 710.0 2.659257e+09 3.745432e+06 NaN NaN df sum_sq mean_sq F PR(>F) workingday 1.0 8.494340e+06 8.494340e+06 2.276122 0.131822 Residual 715.0 2.668333e+09 3.731935e+06 NaN NaN df sum_sq mean_sq F PR(>F) weathersit 2.0 2.679982e+08 1.339991e+08 39.718604 4.408358e-17 Residual 714.0 2.408830e+09 3.373711e+06 NaN NaN
Considering significance value as 0.05. we say that if the p value is less than 0.05, we assume and claim that there is considerable differences in the mean of the groups formed by each level of the categorical data. That is, we reject the NULL hypothesis.
By this, we have reached the end of this topic. Feel free to comment below, in case you come across any question.
Recommended read: Chi-square test in Python
Happy Analyzing!! 🙂