Using Groupby to Group a Data Frame by Month

How To Group A Data Frame By Month

The Groupby function of the Pandas library is used to categorize the data based on a certain condition. The Pandas data frame can be split based on criteria with the help of the groupby function.

It is similar to the groupby function in SQL querying where the output is printed based on certain conditions.

The Pandas library has a dedicated function called groupby to group the data frames based on a condition.

Before we move to the groupby function, read this tutorial on Pandas.

What Is Groupby Function?

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

The groupby function of the Pandas library has the following syntax.

DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, observed=False, dropna=True)
ArgumentDescriptionNecessity
byThis parameter is used to determine the groups by which the data frame should be grouped
If it is a function, the groupby condition is applied to the items in the function
If it is a dictionary or series the dictionary values are used to determine the groups
Can take a function, dictionary, or a series
Required
axisIt determines the axis on which the grouping should occur
If the axis is 0, grouping occurs row-wise
If it is 1, grouping occurs column-wise
Required
levelUsed when the axis is multi-index
When the axis is multi-index, the level parameter is used to mention on which level the grouping should occur
When level is used, by should not be used and vice versa
Takes integer, a sequence of integers or a level name
The default is None
Required
as_indexThis argument decides if the grouped items should be set as the index of the grouped object
The default is True, which means by default the grouped items will become the index
Required
sortAs the name suggests, this parameter is used to sort the grouped objects
The default is True which means the objects are sorted
Required
group_keysDetermines if the grouped keys should be included in the output or not
By default, they are ignored
Required
observedWhen the items we are grouping is categorical, this argument is used
Used to only include selected groups if set to True
The default is False which means all the groups are included
Required
dropnaThe missing(NA) values are dropped if set to True
Else, the NA values are treated as keys in the groups
Required
Arguments of groupby

Returns: A grouped object.

Visit this post to know more about the groupby function.

Groupby Example

Let us take a diabetes dataset and perform some operations using groupby.

import pandas as pd
data=pd.read_csv('diabetes.csv')
df=pd.DataFrame(data)
df

In the first line of the code, we are importing the pandas library.

Next, we are creating a variable called data to read the downloaded data set as CSV.

This data is converted to a data frame called df.

In the last line we are printing the data frame.

Data Frame
Data Frame

First, let us try to group the data frame by Age.

df1=df.groupby(['Age']).sum()
df1.head()

We are creating a variable called df1 which stores the grouped objects based on Age. Then the first five rows of the grouped data frame are printed.

Groupby Age
Groupby Age

We can even group the data frame based on two or more columns.

df2=df.groupby(['Insulin','BloodPressure']).sum()
df2.tail()

Another variable called df2 is used to store the grouped data frame by Insulin and BloodPressure. The last five rows of this data frame are printed.

Groupby Insulin And BloodPressure
Groupby Insulin And BloodPressure

How to Groupby the Data Frame by Month

Let us take data frames with dates as columns and group the data frames by month criteria.

We might encounter FutureWarnings while executing the below examples. So we are required to disable the Warnings by executing the following code.
import warnings
warnings.filterwarnings("ignore")

Read this post to know more about warnings in python.

Groupby Using Dt Accessor

The dt accessor can be used to access the values of a datetime object like year, month or day. We are going to use this dt accessor to access the month in the dates of the data frame.

import pandas as pd
df = pd.DataFrame({'date': ['3/10/2000', '3/11/2000', '3/12/2000'],
                   'value': [11,21,9]})
df['date'] = pd.to_datetime(df['date'])
print("The original data frame is:\n",df)
df1 = df.groupby(df['date'].dt.to_period('M')).sum()
print("Grouped Data Frame by dt accessor:")
print(df1)

We are creating a dictionary that contains three datetime objects and some corresponding values. This dictionary is converted to a data frame by the pandas method.

To avoid any value errors, we are converting the date column of the data frame to a compatible datetime format.

Next, we print the data frame.

We are creating another variable called df1 to store the grouped data frame. Using the groupby we are grouping the entire date column of the data frame by month using the dt accessor. The entire grouped object is then added together.

The grouped data frame is then printed in the next line.

Groupby Using Dt Accessor
Groupby Using Dt Accessor

Groupby Using Resample

The resample method of the pandas library is used to create a unique distribution of the time series data.

import pandas as pd
sd= '2022-01-01'
ed = '2023-04-30'
ndays = (pd.to_datetime(ed) - pd.to_datetime(sd)).days + 1  
data = {'date': pd.date_range(start=sd, end=ed, freq='D'),
        'value': [i + 1 for i in range(ndays)]}
df = pd.DataFrame(data)
print("The original data frame is:\n",df)
df.set_index('date', inplace=True)
df2 = df.resample('M').mean()
print("Using resample:")
print(df2.head(10))
print("\n")

We are initializing a start date and end date to create a data frame that has a date column. The ndays is a datetime range of the start and end dates. 1 is added at the last to include the last date in the end date.

data is a dictionary that consists of the date and value as keys. All the datetime-like objects are converted to a range with the date as frequency.

Next, this data is converted to a data frame called df. It is printed in the next line. The date column is set as an index to the entire data frame using the set_index method.

We are creating a variable called df2 to store the grouped data frame after resampling with Month.

Then the first 10 values of the grouped data frame are printed.

Groupby Using Resample
Groupby Using Resample

Groupby Using Grouper

The grouper is a class that allows us to specify groupby instruction for the object. We are going to take the same example as above and use the grouper class.

import pandas as pd
sd = '2021-12-01'
ed = '2022-04-18'
ndays = (pd.to_datetime(ed) - pd.to_datetime(sd)).days + 1  
data = {'date': pd.date_range(start=sd, end=ed, freq='D'),
        'value': [i + 1 for i in range(ndays)]}
df = pd.DataFrame(data)
df.set_index('date', inplace=True)
print("Original Data Frame:")
print(df.head())
print("\n") 
df3 = df.groupby(pd.Grouper(freq='M')).mean()
print("Grouped Data Frame using Grouper:")
print(df3.head())

In this example, we are changing the start date(sd) and end date(ed) but the rest of the data frame remains the same.

After creating the data frame, the date column of the data frame is set as index.

The data frame is printed using the print function.

Another variable called df3 is created to store the grouped data frame. We are calling the Grouper class and the frequency is set to M which stands for month. The mean of this grouping is calculated using the mean function.

Lastly, the grouped data frame is printed.

Groupby Using The Grouper Class
Groupby Using The Grouper Class

Conclusion

To conclude, we have seen how the groupby function is used to split the data frame based on some conditions.

The groupby function of the Pandas library is discussed along with its arguments. We have also seen how to group a data frame by one column and more than one column with the help of a diabetes data set.

Coming to grouping a data frame by month, we have seen three approaches.

In the first example, we have taken a simple data frame of three datetime objects and corresponding values and used the dt accessor to first access the month property of the date column and then group it based on the month.

The second example takes a start date and end date and the data frame consists of a range of dates in between the start time and end time. There is a values column associated with the date column. The resample method of the pandas library is used to group the data frame by month.

We took the same example but changed the start and end date and used the Grouper class to group the data frame by month.

References

You can find more about the groupby function here

You can find more about the resample method here.

Find more about the Grouper class here.

Datasets

The diabetes data set can be downloaded from here.