Using Groupby to Group a Data Frame by Month

The Groupby function of the Pandas library is used to categorize the data based on a certain condition. The Pandas data frame can be split based on criteria with the help of the groupby function.

It is similar to the groupby function in SQL querying where the output is printed based on certain conditions.

The Pandas library has a dedicated function called groupby to group the data frames based on a condition.

Before we move to the groupby function, read this tutorial on Pandas.

What Is Groupby Function?

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

The groupby function of the Pandas library has the following syntax.

DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, observed=False, dropna=True)

Argument	Description	Necessity
by	This parameter is used to determine the groups by which the data frame should be grouped If it is a function, the groupby condition is applied to the items in the function If it is a dictionary or series the dictionary values are used to determine the groups Can take a function, dictionary, or a series	Required
axis	It determines the axis on which the grouping should occur If the axis is 0, grouping occurs row-wise If it is 1, grouping occurs column-wise	Required
level	Used when the axis is multi-index When the axis is multi-index, the level parameter is used to mention on which level the grouping should occur When level is used, by should not be used and vice versa Takes integer, a sequence of integers or a level name The default is None	Required
as_index	This argument decides if the grouped items should be set as the index of the grouped object The default is True, which means by default the grouped items will become the index	Required
sort	As the name suggests, this parameter is used to sort the grouped objects The default is True which means the objects are sorted	Required
group_keys	Determines if the grouped keys should be included in the output or not By default, they are ignored	Required
observed	When the items we are grouping is categorical, this argument is used Used to only include selected groups if set to True The default is False which means all the groups are included	Required
dropna	The missing(NA) values are dropped if set to True Else, the NA values are treated as keys in the groups	Required

Arguments of groupby

Returns: A grouped object.

Visit this post to know more about the groupby function.

Groupby Example

Let us take a diabetes dataset and perform some operations using groupby.

import pandas as pd
data=pd.read_csv('diabetes.csv')
df=pd.DataFrame(data)
df

In the first line of the code, we are importing the pandas library.

Next, we are creating a variable called data to read the downloaded data set as CSV.

This data is converted to a data frame called df.

In the last line we are printing the data frame.

First, let us try to group the data frame by Age.

df1=df.groupby(['Age']).sum()
df1.head()

We are creating a variable called df1 which stores the grouped objects based on Age. Then the first five rows of the grouped data frame are printed.

We can even group the data frame based on two or more columns.

df2=df.groupby(['Insulin','BloodPressure']).sum()
df2.tail()

Another variable called df2 is used to store the grouped data frame by Insulin and BloodPressure. The last five rows of this data frame are printed.

How to Groupby the Data Frame by Month

Let us take data frames with dates as columns and group the data frames by month criteria.

We might encounter FutureWarnings while executing the below examples. So we are required to disable the Warnings by executing the following code.

import warnings
warnings.filterwarnings("ignore")

Read this post to know more about warnings in python.

Groupby Using Dt Accessor

The dt accessor can be used to access the values of a datetime object like year, month or day. We are going to use this dt accessor to access the month in the dates of the data frame.

import pandas as pd
df = pd.DataFrame({'date': ['3/10/2000', '3/11/2000', '3/12/2000'],
                   'value': [11,21,9]})
df['date'] = pd.to_datetime(df['date'])
print("The original data frame is:\n",df)
df1 = df.groupby(df['date'].dt.to_period('M')).sum()
print("Grouped Data Frame by dt accessor:")
print(df1)

We are creating a dictionary that contains three datetime objects and some corresponding values. This dictionary is converted to a data frame by the pandas method.

To avoid any value errors, we are converting the date column of the data frame to a compatible datetime format.

Next, we print the data frame.

We are creating another variable called df1 to store the grouped data frame. Using the groupby we are grouping the entire date column of the data frame by month using the dt accessor. The entire grouped object is then added together.

The grouped data frame is then printed in the next line.

Groupby Using Resample

The resample method of the pandas library is used to create a unique distribution of the time series data.

import pandas as pd
sd= '2022-01-01'
ed = '2023-04-30'
ndays = (pd.to_datetime(ed) - pd.to_datetime(sd)).days + 1  
data = {'date': pd.date_range(start=sd, end=ed, freq='D'),
        'value': [i + 1 for i in range(ndays)]}
df = pd.DataFrame(data)
print("The original data frame is:\n",df)
df.set_index('date', inplace=True)
df2 = df.resample('M').mean()
print("Using resample:")
print(df2.head(10))
print("\n")

We are initializing a start date and end date to create a data frame that has a date column. The ndays is a datetime range of the start and end dates. 1 is added at the last to include the last date in the end date.

data is a dictionary that consists of the date and value as keys. All the datetime-like objects are converted to a range with the date as frequency.

Next, this data is converted to a data frame called df. It is printed in the next line. The date column is set as an index to the entire data frame using the set_index method.

We are creating a variable called df2 to store the grouped data frame after resampling with Month.

Then the first 10 values of the grouped data frame are printed.

Groupby Using Grouper

The grouper is a class that allows us to specify groupby instruction for the object. We are going to take the same example as above and use the grouper class.

import pandas as pd
sd = '2021-12-01'
ed = '2022-04-18'
ndays = (pd.to_datetime(ed) - pd.to_datetime(sd)).days + 1  
data = {'date': pd.date_range(start=sd, end=ed, freq='D'),
        'value': [i + 1 for i in range(ndays)]}
df = pd.DataFrame(data)
df.set_index('date', inplace=True)
print("Original Data Frame:")
print(df.head())
print("\n") 
df3 = df.groupby(pd.Grouper(freq='M')).mean()
print("Grouped Data Frame using Grouper:")
print(df3.head())

In this example, we are changing the start date(sd) and end date(ed) but the rest of the data frame remains the same.

After creating the data frame, the date column of the data frame is set as index.

The data frame is printed using the print function.

Another variable called df3 is created to store the grouped data frame. We are calling the Grouper class and the frequency is set to M which stands for month. The mean of this grouping is calculated using the mean function.

Lastly, the grouped data frame is printed.

Conclusion

To conclude, we have seen how the groupby function is used to split the data frame based on some conditions.

The groupby function of the Pandas library is discussed along with its arguments. We have also seen how to group a data frame by one column and more than one column with the help of a diabetes data set.

Coming to grouping a data frame by month, we have seen three approaches.

In the first example, we have taken a simple data frame of three datetime objects and corresponding values and used the dt accessor to first access the month property of the date column and then group it based on the month.

The second example takes a start date and end date and the data frame consists of a range of dates in between the start time and end time. There is a values column associated with the date column. The resample method of the pandas library is used to group the data frame by month.

We took the same example but changed the start and end date and used the Grouper class to group the data frame by month.

Using Groupby to Group a Data Frame by Month

What Is Groupby Function?

Groupby Example

How to Groupby the Data Frame by Month

Groupby Using Dt Accessor

Groupby Using Resample

Groupby Using Grouper

Conclusion

References

Datasets

Vignya Durvasula

What Is Groupby Function?

Groupby Example

How to Groupby the Data Frame by Month

Groupby Using Dt Accessor

Groupby Using Resample

Groupby Using Grouper

Conclusion

References

Datasets

Vignya Durvasula

Related Posts

Pandas groupby: Split, aggregate, and transform data with Python

How to Read CSV with Headers Using Pandas?

Walk Through the HDF File Using Pandas