In this article, we’ll be conditionally grouping values with Pandas. We’ve already covered the Python Pandas groupby in detail. So you can take a look through the article if you’re unsure about how the function works.
What is Grouping?
Grouping a database/data frame is a common practice in every day data-analysis and data-cleaning. Grouping refers to combining identical data (or data having the same properties) into different groups.
For example: Imagine a school database where there are students of all classes. Now if the principal wishes to compare results/attendance between the classes, he needs to compare the average data of each class. But how can he do that? He groups the student data based on which class they belong to (students of the same class go into the same group) and then he averages the data over each student in the group.
Our example covers a very ideal situation but it is the most basic application of grouping. Grouping can be based on multiple properties. This is sometimes called hierarchical grouping where a group is further subdivided into smaller groups based on some other property of the data. This allows our queries to be as complex as we require.
There is also a very basic problem that we ignored in our example, all data in the database need not be averaged. For example, if we need to compare only the average attendance and percentage of each class, we can ignore other values like mobile number or roll number whose average really does not make sense. In this article, we will learn how to make such complex grouping commands in pandas.
Grouping in Pandas using df.groupby()
df.groupby() provides a function to split the dataframe, apply a function such as
sum() to form the grouped dataset. This seems a scary operation for the dataframe to undergo, so let us first split the work into 2 sets: splitting the data and applying and combing the data. For this example, we use the supermarket dataset from Kaggle.
# Importing the data import pandas as pd # Our dataframe. The csv file can be downloaded fro above hyperlink. df = pd.read_csv('supermarket_sales - Sheet1.csv') # We drop some redundant columns df.drop(['Date', 'Invoice ID', 'Tax 5%'], axis=1, inplace=True) # Display the dataset df.head()
df.groupby() function will take in labels or a list of labels. Here we want to group according to the column Branch, so we specify only ‘Branch’ in the function definition. We also need to specify which along which axis the grouping will be done.
axis=1 represents ‘columns’ and
axis=0 indicates ‘index’.
# We split the dataset by column 'Branch'. # Rows having the same Branch will be in the same group. groupby = df.groupby('Branch', axis=0) # We apply the accumulator function that we want. Here we use the mean function here but we can also other functions. groupby.mean()
Now that we have learnt how to create grouped dataframes, we will be looking at applying conditions to the data for grouping.
Discrete and Continuous Data
It is a common practice to use discrete(tabular) data for grouping. Continuous data are not suitable for grouping. But will this not limit our data analysis capability? Yes, Obviously. So we need a workaround. We will perform binning of the continuous data to make the data tabular.
For example : Percentage is a continuous data, to convert it in to labelled data we take four predefined groups – Excellent(75-100), Good(50-75), Poor(25-50), Very-Poor(0-25). Each data however varied it might be, will fall into these 4 groups.
Another way can be using true and false for different values.
For example, The supermarket manager wants to find out how many customers bought more than 3 articles at once. One way to do is to approach this is to replace the number of articles by 1/True if the number is greater than or equal to 3 else 0/False.
# Binning of the data based on a condition df.loc[df.Quantity < 3, 'Quantity'] = 0 df.loc[df.Quantity >= 3, 'Quantity'] = 1 # Grouping and couting df.groupby('Quantity').count()
Conditionally grouping values based other columns
For our final query, we need to group the dataframe into groups based on whether more than 3 items were sold. We need to find the average unit price of the articles bought more than 3 articles at once.
- We need to filter out the columns of our interest.
- If the grouping is done on continuous data, we need to convert the continuous data into tabular data.
df.groupby()to split the data.
- Apply the aggregation function.
# Filter out columns of our interest df_1 = df.loc[:, ["Quantity", "Unit price"]] # We have already binned the quantity data into 0 and 1's for counting. # So we don't need any pre-processing # Group the data groupby = df_1.groupby("Quantity", axis=0) # Apply the function(here mean) groupby.mean()
The Unit price of articles which were bought more than 3 at once, is 55.5846 as can be seen from the above figure.
Pandas make querying easier with inbuilt functions such as
df.query(). This allows the user to make more advanced and complicated queries to the database. These are higher-level abstractions to df.loc that we have seen in the previous example
Pandas filter method allows you to filter the labels of the dataframe. It does not act on the contents of the dataframe. Here is an example to filter out the City and Gender label in our dataset.
df = pd.read_csv('supermarket_sales - Sheet1.csv') # We need to mention the labels to be filterd in items df.filter(items=["City","Gender"]).head()
We can also use Regex for filtering out labels. We try out to filter labels starting with letter C.
# We can specify the regex literal under regex in the function df.filter(regex="^C").head()
The query method allows querying the contents of the column of the dataframe to arbitrary complexity. Here is an example to find out the cases where customers bought more than 3 articles at once.
df.query('Quantity > 3').head()
We can also combine many conditions together using ‘&’ and ‘|’. For example, we want to find out the cases where customers bought more than 3 articles at once and paid using Cash.
df.query('Quantity > 3 & Payment=="Cash"').head()
Combining df.query() and df.filter() and df.groupby()
We want to solve the problem of grouping the dataframe into groups based on whether more than 3 items were sold. We need to find the average unit price of the articles bought more than 3 articles at once for each city.
We proceed in this three steps:
df.query()to filter out the data having more than 3 articles.
df.filter()to keep only the labels of interest( here City and Unit Price)
df.groupby()to group the data
# Query the database for Quantity greater than 3 df_g = df.query('Quantity > 3') # Filter out labels of interest df_g = df_g.filter(['City', 'Unit price']) # Group the value according to a condition df_g.groupby('City').mean()