Pandas: Conditionally Grouping Values

In this article, we’ll be conditionally grouping values with Pandas. We’ve already covered the Python Pandas groupby in detail. So you can take a look through the article if you’re unsure about how the function works.

What is Grouping?

Grouping a database/data frame is a common practice in every day data-analysis and data-cleaning. Grouping refers to combining identical data (or data having the same properties) into different groups.

For example: Imagine a school database where there are students of all classes. Now if the principal wishes to compare results/attendance between the classes, he needs to compare the average data of each class. But how can he do that? He groups the student data based on which class they belong to (students of the same class go into the same group) and then he averages the data over each student in the group.

Our example covers a very ideal situation but it is the most basic application of grouping. Grouping can be based on multiple properties. This is sometimes called hierarchical grouping where a group is further subdivided into smaller groups based on some other property of the data. This allows our queries to be as complex as we require.

There is also a very basic problem that we ignored in our example, all data in the database need not be averaged. For example, if we need to compare only the average attendance and percentage of each class, we can ignore other values like mobile number or roll number whose average really does not make sense. In this article, we will learn how to make such complex grouping commands in pandas.

Grouping in Pandas using df.groupby()

Pandas df.groupby() provides a function to split the dataframe, apply a function such as mean() and sum() to form the grouped dataset. This seems a scary operation for the dataframe to undergo, so let us first split the work into 2 sets: splitting the data and applying and combing the data. For this example, we use the supermarket dataset from Kaggle.

Pandas groupby method — An overview of pandas Groupby method

# Importing the data
import pandas as pd

# Our dataframe. The csv file can be downloaded fro above hyperlink.
df = pd.read_csv('supermarket_sales - Sheet1.csv')

# We drop some redundant columns
df.drop(['Date', 'Invoice ID', 'Tax 5%'], axis=1, inplace=True)
# Display the dataset
df.head()

Output:

The df.groupby() function will take in labels or a list of labels. Here we want to group according to the column Branch, so we specify only ‘Branch’ in the function definition. We also need to specify which along which axis the grouping will be done. axis=1 represents ‘columns’ and axis=0 indicates ‘index’.

# We split the dataset by column 'Branch'.
# Rows having the same Branch will be in the same group.
groupby = df.groupby('Branch', axis=0)

# We apply the accumulator function that we want. Here we use the mean function here but we can also other functions. 
groupby.mean()

Output:

Now that we have learnt how to create grouped dataframes, we will be looking at applying conditions to the data for grouping.

Discrete and Continuous Data

Data Hierarchy — A hierarchical map showing the difference between discrete and continuous data. Discrete data are counted whereas continuous data are measured.

It is a common practice to use discrete(tabular) data for grouping. Continuous data are not suitable for grouping. But will this not limit our data analysis capability? Yes, Obviously. So we need a workaround. We will perform binning of the continuous data to make the data tabular.

For example : Percentage is a continuous data, to convert it in to labelled data we take four predefined groups – Excellent(75-100), Good(50-75), Poor(25-50), Very-Poor(0-25). Each data however varied it might be, will fall into these 4 groups.

Continous To Dis 1 — Conversion of data from continuous to discrete from.

Another way can be using true and false for different values.

For example, The supermarket manager wants to find out how many customers bought more than 3 articles at once. One way to do is to approach this is to replace the number of articles by 1/True if the number is greater than or equal to 3 else 0/False.

# Binning of the data based on a condition
df.loc[df.Quantity < 3, 'Quantity'] = 0
df.loc[df.Quantity >= 3, 'Quantity'] = 1

# Grouping and couting
df.groupby('Quantity').count()

Output:

Conditionally grouping values based other columns

For our final query, we need to group the dataframe into groups based on whether more than 3 items were sold. We need to find the average unit price of the articles bought more than 3 articles at once.

We need to filter out the columns of our interest.
If the grouping is done on continuous data, we need to convert the continuous data into tabular data.
Use df.groupby() to split the data.
Apply the aggregation function.

# Filter out columns of our interest
df_1 = df.loc[:, ["Quantity", "Unit price"]]

# We have already binned the quantity data into 0 and 1's for counting.
# So we don't need any pre-processing

# Group the data
groupby = df_1.groupby("Quantity", axis=0)

# Apply the function(here mean)
groupby.mean()

The Unit price of articles which were bought more than 3 at once, is 55.5846 as can be seen from the above figure.

Pandas make querying easier with inbuilt functions such as df.filter() and df.query(). This allows the user to make more advanced and complicated queries to the database. These are higher-level abstractions to df.loc that we have seen in the previous example

df.filter() method

Pandas filter method allows you to filter the labels of the dataframe. It does not act on the contents of the dataframe. Here is an example to filter out the City and Gender label in our dataset.

df = pd.read_csv('supermarket_sales - Sheet1.csv')

# We need to mention the labels to be filterd in items
df.filter(items=["City","Gender"]).head()

We can also use Regex for filtering out labels. We try out to filter labels starting with letter C.

# We can specify the regex literal under regex in the function
df.filter(regex="^C").head()

df.query() method

The query method allows querying the contents of the column of the dataframe to arbitrary complexity. Here is an example to find out the cases where customers bought more than 3 articles at once.

df.query('Quantity > 3').head()

We can also combine many conditions together using ‘&’ and ‘|’. For example, we want to find out the cases where customers bought more than 3 articles at once and paid using Cash.

df.query('Quantity > 3 & Payment=="Cash"').head()

Combining df.query() and df.filter() and df.groupby()

We want to solve the problem of grouping the dataframe into groups based on whether more than 3 items were sold. We need to find the average unit price of the articles bought more than 3 articles at once for each city.

We proceed in this three steps:

Use df.query() to filter out the data having more than 3 articles.
Use df.filter() to keep only the labels of interest( here City and Unit Price)
Use df.groupby() to group the data

# Query the database for Quantity greater than 3
df_g = df.query('Quantity > 3')

# Filter out labels of interest
df_g = df_g.filter(['City', 'Unit price'])

# Group the value according to a condition
df_g.groupby('City').mean()

References

Understanding Pandas groupby() function
Pandas groupby function – JournalDev