Pandas groupby: Split, aggregate, and transform data with Python

import pandas as pd

df = pd.DataFrame({
    'product': ['Laptop', 'Mouse', 'Laptop', 'Keyboard', 'Mouse'],
    'region': ['North', 'North', 'South', 'North', 'South'],
    'sales': [1200, 150, 1400, 220, 180],
    'units': [3, 15, 4, 8, 12]
})

grouped = df.groupby('region')['sales'].sum()
print(grouped)
# Output:
# region
# North    1570
# South    1580
# Name: sales, dtype: int64

The pandas groupby method implements the split-apply-combine pattern, a fundamental data analysis technique that divides your dataset into groups, applies functions to each group independently, and merges the results into a unified output. This approach mirrors SQL’s GROUP BY functionality but extends beyond simple aggregation to support complex transformations, custom computations, and multi-level grouping operations. Working with the groupby method becomes essential when you need to calculate statistics across categories, compare performance between segments, or reshape data based on categorical variables.

Understanding the groupby object and split-apply-combine workflow

When you call the groupby method, pandas creates a DataFrameGroupBy object rather than immediately computing results. This object stores grouping instructions and defers actual computation until you apply an operation. The delayed execution optimizes memory usage because pandas builds only an internal mapping of group labels to row indices without duplicating your data.

import pandas as pd

sales_data = pd.DataFrame({
    'store': ['Store A', 'Store B', 'Store A', 'Store C', 'Store B', 'Store C'],
    'category': ['Electronics', 'Clothing', 'Clothing', 'Electronics', 'Electronics', 'Clothing'],
    'revenue': [45000, 23000, 12000, 38000, 51000, 19000],
    'transactions': [120, 340, 180, 95, 145, 210]
})

grouped_data = sales_data.groupby('store')
print(type(grouped_data))
# Output: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>

print(grouped_data.groups)
# Output: {'Store A': [0, 2], 'Store B': [1, 4], 'Store C': [3, 5]}

The split phase divides your DataFrame based on unique values in the grouping column. Each distinct value becomes a group key, and pandas tracks which rows belong to each group through index positions. The apply phase executes your specified function on each group separately, treating each subset as an independent DataFrame. The combine phase merges all group results into a new data structure, typically a Series or DataFrame depending on your operation.

Grouping data by single columns for categorical analysis

Single-column grouping forms the foundation of most groupby operations. You pass a column name to the groupby method, and pandas creates groups based on that column’s unique values. The resulting GroupBy object supports method chaining, allowing you to select specific columns and apply aggregation functions in a single statement.

import pandas as pd

employee_data = pd.DataFrame({
    'department': ['Engineering', 'Sales', 'Engineering', 'Sales', 'Marketing', 'Engineering', 'Marketing'],
    'employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace'],
    'salary': [95000, 72000, 88000, 68000, 71000, 92000, 69000],
    'performance': [4.2, 3.8, 4.5, 4.0, 3.9, 4.3, 4.1]
})

avg_salary = employee_data.groupby('department')['salary'].mean()
print(avg_salary)
# Output:
# department
# Engineering    91666.666667
# Marketing      70000.000000
# Sales          70000.000000
# Name: salary, dtype: float64

dept_stats = employee_data.groupby('department').agg({
    'salary': ['mean', 'min', 'max'],
    'performance': 'mean'
})
print(dept_stats)

Selecting specific columns after grouping narrows the scope of your analysis. When you write groupby('department')['salary'], you create a SeriesGroupBy object that operates only on the salary column. This approach runs faster than computing statistics across all columns because pandas processes less data.

Grouping by multiple columns for hierarchical segmentation

Multiple-column grouping creates nested segments within your data. You pass a list of column names to groupby, and pandas forms groups based on unique combinations of values across those columns. This technique proves valuable when analyzing data across multiple dimensions, such as examining sales performance by both region and product category.

import pandas as pd

order_data = pd.DataFrame({
    'region': ['East', 'East', 'West', 'West', 'East', 'West'],
    'product': ['Widget', 'Gadget', 'Widget', 'Gadget', 'Widget', 'Widget'],
    'quarter': ['Q1', 'Q1', 'Q1', 'Q1', 'Q2', 'Q2'],
    'revenue': [15000, 23000, 18000, 21000, 17000, 19000]
})

multi_group = order_data.groupby(['region', 'product'])['revenue'].sum()
print(multi_group)
# Output:
# region  product
# East    Gadget     23000
#         Widget     32000
# West    Gadget     21000
#         Widget     37000
# Name: revenue, dtype: int64

quarterly_analysis = order_data.groupby(['region', 'quarter']).agg({
    'revenue': 'sum',
    'product': 'count'
})
print(quarterly_analysis)

The resulting structure uses a MultiIndex, where each group appears as a hierarchical index level. You can reset the index to convert these group keys back into regular columns, making the output easier to work with for further analysis or visualization.

flat_result = multi_group.reset_index()
print(flat_result)
# Output:
#   region product  revenue
# 0   East  Gadget    23000
# 1   East  Widget    32000
# 2   West  Gadget    21000
# 3   West  Widget    37000

Applying aggregation functions to compute group statistics

Aggregation functions reduce each group to a single value. The most common aggregations include sum, mean, count, min, max, and std. You can apply these functions directly to a GroupBy object, and pandas computes the statistic for each group independently.

import pandas as pd

campaign_data = pd.DataFrame({
    'channel': ['Email', 'Social', 'Email', 'Social', 'Search', 'Email', 'Search'],
    'clicks': [1200, 3400, 980, 4100, 2300, 1100, 2800],
    'conversions': [45, 89, 38, 102, 76, 41, 88],
    'spend': [500, 1200, 480, 1350, 890, 510, 920]
})

channel_metrics = campaign_data.groupby('channel').agg({
    'clicks': 'sum',
    'conversions': 'sum',
    'spend': 'sum'
})

channel_metrics['conversion_rate'] = (channel_metrics['conversions'] / channel_metrics['clicks'] * 100)
channel_metrics['cost_per_conversion'] = (channel_metrics['spend'] / channel_metrics['conversions'])

print(channel_metrics)
# Output shows aggregated metrics with calculated rates

The agg method accepts a dictionary mapping column names to aggregation functions. This approach lets you apply different aggregations to different columns in a single operation. You can also pass a list of functions to compute multiple statistics for the same column.

detailed_stats = campaign_data.groupby('channel')['clicks'].agg(['sum', 'mean', 'count', 'std'])
print(detailed_stats)
# Output:
#          sum        mean  count         std
# channel                                     
# Email   3280  1093.333333      3  115.901699
# Search  5100  2550.000000      2  353.553391
# Social  7500  3750.000000      2  494.974747

Accessing and iterating through grouped data

The GroupBy object provides methods to inspect and access individual groups. The groups attribute returns a dictionary mapping group keys to row indices, while the get_group method retrieves a specific group as a DataFrame.

import pandas as pd

inventory_data = pd.DataFrame({
    'warehouse': ['A', 'B', 'A', 'C', 'B', 'C', 'A'],
    'product': ['Laptop', 'Monitor', 'Mouse', 'Laptop', 'Laptop', 'Mouse', 'Monitor'],
    'quantity': [45, 78, 120, 38, 52, 95, 63],
    'value': [67500, 31200, 3600, 57000, 78000, 2850, 25200]
})

grouped = inventory_data.groupby('warehouse')

warehouse_a = grouped.get_group('A')
print(warehouse_a)
# Output shows all rows where warehouse is 'A'

for warehouse, group_df in grouped:
    print(f"\n{warehouse}:")
    print(group_df[['product', 'quantity']].to_string(index=False))

Iterating through groups returns tuples containing the group key and the corresponding DataFrame subset. This pattern works well when you need to perform custom operations on each group or generate separate outputs per group.

Creating custom aggregations with apply and lambda functions

Beyond built-in aggregation methods, you can define custom functions to compute specialized statistics. The apply method executes your function on each group and combines the results.

import pandas as pd
import numpy as np

performance_data = pd.DataFrame({
    'team': ['Alpha', 'Beta', 'Alpha', 'Beta', 'Alpha', 'Beta'],
    'metric': [0.85, 0.92, 0.78, 0.88, 0.91, 0.95],
    'baseline': [0.75, 0.80, 0.75, 0.80, 0.75, 0.80]
})

def performance_improvement(group):
    improvement = ((group['metric'] - group['baseline']) / group['baseline'] * 100).mean()
    return improvement

team_improvement = performance_data.groupby('team').apply(performance_improvement)
print(team_improvement)
# Output:
# team
# Alpha    15.555556
# Beta     15.000000
# dtype: float64

weighted_score = performance_data.groupby('team').apply(
    lambda x: (x['metric'] * np.arange(1, len(x) + 1)).sum() / np.arange(1, len(x) + 1).sum()
)
print(weighted_score)

Custom functions receive each group as a DataFrame, allowing you to perform complex calculations that reference multiple columns or require conditional logic. The function’s return value can be a scalar, Series, or DataFrame depending on your analysis needs.

Filtering groups based on group-level conditions

The filter method removes entire groups based on criteria you define. Unlike row-level filtering with boolean indexing, group filtering evaluates conditions at the group level and either keeps or discards all rows in each group.

import pandas as pd

transaction_data = pd.DataFrame({
    'customer': ['C001', 'C002', 'C001', 'C003', 'C002', 'C003', 'C001'],
    'amount': [150, 2200, 340, 180, 3100, 220, 280],
    'date': pd.date_range('2024-01-01', periods=7)
})

high_value_customers = transaction_data.groupby('customer').filter(
    lambda x: x['amount'].sum() > 1000
)
print(high_value_customers)
# Output shows only customers with total transactions over 1000

frequent_customers = transaction_data.groupby('customer').filter(
    lambda x: len(x) >= 3
)
print(frequent_customers)

The filter function receives each group as a DataFrame and must return a boolean value. When the function returns True, pandas includes all rows from that group in the output. When it returns False, pandas excludes the entire group.

Transforming data within groups while preserving DataFrame shape

Transformation operations apply functions to groups and return results with the same shape as the original DataFrame. This differs from aggregation, which reduces groups to single values. Transformations prove useful for tasks like calculating group-relative metrics, filling missing values with group statistics, or standardizing values within categories.

import pandas as pd

student_data = pd.DataFrame({
    'class': ['A', 'A', 'B', 'B', 'A', 'B'],
    'student': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
    'score': [85, 92, 78, 88, 90, 82]
})

student_data['class_mean'] = student_data.groupby('class')['score'].transform('mean')
student_data['deviation_from_mean'] = student_data['score'] - student_data['class_mean']
student_data['percentile_in_class'] = student_data.groupby('class')['score'].transform(
    lambda x: x.rank(pct=True) * 100
)

print(student_data)
# Output shows original rows with added transformation columns

The transform method broadcasts group-level statistics back to individual rows. When you calculate the class mean and assign it to a new column, each student receives their class’s mean value. This enables row-level comparisons against group benchmarks without changing the DataFrame’s structure.

def standardize(x):
    return (x - x.mean()) / x.std()

student_data['standardized_score'] = student_data.groupby('class')['score'].transform(standardize)
print(student_data[['student', 'class', 'score', 'standardized_score']])

Optimizing groupby performance for large datasets

Performance considerations become critical when working with large datasets. Several strategies can significantly improve groupby operation speed. Converting string columns to categorical dtype reduces memory usage and accelerates grouping because pandas works with integer category codes rather than string comparisons.

import pandas as pd
import numpy as np

# Create larger dataset for performance testing
large_data = pd.DataFrame({
    'category': np.random.choice(['A', 'B', 'C', 'D'], size=100000),
    'subcategory': np.random.choice(['X', 'Y', 'Z'], size=100000),
    'value': np.random.randn(100000)
})

# Convert to categorical for better performance
large_data['category'] = large_data['category'].astype('category')
large_data['subcategory'] = large_data['subcategory'].astype('category')

result = large_data.groupby(['category', 'subcategory'])['value'].mean()

Setting sort=False in the groupby call skips the automatic sorting of group keys. Sorting adds overhead that you can eliminate when the order of results doesn’t matter for your analysis. For operations that benefit from sorted groups, pandas still maintains row order within each group regardless of the sort parameter.

# Skip sorting for better performance when order doesn't matter
result = large_data.groupby('category', sort=False)['value'].agg(['mean', 'std', 'count'])

Selecting specific columns before grouping reduces the amount of data pandas processes. Instead of grouping the entire DataFrame and then selecting columns during aggregation, filter columns first to minimize memory overhead and computation time.

# More efficient: select columns first
result = large_data[['category', 'value']].groupby('category')['value'].mean()

# Less efficient: group entire DataFrame
result = large_data.groupby('category')['value'].mean()

The pandas groupby method transforms complex data analysis tasks into concise, readable operations. By mastering split-apply-combine workflows, custom aggregations, transformations, and performance optimizations, you gain precise control over categorical data analysis. These techniques extend beyond simple statistics to support sophisticated analytical workflows that would require significantly more code using manual iteration approaches.