Pandas crosstab – Simple cross-tabulation for multiple factors

Pandas Crosstab

In this post, let’s try to learn how to use the Pandas crosstab() function, one of the Pandas package’s general functions. To modify and understand data, the Python computer language’s Pandas software package is employed.

The terms “Pandas” are intended to refer to both “Panel Data” and “Python Data Analysis.” It offers specific approaches and data structures for working with time series and mathematical tables.

Functions of the Pandas package are useful to work o data and data frames. One such function that helps to make customized data frames is the crosstab() function. It is quite similar to the Pandas pivot_table() function. The use of crosstab() is preferred when the input data is not in data frame format, rather it’s in the form of an array, list, or series.

Why is Pandas crosstab() used?

This function is used to return the data that are cross-tabulated in data frame format. A frequency table of the variables is generated by default unless an array of values and an aggregation function are provided.

Even if a specific category isn’t present in the data itself, any input that contains categorical data will have all of its categories included in the cross-tabulation. In the absence of specified row or column names for the cross-tabulation, any Series passed will have its name attributes used. And an empty DataFrame will be returned if there are no overlapping indexes.

Syntax of Pandas crosstab

pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name='All', dropna=True, normalize=False)
  • index: array-like, Series, or list of arrays or Series
    • values for the rows to be grouped by.
  • columns: array-like, Series, or list of arrays or Series
    • values in the columns to group by.
  • values: array-like, optional
    • An array of numbers that will be combined based on the factors asks that aggfunc (aggregation function) be given.
  • rownames: sequence, default None, optional
    • If passed, must equal the quantity of passed row arrays.
  • colnames: sequence, default None, optional
    • If passed, the number of column arrays must match the number that was provided.
  • aggfunc: function, optional
    • (aggregation function) Requires values to be given if specified.
  • margins: bool, default False, optional
    • Add columns and row margins (subtotals)
  • margins_name: str, default ‘All’, optional
    • When the margin is True, the name of the row or column that will hold the totals.
  • dropna: bool, default True, optional
    • Columns with only NaN values in them shouldn’t be included.
  • normalize: bool, {‘all’, ‘index’, ‘columns’}, or {0,1}, default False, optional
    • reduce by dividing each value by the total number of values.
      • Will normalize across all values if ‘all’ or ‘True’ are passed.
      • ‘index’ will normalize over each row if it is passed.
      • ‘columns’ will normalize over each column if it is passed.

Implementing Pandas crosstab() with Examples

Make sure to import the Pandas package in your IDE before implementing the function. To do so, run the following code line first.

import pandas as pd

Example 1: Passing only required arguments

car_names = ['Suzuki','Honda','Toyota','Jeep','Tata']
fuel_type = ['diesel','petrol','diesel','petrol','diesel']
price = [10, 18, 27, 35, 15]

pd.crosstab(index=[car_names], columns=[fuel_type])
Example 1: Passing only required arguments
Example 1: Passing only required arguments

Example 2: Passing other arguments

car_names = ['Suzuki','Honda','Toyota','Jeep','Tata', 'Ford']
fuel_type = ['diesel','petrol','diesel','petrol','diesel', '']
price = [10, 18, 27, 35, 15, pd.NA]

pd.crosstab(index=[car_names], columns=[fuel_type], values='price', aggfunc=sum)
Example 2: Passing other arguments
Example 2: Passing other arguments

Note: in the above example we haven’t mentioned the ‘dropna’ parameter, which by default is set to ‘True’. It makes sure that columns/rows with only NaN values in them shouldn’t be included. Hence, ‘Ford’ is omitted from the output table.

pd.crosstab(index=[car_names],
            columns=[fuel_type],
            values=price,
            aggfunc=lambda x: x.sum()*2, 
            rownames=["Car Names"], # title of rows
            colnames=['Fuel type'], # title of columns
            dropna=False, #false will display all NaN columns
            margins=True, # Subtotals at the end
            margins_name="Totals") # subtotal's title
Example 3: Passing other arguments
Example 3: Passing other arguments

Note: In the above example ‘dropna’ is set to ‘False’, hence the ‘Ford’ column is included in the output table.

Also check: Pandas fillna() Method – A Complete Guide

Summary

Pandas general functions are great to help when dealing with data and data frames. crosstab() is one such function that helps you build your own customized data frames or tables from list-like or array input. crosstab() will by default count the frequency of intersections. It also has many other parameters which help to create detailed data frames with specifications. For more such detailed articles on the Pandas package and Python Programming language click here!

Reference

https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html