How to Use Pandas Cut in Python?

Pandas Cut

While analysing data in a lump sum, it would make things easier if one can split the dataset into different categories with intervals of desired length. There is one such function within the pandas library of Python that helps us with this task.

Enter the cut( ) function! It helps us in segmenting a given dataset and sort the values into different bins. Moreover, it offers the flexibility to either split the data into an equal number of bins or to use a pre-defined array as bins.

Let us get started by importing the pandas library using the below code.

import pandas as pd

Thereafter, we shall explore further the aforementioned function through each of the following sections.

  • Syntax of the cut( ) function
  • Use cases for the cut( ) function

Also read: Pandas eval(): Evaluate a Python expression as a string


Syntax of the cut( ) function:

Like any other function within the pandas library, the cut( ) function too has a list of mandatory and optional components that are required for its effective functioning. Given below is its syntax with each of those components.

pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates=’raise’, ordered=True)

where,

  • x – A one-dimensional input array that is to be split
  • bins – Used to specify the exact count of bins that are to be split with equal widths or used to define the exact bins for splitting or to define the bin edges for non-uniform splits
  • right – Set to ‘True’ by default, this optional component is used to specify whether a bin should include the rightmost value or not
  • labels – Set to ‘None’ by default, this optional component is used to specify the labels for each bin
  • retbins – Set to ‘False’ by default, this optional component is used to return the values of the bins
  • precision – An optional component set to ‘3’ by default and is used to store and display the bin labels
  • include_lowest – An optional component set to ‘False’ by default and is used to specify whether a bin should include the leftmost value or not
  • duplicates – Set to ‘raise’ by default, this optional value raises a Value Error when the bin values given are not unique. It can be set to ‘drop’ to drop the non-unique values
  • ordered – An optional component set to ‘True’ by default and is used to specify whether the labels given are ordered or not.

Also read: Pandas read_csv(): Read a CSV File into a DataFrame

Use cases for the cut( ) function

The following dataset shall be split using the cut( ) function.

df = pd.DataFrame({'score':[60, 87, 49, 51, 69, 74, 92, 55, 63, 78, 47, 86]})

Let us first try to split the above dataset into 4 bins of equal widths. This can be done through the below mentioned code.

Catg = pd.cut(df['score'], 4)

Once done, let us have a look at the results.

print(Catg)

Output:
0       (58.25, 69.5]
1       (80.75, 92.0]
2     (46.955, 58.25]
3     (46.955, 58.25]
4       (58.25, 69.5]
5       (69.5, 80.75]
6       (80.75, 92.0]
7     (46.955, 58.25]
8       (58.25, 69.5]
9       (69.5, 80.75]
10    (46.955, 58.25]
11      (80.75, 92.0]
Name: score, dtype: category
Categories (4, interval[float64, right]): [(46.955, 58.25] < (58.25, 69.5] < (69.5, 80.75] < (80.75, 92.0]]

It could be observed that Python has taken the privilege of splitting the datasets into 4 bins of width 11.25 each. Also, each bin seems to start with a round bracket ‘(‘ and ends with a square bracket ‘]’. This is a mathematical indication that the value following the round bracket is not included within the bin whilst that which is before the square bracket is included.

But what if one wants to have the bins customized and put some name on each bin? The same can be done using the bins and labels option within the cut( ) function.

Catg = pd.cut(df['score'], bins=[0, 40, 60, 100], labels=['Fail', 'Pass', 'Distinction'])
Labelled Bins
Labeled Bins

With a closer look, one can deduce that the values in the same order of the input dataset have been compared against the new bins (0, 40], (40, 60], and (60, 100] to return the corresponding labels at their respective positions. But let’s say there is a requirement not to include the rightmost value in each bin. Then, the code is to be changed as shown below.

Catg = pd.cut(df['score'], bins=[0, 40, 60, 100], labels=['Fail', 'Pass', 'Distinction'], right = False)
Change In Results
Change In Results

Since the first entry in the dataset is ‘60’, it is included under the ‘Distinction’ category for scores between 60-100 after the right option is set to ‘False’. Earlier it was included under the ‘Pass’ category for scores between 40-60.


Conclusion

Now that we have reached the end of this article, hope it has elaborated on how to use the cut( ) function from the pandas library. Here’s another article on replacing multiple values using the pandas library in Python. There are numerous other enjoyable and equally informative articles in AskPython that might be of great help to those who are looking to level up in Python. Carpe diem!