How to Use Pandas Cut in Python?

While analysing data in a lump sum, it would make things easier if one can split the dataset into different categories with intervals of desired length. There is one such function within the pandas library of Python that helps us with this task.

Enter the cut( ) function! It helps us in segmenting a given dataset and sort the values into different bins. Moreover, it offers the flexibility to either split the data into an equal number of bins or to use a pre-defined array as bins.

Let us get started by importing the pandas library using the below code.

import pandas as pd

Thereafter, we shall explore further the aforementioned function through each of the following sections.

Syntax of the cut( ) function
Use cases for the cut( ) function

Also read: Pandas eval(): Evaluate a Python expression as a string

**Syntax of the cut( ) function:**

Like any other function within the pandas library, the cut( ) function too has a list of mandatory and optional components that are required for its effective functioning. Given below is its syntax with each of those components.

pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates=’raise’, ordered=True)

where,

x – A one-dimensional input array that is to be split
bins – Used to specify the exact count of bins that are to be split with equal widths or used to define the exact bins for splitting or to define the bin edges for non-uniform splits
right – Set to ‘True’ by default, this optional component is used to specify whether a bin should include the rightmost value or not
labels – Set to ‘None’ by default, this optional component is used to specify the labels for each bin
retbins – Set to ‘False’ by default, this optional component is used to return the values of the bins
precision – An optional component set to ‘3’ by default and is used to store and display the bin labels
include_lowest – An optional component set to ‘False’ by default and is used to specify whether a bin should include the leftmost value or not
duplicates – Set to ‘raise’ by default, this optional value raises a Value Error when the bin values given are not unique. It can be set to ‘drop’ to drop the non-unique values
ordered – An optional component set to ‘True’ by default and is used to specify whether the labels given are ordered or not.

Also read: Pandas read_csv(): Read a CSV File into a DataFrame

**Use cases for the cut( ) function**

The following dataset shall be split using the cut( ) function.

df = pd.DataFrame({'score':[60, 87, 49, 51, 69, 74, 92, 55, 63, 78, 47, 86]})

Let us first try to split the above dataset into 4 bins of equal widths. This can be done through the below mentioned code.

Catg = pd.cut(df['score'], 4)

Once done, let us have a look at the results.

print(Catg)

Output:
0       (58.25, 69.5]
1       (80.75, 92.0]
2     (46.955, 58.25]
3     (46.955, 58.25]
4       (58.25, 69.5]
5       (69.5, 80.75]
6       (80.75, 92.0]
7     (46.955, 58.25]
8       (58.25, 69.5]
9       (69.5, 80.75]
10    (46.955, 58.25]
11      (80.75, 92.0]
Name: score, dtype: category
Categories (4, interval[float64, right]): [(46.955, 58.25] < (58.25, 69.5] < (69.5, 80.75] < (80.75, 92.0]]

It could be observed that Python has taken the privilege of splitting the datasets into 4 bins of width 11.25 each. Also, each bin seems to start with a round bracket ‘(‘ and ends with a square bracket ‘]’. This is a mathematical indication that the value following the round bracket is not included within the bin whilst that which is before the square bracket is included.

But what if one wants to have the bins customized and put some name on each bin? The same can be done using the bins and labels option within the cut( ) function.

Catg = pd.cut(df['score'], bins=[0, 40, 60, 100], labels=['Fail', 'Pass', 'Distinction'])

With a closer look, one can deduce that the values in the same order of the input dataset have been compared against the new bins (0, 40], (40, 60], and (60, 100] to return the corresponding labels at their respective positions. But let’s say there is a requirement not to include the rightmost value in each bin. Then, the code is to be changed as shown below.

Catg = pd.cut(df['score'], bins=[0, 40, 60, 100], labels=['Fail', 'Pass', 'Distinction'], right = False)

Since the first entry in the dataset is ‘60’, it is included under the ‘Distinction’ category for scores between 60-100 after the right option is set to ‘False’. Earlier it was included under the ‘Pass’ category for scores between 40-60.

Conclusion

Now that we have reached the end of this article, hope it has elaborated on how to use the cut( ) function from the pandas library. Here’s another article on replacing multiple values using the pandas library in Python. There are numerous other enjoyable and equally informative articles in AskPython that might be of great help to those who are looking to level up in Python. Carpe diem!

Official documentation