While analysing data in a lump sum, it would make things easier if one can split the dataset into different categories with intervals of desired length. There is one such function within the pandas library of Python that helps us with this task.
Enter the cut( ) function! It helps us in segmenting a given dataset and sort the values into different bins. Moreover, it offers the flexibility to either split the data into an equal number of bins or to use a pre-defined array as bins.
Let us get started by importing the pandas library using the below code.
import pandas as pd
Thereafter, we shall explore further the aforementioned function through each of the following sections.
- Syntax of the cut( ) function
- Use cases for the cut( ) function
Also read: Pandas eval(): Evaluate a Python expression as a string
Syntax of the cut( ) function:
Like any other function within the pandas library, the cut( ) function too has a list of mandatory and optional components that are required for its effective functioning. Given below is its syntax with each of those components.
pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates=โraiseโ, ordered=True)
where,
- x โ A one-dimensional input array that is to be split
- bins โ Used to specify the exact count of bins that are to be split with equal widths or used to define the exact bins for splitting or to define the bin edges for non-uniform splits
- right โ Set to โTrueโ by default, this optional component is used to specify whether a bin should include the rightmost value or not
- labels โ Set to โNoneโ by default, this optional component is used to specify the labels for each bin
- retbins โ Set to โFalseโ by default, this optional component is used to return the values of the bins
- precision โ An optional component set to โ3โ by default and is used to store and display the bin labels
- include_lowest โ An optional component set to โFalseโ by default and is used to specify whether a bin should include the leftmost value or not
- duplicates โ Set to โraiseโ by default, this optional value raises a Value Error when the bin values given are not unique. It can be set to โdropโ to drop the non-unique values
- ordered โ An optional component set to โTrueโ by default and is used to specify whether the labels given are ordered or not.
Also read: Pandas read_csv(): Read a CSV File into a DataFrame
Use cases for the cut( ) function
The following dataset shall be split using the cut( ) function.
df = pd.DataFrame({'score':[60, 87, 49, 51, 69, 74, 92, 55, 63, 78, 47, 86]})
Let us first try to split the above dataset into 4 bins of equal widths. This can be done through the below mentioned code.
Catg = pd.cut(df['score'], 4)
Once done, let us have a look at the results.
print(Catg)
Output:
0 (58.25, 69.5]
1 (80.75, 92.0]
2 (46.955, 58.25]
3 (46.955, 58.25]
4 (58.25, 69.5]
5 (69.5, 80.75]
6 (80.75, 92.0]
7 (46.955, 58.25]
8 (58.25, 69.5]
9 (69.5, 80.75]
10 (46.955, 58.25]
11 (80.75, 92.0]
Name: score, dtype: category
Categories (4, interval[float64, right]): [(46.955, 58.25] < (58.25, 69.5] < (69.5, 80.75] < (80.75, 92.0]]
It could be observed that Python has taken the privilege of splitting the datasets into 4 bins of width 11.25 each. Also, each bin seems to start with a round bracket โ(โ and ends with a square bracket โ]โ. This is a mathematical indication that the value following the round bracket is not included within the bin whilst that which is before the square bracket is included.
But what if one wants to have the bins customized and put some name on each bin? The same can be done using the bins and labels option within the cut( ) function.
Catg = pd.cut(df['score'], bins=[0, 40, 60, 100], labels=['Fail', 'Pass', 'Distinction'])

With a closer look, one can deduce that the values in the same order of the input dataset have been compared against the new bins (0, 40], (40, 60], and (60, 100] to return the corresponding labels at their respective positions. But letโs say there is a requirement not to include the rightmost value in each bin. Then, the code is to be changed as shown below.
Catg = pd.cut(df['score'], bins=[0, 40, 60, 100], labels=['Fail', 'Pass', 'Distinction'], right = False)

Since the first entry in the dataset is โ60โ, it is included under the โDistinctionโ category for scores between 60-100 after the right option is set to โFalseโ. Earlier it was included under the โPassโ category for scores between 40-60.
Conclusion
Now that we have reached the end of this article, hope it has elaborated on how to use the cut( ) function from the pandas library. Hereโs another article on replacing multiple values using the pandas library in Python. There are numerous other enjoyable and equally informative articles in AskPython that might be of great help to those who are looking to level up in Python. Carpe diem!



