Pandas Factorize: Introduction (With Examples)

If there is a function that can encode a given object into an enumerated type or a categorical variable within the pandas library, it ought to be the factorize( ) function. This function provides a trustable technique for identifying distinct values in the input array & converting them into an equivalent numerical representation.

What makes it preferable is that it can be deployed as a top-level function in the form of pandas.factorize( ) and also as a series function pairing up as series.factorize( ) or index.factorize( ).

This article explores the nuances of the factorize() function and demonstrates its functioning with suitable examples. So, let us start things off by importing the pandas library using the below code.

import pandas as pd

Thereafter we shall delve into factorize( ) function through each of the following sections.

Syntax of the factorize( ) function
Use cases for the factorize( ) function
- Deploying Using Default Setting
- Extracting the Code & Unique Part of the Result
- Sorting the Result
- Factorizing ‘None’ values

**Syntax of the factorize() function**

It is to be noted that the input values to be fed into the factorize( ) function are to be of the 1-Dimensional order. These input values are then mapped to a set of unique values. Following is the syntax containing the mandatory and optional constructs that are required for the proper functioning of the aforementioned function.

pandas.factorize(values, sort=False, na_sentinel=_NoDefault.no_default, use_na_sentinel=_NoDefault.no_default, size_hint=None)

where,

values – 1-Dimensional sequence containing the input values that are to be enumerated
sort – set to ‘False’ by default, it is used to sort unique values and shuffle codes
na_sentinel – used to mark the places where values are not found and does not drop the NaN values when set to ‘None’
use_na_sentinel – used to encode NaN values as non-negative integers without dropping them from the set of unique values when set to ‘False’
size_hint – used to assign a hastable sizer

Use Cases For the Pandas factorize() function

Let’s now look at the major use case of the Pandas Factorize() function in Python.

1. Deploying using default setting

Let us construct an input array with a set of values and run it through the factorize( ) function using the code given below.

import pandas as pd
import numpy as np
ar1 = np.array(['Q', 'W', 'E', 'W', 'Q', 'Y'])
R = pd.factorize(ar1)
print(R)

Output:
(array([0, 1, 2, 1, 0, 3]), array(['Q', 'W', 'E', 'Y'], dtype=object))

2. Extracting the Code & Unique part of the Result

The above result is a combination of two parts viz. the uniques & the codes, of which the former returns only the list of distinct values available in the input whereas the latter returns the enumerated code of the input values inclusive of those that are not distinct. The following demonstration might put things in better perspective!

ar1 = np.array(['Q', 'W', 'E', 'W', 'Q', 'Y'])
codes, uniques = pd.factorize(ar1)
print("Code part:", codes)
print("Unique part:", uniques)

Output:
Code part: [0 1 2 1 0 3]
Unique part: ['Q' 'W' 'E' 'Y']

3. Sorting the Result

In this section, we shall make use of the sort option within the syntax of factorize() function to arrange the values returned in the result in an orderly sequence. The below code explains better.

ar1 = np.array(['Q', 'W', 'E', 'W', 'Q', 'Y'])
codes, uniques = pd.factorize(ar1, sort=True)
print("Code part:", codes)
print("Unique part:", uniques)

Following is the result when the above code is run.

Code part: [1 2 0 2 1 3]
Unique part: ['E' 'Q' 'W' 'Y']

Comparing this result with that of the previous section, one could observe that the numbering in the code part of the result varies though the input is the same in both cases.

Though the unique part differs it could be attributed to the sorting of the alphabetical order. The same can explain the change in the code part too! ‘E’ which comes first in alphabetical order is assigned the number zero, then ‘Q’ as ‘1’ & so on & so forth.

4. Factorizing ‘None’ values

The places where there are no values would be returned as ‘-1’ in the output by default. But one can change this value by using the na_sentinel option as shown below.

ar2 = np.array(['Q', 'W', 'E', None, 'Q', 'Y'])
codes, uniques = pd.factorize(ar2, na_sentinel=77)
print("Code part:", codes)
print("Unique part", uniques)

Output:
Code part: [ 0  1  2 77   0  3]
Unique part: ['Q' 'W' 'E' 'Y']

Summary

Now that we have reached the end of this article, hope it has elaborated on how to use the factorize( ) function from the pandas library. Here’s another article that details the usage of the get_dummies( ) function from the pandas library in Python. There are numerous other enjoyable and equally informative articles in AskPython that might be of great help to those who are looking to level up in Python. Audere est facere!