Pandas unique – Return unique values based on a hash table

Pandas Unique

Pandas is a popular open-source data manipulation library for Python. It provides a number of useful functions for working with data, including the ability to return unique values from a DataFrame or Series. One approach for finding unique values is to use a hash table, which is an efficient data structure for testing for membership and inserting elements. This article will explain how to use the pandas unique function to return unique values based on a hash table.

Why is Pandas unique() used?

Within your hash tables, Pandas unique() function is used to display the unique values. The outcome contains values in the order of their appearance in the input, remember the values are not returned in a sorted format. This function is incredibly helpful when comprehending cardinality (the number of components in the hash table). For sufficiently lengthy sequences, it outperforms the Numpy.unique() function by a considerable margin. The output also includes NA values.

Syntax of Pandas unique()

pandas.unique(values)
  • Input Parameter: values
    • one dimensional array
  • Return type:
    • Index for index input
    • Categorial for categorial input
    • n-dimensional array for series or n-dimensional array input.

Implementing the Pandas Unique() function

Make sure to install and then import the Pandas package in your python IDE before starting with the method of learning the functions. To do so run the following line of code in your IDE.

import pandas as pd

Example 1: Index as input

pd.unique(pd.Index([pd.Timedelta(days=1), pd.Timedelta(days=1), pd.NA]))

Output

Example 1: Index as input
Example 1: Index as input

Note that in the above example, even the NA value is included in the output.

To understand

Example 2: Array and Series as input

#array input is provided 
x = [10,20,30,40,40,10,50] 
y = pd.unique(x)
print(x, "\n", y)

#Series input is provided
x = pd.Series([pd.Timestamp("20230101", tz="US/Eastern"), 
               pd.Timestamp("20230101", tz="US/Eastern"),
               pd.Timestamp("20230101", tz="US/Central")])
y = pd.unique(x)
print(x, "\n\n", y)

Note that in the below examples ‘pd.Series’ is used to create a series as input for the function.

Output

Example 2: Array and Series as input
Example 2: Array and Series as input

Example 3: Categorial input

pd.unique(pd.Series(pd.Categorical(list("PandasPythonPackage"))))

Output

Example 3: Categorial input
Example 3: Categorial input

Note that in the above example, the input string has a total of nineteen characters but in the output of the function the length is thirteen characters because the repetitive characters like ‘P’ for example are counted only once in the output.

Example 4: Dataframe input

In this example, a pandas DataFrame is created using a list of tuples as input. The list represents rows in the DataFrame, and each tuple represents a single row. The columns parameter specifies the column names for the DataFrame.

#creating dataframe
df1 = pd.DataFrame([('Ford','USA',1903),
                    ('Mercedes', 'Germany',1926),
                    ('Tesla','USA',2003),
                    ('Bentley','UK',1919),],
           columns=('Name', 'Country', 'Founded in')
                 )
df1

#finding unique countries
df1['Country'].unique()

The resulting DataFrame df1 has three columns: ‘Name’, ‘Country’, and ‘Founded in’. The unique function is then used to return a numpy array of unique values for the ‘Country’ column, which is specified by df1['Country']. The output of the unique function will contain only one occurrence of each unique country, i.e. if ‘USA’ appears multiple times in the ‘Country’ column, it will only be included once in the output.

Note that the syntax to access the unique values is dataframe_name['column_name'].unique(). In this case, df1['Country'].unique() returns the unique values in the ‘Country’ column of the DataFrame df1.

Note that in this example the syntax is dataframe.unique()

Output

Example 4: Dataframe input
Example 4: Dataframe input

Summary

Pandas is a great help when it comes to working on huge datasets. Using the general function of this package makes dataset manipulation and creation much more convenient. The unique() function as discussed in this article is used to find the unique values from the table and hence is helpful to understand the cardinality of the component or any certain value in the dataset.

To view more such detailed articles on the topics like built-in functions of Pandas Package and Python language in general do click here.

Reference

https://pandas.pydata.org/docs/reference/api/pandas.unique.html