Return a Data Hash of Pandas Series, Data Frames, and Index

Return A Hash Value Of The Pandas Objects

A hash is a fixed-sized integer that uniquely identifies any data. When you have large data, and you want to assign a fixed value to the data for its easy retrieval, the hash is your best friend.

The hash value is fixed for a particular data object, hence, it cannot be applied to mutable objects as these objects can be changed. So, we can say that the hash function can only be applied to immutable objects. Examples of mutable objects in python are lists, dictionaries, and sets.

Read this article to know more about immutable objects.

We are going to learn about a method of the Pandas library’s method called the pandas.util.hash_pandas_object and observe to apply this function to the data structures of Pandas- Series, Data Frames, and, Index in this tutorial. I hope you have fun!

But before that, let us look at the Pandas’ data structures.

What Is Index?

The index is the data structure of the Pandas library used mainly for indexing the other data structures like Series and Data frames. We can say that the pandas’ Index is the basic object storing axis labels for all pandas objects. The index object is immutable which means we cannot change its values or characteristics once it is created.

Read this article on timedeltaIndex.

Let us take a look at an example.

#index 
import pandas as pd
ind=pd.Index([12,13,14])
print(ind)

Let us go through the code quickly. The first and very important step is to import the Pandas library as we are working with its very own data structures. We are importing the pandas library with an alias name pd.

In the next line, we are creating a new variable called ind to create a basic index object with the help of the method pd.Index. The elements inside this Index object are 12,13, and 14.

Lastly, we are printing this index object using the print function.

Index
Index

What Is Pandas Series?

A series is similar to an array but only one-dimensional. A series can store heterogeneous data.

Any data structure such as a list, tuple, and even dictionary can be converted into a series object with the Series() method.

Read this article to know more about the Series data structure.

Let us look at a simple example.

#creating a series object
import pandas as pd
ser=pd.Series([1,2,3,4,5])
print(ser)
type(ser)

We are importing the pandas library as pd and creating a new variable to store the series object that we create by using the method pd.Series().

We are printing this series object and also its type.

Series
Series

What is a Data Frame?

A Pandas data frame is similar to a table that is, a data frame stores data in the form of rows and columns.

Visit this post to learn more about Data Frames

A data frame can be created from a list, a dictionary or even directly passing the data to pd.DataFrame.

Let us see an example of creating a data frame from a dictionary.

#creating a data frame
import pandas as pd
dictn={'Food':[ 'Banana','Fries' ,'Milkshake','Lasagna'],'Calories':[23,154,100,280]}
df=pd.DataFrame(dictn)
print("The data frame is:\n",df)

We are importing the pandas library to be able to use it in our code. Then, we are creating a variable called dictn to store a dictionary of food items and their respective calories.

Next, we are using another variable called df to create a data frame out of this dictionary by passing it as an argument to the pd.DataFrame method.

In the last line, we are printing the data frame.

Data Frame
Data Frame

Syntax of pandas.util.hash_pandas_object Explained

This function is used to return a fixed value for the pandas’ data structures such as Index, Series, and Data Frames. This function is only designed to work with these three Pandas objects. So it cannot be applied to the Pandas Panel object as it is not a supported type.

Let us look at the syntax of this function.

pandas.util.hash_pandas_object(obj, index=True, encoding='utf8', hash_key='0123456789123456', categorize=True)

As you can observe from the syntax, the function has the following parameters.

obj: This parameter takes the object we wish to create a hash value for. It can be any object from {Index, Series, Data Frame}.

index: This parameter is used to decide if we wish to create a separate index column for the hash values. It can be only applied when the obj is a Series or Data Frame. It takes a boolean value (True or False) and the default value for this parameter is True. This means, by default, an index is created in the output.

encoding=utf-8: This parameter is optional and is only used when the input is a string in order to convert it into bytes before computing the hash value. The utf-8 stands for Unicode Transformation Format which is 8 bits.

hash_key='0123456789123456': When the obj we are trying to hash has strings in it, we can also generate a hash key to encode the string. This field is also optional and the default value of this parameter is default _default_hash_key.

categorize: This argument takes a boolean type and is used when the obj we are trying to hash has duplicate values. This parameter converts the elements in obj to categorical objects and by doing so, we can reduce the number of values that are unique to be hashed. The default value is True.

Return type: The output has the same length as the input object but has an unsigned 64-bit integer type

Now that we have understood the syntax of this function, let us look at a few examples!

Return the Hash Value of an Index Object

We have seen how we can create an Index object in the previous examples. Let us take the same example to obtain the hash values for the index object.

#index 
import pandas as pd
from pandas.util import hash_pandas_object
ind=pd.Index([12,13,14])
print("The index object is :\n",ind)
print("-"*20)
hs=hash_pandas_object(ind)
print("The hash values of the index object are:\n",hs)

In the very first line, we are importing the pandas library as pd. Next, we are importing the hash_pandas_object function from the utility module of the pandas’ library.

The next line is used to create an index object and the result is stored in a variable called ind.

print("The index object is :\n",ind): This line is used to print the above-created index object.

print("-"*20): This line is just used as a separator because we are using two print statements in the code. What does this line do? It prints 20 hyphens(-) on the screen.

The next line is used to create the hash values of the object. We are calling the function and passing the ind variable as an argument. The resultant hash object is stored in a variable called hs.

Finally, we are printing the hash values obtained by the above line.

Hash Of The Index Object
Hash Of The Index Object

Recollect the parameters of the hash function. We have discussed the argument index=True does not apply to the Index object. Observe the output. We can see that the elements of the Index object are used as the index to the hash object. This is proof that the Index object of the pandas’ library is primarily used to assign indexes to the other data structures.

Return the Hash Value of a Series Object

Let us apply the same function to the Series object we created earlier.

#creating a series object
import pandas as pd
from pandas.util import hash_pandas_object
ser=pd.Series([1,2,3,4,5])
print("The series object is:\n",ser)
type(ser)
print("^"*25)
hs1=hash_pandas_object(ser,index="True")
print("The hash values of the Series object:\n",hs1)

We follow the same pattern as the above example. We import the Pandas library and also get the hash function from the same library.

We are creating a variable ser to store the Series object after it is created by using another function of the Pandas library, pd.Series.

Next, we are printing the series object. We are also checking the type of the object just to be sure.

print("^"*25): This line is used as a separator.

We are creating a new variable called hs1 to obtain the hash values of the Series object.

We are printing the hash values in the last line.

Hash Of The Series Object
Hash Of The Series Object

Return the Hash Value of a Data Frame

We have reached the end of this tutorial. Let us try to obtain the hash values of a data frame.

#creating a data frame
import pandas as pd
from pandas.util import hash_pandas_object
dictn={'Food':[ 'Banana','Fries' ,'Milkshake','Lasagna'],'Calories':[23,154,100,280]}
df=pd.DataFrame(dictn)
print("The data frame is:\n",df)
print("^"*25)
hsdf=hash_pandas_object(df)
print("The hash values of the data frame are:\n",hsdf)

We import the Pandas library and also the hash function in the first two lines.

Next, we are creating a dictionary that is then converted into a data frame by the function pd.DataFrame.

We are printing the data frame in the next line.

The next print statement is used to print ^ 25 times as a separator.

We are creating a variable called hsdf to obtain hash values for the data frame.

In the following line, we are printing the hash object.

Hash Of The Data Frame
Hash Of The Data Frame

We can also find the hash value of the data frame as a single entity with the help of sum() function.

Let us see how we can do that.

hsdf1=hash_pandas_object(df).sum()
print("The hash vakue of the entire data frame is :",hsdf1)

We have taken the same data frame and we are trying to find the hash value of the entire data frame as one.

The sum function is used to sum all the elements in the rows and columns of the data frame.

Hash Value Of The Entire Data Frame
Hash Value Of The Entire Data Frame

There is a reason why the hash value of the entire data frame ended up negative. Since we are treating the entire data frame as one, it becomes very large that it cannot be represented by 64 bit. So, the hash value kind of entered the negative range.

Note: As we discussed in the introduction of this post, the hash function can only be applied to immutable objects. Hence, the hash values of the above discussed objects always remain the same no matter how many times the code is executed.

Conclusion

To conclude, we have seen the data structures of the Pandas library- Index, Series, and Data Frames and a few examples of their creation using the respective functions pd.Index, pd.Series and pd.DataFrame.

Next, we have seen the usage of the hash function to generate random values associated with any data but only for immutable objects.

We have discussed the syntax of pandas.util.hash_pandas_object in detail.

Lastly, we tried to obtain the hash values for the Pandas objects- Index, Series, and Data Frames.

We also tried to obtain the hash value of the data frame as a whole and observed the reason behind the negative value for this particular example.

References

Visit the Pandas’ official documentation to know more about the hash method.

Also, if you want more description about the Pandas Index, it is available in the official documentation.