Walk Through the HDF File Using Pandas

Walk Through The HDF File Using Pandas

Imagine you are working on a big project that needs data from multiple sources. Agreed that collecting this data is cumbersome as it is, but storing the data together for easy access is an even more gruesome task.

You would have to look for software or a tool that keeps together all your files in one place so they are easily accessible. Fortunately, you needn’t search anymore. There is a software or a library called the Hierarchical Data Format(HDF) which stores multiple data all in one place. This data storage technique stores the data under a container called a group.

HDF files are also compatible with Python language and the Pandas library is useful in reading, organizing, and managing the HDF files in a Python environment with the help of a family of functions under the name pandas.HDFStore.

In this tutorial, we are going to talk about a pandas method – HDFStore.walk that helps retrieve all the information about the groups, and subgroups of an HDF file.

Read this article on the Pandas library

Understanding the Structure of HDF Files

Let us first understand how HDF stores multiple datasets under a single file. Essentially, HDF files have two containers – groups and datasets. These groups can be nested, meaning there exists one or more subgroups in a single main group. One of these groups stores the documents/files in it. These documents are called datasets.

Here is a diagrammatic representation of the HDF file structure.

Structure of HDF file
Structure of HDF file

The HDFStore.walk method helps us retrieve information about the Groups, their children(subgroups), and the names of the files present in the group.

Syntax of HDFStore.walk

Let us learn how this method works with the help of its syntax.

HDFStore.walk(where='/')

There is only one parameter in this method.

where: This parameter is used to specify the group at which we need to start walking. Given the group identifier, this method gets information about all the groups, subgroups, and the names of the files present

This method yields the following outputs.

  • path: The complete path of the group
  • groups: The names of the other groups present in the path
  • leaves: Names of the pandas objects stored in the groups

Walking Through an HDF File With One Group

Let us use the walk method on an HDF file that has a single group.

Firstly, we import the pandas library and create a data frame called df1 that has 2 rows and 2 columns.

import pandas as pd 
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
store = pd.HDFStore("store.h5", 'w')
store.put('data', df1, format='table')  
for group in store.walk():  
    print(group)  

We have opened a new HDF file called store in write mode. The data frame is being put into the HDF file using the store.put method. Lastly,we are printing the group name present in the HDF file.

Walking through the HDF file with one group
Walking through the HDF file with one group

Walking Through an HDF File With Two Groups

Now let us walk through the HDF file that has two groups.

import pandas as pd 
store=pd.HDFStore('Testt.h5','w')
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=['C', 'D'])
store.put('data', df1, format = 'table')
store.put('data1',df2,format = 'table')

We have created a new HDF file called Testt.h5 in write mode and two data frames called df1 and df2.These two data frames are put in two groups called data and data1.

for groups in store.walk():
  print(groups)

The above code snippet is used to print the group names.

Walking through the HDF file with two groups
Walking through the HDF file with two groups

Walking Through an HDF File With Subgroups

If you wonder how the outcome will be if the main group had different subgroups? Here it is!

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'C': [5, 6], 'D': [7, 8]})
with pd.HDFStore("store.h5", 'w') as store:
    store.put('group/subgroup1/data1', df1)
    store.put('group/subgroup2/data2', df2)
with pd.HDFStore("store.h5", 'r') as store:
    for group in store.walk():
        print(group)

The two data frames df1 and df2 are put in different subgroups – data1 and data2 under the main group group. These groups exist in a HDF file called store.h5. The file opened in read mode to walk through.

HDF file with subgroups
HDF file with subgroups

Summary

We understood the structure of an HDF file. It has groups and subgroups which store the data in the form of datasets.Next, we have understood how to use the HDFStore.walk method to get information about the groups present in an HDF file. We have seen examples for the same where an hdf file has a single group, two groups, and two subgroups.

References

HDFStore walk documentation