Exploring HDFStore Groups: Efficient Data Organization in Pandas

Pandas HDFStore Groups

The Pandas library is regarded as the best Python tool for data analysis and storage. The library has several useful methods that export data from the Python environment to other compatible software and vice versa, cleaning and exploring the data.

One such family of methods in the library is the HDFStore, which is used to read and manipulate the HDF files in Python. HDF stands for Hierarchical Data Format which stores all the related data in a single parent file.

We will discuss the HDFStore.groups, which gives information about the different groups in an HDF file.

Get started with HDF here

HDFStore.groups is a standalone and a simple method used to get information about the groups present in an HDF file. But to be able to use this method, we need to understand the other important methods such as HDFStore.put, which stores the pandas data frame in a new file and the HDFStore,append which appends data to an existing file

Understanding the Structure of HDF Files

Let us understand the structure of the HDF file in simple terms. Typically, the HDF file contains these two containers- Groups and Datasets. Apart from these, we also have metadata(data about data) related to these containers and the HDF file. Datasets are the actual data stored within the HDF file. These can be arrays, tables, strings, or any other type of data. In the case of Python, the HDF files contain tables, data frames, and series.

Consider this analogy; the file manager/file explorer in laptops is the main HDF file. The various storage paths – Downloads, Desktop, Documents are the groups. The files contained in these groups are the datasets.

HDF File structure
HDF File structure

Exploring the HDFStore.groups Method

The syntax of this method is given below.

HDFStore.groups()

This method returns a list of top-level nodes in the file.

Obtaining Group Information From an HDF File Containing Series

In this example, we are going to understand how to store a series in a new hdf file and print the group information.

import pandas as pd
my_series = pd.Series([10, 20, 30])
store = pd.HDFStore("data2.h5", 'w')  
store.put('data',my_series) 
print(store.groups())
store.close()

In the first line, we import the pandas library as its alias name pd. Then we define a series to be stored in the HDF file called my_series.The HDF file- data2.h5 is opened as a variable store in write mode. The store.put method is used to store the series in the HDF file.

The store.groups method is used to print the information about the groups. Lastly, the HDF file is closed.

Store.groups method
Store.groups method

The HDF file has only one group called data. This group has two datasets – index and values.

Obtaining Group Information From an HDF File Containing Dataframe

In the same manner, let us see how the group structure differs if the HDF file contains a data frame.

df = pd.DataFrame([[1, 2, 5], [3, 4, 6]], columns=['A', 'B', 'C'])
store = pd.HDFStore("store.h5", 'w')  
store.put('data1', df)  
print(store.groups())  
store.close() 

The data frame called df is initialized in the first line. An HDF file store.h5 is opened as store in writing mode. The data frame is stored in the file with the help of store.put. The information about groups is printed in the next line. In the last line, we close the file.

Store.groups method 2
Store.groups method 2

Storing Two Objects in the Same Group

Let us try to store two different data frames in the same group.

import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'X': ['a', 'b', 'c'], 'Y': ['d', 'e', 'f']})
store = pd.HDFStore("xampl.h5",'w')
store.put('group/df1', df1)  
store.put('group/df2', df2)  
print(store.groups())
store.close()

In this code, we have defined two data frames – df1 and df2. The HDF file example.h5 is opened in a write mode, and the data frames are stored in the main group called group, but they are stored in different subgroups – group/df1 and group/df2.

Storing two data frames in the same group
Storing two data frames in the same group

Summary

We understood the structure of HDF files. An HDF file consists of two containers or folders called the groups and datasets. The HDF file has the following structure – HDF file->Group->Dataset. An HDF file may have multiple groups nested inside it. It also has metadata stored which helps in preventing the loss of information.

We have talked about the pandas method – HDFStore.groups, which is used to print information about the groups present in the HDF file. We have seen the difference between the group information when an HDF file contains a series and when it contains a data frame. Later, we have understood how to store two or more objects in the same group in the form of subgroups.

References

Understanding HDF file structure

Pandas documentation