How to Display Detailed Info About the HDF Store?

Hierarchical Data Format(HDF) is a data storage technique that stores the data and the metadata of multiple files under a single file. The National Center for Super-Computing Applications(NCSA) has come up with this new storage format for the manipulation and storage of scientific data. NCSA has also developed the tools for creating and reading HDF Files.

Why are we talking about this now? Because the Pandas Library also provides several methods to read and append data to the HDF file. We are going to talk about one such method related to the HDF Store.

This method is used to obtain information about the HDF file as well as the data contained inside it.

While this method is standalone, we need to use a few other methods related to the Store to be able to use the HDFStore.info method.

For starters, read this article to learn about the HDF file format

Advantages of HDF Compared to Traditional Formats

Let us first talk about the benefits of the HDF over conventional storage formats like CSV and Excel.

Efficient Data Storage: The HDF5 format is designed for storing and accessing huge amounts of numerical data. It uses a compact binary storage layout and supports compression, enabling efficient I/O and storage utilization.
Hierarchical Storage Format: Data is stored hierarchically in HDF5 files, allowing related data elements to be grouped together in a meaningful way. This self-describing organization simplifies access and makes adding new data easier over time.
Ability to store complex data: HDF5 supports storing a variety of complex, multidimensional data structures without sacrificing performance. This makes it suitable for mathematical, scientific, and machine learning applications.
Cross-Platform: The HDF5 library implements the storage format in a platform and language neutral way. This allows easy sharing of data between systems and languages. HDF5 maintains compatibility while taking advantage of capabilities on new platforms.

Now, let us understand how to use the HDFStore.info method.

HDFStore.info Syntax Explained

The syntax for the information method is as follows.

HDFStore.info()

This method takes no parameters and returns a string of information about the HDF file.

Let us see a few examples.

Display Information About an HDF File That Contains a Series Object

Let us take a pandas series and store it in an HDF file. Then, we are going to use the method to get the information about the file.

import pandas as pd 
serie = pd.Series([1, 2, 3])
store = pd.HDFStore("teststore1.h5", 'w')
store.put('data', serie)

In the above snippet of code, we have imported the pandas library in the first line. Then, we create a series object using the pd.Series method. In the next line, we are opening a new HDF file called teststore1 in write mode. The series is stored in the file with the help of put method.

Now let us take a look at the information.

print(store.info())
store.close()

We are printing the information using the store.info method and closing the file we opened earlier. It is always recommended to close the files!

<class 'pandas.io.pytables.HDFStore'>
File path: teststore1.h5
/data            series       (shape->[3])

As you can see, we have the file name, the type of the object stored in the file, and its shape.

Display Information About an HDF File That Contains a Data Frame Object

In the same way, let us try using a data frame. A data frame is a bit different from a series as it has a two-dimensional table-like structure, which stores the data in the form of rows and columns.

df = pd.DataFrame([[12, 24], [36, 48]], columns=['A', 'B'])
store = pd.HDFStore("test.h5", 'w')
store.put('data', df)

We have created a data frame with two rows and two columns with the name df. This data frame is being stored in the HDF file test.h5 using the put method.

print(store.info())
store.close()

Below is the output for this code snippet.

<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/data            frame        (shape->[2,2])

The HDF Store uses the pytables under the hood to store data frames easily.

Know more about how to append tables to HDF files here

Display Information About an HDF File That Contains a Series and a Data Frame

We can store both series and data frames together in a single HDF file. Let us see how.

df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
series = pd.Series([1, 2, 3])
store = pd.HDFStore("teststore3.h5", 'w')
store.put('data', df)
store.put('dataa',series)

We have created a data frame and a series in the first two lines. The new file is opened in writing mode with the help of HDFStore. We are using two different keys(data and dataa) to store the data frames and series. Think of this storage structure as the one you find in file systems. We have various file paths like Downloads, Desktop, Documents, etc under the file explorer. In the same way, we are creating two different paths to store the objects under the same file.

print(store.info())
store.close()

This code gives the following output.

<class 'pandas.io.pytables.HDFStore'>
File path: teststore3.h5
/data             frame        (shape->[2,2])
/dataa            series       (shape->[3])

That is it! We can get the information about the HDF file and the contents stored in it using the info method.

Also read: How to Retrieve a Pandas Object Stored in an HDF File?

Summary

To briefly summarize, we have discussed the benefits of using HDF over traditional file storage systems like CSV. Being cross-platform compatible and platform-independent, HDF can be used in various operating systems and different platforms. It is also compatible with Python, R, and C++.

Following that we have discussed the syntax of the method. After that, we discussed three examples of the method – information about a file that contains a series object, a data frame object, and information about the file that stores a data frame and a series.

References

Store Info Documentation