How to Retrieve a Pandas Object Stored in an HDF File?

How To Retrieve A Pandas Object Stored In A HDF File

The HDF5 file format provides an efficient way to save and retrieve Pandas data objects like DataFrames. If you have already worked with Pandas and want to learn how to persist your data for later reuse, this article will show you how.

We will focus specifically on how to store DataFrames in HDF5 files using PyTables, and then how to load those DataFrames back into memory later on. The examples will demonstrate the basic HDF5 operations you need to know to checkpoint your Pandas workflows and build high-performance data storage pipelines. Let’s get right into the topic.

Also check: Pandas Tutorial

Introduction to Pandas Objects

Before we move to the problem at hand, we need to understand the basics. Take a look at the definitions and examples of Series and Data frames.

Pandas Series

A series is a one-dimensional array-like structure that holds any type of data like string, float, integer, and so on. We can turn a list and even a dictionary into a series with the help of pd.Series method. Let us see an example.

import pandas as pd
dictn = {'a': 1, 'b': 2, 'c': 3}
myseries = pd.Series(dictn)
print(myseries)

We have imported the Pandas library as pd. Then, we have a dictionary called dictn which is to be converted into a series. The pd.Series method is used to perform the conversion.

Pandas Object - Series
Pandas Object – Series

Pandas Data Frame

A data frame is another pandas object which is used to store data and load CSV files. Unlike a series object, data frame stores data in a 2D table in the form of rows and columns. We can also turn a list and a dictionary into a data frame.

#import pandas as pd

dictn = {
  "Age": [22,18,76,12,35,43],
  "Gender": ["F","F","M","F","M","M"]
}
df = pd.DataFrame(dictn)

print(df)

print("The type of the object:",type(df))

The dictn is a dictionary that contains the age and gender of individuals. The data frame is created using the pandas’ DataFrame method, which is printed to the screen. The type of the object is also displayed.

Pandas Object - Data Frame
Pandas Object – Data Frame

Also read: Storing Data Frames and Series in HDF5 Files: A Complete Guide

Introduction to HDF: The Hierarchical Data Format

HDF is an abbreviation of Hierarchical Data Format and is used to store data in a hierarchical or nested format. All these files are stored with a h5 or hdf5 extension. It stores multiple files inside a single file. It is cross-platform and platform-independent, making it a versatile tool for data analysis and management.

Also read: Get to know about HDF here

The HDFStore.get Method

This method is used to get or retrieve the pandas object stored in an HDF Store. Let us look at the syntax.

HDFStore.get(key)

Here, the key is the object we wish to get from the HDF file. Let us see two examples of retrieving a series and data frame from the HDF file.

Retrieving a Series From the Store

Let us learn how to get a series object from the HDF file. But firstly, we need to put the series in the file right?

myseries = pd.Series([1, 2, 3])
store = pd.HDFStore("store2.h5", 'w') 
store.put('data', myseries)  

The series we are using is called myseries. The HDF file called store2 is being opened in a write mode. If the file doesn’t exist, this method creates a new HDF file. The series is being put in the HDF file under the key data.

With just one line, we can retrieve the series as shown below.

store.get('data') 
Getting a data frame from HDF
Getting a series from HDF

Retrieving a Data Frame From the Store

Let us take a look at how to retrieve the data frame from the HDF file.

df = pd.DataFrame([[12, 24], [36, 48]], columns=['A', 'B'])
store = pd.HDFStore("storee.h5", 'w')  
store.put('data', df)  

We have created a data frame called df with two columns and a new HDF file is created using the pd.HDFStore method in a write mode. Later, the data frame is put in the file in the name data.

We can get the data frame from the file as follows.

store.get('data') 
Getting a data frame from HDF
Getting a data frame from HDF

We can also retrieve both the series and data frame if they are present in the HDF file.

df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
myseries = pd.Series([1, 2, 3])
store = pd.HDFStore("teststore1.h5", 'w')  
store.put('data', df)  
store.put('dataa',myseries)

In a new HDF file, we are putting both a series and a data frame using two different keys – data and dataa.

store.get('data')
Data Frame
Data Frame
store.get('dataa')
Series
Series

In this way, we can retrieve any pandas object(or both) using the get method.

Summary

As we have seen, storing Pandas objects in HDF5 files is straightforward with the PyTables library. You can serialize both DataFrames and Series into compact persistent storage using just a few lines of code.

The major advantage HDF5 provides is enabling you to checkpoint intermediate data results instead of having to recreate expensive preprocessing steps each time. This makes development more efficient.

Now that you have seen basic examples of saving Pandas objects to HDF5 files and reloading them later, you can start integrating this capability into your own scripts. HDF5 persistence helps future proof analytical code investments by providing efficient reusable data storage.

References

Pandas Documentation