Storing Data Frames and Series in HDF5 Files: A Complete Guide

STORE A DATA FRAME IN A HDF FILE

With the growing need for data for almost every task related to data science and machine learning like Regression, Classification, or time series prediction, the need for storing and managing data is also growing rapidly and proportionately.

Consider Reading this article on Regression vs Classification

Sometimes, we might need to store the data obtained in one platform in a cross-platform tool to be used or transported to another platform.

This can be applied for many reasons; when we are working on data gathered from many sources and in various formats, data analysis in different languages, and simply, the need to use various software or tools for a big project.

In this tutorial, we are going to learn how to store a data frame in an HDF file.

Before that, learn how to create an HDF file here

Introduction to HDF5 for Efficient Data Storage

HDF (Hierarchical Data Format) is a versatile data storage format designed for complex data. It supports various data types and is ideal for cross-platform data exchange. In a way similar to the file structure we find in our filer explorers, HDF also stores the data in a nested or hierarchical manner.

This storage format is used to store images, strings, text, and many more. It is also platform-independent, making it the perfect choice to use when the data has to be transported across various platforms. All these files are saved with extensions – h5 or hdf5.

If you have worked with neural networks, you might have seen this file format somewhere while saving the model you have built. For example:

model.save('model.h5')

Moving forward, we are going to know about the method used to store a data frame in the HDF file and also look at a few examples.

HDFStore.put Method

The method we are about to use has the following syntax:

HDFStore.put(key, value, format=None, index=True, append=False, complib=None, complevel=None, min_itemsize=None, nan_rep=None, data_columns=None, encoding=None, errors='strict', track_times=True, dropna=False)

The important parameters of the method are discussed below.

  • key: This parameter is the key under which the value will be stored
  • value: The Series or DataFrame that has to be stored in the HDF file
  • format: The format in which the value is supposed to be stored, the default is fixed, but can also be table
  • index: This parameter is used when we wish to include the index for a data frame
  • complib: This parameter is used when we wish to use a library for compression of the data
  • complevel: The level at which the compression should take place
  • nan_rep:The string we can use as a representation for the NaN values
  • data_columns: If True, we can store the data columns as metadata in the HDF architecture
  • encoding: The type of encoding we wish to use, the default being None
  • dropna: This parameter removes the missing values before storing the object

Store a Data Frame in a New HDF File

In this section, let us create a new HDF file and store our data frame inside it.

Firstly, we need to import the pandas library. It can be done as shown below.

import pandas as pd

Next, we create a data frame and then store it in the HDF file with the help of the method.

data = {'A': [1, 2, 3], 'B': ['foo', 'bar', 'baz']}
df = pd.DataFrame(data)
with pd.HDFStore('new_file.h5', mode='w') as store:
    store.put('my_data', df)

The data frame is called df, which is then stored in the HDF file new_file.h5 in a key called my_data.

dfread = pd.read_hdf('new_file.h5')
dfread

Then, we read the HDF file using the read_hdf method

Store a Data Frame in an HDF File
Store a Data Frame in an HDF File

In this way, we can store a data frame in the HDF file.

Store a Series in a New HDF File

In the same way, as shown above, we can also store a series in the HDF file.

import pandas as pd
my_serie = pd.Series([1, 2, 3], name='my_serie')
with pd.HDFStore('my_file.h5', mode='w') as store:
    store.put('my_serie', my_serie)

We are creating a series with the help of the pandas method. Next, we open the HDF file in writing mode, and the put method is used to store the series in the HDF file.

df = pd.read_hdf('my_file.h5')
df
Store a series in the HDF File
Store a series in the HDF File

Summary

In this tutorial, we have discussed the Hierarchical Data Format(HDF) structure and how it is a popular structure used for managing data due to it being platform-independent and supporting several languages.

Next, we have seen the important parameters of the method that can be used to store an object in the HDF file. We have discussed the put method and its syntax.

Following that, we have seen two examples of this method to store a data frame and a series in HDF files. How can HDF5 further evolve to meet our growing data storage needs?

References

HDFStore.put Documentation