How to Append a Table to an Existing HDF File?

HDF an abbreviation for Hierarchical data format, is a data storage format used to store huge amounts of data and many such huge files inside a single file, in a hierarchical format. This file format is similar to the file structure of storage in our computers/laptops for example – C:Users\Profile\Desktop.

Many times when we are dealing with huge volumes of interconnected data, we can use this storage format instead of traditional file storage methods like CSVs.

There are also a few other storage structures like feather, parquet, and pickle which might also be used based on the requirements. The objective of this post is to demonstrate how to append a dataframe to an existing HDF file.

Appending data to an existing HDF (Hierarchical Data Format) file in Python can be efficiently done using the HDFStore.append method in pandas. This method allows adding data frames or series to HDF files without creating new ones, although it requires careful handling of duplicates.

Introduction to Hierarchical Data Storage

To start with this storage format, HDF is used to store multiple large files inside one single parent file or a root folder. It has groups, metadata, and datasets. All the HDF files are saved with a hdf5 or a h5 extension.

It can be used to store images, large text files, strings, numbers, and many more. HDF is compatible with many high-level languages like Python, R, C++, and so on.

There are a few Python libraries that can be used to view the HDF files such as PyTables, and h5pyViewer which make reading and writing HDF files a lot easier.

Also read: Know how to read the HDF files using pandas

Exploring the HDFStore.append Method

This method takes a series or a dataframe and appends it to the existing HDF file. The syntax of the method is as follows:

HDFStore.append(key, value, format=None, axes=None, index=True, append=True, complib=None, complevel=None, columns=None, min_itemsize=None, nan_rep=None, chunksize=None, expectedrows=None, dropna=None, data_columns=None, encoding=None, errors='strict')

Let us discuss some important parameters of this method.

key: The value in the concerned HDF file we would like to append
value: The series or data frame that has to be appended
format: The format in which the data frame should be appended, the default is table format
index: This parameter is optional. It is only used if we wish to append the index of a data frame as a column to the file
append: The default of this parameter is True, which means the data is appended to the HDF file. If False, the data on the HDF file is overwritten
nan_rep: This parameter is used to replace the NaN strings in the data being appended

Appending Data to HDF Files

In this section let us take a look at how to append a series to the existing HDF file. A series is a one-dimensional array of the pandas library. First, let us create an HDF file.

import pandas as pd
df = pd.DataFrame([[11,12], [13, 14]], columns=['A', 'B'])
with pd.HDFStore("store1.h5", 'w') as store:
    store.put('data', df, format='table')
read = pd.read_hdf('store1.h5')
print(read)

We are creating a data frame that is converted to a hdf file using the pandas HDFStore, opening a file called store1.h5 in a writing mode. The HDF file is then read using another pandas method – read_hdf.

#appending a series to the hdf file 
with pd.HDFStore('store1.h5', mode='a') as store:
    existing_data = store['data']
    new_data = pd.DataFrame({'A': [10], 'B': [11]})
    updated_data = pd.concat([existing_data, new_data], ignore_index=True)
    store.put('data', updated_data, format='table', data_columns=True)
with pd.HDFStore('store1.h5', mode='r') as store:
    updated_data = store['data']
    print(updated_data)

In this code, we have opened the HDF file in an append mode. The series new_data has to be appended to the HDF file. Since there is a difference in the number of columns in the appended data and the existing data, we are using the concat method and then append it.

How to Append DataFrames to HDF Files

In the same manner, we can also append a data frame to an HDF file.

with pd.HDFStore('store1.h5', mode='a') as store:
    df = pd.DataFrame([[5, 6], [7, 8]], columns=['A', 'B'])
    store.append('data', df)
read = pd.read_hdf('store1.h5')
print(read)

This follows a direct approach as compared to the series append as the number of columns is the same. The updated file is read with the help of read_hdf.

But there is an issue with this method. While appending to the HDF file, it doesn’t check if the data being appended already exists in the file. It appends the data without checking for duplication.

with pd.HDFStore('store1.h5', mode='a') as store:
    df_new = pd.DataFrame([[5, 6], [7, 8]], columns=['A', 'B'])
    store.append('data', df_new)
read = pd.read_hdf('store1.h5')
print(read)

Does not check for duplicates — Handling Duplicates in HDF Appends

Hence, it is advised to be careful when appending to the HDF file to not have duplicates in the file.

Summary

To recapitulate, we have discussed the hierarchical data format(HDF), its uses, and compatibility with various languages. Sometimes, we would have to add more data to an existing HDF file. Instead of creating a new file with additional data, we can append the data to the HDF using the method store.append.

We have discussed two examples of appending a series and a dataframe to the existing HDF file and also discussed a drawback with this method. This method does not check if the data being appended already exists in the HDF file. Hence, we are required to be careful with what data we are appending to the file.

References

Append Documentation