How to Select an Object in an HDF File With Certain Conditions?

How To Select A Pandas Object In A HDF File With Certain Conditions

The need for alternatives to conventional data storage formats is growing day by day with the data complexities. Such storage formats must be able to store enormous data efficiently and retrieve or select the required information seamlessly.

Gone are those days when data scientists used to believe CSV was their only savior. Now, we are constantly looking for a better data storage mechanism that can accommodate our never-ending list of data requirements.

Previously, we have talked about efficient data storage mechanisms or techniques like Feather, Parquet, ORC, and many more that require less effort but yield much more standard and compression than conventional storage techniques.

There are also dedicated software and libraries for using a particular storage technique across all languages, platforms, and operating systems like the stata software.

In this post, we will be focusing on one such data storage equipment – The Hierarchical Data Format(HDF).

Understanding Hierarchical Data Format (HDF)

Imagine you have downloaded a file from a website. The file will be saved in the Downloads path of your file explorer. The complete path location of the file will be something along the lines of C: Users\Downloads\YourFile(In the case of Windows, Of course). Similarly, there is a path for Documents, Photos, and Screenshots under the same root folder.

HDF files exactly follow this file structure. One single file stores multiple files and interestingly, in different groups. This allows for a consistent data storage mechanism and it is also hassle-free. You might think that storing multiple files under a single roof will lead to confusion. Luckily, HDF files also store the metadata(data about data) related to the files stored in groups. If there are several large files, HDF also allows for compression which makes it easy to transport the store across various platforms, languages, and software.

What is important to our tutorial is that these HDF files can store the pandas objects – Series and Data Frame with the help of the pandas HDFStore methods.

You can read more about HDF here

Now that we have a basic understanding of the file structure, let us learn how to retrieve the objects stored in the file using a few conditions.

Exploring HDFStore’s Select Method

Let us take a look at the select method and its important parameters.

HDFStore.select(key, where=None, start=None, stop=None, columns=None, iterator=False, chunksize=None, auto_close=False)

Let us now discuss the parameters of the method.

  • key: Different files are stored under different keys in an HDF file. So if we want to select a file from the parent HDF file, we need to specify the key related to it
  • where: Similar to SQL Queries, we can use the where clause to specify a certain condition to be used inside the pandas object. This parameter is optional
  • start: We can specify the row number to start the selection with.This parameter takes an integer
  • stop: Similarly, we can also specify the row number at which the selection should end. This parameter also takes an integer
  • columns: With this parameter, we can decide on the columns to be displayed in a list
  • auto_close: If this parameter is set to True, the Store is closed after the execution

How to Retrieve Data Frames from HDF Files

Let us take a data frame, store it in an HDF file and then retrieve it using the select method.

First, we import the pandas library.

import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
store = pd.HDFStore("store.h5", 'w')  
store.put('data', df)  
store.get('data')  
print(store.keys()) 

The data frame is stored in a variable called df. It has two columns- A and B. Next, we are opening a new HDF file called store.h5 in writing mode. The data frame is stored in the file using the put method in the key called data.

In the last line, we print the list of keys available in the store.

Keys
Keys

The data frame can be retrieved using the following code.

store.select('/data') 

As required in the syntax, we have specified the key to select the data frame.

Selecting a data frame from the store
Selecting a data frame from the store

Applying Conditions with the Where Clause

We can use the where clause to retrieve specific columns from the data frame or apply some condition using the where clause just like the SQL query.

with pd.HDFStore("test1.h5", 'a') as store:
    store.put('data', df, format='table')
    print(store.get('data'))
    print(store.keys())  
    selected_data = store.select('/data', where='columns==B')
print(selected_data)

Here, we used the where clause to specify that we only want to select the part of the data frame under the column B.

Where Clause
Where Clause

Slicing Data with Start and Stop Parameters

Let us also use the start and stop parameters to retrieve the rows between the range [start, stop).

a = ['apple', 'banana', 'cherry', 'date', 'elderberry']
b = [10, 20, 30, 40, 50]
df = pd.DataFrame({'A': a, 'B': b})
with pd.HDFStore('xampl.h5', mode='w') as store:
    store.put('data', df)
with pd.HDFStore('xampl.h5', mode='r') as store:
    subset = store.select('data', start=1, stop=3)
    print(subset)

In this code snippet, we have defined two lists a and b that later become the columns of the data frame df. As usual, the data frame is put into the HDF file xampl.h5 using the put method, Then we open the same file in read mode to retrieve the rows between 1 and 3, which are stored in a new variable called subset. This subset of the data frame is printed onto the screen.

Start and Stop parameters
Start and Stop parameters

Wrapping Up

In this post, we have briefly discussed the Hierarchical Data Format, and how it allows us to conveniently store multiple files under one single file.Then we moved on to retrieving the dataframe stored in the HDF file using the select method of the pandas library, learnt how to use the where clause to implement certain conditions on selection of the data frame and lastly, use two fun parameters start and stop to obtain a number of rows from the data frame in the specified range.

References

HDF Select Method