Imagine you are working on a project that requires you to store lots of data, like a million rows and a hundred columns. Storing such enormous data in the form of a comma-separated-value(CSV) file, the traditional technique known to data and machine learning enthusiasts might be a little cumbersome.
There are many other alternatives to storing huge data for easy manipulation and potability like the pickle, feather, parquet, and hierarchical data format(HDF).
HDF is a type of data storage format that stores multiple files in the hierarchical format in a single file. In this post, we are going to learn in detail about the format, and how to read an HDF file using Python returning a dataframe.
Reading Hierarchical Data Format (HDF) files is streamlined using the pandas library’s
read_hdfmethod. This powerful tool allows for efficient handling of large datasets, often used in data-intensive fields like machine learning. HDF, particularly its latest version HDF5, supports complex data collections in a single file, making it ideal for projects involving massive data sets.
Introduction to Hierarchical Data Format(HDF)
HDF an abbreviation for Hierarchical Data Format, stores large data under a single parent file or directory. And you might be thinking, it is only suitable for storing text data and numbers. However, HDFs can also store images, strings, and more. It is also platform-independent, which means the files created on one platform can be transported to another platform, aiding in seamless collaboration.
All these files are stored with a h5 or hdf5 extension, with HDF5 being the most recent and better version.
It can also be used with many programming languages like C++, Python, and R, supporting multiple languages and APIs.
Imagine you are working on a project that receives data from several IoT devices. These devices generate signals during a certain period. You also get the metadata along with the signals. If we were to visualize this, it would look something like this.
Let us try to understand the structure using this illustration. The
IoT Devices.h5 is the HDF file. The
Device 1 and
Device 2 are the groups. The
Metadata are the datasets. We can also include metadata for the groups to make the structure self-describing.
The regular file storage analogy is also suitable to better understand the hierarchical data storage format.
Let us see how we can read HDF files using Python.
PyTables with HDF5 in Python
Before we learn how to use Python to manipulate the hierarchical data format, we need to understand what the Pandas library uses under the hood. Python takes the help of PyTables to read and write HDF files. PyTables is built on top of the HDF5 library and is used to browse/search through large amounts of data.
There are also third-party packages like the h5py that can be used to visualize or understand how the data is stored in the HDF file directory.
Reading HDF Files Using Pandas
The pandas read_hdf method is used to open an HDF file and read it. Like all the reading methods of the pandas library, this method also returns a dataframe. The syntax is given below.
pandas.read_hdf(path_or_buf, key=None, mode='r', errors='strict', where=None, start=None, stop=None, columns=None, iterator=False, chunksize=None, **kwargs)
Let us take a look at the important parameters.
- path or buf: The string path of the hdf5 file, either from the local storage or the remote URL
- key: The group of the HDF file you want to manipulate or read. If there is only one group in the file, this can be ignored
- mode: This parameter represents the mode in which you wish to open the file. The modes are read(r), write(r+), and append(a). The default is read mode
Reading the Madrid’s Air Quality HDF File
The dataset we are going to use for this implementation is Madrid’s air quality dataset which has the records from 2001 to 2018. The dataset contains the records of daily and hourly historical data of the air quality in Madrid.
This dataset has many objects, hence we are going to use the
key parameter of the syntax for simplification purposes.
Firstly, we need to import the pandas library.
import pandas as pd
Next, we use the method as described in the syntax.
hdffile = pd.read_hdf('/content/madrid.h5', mode='r', key='28079006')
The hdffile is the data frame object that stores the contents of the file. We are using the read mode since we are only concerned with reading the file. The key we are using is one of the groups in the file.
For reference, this is the structure of the directory.
Now, we have to print the data frame.
As we can see, there are around seventy-five thousand rows and 15 columns. While this number is relatively small, the hierarchical data format can also store much larger files.
In this post, we have discussed the concepts of hierarchical data format, why it is used, and the structure of the storage in detail with the help of a diagram. We have also discussed the components of the structure like groups, datasets, and metadata.
In the later section, we have understood the libraries and third-party packages that can be used to view the HDF files. These are PyTables and h5pyViewer.
Up next, we have discussed the important parameters of the read_hdf method and its implementation in Python using a real-world dataset.
We have used the reading mode in the implementation. However, you can extend this example by using the write and append mode too!