How to Load a Feather Object Using read_feather?

Load A Feather Format Object From The File Path

read_feather is a method that reads the feather format(from a path) and returns an object that is in the path.

When the feather file is read with the help of read_feather, the data earlier in the feather format is stored in a Pandas Data Frame.

Before reading a feather-format object, let us first understand what a feather is and its advantages.


What is a Feather?

A feather format is a binary storage format that is fast, lightweight, and easy to use.

Feather uses Arrow IPC format to store data frames.

By using the Arrow IPC format, Feather provides a common data exchange format that can be used to share data between different programming languages and tools. This mechanism might come in handy when working with multiple tools of different languages.

That means using feather format with non-compatible formats returns not an arrow file error. Hence, we need to ensure the data is in the required format.


Characteristics of feather format

There are a few characteristics that make the feather format widely used.

  • The Feather format is so fast
  • It is a lightweight data storage structure
  • It is Language agnostic, meaning it is language-independent and can be interchanged between different languages
  • It is used for temporary storage purposes
  • One thing to keep in mind while working with feather is that it does not support nested datatype columns

Advantages

  • The feather format is portable
  • This means that it can be used in many languages without any hustle
  • Reading and writing with the feather format is incredibly fast

Prerequisites

Feather Library

Before we do anything with the feather format, we need to ensure it is installed in our system. This can be done by following one simple command.

pip install feather-format

After installing the feather format, we can use it to achieve faster results.

PyArrow

An essential library must be installed to work with the feather format.

The PyArrow library is used mostly because it supports the Arrow Apache data format.

Here is how you can install this library.

pip install pyarrow

Exploring the Syntax of Pandas.read_feather

The Pandas library provides reading and writing of many data storage formats.

The syntax for pandas.read_feather is as follows:

pandas.read_feather(path, columns=None, use_threads=True, storage_options=None)

Some important parameters are given below.

NumberParameterDescriptionType/Default ValueRequirability
1pathThe path object should be a string
It can also be a URL
Valid URL schemes include http, FTP,s3, and file
For file URLs, a hostname must be specified
For example, file://localhost/path/to/table.feather. Here, the host is the localhost
strRequired
2columnsReads the selected columns
If not provided, all columns are read
sequence
default=None
Required
3use_threads
This argument tells whether to parallelize reading the file using multiple threads
bool
default=True
Required
4storage_optionsExtra options that are passed to storage connections like host, port, username, password, etc
For http URLs, the key-value pairs are forwarded to urllib.request.Request as header options
dictOptional
Parameters of read_feather

Return Type: Returns the type of object stored in the file.


Example 1: Writing a Data Frame and Passing it as input to read_feather

In this example, let us see passing a data frame to feather format.
We are also going to take the help of NumPy to generate an array of records.

Read this article on how to create arrays using the NumPy library

The code is given below.

#Example 1
import pandas as pd
import numpy as np
#generating hundreds of rows using numpy
a = np.random.randn(int(1e6)) 
cols = {f'column_{i}': a for i in range(10)}
df = pd.DataFrame(cols)
df.to_feather('test_df.feather')
df=pd.read_feather('test_df.feather')
df 

Let us go through the code line by line.

In the first line, we import the Pandas library with its alias name pd. This step is compulsory because the method we will use is a part of this library.

Next, we are importing the NumPy library as np. This library is used to create the array we are going to use.

a=np.random.randn(int(1e6)): This line is used to create 1 million randomly generated using the random function of the NumPy library.

cols = {f'column_{i}': a for i in range(10)}: This line creates a dictionary called cols. This dictionary stores a million numbers in 10 columns with the help of the range function. The name of this data frame would be cols. The data is stored in a key-value pair format.

In the next step, we convert these key-value pairs into a data frame.

We aim to read a feather format. But all we have is a data frame. We should convert this data frame into a feather format. This can be done using the df.to_feather. The new file we would store the feather format is test_df.feather.

Now all we are left to do is read this format using pd.read_feather.

Next, we are printing an instance of this file in the last line.

The output is given below.

Read_Feather Example 1
Read_Feather Example 1

Example 2: Reading a Feather File from its Path.

We will see the reading of a feather file according to the syntax.

The dataset used in this example is AMEX-Default Prediction in Feather format.

This dataset is mainly used to predict if the credit card of a new customer will default in the future based on the previous data.

Here is the code:

#Example2
import pandas as pd
df=pd.read_feather('/content/drive/MyDrive/train.feather')
df

Let us break down the code.

In the second line, we import the Pandas Library used to read the feather file.

In the next line, we call a new variable df, to read and store the feather file.

Next, we print the data residing in df.

The output is a data frame shown below.

Read_Feather Example 2
Read_Feather Example 2

As observed from the image, it is clear that the data has a lot of NaN values.

Read this article to know how to replace NaN values with zero.

Let us see how much time it took to read. We can use %timeit to check the time taken.

The one-line code is as follows:

%timeit pd.read_feather('/content/drive/MyDrive/train.feather')
%Timeit
%Timeit

Example 3: Comparison of read_csv, read_feather, and read_parquet

Let us see the comparison between CSV, feather, and parquet.

We will take the data frame from example 1 and read it as a CSV file and a parquet file and compare the results read_feather to see which is faster.

read_csv

df.to_csv('test_df.gzip.csv', compression='gzip')
%timeit df = pd.read_csv('test_df.gzip.csv', compression='gzip')

So in the first line, we are converting the data frame obtained in the first example to CSV using to_csv. The compression mode used is used to reduce the dimensionality of large files.

Check out this article on How to save a DataFrame as a CSV file in Python.

Next, we use the %timeit module to check the time to read the data in CSV format.

Here is the output.

%timeitcsv
%timeitcsv

As you can see, the time to read in CSV format is around 3 seconds.

read_feather

Let us check how much time it took to read a feather format.

Here is the code.

df.to_feather('test_df.feather')
df1=pd.read_feather('test_df.feather')
%timeit df1 = pd.read_feather('test_df.feather')
%imeitfeather
%imeitfeather

We take the data frame created in the first example and convert it to a feather format using to_feather in the first line.

Next, we create a new variable called df1 to read the feathered object using pd.read_feather.

Lastly, we use the %timeit module to check the time taken to read the data frame in feather format.

As seen above, reading feather format took a few milliseconds.

read_parquet

Let us check how long it takes to read the data frame as a parquet file.

If you are unfamiliar with the Parquet format, refer to this article on Pandas read_parquet.

Here is the code.

df.to_parquet('test_df.parquet')
%timeit pd.read_parquet('test_df.parquet')

The df.to_parquet is a method used to convert the data frame to a parquet format. The name of this file is test_df.parquet.

In the next line, we are passing this line of code to the %timeit module to check the reading time.

%timeitparquet
%timeitparquet

Reading the data frame in parquet format just took 91.8 milliseconds.

Here is the overall comparison of CSV, feather, and parquet.

%timeit of CSV,feather,parquet
%timeit of CSV, feather, parquet

As seen from the above image, we came to know that the feather format works as the faster method to read any data frame.

Conclusion

To summarize what we have learned in this post,

We have seen what a feather format, its characteristics, and its internal storage mechanisms and advantages is.
We have also observed how to install the two main important functional libraries to work with feather formats- PyArrow and Feather library.
Next, we explored the syntax of read_feather and understood all of its parameters in detail.
Coming to the illustrational examples, we have seen how to generate a data frame using the random function of the Numpy library and passing it as an argument to read_feather.
Next, we took a feather dataset and loaded it as a path.
We have also seen the usage of the timeit module to check the reading time.
Finally, we have compared the reading times of CSV, Feather, and Parquet, and it turns out feather format wins!

References

You can find the feather dataset (Example 1) here.

Also, refer to the Pandas documentation for more clarity.