How to Write a Dataframe to the Binary Feather Format?

Feather is a lightweight, open-source, and portable storage format used for storing data frames that can be interchanged between languages like Python and R.

Feather is mainly used for sharing and data analysis. The Feather format has increased reading and writing speed compared to other file storage structures.

Feather uses Arrow IPC format internally and is used for storing Arrow tables.

Also read: Pandas DataFrame.to_xml – Render a DataFrame to an XML Document

Arrow IPC Format

Apache Arrow IPC(Inter-process-communication) is a standard format used for serializing and exchanging Apache Arrow data structures.

This format is language agnostic, i.e., compatible with many different languages, and is used in big data applications.

Why Use Feather Format?

With the feather format, reading and writing take less time and are faster. The feather format also requires a smaller storage area.

Feather can be portable between different languages like Python, R, and Julia.

Feather syntax provides for the compression of large files.

Feather also stores the metadata(data types) of the object, which is not supported by other file formats like CSV.

Prerequisites of Feather

Before we work with the feather files, we need to install an essential library called Pyarrow which supports reading and writing feather formats.

Through Command Line or Terminal

The simple command is pip install pyarrow.

Through Conda Prompt

conda install -c conda-forge pyarrow

In Google Colaboratory

!pip install pyarrow

All About the df.to_feather Method

The Pandas library provides a method called df.to_feather to convert a data frame to a feather, and the syntax is given below.

DataFrame.to_feather(path, **kwargs)

The description for arguments is as follows.

Parameters	Description	Default/Type	Required/Optional
path	The object must be a string. The path will be used as the root directory path when writing a partitioned dataset	str	Required
**kwargs	Additional keywords that are passed to `write_feather()` These keywords can be any one (or) all of the following: `compression`, `compression_level`, `chunksize`, `version`	–	Required

arguments of to_feather

Another Useful Method – pd.read_feather()

In the coming examples, we are also going to see the usage of pd.read_feather() while reading the feather format.

Creating a Data Frame from a Dictionary and Writing the Data Frame to Feather

In this example, let us create a simple data frame and pass it to the method-to_feather()

The code snippet for creating a data frame is given below.

import pandas as pd
!pip install pyarrow
 # dictionary of lists
data = {'name':["Rahul", "Kiran", "Balaram", "Kavya","Sirisha","Srushti","Jaya","Anupam","Akshaya","Santosh"],
        'degree': ["MBA", "BCA", "M.Tech", "MBA","MBA","B.Tech","BCA","PhD","M.Tech","PhD"],
        'Id':[315,204,132,312,313,615,212,121,133,120],
        'score':[90, 40, 80, 98,75,76,89,45,36,92]}
#converting the dictionary to dataframe
df = pd.DataFrame(data)
print(df)
print("*"*30)
df.dtypes

Let us decode the above snippet.

import pandas as pd: In this line, we are importing the Pandas library by using a standard alias name- pd.

!pip install pyarrow: We are installing the PyArrow library. If already installed, this step can be ignored.

In the fourth line, we create a dictionary and store it in a variable called data which can later be converted into a Data Frame.

It can be observed that the dictionary created above is a collection of lists. That is, the key ‘name’ is a list of names, ‘degree’ is a list of some degrees, and so are ‘Id’ and ‘score.’

We are converting the dictionary into a data frame in the coming line using the pd.DataFrame method. The new data frame is stored in an object called df.

We are printing the data frame in the next line using print().

print(“*”*30): This line is used to create a separator. It outputs 30 asterisks (*) to the screen.

The following line is used to print the data types of the objects present in the data frame.

Here is the data frame.

Now let us see how to write this data frame to feather.

#writing the dataframe to feather
df.to_feather('df.feather')
#reading the  feather format
fthr=pd.read_feather('df.feather')
fthr.dtypes

df.to_feather: The data frame obtained earlier is now converted into a feather format. The feather format created after this step must have feather extension.

In the following line, we are reading the feather file and storing it in an object called fthr.

We can check the data types of the elements present in this feather file using dtypes. They would be the same as the data types in the data frame.

The feather file will be stored on your local disk.

You can also observe the file df.feather at the left side of the image, which means we have written the data frame to feather successfully.

Converting an Excel File Into a Data Frame

Let us take excel data, convert it into a data frame and pass it to the method.

The excel data considered here contains three entries for three attributes Name, Age, and ID

Consider the following code for reading an excel file and converting it into a data frame.

import pandas as pd
#reading the excelfile
df=pd.read_excel('/content/Book1.xlsx')
df1=pd.DataFrame(df)
print(df1)
print("*"*30)
print(df1.dtypes)

import pandas as pd is used to bring the Pandas library to our environment.

Since the data is in excel format, we need to first read the data using read_excel method. The excel data is stored in an object called df.

Next, we are converting the excel into a data frame using pd.DataFrame. This data frame is stored in a new object called df1.

We are printing the data frame in the next line using print().

print(“*”*30): This line is used to create a separator. It outputs 30 asterisks (*) to the screen.

dtypes is used to check the data types present in the data frame.

The data frame and the data types are shown below.

The next step is to write this data frame to feather.

#writing the dataframe to feather
df.to_feather('df1.feather')
#reading the feather format
fthr=pd.read_feather('df1.feather')
fthr.dtypes

The data frame is now converted into a feather file by df.to_feather. The name of the file would be df1.

In order to read the feather file, we use pd.read_feather. This instance is stored in a variable called fthr.

The data types of the objects present in the file can be known using dtypes.

These data types would be the same as those in the data frame.

The output is shown below.

It can be observed that there is a file created for the new feather format.

Converting a CSV file to a Data Frame.

Let us take a CSV file and convert it into a data frame.

Let us know about the dataset first.

The CSV file we are going to use is a movie review dataset that is mainly used for sentiment analysis.

The data set contains two columns: text– the review for the movie and label- the ranking for the movie(0 or 1).

If the review is positive, the label is 1. If it is negative, the label is 0.

The code for reading a CSV file, and converting it into a data frame is given below.

import pandas as pd
!pip install pyarrow
#reading the csv file 
df=pd.read_csv('/content/Test.csv')
df.head()

import pandas as pd: This line imports the Pandas library pd as an alias name.

!pip install pyarrow: This command installs the PyArrrow library.

In the next line, we are reading the CSV file using pd.read_csv. This instance is stored in an object called df.

df.head(): This method is used to print the first five rows of a dataset.

Now let us see the creation of a data frame.

#converting into data frame
df3=pd.DataFrame(df)
print(df3)

The second line is used to convert the CSV into a data frame.

In the next line, we are printing the data frame.

The data frame is obtained as shown below.

Let us see writing this data frame to feather format.

#writing the dataframe to feather
df.to_feather('df3.feather')
#reading the  feather format
fthr=pd.read_feather('df3.feather')
fthr.dtypes

The data frame obtained above is converted into a feather. This feather format has a file name called df3 followed by the feather extension.

The next line is used to read the feather file.

dtypes is used to analyze the data types of the elements present in the file.

The new feather file is stored in df3.feather.

Reading a Parquet File as a Data Frame and Writing it to Feather

Let us see how to write a data frame to feather format by reading a parquet file.

The parquet file we are going to use is an Employee details dataset from Kaggle.

It has the following columns: registration_dttm, id, first_name, last_name, email, gender, ip_address, title, and salary being the important columns.

The code for reading a parquet file as a path is given below.

import pandas as pd
!pip install pyarrow
#reading a parquet
pq_file=('/content/userdata3 (1).parquet')
df=pd.read_parquet(pq_file,engine='auto')

The first two lines import the Pandas library and install the PyArrow library, respectively.

The next line is used to bring the parquet file into our environment and store it in a variable called pq_file.

This parquet file is read with the help of pd.read_parquet. The instance is stored in a new variable called df.

Now, let us convert this parquet file to a data frame.

#data frame
df4=pd.DataFrame(df)
print(df4.head())

Conversion of any type of file to a Data Frame follows the same method. The parquet file is converted to a data frame and is stored in df4.

df4.head() returns the first five rows of the data frame.

The data frame is shown below.

Finally, here is how to write the data frame to feather.

#writing the dataframe to feather
df.to_feather('df4.feather')
#reading the  feather format
fthr=pd.read_feather('df4.feather')
fthr.dtypes

Once we get a data frame, the next step is to write this data frame to a feather, which is done by df.to_feather. The name of the feather file would be df4.feather.

After we get the feather from a data frame, we need to read it. This can be done by Pandas, another helpful method read_feather. This instance is stored in fthr.

We can also print the data types of objects present in the file using dtypes.

Conclusion

To conclude this article, we have learned about the Feather file format, its internal mechanism-Arrow IPC Format, the necessary library for working with a feather that is PyArrow and its installation, and syntax for the method df.to_feather.

In the first example, we have seen how we can create a data frame from a dictionary and then write the data frame to a feather.

We have also seen how the feather format preserves the data types of the elements inside the data frame after conversion, which may not be possible with other file formats.

In the next example, we have taken an excel file and converted it into a data frame using pd.DataFrame, writing this data frame to a feather.

Next, we have also seen the conversion of a CSV file into feather format and also the usage of df.head() which returns the first five rows of the data frame.

Lastly, we have seen the reading of a parquet file and then writing it to a feather file.

Datasets

CSV Dataset

https://www.kaggle.com/datasets/columbine/imdb-dataset-sentiment-analysis-in-csv-format

Parquet file

https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata3.parquet

References

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_feather.html