Pandas read_parquet — How to load a parquet object and return a dataframe

Featuredimage

Before loading a parquet object, let us first know what a parquet file is and the differences between a parquet and CSV.

What is a parquet?

A Parquet is a storage format based on columnar storage of data. It is mainly used for big data processing as it can efficiently process and manipulate large, complex data.

Parquet files are self-describing, making them easy to work with using processing tools such as Apache Spark, Apache Impala, and so on.

Applications of Parquet

  • Parquet is well-suited to use with columnar storage systems like Apache Arrow and Apache ORC.
  • Parquet files are also highly optimized for reading and writing, and they can be split into smaller pieces that accommodate parallel processing. This makes it ideal for large data sets.
  • It can be used with distributed systems like Hadoop and Spark.
  • Parquet is a popular choice for storing and processing large, complex data sets, and is widely supported by big data processing tools and libraries.

How is parquet different from CSV?

Although CSV and parquet are data storage formats, there are a few differences between them that makes pandas parquet stand out from CSV.

  • While a CSV (comma-separated values) is a table-like structure with each row representing a record, a parquet is a columnar storage format meaning a parquet organizes data into columns rather than rows.
  • In CSV, the data is separated by commas, and the file can be opened and edited using a text editor or spreadsheet software.
    While the parquet file format cannot be supported in a text editor and is complex to understand.
  • Here is the representation of the storage format.

Consider a table with three rows and three columns the representation of CSV and parquet are given below.

Parquet vs csv
Parquet vs csv

Loading a parquet object into dataframe

There are two methods by which we can load a parquet using pandas.

  • Using read_parquet()
  • Importing pyarrow.parquet

Loading a read_parquet()

The standard method to read any object (JSON, Excel, HTML) is the read_objectname(). This will read the Parquet file at the specified file path and return a DataFrame containing the data from the file.

The syntax is as follows:

pandas.read_parquet(pathengine='auto'columns=Nonestorage_options=Noneuse_nullable_dtypes=False**kwargs)

Some important parameters are:

  • path: str, path object, or file-like object

The string should only be a URL. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.parquet.

  • engine: {‘auto’, ‘pyarrow’, ‘fastparquet’}, default ‘auto’
  • columns: list,default=None

If not None, only these columns will be read from the file.

Example 1: Exploring User Data

Let us see the first example. This dataset is a userdata that has the following columns: salary, id, first_name, email, gender, and so on.

import pandas as pd
pq_file=('/content/userdata3.parquet')
df=pd.read_parquet(pq_file,engine='auto')
print(df)

The parquet file is now converted into a data frame.

Output-read_parquet()
Output-read_parquet()

Now, we need to import pyarrow.parquet. This method uses the hood approach.

pyarrow is the library pandas uses under the hood to read the parquet file.

The code is as follows:

import pyarrow.parquet as pq
# file_path is the path to your Parquet file
df = pq.read_table(pq_file).to_pandas()
df

The output for this method is the same as the first method.

Now that we have a parquet file converted to a DataFrame, we can manipulate the data and do the necessary changes.
As seen in the output before, it is observed that there are a lot of missing values in the column ‘comments’.This column is not useful for any operation.

Therefore, we can eliminate this column using drop() .

df.drop('comments',axis=1)

The modified data frame is given below.

Example 1
Example 1

Example 2: Exploring investment parquet file

In this example, we are going to look at an investment parquet file.

The code for the example is as follows:

import pandas as pd
pq_file=('/content/example_test.parquet')
df=pd.read_parquet(pq_file,engine='auto')
df

The data frame for this example is:

Example2
Example2

Example 3: Academic intrusion detection dataset

In this example, let us see the academic intrusion detection dataset.

The code is as follows:

import pandas as pd
pq_file=('/content/KDDTest.parquet')
df=pd.read_parquet(pq_file,engine='auto')
df

The parquet file is now converted into a DataFrame.

Example3
Example 3

Conclusion

In this article we have seen the conversion of a parquet file into a DataFrame and the differences between a comma separated values(CSV) file and a parquet file storage format.

Datasets used in the article