Read an ORC File Returning a Data Frame

If you are a data scientist, a machine learning engineer, or even just a data enthusiast who enjoys experimenting and researching with data, you must be familiar with data storage formats such as text file(txt), Excel (xlsx), a table, or a Comma Separated Value file (CSV).

ORC is a data storage format that is not commonly known. ORC stands for Optimized Row Columnar format, and it stores data in columns, which is a different approach compared to traditional methods.

Have you heard of Parquet?

In this article, we are going to learn what is the ORC structure and how we can return a data frame from ORC.

What Is ORC?

Optimized Row Columnar storage format(ORC) stores the data in the form of columns which enables us to perform parallel processing of data and also helps to store the data efficiently. Initially introduced by Hortonworks can be known as a competitor to the columnar storage format introduced later – Parquet. ORC is compatible with big storage formats like Apache Arrow, and Apache Hive is now an open-source project which is continuously improved and maintained in the Apache Hadoop ecosystem.

If you are familiar with CSV, you know that the data is stored in the form of rows. Contrary to that, the ORC format stores the data in the form of columns. Below is a representation of row vs columnar storage.

Why ORC over CSV?

There is something called storage footprint. It determines how much storage a structure occupies. Compared to CSV, ORC occupies less storage. CSV occupies more than double the space that ORC does.

Intertransport of data from and to ORC preserves the existing datatypes of the entries, which is not possible with CSV.

Refer to this article to know how to create and save data with CSV

Talking of transporting the data between various languages or other file structures, luckily, Python has in-built methods to export data as a data frame to ORC and to read the ORC files as a data frame.

We are going to study the method to read the ORC files and obtain a data frame.

Understanding read_orc

The read_orc is a method that takes an ORC file path and returns a data frame. The syntax of the function is given below.

pandas.read_orc(path, columns=None, dtype_backend=_NoDefault.no_default, **kwargs)

The path argument consists of the path to the ORC data structure.

The columns argument is used to include only specific columns in the data frame and all of the names of the columns must be enclosed in a list.

dtype_backend argument tells us which data type we need to use. The options available are numpy_nullable and pyarrow. This feature is still being experimented.

kwargs are the additional arguments you might want to pass to the pyarrow package.

Let us see a few examples.

Reading an ORC File to a Data Frame

We are going to take the sample data sets of the apache google source and work with this method.

orc = '/content/orc_index_int_string.orc'
df = pd.read_orc(orc)
df

The variable orc consists of the path to the ORC file.

df is the data frame that is obtained from the method.

Lastly, we are printing the data frame.

Reading an ORC File to a Data Frame With Specified Columns

In this example, we are going to do the same as above but include only a few columns. This can be done with the help of columns argument of the function.

orcf = '/content/over1k_bloom.orc'
df = pd.read_orc(orcf,columns=['_col0','_col1','_col3','_col5','_col8','_col10'])
df

The orcf consists of the path to the ORC file. There are a total of 11 columns in this data file.

The variable df consists of six columns of the dataset, which are included in a list([_col0,_col1,_col3,_col5,_col8,_col10]).

Lastly, the data frame is printed.

ORC To Data Frame With Specified Columns

Reading a CSV File as ORC and Obtaining a Data Frame

In this example, we are going to take a CSV file, read it as ORC and then obtain a data frame. If you have worked with data sets, you must be familiar with the Melbourne Housing Data. We are using this dataset for this example.

Before diving into the code, there are a few libraries we need to install. Follow the below command to install the pyarrow library.

pip install pyarrow

import pandas as pd
import pyarrow as pa
import pyarrow.orc as orc
csvf = '/content/Melbourne_housing_FULL.csv'
orcf = '/content/Melbourne_housing_FULL.orc'  
cs = pd.read_csv(csvf)
table = pa.Table.from_pandas(cs)
with pa.OSFile(orcf, 'wb') as orc_file:
    orc.write_table(table, orc_file)

In the first line, we are importing the pandas library. The next two lines import the pyarrow library and the orc package.

csvf consists of the Melbourne CSV file and orcf is the path to the ORC file . We are going to store the ORC file in this path.

Before we do anything, we need to first read the CSV file, for which we are using the pd.read_csv function.

The CSV file is then read as a pyarrow table, which is written into ORC file.

df = pd.read_orc(orcf,columns=['Suburb','Address','Rooms','Type','SellerG','Date','Distance','Regionname','Propertycount'])
df.head(10)

The df variable consists of the data frame obtained after using the read_orc method.We are only printing a few columns of the dataset.

The data frame is printed in the next line which contains the first 10 entries.

Conclusion

To conclude, we have learned what is ORC file format, how it stores the data contrary to traditional data storage structures. ORC stands for Optimized Row Columnar storage and as its name suggests, it stores data in the form of columns. We observed the difference between row and columnar storage with the help of a visual representation.

ORC is recommended to use as it has less storage footprint than popularly used formats like CSV and it also preserves the data types of the elements during interconversion.

In this article, we have observed the syntax of the method used to read an ORC file and return a data frame. We understood the arguments of this method closely and understood its functionality.

We have seen three examples and used real-time data for the examples. In the first example, we used the method with no arguments except the file path. We read the file from its path and returned a data frame.

In the second example, along with the file path, we used the column argument for including specific columns in the data frame.

Lastly, we took a CSV file, read it as an ORC file using the pyarrow library, and this ORC file is then used to return a data frame with specified columns.

Data Sets

You can find the sample ORC data sets here

Melbourne Housing Data

References

Learn more about ORC structure here

You can find more about the method here.