What Is ORC and How to Write a Data Frame to ORC Format?

ORC stands for Optimized Row Columnar storage format was introduced to store the Hive workloads efficiently. As the name suggests, the ORC format stores the data in the form of columns which enables us to perform parallel processing of data and also helps to store the data efficiently.

The ORC format was initially introduced by Hortonworks to work with big storage formats like Apache Arrow, Apache Hive is now an open-source project which is continuously improved and maintained in the Apache Hadoop ecosystem.

Even though it was developed to work with the formats like Apache, ORC can also be used to store data from different sources like a data frame.

The Pandas library provides suitable methods for both reading and writing the ORC storage format into a data frame.

Read Introduction to Pandas Library.

The methods we are going to use are DataFrame.to_orc and pd.read_orc. We are going to revisit the basic concepts of data frames, and ORC and take a look at a few examples of the conversion.

What Is a Data Frame?

A data frame is the most fundamental and popular storage structure of the Pandas library. The data frame stores data in a way similar to a table- in the form of rows and columns.

A data frame can store homogeneous items inside it. For example, the header columns can be of string data type and the row elements can be of numeric data types.

The Pandas library provides a method pd.DataFrame to convert any other data structure to a data frame. Using this method, we can render a data frame from a list, a dictionary, a list of dictionaries, and even a CSV file or an Excel file.

We can also export a data frame into the data structures supported by other programming languages and vice versa.

Read this article to know more about how to write a data frame to parquet.

Let us see an example of writing a data frame from a CSV file.

import pandas as pd
data=pd.read_csv('IRIS.csv')
df=pd.DataFrame(data)
df

In this example firstly, we are importing the Pandas library as pd which is the standard alias name for the library.

Next, we are creating a variable called data that stores the CSV data set we download. The data set we are using is the most popular data set for machine learning- the IRIS data set. This data set contains details of the different species of flowers like petal width, sepal width, petal length, and sepal length and the species it belongs to.

Another variable called df is used to store the data frame created by the method- pd.DataFrame. In the last line, we are printing this newly created data frame.

There is a special property of the data frame method which only prints the selected values. The df.tail() prints the last five rows of the data frame but is customizable. Likewise, the head method prints the first five rows of the data frame.

Let us see how to print the last 10 rows of the data frame.

df.tail(10)

When we are analyzing the data frame, there is one function that helps us get the details of the data frame like the data types of the objects, the number of non-null elements, and so on.

df.info()

What Is ORC Format?

As discussed above, the ORC stands for Optimized Row Columnar format. By row columnar we mean that the collection of rows of a data set or a file is stored in the form of columns in the file. It is the successor of the Record Columnar File (RCFile) format. It is mainly designed to efficiently store the Apache Hive data.

ORC is mainly used to store big data that is big (pretty big) and used in big data analytics. Just like Apache Feather and Parquet formats, ORC also allows compression of the data.

When we are talking about the ORC format, we also need to talk about storage footprint. Storage footprint is a term used to determine the amount of storage occupied by data or files in a system. ORC provides a less storage footprint for big data compared to a data frame.

Also, when we convert a data frame to ORC, the data types of the elements present in the data frame are preserved in the ORC format which is not possible with other formats like CSV.

Here is a flow chart that helps you understand how the ORC format stores data.

Data frame to ORC Method Explained

The method has the following syntax:

DataFrame.to_orc(path=None, *, engine='pyarrow', index=None, engine_kwargs=None)

The parameters of the method follow the description given below.

Argument	Description	Necessity
path	This argument takes a string or a file-like object or a None When the path is a string, the string will be used to store the partitioned dataset If the path is a file-like object, it should have a write() method to write into it If the path is None, a bytes object is returned	Required Default= None
engine	This parameter decides the type of library to use The default is pyarrow When using this parameter, the pyarrow library’s version must be greater than or equal to 7.0.0	Required
index	This parameter decides if the index of the data frame must be included in the output file If set to True, it includes the index of the data frame If set to False, the index will not be included	Optional
engine_kwargs	This argument passes the additional keyword arguments to the hood library pyarrow Takes a dictionary as input	Required Default = None

Arguments of DataFrame.to_ORC

Returns: If the path is set to None, return bytes.

This method might raise some errors.

NotImplementedError: This error is raised if the data types of the columns of the data frame are a category or an unsigned integer or an interval or sparse.

ValueError: This error is raised if the engine is something other than pyarrow.

Before we move on to the examples, there are some prerequisites to follow.

Prerequisites

As the ORC format uses the pyarrow library under the hood, we need to make sure it is installed in our system or the environment we are working in.

PyArrow is also a Python library that works with larger and more complex datasets.

PyArrow provides fast, memory-efficient data structures and algorithms that can be used for various data processing tasks, such as reading and writing data to and from disk and performing data transformations.

! pip install pyarrow

Although this command works most of the time, it is recommended to install the pyarrow library through Conda.

conda install -c conda-forge pyarrow

How to Write a Data Frame to ORC?

We are going to see a few examples of writing a data frame to an ORC and checking if the data types are preserved.

Writing a Simple Data Frame to ORC

Let us take the IRIS data set and render a data frame. Then write this data frame in ORC format.

import pandas as pd
import pyarrow as pa
data=pd.read_csv('IRIS.csv')
df=pd.DataFrame(data)
df

In this example firstly, we are importing the Pandas library as pd which is the standard alias name for the library, and also the pyarrow library as pa.

Another variable called df is used to store the data frame created by the method- pd.DataFrame. In the last line, we are printing this newly created data frame.

df.to_orc('df.orc')  
%timeit df.to_orc('df.orc')
pd.read_orc('df.orc')

In the first line, we are using the df.to_orc method to create a file with the name df.orc to store the ORC file.

The timeit magic function is used to check the time taken by a one-line code to complete the task. Here, we are checking the time taken to convert the data frame to ORC format.

Next, we are using the pd.read_orc to read the ORC file.

As you can see, the conversion just took 172 microseconds.

Writing a Data Frame to Orc With Index

We are going to use the index property of the method to assign the index level to the ORC format.

import pandas as pd
import pyarrow as pa
groc = {'Food': ['Tacos', 'Mac and Cheese', 'Carbonara', 'Lasagna', 'Croissant'],
        'Calories': [226, 164, 574, 135, 406],
        'Quantity':[3,1,2,3,1]}
df = pd.DataFrame(groc)
df

In this example, we are importing the pandas and pyarrow libraries in the first two lines.

Next, a dictionary of different food items, their calories, and the quantity purchased is stored in a variable called groc.

Next, a variable called df is created to store the data frame. This data frame is printed in the next line.

df.to_orc(path='groc.orc',engine='pyarrow',index=True)
pd.read_orc('groc.orc')

We are using the df.to_orc with a path to store the orc format file and the engine is set to pyarrow which is the default. We are also specifying the index to be included in the output.

The read method is used to display the output.

As you can see on the left, there is a file created with the name groc.orc, and in the output, we can see the index level included in the output.

Checking if the Data Types Are Preserved in the ORC Format

In this example, we are going to check if the data types of the elements in the data frame are preserved in the ORC file.

import pandas as pd
x = [2, 4, 6, 8, 10]
y = [1, 3, 5, 7, 11]
z = [2, 3, 5, 7, 11]
data = {'x': x, 'y': y, 'z': z}
df = pd.DataFrame(data)
print(df)
print("The data types are:")
df.dtypes

In the first line, we are importing the pandas library.

Next, we are creating three lists named x,y, and z with random numbers.

Then dictionary called data is created to store the three lists in the form of a dictionary. Now this dictionary is used to create a data frame. The data frame is named df.

Next, we are printing the data frame. We are checking the data types of the columns in the data frame using the dtypes property.

The next step is to convert this data frame into an ORC format.

df.to_orc('num.orc',index=True)
pd.read_orc('num.orc')

The data frame is converted to orc with the help of the method and this is stored in a file called num.orc. Also, we are even including the index.

Next, the read method is used to display the orc file.

On the left sidebar, we can see the file created for the ORC file.

Now let us check if the data types of the elements in the ORC file are the same as the data frame.

import pyarrow.orc as orc
with open('num.orc', 'rb') as file:
    r = orc.ORCFile(file)
    sc = r.schema
    typ = sc.names, sc.types
    for field, typ in zip(typ[0], typ[1]):
        print(f'Column: {field}, Data Type: {typ}')

In the first line, we are importing the orc format from the pyarrow library.

Next, we are opening the orc file created earlier in the reading binary format to check the data types. Next, we are initializing a reader to go through every column in the file.

Next, we are creating a variable called data_types to check if the data types are the same.

We are initializing a for loop to check the field and data type in the file.

The print is used to print the column name and the corresponding data type.

Data Types Are Preserved In The ORC File

Conclusion

To conclude we have learned about the ORC format and how it is used to store the data efficiently and helps in parallel processing of the data.
ORC stands for Optimized Row Columnar storage was initially introduced to store the Hive data efficiently.
It is used in big data analytics to store the data in a better format. It can also be used to store other data formats like a Pandas data frame.

The Pandas library has a method called DataFrame.to_orc to write a data frame in ORC format.
We first started off with the concepts of data frame like writing a data frame from a CSV file, printing the last ten rows of the data frame, and printing the information about the data frame.
Next, we learned about the ORC format and how the ORC stores data with the help of a flow chart.
In the next session, we explored the syntax of the method and understood the arguments of the method.
We have seen a few cases of how this method raises a few errors.

There are a few prerequisites before working with the ORC formats. We have seen how to install the pyarrow library.
Next, we have seen how to write a data frame to an ORC file.
In the first example, we have taken the IRIS data set and rendered a data frame from it. This data frame is written to an ORC file using the method and we have also checked the time taken to convert the data frame to ORC.

In the next example, we followed the same process but also included the index in the ORC file.
Lastly, we took another example of a data frame and checked the data types of the data frame. This data frame is converted to an ORC file and then we followed a code to check if the data types of the columns in the ORC file are the same.
From this example, we can say that the ORC file preserves the data types of the data frame after conversion.

References

The IRIS data set can be downloaded from here.

You can learn more about the data frame to orc method from the official documentation.

Find the official pyarrow documentation here.