Print Data Using PySpark - A Complete Guide

Let’s now learn how to print data using PySpark. Data is one of the most essential things available today. It can be available in encrypted or decrypted formats. In fact, we also tend to create a huge amount of information on a daily basis. Whether it is clicking a button on our smartphone or surfing the web on our computer. But, why are we talking so much about this?

The main problem researchers came across in the previous years is how to manage such a large amount of information? Technology was the answer for this. Apache Spark came into existence and built out PySpark to solve the problem.

If you’re new to PySpark, here’s a PySpark tutorial to get you started.

Intro to spark using Pyspark

Apache Spark is a data management engine that helps us to invent solutions related to analysis for huge software development projects.

It is also a choice tool for Big Data Engineers and Data Scientists. Having knowledge of Spark is one of the in-demand skills for placements in various tech companies.

It comes with many of its extensions and management options. One of them is Pyspark from Python and for Python developers. This is one of the APIs that support libraries to explicitly install in each computer. So, this can manage the implementations easily. As we all know that installation of libraries is quite easy in python.

Before We Print Data Using PySpark

Before we get into learning the different ways you can print data using PySpark, there are some prerequisites that we need to consider:

Core understanding of Python
Core understanding of Pyspark and its supportive packages.
Python 3.6 and above
Java 1.8 and above (most compulsory)
An IDE like Jupyter Notebook or VS Code.

To check the same, go to the command prompt and type the commands:

python --version

java -version

You can print data using PySpark in the follow ways:

Print Raw data
Format the printed data
Show top 20-30 rows
Show bottom 20 rows
Sort data before display

Resources and tools used for the rest of the tutorial:

Dataset: titanic.csv
Environment: Anaconda
IDE: Jupyter Notebook

Creating a session

A session in spark environment is a record holder for all the instances of our activities. To create it we use the SQL module from the spark library.

There is a builder attribute of this SparkSession class that has an appname() function. This function takes the name of the application as a parameter in the form of a string.

Then we create the app using the getOrCreate() method that is called using the dot ‘.’ operator. Using these pieces of code we create our app as ‘App‘.

There is full freedom to give any name to the application we create. Never forget to create a session as we cannot proceed further.

Code:

import pyspark 
from pyspark.sql import SparkSession 

session = SparkSession.builder.appName('App').getOrCreate() # creating an app

Creating A Session 2 — Creating A Session

Different Methods To Print Data Using PySpark

Now that you’re all set, let’s get into the real deal. Now we’ll learn the different ways to print data using PySpark here.

1. Print raw data

In this example, we’ll work with a raw dataset. In the AI (Artificial Intelligence) domain we call a collection of data a Dataset.

It comes in various forms like excel, comma-separated value file, text file, or a server document model. So, keep a track of what type of file formats we are using to print the raw data.

In this, we are using a dataset with a .csv extension. The session’s read attribute has various functions for reading the files.

These functions often have names according to the various file types. Thus, we are using the csv() function for our dataset. We store everything in the data variable.

Code:

data = session.read.csv('Datasets/titanic.csv')
data # calling the variable

By default, Pyspark reads all the data in the form of strings. So, we call our data variable then it returns every column with its number in the form of a string.

To print, the raw data call the show() function with the data variable using the dot operator – ‘.’

data.show()

Reading The Dataset 2 print data using Pyspark. — Reading The Dataset

2. Format the data

Formatting the data in Pyspark means showing the appropriate data types of the columns present in the dataset. To display all the headers we use the option() function. This function takes two arguments in the form of strings.

key
value

For the key parameter, we give the value as header and for value true. What this does is, it will scan that the headers are needed to display rather than column numbers on the top.

Most important is to scan the data type of each column. For this, we need to activate the inferschema parameter in the csv() function that we earlier used to read the dataset. It is a parameter of boolean data type, which means, we need to set it to True to activate it. We connect each function with the dot operator.

Code:

data = session.read.option('header', 'true').csv('Datasets/titanic.csv', inferSchema = True)
data

data.show()

Showing The Data In Proper Format print data using Pyspark. — Showing The Data In Proper Format

Output:

As we can see that headers are visible with the appropriate data types.

3. Show top 20-30 rows

To display the top 20-30 rows is that we can make it with just one line of code. The show() function does this for us. If the dataset is too large it will show the top 20 rows by default. But, we can make it display as many rows as we can. Jut put that number as a parameter inside show() function.

data.show() # to display top 20 rows

Showing Top 20 Rows 2 print data using Pyspark. — Showing Top 20 Rows

data.show(30) # to display top 30 rows

Showing Top 30 Rows 2 — Showing Top 30 Rows

We can implement the same using the head() function. This function specifically gives access to the rows on the topmost section of the dataset. IT takes the number of rows as a parameter as displays as per them. For example, to display the first 10 rows

data.head(10)

But, the result is in the form of an array or list. And the most disappointing thing is we cannot use the head() function for larger datasets that have thousands of rows. Here is the proof for that.

Using The Head Method To Print First 10 Rows

4. Showing bottom 20-30 rows

This is also a bit easier task. The tail() function helps us with this. Call it with the data frame variable and then give the number of rows we want to display as a parameter. For example, to display the last 20 rows we write the code as:

data.tail(20)

In the same way we cannot make any proper view of this as our dataset is too large to show such rows.

5. Sorting the data before display

Sorting is a process where we place things in proper order. This can be in ascending – smaller to greater or descending – greater to smaller. This plays an important role in viewing the data points according to a sequence. Columns in the data frame can be of various types. But, the two main types are integer and string.

For integers sorting is according to greater and smaller numbers.
For strings sorting is according to alphabetical order.

The sort() function in Pyspark is for this purpose only. It can take either a single or multiple columns as a parameter inside it. Let us try it for our dataset. We will sort the PassengerID column from the dataset. For this, we have two functions.

sort()
orderBy()

Sorting in ascending order

data = data.sort('PassengerId')
data.show(5)

The PassengerID column has been sorted. The code places all the elements in ascending order. Here we sort only a single column. To sort multiple columns we can pass them in the sort() functions one by one separating each using comma.

data = data.sort('Name', 'Fare')
data.show(5)

Sorting in descending order

This is specifically for orderBy() function. This function provides a special option to sort our data in descending order.

All the code remains the same in this case just we call a desc() function inside orderBy() after inserting the columns and joining it using the dot operator with them.

The desc() aligns or sorts all the elements of those particular columns in descending order.

First, let us take a look at all the columns in the dataset.

Code:

data.columns

In the below code we will sort the Name and Fare columns. The name is of a string data type so it will be sorted according to alphabetical order. While Fare is a number so it will be in a greater – smaller pattern.

Code:

data = data.orderBy(data.Name.desc(), data.Fare.desc())
data.show(5)

Conclusion

So, this was all about how we can print data using Pyspark. Every code is very short and sweet to understand. This is enough for getting a code knowledge of spark functions. This environment is very powerful for big data and other industry and tech domains.

Print Data Using PySpark – A Complete Guide

Intro to spark using Pyspark

Before We Print Data Using PySpark

Creating a session

Different Methods To Print Data Using PySpark

1. Print raw data

2. Format the data

3. Show top 20-30 rows

4. Showing bottom 20-30 rows

5. Sorting the data before display

Sorting in ascending order

Sorting in descending order

Conclusion

Piyush Bhujbal

Intro to spark using Pyspark

Before We Print Data Using PySpark

Creating a session

Different Methods To Print Data Using PySpark

1. Print raw data

2. Format the data

3. Show top 20-30 rows

4. Showing bottom 20-30 rows

5. Sorting the data before display

Sorting in ascending order

Sorting in descending order

Conclusion

Piyush Bhujbal

Related Posts

scipy.fft: Fast Fourier Transform for Signal Analysis

Python datetime module guide

Pandas groupby: Split, aggregate, and transform data with Python