Let’s now learn how to print data using PySpark. Data is one of the most essential things available today. It can be available in encrypted or decrypted formats. In fact, we also tend to create a huge amount of information on a daily basis. Whether it is clicking a button on our smartphone or surfing the web on our computer. But, why are we talking so much about this?
The main problem researchers came across in the previous years is how to manage such a large amount of information? Technology was the answer for this. Apache Spark came into existence and built out PySpark to solve the problem.
If you’re new to PySpark, here’s a PySpark tutorial to get you started.
Intro to spark using Pyspark
Apache Spark is a data management engine that helps us to invent solutions related to analysis for huge software development projects.
It is also a choice tool for Big Data Engineers and Data Scientists. Having knowledge of Spark is one of the in-demand skills for placements in various tech companies.
It comes with many of its extensions and management options. One of them is Pyspark from Python and for Python developers. This is one of the APIs that support libraries to explicitly install in each computer. So, this can manage the implementations easily. As we all know that installation of libraries is quite easy in python.
Before We Print Data Using PySpark
Before we get into learning the different ways you can print data using PySpark, there are some prerequisites that we need to consider:
- Core understanding of Python
- Core understanding of Pyspark and its supportive packages.
- Python 3.6 and above
- Java 1.8 and above (most compulsory)
- An IDE like Jupyter Notebook or VS Code.
To check the same, go to the command prompt and type the commands:
You can print data using PySpark in the follow ways:
- Print Raw data
- Format the printed data
- Show top 20-30 rows
- Show bottom 20 rows
- Sort data before display
Resources and tools used for the rest of the tutorial:
Creating a session
A session in spark environment is a record holder for all the instances of our activities. To create it we use the SQL module from the spark library.
There is a builder attribute of this SparkSession class that has an appname() function. This function takes the name of the application as a parameter in the form of a string.
Then we create the app using the getOrCreate() method that is called using the dot ‘.’ operator. Using these pieces of code we create our app as ‘App‘.
There is full freedom to give any name to the application we create. Never forget to create a session as we cannot proceed further.
from pyspark.sql import SparkSession
session = SparkSession.builder.appName('App').getOrCreate() # creating an app
Different Methods To Print Data Using PySpark
Now that you’re all set, let’s get into the real deal. Now we’ll learn the different ways to print data using PySpark here.
1. Print raw data
In this example, we’ll work with a raw dataset. In the AI (Artificial Intelligence) domain we call a collection of data a Dataset.
It comes in various forms like excel, comma-separated value file, text file, or a server document model. So, keep a track of what type of file formats we are using to print the raw data.
In this, we are using a dataset with a .csv extension. The session’s read attribute has various functions for reading the files.
These functions often have names according to the various file types. Thus, we are using the csv() function for our dataset. We store everything in the data variable.
data = session.read.csv('Datasets/titanic.csv')
data # calling the variable
By default, Pyspark reads all the data in the form of strings. So, we call our data variable then it returns every column with its number in the form of a string.
To print, the raw data call the show() function with the data variable using the dot operator – ‘.’
2. Format the data
Formatting the data in Pyspark means showing the appropriate data types of the columns present in the dataset. To display all the headers we use the option() function. This function takes two arguments in the form of strings.
For the key parameter, we give the value as header and for value true. What this does is, it will scan that the headers are needed to display rather than column numbers on the top.
Most important is to scan the data type of each column. For this, we need to activate the inferschema parameter in the csv() function that we earlier used to read the dataset. It is a parameter of boolean data type, which means, we need to set it to True to activate it. We connect each function with the dot operator.
data = session.read.option('header', 'true').csv('Datasets/titanic.csv', inferSchema = True)
As we can see that headers are visible with the appropriate data types.
3. Show top 20-30 rows
To display the top 20-30 rows is that we can make it with just one line of code. The show() function does this for us. If the dataset is too large it will show the top 20 rows by default. But, we can make it display as many rows as we can. Jut put that number as a parameter inside show() function.
data.show() # to display top 20 rows
data.show(30) # to display top 30 rows
We can implement the same using the head() function. This function specifically gives access to the rows on the topmost section of the dataset. IT takes the number of rows as a parameter as displays as per them. For example, to display the first 10 rows
But, the result is in the form of an array or list. And the most disappointing thing is we cannot use the head() function for larger datasets that have thousands of rows. Here is the proof for that.
4. Showing bottom 20-30 rows
This is also a bit easier task. The tail() function helps us with this. Call it with the data frame variable and then give the number of rows we want to display as a parameter. For example, to display the last 20 rows we write the code as:
In the same way we cannot make any proper view of this as our dataset is too large to show such rows.
5. Sorting the data before display
Sorting is a process where we place things in proper order. This can be in ascending – smaller to greater or descending – greater to smaller. This plays an important role in viewing the data points according to a sequence. Columns in the data frame can be of various types. But, the two main types are integer and string.
- For integers sorting is according to greater and smaller numbers.
- For strings sorting is according to alphabetical order.
The sort() function in Pyspark is for this purpose only. It can take either a single or multiple columns as a parameter inside it. Let us try it for our dataset. We will sort the PassengerID column from the dataset. For this, we have two functions.
Sorting in ascending order
data = data.sort('PassengerId')
The PassengerID column has been sorted. The code places all the elements in ascending order. Here we sort only a single column. To sort multiple columns we can pass them in the sort() functions one by one separating each using comma.
data = data.sort('Name', 'Fare')
Sorting in descending order
This is specifically for orderBy() function. This function provides a special option to sort our data in descending order.
All the code remains the same in this case just we call a desc() function inside orderBy() after inserting the columns and joining it using the dot operator with them.
The desc() aligns or sorts all the elements of those particular columns in descending order.
First, let us take a look at all the columns in the dataset.
In the below code we will sort the Name and Fare columns. The name is of a string data type so it will be sorted according to alphabetical order. While Fare is a number so it will be in a greater – smaller pattern.
data = data.orderBy(data.Name.desc(), data.Fare.desc())
So, this was all about how we can print data using Pyspark. Every code is very short and sweet to understand. This is enough for getting a code knowledge of spark functions. This environment is very powerful for big data and other industry and tech domains.