In this article, we’ll learn more about PySpark. Data is one of the core sources that fuel every aspect of the Information Technology and Digital domains. So, to use it properly we need to know a few essential points. Data is available in large quantities nowadays. And there are software toolkits available for managing this Big Data. One of them is Pyspark.
PySpark is a Python library and extension from Apache Spark.
Working With Data Columns Using PySpark
This article is for the people who know something about Apache Spark and Python programming. Knowledge of Python and Data Analysis with Pyspark is a must for understanding this topic.
If you’re all set, let’s get started.
1. Installing Pyspark
This section discusses the installation of Pyspark. Just go to the command prompt and make sure you have added Python to the PATH in the Environment Variables. Next, type in the following pip command:
pip install pyspark
Now as we have successfully installed the framework in our system let us make our way to the main topic.
2. Setting Up The Environment
There are some prerequisites to make sure we have a smooth workflow. Following are they:
Tools and resources used
- Environment: Anaconda
- Python version: 3.6 and above
- IDE: Jupyter Notebooks
- Dataset: salary.csv
- Creating a session
- Reading a dataset
- Displaying the dataset
3. Creating a session in Pyspark
A session in Pyspark is one of the most important aspects when we perform a Big Data analysis. A session creates an application for us so that it holds every record of our activity and each checkpoint. To create a session using the following code:
import pyspark import warnings warnings.filterwarnings('ignore') from pyspark.sql import SparkSession
The SQL module’s SparkSession class helps us to create a session. We create a session variable as an instance to the class. Then the builder method’s attribute appname() gives the name to the application. Then the getOrCreate() method creates an interactive app. Now that we have a strong base, let us make our way further to read a dataset.
4. Reading a dataset
When we read a dataset the machine reads it in the form of an SQL table. Every column and cell in this table is read as a string by default. We will read the salary.csv from the Datasets folder. This is the path where the dataset is located. If the file exists inside any folder then giving the folder path is the best option.
Following is the code for that:
data = session.read.csv('salary.csv') data
First, we create a variable – ‘data’ that holds our dataset. The session’s read function is for reading the datasets. The function has sub-functions that read the files for various extensions. There are the following types of files that we can read through Pyspark:
5. Displaying the dataset
When we read the dataset it is only in the system For viewing it there is one method – show() that enables us to view it. If the dataset is too large then the method only displays the first twenty rows but, if it is small like ten or fifteen that will display the whole table.
Column Transformations Using PySpark
In the above image, the table reads each element in the table in form of String. Then it also names the column according to their count. Thus, if we have four columns then it will display the column numbers from 0 to 3. We need to display the table with appropriate column titles. This will be our core topic of discussion in this article. So, let us get into pace with it.
For a basic operation we can perform the following transformations to a dataset:
- Creating a new column
- Selecting one specific column
- Selecting multiple columns
- Adding columns
- Deleting columns
- Renaming columns
We do not explicitly need to use an external library for doing this because Pyspark has features to do the same. To do this the read method’s option() attribute makes us view with the headers. Following is the code for that.
data = session.read.option('header', 'true').csv('Datasets/salary.csv', inferSchema = True) data
The option() attribute makes us view the dataset in a proper format. The inferschema parameter is set to True to make the headings visible. Also, it reads the column with the respective data types.
Let us move our study towards the main techniques on the columns.
1. Selecting a column
Selecting a specific column in the dataset is quite easy in Pyspark. The select() function takes a parameter as a column. It returns the single column in the output.
Also, to record all the available columns we take the columns attribute. This returns them in the form of a list. In this example, we will select the ‘job’ column from the dataset.
2. Selecting multiple columns
We use the same select() function for selecting multiple columns. This function can take multiple parameters in the form of columns. We are selecting the ‘company’ and ‘job’ columns from the dataset.
3. Adding columns
The addition of columns is just using a single line of code. Pyspark provides withColumn() and lit() function.
- The withColumn() function: This function takes two parameters
- Column name to be given.
- Existing column from the data frame that needs to be taken for reference.
- The lit() function integrates with the withColumn() function to add a new column. It takes two parameters.
- Column name
- A constant value to be given for each row.
We will add a new column ‘Tax cutting’ in our data frame using withColumn() function. Let us say, Tax cuttings are common to all the employees so it is a constant value.
from pyspark.sql.functions import lit # adding columns in dataframe data = data.withColumn('Tax Cutting', lit(0.1))
4. Deleting columns
Deleting a column is removing permanently all the contents of that column. Pyspark provides flexible functionality for this. Like Pandas, we have the drop() function. It takes the column to be dropped inside it as a parameter. We will try to drop the degree column from the dataset. Make sure you mention the name appropriately otherwise it will give an error.
data = data.drop("degree") data.show()
5. Renaming a column
Renaming a column is changing the main heading or title of the column. For this we use the withColumnRenamed() function. This function takes two parameters.
- Existing column name
- New name to be given to that column.
To understand it practically, we will rename the job column name to Designation.
data = data.withColumnRenamed('job', 'Designation')
In the above code, the job is the existing column name in the data frame and Designation is the new name that we will be giving to that particular column.
Here the article ends. We saw all about the basics of Pyspark’s column transformations. The various modifications like creating a new column, deleting it, renaming it, and making some changes to it. This is the basic journey to getting started with this library. All the best for future studies.