Working With Columns Using Pyspark In Python

In this article, we’ll learn more about PySpark. Data is one of the core sources that fuel every aspect of the Information Technology and Digital domains. So, to use it properly we need to know a few essential points. Data is available in large quantities nowadays. And there are software toolkits available for managing this Big Data. One of them is Pyspark.

PySpark is a Python library and extension from Apache Spark.

Working With Data Columns Using PySpark

This article is for the people who know something about Apache Spark and Python programming. Knowledge of Python and Data Analysis with Pyspark is a must for understanding this topic.

If you’re all set, let’s get started.

1. Installing Pyspark

This section discusses the installation of Pyspark. Just go to the command prompt and make sure you have added Python to the PATH in the Environment Variables. Next, type in the following pip command:

pip install pyspark

Installing Pyspark Through Command Prompt

Now as we have successfully installed the framework in our system let us make our way to the main topic.

2. Setting Up The Environment

There are some prerequisites to make sure we have a smooth workflow. Following are they:

Tools and resources used

Environment: Anaconda
Python version: 3.6 and above
IDE: Jupyter Notebooks
Dataset: salary.csv

Creating a session
Reading a dataset
Displaying the dataset

3. Creating a session in Pyspark

A session in Pyspark is one of the most important aspects when we perform a Big Data analysis. A session creates an application for us so that it holds every record of our activity and each checkpoint. To create a session using the following code:

Code:

import pyspark
import warnings
warnings.filterwarnings('ignore')

from pyspark.sql import SparkSession

The SQL module’s SparkSession class helps us to create a session. We create a session variable as an instance to the class. Then the builder method’s attribute appname() gives the name to the application. Then the getOrCreate() method creates an interactive app. Now that we have a strong base, let us make our way further to read a dataset.

4. Reading a dataset

When we read a dataset the machine reads it in the form of an SQL table. Every column and cell in this table is read as a string by default. We will read the salary.csv from the Datasets folder. This is the path where the dataset is located. If the file exists inside any folder then giving the folder path is the best option.

Following is the code for that:

data = session.read.csv('salary.csv')
data

First, we create a variable – ‘data’ that holds our dataset. The session’s read function is for reading the datasets. The function has sub-functions that read the files for various extensions. There are the following types of files that we can read through Pyspark:

csv
format
jdbc
json
orc
parquet
schema
table
text

5. Displaying the dataset

When we read the dataset it is only in the system For viewing it there is one method – show() that enables us to view it. If the dataset is too large then the method only displays the first twenty rows but, if it is small like ten or fifteen that will display the whole table.

data.show()

Column Transformations Using PySpark

In the above image, the table reads each element in the table in form of String. Then it also names the column according to their count. Thus, if we have four columns then it will display the column numbers from 0 to 3. We need to display the table with appropriate column titles. This will be our core topic of discussion in this article. So, let us get into pace with it.

For a basic operation we can perform the following transformations to a dataset:

Creating a new column
Selecting one specific column
Selecting multiple columns
Adding columns
Deleting columns
Renaming columns

We do not explicitly need to use an external library for doing this because Pyspark has features to do the same. To do this the read method’s option() attribute makes us view with the headers. Following is the code for that.

data = session.read.option('header', 'true').csv('Datasets/salary.csv', inferSchema = True)
data

The option() attribute makes us view the dataset in a proper format. The inferschema parameter is set to True to make the headings visible. Also, it reads the column with the respective data types.

Dataset Display With Appropriate Column Setup

Let us move our study towards the main techniques on the columns.

1. Selecting a column

Selecting a specific column in the dataset is quite easy in Pyspark. The select() function takes a parameter as a column. It returns the single column in the output.

Also, to record all the available columns we take the columns attribute. This returns them in the form of a list. In this example, we will select the ‘job’ column from the dataset.

Code:

data.columns
data.select('Name').show()

Output:

2. Selecting multiple columns

We use the same select() function for selecting multiple columns. This function can take multiple parameters in the form of columns. We are selecting the ‘company’ and ‘job’ columns from the dataset.

Code:

data.select('company', 'job').show()

Output:

Selecting Multiple Columns 1 — Selecting Multiple Columns

3. Adding columns

The addition of columns is just using a single line of code. Pyspark provides withColumn() and lit() function.

The withColumn() function: This function takes two parameters
1. Column name to be given.
2. Existing column from the data frame that needs to be taken for reference.
The lit() function integrates with the withColumn() function to add a new column. It takes two parameters.
1. Column name
2. A constant value to be given for each row.

We will add a new column ‘Tax cutting’ in our data frame using withColumn() function. Let us say, Tax cuttings are common to all the employees so it is a constant value.

Code:

from pyspark.sql.functions import lit
# adding columns in dataframe
data = data.withColumn('Tax Cutting', lit(0.1))

Output:

Adding A New Column In The Dataset 1 — Adding a new column In the dataset

4. Deleting columns

Deleting a column is removing permanently all the contents of that column. Pyspark provides flexible functionality for this. Like Pandas, we have the drop() function. It takes the column to be dropped inside it as a parameter. We will try to drop the degree column from the dataset. Make sure you mention the name appropriately otherwise it will give an error.

Code:

data = data.drop("degree")
data.show()

Output:

Dropping The Degree Column — Dropping the degree column

5. Renaming a column

Renaming a column is changing the main heading or title of the column. For this we use the withColumnRenamed() function. This function takes two parameters.

Existing column name
New name to be given to that column.

To understand it practically, we will rename the job column name to Designation.

data = data.withColumnRenamed('job', 'Designation')

In the above code, the job is the existing column name in the data frame and Designation is the new name that we will be giving to that particular column.

Renaming The Column — Renaming the column

Conclusion

Here the article ends. We saw all about the basics of Pyspark’s column transformations. The various modifications like creating a new column, deleting it, renaming it, and making some changes to it. This is the basic journey to getting started with this library. All the best for future studies.