Data Cleaning with Pyjanitor: A Beginner's Guide to Preprocessing

Imagine you are working on your dream machine-learning project in Python and collecting huge data from various sources.

How do we integrate this huge data? Use data frames! A data frame is a table-like data storage format that stores data in the form of rows and columns.

One problem is solved. However, since the data is collected from various sources, the quality of the data cannot be guaranteed. It can have many missing values, empty fields, row names, and so on. Before we move on to building a model for the data, we first need to clean the data.

Pandas is a great library that supports data cleaning and pre-processing, but there are many advanced libraries that make data cleaning enjoyable. One such technique is the Pyjanitor.

Pyjanitor is a Python library built on top of Pandas that simplifies data cleaning and preprocessing tasks. It provides a clean API for removing noise from datasets, including functions for removing columns, selecting columns and rows based on conditions, renaming columns, cleaning column names, and removing empty rows and columns.

In this tutorial, we will review the Pyjanitor library and the important methods it offers.

Meanwhile, do read this article on data cleaning using Pandas and Numpy

What is Pyjanitor?

PyJanitor is a Pythonic version of the R language’s Janitor, built on the Pandas library to extend its functionality for cleaning and pre-processing the datasets. It provides a clean API for removing noise from the datasets. It provides several functions to make the data cleaning job easy. In the coming sections, we will discuss a few important functions from the API. To use this API, we need first to install the API in our environment.

# in a terminal
pip install pyjanitor
# in a interactive notebook(example-google colab)
!pip install pyjanitor

Getting Started with Pyjanitor: Installation and Version Check

First, we need to install the pyjanitor library.

!pip install pyjanitor

It can be imported using the line:

import janitor

The version of the library can be imported using the following snippet.

print(janitor.__version__)

Let us take a look at the functions supported by this API.

1. Removing Columns with Pyjanitor

This function is used to remove specific columns from the dataframe. It has the following syntax.

remove_columns(df, column_names)

A point to remember is that this function will be removed from the versions starting from 1x. It is suggested to use df.drop of the pandas library instead
Dropping multiple columns from the dataframe

Let us see an example.

import pandas as pd
import janitor
df = pd.DataFrame({"a": [2, 4, 6,8], "b": [1, 3, 5,7], "c": [7, 8, 9,11], "d":[13,15,17,19]})
df

In the first two lines, we import the pandas and janitor libraries. Next, we define the dataframe called df with four columns. Lastly, we print the data frame.

df1=df.remove_columns(column_names=['a', 'c'])
df1

Since the remove_columns does not modify the original data frame, we are saving the result of remove_columns in a new data frame called df1.

2. Selecting Columns Based on Conditions

The select_columns is a function used to select certain columns based on a condition or criteria. The condition may be a string, an array, a regular expression, or a glob object. But, this function does not alter the original data frame.

select_columns(df, *args, invert=False)

Here is an example of selecting columns based on the array-like object.

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4.5, 5.6, 6.7],
    'C': ['foo', 'bar', 'baz'],
    'D': [True, False, True]
})
selected_columns = df.select_columns(['A', 'C','D'])  
print("Selected columns by name:")
print(selected_columns)
print("*"*15)
print("Doesn't modify the data frame")
df

The data frame called df has the column names: A, B, C, D. We are only selecting the columns – A, C, and D. Then we are printing the selected columns as well as the dataframe to ensure that it isn’t mutated.

3. Renaming Columns

As the name suggests, this function renames the column names in the data frame. Like the other two functions discussed above, this function does not mutate the original data frame.

The syntax is given below.

rename_column(df, old_column_name, new_column_name)

df=pd.DataFrame({
            'AB':[1,2,3,4],
            'BC':[0.1,1.2,2.3,3.4],
            'CD':['Hi','Welcome','to','AskPython']})
df

We define a data frame df with AB, BC, and CD columns.

df2 = df.rename_column(old_column_name='AB', new_column_name='nums')
df2

Now, we are trying to rename the column AB with the new column name nums. Below is an image showing that the function does not modify the dataframe after application and a new dataframe with the renamed column.

Original data frame vs dataframe with the renamed column

4. Filtering Rows Using Criteria

The select rows function selects several rows given conditions or criteria. Like its counterpart, select columns, this function also doesn’t mutate the original dataframe when applying this function.

select_rows(df, *args, invert=False)

Let us take an example.

import pandas as pd
import janitor
data = {
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'gender': ['Female', 'Male', 'Female']
}
df = pd.DataFrame(data)
df

We are creating a dictionary and converting it into a dataframe called df. Now, we would have to apply the select_rows function to filter out all the people who are more than 25 years old.

fdf = df.select_rows(lambda x: x['age'] > 25)
print(fdf)

We are using the lambda keyword to filter out the age column as per the condition. The results are stored in a data frame called fdf.

5. Cleaning Messy Column Names

This function justifies its name. It cleans messy column names and converts them into lowercase while replacing the spaces between them with underscores. It also allows removing special characters(#,$, etc). However, the changes are not reflected in the original data frame.

clean_names(df, axis='columns', column_names=None, strip_underscores=None, case_type='lower', remove_special=False, strip_accents=True, preserve_original_labels=True, enforce_string=True, truncate_limit=None)

import pandas as pd
import janitor
data = {
    'First Name': ['Alice', 'Bob', 'Charlie'],
    'Age$$(years)': [25, 30, 35],
    'Gender@': ['Female', 'Male', 'Male']
}
df = pd.DataFrame(data)
df

The dictionary has three items with spaces between the names and a few special characters. This dictionary is converted into a data frame called df.

cleaned_df = df.clean_names(remove_special=True)
print(cleaned_df)

The clean_names function is used to clean the column names. The remove_special parameter is set to True so that the special characters will also be removed from the column names. This result is stored in a new data frame called cleaned_df.

6. Removing Null Rows and Columns

This function removes all the rows and columns that are completely null, without modifying the original data frame.

remove_empty(df, reset_index=True)

import numpy as np
import pandas as pd
import janitor
df = pd.DataFrame({
    "a": [1, np.nan, 2],
    "b": [3, np.nan, 4],
    "c": [np.nan, np.nan, np.nan],
})
df

Here, we are creating a data frame called df with rows and columns containing nan values with the help of the np.nan function.

df2 = df.remove_empty()
df2

The result of applying the remove_empty function is stored in a new data frame called df2.

Summary

Pyjanitor is a great Python library that makes data cleaning and preprocessing tasks much easier. It works well with Pandas and gives you lots of handy functions to remove columns, select columns and rows based on conditions, rename columns, clean up messy column names, and get rid of empty rows and columns. With Pyjanitor, you can clean up your data quickly and focus on building awesome machine learning models. So why not give Pyjanitor a try and see how it can help you work with messy data and find cool patterns and insights?

References

Pyjanitor Documentation

Pyjanitor functions