Python Data Cleaning using NumPy and Pandas

Python Data cleaning is the process of replacing empty values with some default values, dropping unwanted columns, removing missing rows, etc. When working with a huge amount of raw data, it makes sense to clean the data before analysis so that we work on a complete dataset.

Data cleaning in Python

Python NumPy and Pandas modules provide some methods for data cleaning in Python. Data cleaning is a process where all of the data that needs to be passed into a database or used for data analysis is cleaned by either updating or removing missing, inaccurate, incorrectly formatted, duplicated, or irrelevant information. Regular data cleansing should be practiced regularly in order to avoid pilling up uncleaned data over the years.

Why do we need to clean data in Python?

If data is not cleaned properly it can result in a great loss including a reduction in marketing effectiveness. Hence cleaning of data becomes really important to avoid all the inaccuracy in major results.

Efficient data cleaning implies fewer errors which results in happier customers and fewer frustrated employees. It also leads to an increase in productivity and better decisions.

Steps to clean data in a Python dataset

1. Data Loading

Now let’s perform data cleaning on a random csv file that I have downloaded from the internet. The name of the dataset is ‘San Francisco Building Permits’. Before any processing of the data, it is first loaded from the file. The code for data loading is shown below:

import numpy as np
import pandas as pd
data = pd.read_csv('Building_Permits.csv',low_memory=False)

First, all the required modules are imported and then the CSV file is loaded. I have added an additional parameter named low_memory whose purpose is to make sure the program doesn’t run into any memory errors due to the huge dataset.

The dataset contains 198900 permit details and 43 columns. The columns in the dataset are as follows:

Permit Number
Permit Type
Permit Type Definition
Permit Creation Date
Block
Lot
Street Number
Street Number Suffix
Street Name
Street Suffix
Unit
Unit Suffix
Description
Current Status
Current Status Date
Filed Date
Issued Date
Completed Date
First Construction Document Date
Structural Notification
Number of Existing Stories
Number of Proposed Stories
Voluntary Soft-Story Retrofit
Fire Only Permit
Permit Expiration Date
Estimated Cost
Revised Cost
Existing Use
Existing Units
Proposed Use
Proposed Units
Plansets
TIDF Compliance
Existing Construction Type
Existing Construction Type Description
Proposed Construction Type
Proposed Construction Type Description
Site Permit
Supervisor District
Neighborhoods – Analysis Boundaries
Zipcode
Location
Record ID

2. Dropping Unnecessary columns

When we looked at the dataset, we saw that there were so many columns in the dataset. But for processing, we can skip some of the columns during processing.

For now let’s drop some random columns namely TIDF Compliance, Fire Only Permit, Unit Suffix, Block, and Lot.

columns_to_drop=['TIDF Compliance', 'Fire Only Permit', 'Unit Suffix', 'Block','Lot']
data_dropcol=data.drop(columns_to_drop,axis=1)

We will first create a list storing all the column names to drop from the dataset.

In the next line, we made use of the drop function and pass the list created into the function. We also pass the axis parameter whose value can be either 0 (row-wise drop) or 1 (column-wise drop).

After the execution of the code, the new data contains only 38 columns, not 43.

3. Remove Missing Value Rows

Before moving to directly removing the rows with missing values, let’s first analyze how many missing values are there in the dataset. For the same purpose, we use the code mentioned below.

no_missing = data_dropcol.isnull().sum()
total_missing=no_missing.sum()

On the code execution, we found out that there are 1670031 missing values in the dataset. Since there are so many missing values so instead of dropping the rows with missing data, we drop the columns with maximum missing values instead. The code for the same is shown below.

drop_miss_value=data_dropcol.dropna(axis=1)

The code resulted in the dropping of maximum columns and only 10 columns remained in the resulting dataset. Yes, most of the information is dropped from the dataset but at least now the dataset is adequately cleaned.

Summary

Data analysis is a resource-intensive operation. So it makes sense to clean the raw data before the analysis to save time and effort. Data cleaning also makes sure that our analysis is more accurate. Python pandas and NumPy modules are best suited for CSV data cleaning.