Python Data Cleaning using NumPy and Pandas

Feature Img Data Cleaning

Python Data cleaning is the process of replacing empty values with some default values, dropping unwanted columns, removing missing rows, etc. When working with a huge amount of raw data, it makes sense to clean the data before analysis so that we work on a complete dataset.

Data cleaning in Python

Python NumPy and Pandas modules provide some methods for data cleaning in Python. Data cleaning is a process where all of the data that needs to be passed into a database or used for data analysis is cleaned by either updating or removing missing, inaccurate, incorrectly formatted, duplicated, or irrelevant information. Regular data cleansing should be practiced regularly in order to avoid pilling up uncleaned data over the years.

Why do we need to clean data in Python?

If data is not cleaned properly it can result in a great loss including a reduction in marketing effectiveness. Hence cleaning of data becomes really important to avoid all the inaccuracy in major results.

Efficient data cleaning implies fewer errors which results in happier customers and fewer frustrated employees. It also leads to an increase in productivity and better decisions.

Steps to clean data in a Python dataset

1. Data Loading

Now let’s perform data cleaning on a random csv file that I have downloaded from the internet. The name of the dataset is ‘San Francisco Building Permits’. Before any processing of the data, it is first loaded from the file. The code for data loading is shown below:

import numpy as np
import pandas as pd
data = pd.read_csv('Building_Permits.csv',low_memory=False)

First, all the required modules are imported and then the CSV file is loaded. I have added an additional parameter named low_memory whose purpose is to make sure the program doesn’t run into any memory errors due to the huge dataset.

The dataset contains 198900 permit details and 43 columns. The columns in the dataset are as follows:

  1. Permit Number
  2. Permit Type
  3. Permit Type Definition
  4. Permit Creation Date
  5. Block
  6. Lot
  7. Street Number
  8. Street Number Suffix
  9. Street Name
  10. Street Suffix
  11. Unit
  12. Unit Suffix
  13. Description
  14. Current Status
  15. Current Status Date
  16. Filed Date
  17. Issued Date
  18. Completed Date
  19. First Construction Document Date
  20. Structural Notification
  21. Number of Existing Stories
  22. Number of Proposed Stories
  23. Voluntary Soft-Story Retrofit
  24. Fire Only Permit
  25. Permit Expiration Date
  26. Estimated Cost
  27. Revised Cost
  28. Existing Use
  29. Existing Units
  30. Proposed Use
  31. Proposed Units
  32. Plansets
  33. TIDF Compliance
  34. Existing Construction Type
  35. Existing Construction Type Description
  36. Proposed Construction Type
  37. Proposed Construction Type Description
  38. Site Permit
  39. Supervisor District
  40. Neighborhoods – Analysis Boundaries
  41. Zipcode
  42. Location
  43. Record ID

2. Dropping Unnecessary columns

When we looked at the dataset, we saw that there were so many columns in the dataset. But for processing, we can skip some of the columns during processing.

For now let’s drop some random columns namely TIDF Compliance, Fire Only Permit, Unit Suffix, Block, and Lot.

columns_to_drop=['TIDF Compliance', 'Fire Only Permit', 'Unit Suffix', 'Block','Lot']
data_dropcol=data.drop(columns_to_drop,axis=1)

We will first create a list storing all the column names to drop from the dataset.

In the next line, we made use of the drop function and pass the list created into the function. We also pass the axis parameter whose value can be either 0 (row-wise drop) or 1 (column-wise drop).

After the execution of the code, the new data contains only 38 columns, not 43.

3. Remove Missing Value Rows

Before moving to directly removing the rows with missing values, let’s first analyze how many missing values are there in the dataset. For the same purpose, we use the code mentioned below.

no_missing = data_dropcol.isnull().sum()
total_missing=no_missing.sum()

On the code execution, we found out that there are 1670031 missing values in the dataset. Since there are so many missing values so instead of dropping the rows with missing data, we drop the columns with maximum missing values instead. The code for the same is shown below.

drop_miss_value=data_dropcol.dropna(axis=1)

The code resulted in the dropping of maximum columns and only 10 columns remained in the resulting dataset. Yes, most of the information is dropped from the dataset but at least now the dataset is adequately cleaned.

Summary

Data analysis is a resource-intensive operation. So it makes sense to clean the raw data before the analysis to save time and effort. Data cleaning also makes sure that our analysis is more accurate. Python pandas and NumPy modules are best suited for CSV data cleaning.

What’s Next?