Python Data cleaning is the process of replacing empty values with some default values, dropping unwanted columns, removing missing rows, etc. When working with a huge amount of raw data, it makes sense to clean the data before analysis so that we work on a complete dataset.
Data cleaning in Python
Python NumPy and Pandas modules provide some methods for data cleaning in Python. Data cleaning is a process where all of the data that needs to be passed into a database or used for data analysis is cleaned by either updating or removing missing, inaccurate, incorrectly formatted, duplicated, or irrelevant information. Regular data cleansing should be practiced regularly in order to avoid pilling up uncleaned data over the years.
Why do we need to clean data in Python?
If data is not cleaned properly it can result in a great loss including a reduction in marketing effectiveness. Hence cleaning of data becomes really important to avoid all the inaccuracy in major results.
Efficient data cleaning implies fewer errors which results in happier customers and fewer frustrated employees. It also leads to an increase in productivity and better decisions.
Steps to clean data in a Python dataset
1. Data Loading
Now let’s perform data cleaning on a random csv
file that I have downloaded from the internet. The name of the dataset is ‘San Francisco Building Permits’. Before any processing of the data, it is first loaded from the file. The code for data loading is shown below:
import numpy as np
import pandas as pd
data = pd.read_csv('Building_Permits.csv',low_memory=False)
First, all the required modules are imported and then the CSV file is loaded. I have added an additional parameter named low_memory
whose purpose is to make sure the program doesn’t run into any memory errors due to the huge dataset.
The dataset contains 198900 permit details and 43 columns. The columns in the dataset are as follows:
- Permit Number
- Permit Type
- Permit Type Definition
- Permit Creation Date
- Block
- Lot
- Street Number
- Street Number Suffix
- Street Name
- Street Suffix
- Unit
- Unit Suffix
- Description
- Current Status
- Current Status Date
- Filed Date
- Issued Date
- Completed Date
- First Construction Document Date
- Structural Notification
- Number of Existing Stories
- Number of Proposed Stories
- Voluntary Soft-Story Retrofit
- Fire Only Permit
- Permit Expiration Date
- Estimated Cost
- Revised Cost
- Existing Use
- Existing Units
- Proposed Use
- Proposed Units
- Plansets
- TIDF Compliance
- Existing Construction Type
- Existing Construction Type Description
- Proposed Construction Type
- Proposed Construction Type Description
- Site Permit
- Supervisor District
- Neighborhoods – Analysis Boundaries
- Zipcode
- Location
- Record ID
2. Dropping Unnecessary columns
When we looked at the dataset, we saw that there were so many columns in the dataset. But for processing, we can skip some of the columns during processing.
For now let’s drop some random columns namely TIDF Compliance, Fire Only Permit, Unit Suffix, Block, and Lot.
columns_to_drop=['TIDF Compliance', 'Fire Only Permit', 'Unit Suffix', 'Block','Lot']
data_dropcol=data.drop(columns_to_drop,axis=1)
We will first create a list storing all the column names to drop from the dataset.
In the next line, we made use of the drop function and pass the list created into the function. We also pass the axis parameter whose value can be either 0 (row-wise drop) or 1 (column-wise drop).
After the execution of the code, the new data contains only 38 columns, not 43.
3. Remove Missing Value Rows
Before moving to directly removing the rows with missing values, let’s first analyze how many missing values are there in the dataset. For the same purpose, we use the code mentioned below.
no_missing = data_dropcol.isnull().sum()
total_missing=no_missing.sum()
On the code execution, we found out that there are 1670031 missing values in the dataset. Since there are so many missing values so instead of dropping the rows with missing data, we drop the columns with maximum missing values instead. The code for the same is shown below.
drop_miss_value=data_dropcol.dropna(axis=1)
The code resulted in the dropping of maximum columns and only 10 columns remained in the resulting dataset. Yes, most of the information is dropped from the dataset but at least now the dataset is adequately cleaned.
Summary
Data analysis is a resource-intensive operation. So it makes sense to clean the raw data before the analysis to save time and effort. Data cleaning also makes sure that our analysis is more accurate. Python pandas and NumPy modules are best suited for CSV data cleaning.