Hey fellow learner! Today in this tutorial, we will be learning how to clean data in Python and why is it important.
What is data cleaning?
Data cleaning is a process where all of the data that needs to be passed into a database is cleaned by either updating or removing missing, inaccurate, incorrectly formatted, duplicated or irrelevant information in the data. Regular data cleaning should be practiced regularly in order to avoid pilling up of uncleaned data over the years.
Why do we need to clean data in Python?
If data is not cleaned properly it can result in a great loss including reduce in marketing effectiveness. Hence cleaning of data becomes real important to avoid all the inaccuracy in major results.
Efficient data cleaning implies fewer errors which result in happier customers and fewer frustrated employees. It also lead to increase in productivity and better decisions.
Steps to clean data in a Python dataset
1. Data Loading
Now let’s perform data cleaning on a random csv
file that I have downloaded from the internet. The name of the dataset is ‘San Francisco Building Permits’. Before any processing over the data, it is first loaded into the file. The code for data loading is shown below:
import numpy as np
import pandas as pd
data = pd.read_csv('Building_Permits.csv',low_memory=False)
First all the required modules are imported and then the csv file is loaded. I have added an additional parameter named low_memory
whose purpose is to make sure the program doesn’t run into any memory errors due to the huge dataset.
The dataset contains 198900 permit details and 43 columns. The columns in the dataset are as follows:
- Permit Number
- Permit Type
- Permit Type Definition
- Permit Creation Date
- Block
- Lot
- Street Number
- Street Number Suffix
- Street Name
- Street Suffix
- Unit
- Unit Suffix
- Description
- Current Status
- Current Status Date
- Filed Date
- Issued Date
- Completed Date
- First Construction Document Date
- Structural Notification
- Number of Existing Stories
- Number of Proposed Stories
- Voluntary Soft-Story Retrofit
- Fire Only Permit
- Permit Expiration Date
- Estimated Cost
- Revised Cost
- Existing Use
- Existing Units
- Proposed Use
- Proposed Units
- Plansets
- TIDF Compliance
- Existing Construction Type
- Existing Construction Type Description
- Proposed Construction Type
- Proposed Construction Type Description
- Site Permit
- Supervisor District
- Neighborhoods – Analysis Boundaries
- Zipcode
- Location
- Record ID
2. Dropping Unnecessary columns
When we looked at the dataset, we saw that there are so many columns in the dataset. But for processing, we can skip some of the columns during processing.
For now let’s drop some random columns namely TIDF Compliance, Fire Only Permit, Unit Suffix, Block and Lot.
columns_to_drop=['TIDF Compliance', 'Fire Only Permit', 'Unit Suffix', 'Block','Lot']
data_dropcol=data.drop(columns_to_drop,axis=1)
We will first create a list storing all the column names to drop from the dataset.
In the next line we made use of the drop
function and pass the list created into the function. We also pass the axis
parameter whose value can be either 0 (row wise drop) or 1 (column wise drop)
After the execution of the code the new data contains only 38 columns not 43.
3. Remove Missing Value Rows
Before moving to directly removing the rows with missing values. Let us first analyze how many missing values are there in the dataset. For the same purpose, we use the code mentioned below.
no_missing = data_dropcol.isnull().sum()
total_missing=no_missing.sum()
On the codes execution, we found out that there are 1670031 missing values in the dataset. Since there are so many missing values so instead of dropping the rows with missing data, we drop the columns with maximum missing values instead. The code for the same is shown below.
drop_miss_value=data_dropcol.dropna(axis=1)
The code resulted in dropping of maximum columns and only 10 columns remained in the resulting dataset. Yes, most of the information id dropped from the dataset but atleast now the dataset is cleaned properly.
Conclusion
Awesome! Now we know about data cleaning and how to perform data cleaning on a live dataset.
Hope you learned something! Thank you for reading!