Generic selectors
Exact matches only
Search in title
Search in content
wb_sunny

How to clean data in Python?

Feature Img Data Cleaning

Hey fellow learner! Today in this tutorial, we will be learning how to clean data in Python and why is it important.

What is data cleaning?

Data cleaning is a process where all of the data that needs to be passed into a database is cleaned by either updating or removing missing, inaccurate, incorrectly formatted, duplicated or irrelevant information in the data. Regular data cleaning should be practiced regularly in order to avoid pilling up of uncleaned data over the years.

Why do we need to clean data in Python?

If data is not cleaned properly it can result in a great loss including reduce in marketing effectiveness. Hence cleaning of data becomes real important to avoid all the inaccuracy in major results.

Efficient data cleaning implies fewer errors which result in happier customers and fewer frustrated employees. It also lead to increase in productivity and better decisions.

Steps to clean data in a Python dataset

1. Data Loading

Now let’s perform data cleaning on a random csv file that I have downloaded from the internet. The name of the dataset is ‘San Francisco Building Permits’. Before any processing over the data, it is first loaded into the file. The code for data loading is shown below:

import numpy as np
import pandas as pd
data = pd.read_csv('Building_Permits.csv',low_memory=False)

First all the required modules are imported and then the csv file is loaded. I have added an additional parameter named low_memory whose purpose is to make sure the program doesn’t run into any memory errors due to the huge dataset.

The dataset contains 198900 permit details and 43 columns. The columns in the dataset are as follows:

  1. Permit Number
  2. Permit Type
  3. Permit Type Definition
  4. Permit Creation Date
  5. Block
  6. Lot
  7. Street Number
  8. Street Number Suffix
  9. Street Name
  10. Street Suffix
  11. Unit
  12. Unit Suffix
  13. Description
  14. Current Status
  15. Current Status Date
  16. Filed Date
  17. Issued Date
  18. Completed Date
  19. First Construction Document Date
  20. Structural Notification
  21. Number of Existing Stories
  22. Number of Proposed Stories
  23. Voluntary Soft-Story Retrofit
  24. Fire Only Permit
  25. Permit Expiration Date
  26. Estimated Cost
  27. Revised Cost
  28. Existing Use
  29. Existing Units
  30. Proposed Use
  31. Proposed Units
  32. Plansets
  33. TIDF Compliance
  34. Existing Construction Type
  35. Existing Construction Type Description
  36. Proposed Construction Type
  37. Proposed Construction Type Description
  38. Site Permit
  39. Supervisor District
  40. Neighborhoods – Analysis Boundaries
  41. Zipcode
  42. Location
  43. Record ID

2. Dropping Unnecessary columns

When we looked at the dataset, we saw that there are so many columns in the dataset. But for processing, we can skip some of the columns during processing.

For now let’s drop some random columns namely TIDF Compliance, Fire Only Permit, Unit Suffix, Block and Lot.

columns_to_drop=['TIDF Compliance', 'Fire Only Permit', 'Unit Suffix', 'Block','Lot']
data_dropcol=data.drop(columns_to_drop,axis=1)

We will first create a list storing all the column names to drop from the dataset.

In the next line we made use of the drop function and pass the list created into the function. We also pass the axis parameter whose value can be either 0 (row wise drop) or 1 (column wise drop)

After the execution of the code the new data contains only 38 columns not 43.

3. Remove Missing Value Rows

Before moving to directly removing the rows with missing values. Let us first analyze how many missing values are there in the dataset. For the same purpose, we use the code mentioned below.

no_missing = data_dropcol.isnull().sum()
total_missing=no_missing.sum()

On the codes execution, we found out that there are 1670031 missing values in the dataset. Since there are so many missing values so instead of dropping the rows with missing data, we drop the columns with maximum missing values instead. The code for the same is shown below.

drop_miss_value=data_dropcol.dropna(axis=1)

The code resulted in dropping of maximum columns and only 10 columns remained in the resulting dataset. Yes, most of the information id dropped from the dataset but atleast now the dataset is cleaned properly.

Conclusion

Awesome! Now we know about data cleaning and how to perform data cleaning on a live dataset.

Hope you learned something! Thank you for reading!