Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages
wb_sunny

How to clean data in Python?

Cleaning Dataset Using Python

Effective implementation of Machine learning algorithms or even when doing some data analysis on a dataset, we require clean data.

There’s a well-known saying about numerical modeling with data, “Trash in Trash out” we can’t expect decent results when our data isn’t clean.

in this article, we’ll explore common techniques we can use to clean data using the pandas library.

Common Data Cleaning Checks

We’ll clean data based on the following:

  • Missing Values
  • Outliers
  • Duplicate Values

1. Missing Values

In Pandas, a missing value is usually denoted by NaN , since it is based on the NumPy package it is the special floating-point NaN value particular to NumPy.

You can find the dataset used in this article here.

Finding Missing Values

Let’s first see how we can find if there’s a missing value in our data

#Approach 1: visually

import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt

#importing Dataset
df = pd.read_csv('IMDB-Movie-Data.csv')

#Creating the heatmap
plt.figure(figsize = (8,6))
sb.heatmap(df.isnull(), cbar=False , cmap = 'magma')
Missing Values Using Heatmap
Missing Values Using Heatmap

.isnull() method returns boolean values indicating if there’s a missing value in the data.

However this process could be limited to only medium to small datasets.

#Approach 2

We can use .sum() method after applying .isnull() this will return sum of missing values within each columns in the dataframe.

import pandas as pd

#importing dataset
df = pd.read_csv('IMDB-Movie-Data.csv')

df.isnull().sum()
Finding Sum Of Missing Values
Finding Sum Of Missing Values

Cleaning Missing Values from Data

We found that our dataset does have some missing values in it, what should we do next to get clean data?

We can either drop the rows and columns containing missing values in them or replace the missing values with appropriate quantity i.e. mean, median, or mode.

  1. Dropping Missing Values:
#imporing pandas
import pandas as pd

#Importing dataset
df = pd.read_csv('IMDB-Movie-Data.csv')

#Size of original dataset
print(df.shape)

#Dropping the missing rows.
df_dropped = df.dropna(how = 'any')

The above code will drop the rows from the dataframe having missing values.

Let’s look at .dropna() method in detail:

  • df.dropna() – Drop all rows that have any NaN values
  • df.dropna(how=’all’) – Drop only if ALL columns are NaN
  • df.dropna(thresh=2) – Drop row if it does not have at least two values that are not NaN
  • df.dropna(subset=[1]) – Drop only if NaN in specific column

One must be careful when considering dropping the missing values as it might affect the quality of the dataset.

2. Replacing Missing values

import pandas as pd

#importing the dataset
df = pd.read_csv('IMDB-Movie-Data.csv')

#Creating a copy of dataframe
df_new = df

df_new['Metascore'] = df_new['Metascore'].fillna((df_new['Metascore'].mean()))

#printing the dataframes after replacing null values
print(df_new.isna().sum())
print(df.isna().sum())
Before And After Filling Null Values
Before And After Filling Null Values

pandas has .fillna() method which accepts a value which we want to replace in place of NaN values. We just calculated mean of the column and passed it as an input argument to fillna() method.

2. Dealing with Outliers

Outliers can change the course of entire predictions therefore it is essential we detect and remove outliers.

Using Z-Score

Let’s detect outliers in the Votes column in our dataset and filter the outliers using a z-score.

The idea behind this method lies in the fact that values lying 3 standard deviations away from mean will be termed as an Outlier.

#importing required modules
import pandas as pd
import numpy as np
from scipy import stats

#importing dataset
df = pd.read_csv('IMDB-Movie-Data.csv')

#filtering outliers
df_new = df[(np.abs(stats.zscore(df.Votes)) < 3)]

The column on which this method is applied should be a numerical variable and not categorical.

Using Quantiles

By this method values falling below 0.01 quantile and above 0.99 quantiles in the series will be filtered out.

#importing required modules
import pandas as pd
import numpy as np
from scipy import stats

#importing dataset
df = pd.read_csv('IMDB-Movie-Data.csv')

#Selecting limits
q_low = df["Votes"].quantile(0.01)
q_hi  = df["Votes"].quantile(0.99)

#filtering outliers
df_filtered = df[(df["Votes"] < q_hi) & (df["Votes"] > q_low)]


3. Dealing with Duplicate entries

We can check for any duplicates in a DataFrame using .duplicated() method. This returns a Pandas Series and not a DataFrame.

To check duplicate values in a specific column we can provide the column name as an input argument into the .duplicated( ) method.

Let’s see this in action.

Luckily we have no duplicate values in our data frame, so we will append some values from the data frame itself to create duplicate values.

#Import the required modules
import pandas as pd
import numpy as np

#importing and appending similar rows to of the dataframe
df = pd.read_csv('IMDB-Movie-Data.csv')
df1 = df.append(df.iloc[20:30,:])

df1.duplicated().sum()
Output:
10

Now, .drop_duplicates() method is used to drop the duplicate values from the dataframe.

#Importing the required modules
#Import the required modules
import pandas as pd
import numpy as np

#importing and appending similar rows to of the dataframe
df = pd.read_csv('IMDB-Movie-Data.csv')
df1 = df.append(df.iloc[20:30,:])

#dropping the duplicates
df1 = df1.drop_duplicates()

#checking the duplicates 
df1.duplicated().sum()
output:
0

Conclusion

This article was all about performing and learning about the basic data cleaning operations we need to perform on a messy dataset using Pandas.

Happy Learning!