Effective implementation of Machine learning algorithms or even when doing some data analysis on a dataset, we require clean data.
There’s a well-known saying about numerical modeling with data, “Trash in Trash out” we can’t expect decent results when our data isn’t clean.
in this article, we’ll explore common techniques we can use to clean data using the pandas library.
Common Data Cleaning Checks
We’ll clean data based on the following:
- Missing Values
- Duplicate Values
1. Missing Values
In Pandas, a missing value is usually denoted by
NaN , since it is based on the NumPy package it is the special floating-point NaN value particular to NumPy.
You can find the dataset used in this article here.
Finding Missing Values
Let’s first see how we can find if there’s a missing value in our data
#Approach 1: visually
import pandas as pd import seaborn as sb import matplotlib.pyplot as plt #importing Dataset df = pd.read_csv('IMDB-Movie-Data.csv') #Creating the heatmap plt.figure(figsize = (8,6)) sb.heatmap(df.isnull(), cbar=False , cmap = 'magma')
.isnull() method returns boolean values indicating if there’s a missing value in the data.
However this process could be limited to only medium to small datasets.
We can use
.sum() method after applying
.isnull() this will return sum of missing values within each columns in the dataframe.
import pandas as pd #importing dataset df = pd.read_csv('IMDB-Movie-Data.csv') df.isnull().sum()
Cleaning Missing Values from Data
We found that our dataset does have some missing values in it, what should we do next to get clean data?
We can either drop the rows and columns containing missing values in them or replace the missing values with appropriate quantity i.e. mean, median, or mode.
- Dropping Missing Values:
#imporing pandas import pandas as pd #Importing dataset df = pd.read_csv('IMDB-Movie-Data.csv') #Size of original dataset print(df.shape) #Dropping the missing rows. df_dropped = df.dropna(how = 'any')
The above code will drop the rows from the dataframe having missing values.
Let’s look at
.dropna() method in detail:
- df.dropna() – Drop all rows that have any NaN values
- df.dropna(how=’all’) – Drop only if ALL columns are NaN
- df.dropna(thresh=2) – Drop row if it does not have at least two values that are not NaN
- df.dropna(subset=) – Drop only if NaN in specific column
One must be careful when considering dropping the missing values as it might affect the quality of the dataset.
2. Replacing Missing values
import pandas as pd #importing the dataset df = pd.read_csv('IMDB-Movie-Data.csv') #Creating a copy of dataframe df_new = df df_new['Metascore'] = df_new['Metascore'].fillna((df_new['Metascore'].mean())) #printing the dataframes after replacing null values print(df_new.isna().sum()) print(df.isna().sum())
.fillna() method which accepts a value which we want to replace in place of NaN values. We just calculated mean of the column and passed it as an input argument to
2. Dealing with Outliers
Outliers can change the course of entire predictions therefore it is essential we detect and remove outliers.
Let’s detect outliers in the Votes column in our dataset and filter the outliers using a z-score.
The idea behind this method lies in the fact that values lying 3 standard deviations away from mean will be termed as an Outlier.
#importing required modules import pandas as pd import numpy as np from scipy import stats #importing dataset df = pd.read_csv('IMDB-Movie-Data.csv') #filtering outliers df_new = df[(np.abs(stats.zscore(df.Votes)) < 3)]
The column on which this method is applied should be a numerical variable and not categorical.
By this method values falling below 0.01 quantile and above 0.99 quantiles in the series will be filtered out.
#importing required modules import pandas as pd import numpy as np from scipy import stats #importing dataset df = pd.read_csv('IMDB-Movie-Data.csv') #Selecting limits q_low = df["Votes"].quantile(0.01) q_hi = df["Votes"].quantile(0.99) #filtering outliers df_filtered = df[(df["Votes"] < q_hi) & (df["Votes"] > q_low)]
3. Dealing with Duplicate entries
We can check for any duplicates in a DataFrame using
.duplicated() method. This returns a Pandas Series and not a DataFrame.
To check duplicate values in a specific column we can provide the column name as an input argument into the
.duplicated( ) method.
Let’s see this in action.
Luckily we have no duplicate values in our data frame, so we will append some values from the data frame itself to create duplicate values.
#Import the required modules import pandas as pd import numpy as np #importing and appending similar rows to of the dataframe df = pd.read_csv('IMDB-Movie-Data.csv') df1 = df.append(df.iloc[20:30,:]) df1.duplicated().sum()
.drop_duplicates() method is used to drop the duplicate values from the dataframe.
#Importing the required modules #Import the required modules import pandas as pd import numpy as np #importing and appending similar rows to of the dataframe df = pd.read_csv('IMDB-Movie-Data.csv') df1 = df.append(df.iloc[20:30,:]) #dropping the duplicates df1 = df1.drop_duplicates() #checking the duplicates df1.duplicated().sum()
This article was all about performing and learning about the basic data cleaning operations we need to perform on a messy dataset using Pandas.