How to clean CSV data in Python?

Cleaning Dataset Using Python

Effective implementation of Machine learning algorithms or even when doing some data analysis on a dataset, we require clean data.

There’s a well-known saying about numerical modeling with data, “Trash in Trash out” we can’t expect decent results when our data isn’t clean.

in this article, we’ll explore common techniques we can use to clean CSV data using the python pandas library.

CSV Data Cleaning Checks

We’ll clean data based on the following:

  • Missing Values
  • Outliers
  • Duplicate Values

1. Cleaning Missing Values in CSV File

In Pandas, a missing value is usually denoted by NaN , since it is based on the NumPy package it is the special floating-point NaN value particular to NumPy.

You can find the dataset used in this article here.

Finding Missing Values

Let’s first see how we can find if there’s a missing value in our data.

#Approach 1: visually

import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt

#importing Dataset
df = pd.read_csv('IMDB-Movie-Data.csv')

#Creating the heatmap
plt.figure(figsize = (8,6))
sb.heatmap(df.isnull(), cbar=False , cmap = 'magma')
Missing Values Using Heatmap
Missing Values Using the Heatmap

The isnull() method returns boolean values indicating if there’s a missing value in the data.

However, this process could be limited to only medium to small datasets.

#Approach 2

We can use .sum() method after applying .isnull(), this will return the sum of missing values within each column in the data frame.

import pandas as pd

#importing dataset
df = pd.read_csv('IMDB-Movie-Data.csv')

df.isnull().sum()
Finding Sum Of Missing Values
Finding Sum Of Missing Values

Cleaning Missing Values from Data

We found that our dataset does have some missing values in it, what should we do next to get clean data?

We can either drop the rows and columns containing missing values in them or replace the missing values with appropriate value i.e. mean, median, or mode.

  1. Dropping Missing Values:
#imporing pandas
import pandas as pd

#Importing dataset
df = pd.read_csv('IMDB-Movie-Data.csv')

#Size of original dataset
print(df.shape)

#Dropping the missing rows.
df_dropped = df.dropna(how = 'any')

The above code will drop the rows from the dataframe having missing values.

Let’s look at .dropna() method in detail:

  • df.dropna() – Drop all rows that have any NaN values
  • df.dropna(how=’all’) – Drop only if ALL columns are NaN
  • df.dropna(thresh=2) – Drop row if it does not have at least two values that are not NaN
  • df.dropna(subset=[1]) – Drop only if NaN in specific column

One must be careful when considering dropping the missing values as it might affect the quality of the dataset.

2. Replacing Missing values

import pandas as pd

#importing the dataset
df = pd.read_csv('IMDB-Movie-Data.csv')

#Creating a copy of dataframe
df_new = df

df_new['Metascore'] = df_new['Metascore'].fillna((df_new['Metascore'].mean()))

#printing the dataframes after replacing null values
print(df_new.isna().sum())
print(df.isna().sum())
Before And After Filling Null Values
Before And After Filling Null Values

Pandas module has the .fillna() method, which accepts a value that we want to replace in place of NaN values. We just calculated the mean of the column and passed it as an input argument to fillna() method.

2. Dealing with Outliers

Outliers can change the course of entire predictions therefore it is essential we detect and remove outliers.

Using Z-Score

Let’s detect outliers in the Votes column in our dataset and filter the outliers using a z-score.

The idea behind this method lies in the fact that values lying 3 standard deviations away from the mean will be termed an Outlier.

#importing required modules
import pandas as pd
import numpy as np
from scipy import stats

#importing dataset
df = pd.read_csv('IMDB-Movie-Data.csv')

#filtering outliers
df_new = df[(np.abs(stats.zscore(df.Votes)) < 3)]

The column on which this method is applied should be a numerical variable and not categorical.

Using Quantiles

By this method values falling below 0.01 quantile and above 0.99 quantiles in the series will be filtered out.

#importing required modules
import pandas as pd
import numpy as np
from scipy import stats

#importing dataset
df = pd.read_csv('IMDB-Movie-Data.csv')

#Selecting limits
q_low = df["Votes"].quantile(0.01)
q_hi  = df["Votes"].quantile(0.99)

#filtering outliers
df_filtered = df[(df["Votes"] < q_hi) & (df["Votes"] > q_low)]


3. Dealing with Duplicate entries

We can check for any duplicates in a DataFrame using .duplicated() method. This returns a Pandas Series and not a DataFrame.

To check duplicate values in a specific column we can provide the column name as an input argument into the .duplicated( ) method.

Let’s see this in action.

Luckily we have no duplicate values in our data frame, so we will append some values from the data frame itself to create duplicate values.

#Import the required modules
import pandas as pd
import numpy as np

#importing and appending similar rows to of the dataframe
df = pd.read_csv('IMDB-Movie-Data.csv')
df1 = df.append(df.iloc[20:30,:])

df1.duplicated().sum()
Output:
10

Now, .drop_duplicates() method is used to drop the duplicate values from the dataframe.

#Importing the required modules
#Import the required modules
import pandas as pd
import numpy as np

#importing and appending similar rows to of the dataframe
df = pd.read_csv('IMDB-Movie-Data.csv')
df1 = df.append(df.iloc[20:30,:])

#dropping the duplicates
df1 = df1.drop_duplicates()

#checking the duplicates 
df1.duplicated().sum()
output:
0

Summary

CSV data cleaning in Python is easy with pandas and the NumPy module. Always perform data cleaning before running some analysis over it to make sure the analysis is correct.

What’s Next?

Resources