Detection and Removal of Outliers in Python - An Easy to Understand Guide

Hello, readers! In our series of Data processing and analysis, today we will be having a look at Detection and Removal of Outliers in Python.

So, let us get started!

What are Outliers in Python?

Before diving deep into the concept of outliers, let us understand the origin of raw data.

Raw data that is fed to a system is usually generated from surveys and extraction of data from real-time actions on the web. This may give rise to variations in the data and there exists a chance of measurement error while recording the data.

This is when outliers comes into the scene.

An outlier is a point or set of data points that lie away from the rest of the data values of the dataset. That is, it is a data point(s) that appear away from the overall distribution of data values in a dataset.

Outliers are possible only in continuous values. Thus, the detection and removal of outliers are applicable to regression values only.

Basically, outliers appear to diverge from the overall proper and well structured distribution of the data elements. It can be considered as an abnormal distribution which appears away from the class or population.

Having understood the concept of Outliers, let us now focus on the need to remove outliers in the upcoming section.

Why is it necessary to remove outliers from the data?

As discussed above, outliers are the data points that lie away from the usual distribution of the data and causes the below effects on the overall data distribution:

Affects the overall standard variation of the data.
Manipulates the overall mean of the data.
Converts the data to a skewed form.
It causes bias in the accuracy estimation of the machine learning model.
Affects the distribution and statistics of the dataset.

Because of the above reasons, it is necessary to detect and get rid of outliers before modelling a dataset.

Detection of Outliers – IQR approach

The outliers in the dataset can be detected by the below methods:

Z-score
Scatter Plots
Interquartile range(IQR)

In this article, we will implement IQR method to detect and treat outliers.

IQR is the acronym for Interquartile Range. It measures the statistical dispersion of the data values as a measure of overall distribution.

IQR is equivalent to the difference between the first quartile (Q1) and the third quartile (Q3) respectively.

Here, Q1 refers to the first quartile i.e. 25% and Q3 refers to the third quartile i.e. 75%.

We will be using Boxplots to detect and visualize the outliers present in the dataset.

Boxplots depict the distribution of the data in terms of quartiles and consists of the following components–

Q1-25%
Q2-50%
Q3-75%
Lower bound/whisker
Upper whisker/bound

Detection Of Outlier BoxPlot 1 — **BoxPlot**

Any data point that lies below the lower bound and above the upper bound is considered as an Outlier.

Let us now implement Boxplot to detect the outliers in the below example.

Example:

Initially, we have imported the dataset into the environment. You can find the dataset here.

import pandas
import numpy
BIKE = pandas.read_csv("Bike.csv")

Further, we have segregated the variables into numeric and categorical values.

numeric_col = ['temp','hum','windspeed']
categorical_col = ['season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit']

We apply Boxplot using boxplot() function on the numeric variables as shown below:

BIKE.boxplot(numeric_col)

Detection Of Outlier BoxPlot — **Detection Of Outlier-BoxPlot**

As seen above, the variable ‘windspeed’ contains outliers which lie above the lower bound.

Removal of Outliers

Now is the time to treat the outliers that we have detected using Boxplot in the previous section.

Using IQR, we can follow the below approach to replace the outliers with a NULL value:

Calculate the first and third quartile (Q1 and Q3).
Further, evaluate the interquartile range, IQR = Q3-Q1.
Estimate the lower bound, the lower bound = Q1*1.5
Estimate the upper bound, upper bound = Q3*1.5
Replace the data points that lie outside of the lower and the upper bound with a NULL value.

for x in ['windspeed']:
    q75,q25 = np.percentile(BIKE.loc[:,x],[75,25])
    intr_qr = q75-q25

    max = q75+(1.5*intr_qr)
    min = q25-(1.5*intr_qr)

    BIKE.loc[BIKE[x] < min,x] = np.nan
    BIKE.loc[BIKE[x] > max,x] = np.nan

Thus, we have used numpy.percentile() method to calculate the values of Q1 and Q3. Further, we have replaced the outliers with numpy.nan as the NULL values.

Having replaced the outliers with nan, let us now check the sum of null values or missing values using the below code:

BIKE.isnull().sum()

Sum of count of NULL values/outliers in each column of the dataset:

season        0
yr            0
mnth          0
holiday       0
weathersit    0
temp          0
hum           0
windspeed     5
cnt           0
dtype: int64

Now, we can use any of the below techniques to treat the NULL values:

Impute the missing values with Mean, median or Knn imputed values.
Drop the null values (if the proportion is comparatively less)

Here, we would drop the null values using pandas.dataframe.dropna() function

BIKE = BIKE.dropna(axis = 0)

Having treated the outliers, let us now check for the presence of missing or null values in the dataset:

BIKE.isnull().sum()

Output–

season        0
yr            0
mnth          0
holiday       0
weathersit    0
temp          0
hum           0
windspeed     0
cnt           0
dtype: int64

Thus, all the outliers present in the dataset has been detected and treated(removed).