Hello, readers! In our series of Data processing and analysis, today we will be having a look at Detection and Removal of Outliers in Python.
So, let us get started!
What are Outliers in Python?
Before diving deep into the concept of outliers, let us understand the origin of raw data.
Raw data that is fed to a system is usually generated from surveys and extraction of data from real-time actions on the web. This may give rise to variations in the data and there exists a chance of measurement error while recording the data.
This is when outliers comes into the scene.
An outlier is a point or set of data points that lie away from the rest of the data values of the dataset. That is, it is a data point(s) that appear away from the overall distribution of data values in a dataset.
Outliers are possible only in continuous values. Thus, the detection and removal of outliers are applicable to regression values only.
Basically, outliers appear to diverge from the overall proper and well structured distribution of the data elements. It can be considered as an abnormal distribution which appears away from the class or population.
Having understood the concept of Outliers, let us now focus on the need to remove outliers in the upcoming section.
Why is it necessary to remove outliers from the data?
As discussed above, outliers are the data points that lie away from the usual distribution of the data and causes the below effects on the overall data distribution:
- Affects the overall standard variation of the data.
- Manipulates the overall mean of the data.
- Converts the data to a skewed form.
- It causes bias in the accuracy estimation of the machine learning model.
- Affects the distribution and statistics of the dataset.
Because of the above reasons, it is necessary to detect and get rid of outliers before modelling a dataset.
Detection of Outliers – IQR approach
The outliers in the dataset can be detected by the below methods:
- Scatter Plots
- Interquartile range(IQR)
In this article, we will implement IQR method to detect and treat outliers.
IQR is the acronym for Interquartile Range. It measures the statistical dispersion of the data values as a measure of overall distribution.
IQR is equivalent to the difference between the first quartile (Q1) and the third quartile (Q3) respectively.
Here, Q1 refers to the first quartile i.e. 25% and Q3 refers to the third quartile i.e. 75%.
We will be using Boxplots to detect and visualize the outliers present in the dataset.
Boxplots depict the distribution of the data in terms of quartiles and consists of the following components–
- Lower bound/whisker
- Upper whisker/bound
Any data point that lies below the lower bound and above the upper bound is considered as an Outlier.
Let us now implement Boxplot to detect the outliers in the below example.
Initially, we have imported the dataset into the environment. You can find the dataset here.
import pandas import numpy BIKE = pandas.read_csv("Bike.csv")
Further, we have segregated the variables into numeric and categorical values.
numeric_col = ['temp','hum','windspeed'] categorical_col = ['season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit']
We apply Boxplot using
boxplot() function on the numeric variables as shown below:
As seen above, the variable ‘windspeed’ contains outliers which lie above the lower bound.
Removal of Outliers
Now is the time to treat the outliers that we have detected using Boxplot in the previous section.
Using IQR, we can follow the below approach to replace the outliers with a NULL value:
- Calculate the first and third quartile (Q1 and Q3).
- Further, evaluate the interquartile range, IQR = Q3-Q1.
- Estimate the lower bound, the lower bound = Q1*1.5
- Estimate the upper bound, upper bound = Q3*1.5
- Replace the data points that lie outside of the lower and the upper bound with a NULL value.
for x in ['windspeed']: q75,q25 = np.percentile(BIKE.loc[:,x],[75,25]) intr_qr = q75-q25 max = q75+(1.5*intr_qr) min = q25-(1.5*intr_qr) BIKE.loc[BIKE[x] < min,x] = np.nan BIKE.loc[BIKE[x] > max,x] = np.nan
Thus, we have used
numpy.percentile() method to calculate the values of Q1 and Q3. Further, we have replaced the outliers with
numpy.nan as the NULL values.
Having replaced the outliers with nan, let us now check the sum of null values or missing values using the below code:
Sum of count of NULL values/outliers in each column of the dataset:
season 0 yr 0 mnth 0 holiday 0 weathersit 0 temp 0 hum 0 windspeed 5 cnt 0 dtype: int64
Now, we can use any of the below techniques to treat the NULL values:
- Impute the missing values with Mean, median or Knn imputed values.
- Drop the null values (if the proportion is comparatively less)
Here, we would drop the null values using
BIKE = BIKE.dropna(axis = 0)
Having treated the outliers, let us now check for the presence of missing or null values in the dataset:
season 0 yr 0 mnth 0 holiday 0 weathersit 0 temp 0 hum 0 windspeed 0 cnt 0 dtype: int64
Thus, all the outliers present in the dataset has been detected and treated(removed).
By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question.
For more such posts related to Python. Stay tuned and till then, happy learning!! 🙂