How to Determine Outliers in Python

outlier

In this article, we learn about different methods used to detect an outlier in Python. Z-score method, Interquartile Range (IQR) method, and Tukey’s fences method will be implemented. Python provides modules like numpy and scipy which assist us in detecting the outlier of a given data set.

Understanding Outliers

Outlier is python is any value that is significantly different from others in simpler terms you may consider it an odd man out.

For example, let’s consider a customer’s purchase data. The regular transaction is below 10,000 and suddenly he makes a transaction of 1,00,000 this is an outlier. Another example would be a list of 5 people amongst whom 4 are under 21 years and one person who is 35 years would be an outlier.

Outliner Example

Methods to Detect Outliers in Python

In Python, detecting outliers can be done using different methods such as the Z-score, Interquartile Range (IQR), and Tukey’s Fences. These methods help identify data points that significantly differ from others in the dataset, improving data analysis and accuracy. Let’s dive into three methods to detect outliers in Python.

Method 1: Z-score

import numpy as np

data =[1, 2, 3, 4,5,6,7, 8, 9,10,1000]

mean = np.mean(data)
std = np.std(data)

threshold = 3
outliers = []
for x in data:
    z_score = (x - mean) / std
    if abs(z_score) > threshold:
        outliers.append(x)
print("Mean: ",mean)
print("\nStandard deviation: ",std)
print("\nOutliers  : ", outliers)

Here’s a quick explanation of the above code.

  • Import numpy module and its alias np
  • Z-score method uses standard deviation to determine outliers
  • Calculated z-score > threshold is considered an outlier
  • Threshold generally lies between 2 to 3
  • To calculate outlier, initiate for loop with z-score formula (x – mean) / std
  • Calculate mean and standard deviation beforehand
  • If absolute value of z-score > threshold, return outliers
  • Code also returns mean and standard deviation

Output:

Z Score Output 1

Method 2: Interquartile Range (IQR)

In this method, we would first calculate the IQR of the given array by subtracting q1 from q3 .If the value/data point is more than 1.5 times the iqr it will be considered an outlier.

import numpy as np

data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 100])

q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1
threshold = 1.5 * iqr
outliers = np.where((data < q1 - threshold) | (data > q3 + threshold))

print("Outliers of array ",data,"is : \n", data[outliers])

We import numpy module for calculating the IQR for the same, we first calculate the percentile of 25th and 75th stored in variables q1 and q2 respectively. iqr is calculated by q3-q1 .We then set the threshold to 1.5 times iqr .

Output:

Iqr Output

Method 3: Tukey’s Fences

This method is similar to Interquartile Range (IQR) method used earlier. The only difference is unlike IQR Method this method doesn’t have a single threshold of 1.5 times the IQR, it calculates lower and upper fences based on quartiles, and if the data points/ values lie beyond this range is considered an outlier.

import numpy as np
from scipy import stats

data = np.array([1, 20, 20, 20, 21, 100])

q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1
lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr
outliers = np.where((data < lower_fence) | (data > upper_fence))

print("Outliers of array ",data,"is : \n", data[outliers])
  • Numpy and Scipy assist in calculating fences
  • Define input array in data
  • Calculate 25th and 75th percentiles and store in q1 and q3
  • Calculate IQR as q3-q1
  • Calculate lower fence as q1 – 1.5 * IQR and upper fence as q3 + 1.5 * IQR
  • Outliers are values that satisfy the condition to be considered as an outlier

Output:

Turkey Output

Conclusion

In this article, we implemented three methods: the z-score method, the Interquartile Range (IQR) method, and turkey’s fence method to detect an outlier in a given set of data. With that, an overview of the outlier has also been provided. Detecting outliers can be useful in detecting data errors, improving accuracy, understanding the data distribution, and ensuring fairness overall the detection of outliers can be beneficial in data analysis and improve performance, increase accuracy, and more equitable analyses.

You can read more interesting articles here: