In this article, we learn about different methods used to detect an outlier in Python. Z-score method, Interquartile Range (IQR) method, and Tukey’s fences method will be implemented. Python provides modules like
scipy which assist us in detecting the outlier of a given data set.
Outlier is python is any value that is significantly different from others in simpler terms you may consider it an odd man out.
For example, let’s consider a customer’s purchase data. The regular transaction is below 10,000 and suddenly he makes a transaction of 1,00,000 this is an outlier. Another example would be a list of 5 people amongst whom 4 are under 21 years and one person who is 35 years would be an outlier.
Methods to Detect Outliers in Python
In Python, detecting outliers can be done using different methods such as the Z-score, Interquartile Range (IQR), and Tukey’s Fences. These methods help identify data points that significantly differ from others in the dataset, improving data analysis and accuracy. Let’s dive into three methods to detect outliers in Python.
Method 1: Z-score
import numpy as np data =[1, 2, 3, 4,5,6,7, 8, 9,10,1000] mean = np.mean(data) std = np.std(data) threshold = 3 outliers =  for x in data: z_score = (x - mean) / std if abs(z_score) > threshold: outliers.append(x) print("Mean: ",mean) print("\nStandard deviation: ",std) print("\nOutliers : ", outliers)
Here’s a quick explanation of the above code.
- Import numpy module and its alias np
- Z-score method uses standard deviation to determine outliers
- Calculated z-score > threshold is considered an outlier
- Threshold generally lies between 2 to 3
- To calculate outlier, initiate for loop with z-score formula (x – mean) / std
- Calculate mean and standard deviation beforehand
- If absolute value of z-score > threshold, return outliers
- Code also returns mean and standard deviation
Method 2: Interquartile Range (IQR)
In this method, we would first calculate the
IQR of the given array by subtracting
q3 .If the value/data point is more than 1.5 times the
iqr it will be considered an outlier.
import numpy as np data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 100]) q1 = np.percentile(data, 25) q3 = np.percentile(data, 75) iqr = q3 - q1 threshold = 1.5 * iqr outliers = np.where((data < q1 - threshold) | (data > q3 + threshold)) print("Outliers of array ",data,"is : \n", data[outliers])
numpy module for calculating the IQR for the same, we first calculate the percentile of 25th and 75th stored in variables q1 and q2 respectively. iqr is calculated by
q3-q1 .We then set the
1.5 times iqr .
Method 3: Tukey’s Fences
This method is similar to Interquartile Range (IQR) method used earlier. The only difference is unlike IQR Method this method doesn’t have a single threshold of 1.5 times the IQR, it calculates lower and upper fences based on quartiles, and if the data points/ values lie beyond this range is considered an outlier.
import numpy as np from scipy import stats data = np.array([1, 20, 20, 20, 21, 100]) q1 = np.percentile(data, 25) q3 = np.percentile(data, 75) iqr = q3 - q1 lower_fence = q1 - 1.5 * iqr upper_fence = q3 + 1.5 * iqr outliers = np.where((data < lower_fence) | (data > upper_fence)) print("Outliers of array ",data,"is : \n", data[outliers])
- Numpy and Scipy assist in calculating fences
- Define input array in data
- Calculate 25th and 75th percentiles and store in q1 and q3
- Calculate IQR as q3-q1
- Calculate lower fence as q1 – 1.5 * IQR and upper fence as q3 + 1.5 * IQR
- Outliers are values that satisfy the condition to be considered as an outlier
In this article, we implemented three methods: the z-score method, the Interquartile Range (IQR) method, and turkey’s fence method to detect an outlier in a given set of data. With that, an overview of the outlier has also been provided. Detecting outliers can be useful in detecting data errors, improving accuracy, understanding the data distribution, and ensuring fairness overall the detection of outliers can be beneficial in data analysis and improve performance, increase accuracy, and more equitable analyses.
You can read more interesting articles here: