Boxplots use quartiles to show how a given data is distributed. Finding the spread of our data is an essential part of Exploratory data analysis (EDA). For this, we have some existing tools like histograms and density plots.
But boxplots offer some major advantages over these two.
Boxplots pack more information in a very small face and can be useful when comparing two or more variables on a single plot.
In this article, we’ll get to know what boxplots are all about, their use, and how to implement a boxplot using Python.
Interpreting a Boxplot
Boxplots display the distribution of data based on five summary statistics namely:
- first quartile (Q1)
- third quartile (Q3)
It focuses on the range of values in the distribution.
1. Summary provided by Boxplots
- Median: The median is the mean of the middle two numbers. it is the middle value of the dataset.
- first quartile (Q1): The first quartile is the median of the data points to the left of the median. It is also known as 25th
- third quartile (Q3): The third quartile is the median of the data points to the right of the median.
- interquartile range (IQR): Q3 – Q1 or data between 25th percentile to 75th percentile.
- minimum: Q1 -1.5 * IQR (Excluding outliers)
- maximum: Q3 + 1.5 * IQR (Excluding outliers)
Boxplots are really good at spotting outliers in the provided data. The most commonly implemented method to spot outliers with boxplots is the 1.5 x IQR rule.
Any data point smaller than Q1 – 1.5xIQR and any data point greater than Q3 + 1.5xIQR is considered as an outlier.
Implementing Boxplots with Python
Boxplots can be plotted using many plotting libraries. Let’s check how we can create Boxplots using python.
The dataset we’ll be using is the Titanic Dataset to demonstrate plotting, you can download the dataset here.
1. Using Pandas
#Import the required modules import numpy as np import pandas as pd data = pd.read_csv('Titanic.csv') #Plotting Boxplot of Age column boxplot = data.boxplot(column=['Age'])
If our data has categorical values in it i.e. gender, etc. then pandas can create boxplot based on each category. Let’s see how to do this.
boxplot = data.boxplot(column=['Age' ], by = ['Sex'])
We can also pass a list of more than one column to group data based on provided columns and then creating the boxplots.
boxplot = data.boxplot(column=['Age' ] , by = ['Sex','Survived'] , notch = True)
When notch is set to True we get notches on the boxplot which shows the confidence intervals for the median value, by default it is set to a confidence interval of 95%.
boxplot method on a dataset it becomes really quick to visualize boxplots.
2. Using Seaborn
With Seaborn we can plot some really good publication quality boxplots. Let’s get right into it.
#Importing required modules import seaborn as sns import pandas as ppd data = pd.read_csv('Titanic.csv') sns.boxplot(data = data , x = 'Survived' , y = 'Age' , hue = 'Sex')
Suppose now we want boxplots on the age people who survived and lost their lives and grouped according to gender, seaborn boxplot has
hue argument which is useful for this purpose.
sns.boxplot(data = data , x = 'Survived' , y = 'Age' , hue = 'Sex' , notch = True )
3. Using Matplotlib
Even though we can create more aesthetic plots with fewer lines of code using the seaborn library, but seaborn is actually built on top of Matplotlib. Matplotlib has a method to plot the boxplots and conveniently called as
Let’s create boxplots using Matplotlib.
# importing required modules import pandas as pd import matplotlib.pyplot as plt data = pd.read_csv('Titanic.csv') plt.boxplot(data.Age.dropna() , labels = ['Age']) plt.show()
We can create boxplot of multiple variables by passing them as a list.
plt.boxplot([data.Fare , data.Age.dropna()] , labels = ['Fare' , 'Age']) plt.show()
In this article we explored Boxplots, their importance and what they are used for. We also implemented codes to create our own boxplots using pandas,seaborn and matplotlib libraries respectively. Happy Learning!