Boxplots: Everything you need to know

Boxplots

Boxplots use quartiles to show how a given data is distributed. Finding the spread of our data is an essential part of Exploratory data analysis (EDA). For this, we have some existing tools like histograms and density plots.

But boxplots offer some major advantages over these two.

Boxplots pack more information in a very small face and can be useful when comparing two or more variables on a single plot.

In this article, we’ll get to know what boxplots are all about, their use, and how to implement a boxplot using Python.

Interpreting a Boxplot

Boxplots display the distribution of data based on five summary statistics namely:

  • first quartile (Q1)
  • third quartile (Q3)
  • minimum
  • maximum
  • median

It focuses on the range of values in the distribution.

Box Plot Components
Box Plot Components

1. Summary provided by Boxplots

  • Median: The median is the mean of the middle two numbers. it is the middle value of the dataset.
  • first quartile (Q1): The first quartile is the median of the data points to the left of the median. It is also known as 25th
  • third quartile (Q3): The third quartile is the median of the data points to the right of the median.
  • interquartile range (IQR): Q3 – Q1 or data between 25th percentile to 75th percentile.
  • minimum: Q1 -1.5 * IQR (Excluding outliers)
  • maximum: Q3 + 1.5 * IQR (Excluding outliers)

2. Outliers

Boxplots are really good at spotting outliers in the provided data. The most commonly implemented method to spot outliers with boxplots is the 1.5 x IQR rule.

Any data point smaller than Q1 – 1.5xIQR and any data point greater than Q3 + 1.5xIQR is considered as an outlier.

Implementing Boxplots with Python

Boxplots can be plotted using many plotting libraries. Let’s check how we can create Boxplots using python.

The dataset we’ll be using is the Titanic Dataset to demonstrate plotting, you can download the dataset here.

1. Using Pandas

Pandas have a boxplot method called on dataframe which simply requires the columns which we need to plot as an input argument.

#Import the required modules
import numpy as np
import pandas as pd

data = pd.read_csv('Titanic.csv')

#Plotting Boxplot of Age column
boxplot = data.boxplot(column=['Age'])
Pandas Boxplot Age Column
Pandas Boxplot Age Column

If our data has categorical values in it i.e. gender, etc. then pandas can create boxplot based on each category. Let’s see how to do this.

boxplot = data.boxplot(column=['Age' ], by = ['Sex'])
Pandas Boxplot Age With Gender Columns
Pandas Boxplot Age With Gender Columns

We can also pass a list of more than one column to group data based on provided columns and then creating the boxplots.

boxplot = data.boxplot(column=['Age' ] , by = ['Sex','Survived'] , notch = True)
Pandas Boxplot Grouped By Gender And Survived Columns
Pandas Boxplot Grouped By Gender And Survived Columns

When notch is set to True we get notches on the boxplot which shows the confidence intervals for the median value, by default it is set to a confidence interval of 95%.

using boxplot method on a dataset it becomes really quick to visualize boxplots.

2. Using Seaborn

With Seaborn we can plot some really good publication quality boxplots. Let’s get right into it.

#Importing required modules
import seaborn as sns
import pandas as ppd

data = pd.read_csv('Titanic.csv')

sns.boxplot(data = data , x = 'Survived' , y = 'Age' , hue = 'Sex')
Seaborn Boxplot Grouped By Gender
Seaborn Boxplot Grouped By Gender

Suppose now we want boxplots on the age people who survived and lost their lives and grouped according to gender, seaborn boxplot has hue argument which is useful for this purpose.

sns.boxplot(data = data , x = 'Survived' , y = 'Age' , hue = 'Sex' , notch = True )
Seaborn Boxplot Grouped By Gender And Survived Columns
Seaborn Boxplot Grouped By Gender And Survived Columns

3. Using Matplotlib

Even though we can create more aesthetic plots with fewer lines of code using the seaborn library, but seaborn is actually built on top of Matplotlib. Matplotlib has a method to plot the boxplots and conveniently called as boxplot.

Let’s create boxplots using Matplotlib.

# importing required modules
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('Titanic.csv')

plt.boxplot(data.Age.dropna() , labels = ['Age'])
plt.show()
Matplotlib Boxplot Age
Matplotlib Boxplot Age

We can create boxplot of multiple variables by passing them as a list.

plt.boxplot([data.Fare , data.Age.dropna()] , labels = ['Fare' , 'Age'])
plt.show()

Matplotlib Boxplot On Fares And Age
Matplotlib Boxplot On Fares And Age

Conclusion

In this article we explored Boxplots, their importance and what they are used for. We also implemented codes to create our own boxplots using pandas,seaborn and matplotlib libraries respectively. Happy Learning!