Density Plots in Python - A Comprehensive Overview

A density plot is used to visualize the distribution of a continuous numerical variable in a dataset. It is also known as Kernel Density Plots.

It’s a good practice to know your data well before starting to apply any machine learning techniques to it.

As a good ML practitioner we should be asking some questions like:

What does our data look like?
Is it normally distributed or have some different shape?
The algorithms we are intending to apply to our data, does it has any underlying assumptions about the distribution of data?

Addressing such questions right after we acquire our data can drastically improve the results in later stages and save us a lot of time.

Plots like Histograms and Density plots serve us the ways to answer the questions mentioned above.

Why understand histograms before learning about density plots?

A density plot is very analogous to a histogram. We visualize the shape of the distribution using a histogram. Histograms can be created by binning the data and keeping the count of the number of observations in each bin. In a histogram, the y-axis usually denotes bin counts, but can also be represented in counts per unit also called as densities.

**A Histogram With Less Number Of Bins**

If we increase the number of bins in our histogram, the shape of distribution appears to be smoother.

**Histogram Having More Number Of Bins**

Now, imagine a smooth continuous line passing through top of each bin, creating an outline of the shape of our distribution. The result we get is what we call as a Density Plot.

Understanding The Density Plot

We can think of density plots as plots of smoothened histograms, which is quite intuitive by now. Density plots mostly use a kernel density estimate. Kernel density estimate allows smoother distributions by smoothing out the noise.

The density plots are not affected by the number of bins which is a major parameter when histograms are to be considered, hence allows us to better visualize the distribution of our data.

So in summary it is just like a histogram but having a smooth curve drawn through the top of each bin.

Several shapes of distributions exist out there in the wild. Some of the most common shapes that we would very likely to encounter are:

Density Plots with Python

We can plot a density plot in many ways using python. Let’s look at a few commonly used methods.

1. Using Python scipy.stats module

scipy.stats module provides us with gaussian_kde class to find out density for a given data.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde

data = np.random.normal(10,3,100) # Generate Data
density = gaussian_kde(data)

x_vals = np.linspace(0,20,200) # Specifying the limits of our data
density.covariance_factor = lambda : .5 #Smoothing parameter

density._compute_covariance()
plt.plot(x_vals,density(x_vals))
plt.show()

We change the function covariance_factor of the gaussian_kde class and pass on different values to get a smoother plot. Remember to call _compute_covariance after changing the function.

2. Using Seaborn `kdeplot` module

Seaborn module provides us with an easier way to execute the above task with much more flexibility.

import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt

data = np.random.normal(10,3,300) #Generating data.
plt.figure(figsize = (5,5))
sb.kdeplot(data , bw = 0.5 , fill = True)
plt.show()

Seaborn kdeplot requires a univariate data array or a pandas series object as an input argument to it. The bw argument is equivalent to covariance_factor of the gaussian_kde class demonstrated above. we can pass on fill = False to not fill the area under the curve with color and will simply plot a curve.

3. Using pandas plot function

Pandas plot method can also be used to plot density plots by providing kind = 'density' as an input argument to it.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

x_values = np.random.random(10,3,300) #Generating Data
df = pd.DataFrame(x_values, columns = ['var_name'] ) #Converting array to pandas DataFrame
df.plot(kind = 'density)

4. Using Seaborn `distplot`

We can also use the seaborn distplot method to visualize the distribution of continuous numerical data. seaborn.distplot( ) method requires a univariate data variable as an input parameter which can be a pandas Series, 1d-array, or a list.

Some important arguments we can pass to seaborn.distplot( ) to tweak the plot according to our needs are:

hist : (Type – Bool) whether to plot a histogram or not.
kde : (Type – Bool) whether to plot a gaussian kernel density estimate.
bins : (Type – Number) specifying the number of bins in the histogram.
hist_kws : (Type – Dict) dict of Keyword arguments for matplotlib.axes.Axes.hist()
kde_kws : (Type – Dict) Keyword arguments for kdeplot() passed as a dictionary.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

data = np.random.normal(10, 3, 1000) #Generating data randomly from a normal distribution.

sb.set_style("whitegrid")  # Setting style(Optional)
plt.figure(figsize = (10,5)) #Specify the size of figure we want(Optional)
sb.distplot(x = data  ,  bins = 10 , kde = True , color = 'teal'\
             , kde_kws=dict(linewidth = 4 , color = 'black'))
plt.show()

Density Plot Using Distplot 1 — **Density Plot Using Seaborn `distplot`**

To know more about seaborn distplot you can refer to this article on seaborn Distplots.

Conclusion

That brings us to the end of the article! We hope that you’ve learned a lot about different density plots today. You can read these articles to learn more about the Pandas and Matplotlib libraries that we’ve used in this article.