A density plot is used to visualize the distribution of a continuous numerical variable in a dataset. It is also known as Kernel Density Plots.
It’s a good practice to know your data well before starting to apply any machine learning techniques to it.
As a good ML practitioner we should be asking some questions like:
- What does our data look like?
- Is it normally distributed or have some different shape?
- The algorithms we are intending to apply to our data, does it has any underlying assumptions about the distribution of data?
Addressing such questions right after we acquire our data can drastically improve the results in later stages and save us a lot of time.
Plots like Histograms and Density plots serve us the ways to answer the questions mentioned above.
Why understand histograms before learning about density plots?
A density plot is very analogous to a histogram. We visualize the shape of the distribution using a histogram. Histograms can be created by binning the data and keeping the count of the number of observations in each bin. In a histogram, the y-axis usually denotes bin counts, but can also be represented in counts per unit also called as densities.
If we increase the number of bins in our histogram, the shape of distribution appears to be smoother.
Now, imagine a smooth continuous line passing through top of each bin, creating an outline of the shape of our distribution. The result we get is what we call as a Density Plot.
Understanding The Density Plot
We can think of density plots as plots of smoothened histograms, which is quite intuitive by now. Density plots mostly use a kernel density estimate. Kernel density estimate allows smoother distributions by smoothing out the noise.
The density plots are not affected by the number of bins which is a major parameter when histograms are to be considered, hence allows us to better visualize the distribution of our data.
So in summary it is just like a histogram but having a smooth curve drawn through the top of each bin.
Several shapes of distributions exist out there in the wild. Some of the most common shapes that we would very likely to encounter are:
Density Plots with Python
We can plot a density plot in many ways using python. Let’s look at a few commonly used methods.
1. Using Python scipy.stats module
scipy.stats module provides us with
gaussian_kde class to find out density for a given data.
import numpy as np import matplotlib.pyplot as plt from scipy.stats import gaussian_kde data = np.random.normal(10,3,100) # Generate Data density = gaussian_kde(data) x_vals = np.linspace(0,20,200) # Specifying the limits of our data density.covariance_factor = lambda : .5 #Smoothing parameter density._compute_covariance() plt.plot(x_vals,density(x_vals)) plt.show()
We change the function
covariance_factor of the
gaussian_kde class and pass on different values to get a smoother plot. Remember to call
_compute_covariance after changing the function.
2. Using Seaborn
Seaborn module provides us with an easier way to execute the above task with much more flexibility.
import numpy as np import seaborn as sb import matplotlib.pyplot as plt data = np.random.normal(10,3,300) #Generating data. plt.figure(figsize = (5,5)) sb.kdeplot(data , bw = 0.5 , fill = True) plt.show()
kdeplot requires a univariate data array or a pandas series object as an input argument to it. The
bw argument is equivalent to
covariance_factor of the
gaussian_kde class demonstrated above. we can pass on
False to not fill the area under the curve with color and will simply plot a curve.
3. Using pandas plot function
plot method can also be used to plot density plots by providing
kind = 'density' as an input argument to it.
import numpy as np import pandas as pd import matplotlib.pyplot as plt x_values = np.random.random(10,3,300) #Generating Data df = pd.DataFrame(x_values, columns = ['var_name'] ) #Converting array to pandas DataFrame df.plot(kind = 'density)
4. Using Seaborn
We can also use the seaborn
distplot method to visualize the distribution of continuous numerical data.
seaborn.distplot( ) method requires a univariate data variable as an input parameter which can be a pandas Series, 1d-array, or a list.
Some important arguments we can pass to
seaborn.distplot( ) to tweak the plot according to our needs are:
hist: (Type – Bool) whether to plot a histogram or not.
kde: (Type – Bool) whether to plot a gaussian kernel density estimate.
bins: (Type – Number) specifying the number of bins in the histogram.
hist_kws: (Type – Dict) dict of Keyword arguments for matplotlib.axes.Axes.hist()
kde_kws: (Type – Dict) Keyword arguments for kdeplot() passed as a dictionary.
import numpy as np import matplotlib.pyplot as plt import seaborn as sb data = np.random.normal(10, 3, 1000) #Generating data randomly from a normal distribution. sb.set_style("whitegrid") # Setting style(Optional) plt.figure(figsize = (10,5)) #Specify the size of figure we want(Optional) sb.distplot(x = data , bins = 10 , kde = True , color = 'teal'\ , kde_kws=dict(linewidth = 4 , color = 'black')) plt.show()
To know more about seaborn
distplot you can refer to this article on seaborn Distplots.
That brings us to the end of the article! We hope that you’ve learned a lot about different density plots today. You can read these articles to learn more about the Pandas and Matplotlib libraries that we’ve used in this article.