Matplotlib Histogram from Basic to Advanced

In today’s everyday newspaper we very often see histograms and pie charts explaining the stocks or finance or COVID-19 data. There is no doubt that histograms make our day-to-day life a lot easier. They help us to visualize the data at a glance and get an understanding of the data. In this article today we are going to learn about histograms(from basics to advanced) to help you with your data analysis or machine learning projects.

What is a histogram?

The histogram is a type of bar plot which is used to represent the numerical data distribution. In histograms, X-axis represents the bin ranges and the Y-axis gives the frequency. A histogram creates a bin of the ranges and distributes the entire range of values into intervals and counts the number of values(frequency) that fall into each of those intervals.The matplotlib.pyplot.hist() function helps us to plot a histogram.

What is the Matplotlib library in Python?

Matplotlib is one of the most commonly used data visualization libraries in Python. It is a great tool for simple visualization as well as complex visualizations.

Let us quickly take a look at the syntax of the matplotlib histogram function:

matplotlib.pyplot.hist(x, bins=None, range=None, density=False, weights=None, cumulative=False, bottom=None, histtype=’bar’, align=’mid’, orientation=’vertical’, rwidth=None, log=False, color=None, label=None, stacked=False)

Parameter	Description
x	This denotes the input parameter in the form of arrays.
bins	This denotes the range of values. It can accept both integer and sequence values.
range	The lower and upper range of bins is included through this parameter.
density	This generally contains boolean values and is denoted as density = counts / (sum(counts) * np.diff(bins)).
weights	This parameter denotes the weight of each value.
cumulative	This parameter denotes the count of each bin together with the count of the bin for previous values.
bottom	This denotes the location of the baseline of each bin.
histtype	This parameter is used to denote the type of histogram to be plotted.For example:bar,bar stacked,step or step filled. If you do not mention anything it will take the bar as the default.
align	This will help you in deciding the position of the histogram. For example Left, right or middle. It will take the middle as the default.
orientation	This parameter helps you to decide whether you want to plot your histogram horizontally or vertically. It will take default as vertical.
rwidth	This parameter helps you in setting the relative width of the bars with respect to bin width.
color	This parameter will help you in setting the color of sequences.
label	This command will help you in setting the labels for your histogram plot.
stacked	This parameter takes boolean values(True or False). If you pass it as False then data will be arranged in a side-by-side manner if you have given histtype as a bar or else if it is a step, the data will be arranged on top of each other. If you have passed this parameter as True the data will be stacked on top of each other. The default value of this parameter is False.

Importing Matplotlib and Necessary Libraries

We will import all the necessary libraries before we begin our histogram plotting. Let’s how to install matplotlib and the necessary libraries.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Now let’s start with the very basic one and then we will move on to the advanced histogram plots.

Histogram with Basic Distribution

To create a histogram of basic distribution, we have used the random NumPy function here. To represent the data distribution, we have passed the mean and standard deviation values as well.

In the histogram function, we have provided the total count of values, the number of bins, and the number of patches.

We have also passed input parameters like density, facecolor, and alpha to make the histogram more representable. You can play around and change the bin size and the number of bins. We have passed the histogram type here as Bar.

The xlim and ylim were used to set the minimum and maximum values for the X and Y axes, respectively. If you do not wish to have grid lines, you can still pass the plt.grid function as False.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Using numpy random function to generate random data
np.random.seed(19685689)

mu, sigma = 120, 30
x = mu + sigma * np.random.randn(10000)

# passing the histogram function
n, bins, patches = plt.hist(x, 70, histtype='bar', density=True, facecolor='yellow', alpha=0.80)


plt.xlabel('Values')
plt.ylabel('Probability Distribution')
plt.title('Histogram showing Data Distribution')
plt.xlim(50, 180)
plt.ylim(0, 0.04)
plt.grid(True)
plt.show()

Output:

Histogram Plots with Color Distribution

Plotting histograms with color representation is an excellent way to visualize the different values across the range of your data. We will use the subplot function for this type of plot. We have removed the axes spines and x,y ticks to make the plot look more presentable. We have also added padding and gridlines to it.

For the color representation, we have divided the histogram into fractions or pieces and then we have set different colors for different sections of the histogram.

#importing the packages for colors 
from matplotlib import colors 
from matplotlib.ticker import PercentFormatter 
  
# Forming the dataset with numpy random function
np.random.seed(190345678) 
N_points = 100000
n_bins = 40
  
# Creating distribution 
x = np.random.randn(N_points) 
y = .10 ** x + np.random.randn(100000) + 25
legend = ['distribution'] 
  
# Passing subplot function
fig, axs = plt.subplots(1, 1, figsize =(10, 7),  tight_layout = True) 
  
  
# Removing axes spines  
for s in ['top', 'bottom', 'left', 'right']:  
    axs.spines[s].set_visible(False)  
  
# Removing x, y ticks 
axs.xaxis.set_ticks_position('none')  
axs.yaxis.set_ticks_position('none')  
    
# Adding padding between axes and labels  
axs.xaxis.set_tick_params(pad = 7)  
axs.yaxis.set_tick_params(pad = 15)  
  
# Adding x, y gridlines  
axs.grid(b = True, color ='pink',  linestyle ='-.', linewidth = 0.6,  alpha = 0.6)  
  
# Passing histogram function
N, bins, patches = axs.hist(x, bins = n_bins) 
  
# Setting the color 
fracs = ((N**(1 / 5)) / N.max()) 
norm = colors.Normalize(fracs.min(), fracs.max()) 
  
for thisfrac, thispatch in zip(fracs, patches): 
    color = plt.cm.viridis_r(norm(thisfrac)) 
    thispatch.set_facecolor(color) 
  
# Adding extra features for making it more presentable    
plt.xlabel("X-axis") 
plt.ylabel("y-axis") 
plt.legend(legend) 
plt.title('Customizing your own histogram') 

plt.show()

Output:

Histogram Plotting with Bars

This is a rather easy one to do. For this, we have just created random data using Numpy random function and then we have used the hist() function and passed the histtype parameter as a bar. You can change the parameter into barstacked step or stepwell.

np.random.seed(9**7) 
n_bins = 15
x = np.random.randn(10000, 5) 
    
colors = ['blue', 'pink', 'orange','green','red'] 
  
plt.hist(x, n_bins, density = True,  histtype ='step', color = colors, label = colors) 
  
plt.legend(prop ={'size': 10}) 
  
plt.show()

Output:

KDE Plot and Histogram

This is another interesting way to plot histograms with KDE. In this example, we will plot KDE (kerned Density Estimation) along with histogram with the help of subplot function.KDE plots help in determining the probability of data in a given space. So together with a KDE plot and histogram, we can represent the probability distribution of data. For this, we have first created a data frame by generating random values of mean and standard deviation and have assigned means to the loc parameter and standard deviations to the scale parameter.

np.random.seed(9**7) 
n_bins = 15
x = np.random.randn(10000, 5) 

colors = ['blue', 'pink', 'orange','green','red'] 
  
plt.hist(x, n_bins, density = True,  histtype ='bar', color = colors, label = colors) 
  
plt.legend(prop ={'size': 10}) 
  
plt.show()

Output:

Histogram with Multiple Variables

In this example, we are using the “ramen-rating” dataset to plot a histogram with multiple variables. We have assigned the three different brands of ramen to different variables. We have used the hist() function three times to create the histogram for three different brands of ramen and to plot the probability of getting a 5-star rating for three different brands of ramen.

import pandas as pd
df = pd.read_csv("C://Users//Intel//Documents//ramen-ratings.csv")
df.head()

x1 = df.loc[df.Style=='Bowl', 'Stars']
x2 = df.loc[df.Style=='Cup', 'Stars']
x3 = df.loc[df.Style=='Pack', 'Stars']

# Normalize
kwargs = dict(alpha=0.5, bins=60, density=True, stacked=False)

# Plotting the histogram
plt.hist(x1,**kwargs,histtype='stepfilled',color='b',label='Bowl')
plt.hist(x2,**kwargs,histtype='stepfilled',color='r',label='Cup')
plt.hist(x3,**kwargs,histtype='stepfilled',color='y',label='Pack')
plt.gca().set(title='Histogram of Probability of Ratings by Brand', ylabel='Probability')
plt.xlim(2,5)
plt.legend();

Output:

Two-Dimensional Histogram

2D histogram is another interesting way to visualize your data. We can plot a histogram with just using the function plt.hist2d.We can customize the plot and the bin size just as the previous ones. Let’s look at a very simple example of 2D histogram below.

import numpy as np
import matplotlib.pyplot as plt
import random
  
# Generating random data
n = 1000
x = np.random.standard_normal(1000)
y = 5.0 * x + 3.0* np.random.standard_normal(1000)
  
fig = plt.subplots(figsize =(10, 7))

# Plotting 2D Histogram
plt.hist2d(x, y,bins=100)
plt.title("2D Histogram")
  

plt.show()

Output:

Conclusion

In summary, we learned five different ways in which we can plot a histogram and can customize our histograms, and also how to create a histogram with multiple variables in a dataset. These methods will help you a lot in visualizing your data for any data science project.