What is Kullback-Leibler Divergence in Machine Learning

Information theory is the study of general collection, storage, and transmission of information. It leverages many concepts such as entropy, to efficiently communicate the information.

Kullback-Leibler Divergence (KL Divergence) is one such metric used in information theory to quantify or measure the difference between two distributions.

KL Divergence also known as relative entropy is closely associated with information divergence and information for discrimination. Originating from probability theory and information theory, it finds its use in data mining, data science, and even machine learning.

Kullback-Leibler Divergence, a cornerstone concept in information theory, measures the difference between two probability distributions. This asymmetric, non-negative metric, often called relative entropy, plays a crucial role in data science and machine learning, especially in creating loss functions. Understanding KL Divergence helps in comparing how one distribution diverges from another, providing insights into data efficiency and model performance

The objective of this post is to understand KL Divergence in detail.

Read about KL Divergence’s relative – Entropy

Introduction to Kullback Leibler Divergence

The KL Divergence is named after its researchers, Solomon Kulllback and Richard Leibler who introduced this concept in their paper “On Information and Sufficiency”.

When I was researching KL Divergence, I was confused as to how this metric quantifies the distance between two probability distributions. I came to know that the KL Divergence is known to be a metric but it does not measure “distance” quite literally, instead measures the difference of information in the two distributions.

To put it in simpler words, it differentiates between the two distributions by the information they contain.

Given two probability distributions P and Q, the KL divergence of these distributions is denoted as:

KL(P||Q)

Key Properties of KL Divergence

To better understand the KL Divergence, we can take a look at its properties.

Asymmetric: The main property of KL Divergence is that the divergence is not symmetric, i.e KL(P||Q) ≠ KL(Q||P)
Non-Negative: The divergence is always greater than zero and KL(P||Q)=0 if and only if P=Q

Types of Kullback Leibler Divergence

The probability distributions can be categorized into two types – Discrete distribution and Continuous distribution. Hence, we can use two forms of kl divergence for these distributions.

Application in Machine Learning

If you are familiar with building models using Keras and PyTorch, the name KL Divergence must have rung a few bells. KL Divergence is used to design the loss function or an error function and is available in the Keras and PyTorch libraries.

Find more about Keras loss functions here

Other than that, KL Divergence is also used in adversarial training and GANs to measure the difference between the predicted value and the ground truth.

Calculating KL Divergence in Python

You can either define a function that takes two distributions and computes the Kullback Leibler Divergences according to the formula or you can just import a method of the Scipy library that does the work for you!

In this example, we are going to use the scipy library’s special module to compute the KL Divergence between two distributions.

The method is called relative entropy(rel_entr). Let us see how we can compute the KL Divergence for each element and for the distribution as a whole.

#import the scipy library
import scipy 
from scipy.special import rel_entr
import numpy as np

#define the probability distributions p and q
#element wise kl div
p = [0.23,0.78,0.91,0.86]
q = [0.12,0.57,0.45,0.34]
kldiv = rel_entr(p,q)
print("KL(P||Q):",kldiv)

#kl div of whole distribution 
p = [0.23,0.78,0.91,0.86]
q = [0.12,0.57,0.45,0.34]
kldiv = np.sum(rel_entr(p,q))
print("KL(P||Q):",kldiv)

We must import the scipy library and the method we are going to use before we proceed with the computation. The numpy library is imported to calculate the sum.

The probability distributions are stored in two variables p and q. The method rel_entr is called to compute the divergence between p and q. Lastly, the divergence is printed on the screen.

If we want to calculate the divergence of the whole distribution, we just need to use np.sum before computing the relative entropy.

What if the two distributions are the same?

#what if p and q are same?
p = [0.23,0.57,0.91,0.86]
q = [0.23,0.57,0.91,0.86]
kldiv=rel_entr(p,q)
print("KL(P||Q):",kldiv)

As we have seen in the properties of the KL Divergence, this should result in the divergence being equal to zero.

Wrapping Up: The Role of KL Divergence in Information Theory

Although KL Divergence is said to be a metric, it is not to be confused with other distance metrics such as the Euclidean or the Manhattan, as it does not literally measure the distance between the distributions but rather quantifies how they are different in terms of the information they contain. This concept is widely used in information theory.

It is used in Machine Learning in the form of an error metric or a loss function.

Closely related to relative entropy, the KL Divergence can be calculated using the scipy library’s rel_entr method or by using the formula.

References

KL Divergence – Wikipedia

Scipy’s relative entropy