The Kullback-Leibler divergence is a measure of the dissimilarity between two probability distributions.
We are going to give two separate definitions of Kullback-Leibler (KL) divergence, one for discrete random variables and one for continuous variables.
Note that the summation is over the support of , so that we always have and , and, as a consequence, the natural logarithmis always well-defined.
The KL divergence measures how much the distribution defined by is dissimilar from the reference distribution defined by .
The definition for continuous random variables is analogous.
In order to be entirely rigorous the above definition should also specify that the sets must be measurable (see the lecture on probability for a definition of measurable set).
Property (1), which is called absolute continuity, requires that if the distribution associated to the density assigns a non-zero probability to a set , then also the distribution must assign a non-zero probability to that set. This requirement is analogous to that for discrete variables and ensures that is well-defined on all sets that have non-zero probability.
The next proposition states a fundamental property of the Kullback-Leibler divergence.
Proposition Let and be two probability mass functions and . If the two probability mass functions coincide, that is, if for all , then Otherwise, if they do not coincide, then
Let us first prove the equality part. If the two probability mass functions coincide, then for andWhen they do not coincide, then we havewhere: in step we have written the summation as an expected value with respect to the probability distribution of ; in step , we have used Jensen's inequality (the function is strictly convex in and the random variable is not constant because the two probability mass function do not coincide); in step we have used the fact that a sum of probabilities cannot be greater than 1.
A similar result holds for continuous variables.
Proposition Let and be two probability density functions such that their KL divergence is well-defined. If the two probability density function coincide almost surely, that is, if for all measurable sets , then Otherwise, if they do not coincide almost surely, then
The proof is analogous to that for discrete variables.
An often cited property of the KL divergence is that it is not symmetric, that is, in general there is no guarantee that
In fact, it is even possible that exists when is not well-defined: as you can check by looking at the definition of KL divergence, this happens when the support of is strictly included in the support of :
Since the Kullback-Leibler divergence is an information-theoretic concept and most of the students of probability and statistics are not familiar with information theory, they struggle to get an intuitive understanding of the reason why the KL divergence measures the dissimilarity of a probability distribution from a reference distribution. We provide an explanation that is entirely based on probabilistic concepts.
Suppose that and are two probability mass functions such that the KL divergence is well-defined.
Take a convex combination of the two distributionswhere .
By increasing we can make more and more similar to until, when , and coincide.
It is possible to prove that the KL divergence is convex (see Cover and Thomas 2006) and, as a consequence,
Thus, the higher is, the smaller becomes. In other words, the more is similar to , the smaller the Kullback-Leibler divergence becomes.
Cover, T. M., and J. A. Thomas (2006) " Elements of information theory", Wiley-Interscience.
Most of the learning materials found on this website are now available in a traditional textbook format.