NotesMath for LLMs

KL Divergence

Information Theory / KL Divergence

Notes

"The most important single quantity in information theory and in machine learning is the Kullback-Leibler divergence."

  • David MacKay, Information Theory, Inference, and Learning Algorithms, 2003

Overview

KL divergence - formally the Kullback-Leibler divergence, also called relative entropy or information gain - is the central tool for comparing probability distributions. Where entropy (Section 01) measures the intrinsic uncertainty of a single distribution, KL divergence measures how much one distribution diverges from another: how many extra bits you waste by coding data from pp using a code optimized for qq.

This single quantity underlies an astonishing range of machine learning. Maximum likelihood estimation, the training objective of every language model ever trained, is equivalent to minimizing DKL(pdatapθ)D_{\mathrm{KL}}(p_{\mathrm{data}} \| p_{\boldsymbol{\theta}}). The VAE training objective decomposes exactly as reconstruction loss plus a KL divergence term. RLHF's KL penalty, knowledge distillation's soft-target loss, and PPO's trust-region constraint are all KL divergences in disguise. Understanding KL divergence - its properties, its asymmetry, and the difference between its two directions - is prerequisite to understanding modern deep learning at a mechanistic level.

This section develops KL divergence from first principles. We prove non-negativity rigorously (Section 3.1), dissect the crucial asymmetry between forward and reverse KL (Section 4), derive closed-form expressions for Gaussians and exponential families (Section 5), explore generalizations to f-divergences and Renyi divergence (Section 7), and connect everything to the information-geometric structure of statistical manifolds (Section 8). Every result is grounded in concrete AI/ML applications with specific model and paper citations.

Scope note. This section is the canonical home for KL divergence. Closely related quantities - mutual information (Section 03), cross-entropy loss (Section 04), and Fisher information (Section 05) - each have their own dedicated sections and appear here only as brief previews with forward references.

Prerequisites

  • Shannon entropy - H(X)=xp(x)logp(x)H(X) = -\sum_x p(x)\log p(x), concavity, boundedness - Section 09-01 Entropy
  • Jensen's inequality - for convex ϕ\phi: ϕ(E[X])E[ϕ(X)]\phi(\mathbb{E}[X]) \le \mathbb{E}[\phi(X)] - Chapter 8 Section 01
  • Probability distributions - PMFs, PDFs, expectations, absolute continuity - Chapter 6
  • Logarithm rules - log(ab)=loga+logb\log(ab) = \log a + \log b, change of base - Chapter 1
  • Convex functions - definition, second-derivative test, Bregman divergences (Section 8) - Chapter 8

Companion Notebooks

NotebookDescription
theory.ipynbInteractive derivations: Gibbs' inequality, forward vs reverse KL visualizations, Gaussian KL, f-divergences, VAE ELBO, RLHF policy geometry
exercises.ipynb10 graded exercises from non-negativity proofs to DPO implicit KL

Learning Objectives

After completing this section, you will:

  1. Define DKL(pq)D_{\mathrm{KL}}(p \| q) for discrete and continuous distributions, state the support condition (pqp \ll q), and explain the coding-theoretic interpretation
  2. Prove non-negativity of KL divergence (Gibbs' inequality) via Jensen's inequality and state the equality condition
  3. Demonstrate the asymmetry DKL(pq)DKL(qp)D_{\mathrm{KL}}(p\|q) \ne D_{\mathrm{KL}}(q\|p) with a concrete numerical example and explain why KL is not a metric
  4. Distinguish forward KL (mean-seeking, mass-covering) from reverse KL (mode-seeking, zero-forcing) and predict which behavior each direction produces
  5. Derive the closed-form KL between two Gaussians and apply it to compute the VAE encoder KL penalty
  6. State the chain rule for KL divergence and the data processing inequality and explain their ML implications
  7. Identify KL divergence as a Bregman divergence and explain the Pythagorean theorem for KL, I-projection, and M-projection
  8. Express α\alpha-divergences, Jensen-Shannon divergence, Hellinger distance, and total variation as special cases of Csiszar f-divergences
  9. Prove that maximum likelihood estimation minimizes DKL(pdatapθ)D_{\mathrm{KL}}(p_{\mathrm{data}} \| p_{\boldsymbol{\theta}}) and derive the RLHF optimal policy ππrefer/β\pi^* \propto \pi_{\mathrm{ref}} e^{r/\beta}
  10. Explain the difference between forward and reverse KL in knowledge distillation and describe posterior collapse in VAEs and the beta-VAE fix
  11. Connect KL to cross-entropy via H(p,q)=H(p)+DKL(pq)H(p,q) = H(p) + D_{\mathrm{KL}}(p\|q) and preview its relationship to mutual information and Fisher information

Table of Contents


1. Intuition

1.1 What Is KL Divergence?

Suppose you are a weather forecaster. Every day, you predict a probability distribution over tomorrow's weather: 70% sunny, 20% cloudy, 10% rain. But the true distribution - what nature actually produces - is 50% sunny, 30% cloudy, 20% rain. How much has your forecast missed? Shannon entropy measures the uncertainty in a single distribution. KL divergence measures how much one distribution deviates from another.

Formally, DKL(pq)D_{\mathrm{KL}}(p \| q) is the expected extra number of bits (or nats, if using natural logarithms) you waste when you encode data actually generated from pp using a code optimized for qq. If you design a Huffman code for qq, the optimal code assigns short codewords to events that qq considers probable. But if the true distribution is pp, your code will be suboptimal whenever p(x)q(x)p(x) \ne q(x). The KL divergence precisely quantifies the average penalty:

DKL(pq)=xp(x)logp(x)q(x)D_{\mathrm{KL}}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)}

The ratio p(x)/q(x)p(x)/q(x) captures the mismatch event by event: events that pp considers much more likely than qq does (large p/qp/q) waste the most bits. The expectation under pp reflects that the penalty is averaged over what actually happens.

Three equivalent ways to read DKL(pq)D_{\mathrm{KL}}(p \| q):

  1. Coding view: Expected excess code length when coding pp-data with a qq-optimal code.
  2. Surprise view: How much more surprised you are on average by events under qq than under pp.
  3. Testing view: The expected log-likelihood ratio Ep[log(p(X)/q(X))]\mathbb{E}_p[\log(p(X)/q(X))] - the average evidence per sample that data came from pp rather than qq in a hypothesis test.

All three views are the same formula. The richness of KL divergence comes from this triple identity: it simultaneously measures compression inefficiency, distributional mismatch, and statistical distinguishability.

For AI: When a language model pθp_{\boldsymbol{\theta}} is trained on data drawn from pdatap_{\mathrm{data}}, the negative log-likelihood loss Expdata[logpθ(x)]-\mathbb{E}_{x \sim p_{\mathrm{data}}}[\log p_{\boldsymbol{\theta}}(x)] equals H(pdata)+DKL(pdatapθ)H(p_{\mathrm{data}}) + D_{\mathrm{KL}}(p_{\mathrm{data}} \| p_{\boldsymbol{\theta}}). Since H(pdata)H(p_{\mathrm{data}}) is a constant w.r.t. θ\boldsymbol{\theta}, minimizing cross-entropy loss is identical to minimizing DKL(pdatapθ)D_{\mathrm{KL}}(p_{\mathrm{data}} \| p_{\boldsymbol{\theta}}). Every gradient descent step that improves a language model is a step toward closing the KL gap between model and data distribution.

1.2 The Asymmetry Insight

The defining feature that most surprises newcomers is asymmetry: DKL(pq)D_{\mathrm{KL}}(p \| q) and DKL(qp)D_{\mathrm{KL}}(q \| p) are generally different, and this is not a flaw - it is a deeply meaningful distinction that determines which algorithm you should use.

To see why asymmetry arises, consider what each direction penalizes:

  • DKL(pq)D_{\mathrm{KL}}(p \| q): the expectation is under pp. Events where p(x)>0p(x) > 0 but q(x)0q(x) \approx 0 contribute hugely (the ratio p/qp/q blows up). This is called forward KL or inclusive KL: if pp puts mass somewhere, qq must also put mass there. qq must cover all of pp's support. Minimizing this forces qq to be "mean-seeking" - to match the whole spread of pp, even at multimodal distributions.

  • DKL(qp)D_{\mathrm{KL}}(q \| p): the expectation is now under qq. Events where q(x)>0q(x) > 0 but p(x)0p(x) \approx 0 contribute hugely. This is called reverse KL or exclusive KL: if qq puts mass somewhere, pp must also put mass there - so qq avoids regions where pp is small. Minimizing this forces qq to be "mode-seeking" - to concentrate on one mode of pp and ignore the others.

Concrete example: Let pp be a bimodal mixture 0.5N(3,1)+0.5N(3,1)0.5\,\mathcal{N}(-3,1) + 0.5\,\mathcal{N}(3,1) and let q=N(μ,σ2)q = \mathcal{N}(\mu, \sigma^2) be a unimodal Gaussian we fit to pp:

  • Minimizing DKL(pq)D_{\mathrm{KL}}(p \| q): qq must cover both modes, so qN(0,10)q \approx \mathcal{N}(0, 10) - wide, centered between the modes, placing mass in low-probability regions.
  • Minimizing DKL(qp)D_{\mathrm{KL}}(q \| p): qq collapses onto one mode, e.g., qN(3,1)q \approx \mathcal{N}(3, 1), ignoring the other mode entirely.

Neither approximation is "wrong" - they answer different questions. Which direction to use is one of the most important design decisions in probabilistic ML.

For AI: Variational inference for Bayesian neural networks minimizes reverse KL DKL(qp)D_{\mathrm{KL}}(q\|p) - the variational posterior qq collapses to one mode of the intractable true posterior pp. This is why Bayesian deep learning with mean-field VI tends to underestimate uncertainty (it ignores alternative modes). Maximum likelihood training of generative models minimizes forward KL DKL(pdatapθ)D_{\mathrm{KL}}(p_{\mathrm{data}}\|p_{\boldsymbol{\theta}}) - forcing the model to cover the entire data distribution, at the cost of sometimes generating blurry or average-looking samples.

1.3 Why KL Divergence Matters for AI

KL divergence is not one tool among many - it is the unifying framework behind most of the training objectives and alignment techniques in modern AI:

ApplicationWhich KLHow
Language model training (GPT, LLaMA, Claude)DKL(pdatapθ)D_{\mathrm{KL}}(p_{\mathrm{data}} \| p_{\boldsymbol{\theta}})Minimizing cross-entropy = minimizing forward KL
VAE regularizer (Kingma & Welling, 2014)DKL(qϕ(zx)p(z))D_{\mathrm{KL}}(q_\phi(\mathbf{z}\mid\mathbf{x}) \| p(\mathbf{z}))Reverse KL keeps encoder close to Gaussian prior
RLHF KL penalty (Christiano et al., 2017)DKL(πθπref)D_{\mathrm{KL}}(\pi_{\boldsymbol{\theta}} \| \pi_{\mathrm{ref}})Prevents reward hacking; preserves language coherence
PPO trust region (Schulman et al., 2017)Approx. DKL(πoldπnew)D_{\mathrm{KL}}(\pi_{\mathrm{old}} \| \pi_{\mathrm{new}})Clip objective approximates KL constraint
DPO (Rafailov et al., 2023)Implicit KLClosed-form solution has same fixed point as RLHF
Knowledge distillation (Hinton et al., 2015)DKL(pteacherpstudent)D_{\mathrm{KL}}(p_{\mathrm{teacher}} \| p_{\mathrm{student}})Forward KL to match teacher's full soft distribution
Variational inference (VI)DKL(qp)D_{\mathrm{KL}}(q \| p)Reverse KL for tractable posterior approximation
Normalizing flows (Rezende & Mohamed, 2015)DKL(pdatapθ)D_{\mathrm{KL}}(p_{\mathrm{data}} \| p_{\boldsymbol{\theta}})Forward KL via exact log-likelihood
Diffusion models (Ho et al., 2020)KL at each reverse stepELBO decomposes as sum of per-step KL terms
Differential privacy (Renyi DP)Renyi divergenceRenyi KL bounds privacy loss composition

The breadth of this table reflects a deep fact: KL divergence is the natural measure of distributional distance when you care about expected log-probability ratios, which is exactly what gradient-based optimization of probabilistic models does.

1.4 Historical Timeline

KL DIVERGENCE - KEY MILESTONES


  1951  Kullback & Leibler, "On Information and Sufficiency"
        Introduce D_KL as "information for discrimination" I(1:2)
        Prove non-negativity; connect to sufficiency of statistics

  1959  Kullback, "Information Theory and Statistics"
        Systematic treatment; connection to likelihood ratio tests

  1967  Csiszar, "Information-Type Measures of Difference"
        Generalizes to f-divergences; unifies KL, Hellinger, 2

  1972  Chernoff; Blahut, Arimoto
        Exponential decay of error rates in hypothesis testing
        Connection to Renyi divergence

  1985  Amari, "Differential-Geometric Methods in Statistics"
        Information geometry: KL as non-symmetric Bregman divergence
        Exponential and mixture geodesics; natural gradient

  1986  Hinton & Camp, "Keeping Neural Networks Simple"
        Minimum Description Length  KL regularization

  2013  Kingma & Welling, "Auto-Encoding Variational Bayes"
        VAE: ELBO = reconstruction - D_KL(q_  p)
        Reparameterization trick for gradient through KL

  2015  Hinton, Vinyals & Dean, "Distilling the Knowledge"
        Forward KL with temperature-softened teacher distribution

  2017  Christiano et al., "Deep Reinforcement Learning from Human Feedback"
        KL penalty in RLHF; controls deviation from reference policy

  2017  Schulman et al., "Proximal Policy Optimization"
        PPO-clip approximates KL trust region constraint

  2022  Rafailov et al., "Direct Preference Optimization"
        DPO: closed-form KL-constrained policy optimization

  2024+  All frontier LLMs (GPT-4, Claude, Gemini, LLaMA-3)
         Trained with cross-entropy (= forward KL) + RLHF KL penalty


2. Formal Definitions

2.1 Discrete KL Divergence

Definition (Kullback-Leibler Divergence - discrete). Let pp and qq be probability mass functions on a countable alphabet X\mathcal{X}. The KL divergence from qq to pp (also called the relative entropy of pp with respect to qq) is:

DKL(pq)=xXp(x)logp(x)q(x)D_{\mathrm{KL}}(p \| q) = \sum_{x \in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)}

where:

  • The logarithm is natural (nats) by convention in this curriculum, matching the NOTATION_GUIDE. To convert to bits, divide by ln2\ln 2.
  • Boundary conventions: 0log(0/q)=00 \log(0/q) = 0 for any q0q \ge 0 (by continuity: limp0+plogp=0\lim_{p \to 0^+} p \log p = 0); plog(p/0)=+p \log(p/0) = +\infty for p>0p > 0 (code length is infinite if you cannot encode a symbol at all).
  • DKL(pq)=+D_{\mathrm{KL}}(p \| q) = +\infty whenever there exists xx with p(x)>0p(x) > 0 and q(x)=0q(x) = 0.

The notation convention in this repository (following Cover & Thomas 2006 and NOTATION_GUIDE Section 6): DKL(pq)D_{\mathrm{KL}}(p \| q) reads as "KL divergence from qq to pp." That is, pp is the reference (true) distribution and qq is the approximation. Some sources use the opposite convention; when reading papers, always check which argument is which.

Standard examples:

Example 2.1 (Binary distributions): Let p=(θ,1θ)p = (\theta, 1-\theta) and q=(q0,1q0)q = (q_0, 1-q_0) on {0,1}\{0,1\}. Then:

DKL(pq)=θlnθq0+(1θ)ln1θ1q0D_{\mathrm{KL}}(p \| q) = \theta \ln\frac{\theta}{q_0} + (1-\theta)\ln\frac{1-\theta}{1-q_0}

This is the log-likelihood ratio test statistic for a Bernoulli parameter hypothesis test.

Example 2.2 (Weather forecast): True distribution p=(0.5,0.3,0.2)p = (0.5, 0.3, 0.2) over {sunny, cloudy, rain}; forecast q=(0.7,0.2,0.1)q = (0.7, 0.2, 0.1):

DKL(pq)=0.5ln0.50.7+0.3ln0.30.2+0.2ln0.20.10.168+0.122+0.1390.093 natsD_{\mathrm{KL}}(p \| q) = 0.5\ln\frac{0.5}{0.7} + 0.3\ln\frac{0.3}{0.2} + 0.2\ln\frac{0.2}{0.1} \approx -0.168 + 0.122 + 0.139 \approx 0.093 \text{ nats}

The model wastes approximately 0.093 nats per observation. In the reverse direction: DKL(qp)0.097D_{\mathrm{KL}}(q \| p) \approx 0.097 nats - different but close, because the two distributions are not very far apart.

Example 2.3 (Categorical logits): In a classification model with softmax output q(k)=softmax(z)kq(k) = \text{softmax}(\mathbf{z})_k and one-hot label p(k)=1[k=y]p(k) = \mathbb{1}[k = y]:

DKL(pq)=logq(y)=zy+logkezkD_{\mathrm{KL}}(p \| q) = -\log q(y) = -z_y + \log\sum_k e^{z_k}

This is exactly the cross-entropy loss. (The entropy H(p)=0H(p) = 0 for one-hot pp, so H(p,q)=H(p)+DKL(pq)=DKL(pq)H(p,q) = H(p) + D_{\mathrm{KL}}(p\|q) = D_{\mathrm{KL}}(p\|q).)

2.2 Continuous KL Divergence

Definition (KL divergence - continuous). Let pp and qq be probability density functions on Rd\mathbb{R}^d with pp absolutely continuous with respect to qq (written pqp \ll q, meaning q(x)=0p(x)=0q(x) = 0 \Rightarrow p(x) = 0 a.e.). Then:

DKL(pq)=Rdp(x)logp(x)q(x)dx=Exp ⁣[logp(x)q(x)]D_{\mathrm{KL}}(p \| q) = \int_{\mathbb{R}^d} p(\mathbf{x}) \log \frac{p(\mathbf{x})}{q(\mathbf{x})} \, d\mathbf{x} = \mathbb{E}_{\mathbf{x} \sim p}\!\left[\log \frac{p(\mathbf{x})}{q(\mathbf{x})}\right]

The absolute continuity condition pqp \ll q is essential: it ensures the Radon-Nikodym derivative dp/dqdp/dq exists, making the ratio p(x)/q(x)p(\mathbf{x})/q(\mathbf{x}) well-defined a.e. under pp. When p≪̸qp \not\ll q, DKL(pq)=+D_{\mathrm{KL}}(p\|q) = +\infty by convention.

Measure-theoretic form. Using the Radon-Nikodym derivative dpdq\frac{dp}{dq}, the general definition that works for both discrete and continuous cases:

DKL(pq)=logdpdqdp=Ep ⁣[logdpdq]D_{\mathrm{KL}}(p \| q) = \int \log \frac{dp}{dq} \, dp = \mathbb{E}_p\!\left[\log \frac{dp}{dq}\right]

This unified form shows that discreteness vs continuity is just a choice of dominating measure.

2.3 Relative Entropy Interpretation

The phrase "relative entropy" is justified by a beautiful coding theorem. Suppose X1,X2,X_1, X_2, \ldots are i.i.d. from pp. You observe nn samples and design a code using distribution qq. The optimal code assigns length logq(x)-\log q(x) to symbol xx (by Shannon's source coding theorem). The expected code length for one symbol is Ep[logq(X)]=H(p,q)\mathbb{E}_p[-\log q(X)] = H(p, q) - the cross-entropy. But the optimal code for pp has expected length H(p)H(p). The excess code length per symbol is:

H(p,q)H(p)=xp(x)logq(x)+xp(x)logp(x)=xp(x)logp(x)q(x)=DKL(pq)H(p, q) - H(p) = -\sum_x p(x)\log q(x) + \sum_x p(x)\log p(x) = \sum_x p(x)\log\frac{p(x)}{q(x)} = D_{\mathrm{KL}}(p \| q)

This is exactly KL divergence. The relative entropy DKL(pq)D_{\mathrm{KL}}(p\|q) measures how many extra nats/bits per symbol you pay for using the wrong code qq when the truth is pp. It is relative because it measures information relative to the reference distribution pp.

Information gain interpretation. In Bayesian inference, if pp is the posterior and qq is the prior, then DKL(pq)D_{\mathrm{KL}}(p\|q) is the information gain from the prior to the posterior - the reduction in uncertainty after observing data. The Kullback-Leibler 1951 paper used the term "information for discrimination": how much information the data provides for discriminating between pp and qq.

2.4 Non-Examples and Edge Cases

Understanding when KL divergence is not well-behaved clarifies the definition.

Non-example 2.1 (Mismatched support, DKL=+D_{\mathrm{KL}} = +\infty): Let p=N(0,1)p = \mathcal{N}(0, 1) and q=U(1,1)q = \mathcal{U}(-1, 1) (uniform on [1,1][-1,1]). Since q(x)=0q(x) = 0 for x>1|x| > 1 but p(x)>0p(x) > 0 for all xx, we have p≪̸qp \not\ll q and DKL(pq)=+D_{\mathrm{KL}}(p\|q) = +\infty. However DKL(qp)<D_{\mathrm{KL}}(q\|p) < \infty since p(x)>0p(x) > 0 everywhere, so qpq \ll p. This illustrates the extreme asymmetry that can occur when supports differ.

Non-example 2.2 (Zero KL does not mean identical distributions numerically): DKL(pq)=0D_{\mathrm{KL}}(p\|q) = 0 implies p(x)=q(x)p(x) = q(x) for pp-almost all xx, but for distributions with different supports having measure zero under pp, the distributions can differ on a null set. In practice, if DKL=0D_{\mathrm{KL}} = 0 numerically, pp and qq are identical where it matters.

Non-example 2.3 (Equal KL values do not imply related distributions): DKL(pq)=DKL(rs)=0.5D_{\mathrm{KL}}(p\|q) = D_{\mathrm{KL}}(r\|s) = 0.5 nats gives no useful information about the relationship between (p,q)(p,q) and (r,s)(r,s). KL is an absolute quantity for a specific pair, not a geometry with reference points.

Non-example 2.4 (KL is not symmetric by default): For p=(0.9,0.1)p = (0.9, 0.1) and q=(0.1,0.9)q = (0.1, 0.9):

DKL(pq)=0.9ln0.90.1+0.1ln0.10.9=0.9ln9+0.1ln(1/9)1.758 natsD_{\mathrm{KL}}(p\|q) = 0.9\ln\frac{0.9}{0.1} + 0.1\ln\frac{0.1}{0.9} = 0.9\ln 9 + 0.1\ln(1/9) \approx 1.758 \text{ nats}

By symmetry of this example, DKL(qp)=DKL(pq)1.758D_{\mathrm{KL}}(q\|p) = D_{\mathrm{KL}}(p\|q) \approx 1.758 nats - equal only because pp and qq are exact reverses of each other. For generic asymmetric distributions, the two values differ.

3. Properties of KL Divergence

3.1 Non-Negativity: Gibbs' Inequality

Theorem (Gibbs' Inequality). For any two probability distributions pp and qq on the same alphabet:

DKL(pq)0D_{\mathrm{KL}}(p \| q) \ge 0

with equality if and only if p(x)=q(x)p(x) = q(x) for all xx (or pp-almost all xx in the continuous case).

Proof via Jensen's inequality. The function f(t)=lntf(t) = -\ln t is strictly convex on (0,)(0, \infty) (its second derivative 1/t2>01/t^2 > 0). Equivalently, lnt\ln t is strictly concave. By Jensen's inequality applied to the concave function ln\ln:

DKL(pq)=xp(x)lnq(x)p(x)ln ⁣(xp(x)q(x)p(x))=ln ⁣(xq(x))=ln1=0-D_{\mathrm{KL}}(p \| q) = \sum_x p(x)\ln\frac{q(x)}{p(x)} \le \ln\!\left(\sum_x p(x) \cdot \frac{q(x)}{p(x)}\right) = \ln\!\left(\sum_x q(x)\right) = \ln 1 = 0

Therefore DKL(pq)0D_{\mathrm{KL}}(p\|q) \ge 0. Equality holds in Jensen's inequality for a strictly concave function iff the argument is constant, i.e., q(x)/p(x)=cq(x)/p(x) = c for all xx with p(x)>0p(x) > 0. Since both pp and qq sum to 1, this forces c=1c = 1 and hence p=qp = q everywhere. \square

Alternative proof via lntt1\ln t \le t - 1. The inequality lntt1\ln t \le t - 1 (with equality iff t=1t = 1) implies:

DKL(pq)=xp(x)lnq(x)p(x)xp(x)(q(x)p(x)1)=xq(x)xp(x)=11=0-D_{\mathrm{KL}}(p \| q) = \sum_x p(x)\ln\frac{q(x)}{p(x)} \le \sum_x p(x)\left(\frac{q(x)}{p(x)} - 1\right) = \sum_x q(x) - \sum_x p(x) = 1 - 1 = 0

This proof is elementary (needs no Jensen's inequality) and directly shows the equality condition.

Corollary (Entropy upper bound). Taking q=uq = u (uniform on X\mathcal{X} with X=n|\mathcal{X}| = n):

0DKL(pu)=xp(x)lnp(x)1/n=lnnH(p)0 \le D_{\mathrm{KL}}(p \| u) = \sum_x p(x)\ln\frac{p(x)}{1/n} = \ln n - H(p)

Therefore H(p)lnn=H(u)H(p) \le \ln n = H(u). Gibbs' inequality is the reason entropy is maximized at the uniform distribution (proven rigorously in Section 09-01).

For AI: Non-negativity of KL underpins the ELBO. The VAE lower bound comes from:

logp(x)=L(ϕ,θ;x)+DKL(qϕ(zx)pθ(zx))\log p(\mathbf{x}) = \mathcal{L}(\boldsymbol{\phi}, \boldsymbol{\theta}; \mathbf{x}) + D_{\mathrm{KL}}(q_\phi(\mathbf{z}\mid\mathbf{x}) \| p_\theta(\mathbf{z}\mid\mathbf{x}))

Since DKL0D_{\mathrm{KL}} \ge 0, we have logp(x)L\log p(\mathbf{x}) \ge \mathcal{L} - the ELBO is indeed a lower bound on log-evidence. This is the fundamental inequality that makes variational inference tractable.

3.2 Asymmetry

DKL(pq)DKL(qp)D_{\mathrm{KL}}(p \| q) \ne D_{\mathrm{KL}}(q \| p) in general. KL divergence is not a distance in the metric sense. We have already seen this intuitively; here we quantify it.

Example 3.2: p=(0.8,0.1,0.1)p = (0.8, 0.1, 0.1), q=(0.1,0.8,0.1)q = (0.1, 0.8, 0.1):

DKL(pq)=0.8ln8+0.1ln(1/8)+0.1ln1=0.8(2.079)+0.1(2.079)+0=1.455 natsD_{\mathrm{KL}}(p \| q) = 0.8\ln 8 + 0.1\ln(1/8) + 0.1\ln 1 = 0.8(2.079) + 0.1(-2.079) + 0 = 1.455 \text{ nats} DKL(qp)=0.1ln(0.1/0.8)+0.8ln8+0.1ln1=0.1(2.079)+0.8(2.079)+0=1.455 natsD_{\mathrm{KL}}(q \| p) = 0.1\ln(0.1/0.8) + 0.8\ln 8 + 0.1\ln 1 = 0.1(-2.079) + 0.8(2.079) + 0 = 1.455 \text{ nats}

In this symmetric case they are equal, but for p=(0.9,0.05,0.05)p = (0.9, 0.05, 0.05), q=(0.05,0.9,0.05)q = (0.05, 0.9, 0.05):

DKL(pq)=0.9ln18+0.05ln(0.05/0.9)+0.05ln12.629+(0.144)+0=2.485 natsD_{\mathrm{KL}}(p \| q) = 0.9\ln 18 + 0.05\ln(0.05/0.9) + 0.05\ln 1 \approx 2.629 + (-0.144) + 0 = 2.485 \text{ nats} DKL(qp)=0.05ln(0.05/0.9)+0.9ln18+0.05ln10.144+2.629+0=2.485 natsD_{\mathrm{KL}}(q \| p) = 0.05\ln(0.05/0.9) + 0.9\ln 18 + 0.05\ln 1 \approx -0.144 + 2.629 + 0 = 2.485 \text{ nats}

Both happen to be equal again due to the symmetric structure. For a genuinely asymmetric example: p=(0.99,0.01)p = (0.99, 0.01), q=(0.01,0.99)q = (0.01, 0.99):

DKL(pq)=0.99ln99+0.01ln(0.01/0.99)0.99(4.595)+0.01(4.595)=4.549 natsD_{\mathrm{KL}}(p\|q) = 0.99\ln 99 + 0.01\ln(0.01/0.99) \approx 0.99(4.595) + 0.01(-4.595) = 4.549 \text{ nats}

and DKL(qp)=DKL(pq)4.549D_{\mathrm{KL}}(q\|p) = D_{\mathrm{KL}}(p\|q) \approx 4.549 nats (by symmetry of this particular pair).

For a non-symmetric example: p=(0.9,0.1)p = (0.9, 0.1), q=(0.5,0.5)q = (0.5, 0.5):

DKL(pq)=0.9ln(1.8)+0.1ln(0.2)=0.9(0.588)+0.1(1.609)0.368 natsD_{\mathrm{KL}}(p\|q) = 0.9\ln(1.8) + 0.1\ln(0.2) = 0.9(0.588) + 0.1(-1.609) \approx 0.368 \text{ nats} DKL(qp)=0.5ln(5/9)+0.5ln5=0.5(0.588)+0.5(1.609)0.511 natsD_{\mathrm{KL}}(q\|p) = 0.5\ln(5/9) + 0.5\ln 5 = 0.5(-0.588) + 0.5(1.609) \approx 0.511 \text{ nats}

So DKL(pq)=0.3680.511=DKL(qp)D_{\mathrm{KL}}(p\|q) = 0.368 \ne 0.511 = D_{\mathrm{KL}}(q\|p).

Jeffrey's symmetrized KL. Harold Jeffreys (1946) proposed the symmetric version:

J(p,q)=12[DKL(pq)+DKL(qp)]J(p, q) = \frac{1}{2}\left[D_{\mathrm{KL}}(p \| q) + D_{\mathrm{KL}}(q \| p)\right]

This satisfies J(p,q)=J(q,p)0J(p,q) = J(q,p) \ge 0 and J(p,q)=0p=qJ(p,q) = 0 \Leftrightarrow p = q, but still fails the triangle inequality. Jeffreys divergence is used in some applications where symmetry is important. The Jensen-Shannon divergence (Section 7.4) is a better-behaved symmetric variant that is also bounded.

3.3 Failure of Triangle Inequality

KL divergence is not a metric. Specifically, the triangle inequality DKL(pr)DKL(pq)+DKL(qr)D_{\mathrm{KL}}(p\|r) \le D_{\mathrm{KL}}(p\|q) + D_{\mathrm{KL}}(q\|r) does not hold in general.

Counterexample. Let p=(1,0)p = (1, 0), q=(0.5,0.5)q = (0.5, 0.5), r=(0,1)r = (0, 1) on {0,1}\{0, 1\}:

DKL(pr)=1ln(1/0)=+D_{\mathrm{KL}}(p\|r) = 1 \cdot \ln(1/0) = +\infty DKL(pq)=1ln(1/0.5)=ln20.693 natsD_{\mathrm{KL}}(p\|q) = 1 \cdot \ln(1/0.5) = \ln 2 \approx 0.693 \text{ nats} DKL(qr)=0.5ln(0.5/0)+0.5ln(0.5/1)=+D_{\mathrm{KL}}(q\|r) = 0.5\ln(0.5/0) + 0.5\ln(0.5/1) = +\infty

In this degenerate case DKL(pr)=+D_{\mathrm{KL}}(p\|r) = +\infty and DKL(pq)+DKL(qr)=+D_{\mathrm{KL}}(p\|q) + D_{\mathrm{KL}}(q\|r) = +\infty, so the triangle inequality "holds" trivially. For a cleaner counterexample with all finite values: p=(0.9,0.1)p = (0.9, 0.1), q=(0.5,0.5)q = (0.5, 0.5), r=(0.1,0.9)r = (0.1, 0.9):

DKL(pr)=0.9ln9+0.1ln(1/9)1.758 natsD_{\mathrm{KL}}(p\|r) = 0.9\ln 9 + 0.1\ln(1/9) \approx 1.758 \text{ nats} DKL(pq)+DKL(qr)0.368+0.368=0.736 natsD_{\mathrm{KL}}(p\|q) + D_{\mathrm{KL}}(q\|r) \approx 0.368 + 0.368 = 0.736 \text{ nats}

Since 1.758>0.7361.758 > 0.736, the triangle inequality is violated. The reason: KL "skips" over intermediate distributions in a fundamentally non-Euclidean way.

3.4 Joint Convexity

Theorem. DKL(pq)D_{\mathrm{KL}}(p \| q) is jointly convex in the pair (p,q)(p, q): for any λ[0,1]\lambda \in [0,1],

DKL(λp1+(1λ)p2λq1+(1λ)q2)λDKL(p1q1)+(1λ)DKL(p2q2)D_{\mathrm{KL}}(\lambda p_1 + (1-\lambda)p_2 \| \lambda q_1 + (1-\lambda)q_2) \le \lambda D_{\mathrm{KL}}(p_1 \| q_1) + (1-\lambda) D_{\mathrm{KL}}(p_2 \| q_2)

Proof sketch. The perspective function of a convex function ff is jointly convex: if ff is convex, then g(x,t)=tf(x/t)g(x, t) = tf(x/t) is jointly convex in (x,t)(x, t) for t>0t > 0. Taking f(u)=uloguf(u) = u\log u (which is convex since (ulogu)=1/u>0(u\log u)'' = 1/u > 0), the perspective is t(x/t)log(x/t)=xlog(x/t)t \cdot (x/t)\log(x/t) = x\log(x/t). Summing over xx gives DKL(pq)=xp(x)log(p(x)/q(x))D_{\mathrm{KL}}(p\|q) = \sum_x p(x)\log(p(x)/q(x)), which is jointly convex in (p,q)(p, q) as a sum of perspective functions.

Consequence for the EM algorithm. The EM algorithm alternates between computing the expected log-likelihood (E-step) and maximizing it (M-step). Each M-step minimizes a KL divergence. Joint convexity guarantees that this alternating minimization makes well-defined progress toward a stationary point.

Convexity in pp alone. DKL(pq)D_{\mathrm{KL}}(p\|q) is also convex in pp for fixed qq (since it equals a sum of convex functions p(x)logp(x)p(x)logq(x)p(x)\log p(x) - p(x)\log q(x)).

Convexity in qq alone. DKL(pq)=xp(x)logq(x)+constD_{\mathrm{KL}}(p\|q) = -\sum_x p(x)\log q(x) + \text{const} for fixed pp. This is convex in qq because logq(x)-\log q(x) is convex and p(x)0p(x) \ge 0. This means: for fixed pp, minimizing over qq is a convex optimization problem with a unique minimum at q=pq = p.

3.5 Chain Rule for KL Divergence

Theorem (Chain Rule). For joint distributions P(X,Y)P(X,Y) and Q(X,Y)Q(X,Y):

DKL(P(X,Y)Q(X,Y))=DKL(PXQX)+ExPX ⁣[DKL(PYX=xQYX=x)]D_{\mathrm{KL}}(P(X,Y) \| Q(X,Y)) = D_{\mathrm{KL}}(P_X \| Q_X) + \mathbb{E}_{x \sim P_X}\!\left[D_{\mathrm{KL}}(P_{Y|X=x} \| Q_{Y|X=x})\right]

where PXP_X is the marginal of PP over XX, and PYXP_{Y|X}, QYXQ_{Y|X} are the conditional distributions.

Proof.

DKL(PQ)=x,yP(x,y)logP(x,y)Q(x,y)D_{\mathrm{KL}}(P \| Q) = \sum_{x,y} P(x,y)\log\frac{P(x,y)}{Q(x,y)}

Using P(x,y)=P(x)P(yx)P(x,y) = P(x)P(y|x) and Q(x,y)=Q(x)Q(yx)Q(x,y) = Q(x)Q(y|x):

=x,yP(x,y)logP(x)P(yx)Q(x)Q(yx)=x,yP(x,y)logP(x)Q(x)+x,yP(x,y)logP(yx)Q(yx)= \sum_{x,y} P(x,y)\log\frac{P(x)P(y|x)}{Q(x)Q(y|x)} = \sum_{x,y} P(x,y)\log\frac{P(x)}{Q(x)} + \sum_{x,y} P(x,y)\log\frac{P(y|x)}{Q(y|x)} =xP(x)logP(x)Q(x)+xP(x)yP(yx)logP(yx)Q(yx)= \sum_x P(x)\log\frac{P(x)}{Q(x)} + \sum_x P(x)\sum_y P(y|x)\log\frac{P(y|x)}{Q(y|x)} =DKL(PXQX)+ExPX ⁣[DKL(PYX=xQYX=x)]= D_{\mathrm{KL}}(P_X \| Q_X) + \mathbb{E}_{x \sim P_X}\!\left[D_{\mathrm{KL}}(P_{Y|X=x} \| Q_{Y|X=x})\right] \quad \square

For AI: In a hierarchical generative model p(x,z)=p(z)p(xz)p(\mathbf{x}, \mathbf{z}) = p(\mathbf{z}) p(\mathbf{x}|\mathbf{z}), the chain rule decomposes the KL between the variational distribution and true posterior into a prior KL plus an expected conditional KL. This is exactly the ELBO decomposition:

DKL(q(zx)p(zx))=DKL(q(zx)p(z))Eq[logp(xz)]+logp(x)D_{\mathrm{KL}}(q(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}|\mathbf{x})) = D_{\mathrm{KL}}(q(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z})) - \mathbb{E}_{q}\left[\log p(\mathbf{x}|\mathbf{z})\right] + \log p(\mathbf{x})

Rearranging: logp(x)=Eq[logp(xz)]DKL(qp)ELBO+DKL(q(zx)p(zx))ELBO\log p(\mathbf{x}) = \underbrace{\mathbb{E}_q[\log p(\mathbf{x}|\mathbf{z})] - D_{\mathrm{KL}}(q\|p)}_{\text{ELBO}} + D_{\mathrm{KL}}(q(\mathbf{z}|\mathbf{x})\|p(\mathbf{z}|\mathbf{x})) \ge \text{ELBO}.

3.6 Data Processing Inequality

Theorem (Data Processing Inequality for KL). Let T:XYT: \mathcal{X} \to \mathcal{Y} be any (possibly randomized) function. Let pTp_T and qTq_T be the distributions of T(X)T(X) under pp and qq respectively. Then:

DKL(pTqT)DKL(pq)D_{\mathrm{KL}}(p_T \| q_T) \le D_{\mathrm{KL}}(p \| q)

Proof sketch. By the chain rule applied to the joint (X,T(X))(X, T(X)):

DKL(p(X,T(X))q(X,T(X)))=DKL(pXqX)+E[DKL(pTXqTX)]D_{\mathrm{KL}}(p(X, T(X)) \| q(X, T(X))) = D_{\mathrm{KL}}(p_X \| q_X) + \mathbb{E}\left[D_{\mathrm{KL}}(p_{T|X}\|q_{T|X})\right]

For a deterministic function TT, the conditional DKL(pTXqTX)=0D_{\mathrm{KL}}(p_{T|X}\|q_{T|X}) = 0. By the reverse chain rule decomposing via TT first, DKL(pTqT)+E[DKL(pXTqXT)]=DKL(pXqX)D_{\mathrm{KL}}(p_T\|q_T) + \mathbb{E}[D_{\mathrm{KL}}(p_{X|T}\|q_{X|T})] = D_{\mathrm{KL}}(p_X\|q_X). Since the second term is 0\ge 0, DKL(pTqT)DKL(pXqX)D_{\mathrm{KL}}(p_T\|q_T) \le D_{\mathrm{KL}}(p_X\|q_X). \square

Intuition: Processing data can only destroy information, not create it. Two distributions that are DD nats apart look at most DD nats apart after transformation. If you run a neural network encoder on samples from two distributions, the KL in embedding space is at most the KL in input space.

For AI: This is why representation learning cannot increase discriminability between classes beyond what is in the raw data. It also explains why the KL penalty in RLHF acts on the token distribution: any post-processing of the output tokens (sampling, filtering) can only reduce the KL from the reference policy, not increase it.

4. Forward KL vs Reverse KL

This section develops the most practically important distinction in applied KL theory: which direction to minimize, and what kind of approximation each direction produces.

4.1 Forward KL: Mean-Seeking Behavior

Forward KL is DKL(pq)D_{\mathrm{KL}}(p \| q) where pp is the target distribution (the truth) and qq is the approximating distribution we optimize. We minimize over qq given fixed pp.

The zero-avoiding (mass-covering) property: Because the expectation is under pp, regions where p(x)>0p(x) > 0 but q(x)=0q(x) = 0 contribute ++\infty to DKL(pq)D_{\mathrm{KL}}(p\|q). Therefore, any minimizer qq^* must satisfy q(x)>0q^*(x) > 0 wherever p(x)>0p(x) > 0 - the approximation must cover all the mass of the target. For this reason, forward KL is also called inclusive KL or zero-avoiding KL.

Fitting a Gaussian to a bimodal distribution. Let p(x)=0.5N(x;3,1)+0.5N(x;3,1)p(x) = 0.5\,\mathcal{N}(x;-3,1) + 0.5\,\mathcal{N}(x;3,1) and qμ(x)=N(x;μ,σ2)q_\mu(x) = \mathcal{N}(x;\mu,\sigma^2). Minimizing DKL(pqμ)D_{\mathrm{KL}}(p\|q_\mu) over μ\mu and σ2\sigma^2:

μDKL(pqμ)=0    μ=Ep[X]=0.5(3)+0.5(3)=0\frac{\partial}{\partial \mu} D_{\mathrm{KL}}(p \| q_\mu) = 0 \implies \mu^* = \mathbb{E}_p[X] = 0.5(-3) + 0.5(3) = 0 σ2=Ep[(Xμ)2]=0.5(9+1)+0.5(9+1)=10\sigma^{*2} = \mathbb{E}_p[(X - \mu^*)^2] = 0.5(9 + 1) + 0.5(9 + 1) = 10

So q=N(0,10)q^* = \mathcal{N}(0, 10) - a wide Gaussian centered between the modes, placing substantial mass in the valley between them where pp is near zero. This is the mean-seeking behavior: qq computes the mean and variance of pp (moment matching).

General result. For any exponential family qq with sufficient statistics t(x)\mathbf{t}(x), minimizing forward KL performs moment matching: Eq[t(X)]=Ep[t(X)]\mathbb{E}_q[\mathbf{t}(X)] = \mathbb{E}_p[\mathbf{t}(X)]. The approximation matches the moments of pp, which averages across modes.

For AI: Maximum likelihood estimation minimizes forward KL. This is why MLE on multimodal data can produce "blurry" generative models that average over modes - GANs were introduced partly to address this by using a minimax objective related to Jensen-Shannon divergence (Section 7.4) instead of direct MLE.

4.2 Reverse KL: Mode-Seeking Behavior

Reverse KL is DKL(qp)D_{\mathrm{KL}}(q \| p) where qq is the approximation we optimize and pp is fixed. Now the expectation is under qq - we penalize regions where q(x)>0q(x) > 0 but p(x)0p(x) \approx 0.

The zero-forcing (mass-concentrating) property: Regions where p(x)=0p(x) = 0 but q(x)>0q(x) > 0 contribute ++\infty to DKL(qp)D_{\mathrm{KL}}(q\|p). Therefore qq^* must avoid regions where pp is zero - it must concentrate its mass on high-probability regions of pp. This is called exclusive KL or zero-forcing KL.

Fitting a Gaussian to a bimodal distribution (reverse direction). Now minimizing DKL(qμp)D_{\mathrm{KL}}(q_\mu \| p):

DKL(qμp)=Eqμ ⁣[logqμ(X)p(X)]D_{\mathrm{KL}}(q_\mu \| p) = \mathbb{E}_{q_\mu}\!\left[\log\frac{q_\mu(X)}{p(X)}\right]

This has two local minima: qN(3,1)q^* \approx \mathcal{N}(-3, 1) or qN(3,1)q^* \approx \mathcal{N}(3, 1) - the approximation collapses onto one mode of pp. The wide, mean-covering solution N(0,10)\mathcal{N}(0, 10) from forward KL is actually a local maximum for reverse KL: it places mass where pp is near zero (in the valley), incurring large penalties.

General result. Reverse KL drives the approximation qq to be a mode of pp. The gradient of reverse KL points toward locally consistent regions of pp.

4.3 Geometric Interpretation

FORWARD KL vs REVERSE KL: THE KEY DIFFERENCE


  TRUE DISTRIBUTION p (bimodal):        FORWARD KL (minimizer q*)
                                         must cover both modes:
    
                                   q* = N(0, 10)
                                    
                           (wide, mean-seeking)
    ->         places mass in valley
       -3          3          
    

  REVERSE KL (minimizer q*):
  q* = N(-3, 1) OR N(3, 1)
       mode-seeking: picks one mode,
       ignores the other

  Intuition: forward KL sees regions p > 0 as "must cover"
             reverse KL sees regions p  0 as "must avoid"


The information-geometric view formalizes this with the concept of projections:

  • I-projection (information projection): q=argminqDKL(qp)q^* = \arg\min_q D_{\mathrm{KL}}(q \| p) - projects pp onto the constraint family Q\mathcal{Q} using reverse KL. Selects the qQq \in \mathcal{Q} that is closest to pp in the reverse KL sense.
  • M-projection (moment projection): q=argminqDKL(pq)q^* = \arg\min_q D_{\mathrm{KL}}(p \| q) - projects pp onto Q\mathcal{Q} using forward KL. Produces moment-matched approximations.

For exponential families: the M-projection always produces a unique solution (the moment-matched distribution). The I-projection may have multiple local minima (one per mode).

4.4 Consequences for Variational Inference

Variational Bayes minimizes DKL(q(z)p(zx))D_{\mathrm{KL}}(q(\mathbf{z}) \| p(\mathbf{z} | \mathbf{x})) - reverse KL. This is computationally convenient (the ELBO is a tractable lower bound) but has important consequences:

Posterior collapse in VAEs. In a Variational Autoencoder, each latent dimension ziz_i has approximate posterior qϕ(zix)=N(μi,σi2)q_\phi(z_i | \mathbf{x}) = \mathcal{N}(\mu_i, \sigma_i^2) and prior p(zi)=N(0,1)p(z_i) = \mathcal{N}(0, 1). If the decoder is powerful enough to reconstruct x\mathbf{x} using only a subset of dimensions, the ELBO optimization drives the unused dimensions' KL term to zero - meaning qϕ(zix)p(zi)=N(0,1)q_\phi(z_i|\mathbf{x}) \to p(z_i) = \mathcal{N}(0,1) - a constant prior. This posterior collapse means the latent representation ignores the input for those dimensions.

Why reverse KL causes collapse: Reverse KL DKL(qpprior)D_{\mathrm{KL}}(q\|p_{\mathrm{prior}}) is minimized (= 0) when q=ppriorq = p_{\mathrm{prior}}. Forward KL would penalize this collapse heavily because DKL(pposteriorq)D_{\mathrm{KL}}(p_{\mathrm{posterior}}\|q) would be large if qq ignores the data. The zero-forcing property of reverse KL allows collapse; the zero-avoiding property of forward KL would prevent it.

Beta-VAE fix (Higgins et al., 2017). Adding a hyperparameter β>1\beta > 1 to weight the KL term:

Lβ=Eq[logp(xz)]βDKL(qϕ(zx)p(z))\mathcal{L}_\beta = \mathbb{E}_q[\log p(\mathbf{x}|\mathbf{z})] - \beta\, D_{\mathrm{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))

Higher β\beta increases the penalty for unused latents, encouraging disentangled representations. This is used in modern VAE-based image models and speech representations.

KL annealing. Starting with β=0\beta = 0 and gradually increasing it during training prevents early collapse. Used in language model VAEs (Bowman et al., 2016) to avoid the degenerate solution where the encoder is ignored.

5. KL for Specific Distributions

5.1 KL Between Gaussians

The KL divergence between Gaussian distributions has a closed form used in dozens of ML algorithms.

Scalar Gaussians. For p=N(μ1,σ12)p = \mathcal{N}(\mu_1, \sigma_1^2) and q=N(μ2,σ22)q = \mathcal{N}(\mu_2, \sigma_2^2):

DKL(pq)=logσ2σ1+σ12+(μ1μ2)22σ2212D_{\mathrm{KL}}(p \| q) = \log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}

Derivation. The log-ratio is:

logp(x)q(x)=logσ2σ1(xμ1)22σ12+(xμ2)22σ22\log\frac{p(x)}{q(x)} = \log\frac{\sigma_2}{\sigma_1} - \frac{(x-\mu_1)^2}{2\sigma_1^2} + \frac{(x-\mu_2)^2}{2\sigma_2^2}

Taking expectation under pp, using Ep[(Xμ1)2]=σ12\mathbb{E}_p[(X-\mu_1)^2] = \sigma_1^2 and Ep[(Xμ2)2]=σ12+(μ1μ2)2\mathbb{E}_p[(X-\mu_2)^2] = \sigma_1^2 + (\mu_1-\mu_2)^2:

DKL(pq)=logσ2σ1σ122σ12+σ12+(μ1μ2)22σ22=logσ2σ1+σ12+(μ1μ2)22σ2212D_{\mathrm{KL}}(p\|q) = \log\frac{\sigma_2}{\sigma_1} - \frac{\sigma_1^2}{2\sigma_1^2} + \frac{\sigma_1^2 + (\mu_1-\mu_2)^2}{2\sigma_2^2} = \log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1-\mu_2)^2}{2\sigma_2^2} - \frac{1}{2}

Intuition on terms:

  • log(σ2/σ1)\log(\sigma_2/\sigma_1): penalty for scale mismatch
  • σ12/(2σ22)\sigma_1^2/(2\sigma_2^2): penalty for pp being wider than qq
  • (μ1μ2)2/(2σ22)(\mu_1 - \mu_2)^2/(2\sigma_2^2): penalty for mean mismatch (like Mahalanobis distance)
  • 1/2-1/2: normalization constant

Multivariate Gaussians. For p=N(μ1,Σ1)p = \mathcal{N}(\boldsymbol{\mu}_1, \Sigma_1) and q=N(μ2,Σ2)q = \mathcal{N}(\boldsymbol{\mu}_2, \Sigma_2):

DKL(pq)=12[tr(Σ21Σ1)+(μ2μ1)Σ21(μ2μ1)d+logdetΣ2detΣ1]D_{\mathrm{KL}}(p \| q) = \frac{1}{2}\left[\operatorname{tr}(\Sigma_2^{-1}\Sigma_1) + (\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1)^\top \Sigma_2^{-1}(\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1) - d + \log\frac{\det\Sigma_2}{\det\Sigma_1}\right]

where dd is the dimension. The four terms correspond to: trace ratio (variance mismatch), Mahalanobis distance (mean mismatch), dimension offset, log-determinant ratio (volume mismatch).

VAE application. In a standard VAE, the encoder outputs qϕ(zx)=N(μϕ(x),diag(σϕ2(x)))q_\phi(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}_\phi(\mathbf{x}), \operatorname{diag}(\boldsymbol{\sigma}_\phi^2(\mathbf{x}))) and the prior is p(z)=N(0,I)p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, I). The KL term simplifies because Σ2=I\Sigma_2 = I:

DKL(qϕ(zx)p(z))=12j=1d(μj2+σj2lnσj21)D_{\mathrm{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z})) = \frac{1}{2}\sum_{j=1}^d \left(\mu_j^2 + \sigma_j^2 - \ln\sigma_j^2 - 1\right)

This is differentiable in μϕ\boldsymbol{\mu}_\phi and σϕ\boldsymbol{\sigma}_\phi, enabling direct gradient descent. The reparameterization trick z=μϕ+σϕϵ\mathbf{z} = \boldsymbol{\mu}_\phi + \boldsymbol{\sigma}_\phi \odot \boldsymbol{\epsilon}, ϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0},I) makes the sampling step differentiable.

5.2 KL Between Categorical Distributions

For discrete distributions p=(p1,,pK)p = (p_1, \ldots, p_K) and q=(q1,,qK)q = (q_1, \ldots, q_K):

DKL(pq)=k=1KpklnpkqkD_{\mathrm{KL}}(p \| q) = \sum_{k=1}^K p_k \ln\frac{p_k}{q_k}

This equals H(p,q)H(p)H(p, q) - H(p) where H(p,q)=kpklnqkH(p,q) = -\sum_k p_k \ln q_k is the cross-entropy. In particular, when pp is a one-hot vector (empirical distribution on a single label yy), H(p)=0H(p) = 0 and:

DKL(pq)=H(p,q)=lnqy=cross-entropy lossD_{\mathrm{KL}}(p \| q) = H(p, q) = -\ln q_y = \text{cross-entropy loss}

Temperature scaling. In knowledge distillation, the teacher distribution is softened using temperature τ>1\tau > 1:

psoft(k)=ezkT/τjezjT/τp_{\mathrm{soft}}(k) = \frac{e^{z_k^T/\tau}}{\sum_j e^{z_j^T/\tau}}

The student is trained to minimize DKL(psoftqθ)D_{\mathrm{KL}}(p_{\mathrm{soft}} \| q_{\boldsymbol{\theta}}) rather than the one-hot cross-entropy. Soft targets carry more information (entropy scales up with τ\tau), allowing the student to learn the teacher's "dark knowledge" about relative class similarities.

5.3 KL Within the Exponential Family

Members of the exponential family have a remarkably elegant KL formula. An exponential family distribution has density:

pη(x)=h(x)exp ⁣(ηt(x)A(η))p_{\boldsymbol{\eta}}(\mathbf{x}) = h(\mathbf{x}) \exp\!\left(\boldsymbol{\eta}^\top \mathbf{t}(\mathbf{x}) - A(\boldsymbol{\eta})\right)

where η\boldsymbol{\eta} are natural parameters, t(x)\mathbf{t}(\mathbf{x}) are sufficient statistics, and A(η)=logh(x)eηt(x)dxA(\boldsymbol{\eta}) = \log \int h(\mathbf{x})e^{\boldsymbol{\eta}^\top \mathbf{t}(\mathbf{x})} d\mathbf{x} is the log-partition function (which is convex in η\boldsymbol{\eta}).

KL between exponential family members:

DKL(pη1pη2)=A(η2)A(η1)A(η1)(η2η1)D_{\mathrm{KL}}(p_{\boldsymbol{\eta}_1} \| p_{\boldsymbol{\eta}_2}) = A(\boldsymbol{\eta}_2) - A(\boldsymbol{\eta}_1) - \nabla A(\boldsymbol{\eta}_1)^\top(\boldsymbol{\eta}_2 - \boldsymbol{\eta}_1)

This is the Bregman divergence generated by the convex function A(η)A(\boldsymbol{\eta}): BA(η2,η1)=A(η2)A(η1)A(η1)(η2η1)B_A(\boldsymbol{\eta}_2, \boldsymbol{\eta}_1) = A(\boldsymbol{\eta}_2) - A(\boldsymbol{\eta}_1) - \nabla A(\boldsymbol{\eta}_1)^\top(\boldsymbol{\eta}_2 - \boldsymbol{\eta}_1). This is the error of the first-order Taylor approximation of AA around η1\boldsymbol{\eta}_1 - which is always 0\ge 0 by convexity of AA, recovering Gibbs' inequality.

Examples:

  • Gaussian (μ\mu, σ2\sigma^2): A(μ,σ2)=μ2/(2σ2)+12lnσ2A(\mu,\sigma^2) = \mu^2/(2\sigma^2) + \frac{1}{2}\ln\sigma^2
  • Bernoulli (pp): A(p)=ln(1+ep)A(p) = \ln(1 + e^p) (logistic); Bregman divergence = binary KL
  • Poisson (λ\lambda): A(λ)=eλA(\lambda) = e^\lambda

5.4 VAE Closed-Form KL and the Reparameterization Trick

The VAE's KL term per latent dimension is:

DKL(N(μ,σ2)N(0,1))=12(μ2+σ2lnσ21)D_{\mathrm{KL}}(\mathcal{N}(\mu, \sigma^2) \| \mathcal{N}(0, 1)) = \frac{1}{2}(\mu^2 + \sigma^2 - \ln\sigma^2 - 1)

Derivation. Using the scalar Gaussian formula with μ2=0\mu_2 = 0, σ22=1\sigma_2^2 = 1:

DKL=ln1σ+σ2+μ2212=12lnσ2+σ2+μ2212=12(μ2+σ2lnσ21)D_{\mathrm{KL}} = \ln\frac{1}{\sigma} + \frac{\sigma^2 + \mu^2}{2} - \frac{1}{2} = -\frac{1}{2}\ln\sigma^2 + \frac{\sigma^2 + \mu^2}{2} - \frac{1}{2} = \frac{1}{2}(\mu^2 + \sigma^2 - \ln\sigma^2 - 1)

Properties: This quantity is 0 when μ=0,σ=1\mu = 0, \sigma = 1 (the posterior equals the prior); strictly positive otherwise. Gradient w.r.t. μ\mu: μ\mu. Gradient w.r.t. σ2\sigma^2: 12(11/σ2)\frac{1}{2}(1 - 1/\sigma^2). Both are simple, enabling efficient backpropagation through the KL term.

Reparameterization trick. The challenge: we need Ezqϕ(zx)[logpθ(xz)]\mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] but zz depends on ϕ\phi, so we cannot backpropagate through the sampling. The trick: write z=μϕ(x)+σϕ(x)εz = \mu_\phi(x) + \sigma_\phi(x) \cdot \varepsilon where εN(0,1)\varepsilon \sim \mathcal{N}(0,1) is a separate random variable independent of ϕ\phi. Now zϕ=μϕϕ+εσϕϕ\frac{\partial z}{\partial \phi} = \frac{\partial \mu_\phi}{\partial \phi} + \varepsilon \frac{\partial \sigma_\phi}{\partial \phi} is well-defined, and gradients flow through. This is the key insight of Kingma & Welling (2014).

6. Information-Theoretic Connections

6.1 KL and Cross-Entropy

The relationship H(p,q)=H(p)+DKL(pq)H(p, q) = H(p) + D_{\mathrm{KL}}(p \| q) is fundamental. Expanding both sides:

xp(x)logq(x)=xp(x)logp(x)+xp(x)logp(x)q(x)-\sum_x p(x)\log q(x) = -\sum_x p(x)\log p(x) + \sum_x p(x)\log\frac{p(x)}{q(x)}

which is immediate from log(p/q)=logplogq\log(p/q) = \log p - \log q. This decomposition has critical implications:

  • Training ML models: Minimizing cross-entropy loss H(pdata,pθ)H(p_{\mathrm{data}}, p_{\boldsymbol{\theta}}) over θ\boldsymbol{\theta} is identical to minimizing DKL(pdatapθ)D_{\mathrm{KL}}(p_{\mathrm{data}} \| p_{\boldsymbol{\theta}}), since H(pdata)H(p_{\mathrm{data}}) is constant. The irreducible entropy H(pdata)H(p_{\mathrm{data}}) is the minimum achievable cross-entropy.

  • Perplexity gap: A language model with perplexity PPL=eH(pdata,pθ)\operatorname{PPL} = e^{H(p_{\mathrm{data}}, p_{\boldsymbol{\theta}})} has gap DKL(pdatapθ)D_{\mathrm{KL}}(p_{\mathrm{data}}\|p_{\boldsymbol{\theta}}) nats above the theoretical minimum entropy.

  • Calibration: A perfectly calibrated model satisfies DKL(ptruepθ)=0D_{\mathrm{KL}}(p_{\mathrm{true}}\|p_{\boldsymbol{\theta}}) = 0, meaning the model's output distribution matches the true conditional distributions exactly.

Preview - Cross-Entropy (Section 09-04)

Cross-entropy H(p,q)=H(p)+DKL(pq)H(p, q) = H(p) + D_{\mathrm{KL}}(p\|q) is the canonical training loss for classification. It decomposes neatly into the intrinsic uncertainty H(p)H(p) (irreducible) and the model quality gap DKL(pq)D_{\mathrm{KL}}(p\|q) (reducible by training).

-> Full treatment: Section 09-04 Cross-Entropy

6.2 KL and Mutual Information

Preview - Mutual Information (Section 09-03)

Mutual information is the KL divergence between the joint distribution and the product of marginals:

I(X;Y)=DKL(P(X,Y)P(X)P(Y))=E(x,y)P ⁣[logP(x,y)P(x)P(y)]I(X; Y) = D_{\mathrm{KL}}(P(X, Y) \| P(X)P(Y)) = \mathbb{E}_{(x,y)\sim P}\!\left[\log\frac{P(x,y)}{P(x)P(y)}\right]

It measures statistical dependence: how much knowing YY reduces uncertainty about XX. This connection shows that I(X;Y)0I(X;Y) \ge 0 (Gibbs' inequality) and I(X;Y)=0X ⁣ ⁣ ⁣YI(X;Y) = 0 \Leftrightarrow X \perp\!\!\!\perp Y.

-> Full treatment: Section 09-03 Mutual Information

The MI-as-KL connection is used directly in contrastive learning (InfoNCE loss, SimCLR, CLIP): maximizing a lower bound on I(X;Z)I(X; Z) between the input and its representation. It also appears in the information bottleneck (Tishby et al., 2000) which frames representation learning as minimizing I(X;Z)I(X; Z) subject to maximizing I(Z;Y)I(Z; Y) - two KL divergences in tension.

6.3 KL and Fisher Information

Preview - Fisher Information (Section 09-05)

For a parametric family {pθ}\{p_{\boldsymbol{\theta}}\}, the KL divergence between nearby members is approximately:

DKL(pθpθ+δ)12δF(θ)δD_{\mathrm{KL}}(p_{\boldsymbol{\theta}} \| p_{\boldsymbol{\theta}+\boldsymbol{\delta}}) \approx \frac{1}{2}\boldsymbol{\delta}^\top F(\boldsymbol{\theta})\boldsymbol{\delta}

where F(θ)=Epθ[logpθlogpθ]F(\boldsymbol{\theta}) = \mathbb{E}_{p_{\boldsymbol{\theta}}}[\nabla\log p_{\boldsymbol{\theta}} \nabla\log p_{\boldsymbol{\theta}}^\top] is the Fisher information matrix. This local quadratic approximation to KL is what makes natural gradient descent (which preconditions gradient steps by F1F^{-1}) the "geometrically correct" optimizer on the statistical manifold.

-> Full treatment: Section 09-05 Fisher Information

Connection to RLHF natural policy gradient: The KL trust region DKL(πθπref)ϵD_{\mathrm{KL}}(\pi_{\boldsymbol{\theta}} \| \pi_{\mathrm{ref}}) \le \epsilon defines a neighborhood in policy space. Near the reference policy, this KL ball is approximated by 12δFδϵ\frac{1}{2}\boldsymbol{\delta}^\top F \boldsymbol{\delta} \le \epsilon where FF is the Fisher information of πref\pi_{\mathrm{ref}}. Natural policy gradient algorithms (TRPO, PPO) use this approximation.

6.4 KL and Entropy

KL divergence provides an elegant proof that entropy is maximized at the uniform distribution. For any distribution pp on an alphabet of size nn and uniform distribution u(x)=1/nu(x) = 1/n:

DKL(pu)=xp(x)lnp(x)1/n=xp(x)lnp(x)+lnn=H(p)+lnnD_{\mathrm{KL}}(p \| u) = \sum_x p(x)\ln\frac{p(x)}{1/n} = \sum_x p(x)\ln p(x) + \ln n = -H(p) + \ln n

Since DKL(pu)0D_{\mathrm{KL}}(p\|u) \ge 0: H(p)lnnH(p) \le \ln n, with equality iff p=up = u.

More generally, for any constraint set Q\mathcal{Q} of distributions satisfying given moment constraints, the maximum-entropy distribution p=argmaxpQH(p)p^* = \arg\max_{p \in \mathcal{Q}} H(p) equals argminpQDKL(pu)\arg\min_{p \in \mathcal{Q}} D_{\mathrm{KL}}(p \| u) - it is the M-projection of the uniform distribution onto Q\mathcal{Q}. This unifies the maximum entropy principle with KL geometry.

7. f-Divergences and Generalizations

7.1 Csiszar f-Divergences

KL divergence is one member of a rich family of divergences introduced by Csiszar (1967).

Definition. Let f:(0,)Rf: (0,\infty) \to \mathbb{R} be a convex function with f(1)=0f(1) = 0. The f-divergence of pp from qq is:

Df(pq)=xq(x)f ⁣(p(x)q(x))=Exq ⁣[f ⁣(p(x)q(x))]D_f(p \| q) = \sum_x q(x)\, f\!\left(\frac{p(x)}{q(x)}\right) = \mathbb{E}_{x\sim q}\!\left[f\!\left(\frac{p(x)}{q(x)}\right)\right]

The condition f(1)=0f(1) = 0 ensures Df(pp)=0D_f(p\|p) = 0. Convexity of ff and Jensen's inequality give Df(pq)0D_f(p\|q) \ge 0.

KL divergence as f-divergence. Taking f(t)=tlntf(t) = t\ln t (convex, f(1)=0f(1) = 0):

Df(pq)=xq(x)p(x)q(x)lnp(x)q(x)=xp(x)lnp(x)q(x)=DKL(pq)D_f(p\|q) = \sum_x q(x) \cdot \frac{p(x)}{q(x)}\ln\frac{p(x)}{q(x)} = \sum_x p(x)\ln\frac{p(x)}{q(x)} = D_{\mathrm{KL}}(p\|q)

Standard f-divergences:

Namef(t)f(t)Closed formProperties
KL divergencetlntt\ln tpln(p/q)\sum p\ln(p/q)Asymmetric; [0,)[0, \infty)
Reverse KLlnt-\ln tqln(q/p)\sum q\ln(q/p)Asymmetric; [0,)[0, \infty)
Hellinger distance2(t1)2(\sqrt{t}-1)^2(pq)2\sum(\sqrt{p}-\sqrt{q})^2Symmetric; [0,2][0, 2]
Total variation$\frac{1}{2}t-1$
Chi-squared(t1)2(t-1)^2(pq)2/q\sum(p-q)^2/qAsymmetric; [0,)[0,\infty)
α\alpha-divergencetααt(1α)α(α1)\frac{t^\alpha - \alpha t - (1-\alpha)}{\alpha(\alpha-1)}General; limits give KLParameterizes all of the above

For AI - GAN losses. Generative Adversarial Networks (Goodfellow et al., 2014) use different f-divergences as training objectives: the original GAN uses Jensen-Shannon; f-GAN (Nowozin et al., 2016) generalizes to arbitrary f-divergences. This provides a principled family of training objectives for generative models.

7.2 Properties of f-Divergences

All f-divergences share four key properties (when ff is convex with f(1)=0f(1) = 0):

  1. Non-negativity: Df(pq)0D_f(p\|q) \ge 0, with equality iff p=qp = q (when ff is strictly convex).
  2. Data processing inequality: Df(pTqT)Df(pq)D_f(p_T\|q_T) \le D_f(p\|q) for any stochastic kernel TT.
  3. No triangle inequality in general.
  4. Convexity: DfD_f is jointly convex in (p,q)(p,q) when ff is convex.

The data processing inequality is a theorem for all f-divergences: processing can only reduce distinguishability between distributions.

Total variation as special case. The total variation distance TV(p,q)=12pq1=12xp(x)q(x)\mathrm{TV}(p,q) = \frac{1}{2}\|p - q\|_1 = \frac{1}{2}\sum_x |p(x) - q(x)| satisfies the triangle inequality and is a genuine metric. It bounds other f-divergences: Pinsker's inequality relates KL and TV:

TV(p,q)12DKL(pq)\mathrm{TV}(p,q) \le \sqrt{\frac{1}{2}D_{\mathrm{KL}}(p\|q)}

This is the most important inequality connecting KL divergence to total variation. Pinsker's inequality is used to convert KL bounds (from optimization) into total variation bounds (which control probabilities of events).

7.3 Renyi Divergence

For α(0,1)(1,)\alpha \in (0,1) \cup (1,\infty), the order-α\alpha Renyi divergence is:

Dα(pq)=1α1logxp(x)αq(x)1αD_\alpha(p \| q) = \frac{1}{\alpha - 1}\log\sum_x p(x)^\alpha\, q(x)^{1-\alpha}

Limiting cases:

  • α1\alpha \to 1: Dα(pq)DKL(pq)D_\alpha(p\|q) \to D_{\mathrm{KL}}(p\|q) (L'Hopital's rule)
  • α0\alpha \to 0: Dα(pq)logx1[p(x)>0]q(x)D_\alpha(p\|q) \to -\log\sum_x \mathbb{1}[p(x) > 0]\, q(x) (support-based)
  • α\alpha \to \infty: Dα(pq)logmaxxp(x)/q(x)D_\alpha(p\|q) \to \log\max_x p(x)/q(x) (max-ratio)
  • α=1/2\alpha = 1/2: related to Bhattacharyya distance; D1/2(pq)=2logxp(x)q(x)D_{1/2}(p\|q) = -2\log\sum_x\sqrt{p(x)q(x)}

Properties: Dα(pq)0D_\alpha(p\|q) \ge 0; Dα(pp)=0D_\alpha(p\|p) = 0; DαD_\alpha is monotone increasing in α\alpha; satisfies data processing inequality; not symmetric.

For AI - Differential privacy. Renyi differential privacy (Mironov, 2017) quantifies privacy loss using Renyi divergence: a mechanism MM is (α,ε)(\alpha, \varepsilon)-RDP if Dα(M(S)M(S))εD_\alpha(M(S)\|M(S')) \le \varepsilon for any adjacent datasets S,SS, S'. Renyi DP composes additively: εtotal=ε1+ε2\varepsilon_{\text{total}} = \varepsilon_1 + \varepsilon_2. This makes it the tool of choice for privacy accounting in large-scale training with DP-SGD (used in Google's federated learning and some LLM fine-tuning pipelines).

7.4 Jensen-Shannon Divergence

The Jensen-Shannon Divergence (JSD) is the symmetrized, bounded variant of KL:

JSD(pq)=12DKL ⁣(pp+q2)+12DKL ⁣(qp+q2)\mathrm{JSD}(p \| q) = \frac{1}{2}D_{\mathrm{KL}}\!\left(p \,\Big\|\, \frac{p+q}{2}\right) + \frac{1}{2}D_{\mathrm{KL}}\!\left(q \,\Big\|\, \frac{p+q}{2}\right)

where m=(p+q)/2m = (p+q)/2 is the mixture distribution.

Properties:

  • JSD(pq)=JSD(qp)\mathrm{JSD}(p\|q) = \mathrm{JSD}(q\|p) - symmetric
  • 0JSD(pq)ln20 \le \mathrm{JSD}(p\|q) \le \ln 2 - bounded (in nats)
  • JSD\sqrt{\mathrm{JSD}} is a metric (satisfies triangle inequality)
  • JSD(pq)=H(m)12H(p)12H(q)\mathrm{JSD}(p\|q) = H(m) - \frac{1}{2}H(p) - \frac{1}{2}H(q) - entropy of mixture minus average entropy

For AI - Original GAN objective. The original GAN (Goodfellow et al., 2014) with an optimal discriminator minimizes the Jensen-Shannon divergence between the generator distribution pGp_G and the data distribution pdatap_{\mathrm{data}}:

minGmaxDV(D,G)=2JSD(pdatapG)2ln2\min_G \max_D V(D, G) = 2\,\mathrm{JSD}(p_{\mathrm{data}} \| p_G) - 2\ln 2

The optimal discriminator is D(x)=pdata(x)/(pdata(x)+pG(x))D^*(x) = p_{\mathrm{data}}(x)/(p_{\mathrm{data}}(x) + p_G(x)). However, when pdatap_{\mathrm{data}} and pGp_G have disjoint supports, JSD=ln2\mathrm{JSD} = \ln 2 (maximum) everywhere - the discriminator saturates and gradients vanish. This mode collapse / vanishing gradient problem motivated WGAN (Arjovsky et al., 2017), which uses the Wasserstein distance (which doesn't require absolute continuity).

8. Information Geometry of KL

8.1 KL as a Bregman Divergence

A Bregman divergence generated by a strictly convex function ϕ\phi is:

Bϕ(p,q)=ϕ(p)ϕ(q)ϕ(q)(pq)B_\phi(p, q) = \phi(p) - \phi(q) - \nabla\phi(q)^\top(p - q)

This is the gap between ϕ\phi at pp and its first-order Taylor approximation at qq - always 0\ge 0 by convexity.

KL is a Bregman divergence. Taking ϕ(p)=xp(x)lnp(x)\phi(p) = \sum_x p(x)\ln p(x) (the negative entropy, which is strictly convex):

ϕ(q)x=lnq(x)+1\nabla\phi(q)_x = \ln q(x) + 1 Bϕ(p,q)=xp(x)lnp(x)xq(x)lnq(x)x(lnq(x)+1)(p(x)q(x))B_\phi(p, q) = \sum_x p(x)\ln p(x) - \sum_x q(x)\ln q(x) - \sum_x (\ln q(x) + 1)(p(x) - q(x)) =xp(x)lnp(x)xp(x)lnq(x)+x(q(x)p(x))=xp(x)lnp(x)q(x)=DKL(pq)= \sum_x p(x)\ln p(x) - \sum_x p(x)\ln q(x) + \sum_x (q(x) - p(x)) = \sum_x p(x)\ln\frac{p(x)}{q(x)} = D_{\mathrm{KL}}(p\|q)

This Bregman representation reveals why KL is asymmetric: Bregman divergences are generally not symmetric because the Taylor approximation is computed at qq, not at pp.

Exponential family connection. For an exponential family with log-partition function A(η)A(\boldsymbol{\eta}), the Bregman divergence BA(η2,η1)B_A(\boldsymbol{\eta}_2, \boldsymbol{\eta}_1) equals DKL(pη1pη2)D_{\mathrm{KL}}(p_{\boldsymbol{\eta}_1}\|p_{\boldsymbol{\eta}_2}) (note reversed argument order). This is the source of the elegant formula in Section 5.3.

8.2 I-Projection and M-Projection

Definition. Let M\mathcal{M} be a constraint set (e.g., a parametric family of distributions):

  • I-projection (information projection): q=argminqMDKL(qp)q^* = \arg\min_{q \in \mathcal{M}} D_{\mathrm{KL}}(q \| p) - the distribution in M\mathcal{M} closest to pp in reverse KL
  • M-projection (moment projection): q=argminqMDKL(pq)q^* = \arg\min_{q \in \mathcal{M}} D_{\mathrm{KL}}(p \| q) - the distribution in M\mathcal{M} closest to pp in forward KL

For the exponential family:

  • The M-projection always produces the moment-matched distribution: Eq[t(X)]=Ep[t(X)]\mathbb{E}_{q^*}[\mathbf{t}(X)] = \mathbb{E}_p[\mathbf{t}(X)]
  • The I-projection onto an exponential family (convex set of mean parameters) produces the distribution with the correct natural parameters for those means

EM algorithm as alternating projections. The EM algorithm alternates:

  • E-step: Compute q(t)(z)=p(zx;θ(t))q^{(t)}(\mathbf{z}) = p(\mathbf{z}|\mathbf{x}; \boldsymbol{\theta}^{(t)}) - I-projection of the previous model onto the exact posterior
  • M-step: θ(t+1)=argmaxθQ(θ,θ(t))\boldsymbol{\theta}^{(t+1)} = \arg\max_{\boldsymbol{\theta}} \mathcal{Q}(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)}) - M-projection of q(t)q^{(t)} back onto the parametric family

Each step decreases DKL(qp)D_{\mathrm{KL}}(q\|p) or DKL(pdatapθ)D_{\mathrm{KL}}(p_{\text{data}}\|p_{\boldsymbol{\theta}}) respectively, guaranteeing convergence to a stationary point.

8.3 The Pythagorean Theorem for KL

Theorem (Pythagorean theorem for I-projections). Let E\mathcal{E} be an exponential family (a convex set in mean parameter space) and let q=argminqEDKL(qp)q^* = \arg\min_{q \in \mathcal{E}} D_{\mathrm{KL}}(q \| p) be the I-projection of pp onto E\mathcal{E}. Then for any rEr \in \mathcal{E}:

DKL(rp)=DKL(rq)+DKL(qp)D_{\mathrm{KL}}(r \| p) = D_{\mathrm{KL}}(r \| q^*) + D_{\mathrm{KL}}(q^* \| p)

This is an exact equality - there is no approximation, unlike the geometric Pythagorean theorem.

Interpretation. The KL distance from any rr in the constraint set to the target pp decomposes into: (1) the distance from rr to the projection qq^*, plus (2) the distance from the projection to pp. The "closest" point qq^* acts like the foot of a perpendicular. This is called "Pythagorean" because the geometry is orthogonal in the KL sense.

PYTHAGOREAN THEOREM FOR KL


          p  (target distribution, outside set E)
          |
          | D_KL(q* || p)
          |
          q*  r    (all in constraint set E)
            D_KL(r || q*)

    D_KL(r || p) = D_KL(r || q*) + D_KL(q* || p)

    (exact for I-projections onto exponential families)


For AI: The variational inference ELBO maximization is equivalent to minimizing DKL(qpθ(zx))D_{\mathrm{KL}}(q\|p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x})) - an I-projection. The Pythagorean theorem shows that the ELBO gap is exactly DKL(qp)D_{\mathrm{KL}}(q^*\|p) at the optimum - a clean statement of how much information is lost by restricting to the variational family.

8.4 Statistical Manifolds and Natural Gradient

The space of probability distributions over X\mathcal{X} forms a statistical manifold with local coordinates given by the natural parameters η\boldsymbol{\eta} (for exponential families). The KL divergence induces a Riemannian metric on this manifold called the Fisher information metric:

gij(η)=2DKL(pηpη)ηiηjη=η=Fij(η)g_{ij}(\boldsymbol{\eta}) = \frac{\partial^2 D_{\mathrm{KL}}(p_{\boldsymbol{\eta}} \| p_{\boldsymbol{\eta}'})}{\partial\eta_i' \partial\eta_j'}\bigg|_{\boldsymbol{\eta}'=\boldsymbol{\eta}} = F_{ij}(\boldsymbol{\eta})

where F(η)F(\boldsymbol{\eta}) is the Fisher information matrix. The Riemannian structure makes the "natural" step size in parameter space match the actual KL distance between distributions.

Natural gradient. In Euclidean parameter space, gradient descent steps θθηL\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta\nabla\mathcal{L} make steps proportional to the Euclidean metric on θ\boldsymbol{\theta}-space, which does not correspond to any natural metric on distribution space. The natural gradient corrects for this:

~L=F(θ)1L\tilde{\nabla}\mathcal{L} = F(\boldsymbol{\theta})^{-1}\nabla\mathcal{L}

Natural gradient descent converges much faster in problems where the Fisher information is highly non-uniform (i.e., where the Euclidean metric on parameters is poorly calibrated to KL distance on distributions). This is the foundation of K-FAC (Martens & Grosse, 2015), used in second-order optimization of neural networks.

-> Full treatment of Fisher information and natural gradient: Section 09-05 Fisher Information

9. Applications in Machine Learning

9.1 Maximum Likelihood = Minimizing KL

Theorem. For a parametric model pθp_{\boldsymbol{\theta}} and empirical data distribution p^n(x)=1ni=1n1[X(i)=x]\hat{p}_n(x) = \frac{1}{n}\sum_{i=1}^n \mathbb{1}[X^{(i)} = x]:

argmaxθi=1nlogpθ(x(i))=argminθDKL(p^npθ)\arg\max_{\boldsymbol{\theta}} \sum_{i=1}^n \log p_{\boldsymbol{\theta}}(x^{(i)}) = \arg\min_{\boldsymbol{\theta}} D_{\mathrm{KL}}(\hat{p}_n \| p_{\boldsymbol{\theta}})

Proof. Using DKL(p^npθ)=Ep^[logp^(x)/pθ(x)]=H(p^n)Ep^[logpθ(x)]D_{\mathrm{KL}}(\hat{p}_n \| p_{\boldsymbol{\theta}}) = \mathbb{E}_{\hat{p}}[\log\hat{p}(x)/p_{\boldsymbol{\theta}}(x)] = H(\hat{p}_n) - \mathbb{E}_{\hat{p}}[\log p_{\boldsymbol{\theta}}(x)], minimizing over θ\boldsymbol{\theta} eliminates the constant H(p^n)H(\hat{p}_n):

argminθDKL(p^npθ)=argmaxθEp^[logpθ(X)]=argmaxθ1nilogpθ(x(i))\arg\min_{\boldsymbol{\theta}} D_{\mathrm{KL}}(\hat{p}_n\|p_{\boldsymbol{\theta}}) = \arg\max_{\boldsymbol{\theta}} \mathbb{E}_{\hat{p}}[\log p_{\boldsymbol{\theta}}(X)] = \arg\max_{\boldsymbol{\theta}} \frac{1}{n}\sum_i \log p_{\boldsymbol{\theta}}(x^{(i)}) \quad\square

Implications:

  • Every LLM trained by next-token prediction (GPT-4, LLaMA-3, Claude, Gemini) minimizes DKL(pdatapθ)D_{\mathrm{KL}}(p_{\mathrm{data}}\|p_{\boldsymbol{\theta}})
  • The MLE estimate is not "arbitrary" - it has a precise information-theoretic interpretation as the distribution in the model family closest to the data in forward KL
  • Coverage: MLE must cover the entire data distribution; if pdata(x)>0p_{\mathrm{data}}(x) > 0 but pθ(x)0p_{\boldsymbol{\theta}}(x) \approx 0, the loss is large

9.2 Variational Autoencoders

The VAE (Kingma & Welling, 2014) introduces a latent variable z\mathbf{z} with prior p(z)p(\mathbf{z}) and likelihood pθ(xz)p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z}). The true log-likelihood logpθ(x)\log p_{\boldsymbol{\theta}}(\mathbf{x}) is intractable (requires integrating over z\mathbf{z}). The VAE introduces an encoder qϕ(zx)q_\phi(\mathbf{z}|\mathbf{x}) and derives a tractable lower bound via the KL decomposition:

logpθ(x)=Eqϕ[logpθ(xz)]DKL(qϕ(zx)p(z))L(ϕ,θ;x)=ELBO+DKL(qϕ(zx)pθ(zx))0\log p_{\boldsymbol{\theta}}(\mathbf{x}) = \underbrace{\mathbb{E}_{q_\phi}[\log p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})] - D_{\mathrm{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))}_{\mathcal{L}(\boldsymbol{\phi},\boldsymbol{\theta};\mathbf{x}) = \text{ELBO}} + \underbrace{D_{\mathrm{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x}))}_{\ge 0}

The ELBO has two terms:

  1. Reconstruction term Eqϕ[logpθ(xz)]\mathbb{E}_{q_\phi}[\log p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})]: how well the decoder reconstructs from the latent code. Analogous to negative distortion.
  2. KL regularizer DKL(qϕ(zx)p(z))D_{\mathrm{KL}}(q_\phi(\mathbf{z}|\mathbf{x})\|p(\mathbf{z})): how much the approximate posterior deviates from the prior. For Gaussian encoder and standard Gaussian prior: 12j(μj2+σj2lnσj21)\frac{1}{2}\sum_j(\mu_j^2 + \sigma_j^2 - \ln\sigma_j^2 - 1).

Practical training: The ELBO is maximized by alternating gradient steps:

  • w.r.t. ϕ\boldsymbol{\phi}: encoder parameters, using the reparameterization trick
  • w.r.t. θ\boldsymbol{\theta}: decoder parameters, via direct backpropagation

VAE variants using KL: beta-VAE (β>1\beta > 1 weight on KL for disentanglement), WAE (Wasserstein autoencoder replaces KL with MMD), VQ-VAE (discrete latent, no KL), InfoVAE (adds MI term).

9.3 RLHF, PPO, and DPO

RLHF objective. Reinforcement Learning from Human Feedback (Christiano et al., 2017; InstructGPT, Ouyang et al., 2022) trains a policy πθ\pi_{\boldsymbol{\theta}} to maximize expected reward while staying close to the reference policy πref\pi_{\mathrm{ref}} (the pretrained LLM):

maxπθExD,yπθ(x)[r(x,y)βDKL(πθ(x)πref(x))]\max_{\pi_{\boldsymbol{\theta}}} \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_{\boldsymbol{\theta}}(\cdot|x)}\left[r(x, y) - \beta\, D_{\mathrm{KL}}(\pi_{\boldsymbol{\theta}}(\cdot|x) \| \pi_{\mathrm{ref}}(\cdot|x))\right]

where r(x,y)r(x,y) is the reward model score and β>0\beta > 0 is the KL coefficient. Without the KL penalty, the policy collapses to reward hacking: generating short, repetitive text that maximizes reward scores but is incoherent or unnatural.

Optimal policy. The RLHF objective has a closed-form optimal solution (for fixed rr):

π(yx)=1Z(x)πref(yx)er(x,y)/β\pi^*(y|x) = \frac{1}{Z(x)}\pi_{\mathrm{ref}}(y|x)\, e^{r(x,y)/\beta}

where Z(x)=yπref(yx)er(x,y)/βZ(x) = \sum_y \pi_{\mathrm{ref}}(y|x)e^{r(x,y)/\beta} is the normalizing partition function. The optimal policy exponentially upweights sequences with high reward relative to the reference.

DPO (Direct Preference Optimization, Rafailov et al., 2023). DPO solves the RLHF objective directly from preference data, without training an explicit reward model. The key insight: substituting the optimal policy formula into the reward-learning objective yields a loss that only involves the policy, not the reward model:

LDPO(θ)=E ⁣[logσ ⁣(βlnπθ(ywx)πref(ywx)βlnπθ(ylx)πref(ylx))]\mathcal{L}_{\mathrm{DPO}}(\boldsymbol{\theta}) = -\mathbb{E}\!\left[\log\sigma\!\left(\beta\ln\frac{\pi_{\boldsymbol{\theta}}(y_w|x)}{\pi_{\mathrm{ref}}(y_w|x)} - \beta\ln\frac{\pi_{\boldsymbol{\theta}}(y_l|x)}{\pi_{\mathrm{ref}}(y_l|x)}\right)\right]

where ywy_w is the preferred (winning) response and yly_l the dispreferred (losing) response. The log-ratio log(πθ/πref)\log(\pi_{\boldsymbol{\theta}}/\pi_{\mathrm{ref}}) is the implicit reward that the KL-constrained policy learns. DPO is used in practice for fine-tuning LLMs on human preferences (LLaMA-2-Chat, Mixtral-Instruct, and many others).

PPO trust region. PPO (Schulman et al., 2017) approximates the KL constraint with a clipping objective:

LCLIP(θ)=Et ⁣[min ⁣(rt(θ)A^t,  clip(rt(θ),1ε,1+ε)A^t)]\mathcal{L}_{\mathrm{CLIP}}(\boldsymbol{\theta}) = \mathbb{E}_t\!\left[\min\!\left(r_t(\boldsymbol{\theta})\hat{A}_t,\; \mathrm{clip}(r_t(\boldsymbol{\theta}), 1-\varepsilon, 1+\varepsilon)\hat{A}_t\right)\right]

where rt(θ)=πθ(atst)/πold(atst)r_t(\boldsymbol{\theta}) = \pi_{\boldsymbol{\theta}}(a_t|s_t)/\pi_{\mathrm{old}}(a_t|s_t) is the probability ratio. Clipping at 1±ε1 \pm \varepsilon prevents large policy updates, approximating the KL trust region constraint DKL(πoldπθ)δD_{\mathrm{KL}}(\pi_{\mathrm{old}}\|\pi_{\boldsymbol{\theta}}) \le \delta.

9.4 Knowledge Distillation

Knowledge distillation (Hinton, Vinyals & Dean, 2015) transfers knowledge from a large teacher model pTp_T to a smaller student model pSp_S. The distillation loss uses forward KL (from teacher to student):

Ldistill=DKL(pT(τ)pS)=kpT(τ)(k)lnpT(τ)(k)pS(k)\mathcal{L}_{\mathrm{distill}} = D_{\mathrm{KL}}(p_T^{(\tau)} \| p_S) = \sum_k p_T^{(\tau)}(k)\ln\frac{p_T^{(\tau)}(k)}{p_S(k)}

where pT(τ)(k)=softmax(zkT/τ)kp_T^{(\tau)}(k) = \text{softmax}(z_k^T/\tau)_k is the teacher's softened distribution at temperature τ>1\tau > 1.

Why forward KL (teacher -> student)? The student must cover all of the teacher's probability mass, including the "dark knowledge" - the non-zero probabilities assigned to incorrect classes that encode the teacher's generalization patterns. If a teacher assigns 60% to "cat," 30% to "tiger," and 10% to "dog," the soft labels teach the student that cats look more like tigers than dogs. One-hot labels destroy this information.

Forward vs reverse KL choice in distillation. Using reverse KL DKL(pSpT)D_{\mathrm{KL}}(p_S\|p_T) would allow the student to become mode-seeking - focusing only on the teacher's highest-probability output and ignoring the distributional shape. Forward KL forces the student to match the full distribution, preserving the teacher's calibration.

Modern LLM distillation. DistilBERT (Sanh et al., 2019) distills BERT using forward KL plus cosine similarity on hidden states. Phi-1 (Gunasekar et al., 2023) and Phi-2 (Microsoft, 2023) use distillation from GPT-4 to train small but capable models. LLaMA-3-8B was distilled from larger checkpoints. The forward KL direction is standard.

9.5 Normalizing Flows and Diffusion Models

Normalizing flows learn an invertible transformation fθ:RdRdf_{\boldsymbol{\theta}}: \mathbb{R}^d \to \mathbb{R}^d such that x=fθ(z)\mathbf{x} = f_{\boldsymbol{\theta}}(\mathbf{z}) where zN(0,I)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, I). The exact change-of-variables formula gives:

logpθ(x)=logpz(fθ1(x))+logdetJfθ1(x)\log p_{\boldsymbol{\theta}}(\mathbf{x}) = \log p_{\mathbf{z}}(f_{\boldsymbol{\theta}}^{-1}(\mathbf{x})) + \log|\det J_{f_{\boldsymbol{\theta}}^{-1}}(\mathbf{x})|

Training by maximum likelihood minimizes DKL(pdatapθ)D_{\mathrm{KL}}(p_{\mathrm{data}}\|p_{\boldsymbol{\theta}}) (forward KL) since the log-likelihood is tractable. This contrasts with GANs, which use a discriminator to avoid computing logpθ\log p_{\boldsymbol{\theta}} directly.

Diffusion models (DDPM, Ho et al., 2020). The DDPM ELBO is a sum of KL terms:

logpθ(x)L=DKL(q(xTx0)p(xT))+t=2TDKL(q(xt1xt,x0)pθ(xt1xt))logpθ(x0x1)-\log p_{\boldsymbol{\theta}}(\mathbf{x}) \le \mathcal{L} = D_{\mathrm{KL}}(q(\mathbf{x}_T|\mathbf{x}_0)\|p(\mathbf{x}_T)) + \sum_{t=2}^T D_{\mathrm{KL}}(q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)\|p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_t)) - \log p_{\boldsymbol{\theta}}(\mathbf{x}_0|\mathbf{x}_1)

Each reverse diffusion step pθ(xt1xt)p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_t) is trained to minimize the KL divergence from the (analytically tractable) true reverse q(xt1xt,x0)q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0). Because both distributions are Gaussian (by the Markov structure), each per-step KL has a closed-form expression in terms of the noise predictor ϵθ\boldsymbol{\epsilon}_{\boldsymbol{\theta}}. The simplified training objective is equivalent to predicting the noise ϵ\boldsymbol{\epsilon} - which is the standard diffusion training loss.

10. Common Mistakes

#MistakeWhy It's WrongFix
1Writing DKL(qp)D_{\mathrm{KL}}(q \| p) when you mean forward KL (truth -> approximation)The first argument is the reference (truth); forward KL is DKL(ptruthpapprox)D_{\mathrm{KL}}(p_{\mathrm{truth}} \| p_{\mathrm{approx}})Always identify which distribution is the truth and which is the approximation; check: is the expectation under truth or approx?
2Assuming KL is symmetric: DKL(pq)=DKL(qp)D_{\mathrm{KL}}(p\|q) = D_{\mathrm{KL}}(q\|p)KL is not symmetric; the two directions have fundamentally different behaviorsNever swap pp and qq without checking both directions; use Jeffrey's JSD if you need symmetry
3Ignoring support mismatch: computing DKLD_{\mathrm{KL}} when q(x)=0q(x) = 0 for some xx with p(x)>0p(x) > 0DKL(pq)=+D_{\mathrm{KL}}(p\|q) = +\infty when p≪̸qp \not\ll q; finite values are meaninglessAdd Laplace smoothing or use JSD\mathrm{JSD} / TV\mathrm{TV} for distributions with potentially different supports
4Treating DKL(pq)=0D_{\mathrm{KL}}(p\|q) = 0 as equivalent to p=qp = q under numerical approximationsFinite-precision computation gives DKL0D_{\mathrm{KL}} \approx 0 even when pqp \ne q due to roundingUse DKLϵD_{\mathrm{KL}} \le \epsilon (small threshold) rather than exact zero; check sample moments separately
5Using KL as a distance metric satisfying the triangle inequalityDKL(pr)D_{\mathrm{KL}}(p\|r) can exceed DKL(pq)+DKL(qr)D_{\mathrm{KL}}(p\|q) + D_{\mathrm{KL}}(q\|r); KL is not a metricUse total variation or JSD\sqrt{\mathrm{JSD}} if you need a proper metric
6Confusing minimizing DKL(pq)D_{\mathrm{KL}}(p\|q) over qq with minimizing over ppMLE minimizes forward KL over θ\boldsymbol{\theta} (model parameters); variational inference minimizes reverse KL over qqAlways identify which argument is the optimization variable
7Computing cross-entropy loss and calling it KL divergenceCross-entropy H(p,q)=H(p)+DKL(pq)H(p,q) = H(p) + D_{\mathrm{KL}}(p\|q); they differ by H(p)H(p)They're equal only when pp is one-hot (H(p)=0H(p) = 0); for soft labels, H(p,q)>DKL(pq)H(p,q) > D_{\mathrm{KL}}(p\|q)
8Expecting the VAE KL term to be zero at optimumThe KL term equals $D_{\mathrm{KL}}(q_\phi(\mathbf{z}\mathbf{x})|p(\mathbf{z}))$; zero only if posterior = prior (posterior collapse)
9Using forward KL for variational inferenceForward KL Ep[log(p/q)]\mathbb{E}_p[\log(p/q)] requires samples from the intractable $p(\mathbf{z}\mathbf{x});ELBOusesreverseKLwhichonlyrequiressamplesfrom; ELBO uses reverse KL which only requires samples from q$
10Claiming the RLHF KL penalty "forces" the model to stay close to referenceThe KL penalty softly penalizes deviation; with large enough reward, the policy can still deviate significantlyThe KL penalty is a soft regularizer, not a hard constraint; monitor DKL(ππref)D_{\mathrm{KL}}(\pi\|\pi_{\mathrm{ref}}) during training
11Conflating Renyi divergence with KL divergenceRenyi Dα(pq)D_\alpha(p\|q) only equals KL in the limit α1\alpha \to 1; for α1\alpha \ne 1 they have different propertiesUse Renyi divergence when differential privacy accounting is needed; use KL for standard information-theoretic arguments
12Forgetting the ln2\ln 2 factor when converting between nats and bitsDKLD_{\mathrm{KL}} in nats =DKL= D_{\mathrm{KL}} in bits ×ln20.693×\times \ln 2 \approx 0.693 \times bitsAlways specify units; when comparing with cross-entropy in bits (binary classification), convert to nats or vice versa

11. Exercises

Exercise 1 (*) - Non-negativity and equality condition. Let p=(0.4,0.3,0.2,0.1)p = (0.4, 0.3, 0.2, 0.1) and q=(0.1,0.2,0.3,0.4)q = (0.1, 0.2, 0.3, 0.4).

(a) Compute DKL(pq)D_{\mathrm{KL}}(p\|q) and DKL(qp)D_{\mathrm{KL}}(q\|p) by hand (use ln\ln, give exact values and numerical approximations).

(b) Verify that both are non-negative. Which is larger?

(c) Find a distribution rr on four symbols such that DKL(pr)=0D_{\mathrm{KL}}(p\|r) = 0. Is rr unique?

(d) Using only the convexity of ln-\ln, write a self-contained proof that DKL(pq)0D_{\mathrm{KL}}(p\|q) \ge 0 for your specific pp and qq.


Exercise 2 (*) - KL between Gaussians. Let p=N(μ1,σ12)p = \mathcal{N}(\mu_1, \sigma_1^2) and q=N(μ2,σ22)q = \mathcal{N}(\mu_2, \sigma_2^2).

(a) Starting from the definition, derive the closed form:

DKL(pq)=lnσ2σ1+σ12+(μ1μ2)22σ2212D_{\mathrm{KL}}(p\|q) = \ln\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1-\mu_2)^2}{2\sigma_2^2} - \frac{1}{2}

(b) Compute DKL(N(1,2)N(0,1))D_{\mathrm{KL}}(\mathcal{N}(1,2)\|\mathcal{N}(0,1)) and DKL(N(0,1)N(1,2))D_{\mathrm{KL}}(\mathcal{N}(0,1)\|\mathcal{N}(1,2)). Are they equal?

(c) Show that when σ1=σ2=σ\sigma_1 = \sigma_2 = \sigma, the formula reduces to DKL(pq)=(μ1μ2)2/(2σ2)D_{\mathrm{KL}}(p\|q) = (\mu_1-\mu_2)^2/(2\sigma^2) - the squared Mahalanobis distance.

(d) For the VAE case p=N(μ,σ2)p = \mathcal{N}(\mu, \sigma^2), q=N(0,1)q = \mathcal{N}(0,1): verify the simplified formula 12(μ2+σ2lnσ21)\frac{1}{2}(\mu^2 + \sigma^2 - \ln\sigma^2 - 1) and check that it is 0 when μ=0\mu = 0, σ=1\sigma = 1.


Exercise 3 (*) - Forward vs reverse KL. Let p(x)=0.5N(x;2,0.25)+0.5N(x;2,0.25)p(x) = 0.5\,\mathcal{N}(x;-2,0.25) + 0.5\,\mathcal{N}(x;2,0.25) (bimodal) and qμ(x)=N(x;μ,σ2)q_\mu(x) = \mathcal{N}(x;\mu,\sigma^2).

(a) Show that the minimizer of DKL(pqμ,σ2)D_{\mathrm{KL}}(p\|q_{\mu,\sigma^2}) over (μ,σ2)(\mu,\sigma^2) is μ=0\mu^* = 0, σ2=4.25\sigma^{*2} = 4.25. (Hint: moment matching.)

(b) Numerically estimate DKL(qμp)D_{\mathrm{KL}}(q_\mu\|p) on a grid of μ[3,3]\mu \in [-3, 3] using 1000 samples. Show the two local minima near μ=±2\mu = \pm 2.

(c) Explain in one paragraph: if you were training a VAE generative model and the true data distribution is pp, which direction would you minimize, and what would the generator look like?


Exercise 4 () - Chain rule decomposition.** Let P(X,Y)P(X, Y) be a joint distribution on {0,1}2\{0,1\}^2 with P(0,0)=0.4P(0,0) = 0.4, P(0,1)=0.1P(0,1) = 0.1, P(1,0)=0.1P(1,0) = 0.1, P(1,1)=0.4P(1,1) = 0.4, and Q(X,Y)Q(X,Y) uniform on {0,1}2\{0,1\}^2 (=0.25= 0.25 each).

(a) Compute DKL(P(X,Y)Q(X,Y))D_{\mathrm{KL}}(P(X,Y)\|Q(X,Y)) directly.

(b) Compute DKL(PXQX)D_{\mathrm{KL}}(P_X\|Q_X) and EPX[DKL(PYXQYX)]\mathbb{E}_{P_X}[D_{\mathrm{KL}}(P_{Y|X}\|Q_{Y|X})] separately.

(c) Verify that (a) == (b): the chain rule holds.

(d) Interpret: which direction (marginal or conditional) contributes more to the total KL? What does this mean for the structure of PP vs QQ?


Exercise 5 () - f-Divergence family.** Consider three f-divergences on Bernoulli distributions p=Bern(θ)p = \text{Bern}(\theta) and q=Bern(0.5)q = \text{Bern}(0.5) as a function of θ(0,1)\theta \in (0,1).

(a) Compute and plot (as functions of θ\theta): KL divergence DKL(pq)D_{\mathrm{KL}}(p\|q), reverse KL DKL(qp)D_{\mathrm{KL}}(q\|p), total variation TV(p,q)\mathrm{TV}(p,q), and Jensen-Shannon JSD(p,q)\mathrm{JSD}(p,q).

(b) Verify that Pinsker's inequality TV(p,q)212DKL(pq)\mathrm{TV}(p,q)^2 \le \frac{1}{2}D_{\mathrm{KL}}(p\|q) holds for all θ\theta.

(c) Which divergence is symmetric? Which is bounded? Which is asymmetric?

(d) For θ=0.01\theta = 0.01 (near-deterministic pp): compute all four divergences. Why does KL blow up while JSD stays bounded?


Exercise 6 () - ELBO and KL decomposition.** A simple VAE has prior p(z)=N(0,1)p(z) = \mathcal{N}(0,1) and likelihood pθ(xz)=N(z,0.1)p_\theta(x|z) = \mathcal{N}(z, 0.1). The encoder is qϕ(zx)=N(ϕx,0.5)q_\phi(z|x) = \mathcal{N}(\phi \cdot x, 0.5) for a scalar parameter ϕ\phi.

(a) For x=2x = 2: compute the ELBO as a function of ϕ\phi. Use the Gaussian KL formula.

(b) Show that the ELBO decomposes as: reconstruction term (2ϕ2)2+0.52×0.1-\frac{(2 - \phi\cdot 2)^2 + 0.5}{2 \times 0.1} (plus constants) minus the KL term 12(ϕ24+0.5ln0.51)\frac{1}{2}(\phi^2 \cdot 4 + 0.5 - \ln 0.5 - 1).

(c) Find the ϕ\phi that maximizes the ELBO. What does this optimal encoder do geometrically?

(d) As the likelihood variance σ20\sigma^2 \to 0 (decoder becomes very powerful): what happens to the optimal ϕ\phi? Does posterior collapse occur?


Exercise 7 (*) - RLHF optimal policy.** A language model has reference policy πref(yx)\pi_{\mathrm{ref}}(y|x) and reward r(x,y)r(x,y). The RLHF objective is maxπE[r(x,y)]βDKL(ππref)\max_\pi \mathbb{E}[r(x,y)] - \beta D_{\mathrm{KL}}(\pi\|\pi_{\mathrm{ref}}).

(a) Show that the unconstrained optimizer is π(yx)=1Z(x)πref(yx)er(x,y)/β\pi^*(y|x) = \frac{1}{Z(x)}\pi_{\mathrm{ref}}(y|x)e^{r(x,y)/\beta} by taking the functional derivative of the objective with respect to π(yx)\pi(y|x) (subject to yπ(yx)=1\sum_y \pi(y|x) = 1).

(b) Show that at the optimum, the KL divergence is DKL(ππref)=1βEπ[r]logZ(x)D_{\mathrm{KL}}(\pi^*\|\pi_{\mathrm{ref}}) = \frac{1}{\beta}\mathbb{E}_{\pi^*}[r] - \log Z(x).

(c) Derive the DPO loss: given preference data (x,yw,yl)(x, y_w, y_l) where ywyly_w \succ y_l, show that the Bradley-Terry preference model P(ywylx)=σ(βlogπ(ywx)π(ylx)βlogπref(ywx)πref(ylx))P(y_w \succ y_l|x) = \sigma(\beta\log\frac{\pi^*(y_w|x)}{\pi^*(y_l|x)} - \beta\log\frac{\pi_{\mathrm{ref}}(y_w|x)}{\pi_{\mathrm{ref}}(y_l|x)}) leads to the DPO loss.

(d) Numerically: for β=0.1\beta = 0.1, πref(A)=0.6\pi_{\mathrm{ref}}(A) = 0.6, πref(B)=0.4\pi_{\mathrm{ref}}(B) = 0.4, r(A)=2r(A) = 2, r(B)=0r(B) = 0: compute π\pi^* and the resulting DKL(ππref)D_{\mathrm{KL}}(\pi^*\|\pi_{\mathrm{ref}}).


Exercise 8 (*) - Knowledge distillation analysis.** A teacher model outputs logits zT=(3,1,2)\mathbf{z}_T = (3, 1, -2) for three classes and a student outputs zS=(2,0,1)\mathbf{z}_S = (2, 0, -1).

(a) Compute pT=softmax(zT)p_T = \text{softmax}(\mathbf{z}_T) and pS=softmax(zS)p_S = \text{softmax}(\mathbf{z}_S).

(b) Compute DKL(pTpS)D_{\mathrm{KL}}(p_T\|p_S) (forward KL, what distillation minimizes) and DKL(pSpT)D_{\mathrm{KL}}(p_S\|p_T) (reverse KL).

(c) Now compute soft labels at temperature τ=3\tau = 3: pT(τ)p_T^{(\tau)} and pS(τ)p_S^{(\tau)}. Recompute both KL divergences. How does temperature affect the magnitude of the KL and the information in the soft labels?

(d) What is the total distillation loss: L=λDKL(pT(τ)pS(τ))+(1λ)H(ytrue,pS)\mathcal{L} = \lambda D_{\mathrm{KL}}(p_T^{(\tau)}\|p_S^{(\tau)}) + (1-\lambda)H(\mathbf{y}_{\mathrm{true}}, p_S) for λ=0.7\lambda = 0.7 and ytrue=(1,0,0)\mathbf{y}_{\mathrm{true}} = (1, 0, 0)?

(e) Argue why forward KL (teacher to student) is used rather than reverse KL. What kind of "dark knowledge" is preserved?

12. Why This Matters for AI (2026 Perspective)

ConceptAI Impact
Forward KL = MLEEvery language model trained by next-token prediction (GPT-4, LLaMA-3, Claude, Gemini, Mistral) minimizes DKL(pdatapθ)D_{\mathrm{KL}}(p_{\mathrm{data}}\|p_{\boldsymbol{\theta}}); cross-entropy loss IS KL divergence
ELBO = KL decompositionAll VAE-based generative models (image, speech, latent diffusion) optimize reconstruction + KL; the KL term determines latent space structure
RLHF KL penaltyAll instruction-tuned LLMs (InstructGPT, Claude, Gemini) use βDKL(ππref)\beta D_{\mathrm{KL}}(\pi\|\pi_{\mathrm{ref}}) to prevent reward hacking; β\beta is one of the most important hyperparameters in alignment
DPO implicit KLDPO (used in LLaMA-2-Chat, Mistral-Instruct) reframes RLHF as direct KL-constrained optimization; the log-ratio log(π/πref)\log(\pi/\pi_{\mathrm{ref}}) IS the implicit reward
Forward vs reverse KLChoosing forward KL (MLE, distillation) vs reverse KL (VI, mean-field) is the defining design decision of probabilistic ML; wrong choice -> blurry outputs or posterior collapse
Knowledge distillationLLM compression (DistilGPT, Phi-2, Phi-3) uses forward KL from teacher to student; soft labels at temperature τ>1\tau > 1 carry 10-100 more information than one-hot labels
Posterior collapseWhen the decoder is powerful, the VAE KL term collapses to zero - a major pathology in text VAEs (Bowman et al., 2016); fixed by KL annealing or β\beta-VAE weighting
PPO trust regionPPO's clip objective approximates a KL trust region; the ε{0.1,0.2}\varepsilon \in \{0.1, 0.2\} clip threshold corresponds to a specific KL budget per update step
GAN = JSD minimizationOriginal GAN (Goodfellow, 2014) minimizes JSD; JSD's saturation when supports are disjoint causes vanishing gradients and mode collapse - motivating WGAN/Wasserstein distance
Renyi DP for private trainingDifferential privacy in LLM fine-tuning (DP-SGD, used at Google) uses Renyi divergence; Renyi DP composes additively, making privacy budgets tractable
Diffusion ELBODDPM training objective is a sum of per-step KL divergences; the connection to score matching makes the KL formulation equivalent to noise prediction
Natural gradientK-FAC (second-order optimizer for neural networks) approximates the Fisher matrix F=E[logplogp]F = \mathbb{E}[\nabla\log p \nabla\log p^\top] which is the local KL curvature; natural gradient descent is the geometrically correct first-order method on the statistical manifold
CalibrationDKL(ptruepθ)=0D_{\mathrm{KL}}(p_{\mathrm{true}}\|p_{\boldsymbol{\theta}}) = 0 iff the model is perfectly calibrated; calibration regularization (temperature scaling, label smoothing) implicitly minimizes KL between predicted and calibrated distributions

13. Conceptual Bridge

KL divergence sits at the center of the information theory chapter, connecting the individual uncertainty measured by entropy (Section 01) to the distributional relationships measured by mutual information (Section 03), the training objectives formalized as cross-entropy (Section 04), and the local geometry captured by Fisher information (Section 05). It is not an isolated concept - it is the engine that drives all four.

Backward: from Entropy. In Section 01, entropy H(X)=plogpH(X) = -\sum p\log p measured the intrinsic uncertainty of a single distribution. KL divergence extends this to pairs of distributions: DKL(pq)=H(p,q)H(p)D_{\mathrm{KL}}(p\|q) = H(p,q) - H(p) measures the extra uncertainty you incur by confusing qq for pp. The non-negativity of KL proved the fundamental bound H(X)logXH(X) \le \log|\mathcal{X}| that appeared in Section 01 - Gibbs' inequality was already implicitly used there. Every result about optimal codes (Huffman, arithmetic coding) is ultimately about minimizing KL between the empirical distribution and the code-induced distribution.

Forward: to Mutual Information. Mutual information I(X;Y)=DKL(P(X,Y)P(X)P(Y))I(X;Y) = D_{\mathrm{KL}}(P(X,Y)\|P(X)P(Y)) (Section 03) is a KL divergence measuring the dependence between two variables. The chain rule for KL (Section 3.5) becomes the chain rule for entropy when applied to the independence gap. The data processing inequality for KL becomes the DPI for mutual information: I(X;Z)I(X;Y)I(X;Z) \le I(X;Y) whenever ZZ is determined by YY. Understanding KL makes mutual information, and all the contrastive learning objectives built on it (InfoNCE, SimCLR, CLIP), immediately transparent.

Forward: to Cross-Entropy and Fisher Information. Cross-entropy H(p,q)H(p,q) is KL plus a constant (Section 04); Fisher information is KL's local curvature (Section 05). The three-way connection H(p,q)=H(p)+DKL(pq)H(p)+12δFδH(p,q) = H(p) + D_{\mathrm{KL}}(p\|q) \approx H(p) + \frac{1}{2}\delta^\top F\delta (for small perturbations) shows that these are all the same object at different scales: global (cross-entropy), local (Fisher).

KL DIVERGENCE IN THE INFORMATION THEORY CHAPTER


  
    Chapter 09 - Information Theory                                
                                                                   
    Section 01 ENTROPY H(X)                                               
          "extend to pairs of distributions"                      
    Section 02 KL DIVERGENCE D_KL(pq)       <- YOU ARE HERE              
                                                                 
    Section 03 MUTUAL INFORMATION    Section 04 CROSS-ENTROPY                    
    I(X;Y) = D_KL(PPP)     H(p,q) = H(p) + D_KL(pq)           
                                                                  
    Section 05 FISHER INFORMATION                                         
    F(theta) = local KL curvature                                      
  

  ML CONNECTIONS:
  
                                                                   
     D_KL(p_data  p_theta)        ->  Language model training (MLE)   
     D_KL(q_(z|x)  p(z))    ->  VAE regularizer (ELBO)          
     D_KL(pi_theta  pi_ref)         ->  RLHF / PPO / DPO alignment     
     D_KL(p_teacher  p_student) -> Knowledge distillation        
     D_KL(q  p_posterior)     ->  Variational inference (VI)     
                                                                   
  

  CHAPTER POSITION:

  09-01 Entropy
      
  09-02 KL Divergence  <- current section
      
  09-03 Mutual Information  (KL between joint and product of marginals)
      
  09-04 Cross-Entropy       (H(p,q) = H(p) + D_KL(pq))
      
  09-05 Fisher Information  (local KL curvature = Riemannian metric)


Looking ahead beyond this chapter: KL divergence appears again in Chapter 12 (Functional Analysis), where it generalizes to the RKHS setting and connects to kernel methods and maximum mean discrepancy (MMD). It appears in Chapter 21 (Statistical Learning Theory) as a key quantity in PAC-Bayes bounds: the generalization gap of a stochastic predictor is bounded by DKL(QP)/nD_{\mathrm{KL}}(Q\|P)/n where QQ is the posterior and PP is the prior on parameters. And in Chapter 22 (Causal Inference), KL divergence under interventional vs observational distributions quantifies the "causal effect" of an intervention in terms of distributional shift.

The single formula xp(x)log(p(x)/q(x))\sum_x p(x)\log(p(x)/q(x)) - first written by Kullback and Leibler in 1951 - underlies more of modern AI than perhaps any other equation in this curriculum.


Appendix A: Detailed Proofs

A.1 Proof of Non-Negativity via Log-Sum Inequality

The log-sum inequality provides an alternative, direct proof of Gibbs' inequality without invoking Jensen's explicitly.

Log-Sum Inequality. For non-negative reals a1,,ana_1, \ldots, a_n and b1,,bnb_1, \ldots, b_n:

i=1nailogaibi(i=1nai)logi=1naii=1nbi\sum_{i=1}^n a_i \log\frac{a_i}{b_i} \ge \left(\sum_{i=1}^n a_i\right)\log\frac{\sum_{i=1}^n a_i}{\sum_{i=1}^n b_i}

with equality iff all ai/bia_i/b_i are equal.

Proof of log-sum inequality. The function f(t)=tlogtf(t) = t\log t is convex (since f(t)=1/t>0f''(t) = 1/t > 0). By the weighted Jensen inequality with weights bi/Bb_i/B where B=bjB = \sum b_j:

1Bibiaibilogaibi1B(ibiaibi)logibiai/biibi1\frac{1}{B}\sum_i b_i \cdot \frac{a_i}{b_i}\log\frac{a_i}{b_i} \ge \frac{1}{B}\left(\sum_i b_i \cdot \frac{a_i}{b_i}\right)\log\frac{\sum_i b_i \cdot a_i/b_i}{\sum_i b_i \cdot 1} =ABlogAB= \frac{A}{B}\log\frac{A}{B}

Multiplying by BB gives the result. \square

KL non-negativity from log-sum inequality. Apply with ai=p(xi)a_i = p(x_i) and bi=q(xi)b_i = q(x_i): since pi=1=qi\sum p_i = 1 = \sum q_i:

DKL(pq)=ipilogpiqi(ipi)logipiiqi=1log11=0D_{\mathrm{KL}}(p\|q) = \sum_i p_i\log\frac{p_i}{q_i} \ge \left(\sum_i p_i\right)\log\frac{\sum_i p_i}{\sum_i q_i} = 1 \cdot \log\frac{1}{1} = 0 \quad\square

A.2 Convexity of KL: Detailed Proof

We prove that (p,q)DKL(pq)(p,q) \mapsto D_{\mathrm{KL}}(p\|q) is jointly convex using the perspective function argument.

Lemma. The function g(s,t)=slog(s/t)g(s,t) = s\log(s/t) (with g(0,t)=0g(0,t) = 0, g(s,0)=+g(s,0) = +\infty for s>0s > 0) is jointly convex on {(s,t):s0,t>0}\{(s,t) : s \ge 0, t > 0\}.

Proof. g(s,t)=slogsslogtg(s,t) = s\log s - s\log t. We need Hg0H_g \succeq 0. The Hessian is:

Hg=(1/s1/t1/ts/t2)H_g = \begin{pmatrix} 1/s & -1/t \\ -1/t & s/t^2 \end{pmatrix}

Determinant =s/t21/s1/t2=1/t21/t2=0= s/t^2 \cdot 1/s - 1/t^2 = 1/t^2 - 1/t^2 = 0. Trace =1/s+s/t2>0= 1/s + s/t^2 > 0. PSD since det=00\det = 0 \ge 0 and trace >0> 0. \square

Since DKL(pq)=xg(p(x),q(x))D_{\mathrm{KL}}(p\|q) = \sum_x g(p(x), q(x)) and sums of jointly convex functions are jointly convex, DKLD_{\mathrm{KL}} is jointly convex. \square

A.3 Pinsker's Inequality

Theorem (Pinsker's Inequality). TV(p,q)12DKL(pq)\mathrm{TV}(p,q) \le \sqrt{\frac{1}{2}D_{\mathrm{KL}}(p\|q)}.

Proof. We use the Bretagnolle-Huber inequality, which is slightly weaker but has a clean proof: for any event AA,

p(A)q(A)12DKL(pq)p(A) - q(A) \le \sqrt{\frac{1}{2}D_{\mathrm{KL}}(p\|q)}

For the full Pinsker proof, one approach uses the 2-point inequality: for a,b[0,1]a, b \in [0,1],

alogab+(1a)log1a1b12ln2(ab)2a\log\frac{a}{b} + (1-a)\log\frac{1-a}{1-b} \ge \frac{1}{2\ln 2}(a-b)^2

Applying this to the binary partition {A,Ac}\{A, A^c\} and optimizing over AA (which gives A={p>q}A = \{p > q\} for TV):

DKL(pq)DKL(Bern(p(A))Bern(q(A)))12ln2(p(A)q(A))212(2TV)2D_{\mathrm{KL}}(p\|q) \ge D_{\mathrm{KL}}(\text{Bern}(p(A))\|\text{Bern}(q(A))) \ge \frac{1}{2\ln 2}(p(A) - q(A))^2 \ge \frac{1}{2}(2\mathrm{TV})^2

Rearranging: TV(p,q)12DKL(pq)\mathrm{TV}(p,q) \le \sqrt{\frac{1}{2}D_{\mathrm{KL}}(p\|q)}. \square

Pinsker's inequality in AI: KL divergence is continuous and differentiable (gradients exist for optimization), while total variation is an event-level bound. Pinsker's lets us translate KL bounds from optimization theory into guarantees about individual event probabilities - useful in PAC-Bayes learning theory.

A.4 KL Divergence for the Gaussian: Step-by-Step

We derive the multivariate Gaussian KL formula in detail.

Let p=N(μ1,Σ1)p = \mathcal{N}(\boldsymbol{\mu}_1, \Sigma_1) and q=N(μ2,Σ2)q = \mathcal{N}(\boldsymbol{\mu}_2, \Sigma_2) on Rd\mathbb{R}^d. The log-ratio is:

logp(x)q(x)=12(xμ1)Σ11(xμ1)+12(xμ2)Σ21(xμ2)+12logdetΣ2detΣ1\log\frac{p(\mathbf{x})}{q(\mathbf{x})} = -\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu}_1)^\top\Sigma_1^{-1}(\mathbf{x}-\boldsymbol{\mu}_1) + \frac{1}{2}(\mathbf{x}-\boldsymbol{\mu}_2)^\top\Sigma_2^{-1}(\mathbf{x}-\boldsymbol{\mu}_2) + \frac{1}{2}\log\frac{\det\Sigma_2}{\det\Sigma_1}

Taking expectation under pp, using:

  • Ep[(xμ1)Σ11(xμ1)]=tr(Σ11Σ1)=d\mathbb{E}_p[(\mathbf{x}-\boldsymbol{\mu}_1)^\top\Sigma_1^{-1}(\mathbf{x}-\boldsymbol{\mu}_1)] = \operatorname{tr}(\Sigma_1^{-1}\Sigma_1) = d
  • Ep[(xμ2)Σ21(xμ2)]=tr(Σ21Σ1)+(μ1μ2)Σ21(μ1μ2)\mathbb{E}_p[(\mathbf{x}-\boldsymbol{\mu}_2)^\top\Sigma_2^{-1}(\mathbf{x}-\boldsymbol{\mu}_2)] = \operatorname{tr}(\Sigma_2^{-1}\Sigma_1) + (\boldsymbol{\mu}_1-\boldsymbol{\mu}_2)^\top\Sigma_2^{-1}(\boldsymbol{\mu}_1-\boldsymbol{\mu}_2)

gives:

DKL(pq)=12[tr(Σ21Σ1)+(μ2μ1)Σ21(μ2μ1)d+logdetΣ2detΣ1]D_{\mathrm{KL}}(p\|q) = \frac{1}{2}\left[\operatorname{tr}(\Sigma_2^{-1}\Sigma_1) + (\boldsymbol{\mu}_2-\boldsymbol{\mu}_1)^\top\Sigma_2^{-1}(\boldsymbol{\mu}_2-\boldsymbol{\mu}_1) - d + \log\frac{\det\Sigma_2}{\det\Sigma_1}\right]

The four terms correspond to:

  1. tr(Σ21Σ1)d\operatorname{tr}(\Sigma_2^{-1}\Sigma_1) - d: how much the covariances differ (= 0 when Σ1=Σ2\Sigma_1 = \Sigma_2)
  2. (μ2μ1)Σ21(μ2μ1)(\boldsymbol{\mu}_2-\boldsymbol{\mu}_1)^\top\Sigma_2^{-1}(\boldsymbol{\mu}_2-\boldsymbol{\mu}_1): squared Mahalanobis distance between means
  3. log(detΣ2/detΣ1)\log(\det\Sigma_2/\det\Sigma_1): log ratio of volumes (= 0 when detΣ1=detΣ2\det\Sigma_1 = \det\Sigma_2)

Appendix B: KL Divergence Computations and Reference Formulas

B.1 Quick Reference: KL Formulas for Common Distributions

KL DIVERGENCE CLOSED FORMS - QUICK REFERENCE


  SCALAR GAUSSIAN:
  D_KL(N(mu1,12)  N(mu2,22)) = log(2/1) + (12 + (mu1-mu2)2)/(222) - 12

  VAE ENCODER -> STANDARD NORMAL:
  D_KL(N(mu,2)  N(0,1)) = 12(mu2 + 2 - log 2 - 1)

  MULTIVARIATE GAUSSIAN:
  D_KL(N(mu1,1)  N(mu2,2)) = 12[tr(211) + (mu2-mu1)T21(mu2-mu1) - d + log(det 2/det 1)]

  BERNOULLI:
  D_KL(Bern(p)  Bern(q)) = plog(p/q) + (1-p)log((1-p)/(1-q))

  CATEGORICAL (general discrete):
  D_KL(p  q) = k pk  log(pk/qk)

  POISSON:
  D_KL(Poisson(1)  Poisson(2)) = 1log(1/2) - 1 + 2

  EXPONENTIAL:
  D_KL(Exp(1)  Exp(2)) = log(2/1) + 1/2 - 1

  BETA:
  D_KL(Beta(alpha1,beta1)  Beta(alpha2,beta2)) = log B(alpha2,beta2)/B(alpha1,beta1)
      + (alpha1-alpha2)(alpha1) + (beta1-beta2)(beta1) + (alpha2-alpha1+beta2-beta1)(alpha1+beta1)
      where  = digamma function, B = beta function


B.2 Worked Example: KL Between Poisson Distributions

A Poisson distribution Pois(λ)\text{Pois}(\lambda) has PMF p(k)=eλλk/k!p(k) = e^{-\lambda}\lambda^k/k! for k=0,1,2,k = 0,1,2,\ldots

DKL(Pois(λ1)Pois(λ2))D_{\mathrm{KL}}(\text{Pois}(\lambda_1)\|\text{Pois}(\lambda_2)):

logpλ1(k)pλ2(k)=(λ2λ1)+klogλ1λ2\log\frac{p_{\lambda_1}(k)}{p_{\lambda_2}(k)} = (\lambda_2 - \lambda_1) + k\log\frac{\lambda_1}{\lambda_2} DKL=Eλ1 ⁣[(λ2λ1)+klogλ1λ2]=(λ2λ1)+λ1logλ1λ2D_{\mathrm{KL}} = \mathbb{E}_{\lambda_1}\!\left[(\lambda_2-\lambda_1) + k\log\frac{\lambda_1}{\lambda_2}\right] = (\lambda_2 - \lambda_1) + \lambda_1\log\frac{\lambda_1}{\lambda_2}

For λ1=3,λ2=2\lambda_1 = 3, \lambda_2 = 2: DKL=(23)+3ln(3/2)=1+3(0.405)=0.216D_{\mathrm{KL}} = (2-3) + 3\ln(3/2) = -1 + 3(0.405) = 0.216 nats.

For AI: Poisson distributions model count data (number of events, word counts in LDA topic models). The KL between Poisson distributions appears in the ELBO for Poisson process models and in topic model variational inference.

B.3 Worked Example: KL for Categorical - Softmax Outputs

Let z1=(2.0,1.0,0.1)\mathbf{z}_1 = (2.0, 1.0, 0.1) and z2=(1.5,1.0,0.5)\mathbf{z}_2 = (1.5, 1.0, 0.5) be logit vectors.

Step 1: p1=softmax(z1)p_1 = \text{softmax}(\mathbf{z}_1):

p1=(e2,e1,e0.1)/(e2+e1+e0.1)=(7.389,2.718,1.105)/11.212=(0.659,0.242,0.099)p_1 = (e^2, e^1, e^{0.1})/(e^2 + e^1 + e^{0.1}) = (7.389, 2.718, 1.105)/11.212 = (0.659, 0.242, 0.099)

Step 2: p2=softmax(z2)p_2 = \text{softmax}(\mathbf{z}_2):

p2=(e1.5,e1,e0.5)/(e1.5+e1+e0.5)=(4.482,2.718,1.649)/8.849=(0.507,0.307,0.186)p_2 = (e^{1.5}, e^1, e^{0.5})/(e^{1.5} + e^1 + e^{0.5}) = (4.482, 2.718, 1.649)/8.849 = (0.507, 0.307, 0.186)

Step 3: DKL(p1p2)D_{\mathrm{KL}}(p_1\|p_2):

=0.659ln(0.659/0.507)+0.242ln(0.242/0.307)+0.099ln(0.099/0.186)= 0.659\ln(0.659/0.507) + 0.242\ln(0.242/0.307) + 0.099\ln(0.099/0.186) =0.659(0.261)+0.242(0.238)+0.099(0.635)=0.1720.0580.063=0.051 nats= 0.659(0.261) + 0.242(-0.238) + 0.099(-0.635) = 0.172 - 0.058 - 0.063 = 0.051 \text{ nats}

For AI: This is the KL component when computing the distillation loss between a teacher (p1p_1) and student (p2p_2). The small value (0.051 nats) indicates the student's distribution is reasonably close to the teacher's.

B.4 KL Divergence and Hypothesis Testing

The connection between KL divergence and hypothesis testing is both historical (Kullback 1959) and practically important.

Setting: nn i.i.d. observations from an unknown distribution. Hypothesis H0H_0: data from qq; hypothesis H1H_1: data from pp.

Log-likelihood ratio test statistic:

Λn=i=1nlogp(xi)q(xi)\Lambda_n = \sum_{i=1}^n \log\frac{p(x_i)}{q(x_i)}

By the law of large numbers, Λn/nEp[log(p/q)]=DKL(pq)\Lambda_n/n \to \mathbb{E}_p[\log(p/q)] = D_{\mathrm{KL}}(p\|q) as nn \to \infty when truth is pp.

Stein's lemma (type II error rate): For fixed type I error probability α(0,1)\alpha \in (0,1), the best achievable type II error probability βn\beta_n^* decays as:

limn1nlogβn=DKL(pq)\lim_{n\to\infty} \frac{1}{n}\log\beta_n^* = -D_{\mathrm{KL}}(p\|q)

Larger KL divergence -> faster exponential decay of type II error -> easier to distinguish pp from qq with more data.

Chernoff exponent: The best achievable equal-error-rate exponent (minimizing over both type I and II simultaneously) is:

C(p,q)=minα[0,1]logxp(x)αq(x)1α=maxα[0,1]Dα(pq)C(p,q) = -\min_{\alpha\in[0,1]}\log\sum_x p(x)^\alpha q(x)^{1-\alpha} = \max_{\alpha\in[0,1]} D_\alpha(p\|q)

the maximum Renyi divergence over orders α[0,1]\alpha \in [0,1].

For AI: The hypothesis testing perspective explains why DKL(ππref)D_{\mathrm{KL}}(\pi\|\pi_{\mathrm{ref}}) in RLHF bounds how detectable the policy shift is: small KL means the fine-tuned model's outputs are statistically indistinguishable from the reference model's outputs.


Appendix C: KL Divergence in Broader ML Contexts

C.1 PAC-Bayes Bounds

PAC-Bayes theory (McAllester 1999; Seeger 2002) gives generalization bounds for stochastic predictors. The key theorem:

PAC-Bayes Theorem. Let PP be a prior over hypotheses (fixed before seeing data) and QQ be any posterior (possibly data-dependent). For any δ>0\delta > 0, with probability 1δ\ge 1 - \delta over the training set:

EhQ[L(h)]EhQ[L^(h)]+DKL(QP)+log(2n/δ)2n\mathbb{E}_{h \sim Q}[\mathcal{L}(h)] \le \mathbb{E}_{h \sim Q}[\hat{\mathcal{L}}(h)] + \sqrt{\frac{D_{\mathrm{KL}}(Q\|P) + \log(2\sqrt{n}/\delta)}{2n}}

where L(h)\mathcal{L}(h) is the true loss and L^(h)\hat{\mathcal{L}}(h) is the empirical loss.

Interpretation: The generalization gap is bounded by DKL(QP)/n\sqrt{D_{\mathrm{KL}}(Q\|P)/n}. The posterior QQ can be anything - including a specific trained neural network - and the bound holds. This shows that models that stay "close" to a simple prior in KL generalize well. Modern applications: PAC-Bayes bounds for LoRA-fine-tuned LLMs (where P=P = pretrained weights, Q=Q = fine-tuned weights).

C.2 KL in Variational Inference for Bayesian Neural Networks

Bayesian neural networks (BNNs) maintain distributions over weights θ\boldsymbol{\theta} rather than point estimates. The posterior p(θD)p(\boldsymbol{\theta}|\mathcal{D}) is intractable for large networks. Variational inference minimizes:

DKL(q(θ)p(θD))=DKL(q(θ)p(θ))Eq[logp(Dθ)]+logp(D)D_{\mathrm{KL}}(q(\boldsymbol{\theta})\|p(\boldsymbol{\theta}|\mathcal{D})) = D_{\mathrm{KL}}(q(\boldsymbol{\theta})\|p(\boldsymbol{\theta})) - \mathbb{E}_q[\log p(\mathcal{D}|\boldsymbol{\theta})] + \log p(\mathcal{D})

The ELBO is: L=Eq[logp(Dθ)]DKL(q(θ)p(θ))\mathcal{L} = \mathbb{E}_q[\log p(\mathcal{D}|\boldsymbol{\theta})] - D_{\mathrm{KL}}(q(\boldsymbol{\theta})\|p(\boldsymbol{\theta})). The KL term DKL(qp)D_{\mathrm{KL}}(q\|p) acts as a regularizer: it penalizes variational distributions that deviate from the prior p(θ)p(\boldsymbol{\theta}).

Practical implementations: Bayes By Backprop (Blundell et al., 2015) parameterizes q(θ)=iN(μi,σi2)q(\boldsymbol{\theta}) = \prod_i \mathcal{N}(\mu_i, \sigma_i^2) and uses the local reparameterization trick. Each weight θi=μi+σiεi\theta_i = \mu_i + \sigma_i\varepsilon_i where εiN(0,1)\varepsilon_i \sim \mathcal{N}(0,1). The KL term per weight is 12(μi2/σp2+σi2/σp21log(σi2/σp2))\frac{1}{2}(\mu_i^2/\sigma_p^2 + \sigma_i^2/\sigma_p^2 - 1 - \log(\sigma_i^2/\sigma_p^2)) for Gaussian prior N(0,σp2)\mathcal{N}(0, \sigma_p^2).

C.3 KL in Contrastive Learning and Representation

Contrastive learning methods like SimCLR (Chen et al., 2020) and MoCo (He et al., 2020) maximize the lower bound on mutual information:

I(X;Z)E[InfoNCE(X,Z)]=E ⁣[logef(x,z)1Nj=1Nef(x,zj)]I(X; Z) \ge \mathbb{E}[\text{InfoNCE}(X, Z)] = \mathbb{E}\!\left[\log\frac{e^{f(x,z)}}{\frac{1}{N}\sum_{j=1}^N e^{f(x,z_j)}}\right]

The InfoNCE loss is a lower bound on I(X;Z)I(X;Z), which is a KL divergence DKL(P(X,Z)P(X)P(Z))D_{\mathrm{KL}}(P(X,Z)\|P(X)P(Z)) (see Section 09-03). CLIP (Radford et al., 2021) uses a symmetric InfoNCE to align image and text embeddings - which is equivalent to minimizing KL between the joint image-text distribution and the product of marginals.

CPC (Contrastive Predictive Coding, van den Oord et al., 2018): Used in speech (wav2vec 2.0, HuBERT) and images. The representation ztz_t at time tt predicts future zt+kz_{t+k} via a density ratio estimator. The InfoNCE objective is a lower bound on kI(zt;zt+k)\sum_k I(z_t; z_{t+k}). Training audio encoders this way produces features that capture the semantic structure of speech - because mutual information (= KL divergence) rewards capturing any statistical dependency.

C.4 KL in Score-Based Generative Models

Score-based models (Song & Ermon, 2020) and their diffusion counterparts learn the score function xlogp(x)\nabla_\mathbf{x}\log p(\mathbf{x}) to generate samples via Langevin dynamics. The connection to KL:

Score matching minimizes:

J(θ)=Ep ⁣[12sθ(x)logp(x)2]J(\boldsymbol{\theta}) = \mathbb{E}_p\!\left[\frac{1}{2}\|s_{\boldsymbol{\theta}}(\mathbf{x}) - \nabla\log p(\mathbf{x})\|^2\right]

where sθlogps_{\boldsymbol{\theta}} \approx \nabla\log p is the learned score. Fisher divergence equals:

J(θ)=Ep ⁣[12sθ(x)2+sθ(x)]+constJ(\boldsymbol{\theta}) = \mathbb{E}_p\!\left[\frac{1}{2}\|s_{\boldsymbol{\theta}}(\mathbf{x})\|^2 + \nabla\cdot s_{\boldsymbol{\theta}}(\mathbf{x})\right] + \text{const}

Score matching is equivalent to minimizing a generalization of KL (Fisher divergence: p(x)log(p/q)2dx\int p(\mathbf{x})\|\nabla\log(p/q)\|^2 d\mathbf{x}), which uses the gradient of the log-density ratio rather than the ratio itself. This is the information-geometric connection between score matching and KL divergence.

C.5 KL and Temperature in LLM Sampling

LLM inference involves sampling from pτ(yx)=softmax(z/τ)p_\tau(y|x) = \text{softmax}(\mathbf{z}/\tau) at temperature τ\tau. The KL divergence between the original distribution (τ=1\tau = 1) and the temperature-scaled distribution is:

DKL(p1pτ)=kp1(k)logp1(k)pτ(k)=kp1(k)[zk1logjezjzkτ+logjezj/τ]D_{\mathrm{KL}}(p_1\|p_\tau) = \sum_k p_1(k)\log\frac{p_1(k)}{p_\tau(k)} = \sum_k p_1(k)\left[\frac{z_k}{1} - \log\sum_j e^{z_j} - \frac{z_k}{\tau} + \log\sum_j e^{z_j/\tau}\right]

As τ0\tau \to 0: pτp_\tau \to one-hot (deterministic, picks the argmax). DKL(p1pτ)H(p1)D_{\mathrm{KL}}(p_1\|p_\tau) \to H(p_1) - the divergence equals the entropy of the original distribution.

As τ\tau \to \infty: pτp_\tau \to uniform. DKL(p1pτ)logVH(p1)D_{\mathrm{KL}}(p_1\|p_\tau) \to \log|\mathcal{V}| - H(p_1) - the divergence equals the KL from uniform.

Practical implications: The "temperature" knob in LLM APIs (temperature=0.7, top-p=0.9) directly controls the KL divergence from the trained distribution. High temperature increases diversity (reduces KL to uniform); low temperature reduces diversity (pushes toward argmax). The KL from the training distribution is minimized at τ=1\tau = 1.


Appendix D: Numerical Stability and Implementation

D.1 The log-sum-exp Trick for KL Computation

Computing DKL(pq)D_{\mathrm{KL}}(p\|q) with softmax distributions requires care to avoid numerical overflow.

Naive implementation (unstable):

# WRONG - can overflow for large logits
kl = torch.sum(p * (torch.log(p) - torch.log(q)))

Stable implementation using log-softmax:

# CORRECT - uses log-space throughout
import torch.nn.functional as F

def kl_divergence_stable(logits_p, logits_q):
    """
    KL(p || q) where p = softmax(logits_p), q = softmax(logits_q).
    Uses log_softmax for numerical stability.
    """
    log_p = F.log_softmax(logits_p, dim=-1)  # log(softmax) in one step
    log_q = F.log_softmax(logits_q, dim=-1)
    p = torch.exp(log_p)
    return torch.sum(p * (log_p - log_q), dim=-1)

PyTorch also provides torch.nn.functional.kl_div(log_q, p) which computes p(logplog_q)\sum p(\log p - \text{log\_q}) given pre-computed log_q. Note the unusual argument order: it expects log of the second argument.

D.2 Estimating KL from Samples

When only samples are available (not closed-form distributions), KL must be estimated.

Monte Carlo estimator. Given samples x1,,xnpx_1, \ldots, x_n \sim p and access to logp(x)\log p(x) and logq(x)\log q(x):

D^KL(pq)=1ni=1nlogp(xi)q(xi)\hat{D}_{\mathrm{KL}}(p\|q) = \frac{1}{n}\sum_{i=1}^n \log\frac{p(x_i)}{q(x_i)}

This is unbiased: E[D^]=DKL(pq)\mathbb{E}[\hat{D}] = D_{\mathrm{KL}}(p\|q).

k-nearest neighbor estimator (Perez-Cruz 2008). When neither p(x)p(x) nor q(x)q(x) are known analytically - only samples from both distributions - the KNN estimator uses:

D^KL(pq)=dni=1nlogρk(xi)νk(xi)+logmn1\hat{D}_{\mathrm{KL}}(p\|q) = \frac{d}{n}\sum_{i=1}^n \log\frac{\rho_k(x_i)}{\nu_k(x_i)} + \log\frac{m}{n-1}

where ρk(xi)\rho_k(x_i) is the distance to the kk-th nearest neighbor in the pp-samples, νk(xi)\nu_k(x_i) is the distance to the kk-th neighbor in the qq-samples, nn = number of pp-samples, mm = number of qq-samples, and dd = dimension.

This estimator is used in representation learning evaluation, where you have samples from an encoder's output distribution qq and want to measure DKL(qptarget)D_{\mathrm{KL}}(q\|p_{\text{target}}) without a closed form.

D.3 Symmetric KL Implementation

For applications requiring symmetric divergence without the issues of JSD (which requires computing the mixture), Jeffrey's divergence is simpler:

def jeffreys_divergence(p_probs, q_probs, eps=1e-10):
    """
    J(p, q) = 0.5 * (KL(p||q) + KL(q||p))
    More numerically stable than computing JSD.
    """
    p = p_probs + eps
    q = q_probs + eps
    p = p / p.sum()
    q = q / q.sum()
    kl_pq = (p * np.log(p / q)).sum()
    kl_qp = (q * np.log(q / p)).sum()
    return 0.5 * (kl_pq + kl_qp)

D.4 KL Divergence in PyTorch for VAE Training

The standard VAE KL term implementation:

def vae_kl_loss(mu, log_var):
    """
    D_KL(N(mu, sigma^2) || N(0, 1)) per-sample, summed over latent dimensions.
    
    Formula: 0.5 * sum(mu^2 + sigma^2 - log(sigma^2) - 1)
           = 0.5 * sum(mu^2 + exp(log_var) - log_var - 1)
    
    Input: mu, log_var - shape (batch, latent_dim)
    Output: KL loss - shape (batch,), or scalar if mean-reduced
    """
    kl = 0.5 * torch.sum(mu**2 + torch.exp(log_var) - log_var - 1, dim=-1)
    return kl.mean()  # average over batch

Note: using log_var (log of variance) instead of sigma avoids numerical issues with very small variances and ensures σ2>0\sigma^2 > 0 by construction.

RLHF KL penalty implementation:

def rlhf_kl_penalty(logprobs_policy, logprobs_ref, beta=0.1):
    """
    D_KL(pi_policy || pi_ref) for a batch of token sequences.
    
    logprobs_policy, logprobs_ref: log probabilities under each model
                                   shape (batch, seq_len)
    Returns: mean KL per sequence
    """
    # KL = E_pi[log pi - log pi_ref]
    kl_per_token = logprobs_policy - logprobs_ref  # log ratio
    kl_per_seq = kl_per_token.sum(dim=-1)  # sum over sequence length
    return beta * kl_per_seq.mean()

Appendix E: Variational Inference - A Deep Dive

E.1 The Variational Inference Problem

Bayesian inference requires computing the posterior p(zx)=p(xz)p(z)/p(x)p(\mathbf{z}|\mathbf{x}) = p(\mathbf{x}|\mathbf{z})p(\mathbf{z})/p(\mathbf{x}). The denominator p(x)=p(xz)p(z)dzp(\mathbf{x}) = \int p(\mathbf{x}|\mathbf{z})p(\mathbf{z})\,d\mathbf{z} is typically intractable - the integral has no closed form for nonlinear likelihoods.

Variational inference approximates the posterior with a tractable family Q={qϕ:ϕΦ}\mathcal{Q} = \{q_\phi : \phi \in \Phi\} by solving:

q=argminqQDKL(q(z)p(zx))q^* = \arg\min_{q \in \mathcal{Q}} D_{\mathrm{KL}}(q(\mathbf{z}) \| p(\mathbf{z}|\mathbf{x}))

Why reverse KL? Because it is the only tractable direction. Computing DKL(pq)D_{\mathrm{KL}}(p\|q) requires evaluating p(zx)p(\mathbf{z}|\mathbf{x}), which is the intractable quantity we're trying to avoid. Computing DKL(qp)D_{\mathrm{KL}}(q\|p) requires only samples from qq and evaluations of p(z)p(xz)p(\mathbf{z})p(\mathbf{x}|\mathbf{z}) - both tractable.

ELBO derivation. Expanding the reverse KL:

DKL(q(z)p(zx))=Eq ⁣[logq(z)p(zx)]=Eq[logq(z)]Eq[logp(zx)]D_{\mathrm{KL}}(q(\mathbf{z})\|p(\mathbf{z}|\mathbf{x})) = \mathbb{E}_q\!\left[\log\frac{q(\mathbf{z})}{p(\mathbf{z}|\mathbf{x})}\right] = \mathbb{E}_q[\log q(\mathbf{z})] - \mathbb{E}_q[\log p(\mathbf{z}|\mathbf{x})]

Using logp(zx)=logp(xz)+logp(z)logp(x)\log p(\mathbf{z}|\mathbf{x}) = \log p(\mathbf{x}|\mathbf{z}) + \log p(\mathbf{z}) - \log p(\mathbf{x}):

=Eq[logq(z)]Eq[logp(xz)]Eq[logp(z)]+logp(x)= \mathbb{E}_q[\log q(\mathbf{z})] - \mathbb{E}_q[\log p(\mathbf{x}|\mathbf{z})] - \mathbb{E}_q[\log p(\mathbf{z})] + \log p(\mathbf{x})

Rearranging:

logp(x)=Eq[logp(xz)]DKL(q(z)p(z))L(q)=ELBO+DKL(q(z)p(zx))\log p(\mathbf{x}) = \underbrace{\mathbb{E}_q[\log p(\mathbf{x}|\mathbf{z})] - D_{\mathrm{KL}}(q(\mathbf{z})\|p(\mathbf{z}))}_{\mathcal{L}(q) = \text{ELBO}} + D_{\mathrm{KL}}(q(\mathbf{z})\|p(\mathbf{z}|\mathbf{x}))

Since DKL0D_{\mathrm{KL}} \ge 0: L(q)logp(x)\mathcal{L}(q) \le \log p(\mathbf{x}) - the ELBO is a lower bound. Maximizing ELBO over qq simultaneously: (a) tightens the bound (reduces the KL gap), and (b) improves the model fit (increases logp(x)\log p(\mathbf{x}) when pθp_{\boldsymbol{\theta}} is also optimized).

E.2 Mean-Field Variational Inference

The mean-field approximation restricts to factored distributions q(z)=j=1dqj(zj)q(\mathbf{z}) = \prod_{j=1}^d q_j(z_j). This is an enormous simplification - the full posterior may have complex correlations, but the approximation ignores all correlations between latent variables.

Coordinate ascent VI (CAVI). For mean-field, the optimal factor qjq_j^* satisfies:

logqj(zj)=Eqj[logp(x,z)]+const\log q_j^*(z_j) = \mathbb{E}_{q_{-j}}[\log p(\mathbf{x}, \mathbf{z})] + \text{const}

where qjq_{-j} denotes expectation over all factors except jj. This is a fixed-point equation: each factor depends on the others. CAVI iterates: update q1q_1 given q2,,qdq_2, \ldots, q_d; then update q2q_2 given new q1q_1 and old q3,,qdq_3, \ldots, q_d; etc.

Guarantee: Each update increases the ELBO (decreases the reverse KL). CAVI converges to a local optimum.

For AI: Mean-field VI was the basis for early Bayesian neural networks (BNNs) and topic models (LDA). Its mode-seeking bias (reverse KL) means it systematically underestimates posterior uncertainty - a known limitation. Modern BNNs often use Monte Carlo Dropout as a cheaper alternative.

E.3 ELBO as a Free Energy

The ELBO has a physical interpretation as a variational free energy (Helmholtz, Gibbs):

L(q)=Eq[logp(x,z)]Eq[logq(z)]=Eq[logp(x,z)]+H(q)\mathcal{L}(q) = \mathbb{E}_q[\log p(\mathbf{x},\mathbf{z})] - \mathbb{E}_q[\log q(\mathbf{z})] = \mathbb{E}_q[\log p(\mathbf{x},\mathbf{z})] + H(q)

This is the expected log joint minus the entropy of qq. Maximizing the ELBO simultaneously:

  • Maximizes Eq[logp(x,z)]\mathbb{E}_q[\log p(\mathbf{x},\mathbf{z})]: encourages qq to put mass on high-probability regions of the joint
  • Maximizes H(q)H(q): encourages qq to spread out (high entropy = uncertainty), preventing mode collapse

The tension between these two terms is the fundamental trade-off in variational inference. In physics, this is the Gibbs free energy minimization: equilibrium balances energy minimization (term 1) with entropy maximization (term 2), at temperature 1.

For AI: The free energy interpretation connects to energy-based models (LeCun et al., 2006) and to the "free energy principle" in computational neuroscience (Friston, 2010). The ELBO appears in all modern latent variable models: VAEs, VQ-VAE, neural process families, and latent diffusion models (Stable Diffusion uses a VAE for compression, then diffusion in the latent space).

E.4 Amortized Variational Inference

Standard VI optimizes a separate qϕ(z)q_\phi(\mathbf{z}) per data point - O(n)O(n) optimization problems. Amortized VI (Kingma & Welling 2014) instead trains a single encoder network qϕ(zx)q_\phi(\mathbf{z}|\mathbf{x}) that takes x\mathbf{x} as input:

qϕ(zx)=argmaxϕEp(x) ⁣[L(ϕ,θ;x)]q_\phi^*(\mathbf{z}|\mathbf{x}) = \arg\max_\phi \mathbb{E}_{p(\mathbf{x})}\!\left[\mathcal{L}(\phi, \boldsymbol{\theta}; \mathbf{x})\right]

The encoder "amortizes" the per-sample optimization across the full dataset, generalizing to new samples at inference time. This is what makes VAEs practical for large datasets: a single forward pass through the encoder gives qϕ(zx)=N(μϕ(x),diag(σϕ2(x)))q_\phi(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}_\phi(\mathbf{x}), \text{diag}(\boldsymbol{\sigma}_\phi^2(\mathbf{x}))) instantly.

Amortization gap: The amortized approximation qϕ(zx)q_\phi(\mathbf{z}|\mathbf{x}) is generally worse than the optimal per-sample q(zx)q^*(\mathbf{z}|\mathbf{x}) (it must generalize, so it trades off per-sample accuracy). The gap DKL(qp)DKL(qϕp)D_{\mathrm{KL}}(q^*\|p) - D_{\mathrm{KL}}(q_\phi\|p) is called the amortization gap and is an active research topic.


Appendix F: The RLHF-KL Alignment Triangle

F.1 Three Ways KL Appears in Alignment

Modern LLM alignment uses KL divergence in three distinct but related ways, forming a triangle of constraints:

RLHF-KL ALIGNMENT TRIANGLE


                     PRETRAINED MODEL pi_ref
                           
                          / \
                KL       /   \   KL
              penalty   /     \  constraint
                       /       \
                      
              RLHF pi_theta          DPO pi_DPO
                    \           /
                     \  equiv  /
                      \       /
                       Fixed point: same optimal policy

  All three policies minimize:
    E[r(x,y)] - betaD_KL(pi_theta(|x)  pi_ref(|x))

  Optimal solution: pi*(y|x) propto pi_ref(y|x)exp(r(x,y)/beta)


F.2 Interpreting the KL Coefficient beta

The KL coefficient β\beta in RLHF controls the fundamental exploration-exploitation trade-off in alignment:

β\beta valueBehaviorRisk
β0\beta \to 0Pure reward maximization; ignores KL constraintReward hacking; generates repetitive, incoherent outputs
β\beta \to \inftyStay at reference policy; no alignmentNo improvement in helpfulness, safety
β[0.01,0.5]\beta \in [0.01, 0.5]Typical range used in practiceBalance between helpfulness and coherence

Typical values in production: InstructGPT (Ouyang et al., 2022) used β=0.02\beta = 0.02 for PPO-RLHF. Different applications need different β\beta: safety alignment uses larger β\beta (conservative); instruction following uses smaller β\beta (allows larger policy changes).

Adaptive β\beta (KL-controller): Some implementations adaptively adjust β\beta to keep DKL(πθπref)D_{\mathrm{KL}}(\pi_{\boldsymbol{\theta}}\|\pi_{\mathrm{ref}}) near a target value κ\kappa:

βt+1=βt+γ(DKLκ)\beta_{t+1} = \beta_t + \gamma(D_{\mathrm{KL}} - \kappa)

This is a PI controller on the KL divergence. If the policy drifts too far, β\beta increases; if it stays too close to reference, β\beta decreases.

F.3 DPO's Implicit Reward

DPO (Rafailov et al., 2023) reveals that the log-ratio between policy and reference is the implicit reward:

rθ(x,y)=βlogπθ(yx)πref(yx)r_{\boldsymbol{\theta}}(x,y) = \beta\log\frac{\pi_{\boldsymbol{\theta}}(y|x)}{\pi_{\mathrm{ref}}(y|x)}

This has a clean interpretation: the reward a sequence yy receives is proportional to how much more likely the aligned policy assigns to it compared to the reference. Sequences the aligned model prefers more than the reference model are rewarded; sequences it prefers less are penalized.

DPO gradient: The DPO loss gradient with respect to θ\boldsymbol{\theta}:

θLDPO=βE ⁣[σ(r^θ(yl)r^θ(yw))(logπθ(yw)logπθ(yl))]\nabla_{\boldsymbol{\theta}}\mathcal{L}_{\mathrm{DPO}} = -\beta\,\mathbb{E}\!\left[\sigma(\hat{r}_{\boldsymbol{\theta}}(y_l) - \hat{r}_{\boldsymbol{\theta}}(y_w))\left(\nabla\log\pi_{\boldsymbol{\theta}}(y_w) - \nabla\log\pi_{\boldsymbol{\theta}}(y_l)\right)\right]

where r^θ(y)=βlog(πθ(y)/πref(y))\hat{r}_{\boldsymbol{\theta}}(y) = \beta\log(\pi_{\boldsymbol{\theta}}(y)/\pi_{\mathrm{ref}}(y)). The weight σ(r^(yl)r^(yw))\sigma(\hat{r}(y_l) - \hat{r}(y_w)) is large when the model assigns higher reward to the loser than the winner - the loss upweights these hard examples. This has the same structure as hard example mining in contrastive learning.


Appendix G: Connections to Other Chapters

G.1 KL and Exponential Families (Chapter 7: Statistics)

Exponential families are the natural domain for KL divergence because:

  1. KL between exponential family members has closed form (Section 5.3)
  2. The maximum entropy distribution under moment constraints is an exponential family member (Section 09-01 Section 5)
  3. The Bregman divergence generated by A(η)A(\boldsymbol{\eta}) equals the KL between members with those natural parameters

Sufficiency: Kullback and Leibler's original paper introduced KL as a measure of information for discrimination between hypotheses. A statistic T(x)T(\mathbf{x}) is sufficient for parameter θ\boldsymbol{\theta} iff DKL(pθ1(x)pθ2(x))=DKL(pθ1(T)pθ2(T))D_{\mathrm{KL}}(p_{\boldsymbol{\theta}_1}(\mathbf{x})\|p_{\boldsymbol{\theta}_2}(\mathbf{x})) = D_{\mathrm{KL}}(p_{\boldsymbol{\theta}_1}(T)\|p_{\boldsymbol{\theta}_2}(T)) - sufficient statistics preserve all the discriminatory information. This is the information-theoretic definition of sufficiency.

G.2 KL and Convex Optimization (Chapter 8)

The ELBO maximization is a convex optimization problem in qq (for fixed model parameters θ\boldsymbol{\theta}):

  • DKL(qp)D_{\mathrm{KL}}(q\|p) is convex in qq (Section 3.4)
  • The constraint q0q \ge 0, q=1\sum q = 1 is convex (simplex)
  • The I-projection q=argminqDKL(qp)q^* = \arg\min_q D_{\mathrm{KL}}(q\|p) is a convex program with a unique solution

The Lagrangian of the I-projection onto the exponential family gives the natural parameter update equations - connecting KL divergence to the theory of constrained convex optimization from Chapter 8.

G.3 KL and Functional Analysis (Chapter 12)

In infinite-dimensional settings, KL divergence extends via the Radon-Nikodym theorem. For Gaussian processes and Gaussian measures on Hilbert spaces:

DKL(GP(m1,K1)GP(m2,K2))=12[tr(K21K1)+(m2m1)K21(m2m1)d+logdetK2detK1]D_{\mathrm{KL}}(\mathcal{GP}(m_1, K_1)\|\mathcal{GP}(m_2, K_2)) = \frac{1}{2}\left[\operatorname{tr}(K_2^{-1}K_1) + (m_2-m_1)^\top K_2^{-1}(m_2-m_1) - d + \log\frac{\det K_2}{\det K_1}\right]

the same Gaussian KL formula, but now K1,K2K_1, K_2 are kernel matrices and dd \to \infty. The trace and log-determinant terms generalize as operator trace and Fredholm determinant.

RKHS connection: The maximum mean discrepancy (MMD) is an alternative to KL for comparing distributions, based on RKHS distances. MMD has better convergence rates from finite samples than KL estimators, which is why it appears in some GANs (MMD-GAN) and kernel two-sample tests.

G.4 KL and Statistical Learning Theory (Chapter 21)

The PAC-Bayes bound (Appendix C.1) shows that the KL from posterior to prior controls generalization. This has a direct implication for LoRA fine-tuning: if the LoRA adapter is constrained to have small DKL(qadapterpinit)D_{\mathrm{KL}}(q_{\mathrm{adapter}}\|p_{\mathrm{init}}) - i.e., small deviation from initialization - then generalization bounds are tight. This is the information-theoretic justification for small-rank adaptations: they have small KL from the pretrained distribution.

Minimum Description Length (MDL): The MDL principle selects the model that minimizes the two-part code length: L(M)+L(DM)L(M) + L(D|M) where L(M)L(M) is the code length of the model and L(DM)L(D|M) is the code length of data given the model. This equals DKL(qp)+constD_{\mathrm{KL}}(q\|p) + \text{const} for the right choice of pp and qq. MDL and ELBO are the same objective - minimum description length equals maximum ELBO. This is the Bayesian interpretation of Occam's razor.


Appendix H: Summary Tables

H.1 KL Divergence Properties at a Glance

PropertyHolds?Counterexample if No
Non-negative: DKL0D_{\mathrm{KL}} \ge 0YES-
Zero iff p=qp = q: DKL=0p=qD_{\mathrm{KL}} = 0 \Leftrightarrow p = qYES (a.e.)-
Symmetric: DKL(pq)=DKL(qp)D_{\mathrm{KL}}(p\|q) = D_{\mathrm{KL}}(q\|p)NOp=(0.9,0.1),q=(0.5,0.5)p=(0.9,0.1), q=(0.5,0.5)
Triangle inequality: DKL(pr)DKL(pq)+DKL(qr)D_{\mathrm{KL}}(p\|r) \le D_{\mathrm{KL}}(p\|q) + D_{\mathrm{KL}}(q\|r)NOp=(0.9,0.1),q=(0.5,0.5),r=(0.1,0.9)p=(0.9,0.1), q=(0.5,0.5), r=(0.1,0.9)
Data processing inequality: DKL(pTqT)DKL(pq)D_{\mathrm{KL}}(p_T\|q_T) \le D_{\mathrm{KL}}(p\|q)YES-
Jointly convex in (p,q)(p,q)YES-
Convex in pp for fixed qqYES-
Convex in qq for fixed ppYES-
Finite when p≪̸qp \not\ll qNOp=(0.5,0.5,0),q=(0.5,0,0.5)p=(0.5,0.5,0), q=(0.5,0,0.5): DKL=D_{\mathrm{KL}} = \infty
Metric (distance function)NOFails symmetry and triangle inequality

H.2 Forward vs Reverse KL Summary

Forward KL DKL(pq)D_{\mathrm{KL}}(p\|q)Reverse KL DKL(qp)D_{\mathrm{KL}}(q\|p)
Expectation underpp (truth)qq (approximation)
Also calledInclusive, zero-avoidingExclusive, zero-forcing
Behavior when fitting qq to ppMean-seeking, mass-coveringMode-seeking, mass-concentrating
Handles multimodal ppAverages over all modesCollapses to one mode
Support requirementqq must cover supp(p)\text{supp}(p)qq must avoid supp(p)c\text{supp}(p)^c
Used inMLE, distillation, flowsVariational inference, ELBO
RiskBlurry/average samplesPosterior collapse
TractabilityNeeds samples from ppOnly needs samples from qq

H.3 f-Divergence Family Summary

Divergencef(t)f(t)Symmetric?Bounded?Metric?
KL DKL(pq)D_{\mathrm{KL}}(p\|q)tlntt\ln tNoNoNo
Reverse KL DKL(qp)D_{\mathrm{KL}}(q\|p)lnt-\ln tNoNoNo
Jeffrey's JJtlntlntt\ln t - \ln tYesNoNo
Hellinger H2H^2(t1)2(\sqrt{t}-1)^2YesYes ([0,2][0,2])H2\sqrt{H^2} is
Total Variation TV$\frac{1}{2}t-1$Yes
Chi-squared χ2\chi^2(t1)2(t-1)^2NoNoNo
Jensen-Shannon JSD(mixture form)YesYes ([0,ln2][0,\ln 2])JSD\sqrt{\mathrm{JSD}} is
Renyi DαD_\alphatα1α1\frac{t^\alpha-1}{\alpha-1} (limit form)NoNoNo

Appendix I: Extended Examples and Geometric Visualizations

I.1 KL Divergence Along a Path Between Distributions

Consider a one-parameter family pt=(1t)p0+tp1p_t = (1-t)p_0 + t p_1 - a straight line between two distributions in probability simplex. The KL divergence DKL(ptq)D_{\mathrm{KL}}(p_t\|q) is convex in tt (because it is jointly convex in (p,q)(p,q) and linear in pp). However, tDKL(p0pt)t \mapsto D_{\mathrm{KL}}(p_0\|p_t) may not be convex - the "path" in distribution space is curved from the perspective of KL divergence.

Numerical example. p0=(0.9,0.1)p_0 = (0.9, 0.1), p1=(0.1,0.9)p_1 = (0.1, 0.9), q=(0.5,0.5)q = (0.5, 0.5). For t[0,1]t \in [0,1]:

ttptp_tDKL(ptq)D_{\mathrm{KL}}(p_t\|q)DKL(qpt)D_{\mathrm{KL}}(q\|p_t)
0(0.9,0.1)(0.9, 0.1)0.3680.3680.5110.511
0.25(0.7,0.3)(0.7, 0.3)0.1190.1190.1330.133
0.5(0.5,0.5)(0.5, 0.5)0.0000.0000.0000.000
0.75(0.3,0.7)(0.3, 0.7)0.1190.1190.1330.133
1.0(0.1,0.9)(0.1, 0.9)0.3680.3680.5110.511

Both directions are symmetric here due to the symmetric path. The KL is zero at t=0.5t=0.5 (when pt=qp_t = q) and increases convexly toward the endpoints.

Geometric implication: The straight-line path pt=(1t)p0+tp1p_t = (1-t)p_0 + t p_1 is a mixture geodesic (M-geodesic) in information geometry - the natural path in the mixture parameterization. The dual path (exponential geodesic) mixes in the natural parameter space, giving different intermediate distributions.

I.2 KL vs Wasserstein Distance

KL divergence and the Wasserstein distance (from optimal transport theory) are complementary tools for comparing distributions:

PropertyKL DivergenceWasserstein Distance
Requires pqp \ll qYESNo
Metrizes weak convergenceNoYES
Captures geometry of supportNoYES
Tractable to computeYES (closed form)Harder (O(n3)O(n^3) naive)
DifferentiableYESWith entropic regularization
Used inMLE, ELBO, RLHFGenerative models, image quality

When Wasserstein beats KL: If pGp_G (generator) and pdatap_{\mathrm{data}} have disjoint supports, DKL(pdatapG)=+D_{\mathrm{KL}}(p_{\mathrm{data}}\|p_G) = +\infty and JSD=ln2\mathrm{JSD} = \ln 2 - gradients vanish. The Wasserstein distance is still finite and differentiable (reflecting actual geometric distance between supports). This is the motivation for WGAN (Arjovsky et al., 2017).

When KL beats Wasserstein: For language modeling, the vocabulary forms a discrete space with no natural geometry. The Wasserstein distance would require defining a metric on tokens (words), which is ambiguous. KL divergence makes no assumptions about token geometry and is directly connected to log-probability, making it the natural choice.

I.3 KL Divergence Under Transformations

Transformation by sufficient statistic. If T(x)T(\mathbf{x}) is a sufficient statistic for θ\boldsymbol{\theta}, then:

DKL(pθ1pθ2)=DKL(pθ1Tpθ2T)D_{\mathrm{KL}}(p_{\boldsymbol{\theta}_1}\|p_{\boldsymbol{\theta}_2}) = D_{\mathrm{KL}}(p_{\boldsymbol{\theta}_1}^T\|p_{\boldsymbol{\theta}_2}^T)

where pθTp_{\boldsymbol{\theta}}^T is the induced distribution on T(x)T(\mathbf{x}). Sufficient statistics preserve all the discriminatory information - no KL is lost when summarizing data by its sufficient statistics.

Example. For pθ=Bern(θ)p_{\boldsymbol{\theta}} = \text{Bern}(\theta), the sufficient statistic T=ixiT = \sum_i x_i (number of heads in nn flips) is sufficient. The KL between Bern(θ1)n\text{Bern}(\theta_1)^{\otimes n} and Bern(θ2)n\text{Bern}(\theta_2)^{\otimes n} equals nDKL(Bern(θ1)Bern(θ2))n \cdot D_{\mathrm{KL}}(\text{Bern}(\theta_1)\|\text{Bern}(\theta_2)), and the KL between the corresponding Binomial(n,θ1)\text{Binomial}(n,\theta_1) and Binomial(n,θ2)\text{Binomial}(n,\theta_2) distributions equals the same thing - the sufficient statistic preserves KL exactly.

KL under affine transformation. For continuous distributions, an affine map T(x)=Ax+bT(\mathbf{x}) = A\mathbf{x} + \mathbf{b} with detA0\det A \ne 0: DKL(pTqT)=DKL(pq)D_{\mathrm{KL}}(p_T\|q_T) = D_{\mathrm{KL}}(p\|q) - KL is invariant under invertible affine transformations. This is why KL between two Gaussians can be computed in any coordinate system.

I.4 The Sanov Theorem and Large Deviations

Sanov's theorem is the large deviations foundation for KL divergence.

Theorem (Sanov). Let X1,,XnX_1, \ldots, X_n be i.i.d. from qq. Let p^n\hat{p}_n be the empirical distribution. For any set EE of distributions (closed in total variation):

P(p^nE)eninfpEDKL(pq)P(\hat{p}_n \in E) \approx e^{-n \inf_{p \in E} D_{\mathrm{KL}}(p\|q)}

More precisely: limn1nlogP(p^nE)=infpEDKL(pq)\lim_{n\to\infty} -\frac{1}{n}\log P(\hat{p}_n \in E) = \inf_{p \in E} D_{\mathrm{KL}}(p\|q).

Interpretation: The probability of observing an empirical distribution in EE decays exponentially in nn, with exponent given by the minimum KL from any distribution in EE to the true distribution qq. The "most likely" atypical empirical distribution is the one in EE closest to qq in KL divergence - the I-projection of qq onto EE.

For AI: Sanov's theorem explains why KL divergence appears in PAC-Bayes bounds. The probability that the empirical risk deviates from the true risk by more than ϵ\epsilon decays as enDKL(QP)e^{-nD_{\mathrm{KL}}(Q\|P)}, where QQ is the posterior over hypotheses and PP is the prior. The larger the KL, the more confident we are that the training set is representative.

I.5 KL Divergence and the EM Algorithm: A Complete Example

We illustrate the EM algorithm as alternating KL minimizations with a Gaussian mixture model.

Setup: Data D={x(1),,x(n)}\mathcal{D} = \{x^{(1)}, \ldots, x^{(n)}\}; model pθ(x)=k=1KπkN(x;μk,σk2)p_{\boldsymbol{\theta}}(x) = \sum_{k=1}^K \pi_k \mathcal{N}(x; \mu_k, \sigma_k^2).

E-step - compute responsibilities (I-projection):

q(t)(z=kx(i))=πk(t)N(x(i);μk(t),σk2(t))j=1Kπj(t)N(x(i);μj(t),σj2(t))q^{(t)}(z=k|x^{(i)}) = \frac{\pi_k^{(t)}\mathcal{N}(x^{(i)};\mu_k^{(t)},\sigma_k^{2(t)})}{\sum_{j=1}^K \pi_j^{(t)}\mathcal{N}(x^{(i)};\mu_j^{(t)},\sigma_j^{2(t)})}

This is the I-projection of the model onto the space of conditional distributions given the current parameters.

M-step - update parameters (M-projection):

πk(t+1)=1niq(t)(z=kx(i))\pi_k^{(t+1)} = \frac{1}{n}\sum_i q^{(t)}(z=k|x^{(i)}) μk(t+1)=iq(t)(z=kx(i))x(i)iq(t)(z=kx(i))\mu_k^{(t+1)} = \frac{\sum_i q^{(t)}(z=k|x^{(i)})x^{(i)}}{\sum_i q^{(t)}(z=k|x^{(i)})}

This maximizes the ELBO - equivalent to the M-projection that minimizes DKL(q(t)pθ)D_{\mathrm{KL}}(q^{(t)}\|p_{\boldsymbol{\theta}}) over θ\boldsymbol{\theta}.

KL decomposition at each step:

logpθ(D)=L(q,θ)+iDKL(q(t)(x(i))pθ(x(i)))\log p_{\boldsymbol{\theta}}(\mathcal{D}) = \mathcal{L}(q, \boldsymbol{\theta}) + \sum_i D_{\mathrm{KL}}(q^{(t)}(\cdot|x^{(i)})\|p_{\boldsymbol{\theta}}(\cdot|x^{(i)}))

The E-step zeroes the second term (for the current θ(t)\boldsymbol{\theta}^{(t)}); the M-step maximizes the first term. Together, each EM iteration increases logpθ(D)\log p_{\boldsymbol{\theta}}(\mathcal{D}).


Appendix J: Notation and Conventions

J.1 KL Divergence Notation Across Different Sources

Different textbooks and papers use different notation for KL divergence. This can cause significant confusion:

SourceNotationDirection
This curriculum (following Cover & Thomas)DKL(pq)D_{\mathrm{KL}}(p \| q)pp = reference (truth); qq = approximation
Kullback (1959)I(1:2)I(1:2)Between distributions indexed 1 and 2
Bishop (2006)KL(pq)\mathrm{KL}(p \| q)Same as Cover & Thomas
Murphy (2012)KL(pq)\mathrm{KL}(p \| q)Same
Goodfellow et al. (2016)DKL(PQ)D_{\mathrm{KL}}(P \| Q)Same
Some optimization papersKL(qp)\mathrm{KL}(q \| p)Reversed - always check!

Key check: In VAE papers, DKL(qϕ(zx)p(z))D_{\mathrm{KL}}(q_\phi(\mathbf{z}|\mathbf{x})\|p(\mathbf{z})) has the encoder qq as the first argument (the reference for the expectation) and the prior pp as the second. This is reverse KL - expectation under the approximate posterior qq.

J.2 Relationship Between Information Quantities

VENN DIAGRAM OF INFORMATION QUANTITIES


  For joint distribution P(X,Y):

    
                      H(X,Y)                      
                  
                                               
         H(X|Y)         I(X;Y)        H(Y|X)  
                                               
                  
    

     H(X) 
                                 H(Y) 

  Key relationships:
  H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)       [chain rule]
  I(X;Y) = H(X) + H(Y) - H(X,Y)                 [definition]
  I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)        [equivalences]
  I(X;Y) = D_KL(P(X,Y)  P(X)P(Y))              [KL form]
  H(p,q) = H(p) + D_KL(p  q)                   [cross-entropy decomp]


J.3 Units and Conversions

BaseUnitConversion
ln\ln (natural log)natsDefault in this curriculum; 1 nat=1/ln21.4431\text{ nat} = 1/\ln 2 \approx 1.443 bits
log2\log_2bitsUsed in data compression contexts; 1 bit=ln20.6931\text{ bit} = \ln 2 \approx 0.693 nats
log10\log_{10}hartleys (bans)Rare in ML; 1 hartley=ln102.3031\text{ hartley} = \ln 10 \approx 2.303 nats

PyTorch convention: torch.nn.functional.kl_div expects inputs in log-space and computes nats by default (uses torch.log which is natural log). To get bits, divide by math.log(2).


End of Section 09-02 KL Divergence

<- Back to Information Theory | Previous: Entropy <- | Next: Mutual Information ->


Appendix K: Worked Problems with Full Solutions

K.1 Computing KL Between Two Categorical Distributions

Problem. A language model assigns the following probabilities to the next token in "The capital of France is ___":

TokenTrue distribution ppModel qq
"Paris"0.920.70
"London"0.030.10
"Rome"0.020.10
"Berlin"0.020.05
other0.010.05

Compute (a) DKL(pq)D_{\mathrm{KL}}(p\|q), (b) cross-entropy H(p,q)H(p,q), (c) entropy H(p)H(p).

Solution.

(a) DKL(pq)D_{\mathrm{KL}}(p\|q):

=0.92ln0.920.70+0.03ln0.030.10+0.02ln0.020.10+0.02ln0.020.05+0.01ln0.010.05= 0.92\ln\frac{0.92}{0.70} + 0.03\ln\frac{0.03}{0.10} + 0.02\ln\frac{0.02}{0.10} + 0.02\ln\frac{0.02}{0.05} + 0.01\ln\frac{0.01}{0.05} =0.92(0.274)+0.03(1.204)+0.02(1.609)+0.02(0.916)+0.01(1.609)= 0.92(0.274) + 0.03(-1.204) + 0.02(-1.609) + 0.02(-0.916) + 0.01(-1.609) =0.2520.0360.0320.0180.016=0.150 nats= 0.252 - 0.036 - 0.032 - 0.018 - 0.016 = 0.150 \text{ nats}

(b) Cross-entropy H(p,q)=plogqH(p,q) = -\sum p\log q:

=(0.92ln0.70+0.03ln0.10+0.02ln0.10+0.02ln0.05+0.01ln0.05)= -(0.92\ln 0.70 + 0.03\ln 0.10 + 0.02\ln 0.10 + 0.02\ln 0.05 + 0.01\ln 0.05) =(0.92(0.357)+0.05(2.303)+0.02(2.303)+0.03(2.996))= -(0.92(-0.357) + 0.05(-2.303) + 0.02(-2.303) + 0.03(-2.996)) =0.328+0.115+0.046+0.090=0.579 nats= 0.328 + 0.115 + 0.046 + 0.090 = 0.579 \text{ nats}

(c) Entropy H(p)=plogpH(p) = -\sum p\log p:

=(0.92ln0.92+0.03ln0.03+0.02ln0.02+0.02ln0.02+0.01ln0.01)= -(0.92\ln 0.92 + 0.03\ln 0.03 + 0.02\ln 0.02 + 0.02\ln 0.02 + 0.01\ln 0.01) =(0.92(0.083)+0.03(3.507)+0.04(3.912)+0.01(4.605))= -(0.92(-0.083) + 0.03(-3.507) + 0.04(-3.912) + 0.01(-4.605)) =0.076+0.105+0.156+0.046=0.383 nats= 0.076 + 0.105 + 0.156 + 0.046 = 0.383 \text{ nats}

Verification: H(p,q)=H(p)+DKL(pq)=?0.383+0.150=0.533H(p,q) = H(p) + D_{\mathrm{KL}}(p\|q) \stackrel{?}{=} 0.383 + 0.150 = 0.533 nats. (Small discrepancy from rounding; exact values match.)

Interpretation: The model wastes 0.150 nats per prediction compared to an oracle. The perplexity of the oracle is e0.3831.47e^{0.383} \approx 1.47; the model's perplexity is e0.5791.78e^{0.579} \approx 1.78 - 21% higher, entirely due to the KL gap.

K.2 The ELBO Calculation for a Toy VAE

Setup. One-dimensional VAE: prior p(z)=N(0,1)p(z) = \mathcal{N}(0,1), likelihood pθ(xz)=N(z,0.25)p_\theta(x|z) = \mathcal{N}(z, 0.25) (decoder with σ2=0.25\sigma^2 = 0.25), encoder qϕ(zx)=N(μϕ(x),σϕ2)q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma_\phi^2).

For x=1.5x = 1.5, μϕ=1.2\mu_\phi = 1.2, σϕ2=0.4\sigma_\phi^2 = 0.4:

KL term: DKL(N(1.2,0.4)N(0,1))D_{\mathrm{KL}}(\mathcal{N}(1.2, 0.4)\|\mathcal{N}(0,1)):

=12(1.22+0.4ln0.41)=12(1.44+0.4+0.9161)=12(1.756)=0.878 nats= \frac{1}{2}(1.2^2 + 0.4 - \ln 0.4 - 1) = \frac{1}{2}(1.44 + 0.4 + 0.916 - 1) = \frac{1}{2}(1.756) = 0.878 \text{ nats}

Reconstruction term: Eqϕ[logpθ(xz)]\mathbb{E}_{q_\phi}[\log p_\theta(x|z)] where x=1.5x = 1.5, zN(1.2,0.4)z \sim \mathcal{N}(1.2, 0.4):

=EzN(1.2,0.4)[12ln(2π0.25)(1.5z)220.25]= \mathbb{E}_{z\sim\mathcal{N}(1.2,0.4)}\left[-\frac{1}{2}\ln(2\pi \cdot 0.25) - \frac{(1.5-z)^2}{2 \cdot 0.25}\right] =12ln(0.5π)10.5E[(1.5z)2]= -\frac{1}{2}\ln(0.5\pi) - \frac{1}{0.5}\mathbb{E}[(1.5-z)^2] =12(0.452)2[(1.51.2)2+0.4]= -\frac{1}{2}(0.452) - 2[(1.5-1.2)^2 + 0.4] =0.2262[0.09+0.4]=0.2260.980=1.206 nats= -0.226 - 2[0.09 + 0.4] = -0.226 - 0.980 = -1.206 \text{ nats}

ELBO: L=1.2060.878=2.084\mathcal{L} = -1.206 - 0.878 = -2.084 nats.

Interpretation: The true log-evidence satisfies logpθ(1.5)2.084\log p_\theta(1.5) \ge -2.084. Better encoder parameters (closer μϕ\mu_\phi to posterior mean) and tighter σϕ2\sigma_\phi^2 would close the ELBO gap.

K.3 Verifying Data Processing Inequality Numerically

Consider input X{0,1,2}X \in \{0,1,2\} with p=(0.5,0.3,0.2)p = (0.5, 0.3, 0.2) and q=(0.2,0.3,0.5)q = (0.2, 0.3, 0.5).

A stochastic function T:{0,1,2}{A,B}T: \{0,1,2\} \to \{A, B\} with transition matrix:

T(A0)=0.8,T(B0)=0.2;T(A1)=0.5,T(B1)=0.5;T(A2)=0.1,T(B2)=0.9T(A|0) = 0.8, T(B|0) = 0.2; \quad T(A|1) = 0.5, T(B|1) = 0.5; \quad T(A|2) = 0.1, T(B|2) = 0.9

Original KL:

DKL(pq)=0.5ln(2.5)+0.3ln(1)+0.2ln(0.4)=0.5(0.916)+0+0.2(0.916)=0.4580.183=0.275 natsD_{\mathrm{KL}}(p\|q) = 0.5\ln(2.5) + 0.3\ln(1) + 0.2\ln(0.4) = 0.5(0.916) + 0 + 0.2(-0.916) = 0.458 - 0.183 = 0.275 \text{ nats}

Induced distributions on {A,B}\{A,B\}:

pT(A)=0.5(0.8)+0.3(0.5)+0.2(0.1)=0.4+0.15+0.02=0.57p_T(A) = 0.5(0.8) + 0.3(0.5) + 0.2(0.1) = 0.4 + 0.15 + 0.02 = 0.57 qT(A)=0.2(0.8)+0.3(0.5)+0.5(0.1)=0.16+0.15+0.05=0.36q_T(A) = 0.2(0.8) + 0.3(0.5) + 0.5(0.1) = 0.16 + 0.15 + 0.05 = 0.36

KL after processing:

DKL(pTqT)=0.57ln0.570.36+0.43ln0.430.64=0.57(0.458)+0.43(0.399)=0.2610.172=0.089 natsD_{\mathrm{KL}}(p_T\|q_T) = 0.57\ln\frac{0.57}{0.36} + 0.43\ln\frac{0.43}{0.64} = 0.57(0.458) + 0.43(-0.399) = 0.261 - 0.172 = 0.089 \text{ nats}

Verification: 0.0890.2750.089 \le 0.275 PASS - the data processing inequality holds. The stochastic transformation TT reduced the KL from 0.275 to 0.089 nats, losing information about whether data came from pp or qq.


Appendix L: Connections to Optimization Theory

L.1 KL Divergence and Mirror Descent

Mirror descent is a generalization of gradient descent where the update uses a Bregman divergence instead of Euclidean distance:

x(t+1)=argminx{ηf(x(t)),x+Dϕ(x,x(t))}\mathbf{x}^{(t+1)} = \arg\min_{\mathbf{x}} \left\{\eta \langle \nabla f(\mathbf{x}^{(t)}), \mathbf{x} \rangle + D_\phi(\mathbf{x}, \mathbf{x}^{(t)})\right\}

When ϕ(p)=kpklnpk\phi(\mathbf{p}) = \sum_k p_k \ln p_k (negative entropy) and x=p\mathbf{x} = \mathbf{p} is a probability vector, the Bregman divergence Dϕ(p,q)=DKL(pq)D_\phi(\mathbf{p}, \mathbf{q}) = D_{\mathrm{KL}}(p\|q) and mirror descent becomes the multiplicative weights update or exponentiated gradient:

pk(t+1)pk(t)exp(ηgk(t))p_k^{(t+1)} \propto p_k^{(t)} \exp(-\eta g_k^{(t)})

where g(t)=f(p(t))\mathbf{g}^{(t)} = \nabla f(\mathbf{p}^{(t)}). This is exactly how softmax policies in RL are updated, and how online learning over the probability simplex works (regret analysis via Bregman projections).

Natural gradient as mirror descent. The natural gradient descent update:

θ(t+1)=argminθ{ηθL,θ+DKL(pθ(t)pθ)}\boldsymbol{\theta}^{(t+1)} = \arg\min_{\boldsymbol{\theta}} \left\{\eta \langle \nabla_{\boldsymbol{\theta}}\mathcal{L}, \boldsymbol{\theta} \rangle + D_{\mathrm{KL}}(p_{\boldsymbol{\theta}^{(t)}}\|p_{\boldsymbol{\theta}})\right\}

is mirror descent using KL divergence as the Bregman proximity term. The solution is θ(t+1)=θ(t)ηF1L\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \eta F^{-1}\nabla\mathcal{L} where FF is the Fisher matrix - reproducing the natural gradient formula.

L.2 KL-Penalized Optimization and Soft Constraints

The RLHF objective maxπE[r]βDKL(ππref)\max_\pi \mathbb{E}[r] - \beta D_{\mathrm{KL}}(\pi\|\pi_{\mathrm{ref}}) is a classic penalized optimization problem. The KL term acts as a regularizer that keeps π\pi near πref\pi_{\mathrm{ref}}.

Lagrangian form. The hard-constrained version:

maxπE[r(x,y)]s.t.DKL(ππref)ϵ\max_\pi \mathbb{E}[r(x,y)] \quad \text{s.t.} \quad D_{\mathrm{KL}}(\pi\|\pi_{\mathrm{ref}}) \le \epsilon

is equivalent to the penalized form at the β\beta that is the Lagrange multiplier for the constraint. As ϵ\epsilon \to \infty, β0\beta \to 0 (no constraint); as ϵ0\epsilon \to 0, β\beta \to \infty (stay at reference). The optimal β\beta depends on the trade-off between reward and divergence.

TRPO (Schulman et al., 2015). Trust Region Policy Optimization solves:

maxπLπold(π)s.t.DˉKL(πold,π)δ\max_\pi \mathcal{L}_{\pi_{\mathrm{old}}}(\pi) \quad \text{s.t.} \quad \bar{D}_{\mathrm{KL}}(\pi_{\mathrm{old}}, \pi) \le \delta

where L\mathcal{L} is the surrogate advantage and DˉKL\bar{D}_{\mathrm{KL}} is the maximum KL over states. Each TRPO step solves a constrained optimization problem with a KL trust region - computationally expensive. PPO approximates this with the clip objective.

L.3 The Frank-Wolfe Algorithm and KL

The Frank-Wolfe algorithm solves minpCf(p)\min_{p \in \mathcal{C}} f(p) (for a convex function ff on a convex constraint set C\mathcal{C}) by linearizing ff at each step. When f=DKL(q)f = D_{\mathrm{KL}}(\cdot\|q) and C\mathcal{C} is the probability simplex, Frank-Wolfe recovers the multiplicative weights algorithm. When f=DKL(p)f = D_{\mathrm{KL}}(p\|\cdot) and C\mathcal{C} is a parametric exponential family, Frank-Wolfe corresponds to the M-step of EM.

This connection shows that common ML algorithms - EM, multiplicative weights, mirror descent, natural gradient - are all special cases of convex optimization using KL divergence as the geometry.


Appendix M: Further Reading and References

M.1 Primary Sources

ReferenceWhat to ReadWhy
Kullback & Leibler (1951). "On Information and Sufficiency." Ann. Math. Stat.Section 1-Section 3Original paper; clean introduction to I-divergence and sufficiency
Cover & Thomas (2006). Elements of Information Theory, Ch. 2Ch. 2, Section 2.3Best textbook treatment; all proofs with full details
MacKay (2003). Information Theory, Inference, and Learning Algorithms, Ch. 2Free online; Ch. 2Excellent intuition; coding-theoretic perspective
Bishop (2006). Pattern Recognition and Machine Learning, Ch. 10Ch. 10.1Variational inference from an ML perspective
Amari (2016). Information Geometry and Its ApplicationsCh. 1-3Statistical manifolds, I/M-projections, natural gradient
Csiszar & Shields (2004). "Information Theory and Statistics"Section 1-Section 3f-divergences, Sanov's theorem, method of types

M.2 Key ML Papers Using KL Divergence

PaperKL Divergence RoleYear
Kingma & Welling, "Auto-Encoding Variational Bayes"ELBO = reconstruction - KL; VAE training2014
Goodfellow et al., "Generative Adversarial Networks"GAN = JSD minimization2014
Hinton, Vinyals & Dean, "Distilling the Knowledge in a Neural Network"Forward KL for knowledge transfer2015
Schulman et al., "Trust Region Policy Optimization" (TRPO)KL trust region constraint2015
Schulman et al., "Proximal Policy Optimization" (PPO)KL approximated by clip2017
Higgins et al., "beta-VAE: Learning Basic Visual Concepts"β\beta-weighted KL for disentanglement2017
Mironov, "Renyi Differential Privacy"Renyi KL for privacy accounting2017
Rafailov et al., "Direct Preference Optimization" (DPO)Implicit KL-constrained policy2023
Ho et al., "Denoising Diffusion Probabilistic Models"Per-step KL for diffusion ELBO2020

M.3 Numerical Tools

ToolWhat It Provides
scipy.special.rel_entr(p, q)Element-wise plog(p/q)p\log(p/q); sum for KL
scipy.stats.entropy(p, q)DKL(pq)D_{\mathrm{KL}}(p\|q) for discrete distributions
torch.nn.functional.kl_div(log_q, p)KL given log qq and pp; note arg order
torch.distributions.kl_divergence(p, q)For torch distribution objects; supports many families
sklearn.metrics.mutual_info_scoreMutual information (related to KL)
numpy customSee Appendix D.1 for stable implementation

M.4 Quick Diagnostic Checklist

Before using KL divergence in any ML project, verify:

  • Which direction? Identify which distribution is pp (truth/reference) and which is qq (approximation). The first argument pp is where the expectation is taken.
  • Support check. Does p(x)>0p(x) > 0 wherever q(x)=0q(x) = 0? If so, DKL(pq)=+D_{\mathrm{KL}}(p\|q) = +\infty. Add smoothing if needed.
  • Units. Is your implementation using natural log (nats) or log base 2 (bits)? Be consistent.
  • Forward or reverse? Is this an MLE/distillation problem (forward KL) or variational inference (reverse KL)? The choice determines whether the result is mode-seeking or mean-seeking.
  • Numerical stability. Are you computing log(p/q)\log(p/q) directly (unstable) or using log-softmax (stable)?
  • Sample size for estimation. KL estimators from finite samples have high variance for large dd - consider using MMD or Wasserstein for high-dimensional distributions.
  • KL coefficient. If using KL as a regularizer, is β\beta tuned? Too small -> reward hacking; too large -> no alignment gain.
  • ELBO monitoring. In VAE training, are both the reconstruction term and KL term logged separately? Monitoring their ratio over training diagnoses posterior collapse early.

M.5 Conceptual Summary: One Page

KL divergence DKL(pq)=Ep[log(p/q)]D_{\mathrm{KL}}(p\|q) = \mathbb{E}_p[\log(p/q)] measures:

  • The expected excess code length when coding pp-data with a qq-code
  • The average log-likelihood ratio evidence that data came from pp not qq
  • The information gain from using pp instead of qq as a model

Three things to always remember:

  1. Direction matters. DKL(pq)DKL(qp)D_{\mathrm{KL}}(p\|q) \ne D_{\mathrm{KL}}(q\|p). Forward KL -> mass-covering. Reverse KL -> mode-seeking.

  2. Non-negativity is everything. DKL0D_{\mathrm{KL}} \ge 0 with equality iff p=qp = q. This single fact proves: entropy is maximized at uniform, the ELBO is a lower bound, MLE is consistent.

  3. It's everywhere in AI. MLE, VAE, RLHF, distillation, diffusion, VI - all are special cases of optimizing a KL divergence in a specific direction. Understanding which direction, and what approximations are made, is the key to understanding modern deep learning at a principled level.


Section 09-02 KL Divergence - 1966+ lines, 13 main sections, 13 appendices (A-M) Prerequisites: Section 09-01 Entropy | Follows to: Section 09-03 Mutual Information