Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 3
29 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

KL Divergence: Appendix A: Detailed Proofs to Appendix J: Notation and Conventions

Appendix A: Detailed Proofs

A.1 Proof of Non-Negativity via Log-Sum Inequality

The log-sum inequality provides an alternative, direct proof of Gibbs' inequality without invoking Jensen's explicitly.

Log-Sum Inequality. For non-negative reals a1,,ana_1, \ldots, a_n and b1,,bnb_1, \ldots, b_n:

i=1nailogaibi(i=1nai)logi=1naii=1nbi\sum_{i=1}^n a_i \log\frac{a_i}{b_i} \ge \left(\sum_{i=1}^n a_i\right)\log\frac{\sum_{i=1}^n a_i}{\sum_{i=1}^n b_i}

with equality iff all ai/bia_i/b_i are equal.

Proof of log-sum inequality. The function f(t)=tlogtf(t) = t\log t is convex (since f(t)=1/t>0f''(t) = 1/t > 0). By the weighted Jensen inequality with weights bi/Bb_i/B where B=bjB = \sum b_j:

1Bibiaibilogaibi1B(ibiaibi)logibiai/biibi1\frac{1}{B}\sum_i b_i \cdot \frac{a_i}{b_i}\log\frac{a_i}{b_i} \ge \frac{1}{B}\left(\sum_i b_i \cdot \frac{a_i}{b_i}\right)\log\frac{\sum_i b_i \cdot a_i/b_i}{\sum_i b_i \cdot 1} =ABlogAB= \frac{A}{B}\log\frac{A}{B}

Multiplying by BB gives the result. \square

KL non-negativity from log-sum inequality. Apply with ai=p(xi)a_i = p(x_i) and bi=q(xi)b_i = q(x_i): since pi=1=qi\sum p_i = 1 = \sum q_i:

DKL(pq)=ipilogpiqi(ipi)logipiiqi=1log11=0D_{\mathrm{KL}}(p\|q) = \sum_i p_i\log\frac{p_i}{q_i} \ge \left(\sum_i p_i\right)\log\frac{\sum_i p_i}{\sum_i q_i} = 1 \cdot \log\frac{1}{1} = 0 \quad\square

A.2 Convexity of KL: Detailed Proof

We prove that (p,q)DKL(pq)(p,q) \mapsto D_{\mathrm{KL}}(p\|q) is jointly convex using the perspective function argument.

Lemma. The function g(s,t)=slog(s/t)g(s,t) = s\log(s/t) (with g(0,t)=0g(0,t) = 0, g(s,0)=+g(s,0) = +\infty for s>0s > 0) is jointly convex on {(s,t):s0,t>0}\{(s,t) : s \ge 0, t > 0\}.

Proof. g(s,t)=slogsslogtg(s,t) = s\log s - s\log t. We need Hg0H_g \succeq 0. The Hessian is:

Hg=(1/s1/t1/ts/t2)H_g = \begin{pmatrix} 1/s & -1/t \\ -1/t & s/t^2 \end{pmatrix}

Determinant =s/t21/s1/t2=1/t21/t2=0= s/t^2 \cdot 1/s - 1/t^2 = 1/t^2 - 1/t^2 = 0. Trace =1/s+s/t2>0= 1/s + s/t^2 > 0. PSD since det=00\det = 0 \ge 0 and trace >0> 0. \square

Since DKL(pq)=xg(p(x),q(x))D_{\mathrm{KL}}(p\|q) = \sum_x g(p(x), q(x)) and sums of jointly convex functions are jointly convex, DKLD_{\mathrm{KL}} is jointly convex. \square

A.3 Pinsker's Inequality

Theorem (Pinsker's Inequality). TV(p,q)12DKL(pq)\mathrm{TV}(p,q) \le \sqrt{\frac{1}{2}D_{\mathrm{KL}}(p\|q)}.

Proof. We use the Bretagnolle-Huber inequality, which is slightly weaker but has a clean proof: for any event AA,

p(A)q(A)12DKL(pq)p(A) - q(A) \le \sqrt{\frac{1}{2}D_{\mathrm{KL}}(p\|q)}

For the full Pinsker proof, one approach uses the 2-point inequality: for a,b[0,1]a, b \in [0,1],

alogab+(1a)log1a1b12ln2(ab)2a\log\frac{a}{b} + (1-a)\log\frac{1-a}{1-b} \ge \frac{1}{2\ln 2}(a-b)^2

Applying this to the binary partition {A,Ac}\{A, A^c\} and optimizing over AA (which gives A={p>q}A = \{p > q\} for TV):

DKL(pq)DKL(Bern(p(A))Bern(q(A)))12ln2(p(A)q(A))212(2TV)2D_{\mathrm{KL}}(p\|q) \ge D_{\mathrm{KL}}(\text{Bern}(p(A))\|\text{Bern}(q(A))) \ge \frac{1}{2\ln 2}(p(A) - q(A))^2 \ge \frac{1}{2}(2\mathrm{TV})^2

Rearranging: TV(p,q)12DKL(pq)\mathrm{TV}(p,q) \le \sqrt{\frac{1}{2}D_{\mathrm{KL}}(p\|q)}. \square

Pinsker's inequality in AI: KL divergence is continuous and differentiable (gradients exist for optimization), while total variation is an event-level bound. Pinsker's lets us translate KL bounds from optimization theory into guarantees about individual event probabilities - useful in PAC-Bayes learning theory.

A.4 KL Divergence for the Gaussian: Step-by-Step

We derive the multivariate Gaussian KL formula in detail.

Let p=N(μ1,Σ1)p = \mathcal{N}(\boldsymbol{\mu}_1, \Sigma_1) and q=N(μ2,Σ2)q = \mathcal{N}(\boldsymbol{\mu}_2, \Sigma_2) on Rd\mathbb{R}^d. The log-ratio is:

logp(x)q(x)=12(xμ1)Σ11(xμ1)+12(xμ2)Σ21(xμ2)+12logdetΣ2detΣ1\log\frac{p(\mathbf{x})}{q(\mathbf{x})} = -\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu}_1)^\top\Sigma_1^{-1}(\mathbf{x}-\boldsymbol{\mu}_1) + \frac{1}{2}(\mathbf{x}-\boldsymbol{\mu}_2)^\top\Sigma_2^{-1}(\mathbf{x}-\boldsymbol{\mu}_2) + \frac{1}{2}\log\frac{\det\Sigma_2}{\det\Sigma_1}

Taking expectation under pp, using:

  • Ep[(xμ1)Σ11(xμ1)]=tr(Σ11Σ1)=d\mathbb{E}_p[(\mathbf{x}-\boldsymbol{\mu}_1)^\top\Sigma_1^{-1}(\mathbf{x}-\boldsymbol{\mu}_1)] = \operatorname{tr}(\Sigma_1^{-1}\Sigma_1) = d
  • Ep[(xμ2)Σ21(xμ2)]=tr(Σ21Σ1)+(μ1μ2)Σ21(μ1μ2)\mathbb{E}_p[(\mathbf{x}-\boldsymbol{\mu}_2)^\top\Sigma_2^{-1}(\mathbf{x}-\boldsymbol{\mu}_2)] = \operatorname{tr}(\Sigma_2^{-1}\Sigma_1) + (\boldsymbol{\mu}_1-\boldsymbol{\mu}_2)^\top\Sigma_2^{-1}(\boldsymbol{\mu}_1-\boldsymbol{\mu}_2)

gives:

DKL(pq)=12[tr(Σ21Σ1)+(μ2μ1)Σ21(μ2μ1)d+logdetΣ2detΣ1]D_{\mathrm{KL}}(p\|q) = \frac{1}{2}\left[\operatorname{tr}(\Sigma_2^{-1}\Sigma_1) + (\boldsymbol{\mu}_2-\boldsymbol{\mu}_1)^\top\Sigma_2^{-1}(\boldsymbol{\mu}_2-\boldsymbol{\mu}_1) - d + \log\frac{\det\Sigma_2}{\det\Sigma_1}\right]

The four terms correspond to:

  1. tr(Σ21Σ1)d\operatorname{tr}(\Sigma_2^{-1}\Sigma_1) - d: how much the covariances differ (= 0 when Σ1=Σ2\Sigma_1 = \Sigma_2)
  2. (μ2μ1)Σ21(μ2μ1)(\boldsymbol{\mu}_2-\boldsymbol{\mu}_1)^\top\Sigma_2^{-1}(\boldsymbol{\mu}_2-\boldsymbol{\mu}_1): squared Mahalanobis distance between means
  3. log(detΣ2/detΣ1)\log(\det\Sigma_2/\det\Sigma_1): log ratio of volumes (= 0 when detΣ1=detΣ2\det\Sigma_1 = \det\Sigma_2)

Appendix B: KL Divergence Computations and Reference Formulas

B.1 Quick Reference: KL Formulas for Common Distributions

KL DIVERGENCE CLOSED FORMS - QUICK REFERENCE


  SCALAR GAUSSIAN:
  D_KL(N(mu1,12)  N(mu2,22)) = log(2/1) + (12 + (mu1-mu2)2)/(222) - 12

  VAE ENCODER -> STANDARD NORMAL:
  D_KL(N(mu,2)  N(0,1)) = 12(mu2 + 2 - log 2 - 1)

  MULTIVARIATE GAUSSIAN:
  D_KL(N(mu1,1)  N(mu2,2)) = 12[tr(211) + (mu2-mu1)T21(mu2-mu1) - d + log(det 2/det 1)]

  BERNOULLI:
  D_KL(Bern(p)  Bern(q)) = plog(p/q) + (1-p)log((1-p)/(1-q))

  CATEGORICAL (general discrete):
  D_KL(p  q) = k pk  log(pk/qk)

  POISSON:
  D_KL(Poisson(1)  Poisson(2)) = 1log(1/2) - 1 + 2

  EXPONENTIAL:
  D_KL(Exp(1)  Exp(2)) = log(2/1) + 1/2 - 1

  BETA:
  D_KL(Beta(alpha1,beta1)  Beta(alpha2,beta2)) = log B(alpha2,beta2)/B(alpha1,beta1)
      + (alpha1-alpha2)(alpha1) + (beta1-beta2)(beta1) + (alpha2-alpha1+beta2-beta1)(alpha1+beta1)
      where  = digamma function, B = beta function


B.2 Worked Example: KL Between Poisson Distributions

A Poisson distribution Pois(λ)\text{Pois}(\lambda) has PMF p(k)=eλλk/k!p(k) = e^{-\lambda}\lambda^k/k! for k=0,1,2,k = 0,1,2,\ldots

DKL(Pois(λ1)Pois(λ2))D_{\mathrm{KL}}(\text{Pois}(\lambda_1)\|\text{Pois}(\lambda_2)):

logpλ1(k)pλ2(k)=(λ2λ1)+klogλ1λ2\log\frac{p_{\lambda_1}(k)}{p_{\lambda_2}(k)} = (\lambda_2 - \lambda_1) + k\log\frac{\lambda_1}{\lambda_2} DKL=Eλ1 ⁣[(λ2λ1)+klogλ1λ2]=(λ2λ1)+λ1logλ1λ2D_{\mathrm{KL}} = \mathbb{E}_{\lambda_1}\!\left[(\lambda_2-\lambda_1) + k\log\frac{\lambda_1}{\lambda_2}\right] = (\lambda_2 - \lambda_1) + \lambda_1\log\frac{\lambda_1}{\lambda_2}

For λ1=3,λ2=2\lambda_1 = 3, \lambda_2 = 2: DKL=(23)+3ln(3/2)=1+3(0.405)=0.216D_{\mathrm{KL}} = (2-3) + 3\ln(3/2) = -1 + 3(0.405) = 0.216 nats.

For AI: Poisson distributions model count data (number of events, word counts in LDA topic models). The KL between Poisson distributions appears in the ELBO for Poisson process models and in topic model variational inference.

B.3 Worked Example: KL for Categorical - Softmax Outputs

Let z1=(2.0,1.0,0.1)\mathbf{z}_1 = (2.0, 1.0, 0.1) and z2=(1.5,1.0,0.5)\mathbf{z}_2 = (1.5, 1.0, 0.5) be logit vectors.

Step 1: p1=softmax(z1)p_1 = \text{softmax}(\mathbf{z}_1):

p1=(e2,e1,e0.1)/(e2+e1+e0.1)=(7.389,2.718,1.105)/11.212=(0.659,0.242,0.099)p_1 = (e^2, e^1, e^{0.1})/(e^2 + e^1 + e^{0.1}) = (7.389, 2.718, 1.105)/11.212 = (0.659, 0.242, 0.099)

Step 2: p2=softmax(z2)p_2 = \text{softmax}(\mathbf{z}_2):

p2=(e1.5,e1,e0.5)/(e1.5+e1+e0.5)=(4.482,2.718,1.649)/8.849=(0.507,0.307,0.186)p_2 = (e^{1.5}, e^1, e^{0.5})/(e^{1.5} + e^1 + e^{0.5}) = (4.482, 2.718, 1.649)/8.849 = (0.507, 0.307, 0.186)

Step 3: DKL(p1p2)D_{\mathrm{KL}}(p_1\|p_2):

=0.659ln(0.659/0.507)+0.242ln(0.242/0.307)+0.099ln(0.099/0.186)= 0.659\ln(0.659/0.507) + 0.242\ln(0.242/0.307) + 0.099\ln(0.099/0.186) =0.659(0.261)+0.242(0.238)+0.099(0.635)=0.1720.0580.063=0.051 nats= 0.659(0.261) + 0.242(-0.238) + 0.099(-0.635) = 0.172 - 0.058 - 0.063 = 0.051 \text{ nats}

For AI: This is the KL component when computing the distillation loss between a teacher (p1p_1) and student (p2p_2). The small value (0.051 nats) indicates the student's distribution is reasonably close to the teacher's.

B.4 KL Divergence and Hypothesis Testing

The connection between KL divergence and hypothesis testing is both historical (Kullback 1959) and practically important.

Setting: nn i.i.d. observations from an unknown distribution. Hypothesis H0H_0: data from qq; hypothesis H1H_1: data from pp.

Log-likelihood ratio test statistic:

Λn=i=1nlogp(xi)q(xi)\Lambda_n = \sum_{i=1}^n \log\frac{p(x_i)}{q(x_i)}

By the law of large numbers, Λn/nEp[log(p/q)]=DKL(pq)\Lambda_n/n \to \mathbb{E}_p[\log(p/q)] = D_{\mathrm{KL}}(p\|q) as nn \to \infty when truth is pp.

Stein's lemma (type II error rate): For fixed type I error probability α(0,1)\alpha \in (0,1), the best achievable type II error probability βn\beta_n^* decays as:

limn1nlogβn=DKL(pq)\lim_{n\to\infty} \frac{1}{n}\log\beta_n^* = -D_{\mathrm{KL}}(p\|q)

Larger KL divergence -> faster exponential decay of type II error -> easier to distinguish pp from qq with more data.

Chernoff exponent: The best achievable equal-error-rate exponent (minimizing over both type I and II simultaneously) is:

C(p,q)=minα[0,1]logxp(x)αq(x)1α=maxα[0,1]Dα(pq)C(p,q) = -\min_{\alpha\in[0,1]}\log\sum_x p(x)^\alpha q(x)^{1-\alpha} = \max_{\alpha\in[0,1]} D_\alpha(p\|q)

the maximum Renyi divergence over orders α[0,1]\alpha \in [0,1].

For AI: The hypothesis testing perspective explains why DKL(ππref)D_{\mathrm{KL}}(\pi\|\pi_{\mathrm{ref}}) in RLHF bounds how detectable the policy shift is: small KL means the fine-tuned model's outputs are statistically indistinguishable from the reference model's outputs.


Appendix C: KL Divergence in Broader ML Contexts

C.1 PAC-Bayes Bounds

PAC-Bayes theory (McAllester 1999; Seeger 2002) gives generalization bounds for stochastic predictors. The key theorem:

PAC-Bayes Theorem. Let PP be a prior over hypotheses (fixed before seeing data) and QQ be any posterior (possibly data-dependent). For any δ>0\delta > 0, with probability 1δ\ge 1 - \delta over the training set:

EhQ[L(h)]EhQ[L^(h)]+DKL(QP)+log(2n/δ)2n\mathbb{E}_{h \sim Q}[\mathcal{L}(h)] \le \mathbb{E}_{h \sim Q}[\hat{\mathcal{L}}(h)] + \sqrt{\frac{D_{\mathrm{KL}}(Q\|P) + \log(2\sqrt{n}/\delta)}{2n}}

where L(h)\mathcal{L}(h) is the true loss and L^(h)\hat{\mathcal{L}}(h) is the empirical loss.

Interpretation: The generalization gap is bounded by DKL(QP)/n\sqrt{D_{\mathrm{KL}}(Q\|P)/n}. The posterior QQ can be anything - including a specific trained neural network - and the bound holds. This shows that models that stay "close" to a simple prior in KL generalize well. Modern applications: PAC-Bayes bounds for LoRA-fine-tuned LLMs (where P=P = pretrained weights, Q=Q = fine-tuned weights).

C.2 KL in Variational Inference for Bayesian Neural Networks

Bayesian neural networks (BNNs) maintain distributions over weights θ\boldsymbol{\theta} rather than point estimates. The posterior p(θD)p(\boldsymbol{\theta}|\mathcal{D}) is intractable for large networks. Variational inference minimizes:

DKL(q(θ)p(θD))=DKL(q(θ)p(θ))Eq[logp(Dθ)]+logp(D)D_{\mathrm{KL}}(q(\boldsymbol{\theta})\|p(\boldsymbol{\theta}|\mathcal{D})) = D_{\mathrm{KL}}(q(\boldsymbol{\theta})\|p(\boldsymbol{\theta})) - \mathbb{E}_q[\log p(\mathcal{D}|\boldsymbol{\theta})] + \log p(\mathcal{D})

The ELBO is: L=Eq[logp(Dθ)]DKL(q(θ)p(θ))\mathcal{L} = \mathbb{E}_q[\log p(\mathcal{D}|\boldsymbol{\theta})] - D_{\mathrm{KL}}(q(\boldsymbol{\theta})\|p(\boldsymbol{\theta})). The KL term DKL(qp)D_{\mathrm{KL}}(q\|p) acts as a regularizer: it penalizes variational distributions that deviate from the prior p(θ)p(\boldsymbol{\theta}).

Practical implementations: Bayes By Backprop (Blundell et al., 2015) parameterizes q(θ)=iN(μi,σi2)q(\boldsymbol{\theta}) = \prod_i \mathcal{N}(\mu_i, \sigma_i^2) and uses the local reparameterization trick. Each weight θi=μi+σiεi\theta_i = \mu_i + \sigma_i\varepsilon_i where εiN(0,1)\varepsilon_i \sim \mathcal{N}(0,1). The KL term per weight is 12(μi2/σp2+σi2/σp21log(σi2/σp2))\frac{1}{2}(\mu_i^2/\sigma_p^2 + \sigma_i^2/\sigma_p^2 - 1 - \log(\sigma_i^2/\sigma_p^2)) for Gaussian prior N(0,σp2)\mathcal{N}(0, \sigma_p^2).

C.3 KL in Contrastive Learning and Representation

Contrastive learning methods like SimCLR (Chen et al., 2020) and MoCo (He et al., 2020) maximize the lower bound on mutual information:

I(X;Z)E[InfoNCE(X,Z)]=E ⁣[logef(x,z)1Nj=1Nef(x,zj)]I(X; Z) \ge \mathbb{E}[\text{InfoNCE}(X, Z)] = \mathbb{E}\!\left[\log\frac{e^{f(x,z)}}{\frac{1}{N}\sum_{j=1}^N e^{f(x,z_j)}}\right]

The InfoNCE loss is a lower bound on I(X;Z)I(X;Z), which is a KL divergence DKL(P(X,Z)P(X)P(Z))D_{\mathrm{KL}}(P(X,Z)\|P(X)P(Z)) (see Section 09-03). CLIP (Radford et al., 2021) uses a symmetric InfoNCE to align image and text embeddings - which is equivalent to minimizing KL between the joint image-text distribution and the product of marginals.

CPC (Contrastive Predictive Coding, van den Oord et al., 2018): Used in speech (wav2vec 2.0, HuBERT) and images. The representation ztz_t at time tt predicts future zt+kz_{t+k} via a density ratio estimator. The InfoNCE objective is a lower bound on kI(zt;zt+k)\sum_k I(z_t; z_{t+k}). Training audio encoders this way produces features that capture the semantic structure of speech - because mutual information (= KL divergence) rewards capturing any statistical dependency.

C.4 KL in Score-Based Generative Models

Score-based models (Song & Ermon, 2020) and their diffusion counterparts learn the score function xlogp(x)\nabla_\mathbf{x}\log p(\mathbf{x}) to generate samples via Langevin dynamics. The connection to KL:

Score matching minimizes:

J(θ)=Ep ⁣[12sθ(x)logp(x)2]J(\boldsymbol{\theta}) = \mathbb{E}_p\!\left[\frac{1}{2}\|s_{\boldsymbol{\theta}}(\mathbf{x}) - \nabla\log p(\mathbf{x})\|^2\right]

where sθlogps_{\boldsymbol{\theta}} \approx \nabla\log p is the learned score. Fisher divergence equals:

J(θ)=Ep ⁣[12sθ(x)2+sθ(x)]+constJ(\boldsymbol{\theta}) = \mathbb{E}_p\!\left[\frac{1}{2}\|s_{\boldsymbol{\theta}}(\mathbf{x})\|^2 + \nabla\cdot s_{\boldsymbol{\theta}}(\mathbf{x})\right] + \text{const}

Score matching is equivalent to minimizing a generalization of KL (Fisher divergence: p(x)log(p/q)2dx\int p(\mathbf{x})\|\nabla\log(p/q)\|^2 d\mathbf{x}), which uses the gradient of the log-density ratio rather than the ratio itself. This is the information-geometric connection between score matching and KL divergence.

C.5 KL and Temperature in LLM Sampling

LLM inference involves sampling from pτ(yx)=softmax(z/τ)p_\tau(y|x) = \text{softmax}(\mathbf{z}/\tau) at temperature τ\tau. The KL divergence between the original distribution (τ=1\tau = 1) and the temperature-scaled distribution is:

DKL(p1pτ)=kp1(k)logp1(k)pτ(k)=kp1(k)[zk1logjezjzkτ+logjezj/τ]D_{\mathrm{KL}}(p_1\|p_\tau) = \sum_k p_1(k)\log\frac{p_1(k)}{p_\tau(k)} = \sum_k p_1(k)\left[\frac{z_k}{1} - \log\sum_j e^{z_j} - \frac{z_k}{\tau} + \log\sum_j e^{z_j/\tau}\right]

As τ0\tau \to 0: pτp_\tau \to one-hot (deterministic, picks the argmax). DKL(p1pτ)H(p1)D_{\mathrm{KL}}(p_1\|p_\tau) \to H(p_1) - the divergence equals the entropy of the original distribution.

As τ\tau \to \infty: pτp_\tau \to uniform. DKL(p1pτ)logVH(p1)D_{\mathrm{KL}}(p_1\|p_\tau) \to \log|\mathcal{V}| - H(p_1) - the divergence equals the KL from uniform.

Practical implications: The "temperature" knob in LLM APIs (temperature=0.7, top-p=0.9) directly controls the KL divergence from the trained distribution. High temperature increases diversity (reduces KL to uniform); low temperature reduces diversity (pushes toward argmax). The KL from the training distribution is minimized at τ=1\tau = 1.


Appendix D: Numerical Stability and Implementation

D.1 The log-sum-exp Trick for KL Computation

Computing DKL(pq)D_{\mathrm{KL}}(p\|q) with softmax distributions requires care to avoid numerical overflow.

Naive implementation (unstable):

# WRONG - can overflow for large logits
kl = torch.sum(p * (torch.log(p) - torch.log(q)))

Stable implementation using log-softmax:

# CORRECT - uses log-space throughout
import torch.nn.functional as F

def kl_divergence_stable(logits_p, logits_q):
    """
    KL(p || q) where p = softmax(logits_p), q = softmax(logits_q).
    Uses log_softmax for numerical stability.
    """
    log_p = F.log_softmax(logits_p, dim=-1)  # log(softmax) in one step
    log_q = F.log_softmax(logits_q, dim=-1)
    p = torch.exp(log_p)
    return torch.sum(p * (log_p - log_q), dim=-1)

PyTorch also provides torch.nn.functional.kl_div(log_q, p) which computes p(logplog_q)\sum p(\log p - \text{log\_q}) given pre-computed log_q. Note the unusual argument order: it expects log of the second argument.

D.2 Estimating KL from Samples

When only samples are available (not closed-form distributions), KL must be estimated.

Monte Carlo estimator. Given samples x1,,xnpx_1, \ldots, x_n \sim p and access to logp(x)\log p(x) and logq(x)\log q(x):

D^KL(pq)=1ni=1nlogp(xi)q(xi)\hat{D}_{\mathrm{KL}}(p\|q) = \frac{1}{n}\sum_{i=1}^n \log\frac{p(x_i)}{q(x_i)}

This is unbiased: E[D^]=DKL(pq)\mathbb{E}[\hat{D}] = D_{\mathrm{KL}}(p\|q).

k-nearest neighbor estimator (Perez-Cruz 2008). When neither p(x)p(x) nor q(x)q(x) are known analytically - only samples from both distributions - the KNN estimator uses:

D^KL(pq)=dni=1nlogρk(xi)νk(xi)+logmn1\hat{D}_{\mathrm{KL}}(p\|q) = \frac{d}{n}\sum_{i=1}^n \log\frac{\rho_k(x_i)}{\nu_k(x_i)} + \log\frac{m}{n-1}

where ρk(xi)\rho_k(x_i) is the distance to the kk-th nearest neighbor in the pp-samples, νk(xi)\nu_k(x_i) is the distance to the kk-th neighbor in the qq-samples, nn = number of pp-samples, mm = number of qq-samples, and dd = dimension.

This estimator is used in representation learning evaluation, where you have samples from an encoder's output distribution qq and want to measure DKL(qptarget)D_{\mathrm{KL}}(q\|p_{\text{target}}) without a closed form.

D.3 Symmetric KL Implementation

For applications requiring symmetric divergence without the issues of JSD (which requires computing the mixture), Jeffrey's divergence is simpler:

def jeffreys_divergence(p_probs, q_probs, eps=1e-10):
    """
    J(p, q) = 0.5 * (KL(p||q) + KL(q||p))
    More numerically stable than computing JSD.
    """
    p = p_probs + eps
    q = q_probs + eps
    p = p / p.sum()
    q = q / q.sum()
    kl_pq = (p * np.log(p / q)).sum()
    kl_qp = (q * np.log(q / p)).sum()
    return 0.5 * (kl_pq + kl_qp)

D.4 KL Divergence in PyTorch for VAE Training

The standard VAE KL term implementation:

def vae_kl_loss(mu, log_var):
    """
    D_KL(N(mu, sigma^2) || N(0, 1)) per-sample, summed over latent dimensions.
    
    Formula: 0.5 * sum(mu^2 + sigma^2 - log(sigma^2) - 1)
           = 0.5 * sum(mu^2 + exp(log_var) - log_var - 1)
    
    Input: mu, log_var - shape (batch, latent_dim)
    Output: KL loss - shape (batch,), or scalar if mean-reduced
    """
    kl = 0.5 * torch.sum(mu**2 + torch.exp(log_var) - log_var - 1, dim=-1)
    return kl.mean()  # average over batch

Note: using log_var (log of variance) instead of sigma avoids numerical issues with very small variances and ensures σ2>0\sigma^2 > 0 by construction.

RLHF KL penalty implementation:

def rlhf_kl_penalty(logprobs_policy, logprobs_ref, beta=0.1):
    """
    D_KL(pi_policy || pi_ref) for a batch of token sequences.
    
    logprobs_policy, logprobs_ref: log probabilities under each model
                                   shape (batch, seq_len)
    Returns: mean KL per sequence
    """
    # KL = E_pi[log pi - log pi_ref]
    kl_per_token = logprobs_policy - logprobs_ref  # log ratio
    kl_per_seq = kl_per_token.sum(dim=-1)  # sum over sequence length
    return beta * kl_per_seq.mean()

Appendix E: Variational Inference - A Deep Dive

E.1 The Variational Inference Problem

Bayesian inference requires computing the posterior p(zx)=p(xz)p(z)/p(x)p(\mathbf{z}|\mathbf{x}) = p(\mathbf{x}|\mathbf{z})p(\mathbf{z})/p(\mathbf{x}). The denominator p(x)=p(xz)p(z)dzp(\mathbf{x}) = \int p(\mathbf{x}|\mathbf{z})p(\mathbf{z})\,d\mathbf{z} is typically intractable - the integral has no closed form for nonlinear likelihoods.

Variational inference approximates the posterior with a tractable family Q={qϕ:ϕΦ}\mathcal{Q} = \{q_\phi : \phi \in \Phi\} by solving:

q=argminqQDKL(q(z)p(zx))q^* = \arg\min_{q \in \mathcal{Q}} D_{\mathrm{KL}}(q(\mathbf{z}) \| p(\mathbf{z}|\mathbf{x}))

Why reverse KL? Because it is the only tractable direction. Computing DKL(pq)D_{\mathrm{KL}}(p\|q) requires evaluating p(zx)p(\mathbf{z}|\mathbf{x}), which is the intractable quantity we're trying to avoid. Computing DKL(qp)D_{\mathrm{KL}}(q\|p) requires only samples from qq and evaluations of p(z)p(xz)p(\mathbf{z})p(\mathbf{x}|\mathbf{z}) - both tractable.

ELBO derivation. Expanding the reverse KL:

DKL(q(z)p(zx))=Eq ⁣[logq(z)p(zx)]=Eq[logq(z)]Eq[logp(zx)]D_{\mathrm{KL}}(q(\mathbf{z})\|p(\mathbf{z}|\mathbf{x})) = \mathbb{E}_q\!\left[\log\frac{q(\mathbf{z})}{p(\mathbf{z}|\mathbf{x})}\right] = \mathbb{E}_q[\log q(\mathbf{z})] - \mathbb{E}_q[\log p(\mathbf{z}|\mathbf{x})]

Using logp(zx)=logp(xz)+logp(z)logp(x)\log p(\mathbf{z}|\mathbf{x}) = \log p(\mathbf{x}|\mathbf{z}) + \log p(\mathbf{z}) - \log p(\mathbf{x}):

=Eq[logq(z)]Eq[logp(xz)]Eq[logp(z)]+logp(x)= \mathbb{E}_q[\log q(\mathbf{z})] - \mathbb{E}_q[\log p(\mathbf{x}|\mathbf{z})] - \mathbb{E}_q[\log p(\mathbf{z})] + \log p(\mathbf{x})

Rearranging:

logp(x)=Eq[logp(xz)]DKL(q(z)p(z))L(q)=ELBO+DKL(q(z)p(zx))\log p(\mathbf{x}) = \underbrace{\mathbb{E}_q[\log p(\mathbf{x}|\mathbf{z})] - D_{\mathrm{KL}}(q(\mathbf{z})\|p(\mathbf{z}))}_{\mathcal{L}(q) = \text{ELBO}} + D_{\mathrm{KL}}(q(\mathbf{z})\|p(\mathbf{z}|\mathbf{x}))

Since DKL0D_{\mathrm{KL}} \ge 0: L(q)logp(x)\mathcal{L}(q) \le \log p(\mathbf{x}) - the ELBO is a lower bound. Maximizing ELBO over qq simultaneously: (a) tightens the bound (reduces the KL gap), and (b) improves the model fit (increases logp(x)\log p(\mathbf{x}) when pθp_{\boldsymbol{\theta}} is also optimized).

E.2 Mean-Field Variational Inference

The mean-field approximation restricts to factored distributions q(z)=j=1dqj(zj)q(\mathbf{z}) = \prod_{j=1}^d q_j(z_j). This is an enormous simplification - the full posterior may have complex correlations, but the approximation ignores all correlations between latent variables.

Coordinate ascent VI (CAVI). For mean-field, the optimal factor qjq_j^* satisfies:

logqj(zj)=Eqj[logp(x,z)]+const\log q_j^*(z_j) = \mathbb{E}_{q_{-j}}[\log p(\mathbf{x}, \mathbf{z})] + \text{const}

where qjq_{-j} denotes expectation over all factors except jj. This is a fixed-point equation: each factor depends on the others. CAVI iterates: update q1q_1 given q2,,qdq_2, \ldots, q_d; then update q2q_2 given new q1q_1 and old q3,,qdq_3, \ldots, q_d; etc.

Guarantee: Each update increases the ELBO (decreases the reverse KL). CAVI converges to a local optimum.

For AI: Mean-field VI was the basis for early Bayesian neural networks (BNNs) and topic models (LDA). Its mode-seeking bias (reverse KL) means it systematically underestimates posterior uncertainty - a known limitation. Modern BNNs often use Monte Carlo Dropout as a cheaper alternative.

E.3 ELBO as a Free Energy

The ELBO has a physical interpretation as a variational free energy (Helmholtz, Gibbs):

L(q)=Eq[logp(x,z)]Eq[logq(z)]=Eq[logp(x,z)]+H(q)\mathcal{L}(q) = \mathbb{E}_q[\log p(\mathbf{x},\mathbf{z})] - \mathbb{E}_q[\log q(\mathbf{z})] = \mathbb{E}_q[\log p(\mathbf{x},\mathbf{z})] + H(q)

This is the expected log joint minus the entropy of qq. Maximizing the ELBO simultaneously:

  • Maximizes Eq[logp(x,z)]\mathbb{E}_q[\log p(\mathbf{x},\mathbf{z})]: encourages qq to put mass on high-probability regions of the joint
  • Maximizes H(q)H(q): encourages qq to spread out (high entropy = uncertainty), preventing mode collapse

The tension between these two terms is the fundamental trade-off in variational inference. In physics, this is the Gibbs free energy minimization: equilibrium balances energy minimization (term 1) with entropy maximization (term 2), at temperature 1.

For AI: The free energy interpretation connects to energy-based models (LeCun et al., 2006) and to the "free energy principle" in computational neuroscience (Friston, 2010). The ELBO appears in all modern latent variable models: VAEs, VQ-VAE, neural process families, and latent diffusion models (Stable Diffusion uses a VAE for compression, then diffusion in the latent space).

E.4 Amortized Variational Inference

Standard VI optimizes a separate qϕ(z)q_\phi(\mathbf{z}) per data point - O(n)O(n) optimization problems. Amortized VI (Kingma & Welling 2014) instead trains a single encoder network qϕ(zx)q_\phi(\mathbf{z}|\mathbf{x}) that takes x\mathbf{x} as input:

qϕ(zx)=argmaxϕEp(x) ⁣[L(ϕ,θ;x)]q_\phi^*(\mathbf{z}|\mathbf{x}) = \arg\max_\phi \mathbb{E}_{p(\mathbf{x})}\!\left[\mathcal{L}(\phi, \boldsymbol{\theta}; \mathbf{x})\right]

The encoder "amortizes" the per-sample optimization across the full dataset, generalizing to new samples at inference time. This is what makes VAEs practical for large datasets: a single forward pass through the encoder gives qϕ(zx)=N(μϕ(x),diag(σϕ2(x)))q_\phi(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}_\phi(\mathbf{x}), \text{diag}(\boldsymbol{\sigma}_\phi^2(\mathbf{x}))) instantly.

Amortization gap: The amortized approximation qϕ(zx)q_\phi(\mathbf{z}|\mathbf{x}) is generally worse than the optimal per-sample q(zx)q^*(\mathbf{z}|\mathbf{x}) (it must generalize, so it trades off per-sample accuracy). The gap DKL(qp)DKL(qϕp)D_{\mathrm{KL}}(q^*\|p) - D_{\mathrm{KL}}(q_\phi\|p) is called the amortization gap and is an active research topic.


Appendix F: The RLHF-KL Alignment Triangle

F.1 Three Ways KL Appears in Alignment

Modern LLM alignment uses KL divergence in three distinct but related ways, forming a triangle of constraints:

RLHF-KL ALIGNMENT TRIANGLE


                     PRETRAINED MODEL pi_ref
                           
                          / \
                KL       /   \   KL
              penalty   /     \  constraint
                       /       \
                      
              RLHF pi_theta          DPO pi_DPO
                    \           /
                     \  equiv  /
                      \       /
                       Fixed point: same optimal policy

  All three policies minimize:
    E[r(x,y)] - betaD_KL(pi_theta(|x)  pi_ref(|x))

  Optimal solution: pi*(y|x) propto pi_ref(y|x)exp(r(x,y)/beta)


F.2 Interpreting the KL Coefficient beta

The KL coefficient β\beta in RLHF controls the fundamental exploration-exploitation trade-off in alignment:

β\beta valueBehaviorRisk
β0\beta \to 0Pure reward maximization; ignores KL constraintReward hacking; generates repetitive, incoherent outputs
β\beta \to \inftyStay at reference policy; no alignmentNo improvement in helpfulness, safety
β[0.01,0.5]\beta \in [0.01, 0.5]Typical range used in practiceBalance between helpfulness and coherence

Typical values in production: InstructGPT (Ouyang et al., 2022) used β=0.02\beta = 0.02 for PPO-RLHF. Different applications need different β\beta: safety alignment uses larger β\beta (conservative); instruction following uses smaller β\beta (allows larger policy changes).

Adaptive β\beta (KL-controller): Some implementations adaptively adjust β\beta to keep DKL(πθπref)D_{\mathrm{KL}}(\pi_{\boldsymbol{\theta}}\|\pi_{\mathrm{ref}}) near a target value κ\kappa:

βt+1=βt+γ(DKLκ)\beta_{t+1} = \beta_t + \gamma(D_{\mathrm{KL}} - \kappa)

This is a PI controller on the KL divergence. If the policy drifts too far, β\beta increases; if it stays too close to reference, β\beta decreases.

F.3 DPO's Implicit Reward

DPO (Rafailov et al., 2023) reveals that the log-ratio between policy and reference is the implicit reward:

rθ(x,y)=βlogπθ(yx)πref(yx)r_{\boldsymbol{\theta}}(x,y) = \beta\log\frac{\pi_{\boldsymbol{\theta}}(y|x)}{\pi_{\mathrm{ref}}(y|x)}

This has a clean interpretation: the reward a sequence yy receives is proportional to how much more likely the aligned policy assigns to it compared to the reference. Sequences the aligned model prefers more than the reference model are rewarded; sequences it prefers less are penalized.

DPO gradient: The DPO loss gradient with respect to θ\boldsymbol{\theta}:

θLDPO=βE ⁣[σ(r^θ(yl)r^θ(yw))(logπθ(yw)logπθ(yl))]\nabla_{\boldsymbol{\theta}}\mathcal{L}_{\mathrm{DPO}} = -\beta\,\mathbb{E}\!\left[\sigma(\hat{r}_{\boldsymbol{\theta}}(y_l) - \hat{r}_{\boldsymbol{\theta}}(y_w))\left(\nabla\log\pi_{\boldsymbol{\theta}}(y_w) - \nabla\log\pi_{\boldsymbol{\theta}}(y_l)\right)\right]

where r^θ(y)=βlog(πθ(y)/πref(y))\hat{r}_{\boldsymbol{\theta}}(y) = \beta\log(\pi_{\boldsymbol{\theta}}(y)/\pi_{\mathrm{ref}}(y)). The weight σ(r^(yl)r^(yw))\sigma(\hat{r}(y_l) - \hat{r}(y_w)) is large when the model assigns higher reward to the loser than the winner - the loss upweights these hard examples. This has the same structure as hard example mining in contrastive learning.


Appendix G: Connections to Other Chapters

G.1 KL and Exponential Families (Chapter 7: Statistics)

Exponential families are the natural domain for KL divergence because:

  1. KL between exponential family members has closed form (Section 5.3)
  2. The maximum entropy distribution under moment constraints is an exponential family member (Section 09-01 Section 5)
  3. The Bregman divergence generated by A(η)A(\boldsymbol{\eta}) equals the KL between members with those natural parameters

Sufficiency: Kullback and Leibler's original paper introduced KL as a measure of information for discrimination between hypotheses. A statistic T(x)T(\mathbf{x}) is sufficient for parameter θ\boldsymbol{\theta} iff DKL(pθ1(x)pθ2(x))=DKL(pθ1(T)pθ2(T))D_{\mathrm{KL}}(p_{\boldsymbol{\theta}_1}(\mathbf{x})\|p_{\boldsymbol{\theta}_2}(\mathbf{x})) = D_{\mathrm{KL}}(p_{\boldsymbol{\theta}_1}(T)\|p_{\boldsymbol{\theta}_2}(T)) - sufficient statistics preserve all the discriminatory information. This is the information-theoretic definition of sufficiency.

G.2 KL and Convex Optimization (Chapter 8)

The ELBO maximization is a convex optimization problem in qq (for fixed model parameters θ\boldsymbol{\theta}):

  • DKL(qp)D_{\mathrm{KL}}(q\|p) is convex in qq (Section 3.4)
  • The constraint q0q \ge 0, q=1\sum q = 1 is convex (simplex)
  • The I-projection q=argminqDKL(qp)q^* = \arg\min_q D_{\mathrm{KL}}(q\|p) is a convex program with a unique solution

The Lagrangian of the I-projection onto the exponential family gives the natural parameter update equations - connecting KL divergence to the theory of constrained convex optimization from Chapter 8.

G.3 KL and Functional Analysis (Chapter 12)

In infinite-dimensional settings, KL divergence extends via the Radon-Nikodym theorem. For Gaussian processes and Gaussian measures on Hilbert spaces:

DKL(GP(m1,K1)GP(m2,K2))=12[tr(K21K1)+(m2m1)K21(m2m1)d+logdetK2detK1]D_{\mathrm{KL}}(\mathcal{GP}(m_1, K_1)\|\mathcal{GP}(m_2, K_2)) = \frac{1}{2}\left[\operatorname{tr}(K_2^{-1}K_1) + (m_2-m_1)^\top K_2^{-1}(m_2-m_1) - d + \log\frac{\det K_2}{\det K_1}\right]

the same Gaussian KL formula, but now K1,K2K_1, K_2 are kernel matrices and dd \to \infty. The trace and log-determinant terms generalize as operator trace and Fredholm determinant.

RKHS connection: The maximum mean discrepancy (MMD) is an alternative to KL for comparing distributions, based on RKHS distances. MMD has better convergence rates from finite samples than KL estimators, which is why it appears in some GANs (MMD-GAN) and kernel two-sample tests.

G.4 KL and Statistical Learning Theory (Chapter 21)

The PAC-Bayes bound (Appendix C.1) shows that the KL from posterior to prior controls generalization. This has a direct implication for LoRA fine-tuning: if the LoRA adapter is constrained to have small DKL(qadapterpinit)D_{\mathrm{KL}}(q_{\mathrm{adapter}}\|p_{\mathrm{init}}) - i.e., small deviation from initialization - then generalization bounds are tight. This is the information-theoretic justification for small-rank adaptations: they have small KL from the pretrained distribution.

Minimum Description Length (MDL): The MDL principle selects the model that minimizes the two-part code length: L(M)+L(DM)L(M) + L(D|M) where L(M)L(M) is the code length of the model and L(DM)L(D|M) is the code length of data given the model. This equals DKL(qp)+constD_{\mathrm{KL}}(q\|p) + \text{const} for the right choice of pp and qq. MDL and ELBO are the same objective - minimum description length equals maximum ELBO. This is the Bayesian interpretation of Occam's razor.


Appendix H: Summary Tables

H.1 KL Divergence Properties at a Glance

PropertyHolds?Counterexample if No
Non-negative: DKL0D_{\mathrm{KL}} \ge 0YES-
Zero iff p=qp = q: DKL=0p=qD_{\mathrm{KL}} = 0 \Leftrightarrow p = qYES (a.e.)-
Symmetric: DKL(pq)=DKL(qp)D_{\mathrm{KL}}(p\|q) = D_{\mathrm{KL}}(q\|p)NOp=(0.9,0.1),q=(0.5,0.5)p=(0.9,0.1), q=(0.5,0.5)
Triangle inequality: DKL(pr)DKL(pq)+DKL(qr)D_{\mathrm{KL}}(p\|r) \le D_{\mathrm{KL}}(p\|q) + D_{\mathrm{KL}}(q\|r)NOp=(0.9,0.1),q=(0.5,0.5),r=(0.1,0.9)p=(0.9,0.1), q=(0.5,0.5), r=(0.1,0.9)
Data processing inequality: DKL(pTqT)DKL(pq)D_{\mathrm{KL}}(p_T\|q_T) \le D_{\mathrm{KL}}(p\|q)YES-
Jointly convex in (p,q)(p,q)YES-
Convex in pp for fixed qqYES-
Convex in qq for fixed ppYES-
Finite when p≪̸qp \not\ll qNOp=(0.5,0.5,0),q=(0.5,0,0.5)p=(0.5,0.5,0), q=(0.5,0,0.5): DKL=D_{\mathrm{KL}} = \infty
Metric (distance function)NOFails symmetry and triangle inequality

H.2 Forward vs Reverse KL Summary

Forward KL DKL(pq)D_{\mathrm{KL}}(p\|q)Reverse KL DKL(qp)D_{\mathrm{KL}}(q\|p)
Expectation underpp (truth)qq (approximation)
Also calledInclusive, zero-avoidingExclusive, zero-forcing
Behavior when fitting qq to ppMean-seeking, mass-coveringMode-seeking, mass-concentrating
Handles multimodal ppAverages over all modesCollapses to one mode
Support requirementqq must cover supp(p)\text{supp}(p)qq must avoid supp(p)c\text{supp}(p)^c
Used inMLE, distillation, flowsVariational inference, ELBO
RiskBlurry/average samplesPosterior collapse
TractabilityNeeds samples from ppOnly needs samples from qq

H.3 f-Divergence Family Summary

Divergencef(t)f(t)Symmetric?Bounded?Metric?
KL DKL(pq)D_{\mathrm{KL}}(p\|q)tlntt\ln tNoNoNo
Reverse KL DKL(qp)D_{\mathrm{KL}}(q\|p)lnt-\ln tNoNoNo
Jeffrey's JJtlntlntt\ln t - \ln tYesNoNo
Hellinger H2H^2(t1)2(\sqrt{t}-1)^2YesYes ([0,2][0,2])H2\sqrt{H^2} is
Total Variation TV$\frac{1}{2}t-1$Yes
Chi-squared χ2\chi^2(t1)2(t-1)^2NoNoNo
Jensen-Shannon JSD(mixture form)YesYes ([0,ln2][0,\ln 2])JSD\sqrt{\mathrm{JSD}} is
Renyi DαD_\alphatα1α1\frac{t^\alpha-1}{\alpha-1} (limit form)NoNoNo

Appendix I: Extended Examples and Geometric Visualizations

I.1 KL Divergence Along a Path Between Distributions

Consider a one-parameter family pt=(1t)p0+tp1p_t = (1-t)p_0 + t p_1 - a straight line between two distributions in probability simplex. The KL divergence DKL(ptq)D_{\mathrm{KL}}(p_t\|q) is convex in tt (because it is jointly convex in (p,q)(p,q) and linear in pp). However, tDKL(p0pt)t \mapsto D_{\mathrm{KL}}(p_0\|p_t) may not be convex - the "path" in distribution space is curved from the perspective of KL divergence.

Numerical example. p0=(0.9,0.1)p_0 = (0.9, 0.1), p1=(0.1,0.9)p_1 = (0.1, 0.9), q=(0.5,0.5)q = (0.5, 0.5). For t[0,1]t \in [0,1]:

ttptp_tDKL(ptq)D_{\mathrm{KL}}(p_t\|q)DKL(qpt)D_{\mathrm{KL}}(q\|p_t)
0(0.9,0.1)(0.9, 0.1)0.3680.3680.5110.511
0.25(0.7,0.3)(0.7, 0.3)0.1190.1190.1330.133
0.5(0.5,0.5)(0.5, 0.5)0.0000.0000.0000.000
0.75(0.3,0.7)(0.3, 0.7)0.1190.1190.1330.133
1.0(0.1,0.9)(0.1, 0.9)0.3680.3680.5110.511

Both directions are symmetric here due to the symmetric path. The KL is zero at t=0.5t=0.5 (when pt=qp_t = q) and increases convexly toward the endpoints.

Geometric implication: The straight-line path pt=(1t)p0+tp1p_t = (1-t)p_0 + t p_1 is a mixture geodesic (M-geodesic) in information geometry - the natural path in the mixture parameterization. The dual path (exponential geodesic) mixes in the natural parameter space, giving different intermediate distributions.

I.2 KL vs Wasserstein Distance

KL divergence and the Wasserstein distance (from optimal transport theory) are complementary tools for comparing distributions:

PropertyKL DivergenceWasserstein Distance
Requires pqp \ll qYESNo
Metrizes weak convergenceNoYES
Captures geometry of supportNoYES
Tractable to computeYES (closed form)Harder (O(n3)O(n^3) naive)
DifferentiableYESWith entropic regularization
Used inMLE, ELBO, RLHFGenerative models, image quality

When Wasserstein beats KL: If pGp_G (generator) and pdatap_{\mathrm{data}} have disjoint supports, DKL(pdatapG)=+D_{\mathrm{KL}}(p_{\mathrm{data}}\|p_G) = +\infty and JSD=ln2\mathrm{JSD} = \ln 2 - gradients vanish. The Wasserstein distance is still finite and differentiable (reflecting actual geometric distance between supports). This is the motivation for WGAN (Arjovsky et al., 2017).

When KL beats Wasserstein: For language modeling, the vocabulary forms a discrete space with no natural geometry. The Wasserstein distance would require defining a metric on tokens (words), which is ambiguous. KL divergence makes no assumptions about token geometry and is directly connected to log-probability, making it the natural choice.

I.3 KL Divergence Under Transformations

Transformation by sufficient statistic. If T(x)T(\mathbf{x}) is a sufficient statistic for θ\boldsymbol{\theta}, then:

DKL(pθ1pθ2)=DKL(pθ1Tpθ2T)D_{\mathrm{KL}}(p_{\boldsymbol{\theta}_1}\|p_{\boldsymbol{\theta}_2}) = D_{\mathrm{KL}}(p_{\boldsymbol{\theta}_1}^T\|p_{\boldsymbol{\theta}_2}^T)

where pθTp_{\boldsymbol{\theta}}^T is the induced distribution on T(x)T(\mathbf{x}). Sufficient statistics preserve all the discriminatory information - no KL is lost when summarizing data by its sufficient statistics.

Example. For pθ=Bern(θ)p_{\boldsymbol{\theta}} = \text{Bern}(\theta), the sufficient statistic T=ixiT = \sum_i x_i (number of heads in nn flips) is sufficient. The KL between Bern(θ1)n\text{Bern}(\theta_1)^{\otimes n} and Bern(θ2)n\text{Bern}(\theta_2)^{\otimes n} equals nDKL(Bern(θ1)Bern(θ2))n \cdot D_{\mathrm{KL}}(\text{Bern}(\theta_1)\|\text{Bern}(\theta_2)), and the KL between the corresponding Binomial(n,θ1)\text{Binomial}(n,\theta_1) and Binomial(n,θ2)\text{Binomial}(n,\theta_2) distributions equals the same thing - the sufficient statistic preserves KL exactly.

KL under affine transformation. For continuous distributions, an affine map T(x)=Ax+bT(\mathbf{x}) = A\mathbf{x} + \mathbf{b} with detA0\det A \ne 0: DKL(pTqT)=DKL(pq)D_{\mathrm{KL}}(p_T\|q_T) = D_{\mathrm{KL}}(p\|q) - KL is invariant under invertible affine transformations. This is why KL between two Gaussians can be computed in any coordinate system.

I.4 The Sanov Theorem and Large Deviations

Sanov's theorem is the large deviations foundation for KL divergence.

Theorem (Sanov). Let X1,,XnX_1, \ldots, X_n be i.i.d. from qq. Let p^n\hat{p}_n be the empirical distribution. For any set EE of distributions (closed in total variation):

P(p^nE)eninfpEDKL(pq)P(\hat{p}_n \in E) \approx e^{-n \inf_{p \in E} D_{\mathrm{KL}}(p\|q)}

More precisely: limn1nlogP(p^nE)=infpEDKL(pq)\lim_{n\to\infty} -\frac{1}{n}\log P(\hat{p}_n \in E) = \inf_{p \in E} D_{\mathrm{KL}}(p\|q).

Interpretation: The probability of observing an empirical distribution in EE decays exponentially in nn, with exponent given by the minimum KL from any distribution in EE to the true distribution qq. The "most likely" atypical empirical distribution is the one in EE closest to qq in KL divergence - the I-projection of qq onto EE.

For AI: Sanov's theorem explains why KL divergence appears in PAC-Bayes bounds. The probability that the empirical risk deviates from the true risk by more than ϵ\epsilon decays as enDKL(QP)e^{-nD_{\mathrm{KL}}(Q\|P)}, where QQ is the posterior over hypotheses and PP is the prior. The larger the KL, the more confident we are that the training set is representative.

I.5 KL Divergence and the EM Algorithm: A Complete Example

We illustrate the EM algorithm as alternating KL minimizations with a Gaussian mixture model.

Setup: Data D={x(1),,x(n)}\mathcal{D} = \{x^{(1)}, \ldots, x^{(n)}\}; model pθ(x)=k=1KπkN(x;μk,σk2)p_{\boldsymbol{\theta}}(x) = \sum_{k=1}^K \pi_k \mathcal{N}(x; \mu_k, \sigma_k^2).

E-step - compute responsibilities (I-projection):

q(t)(z=kx(i))=πk(t)N(x(i);μk(t),σk2(t))j=1Kπj(t)N(x(i);μj(t),σj2(t))q^{(t)}(z=k|x^{(i)}) = \frac{\pi_k^{(t)}\mathcal{N}(x^{(i)};\mu_k^{(t)},\sigma_k^{2(t)})}{\sum_{j=1}^K \pi_j^{(t)}\mathcal{N}(x^{(i)};\mu_j^{(t)},\sigma_j^{2(t)})}

This is the I-projection of the model onto the space of conditional distributions given the current parameters.

M-step - update parameters (M-projection):

πk(t+1)=1niq(t)(z=kx(i))\pi_k^{(t+1)} = \frac{1}{n}\sum_i q^{(t)}(z=k|x^{(i)}) μk(t+1)=iq(t)(z=kx(i))x(i)iq(t)(z=kx(i))\mu_k^{(t+1)} = \frac{\sum_i q^{(t)}(z=k|x^{(i)})x^{(i)}}{\sum_i q^{(t)}(z=k|x^{(i)})}

This maximizes the ELBO - equivalent to the M-projection that minimizes DKL(q(t)pθ)D_{\mathrm{KL}}(q^{(t)}\|p_{\boldsymbol{\theta}}) over θ\boldsymbol{\theta}.

KL decomposition at each step:

logpθ(D)=L(q,θ)+iDKL(q(t)(x(i))pθ(x(i)))\log p_{\boldsymbol{\theta}}(\mathcal{D}) = \mathcal{L}(q, \boldsymbol{\theta}) + \sum_i D_{\mathrm{KL}}(q^{(t)}(\cdot|x^{(i)})\|p_{\boldsymbol{\theta}}(\cdot|x^{(i)}))

The E-step zeroes the second term (for the current θ(t)\boldsymbol{\theta}^{(t)}); the M-step maximizes the first term. Together, each EM iteration increases logpθ(D)\log p_{\boldsymbol{\theta}}(\mathcal{D}).


Appendix J: Notation and Conventions

J.1 KL Divergence Notation Across Different Sources

Different textbooks and papers use different notation for KL divergence. This can cause significant confusion:

SourceNotationDirection
This curriculum (following Cover & Thomas)DKL(pq)D_{\mathrm{KL}}(p \| q)pp = reference (truth); qq = approximation
Kullback (1959)I(1:2)I(1:2)Between distributions indexed 1 and 2
Bishop (2006)KL(pq)\mathrm{KL}(p \| q)Same as Cover & Thomas
Murphy (2012)KL(pq)\mathrm{KL}(p \| q)Same
Goodfellow et al. (2016)DKL(PQ)D_{\mathrm{KL}}(P \| Q)Same
Some optimization papersKL(qp)\mathrm{KL}(q \| p)Reversed - always check!

Key check: In VAE papers, DKL(qϕ(zx)p(z))D_{\mathrm{KL}}(q_\phi(\mathbf{z}|\mathbf{x})\|p(\mathbf{z})) has the encoder qq as the first argument (the reference for the expectation) and the prior pp as the second. This is reverse KL - expectation under the approximate posterior qq.

J.2 Relationship Between Information Quantities

VENN DIAGRAM OF INFORMATION QUANTITIES


  For joint distribution P(X,Y):

    
                      H(X,Y)                      
                  
                                               
         H(X|Y)         I(X;Y)        H(Y|X)  
                                               
                  
    

     H(X) 
                                 H(Y) 

  Key relationships:
  H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)       [chain rule]
  I(X;Y) = H(X) + H(Y) - H(X,Y)                 [definition]
  I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)        [equivalences]
  I(X;Y) = D_KL(P(X,Y)  P(X)P(Y))              [KL form]
  H(p,q) = H(p) + D_KL(p  q)                   [cross-entropy decomp]


J.3 Units and Conversions

BaseUnitConversion
ln\ln (natural log)natsDefault in this curriculum; 1 nat=1/ln21.4431\text{ nat} = 1/\ln 2 \approx 1.443 bits
log2\log_2bitsUsed in data compression contexts; 1 bit=ln20.6931\text{ bit} = \ln 2 \approx 0.693 nats
log10\log_{10}hartleys (bans)Rare in ML; 1 hartley=ln102.3031\text{ hartley} = \ln 10 \approx 2.303 nats

PyTorch convention: torch.nn.functional.kl_div expects inputs in log-space and computes nats by default (uses torch.log which is natural log). To get bits, divide by math.log(2).


End of Section 09-02 KL Divergence

<- Back to Information Theory | Previous: Entropy <- | Next: Mutual Information ->


Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue