Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 4
10 min read14 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

KL Divergence: Appendix K: Worked Problems with Full Solutions to Appendix M: Further Reading and References

Appendix K: Worked Problems with Full Solutions

K.1 Computing KL Between Two Categorical Distributions

Problem. A language model assigns the following probabilities to the next token in "The capital of France is ___":

TokenTrue distribution ppModel qq
"Paris"0.920.70
"London"0.030.10
"Rome"0.020.10
"Berlin"0.020.05
other0.010.05

Compute (a) DKL(pq)D_{\mathrm{KL}}(p\|q), (b) cross-entropy H(p,q)H(p,q), (c) entropy H(p)H(p).

Solution.

(a) DKL(pq)D_{\mathrm{KL}}(p\|q):

=0.92ln0.920.70+0.03ln0.030.10+0.02ln0.020.10+0.02ln0.020.05+0.01ln0.010.05= 0.92\ln\frac{0.92}{0.70} + 0.03\ln\frac{0.03}{0.10} + 0.02\ln\frac{0.02}{0.10} + 0.02\ln\frac{0.02}{0.05} + 0.01\ln\frac{0.01}{0.05} =0.92(0.274)+0.03(1.204)+0.02(1.609)+0.02(0.916)+0.01(1.609)= 0.92(0.274) + 0.03(-1.204) + 0.02(-1.609) + 0.02(-0.916) + 0.01(-1.609) =0.2520.0360.0320.0180.016=0.150 nats= 0.252 - 0.036 - 0.032 - 0.018 - 0.016 = 0.150 \text{ nats}

(b) Cross-entropy H(p,q)=plogqH(p,q) = -\sum p\log q:

=(0.92ln0.70+0.03ln0.10+0.02ln0.10+0.02ln0.05+0.01ln0.05)= -(0.92\ln 0.70 + 0.03\ln 0.10 + 0.02\ln 0.10 + 0.02\ln 0.05 + 0.01\ln 0.05) =(0.92(0.357)+0.05(2.303)+0.02(2.303)+0.03(2.996))= -(0.92(-0.357) + 0.05(-2.303) + 0.02(-2.303) + 0.03(-2.996)) =0.328+0.115+0.046+0.090=0.579 nats= 0.328 + 0.115 + 0.046 + 0.090 = 0.579 \text{ nats}

(c) Entropy H(p)=plogpH(p) = -\sum p\log p:

=(0.92ln0.92+0.03ln0.03+0.02ln0.02+0.02ln0.02+0.01ln0.01)= -(0.92\ln 0.92 + 0.03\ln 0.03 + 0.02\ln 0.02 + 0.02\ln 0.02 + 0.01\ln 0.01) =(0.92(0.083)+0.03(3.507)+0.04(3.912)+0.01(4.605))= -(0.92(-0.083) + 0.03(-3.507) + 0.04(-3.912) + 0.01(-4.605)) =0.076+0.105+0.156+0.046=0.383 nats= 0.076 + 0.105 + 0.156 + 0.046 = 0.383 \text{ nats}

Verification: H(p,q)=H(p)+DKL(pq)=?0.383+0.150=0.533H(p,q) = H(p) + D_{\mathrm{KL}}(p\|q) \stackrel{?}{=} 0.383 + 0.150 = 0.533 nats. (Small discrepancy from rounding; exact values match.)

Interpretation: The model wastes 0.150 nats per prediction compared to an oracle. The perplexity of the oracle is e0.3831.47e^{0.383} \approx 1.47; the model's perplexity is e0.5791.78e^{0.579} \approx 1.78 - 21% higher, entirely due to the KL gap.

K.2 The ELBO Calculation for a Toy VAE

Setup. One-dimensional VAE: prior p(z)=N(0,1)p(z) = \mathcal{N}(0,1), likelihood pθ(xz)=N(z,0.25)p_\theta(x|z) = \mathcal{N}(z, 0.25) (decoder with σ2=0.25\sigma^2 = 0.25), encoder qϕ(zx)=N(μϕ(x),σϕ2)q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma_\phi^2).

For x=1.5x = 1.5, μϕ=1.2\mu_\phi = 1.2, σϕ2=0.4\sigma_\phi^2 = 0.4:

KL term: DKL(N(1.2,0.4)N(0,1))D_{\mathrm{KL}}(\mathcal{N}(1.2, 0.4)\|\mathcal{N}(0,1)):

=12(1.22+0.4ln0.41)=12(1.44+0.4+0.9161)=12(1.756)=0.878 nats= \frac{1}{2}(1.2^2 + 0.4 - \ln 0.4 - 1) = \frac{1}{2}(1.44 + 0.4 + 0.916 - 1) = \frac{1}{2}(1.756) = 0.878 \text{ nats}

Reconstruction term: Eqϕ[logpθ(xz)]\mathbb{E}_{q_\phi}[\log p_\theta(x|z)] where x=1.5x = 1.5, zN(1.2,0.4)z \sim \mathcal{N}(1.2, 0.4):

=EzN(1.2,0.4)[12ln(2π0.25)(1.5z)220.25]= \mathbb{E}_{z\sim\mathcal{N}(1.2,0.4)}\left[-\frac{1}{2}\ln(2\pi \cdot 0.25) - \frac{(1.5-z)^2}{2 \cdot 0.25}\right] =12ln(0.5π)10.5E[(1.5z)2]= -\frac{1}{2}\ln(0.5\pi) - \frac{1}{0.5}\mathbb{E}[(1.5-z)^2] =12(0.452)2[(1.51.2)2+0.4]= -\frac{1}{2}(0.452) - 2[(1.5-1.2)^2 + 0.4] =0.2262[0.09+0.4]=0.2260.980=1.206 nats= -0.226 - 2[0.09 + 0.4] = -0.226 - 0.980 = -1.206 \text{ nats}

ELBO: L=1.2060.878=2.084\mathcal{L} = -1.206 - 0.878 = -2.084 nats.

Interpretation: The true log-evidence satisfies logpθ(1.5)2.084\log p_\theta(1.5) \ge -2.084. Better encoder parameters (closer μϕ\mu_\phi to posterior mean) and tighter σϕ2\sigma_\phi^2 would close the ELBO gap.

K.3 Verifying Data Processing Inequality Numerically

Consider input X{0,1,2}X \in \{0,1,2\} with p=(0.5,0.3,0.2)p = (0.5, 0.3, 0.2) and q=(0.2,0.3,0.5)q = (0.2, 0.3, 0.5).

A stochastic function T:{0,1,2}{A,B}T: \{0,1,2\} \to \{A, B\} with transition matrix:

T(A0)=0.8,T(B0)=0.2;T(A1)=0.5,T(B1)=0.5;T(A2)=0.1,T(B2)=0.9T(A|0) = 0.8, T(B|0) = 0.2; \quad T(A|1) = 0.5, T(B|1) = 0.5; \quad T(A|2) = 0.1, T(B|2) = 0.9

Original KL:

DKL(pq)=0.5ln(2.5)+0.3ln(1)+0.2ln(0.4)=0.5(0.916)+0+0.2(0.916)=0.4580.183=0.275 natsD_{\mathrm{KL}}(p\|q) = 0.5\ln(2.5) + 0.3\ln(1) + 0.2\ln(0.4) = 0.5(0.916) + 0 + 0.2(-0.916) = 0.458 - 0.183 = 0.275 \text{ nats}

Induced distributions on {A,B}\{A,B\}:

pT(A)=0.5(0.8)+0.3(0.5)+0.2(0.1)=0.4+0.15+0.02=0.57p_T(A) = 0.5(0.8) + 0.3(0.5) + 0.2(0.1) = 0.4 + 0.15 + 0.02 = 0.57 qT(A)=0.2(0.8)+0.3(0.5)+0.5(0.1)=0.16+0.15+0.05=0.36q_T(A) = 0.2(0.8) + 0.3(0.5) + 0.5(0.1) = 0.16 + 0.15 + 0.05 = 0.36

KL after processing:

DKL(pTqT)=0.57ln0.570.36+0.43ln0.430.64=0.57(0.458)+0.43(0.399)=0.2610.172=0.089 natsD_{\mathrm{KL}}(p_T\|q_T) = 0.57\ln\frac{0.57}{0.36} + 0.43\ln\frac{0.43}{0.64} = 0.57(0.458) + 0.43(-0.399) = 0.261 - 0.172 = 0.089 \text{ nats}

Verification: 0.0890.2750.089 \le 0.275 PASS - the data processing inequality holds. The stochastic transformation TT reduced the KL from 0.275 to 0.089 nats, losing information about whether data came from pp or qq.


Appendix L: Connections to Optimization Theory

L.1 KL Divergence and Mirror Descent

Mirror descent is a generalization of gradient descent where the update uses a Bregman divergence instead of Euclidean distance:

x(t+1)=argminx{ηf(x(t)),x+Dϕ(x,x(t))}\mathbf{x}^{(t+1)} = \arg\min_{\mathbf{x}} \left\{\eta \langle \nabla f(\mathbf{x}^{(t)}), \mathbf{x} \rangle + D_\phi(\mathbf{x}, \mathbf{x}^{(t)})\right\}

When ϕ(p)=kpklnpk\phi(\mathbf{p}) = \sum_k p_k \ln p_k (negative entropy) and x=p\mathbf{x} = \mathbf{p} is a probability vector, the Bregman divergence Dϕ(p,q)=DKL(pq)D_\phi(\mathbf{p}, \mathbf{q}) = D_{\mathrm{KL}}(p\|q) and mirror descent becomes the multiplicative weights update or exponentiated gradient:

pk(t+1)pk(t)exp(ηgk(t))p_k^{(t+1)} \propto p_k^{(t)} \exp(-\eta g_k^{(t)})

where g(t)=f(p(t))\mathbf{g}^{(t)} = \nabla f(\mathbf{p}^{(t)}). This is exactly how softmax policies in RL are updated, and how online learning over the probability simplex works (regret analysis via Bregman projections).

Natural gradient as mirror descent. The natural gradient descent update:

θ(t+1)=argminθ{ηθL,θ+DKL(pθ(t)pθ)}\boldsymbol{\theta}^{(t+1)} = \arg\min_{\boldsymbol{\theta}} \left\{\eta \langle \nabla_{\boldsymbol{\theta}}\mathcal{L}, \boldsymbol{\theta} \rangle + D_{\mathrm{KL}}(p_{\boldsymbol{\theta}^{(t)}}\|p_{\boldsymbol{\theta}})\right\}

is mirror descent using KL divergence as the Bregman proximity term. The solution is θ(t+1)=θ(t)ηF1L\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \eta F^{-1}\nabla\mathcal{L} where FF is the Fisher matrix - reproducing the natural gradient formula.

L.2 KL-Penalized Optimization and Soft Constraints

The RLHF objective maxπE[r]βDKL(ππref)\max_\pi \mathbb{E}[r] - \beta D_{\mathrm{KL}}(\pi\|\pi_{\mathrm{ref}}) is a classic penalized optimization problem. The KL term acts as a regularizer that keeps π\pi near πref\pi_{\mathrm{ref}}.

Lagrangian form. The hard-constrained version:

maxπE[r(x,y)]s.t.DKL(ππref)ϵ\max_\pi \mathbb{E}[r(x,y)] \quad \text{s.t.} \quad D_{\mathrm{KL}}(\pi\|\pi_{\mathrm{ref}}) \le \epsilon

is equivalent to the penalized form at the β\beta that is the Lagrange multiplier for the constraint. As ϵ\epsilon \to \infty, β0\beta \to 0 (no constraint); as ϵ0\epsilon \to 0, β\beta \to \infty (stay at reference). The optimal β\beta depends on the trade-off between reward and divergence.

TRPO (Schulman et al., 2015). Trust Region Policy Optimization solves:

maxπLπold(π)s.t.DˉKL(πold,π)δ\max_\pi \mathcal{L}_{\pi_{\mathrm{old}}}(\pi) \quad \text{s.t.} \quad \bar{D}_{\mathrm{KL}}(\pi_{\mathrm{old}}, \pi) \le \delta

where L\mathcal{L} is the surrogate advantage and DˉKL\bar{D}_{\mathrm{KL}} is the maximum KL over states. Each TRPO step solves a constrained optimization problem with a KL trust region - computationally expensive. PPO approximates this with the clip objective.

L.3 The Frank-Wolfe Algorithm and KL

The Frank-Wolfe algorithm solves minpCf(p)\min_{p \in \mathcal{C}} f(p) (for a convex function ff on a convex constraint set C\mathcal{C}) by linearizing ff at each step. When f=DKL(q)f = D_{\mathrm{KL}}(\cdot\|q) and C\mathcal{C} is the probability simplex, Frank-Wolfe recovers the multiplicative weights algorithm. When f=DKL(p)f = D_{\mathrm{KL}}(p\|\cdot) and C\mathcal{C} is a parametric exponential family, Frank-Wolfe corresponds to the M-step of EM.

This connection shows that common ML algorithms - EM, multiplicative weights, mirror descent, natural gradient - are all special cases of convex optimization using KL divergence as the geometry.


Appendix M: Further Reading and References

M.1 Primary Sources

ReferenceWhat to ReadWhy
Kullback & Leibler (1951). "On Information and Sufficiency." Ann. Math. Stat.Section 1-Section 3Original paper; clean introduction to I-divergence and sufficiency
Cover & Thomas (2006). Elements of Information Theory, Ch. 2Ch. 2, Section 2.3Best textbook treatment; all proofs with full details
MacKay (2003). Information Theory, Inference, and Learning Algorithms, Ch. 2Free online; Ch. 2Excellent intuition; coding-theoretic perspective
Bishop (2006). Pattern Recognition and Machine Learning, Ch. 10Ch. 10.1Variational inference from an ML perspective
Amari (2016). Information Geometry and Its ApplicationsCh. 1-3Statistical manifolds, I/M-projections, natural gradient
Csiszar & Shields (2004). "Information Theory and Statistics"Section 1-Section 3f-divergences, Sanov's theorem, method of types

M.2 Key ML Papers Using KL Divergence

PaperKL Divergence RoleYear
Kingma & Welling, "Auto-Encoding Variational Bayes"ELBO = reconstruction - KL; VAE training2014
Goodfellow et al., "Generative Adversarial Networks"GAN = JSD minimization2014
Hinton, Vinyals & Dean, "Distilling the Knowledge in a Neural Network"Forward KL for knowledge transfer2015
Schulman et al., "Trust Region Policy Optimization" (TRPO)KL trust region constraint2015
Schulman et al., "Proximal Policy Optimization" (PPO)KL approximated by clip2017
Higgins et al., "beta-VAE: Learning Basic Visual Concepts"β\beta-weighted KL for disentanglement2017
Mironov, "Renyi Differential Privacy"Renyi KL for privacy accounting2017
Rafailov et al., "Direct Preference Optimization" (DPO)Implicit KL-constrained policy2023
Ho et al., "Denoising Diffusion Probabilistic Models"Per-step KL for diffusion ELBO2020

M.3 Numerical Tools

ToolWhat It Provides
scipy.special.rel_entr(p, q)Element-wise plog(p/q)p\log(p/q); sum for KL
scipy.stats.entropy(p, q)DKL(pq)D_{\mathrm{KL}}(p\|q) for discrete distributions
torch.nn.functional.kl_div(log_q, p)KL given log qq and pp; note arg order
torch.distributions.kl_divergence(p, q)For torch distribution objects; supports many families
sklearn.metrics.mutual_info_scoreMutual information (related to KL)
numpy customSee Appendix D.1 for stable implementation

M.4 Quick Diagnostic Checklist

Before using KL divergence in any ML project, verify:

  • Which direction? Identify which distribution is pp (truth/reference) and which is qq (approximation). The first argument pp is where the expectation is taken.
  • Support check. Does p(x)>0p(x) > 0 wherever q(x)=0q(x) = 0? If so, DKL(pq)=+D_{\mathrm{KL}}(p\|q) = +\infty. Add smoothing if needed.
  • Units. Is your implementation using natural log (nats) or log base 2 (bits)? Be consistent.
  • Forward or reverse? Is this an MLE/distillation problem (forward KL) or variational inference (reverse KL)? The choice determines whether the result is mode-seeking or mean-seeking.
  • Numerical stability. Are you computing log(p/q)\log(p/q) directly (unstable) or using log-softmax (stable)?
  • Sample size for estimation. KL estimators from finite samples have high variance for large dd - consider using MMD or Wasserstein for high-dimensional distributions.
  • KL coefficient. If using KL as a regularizer, is β\beta tuned? Too small -> reward hacking; too large -> no alignment gain.
  • ELBO monitoring. In VAE training, are both the reconstruction term and KL term logged separately? Monitoring their ratio over training diagnoses posterior collapse early.

M.5 Conceptual Summary: One Page

KL divergence DKL(pq)=Ep[log(p/q)]D_{\mathrm{KL}}(p\|q) = \mathbb{E}_p[\log(p/q)] measures:

  • The expected excess code length when coding pp-data with a qq-code
  • The average log-likelihood ratio evidence that data came from pp not qq
  • The information gain from using pp instead of qq as a model

Three things to always remember:

  1. Direction matters. DKL(pq)DKL(qp)D_{\mathrm{KL}}(p\|q) \ne D_{\mathrm{KL}}(q\|p). Forward KL -> mass-covering. Reverse KL -> mode-seeking.

  2. Non-negativity is everything. DKL0D_{\mathrm{KL}} \ge 0 with equality iff p=qp = q. This single fact proves: entropy is maximized at uniform, the ELBO is a lower bound, MLE is consistent.

  3. It's everywhere in AI. MLE, VAE, RLHF, distillation, diffusion, VI - all are special cases of optimizing a KL divergence in a specific direction. Understanding which direction, and what approximations are made, is the key to understanding modern deep learning at a principled level.


Section 09-02 KL Divergence - 1966+ lines, 13 main sections, 13 appendices (A-M) Prerequisites: Section 09-01 Entropy | Follows to: Section 09-03 Mutual Information

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue