Lesson overview | Previous part | Lesson overview
KL Divergence: Appendix K: Worked Problems with Full Solutions to Appendix M: Further Reading and References
Appendix K: Worked Problems with Full Solutions
K.1 Computing KL Between Two Categorical Distributions
Problem. A language model assigns the following probabilities to the next token in "The capital of France is ___":
| Token | True distribution | Model |
|---|---|---|
| "Paris" | 0.92 | 0.70 |
| "London" | 0.03 | 0.10 |
| "Rome" | 0.02 | 0.10 |
| "Berlin" | 0.02 | 0.05 |
| other | 0.01 | 0.05 |
Compute (a) , (b) cross-entropy , (c) entropy .
Solution.
(a) :
(b) Cross-entropy :
(c) Entropy :
Verification: nats. (Small discrepancy from rounding; exact values match.)
Interpretation: The model wastes 0.150 nats per prediction compared to an oracle. The perplexity of the oracle is ; the model's perplexity is - 21% higher, entirely due to the KL gap.
K.2 The ELBO Calculation for a Toy VAE
Setup. One-dimensional VAE: prior , likelihood (decoder with ), encoder .
For , , :
KL term: :
Reconstruction term: where , :
ELBO: nats.
Interpretation: The true log-evidence satisfies . Better encoder parameters (closer to posterior mean) and tighter would close the ELBO gap.
K.3 Verifying Data Processing Inequality Numerically
Consider input with and .
A stochastic function with transition matrix:
Original KL:
Induced distributions on :
KL after processing:
Verification: PASS - the data processing inequality holds. The stochastic transformation reduced the KL from 0.275 to 0.089 nats, losing information about whether data came from or .
Appendix L: Connections to Optimization Theory
L.1 KL Divergence and Mirror Descent
Mirror descent is a generalization of gradient descent where the update uses a Bregman divergence instead of Euclidean distance:
When (negative entropy) and is a probability vector, the Bregman divergence and mirror descent becomes the multiplicative weights update or exponentiated gradient:
where . This is exactly how softmax policies in RL are updated, and how online learning over the probability simplex works (regret analysis via Bregman projections).
Natural gradient as mirror descent. The natural gradient descent update:
is mirror descent using KL divergence as the Bregman proximity term. The solution is where is the Fisher matrix - reproducing the natural gradient formula.
L.2 KL-Penalized Optimization and Soft Constraints
The RLHF objective is a classic penalized optimization problem. The KL term acts as a regularizer that keeps near .
Lagrangian form. The hard-constrained version:
is equivalent to the penalized form at the that is the Lagrange multiplier for the constraint. As , (no constraint); as , (stay at reference). The optimal depends on the trade-off between reward and divergence.
TRPO (Schulman et al., 2015). Trust Region Policy Optimization solves:
where is the surrogate advantage and is the maximum KL over states. Each TRPO step solves a constrained optimization problem with a KL trust region - computationally expensive. PPO approximates this with the clip objective.
L.3 The Frank-Wolfe Algorithm and KL
The Frank-Wolfe algorithm solves (for a convex function on a convex constraint set ) by linearizing at each step. When and is the probability simplex, Frank-Wolfe recovers the multiplicative weights algorithm. When and is a parametric exponential family, Frank-Wolfe corresponds to the M-step of EM.
This connection shows that common ML algorithms - EM, multiplicative weights, mirror descent, natural gradient - are all special cases of convex optimization using KL divergence as the geometry.
Appendix M: Further Reading and References
M.1 Primary Sources
| Reference | What to Read | Why |
|---|---|---|
| Kullback & Leibler (1951). "On Information and Sufficiency." Ann. Math. Stat. | Section 1-Section 3 | Original paper; clean introduction to I-divergence and sufficiency |
| Cover & Thomas (2006). Elements of Information Theory, Ch. 2 | Ch. 2, Section 2.3 | Best textbook treatment; all proofs with full details |
| MacKay (2003). Information Theory, Inference, and Learning Algorithms, Ch. 2 | Free online; Ch. 2 | Excellent intuition; coding-theoretic perspective |
| Bishop (2006). Pattern Recognition and Machine Learning, Ch. 10 | Ch. 10.1 | Variational inference from an ML perspective |
| Amari (2016). Information Geometry and Its Applications | Ch. 1-3 | Statistical manifolds, I/M-projections, natural gradient |
| Csiszar & Shields (2004). "Information Theory and Statistics" | Section 1-Section 3 | f-divergences, Sanov's theorem, method of types |
M.2 Key ML Papers Using KL Divergence
| Paper | KL Divergence Role | Year |
|---|---|---|
| Kingma & Welling, "Auto-Encoding Variational Bayes" | ELBO = reconstruction - KL; VAE training | 2014 |
| Goodfellow et al., "Generative Adversarial Networks" | GAN = JSD minimization | 2014 |
| Hinton, Vinyals & Dean, "Distilling the Knowledge in a Neural Network" | Forward KL for knowledge transfer | 2015 |
| Schulman et al., "Trust Region Policy Optimization" (TRPO) | KL trust region constraint | 2015 |
| Schulman et al., "Proximal Policy Optimization" (PPO) | KL approximated by clip | 2017 |
| Higgins et al., "beta-VAE: Learning Basic Visual Concepts" | -weighted KL for disentanglement | 2017 |
| Mironov, "Renyi Differential Privacy" | Renyi KL for privacy accounting | 2017 |
| Rafailov et al., "Direct Preference Optimization" (DPO) | Implicit KL-constrained policy | 2023 |
| Ho et al., "Denoising Diffusion Probabilistic Models" | Per-step KL for diffusion ELBO | 2020 |
M.3 Numerical Tools
| Tool | What It Provides |
|---|---|
scipy.special.rel_entr(p, q) | Element-wise ; sum for KL |
scipy.stats.entropy(p, q) | for discrete distributions |
torch.nn.functional.kl_div(log_q, p) | KL given log and ; note arg order |
torch.distributions.kl_divergence(p, q) | For torch distribution objects; supports many families |
sklearn.metrics.mutual_info_score | Mutual information (related to KL) |
numpy custom | See Appendix D.1 for stable implementation |
M.4 Quick Diagnostic Checklist
Before using KL divergence in any ML project, verify:
- Which direction? Identify which distribution is (truth/reference) and which is (approximation). The first argument is where the expectation is taken.
- Support check. Does wherever ? If so, . Add smoothing if needed.
- Units. Is your implementation using natural log (nats) or log base 2 (bits)? Be consistent.
- Forward or reverse? Is this an MLE/distillation problem (forward KL) or variational inference (reverse KL)? The choice determines whether the result is mode-seeking or mean-seeking.
- Numerical stability. Are you computing directly (unstable) or using log-softmax (stable)?
- Sample size for estimation. KL estimators from finite samples have high variance for large - consider using MMD or Wasserstein for high-dimensional distributions.
- KL coefficient. If using KL as a regularizer, is tuned? Too small -> reward hacking; too large -> no alignment gain.
- ELBO monitoring. In VAE training, are both the reconstruction term and KL term logged separately? Monitoring their ratio over training diagnoses posterior collapse early.
M.5 Conceptual Summary: One Page
KL divergence measures:
- The expected excess code length when coding -data with a -code
- The average log-likelihood ratio evidence that data came from not
- The information gain from using instead of as a model
Three things to always remember:
-
Direction matters. . Forward KL -> mass-covering. Reverse KL -> mode-seeking.
-
Non-negativity is everything. with equality iff . This single fact proves: entropy is maximized at uniform, the ELBO is a lower bound, MLE is consistent.
-
It's everywhere in AI. MLE, VAE, RLHF, distillation, diffusion, VI - all are special cases of optimizing a KL divergence in a specific direction. Understanding which direction, and what approximations are made, is the key to understanding modern deep learning at a principled level.
Section 09-02 KL Divergence - 1966+ lines, 13 main sections, 13 appendices (A-M) Prerequisites: Section 09-01 Entropy | Follows to: Section 09-03 Mutual Information