Lesson overview | Previous part | Next part
KL Divergence: Part 6: Information-Theoretic Connections to 13. Conceptual Bridge
6. Information-Theoretic Connections
6.1 KL and Cross-Entropy
The relationship is fundamental. Expanding both sides:
which is immediate from . This decomposition has critical implications:
-
Training ML models: Minimizing cross-entropy loss over is identical to minimizing , since is constant. The irreducible entropy is the minimum achievable cross-entropy.
-
Perplexity gap: A language model with perplexity has gap nats above the theoretical minimum entropy.
-
Calibration: A perfectly calibrated model satisfies , meaning the model's output distribution matches the true conditional distributions exactly.
Preview - Cross-Entropy (Section 09-04)
Cross-entropy is the canonical training loss for classification. It decomposes neatly into the intrinsic uncertainty (irreducible) and the model quality gap (reducible by training).
-> Full treatment: Section 09-04 Cross-Entropy
6.2 KL and Mutual Information
Preview - Mutual Information (Section 09-03)
Mutual information is the KL divergence between the joint distribution and the product of marginals:
It measures statistical dependence: how much knowing reduces uncertainty about . This connection shows that (Gibbs' inequality) and .
-> Full treatment: Section 09-03 Mutual Information
The MI-as-KL connection is used directly in contrastive learning (InfoNCE loss, SimCLR, CLIP): maximizing a lower bound on between the input and its representation. It also appears in the information bottleneck (Tishby et al., 2000) which frames representation learning as minimizing subject to maximizing - two KL divergences in tension.
6.3 KL and Fisher Information
Preview - Fisher Information (Section 09-05)
For a parametric family , the KL divergence between nearby members is approximately:
where is the Fisher information matrix. This local quadratic approximation to KL is what makes natural gradient descent (which preconditions gradient steps by ) the "geometrically correct" optimizer on the statistical manifold.
-> Full treatment: Section 09-05 Fisher Information
Connection to RLHF natural policy gradient: The KL trust region defines a neighborhood in policy space. Near the reference policy, this KL ball is approximated by where is the Fisher information of . Natural policy gradient algorithms (TRPO, PPO) use this approximation.
6.4 KL and Entropy
KL divergence provides an elegant proof that entropy is maximized at the uniform distribution. For any distribution on an alphabet of size and uniform distribution :
Since : , with equality iff .
More generally, for any constraint set of distributions satisfying given moment constraints, the maximum-entropy distribution equals - it is the M-projection of the uniform distribution onto . This unifies the maximum entropy principle with KL geometry.
7. f-Divergences and Generalizations
7.1 Csiszar f-Divergences
KL divergence is one member of a rich family of divergences introduced by Csiszar (1967).
Definition. Let be a convex function with . The f-divergence of from is:
The condition ensures . Convexity of and Jensen's inequality give .
KL divergence as f-divergence. Taking (convex, ):
Standard f-divergences:
| Name | Closed form | Properties | |
|---|---|---|---|
| KL divergence | Asymmetric; | ||
| Reverse KL | Asymmetric; | ||
| Hellinger distance2 | Symmetric; | ||
| Total variation | $\frac{1}{2} | t-1 | $ |
| Chi-squared | Asymmetric; | ||
| -divergence | General; limits give KL | Parameterizes all of the above |
For AI - GAN losses. Generative Adversarial Networks (Goodfellow et al., 2014) use different f-divergences as training objectives: the original GAN uses Jensen-Shannon; f-GAN (Nowozin et al., 2016) generalizes to arbitrary f-divergences. This provides a principled family of training objectives for generative models.
7.2 Properties of f-Divergences
All f-divergences share four key properties (when is convex with ):
- Non-negativity: , with equality iff (when is strictly convex).
- Data processing inequality: for any stochastic kernel .
- No triangle inequality in general.
- Convexity: is jointly convex in when is convex.
The data processing inequality is a theorem for all f-divergences: processing can only reduce distinguishability between distributions.
Total variation as special case. The total variation distance satisfies the triangle inequality and is a genuine metric. It bounds other f-divergences: Pinsker's inequality relates KL and TV:
This is the most important inequality connecting KL divergence to total variation. Pinsker's inequality is used to convert KL bounds (from optimization) into total variation bounds (which control probabilities of events).
7.3 Renyi Divergence
For , the order- Renyi divergence is:
Limiting cases:
- : (L'Hopital's rule)
- : (support-based)
- : (max-ratio)
- : related to Bhattacharyya distance;
Properties: ; ; is monotone increasing in ; satisfies data processing inequality; not symmetric.
For AI - Differential privacy. Renyi differential privacy (Mironov, 2017) quantifies privacy loss using Renyi divergence: a mechanism is -RDP if for any adjacent datasets . Renyi DP composes additively: . This makes it the tool of choice for privacy accounting in large-scale training with DP-SGD (used in Google's federated learning and some LLM fine-tuning pipelines).
7.4 Jensen-Shannon Divergence
The Jensen-Shannon Divergence (JSD) is the symmetrized, bounded variant of KL:
where is the mixture distribution.
Properties:
- - symmetric
- - bounded (in nats)
- is a metric (satisfies triangle inequality)
- - entropy of mixture minus average entropy
For AI - Original GAN objective. The original GAN (Goodfellow et al., 2014) with an optimal discriminator minimizes the Jensen-Shannon divergence between the generator distribution and the data distribution :
The optimal discriminator is . However, when and have disjoint supports, (maximum) everywhere - the discriminator saturates and gradients vanish. This mode collapse / vanishing gradient problem motivated WGAN (Arjovsky et al., 2017), which uses the Wasserstein distance (which doesn't require absolute continuity).
8. Information Geometry of KL
8.1 KL as a Bregman Divergence
A Bregman divergence generated by a strictly convex function is:
This is the gap between at and its first-order Taylor approximation at - always by convexity.
KL is a Bregman divergence. Taking (the negative entropy, which is strictly convex):
This Bregman representation reveals why KL is asymmetric: Bregman divergences are generally not symmetric because the Taylor approximation is computed at , not at .
Exponential family connection. For an exponential family with log-partition function , the Bregman divergence equals (note reversed argument order). This is the source of the elegant formula in Section 5.3.
8.2 I-Projection and M-Projection
Definition. Let be a constraint set (e.g., a parametric family of distributions):
- I-projection (information projection): - the distribution in closest to in reverse KL
- M-projection (moment projection): - the distribution in closest to in forward KL
For the exponential family:
- The M-projection always produces the moment-matched distribution:
- The I-projection onto an exponential family (convex set of mean parameters) produces the distribution with the correct natural parameters for those means
EM algorithm as alternating projections. The EM algorithm alternates:
- E-step: Compute - I-projection of the previous model onto the exact posterior
- M-step: - M-projection of back onto the parametric family
Each step decreases or respectively, guaranteeing convergence to a stationary point.
8.3 The Pythagorean Theorem for KL
Theorem (Pythagorean theorem for I-projections). Let be an exponential family (a convex set in mean parameter space) and let be the I-projection of onto . Then for any :
This is an exact equality - there is no approximation, unlike the geometric Pythagorean theorem.
Interpretation. The KL distance from any in the constraint set to the target decomposes into: (1) the distance from to the projection , plus (2) the distance from the projection to . The "closest" point acts like the foot of a perpendicular. This is called "Pythagorean" because the geometry is orthogonal in the KL sense.
PYTHAGOREAN THEOREM FOR KL
p (target distribution, outside set E)
|
| D_KL(q* || p)
|
q* r (all in constraint set E)
D_KL(r || q*)
D_KL(r || p) = D_KL(r || q*) + D_KL(q* || p)
(exact for I-projections onto exponential families)
For AI: The variational inference ELBO maximization is equivalent to minimizing - an I-projection. The Pythagorean theorem shows that the ELBO gap is exactly at the optimum - a clean statement of how much information is lost by restricting to the variational family.
8.4 Statistical Manifolds and Natural Gradient
The space of probability distributions over forms a statistical manifold with local coordinates given by the natural parameters (for exponential families). The KL divergence induces a Riemannian metric on this manifold called the Fisher information metric:
where is the Fisher information matrix. The Riemannian structure makes the "natural" step size in parameter space match the actual KL distance between distributions.
Natural gradient. In Euclidean parameter space, gradient descent steps make steps proportional to the Euclidean metric on -space, which does not correspond to any natural metric on distribution space. The natural gradient corrects for this:
Natural gradient descent converges much faster in problems where the Fisher information is highly non-uniform (i.e., where the Euclidean metric on parameters is poorly calibrated to KL distance on distributions). This is the foundation of K-FAC (Martens & Grosse, 2015), used in second-order optimization of neural networks.
-> Full treatment of Fisher information and natural gradient: Section 09-05 Fisher Information
9. Applications in Machine Learning
9.1 Maximum Likelihood = Minimizing KL
Theorem. For a parametric model and empirical data distribution :
Proof. Using , minimizing over eliminates the constant :
Implications:
- Every LLM trained by next-token prediction (GPT-4, LLaMA-3, Claude, Gemini) minimizes
- The MLE estimate is not "arbitrary" - it has a precise information-theoretic interpretation as the distribution in the model family closest to the data in forward KL
- Coverage: MLE must cover the entire data distribution; if but , the loss is large
9.2 Variational Autoencoders
The VAE (Kingma & Welling, 2014) introduces a latent variable with prior and likelihood . The true log-likelihood is intractable (requires integrating over ). The VAE introduces an encoder and derives a tractable lower bound via the KL decomposition:
The ELBO has two terms:
- Reconstruction term : how well the decoder reconstructs from the latent code. Analogous to negative distortion.
- KL regularizer : how much the approximate posterior deviates from the prior. For Gaussian encoder and standard Gaussian prior: .
Practical training: The ELBO is maximized by alternating gradient steps:
- w.r.t. : encoder parameters, using the reparameterization trick
- w.r.t. : decoder parameters, via direct backpropagation
VAE variants using KL: beta-VAE ( weight on KL for disentanglement), WAE (Wasserstein autoencoder replaces KL with MMD), VQ-VAE (discrete latent, no KL), InfoVAE (adds MI term).
9.3 RLHF, PPO, and DPO
RLHF objective. Reinforcement Learning from Human Feedback (Christiano et al., 2017; InstructGPT, Ouyang et al., 2022) trains a policy to maximize expected reward while staying close to the reference policy (the pretrained LLM):
where is the reward model score and is the KL coefficient. Without the KL penalty, the policy collapses to reward hacking: generating short, repetitive text that maximizes reward scores but is incoherent or unnatural.
Optimal policy. The RLHF objective has a closed-form optimal solution (for fixed ):
where is the normalizing partition function. The optimal policy exponentially upweights sequences with high reward relative to the reference.
DPO (Direct Preference Optimization, Rafailov et al., 2023). DPO solves the RLHF objective directly from preference data, without training an explicit reward model. The key insight: substituting the optimal policy formula into the reward-learning objective yields a loss that only involves the policy, not the reward model:
where is the preferred (winning) response and the dispreferred (losing) response. The log-ratio is the implicit reward that the KL-constrained policy learns. DPO is used in practice for fine-tuning LLMs on human preferences (LLaMA-2-Chat, Mixtral-Instruct, and many others).
PPO trust region. PPO (Schulman et al., 2017) approximates the KL constraint with a clipping objective:
where is the probability ratio. Clipping at prevents large policy updates, approximating the KL trust region constraint .
9.4 Knowledge Distillation
Knowledge distillation (Hinton, Vinyals & Dean, 2015) transfers knowledge from a large teacher model to a smaller student model . The distillation loss uses forward KL (from teacher to student):
where is the teacher's softened distribution at temperature .
Why forward KL (teacher -> student)? The student must cover all of the teacher's probability mass, including the "dark knowledge" - the non-zero probabilities assigned to incorrect classes that encode the teacher's generalization patterns. If a teacher assigns 60% to "cat," 30% to "tiger," and 10% to "dog," the soft labels teach the student that cats look more like tigers than dogs. One-hot labels destroy this information.
Forward vs reverse KL choice in distillation. Using reverse KL would allow the student to become mode-seeking - focusing only on the teacher's highest-probability output and ignoring the distributional shape. Forward KL forces the student to match the full distribution, preserving the teacher's calibration.
Modern LLM distillation. DistilBERT (Sanh et al., 2019) distills BERT using forward KL plus cosine similarity on hidden states. Phi-1 (Gunasekar et al., 2023) and Phi-2 (Microsoft, 2023) use distillation from GPT-4 to train small but capable models. LLaMA-3-8B was distilled from larger checkpoints. The forward KL direction is standard.
9.5 Normalizing Flows and Diffusion Models
Normalizing flows learn an invertible transformation such that where . The exact change-of-variables formula gives:
Training by maximum likelihood minimizes (forward KL) since the log-likelihood is tractable. This contrasts with GANs, which use a discriminator to avoid computing directly.
Diffusion models (DDPM, Ho et al., 2020). The DDPM ELBO is a sum of KL terms:
Each reverse diffusion step is trained to minimize the KL divergence from the (analytically tractable) true reverse . Because both distributions are Gaussian (by the Markov structure), each per-step KL has a closed-form expression in terms of the noise predictor . The simplified training objective is equivalent to predicting the noise - which is the standard diffusion training loss.
10. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Writing when you mean forward KL (truth -> approximation) | The first argument is the reference (truth); forward KL is | Always identify which distribution is the truth and which is the approximation; check: is the expectation under truth or approx? |
| 2 | Assuming KL is symmetric: | KL is not symmetric; the two directions have fundamentally different behaviors | Never swap and without checking both directions; use Jeffrey's JSD if you need symmetry |
| 3 | Ignoring support mismatch: computing when for some with | when ; finite values are meaningless | Add Laplace smoothing or use / for distributions with potentially different supports |
| 4 | Treating as equivalent to under numerical approximations | Finite-precision computation gives even when due to rounding | Use (small threshold) rather than exact zero; check sample moments separately |
| 5 | Using KL as a distance metric satisfying the triangle inequality | can exceed ; KL is not a metric | Use total variation or if you need a proper metric |
| 6 | Confusing minimizing over with minimizing over | MLE minimizes forward KL over (model parameters); variational inference minimizes reverse KL over | Always identify which argument is the optimization variable |
| 7 | Computing cross-entropy loss and calling it KL divergence | Cross-entropy ; they differ by | They're equal only when is one-hot (); for soft labels, |
| 8 | Expecting the VAE KL term to be zero at optimum | The KL term equals $D_{\mathrm{KL}}(q_\phi(\mathbf{z} | \mathbf{x})|p(\mathbf{z}))$; zero only if posterior = prior (posterior collapse) |
| 9 | Using forward KL for variational inference | Forward KL requires samples from the intractable $p(\mathbf{z} | \mathbf{x})q$ |
| 10 | Claiming the RLHF KL penalty "forces" the model to stay close to reference | The KL penalty softly penalizes deviation; with large enough reward, the policy can still deviate significantly | The KL penalty is a soft regularizer, not a hard constraint; monitor during training |
| 11 | Conflating Renyi divergence with KL divergence | Renyi only equals KL in the limit ; for they have different properties | Use Renyi divergence when differential privacy accounting is needed; use KL for standard information-theoretic arguments |
| 12 | Forgetting the factor when converting between nats and bits | in nats in bits bits | Always specify units; when comparing with cross-entropy in bits (binary classification), convert to nats or vice versa |
11. Exercises
Exercise 1 (*) - Non-negativity and equality condition. Let and .
(a) Compute and by hand (use , give exact values and numerical approximations).
(b) Verify that both are non-negative. Which is larger?
(c) Find a distribution on four symbols such that . Is unique?
(d) Using only the convexity of , write a self-contained proof that for your specific and .
Exercise 2 (*) - KL between Gaussians. Let and .
(a) Starting from the definition, derive the closed form:
(b) Compute and . Are they equal?
(c) Show that when , the formula reduces to - the squared Mahalanobis distance.
(d) For the VAE case , : verify the simplified formula and check that it is 0 when , .
Exercise 3 (*) - Forward vs reverse KL. Let (bimodal) and .
(a) Show that the minimizer of over is , . (Hint: moment matching.)
(b) Numerically estimate on a grid of using 1000 samples. Show the two local minima near .
(c) Explain in one paragraph: if you were training a VAE generative model and the true data distribution is , which direction would you minimize, and what would the generator look like?
Exercise 4 () - Chain rule decomposition.** Let be a joint distribution on with , , , , and uniform on ( each).
(a) Compute directly.
(b) Compute and separately.
(c) Verify that (a) (b): the chain rule holds.
(d) Interpret: which direction (marginal or conditional) contributes more to the total KL? What does this mean for the structure of vs ?
Exercise 5 () - f-Divergence family.** Consider three f-divergences on Bernoulli distributions and as a function of .
(a) Compute and plot (as functions of ): KL divergence , reverse KL , total variation , and Jensen-Shannon .
(b) Verify that Pinsker's inequality holds for all .
(c) Which divergence is symmetric? Which is bounded? Which is asymmetric?
(d) For (near-deterministic ): compute all four divergences. Why does KL blow up while JSD stays bounded?
Exercise 6 () - ELBO and KL decomposition.** A simple VAE has prior and likelihood . The encoder is for a scalar parameter .
(a) For : compute the ELBO as a function of . Use the Gaussian KL formula.
(b) Show that the ELBO decomposes as: reconstruction term (plus constants) minus the KL term .
(c) Find the that maximizes the ELBO. What does this optimal encoder do geometrically?
(d) As the likelihood variance (decoder becomes very powerful): what happens to the optimal ? Does posterior collapse occur?
Exercise 7 (*) - RLHF optimal policy.** A language model has reference policy and reward . The RLHF objective is .
(a) Show that the unconstrained optimizer is by taking the functional derivative of the objective with respect to (subject to ).
(b) Show that at the optimum, the KL divergence is .
(c) Derive the DPO loss: given preference data where , show that the Bradley-Terry preference model leads to the DPO loss.
(d) Numerically: for , , , , : compute and the resulting .
Exercise 8 (*) - Knowledge distillation analysis.** A teacher model outputs logits for three classes and a student outputs .
(a) Compute and .
(b) Compute (forward KL, what distillation minimizes) and (reverse KL).
(c) Now compute soft labels at temperature : and . Recompute both KL divergences. How does temperature affect the magnitude of the KL and the information in the soft labels?
(d) What is the total distillation loss: for and ?
(e) Argue why forward KL (teacher to student) is used rather than reverse KL. What kind of "dark knowledge" is preserved?
12. Why This Matters for AI (2026 Perspective)
| Concept | AI Impact |
|---|---|
| Forward KL = MLE | Every language model trained by next-token prediction (GPT-4, LLaMA-3, Claude, Gemini, Mistral) minimizes ; cross-entropy loss IS KL divergence |
| ELBO = KL decomposition | All VAE-based generative models (image, speech, latent diffusion) optimize reconstruction + KL; the KL term determines latent space structure |
| RLHF KL penalty | All instruction-tuned LLMs (InstructGPT, Claude, Gemini) use to prevent reward hacking; is one of the most important hyperparameters in alignment |
| DPO implicit KL | DPO (used in LLaMA-2-Chat, Mistral-Instruct) reframes RLHF as direct KL-constrained optimization; the log-ratio IS the implicit reward |
| Forward vs reverse KL | Choosing forward KL (MLE, distillation) vs reverse KL (VI, mean-field) is the defining design decision of probabilistic ML; wrong choice -> blurry outputs or posterior collapse |
| Knowledge distillation | LLM compression (DistilGPT, Phi-2, Phi-3) uses forward KL from teacher to student; soft labels at temperature carry 10-100 more information than one-hot labels |
| Posterior collapse | When the decoder is powerful, the VAE KL term collapses to zero - a major pathology in text VAEs (Bowman et al., 2016); fixed by KL annealing or -VAE weighting |
| PPO trust region | PPO's clip objective approximates a KL trust region; the clip threshold corresponds to a specific KL budget per update step |
| GAN = JSD minimization | Original GAN (Goodfellow, 2014) minimizes JSD; JSD's saturation when supports are disjoint causes vanishing gradients and mode collapse - motivating WGAN/Wasserstein distance |
| Renyi DP for private training | Differential privacy in LLM fine-tuning (DP-SGD, used at Google) uses Renyi divergence; Renyi DP composes additively, making privacy budgets tractable |
| Diffusion ELBO | DDPM training objective is a sum of per-step KL divergences; the connection to score matching makes the KL formulation equivalent to noise prediction |
| Natural gradient | K-FAC (second-order optimizer for neural networks) approximates the Fisher matrix which is the local KL curvature; natural gradient descent is the geometrically correct first-order method on the statistical manifold |
| Calibration | iff the model is perfectly calibrated; calibration regularization (temperature scaling, label smoothing) implicitly minimizes KL between predicted and calibrated distributions |
13. Conceptual Bridge
KL divergence sits at the center of the information theory chapter, connecting the individual uncertainty measured by entropy (Section 01) to the distributional relationships measured by mutual information (Section 03), the training objectives formalized as cross-entropy (Section 04), and the local geometry captured by Fisher information (Section 05). It is not an isolated concept - it is the engine that drives all four.
Backward: from Entropy. In Section 01, entropy measured the intrinsic uncertainty of a single distribution. KL divergence extends this to pairs of distributions: measures the extra uncertainty you incur by confusing for . The non-negativity of KL proved the fundamental bound that appeared in Section 01 - Gibbs' inequality was already implicitly used there. Every result about optimal codes (Huffman, arithmetic coding) is ultimately about minimizing KL between the empirical distribution and the code-induced distribution.
Forward: to Mutual Information. Mutual information (Section 03) is a KL divergence measuring the dependence between two variables. The chain rule for KL (Section 3.5) becomes the chain rule for entropy when applied to the independence gap. The data processing inequality for KL becomes the DPI for mutual information: whenever is determined by . Understanding KL makes mutual information, and all the contrastive learning objectives built on it (InfoNCE, SimCLR, CLIP), immediately transparent.
Forward: to Cross-Entropy and Fisher Information. Cross-entropy is KL plus a constant (Section 04); Fisher information is KL's local curvature (Section 05). The three-way connection (for small perturbations) shows that these are all the same object at different scales: global (cross-entropy), local (Fisher).
KL DIVERGENCE IN THE INFORMATION THEORY CHAPTER
Chapter 09 - Information Theory
Section 01 ENTROPY H(X)
"extend to pairs of distributions"
Section 02 KL DIVERGENCE D_KL(pq) <- YOU ARE HERE
Section 03 MUTUAL INFORMATION Section 04 CROSS-ENTROPY
I(X;Y) = D_KL(PPP) H(p,q) = H(p) + D_KL(pq)
Section 05 FISHER INFORMATION
F(theta) = local KL curvature
ML CONNECTIONS:
D_KL(p_data p_theta) -> Language model training (MLE)
D_KL(q_(z|x) p(z)) -> VAE regularizer (ELBO)
D_KL(pi_theta pi_ref) -> RLHF / PPO / DPO alignment
D_KL(p_teacher p_student) -> Knowledge distillation
D_KL(q p_posterior) -> Variational inference (VI)
CHAPTER POSITION:
09-01 Entropy
09-02 KL Divergence <- current section
09-03 Mutual Information (KL between joint and product of marginals)
09-04 Cross-Entropy (H(p,q) = H(p) + D_KL(pq))
09-05 Fisher Information (local KL curvature = Riemannian metric)
Looking ahead beyond this chapter: KL divergence appears again in Chapter 12 (Functional Analysis), where it generalizes to the RKHS setting and connects to kernel methods and maximum mean discrepancy (MMD). It appears in Chapter 21 (Statistical Learning Theory) as a key quantity in PAC-Bayes bounds: the generalization gap of a stochastic predictor is bounded by where is the posterior and is the prior on parameters. And in Chapter 22 (Causal Inference), KL divergence under interventional vs observational distributions quantifies the "causal effect" of an intervention in terms of distributional shift.
The single formula - first written by Kullback and Leibler in 1951 - underlies more of modern AI than perhaps any other equation in this curriculum.