"The most important single quantity in information theory and in machine learning is the Kullback-Leibler divergence."
- David MacKay, Information Theory, Inference, and Learning Algorithms, 2003
Overview
KL divergence - formally the Kullback-Leibler divergence, also called relative entropy or information gain - is the central tool for comparing probability distributions. Where entropy (Section 01) measures the intrinsic uncertainty of a single distribution, KL divergence measures how much one distribution diverges from another: how many extra bits you waste by coding data from using a code optimized for .
This single quantity underlies an astonishing range of machine learning. Maximum likelihood estimation, the training objective of every language model ever trained, is equivalent to minimizing . The VAE training objective decomposes exactly as reconstruction loss plus a KL divergence term. RLHF's KL penalty, knowledge distillation's soft-target loss, and PPO's trust-region constraint are all KL divergences in disguise. Understanding KL divergence - its properties, its asymmetry, and the difference between its two directions - is prerequisite to understanding modern deep learning at a mechanistic level.
This section develops KL divergence from first principles. We prove non-negativity rigorously (Section 3.1), dissect the crucial asymmetry between forward and reverse KL (Section 4), derive closed-form expressions for Gaussians and exponential families (Section 5), explore generalizations to f-divergences and Renyi divergence (Section 7), and connect everything to the information-geometric structure of statistical manifolds (Section 8). Every result is grounded in concrete AI/ML applications with specific model and paper citations.
Scope note. This section is the canonical home for KL divergence. Closely related quantities - mutual information (Section 03), cross-entropy loss (Section 04), and Fisher information (Section 05) - each have their own dedicated sections and appear here only as brief previews with forward references.
Prerequisites
- Shannon entropy - , concavity, boundedness - Section 09-01 Entropy
- Jensen's inequality - for convex : - Chapter 8 Section 01
- Probability distributions - PMFs, PDFs, expectations, absolute continuity - Chapter 6
- Logarithm rules - , change of base - Chapter 1
- Convex functions - definition, second-derivative test, Bregman divergences (Section 8) - Chapter 8
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Interactive derivations: Gibbs' inequality, forward vs reverse KL visualizations, Gaussian KL, f-divergences, VAE ELBO, RLHF policy geometry |
| exercises.ipynb | 10 graded exercises from non-negativity proofs to DPO implicit KL |
Learning Objectives
After completing this section, you will:
- Define for discrete and continuous distributions, state the support condition (), and explain the coding-theoretic interpretation
- Prove non-negativity of KL divergence (Gibbs' inequality) via Jensen's inequality and state the equality condition
- Demonstrate the asymmetry with a concrete numerical example and explain why KL is not a metric
- Distinguish forward KL (mean-seeking, mass-covering) from reverse KL (mode-seeking, zero-forcing) and predict which behavior each direction produces
- Derive the closed-form KL between two Gaussians and apply it to compute the VAE encoder KL penalty
- State the chain rule for KL divergence and the data processing inequality and explain their ML implications
- Identify KL divergence as a Bregman divergence and explain the Pythagorean theorem for KL, I-projection, and M-projection
- Express -divergences, Jensen-Shannon divergence, Hellinger distance, and total variation as special cases of Csiszar f-divergences
- Prove that maximum likelihood estimation minimizes and derive the RLHF optimal policy
- Explain the difference between forward and reverse KL in knowledge distillation and describe posterior collapse in VAEs and the beta-VAE fix
- Connect KL to cross-entropy via and preview its relationship to mutual information and Fisher information
Table of Contents
- 1. Intuition
- 2. Formal Definitions
- 3. Properties of KL Divergence
- 4. Forward KL vs Reverse KL
- 5. KL for Specific Distributions
- 6. Information-Theoretic Connections
- 7. f-Divergences and Generalizations
- 8. Information Geometry of KL
- 9. Applications in Machine Learning
- 10. Common Mistakes
- 11. Exercises
- 12. Why This Matters for AI (2026 Perspective)
- 13. Conceptual Bridge
1. Intuition
1.1 What Is KL Divergence?
Suppose you are a weather forecaster. Every day, you predict a probability distribution over tomorrow's weather: 70% sunny, 20% cloudy, 10% rain. But the true distribution - what nature actually produces - is 50% sunny, 30% cloudy, 20% rain. How much has your forecast missed? Shannon entropy measures the uncertainty in a single distribution. KL divergence measures how much one distribution deviates from another.
Formally, is the expected extra number of bits (or nats, if using natural logarithms) you waste when you encode data actually generated from using a code optimized for . If you design a Huffman code for , the optimal code assigns short codewords to events that considers probable. But if the true distribution is , your code will be suboptimal whenever . The KL divergence precisely quantifies the average penalty:
The ratio captures the mismatch event by event: events that considers much more likely than does (large ) waste the most bits. The expectation under reflects that the penalty is averaged over what actually happens.
Three equivalent ways to read :
- Coding view: Expected excess code length when coding -data with a -optimal code.
- Surprise view: How much more surprised you are on average by events under than under .
- Testing view: The expected log-likelihood ratio - the average evidence per sample that data came from rather than in a hypothesis test.
All three views are the same formula. The richness of KL divergence comes from this triple identity: it simultaneously measures compression inefficiency, distributional mismatch, and statistical distinguishability.
For AI: When a language model is trained on data drawn from , the negative log-likelihood loss equals . Since is a constant w.r.t. , minimizing cross-entropy loss is identical to minimizing . Every gradient descent step that improves a language model is a step toward closing the KL gap between model and data distribution.
1.2 The Asymmetry Insight
The defining feature that most surprises newcomers is asymmetry: and are generally different, and this is not a flaw - it is a deeply meaningful distinction that determines which algorithm you should use.
To see why asymmetry arises, consider what each direction penalizes:
-
: the expectation is under . Events where but contribute hugely (the ratio blows up). This is called forward KL or inclusive KL: if puts mass somewhere, must also put mass there. must cover all of 's support. Minimizing this forces to be "mean-seeking" - to match the whole spread of , even at multimodal distributions.
-
: the expectation is now under . Events where but contribute hugely. This is called reverse KL or exclusive KL: if puts mass somewhere, must also put mass there - so avoids regions where is small. Minimizing this forces to be "mode-seeking" - to concentrate on one mode of and ignore the others.
Concrete example: Let be a bimodal mixture and let be a unimodal Gaussian we fit to :
- Minimizing : must cover both modes, so - wide, centered between the modes, placing mass in low-probability regions.
- Minimizing : collapses onto one mode, e.g., , ignoring the other mode entirely.
Neither approximation is "wrong" - they answer different questions. Which direction to use is one of the most important design decisions in probabilistic ML.
For AI: Variational inference for Bayesian neural networks minimizes reverse KL - the variational posterior collapses to one mode of the intractable true posterior . This is why Bayesian deep learning with mean-field VI tends to underestimate uncertainty (it ignores alternative modes). Maximum likelihood training of generative models minimizes forward KL - forcing the model to cover the entire data distribution, at the cost of sometimes generating blurry or average-looking samples.
1.3 Why KL Divergence Matters for AI
KL divergence is not one tool among many - it is the unifying framework behind most of the training objectives and alignment techniques in modern AI:
| Application | Which KL | How |
|---|---|---|
| Language model training (GPT, LLaMA, Claude) | Minimizing cross-entropy = minimizing forward KL | |
| VAE regularizer (Kingma & Welling, 2014) | Reverse KL keeps encoder close to Gaussian prior | |
| RLHF KL penalty (Christiano et al., 2017) | Prevents reward hacking; preserves language coherence | |
| PPO trust region (Schulman et al., 2017) | Approx. | Clip objective approximates KL constraint |
| DPO (Rafailov et al., 2023) | Implicit KL | Closed-form solution has same fixed point as RLHF |
| Knowledge distillation (Hinton et al., 2015) | Forward KL to match teacher's full soft distribution | |
| Variational inference (VI) | Reverse KL for tractable posterior approximation | |
| Normalizing flows (Rezende & Mohamed, 2015) | Forward KL via exact log-likelihood | |
| Diffusion models (Ho et al., 2020) | KL at each reverse step | ELBO decomposes as sum of per-step KL terms |
| Differential privacy (Renyi DP) | Renyi divergence | Renyi KL bounds privacy loss composition |
The breadth of this table reflects a deep fact: KL divergence is the natural measure of distributional distance when you care about expected log-probability ratios, which is exactly what gradient-based optimization of probabilistic models does.
1.4 Historical Timeline
KL DIVERGENCE - KEY MILESTONES
1951 Kullback & Leibler, "On Information and Sufficiency"
Introduce D_KL as "information for discrimination" I(1:2)
Prove non-negativity; connect to sufficiency of statistics
1959 Kullback, "Information Theory and Statistics"
Systematic treatment; connection to likelihood ratio tests
1967 Csiszar, "Information-Type Measures of Difference"
Generalizes to f-divergences; unifies KL, Hellinger, 2
1972 Chernoff; Blahut, Arimoto
Exponential decay of error rates in hypothesis testing
Connection to Renyi divergence
1985 Amari, "Differential-Geometric Methods in Statistics"
Information geometry: KL as non-symmetric Bregman divergence
Exponential and mixture geodesics; natural gradient
1986 Hinton & Camp, "Keeping Neural Networks Simple"
Minimum Description Length KL regularization
2013 Kingma & Welling, "Auto-Encoding Variational Bayes"
VAE: ELBO = reconstruction - D_KL(q_ p)
Reparameterization trick for gradient through KL
2015 Hinton, Vinyals & Dean, "Distilling the Knowledge"
Forward KL with temperature-softened teacher distribution
2017 Christiano et al., "Deep Reinforcement Learning from Human Feedback"
KL penalty in RLHF; controls deviation from reference policy
2017 Schulman et al., "Proximal Policy Optimization"
PPO-clip approximates KL trust region constraint
2022 Rafailov et al., "Direct Preference Optimization"
DPO: closed-form KL-constrained policy optimization
2024+ All frontier LLMs (GPT-4, Claude, Gemini, LLaMA-3)
Trained with cross-entropy (= forward KL) + RLHF KL penalty
2. Formal Definitions
2.1 Discrete KL Divergence
Definition (Kullback-Leibler Divergence - discrete). Let and be probability mass functions on a countable alphabet . The KL divergence from to (also called the relative entropy of with respect to ) is:
where:
- The logarithm is natural (nats) by convention in this curriculum, matching the NOTATION_GUIDE. To convert to bits, divide by .
- Boundary conventions: for any (by continuity: ); for (code length is infinite if you cannot encode a symbol at all).
- whenever there exists with and .
The notation convention in this repository (following Cover & Thomas 2006 and NOTATION_GUIDE Section 6): reads as "KL divergence from to ." That is, is the reference (true) distribution and is the approximation. Some sources use the opposite convention; when reading papers, always check which argument is which.
Standard examples:
Example 2.1 (Binary distributions): Let and on . Then:
This is the log-likelihood ratio test statistic for a Bernoulli parameter hypothesis test.
Example 2.2 (Weather forecast): True distribution over {sunny, cloudy, rain}; forecast :
The model wastes approximately 0.093 nats per observation. In the reverse direction: nats - different but close, because the two distributions are not very far apart.
Example 2.3 (Categorical logits): In a classification model with softmax output and one-hot label :
This is exactly the cross-entropy loss. (The entropy for one-hot , so .)
2.2 Continuous KL Divergence
Definition (KL divergence - continuous). Let and be probability density functions on with absolutely continuous with respect to (written , meaning a.e.). Then:
The absolute continuity condition is essential: it ensures the Radon-Nikodym derivative exists, making the ratio well-defined a.e. under . When , by convention.
Measure-theoretic form. Using the Radon-Nikodym derivative , the general definition that works for both discrete and continuous cases:
This unified form shows that discreteness vs continuity is just a choice of dominating measure.
2.3 Relative Entropy Interpretation
The phrase "relative entropy" is justified by a beautiful coding theorem. Suppose are i.i.d. from . You observe samples and design a code using distribution . The optimal code assigns length to symbol (by Shannon's source coding theorem). The expected code length for one symbol is - the cross-entropy. But the optimal code for has expected length . The excess code length per symbol is:
This is exactly KL divergence. The relative entropy measures how many extra nats/bits per symbol you pay for using the wrong code when the truth is . It is relative because it measures information relative to the reference distribution .
Information gain interpretation. In Bayesian inference, if is the posterior and is the prior, then is the information gain from the prior to the posterior - the reduction in uncertainty after observing data. The Kullback-Leibler 1951 paper used the term "information for discrimination": how much information the data provides for discriminating between and .
2.4 Non-Examples and Edge Cases
Understanding when KL divergence is not well-behaved clarifies the definition.
Non-example 2.1 (Mismatched support, ): Let and (uniform on ). Since for but for all , we have and . However since everywhere, so . This illustrates the extreme asymmetry that can occur when supports differ.
Non-example 2.2 (Zero KL does not mean identical distributions numerically): implies for -almost all , but for distributions with different supports having measure zero under , the distributions can differ on a null set. In practice, if numerically, and are identical where it matters.
Non-example 2.3 (Equal KL values do not imply related distributions): nats gives no useful information about the relationship between and . KL is an absolute quantity for a specific pair, not a geometry with reference points.
Non-example 2.4 (KL is not symmetric by default): For and :
By symmetry of this example, nats - equal only because and are exact reverses of each other. For generic asymmetric distributions, the two values differ.
3. Properties of KL Divergence
3.1 Non-Negativity: Gibbs' Inequality
Theorem (Gibbs' Inequality). For any two probability distributions and on the same alphabet:
with equality if and only if for all (or -almost all in the continuous case).
Proof via Jensen's inequality. The function is strictly convex on (its second derivative ). Equivalently, is strictly concave. By Jensen's inequality applied to the concave function :
Therefore . Equality holds in Jensen's inequality for a strictly concave function iff the argument is constant, i.e., for all with . Since both and sum to 1, this forces and hence everywhere.
Alternative proof via . The inequality (with equality iff ) implies:
This proof is elementary (needs no Jensen's inequality) and directly shows the equality condition.
Corollary (Entropy upper bound). Taking (uniform on with ):
Therefore . Gibbs' inequality is the reason entropy is maximized at the uniform distribution (proven rigorously in Section 09-01).
For AI: Non-negativity of KL underpins the ELBO. The VAE lower bound comes from:
Since , we have - the ELBO is indeed a lower bound on log-evidence. This is the fundamental inequality that makes variational inference tractable.
3.2 Asymmetry
in general. KL divergence is not a distance in the metric sense. We have already seen this intuitively; here we quantify it.
Example 3.2: , :
In this symmetric case they are equal, but for , :
Both happen to be equal again due to the symmetric structure. For a genuinely asymmetric example: , :
and nats (by symmetry of this particular pair).
For a non-symmetric example: , :
So .
Jeffrey's symmetrized KL. Harold Jeffreys (1946) proposed the symmetric version:
This satisfies and , but still fails the triangle inequality. Jeffreys divergence is used in some applications where symmetry is important. The Jensen-Shannon divergence (Section 7.4) is a better-behaved symmetric variant that is also bounded.
3.3 Failure of Triangle Inequality
KL divergence is not a metric. Specifically, the triangle inequality does not hold in general.
Counterexample. Let , , on :
In this degenerate case and , so the triangle inequality "holds" trivially. For a cleaner counterexample with all finite values: , , :
Since , the triangle inequality is violated. The reason: KL "skips" over intermediate distributions in a fundamentally non-Euclidean way.
3.4 Joint Convexity
Theorem. is jointly convex in the pair : for any ,
Proof sketch. The perspective function of a convex function is jointly convex: if is convex, then is jointly convex in for . Taking (which is convex since ), the perspective is . Summing over gives , which is jointly convex in as a sum of perspective functions.
Consequence for the EM algorithm. The EM algorithm alternates between computing the expected log-likelihood (E-step) and maximizing it (M-step). Each M-step minimizes a KL divergence. Joint convexity guarantees that this alternating minimization makes well-defined progress toward a stationary point.
Convexity in alone. is also convex in for fixed (since it equals a sum of convex functions ).
Convexity in alone. for fixed . This is convex in because is convex and . This means: for fixed , minimizing over is a convex optimization problem with a unique minimum at .
3.5 Chain Rule for KL Divergence
Theorem (Chain Rule). For joint distributions and :
where is the marginal of over , and , are the conditional distributions.
Proof.
Using and :
For AI: In a hierarchical generative model , the chain rule decomposes the KL between the variational distribution and true posterior into a prior KL plus an expected conditional KL. This is exactly the ELBO decomposition:
Rearranging: .
3.6 Data Processing Inequality
Theorem (Data Processing Inequality for KL). Let be any (possibly randomized) function. Let and be the distributions of under and respectively. Then:
Proof sketch. By the chain rule applied to the joint :
For a deterministic function , the conditional . By the reverse chain rule decomposing via first, . Since the second term is , .
Intuition: Processing data can only destroy information, not create it. Two distributions that are nats apart look at most nats apart after transformation. If you run a neural network encoder on samples from two distributions, the KL in embedding space is at most the KL in input space.
For AI: This is why representation learning cannot increase discriminability between classes beyond what is in the raw data. It also explains why the KL penalty in RLHF acts on the token distribution: any post-processing of the output tokens (sampling, filtering) can only reduce the KL from the reference policy, not increase it.
4. Forward KL vs Reverse KL
This section develops the most practically important distinction in applied KL theory: which direction to minimize, and what kind of approximation each direction produces.
4.1 Forward KL: Mean-Seeking Behavior
Forward KL is where is the target distribution (the truth) and is the approximating distribution we optimize. We minimize over given fixed .
The zero-avoiding (mass-covering) property: Because the expectation is under , regions where but contribute to . Therefore, any minimizer must satisfy wherever - the approximation must cover all the mass of the target. For this reason, forward KL is also called inclusive KL or zero-avoiding KL.
Fitting a Gaussian to a bimodal distribution. Let and . Minimizing over and :
So - a wide Gaussian centered between the modes, placing substantial mass in the valley between them where is near zero. This is the mean-seeking behavior: computes the mean and variance of (moment matching).
General result. For any exponential family with sufficient statistics , minimizing forward KL performs moment matching: . The approximation matches the moments of , which averages across modes.
For AI: Maximum likelihood estimation minimizes forward KL. This is why MLE on multimodal data can produce "blurry" generative models that average over modes - GANs were introduced partly to address this by using a minimax objective related to Jensen-Shannon divergence (Section 7.4) instead of direct MLE.
4.2 Reverse KL: Mode-Seeking Behavior
Reverse KL is where is the approximation we optimize and is fixed. Now the expectation is under - we penalize regions where but .
The zero-forcing (mass-concentrating) property: Regions where but contribute to . Therefore must avoid regions where is zero - it must concentrate its mass on high-probability regions of . This is called exclusive KL or zero-forcing KL.
Fitting a Gaussian to a bimodal distribution (reverse direction). Now minimizing :
This has two local minima: or - the approximation collapses onto one mode of . The wide, mean-covering solution from forward KL is actually a local maximum for reverse KL: it places mass where is near zero (in the valley), incurring large penalties.
General result. Reverse KL drives the approximation to be a mode of . The gradient of reverse KL points toward locally consistent regions of .
4.3 Geometric Interpretation
FORWARD KL vs REVERSE KL: THE KEY DIFFERENCE
TRUE DISTRIBUTION p (bimodal): FORWARD KL (minimizer q*)
must cover both modes:
q* = N(0, 10)
(wide, mean-seeking)
-> places mass in valley
-3 3
REVERSE KL (minimizer q*):
q* = N(-3, 1) OR N(3, 1)
mode-seeking: picks one mode,
ignores the other
Intuition: forward KL sees regions p > 0 as "must cover"
reverse KL sees regions p 0 as "must avoid"
The information-geometric view formalizes this with the concept of projections:
- I-projection (information projection): - projects onto the constraint family using reverse KL. Selects the that is closest to in the reverse KL sense.
- M-projection (moment projection): - projects onto using forward KL. Produces moment-matched approximations.
For exponential families: the M-projection always produces a unique solution (the moment-matched distribution). The I-projection may have multiple local minima (one per mode).
4.4 Consequences for Variational Inference
Variational Bayes minimizes - reverse KL. This is computationally convenient (the ELBO is a tractable lower bound) but has important consequences:
Posterior collapse in VAEs. In a Variational Autoencoder, each latent dimension has approximate posterior and prior . If the decoder is powerful enough to reconstruct using only a subset of dimensions, the ELBO optimization drives the unused dimensions' KL term to zero - meaning - a constant prior. This posterior collapse means the latent representation ignores the input for those dimensions.
Why reverse KL causes collapse: Reverse KL is minimized (= 0) when . Forward KL would penalize this collapse heavily because would be large if ignores the data. The zero-forcing property of reverse KL allows collapse; the zero-avoiding property of forward KL would prevent it.
Beta-VAE fix (Higgins et al., 2017). Adding a hyperparameter to weight the KL term:
Higher increases the penalty for unused latents, encouraging disentangled representations. This is used in modern VAE-based image models and speech representations.
KL annealing. Starting with and gradually increasing it during training prevents early collapse. Used in language model VAEs (Bowman et al., 2016) to avoid the degenerate solution where the encoder is ignored.
5. KL for Specific Distributions
5.1 KL Between Gaussians
The KL divergence between Gaussian distributions has a closed form used in dozens of ML algorithms.
Scalar Gaussians. For and :
Derivation. The log-ratio is:
Taking expectation under , using and :
Intuition on terms:
- : penalty for scale mismatch
- : penalty for being wider than
- : penalty for mean mismatch (like Mahalanobis distance)
- : normalization constant
Multivariate Gaussians. For and :
where is the dimension. The four terms correspond to: trace ratio (variance mismatch), Mahalanobis distance (mean mismatch), dimension offset, log-determinant ratio (volume mismatch).
VAE application. In a standard VAE, the encoder outputs and the prior is . The KL term simplifies because :
This is differentiable in and , enabling direct gradient descent. The reparameterization trick , makes the sampling step differentiable.
5.2 KL Between Categorical Distributions
For discrete distributions and :
This equals where is the cross-entropy. In particular, when is a one-hot vector (empirical distribution on a single label ), and:
Temperature scaling. In knowledge distillation, the teacher distribution is softened using temperature :
The student is trained to minimize rather than the one-hot cross-entropy. Soft targets carry more information (entropy scales up with ), allowing the student to learn the teacher's "dark knowledge" about relative class similarities.
5.3 KL Within the Exponential Family
Members of the exponential family have a remarkably elegant KL formula. An exponential family distribution has density:
where are natural parameters, are sufficient statistics, and is the log-partition function (which is convex in ).
KL between exponential family members:
This is the Bregman divergence generated by the convex function : . This is the error of the first-order Taylor approximation of around - which is always by convexity of , recovering Gibbs' inequality.
Examples:
- Gaussian (, ):
- Bernoulli (): (logistic); Bregman divergence = binary KL
- Poisson ():
5.4 VAE Closed-Form KL and the Reparameterization Trick
The VAE's KL term per latent dimension is:
Derivation. Using the scalar Gaussian formula with , :
Properties: This quantity is 0 when (the posterior equals the prior); strictly positive otherwise. Gradient w.r.t. : . Gradient w.r.t. : . Both are simple, enabling efficient backpropagation through the KL term.
Reparameterization trick. The challenge: we need but depends on , so we cannot backpropagate through the sampling. The trick: write where is a separate random variable independent of . Now is well-defined, and gradients flow through. This is the key insight of Kingma & Welling (2014).
6. Information-Theoretic Connections
6.1 KL and Cross-Entropy
The relationship is fundamental. Expanding both sides:
which is immediate from . This decomposition has critical implications:
-
Training ML models: Minimizing cross-entropy loss over is identical to minimizing , since is constant. The irreducible entropy is the minimum achievable cross-entropy.
-
Perplexity gap: A language model with perplexity has gap nats above the theoretical minimum entropy.
-
Calibration: A perfectly calibrated model satisfies , meaning the model's output distribution matches the true conditional distributions exactly.
Preview - Cross-Entropy (Section 09-04)
Cross-entropy is the canonical training loss for classification. It decomposes neatly into the intrinsic uncertainty (irreducible) and the model quality gap (reducible by training).
-> Full treatment: Section 09-04 Cross-Entropy
6.2 KL and Mutual Information
Preview - Mutual Information (Section 09-03)
Mutual information is the KL divergence between the joint distribution and the product of marginals:
It measures statistical dependence: how much knowing reduces uncertainty about . This connection shows that (Gibbs' inequality) and .
-> Full treatment: Section 09-03 Mutual Information
The MI-as-KL connection is used directly in contrastive learning (InfoNCE loss, SimCLR, CLIP): maximizing a lower bound on between the input and its representation. It also appears in the information bottleneck (Tishby et al., 2000) which frames representation learning as minimizing subject to maximizing - two KL divergences in tension.
6.3 KL and Fisher Information
Preview - Fisher Information (Section 09-05)
For a parametric family , the KL divergence between nearby members is approximately:
where is the Fisher information matrix. This local quadratic approximation to KL is what makes natural gradient descent (which preconditions gradient steps by ) the "geometrically correct" optimizer on the statistical manifold.
-> Full treatment: Section 09-05 Fisher Information
Connection to RLHF natural policy gradient: The KL trust region defines a neighborhood in policy space. Near the reference policy, this KL ball is approximated by where is the Fisher information of . Natural policy gradient algorithms (TRPO, PPO) use this approximation.
6.4 KL and Entropy
KL divergence provides an elegant proof that entropy is maximized at the uniform distribution. For any distribution on an alphabet of size and uniform distribution :
Since : , with equality iff .
More generally, for any constraint set of distributions satisfying given moment constraints, the maximum-entropy distribution equals - it is the M-projection of the uniform distribution onto . This unifies the maximum entropy principle with KL geometry.
7. f-Divergences and Generalizations
7.1 Csiszar f-Divergences
KL divergence is one member of a rich family of divergences introduced by Csiszar (1967).
Definition. Let be a convex function with . The f-divergence of from is:
The condition ensures . Convexity of and Jensen's inequality give .
KL divergence as f-divergence. Taking (convex, ):
Standard f-divergences:
| Name | Closed form | Properties | |
|---|---|---|---|
| KL divergence | Asymmetric; | ||
| Reverse KL | Asymmetric; | ||
| Hellinger distance2 | Symmetric; | ||
| Total variation | $\frac{1}{2} | t-1 | $ |
| Chi-squared | Asymmetric; | ||
| -divergence | General; limits give KL | Parameterizes all of the above |
For AI - GAN losses. Generative Adversarial Networks (Goodfellow et al., 2014) use different f-divergences as training objectives: the original GAN uses Jensen-Shannon; f-GAN (Nowozin et al., 2016) generalizes to arbitrary f-divergences. This provides a principled family of training objectives for generative models.
7.2 Properties of f-Divergences
All f-divergences share four key properties (when is convex with ):
- Non-negativity: , with equality iff (when is strictly convex).
- Data processing inequality: for any stochastic kernel .
- No triangle inequality in general.
- Convexity: is jointly convex in when is convex.
The data processing inequality is a theorem for all f-divergences: processing can only reduce distinguishability between distributions.
Total variation as special case. The total variation distance satisfies the triangle inequality and is a genuine metric. It bounds other f-divergences: Pinsker's inequality relates KL and TV:
This is the most important inequality connecting KL divergence to total variation. Pinsker's inequality is used to convert KL bounds (from optimization) into total variation bounds (which control probabilities of events).
7.3 Renyi Divergence
For , the order- Renyi divergence is:
Limiting cases:
- : (L'Hopital's rule)
- : (support-based)
- : (max-ratio)
- : related to Bhattacharyya distance;
Properties: ; ; is monotone increasing in ; satisfies data processing inequality; not symmetric.
For AI - Differential privacy. Renyi differential privacy (Mironov, 2017) quantifies privacy loss using Renyi divergence: a mechanism is -RDP if for any adjacent datasets . Renyi DP composes additively: . This makes it the tool of choice for privacy accounting in large-scale training with DP-SGD (used in Google's federated learning and some LLM fine-tuning pipelines).
7.4 Jensen-Shannon Divergence
The Jensen-Shannon Divergence (JSD) is the symmetrized, bounded variant of KL:
where is the mixture distribution.
Properties:
- - symmetric
- - bounded (in nats)
- is a metric (satisfies triangle inequality)
- - entropy of mixture minus average entropy
For AI - Original GAN objective. The original GAN (Goodfellow et al., 2014) with an optimal discriminator minimizes the Jensen-Shannon divergence between the generator distribution and the data distribution :
The optimal discriminator is . However, when and have disjoint supports, (maximum) everywhere - the discriminator saturates and gradients vanish. This mode collapse / vanishing gradient problem motivated WGAN (Arjovsky et al., 2017), which uses the Wasserstein distance (which doesn't require absolute continuity).
8. Information Geometry of KL
8.1 KL as a Bregman Divergence
A Bregman divergence generated by a strictly convex function is:
This is the gap between at and its first-order Taylor approximation at - always by convexity.
KL is a Bregman divergence. Taking (the negative entropy, which is strictly convex):
This Bregman representation reveals why KL is asymmetric: Bregman divergences are generally not symmetric because the Taylor approximation is computed at , not at .
Exponential family connection. For an exponential family with log-partition function , the Bregman divergence equals (note reversed argument order). This is the source of the elegant formula in Section 5.3.
8.2 I-Projection and M-Projection
Definition. Let be a constraint set (e.g., a parametric family of distributions):
- I-projection (information projection): - the distribution in closest to in reverse KL
- M-projection (moment projection): - the distribution in closest to in forward KL
For the exponential family:
- The M-projection always produces the moment-matched distribution:
- The I-projection onto an exponential family (convex set of mean parameters) produces the distribution with the correct natural parameters for those means
EM algorithm as alternating projections. The EM algorithm alternates:
- E-step: Compute - I-projection of the previous model onto the exact posterior
- M-step: - M-projection of back onto the parametric family
Each step decreases or respectively, guaranteeing convergence to a stationary point.
8.3 The Pythagorean Theorem for KL
Theorem (Pythagorean theorem for I-projections). Let be an exponential family (a convex set in mean parameter space) and let be the I-projection of onto . Then for any :
This is an exact equality - there is no approximation, unlike the geometric Pythagorean theorem.
Interpretation. The KL distance from any in the constraint set to the target decomposes into: (1) the distance from to the projection , plus (2) the distance from the projection to . The "closest" point acts like the foot of a perpendicular. This is called "Pythagorean" because the geometry is orthogonal in the KL sense.
PYTHAGOREAN THEOREM FOR KL
p (target distribution, outside set E)
|
| D_KL(q* || p)
|
q* r (all in constraint set E)
D_KL(r || q*)
D_KL(r || p) = D_KL(r || q*) + D_KL(q* || p)
(exact for I-projections onto exponential families)
For AI: The variational inference ELBO maximization is equivalent to minimizing - an I-projection. The Pythagorean theorem shows that the ELBO gap is exactly at the optimum - a clean statement of how much information is lost by restricting to the variational family.
8.4 Statistical Manifolds and Natural Gradient
The space of probability distributions over forms a statistical manifold with local coordinates given by the natural parameters (for exponential families). The KL divergence induces a Riemannian metric on this manifold called the Fisher information metric:
where is the Fisher information matrix. The Riemannian structure makes the "natural" step size in parameter space match the actual KL distance between distributions.
Natural gradient. In Euclidean parameter space, gradient descent steps make steps proportional to the Euclidean metric on -space, which does not correspond to any natural metric on distribution space. The natural gradient corrects for this:
Natural gradient descent converges much faster in problems where the Fisher information is highly non-uniform (i.e., where the Euclidean metric on parameters is poorly calibrated to KL distance on distributions). This is the foundation of K-FAC (Martens & Grosse, 2015), used in second-order optimization of neural networks.
-> Full treatment of Fisher information and natural gradient: Section 09-05 Fisher Information
9. Applications in Machine Learning
9.1 Maximum Likelihood = Minimizing KL
Theorem. For a parametric model and empirical data distribution :
Proof. Using , minimizing over eliminates the constant :
Implications:
- Every LLM trained by next-token prediction (GPT-4, LLaMA-3, Claude, Gemini) minimizes
- The MLE estimate is not "arbitrary" - it has a precise information-theoretic interpretation as the distribution in the model family closest to the data in forward KL
- Coverage: MLE must cover the entire data distribution; if but , the loss is large
9.2 Variational Autoencoders
The VAE (Kingma & Welling, 2014) introduces a latent variable with prior and likelihood . The true log-likelihood is intractable (requires integrating over ). The VAE introduces an encoder and derives a tractable lower bound via the KL decomposition:
The ELBO has two terms:
- Reconstruction term : how well the decoder reconstructs from the latent code. Analogous to negative distortion.
- KL regularizer : how much the approximate posterior deviates from the prior. For Gaussian encoder and standard Gaussian prior: .
Practical training: The ELBO is maximized by alternating gradient steps:
- w.r.t. : encoder parameters, using the reparameterization trick
- w.r.t. : decoder parameters, via direct backpropagation
VAE variants using KL: beta-VAE ( weight on KL for disentanglement), WAE (Wasserstein autoencoder replaces KL with MMD), VQ-VAE (discrete latent, no KL), InfoVAE (adds MI term).
9.3 RLHF, PPO, and DPO
RLHF objective. Reinforcement Learning from Human Feedback (Christiano et al., 2017; InstructGPT, Ouyang et al., 2022) trains a policy to maximize expected reward while staying close to the reference policy (the pretrained LLM):
where is the reward model score and is the KL coefficient. Without the KL penalty, the policy collapses to reward hacking: generating short, repetitive text that maximizes reward scores but is incoherent or unnatural.
Optimal policy. The RLHF objective has a closed-form optimal solution (for fixed ):
where is the normalizing partition function. The optimal policy exponentially upweights sequences with high reward relative to the reference.
DPO (Direct Preference Optimization, Rafailov et al., 2023). DPO solves the RLHF objective directly from preference data, without training an explicit reward model. The key insight: substituting the optimal policy formula into the reward-learning objective yields a loss that only involves the policy, not the reward model:
where is the preferred (winning) response and the dispreferred (losing) response. The log-ratio is the implicit reward that the KL-constrained policy learns. DPO is used in practice for fine-tuning LLMs on human preferences (LLaMA-2-Chat, Mixtral-Instruct, and many others).
PPO trust region. PPO (Schulman et al., 2017) approximates the KL constraint with a clipping objective:
where is the probability ratio. Clipping at prevents large policy updates, approximating the KL trust region constraint .
9.4 Knowledge Distillation
Knowledge distillation (Hinton, Vinyals & Dean, 2015) transfers knowledge from a large teacher model to a smaller student model . The distillation loss uses forward KL (from teacher to student):
where is the teacher's softened distribution at temperature .
Why forward KL (teacher -> student)? The student must cover all of the teacher's probability mass, including the "dark knowledge" - the non-zero probabilities assigned to incorrect classes that encode the teacher's generalization patterns. If a teacher assigns 60% to "cat," 30% to "tiger," and 10% to "dog," the soft labels teach the student that cats look more like tigers than dogs. One-hot labels destroy this information.
Forward vs reverse KL choice in distillation. Using reverse KL would allow the student to become mode-seeking - focusing only on the teacher's highest-probability output and ignoring the distributional shape. Forward KL forces the student to match the full distribution, preserving the teacher's calibration.
Modern LLM distillation. DistilBERT (Sanh et al., 2019) distills BERT using forward KL plus cosine similarity on hidden states. Phi-1 (Gunasekar et al., 2023) and Phi-2 (Microsoft, 2023) use distillation from GPT-4 to train small but capable models. LLaMA-3-8B was distilled from larger checkpoints. The forward KL direction is standard.
9.5 Normalizing Flows and Diffusion Models
Normalizing flows learn an invertible transformation such that where . The exact change-of-variables formula gives:
Training by maximum likelihood minimizes (forward KL) since the log-likelihood is tractable. This contrasts with GANs, which use a discriminator to avoid computing directly.
Diffusion models (DDPM, Ho et al., 2020). The DDPM ELBO is a sum of KL terms:
Each reverse diffusion step is trained to minimize the KL divergence from the (analytically tractable) true reverse . Because both distributions are Gaussian (by the Markov structure), each per-step KL has a closed-form expression in terms of the noise predictor . The simplified training objective is equivalent to predicting the noise - which is the standard diffusion training loss.
10. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Writing when you mean forward KL (truth -> approximation) | The first argument is the reference (truth); forward KL is | Always identify which distribution is the truth and which is the approximation; check: is the expectation under truth or approx? |
| 2 | Assuming KL is symmetric: | KL is not symmetric; the two directions have fundamentally different behaviors | Never swap and without checking both directions; use Jeffrey's JSD if you need symmetry |
| 3 | Ignoring support mismatch: computing when for some with | when ; finite values are meaningless | Add Laplace smoothing or use / for distributions with potentially different supports |
| 4 | Treating as equivalent to under numerical approximations | Finite-precision computation gives even when due to rounding | Use (small threshold) rather than exact zero; check sample moments separately |
| 5 | Using KL as a distance metric satisfying the triangle inequality | can exceed ; KL is not a metric | Use total variation or if you need a proper metric |
| 6 | Confusing minimizing over with minimizing over | MLE minimizes forward KL over (model parameters); variational inference minimizes reverse KL over | Always identify which argument is the optimization variable |
| 7 | Computing cross-entropy loss and calling it KL divergence | Cross-entropy ; they differ by | They're equal only when is one-hot (); for soft labels, |
| 8 | Expecting the VAE KL term to be zero at optimum | The KL term equals $D_{\mathrm{KL}}(q_\phi(\mathbf{z} | \mathbf{x})|p(\mathbf{z}))$; zero only if posterior = prior (posterior collapse) |
| 9 | Using forward KL for variational inference | Forward KL requires samples from the intractable $p(\mathbf{z} | \mathbf{x})q$ |
| 10 | Claiming the RLHF KL penalty "forces" the model to stay close to reference | The KL penalty softly penalizes deviation; with large enough reward, the policy can still deviate significantly | The KL penalty is a soft regularizer, not a hard constraint; monitor during training |
| 11 | Conflating Renyi divergence with KL divergence | Renyi only equals KL in the limit ; for they have different properties | Use Renyi divergence when differential privacy accounting is needed; use KL for standard information-theoretic arguments |
| 12 | Forgetting the factor when converting between nats and bits | in nats in bits bits | Always specify units; when comparing with cross-entropy in bits (binary classification), convert to nats or vice versa |
11. Exercises
Exercise 1 (*) - Non-negativity and equality condition. Let and .
(a) Compute and by hand (use , give exact values and numerical approximations).
(b) Verify that both are non-negative. Which is larger?
(c) Find a distribution on four symbols such that . Is unique?
(d) Using only the convexity of , write a self-contained proof that for your specific and .
Exercise 2 (*) - KL between Gaussians. Let and .
(a) Starting from the definition, derive the closed form:
(b) Compute and . Are they equal?
(c) Show that when , the formula reduces to - the squared Mahalanobis distance.
(d) For the VAE case , : verify the simplified formula and check that it is 0 when , .
Exercise 3 (*) - Forward vs reverse KL. Let (bimodal) and .
(a) Show that the minimizer of over is , . (Hint: moment matching.)
(b) Numerically estimate on a grid of using 1000 samples. Show the two local minima near .
(c) Explain in one paragraph: if you were training a VAE generative model and the true data distribution is , which direction would you minimize, and what would the generator look like?
Exercise 4 () - Chain rule decomposition.** Let be a joint distribution on with , , , , and uniform on ( each).
(a) Compute directly.
(b) Compute and separately.
(c) Verify that (a) (b): the chain rule holds.
(d) Interpret: which direction (marginal or conditional) contributes more to the total KL? What does this mean for the structure of vs ?
Exercise 5 () - f-Divergence family.** Consider three f-divergences on Bernoulli distributions and as a function of .
(a) Compute and plot (as functions of ): KL divergence , reverse KL , total variation , and Jensen-Shannon .
(b) Verify that Pinsker's inequality holds for all .
(c) Which divergence is symmetric? Which is bounded? Which is asymmetric?
(d) For (near-deterministic ): compute all four divergences. Why does KL blow up while JSD stays bounded?
Exercise 6 () - ELBO and KL decomposition.** A simple VAE has prior and likelihood . The encoder is for a scalar parameter .
(a) For : compute the ELBO as a function of . Use the Gaussian KL formula.
(b) Show that the ELBO decomposes as: reconstruction term (plus constants) minus the KL term .
(c) Find the that maximizes the ELBO. What does this optimal encoder do geometrically?
(d) As the likelihood variance (decoder becomes very powerful): what happens to the optimal ? Does posterior collapse occur?
Exercise 7 (*) - RLHF optimal policy.** A language model has reference policy and reward . The RLHF objective is .
(a) Show that the unconstrained optimizer is by taking the functional derivative of the objective with respect to (subject to ).
(b) Show that at the optimum, the KL divergence is .
(c) Derive the DPO loss: given preference data where , show that the Bradley-Terry preference model leads to the DPO loss.
(d) Numerically: for , , , , : compute and the resulting .
Exercise 8 (*) - Knowledge distillation analysis.** A teacher model outputs logits for three classes and a student outputs .
(a) Compute and .
(b) Compute (forward KL, what distillation minimizes) and (reverse KL).
(c) Now compute soft labels at temperature : and . Recompute both KL divergences. How does temperature affect the magnitude of the KL and the information in the soft labels?
(d) What is the total distillation loss: for and ?
(e) Argue why forward KL (teacher to student) is used rather than reverse KL. What kind of "dark knowledge" is preserved?
12. Why This Matters for AI (2026 Perspective)
| Concept | AI Impact |
|---|---|
| Forward KL = MLE | Every language model trained by next-token prediction (GPT-4, LLaMA-3, Claude, Gemini, Mistral) minimizes ; cross-entropy loss IS KL divergence |
| ELBO = KL decomposition | All VAE-based generative models (image, speech, latent diffusion) optimize reconstruction + KL; the KL term determines latent space structure |
| RLHF KL penalty | All instruction-tuned LLMs (InstructGPT, Claude, Gemini) use to prevent reward hacking; is one of the most important hyperparameters in alignment |
| DPO implicit KL | DPO (used in LLaMA-2-Chat, Mistral-Instruct) reframes RLHF as direct KL-constrained optimization; the log-ratio IS the implicit reward |
| Forward vs reverse KL | Choosing forward KL (MLE, distillation) vs reverse KL (VI, mean-field) is the defining design decision of probabilistic ML; wrong choice -> blurry outputs or posterior collapse |
| Knowledge distillation | LLM compression (DistilGPT, Phi-2, Phi-3) uses forward KL from teacher to student; soft labels at temperature carry 10-100 more information than one-hot labels |
| Posterior collapse | When the decoder is powerful, the VAE KL term collapses to zero - a major pathology in text VAEs (Bowman et al., 2016); fixed by KL annealing or -VAE weighting |
| PPO trust region | PPO's clip objective approximates a KL trust region; the clip threshold corresponds to a specific KL budget per update step |
| GAN = JSD minimization | Original GAN (Goodfellow, 2014) minimizes JSD; JSD's saturation when supports are disjoint causes vanishing gradients and mode collapse - motivating WGAN/Wasserstein distance |
| Renyi DP for private training | Differential privacy in LLM fine-tuning (DP-SGD, used at Google) uses Renyi divergence; Renyi DP composes additively, making privacy budgets tractable |
| Diffusion ELBO | DDPM training objective is a sum of per-step KL divergences; the connection to score matching makes the KL formulation equivalent to noise prediction |
| Natural gradient | K-FAC (second-order optimizer for neural networks) approximates the Fisher matrix which is the local KL curvature; natural gradient descent is the geometrically correct first-order method on the statistical manifold |
| Calibration | iff the model is perfectly calibrated; calibration regularization (temperature scaling, label smoothing) implicitly minimizes KL between predicted and calibrated distributions |
13. Conceptual Bridge
KL divergence sits at the center of the information theory chapter, connecting the individual uncertainty measured by entropy (Section 01) to the distributional relationships measured by mutual information (Section 03), the training objectives formalized as cross-entropy (Section 04), and the local geometry captured by Fisher information (Section 05). It is not an isolated concept - it is the engine that drives all four.
Backward: from Entropy. In Section 01, entropy measured the intrinsic uncertainty of a single distribution. KL divergence extends this to pairs of distributions: measures the extra uncertainty you incur by confusing for . The non-negativity of KL proved the fundamental bound that appeared in Section 01 - Gibbs' inequality was already implicitly used there. Every result about optimal codes (Huffman, arithmetic coding) is ultimately about minimizing KL between the empirical distribution and the code-induced distribution.
Forward: to Mutual Information. Mutual information (Section 03) is a KL divergence measuring the dependence between two variables. The chain rule for KL (Section 3.5) becomes the chain rule for entropy when applied to the independence gap. The data processing inequality for KL becomes the DPI for mutual information: whenever is determined by . Understanding KL makes mutual information, and all the contrastive learning objectives built on it (InfoNCE, SimCLR, CLIP), immediately transparent.
Forward: to Cross-Entropy and Fisher Information. Cross-entropy is KL plus a constant (Section 04); Fisher information is KL's local curvature (Section 05). The three-way connection (for small perturbations) shows that these are all the same object at different scales: global (cross-entropy), local (Fisher).
KL DIVERGENCE IN THE INFORMATION THEORY CHAPTER
Chapter 09 - Information Theory
Section 01 ENTROPY H(X)
"extend to pairs of distributions"
Section 02 KL DIVERGENCE D_KL(pq) <- YOU ARE HERE
Section 03 MUTUAL INFORMATION Section 04 CROSS-ENTROPY
I(X;Y) = D_KL(PPP) H(p,q) = H(p) + D_KL(pq)
Section 05 FISHER INFORMATION
F(theta) = local KL curvature
ML CONNECTIONS:
D_KL(p_data p_theta) -> Language model training (MLE)
D_KL(q_(z|x) p(z)) -> VAE regularizer (ELBO)
D_KL(pi_theta pi_ref) -> RLHF / PPO / DPO alignment
D_KL(p_teacher p_student) -> Knowledge distillation
D_KL(q p_posterior) -> Variational inference (VI)
CHAPTER POSITION:
09-01 Entropy
09-02 KL Divergence <- current section
09-03 Mutual Information (KL between joint and product of marginals)
09-04 Cross-Entropy (H(p,q) = H(p) + D_KL(pq))
09-05 Fisher Information (local KL curvature = Riemannian metric)
Looking ahead beyond this chapter: KL divergence appears again in Chapter 12 (Functional Analysis), where it generalizes to the RKHS setting and connects to kernel methods and maximum mean discrepancy (MMD). It appears in Chapter 21 (Statistical Learning Theory) as a key quantity in PAC-Bayes bounds: the generalization gap of a stochastic predictor is bounded by where is the posterior and is the prior on parameters. And in Chapter 22 (Causal Inference), KL divergence under interventional vs observational distributions quantifies the "causal effect" of an intervention in terms of distributional shift.
The single formula - first written by Kullback and Leibler in 1951 - underlies more of modern AI than perhaps any other equation in this curriculum.
Appendix A: Detailed Proofs
A.1 Proof of Non-Negativity via Log-Sum Inequality
The log-sum inequality provides an alternative, direct proof of Gibbs' inequality without invoking Jensen's explicitly.
Log-Sum Inequality. For non-negative reals and :
with equality iff all are equal.
Proof of log-sum inequality. The function is convex (since ). By the weighted Jensen inequality with weights where :
Multiplying by gives the result.
KL non-negativity from log-sum inequality. Apply with and : since :
A.2 Convexity of KL: Detailed Proof
We prove that is jointly convex using the perspective function argument.
Lemma. The function (with , for ) is jointly convex on .
Proof. . We need . The Hessian is:
Determinant . Trace . PSD since and trace .
Since and sums of jointly convex functions are jointly convex, is jointly convex.
A.3 Pinsker's Inequality
Theorem (Pinsker's Inequality). .
Proof. We use the Bretagnolle-Huber inequality, which is slightly weaker but has a clean proof: for any event ,
For the full Pinsker proof, one approach uses the 2-point inequality: for ,
Applying this to the binary partition and optimizing over (which gives for TV):
Rearranging: .
Pinsker's inequality in AI: KL divergence is continuous and differentiable (gradients exist for optimization), while total variation is an event-level bound. Pinsker's lets us translate KL bounds from optimization theory into guarantees about individual event probabilities - useful in PAC-Bayes learning theory.
A.4 KL Divergence for the Gaussian: Step-by-Step
We derive the multivariate Gaussian KL formula in detail.
Let and on . The log-ratio is:
Taking expectation under , using:
gives:
The four terms correspond to:
- : how much the covariances differ (= 0 when )
- : squared Mahalanobis distance between means
- : log ratio of volumes (= 0 when )
Appendix B: KL Divergence Computations and Reference Formulas
B.1 Quick Reference: KL Formulas for Common Distributions
KL DIVERGENCE CLOSED FORMS - QUICK REFERENCE
SCALAR GAUSSIAN:
D_KL(N(mu1,12) N(mu2,22)) = log(2/1) + (12 + (mu1-mu2)2)/(222) - 12
VAE ENCODER -> STANDARD NORMAL:
D_KL(N(mu,2) N(0,1)) = 12(mu2 + 2 - log 2 - 1)
MULTIVARIATE GAUSSIAN:
D_KL(N(mu1,1) N(mu2,2)) = 12[tr(211) + (mu2-mu1)T21(mu2-mu1) - d + log(det 2/det 1)]
BERNOULLI:
D_KL(Bern(p) Bern(q)) = plog(p/q) + (1-p)log((1-p)/(1-q))
CATEGORICAL (general discrete):
D_KL(p q) = k pk log(pk/qk)
POISSON:
D_KL(Poisson(1) Poisson(2)) = 1log(1/2) - 1 + 2
EXPONENTIAL:
D_KL(Exp(1) Exp(2)) = log(2/1) + 1/2 - 1
BETA:
D_KL(Beta(alpha1,beta1) Beta(alpha2,beta2)) = log B(alpha2,beta2)/B(alpha1,beta1)
+ (alpha1-alpha2)(alpha1) + (beta1-beta2)(beta1) + (alpha2-alpha1+beta2-beta1)(alpha1+beta1)
where = digamma function, B = beta function
B.2 Worked Example: KL Between Poisson Distributions
A Poisson distribution has PMF for
:
For : nats.
For AI: Poisson distributions model count data (number of events, word counts in LDA topic models). The KL between Poisson distributions appears in the ELBO for Poisson process models and in topic model variational inference.
B.3 Worked Example: KL for Categorical - Softmax Outputs
Let and be logit vectors.
Step 1: :
Step 2: :
Step 3: :
For AI: This is the KL component when computing the distillation loss between a teacher () and student (). The small value (0.051 nats) indicates the student's distribution is reasonably close to the teacher's.
B.4 KL Divergence and Hypothesis Testing
The connection between KL divergence and hypothesis testing is both historical (Kullback 1959) and practically important.
Setting: i.i.d. observations from an unknown distribution. Hypothesis : data from ; hypothesis : data from .
Log-likelihood ratio test statistic:
By the law of large numbers, as when truth is .
Stein's lemma (type II error rate): For fixed type I error probability , the best achievable type II error probability decays as:
Larger KL divergence -> faster exponential decay of type II error -> easier to distinguish from with more data.
Chernoff exponent: The best achievable equal-error-rate exponent (minimizing over both type I and II simultaneously) is:
the maximum Renyi divergence over orders .
For AI: The hypothesis testing perspective explains why in RLHF bounds how detectable the policy shift is: small KL means the fine-tuned model's outputs are statistically indistinguishable from the reference model's outputs.
Appendix C: KL Divergence in Broader ML Contexts
C.1 PAC-Bayes Bounds
PAC-Bayes theory (McAllester 1999; Seeger 2002) gives generalization bounds for stochastic predictors. The key theorem:
PAC-Bayes Theorem. Let be a prior over hypotheses (fixed before seeing data) and be any posterior (possibly data-dependent). For any , with probability over the training set:
where is the true loss and is the empirical loss.
Interpretation: The generalization gap is bounded by . The posterior can be anything - including a specific trained neural network - and the bound holds. This shows that models that stay "close" to a simple prior in KL generalize well. Modern applications: PAC-Bayes bounds for LoRA-fine-tuned LLMs (where pretrained weights, fine-tuned weights).
C.2 KL in Variational Inference for Bayesian Neural Networks
Bayesian neural networks (BNNs) maintain distributions over weights rather than point estimates. The posterior is intractable for large networks. Variational inference minimizes:
The ELBO is: . The KL term acts as a regularizer: it penalizes variational distributions that deviate from the prior .
Practical implementations: Bayes By Backprop (Blundell et al., 2015) parameterizes and uses the local reparameterization trick. Each weight where . The KL term per weight is for Gaussian prior .
C.3 KL in Contrastive Learning and Representation
Contrastive learning methods like SimCLR (Chen et al., 2020) and MoCo (He et al., 2020) maximize the lower bound on mutual information:
The InfoNCE loss is a lower bound on , which is a KL divergence (see Section 09-03). CLIP (Radford et al., 2021) uses a symmetric InfoNCE to align image and text embeddings - which is equivalent to minimizing KL between the joint image-text distribution and the product of marginals.
CPC (Contrastive Predictive Coding, van den Oord et al., 2018): Used in speech (wav2vec 2.0, HuBERT) and images. The representation at time predicts future via a density ratio estimator. The InfoNCE objective is a lower bound on . Training audio encoders this way produces features that capture the semantic structure of speech - because mutual information (= KL divergence) rewards capturing any statistical dependency.
C.4 KL in Score-Based Generative Models
Score-based models (Song & Ermon, 2020) and their diffusion counterparts learn the score function to generate samples via Langevin dynamics. The connection to KL:
Score matching minimizes:
where is the learned score. Fisher divergence equals:
Score matching is equivalent to minimizing a generalization of KL (Fisher divergence: ), which uses the gradient of the log-density ratio rather than the ratio itself. This is the information-geometric connection between score matching and KL divergence.
C.5 KL and Temperature in LLM Sampling
LLM inference involves sampling from at temperature . The KL divergence between the original distribution () and the temperature-scaled distribution is:
As : one-hot (deterministic, picks the argmax). - the divergence equals the entropy of the original distribution.
As : uniform. - the divergence equals the KL from uniform.
Practical implications: The "temperature" knob in LLM APIs (temperature=0.7, top-p=0.9) directly controls the KL divergence from the trained distribution. High temperature increases diversity (reduces KL to uniform); low temperature reduces diversity (pushes toward argmax). The KL from the training distribution is minimized at .
Appendix D: Numerical Stability and Implementation
D.1 The log-sum-exp Trick for KL Computation
Computing with softmax distributions requires care to avoid numerical overflow.
Naive implementation (unstable):
# WRONG - can overflow for large logits
kl = torch.sum(p * (torch.log(p) - torch.log(q)))
Stable implementation using log-softmax:
# CORRECT - uses log-space throughout
import torch.nn.functional as F
def kl_divergence_stable(logits_p, logits_q):
"""
KL(p || q) where p = softmax(logits_p), q = softmax(logits_q).
Uses log_softmax for numerical stability.
"""
log_p = F.log_softmax(logits_p, dim=-1) # log(softmax) in one step
log_q = F.log_softmax(logits_q, dim=-1)
p = torch.exp(log_p)
return torch.sum(p * (log_p - log_q), dim=-1)
PyTorch also provides torch.nn.functional.kl_div(log_q, p) which computes given pre-computed log_q. Note the unusual argument order: it expects log of the second argument.
D.2 Estimating KL from Samples
When only samples are available (not closed-form distributions), KL must be estimated.
Monte Carlo estimator. Given samples and access to and :
This is unbiased: .
k-nearest neighbor estimator (Perez-Cruz 2008). When neither nor are known analytically - only samples from both distributions - the KNN estimator uses:
where is the distance to the -th nearest neighbor in the -samples, is the distance to the -th neighbor in the -samples, = number of -samples, = number of -samples, and = dimension.
This estimator is used in representation learning evaluation, where you have samples from an encoder's output distribution and want to measure without a closed form.
D.3 Symmetric KL Implementation
For applications requiring symmetric divergence without the issues of JSD (which requires computing the mixture), Jeffrey's divergence is simpler:
def jeffreys_divergence(p_probs, q_probs, eps=1e-10):
"""
J(p, q) = 0.5 * (KL(p||q) + KL(q||p))
More numerically stable than computing JSD.
"""
p = p_probs + eps
q = q_probs + eps
p = p / p.sum()
q = q / q.sum()
kl_pq = (p * np.log(p / q)).sum()
kl_qp = (q * np.log(q / p)).sum()
return 0.5 * (kl_pq + kl_qp)
D.4 KL Divergence in PyTorch for VAE Training
The standard VAE KL term implementation:
def vae_kl_loss(mu, log_var):
"""
D_KL(N(mu, sigma^2) || N(0, 1)) per-sample, summed over latent dimensions.
Formula: 0.5 * sum(mu^2 + sigma^2 - log(sigma^2) - 1)
= 0.5 * sum(mu^2 + exp(log_var) - log_var - 1)
Input: mu, log_var - shape (batch, latent_dim)
Output: KL loss - shape (batch,), or scalar if mean-reduced
"""
kl = 0.5 * torch.sum(mu**2 + torch.exp(log_var) - log_var - 1, dim=-1)
return kl.mean() # average over batch
Note: using log_var (log of variance) instead of sigma avoids numerical issues with very small variances and ensures by construction.
RLHF KL penalty implementation:
def rlhf_kl_penalty(logprobs_policy, logprobs_ref, beta=0.1):
"""
D_KL(pi_policy || pi_ref) for a batch of token sequences.
logprobs_policy, logprobs_ref: log probabilities under each model
shape (batch, seq_len)
Returns: mean KL per sequence
"""
# KL = E_pi[log pi - log pi_ref]
kl_per_token = logprobs_policy - logprobs_ref # log ratio
kl_per_seq = kl_per_token.sum(dim=-1) # sum over sequence length
return beta * kl_per_seq.mean()
Appendix E: Variational Inference - A Deep Dive
E.1 The Variational Inference Problem
Bayesian inference requires computing the posterior . The denominator is typically intractable - the integral has no closed form for nonlinear likelihoods.
Variational inference approximates the posterior with a tractable family by solving:
Why reverse KL? Because it is the only tractable direction. Computing requires evaluating , which is the intractable quantity we're trying to avoid. Computing requires only samples from and evaluations of - both tractable.
ELBO derivation. Expanding the reverse KL:
Using :
Rearranging:
Since : - the ELBO is a lower bound. Maximizing ELBO over simultaneously: (a) tightens the bound (reduces the KL gap), and (b) improves the model fit (increases when is also optimized).
E.2 Mean-Field Variational Inference
The mean-field approximation restricts to factored distributions . This is an enormous simplification - the full posterior may have complex correlations, but the approximation ignores all correlations between latent variables.
Coordinate ascent VI (CAVI). For mean-field, the optimal factor satisfies:
where denotes expectation over all factors except . This is a fixed-point equation: each factor depends on the others. CAVI iterates: update given ; then update given new and old ; etc.
Guarantee: Each update increases the ELBO (decreases the reverse KL). CAVI converges to a local optimum.
For AI: Mean-field VI was the basis for early Bayesian neural networks (BNNs) and topic models (LDA). Its mode-seeking bias (reverse KL) means it systematically underestimates posterior uncertainty - a known limitation. Modern BNNs often use Monte Carlo Dropout as a cheaper alternative.
E.3 ELBO as a Free Energy
The ELBO has a physical interpretation as a variational free energy (Helmholtz, Gibbs):
This is the expected log joint minus the entropy of . Maximizing the ELBO simultaneously:
- Maximizes : encourages to put mass on high-probability regions of the joint
- Maximizes : encourages to spread out (high entropy = uncertainty), preventing mode collapse
The tension between these two terms is the fundamental trade-off in variational inference. In physics, this is the Gibbs free energy minimization: equilibrium balances energy minimization (term 1) with entropy maximization (term 2), at temperature 1.
For AI: The free energy interpretation connects to energy-based models (LeCun et al., 2006) and to the "free energy principle" in computational neuroscience (Friston, 2010). The ELBO appears in all modern latent variable models: VAEs, VQ-VAE, neural process families, and latent diffusion models (Stable Diffusion uses a VAE for compression, then diffusion in the latent space).
E.4 Amortized Variational Inference
Standard VI optimizes a separate per data point - optimization problems. Amortized VI (Kingma & Welling 2014) instead trains a single encoder network that takes as input:
The encoder "amortizes" the per-sample optimization across the full dataset, generalizing to new samples at inference time. This is what makes VAEs practical for large datasets: a single forward pass through the encoder gives instantly.
Amortization gap: The amortized approximation is generally worse than the optimal per-sample (it must generalize, so it trades off per-sample accuracy). The gap is called the amortization gap and is an active research topic.
Appendix F: The RLHF-KL Alignment Triangle
F.1 Three Ways KL Appears in Alignment
Modern LLM alignment uses KL divergence in three distinct but related ways, forming a triangle of constraints:
RLHF-KL ALIGNMENT TRIANGLE
PRETRAINED MODEL pi_ref
/ \
KL / \ KL
penalty / \ constraint
/ \
RLHF pi_theta DPO pi_DPO
\ /
\ equiv /
\ /
Fixed point: same optimal policy
All three policies minimize:
E[r(x,y)] - betaD_KL(pi_theta(|x) pi_ref(|x))
Optimal solution: pi*(y|x) propto pi_ref(y|x)exp(r(x,y)/beta)
F.2 Interpreting the KL Coefficient beta
The KL coefficient in RLHF controls the fundamental exploration-exploitation trade-off in alignment:
| value | Behavior | Risk |
|---|---|---|
| Pure reward maximization; ignores KL constraint | Reward hacking; generates repetitive, incoherent outputs | |
| Stay at reference policy; no alignment | No improvement in helpfulness, safety | |
| Typical range used in practice | Balance between helpfulness and coherence |
Typical values in production: InstructGPT (Ouyang et al., 2022) used for PPO-RLHF. Different applications need different : safety alignment uses larger (conservative); instruction following uses smaller (allows larger policy changes).
Adaptive (KL-controller): Some implementations adaptively adjust to keep near a target value :
This is a PI controller on the KL divergence. If the policy drifts too far, increases; if it stays too close to reference, decreases.
F.3 DPO's Implicit Reward
DPO (Rafailov et al., 2023) reveals that the log-ratio between policy and reference is the implicit reward:
This has a clean interpretation: the reward a sequence receives is proportional to how much more likely the aligned policy assigns to it compared to the reference. Sequences the aligned model prefers more than the reference model are rewarded; sequences it prefers less are penalized.
DPO gradient: The DPO loss gradient with respect to :
where . The weight is large when the model assigns higher reward to the loser than the winner - the loss upweights these hard examples. This has the same structure as hard example mining in contrastive learning.
Appendix G: Connections to Other Chapters
G.1 KL and Exponential Families (Chapter 7: Statistics)
Exponential families are the natural domain for KL divergence because:
- KL between exponential family members has closed form (Section 5.3)
- The maximum entropy distribution under moment constraints is an exponential family member (Section 09-01 Section 5)
- The Bregman divergence generated by equals the KL between members with those natural parameters
Sufficiency: Kullback and Leibler's original paper introduced KL as a measure of information for discrimination between hypotheses. A statistic is sufficient for parameter iff - sufficient statistics preserve all the discriminatory information. This is the information-theoretic definition of sufficiency.
G.2 KL and Convex Optimization (Chapter 8)
The ELBO maximization is a convex optimization problem in (for fixed model parameters ):
- is convex in (Section 3.4)
- The constraint , is convex (simplex)
- The I-projection is a convex program with a unique solution
The Lagrangian of the I-projection onto the exponential family gives the natural parameter update equations - connecting KL divergence to the theory of constrained convex optimization from Chapter 8.
G.3 KL and Functional Analysis (Chapter 12)
In infinite-dimensional settings, KL divergence extends via the Radon-Nikodym theorem. For Gaussian processes and Gaussian measures on Hilbert spaces:
the same Gaussian KL formula, but now are kernel matrices and . The trace and log-determinant terms generalize as operator trace and Fredholm determinant.
RKHS connection: The maximum mean discrepancy (MMD) is an alternative to KL for comparing distributions, based on RKHS distances. MMD has better convergence rates from finite samples than KL estimators, which is why it appears in some GANs (MMD-GAN) and kernel two-sample tests.
G.4 KL and Statistical Learning Theory (Chapter 21)
The PAC-Bayes bound (Appendix C.1) shows that the KL from posterior to prior controls generalization. This has a direct implication for LoRA fine-tuning: if the LoRA adapter is constrained to have small - i.e., small deviation from initialization - then generalization bounds are tight. This is the information-theoretic justification for small-rank adaptations: they have small KL from the pretrained distribution.
Minimum Description Length (MDL): The MDL principle selects the model that minimizes the two-part code length: where is the code length of the model and is the code length of data given the model. This equals for the right choice of and . MDL and ELBO are the same objective - minimum description length equals maximum ELBO. This is the Bayesian interpretation of Occam's razor.
Appendix H: Summary Tables
H.1 KL Divergence Properties at a Glance
| Property | Holds? | Counterexample if No |
|---|---|---|
| Non-negative: | YES | - |
| Zero iff : | YES (a.e.) | - |
| Symmetric: | NO | |
| Triangle inequality: | NO | |
| Data processing inequality: | YES | - |
| Jointly convex in | YES | - |
| Convex in for fixed | YES | - |
| Convex in for fixed | YES | - |
| Finite when | NO | : |
| Metric (distance function) | NO | Fails symmetry and triangle inequality |
H.2 Forward vs Reverse KL Summary
| Forward KL | Reverse KL | |
|---|---|---|
| Expectation under | (truth) | (approximation) |
| Also called | Inclusive, zero-avoiding | Exclusive, zero-forcing |
| Behavior when fitting to | Mean-seeking, mass-covering | Mode-seeking, mass-concentrating |
| Handles multimodal | Averages over all modes | Collapses to one mode |
| Support requirement | must cover | must avoid |
| Used in | MLE, distillation, flows | Variational inference, ELBO |
| Risk | Blurry/average samples | Posterior collapse |
| Tractability | Needs samples from | Only needs samples from |
H.3 f-Divergence Family Summary
| Divergence | Symmetric? | Bounded? | Metric? | |
|---|---|---|---|---|
| KL | No | No | No | |
| Reverse KL | No | No | No | |
| Jeffrey's | Yes | No | No | |
| Hellinger | Yes | Yes () | is | |
| Total Variation TV | $\frac{1}{2} | t-1 | $ | Yes |
| Chi-squared | No | No | No | |
| Jensen-Shannon JSD | (mixture form) | Yes | Yes () | is |
| Renyi | (limit form) | No | No | No |
Appendix I: Extended Examples and Geometric Visualizations
I.1 KL Divergence Along a Path Between Distributions
Consider a one-parameter family - a straight line between two distributions in probability simplex. The KL divergence is convex in (because it is jointly convex in and linear in ). However, may not be convex - the "path" in distribution space is curved from the perspective of KL divergence.
Numerical example. , , . For :
| 0 | |||
| 0.25 | |||
| 0.5 | |||
| 0.75 | |||
| 1.0 |
Both directions are symmetric here due to the symmetric path. The KL is zero at (when ) and increases convexly toward the endpoints.
Geometric implication: The straight-line path is a mixture geodesic (M-geodesic) in information geometry - the natural path in the mixture parameterization. The dual path (exponential geodesic) mixes in the natural parameter space, giving different intermediate distributions.
I.2 KL vs Wasserstein Distance
KL divergence and the Wasserstein distance (from optimal transport theory) are complementary tools for comparing distributions:
| Property | KL Divergence | Wasserstein Distance |
|---|---|---|
| Requires | YES | No |
| Metrizes weak convergence | No | YES |
| Captures geometry of support | No | YES |
| Tractable to compute | YES (closed form) | Harder ( naive) |
| Differentiable | YES | With entropic regularization |
| Used in | MLE, ELBO, RLHF | Generative models, image quality |
When Wasserstein beats KL: If (generator) and have disjoint supports, and - gradients vanish. The Wasserstein distance is still finite and differentiable (reflecting actual geometric distance between supports). This is the motivation for WGAN (Arjovsky et al., 2017).
When KL beats Wasserstein: For language modeling, the vocabulary forms a discrete space with no natural geometry. The Wasserstein distance would require defining a metric on tokens (words), which is ambiguous. KL divergence makes no assumptions about token geometry and is directly connected to log-probability, making it the natural choice.
I.3 KL Divergence Under Transformations
Transformation by sufficient statistic. If is a sufficient statistic for , then:
where is the induced distribution on . Sufficient statistics preserve all the discriminatory information - no KL is lost when summarizing data by its sufficient statistics.
Example. For , the sufficient statistic (number of heads in flips) is sufficient. The KL between and equals , and the KL between the corresponding and distributions equals the same thing - the sufficient statistic preserves KL exactly.
KL under affine transformation. For continuous distributions, an affine map with : - KL is invariant under invertible affine transformations. This is why KL between two Gaussians can be computed in any coordinate system.
I.4 The Sanov Theorem and Large Deviations
Sanov's theorem is the large deviations foundation for KL divergence.
Theorem (Sanov). Let be i.i.d. from . Let be the empirical distribution. For any set of distributions (closed in total variation):
More precisely: .
Interpretation: The probability of observing an empirical distribution in decays exponentially in , with exponent given by the minimum KL from any distribution in to the true distribution . The "most likely" atypical empirical distribution is the one in closest to in KL divergence - the I-projection of onto .
For AI: Sanov's theorem explains why KL divergence appears in PAC-Bayes bounds. The probability that the empirical risk deviates from the true risk by more than decays as , where is the posterior over hypotheses and is the prior. The larger the KL, the more confident we are that the training set is representative.
I.5 KL Divergence and the EM Algorithm: A Complete Example
We illustrate the EM algorithm as alternating KL minimizations with a Gaussian mixture model.
Setup: Data ; model .
E-step - compute responsibilities (I-projection):
This is the I-projection of the model onto the space of conditional distributions given the current parameters.
M-step - update parameters (M-projection):
This maximizes the ELBO - equivalent to the M-projection that minimizes over .
KL decomposition at each step:
The E-step zeroes the second term (for the current ); the M-step maximizes the first term. Together, each EM iteration increases .
Appendix J: Notation and Conventions
J.1 KL Divergence Notation Across Different Sources
Different textbooks and papers use different notation for KL divergence. This can cause significant confusion:
| Source | Notation | Direction |
|---|---|---|
| This curriculum (following Cover & Thomas) | = reference (truth); = approximation | |
| Kullback (1959) | Between distributions indexed 1 and 2 | |
| Bishop (2006) | Same as Cover & Thomas | |
| Murphy (2012) | Same | |
| Goodfellow et al. (2016) | Same | |
| Some optimization papers | Reversed - always check! |
Key check: In VAE papers, has the encoder as the first argument (the reference for the expectation) and the prior as the second. This is reverse KL - expectation under the approximate posterior .
J.2 Relationship Between Information Quantities
VENN DIAGRAM OF INFORMATION QUANTITIES
For joint distribution P(X,Y):
H(X,Y)
H(X|Y) I(X;Y) H(Y|X)
H(X)
H(Y)
Key relationships:
H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y) [chain rule]
I(X;Y) = H(X) + H(Y) - H(X,Y) [definition]
I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) [equivalences]
I(X;Y) = D_KL(P(X,Y) P(X)P(Y)) [KL form]
H(p,q) = H(p) + D_KL(p q) [cross-entropy decomp]
J.3 Units and Conversions
| Base | Unit | Conversion |
|---|---|---|
| (natural log) | nats | Default in this curriculum; bits |
| bits | Used in data compression contexts; nats | |
| hartleys (bans) | Rare in ML; nats |
PyTorch convention: torch.nn.functional.kl_div expects inputs in log-space and computes nats by default (uses torch.log which is natural log). To get bits, divide by math.log(2).
End of Section 09-02 KL Divergence
<- Back to Information Theory | Previous: Entropy <- | Next: Mutual Information ->
Appendix K: Worked Problems with Full Solutions
K.1 Computing KL Between Two Categorical Distributions
Problem. A language model assigns the following probabilities to the next token in "The capital of France is ___":
| Token | True distribution | Model |
|---|---|---|
| "Paris" | 0.92 | 0.70 |
| "London" | 0.03 | 0.10 |
| "Rome" | 0.02 | 0.10 |
| "Berlin" | 0.02 | 0.05 |
| other | 0.01 | 0.05 |
Compute (a) , (b) cross-entropy , (c) entropy .
Solution.
(a) :
(b) Cross-entropy :
(c) Entropy :
Verification: nats. (Small discrepancy from rounding; exact values match.)
Interpretation: The model wastes 0.150 nats per prediction compared to an oracle. The perplexity of the oracle is ; the model's perplexity is - 21% higher, entirely due to the KL gap.
K.2 The ELBO Calculation for a Toy VAE
Setup. One-dimensional VAE: prior , likelihood (decoder with ), encoder .
For , , :
KL term: :
Reconstruction term: where , :
ELBO: nats.
Interpretation: The true log-evidence satisfies . Better encoder parameters (closer to posterior mean) and tighter would close the ELBO gap.
K.3 Verifying Data Processing Inequality Numerically
Consider input with and .
A stochastic function with transition matrix:
Original KL:
Induced distributions on :
KL after processing:
Verification: PASS - the data processing inequality holds. The stochastic transformation reduced the KL from 0.275 to 0.089 nats, losing information about whether data came from or .
Appendix L: Connections to Optimization Theory
L.1 KL Divergence and Mirror Descent
Mirror descent is a generalization of gradient descent where the update uses a Bregman divergence instead of Euclidean distance:
When (negative entropy) and is a probability vector, the Bregman divergence and mirror descent becomes the multiplicative weights update or exponentiated gradient:
where . This is exactly how softmax policies in RL are updated, and how online learning over the probability simplex works (regret analysis via Bregman projections).
Natural gradient as mirror descent. The natural gradient descent update:
is mirror descent using KL divergence as the Bregman proximity term. The solution is where is the Fisher matrix - reproducing the natural gradient formula.
L.2 KL-Penalized Optimization and Soft Constraints
The RLHF objective is a classic penalized optimization problem. The KL term acts as a regularizer that keeps near .
Lagrangian form. The hard-constrained version:
is equivalent to the penalized form at the that is the Lagrange multiplier for the constraint. As , (no constraint); as , (stay at reference). The optimal depends on the trade-off between reward and divergence.
TRPO (Schulman et al., 2015). Trust Region Policy Optimization solves:
where is the surrogate advantage and is the maximum KL over states. Each TRPO step solves a constrained optimization problem with a KL trust region - computationally expensive. PPO approximates this with the clip objective.
L.3 The Frank-Wolfe Algorithm and KL
The Frank-Wolfe algorithm solves (for a convex function on a convex constraint set ) by linearizing at each step. When and is the probability simplex, Frank-Wolfe recovers the multiplicative weights algorithm. When and is a parametric exponential family, Frank-Wolfe corresponds to the M-step of EM.
This connection shows that common ML algorithms - EM, multiplicative weights, mirror descent, natural gradient - are all special cases of convex optimization using KL divergence as the geometry.
Appendix M: Further Reading and References
M.1 Primary Sources
| Reference | What to Read | Why |
|---|---|---|
| Kullback & Leibler (1951). "On Information and Sufficiency." Ann. Math. Stat. | Section 1-Section 3 | Original paper; clean introduction to I-divergence and sufficiency |
| Cover & Thomas (2006). Elements of Information Theory, Ch. 2 | Ch. 2, Section 2.3 | Best textbook treatment; all proofs with full details |
| MacKay (2003). Information Theory, Inference, and Learning Algorithms, Ch. 2 | Free online; Ch. 2 | Excellent intuition; coding-theoretic perspective |
| Bishop (2006). Pattern Recognition and Machine Learning, Ch. 10 | Ch. 10.1 | Variational inference from an ML perspective |
| Amari (2016). Information Geometry and Its Applications | Ch. 1-3 | Statistical manifolds, I/M-projections, natural gradient |
| Csiszar & Shields (2004). "Information Theory and Statistics" | Section 1-Section 3 | f-divergences, Sanov's theorem, method of types |
M.2 Key ML Papers Using KL Divergence
| Paper | KL Divergence Role | Year |
|---|---|---|
| Kingma & Welling, "Auto-Encoding Variational Bayes" | ELBO = reconstruction - KL; VAE training | 2014 |
| Goodfellow et al., "Generative Adversarial Networks" | GAN = JSD minimization | 2014 |
| Hinton, Vinyals & Dean, "Distilling the Knowledge in a Neural Network" | Forward KL for knowledge transfer | 2015 |
| Schulman et al., "Trust Region Policy Optimization" (TRPO) | KL trust region constraint | 2015 |
| Schulman et al., "Proximal Policy Optimization" (PPO) | KL approximated by clip | 2017 |
| Higgins et al., "beta-VAE: Learning Basic Visual Concepts" | -weighted KL for disentanglement | 2017 |
| Mironov, "Renyi Differential Privacy" | Renyi KL for privacy accounting | 2017 |
| Rafailov et al., "Direct Preference Optimization" (DPO) | Implicit KL-constrained policy | 2023 |
| Ho et al., "Denoising Diffusion Probabilistic Models" | Per-step KL for diffusion ELBO | 2020 |
M.3 Numerical Tools
| Tool | What It Provides |
|---|---|
scipy.special.rel_entr(p, q) | Element-wise ; sum for KL |
scipy.stats.entropy(p, q) | for discrete distributions |
torch.nn.functional.kl_div(log_q, p) | KL given log and ; note arg order |
torch.distributions.kl_divergence(p, q) | For torch distribution objects; supports many families |
sklearn.metrics.mutual_info_score | Mutual information (related to KL) |
numpy custom | See Appendix D.1 for stable implementation |
M.4 Quick Diagnostic Checklist
Before using KL divergence in any ML project, verify:
- Which direction? Identify which distribution is (truth/reference) and which is (approximation). The first argument is where the expectation is taken.
- Support check. Does wherever ? If so, . Add smoothing if needed.
- Units. Is your implementation using natural log (nats) or log base 2 (bits)? Be consistent.
- Forward or reverse? Is this an MLE/distillation problem (forward KL) or variational inference (reverse KL)? The choice determines whether the result is mode-seeking or mean-seeking.
- Numerical stability. Are you computing directly (unstable) or using log-softmax (stable)?
- Sample size for estimation. KL estimators from finite samples have high variance for large - consider using MMD or Wasserstein for high-dimensional distributions.
- KL coefficient. If using KL as a regularizer, is tuned? Too small -> reward hacking; too large -> no alignment gain.
- ELBO monitoring. In VAE training, are both the reconstruction term and KL term logged separately? Monitoring their ratio over training diagnoses posterior collapse early.
M.5 Conceptual Summary: One Page
KL divergence measures:
- The expected excess code length when coding -data with a -code
- The average log-likelihood ratio evidence that data came from not
- The information gain from using instead of as a model
Three things to always remember:
-
Direction matters. . Forward KL -> mass-covering. Reverse KL -> mode-seeking.
-
Non-negativity is everything. with equality iff . This single fact proves: entropy is maximized at uniform, the ELBO is a lower bound, MLE is consistent.
-
It's everywhere in AI. MLE, VAE, RLHF, distillation, diffusion, VI - all are special cases of optimizing a KL divergence in a specific direction. Understanding which direction, and what approximations are made, is the key to understanding modern deep learning at a principled level.
Section 09-02 KL Divergence - 1966+ lines, 13 main sections, 13 appendices (A-M) Prerequisites: Section 09-01 Entropy | Follows to: Section 09-03 Mutual Information