Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 2
30 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

KL Divergence: Part 6: Information-Theoretic Connections to 13. Conceptual Bridge

6. Information-Theoretic Connections

6.1 KL and Cross-Entropy

The relationship H(p,q)=H(p)+DKL(pq)H(p, q) = H(p) + D_{\mathrm{KL}}(p \| q) is fundamental. Expanding both sides:

xp(x)logq(x)=xp(x)logp(x)+xp(x)logp(x)q(x)-\sum_x p(x)\log q(x) = -\sum_x p(x)\log p(x) + \sum_x p(x)\log\frac{p(x)}{q(x)}

which is immediate from log(p/q)=logplogq\log(p/q) = \log p - \log q. This decomposition has critical implications:

  • Training ML models: Minimizing cross-entropy loss H(pdata,pθ)H(p_{\mathrm{data}}, p_{\boldsymbol{\theta}}) over θ\boldsymbol{\theta} is identical to minimizing DKL(pdatapθ)D_{\mathrm{KL}}(p_{\mathrm{data}} \| p_{\boldsymbol{\theta}}), since H(pdata)H(p_{\mathrm{data}}) is constant. The irreducible entropy H(pdata)H(p_{\mathrm{data}}) is the minimum achievable cross-entropy.

  • Perplexity gap: A language model with perplexity PPL=eH(pdata,pθ)\operatorname{PPL} = e^{H(p_{\mathrm{data}}, p_{\boldsymbol{\theta}})} has gap DKL(pdatapθ)D_{\mathrm{KL}}(p_{\mathrm{data}}\|p_{\boldsymbol{\theta}}) nats above the theoretical minimum entropy.

  • Calibration: A perfectly calibrated model satisfies DKL(ptruepθ)=0D_{\mathrm{KL}}(p_{\mathrm{true}}\|p_{\boldsymbol{\theta}}) = 0, meaning the model's output distribution matches the true conditional distributions exactly.

Preview - Cross-Entropy (Section 09-04)

Cross-entropy H(p,q)=H(p)+DKL(pq)H(p, q) = H(p) + D_{\mathrm{KL}}(p\|q) is the canonical training loss for classification. It decomposes neatly into the intrinsic uncertainty H(p)H(p) (irreducible) and the model quality gap DKL(pq)D_{\mathrm{KL}}(p\|q) (reducible by training).

-> Full treatment: Section 09-04 Cross-Entropy

6.2 KL and Mutual Information

Preview - Mutual Information (Section 09-03)

Mutual information is the KL divergence between the joint distribution and the product of marginals:

I(X;Y)=DKL(P(X,Y)P(X)P(Y))=E(x,y)P ⁣[logP(x,y)P(x)P(y)]I(X; Y) = D_{\mathrm{KL}}(P(X, Y) \| P(X)P(Y)) = \mathbb{E}_{(x,y)\sim P}\!\left[\log\frac{P(x,y)}{P(x)P(y)}\right]

It measures statistical dependence: how much knowing YY reduces uncertainty about XX. This connection shows that I(X;Y)0I(X;Y) \ge 0 (Gibbs' inequality) and I(X;Y)=0X ⁣ ⁣ ⁣YI(X;Y) = 0 \Leftrightarrow X \perp\!\!\!\perp Y.

-> Full treatment: Section 09-03 Mutual Information

The MI-as-KL connection is used directly in contrastive learning (InfoNCE loss, SimCLR, CLIP): maximizing a lower bound on I(X;Z)I(X; Z) between the input and its representation. It also appears in the information bottleneck (Tishby et al., 2000) which frames representation learning as minimizing I(X;Z)I(X; Z) subject to maximizing I(Z;Y)I(Z; Y) - two KL divergences in tension.

6.3 KL and Fisher Information

Preview - Fisher Information (Section 09-05)

For a parametric family {pθ}\{p_{\boldsymbol{\theta}}\}, the KL divergence between nearby members is approximately:

DKL(pθpθ+δ)12δF(θ)δD_{\mathrm{KL}}(p_{\boldsymbol{\theta}} \| p_{\boldsymbol{\theta}+\boldsymbol{\delta}}) \approx \frac{1}{2}\boldsymbol{\delta}^\top F(\boldsymbol{\theta})\boldsymbol{\delta}

where F(θ)=Epθ[logpθlogpθ]F(\boldsymbol{\theta}) = \mathbb{E}_{p_{\boldsymbol{\theta}}}[\nabla\log p_{\boldsymbol{\theta}} \nabla\log p_{\boldsymbol{\theta}}^\top] is the Fisher information matrix. This local quadratic approximation to KL is what makes natural gradient descent (which preconditions gradient steps by F1F^{-1}) the "geometrically correct" optimizer on the statistical manifold.

-> Full treatment: Section 09-05 Fisher Information

Connection to RLHF natural policy gradient: The KL trust region DKL(πθπref)ϵD_{\mathrm{KL}}(\pi_{\boldsymbol{\theta}} \| \pi_{\mathrm{ref}}) \le \epsilon defines a neighborhood in policy space. Near the reference policy, this KL ball is approximated by 12δFδϵ\frac{1}{2}\boldsymbol{\delta}^\top F \boldsymbol{\delta} \le \epsilon where FF is the Fisher information of πref\pi_{\mathrm{ref}}. Natural policy gradient algorithms (TRPO, PPO) use this approximation.

6.4 KL and Entropy

KL divergence provides an elegant proof that entropy is maximized at the uniform distribution. For any distribution pp on an alphabet of size nn and uniform distribution u(x)=1/nu(x) = 1/n:

DKL(pu)=xp(x)lnp(x)1/n=xp(x)lnp(x)+lnn=H(p)+lnnD_{\mathrm{KL}}(p \| u) = \sum_x p(x)\ln\frac{p(x)}{1/n} = \sum_x p(x)\ln p(x) + \ln n = -H(p) + \ln n

Since DKL(pu)0D_{\mathrm{KL}}(p\|u) \ge 0: H(p)lnnH(p) \le \ln n, with equality iff p=up = u.

More generally, for any constraint set Q\mathcal{Q} of distributions satisfying given moment constraints, the maximum-entropy distribution p=argmaxpQH(p)p^* = \arg\max_{p \in \mathcal{Q}} H(p) equals argminpQDKL(pu)\arg\min_{p \in \mathcal{Q}} D_{\mathrm{KL}}(p \| u) - it is the M-projection of the uniform distribution onto Q\mathcal{Q}. This unifies the maximum entropy principle with KL geometry.

7. f-Divergences and Generalizations

7.1 Csiszar f-Divergences

KL divergence is one member of a rich family of divergences introduced by Csiszar (1967).

Definition. Let f:(0,)Rf: (0,\infty) \to \mathbb{R} be a convex function with f(1)=0f(1) = 0. The f-divergence of pp from qq is:

Df(pq)=xq(x)f ⁣(p(x)q(x))=Exq ⁣[f ⁣(p(x)q(x))]D_f(p \| q) = \sum_x q(x)\, f\!\left(\frac{p(x)}{q(x)}\right) = \mathbb{E}_{x\sim q}\!\left[f\!\left(\frac{p(x)}{q(x)}\right)\right]

The condition f(1)=0f(1) = 0 ensures Df(pp)=0D_f(p\|p) = 0. Convexity of ff and Jensen's inequality give Df(pq)0D_f(p\|q) \ge 0.

KL divergence as f-divergence. Taking f(t)=tlntf(t) = t\ln t (convex, f(1)=0f(1) = 0):

Df(pq)=xq(x)p(x)q(x)lnp(x)q(x)=xp(x)lnp(x)q(x)=DKL(pq)D_f(p\|q) = \sum_x q(x) \cdot \frac{p(x)}{q(x)}\ln\frac{p(x)}{q(x)} = \sum_x p(x)\ln\frac{p(x)}{q(x)} = D_{\mathrm{KL}}(p\|q)

Standard f-divergences:

Namef(t)f(t)Closed formProperties
KL divergencetlntt\ln tpln(p/q)\sum p\ln(p/q)Asymmetric; [0,)[0, \infty)
Reverse KLlnt-\ln tqln(q/p)\sum q\ln(q/p)Asymmetric; [0,)[0, \infty)
Hellinger distance2(t1)2(\sqrt{t}-1)^2(pq)2\sum(\sqrt{p}-\sqrt{q})^2Symmetric; [0,2][0, 2]
Total variation$\frac{1}{2}t-1$
Chi-squared(t1)2(t-1)^2(pq)2/q\sum(p-q)^2/qAsymmetric; [0,)[0,\infty)
α\alpha-divergencetααt(1α)α(α1)\frac{t^\alpha - \alpha t - (1-\alpha)}{\alpha(\alpha-1)}General; limits give KLParameterizes all of the above

For AI - GAN losses. Generative Adversarial Networks (Goodfellow et al., 2014) use different f-divergences as training objectives: the original GAN uses Jensen-Shannon; f-GAN (Nowozin et al., 2016) generalizes to arbitrary f-divergences. This provides a principled family of training objectives for generative models.

7.2 Properties of f-Divergences

All f-divergences share four key properties (when ff is convex with f(1)=0f(1) = 0):

  1. Non-negativity: Df(pq)0D_f(p\|q) \ge 0, with equality iff p=qp = q (when ff is strictly convex).
  2. Data processing inequality: Df(pTqT)Df(pq)D_f(p_T\|q_T) \le D_f(p\|q) for any stochastic kernel TT.
  3. No triangle inequality in general.
  4. Convexity: DfD_f is jointly convex in (p,q)(p,q) when ff is convex.

The data processing inequality is a theorem for all f-divergences: processing can only reduce distinguishability between distributions.

Total variation as special case. The total variation distance TV(p,q)=12pq1=12xp(x)q(x)\mathrm{TV}(p,q) = \frac{1}{2}\|p - q\|_1 = \frac{1}{2}\sum_x |p(x) - q(x)| satisfies the triangle inequality and is a genuine metric. It bounds other f-divergences: Pinsker's inequality relates KL and TV:

TV(p,q)12DKL(pq)\mathrm{TV}(p,q) \le \sqrt{\frac{1}{2}D_{\mathrm{KL}}(p\|q)}

This is the most important inequality connecting KL divergence to total variation. Pinsker's inequality is used to convert KL bounds (from optimization) into total variation bounds (which control probabilities of events).

7.3 Renyi Divergence

For α(0,1)(1,)\alpha \in (0,1) \cup (1,\infty), the order-α\alpha Renyi divergence is:

Dα(pq)=1α1logxp(x)αq(x)1αD_\alpha(p \| q) = \frac{1}{\alpha - 1}\log\sum_x p(x)^\alpha\, q(x)^{1-\alpha}

Limiting cases:

  • α1\alpha \to 1: Dα(pq)DKL(pq)D_\alpha(p\|q) \to D_{\mathrm{KL}}(p\|q) (L'Hopital's rule)
  • α0\alpha \to 0: Dα(pq)logx1[p(x)>0]q(x)D_\alpha(p\|q) \to -\log\sum_x \mathbb{1}[p(x) > 0]\, q(x) (support-based)
  • α\alpha \to \infty: Dα(pq)logmaxxp(x)/q(x)D_\alpha(p\|q) \to \log\max_x p(x)/q(x) (max-ratio)
  • α=1/2\alpha = 1/2: related to Bhattacharyya distance; D1/2(pq)=2logxp(x)q(x)D_{1/2}(p\|q) = -2\log\sum_x\sqrt{p(x)q(x)}

Properties: Dα(pq)0D_\alpha(p\|q) \ge 0; Dα(pp)=0D_\alpha(p\|p) = 0; DαD_\alpha is monotone increasing in α\alpha; satisfies data processing inequality; not symmetric.

For AI - Differential privacy. Renyi differential privacy (Mironov, 2017) quantifies privacy loss using Renyi divergence: a mechanism MM is (α,ε)(\alpha, \varepsilon)-RDP if Dα(M(S)M(S))εD_\alpha(M(S)\|M(S')) \le \varepsilon for any adjacent datasets S,SS, S'. Renyi DP composes additively: εtotal=ε1+ε2\varepsilon_{\text{total}} = \varepsilon_1 + \varepsilon_2. This makes it the tool of choice for privacy accounting in large-scale training with DP-SGD (used in Google's federated learning and some LLM fine-tuning pipelines).

7.4 Jensen-Shannon Divergence

The Jensen-Shannon Divergence (JSD) is the symmetrized, bounded variant of KL:

JSD(pq)=12DKL ⁣(pp+q2)+12DKL ⁣(qp+q2)\mathrm{JSD}(p \| q) = \frac{1}{2}D_{\mathrm{KL}}\!\left(p \,\Big\|\, \frac{p+q}{2}\right) + \frac{1}{2}D_{\mathrm{KL}}\!\left(q \,\Big\|\, \frac{p+q}{2}\right)

where m=(p+q)/2m = (p+q)/2 is the mixture distribution.

Properties:

  • JSD(pq)=JSD(qp)\mathrm{JSD}(p\|q) = \mathrm{JSD}(q\|p) - symmetric
  • 0JSD(pq)ln20 \le \mathrm{JSD}(p\|q) \le \ln 2 - bounded (in nats)
  • JSD\sqrt{\mathrm{JSD}} is a metric (satisfies triangle inequality)
  • JSD(pq)=H(m)12H(p)12H(q)\mathrm{JSD}(p\|q) = H(m) - \frac{1}{2}H(p) - \frac{1}{2}H(q) - entropy of mixture minus average entropy

For AI - Original GAN objective. The original GAN (Goodfellow et al., 2014) with an optimal discriminator minimizes the Jensen-Shannon divergence between the generator distribution pGp_G and the data distribution pdatap_{\mathrm{data}}:

minGmaxDV(D,G)=2JSD(pdatapG)2ln2\min_G \max_D V(D, G) = 2\,\mathrm{JSD}(p_{\mathrm{data}} \| p_G) - 2\ln 2

The optimal discriminator is D(x)=pdata(x)/(pdata(x)+pG(x))D^*(x) = p_{\mathrm{data}}(x)/(p_{\mathrm{data}}(x) + p_G(x)). However, when pdatap_{\mathrm{data}} and pGp_G have disjoint supports, JSD=ln2\mathrm{JSD} = \ln 2 (maximum) everywhere - the discriminator saturates and gradients vanish. This mode collapse / vanishing gradient problem motivated WGAN (Arjovsky et al., 2017), which uses the Wasserstein distance (which doesn't require absolute continuity).

8. Information Geometry of KL

8.1 KL as a Bregman Divergence

A Bregman divergence generated by a strictly convex function ϕ\phi is:

Bϕ(p,q)=ϕ(p)ϕ(q)ϕ(q)(pq)B_\phi(p, q) = \phi(p) - \phi(q) - \nabla\phi(q)^\top(p - q)

This is the gap between ϕ\phi at pp and its first-order Taylor approximation at qq - always 0\ge 0 by convexity.

KL is a Bregman divergence. Taking ϕ(p)=xp(x)lnp(x)\phi(p) = \sum_x p(x)\ln p(x) (the negative entropy, which is strictly convex):

ϕ(q)x=lnq(x)+1\nabla\phi(q)_x = \ln q(x) + 1 Bϕ(p,q)=xp(x)lnp(x)xq(x)lnq(x)x(lnq(x)+1)(p(x)q(x))B_\phi(p, q) = \sum_x p(x)\ln p(x) - \sum_x q(x)\ln q(x) - \sum_x (\ln q(x) + 1)(p(x) - q(x)) =xp(x)lnp(x)xp(x)lnq(x)+x(q(x)p(x))=xp(x)lnp(x)q(x)=DKL(pq)= \sum_x p(x)\ln p(x) - \sum_x p(x)\ln q(x) + \sum_x (q(x) - p(x)) = \sum_x p(x)\ln\frac{p(x)}{q(x)} = D_{\mathrm{KL}}(p\|q)

This Bregman representation reveals why KL is asymmetric: Bregman divergences are generally not symmetric because the Taylor approximation is computed at qq, not at pp.

Exponential family connection. For an exponential family with log-partition function A(η)A(\boldsymbol{\eta}), the Bregman divergence BA(η2,η1)B_A(\boldsymbol{\eta}_2, \boldsymbol{\eta}_1) equals DKL(pη1pη2)D_{\mathrm{KL}}(p_{\boldsymbol{\eta}_1}\|p_{\boldsymbol{\eta}_2}) (note reversed argument order). This is the source of the elegant formula in Section 5.3.

8.2 I-Projection and M-Projection

Definition. Let M\mathcal{M} be a constraint set (e.g., a parametric family of distributions):

  • I-projection (information projection): q=argminqMDKL(qp)q^* = \arg\min_{q \in \mathcal{M}} D_{\mathrm{KL}}(q \| p) - the distribution in M\mathcal{M} closest to pp in reverse KL
  • M-projection (moment projection): q=argminqMDKL(pq)q^* = \arg\min_{q \in \mathcal{M}} D_{\mathrm{KL}}(p \| q) - the distribution in M\mathcal{M} closest to pp in forward KL

For the exponential family:

  • The M-projection always produces the moment-matched distribution: Eq[t(X)]=Ep[t(X)]\mathbb{E}_{q^*}[\mathbf{t}(X)] = \mathbb{E}_p[\mathbf{t}(X)]
  • The I-projection onto an exponential family (convex set of mean parameters) produces the distribution with the correct natural parameters for those means

EM algorithm as alternating projections. The EM algorithm alternates:

  • E-step: Compute q(t)(z)=p(zx;θ(t))q^{(t)}(\mathbf{z}) = p(\mathbf{z}|\mathbf{x}; \boldsymbol{\theta}^{(t)}) - I-projection of the previous model onto the exact posterior
  • M-step: θ(t+1)=argmaxθQ(θ,θ(t))\boldsymbol{\theta}^{(t+1)} = \arg\max_{\boldsymbol{\theta}} \mathcal{Q}(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)}) - M-projection of q(t)q^{(t)} back onto the parametric family

Each step decreases DKL(qp)D_{\mathrm{KL}}(q\|p) or DKL(pdatapθ)D_{\mathrm{KL}}(p_{\text{data}}\|p_{\boldsymbol{\theta}}) respectively, guaranteeing convergence to a stationary point.

8.3 The Pythagorean Theorem for KL

Theorem (Pythagorean theorem for I-projections). Let E\mathcal{E} be an exponential family (a convex set in mean parameter space) and let q=argminqEDKL(qp)q^* = \arg\min_{q \in \mathcal{E}} D_{\mathrm{KL}}(q \| p) be the I-projection of pp onto E\mathcal{E}. Then for any rEr \in \mathcal{E}:

DKL(rp)=DKL(rq)+DKL(qp)D_{\mathrm{KL}}(r \| p) = D_{\mathrm{KL}}(r \| q^*) + D_{\mathrm{KL}}(q^* \| p)

This is an exact equality - there is no approximation, unlike the geometric Pythagorean theorem.

Interpretation. The KL distance from any rr in the constraint set to the target pp decomposes into: (1) the distance from rr to the projection qq^*, plus (2) the distance from the projection to pp. The "closest" point qq^* acts like the foot of a perpendicular. This is called "Pythagorean" because the geometry is orthogonal in the KL sense.

PYTHAGOREAN THEOREM FOR KL


          p  (target distribution, outside set E)
          |
          | D_KL(q* || p)
          |
          q*  r    (all in constraint set E)
            D_KL(r || q*)

    D_KL(r || p) = D_KL(r || q*) + D_KL(q* || p)

    (exact for I-projections onto exponential families)


For AI: The variational inference ELBO maximization is equivalent to minimizing DKL(qpθ(zx))D_{\mathrm{KL}}(q\|p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x})) - an I-projection. The Pythagorean theorem shows that the ELBO gap is exactly DKL(qp)D_{\mathrm{KL}}(q^*\|p) at the optimum - a clean statement of how much information is lost by restricting to the variational family.

8.4 Statistical Manifolds and Natural Gradient

The space of probability distributions over X\mathcal{X} forms a statistical manifold with local coordinates given by the natural parameters η\boldsymbol{\eta} (for exponential families). The KL divergence induces a Riemannian metric on this manifold called the Fisher information metric:

gij(η)=2DKL(pηpη)ηiηjη=η=Fij(η)g_{ij}(\boldsymbol{\eta}) = \frac{\partial^2 D_{\mathrm{KL}}(p_{\boldsymbol{\eta}} \| p_{\boldsymbol{\eta}'})}{\partial\eta_i' \partial\eta_j'}\bigg|_{\boldsymbol{\eta}'=\boldsymbol{\eta}} = F_{ij}(\boldsymbol{\eta})

where F(η)F(\boldsymbol{\eta}) is the Fisher information matrix. The Riemannian structure makes the "natural" step size in parameter space match the actual KL distance between distributions.

Natural gradient. In Euclidean parameter space, gradient descent steps θθηL\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta\nabla\mathcal{L} make steps proportional to the Euclidean metric on θ\boldsymbol{\theta}-space, which does not correspond to any natural metric on distribution space. The natural gradient corrects for this:

~L=F(θ)1L\tilde{\nabla}\mathcal{L} = F(\boldsymbol{\theta})^{-1}\nabla\mathcal{L}

Natural gradient descent converges much faster in problems where the Fisher information is highly non-uniform (i.e., where the Euclidean metric on parameters is poorly calibrated to KL distance on distributions). This is the foundation of K-FAC (Martens & Grosse, 2015), used in second-order optimization of neural networks.

-> Full treatment of Fisher information and natural gradient: Section 09-05 Fisher Information

9. Applications in Machine Learning

9.1 Maximum Likelihood = Minimizing KL

Theorem. For a parametric model pθp_{\boldsymbol{\theta}} and empirical data distribution p^n(x)=1ni=1n1[X(i)=x]\hat{p}_n(x) = \frac{1}{n}\sum_{i=1}^n \mathbb{1}[X^{(i)} = x]:

argmaxθi=1nlogpθ(x(i))=argminθDKL(p^npθ)\arg\max_{\boldsymbol{\theta}} \sum_{i=1}^n \log p_{\boldsymbol{\theta}}(x^{(i)}) = \arg\min_{\boldsymbol{\theta}} D_{\mathrm{KL}}(\hat{p}_n \| p_{\boldsymbol{\theta}})

Proof. Using DKL(p^npθ)=Ep^[logp^(x)/pθ(x)]=H(p^n)Ep^[logpθ(x)]D_{\mathrm{KL}}(\hat{p}_n \| p_{\boldsymbol{\theta}}) = \mathbb{E}_{\hat{p}}[\log\hat{p}(x)/p_{\boldsymbol{\theta}}(x)] = H(\hat{p}_n) - \mathbb{E}_{\hat{p}}[\log p_{\boldsymbol{\theta}}(x)], minimizing over θ\boldsymbol{\theta} eliminates the constant H(p^n)H(\hat{p}_n):

argminθDKL(p^npθ)=argmaxθEp^[logpθ(X)]=argmaxθ1nilogpθ(x(i))\arg\min_{\boldsymbol{\theta}} D_{\mathrm{KL}}(\hat{p}_n\|p_{\boldsymbol{\theta}}) = \arg\max_{\boldsymbol{\theta}} \mathbb{E}_{\hat{p}}[\log p_{\boldsymbol{\theta}}(X)] = \arg\max_{\boldsymbol{\theta}} \frac{1}{n}\sum_i \log p_{\boldsymbol{\theta}}(x^{(i)}) \quad\square

Implications:

  • Every LLM trained by next-token prediction (GPT-4, LLaMA-3, Claude, Gemini) minimizes DKL(pdatapθ)D_{\mathrm{KL}}(p_{\mathrm{data}}\|p_{\boldsymbol{\theta}})
  • The MLE estimate is not "arbitrary" - it has a precise information-theoretic interpretation as the distribution in the model family closest to the data in forward KL
  • Coverage: MLE must cover the entire data distribution; if pdata(x)>0p_{\mathrm{data}}(x) > 0 but pθ(x)0p_{\boldsymbol{\theta}}(x) \approx 0, the loss is large

9.2 Variational Autoencoders

The VAE (Kingma & Welling, 2014) introduces a latent variable z\mathbf{z} with prior p(z)p(\mathbf{z}) and likelihood pθ(xz)p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z}). The true log-likelihood logpθ(x)\log p_{\boldsymbol{\theta}}(\mathbf{x}) is intractable (requires integrating over z\mathbf{z}). The VAE introduces an encoder qϕ(zx)q_\phi(\mathbf{z}|\mathbf{x}) and derives a tractable lower bound via the KL decomposition:

logpθ(x)=Eqϕ[logpθ(xz)]DKL(qϕ(zx)p(z))L(ϕ,θ;x)=ELBO+DKL(qϕ(zx)pθ(zx))0\log p_{\boldsymbol{\theta}}(\mathbf{x}) = \underbrace{\mathbb{E}_{q_\phi}[\log p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})] - D_{\mathrm{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))}_{\mathcal{L}(\boldsymbol{\phi},\boldsymbol{\theta};\mathbf{x}) = \text{ELBO}} + \underbrace{D_{\mathrm{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x}))}_{\ge 0}

The ELBO has two terms:

  1. Reconstruction term Eqϕ[logpθ(xz)]\mathbb{E}_{q_\phi}[\log p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})]: how well the decoder reconstructs from the latent code. Analogous to negative distortion.
  2. KL regularizer DKL(qϕ(zx)p(z))D_{\mathrm{KL}}(q_\phi(\mathbf{z}|\mathbf{x})\|p(\mathbf{z})): how much the approximate posterior deviates from the prior. For Gaussian encoder and standard Gaussian prior: 12j(μj2+σj2lnσj21)\frac{1}{2}\sum_j(\mu_j^2 + \sigma_j^2 - \ln\sigma_j^2 - 1).

Practical training: The ELBO is maximized by alternating gradient steps:

  • w.r.t. ϕ\boldsymbol{\phi}: encoder parameters, using the reparameterization trick
  • w.r.t. θ\boldsymbol{\theta}: decoder parameters, via direct backpropagation

VAE variants using KL: beta-VAE (β>1\beta > 1 weight on KL for disentanglement), WAE (Wasserstein autoencoder replaces KL with MMD), VQ-VAE (discrete latent, no KL), InfoVAE (adds MI term).

9.3 RLHF, PPO, and DPO

RLHF objective. Reinforcement Learning from Human Feedback (Christiano et al., 2017; InstructGPT, Ouyang et al., 2022) trains a policy πθ\pi_{\boldsymbol{\theta}} to maximize expected reward while staying close to the reference policy πref\pi_{\mathrm{ref}} (the pretrained LLM):

maxπθExD,yπθ(x)[r(x,y)βDKL(πθ(x)πref(x))]\max_{\pi_{\boldsymbol{\theta}}} \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_{\boldsymbol{\theta}}(\cdot|x)}\left[r(x, y) - \beta\, D_{\mathrm{KL}}(\pi_{\boldsymbol{\theta}}(\cdot|x) \| \pi_{\mathrm{ref}}(\cdot|x))\right]

where r(x,y)r(x,y) is the reward model score and β>0\beta > 0 is the KL coefficient. Without the KL penalty, the policy collapses to reward hacking: generating short, repetitive text that maximizes reward scores but is incoherent or unnatural.

Optimal policy. The RLHF objective has a closed-form optimal solution (for fixed rr):

π(yx)=1Z(x)πref(yx)er(x,y)/β\pi^*(y|x) = \frac{1}{Z(x)}\pi_{\mathrm{ref}}(y|x)\, e^{r(x,y)/\beta}

where Z(x)=yπref(yx)er(x,y)/βZ(x) = \sum_y \pi_{\mathrm{ref}}(y|x)e^{r(x,y)/\beta} is the normalizing partition function. The optimal policy exponentially upweights sequences with high reward relative to the reference.

DPO (Direct Preference Optimization, Rafailov et al., 2023). DPO solves the RLHF objective directly from preference data, without training an explicit reward model. The key insight: substituting the optimal policy formula into the reward-learning objective yields a loss that only involves the policy, not the reward model:

LDPO(θ)=E ⁣[logσ ⁣(βlnπθ(ywx)πref(ywx)βlnπθ(ylx)πref(ylx))]\mathcal{L}_{\mathrm{DPO}}(\boldsymbol{\theta}) = -\mathbb{E}\!\left[\log\sigma\!\left(\beta\ln\frac{\pi_{\boldsymbol{\theta}}(y_w|x)}{\pi_{\mathrm{ref}}(y_w|x)} - \beta\ln\frac{\pi_{\boldsymbol{\theta}}(y_l|x)}{\pi_{\mathrm{ref}}(y_l|x)}\right)\right]

where ywy_w is the preferred (winning) response and yly_l the dispreferred (losing) response. The log-ratio log(πθ/πref)\log(\pi_{\boldsymbol{\theta}}/\pi_{\mathrm{ref}}) is the implicit reward that the KL-constrained policy learns. DPO is used in practice for fine-tuning LLMs on human preferences (LLaMA-2-Chat, Mixtral-Instruct, and many others).

PPO trust region. PPO (Schulman et al., 2017) approximates the KL constraint with a clipping objective:

LCLIP(θ)=Et ⁣[min ⁣(rt(θ)A^t,  clip(rt(θ),1ε,1+ε)A^t)]\mathcal{L}_{\mathrm{CLIP}}(\boldsymbol{\theta}) = \mathbb{E}_t\!\left[\min\!\left(r_t(\boldsymbol{\theta})\hat{A}_t,\; \mathrm{clip}(r_t(\boldsymbol{\theta}), 1-\varepsilon, 1+\varepsilon)\hat{A}_t\right)\right]

where rt(θ)=πθ(atst)/πold(atst)r_t(\boldsymbol{\theta}) = \pi_{\boldsymbol{\theta}}(a_t|s_t)/\pi_{\mathrm{old}}(a_t|s_t) is the probability ratio. Clipping at 1±ε1 \pm \varepsilon prevents large policy updates, approximating the KL trust region constraint DKL(πoldπθ)δD_{\mathrm{KL}}(\pi_{\mathrm{old}}\|\pi_{\boldsymbol{\theta}}) \le \delta.

9.4 Knowledge Distillation

Knowledge distillation (Hinton, Vinyals & Dean, 2015) transfers knowledge from a large teacher model pTp_T to a smaller student model pSp_S. The distillation loss uses forward KL (from teacher to student):

Ldistill=DKL(pT(τ)pS)=kpT(τ)(k)lnpT(τ)(k)pS(k)\mathcal{L}_{\mathrm{distill}} = D_{\mathrm{KL}}(p_T^{(\tau)} \| p_S) = \sum_k p_T^{(\tau)}(k)\ln\frac{p_T^{(\tau)}(k)}{p_S(k)}

where pT(τ)(k)=softmax(zkT/τ)kp_T^{(\tau)}(k) = \text{softmax}(z_k^T/\tau)_k is the teacher's softened distribution at temperature τ>1\tau > 1.

Why forward KL (teacher -> student)? The student must cover all of the teacher's probability mass, including the "dark knowledge" - the non-zero probabilities assigned to incorrect classes that encode the teacher's generalization patterns. If a teacher assigns 60% to "cat," 30% to "tiger," and 10% to "dog," the soft labels teach the student that cats look more like tigers than dogs. One-hot labels destroy this information.

Forward vs reverse KL choice in distillation. Using reverse KL DKL(pSpT)D_{\mathrm{KL}}(p_S\|p_T) would allow the student to become mode-seeking - focusing only on the teacher's highest-probability output and ignoring the distributional shape. Forward KL forces the student to match the full distribution, preserving the teacher's calibration.

Modern LLM distillation. DistilBERT (Sanh et al., 2019) distills BERT using forward KL plus cosine similarity on hidden states. Phi-1 (Gunasekar et al., 2023) and Phi-2 (Microsoft, 2023) use distillation from GPT-4 to train small but capable models. LLaMA-3-8B was distilled from larger checkpoints. The forward KL direction is standard.

9.5 Normalizing Flows and Diffusion Models

Normalizing flows learn an invertible transformation fθ:RdRdf_{\boldsymbol{\theta}}: \mathbb{R}^d \to \mathbb{R}^d such that x=fθ(z)\mathbf{x} = f_{\boldsymbol{\theta}}(\mathbf{z}) where zN(0,I)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, I). The exact change-of-variables formula gives:

logpθ(x)=logpz(fθ1(x))+logdetJfθ1(x)\log p_{\boldsymbol{\theta}}(\mathbf{x}) = \log p_{\mathbf{z}}(f_{\boldsymbol{\theta}}^{-1}(\mathbf{x})) + \log|\det J_{f_{\boldsymbol{\theta}}^{-1}}(\mathbf{x})|

Training by maximum likelihood minimizes DKL(pdatapθ)D_{\mathrm{KL}}(p_{\mathrm{data}}\|p_{\boldsymbol{\theta}}) (forward KL) since the log-likelihood is tractable. This contrasts with GANs, which use a discriminator to avoid computing logpθ\log p_{\boldsymbol{\theta}} directly.

Diffusion models (DDPM, Ho et al., 2020). The DDPM ELBO is a sum of KL terms:

logpθ(x)L=DKL(q(xTx0)p(xT))+t=2TDKL(q(xt1xt,x0)pθ(xt1xt))logpθ(x0x1)-\log p_{\boldsymbol{\theta}}(\mathbf{x}) \le \mathcal{L} = D_{\mathrm{KL}}(q(\mathbf{x}_T|\mathbf{x}_0)\|p(\mathbf{x}_T)) + \sum_{t=2}^T D_{\mathrm{KL}}(q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)\|p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_t)) - \log p_{\boldsymbol{\theta}}(\mathbf{x}_0|\mathbf{x}_1)

Each reverse diffusion step pθ(xt1xt)p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_t) is trained to minimize the KL divergence from the (analytically tractable) true reverse q(xt1xt,x0)q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0). Because both distributions are Gaussian (by the Markov structure), each per-step KL has a closed-form expression in terms of the noise predictor ϵθ\boldsymbol{\epsilon}_{\boldsymbol{\theta}}. The simplified training objective is equivalent to predicting the noise ϵ\boldsymbol{\epsilon} - which is the standard diffusion training loss.

10. Common Mistakes

#MistakeWhy It's WrongFix
1Writing DKL(qp)D_{\mathrm{KL}}(q \| p) when you mean forward KL (truth -> approximation)The first argument is the reference (truth); forward KL is DKL(ptruthpapprox)D_{\mathrm{KL}}(p_{\mathrm{truth}} \| p_{\mathrm{approx}})Always identify which distribution is the truth and which is the approximation; check: is the expectation under truth or approx?
2Assuming KL is symmetric: DKL(pq)=DKL(qp)D_{\mathrm{KL}}(p\|q) = D_{\mathrm{KL}}(q\|p)KL is not symmetric; the two directions have fundamentally different behaviorsNever swap pp and qq without checking both directions; use Jeffrey's JSD if you need symmetry
3Ignoring support mismatch: computing DKLD_{\mathrm{KL}} when q(x)=0q(x) = 0 for some xx with p(x)>0p(x) > 0DKL(pq)=+D_{\mathrm{KL}}(p\|q) = +\infty when p≪̸qp \not\ll q; finite values are meaninglessAdd Laplace smoothing or use JSD\mathrm{JSD} / TV\mathrm{TV} for distributions with potentially different supports
4Treating DKL(pq)=0D_{\mathrm{KL}}(p\|q) = 0 as equivalent to p=qp = q under numerical approximationsFinite-precision computation gives DKL0D_{\mathrm{KL}} \approx 0 even when pqp \ne q due to roundingUse DKLϵD_{\mathrm{KL}} \le \epsilon (small threshold) rather than exact zero; check sample moments separately
5Using KL as a distance metric satisfying the triangle inequalityDKL(pr)D_{\mathrm{KL}}(p\|r) can exceed DKL(pq)+DKL(qr)D_{\mathrm{KL}}(p\|q) + D_{\mathrm{KL}}(q\|r); KL is not a metricUse total variation or JSD\sqrt{\mathrm{JSD}} if you need a proper metric
6Confusing minimizing DKL(pq)D_{\mathrm{KL}}(p\|q) over qq with minimizing over ppMLE minimizes forward KL over θ\boldsymbol{\theta} (model parameters); variational inference minimizes reverse KL over qqAlways identify which argument is the optimization variable
7Computing cross-entropy loss and calling it KL divergenceCross-entropy H(p,q)=H(p)+DKL(pq)H(p,q) = H(p) + D_{\mathrm{KL}}(p\|q); they differ by H(p)H(p)They're equal only when pp is one-hot (H(p)=0H(p) = 0); for soft labels, H(p,q)>DKL(pq)H(p,q) > D_{\mathrm{KL}}(p\|q)
8Expecting the VAE KL term to be zero at optimumThe KL term equals $D_{\mathrm{KL}}(q_\phi(\mathbf{z}\mathbf{x})|p(\mathbf{z}))$; zero only if posterior = prior (posterior collapse)
9Using forward KL for variational inferenceForward KL Ep[log(p/q)]\mathbb{E}_p[\log(p/q)] requires samples from the intractable $p(\mathbf{z}\mathbf{x});ELBOusesreverseKLwhichonlyrequiressamplesfrom; ELBO uses reverse KL which only requires samples from q$
10Claiming the RLHF KL penalty "forces" the model to stay close to referenceThe KL penalty softly penalizes deviation; with large enough reward, the policy can still deviate significantlyThe KL penalty is a soft regularizer, not a hard constraint; monitor DKL(ππref)D_{\mathrm{KL}}(\pi\|\pi_{\mathrm{ref}}) during training
11Conflating Renyi divergence with KL divergenceRenyi Dα(pq)D_\alpha(p\|q) only equals KL in the limit α1\alpha \to 1; for α1\alpha \ne 1 they have different propertiesUse Renyi divergence when differential privacy accounting is needed; use KL for standard information-theoretic arguments
12Forgetting the ln2\ln 2 factor when converting between nats and bitsDKLD_{\mathrm{KL}} in nats =DKL= D_{\mathrm{KL}} in bits ×ln20.693×\times \ln 2 \approx 0.693 \times bitsAlways specify units; when comparing with cross-entropy in bits (binary classification), convert to nats or vice versa

11. Exercises

Exercise 1 (*) - Non-negativity and equality condition. Let p=(0.4,0.3,0.2,0.1)p = (0.4, 0.3, 0.2, 0.1) and q=(0.1,0.2,0.3,0.4)q = (0.1, 0.2, 0.3, 0.4).

(a) Compute DKL(pq)D_{\mathrm{KL}}(p\|q) and DKL(qp)D_{\mathrm{KL}}(q\|p) by hand (use ln\ln, give exact values and numerical approximations).

(b) Verify that both are non-negative. Which is larger?

(c) Find a distribution rr on four symbols such that DKL(pr)=0D_{\mathrm{KL}}(p\|r) = 0. Is rr unique?

(d) Using only the convexity of ln-\ln, write a self-contained proof that DKL(pq)0D_{\mathrm{KL}}(p\|q) \ge 0 for your specific pp and qq.


Exercise 2 (*) - KL between Gaussians. Let p=N(μ1,σ12)p = \mathcal{N}(\mu_1, \sigma_1^2) and q=N(μ2,σ22)q = \mathcal{N}(\mu_2, \sigma_2^2).

(a) Starting from the definition, derive the closed form:

DKL(pq)=lnσ2σ1+σ12+(μ1μ2)22σ2212D_{\mathrm{KL}}(p\|q) = \ln\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1-\mu_2)^2}{2\sigma_2^2} - \frac{1}{2}

(b) Compute DKL(N(1,2)N(0,1))D_{\mathrm{KL}}(\mathcal{N}(1,2)\|\mathcal{N}(0,1)) and DKL(N(0,1)N(1,2))D_{\mathrm{KL}}(\mathcal{N}(0,1)\|\mathcal{N}(1,2)). Are they equal?

(c) Show that when σ1=σ2=σ\sigma_1 = \sigma_2 = \sigma, the formula reduces to DKL(pq)=(μ1μ2)2/(2σ2)D_{\mathrm{KL}}(p\|q) = (\mu_1-\mu_2)^2/(2\sigma^2) - the squared Mahalanobis distance.

(d) For the VAE case p=N(μ,σ2)p = \mathcal{N}(\mu, \sigma^2), q=N(0,1)q = \mathcal{N}(0,1): verify the simplified formula 12(μ2+σ2lnσ21)\frac{1}{2}(\mu^2 + \sigma^2 - \ln\sigma^2 - 1) and check that it is 0 when μ=0\mu = 0, σ=1\sigma = 1.


Exercise 3 (*) - Forward vs reverse KL. Let p(x)=0.5N(x;2,0.25)+0.5N(x;2,0.25)p(x) = 0.5\,\mathcal{N}(x;-2,0.25) + 0.5\,\mathcal{N}(x;2,0.25) (bimodal) and qμ(x)=N(x;μ,σ2)q_\mu(x) = \mathcal{N}(x;\mu,\sigma^2).

(a) Show that the minimizer of DKL(pqμ,σ2)D_{\mathrm{KL}}(p\|q_{\mu,\sigma^2}) over (μ,σ2)(\mu,\sigma^2) is μ=0\mu^* = 0, σ2=4.25\sigma^{*2} = 4.25. (Hint: moment matching.)

(b) Numerically estimate DKL(qμp)D_{\mathrm{KL}}(q_\mu\|p) on a grid of μ[3,3]\mu \in [-3, 3] using 1000 samples. Show the two local minima near μ=±2\mu = \pm 2.

(c) Explain in one paragraph: if you were training a VAE generative model and the true data distribution is pp, which direction would you minimize, and what would the generator look like?


Exercise 4 () - Chain rule decomposition.** Let P(X,Y)P(X, Y) be a joint distribution on {0,1}2\{0,1\}^2 with P(0,0)=0.4P(0,0) = 0.4, P(0,1)=0.1P(0,1) = 0.1, P(1,0)=0.1P(1,0) = 0.1, P(1,1)=0.4P(1,1) = 0.4, and Q(X,Y)Q(X,Y) uniform on {0,1}2\{0,1\}^2 (=0.25= 0.25 each).

(a) Compute DKL(P(X,Y)Q(X,Y))D_{\mathrm{KL}}(P(X,Y)\|Q(X,Y)) directly.

(b) Compute DKL(PXQX)D_{\mathrm{KL}}(P_X\|Q_X) and EPX[DKL(PYXQYX)]\mathbb{E}_{P_X}[D_{\mathrm{KL}}(P_{Y|X}\|Q_{Y|X})] separately.

(c) Verify that (a) == (b): the chain rule holds.

(d) Interpret: which direction (marginal or conditional) contributes more to the total KL? What does this mean for the structure of PP vs QQ?


Exercise 5 () - f-Divergence family.** Consider three f-divergences on Bernoulli distributions p=Bern(θ)p = \text{Bern}(\theta) and q=Bern(0.5)q = \text{Bern}(0.5) as a function of θ(0,1)\theta \in (0,1).

(a) Compute and plot (as functions of θ\theta): KL divergence DKL(pq)D_{\mathrm{KL}}(p\|q), reverse KL DKL(qp)D_{\mathrm{KL}}(q\|p), total variation TV(p,q)\mathrm{TV}(p,q), and Jensen-Shannon JSD(p,q)\mathrm{JSD}(p,q).

(b) Verify that Pinsker's inequality TV(p,q)212DKL(pq)\mathrm{TV}(p,q)^2 \le \frac{1}{2}D_{\mathrm{KL}}(p\|q) holds for all θ\theta.

(c) Which divergence is symmetric? Which is bounded? Which is asymmetric?

(d) For θ=0.01\theta = 0.01 (near-deterministic pp): compute all four divergences. Why does KL blow up while JSD stays bounded?


Exercise 6 () - ELBO and KL decomposition.** A simple VAE has prior p(z)=N(0,1)p(z) = \mathcal{N}(0,1) and likelihood pθ(xz)=N(z,0.1)p_\theta(x|z) = \mathcal{N}(z, 0.1). The encoder is qϕ(zx)=N(ϕx,0.5)q_\phi(z|x) = \mathcal{N}(\phi \cdot x, 0.5) for a scalar parameter ϕ\phi.

(a) For x=2x = 2: compute the ELBO as a function of ϕ\phi. Use the Gaussian KL formula.

(b) Show that the ELBO decomposes as: reconstruction term (2ϕ2)2+0.52×0.1-\frac{(2 - \phi\cdot 2)^2 + 0.5}{2 \times 0.1} (plus constants) minus the KL term 12(ϕ24+0.5ln0.51)\frac{1}{2}(\phi^2 \cdot 4 + 0.5 - \ln 0.5 - 1).

(c) Find the ϕ\phi that maximizes the ELBO. What does this optimal encoder do geometrically?

(d) As the likelihood variance σ20\sigma^2 \to 0 (decoder becomes very powerful): what happens to the optimal ϕ\phi? Does posterior collapse occur?


Exercise 7 (*) - RLHF optimal policy.** A language model has reference policy πref(yx)\pi_{\mathrm{ref}}(y|x) and reward r(x,y)r(x,y). The RLHF objective is maxπE[r(x,y)]βDKL(ππref)\max_\pi \mathbb{E}[r(x,y)] - \beta D_{\mathrm{KL}}(\pi\|\pi_{\mathrm{ref}}).

(a) Show that the unconstrained optimizer is π(yx)=1Z(x)πref(yx)er(x,y)/β\pi^*(y|x) = \frac{1}{Z(x)}\pi_{\mathrm{ref}}(y|x)e^{r(x,y)/\beta} by taking the functional derivative of the objective with respect to π(yx)\pi(y|x) (subject to yπ(yx)=1\sum_y \pi(y|x) = 1).

(b) Show that at the optimum, the KL divergence is DKL(ππref)=1βEπ[r]logZ(x)D_{\mathrm{KL}}(\pi^*\|\pi_{\mathrm{ref}}) = \frac{1}{\beta}\mathbb{E}_{\pi^*}[r] - \log Z(x).

(c) Derive the DPO loss: given preference data (x,yw,yl)(x, y_w, y_l) where ywyly_w \succ y_l, show that the Bradley-Terry preference model P(ywylx)=σ(βlogπ(ywx)π(ylx)βlogπref(ywx)πref(ylx))P(y_w \succ y_l|x) = \sigma(\beta\log\frac{\pi^*(y_w|x)}{\pi^*(y_l|x)} - \beta\log\frac{\pi_{\mathrm{ref}}(y_w|x)}{\pi_{\mathrm{ref}}(y_l|x)}) leads to the DPO loss.

(d) Numerically: for β=0.1\beta = 0.1, πref(A)=0.6\pi_{\mathrm{ref}}(A) = 0.6, πref(B)=0.4\pi_{\mathrm{ref}}(B) = 0.4, r(A)=2r(A) = 2, r(B)=0r(B) = 0: compute π\pi^* and the resulting DKL(ππref)D_{\mathrm{KL}}(\pi^*\|\pi_{\mathrm{ref}}).


Exercise 8 (*) - Knowledge distillation analysis.** A teacher model outputs logits zT=(3,1,2)\mathbf{z}_T = (3, 1, -2) for three classes and a student outputs zS=(2,0,1)\mathbf{z}_S = (2, 0, -1).

(a) Compute pT=softmax(zT)p_T = \text{softmax}(\mathbf{z}_T) and pS=softmax(zS)p_S = \text{softmax}(\mathbf{z}_S).

(b) Compute DKL(pTpS)D_{\mathrm{KL}}(p_T\|p_S) (forward KL, what distillation minimizes) and DKL(pSpT)D_{\mathrm{KL}}(p_S\|p_T) (reverse KL).

(c) Now compute soft labels at temperature τ=3\tau = 3: pT(τ)p_T^{(\tau)} and pS(τ)p_S^{(\tau)}. Recompute both KL divergences. How does temperature affect the magnitude of the KL and the information in the soft labels?

(d) What is the total distillation loss: L=λDKL(pT(τ)pS(τ))+(1λ)H(ytrue,pS)\mathcal{L} = \lambda D_{\mathrm{KL}}(p_T^{(\tau)}\|p_S^{(\tau)}) + (1-\lambda)H(\mathbf{y}_{\mathrm{true}}, p_S) for λ=0.7\lambda = 0.7 and ytrue=(1,0,0)\mathbf{y}_{\mathrm{true}} = (1, 0, 0)?

(e) Argue why forward KL (teacher to student) is used rather than reverse KL. What kind of "dark knowledge" is preserved?

12. Why This Matters for AI (2026 Perspective)

ConceptAI Impact
Forward KL = MLEEvery language model trained by next-token prediction (GPT-4, LLaMA-3, Claude, Gemini, Mistral) minimizes DKL(pdatapθ)D_{\mathrm{KL}}(p_{\mathrm{data}}\|p_{\boldsymbol{\theta}}); cross-entropy loss IS KL divergence
ELBO = KL decompositionAll VAE-based generative models (image, speech, latent diffusion) optimize reconstruction + KL; the KL term determines latent space structure
RLHF KL penaltyAll instruction-tuned LLMs (InstructGPT, Claude, Gemini) use βDKL(ππref)\beta D_{\mathrm{KL}}(\pi\|\pi_{\mathrm{ref}}) to prevent reward hacking; β\beta is one of the most important hyperparameters in alignment
DPO implicit KLDPO (used in LLaMA-2-Chat, Mistral-Instruct) reframes RLHF as direct KL-constrained optimization; the log-ratio log(π/πref)\log(\pi/\pi_{\mathrm{ref}}) IS the implicit reward
Forward vs reverse KLChoosing forward KL (MLE, distillation) vs reverse KL (VI, mean-field) is the defining design decision of probabilistic ML; wrong choice -> blurry outputs or posterior collapse
Knowledge distillationLLM compression (DistilGPT, Phi-2, Phi-3) uses forward KL from teacher to student; soft labels at temperature τ>1\tau > 1 carry 10-100 more information than one-hot labels
Posterior collapseWhen the decoder is powerful, the VAE KL term collapses to zero - a major pathology in text VAEs (Bowman et al., 2016); fixed by KL annealing or β\beta-VAE weighting
PPO trust regionPPO's clip objective approximates a KL trust region; the ε{0.1,0.2}\varepsilon \in \{0.1, 0.2\} clip threshold corresponds to a specific KL budget per update step
GAN = JSD minimizationOriginal GAN (Goodfellow, 2014) minimizes JSD; JSD's saturation when supports are disjoint causes vanishing gradients and mode collapse - motivating WGAN/Wasserstein distance
Renyi DP for private trainingDifferential privacy in LLM fine-tuning (DP-SGD, used at Google) uses Renyi divergence; Renyi DP composes additively, making privacy budgets tractable
Diffusion ELBODDPM training objective is a sum of per-step KL divergences; the connection to score matching makes the KL formulation equivalent to noise prediction
Natural gradientK-FAC (second-order optimizer for neural networks) approximates the Fisher matrix F=E[logplogp]F = \mathbb{E}[\nabla\log p \nabla\log p^\top] which is the local KL curvature; natural gradient descent is the geometrically correct first-order method on the statistical manifold
CalibrationDKL(ptruepθ)=0D_{\mathrm{KL}}(p_{\mathrm{true}}\|p_{\boldsymbol{\theta}}) = 0 iff the model is perfectly calibrated; calibration regularization (temperature scaling, label smoothing) implicitly minimizes KL between predicted and calibrated distributions

13. Conceptual Bridge

KL divergence sits at the center of the information theory chapter, connecting the individual uncertainty measured by entropy (Section 01) to the distributional relationships measured by mutual information (Section 03), the training objectives formalized as cross-entropy (Section 04), and the local geometry captured by Fisher information (Section 05). It is not an isolated concept - it is the engine that drives all four.

Backward: from Entropy. In Section 01, entropy H(X)=plogpH(X) = -\sum p\log p measured the intrinsic uncertainty of a single distribution. KL divergence extends this to pairs of distributions: DKL(pq)=H(p,q)H(p)D_{\mathrm{KL}}(p\|q) = H(p,q) - H(p) measures the extra uncertainty you incur by confusing qq for pp. The non-negativity of KL proved the fundamental bound H(X)logXH(X) \le \log|\mathcal{X}| that appeared in Section 01 - Gibbs' inequality was already implicitly used there. Every result about optimal codes (Huffman, arithmetic coding) is ultimately about minimizing KL between the empirical distribution and the code-induced distribution.

Forward: to Mutual Information. Mutual information I(X;Y)=DKL(P(X,Y)P(X)P(Y))I(X;Y) = D_{\mathrm{KL}}(P(X,Y)\|P(X)P(Y)) (Section 03) is a KL divergence measuring the dependence between two variables. The chain rule for KL (Section 3.5) becomes the chain rule for entropy when applied to the independence gap. The data processing inequality for KL becomes the DPI for mutual information: I(X;Z)I(X;Y)I(X;Z) \le I(X;Y) whenever ZZ is determined by YY. Understanding KL makes mutual information, and all the contrastive learning objectives built on it (InfoNCE, SimCLR, CLIP), immediately transparent.

Forward: to Cross-Entropy and Fisher Information. Cross-entropy H(p,q)H(p,q) is KL plus a constant (Section 04); Fisher information is KL's local curvature (Section 05). The three-way connection H(p,q)=H(p)+DKL(pq)H(p)+12δFδH(p,q) = H(p) + D_{\mathrm{KL}}(p\|q) \approx H(p) + \frac{1}{2}\delta^\top F\delta (for small perturbations) shows that these are all the same object at different scales: global (cross-entropy), local (Fisher).

KL DIVERGENCE IN THE INFORMATION THEORY CHAPTER


  
    Chapter 09 - Information Theory                                
                                                                   
    Section 01 ENTROPY H(X)                                               
          "extend to pairs of distributions"                      
    Section 02 KL DIVERGENCE D_KL(pq)       <- YOU ARE HERE              
                                                                 
    Section 03 MUTUAL INFORMATION    Section 04 CROSS-ENTROPY                    
    I(X;Y) = D_KL(PPP)     H(p,q) = H(p) + D_KL(pq)           
                                                                  
    Section 05 FISHER INFORMATION                                         
    F(theta) = local KL curvature                                      
  

  ML CONNECTIONS:
  
                                                                   
     D_KL(p_data  p_theta)        ->  Language model training (MLE)   
     D_KL(q_(z|x)  p(z))    ->  VAE regularizer (ELBO)          
     D_KL(pi_theta  pi_ref)         ->  RLHF / PPO / DPO alignment     
     D_KL(p_teacher  p_student) -> Knowledge distillation        
     D_KL(q  p_posterior)     ->  Variational inference (VI)     
                                                                   
  

  CHAPTER POSITION:

  09-01 Entropy
      
  09-02 KL Divergence  <- current section
      
  09-03 Mutual Information  (KL between joint and product of marginals)
      
  09-04 Cross-Entropy       (H(p,q) = H(p) + D_KL(pq))
      
  09-05 Fisher Information  (local KL curvature = Riemannian metric)


Looking ahead beyond this chapter: KL divergence appears again in Chapter 12 (Functional Analysis), where it generalizes to the RKHS setting and connects to kernel methods and maximum mean discrepancy (MMD). It appears in Chapter 21 (Statistical Learning Theory) as a key quantity in PAC-Bayes bounds: the generalization gap of a stochastic predictor is bounded by DKL(QP)/nD_{\mathrm{KL}}(Q\|P)/n where QQ is the posterior and PP is the prior on parameters. And in Chapter 22 (Causal Inference), KL divergence under interventional vs observational distributions quantifies the "causal effect" of an intervention in terms of distributional shift.

The single formula xp(x)log(p(x)/q(x))\sum_x p(x)\log(p(x)/q(x)) - first written by Kullback and Leibler in 1951 - underlies more of modern AI than perhaps any other equation in this curriculum.


Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue