Part 2

30 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

KL Divergence: Part 6: Information-Theoretic Connections to 13. Conceptual Bridge

6. Information-Theoretic Connections

6.1 KL and Cross-Entropy

The relationship $H(p, q) = H(p) + D_{\mathrm{KL}}(p \| q)$ is fundamental. Expanding both sides:

-\sum_x p(x)\log q(x) = -\sum_x p(x)\log p(x) + \sum_x p(x)\log\frac{p(x)}{q(x)}

which is immediate from $\log(p/q) = \log p - \log q$ . This decomposition has critical implications:

Training ML models: Minimizing cross-entropy loss $H(p_{\mathrm{data}}, p_{\boldsymbol{\theta}})$ over $\boldsymbol{\theta}$ is identical to minimizing $D_{\mathrm{KL}}(p_{\mathrm{data}} \| p_{\boldsymbol{\theta}})$ , since $H(p_{\mathrm{data}})$ is constant. The irreducible entropy $H(p_{\mathrm{data}})$ is the minimum achievable cross-entropy.
Perplexity gap: A language model with perplexity $\operatorname{PPL} = e^{H(p_{\mathrm{data}}, p_{\boldsymbol{\theta}})}$ has gap $D_{\mathrm{KL}}(p_{\mathrm{data}}\|p_{\boldsymbol{\theta}})$ nats above the theoretical minimum entropy.
Calibration: A perfectly calibrated model satisfies $D_{\mathrm{KL}}(p_{\mathrm{true}}\|p_{\boldsymbol{\theta}}) = 0$ , meaning the model's output distribution matches the true conditional distributions exactly.

Preview - Cross-Entropy (Section 09-04)

Cross-entropy $H(p, q) = H(p) + D_{\mathrm{KL}}(p\|q)$ is the canonical training loss for classification. It decomposes neatly into the intrinsic uncertainty $H(p)$ (irreducible) and the model quality gap $D_{\mathrm{KL}}(p\|q)$ (reducible by training).

-> Full treatment: Section 09-04 Cross-Entropy

6.2 KL and Mutual Information

Preview - Mutual Information (Section 09-03)

Mutual information is the KL divergence between the joint distribution and the product of marginals:

I(X; Y) = D_{\mathrm{KL}}(P(X, Y) \| P(X)P(Y)) = \mathbb{E}_{(x,y)\sim P}\!\left[\log\frac{P(x,y)}{P(x)P(y)}\right]

It measures statistical dependence: how much knowing $Y$ reduces uncertainty about $X$ . This connection shows that $I(X;Y) \ge 0$ (Gibbs' inequality) and $I(X;Y) = 0 \Leftrightarrow X \perp\!\!\!\perp Y$ .

-> Full treatment: Section 09-03 Mutual Information

The MI-as-KL connection is used directly in contrastive learning (InfoNCE loss, SimCLR, CLIP): maximizing a lower bound on $I(X; Z)$ between the input and its representation. It also appears in the information bottleneck (Tishby et al., 2000) which frames representation learning as minimizing $I(X; Z)$ subject to maximizing $I(Z; Y)$ - two KL divergences in tension.

6.3 KL and Fisher Information

Preview - Fisher Information (Section 09-05)

For a parametric family $\{p_{\boldsymbol{\theta}}\}$ , the KL divergence between nearby members is approximately:

D_{\mathrm{KL}}(p_{\boldsymbol{\theta}} \| p_{\boldsymbol{\theta}+\boldsymbol{\delta}}) \approx \frac{1}{2}\boldsymbol{\delta}^\top F(\boldsymbol{\theta})\boldsymbol{\delta}

where $F(\boldsymbol{\theta}) = \mathbb{E}_{p_{\boldsymbol{\theta}}}[\nabla\log p_{\boldsymbol{\theta}} \nabla\log p_{\boldsymbol{\theta}}^\top]$ is the Fisher information matrix. This local quadratic approximation to KL is what makes natural gradient descent (which preconditions gradient steps by $F^{-1}$ ) the "geometrically correct" optimizer on the statistical manifold.

-> Full treatment: Section 09-05 Fisher Information

Connection to RLHF natural policy gradient: The KL trust region $D_{\mathrm{KL}}(\pi_{\boldsymbol{\theta}} \| \pi_{\mathrm{ref}}) \le \epsilon$ defines a neighborhood in policy space. Near the reference policy, this KL ball is approximated by $\frac{1}{2}\boldsymbol{\delta}^\top F \boldsymbol{\delta} \le \epsilon$ where $F$ is the Fisher information of $\pi_{\mathrm{ref}}$ . Natural policy gradient algorithms (TRPO, PPO) use this approximation.

6.4 KL and Entropy

KL divergence provides an elegant proof that entropy is maximized at the uniform distribution. For any distribution $p$ on an alphabet of size $n$ and uniform distribution $u(x) = 1/n$ :

D_{\mathrm{KL}}(p \| u) = \sum_x p(x)\ln\frac{p(x)}{1/n} = \sum_x p(x)\ln p(x) + \ln n = -H(p) + \ln n

Since $D_{\mathrm{KL}}(p\|u) \ge 0$ : $H(p) \le \ln n$ , with equality iff $p = u$ .

More generally, for any constraint set $\mathcal{Q}$ of distributions satisfying given moment constraints, the maximum-entropy distribution $p^* = \arg\max_{p \in \mathcal{Q}} H(p)$ equals $\arg\min_{p \in \mathcal{Q}} D_{\mathrm{KL}}(p \| u)$ - it is the M-projection of the uniform distribution onto $\mathcal{Q}$ . This unifies the maximum entropy principle with KL geometry.

7. f-Divergences and Generalizations

7.1 Csiszar f-Divergences

KL divergence is one member of a rich family of divergences introduced by Csiszar (1967).

Definition. Let $f: (0,\infty) \to \mathbb{R}$ be a convex function with $f(1) = 0$ . The f-divergence of $p$ from $q$ is:

D_f(p \| q) = \sum_x q(x)\, f\!\left(\frac{p(x)}{q(x)}\right) = \mathbb{E}_{x\sim q}\!\left[f\!\left(\frac{p(x)}{q(x)}\right)\right]

The condition $f(1) = 0$ ensures $D_f(p\|p) = 0$ . Convexity of $f$ and Jensen's inequality give $D_f(p\|q) \ge 0$ .

KL divergence as f-divergence. Taking $f(t) = t\ln t$ (convex, $f(1) = 0$ ):

D_f(p\|q) = \sum_x q(x) \cdot \frac{p(x)}{q(x)}\ln\frac{p(x)}{q(x)} = \sum_x p(x)\ln\frac{p(x)}{q(x)} = D_{\mathrm{KL}}(p\|q)

Standard f-divergences:

Name	$f(t)$	Closed form	Properties
KL divergence	$t\ln t$	$\sum p\ln(p/q)$	Asymmetric; $[0, \infty)$
Reverse KL	$-\ln t$	$\sum q\ln(q/p)$	Asymmetric; $[0, \infty)$
Hellinger distance2	$(\sqrt{t}-1)^2$	$\sum(\sqrt{p}-\sqrt{q})^2$	Symmetric; $[0, 2]$
Total variation	$\frac{1}{2}	t-1	$
Chi-squared	$(t-1)^2$	$\sum(p-q)^2/q$	Asymmetric; $[0,\infty)$
$\alpha$ -divergence	$\frac{t^\alpha - \alpha t - (1-\alpha)}{\alpha(\alpha-1)}$	General; limits give KL	Parameterizes all of the above

For AI - GAN losses. Generative Adversarial Networks (Goodfellow et al., 2014) use different f-divergences as training objectives: the original GAN uses Jensen-Shannon; f-GAN (Nowozin et al., 2016) generalizes to arbitrary f-divergences. This provides a principled family of training objectives for generative models.

7.2 Properties of f-Divergences

All f-divergences share four key properties (when $f$ is convex with $f(1) = 0$ ):

Non-negativity: $D_f(p\|q) \ge 0$ , with equality iff $p = q$ (when $f$ is strictly convex).
Data processing inequality: $D_f(p_T\|q_T) \le D_f(p\|q)$ for any stochastic kernel $T$ .
No triangle inequality in general.
Convexity: $D_f$ is jointly convex in $(p,q)$ when $f$ is convex.

The data processing inequality is a theorem for all f-divergences: processing can only reduce distinguishability between distributions.

Total variation as special case. The total variation distance $\mathrm{TV}(p,q) = \frac{1}{2}\|p - q\|_1 = \frac{1}{2}\sum_x |p(x) - q(x)|$ satisfies the triangle inequality and is a genuine metric. It bounds other f-divergences: Pinsker's inequality relates KL and TV:

\mathrm{TV}(p,q) \le \sqrt{\frac{1}{2}D_{\mathrm{KL}}(p\|q)}

This is the most important inequality connecting KL divergence to total variation. Pinsker's inequality is used to convert KL bounds (from optimization) into total variation bounds (which control probabilities of events).

7.3 Renyi Divergence

For $\alpha \in (0,1) \cup (1,\infty)$ , the order- $\alpha$ Renyi divergence is:

D_\alpha(p \| q) = \frac{1}{\alpha - 1}\log\sum_x p(x)^\alpha\, q(x)^{1-\alpha}

Limiting cases:

$\alpha \to 1$ : $D_\alpha(p\|q) \to D_{\mathrm{KL}}(p\|q)$ (L'Hopital's rule)
$\alpha \to 0$ : $D_\alpha(p\|q) \to -\log\sum_x \mathbb{1}[p(x) > 0]\, q(x)$ (support-based)
$\alpha \to \infty$ : $D_\alpha(p\|q) \to \log\max_x p(x)/q(x)$ (max-ratio)
$\alpha = 1/2$ : related to Bhattacharyya distance; $D_{1/2}(p\|q) = -2\log\sum_x\sqrt{p(x)q(x)}$

Properties: $D_\alpha(p\|q) \ge 0$ ; $D_\alpha(p\|p) = 0$ ; $D_\alpha$ is monotone increasing in $\alpha$ ; satisfies data processing inequality; not symmetric.

For AI - Differential privacy. Renyi differential privacy (Mironov, 2017) quantifies privacy loss using Renyi divergence: a mechanism $M$ is $(\alpha, \varepsilon)$ -RDP if $D_\alpha(M(S)\|M(S')) \le \varepsilon$ for any adjacent datasets $S, S'$ . Renyi DP composes additively: $\varepsilon_{\text{total}} = \varepsilon_1 + \varepsilon_2$ . This makes it the tool of choice for privacy accounting in large-scale training with DP-SGD (used in Google's federated learning and some LLM fine-tuning pipelines).

7.4 Jensen-Shannon Divergence

The Jensen-Shannon Divergence (JSD) is the symmetrized, bounded variant of KL:

\mathrm{JSD}(p \| q) = \frac{1}{2}D_{\mathrm{KL}}\!\left(p \,\Big\|\, \frac{p+q}{2}\right) + \frac{1}{2}D_{\mathrm{KL}}\!\left(q \,\Big\|\, \frac{p+q}{2}\right)

where $m = (p+q)/2$ is the mixture distribution.

Properties:

$\mathrm{JSD}(p\|q) = \mathrm{JSD}(q\|p)$ - symmetric
$0 \le \mathrm{JSD}(p\|q) \le \ln 2$ - bounded (in nats)
$\sqrt{\mathrm{JSD}}$ is a metric (satisfies triangle inequality)
$\mathrm{JSD}(p\|q) = H(m) - \frac{1}{2}H(p) - \frac{1}{2}H(q)$ - entropy of mixture minus average entropy

For AI - Original GAN objective. The original GAN (Goodfellow et al., 2014) with an optimal discriminator minimizes the Jensen-Shannon divergence between the generator distribution $p_G$ and the data distribution $p_{\mathrm{data}}$ :

\min_G \max_D V(D, G) = 2\,\mathrm{JSD}(p_{\mathrm{data}} \| p_G) - 2\ln 2

The optimal discriminator is $D^*(x) = p_{\mathrm{data}}(x)/(p_{\mathrm{data}}(x) + p_G(x))$ . However, when $p_{\mathrm{data}}$ and $p_G$ have disjoint supports, $\mathrm{JSD} = \ln 2$ (maximum) everywhere - the discriminator saturates and gradients vanish. This mode collapse / vanishing gradient problem motivated WGAN (Arjovsky et al., 2017), which uses the Wasserstein distance (which doesn't require absolute continuity).

8. Information Geometry of KL

8.1 KL as a Bregman Divergence

A Bregman divergence generated by a strictly convex function $\phi$ is:

B_\phi(p, q) = \phi(p) - \phi(q) - \nabla\phi(q)^\top(p - q)

This is the gap between $\phi$ at $p$ and its first-order Taylor approximation at $q$ - always $\ge 0$ by convexity.

KL is a Bregman divergence. Taking $\phi(p) = \sum_x p(x)\ln p(x)$ (the negative entropy, which is strictly convex):

\nabla\phi(q)_x = \ln q(x) + 1

B_\phi(p, q) = \sum_x p(x)\ln p(x) - \sum_x q(x)\ln q(x) - \sum_x (\ln q(x) + 1)(p(x) - q(x))

= \sum_x p(x)\ln p(x) - \sum_x p(x)\ln q(x) + \sum_x (q(x) - p(x)) = \sum_x p(x)\ln\frac{p(x)}{q(x)} = D_{\mathrm{KL}}(p\|q)

This Bregman representation reveals why KL is asymmetric: Bregman divergences are generally not symmetric because the Taylor approximation is computed at $q$ , not at $p$ .

Exponential family connection. For an exponential family with log-partition function $A(\boldsymbol{\eta})$ , the Bregman divergence $B_A(\boldsymbol{\eta}_2, \boldsymbol{\eta}_1)$ equals $D_{\mathrm{KL}}(p_{\boldsymbol{\eta}_1}\|p_{\boldsymbol{\eta}_2})$ (note reversed argument order). This is the source of the elegant formula in Section 5.3.

8.2 I-Projection and M-Projection

Definition. Let $\mathcal{M}$ be a constraint set (e.g., a parametric family of distributions):

I-projection (information projection): $q^* = \arg\min_{q \in \mathcal{M}} D_{\mathrm{KL}}(q \| p)$ - the distribution in $\mathcal{M}$ closest to $p$ in reverse KL
M-projection (moment projection): $q^* = \arg\min_{q \in \mathcal{M}} D_{\mathrm{KL}}(p \| q)$ - the distribution in $\mathcal{M}$ closest to $p$ in forward KL

For the exponential family:

The M-projection always produces the moment-matched distribution: $\mathbb{E}_{q^*}[\mathbf{t}(X)] = \mathbb{E}_p[\mathbf{t}(X)]$
The I-projection onto an exponential family (convex set of mean parameters) produces the distribution with the correct natural parameters for those means

EM algorithm as alternating projections. The EM algorithm alternates:

E-step: Compute $q^{(t)}(\mathbf{z}) = p(\mathbf{z}|\mathbf{x}; \boldsymbol{\theta}^{(t)})$ - I-projection of the previous model onto the exact posterior
M-step: $\boldsymbol{\theta}^{(t+1)} = \arg\max_{\boldsymbol{\theta}} \mathcal{Q}(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)})$ - M-projection of $q^{(t)}$ back onto the parametric family

Each step decreases $D_{\mathrm{KL}}(q\|p)$ or $D_{\mathrm{KL}}(p_{\text{data}}\|p_{\boldsymbol{\theta}})$ respectively, guaranteeing convergence to a stationary point.

8.3 The Pythagorean Theorem for KL

Theorem (Pythagorean theorem for I-projections). Let $\mathcal{E}$ be an exponential family (a convex set in mean parameter space) and let $q^* = \arg\min_{q \in \mathcal{E}} D_{\mathrm{KL}}(q \| p)$ be the I-projection of $p$ onto $\mathcal{E}$ . Then for any $r \in \mathcal{E}$ :

D_{\mathrm{KL}}(r \| p) = D_{\mathrm{KL}}(r \| q^*) + D_{\mathrm{KL}}(q^* \| p)

This is an exact equality - there is no approximation, unlike the geometric Pythagorean theorem.

Interpretation. The KL distance from any $r$ in the constraint set to the target $p$ decomposes into: (1) the distance from $r$ to the projection $q^*$ , plus (2) the distance from the projection to $p$ . The "closest" point $q^*$ acts like the foot of a perpendicular. This is called "Pythagorean" because the geometry is orthogonal in the KL sense.

PYTHAGOREAN THEOREM FOR KL


          p  (target distribution, outside set E)
          |
          | D_KL(q* || p)
          |
          q*  r    (all in constraint set E)
            D_KL(r || q*)

    D_KL(r || p) = D_KL(r || q*) + D_KL(q* || p)

    (exact for I-projections onto exponential families)

For AI: The variational inference ELBO maximization is equivalent to minimizing $D_{\mathrm{KL}}(q\|p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x}))$ - an I-projection. The Pythagorean theorem shows that the ELBO gap is exactly $D_{\mathrm{KL}}(q^*\|p)$ at the optimum - a clean statement of how much information is lost by restricting to the variational family.

8.4 Statistical Manifolds and Natural Gradient

The space of probability distributions over $\mathcal{X}$ forms a statistical manifold with local coordinates given by the natural parameters $\boldsymbol{\eta}$ (for exponential families). The KL divergence induces a Riemannian metric on this manifold called the Fisher information metric:

g_{ij}(\boldsymbol{\eta}) = \frac{\partial^2 D_{\mathrm{KL}}(p_{\boldsymbol{\eta}} \| p_{\boldsymbol{\eta}'})}{\partial\eta_i' \partial\eta_j'}\bigg|_{\boldsymbol{\eta}'=\boldsymbol{\eta}} = F_{ij}(\boldsymbol{\eta})

where $F(\boldsymbol{\eta})$ is the Fisher information matrix. The Riemannian structure makes the "natural" step size in parameter space match the actual KL distance between distributions.

Natural gradient. In Euclidean parameter space, gradient descent steps $\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta\nabla\mathcal{L}$ make steps proportional to the Euclidean metric on $\boldsymbol{\theta}$ -space, which does not correspond to any natural metric on distribution space. The natural gradient corrects for this:

\tilde{\nabla}\mathcal{L} = F(\boldsymbol{\theta})^{-1}\nabla\mathcal{L}

Natural gradient descent converges much faster in problems where the Fisher information is highly non-uniform (i.e., where the Euclidean metric on parameters is poorly calibrated to KL distance on distributions). This is the foundation of K-FAC (Martens & Grosse, 2015), used in second-order optimization of neural networks.

-> Full treatment of Fisher information and natural gradient: Section 09-05 Fisher Information

9. Applications in Machine Learning

9.1 Maximum Likelihood = Minimizing KL

Theorem. For a parametric model $p_{\boldsymbol{\theta}}$ and empirical data distribution $\hat{p}_n(x) = \frac{1}{n}\sum_{i=1}^n \mathbb{1}[X^{(i)} = x]$ :

\arg\max_{\boldsymbol{\theta}} \sum_{i=1}^n \log p_{\boldsymbol{\theta}}(x^{(i)}) = \arg\min_{\boldsymbol{\theta}} D_{\mathrm{KL}}(\hat{p}_n \| p_{\boldsymbol{\theta}})

Proof. Using $D_{\mathrm{KL}}(\hat{p}_n \| p_{\boldsymbol{\theta}}) = \mathbb{E}_{\hat{p}}[\log\hat{p}(x)/p_{\boldsymbol{\theta}}(x)] = H(\hat{p}_n) - \mathbb{E}_{\hat{p}}[\log p_{\boldsymbol{\theta}}(x)]$ , minimizing over $\boldsymbol{\theta}$ eliminates the constant $H(\hat{p}_n)$ :

\arg\min_{\boldsymbol{\theta}} D_{\mathrm{KL}}(\hat{p}_n\|p_{\boldsymbol{\theta}}) = \arg\max_{\boldsymbol{\theta}} \mathbb{E}_{\hat{p}}[\log p_{\boldsymbol{\theta}}(X)] = \arg\max_{\boldsymbol{\theta}} \frac{1}{n}\sum_i \log p_{\boldsymbol{\theta}}(x^{(i)}) \quad\square

Implications:

Every LLM trained by next-token prediction (GPT-4, LLaMA-3, Claude, Gemini) minimizes $D_{\mathrm{KL}}(p_{\mathrm{data}}\|p_{\boldsymbol{\theta}})$
The MLE estimate is not "arbitrary" - it has a precise information-theoretic interpretation as the distribution in the model family closest to the data in forward KL
Coverage: MLE must cover the entire data distribution; if $p_{\mathrm{data}}(x) > 0$ but $p_{\boldsymbol{\theta}}(x) \approx 0$ , the loss is large

9.2 Variational Autoencoders

The VAE (Kingma & Welling, 2014) introduces a latent variable $\mathbf{z}$ with prior $p(\mathbf{z})$ and likelihood $p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})$ . The true log-likelihood $\log p_{\boldsymbol{\theta}}(\mathbf{x})$ is intractable (requires integrating over $\mathbf{z}$ ). The VAE introduces an encoder $q_\phi(\mathbf{z}|\mathbf{x})$ and derives a tractable lower bound via the KL decomposition:

\log p_{\boldsymbol{\theta}}(\mathbf{x}) = \underbrace{\mathbb{E}_{q_\phi}[\log p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})] - D_{\mathrm{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))}_{\mathcal{L}(\boldsymbol{\phi},\boldsymbol{\theta};\mathbf{x}) = \text{ELBO}} + \underbrace{D_{\mathrm{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x}))}_{\ge 0}

The ELBO has two terms:

Reconstruction term $\mathbb{E}_{q_\phi}[\log p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})]$ : how well the decoder reconstructs from the latent code. Analogous to negative distortion.
KL regularizer $D_{\mathrm{KL}}(q_\phi(\mathbf{z}|\mathbf{x})\|p(\mathbf{z}))$ : how much the approximate posterior deviates from the prior. For Gaussian encoder and standard Gaussian prior: $\frac{1}{2}\sum_j(\mu_j^2 + \sigma_j^2 - \ln\sigma_j^2 - 1)$ .

Practical training: The ELBO is maximized by alternating gradient steps:

w.r.t. $\boldsymbol{\phi}$ : encoder parameters, using the reparameterization trick
w.r.t. $\boldsymbol{\theta}$ : decoder parameters, via direct backpropagation

VAE variants using KL: beta-VAE ( $\beta > 1$ weight on KL for disentanglement), WAE (Wasserstein autoencoder replaces KL with MMD), VQ-VAE (discrete latent, no KL), InfoVAE (adds MI term).

9.3 RLHF, PPO, and DPO

RLHF objective. Reinforcement Learning from Human Feedback (Christiano et al., 2017; InstructGPT, Ouyang et al., 2022) trains a policy $\pi_{\boldsymbol{\theta}}$ to maximize expected reward while staying close to the reference policy $\pi_{\mathrm{ref}}$ (the pretrained LLM):

\max_{\pi_{\boldsymbol{\theta}}} \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_{\boldsymbol{\theta}}(\cdot|x)}\left[r(x, y) - \beta\, D_{\mathrm{KL}}(\pi_{\boldsymbol{\theta}}(\cdot|x) \| \pi_{\mathrm{ref}}(\cdot|x))\right]

where $r(x,y)$ is the reward model score and $\beta > 0$ is the KL coefficient. Without the KL penalty, the policy collapses to reward hacking: generating short, repetitive text that maximizes reward scores but is incoherent or unnatural.

Optimal policy. The RLHF objective has a closed-form optimal solution (for fixed $r$ ):

\pi^*(y|x) = \frac{1}{Z(x)}\pi_{\mathrm{ref}}(y|x)\, e^{r(x,y)/\beta}

where $Z(x) = \sum_y \pi_{\mathrm{ref}}(y|x)e^{r(x,y)/\beta}$ is the normalizing partition function. The optimal policy exponentially upweights sequences with high reward relative to the reference.

DPO (Direct Preference Optimization, Rafailov et al., 2023). DPO solves the RLHF objective directly from preference data, without training an explicit reward model. The key insight: substituting the optimal policy formula into the reward-learning objective yields a loss that only involves the policy, not the reward model:

\mathcal{L}_{\mathrm{DPO}}(\boldsymbol{\theta}) = -\mathbb{E}\!\left[\log\sigma\!\left(\beta\ln\frac{\pi_{\boldsymbol{\theta}}(y_w|x)}{\pi_{\mathrm{ref}}(y_w|x)} - \beta\ln\frac{\pi_{\boldsymbol{\theta}}(y_l|x)}{\pi_{\mathrm{ref}}(y_l|x)}\right)\right]

where $y_w$ is the preferred (winning) response and $y_l$ the dispreferred (losing) response. The log-ratio $\log(\pi_{\boldsymbol{\theta}}/\pi_{\mathrm{ref}})$ is the implicit reward that the KL-constrained policy learns. DPO is used in practice for fine-tuning LLMs on human preferences (LLaMA-2-Chat, Mixtral-Instruct, and many others).

PPO trust region. PPO (Schulman et al., 2017) approximates the KL constraint with a clipping objective:

\mathcal{L}_{\mathrm{CLIP}}(\boldsymbol{\theta}) = \mathbb{E}_t\!\left[\min\!\left(r_t(\boldsymbol{\theta})\hat{A}_t,\; \mathrm{clip}(r_t(\boldsymbol{\theta}), 1-\varepsilon, 1+\varepsilon)\hat{A}_t\right)\right]

where $r_t(\boldsymbol{\theta}) = \pi_{\boldsymbol{\theta}}(a_t|s_t)/\pi_{\mathrm{old}}(a_t|s_t)$ is the probability ratio. Clipping at $1 \pm \varepsilon$ prevents large policy updates, approximating the KL trust region constraint $D_{\mathrm{KL}}(\pi_{\mathrm{old}}\|\pi_{\boldsymbol{\theta}}) \le \delta$ .

9.4 Knowledge Distillation

Knowledge distillation (Hinton, Vinyals & Dean, 2015) transfers knowledge from a large teacher model $p_T$ to a smaller student model $p_S$ . The distillation loss uses forward KL (from teacher to student):

\mathcal{L}_{\mathrm{distill}} = D_{\mathrm{KL}}(p_T^{(\tau)} \| p_S) = \sum_k p_T^{(\tau)}(k)\ln\frac{p_T^{(\tau)}(k)}{p_S(k)}

where $p_T^{(\tau)}(k) = \text{softmax}(z_k^T/\tau)_k$ is the teacher's softened distribution at temperature $\tau > 1$ .

Why forward KL (teacher -> student)? The student must cover all of the teacher's probability mass, including the "dark knowledge" - the non-zero probabilities assigned to incorrect classes that encode the teacher's generalization patterns. If a teacher assigns 60% to "cat," 30% to "tiger," and 10% to "dog," the soft labels teach the student that cats look more like tigers than dogs. One-hot labels destroy this information.

Forward vs reverse KL choice in distillation. Using reverse KL $D_{\mathrm{KL}}(p_S\|p_T)$ would allow the student to become mode-seeking - focusing only on the teacher's highest-probability output and ignoring the distributional shape. Forward KL forces the student to match the full distribution, preserving the teacher's calibration.

Modern LLM distillation. DistilBERT (Sanh et al., 2019) distills BERT using forward KL plus cosine similarity on hidden states. Phi-1 (Gunasekar et al., 2023) and Phi-2 (Microsoft, 2023) use distillation from GPT-4 to train small but capable models. LLaMA-3-8B was distilled from larger checkpoints. The forward KL direction is standard.

9.5 Normalizing Flows and Diffusion Models

Normalizing flows learn an invertible transformation $f_{\boldsymbol{\theta}}: \mathbb{R}^d \to \mathbb{R}^d$ such that $\mathbf{x} = f_{\boldsymbol{\theta}}(\mathbf{z})$ where $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, I)$ . The exact change-of-variables formula gives:

\log p_{\boldsymbol{\theta}}(\mathbf{x}) = \log p_{\mathbf{z}}(f_{\boldsymbol{\theta}}^{-1}(\mathbf{x})) + \log|\det J_{f_{\boldsymbol{\theta}}^{-1}}(\mathbf{x})|

Training by maximum likelihood minimizes $D_{\mathrm{KL}}(p_{\mathrm{data}}\|p_{\boldsymbol{\theta}})$ (forward KL) since the log-likelihood is tractable. This contrasts with GANs, which use a discriminator to avoid computing $\log p_{\boldsymbol{\theta}}$ directly.

Diffusion models (DDPM, Ho et al., 2020). The DDPM ELBO is a sum of KL terms:

-\log p_{\boldsymbol{\theta}}(\mathbf{x}) \le \mathcal{L} = D_{\mathrm{KL}}(q(\mathbf{x}_T|\mathbf{x}_0)\|p(\mathbf{x}_T)) + \sum_{t=2}^T D_{\mathrm{KL}}(q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)\|p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_t)) - \log p_{\boldsymbol{\theta}}(\mathbf{x}_0|\mathbf{x}_1)

Each reverse diffusion step $p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_t)$ is trained to minimize the KL divergence from the (analytically tractable) true reverse $q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)$ . Because both distributions are Gaussian (by the Markov structure), each per-step KL has a closed-form expression in terms of the noise predictor $\boldsymbol{\epsilon}_{\boldsymbol{\theta}}$ . The simplified training objective is equivalent to predicting the noise $\boldsymbol{\epsilon}$ - which is the standard diffusion training loss.

10. Common Mistakes

#	Mistake	Why It's Wrong	Fix
1	Writing $D_{\mathrm{KL}}(q \\| p)$ when you mean forward KL (truth -> approximation)	The first argument is the reference (truth); forward KL is $D_{\mathrm{KL}}(p_{\mathrm{truth}} \\| p_{\mathrm{approx}})$	Always identify which distribution is the truth and which is the approximation; check: is the expectation under truth or approx?
2	Assuming KL is symmetric: $D_{\mathrm{KL}}(p\\|q) = D_{\mathrm{KL}}(q\\|p)$	KL is not symmetric; the two directions have fundamentally different behaviors	Never swap $p$ and $q$ without checking both directions; use Jeffrey's JSD if you need symmetry
3	Ignoring support mismatch: computing $D_{\mathrm{KL}}$ when $q(x) = 0$ for some $x$ with $p(x) > 0$	$D_{\mathrm{KL}}(p\\|q) = +\infty$ when $p \not\ll q$ ; finite values are meaningless	Add Laplace smoothing or use $\mathrm{JSD}$ / $\mathrm{TV}$ for distributions with potentially different supports
4	Treating $D_{\mathrm{KL}}(p\\|q) = 0$ as equivalent to $p = q$ under numerical approximations	Finite-precision computation gives $D_{\mathrm{KL}} \approx 0$ even when $p \ne q$ due to rounding	Use $D_{\mathrm{KL}} \le \epsilon$ (small threshold) rather than exact zero; check sample moments separately
5	Using KL as a distance metric satisfying the triangle inequality	$D_{\mathrm{KL}}(p\\|r)$ can exceed $D_{\mathrm{KL}}(p\\|q) + D_{\mathrm{KL}}(q\\|r)$ ; KL is not a metric	Use total variation or $\sqrt{\mathrm{JSD}}$ if you need a proper metric
6	Confusing minimizing $D_{\mathrm{KL}}(p\\|q)$ over $q$ with minimizing over $p$	MLE minimizes forward KL over $\boldsymbol{\theta}$ (model parameters); variational inference minimizes reverse KL over $q$	Always identify which argument is the optimization variable
7	Computing cross-entropy loss and calling it KL divergence	Cross-entropy $H(p,q) = H(p) + D_{\mathrm{KL}}(p\\|q)$ ; they differ by $H(p)$	They're equal only when $p$ is one-hot ( $H(p) = 0$ ); for soft labels, $H(p,q) > D_{\mathrm{KL}}(p\\|q)$
8	Expecting the VAE KL term to be zero at optimum	The KL term equals $D_{\mathrm{KL}}(q_\phi(\mathbf{z}	\mathbf{x})\|p(\mathbf{z}))$; zero only if posterior = prior (posterior collapse)
9	Using forward KL for variational inference	Forward KL $\mathbb{E}_p[\log(p/q)]$ requires samples from the intractable $p(\mathbf{z}	\mathbf{x}) $; ELBO uses reverse KL which only requires samples from$ q$
10	Claiming the RLHF KL penalty "forces" the model to stay close to reference	The KL penalty softly penalizes deviation; with large enough reward, the policy can still deviate significantly	The KL penalty is a soft regularizer, not a hard constraint; monitor $D_{\mathrm{KL}}(\pi\\|\pi_{\mathrm{ref}})$ during training
11	Conflating Renyi divergence with KL divergence	Renyi $D_\alpha(p\\|q)$ only equals KL in the limit $\alpha \to 1$ ; for $\alpha \ne 1$ they have different properties	Use Renyi divergence when differential privacy accounting is needed; use KL for standard information-theoretic arguments
12	Forgetting the $\ln 2$ factor when converting between nats and bits	$D_{\mathrm{KL}}$ in nats $= D_{\mathrm{KL}}$ in bits $\times \ln 2 \approx 0.693 \times$ bits	Always specify units; when comparing with cross-entropy in bits (binary classification), convert to nats or vice versa

11. Exercises

Exercise 1 (*) - Non-negativity and equality condition. Let $p = (0.4, 0.3, 0.2, 0.1)$ and $q = (0.1, 0.2, 0.3, 0.4)$ .

(a) Compute $D_{\mathrm{KL}}(p\|q)$ and $D_{\mathrm{KL}}(q\|p)$ by hand (use $\ln$ , give exact values and numerical approximations).

(b) Verify that both are non-negative. Which is larger?

(d) Using only the convexity of $-\ln$ , write a self-contained proof that $D_{\mathrm{KL}}(p\|q) \ge 0$ for your specific $p$ and $q$ .

Exercise 2 (*) - KL between Gaussians. Let $p = \mathcal{N}(\mu_1, \sigma_1^2)$ and $q = \mathcal{N}(\mu_2, \sigma_2^2)$ .

(a) Starting from the definition, derive the closed form:

D_{\mathrm{KL}}(p\|q) = \ln\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1-\mu_2)^2}{2\sigma_2^2} - \frac{1}{2}

(b) Compute $D_{\mathrm{KL}}(\mathcal{N}(1,2)\|\mathcal{N}(0,1))$ and $D_{\mathrm{KL}}(\mathcal{N}(0,1)\|\mathcal{N}(1,2))$ . Are they equal?

(c) Show that when $\sigma_1 = \sigma_2 = \sigma$ , the formula reduces to $D_{\mathrm{KL}}(p\|q) = (\mu_1-\mu_2)^2/(2\sigma^2)$ - the squared Mahalanobis distance.

(d) For the VAE case $p = \mathcal{N}(\mu, \sigma^2)$ , $q = \mathcal{N}(0,1)$ : verify the simplified formula $\frac{1}{2}(\mu^2 + \sigma^2 - \ln\sigma^2 - 1)$ and check that it is 0 when $\mu = 0$ , $\sigma = 1$ .

Exercise 3 (*) - Forward vs reverse KL. Let $p(x) = 0.5\,\mathcal{N}(x;-2,0.25) + 0.5\,\mathcal{N}(x;2,0.25)$ (bimodal) and $q_\mu(x) = \mathcal{N}(x;\mu,\sigma^2)$ .

(a) Show that the minimizer of $D_{\mathrm{KL}}(p\|q_{\mu,\sigma^2})$ over $(\mu,\sigma^2)$ is $\mu^* = 0$ , $\sigma^{*2} = 4.25$ . (Hint: moment matching.)

(b) Numerically estimate $D_{\mathrm{KL}}(q_\mu\|p)$ on a grid of $\mu \in [-3, 3]$ using 1000 samples. Show the two local minima near $\mu = \pm 2$ .

(c) Explain in one paragraph: if you were training a VAE generative model and the true data distribution is $p$ , which direction would you minimize, and what would the generator look like?

Exercise 4 () - Chain rule decomposition.** Let $P(X, Y)$ be a joint distribution on $\{0,1\}^2$ with $P(0,0) = 0.4$ , $P(0,1) = 0.1$ , $P(1,0) = 0.1$ , $P(1,1) = 0.4$ , and $Q(X,Y)$ uniform on $\{0,1\}^2$ ( $= 0.25$ each).

(a) Compute $D_{\mathrm{KL}}(P(X,Y)\|Q(X,Y))$ directly.

(b) Compute $D_{\mathrm{KL}}(P_X\|Q_X)$ and $\mathbb{E}_{P_X}[D_{\mathrm{KL}}(P_{Y|X}\|Q_{Y|X})]$ separately.

(d) Interpret: which direction (marginal or conditional) contributes more to the total KL? What does this mean for the structure of $P$ vs $Q$ ?

Exercise 5 () - f-Divergence family.** Consider three f-divergences on Bernoulli distributions $p = \text{Bern}(\theta)$ and $q = \text{Bern}(0.5)$ as a function of $\theta \in (0,1)$ .

(a) Compute and plot (as functions of $\theta$ ): KL divergence $D_{\mathrm{KL}}(p\|q)$ , reverse KL $D_{\mathrm{KL}}(q\|p)$ , total variation $\mathrm{TV}(p,q)$ , and Jensen-Shannon $\mathrm{JSD}(p,q)$ .

(b) Verify that Pinsker's inequality $\mathrm{TV}(p,q)^2 \le \frac{1}{2}D_{\mathrm{KL}}(p\|q)$ holds for all $\theta$ .

(d) For $\theta = 0.01$ (near-deterministic $p$ ): compute all four divergences. Why does KL blow up while JSD stays bounded?

Exercise 6 () - ELBO and KL decomposition.** A simple VAE has prior $p(z) = \mathcal{N}(0,1)$ and likelihood $p_\theta(x|z) = \mathcal{N}(z, 0.1)$ . The encoder is $q_\phi(z|x) = \mathcal{N}(\phi \cdot x, 0.5)$ for a scalar parameter $\phi$ .

(a) For $x = 2$ : compute the ELBO as a function of $\phi$ . Use the Gaussian KL formula.

(b) Show that the ELBO decomposes as: reconstruction term $-\frac{(2 - \phi\cdot 2)^2 + 0.5}{2 \times 0.1}$ (plus constants) minus the KL term $\frac{1}{2}(\phi^2 \cdot 4 + 0.5 - \ln 0.5 - 1)$ .

(d) As the likelihood variance $\sigma^2 \to 0$ (decoder becomes very powerful): what happens to the optimal $\phi$ ? Does posterior collapse occur?

Exercise 7 (*) - RLHF optimal policy.** A language model has reference policy $\pi_{\mathrm{ref}}(y|x)$ and reward $r(x,y)$ . The RLHF objective is $\max_\pi \mathbb{E}[r(x,y)] - \beta D_{\mathrm{KL}}(\pi\|\pi_{\mathrm{ref}})$ .

(a) Show that the unconstrained optimizer is $\pi^*(y|x) = \frac{1}{Z(x)}\pi_{\mathrm{ref}}(y|x)e^{r(x,y)/\beta}$ by taking the functional derivative of the objective with respect to $\pi(y|x)$ (subject to $\sum_y \pi(y|x) = 1$ ).

(b) Show that at the optimum, the KL divergence is $D_{\mathrm{KL}}(\pi^*\|\pi_{\mathrm{ref}}) = \frac{1}{\beta}\mathbb{E}_{\pi^*}[r] - \log Z(x)$ .

(c) Derive the DPO loss: given preference data $(x, y_w, y_l)$ where $y_w \succ y_l$ , show that the Bradley-Terry preference model $P(y_w \succ y_l|x) = \sigma(\beta\log\frac{\pi^*(y_w|x)}{\pi^*(y_l|x)} - \beta\log\frac{\pi_{\mathrm{ref}}(y_w|x)}{\pi_{\mathrm{ref}}(y_l|x)})$ leads to the DPO loss.

(d) Numerically: for $\beta = 0.1$ , $\pi_{\mathrm{ref}}(A) = 0.6$ , $\pi_{\mathrm{ref}}(B) = 0.4$ , $r(A) = 2$ , $r(B) = 0$ : compute $\pi^*$ and the resulting $D_{\mathrm{KL}}(\pi^*\|\pi_{\mathrm{ref}})$ .

Exercise 8 (*) - Knowledge distillation analysis.** A teacher model outputs logits $\mathbf{z}_T = (3, 1, -2)$ for three classes and a student outputs $\mathbf{z}_S = (2, 0, -1)$ .

(a) Compute $p_T = \text{softmax}(\mathbf{z}_T)$ and $p_S = \text{softmax}(\mathbf{z}_S)$ .

(b) Compute $D_{\mathrm{KL}}(p_T\|p_S)$ (forward KL, what distillation minimizes) and $D_{\mathrm{KL}}(p_S\|p_T)$ (reverse KL).

(c) Now compute soft labels at temperature $\tau = 3$ : $p_T^{(\tau)}$ and $p_S^{(\tau)}$ . Recompute both KL divergences. How does temperature affect the magnitude of the KL and the information in the soft labels?

(d) What is the total distillation loss: $\mathcal{L} = \lambda D_{\mathrm{KL}}(p_T^{(\tau)}\|p_S^{(\tau)}) + (1-\lambda)H(\mathbf{y}_{\mathrm{true}}, p_S)$ for $\lambda = 0.7$ and $\mathbf{y}_{\mathrm{true}} = (1, 0, 0)$ ?

(e) Argue why forward KL (teacher to student) is used rather than reverse KL. What kind of "dark knowledge" is preserved?

12. Why This Matters for AI (2026 Perspective)

Concept	AI Impact
Forward KL = MLE	Every language model trained by next-token prediction (GPT-4, LLaMA-3, Claude, Gemini, Mistral) minimizes $D_{\mathrm{KL}}(p_{\mathrm{data}}\\|p_{\boldsymbol{\theta}})$ ; cross-entropy loss IS KL divergence
ELBO = KL decomposition	All VAE-based generative models (image, speech, latent diffusion) optimize reconstruction + KL; the KL term determines latent space structure
RLHF KL penalty	All instruction-tuned LLMs (InstructGPT, Claude, Gemini) use $\beta D_{\mathrm{KL}}(\pi\\|\pi_{\mathrm{ref}})$ to prevent reward hacking; $\beta$ is one of the most important hyperparameters in alignment
DPO implicit KL	DPO (used in LLaMA-2-Chat, Mistral-Instruct) reframes RLHF as direct KL-constrained optimization; the log-ratio $\log(\pi/\pi_{\mathrm{ref}})$ IS the implicit reward
Forward vs reverse KL	Choosing forward KL (MLE, distillation) vs reverse KL (VI, mean-field) is the defining design decision of probabilistic ML; wrong choice -> blurry outputs or posterior collapse
Knowledge distillation	LLM compression (DistilGPT, Phi-2, Phi-3) uses forward KL from teacher to student; soft labels at temperature $\tau > 1$ carry 10-100 more information than one-hot labels
Posterior collapse	When the decoder is powerful, the VAE KL term collapses to zero - a major pathology in text VAEs (Bowman et al., 2016); fixed by KL annealing or $\beta$ -VAE weighting
PPO trust region	PPO's clip objective approximates a KL trust region; the $\varepsilon \in \{0.1, 0.2\}$ clip threshold corresponds to a specific KL budget per update step
GAN = JSD minimization	Original GAN (Goodfellow, 2014) minimizes JSD; JSD's saturation when supports are disjoint causes vanishing gradients and mode collapse - motivating WGAN/Wasserstein distance
Renyi DP for private training	Differential privacy in LLM fine-tuning (DP-SGD, used at Google) uses Renyi divergence; Renyi DP composes additively, making privacy budgets tractable
Diffusion ELBO	DDPM training objective is a sum of per-step KL divergences; the connection to score matching makes the KL formulation equivalent to noise prediction
Natural gradient	K-FAC (second-order optimizer for neural networks) approximates the Fisher matrix $F = \mathbb{E}[\nabla\log p \nabla\log p^\top]$ which is the local KL curvature; natural gradient descent is the geometrically correct first-order method on the statistical manifold
Calibration	$D_{\mathrm{KL}}(p_{\mathrm{true}}\\|p_{\boldsymbol{\theta}}) = 0$ iff the model is perfectly calibrated; calibration regularization (temperature scaling, label smoothing) implicitly minimizes KL between predicted and calibrated distributions

13. Conceptual Bridge

KL divergence sits at the center of the information theory chapter, connecting the individual uncertainty measured by entropy (Section 01) to the distributional relationships measured by mutual information (Section 03), the training objectives formalized as cross-entropy (Section 04), and the local geometry captured by Fisher information (Section 05). It is not an isolated concept - it is the engine that drives all four.

Backward: from Entropy. In Section 01, entropy $H(X) = -\sum p\log p$ measured the intrinsic uncertainty of a single distribution. KL divergence extends this to pairs of distributions: $D_{\mathrm{KL}}(p\|q) = H(p,q) - H(p)$ measures the extra uncertainty you incur by confusing $q$ for $p$ . The non-negativity of KL proved the fundamental bound $H(X) \le \log|\mathcal{X}|$ that appeared in Section 01 - Gibbs' inequality was already implicitly used there. Every result about optimal codes (Huffman, arithmetic coding) is ultimately about minimizing KL between the empirical distribution and the code-induced distribution.

Forward: to Mutual Information. Mutual information $I(X;Y) = D_{\mathrm{KL}}(P(X,Y)\|P(X)P(Y))$ (Section 03) is a KL divergence measuring the dependence between two variables. The chain rule for KL (Section 3.5) becomes the chain rule for entropy when applied to the independence gap. The data processing inequality for KL becomes the DPI for mutual information: $I(X;Z) \le I(X;Y)$ whenever $Z$ is determined by $Y$ . Understanding KL makes mutual information, and all the contrastive learning objectives built on it (InfoNCE, SimCLR, CLIP), immediately transparent.

Forward: to Cross-Entropy and Fisher Information. Cross-entropy $H(p,q)$ is KL plus a constant (Section 04); Fisher information is KL's local curvature (Section 05). The three-way connection $H(p,q) = H(p) + D_{\mathrm{KL}}(p\|q) \approx H(p) + \frac{1}{2}\delta^\top F\delta$ (for small perturbations) shows that these are all the same object at different scales: global (cross-entropy), local (Fisher).

KL DIVERGENCE IN THE INFORMATION THEORY CHAPTER


  
    Chapter 09 - Information Theory                                
                                                                   
    Section 01 ENTROPY H(X)                                               
          "extend to pairs of distributions"                      
    Section 02 KL DIVERGENCE D_KL(pq)       <- YOU ARE HERE              
                                                                 
    Section 03 MUTUAL INFORMATION    Section 04 CROSS-ENTROPY                    
    I(X;Y) = D_KL(PPP)     H(p,q) = H(p) + D_KL(pq)           
                                                                  
    Section 05 FISHER INFORMATION                                         
    F(theta) = local KL curvature                                      
  

  ML CONNECTIONS:
  
                                                                   
     D_KL(p_data  p_theta)        ->  Language model training (MLE)   
     D_KL(q_(z|x)  p(z))    ->  VAE regularizer (ELBO)          
     D_KL(pi_theta  pi_ref)         ->  RLHF / PPO / DPO alignment     
     D_KL(p_teacher  p_student) -> Knowledge distillation        
     D_KL(q  p_posterior)     ->  Variational inference (VI)     
                                                                   
  

  CHAPTER POSITION:

  09-01 Entropy
      
  09-02 KL Divergence  <- current section
      
  09-03 Mutual Information  (KL between joint and product of marginals)
      
  09-04 Cross-Entropy       (H(p,q) = H(p) + D_KL(pq))
      
  09-05 Fisher Information  (local KL curvature = Riemannian metric)

Looking ahead beyond this chapter: KL divergence appears again in Chapter 12 (Functional Analysis), where it generalizes to the RKHS setting and connects to kernel methods and maximum mean discrepancy (MMD). It appears in Chapter 21 (Statistical Learning Theory) as a key quantity in PAC-Bayes bounds: the generalization gap of a stochastic predictor is bounded by $D_{\mathrm{KL}}(Q\|P)/n$ where $Q$ is the posterior and $P$ is the prior on parameters. And in Chapter 22 (Causal Inference), KL divergence under interventional vs observational distributions quantifies the "causal effect" of an intervention in terms of distributional shift.

The single formula $\sum_x p(x)\log(p(x)/q(x))$ - first written by Kullback and Leibler in 1951 - underlies more of modern AI than perhaps any other equation in this curriculum.

KL Divergence: Part 2 - Information Theoretic Connections To 13 Conceptual Bridge