Part 3

29 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

KL Divergence: Appendix A: Detailed Proofs to Appendix J: Notation and Conventions

Appendix A: Detailed Proofs

A.1 Proof of Non-Negativity via Log-Sum Inequality

The log-sum inequality provides an alternative, direct proof of Gibbs' inequality without invoking Jensen's explicitly.

Log-Sum Inequality. For non-negative reals $a_1, \ldots, a_n$ and $b_1, \ldots, b_n$ :

\sum_{i=1}^n a_i \log\frac{a_i}{b_i} \ge \left(\sum_{i=1}^n a_i\right)\log\frac{\sum_{i=1}^n a_i}{\sum_{i=1}^n b_i}

with equality iff all $a_i/b_i$ are equal.

Proof of log-sum inequality. The function $f(t) = t\log t$ is convex (since $f''(t) = 1/t > 0$ ). By the weighted Jensen inequality with weights $b_i/B$ where $B = \sum b_j$ :

\frac{1}{B}\sum_i b_i \cdot \frac{a_i}{b_i}\log\frac{a_i}{b_i} \ge \frac{1}{B}\left(\sum_i b_i \cdot \frac{a_i}{b_i}\right)\log\frac{\sum_i b_i \cdot a_i/b_i}{\sum_i b_i \cdot 1}

= \frac{A}{B}\log\frac{A}{B}

Multiplying by $B$ gives the result. $\square$

KL non-negativity from log-sum inequality. Apply with $a_i = p(x_i)$ and $b_i = q(x_i)$ : since $\sum p_i = 1 = \sum q_i$ :

D_{\mathrm{KL}}(p\|q) = \sum_i p_i\log\frac{p_i}{q_i} \ge \left(\sum_i p_i\right)\log\frac{\sum_i p_i}{\sum_i q_i} = 1 \cdot \log\frac{1}{1} = 0 \quad\square

A.2 Convexity of KL: Detailed Proof

We prove that $(p,q) \mapsto D_{\mathrm{KL}}(p\|q)$ is jointly convex using the perspective function argument.

Lemma. The function $g(s,t) = s\log(s/t)$ (with $g(0,t) = 0$ , $g(s,0) = +\infty$ for $s > 0$ ) is jointly convex on $\{(s,t) : s \ge 0, t > 0\}$ .

Proof. $g(s,t) = s\log s - s\log t$ . We need $H_g \succeq 0$ . The Hessian is:

H_g = \begin{pmatrix} 1/s & -1/t \\ -1/t & s/t^2 \end{pmatrix}

Determinant $= s/t^2 \cdot 1/s - 1/t^2 = 1/t^2 - 1/t^2 = 0$ . Trace $= 1/s + s/t^2 > 0$ . PSD since $\det = 0 \ge 0$ and trace $> 0$ . $\square$

Since $D_{\mathrm{KL}}(p\|q) = \sum_x g(p(x), q(x))$ and sums of jointly convex functions are jointly convex, $D_{\mathrm{KL}}$ is jointly convex. $\square$

A.3 Pinsker's Inequality

Theorem (Pinsker's Inequality). $\mathrm{TV}(p,q) \le \sqrt{\frac{1}{2}D_{\mathrm{KL}}(p\|q)}$ .

Proof. We use the Bretagnolle-Huber inequality, which is slightly weaker but has a clean proof: for any event $A$ ,

p(A) - q(A) \le \sqrt{\frac{1}{2}D_{\mathrm{KL}}(p\|q)}

For the full Pinsker proof, one approach uses the 2-point inequality: for $a, b \in [0,1]$ ,

a\log\frac{a}{b} + (1-a)\log\frac{1-a}{1-b} \ge \frac{1}{2\ln 2}(a-b)^2

Applying this to the binary partition $\{A, A^c\}$ and optimizing over $A$ (which gives $A = \{p > q\}$ for TV):

D_{\mathrm{KL}}(p\|q) \ge D_{\mathrm{KL}}(\text{Bern}(p(A))\|\text{Bern}(q(A))) \ge \frac{1}{2\ln 2}(p(A) - q(A))^2 \ge \frac{1}{2}(2\mathrm{TV})^2

Rearranging: $\mathrm{TV}(p,q) \le \sqrt{\frac{1}{2}D_{\mathrm{KL}}(p\|q)}$ . $\square$

Pinsker's inequality in AI: KL divergence is continuous and differentiable (gradients exist for optimization), while total variation is an event-level bound. Pinsker's lets us translate KL bounds from optimization theory into guarantees about individual event probabilities - useful in PAC-Bayes learning theory.

A.4 KL Divergence for the Gaussian: Step-by-Step

We derive the multivariate Gaussian KL formula in detail.

Let $p = \mathcal{N}(\boldsymbol{\mu}_1, \Sigma_1)$ and $q = \mathcal{N}(\boldsymbol{\mu}_2, \Sigma_2)$ on $\mathbb{R}^d$ . The log-ratio is:

\log\frac{p(\mathbf{x})}{q(\mathbf{x})} = -\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu}_1)^\top\Sigma_1^{-1}(\mathbf{x}-\boldsymbol{\mu}_1) + \frac{1}{2}(\mathbf{x}-\boldsymbol{\mu}_2)^\top\Sigma_2^{-1}(\mathbf{x}-\boldsymbol{\mu}_2) + \frac{1}{2}\log\frac{\det\Sigma_2}{\det\Sigma_1}

Taking expectation under $p$ , using:

$\mathbb{E}_p[(\mathbf{x}-\boldsymbol{\mu}_1)^\top\Sigma_1^{-1}(\mathbf{x}-\boldsymbol{\mu}_1)] = \operatorname{tr}(\Sigma_1^{-1}\Sigma_1) = d$
$\mathbb{E}_p[(\mathbf{x}-\boldsymbol{\mu}_2)^\top\Sigma_2^{-1}(\mathbf{x}-\boldsymbol{\mu}_2)] = \operatorname{tr}(\Sigma_2^{-1}\Sigma_1) + (\boldsymbol{\mu}_1-\boldsymbol{\mu}_2)^\top\Sigma_2^{-1}(\boldsymbol{\mu}_1-\boldsymbol{\mu}_2)$

gives:

D_{\mathrm{KL}}(p\|q) = \frac{1}{2}\left[\operatorname{tr}(\Sigma_2^{-1}\Sigma_1) + (\boldsymbol{\mu}_2-\boldsymbol{\mu}_1)^\top\Sigma_2^{-1}(\boldsymbol{\mu}_2-\boldsymbol{\mu}_1) - d + \log\frac{\det\Sigma_2}{\det\Sigma_1}\right]

The four terms correspond to:

$\operatorname{tr}(\Sigma_2^{-1}\Sigma_1) - d$ : how much the covariances differ (= 0 when $\Sigma_1 = \Sigma_2$ )
$(\boldsymbol{\mu}_2-\boldsymbol{\mu}_1)^\top\Sigma_2^{-1}(\boldsymbol{\mu}_2-\boldsymbol{\mu}_1)$ : squared Mahalanobis distance between means
$\log(\det\Sigma_2/\det\Sigma_1)$ : log ratio of volumes (= 0 when $\det\Sigma_1 = \det\Sigma_2$ )

Appendix B: KL Divergence Computations and Reference Formulas

B.1 Quick Reference: KL Formulas for Common Distributions

KL DIVERGENCE CLOSED FORMS - QUICK REFERENCE


  SCALAR GAUSSIAN:
  D_KL(N(mu1,12)  N(mu2,22)) = log(2/1) + (12 + (mu1-mu2)2)/(222) - 12

  VAE ENCODER -> STANDARD NORMAL:
  D_KL(N(mu,2)  N(0,1)) = 12(mu2 + 2 - log 2 - 1)

  MULTIVARIATE GAUSSIAN:
  D_KL(N(mu1,1)  N(mu2,2)) = 12[tr(211) + (mu2-mu1)T21(mu2-mu1) - d + log(det 2/det 1)]

  BERNOULLI:
  D_KL(Bern(p)  Bern(q)) = plog(p/q) + (1-p)log((1-p)/(1-q))

  CATEGORICAL (general discrete):
  D_KL(p  q) = k pk  log(pk/qk)

  POISSON:
  D_KL(Poisson(1)  Poisson(2)) = 1log(1/2) - 1 + 2

  EXPONENTIAL:
  D_KL(Exp(1)  Exp(2)) = log(2/1) + 1/2 - 1

  BETA:
  D_KL(Beta(alpha1,beta1)  Beta(alpha2,beta2)) = log B(alpha2,beta2)/B(alpha1,beta1)
      + (alpha1-alpha2)(alpha1) + (beta1-beta2)(beta1) + (alpha2-alpha1+beta2-beta1)(alpha1+beta1)
      where  = digamma function, B = beta function

B.2 Worked Example: KL Between Poisson Distributions

A Poisson distribution $\text{Pois}(\lambda)$ has PMF $p(k) = e^{-\lambda}\lambda^k/k!$ for $k = 0,1,2,\ldots$

$D_{\mathrm{KL}}(\text{Pois}(\lambda_1)\|\text{Pois}(\lambda_2))$ :

\log\frac{p_{\lambda_1}(k)}{p_{\lambda_2}(k)} = (\lambda_2 - \lambda_1) + k\log\frac{\lambda_1}{\lambda_2}

D_{\mathrm{KL}} = \mathbb{E}_{\lambda_1}\!\left[(\lambda_2-\lambda_1) + k\log\frac{\lambda_1}{\lambda_2}\right] = (\lambda_2 - \lambda_1) + \lambda_1\log\frac{\lambda_1}{\lambda_2}

For $\lambda_1 = 3, \lambda_2 = 2$ : $D_{\mathrm{KL}} = (2-3) + 3\ln(3/2) = -1 + 3(0.405) = 0.216$ nats.

For AI: Poisson distributions model count data (number of events, word counts in LDA topic models). The KL between Poisson distributions appears in the ELBO for Poisson process models and in topic model variational inference.

B.3 Worked Example: KL for Categorical - Softmax Outputs

Let $\mathbf{z}_1 = (2.0, 1.0, 0.1)$ and $\mathbf{z}_2 = (1.5, 1.0, 0.5)$ be logit vectors.

Step 1: $p_1 = \text{softmax}(\mathbf{z}_1)$ :

p_1 = (e^2, e^1, e^{0.1})/(e^2 + e^1 + e^{0.1}) = (7.389, 2.718, 1.105)/11.212 = (0.659, 0.242, 0.099)

Step 2: $p_2 = \text{softmax}(\mathbf{z}_2)$ :

p_2 = (e^{1.5}, e^1, e^{0.5})/(e^{1.5} + e^1 + e^{0.5}) = (4.482, 2.718, 1.649)/8.849 = (0.507, 0.307, 0.186)

Step 3: $D_{\mathrm{KL}}(p_1\|p_2)$ :

= 0.659\ln(0.659/0.507) + 0.242\ln(0.242/0.307) + 0.099\ln(0.099/0.186)

= 0.659(0.261) + 0.242(-0.238) + 0.099(-0.635) = 0.172 - 0.058 - 0.063 = 0.051 \text{ nats}

For AI: This is the KL component when computing the distillation loss between a teacher ( $p_1$ ) and student ( $p_2$ ). The small value (0.051 nats) indicates the student's distribution is reasonably close to the teacher's.

B.4 KL Divergence and Hypothesis Testing

The connection between KL divergence and hypothesis testing is both historical (Kullback 1959) and practically important.

Setting: $n$ i.i.d. observations from an unknown distribution. Hypothesis $H_0$ : data from $q$ ; hypothesis $H_1$ : data from $p$ .

Log-likelihood ratio test statistic:

\Lambda_n = \sum_{i=1}^n \log\frac{p(x_i)}{q(x_i)}

By the law of large numbers, $\Lambda_n/n \to \mathbb{E}_p[\log(p/q)] = D_{\mathrm{KL}}(p\|q)$ as $n \to \infty$ when truth is $p$ .

Stein's lemma (type II error rate): For fixed type I error probability $\alpha \in (0,1)$ , the best achievable type II error probability $\beta_n^*$ decays as:

\lim_{n\to\infty} \frac{1}{n}\log\beta_n^* = -D_{\mathrm{KL}}(p\|q)

Larger KL divergence -> faster exponential decay of type II error -> easier to distinguish $p$ from $q$ with more data.

Chernoff exponent: The best achievable equal-error-rate exponent (minimizing over both type I and II simultaneously) is:

C(p,q) = -\min_{\alpha\in[0,1]}\log\sum_x p(x)^\alpha q(x)^{1-\alpha} = \max_{\alpha\in[0,1]} D_\alpha(p\|q)

the maximum Renyi divergence over orders $\alpha \in [0,1]$ .

For AI: The hypothesis testing perspective explains why $D_{\mathrm{KL}}(\pi\|\pi_{\mathrm{ref}})$ in RLHF bounds how detectable the policy shift is: small KL means the fine-tuned model's outputs are statistically indistinguishable from the reference model's outputs.

Appendix C: KL Divergence in Broader ML Contexts

C.1 PAC-Bayes Bounds

PAC-Bayes theory (McAllester 1999; Seeger 2002) gives generalization bounds for stochastic predictors. The key theorem:

PAC-Bayes Theorem. Let $P$ be a prior over hypotheses (fixed before seeing data) and $Q$ be any posterior (possibly data-dependent). For any $\delta > 0$ , with probability $\ge 1 - \delta$ over the training set:

\mathbb{E}_{h \sim Q}[\mathcal{L}(h)] \le \mathbb{E}_{h \sim Q}[\hat{\mathcal{L}}(h)] + \sqrt{\frac{D_{\mathrm{KL}}(Q\|P) + \log(2\sqrt{n}/\delta)}{2n}}

where $\mathcal{L}(h)$ is the true loss and $\hat{\mathcal{L}}(h)$ is the empirical loss.

Interpretation: The generalization gap is bounded by $\sqrt{D_{\mathrm{KL}}(Q\|P)/n}$ . The posterior $Q$ can be anything - including a specific trained neural network - and the bound holds. This shows that models that stay "close" to a simple prior in KL generalize well. Modern applications: PAC-Bayes bounds for LoRA-fine-tuned LLMs (where $P =$ pretrained weights, $Q =$ fine-tuned weights).

C.2 KL in Variational Inference for Bayesian Neural Networks

Bayesian neural networks (BNNs) maintain distributions over weights $\boldsymbol{\theta}$ rather than point estimates. The posterior $p(\boldsymbol{\theta}|\mathcal{D})$ is intractable for large networks. Variational inference minimizes:

D_{\mathrm{KL}}(q(\boldsymbol{\theta})\|p(\boldsymbol{\theta}|\mathcal{D})) = D_{\mathrm{KL}}(q(\boldsymbol{\theta})\|p(\boldsymbol{\theta})) - \mathbb{E}_q[\log p(\mathcal{D}|\boldsymbol{\theta})] + \log p(\mathcal{D})

The ELBO is: $\mathcal{L} = \mathbb{E}_q[\log p(\mathcal{D}|\boldsymbol{\theta})] - D_{\mathrm{KL}}(q(\boldsymbol{\theta})\|p(\boldsymbol{\theta}))$ . The KL term $D_{\mathrm{KL}}(q\|p)$ acts as a regularizer: it penalizes variational distributions that deviate from the prior $p(\boldsymbol{\theta})$ .

Practical implementations: Bayes By Backprop (Blundell et al., 2015) parameterizes $q(\boldsymbol{\theta}) = \prod_i \mathcal{N}(\mu_i, \sigma_i^2)$ and uses the local reparameterization trick. Each weight $\theta_i = \mu_i + \sigma_i\varepsilon_i$ where $\varepsilon_i \sim \mathcal{N}(0,1)$ . The KL term per weight is $\frac{1}{2}(\mu_i^2/\sigma_p^2 + \sigma_i^2/\sigma_p^2 - 1 - \log(\sigma_i^2/\sigma_p^2))$ for Gaussian prior $\mathcal{N}(0, \sigma_p^2)$ .

C.3 KL in Contrastive Learning and Representation

Contrastive learning methods like SimCLR (Chen et al., 2020) and MoCo (He et al., 2020) maximize the lower bound on mutual information:

I(X; Z) \ge \mathbb{E}[\text{InfoNCE}(X, Z)] = \mathbb{E}\!\left[\log\frac{e^{f(x,z)}}{\frac{1}{N}\sum_{j=1}^N e^{f(x,z_j)}}\right]

The InfoNCE loss is a lower bound on $I(X;Z)$ , which is a KL divergence $D_{\mathrm{KL}}(P(X,Z)\|P(X)P(Z))$ (see Section 09-03). CLIP (Radford et al., 2021) uses a symmetric InfoNCE to align image and text embeddings - which is equivalent to minimizing KL between the joint image-text distribution and the product of marginals.

CPC (Contrastive Predictive Coding, van den Oord et al., 2018): Used in speech (wav2vec 2.0, HuBERT) and images. The representation $z_t$ at time $t$ predicts future $z_{t+k}$ via a density ratio estimator. The InfoNCE objective is a lower bound on $\sum_k I(z_t; z_{t+k})$ . Training audio encoders this way produces features that capture the semantic structure of speech - because mutual information (= KL divergence) rewards capturing any statistical dependency.

C.4 KL in Score-Based Generative Models

Score-based models (Song & Ermon, 2020) and their diffusion counterparts learn the score function $\nabla_\mathbf{x}\log p(\mathbf{x})$ to generate samples via Langevin dynamics. The connection to KL:

Score matching minimizes:

J(\boldsymbol{\theta}) = \mathbb{E}_p\!\left[\frac{1}{2}\|s_{\boldsymbol{\theta}}(\mathbf{x}) - \nabla\log p(\mathbf{x})\|^2\right]

where $s_{\boldsymbol{\theta}} \approx \nabla\log p$ is the learned score. Fisher divergence equals:

J(\boldsymbol{\theta}) = \mathbb{E}_p\!\left[\frac{1}{2}\|s_{\boldsymbol{\theta}}(\mathbf{x})\|^2 + \nabla\cdot s_{\boldsymbol{\theta}}(\mathbf{x})\right] + \text{const}

Score matching is equivalent to minimizing a generalization of KL (Fisher divergence: $\int p(\mathbf{x})\|\nabla\log(p/q)\|^2 d\mathbf{x}$ ), which uses the gradient of the log-density ratio rather than the ratio itself. This is the information-geometric connection between score matching and KL divergence.

C.5 KL and Temperature in LLM Sampling

LLM inference involves sampling from $p_\tau(y|x) = \text{softmax}(\mathbf{z}/\tau)$ at temperature $\tau$ . The KL divergence between the original distribution ( $\tau = 1$ ) and the temperature-scaled distribution is:

D_{\mathrm{KL}}(p_1\|p_\tau) = \sum_k p_1(k)\log\frac{p_1(k)}{p_\tau(k)} = \sum_k p_1(k)\left[\frac{z_k}{1} - \log\sum_j e^{z_j} - \frac{z_k}{\tau} + \log\sum_j e^{z_j/\tau}\right]

As $\tau \to 0$ : $p_\tau \to$ one-hot (deterministic, picks the argmax). $D_{\mathrm{KL}}(p_1\|p_\tau) \to H(p_1)$ - the divergence equals the entropy of the original distribution.

As $\tau \to \infty$ : $p_\tau \to$ uniform. $D_{\mathrm{KL}}(p_1\|p_\tau) \to \log|\mathcal{V}| - H(p_1)$ - the divergence equals the KL from uniform.

Practical implications: The "temperature" knob in LLM APIs (temperature=0.7, top-p=0.9) directly controls the KL divergence from the trained distribution. High temperature increases diversity (reduces KL to uniform); low temperature reduces diversity (pushes toward argmax). The KL from the training distribution is minimized at $\tau = 1$ .

Appendix D: Numerical Stability and Implementation

D.1 The log-sum-exp Trick for KL Computation

Computing $D_{\mathrm{KL}}(p\|q)$ with softmax distributions requires care to avoid numerical overflow.

Naive implementation (unstable):

# WRONG - can overflow for large logits
kl = torch.sum(p * (torch.log(p) - torch.log(q)))

Stable implementation using log-softmax:

# CORRECT - uses log-space throughout
import torch.nn.functional as F

def kl_divergence_stable(logits_p, logits_q):
    """
    KL(p || q) where p = softmax(logits_p), q = softmax(logits_q).
    Uses log_softmax for numerical stability.
    """
    log_p = F.log_softmax(logits_p, dim=-1)  # log(softmax) in one step
    log_q = F.log_softmax(logits_q, dim=-1)
    p = torch.exp(log_p)
    return torch.sum(p * (log_p - log_q), dim=-1)

PyTorch also provides torch.nn.functional.kl_div(log_q, p) which computes $\sum p(\log p - \text{log\_q})$ given pre-computed log_q. Note the unusual argument order: it expects log of the second argument.

D.2 Estimating KL from Samples

When only samples are available (not closed-form distributions), KL must be estimated.

Monte Carlo estimator. Given samples $x_1, \ldots, x_n \sim p$ and access to $\log p(x)$ and $\log q(x)$ :

\hat{D}_{\mathrm{KL}}(p\|q) = \frac{1}{n}\sum_{i=1}^n \log\frac{p(x_i)}{q(x_i)}

This is unbiased: $\mathbb{E}[\hat{D}] = D_{\mathrm{KL}}(p\|q)$ .

k-nearest neighbor estimator (Perez-Cruz 2008). When neither $p(x)$ nor $q(x)$ are known analytically - only samples from both distributions - the KNN estimator uses:

\hat{D}_{\mathrm{KL}}(p\|q) = \frac{d}{n}\sum_{i=1}^n \log\frac{\rho_k(x_i)}{\nu_k(x_i)} + \log\frac{m}{n-1}

where $\rho_k(x_i)$ is the distance to the $k$ -th nearest neighbor in the $p$ -samples, $\nu_k(x_i)$ is the distance to the $k$ -th neighbor in the $q$ -samples, $n$ = number of $p$ -samples, $m$ = number of $q$ -samples, and $d$ = dimension.

This estimator is used in representation learning evaluation, where you have samples from an encoder's output distribution $q$ and want to measure $D_{\mathrm{KL}}(q\|p_{\text{target}})$ without a closed form.

D.3 Symmetric KL Implementation

For applications requiring symmetric divergence without the issues of JSD (which requires computing the mixture), Jeffrey's divergence is simpler:

def jeffreys_divergence(p_probs, q_probs, eps=1e-10):
    """
    J(p, q) = 0.5 * (KL(p||q) + KL(q||p))
    More numerically stable than computing JSD.
    """
    p = p_probs + eps
    q = q_probs + eps
    p = p / p.sum()
    q = q / q.sum()
    kl_pq = (p * np.log(p / q)).sum()
    kl_qp = (q * np.log(q / p)).sum()
    return 0.5 * (kl_pq + kl_qp)

D.4 KL Divergence in PyTorch for VAE Training

The standard VAE KL term implementation:

def vae_kl_loss(mu, log_var):
    """
    D_KL(N(mu, sigma^2) || N(0, 1)) per-sample, summed over latent dimensions.
    
    Formula: 0.5 * sum(mu^2 + sigma^2 - log(sigma^2) - 1)
           = 0.5 * sum(mu^2 + exp(log_var) - log_var - 1)
    
    Input: mu, log_var - shape (batch, latent_dim)
    Output: KL loss - shape (batch,), or scalar if mean-reduced
    """
    kl = 0.5 * torch.sum(mu**2 + torch.exp(log_var) - log_var - 1, dim=-1)
    return kl.mean()  # average over batch

Note: using log_var (log of variance) instead of sigma avoids numerical issues with very small variances and ensures $\sigma^2 > 0$ by construction.

RLHF KL penalty implementation:

def rlhf_kl_penalty(logprobs_policy, logprobs_ref, beta=0.1):
    """
    D_KL(pi_policy || pi_ref) for a batch of token sequences.
    
    logprobs_policy, logprobs_ref: log probabilities under each model
                                   shape (batch, seq_len)
    Returns: mean KL per sequence
    """
    # KL = E_pi[log pi - log pi_ref]
    kl_per_token = logprobs_policy - logprobs_ref  # log ratio
    kl_per_seq = kl_per_token.sum(dim=-1)  # sum over sequence length
    return beta * kl_per_seq.mean()

Appendix E: Variational Inference - A Deep Dive

E.1 The Variational Inference Problem

Bayesian inference requires computing the posterior $p(\mathbf{z}|\mathbf{x}) = p(\mathbf{x}|\mathbf{z})p(\mathbf{z})/p(\mathbf{x})$ . The denominator $p(\mathbf{x}) = \int p(\mathbf{x}|\mathbf{z})p(\mathbf{z})\,d\mathbf{z}$ is typically intractable - the integral has no closed form for nonlinear likelihoods.

Variational inference approximates the posterior with a tractable family $\mathcal{Q} = \{q_\phi : \phi \in \Phi\}$ by solving:

q^* = \arg\min_{q \in \mathcal{Q}} D_{\mathrm{KL}}(q(\mathbf{z}) \| p(\mathbf{z}|\mathbf{x}))

Why reverse KL? Because it is the only tractable direction. Computing $D_{\mathrm{KL}}(p\|q)$ requires evaluating $p(\mathbf{z}|\mathbf{x})$ , which is the intractable quantity we're trying to avoid. Computing $D_{\mathrm{KL}}(q\|p)$ requires only samples from $q$ and evaluations of $p(\mathbf{z})p(\mathbf{x}|\mathbf{z})$ - both tractable.

ELBO derivation. Expanding the reverse KL:

D_{\mathrm{KL}}(q(\mathbf{z})\|p(\mathbf{z}|\mathbf{x})) = \mathbb{E}_q\!\left[\log\frac{q(\mathbf{z})}{p(\mathbf{z}|\mathbf{x})}\right] = \mathbb{E}_q[\log q(\mathbf{z})] - \mathbb{E}_q[\log p(\mathbf{z}|\mathbf{x})]

Using $\log p(\mathbf{z}|\mathbf{x}) = \log p(\mathbf{x}|\mathbf{z}) + \log p(\mathbf{z}) - \log p(\mathbf{x})$ :

= \mathbb{E}_q[\log q(\mathbf{z})] - \mathbb{E}_q[\log p(\mathbf{x}|\mathbf{z})] - \mathbb{E}_q[\log p(\mathbf{z})] + \log p(\mathbf{x})

Rearranging:

\log p(\mathbf{x}) = \underbrace{\mathbb{E}_q[\log p(\mathbf{x}|\mathbf{z})] - D_{\mathrm{KL}}(q(\mathbf{z})\|p(\mathbf{z}))}_{\mathcal{L}(q) = \text{ELBO}} + D_{\mathrm{KL}}(q(\mathbf{z})\|p(\mathbf{z}|\mathbf{x}))

Since $D_{\mathrm{KL}} \ge 0$ : $\mathcal{L}(q) \le \log p(\mathbf{x})$ - the ELBO is a lower bound. Maximizing ELBO over $q$ simultaneously: (a) tightens the bound (reduces the KL gap), and (b) improves the model fit (increases $\log p(\mathbf{x})$ when $p_{\boldsymbol{\theta}}$ is also optimized).

E.2 Mean-Field Variational Inference

The mean-field approximation restricts to factored distributions $q(\mathbf{z}) = \prod_{j=1}^d q_j(z_j)$ . This is an enormous simplification - the full posterior may have complex correlations, but the approximation ignores all correlations between latent variables.

Coordinate ascent VI (CAVI). For mean-field, the optimal factor $q_j^*$ satisfies:

\log q_j^*(z_j) = \mathbb{E}_{q_{-j}}[\log p(\mathbf{x}, \mathbf{z})] + \text{const}

where $q_{-j}$ denotes expectation over all factors except $j$ . This is a fixed-point equation: each factor depends on the others. CAVI iterates: update $q_1$ given $q_2, \ldots, q_d$ ; then update $q_2$ given new $q_1$ and old $q_3, \ldots, q_d$ ; etc.

Guarantee: Each update increases the ELBO (decreases the reverse KL). CAVI converges to a local optimum.

For AI: Mean-field VI was the basis for early Bayesian neural networks (BNNs) and topic models (LDA). Its mode-seeking bias (reverse KL) means it systematically underestimates posterior uncertainty - a known limitation. Modern BNNs often use Monte Carlo Dropout as a cheaper alternative.

E.3 ELBO as a Free Energy

The ELBO has a physical interpretation as a variational free energy (Helmholtz, Gibbs):

\mathcal{L}(q) = \mathbb{E}_q[\log p(\mathbf{x},\mathbf{z})] - \mathbb{E}_q[\log q(\mathbf{z})] = \mathbb{E}_q[\log p(\mathbf{x},\mathbf{z})] + H(q)

This is the expected log joint minus the entropy of $q$ . Maximizing the ELBO simultaneously:

Maximizes $\mathbb{E}_q[\log p(\mathbf{x},\mathbf{z})]$ : encourages $q$ to put mass on high-probability regions of the joint
Maximizes $H(q)$ : encourages $q$ to spread out (high entropy = uncertainty), preventing mode collapse

The tension between these two terms is the fundamental trade-off in variational inference. In physics, this is the Gibbs free energy minimization: equilibrium balances energy minimization (term 1) with entropy maximization (term 2), at temperature 1.

For AI: The free energy interpretation connects to energy-based models (LeCun et al., 2006) and to the "free energy principle" in computational neuroscience (Friston, 2010). The ELBO appears in all modern latent variable models: VAEs, VQ-VAE, neural process families, and latent diffusion models (Stable Diffusion uses a VAE for compression, then diffusion in the latent space).

E.4 Amortized Variational Inference

Standard VI optimizes a separate $q_\phi(\mathbf{z})$ per data point - $O(n)$ optimization problems. Amortized VI (Kingma & Welling 2014) instead trains a single encoder network $q_\phi(\mathbf{z}|\mathbf{x})$ that takes $\mathbf{x}$ as input:

q_\phi^*(\mathbf{z}|\mathbf{x}) = \arg\max_\phi \mathbb{E}_{p(\mathbf{x})}\!\left[\mathcal{L}(\phi, \boldsymbol{\theta}; \mathbf{x})\right]

The encoder "amortizes" the per-sample optimization across the full dataset, generalizing to new samples at inference time. This is what makes VAEs practical for large datasets: a single forward pass through the encoder gives $q_\phi(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}_\phi(\mathbf{x}), \text{diag}(\boldsymbol{\sigma}_\phi^2(\mathbf{x})))$ instantly.

Amortization gap: The amortized approximation $q_\phi(\mathbf{z}|\mathbf{x})$ is generally worse than the optimal per-sample $q^*(\mathbf{z}|\mathbf{x})$ (it must generalize, so it trades off per-sample accuracy). The gap $D_{\mathrm{KL}}(q^*\|p) - D_{\mathrm{KL}}(q_\phi\|p)$ is called the amortization gap and is an active research topic.

Appendix F: The RLHF-KL Alignment Triangle

F.1 Three Ways KL Appears in Alignment

Modern LLM alignment uses KL divergence in three distinct but related ways, forming a triangle of constraints:

RLHF-KL ALIGNMENT TRIANGLE


                     PRETRAINED MODEL pi_ref
                           
                          / \
                KL       /   \   KL
              penalty   /     \  constraint
                       /       \
                      
              RLHF pi_theta          DPO pi_DPO
                    \           /
                     \  equiv  /
                      \       /
                       Fixed point: same optimal policy

  All three policies minimize:
    E[r(x,y)] - betaD_KL(pi_theta(|x)  pi_ref(|x))

  Optimal solution: pi*(y|x) propto pi_ref(y|x)exp(r(x,y)/beta)

F.2 Interpreting the KL Coefficient beta

The KL coefficient $\beta$ in RLHF controls the fundamental exploration-exploitation trade-off in alignment:

$\beta$ value	Behavior	Risk
$\beta \to 0$	Pure reward maximization; ignores KL constraint	Reward hacking; generates repetitive, incoherent outputs
$\beta \to \infty$	Stay at reference policy; no alignment	No improvement in helpfulness, safety
$\beta \in [0.01, 0.5]$	Typical range used in practice	Balance between helpfulness and coherence

Typical values in production: InstructGPT (Ouyang et al., 2022) used $\beta = 0.02$ for PPO-RLHF. Different applications need different $\beta$ : safety alignment uses larger $\beta$ (conservative); instruction following uses smaller $\beta$ (allows larger policy changes).

Adaptive $\beta$ (KL-controller): Some implementations adaptively adjust $\beta$ to keep $D_{\mathrm{KL}}(\pi_{\boldsymbol{\theta}}\|\pi_{\mathrm{ref}})$ near a target value $\kappa$ :

\beta_{t+1} = \beta_t + \gamma(D_{\mathrm{KL}} - \kappa)

This is a PI controller on the KL divergence. If the policy drifts too far, $\beta$ increases; if it stays too close to reference, $\beta$ decreases.

F.3 DPO's Implicit Reward

DPO (Rafailov et al., 2023) reveals that the log-ratio between policy and reference is the implicit reward:

r_{\boldsymbol{\theta}}(x,y) = \beta\log\frac{\pi_{\boldsymbol{\theta}}(y|x)}{\pi_{\mathrm{ref}}(y|x)}

This has a clean interpretation: the reward a sequence $y$ receives is proportional to how much more likely the aligned policy assigns to it compared to the reference. Sequences the aligned model prefers more than the reference model are rewarded; sequences it prefers less are penalized.

DPO gradient: The DPO loss gradient with respect to $\boldsymbol{\theta}$ :

\nabla_{\boldsymbol{\theta}}\mathcal{L}_{\mathrm{DPO}} = -\beta\,\mathbb{E}\!\left[\sigma(\hat{r}_{\boldsymbol{\theta}}(y_l) - \hat{r}_{\boldsymbol{\theta}}(y_w))\left(\nabla\log\pi_{\boldsymbol{\theta}}(y_w) - \nabla\log\pi_{\boldsymbol{\theta}}(y_l)\right)\right]

where $\hat{r}_{\boldsymbol{\theta}}(y) = \beta\log(\pi_{\boldsymbol{\theta}}(y)/\pi_{\mathrm{ref}}(y))$ . The weight $\sigma(\hat{r}(y_l) - \hat{r}(y_w))$ is large when the model assigns higher reward to the loser than the winner - the loss upweights these hard examples. This has the same structure as hard example mining in contrastive learning.

Appendix G: Connections to Other Chapters

G.1 KL and Exponential Families (Chapter 7: Statistics)

Exponential families are the natural domain for KL divergence because:

KL between exponential family members has closed form (Section 5.3)
The maximum entropy distribution under moment constraints is an exponential family member (Section 09-01 Section 5)
The Bregman divergence generated by $A(\boldsymbol{\eta})$ equals the KL between members with those natural parameters

Sufficiency: Kullback and Leibler's original paper introduced KL as a measure of information for discrimination between hypotheses. A statistic $T(\mathbf{x})$ is sufficient for parameter $\boldsymbol{\theta}$ iff $D_{\mathrm{KL}}(p_{\boldsymbol{\theta}_1}(\mathbf{x})\|p_{\boldsymbol{\theta}_2}(\mathbf{x})) = D_{\mathrm{KL}}(p_{\boldsymbol{\theta}_1}(T)\|p_{\boldsymbol{\theta}_2}(T))$ - sufficient statistics preserve all the discriminatory information. This is the information-theoretic definition of sufficiency.

G.2 KL and Convex Optimization (Chapter 8)

The ELBO maximization is a convex optimization problem in $q$ (for fixed model parameters $\boldsymbol{\theta}$ ):

$D_{\mathrm{KL}}(q\|p)$ is convex in $q$ (Section 3.4)
The constraint $q \ge 0$ , $\sum q = 1$ is convex (simplex)
The I-projection $q^* = \arg\min_q D_{\mathrm{KL}}(q\|p)$ is a convex program with a unique solution

The Lagrangian of the I-projection onto the exponential family gives the natural parameter update equations - connecting KL divergence to the theory of constrained convex optimization from Chapter 8.

G.3 KL and Functional Analysis (Chapter 12)

In infinite-dimensional settings, KL divergence extends via the Radon-Nikodym theorem. For Gaussian processes and Gaussian measures on Hilbert spaces:

D_{\mathrm{KL}}(\mathcal{GP}(m_1, K_1)\|\mathcal{GP}(m_2, K_2)) = \frac{1}{2}\left[\operatorname{tr}(K_2^{-1}K_1) + (m_2-m_1)^\top K_2^{-1}(m_2-m_1) - d + \log\frac{\det K_2}{\det K_1}\right]

the same Gaussian KL formula, but now $K_1, K_2$ are kernel matrices and $d \to \infty$ . The trace and log-determinant terms generalize as operator trace and Fredholm determinant.

RKHS connection: The maximum mean discrepancy (MMD) is an alternative to KL for comparing distributions, based on RKHS distances. MMD has better convergence rates from finite samples than KL estimators, which is why it appears in some GANs (MMD-GAN) and kernel two-sample tests.

G.4 KL and Statistical Learning Theory (Chapter 21)

The PAC-Bayes bound (Appendix C.1) shows that the KL from posterior to prior controls generalization. This has a direct implication for LoRA fine-tuning: if the LoRA adapter is constrained to have small $D_{\mathrm{KL}}(q_{\mathrm{adapter}}\|p_{\mathrm{init}})$ - i.e., small deviation from initialization - then generalization bounds are tight. This is the information-theoretic justification for small-rank adaptations: they have small KL from the pretrained distribution.

Minimum Description Length (MDL): The MDL principle selects the model that minimizes the two-part code length: $L(M) + L(D|M)$ where $L(M)$ is the code length of the model and $L(D|M)$ is the code length of data given the model. This equals $D_{\mathrm{KL}}(q\|p) + \text{const}$ for the right choice of $p$ and $q$ . MDL and ELBO are the same objective - minimum description length equals maximum ELBO. This is the Bayesian interpretation of Occam's razor.

Appendix H: Summary Tables

H.1 KL Divergence Properties at a Glance

Property	Holds?	Counterexample if No
Non-negative: $D_{\mathrm{KL}} \ge 0$	YES	-
Zero iff $p = q$ : $D_{\mathrm{KL}} = 0 \Leftrightarrow p = q$	YES (a.e.)	-
Symmetric: $D_{\mathrm{KL}}(p\\|q) = D_{\mathrm{KL}}(q\\|p)$	NO	$p=(0.9,0.1), q=(0.5,0.5)$
Triangle inequality: $D_{\mathrm{KL}}(p\\|r) \le D_{\mathrm{KL}}(p\\|q) + D_{\mathrm{KL}}(q\\|r)$	NO	$p=(0.9,0.1), q=(0.5,0.5), r=(0.1,0.9)$
Data processing inequality: $D_{\mathrm{KL}}(p_T\\|q_T) \le D_{\mathrm{KL}}(p\\|q)$	YES	-
Jointly convex in $(p,q)$	YES	-
Convex in $p$ for fixed $q$	YES	-
Convex in $q$ for fixed $p$	YES	-
Finite when $p \not\ll q$	NO	$p=(0.5,0.5,0), q=(0.5,0,0.5)$ : $D_{\mathrm{KL}} = \infty$
Metric (distance function)	NO	Fails symmetry and triangle inequality

H.2 Forward vs Reverse KL Summary

	Forward KL $D_{\mathrm{KL}}(p\\|q)$	Reverse KL $D_{\mathrm{KL}}(q\\|p)$
Expectation under	$p$ (truth)	$q$ (approximation)
Also called	Inclusive, zero-avoiding	Exclusive, zero-forcing
Behavior when fitting $q$ to $p$	Mean-seeking, mass-covering	Mode-seeking, mass-concentrating
Handles multimodal $p$	Averages over all modes	Collapses to one mode
Support requirement	$q$ must cover $\text{supp}(p)$	$q$ must avoid $\text{supp}(p)^c$
Used in	MLE, distillation, flows	Variational inference, ELBO
Risk	Blurry/average samples	Posterior collapse
Tractability	Needs samples from $p$	Only needs samples from $q$

H.3 f-Divergence Family Summary

Divergence	$f(t)$	Symmetric?	Bounded?	Metric?
KL $D_{\mathrm{KL}}(p\\|q)$	$t\ln t$	No	No	No
Reverse KL $D_{\mathrm{KL}}(q\\|p)$	$-\ln t$	No	No	No
Jeffrey's $J$	$t\ln t - \ln t$	Yes	No	No
Hellinger $H^2$	$(\sqrt{t}-1)^2$	Yes	Yes ( $[0,2]$ )	$\sqrt{H^2}$ is
Total Variation TV	$\frac{1}{2}	t-1	$	Yes
Chi-squared $\chi^2$	$(t-1)^2$	No	No	No
Jensen-Shannon JSD	(mixture form)	Yes	Yes ( $[0,\ln 2]$ )	$\sqrt{\mathrm{JSD}}$ is
Renyi $D_\alpha$	$\frac{t^\alpha-1}{\alpha-1}$ (limit form)	No	No	No

Appendix I: Extended Examples and Geometric Visualizations

I.1 KL Divergence Along a Path Between Distributions

Consider a one-parameter family $p_t = (1-t)p_0 + t p_1$ - a straight line between two distributions in probability simplex. The KL divergence $D_{\mathrm{KL}}(p_t\|q)$ is convex in $t$ (because it is jointly convex in $(p,q)$ and linear in $p$ ). However, $t \mapsto D_{\mathrm{KL}}(p_0\|p_t)$ may not be convex - the "path" in distribution space is curved from the perspective of KL divergence.

Numerical example. $p_0 = (0.9, 0.1)$ , $p_1 = (0.1, 0.9)$ , $q = (0.5, 0.5)$ . For $t \in [0,1]$ :

$t$	$p_t$	$D_{\mathrm{KL}}(p_t\\|q)$	$D_{\mathrm{KL}}(q\\|p_t)$
0	$(0.9, 0.1)$	$0.368$	$0.511$
0.25	$(0.7, 0.3)$	$0.119$	$0.133$
0.5	$(0.5, 0.5)$	$0.000$	$0.000$
0.75	$(0.3, 0.7)$	$0.119$	$0.133$
1.0	$(0.1, 0.9)$	$0.368$	$0.511$

Both directions are symmetric here due to the symmetric path. The KL is zero at $t=0.5$ (when $p_t = q$ ) and increases convexly toward the endpoints.

Geometric implication: The straight-line path $p_t = (1-t)p_0 + t p_1$ is a mixture geodesic (M-geodesic) in information geometry - the natural path in the mixture parameterization. The dual path (exponential geodesic) mixes in the natural parameter space, giving different intermediate distributions.

I.2 KL vs Wasserstein Distance

KL divergence and the Wasserstein distance (from optimal transport theory) are complementary tools for comparing distributions:

Property	KL Divergence	Wasserstein Distance
Requires $p \ll q$	YES	No
Metrizes weak convergence	No	YES
Captures geometry of support	No	YES
Tractable to compute	YES (closed form)	Harder ( $O(n^3)$ naive)
Differentiable	YES	With entropic regularization
Used in	MLE, ELBO, RLHF	Generative models, image quality

When Wasserstein beats KL: If $p_G$ (generator) and $p_{\mathrm{data}}$ have disjoint supports, $D_{\mathrm{KL}}(p_{\mathrm{data}}\|p_G) = +\infty$ and $\mathrm{JSD} = \ln 2$ - gradients vanish. The Wasserstein distance is still finite and differentiable (reflecting actual geometric distance between supports). This is the motivation for WGAN (Arjovsky et al., 2017).

When KL beats Wasserstein: For language modeling, the vocabulary forms a discrete space with no natural geometry. The Wasserstein distance would require defining a metric on tokens (words), which is ambiguous. KL divergence makes no assumptions about token geometry and is directly connected to log-probability, making it the natural choice.

I.3 KL Divergence Under Transformations

Transformation by sufficient statistic. If $T(\mathbf{x})$ is a sufficient statistic for $\boldsymbol{\theta}$ , then:

D_{\mathrm{KL}}(p_{\boldsymbol{\theta}_1}\|p_{\boldsymbol{\theta}_2}) = D_{\mathrm{KL}}(p_{\boldsymbol{\theta}_1}^T\|p_{\boldsymbol{\theta}_2}^T)

where $p_{\boldsymbol{\theta}}^T$ is the induced distribution on $T(\mathbf{x})$ . Sufficient statistics preserve all the discriminatory information - no KL is lost when summarizing data by its sufficient statistics.

Example. For $p_{\boldsymbol{\theta}} = \text{Bern}(\theta)$ , the sufficient statistic $T = \sum_i x_i$ (number of heads in $n$ flips) is sufficient. The KL between $\text{Bern}(\theta_1)^{\otimes n}$ and $\text{Bern}(\theta_2)^{\otimes n}$ equals $n \cdot D_{\mathrm{KL}}(\text{Bern}(\theta_1)\|\text{Bern}(\theta_2))$ , and the KL between the corresponding $\text{Binomial}(n,\theta_1)$ and $\text{Binomial}(n,\theta_2)$ distributions equals the same thing - the sufficient statistic preserves KL exactly.

KL under affine transformation. For continuous distributions, an affine map $T(\mathbf{x}) = A\mathbf{x} + \mathbf{b}$ with $\det A \ne 0$ : $D_{\mathrm{KL}}(p_T\|q_T) = D_{\mathrm{KL}}(p\|q)$ - KL is invariant under invertible affine transformations. This is why KL between two Gaussians can be computed in any coordinate system.

I.4 The Sanov Theorem and Large Deviations

Sanov's theorem is the large deviations foundation for KL divergence.

Theorem (Sanov). Let $X_1, \ldots, X_n$ be i.i.d. from $q$ . Let $\hat{p}_n$ be the empirical distribution. For any set $E$ of distributions (closed in total variation):

P(\hat{p}_n \in E) \approx e^{-n \inf_{p \in E} D_{\mathrm{KL}}(p\|q)}

More precisely: $\lim_{n\to\infty} -\frac{1}{n}\log P(\hat{p}_n \in E) = \inf_{p \in E} D_{\mathrm{KL}}(p\|q)$ .

Interpretation: The probability of observing an empirical distribution in $E$ decays exponentially in $n$ , with exponent given by the minimum KL from any distribution in $E$ to the true distribution $q$ . The "most likely" atypical empirical distribution is the one in $E$ closest to $q$ in KL divergence - the I-projection of $q$ onto $E$ .

For AI: Sanov's theorem explains why KL divergence appears in PAC-Bayes bounds. The probability that the empirical risk deviates from the true risk by more than $\epsilon$ decays as $e^{-nD_{\mathrm{KL}}(Q\|P)}$ , where $Q$ is the posterior over hypotheses and $P$ is the prior. The larger the KL, the more confident we are that the training set is representative.

I.5 KL Divergence and the EM Algorithm: A Complete Example

We illustrate the EM algorithm as alternating KL minimizations with a Gaussian mixture model.

Setup: Data $\mathcal{D} = \{x^{(1)}, \ldots, x^{(n)}\}$ ; model $p_{\boldsymbol{\theta}}(x) = \sum_{k=1}^K \pi_k \mathcal{N}(x; \mu_k, \sigma_k^2)$ .

E-step - compute responsibilities (I-projection):

q^{(t)}(z=k|x^{(i)}) = \frac{\pi_k^{(t)}\mathcal{N}(x^{(i)};\mu_k^{(t)},\sigma_k^{2(t)})}{\sum_{j=1}^K \pi_j^{(t)}\mathcal{N}(x^{(i)};\mu_j^{(t)},\sigma_j^{2(t)})}

This is the I-projection of the model onto the space of conditional distributions given the current parameters.

M-step - update parameters (M-projection):

\pi_k^{(t+1)} = \frac{1}{n}\sum_i q^{(t)}(z=k|x^{(i)})

\mu_k^{(t+1)} = \frac{\sum_i q^{(t)}(z=k|x^{(i)})x^{(i)}}{\sum_i q^{(t)}(z=k|x^{(i)})}

This maximizes the ELBO - equivalent to the M-projection that minimizes $D_{\mathrm{KL}}(q^{(t)}\|p_{\boldsymbol{\theta}})$ over $\boldsymbol{\theta}$ .

KL decomposition at each step:

\log p_{\boldsymbol{\theta}}(\mathcal{D}) = \mathcal{L}(q, \boldsymbol{\theta}) + \sum_i D_{\mathrm{KL}}(q^{(t)}(\cdot|x^{(i)})\|p_{\boldsymbol{\theta}}(\cdot|x^{(i)}))

The E-step zeroes the second term (for the current $\boldsymbol{\theta}^{(t)}$ ); the M-step maximizes the first term. Together, each EM iteration increases $\log p_{\boldsymbol{\theta}}(\mathcal{D})$ .

Appendix J: Notation and Conventions

J.1 KL Divergence Notation Across Different Sources

Different textbooks and papers use different notation for KL divergence. This can cause significant confusion:

Source	Notation	Direction
This curriculum (following Cover & Thomas)	$D_{\mathrm{KL}}(p \\| q)$	$p$ = reference (truth); $q$ = approximation
Kullback (1959)	$I(1:2)$	Between distributions indexed 1 and 2
Bishop (2006)	$\mathrm{KL}(p \\| q)$	Same as Cover & Thomas
Murphy (2012)	$\mathrm{KL}(p \\| q)$	Same
Goodfellow et al. (2016)	$D_{\mathrm{KL}}(P \\| Q)$	Same
Some optimization papers	$\mathrm{KL}(q \\| p)$	Reversed - always check!

Key check: In VAE papers, $D_{\mathrm{KL}}(q_\phi(\mathbf{z}|\mathbf{x})\|p(\mathbf{z}))$ has the encoder $q$ as the first argument (the reference for the expectation) and the prior $p$ as the second. This is reverse KL - expectation under the approximate posterior $q$ .

J.2 Relationship Between Information Quantities

VENN DIAGRAM OF INFORMATION QUANTITIES


  For joint distribution P(X,Y):

    
                      H(X,Y)                      
                  
                                               
         H(X|Y)         I(X;Y)        H(Y|X)  
                                               
                  
    

     H(X) 
                                 H(Y) 

  Key relationships:
  H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)       [chain rule]
  I(X;Y) = H(X) + H(Y) - H(X,Y)                 [definition]
  I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)        [equivalences]
  I(X;Y) = D_KL(P(X,Y)  P(X)P(Y))              [KL form]
  H(p,q) = H(p) + D_KL(p  q)                   [cross-entropy decomp]

J.3 Units and Conversions

Base	Unit	Conversion
$\ln$ (natural log)	nats	Default in this curriculum; $1\text{ nat} = 1/\ln 2 \approx 1.443$ bits
$\log_2$	bits	Used in data compression contexts; $1\text{ bit} = \ln 2 \approx 0.693$ nats
$\log_{10}$	hartleys (bans)	Rare in ML; $1\text{ hartley} = \ln 10 \approx 2.303$ nats

PyTorch convention: torch.nn.functional.kl_div expects inputs in log-space and computes nats by default (uses torch.log which is natural log). To get bits, divide by math.log(2).

End of Section 09-02 KL Divergence

<- Back to Information Theory | Previous: Entropy <- | Next: Mutual Information ->

KL Divergence: Part 3 - Appendix A Detailed Proofs To Appendix J Notation And Conventions