Lesson overview | Previous part | Next part
KL Divergence: Appendix A: Detailed Proofs to Appendix J: Notation and Conventions
Appendix A: Detailed Proofs
A.1 Proof of Non-Negativity via Log-Sum Inequality
The log-sum inequality provides an alternative, direct proof of Gibbs' inequality without invoking Jensen's explicitly.
Log-Sum Inequality. For non-negative reals and :
with equality iff all are equal.
Proof of log-sum inequality. The function is convex (since ). By the weighted Jensen inequality with weights where :
Multiplying by gives the result.
KL non-negativity from log-sum inequality. Apply with and : since :
A.2 Convexity of KL: Detailed Proof
We prove that is jointly convex using the perspective function argument.
Lemma. The function (with , for ) is jointly convex on .
Proof. . We need . The Hessian is:
Determinant . Trace . PSD since and trace .
Since and sums of jointly convex functions are jointly convex, is jointly convex.
A.3 Pinsker's Inequality
Theorem (Pinsker's Inequality). .
Proof. We use the Bretagnolle-Huber inequality, which is slightly weaker but has a clean proof: for any event ,
For the full Pinsker proof, one approach uses the 2-point inequality: for ,
Applying this to the binary partition and optimizing over (which gives for TV):
Rearranging: .
Pinsker's inequality in AI: KL divergence is continuous and differentiable (gradients exist for optimization), while total variation is an event-level bound. Pinsker's lets us translate KL bounds from optimization theory into guarantees about individual event probabilities - useful in PAC-Bayes learning theory.
A.4 KL Divergence for the Gaussian: Step-by-Step
We derive the multivariate Gaussian KL formula in detail.
Let and on . The log-ratio is:
Taking expectation under , using:
gives:
The four terms correspond to:
- : how much the covariances differ (= 0 when )
- : squared Mahalanobis distance between means
- : log ratio of volumes (= 0 when )
Appendix B: KL Divergence Computations and Reference Formulas
B.1 Quick Reference: KL Formulas for Common Distributions
KL DIVERGENCE CLOSED FORMS - QUICK REFERENCE
SCALAR GAUSSIAN:
D_KL(N(mu1,12) N(mu2,22)) = log(2/1) + (12 + (mu1-mu2)2)/(222) - 12
VAE ENCODER -> STANDARD NORMAL:
D_KL(N(mu,2) N(0,1)) = 12(mu2 + 2 - log 2 - 1)
MULTIVARIATE GAUSSIAN:
D_KL(N(mu1,1) N(mu2,2)) = 12[tr(211) + (mu2-mu1)T21(mu2-mu1) - d + log(det 2/det 1)]
BERNOULLI:
D_KL(Bern(p) Bern(q)) = plog(p/q) + (1-p)log((1-p)/(1-q))
CATEGORICAL (general discrete):
D_KL(p q) = k pk log(pk/qk)
POISSON:
D_KL(Poisson(1) Poisson(2)) = 1log(1/2) - 1 + 2
EXPONENTIAL:
D_KL(Exp(1) Exp(2)) = log(2/1) + 1/2 - 1
BETA:
D_KL(Beta(alpha1,beta1) Beta(alpha2,beta2)) = log B(alpha2,beta2)/B(alpha1,beta1)
+ (alpha1-alpha2)(alpha1) + (beta1-beta2)(beta1) + (alpha2-alpha1+beta2-beta1)(alpha1+beta1)
where = digamma function, B = beta function
B.2 Worked Example: KL Between Poisson Distributions
A Poisson distribution has PMF for
:
For : nats.
For AI: Poisson distributions model count data (number of events, word counts in LDA topic models). The KL between Poisson distributions appears in the ELBO for Poisson process models and in topic model variational inference.
B.3 Worked Example: KL for Categorical - Softmax Outputs
Let and be logit vectors.
Step 1: :
Step 2: :
Step 3: :
For AI: This is the KL component when computing the distillation loss between a teacher () and student (). The small value (0.051 nats) indicates the student's distribution is reasonably close to the teacher's.
B.4 KL Divergence and Hypothesis Testing
The connection between KL divergence and hypothesis testing is both historical (Kullback 1959) and practically important.
Setting: i.i.d. observations from an unknown distribution. Hypothesis : data from ; hypothesis : data from .
Log-likelihood ratio test statistic:
By the law of large numbers, as when truth is .
Stein's lemma (type II error rate): For fixed type I error probability , the best achievable type II error probability decays as:
Larger KL divergence -> faster exponential decay of type II error -> easier to distinguish from with more data.
Chernoff exponent: The best achievable equal-error-rate exponent (minimizing over both type I and II simultaneously) is:
the maximum Renyi divergence over orders .
For AI: The hypothesis testing perspective explains why in RLHF bounds how detectable the policy shift is: small KL means the fine-tuned model's outputs are statistically indistinguishable from the reference model's outputs.
Appendix C: KL Divergence in Broader ML Contexts
C.1 PAC-Bayes Bounds
PAC-Bayes theory (McAllester 1999; Seeger 2002) gives generalization bounds for stochastic predictors. The key theorem:
PAC-Bayes Theorem. Let be a prior over hypotheses (fixed before seeing data) and be any posterior (possibly data-dependent). For any , with probability over the training set:
where is the true loss and is the empirical loss.
Interpretation: The generalization gap is bounded by . The posterior can be anything - including a specific trained neural network - and the bound holds. This shows that models that stay "close" to a simple prior in KL generalize well. Modern applications: PAC-Bayes bounds for LoRA-fine-tuned LLMs (where pretrained weights, fine-tuned weights).
C.2 KL in Variational Inference for Bayesian Neural Networks
Bayesian neural networks (BNNs) maintain distributions over weights rather than point estimates. The posterior is intractable for large networks. Variational inference minimizes:
The ELBO is: . The KL term acts as a regularizer: it penalizes variational distributions that deviate from the prior .
Practical implementations: Bayes By Backprop (Blundell et al., 2015) parameterizes and uses the local reparameterization trick. Each weight where . The KL term per weight is for Gaussian prior .
C.3 KL in Contrastive Learning and Representation
Contrastive learning methods like SimCLR (Chen et al., 2020) and MoCo (He et al., 2020) maximize the lower bound on mutual information:
The InfoNCE loss is a lower bound on , which is a KL divergence (see Section 09-03). CLIP (Radford et al., 2021) uses a symmetric InfoNCE to align image and text embeddings - which is equivalent to minimizing KL between the joint image-text distribution and the product of marginals.
CPC (Contrastive Predictive Coding, van den Oord et al., 2018): Used in speech (wav2vec 2.0, HuBERT) and images. The representation at time predicts future via a density ratio estimator. The InfoNCE objective is a lower bound on . Training audio encoders this way produces features that capture the semantic structure of speech - because mutual information (= KL divergence) rewards capturing any statistical dependency.
C.4 KL in Score-Based Generative Models
Score-based models (Song & Ermon, 2020) and their diffusion counterparts learn the score function to generate samples via Langevin dynamics. The connection to KL:
Score matching minimizes:
where is the learned score. Fisher divergence equals:
Score matching is equivalent to minimizing a generalization of KL (Fisher divergence: ), which uses the gradient of the log-density ratio rather than the ratio itself. This is the information-geometric connection between score matching and KL divergence.
C.5 KL and Temperature in LLM Sampling
LLM inference involves sampling from at temperature . The KL divergence between the original distribution () and the temperature-scaled distribution is:
As : one-hot (deterministic, picks the argmax). - the divergence equals the entropy of the original distribution.
As : uniform. - the divergence equals the KL from uniform.
Practical implications: The "temperature" knob in LLM APIs (temperature=0.7, top-p=0.9) directly controls the KL divergence from the trained distribution. High temperature increases diversity (reduces KL to uniform); low temperature reduces diversity (pushes toward argmax). The KL from the training distribution is minimized at .
Appendix D: Numerical Stability and Implementation
D.1 The log-sum-exp Trick for KL Computation
Computing with softmax distributions requires care to avoid numerical overflow.
Naive implementation (unstable):
# WRONG - can overflow for large logits
kl = torch.sum(p * (torch.log(p) - torch.log(q)))
Stable implementation using log-softmax:
# CORRECT - uses log-space throughout
import torch.nn.functional as F
def kl_divergence_stable(logits_p, logits_q):
"""
KL(p || q) where p = softmax(logits_p), q = softmax(logits_q).
Uses log_softmax for numerical stability.
"""
log_p = F.log_softmax(logits_p, dim=-1) # log(softmax) in one step
log_q = F.log_softmax(logits_q, dim=-1)
p = torch.exp(log_p)
return torch.sum(p * (log_p - log_q), dim=-1)
PyTorch also provides torch.nn.functional.kl_div(log_q, p) which computes given pre-computed log_q. Note the unusual argument order: it expects log of the second argument.
D.2 Estimating KL from Samples
When only samples are available (not closed-form distributions), KL must be estimated.
Monte Carlo estimator. Given samples and access to and :
This is unbiased: .
k-nearest neighbor estimator (Perez-Cruz 2008). When neither nor are known analytically - only samples from both distributions - the KNN estimator uses:
where is the distance to the -th nearest neighbor in the -samples, is the distance to the -th neighbor in the -samples, = number of -samples, = number of -samples, and = dimension.
This estimator is used in representation learning evaluation, where you have samples from an encoder's output distribution and want to measure without a closed form.
D.3 Symmetric KL Implementation
For applications requiring symmetric divergence without the issues of JSD (which requires computing the mixture), Jeffrey's divergence is simpler:
def jeffreys_divergence(p_probs, q_probs, eps=1e-10):
"""
J(p, q) = 0.5 * (KL(p||q) + KL(q||p))
More numerically stable than computing JSD.
"""
p = p_probs + eps
q = q_probs + eps
p = p / p.sum()
q = q / q.sum()
kl_pq = (p * np.log(p / q)).sum()
kl_qp = (q * np.log(q / p)).sum()
return 0.5 * (kl_pq + kl_qp)
D.4 KL Divergence in PyTorch for VAE Training
The standard VAE KL term implementation:
def vae_kl_loss(mu, log_var):
"""
D_KL(N(mu, sigma^2) || N(0, 1)) per-sample, summed over latent dimensions.
Formula: 0.5 * sum(mu^2 + sigma^2 - log(sigma^2) - 1)
= 0.5 * sum(mu^2 + exp(log_var) - log_var - 1)
Input: mu, log_var - shape (batch, latent_dim)
Output: KL loss - shape (batch,), or scalar if mean-reduced
"""
kl = 0.5 * torch.sum(mu**2 + torch.exp(log_var) - log_var - 1, dim=-1)
return kl.mean() # average over batch
Note: using log_var (log of variance) instead of sigma avoids numerical issues with very small variances and ensures by construction.
RLHF KL penalty implementation:
def rlhf_kl_penalty(logprobs_policy, logprobs_ref, beta=0.1):
"""
D_KL(pi_policy || pi_ref) for a batch of token sequences.
logprobs_policy, logprobs_ref: log probabilities under each model
shape (batch, seq_len)
Returns: mean KL per sequence
"""
# KL = E_pi[log pi - log pi_ref]
kl_per_token = logprobs_policy - logprobs_ref # log ratio
kl_per_seq = kl_per_token.sum(dim=-1) # sum over sequence length
return beta * kl_per_seq.mean()
Appendix E: Variational Inference - A Deep Dive
E.1 The Variational Inference Problem
Bayesian inference requires computing the posterior . The denominator is typically intractable - the integral has no closed form for nonlinear likelihoods.
Variational inference approximates the posterior with a tractable family by solving:
Why reverse KL? Because it is the only tractable direction. Computing requires evaluating , which is the intractable quantity we're trying to avoid. Computing requires only samples from and evaluations of - both tractable.
ELBO derivation. Expanding the reverse KL:
Using :
Rearranging:
Since : - the ELBO is a lower bound. Maximizing ELBO over simultaneously: (a) tightens the bound (reduces the KL gap), and (b) improves the model fit (increases when is also optimized).
E.2 Mean-Field Variational Inference
The mean-field approximation restricts to factored distributions . This is an enormous simplification - the full posterior may have complex correlations, but the approximation ignores all correlations between latent variables.
Coordinate ascent VI (CAVI). For mean-field, the optimal factor satisfies:
where denotes expectation over all factors except . This is a fixed-point equation: each factor depends on the others. CAVI iterates: update given ; then update given new and old ; etc.
Guarantee: Each update increases the ELBO (decreases the reverse KL). CAVI converges to a local optimum.
For AI: Mean-field VI was the basis for early Bayesian neural networks (BNNs) and topic models (LDA). Its mode-seeking bias (reverse KL) means it systematically underestimates posterior uncertainty - a known limitation. Modern BNNs often use Monte Carlo Dropout as a cheaper alternative.
E.3 ELBO as a Free Energy
The ELBO has a physical interpretation as a variational free energy (Helmholtz, Gibbs):
This is the expected log joint minus the entropy of . Maximizing the ELBO simultaneously:
- Maximizes : encourages to put mass on high-probability regions of the joint
- Maximizes : encourages to spread out (high entropy = uncertainty), preventing mode collapse
The tension between these two terms is the fundamental trade-off in variational inference. In physics, this is the Gibbs free energy minimization: equilibrium balances energy minimization (term 1) with entropy maximization (term 2), at temperature 1.
For AI: The free energy interpretation connects to energy-based models (LeCun et al., 2006) and to the "free energy principle" in computational neuroscience (Friston, 2010). The ELBO appears in all modern latent variable models: VAEs, VQ-VAE, neural process families, and latent diffusion models (Stable Diffusion uses a VAE for compression, then diffusion in the latent space).
E.4 Amortized Variational Inference
Standard VI optimizes a separate per data point - optimization problems. Amortized VI (Kingma & Welling 2014) instead trains a single encoder network that takes as input:
The encoder "amortizes" the per-sample optimization across the full dataset, generalizing to new samples at inference time. This is what makes VAEs practical for large datasets: a single forward pass through the encoder gives instantly.
Amortization gap: The amortized approximation is generally worse than the optimal per-sample (it must generalize, so it trades off per-sample accuracy). The gap is called the amortization gap and is an active research topic.
Appendix F: The RLHF-KL Alignment Triangle
F.1 Three Ways KL Appears in Alignment
Modern LLM alignment uses KL divergence in three distinct but related ways, forming a triangle of constraints:
RLHF-KL ALIGNMENT TRIANGLE
PRETRAINED MODEL pi_ref
/ \
KL / \ KL
penalty / \ constraint
/ \
RLHF pi_theta DPO pi_DPO
\ /
\ equiv /
\ /
Fixed point: same optimal policy
All three policies minimize:
E[r(x,y)] - betaD_KL(pi_theta(|x) pi_ref(|x))
Optimal solution: pi*(y|x) propto pi_ref(y|x)exp(r(x,y)/beta)
F.2 Interpreting the KL Coefficient beta
The KL coefficient in RLHF controls the fundamental exploration-exploitation trade-off in alignment:
| value | Behavior | Risk |
|---|---|---|
| Pure reward maximization; ignores KL constraint | Reward hacking; generates repetitive, incoherent outputs | |
| Stay at reference policy; no alignment | No improvement in helpfulness, safety | |
| Typical range used in practice | Balance between helpfulness and coherence |
Typical values in production: InstructGPT (Ouyang et al., 2022) used for PPO-RLHF. Different applications need different : safety alignment uses larger (conservative); instruction following uses smaller (allows larger policy changes).
Adaptive (KL-controller): Some implementations adaptively adjust to keep near a target value :
This is a PI controller on the KL divergence. If the policy drifts too far, increases; if it stays too close to reference, decreases.
F.3 DPO's Implicit Reward
DPO (Rafailov et al., 2023) reveals that the log-ratio between policy and reference is the implicit reward:
This has a clean interpretation: the reward a sequence receives is proportional to how much more likely the aligned policy assigns to it compared to the reference. Sequences the aligned model prefers more than the reference model are rewarded; sequences it prefers less are penalized.
DPO gradient: The DPO loss gradient with respect to :
where . The weight is large when the model assigns higher reward to the loser than the winner - the loss upweights these hard examples. This has the same structure as hard example mining in contrastive learning.
Appendix G: Connections to Other Chapters
G.1 KL and Exponential Families (Chapter 7: Statistics)
Exponential families are the natural domain for KL divergence because:
- KL between exponential family members has closed form (Section 5.3)
- The maximum entropy distribution under moment constraints is an exponential family member (Section 09-01 Section 5)
- The Bregman divergence generated by equals the KL between members with those natural parameters
Sufficiency: Kullback and Leibler's original paper introduced KL as a measure of information for discrimination between hypotheses. A statistic is sufficient for parameter iff - sufficient statistics preserve all the discriminatory information. This is the information-theoretic definition of sufficiency.
G.2 KL and Convex Optimization (Chapter 8)
The ELBO maximization is a convex optimization problem in (for fixed model parameters ):
- is convex in (Section 3.4)
- The constraint , is convex (simplex)
- The I-projection is a convex program with a unique solution
The Lagrangian of the I-projection onto the exponential family gives the natural parameter update equations - connecting KL divergence to the theory of constrained convex optimization from Chapter 8.
G.3 KL and Functional Analysis (Chapter 12)
In infinite-dimensional settings, KL divergence extends via the Radon-Nikodym theorem. For Gaussian processes and Gaussian measures on Hilbert spaces:
the same Gaussian KL formula, but now are kernel matrices and . The trace and log-determinant terms generalize as operator trace and Fredholm determinant.
RKHS connection: The maximum mean discrepancy (MMD) is an alternative to KL for comparing distributions, based on RKHS distances. MMD has better convergence rates from finite samples than KL estimators, which is why it appears in some GANs (MMD-GAN) and kernel two-sample tests.
G.4 KL and Statistical Learning Theory (Chapter 21)
The PAC-Bayes bound (Appendix C.1) shows that the KL from posterior to prior controls generalization. This has a direct implication for LoRA fine-tuning: if the LoRA adapter is constrained to have small - i.e., small deviation from initialization - then generalization bounds are tight. This is the information-theoretic justification for small-rank adaptations: they have small KL from the pretrained distribution.
Minimum Description Length (MDL): The MDL principle selects the model that minimizes the two-part code length: where is the code length of the model and is the code length of data given the model. This equals for the right choice of and . MDL and ELBO are the same objective - minimum description length equals maximum ELBO. This is the Bayesian interpretation of Occam's razor.
Appendix H: Summary Tables
H.1 KL Divergence Properties at a Glance
| Property | Holds? | Counterexample if No |
|---|---|---|
| Non-negative: | YES | - |
| Zero iff : | YES (a.e.) | - |
| Symmetric: | NO | |
| Triangle inequality: | NO | |
| Data processing inequality: | YES | - |
| Jointly convex in | YES | - |
| Convex in for fixed | YES | - |
| Convex in for fixed | YES | - |
| Finite when | NO | : |
| Metric (distance function) | NO | Fails symmetry and triangle inequality |
H.2 Forward vs Reverse KL Summary
| Forward KL | Reverse KL | |
|---|---|---|
| Expectation under | (truth) | (approximation) |
| Also called | Inclusive, zero-avoiding | Exclusive, zero-forcing |
| Behavior when fitting to | Mean-seeking, mass-covering | Mode-seeking, mass-concentrating |
| Handles multimodal | Averages over all modes | Collapses to one mode |
| Support requirement | must cover | must avoid |
| Used in | MLE, distillation, flows | Variational inference, ELBO |
| Risk | Blurry/average samples | Posterior collapse |
| Tractability | Needs samples from | Only needs samples from |
H.3 f-Divergence Family Summary
| Divergence | Symmetric? | Bounded? | Metric? | |
|---|---|---|---|---|
| KL | No | No | No | |
| Reverse KL | No | No | No | |
| Jeffrey's | Yes | No | No | |
| Hellinger | Yes | Yes () | is | |
| Total Variation TV | $\frac{1}{2} | t-1 | $ | Yes |
| Chi-squared | No | No | No | |
| Jensen-Shannon JSD | (mixture form) | Yes | Yes () | is |
| Renyi | (limit form) | No | No | No |
Appendix I: Extended Examples and Geometric Visualizations
I.1 KL Divergence Along a Path Between Distributions
Consider a one-parameter family - a straight line between two distributions in probability simplex. The KL divergence is convex in (because it is jointly convex in and linear in ). However, may not be convex - the "path" in distribution space is curved from the perspective of KL divergence.
Numerical example. , , . For :
| 0 | |||
| 0.25 | |||
| 0.5 | |||
| 0.75 | |||
| 1.0 |
Both directions are symmetric here due to the symmetric path. The KL is zero at (when ) and increases convexly toward the endpoints.
Geometric implication: The straight-line path is a mixture geodesic (M-geodesic) in information geometry - the natural path in the mixture parameterization. The dual path (exponential geodesic) mixes in the natural parameter space, giving different intermediate distributions.
I.2 KL vs Wasserstein Distance
KL divergence and the Wasserstein distance (from optimal transport theory) are complementary tools for comparing distributions:
| Property | KL Divergence | Wasserstein Distance |
|---|---|---|
| Requires | YES | No |
| Metrizes weak convergence | No | YES |
| Captures geometry of support | No | YES |
| Tractable to compute | YES (closed form) | Harder ( naive) |
| Differentiable | YES | With entropic regularization |
| Used in | MLE, ELBO, RLHF | Generative models, image quality |
When Wasserstein beats KL: If (generator) and have disjoint supports, and - gradients vanish. The Wasserstein distance is still finite and differentiable (reflecting actual geometric distance between supports). This is the motivation for WGAN (Arjovsky et al., 2017).
When KL beats Wasserstein: For language modeling, the vocabulary forms a discrete space with no natural geometry. The Wasserstein distance would require defining a metric on tokens (words), which is ambiguous. KL divergence makes no assumptions about token geometry and is directly connected to log-probability, making it the natural choice.
I.3 KL Divergence Under Transformations
Transformation by sufficient statistic. If is a sufficient statistic for , then:
where is the induced distribution on . Sufficient statistics preserve all the discriminatory information - no KL is lost when summarizing data by its sufficient statistics.
Example. For , the sufficient statistic (number of heads in flips) is sufficient. The KL between and equals , and the KL between the corresponding and distributions equals the same thing - the sufficient statistic preserves KL exactly.
KL under affine transformation. For continuous distributions, an affine map with : - KL is invariant under invertible affine transformations. This is why KL between two Gaussians can be computed in any coordinate system.
I.4 The Sanov Theorem and Large Deviations
Sanov's theorem is the large deviations foundation for KL divergence.
Theorem (Sanov). Let be i.i.d. from . Let be the empirical distribution. For any set of distributions (closed in total variation):
More precisely: .
Interpretation: The probability of observing an empirical distribution in decays exponentially in , with exponent given by the minimum KL from any distribution in to the true distribution . The "most likely" atypical empirical distribution is the one in closest to in KL divergence - the I-projection of onto .
For AI: Sanov's theorem explains why KL divergence appears in PAC-Bayes bounds. The probability that the empirical risk deviates from the true risk by more than decays as , where is the posterior over hypotheses and is the prior. The larger the KL, the more confident we are that the training set is representative.
I.5 KL Divergence and the EM Algorithm: A Complete Example
We illustrate the EM algorithm as alternating KL minimizations with a Gaussian mixture model.
Setup: Data ; model .
E-step - compute responsibilities (I-projection):
This is the I-projection of the model onto the space of conditional distributions given the current parameters.
M-step - update parameters (M-projection):
This maximizes the ELBO - equivalent to the M-projection that minimizes over .
KL decomposition at each step:
The E-step zeroes the second term (for the current ); the M-step maximizes the first term. Together, each EM iteration increases .
Appendix J: Notation and Conventions
J.1 KL Divergence Notation Across Different Sources
Different textbooks and papers use different notation for KL divergence. This can cause significant confusion:
| Source | Notation | Direction |
|---|---|---|
| This curriculum (following Cover & Thomas) | = reference (truth); = approximation | |
| Kullback (1959) | Between distributions indexed 1 and 2 | |
| Bishop (2006) | Same as Cover & Thomas | |
| Murphy (2012) | Same | |
| Goodfellow et al. (2016) | Same | |
| Some optimization papers | Reversed - always check! |
Key check: In VAE papers, has the encoder as the first argument (the reference for the expectation) and the prior as the second. This is reverse KL - expectation under the approximate posterior .
J.2 Relationship Between Information Quantities
VENN DIAGRAM OF INFORMATION QUANTITIES
For joint distribution P(X,Y):
H(X,Y)
H(X|Y) I(X;Y) H(Y|X)
H(X)
H(Y)
Key relationships:
H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y) [chain rule]
I(X;Y) = H(X) + H(Y) - H(X,Y) [definition]
I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) [equivalences]
I(X;Y) = D_KL(P(X,Y) P(X)P(Y)) [KL form]
H(p,q) = H(p) + D_KL(p q) [cross-entropy decomp]
J.3 Units and Conversions
| Base | Unit | Conversion |
|---|---|---|
| (natural log) | nats | Default in this curriculum; bits |
| bits | Used in data compression contexts; nats | |
| hartleys (bans) | Rare in ML; nats |
PyTorch convention: torch.nn.functional.kl_div expects inputs in log-space and computes nats by default (uses torch.log which is natural log). To get bits, divide by math.log(2).
End of Section 09-02 KL Divergence
<- Back to Information Theory | Previous: Entropy <- | Next: Mutual Information ->