NotesMath for LLMs

Expectation and Moments

Probability Theory / Expectation and Moments

Notes

"The expectation of the product of two independent random variables equals the product of their expectations - a theorem so simple it conceals its own depth."

  • Mark Kac

Overview

Expectation is the single most important operation in probability theory. It compresses an entire distribution into a number - the weighted average of all possible values, where weights are probabilities. Every loss function in machine learning is an expectation. Every gradient update in stochastic training is an estimated expectation. Every generative model is ultimately defined by the distributions whose expectations it learns to match.

This section builds the theory of the expectation operator from first principles: the formal definition via the law of the unconscious statistician (LOTUS), linearity (which holds without independence), the tower property (iterated conditioning), moment generating functions (which encode the entire distribution in an analytic function), and Jensen's inequality (which underlies the derivation of the ELBO, the KL divergence bound, and every variational argument in modern deep learning).

The central thread connecting all these tools is moments - the sequence E[X],E[X2],E[X3],\mathbb{E}[X], \mathbb{E}[X^2], \mathbb{E}[X^3], \ldots that describes the shape of a distribution. The first moment locates the center. The second moment (variance) measures spread. The third captures asymmetry. The fourth captures tail heaviness. In modern ML, the Adam optimizer literally tracks first and second moments of gradients at every parameter, every step. Understanding moments is understanding adaptive optimisation.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbInteractive derivations: LOTUS, Jensen, MGFs, bias-variance, Adam moment tracking
exercises.ipynb10 graded exercises from LOTUS computation to policy gradient estimation

Learning Objectives

After completing this section, you will be able to:

  1. Compute expectations for discrete, continuous, and mixed distributions using LOTUS
  2. Apply linearity of expectation without assuming independence
  3. Derive variance using the computational formula and prove properties (Var(aX+b), Var(X+Y))
  4. Interpret and compute skewness and kurtosis from their moment definitions
  5. State and apply the tower property (iterated expectation) and law of total variance
  6. Derive moment generating functions and extract moments via differentiation
  7. Apply Jensen's inequality to prove KL divergence non-negativity and derive the ELBO
  8. Prove and apply the Cauchy-Schwarz inequality for the ρ1|\rho| \leq 1 bound
  9. Decompose expected squared error into bias^2, variance, and irreducible noise
  10. Connect Adam optimizer to first and second moment estimation with bias correction
  11. Preview Markov's and Chebyshev's inequalities as moment-based tail bounds (-> Section05)

Table of Contents


1. Intuition and Historical Context

1.1 The Moment Analogy

The word "moment" comes from classical mechanics. In physics, the first moment of a mass distribution is the center of mass; the second moment is the moment of inertia. Probability theory borrows this language directly: the kk-th moment of a random variable XX is E[Xk]\mathbb{E}[X^k], the expected value of XX raised to the kk-th power.

Imagine balancing a plank on which weights are placed at positions x1,x2,,xnx_1, x_2, \ldots, x_n with weights p1,p2,,pnp_1, p_2, \ldots, p_n (probabilities). The balance point is ixipi=E[X]\sum_i x_i p_i = \mathbb{E}[X] - the center of mass. The spread around that center is captured by i(xiμ)2pi=Var(X)\sum_i (x_i - \mu)^2 p_i = \text{Var}(X), the second central moment. Higher moments capture the shape: whether the distribution leans left or right (skewness), whether it has heavy tails (kurtosis).

MOMENTS AS SHAPE DESCRIPTORS
========================================================================

  First moment (mean \\mu):        Location of center
  Second central moment (\\sigma^2):   Spread about center
  Third central moment (\\gamma_1\\sigma^3):  Asymmetry (skewness)
  Fourth central moment (\\gamma_2\\sigma^4): Tail weight (kurtosis)

  Standard normal N(0,1):  \\mu=0, \\sigma^2=1, \\gamma_1=0, \\gamma_2=0 (baseline)

  Right-skewed (\\gamma_1 > 0):         Heavy right tail: income, LLM token probs
        #
       ###
      #####=====........
  ----+--------------------

  Heavy-tailed (\\gamma_2 > 0):         Leptokurtic: gradient norms, loss landscapes
        #
        #
       ###
      #####.............
  ----+--------------------

========================================================================

Every major operation in statistics reduces to computing moments. The mean is the first moment. The variance is the second central moment. The sample statistics we compute to estimate distributions are moment estimates. The method of moments estimation technique literally sets sample moments equal to theoretical moments and solves for parameters.

For AI: Every loss function is an expectation over the data distribution:

L(θ)=E(x,y)D[(fθ(x),y)]\mathcal{L}(\theta) = \mathbb{E}_{(x,y) \sim \mathcal{D}}[\ell(f_\theta(x), y)]

We cannot compute this exactly (the true distribution D\mathcal{D} is unknown), so we estimate it with the sample mean over a minibatch. The entire machinery of stochastic gradient descent rests on this approximation.

1.2 Why Moments Matter for AI

Moments appear in nearly every component of modern deep learning:

Loss functions. The mean squared error 1Ni(yiy^i)2\frac{1}{N}\sum_i (y_i - \hat{y}_i)^2 estimates E[(Yf(X))2]\mathbb{E}[(Y - f(X))^2], the expected squared loss. Cross-entropy loss estimates Ep[logq]-\mathbb{E}_{p}[\log q]. Every loss is a first moment of some function of the predictions.

Optimisation. The Adam optimizer maintains running estimates of the first moment mt=E[gt]m_t = \mathbb{E}[g_t] and second moment vt=E[gt2]v_t = \mathbb{E}[g_t^2] of the gradient gtg_t. The adaptive learning rate α/vt+ϵ\alpha / \sqrt{v_t + \epsilon} normalises by the square root of the second moment. Without understanding moments, Adam is a black box; with it, Adam is moment matching with bias correction.

Batch Normalisation. BatchNorm computes the sample mean and variance of activations across a batch, then normalises. This is moment normalisation: forcing the first moment to zero and the second to one, stabilising training by controlling the distribution of layer activations.

KL Divergence. The KL divergence KL(pq)=Ep[log(p/q)]\text{KL}(p \| q) = \mathbb{E}_p[\log(p/q)] is an expectation under pp. Its non-negativity (KL0\text{KL} \geq 0) follows directly from Jensen's inequality applied to the convex function log-\log. The ELBO in variational autoencoders is derived by applying Jensen's inequality to decompose logp(x)\log p(x).

Reparameterisation. The reparameterisation trick in VAEs (z=μϕ+σϕεz = \mu_\phi + \sigma_\phi \odot \varepsilon) enables computing gradients of expectations. The gradient ϕEqϕ[f(z)]\nabla_\phi \mathbb{E}_{q_\phi}[f(z)] is intractable to compute directly, but after reparameterisation it becomes Eε[ϕf(μϕ+σϕε)]\mathbb{E}_{\varepsilon}[\nabla_\phi f(\mu_\phi + \sigma_\phi \odot \varepsilon)], estimable via sampling.

Score functions. The score function θlogpθ(x)\nabla_\theta \log p_\theta(x) satisfies Ep[θlogpθ(X)]=0\mathbb{E}_p[\nabla_\theta \log p_\theta(X)] = \mathbf{0} - its expectation is zero. This identity (proved by differentiating the normalisation condition pθ=1\int p_\theta = 1) underlies Fisher information, the REINFORCE policy gradient estimator, and Stein's identity used in score-based diffusion models.

1.3 Historical Timeline

HISTORY OF MOMENTS AND EXPECTATION
========================================================================

  1654  Pascal-Fermat correspondence: first rigorous treatment of
        expected value ("valeur" of a game) in probability

  1657  Huygens: De Ratiociniis in Ludo Aleae - published the first
        probability textbook with formal expectation calculations

  1713  Bernoulli (Jakob): Ars Conjectandi - law of large numbers
        (published posthumously); shows sample mean -> E[X]

  1730  De Moivre: normal approximation to binomial via moments

  1814  Laplace: Theorie Analytique des Probabilites - systematic
        use of characteristic functions (Fourier transforms of densities)

  1853  Chebyshev: first inequality bounding tails via variance;
        rigorous proof of LLN; laid groundwork for moment inequalities

  1867  Chebyshev: full variance-based inequality (P(|X-\\mu| \\geq k\\sigma) \\leq 1/k^2)

  1894  Pearson: coined "moment" in statistics; introduced skewness
        and kurtosis; method of moments estimation

  1922  Fisher: likelihood, sufficient statistics, and moment
        generating functions as tools of parametric inference

  1945  Cramer: large deviation theory; MGF-based tail bounds
        (foundations of Hoeffding and Chernoff later work)

  1972  Williams: REINFORCE algorithm (score function gradient
        estimator = gradient of log-expectation)

  2014  Kingma & Ba: Adam optimizer - formal moment-tracking
        adaptive gradient method

  2014  Kingma & Welling: VAE - reparameterisation trick;
        expectations with differentiable sampling

  2020  Song et al.: score-based diffusion models - reverse SDE
        guidance via score function = \\nabla log p(x)

========================================================================

2. Formal Definition of Expectation

2.1 Expectation for Discrete, Continuous, and Mixed Distributions

The expected value (or expectation or mean) of a random variable XX is defined as the probability-weighted average of all possible values.

Discrete random variable. If XX takes values in a countable set with PMF p(x)=P(X=x)p(x) = P(X=x):

E[X]=xsupport(X)xp(x)\mathbb{E}[X] = \sum_{x \in \text{support}(X)} x \cdot p(x)

provided this sum converges absolutely: xxp(x)<\sum_x |x| \cdot p(x) < \infty. If it does not converge absolutely, E[X]\mathbb{E}[X] does not exist.

Example. For a fair die X{1,2,3,4,5,6}X \in \{1,2,3,4,5,6\} with p(x)=1/6p(x) = 1/6:

E[X]=16(1+2+3+4+5+6)=216=3.5\mathbb{E}[X] = \frac{1}{6}(1+2+3+4+5+6) = \frac{21}{6} = 3.5

Continuous random variable. If XX has PDF f(x)f(x):

E[X]=xf(x)dx\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx

provided xf(x)dx<\int_{-\infty}^{\infty} |x| f(x) \, dx < \infty.

Example. For XExp(λ)X \sim \text{Exp}(\lambda) with PDF f(x)=λeλxf(x) = \lambda e^{-\lambda x} on [0,)[0,\infty):

E[X]=0xλeλxdx=1λ\mathbb{E}[X] = \int_0^\infty x \lambda e^{-\lambda x} dx = \frac{1}{\lambda}

Mixed distribution. If XX has a mixed distribution (point mass at x0x_0 plus a continuous density):

E[X]=p0x0+xfcont(x)dx\mathbb{E}[X] = p_0 \cdot x_0 + \int x \cdot f_{\text{cont}}(x) \, dx

where p0=P(X=x0)p_0 = P(X = x_0) is the point mass probability.

Notation. We write E[X]\mathbb{E}[X], EX[X]\mathbb{E}_X[X], μX\mu_X, or X\langle X \rangle (physics notation) interchangeably. When the distribution is parameterised by θ\theta: Eθ[X]\mathbb{E}_\theta[X] or EXpθ[X]\mathbb{E}_{X \sim p_\theta}[X].

For AI: The empirical mean XˉN=1Ni=1NXi\bar{X}_N = \frac{1}{N}\sum_{i=1}^N X_i is the canonical estimator of E[X]\mathbb{E}[X]. Every forward pass computes a function of the inputs; averaging that function over a batch estimates its expectation. The loss function IS an expectation - we just estimate it from data.

2.2 Existence and Integrability

The expectation E[X]\mathbb{E}[X] exists (is finite) when E[X]<\mathbb{E}[|X|] < \infty, i.e., when XX is integrable. This fails in important cases:

Cauchy distribution. f(x)=1π(1+x2)f(x) = \frac{1}{\pi(1+x^2)} has x1π(1+x2)dx=+\int_{-\infty}^\infty |x| \cdot \frac{1}{\pi(1+x^2)} dx = +\infty. The Cauchy distribution has no mean (or any finite moment). A ratio of two independent standard normals follows a Cauchy distribution.

Heavy-tailed distributions. The Pareto distribution f(x)=αxmα/xα+1f(x) = \alpha x_m^\alpha / x^{\alpha+1} for x>xmx > x_m has:

  • E[X]\mathbb{E}[X] exists iff α>1\alpha > 1
  • Var(X)\text{Var}(X) exists iff α>2\alpha > 2
  • kk-th moment exists iff α>k\alpha > k

This matters in practice: internet traffic volumes, earthquake magnitudes, word frequencies in language (Zipf's law), and wealth distributions all exhibit Pareto-like tails. Models that assume finite variance (e.g., CLT-based confidence intervals) can fail badly when applied to such data.

Convergence condition. E[X]\mathbb{E}[X] is well-defined (possibly ±\pm\infty) for non-negative XX (since the integral may diverge to ++\infty but is unambiguous). For general XX, write X=X+XX = X^+ - X^- where X+=max(X,0)X^+ = \max(X,0) and X=max(X,0)X^- = \max(-X,0). Then E[X]=E[X+]E[X]\mathbb{E}[X] = \mathbb{E}[X^+] - \mathbb{E}[X^-], finite only when both are finite.

For AI: In practice, neural network outputs and loss values are bounded (through clipping, normalisation, or the bounded nature of activation functions). But gradient values can have very heavy tails, especially in early training. Gradient clipping is a practical acknowledgment that the gradient may not have finite variance, making the Cauchy-Schwarz and CLT approximations break down.

2.3 Linearity of Expectation

The most powerful property of expectation is its linearity, which holds without any independence assumption:

Theorem (Linearity of Expectation). For any random variables XX and YY and constants a,bRa, b \in \mathbb{R}:

E[aX+bY]=aE[X]+bE[Y]\mathbb{E}[aX + bY] = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y]

Proof (continuous case). Using the joint density fX,Yf_{X,Y}:

E[aX+bY]= ⁣ ⁣(ax+by)fX,Y(x,y)dxdy\mathbb{E}[aX+bY] = \int\!\!\int (ax+by) f_{X,Y}(x,y)\, dx\, dy =a ⁣ ⁣xfX,Y(x,y)dxdy+b ⁣ ⁣yfX,Y(x,y)dxdy= a\int\!\!\int x\, f_{X,Y}(x,y)\, dx\, dy + b\int\!\!\int y\, f_{X,Y}(x,y)\, dx\, dy =axfX(x)dx+byfY(y)dy=aE[X]+bE[Y]= a\int x\, f_X(x)\, dx + b\int y\, f_Y(y)\, dy = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y]

where we used fX,Y(x,y)dy=fX(x)\int f_{X,Y}(x,y)\, dy = f_X(x) (marginalisation from Section03). The discrete case is analogous. \square

Extension. By induction: for any constants a1,,ana_1, \ldots, a_n and random variables X1,,XnX_1, \ldots, X_n (not necessarily independent):

E ⁣[i=1naiXi]=i=1naiE[Xi]\mathbb{E}\!\left[\sum_{i=1}^n a_i X_i\right] = \sum_{i=1}^n a_i \mathbb{E}[X_i]

Critical point: This holds even when XX and YY are dependent. It does NOT extend to products: E[XY]=E[X]E[Y]\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y] only when XYX \perp Y.

Examples demonstrating power of linearity:

Sum of dice. S=X1+X2++XnS = X_1 + X_2 + \cdots + X_n where each XiX_i is a fair die. Without worrying about dependence (there is none here, but the theorem doesn't require it): E[S]=n3.5\mathbb{E}[S] = n \cdot 3.5.

Binomial mean. XBinomial(n,p)X \sim \text{Binomial}(n,p). Write X=i=1nBiX = \sum_{i=1}^n B_i where BiBernoulli(p)B_i \sim \text{Bernoulli}(p) (the indicator that trial ii succeeds). Then: E[X]=i=1nE[Bi]=np\mathbb{E}[X] = \sum_{i=1}^n \mathbb{E}[B_i] = np. No need to compute the full PMF.

Expected number of fixed points. A random permutation σ\sigma of {1,,n}\{1,\ldots,n\} has expected number of fixed points (σ(i)=i\sigma(i) = i) equal to 1, for any nn. Proof: E[#fixed points]=i=1nP(σ(i)=i)=n(1/n)=1\mathbb{E}[\#\text{fixed points}] = \sum_{i=1}^n P(\sigma(i)=i) = n \cdot (1/n) = 1.

For AI: Linearity of expectation is the reason we can decompose complex expected losses into sums of simpler expected losses. The expected cross-entropy over a vocabulary of size VV is v=1VE[yvlogp^v]\sum_{v=1}^V \mathbb{E}[y_v \log \hat{p}_v] - a sum of VV simpler expectations. The expected gradient of a sum is the sum of the expected gradients, justifying minibatch averaging as a gradient estimator.

2.4 LOTUS - Law of the Unconscious Statistician

A common need: compute E[g(X)]\mathbb{E}[g(X)] for some function gg when we know the distribution of XX but not necessarily of g(X)g(X) directly. LOTUS provides the answer.

Theorem (LOTUS - Law of the Unconscious Statistician). Let XX be a random variable and g:RRg : \mathbb{R} \to \mathbb{R} a measurable function.

Discrete:

E[g(X)]=xg(x)pX(x)\mathbb{E}[g(X)] = \sum_{x} g(x) \cdot p_X(x)

Continuous:

E[g(X)]=g(x)fX(x)dx\mathbb{E}[g(X)] = \int_{-\infty}^{\infty} g(x) \cdot f_X(x) \, dx

Key insight: We do NOT need the distribution of g(X)g(X). We use the distribution of XX directly. This is why it's called the "unconscious" statistician - you can compute E[g(X)]\mathbb{E}[g(X)] without consciously finding the PDF of g(X)g(X).

Proof sketch (discrete case). Let Y=g(X)Y = g(X). Grouping terms:

E[Y]=yyP(Y=y)=yyx:g(x)=yP(X=x)=xg(x)P(X=x)\mathbb{E}[Y] = \sum_y y \cdot P(Y=y) = \sum_y y \sum_{x: g(x)=y} P(X=x) = \sum_x g(x) P(X=x)

The last step rearranges the double sum. \square

Examples:

Variance via LOTUS. Var(X)=E[(Xμ)2]\text{Var}(X) = \mathbb{E}[(X-\mu)^2]. Using LOTUS directly on g(x)=(xμ)2g(x) = (x-\mu)^2:

Var(X)=(xμ)2fX(x)dx\text{Var}(X) = \int (x-\mu)^2 f_X(x) \, dx

Second moment of Exponential. For XExp(λ)X \sim \text{Exp}(\lambda), using g(x)=x2g(x) = x^2:

E[X2]=0x2λeλxdx=2λ2\mathbb{E}[X^2] = \int_0^\infty x^2 \lambda e^{-\lambda x} dx = \frac{2}{\lambda^2}

Jensen's inequality (preview). For convex ff, LOTUS combined with the supporting hyperplane property gives f(E[X])E[f(X)]f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]. (Full treatment in Section7.)

LOTUS for functions of multiple variables. For jointly distributed (X,Y)(X,Y) with joint density fX,Yf_{X,Y}:

E[g(X,Y)]= ⁣ ⁣g(x,y)fX,Y(x,y)dxdy\mathbb{E}[g(X,Y)] = \int\!\!\int g(x,y) f_{X,Y}(x,y) \, dx \, dy

For AI: LOTUS is the theorem behind every expectation computation in variational inference. The ELBO contains Eqϕ(zx)[logpθ(xz)]\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - an expectation of logpθ(xz)\log p_\theta(x|z) under qϕq_\phi. LOTUS says: integrate logpθ(xz)\log p_\theta(x|z) weighted by qϕ(zx)q_\phi(z|x), no need to find the distribution of logpθ(xz)\log p_\theta(x|z) directly.


3. Variance, Standard Deviation, and Higher Moments

3.1 Variance

The variance of XX measures the expected squared deviation from the mean:

Var(X)=E[(XE[X])2]\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2]

Computational formula. Expanding the square:

Var(X)=E[X22μX+μ2]=E[X2]2μE[X]+μ2=E[X2]μ2\text{Var}(X) = \mathbb{E}[X^2 - 2\mu X + \mu^2] = \mathbb{E}[X^2] - 2\mu\mathbb{E}[X] + \mu^2 = \mathbb{E}[X^2] - \mu^2

This is the Konig-Huygens formula: Var(X)=E[X2](E[X])2\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2.

Properties of variance:

  1. Var(X)0\text{Var}(X) \geq 0, with equality iff XX is almost surely constant.
  2. Var(aX+b)=a2Var(X)\text{Var}(aX + b) = a^2 \text{Var}(X) for any constants a,ba, b.
  3. Var(X+Y)=Var(X)+2Cov(X,Y)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + 2\text{Cov}(X,Y) + \text{Var}(Y) (see Section4.4).
  4. If XYX \perp Y: Var(X+Y)=Var(X)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y).

Proof of property 2.

Var(aX+b)=E[(aX+b(aμ+b))2]=E[(a(Xμ))2]=a2E[(Xμ)2]=a2Var(X)\text{Var}(aX+b) = \mathbb{E}[(aX+b - (a\mu+b))^2] = \mathbb{E}[(a(X-\mu))^2] = a^2 \mathbb{E}[(X-\mu)^2] = a^2\text{Var}(X)

Note: the constant bb does not affect variance (shifts don't change spread).

Standard examples:

  • XBernoulli(p)X \sim \text{Bernoulli}(p): E[X]=p\mathbb{E}[X] = p, E[X2]=p\mathbb{E}[X^2] = p, Var(X)=pp2=p(1p)\text{Var}(X) = p - p^2 = p(1-p). Maximum at p=1/2p = 1/2.
  • XUniform[a,b]X \sim \text{Uniform}[a,b]: E[X]=(a+b)/2\mathbb{E}[X] = (a+b)/2, Var(X)=(ba)2/12\text{Var}(X) = (b-a)^2/12.
  • XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2): E[X]=μ\mathbb{E}[X] = \mu, Var(X)=σ2\text{Var}(X) = \sigma^2 (by construction).
  • XExp(λ)X \sim \text{Exp}(\lambda): E[X]=1/λ\mathbb{E}[X] = 1/\lambda, E[X2]=2/λ2\mathbb{E}[X^2] = 2/\lambda^2, Var(X)=1/λ2\text{Var}(X) = 1/\lambda^2.
  • XPoisson(λ)X \sim \text{Poisson}(\lambda): E[X]=Var(X)=λ\mathbb{E}[X] = \text{Var}(X) = \lambda (special property).

For AI: Variance quantifies uncertainty. The variance of a model's predictions measures how much predictions vary with the training set - high variance = overfitting. The variance of gradient estimates in SGD determines how noisy the update steps are - high variance slows convergence. Reducing gradient variance is the goal of techniques like control variates, variance reduction in REINFORCE, and importance-weighted estimators.

3.2 Standard Deviation and Coefficient of Variation

The standard deviation σX=Var(X)\sigma_X = \sqrt{\text{Var}(X)} has the same units as XX (variance has squared units). It is the more interpretable spread measure.

The coefficient of variation (CV) normalises the standard deviation by the mean:

CV(X)=σXE[X]\text{CV}(X) = \frac{\sigma_X}{\mathbb{E}[X]}

CV is dimensionless and measures relative spread. A CV of 0.1 means the standard deviation is 10% of the mean. This is useful when comparing spread across distributions with different scales.

Standardisation. Subtracting the mean and dividing by the standard deviation gives the standardised variable:

Z=XμσZ = \frac{X - \mu}{\sigma}

which has E[Z]=0\mathbb{E}[Z] = 0 and Var(Z)=1\text{Var}(Z) = 1.

For AI: Batch Normalisation (Ioffe & Szegedy, 2015) standardises layer activations during training: for each feature, subtract the batch mean and divide by the batch standard deviation. This forces E[Z]=0\mathbb{E}[Z] = 0 and Var(Z)=1\text{Var}(Z) = 1, then rescales with learnable γ,β\gamma, \beta. Layer Norm and RMS Norm are variations that normalise over the feature dimension rather than the batch dimension - critical for transformers where batch sizes can be 1.

3.3 Raw vs Central Moments

Raw moments (moments about zero):

μk=E[Xk]\mu'_k = \mathbb{E}[X^k]

Central moments (moments about the mean μ=E[X]\mu = \mathbb{E}[X]):

μk=E[(Xμ)k]\mu_k = \mathbb{E}[(X - \mu)^k]

Relationship: The central moments can be expressed in terms of raw moments via the binomial theorem:

μk=j=0k(kj)μj(μ)kj\mu_k = \sum_{j=0}^k \binom{k}{j} \mu'_j (-\mu)^{k-j}

For the first four:

  • μ1=0\mu_1 = 0 (always - the mean is the center)
  • μ2=μ2(μ1)2=E[X2]μ2=Var(X)\mu_2 = \mu'_2 - (\mu'_1)^2 = \mathbb{E}[X^2] - \mu^2 = \text{Var}(X)
  • μ3=μ33μ2μ1+2(μ1)3\mu_3 = \mu'_3 - 3\mu'_2\mu'_1 + 2(\mu'_1)^3
  • μ4=μ44μ3μ1+6μ2(μ1)23(μ1)4\mu_4 = \mu'_4 - 4\mu'_3\mu'_1 + 6\mu'_2(\mu'_1)^2 - 3(\mu'_1)^4

Standardised moments divide central moments by σk\sigma^k, making them dimensionless:

γk=μkσk\gamma_k = \frac{\mu_k}{\sigma^k}

γ1\gamma_1 = skewness, γ2\gamma_2 = excess kurtosis (kurtosis of normal = 3; excess kurtosis subtracts 3).

3.4 Skewness

Skewness measures the asymmetry of a distribution about its mean:

Skew(X)=γ1=μ3σ3=E[(Xμ)3]σ3\text{Skew}(X) = \gamma_1 = \frac{\mu_3}{\sigma^3} = \frac{\mathbb{E}[(X-\mu)^3]}{\sigma^3}
  • γ1>0\gamma_1 > 0 (right-skewed / positive skew): Long tail on the right; mean > median > mode. Examples: income distributions, insurance claims, LLM generation probabilities.
  • γ1=0\gamma_1 = 0 (symmetric): Normal distribution, uniform, any distribution symmetric about its mean.
  • γ1<0\gamma_1 < 0 (left-skewed / negative skew): Long tail on the left; mean < median < mode. Examples: age at death in developed countries.
SKEWNESS INTUITION
========================================================================

  Right-skewed (\\gamma_1 > 0):           Left-skewed (\\gamma_1 < 0):

     ##                                          ##
    ####                                        ####
   ######...                           ...##########
  --+--+--+------                 ------+--+--+--+--

  Mode < Median < Mean            Mean < Median < Mode
  Tail pulls mean right           Tail pulls mean left

========================================================================

For AI: Gradient distributions in deep networks are often right-skewed - most gradients are small, but occasional large gradients occur. This is why gradient clipping is standard practice: the long right tail causes instability if unconstrained. Token probability distributions from language models are also right-skewed: most tokens have very low probability, while a handful of likely tokens dominate.

Skewness of common distributions:

  • N(μ,σ2)\mathcal{N}(\mu,\sigma^2): γ1=0\gamma_1 = 0 (symmetric by construction)
  • Exp(λ)\text{Exp}(\lambda): γ1=2\gamma_1 = 2 (always right-skewed)
  • Poisson(λ)\text{Poisson}(\lambda): γ1=1/λ\gamma_1 = 1/\sqrt{\lambda} (approaches 0 as λ\lambda \to \infty)
  • Beta(α,β)\text{Beta}(\alpha,\beta): γ1=2(βα)α+β+1/[(α+β+2)αβ]\gamma_1 = 2(\beta-\alpha)\sqrt{\alpha+\beta+1}/[(\alpha+\beta+2)\sqrt{\alpha\beta}]

3.5 Kurtosis and Excess Kurtosis

Kurtosis measures the weight of the tails relative to a normal distribution:

Kurt(X)=μ4σ4=E[(Xμ)4]σ4\text{Kurt}(X) = \frac{\mu_4}{\sigma^4} = \frac{\mathbb{E}[(X-\mu)^4]}{\sigma^4}

Excess kurtosis (the standard in statistics) subtracts the normal baseline of 3:

γ2=μ4σ43\gamma_2 = \frac{\mu_4}{\sigma^4} - 3
  • γ2>0\gamma_2 > 0 (leptokurtic): Heavier tails than Gaussian; more probability in the tails and peak. Examples: Student-tt distribution, financial returns, gradient norms.
  • γ2=0\gamma_2 = 0 (mesokurtic): Normal distribution (the baseline).
  • γ2<0\gamma_2 < 0 (platykurtic): Lighter tails; more uniform. Uniform distribution has γ2=6/5\gamma_2 = -6/5.
KURTOSIS AND TAIL WEIGHT
========================================================================

  Leptokurtic (\\gamma_2>0):     Mesokurtic (\\gamma_2=0):     Platykurtic (\\gamma_2<0):

      #                        #                      ###
     ###                      ###                    #####
    #####                    #####                  #######
   .........                .......                 .......

  Heavy tails - outliers    Normal tails            Thin tails
  common. Student-t, loss   Gaussian                Uniform
  gradients, finance

========================================================================

For AI: Gradient norm distributions during training are leptokurtic - occasional large-gradient events are far more common than a Gaussian would predict. This motivates both gradient clipping (cap the maximum norm) and heavy-tailed noise models for understanding why SGD generalises. The Student-tt distribution (γ2=6/(df4)\gamma_2 = 6/(df-4) for df>4df > 4) is often used to model heavy-tailed distributions in robust statistics.

Kurtosis of common distributions:

  • N(μ,σ2)\mathcal{N}(\mu,\sigma^2): γ2=0\gamma_2 = 0 (by definition - the reference)
  • Uniform[a,b]\text{Uniform}[a,b]: γ2=6/5\gamma_2 = -6/5
  • Exp(λ)\text{Exp}(\lambda): γ2=6\gamma_2 = 6
  • Laplace(μ,b)\text{Laplace}(\mu,b): γ2=3\gamma_2 = 3
  • Student-t(ν)\text{Student-}t(\nu): γ2=6/(ν4)\gamma_2 = 6/(\nu-4) for ν>4\nu > 4; infinite for ν4\nu \leq 4

4. Covariance and Correlation

4.1 Covariance as an Operator

The covariance of two random variables XX and YY measures their linear co-movement:

Cov(X,Y)=E[(XE[X])(YE[Y])]\text{Cov}(X,Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]

Computational formula:

Cov(X,Y)=E[XY]E[X]E[Y]\text{Cov}(X,Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]

Proof: E[(XμX)(YμY)]=E[XYμYXμXY+μXμY]=E[XY]μYμXμXμY+μXμY=E[XY]μXμY\mathbb{E}[(X-\mu_X)(Y-\mu_Y)] = \mathbb{E}[XY - \mu_Y X - \mu_X Y + \mu_X \mu_Y] = \mathbb{E}[XY] - \mu_Y\mu_X - \mu_X\mu_Y + \mu_X\mu_Y = \mathbb{E}[XY] - \mu_X\mu_Y. \square

Properties (bilinearity):

  1. Cov(X,X)=Var(X)\text{Cov}(X,X) = \text{Var}(X)
  2. Cov(X,Y)=Cov(Y,X)\text{Cov}(X,Y) = \text{Cov}(Y,X) (symmetric)
  3. Cov(aX+b,cY+d)=acCov(X,Y)\text{Cov}(aX+b, cY+d) = ac\,\text{Cov}(X,Y) (bilinear, constants drop out)
  4. Cov(X+Z,Y)=Cov(X,Y)+Cov(Z,Y)\text{Cov}(X+Z, Y) = \text{Cov}(X,Y) + \text{Cov}(Z,Y) (linear in first argument)
  5. Cov(X,Y)Var(X)Var(Y)|\text{Cov}(X,Y)| \leq \sqrt{\text{Var}(X)\text{Var}(Y)} (Cauchy-Schwarz, proved in Section7.3)

Sign interpretation:

  • Cov(X,Y)>0\text{Cov}(X,Y) > 0: XX and YY tend to be above their means simultaneously.
  • Cov(X,Y)<0\text{Cov}(X,Y) < 0: when XX is above its mean, YY tends to be below its mean.
  • Cov(X,Y)=0\text{Cov}(X,Y) = 0: no linear relationship (but may still be nonlinearly dependent).

Recall from Section03: The full covariance matrix Σ\Sigma of a random vector collects all pairwise covariances: Σij=Cov(Xi,Xj)\Sigma_{ij} = \text{Cov}(X_i, X_j). The multivariate Gaussian is parameterised by (μ,Σ)(\boldsymbol{\mu}, \Sigma). -> Full treatment: Joint Distributions Section03

4.2 Pearson Correlation Coefficient

The Pearson correlation normalises covariance to remove units:

ρXY=Corr(X,Y)=Cov(X,Y)Var(X)Var(Y)=Cov(X,Y)σXσY\rho_{XY} = \text{Corr}(X,Y) = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}

Theorem: 1ρXY1-1 \leq \rho_{XY} \leq 1.

Proof via Cauchy-Schwarz. By the Cauchy-Schwarz inequality (proved in Section7.3):

(Cov(X,Y))2=(E[(XμX)(YμY)])2E[(XμX)2]E[(YμY)2]=σX2σY2(\text{Cov}(X,Y))^2 = (\mathbb{E}[(X-\mu_X)(Y-\mu_Y)])^2 \leq \mathbb{E}[(X-\mu_X)^2]\mathbb{E}[(Y-\mu_Y)^2] = \sigma_X^2 \sigma_Y^2

Dividing both sides by σX2σY2\sigma_X^2 \sigma_Y^2: ρ21\rho^2 \leq 1, so ρ1|\rho| \leq 1. \square

Boundary cases:

  • ρ=+1\rho = +1: Y=aX+bY = aX + b for some a>0a > 0 (perfect positive linear relationship).
  • ρ=1\rho = -1: Y=aX+bY = aX + b for some a<0a < 0 (perfect negative linear relationship).
  • ρ=0\rho = 0: uncorrelated (no linear relationship - but not necessarily independent).

For AI: The correlation coefficient appears in weight matrix analysis. The effective rank of a weight matrix is related to the correlations between its columns. In attention, high correlation between key-query pairs produces high attention weights. In representational similarity analysis (RSA), neural network layers are compared by computing correlation matrices of their activations.

4.3 Independence and Zero Covariance

Theorem. If XYX \perp Y, then Cov(X,Y)=0\text{Cov}(X,Y) = 0.

Proof. If XYX \perp Y, then E[XY]=E[X]E[Y]\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y] (a characterisation of independence). Therefore Cov(X,Y)=E[XY]E[X]E[Y]=0\text{Cov}(X,Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] = 0. \square

The converse is FALSE. Zero covariance does not imply independence.

Classic counterexample. Let XUniform(1,1)X \sim \text{Uniform}(-1,1) and Y=X2Y = X^2. Then:

  • E[X]=0\mathbb{E}[X] = 0 (symmetric distribution)
  • E[XY]=E[XX2]=E[X3]=0\mathbb{E}[XY] = \mathbb{E}[X \cdot X^2] = \mathbb{E}[X^3] = 0 (odd function, symmetric distribution)
  • Cov(X,Y)=0\text{Cov}(X,Y) = 0

But YY is completely determined by XX - knowing XX tells you Y=X2Y = X^2 exactly. They are maximally dependent!

For jointly Gaussian variables only: Zero covariance DOES imply independence. This is a special property of the Gaussian distribution - it is the unique distribution for which uncorrelated implies independent (for the joint distribution; marginal Gaussians with zero covariance need not be jointly Gaussian).

For AI: Feature selection based on correlation can miss highly informative features that have nonlinear (but zero-covariance) relationships with the target. Mutual information is a better measure of dependence. In contrastive learning (e.g., SimCLR), redundancy reduction methods explicitly decorrelate representations - but decorrelation (zero covariance) and statistical independence are different objectives.

4.4 Variance of Sums

Theorem. For any random variables XX and YY:

Var(X+Y)=Var(X)+2Cov(X,Y)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + 2\text{Cov}(X,Y) + \text{Var}(Y)

Proof:

Var(X+Y)=E[(X+Y)2](E[X+Y])2\text{Var}(X+Y) = \mathbb{E}[(X+Y)^2] - (\mathbb{E}[X+Y])^2 =E[X2+2XY+Y2](μX+μY)2= \mathbb{E}[X^2 + 2XY + Y^2] - (\mu_X + \mu_Y)^2 =(E[X2]μX2)+2(E[XY]μXμY)+(E[Y2]μY2)= (\mathbb{E}[X^2] - \mu_X^2) + 2(\mathbb{E}[XY] - \mu_X\mu_Y) + (\mathbb{E}[Y^2] - \mu_Y^2) =Var(X)+2Cov(X,Y)+Var(Y)= \text{Var}(X) + 2\text{Cov}(X,Y) + \text{Var}(Y) \quad \square

Corollary (independence): If XYX \perp Y: Var(X+Y)=Var(X)+Var(Y)\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y).

General sum: For X1,,XnX_1, \ldots, X_n:

Var ⁣(i=1nXi)=i=1nVar(Xi)+2i<jCov(Xi,Xj)\text{Var}\!\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \text{Var}(X_i) + 2\sum_{i<j} \text{Cov}(X_i, X_j)

If all XiX_i are independent: Var(iXi)=iVar(Xi)\text{Var}(\sum_i X_i) = \sum_i \text{Var}(X_i).

Scaling. For iid X1,,XnX_1,\ldots,X_n with variance σ2\sigma^2:

Var ⁣(1ni=1nXi)=1n2nσ2=σ2n\text{Var}\!\left(\frac{1}{n}\sum_{i=1}^n X_i\right) = \frac{1}{n^2} \cdot n\sigma^2 = \frac{\sigma^2}{n}

The variance of the sample mean decreases as 1/n1/n - the statistical basis for why larger datasets reduce estimation uncertainty.

For AI: Weight initialisation strategies (Xavier/Glorot, He) are derived from variance propagation. If h[l]=W[l]h[l1]\mathbf{h}^{[l]} = W^{[l]}\mathbf{h}^{[l-1]}, then the variance of each output neuron is ninVar(wij)Var(hj[l1])n_{\text{in}} \cdot \text{Var}(w_{ij}) \cdot \text{Var}(h^{[l-1]}_j). To keep variance constant across layers (preventing vanishing/exploding activations): Var(wij)=1/nin\text{Var}(w_{ij}) = 1/n_{\text{in}} (Xavier) or 2/nin2/n_{\text{in}} (He for ReLU). This is the variance-of-sums formula applied to layer activations.


5. Conditional Expectation

5.1 Definition and Basic Properties

The conditional expectation E[YX=x]\mathbb{E}[Y \mid X = x] is the expected value of YY given that XX takes the specific value xx. It is a function of xx, not a number.

Discrete case:

E[YX=x]=yyP(Y=yX=x)=yypX,Y(x,y)pX(x)\mathbb{E}[Y \mid X = x] = \sum_y y \cdot P(Y = y \mid X = x) = \sum_y y \cdot \frac{p_{X,Y}(x,y)}{p_X(x)}

Continuous case:

E[YX=x]=yfYX(yx)dy=yfX,Y(x,y)fX(x)dy\mathbb{E}[Y \mid X = x] = \int_{-\infty}^\infty y \cdot f_{Y|X}(y|x) \, dy = \int y \cdot \frac{f_{X,Y}(x,y)}{f_X(x)} \, dy

As a random variable. When we write E[YX]\mathbb{E}[Y \mid X] (without specifying the value of XX), we mean the random variable obtained by composing the function xE[YX=x]x \mapsto \mathbb{E}[Y \mid X=x] with XX. This is a function of XX and is therefore itself a random variable.

Basic properties:

  1. E[aY+bZX]=aE[YX]+bE[ZX]\mathbb{E}[aY + bZ \mid X] = a\mathbb{E}[Y \mid X] + b\mathbb{E}[Z \mid X] (linearity)
  2. E[g(X)YX]=g(X)E[YX]\mathbb{E}[g(X) \cdot Y \mid X] = g(X) \cdot \mathbb{E}[Y \mid X] (taking out known factors)
  3. If XYX \perp Y: E[YX]=E[Y]\mathbb{E}[Y \mid X] = \mathbb{E}[Y] (conditioning on independent variable doesn't help)
  4. E[E[YX]]=E[Y]\mathbb{E}[\mathbb{E}[Y \mid X]] = \mathbb{E}[Y] (tower property - see Section5.2)

Example. Let (X,Y)(X,Y) be jointly uniform on the triangle {(x,y):0yx1}\{(x,y): 0 \leq y \leq x \leq 1\}. From Section03 (Appendix V.1), the conditional density is fYX(yx)=1/xf_{Y|X}(y|x) = 1/x on [0,x][0,x]. Therefore:

E[YX=x]=0xy1xdy=1xx22=x2\mathbb{E}[Y \mid X = x] = \int_0^x y \cdot \frac{1}{x} \, dy = \frac{1}{x} \cdot \frac{x^2}{2} = \frac{x}{2}

The conditional mean of YY given X=xX = x is x/2x/2 - exactly half of xx, which makes sense since YY is uniform on [0,x][0,x].

5.2 Tower Property (Iterated Expectation)

The tower property (also called the law of iterated expectations or Adam's law) is one of the most useful results in probability:

E[E[YX]]=E[Y]\mathbb{E}[\mathbb{E}[Y \mid X]] = \mathbb{E}[Y]

More generally, for any function g(X)g(X):

E[g(X)Y]=E[g(X)E[YX]]\mathbb{E}[g(X) \cdot Y] = \mathbb{E}[g(X) \cdot \mathbb{E}[Y \mid X]]

Proof (continuous case):

E[E[YX]]=E[YX=x]fX(x)dx\mathbb{E}[\mathbb{E}[Y \mid X]] = \int \mathbb{E}[Y \mid X=x] \cdot f_X(x) \, dx =(yfYX(yx)dy)fX(x)dx= \int \left(\int y \cdot f_{Y|X}(y|x) \, dy\right) f_X(x) \, dx = ⁣ ⁣yfYX(yx)fX(x)dydx= \int\!\!\int y \cdot f_{Y|X}(y|x) f_X(x) \, dy \, dx = ⁣ ⁣yfX,Y(x,y)dydx=E[Y]= \int\!\!\int y \cdot f_{X,Y}(x,y) \, dy \, dx = \mathbb{E}[Y] \quad \square

Intuition. If you first average YY within groups defined by XX, then average those group means weighted by group size, you get the overall mean of YY.

Example: Expected test score. A class has 40% students from Group A (mean score 75) and 60% from Group B (mean score 85). Overall expected score: 0.4×75+0.6×85=810.4 \times 75 + 0.6 \times 85 = 81. This is exactly the tower property: E[Y]=E[E[YG]]\mathbb{E}[Y] = \mathbb{E}[\mathbb{E}[Y|G]] where GG is group membership.

For AI: The tower property underlies the EM algorithm. The E-step computes Q(θθ(t))=EZX,θ(t)[logp(X,Zθ)]Q(\theta|\theta^{(t)}) = \mathbb{E}_{Z|X,\theta^{(t)}}[\log p(X,Z|\theta)] - the expected complete-data log-likelihood, where the expectation is over the latent variable ZZ given the observed XX. The tower property guarantees that maximising QQ increases logp(Xθ)\log p(X|\theta), the marginal likelihood. In policy gradient methods, Eτ[R]=Es0[Eτ[Rs0]]\mathbb{E}_\tau[R] = \mathbb{E}_{s_0}[\mathbb{E}_\tau[R | s_0]] - decomposing trajectory returns by initial state.

5.3 Conditional Variance and Law of Total Variance

The conditional variance of YY given X=xX = x is:

Var(YX=x)=E[(YE[YX=x])2X=x]\text{Var}(Y \mid X = x) = \mathbb{E}[(Y - \mathbb{E}[Y \mid X=x])^2 \mid X = x]

Law of Total Variance (Eve's Law):

Var(Y)=E[Var(YX)]+Var(E[YX])\text{Var}(Y) = \mathbb{E}[\text{Var}(Y \mid X)] + \text{Var}(\mathbb{E}[Y \mid X])

This decomposes total variance into:

  • Within-group variance: E[Var(YX)]\mathbb{E}[\text{Var}(Y \mid X)] - average variability within each value of XX.
  • Between-group variance: Var(E[YX])\text{Var}(\mathbb{E}[Y \mid X]) - variability of the group means.

Proof:

Var(Y)=E[Y2](E[Y])2\text{Var}(Y) = \mathbb{E}[Y^2] - (\mathbb{E}[Y])^2

Using the tower property twice and writing μ(x)=E[YX=x]\mu(x) = \mathbb{E}[Y|X=x]:

E[Y2]=E[E[Y2X]]=E[Var(YX)+(E[YX])2]=E[Var(YX)]+E[μ(X)2]\mathbb{E}[Y^2] = \mathbb{E}[\mathbb{E}[Y^2 \mid X]] = \mathbb{E}[\text{Var}(Y|X) + (\mathbb{E}[Y|X])^2] = \mathbb{E}[\text{Var}(Y|X)] + \mathbb{E}[\mu(X)^2]

And (E[Y])2=(E[μ(X)])2(\mathbb{E}[Y])^2 = (\mathbb{E}[\mu(X)])^2. Therefore:

Var(Y)=E[Var(YX)]+(E[μ(X)2](E[μ(X)])2)=E[Var(YX)]+Var(μ(X))\text{Var}(Y) = \mathbb{E}[\text{Var}(Y|X)] + (\mathbb{E}[\mu(X)^2] - (\mathbb{E}[\mu(X)])^2) = \mathbb{E}[\text{Var}(Y|X)] + \text{Var}(\mu(X)) \quad \square

For AI: The law of total variance is the mathematical foundation of the bias-variance decomposition (Section8). For a model trained on dataset D\mathcal{D}: total error = expected within-dataset error + between-dataset variance. It also explains why mixture models (GMMs) combine within-component variance and between-component variance.

5.4 AI Applications of Conditional Expectation

Optimal predictor. The function f(x)=E[YX=x]f^*(x) = \mathbb{E}[Y \mid X=x] minimises the expected squared error:

f=argminfE[(Yf(X))2]f^* = \arg\min_{f} \mathbb{E}[(Y - f(X))^2]

Proof: For any ff,

E[(Yf(X))2]=E[(Yf(X)+f(X)f(X))2]\mathbb{E}[(Y-f(X))^2] = \mathbb{E}[(Y - f^*(X) + f^*(X) - f(X))^2] =E[(Yf(X))2]+2E[(Yf(X))(f(X)f(X))]+E[(f(X)f(X))2]= \mathbb{E}[(Y-f^*(X))^2] + 2\mathbb{E}[(Y-f^*(X))(f^*(X)-f(X))] + \mathbb{E}[(f^*(X)-f(X))^2]

The cross term vanishes because E[Yf(X)X]=E[YX]f(X)=0\mathbb{E}[Y - f^*(X) \mid X] = \mathbb{E}[Y|X] - f^*(X) = 0, so by the tower property E[(Yf(X))g(X)]=0\mathbb{E}[(Y-f^*(X))g(X)] = 0 for any gg. The last term is non-negative. Therefore E[(Yf(X))2]E[(Yf(X))2]\mathbb{E}[(Y-f(X))^2] \geq \mathbb{E}[(Y-f^*(X))^2]. \square

This is profound: the best possible predictor (in MSE sense) is the conditional expectation. Neural networks are universal approximators trained to approximate f=E[YX]f^* = \mathbb{E}[Y|X].

Rao-Blackwell theorem. In estimation theory: if θ^\hat{\theta} is any unbiased estimator of θ\theta and TT is a sufficient statistic, then θ~=E[θ^T]\tilde{\theta} = \mathbb{E}[\hat{\theta} \mid T] is also unbiased and has variance Var(θ^)\leq \text{Var}(\hat{\theta}). Conditioning on sufficient statistics never increases variance. This is the variance reduction analogue of why attention over relevant context helps.

Attention as conditional expectation. The output of self-attention at position ii is:

oi=jαijvj=Ejαi[vj]o_i = \sum_j \alpha_{ij} v_j = \mathbb{E}_{j \sim \alpha_i}[v_j]

This is the conditional expectation of the value vectors under the attention distribution αi\alpha_i over key positions. High-entropy attention = averaging many values; low-entropy (sharp) attention = conditioning on one specific value.


6. Moment Generating Functions and Characteristic Functions

6.1 MGF Definition and Basic Properties

The moment generating function (MGF) of a random variable XX is:

MX(t)=E[etX]M_X(t) = \mathbb{E}[e^{tX}]

defined for all tt in an open interval containing 0. If no such interval exists (the integral diverges), the MGF is said not to exist.

Moments from the MGF. Differentiating kk times and evaluating at t=0t=0 gives the kk-th raw moment:

dkdtkMX(t)t=0=E[Xk]=μk\frac{d^k}{dt^k} M_X(t) \bigg|_{t=0} = \mathbb{E}[X^k] = \mu'_k

Proof: MX(t)=E[etX]=E[k=0(tX)kk!]=k=0E[Xk]k!tkM_X(t) = \mathbb{E}[e^{tX}] = \mathbb{E}\left[\sum_{k=0}^\infty \frac{(tX)^k}{k!}\right] = \sum_{k=0}^\infty \frac{\mathbb{E}[X^k]}{k!} t^k (when interchange is valid).

Differentiating kk times: MX(k)(0)=E[Xk]M_X^{(k)}(0) = \mathbb{E}[X^k]. \square

Uniqueness theorem. If MX(t)=MY(t)M_X(t) = M_Y(t) for tt in an open neighbourhood of 0, then XX and YY have the same distribution. The MGF uniquely determines the distribution (when it exists).

Existence. The MGF exists when E[etX]<\mathbb{E}[e^{tX}] < \infty for t<t0|t| < t_0 for some t0>0t_0 > 0. Heavy-tailed distributions (Cauchy, Pareto with small α\alpha, log-normal) do NOT have finite MGFs. The characteristic function (Section6.4) always exists and is the preferred tool for such distributions.

6.2 MGF of Common Distributions

Bernoulli(pp):

M(t)=(1p)+pet=1+p(et1)M(t) = (1-p) + p e^t = 1 + p(e^t - 1)

Check: M(t)=petM'(t) = pe^t, M(0)=p=E[X]M'(0) = p = \mathbb{E}[X]. [ok]

Binomial(n,pn,p): (sum of nn iid Bernoulli)

M(t)=(1p+pet)nM(t) = (1-p+pe^t)^n

First moment: M(0)=npM'(0) = np. [ok]

Poisson(λ\lambda):

M(t)=eλ(et1)M(t) = e^{\lambda(e^t - 1)}

Derivation: M(t)=k=0etkλkeλk!=eλk(λet)kk!=eλeλet=eλ(et1)M(t) = \sum_{k=0}^\infty e^{tk} \frac{\lambda^k e^{-\lambda}}{k!} = e^{-\lambda}\sum_k \frac{(\lambda e^t)^k}{k!} = e^{-\lambda} e^{\lambda e^t} = e^{\lambda(e^t-1)}. M(0)=λM'(0) = \lambda, M(0)=λ2+λM''(0) = \lambda^2 + \lambda, so Var(X)=λ2+λλ2=λ\text{Var}(X) = \lambda^2+\lambda - \lambda^2 = \lambda. [ok]

Normal(μ,σ2\mu, \sigma^2):

M(t)=eμt+σ2t2/2M(t) = e^{\mu t + \sigma^2 t^2/2}

Derivation: Complete the square in the exponent of etx12πσe(xμ)2/(2σ2)dx\int e^{tx} \cdot \frac{1}{\sqrt{2\pi}\sigma} e^{-(x-\mu)^2/(2\sigma^2)} dx. M(0)=μM'(0) = \mu, M(0)=σ2+μ2M''(0) = \sigma^2 + \mu^2, Var(X)=σ2\text{Var}(X) = \sigma^2. [ok]

Exponential(λ\lambda):

M(t)=λλtfor t<λM(t) = \frac{\lambda}{\lambda - t} \quad \text{for } t < \lambda

Derivation: 0etxλeλxdx=λ0e(λt)xdx=λλt\int_0^\infty e^{tx} \lambda e^{-\lambda x} dx = \lambda \int_0^\infty e^{-({\lambda-t})x} dx = \frac{\lambda}{\lambda-t} (converges iff t<λt < \lambda). M(0)=1/λM'(0) = 1/\lambda, M(0)=2/λ2M''(0) = 2/\lambda^2, Var(X)=1/λ2\text{Var}(X) = 1/\lambda^2. [ok]

For AI: MGFs appear in the proof of the Chernoff bound (Section05), which gives exponentially sharp tail bounds. The exponential form E[etX]\mathbb{E}[e^{tX}] is closely related to the log-partition function in the exponential family: A(η)=logZ(η)A(\eta) = \log Z(\eta). The moments of the sufficient statistic can be computed by differentiating A(η)A(\eta) - the MGF and the log-partition function are connected by a change of variables.

6.3 MGF of Sums and the Reproductive Property

Key theorem. If XYX \perp Y:

MX+Y(t)=MX(t)MY(t)M_{X+Y}(t) = M_X(t) \cdot M_Y(t)

Proof: MX+Y(t)=E[et(X+Y)]=E[etXetY]=E[etX]E[etY]=MX(t)MY(t)M_{X+Y}(t) = \mathbb{E}[e^{t(X+Y)}] = \mathbb{E}[e^{tX} e^{tY}] = \mathbb{E}[e^{tX}]\mathbb{E}[e^{tY}] = M_X(t)M_Y(t), using independence for the third equality. \square

This is the MGF version of the convolution theorem. It provides elegant proofs of reproductive properties:

Gaussian reproductive property. If XN(μ1,σ12)X \sim \mathcal{N}(\mu_1,\sigma_1^2) and YN(μ2,σ22)Y \sim \mathcal{N}(\mu_2,\sigma_2^2) independently:

MX+Y(t)=eμ1t+σ12t2/2eμ2t+σ22t2/2=e(μ1+μ2)t+(σ12+σ22)t2/2M_{X+Y}(t) = e^{\mu_1 t + \sigma_1^2 t^2/2} \cdot e^{\mu_2 t + \sigma_2^2 t^2/2} = e^{(\mu_1+\mu_2)t + (\sigma_1^2+\sigma_2^2)t^2/2}

This is the MGF of N(μ1+μ2,σ12+σ22)\mathcal{N}(\mu_1+\mu_2, \sigma_1^2+\sigma_2^2). So X+YN(μ1+μ2,σ12+σ22)X+Y \sim \mathcal{N}(\mu_1+\mu_2, \sigma_1^2+\sigma_2^2).

Poisson reproductive property. If XPoisson(λ1)X \sim \text{Poisson}(\lambda_1), YPoisson(λ2)Y \sim \text{Poisson}(\lambda_2) independently:

MX+Y(t)=eλ1(et1)eλ2(et1)=e(λ1+λ2)(et1)M_{X+Y}(t) = e^{\lambda_1(e^t-1)} \cdot e^{\lambda_2(e^t-1)} = e^{(\lambda_1+\lambda_2)(e^t-1)}

So X+YPoisson(λ1+λ2)X+Y \sim \text{Poisson}(\lambda_1+\lambda_2).

6.4 Characteristic Function

The characteristic function of XX is:

φX(t)=E[eitX]=E[cos(tX)]+iE[sin(tX)]\varphi_X(t) = \mathbb{E}[e^{itX}] = \mathbb{E}[\cos(tX)] + i\,\mathbb{E}[\sin(tX)]

where i=1i = \sqrt{-1}.

Key advantage: φX(t)\varphi_X(t) exists for ALL distributions (since eitX=1|e^{itX}| = 1 always), unlike the MGF which may not exist for heavy-tailed distributions.

Connection to MGF: φX(t)=MX(it)\varphi_X(t) = M_X(it) when the MGF exists. But the characteristic function is defined even when the MGF is not.

Moments from φX\varphi_X. When moments exist:

E[Xk]=1ikφX(k)(0)\mathbb{E}[X^k] = \frac{1}{i^k} \varphi_X^{(k)}(0)

Inversion formula. The distribution is uniquely determined by its characteristic function:

fX(x)=12πeitxφX(t)dtf_X(x) = \frac{1}{2\pi} \int_{-\infty}^\infty e^{-itx} \varphi_X(t) \, dt

(when fXf_X is continuous). This is the Fourier inversion formula - the characteristic function IS the Fourier transform of the PDF.

For AI: Characteristic functions appear in the analysis of convergence of distributions (Central Limit Theorem proofs use characteristic functions because they always exist). The score-based diffusion model framework uses the score function xlogp(x)\nabla_x \log p(x), which is related to the characteristic function via Fourier analysis. Random Fourier Features (Rahimi & Recht, 2007) approximate kernel functions via random sampling from the characteristic function of the kernel.

6.5 Cumulants and the Cumulant Generating Function

The cumulant generating function (CGF) is the log of the MGF:

KX(t)=logMX(t)=logE[etX]K_X(t) = \log M_X(t) = \log \mathbb{E}[e^{tX}]

Cumulants κk\kappa_k are the coefficients in the Taylor expansion of KX(t)K_X(t):

KX(t)=k=1κkk!tkK_X(t) = \sum_{k=1}^\infty \frac{\kappa_k}{k!} t^k

so κk=KX(k)(0)\kappa_k = K_X^{(k)}(0).

First four cumulants:

  • κ1=E[X]=μ\kappa_1 = \mathbb{E}[X] = \mu (mean)
  • κ2=Var(X)=σ2\kappa_2 = \text{Var}(X) = \sigma^2 (variance)
  • κ3=E[(Xμ)3]=μ3\kappa_3 = \mathbb{E}[(X-\mu)^3] = \mu_3 (third central moment = γ1σ3\gamma_1 \sigma^3)
  • κ4=μ43σ4\kappa_4 = \mu_4 - 3\sigma^4 (excess kurtosis ×σ4\times \sigma^4)

Key property. For independent XYX \perp Y:

KX+Y(t)=KX(t)+KY(t)    κk(X+Y)=κk(X)+κk(Y)K_{X+Y}(t) = K_X(t) + K_Y(t) \implies \kappa_k(X+Y) = \kappa_k(X) + \kappa_k(Y)

Cumulants of a sum of independent variables add - this is the additivity property that makes cumulants more natural than moments for sums.

Normal distribution: KX(t)=μt+σ2t2/2K_X(t) = \mu t + \sigma^2 t^2/2. All cumulants κk=0\kappa_k = 0 for k3k \geq 3. This characterises the normal distribution - it is the unique distribution with all cumulants beyond the second equal to zero.


7. Jensen's Inequality and Moment Inequalities

7.1 Jensen's Inequality

Definition. A function f:RRf : \mathbb{R} \to \mathbb{R} is convex if for all x,yx, y and λ[0,1]\lambda \in [0,1]:

f(λx+(1λ)y)λf(x)+(1λ)f(y)f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y)

Geometrically: the chord between any two points lies above the curve. Equivalently (for twice-differentiable ff): f(x)0f''(x) \geq 0 everywhere.

Theorem (Jensen's Inequality). If ff is convex and E[X]<\mathbb{E}[|X|] < \infty:

f(E[X])E[f(X)]f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]

Proof (via supporting hyperplane). Since ff is convex, for any aa there exists a supporting hyperplane at aa: a linear function (x)=f(a)+c(xa)\ell(x) = f(a) + c(x-a) such that f(x)(x)f(x) \geq \ell(x) for all xx (where cc is the subgradient at aa). Setting a=E[X]a = \mathbb{E}[X]:

f(x)f(E[X])+c(xE[X])f(x) \geq f(\mathbb{E}[X]) + c(x - \mathbb{E}[X])

Taking expectations of both sides:

E[f(X)]f(E[X])+cE[XE[X]]=f(E[X])+c0=f(E[X])\mathbb{E}[f(X)] \geq f(\mathbb{E}[X]) + c\,\mathbb{E}[X - \mathbb{E}[X]] = f(\mathbb{E}[X]) + c \cdot 0 = f(\mathbb{E}[X]) \quad \square

Equality condition: f(E[X])=E[f(X)]f(\mathbb{E}[X]) = \mathbb{E}[f(X)] iff ff is linear on the support of XX, or XX is almost surely constant.

For concave ff: f(E[X])E[f(X)]f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)] (Jensen's inequality reverses).

Common applications:

Function ffConvex/ConcaveJensen gives
exe^xConvexeE[X]E[eX]e^{\mathbb{E}[X]} \leq \mathbb{E}[e^X]
logx-\log xConvexlogE[X]E[logX]-\log\mathbb{E}[X] \leq \mathbb{E}[-\log X], i.e., logE[X]E[logX]\log\mathbb{E}[X] \geq \mathbb{E}[\log X]
x2x^2Convex(E[X])2E[X2](\mathbb{E}[X])^2 \leq \mathbb{E}[X^2] (variance 0\geq 0)
logx\log xConcavelogE[X]E[logX]\log\mathbb{E}[X] \geq \mathbb{E}[\log X]
x\sqrt{x}ConcaveE[X]E[X]\sqrt{\mathbb{E}[X]} \geq \mathbb{E}[\sqrt{X}]

7.2 Applications: KL Divergence and ELBO

KL divergence non-negativity. The Kullback-Leibler divergence is:

KL(pq)=Ep[logp(X)q(X)]=Ep[logp(X)logq(X)]\text{KL}(p \| q) = \mathbb{E}_p\left[\log \frac{p(X)}{q(X)}\right] = \mathbb{E}_p[\log p(X) - \log q(X)]

Theorem (Gibbs' inequality). KL(pq)0\text{KL}(p \| q) \geq 0, with equality iff p=qp = q almost everywhere.

Proof via Jensen:

KL(pq)=Ep[logq(X)p(X)]logEp[q(X)p(X)]=logp(x)q(x)p(x)dx=log1=0-\text{KL}(p \| q) = \mathbb{E}_p\left[\log \frac{q(X)}{p(X)}\right] \leq \log\mathbb{E}_p\left[\frac{q(X)}{p(X)}\right] = \log\int p(x) \frac{q(x)}{p(x)} dx = \log 1 = 0

where we applied Jensen's inequality with the concave function log\log. Therefore KL(pq)0\text{KL}(p\|q) \geq 0. \square

ELBO derivation via Jensen. For a latent variable model pθ(x)=pθ(x,z)dzp_\theta(x) = \int p_\theta(x,z) \, dz:

logpθ(x)=logpθ(x,z)dz=logqϕ(zx)pθ(x,z)qϕ(zx)dz\log p_\theta(x) = \log \int p_\theta(x,z) \, dz = \log \int q_\phi(z|x) \frac{p_\theta(x,z)}{q_\phi(z|x)} \, dz =logEqϕ[pθ(x,z)qϕ(zx)]Eqϕ[logpθ(x,z)qϕ(zx)]= \log \mathbb{E}_{q_\phi}\left[\frac{p_\theta(x,z)}{q_\phi(z|x)}\right] \geq \mathbb{E}_{q_\phi}\left[\log \frac{p_\theta(x,z)}{q_\phi(z|x)}\right]

where the inequality applies Jensen's to the concave log\log. This lower bound is the ELBO:

L(ϕ,θ;x)=Eqϕ[logpθ(xz)]KL(qϕ(zx)p(z))\mathcal{L}(\phi,\theta;x) = \mathbb{E}_{q_\phi}[\log p_\theta(x|z)] - \text{KL}(q_\phi(z|x) \| p(z))

The gap between logpθ(x)\log p_\theta(x) and the ELBO is exactly KL(qϕpθ(zx))0\text{KL}(q_\phi \| p_\theta(z|x)) \geq 0. Maximising the ELBO tightens this bound.

Log-sum inequality. For non-negative ai,bia_i, b_i:

iailogaibi(iai)logiaiibi\sum_i a_i \log \frac{a_i}{b_i} \geq \left(\sum_i a_i\right) \log \frac{\sum_i a_i}{\sum_i b_i}

This is the discrete version of KL non-negativity via Jensen.

7.3 Cauchy-Schwarz Inequality

Theorem (Cauchy-Schwarz for expectations). For any random variables XX and YY:

(E[XY])2E[X2]E[Y2](\mathbb{E}[XY])^2 \leq \mathbb{E}[X^2] \cdot \mathbb{E}[Y^2]

with equality iff Y=cXY = cX almost surely for some constant cc.

Proof. For any tRt \in \mathbb{R}:

0E[(tXY)2]=t2E[X2]2tE[XY]+E[Y2]0 \leq \mathbb{E}[(tX - Y)^2] = t^2\mathbb{E}[X^2] - 2t\mathbb{E}[XY] + \mathbb{E}[Y^2]

This is a quadratic in tt that is always non-negative. A quadratic at2+bt+c0at^2 + bt + c \geq 0 for all tt iff discriminant b24ac0b^2 - 4ac \leq 0. Here a=E[X2]a = \mathbb{E}[X^2], b=2E[XY]b = -2\mathbb{E}[XY], c=E[Y2]c = \mathbb{E}[Y^2]:

4(E[XY])24E[X2]E[Y2]0    (E[XY])2E[X2]E[Y2]4(\mathbb{E}[XY])^2 - 4\mathbb{E}[X^2]\mathbb{E}[Y^2] \leq 0 \implies (\mathbb{E}[XY])^2 \leq \mathbb{E}[X^2]\mathbb{E}[Y^2] \quad \square

Corollary (ρ1|\rho| \leq 1). Apply the Cauchy-Schwarz inequality to the centred variables XμXX - \mu_X and YμYY - \mu_Y:

(Cov(X,Y))2=(E[(XμX)(YμY)])2E[(XμX)2]E[(YμY)2]=σX2σY2(\text{Cov}(X,Y))^2 = (\mathbb{E}[(X-\mu_X)(Y-\mu_Y)])^2 \leq \mathbb{E}[(X-\mu_X)^2]\mathbb{E}[(Y-\mu_Y)^2] = \sigma_X^2\sigma_Y^2

Dividing: ρ21\rho^2 \leq 1.

For AI: Cauchy-Schwarz appears in attention - the dot product qkq^\top k is bounded by qk\|q\| \|k\|, motivating the dk\sqrt{d_k} scaling in scaled dot-product attention (softmax(QK/dk)V\text{softmax}(QK^\top/\sqrt{d_k})V). Without this scaling, the dot products can become large in magnitude for high-dimensional keys/queries, causing the softmax to saturate and gradients to vanish.

7.4 Lyapunov Inequality

Theorem (Lyapunov's inequality). For 0<rs0 < r \leq s:

(E[Xr])1/r(E[Xs])1/s(\mathbb{E}[|X|^r])^{1/r} \leq (\mathbb{E}[|X|^s])^{1/s}

In words: the LrL^r norm of XX (as a random variable) is non-decreasing in rr.

Proof. Apply Jensen's inequality with the convex function f(t)=ts/rf(t) = t^{s/r} (convex since s/r1s/r \geq 1) to the random variable Y=XrY = |X|^r:

(E[Y])s/r=f(E[Y])E[f(Y)]=E[Xs](\mathbb{E}[Y])^{s/r} = f(\mathbb{E}[Y]) \leq \mathbb{E}[f(Y)] = \mathbb{E}[|X|^s]

Taking (1/s)(1/s) powers of both sides: (E[Xr])1/r(E[Xs])1/s(\mathbb{E}[|X|^r])^{1/r} \leq (\mathbb{E}[|X|^s])^{1/s}. \square

Consequence. If the ss-th moment is finite, all lower moments are finite. If E[X4]<\mathbb{E}[X^4] < \infty, then E[X3],E[X2],E[X]<\mathbb{E}[X^3], \mathbb{E}[X^2], \mathbb{E}[|X|] < \infty as well.

For AI: Lyapunov's inequality underlies the Lyapunov CLT (a variant of the central limit theorem for non-identically distributed variables). In the theory of stochastic gradient descent, the existence of higher moments of the gradient controls the tightness of convergence bounds.

7.5 Preview: Tail Bounds from Moments

The most direct application of moments to probability is bounding how far a random variable can stray from its mean. These are covered fully in Section05, but we preview the key results here.

Preview: Markov's Inequality

For any non-negative random variable XX and t>0t > 0:

P(Xt)E[X]tP(X \geq t) \leq \frac{\mathbb{E}[X]}{t}

Proof: E[X]=0xf(x)dxtxf(x)dxttf(x)dx=tP(Xt)\mathbb{E}[X] = \int_0^\infty xf(x)\,dx \geq \int_t^\infty xf(x)\,dx \geq t\int_t^\infty f(x)\,dx = t\,P(X\geq t). The mean bounds the probability of large values.

-> Full treatment and applications: Section05 Concentration Inequalities

Preview: Chebyshev's Inequality

For any random variable XX with mean μ\mu and variance σ2\sigma^2:

P(Xμkσ)1k2P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}

Proof: Apply Markov's inequality to (Xμ)2k2σ2(X-\mu)^2 \geq k^2\sigma^2: P((Xμ)2k2σ2)E[(Xμ)2]/(k2σ2)=1/k2P((X-\mu)^2 \geq k^2\sigma^2) \leq \mathbb{E}[(X-\mu)^2]/(k^2\sigma^2) = 1/k^2. The variance bounds how often XX deviates from its mean by kk standard deviations.

-> Full treatment and sharper bounds: Section05 Concentration Inequalities


8. Bias-Variance Decomposition

8.1 Statistical Risk and Optimal Predictor

Consider predicting YY from XX using a function f:RdRf : \mathbb{R}^d \to \mathbb{R}. The statistical risk (expected squared loss) is:

R(f)=E[(Yf(X))2]\mathcal{R}(f) = \mathbb{E}[(Y - f(X))^2]

Theorem. The minimiser of R(f)\mathcal{R}(f) over all measurable functions is the conditional expectation:

f(X)=E[YX]f^*(X) = \mathbb{E}[Y \mid X]

and the minimum risk equals the irreducible noise:

R(f)=E[Var(YX)]\mathcal{R}(f^*) = \mathbb{E}[\text{Var}(Y \mid X)]

Proof: From Section5.4, for any ff:

R(f)=E[(Yf(X))2]+E[(f(X)f(X))2]R(f)\mathcal{R}(f) = \mathbb{E}[(Y - f^*(X))^2] + \mathbb{E}[(f^*(X) - f(X))^2] \geq \mathcal{R}(f^*)

And R(f)=E[(YE[YX])2]=E[Var(YX)]\mathcal{R}(f^*) = \mathbb{E}[(Y - \mathbb{E}[Y|X])^2] = \mathbb{E}[\text{Var}(Y|X)] by definition. \square

The Bayes risk (irreducible error) ε=E[Var(YX)]\varepsilon^* = \mathbb{E}[\text{Var}(Y|X)] cannot be reduced by any model, no matter how complex. It is the noise inherent in the prediction problem.

For regression: With Y=f(X)+εY = f^*(X) + \varepsilon where ε\varepsilon is independent noise with variance σ2\sigma^2, the Bayes risk is σ2\sigma^2.

For AI: The irreducible error is why perfect test accuracy is impossible on noisy datasets. In language modelling, even a perfect model (which exactly learns the true distribution) will have non-zero cross-entropy loss equal to the entropy of the true distribution H(p)H(p^*). GPT-4's perplexity on human text is bounded below by the entropy of human language.

8.2 Bias-Variance-Noise Decomposition

In practice, we learn a model f^\hat{f} from a finite dataset D\mathcal{D} of NN samples. The model f^\hat{f} is a random function - it depends on the random training data. We can decompose the expected prediction error at a new test point x0x_0:

ED[(Yf^(x0))2]=(f(x0)ED[f^(x0)])2Bias2+VarD(f^(x0))Variance+σ2Noise\mathbb{E}_{\mathcal{D}}[(Y - \hat{f}(x_0))^2] = \underbrace{(f^*(x_0) - \mathbb{E}_{\mathcal{D}}[\hat{f}(x_0)])^2}_{\text{Bias}^2} + \underbrace{\text{Var}_{\mathcal{D}}(\hat{f}(x_0))}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Noise}}

Proof. Let fˉ=ED[f^(x0)]\bar{f} = \mathbb{E}_{\mathcal{D}}[\hat{f}(x_0)] (expected prediction) and y0=f(x0)+εy_0 = f^*(x_0) + \varepsilon (noisy observation):

ED,ε[(y0f^(x0))2]=E[(y0fˉ+fˉf^)2]\mathbb{E}_{\mathcal{D},\varepsilon}[(y_0 - \hat{f}(x_0))^2] = \mathbb{E}[(y_0 - \bar{f} + \bar{f} - \hat{f})^2] =E[(y0fˉ)2]+2E[(y0fˉ)(fˉf^)]+E[(fˉf^)2]= \mathbb{E}[(y_0 - \bar{f})^2] + 2\mathbb{E}[(y_0-\bar{f})(\bar{f}-\hat{f})] + \mathbb{E}[(\bar{f}-\hat{f})^2]

The cross term is 2E[(f+εfˉ)(fˉf^)]=2(ffˉ)E[fˉf^]+2E[ε]E[fˉf^]=02\mathbb{E}[(f^*+\varepsilon-\bar{f})(\bar{f}-\hat{f})] = 2(f^*-\bar{f})\mathbb{E}[\bar{f}-\hat{f}] + 2\mathbb{E}[\varepsilon]\mathbb{E}[\bar{f}-\hat{f}] = 0 (since E[f^]=fˉ\mathbb{E}[\hat{f}] = \bar{f} and E[ε]=0\mathbb{E}[\varepsilon]=0). Then:

E[(y0fˉ)2]=E[(ffˉ+ε)2]=(ffˉ)2+σ2=Bias2+σ2\mathbb{E}[(y_0-\bar{f})^2] = \mathbb{E}[(f^*-\bar{f}+\varepsilon)^2] = (f^*-\bar{f})^2 + \sigma^2 = \text{Bias}^2 + \sigma^2

And E[(fˉf^)2]=Var(f^(x0))\mathbb{E}[(\bar{f}-\hat{f})^2] = \text{Var}(\hat{f}(x_0)). Combining: Bias^2 + Variance + Noise. \square

BIAS-VARIANCE TRADEOFF
========================================================================

  Error                        Total Error
   ^                           /
   |              Variance    /
   |         _______________/
   |        /
   |       /       Bias^2
   |      /  _______________
   |_____/__/________________
   |         Noise
   +-------------------------------- Model Complexity ->

  Classical view: sweet spot minimises total error

========================================================================

Interpretations:

  • High bias: The model is too simple to capture ff^* (underfitting). Consistent but wrong. Example: linear model for a quadratic ff^*.
  • High variance: The model is too complex - it fits training noise (overfitting). Right on average but variable. Example: high-degree polynomial on limited data.
  • Noise: Irreducible. The best we can do.

8.3 Double Descent and Modern Deep Learning

The classical bias-variance picture predicts a unique optimal complexity. Modern deep learning violates this: neural networks trained to zero training loss (interpolating the training data, which classical theory predicts should overfit catastrophically) often generalise well. This is the double descent phenomenon.

DOUBLE DESCENT
========================================================================

  Test Error
   ^
   |        Classical regime        Modern regime
   |              |                      |
   |    +----+    |                      |
   |   ++    ++   |                      |
   |  ++      ++  |                 +----+
   | ++        +--+-----------------+
   |               | Interpolation threshold
   +---------------------------------------- Model Size ->
           ^                   ^
    Underfitting         Overparameterised
    regime               regime (good!)

========================================================================

Why does this happen? In overparameterised models:

  • Many solutions interpolate the training data.
  • Gradient descent (especially with weight decay or implicit regularisation from SGD noise) selects the minimum-norm solution among all interpolating solutions.
  • The minimum-norm interpolant has low variance among interpolants - it is the "flattest" fit.

The bias-variance decomposition still holds in double descent, but the variance first rises (at the interpolation threshold, the model just barely interpolates - highly sensitive to training data) and then falls again (overparameterised models have many interpolating solutions, most with low variance).

For AI (2026 context): Large language models like GPT-4, Claude, and Gemini are massively overparameterised. They interpolate (or nearly interpolate) their training data and yet generalise remarkably well - this is the double descent regime. Understanding why overparameterisation helps generalisation remains one of the central open questions in deep learning theory.

8.4 Regularisation as Bias Injection

L2 regularisation (Ridge regression). The regularised estimator minimises:

θ^λ=argminθ[i=1N(yifθ(xi))2+λθ2]\hat{\theta}_\lambda = \arg\min_\theta \left[\sum_{i=1}^N (y_i - f_\theta(x_i))^2 + \lambda \|\theta\|^2\right]

For linear regression with design matrix XX (overloading notation): θ^λ=(XX+λI)1Xy\hat{\theta}_\lambda = (X^\top X + \lambda I)^{-1}X^\top y.

Bias-variance analysis of Ridge:

  • Bias: θ^λ\hat{\theta}_\lambda is a biased estimator of θ\theta^*. Specifically, E[θ^λ]=(I+λ(XX)1)1θθ\mathbb{E}[\hat{\theta}_\lambda] = (I + \lambda(X^\top X)^{-1})^{-1}\theta^* \neq \theta^*. Ridge shrinks estimates toward 0, introducing bias.
  • Variance: Var(θ^λ)Var(θ^OLS)\text{Var}(\hat{\theta}_\lambda) \leq \text{Var}(\hat{\theta}_{\text{OLS}}) for any λ>0\lambda > 0. The regularised estimator has smaller variance - it is more stable with respect to perturbations in the training data.
  • Trade-off: As λ\lambda increases, bias increases and variance decreases. The optimal λ\lambda minimises total risk.

For AI: Weight decay in neural network training (the most common form of regularisation) adds λθ2\lambda \|\theta\|^2 to the loss, equivalent to L2 regularisation. AdamW implements weight decay correctly by decoupling it from the adaptive learning rate. Understanding the bias-variance trade-off tells us why weight decay helps: it reduces model variance (overfitting) at the cost of some bias.


9. Expectation in ML: Core Applications

9.1 Cross-Entropy Loss as an Expectation

For a KK-class classification problem, the cross-entropy loss for a single example (x,y)(x, y) with true label y{1,,K}y \in \{1,\ldots,K\} and model probabilities p^=(p^1,,p^K)\hat{p} = (\hat{p}_1,\ldots,\hat{p}_K) is:

(y,p^)=logp^y=k=1K1[y=k]logp^k\ell(y, \hat{p}) = -\log \hat{p}_y = -\sum_{k=1}^K \mathbf{1}[y=k] \log \hat{p}_k

The expected cross-entropy over the data distribution p(x,y)p(x,y):

L(θ)=E(X,Y)p[logp^Y]=E(X,Y)p[k=1K1[Y=k]logp^k(X;θ)]\mathcal{L}(\theta) = \mathbb{E}_{(X,Y) \sim p}[-\log \hat{p}_{Y}] = \mathbb{E}_{(X,Y) \sim p}\left[-\sum_{k=1}^K \mathbf{1}[Y=k] \log \hat{p}_k(X;\theta)\right]

This equals the cross-entropy H(p,qθ)=kpklogqkH(p, q_\theta) = -\sum_k p_k \log q_k between the true label distribution and the model's predicted distribution, averaged over inputs XX.

Connection to KL divergence. For each input xx:

H(p(x),qθ(x))=H(p(x))+KL(p(x)qθ(x))H(p(\cdot|x), q_\theta(\cdot|x)) = H(p(\cdot|x)) + \text{KL}(p(\cdot|x) \| q_\theta(\cdot|x))

where H(p)=Ep[logp]H(p) = -\mathbb{E}_p[\log p] is the entropy of the true conditional. Since H(p(x))H(p(\cdot|x)) doesn't depend on θ\theta, minimising cross-entropy is equivalent to minimising KL divergence - i.e., making the model distribution as close as possible to the true distribution, in the KL sense.

For language models. The cross-entropy loss on a token sequence (w1,,wT)(w_1,\ldots,w_T) is:

L=1Tt=1TlogPθ(wtw1,,wt1)\mathcal{L} = -\frac{1}{T}\sum_{t=1}^T \log P_\theta(w_t \mid w_1,\ldots,w_{t-1})

This is the expected negative log-probability per token - an expectation under the empirical distribution of the training corpus. Minimising this drives the model distribution toward the empirical distribution of human text. The perplexity is exp(L)\exp(\mathcal{L}): a model with perplexity 50 is "equally surprised" by text as a uniform distribution over 50 tokens.

9.2 ELBO as an Expected Value

The Variational Autoencoder (VAE) introduces an approximate posterior qϕ(zx)q_\phi(z|x) (the encoder) to approximate the intractable true posterior pθ(zx)p_\theta(z|x). The Evidence Lower BOund (ELBO) is:

L(ϕ,θ;x)=Eqϕ(zx)[logpθ(xz)]KL(qϕ(zx)p(z))\mathcal{L}(\phi,\theta;x) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - \text{KL}(q_\phi(z|x) \| p(z))

Reconstruction term: Eqϕ[logpθ(xz)]\mathbb{E}_{q_\phi}[\log p_\theta(x|z)] - the expected log-likelihood of the observed data given a latent code, averaged over the approximate posterior. This rewards the decoder for reconstructing xx well from samples of zqϕ(zx)z \sim q_\phi(z|x).

KL regularisation term: KL(qϕp(z))\text{KL}(q_\phi \| p(z)) - penalises the approximate posterior for straying from the prior p(z)=N(0,I)p(z) = \mathcal{N}(0,I). For Gaussian encoder qϕ=N(μϕ(x),diag(σϕ2(x)))q_\phi = \mathcal{N}(\mu_\phi(x), \text{diag}(\sigma_\phi^2(x))):

KL(qϕN(0,I))=12j=1d(μϕ,j2+σϕ,j2logσϕ,j21)\text{KL}(q_\phi \| \mathcal{N}(0,I)) = \frac{1}{2}\sum_{j=1}^d \left(\mu_{\phi,j}^2 + \sigma_{\phi,j}^2 - \log\sigma_{\phi,j}^2 - 1\right)

Gradient computation. The reconstruction term Eqϕ[logpθ(xz)]\mathbb{E}_{q_\phi}[\log p_\theta(x|z)] cannot be differentiated through the sampling operation zqϕz \sim q_\phi directly. The reparameterisation trick (Section03) writes z=μϕ+σϕεz = \mu_\phi + \sigma_\phi \odot \varepsilon with εN(0,I)\varepsilon \sim \mathcal{N}(0,I), enabling:

ϕEqϕ[logpθ(xz)]=EεN(0,I)[ϕlogpθ(xμϕ+σϕε)]\nabla_\phi \mathbb{E}_{q_\phi}[\log p_\theta(x|z)] = \mathbb{E}_{\varepsilon \sim \mathcal{N}(0,I)}[\nabla_\phi \log p_\theta(x|\mu_\phi + \sigma_\phi \odot \varepsilon)]

This gradient is computable by backpropagation through the deterministic function z=μϕ+σϕεz = \mu_\phi + \sigma_\phi \odot \varepsilon.

9.3 Policy Gradient (REINFORCE)

In reinforcement learning, an agent with policy πθ(as)\pi_\theta(a|s) (parameterised by θ\theta) seeks to maximise the expected return:

J(θ)=Eτπθ[t=0Trt]=Eτ[R(τ)]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T r_t\right] = \mathbb{E}_\tau[R(\tau)]

where τ=(s0,a0,r0,s1,a1,)\tau = (s_0, a_0, r_0, s_1, a_1, \ldots) is a trajectory. The REINFORCE gradient is:

θJ(θ)=Eτπθ[R(τ)t=0Tθlogπθ(atst)]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[R(\tau) \cdot \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t)\right]

Derivation via log-derivative trick (score function):

θEτ[R]=θR(τ)pθ(τ)dτ=R(τ)θpθ(τ)dτ\nabla_\theta \mathbb{E}_\tau[R] = \nabla_\theta \int R(\tau) p_\theta(\tau) \, d\tau = \int R(\tau) \nabla_\theta p_\theta(\tau) \, d\tau =R(τ)pθ(τ)θpθ(τ)pθ(τ)dτ=Eτ[R(τ)θlogpθ(τ)]= \int R(\tau) p_\theta(\tau) \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)} \, d\tau = \mathbb{E}_\tau\left[R(\tau) \nabla_\theta \log p_\theta(\tau)\right]

Since logpθ(τ)=tlogπθ(atst)+const\log p_\theta(\tau) = \sum_t \log \pi_\theta(a_t|s_t) + \text{const} (dynamics don't depend on θ\theta), we get the REINFORCE formula.

The score function θlogπθ(as)\nabla_\theta \log \pi_\theta(a|s) appears because we differentiated the log of the probability. The key identity underlying this:

Eθ[θlogpθ(X)]=0\mathbb{E}_\theta[\nabla_\theta \log p_\theta(X)] = \mathbf{0}

(differentiate the normalisation condition pθ=1\int p_\theta = 1 to get θpθ=0\int \nabla_\theta p_\theta = \mathbf{0}, then note θpθ=pθθlogpθ\nabla_\theta p_\theta = p_\theta \nabla_\theta \log p_\theta).

For RLHF. Reinforcement learning from human feedback uses policy gradients to fine-tune language models. The policy is the LLM: πθ(as)\pi_\theta(a|s) where ss is the context and aa is the generated token. The reward RR comes from a reward model trained on human preferences. REINFORCE and its variants (PPO, GRPO) compute gradients of E[R]\mathbb{E}[R] with respect to the LLM parameters.

9.4 Adam Optimizer as Moment Tracking

The Adam optimizer (Kingma & Ba, 2014) maintains exponential moving averages of the first and second moments of the gradient:

mt=β1mt1+(1β1)gtE[gt]m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \approx \mathbb{E}[g_t] vt=β2vt1+(1β2)gt2E[gt2]v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \approx \mathbb{E}[g_t^2]

where gt=θL(θt)g_t = \nabla_\theta \mathcal{L}(\theta_t) is the gradient at step tt.

Bias correction. Since m0=v0=0m_0 = v_0 = 0, the early estimates are biased toward zero. The debiased estimates are:

m^t=mt1β1t,v^t=vt1β2t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \qquad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

Why does this work? E[mt]=(1β1t)E[gt]\mathbb{E}[m_t] = (1-\beta_1^t) \cdot \mathbb{E}[g_t] (when gradient is stationary), so dividing by (1β1t)(1-\beta_1^t) recovers an unbiased estimate of E[gt]\mathbb{E}[g_t].

Parameter update:

θt+1=θtαv^t+ϵm^t\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

The update is the first moment (direction of average gradient) scaled by 1/second moment1/\sqrt{\text{second moment}} - normalising by the root mean square of the gradient. This is adaptive learning rate: parameters with large gradient variance get smaller updates (conservative), parameters with small variance get larger updates (aggressive).

Connection to Fisher Information. The diagonal of the Fisher information matrix I(θ)\mathcal{I}(\theta) is Var(gt)E[gt2]\text{Var}(g_t) \approx \mathbb{E}[g_t^2] (when gradients are zero-mean). Adam's v^t\hat{v}_t approximates the diagonal Fisher, making Adam an approximate natural gradient method.

AdamW. Standard Adam conflates weight decay and L2 regularisation (they are equivalent for SGD but not for adaptive methods). AdamW decouples them:

θt+1=θtαv^t+ϵm^tαλθt\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t - \alpha\lambda\theta_t

The last term is direct weight decay - not affected by the adaptive scaling. This is the current default for training large language models.


10. Common Mistakes

#MistakeWhy It's WrongFix
1E[g(X)]=g(E[X])\mathbb{E}[g(X)] = g(\mathbb{E}[X]) (always)Only true when gg is linear. For convex gg: g(E[X])E[g(X)]g(\mathbb{E}[X]) \leq \mathbb{E}[g(X)] (Jensen). For g(x)=x2g(x)=x^2: (E[X])2E[X2](\mathbb{E}[X])^2 \leq \mathbb{E}[X^2].Apply LOTUS directly to compute E[g(X)]\mathbb{E}[g(X)].
2Assuming E[XY]=E[X]E[Y]\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y] without checking independenceTrue only for independent (or uncorrelated for linear functions). Dependent variables violate this.Check Cov(X,Y)=0\text{Cov}(X,Y) = 0 (uncorrelated) before using this; for full product rule you need XYX \perp Y.
3Confusing variance and standard deviation unitsVariance has squared units; std dev has same units as XX. Mixing them gives nonsense.Always track units: Var(X)=σ2\text{Var}(X) = \sigma^2 (squared), σX=Var(X)\sigma_X = \sqrt{\text{Var}(X)} (same as XX).
4Var(aX+b)=a2Var(X)+b2\text{Var}(aX + b) = a^2\text{Var}(X) + b^2 (wrong)The constant bb shifts location, not spread. Variance is invariant to translation.Var(aX+b)=a2Var(X)\text{Var}(aX+b) = a^2\text{Var}(X) - the bb vanishes.
5Var(X+Y)=Var(X)+Var(Y)\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y) without independenceThis only holds when Cov(X,Y)=0\text{Cov}(X,Y) = 0. For positively correlated X,YX,Y: variance is larger.Var(X+Y)=Var(X)+2Cov(X,Y)+Var(Y)\text{Var}(X+Y) = \text{Var}(X) + 2\text{Cov}(X,Y) + \text{Var}(Y).
6Zero covariance implies independenceFalse in general. The unit disk example: Y=1X2Y=\sqrt{1-X^2} has Cov(X,Y)=0\text{Cov}(X,Y)=0 but YY is determined by XX.Use mutual information I(X;Y)I(X;Y) to test full independence. Covariance only measures linear dependence.
7Confusing E[YX=x]\mathbb{E}[Y\|X=x] (a number) with E[YX]\mathbb{E}[Y\|X] (a random variable)E[YX=x]\mathbb{E}[Y\|X=x] is fixed once xx is fixed. E[YX]\mathbb{E}[Y\|X] is random because XX is random.Be explicit: h(x)=E[YX=x]h(x) = \mathbb{E}[Y\|X=x]; then E[YX]=h(X)\mathbb{E}[Y\|X] = h(X) is the random variable.
8Applying Jensen in the wrong direction for concave ffJensen: convex ff(E[X])E[f(X)]f \Rightarrow f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]. For log\log (concave): logE[X]E[logX]\log\mathbb{E}[X] \geq \mathbb{E}[\log X].Check second derivative: f>0f'' > 0 = convex (Jensen \leq); f<0f'' < 0 = concave (Jensen \geq).
9Assuming MGF exists for all distributionsCauchy, Pareto (small α\alpha), log-normal: MGFs are infinite. Using MGF-based results for these fails.Use characteristic function (always exists). Check integrability: E[etX]<\mathbb{E}[e^{tX}] < \infty for $
10Confusing raw moments and central momentsμk=E[Xk]\mu'_k = \mathbb{E}[X^k] and μk=E[(Xμ)k]\mu_k = \mathbb{E}[(X-\mu)^k] differ unless μ=0\mu=0. Kurtosis is μ4/σ4\mu_4/\sigma^4, not μ4/σ4\mu'_4/\sigma^4.Explicitly write which you mean. For Gaussian: μ2=σ2+μ2\mu'_2 = \sigma^2 + \mu^2 but μ2=σ2\mu_2 = \sigma^2.

11. Exercises

Exercise 1 * - LOTUS: Expected Value of a Transformation

Let XExponential(λ=2)X \sim \text{Exponential}(\lambda=2) with PDF f(x)=2e2xf(x) = 2e^{-2x} for x0x \geq 0.

(a) Compute E[X]\mathbb{E}[X] directly from the definition.

(b) Using LOTUS, compute E[X2]\mathbb{E}[X^2].

(c) Compute Var(X)=E[X2](E[X])2\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2.

(d) Let Y=eXY = e^X. Compute E[Y]\mathbb{E}[Y] using LOTUS. (Hint: 0ex2e2xdx=02exdx\int_0^\infty e^{x} \cdot 2e^{-2x}dx = \int_0^\infty 2e^{-x}dx.) What is the domain of tt for which E[etX]\mathbb{E}[e^{tX}] is finite?


Exercise 2 * - Linearity Without Independence

Let XUniform(1,1)X \sim \text{Uniform}(-1, 1) and Y=X2Y = X^2.

(a) Compute E[X]\mathbb{E}[X], E[Y]\mathbb{E}[Y], and E[X+Y]\mathbb{E}[X+Y] and verify linearity.

(b) Compute Cov(X,Y)\text{Cov}(X, Y). What does this tell you about the linear relationship?

(c) Are XX and YY independent? Justify rigorously.

(d) Compute Var(X+Y)\text{Var}(X+Y) using the formula Var(X)+2Cov(X,Y)+Var(Y)\text{Var}(X) + 2\text{Cov}(X,Y) + \text{Var}(Y), and verify numerically.


Exercise 3 * - Moments of the Gaussian

For XN(0,1)X \sim \mathcal{N}(0, 1) (standard normal):

(a) Show E[X2k+1]=0\mathbb{E}[X^{2k+1}] = 0 for all k0k \geq 0 (odd moments vanish).

(b) Using the MGF M(t)=et2/2M(t) = e^{t^2/2}, find E[X2]\mathbb{E}[X^2], E[X4]\mathbb{E}[X^4], and E[X6]\mathbb{E}[X^6] by differentiating.

(c) Verify the general formula E[X2k]=(2k1)!!=(2k1)(2k3)31\mathbb{E}[X^{2k}] = (2k-1)!! = (2k-1)(2k-3)\cdots 3 \cdot 1.

(d) For XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2), compute the skewness γ1\gamma_1 and excess kurtosis γ2\gamma_2.


Exercise 4 ** - Tower Property in Action

A fair coin is flipped. If heads (prob 1/2), XPoisson(3)X \sim \text{Poisson}(3). If tails (prob 1/2), XPoisson(7)X \sim \text{Poisson}(7).

(a) Using the tower property, compute E[X]\mathbb{E}[X] (let C{H,T}C \in \{H,T\} be the coin flip).

(b) Using the law of total variance, compute Var(X)\text{Var}(X).

(c) Write down the marginal PMF of XX (it is a mixture of two Poissons). Verify E[X]\mathbb{E}[X] directly.

(d) AI connection: Mixture-of-experts (MoE) models route tokens to different experts. Each expert has its own output distribution. The total output distribution is a mixture; the tower property computes its moments.


Exercise 5 ** - MGF and Moment Extraction

Let XGamma(α,β)X \sim \text{Gamma}(\alpha, \beta) with PDF f(x)=βαΓ(α)xα1eβxf(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x} for x>0x > 0.

(a) Derive the MGF: show MX(t)=(ββt)αM_X(t) = \left(\frac{\beta}{\beta-t}\right)^\alpha for t<βt < \beta.

(b) Use MX(t)M_X(t) to compute E[X]\mathbb{E}[X] and Var(X)\text{Var}(X) by differentiating.

(c) Show that the sum of nn independent Gamma(αi,β)\text{Gamma}(\alpha_i, \beta) variables (same rate β\beta) follows Gamma(iαi,β)\text{Gamma}(\sum_i \alpha_i, \beta) using the MGF multiplicative property.

(d) The chi-squared distribution χk2=Gamma(k/2,1/2)\chi^2_k = \text{Gamma}(k/2, 1/2). What is E[χk2]\mathbb{E}[\chi^2_k] and Var(χk2)\text{Var}(\chi^2_k)?


Exercise 6 ** - Jensen and KL Divergence

(a) Prove that for any probability distribution pp over {1,,K}\{1,\ldots,K\}:

H(p)=kpklogpklogKH(p) = -\sum_k p_k \log p_k \leq \log K

with equality iff pp is uniform. (Hint: Apply Jensen to f(x)=logxf(x) = -\log x or use the KL divergence to uniform.)

(b) Show that KL(pq)0\text{KL}(p \| q) \geq 0 using Jensen's inequality applied to log-\log.

(c) Compute KL(N(μ1,σ12)N(μ2,σ22))\text{KL}(\mathcal{N}(\mu_1, \sigma_1^2) \| \mathcal{N}(\mu_2, \sigma_2^2)) analytically. Verify it equals zero when μ1=μ2\mu_1=\mu_2, σ1=σ2\sigma_1=\sigma_2.

(d) For the ELBO: show that logpθ(x)L(ϕ,θ;x)\log p_\theta(x) \geq \mathcal{L}(\phi,\theta;x) with gap =KL(qϕ(zx)pθ(zx))= \text{KL}(q_\phi(z|x) \| p_\theta(z|x)).


Exercise 7 *** - Bias-Variance Decomposition

Consider estimating θ=2\theta^* = 2 from N=10N=10 iid samples X1,,XNN(θ,1)X_1,\ldots,X_N \sim \mathcal{N}(\theta^*, 1).

(a) Show that the sample mean θ^1=XˉN\hat{\theta}_1 = \bar{X}_N is unbiased with variance σ2/N=1/10\sigma^2/N = 1/10.

(b) Consider a regularised estimator θ^λ=NN+λXˉN\hat{\theta}_\lambda = \frac{N}{N+\lambda}\bar{X}_N. Compute its bias and variance as functions of λ>0\lambda > 0.

(c) The MSE of θ^λ\hat{\theta}_\lambda is Bias^2 + Variance. Find the λ\lambda that minimises MSE. (Hint: it depends on θ\theta^*.)

(d) Simulate this: draw 10000 datasets of N=10N=10 samples, compute θ^λ\hat{\theta}_\lambda for λ{0,0.1,0.5,1,2,5}\lambda \in \{0, 0.1, 0.5, 1, 2, 5\}, plot bias^2, variance, and MSE vs λ\lambda.


Exercise 8 *** - Adam as Moment Estimation

(a) Show that the Adam first-moment estimate satisfies E[mt]=(1β1t)E[g]\mathbb{E}[m_t] = (1-\beta_1^t)\mathbb{E}[g] when gradients are iid with mean E[g]\mathbb{E}[g] (stationary gradient assumption). Therefore m^t=mt/(1β1t)\hat{m}_t = m_t/(1-\beta_1^t) is unbiased.

(b) Implement a minimal Adam optimizer in NumPy (no PyTorch) and apply it to minimise f(θ)=θ44θ2+θf(\theta) = \theta^4 - 4\theta^2 + \theta. Track the first and second moment estimates. Plot mtm_t, vtv_t, and θt\theta_t over 200 steps.

(c) Show that for a quadratic loss L(θ)=12(θθ)2\mathcal{L}(\theta) = \frac{1}{2}(\theta-\theta^*)^2, Adam converges to θ\theta^* and the second moment vt(θθ)2=v_t \to (\theta-\theta^*)^2 = squared gradient. In this case, m^t/v^tsign(θθ)\hat{m}_t/\sqrt{\hat{v}_t} \to \text{sign}(\theta-\theta^*): Adam reduces to steepest descent with unit step size.

(d) Connect to Fisher information: the Fisher information matrix for pθ(x)=N(θ,I)p_\theta(x) = \mathcal{N}(\theta, I) has diagonal equal to 1 (the variance of the score). Natural gradient descent uses I1L\mathcal{I}^{-1}\nabla\mathcal{L}. How does Adam approximate this for diagonal Fisher?


12. Why This Matters for AI (2026 Perspective)

ConceptAI/LLM Application2026 Context
E[(θ)]\mathbb{E}[\ell(\theta)] - expected lossEvery training objective; cross-entropy, MSE, RLHF rewardAll SOTA LLMs (GPT-4, Claude, Gemini) optimise expected log-likelihood; RLHF maximises expected reward from human preference model
Linearity of expectationGradient of a sum = sum of gradients; minibatch averagingDistributed training computes gradients on data shards and averages them - justified by linearity
Tower propertyEM algorithm E-step; policy gradient decompositionMixture-of-experts routing in GPT-4, Mixtral uses tower property to compute expected output over router distribution
Var(X)\text{Var}(X) - varianceUncertainty quantification, BatchNorm/LayerNorm, gradient clippingLayerNorm is standard in every transformer (BERT, GPT, LLaMA); normalises by mean and variance per token
Bias-variance decompositionModel selection, regularisation, double descentUnderstanding why GPT-4-scale models (massively overparameterised) still generalise - double descent regime
Jensen's inequalityKL divergence 0\geq 0; ELBO derivation; softmax boundsVAE encoder (DALL-E 2, Stable Diffusion) optimises ELBO; Jensen is the mathematical foundation
MGF / CumulantsProof of CLT, Chernoff bounds, tail analysisHoeffding/Chernoff bounds used to prove PAC-learning generalisation bounds for language models
Adam / moment trackingDefault optimiser for all large-scale trainingAdamW trains GPT-4, LLaMA, Claude, Gemini; Adam moment tracking is a direct application of E[g]\mathbb{E}[g] and E[g2]\mathbb{E}[g^2]
Score function = logp\nabla\log pREINFORCE, RLHF policy gradient, diffusion model scoreScore-based diffusion (Stable Diffusion, DALL-E 3) is defined by xlogpt(x)\nabla_x \log p_t(x); trained by denoising score matching
Cauchy-SchwarzAttention scaling 1/dk1/\sqrt{d_k}, Fisher information boundsScaled dot-product attention in every transformer uses 1/dk1/\sqrt{d_k} scaling derived from variance of dot products
Conditional expectationAttention output = conditional mean; optimal predictor = E[YX]\mathbb{E}[Y\|X]Transformers approximate E[YX]\mathbb{E}[Y\|X] via attention; mechanistic interpretability studies what conditional distributions each head computes
ReparameterisationVAE encoder, diffusion posterior samplingStable Diffusion's latent diffusion uses Gaussian reparameterisation for differentiable sampling through the encoder

13. Conceptual Bridge

Where We Came From

This section builds directly on the foundations laid in Section01-Section03. The probability spaces and random variable formalism of Section01 provides the objects - PDFs, PMFs, CDFs - over which we compute expectations. The named distributions of Section02 supply the examples whose moments we derive in Section6.2: the mean 1/λ1/\lambda and variance 1/λ21/\lambda^2 of the exponential, the mean npnp and variance np(1p)np(1-p) of the binomial, the mean μ\mu and variance σ2\sigma^2 that define the Gaussian. The joint distribution machinery of Section03 provides the tools for conditional expectation (Section5) and the covariance matrix (Section4): the tower property is essentially iterated marginalisation, and the conditional variance formula is the law of total variance derived via joint distributions.

Where We Are Going

Section Section05 (Concentration Inequalities) takes the moment-bound preview from Section7.5 much further. Markov's inequality (from the mean) and Chebyshev's inequality (from the variance) are the first two results, but Section05 develops exponentially sharper bounds: Hoeffding's inequality (bounded variables), Chernoff bounds (via MGFs), and McDiarmid's inequality (functions of independent variables). These bounds are the mathematical machinery of PAC-learning generalisation theory - they quantify how many training examples are needed to guarantee that the empirical risk is close to the true risk.

Section Section06 (Stochastic Processes) extends expectation to sequences of random variables indexed by time. The Law of Large Numbers proves rigorously that XˉNE[X]\bar{X}_N \to \mathbb{E}[X] - the sample mean converges to the expectation. The Central Limit Theorem proves that the standardised sample mean converges in distribution to N(0,1)\mathcal{N}(0,1) - the Gaussian emerges as the universal limit of sums, with proof via MGF or characteristic function techniques introduced here.

POSITION IN CHAPTER 6 CURRICULUM
========================================================================

  Section01 Probability Spaces          Section02 Distributions
  +------------------+            +------------------+
  | Kolmogorov axioms|            | Named PDFs/PMFs  |
  | Events, \\sigma-algebra|            | Parameters       |
  | CDF, PDF, PMF    |            | Relationships    |
  +--------+---------+            +--------+---------+
           |                               |
           +--------------+----------------+
                          v
              Section03 Joint Distributions
              +----------------------+
              | Marginals            |
              | Conditionals f(y|x) |
              | MVN, Bayes, Chain   |
              +----------+----------+
                         |
                         v
          +==============================+
          |  Section04 Expectation & Moments  |  <- YOU ARE HERE
          |  E[X], Var, Cov, MGF       |
          |  Jensen, C-S, LOTUS        |
          |  Bias-Variance, Adam       |
          +==============+=============+
                         |
              +----------+----------+
              v                     v
    Section05 Concentration         Section06 Stochastic
    Inequalities              Processes
    +-------------+           +-------------+
    | Markov      |           | LLN: Xbar->E[X]|
    | Chebyshev   |           | CLT: ->N(0,1)|
    | Hoeffding   |           | Gaussian    |
    | PAC bounds  |           | processes   |
    +-------------+           +-------------+
              |                     |
              +----------+----------+
                         v
              Section07 Markov Chains
              +----------------------+
              | Transition matrices  |
              | Steady state         |
              | MCMC (Bayes sampling)|
              +----------------------+

========================================================================

The conceptual arc through Section04 is: we began with probability as a framework for describing uncertainty (Section01), learned the vocabulary of named distributions (Section02), understood how to reason about multiple random variables jointly (Section03), and now in Section04 we have the tools to summarise distributions with numbers - expectations, variances, covariances, moments - and to bound the gap between reality and our estimates. Every subsequent section will use the expectation operator as a core tool: Section05 to prove tail bounds, Section06 to formalise convergence, Section07 to analyse Markov chain stationary distributions via matrix expectations.

<- Back to Chapter 6: Probability Theory | Next: Concentration Inequalities ->


Appendix A - Worked Examples: Expectation and Moments

A.1 Expected Value of the Geometric Distribution

The geometric distribution counts the number of trials until the first success in a sequence of iid Bernoulli(pp) trials.

PMF: P(X=k)=(1p)k1pP(X=k) = (1-p)^{k-1}p for k=1,2,3,k = 1, 2, 3, \ldots

Expected value via the definition:

E[X]=k=1k(1p)k1p=pk=1kqk1\mathbb{E}[X] = \sum_{k=1}^\infty k(1-p)^{k-1}p = p \sum_{k=1}^\infty k q^{k-1}

where q=1pq = 1-p. Using the identity k=1kqk1=ddqk=0qk=ddq11q=1(1q)2=1p2\sum_{k=1}^\infty k q^{k-1} = \frac{d}{dq}\sum_{k=0}^\infty q^k = \frac{d}{dq}\frac{1}{1-q} = \frac{1}{(1-q)^2} = \frac{1}{p^2}:

E[X]=p1p2=1p\mathbb{E}[X] = p \cdot \frac{1}{p^2} = \frac{1}{p}

Variance via the second moment: E[X2]=k=1k2(1p)k1p\mathbb{E}[X^2] = \sum_{k=1}^\infty k^2(1-p)^{k-1}p. Using k2qk1=ddq[kqk]=ddqq(1q)2=1+q(1q)3\sum k^2 q^{k-1} = \frac{d}{dq}[\sum k q^k] = \frac{d}{dq}\frac{q}{(1-q)^2} = \frac{1+q}{(1-q)^3}:

E[X2]=p1+(1p)p3=2pp2\mathbb{E}[X^2] = p \cdot \frac{1+(1-p)}{p^3} = \frac{2-p}{p^2} Var(X)=E[X2](E[X])2=2pp21p2=1pp2\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = \frac{2-p}{p^2} - \frac{1}{p^2} = \frac{1-p}{p^2}

AI connection: The geometric distribution models the number of tokens until a specific token (e.g., end-of-sequence) appears, assuming iid generation. In practice tokens are not iid, but the geometric provides a baseline model. Expected sequence length under a uniform pp = stopping probability is 1/p1/p.

A.2 Method of Moments Estimation

The method of moments estimates distribution parameters by setting sample moments equal to theoretical moments and solving.

Example: Estimating (α,β)(\alpha, \beta) for Gamma(α,β)\text{Gamma}(\alpha, \beta).

Theoretical moments: E[X]=α/β\mathbb{E}[X] = \alpha/\beta and Var(X)=α/β2\text{Var}(X) = \alpha/\beta^2.

Given NN samples x1,,xNx_1, \ldots, x_N, compute sample mean xˉ=1Nxi\bar{x} = \frac{1}{N}\sum x_i and sample variance s2=1N1(xixˉ)2s^2 = \frac{1}{N-1}\sum(x_i-\bar{x})^2.

Set xˉ=α^/β^\bar{x} = \hat{\alpha}/\hat{\beta} and s2=α^/β^2s^2 = \hat{\alpha}/\hat{\beta}^2. Solving:

β^=xˉs2,α^=xˉ2s2\hat{\beta} = \frac{\bar{x}}{s^2}, \qquad \hat{\alpha} = \frac{\bar{x}^2}{s^2}

For AI: Many Bayesian models require estimating hyperparameters of prior distributions. Method of moments provides fast, closed-form initial estimates that can be refined by maximum likelihood or MCMC. It is also used in moment matching for knowledge distillation: a student model's distribution moments are matched to the teacher's.

A.3 St. Petersburg Paradox: When Expectation Misleads

The St. Petersburg game: flip a coin repeatedly until the first head. If head appears on flip kk, win 2k2^k dollars.

Expected winnings:

E[Winnings]=k=12k12k=k=11=+\mathbb{E}[\text{Winnings}] = \sum_{k=1}^\infty 2^k \cdot \frac{1}{2^k} = \sum_{k=1}^\infty 1 = +\infty

The expected value is infinite! Yet no rational person would pay more than a few dollars to play this game.

Resolution: Rational agents maximise expected utility, not expected monetary value. For a logarithmic utility function u(w)=log(w)u(w) = \log(w), the expected utility is finite:

E[log(Winnings)]=k=1log(2k)12k=k=1klog22k=2log2<\mathbb{E}[\log(\text{Winnings})] = \sum_{k=1}^\infty \log(2^k) \cdot \frac{1}{2^k} = \sum_{k=1}^\infty \frac{k \log 2}{2^k} = 2\log 2 < \infty

This is Bernoulli's resolution (1738): diminishing marginal utility. The paradox reveals that infinite expected value is not sufficient for rational choice - the existence of all moments (or at least bounded utility) is needed.

For AI: Reinforcement learning reward design must account for this. An agent with unbounded reward function may take extremely risky actions that have infinite expected reward but almost surely fail. This motivates bounded reward functions and regularisation in RLHF: keeping reward signals within a bounded range prevents policy collapse toward infinite-expectation strategies.


Appendix B - Proofs of Key Identities

B.1 Cauchy-Schwarz via Inner Product

The expectation E[XY]\mathbb{E}[XY] can be viewed as an inner product X,Y=E[XY]\langle X, Y \rangle = \mathbb{E}[XY] in the L2L^2 space of square-integrable random variables. The Cauchy-Schwarz inequality for inner products states X,Y2X,XY,Y|\langle X, Y \rangle|^2 \leq \langle X, X \rangle \langle Y, Y \rangle, which gives (E[XY])2E[X2]E[Y2](\mathbb{E}[XY])^2 \leq \mathbb{E}[X^2]\mathbb{E}[Y^2] directly.

This inner product interpretation gives L2L^2 convergence its name: XnXX_n \to X in L2L^2 means E[(XnX)2]0\mathbb{E}[(X_n - X)^2] \to 0 - convergence in the L2L^2 inner product sense.

B.2 Variance as Second Cumulant

The CGF of XX is KX(t)=logMX(t)=logE[etX]K_X(t) = \log M_X(t) = \log \mathbb{E}[e^{tX}]. Computing its second derivative at t=0t=0:

KX(t)=MX(t)MX(t)(MX(t))2(MX(t))2K_X''(t) = \frac{M_X(t)M_X''(t) - (M_X'(t))^2}{(M_X(t))^2}

At t=0t=0: MX(0)=1M_X(0)=1, MX(0)=E[X]M_X'(0)=\mathbb{E}[X], MX(0)=E[X2]M_X''(0)=\mathbb{E}[X^2], so:

KX(0)=E[X2](E[X])2=Var(X)K_X''(0) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = \text{Var}(X)

The variance is precisely the second cumulant κ2\kappa_2.

B.3 Fisher Information and the Score

The Fisher information matrix is defined as:

I(θ)=Eθ[(θlogpθ(X))(θlogpθ(X))]=Covθ(θlogpθ(X))\mathcal{I}(\theta) = \mathbb{E}_\theta\left[(\nabla_\theta \log p_\theta(X))(\nabla_\theta \log p_\theta(X))^\top\right] = \text{Cov}_\theta(\nabla_\theta \log p_\theta(X))

The second equality uses the fact that Eθ[θlogpθ(X)]=0\mathbb{E}_\theta[\nabla_\theta \log p_\theta(X)] = \mathbf{0} (proved by differentiating pθ=1\int p_\theta = 1):

pθ(x)dx=1    θpθ(x)dx=0    θpθ(x)dx=0    Eθ[θpθ(X)pθ(X)]=Eθ[θlogpθ(X)]=0\int p_\theta(x) dx = 1 \implies \nabla_\theta \int p_\theta(x) dx = 0 \implies \int \nabla_\theta p_\theta(x) dx = 0 \implies \mathbb{E}_\theta\left[\frac{\nabla_\theta p_\theta(X)}{p_\theta(X)}\right] = \mathbb{E}_\theta[\nabla_\theta \log p_\theta(X)] = \mathbf{0}

Therefore the Fisher information is the variance (covariance matrix) of the score function θlogpθ(X)\nabla_\theta \log p_\theta(X).

For AI: Fisher information appears in:

  • Cramer-Rao bound: Var(θ^)1/I(θ)\text{Var}(\hat{\theta}) \geq 1/\mathcal{I}(\theta) - the variance of any unbiased estimator is at least the reciprocal Fisher information.
  • Natural gradient: The natural gradient descent update I(θ)1L\mathcal{I}(\theta)^{-1}\nabla\mathcal{L} moves in the direction of steepest descent in the space of distributions (KL-divergence geometry), rather than parameter space. Adam approximates this with a diagonal Fisher.
  • Elastic Weight Consolidation (EWC): Used in continual learning to prevent catastrophic forgetting. The Fisher information diagonal identifies which parameters are important for previous tasks.

B.4 Stein's Lemma

Lemma (Stein, 1972). If XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2) and gg is differentiable with E[g(X)]<\mathbb{E}[|g'(X)|] < \infty:

E[g(X)(Xμ)]=σ2E[g(X)]\mathbb{E}[g(X)(X-\mu)] = \sigma^2 \mathbb{E}[g'(X)]

Proof: Integration by parts:

E[g(X)(Xμ)]=g(x)(xμ)12πσe(xμ)2/(2σ2)dx\mathbb{E}[g(X)(X-\mu)] = \int g(x)(x-\mu) \frac{1}{\sqrt{2\pi}\sigma}e^{-(x-\mu)^2/(2\sigma^2)} dx

Note (xμ)e(xμ)2/(2σ2)=σ2ddxe(xμ)2/(2σ2)(x-\mu)e^{-(x-\mu)^2/(2\sigma^2)} = -\sigma^2 \frac{d}{dx}e^{-(x-\mu)^2/(2\sigma^2)}. Integrating by parts:

=σ2g(x)12πσe(xμ)2/(2σ2)dx=σ2E[g(X)]= \sigma^2 \int g'(x) \frac{1}{\sqrt{2\pi}\sigma}e^{-(x-\mu)^2/(2\sigma^2)} dx = \sigma^2 \mathbb{E}[g'(X)] \quad \square

Special cases:

  • g(x)=xg(x) = x: E[X(Xμ)]=σ2    E[X2]=μ2+σ2\mathbb{E}[X(X-\mu)] = \sigma^2 \implies \mathbb{E}[X^2] = \mu^2 + \sigma^2 [ok]
  • g(x)=x2g(x) = x^2: E[X2(Xμ)]=2σ2E[X]=2σ2μ    E[X3]=μ3+3μσ2\mathbb{E}[X^2(X-\mu)] = 2\sigma^2 \mathbb{E}[X] = 2\sigma^2\mu \implies \mathbb{E}[X^3] = \mu^3 + 3\mu\sigma^2 [ok]

For AI: Stein's lemma is the foundation of Stein's identity used in score matching and denoising diffusion models. The score function xlogp(x)\nabla_x \log p(x) satisfies Ep[g(X)+xg(X)]=0\mathbb{E}_p[g(X) + \nabla_x \cdot g(X)] = 0 (Stein's operator), enabling training by minimising a quadratic loss without computing the intractable normalisation constant of pp.


Appendix C - Moment Computations for Common Distributions

This table gives raw moments, central moments, MGF, skewness, and excess kurtosis for the distributions used throughout the course.

DistributionE[X]\mathbb{E}[X]Var(X)\text{Var}(X)Skewness γ1\gamma_1Ex. Kurtosis γ2\gamma_2MX(t)M_X(t) (domain)
Bernoulli(pp)ppp(1p)p(1-p)12pp(1p)\frac{1-2p}{\sqrt{p(1-p)}}16p(1p)p(1p)\frac{1-6p(1-p)}{p(1-p)}1p+pet1-p+pe^t
Binomial(n,pn,p)npnpnp(1p)np(1-p)12pnp(1p)\frac{1-2p}{\sqrt{np(1-p)}}16p(1p)np(1p)\frac{1-6p(1-p)}{np(1-p)}(1p+pet)n(1-p+pe^t)^n
Poisson(λ\lambda)λ\lambdaλ\lambda1/λ1/\sqrt{\lambda}1/λ1/\lambdaeλ(et1)e^{\lambda(e^t-1)}
Geometric(pp)1/p1/p(1p)/p2(1-p)/p^22p1p\frac{2-p}{\sqrt{1-p}}6+p2/(1p)6 + p^2/(1-p)pet1(1p)et\frac{pe^t}{1-(1-p)e^t}, t<log(1p)t<-\log(1-p)
Uniform(a,ba,b)(a+b)/2(a+b)/2(ba)2/12(b-a)^2/12006/5-6/5etbetat(ba)\frac{e^{tb}-e^{ta}}{t(b-a)}
Normal(μ,σ2\mu,\sigma^2)μ\muσ2\sigma^20000eμt+σ2t2/2e^{\mu t+\sigma^2t^2/2}
Exponential(λ\lambda)1/λ1/\lambda1/λ21/\lambda^22266λλt\frac{\lambda}{\lambda-t}, t<λt<\lambda
Gamma(α,β\alpha,\beta)α/β\alpha/\betaα/β2\alpha/\beta^22/α2/\sqrt{\alpha}6/α6/\alpha(ββt)α(\frac{\beta}{\beta-t})^\alpha, t<βt<\beta
Beta(α,β\alpha,\beta)αα+β\frac{\alpha}{\alpha+\beta}αβ(α+β)2(α+β+1)\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}2(βα)α+β+1(α+β+2)αβ\frac{2(\beta-\alpha)\sqrt{\alpha+\beta+1}}{(\alpha+\beta+2)\sqrt{\alpha\beta}}complexNo closed form
Student-tt(ν\nu)00 (ν>1\nu>1)νν2\frac{\nu}{\nu-2} (ν>2\nu>2)00 (ν>3\nu>3)6ν4\frac{6}{\nu-4} (ν>4\nu>4)Does not exist

Notes:

  • Student-tt has no MGF (heavy tails cause E[etX]=\mathbb{E}[e^{tX}] = \infty for all t0t \neq 0).
  • Beta distribution skewness is zero iff α=β\alpha = \beta (symmetric); negative when α>β\alpha > \beta, positive when α<β\alpha < \beta.
  • Poisson has equal mean and variance - a property used to test whether count data follows Poisson (overdispersion: Var>E\text{Var} > \mathbb{E} means extra variability, e.g., negative binomial is better).

Appendix D - The Exponential Family and Moments

Many common distributions belong to the exponential family, which has a elegant connection between natural parameters and moments.

D.1 Exponential Family Form

A distribution belongs to the exponential family if its density can be written as:

p(x;η)=h(x)exp(ηT(x)A(η))p(x;\eta) = h(x)\exp(\eta^\top T(x) - A(\eta))

where:

  • η\eta = natural parameter vector
  • T(x)T(x) = sufficient statistic vector
  • A(η)A(\eta) = log-partition function (log-normaliser)
  • h(x)h(x) = base measure

D.2 Moments from the Log-Partition Function

Theorem. For an exponential family distribution:

E[T(X)]=ηA(η),Cov(T(X))=η2A(η)\mathbb{E}[T(X)] = \nabla_\eta A(\eta), \qquad \text{Cov}(T(X)) = \nabla^2_\eta A(\eta)

Proof sketch. Differentiating p(x;η)dx=1\int p(x;\eta)dx = 1 with respect to η\eta:

0=h(x)eηT(x)A(η)(T(x)A(η))dx=E[T(X)]A(η)0 = \int h(x)e^{\eta^\top T(x) - A(\eta)}(T(x) - \nabla A(\eta))dx = \mathbb{E}[T(X)] - \nabla A(\eta)

Therefore E[T(X)]=A(η)\mathbb{E}[T(X)] = \nabla A(\eta). Differentiating again gives the covariance formula. \square

Examples:

Distributionη\etaT(x)T(x)A(η)A(\eta)E[T(X)]=A\mathbb{E}[T(X)] = \nabla A
Bernoulli(pp)logp1p\log\frac{p}{1-p}xxlog(1+eη)\log(1+e^\eta)eη1+eη=p\frac{e^\eta}{1+e^\eta} = p
Gaussian(μ,σ2\mu,\sigma^2)(μ/σ2,1/(2σ2))(\mu/\sigma^2, -1/(2\sigma^2))(x,x2)(x, x^2)η12/(4η2)12log(2η2)-\eta_1^2/(4\eta_2) - \frac{1}{2}\log(-2\eta_2)(μ,σ2+μ2)(\mu, \sigma^2+\mu^2)
Poisson(λ\lambda)logλ\log\lambdaxxeη=λe^\eta = \lambdaeη=λe^\eta = \lambda

For AI: The log-partition function A(η)A(\eta) is the free energy of the exponential family. Its gradient gives the expected sufficient statistics (moments), its Hessian gives the Fisher information matrix. In variational inference, the ELBO is optimised over the natural parameters η\eta of the approximate posterior; the optimal η\eta satisfies A(η)=Ep[T(X)]\nabla A(\eta) = \mathbb{E}_{p}[T(X)] (moment matching). This is the connection between maximum entropy, sufficient statistics, and the moments derived in Section6.


Appendix E - Convergence in Probability and L^2 Convergence

The sample mean XˉN=1Ni=1NXi\bar{X}_N = \frac{1}{N}\sum_{i=1}^N X_i estimates E[X]\mathbb{E}[X]. In what sense does this estimate converge?

E.1 L^2 Convergence (Mean Square Convergence)

XˉN\bar{X}_N converges to E[X]\mathbb{E}[X] in mean square (L2L^2) sense:

E[(XˉNE[X])2]=Var(XˉN)=Var(X)N0\mathbb{E}[(\bar{X}_N - \mathbb{E}[X])^2] = \text{Var}(\bar{X}_N) = \frac{\text{Var}(X)}{N} \to 0

This is an immediate consequence of the variance formula for means (Section4.4). The sample mean's MSE decreases at rate 1/N1/N.

E.2 Weak Law of Large Numbers (Preview)

The Weak LLN (proved in Section06 using characteristic functions or Chebyshev's inequality) states that for iid XiX_i with finite mean μ\mu:

XˉNPμas N\bar{X}_N \xrightarrow{P} \mu \quad \text{as } N \to \infty

i.e., P(XˉNμ>ε)0P(|\bar{X}_N - \mu| > \varepsilon) \to 0 for any ε>0\varepsilon > 0.

Proof via Chebyshev (when Var(X)<\text{Var}(X) < \infty): P(XˉNμ>ε)Var(XˉN)ε2=Var(X)Nε20P(|\bar{X}_N - \mu| > \varepsilon) \leq \frac{\text{Var}(\bar{X}_N)}{\varepsilon^2} = \frac{\text{Var}(X)}{N\varepsilon^2} \to 0.

This is a direct application of Chebyshev's inequality (from Section7.5 preview) to the sample mean. The LLN justifies: if we run a neural network many times on iid batches and average the loss estimates, the average converges to the true expected loss.

-> Full treatment of LLN and CLT: Section06 Stochastic Processes


Appendix F - Conditional Expectation as Projection

The conditional expectation E[YX]\mathbb{E}[Y|X] can be understood geometrically as an orthogonal projection in the Hilbert space L2(Ω,P)L^2(\Omega, P) of square-integrable random variables.

Inner product: X,Y=E[XY]\langle X, Y \rangle = \mathbb{E}[XY]

Projection: E[YG]\mathbb{E}[Y|\mathcal{G}] (conditional expectation on a sub-σ\sigma-algebra G\mathcal{G}) is the projection of YY onto the closed subspace of G\mathcal{G}-measurable random variables.

Geometric interpretation: Among all G\mathcal{G}-measurable (i.e., functions of XX) approximations to YY, the conditional expectation E[YX]\mathbb{E}[Y|X] minimises the L2L^2 distance E[(Yg(X))2]\mathbb{E}[(Y-g(X))^2]. This is the projection theorem: the best approximation is the projection, and the residual YE[YX]Y - \mathbb{E}[Y|X] is orthogonal to every G\mathcal{G}-measurable random variable:

E[(YE[YX])g(X)]=0for all measurable g\mathbb{E}[(Y - \mathbb{E}[Y|X]) \cdot g(X)] = 0 \quad \text{for all measurable } g

Consequence: The tower property E[E[YX]]=E[Y]\mathbb{E}[\mathbb{E}[Y|X]] = \mathbb{E}[Y] is the projection version of the law of total expectation. In a Hilbert space, projecting YY onto a subspace and then taking the "total length" (expectation) equals the total length of YY.

For AI: This projection view clarifies why neural networks trained with MSE loss approximate E[YX]\mathbb{E}[Y|X]: the network is learning the projection of YY onto the subspace of functions representable by the architecture. Deeper networks can represent larger subspaces, hence better approximations to the conditional expectation.


Appendix G - Worked Problems: Moments and Inequalities

G.1 Computing the Moments of the Beta Distribution

The Beta(α,β\alpha, \beta) distribution has PDF f(x)=xα1(1x)β1B(α,β)f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)} on [0,1][0,1], where B(α,β)=Γ(α)Γ(β)/Γ(α+β)B(\alpha,\beta) = \Gamma(\alpha)\Gamma(\beta)/\Gamma(\alpha+\beta).

Raw moment: Using the Beta function definition:

E[Xk]=01xkxα1(1x)β1B(α,β)dx=B(α+k,β)B(α,β)=Γ(α+k)Γ(α+β)Γ(α)Γ(α+β+k)\mathbb{E}[X^k] = \int_0^1 x^k \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}dx = \frac{B(\alpha+k, \beta)}{B(\alpha,\beta)} = \frac{\Gamma(\alpha+k)\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\alpha+\beta+k)} =(α+k1)(α+k2)α(α+β+k1)(α+β+k2)(α+β)=j=0k1α+jα+β+j= \frac{(\alpha+k-1)(\alpha+k-2)\cdots\alpha}{(\alpha+\beta+k-1)(\alpha+\beta+k-2)\cdots(\alpha+\beta)} = \prod_{j=0}^{k-1}\frac{\alpha+j}{\alpha+\beta+j}

First moment: E[X]=αα+β\mathbb{E}[X] = \frac{\alpha}{\alpha+\beta}.

Second moment: E[X2]=α(α+1)(α+β)(α+β+1)\mathbb{E}[X^2] = \frac{\alpha(\alpha+1)}{(\alpha+\beta)(\alpha+\beta+1)}.

Variance:

Var(X)=E[X2](E[X])2=α(α+1)(α+β)(α+β+1)α2(α+β)2\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = \frac{\alpha(\alpha+1)}{(\alpha+\beta)(\alpha+\beta+1)} - \frac{\alpha^2}{(\alpha+\beta)^2} =αβ(α+β)2(α+β+1)= \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}

AI connection: The Beta distribution is the conjugate prior for the Bernoulli/Binomial likelihood. After observing ss successes and ff failures, the posterior is Beta(α+s\alpha+s, β+f\beta+f). The posterior mean is α+sα+β+s+f\frac{\alpha+s}{\alpha+\beta+s+f} - a weighted average of the prior mean αα+β\frac{\alpha}{\alpha+\beta} and the sample mean ss+f\frac{s}{s+f}. As s+fs+f \to \infty, the posterior converges to the sample mean (data dominates prior).

G.2 Proving H(p)logKH(p) \leq \log K by Jensen

For a probability distribution p=(p1,,pK)p = (p_1,\ldots,p_K) with pk>0p_k > 0 and kpk=1\sum_k p_k = 1, the Shannon entropy is H(p)=kpklogpkH(p) = -\sum_k p_k \log p_k.

Proof that H(p)logKH(p) \leq \log K:

Apply Jensen's inequality with the concave function log\log:

H(p)=kpklog1pk=Ekp[log1pk]logEkp[1pk]=logkpk1pk=logKH(p) = \sum_k p_k \log \frac{1}{p_k} = \mathbb{E}_{k \sim p}\left[\log \frac{1}{p_k}\right] \leq \log\mathbb{E}_{k\sim p}\left[\frac{1}{p_k}\right] = \log\sum_k p_k \cdot \frac{1}{p_k} = \log K

Equality holds iff 1/pk=1/p_k = const for all kk (Jensen with equality iff the argument is constant), i.e., pk=1/Kp_k = 1/K (uniform). \square

Alternative proof via KL divergence:

H(p)logK=kpklogpklogK=kpk(log(1/K)logpk)=KL(pu)0H(p) - \log K = -\sum_k p_k \log p_k - \log K = \sum_k p_k(\log(1/K) - \log p_k) = -\text{KL}(p \| u) \leq 0

where u=(1/K,,1/K)u = (1/K,\ldots,1/K) is uniform. KL non-negativity gives H(p)logKH(p) \leq \log K.

G.3 Adam Bias Correction: Full Derivation

At step tt, the Adam first-moment accumulator is:

mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1)g_t

Unrolling from m0=0m_0 = 0:

mt=(1β1)i=1tβ1tigim_t = (1-\beta_1)\sum_{i=1}^t \beta_1^{t-i} g_i

Taking expectations (assuming iid gradients with constant mean E[gi]=μg\mathbb{E}[g_i] = \mu_g):

E[mt]=(1β1)μgi=1tβ1ti=(1β1)μg1β1t1β1=(1β1t)μg\mathbb{E}[m_t] = (1-\beta_1)\mu_g \sum_{i=1}^t \beta_1^{t-i} = (1-\beta_1)\mu_g \cdot \frac{1-\beta_1^t}{1-\beta_1} = (1-\beta_1^t)\mu_g

The bias is E[mt]μg=β1tμg\mathbb{E}[m_t] - \mu_g = -\beta_1^t \mu_g, which decays to zero geometrically. The bias-corrected estimate m^t=mt/(1β1t)\hat{m}_t = m_t / (1-\beta_1^t) satisfies E[m^t]=μg\mathbb{E}[\hat{m}_t] = \mu_g (unbiased).

Similarly for vtv_t, and the debiased v^t=vt/(1β2t)\hat{v}_t = v_t/(1-\beta_2^t) estimates E[gt2]\mathbb{E}[g_t^2] unbiasedly.

In early training (tt small), 1β1t(1β1)t1-\beta_1^t \approx (1-\beta_1)t is small, so m^t=mt/((1β1)t)\hat{m}_t = m_t/((1-\beta_1)t) greatly amplifies mtm_t. Without bias correction, Adam would take tiny steps at the start because mt(1β1)g1m_t \approx (1-\beta_1)g_1 after one step. With bias correction: m^1=g1\hat{m}_1 = g_1 - the first step uses the actual gradient.

G.4 Conditional Expectation: Gaussian Case

Let (X,Y)N(μ,Σ)(X,Y) \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma) with μ=(μX,μY)\boldsymbol{\mu} = (\mu_X, \mu_Y) and Σ=(σX2ρσXσYρσXσYσY2)\Sigma = \begin{pmatrix}\sigma_X^2 & \rho\sigma_X\sigma_Y \\ \rho\sigma_X\sigma_Y & \sigma_Y^2\end{pmatrix}.

Conditional distribution (from Section03 Schur complement formula):

YX=xN ⁣(μY+ρσYσX(xμX),  σY2(1ρ2))Y | X=x \sim \mathcal{N}\!\left(\mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(x-\mu_X),\; \sigma_Y^2(1-\rho^2)\right)

Therefore:

E[YX=x]=μY+ρσYσX(xμX)\mathbb{E}[Y|X=x] = \mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(x-\mu_X) Var(YX)=σY2(1ρ2)\text{Var}(Y|X) = \sigma_Y^2(1-\rho^2)

Verification via law of total variance:

Var(Y)=E[Var(YX)]+Var(E[YX])\text{Var}(Y) = \mathbb{E}[\text{Var}(Y|X)] + \text{Var}(\mathbb{E}[Y|X]) =σY2(1ρ2)+Var ⁣(μY+ρσYσX(XμX))= \sigma_Y^2(1-\rho^2) + \text{Var}\!\left(\mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(X-\mu_X)\right) =σY2(1ρ2)+ρ2σY2σX2Var(X)= \sigma_Y^2(1-\rho^2) + \rho^2\frac{\sigma_Y^2}{\sigma_X^2}\text{Var}(X) =σY2(1ρ2)+ρ2σY2=σY2= \sigma_Y^2(1-\rho^2) + \rho^2\sigma_Y^2 = \sigma_Y^2 \checkmark

The variance of the conditional mean contributes ρ2σY2\rho^2\sigma_Y^2 (explained variance), and the within-group variance contributes (1ρ2)σY2(1-\rho^2)\sigma_Y^2 (unexplained variance). The fraction ρ2\rho^2 of Var(Y)\text{Var}(Y) explained by XX is the coefficient of determination R2R^2 of linear regression.


Appendix H - Notation Summary

SymbolMeaningFirst defined
E[X]\mathbb{E}[X]Expected value of XXSection2.1
μX\mu_X or μ\muMean (expected value)Section2.1
Var(X)\text{Var}(X) or σX2\sigma_X^2Variance of XXSection3.1
σX\sigma_XStandard deviation of XXSection3.2
μk=E[Xk]\mu'_k = \mathbb{E}[X^k]kk-th raw momentSection3.3
μk=E[(Xμ)k]\mu_k = \mathbb{E}[(X-\mu)^k]kk-th central momentSection3.3
γ1=μ3/σ3\gamma_1 = \mu_3/\sigma^3SkewnessSection3.4
γ2=μ4/σ43\gamma_2 = \mu_4/\sigma^4 - 3Excess kurtosisSection3.5
Cov(X,Y)\text{Cov}(X,Y)CovarianceSection4.1
ρXY\rho_{XY}Pearson correlationSection4.2
E[YX=x]\mathbb{E}[Y \mid X=x]Conditional expectation (function of xx)Section5.1
E[YX]\mathbb{E}[Y \mid X]Conditional expectation (random variable)Section5.1
Var(YX)\text{Var}(Y \mid X)Conditional varianceSection5.3
MX(t)=E[etX]M_X(t) = \mathbb{E}[e^{tX}]Moment generating functionSection6.1
φX(t)=E[eitX]\varphi_X(t) = \mathbb{E}[e^{itX}]Characteristic functionSection6.4
KX(t)=logMX(t)K_X(t) = \log M_X(t)Cumulant generating functionSection6.5
κk\kappa_kkk-th cumulantSection6.5
L(ϕ,θ;x)\mathcal{L}(\phi,\theta;x)ELBO (evidence lower bound)Section9.2
KL(pq)\text{KL}(p\|q)KL divergenceSection7.2
H(p)H(p)Shannon entropySection9.1

Appendix I - Quick Reference: Key Formulas

Expectation:

E[aX+bY]=aE[X]+bE[Y](always)\mathbb{E}[aX+bY] = a\mathbb{E}[X] + b\mathbb{E}[Y] \quad (\text{always}) E[g(X)]=g(x)fX(x)dx(LOTUS)\mathbb{E}[g(X)] = \int g(x)f_X(x)\,dx \quad (\text{LOTUS})

Variance:

Var(X)=E[X2](E[X])2\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 Var(aX+b)=a2Var(X)\text{Var}(aX+b) = a^2\text{Var}(X) Var(X+Y)=Var(X)+2Cov(X,Y)+Var(Y)\text{Var}(X+Y) = \text{Var}(X) + 2\text{Cov}(X,Y) + \text{Var}(Y)

Conditional expectation:

E[Y]=E[E[YX]](tower property)\mathbb{E}[Y] = \mathbb{E}[\mathbb{E}[Y|X]] \quad (\text{tower property}) Var(Y)=E[Var(YX)]+Var(E[YX])(total variance)\text{Var}(Y) = \mathbb{E}[\text{Var}(Y|X)] + \text{Var}(\mathbb{E}[Y|X]) \quad (\text{total variance})

MGF:

MX(k)(0)=E[Xk],MX+Y(t)=MX(t)MY(t)  (XY)M_X^{(k)}(0) = \mathbb{E}[X^k], \qquad M_{X+Y}(t) = M_X(t)M_Y(t) \; (X\perp Y)

Jensen's inequality:

f convex    f(E[X])E[f(X)]f \text{ convex} \implies f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)] KL(pq)0(Gibbs’ inequality, via Jensen)\text{KL}(p\|q) \geq 0 \quad (\text{Gibbs' inequality, via Jensen})

Cauchy-Schwarz:

(E[XY])2E[X2]E[Y2],ρXY1(\mathbb{E}[XY])^2 \leq \mathbb{E}[X^2]\mathbb{E}[Y^2], \qquad |\rho_{XY}| \leq 1

Bias-Variance:

E[(Yf^)2]=Bias2(f^)+Var(f^)+σ2\mathbb{E}[(Y-\hat{f})^2] = \text{Bias}^2(\hat{f}) + \text{Var}(\hat{f}) + \sigma^2

Adam:

mt=β1mt1+(1β1)gt,m^t=mt/(1β1t)E[gt]m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t, \quad \hat{m}_t = m_t/(1-\beta_1^t) \approx \mathbb{E}[g_t] vt=β2vt1+(1β2)gt2,v^t=vt/(1β2t)E[gt2]v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2, \quad \hat{v}_t = v_t/(1-\beta_2^t) \approx \mathbb{E}[g_t^2]

Appendix J - Common Mistakes: Extended Examples

J.1 Jensen's Direction: A Costly Error

One of the most frequent mistakes when applying Jensen's inequality is applying it in the wrong direction. The rule is simple: for convex ff, f(E[X])E[f(X)]f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]; for concave ff, the inequality reverses.

Common confusion: \sqrt{\cdot} vs. ()2(\cdot)^2.

  • f(x)=x2f(x) = x^2 is convex (f=2>0f''=2>0). Jensen: (E[X])2E[X2](\mathbb{E}[X])^2 \leq \mathbb{E}[X^2]. Equivalently, Var(X)=E[X2](E[X])20\text{Var}(X) = \mathbb{E}[X^2]-(\mathbb{E}[X])^2 \geq 0. [ok]
  • f(x)=xf(x) = \sqrt{x} is concave (f=14x3/2<0f''=-\frac{1}{4}x^{-3/2}<0). Jensen: E[X]E[X]\sqrt{\mathbb{E}[X]} \geq \mathbb{E}[\sqrt{X}] for X0X \geq 0.

A model that minimises E[loss]\mathbb{E}[\sqrt{\text{loss}}] is NOT the same as minimising E[loss]\sqrt{\mathbb{E}[\text{loss}]}. The former cares about average square-root loss, the latter about the square root of average loss. In practice this distinction matters for robust loss functions.

Common confusion: log\log vs. exp\exp.

  • f(x)=logxf(x) = \log x is concave -> logE[X]E[logX]\log\mathbb{E}[X] \geq \mathbb{E}[\log X]. This is the key inequality behind the ELBO derivation.
  • f(x)=exf(x) = e^x is convex -> eE[X]E[eX]e^{\mathbb{E}[X]} \leq \mathbb{E}[e^X]. This is the key inequality for bounding the MGF.

The ELBO derivation applies Jensen with log\log (concave): logE[Z]E[logZ]\log \mathbb{E}[Z] \geq \mathbb{E}[\log Z] where Z=pθ(x,z)/qϕ(zx)Z = p_\theta(x,z)/q_\phi(z|x). Students often try to apply it in the wrong direction, getting an upper bound instead of a lower bound.

J.2 The Tower Property Subtlety: Nested Conditioning

The tower property states E[E[YX]]=E[Y]\mathbb{E}[\mathbb{E}[Y|X]] = \mathbb{E}[Y]. A more general version for nested conditioning: for σ\sigma-algebras G1G2\mathcal{G}_1 \subset \mathcal{G}_2:

E[E[YG2]G1]=E[YG1]\mathbb{E}[\mathbb{E}[Y|\mathcal{G}_2]|\mathcal{G}_1] = \mathbb{E}[Y|\mathcal{G}_1]

Conditioning on less information from already conditioned: you keep the coarser conditioning.

Example: Let ZZ be a sufficient statistic for θ\theta given data X1,,XnX_1,\ldots,X_n. Then:

E[θ^θ]=E[E[θ^Z]θ]=E[θ~θ]\mathbb{E}[\hat{\theta}|\theta] = \mathbb{E}[\mathbb{E}[\hat{\theta}|Z]|\theta] = \mathbb{E}[\tilde{\theta}|\theta]

where θ~=E[θ^Z]\tilde{\theta} = \mathbb{E}[\hat{\theta}|Z] is the Rao-Blackwellised estimator. The outer expectation equals the original (tower property), but the Rao-Blackwellised estimator has smaller variance. This is NOT circular - it is the statement that smoothing θ^\hat{\theta} via conditioning on ZZ doesn't change its mean.

J.3 Correlation Does Not Imply Causation - A Statistical View

Zero correlation means E[XY]=E[X]E[Y]\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y] - no linear relationship. But:

  1. There may be nonlinear dependence (Section4.3 example: Y=X2Y = X^2).
  2. Even positive correlation may arise from a common cause (confounding): if ZXZ \to X and ZYZ \to Y (fork structure from Section03), then XX and YY are correlated even if neither causes the other.

In ML: the correlation between model predictions and labels on the test set measures linear predictive ability, not causal understanding. A model can achieve high correlation while exploiting spurious features (shortcuts). For example, language models correlate "hospital" with "disease" not through causal understanding but through co-occurrence patterns.

Conditional independence as the resolution: If ZZ is the common cause (confounder), XX and YY may be conditionally independent given ZZ: Cov(X,YZ)=0\text{Cov}(X,Y|Z) = 0 even though Cov(X,Y)0\text{Cov}(X,Y) \neq 0. Adjusting for confounders (either by conditioning or instrumental variables) is the statistical approach to causal inference.


Appendix K - Information-Theoretic View of Moments

K.1 Entropy as Negative Expected Log-Probability

The Shannon entropy of a discrete distribution pp is:

H(p)=Ep[logp(X)]=kpklogpkH(p) = -\mathbb{E}_p[\log p(X)] = -\sum_k p_k \log p_k

This is simply minus the expected log-probability. For a continuous distribution with PDF ff, the differential entropy is:

h(f)=Ef[logf(X)]=f(x)logf(x)dxh(f) = -\mathbb{E}_f[\log f(X)] = -\int f(x)\log f(x)\,dx

Gaussian maximises differential entropy. Among all distributions on R\mathbb{R} with fixed mean μ\mu and variance σ2\sigma^2, the Gaussian N(μ,σ2)\mathcal{N}(\mu, \sigma^2) maximises differential entropy:

h(N(μ,σ2))=12log(2πeσ2)h(\mathcal{N}(\mu,\sigma^2)) = \frac{1}{2}\log(2\pi e \sigma^2)

Proof sketch (via KL divergence): For any distribution ff with mean μ\mu and variance σ2\sigma^2, let g=N(μ,σ2)g = \mathcal{N}(\mu,\sigma^2).

0KL(fg)=flogfg=h(f)flogg0 \leq \text{KL}(f\|g) = \int f\log\frac{f}{g} = -h(f) - \int f\log g

Since logg(x)=12log(2πσ2)(xμ)22σ2\log g(x) = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x-\mu)^2}{2\sigma^2}, and f(x)(xμ)22σ2dx=σ22σ2=12\int f(x) \cdot \frac{(x-\mu)^2}{2\sigma^2}dx = \frac{\sigma^2}{2\sigma^2} = \frac{1}{2} (since ff has variance σ2\sigma^2):

flogg=12log(2πσ2)12=h(g)\int f\log g = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{1}{2} = h(g)

Therefore 0h(f)h(g)0 \leq -h(f) - h(g), i.e., h(f)h(g)h(f) \leq h(g). Equality iff f=gf = g (since KL=0\text{KL}=0 iff distributions are equal). \square

AI connection: This maximum entropy property explains why the Gaussian prior is so common in Bayesian machine learning: given a known mean and variance (from domain knowledge), the Gaussian is the least informative (maximum entropy) prior consistent with those constraints. It makes the fewest additional assumptions about the distribution.

K.2 Mutual Information as Expected KL Divergence

The mutual information between XX and YY is:

I(X;Y)=KL(pX,YpXpY)=E(X,Y)[logpX,Y(X,Y)pX(X)pY(Y)]I(X;Y) = \text{KL}(p_{X,Y} \| p_X \otimes p_Y) = \mathbb{E}_{(X,Y)}\left[\log\frac{p_{X,Y}(X,Y)}{p_X(X)p_Y(Y)}\right]

This is the KL divergence between the joint distribution and the product of marginals. Since KL0\text{KL} \geq 0: I(X;Y)0I(X;Y) \geq 0 with equality iff XYX \perp Y.

Relationship to conditional entropy:

I(X;Y)=H(Y)H(YX)=H(X)H(XY)I(X;Y) = H(Y) - H(Y|X) = H(X) - H(X|Y)

where H(YX)=EX[H(YX=x)]H(Y|X) = \mathbb{E}_X[H(Y|X=x)] is the conditional entropy (tower property applied to entropy).

For AI: Mutual information is the gold standard for measuring statistical dependence. It captures all forms of dependence (linear and nonlinear), unlike correlation. Contrastive learning methods (SimCLR, CLIP) can be viewed as maximising a lower bound on I(view1;view2)I(\text{view}_1; \text{view}_2) - encouraging representations to capture information shared between different views/modalities. Information bottleneck methods (used for analysing neural networks) study the trade-off between compressing XX (low I(Z;X)I(Z;X)) and preserving information about YY (high I(Z;Y)I(Z;Y)).


Appendix L - The Reparameterisation Trick: Full Mathematical Treatment

The reparameterisation trick is a technique for computing gradients of expectations when the distribution depends on the parameters we differentiate with respect to.

L.1 The Problem: Gradient Through Sampling

We want ϕEzqϕ[f(z)]\nabla_\phi \mathbb{E}_{z \sim q_\phi}[f(z)] where qϕq_\phi is a distribution parameterised by ϕ\phi. Naively:

ϕEqϕ[f(z)]=ϕf(z)qϕ(z)dz=f(z)ϕqϕ(z)dz\nabla_\phi \mathbb{E}_{q_\phi}[f(z)] = \nabla_\phi \int f(z) q_\phi(z) dz = \int f(z) \nabla_\phi q_\phi(z) dz

This requires computing ϕqϕ(z)\nabla_\phi q_\phi(z) for each zz, and the expectation is now over the fixed distribution of zz values - but the integration measure qϕ(z)dzq_\phi(z)dz changes with ϕ\phi, making Monte Carlo estimation require evaluating ϕqϕ\nabla_\phi q_\phi.

L.2 REINFORCE (Score Function Estimator)

The score function trick uses ϕqϕ=qϕϕlogqϕ\nabla_\phi q_\phi = q_\phi \nabla_\phi \log q_\phi:

ϕEqϕ[f(z)]=f(z)qϕ(z)ϕlogqϕ(z)dz=Eqϕ[f(z)ϕlogqϕ(z)]\nabla_\phi \mathbb{E}_{q_\phi}[f(z)] = \int f(z) q_\phi(z) \nabla_\phi \log q_\phi(z) dz = \mathbb{E}_{q_\phi}[f(z) \nabla_\phi \log q_\phi(z)]

This is computable by Monte Carlo: sample zqϕz \sim q_\phi, compute f(z)ϕlogqϕ(z)f(z)\nabla_\phi \log q_\phi(z), average over samples. However, this estimator has high variance because f(z)f(z) can be large and ϕlogqϕ(z)\nabla_\phi \log q_\phi(z) can vary greatly.

L.3 Reparameterisation

When qϕq_\phi is a location-scale family (or more generally, when there exists a deterministic transformation g(ϕ,ε)g(\phi, \varepsilon)):

z=g(ϕ,ε),εp(ε)(fixed distribution, independent of ϕ)z = g(\phi, \varepsilon), \quad \varepsilon \sim p(\varepsilon) \quad \text{(fixed distribution, independent of }\phi\text{)}

For Gaussian: z=μϕ+σϕεz = \mu_\phi + \sigma_\phi \odot \varepsilon, εN(0,I)\varepsilon \sim \mathcal{N}(0,I).

Then:

Eqϕ[f(z)]=Ep(ε)[f(g(ϕ,ε))]\mathbb{E}_{q_\phi}[f(z)] = \mathbb{E}_{p(\varepsilon)}[f(g(\phi, \varepsilon))]

and:

ϕEqϕ[f(z)]=Ep(ε)[ϕf(g(ϕ,ε))]=Ep(ε)[zf(z)ϕg(ϕ,ε)]\nabla_\phi \mathbb{E}_{q_\phi}[f(z)] = \mathbb{E}_{p(\varepsilon)}[\nabla_\phi f(g(\phi, \varepsilon))] = \mathbb{E}_{p(\varepsilon)}\left[\nabla_z f(z) \cdot \nabla_\phi g(\phi,\varepsilon)\right]

The gradient now flows through the deterministic function gg, enabling automatic differentiation. The variance of this estimator is typically much lower than REINFORCE because zf\nabla_z f is often smoother than fϕlogqϕf \cdot \nabla_\phi \log q_\phi.

Jacobian of the transform:

ϕg(ϕ,ε)μϕ,σϕ=(zμϕzσϕ)=(Idiag(ε))\nabla_\phi g(\phi, \varepsilon)\big|_{\mu_\phi, \sigma_\phi} = \begin{pmatrix}\frac{\partial z}{\partial \mu_\phi} \\ \frac{\partial z}{\partial \sigma_\phi}\end{pmatrix} = \begin{pmatrix}I \\ \text{diag}(\varepsilon)\end{pmatrix}

So the gradient with respect to μϕ\mu_\phi is E[zf(z)]\mathbb{E}[\nabla_z f(z)] and with respect to σϕ\sigma_\phi is E[zf(z)ε]\mathbb{E}[\nabla_z f(z) \odot \varepsilon].

L.4 Why Lower Variance?

Consider f(z)=sin(z)f(z) = \sin(z) with zN(μ,σ2)z \sim \mathcal{N}(\mu, \sigma^2).

REINFORCE gradient: E[sin(z)(zμ)/σ2]\mathbb{E}[\sin(z) \cdot (z-\mu)/\sigma^2]. The product sin(z)(zμ)\sin(z)(z-\mu) can be large in magnitude.

Reparameterisation gradient: E[cos(μ+σε)]=E[cos(z)]\mathbb{E}[\cos(\mu+\sigma\varepsilon)] = \mathbb{E}[\cos(z)]. The gradient is cos(z)\cos(z), bounded in [1,1][-1,1], much more stable.

In general, reparameterisation produces gradients of O(1)O(1) magnitude when ff is smooth, while REINFORCE gradients scale with ff values which can be large.


Appendix M - Moments in the Context of Score Matching and Diffusion

M.1 Denoising Score Matching

Diffusion models (DDPM, Score SDE) learn the score function xlogpt(x)\nabla_x \log p_t(x) at each noise level tt. The training objective is:

Lθ=Et,x0,ε[sθ(xt,t)xtlogpt(xtx0)2]\mathcal{L}_\theta = \mathbb{E}_{t, x_0, \varepsilon}\left[\|s_\theta(x_t, t) - \nabla_{x_t}\log p_t(x_t|x_0)\|^2\right]

where xt=αˉtx0+1αˉtεx_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t}\varepsilon and εN(0,I)\varepsilon \sim \mathcal{N}(0,I).

The conditional score is:

xtlogpt(xtx0)=ε1αˉt\nabla_{x_t}\log p_t(x_t|x_0) = -\frac{\varepsilon}{\sqrt{1-\bar{\alpha}_t}}

This is an expectation over the noise schedule. The loss simplifies to:

Lθ=Et,ε[εθ(xt,t)ε2]\mathcal{L}_\theta = \mathbb{E}_{t,\varepsilon}\left[\|\varepsilon_\theta(x_t,t) - \varepsilon\|^2\right]

which is an MSE loss - an empirical estimate of E[(Yf(X))2]\mathbb{E}[(Y-f(X))^2] where YY is the added noise ε\varepsilon and f=εθf = \varepsilon_\theta.

M.2 First Moment of the Reverse Process

The reverse diffusion process gives pθ(xt1xt)=N(μθ(xt,t),σt2I)p_\theta(x_{t-1}|x_t) = \mathcal{N}(\mu_\theta(x_t,t), \sigma_t^2 I) where:

μθ(xt,t)=1αt(xt1αt1αˉtεθ(xt,t))\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\varepsilon_\theta(x_t, t)\right)

The mean of the reverse step is determined by the predicted noise. The conditional expectation E[xt1xt]\mathbb{E}[x_{t-1}|x_t] is the denoised estimate of x0x_0:

x^0=xt1αˉtεθ(xt,t)αˉt\hat{x}_0 = \frac{x_t - \sqrt{1-\bar{\alpha}_t}\varepsilon_\theta(x_t,t)}{\sqrt{\bar{\alpha}_t}}

This is LOTUS applied to the reparameterised relationship: the expected clean image given the noisy image is a simple linear function of the predicted noise, evaluated using the noisy image.


Appendix N - Worked Problems: Bias-Variance and Regularisation

N.1 Ridge Regression Bias-Variance Tradeoff

Setup. Data: y=Xθ+εy = X\theta^* + \varepsilon where XRn×dX \in \mathbb{R}^{n\times d}, εN(0,σ2I)\varepsilon \sim \mathcal{N}(0, \sigma^2 I).

Ridge estimator: θ^λ=(XX+λI)1Xy\hat{\theta}_\lambda = (X^\top X + \lambda I)^{-1}X^\top y.

Let Aλ=(XX+λI)1XA_\lambda = (X^\top X + \lambda I)^{-1}X^\top.

Bias: Bias(θ^λ)=E[θ^λ]θ=AλXθθ=(AλXI)θ\text{Bias}(\hat{\theta}_\lambda) = \mathbb{E}[\hat{\theta}_\lambda] - \theta^* = A_\lambda X\theta^* - \theta^* = (A_\lambda X - I)\theta^*.

Since AλX=(XX+λI)1XXA_\lambda X = (X^\top X + \lambda I)^{-1}X^\top X:

AλXI=(XX+λI)1XXI=λ(XX+λI)1A_\lambda X - I = (X^\top X + \lambda I)^{-1}X^\top X - I = -\lambda(X^\top X + \lambda I)^{-1}

So Bias(θ^λ)=λ(XX+λI)1θ\text{Bias}(\hat{\theta}_\lambda) = -\lambda(X^\top X + \lambda I)^{-1}\theta^*. Bias increases with λ\lambda.

Variance: Var(θ^λ)=AλVar(y)Aλ=σ2AλAλ=σ2(XX+λI)1XX(XX+λI)1\text{Var}(\hat{\theta}_\lambda) = A_\lambda \text{Var}(y) A_\lambda^\top = \sigma^2 A_\lambda A_\lambda^\top = \sigma^2(X^\top X+\lambda I)^{-1}X^\top X(X^\top X+\lambda I)^{-1}.

As λ\lambda \to \infty: (XX+λI)10(X^\top X+\lambda I)^{-1} \to 0, so variance 0\to 0. As λ0\lambda \to 0: recovers OLS variance σ2(XX)1\sigma^2(X^\top X)^{-1}.

Total MSE: MSE=Bias2+tr(Var)=λ2θ(XX+λI)2θ+σ2tr((XX+λI)1XX(XX+λI)1)\text{MSE} = \|\text{Bias}\|^2 + \text{tr}(\text{Var}) = \lambda^2 \theta^{*\top}(X^\top X+\lambda I)^{-2}\theta^* + \sigma^2\text{tr}((X^\top X+\lambda I)^{-1}X^\top X(X^\top X+\lambda I)^{-1}).

The optimal λ\lambda minimises this sum - bias grows, variance shrinks, and there is a sweet spot.

N.2 Neural Network Initialisation via Variance Propagation

He initialisation (for ReLU networks) is derived from the bias-variance / variance propagation analysis.

For a layer hj[l]=ReLU ⁣(iWji[l]hi[l1])h_j^{[l]} = \text{ReLU}\!\left(\sum_{i} W_{ji}^{[l]} h_i^{[l-1]}\right), with iid inputs hi[l1](0,vl1)h_i^{[l-1]} \sim (0, v_{l-1}) (zero mean, variance vl1v_{l-1}) and iid weights Wji(0,σW2)W_{ji} \sim (0, \sigma_W^2) independent of inputs:

Variance propagation:

Var(hj[l])=nl1σW2E[ReLU(z)2]\text{Var}(h_j^{[l]}) = n_{l-1} \cdot \sigma_W^2 \cdot \mathbb{E}[\text{ReLU}(z)^2]

where zN(0,vl1)z \sim \mathcal{N}(0, v_{l-1}). For ReLU: E[ReLU(z)2]=Var(z)/2=vl1/2\mathbb{E}[\text{ReLU}(z)^2] = \text{Var}(z)/2 = v_{l-1}/2 (since ReLU zeros half the distribution).

Therefore vl=nl1σW2vl1/2v_l = n_{l-1} \cdot \sigma_W^2 \cdot v_{l-1}/2.

For variance to stay constant across layers: vl=vl1v_l = v_{l-1} requires σW2=2/nl1\sigma_W^2 = 2/n_{l-1}.

He initialisation: WN(0,2/nin)W \sim \mathcal{N}(0, 2/n_\text{in}). This preserves variance through the network during the forward pass, preventing gradients from vanishing or exploding in deep networks.


Appendix O - Self-Assessment Checklist

Use this checklist after studying the section to identify gaps before proceeding to Section05.

Core Mechanics (Should be fluent)

  • I can compute E[X]\mathbb{E}[X] from a PMF or PDF using the definition.
  • I can apply LOTUS: E[g(X)]=g(x)f(x)dx\mathbb{E}[g(X)] = \int g(x)f(x)dx without finding the distribution of g(X)g(X).
  • I know that linearity E[aX+bY]=aE[X]+bE[Y]\mathbb{E}[aX+bY]=a\mathbb{E}[X]+b\mathbb{E}[Y] holds without independence.
  • I can compute Var(X)\text{Var}(X) using E[X2](E[X])2\mathbb{E}[X^2]-(\mathbb{E}[X])^2.
  • I know Var(aX+b)=a2Var(X)\text{Var}(aX+b)=a^2\text{Var}(X) (shift doesn't change variance).
  • I can compute skewness γ1=μ3/σ3\gamma_1=\mu_3/\sigma^3 and kurtosis γ2=μ4/σ43\gamma_2=\mu_4/\sigma^4-3.

Intermediate Theory (Should understand proofs)

  • I can state and prove the tower property E[E[YX]]=E[Y]\mathbb{E}[\mathbb{E}[Y|X]]=\mathbb{E}[Y].
  • I can state and prove the law of total variance.
  • I can derive the MGF of the Gaussian, Exponential, and Poisson distributions.
  • I can state Jensen's inequality and identify when it applies (ff convex vs. concave).
  • I can prove KL(pq)0\text{KL}(p\|q)\geq 0 using Jensen.
  • I can state and prove the Cauchy-Schwarz inequality for expectations.
  • I know that ρ1|\rho|\leq 1 follows from Cauchy-Schwarz.
  • I understand why zero covariance does NOT imply independence (with a counterexample).

Advanced Applications (Should be able to apply)

  • I can derive the bias-variance decomposition from first principles.
  • I can explain what double descent is and why overparameterised models can generalise.
  • I can derive the ELBO using Jensen's inequality.
  • I can explain the reparameterisation trick and why it reduces gradient variance.
  • I understand Adam as tracking first and second moments with bias correction.
  • I can explain why the score function E[θlogpθ(X)]=0\mathbb{E}[\nabla_\theta \log p_\theta(X)]=\mathbf{0}.

Appendix P - Further Reading and References

Textbooks

  1. Probability and Statistics for Engineering and the Sciences - Jay Devore (2015). Accessible introduction to moments, MGFs, and expectation with engineering examples.

  2. Probability Theory: The Logic of Science - E.T. Jaynes (2003). Philosophical and technical treatment; excellent on entropy and maximum entropy principle.

  3. Pattern Recognition and Machine Learning - Christopher Bishop (2006), Ch. 1-2. The ML-focused treatment of expectations, KL divergence, ELBO, and variational inference.

  4. Deep Learning - Goodfellow, Bengio, Courville (2016), Ch. 3. Standard reference for probability in ML context including bias-variance tradeoff.

  5. Probabilistic Machine Learning: An Introduction - Kevin Murphy (2022), Ch. 2-4. Modern treatment with extensive ML applications including Adam, VAEs, and diffusion models.

Papers

  1. Kingma & Welling (2014). Auto-Encoding Variational Bayes. arXiv:1312.6114. Original VAE paper; derives ELBO via Jensen and introduces reparameterisation trick.

  2. Kingma & Ba (2015). Adam: A Method for Stochastic Optimization. arXiv:1412.6980. Original Adam paper; moment tracking interpretation is explicit in the derivation.

  3. Williams (1992). Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning. REINFORCE algorithm; score function gradient estimator.

  4. Ioffe & Szegedy (2015). Batch Normalization. arXiv:1502.03167. Moment normalisation of activations; analysis of how variance affects training dynamics.

  5. Belkin et al. (2019). Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-Off. PNAS. Double descent phenomenon; formal analysis of why interpolating models can generalise.

  6. Song et al. (2020). Score-Based Generative Modeling through Stochastic Differential Equations. arXiv:2011.13456. Diffusion models as score-matching; score function is a moment of the distribution.


Summary of Key Results

This section established the expectation operator as the central tool in probability and machine learning. Starting from the LOTUS definition, we derived linearity (which holds without independence), the tower property (iterated expectation), and the law of total variance (which decomposes uncertainty into within-group and between-group components).

The moment hierarchy captures distribution shape: the first moment (mean) locates it; the second (variance) measures spread; the third (skewness) captures asymmetry; the fourth (kurtosis) captures tail weight. Moment generating functions encode all moments in a single analytic function and enable elegant proofs of the reproductive property (sum of independent Gaussians is Gaussian) and cumulant additivity.

Jensen's inequality - the most important inequality in this section - gives the KL divergence its non-negativity, the ELBO its existence as a lower bound, and the bias-variance decomposition its clean structure. Cauchy-Schwarz gives ρ1|\rho| \leq 1 and motivates the attention scaling factor 1/dk1/\sqrt{d_k}.

Every major ML method reviewed in Section9 is an application of these results: cross-entropy is expected log-loss, the ELBO is Jensen applied to the marginal likelihood, policy gradient is the score function trick, and Adam is first-and-second-moment tracking with debiasing.

The concepts forward-referenced here - Markov's and Chebyshev's inequalities, the Law of Large Numbers, the Central Limit Theorem - will be fully developed in Section05 and Section06, completing the probabilistic toolkit required for modern ML.

The conceptual unification: expectation is a linear functional on the space of random variables. It maps random variables to real numbers while preserving linear structure. Every computation in this section - variance as E[(Xμ)2]\mathbb{E}[(X-\mu)^2], covariance as E[(XμX)(YμY)]\mathbb{E}[(X-\mu_X)(Y-\mu_Y)], the MGF as E[etX]\mathbb{E}[e^{tX}], the characteristic function as E[eitX]\mathbb{E}[e^{itX}], the KL divergence as Ep[logp/q]\mathbb{E}_p[\log p/q], Adam's first moment as E[gt]\mathbb{E}[g_t] - is an application of this single linear functional applied to different functions of the random variable.

Understanding this unity transforms the apparent complexity of ML training into a coherent framework: we are always estimating expectations from samples, bounding how far sample estimates stray from true expectations, and choosing parameterisations that make those expectations tractable to compute and differentiate.

<- Back to Chapter 6: Probability Theory | Next: Concentration Inequalities ->


End of Section06/04 - Expectation and Moments

FileLines / CellsStatus
notes.md2000+[ok] Complete
theory.ipynb38 cells[ok] Complete
exercises.ipynb27 cells[ok] Complete