Part 2Math for LLMs

Expectation and Moments: Part 2 - Conceptual Bridge To Summary Of Key Results

Probability Theory / Expectation and Moments

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 2
29 min read18 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Expectation and Moments: Part 13: Conceptual Bridge to Summary of Key Results

13. Conceptual Bridge

Where We Came From

This section builds directly on the foundations laid in Section01-Section03. The probability spaces and random variable formalism of Section01 provides the objects - PDFs, PMFs, CDFs - over which we compute expectations. The named distributions of Section02 supply the examples whose moments we derive in Section6.2: the mean 1/λ1/\lambda and variance 1/λ21/\lambda^2 of the exponential, the mean npnp and variance np(1p)np(1-p) of the binomial, the mean μ\mu and variance σ2\sigma^2 that define the Gaussian. The joint distribution machinery of Section03 provides the tools for conditional expectation (Section5) and the covariance matrix (Section4): the tower property is essentially iterated marginalisation, and the conditional variance formula is the law of total variance derived via joint distributions.

Where We Are Going

Section Section05 (Concentration Inequalities) takes the moment-bound preview from Section7.5 much further. Markov's inequality (from the mean) and Chebyshev's inequality (from the variance) are the first two results, but Section05 develops exponentially sharper bounds: Hoeffding's inequality (bounded variables), Chernoff bounds (via MGFs), and McDiarmid's inequality (functions of independent variables). These bounds are the mathematical machinery of PAC-learning generalisation theory - they quantify how many training examples are needed to guarantee that the empirical risk is close to the true risk.

Section Section06 (Stochastic Processes) extends expectation to sequences of random variables indexed by time. The Law of Large Numbers proves rigorously that XˉNE[X]\bar{X}_N \to \mathbb{E}[X] - the sample mean converges to the expectation. The Central Limit Theorem proves that the standardised sample mean converges in distribution to N(0,1)\mathcal{N}(0,1) - the Gaussian emerges as the universal limit of sums, with proof via MGF or characteristic function techniques introduced here.

POSITION IN CHAPTER 6 CURRICULUM
========================================================================

  Section01 Probability Spaces          Section02 Distributions
  +------------------+            +------------------+
  | Kolmogorov axioms|            | Named PDFs/PMFs  |
  | Events, \\sigma-algebra|            | Parameters       |
  | CDF, PDF, PMF    |            | Relationships    |
  +--------+---------+            +--------+---------+
           |                               |
           +--------------+----------------+
                          v
              Section03 Joint Distributions
              +----------------------+
              | Marginals            |
              | Conditionals f(y|x) |
              | MVN, Bayes, Chain   |
              +----------+----------+
                         |
                         v
          +==============================+
          |  Section04 Expectation & Moments  |  <- YOU ARE HERE
          |  E[X], Var, Cov, MGF       |
          |  Jensen, C-S, LOTUS        |
          |  Bias-Variance, Adam       |
          +==============+=============+
                         |
              +----------+----------+
              v                     v
    Section05 Concentration         Section06 Stochastic
    Inequalities              Processes
    +-------------+           +-------------+
    | Markov      |           | LLN: Xbar->E[X]|
    | Chebyshev   |           | CLT: ->N(0,1)|
    | Hoeffding   |           | Gaussian    |
    | PAC bounds  |           | processes   |
    +-------------+           +-------------+
              |                     |
              +----------+----------+
                         v
              Section07 Markov Chains
              +----------------------+
              | Transition matrices  |
              | Steady state         |
              | MCMC (Bayes sampling)|
              +----------------------+

========================================================================

The conceptual arc through Section04 is: we began with probability as a framework for describing uncertainty (Section01), learned the vocabulary of named distributions (Section02), understood how to reason about multiple random variables jointly (Section03), and now in Section04 we have the tools to summarise distributions with numbers - expectations, variances, covariances, moments - and to bound the gap between reality and our estimates. Every subsequent section will use the expectation operator as a core tool: Section05 to prove tail bounds, Section06 to formalise convergence, Section07 to analyse Markov chain stationary distributions via matrix expectations.

<- Back to Chapter 6: Probability Theory | Next: Concentration Inequalities ->


Appendix A - Worked Examples: Expectation and Moments

A.1 Expected Value of the Geometric Distribution

The geometric distribution counts the number of trials until the first success in a sequence of iid Bernoulli(pp) trials.

PMF: P(X=k)=(1p)k1pP(X=k) = (1-p)^{k-1}p for k=1,2,3,k = 1, 2, 3, \ldots

Expected value via the definition:

E[X]=k=1k(1p)k1p=pk=1kqk1\mathbb{E}[X] = \sum_{k=1}^\infty k(1-p)^{k-1}p = p \sum_{k=1}^\infty k q^{k-1}

where q=1pq = 1-p. Using the identity k=1kqk1=ddqk=0qk=ddq11q=1(1q)2=1p2\sum_{k=1}^\infty k q^{k-1} = \frac{d}{dq}\sum_{k=0}^\infty q^k = \frac{d}{dq}\frac{1}{1-q} = \frac{1}{(1-q)^2} = \frac{1}{p^2}:

E[X]=p1p2=1p\mathbb{E}[X] = p \cdot \frac{1}{p^2} = \frac{1}{p}

Variance via the second moment: E[X2]=k=1k2(1p)k1p\mathbb{E}[X^2] = \sum_{k=1}^\infty k^2(1-p)^{k-1}p. Using k2qk1=ddq[kqk]=ddqq(1q)2=1+q(1q)3\sum k^2 q^{k-1} = \frac{d}{dq}[\sum k q^k] = \frac{d}{dq}\frac{q}{(1-q)^2} = \frac{1+q}{(1-q)^3}:

E[X2]=p1+(1p)p3=2pp2\mathbb{E}[X^2] = p \cdot \frac{1+(1-p)}{p^3} = \frac{2-p}{p^2} Var(X)=E[X2](E[X])2=2pp21p2=1pp2\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = \frac{2-p}{p^2} - \frac{1}{p^2} = \frac{1-p}{p^2}

AI connection: The geometric distribution models the number of tokens until a specific token (e.g., end-of-sequence) appears, assuming iid generation. In practice tokens are not iid, but the geometric provides a baseline model. Expected sequence length under a uniform pp = stopping probability is 1/p1/p.

A.2 Method of Moments Estimation

The method of moments estimates distribution parameters by setting sample moments equal to theoretical moments and solving.

Example: Estimating (α,β)(\alpha, \beta) for Gamma(α,β)\text{Gamma}(\alpha, \beta).

Theoretical moments: E[X]=α/β\mathbb{E}[X] = \alpha/\beta and Var(X)=α/β2\text{Var}(X) = \alpha/\beta^2.

Given NN samples x1,,xNx_1, \ldots, x_N, compute sample mean xˉ=1Nxi\bar{x} = \frac{1}{N}\sum x_i and sample variance s2=1N1(xixˉ)2s^2 = \frac{1}{N-1}\sum(x_i-\bar{x})^2.

Set xˉ=α^/β^\bar{x} = \hat{\alpha}/\hat{\beta} and s2=α^/β^2s^2 = \hat{\alpha}/\hat{\beta}^2. Solving:

β^=xˉs2,α^=xˉ2s2\hat{\beta} = \frac{\bar{x}}{s^2}, \qquad \hat{\alpha} = \frac{\bar{x}^2}{s^2}

For AI: Many Bayesian models require estimating hyperparameters of prior distributions. Method of moments provides fast, closed-form initial estimates that can be refined by maximum likelihood or MCMC. It is also used in moment matching for knowledge distillation: a student model's distribution moments are matched to the teacher's.

A.3 St. Petersburg Paradox: When Expectation Misleads

The St. Petersburg game: flip a coin repeatedly until the first head. If head appears on flip kk, win 2k2^k dollars.

Expected winnings:

E[Winnings]=k=12k12k=k=11=+\mathbb{E}[\text{Winnings}] = \sum_{k=1}^\infty 2^k \cdot \frac{1}{2^k} = \sum_{k=1}^\infty 1 = +\infty

The expected value is infinite! Yet no rational person would pay more than a few dollars to play this game.

Resolution: Rational agents maximise expected utility, not expected monetary value. For a logarithmic utility function u(w)=log(w)u(w) = \log(w), the expected utility is finite:

E[log(Winnings)]=k=1log(2k)12k=k=1klog22k=2log2<\mathbb{E}[\log(\text{Winnings})] = \sum_{k=1}^\infty \log(2^k) \cdot \frac{1}{2^k} = \sum_{k=1}^\infty \frac{k \log 2}{2^k} = 2\log 2 < \infty

This is Bernoulli's resolution (1738): diminishing marginal utility. The paradox reveals that infinite expected value is not sufficient for rational choice - the existence of all moments (or at least bounded utility) is needed.

For AI: Reinforcement learning reward design must account for this. An agent with unbounded reward function may take extremely risky actions that have infinite expected reward but almost surely fail. This motivates bounded reward functions and regularisation in RLHF: keeping reward signals within a bounded range prevents policy collapse toward infinite-expectation strategies.


Appendix B - Proofs of Key Identities

B.1 Cauchy-Schwarz via Inner Product

The expectation E[XY]\mathbb{E}[XY] can be viewed as an inner product X,Y=E[XY]\langle X, Y \rangle = \mathbb{E}[XY] in the L2L^2 space of square-integrable random variables. The Cauchy-Schwarz inequality for inner products states X,Y2X,XY,Y|\langle X, Y \rangle|^2 \leq \langle X, X \rangle \langle Y, Y \rangle, which gives (E[XY])2E[X2]E[Y2](\mathbb{E}[XY])^2 \leq \mathbb{E}[X^2]\mathbb{E}[Y^2] directly.

This inner product interpretation gives L2L^2 convergence its name: XnXX_n \to X in L2L^2 means E[(XnX)2]0\mathbb{E}[(X_n - X)^2] \to 0 - convergence in the L2L^2 inner product sense.

B.2 Variance as Second Cumulant

The CGF of XX is KX(t)=logMX(t)=logE[etX]K_X(t) = \log M_X(t) = \log \mathbb{E}[e^{tX}]. Computing its second derivative at t=0t=0:

KX(t)=MX(t)MX(t)(MX(t))2(MX(t))2K_X''(t) = \frac{M_X(t)M_X''(t) - (M_X'(t))^2}{(M_X(t))^2}

At t=0t=0: MX(0)=1M_X(0)=1, MX(0)=E[X]M_X'(0)=\mathbb{E}[X], MX(0)=E[X2]M_X''(0)=\mathbb{E}[X^2], so:

KX(0)=E[X2](E[X])2=Var(X)K_X''(0) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = \text{Var}(X)

The variance is precisely the second cumulant κ2\kappa_2.

B.3 Fisher Information and the Score

The Fisher information matrix is defined as:

I(θ)=Eθ[(θlogpθ(X))(θlogpθ(X))]=Covθ(θlogpθ(X))\mathcal{I}(\theta) = \mathbb{E}_\theta\left[(\nabla_\theta \log p_\theta(X))(\nabla_\theta \log p_\theta(X))^\top\right] = \text{Cov}_\theta(\nabla_\theta \log p_\theta(X))

The second equality uses the fact that Eθ[θlogpθ(X)]=0\mathbb{E}_\theta[\nabla_\theta \log p_\theta(X)] = \mathbf{0} (proved by differentiating pθ=1\int p_\theta = 1):

pθ(x)dx=1    θpθ(x)dx=0    θpθ(x)dx=0    Eθ[θpθ(X)pθ(X)]=Eθ[θlogpθ(X)]=0\int p_\theta(x) dx = 1 \implies \nabla_\theta \int p_\theta(x) dx = 0 \implies \int \nabla_\theta p_\theta(x) dx = 0 \implies \mathbb{E}_\theta\left[\frac{\nabla_\theta p_\theta(X)}{p_\theta(X)}\right] = \mathbb{E}_\theta[\nabla_\theta \log p_\theta(X)] = \mathbf{0}

Therefore the Fisher information is the variance (covariance matrix) of the score function θlogpθ(X)\nabla_\theta \log p_\theta(X).

For AI: Fisher information appears in:

  • Cramer-Rao bound: Var(θ^)1/I(θ)\text{Var}(\hat{\theta}) \geq 1/\mathcal{I}(\theta) - the variance of any unbiased estimator is at least the reciprocal Fisher information.
  • Natural gradient: The natural gradient descent update I(θ)1L\mathcal{I}(\theta)^{-1}\nabla\mathcal{L} moves in the direction of steepest descent in the space of distributions (KL-divergence geometry), rather than parameter space. Adam approximates this with a diagonal Fisher.
  • Elastic Weight Consolidation (EWC): Used in continual learning to prevent catastrophic forgetting. The Fisher information diagonal identifies which parameters are important for previous tasks.

B.4 Stein's Lemma

Lemma (Stein, 1972). If XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2) and gg is differentiable with E[g(X)]<\mathbb{E}[|g'(X)|] < \infty:

E[g(X)(Xμ)]=σ2E[g(X)]\mathbb{E}[g(X)(X-\mu)] = \sigma^2 \mathbb{E}[g'(X)]

Proof: Integration by parts:

E[g(X)(Xμ)]=g(x)(xμ)12πσe(xμ)2/(2σ2)dx\mathbb{E}[g(X)(X-\mu)] = \int g(x)(x-\mu) \frac{1}{\sqrt{2\pi}\sigma}e^{-(x-\mu)^2/(2\sigma^2)} dx

Note (xμ)e(xμ)2/(2σ2)=σ2ddxe(xμ)2/(2σ2)(x-\mu)e^{-(x-\mu)^2/(2\sigma^2)} = -\sigma^2 \frac{d}{dx}e^{-(x-\mu)^2/(2\sigma^2)}. Integrating by parts:

=σ2g(x)12πσe(xμ)2/(2σ2)dx=σ2E[g(X)]= \sigma^2 \int g'(x) \frac{1}{\sqrt{2\pi}\sigma}e^{-(x-\mu)^2/(2\sigma^2)} dx = \sigma^2 \mathbb{E}[g'(X)] \quad \square

Special cases:

  • g(x)=xg(x) = x: E[X(Xμ)]=σ2    E[X2]=μ2+σ2\mathbb{E}[X(X-\mu)] = \sigma^2 \implies \mathbb{E}[X^2] = \mu^2 + \sigma^2 [ok]
  • g(x)=x2g(x) = x^2: E[X2(Xμ)]=2σ2E[X]=2σ2μ    E[X3]=μ3+3μσ2\mathbb{E}[X^2(X-\mu)] = 2\sigma^2 \mathbb{E}[X] = 2\sigma^2\mu \implies \mathbb{E}[X^3] = \mu^3 + 3\mu\sigma^2 [ok]

For AI: Stein's lemma is the foundation of Stein's identity used in score matching and denoising diffusion models. The score function xlogp(x)\nabla_x \log p(x) satisfies Ep[g(X)+xg(X)]=0\mathbb{E}_p[g(X) + \nabla_x \cdot g(X)] = 0 (Stein's operator), enabling training by minimising a quadratic loss without computing the intractable normalisation constant of pp.


Appendix C - Moment Computations for Common Distributions

This table gives raw moments, central moments, MGF, skewness, and excess kurtosis for the distributions used throughout the course.

DistributionE[X]\mathbb{E}[X]Var(X)\text{Var}(X)Skewness γ1\gamma_1Ex. Kurtosis γ2\gamma_2MX(t)M_X(t) (domain)
Bernoulli(pp)ppp(1p)p(1-p)12pp(1p)\frac{1-2p}{\sqrt{p(1-p)}}16p(1p)p(1p)\frac{1-6p(1-p)}{p(1-p)}1p+pet1-p+pe^t
Binomial(n,pn,p)npnpnp(1p)np(1-p)12pnp(1p)\frac{1-2p}{\sqrt{np(1-p)}}16p(1p)np(1p)\frac{1-6p(1-p)}{np(1-p)}(1p+pet)n(1-p+pe^t)^n
Poisson(λ\lambda)λ\lambdaλ\lambda1/λ1/\sqrt{\lambda}1/λ1/\lambdaeλ(et1)e^{\lambda(e^t-1)}
Geometric(pp)1/p1/p(1p)/p2(1-p)/p^22p1p\frac{2-p}{\sqrt{1-p}}6+p2/(1p)6 + p^2/(1-p)pet1(1p)et\frac{pe^t}{1-(1-p)e^t}, t<log(1p)t<-\log(1-p)
Uniform(a,ba,b)(a+b)/2(a+b)/2(ba)2/12(b-a)^2/12006/5-6/5etbetat(ba)\frac{e^{tb}-e^{ta}}{t(b-a)}
Normal(μ,σ2\mu,\sigma^2)μ\muσ2\sigma^20000eμt+σ2t2/2e^{\mu t+\sigma^2t^2/2}
Exponential(λ\lambda)1/λ1/\lambda1/λ21/\lambda^22266λλt\frac{\lambda}{\lambda-t}, t<λt<\lambda
Gamma(α,β\alpha,\beta)α/β\alpha/\betaα/β2\alpha/\beta^22/α2/\sqrt{\alpha}6/α6/\alpha(ββt)α(\frac{\beta}{\beta-t})^\alpha, t<βt<\beta
Beta(α,β\alpha,\beta)αα+β\frac{\alpha}{\alpha+\beta}αβ(α+β)2(α+β+1)\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}2(βα)α+β+1(α+β+2)αβ\frac{2(\beta-\alpha)\sqrt{\alpha+\beta+1}}{(\alpha+\beta+2)\sqrt{\alpha\beta}}complexNo closed form
Student-tt(ν\nu)00 (ν>1\nu>1)νν2\frac{\nu}{\nu-2} (ν>2\nu>2)00 (ν>3\nu>3)6ν4\frac{6}{\nu-4} (ν>4\nu>4)Does not exist

Notes:

  • Student-tt has no MGF (heavy tails cause E[etX]=\mathbb{E}[e^{tX}] = \infty for all t0t \neq 0).
  • Beta distribution skewness is zero iff α=β\alpha = \beta (symmetric); negative when α>β\alpha > \beta, positive when α<β\alpha < \beta.
  • Poisson has equal mean and variance - a property used to test whether count data follows Poisson (overdispersion: Var>E\text{Var} > \mathbb{E} means extra variability, e.g., negative binomial is better).

Appendix D - The Exponential Family and Moments

Many common distributions belong to the exponential family, which has a elegant connection between natural parameters and moments.

D.1 Exponential Family Form

A distribution belongs to the exponential family if its density can be written as:

p(x;η)=h(x)exp(ηT(x)A(η))p(x;\eta) = h(x)\exp(\eta^\top T(x) - A(\eta))

where:

  • η\eta = natural parameter vector
  • T(x)T(x) = sufficient statistic vector
  • A(η)A(\eta) = log-partition function (log-normaliser)
  • h(x)h(x) = base measure

D.2 Moments from the Log-Partition Function

Theorem. For an exponential family distribution:

E[T(X)]=ηA(η),Cov(T(X))=η2A(η)\mathbb{E}[T(X)] = \nabla_\eta A(\eta), \qquad \text{Cov}(T(X)) = \nabla^2_\eta A(\eta)

Proof sketch. Differentiating p(x;η)dx=1\int p(x;\eta)dx = 1 with respect to η\eta:

0=h(x)eηT(x)A(η)(T(x)A(η))dx=E[T(X)]A(η)0 = \int h(x)e^{\eta^\top T(x) - A(\eta)}(T(x) - \nabla A(\eta))dx = \mathbb{E}[T(X)] - \nabla A(\eta)

Therefore E[T(X)]=A(η)\mathbb{E}[T(X)] = \nabla A(\eta). Differentiating again gives the covariance formula. \square

Examples:

Distributionη\etaT(x)T(x)A(η)A(\eta)E[T(X)]=A\mathbb{E}[T(X)] = \nabla A
Bernoulli(pp)logp1p\log\frac{p}{1-p}xxlog(1+eη)\log(1+e^\eta)eη1+eη=p\frac{e^\eta}{1+e^\eta} = p
Gaussian(μ,σ2\mu,\sigma^2)(μ/σ2,1/(2σ2))(\mu/\sigma^2, -1/(2\sigma^2))(x,x2)(x, x^2)η12/(4η2)12log(2η2)-\eta_1^2/(4\eta_2) - \frac{1}{2}\log(-2\eta_2)(μ,σ2+μ2)(\mu, \sigma^2+\mu^2)
Poisson(λ\lambda)logλ\log\lambdaxxeη=λe^\eta = \lambdaeη=λe^\eta = \lambda

For AI: The log-partition function A(η)A(\eta) is the free energy of the exponential family. Its gradient gives the expected sufficient statistics (moments), its Hessian gives the Fisher information matrix. In variational inference, the ELBO is optimised over the natural parameters η\eta of the approximate posterior; the optimal η\eta satisfies A(η)=Ep[T(X)]\nabla A(\eta) = \mathbb{E}_{p}[T(X)] (moment matching). This is the connection between maximum entropy, sufficient statistics, and the moments derived in Section6.


Appendix E - Convergence in Probability and L^2 Convergence

The sample mean XˉN=1Ni=1NXi\bar{X}_N = \frac{1}{N}\sum_{i=1}^N X_i estimates E[X]\mathbb{E}[X]. In what sense does this estimate converge?

E.1 L^2 Convergence (Mean Square Convergence)

XˉN\bar{X}_N converges to E[X]\mathbb{E}[X] in mean square (L2L^2) sense:

E[(XˉNE[X])2]=Var(XˉN)=Var(X)N0\mathbb{E}[(\bar{X}_N - \mathbb{E}[X])^2] = \text{Var}(\bar{X}_N) = \frac{\text{Var}(X)}{N} \to 0

This is an immediate consequence of the variance formula for means (Section4.4). The sample mean's MSE decreases at rate 1/N1/N.

E.2 Weak Law of Large Numbers (Preview)

The Weak LLN (proved in Section06 using characteristic functions or Chebyshev's inequality) states that for iid XiX_i with finite mean μ\mu:

XˉNPμas N\bar{X}_N \xrightarrow{P} \mu \quad \text{as } N \to \infty

i.e., P(XˉNμ>ε)0P(|\bar{X}_N - \mu| > \varepsilon) \to 0 for any ε>0\varepsilon > 0.

Proof via Chebyshev (when Var(X)<\text{Var}(X) < \infty): P(XˉNμ>ε)Var(XˉN)ε2=Var(X)Nε20P(|\bar{X}_N - \mu| > \varepsilon) \leq \frac{\text{Var}(\bar{X}_N)}{\varepsilon^2} = \frac{\text{Var}(X)}{N\varepsilon^2} \to 0.

This is a direct application of Chebyshev's inequality (from Section7.5 preview) to the sample mean. The LLN justifies: if we run a neural network many times on iid batches and average the loss estimates, the average converges to the true expected loss.

-> Full treatment of LLN and CLT: Section06 Stochastic Processes


Appendix F - Conditional Expectation as Projection

The conditional expectation E[YX]\mathbb{E}[Y|X] can be understood geometrically as an orthogonal projection in the Hilbert space L2(Ω,P)L^2(\Omega, P) of square-integrable random variables.

Inner product: X,Y=E[XY]\langle X, Y \rangle = \mathbb{E}[XY]

Projection: E[YG]\mathbb{E}[Y|\mathcal{G}] (conditional expectation on a sub-σ\sigma-algebra G\mathcal{G}) is the projection of YY onto the closed subspace of G\mathcal{G}-measurable random variables.

Geometric interpretation: Among all G\mathcal{G}-measurable (i.e., functions of XX) approximations to YY, the conditional expectation E[YX]\mathbb{E}[Y|X] minimises the L2L^2 distance E[(Yg(X))2]\mathbb{E}[(Y-g(X))^2]. This is the projection theorem: the best approximation is the projection, and the residual YE[YX]Y - \mathbb{E}[Y|X] is orthogonal to every G\mathcal{G}-measurable random variable:

E[(YE[YX])g(X)]=0for all measurable g\mathbb{E}[(Y - \mathbb{E}[Y|X]) \cdot g(X)] = 0 \quad \text{for all measurable } g

Consequence: The tower property E[E[YX]]=E[Y]\mathbb{E}[\mathbb{E}[Y|X]] = \mathbb{E}[Y] is the projection version of the law of total expectation. In a Hilbert space, projecting YY onto a subspace and then taking the "total length" (expectation) equals the total length of YY.

For AI: This projection view clarifies why neural networks trained with MSE loss approximate E[YX]\mathbb{E}[Y|X]: the network is learning the projection of YY onto the subspace of functions representable by the architecture. Deeper networks can represent larger subspaces, hence better approximations to the conditional expectation.


Appendix G - Worked Problems: Moments and Inequalities

G.1 Computing the Moments of the Beta Distribution

The Beta(α,β\alpha, \beta) distribution has PDF f(x)=xα1(1x)β1B(α,β)f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)} on [0,1][0,1], where B(α,β)=Γ(α)Γ(β)/Γ(α+β)B(\alpha,\beta) = \Gamma(\alpha)\Gamma(\beta)/\Gamma(\alpha+\beta).

Raw moment: Using the Beta function definition:

E[Xk]=01xkxα1(1x)β1B(α,β)dx=B(α+k,β)B(α,β)=Γ(α+k)Γ(α+β)Γ(α)Γ(α+β+k)\mathbb{E}[X^k] = \int_0^1 x^k \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}dx = \frac{B(\alpha+k, \beta)}{B(\alpha,\beta)} = \frac{\Gamma(\alpha+k)\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\alpha+\beta+k)} =(α+k1)(α+k2)α(α+β+k1)(α+β+k2)(α+β)=j=0k1α+jα+β+j= \frac{(\alpha+k-1)(\alpha+k-2)\cdots\alpha}{(\alpha+\beta+k-1)(\alpha+\beta+k-2)\cdots(\alpha+\beta)} = \prod_{j=0}^{k-1}\frac{\alpha+j}{\alpha+\beta+j}

First moment: E[X]=αα+β\mathbb{E}[X] = \frac{\alpha}{\alpha+\beta}.

Second moment: E[X2]=α(α+1)(α+β)(α+β+1)\mathbb{E}[X^2] = \frac{\alpha(\alpha+1)}{(\alpha+\beta)(\alpha+\beta+1)}.

Variance:

Var(X)=E[X2](E[X])2=α(α+1)(α+β)(α+β+1)α2(α+β)2\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = \frac{\alpha(\alpha+1)}{(\alpha+\beta)(\alpha+\beta+1)} - \frac{\alpha^2}{(\alpha+\beta)^2} =αβ(α+β)2(α+β+1)= \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}

AI connection: The Beta distribution is the conjugate prior for the Bernoulli/Binomial likelihood. After observing ss successes and ff failures, the posterior is Beta(α+s\alpha+s, β+f\beta+f). The posterior mean is α+sα+β+s+f\frac{\alpha+s}{\alpha+\beta+s+f} - a weighted average of the prior mean αα+β\frac{\alpha}{\alpha+\beta} and the sample mean ss+f\frac{s}{s+f}. As s+fs+f \to \infty, the posterior converges to the sample mean (data dominates prior).

G.2 Proving H(p)logKH(p) \leq \log K by Jensen

For a probability distribution p=(p1,,pK)p = (p_1,\ldots,p_K) with pk>0p_k > 0 and kpk=1\sum_k p_k = 1, the Shannon entropy is H(p)=kpklogpkH(p) = -\sum_k p_k \log p_k.

Proof that H(p)logKH(p) \leq \log K:

Apply Jensen's inequality with the concave function log\log:

H(p)=kpklog1pk=Ekp[log1pk]logEkp[1pk]=logkpk1pk=logKH(p) = \sum_k p_k \log \frac{1}{p_k} = \mathbb{E}_{k \sim p}\left[\log \frac{1}{p_k}\right] \leq \log\mathbb{E}_{k\sim p}\left[\frac{1}{p_k}\right] = \log\sum_k p_k \cdot \frac{1}{p_k} = \log K

Equality holds iff 1/pk=1/p_k = const for all kk (Jensen with equality iff the argument is constant), i.e., pk=1/Kp_k = 1/K (uniform). \square

Alternative proof via KL divergence:

H(p)logK=kpklogpklogK=kpk(log(1/K)logpk)=KL(pu)0H(p) - \log K = -\sum_k p_k \log p_k - \log K = \sum_k p_k(\log(1/K) - \log p_k) = -\text{KL}(p \| u) \leq 0

where u=(1/K,,1/K)u = (1/K,\ldots,1/K) is uniform. KL non-negativity gives H(p)logKH(p) \leq \log K.

G.3 Adam Bias Correction: Full Derivation

At step tt, the Adam first-moment accumulator is:

mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1)g_t

Unrolling from m0=0m_0 = 0:

mt=(1β1)i=1tβ1tigim_t = (1-\beta_1)\sum_{i=1}^t \beta_1^{t-i} g_i

Taking expectations (assuming iid gradients with constant mean E[gi]=μg\mathbb{E}[g_i] = \mu_g):

E[mt]=(1β1)μgi=1tβ1ti=(1β1)μg1β1t1β1=(1β1t)μg\mathbb{E}[m_t] = (1-\beta_1)\mu_g \sum_{i=1}^t \beta_1^{t-i} = (1-\beta_1)\mu_g \cdot \frac{1-\beta_1^t}{1-\beta_1} = (1-\beta_1^t)\mu_g

The bias is E[mt]μg=β1tμg\mathbb{E}[m_t] - \mu_g = -\beta_1^t \mu_g, which decays to zero geometrically. The bias-corrected estimate m^t=mt/(1β1t)\hat{m}_t = m_t / (1-\beta_1^t) satisfies E[m^t]=μg\mathbb{E}[\hat{m}_t] = \mu_g (unbiased).

Similarly for vtv_t, and the debiased v^t=vt/(1β2t)\hat{v}_t = v_t/(1-\beta_2^t) estimates E[gt2]\mathbb{E}[g_t^2] unbiasedly.

In early training (tt small), 1β1t(1β1)t1-\beta_1^t \approx (1-\beta_1)t is small, so m^t=mt/((1β1)t)\hat{m}_t = m_t/((1-\beta_1)t) greatly amplifies mtm_t. Without bias correction, Adam would take tiny steps at the start because mt(1β1)g1m_t \approx (1-\beta_1)g_1 after one step. With bias correction: m^1=g1\hat{m}_1 = g_1 - the first step uses the actual gradient.

G.4 Conditional Expectation: Gaussian Case

Let (X,Y)N(μ,Σ)(X,Y) \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma) with μ=(μX,μY)\boldsymbol{\mu} = (\mu_X, \mu_Y) and Σ=(σX2ρσXσYρσXσYσY2)\Sigma = \begin{pmatrix}\sigma_X^2 & \rho\sigma_X\sigma_Y \\ \rho\sigma_X\sigma_Y & \sigma_Y^2\end{pmatrix}.

Conditional distribution (from Section03 Schur complement formula):

YX=xN ⁣(μY+ρσYσX(xμX),  σY2(1ρ2))Y | X=x \sim \mathcal{N}\!\left(\mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(x-\mu_X),\; \sigma_Y^2(1-\rho^2)\right)

Therefore:

E[YX=x]=μY+ρσYσX(xμX)\mathbb{E}[Y|X=x] = \mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(x-\mu_X) Var(YX)=σY2(1ρ2)\text{Var}(Y|X) = \sigma_Y^2(1-\rho^2)

Verification via law of total variance:

Var(Y)=E[Var(YX)]+Var(E[YX])\text{Var}(Y) = \mathbb{E}[\text{Var}(Y|X)] + \text{Var}(\mathbb{E}[Y|X]) =σY2(1ρ2)+Var ⁣(μY+ρσYσX(XμX))= \sigma_Y^2(1-\rho^2) + \text{Var}\!\left(\mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(X-\mu_X)\right) =σY2(1ρ2)+ρ2σY2σX2Var(X)= \sigma_Y^2(1-\rho^2) + \rho^2\frac{\sigma_Y^2}{\sigma_X^2}\text{Var}(X) =σY2(1ρ2)+ρ2σY2=σY2= \sigma_Y^2(1-\rho^2) + \rho^2\sigma_Y^2 = \sigma_Y^2 \checkmark

The variance of the conditional mean contributes ρ2σY2\rho^2\sigma_Y^2 (explained variance), and the within-group variance contributes (1ρ2)σY2(1-\rho^2)\sigma_Y^2 (unexplained variance). The fraction ρ2\rho^2 of Var(Y)\text{Var}(Y) explained by XX is the coefficient of determination R2R^2 of linear regression.


Appendix H - Notation Summary

SymbolMeaningFirst defined
E[X]\mathbb{E}[X]Expected value of XXSection2.1
μX\mu_X or μ\muMean (expected value)Section2.1
Var(X)\text{Var}(X) or σX2\sigma_X^2Variance of XXSection3.1
σX\sigma_XStandard deviation of XXSection3.2
μk=E[Xk]\mu'_k = \mathbb{E}[X^k]kk-th raw momentSection3.3
μk=E[(Xμ)k]\mu_k = \mathbb{E}[(X-\mu)^k]kk-th central momentSection3.3
γ1=μ3/σ3\gamma_1 = \mu_3/\sigma^3SkewnessSection3.4
γ2=μ4/σ43\gamma_2 = \mu_4/\sigma^4 - 3Excess kurtosisSection3.5
Cov(X,Y)\text{Cov}(X,Y)CovarianceSection4.1
ρXY\rho_{XY}Pearson correlationSection4.2
E[YX=x]\mathbb{E}[Y \mid X=x]Conditional expectation (function of xx)Section5.1
E[YX]\mathbb{E}[Y \mid X]Conditional expectation (random variable)Section5.1
Var(YX)\text{Var}(Y \mid X)Conditional varianceSection5.3
MX(t)=E[etX]M_X(t) = \mathbb{E}[e^{tX}]Moment generating functionSection6.1
φX(t)=E[eitX]\varphi_X(t) = \mathbb{E}[e^{itX}]Characteristic functionSection6.4
KX(t)=logMX(t)K_X(t) = \log M_X(t)Cumulant generating functionSection6.5
κk\kappa_kkk-th cumulantSection6.5
L(ϕ,θ;x)\mathcal{L}(\phi,\theta;x)ELBO (evidence lower bound)Section9.2
KL(pq)\text{KL}(p\|q)KL divergenceSection7.2
H(p)H(p)Shannon entropySection9.1

Appendix I - Quick Reference: Key Formulas

Expectation:

E[aX+bY]=aE[X]+bE[Y](always)\mathbb{E}[aX+bY] = a\mathbb{E}[X] + b\mathbb{E}[Y] \quad (\text{always}) E[g(X)]=g(x)fX(x)dx(LOTUS)\mathbb{E}[g(X)] = \int g(x)f_X(x)\,dx \quad (\text{LOTUS})

Variance:

Var(X)=E[X2](E[X])2\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 Var(aX+b)=a2Var(X)\text{Var}(aX+b) = a^2\text{Var}(X) Var(X+Y)=Var(X)+2Cov(X,Y)+Var(Y)\text{Var}(X+Y) = \text{Var}(X) + 2\text{Cov}(X,Y) + \text{Var}(Y)

Conditional expectation:

E[Y]=E[E[YX]](tower property)\mathbb{E}[Y] = \mathbb{E}[\mathbb{E}[Y|X]] \quad (\text{tower property}) Var(Y)=E[Var(YX)]+Var(E[YX])(total variance)\text{Var}(Y) = \mathbb{E}[\text{Var}(Y|X)] + \text{Var}(\mathbb{E}[Y|X]) \quad (\text{total variance})

MGF:

MX(k)(0)=E[Xk],MX+Y(t)=MX(t)MY(t)  (XY)M_X^{(k)}(0) = \mathbb{E}[X^k], \qquad M_{X+Y}(t) = M_X(t)M_Y(t) \; (X\perp Y)

Jensen's inequality:

f convex    f(E[X])E[f(X)]f \text{ convex} \implies f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)] KL(pq)0(Gibbs’ inequality, via Jensen)\text{KL}(p\|q) \geq 0 \quad (\text{Gibbs' inequality, via Jensen})

Cauchy-Schwarz:

(E[XY])2E[X2]E[Y2],ρXY1(\mathbb{E}[XY])^2 \leq \mathbb{E}[X^2]\mathbb{E}[Y^2], \qquad |\rho_{XY}| \leq 1

Bias-Variance:

E[(Yf^)2]=Bias2(f^)+Var(f^)+σ2\mathbb{E}[(Y-\hat{f})^2] = \text{Bias}^2(\hat{f}) + \text{Var}(\hat{f}) + \sigma^2

Adam:

mt=β1mt1+(1β1)gt,m^t=mt/(1β1t)E[gt]m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t, \quad \hat{m}_t = m_t/(1-\beta_1^t) \approx \mathbb{E}[g_t] vt=β2vt1+(1β2)gt2,v^t=vt/(1β2t)E[gt2]v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2, \quad \hat{v}_t = v_t/(1-\beta_2^t) \approx \mathbb{E}[g_t^2]

Appendix J - Common Mistakes: Extended Examples

J.1 Jensen's Direction: A Costly Error

One of the most frequent mistakes when applying Jensen's inequality is applying it in the wrong direction. The rule is simple: for convex ff, f(E[X])E[f(X)]f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]; for concave ff, the inequality reverses.

Common confusion: \sqrt{\cdot} vs. ()2(\cdot)^2.

  • f(x)=x2f(x) = x^2 is convex (f=2>0f''=2>0). Jensen: (E[X])2E[X2](\mathbb{E}[X])^2 \leq \mathbb{E}[X^2]. Equivalently, Var(X)=E[X2](E[X])20\text{Var}(X) = \mathbb{E}[X^2]-(\mathbb{E}[X])^2 \geq 0. [ok]
  • f(x)=xf(x) = \sqrt{x} is concave (f=14x3/2<0f''=-\frac{1}{4}x^{-3/2}<0). Jensen: E[X]E[X]\sqrt{\mathbb{E}[X]} \geq \mathbb{E}[\sqrt{X}] for X0X \geq 0.

A model that minimises E[loss]\mathbb{E}[\sqrt{\text{loss}}] is NOT the same as minimising E[loss]\sqrt{\mathbb{E}[\text{loss}]}. The former cares about average square-root loss, the latter about the square root of average loss. In practice this distinction matters for robust loss functions.

Common confusion: log\log vs. exp\exp.

  • f(x)=logxf(x) = \log x is concave -> logE[X]E[logX]\log\mathbb{E}[X] \geq \mathbb{E}[\log X]. This is the key inequality behind the ELBO derivation.
  • f(x)=exf(x) = e^x is convex -> eE[X]E[eX]e^{\mathbb{E}[X]} \leq \mathbb{E}[e^X]. This is the key inequality for bounding the MGF.

The ELBO derivation applies Jensen with log\log (concave): logE[Z]E[logZ]\log \mathbb{E}[Z] \geq \mathbb{E}[\log Z] where Z=pθ(x,z)/qϕ(zx)Z = p_\theta(x,z)/q_\phi(z|x). Students often try to apply it in the wrong direction, getting an upper bound instead of a lower bound.

J.2 The Tower Property Subtlety: Nested Conditioning

The tower property states E[E[YX]]=E[Y]\mathbb{E}[\mathbb{E}[Y|X]] = \mathbb{E}[Y]. A more general version for nested conditioning: for σ\sigma-algebras G1G2\mathcal{G}_1 \subset \mathcal{G}_2:

E[E[YG2]G1]=E[YG1]\mathbb{E}[\mathbb{E}[Y|\mathcal{G}_2]|\mathcal{G}_1] = \mathbb{E}[Y|\mathcal{G}_1]

Conditioning on less information from already conditioned: you keep the coarser conditioning.

Example: Let ZZ be a sufficient statistic for θ\theta given data X1,,XnX_1,\ldots,X_n. Then:

E[θ^θ]=E[E[θ^Z]θ]=E[θ~θ]\mathbb{E}[\hat{\theta}|\theta] = \mathbb{E}[\mathbb{E}[\hat{\theta}|Z]|\theta] = \mathbb{E}[\tilde{\theta}|\theta]

where θ~=E[θ^Z]\tilde{\theta} = \mathbb{E}[\hat{\theta}|Z] is the Rao-Blackwellised estimator. The outer expectation equals the original (tower property), but the Rao-Blackwellised estimator has smaller variance. This is NOT circular - it is the statement that smoothing θ^\hat{\theta} via conditioning on ZZ doesn't change its mean.

J.3 Correlation Does Not Imply Causation - A Statistical View

Zero correlation means E[XY]=E[X]E[Y]\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y] - no linear relationship. But:

  1. There may be nonlinear dependence (Section4.3 example: Y=X2Y = X^2).
  2. Even positive correlation may arise from a common cause (confounding): if ZXZ \to X and ZYZ \to Y (fork structure from Section03), then XX and YY are correlated even if neither causes the other.

In ML: the correlation between model predictions and labels on the test set measures linear predictive ability, not causal understanding. A model can achieve high correlation while exploiting spurious features (shortcuts). For example, language models correlate "hospital" with "disease" not through causal understanding but through co-occurrence patterns.

Conditional independence as the resolution: If ZZ is the common cause (confounder), XX and YY may be conditionally independent given ZZ: Cov(X,YZ)=0\text{Cov}(X,Y|Z) = 0 even though Cov(X,Y)0\text{Cov}(X,Y) \neq 0. Adjusting for confounders (either by conditioning or instrumental variables) is the statistical approach to causal inference.


Appendix K - Information-Theoretic View of Moments

K.1 Entropy as Negative Expected Log-Probability

The Shannon entropy of a discrete distribution pp is:

H(p)=Ep[logp(X)]=kpklogpkH(p) = -\mathbb{E}_p[\log p(X)] = -\sum_k p_k \log p_k

This is simply minus the expected log-probability. For a continuous distribution with PDF ff, the differential entropy is:

h(f)=Ef[logf(X)]=f(x)logf(x)dxh(f) = -\mathbb{E}_f[\log f(X)] = -\int f(x)\log f(x)\,dx

Gaussian maximises differential entropy. Among all distributions on R\mathbb{R} with fixed mean μ\mu and variance σ2\sigma^2, the Gaussian N(μ,σ2)\mathcal{N}(\mu, \sigma^2) maximises differential entropy:

h(N(μ,σ2))=12log(2πeσ2)h(\mathcal{N}(\mu,\sigma^2)) = \frac{1}{2}\log(2\pi e \sigma^2)

Proof sketch (via KL divergence): For any distribution ff with mean μ\mu and variance σ2\sigma^2, let g=N(μ,σ2)g = \mathcal{N}(\mu,\sigma^2).

0KL(fg)=flogfg=h(f)flogg0 \leq \text{KL}(f\|g) = \int f\log\frac{f}{g} = -h(f) - \int f\log g

Since logg(x)=12log(2πσ2)(xμ)22σ2\log g(x) = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x-\mu)^2}{2\sigma^2}, and f(x)(xμ)22σ2dx=σ22σ2=12\int f(x) \cdot \frac{(x-\mu)^2}{2\sigma^2}dx = \frac{\sigma^2}{2\sigma^2} = \frac{1}{2} (since ff has variance σ2\sigma^2):

flogg=12log(2πσ2)12=h(g)\int f\log g = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{1}{2} = h(g)

Therefore 0h(f)h(g)0 \leq -h(f) - h(g), i.e., h(f)h(g)h(f) \leq h(g). Equality iff f=gf = g (since KL=0\text{KL}=0 iff distributions are equal). \square

AI connection: This maximum entropy property explains why the Gaussian prior is so common in Bayesian machine learning: given a known mean and variance (from domain knowledge), the Gaussian is the least informative (maximum entropy) prior consistent with those constraints. It makes the fewest additional assumptions about the distribution.

K.2 Mutual Information as Expected KL Divergence

The mutual information between XX and YY is:

I(X;Y)=KL(pX,YpXpY)=E(X,Y)[logpX,Y(X,Y)pX(X)pY(Y)]I(X;Y) = \text{KL}(p_{X,Y} \| p_X \otimes p_Y) = \mathbb{E}_{(X,Y)}\left[\log\frac{p_{X,Y}(X,Y)}{p_X(X)p_Y(Y)}\right]

This is the KL divergence between the joint distribution and the product of marginals. Since KL0\text{KL} \geq 0: I(X;Y)0I(X;Y) \geq 0 with equality iff XYX \perp Y.

Relationship to conditional entropy:

I(X;Y)=H(Y)H(YX)=H(X)H(XY)I(X;Y) = H(Y) - H(Y|X) = H(X) - H(X|Y)

where H(YX)=EX[H(YX=x)]H(Y|X) = \mathbb{E}_X[H(Y|X=x)] is the conditional entropy (tower property applied to entropy).

For AI: Mutual information is the gold standard for measuring statistical dependence. It captures all forms of dependence (linear and nonlinear), unlike correlation. Contrastive learning methods (SimCLR, CLIP) can be viewed as maximising a lower bound on I(view1;view2)I(\text{view}_1; \text{view}_2) - encouraging representations to capture information shared between different views/modalities. Information bottleneck methods (used for analysing neural networks) study the trade-off between compressing XX (low I(Z;X)I(Z;X)) and preserving information about YY (high I(Z;Y)I(Z;Y)).


Appendix L - The Reparameterisation Trick: Full Mathematical Treatment

The reparameterisation trick is a technique for computing gradients of expectations when the distribution depends on the parameters we differentiate with respect to.

L.1 The Problem: Gradient Through Sampling

We want ϕEzqϕ[f(z)]\nabla_\phi \mathbb{E}_{z \sim q_\phi}[f(z)] where qϕq_\phi is a distribution parameterised by ϕ\phi. Naively:

ϕEqϕ[f(z)]=ϕf(z)qϕ(z)dz=f(z)ϕqϕ(z)dz\nabla_\phi \mathbb{E}_{q_\phi}[f(z)] = \nabla_\phi \int f(z) q_\phi(z) dz = \int f(z) \nabla_\phi q_\phi(z) dz

This requires computing ϕqϕ(z)\nabla_\phi q_\phi(z) for each zz, and the expectation is now over the fixed distribution of zz values - but the integration measure qϕ(z)dzq_\phi(z)dz changes with ϕ\phi, making Monte Carlo estimation require evaluating ϕqϕ\nabla_\phi q_\phi.

L.2 REINFORCE (Score Function Estimator)

The score function trick uses ϕqϕ=qϕϕlogqϕ\nabla_\phi q_\phi = q_\phi \nabla_\phi \log q_\phi:

ϕEqϕ[f(z)]=f(z)qϕ(z)ϕlogqϕ(z)dz=Eqϕ[f(z)ϕlogqϕ(z)]\nabla_\phi \mathbb{E}_{q_\phi}[f(z)] = \int f(z) q_\phi(z) \nabla_\phi \log q_\phi(z) dz = \mathbb{E}_{q_\phi}[f(z) \nabla_\phi \log q_\phi(z)]

This is computable by Monte Carlo: sample zqϕz \sim q_\phi, compute f(z)ϕlogqϕ(z)f(z)\nabla_\phi \log q_\phi(z), average over samples. However, this estimator has high variance because f(z)f(z) can be large and ϕlogqϕ(z)\nabla_\phi \log q_\phi(z) can vary greatly.

L.3 Reparameterisation

When qϕq_\phi is a location-scale family (or more generally, when there exists a deterministic transformation g(ϕ,ε)g(\phi, \varepsilon)):

z=g(ϕ,ε),εp(ε)(fixed distribution, independent of ϕ)z = g(\phi, \varepsilon), \quad \varepsilon \sim p(\varepsilon) \quad \text{(fixed distribution, independent of }\phi\text{)}

For Gaussian: z=μϕ+σϕεz = \mu_\phi + \sigma_\phi \odot \varepsilon, εN(0,I)\varepsilon \sim \mathcal{N}(0,I).

Then:

Eqϕ[f(z)]=Ep(ε)[f(g(ϕ,ε))]\mathbb{E}_{q_\phi}[f(z)] = \mathbb{E}_{p(\varepsilon)}[f(g(\phi, \varepsilon))]

and:

ϕEqϕ[f(z)]=Ep(ε)[ϕf(g(ϕ,ε))]=Ep(ε)[zf(z)ϕg(ϕ,ε)]\nabla_\phi \mathbb{E}_{q_\phi}[f(z)] = \mathbb{E}_{p(\varepsilon)}[\nabla_\phi f(g(\phi, \varepsilon))] = \mathbb{E}_{p(\varepsilon)}\left[\nabla_z f(z) \cdot \nabla_\phi g(\phi,\varepsilon)\right]

The gradient now flows through the deterministic function gg, enabling automatic differentiation. The variance of this estimator is typically much lower than REINFORCE because zf\nabla_z f is often smoother than fϕlogqϕf \cdot \nabla_\phi \log q_\phi.

Jacobian of the transform:

ϕg(ϕ,ε)μϕ,σϕ=(zμϕzσϕ)=(Idiag(ε))\nabla_\phi g(\phi, \varepsilon)\big|_{\mu_\phi, \sigma_\phi} = \begin{pmatrix}\frac{\partial z}{\partial \mu_\phi} \\ \frac{\partial z}{\partial \sigma_\phi}\end{pmatrix} = \begin{pmatrix}I \\ \text{diag}(\varepsilon)\end{pmatrix}

So the gradient with respect to μϕ\mu_\phi is E[zf(z)]\mathbb{E}[\nabla_z f(z)] and with respect to σϕ\sigma_\phi is E[zf(z)ε]\mathbb{E}[\nabla_z f(z) \odot \varepsilon].

L.4 Why Lower Variance?

Consider f(z)=sin(z)f(z) = \sin(z) with zN(μ,σ2)z \sim \mathcal{N}(\mu, \sigma^2).

REINFORCE gradient: E[sin(z)(zμ)/σ2]\mathbb{E}[\sin(z) \cdot (z-\mu)/\sigma^2]. The product sin(z)(zμ)\sin(z)(z-\mu) can be large in magnitude.

Reparameterisation gradient: E[cos(μ+σε)]=E[cos(z)]\mathbb{E}[\cos(\mu+\sigma\varepsilon)] = \mathbb{E}[\cos(z)]. The gradient is cos(z)\cos(z), bounded in [1,1][-1,1], much more stable.

In general, reparameterisation produces gradients of O(1)O(1) magnitude when ff is smooth, while REINFORCE gradients scale with ff values which can be large.


Appendix M - Moments in the Context of Score Matching and Diffusion

M.1 Denoising Score Matching

Diffusion models (DDPM, Score SDE) learn the score function xlogpt(x)\nabla_x \log p_t(x) at each noise level tt. The training objective is:

Lθ=Et,x0,ε[sθ(xt,t)xtlogpt(xtx0)2]\mathcal{L}_\theta = \mathbb{E}_{t, x_0, \varepsilon}\left[\|s_\theta(x_t, t) - \nabla_{x_t}\log p_t(x_t|x_0)\|^2\right]

where xt=αˉtx0+1αˉtεx_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t}\varepsilon and εN(0,I)\varepsilon \sim \mathcal{N}(0,I).

The conditional score is:

xtlogpt(xtx0)=ε1αˉt\nabla_{x_t}\log p_t(x_t|x_0) = -\frac{\varepsilon}{\sqrt{1-\bar{\alpha}_t}}

This is an expectation over the noise schedule. The loss simplifies to:

Lθ=Et,ε[εθ(xt,t)ε2]\mathcal{L}_\theta = \mathbb{E}_{t,\varepsilon}\left[\|\varepsilon_\theta(x_t,t) - \varepsilon\|^2\right]

which is an MSE loss - an empirical estimate of E[(Yf(X))2]\mathbb{E}[(Y-f(X))^2] where YY is the added noise ε\varepsilon and f=εθf = \varepsilon_\theta.

M.2 First Moment of the Reverse Process

The reverse diffusion process gives pθ(xt1xt)=N(μθ(xt,t),σt2I)p_\theta(x_{t-1}|x_t) = \mathcal{N}(\mu_\theta(x_t,t), \sigma_t^2 I) where:

μθ(xt,t)=1αt(xt1αt1αˉtεθ(xt,t))\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\varepsilon_\theta(x_t, t)\right)

The mean of the reverse step is determined by the predicted noise. The conditional expectation E[xt1xt]\mathbb{E}[x_{t-1}|x_t] is the denoised estimate of x0x_0:

x^0=xt1αˉtεθ(xt,t)αˉt\hat{x}_0 = \frac{x_t - \sqrt{1-\bar{\alpha}_t}\varepsilon_\theta(x_t,t)}{\sqrt{\bar{\alpha}_t}}

This is LOTUS applied to the reparameterised relationship: the expected clean image given the noisy image is a simple linear function of the predicted noise, evaluated using the noisy image.


Appendix N - Worked Problems: Bias-Variance and Regularisation

N.1 Ridge Regression Bias-Variance Tradeoff

Setup. Data: y=Xθ+εy = X\theta^* + \varepsilon where XRn×dX \in \mathbb{R}^{n\times d}, εN(0,σ2I)\varepsilon \sim \mathcal{N}(0, \sigma^2 I).

Ridge estimator: θ^λ=(XX+λI)1Xy\hat{\theta}_\lambda = (X^\top X + \lambda I)^{-1}X^\top y.

Let Aλ=(XX+λI)1XA_\lambda = (X^\top X + \lambda I)^{-1}X^\top.

Bias: Bias(θ^λ)=E[θ^λ]θ=AλXθθ=(AλXI)θ\text{Bias}(\hat{\theta}_\lambda) = \mathbb{E}[\hat{\theta}_\lambda] - \theta^* = A_\lambda X\theta^* - \theta^* = (A_\lambda X - I)\theta^*.

Since AλX=(XX+λI)1XXA_\lambda X = (X^\top X + \lambda I)^{-1}X^\top X:

AλXI=(XX+λI)1XXI=λ(XX+λI)1A_\lambda X - I = (X^\top X + \lambda I)^{-1}X^\top X - I = -\lambda(X^\top X + \lambda I)^{-1}

So Bias(θ^λ)=λ(XX+λI)1θ\text{Bias}(\hat{\theta}_\lambda) = -\lambda(X^\top X + \lambda I)^{-1}\theta^*. Bias increases with λ\lambda.

Variance: Var(θ^λ)=AλVar(y)Aλ=σ2AλAλ=σ2(XX+λI)1XX(XX+λI)1\text{Var}(\hat{\theta}_\lambda) = A_\lambda \text{Var}(y) A_\lambda^\top = \sigma^2 A_\lambda A_\lambda^\top = \sigma^2(X^\top X+\lambda I)^{-1}X^\top X(X^\top X+\lambda I)^{-1}.

As λ\lambda \to \infty: (XX+λI)10(X^\top X+\lambda I)^{-1} \to 0, so variance 0\to 0. As λ0\lambda \to 0: recovers OLS variance σ2(XX)1\sigma^2(X^\top X)^{-1}.

Total MSE: MSE=Bias2+tr(Var)=λ2θ(XX+λI)2θ+σ2tr((XX+λI)1XX(XX+λI)1)\text{MSE} = \|\text{Bias}\|^2 + \text{tr}(\text{Var}) = \lambda^2 \theta^{*\top}(X^\top X+\lambda I)^{-2}\theta^* + \sigma^2\text{tr}((X^\top X+\lambda I)^{-1}X^\top X(X^\top X+\lambda I)^{-1}).

The optimal λ\lambda minimises this sum - bias grows, variance shrinks, and there is a sweet spot.

N.2 Neural Network Initialisation via Variance Propagation

He initialisation (for ReLU networks) is derived from the bias-variance / variance propagation analysis.

For a layer hj[l]=ReLU ⁣(iWji[l]hi[l1])h_j^{[l]} = \text{ReLU}\!\left(\sum_{i} W_{ji}^{[l]} h_i^{[l-1]}\right), with iid inputs hi[l1](0,vl1)h_i^{[l-1]} \sim (0, v_{l-1}) (zero mean, variance vl1v_{l-1}) and iid weights Wji(0,σW2)W_{ji} \sim (0, \sigma_W^2) independent of inputs:

Variance propagation:

Var(hj[l])=nl1σW2E[ReLU(z)2]\text{Var}(h_j^{[l]}) = n_{l-1} \cdot \sigma_W^2 \cdot \mathbb{E}[\text{ReLU}(z)^2]

where zN(0,vl1)z \sim \mathcal{N}(0, v_{l-1}). For ReLU: E[ReLU(z)2]=Var(z)/2=vl1/2\mathbb{E}[\text{ReLU}(z)^2] = \text{Var}(z)/2 = v_{l-1}/2 (since ReLU zeros half the distribution).

Therefore vl=nl1σW2vl1/2v_l = n_{l-1} \cdot \sigma_W^2 \cdot v_{l-1}/2.

For variance to stay constant across layers: vl=vl1v_l = v_{l-1} requires σW2=2/nl1\sigma_W^2 = 2/n_{l-1}.

He initialisation: WN(0,2/nin)W \sim \mathcal{N}(0, 2/n_\text{in}). This preserves variance through the network during the forward pass, preventing gradients from vanishing or exploding in deep networks.


Appendix O - Self-Assessment Checklist

Use this checklist after studying the section to identify gaps before proceeding to Section05.

Core Mechanics (Should be fluent)

  • I can compute E[X]\mathbb{E}[X] from a PMF or PDF using the definition.
  • I can apply LOTUS: E[g(X)]=g(x)f(x)dx\mathbb{E}[g(X)] = \int g(x)f(x)dx without finding the distribution of g(X)g(X).
  • I know that linearity E[aX+bY]=aE[X]+bE[Y]\mathbb{E}[aX+bY]=a\mathbb{E}[X]+b\mathbb{E}[Y] holds without independence.
  • I can compute Var(X)\text{Var}(X) using E[X2](E[X])2\mathbb{E}[X^2]-(\mathbb{E}[X])^2.
  • I know Var(aX+b)=a2Var(X)\text{Var}(aX+b)=a^2\text{Var}(X) (shift doesn't change variance).
  • I can compute skewness γ1=μ3/σ3\gamma_1=\mu_3/\sigma^3 and kurtosis γ2=μ4/σ43\gamma_2=\mu_4/\sigma^4-3.

Intermediate Theory (Should understand proofs)

  • I can state and prove the tower property E[E[YX]]=E[Y]\mathbb{E}[\mathbb{E}[Y|X]]=\mathbb{E}[Y].
  • I can state and prove the law of total variance.
  • I can derive the MGF of the Gaussian, Exponential, and Poisson distributions.
  • I can state Jensen's inequality and identify when it applies (ff convex vs. concave).
  • I can prove KL(pq)0\text{KL}(p\|q)\geq 0 using Jensen.
  • I can state and prove the Cauchy-Schwarz inequality for expectations.
  • I know that ρ1|\rho|\leq 1 follows from Cauchy-Schwarz.
  • I understand why zero covariance does NOT imply independence (with a counterexample).

Advanced Applications (Should be able to apply)

  • I can derive the bias-variance decomposition from first principles.
  • I can explain what double descent is and why overparameterised models can generalise.
  • I can derive the ELBO using Jensen's inequality.
  • I can explain the reparameterisation trick and why it reduces gradient variance.
  • I understand Adam as tracking first and second moments with bias correction.
  • I can explain why the score function E[θlogpθ(X)]=0\mathbb{E}[\nabla_\theta \log p_\theta(X)]=\mathbf{0}.

Appendix P - Further Reading and References

Textbooks

  1. Probability and Statistics for Engineering and the Sciences - Jay Devore (2015). Accessible introduction to moments, MGFs, and expectation with engineering examples.

  2. Probability Theory: The Logic of Science - E.T. Jaynes (2003). Philosophical and technical treatment; excellent on entropy and maximum entropy principle.

  3. Pattern Recognition and Machine Learning - Christopher Bishop (2006), Ch. 1-2. The ML-focused treatment of expectations, KL divergence, ELBO, and variational inference.

  4. Deep Learning - Goodfellow, Bengio, Courville (2016), Ch. 3. Standard reference for probability in ML context including bias-variance tradeoff.

  5. Probabilistic Machine Learning: An Introduction - Kevin Murphy (2022), Ch. 2-4. Modern treatment with extensive ML applications including Adam, VAEs, and diffusion models.

Papers

  1. Kingma & Welling (2014). Auto-Encoding Variational Bayes. arXiv:1312.6114. Original VAE paper; derives ELBO via Jensen and introduces reparameterisation trick.

  2. Kingma & Ba (2015). Adam: A Method for Stochastic Optimization. arXiv:1412.6980. Original Adam paper; moment tracking interpretation is explicit in the derivation.

  3. Williams (1992). Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning. REINFORCE algorithm; score function gradient estimator.

  4. Ioffe & Szegedy (2015). Batch Normalization. arXiv:1502.03167. Moment normalisation of activations; analysis of how variance affects training dynamics.

  5. Belkin et al. (2019). Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-Off. PNAS. Double descent phenomenon; formal analysis of why interpolating models can generalise.

  6. Song et al. (2020). Score-Based Generative Modeling through Stochastic Differential Equations. arXiv:2011.13456. Diffusion models as score-matching; score function is a moment of the distribution.


Summary of Key Results

This section established the expectation operator as the central tool in probability and machine learning. Starting from the LOTUS definition, we derived linearity (which holds without independence), the tower property (iterated expectation), and the law of total variance (which decomposes uncertainty into within-group and between-group components).

The moment hierarchy captures distribution shape: the first moment (mean) locates it; the second (variance) measures spread; the third (skewness) captures asymmetry; the fourth (kurtosis) captures tail weight. Moment generating functions encode all moments in a single analytic function and enable elegant proofs of the reproductive property (sum of independent Gaussians is Gaussian) and cumulant additivity.

Jensen's inequality - the most important inequality in this section - gives the KL divergence its non-negativity, the ELBO its existence as a lower bound, and the bias-variance decomposition its clean structure. Cauchy-Schwarz gives ρ1|\rho| \leq 1 and motivates the attention scaling factor 1/dk1/\sqrt{d_k}.

Every major ML method reviewed in Section9 is an application of these results: cross-entropy is expected log-loss, the ELBO is Jensen applied to the marginal likelihood, policy gradient is the score function trick, and Adam is first-and-second-moment tracking with debiasing.

The concepts forward-referenced here - Markov's and Chebyshev's inequalities, the Law of Large Numbers, the Central Limit Theorem - will be fully developed in Section05 and Section06, completing the probabilistic toolkit required for modern ML.

The conceptual unification: expectation is a linear functional on the space of random variables. It maps random variables to real numbers while preserving linear structure. Every computation in this section - variance as E[(Xμ)2]\mathbb{E}[(X-\mu)^2], covariance as E[(XμX)(YμY)]\mathbb{E}[(X-\mu_X)(Y-\mu_Y)], the MGF as E[etX]\mathbb{E}[e^{tX}], the characteristic function as E[eitX]\mathbb{E}[e^{itX}], the KL divergence as Ep[logp/q]\mathbb{E}_p[\log p/q], Adam's first moment as E[gt]\mathbb{E}[g_t] - is an application of this single linear functional applied to different functions of the random variable.

Understanding this unity transforms the apparent complexity of ML training into a coherent framework: we are always estimating expectations from samples, bounding how far sample estimates stray from true expectations, and choosing parameterisations that make those expectations tractable to compute and differentiate.

<- Back to Chapter 6: Probability Theory | Next: Concentration Inequalities ->


End of Section06/04 - Expectation and Moments

FileLines / CellsStatus
notes.md2000+[ok] Complete
theory.ipynb38 cells[ok] Complete
exercises.ipynb27 cells[ok] Complete

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue