Part 2

29 min read18 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Expectation and Moments: Part 13: Conceptual Bridge to Summary of Key Results

13. Conceptual Bridge

Where We Came From

This section builds directly on the foundations laid in Section01-Section03. The probability spaces and random variable formalism of Section01 provides the objects - PDFs, PMFs, CDFs - over which we compute expectations. The named distributions of Section02 supply the examples whose moments we derive in Section6.2: the mean $1/\lambda$ and variance $1/\lambda^2$ of the exponential, the mean $np$ and variance $np(1-p)$ of the binomial, the mean $\mu$ and variance $\sigma^2$ that define the Gaussian. The joint distribution machinery of Section03 provides the tools for conditional expectation (Section5) and the covariance matrix (Section4): the tower property is essentially iterated marginalisation, and the conditional variance formula is the law of total variance derived via joint distributions.

Where We Are Going

Section Section05 (Concentration Inequalities) takes the moment-bound preview from Section7.5 much further. Markov's inequality (from the mean) and Chebyshev's inequality (from the variance) are the first two results, but Section05 develops exponentially sharper bounds: Hoeffding's inequality (bounded variables), Chernoff bounds (via MGFs), and McDiarmid's inequality (functions of independent variables). These bounds are the mathematical machinery of PAC-learning generalisation theory - they quantify how many training examples are needed to guarantee that the empirical risk is close to the true risk.

Section Section06 (Stochastic Processes) extends expectation to sequences of random variables indexed by time. The Law of Large Numbers proves rigorously that $\bar{X}_N \to \mathbb{E}[X]$ - the sample mean converges to the expectation. The Central Limit Theorem proves that the standardised sample mean converges in distribution to $\mathcal{N}(0,1)$ - the Gaussian emerges as the universal limit of sums, with proof via MGF or characteristic function techniques introduced here.

POSITION IN CHAPTER 6 CURRICULUM
========================================================================

  Section01 Probability Spaces          Section02 Distributions
  +------------------+            +------------------+
  | Kolmogorov axioms|            | Named PDFs/PMFs  |
  | Events, \\sigma-algebra|            | Parameters       |
  | CDF, PDF, PMF    |            | Relationships    |
  +--------+---------+            +--------+---------+
           |                               |
           +--------------+----------------+
                          v
              Section03 Joint Distributions
              +----------------------+
              | Marginals            |
              | Conditionals f(y|x) |
              | MVN, Bayes, Chain   |
              +----------+----------+
                         |
                         v
          +==============================+
          |  Section04 Expectation & Moments  |  <- YOU ARE HERE
          |  E[X], Var, Cov, MGF       |
          |  Jensen, C-S, LOTUS        |
          |  Bias-Variance, Adam       |
          +==============+=============+
                         |
              +----------+----------+
              v                     v
    Section05 Concentration         Section06 Stochastic
    Inequalities              Processes
    +-------------+           +-------------+
    | Markov      |           | LLN: Xbar->E[X]|
    | Chebyshev   |           | CLT: ->N(0,1)|
    | Hoeffding   |           | Gaussian    |
    | PAC bounds  |           | processes   |
    +-------------+           +-------------+
              |                     |
              +----------+----------+
                         v
              Section07 Markov Chains
              +----------------------+
              | Transition matrices  |
              | Steady state         |
              | MCMC (Bayes sampling)|
              +----------------------+

========================================================================

The conceptual arc through Section04 is: we began with probability as a framework for describing uncertainty (Section01), learned the vocabulary of named distributions (Section02), understood how to reason about multiple random variables jointly (Section03), and now in Section04 we have the tools to summarise distributions with numbers - expectations, variances, covariances, moments - and to bound the gap between reality and our estimates. Every subsequent section will use the expectation operator as a core tool: Section05 to prove tail bounds, Section06 to formalise convergence, Section07 to analyse Markov chain stationary distributions via matrix expectations.

<- Back to Chapter 6: Probability Theory | Next: Concentration Inequalities ->

Appendix A - Worked Examples: Expectation and Moments

A.1 Expected Value of the Geometric Distribution

The geometric distribution counts the number of trials until the first success in a sequence of iid Bernoulli( $p$ ) trials.

PMF: $P(X=k) = (1-p)^{k-1}p$ for $k = 1, 2, 3, \ldots$

Expected value via the definition:

\mathbb{E}[X] = \sum_{k=1}^\infty k(1-p)^{k-1}p = p \sum_{k=1}^\infty k q^{k-1}

where $q = 1-p$ . Using the identity $\sum_{k=1}^\infty k q^{k-1} = \frac{d}{dq}\sum_{k=0}^\infty q^k = \frac{d}{dq}\frac{1}{1-q} = \frac{1}{(1-q)^2} = \frac{1}{p^2}$ :

\mathbb{E}[X] = p \cdot \frac{1}{p^2} = \frac{1}{p}

Variance via the second moment: $\mathbb{E}[X^2] = \sum_{k=1}^\infty k^2(1-p)^{k-1}p$ . Using $\sum k^2 q^{k-1} = \frac{d}{dq}[\sum k q^k] = \frac{d}{dq}\frac{q}{(1-q)^2} = \frac{1+q}{(1-q)^3}$ :

\mathbb{E}[X^2] = p \cdot \frac{1+(1-p)}{p^3} = \frac{2-p}{p^2}

\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = \frac{2-p}{p^2} - \frac{1}{p^2} = \frac{1-p}{p^2}

AI connection: The geometric distribution models the number of tokens until a specific token (e.g., end-of-sequence) appears, assuming iid generation. In practice tokens are not iid, but the geometric provides a baseline model. Expected sequence length under a uniform $p$ = stopping probability is $1/p$ .

A.2 Method of Moments Estimation

The method of moments estimates distribution parameters by setting sample moments equal to theoretical moments and solving.

Example: Estimating $(\alpha, \beta)$ for $\text{Gamma}(\alpha, \beta)$ .

Theoretical moments: $\mathbb{E}[X] = \alpha/\beta$ and $\text{Var}(X) = \alpha/\beta^2$ .

Given $N$ samples $x_1, \ldots, x_N$ , compute sample mean $\bar{x} = \frac{1}{N}\sum x_i$ and sample variance $s^2 = \frac{1}{N-1}\sum(x_i-\bar{x})^2$ .

Set $\bar{x} = \hat{\alpha}/\hat{\beta}$ and $s^2 = \hat{\alpha}/\hat{\beta}^2$ . Solving:

\hat{\beta} = \frac{\bar{x}}{s^2}, \qquad \hat{\alpha} = \frac{\bar{x}^2}{s^2}

For AI: Many Bayesian models require estimating hyperparameters of prior distributions. Method of moments provides fast, closed-form initial estimates that can be refined by maximum likelihood or MCMC. It is also used in moment matching for knowledge distillation: a student model's distribution moments are matched to the teacher's.

A.3 St. Petersburg Paradox: When Expectation Misleads

The St. Petersburg game: flip a coin repeatedly until the first head. If head appears on flip $k$ , win $2^k$ dollars.

Expected winnings:

\mathbb{E}[\text{Winnings}] = \sum_{k=1}^\infty 2^k \cdot \frac{1}{2^k} = \sum_{k=1}^\infty 1 = +\infty

The expected value is infinite! Yet no rational person would pay more than a few dollars to play this game.

Resolution: Rational agents maximise expected utility, not expected monetary value. For a logarithmic utility function $u(w) = \log(w)$ , the expected utility is finite:

\mathbb{E}[\log(\text{Winnings})] = \sum_{k=1}^\infty \log(2^k) \cdot \frac{1}{2^k} = \sum_{k=1}^\infty \frac{k \log 2}{2^k} = 2\log 2 < \infty

This is Bernoulli's resolution (1738): diminishing marginal utility. The paradox reveals that infinite expected value is not sufficient for rational choice - the existence of all moments (or at least bounded utility) is needed.

For AI: Reinforcement learning reward design must account for this. An agent with unbounded reward function may take extremely risky actions that have infinite expected reward but almost surely fail. This motivates bounded reward functions and regularisation in RLHF: keeping reward signals within a bounded range prevents policy collapse toward infinite-expectation strategies.

Appendix B - Proofs of Key Identities

B.1 Cauchy-Schwarz via Inner Product

The expectation $\mathbb{E}[XY]$ can be viewed as an inner product $\langle X, Y \rangle = \mathbb{E}[XY]$ in the $L^2$ space of square-integrable random variables. The Cauchy-Schwarz inequality for inner products states $|\langle X, Y \rangle|^2 \leq \langle X, X \rangle \langle Y, Y \rangle$ , which gives $(\mathbb{E}[XY])^2 \leq \mathbb{E}[X^2]\mathbb{E}[Y^2]$ directly.

This inner product interpretation gives $L^2$ convergence its name: $X_n \to X$ in $L^2$ means $\mathbb{E}[(X_n - X)^2] \to 0$ - convergence in the $L^2$ inner product sense.

B.2 Variance as Second Cumulant

The CGF of $X$ is $K_X(t) = \log M_X(t) = \log \mathbb{E}[e^{tX}]$ . Computing its second derivative at $t=0$ :

K_X''(t) = \frac{M_X(t)M_X''(t) - (M_X'(t))^2}{(M_X(t))^2}

At $t=0$ : $M_X(0)=1$ , $M_X'(0)=\mathbb{E}[X]$ , $M_X''(0)=\mathbb{E}[X^2]$ , so:

K_X''(0) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = \text{Var}(X)

The variance is precisely the second cumulant $\kappa_2$ .

B.3 Fisher Information and the Score

The Fisher information matrix is defined as:

\mathcal{I}(\theta) = \mathbb{E}_\theta\left[(\nabla_\theta \log p_\theta(X))(\nabla_\theta \log p_\theta(X))^\top\right] = \text{Cov}_\theta(\nabla_\theta \log p_\theta(X))

The second equality uses the fact that $\mathbb{E}_\theta[\nabla_\theta \log p_\theta(X)] = \mathbf{0}$ (proved by differentiating $\int p_\theta = 1$ ):

\int p_\theta(x) dx = 1 \implies \nabla_\theta \int p_\theta(x) dx = 0 \implies \int \nabla_\theta p_\theta(x) dx = 0 \implies \mathbb{E}_\theta\left[\frac{\nabla_\theta p_\theta(X)}{p_\theta(X)}\right] = \mathbb{E}_\theta[\nabla_\theta \log p_\theta(X)] = \mathbf{0}

Therefore the Fisher information is the variance (covariance matrix) of the score function $\nabla_\theta \log p_\theta(X)$ .

For AI: Fisher information appears in:

Cramer-Rao bound: $\text{Var}(\hat{\theta}) \geq 1/\mathcal{I}(\theta)$ - the variance of any unbiased estimator is at least the reciprocal Fisher information.
Natural gradient: The natural gradient descent update $\mathcal{I}(\theta)^{-1}\nabla\mathcal{L}$ moves in the direction of steepest descent in the space of distributions (KL-divergence geometry), rather than parameter space. Adam approximates this with a diagonal Fisher.
Elastic Weight Consolidation (EWC): Used in continual learning to prevent catastrophic forgetting. The Fisher information diagonal identifies which parameters are important for previous tasks.

B.4 Stein's Lemma

Lemma (Stein, 1972). If $X \sim \mathcal{N}(\mu, \sigma^2)$ and $g$ is differentiable with $\mathbb{E}[|g'(X)|] < \infty$ :

\mathbb{E}[g(X)(X-\mu)] = \sigma^2 \mathbb{E}[g'(X)]

Proof: Integration by parts:

\mathbb{E}[g(X)(X-\mu)] = \int g(x)(x-\mu) \frac{1}{\sqrt{2\pi}\sigma}e^{-(x-\mu)^2/(2\sigma^2)} dx

Note $(x-\mu)e^{-(x-\mu)^2/(2\sigma^2)} = -\sigma^2 \frac{d}{dx}e^{-(x-\mu)^2/(2\sigma^2)}$ . Integrating by parts:

= \sigma^2 \int g'(x) \frac{1}{\sqrt{2\pi}\sigma}e^{-(x-\mu)^2/(2\sigma^2)} dx = \sigma^2 \mathbb{E}[g'(X)] \quad \square

Special cases:

$g(x) = x$ : $\mathbb{E}[X(X-\mu)] = \sigma^2 \implies \mathbb{E}[X^2] = \mu^2 + \sigma^2$ [ok]
$g(x) = x^2$ : $\mathbb{E}[X^2(X-\mu)] = 2\sigma^2 \mathbb{E}[X] = 2\sigma^2\mu \implies \mathbb{E}[X^3] = \mu^3 + 3\mu\sigma^2$ [ok]

For AI: Stein's lemma is the foundation of Stein's identity used in score matching and denoising diffusion models. The score function $\nabla_x \log p(x)$ satisfies $\mathbb{E}_p[g(X) + \nabla_x \cdot g(X)] = 0$ (Stein's operator), enabling training by minimising a quadratic loss without computing the intractable normalisation constant of $p$ .

Appendix C - Moment Computations for Common Distributions

This table gives raw moments, central moments, MGF, skewness, and excess kurtosis for the distributions used throughout the course.

Distribution	$\mathbb{E}[X]$	$\text{Var}(X)$	Skewness $\gamma_1$	Ex. Kurtosis $\gamma_2$	$M_X(t)$ (domain)
Bernoulli( $p$ )	$p$	$p(1-p)$	$\frac{1-2p}{\sqrt{p(1-p)}}$	$\frac{1-6p(1-p)}{p(1-p)}$	$1-p+pe^t$
Binomial( $n,p$ )	$np$	$np(1-p)$	$\frac{1-2p}{\sqrt{np(1-p)}}$	$\frac{1-6p(1-p)}{np(1-p)}$	$(1-p+pe^t)^n$
Poisson( $\lambda$ )	$\lambda$	$\lambda$	$1/\sqrt{\lambda}$	$1/\lambda$	$e^{\lambda(e^t-1)}$
Geometric( $p$ )	$1/p$	$(1-p)/p^2$	$\frac{2-p}{\sqrt{1-p}}$	$6 + p^2/(1-p)$	$\frac{pe^t}{1-(1-p)e^t}$ , $t<-\log(1-p)$
Uniform( $a,b$ )	$(a+b)/2$	$(b-a)^2/12$	$0$	$-6/5$	$\frac{e^{tb}-e^{ta}}{t(b-a)}$
Normal( $\mu,\sigma^2$ )	$\mu$	$\sigma^2$	$0$	$0$	$e^{\mu t+\sigma^2t^2/2}$
Exponential( $\lambda$ )	$1/\lambda$	$1/\lambda^2$	$2$	$6$	$\frac{\lambda}{\lambda-t}$ , $t<\lambda$
Gamma( $\alpha,\beta$ )	$\alpha/\beta$	$\alpha/\beta^2$	$2/\sqrt{\alpha}$	$6/\alpha$	$(\frac{\beta}{\beta-t})^\alpha$ , $t<\beta$
Beta( $\alpha,\beta$ )	$\frac{\alpha}{\alpha+\beta}$	$\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$	$\frac{2(\beta-\alpha)\sqrt{\alpha+\beta+1}}{(\alpha+\beta+2)\sqrt{\alpha\beta}}$	complex	No closed form
Student- $t$ ( $\nu$ )	$0$ ( $\nu>1$ )	$\frac{\nu}{\nu-2}$ ( $\nu>2$ )	$0$ ( $\nu>3$ )	$\frac{6}{\nu-4}$ ( $\nu>4$ )	Does not exist

Notes:

Student- $t$ has no MGF (heavy tails cause $\mathbb{E}[e^{tX}] = \infty$ for all $t \neq 0$ ).
Beta distribution skewness is zero iff $\alpha = \beta$ (symmetric); negative when $\alpha > \beta$ , positive when $\alpha < \beta$ .
Poisson has equal mean and variance - a property used to test whether count data follows Poisson (overdispersion: $\text{Var} > \mathbb{E}$ means extra variability, e.g., negative binomial is better).

Appendix D - The Exponential Family and Moments

Many common distributions belong to the exponential family, which has a elegant connection between natural parameters and moments.

D.1 Exponential Family Form

A distribution belongs to the exponential family if its density can be written as:

p(x;\eta) = h(x)\exp(\eta^\top T(x) - A(\eta))

where:

$\eta$ = natural parameter vector
$T(x)$ = sufficient statistic vector
$A(\eta)$ = log-partition function (log-normaliser)
$h(x)$ = base measure

D.2 Moments from the Log-Partition Function

Theorem. For an exponential family distribution:

\mathbb{E}[T(X)] = \nabla_\eta A(\eta), \qquad \text{Cov}(T(X)) = \nabla^2_\eta A(\eta)

Proof sketch. Differentiating $\int p(x;\eta)dx = 1$ with respect to $\eta$ :

0 = \int h(x)e^{\eta^\top T(x) - A(\eta)}(T(x) - \nabla A(\eta))dx = \mathbb{E}[T(X)] - \nabla A(\eta)

Therefore $\mathbb{E}[T(X)] = \nabla A(\eta)$ . Differentiating again gives the covariance formula. $\square$

Examples:

Distribution	$\eta$	$T(x)$	$A(\eta)$	$\mathbb{E}[T(X)] = \nabla A$
Bernoulli( $p$ )	$\log\frac{p}{1-p}$	$x$	$\log(1+e^\eta)$	$\frac{e^\eta}{1+e^\eta} = p$
Gaussian( $\mu,\sigma^2$ )	$(\mu/\sigma^2, -1/(2\sigma^2))$	$(x, x^2)$	$-\eta_1^2/(4\eta_2) - \frac{1}{2}\log(-2\eta_2)$	$(\mu, \sigma^2+\mu^2)$
Poisson( $\lambda$ )	$\log\lambda$	$x$	$e^\eta = \lambda$	$e^\eta = \lambda$

For AI: The log-partition function $A(\eta)$ is the free energy of the exponential family. Its gradient gives the expected sufficient statistics (moments), its Hessian gives the Fisher information matrix. In variational inference, the ELBO is optimised over the natural parameters $\eta$ of the approximate posterior; the optimal $\eta$ satisfies $\nabla A(\eta) = \mathbb{E}_{p}[T(X)]$ (moment matching). This is the connection between maximum entropy, sufficient statistics, and the moments derived in Section6.

Appendix E - Convergence in Probability and L^2 Convergence

The sample mean $\bar{X}_N = \frac{1}{N}\sum_{i=1}^N X_i$ estimates $\mathbb{E}[X]$ . In what sense does this estimate converge?

E.1 L^2 Convergence (Mean Square Convergence)

$\bar{X}_N$ converges to $\mathbb{E}[X]$ in mean square ( $L^2$ ) sense:

\mathbb{E}[(\bar{X}_N - \mathbb{E}[X])^2] = \text{Var}(\bar{X}_N) = \frac{\text{Var}(X)}{N} \to 0

This is an immediate consequence of the variance formula for means (Section4.4). The sample mean's MSE decreases at rate $1/N$ .

E.2 Weak Law of Large Numbers (Preview)

The Weak LLN (proved in Section06 using characteristic functions or Chebyshev's inequality) states that for iid $X_i$ with finite mean $\mu$ :

\bar{X}_N \xrightarrow{P} \mu \quad \text{as } N \to \infty

i.e., $P(|\bar{X}_N - \mu| > \varepsilon) \to 0$ for any $\varepsilon > 0$ .

Proof via Chebyshev (when $\text{Var}(X) < \infty$ ): $P(|\bar{X}_N - \mu| > \varepsilon) \leq \frac{\text{Var}(\bar{X}_N)}{\varepsilon^2} = \frac{\text{Var}(X)}{N\varepsilon^2} \to 0$ .

This is a direct application of Chebyshev's inequality (from Section7.5 preview) to the sample mean. The LLN justifies: if we run a neural network many times on iid batches and average the loss estimates, the average converges to the true expected loss.

-> Full treatment of LLN and CLT: Section06 Stochastic Processes

Appendix F - Conditional Expectation as Projection

The conditional expectation $\mathbb{E}[Y|X]$ can be understood geometrically as an orthogonal projection in the Hilbert space $L^2(\Omega, P)$ of square-integrable random variables.

Inner product: $\langle X, Y \rangle = \mathbb{E}[XY]$

Projection: $\mathbb{E}[Y|\mathcal{G}]$ (conditional expectation on a sub- $\sigma$ -algebra $\mathcal{G}$ ) is the projection of $Y$ onto the closed subspace of $\mathcal{G}$ -measurable random variables.

Geometric interpretation: Among all $\mathcal{G}$ -measurable (i.e., functions of $X$ ) approximations to $Y$ , the conditional expectation $\mathbb{E}[Y|X]$ minimises the $L^2$ distance $\mathbb{E}[(Y-g(X))^2]$ . This is the projection theorem: the best approximation is the projection, and the residual $Y - \mathbb{E}[Y|X]$ is orthogonal to every $\mathcal{G}$ -measurable random variable:

\mathbb{E}[(Y - \mathbb{E}[Y|X]) \cdot g(X)] = 0 \quad \text{for all measurable } g

Consequence: The tower property $\mathbb{E}[\mathbb{E}[Y|X]] = \mathbb{E}[Y]$ is the projection version of the law of total expectation. In a Hilbert space, projecting $Y$ onto a subspace and then taking the "total length" (expectation) equals the total length of $Y$ .

For AI: This projection view clarifies why neural networks trained with MSE loss approximate $\mathbb{E}[Y|X]$ : the network is learning the projection of $Y$ onto the subspace of functions representable by the architecture. Deeper networks can represent larger subspaces, hence better approximations to the conditional expectation.

Appendix G - Worked Problems: Moments and Inequalities

G.1 Computing the Moments of the Beta Distribution

The Beta( $\alpha, \beta$ ) distribution has PDF $f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}$ on $[0,1]$ , where $B(\alpha,\beta) = \Gamma(\alpha)\Gamma(\beta)/\Gamma(\alpha+\beta)$ .

Raw moment: Using the Beta function definition:

\mathbb{E}[X^k] = \int_0^1 x^k \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}dx = \frac{B(\alpha+k, \beta)}{B(\alpha,\beta)} = \frac{\Gamma(\alpha+k)\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\alpha+\beta+k)}

= \frac{(\alpha+k-1)(\alpha+k-2)\cdots\alpha}{(\alpha+\beta+k-1)(\alpha+\beta+k-2)\cdots(\alpha+\beta)} = \prod_{j=0}^{k-1}\frac{\alpha+j}{\alpha+\beta+j}

First moment: $\mathbb{E}[X] = \frac{\alpha}{\alpha+\beta}$ .

Second moment: $\mathbb{E}[X^2] = \frac{\alpha(\alpha+1)}{(\alpha+\beta)(\alpha+\beta+1)}$ .

Variance:

\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = \frac{\alpha(\alpha+1)}{(\alpha+\beta)(\alpha+\beta+1)} - \frac{\alpha^2}{(\alpha+\beta)^2}

= \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}

AI connection: The Beta distribution is the conjugate prior for the Bernoulli/Binomial likelihood. After observing $s$ successes and $f$ failures, the posterior is Beta( $\alpha+s$ , $\beta+f$ ). The posterior mean is $\frac{\alpha+s}{\alpha+\beta+s+f}$ - a weighted average of the prior mean $\frac{\alpha}{\alpha+\beta}$ and the sample mean $\frac{s}{s+f}$ . As $s+f \to \infty$ , the posterior converges to the sample mean (data dominates prior).

G.2 Proving $H(p) \leq \log K$ by Jensen

For a probability distribution $p = (p_1,\ldots,p_K)$ with $p_k > 0$ and $\sum_k p_k = 1$ , the Shannon entropy is $H(p) = -\sum_k p_k \log p_k$ .

Proof that $H(p) \leq \log K$ :

Apply Jensen's inequality with the concave function $\log$ :

H(p) = \sum_k p_k \log \frac{1}{p_k} = \mathbb{E}_{k \sim p}\left[\log \frac{1}{p_k}\right] \leq \log\mathbb{E}_{k\sim p}\left[\frac{1}{p_k}\right] = \log\sum_k p_k \cdot \frac{1}{p_k} = \log K

Equality holds iff $1/p_k =$ const for all $k$ (Jensen with equality iff the argument is constant), i.e., $p_k = 1/K$ (uniform). $\square$

Alternative proof via KL divergence:

H(p) - \log K = -\sum_k p_k \log p_k - \log K = \sum_k p_k(\log(1/K) - \log p_k) = -\text{KL}(p \| u) \leq 0

where $u = (1/K,\ldots,1/K)$ is uniform. KL non-negativity gives $H(p) \leq \log K$ .

G.3 Adam Bias Correction: Full Derivation

At step $t$ , the Adam first-moment accumulator is:

m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t

Unrolling from $m_0 = 0$ :

m_t = (1-\beta_1)\sum_{i=1}^t \beta_1^{t-i} g_i

Taking expectations (assuming iid gradients with constant mean $\mathbb{E}[g_i] = \mu_g$ ):

\mathbb{E}[m_t] = (1-\beta_1)\mu_g \sum_{i=1}^t \beta_1^{t-i} = (1-\beta_1)\mu_g \cdot \frac{1-\beta_1^t}{1-\beta_1} = (1-\beta_1^t)\mu_g

The bias is $\mathbb{E}[m_t] - \mu_g = -\beta_1^t \mu_g$ , which decays to zero geometrically. The bias-corrected estimate $\hat{m}_t = m_t / (1-\beta_1^t)$ satisfies $\mathbb{E}[\hat{m}_t] = \mu_g$ (unbiased).

Similarly for $v_t$ , and the debiased $\hat{v}_t = v_t/(1-\beta_2^t)$ estimates $\mathbb{E}[g_t^2]$ unbiasedly.

In early training ( $t$ small), $1-\beta_1^t \approx (1-\beta_1)t$ is small, so $\hat{m}_t = m_t/((1-\beta_1)t)$ greatly amplifies $m_t$ . Without bias correction, Adam would take tiny steps at the start because $m_t \approx (1-\beta_1)g_1$ after one step. With bias correction: $\hat{m}_1 = g_1$ - the first step uses the actual gradient.

G.4 Conditional Expectation: Gaussian Case

Let $(X,Y) \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)$ with $\boldsymbol{\mu} = (\mu_X, \mu_Y)$ and $\Sigma = \begin{pmatrix}\sigma_X^2 & \rho\sigma_X\sigma_Y \\ \rho\sigma_X\sigma_Y & \sigma_Y^2\end{pmatrix}$ .

Conditional distribution (from Section03 Schur complement formula):

Y | X=x \sim \mathcal{N}\!\left(\mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(x-\mu_X),\; \sigma_Y^2(1-\rho^2)\right)

Therefore:

\mathbb{E}[Y|X=x] = \mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(x-\mu_X)

\text{Var}(Y|X) = \sigma_Y^2(1-\rho^2)

Verification via law of total variance:

\text{Var}(Y) = \mathbb{E}[\text{Var}(Y|X)] + \text{Var}(\mathbb{E}[Y|X])

= \sigma_Y^2(1-\rho^2) + \text{Var}\!\left(\mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(X-\mu_X)\right)

= \sigma_Y^2(1-\rho^2) + \rho^2\frac{\sigma_Y^2}{\sigma_X^2}\text{Var}(X)

= \sigma_Y^2(1-\rho^2) + \rho^2\sigma_Y^2 = \sigma_Y^2 \checkmark

The variance of the conditional mean contributes $\rho^2\sigma_Y^2$ (explained variance), and the within-group variance contributes $(1-\rho^2)\sigma_Y^2$ (unexplained variance). The fraction $\rho^2$ of $\text{Var}(Y)$ explained by $X$ is the coefficient of determination $R^2$ of linear regression.

Appendix H - Notation Summary

Symbol	Meaning	First defined
$\mathbb{E}[X]$	Expected value of $X$	Section2.1
$\mu_X$ or $\mu$	Mean (expected value)	Section2.1
$\text{Var}(X)$ or $\sigma_X^2$	Variance of $X$	Section3.1
$\sigma_X$	Standard deviation of $X$	Section3.2
$\mu'_k = \mathbb{E}[X^k]$	$k$ -th raw moment	Section3.3
$\mu_k = \mathbb{E}[(X-\mu)^k]$	$k$ -th central moment	Section3.3
$\gamma_1 = \mu_3/\sigma^3$	Skewness	Section3.4
$\gamma_2 = \mu_4/\sigma^4 - 3$	Excess kurtosis	Section3.5
$\text{Cov}(X,Y)$	Covariance	Section4.1
$\rho_{XY}$	Pearson correlation	Section4.2
$\mathbb{E}[Y \mid X=x]$	Conditional expectation (function of $x$ )	Section5.1
$\mathbb{E}[Y \mid X]$	Conditional expectation (random variable)	Section5.1
$\text{Var}(Y \mid X)$	Conditional variance	Section5.3
$M_X(t) = \mathbb{E}[e^{tX}]$	Moment generating function	Section6.1
$\varphi_X(t) = \mathbb{E}[e^{itX}]$	Characteristic function	Section6.4
$K_X(t) = \log M_X(t)$	Cumulant generating function	Section6.5
$\kappa_k$	$k$ -th cumulant	Section6.5
$\mathcal{L}(\phi,\theta;x)$	ELBO (evidence lower bound)	Section9.2
$\text{KL}(p\\|q)$	KL divergence	Section7.2
$H(p)$	Shannon entropy	Section9.1

Appendix I - Quick Reference: Key Formulas

Expectation:

\mathbb{E}[aX+bY] = a\mathbb{E}[X] + b\mathbb{E}[Y] \quad (\text{always})

\mathbb{E}[g(X)] = \int g(x)f_X(x)\,dx \quad (\text{LOTUS})

Variance:

\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2

\text{Var}(aX+b) = a^2\text{Var}(X)

\text{Var}(X+Y) = \text{Var}(X) + 2\text{Cov}(X,Y) + \text{Var}(Y)

Conditional expectation:

\mathbb{E}[Y] = \mathbb{E}[\mathbb{E}[Y|X]] \quad (\text{tower property})

\text{Var}(Y) = \mathbb{E}[\text{Var}(Y|X)] + \text{Var}(\mathbb{E}[Y|X]) \quad (\text{total variance})

MGF:

M_X^{(k)}(0) = \mathbb{E}[X^k], \qquad M_{X+Y}(t) = M_X(t)M_Y(t) \; (X\perp Y)

Jensen's inequality:

f \text{ convex} \implies f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]

\text{KL}(p\|q) \geq 0 \quad (\text{Gibbs' inequality, via Jensen})

Cauchy-Schwarz:

(\mathbb{E}[XY])^2 \leq \mathbb{E}[X^2]\mathbb{E}[Y^2], \qquad |\rho_{XY}| \leq 1

Bias-Variance:

\mathbb{E}[(Y-\hat{f})^2] = \text{Bias}^2(\hat{f}) + \text{Var}(\hat{f}) + \sigma^2

Adam:

m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t, \quad \hat{m}_t = m_t/(1-\beta_1^t) \approx \mathbb{E}[g_t]

v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2, \quad \hat{v}_t = v_t/(1-\beta_2^t) \approx \mathbb{E}[g_t^2]

Appendix J - Common Mistakes: Extended Examples

J.1 Jensen's Direction: A Costly Error

One of the most frequent mistakes when applying Jensen's inequality is applying it in the wrong direction. The rule is simple: for convex $f$ , $f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]$ ; for concave $f$ , the inequality reverses.

Common confusion: $\sqrt{\cdot}$ vs. $(\cdot)^2$ .

$f(x) = x^2$ is convex ( $f''=2>0$ ). Jensen: $(\mathbb{E}[X])^2 \leq \mathbb{E}[X^2]$ . Equivalently, $\text{Var}(X) = \mathbb{E}[X^2]-(\mathbb{E}[X])^2 \geq 0$ . [ok]
$f(x) = \sqrt{x}$ is concave ( $f''=-\frac{1}{4}x^{-3/2}<0$ ). Jensen: $\sqrt{\mathbb{E}[X]} \geq \mathbb{E}[\sqrt{X}]$ for $X \geq 0$ .

A model that minimises $\mathbb{E}[\sqrt{\text{loss}}]$ is NOT the same as minimising $\sqrt{\mathbb{E}[\text{loss}]}$ . The former cares about average square-root loss, the latter about the square root of average loss. In practice this distinction matters for robust loss functions.

Common confusion: $\log$ vs. $\exp$ .

$f(x) = \log x$ is concave -> $\log\mathbb{E}[X] \geq \mathbb{E}[\log X]$ . This is the key inequality behind the ELBO derivation.
$f(x) = e^x$ is convex -> $e^{\mathbb{E}[X]} \leq \mathbb{E}[e^X]$ . This is the key inequality for bounding the MGF.

The ELBO derivation applies Jensen with $\log$ (concave): $\log \mathbb{E}[Z] \geq \mathbb{E}[\log Z]$ where $Z = p_\theta(x,z)/q_\phi(z|x)$ . Students often try to apply it in the wrong direction, getting an upper bound instead of a lower bound.

J.2 The Tower Property Subtlety: Nested Conditioning

The tower property states $\mathbb{E}[\mathbb{E}[Y|X]] = \mathbb{E}[Y]$ . A more general version for nested conditioning: for $\sigma$ -algebras $\mathcal{G}_1 \subset \mathcal{G}_2$ :

\mathbb{E}[\mathbb{E}[Y|\mathcal{G}_2]|\mathcal{G}_1] = \mathbb{E}[Y|\mathcal{G}_1]

Conditioning on less information from already conditioned: you keep the coarser conditioning.

Example: Let $Z$ be a sufficient statistic for $\theta$ given data $X_1,\ldots,X_n$ . Then:

\mathbb{E}[\hat{\theta}|\theta] = \mathbb{E}[\mathbb{E}[\hat{\theta}|Z]|\theta] = \mathbb{E}[\tilde{\theta}|\theta]

where $\tilde{\theta} = \mathbb{E}[\hat{\theta}|Z]$ is the Rao-Blackwellised estimator. The outer expectation equals the original (tower property), but the Rao-Blackwellised estimator has smaller variance. This is NOT circular - it is the statement that smoothing $\hat{\theta}$ via conditioning on $Z$ doesn't change its mean.

J.3 Correlation Does Not Imply Causation - A Statistical View

Zero correlation means $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$ - no linear relationship. But:

There may be nonlinear dependence (Section4.3 example: $Y = X^2$ ).
Even positive correlation may arise from a common cause (confounding): if $Z \to X$ and $Z \to Y$ (fork structure from Section03), then $X$ and $Y$ are correlated even if neither causes the other.

In ML: the correlation between model predictions and labels on the test set measures linear predictive ability, not causal understanding. A model can achieve high correlation while exploiting spurious features (shortcuts). For example, language models correlate "hospital" with "disease" not through causal understanding but through co-occurrence patterns.

Conditional independence as the resolution: If $Z$ is the common cause (confounder), $X$ and $Y$ may be conditionally independent given $Z$ : $\text{Cov}(X,Y|Z) = 0$ even though $\text{Cov}(X,Y) \neq 0$ . Adjusting for confounders (either by conditioning or instrumental variables) is the statistical approach to causal inference.

Appendix K - Information-Theoretic View of Moments

K.1 Entropy as Negative Expected Log-Probability

The Shannon entropy of a discrete distribution $p$ is:

H(p) = -\mathbb{E}_p[\log p(X)] = -\sum_k p_k \log p_k

This is simply minus the expected log-probability. For a continuous distribution with PDF $f$ , the differential entropy is:

h(f) = -\mathbb{E}_f[\log f(X)] = -\int f(x)\log f(x)\,dx

Gaussian maximises differential entropy. Among all distributions on $\mathbb{R}$ with fixed mean $\mu$ and variance $\sigma^2$ , the Gaussian $\mathcal{N}(\mu, \sigma^2)$ maximises differential entropy:

h(\mathcal{N}(\mu,\sigma^2)) = \frac{1}{2}\log(2\pi e \sigma^2)

Proof sketch (via KL divergence): For any distribution $f$ with mean $\mu$ and variance $\sigma^2$ , let $g = \mathcal{N}(\mu,\sigma^2)$ .

0 \leq \text{KL}(f\|g) = \int f\log\frac{f}{g} = -h(f) - \int f\log g

Since $\log g(x) = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x-\mu)^2}{2\sigma^2}$ , and $\int f(x) \cdot \frac{(x-\mu)^2}{2\sigma^2}dx = \frac{\sigma^2}{2\sigma^2} = \frac{1}{2}$ (since $f$ has variance $\sigma^2$ ):

\int f\log g = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{1}{2} = h(g)

Therefore $0 \leq -h(f) - h(g)$ , i.e., $h(f) \leq h(g)$ . Equality iff $f = g$ (since $\text{KL}=0$ iff distributions are equal). $\square$

AI connection: This maximum entropy property explains why the Gaussian prior is so common in Bayesian machine learning: given a known mean and variance (from domain knowledge), the Gaussian is the least informative (maximum entropy) prior consistent with those constraints. It makes the fewest additional assumptions about the distribution.

K.2 Mutual Information as Expected KL Divergence

The mutual information between $X$ and $Y$ is:

I(X;Y) = \text{KL}(p_{X,Y} \| p_X \otimes p_Y) = \mathbb{E}_{(X,Y)}\left[\log\frac{p_{X,Y}(X,Y)}{p_X(X)p_Y(Y)}\right]

This is the KL divergence between the joint distribution and the product of marginals. Since $\text{KL} \geq 0$ : $I(X;Y) \geq 0$ with equality iff $X \perp Y$ .

Relationship to conditional entropy:

I(X;Y) = H(Y) - H(Y|X) = H(X) - H(X|Y)

where $H(Y|X) = \mathbb{E}_X[H(Y|X=x)]$ is the conditional entropy (tower property applied to entropy).

For AI: Mutual information is the gold standard for measuring statistical dependence. It captures all forms of dependence (linear and nonlinear), unlike correlation. Contrastive learning methods (SimCLR, CLIP) can be viewed as maximising a lower bound on $I(\text{view}_1; \text{view}_2)$ - encouraging representations to capture information shared between different views/modalities. Information bottleneck methods (used for analysing neural networks) study the trade-off between compressing $X$ (low $I(Z;X)$ ) and preserving information about $Y$ (high $I(Z;Y)$ ).

Appendix L - The Reparameterisation Trick: Full Mathematical Treatment

The reparameterisation trick is a technique for computing gradients of expectations when the distribution depends on the parameters we differentiate with respect to.

L.1 The Problem: Gradient Through Sampling

We want $\nabla_\phi \mathbb{E}_{z \sim q_\phi}[f(z)]$ where $q_\phi$ is a distribution parameterised by $\phi$ . Naively:

\nabla_\phi \mathbb{E}_{q_\phi}[f(z)] = \nabla_\phi \int f(z) q_\phi(z) dz = \int f(z) \nabla_\phi q_\phi(z) dz

This requires computing $\nabla_\phi q_\phi(z)$ for each $z$ , and the expectation is now over the fixed distribution of $z$ values - but the integration measure $q_\phi(z)dz$ changes with $\phi$ , making Monte Carlo estimation require evaluating $\nabla_\phi q_\phi$ .

L.2 REINFORCE (Score Function Estimator)

The score function trick uses $\nabla_\phi q_\phi = q_\phi \nabla_\phi \log q_\phi$ :

\nabla_\phi \mathbb{E}_{q_\phi}[f(z)] = \int f(z) q_\phi(z) \nabla_\phi \log q_\phi(z) dz = \mathbb{E}_{q_\phi}[f(z) \nabla_\phi \log q_\phi(z)]

This is computable by Monte Carlo: sample $z \sim q_\phi$ , compute $f(z)\nabla_\phi \log q_\phi(z)$ , average over samples. However, this estimator has high variance because $f(z)$ can be large and $\nabla_\phi \log q_\phi(z)$ can vary greatly.

L.3 Reparameterisation

When $q_\phi$ is a location-scale family (or more generally, when there exists a deterministic transformation $g(\phi, \varepsilon)$ ):

z = g(\phi, \varepsilon), \quad \varepsilon \sim p(\varepsilon) \quad \text{(fixed distribution, independent of }\phi\text{)}

For Gaussian: $z = \mu_\phi + \sigma_\phi \odot \varepsilon$ , $\varepsilon \sim \mathcal{N}(0,I)$ .

Then:

\mathbb{E}_{q_\phi}[f(z)] = \mathbb{E}_{p(\varepsilon)}[f(g(\phi, \varepsilon))]

and:

\nabla_\phi \mathbb{E}_{q_\phi}[f(z)] = \mathbb{E}_{p(\varepsilon)}[\nabla_\phi f(g(\phi, \varepsilon))] = \mathbb{E}_{p(\varepsilon)}\left[\nabla_z f(z) \cdot \nabla_\phi g(\phi,\varepsilon)\right]

The gradient now flows through the deterministic function $g$ , enabling automatic differentiation. The variance of this estimator is typically much lower than REINFORCE because $\nabla_z f$ is often smoother than $f \cdot \nabla_\phi \log q_\phi$ .

Jacobian of the transform:

\nabla_\phi g(\phi, \varepsilon)\big|_{\mu_\phi, \sigma_\phi} = \begin{pmatrix}\frac{\partial z}{\partial \mu_\phi} \\ \frac{\partial z}{\partial \sigma_\phi}\end{pmatrix} = \begin{pmatrix}I \\ \text{diag}(\varepsilon)\end{pmatrix}

So the gradient with respect to $\mu_\phi$ is $\mathbb{E}[\nabla_z f(z)]$ and with respect to $\sigma_\phi$ is $\mathbb{E}[\nabla_z f(z) \odot \varepsilon]$ .

L.4 Why Lower Variance?

Consider $f(z) = \sin(z)$ with $z \sim \mathcal{N}(\mu, \sigma^2)$ .

REINFORCE gradient: $\mathbb{E}[\sin(z) \cdot (z-\mu)/\sigma^2]$ . The product $\sin(z)(z-\mu)$ can be large in magnitude.

Reparameterisation gradient: $\mathbb{E}[\cos(\mu+\sigma\varepsilon)] = \mathbb{E}[\cos(z)]$ . The gradient is $\cos(z)$ , bounded in $[-1,1]$ , much more stable.

In general, reparameterisation produces gradients of $O(1)$ magnitude when $f$ is smooth, while REINFORCE gradients scale with $f$ values which can be large.

Appendix M - Moments in the Context of Score Matching and Diffusion

M.1 Denoising Score Matching

Diffusion models (DDPM, Score SDE) learn the score function $\nabla_x \log p_t(x)$ at each noise level $t$ . The training objective is:

\mathcal{L}_\theta = \mathbb{E}_{t, x_0, \varepsilon}\left[\|s_\theta(x_t, t) - \nabla_{x_t}\log p_t(x_t|x_0)\|^2\right]

where $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t}\varepsilon$ and $\varepsilon \sim \mathcal{N}(0,I)$ .

The conditional score is:

\nabla_{x_t}\log p_t(x_t|x_0) = -\frac{\varepsilon}{\sqrt{1-\bar{\alpha}_t}}

This is an expectation over the noise schedule. The loss simplifies to:

\mathcal{L}_\theta = \mathbb{E}_{t,\varepsilon}\left[\|\varepsilon_\theta(x_t,t) - \varepsilon\|^2\right]

which is an MSE loss - an empirical estimate of $\mathbb{E}[(Y-f(X))^2]$ where $Y$ is the added noise $\varepsilon$ and $f = \varepsilon_\theta$ .

M.2 First Moment of the Reverse Process

The reverse diffusion process gives $p_\theta(x_{t-1}|x_t) = \mathcal{N}(\mu_\theta(x_t,t), \sigma_t^2 I)$ where:

\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\varepsilon_\theta(x_t, t)\right)

The mean of the reverse step is determined by the predicted noise. The conditional expectation $\mathbb{E}[x_{t-1}|x_t]$ is the denoised estimate of $x_0$ :

\hat{x}_0 = \frac{x_t - \sqrt{1-\bar{\alpha}_t}\varepsilon_\theta(x_t,t)}{\sqrt{\bar{\alpha}_t}}

This is LOTUS applied to the reparameterised relationship: the expected clean image given the noisy image is a simple linear function of the predicted noise, evaluated using the noisy image.

Appendix N - Worked Problems: Bias-Variance and Regularisation

N.1 Ridge Regression Bias-Variance Tradeoff

Setup. Data: $y = X\theta^* + \varepsilon$ where $X \in \mathbb{R}^{n\times d}$ , $\varepsilon \sim \mathcal{N}(0, \sigma^2 I)$ .

Ridge estimator: $\hat{\theta}_\lambda = (X^\top X + \lambda I)^{-1}X^\top y$ .

Let $A_\lambda = (X^\top X + \lambda I)^{-1}X^\top$ .

Bias: $\text{Bias}(\hat{\theta}_\lambda) = \mathbb{E}[\hat{\theta}_\lambda] - \theta^* = A_\lambda X\theta^* - \theta^* = (A_\lambda X - I)\theta^*$ .

Since $A_\lambda X = (X^\top X + \lambda I)^{-1}X^\top X$ :

A_\lambda X - I = (X^\top X + \lambda I)^{-1}X^\top X - I = -\lambda(X^\top X + \lambda I)^{-1}

So $\text{Bias}(\hat{\theta}_\lambda) = -\lambda(X^\top X + \lambda I)^{-1}\theta^*$ . Bias increases with $\lambda$ .

Variance: $\text{Var}(\hat{\theta}_\lambda) = A_\lambda \text{Var}(y) A_\lambda^\top = \sigma^2 A_\lambda A_\lambda^\top = \sigma^2(X^\top X+\lambda I)^{-1}X^\top X(X^\top X+\lambda I)^{-1}$ .

As $\lambda \to \infty$ : $(X^\top X+\lambda I)^{-1} \to 0$ , so variance $\to 0$ . As $\lambda \to 0$ : recovers OLS variance $\sigma^2(X^\top X)^{-1}$ .

Total MSE: $\text{MSE} = \|\text{Bias}\|^2 + \text{tr}(\text{Var}) = \lambda^2 \theta^{*\top}(X^\top X+\lambda I)^{-2}\theta^* + \sigma^2\text{tr}((X^\top X+\lambda I)^{-1}X^\top X(X^\top X+\lambda I)^{-1})$ .

The optimal $\lambda$ minimises this sum - bias grows, variance shrinks, and there is a sweet spot.

N.2 Neural Network Initialisation via Variance Propagation

He initialisation (for ReLU networks) is derived from the bias-variance / variance propagation analysis.

For a layer $h_j^{[l]} = \text{ReLU}\!\left(\sum_{i} W_{ji}^{[l]} h_i^{[l-1]}\right)$ , with iid inputs $h_i^{[l-1]} \sim (0, v_{l-1})$ (zero mean, variance $v_{l-1}$ ) and iid weights $W_{ji} \sim (0, \sigma_W^2)$ independent of inputs:

Variance propagation:

\text{Var}(h_j^{[l]}) = n_{l-1} \cdot \sigma_W^2 \cdot \mathbb{E}[\text{ReLU}(z)^2]

where $z \sim \mathcal{N}(0, v_{l-1})$ . For ReLU: $\mathbb{E}[\text{ReLU}(z)^2] = \text{Var}(z)/2 = v_{l-1}/2$ (since ReLU zeros half the distribution).

Therefore $v_l = n_{l-1} \cdot \sigma_W^2 \cdot v_{l-1}/2$ .

For variance to stay constant across layers: $v_l = v_{l-1}$ requires $\sigma_W^2 = 2/n_{l-1}$ .

He initialisation: $W \sim \mathcal{N}(0, 2/n_\text{in})$ . This preserves variance through the network during the forward pass, preventing gradients from vanishing or exploding in deep networks.

Appendix O - Self-Assessment Checklist

Use this checklist after studying the section to identify gaps before proceeding to Section05.

Core Mechanics (Should be fluent)

I can compute $\mathbb{E}[X]$ from a PMF or PDF using the definition.
I can apply LOTUS: $\mathbb{E}[g(X)] = \int g(x)f(x)dx$ without finding the distribution of $g(X)$ .
I know that linearity $\mathbb{E}[aX+bY]=a\mathbb{E}[X]+b\mathbb{E}[Y]$ holds without independence.
I can compute $\text{Var}(X)$ using $\mathbb{E}[X^2]-(\mathbb{E}[X])^2$ .
I know $\text{Var}(aX+b)=a^2\text{Var}(X)$ (shift doesn't change variance).
I can compute skewness $\gamma_1=\mu_3/\sigma^3$ and kurtosis $\gamma_2=\mu_4/\sigma^4-3$ .

Intermediate Theory (Should understand proofs)

I can state and prove the tower property $\mathbb{E}[\mathbb{E}[Y|X]]=\mathbb{E}[Y]$ .
I can state and prove the law of total variance.
I can derive the MGF of the Gaussian, Exponential, and Poisson distributions.
I can state Jensen's inequality and identify when it applies ( $f$ convex vs. concave).
I can prove $\text{KL}(p\|q)\geq 0$ using Jensen.
I can state and prove the Cauchy-Schwarz inequality for expectations.
I know that $|\rho|\leq 1$ follows from Cauchy-Schwarz.
I understand why zero covariance does NOT imply independence (with a counterexample).

Advanced Applications (Should be able to apply)

I can derive the bias-variance decomposition from first principles.
I can explain what double descent is and why overparameterised models can generalise.
I can derive the ELBO using Jensen's inequality.
I can explain the reparameterisation trick and why it reduces gradient variance.
I understand Adam as tracking first and second moments with bias correction.
I can explain why the score function $\mathbb{E}[\nabla_\theta \log p_\theta(X)]=\mathbf{0}$ .

Appendix P - Further Reading and References

Textbooks

Probability and Statistics for Engineering and the Sciences - Jay Devore (2015). Accessible introduction to moments, MGFs, and expectation with engineering examples.
Probability Theory: The Logic of Science - E.T. Jaynes (2003). Philosophical and technical treatment; excellent on entropy and maximum entropy principle.
Pattern Recognition and Machine Learning - Christopher Bishop (2006), Ch. 1-2. The ML-focused treatment of expectations, KL divergence, ELBO, and variational inference.
Deep Learning - Goodfellow, Bengio, Courville (2016), Ch. 3. Standard reference for probability in ML context including bias-variance tradeoff.
Probabilistic Machine Learning: An Introduction - Kevin Murphy (2022), Ch. 2-4. Modern treatment with extensive ML applications including Adam, VAEs, and diffusion models.

Papers

Kingma & Welling (2014). Auto-Encoding Variational Bayes. arXiv:1312.6114. Original VAE paper; derives ELBO via Jensen and introduces reparameterisation trick.
Kingma & Ba (2015). Adam: A Method for Stochastic Optimization. arXiv:1412.6980. Original Adam paper; moment tracking interpretation is explicit in the derivation.
Williams (1992). Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning. REINFORCE algorithm; score function gradient estimator.
Ioffe & Szegedy (2015). Batch Normalization. arXiv:1502.03167. Moment normalisation of activations; analysis of how variance affects training dynamics.
Belkin et al. (2019). Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-Off. PNAS. Double descent phenomenon; formal analysis of why interpolating models can generalise.
Song et al. (2020). Score-Based Generative Modeling through Stochastic Differential Equations. arXiv:2011.13456. Diffusion models as score-matching; score function is a moment of the distribution.

Summary of Key Results

This section established the expectation operator as the central tool in probability and machine learning. Starting from the LOTUS definition, we derived linearity (which holds without independence), the tower property (iterated expectation), and the law of total variance (which decomposes uncertainty into within-group and between-group components).

The moment hierarchy captures distribution shape: the first moment (mean) locates it; the second (variance) measures spread; the third (skewness) captures asymmetry; the fourth (kurtosis) captures tail weight. Moment generating functions encode all moments in a single analytic function and enable elegant proofs of the reproductive property (sum of independent Gaussians is Gaussian) and cumulant additivity.

Jensen's inequality - the most important inequality in this section - gives the KL divergence its non-negativity, the ELBO its existence as a lower bound, and the bias-variance decomposition its clean structure. Cauchy-Schwarz gives $|\rho| \leq 1$ and motivates the attention scaling factor $1/\sqrt{d_k}$ .

Every major ML method reviewed in Section9 is an application of these results: cross-entropy is expected log-loss, the ELBO is Jensen applied to the marginal likelihood, policy gradient is the score function trick, and Adam is first-and-second-moment tracking with debiasing.

The concepts forward-referenced here - Markov's and Chebyshev's inequalities, the Law of Large Numbers, the Central Limit Theorem - will be fully developed in Section05 and Section06, completing the probabilistic toolkit required for modern ML.

The conceptual unification: expectation is a linear functional on the space of random variables. It maps random variables to real numbers while preserving linear structure. Every computation in this section - variance as $\mathbb{E}[(X-\mu)^2]$ , covariance as $\mathbb{E}[(X-\mu_X)(Y-\mu_Y)]$ , the MGF as $\mathbb{E}[e^{tX}]$ , the characteristic function as $\mathbb{E}[e^{itX}]$ , the KL divergence as $\mathbb{E}_p[\log p/q]$ , Adam's first moment as $\mathbb{E}[g_t]$ - is an application of this single linear functional applied to different functions of the random variable.

Understanding this unity transforms the apparent complexity of ML training into a coherent framework: we are always estimating expectations from samples, bounding how far sample estimates stray from true expectations, and choosing parameterisations that make those expectations tractable to compute and differentiate.