Part 1

30 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Expectation and Moments: Part 1: Intuition and Historical Context to 12. Why This Matters for AI 2026 Perspective

1. Intuition and Historical Context

1.1 The Moment Analogy

The word "moment" comes from classical mechanics. In physics, the first moment of a mass distribution is the center of mass; the second moment is the moment of inertia. Probability theory borrows this language directly: the $k$ -th moment of a random variable $X$ is $\mathbb{E}[X^k]$ , the expected value of $X$ raised to the $k$ -th power.

Imagine balancing a plank on which weights are placed at positions $x_1, x_2, \ldots, x_n$ with weights $p_1, p_2, \ldots, p_n$ (probabilities). The balance point is $\sum_i x_i p_i = \mathbb{E}[X]$ - the center of mass. The spread around that center is captured by $\sum_i (x_i - \mu)^2 p_i = \text{Var}(X)$ , the second central moment. Higher moments capture the shape: whether the distribution leans left or right (skewness), whether it has heavy tails (kurtosis).

MOMENTS AS SHAPE DESCRIPTORS
========================================================================

  First moment (mean \\mu):        Location of center
  Second central moment (\\sigma^2):   Spread about center
  Third central moment (\\gamma_1\\sigma^3):  Asymmetry (skewness)
  Fourth central moment (\\gamma_2\\sigma^4): Tail weight (kurtosis)

  Standard normal N(0,1):  \\mu=0, \\sigma^2=1, \\gamma_1=0, \\gamma_2=0 (baseline)

  Right-skewed (\\gamma_1 > 0):         Heavy right tail: income, LLM token probs
        #
       ###
      #####=====........
  ----+--------------------

  Heavy-tailed (\\gamma_2 > 0):         Leptokurtic: gradient norms, loss landscapes
        #
        #
       ###
      #####.............
  ----+--------------------

========================================================================

Every major operation in statistics reduces to computing moments. The mean is the first moment. The variance is the second central moment. The sample statistics we compute to estimate distributions are moment estimates. The method of moments estimation technique literally sets sample moments equal to theoretical moments and solves for parameters.

For AI: Every loss function is an expectation over the data distribution:

\mathcal{L}(\theta) = \mathbb{E}_{(x,y) \sim \mathcal{D}}[\ell(f_\theta(x), y)]

We cannot compute this exactly (the true distribution $\mathcal{D}$ is unknown), so we estimate it with the sample mean over a minibatch. The entire machinery of stochastic gradient descent rests on this approximation.

1.2 Why Moments Matter for AI

Moments appear in nearly every component of modern deep learning:

Loss functions. The mean squared error $\frac{1}{N}\sum_i (y_i - \hat{y}_i)^2$ estimates $\mathbb{E}[(Y - f(X))^2]$ , the expected squared loss. Cross-entropy loss estimates $-\mathbb{E}_{p}[\log q]$ . Every loss is a first moment of some function of the predictions.

Optimisation. The Adam optimizer maintains running estimates of the first moment $m_t = \mathbb{E}[g_t]$ and second moment $v_t = \mathbb{E}[g_t^2]$ of the gradient $g_t$ . The adaptive learning rate $\alpha / \sqrt{v_t + \epsilon}$ normalises by the square root of the second moment. Without understanding moments, Adam is a black box; with it, Adam is moment matching with bias correction.

Batch Normalisation. BatchNorm computes the sample mean and variance of activations across a batch, then normalises. This is moment normalisation: forcing the first moment to zero and the second to one, stabilising training by controlling the distribution of layer activations.

KL Divergence. The KL divergence $\text{KL}(p \| q) = \mathbb{E}_p[\log(p/q)]$ is an expectation under $p$ . Its non-negativity ( $\text{KL} \geq 0$ ) follows directly from Jensen's inequality applied to the convex function $-\log$ . The ELBO in variational autoencoders is derived by applying Jensen's inequality to decompose $\log p(x)$ .

Reparameterisation. The reparameterisation trick in VAEs ( $z = \mu_\phi + \sigma_\phi \odot \varepsilon$ ) enables computing gradients of expectations. The gradient $\nabla_\phi \mathbb{E}_{q_\phi}[f(z)]$ is intractable to compute directly, but after reparameterisation it becomes $\mathbb{E}_{\varepsilon}[\nabla_\phi f(\mu_\phi + \sigma_\phi \odot \varepsilon)]$ , estimable via sampling.

Score functions. The score function $\nabla_\theta \log p_\theta(x)$ satisfies $\mathbb{E}_p[\nabla_\theta \log p_\theta(X)] = \mathbf{0}$ - its expectation is zero. This identity (proved by differentiating the normalisation condition $\int p_\theta = 1$ ) underlies Fisher information, the REINFORCE policy gradient estimator, and Stein's identity used in score-based diffusion models.

1.3 Historical Timeline

HISTORY OF MOMENTS AND EXPECTATION
========================================================================

  1654  Pascal-Fermat correspondence: first rigorous treatment of
        expected value ("valeur" of a game) in probability

  1657  Huygens: De Ratiociniis in Ludo Aleae - published the first
        probability textbook with formal expectation calculations

  1713  Bernoulli (Jakob): Ars Conjectandi - law of large numbers
        (published posthumously); shows sample mean -> E[X]

  1730  De Moivre: normal approximation to binomial via moments

  1814  Laplace: Theorie Analytique des Probabilites - systematic
        use of characteristic functions (Fourier transforms of densities)

  1853  Chebyshev: first inequality bounding tails via variance;
        rigorous proof of LLN; laid groundwork for moment inequalities

  1867  Chebyshev: full variance-based inequality (P(|X-\\mu| \\geq k\\sigma) \\leq 1/k^2)

  1894  Pearson: coined "moment" in statistics; introduced skewness
        and kurtosis; method of moments estimation

  1922  Fisher: likelihood, sufficient statistics, and moment
        generating functions as tools of parametric inference

  1945  Cramer: large deviation theory; MGF-based tail bounds
        (foundations of Hoeffding and Chernoff later work)

  1972  Williams: REINFORCE algorithm (score function gradient
        estimator = gradient of log-expectation)

  2014  Kingma & Ba: Adam optimizer - formal moment-tracking
        adaptive gradient method

  2014  Kingma & Welling: VAE - reparameterisation trick;
        expectations with differentiable sampling

  2020  Song et al.: score-based diffusion models - reverse SDE
        guidance via score function = \\nabla log p(x)

========================================================================

2. Formal Definition of Expectation

2.1 Expectation for Discrete, Continuous, and Mixed Distributions

The expected value (or expectation or mean) of a random variable $X$ is defined as the probability-weighted average of all possible values.

Discrete random variable. If $X$ takes values in a countable set with PMF $p(x) = P(X=x)$ :

\mathbb{E}[X] = \sum_{x \in \text{support}(X)} x \cdot p(x)

provided this sum converges absolutely: $\sum_x |x| \cdot p(x) < \infty$ . If it does not converge absolutely, $\mathbb{E}[X]$ does not exist.

Example. For a fair die $X \in \{1,2,3,4,5,6\}$ with $p(x) = 1/6$ :

\mathbb{E}[X] = \frac{1}{6}(1+2+3+4+5+6) = \frac{21}{6} = 3.5

Continuous random variable. If $X$ has PDF $f(x)$ :

\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx

provided $\int_{-\infty}^{\infty} |x| f(x) \, dx < \infty$ .

Example. For $X \sim \text{Exp}(\lambda)$ with PDF $f(x) = \lambda e^{-\lambda x}$ on $[0,\infty)$ :

\mathbb{E}[X] = \int_0^\infty x \lambda e^{-\lambda x} dx = \frac{1}{\lambda}

Mixed distribution. If $X$ has a mixed distribution (point mass at $x_0$ plus a continuous density):

\mathbb{E}[X] = p_0 \cdot x_0 + \int x \cdot f_{\text{cont}}(x) \, dx

where $p_0 = P(X = x_0)$ is the point mass probability.

Notation. We write $\mathbb{E}[X]$ , $\mathbb{E}_X[X]$ , $\mu_X$ , or $\langle X \rangle$ (physics notation) interchangeably. When the distribution is parameterised by $\theta$ : $\mathbb{E}_\theta[X]$ or $\mathbb{E}_{X \sim p_\theta}[X]$ .

For AI: The empirical mean $\bar{X}_N = \frac{1}{N}\sum_{i=1}^N X_i$ is the canonical estimator of $\mathbb{E}[X]$ . Every forward pass computes a function of the inputs; averaging that function over a batch estimates its expectation. The loss function IS an expectation - we just estimate it from data.

2.2 Existence and Integrability

The expectation $\mathbb{E}[X]$ exists (is finite) when $\mathbb{E}[|X|] < \infty$ , i.e., when $X$ is integrable. This fails in important cases:

Cauchy distribution. $f(x) = \frac{1}{\pi(1+x^2)}$ has $\int_{-\infty}^\infty |x| \cdot \frac{1}{\pi(1+x^2)} dx = +\infty$ . The Cauchy distribution has no mean (or any finite moment). A ratio of two independent standard normals follows a Cauchy distribution.

Heavy-tailed distributions. The Pareto distribution $f(x) = \alpha x_m^\alpha / x^{\alpha+1}$ for $x > x_m$ has:

$\mathbb{E}[X]$ exists iff $\alpha > 1$
$\text{Var}(X)$ exists iff $\alpha > 2$
$k$ -th moment exists iff $\alpha > k$

This matters in practice: internet traffic volumes, earthquake magnitudes, word frequencies in language (Zipf's law), and wealth distributions all exhibit Pareto-like tails. Models that assume finite variance (e.g., CLT-based confidence intervals) can fail badly when applied to such data.

Convergence condition. $\mathbb{E}[X]$ is well-defined (possibly $\pm\infty$ ) for non-negative $X$ (since the integral may diverge to $+\infty$ but is unambiguous). For general $X$ , write $X = X^+ - X^-$ where $X^+ = \max(X,0)$ and $X^- = \max(-X,0)$ . Then $\mathbb{E}[X] = \mathbb{E}[X^+] - \mathbb{E}[X^-]$ , finite only when both are finite.

For AI: In practice, neural network outputs and loss values are bounded (through clipping, normalisation, or the bounded nature of activation functions). But gradient values can have very heavy tails, especially in early training. Gradient clipping is a practical acknowledgment that the gradient may not have finite variance, making the Cauchy-Schwarz and CLT approximations break down.

2.3 Linearity of Expectation

The most powerful property of expectation is its linearity, which holds without any independence assumption:

Theorem (Linearity of Expectation). For any random variables $X$ and $Y$ and constants $a, b \in \mathbb{R}$ :

\mathbb{E}[aX + bY] = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y]

Proof (continuous case). Using the joint density $f_{X,Y}$ :

\mathbb{E}[aX+bY] = \int\!\!\int (ax+by) f_{X,Y}(x,y)\, dx\, dy

= a\int\!\!\int x\, f_{X,Y}(x,y)\, dx\, dy + b\int\!\!\int y\, f_{X,Y}(x,y)\, dx\, dy

= a\int x\, f_X(x)\, dx + b\int y\, f_Y(y)\, dy = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y]

where we used $\int f_{X,Y}(x,y)\, dy = f_X(x)$ (marginalisation from Section03). The discrete case is analogous. $\square$

Extension. By induction: for any constants $a_1, \ldots, a_n$ and random variables $X_1, \ldots, X_n$ (not necessarily independent):

\mathbb{E}\!\left[\sum_{i=1}^n a_i X_i\right] = \sum_{i=1}^n a_i \mathbb{E}[X_i]

Critical point: This holds even when $X$ and $Y$ are dependent. It does NOT extend to products: $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$ only when $X \perp Y$ .

Examples demonstrating power of linearity:

Sum of dice. $S = X_1 + X_2 + \cdots + X_n$ where each $X_i$ is a fair die. Without worrying about dependence (there is none here, but the theorem doesn't require it): $\mathbb{E}[S] = n \cdot 3.5$ .

Binomial mean. $X \sim \text{Binomial}(n,p)$ . Write $X = \sum_{i=1}^n B_i$ where $B_i \sim \text{Bernoulli}(p)$ (the indicator that trial $i$ succeeds). Then: $\mathbb{E}[X] = \sum_{i=1}^n \mathbb{E}[B_i] = np$ . No need to compute the full PMF.

Expected number of fixed points. A random permutation $\sigma$ of $\{1,\ldots,n\}$ has expected number of fixed points ( $\sigma(i) = i$ ) equal to 1, for any $n$ . Proof: $\mathbb{E}[\#\text{fixed points}] = \sum_{i=1}^n P(\sigma(i)=i) = n \cdot (1/n) = 1$ .

For AI: Linearity of expectation is the reason we can decompose complex expected losses into sums of simpler expected losses. The expected cross-entropy over a vocabulary of size $V$ is $\sum_{v=1}^V \mathbb{E}[y_v \log \hat{p}_v]$ - a sum of $V$ simpler expectations. The expected gradient of a sum is the sum of the expected gradients, justifying minibatch averaging as a gradient estimator.

2.4 LOTUS - Law of the Unconscious Statistician

A common need: compute $\mathbb{E}[g(X)]$ for some function $g$ when we know the distribution of $X$ but not necessarily of $g(X)$ directly. LOTUS provides the answer.

Theorem (LOTUS - Law of the Unconscious Statistician). Let $X$ be a random variable and $g : \mathbb{R} \to \mathbb{R}$ a measurable function.

Discrete:

\mathbb{E}[g(X)] = \sum_{x} g(x) \cdot p_X(x)

Continuous:

\mathbb{E}[g(X)] = \int_{-\infty}^{\infty} g(x) \cdot f_X(x) \, dx

Key insight: We do NOT need the distribution of $g(X)$ . We use the distribution of $X$ directly. This is why it's called the "unconscious" statistician - you can compute $\mathbb{E}[g(X)]$ without consciously finding the PDF of $g(X)$ .

Proof sketch (discrete case). Let $Y = g(X)$ . Grouping terms:

\mathbb{E}[Y] = \sum_y y \cdot P(Y=y) = \sum_y y \sum_{x: g(x)=y} P(X=x) = \sum_x g(x) P(X=x)

The last step rearranges the double sum. $\square$

Examples:

Variance via LOTUS. $\text{Var}(X) = \mathbb{E}[(X-\mu)^2]$ . Using LOTUS directly on $g(x) = (x-\mu)^2$ :

\text{Var}(X) = \int (x-\mu)^2 f_X(x) \, dx

Second moment of Exponential. For $X \sim \text{Exp}(\lambda)$ , using $g(x) = x^2$ :

\mathbb{E}[X^2] = \int_0^\infty x^2 \lambda e^{-\lambda x} dx = \frac{2}{\lambda^2}

Jensen's inequality (preview). For convex $f$ , LOTUS combined with the supporting hyperplane property gives $f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]$ . (Full treatment in Section7.)

LOTUS for functions of multiple variables. For jointly distributed $(X,Y)$ with joint density $f_{X,Y}$ :

\mathbb{E}[g(X,Y)] = \int\!\!\int g(x,y) f_{X,Y}(x,y) \, dx \, dy

For AI: LOTUS is the theorem behind every expectation computation in variational inference. The ELBO contains $\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]$ - an expectation of $\log p_\theta(x|z)$ under $q_\phi$ . LOTUS says: integrate $\log p_\theta(x|z)$ weighted by $q_\phi(z|x)$ , no need to find the distribution of $\log p_\theta(x|z)$ directly.

3. Variance, Standard Deviation, and Higher Moments

3.1 Variance

The variance of $X$ measures the expected squared deviation from the mean:

\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2]

Computational formula. Expanding the square:

\text{Var}(X) = \mathbb{E}[X^2 - 2\mu X + \mu^2] = \mathbb{E}[X^2] - 2\mu\mathbb{E}[X] + \mu^2 = \mathbb{E}[X^2] - \mu^2

This is the Konig-Huygens formula: $\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$ .

Properties of variance:

$\text{Var}(X) \geq 0$ , with equality iff $X$ is almost surely constant.
$\text{Var}(aX + b) = a^2 \text{Var}(X)$ for any constants $a, b$ .
$\text{Var}(X + Y) = \text{Var}(X) + 2\text{Cov}(X,Y) + \text{Var}(Y)$ (see Section4.4).
If $X \perp Y$ : $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$ .

Proof of property 2.

\text{Var}(aX+b) = \mathbb{E}[(aX+b - (a\mu+b))^2] = \mathbb{E}[(a(X-\mu))^2] = a^2 \mathbb{E}[(X-\mu)^2] = a^2\text{Var}(X)

Note: the constant $b$ does not affect variance (shifts don't change spread).

Standard examples:

$X \sim \text{Bernoulli}(p)$ : $\mathbb{E}[X] = p$ , $\mathbb{E}[X^2] = p$ , $\text{Var}(X) = p - p^2 = p(1-p)$ . Maximum at $p = 1/2$ .
$X \sim \text{Uniform}[a,b]$ : $\mathbb{E}[X] = (a+b)/2$ , $\text{Var}(X) = (b-a)^2/12$ .
$X \sim \mathcal{N}(\mu, \sigma^2)$ : $\mathbb{E}[X] = \mu$ , $\text{Var}(X) = \sigma^2$ (by construction).
$X \sim \text{Exp}(\lambda)$ : $\mathbb{E}[X] = 1/\lambda$ , $\mathbb{E}[X^2] = 2/\lambda^2$ , $\text{Var}(X) = 1/\lambda^2$ .
$X \sim \text{Poisson}(\lambda)$ : $\mathbb{E}[X] = \text{Var}(X) = \lambda$ (special property).

For AI: Variance quantifies uncertainty. The variance of a model's predictions measures how much predictions vary with the training set - high variance = overfitting. The variance of gradient estimates in SGD determines how noisy the update steps are - high variance slows convergence. Reducing gradient variance is the goal of techniques like control variates, variance reduction in REINFORCE, and importance-weighted estimators.

3.2 Standard Deviation and Coefficient of Variation

The standard deviation $\sigma_X = \sqrt{\text{Var}(X)}$ has the same units as $X$ (variance has squared units). It is the more interpretable spread measure.

The coefficient of variation (CV) normalises the standard deviation by the mean:

\text{CV}(X) = \frac{\sigma_X}{\mathbb{E}[X]}

CV is dimensionless and measures relative spread. A CV of 0.1 means the standard deviation is 10% of the mean. This is useful when comparing spread across distributions with different scales.

Standardisation. Subtracting the mean and dividing by the standard deviation gives the standardised variable:

Z = \frac{X - \mu}{\sigma}

which has $\mathbb{E}[Z] = 0$ and $\text{Var}(Z) = 1$ .

For AI: Batch Normalisation (Ioffe & Szegedy, 2015) standardises layer activations during training: for each feature, subtract the batch mean and divide by the batch standard deviation. This forces $\mathbb{E}[Z] = 0$ and $\text{Var}(Z) = 1$ , then rescales with learnable $\gamma, \beta$ . Layer Norm and RMS Norm are variations that normalise over the feature dimension rather than the batch dimension - critical for transformers where batch sizes can be 1.

3.3 Raw vs Central Moments

Raw moments (moments about zero):

\mu'_k = \mathbb{E}[X^k]

Central moments (moments about the mean $\mu = \mathbb{E}[X]$ ):

\mu_k = \mathbb{E}[(X - \mu)^k]

Relationship: The central moments can be expressed in terms of raw moments via the binomial theorem:

\mu_k = \sum_{j=0}^k \binom{k}{j} \mu'_j (-\mu)^{k-j}

For the first four:

$\mu_1 = 0$ (always - the mean is the center)
$\mu_2 = \mu'_2 - (\mu'_1)^2 = \mathbb{E}[X^2] - \mu^2 = \text{Var}(X)$
$\mu_3 = \mu'_3 - 3\mu'_2\mu'_1 + 2(\mu'_1)^3$
$\mu_4 = \mu'_4 - 4\mu'_3\mu'_1 + 6\mu'_2(\mu'_1)^2 - 3(\mu'_1)^4$

Standardised moments divide central moments by $\sigma^k$ , making them dimensionless:

\gamma_k = \frac{\mu_k}{\sigma^k}

$\gamma_1$ = skewness, $\gamma_2$ = excess kurtosis (kurtosis of normal = 3; excess kurtosis subtracts 3).

3.4 Skewness

Skewness measures the asymmetry of a distribution about its mean:

\text{Skew}(X) = \gamma_1 = \frac{\mu_3}{\sigma^3} = \frac{\mathbb{E}[(X-\mu)^3]}{\sigma^3}

$\gamma_1 > 0$ (right-skewed / positive skew): Long tail on the right; mean > median > mode. Examples: income distributions, insurance claims, LLM generation probabilities.
$\gamma_1 = 0$ (symmetric): Normal distribution, uniform, any distribution symmetric about its mean.
$\gamma_1 < 0$ (left-skewed / negative skew): Long tail on the left; mean < median < mode. Examples: age at death in developed countries.

SKEWNESS INTUITION
========================================================================

  Right-skewed (\\gamma_1 > 0):           Left-skewed (\\gamma_1 < 0):

     ##                                          ##
    ####                                        ####
   ######...                           ...##########
  --+--+--+------                 ------+--+--+--+--

  Mode < Median < Mean            Mean < Median < Mode
  Tail pulls mean right           Tail pulls mean left

========================================================================

For AI: Gradient distributions in deep networks are often right-skewed - most gradients are small, but occasional large gradients occur. This is why gradient clipping is standard practice: the long right tail causes instability if unconstrained. Token probability distributions from language models are also right-skewed: most tokens have very low probability, while a handful of likely tokens dominate.

Skewness of common distributions:

$\mathcal{N}(\mu,\sigma^2)$ : $\gamma_1 = 0$ (symmetric by construction)
$\text{Exp}(\lambda)$ : $\gamma_1 = 2$ (always right-skewed)
$\text{Poisson}(\lambda)$ : $\gamma_1 = 1/\sqrt{\lambda}$ (approaches 0 as $\lambda \to \infty$ )
$\text{Beta}(\alpha,\beta)$ : $\gamma_1 = 2(\beta-\alpha)\sqrt{\alpha+\beta+1}/[(\alpha+\beta+2)\sqrt{\alpha\beta}]$

3.5 Kurtosis and Excess Kurtosis

Kurtosis measures the weight of the tails relative to a normal distribution:

\text{Kurt}(X) = \frac{\mu_4}{\sigma^4} = \frac{\mathbb{E}[(X-\mu)^4]}{\sigma^4}

Excess kurtosis (the standard in statistics) subtracts the normal baseline of 3:

\gamma_2 = \frac{\mu_4}{\sigma^4} - 3

$\gamma_2 > 0$ (leptokurtic): Heavier tails than Gaussian; more probability in the tails and peak. Examples: Student- $t$ distribution, financial returns, gradient norms.
$\gamma_2 = 0$ (mesokurtic): Normal distribution (the baseline).
$\gamma_2 < 0$ (platykurtic): Lighter tails; more uniform. Uniform distribution has $\gamma_2 = -6/5$ .

KURTOSIS AND TAIL WEIGHT
========================================================================

  Leptokurtic (\\gamma_2>0):     Mesokurtic (\\gamma_2=0):     Platykurtic (\\gamma_2<0):

      #                        #                      ###
     ###                      ###                    #####
    #####                    #####                  #######
   .........                .......                 .......

  Heavy tails - outliers    Normal tails            Thin tails
  common. Student-t, loss   Gaussian                Uniform
  gradients, finance

========================================================================

For AI: Gradient norm distributions during training are leptokurtic - occasional large-gradient events are far more common than a Gaussian would predict. This motivates both gradient clipping (cap the maximum norm) and heavy-tailed noise models for understanding why SGD generalises. The Student- $t$ distribution ( $\gamma_2 = 6/(df-4)$ for $df > 4$ ) is often used to model heavy-tailed distributions in robust statistics.

Kurtosis of common distributions:

$\mathcal{N}(\mu,\sigma^2)$ : $\gamma_2 = 0$ (by definition - the reference)
$\text{Uniform}[a,b]$ : $\gamma_2 = -6/5$
$\text{Exp}(\lambda)$ : $\gamma_2 = 6$
$\text{Laplace}(\mu,b)$ : $\gamma_2 = 3$
$\text{Student-}t(\nu)$ : $\gamma_2 = 6/(\nu-4)$ for $\nu > 4$ ; infinite for $\nu \leq 4$

4. Covariance and Correlation

4.1 Covariance as an Operator

The covariance of two random variables $X$ and $Y$ measures their linear co-movement:

\text{Cov}(X,Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]

Computational formula:

\text{Cov}(X,Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]

Proof: $\mathbb{E}[(X-\mu_X)(Y-\mu_Y)] = \mathbb{E}[XY - \mu_Y X - \mu_X Y + \mu_X \mu_Y] = \mathbb{E}[XY] - \mu_Y\mu_X - \mu_X\mu_Y + \mu_X\mu_Y = \mathbb{E}[XY] - \mu_X\mu_Y$ . $\square$

Properties (bilinearity):

$\text{Cov}(X,X) = \text{Var}(X)$
$\text{Cov}(X,Y) = \text{Cov}(Y,X)$ (symmetric)
$\text{Cov}(aX+b, cY+d) = ac\,\text{Cov}(X,Y)$ (bilinear, constants drop out)
$\text{Cov}(X+Z, Y) = \text{Cov}(X,Y) + \text{Cov}(Z,Y)$ (linear in first argument)
$|\text{Cov}(X,Y)| \leq \sqrt{\text{Var}(X)\text{Var}(Y)}$ (Cauchy-Schwarz, proved in Section7.3)

Sign interpretation:

$\text{Cov}(X,Y) > 0$ : $X$ and $Y$ tend to be above their means simultaneously.
$\text{Cov}(X,Y) < 0$ : when $X$ is above its mean, $Y$ tends to be below its mean.
$\text{Cov}(X,Y) = 0$ : no linear relationship (but may still be nonlinearly dependent).

Recall from Section03: The full covariance matrix $\Sigma$ of a random vector collects all pairwise covariances: $\Sigma_{ij} = \text{Cov}(X_i, X_j)$ . The multivariate Gaussian is parameterised by $(\boldsymbol{\mu}, \Sigma)$ . -> Full treatment: Joint Distributions Section03

4.2 Pearson Correlation Coefficient

The Pearson correlation normalises covariance to remove units:

\rho_{XY} = \text{Corr}(X,Y) = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}

Theorem: $-1 \leq \rho_{XY} \leq 1$ .

Proof via Cauchy-Schwarz. By the Cauchy-Schwarz inequality (proved in Section7.3):

(\text{Cov}(X,Y))^2 = (\mathbb{E}[(X-\mu_X)(Y-\mu_Y)])^2 \leq \mathbb{E}[(X-\mu_X)^2]\mathbb{E}[(Y-\mu_Y)^2] = \sigma_X^2 \sigma_Y^2

Dividing both sides by $\sigma_X^2 \sigma_Y^2$ : $\rho^2 \leq 1$ , so $|\rho| \leq 1$ . $\square$

Boundary cases:

$\rho = +1$ : $Y = aX + b$ for some $a > 0$ (perfect positive linear relationship).
$\rho = -1$ : $Y = aX + b$ for some $a < 0$ (perfect negative linear relationship).
$\rho = 0$ : uncorrelated (no linear relationship - but not necessarily independent).

For AI: The correlation coefficient appears in weight matrix analysis. The effective rank of a weight matrix is related to the correlations between its columns. In attention, high correlation between key-query pairs produces high attention weights. In representational similarity analysis (RSA), neural network layers are compared by computing correlation matrices of their activations.

4.3 Independence and Zero Covariance

Theorem. If $X \perp Y$ , then $\text{Cov}(X,Y) = 0$ .

Proof. If $X \perp Y$ , then $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$ (a characterisation of independence). Therefore $\text{Cov}(X,Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] = 0$ . $\square$

The converse is FALSE. Zero covariance does not imply independence.

Classic counterexample. Let $X \sim \text{Uniform}(-1,1)$ and $Y = X^2$ . Then:

$\mathbb{E}[X] = 0$ (symmetric distribution)
$\mathbb{E}[XY] = \mathbb{E}[X \cdot X^2] = \mathbb{E}[X^3] = 0$ (odd function, symmetric distribution)
$\text{Cov}(X,Y) = 0$

But $Y$ is completely determined by $X$ - knowing $X$ tells you $Y = X^2$ exactly. They are maximally dependent!

For jointly Gaussian variables only: Zero covariance DOES imply independence. This is a special property of the Gaussian distribution - it is the unique distribution for which uncorrelated implies independent (for the joint distribution; marginal Gaussians with zero covariance need not be jointly Gaussian).

For AI: Feature selection based on correlation can miss highly informative features that have nonlinear (but zero-covariance) relationships with the target. Mutual information is a better measure of dependence. In contrastive learning (e.g., SimCLR), redundancy reduction methods explicitly decorrelate representations - but decorrelation (zero covariance) and statistical independence are different objectives.

4.4 Variance of Sums

Theorem. For any random variables $X$ and $Y$ :

\text{Var}(X + Y) = \text{Var}(X) + 2\text{Cov}(X,Y) + \text{Var}(Y)

Proof:

\text{Var}(X+Y) = \mathbb{E}[(X+Y)^2] - (\mathbb{E}[X+Y])^2

= \mathbb{E}[X^2 + 2XY + Y^2] - (\mu_X + \mu_Y)^2

= (\mathbb{E}[X^2] - \mu_X^2) + 2(\mathbb{E}[XY] - \mu_X\mu_Y) + (\mathbb{E}[Y^2] - \mu_Y^2)

= \text{Var}(X) + 2\text{Cov}(X,Y) + \text{Var}(Y) \quad \square

Corollary (independence): If $X \perp Y$ : $\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y)$ .

General sum: For $X_1, \ldots, X_n$ :

\text{Var}\!\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \text{Var}(X_i) + 2\sum_{i<j} \text{Cov}(X_i, X_j)

If all $X_i$ are independent: $\text{Var}(\sum_i X_i) = \sum_i \text{Var}(X_i)$ .

Scaling. For iid $X_1,\ldots,X_n$ with variance $\sigma^2$ :

\text{Var}\!\left(\frac{1}{n}\sum_{i=1}^n X_i\right) = \frac{1}{n^2} \cdot n\sigma^2 = \frac{\sigma^2}{n}

The variance of the sample mean decreases as $1/n$ - the statistical basis for why larger datasets reduce estimation uncertainty.

For AI: Weight initialisation strategies (Xavier/Glorot, He) are derived from variance propagation. If $\mathbf{h}^{[l]} = W^{[l]}\mathbf{h}^{[l-1]}$ , then the variance of each output neuron is $n_{\text{in}} \cdot \text{Var}(w_{ij}) \cdot \text{Var}(h^{[l-1]}_j)$ . To keep variance constant across layers (preventing vanishing/exploding activations): $\text{Var}(w_{ij}) = 1/n_{\text{in}}$ (Xavier) or $2/n_{\text{in}}$ (He for ReLU). This is the variance-of-sums formula applied to layer activations.

5. Conditional Expectation

5.1 Definition and Basic Properties

The conditional expectation $\mathbb{E}[Y \mid X = x]$ is the expected value of $Y$ given that $X$ takes the specific value $x$ . It is a function of $x$ , not a number.

Discrete case:

\mathbb{E}[Y \mid X = x] = \sum_y y \cdot P(Y = y \mid X = x) = \sum_y y \cdot \frac{p_{X,Y}(x,y)}{p_X(x)}

Continuous case:

\mathbb{E}[Y \mid X = x] = \int_{-\infty}^\infty y \cdot f_{Y|X}(y|x) \, dy = \int y \cdot \frac{f_{X,Y}(x,y)}{f_X(x)} \, dy

As a random variable. When we write $\mathbb{E}[Y \mid X]$ (without specifying the value of $X$ ), we mean the random variable obtained by composing the function $x \mapsto \mathbb{E}[Y \mid X=x]$ with $X$ . This is a function of $X$ and is therefore itself a random variable.

Basic properties:

$\mathbb{E}[aY + bZ \mid X] = a\mathbb{E}[Y \mid X] + b\mathbb{E}[Z \mid X]$ (linearity)
$\mathbb{E}[g(X) \cdot Y \mid X] = g(X) \cdot \mathbb{E}[Y \mid X]$ (taking out known factors)
If $X \perp Y$ : $\mathbb{E}[Y \mid X] = \mathbb{E}[Y]$ (conditioning on independent variable doesn't help)
$\mathbb{E}[\mathbb{E}[Y \mid X]] = \mathbb{E}[Y]$ (tower property - see Section5.2)

Example. Let $(X,Y)$ be jointly uniform on the triangle $\{(x,y): 0 \leq y \leq x \leq 1\}$ . From Section03 (Appendix V.1), the conditional density is $f_{Y|X}(y|x) = 1/x$ on $[0,x]$ . Therefore:

\mathbb{E}[Y \mid X = x] = \int_0^x y \cdot \frac{1}{x} \, dy = \frac{1}{x} \cdot \frac{x^2}{2} = \frac{x}{2}

The conditional mean of $Y$ given $X = x$ is $x/2$ - exactly half of $x$ , which makes sense since $Y$ is uniform on $[0,x]$ .

5.2 Tower Property (Iterated Expectation)

The tower property (also called the law of iterated expectations or Adam's law) is one of the most useful results in probability:

\mathbb{E}[\mathbb{E}[Y \mid X]] = \mathbb{E}[Y]

More generally, for any function $g(X)$ :

\mathbb{E}[g(X) \cdot Y] = \mathbb{E}[g(X) \cdot \mathbb{E}[Y \mid X]]

Proof (continuous case):

\mathbb{E}[\mathbb{E}[Y \mid X]] = \int \mathbb{E}[Y \mid X=x] \cdot f_X(x) \, dx

= \int \left(\int y \cdot f_{Y|X}(y|x) \, dy\right) f_X(x) \, dx

= \int\!\!\int y \cdot f_{Y|X}(y|x) f_X(x) \, dy \, dx

= \int\!\!\int y \cdot f_{X,Y}(x,y) \, dy \, dx = \mathbb{E}[Y] \quad \square

Intuition. If you first average $Y$ within groups defined by $X$ , then average those group means weighted by group size, you get the overall mean of $Y$ .

Example: Expected test score. A class has 40% students from Group A (mean score 75) and 60% from Group B (mean score 85). Overall expected score: $0.4 \times 75 + 0.6 \times 85 = 81$ . This is exactly the tower property: $\mathbb{E}[Y] = \mathbb{E}[\mathbb{E}[Y|G]]$ where $G$ is group membership.

For AI: The tower property underlies the EM algorithm. The E-step computes $Q(\theta|\theta^{(t)}) = \mathbb{E}_{Z|X,\theta^{(t)}}[\log p(X,Z|\theta)]$ - the expected complete-data log-likelihood, where the expectation is over the latent variable $Z$ given the observed $X$ . The tower property guarantees that maximising $Q$ increases $\log p(X|\theta)$ , the marginal likelihood. In policy gradient methods, $\mathbb{E}_\tau[R] = \mathbb{E}_{s_0}[\mathbb{E}_\tau[R | s_0]]$ - decomposing trajectory returns by initial state.

5.3 Conditional Variance and Law of Total Variance

The conditional variance of $Y$ given $X = x$ is:

\text{Var}(Y \mid X = x) = \mathbb{E}[(Y - \mathbb{E}[Y \mid X=x])^2 \mid X = x]

Law of Total Variance (Eve's Law):

\text{Var}(Y) = \mathbb{E}[\text{Var}(Y \mid X)] + \text{Var}(\mathbb{E}[Y \mid X])

This decomposes total variance into:

Within-group variance: $\mathbb{E}[\text{Var}(Y \mid X)]$ - average variability within each value of $X$ .
Between-group variance: $\text{Var}(\mathbb{E}[Y \mid X])$ - variability of the group means.

Proof:

\text{Var}(Y) = \mathbb{E}[Y^2] - (\mathbb{E}[Y])^2

Using the tower property twice and writing $\mu(x) = \mathbb{E}[Y|X=x]$ :

\mathbb{E}[Y^2] = \mathbb{E}[\mathbb{E}[Y^2 \mid X]] = \mathbb{E}[\text{Var}(Y|X) + (\mathbb{E}[Y|X])^2] = \mathbb{E}[\text{Var}(Y|X)] + \mathbb{E}[\mu(X)^2]

And $(\mathbb{E}[Y])^2 = (\mathbb{E}[\mu(X)])^2$ . Therefore:

\text{Var}(Y) = \mathbb{E}[\text{Var}(Y|X)] + (\mathbb{E}[\mu(X)^2] - (\mathbb{E}[\mu(X)])^2) = \mathbb{E}[\text{Var}(Y|X)] + \text{Var}(\mu(X)) \quad \square

For AI: The law of total variance is the mathematical foundation of the bias-variance decomposition (Section8). For a model trained on dataset $\mathcal{D}$ : total error = expected within-dataset error + between-dataset variance. It also explains why mixture models (GMMs) combine within-component variance and between-component variance.

5.4 AI Applications of Conditional Expectation

Optimal predictor. The function $f^*(x) = \mathbb{E}[Y \mid X=x]$ minimises the expected squared error:

f^* = \arg\min_{f} \mathbb{E}[(Y - f(X))^2]

Proof: For any $f$ ,

\mathbb{E}[(Y-f(X))^2] = \mathbb{E}[(Y - f^*(X) + f^*(X) - f(X))^2]

= \mathbb{E}[(Y-f^*(X))^2] + 2\mathbb{E}[(Y-f^*(X))(f^*(X)-f(X))] + \mathbb{E}[(f^*(X)-f(X))^2]

The cross term vanishes because $\mathbb{E}[Y - f^*(X) \mid X] = \mathbb{E}[Y|X] - f^*(X) = 0$ , so by the tower property $\mathbb{E}[(Y-f^*(X))g(X)] = 0$ for any $g$ . The last term is non-negative. Therefore $\mathbb{E}[(Y-f(X))^2] \geq \mathbb{E}[(Y-f^*(X))^2]$ . $\square$

This is profound: the best possible predictor (in MSE sense) is the conditional expectation. Neural networks are universal approximators trained to approximate $f^* = \mathbb{E}[Y|X]$ .

Rao-Blackwell theorem. In estimation theory: if $\hat{\theta}$ is any unbiased estimator of $\theta$ and $T$ is a sufficient statistic, then $\tilde{\theta} = \mathbb{E}[\hat{\theta} \mid T]$ is also unbiased and has variance $\leq \text{Var}(\hat{\theta})$ . Conditioning on sufficient statistics never increases variance. This is the variance reduction analogue of why attention over relevant context helps.

Attention as conditional expectation. The output of self-attention at position $i$ is:

o_i = \sum_j \alpha_{ij} v_j = \mathbb{E}_{j \sim \alpha_i}[v_j]

This is the conditional expectation of the value vectors under the attention distribution $\alpha_i$ over key positions. High-entropy attention = averaging many values; low-entropy (sharp) attention = conditioning on one specific value.

6. Moment Generating Functions and Characteristic Functions

6.1 MGF Definition and Basic Properties

The moment generating function (MGF) of a random variable $X$ is:

M_X(t) = \mathbb{E}[e^{tX}]

defined for all $t$ in an open interval containing 0. If no such interval exists (the integral diverges), the MGF is said not to exist.

Moments from the MGF. Differentiating $k$ times and evaluating at $t=0$ gives the $k$ -th raw moment:

\frac{d^k}{dt^k} M_X(t) \bigg|_{t=0} = \mathbb{E}[X^k] = \mu'_k

Proof: $M_X(t) = \mathbb{E}[e^{tX}] = \mathbb{E}\left[\sum_{k=0}^\infty \frac{(tX)^k}{k!}\right] = \sum_{k=0}^\infty \frac{\mathbb{E}[X^k]}{k!} t^k$ (when interchange is valid).

Differentiating $k$ times: $M_X^{(k)}(0) = \mathbb{E}[X^k]$ . $\square$

Uniqueness theorem. If $M_X(t) = M_Y(t)$ for $t$ in an open neighbourhood of 0, then $X$ and $Y$ have the same distribution. The MGF uniquely determines the distribution (when it exists).

Existence. The MGF exists when $\mathbb{E}[e^{tX}] < \infty$ for $|t| < t_0$ for some $t_0 > 0$ . Heavy-tailed distributions (Cauchy, Pareto with small $\alpha$ , log-normal) do NOT have finite MGFs. The characteristic function (Section6.4) always exists and is the preferred tool for such distributions.

6.2 MGF of Common Distributions

Bernoulli( $p$ ):

M(t) = (1-p) + p e^t = 1 + p(e^t - 1)

Check: $M'(t) = pe^t$ , $M'(0) = p = \mathbb{E}[X]$ . [ok]

Binomial( $n,p$ ): (sum of $n$ iid Bernoulli)

M(t) = (1-p+pe^t)^n

First moment: $M'(0) = np$ . [ok]

Poisson( $\lambda$ ):

M(t) = e^{\lambda(e^t - 1)}

Derivation: $M(t) = \sum_{k=0}^\infty e^{tk} \frac{\lambda^k e^{-\lambda}}{k!} = e^{-\lambda}\sum_k \frac{(\lambda e^t)^k}{k!} = e^{-\lambda} e^{\lambda e^t} = e^{\lambda(e^t-1)}$ . $M'(0) = \lambda$ , $M''(0) = \lambda^2 + \lambda$ , so $\text{Var}(X) = \lambda^2+\lambda - \lambda^2 = \lambda$ . [ok]

Normal( $\mu, \sigma^2$ ):

M(t) = e^{\mu t + \sigma^2 t^2/2}

Derivation: Complete the square in the exponent of $\int e^{tx} \cdot \frac{1}{\sqrt{2\pi}\sigma} e^{-(x-\mu)^2/(2\sigma^2)} dx$ . $M'(0) = \mu$ , $M''(0) = \sigma^2 + \mu^2$ , $\text{Var}(X) = \sigma^2$ . [ok]

Exponential( $\lambda$ ):

M(t) = \frac{\lambda}{\lambda - t} \quad \text{for } t < \lambda

Derivation: $\int_0^\infty e^{tx} \lambda e^{-\lambda x} dx = \lambda \int_0^\infty e^{-({\lambda-t})x} dx = \frac{\lambda}{\lambda-t}$ (converges iff $t < \lambda$ ). $M'(0) = 1/\lambda$ , $M''(0) = 2/\lambda^2$ , $\text{Var}(X) = 1/\lambda^2$ . [ok]

For AI: MGFs appear in the proof of the Chernoff bound (Section05), which gives exponentially sharp tail bounds. The exponential form $\mathbb{E}[e^{tX}]$ is closely related to the log-partition function in the exponential family: $A(\eta) = \log Z(\eta)$ . The moments of the sufficient statistic can be computed by differentiating $A(\eta)$ - the MGF and the log-partition function are connected by a change of variables.

6.3 MGF of Sums and the Reproductive Property

Key theorem. If $X \perp Y$ :

M_{X+Y}(t) = M_X(t) \cdot M_Y(t)

Proof: $M_{X+Y}(t) = \mathbb{E}[e^{t(X+Y)}] = \mathbb{E}[e^{tX} e^{tY}] = \mathbb{E}[e^{tX}]\mathbb{E}[e^{tY}] = M_X(t)M_Y(t)$ , using independence for the third equality. $\square$

This is the MGF version of the convolution theorem. It provides elegant proofs of reproductive properties:

Gaussian reproductive property. If $X \sim \mathcal{N}(\mu_1,\sigma_1^2)$ and $Y \sim \mathcal{N}(\mu_2,\sigma_2^2)$ independently:

M_{X+Y}(t) = e^{\mu_1 t + \sigma_1^2 t^2/2} \cdot e^{\mu_2 t + \sigma_2^2 t^2/2} = e^{(\mu_1+\mu_2)t + (\sigma_1^2+\sigma_2^2)t^2/2}

This is the MGF of $\mathcal{N}(\mu_1+\mu_2, \sigma_1^2+\sigma_2^2)$ . So $X+Y \sim \mathcal{N}(\mu_1+\mu_2, \sigma_1^2+\sigma_2^2)$ .

Poisson reproductive property. If $X \sim \text{Poisson}(\lambda_1)$ , $Y \sim \text{Poisson}(\lambda_2)$ independently:

M_{X+Y}(t) = e^{\lambda_1(e^t-1)} \cdot e^{\lambda_2(e^t-1)} = e^{(\lambda_1+\lambda_2)(e^t-1)}

So $X+Y \sim \text{Poisson}(\lambda_1+\lambda_2)$ .

6.4 Characteristic Function

The characteristic function of $X$ is:

\varphi_X(t) = \mathbb{E}[e^{itX}] = \mathbb{E}[\cos(tX)] + i\,\mathbb{E}[\sin(tX)]

where $i = \sqrt{-1}$ .

Key advantage: $\varphi_X(t)$ exists for ALL distributions (since $|e^{itX}| = 1$ always), unlike the MGF which may not exist for heavy-tailed distributions.

Connection to MGF: $\varphi_X(t) = M_X(it)$ when the MGF exists. But the characteristic function is defined even when the MGF is not.

Moments from $\varphi_X$ . When moments exist:

\mathbb{E}[X^k] = \frac{1}{i^k} \varphi_X^{(k)}(0)

Inversion formula. The distribution is uniquely determined by its characteristic function:

f_X(x) = \frac{1}{2\pi} \int_{-\infty}^\infty e^{-itx} \varphi_X(t) \, dt

(when $f_X$ is continuous). This is the Fourier inversion formula - the characteristic function IS the Fourier transform of the PDF.

For AI: Characteristic functions appear in the analysis of convergence of distributions (Central Limit Theorem proofs use characteristic functions because they always exist). The score-based diffusion model framework uses the score function $\nabla_x \log p(x)$ , which is related to the characteristic function via Fourier analysis. Random Fourier Features (Rahimi & Recht, 2007) approximate kernel functions via random sampling from the characteristic function of the kernel.

6.5 Cumulants and the Cumulant Generating Function

The cumulant generating function (CGF) is the log of the MGF:

K_X(t) = \log M_X(t) = \log \mathbb{E}[e^{tX}]

Cumulants $\kappa_k$ are the coefficients in the Taylor expansion of $K_X(t)$ :

K_X(t) = \sum_{k=1}^\infty \frac{\kappa_k}{k!} t^k

so $\kappa_k = K_X^{(k)}(0)$ .

First four cumulants:

$\kappa_1 = \mathbb{E}[X] = \mu$ (mean)
$\kappa_2 = \text{Var}(X) = \sigma^2$ (variance)
$\kappa_3 = \mathbb{E}[(X-\mu)^3] = \mu_3$ (third central moment = $\gamma_1 \sigma^3$ )
$\kappa_4 = \mu_4 - 3\sigma^4$ (excess kurtosis $\times \sigma^4$ )

Key property. For independent $X \perp Y$ :

K_{X+Y}(t) = K_X(t) + K_Y(t) \implies \kappa_k(X+Y) = \kappa_k(X) + \kappa_k(Y)

Cumulants of a sum of independent variables add - this is the additivity property that makes cumulants more natural than moments for sums.

Normal distribution: $K_X(t) = \mu t + \sigma^2 t^2/2$ . All cumulants $\kappa_k = 0$ for $k \geq 3$ . This characterises the normal distribution - it is the unique distribution with all cumulants beyond the second equal to zero.

7. Jensen's Inequality and Moment Inequalities

7.1 Jensen's Inequality

Definition. A function $f : \mathbb{R} \to \mathbb{R}$ is convex if for all $x, y$ and $\lambda \in [0,1]$ :

f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y)

Geometrically: the chord between any two points lies above the curve. Equivalently (for twice-differentiable $f$ ): $f''(x) \geq 0$ everywhere.

Theorem (Jensen's Inequality). If $f$ is convex and $\mathbb{E}[|X|] < \infty$ :

f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]

Proof (via supporting hyperplane). Since $f$ is convex, for any $a$ there exists a supporting hyperplane at $a$ : a linear function $\ell(x) = f(a) + c(x-a)$ such that $f(x) \geq \ell(x)$ for all $x$ (where $c$ is the subgradient at $a$ ). Setting $a = \mathbb{E}[X]$ :

f(x) \geq f(\mathbb{E}[X]) + c(x - \mathbb{E}[X])

Taking expectations of both sides:

\mathbb{E}[f(X)] \geq f(\mathbb{E}[X]) + c\,\mathbb{E}[X - \mathbb{E}[X]] = f(\mathbb{E}[X]) + c \cdot 0 = f(\mathbb{E}[X]) \quad \square

Equality condition: $f(\mathbb{E}[X]) = \mathbb{E}[f(X)]$ iff $f$ is linear on the support of $X$ , or $X$ is almost surely constant.

For concave $f$ : $f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)]$ (Jensen's inequality reverses).

Common applications:

Function $f$	Convex/Concave	Jensen gives
$e^x$	Convex	$e^{\mathbb{E}[X]} \leq \mathbb{E}[e^X]$
$-\log x$	Convex	$-\log\mathbb{E}[X] \leq \mathbb{E}[-\log X]$ , i.e., $\log\mathbb{E}[X] \geq \mathbb{E}[\log X]$
$x^2$	Convex	$(\mathbb{E}[X])^2 \leq \mathbb{E}[X^2]$ (variance $\geq 0$ )
$\log x$	Concave	$\log\mathbb{E}[X] \geq \mathbb{E}[\log X]$
$\sqrt{x}$	Concave	$\sqrt{\mathbb{E}[X]} \geq \mathbb{E}[\sqrt{X}]$

7.2 Applications: KL Divergence and ELBO

KL divergence non-negativity. The Kullback-Leibler divergence is:

\text{KL}(p \| q) = \mathbb{E}_p\left[\log \frac{p(X)}{q(X)}\right] = \mathbb{E}_p[\log p(X) - \log q(X)]

Theorem (Gibbs' inequality). $\text{KL}(p \| q) \geq 0$ , with equality iff $p = q$ almost everywhere.

Proof via Jensen:

-\text{KL}(p \| q) = \mathbb{E}_p\left[\log \frac{q(X)}{p(X)}\right] \leq \log\mathbb{E}_p\left[\frac{q(X)}{p(X)}\right] = \log\int p(x) \frac{q(x)}{p(x)} dx = \log 1 = 0

where we applied Jensen's inequality with the concave function $\log$ . Therefore $\text{KL}(p\|q) \geq 0$ . $\square$

ELBO derivation via Jensen. For a latent variable model $p_\theta(x) = \int p_\theta(x,z) \, dz$ :

\log p_\theta(x) = \log \int p_\theta(x,z) \, dz = \log \int q_\phi(z|x) \frac{p_\theta(x,z)}{q_\phi(z|x)} \, dz

= \log \mathbb{E}_{q_\phi}\left[\frac{p_\theta(x,z)}{q_\phi(z|x)}\right] \geq \mathbb{E}_{q_\phi}\left[\log \frac{p_\theta(x,z)}{q_\phi(z|x)}\right]

where the inequality applies Jensen's to the concave $\log$ . This lower bound is the ELBO:

\mathcal{L}(\phi,\theta;x) = \mathbb{E}_{q_\phi}[\log p_\theta(x|z)] - \text{KL}(q_\phi(z|x) \| p(z))

The gap between $\log p_\theta(x)$ and the ELBO is exactly $\text{KL}(q_\phi \| p_\theta(z|x)) \geq 0$ . Maximising the ELBO tightens this bound.

Log-sum inequality. For non-negative $a_i, b_i$ :

\sum_i a_i \log \frac{a_i}{b_i} \geq \left(\sum_i a_i\right) \log \frac{\sum_i a_i}{\sum_i b_i}

This is the discrete version of KL non-negativity via Jensen.

7.3 Cauchy-Schwarz Inequality

Theorem (Cauchy-Schwarz for expectations). For any random variables $X$ and $Y$ :

(\mathbb{E}[XY])^2 \leq \mathbb{E}[X^2] \cdot \mathbb{E}[Y^2]

with equality iff $Y = cX$ almost surely for some constant $c$ .

Proof. For any $t \in \mathbb{R}$ :

0 \leq \mathbb{E}[(tX - Y)^2] = t^2\mathbb{E}[X^2] - 2t\mathbb{E}[XY] + \mathbb{E}[Y^2]

This is a quadratic in $t$ that is always non-negative. A quadratic $at^2 + bt + c \geq 0$ for all $t$ iff discriminant $b^2 - 4ac \leq 0$ . Here $a = \mathbb{E}[X^2]$ , $b = -2\mathbb{E}[XY]$ , $c = \mathbb{E}[Y^2]$ :

4(\mathbb{E}[XY])^2 - 4\mathbb{E}[X^2]\mathbb{E}[Y^2] \leq 0 \implies (\mathbb{E}[XY])^2 \leq \mathbb{E}[X^2]\mathbb{E}[Y^2] \quad \square

Corollary ( $|\rho| \leq 1$ ). Apply the Cauchy-Schwarz inequality to the centred variables $X - \mu_X$ and $Y - \mu_Y$ :

(\text{Cov}(X,Y))^2 = (\mathbb{E}[(X-\mu_X)(Y-\mu_Y)])^2 \leq \mathbb{E}[(X-\mu_X)^2]\mathbb{E}[(Y-\mu_Y)^2] = \sigma_X^2\sigma_Y^2

Dividing: $\rho^2 \leq 1$ .

For AI: Cauchy-Schwarz appears in attention - the dot product $q^\top k$ is bounded by $\|q\| \|k\|$ , motivating the $\sqrt{d_k}$ scaling in scaled dot-product attention ( $\text{softmax}(QK^\top/\sqrt{d_k})V$ ). Without this scaling, the dot products can become large in magnitude for high-dimensional keys/queries, causing the softmax to saturate and gradients to vanish.

7.4 Lyapunov Inequality

Theorem (Lyapunov's inequality). For $0 < r \leq s$ :

(\mathbb{E}[|X|^r])^{1/r} \leq (\mathbb{E}[|X|^s])^{1/s}

In words: the $L^r$ norm of $X$ (as a random variable) is non-decreasing in $r$ .

Proof. Apply Jensen's inequality with the convex function $f(t) = t^{s/r}$ (convex since $s/r \geq 1$ ) to the random variable $Y = |X|^r$ :

(\mathbb{E}[Y])^{s/r} = f(\mathbb{E}[Y]) \leq \mathbb{E}[f(Y)] = \mathbb{E}[|X|^s]

Taking $(1/s)$ powers of both sides: $(\mathbb{E}[|X|^r])^{1/r} \leq (\mathbb{E}[|X|^s])^{1/s}$ . $\square$

Consequence. If the $s$ -th moment is finite, all lower moments are finite. If $\mathbb{E}[X^4] < \infty$ , then $\mathbb{E}[X^3], \mathbb{E}[X^2], \mathbb{E}[|X|] < \infty$ as well.

For AI: Lyapunov's inequality underlies the Lyapunov CLT (a variant of the central limit theorem for non-identically distributed variables). In the theory of stochastic gradient descent, the existence of higher moments of the gradient controls the tightness of convergence bounds.

7.5 Preview: Tail Bounds from Moments

The most direct application of moments to probability is bounding how far a random variable can stray from its mean. These are covered fully in Section05, but we preview the key results here.

Preview: Markov's Inequality

For any non-negative random variable $X$ and $t > 0$ :

P(X \geq t) \leq \frac{\mathbb{E}[X]}{t}

Proof: $\mathbb{E}[X] = \int_0^\infty xf(x)\,dx \geq \int_t^\infty xf(x)\,dx \geq t\int_t^\infty f(x)\,dx = t\,P(X\geq t)$ . The mean bounds the probability of large values.

-> Full treatment and applications: Section05 Concentration Inequalities

Preview: Chebyshev's Inequality

For any random variable $X$ with mean $\mu$ and variance $\sigma^2$ :

P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}

Proof: Apply Markov's inequality to $(X-\mu)^2 \geq k^2\sigma^2$ : $P((X-\mu)^2 \geq k^2\sigma^2) \leq \mathbb{E}[(X-\mu)^2]/(k^2\sigma^2) = 1/k^2$ . The variance bounds how often $X$ deviates from its mean by $k$ standard deviations.

-> Full treatment and sharper bounds: Section05 Concentration Inequalities

8. Bias-Variance Decomposition

8.1 Statistical Risk and Optimal Predictor

Consider predicting $Y$ from $X$ using a function $f : \mathbb{R}^d \to \mathbb{R}$ . The statistical risk (expected squared loss) is:

\mathcal{R}(f) = \mathbb{E}[(Y - f(X))^2]

Theorem. The minimiser of $\mathcal{R}(f)$ over all measurable functions is the conditional expectation:

f^*(X) = \mathbb{E}[Y \mid X]

and the minimum risk equals the irreducible noise:

\mathcal{R}(f^*) = \mathbb{E}[\text{Var}(Y \mid X)]

Proof: From Section5.4, for any $f$ :

\mathcal{R}(f) = \mathbb{E}[(Y - f^*(X))^2] + \mathbb{E}[(f^*(X) - f(X))^2] \geq \mathcal{R}(f^*)

And $\mathcal{R}(f^*) = \mathbb{E}[(Y - \mathbb{E}[Y|X])^2] = \mathbb{E}[\text{Var}(Y|X)]$ by definition. $\square$

The Bayes risk (irreducible error) $\varepsilon^* = \mathbb{E}[\text{Var}(Y|X)]$ cannot be reduced by any model, no matter how complex. It is the noise inherent in the prediction problem.

For regression: With $Y = f^*(X) + \varepsilon$ where $\varepsilon$ is independent noise with variance $\sigma^2$ , the Bayes risk is $\sigma^2$ .

For AI: The irreducible error is why perfect test accuracy is impossible on noisy datasets. In language modelling, even a perfect model (which exactly learns the true distribution) will have non-zero cross-entropy loss equal to the entropy of the true distribution $H(p^*)$ . GPT-4's perplexity on human text is bounded below by the entropy of human language.

8.2 Bias-Variance-Noise Decomposition

In practice, we learn a model $\hat{f}$ from a finite dataset $\mathcal{D}$ of $N$ samples. The model $\hat{f}$ is a random function - it depends on the random training data. We can decompose the expected prediction error at a new test point $x_0$ :

\mathbb{E}_{\mathcal{D}}[(Y - \hat{f}(x_0))^2] = \underbrace{(f^*(x_0) - \mathbb{E}_{\mathcal{D}}[\hat{f}(x_0)])^2}_{\text{Bias}^2} + \underbrace{\text{Var}_{\mathcal{D}}(\hat{f}(x_0))}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Noise}}

Proof. Let $\bar{f} = \mathbb{E}_{\mathcal{D}}[\hat{f}(x_0)]$ (expected prediction) and $y_0 = f^*(x_0) + \varepsilon$ (noisy observation):

\mathbb{E}_{\mathcal{D},\varepsilon}[(y_0 - \hat{f}(x_0))^2] = \mathbb{E}[(y_0 - \bar{f} + \bar{f} - \hat{f})^2]

= \mathbb{E}[(y_0 - \bar{f})^2] + 2\mathbb{E}[(y_0-\bar{f})(\bar{f}-\hat{f})] + \mathbb{E}[(\bar{f}-\hat{f})^2]

The cross term is $2\mathbb{E}[(f^*+\varepsilon-\bar{f})(\bar{f}-\hat{f})] = 2(f^*-\bar{f})\mathbb{E}[\bar{f}-\hat{f}] + 2\mathbb{E}[\varepsilon]\mathbb{E}[\bar{f}-\hat{f}] = 0$ (since $\mathbb{E}[\hat{f}] = \bar{f}$ and $\mathbb{E}[\varepsilon]=0$ ). Then:

\mathbb{E}[(y_0-\bar{f})^2] = \mathbb{E}[(f^*-\bar{f}+\varepsilon)^2] = (f^*-\bar{f})^2 + \sigma^2 = \text{Bias}^2 + \sigma^2

And $\mathbb{E}[(\bar{f}-\hat{f})^2] = \text{Var}(\hat{f}(x_0))$ . Combining: Bias^2 + Variance + Noise. $\square$

BIAS-VARIANCE TRADEOFF
========================================================================

  Error                        Total Error
   ^                           /
   |              Variance    /
   |         _______________/
   |        /
   |       /       Bias^2
   |      /  _______________
   |_____/__/________________
   |         Noise
   +-------------------------------- Model Complexity ->

  Classical view: sweet spot minimises total error

========================================================================

Interpretations:

High bias: The model is too simple to capture $f^*$ (underfitting). Consistent but wrong. Example: linear model for a quadratic $f^*$ .
High variance: The model is too complex - it fits training noise (overfitting). Right on average but variable. Example: high-degree polynomial on limited data.
Noise: Irreducible. The best we can do.

8.3 Double Descent and Modern Deep Learning

The classical bias-variance picture predicts a unique optimal complexity. Modern deep learning violates this: neural networks trained to zero training loss (interpolating the training data, which classical theory predicts should overfit catastrophically) often generalise well. This is the double descent phenomenon.

DOUBLE DESCENT
========================================================================

  Test Error
   ^
   |        Classical regime        Modern regime
   |              |                      |
   |    +----+    |                      |
   |   ++    ++   |                      |
   |  ++      ++  |                 +----+
   | ++        +--+-----------------+
   |               | Interpolation threshold
   +---------------------------------------- Model Size ->
           ^                   ^
    Underfitting         Overparameterised
    regime               regime (good!)

========================================================================

Why does this happen? In overparameterised models:

Many solutions interpolate the training data.
Gradient descent (especially with weight decay or implicit regularisation from SGD noise) selects the minimum-norm solution among all interpolating solutions.
The minimum-norm interpolant has low variance among interpolants - it is the "flattest" fit.

The bias-variance decomposition still holds in double descent, but the variance first rises (at the interpolation threshold, the model just barely interpolates - highly sensitive to training data) and then falls again (overparameterised models have many interpolating solutions, most with low variance).

For AI (2026 context): Large language models like GPT-4, Claude, and Gemini are massively overparameterised. They interpolate (or nearly interpolate) their training data and yet generalise remarkably well - this is the double descent regime. Understanding why overparameterisation helps generalisation remains one of the central open questions in deep learning theory.

8.4 Regularisation as Bias Injection

L2 regularisation (Ridge regression). The regularised estimator minimises:

\hat{\theta}_\lambda = \arg\min_\theta \left[\sum_{i=1}^N (y_i - f_\theta(x_i))^2 + \lambda \|\theta\|^2\right]

For linear regression with design matrix $X$ (overloading notation): $\hat{\theta}_\lambda = (X^\top X + \lambda I)^{-1}X^\top y$ .

Bias-variance analysis of Ridge:

Bias: $\hat{\theta}_\lambda$ is a biased estimator of $\theta^*$ . Specifically, $\mathbb{E}[\hat{\theta}_\lambda] = (I + \lambda(X^\top X)^{-1})^{-1}\theta^* \neq \theta^*$ . Ridge shrinks estimates toward 0, introducing bias.
Variance: $\text{Var}(\hat{\theta}_\lambda) \leq \text{Var}(\hat{\theta}_{\text{OLS}})$ for any $\lambda > 0$ . The regularised estimator has smaller variance - it is more stable with respect to perturbations in the training data.
Trade-off: As $\lambda$ increases, bias increases and variance decreases. The optimal $\lambda$ minimises total risk.

For AI: Weight decay in neural network training (the most common form of regularisation) adds $\lambda \|\theta\|^2$ to the loss, equivalent to L2 regularisation. AdamW implements weight decay correctly by decoupling it from the adaptive learning rate. Understanding the bias-variance trade-off tells us why weight decay helps: it reduces model variance (overfitting) at the cost of some bias.

9. Expectation in ML: Core Applications

9.1 Cross-Entropy Loss as an Expectation

For a $K$ -class classification problem, the cross-entropy loss for a single example $(x, y)$ with true label $y \in \{1,\ldots,K\}$ and model probabilities $\hat{p} = (\hat{p}_1,\ldots,\hat{p}_K)$ is:

\ell(y, \hat{p}) = -\log \hat{p}_y = -\sum_{k=1}^K \mathbf{1}[y=k] \log \hat{p}_k

The expected cross-entropy over the data distribution $p(x,y)$ :

\mathcal{L}(\theta) = \mathbb{E}_{(X,Y) \sim p}[-\log \hat{p}_{Y}] = \mathbb{E}_{(X,Y) \sim p}\left[-\sum_{k=1}^K \mathbf{1}[Y=k] \log \hat{p}_k(X;\theta)\right]

This equals the cross-entropy $H(p, q_\theta) = -\sum_k p_k \log q_k$ between the true label distribution and the model's predicted distribution, averaged over inputs $X$ .

Connection to KL divergence. For each input $x$ :

H(p(\cdot|x), q_\theta(\cdot|x)) = H(p(\cdot|x)) + \text{KL}(p(\cdot|x) \| q_\theta(\cdot|x))

where $H(p) = -\mathbb{E}_p[\log p]$ is the entropy of the true conditional. Since $H(p(\cdot|x))$ doesn't depend on $\theta$ , minimising cross-entropy is equivalent to minimising KL divergence - i.e., making the model distribution as close as possible to the true distribution, in the KL sense.

For language models. The cross-entropy loss on a token sequence $(w_1,\ldots,w_T)$ is:

\mathcal{L} = -\frac{1}{T}\sum_{t=1}^T \log P_\theta(w_t \mid w_1,\ldots,w_{t-1})

This is the expected negative log-probability per token - an expectation under the empirical distribution of the training corpus. Minimising this drives the model distribution toward the empirical distribution of human text. The perplexity is $\exp(\mathcal{L})$ : a model with perplexity 50 is "equally surprised" by text as a uniform distribution over 50 tokens.

9.2 ELBO as an Expected Value

The Variational Autoencoder (VAE) introduces an approximate posterior $q_\phi(z|x)$ (the encoder) to approximate the intractable true posterior $p_\theta(z|x)$ . The Evidence Lower BOund (ELBO) is:

\mathcal{L}(\phi,\theta;x) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - \text{KL}(q_\phi(z|x) \| p(z))

Reconstruction term: $\mathbb{E}_{q_\phi}[\log p_\theta(x|z)]$ - the expected log-likelihood of the observed data given a latent code, averaged over the approximate posterior. This rewards the decoder for reconstructing $x$ well from samples of $z \sim q_\phi(z|x)$ .

KL regularisation term: $\text{KL}(q_\phi \| p(z))$ - penalises the approximate posterior for straying from the prior $p(z) = \mathcal{N}(0,I)$ . For Gaussian encoder $q_\phi = \mathcal{N}(\mu_\phi(x), \text{diag}(\sigma_\phi^2(x)))$ :

\text{KL}(q_\phi \| \mathcal{N}(0,I)) = \frac{1}{2}\sum_{j=1}^d \left(\mu_{\phi,j}^2 + \sigma_{\phi,j}^2 - \log\sigma_{\phi,j}^2 - 1\right)

Gradient computation. The reconstruction term $\mathbb{E}_{q_\phi}[\log p_\theta(x|z)]$ cannot be differentiated through the sampling operation $z \sim q_\phi$ directly. The reparameterisation trick (Section03) writes $z = \mu_\phi + \sigma_\phi \odot \varepsilon$ with $\varepsilon \sim \mathcal{N}(0,I)$ , enabling:

\nabla_\phi \mathbb{E}_{q_\phi}[\log p_\theta(x|z)] = \mathbb{E}_{\varepsilon \sim \mathcal{N}(0,I)}[\nabla_\phi \log p_\theta(x|\mu_\phi + \sigma_\phi \odot \varepsilon)]

This gradient is computable by backpropagation through the deterministic function $z = \mu_\phi + \sigma_\phi \odot \varepsilon$ .

9.3 Policy Gradient (REINFORCE)

In reinforcement learning, an agent with policy $\pi_\theta(a|s)$ (parameterised by $\theta$ ) seeks to maximise the expected return:

J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T r_t\right] = \mathbb{E}_\tau[R(\tau)]

where $\tau = (s_0, a_0, r_0, s_1, a_1, \ldots)$ is a trajectory. The REINFORCE gradient is:

\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[R(\tau) \cdot \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t)\right]

Derivation via log-derivative trick (score function):

\nabla_\theta \mathbb{E}_\tau[R] = \nabla_\theta \int R(\tau) p_\theta(\tau) \, d\tau = \int R(\tau) \nabla_\theta p_\theta(\tau) \, d\tau

= \int R(\tau) p_\theta(\tau) \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)} \, d\tau = \mathbb{E}_\tau\left[R(\tau) \nabla_\theta \log p_\theta(\tau)\right]

Since $\log p_\theta(\tau) = \sum_t \log \pi_\theta(a_t|s_t) + \text{const}$ (dynamics don't depend on $\theta$ ), we get the REINFORCE formula.

The score function $\nabla_\theta \log \pi_\theta(a|s)$ appears because we differentiated the log of the probability. The key identity underlying this:

\mathbb{E}_\theta[\nabla_\theta \log p_\theta(X)] = \mathbf{0}

(differentiate the normalisation condition $\int p_\theta = 1$ to get $\int \nabla_\theta p_\theta = \mathbf{0}$ , then note $\nabla_\theta p_\theta = p_\theta \nabla_\theta \log p_\theta$ ).

For RLHF. Reinforcement learning from human feedback uses policy gradients to fine-tune language models. The policy is the LLM: $\pi_\theta(a|s)$ where $s$ is the context and $a$ is the generated token. The reward $R$ comes from a reward model trained on human preferences. REINFORCE and its variants (PPO, GRPO) compute gradients of $\mathbb{E}[R]$ with respect to the LLM parameters.

9.4 Adam Optimizer as Moment Tracking

The Adam optimizer (Kingma & Ba, 2014) maintains exponential moving averages of the first and second moments of the gradient:

m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \approx \mathbb{E}[g_t]

v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \approx \mathbb{E}[g_t^2]

where $g_t = \nabla_\theta \mathcal{L}(\theta_t)$ is the gradient at step $t$ .

Bias correction. Since $m_0 = v_0 = 0$ , the early estimates are biased toward zero. The debiased estimates are:

\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \qquad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

Why does this work? $\mathbb{E}[m_t] = (1-\beta_1^t) \cdot \mathbb{E}[g_t]$ (when gradient is stationary), so dividing by $(1-\beta_1^t)$ recovers an unbiased estimate of $\mathbb{E}[g_t]$ .

Parameter update:

\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

The update is the first moment (direction of average gradient) scaled by $1/\sqrt{\text{second moment}}$ - normalising by the root mean square of the gradient. This is adaptive learning rate: parameters with large gradient variance get smaller updates (conservative), parameters with small variance get larger updates (aggressive).

Connection to Fisher Information. The diagonal of the Fisher information matrix $\mathcal{I}(\theta)$ is $\text{Var}(g_t) \approx \mathbb{E}[g_t^2]$ (when gradients are zero-mean). Adam's $\hat{v}_t$ approximates the diagonal Fisher, making Adam an approximate natural gradient method.

AdamW. Standard Adam conflates weight decay and L2 regularisation (they are equivalent for SGD but not for adaptive methods). AdamW decouples them:

\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t - \alpha\lambda\theta_t

The last term is direct weight decay - not affected by the adaptive scaling. This is the current default for training large language models.

10. Common Mistakes

#	Mistake	Why It's Wrong	Fix
1	$\mathbb{E}[g(X)] = g(\mathbb{E}[X])$ (always)	Only true when $g$ is linear. For convex $g$ : $g(\mathbb{E}[X]) \leq \mathbb{E}[g(X)]$ (Jensen). For $g(x)=x^2$ : $(\mathbb{E}[X])^2 \leq \mathbb{E}[X^2]$ .	Apply LOTUS directly to compute $\mathbb{E}[g(X)]$ .
2	Assuming $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$ without checking independence	True only for independent (or uncorrelated for linear functions). Dependent variables violate this.	Check $\text{Cov}(X,Y) = 0$ (uncorrelated) before using this; for full product rule you need $X \perp Y$ .
3	Confusing variance and standard deviation units	Variance has squared units; std dev has same units as $X$ . Mixing them gives nonsense.	Always track units: $\text{Var}(X) = \sigma^2$ (squared), $\sigma_X = \sqrt{\text{Var}(X)}$ (same as $X$ ).
4	$\text{Var}(aX + b) = a^2\text{Var}(X) + b^2$ (wrong)	The constant $b$ shifts location, not spread. Variance is invariant to translation.	$\text{Var}(aX+b) = a^2\text{Var}(X)$ - the $b$ vanishes.
5	$\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y)$ without independence	This only holds when $\text{Cov}(X,Y) = 0$ . For positively correlated $X,Y$ : variance is larger.	$\text{Var}(X+Y) = \text{Var}(X) + 2\text{Cov}(X,Y) + \text{Var}(Y)$ .
6	Zero covariance implies independence	False in general. The unit disk example: $Y=\sqrt{1-X^2}$ has $\text{Cov}(X,Y)=0$ but $Y$ is determined by $X$ .	Use mutual information $I(X;Y)$ to test full independence. Covariance only measures linear dependence.
7	Confusing $\mathbb{E}[Y\\|X=x]$ (a number) with $\mathbb{E}[Y\\|X]$ (a random variable)	$\mathbb{E}[Y\\|X=x]$ is fixed once $x$ is fixed. $\mathbb{E}[Y\\|X]$ is random because $X$ is random.	Be explicit: $h(x) = \mathbb{E}[Y\\|X=x]$ ; then $\mathbb{E}[Y\\|X] = h(X)$ is the random variable.
8	Applying Jensen in the wrong direction for concave $f$	Jensen: convex $f \Rightarrow f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]$ . For $\log$ (concave): $\log\mathbb{E}[X] \geq \mathbb{E}[\log X]$ .	Check second derivative: $f'' > 0$ = convex (Jensen \leq); $f'' < 0$ = concave (Jensen \geq).
9	Assuming MGF exists for all distributions	Cauchy, Pareto (small $\alpha$ ), log-normal: MGFs are infinite. Using MGF-based results for these fails.	Use characteristic function (always exists). Check integrability: $\mathbb{E}[e^{tX}] < \infty$ for $
10	Confusing raw moments and central moments	$\mu'_k = \mathbb{E}[X^k]$ and $\mu_k = \mathbb{E}[(X-\mu)^k]$ differ unless $\mu=0$ . Kurtosis is $\mu_4/\sigma^4$ , not $\mu'_4/\sigma^4$ .	Explicitly write which you mean. For Gaussian: $\mu'_2 = \sigma^2 + \mu^2$ but $\mu_2 = \sigma^2$ .

11. Exercises

Exercise 1 * - LOTUS: Expected Value of a Transformation

Let $X \sim \text{Exponential}(\lambda=2)$ with PDF $f(x) = 2e^{-2x}$ for $x \geq 0$ .

(a) Compute $\mathbb{E}[X]$ directly from the definition.

(b) Using LOTUS, compute $\mathbb{E}[X^2]$ .

(c) Compute $\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$ .

(d) Let $Y = e^X$ . Compute $\mathbb{E}[Y]$ using LOTUS. (Hint: $\int_0^\infty e^{x} \cdot 2e^{-2x}dx = \int_0^\infty 2e^{-x}dx$ .) What is the domain of $t$ for which $\mathbb{E}[e^{tX}]$ is finite?

Exercise 2 * - Linearity Without Independence

Let $X \sim \text{Uniform}(-1, 1)$ and $Y = X^2$ .

(a) Compute $\mathbb{E}[X]$ , $\mathbb{E}[Y]$ , and $\mathbb{E}[X+Y]$ and verify linearity.

(b) Compute $\text{Cov}(X, Y)$ . What does this tell you about the linear relationship?

(c) Are $X$ and $Y$ independent? Justify rigorously.

(d) Compute $\text{Var}(X+Y)$ using the formula $\text{Var}(X) + 2\text{Cov}(X,Y) + \text{Var}(Y)$ , and verify numerically.

Exercise 3 * - Moments of the Gaussian

For $X \sim \mathcal{N}(0, 1)$ (standard normal):

(a) Show $\mathbb{E}[X^{2k+1}] = 0$ for all $k \geq 0$ (odd moments vanish).

(b) Using the MGF $M(t) = e^{t^2/2}$ , find $\mathbb{E}[X^2]$ , $\mathbb{E}[X^4]$ , and $\mathbb{E}[X^6]$ by differentiating.

(c) Verify the general formula $\mathbb{E}[X^{2k}] = (2k-1)!! = (2k-1)(2k-3)\cdots 3 \cdot 1$ .

(d) For $X \sim \mathcal{N}(\mu, \sigma^2)$ , compute the skewness $\gamma_1$ and excess kurtosis $\gamma_2$ .

Exercise 4 ** - Tower Property in Action

A fair coin is flipped. If heads (prob 1/2), $X \sim \text{Poisson}(3)$ . If tails (prob 1/2), $X \sim \text{Poisson}(7)$ .

(a) Using the tower property, compute $\mathbb{E}[X]$ (let $C \in \{H,T\}$ be the coin flip).

(b) Using the law of total variance, compute $\text{Var}(X)$ .

(c) Write down the marginal PMF of $X$ (it is a mixture of two Poissons). Verify $\mathbb{E}[X]$ directly.

(d) AI connection: Mixture-of-experts (MoE) models route tokens to different experts. Each expert has its own output distribution. The total output distribution is a mixture; the tower property computes its moments.

Exercise 5 ** - MGF and Moment Extraction

Let $X \sim \text{Gamma}(\alpha, \beta)$ with PDF $f(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}$ for $x > 0$ .

(a) Derive the MGF: show $M_X(t) = \left(\frac{\beta}{\beta-t}\right)^\alpha$ for $t < \beta$ .

(b) Use $M_X(t)$ to compute $\mathbb{E}[X]$ and $\text{Var}(X)$ by differentiating.

(c) Show that the sum of $n$ independent $\text{Gamma}(\alpha_i, \beta)$ variables (same rate $\beta$ ) follows $\text{Gamma}(\sum_i \alpha_i, \beta)$ using the MGF multiplicative property.

(d) The chi-squared distribution $\chi^2_k = \text{Gamma}(k/2, 1/2)$ . What is $\mathbb{E}[\chi^2_k]$ and $\text{Var}(\chi^2_k)$ ?

Exercise 6 ** - Jensen and KL Divergence

(a) Prove that for any probability distribution $p$ over $\{1,\ldots,K\}$ :

H(p) = -\sum_k p_k \log p_k \leq \log K

with equality iff $p$ is uniform. (Hint: Apply Jensen to $f(x) = -\log x$ or use the KL divergence to uniform.)

(b) Show that $\text{KL}(p \| q) \geq 0$ using Jensen's inequality applied to $-\log$ .

(c) Compute $\text{KL}(\mathcal{N}(\mu_1, \sigma_1^2) \| \mathcal{N}(\mu_2, \sigma_2^2))$ analytically. Verify it equals zero when $\mu_1=\mu_2$ , $\sigma_1=\sigma_2$ .

(d) For the ELBO: show that $\log p_\theta(x) \geq \mathcal{L}(\phi,\theta;x)$ with gap $= \text{KL}(q_\phi(z|x) \| p_\theta(z|x))$ .

Exercise 7 *** - Bias-Variance Decomposition

Consider estimating $\theta^* = 2$ from $N=10$ iid samples $X_1,\ldots,X_N \sim \mathcal{N}(\theta^*, 1)$ .

(a) Show that the sample mean $\hat{\theta}_1 = \bar{X}_N$ is unbiased with variance $\sigma^2/N = 1/10$ .

(b) Consider a regularised estimator $\hat{\theta}_\lambda = \frac{N}{N+\lambda}\bar{X}_N$ . Compute its bias and variance as functions of $\lambda > 0$ .

(c) The MSE of $\hat{\theta}_\lambda$ is Bias^2 + Variance. Find the $\lambda$ that minimises MSE. (Hint: it depends on $\theta^*$ .)

(d) Simulate this: draw 10000 datasets of $N=10$ samples, compute $\hat{\theta}_\lambda$ for $\lambda \in \{0, 0.1, 0.5, 1, 2, 5\}$ , plot bias^2, variance, and MSE vs $\lambda$ .

Exercise 8 *** - Adam as Moment Estimation

(a) Show that the Adam first-moment estimate satisfies $\mathbb{E}[m_t] = (1-\beta_1^t)\mathbb{E}[g]$ when gradients are iid with mean $\mathbb{E}[g]$ (stationary gradient assumption). Therefore $\hat{m}_t = m_t/(1-\beta_1^t)$ is unbiased.

(b) Implement a minimal Adam optimizer in NumPy (no PyTorch) and apply it to minimise $f(\theta) = \theta^4 - 4\theta^2 + \theta$ . Track the first and second moment estimates. Plot $m_t$ , $v_t$ , and $\theta_t$ over 200 steps.

(c) Show that for a quadratic loss $\mathcal{L}(\theta) = \frac{1}{2}(\theta-\theta^*)^2$ , Adam converges to $\theta^*$ and the second moment $v_t \to (\theta-\theta^*)^2 =$ squared gradient. In this case, $\hat{m}_t/\sqrt{\hat{v}_t} \to \text{sign}(\theta-\theta^*)$ : Adam reduces to steepest descent with unit step size.

(d) Connect to Fisher information: the Fisher information matrix for $p_\theta(x) = \mathcal{N}(\theta, I)$ has diagonal equal to 1 (the variance of the score). Natural gradient descent uses $\mathcal{I}^{-1}\nabla\mathcal{L}$ . How does Adam approximate this for diagonal Fisher?

12. Why This Matters for AI (2026 Perspective)

Concept	AI/LLM Application	2026 Context
$\mathbb{E}[\ell(\theta)]$ - expected loss	Every training objective; cross-entropy, MSE, RLHF reward	All SOTA LLMs (GPT-4, Claude, Gemini) optimise expected log-likelihood; RLHF maximises expected reward from human preference model
Linearity of expectation	Gradient of a sum = sum of gradients; minibatch averaging	Distributed training computes gradients on data shards and averages them - justified by linearity
Tower property	EM algorithm E-step; policy gradient decomposition	Mixture-of-experts routing in GPT-4, Mixtral uses tower property to compute expected output over router distribution
$\text{Var}(X)$ - variance	Uncertainty quantification, BatchNorm/LayerNorm, gradient clipping	LayerNorm is standard in every transformer (BERT, GPT, LLaMA); normalises by mean and variance per token
Bias-variance decomposition	Model selection, regularisation, double descent	Understanding why GPT-4-scale models (massively overparameterised) still generalise - double descent regime
Jensen's inequality	KL divergence $\geq 0$ ; ELBO derivation; softmax bounds	VAE encoder (DALL-E 2, Stable Diffusion) optimises ELBO; Jensen is the mathematical foundation
MGF / Cumulants	Proof of CLT, Chernoff bounds, tail analysis	Hoeffding/Chernoff bounds used to prove PAC-learning generalisation bounds for language models
Adam / moment tracking	Default optimiser for all large-scale training	AdamW trains GPT-4, LLaMA, Claude, Gemini; Adam moment tracking is a direct application of $\mathbb{E}[g]$ and $\mathbb{E}[g^2]$
Score function = $\nabla\log p$	REINFORCE, RLHF policy gradient, diffusion model score	Score-based diffusion (Stable Diffusion, DALL-E 3) is defined by $\nabla_x \log p_t(x)$ ; trained by denoising score matching
Cauchy-Schwarz	Attention scaling $1/\sqrt{d_k}$ , Fisher information bounds	Scaled dot-product attention in every transformer uses $1/\sqrt{d_k}$ scaling derived from variance of dot products
Conditional expectation	Attention output = conditional mean; optimal predictor = $\mathbb{E}[Y\\|X]$	Transformers approximate $\mathbb{E}[Y\\|X]$ via attention; mechanistic interpretability studies what conditional distributions each head computes
Reparameterisation	VAE encoder, diffusion posterior sampling	Stable Diffusion's latent diffusion uses Gaussian reparameterisation for differentiable sampling through the encoder

Expectation and Moments: Part 1 - Intuition And Historical Context To 12 Why This Matters For Ai 2026