Lesson overview | Previous part | Lesson overview
Expectation and Moments: Part 13: Conceptual Bridge to Summary of Key Results
13. Conceptual Bridge
Where We Came From
This section builds directly on the foundations laid in Section01-Section03. The probability spaces and random variable formalism of Section01 provides the objects - PDFs, PMFs, CDFs - over which we compute expectations. The named distributions of Section02 supply the examples whose moments we derive in Section6.2: the mean and variance of the exponential, the mean and variance of the binomial, the mean and variance that define the Gaussian. The joint distribution machinery of Section03 provides the tools for conditional expectation (Section5) and the covariance matrix (Section4): the tower property is essentially iterated marginalisation, and the conditional variance formula is the law of total variance derived via joint distributions.
Where We Are Going
Section Section05 (Concentration Inequalities) takes the moment-bound preview from Section7.5 much further. Markov's inequality (from the mean) and Chebyshev's inequality (from the variance) are the first two results, but Section05 develops exponentially sharper bounds: Hoeffding's inequality (bounded variables), Chernoff bounds (via MGFs), and McDiarmid's inequality (functions of independent variables). These bounds are the mathematical machinery of PAC-learning generalisation theory - they quantify how many training examples are needed to guarantee that the empirical risk is close to the true risk.
Section Section06 (Stochastic Processes) extends expectation to sequences of random variables indexed by time. The Law of Large Numbers proves rigorously that - the sample mean converges to the expectation. The Central Limit Theorem proves that the standardised sample mean converges in distribution to - the Gaussian emerges as the universal limit of sums, with proof via MGF or characteristic function techniques introduced here.
POSITION IN CHAPTER 6 CURRICULUM
========================================================================
Section01 Probability Spaces Section02 Distributions
+------------------+ +------------------+
| Kolmogorov axioms| | Named PDFs/PMFs |
| Events, \\sigma-algebra| | Parameters |
| CDF, PDF, PMF | | Relationships |
+--------+---------+ +--------+---------+
| |
+--------------+----------------+
v
Section03 Joint Distributions
+----------------------+
| Marginals |
| Conditionals f(y|x) |
| MVN, Bayes, Chain |
+----------+----------+
|
v
+==============================+
| Section04 Expectation & Moments | <- YOU ARE HERE
| E[X], Var, Cov, MGF |
| Jensen, C-S, LOTUS |
| Bias-Variance, Adam |
+==============+=============+
|
+----------+----------+
v v
Section05 Concentration Section06 Stochastic
Inequalities Processes
+-------------+ +-------------+
| Markov | | LLN: Xbar->E[X]|
| Chebyshev | | CLT: ->N(0,1)|
| Hoeffding | | Gaussian |
| PAC bounds | | processes |
+-------------+ +-------------+
| |
+----------+----------+
v
Section07 Markov Chains
+----------------------+
| Transition matrices |
| Steady state |
| MCMC (Bayes sampling)|
+----------------------+
========================================================================
The conceptual arc through Section04 is: we began with probability as a framework for describing uncertainty (Section01), learned the vocabulary of named distributions (Section02), understood how to reason about multiple random variables jointly (Section03), and now in Section04 we have the tools to summarise distributions with numbers - expectations, variances, covariances, moments - and to bound the gap between reality and our estimates. Every subsequent section will use the expectation operator as a core tool: Section05 to prove tail bounds, Section06 to formalise convergence, Section07 to analyse Markov chain stationary distributions via matrix expectations.
<- Back to Chapter 6: Probability Theory | Next: Concentration Inequalities ->
Appendix A - Worked Examples: Expectation and Moments
A.1 Expected Value of the Geometric Distribution
The geometric distribution counts the number of trials until the first success in a sequence of iid Bernoulli() trials.
PMF: for
Expected value via the definition:
where . Using the identity :
Variance via the second moment: . Using :
AI connection: The geometric distribution models the number of tokens until a specific token (e.g., end-of-sequence) appears, assuming iid generation. In practice tokens are not iid, but the geometric provides a baseline model. Expected sequence length under a uniform = stopping probability is .
A.2 Method of Moments Estimation
The method of moments estimates distribution parameters by setting sample moments equal to theoretical moments and solving.
Example: Estimating for .
Theoretical moments: and .
Given samples , compute sample mean and sample variance .
Set and . Solving:
For AI: Many Bayesian models require estimating hyperparameters of prior distributions. Method of moments provides fast, closed-form initial estimates that can be refined by maximum likelihood or MCMC. It is also used in moment matching for knowledge distillation: a student model's distribution moments are matched to the teacher's.
A.3 St. Petersburg Paradox: When Expectation Misleads
The St. Petersburg game: flip a coin repeatedly until the first head. If head appears on flip , win dollars.
Expected winnings:
The expected value is infinite! Yet no rational person would pay more than a few dollars to play this game.
Resolution: Rational agents maximise expected utility, not expected monetary value. For a logarithmic utility function , the expected utility is finite:
This is Bernoulli's resolution (1738): diminishing marginal utility. The paradox reveals that infinite expected value is not sufficient for rational choice - the existence of all moments (or at least bounded utility) is needed.
For AI: Reinforcement learning reward design must account for this. An agent with unbounded reward function may take extremely risky actions that have infinite expected reward but almost surely fail. This motivates bounded reward functions and regularisation in RLHF: keeping reward signals within a bounded range prevents policy collapse toward infinite-expectation strategies.
Appendix B - Proofs of Key Identities
B.1 Cauchy-Schwarz via Inner Product
The expectation can be viewed as an inner product in the space of square-integrable random variables. The Cauchy-Schwarz inequality for inner products states , which gives directly.
This inner product interpretation gives convergence its name: in means - convergence in the inner product sense.
B.2 Variance as Second Cumulant
The CGF of is . Computing its second derivative at :
At : , , , so:
The variance is precisely the second cumulant .
B.3 Fisher Information and the Score
The Fisher information matrix is defined as:
The second equality uses the fact that (proved by differentiating ):
Therefore the Fisher information is the variance (covariance matrix) of the score function .
For AI: Fisher information appears in:
- Cramer-Rao bound: - the variance of any unbiased estimator is at least the reciprocal Fisher information.
- Natural gradient: The natural gradient descent update moves in the direction of steepest descent in the space of distributions (KL-divergence geometry), rather than parameter space. Adam approximates this with a diagonal Fisher.
- Elastic Weight Consolidation (EWC): Used in continual learning to prevent catastrophic forgetting. The Fisher information diagonal identifies which parameters are important for previous tasks.
B.4 Stein's Lemma
Lemma (Stein, 1972). If and is differentiable with :
Proof: Integration by parts:
Note . Integrating by parts:
Special cases:
- : [ok]
- : [ok]
For AI: Stein's lemma is the foundation of Stein's identity used in score matching and denoising diffusion models. The score function satisfies (Stein's operator), enabling training by minimising a quadratic loss without computing the intractable normalisation constant of .
Appendix C - Moment Computations for Common Distributions
This table gives raw moments, central moments, MGF, skewness, and excess kurtosis for the distributions used throughout the course.
| Distribution | Skewness | Ex. Kurtosis | (domain) | ||
|---|---|---|---|---|---|
| Bernoulli() | |||||
| Binomial() | |||||
| Poisson() | |||||
| Geometric() | , | ||||
| Uniform() | |||||
| Normal() | |||||
| Exponential() | , | ||||
| Gamma() | , | ||||
| Beta() | complex | No closed form | |||
| Student-() | () | () | () | () | Does not exist |
Notes:
- Student- has no MGF (heavy tails cause for all ).
- Beta distribution skewness is zero iff (symmetric); negative when , positive when .
- Poisson has equal mean and variance - a property used to test whether count data follows Poisson (overdispersion: means extra variability, e.g., negative binomial is better).
Appendix D - The Exponential Family and Moments
Many common distributions belong to the exponential family, which has a elegant connection between natural parameters and moments.
D.1 Exponential Family Form
A distribution belongs to the exponential family if its density can be written as:
where:
- = natural parameter vector
- = sufficient statistic vector
- = log-partition function (log-normaliser)
- = base measure
D.2 Moments from the Log-Partition Function
Theorem. For an exponential family distribution:
Proof sketch. Differentiating with respect to :
Therefore . Differentiating again gives the covariance formula.
Examples:
| Distribution | ||||
|---|---|---|---|---|
| Bernoulli() | ||||
| Gaussian() | ||||
| Poisson() |
For AI: The log-partition function is the free energy of the exponential family. Its gradient gives the expected sufficient statistics (moments), its Hessian gives the Fisher information matrix. In variational inference, the ELBO is optimised over the natural parameters of the approximate posterior; the optimal satisfies (moment matching). This is the connection between maximum entropy, sufficient statistics, and the moments derived in Section6.
Appendix E - Convergence in Probability and L^2 Convergence
The sample mean estimates . In what sense does this estimate converge?
E.1 L^2 Convergence (Mean Square Convergence)
converges to in mean square () sense:
This is an immediate consequence of the variance formula for means (Section4.4). The sample mean's MSE decreases at rate .
E.2 Weak Law of Large Numbers (Preview)
The Weak LLN (proved in Section06 using characteristic functions or Chebyshev's inequality) states that for iid with finite mean :
i.e., for any .
Proof via Chebyshev (when ): .
This is a direct application of Chebyshev's inequality (from Section7.5 preview) to the sample mean. The LLN justifies: if we run a neural network many times on iid batches and average the loss estimates, the average converges to the true expected loss.
-> Full treatment of LLN and CLT: Section06 Stochastic Processes
Appendix F - Conditional Expectation as Projection
The conditional expectation can be understood geometrically as an orthogonal projection in the Hilbert space of square-integrable random variables.
Inner product:
Projection: (conditional expectation on a sub--algebra ) is the projection of onto the closed subspace of -measurable random variables.
Geometric interpretation: Among all -measurable (i.e., functions of ) approximations to , the conditional expectation minimises the distance . This is the projection theorem: the best approximation is the projection, and the residual is orthogonal to every -measurable random variable:
Consequence: The tower property is the projection version of the law of total expectation. In a Hilbert space, projecting onto a subspace and then taking the "total length" (expectation) equals the total length of .
For AI: This projection view clarifies why neural networks trained with MSE loss approximate : the network is learning the projection of onto the subspace of functions representable by the architecture. Deeper networks can represent larger subspaces, hence better approximations to the conditional expectation.
Appendix G - Worked Problems: Moments and Inequalities
G.1 Computing the Moments of the Beta Distribution
The Beta() distribution has PDF on , where .
Raw moment: Using the Beta function definition:
First moment: .
Second moment: .
Variance:
AI connection: The Beta distribution is the conjugate prior for the Bernoulli/Binomial likelihood. After observing successes and failures, the posterior is Beta(, ). The posterior mean is - a weighted average of the prior mean and the sample mean . As , the posterior converges to the sample mean (data dominates prior).
G.2 Proving by Jensen
For a probability distribution with and , the Shannon entropy is .
Proof that :
Apply Jensen's inequality with the concave function :
Equality holds iff const for all (Jensen with equality iff the argument is constant), i.e., (uniform).
Alternative proof via KL divergence:
where is uniform. KL non-negativity gives .
G.3 Adam Bias Correction: Full Derivation
At step , the Adam first-moment accumulator is:
Unrolling from :
Taking expectations (assuming iid gradients with constant mean ):
The bias is , which decays to zero geometrically. The bias-corrected estimate satisfies (unbiased).
Similarly for , and the debiased estimates unbiasedly.
In early training ( small), is small, so greatly amplifies . Without bias correction, Adam would take tiny steps at the start because after one step. With bias correction: - the first step uses the actual gradient.
G.4 Conditional Expectation: Gaussian Case
Let with and .
Conditional distribution (from Section03 Schur complement formula):
Therefore:
Verification via law of total variance:
The variance of the conditional mean contributes (explained variance), and the within-group variance contributes (unexplained variance). The fraction of explained by is the coefficient of determination of linear regression.
Appendix H - Notation Summary
| Symbol | Meaning | First defined |
|---|---|---|
| Expected value of | Section2.1 | |
| or | Mean (expected value) | Section2.1 |
| or | Variance of | Section3.1 |
| Standard deviation of | Section3.2 | |
| -th raw moment | Section3.3 | |
| -th central moment | Section3.3 | |
| Skewness | Section3.4 | |
| Excess kurtosis | Section3.5 | |
| Covariance | Section4.1 | |
| Pearson correlation | Section4.2 | |
| Conditional expectation (function of ) | Section5.1 | |
| Conditional expectation (random variable) | Section5.1 | |
| Conditional variance | Section5.3 | |
| Moment generating function | Section6.1 | |
| Characteristic function | Section6.4 | |
| Cumulant generating function | Section6.5 | |
| -th cumulant | Section6.5 | |
| ELBO (evidence lower bound) | Section9.2 | |
| KL divergence | Section7.2 | |
| Shannon entropy | Section9.1 |
Appendix I - Quick Reference: Key Formulas
Expectation:
Variance:
Conditional expectation:
MGF:
Jensen's inequality:
Cauchy-Schwarz:
Bias-Variance:
Adam:
Appendix J - Common Mistakes: Extended Examples
J.1 Jensen's Direction: A Costly Error
One of the most frequent mistakes when applying Jensen's inequality is applying it in the wrong direction. The rule is simple: for convex , ; for concave , the inequality reverses.
Common confusion: vs. .
- is convex (). Jensen: . Equivalently, . [ok]
- is concave (). Jensen: for .
A model that minimises is NOT the same as minimising . The former cares about average square-root loss, the latter about the square root of average loss. In practice this distinction matters for robust loss functions.
Common confusion: vs. .
- is concave -> . This is the key inequality behind the ELBO derivation.
- is convex -> . This is the key inequality for bounding the MGF.
The ELBO derivation applies Jensen with (concave): where . Students often try to apply it in the wrong direction, getting an upper bound instead of a lower bound.
J.2 The Tower Property Subtlety: Nested Conditioning
The tower property states . A more general version for nested conditioning: for -algebras :
Conditioning on less information from already conditioned: you keep the coarser conditioning.
Example: Let be a sufficient statistic for given data . Then:
where is the Rao-Blackwellised estimator. The outer expectation equals the original (tower property), but the Rao-Blackwellised estimator has smaller variance. This is NOT circular - it is the statement that smoothing via conditioning on doesn't change its mean.
J.3 Correlation Does Not Imply Causation - A Statistical View
Zero correlation means - no linear relationship. But:
- There may be nonlinear dependence (Section4.3 example: ).
- Even positive correlation may arise from a common cause (confounding): if and (fork structure from Section03), then and are correlated even if neither causes the other.
In ML: the correlation between model predictions and labels on the test set measures linear predictive ability, not causal understanding. A model can achieve high correlation while exploiting spurious features (shortcuts). For example, language models correlate "hospital" with "disease" not through causal understanding but through co-occurrence patterns.
Conditional independence as the resolution: If is the common cause (confounder), and may be conditionally independent given : even though . Adjusting for confounders (either by conditioning or instrumental variables) is the statistical approach to causal inference.
Appendix K - Information-Theoretic View of Moments
K.1 Entropy as Negative Expected Log-Probability
The Shannon entropy of a discrete distribution is:
This is simply minus the expected log-probability. For a continuous distribution with PDF , the differential entropy is:
Gaussian maximises differential entropy. Among all distributions on with fixed mean and variance , the Gaussian maximises differential entropy:
Proof sketch (via KL divergence): For any distribution with mean and variance , let .
Since , and (since has variance ):
Therefore , i.e., . Equality iff (since iff distributions are equal).
AI connection: This maximum entropy property explains why the Gaussian prior is so common in Bayesian machine learning: given a known mean and variance (from domain knowledge), the Gaussian is the least informative (maximum entropy) prior consistent with those constraints. It makes the fewest additional assumptions about the distribution.
K.2 Mutual Information as Expected KL Divergence
The mutual information between and is:
This is the KL divergence between the joint distribution and the product of marginals. Since : with equality iff .
Relationship to conditional entropy:
where is the conditional entropy (tower property applied to entropy).
For AI: Mutual information is the gold standard for measuring statistical dependence. It captures all forms of dependence (linear and nonlinear), unlike correlation. Contrastive learning methods (SimCLR, CLIP) can be viewed as maximising a lower bound on - encouraging representations to capture information shared between different views/modalities. Information bottleneck methods (used for analysing neural networks) study the trade-off between compressing (low ) and preserving information about (high ).
Appendix L - The Reparameterisation Trick: Full Mathematical Treatment
The reparameterisation trick is a technique for computing gradients of expectations when the distribution depends on the parameters we differentiate with respect to.
L.1 The Problem: Gradient Through Sampling
We want where is a distribution parameterised by . Naively:
This requires computing for each , and the expectation is now over the fixed distribution of values - but the integration measure changes with , making Monte Carlo estimation require evaluating .
L.2 REINFORCE (Score Function Estimator)
The score function trick uses :
This is computable by Monte Carlo: sample , compute , average over samples. However, this estimator has high variance because can be large and can vary greatly.
L.3 Reparameterisation
When is a location-scale family (or more generally, when there exists a deterministic transformation ):
For Gaussian: , .
Then:
and:
The gradient now flows through the deterministic function , enabling automatic differentiation. The variance of this estimator is typically much lower than REINFORCE because is often smoother than .
Jacobian of the transform:
So the gradient with respect to is and with respect to is .
L.4 Why Lower Variance?
Consider with .
REINFORCE gradient: . The product can be large in magnitude.
Reparameterisation gradient: . The gradient is , bounded in , much more stable.
In general, reparameterisation produces gradients of magnitude when is smooth, while REINFORCE gradients scale with values which can be large.
Appendix M - Moments in the Context of Score Matching and Diffusion
M.1 Denoising Score Matching
Diffusion models (DDPM, Score SDE) learn the score function at each noise level . The training objective is:
where and .
The conditional score is:
This is an expectation over the noise schedule. The loss simplifies to:
which is an MSE loss - an empirical estimate of where is the added noise and .
M.2 First Moment of the Reverse Process
The reverse diffusion process gives where:
The mean of the reverse step is determined by the predicted noise. The conditional expectation is the denoised estimate of :
This is LOTUS applied to the reparameterised relationship: the expected clean image given the noisy image is a simple linear function of the predicted noise, evaluated using the noisy image.
Appendix N - Worked Problems: Bias-Variance and Regularisation
N.1 Ridge Regression Bias-Variance Tradeoff
Setup. Data: where , .
Ridge estimator: .
Let .
Bias: .
Since :
So . Bias increases with .
Variance: .
As : , so variance . As : recovers OLS variance .
Total MSE: .
The optimal minimises this sum - bias grows, variance shrinks, and there is a sweet spot.
N.2 Neural Network Initialisation via Variance Propagation
He initialisation (for ReLU networks) is derived from the bias-variance / variance propagation analysis.
For a layer , with iid inputs (zero mean, variance ) and iid weights independent of inputs:
Variance propagation:
where . For ReLU: (since ReLU zeros half the distribution).
Therefore .
For variance to stay constant across layers: requires .
He initialisation: . This preserves variance through the network during the forward pass, preventing gradients from vanishing or exploding in deep networks.
Appendix O - Self-Assessment Checklist
Use this checklist after studying the section to identify gaps before proceeding to Section05.
Core Mechanics (Should be fluent)
- I can compute from a PMF or PDF using the definition.
- I can apply LOTUS: without finding the distribution of .
- I know that linearity holds without independence.
- I can compute using .
- I know (shift doesn't change variance).
- I can compute skewness and kurtosis .
Intermediate Theory (Should understand proofs)
- I can state and prove the tower property .
- I can state and prove the law of total variance.
- I can derive the MGF of the Gaussian, Exponential, and Poisson distributions.
- I can state Jensen's inequality and identify when it applies ( convex vs. concave).
- I can prove using Jensen.
- I can state and prove the Cauchy-Schwarz inequality for expectations.
- I know that follows from Cauchy-Schwarz.
- I understand why zero covariance does NOT imply independence (with a counterexample).
Advanced Applications (Should be able to apply)
- I can derive the bias-variance decomposition from first principles.
- I can explain what double descent is and why overparameterised models can generalise.
- I can derive the ELBO using Jensen's inequality.
- I can explain the reparameterisation trick and why it reduces gradient variance.
- I understand Adam as tracking first and second moments with bias correction.
- I can explain why the score function .
Appendix P - Further Reading and References
Textbooks
-
Probability and Statistics for Engineering and the Sciences - Jay Devore (2015). Accessible introduction to moments, MGFs, and expectation with engineering examples.
-
Probability Theory: The Logic of Science - E.T. Jaynes (2003). Philosophical and technical treatment; excellent on entropy and maximum entropy principle.
-
Pattern Recognition and Machine Learning - Christopher Bishop (2006), Ch. 1-2. The ML-focused treatment of expectations, KL divergence, ELBO, and variational inference.
-
Deep Learning - Goodfellow, Bengio, Courville (2016), Ch. 3. Standard reference for probability in ML context including bias-variance tradeoff.
-
Probabilistic Machine Learning: An Introduction - Kevin Murphy (2022), Ch. 2-4. Modern treatment with extensive ML applications including Adam, VAEs, and diffusion models.
Papers
-
Kingma & Welling (2014). Auto-Encoding Variational Bayes. arXiv:1312.6114. Original VAE paper; derives ELBO via Jensen and introduces reparameterisation trick.
-
Kingma & Ba (2015). Adam: A Method for Stochastic Optimization. arXiv:1412.6980. Original Adam paper; moment tracking interpretation is explicit in the derivation.
-
Williams (1992). Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning. REINFORCE algorithm; score function gradient estimator.
-
Ioffe & Szegedy (2015). Batch Normalization. arXiv:1502.03167. Moment normalisation of activations; analysis of how variance affects training dynamics.
-
Belkin et al. (2019). Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-Off. PNAS. Double descent phenomenon; formal analysis of why interpolating models can generalise.
-
Song et al. (2020). Score-Based Generative Modeling through Stochastic Differential Equations. arXiv:2011.13456. Diffusion models as score-matching; score function is a moment of the distribution.
Summary of Key Results
This section established the expectation operator as the central tool in probability and machine learning. Starting from the LOTUS definition, we derived linearity (which holds without independence), the tower property (iterated expectation), and the law of total variance (which decomposes uncertainty into within-group and between-group components).
The moment hierarchy captures distribution shape: the first moment (mean) locates it; the second (variance) measures spread; the third (skewness) captures asymmetry; the fourth (kurtosis) captures tail weight. Moment generating functions encode all moments in a single analytic function and enable elegant proofs of the reproductive property (sum of independent Gaussians is Gaussian) and cumulant additivity.
Jensen's inequality - the most important inequality in this section - gives the KL divergence its non-negativity, the ELBO its existence as a lower bound, and the bias-variance decomposition its clean structure. Cauchy-Schwarz gives and motivates the attention scaling factor .
Every major ML method reviewed in Section9 is an application of these results: cross-entropy is expected log-loss, the ELBO is Jensen applied to the marginal likelihood, policy gradient is the score function trick, and Adam is first-and-second-moment tracking with debiasing.
The concepts forward-referenced here - Markov's and Chebyshev's inequalities, the Law of Large Numbers, the Central Limit Theorem - will be fully developed in Section05 and Section06, completing the probabilistic toolkit required for modern ML.
The conceptual unification: expectation is a linear functional on the space of random variables. It maps random variables to real numbers while preserving linear structure. Every computation in this section - variance as , covariance as , the MGF as , the characteristic function as , the KL divergence as , Adam's first moment as - is an application of this single linear functional applied to different functions of the random variable.
Understanding this unity transforms the apparent complexity of ML training into a coherent framework: we are always estimating expectations from samples, bounding how far sample estimates stray from true expectations, and choosing parameterisations that make those expectations tractable to compute and differentiate.
<- Back to Chapter 6: Probability Theory | Next: Concentration Inequalities ->
End of Section06/04 - Expectation and Moments
| File | Lines / Cells | Status |
|---|---|---|
| notes.md | 2000+ | [ok] Complete |
| theory.ipynb | 38 cells | [ok] Complete |
| exercises.ipynb | 27 cells | [ok] Complete |