Chapter 6 — Probability Theory
"Probability is not about chance — it is about the precise language for reasoning under uncertainty. That language is the foundation of every machine learning model."
Overview
Probability theory is the mathematical framework for reasoning about uncertainty, and every component of modern machine learning is probabilistic at its core. Training data is a sample from an unknown distribution; model parameters are drawn randomly at initialisation; predictions are distributions over outcomes; generative models explicitly model data-generating processes; language model outputs are probability distributions over tokens.
This chapter builds the formal theory from sample spaces and probability axioms through random variables, common distributions, joint and conditional probability, expectation and moments, concentration inequalities, stochastic processes, and Markov chains. The goal is not abstract measure theory for its own sake — every concept is grounded in its ML applications, from cross-entropy loss to the central limit theorem underlying batch statistics to Markov chain Monte Carlo in Bayesian inference.
The conceptual arc: probability spaces (§01) → named distributions and their properties (§02) → joint and conditional probability, independence (§03) → expectation, variance, and moments (§04) → concentration inequalities and tail bounds (§05) → stochastic processes and martingales (§06) → Markov chains and steady-state theory (§07).
Subsection Map
| # | Subsection | What It Covers | Canonical Topics |
|---|---|---|---|
| 01 | Introduction and Random Variables | Probability spaces, axioms, discrete and continuous random variables, CDF/PDF/PMF | Probability axioms, sample space, events, random variables, CDF, PDF, PMF, Bernoulli, uniform; basic rules (addition, complement, conditional) |
| 02 | Common Distributions | The named distributions that appear throughout ML; their parameters, moments, and relationships | Bernoulli, Binomial, Poisson, Gaussian, Exponential, Beta, Dirichlet, Categorical, Multinomial, Student-t, Gamma; moment generating functions; exponential family |
| 03 | Joint Distributions | Multivariate probability; conditional distributions; independence; Bayes' theorem | Joint PDF/PMF, marginalisation, conditional distributions, statistical independence, Bayes' theorem, chain rule of probability, multivariate Gaussian |
| 04 | Expectation and Moments | The expectation operator; variance, covariance, and higher moments; moment generating functions | Expected value, linearity of expectation, variance, standard deviation, covariance matrix, correlation, skewness, kurtosis, MGF, characteristic function, LOTUS |
| 05 | Concentration Inequalities | How probability mass concentrates around the mean; tail bounds used in statistical learning theory | Markov's inequality, Chebyshev's inequality, Hoeffding's inequality, Chernoff bounds, McDiarmid's inequality, PAC learning framework, generalisation bounds |
| 06 | Stochastic Processes | Sequences of random variables indexed by time; convergence theorems; central limit theorem | Law of large numbers (weak and strong), central limit theorem, Gaussian processes, Brownian motion, stationary processes, ergodicity |
| 07 | Markov Chains | Memoryless sequential processes; steady-state distributions; MCMC and its role in Bayesian ML | Markov property, transition matrices, steady-state and stationary distributions, detailed balance, MCMC (Metropolis-Hastings, Gibbs sampling), PageRank, hidden Markov models |
Reading Order and Dependencies
01-Introduction-and-Random-Variables (foundation: probability spaces, random variables)
↓
02-Common-Distributions (vocabulary: named distributions and their properties)
↓
03-Joint-Distributions (multivariate: joint, conditional, Bayes' theorem)
↓
04-Expectation-and-Moments (summary statistics: E[X], Var[X], covariance)
↓
05-Concentration-Inequalities (bounds: how far can a random variable stray from its mean?)
↓
06-Stochastic-Processes (sequences: LLN, CLT, Gaussian processes)
↓
07-Markov-Chains (structured processes: MCMC, Bayesian inference machinery)
↓
07-Statistics (Chapter 7) (estimation: MLE, MAP, confidence intervals, hypothesis tests)
What Belongs Where — Canonical Homes
This table is the authoritative scoping guide. If a topic has a canonical home, every other section must give at most a 1–2 paragraph preview with a forward/backward reference — never a full treatment.
| Topic | Canonical Home | Preview Only In |
|---|---|---|
| Probability axioms (Kolmogorov) | §01 | — |
| Sample space, events, set operations | §01 | — |
| CDF, PDF, PMF definitions | §01 | §02 (used for named distributions) |
| Conditional probability | §01 | §03 (extended to joint distributions) |
| Bernoulli and uniform (simple intro) | §01 | §02 (full properties) |
| Named discrete distributions (Binomial, Poisson, Geometric, Categorical) | §02 | §01 (Bernoulli only), §03 (marginals) |
| Named continuous distributions (Gaussian, Exponential, Beta, Dirichlet, Gamma, Student-t) | §02 | §04 (moments of Gaussian), §06 (Gaussian process) |
| Exponential family and natural parameters | §02 | §04 (sufficient statistics and moments) |
| Moment generating functions | §02 | §04 (moments from MGF) |
| Joint distributions, joint PDF/PMF | §03 | §04 (expectation under joint) |
| Marginalisation | §03 | §04 (iterated expectation) |
| Statistical independence | §03 | §01 (preview), §04 (Var[X+Y]=Var[X]+Var[Y] if independent) |
| Bayes' theorem | §03 | §01 (introduced as formula), §07 (Bayesian update) |
| Conditional independence | §03 | §07 (Markov property as conditional independence) |
| Multivariate Gaussian | §03 | §06 (as Gaussian process marginals) |
| Expected value (definition and linearity) | §04 | §01 (informal preview) |
| Variance, covariance, correlation | §04 | §02 (variance of named distributions), §03 (covariance matrix) |
| LOTUS (law of the unconscious statistician) | §04 | — |
| Jensen's inequality | §04 | §05 (used to derive Markov's inequality) |
| Moment generating function applications | §04 | §02 (MGF definitions) |
| Markov's inequality | §05 | §04 (preview: E[X] bounds tails) |
| Chebyshev's inequality | §05 | §04 (preview: Var[X] bounds tails) |
| Hoeffding's, Chernoff bounds | §05 | — |
| PAC learning and generalisation bounds | §05 | — |
| Union bound | §05 | §01 (probability of union, preview) |
| Law of large numbers | §06 | §04 (preview: sample mean → E[X]) |
| Central limit theorem | §06 | §04 (preview: sum of iid → Gaussian) |
| Gaussian processes | §06 | §02 (Gaussian distribution) |
| Brownian motion | §06 | — |
| Markov property | §07 | §03 (conditional independence preview) |
| Transition matrices and steady state | §07 | — |
| Detailed balance (reversibility) | §07 | — |
| Metropolis-Hastings, Gibbs sampling | §07 | §03 (Bayes' theorem used) |
| PageRank | §07 | — |
| Hidden Markov models | §07 | §03 (conditional independence structure) |
Overlap Danger Zones
1. Probability Axioms ↔ Common Distributions
- §01 introduces probability through axioms and defines the simplest distributions (Bernoulli, uniform) as concrete examples of the formalism.
- §02 gives the complete treatment of each named distribution — parameters, PDF/PMF, CDF, mean, variance, MGF, relationships. §02 must not re-derive the axioms; §01 must not provide the full moments and relationships of Binomial, Gaussian, etc.
2. Conditional Probability ↔ Joint Distributions ↔ Bayes
- §01 defines conditional probability and states Bayes' theorem as a formula.
- §03 develops the full machinery: joint PDF/PMF, marginalisation, conditional distributions for random variables, chain rule of probability, and Bayes' theorem with prior/likelihood/posterior structure.
- §07 applies Bayesian updates in the Markov chain setting. §07 must backward-reference §03 for Bayes' theorem.
3. Expectation ↔ Moments ↔ MGF
- §04 is the exclusive home of the expectation operator, variance, covariance, higher moments, LOTUS, and moment generating functions as analytical tools.
- §02 may state the mean and variance of each named distribution (as facts), but must not derive them from first principles — §04 owns those derivations.
4. Concentration Inequalities ↔ Law of Large Numbers
- §05 proves Markov's and Chebyshev's inequalities and derives Hoeffding/Chernoff bounds, culminating in PAC-learning generalisation bounds.
- §06 proves the weak and strong LLN using the tools from §05. §06 should backward-reference §05 rather than re-proving the concentration results it uses.
5. Markov Chains ↔ Conditional Independence
- §03 defines conditional independence .
- §07 uses conditional independence as the definition of the Markov property: . §07 must backward-reference §03 rather than re-define conditional independence.
6. Gaussian Distribution ↔ Gaussian Process ↔ Multivariate Gaussian
- §02 defines the univariate Gaussian: PDF, CDF (via ), moments, MGF, standard normal.
- §03 defines the multivariate Gaussian: joint PDF, covariance matrix, marginals, conditionals, affine transformations.
- §06 defines the Gaussian process as an infinite-dimensional extension: any finite collection of function values follows a multivariate Gaussian. §06 must backward-reference §02 and §03 rather than re-derive Gaussian properties.
Forward and Backward Reference Format
Forward reference (full treatment is later):
> **Preview: Bayes' Theorem for Distributions**
> For random variables $X$ and $Y$ with joint density $p(x,y)$, Bayes' theorem becomes
> $p(x|y) = p(y|x)p(x)/p(y)$, giving the posterior over $x$ given observation $y$.
>
> → _Full treatment: [Joint Distributions](../03-Joint-Distributions/notes.md)_
Backward reference (builds on earlier section):
> **Recall:** The expected value $\mathbb{E}[X] = \int x\, p(x)\, dx$ was defined in
> [Expectation and Moments](../04-Expectation-and-Moments/notes.md). The covariance
> $\text{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)]$ defined there is the key
> parameter of the multivariate Gaussian developed here.
Key Cross-Chapter Dependencies
From Chapter 5 — Multivariate Calculus:
- Integration in multiple variables (§01/§03 use multivariate integrals for marginalisation)
- Lagrange multipliers (§04 in Ch5) → maximum entropy distributions (§02 exponential family derivation)
- Gradients and log-derivatives → score functions, Fisher information matrix (§03, §04)
From Chapter 4 — Calculus Fundamentals:
- Integration (computing CDFs, normalisation constants, moments)
- Series (computing MGFs, Poisson approximation)
Into Chapter 7 — Statistics:
- §02 (distributions) → likelihood functions and MLE
- §03 (Bayes' theorem) → MAP estimation and Bayesian inference
- §04 (expectation, variance) → unbiased estimators, sample statistics
- §05 (concentration) → confidence intervals and hypothesis testing
- §06 (CLT) → asymptotic normality of estimators
- §07 (Markov chains) → MCMC-based posterior sampling
Into Chapter 8 — Optimisation:
- §02 (Gaussian, softmax) → cross-entropy loss derivation
- §03 (conditional distributions) → EM algorithm (expectation step)
- §04 (KL divergence, entropy) → variational inference
Into Chapter 9 — Information Theory:
- §02 (exponential family) → natural exponential family and sufficient statistics
- §04 (entropy = ) → Shannon entropy definition
- §05 (MGF) → Cramér's theorem, large deviations
ML Concept Map
| ML Concept | Probability Theory Foundation | Section |
|---|---|---|
| Cross-entropy loss | Categorical distribution; negative log-likelihood | §02, §03 |
| Softmax output layer | Categorical distribution parameterisation | §02 |
| Dropout regularisation | Bernoulli random variables over activations | §01, §02 |
| Batch normalisation | Sample mean and variance as random variables; CLT | §04, §06 |
| Variational autoencoder (VAE) | Gaussian reparameterisation; KL divergence | §02, §03 |
| Bayesian neural networks | Prior × likelihood → posterior; Bayes' theorem | §03 |
| Gaussian processes for regression | Gaussian process prior; posterior conditioning | §06 |
| MCMC for posterior sampling | Markov chains; detailed balance; stationary distribution | §07 |
| Attention mechanism as soft lookup | Categorical distribution over keys | §02 |
| Language model outputs | Categorical distribution over vocabulary | §02 |
| Data augmentation | Distribution over augmented inputs | §01, §03 |
| Importance sampling | Change of measure; Radon-Nikodym derivative | §03, §06 |
| PAC generalisation bounds | Hoeffding's inequality; union bound | §05 |
| Double descent / bias-variance | Variance of estimators; bias-variance decomposition | §04 |
| Reinforcement learning (policy) | Markov decision process (MDP); Markov property | §07 |
| RLHF reward modelling | Bradley-Terry model; probability of preference | §02, §03 |
| Normalising flows | Change of variables formula for densities | §03 |
| Diffusion models | Gaussian noise schedule; score function | §02, §06 |
Prerequisites
Before starting this chapter, ensure you are comfortable with:
- Integration (single and multi-variable): computing areas under curves, normalisation constants — §03-Integration
- Series: power series, convergence — §04-Series-and-Sequences
- Set theory basics: union, intersection, complement — introduced in §01 of this chapter
- Matrix algebra: matrix–vector products, determinants, positive definite matrices — Chapter 3
- Lagrange multipliers (for exponential family derivation): §04-Optimality-Conditions
← Previous Chapter: Multivariate Calculus | Next Chapter: Statistics →