READMEMath for LLMs

README

Probability Theory

README

Chapter 6 — Probability Theory

"Probability is not about chance — it is about the precise language for reasoning under uncertainty. That language is the foundation of every machine learning model."

Overview

Probability theory is the mathematical framework for reasoning about uncertainty, and every component of modern machine learning is probabilistic at its core. Training data is a sample from an unknown distribution; model parameters are drawn randomly at initialisation; predictions are distributions over outcomes; generative models explicitly model data-generating processes; language model outputs are probability distributions over tokens.

This chapter builds the formal theory from sample spaces and probability axioms through random variables, common distributions, joint and conditional probability, expectation and moments, concentration inequalities, stochastic processes, and Markov chains. The goal is not abstract measure theory for its own sake — every concept is grounded in its ML applications, from cross-entropy loss to the central limit theorem underlying batch statistics to Markov chain Monte Carlo in Bayesian inference.

The conceptual arc: probability spaces (§01) → named distributions and their properties (§02) → joint and conditional probability, independence (§03) → expectation, variance, and moments (§04) → concentration inequalities and tail bounds (§05) → stochastic processes and martingales (§06) → Markov chains and steady-state theory (§07).


Subsection Map

#SubsectionWhat It CoversCanonical Topics
01Introduction and Random VariablesProbability spaces, axioms, discrete and continuous random variables, CDF/PDF/PMFProbability axioms, sample space, events, random variables, CDF, PDF, PMF, Bernoulli, uniform; basic rules (addition, complement, conditional)
02Common DistributionsThe named distributions that appear throughout ML; their parameters, moments, and relationshipsBernoulli, Binomial, Poisson, Gaussian, Exponential, Beta, Dirichlet, Categorical, Multinomial, Student-t, Gamma; moment generating functions; exponential family
03Joint DistributionsMultivariate probability; conditional distributions; independence; Bayes' theoremJoint PDF/PMF, marginalisation, conditional distributions, statistical independence, Bayes' theorem, chain rule of probability, multivariate Gaussian
04Expectation and MomentsThe expectation operator; variance, covariance, and higher moments; moment generating functionsExpected value, linearity of expectation, variance, standard deviation, covariance matrix, correlation, skewness, kurtosis, MGF, characteristic function, LOTUS
05Concentration InequalitiesHow probability mass concentrates around the mean; tail bounds used in statistical learning theoryMarkov's inequality, Chebyshev's inequality, Hoeffding's inequality, Chernoff bounds, McDiarmid's inequality, PAC learning framework, generalisation bounds
06Stochastic ProcessesSequences of random variables indexed by time; convergence theorems; central limit theoremLaw of large numbers (weak and strong), central limit theorem, Gaussian processes, Brownian motion, stationary processes, ergodicity
07Markov ChainsMemoryless sequential processes; steady-state distributions; MCMC and its role in Bayesian MLMarkov property, transition matrices, steady-state and stationary distributions, detailed balance, MCMC (Metropolis-Hastings, Gibbs sampling), PageRank, hidden Markov models

Reading Order and Dependencies

01-Introduction-and-Random-Variables   (foundation: probability spaces, random variables)
        ↓
02-Common-Distributions                (vocabulary: named distributions and their properties)
        ↓
03-Joint-Distributions                 (multivariate: joint, conditional, Bayes' theorem)
        ↓
04-Expectation-and-Moments             (summary statistics: E[X], Var[X], covariance)
        ↓
05-Concentration-Inequalities          (bounds: how far can a random variable stray from its mean?)
        ↓
06-Stochastic-Processes                (sequences: LLN, CLT, Gaussian processes)
        ↓
07-Markov-Chains                       (structured processes: MCMC, Bayesian inference machinery)
        ↓
07-Statistics (Chapter 7)             (estimation: MLE, MAP, confidence intervals, hypothesis tests)

What Belongs Where — Canonical Homes

This table is the authoritative scoping guide. If a topic has a canonical home, every other section must give at most a 1–2 paragraph preview with a forward/backward reference — never a full treatment.

TopicCanonical HomePreview Only In
Probability axioms (Kolmogorov)§01
Sample space, events, set operations§01
CDF, PDF, PMF definitions§01§02 (used for named distributions)
Conditional probability P(AB)P(A\|B)§01§03 (extended to joint distributions)
Bernoulli and uniform (simple intro)§01§02 (full properties)
Named discrete distributions (Binomial, Poisson, Geometric, Categorical)§02§01 (Bernoulli only), §03 (marginals)
Named continuous distributions (Gaussian, Exponential, Beta, Dirichlet, Gamma, Student-t)§02§04 (moments of Gaussian), §06 (Gaussian process)
Exponential family and natural parameters§02§04 (sufficient statistics and moments)
Moment generating functions§02§04 (moments from MGF)
Joint distributions, joint PDF/PMF§03§04 (expectation under joint)
Marginalisation§03§04 (iterated expectation)
Statistical independence XYX \perp Y§03§01 (preview), §04 (Var[X+Y]=Var[X]+Var[Y] if independent)
Bayes' theorem§03§01 (introduced as formula), §07 (Bayesian update)
Conditional independence§03§07 (Markov property as conditional independence)
Multivariate Gaussian§03§06 (as Gaussian process marginals)
Expected value (definition and linearity)§04§01 (informal preview)
Variance, covariance, correlation§04§02 (variance of named distributions), §03 (covariance matrix)
LOTUS (law of the unconscious statistician)§04
Jensen's inequality§04§05 (used to derive Markov's inequality)
Moment generating function applications§04§02 (MGF definitions)
Markov's inequality§05§04 (preview: E[X] bounds tails)
Chebyshev's inequality§05§04 (preview: Var[X] bounds tails)
Hoeffding's, Chernoff bounds§05
PAC learning and generalisation bounds§05
Union bound§05§01 (probability of union, preview)
Law of large numbers§06§04 (preview: sample mean → E[X])
Central limit theorem§06§04 (preview: sum of iid → Gaussian)
Gaussian processes§06§02 (Gaussian distribution)
Brownian motion§06
Markov property§07§03 (conditional independence preview)
Transition matrices and steady state§07
Detailed balance (reversibility)§07
Metropolis-Hastings, Gibbs sampling§07§03 (Bayes' theorem used)
PageRank§07
Hidden Markov models§07§03 (conditional independence structure)

Overlap Danger Zones

1. Probability Axioms ↔ Common Distributions

  • §01 introduces probability through axioms and defines the simplest distributions (Bernoulli, uniform) as concrete examples of the formalism.
  • §02 gives the complete treatment of each named distribution — parameters, PDF/PMF, CDF, mean, variance, MGF, relationships. §02 must not re-derive the axioms; §01 must not provide the full moments and relationships of Binomial, Gaussian, etc.

2. Conditional Probability ↔ Joint Distributions ↔ Bayes

  • §01 defines conditional probability P(AB)=P(AB)/P(B)P(A|B) = P(A \cap B)/P(B) and states Bayes' theorem as a formula.
  • §03 develops the full machinery: joint PDF/PMF, marginalisation, conditional distributions for random variables, chain rule of probability, and Bayes' theorem with prior/likelihood/posterior structure.
  • §07 applies Bayesian updates in the Markov chain setting. §07 must backward-reference §03 for Bayes' theorem.

3. Expectation ↔ Moments ↔ MGF

  • §04 is the exclusive home of the expectation operator, variance, covariance, higher moments, LOTUS, and moment generating functions as analytical tools.
  • §02 may state the mean and variance of each named distribution (as facts), but must not derive them from first principles — §04 owns those derivations.

4. Concentration Inequalities ↔ Law of Large Numbers

  • §05 proves Markov's and Chebyshev's inequalities and derives Hoeffding/Chernoff bounds, culminating in PAC-learning generalisation bounds.
  • §06 proves the weak and strong LLN using the tools from §05. §06 should backward-reference §05 rather than re-proving the concentration results it uses.

5. Markov Chains ↔ Conditional Independence

  • §03 defines conditional independence XYZX \perp Y | Z.
  • §07 uses conditional independence as the definition of the Markov property: Xt+1X0:t1XtX_{t+1} \perp X_{0:t-1} | X_t. §07 must backward-reference §03 rather than re-define conditional independence.

6. Gaussian Distribution ↔ Gaussian Process ↔ Multivariate Gaussian

  • §02 defines the univariate Gaussian: PDF, CDF (via Φ\Phi), moments, MGF, standard normal.
  • §03 defines the multivariate Gaussian: joint PDF, covariance matrix, marginals, conditionals, affine transformations.
  • §06 defines the Gaussian process as an infinite-dimensional extension: any finite collection of function values follows a multivariate Gaussian. §06 must backward-reference §02 and §03 rather than re-derive Gaussian properties.

Forward and Backward Reference Format

Forward reference (full treatment is later):

> **Preview: Bayes' Theorem for Distributions**
> For random variables $X$ and $Y$ with joint density $p(x,y)$, Bayes' theorem becomes
> $p(x|y) = p(y|x)p(x)/p(y)$, giving the posterior over $x$ given observation $y$.
>
> → _Full treatment: [Joint Distributions](../03-Joint-Distributions/notes.md)_

Backward reference (builds on earlier section):

> **Recall:** The expected value $\mathbb{E}[X] = \int x\, p(x)\, dx$ was defined in
> [Expectation and Moments](../04-Expectation-and-Moments/notes.md). The covariance
> $\text{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)]$ defined there is the key
> parameter of the multivariate Gaussian developed here.

Key Cross-Chapter Dependencies

From Chapter 5 — Multivariate Calculus:

  • Integration in multiple variables (§01/§03 use multivariate integrals for marginalisation)
  • Lagrange multipliers (§04 in Ch5) → maximum entropy distributions (§02 exponential family derivation)
  • Gradients and log-derivatives → score functions, Fisher information matrix (§03, §04)

From Chapter 4 — Calculus Fundamentals:

  • Integration (computing CDFs, normalisation constants, moments)
  • Series (computing MGFs, Poisson approximation)

Into Chapter 7 — Statistics:

  • §02 (distributions) → likelihood functions and MLE
  • §03 (Bayes' theorem) → MAP estimation and Bayesian inference
  • §04 (expectation, variance) → unbiased estimators, sample statistics
  • §05 (concentration) → confidence intervals and hypothesis testing
  • §06 (CLT) → asymptotic normality of estimators
  • §07 (Markov chains) → MCMC-based posterior sampling

Into Chapter 8 — Optimisation:

  • §02 (Gaussian, softmax) → cross-entropy loss derivation
  • §03 (conditional distributions) → EM algorithm (expectation step)
  • §04 (KL divergence, entropy) → variational inference

Into Chapter 9 — Information Theory:

  • §02 (exponential family) → natural exponential family and sufficient statistics
  • §04 (entropy = E[logp(X)]-\mathbb{E}[\log p(X)]) → Shannon entropy definition
  • §05 (MGF) → Cramér's theorem, large deviations

ML Concept Map

ML ConceptProbability Theory FoundationSection
Cross-entropy loss ylogp^-\sum y \log \hat{p}Categorical distribution; negative log-likelihood§02, §03
Softmax output layerCategorical distribution parameterisation§02
Dropout regularisationBernoulli random variables over activations§01, §02
Batch normalisationSample mean and variance as random variables; CLT§04, §06
Variational autoencoder (VAE)Gaussian reparameterisation; KL divergence§02, §03
Bayesian neural networksPrior × likelihood → posterior; Bayes' theorem§03
Gaussian processes for regressionGaussian process prior; posterior conditioning§06
MCMC for posterior samplingMarkov chains; detailed balance; stationary distribution§07
Attention mechanism as soft lookupCategorical distribution over keys§02
Language model outputsCategorical distribution over vocabulary§02
Data augmentationDistribution over augmented inputs§01, §03
Importance samplingChange of measure; Radon-Nikodym derivative§03, §06
PAC generalisation boundsHoeffding's inequality; union bound§05
Double descent / bias-varianceVariance of estimators; bias-variance decomposition§04
Reinforcement learning (policy)Markov decision process (MDP); Markov property§07
RLHF reward modellingBradley-Terry model; probability of preference§02, §03
Normalising flowsChange of variables formula for densities§03
Diffusion modelsGaussian noise schedule; score function§02, §06

Prerequisites

Before starting this chapter, ensure you are comfortable with:

  • Integration (single and multi-variable): computing areas under curves, normalisation constants — §03-Integration
  • Series: power series, convergence — §04-Series-and-Sequences
  • Set theory basics: union, intersection, complement — introduced in §01 of this chapter
  • Matrix algebra: matrix–vector products, determinants, positive definite matrices — Chapter 3
  • Lagrange multipliers (for exponential family derivation): §04-Optimality-Conditions

← Previous Chapter: Multivariate Calculus | Next Chapter: Statistics →