Part 2

29 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Introduction to Probability and Random Variables: Part 13: Why This Matters for AI 2026 Perspective to Appendix Q: Historical Development of Probability Theory

13. Why This Matters for AI (2026 Perspective)

Concept	Impact on AI/LLMs
Kolmogorov axioms	The foundation of every probabilistic model. MLE, MAP, VAEs, diffusion models, Bayesian NNs - all rest on these three axioms. Knowing the axioms lets you verify whether a proposed probability model is well-defined.
Conditional probability	The core operation in Bayesian inference and autoregressive language modelling. Every $P(x_t \mid x_{<t})$ in a language model is a conditional probability. Causal attention implements conditional independence structure.
Bayes' theorem	Underlies RLHF (updating reward model beliefs from human feedback), Bayesian hyperparameter search, and the theoretical analysis of in-context learning as implicit Bayesian inference.
Independence	The i.i.d. assumption justifies mini-batch gradient descent. Conditional independence defines the graphical structure of Bayesian networks and the Markov property of diffusion models.
PMF / Categorical distribution	Every token prediction in a language model is a Categorical PMF. Softmax produces a valid PMF over $
Gaussian distribution	Used in weight initialisation (He, Glorot), VAE latent spaces, diffusion forward processes, Gaussian process priors, and noise injection for differential privacy. The CLT (Section06) explains its ubiquity.
Expected value	The training objective is an expected loss $\mathbb{E}_{\mathbf{x},y}[\mathcal{L}(\theta)]$ over the data distribution. Linearity of expectation justifies mini-batch gradient estimation.
Variance	Gradient variance determines training stability. BN and LN are variance-control mechanisms. Dropout variance analysis guides the choice of dropout rate.
Change of variables	Foundation of normalising flows (Glow, RealNVP, neural spline flows). The change-of-variables formula computes the density of a transformed distribution.
Negative log-likelihood	The universal training objective. Cross-entropy loss (classification), MSE (regression under Gaussian noise), ELBO (VAEs) all derive from minimising NLL.

Conceptual Bridge

This section establishes the probability-theoretic language that all subsequent sections build upon. We began from the Kolmogorov axioms - three simple rules that define what probabilities must be - and derived the core computational rules (complement, union, conditioning, total probability, Bayes) that allow us to reason under uncertainty. The random variable formalism converts outcomes to numbers, enabling the full machinery of analysis.

Looking back: The section draws on set theory (union, intersection, complement - assumed), on calculus for the continuous case (integration for CDFs and PDFs, differentiation for the fundamental theorem), and on the concept of a function (random variables are functions). The proof techniques (direct calculation from axioms, algebraic manipulation) mirror those in linear algebra and calculus sections of this course.

Looking forward - within this chapter:

Section02-Common-Distributions takes the distribution vocabulary started here (Bernoulli, Uniform, Gaussian) and completes it: Binomial, Poisson, Exponential, Beta, Dirichlet, Categorical, Student-t, plus the exponential family.
Section03-Joint-Distributions extends the single-variable machinery to multiple random variables: joint PDFs, marginalisation, and Bayes' theorem for random variables rather than events.
Section04-Expectation-and-Moments develops expectation much more deeply: LOTUS, covariance matrices, MGFs, characteristic functions, and the full moment theory.
Section05-Concentration-Inequalities proves how expectations and variances control tail probabilities (Markov, Chebyshev, Hoeffding) - the mathematical tools for generalisation guarantees.
Section06-Stochastic-Processes introduces the Central Limit Theorem (explaining Gaussian ubiquity) and Gaussian processes (infinite-dimensional generalisations).
Section07-Markov-Chains uses conditional independence (Section4.2) and Bayes' theorem (Section3.6) to define memoryless sequential processes and MCMC sampling.

Looking forward - across chapters:

Chapter 7 (Statistics) builds estimation theory (MLE, MAP, confidence intervals) directly on the probability foundation here.
Chapter 8 (Optimisation) uses the Gaussian noise model (Section7.4) and NLL loss (Section10.1) to motivate SGD and Adam.
Chapter 9 (Information Theory) defines entropy as $-\mathbb{E}[\log p(X)]$ and KL divergence in terms of two probability measures - direct extensions of Section8.1 and Section10.1.

POSITION IN CURRICULUM
========================================================================

  Section04 Calculus Fundamentals
  Section05 Multivariate Calculus
  |
  +-- Section06 Probability Theory
      +-- 01-Introduction-and-Random-Variables   <== YOU ARE HERE
      |     (probability spaces, RVs, CDF/PDF/PMF, E[X], Var[X])
      +-- 02-Common-Distributions
      |     (Binomial, Poisson, Gaussian, Beta, Dirichlet, ...)
      +-- 03-Joint-Distributions
      |     (joint PDF, marginalisation, Bayes for RVs)
      +-- 04-Expectation-and-Moments
      |     (LOTUS, covariance, MGF, higher moments)
      +-- 05-Concentration-Inequalities
      |     (Markov, Chebyshev, Hoeffding, PAC bounds)
      +-- 06-Stochastic-Processes
      |     (LLN, CLT, Gaussian processes)
      +-- 07-Markov-Chains
            (MCMC, detailed balance, stationary distributions)

========================================================================

<- Back to Probability Theory | Next: Common Distributions ->

Appendix A: The Gaussian Normalisation Integral

The fact that $\int_{-\infty}^{\infty} e^{-x^2/2} dx = \sqrt{2\pi}$ is not obvious. Here is the classical proof using a two-dimensional trick.

Claim: $I = \int_{-\infty}^{\infty} e^{-x^2/2} dx = \sqrt{2\pi}$ .

Proof (Gaussian integral via polar coordinates):

I^2 = \left(\int_{-\infty}^{\infty} e^{-x^2/2} dx\right)\left(\int_{-\infty}^{\infty} e^{-y^2/2} dy\right) = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty} e^{-(x^2+y^2)/2}\, dx\, dy

Convert to polar coordinates $(r, \theta)$ with $x = r\cos\theta$ , $y = r\sin\theta$ , $dx\, dy = r\, dr\, d\theta$ :

I^2 = \int_0^{2\pi}\int_0^{\infty} e^{-r^2/2} r\, dr\, d\theta = 2\pi \int_0^{\infty} r e^{-r^2/2}\, dr

Let $u = r^2/2$ , $du = r\,dr$ :

I^2 = 2\pi \int_0^{\infty} e^{-u}\, du = 2\pi \cdot 1 = 2\pi

Therefore $I = \sqrt{2\pi}$ (since $I > 0$ ). $\square$

This means the normalisation constant $1/\sqrt{2\pi\sigma^2}$ in the Gaussian PDF is exactly what is needed to make the integral equal 1. The proof that the Gaussian with any $\mu, \sigma^2$ integrates to 1 follows by the substitution $z = (x-\mu)/\sigma$ .

Appendix B: Discrete Probability - Worked Examples in Full

B.1 The Birthday Problem

Question: In a room of $n$ people, what is the probability that at least two share a birthday (assuming 365 days, uniform distribution, no leap years)?

This is a classic application of the complement rule.

Let $A$ = "at least two people share a birthday" and $A^c$ = "all birthdays are distinct."

P(A^c) = \frac{365}{365} \cdot \frac{364}{365} \cdot \frac{363}{365} \cdots \frac{365-n+1}{365} = \frac{365!/(365-n)!}{365^n}

The probability of at least one shared birthday:

P(A) = 1 - P(A^c) = 1 - \prod_{k=0}^{n-1}\frac{365-k}{365}

$n$	$P(\text{at least one match})$
10	11.7%
23	50.7% - crossover point
50	97.0%
70	99.9%

For AI: The birthday paradox is closely related to hash collision probability and the birthday attack in cryptography. In machine learning, it appears in:

Retrieval collision: The probability that two embeddings in a high-dimensional space are closer than a threshold depends on how many embeddings exist - birthday-paradox type reasoning
Token collision in tokenisation: In BPE tokenisation, character pairs collide and merge; the frequency of collisions follows birthday-paradox dynamics
Data deduplication: Estimating the probability of exact or near-duplicate examples in large training datasets

B.2 Geometric Series and PMF Normalisation

Many PMFs involve geometric series. Recall:

\sum_{k=0}^{\infty} r^k = \frac{1}{1-r}, \quad |r| < 1

Application to Geometric PMF: $P(X = k) = (1-p)^{k-1}p$ for $k = 1, 2, 3, \ldots$

\sum_{k=1}^{\infty}(1-p)^{k-1}p = p\sum_{j=0}^{\infty}(1-p)^j = p \cdot \frac{1}{1-(1-p)} = \frac{p}{p} = 1 \checkmark

Computing the mean of Geometric via differentiation of generating function:

\mathbb{E}[X] = \sum_{k=1}^{\infty}k(1-p)^{k-1}p = p\sum_{k=1}^{\infty}k(1-p)^{k-1} = p \cdot \frac{d}{dq}\sum_{k=0}^{\infty}q^k\bigg|_{q=1-p}

= p \cdot \frac{d}{dq}\frac{1}{1-q}\bigg|_{q=1-p} = p \cdot \frac{1}{(1-q)^2}\bigg|_{q=1-p} = p \cdot \frac{1}{p^2} = \frac{1}{p}

The expected wait time is $1/p$ - intuitively, if each trial has probability $p$ of success, on average you need $1/p$ trials.

Appendix C: Continuous Distributions - Computing Moments Directly

C.1 Gaussian Mean and Variance from First Principles

Mean of $\mathcal{N}(\mu, \sigma^2)$ : By symmetry of the Gaussian PDF about $\mu$ :

\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot \frac{1}{\sqrt{2\pi\sigma^2}} e^{-(x-\mu)^2/(2\sigma^2)}\, dx

Substitute $z = (x-\mu)/\sigma$ so $x = \mu + \sigma z$ and $dx = \sigma\, dz$ :

= \int_{-\infty}^{\infty} (\mu + \sigma z) \cdot \frac{1}{\sqrt{2\pi}} e^{-z^2/2}\, dz = \mu \underbrace{\int_{-\infty}^{\infty}\phi(z)\,dz}_{=1} + \sigma\underbrace{\int_{-\infty}^{\infty} z\phi(z)\, dz}_{=0\ (\text{odd integrand})} = \mu

Variance of $\mathcal{N}(\mu, \sigma^2)$ : With the same substitution, $\operatorname{Var}(X) = \mathbb{E}[(X-\mu)^2] = \sigma^2 \mathbb{E}[Z^2]$ where $Z \sim \mathcal{N}(0,1)$ .

\mathbb{E}[Z^2] = \int_{-\infty}^{\infty} z^2 \cdot \frac{1}{\sqrt{2\pi}} e^{-z^2/2}\, dz

Integrate by parts: $u = z$ , $dv = z e^{-z^2/2}\, dz$ , so $v = -e^{-z^2/2}$ :

= \frac{1}{\sqrt{2\pi}}\left[-ze^{-z^2/2}\right]_{-\infty}^{\infty} + \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty} e^{-z^2/2}\, dz = 0 + 1 = 1

Therefore $\operatorname{Var}(X) = \sigma^2 \cdot 1 = \sigma^2$ . $\square$

C.2 Uniform Distribution Variance - Direct Computation

For $X \sim \text{Uniform}(a,b)$ :

\mathbb{E}[X^2] = \int_a^b x^2 \cdot \frac{1}{b-a}\, dx = \frac{1}{b-a}\cdot\frac{x^3}{3}\bigg|_a^b = \frac{b^3-a^3}{3(b-a)} = \frac{a^2+ab+b^2}{3}

using the factorisation $b^3 - a^3 = (b-a)(b^2+ab+a^2)$ .

\operatorname{Var}(X) = \frac{a^2+ab+b^2}{3} - \frac{(a+b)^2}{4} = \frac{4(a^2+ab+b^2) - 3(a+b)^2}{12} = \frac{(b-a)^2}{12}

Appendix D: Key Inequalities and Their Probability Theory Roots

Several important inequalities follow directly from the basic probability theory developed in this section:

D.1 Markov's Inequality (Preview)

For any non-negative random variable $X$ and $t > 0$ :

P(X \geq t) \leq \frac{\mathbb{E}[X]}{t}

Proof sketch: $\mathbb{E}[X] = \mathbb{E}[X \cdot \mathbf{1}_{X \geq t}] + \mathbb{E}[X \cdot \mathbf{1}_{X < t}] \geq \mathbb{E}[X \cdot \mathbf{1}_{X \geq t}] \geq t \cdot P(X \geq t)$ .

This connects the expected value (Section8.1) to tail probabilities. Full treatment with applications to PAC bounds: Section05-Concentration-Inequalities.

D.2 Jensen's Inequality

For a convex function $g$ and random variable $X$ :

g(\mathbb{E}[X]) \leq \mathbb{E}[g(X)]

Examples:

$g(x) = x^2$ (convex): $(\mathbb{E}[X])^2 \leq \mathbb{E}[X^2]$ - the computational variance formula
$g(x) = e^x$ (convex): $e^{\mathbb{E}[X]} \leq \mathbb{E}[e^X]$ - used in exponential tail bounds
$g(x) = -\log x$ (convex for $x > 0$ ): $-\log \mathbb{E}[X] \leq \mathbb{E}[-\log X]$ - used in ELBO derivation for VAEs

D.3 Cauchy-Schwarz for Expectations

|\mathbb{E}[XY]|^2 \leq \mathbb{E}[X^2] \cdot \mathbb{E}[Y^2]

This bounds the correlation between random variables and is used to prove the variance of sums and covariance properties in Section04.

Appendix E: Probability in High Dimensions - A Preview

All of the distributions introduced in this section are one-dimensional ( $X : \Omega \to \mathbb{R}$ ). In machine learning, we almost always work with high-dimensional data - image pixels, word embeddings, weight tensors. High-dimensional probability has surprising and sometimes counterintuitive properties:

E.1 The Curse of Dimensionality for Probability

In $d$ dimensions, a unit hypercube $[0,1]^d$ has volume 1. But an inscribed hypersphere of radius 0.5 has volume:

V_d = \frac{\pi^{d/2}}{\Gamma(d/2+1)} \cdot 0.5^d

For $d = 2$ : $V_2 = \pi/4 \approx 0.785$ . For $d = 10$ : $V_{10} \approx 0.0025$ . For $d = 100$ : essentially 0.

Almost all the volume of a high-dimensional cube is in its corners, not near its centre. A uniform distribution over the hypercube concentrates most of its mass far from the centre - which means nearest-neighbour methods based on Euclidean distance break down.

E.2 Gaussian Concentration in High Dimensions

For $\mathbf{z} \sim \mathcal{N}(0, I_d)$ (standard Gaussian in $d$ dimensions):

\mathbb{E}[\|\mathbf{z}\|^2] = d, \quad \operatorname{Var}(\|\mathbf{z}\|^2) = 2d

By the law of large numbers: $\|\mathbf{z}\|^2/d \to 1$ as $d \to \infty$ (concentrates on the sphere of radius $\sqrt{d}$ ). This means:

High-dimensional Gaussian samples are approximately uniformly distributed on the $\sqrt{d}$ -sphere
The dot product $\mathbf{z}_1 \cdot \mathbf{z}_2 \approx 0$ for two independent samples (random vectors are nearly orthogonal)

For AI: This explains why random embedding matrices in transformers initialised with $\mathcal{N}(0, 1/d)$ produce approximately orthogonal row vectors in high dimensions - a useful inductive bias for attention.

These high-dimensional considerations are fully developed in Section03-Joint-Distributions (multivariate Gaussian) and Section05-Concentration-Inequalities (concentration of measure).

Appendix F: Information Theory Preview - Entropy and KL Divergence

This appendix previews two information-theoretic quantities that appear constantly in ML loss functions. The full derivations using the expectation operator belong in Section04 (Expectation and Moments) and Chapter 9 (Information Theory).

F.1 Entropy: Measuring Uncertainty

The Shannon entropy of a discrete random variable $X$ with PMF $p$ is:

H(X) = -\sum_{x} p(x) \log p(x)

Entropy measures the average "surprise" or uncertainty in $X$ . Higher entropy = more spread-out distribution = more uncertainty.

Key values:

Bernoulli( $p$ ): $H = -p\log p - (1-p)\log(1-p)$ . Maximum $H = \log 2 \approx 0.693$ nats at $p = 0.5$ (fair coin). Zero when $p = 0$ or $p = 1$ (certain outcome).
Uniform on $n$ values: $H = \log n$ - this is the maximum entropy distribution for a discrete variable with $n$ outcomes.
Deterministic: $H = 0$ - no uncertainty.

For continuous random variables, the differential entropy is:

h(X) = -\int p(x) \log p(x) \, dx

The Gaussian $\mathcal{N}(\mu, \sigma^2)$ has differential entropy $h = \frac{1}{2}\log(2\pi e \sigma^2)$ , which is the maximum entropy distribution for a continuous variable with fixed variance.

For AI: The cross-entropy loss for classification is $H(y, \hat{p}) = -\sum_c y_c \log \hat{p}_c$ . When $y$ is a one-hot label, this equals $-\log \hat{p}_{y}$ - the negative log probability the model assigned to the correct class.

F.2 KL Divergence: Measuring Distribution Distance

The Kullback-Leibler divergence from distribution $q$ to distribution $p$ is:

D_{\mathrm{KL}}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)} = \mathbb{E}_p\left[\log \frac{p(X)}{q(X)}\right]

Properties (proved using Jensen's inequality in Appendix D):

$D_{\mathrm{KL}}(p \| q) \geq 0$ (non-negativity; follows from $-\log$ being convex)
$D_{\mathrm{KL}}(p \| q) = 0 \iff p = q$ (equals zero only when identical)
$D_{\mathrm{KL}}(p \| q) \neq D_{\mathrm{KL}}(q \| p)$ (asymmetric - not a metric)

Decomposition: Cross-entropy $=$ entropy $+$ KL divergence:

H(p, q) = H(p) + D_{\mathrm{KL}}(p \| q)

Since $H(p)$ is fixed when $p$ is the true label distribution, minimising cross-entropy loss is equivalent to minimising $D_{\mathrm{KL}}(p \| q)$ - pushing the model's distribution $q$ toward the true distribution $p$ .

For AI: VAE training minimises the ELBO, which includes a $D_{\mathrm{KL}}(q_\phi(z|x) \| p(z))$ regularisation term. RLHF preference modelling computes KL penalties between the fine-tuned and reference policy distributions.

-> Full treatment: Chapter 9 - Information Theory

F.3 Negative Log-Likelihood Connects Entropy and Probability

For a dataset $\{x_1, \ldots, x_n\}$ drawn i.i.d. from $p$ , the negative log-likelihood under model $q_\theta$ is:

-\log L(\theta) = -\sum_{i=1}^n \log q_\theta(x_i)

By the law of large numbers (Section06 preview), as $n \to \infty$ :

-\frac{1}{n}\log L(\theta) \to \mathbb{E}_p[-\log q_\theta(X)] = H(p, q_\theta)

Minimising NLL is therefore equivalent to minimising cross-entropy, which is equivalent to minimising $D_{\mathrm{KL}}(p \| q_\theta)$ . Maximum likelihood estimation is KL minimisation.

Appendix G: The Exponential Family - A Preview

Almost every distribution in Section02 belongs to a unified family called the exponential family. Understanding this family's structure explains why Gaussian, Bernoulli, Poisson, and many other distributions share the same algorithmic properties in ML.

G.1 Exponential Family Form

A distribution belongs to the exponential family if its PDF/PMF can be written as:

p(x; \eta) = h(x) \exp\left(\eta^\top T(x) - A(\eta)\right)

where:

$\eta \in \mathbb{R}^k$ - natural parameters (the parameterisation used in optimisation)
$T(x) \in \mathbb{R}^k$ - sufficient statistics (all information about $\eta$ in the data is captured by $T(x)$ )
$A(\eta) = \log \int h(x)\exp(\eta^\top T(x)) \, dx$ - log-partition function (ensures normalisation)
$h(x)$ - base measure (does not depend on $\eta$ )

G.2 Standard Distributions as Exponential Family Members

Distribution	Natural param $\eta$	Sufficient stat $T(x)$	Log-partition $A(\eta)$
Bernoulli( $p$ )	$\log\frac{p}{1-p}$	$x$	$\log(1 + e^\eta)$
Gaussian( $\mu,\sigma^2$ , fixed $\sigma^2$ )	$\mu/\sigma^2$	$x$	$\eta^2\sigma^2/2$
Poisson( $\lambda$ )	$\log \lambda$	$x$	$e^\eta$
Exponential( $\lambda$ )	$-\lambda$	$x$	$-\log(-\eta)$

The Bernoulli natural parameter $\eta = \log(p/(1-p))$ is the log-odds or logit. The softmax function is the multivariate inverse: $\sigma(\eta) = e^\eta/(1+e^\eta)$ recovers $p$ from $\eta$ . This is why logistic regression outputs are Bernoulli exponential family probabilities.

G.3 Why the Log-Partition Function Matters

The log-partition function $A(\eta)$ is convex and its derivatives give the cumulants:

\nabla_\eta A(\eta) = \mathbb{E}_{p_\eta}[T(X)] \quad \text{(mean of sufficient statistics)}

\nabla^2_\eta A(\eta) = \operatorname{Cov}_{p_\eta}[T(X)] \quad \text{(covariance of sufficient statistics)}

This means computing moments of exponential family distributions reduces to differentiating $A(\eta)$ - a powerful computational shortcut.

For AI: Softmax attention uses the log-partition function. The logsumexp operation $\log \sum_j e^{a_j}$ is the log-partition function of a categorical distribution over query-key matches. The numerical stability trick $\log \sum_j e^{a_j - \max a} + \max a$ exploits properties of $A(\eta)$ .

-> Full treatment: Section02-Common-Distributions

Appendix H: Extended Worked Examples

H.1 The Monty Hall Problem - Conditional Probability in Action

Setup: A car is behind one of three doors. You choose Door 1. The host (who knows where the car is) opens Door 3, revealing a goat. Should you switch to Door 2?

Solution using conditional probability:

Let $C_i$ = event car is behind Door $i$ , $O_3$ = event host opens Door 3.

Prior: $P(C_1) = P(C_2) = P(C_3) = 1/3$ .

Host strategy: if car is at Door 1, host opens Door 2 or 3 with equal probability $1/2$ ; if car is at Door 2, host must open Door 3 (probability 1); if car is at Door 3, host must open Door 2 (probability 0 for Door 3).

P(O_3 | C_1) = 1/2, \quad P(O_3 | C_2) = 1, \quad P(O_3 | C_3) = 0

By Bayes' theorem:

P(C_1 | O_3) = \frac{P(O_3|C_1)P(C_1)}{P(O_3)} = \frac{(1/2)(1/3)}{1/6 + 1/3 + 0} = \frac{1/6}{1/2} = \frac{1}{3}

P(C_2 | O_3) = \frac{P(O_3|C_2)P(C_2)}{P(O_3)} = \frac{(1)(1/3)}{1/2} = \frac{2}{3}

Conclusion: Switching wins with probability $2/3$ . Staying wins with probability $1/3$ .

Intuition: The host's action is not random - they always reveal a goat. This action transfers probability mass from Door 3 to Door 2 (given your initial choice was Door 1). The host's knowledge changes the posterior.

For AI: This illustrates why naive probabilistic reasoning fails when the observation process is not independent of the hypothesis. In model evaluation, the selection of which test examples to report affects the posterior over model quality - a key issue in benchmark gaming.

H.2 The Prosecutor's Fallacy - $P(A|B) \neq P(B|A)$

A DNA test for a rare genetic marker has false positive rate $0.1\%$ and false negative rate $0\%$ . The marker is present in $0.01\%$ of the population. A defendant tests positive. A prosecutor claims: "The probability of a false positive is only $0.1\%$ , so there is a $99.9\%$ chance the defendant is guilty."

What's wrong: The prosecutor is claiming $P(\text{innocent} | \text{positive}) = 0.001$ , but they are calculating $P(\text{positive} | \text{innocent}) = 0.001$ .

By Bayes' theorem:

$P(\text{guilty}) = 0.0001$ (base rate)
$P(\text{positive} | \text{guilty}) = 1$ (no false negatives)
$P(\text{positive} | \text{innocent}) = 0.001$ (false positive rate)

P(\text{guilty} | \text{positive}) = \frac{1 \cdot 0.0001}{1 \cdot 0.0001 + 0.001 \cdot 0.9999} \approx \frac{0.0001}{0.001} = 9.1\%

The actual probability of guilt given the positive test is only about $9\%$ , not $99.9\%$ .

For AI: This fallacy appears in anomaly detection. A model with $99.9\%$ accuracy on a test set may have only $50\%$ precision on a rare-class detection task if the base rate of the rare class is $0.1\%$ .

H.3 Coupon Collector Problem - Expected Waiting Times

There are $n$ distinct coupons. Each cereal box contains one coupon uniformly at random. How many boxes must you buy (in expectation) to collect all $n$ coupons?

Let $T_k$ = number of boxes needed to get the $k$ -th new coupon (given you already have $k-1$ distinct coupons). At that point, the probability of drawing a new coupon is $(n-k+1)/n$ , so $T_k \sim \text{Geometric}\left(\frac{n-k+1}{n}\right)$ .

\mathbb{E}[T_k] = \frac{n}{n-k+1}

Total boxes: $T = T_1 + T_2 + \cdots + T_n$ , so:

\mathbb{E}[T] = \sum_{k=1}^{n} \frac{n}{n-k+1} = n \sum_{j=1}^{n} \frac{1}{j} = n H_n

where $H_n = 1 + 1/2 + 1/3 + \cdots + 1/n \approx \ln n + 0.577$ (harmonic number).

For $n = 365$ (days of the year): $\mathbb{E}[T] = 365 \cdot H_{365} \approx 365 \cdot 6.46 \approx 2365$ attempts needed to see all birthdays.

For AI: The coupon collector framework models how many gradient steps are needed for a stochastic optimizer to see every training example at least once. For $n$ training examples, about $n \ln n$ steps are needed in expectation - which is why practitioners set epoch counts based on logarithmic growth.

Appendix I: Characteristic Functions and Moment Generating Functions - Preview

I.1 The Moment Generating Function (MGF)

The moment generating function of a random variable $X$ is:

M_X(t) = \mathbb{E}[e^{tX}] = \int e^{tx} p(x) \, dx \quad \text{(when it exists)}

Why "moment generating"? Differentiating and evaluating at $t=0$ :

M_X^{(k)}(0) = \mathbb{E}[X^k]

The $k$ -th derivative of the MGF at zero is the $k$ -th raw moment of $X$ .

For Gaussian $X \sim \mathcal{N}(\mu, \sigma^2)$ :

M_X(t) = \exp\left(\mu t + \frac{\sigma^2 t^2}{2}\right)

This elegant form is why the Gaussian is easy to work with analytically: all moments can be extracted by differentiating a simple exponential.

Key property (moment generating functions and sums): If $X \perp Y$ :

M_{X+Y}(t) = M_X(t) \cdot M_Y(t)

This product rule for MGFs is the probabilistic analogue of the convolution theorem, and it is the key tool for proving that sums of independent Gaussians are Gaussian and that sums of independent Poissons are Poisson.

I.2 The Characteristic Function

The characteristic function of $X$ is $\varphi_X(t) = \mathbb{E}[e^{itX}]$ (Fourier transform of the density). Unlike the MGF, the characteristic function always exists and uniquely determines the distribution. It is the tool used in the proof of the Central Limit Theorem.

-> Full treatment: Section04-Expectation-and-Moments

Appendix J: Simulation and Monte Carlo Methods - A Preview

A recurring theme throughout this chapter is that computing exact probabilities analytically is often hard or impossible. Monte Carlo simulation is the practical alternative: approximate $\mathbb{E}[g(X)]$ by averaging over many samples.

J.1 The Monte Carlo Principle

If $X_1, X_2, \ldots, X_n$ are i.i.d. samples from $p(x)$ , then by the Law of Large Numbers (Section06):

\frac{1}{n}\sum_{i=1}^n g(X_i) \xrightarrow{n \to \infty} \mathbb{E}_p[g(X)]

The Monte Carlo estimator $\hat{\mu}_n = \frac{1}{n}\sum_{i=1}^n g(X_i)$ is unbiased ( $\mathbb{E}[\hat{\mu}_n] = \mathbb{E}[g(X)]$ ) and has variance $\operatorname{Var}(g(X))/n$ , so the standard error is $\sigma/\sqrt{n}$ regardless of dimension.

Why this matters: Unlike numerical integration (which requires exponentially many grid points in high dimensions), Monte Carlo has convergence rate $O(1/\sqrt{n})$ independent of dimension. This makes it the tool of choice for high-dimensional integrals - exactly the setting of modern ML.

J.2 Estimating Probabilities

$P(A) = \mathbb{E}[\mathbf{1}_A(X)]$ , so we can estimate any probability:

\hat{P}(A) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[X_i \in A]

Example: Estimate $P(Z > 1.96)$ for $Z \sim \mathcal{N}(0,1)$ by sampling:

Draw $Z_1, \ldots, Z_n \sim \mathcal{N}(0,1)$
Count the fraction where $Z_i > 1.96$
Answer converges to $0.025$ (the true tail probability)

J.3 Sampling from Basic Distributions

Inverse CDF method (Universality of the Uniform): If $U \sim \text{Uniform}(0,1)$ and $F$ is a CDF, then $F^{-1}(U)$ has distribution $F$ . This is because:

P(F^{-1}(U) \leq x) = P(U \leq F(x)) = F(x)

Consequences:

Sample Exponential( $\lambda$ ): $X = -\frac{1}{\lambda}\log(U)$ where $U \sim \text{Uniform}(0,1)$
Sample Geometric( $p$ ): $X = \lceil \log(U) / \log(1-p) \rceil$
Sample Bernoulli( $p$ ): $X = \mathbf{1}[U \leq p]$

The Box-Muller transform generates Gaussian samples from uniform:

Z_1 = \sqrt{-2\log U_1}\cos(2\pi U_2), \quad Z_2 = \sqrt{-2\log U_1}\sin(2\pi U_2)

where $U_1, U_2 \sim \text{Uniform}(0,1)$ i.i.d., and $Z_1, Z_2 \sim \mathcal{N}(0,1)$ i.i.d.

J.4 Rejection Sampling

When the inverse CDF has no closed form, use rejection sampling: to sample from $p(x)$ , find a proposal $q(x)$ and constant $M$ such that $p(x) \leq M q(x)$ everywhere.

Algorithm:

Sample $Y \sim q$
Accept $Y$ with probability $p(Y)/(Mq(Y))$ ; otherwise reject and repeat

The accepted samples have distribution $p$ . The acceptance rate is $1/M$ , so choose $M$ as small as possible.

For AI: Rejection sampling underlies early language model decoding strategies. Modern methods like nucleus sampling (top- $p$ ) and temperature sampling directly manipulate the token probability distribution - understanding these requires the CDF and quantile framework from Section7.

J.5 Monte Carlo Integration of Normalisation Constants

Many probabilistic models have densities of the form $p(x) = \tilde{p}(x)/Z$ where computing the normalisation constant $Z = \int \tilde{p}(x) \, dx$ is intractable. Examples:

Bayesian posterior: $Z = \int \text{likelihood}(x) \cdot \text{prior}(x) \, dx$
Boltzmann distribution: $Z = \sum_x e^{-E(x)/T}$ over exponentially many states
Normalising flow: $Z = 1$ by construction but requires a change-of-variables Jacobian

Monte Carlo methods (particularly MCMC, developed in Section07) are the primary tools for inference in these models.

-> Full treatment: Section07-Markov-Chains (MCMC); Section06-Stochastic-Processes (LLN and CLT)

Appendix K: Probability Distributions in PyTorch and NumPy

Modern ML frameworks have built-in probability distribution objects. Understanding the mathematical definitions in this section makes framework APIs immediately transparent.

K.1 NumPy / SciPy Distributions

import numpy as np
from scipy import stats

# Bernoulli(p=0.3)
rv = stats.bernoulli(p=0.3)
rv.pmf(1)          # P(X=1) = 0.3
rv.cdf(0)          # P(X\\leq0) = 0.7
rv.rvs(size=100)   # 100 random samples

# Gaussian N(mu=2, sigma=1.5)
rv = stats.norm(loc=2, scale=1.5)
rv.pdf(2.0)        # f(2.0) = 1/(1.5*sqrt(2\\pi)) \\approx 0.266
rv.cdf(2.0)        # \\Phi(0) = 0.5
rv.ppf(0.975)      # inverse CDF (quantile): z such that P(X\\leqz)=0.975

# Uniform(a=0, b=1)
rv = stats.uniform(loc=0, scale=1)

The loc and scale parameters implement the location-scale transform: $X = \text{loc} + \text{scale} \cdot Z$ where $Z$ is the standard form.

K.2 PyTorch `torch.distributions`

import torch
from torch.distributions import Bernoulli, Normal, Uniform, Categorical

# Bernoulli - for binary classification output
dist = Bernoulli(probs=torch.tensor(0.3))
dist.log_prob(torch.tensor(1.0))  # log P(X=1) = log(0.3) \\approx -1.204
dist.sample()                      # sample {0,1}
dist.entropy()                     # H(X) \\approx 0.611 nats

# Normal - for regression, VAE latent, etc.
dist = Normal(loc=torch.tensor(0.0), scale=torch.tensor(1.0))
dist.log_prob(torch.tensor(1.96))  # log f(1.96) \\approx -2.84
dist.cdf(torch.tensor(1.96))       # \\approx 0.975

# Categorical - for language model outputs
logits = torch.tensor([2.0, 1.0, 0.5])  # unnormalised log probabilities
dist = Categorical(logits=logits)
dist.probs                              # softmax probabilities
dist.log_prob(torch.tensor(0))          # log P(X=0)
dist.entropy()                          # entropy of the distribution

K.3 Reparameterisation - The Cornerstone of VAE Training

The reparameterisation trick lets us differentiate through a sampling operation:

Instead of sampling $z \sim \mathcal{N}(\mu, \sigma^2)$ (which has no gradient w.r.t. $\mu, \sigma$ ), write:

z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1)

Now $\partial z / \partial \mu = 1$ and $\partial z / \partial \sigma = \epsilon$ - gradients flow through.

# Without reparameterisation (WRONG for training)
z = Normal(mu, sigma).sample()  # no gradient

# With reparameterisation (CORRECT)
eps = torch.randn_like(sigma)   # \\varepsilon ~ N(0,1)
z = mu + sigma * eps             # z ~ N(mu, sigma^2), gradients flow

This technique requires understanding that $\mathcal{N}(\mu, \sigma^2)$ and $\mu + \sigma \mathcal{N}(0,1)$ have the same distribution - a direct consequence of the location-scale property developed in Section7.3.

For AI: VAE encoders output $(\mu, \log\sigma^2)$ and use this trick to sample the latent code $z$ while maintaining a differentiable path back to the encoder parameters. The same idea extends to normalising flows and diffusion model sampling.

Appendix L: Measure Theory - The Foundations Beneath the Formalism

This section develops probability from elementary axioms without requiring measure theory. But for completeness, this appendix sketches where the measure-theoretic foundations live and when you need them.

L.1 Why Elementary Probability Is Insufficient

The elementary theory (as developed in Section2) works when the sample space $\Omega$ is discrete or when we can write explicit PDF formulas. It breaks for:

Mixed distributions: a random variable that is sometimes discrete (e.g., exactly zero) and sometimes continuous (e.g., positive real) - arises in zero-inflated models and ReLU activations.
Limits: taking limits of sequences of random variables requires knowing what " $X_n \to X$ " means precisely - there are multiple notions of convergence ( $L^2$ , almost sure, in probability, in distribution) that are not cleanly expressible without measure theory.
Conditional expectation on events of probability zero: $\mathbb{E}[Y | X = x]$ when $P(X = x) = 0$ (any continuous $X$ ) requires the Radon-Nikodym theorem.
Change-of-measure / importance sampling: $\mathbb{E}_p[f(X)] = \mathbb{E}_q\left[\frac{p(X)}{q(X)}f(X)\right]$ when $p$ and $q$ are continuous requires the Radon-Nikodym derivative.

L.2 The Measure-Theoretic Setup

A probability space is a triple $(\Omega, \mathcal{F}, P)$ where:

$\Omega$ - sample space (any set)
$\mathcal{F}$ - sigma-algebra: a collection of subsets of $\Omega$ closed under complement and countable union (the measurable events)
$P: \mathcal{F} \to [0,1]$ - probability measure satisfying Kolmogorov's axioms

A random variable is a measurable function $X: \Omega \to \mathbb{R}$ , meaning $\{X \leq x\} \in \mathcal{F}$ for all $x$ (so that $P(X \leq x)$ is well-defined).

The distribution of $X$ is the induced measure $\mu_X(B) = P(X^{-1}(B))$ for Borel sets $B \subseteq \mathbb{R}$ .

The Lebesgue integral replaces the Riemann integral and is defined for any measurable function - this is what makes $\mathbb{E}[X] = \int X \, dP$ well-defined even for random variables with no PDF.

L.3 The Lebesgue-Stieltjes Integral Unifies PDF and PMF

For any random variable with CDF $F$ , the expectation can be written as a Stieltjes integral:

\mathbb{E}[g(X)] = \int g(x) \, dF(x)

This notation works for both discrete and continuous cases:

Continuous: $dF(x) = f(x)\,dx$ (use Riemann integral)
Discrete: $dF(x) = \sum_k p_k \,\delta(x - x_k)$ (use a sum)
Mixed: combination of both

For AI: Importance sampling in policy gradient methods (PPO, TRPO) uses the Radon-Nikodym derivative implicitly when bounding the ratio $\pi_\theta(a|s)/\pi_{\theta_{\text{old}}}(a|s)$ - this is the change-of-measure formula at work.

The full measure-theoretic treatment is beyond the scope of this curriculum. The reader interested in rigorous foundations should consult Billingsley's Probability and Measure or Durrett's Probability: Theory and Examples.

Appendix M: Classic Probability Paradoxes and Their Resolutions

M.1 Bertrand's Paradox - Why "Uniform at Random" Needs Specification

Problem: A chord is drawn "at random" in a unit circle. What is the probability the chord is longer than the side of the inscribed equilateral triangle (length $\sqrt{3}$ )?

The answer depends on how you sample "at random":

Method 1 - Random endpoints: Choose two points uniformly on the circle's circumference. Probability = $1/3$ .

Method 2 - Random midpoint: Choose a point uniformly inside the circle as the chord's midpoint. Probability = $1/4$ .

Method 3 - Random radius and distance: Choose a radius uniformly; choose the midpoint's distance from centre uniformly on $[0,1]$ . Probability = $1/2$ .

All three methods are geometrically natural, yet give different answers.

Resolution: The phrase "uniform at random" is not well-defined without specifying the sample space and the probability measure on it. The three methods define different probability spaces $(\Omega, \mathcal{F}, P)$ - they are asking different questions.

Lesson for ML: When we say "sample a random minibatch" or "sample a random initialisation," we must specify the distribution explicitly. The implicit assumption (uniform distribution) can lead to different algorithmic properties depending on whether we sample with or without replacement, sample tokens or sequences, etc.

M.2 The Two-Envelope Paradox

Two envelopes each contain money; one has twice the amount of the other. You pick one, see $X$ inside, and are offered the chance to switch. You reason: the other envelope contains either $X/2$ (probability $1/2$ ) or $2X$ (probability $1/2$ ). Expected value of switching: $\frac{1}{2}(X/2) + \frac{1}{2}(2X) = \frac{5X}{4} > X$ . So you should always switch - but the same reasoning applies after switching back!

Resolution: The calculation is flawed because the event "the other envelope has $2X$ " and "the other envelope has $X/2$ " cannot both be assigned probability $1/2$ simultaneously for a fixed $X$ . The reasoning treats $X$ as fixed and random simultaneously. A rigorous Bayesian analysis shows no advantage to switching when the prior over the smaller amount is proper (integrable).

Lesson for ML: The paradox arises from mixing conditional and unconditional reasoning. In ML, this appears in off-policy evaluation (evaluating policy $\pi_\text{new}$ using data from $\pi_\text{old}$ ) - naively computing expected reward leads to analogous double-counting errors.

M.3 The St. Petersburg Paradox - When Expected Value Is Infinite

A game: flip a fair coin repeatedly. If the first head appears on flip $k$ , you win $2^k$ dollars. Expected value:

\mathbb{E}[\text{winnings}] = \sum_{k=1}^\infty 2^k \cdot \frac{1}{2^k} = \sum_{k=1}^\infty 1 = \infty

Yet no rational person would pay more than ~$25 to play.

Resolution: Human decision-making is based on utility, not raw monetary value. If utility is $u(w) = \log(w)$ (a common assumption in economics), the expected utility is finite. This gives expected utility theory - the rational framework for decisions under uncertainty.

For AI: Reinforcement learning explicitly maximises expected return (reward), which can lead to high-variance, unstable policies if reward distributions have heavy tails. Techniques like reward normalisation, reward clipping (PPO), and distributional RL (C51, QR-DQN) are direct engineering responses to the St. Petersburg problem.

Appendix N: Summary of Key Inequalities and Bounds

This appendix collects all the inequalities introduced in this section and previewed from later sections, as a reference card.

Probability Bounds

Inequality	Statement	When to Use
Boole (union bound)	$P\!\left(\bigcup_i A_i\right) \leq \sum_i P(A_i)$	When events may overlap; gives upper bound
Inclusion-exclusion	$P(A \cup B) = P(A) + P(B) - P(A \cap B)$	Exact for two events
Bonferroni	$P\!\left(\bigcap_i A_i^c\right) \geq 1 - \sum_i P(A_i)$	Lower bound on "all good" probability
Frechet	$\max(0, P(A)+P(B)-1) \leq P(A \cap B) \leq \min(P(A),P(B))$	Bounds on intersection without independence

Moment-Based Bounds (from Section05)

Inequality	Statement	Uses
Markov	$P(X \geq t) \leq \mathbb{E}[X]/t$ for $X \geq 0$	Only needs $\mathbb{E}[X]$
Chebyshev	$P(\\|X-\mu\\| \geq t) \leq \sigma^2/t^2$	Needs $\mathbb{E}[X]$ and $\operatorname{Var}(X)$
Jensen	$g(\mathbb{E}[X]) \leq \mathbb{E}[g(X)]$ for convex $g$	Log-sum-exp, ELBO, entropy
Cauchy-Schwarz	$	\mathbb{E}[XY]

Concentration Bounds (from Section05)

Inequality	Statement	When to Use
Hoeffding	$P(\bar{X}-\mu \geq t) \leq e^{-2n^2t^2/\sum(b_i-a_i)^2}$ for bounded $X_i \in [a_i,b_i]$	i.i.d. bounded variables
Chernoff	$P(X \geq (1+\delta)\mu) \leq e^{-\mu\delta^2/3}$ for Poisson binomial	Sums of independent Bernoullis
McDiarmid	If $f$ changes by at most $c_i$ in coordinate $i$ : $P(f - \mathbb{E}[f] \geq t) \leq e^{-2t^2/\sum c_i^2}$	General functions of independent variables

For AI: Hoeffding's inequality directly gives PAC learning generalisation bounds: with $n$ i.i.d. training examples and loss values in $[0,1]$ , the gap between empirical and expected loss satisfies $P(L_{\text{test}} - L_{\text{train}} \geq \varepsilon) \leq 2e^{-2n\varepsilon^2}$ for any fixed hypothesis.

Appendix O: Sigma-Algebras and the Filtration Framework

This appendix provides a slightly more formal treatment of sigma-algebras for readers who want mathematical precision, without requiring full measure theory.

O.1 Why Sigma-Algebras Appear

The Kolmogorov axioms in Section2 assume we know which subsets of $\Omega$ count as "events." For finite or countable $\Omega$ , every subset is an event. For continuous $\Omega = \mathbb{R}$ , we cannot assign probabilities to all subsets (Vitali sets are non-measurable), so we restrict to a collection $\mathcal{F}$ closed under the operations we need.

O.2 The Borel Sigma-Algebra

The Borel sigma-algebra $\mathcal{B}(\mathbb{R})$ is the smallest sigma-algebra containing all open intervals $(a, b)$ . It includes:

All open and closed sets
All countable unions and intersections of intervals
All CDFs are measurable with respect to $\mathcal{B}(\mathbb{R})$

For practical purposes, every set you will encounter in ML is Borel-measurable.

O.3 Information and Filtrations

In sequential problems (reinforcement learning, time series, stochastic processes), we need to track how much information is available at each time step. A filtration is an increasing sequence of sigma-algebras:

\mathcal{F}_0 \subseteq \mathcal{F}_1 \subseteq \mathcal{F}_2 \subseteq \cdots

where $\mathcal{F}_t$ represents "all information available by time $t$ ."

A random variable $X_t$ is $\mathcal{F}_t$ -measurable if its value is determined by the information in $\mathcal{F}_t$ - that is, you can compute $X_t$ from the history up to time $t$ .

Example: In a sequential decision problem:

$\mathcal{F}_0 = \{\emptyset, \Omega\}$ - no information (only trivial events)
$\mathcal{F}_1 = \sigma(X_1)$ - sigma-algebra generated by the first observation
$\mathcal{F}_t = \sigma(X_1, \ldots, X_t)$ - sigma-algebra generated by observations up to time $t$

The Markov property (Section07) states: $X_{t+1}$ depends on $\mathcal{F}_t$ only through $X_t$ (not the full history).

For AI: The attention mechanism's causal mask in GPT-style models enforces a filtration: token $t$ can only attend to tokens at positions $\leq t$ . The mask ensures $\mathcal{F}_t$ -measurability of each token prediction, making the autoregressive language model a valid probabilistic model of sequences.

O.4 The Generated Sigma-Algebra

For any random variable $X$ , the generated sigma-algebra is:

\sigma(X) = \{X^{-1}(B) : B \in \mathcal{B}(\mathbb{R})\} = \{\{\omega : X(\omega) \in B\} : B \in \mathcal{B}(\mathbb{R})\}

This is the smallest sigma-algebra that makes $X$ measurable - it contains exactly the information encoded in $X$ .

For a discrete random variable: $\sigma(X) = \sigma(\{X = x_1\}, \{X = x_2\}, \ldots)$ - the partition of $\Omega$ induced by the values of $X$ .

Key insight: Conditional expectation $\mathbb{E}[Y | X]$ is really $\mathbb{E}[Y | \sigma(X)]$ - the expectation of $Y$ given all the information encoded in $X$ . This formulation (using sigma-algebras) is what makes conditional expectation rigorously defined even when $P(X = x) = 0$ .

Appendix P: Probability Generating Functions

P.1 Definition

For a non-negative integer-valued random variable $X$ (taking values $0, 1, 2, \ldots$ ), the probability generating function (PGF) is:

G_X(z) = \mathbb{E}[z^X] = \sum_{k=0}^\infty p_k z^k

where $p_k = P(X = k)$ .

The PGF converges for $|z| \leq 1$ and stores all probabilities as coefficients of a power series.

P.2 Recovering Probabilities and Moments

p_k = \frac{G_X^{(k)}(0)}{k!} \quad \text{($k$-th derivative at zero)}

\mathbb{E}[X] = G_X'(1), \quad \mathbb{E}[X(X-1)] = G_X''(1), \quad \operatorname{Var}(X) = G_X''(1) + G_X'(1) - [G_X'(1)]^2

P.3 PGFs of Common Distributions

Distribution	PGF $G(z)$
Bernoulli( $p$ )	$1-p+pz$
Binomial( $n,p$ )	$(1-p+pz)^n$
Poisson( $\lambda$ )	$e^{\lambda(z-1)}$
Geometric( $p$ )	$pz/(1-(1-p)z)$

The Binomial PGF $(1-p+pz)^n$ is the $n$ -fold product of Bernoulli PGFs - directly encoding that a Binomial is a sum of $n$ i.i.d. Bernoullis (by the product property for independent variables).

Product property: If $X \perp Y$ : $G_{X+Y}(z) = G_X(z) \cdot G_Y(z)$ .

This is the PGF analogue of the convolution theorem, and it gives a clean proof that the sum of independent Poisson( $\lambda_1$ ) and Poisson( $\lambda_2$ ) is Poisson( $\lambda_1 + \lambda_2$ ):

G_{\text{Pois}(\lambda_1)}(z) \cdot G_{\text{Pois}(\lambda_2)}(z) = e^{\lambda_1(z-1)} \cdot e^{\lambda_2(z-1)} = e^{(\lambda_1+\lambda_2)(z-1)} = G_{\text{Pois}(\lambda_1+\lambda_2)}(z)

For AI: The token generation process in language models can be viewed as repeatedly sampling from a categorical PGF. The PGF of the output token count under various sampling strategies (greedy, top- $k$ , nucleus) encodes the statistical properties of generated text length distributions.

Appendix Q: Historical Development of Probability Theory

Understanding how probability theory evolved helps locate the axioms, distributions, and theorems in their proper intellectual context.

Q.1 Pre-Axiomatic Period (1654-1900)

Pascal and Fermat (1654): The famous exchange of letters about the "Problem of Points" (how to divide stakes in an unfinished game) is often cited as the birth of probability theory. They developed the idea of expected value and laid groundwork for combinatorics.

Bernoulli (1713): Jakob Bernoulli's Ars Conjectandi stated the first version of the Law of Large Numbers: relative frequencies of outcomes converge to true probabilities. This transformed probability from gambling mathematics to a science of inference.

Bayes (1763): Thomas Bayes' posthumous Essay gave the first treatment of inverse probability - inferring parameters from observations. Richard Price edited and published it; Laplace independently developed the same ideas more completely.

Laplace (1812): Theorie analytique des probabilites synthesised the field, introducing characteristic functions, the Central Limit Theorem, the method of moments, and the formal definition of probability as a fraction of equally likely cases.

Gauss (1809, 1823): Derived the normal distribution as the error distribution minimising the sum of squared residuals. Proved that the sample mean is the best linear unbiased estimator - establishing connections between probability and statistics.

Chebyshev, Markov, Lyapunov (1867-1901): Proved increasingly general versions of the CLT. Markov introduced his eponymous chains; Lyapunov's proof of the CLT via characteristic functions is essentially the modern proof.

Q.2 Axiomatic Period (1900-1950)

Borel (1909): Proved the strong law of large numbers for the Bernoulli case, introducing Borel sets and measure-theoretic methods.

Kolmogorov (1933): Grundbegriffe der Wahrscheinlichkeitsrechnung placed probability theory on rigorous measure-theoretic foundations. The three axioms in Section2 are Kolmogorov's. This work resolved foundational debates and enabled the rigorous development of stochastic processes.

Wiener (1923): Constructed Brownian motion rigorously as a probability measure on function spaces - the first infinite-dimensional probability space.

Doob (1940s): Developed martingale theory, providing rigorous tools for studying random processes with conditional structure.

Q.3 Modern Period (1950-present)

Shannon (1948): A Mathematical Theory of Communication introduced entropy, mutual information, and channel capacity - connecting probability to communication. Cross-entropy loss in modern ML is a direct descendant.

Robbins-Monro (1951): Stochastic approximation - the theoretical foundation for stochastic gradient descent. Proved that gradient estimates from mini-batches converge to the true gradient.

Dempster-Laird-Rubin (1977): The EM algorithm - an expectation-maximisation procedure for MLE with latent variables. Built on conditional expectation and Jensen's inequality.

Pearl (1988): Bayesian networks and causal inference - graphical models encoding conditional independence structure. Direct descendant of conditional probability in Section4 and Section03 of this chapter.

Goodfellow et al. (2014): Generative Adversarial Networks - training a generator to match a target distribution by playing a minimax game. The divergence between generated and true distributions is what is being minimised (implicitly or explicitly).

Sohl-Dickstein et al. (2015), Ho et al. (2020): Diffusion models - add Gaussian noise progressively (forward process), then learn to denoise (reverse process). Grounded in stochastic processes and Gaussian distributions.

Q.4 The Connections to Modern LLMs

Every component of a modern language model has a probabilistic interpretation rooted in the theory developed in this section:

LLM Component	Probability Foundation	Section
Token prediction	Categorical distribution over vocabulary	Section6 (discrete RV)
Temperature scaling	Scale parameter of Gumbel distribution	Section7 (continuous RV)
Dropout	Bernoulli mask on activations	Section6 (Bernoulli)
Weight initialisation	Normal distribution, variance scaling	Section7 (Gaussian)
KV cache attention	Conditional probability over past keys	Section4 (conditioning)
RLHF reward model	Bradley-Terry preference model	Section3 (conditional probability)
Beam search	Greedy argmax over joint probability	Section6 (joint events)
Perplexity metric	Exponential of cross-entropy	Section8 (expectation)
Calibration	Agreement between predicted probs and frequencies	Section2 (frequentist)
Chain-of-thought	Latent variable model over reasoning traces	Section4 (conditional independence)

Introduction and Random Variables: Part 2 - Why This Matters For Ai 2026 Perspective To Appendix Q Historical D