NotesMath for LLMs

Introduction and Random Variables

Probability Theory / Introduction and Random Variables

Notes

Introduction to Probability and Random Variables

"Probability theory is nothing but common sense reduced to calculation."

  • Pierre-Simon Laplace, 1814

Overview

Probability theory is the mathematical language for reasoning under uncertainty - and uncertainty is everywhere in machine learning. Every training example is a noisy sample from an unknown distribution; every weight initialisation is a random draw; every language model output is a distribution over tokens. To reason rigorously about any of these, we need the formal apparatus built in this section.

We begin at the foundations: the Kolmogorov axioms that define what a probability is, and the algebraic rules (complement, union, conditional probability, Bayes' theorem) that follow from them. We then introduce the central object of probabilistic modelling - the random variable - and its two modes: discrete (characterised by a probability mass function) and continuous (characterised by a probability density function). We close with the expectation and variance of a random variable, which distil a full distribution into its most useful summary statistics.

Scope: This section is the canonical home for probability axioms, conditional probability, independence, random variable definitions, CDF/PDF/PMF, and the Bernoulli/Uniform/Gaussian distributions at an introductory level. The full catalogue of named distributions (Binomial, Poisson, Beta, Dirichlet, Categorical, Student-t) is covered in Section02. Joint distributions and Bayes' theorem for random variables are developed in Section03. The full treatment of expectation, covariance, LOTUS, and moment generating functions is in Section04.

Prerequisites

  • Basic set theory: union (ABA \cup B), intersection (ABA \cap B), complement (AcA^c), subset (ABA \subseteq B)
  • Single-variable calculus: integration, differentiation - Section03-Integration
  • Series and summation notation - Section04-Series-and-Sequences

Companion Notebooks

NotebookDescription
theory.ipynbInteractive derivations: Kolmogorov axioms, Bayes' theorem, CDF/PDF/PMF, Bernoulli, Gaussian intro, change of variables, cross-entropy
exercises.ipynb10 graded exercises: axiom proofs through probabilistic language model analysis

Learning Objectives

After completing this section, you will:

  • State the three Kolmogorov axioms and derive the complement, union, and monotonicity rules from them
  • Compute conditional probabilities and apply the chain rule and law of total probability
  • Apply Bayes' theorem to update beliefs from evidence, naming prior, likelihood, and posterior
  • Distinguish independence from conditional independence and give an example where they differ
  • Define a random variable formally as a measurable function from a sample space to R\mathbb{R}
  • Write the CDF F(x)=P(Xx)F(x) = P(X \leq x) for a discrete and a continuous random variable, and list its four properties
  • Compute probabilities from a PMF (discrete case) and a PDF (continuous case)
  • Describe the Bernoulli and Uniform distributions completely: PMF/PDF, mean, variance, and ML uses
  • Introduce the Gaussian distribution and state the 68-95-99.7 rule
  • Compute E[X]\mathbb{E}[X] and Var(X)\operatorname{Var}(X) for simple distributions and apply linearity of expectation
  • Apply the change-of-variables formula for monotone transformations of a continuous random variable
  • Express cross-entropy loss as negative log-likelihood and explain what it minimises

Table of Contents


1. Intuition

1.1 What Is Probability? Three Interpretations

Probability is a number in [0,1][0, 1] assigned to an event, measuring how likely that event is to occur. But what does "likely" mean? There are three major interpretations, and each underpins a different school of thought in statistics and machine learning.

1. Classical (equally likely outcomes). If a sample space Ω\Omega has NN equally likely outcomes and event AA contains kk of them, then P(A)=k/NP(A) = k/N. This is the interpretation of textbook dice and card problems. It breaks down when outcomes are not equally likely - flipping a biased coin, for instance.

2. Frequentist (long-run relative frequency). P(A)P(A) is the limiting proportion of times AA occurs in an infinite sequence of identical, independent experiments: P(A)=limncount of A in n trialsnP(A) = \lim_{n \to \infty} \frac{\text{count of } A \text{ in } n \text{ trials}}{n}. This is the foundation of classical statistics (hypothesis testing, confidence intervals). Its weakness: it cannot assign probabilities to one-off events ("the probability that GPT-5 passes the bar exam").

3. Bayesian (degree of belief). P(A)P(A) represents a rational agent's degree of belief that AA is true, updated as new evidence arrives. This interpretation allows probabilities for singular events and underpins Bayesian inference, Bayesian neural networks, and probabilistic generative models. Crucially, different agents can assign different prior probabilities to the same event - and Bayes' theorem tells them how to update rationally.

THREE INTERPRETATIONS OF PROBABILITY
========================================================================

  Classical           Frequentist              Bayesian
  -----------------   --------------------     ----------------------
  P(A) = k/N          P(A) = lim freq(A)       P(A) = degree of belief
  Equally likely       Long-run proportion      Updated by evidence
  outcomes             Objective               Subjective / rational
  Dice, cards          Hypothesis tests         Generative models
                       Confidence intervals     Bayesian NNs, VAEs

  All three satisfy the same axioms (Section2.2).
  They differ only in WHAT those axioms are applied to.

========================================================================

For AI: Modern ML systems implicitly blend all three. A softmax output layer uses the classical interpretation (outputs sum to 1). Training loop analysis uses frequentist reasoning (expected loss over the data distribution). Bayesian neural networks, RLHF reward models, and variational autoencoders use Bayesian reasoning explicitly.

1.2 Why Probability Is the Language of AI

Every component of a modern AI system is probabilistic:

  • Data: Training sets are finite samples from an unknown data-generating distribution pdata(x,y)p_{\text{data}}(\mathbf{x}, y). The goal of supervised learning is to find fθf_\theta that estimates pdata(yx)p_{\text{data}}(y \mid \mathbf{x}).
  • Model outputs: A language model defines a probability distribution P(xtx<t)P(x_t \mid x_{<t}) over the next token. Generation is sampling from this distribution. Temperature scaling, top-kk and top-pp sampling are all operations on probability distributions.
  • Loss functions: Cross-entropy is the negative log-likelihood of the data under the model's distribution. Minimising cross-entropy is equivalent to maximising the probability the model assigns to the training data.
  • Regularisation: Dropout applies Bernoulli random variables to activations. Weight decay corresponds to a Gaussian prior in a Bayesian interpretation (MAP estimation).
  • Uncertainty quantification: Calibration of a classifier means its predicted probabilities match empirical frequencies. A well-calibrated model saying "70% confident" should be right 70% of the time.
  • Generative models: Diffusion models define a forward Markov chain (adding Gaussian noise) and learn to reverse it. VAEs learn pθ(xz)p_\theta(\mathbf{x} \mid \mathbf{z}) and qϕ(zx)q_\phi(\mathbf{z} \mid \mathbf{x}). Normalising flows learn invertible transformations of simple distributions. All are explicitly probabilistic.

1.3 Historical Timeline

YearPersonContribution
1654Pascal & FermatCorrespondence on gambling problems - foundations of combinatorial probability
1713Jacob BernoulliArs Conjectandi - law of large numbers, Bernoulli distribution
1763Thomas Bayes (posthumous)Essay on inverse probability - what we now call Bayes' theorem
1812Pierre-Simon LaplaceTheorie analytique des probabilites - systematic probability theory, Laplace approximation
1837Simeon PoissonPoisson distribution - rare events in large samples
1900Karl PearsonChi-squared distribution, correlation coefficient
1933Andrei KolmogorovRigorous axiomatic foundation - the three axioms that unify all interpretations
1950sShannonInformation theory - entropy connects probability to communication
1980sJudea PearlBayesian networks - graphical models for probabilistic reasoning
1990sGelman, Rubin et al.MCMC methods - practical Bayesian inference for complex models
2013Kingma & WellingVariational Autoencoders - learned latent distributions via reparameterisation
2020Ho et al.Denoising Diffusion Probabilistic Models - probabilistic generative modelling at scale

2. Formal Definitions - Probability Spaces

2.1 Sample Spaces and Events

Definition 2.1 (Sample space). The sample space Ω\Omega is the set of all possible outcomes of a random experiment. Each element ωΩ\omega \in \Omega is called an elementary outcome or sample point.

Examples:

  • Tossing a fair coin: Ω={H,T}\Omega = \{\text{H}, \text{T}\}
  • Rolling a six-sided die: Ω={1,2,3,4,5,6}\Omega = \{1, 2, 3, 4, 5, 6\}
  • Measuring a person's height: Ω=(0,)R\Omega = (0, \infty) \subset \mathbb{R}
  • Choosing a word from a vocabulary: Ω={the,cat,sat,}\Omega = \{\text{the}, \text{cat}, \text{sat}, \ldots\} (finite vocabulary)
  • One training step of a neural network: Ω=\Omega = all possible mini-batch draws

Definition 2.2 (Event). An event AA is a subset of Ω\Omega, i.e., AΩA \subseteq \Omega. We say event AA occurs if the observed outcome ωA\omega \in A.

Event algebra. Events combine using set operations:

  • Complement: Ac=ΩAA^c = \Omega \setminus A - "AA does not occur"
  • Union: ABA \cup B - "AA or BB (or both) occur"
  • Intersection: ABA \cap B - "both AA and BB occur"
  • Difference: AB=ABcA \setminus B = A \cap B^c - "AA occurs but BB does not"

The σ\sigma-algebra (brief note). For continuous sample spaces (e.g., Ω=R\Omega = \mathbb{R}), we cannot assign probabilities to every subset - some are too pathological (Vitali sets). The solution is to restrict to a σ\sigma-algebra F\mathcal{F}: a collection of subsets of Ω\Omega that is closed under complement and countable unions. The standard choice for Ω=R\Omega = \mathbb{R} is the Borel σ\sigma-algebra B(R)\mathcal{B}(\mathbb{R}), generated by all open intervals. For this course, you can think of "events" as "any reasonable subset of Ω\Omega" - the measure-theoretic subtlety is noted but not required.

2.2 The Three Kolmogorov Axioms

In 1933, Andrei Kolmogorov placed probability theory on a rigorous axiomatic foundation, resolving a century of informal debate about what probability "really is."

Definition 2.3 (Probability measure). A probability measure is a function P:F[0,1]P : \mathcal{F} \to [0, 1] satisfying:

Axiom 1 (Non-negativity):

P(A)0for all events AP(A) \geq 0 \quad \text{for all events } A

Axiom 2 (Normalisation):

P(Ω)=1P(\Omega) = 1

Axiom 3 (Countable additivity / σ\sigma-additivity): If A1,A2,A3,A_1, A_2, A_3, \ldots are pairwise disjoint events (i.e., AiAj=A_i \cap A_j = \emptyset for iji \neq j), then:

P ⁣(i=1Ai)=i=1P(Ai)P\!\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i)

The triple (Ω,F,P)(\Omega, \mathcal{F}, P) is called a probability space.

Why these three axioms? They are minimal: they are logically independent (none follows from the other two), they rule out degenerate assignments (e.g., P(A)=0.5P(A) = -0.5 or P(Ω)=2P(\Omega) = 2), and they imply everything else in probability theory via logical deduction.

2.3 Immediate Consequences of the Axioms

From just three axioms, we can derive all the basic rules of probability:

Theorem 2.1 (Probability of the empty set): P()=0P(\emptyset) = 0.

Proof. Ω\Omega and \emptyset are disjoint with Ω=Ω\Omega \cup \emptyset = \Omega. By Axiom 3: P(Ω)+P()=P(Ω)=1P(\Omega) + P(\emptyset) = P(\Omega) = 1. So P()=0P(\emptyset) = 0. \square

Theorem 2.2 (Finite additivity): For disjoint A1,,AnA_1, \ldots, A_n:

P(A1An)=P(A1)++P(An)P(A_1 \cup \cdots \cup A_n) = P(A_1) + \cdots + P(A_n)

Proof. Set An+1=An+2==A_{n+1} = A_{n+2} = \cdots = \emptyset, apply Axiom 3, then Theorem 2.1. \square

Theorem 2.3 (Complement rule): P(Ac)=1P(A)P(A^c) = 1 - P(A).

Proof. AA and AcA^c are disjoint with AAc=ΩA \cup A^c = \Omega. By Axiom 3: P(A)+P(Ac)=P(Ω)=1P(A) + P(A^c) = P(\Omega) = 1. \square

Theorem 2.4 (Monotonicity): If ABA \subseteq B, then P(A)P(B)P(A) \leq P(B).

Proof. Write B=A(BA)B = A \cup (B \setminus A) where AA and BAB \setminus A are disjoint. By Axiom 3: P(B)=P(A)+P(BA)P(A)P(B) = P(A) + P(B \setminus A) \geq P(A) (since P(BA)0P(B \setminus A) \geq 0 by Axiom 1). \square

Theorem 2.5 (Bounds): 0P(A)10 \leq P(A) \leq 1 for all AA.

Proof. AΩ\emptyset \subseteq A \subseteq \Omega, so by monotonicity: 0=P()P(A)P(Ω)=10 = P(\emptyset) \leq P(A) \leq P(\Omega) = 1. \square

2.4 Probability Measures - Examples and Non-Examples

Examples of valid probability measures:

Sample space Ω\OmegaProbability measure PP
{1,2,3,4,5,6}\{1,2,3,4,5,6\}P({k})=1/6P(\{k\}) = 1/6 for each kk (fair die)
{1,2,3,4,5,6}\{1,2,3,4,5,6\}P({k})=k/21P(\{k\}) = k/21 for each kk (loaded die, k/21=1\sum k/21 = 1)
[0,1][0,1]P([a,b])=baP([a,b]) = b-a for 0ab10 \leq a \leq b \leq 1 (uniform)
R\mathbb{R}P(A)=A12πex2/2dxP(A) = \int_A \frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx (standard Gaussian)
{0,1}\{0, 1\}P({1})=pP(\{1\}) = p, P({0})=1pP(\{0\}) = 1-p for p[0,1]p \in [0,1] (Bernoulli)

Non-examples (violate at least one axiom):

AssignmentAxiom violated
P({H})=0.6P(\{H\}) = 0.6, P({T})=0.6P(\{T\}) = 0.6Axiom 2: total = 1.2 \neq 1
P({H})=0.1P(\{H\}) = -0.1, P({T})=1.1P(\{T\}) = 1.1Axiom 1: negative probability
P({H})=0.5P(\{H\}) = 0.5, P({T})=0.5P(\{T\}) = 0.5, P({H,T})=0.8P(\{H,T\}) = 0.8Axiom 3: P({H}{T})P({H})+P({T})P(\{H\} \cup \{T\}) \neq P(\{H\}) + P(\{T\})

3. Computing Probabilities - The Core Rules

3.1 Complement and Monotonicity

The complement rule is one of the most practically useful consequences of the axioms.

Complement rule: P(Ac)=1P(A)P(A^c) = 1 - P(A)

This is particularly useful when P(Ac)P(A^c) is easier to compute than P(A)P(A) directly. This technique - "compute the complement" - is ubiquitous in probability:

Example. What is the probability that at least one of 10 coin flips is heads (for a fair coin)?

Direct approach: sum P(exactly k heads)P(\text{exactly } k \text{ heads}) for k=1,,10k = 1, \ldots, 10 - tedious.

Complement approach: P(at least one head)=1P(no heads)=1(1/2)10=1023/10240.999P(\text{at least one head}) = 1 - P(\text{no heads}) = 1 - (1/2)^{10} = 1023/1024 \approx 0.999.

For AI: The complement rule underlies the "at least one" union bound used in statistical learning theory. If we want P(at least one of n bad events occurs)δP(\text{at least one of } n \text{ bad events occurs}) \leq \delta, we equivalently want P(all events are good)1δP(\text{all events are good}) \geq 1 - \delta.

3.2 Inclusion-Exclusion Principle

Theorem 3.1 (Inclusion-exclusion, two events):

P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)

Proof. Decompose AB=A(BA)A \cup B = A \cup (B \setminus A) into disjoint parts. Then B=(AB)(BA)B = (A \cap B) \cup (B \setminus A) into disjoint parts. By Axiom 3: P(AB)=P(A)+P(BA)P(A \cup B) = P(A) + P(B \setminus A) and P(B)=P(AB)+P(BA)P(B) = P(A \cap B) + P(B \setminus A). Subtracting: P(BA)=P(B)P(AB)P(B \setminus A) = P(B) - P(A \cap B). Substituting: P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B). \square

The subtraction corrects for double-counting ABA \cap B.

Generalisation to nn events:

P ⁣(i=1nAi)=iP(Ai)i<jP(AiAj)+i<j<kP(AiAjAk)P\!\left(\bigcup_{i=1}^n A_i\right) = \sum_i P(A_i) - \sum_{i<j} P(A_i \cap A_j) + \sum_{i<j<k} P(A_i \cap A_j \cap A_k) - \cdots

Union bound (Boole's inequality): A simpler but looser bound:

P ⁣(i=1nAi)i=1nP(Ai)P\!\left(\bigcup_{i=1}^n A_i\right) \leq \sum_{i=1}^n P(A_i)

For AI: The union bound is the workhorse of statistical learning theory. To prove that a learned classifier generalises to unseen data with high probability, one typically applies the union bound over all possible hypotheses, then uses concentration inequalities (Section05) to bound each term.

3.3 Conditional Probability

When we observe that event BB has occurred, this changes our state of knowledge. The probability of AA given that BB occurred is:

Definition 3.1 (Conditional probability):

P(AB)=P(AB)P(B),P(B)>0P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0

Intuition: Conditioning on BB restricts the sample space from Ω\Omega to BB. Within this restricted space, we renormalise by dividing by P(B)P(B) to ensure the conditional probabilities sum to 1.

CONDITIONAL PROBABILITY - GEOMETRIC INTUITION
========================================================================

  Before conditioning:         After conditioning on B:
  +---------------------+      +---------------------+
  |          \\Omega           |      |         B            |
  |   +------------+    |      |   +--------------+  |
  |   |  A   |A\\capB | B  |      |   |  A\\capB         |  |
  |   |      |    |    |      |   | (renormalised)|  |
  |   +------------+    |      |   +--------------+  |
  +---------------------+      +---------------------+

  P(A) = area(A) / area(\\Omega)     P(A|B) = area(A\\capB) / area(B)

========================================================================

Key properties of conditional probability:

  • P(B)P(\cdot \mid B) is itself a valid probability measure on (Ω,F)(\Omega, \mathcal{F}): it satisfies all three Kolmogorov axioms
  • P(BB)=1P(B \mid B) = 1 - given BB occurred, BB certainly occurred
  • P(AcB)=1P(AB)P(A^c \mid B) = 1 - P(A \mid B) - the complement rule holds conditionally

Non-example: P(AB)P(BA)P(A \mid B) \neq P(B \mid A) in general. This asymmetry is the source of the classic prosecutor's fallacy: P(DNA matchinnocent)P(\text{DNA match} \mid \text{innocent}) is very small, but P(innocentDNA match)P(\text{innocent} \mid \text{DNA match}) could be substantial in a large population.

3.4 Chain Rule of Probability

Rearranging the definition of conditional probability gives:

P(AB)=P(AB)P(B)=P(BA)P(A)P(A \cap B) = P(A \mid B) \cdot P(B) = P(B \mid A) \cdot P(A)

This extends to any number of events:

Theorem 3.2 (Chain rule / product rule):

P(A1A2An)=P(A1)P(A2A1)P(A3A1,A2)P(AnA1,,An1)P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1) \cdot P(A_2 \mid A_1) \cdot P(A_3 \mid A_1, A_2) \cdots P(A_n \mid A_1, \ldots, A_{n-1})

For AI: The chain rule of probability is the mathematical foundation of autoregressive language models. A language model with vocabulary V\mathcal{V} and context window TT defines:

P(x1,x2,,xT)=P(x1)P(x2x1)P(x3x1,x2)P(xTx1,,xT1)P(x_1, x_2, \ldots, x_T) = P(x_1) \cdot P(x_2 \mid x_1) \cdot P(x_3 \mid x_1, x_2) \cdots P(x_T \mid x_1, \ldots, x_{T-1})

Every GPT-family model is directly computing these conditional probabilities. The model is trained to minimise the average negative log of each factor - i.e., the cross-entropy loss.

3.5 Law of Total Probability

Definition 3.2 (Partition). Events B1,B2,,BnB_1, B_2, \ldots, B_n form a partition of Ω\Omega if they are pairwise disjoint (BiBj=B_i \cap B_j = \emptyset for iji \neq j) and exhaustive (iBi=Ω\bigcup_i B_i = \Omega).

Theorem 3.3 (Law of total probability). If B1,,BnB_1, \ldots, B_n partition Ω\Omega and P(Bi)>0P(B_i) > 0 for all ii:

P(A)=i=1nP(ABi)P(Bi)P(A) = \sum_{i=1}^n P(A \mid B_i) \cdot P(B_i)

Proof. Since B1,,BnB_1, \ldots, B_n partition Ω\Omega: A=i(ABi)A = \bigcup_i (A \cap B_i), a disjoint union. By Axiom 3: P(A)=iP(ABi)=iP(ABi)P(Bi)P(A) = \sum_i P(A \cap B_i) = \sum_i P(A \mid B_i) P(B_i). \square

Example. A language model generates a sentence that is either grammatical (B1B_1, probability 0.7) or ungrammatical (B2B_2, probability 0.3). Given grammatical, the probability a human accepts it is 0.9; given ungrammatical, 0.2. What is the overall acceptance probability?

P(accepted)=P(accB1)P(B1)+P(accB2)P(B2)=0.9×0.7+0.2×0.3=0.63+0.06=0.69P(\text{accepted}) = P(\text{acc} \mid B_1) P(B_1) + P(\text{acc} \mid B_2) P(B_2) = 0.9 \times 0.7 + 0.2 \times 0.3 = 0.63 + 0.06 = 0.69

3.6 Bayes' Theorem

Bayes' theorem inverts conditional probability: from P(AB)P(A \mid B) we can compute P(BA)P(B \mid A).

Theorem 3.4 (Bayes' theorem):

P(BA)=P(AB)P(B)P(A)P(B \mid A) = \frac{P(A \mid B) \cdot P(B)}{P(A)}

Applying the law of total probability to the denominator (with a partition B,BcB, B^c):

P(BA)=P(AB)P(B)P(AB)P(B)+P(ABc)P(Bc)P(B \mid A) = \frac{P(A \mid B) \cdot P(B)}{P(A \mid B) P(B) + P(A \mid B^c) P(B^c)}

Bayesian inference terminology:

TermSymbolMeaning
PriorP(B)P(B)Belief about BB before seeing AA
LikelihoodP(AB)P(A \mid B)How probable is AA if BB is true?
EvidenceP(A)P(A)Total probability of observation AA
PosteriorP(BA)P(B \mid A)Updated belief about BB after seeing AA
P(BA)posterior=P(AB)likelihoodP(B)priorP(A)evidence\underbrace{P(B \mid A)}_{\text{posterior}} = \frac{\overbrace{P(A \mid B)}^{\text{likelihood}} \cdot \overbrace{P(B)}^{\text{prior}}}{\underbrace{P(A)}_{\text{evidence}}}

Classic example - Spam filter. Email is spam with prior P(spam)=0.3P(\text{spam}) = 0.3. The word "lottery" appears in 80% of spam emails but only 5% of legitimate ones. Given an email contains "lottery", what is P(spamlottery)P(\text{spam} \mid \text{lottery})?

P(spamlottery)=0.80×0.300.80×0.30+0.05×0.70=0.240.24+0.035=0.240.2750.873P(\text{spam} \mid \text{lottery}) = \frac{0.80 \times 0.30}{0.80 \times 0.30 + 0.05 \times 0.70} = \frac{0.24}{0.24 + 0.035} = \frac{0.24}{0.275} \approx 0.873

For AI: Bayes' theorem is the core of:

  • Bayesian neural networks: parameters θ\theta have a prior P(θ)P(\theta); training updates to the posterior P(θD)P(Dθ)P(θ)P(\theta \mid \mathcal{D}) \propto P(\mathcal{D} \mid \theta) P(\theta)
  • RLHF reward modelling: given human preference data, update beliefs about which completions are better
  • Naive Bayes classifiers: classify documents by computing P(classwords)P(wordsclass)P(class)P(\text{class} \mid \text{words}) \propto P(\text{words} \mid \text{class}) P(\text{class})

4. Independence

4.1 Unconditional Independence

Intuitively, events AA and BB are independent if knowing one occurred gives no information about whether the other occurred.

Definition 4.1 (Independence). Events AA and BB are independent, written ABA \perp B, if:

P(AB)=P(A)P(B)P(A \cap B) = P(A) \cdot P(B)

Equivalently (when P(B)>0P(B) > 0): P(AB)=P(A)P(A \mid B) = P(A) - conditioning on BB does not change the probability of AA.

Why this definition? The conditional probability P(AB)=P(AB)/P(B)P(A \mid B) = P(A \cap B)/P(B) equals P(A)P(A) if and only if P(AB)=P(A)P(B)P(A \cap B) = P(A)P(B). The multiplicative form is preferred as it is symmetric and well-defined even when P(A)=0P(A) = 0 or P(B)=0P(B) = 0.

Examples of independent events:

  • Two fair coin flips: P(H on flip 1H on flip 2)=1/4=1/2×1/2P(\text{H on flip 1} \cap \text{H on flip 2}) = 1/4 = 1/2 \times 1/2 [ok]
  • Drawing with replacement: second draw is independent of first
  • Two separate neural network weight initialisations drawn i.i.d.

Non-examples (dependent events):

  • Drawing without replacement: if first card is an ace, second ace is less likely
  • Correlated features in a dataset
  • Sequential tokens in a language model: P(xtxt1)P(xt)P(x_t \mid x_{t-1}) \neq P(x_t)

For nn events (mutual independence): Events A1,,AnA_1, \ldots, A_n are mutually independent if for every subset S{1,,n}S \subseteq \{1, \ldots, n\}:

P ⁣(iSAi)=iSP(Ai)P\!\left(\bigcap_{i \in S} A_i\right) = \prod_{i \in S} P(A_i)

This is strictly stronger than pairwise independence (see Section4.3).

4.2 Conditional Independence

Conditional independence is one of the most important concepts in probabilistic modelling and graphical models.

Definition 4.2 (Conditional independence). Events AA and BB are conditionally independent given CC, written ABCA \perp B \mid C, if:

P(ABC)=P(AC)P(BC)P(A \cap B \mid C) = P(A \mid C) \cdot P(B \mid C)

Equivalently (when P(BC)>0P(B \cap C) > 0): P(AB,C)=P(AC)P(A \mid B, C) = P(A \mid C) - once we know CC, learning BB gives no additional information about AA.

The classic example - explaining away. Let AA = "the grass is wet", BB = "the sprinkler is on", CC = "it rained". Rain and sprinkler are independent causes of wet grass. But:

  • Without knowing whether it rained: AA and BB are marginally dependent (if the grass is wet, the sprinkler being on is more likely)
  • Given that it rained: ABCA \perp B \mid C - knowing the sprinkler status adds nothing once we know it rained

Caution: Independence and conditional independence are logically independent:

  • ABA \perp B does NOT imply ABCA \perp B \mid C (conditioning can introduce dependence)
  • ABCA \perp B \mid C does NOT imply ABA \perp B (marginalising out CC can introduce dependence)

For AI: Conditional independence is the foundation of:

  • Naive Bayes: assumes features are conditionally independent given the class label
  • Hidden Markov Models: xtxtztx_t \perp x_{t'} \mid z_t - observations are independent given the hidden state
  • Bayesian networks: the factorisation P(X1,,Xn)=iP(Xiparents(Xi))P(X_1, \ldots, X_n) = \prod_i P(X_i \mid \text{parents}(X_i)) encodes a set of conditional independence assumptions
  • Attention masks: attention weights in transformers implement conditional independence (causal masking = future tokens are conditionally independent of past given present)

4.3 Pairwise vs Mutual Independence

Pairwise independence means P(AiAj)=P(Ai)P(Aj)P(A_i \cap A_j) = P(A_i)P(A_j) for all pairs (i,j)(i,j).

Mutual independence means the joint probability factorises for every subset (Definition 4.1 for nn events).

Mutual independence implies pairwise independence but not vice versa. The following counterexample (Bernstein, 1928) shows that pairwise independence can hold without mutual independence:

Roll two fair dice. Define:

  • AA = "first die shows even" - P(A)=1/2P(A) = 1/2
  • BB = "second die shows even" - P(B)=1/2P(B) = 1/2
  • CC = "sum of dice is even" - P(C)=1/2P(C) = 1/2

Check pairwise: P(AB)=1/4=P(A)P(B)P(A \cap B) = 1/4 = P(A)P(B) [ok], P(AC)=1/4P(A \cap C) = 1/4 [ok], P(BC)=1/4P(B \cap C) = 1/4 [ok].

But P(ABC)=P(both even, sum even)=P(both even)=1/4P(A)P(B)P(C)=1/8P(A \cap B \cap C) = P(\text{both even, sum even}) = P(\text{both even}) = 1/4 \neq P(A)P(B)P(C) = 1/8.

So A,B,CA, B, C are pairwise independent but not mutually independent.

For AI: The i.i.d. (independent and identically distributed) assumption in ML training is a mutual independence assumption: P((x(1),y(1)),,(x(n),y(n)))=i=1nP(x(i),y(i))P((\mathbf{x}^{(1)}, y^{(1)}), \ldots, (\mathbf{x}^{(n)}, y^{(n)})) = \prod_{i=1}^n P(\mathbf{x}^{(i)}, y^{(i)}). This assumption underlies the derivation of cross-entropy loss as a maximum likelihood estimator.

4.4 Why Independence Matters for AI

Independence assumptions make otherwise intractable problems tractable.

Without independence: The joint distribution of nn binary random variables requires 2n12^n - 1 parameters. For n=100n = 100 features, this is 2100110302^{100} - 1 \approx 10^{30} - utterly infeasible.

With conditional independence (Naive Bayes): Given class CC, features are independent: P(xC)=j=1nP(xjC)P(\mathbf{x} \mid C) = \prod_{j=1}^n P(x_j \mid C). Now only nn parameters per class - linear in nn.

Independence in optimisation: Mini-batch gradient descent assumes that samples in a batch are independent draws from the training distribution. This independence makes the batch gradient an unbiased estimator of the full gradient.

Independence and parallelism: Independently distributed data can be processed in parallel without communication - the foundation of distributed ML training (data parallelism).


5. Random Variables - Formal Foundation

5.1 Definition: Random Variable as a Measurable Function

The outcomes in a sample space Ω\Omega can be anything - text strings, images, dice faces. To do mathematics with them, we need to convert them to numbers. A random variable does exactly this.

Definition 5.1 (Random variable). A random variable XX is a (measurable) function from the sample space Ω\Omega to the real line R\mathbb{R}:

X:ΩR,ωX(ω)X : \Omega \to \mathbb{R}, \quad \omega \mapsto X(\omega)

The measurability requirement ensures that {ωΩ:X(ω)x}\{\omega \in \Omega : X(\omega) \leq x\} is a valid event (belongs to F\mathcal{F}) for every xRx \in \mathbb{R}, so that we can assign it a probability. For all practical purposes, every function you would naturally write down is measurable.

Examples:

  • Die roll: Ω={1,,6}\Omega = \{1,\ldots,6\}, X(ω)=ωX(\omega) = \omega (the outcome itself)
  • Coin flip: Ω={H,T}\Omega = \{H, T\}, X(H)=1X(H) = 1, X(T)=0X(T) = 0 (Bernoulli representation)
  • Sentence length: Ω={all English sentences}\Omega = \{\text{all English sentences}\}, X(ω)=ωX(\omega) = |\omega| (number of words)
  • Model loss: Ω={all possible mini-batches}\Omega = \{\text{all possible mini-batches}\}, X(ω)=L(θ;ω)X(\omega) = \mathcal{L}(\theta; \omega)

Non-examples:

  • The sample space Ω\Omega itself is not a random variable (it is the domain, not a function)
  • A deterministic constant cc can be viewed as a degenerate random variable X(ω)=cX(\omega) = c for all ω\omega, but calling it "random" is misleading

Notation: Random variables are typically denoted by uppercase letters (XX, YY, ZZ); their realised values (specific numbers) by lowercase (xx, yy, zz). So "X=xX = x" means "the random variable XX takes the value xx."

The event {Xx}={ωΩ:X(ω)x}\{X \leq x\} = \{\omega \in \Omega : X(\omega) \leq x\} is well-defined, and we write P(Xx)P(X \leq x) as shorthand for P({Xx})P(\{X \leq x\}).

5.2 Discrete vs Continuous - Taxonomy

Random variables split into two main types based on the range of XX.

TAXONOMY OF RANDOM VARIABLES
========================================================================

  Random Variable X: \\Omega -> \\mathbb{R}
  |
  +-- Discrete: X takes countably many values {x_1, x_2, x_3, ...}
  |   |
  |   +-- Characterised by PMF: p(x) = P(X = x)
  |   +-- CDF: F(x) = \\Sigma p(x_i) for all x_i \\leq x   (staircase)
  |   +-- Examples: Bernoulli, Geometric, Binomial, Poisson, Categorical
  |
  +-- Continuous: X takes values in an interval (uncountably many)
      |
      +-- Characterised by PDF: f(x) such that P(a\\leqX\\leqb) = \\int_a^b f(x)dx
      +-- CDF: F(x) = \\int_-\\infty^x f(t)dt   (smooth, differentiable)
      +-- Examples: Uniform, Gaussian, Exponential, Beta, Gamma

  Note: "Mixed" random variables exist (CDF is a mix of jumps and
  smooth parts) but are rare in practice.

========================================================================

Key distinction - probability at a point:

  • Discrete: P(X=x)P(X = x) can be positive (it is the PMF value)
  • Continuous: P(X=x)=0P(X = x) = 0 for every specific xx (the probability of hitting any exact value is zero). Probabilities come from integrating over intervals.

For AI: The distinction is crucial for loss functions. Cross-entropy loss for classification uses a discrete distribution (Categorical) over a finite vocabulary or class set. For regression, the loss is often the negative log-likelihood of a Gaussian (continuous distribution). Diffusion models use a mixture: discrete time steps with continuous noise.

5.3 The Cumulative Distribution Function

The CDF provides a unified description of both discrete and continuous random variables.

Definition 5.2 (Cumulative distribution function). The CDF of a random variable XX is:

FX(x)=P(Xx),xRF_X(x) = P(X \leq x), \quad x \in \mathbb{R}

Examples:

  • Fair die: F(x)=0F(x) = 0 for x<1x < 1; F(x)=k/6F(x) = k/6 for kx<k+1k \leq x < k+1, k=1,,6k = 1, \ldots, 6; F(x)=1F(x) = 1 for x6x \geq 6. A staircase with jumps of 1/61/6 at each integer.
  • Standard normal: F(x)=Φ(x)=12πxet2/2dtF(x) = \Phi(x) = \frac{1}{\sqrt{2\pi}}\int_{-\infty}^x e^{-t^2/2}\,dt. A smooth S-shaped curve.
  • Bernoulli(pp): F(x)=0F(x) = 0 for x<0x < 0; F(x)=1pF(x) = 1-p for 0x<10 \leq x < 1; F(x)=1F(x) = 1 for x1x \geq 1.

Computing interval probabilities from the CDF:

P(a<Xb)=F(b)F(a)P(a < X \leq b) = F(b) - F(a) P(X>x)=1F(x)P(X > x) = 1 - F(x)

5.4 CDF Properties and the Fundamental Theorem

Theorem 5.1 (CDF properties). Any CDF satisfies:

  1. Monotone non-decreasing: x1x2    F(x1)F(x2)x_1 \leq x_2 \implies F(x_1) \leq F(x_2)
  2. Right-continuous: limyxF(y)=F(x)\lim_{y \downarrow x} F(y) = F(x) for all xx
  3. Limits: limxF(x)=0\lim_{x \to -\infty} F(x) = 0 and limx+F(x)=1\lim_{x \to +\infty} F(x) = 1
  4. Jump characterisation: P(X=x)=F(x)F(x)P(X = x) = F(x) - F(x^-) where F(x)=limyxF(y)F(x^-) = \lim_{y \uparrow x} F(y). For continuous XX, this is 0 everywhere; for discrete XX, it equals the PMF at jump points.

These four properties completely characterise CDFs - any function satisfying them is the CDF of some random variable. This is both the universality and the tractability of the CDF.

Fundamental theorem (continuous case): For continuous XX with CDF FF and PDF ff:

f(x)=ddxF(x),F(x)=xf(t)dtf(x) = \frac{d}{dx} F(x), \quad F(x) = \int_{-\infty}^x f(t)\, dt

The PDF is the derivative of the CDF; the CDF is the integral of the PDF.


6. Discrete Random Variables

6.1 Probability Mass Function

Definition 6.1 (PMF). For a discrete random variable XX with countable range X={x1,x2,}\mathcal{X} = \{x_1, x_2, \ldots\}, the probability mass function is:

pX(x)=P(X=x),xXp_X(x) = P(X = x), \quad x \in \mathcal{X}

Properties:

  1. pX(x)0p_X(x) \geq 0 for all xx (Axiom 1)
  2. xXpX(x)=1\sum_{x \in \mathcal{X}} p_X(x) = 1 (Axiom 2)
  3. pX(x)=0p_X(x) = 0 for xXx \notin \mathcal{X}

CDF from PMF: FX(x)=xixpX(xi)F_X(x) = \sum_{x_i \leq x} p_X(x_i) - sum all PMF values at or below xx.

Non-examples of valid PMFs:

  • p(0)=0.6,p(1)=0.6p(0) = 0.6, p(1) = 0.6: sum = 1.2 > 1
  • p(0)=0.2,p(1)=1.2p(0) = -0.2, p(1) = 1.2: negative value
  • p(k)=1/kp(k) = 1/k for k=1,2,k = 1, 2, \ldots: 1/k=\sum 1/k = \infty (harmonic series diverges)

For AI: Every classifier's output layer implicitly defines a PMF over classes. The softmax function softmax(z)k=ezk/jezj\text{softmax}(\mathbf{z})_k = e^{z_k}/\sum_j e^{z_j} produces a vector of non-negative values summing to 1 - a valid PMF over the KK classes. Training minimises the cross-entropy between this PMF and the one-hot PMF of the true label.

6.2 Bernoulli Distribution - The Canonical Example

The Bernoulli distribution is the simplest non-trivial probability distribution: a single binary trial.

Definition 6.2 (Bernoulli distribution). XBernoulli(p)X \sim \text{Bernoulli}(p) for p[0,1]p \in [0,1] if:

P(X=1)=p,P(X=0)=1pP(X = 1) = p, \quad P(X = 0) = 1 - p

or equivalently: pX(x)=px(1p)1xp_X(x) = p^x (1-p)^{1-x} for x{0,1}x \in \{0, 1\}.

Properties:

  • Mean: E[X]=1p+0(1p)=p\mathbb{E}[X] = 1 \cdot p + 0 \cdot (1-p) = p
  • Variance: Var(X)=E[X2](E[X])2=pp2=p(1p)\operatorname{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = p - p^2 = p(1-p)
  • Maximum variance at p=1/2p = 1/2: Var(X)=1/4\operatorname{Var}(X) = 1/4
  • CDF: F(x)=0F(x) = 0 for x<0x < 0; F(x)=1pF(x) = 1-p for 0x<10 \leq x < 1; F(x)=1F(x) = 1 for x1x \geq 1

ML applications of the Bernoulli distribution:

  • Binary classification: YBernoulli(σ(wx))Y \sim \text{Bernoulli}(\sigma(\mathbf{w}^\top \mathbf{x})) where σ\sigma is sigmoid
  • Dropout: each neuron is active with probability 1pdrop1 - p_{\text{drop}}; the mask is Bernoulli(1pdrop)d\text{Bernoulli}(1-p_{\text{drop}})^d
  • Coin-flip initialisation tests: checking whether random initialisation is breaking symmetry
  • Binary attention masks: a position is masked with Bernoulli(pmask)\text{Bernoulli}(p_{\text{mask}}) in masked language modelling (BERT, RoBERTa)
  • Bernoulli reward signals: in RL, some environments give binary rewards (00 or 11)

Bernoulli and entropy. The binary entropy of a Bernoulli(pp) variable is:

H(p)=plogp(1p)log(1p)H(p) = -p \log p - (1-p) \log(1-p)

This is maximised at p=1/2p = 1/2 (maximum uncertainty) and equals 0 at p{0,1}p \in \{0, 1\} (certainty). Binary cross-entropy loss is exactly this entropy when the true label is p{0,1}p \in \{0,1\} and the model predicts p^\hat{p}: L=ylogp^(1y)log(1p^)\mathcal{L} = -y \log \hat{p} - (1-y)\log(1-\hat{p}).

6.3 Geometric Distribution

The Geometric distribution models waiting times: how many Bernoulli trials until the first success?

Definition 6.3 (Geometric distribution). XGeometric(p)X \sim \text{Geometric}(p) for p(0,1]p \in (0,1] if:

P(X=k)=(1p)k1p,k=1,2,3,P(X = k) = (1-p)^{k-1} p, \quad k = 1, 2, 3, \ldots

(Here XX is the trial number of the first success; an alternative convention has XX = number of failures before first success, giving PMF (1p)kp(1-p)^k p for k=0,1,2,k = 0, 1, 2, \ldots)

Properties:

  • Mean: E[X]=1/p\mathbb{E}[X] = 1/p
  • Variance: Var(X)=(1p)/p2\operatorname{Var}(X) = (1-p)/p^2
  • Memoryless property: P(X>m+nX>m)=P(X>n)P(X > m+n \mid X > m) = P(X > n) - the past gives no information about the remaining wait time. The Geometric is the only discrete memoryless distribution.

Geometric series check (normalisation):

k=1(1p)k1p=pk=0(1p)k=p11(1p)=p1p=1\sum_{k=1}^\infty (1-p)^{k-1}p = p \sum_{k=0}^\infty (1-p)^k = p \cdot \frac{1}{1-(1-p)} = p \cdot \frac{1}{p} = 1 \checkmark

ML applications:

  • Sentence length models: the length of a generated sentence can be modelled geometrically (each word has probability pp of being the last)
  • Number of gradient steps to convergence: in simplified convergence analyses, the number of steps until L<ε\|\nabla\mathcal{L}\| < \varepsilon can have an approximately geometric distribution under certain conditions
  • Retry logic: in RL exploration, the number of random actions before discovering a reward follows a geometric-like distribution in uniform random environments

6.4 Preview: Binomial, Categorical, Poisson

The following distributions have their full treatment in Section02-Common-Distributions. They are introduced here briefly for orientation:

Preview: Binomial Distribution Binomial(n,p)\text{Binomial}(n, p) counts the number of successes in nn independent Bernoulli(pp) trials: P(X=k)=(nk)pk(1p)nkP(X = k) = \binom{n}{k}p^k(1-p)^{n-k} for k=0,1,,nk = 0, 1, \ldots, n. Mean npnp, variance np(1p)np(1-p). As nn \to \infty with np=λnp = \lambda fixed, the Binomial converges to Poisson(λ\lambda).

-> Full treatment: Common Distributions Section2

Preview: Categorical Distribution Categorical(p)\text{Categorical}(\mathbf{p}) with pΔK1\mathbf{p} \in \Delta^{K-1} (the probability simplex) models selection of one of KK categories: P(X=k)=pkP(X = k) = p_k. This is the distribution underlying softmax classifiers and language model next-token prediction.

-> Full treatment: Common Distributions Section4

Preview: Poisson Distribution Poisson(λ)\text{Poisson}(\lambda) models the number of events in a fixed interval when events occur independently at constant rate λ\lambda: P(X=k)=eλλk/k!P(X = k) = e^{-\lambda}\lambda^k/k!. Mean = Variance = λ\lambda. Used in modelling word frequencies (Zipf/Poisson approximation), network packet arrivals, and mutation rates.

-> Full treatment: Common Distributions Section3


7. Continuous Random Variables

7.1 Probability Density Function

For continuous random variables, individual points have probability zero. Probability is distributed over intervals via the PDF.

Definition 7.1 (Probability density function). A non-negative function fX:R[0,)f_X : \mathbb{R} \to [0, \infty) is a PDF for XX if:

  1. fX(x)0f_X(x) \geq 0 for all xx
  2. fX(x)dx=1\int_{-\infty}^{\infty} f_X(x)\, dx = 1
  3. For any aba \leq b: P(aXb)=abfX(x)dxP(a \leq X \leq b) = \int_a^b f_X(x)\, dx

Critical distinction: The PDF fX(x)f_X(x) is NOT a probability. It is a probability density - probability per unit length. In particular, fX(x)f_X(x) can exceed 1 (though the integral must equal 1).

Example: fX(x)=2xf_X(x) = 2x for x[0,1]x \in [0,1] and 00 elsewhere. This is a valid PDF: non-negative, integrates to 012xdx=[x2]01=1\int_0^1 2x\,dx = [x^2]_0^1 = 1. Note fX(0.9)=1.8>1f_X(0.9) = 1.8 > 1, but this is fine - it is a density.

Probability at a point: P(X=x)=xxf(t)dt=0P(X = x) = \int_x^x f(t)\,dt = 0. This is why conditioning on a continuous event requires care and is handled via conditional densities (Section03).

Non-examples of valid PDFs:

  • f(x)=xf(x) = x on [0,1][0,1]: 01xdx=1/21\int_0^1 x\,dx = 1/2 \neq 1 (not normalised)
  • f(x)=exf(x) = -e^{-x} for x0x \geq 0: negative values
  • f(x)=1f(x) = 1 for x[0,2]x \in [0,2]: 021dx=21\int_0^2 1\,dx = 2 \neq 1

7.2 From CDF to PDF: the Derivative Relationship

For a continuous random variable with PDF fXf_X and CDF FXF_X:

FX(x)=xfX(t)dtfX(x)=ddxFX(x)F_X(x) = \int_{-\infty}^x f_X(t)\, dt \quad \Longleftrightarrow \quad f_X(x) = \frac{d}{dx} F_X(x)

This is the Fundamental Theorem of Calculus applied to probability. The CDF accumulates density; the PDF is the rate of accumulation.

Using the CDF to compute probabilities:

P(aXb)=FX(b)FX(a)P(a \leq X \leq b) = F_X(b) - F_X(a) P(X>x)=1FX(x)P(X > x) = 1 - F_X(x) P(Xx)=FX(x)P(X \leq x) = F_X(x)

Since continuous XX has P(X=a)=0P(X = a) = 0: the endpoints don't matter, P(a<Xb)=P(aXb)=P(a<X<b)P(a < X \leq b) = P(a \leq X \leq b) = P(a < X < b).

7.3 Uniform Distribution

The Uniform distribution is the maximum-entropy distribution over a bounded interval: all values are equally likely.

Definition 7.2 (Uniform distribution). XUniform(a,b)X \sim \text{Uniform}(a, b) for a<ba < b if:

fX(x)={1baaxb0otherwisef_X(x) = \begin{cases} \frac{1}{b-a} & a \leq x \leq b \\ 0 & \text{otherwise} \end{cases}

CDF: FX(x)=0F_X(x) = 0 for x<ax < a; FX(x)=(xa)/(ba)F_X(x) = (x-a)/(b-a) for axba \leq x \leq b; FX(x)=1F_X(x) = 1 for x>bx > b.

Properties:

  • Mean: E[X]=(a+b)/2\mathbb{E}[X] = (a+b)/2 (the midpoint)
  • Variance: Var(X)=(ba)2/12\operatorname{Var}(X) = (b-a)^2/12
  • Symmetry: ff is symmetric about (a+b)/2(a+b)/2

Derivation of variance:

E[X2]=abx21badx=1bab3a33=a2+ab+b23\mathbb{E}[X^2] = \int_a^b x^2 \cdot \frac{1}{b-a}\, dx = \frac{1}{b-a} \cdot \frac{b^3 - a^3}{3} = \frac{a^2 + ab + b^2}{3} Var(X)=a2+ab+b23(a+b2)2=(ba)212\operatorname{Var}(X) = \frac{a^2+ab+b^2}{3} - \left(\frac{a+b}{2}\right)^2 = \frac{(b-a)^2}{12}

ML applications of Uniform:

  • Weight initialisation (Xavier/Glorot): weights drawn from Uniform(1/n,1/n)\text{Uniform}(-1/\sqrt{n}, 1/\sqrt{n}) to maintain activation variance across layers
  • Batch selection: selecting which training examples to include in a mini-batch (sampling uniformly without replacement)
  • Data augmentation: random cropping positions, rotation angles, colour jitter magnitudes drawn from uniform distributions
  • Hyperparameter search: random search over hyperparameter ranges samples uniformly from the search space

7.4 Gaussian Distribution - Introduction

The Gaussian (Normal) distribution is the most important continuous distribution in all of probability and statistics, due to the Central Limit Theorem (Section06-Stochastic-Processes) and its mathematical tractability.

Definition 7.3 (Gaussian distribution). XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2) with mean μR\mu \in \mathbb{R} and variance σ2>0\sigma^2 > 0 if:

fX(x)=12πσ2exp ⁣((xμ)22σ2),xRf_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), \quad x \in \mathbb{R}

The standard normal ZN(0,1)Z \sim \mathcal{N}(0,1) has μ=0\mu = 0, σ2=1\sigma^2 = 1:

ϕ(z)=12πez2/2\phi(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2}

Its CDF is Φ(z)=zϕ(t)dt\Phi(z) = \int_{-\infty}^z \phi(t)\, dt, which has no closed form but is tabulated and implemented in all numerical libraries.

Properties:

  • Mean: E[X]=μ\mathbb{E}[X] = \mu
  • Variance: Var(X)=σ2\operatorname{Var}(X) = \sigma^2
  • Symmetry: f(μx)=f(μ+x)f(\mu - x) = f(\mu + x) - symmetric about μ\mu
  • 68-95-99.7 rule:
    • P(μσXμ+σ)68.27%P(\mu - \sigma \leq X \leq \mu + \sigma) \approx 68.27\%
    • P(μ2σXμ+2σ)95.45%P(\mu - 2\sigma \leq X \leq \mu + 2\sigma) \approx 95.45\%
    • P(μ3σXμ+3σ)99.73%P(\mu - 3\sigma \leq X \leq \mu + 3\sigma) \approx 99.73\%
  • Standardisation: If XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2), then Z=(Xμ)/σN(0,1)Z = (X-\mu)/\sigma \sim \mathcal{N}(0,1)
  • Affine stability: If XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2), then aX+bN(aμ+b,a2σ2)aX + b \sim \mathcal{N}(a\mu + b, a^2\sigma^2)

Why the Gaussian is ubiquitous:

  1. Central Limit Theorem (preview): The sum of many i.i.d. random variables with finite variance converges to a Gaussian - regardless of the original distribution. See Section06.
  2. Maximum entropy: Among all continuous distributions with fixed mean μ\mu and variance σ2\sigma^2, the Gaussian maximises entropy (is the "least informative" given these constraints)
  3. Conjugacy: The Gaussian is conjugate to itself under Bayesian updating with Gaussian likelihood - the posterior is also Gaussian

ML applications of the Gaussian:

  • Weight initialisation (He/Kaiming): weights drawn from N(0,2/nin)\mathcal{N}(0, 2/n_{\text{in}}) for ReLU networks
  • Noise models: residuals yfθ(x)N(0,σ2)y - f_\theta(\mathbf{x}) \sim \mathcal{N}(0, \sigma^2) gives mean squared error loss
  • VAE latent space: zN(μϕ(x),σϕ2(x))z \sim \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x)) with Gaussian prior N(0,I)\mathcal{N}(0, I)
  • Diffusion models forward process: xt=αˉtx0+1αˉtεx_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t}\varepsilon, εN(0,I)\varepsilon \sim \mathcal{N}(0, I)
  • Score functions: the score xlogp(x)\nabla_x \log p(x) for a Gaussian is linear: xlogN(x;μ,σ2)=(xμ)/σ2\nabla_x \log \mathcal{N}(x;\mu,\sigma^2) = -(x-\mu)/\sigma^2

Preview: Exponential, Beta, Gamma Distributions The Exponential distribution models waiting times between Poisson events; the Beta models probabilities on [0,1][0,1] (conjugate prior for Bernoulli); the Gamma generalises both. Full treatment: Common Distributions Section5, Section6, Section7.


8. Expectation and Variance - Foundations

8.1 Expected Value: Definition and Linearity

The expected value (or mean) of a random variable is its probability-weighted average - the "centre of mass" of its distribution.

Definition 8.1 (Expected value - discrete). For a discrete XX with PMF pXp_X:

E[X]=xXxpX(x)\mathbb{E}[X] = \sum_{x \in \mathcal{X}} x \cdot p_X(x)

(provided the sum converges absolutely: xpX(x)<\sum |x| p_X(x) < \infty).

Definition 8.2 (Expected value - continuous). For a continuous XX with PDF fXf_X:

E[X]=xfX(x)dx\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f_X(x)\, dx

(provided the integral converges absolutely).

Examples:

  • Bernoulli(p)(p): E[X]=1p+0(1p)=p\mathbb{E}[X] = 1 \cdot p + 0 \cdot (1-p) = p
  • Uniform(0,1)(0,1): E[X]=01x1dx=1/2\mathbb{E}[X] = \int_0^1 x \cdot 1\, dx = 1/2
  • N(μ,σ2)\mathcal{N}(\mu, \sigma^2): E[X]=μ\mathbb{E}[X] = \mu (by symmetry and definition)
  • Fair die: E[X]=(1+2+3+4+5+6)/6=3.5\mathbb{E}[X] = (1+2+3+4+5+6)/6 = 3.5

Non-example (expectation undefined): The Cauchy distribution with PDF f(x)=1/(π(1+x2))f(x) = 1/(\pi(1+x^2)) has x/(pi(1+x2))dx=\int_{-\infty}^\infty |x|/(\\pi(1+x^2))\,dx = \infty, so E[X]\mathbb{E}[X] does not exist. This is why heavy-tailed distributions require care.

Theorem 8.1 (Linearity of expectation). For any random variables XX, YY and constants aa, bb:

E[aX+bY]=aE[X]+bE[Y]\mathbb{E}[aX + bY] = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y]

This holds whether or not XX and YY are independent - linearity of expectation is unconditional.

Proof (discrete): E[aX+bY]=x,y(ax+by)P(X=x,Y=y)=axxyP(X=x,Y=y)+byyxP(X=x,Y=y)=aE[X]+bE[Y]\mathbb{E}[aX + bY] = \sum_{x,y}(ax+by)P(X=x, Y=y) = a\sum_x x \sum_y P(X=x,Y=y) + b\sum_y y \sum_x P(X=x,Y=y) = a\mathbb{E}[X] + b\mathbb{E}[Y]. \square

For AI: Linearity of expectation is why mini-batch gradient descent works. The gradient of the average loss over a batch equals the average of individual gradients:

θExD ⁣[L(x;θ)]=ExD ⁣[θL(x;θ)]\nabla_\theta \mathbb{E}_{\mathbf{x} \sim \mathcal{D}}\!\left[\mathcal{L}(\mathbf{x};\theta)\right] = \mathbb{E}_{\mathbf{x} \sim \mathcal{D}}\!\left[\nabla_\theta \mathcal{L}(\mathbf{x};\theta)\right]

This follows from linearity of both expectation and differentiation.

Expected value of a function: For g:RRg : \mathbb{R} \to \mathbb{R}:

E[g(X)]=xg(x)pX(x)(discrete),E[g(X)]=g(x)fX(x)dx(continuous)\mathbb{E}[g(X)] = \sum_x g(x) p_X(x) \quad (\text{discrete}), \qquad \mathbb{E}[g(X)] = \int g(x) f_X(x)\, dx \quad (\text{continuous})

This is the Law of the Unconscious Statistician (LOTUS) - you can compute E[g(X)]\mathbb{E}[g(X)] directly from the distribution of XX without finding the distribution of g(X)g(X) first. (Full treatment in Section04.)

Warning: E[g(X)]g(E[X])\mathbb{E}[g(X)] \neq g(\mathbb{E}[X]) in general. For example, E[X2](E[X])2\mathbb{E}[X^2] \neq (\mathbb{E}[X])^2 (the gap is the variance). Jensen's inequality gives the direction: if gg is convex, g(E[X])E[g(X)]g(\mathbb{E}[X]) \leq \mathbb{E}[g(X)].

8.2 Variance and Standard Deviation

Variance measures spread - how far a random variable typically deviates from its mean.

Definition 8.3 (Variance). The variance of XX is:

Var(X)=E ⁣[(XE[X])2]\operatorname{Var}(X) = \mathbb{E}\!\left[(X - \mathbb{E}[X])^2\right]

Computational formula:

Var(X)=E[X2](E[X])2\operatorname{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2

Proof: E[(Xμ)2]=E[X22μX+μ2]=E[X2]2μE[X]+μ2=E[X2]μ2\mathbb{E}[(X-\mu)^2] = \mathbb{E}[X^2 - 2\mu X + \mu^2] = \mathbb{E}[X^2] - 2\mu \mathbb{E}[X] + \mu^2 = \mathbb{E}[X^2] - \mu^2. \square

Definition 8.4 (Standard deviation). σX=Var(X)\sigma_X = \sqrt{\operatorname{Var}(X)}. This has the same units as XX and E[X]\mathbb{E}[X].

Examples:

  • Bernoulli(p)(p): Var(X)=pp2=p(1p)\operatorname{Var}(X) = p - p^2 = p(1-p). Maximum at p=1/2p=1/2: Var(X)=1/4\operatorname{Var}(X) = 1/4.
  • Uniform(a,b)(a,b): Var(X)=(ba)2/12\operatorname{Var}(X) = (b-a)^2/12
  • N(μ,σ2)\mathcal{N}(\mu,\sigma^2): Var(X)=σ2\operatorname{Var}(X) = \sigma^2 (by definition)
  • Constant cc: Var(c)=0\operatorname{Var}(c) = 0 (no randomness)

Properties of variance:

  1. Var(X)0\operatorname{Var}(X) \geq 0, with equality iff XX is a constant a.s.
  2. Var(aX+b)=a2Var(X)\operatorname{Var}(aX + b) = a^2 \operatorname{Var}(X) (shifting by bb doesn't change spread; scaling by aa scales variance by a2a^2)
  3. Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\operatorname{Var}(X + Y) = \operatorname{Var}(X) + \operatorname{Var}(Y) + 2\operatorname{Cov}(X,Y)
  4. If XYX \perp Y: Var(X+Y)=Var(X)+Var(Y)\operatorname{Var}(X+Y) = \operatorname{Var}(X) + \operatorname{Var}(Y)

For AI: Variance appears throughout ML:

  • Bias-variance tradeoff: test error = bias^2 + variance + irreducible noise. High-variance models overfit.
  • Gradient variance: the variance of the stochastic gradient estimator determines training stability. Variance reduction techniques (control variates, importance sampling) reduce this.
  • Batch normalisation: standardises activations to have mean 0 and variance 1 within each batch, stabilising training.
  • Uncertainty in predictions: the variance of a Gaussian output head models aleatoric uncertainty.

8.3 Key Properties and Common Pitfalls

Expectation is linear; variance is not. For independent X,YX, Y:

  • E[XY]=E[X]E[Y]\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y] (independence gives factorisation)
  • Var(X+Y)=Var(X)+Var(Y)\operatorname{Var}(X+Y) = \operatorname{Var}(X) + \operatorname{Var}(Y) (independence cancels covariance)
  • But in general: Var(X+Y)Var(X)+Var(Y)\operatorname{Var}(X+Y) \neq \operatorname{Var}(X) + \operatorname{Var}(Y)

Jensen's inequality: For convex gg:

g(E[X])E[g(X)]g(\mathbb{E}[X]) \leq \mathbb{E}[g(X)]

Examples: (E[X])2E[X2](\mathbb{E}[X])^2 \leq \mathbb{E}[X^2] (taking g(x)=x2g(x)=x^2); eE[X]E[eX]e^{\mathbb{E}[X]} \leq \mathbb{E}[e^X] (taking g(x)=exg(x)=e^x). Jensen's inequality is used to prove Markov's inequality, the EM lower bound (ELBO), and the KL divergence non-negativity.

Expectation and independence: E[XY]=E[X]E[Y]\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y] holds if XYX \perp Y, but the converse is false: E[XY]=E[X]E[Y]\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y] (uncorrelated) does not imply independence.

8.4 Preview: Covariance, LOTUS, MGF

Preview: Covariance and Correlation The covariance Cov(X,Y)=E[(XμX)(YμY)]\operatorname{Cov}(X,Y) = \mathbb{E}[(X-\mu_X)(Y-\mu_Y)] measures linear dependence between two random variables. The covariance matrix Σ\Sigma of a random vector extends this to multiple dimensions. These are the core tools for analysing multivariate distributions.

-> Full treatment: Expectation and Moments Section3

Preview: Law of the Unconscious Statistician (LOTUS) LOTUS allows computing E[g(X)]\mathbb{E}[g(X)] using the distribution of XX directly: no change-of-variables required. It is the foundational tool for deriving moments and the expected loss under a distribution.

-> Full treatment: Expectation and Moments Section2

Preview: Moment Generating Functions The MGF MX(t)=E[etX]M_X(t) = \mathbb{E}[e^{tX}] encodes all moments of XX: E[Xn]=MX(n)(0)\mathbb{E}[X^n] = M_X^{(n)}(0). MGFs are the key tool for proving the Central Limit Theorem and computing tail bounds.

-> Full treatment: Expectation and Moments Section5


9. Transformations of Random Variables

9.1 Functions of a Random Variable

Given a random variable XX and a function g:RRg : \mathbb{R} \to \mathbb{R}, the composition Y=g(X)Y = g(X) is a new random variable. Computing the distribution of YY from the distribution of XX is a central skill.

Discrete case. If XX is discrete with PMF pXp_X and gg is any function, then Y=g(X)Y = g(X) is discrete with:

pY(y)=P(Y=y)=P(g(X)=y)=x:g(x)=ypX(x)p_Y(y) = P(Y = y) = P(g(X) = y) = \sum_{x : g(x) = y} p_X(x)

Example. XUniform{1,2,3,4,5,6}X \sim \text{Uniform}\{1,2,3,4,5,6\} (fair die), Y=X2mod6Y = X^2 \bmod 6. Then pY(1)=P(X21mod6)=P(X{1,5})=1/3p_Y(1) = P(X^2 \equiv 1 \bmod 6) = P(X \in \{1,5\}) = 1/3, etc.

Continuous case - CDF method. For continuous XX and Y=g(X)Y = g(X), compute the CDF of YY:

FY(y)=P(Yy)=P(g(X)y)=P(X{x:g(x)y})F_Y(y) = P(Y \leq y) = P(g(X) \leq y) = P(X \in \{x : g(x) \leq y\})

then differentiate to get the PDF: fY(y)=FY(y)f_Y(y) = F_Y'(y).

9.2 Change-of-Variables Formula

For monotone transformations, the CDF method yields a clean formula.

Theorem 9.1 (Change of variables - monotone, 1-D). Let XX have PDF fXf_X and CDF FXF_X. Let gg be a differentiable, strictly monotone function on the support of XX, with inverse h=g1h = g^{-1}. Then Y=g(X)Y = g(X) has PDF:

fY(y)=fX(h(y))dhdy=fX(h(y))1g(h(y))f_Y(y) = f_X(h(y)) \cdot \left|\frac{dh}{dy}\right| = f_X(h(y)) \cdot \left|\frac{1}{g'(h(y))}\right|

Derivation (increasing gg):

FY(y)=P(Yy)=P(g(X)y)=P(Xh(y))=FX(h(y))F_Y(y) = P(Y \leq y) = P(g(X) \leq y) = P(X \leq h(y)) = F_X(h(y)) fY(y)=ddyFY(y)=fX(h(y))h(y)=fX(h(y))1g(h(y))f_Y(y) = \frac{d}{dy}F_Y(y) = f_X(h(y)) \cdot h'(y) = f_X(h(y)) \cdot \frac{1}{g'(h(y))}

For decreasing gg, FY(y)=1FX(h(y))F_Y(y) = 1 - F_X(h(y)), and differentiating gives a negative sign, cancelled by the absolute value in the formula.

Example. XN(0,1)X \sim \mathcal{N}(0,1), Y=eXY = e^X (log-normal distribution). Here g(x)=exg(x) = e^x, h(y)=lnyh(y) = \ln y, h(y)=1/yh'(y) = 1/y. So:

fY(y)=ϕ(lny)1y=12πe(lny)2/21y,y>0f_Y(y) = \phi(\ln y) \cdot \frac{1}{y} = \frac{1}{\sqrt{2\pi}}\, e^{-(\ln y)^2/2} \cdot \frac{1}{y}, \quad y > 0

Example. XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2), Y=(Xμ)/σY = (X - \mu)/\sigma (standardisation). Here g(x)=(xμ)/σg(x) = (x-\mu)/\sigma, h(y)=μ+σyh(y) = \mu + \sigma y, h(y)=σh'(y) = \sigma:

fY(y)=fX(μ+σy)σ=12πσ2ey2/2σ=12πey2/2=ϕ(y)f_Y(y) = f_X(\mu + \sigma y) \cdot \sigma = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-y^2/2} \cdot \sigma = \frac{1}{\sqrt{2\pi}} e^{-y^2/2} = \phi(y)

So YN(0,1)Y \sim \mathcal{N}(0,1). [ok]

For AI: The change-of-variables formula is the mathematical foundation of normalising flows - generative models that learn a bijection g:RdRdg : \mathbb{R}^d \to \mathbb{R}^d transforming a simple base distribution (e.g., N(0,I)\mathcal{N}(0,I)) into a complex target distribution. The log-density of the transformed variable is:

logpY(y)=logpX(g1(y))+logdetJg1(y)\log p_Y(y) = \log p_X(g^{-1}(y)) + \log |\det J_{g^{-1}}(y)|

where Jg1J_{g^{-1}} is the Jacobian of the inverse transformation.

9.3 Standardisation and the Z-Score

Definition 9.1 (Z-score / standardisation). For XX with mean μ\mu and standard deviation σ>0\sigma > 0:

Z=XμσZ = \frac{X - \mu}{\sigma}

Then E[Z]=0\mathbb{E}[Z] = 0 and Var(Z)=1\operatorname{Var}(Z) = 1 regardless of the distribution of XX.

Proof: E[Z]=E[(Xμ)/σ]=(E[X]μ)/σ=0\mathbb{E}[Z] = \mathbb{E}[(X-\mu)/\sigma] = (\mathbb{E}[X] - \mu)/\sigma = 0. Var(Z)=Var(X)/σ2=σ2/σ2=1\operatorname{Var}(Z) = \operatorname{Var}(X)/\sigma^2 = \sigma^2/\sigma^2 = 1. \square

Applications in ML:

  • Feature normalisation (preprocessing): standardise each feature so that the gradient has comparable scale across dimensions, improving optimiser convergence
  • Batch normalisation (BN): standardises activations within each mini-batch, then applies learned scale and shift parameters (γ,β)(\gamma, \beta): x^=(xμ^)/σ^\hat{x} = (x - \hat{\mu})/\hat{\sigma}, output =γx^+β= \gamma\hat{x} + \beta
  • Layer normalisation (LN): same idea but normalising across features rather than the batch dimension - standard in transformer models
  • RMS Normalisation (RMSNorm): a simplified variant using only the RMS (root mean square), dropping the mean subtraction - used in LLaMA, Gemma, and Mistral models: x^i=xi/RMS(x)\hat{x}_i = x_i / \text{RMS}(\mathbf{x})

10. Applications in Machine Learning

10.1 Cross-Entropy Loss as Negative Log-Likelihood

Setup. Suppose we observe data D={(x(1),y(1)),,(x(n),y(n))}\mathcal{D} = \{(x^{(1)}, y^{(1)}), \ldots, (x^{(n)}, y^{(n)})\} drawn i.i.d. from the true distribution pdata(yx)p_{\text{data}}(y \mid x). A model fθf_\theta with parameters θ\theta defines a predicted distribution pθ(yx)p_\theta(y \mid x).

Maximum likelihood estimation (MLE). We want to choose θ\theta to maximise the probability assigned to the observed data:

θ=argmaxθi=1npθ(y(i)x(i))[likelihood]\theta^* = \arg\max_\theta \prod_{i=1}^n p_\theta(y^{(i)} \mid x^{(i)}) \quad [\text{likelihood}]

Taking the logarithm (monotone transformation, same argmax):

θ=argmaxθi=1nlogpθ(y(i)x(i))=argminθ1ni=1nlogpθ(y(i)x(i))negative log-likelihood loss\theta^* = \arg\max_\theta \sum_{i=1}^n \log p_\theta(y^{(i)} \mid x^{(i)}) = \arg\min_\theta \underbrace{-\frac{1}{n}\sum_{i=1}^n \log p_\theta(y^{(i)} \mid x^{(i)})}_{\text{negative log-likelihood loss}}

For classification. With KK classes, pθ(yx)=Categorical(softmax(fθ(x)))yp_\theta(y \mid x) = \text{Categorical}(\text{softmax}(f_\theta(x)))_y:

logpθ(yx)=logsoftmax(fθ(x))y=logezyjezj=logjezjzy-\log p_\theta(y \mid x) = -\log \text{softmax}(f_\theta(x))_y = -\log \frac{e^{z_y}}{\sum_j e^{z_j}} = \log\sum_j e^{z_j} - z_y

where z=fθ(x)z = f_\theta(x) are the logits. For a one-hot label y\mathbf{y}, this is exactly the cross-entropy loss:

L(θ;x,y)=k=1Kyklogpθ(kx)=H(pdata,pθ)\mathcal{L}(\theta; x, y) = -\sum_{k=1}^K y_k \log p_\theta(k \mid x) = H(p_{\text{data}}, p_\theta)

For regression. If pθ(yx)=N(fθ(x),σ2)p_\theta(y \mid x) = \mathcal{N}(f_\theta(x), \sigma^2):

logpθ(yx)=12σ2(yfθ(x))2+const-\log p_\theta(y \mid x) = \frac{1}{2\sigma^2}(y - f_\theta(x))^2 + \text{const}

Minimising negative log-likelihood under a Gaussian noise model is equivalent to minimising mean squared error.

For language modelling. Each token has distribution pθ(xtx<t)=Categorical(softmax(zt))p_\theta(x_t \mid x_{<t}) = \text{Categorical}(\text{softmax}(z_t)). The NLL loss over a sequence of TT tokens is:

L=1Tt=1Tlogpθ(xtx<t)\mathcal{L} = -\frac{1}{T}\sum_{t=1}^T \log p_\theta(x_t \mid x_{<t})

This is the standard pretraining loss for GPT-family models. The perplexity is exp(L)\exp(\mathcal{L}) - a lower perplexity means the model assigns higher probability to the actual next token.

10.2 Bayesian Inference: Prior, Likelihood, Posterior

Bayesian inference applies Bayes' theorem to update beliefs about parameters θ\theta after observing data D\mathcal{D}:

p(θD)posteriorp(Dθ)likelihoodp(θ)prior\underbrace{p(\theta \mid \mathcal{D})}_{\text{posterior}} \propto \underbrace{p(\mathcal{D} \mid \theta)}_{\text{likelihood}} \cdot \underbrace{p(\theta)}_{\text{prior}}

The normalising constant p(D)=p(Dθ)p(θ)dθp(\mathcal{D}) = \int p(\mathcal{D} \mid \theta) p(\theta) d\theta is the evidence or marginal likelihood.

MAP estimation. The Maximum A Posteriori (MAP) estimate is the mode of the posterior:

θMAP=argmaxθlogp(θD)=argmaxθ[logp(Dθ)+logp(θ)]\theta^*_{\text{MAP}} = \arg\max_\theta \log p(\theta \mid \mathcal{D}) = \arg\max_\theta \left[\log p(\mathcal{D} \mid \theta) + \log p(\theta)\right]

With a Gaussian prior p(θ)=N(0,λ1I)p(\theta) = \mathcal{N}(0, \lambda^{-1}I), the logp(θ)\log p(\theta) term becomes λθ2/2-\lambda\|\theta\|^2/2 - i.e., L2 regularisation (weight decay). MAP estimation with a Gaussian prior is exactly regularised MLE.

For AI: Bayesian thinking is increasingly central to modern AI:

  • Model uncertainty (epistemic): the posterior over models captures uncertainty from limited data - Bayesian deep learning approximates this posterior
  • RLHF reward model: human preferences give a likelihood P(human prefers y1 over y2r1,r2)P(\text{human prefers } y_1 \text{ over } y_2 \mid r_1, r_2); updating the reward model is a posterior update
  • In-context learning: some interpretations frame few-shot ICL as implicit Bayesian updating over hypotheses about the task

10.3 Probabilistic Language Models

A language model defines a probability distribution over sequences of tokens. Given vocabulary V\mathcal{V}, a sequence x=(x1,,xT)\mathbf{x} = (x_1, \ldots, x_T) is assigned probability:

P(x)=P(x1)P(x2x1)P(x3x1,x2)P(xTx1,,xT1)=t=1TP(xtx<t)P(\mathbf{x}) = P(x_1) \cdot P(x_2 \mid x_1) \cdot P(x_3 \mid x_1, x_2) \cdots P(x_T \mid x_1, \ldots, x_{T-1}) = \prod_{t=1}^T P(x_t \mid x_{<t})

by the chain rule of probability (Section3.4). Each conditional P(xtx<t)P(x_t \mid x_{<t}) is a Categorical distribution over V|\mathcal{V}| tokens, produced by the transformer's softmax output.

Sampling strategies as distribution operations:

  • Greedy decoding: always take argmaxP(xtx<t)\arg\max P(x_t \mid x_{<t}) - equivalent to a degenerate point mass
  • Temperature scaling: replace logits zz with z/τz/\tau; τ0\tau \to 0 approaches greedy, τ\tau \to \infty approaches uniform
  • Top-kk sampling: restrict to the kk most probable tokens, renormalise - truncates the distribution
  • Top-pp (nucleus) sampling: restrict to the smallest set of tokens whose cumulative probability p\geq p, renormalise - adapts the cutoff to the entropy of the distribution

Bits-per-character (BPC) and perplexity: The quality of a language model is measured by how much probability mass it assigns to held-out text. Perplexity = exp(1TtlogP(xtx<t))\exp(-\frac{1}{T}\sum_t \log P(x_t \mid x_{<t})). A perplexity of kk means the model is "as uncertain as a uniform distribution over kk choices" at each step.

10.4 Sources of Randomness in Neural Network Training

Training a neural network involves multiple distinct sources of randomness, each with a different probability distribution:

SourceDistributionWhen it occurs
Weight initialisationN(0,σ2)\mathcal{N}(0, \sigma^2) or UniformOnce, at start
Mini-batch selectionUniform (without replacement)Every step
Dropout masksBernoulli(1pdrop)d\text{Bernoulli}(1-p_{\text{drop}})^dEvery forward pass
Data augmentationUniform (flips), Uniform (crop positions), Gaussian (colour jitter)Every sample
Label smoothingConvex combination of Categorical and UniformDuring loss computation
Noise injection (DP-SGD)N(0,σ2C2)\mathcal{N}(0, \sigma^2 C^2) (clipped Gaussian)Every gradient update
Sampling from generative modelsModel-defined (Categorical, Gaussian, etc.)Inference

Each source of randomness introduces variance into the training process. Understanding their distributions is essential for:

  • Reproducibility (fixing random seeds)
  • Differential privacy analysis (noise calibration)
  • Variance reduction in gradient estimation
  • Debugging training instabilities (distinguishing random vs systematic failures)

11. Common Mistakes

#MistakeWhy It's WrongFix
1Confusing P(AB)P(A \mid B) with P(BA)P(B \mid A)These are completely different quantities (prosecutor's fallacy, base rate neglect)Always identify which is the "given" and which is the "unknown" before applying Bayes' theorem
2Treating fX(x)f_X(x) as a probabilityThe PDF is a density - it can exceed 1, and individual point probabilities are always 0 for continuous XXProbabilities come from integrating the PDF: P(aXb)=abf(x)dxP(a \leq X \leq b) = \int_a^b f(x)dx
3Assuming independence from E[XY]=E[X]E[Y]\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]Uncorrelated does not imply independent; independence is strictly strongerUse the definition: XY    P(Xx,Yy)=P(Xx)P(Yy)X \perp Y \iff P(X \leq x, Y \leq y) = P(X \leq x)P(Y \leq y) for all x,yx,y
4Applying Axiom 3 to non-disjoint eventsP(AB)=P(A)+P(B)P(A \cup B) = P(A) + P(B) holds only when AB=A \cap B = \emptysetUse inclusion-exclusion: P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)
5Forgetting to check normalisation when constructing a PMF/PDFAn un-normalised function cannot be a PMF or PDFAlways verify xp(x)=1\sum_x p(x) = 1 (discrete) or f(x)dx=1\int f(x)dx = 1 (continuous)
6Using Var(X+Y)=Var(X)+Var(Y)\operatorname{Var}(X+Y) = \operatorname{Var}(X) + \operatorname{Var}(Y) without checking independenceVariance is additive only for uncorrelated (not just any) random variablesAdd the covariance term: Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\operatorname{Var}(X+Y) = \operatorname{Var}(X) + \operatorname{Var}(Y) + 2\operatorname{Cov}(X,Y)
7Confusing E[g(X)]\mathbb{E}[g(X)] with g(E[X])g(\mathbb{E}[X])Jensen's inequality shows these differ whenever gg is nonlinearCompute E[g(X)]\mathbb{E}[g(X)] using LOTUS: E[g(X)]=xg(x)p(x)\mathbb{E}[g(X)] = \sum_x g(x)p(x) or g(x)f(x)dx\int g(x)f(x)dx
8Confusing pairwise independence with mutual independenceEvents can be pairwise independent but not mutually independent (Bernstein example)For n>2n > 2 events, verify independence for all 2nn12^n - n - 1 subsets, or use a structural argument
9Assuming P(AB)=0P(A \cap B) = 0 when AA and BB are independentIndependence means P(AB)=P(A)P(B)P(A \cap B) = P(A)P(B); this is zero only when P(A)=0P(A) = 0 or P(B)=0P(B) = 0Independence \neq mutual exclusivity. Disjoint events (AB=A \cap B = \emptyset) with positive probability are necessarily dependent
10Applying the change-of-variables formula without the absolute value of the JacobianFor decreasing transformations, the Jacobian dh/dydh/dy is negative; omitting the absolute value gives a negative PDFAlways write $f_Y(y) = f_X(h(y)) \cdot
11Concluding XX and YY are independent from a correlation of zero in non-Gaussian dataZero correlation \Leftrightarrow uncorrelated, which does NOT imply independence for non-Gaussian distributionsIndependence implies zero correlation, but not vice versa. Check independence directly or use copulas
12Treating conditional probability as symmetric: P(AB)=P(BA)P(A \mid B) = P(B \mid A)P(AB)=P(BA)P(A \mid B) = P(B \mid A) only if P(A)=P(B)P(A) = P(B) (from Bayes' theorem with equal priors)Apply Bayes' theorem to convert between the two: P(BA)=P(AB)P(B)/P(A)P(B \mid A) = P(A \mid B) P(B) / P(A)

12. Exercises

Exercise 1 * - Axiom Derivations Using only the three Kolmogorov axioms, prove each of the following: (a) P()=0P(\emptyset) = 0 (b) P(Ac)=1P(A)P(A^c) = 1 - P(A) (c) P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B) (inclusion-exclusion for two events) (d) If ABA \subseteq B then P(A)P(B)P(A) \leq P(B) (e) 0P(A)10 \leq P(A) \leq 1 for all events AA

Each proof should cite exactly which axiom(s) and which previously proved results it uses.

Exercise 2 * - Bayes' Theorem Application A medical test for a disease has sensitivity 95% (true positive rate) and specificity 98% (true negative rate). The disease affects 0.5% of the population. (a) Define events DD (has disease) and TT (tests positive). State all given probabilities. (b) Compute P(T)P(T) using the law of total probability. (c) Compute P(DT)P(D \mid T) using Bayes' theorem. (d) Explain why the result might surprise a clinician unfamiliar with Bayes' theorem. (e) How would the answer change if the disease prevalence were 10% instead of 0.5%? Compute and interpret.

Exercise 3 * - PMF Construction and Verification A biased die has P(X=k)=ckP(X = k) = c \cdot k for k{1,2,3,4,5,6}k \in \{1, 2, 3, 4, 5, 6\}. (a) Find the normalisation constant cc. (b) Verify this is a valid PMF. (c) Compute E[X]\mathbb{E}[X]. (d) Compute Var(X)\operatorname{Var}(X). (e) What is P(X4)P(X \geq 4)? Compare to the fair die.

Exercise 4 ** - CDF Analysis Let XX have CDF F(x)=1eλxF(x) = 1 - e^{-\lambda x} for x0x \geq 0 and F(x)=0F(x) = 0 for x<0x < 0 (Exponential distribution). (a) Show this is a valid CDF (check all four properties from Theorem 5.1). (b) Find the PDF f(x)f(x) by differentiating the CDF. (c) Compute P(X>t)P(X > t) for any t>0t > 0. (d) Show the memoryless property: P(X>s+tX>s)=P(X>t)P(X > s+t \mid X > s) = P(X > t) for all s,t>0s, t > 0. (e) For λ=1\lambda = 1: compute P(1X2)P(1 \leq X \leq 2) exactly and numerically.

Exercise 5 ** - Independence and Conditional Independence Roll two fair dice, letting XX be the result of die 1 and YY the result of die 2. Define Z=X+YZ = X + Y (the sum). (a) Verify that XX and YY are independent. (b) Show that XX and ZZ are not independent (hint: compute P(X=1,Z=2)P(X=1, Z=2) and compare to P(X=1)P(Z=2)P(X=1)P(Z=2)). (c) Now condition on the event {Z=7}\{Z = 7\}. Show that XX and YY are conditionally independent given Z=7Z = 7 fails - or more precisely, show P(X=1,Y=6Z=7)=P(X=1Z=7)P(Y=6Z=7)P(X=1, Y=6 \mid Z=7) = P(X=1 \mid Z=7) P(Y=6 \mid Z=7) and check whether this holds. (d) Explain in words why knowing ZZ induces dependence between XX and YY.

Exercise 6 ** - Change of Variables Let XUniform(0,1)X \sim \text{Uniform}(0, 1). (a) Find the PDF of Y=lnXY = -\ln X. What distribution is this? (b) Find the PDF of W=X2W = X^2. (c) Find the CDF and PDF of V=XV = \sqrt{X}. (d) If UUniform(0,1)U \sim \text{Uniform}(0,1), show that FX1(U)F_X^{-1}(U) has the same distribution as XX for any CDF FXF_X (the inverse CDF / quantile transform method). (e) How is part (d) used in sampling from arbitrary distributions in ML?

Exercise 7 *** - Cross-Entropy Loss Derivation Consider a KK-class classification problem with model pθ(yx)=softmax(fθ(x))yp_\theta(y \mid \mathbf{x}) = \text{softmax}(f_\theta(\mathbf{x}))_y. (a) Write the negative log-likelihood for a single example (x,y)(\mathbf{x}, y^*). (b) Show this equals the cross-entropy H(ptrue,pθ)H(p_{\text{true}}, p_\theta) where ptruep_{\text{true}} is the one-hot distribution. (c) Show the gradient of the NLL w.r.t. logits z=fθ(x)\mathbf{z} = f_\theta(\mathbf{x}) is softmax(z)ey\text{softmax}(\mathbf{z}) - \mathbf{e}_{y^*} (softmax output minus one-hot target). (d) Explain why label smoothing replaces the one-hot ptruep_{\text{true}} with (1ε)ey+ε/K1(1-\varepsilon)\mathbf{e}_{y^*} + \varepsilon/K \cdot \mathbf{1}, and how this changes the loss. (e) For language modelling with vocabulary size V=32000|\mathcal{V}| = 32000: if the model assigns probability 0.8 to the correct next token, what is the NLL? The perplexity?

Exercise 8 *** - Probabilistic Analysis of Dropout During training, dropout independently sets each of dd activations to 0 with probability pdropp_{\text{drop}}, and scales the remaining by 1/(1pdrop)1/(1-p_{\text{drop}}). (a) Let MiBernoulli(1pdrop)M_i \sim \text{Bernoulli}(1-p_{\text{drop}}) be the mask for activation ii. What is E[Mi]\mathbb{E}[M_i]? (b) Let hih_i be the pre-dropout activation. Define h~i=Mihi/(1pdrop)\tilde{h}_i = M_i \cdot h_i / (1-p_{\text{drop}}). Show E[h~i]=hi\mathbb{E}[\tilde{h}_i] = h_i (inverted dropout preserves the expected activation). (c) Compute Var(h~i)\operatorname{Var}(\tilde{h}_i) as a function of hih_i, pdropp_{\text{drop}}. (d) The output of a layer is s=i=1dh~is = \sum_{i=1}^d \tilde{h}_i. Assuming hih_i and MiM_i are independent across neurons, what is Var(s)\operatorname{Var}(s)? (e) Explain why high pdropp_{\text{drop}} increases gradient variance and can destabilise training. At what pdropp_{\text{drop}} is variance maximised?


13. Why This Matters for AI (2026 Perspective)

ConceptImpact on AI/LLMs
Kolmogorov axiomsThe foundation of every probabilistic model. MLE, MAP, VAEs, diffusion models, Bayesian NNs - all rest on these three axioms. Knowing the axioms lets you verify whether a proposed probability model is well-defined.
Conditional probabilityThe core operation in Bayesian inference and autoregressive language modelling. Every P(xtx<t)P(x_t \mid x_{<t}) in a language model is a conditional probability. Causal attention implements conditional independence structure.
Bayes' theoremUnderlies RLHF (updating reward model beliefs from human feedback), Bayesian hyperparameter search, and the theoretical analysis of in-context learning as implicit Bayesian inference.
IndependenceThe i.i.d. assumption justifies mini-batch gradient descent. Conditional independence defines the graphical structure of Bayesian networks and the Markov property of diffusion models.
PMF / Categorical distributionEvery token prediction in a language model is a Categorical PMF. Softmax produces a valid PMF over $
Gaussian distributionUsed in weight initialisation (He, Glorot), VAE latent spaces, diffusion forward processes, Gaussian process priors, and noise injection for differential privacy. The CLT (Section06) explains its ubiquity.
Expected valueThe training objective is an expected loss Ex,y[L(θ)]\mathbb{E}_{\mathbf{x},y}[\mathcal{L}(\theta)] over the data distribution. Linearity of expectation justifies mini-batch gradient estimation.
VarianceGradient variance determines training stability. BN and LN are variance-control mechanisms. Dropout variance analysis guides the choice of dropout rate.
Change of variablesFoundation of normalising flows (Glow, RealNVP, neural spline flows). The change-of-variables formula computes the density of a transformed distribution.
Negative log-likelihoodThe universal training objective. Cross-entropy loss (classification), MSE (regression under Gaussian noise), ELBO (VAEs) all derive from minimising NLL.

Conceptual Bridge

This section establishes the probability-theoretic language that all subsequent sections build upon. We began from the Kolmogorov axioms - three simple rules that define what probabilities must be - and derived the core computational rules (complement, union, conditioning, total probability, Bayes) that allow us to reason under uncertainty. The random variable formalism converts outcomes to numbers, enabling the full machinery of analysis.

Looking back: The section draws on set theory (union, intersection, complement - assumed), on calculus for the continuous case (integration for CDFs and PDFs, differentiation for the fundamental theorem), and on the concept of a function (random variables are functions). The proof techniques (direct calculation from axioms, algebraic manipulation) mirror those in linear algebra and calculus sections of this course.

Looking forward - within this chapter:

  • Section02-Common-Distributions takes the distribution vocabulary started here (Bernoulli, Uniform, Gaussian) and completes it: Binomial, Poisson, Exponential, Beta, Dirichlet, Categorical, Student-t, plus the exponential family.
  • Section03-Joint-Distributions extends the single-variable machinery to multiple random variables: joint PDFs, marginalisation, and Bayes' theorem for random variables rather than events.
  • Section04-Expectation-and-Moments develops expectation much more deeply: LOTUS, covariance matrices, MGFs, characteristic functions, and the full moment theory.
  • Section05-Concentration-Inequalities proves how expectations and variances control tail probabilities (Markov, Chebyshev, Hoeffding) - the mathematical tools for generalisation guarantees.
  • Section06-Stochastic-Processes introduces the Central Limit Theorem (explaining Gaussian ubiquity) and Gaussian processes (infinite-dimensional generalisations).
  • Section07-Markov-Chains uses conditional independence (Section4.2) and Bayes' theorem (Section3.6) to define memoryless sequential processes and MCMC sampling.

Looking forward - across chapters:

  • Chapter 7 (Statistics) builds estimation theory (MLE, MAP, confidence intervals) directly on the probability foundation here.
  • Chapter 8 (Optimisation) uses the Gaussian noise model (Section7.4) and NLL loss (Section10.1) to motivate SGD and Adam.
  • Chapter 9 (Information Theory) defines entropy as E[logp(X)]-\mathbb{E}[\log p(X)] and KL divergence in terms of two probability measures - direct extensions of Section8.1 and Section10.1.
POSITION IN CURRICULUM
========================================================================

  Section04 Calculus Fundamentals
  Section05 Multivariate Calculus
  |
  +-- Section06 Probability Theory
      +-- 01-Introduction-and-Random-Variables   <== YOU ARE HERE
      |     (probability spaces, RVs, CDF/PDF/PMF, E[X], Var[X])
      +-- 02-Common-Distributions
      |     (Binomial, Poisson, Gaussian, Beta, Dirichlet, ...)
      +-- 03-Joint-Distributions
      |     (joint PDF, marginalisation, Bayes for RVs)
      +-- 04-Expectation-and-Moments
      |     (LOTUS, covariance, MGF, higher moments)
      +-- 05-Concentration-Inequalities
      |     (Markov, Chebyshev, Hoeffding, PAC bounds)
      +-- 06-Stochastic-Processes
      |     (LLN, CLT, Gaussian processes)
      +-- 07-Markov-Chains
            (MCMC, detailed balance, stationary distributions)

========================================================================

<- Back to Probability Theory | Next: Common Distributions ->


Appendix A: The Gaussian Normalisation Integral

The fact that ex2/2dx=2π\int_{-\infty}^{\infty} e^{-x^2/2} dx = \sqrt{2\pi} is not obvious. Here is the classical proof using a two-dimensional trick.

Claim: I=ex2/2dx=2πI = \int_{-\infty}^{\infty} e^{-x^2/2} dx = \sqrt{2\pi}.

Proof (Gaussian integral via polar coordinates):

I2=(ex2/2dx)(ey2/2dy)=e(x2+y2)/2dxdyI^2 = \left(\int_{-\infty}^{\infty} e^{-x^2/2} dx\right)\left(\int_{-\infty}^{\infty} e^{-y^2/2} dy\right) = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty} e^{-(x^2+y^2)/2}\, dx\, dy

Convert to polar coordinates (r,θ)(r, \theta) with x=rcosθx = r\cos\theta, y=rsinθy = r\sin\theta, dxdy=rdrdθdx\, dy = r\, dr\, d\theta:

I2=02π0er2/2rdrdθ=2π0rer2/2drI^2 = \int_0^{2\pi}\int_0^{\infty} e^{-r^2/2} r\, dr\, d\theta = 2\pi \int_0^{\infty} r e^{-r^2/2}\, dr

Let u=r2/2u = r^2/2, du=rdrdu = r\,dr:

I2=2π0eudu=2π1=2πI^2 = 2\pi \int_0^{\infty} e^{-u}\, du = 2\pi \cdot 1 = 2\pi

Therefore I=2πI = \sqrt{2\pi} (since I>0I > 0). \square

This means the normalisation constant 1/2πσ21/\sqrt{2\pi\sigma^2} in the Gaussian PDF is exactly what is needed to make the integral equal 1. The proof that the Gaussian with any μ,σ2\mu, \sigma^2 integrates to 1 follows by the substitution z=(xμ)/σz = (x-\mu)/\sigma.


Appendix B: Discrete Probability - Worked Examples in Full

B.1 The Birthday Problem

Question: In a room of nn people, what is the probability that at least two share a birthday (assuming 365 days, uniform distribution, no leap years)?

This is a classic application of the complement rule.

Let AA = "at least two people share a birthday" and AcA^c = "all birthdays are distinct."

P(Ac)=365365364365363365365n+1365=365!/(365n)!365nP(A^c) = \frac{365}{365} \cdot \frac{364}{365} \cdot \frac{363}{365} \cdots \frac{365-n+1}{365} = \frac{365!/(365-n)!}{365^n}

The probability of at least one shared birthday:

P(A)=1P(Ac)=1k=0n1365k365P(A) = 1 - P(A^c) = 1 - \prod_{k=0}^{n-1}\frac{365-k}{365}
nnP(at least one match)P(\text{at least one match})
1011.7%
2350.7% - crossover point
5097.0%
7099.9%

For AI: The birthday paradox is closely related to hash collision probability and the birthday attack in cryptography. In machine learning, it appears in:

  • Retrieval collision: The probability that two embeddings in a high-dimensional space are closer than a threshold depends on how many embeddings exist - birthday-paradox type reasoning
  • Token collision in tokenisation: In BPE tokenisation, character pairs collide and merge; the frequency of collisions follows birthday-paradox dynamics
  • Data deduplication: Estimating the probability of exact or near-duplicate examples in large training datasets

B.2 Geometric Series and PMF Normalisation

Many PMFs involve geometric series. Recall:

k=0rk=11r,r<1\sum_{k=0}^{\infty} r^k = \frac{1}{1-r}, \quad |r| < 1

Application to Geometric PMF: P(X=k)=(1p)k1pP(X = k) = (1-p)^{k-1}p for k=1,2,3,k = 1, 2, 3, \ldots

k=1(1p)k1p=pj=0(1p)j=p11(1p)=pp=1\sum_{k=1}^{\infty}(1-p)^{k-1}p = p\sum_{j=0}^{\infty}(1-p)^j = p \cdot \frac{1}{1-(1-p)} = \frac{p}{p} = 1 \checkmark

Computing the mean of Geometric via differentiation of generating function:

E[X]=k=1k(1p)k1p=pk=1k(1p)k1=pddqk=0qkq=1p\mathbb{E}[X] = \sum_{k=1}^{\infty}k(1-p)^{k-1}p = p\sum_{k=1}^{\infty}k(1-p)^{k-1} = p \cdot \frac{d}{dq}\sum_{k=0}^{\infty}q^k\bigg|_{q=1-p} =pddq11qq=1p=p1(1q)2q=1p=p1p2=1p= p \cdot \frac{d}{dq}\frac{1}{1-q}\bigg|_{q=1-p} = p \cdot \frac{1}{(1-q)^2}\bigg|_{q=1-p} = p \cdot \frac{1}{p^2} = \frac{1}{p}

The expected wait time is 1/p1/p - intuitively, if each trial has probability pp of success, on average you need 1/p1/p trials.


Appendix C: Continuous Distributions - Computing Moments Directly

C.1 Gaussian Mean and Variance from First Principles

Mean of N(μ,σ2)\mathcal{N}(\mu, \sigma^2): By symmetry of the Gaussian PDF about μ\mu:

E[X]=x12πσ2e(xμ)2/(2σ2)dx\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot \frac{1}{\sqrt{2\pi\sigma^2}} e^{-(x-\mu)^2/(2\sigma^2)}\, dx

Substitute z=(xμ)/σz = (x-\mu)/\sigma so x=μ+σzx = \mu + \sigma z and dx=σdzdx = \sigma\, dz:

=(μ+σz)12πez2/2dz=μϕ(z)dz=1+σzϕ(z)dz=0 (odd integrand)=μ= \int_{-\infty}^{\infty} (\mu + \sigma z) \cdot \frac{1}{\sqrt{2\pi}} e^{-z^2/2}\, dz = \mu \underbrace{\int_{-\infty}^{\infty}\phi(z)\,dz}_{=1} + \sigma\underbrace{\int_{-\infty}^{\infty} z\phi(z)\, dz}_{=0\ (\text{odd integrand})} = \mu

Variance of N(μ,σ2)\mathcal{N}(\mu, \sigma^2): With the same substitution, Var(X)=E[(Xμ)2]=σ2E[Z2]\operatorname{Var}(X) = \mathbb{E}[(X-\mu)^2] = \sigma^2 \mathbb{E}[Z^2] where ZN(0,1)Z \sim \mathcal{N}(0,1).

E[Z2]=z212πez2/2dz\mathbb{E}[Z^2] = \int_{-\infty}^{\infty} z^2 \cdot \frac{1}{\sqrt{2\pi}} e^{-z^2/2}\, dz

Integrate by parts: u=zu = z, dv=zez2/2dzdv = z e^{-z^2/2}\, dz, so v=ez2/2v = -e^{-z^2/2}:

=12π[zez2/2]+12πez2/2dz=0+1=1= \frac{1}{\sqrt{2\pi}}\left[-ze^{-z^2/2}\right]_{-\infty}^{\infty} + \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty} e^{-z^2/2}\, dz = 0 + 1 = 1

Therefore Var(X)=σ21=σ2\operatorname{Var}(X) = \sigma^2 \cdot 1 = \sigma^2. \square

C.2 Uniform Distribution Variance - Direct Computation

For XUniform(a,b)X \sim \text{Uniform}(a,b):

E[X2]=abx21badx=1bax33ab=b3a33(ba)=a2+ab+b23\mathbb{E}[X^2] = \int_a^b x^2 \cdot \frac{1}{b-a}\, dx = \frac{1}{b-a}\cdot\frac{x^3}{3}\bigg|_a^b = \frac{b^3-a^3}{3(b-a)} = \frac{a^2+ab+b^2}{3}

using the factorisation b3a3=(ba)(b2+ab+a2)b^3 - a^3 = (b-a)(b^2+ab+a^2).

Var(X)=a2+ab+b23(a+b)24=4(a2+ab+b2)3(a+b)212=(ba)212\operatorname{Var}(X) = \frac{a^2+ab+b^2}{3} - \frac{(a+b)^2}{4} = \frac{4(a^2+ab+b^2) - 3(a+b)^2}{12} = \frac{(b-a)^2}{12}

Appendix D: Key Inequalities and Their Probability Theory Roots

Several important inequalities follow directly from the basic probability theory developed in this section:

D.1 Markov's Inequality (Preview)

For any non-negative random variable XX and t>0t > 0:

P(Xt)E[X]tP(X \geq t) \leq \frac{\mathbb{E}[X]}{t}

Proof sketch: E[X]=E[X1Xt]+E[X1X<t]E[X1Xt]tP(Xt)\mathbb{E}[X] = \mathbb{E}[X \cdot \mathbf{1}_{X \geq t}] + \mathbb{E}[X \cdot \mathbf{1}_{X < t}] \geq \mathbb{E}[X \cdot \mathbf{1}_{X \geq t}] \geq t \cdot P(X \geq t).

This connects the expected value (Section8.1) to tail probabilities. Full treatment with applications to PAC bounds: Section05-Concentration-Inequalities.

D.2 Jensen's Inequality

For a convex function gg and random variable XX:

g(E[X])E[g(X)]g(\mathbb{E}[X]) \leq \mathbb{E}[g(X)]

Examples:

  • g(x)=x2g(x) = x^2 (convex): (E[X])2E[X2](\mathbb{E}[X])^2 \leq \mathbb{E}[X^2] - the computational variance formula
  • g(x)=exg(x) = e^x (convex): eE[X]E[eX]e^{\mathbb{E}[X]} \leq \mathbb{E}[e^X] - used in exponential tail bounds
  • g(x)=logxg(x) = -\log x (convex for x>0x > 0): logE[X]E[logX]-\log \mathbb{E}[X] \leq \mathbb{E}[-\log X] - used in ELBO derivation for VAEs

D.3 Cauchy-Schwarz for Expectations

E[XY]2E[X2]E[Y2]|\mathbb{E}[XY]|^2 \leq \mathbb{E}[X^2] \cdot \mathbb{E}[Y^2]

This bounds the correlation between random variables and is used to prove the variance of sums and covariance properties in Section04.


Appendix E: Probability in High Dimensions - A Preview

All of the distributions introduced in this section are one-dimensional (X:ΩRX : \Omega \to \mathbb{R}). In machine learning, we almost always work with high-dimensional data - image pixels, word embeddings, weight tensors. High-dimensional probability has surprising and sometimes counterintuitive properties:

E.1 The Curse of Dimensionality for Probability

In dd dimensions, a unit hypercube [0,1]d[0,1]^d has volume 1. But an inscribed hypersphere of radius 0.5 has volume:

Vd=πd/2Γ(d/2+1)0.5dV_d = \frac{\pi^{d/2}}{\Gamma(d/2+1)} \cdot 0.5^d

For d=2d = 2: V2=π/40.785V_2 = \pi/4 \approx 0.785. For d=10d = 10: V100.0025V_{10} \approx 0.0025. For d=100d = 100: essentially 0.

Almost all the volume of a high-dimensional cube is in its corners, not near its centre. A uniform distribution over the hypercube concentrates most of its mass far from the centre - which means nearest-neighbour methods based on Euclidean distance break down.

E.2 Gaussian Concentration in High Dimensions

For zN(0,Id)\mathbf{z} \sim \mathcal{N}(0, I_d) (standard Gaussian in dd dimensions):

E[z2]=d,Var(z2)=2d\mathbb{E}[\|\mathbf{z}\|^2] = d, \quad \operatorname{Var}(\|\mathbf{z}\|^2) = 2d

By the law of large numbers: z2/d1\|\mathbf{z}\|^2/d \to 1 as dd \to \infty (concentrates on the sphere of radius d\sqrt{d}). This means:

  • High-dimensional Gaussian samples are approximately uniformly distributed on the d\sqrt{d}-sphere
  • The dot product z1z20\mathbf{z}_1 \cdot \mathbf{z}_2 \approx 0 for two independent samples (random vectors are nearly orthogonal)

For AI: This explains why random embedding matrices in transformers initialised with N(0,1/d)\mathcal{N}(0, 1/d) produce approximately orthogonal row vectors in high dimensions - a useful inductive bias for attention.

These high-dimensional considerations are fully developed in Section03-Joint-Distributions (multivariate Gaussian) and Section05-Concentration-Inequalities (concentration of measure).


Appendix F: Information Theory Preview - Entropy and KL Divergence

This appendix previews two information-theoretic quantities that appear constantly in ML loss functions. The full derivations using the expectation operator belong in Section04 (Expectation and Moments) and Chapter 9 (Information Theory).

F.1 Entropy: Measuring Uncertainty

The Shannon entropy of a discrete random variable XX with PMF pp is:

H(X)=xp(x)logp(x)H(X) = -\sum_{x} p(x) \log p(x)

Entropy measures the average "surprise" or uncertainty in XX. Higher entropy = more spread-out distribution = more uncertainty.

Key values:

  • Bernoulli(pp): H=plogp(1p)log(1p)H = -p\log p - (1-p)\log(1-p). Maximum H=log20.693H = \log 2 \approx 0.693 nats at p=0.5p = 0.5 (fair coin). Zero when p=0p = 0 or p=1p = 1 (certain outcome).
  • Uniform on nn values: H=lognH = \log n - this is the maximum entropy distribution for a discrete variable with nn outcomes.
  • Deterministic: H=0H = 0 - no uncertainty.

For continuous random variables, the differential entropy is:

h(X)=p(x)logp(x)dxh(X) = -\int p(x) \log p(x) \, dx

The Gaussian N(μ,σ2)\mathcal{N}(\mu, \sigma^2) has differential entropy h=12log(2πeσ2)h = \frac{1}{2}\log(2\pi e \sigma^2), which is the maximum entropy distribution for a continuous variable with fixed variance.

For AI: The cross-entropy loss for classification is H(y,p^)=cyclogp^cH(y, \hat{p}) = -\sum_c y_c \log \hat{p}_c. When yy is a one-hot label, this equals logp^y-\log \hat{p}_{y} - the negative log probability the model assigned to the correct class.

F.2 KL Divergence: Measuring Distribution Distance

The Kullback-Leibler divergence from distribution qq to distribution pp is:

DKL(pq)=xp(x)logp(x)q(x)=Ep[logp(X)q(X)]D_{\mathrm{KL}}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)} = \mathbb{E}_p\left[\log \frac{p(X)}{q(X)}\right]

Properties (proved using Jensen's inequality in Appendix D):

  1. DKL(pq)0D_{\mathrm{KL}}(p \| q) \geq 0 (non-negativity; follows from log-\log being convex)
  2. DKL(pq)=0    p=qD_{\mathrm{KL}}(p \| q) = 0 \iff p = q (equals zero only when identical)
  3. DKL(pq)DKL(qp)D_{\mathrm{KL}}(p \| q) \neq D_{\mathrm{KL}}(q \| p) (asymmetric - not a metric)

Decomposition: Cross-entropy == entropy ++ KL divergence:

H(p,q)=H(p)+DKL(pq)H(p, q) = H(p) + D_{\mathrm{KL}}(p \| q)

Since H(p)H(p) is fixed when pp is the true label distribution, minimising cross-entropy loss is equivalent to minimising DKL(pq)D_{\mathrm{KL}}(p \| q) - pushing the model's distribution qq toward the true distribution pp.

For AI: VAE training minimises the ELBO, which includes a DKL(qϕ(zx)p(z))D_{\mathrm{KL}}(q_\phi(z|x) \| p(z)) regularisation term. RLHF preference modelling computes KL penalties between the fine-tuned and reference policy distributions.

-> Full treatment: Chapter 9 - Information Theory

F.3 Negative Log-Likelihood Connects Entropy and Probability

For a dataset {x1,,xn}\{x_1, \ldots, x_n\} drawn i.i.d. from pp, the negative log-likelihood under model qθq_\theta is:

logL(θ)=i=1nlogqθ(xi)-\log L(\theta) = -\sum_{i=1}^n \log q_\theta(x_i)

By the law of large numbers (Section06 preview), as nn \to \infty:

1nlogL(θ)Ep[logqθ(X)]=H(p,qθ)-\frac{1}{n}\log L(\theta) \to \mathbb{E}_p[-\log q_\theta(X)] = H(p, q_\theta)

Minimising NLL is therefore equivalent to minimising cross-entropy, which is equivalent to minimising DKL(pqθ)D_{\mathrm{KL}}(p \| q_\theta). Maximum likelihood estimation is KL minimisation.


Appendix G: The Exponential Family - A Preview

Almost every distribution in Section02 belongs to a unified family called the exponential family. Understanding this family's structure explains why Gaussian, Bernoulli, Poisson, and many other distributions share the same algorithmic properties in ML.

G.1 Exponential Family Form

A distribution belongs to the exponential family if its PDF/PMF can be written as:

p(x;η)=h(x)exp(ηT(x)A(η))p(x; \eta) = h(x) \exp\left(\eta^\top T(x) - A(\eta)\right)

where:

  • ηRk\eta \in \mathbb{R}^k - natural parameters (the parameterisation used in optimisation)
  • T(x)RkT(x) \in \mathbb{R}^k - sufficient statistics (all information about η\eta in the data is captured by T(x)T(x))
  • A(η)=logh(x)exp(ηT(x))dxA(\eta) = \log \int h(x)\exp(\eta^\top T(x)) \, dx - log-partition function (ensures normalisation)
  • h(x)h(x) - base measure (does not depend on η\eta)

G.2 Standard Distributions as Exponential Family Members

DistributionNatural param η\etaSufficient stat T(x)T(x)Log-partition A(η)A(\eta)
Bernoulli(pp)logp1p\log\frac{p}{1-p}xxlog(1+eη)\log(1 + e^\eta)
Gaussian(μ,σ2\mu,\sigma^2, fixed σ2\sigma^2)μ/σ2\mu/\sigma^2xxη2σ2/2\eta^2\sigma^2/2
Poisson(λ\lambda)logλ\log \lambdaxxeηe^\eta
Exponential(λ\lambda)λ-\lambdaxxlog(η)-\log(-\eta)

The Bernoulli natural parameter η=log(p/(1p))\eta = \log(p/(1-p)) is the log-odds or logit. The softmax function is the multivariate inverse: σ(η)=eη/(1+eη)\sigma(\eta) = e^\eta/(1+e^\eta) recovers pp from η\eta. This is why logistic regression outputs are Bernoulli exponential family probabilities.

G.3 Why the Log-Partition Function Matters

The log-partition function A(η)A(\eta) is convex and its derivatives give the cumulants:

ηA(η)=Epη[T(X)](mean of sufficient statistics)\nabla_\eta A(\eta) = \mathbb{E}_{p_\eta}[T(X)] \quad \text{(mean of sufficient statistics)} η2A(η)=Covpη[T(X)](covariance of sufficient statistics)\nabla^2_\eta A(\eta) = \operatorname{Cov}_{p_\eta}[T(X)] \quad \text{(covariance of sufficient statistics)}

This means computing moments of exponential family distributions reduces to differentiating A(η)A(\eta) - a powerful computational shortcut.

For AI: Softmax attention uses the log-partition function. The logsumexp operation logjeaj\log \sum_j e^{a_j} is the log-partition function of a categorical distribution over query-key matches. The numerical stability trick logjeajmaxa+maxa\log \sum_j e^{a_j - \max a} + \max a exploits properties of A(η)A(\eta).

-> Full treatment: Section02-Common-Distributions


Appendix H: Extended Worked Examples

H.1 The Monty Hall Problem - Conditional Probability in Action

Setup: A car is behind one of three doors. You choose Door 1. The host (who knows where the car is) opens Door 3, revealing a goat. Should you switch to Door 2?

Solution using conditional probability:

Let CiC_i = event car is behind Door ii, O3O_3 = event host opens Door 3.

Prior: P(C1)=P(C2)=P(C3)=1/3P(C_1) = P(C_2) = P(C_3) = 1/3.

Host strategy: if car is at Door 1, host opens Door 2 or 3 with equal probability 1/21/2; if car is at Door 2, host must open Door 3 (probability 1); if car is at Door 3, host must open Door 2 (probability 0 for Door 3).

P(O3C1)=1/2,P(O3C2)=1,P(O3C3)=0P(O_3 | C_1) = 1/2, \quad P(O_3 | C_2) = 1, \quad P(O_3 | C_3) = 0

By Bayes' theorem:

P(C1O3)=P(O3C1)P(C1)P(O3)=(1/2)(1/3)1/6+1/3+0=1/61/2=13P(C_1 | O_3) = \frac{P(O_3|C_1)P(C_1)}{P(O_3)} = \frac{(1/2)(1/3)}{1/6 + 1/3 + 0} = \frac{1/6}{1/2} = \frac{1}{3} P(C2O3)=P(O3C2)P(C2)P(O3)=(1)(1/3)1/2=23P(C_2 | O_3) = \frac{P(O_3|C_2)P(C_2)}{P(O_3)} = \frac{(1)(1/3)}{1/2} = \frac{2}{3}

Conclusion: Switching wins with probability 2/32/3. Staying wins with probability 1/31/3.

Intuition: The host's action is not random - they always reveal a goat. This action transfers probability mass from Door 3 to Door 2 (given your initial choice was Door 1). The host's knowledge changes the posterior.

For AI: This illustrates why naive probabilistic reasoning fails when the observation process is not independent of the hypothesis. In model evaluation, the selection of which test examples to report affects the posterior over model quality - a key issue in benchmark gaming.

H.2 The Prosecutor's Fallacy - P(AB)P(BA)P(A|B) \neq P(B|A)

A DNA test for a rare genetic marker has false positive rate 0.1%0.1\% and false negative rate 0%0\%. The marker is present in 0.01%0.01\% of the population. A defendant tests positive. A prosecutor claims: "The probability of a false positive is only 0.1%0.1\%, so there is a 99.9%99.9\% chance the defendant is guilty."

What's wrong: The prosecutor is claiming P(innocentpositive)=0.001P(\text{innocent} | \text{positive}) = 0.001, but they are calculating P(positiveinnocent)=0.001P(\text{positive} | \text{innocent}) = 0.001.

By Bayes' theorem:

  • P(guilty)=0.0001P(\text{guilty}) = 0.0001 (base rate)
  • P(positiveguilty)=1P(\text{positive} | \text{guilty}) = 1 (no false negatives)
  • P(positiveinnocent)=0.001P(\text{positive} | \text{innocent}) = 0.001 (false positive rate)
P(guiltypositive)=10.000110.0001+0.0010.99990.00010.001=9.1%P(\text{guilty} | \text{positive}) = \frac{1 \cdot 0.0001}{1 \cdot 0.0001 + 0.001 \cdot 0.9999} \approx \frac{0.0001}{0.001} = 9.1\%

The actual probability of guilt given the positive test is only about 9%9\%, not 99.9%99.9\%.

For AI: This fallacy appears in anomaly detection. A model with 99.9%99.9\% accuracy on a test set may have only 50%50\% precision on a rare-class detection task if the base rate of the rare class is 0.1%0.1\%.

H.3 Coupon Collector Problem - Expected Waiting Times

There are nn distinct coupons. Each cereal box contains one coupon uniformly at random. How many boxes must you buy (in expectation) to collect all nn coupons?

Let TkT_k = number of boxes needed to get the kk-th new coupon (given you already have k1k-1 distinct coupons). At that point, the probability of drawing a new coupon is (nk+1)/n(n-k+1)/n, so TkGeometric(nk+1n)T_k \sim \text{Geometric}\left(\frac{n-k+1}{n}\right).

E[Tk]=nnk+1\mathbb{E}[T_k] = \frac{n}{n-k+1}

Total boxes: T=T1+T2++TnT = T_1 + T_2 + \cdots + T_n, so:

E[T]=k=1nnnk+1=nj=1n1j=nHn\mathbb{E}[T] = \sum_{k=1}^{n} \frac{n}{n-k+1} = n \sum_{j=1}^{n} \frac{1}{j} = n H_n

where Hn=1+1/2+1/3++1/nlnn+0.577H_n = 1 + 1/2 + 1/3 + \cdots + 1/n \approx \ln n + 0.577 (harmonic number).

For n=365n = 365 (days of the year): E[T]=365H3653656.462365\mathbb{E}[T] = 365 \cdot H_{365} \approx 365 \cdot 6.46 \approx 2365 attempts needed to see all birthdays.

For AI: The coupon collector framework models how many gradient steps are needed for a stochastic optimizer to see every training example at least once. For nn training examples, about nlnnn \ln n steps are needed in expectation - which is why practitioners set epoch counts based on logarithmic growth.


Appendix I: Characteristic Functions and Moment Generating Functions - Preview

I.1 The Moment Generating Function (MGF)

The moment generating function of a random variable XX is:

MX(t)=E[etX]=etxp(x)dx(when it exists)M_X(t) = \mathbb{E}[e^{tX}] = \int e^{tx} p(x) \, dx \quad \text{(when it exists)}

Why "moment generating"? Differentiating and evaluating at t=0t=0:

MX(k)(0)=E[Xk]M_X^{(k)}(0) = \mathbb{E}[X^k]

The kk-th derivative of the MGF at zero is the kk-th raw moment of XX.

For Gaussian XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2):

MX(t)=exp(μt+σ2t22)M_X(t) = \exp\left(\mu t + \frac{\sigma^2 t^2}{2}\right)

This elegant form is why the Gaussian is easy to work with analytically: all moments can be extracted by differentiating a simple exponential.

Key property (moment generating functions and sums): If XYX \perp Y:

MX+Y(t)=MX(t)MY(t)M_{X+Y}(t) = M_X(t) \cdot M_Y(t)

This product rule for MGFs is the probabilistic analogue of the convolution theorem, and it is the key tool for proving that sums of independent Gaussians are Gaussian and that sums of independent Poissons are Poisson.

I.2 The Characteristic Function

The characteristic function of XX is φX(t)=E[eitX]\varphi_X(t) = \mathbb{E}[e^{itX}] (Fourier transform of the density). Unlike the MGF, the characteristic function always exists and uniquely determines the distribution. It is the tool used in the proof of the Central Limit Theorem.

-> Full treatment: Section04-Expectation-and-Moments


Appendix J: Simulation and Monte Carlo Methods - A Preview

A recurring theme throughout this chapter is that computing exact probabilities analytically is often hard or impossible. Monte Carlo simulation is the practical alternative: approximate E[g(X)]\mathbb{E}[g(X)] by averaging over many samples.

J.1 The Monte Carlo Principle

If X1,X2,,XnX_1, X_2, \ldots, X_n are i.i.d. samples from p(x)p(x), then by the Law of Large Numbers (Section06):

1ni=1ng(Xi)nEp[g(X)]\frac{1}{n}\sum_{i=1}^n g(X_i) \xrightarrow{n \to \infty} \mathbb{E}_p[g(X)]

The Monte Carlo estimator μ^n=1ni=1ng(Xi)\hat{\mu}_n = \frac{1}{n}\sum_{i=1}^n g(X_i) is unbiased (E[μ^n]=E[g(X)]\mathbb{E}[\hat{\mu}_n] = \mathbb{E}[g(X)]) and has variance Var(g(X))/n\operatorname{Var}(g(X))/n, so the standard error is σ/n\sigma/\sqrt{n} regardless of dimension.

Why this matters: Unlike numerical integration (which requires exponentially many grid points in high dimensions), Monte Carlo has convergence rate O(1/n)O(1/\sqrt{n}) independent of dimension. This makes it the tool of choice for high-dimensional integrals - exactly the setting of modern ML.

J.2 Estimating Probabilities

P(A)=E[1A(X)]P(A) = \mathbb{E}[\mathbf{1}_A(X)], so we can estimate any probability:

P^(A)=1ni=1n1[XiA]\hat{P}(A) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[X_i \in A]

Example: Estimate P(Z>1.96)P(Z > 1.96) for ZN(0,1)Z \sim \mathcal{N}(0,1) by sampling:

  1. Draw Z1,,ZnN(0,1)Z_1, \ldots, Z_n \sim \mathcal{N}(0,1)
  2. Count the fraction where Zi>1.96Z_i > 1.96
  3. Answer converges to 0.0250.025 (the true tail probability)

J.3 Sampling from Basic Distributions

Inverse CDF method (Universality of the Uniform): If UUniform(0,1)U \sim \text{Uniform}(0,1) and FF is a CDF, then F1(U)F^{-1}(U) has distribution FF. This is because:

P(F1(U)x)=P(UF(x))=F(x)P(F^{-1}(U) \leq x) = P(U \leq F(x)) = F(x)

Consequences:

  • Sample Exponential(λ\lambda): X=1λlog(U)X = -\frac{1}{\lambda}\log(U) where UUniform(0,1)U \sim \text{Uniform}(0,1)
  • Sample Geometric(pp): X=log(U)/log(1p)X = \lceil \log(U) / \log(1-p) \rceil
  • Sample Bernoulli(pp): X=1[Up]X = \mathbf{1}[U \leq p]

The Box-Muller transform generates Gaussian samples from uniform:

Z1=2logU1cos(2πU2),Z2=2logU1sin(2πU2)Z_1 = \sqrt{-2\log U_1}\cos(2\pi U_2), \quad Z_2 = \sqrt{-2\log U_1}\sin(2\pi U_2)

where U1,U2Uniform(0,1)U_1, U_2 \sim \text{Uniform}(0,1) i.i.d., and Z1,Z2N(0,1)Z_1, Z_2 \sim \mathcal{N}(0,1) i.i.d.

J.4 Rejection Sampling

When the inverse CDF has no closed form, use rejection sampling: to sample from p(x)p(x), find a proposal q(x)q(x) and constant MM such that p(x)Mq(x)p(x) \leq M q(x) everywhere.

Algorithm:

  1. Sample YqY \sim q
  2. Accept YY with probability p(Y)/(Mq(Y))p(Y)/(Mq(Y)); otherwise reject and repeat

The accepted samples have distribution pp. The acceptance rate is 1/M1/M, so choose MM as small as possible.

For AI: Rejection sampling underlies early language model decoding strategies. Modern methods like nucleus sampling (top-pp) and temperature sampling directly manipulate the token probability distribution - understanding these requires the CDF and quantile framework from Section7.

J.5 Monte Carlo Integration of Normalisation Constants

Many probabilistic models have densities of the form p(x)=p~(x)/Zp(x) = \tilde{p}(x)/Z where computing the normalisation constant Z=p~(x)dxZ = \int \tilde{p}(x) \, dx is intractable. Examples:

  • Bayesian posterior: Z=likelihood(x)prior(x)dxZ = \int \text{likelihood}(x) \cdot \text{prior}(x) \, dx
  • Boltzmann distribution: Z=xeE(x)/TZ = \sum_x e^{-E(x)/T} over exponentially many states
  • Normalising flow: Z=1Z = 1 by construction but requires a change-of-variables Jacobian

Monte Carlo methods (particularly MCMC, developed in Section07) are the primary tools for inference in these models.

-> Full treatment: Section07-Markov-Chains (MCMC); Section06-Stochastic-Processes (LLN and CLT)


Appendix K: Probability Distributions in PyTorch and NumPy

Modern ML frameworks have built-in probability distribution objects. Understanding the mathematical definitions in this section makes framework APIs immediately transparent.

K.1 NumPy / SciPy Distributions

import numpy as np
from scipy import stats

# Bernoulli(p=0.3)
rv = stats.bernoulli(p=0.3)
rv.pmf(1)          # P(X=1) = 0.3
rv.cdf(0)          # P(X\\leq0) = 0.7
rv.rvs(size=100)   # 100 random samples

# Gaussian N(mu=2, sigma=1.5)
rv = stats.norm(loc=2, scale=1.5)
rv.pdf(2.0)        # f(2.0) = 1/(1.5*sqrt(2\\pi)) \\approx 0.266
rv.cdf(2.0)        # \\Phi(0) = 0.5
rv.ppf(0.975)      # inverse CDF (quantile): z such that P(X\\leqz)=0.975

# Uniform(a=0, b=1)
rv = stats.uniform(loc=0, scale=1)

The loc and scale parameters implement the location-scale transform: X=loc+scaleZX = \text{loc} + \text{scale} \cdot Z where ZZ is the standard form.

K.2 PyTorch torch.distributions

import torch
from torch.distributions import Bernoulli, Normal, Uniform, Categorical

# Bernoulli - for binary classification output
dist = Bernoulli(probs=torch.tensor(0.3))
dist.log_prob(torch.tensor(1.0))  # log P(X=1) = log(0.3) \\approx -1.204
dist.sample()                      # sample {0,1}
dist.entropy()                     # H(X) \\approx 0.611 nats

# Normal - for regression, VAE latent, etc.
dist = Normal(loc=torch.tensor(0.0), scale=torch.tensor(1.0))
dist.log_prob(torch.tensor(1.96))  # log f(1.96) \\approx -2.84
dist.cdf(torch.tensor(1.96))       # \\approx 0.975

# Categorical - for language model outputs
logits = torch.tensor([2.0, 1.0, 0.5])  # unnormalised log probabilities
dist = Categorical(logits=logits)
dist.probs                              # softmax probabilities
dist.log_prob(torch.tensor(0))          # log P(X=0)
dist.entropy()                          # entropy of the distribution

K.3 Reparameterisation - The Cornerstone of VAE Training

The reparameterisation trick lets us differentiate through a sampling operation:

Instead of sampling zN(μ,σ2)z \sim \mathcal{N}(\mu, \sigma^2) (which has no gradient w.r.t. μ,σ\mu, \sigma), write:

z=μ+σϵ,ϵN(0,1)z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1)

Now z/μ=1\partial z / \partial \mu = 1 and z/σ=ϵ\partial z / \partial \sigma = \epsilon - gradients flow through.

# Without reparameterisation (WRONG for training)
z = Normal(mu, sigma).sample()  # no gradient

# With reparameterisation (CORRECT)
eps = torch.randn_like(sigma)   # \\varepsilon ~ N(0,1)
z = mu + sigma * eps             # z ~ N(mu, sigma^2), gradients flow

This technique requires understanding that N(μ,σ2)\mathcal{N}(\mu, \sigma^2) and μ+σN(0,1)\mu + \sigma \mathcal{N}(0,1) have the same distribution - a direct consequence of the location-scale property developed in Section7.3.

For AI: VAE encoders output (μ,logσ2)(\mu, \log\sigma^2) and use this trick to sample the latent code zz while maintaining a differentiable path back to the encoder parameters. The same idea extends to normalising flows and diffusion model sampling.


Appendix L: Measure Theory - The Foundations Beneath the Formalism

This section develops probability from elementary axioms without requiring measure theory. But for completeness, this appendix sketches where the measure-theoretic foundations live and when you need them.

L.1 Why Elementary Probability Is Insufficient

The elementary theory (as developed in Section2) works when the sample space Ω\Omega is discrete or when we can write explicit PDF formulas. It breaks for:

  1. Mixed distributions: a random variable that is sometimes discrete (e.g., exactly zero) and sometimes continuous (e.g., positive real) - arises in zero-inflated models and ReLU activations.

  2. Limits: taking limits of sequences of random variables requires knowing what "XnXX_n \to X" means precisely - there are multiple notions of convergence (L2L^2, almost sure, in probability, in distribution) that are not cleanly expressible without measure theory.

  3. Conditional expectation on events of probability zero: E[YX=x]\mathbb{E}[Y | X = x] when P(X=x)=0P(X = x) = 0 (any continuous XX) requires the Radon-Nikodym theorem.

  4. Change-of-measure / importance sampling: Ep[f(X)]=Eq[p(X)q(X)f(X)]\mathbb{E}_p[f(X)] = \mathbb{E}_q\left[\frac{p(X)}{q(X)}f(X)\right] when pp and qq are continuous requires the Radon-Nikodym derivative.

L.2 The Measure-Theoretic Setup

A probability space is a triple (Ω,F,P)(\Omega, \mathcal{F}, P) where:

  • Ω\Omega - sample space (any set)
  • F\mathcal{F} - sigma-algebra: a collection of subsets of Ω\Omega closed under complement and countable union (the measurable events)
  • P:F[0,1]P: \mathcal{F} \to [0,1] - probability measure satisfying Kolmogorov's axioms

A random variable is a measurable function X:ΩRX: \Omega \to \mathbb{R}, meaning {Xx}F\{X \leq x\} \in \mathcal{F} for all xx (so that P(Xx)P(X \leq x) is well-defined).

The distribution of XX is the induced measure μX(B)=P(X1(B))\mu_X(B) = P(X^{-1}(B)) for Borel sets BRB \subseteq \mathbb{R}.

The Lebesgue integral replaces the Riemann integral and is defined for any measurable function - this is what makes E[X]=XdP\mathbb{E}[X] = \int X \, dP well-defined even for random variables with no PDF.

L.3 The Lebesgue-Stieltjes Integral Unifies PDF and PMF

For any random variable with CDF FF, the expectation can be written as a Stieltjes integral:

E[g(X)]=g(x)dF(x)\mathbb{E}[g(X)] = \int g(x) \, dF(x)

This notation works for both discrete and continuous cases:

  • Continuous: dF(x)=f(x)dxdF(x) = f(x)\,dx (use Riemann integral)
  • Discrete: dF(x)=kpkδ(xxk)dF(x) = \sum_k p_k \,\delta(x - x_k) (use a sum)
  • Mixed: combination of both

For AI: Importance sampling in policy gradient methods (PPO, TRPO) uses the Radon-Nikodym derivative implicitly when bounding the ratio πθ(as)/πθold(as)\pi_\theta(a|s)/\pi_{\theta_{\text{old}}}(a|s) - this is the change-of-measure formula at work.

The full measure-theoretic treatment is beyond the scope of this curriculum. The reader interested in rigorous foundations should consult Billingsley's Probability and Measure or Durrett's Probability: Theory and Examples.


Appendix M: Classic Probability Paradoxes and Their Resolutions

M.1 Bertrand's Paradox - Why "Uniform at Random" Needs Specification

Problem: A chord is drawn "at random" in a unit circle. What is the probability the chord is longer than the side of the inscribed equilateral triangle (length 3\sqrt{3})?

The answer depends on how you sample "at random":

Method 1 - Random endpoints: Choose two points uniformly on the circle's circumference. Probability = 1/31/3.

Method 2 - Random midpoint: Choose a point uniformly inside the circle as the chord's midpoint. Probability = 1/41/4.

Method 3 - Random radius and distance: Choose a radius uniformly; choose the midpoint's distance from centre uniformly on [0,1][0,1]. Probability = 1/21/2.

All three methods are geometrically natural, yet give different answers.

Resolution: The phrase "uniform at random" is not well-defined without specifying the sample space and the probability measure on it. The three methods define different probability spaces (Ω,F,P)(\Omega, \mathcal{F}, P) - they are asking different questions.

Lesson for ML: When we say "sample a random minibatch" or "sample a random initialisation," we must specify the distribution explicitly. The implicit assumption (uniform distribution) can lead to different algorithmic properties depending on whether we sample with or without replacement, sample tokens or sequences, etc.

M.2 The Two-Envelope Paradox

Two envelopes each contain money; one has twice the amount of the other. You pick one, see XX inside, and are offered the chance to switch. You reason: the other envelope contains either X/2X/2 (probability 1/21/2) or 2X2X (probability 1/21/2). Expected value of switching: 12(X/2)+12(2X)=5X4>X\frac{1}{2}(X/2) + \frac{1}{2}(2X) = \frac{5X}{4} > X. So you should always switch - but the same reasoning applies after switching back!

Resolution: The calculation is flawed because the event "the other envelope has 2X2X" and "the other envelope has X/2X/2" cannot both be assigned probability 1/21/2 simultaneously for a fixed XX. The reasoning treats XX as fixed and random simultaneously. A rigorous Bayesian analysis shows no advantage to switching when the prior over the smaller amount is proper (integrable).

Lesson for ML: The paradox arises from mixing conditional and unconditional reasoning. In ML, this appears in off-policy evaluation (evaluating policy πnew\pi_\text{new} using data from πold\pi_\text{old}) - naively computing expected reward leads to analogous double-counting errors.

M.3 The St. Petersburg Paradox - When Expected Value Is Infinite

A game: flip a fair coin repeatedly. If the first head appears on flip kk, you win 2k2^k dollars. Expected value:

E[winnings]=k=12k12k=k=11=\mathbb{E}[\text{winnings}] = \sum_{k=1}^\infty 2^k \cdot \frac{1}{2^k} = \sum_{k=1}^\infty 1 = \infty

Yet no rational person would pay more than ~$25 to play.

Resolution: Human decision-making is based on utility, not raw monetary value. If utility is u(w)=log(w)u(w) = \log(w) (a common assumption in economics), the expected utility is finite. This gives expected utility theory - the rational framework for decisions under uncertainty.

For AI: Reinforcement learning explicitly maximises expected return (reward), which can lead to high-variance, unstable policies if reward distributions have heavy tails. Techniques like reward normalisation, reward clipping (PPO), and distributional RL (C51, QR-DQN) are direct engineering responses to the St. Petersburg problem.


Appendix N: Summary of Key Inequalities and Bounds

This appendix collects all the inequalities introduced in this section and previewed from later sections, as a reference card.

Probability Bounds

InequalityStatementWhen to Use
Boole (union bound)P ⁣(iAi)iP(Ai)P\!\left(\bigcup_i A_i\right) \leq \sum_i P(A_i)When events may overlap; gives upper bound
Inclusion-exclusionP(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)Exact for two events
BonferroniP ⁣(iAic)1iP(Ai)P\!\left(\bigcap_i A_i^c\right) \geq 1 - \sum_i P(A_i)Lower bound on "all good" probability
Frechetmax(0,P(A)+P(B)1)P(AB)min(P(A),P(B))\max(0, P(A)+P(B)-1) \leq P(A \cap B) \leq \min(P(A),P(B))Bounds on intersection without independence

Moment-Based Bounds (from Section05)

InequalityStatementUses
MarkovP(Xt)E[X]/tP(X \geq t) \leq \mathbb{E}[X]/t for X0X \geq 0Only needs E[X]\mathbb{E}[X]
ChebyshevP(Xμt)σ2/t2P(\|X-\mu\| \geq t) \leq \sigma^2/t^2Needs E[X]\mathbb{E}[X] and Var(X)\operatorname{Var}(X)
Jenseng(E[X])E[g(X)]g(\mathbb{E}[X]) \leq \mathbb{E}[g(X)] for convex ggLog-sum-exp, ELBO, entropy
Cauchy-Schwarz$\mathbb{E}[XY]

Concentration Bounds (from Section05)

InequalityStatementWhen to Use
HoeffdingP(Xˉμt)e2n2t2/(biai)2P(\bar{X}-\mu \geq t) \leq e^{-2n^2t^2/\sum(b_i-a_i)^2} for bounded Xi[ai,bi]X_i \in [a_i,b_i]i.i.d. bounded variables
ChernoffP(X(1+δ)μ)eμδ2/3P(X \geq (1+\delta)\mu) \leq e^{-\mu\delta^2/3} for Poisson binomialSums of independent Bernoullis
McDiarmidIf ff changes by at most cic_i in coordinate ii: P(fE[f]t)e2t2/ci2P(f - \mathbb{E}[f] \geq t) \leq e^{-2t^2/\sum c_i^2}General functions of independent variables

For AI: Hoeffding's inequality directly gives PAC learning generalisation bounds: with nn i.i.d. training examples and loss values in [0,1][0,1], the gap between empirical and expected loss satisfies P(LtestLtrainε)2e2nε2P(L_{\text{test}} - L_{\text{train}} \geq \varepsilon) \leq 2e^{-2n\varepsilon^2} for any fixed hypothesis.


Appendix O: Sigma-Algebras and the Filtration Framework

This appendix provides a slightly more formal treatment of sigma-algebras for readers who want mathematical precision, without requiring full measure theory.

O.1 Why Sigma-Algebras Appear

The Kolmogorov axioms in Section2 assume we know which subsets of Ω\Omega count as "events." For finite or countable Ω\Omega, every subset is an event. For continuous Ω=R\Omega = \mathbb{R}, we cannot assign probabilities to all subsets (Vitali sets are non-measurable), so we restrict to a collection F\mathcal{F} closed under the operations we need.

O.2 The Borel Sigma-Algebra

The Borel sigma-algebra B(R)\mathcal{B}(\mathbb{R}) is the smallest sigma-algebra containing all open intervals (a,b)(a, b). It includes:

  • All open and closed sets
  • All countable unions and intersections of intervals
  • All CDFs are measurable with respect to B(R)\mathcal{B}(\mathbb{R})

For practical purposes, every set you will encounter in ML is Borel-measurable.

O.3 Information and Filtrations

In sequential problems (reinforcement learning, time series, stochastic processes), we need to track how much information is available at each time step. A filtration is an increasing sequence of sigma-algebras:

F0F1F2\mathcal{F}_0 \subseteq \mathcal{F}_1 \subseteq \mathcal{F}_2 \subseteq \cdots

where Ft\mathcal{F}_t represents "all information available by time tt."

A random variable XtX_t is Ft\mathcal{F}_t-measurable if its value is determined by the information in Ft\mathcal{F}_t - that is, you can compute XtX_t from the history up to time tt.

Example: In a sequential decision problem:

  • F0={,Ω}\mathcal{F}_0 = \{\emptyset, \Omega\} - no information (only trivial events)
  • F1=σ(X1)\mathcal{F}_1 = \sigma(X_1) - sigma-algebra generated by the first observation
  • Ft=σ(X1,,Xt)\mathcal{F}_t = \sigma(X_1, \ldots, X_t) - sigma-algebra generated by observations up to time tt

The Markov property (Section07) states: Xt+1X_{t+1} depends on Ft\mathcal{F}_t only through XtX_t (not the full history).

For AI: The attention mechanism's causal mask in GPT-style models enforces a filtration: token tt can only attend to tokens at positions t\leq t. The mask ensures Ft\mathcal{F}_t-measurability of each token prediction, making the autoregressive language model a valid probabilistic model of sequences.

O.4 The Generated Sigma-Algebra

For any random variable XX, the generated sigma-algebra is:

σ(X)={X1(B):BB(R)}={{ω:X(ω)B}:BB(R)}\sigma(X) = \{X^{-1}(B) : B \in \mathcal{B}(\mathbb{R})\} = \{\{\omega : X(\omega) \in B\} : B \in \mathcal{B}(\mathbb{R})\}

This is the smallest sigma-algebra that makes XX measurable - it contains exactly the information encoded in XX.

For a discrete random variable: σ(X)=σ({X=x1},{X=x2},)\sigma(X) = \sigma(\{X = x_1\}, \{X = x_2\}, \ldots) - the partition of Ω\Omega induced by the values of XX.

Key insight: Conditional expectation E[YX]\mathbb{E}[Y | X] is really E[Yσ(X)]\mathbb{E}[Y | \sigma(X)] - the expectation of YY given all the information encoded in XX. This formulation (using sigma-algebras) is what makes conditional expectation rigorously defined even when P(X=x)=0P(X = x) = 0.


Appendix P: Probability Generating Functions

P.1 Definition

For a non-negative integer-valued random variable XX (taking values 0,1,2,0, 1, 2, \ldots), the probability generating function (PGF) is:

GX(z)=E[zX]=k=0pkzkG_X(z) = \mathbb{E}[z^X] = \sum_{k=0}^\infty p_k z^k

where pk=P(X=k)p_k = P(X = k).

The PGF converges for z1|z| \leq 1 and stores all probabilities as coefficients of a power series.

P.2 Recovering Probabilities and Moments

pk=GX(k)(0)k!(k-th derivative at zero)p_k = \frac{G_X^{(k)}(0)}{k!} \quad \text{($k$-th derivative at zero)} E[X]=GX(1),E[X(X1)]=GX(1),Var(X)=GX(1)+GX(1)[GX(1)]2\mathbb{E}[X] = G_X'(1), \quad \mathbb{E}[X(X-1)] = G_X''(1), \quad \operatorname{Var}(X) = G_X''(1) + G_X'(1) - [G_X'(1)]^2

P.3 PGFs of Common Distributions

DistributionPGF G(z)G(z)
Bernoulli(pp)1p+pz1-p+pz
Binomial(n,pn,p)(1p+pz)n(1-p+pz)^n
Poisson(λ\lambda)eλ(z1)e^{\lambda(z-1)}
Geometric(pp)pz/(1(1p)z)pz/(1-(1-p)z)

The Binomial PGF (1p+pz)n(1-p+pz)^n is the nn-fold product of Bernoulli PGFs - directly encoding that a Binomial is a sum of nn i.i.d. Bernoullis (by the product property for independent variables).

Product property: If XYX \perp Y: GX+Y(z)=GX(z)GY(z)G_{X+Y}(z) = G_X(z) \cdot G_Y(z).

This is the PGF analogue of the convolution theorem, and it gives a clean proof that the sum of independent Poisson(λ1\lambda_1) and Poisson(λ2\lambda_2) is Poisson(λ1+λ2\lambda_1 + \lambda_2):

GPois(λ1)(z)GPois(λ2)(z)=eλ1(z1)eλ2(z1)=e(λ1+λ2)(z1)=GPois(λ1+λ2)(z)G_{\text{Pois}(\lambda_1)}(z) \cdot G_{\text{Pois}(\lambda_2)}(z) = e^{\lambda_1(z-1)} \cdot e^{\lambda_2(z-1)} = e^{(\lambda_1+\lambda_2)(z-1)} = G_{\text{Pois}(\lambda_1+\lambda_2)}(z)

For AI: The token generation process in language models can be viewed as repeatedly sampling from a categorical PGF. The PGF of the output token count under various sampling strategies (greedy, top-kk, nucleus) encodes the statistical properties of generated text length distributions.


Appendix Q: Historical Development of Probability Theory

Understanding how probability theory evolved helps locate the axioms, distributions, and theorems in their proper intellectual context.

Q.1 Pre-Axiomatic Period (1654-1900)

Pascal and Fermat (1654): The famous exchange of letters about the "Problem of Points" (how to divide stakes in an unfinished game) is often cited as the birth of probability theory. They developed the idea of expected value and laid groundwork for combinatorics.

Bernoulli (1713): Jakob Bernoulli's Ars Conjectandi stated the first version of the Law of Large Numbers: relative frequencies of outcomes converge to true probabilities. This transformed probability from gambling mathematics to a science of inference.

Bayes (1763): Thomas Bayes' posthumous Essay gave the first treatment of inverse probability - inferring parameters from observations. Richard Price edited and published it; Laplace independently developed the same ideas more completely.

Laplace (1812): Theorie analytique des probabilites synthesised the field, introducing characteristic functions, the Central Limit Theorem, the method of moments, and the formal definition of probability as a fraction of equally likely cases.

Gauss (1809, 1823): Derived the normal distribution as the error distribution minimising the sum of squared residuals. Proved that the sample mean is the best linear unbiased estimator - establishing connections between probability and statistics.

Chebyshev, Markov, Lyapunov (1867-1901): Proved increasingly general versions of the CLT. Markov introduced his eponymous chains; Lyapunov's proof of the CLT via characteristic functions is essentially the modern proof.

Q.2 Axiomatic Period (1900-1950)

Borel (1909): Proved the strong law of large numbers for the Bernoulli case, introducing Borel sets and measure-theoretic methods.

Kolmogorov (1933): Grundbegriffe der Wahrscheinlichkeitsrechnung placed probability theory on rigorous measure-theoretic foundations. The three axioms in Section2 are Kolmogorov's. This work resolved foundational debates and enabled the rigorous development of stochastic processes.

Wiener (1923): Constructed Brownian motion rigorously as a probability measure on function spaces - the first infinite-dimensional probability space.

Doob (1940s): Developed martingale theory, providing rigorous tools for studying random processes with conditional structure.

Q.3 Modern Period (1950-present)

Shannon (1948): A Mathematical Theory of Communication introduced entropy, mutual information, and channel capacity - connecting probability to communication. Cross-entropy loss in modern ML is a direct descendant.

Robbins-Monro (1951): Stochastic approximation - the theoretical foundation for stochastic gradient descent. Proved that gradient estimates from mini-batches converge to the true gradient.

Dempster-Laird-Rubin (1977): The EM algorithm - an expectation-maximisation procedure for MLE with latent variables. Built on conditional expectation and Jensen's inequality.

Pearl (1988): Bayesian networks and causal inference - graphical models encoding conditional independence structure. Direct descendant of conditional probability in Section4 and Section03 of this chapter.

Goodfellow et al. (2014): Generative Adversarial Networks - training a generator to match a target distribution by playing a minimax game. The divergence between generated and true distributions is what is being minimised (implicitly or explicitly).

Sohl-Dickstein et al. (2015), Ho et al. (2020): Diffusion models - add Gaussian noise progressively (forward process), then learn to denoise (reverse process). Grounded in stochastic processes and Gaussian distributions.

Q.4 The Connections to Modern LLMs

Every component of a modern language model has a probabilistic interpretation rooted in the theory developed in this section:

LLM ComponentProbability FoundationSection
Token predictionCategorical distribution over vocabularySection6 (discrete RV)
Temperature scalingScale parameter of Gumbel distributionSection7 (continuous RV)
DropoutBernoulli mask on activationsSection6 (Bernoulli)
Weight initialisationNormal distribution, variance scalingSection7 (Gaussian)
KV cache attentionConditional probability over past keysSection4 (conditioning)
RLHF reward modelBradley-Terry preference modelSection3 (conditional probability)
Beam searchGreedy argmax over joint probabilitySection6 (joint events)
Perplexity metricExponential of cross-entropySection8 (expectation)
CalibrationAgreement between predicted probs and frequenciesSection2 (frequentist)
Chain-of-thoughtLatent variable model over reasoning tracesSection4 (conditional independence)

Appendix R: Notation Reference for Chapter 6

This reference card collects all notation used in this chapter. Consistent notation prevents ambiguity when working across sections.

R.1 Sets and Events

SymbolMeaning
Ω\OmegaSample space
ωΩ\omega \in \OmegaElementary outcome
A,B,CA, B, CEvents (subsets of Ω\Omega)
AcA^cComplement of AA
ABA \cap BIntersection (both AA and BB)
ABA \cup BUnion (either AA or BB)
ABA \setminus BSet difference (AA but not BB)
ABA \triangle BSymmetric difference
1A\mathbf{1}_A or 1[A]\mathbf{1}[A]Indicator function of event AA
F\mathcal{F}Sigma-algebra of events

R.2 Probability

SymbolMeaning
P(A)P(A)Probability of event AA
P(AB)P(A \| B)Conditional probability of AA given BB
P(A,B)P(A, B)Joint probability P(AB)P(A \cap B)
ABA \perp BEvents AA and BB are independent

R.3 Random Variables

SymbolMeaning
X,Y,ZX, Y, ZRandom variables
x,y,zx, y, zValues (realisations) of random variables
FX(x)F_X(x)CDF of XX: P(Xx)P(X \leq x)
fX(x)f_X(x)PDF of XX (continuous)
pX(x)p_X(x)PMF of XX (discrete)
supp(X)\text{supp}(X)Support: {x:pX(x)>0}\{x : p_X(x) > 0\} or {x:fX(x)>0}\{x : f_X(x) > 0\}
XpX \sim pXX has distribution pp
XYX \perp YXX and YY are independent random variables
XYZX \perp Y \| ZXX and YY are conditionally independent given ZZ

R.4 Named Distributions

NotationDistribution
XBernoulli(p)X \sim \text{Bernoulli}(p)Bernoulli with success probability pp
XBinomial(n,p)X \sim \text{Binomial}(n, p)Binomial: nn trials, success probability pp
XGeometric(p)X \sim \text{Geometric}(p)Geometric: trials until first success
XPoisson(λ)X \sim \text{Poisson}(\lambda)Poisson with rate λ\lambda
XUniform(a,b)X \sim \text{Uniform}(a, b)Uniform on interval [a,b][a,b]
XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2)Gaussian with mean μ\mu and variance σ2\sigma^2
XExponential(λ)X \sim \text{Exponential}(\lambda)Exponential with rate λ\lambda
ZN(0,1)Z \sim \mathcal{N}(0,1)Standard normal
Φ(z)\Phi(z)CDF of the standard normal

R.5 Expectation and Moments

SymbolMeaning
E[X]\mathbb{E}[X]Expected value of XX
E[XY]\mathbb{E}[X \| Y]Conditional expectation of XX given YY
μX\mu_XMean of XX (=E[X]= \mathbb{E}[X])
σX2\sigma^2_X or Var(X)\text{Var}(X)Variance of XX
σX\sigma_XStandard deviation of XX
Cov(X,Y)\text{Cov}(X,Y)Covariance of XX and YY
ρXY\rho_{XY}Pearson correlation
MX(t)M_X(t)Moment generating function
GX(z)G_X(z)Probability generating function
φX(t)\varphi_X(t)Characteristic function

R.6 Information Theory

SymbolMeaning
H(X)H(X)Shannon entropy of XX
H(X,Y)H(X, Y)Joint entropy
H(XY)H(X \| Y)Conditional entropy
I(X;Y)I(X; Y)Mutual information
DKL(pq)D_{\mathrm{KL}}(p \| q)KL divergence from qq to pp
H(p,q)H(p, q)Cross-entropy of qq under pp

End of Appendices. Return to Table of Contents.