Part 1Math for LLMs

Introduction and Random Variables: Part 1 - Intuition To 12 Exercises

Probability Theory / Introduction and Random Variables

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 1
30 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Introduction to Probability and Random Variables: Part 1: Intuition to 12. Exercises

1. Intuition

1.1 What Is Probability? Three Interpretations

Probability is a number in [0,1][0, 1] assigned to an event, measuring how likely that event is to occur. But what does "likely" mean? There are three major interpretations, and each underpins a different school of thought in statistics and machine learning.

1. Classical (equally likely outcomes). If a sample space Ω\Omega has NN equally likely outcomes and event AA contains kk of them, then P(A)=k/NP(A) = k/N. This is the interpretation of textbook dice and card problems. It breaks down when outcomes are not equally likely - flipping a biased coin, for instance.

2. Frequentist (long-run relative frequency). P(A)P(A) is the limiting proportion of times AA occurs in an infinite sequence of identical, independent experiments: P(A)=limncount of A in n trialsnP(A) = \lim_{n \to \infty} \frac{\text{count of } A \text{ in } n \text{ trials}}{n}. This is the foundation of classical statistics (hypothesis testing, confidence intervals). Its weakness: it cannot assign probabilities to one-off events ("the probability that GPT-5 passes the bar exam").

3. Bayesian (degree of belief). P(A)P(A) represents a rational agent's degree of belief that AA is true, updated as new evidence arrives. This interpretation allows probabilities for singular events and underpins Bayesian inference, Bayesian neural networks, and probabilistic generative models. Crucially, different agents can assign different prior probabilities to the same event - and Bayes' theorem tells them how to update rationally.

THREE INTERPRETATIONS OF PROBABILITY
========================================================================

  Classical           Frequentist              Bayesian
  -----------------   --------------------     ----------------------
  P(A) = k/N          P(A) = lim freq(A)       P(A) = degree of belief
  Equally likely       Long-run proportion      Updated by evidence
  outcomes             Objective               Subjective / rational
  Dice, cards          Hypothesis tests         Generative models
                       Confidence intervals     Bayesian NNs, VAEs

  All three satisfy the same axioms (Section2.2).
  They differ only in WHAT those axioms are applied to.

========================================================================

For AI: Modern ML systems implicitly blend all three. A softmax output layer uses the classical interpretation (outputs sum to 1). Training loop analysis uses frequentist reasoning (expected loss over the data distribution). Bayesian neural networks, RLHF reward models, and variational autoencoders use Bayesian reasoning explicitly.

1.2 Why Probability Is the Language of AI

Every component of a modern AI system is probabilistic:

  • Data: Training sets are finite samples from an unknown data-generating distribution pdata(x,y)p_{\text{data}}(\mathbf{x}, y). The goal of supervised learning is to find fθf_\theta that estimates pdata(yx)p_{\text{data}}(y \mid \mathbf{x}).
  • Model outputs: A language model defines a probability distribution P(xtx<t)P(x_t \mid x_{<t}) over the next token. Generation is sampling from this distribution. Temperature scaling, top-kk and top-pp sampling are all operations on probability distributions.
  • Loss functions: Cross-entropy is the negative log-likelihood of the data under the model's distribution. Minimising cross-entropy is equivalent to maximising the probability the model assigns to the training data.
  • Regularisation: Dropout applies Bernoulli random variables to activations. Weight decay corresponds to a Gaussian prior in a Bayesian interpretation (MAP estimation).
  • Uncertainty quantification: Calibration of a classifier means its predicted probabilities match empirical frequencies. A well-calibrated model saying "70% confident" should be right 70% of the time.
  • Generative models: Diffusion models define a forward Markov chain (adding Gaussian noise) and learn to reverse it. VAEs learn pθ(xz)p_\theta(\mathbf{x} \mid \mathbf{z}) and qϕ(zx)q_\phi(\mathbf{z} \mid \mathbf{x}). Normalising flows learn invertible transformations of simple distributions. All are explicitly probabilistic.

1.3 Historical Timeline

YearPersonContribution
1654Pascal & FermatCorrespondence on gambling problems - foundations of combinatorial probability
1713Jacob BernoulliArs Conjectandi - law of large numbers, Bernoulli distribution
1763Thomas Bayes (posthumous)Essay on inverse probability - what we now call Bayes' theorem
1812Pierre-Simon LaplaceTheorie analytique des probabilites - systematic probability theory, Laplace approximation
1837Simeon PoissonPoisson distribution - rare events in large samples
1900Karl PearsonChi-squared distribution, correlation coefficient
1933Andrei KolmogorovRigorous axiomatic foundation - the three axioms that unify all interpretations
1950sShannonInformation theory - entropy connects probability to communication
1980sJudea PearlBayesian networks - graphical models for probabilistic reasoning
1990sGelman, Rubin et al.MCMC methods - practical Bayesian inference for complex models
2013Kingma & WellingVariational Autoencoders - learned latent distributions via reparameterisation
2020Ho et al.Denoising Diffusion Probabilistic Models - probabilistic generative modelling at scale

2. Formal Definitions - Probability Spaces

2.1 Sample Spaces and Events

Definition 2.1 (Sample space). The sample space Ω\Omega is the set of all possible outcomes of a random experiment. Each element ωΩ\omega \in \Omega is called an elementary outcome or sample point.

Examples:

  • Tossing a fair coin: Ω={H,T}\Omega = \{\text{H}, \text{T}\}
  • Rolling a six-sided die: Ω={1,2,3,4,5,6}\Omega = \{1, 2, 3, 4, 5, 6\}
  • Measuring a person's height: Ω=(0,)R\Omega = (0, \infty) \subset \mathbb{R}
  • Choosing a word from a vocabulary: Ω={the,cat,sat,}\Omega = \{\text{the}, \text{cat}, \text{sat}, \ldots\} (finite vocabulary)
  • One training step of a neural network: Ω=\Omega = all possible mini-batch draws

Definition 2.2 (Event). An event AA is a subset of Ω\Omega, i.e., AΩA \subseteq \Omega. We say event AA occurs if the observed outcome ωA\omega \in A.

Event algebra. Events combine using set operations:

  • Complement: Ac=ΩAA^c = \Omega \setminus A - "AA does not occur"
  • Union: ABA \cup B - "AA or BB (or both) occur"
  • Intersection: ABA \cap B - "both AA and BB occur"
  • Difference: AB=ABcA \setminus B = A \cap B^c - "AA occurs but BB does not"

The σ\sigma-algebra (brief note). For continuous sample spaces (e.g., Ω=R\Omega = \mathbb{R}), we cannot assign probabilities to every subset - some are too pathological (Vitali sets). The solution is to restrict to a σ\sigma-algebra F\mathcal{F}: a collection of subsets of Ω\Omega that is closed under complement and countable unions. The standard choice for Ω=R\Omega = \mathbb{R} is the Borel σ\sigma-algebra B(R)\mathcal{B}(\mathbb{R}), generated by all open intervals. For this course, you can think of "events" as "any reasonable subset of Ω\Omega" - the measure-theoretic subtlety is noted but not required.

2.2 The Three Kolmogorov Axioms

In 1933, Andrei Kolmogorov placed probability theory on a rigorous axiomatic foundation, resolving a century of informal debate about what probability "really is."

Definition 2.3 (Probability measure). A probability measure is a function P:F[0,1]P : \mathcal{F} \to [0, 1] satisfying:

Axiom 1 (Non-negativity):

P(A)0for all events AP(A) \geq 0 \quad \text{for all events } A

Axiom 2 (Normalisation):

P(Ω)=1P(\Omega) = 1

Axiom 3 (Countable additivity / σ\sigma-additivity): If A1,A2,A3,A_1, A_2, A_3, \ldots are pairwise disjoint events (i.e., AiAj=A_i \cap A_j = \emptyset for iji \neq j), then:

P ⁣(i=1Ai)=i=1P(Ai)P\!\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i)

The triple (Ω,F,P)(\Omega, \mathcal{F}, P) is called a probability space.

Why these three axioms? They are minimal: they are logically independent (none follows from the other two), they rule out degenerate assignments (e.g., P(A)=0.5P(A) = -0.5 or P(Ω)=2P(\Omega) = 2), and they imply everything else in probability theory via logical deduction.

2.3 Immediate Consequences of the Axioms

From just three axioms, we can derive all the basic rules of probability:

Theorem 2.1 (Probability of the empty set): P()=0P(\emptyset) = 0.

Proof. Ω\Omega and \emptyset are disjoint with Ω=Ω\Omega \cup \emptyset = \Omega. By Axiom 3: P(Ω)+P()=P(Ω)=1P(\Omega) + P(\emptyset) = P(\Omega) = 1. So P()=0P(\emptyset) = 0. \square

Theorem 2.2 (Finite additivity): For disjoint A1,,AnA_1, \ldots, A_n:

P(A1An)=P(A1)++P(An)P(A_1 \cup \cdots \cup A_n) = P(A_1) + \cdots + P(A_n)

Proof. Set An+1=An+2==A_{n+1} = A_{n+2} = \cdots = \emptyset, apply Axiom 3, then Theorem 2.1. \square

Theorem 2.3 (Complement rule): P(Ac)=1P(A)P(A^c) = 1 - P(A).

Proof. AA and AcA^c are disjoint with AAc=ΩA \cup A^c = \Omega. By Axiom 3: P(A)+P(Ac)=P(Ω)=1P(A) + P(A^c) = P(\Omega) = 1. \square

Theorem 2.4 (Monotonicity): If ABA \subseteq B, then P(A)P(B)P(A) \leq P(B).

Proof. Write B=A(BA)B = A \cup (B \setminus A) where AA and BAB \setminus A are disjoint. By Axiom 3: P(B)=P(A)+P(BA)P(A)P(B) = P(A) + P(B \setminus A) \geq P(A) (since P(BA)0P(B \setminus A) \geq 0 by Axiom 1). \square

Theorem 2.5 (Bounds): 0P(A)10 \leq P(A) \leq 1 for all AA.

Proof. AΩ\emptyset \subseteq A \subseteq \Omega, so by monotonicity: 0=P()P(A)P(Ω)=10 = P(\emptyset) \leq P(A) \leq P(\Omega) = 1. \square

2.4 Probability Measures - Examples and Non-Examples

Examples of valid probability measures:

Sample space Ω\OmegaProbability measure PP
{1,2,3,4,5,6}\{1,2,3,4,5,6\}P({k})=1/6P(\{k\}) = 1/6 for each kk (fair die)
{1,2,3,4,5,6}\{1,2,3,4,5,6\}P({k})=k/21P(\{k\}) = k/21 for each kk (loaded die, k/21=1\sum k/21 = 1)
[0,1][0,1]P([a,b])=baP([a,b]) = b-a for 0ab10 \leq a \leq b \leq 1 (uniform)
R\mathbb{R}P(A)=A12πex2/2dxP(A) = \int_A \frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx (standard Gaussian)
{0,1}\{0, 1\}P({1})=pP(\{1\}) = p, P({0})=1pP(\{0\}) = 1-p for p[0,1]p \in [0,1] (Bernoulli)

Non-examples (violate at least one axiom):

AssignmentAxiom violated
P({H})=0.6P(\{H\}) = 0.6, P({T})=0.6P(\{T\}) = 0.6Axiom 2: total = 1.2 \neq 1
P({H})=0.1P(\{H\}) = -0.1, P({T})=1.1P(\{T\}) = 1.1Axiom 1: negative probability
P({H})=0.5P(\{H\}) = 0.5, P({T})=0.5P(\{T\}) = 0.5, P({H,T})=0.8P(\{H,T\}) = 0.8Axiom 3: P({H}{T})P({H})+P({T})P(\{H\} \cup \{T\}) \neq P(\{H\}) + P(\{T\})

3. Computing Probabilities - The Core Rules

3.1 Complement and Monotonicity

The complement rule is one of the most practically useful consequences of the axioms.

Complement rule: P(Ac)=1P(A)P(A^c) = 1 - P(A)

This is particularly useful when P(Ac)P(A^c) is easier to compute than P(A)P(A) directly. This technique - "compute the complement" - is ubiquitous in probability:

Example. What is the probability that at least one of 10 coin flips is heads (for a fair coin)?

Direct approach: sum P(exactly k heads)P(\text{exactly } k \text{ heads}) for k=1,,10k = 1, \ldots, 10 - tedious.

Complement approach: P(at least one head)=1P(no heads)=1(1/2)10=1023/10240.999P(\text{at least one head}) = 1 - P(\text{no heads}) = 1 - (1/2)^{10} = 1023/1024 \approx 0.999.

For AI: The complement rule underlies the "at least one" union bound used in statistical learning theory. If we want P(at least one of n bad events occurs)δP(\text{at least one of } n \text{ bad events occurs}) \leq \delta, we equivalently want P(all events are good)1δP(\text{all events are good}) \geq 1 - \delta.

3.2 Inclusion-Exclusion Principle

Theorem 3.1 (Inclusion-exclusion, two events):

P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)

Proof. Decompose AB=A(BA)A \cup B = A \cup (B \setminus A) into disjoint parts. Then B=(AB)(BA)B = (A \cap B) \cup (B \setminus A) into disjoint parts. By Axiom 3: P(AB)=P(A)+P(BA)P(A \cup B) = P(A) + P(B \setminus A) and P(B)=P(AB)+P(BA)P(B) = P(A \cap B) + P(B \setminus A). Subtracting: P(BA)=P(B)P(AB)P(B \setminus A) = P(B) - P(A \cap B). Substituting: P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B). \square

The subtraction corrects for double-counting ABA \cap B.

Generalisation to nn events:

P ⁣(i=1nAi)=iP(Ai)i<jP(AiAj)+i<j<kP(AiAjAk)P\!\left(\bigcup_{i=1}^n A_i\right) = \sum_i P(A_i) - \sum_{i<j} P(A_i \cap A_j) + \sum_{i<j<k} P(A_i \cap A_j \cap A_k) - \cdots

Union bound (Boole's inequality): A simpler but looser bound:

P ⁣(i=1nAi)i=1nP(Ai)P\!\left(\bigcup_{i=1}^n A_i\right) \leq \sum_{i=1}^n P(A_i)

For AI: The union bound is the workhorse of statistical learning theory. To prove that a learned classifier generalises to unseen data with high probability, one typically applies the union bound over all possible hypotheses, then uses concentration inequalities (Section05) to bound each term.

3.3 Conditional Probability

When we observe that event BB has occurred, this changes our state of knowledge. The probability of AA given that BB occurred is:

Definition 3.1 (Conditional probability):

P(AB)=P(AB)P(B),P(B)>0P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0

Intuition: Conditioning on BB restricts the sample space from Ω\Omega to BB. Within this restricted space, we renormalise by dividing by P(B)P(B) to ensure the conditional probabilities sum to 1.

CONDITIONAL PROBABILITY - GEOMETRIC INTUITION
========================================================================

  Before conditioning:         After conditioning on B:
  +---------------------+      +---------------------+
  |          \\Omega           |      |         B            |
  |   +------------+    |      |   +--------------+  |
  |   |  A   |A\\capB | B  |      |   |  A\\capB         |  |
  |   |      |    |    |      |   | (renormalised)|  |
  |   +------------+    |      |   +--------------+  |
  +---------------------+      +---------------------+

  P(A) = area(A) / area(\\Omega)     P(A|B) = area(A\\capB) / area(B)

========================================================================

Key properties of conditional probability:

  • P(B)P(\cdot \mid B) is itself a valid probability measure on (Ω,F)(\Omega, \mathcal{F}): it satisfies all three Kolmogorov axioms
  • P(BB)=1P(B \mid B) = 1 - given BB occurred, BB certainly occurred
  • P(AcB)=1P(AB)P(A^c \mid B) = 1 - P(A \mid B) - the complement rule holds conditionally

Non-example: P(AB)P(BA)P(A \mid B) \neq P(B \mid A) in general. This asymmetry is the source of the classic prosecutor's fallacy: P(DNA matchinnocent)P(\text{DNA match} \mid \text{innocent}) is very small, but P(innocentDNA match)P(\text{innocent} \mid \text{DNA match}) could be substantial in a large population.

3.4 Chain Rule of Probability

Rearranging the definition of conditional probability gives:

P(AB)=P(AB)P(B)=P(BA)P(A)P(A \cap B) = P(A \mid B) \cdot P(B) = P(B \mid A) \cdot P(A)

This extends to any number of events:

Theorem 3.2 (Chain rule / product rule):

P(A1A2An)=P(A1)P(A2A1)P(A3A1,A2)P(AnA1,,An1)P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1) \cdot P(A_2 \mid A_1) \cdot P(A_3 \mid A_1, A_2) \cdots P(A_n \mid A_1, \ldots, A_{n-1})

For AI: The chain rule of probability is the mathematical foundation of autoregressive language models. A language model with vocabulary V\mathcal{V} and context window TT defines:

P(x1,x2,,xT)=P(x1)P(x2x1)P(x3x1,x2)P(xTx1,,xT1)P(x_1, x_2, \ldots, x_T) = P(x_1) \cdot P(x_2 \mid x_1) \cdot P(x_3 \mid x_1, x_2) \cdots P(x_T \mid x_1, \ldots, x_{T-1})

Every GPT-family model is directly computing these conditional probabilities. The model is trained to minimise the average negative log of each factor - i.e., the cross-entropy loss.

3.5 Law of Total Probability

Definition 3.2 (Partition). Events B1,B2,,BnB_1, B_2, \ldots, B_n form a partition of Ω\Omega if they are pairwise disjoint (BiBj=B_i \cap B_j = \emptyset for iji \neq j) and exhaustive (iBi=Ω\bigcup_i B_i = \Omega).

Theorem 3.3 (Law of total probability). If B1,,BnB_1, \ldots, B_n partition Ω\Omega and P(Bi)>0P(B_i) > 0 for all ii:

P(A)=i=1nP(ABi)P(Bi)P(A) = \sum_{i=1}^n P(A \mid B_i) \cdot P(B_i)

Proof. Since B1,,BnB_1, \ldots, B_n partition Ω\Omega: A=i(ABi)A = \bigcup_i (A \cap B_i), a disjoint union. By Axiom 3: P(A)=iP(ABi)=iP(ABi)P(Bi)P(A) = \sum_i P(A \cap B_i) = \sum_i P(A \mid B_i) P(B_i). \square

Example. A language model generates a sentence that is either grammatical (B1B_1, probability 0.7) or ungrammatical (B2B_2, probability 0.3). Given grammatical, the probability a human accepts it is 0.9; given ungrammatical, 0.2. What is the overall acceptance probability?

P(accepted)=P(accB1)P(B1)+P(accB2)P(B2)=0.9×0.7+0.2×0.3=0.63+0.06=0.69P(\text{accepted}) = P(\text{acc} \mid B_1) P(B_1) + P(\text{acc} \mid B_2) P(B_2) = 0.9 \times 0.7 + 0.2 \times 0.3 = 0.63 + 0.06 = 0.69

3.6 Bayes' Theorem

Bayes' theorem inverts conditional probability: from P(AB)P(A \mid B) we can compute P(BA)P(B \mid A).

Theorem 3.4 (Bayes' theorem):

P(BA)=P(AB)P(B)P(A)P(B \mid A) = \frac{P(A \mid B) \cdot P(B)}{P(A)}

Applying the law of total probability to the denominator (with a partition B,BcB, B^c):

P(BA)=P(AB)P(B)P(AB)P(B)+P(ABc)P(Bc)P(B \mid A) = \frac{P(A \mid B) \cdot P(B)}{P(A \mid B) P(B) + P(A \mid B^c) P(B^c)}

Bayesian inference terminology:

TermSymbolMeaning
PriorP(B)P(B)Belief about BB before seeing AA
LikelihoodP(AB)P(A \mid B)How probable is AA if BB is true?
EvidenceP(A)P(A)Total probability of observation AA
PosteriorP(BA)P(B \mid A)Updated belief about BB after seeing AA
P(BA)posterior=P(AB)likelihoodP(B)priorP(A)evidence\underbrace{P(B \mid A)}_{\text{posterior}} = \frac{\overbrace{P(A \mid B)}^{\text{likelihood}} \cdot \overbrace{P(B)}^{\text{prior}}}{\underbrace{P(A)}_{\text{evidence}}}

Classic example - Spam filter. Email is spam with prior P(spam)=0.3P(\text{spam}) = 0.3. The word "lottery" appears in 80% of spam emails but only 5% of legitimate ones. Given an email contains "lottery", what is P(spamlottery)P(\text{spam} \mid \text{lottery})?

P(spamlottery)=0.80×0.300.80×0.30+0.05×0.70=0.240.24+0.035=0.240.2750.873P(\text{spam} \mid \text{lottery}) = \frac{0.80 \times 0.30}{0.80 \times 0.30 + 0.05 \times 0.70} = \frac{0.24}{0.24 + 0.035} = \frac{0.24}{0.275} \approx 0.873

For AI: Bayes' theorem is the core of:

  • Bayesian neural networks: parameters θ\theta have a prior P(θ)P(\theta); training updates to the posterior P(θD)P(Dθ)P(θ)P(\theta \mid \mathcal{D}) \propto P(\mathcal{D} \mid \theta) P(\theta)
  • RLHF reward modelling: given human preference data, update beliefs about which completions are better
  • Naive Bayes classifiers: classify documents by computing P(classwords)P(wordsclass)P(class)P(\text{class} \mid \text{words}) \propto P(\text{words} \mid \text{class}) P(\text{class})

4. Independence

4.1 Unconditional Independence

Intuitively, events AA and BB are independent if knowing one occurred gives no information about whether the other occurred.

Definition 4.1 (Independence). Events AA and BB are independent, written ABA \perp B, if:

P(AB)=P(A)P(B)P(A \cap B) = P(A) \cdot P(B)

Equivalently (when P(B)>0P(B) > 0): P(AB)=P(A)P(A \mid B) = P(A) - conditioning on BB does not change the probability of AA.

Why this definition? The conditional probability P(AB)=P(AB)/P(B)P(A \mid B) = P(A \cap B)/P(B) equals P(A)P(A) if and only if P(AB)=P(A)P(B)P(A \cap B) = P(A)P(B). The multiplicative form is preferred as it is symmetric and well-defined even when P(A)=0P(A) = 0 or P(B)=0P(B) = 0.

Examples of independent events:

  • Two fair coin flips: P(H on flip 1H on flip 2)=1/4=1/2×1/2P(\text{H on flip 1} \cap \text{H on flip 2}) = 1/4 = 1/2 \times 1/2 [ok]
  • Drawing with replacement: second draw is independent of first
  • Two separate neural network weight initialisations drawn i.i.d.

Non-examples (dependent events):

  • Drawing without replacement: if first card is an ace, second ace is less likely
  • Correlated features in a dataset
  • Sequential tokens in a language model: P(xtxt1)P(xt)P(x_t \mid x_{t-1}) \neq P(x_t)

For nn events (mutual independence): Events A1,,AnA_1, \ldots, A_n are mutually independent if for every subset S{1,,n}S \subseteq \{1, \ldots, n\}:

P ⁣(iSAi)=iSP(Ai)P\!\left(\bigcap_{i \in S} A_i\right) = \prod_{i \in S} P(A_i)

This is strictly stronger than pairwise independence (see Section4.3).

4.2 Conditional Independence

Conditional independence is one of the most important concepts in probabilistic modelling and graphical models.

Definition 4.2 (Conditional independence). Events AA and BB are conditionally independent given CC, written ABCA \perp B \mid C, if:

P(ABC)=P(AC)P(BC)P(A \cap B \mid C) = P(A \mid C) \cdot P(B \mid C)

Equivalently (when P(BC)>0P(B \cap C) > 0): P(AB,C)=P(AC)P(A \mid B, C) = P(A \mid C) - once we know CC, learning BB gives no additional information about AA.

The classic example - explaining away. Let AA = "the grass is wet", BB = "the sprinkler is on", CC = "it rained". Rain and sprinkler are independent causes of wet grass. But:

  • Without knowing whether it rained: AA and BB are marginally dependent (if the grass is wet, the sprinkler being on is more likely)
  • Given that it rained: ABCA \perp B \mid C - knowing the sprinkler status adds nothing once we know it rained

Caution: Independence and conditional independence are logically independent:

  • ABA \perp B does NOT imply ABCA \perp B \mid C (conditioning can introduce dependence)
  • ABCA \perp B \mid C does NOT imply ABA \perp B (marginalising out CC can introduce dependence)

For AI: Conditional independence is the foundation of:

  • Naive Bayes: assumes features are conditionally independent given the class label
  • Hidden Markov Models: xtxtztx_t \perp x_{t'} \mid z_t - observations are independent given the hidden state
  • Bayesian networks: the factorisation P(X1,,Xn)=iP(Xiparents(Xi))P(X_1, \ldots, X_n) = \prod_i P(X_i \mid \text{parents}(X_i)) encodes a set of conditional independence assumptions
  • Attention masks: attention weights in transformers implement conditional independence (causal masking = future tokens are conditionally independent of past given present)

4.3 Pairwise vs Mutual Independence

Pairwise independence means P(AiAj)=P(Ai)P(Aj)P(A_i \cap A_j) = P(A_i)P(A_j) for all pairs (i,j)(i,j).

Mutual independence means the joint probability factorises for every subset (Definition 4.1 for nn events).

Mutual independence implies pairwise independence but not vice versa. The following counterexample (Bernstein, 1928) shows that pairwise independence can hold without mutual independence:

Roll two fair dice. Define:

  • AA = "first die shows even" - P(A)=1/2P(A) = 1/2
  • BB = "second die shows even" - P(B)=1/2P(B) = 1/2
  • CC = "sum of dice is even" - P(C)=1/2P(C) = 1/2

Check pairwise: P(AB)=1/4=P(A)P(B)P(A \cap B) = 1/4 = P(A)P(B) [ok], P(AC)=1/4P(A \cap C) = 1/4 [ok], P(BC)=1/4P(B \cap C) = 1/4 [ok].

But P(ABC)=P(both even, sum even)=P(both even)=1/4P(A)P(B)P(C)=1/8P(A \cap B \cap C) = P(\text{both even, sum even}) = P(\text{both even}) = 1/4 \neq P(A)P(B)P(C) = 1/8.

So A,B,CA, B, C are pairwise independent but not mutually independent.

For AI: The i.i.d. (independent and identically distributed) assumption in ML training is a mutual independence assumption: P((x(1),y(1)),,(x(n),y(n)))=i=1nP(x(i),y(i))P((\mathbf{x}^{(1)}, y^{(1)}), \ldots, (\mathbf{x}^{(n)}, y^{(n)})) = \prod_{i=1}^n P(\mathbf{x}^{(i)}, y^{(i)}). This assumption underlies the derivation of cross-entropy loss as a maximum likelihood estimator.

4.4 Why Independence Matters for AI

Independence assumptions make otherwise intractable problems tractable.

Without independence: The joint distribution of nn binary random variables requires 2n12^n - 1 parameters. For n=100n = 100 features, this is 2100110302^{100} - 1 \approx 10^{30} - utterly infeasible.

With conditional independence (Naive Bayes): Given class CC, features are independent: P(xC)=j=1nP(xjC)P(\mathbf{x} \mid C) = \prod_{j=1}^n P(x_j \mid C). Now only nn parameters per class - linear in nn.

Independence in optimisation: Mini-batch gradient descent assumes that samples in a batch are independent draws from the training distribution. This independence makes the batch gradient an unbiased estimator of the full gradient.

Independence and parallelism: Independently distributed data can be processed in parallel without communication - the foundation of distributed ML training (data parallelism).


5. Random Variables - Formal Foundation

5.1 Definition: Random Variable as a Measurable Function

The outcomes in a sample space Ω\Omega can be anything - text strings, images, dice faces. To do mathematics with them, we need to convert them to numbers. A random variable does exactly this.

Definition 5.1 (Random variable). A random variable XX is a (measurable) function from the sample space Ω\Omega to the real line R\mathbb{R}:

X:ΩR,ωX(ω)X : \Omega \to \mathbb{R}, \quad \omega \mapsto X(\omega)

The measurability requirement ensures that {ωΩ:X(ω)x}\{\omega \in \Omega : X(\omega) \leq x\} is a valid event (belongs to F\mathcal{F}) for every xRx \in \mathbb{R}, so that we can assign it a probability. For all practical purposes, every function you would naturally write down is measurable.

Examples:

  • Die roll: Ω={1,,6}\Omega = \{1,\ldots,6\}, X(ω)=ωX(\omega) = \omega (the outcome itself)
  • Coin flip: Ω={H,T}\Omega = \{H, T\}, X(H)=1X(H) = 1, X(T)=0X(T) = 0 (Bernoulli representation)
  • Sentence length: Ω={all English sentences}\Omega = \{\text{all English sentences}\}, X(ω)=ωX(\omega) = |\omega| (number of words)
  • Model loss: Ω={all possible mini-batches}\Omega = \{\text{all possible mini-batches}\}, X(ω)=L(θ;ω)X(\omega) = \mathcal{L}(\theta; \omega)

Non-examples:

  • The sample space Ω\Omega itself is not a random variable (it is the domain, not a function)
  • A deterministic constant cc can be viewed as a degenerate random variable X(ω)=cX(\omega) = c for all ω\omega, but calling it "random" is misleading

Notation: Random variables are typically denoted by uppercase letters (XX, YY, ZZ); their realised values (specific numbers) by lowercase (xx, yy, zz). So "X=xX = x" means "the random variable XX takes the value xx."

The event {Xx}={ωΩ:X(ω)x}\{X \leq x\} = \{\omega \in \Omega : X(\omega) \leq x\} is well-defined, and we write P(Xx)P(X \leq x) as shorthand for P({Xx})P(\{X \leq x\}).

5.2 Discrete vs Continuous - Taxonomy

Random variables split into two main types based on the range of XX.

TAXONOMY OF RANDOM VARIABLES
========================================================================

  Random Variable X: \\Omega -> \\mathbb{R}
  |
  +-- Discrete: X takes countably many values {x_1, x_2, x_3, ...}
  |   |
  |   +-- Characterised by PMF: p(x) = P(X = x)
  |   +-- CDF: F(x) = \\Sigma p(x_i) for all x_i \\leq x   (staircase)
  |   +-- Examples: Bernoulli, Geometric, Binomial, Poisson, Categorical
  |
  +-- Continuous: X takes values in an interval (uncountably many)
      |
      +-- Characterised by PDF: f(x) such that P(a\\leqX\\leqb) = \\int_a^b f(x)dx
      +-- CDF: F(x) = \\int_-\\infty^x f(t)dt   (smooth, differentiable)
      +-- Examples: Uniform, Gaussian, Exponential, Beta, Gamma

  Note: "Mixed" random variables exist (CDF is a mix of jumps and
  smooth parts) but are rare in practice.

========================================================================

Key distinction - probability at a point:

  • Discrete: P(X=x)P(X = x) can be positive (it is the PMF value)
  • Continuous: P(X=x)=0P(X = x) = 0 for every specific xx (the probability of hitting any exact value is zero). Probabilities come from integrating over intervals.

For AI: The distinction is crucial for loss functions. Cross-entropy loss for classification uses a discrete distribution (Categorical) over a finite vocabulary or class set. For regression, the loss is often the negative log-likelihood of a Gaussian (continuous distribution). Diffusion models use a mixture: discrete time steps with continuous noise.

5.3 The Cumulative Distribution Function

The CDF provides a unified description of both discrete and continuous random variables.

Definition 5.2 (Cumulative distribution function). The CDF of a random variable XX is:

FX(x)=P(Xx),xRF_X(x) = P(X \leq x), \quad x \in \mathbb{R}

Examples:

  • Fair die: F(x)=0F(x) = 0 for x<1x < 1; F(x)=k/6F(x) = k/6 for kx<k+1k \leq x < k+1, k=1,,6k = 1, \ldots, 6; F(x)=1F(x) = 1 for x6x \geq 6. A staircase with jumps of 1/61/6 at each integer.
  • Standard normal: F(x)=Φ(x)=12πxet2/2dtF(x) = \Phi(x) = \frac{1}{\sqrt{2\pi}}\int_{-\infty}^x e^{-t^2/2}\,dt. A smooth S-shaped curve.
  • Bernoulli(pp): F(x)=0F(x) = 0 for x<0x < 0; F(x)=1pF(x) = 1-p for 0x<10 \leq x < 1; F(x)=1F(x) = 1 for x1x \geq 1.

Computing interval probabilities from the CDF:

P(a<Xb)=F(b)F(a)P(a < X \leq b) = F(b) - F(a) P(X>x)=1F(x)P(X > x) = 1 - F(x)

5.4 CDF Properties and the Fundamental Theorem

Theorem 5.1 (CDF properties). Any CDF satisfies:

  1. Monotone non-decreasing: x1x2    F(x1)F(x2)x_1 \leq x_2 \implies F(x_1) \leq F(x_2)
  2. Right-continuous: limyxF(y)=F(x)\lim_{y \downarrow x} F(y) = F(x) for all xx
  3. Limits: limxF(x)=0\lim_{x \to -\infty} F(x) = 0 and limx+F(x)=1\lim_{x \to +\infty} F(x) = 1
  4. Jump characterisation: P(X=x)=F(x)F(x)P(X = x) = F(x) - F(x^-) where F(x)=limyxF(y)F(x^-) = \lim_{y \uparrow x} F(y). For continuous XX, this is 0 everywhere; for discrete XX, it equals the PMF at jump points.

These four properties completely characterise CDFs - any function satisfying them is the CDF of some random variable. This is both the universality and the tractability of the CDF.

Fundamental theorem (continuous case): For continuous XX with CDF FF and PDF ff:

f(x)=ddxF(x),F(x)=xf(t)dtf(x) = \frac{d}{dx} F(x), \quad F(x) = \int_{-\infty}^x f(t)\, dt

The PDF is the derivative of the CDF; the CDF is the integral of the PDF.


6. Discrete Random Variables

6.1 Probability Mass Function

Definition 6.1 (PMF). For a discrete random variable XX with countable range X={x1,x2,}\mathcal{X} = \{x_1, x_2, \ldots\}, the probability mass function is:

pX(x)=P(X=x),xXp_X(x) = P(X = x), \quad x \in \mathcal{X}

Properties:

  1. pX(x)0p_X(x) \geq 0 for all xx (Axiom 1)
  2. xXpX(x)=1\sum_{x \in \mathcal{X}} p_X(x) = 1 (Axiom 2)
  3. pX(x)=0p_X(x) = 0 for xXx \notin \mathcal{X}

CDF from PMF: FX(x)=xixpX(xi)F_X(x) = \sum_{x_i \leq x} p_X(x_i) - sum all PMF values at or below xx.

Non-examples of valid PMFs:

  • p(0)=0.6,p(1)=0.6p(0) = 0.6, p(1) = 0.6: sum = 1.2 > 1
  • p(0)=0.2,p(1)=1.2p(0) = -0.2, p(1) = 1.2: negative value
  • p(k)=1/kp(k) = 1/k for k=1,2,k = 1, 2, \ldots: 1/k=\sum 1/k = \infty (harmonic series diverges)

For AI: Every classifier's output layer implicitly defines a PMF over classes. The softmax function softmax(z)k=ezk/jezj\text{softmax}(\mathbf{z})_k = e^{z_k}/\sum_j e^{z_j} produces a vector of non-negative values summing to 1 - a valid PMF over the KK classes. Training minimises the cross-entropy between this PMF and the one-hot PMF of the true label.

6.2 Bernoulli Distribution - The Canonical Example

The Bernoulli distribution is the simplest non-trivial probability distribution: a single binary trial.

Definition 6.2 (Bernoulli distribution). XBernoulli(p)X \sim \text{Bernoulli}(p) for p[0,1]p \in [0,1] if:

P(X=1)=p,P(X=0)=1pP(X = 1) = p, \quad P(X = 0) = 1 - p

or equivalently: pX(x)=px(1p)1xp_X(x) = p^x (1-p)^{1-x} for x{0,1}x \in \{0, 1\}.

Properties:

  • Mean: E[X]=1p+0(1p)=p\mathbb{E}[X] = 1 \cdot p + 0 \cdot (1-p) = p
  • Variance: Var(X)=E[X2](E[X])2=pp2=p(1p)\operatorname{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = p - p^2 = p(1-p)
  • Maximum variance at p=1/2p = 1/2: Var(X)=1/4\operatorname{Var}(X) = 1/4
  • CDF: F(x)=0F(x) = 0 for x<0x < 0; F(x)=1pF(x) = 1-p for 0x<10 \leq x < 1; F(x)=1F(x) = 1 for x1x \geq 1

ML applications of the Bernoulli distribution:

  • Binary classification: YBernoulli(σ(wx))Y \sim \text{Bernoulli}(\sigma(\mathbf{w}^\top \mathbf{x})) where σ\sigma is sigmoid
  • Dropout: each neuron is active with probability 1pdrop1 - p_{\text{drop}}; the mask is Bernoulli(1pdrop)d\text{Bernoulli}(1-p_{\text{drop}})^d
  • Coin-flip initialisation tests: checking whether random initialisation is breaking symmetry
  • Binary attention masks: a position is masked with Bernoulli(pmask)\text{Bernoulli}(p_{\text{mask}}) in masked language modelling (BERT, RoBERTa)
  • Bernoulli reward signals: in RL, some environments give binary rewards (00 or 11)

Bernoulli and entropy. The binary entropy of a Bernoulli(pp) variable is:

H(p)=plogp(1p)log(1p)H(p) = -p \log p - (1-p) \log(1-p)

This is maximised at p=1/2p = 1/2 (maximum uncertainty) and equals 0 at p{0,1}p \in \{0, 1\} (certainty). Binary cross-entropy loss is exactly this entropy when the true label is p{0,1}p \in \{0,1\} and the model predicts p^\hat{p}: L=ylogp^(1y)log(1p^)\mathcal{L} = -y \log \hat{p} - (1-y)\log(1-\hat{p}).

6.3 Geometric Distribution

The Geometric distribution models waiting times: how many Bernoulli trials until the first success?

Definition 6.3 (Geometric distribution). XGeometric(p)X \sim \text{Geometric}(p) for p(0,1]p \in (0,1] if:

P(X=k)=(1p)k1p,k=1,2,3,P(X = k) = (1-p)^{k-1} p, \quad k = 1, 2, 3, \ldots

(Here XX is the trial number of the first success; an alternative convention has XX = number of failures before first success, giving PMF (1p)kp(1-p)^k p for k=0,1,2,k = 0, 1, 2, \ldots)

Properties:

  • Mean: E[X]=1/p\mathbb{E}[X] = 1/p
  • Variance: Var(X)=(1p)/p2\operatorname{Var}(X) = (1-p)/p^2
  • Memoryless property: P(X>m+nX>m)=P(X>n)P(X > m+n \mid X > m) = P(X > n) - the past gives no information about the remaining wait time. The Geometric is the only discrete memoryless distribution.

Geometric series check (normalisation):

k=1(1p)k1p=pk=0(1p)k=p11(1p)=p1p=1\sum_{k=1}^\infty (1-p)^{k-1}p = p \sum_{k=0}^\infty (1-p)^k = p \cdot \frac{1}{1-(1-p)} = p \cdot \frac{1}{p} = 1 \checkmark

ML applications:

  • Sentence length models: the length of a generated sentence can be modelled geometrically (each word has probability pp of being the last)
  • Number of gradient steps to convergence: in simplified convergence analyses, the number of steps until L<ε\|\nabla\mathcal{L}\| < \varepsilon can have an approximately geometric distribution under certain conditions
  • Retry logic: in RL exploration, the number of random actions before discovering a reward follows a geometric-like distribution in uniform random environments

6.4 Preview: Binomial, Categorical, Poisson

The following distributions have their full treatment in Section02-Common-Distributions. They are introduced here briefly for orientation:

Preview: Binomial Distribution Binomial(n,p)\text{Binomial}(n, p) counts the number of successes in nn independent Bernoulli(pp) trials: P(X=k)=(nk)pk(1p)nkP(X = k) = \binom{n}{k}p^k(1-p)^{n-k} for k=0,1,,nk = 0, 1, \ldots, n. Mean npnp, variance np(1p)np(1-p). As nn \to \infty with np=λnp = \lambda fixed, the Binomial converges to Poisson(λ\lambda).

-> Full treatment: Common Distributions Section2

Preview: Categorical Distribution Categorical(p)\text{Categorical}(\mathbf{p}) with pΔK1\mathbf{p} \in \Delta^{K-1} (the probability simplex) models selection of one of KK categories: P(X=k)=pkP(X = k) = p_k. This is the distribution underlying softmax classifiers and language model next-token prediction.

-> Full treatment: Common Distributions Section4

Preview: Poisson Distribution Poisson(λ)\text{Poisson}(\lambda) models the number of events in a fixed interval when events occur independently at constant rate λ\lambda: P(X=k)=eλλk/k!P(X = k) = e^{-\lambda}\lambda^k/k!. Mean = Variance = λ\lambda. Used in modelling word frequencies (Zipf/Poisson approximation), network packet arrivals, and mutation rates.

-> Full treatment: Common Distributions Section3


7. Continuous Random Variables

7.1 Probability Density Function

For continuous random variables, individual points have probability zero. Probability is distributed over intervals via the PDF.

Definition 7.1 (Probability density function). A non-negative function fX:R[0,)f_X : \mathbb{R} \to [0, \infty) is a PDF for XX if:

  1. fX(x)0f_X(x) \geq 0 for all xx
  2. fX(x)dx=1\int_{-\infty}^{\infty} f_X(x)\, dx = 1
  3. For any aba \leq b: P(aXb)=abfX(x)dxP(a \leq X \leq b) = \int_a^b f_X(x)\, dx

Critical distinction: The PDF fX(x)f_X(x) is NOT a probability. It is a probability density - probability per unit length. In particular, fX(x)f_X(x) can exceed 1 (though the integral must equal 1).

Example: fX(x)=2xf_X(x) = 2x for x[0,1]x \in [0,1] and 00 elsewhere. This is a valid PDF: non-negative, integrates to 012xdx=[x2]01=1\int_0^1 2x\,dx = [x^2]_0^1 = 1. Note fX(0.9)=1.8>1f_X(0.9) = 1.8 > 1, but this is fine - it is a density.

Probability at a point: P(X=x)=xxf(t)dt=0P(X = x) = \int_x^x f(t)\,dt = 0. This is why conditioning on a continuous event requires care and is handled via conditional densities (Section03).

Non-examples of valid PDFs:

  • f(x)=xf(x) = x on [0,1][0,1]: 01xdx=1/21\int_0^1 x\,dx = 1/2 \neq 1 (not normalised)
  • f(x)=exf(x) = -e^{-x} for x0x \geq 0: negative values
  • f(x)=1f(x) = 1 for x[0,2]x \in [0,2]: 021dx=21\int_0^2 1\,dx = 2 \neq 1

7.2 From CDF to PDF: the Derivative Relationship

For a continuous random variable with PDF fXf_X and CDF FXF_X:

FX(x)=xfX(t)dtfX(x)=ddxFX(x)F_X(x) = \int_{-\infty}^x f_X(t)\, dt \quad \Longleftrightarrow \quad f_X(x) = \frac{d}{dx} F_X(x)

This is the Fundamental Theorem of Calculus applied to probability. The CDF accumulates density; the PDF is the rate of accumulation.

Using the CDF to compute probabilities:

P(aXb)=FX(b)FX(a)P(a \leq X \leq b) = F_X(b) - F_X(a) P(X>x)=1FX(x)P(X > x) = 1 - F_X(x) P(Xx)=FX(x)P(X \leq x) = F_X(x)

Since continuous XX has P(X=a)=0P(X = a) = 0: the endpoints don't matter, P(a<Xb)=P(aXb)=P(a<X<b)P(a < X \leq b) = P(a \leq X \leq b) = P(a < X < b).

7.3 Uniform Distribution

The Uniform distribution is the maximum-entropy distribution over a bounded interval: all values are equally likely.

Definition 7.2 (Uniform distribution). XUniform(a,b)X \sim \text{Uniform}(a, b) for a<ba < b if:

fX(x)={1baaxb0otherwisef_X(x) = \begin{cases} \frac{1}{b-a} & a \leq x \leq b \\ 0 & \text{otherwise} \end{cases}

CDF: FX(x)=0F_X(x) = 0 for x<ax < a; FX(x)=(xa)/(ba)F_X(x) = (x-a)/(b-a) for axba \leq x \leq b; FX(x)=1F_X(x) = 1 for x>bx > b.

Properties:

  • Mean: E[X]=(a+b)/2\mathbb{E}[X] = (a+b)/2 (the midpoint)
  • Variance: Var(X)=(ba)2/12\operatorname{Var}(X) = (b-a)^2/12
  • Symmetry: ff is symmetric about (a+b)/2(a+b)/2

Derivation of variance:

E[X2]=abx21badx=1bab3a33=a2+ab+b23\mathbb{E}[X^2] = \int_a^b x^2 \cdot \frac{1}{b-a}\, dx = \frac{1}{b-a} \cdot \frac{b^3 - a^3}{3} = \frac{a^2 + ab + b^2}{3} Var(X)=a2+ab+b23(a+b2)2=(ba)212\operatorname{Var}(X) = \frac{a^2+ab+b^2}{3} - \left(\frac{a+b}{2}\right)^2 = \frac{(b-a)^2}{12}

ML applications of Uniform:

  • Weight initialisation (Xavier/Glorot): weights drawn from Uniform(1/n,1/n)\text{Uniform}(-1/\sqrt{n}, 1/\sqrt{n}) to maintain activation variance across layers
  • Batch selection: selecting which training examples to include in a mini-batch (sampling uniformly without replacement)
  • Data augmentation: random cropping positions, rotation angles, colour jitter magnitudes drawn from uniform distributions
  • Hyperparameter search: random search over hyperparameter ranges samples uniformly from the search space

7.4 Gaussian Distribution - Introduction

The Gaussian (Normal) distribution is the most important continuous distribution in all of probability and statistics, due to the Central Limit Theorem (Section06-Stochastic-Processes) and its mathematical tractability.

Definition 7.3 (Gaussian distribution). XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2) with mean μR\mu \in \mathbb{R} and variance σ2>0\sigma^2 > 0 if:

fX(x)=12πσ2exp ⁣((xμ)22σ2),xRf_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), \quad x \in \mathbb{R}

The standard normal ZN(0,1)Z \sim \mathcal{N}(0,1) has μ=0\mu = 0, σ2=1\sigma^2 = 1:

ϕ(z)=12πez2/2\phi(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2}

Its CDF is Φ(z)=zϕ(t)dt\Phi(z) = \int_{-\infty}^z \phi(t)\, dt, which has no closed form but is tabulated and implemented in all numerical libraries.

Properties:

  • Mean: E[X]=μ\mathbb{E}[X] = \mu
  • Variance: Var(X)=σ2\operatorname{Var}(X) = \sigma^2
  • Symmetry: f(μx)=f(μ+x)f(\mu - x) = f(\mu + x) - symmetric about μ\mu
  • 68-95-99.7 rule:
    • P(μσXμ+σ)68.27%P(\mu - \sigma \leq X \leq \mu + \sigma) \approx 68.27\%
    • P(μ2σXμ+2σ)95.45%P(\mu - 2\sigma \leq X \leq \mu + 2\sigma) \approx 95.45\%
    • P(μ3σXμ+3σ)99.73%P(\mu - 3\sigma \leq X \leq \mu + 3\sigma) \approx 99.73\%
  • Standardisation: If XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2), then Z=(Xμ)/σN(0,1)Z = (X-\mu)/\sigma \sim \mathcal{N}(0,1)
  • Affine stability: If XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2), then aX+bN(aμ+b,a2σ2)aX + b \sim \mathcal{N}(a\mu + b, a^2\sigma^2)

Why the Gaussian is ubiquitous:

  1. Central Limit Theorem (preview): The sum of many i.i.d. random variables with finite variance converges to a Gaussian - regardless of the original distribution. See Section06.
  2. Maximum entropy: Among all continuous distributions with fixed mean μ\mu and variance σ2\sigma^2, the Gaussian maximises entropy (is the "least informative" given these constraints)
  3. Conjugacy: The Gaussian is conjugate to itself under Bayesian updating with Gaussian likelihood - the posterior is also Gaussian

ML applications of the Gaussian:

  • Weight initialisation (He/Kaiming): weights drawn from N(0,2/nin)\mathcal{N}(0, 2/n_{\text{in}}) for ReLU networks
  • Noise models: residuals yfθ(x)N(0,σ2)y - f_\theta(\mathbf{x}) \sim \mathcal{N}(0, \sigma^2) gives mean squared error loss
  • VAE latent space: zN(μϕ(x),σϕ2(x))z \sim \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x)) with Gaussian prior N(0,I)\mathcal{N}(0, I)
  • Diffusion models forward process: xt=αˉtx0+1αˉtεx_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t}\varepsilon, εN(0,I)\varepsilon \sim \mathcal{N}(0, I)
  • Score functions: the score xlogp(x)\nabla_x \log p(x) for a Gaussian is linear: xlogN(x;μ,σ2)=(xμ)/σ2\nabla_x \log \mathcal{N}(x;\mu,\sigma^2) = -(x-\mu)/\sigma^2

Preview: Exponential, Beta, Gamma Distributions The Exponential distribution models waiting times between Poisson events; the Beta models probabilities on [0,1][0,1] (conjugate prior for Bernoulli); the Gamma generalises both. Full treatment: Common Distributions Section5, Section6, Section7.


8. Expectation and Variance - Foundations

8.1 Expected Value: Definition and Linearity

The expected value (or mean) of a random variable is its probability-weighted average - the "centre of mass" of its distribution.

Definition 8.1 (Expected value - discrete). For a discrete XX with PMF pXp_X:

E[X]=xXxpX(x)\mathbb{E}[X] = \sum_{x \in \mathcal{X}} x \cdot p_X(x)

(provided the sum converges absolutely: xpX(x)<\sum |x| p_X(x) < \infty).

Definition 8.2 (Expected value - continuous). For a continuous XX with PDF fXf_X:

E[X]=xfX(x)dx\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f_X(x)\, dx

(provided the integral converges absolutely).

Examples:

  • Bernoulli(p)(p): E[X]=1p+0(1p)=p\mathbb{E}[X] = 1 \cdot p + 0 \cdot (1-p) = p
  • Uniform(0,1)(0,1): E[X]=01x1dx=1/2\mathbb{E}[X] = \int_0^1 x \cdot 1\, dx = 1/2
  • N(μ,σ2)\mathcal{N}(\mu, \sigma^2): E[X]=μ\mathbb{E}[X] = \mu (by symmetry and definition)
  • Fair die: E[X]=(1+2+3+4+5+6)/6=3.5\mathbb{E}[X] = (1+2+3+4+5+6)/6 = 3.5

Non-example (expectation undefined): The Cauchy distribution with PDF f(x)=1/(π(1+x2))f(x) = 1/(\pi(1+x^2)) has x/(pi(1+x2))dx=\int_{-\infty}^\infty |x|/(\\pi(1+x^2))\,dx = \infty, so E[X]\mathbb{E}[X] does not exist. This is why heavy-tailed distributions require care.

Theorem 8.1 (Linearity of expectation). For any random variables XX, YY and constants aa, bb:

E[aX+bY]=aE[X]+bE[Y]\mathbb{E}[aX + bY] = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y]

This holds whether or not XX and YY are independent - linearity of expectation is unconditional.

Proof (discrete): E[aX+bY]=x,y(ax+by)P(X=x,Y=y)=axxyP(X=x,Y=y)+byyxP(X=x,Y=y)=aE[X]+bE[Y]\mathbb{E}[aX + bY] = \sum_{x,y}(ax+by)P(X=x, Y=y) = a\sum_x x \sum_y P(X=x,Y=y) + b\sum_y y \sum_x P(X=x,Y=y) = a\mathbb{E}[X] + b\mathbb{E}[Y]. \square

For AI: Linearity of expectation is why mini-batch gradient descent works. The gradient of the average loss over a batch equals the average of individual gradients:

θExD ⁣[L(x;θ)]=ExD ⁣[θL(x;θ)]\nabla_\theta \mathbb{E}_{\mathbf{x} \sim \mathcal{D}}\!\left[\mathcal{L}(\mathbf{x};\theta)\right] = \mathbb{E}_{\mathbf{x} \sim \mathcal{D}}\!\left[\nabla_\theta \mathcal{L}(\mathbf{x};\theta)\right]

This follows from linearity of both expectation and differentiation.

Expected value of a function: For g:RRg : \mathbb{R} \to \mathbb{R}:

E[g(X)]=xg(x)pX(x)(discrete),E[g(X)]=g(x)fX(x)dx(continuous)\mathbb{E}[g(X)] = \sum_x g(x) p_X(x) \quad (\text{discrete}), \qquad \mathbb{E}[g(X)] = \int g(x) f_X(x)\, dx \quad (\text{continuous})

This is the Law of the Unconscious Statistician (LOTUS) - you can compute E[g(X)]\mathbb{E}[g(X)] directly from the distribution of XX without finding the distribution of g(X)g(X) first. (Full treatment in Section04.)

Warning: E[g(X)]g(E[X])\mathbb{E}[g(X)] \neq g(\mathbb{E}[X]) in general. For example, E[X2](E[X])2\mathbb{E}[X^2] \neq (\mathbb{E}[X])^2 (the gap is the variance). Jensen's inequality gives the direction: if gg is convex, g(E[X])E[g(X)]g(\mathbb{E}[X]) \leq \mathbb{E}[g(X)].

8.2 Variance and Standard Deviation

Variance measures spread - how far a random variable typically deviates from its mean.

Definition 8.3 (Variance). The variance of XX is:

Var(X)=E ⁣[(XE[X])2]\operatorname{Var}(X) = \mathbb{E}\!\left[(X - \mathbb{E}[X])^2\right]

Computational formula:

Var(X)=E[X2](E[X])2\operatorname{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2

Proof: E[(Xμ)2]=E[X22μX+μ2]=E[X2]2μE[X]+μ2=E[X2]μ2\mathbb{E}[(X-\mu)^2] = \mathbb{E}[X^2 - 2\mu X + \mu^2] = \mathbb{E}[X^2] - 2\mu \mathbb{E}[X] + \mu^2 = \mathbb{E}[X^2] - \mu^2. \square

Definition 8.4 (Standard deviation). σX=Var(X)\sigma_X = \sqrt{\operatorname{Var}(X)}. This has the same units as XX and E[X]\mathbb{E}[X].

Examples:

  • Bernoulli(p)(p): Var(X)=pp2=p(1p)\operatorname{Var}(X) = p - p^2 = p(1-p). Maximum at p=1/2p=1/2: Var(X)=1/4\operatorname{Var}(X) = 1/4.
  • Uniform(a,b)(a,b): Var(X)=(ba)2/12\operatorname{Var}(X) = (b-a)^2/12
  • N(μ,σ2)\mathcal{N}(\mu,\sigma^2): Var(X)=σ2\operatorname{Var}(X) = \sigma^2 (by definition)
  • Constant cc: Var(c)=0\operatorname{Var}(c) = 0 (no randomness)

Properties of variance:

  1. Var(X)0\operatorname{Var}(X) \geq 0, with equality iff XX is a constant a.s.
  2. Var(aX+b)=a2Var(X)\operatorname{Var}(aX + b) = a^2 \operatorname{Var}(X) (shifting by bb doesn't change spread; scaling by aa scales variance by a2a^2)
  3. Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\operatorname{Var}(X + Y) = \operatorname{Var}(X) + \operatorname{Var}(Y) + 2\operatorname{Cov}(X,Y)
  4. If XYX \perp Y: Var(X+Y)=Var(X)+Var(Y)\operatorname{Var}(X+Y) = \operatorname{Var}(X) + \operatorname{Var}(Y)

For AI: Variance appears throughout ML:

  • Bias-variance tradeoff: test error = bias^2 + variance + irreducible noise. High-variance models overfit.
  • Gradient variance: the variance of the stochastic gradient estimator determines training stability. Variance reduction techniques (control variates, importance sampling) reduce this.
  • Batch normalisation: standardises activations to have mean 0 and variance 1 within each batch, stabilising training.
  • Uncertainty in predictions: the variance of a Gaussian output head models aleatoric uncertainty.

8.3 Key Properties and Common Pitfalls

Expectation is linear; variance is not. For independent X,YX, Y:

  • E[XY]=E[X]E[Y]\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y] (independence gives factorisation)
  • Var(X+Y)=Var(X)+Var(Y)\operatorname{Var}(X+Y) = \operatorname{Var}(X) + \operatorname{Var}(Y) (independence cancels covariance)
  • But in general: Var(X+Y)Var(X)+Var(Y)\operatorname{Var}(X+Y) \neq \operatorname{Var}(X) + \operatorname{Var}(Y)

Jensen's inequality: For convex gg:

g(E[X])E[g(X)]g(\mathbb{E}[X]) \leq \mathbb{E}[g(X)]

Examples: (E[X])2E[X2](\mathbb{E}[X])^2 \leq \mathbb{E}[X^2] (taking g(x)=x2g(x)=x^2); eE[X]E[eX]e^{\mathbb{E}[X]} \leq \mathbb{E}[e^X] (taking g(x)=exg(x)=e^x). Jensen's inequality is used to prove Markov's inequality, the EM lower bound (ELBO), and the KL divergence non-negativity.

Expectation and independence: E[XY]=E[X]E[Y]\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y] holds if XYX \perp Y, but the converse is false: E[XY]=E[X]E[Y]\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y] (uncorrelated) does not imply independence.

8.4 Preview: Covariance, LOTUS, MGF

Preview: Covariance and Correlation The covariance Cov(X,Y)=E[(XμX)(YμY)]\operatorname{Cov}(X,Y) = \mathbb{E}[(X-\mu_X)(Y-\mu_Y)] measures linear dependence between two random variables. The covariance matrix Σ\Sigma of a random vector extends this to multiple dimensions. These are the core tools for analysing multivariate distributions.

-> Full treatment: Expectation and Moments Section3

Preview: Law of the Unconscious Statistician (LOTUS) LOTUS allows computing E[g(X)]\mathbb{E}[g(X)] using the distribution of XX directly: no change-of-variables required. It is the foundational tool for deriving moments and the expected loss under a distribution.

-> Full treatment: Expectation and Moments Section2

Preview: Moment Generating Functions The MGF MX(t)=E[etX]M_X(t) = \mathbb{E}[e^{tX}] encodes all moments of XX: E[Xn]=MX(n)(0)\mathbb{E}[X^n] = M_X^{(n)}(0). MGFs are the key tool for proving the Central Limit Theorem and computing tail bounds.

-> Full treatment: Expectation and Moments Section5


9. Transformations of Random Variables

9.1 Functions of a Random Variable

Given a random variable XX and a function g:RRg : \mathbb{R} \to \mathbb{R}, the composition Y=g(X)Y = g(X) is a new random variable. Computing the distribution of YY from the distribution of XX is a central skill.

Discrete case. If XX is discrete with PMF pXp_X and gg is any function, then Y=g(X)Y = g(X) is discrete with:

pY(y)=P(Y=y)=P(g(X)=y)=x:g(x)=ypX(x)p_Y(y) = P(Y = y) = P(g(X) = y) = \sum_{x : g(x) = y} p_X(x)

Example. XUniform{1,2,3,4,5,6}X \sim \text{Uniform}\{1,2,3,4,5,6\} (fair die), Y=X2mod6Y = X^2 \bmod 6. Then pY(1)=P(X21mod6)=P(X{1,5})=1/3p_Y(1) = P(X^2 \equiv 1 \bmod 6) = P(X \in \{1,5\}) = 1/3, etc.

Continuous case - CDF method. For continuous XX and Y=g(X)Y = g(X), compute the CDF of YY:

FY(y)=P(Yy)=P(g(X)y)=P(X{x:g(x)y})F_Y(y) = P(Y \leq y) = P(g(X) \leq y) = P(X \in \{x : g(x) \leq y\})

then differentiate to get the PDF: fY(y)=FY(y)f_Y(y) = F_Y'(y).

9.2 Change-of-Variables Formula

For monotone transformations, the CDF method yields a clean formula.

Theorem 9.1 (Change of variables - monotone, 1-D). Let XX have PDF fXf_X and CDF FXF_X. Let gg be a differentiable, strictly monotone function on the support of XX, with inverse h=g1h = g^{-1}. Then Y=g(X)Y = g(X) has PDF:

fY(y)=fX(h(y))dhdy=fX(h(y))1g(h(y))f_Y(y) = f_X(h(y)) \cdot \left|\frac{dh}{dy}\right| = f_X(h(y)) \cdot \left|\frac{1}{g'(h(y))}\right|

Derivation (increasing gg):

FY(y)=P(Yy)=P(g(X)y)=P(Xh(y))=FX(h(y))F_Y(y) = P(Y \leq y) = P(g(X) \leq y) = P(X \leq h(y)) = F_X(h(y)) fY(y)=ddyFY(y)=fX(h(y))h(y)=fX(h(y))1g(h(y))f_Y(y) = \frac{d}{dy}F_Y(y) = f_X(h(y)) \cdot h'(y) = f_X(h(y)) \cdot \frac{1}{g'(h(y))}

For decreasing gg, FY(y)=1FX(h(y))F_Y(y) = 1 - F_X(h(y)), and differentiating gives a negative sign, cancelled by the absolute value in the formula.

Example. XN(0,1)X \sim \mathcal{N}(0,1), Y=eXY = e^X (log-normal distribution). Here g(x)=exg(x) = e^x, h(y)=lnyh(y) = \ln y, h(y)=1/yh'(y) = 1/y. So:

fY(y)=ϕ(lny)1y=12πe(lny)2/21y,y>0f_Y(y) = \phi(\ln y) \cdot \frac{1}{y} = \frac{1}{\sqrt{2\pi}}\, e^{-(\ln y)^2/2} \cdot \frac{1}{y}, \quad y > 0

Example. XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2), Y=(Xμ)/σY = (X - \mu)/\sigma (standardisation). Here g(x)=(xμ)/σg(x) = (x-\mu)/\sigma, h(y)=μ+σyh(y) = \mu + \sigma y, h(y)=σh'(y) = \sigma:

fY(y)=fX(μ+σy)σ=12πσ2ey2/2σ=12πey2/2=ϕ(y)f_Y(y) = f_X(\mu + \sigma y) \cdot \sigma = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-y^2/2} \cdot \sigma = \frac{1}{\sqrt{2\pi}} e^{-y^2/2} = \phi(y)

So YN(0,1)Y \sim \mathcal{N}(0,1). [ok]

For AI: The change-of-variables formula is the mathematical foundation of normalising flows - generative models that learn a bijection g:RdRdg : \mathbb{R}^d \to \mathbb{R}^d transforming a simple base distribution (e.g., N(0,I)\mathcal{N}(0,I)) into a complex target distribution. The log-density of the transformed variable is:

logpY(y)=logpX(g1(y))+logdetJg1(y)\log p_Y(y) = \log p_X(g^{-1}(y)) + \log |\det J_{g^{-1}}(y)|

where Jg1J_{g^{-1}} is the Jacobian of the inverse transformation.

9.3 Standardisation and the Z-Score

Definition 9.1 (Z-score / standardisation). For XX with mean μ\mu and standard deviation σ>0\sigma > 0:

Z=XμσZ = \frac{X - \mu}{\sigma}

Then E[Z]=0\mathbb{E}[Z] = 0 and Var(Z)=1\operatorname{Var}(Z) = 1 regardless of the distribution of XX.

Proof: E[Z]=E[(Xμ)/σ]=(E[X]μ)/σ=0\mathbb{E}[Z] = \mathbb{E}[(X-\mu)/\sigma] = (\mathbb{E}[X] - \mu)/\sigma = 0. Var(Z)=Var(X)/σ2=σ2/σ2=1\operatorname{Var}(Z) = \operatorname{Var}(X)/\sigma^2 = \sigma^2/\sigma^2 = 1. \square

Applications in ML:

  • Feature normalisation (preprocessing): standardise each feature so that the gradient has comparable scale across dimensions, improving optimiser convergence
  • Batch normalisation (BN): standardises activations within each mini-batch, then applies learned scale and shift parameters (γ,β)(\gamma, \beta): x^=(xμ^)/σ^\hat{x} = (x - \hat{\mu})/\hat{\sigma}, output =γx^+β= \gamma\hat{x} + \beta
  • Layer normalisation (LN): same idea but normalising across features rather than the batch dimension - standard in transformer models
  • RMS Normalisation (RMSNorm): a simplified variant using only the RMS (root mean square), dropping the mean subtraction - used in LLaMA, Gemma, and Mistral models: x^i=xi/RMS(x)\hat{x}_i = x_i / \text{RMS}(\mathbf{x})

10. Applications in Machine Learning

10.1 Cross-Entropy Loss as Negative Log-Likelihood

Setup. Suppose we observe data D={(x(1),y(1)),,(x(n),y(n))}\mathcal{D} = \{(x^{(1)}, y^{(1)}), \ldots, (x^{(n)}, y^{(n)})\} drawn i.i.d. from the true distribution pdata(yx)p_{\text{data}}(y \mid x). A model fθf_\theta with parameters θ\theta defines a predicted distribution pθ(yx)p_\theta(y \mid x).

Maximum likelihood estimation (MLE). We want to choose θ\theta to maximise the probability assigned to the observed data:

θ=argmaxθi=1npθ(y(i)x(i))[likelihood]\theta^* = \arg\max_\theta \prod_{i=1}^n p_\theta(y^{(i)} \mid x^{(i)}) \quad [\text{likelihood}]

Taking the logarithm (monotone transformation, same argmax):

θ=argmaxθi=1nlogpθ(y(i)x(i))=argminθ1ni=1nlogpθ(y(i)x(i))negative log-likelihood loss\theta^* = \arg\max_\theta \sum_{i=1}^n \log p_\theta(y^{(i)} \mid x^{(i)}) = \arg\min_\theta \underbrace{-\frac{1}{n}\sum_{i=1}^n \log p_\theta(y^{(i)} \mid x^{(i)})}_{\text{negative log-likelihood loss}}

For classification. With KK classes, pθ(yx)=Categorical(softmax(fθ(x)))yp_\theta(y \mid x) = \text{Categorical}(\text{softmax}(f_\theta(x)))_y:

logpθ(yx)=logsoftmax(fθ(x))y=logezyjezj=logjezjzy-\log p_\theta(y \mid x) = -\log \text{softmax}(f_\theta(x))_y = -\log \frac{e^{z_y}}{\sum_j e^{z_j}} = \log\sum_j e^{z_j} - z_y

where z=fθ(x)z = f_\theta(x) are the logits. For a one-hot label y\mathbf{y}, this is exactly the cross-entropy loss:

L(θ;x,y)=k=1Kyklogpθ(kx)=H(pdata,pθ)\mathcal{L}(\theta; x, y) = -\sum_{k=1}^K y_k \log p_\theta(k \mid x) = H(p_{\text{data}}, p_\theta)

For regression. If pθ(yx)=N(fθ(x),σ2)p_\theta(y \mid x) = \mathcal{N}(f_\theta(x), \sigma^2):

logpθ(yx)=12σ2(yfθ(x))2+const-\log p_\theta(y \mid x) = \frac{1}{2\sigma^2}(y - f_\theta(x))^2 + \text{const}

Minimising negative log-likelihood under a Gaussian noise model is equivalent to minimising mean squared error.

For language modelling. Each token has distribution pθ(xtx<t)=Categorical(softmax(zt))p_\theta(x_t \mid x_{<t}) = \text{Categorical}(\text{softmax}(z_t)). The NLL loss over a sequence of TT tokens is:

L=1Tt=1Tlogpθ(xtx<t)\mathcal{L} = -\frac{1}{T}\sum_{t=1}^T \log p_\theta(x_t \mid x_{<t})

This is the standard pretraining loss for GPT-family models. The perplexity is exp(L)\exp(\mathcal{L}) - a lower perplexity means the model assigns higher probability to the actual next token.

10.2 Bayesian Inference: Prior, Likelihood, Posterior

Bayesian inference applies Bayes' theorem to update beliefs about parameters θ\theta after observing data D\mathcal{D}:

p(θD)posteriorp(Dθ)likelihoodp(θ)prior\underbrace{p(\theta \mid \mathcal{D})}_{\text{posterior}} \propto \underbrace{p(\mathcal{D} \mid \theta)}_{\text{likelihood}} \cdot \underbrace{p(\theta)}_{\text{prior}}

The normalising constant p(D)=p(Dθ)p(θ)dθp(\mathcal{D}) = \int p(\mathcal{D} \mid \theta) p(\theta) d\theta is the evidence or marginal likelihood.

MAP estimation. The Maximum A Posteriori (MAP) estimate is the mode of the posterior:

θMAP=argmaxθlogp(θD)=argmaxθ[logp(Dθ)+logp(θ)]\theta^*_{\text{MAP}} = \arg\max_\theta \log p(\theta \mid \mathcal{D}) = \arg\max_\theta \left[\log p(\mathcal{D} \mid \theta) + \log p(\theta)\right]

With a Gaussian prior p(θ)=N(0,λ1I)p(\theta) = \mathcal{N}(0, \lambda^{-1}I), the logp(θ)\log p(\theta) term becomes λθ2/2-\lambda\|\theta\|^2/2 - i.e., L2 regularisation (weight decay). MAP estimation with a Gaussian prior is exactly regularised MLE.

For AI: Bayesian thinking is increasingly central to modern AI:

  • Model uncertainty (epistemic): the posterior over models captures uncertainty from limited data - Bayesian deep learning approximates this posterior
  • RLHF reward model: human preferences give a likelihood P(human prefers y1 over y2r1,r2)P(\text{human prefers } y_1 \text{ over } y_2 \mid r_1, r_2); updating the reward model is a posterior update
  • In-context learning: some interpretations frame few-shot ICL as implicit Bayesian updating over hypotheses about the task

10.3 Probabilistic Language Models

A language model defines a probability distribution over sequences of tokens. Given vocabulary V\mathcal{V}, a sequence x=(x1,,xT)\mathbf{x} = (x_1, \ldots, x_T) is assigned probability:

P(x)=P(x1)P(x2x1)P(x3x1,x2)P(xTx1,,xT1)=t=1TP(xtx<t)P(\mathbf{x}) = P(x_1) \cdot P(x_2 \mid x_1) \cdot P(x_3 \mid x_1, x_2) \cdots P(x_T \mid x_1, \ldots, x_{T-1}) = \prod_{t=1}^T P(x_t \mid x_{<t})

by the chain rule of probability (Section3.4). Each conditional P(xtx<t)P(x_t \mid x_{<t}) is a Categorical distribution over V|\mathcal{V}| tokens, produced by the transformer's softmax output.

Sampling strategies as distribution operations:

  • Greedy decoding: always take argmaxP(xtx<t)\arg\max P(x_t \mid x_{<t}) - equivalent to a degenerate point mass
  • Temperature scaling: replace logits zz with z/τz/\tau; τ0\tau \to 0 approaches greedy, τ\tau \to \infty approaches uniform
  • Top-kk sampling: restrict to the kk most probable tokens, renormalise - truncates the distribution
  • Top-pp (nucleus) sampling: restrict to the smallest set of tokens whose cumulative probability p\geq p, renormalise - adapts the cutoff to the entropy of the distribution

Bits-per-character (BPC) and perplexity: The quality of a language model is measured by how much probability mass it assigns to held-out text. Perplexity = exp(1TtlogP(xtx<t))\exp(-\frac{1}{T}\sum_t \log P(x_t \mid x_{<t})). A perplexity of kk means the model is "as uncertain as a uniform distribution over kk choices" at each step.

10.4 Sources of Randomness in Neural Network Training

Training a neural network involves multiple distinct sources of randomness, each with a different probability distribution:

SourceDistributionWhen it occurs
Weight initialisationN(0,σ2)\mathcal{N}(0, \sigma^2) or UniformOnce, at start
Mini-batch selectionUniform (without replacement)Every step
Dropout masksBernoulli(1pdrop)d\text{Bernoulli}(1-p_{\text{drop}})^dEvery forward pass
Data augmentationUniform (flips), Uniform (crop positions), Gaussian (colour jitter)Every sample
Label smoothingConvex combination of Categorical and UniformDuring loss computation
Noise injection (DP-SGD)N(0,σ2C2)\mathcal{N}(0, \sigma^2 C^2) (clipped Gaussian)Every gradient update
Sampling from generative modelsModel-defined (Categorical, Gaussian, etc.)Inference

Each source of randomness introduces variance into the training process. Understanding their distributions is essential for:

  • Reproducibility (fixing random seeds)
  • Differential privacy analysis (noise calibration)
  • Variance reduction in gradient estimation
  • Debugging training instabilities (distinguishing random vs systematic failures)

11. Common Mistakes

#MistakeWhy It's WrongFix
1Confusing P(AB)P(A \mid B) with P(BA)P(B \mid A)These are completely different quantities (prosecutor's fallacy, base rate neglect)Always identify which is the "given" and which is the "unknown" before applying Bayes' theorem
2Treating fX(x)f_X(x) as a probabilityThe PDF is a density - it can exceed 1, and individual point probabilities are always 0 for continuous XXProbabilities come from integrating the PDF: P(aXb)=abf(x)dxP(a \leq X \leq b) = \int_a^b f(x)dx
3Assuming independence from E[XY]=E[X]E[Y]\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]Uncorrelated does not imply independent; independence is strictly strongerUse the definition: XY    P(Xx,Yy)=P(Xx)P(Yy)X \perp Y \iff P(X \leq x, Y \leq y) = P(X \leq x)P(Y \leq y) for all x,yx,y
4Applying Axiom 3 to non-disjoint eventsP(AB)=P(A)+P(B)P(A \cup B) = P(A) + P(B) holds only when AB=A \cap B = \emptysetUse inclusion-exclusion: P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)
5Forgetting to check normalisation when constructing a PMF/PDFAn un-normalised function cannot be a PMF or PDFAlways verify xp(x)=1\sum_x p(x) = 1 (discrete) or f(x)dx=1\int f(x)dx = 1 (continuous)
6Using Var(X+Y)=Var(X)+Var(Y)\operatorname{Var}(X+Y) = \operatorname{Var}(X) + \operatorname{Var}(Y) without checking independenceVariance is additive only for uncorrelated (not just any) random variablesAdd the covariance term: Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\operatorname{Var}(X+Y) = \operatorname{Var}(X) + \operatorname{Var}(Y) + 2\operatorname{Cov}(X,Y)
7Confusing E[g(X)]\mathbb{E}[g(X)] with g(E[X])g(\mathbb{E}[X])Jensen's inequality shows these differ whenever gg is nonlinearCompute E[g(X)]\mathbb{E}[g(X)] using LOTUS: E[g(X)]=xg(x)p(x)\mathbb{E}[g(X)] = \sum_x g(x)p(x) or g(x)f(x)dx\int g(x)f(x)dx
8Confusing pairwise independence with mutual independenceEvents can be pairwise independent but not mutually independent (Bernstein example)For n>2n > 2 events, verify independence for all 2nn12^n - n - 1 subsets, or use a structural argument
9Assuming P(AB)=0P(A \cap B) = 0 when AA and BB are independentIndependence means P(AB)=P(A)P(B)P(A \cap B) = P(A)P(B); this is zero only when P(A)=0P(A) = 0 or P(B)=0P(B) = 0Independence \neq mutual exclusivity. Disjoint events (AB=A \cap B = \emptyset) with positive probability are necessarily dependent
10Applying the change-of-variables formula without the absolute value of the JacobianFor decreasing transformations, the Jacobian dh/dydh/dy is negative; omitting the absolute value gives a negative PDFAlways write $f_Y(y) = f_X(h(y)) \cdot
11Concluding XX and YY are independent from a correlation of zero in non-Gaussian dataZero correlation \Leftrightarrow uncorrelated, which does NOT imply independence for non-Gaussian distributionsIndependence implies zero correlation, but not vice versa. Check independence directly or use copulas
12Treating conditional probability as symmetric: P(AB)=P(BA)P(A \mid B) = P(B \mid A)P(AB)=P(BA)P(A \mid B) = P(B \mid A) only if P(A)=P(B)P(A) = P(B) (from Bayes' theorem with equal priors)Apply Bayes' theorem to convert between the two: P(BA)=P(AB)P(B)/P(A)P(B \mid A) = P(A \mid B) P(B) / P(A)

12. Exercises

Exercise 1 * - Axiom Derivations Using only the three Kolmogorov axioms, prove each of the following: (a) P()=0P(\emptyset) = 0 (b) P(Ac)=1P(A)P(A^c) = 1 - P(A) (c) P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B) (inclusion-exclusion for two events) (d) If ABA \subseteq B then P(A)P(B)P(A) \leq P(B) (e) 0P(A)10 \leq P(A) \leq 1 for all events AA

Each proof should cite exactly which axiom(s) and which previously proved results it uses.

Exercise 2 * - Bayes' Theorem Application A medical test for a disease has sensitivity 95% (true positive rate) and specificity 98% (true negative rate). The disease affects 0.5% of the population. (a) Define events DD (has disease) and TT (tests positive). State all given probabilities. (b) Compute P(T)P(T) using the law of total probability. (c) Compute P(DT)P(D \mid T) using Bayes' theorem. (d) Explain why the result might surprise a clinician unfamiliar with Bayes' theorem. (e) How would the answer change if the disease prevalence were 10% instead of 0.5%? Compute and interpret.

Exercise 3 * - PMF Construction and Verification A biased die has P(X=k)=ckP(X = k) = c \cdot k for k{1,2,3,4,5,6}k \in \{1, 2, 3, 4, 5, 6\}. (a) Find the normalisation constant cc. (b) Verify this is a valid PMF. (c) Compute E[X]\mathbb{E}[X]. (d) Compute Var(X)\operatorname{Var}(X). (e) What is P(X4)P(X \geq 4)? Compare to the fair die.

Exercise 4 ** - CDF Analysis Let XX have CDF F(x)=1eλxF(x) = 1 - e^{-\lambda x} for x0x \geq 0 and F(x)=0F(x) = 0 for x<0x < 0 (Exponential distribution). (a) Show this is a valid CDF (check all four properties from Theorem 5.1). (b) Find the PDF f(x)f(x) by differentiating the CDF. (c) Compute P(X>t)P(X > t) for any t>0t > 0. (d) Show the memoryless property: P(X>s+tX>s)=P(X>t)P(X > s+t \mid X > s) = P(X > t) for all s,t>0s, t > 0. (e) For λ=1\lambda = 1: compute P(1X2)P(1 \leq X \leq 2) exactly and numerically.

Exercise 5 ** - Independence and Conditional Independence Roll two fair dice, letting XX be the result of die 1 and YY the result of die 2. Define Z=X+YZ = X + Y (the sum). (a) Verify that XX and YY are independent. (b) Show that XX and ZZ are not independent (hint: compute P(X=1,Z=2)P(X=1, Z=2) and compare to P(X=1)P(Z=2)P(X=1)P(Z=2)). (c) Now condition on the event {Z=7}\{Z = 7\}. Show that XX and YY are conditionally independent given Z=7Z = 7 fails - or more precisely, show P(X=1,Y=6Z=7)=P(X=1Z=7)P(Y=6Z=7)P(X=1, Y=6 \mid Z=7) = P(X=1 \mid Z=7) P(Y=6 \mid Z=7) and check whether this holds. (d) Explain in words why knowing ZZ induces dependence between XX and YY.

Exercise 6 ** - Change of Variables Let XUniform(0,1)X \sim \text{Uniform}(0, 1). (a) Find the PDF of Y=lnXY = -\ln X. What distribution is this? (b) Find the PDF of W=X2W = X^2. (c) Find the CDF and PDF of V=XV = \sqrt{X}. (d) If UUniform(0,1)U \sim \text{Uniform}(0,1), show that FX1(U)F_X^{-1}(U) has the same distribution as XX for any CDF FXF_X (the inverse CDF / quantile transform method). (e) How is part (d) used in sampling from arbitrary distributions in ML?

Exercise 7 *** - Cross-Entropy Loss Derivation Consider a KK-class classification problem with model pθ(yx)=softmax(fθ(x))yp_\theta(y \mid \mathbf{x}) = \text{softmax}(f_\theta(\mathbf{x}))_y. (a) Write the negative log-likelihood for a single example (x,y)(\mathbf{x}, y^*). (b) Show this equals the cross-entropy H(ptrue,pθ)H(p_{\text{true}}, p_\theta) where ptruep_{\text{true}} is the one-hot distribution. (c) Show the gradient of the NLL w.r.t. logits z=fθ(x)\mathbf{z} = f_\theta(\mathbf{x}) is softmax(z)ey\text{softmax}(\mathbf{z}) - \mathbf{e}_{y^*} (softmax output minus one-hot target). (d) Explain why label smoothing replaces the one-hot ptruep_{\text{true}} with (1ε)ey+ε/K1(1-\varepsilon)\mathbf{e}_{y^*} + \varepsilon/K \cdot \mathbf{1}, and how this changes the loss. (e) For language modelling with vocabulary size V=32000|\mathcal{V}| = 32000: if the model assigns probability 0.8 to the correct next token, what is the NLL? The perplexity?

Exercise 8 *** - Probabilistic Analysis of Dropout During training, dropout independently sets each of dd activations to 0 with probability pdropp_{\text{drop}}, and scales the remaining by 1/(1pdrop)1/(1-p_{\text{drop}}). (a) Let MiBernoulli(1pdrop)M_i \sim \text{Bernoulli}(1-p_{\text{drop}}) be the mask for activation ii. What is E[Mi]\mathbb{E}[M_i]? (b) Let hih_i be the pre-dropout activation. Define h~i=Mihi/(1pdrop)\tilde{h}_i = M_i \cdot h_i / (1-p_{\text{drop}}). Show E[h~i]=hi\mathbb{E}[\tilde{h}_i] = h_i (inverted dropout preserves the expected activation). (c) Compute Var(h~i)\operatorname{Var}(\tilde{h}_i) as a function of hih_i, pdropp_{\text{drop}}. (d) The output of a layer is s=i=1dh~is = \sum_{i=1}^d \tilde{h}_i. Assuming hih_i and MiM_i are independent across neurons, what is Var(s)\operatorname{Var}(s)? (e) Explain why high pdropp_{\text{drop}} increases gradient variance and can destabilise training. At what pdropp_{\text{drop}} is variance maximised?


Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue