Lesson overview | Lesson overview | Next part
Introduction to Probability and Random Variables: Part 1: Intuition to 12. Exercises
1. Intuition
1.1 What Is Probability? Three Interpretations
Probability is a number in assigned to an event, measuring how likely that event is to occur. But what does "likely" mean? There are three major interpretations, and each underpins a different school of thought in statistics and machine learning.
1. Classical (equally likely outcomes). If a sample space has equally likely outcomes and event contains of them, then . This is the interpretation of textbook dice and card problems. It breaks down when outcomes are not equally likely - flipping a biased coin, for instance.
2. Frequentist (long-run relative frequency). is the limiting proportion of times occurs in an infinite sequence of identical, independent experiments: . This is the foundation of classical statistics (hypothesis testing, confidence intervals). Its weakness: it cannot assign probabilities to one-off events ("the probability that GPT-5 passes the bar exam").
3. Bayesian (degree of belief). represents a rational agent's degree of belief that is true, updated as new evidence arrives. This interpretation allows probabilities for singular events and underpins Bayesian inference, Bayesian neural networks, and probabilistic generative models. Crucially, different agents can assign different prior probabilities to the same event - and Bayes' theorem tells them how to update rationally.
THREE INTERPRETATIONS OF PROBABILITY
========================================================================
Classical Frequentist Bayesian
----------------- -------------------- ----------------------
P(A) = k/N P(A) = lim freq(A) P(A) = degree of belief
Equally likely Long-run proportion Updated by evidence
outcomes Objective Subjective / rational
Dice, cards Hypothesis tests Generative models
Confidence intervals Bayesian NNs, VAEs
All three satisfy the same axioms (Section2.2).
They differ only in WHAT those axioms are applied to.
========================================================================
For AI: Modern ML systems implicitly blend all three. A softmax output layer uses the classical interpretation (outputs sum to 1). Training loop analysis uses frequentist reasoning (expected loss over the data distribution). Bayesian neural networks, RLHF reward models, and variational autoencoders use Bayesian reasoning explicitly.
1.2 Why Probability Is the Language of AI
Every component of a modern AI system is probabilistic:
- Data: Training sets are finite samples from an unknown data-generating distribution . The goal of supervised learning is to find that estimates .
- Model outputs: A language model defines a probability distribution over the next token. Generation is sampling from this distribution. Temperature scaling, top- and top- sampling are all operations on probability distributions.
- Loss functions: Cross-entropy is the negative log-likelihood of the data under the model's distribution. Minimising cross-entropy is equivalent to maximising the probability the model assigns to the training data.
- Regularisation: Dropout applies Bernoulli random variables to activations. Weight decay corresponds to a Gaussian prior in a Bayesian interpretation (MAP estimation).
- Uncertainty quantification: Calibration of a classifier means its predicted probabilities match empirical frequencies. A well-calibrated model saying "70% confident" should be right 70% of the time.
- Generative models: Diffusion models define a forward Markov chain (adding Gaussian noise) and learn to reverse it. VAEs learn and . Normalising flows learn invertible transformations of simple distributions. All are explicitly probabilistic.
1.3 Historical Timeline
| Year | Person | Contribution |
|---|---|---|
| 1654 | Pascal & Fermat | Correspondence on gambling problems - foundations of combinatorial probability |
| 1713 | Jacob Bernoulli | Ars Conjectandi - law of large numbers, Bernoulli distribution |
| 1763 | Thomas Bayes (posthumous) | Essay on inverse probability - what we now call Bayes' theorem |
| 1812 | Pierre-Simon Laplace | Theorie analytique des probabilites - systematic probability theory, Laplace approximation |
| 1837 | Simeon Poisson | Poisson distribution - rare events in large samples |
| 1900 | Karl Pearson | Chi-squared distribution, correlation coefficient |
| 1933 | Andrei Kolmogorov | Rigorous axiomatic foundation - the three axioms that unify all interpretations |
| 1950s | Shannon | Information theory - entropy connects probability to communication |
| 1980s | Judea Pearl | Bayesian networks - graphical models for probabilistic reasoning |
| 1990s | Gelman, Rubin et al. | MCMC methods - practical Bayesian inference for complex models |
| 2013 | Kingma & Welling | Variational Autoencoders - learned latent distributions via reparameterisation |
| 2020 | Ho et al. | Denoising Diffusion Probabilistic Models - probabilistic generative modelling at scale |
2. Formal Definitions - Probability Spaces
2.1 Sample Spaces and Events
Definition 2.1 (Sample space). The sample space is the set of all possible outcomes of a random experiment. Each element is called an elementary outcome or sample point.
Examples:
- Tossing a fair coin:
- Rolling a six-sided die:
- Measuring a person's height:
- Choosing a word from a vocabulary: (finite vocabulary)
- One training step of a neural network: all possible mini-batch draws
Definition 2.2 (Event). An event is a subset of , i.e., . We say event occurs if the observed outcome .
Event algebra. Events combine using set operations:
- Complement: - " does not occur"
- Union: - " or (or both) occur"
- Intersection: - "both and occur"
- Difference: - " occurs but does not"
The -algebra (brief note). For continuous sample spaces (e.g., ), we cannot assign probabilities to every subset - some are too pathological (Vitali sets). The solution is to restrict to a -algebra : a collection of subsets of that is closed under complement and countable unions. The standard choice for is the Borel -algebra , generated by all open intervals. For this course, you can think of "events" as "any reasonable subset of " - the measure-theoretic subtlety is noted but not required.
2.2 The Three Kolmogorov Axioms
In 1933, Andrei Kolmogorov placed probability theory on a rigorous axiomatic foundation, resolving a century of informal debate about what probability "really is."
Definition 2.3 (Probability measure). A probability measure is a function satisfying:
Axiom 1 (Non-negativity):
Axiom 2 (Normalisation):
Axiom 3 (Countable additivity / -additivity): If are pairwise disjoint events (i.e., for ), then:
The triple is called a probability space.
Why these three axioms? They are minimal: they are logically independent (none follows from the other two), they rule out degenerate assignments (e.g., or ), and they imply everything else in probability theory via logical deduction.
2.3 Immediate Consequences of the Axioms
From just three axioms, we can derive all the basic rules of probability:
Theorem 2.1 (Probability of the empty set): .
Proof. and are disjoint with . By Axiom 3: . So .
Theorem 2.2 (Finite additivity): For disjoint :
Proof. Set , apply Axiom 3, then Theorem 2.1.
Theorem 2.3 (Complement rule): .
Proof. and are disjoint with . By Axiom 3: .
Theorem 2.4 (Monotonicity): If , then .
Proof. Write where and are disjoint. By Axiom 3: (since by Axiom 1).
Theorem 2.5 (Bounds): for all .
Proof. , so by monotonicity: .
2.4 Probability Measures - Examples and Non-Examples
Examples of valid probability measures:
| Sample space | Probability measure |
|---|---|
| for each (fair die) | |
| for each (loaded die, ) | |
| for (uniform) | |
| (standard Gaussian) | |
| , for (Bernoulli) |
Non-examples (violate at least one axiom):
| Assignment | Axiom violated |
|---|---|
| , | Axiom 2: total = 1.2 \neq 1 |
| , | Axiom 1: negative probability |
| , , | Axiom 3: |
3. Computing Probabilities - The Core Rules
3.1 Complement and Monotonicity
The complement rule is one of the most practically useful consequences of the axioms.
Complement rule:
This is particularly useful when is easier to compute than directly. This technique - "compute the complement" - is ubiquitous in probability:
Example. What is the probability that at least one of 10 coin flips is heads (for a fair coin)?
Direct approach: sum for - tedious.
Complement approach: .
For AI: The complement rule underlies the "at least one" union bound used in statistical learning theory. If we want , we equivalently want .
3.2 Inclusion-Exclusion Principle
Theorem 3.1 (Inclusion-exclusion, two events):
Proof. Decompose into disjoint parts. Then into disjoint parts. By Axiom 3: and . Subtracting: . Substituting: .
The subtraction corrects for double-counting .
Generalisation to events:
Union bound (Boole's inequality): A simpler but looser bound:
For AI: The union bound is the workhorse of statistical learning theory. To prove that a learned classifier generalises to unseen data with high probability, one typically applies the union bound over all possible hypotheses, then uses concentration inequalities (Section05) to bound each term.
3.3 Conditional Probability
When we observe that event has occurred, this changes our state of knowledge. The probability of given that occurred is:
Definition 3.1 (Conditional probability):
Intuition: Conditioning on restricts the sample space from to . Within this restricted space, we renormalise by dividing by to ensure the conditional probabilities sum to 1.
CONDITIONAL PROBABILITY - GEOMETRIC INTUITION
========================================================================
Before conditioning: After conditioning on B:
+---------------------+ +---------------------+
| \\Omega | | B |
| +------------+ | | +--------------+ |
| | A |A\\capB | B | | | A\\capB | |
| | | | | | | (renormalised)| |
| +------------+ | | +--------------+ |
+---------------------+ +---------------------+
P(A) = area(A) / area(\\Omega) P(A|B) = area(A\\capB) / area(B)
========================================================================
Key properties of conditional probability:
- is itself a valid probability measure on : it satisfies all three Kolmogorov axioms
- - given occurred, certainly occurred
- - the complement rule holds conditionally
Non-example: in general. This asymmetry is the source of the classic prosecutor's fallacy: is very small, but could be substantial in a large population.
3.4 Chain Rule of Probability
Rearranging the definition of conditional probability gives:
This extends to any number of events:
Theorem 3.2 (Chain rule / product rule):
For AI: The chain rule of probability is the mathematical foundation of autoregressive language models. A language model with vocabulary and context window defines:
Every GPT-family model is directly computing these conditional probabilities. The model is trained to minimise the average negative log of each factor - i.e., the cross-entropy loss.
3.5 Law of Total Probability
Definition 3.2 (Partition). Events form a partition of if they are pairwise disjoint ( for ) and exhaustive ().
Theorem 3.3 (Law of total probability). If partition and for all :
Proof. Since partition : , a disjoint union. By Axiom 3: .
Example. A language model generates a sentence that is either grammatical (, probability 0.7) or ungrammatical (, probability 0.3). Given grammatical, the probability a human accepts it is 0.9; given ungrammatical, 0.2. What is the overall acceptance probability?
3.6 Bayes' Theorem
Bayes' theorem inverts conditional probability: from we can compute .
Theorem 3.4 (Bayes' theorem):
Applying the law of total probability to the denominator (with a partition ):
Bayesian inference terminology:
| Term | Symbol | Meaning |
|---|---|---|
| Prior | Belief about before seeing | |
| Likelihood | How probable is if is true? | |
| Evidence | Total probability of observation | |
| Posterior | Updated belief about after seeing |
Classic example - Spam filter. Email is spam with prior . The word "lottery" appears in 80% of spam emails but only 5% of legitimate ones. Given an email contains "lottery", what is ?
For AI: Bayes' theorem is the core of:
- Bayesian neural networks: parameters have a prior ; training updates to the posterior
- RLHF reward modelling: given human preference data, update beliefs about which completions are better
- Naive Bayes classifiers: classify documents by computing
4. Independence
4.1 Unconditional Independence
Intuitively, events and are independent if knowing one occurred gives no information about whether the other occurred.
Definition 4.1 (Independence). Events and are independent, written , if:
Equivalently (when ): - conditioning on does not change the probability of .
Why this definition? The conditional probability equals if and only if . The multiplicative form is preferred as it is symmetric and well-defined even when or .
Examples of independent events:
- Two fair coin flips: [ok]
- Drawing with replacement: second draw is independent of first
- Two separate neural network weight initialisations drawn i.i.d.
Non-examples (dependent events):
- Drawing without replacement: if first card is an ace, second ace is less likely
- Correlated features in a dataset
- Sequential tokens in a language model:
For events (mutual independence): Events are mutually independent if for every subset :
This is strictly stronger than pairwise independence (see Section4.3).
4.2 Conditional Independence
Conditional independence is one of the most important concepts in probabilistic modelling and graphical models.
Definition 4.2 (Conditional independence). Events and are conditionally independent given , written , if:
Equivalently (when ): - once we know , learning gives no additional information about .
The classic example - explaining away. Let = "the grass is wet", = "the sprinkler is on", = "it rained". Rain and sprinkler are independent causes of wet grass. But:
- Without knowing whether it rained: and are marginally dependent (if the grass is wet, the sprinkler being on is more likely)
- Given that it rained: - knowing the sprinkler status adds nothing once we know it rained
Caution: Independence and conditional independence are logically independent:
- does NOT imply (conditioning can introduce dependence)
- does NOT imply (marginalising out can introduce dependence)
For AI: Conditional independence is the foundation of:
- Naive Bayes: assumes features are conditionally independent given the class label
- Hidden Markov Models: - observations are independent given the hidden state
- Bayesian networks: the factorisation encodes a set of conditional independence assumptions
- Attention masks: attention weights in transformers implement conditional independence (causal masking = future tokens are conditionally independent of past given present)
4.3 Pairwise vs Mutual Independence
Pairwise independence means for all pairs .
Mutual independence means the joint probability factorises for every subset (Definition 4.1 for events).
Mutual independence implies pairwise independence but not vice versa. The following counterexample (Bernstein, 1928) shows that pairwise independence can hold without mutual independence:
Roll two fair dice. Define:
- = "first die shows even" -
- = "second die shows even" -
- = "sum of dice is even" -
Check pairwise: [ok], [ok], [ok].
But .
So are pairwise independent but not mutually independent.
For AI: The i.i.d. (independent and identically distributed) assumption in ML training is a mutual independence assumption: . This assumption underlies the derivation of cross-entropy loss as a maximum likelihood estimator.
4.4 Why Independence Matters for AI
Independence assumptions make otherwise intractable problems tractable.
Without independence: The joint distribution of binary random variables requires parameters. For features, this is - utterly infeasible.
With conditional independence (Naive Bayes): Given class , features are independent: . Now only parameters per class - linear in .
Independence in optimisation: Mini-batch gradient descent assumes that samples in a batch are independent draws from the training distribution. This independence makes the batch gradient an unbiased estimator of the full gradient.
Independence and parallelism: Independently distributed data can be processed in parallel without communication - the foundation of distributed ML training (data parallelism).
5. Random Variables - Formal Foundation
5.1 Definition: Random Variable as a Measurable Function
The outcomes in a sample space can be anything - text strings, images, dice faces. To do mathematics with them, we need to convert them to numbers. A random variable does exactly this.
Definition 5.1 (Random variable). A random variable is a (measurable) function from the sample space to the real line :
The measurability requirement ensures that is a valid event (belongs to ) for every , so that we can assign it a probability. For all practical purposes, every function you would naturally write down is measurable.
Examples:
- Die roll: , (the outcome itself)
- Coin flip: , , (Bernoulli representation)
- Sentence length: , (number of words)
- Model loss: ,
Non-examples:
- The sample space itself is not a random variable (it is the domain, not a function)
- A deterministic constant can be viewed as a degenerate random variable for all , but calling it "random" is misleading
Notation: Random variables are typically denoted by uppercase letters (, , ); their realised values (specific numbers) by lowercase (, , ). So "" means "the random variable takes the value ."
The event is well-defined, and we write as shorthand for .
5.2 Discrete vs Continuous - Taxonomy
Random variables split into two main types based on the range of .
TAXONOMY OF RANDOM VARIABLES
========================================================================
Random Variable X: \\Omega -> \\mathbb{R}
|
+-- Discrete: X takes countably many values {x_1, x_2, x_3, ...}
| |
| +-- Characterised by PMF: p(x) = P(X = x)
| +-- CDF: F(x) = \\Sigma p(x_i) for all x_i \\leq x (staircase)
| +-- Examples: Bernoulli, Geometric, Binomial, Poisson, Categorical
|
+-- Continuous: X takes values in an interval (uncountably many)
|
+-- Characterised by PDF: f(x) such that P(a\\leqX\\leqb) = \\int_a^b f(x)dx
+-- CDF: F(x) = \\int_-\\infty^x f(t)dt (smooth, differentiable)
+-- Examples: Uniform, Gaussian, Exponential, Beta, Gamma
Note: "Mixed" random variables exist (CDF is a mix of jumps and
smooth parts) but are rare in practice.
========================================================================
Key distinction - probability at a point:
- Discrete: can be positive (it is the PMF value)
- Continuous: for every specific (the probability of hitting any exact value is zero). Probabilities come from integrating over intervals.
For AI: The distinction is crucial for loss functions. Cross-entropy loss for classification uses a discrete distribution (Categorical) over a finite vocabulary or class set. For regression, the loss is often the negative log-likelihood of a Gaussian (continuous distribution). Diffusion models use a mixture: discrete time steps with continuous noise.
5.3 The Cumulative Distribution Function
The CDF provides a unified description of both discrete and continuous random variables.
Definition 5.2 (Cumulative distribution function). The CDF of a random variable is:
Examples:
- Fair die: for ; for , ; for . A staircase with jumps of at each integer.
- Standard normal: . A smooth S-shaped curve.
- Bernoulli(): for ; for ; for .
Computing interval probabilities from the CDF:
5.4 CDF Properties and the Fundamental Theorem
Theorem 5.1 (CDF properties). Any CDF satisfies:
- Monotone non-decreasing:
- Right-continuous: for all
- Limits: and
- Jump characterisation: where . For continuous , this is 0 everywhere; for discrete , it equals the PMF at jump points.
These four properties completely characterise CDFs - any function satisfying them is the CDF of some random variable. This is both the universality and the tractability of the CDF.
Fundamental theorem (continuous case): For continuous with CDF and PDF :
The PDF is the derivative of the CDF; the CDF is the integral of the PDF.
6. Discrete Random Variables
6.1 Probability Mass Function
Definition 6.1 (PMF). For a discrete random variable with countable range , the probability mass function is:
Properties:
- for all (Axiom 1)
- (Axiom 2)
- for
CDF from PMF: - sum all PMF values at or below .
Non-examples of valid PMFs:
- : sum = 1.2 > 1
- : negative value
- for : (harmonic series diverges)
For AI: Every classifier's output layer implicitly defines a PMF over classes. The softmax function produces a vector of non-negative values summing to 1 - a valid PMF over the classes. Training minimises the cross-entropy between this PMF and the one-hot PMF of the true label.
6.2 Bernoulli Distribution - The Canonical Example
The Bernoulli distribution is the simplest non-trivial probability distribution: a single binary trial.
Definition 6.2 (Bernoulli distribution). for if:
or equivalently: for .
Properties:
- Mean:
- Variance:
- Maximum variance at :
- CDF: for ; for ; for
ML applications of the Bernoulli distribution:
- Binary classification: where is sigmoid
- Dropout: each neuron is active with probability ; the mask is
- Coin-flip initialisation tests: checking whether random initialisation is breaking symmetry
- Binary attention masks: a position is masked with in masked language modelling (BERT, RoBERTa)
- Bernoulli reward signals: in RL, some environments give binary rewards ( or )
Bernoulli and entropy. The binary entropy of a Bernoulli() variable is:
This is maximised at (maximum uncertainty) and equals 0 at (certainty). Binary cross-entropy loss is exactly this entropy when the true label is and the model predicts : .
6.3 Geometric Distribution
The Geometric distribution models waiting times: how many Bernoulli trials until the first success?
Definition 6.3 (Geometric distribution). for if:
(Here is the trial number of the first success; an alternative convention has = number of failures before first success, giving PMF for )
Properties:
- Mean:
- Variance:
- Memoryless property: - the past gives no information about the remaining wait time. The Geometric is the only discrete memoryless distribution.
Geometric series check (normalisation):
ML applications:
- Sentence length models: the length of a generated sentence can be modelled geometrically (each word has probability of being the last)
- Number of gradient steps to convergence: in simplified convergence analyses, the number of steps until can have an approximately geometric distribution under certain conditions
- Retry logic: in RL exploration, the number of random actions before discovering a reward follows a geometric-like distribution in uniform random environments
6.4 Preview: Binomial, Categorical, Poisson
The following distributions have their full treatment in Section02-Common-Distributions. They are introduced here briefly for orientation:
Preview: Binomial Distribution counts the number of successes in independent Bernoulli() trials: for . Mean , variance . As with fixed, the Binomial converges to Poisson().
-> Full treatment: Common Distributions Section2
Preview: Categorical Distribution with (the probability simplex) models selection of one of categories: . This is the distribution underlying softmax classifiers and language model next-token prediction.
-> Full treatment: Common Distributions Section4
Preview: Poisson Distribution models the number of events in a fixed interval when events occur independently at constant rate : . Mean = Variance = . Used in modelling word frequencies (Zipf/Poisson approximation), network packet arrivals, and mutation rates.
-> Full treatment: Common Distributions Section3
7. Continuous Random Variables
7.1 Probability Density Function
For continuous random variables, individual points have probability zero. Probability is distributed over intervals via the PDF.
Definition 7.1 (Probability density function). A non-negative function is a PDF for if:
- for all
- For any :
Critical distinction: The PDF is NOT a probability. It is a probability density - probability per unit length. In particular, can exceed 1 (though the integral must equal 1).
Example: for and elsewhere. This is a valid PDF: non-negative, integrates to . Note , but this is fine - it is a density.
Probability at a point: . This is why conditioning on a continuous event requires care and is handled via conditional densities (Section03).
Non-examples of valid PDFs:
- on : (not normalised)
- for : negative values
- for :
7.2 From CDF to PDF: the Derivative Relationship
For a continuous random variable with PDF and CDF :
This is the Fundamental Theorem of Calculus applied to probability. The CDF accumulates density; the PDF is the rate of accumulation.
Using the CDF to compute probabilities:
Since continuous has : the endpoints don't matter, .
7.3 Uniform Distribution
The Uniform distribution is the maximum-entropy distribution over a bounded interval: all values are equally likely.
Definition 7.2 (Uniform distribution). for if:
CDF: for ; for ; for .
Properties:
- Mean: (the midpoint)
- Variance:
- Symmetry: is symmetric about
Derivation of variance:
ML applications of Uniform:
- Weight initialisation (Xavier/Glorot): weights drawn from to maintain activation variance across layers
- Batch selection: selecting which training examples to include in a mini-batch (sampling uniformly without replacement)
- Data augmentation: random cropping positions, rotation angles, colour jitter magnitudes drawn from uniform distributions
- Hyperparameter search: random search over hyperparameter ranges samples uniformly from the search space
7.4 Gaussian Distribution - Introduction
The Gaussian (Normal) distribution is the most important continuous distribution in all of probability and statistics, due to the Central Limit Theorem (Section06-Stochastic-Processes) and its mathematical tractability.
Definition 7.3 (Gaussian distribution). with mean and variance if:
The standard normal has , :
Its CDF is , which has no closed form but is tabulated and implemented in all numerical libraries.
Properties:
- Mean:
- Variance:
- Symmetry: - symmetric about
- 68-95-99.7 rule:
- Standardisation: If , then
- Affine stability: If , then
Why the Gaussian is ubiquitous:
- Central Limit Theorem (preview): The sum of many i.i.d. random variables with finite variance converges to a Gaussian - regardless of the original distribution. See Section06.
- Maximum entropy: Among all continuous distributions with fixed mean and variance , the Gaussian maximises entropy (is the "least informative" given these constraints)
- Conjugacy: The Gaussian is conjugate to itself under Bayesian updating with Gaussian likelihood - the posterior is also Gaussian
ML applications of the Gaussian:
- Weight initialisation (He/Kaiming): weights drawn from for ReLU networks
- Noise models: residuals gives mean squared error loss
- VAE latent space: with Gaussian prior
- Diffusion models forward process: ,
- Score functions: the score for a Gaussian is linear:
Preview: Exponential, Beta, Gamma Distributions The Exponential distribution models waiting times between Poisson events; the Beta models probabilities on (conjugate prior for Bernoulli); the Gamma generalises both. Full treatment: Common Distributions Section5, Section6, Section7.
8. Expectation and Variance - Foundations
8.1 Expected Value: Definition and Linearity
The expected value (or mean) of a random variable is its probability-weighted average - the "centre of mass" of its distribution.
Definition 8.1 (Expected value - discrete). For a discrete with PMF :
(provided the sum converges absolutely: ).
Definition 8.2 (Expected value - continuous). For a continuous with PDF :
(provided the integral converges absolutely).
Examples:
- Bernoulli:
- Uniform:
- : (by symmetry and definition)
- Fair die:
Non-example (expectation undefined): The Cauchy distribution with PDF has , so does not exist. This is why heavy-tailed distributions require care.
Theorem 8.1 (Linearity of expectation). For any random variables , and constants , :
This holds whether or not and are independent - linearity of expectation is unconditional.
Proof (discrete): .
For AI: Linearity of expectation is why mini-batch gradient descent works. The gradient of the average loss over a batch equals the average of individual gradients:
This follows from linearity of both expectation and differentiation.
Expected value of a function: For :
This is the Law of the Unconscious Statistician (LOTUS) - you can compute directly from the distribution of without finding the distribution of first. (Full treatment in Section04.)
Warning: in general. For example, (the gap is the variance). Jensen's inequality gives the direction: if is convex, .
8.2 Variance and Standard Deviation
Variance measures spread - how far a random variable typically deviates from its mean.
Definition 8.3 (Variance). The variance of is:
Computational formula:
Proof: .
Definition 8.4 (Standard deviation). . This has the same units as and .
Examples:
- Bernoulli: . Maximum at : .
- Uniform:
- : (by definition)
- Constant : (no randomness)
Properties of variance:
- , with equality iff is a constant a.s.
- (shifting by doesn't change spread; scaling by scales variance by )
- If :
For AI: Variance appears throughout ML:
- Bias-variance tradeoff: test error = bias^2 + variance + irreducible noise. High-variance models overfit.
- Gradient variance: the variance of the stochastic gradient estimator determines training stability. Variance reduction techniques (control variates, importance sampling) reduce this.
- Batch normalisation: standardises activations to have mean 0 and variance 1 within each batch, stabilising training.
- Uncertainty in predictions: the variance of a Gaussian output head models aleatoric uncertainty.
8.3 Key Properties and Common Pitfalls
Expectation is linear; variance is not. For independent :
- (independence gives factorisation)
- (independence cancels covariance)
- But in general:
Jensen's inequality: For convex :
Examples: (taking ); (taking ). Jensen's inequality is used to prove Markov's inequality, the EM lower bound (ELBO), and the KL divergence non-negativity.
Expectation and independence: holds if , but the converse is false: (uncorrelated) does not imply independence.
8.4 Preview: Covariance, LOTUS, MGF
Preview: Covariance and Correlation The covariance measures linear dependence between two random variables. The covariance matrix of a random vector extends this to multiple dimensions. These are the core tools for analysing multivariate distributions.
-> Full treatment: Expectation and Moments Section3
Preview: Law of the Unconscious Statistician (LOTUS) LOTUS allows computing using the distribution of directly: no change-of-variables required. It is the foundational tool for deriving moments and the expected loss under a distribution.
-> Full treatment: Expectation and Moments Section2
Preview: Moment Generating Functions The MGF encodes all moments of : . MGFs are the key tool for proving the Central Limit Theorem and computing tail bounds.
-> Full treatment: Expectation and Moments Section5
9. Transformations of Random Variables
9.1 Functions of a Random Variable
Given a random variable and a function , the composition is a new random variable. Computing the distribution of from the distribution of is a central skill.
Discrete case. If is discrete with PMF and is any function, then is discrete with:
Example. (fair die), . Then , etc.
Continuous case - CDF method. For continuous and , compute the CDF of :
then differentiate to get the PDF: .
9.2 Change-of-Variables Formula
For monotone transformations, the CDF method yields a clean formula.
Theorem 9.1 (Change of variables - monotone, 1-D). Let have PDF and CDF . Let be a differentiable, strictly monotone function on the support of , with inverse . Then has PDF:
Derivation (increasing ):
For decreasing , , and differentiating gives a negative sign, cancelled by the absolute value in the formula.
Example. , (log-normal distribution). Here , , . So:
Example. , (standardisation). Here , , :
So . [ok]
For AI: The change-of-variables formula is the mathematical foundation of normalising flows - generative models that learn a bijection transforming a simple base distribution (e.g., ) into a complex target distribution. The log-density of the transformed variable is:
where is the Jacobian of the inverse transformation.
9.3 Standardisation and the Z-Score
Definition 9.1 (Z-score / standardisation). For with mean and standard deviation :
Then and regardless of the distribution of .
Proof: . .
Applications in ML:
- Feature normalisation (preprocessing): standardise each feature so that the gradient has comparable scale across dimensions, improving optimiser convergence
- Batch normalisation (BN): standardises activations within each mini-batch, then applies learned scale and shift parameters : , output
- Layer normalisation (LN): same idea but normalising across features rather than the batch dimension - standard in transformer models
- RMS Normalisation (RMSNorm): a simplified variant using only the RMS (root mean square), dropping the mean subtraction - used in LLaMA, Gemma, and Mistral models:
10. Applications in Machine Learning
10.1 Cross-Entropy Loss as Negative Log-Likelihood
Setup. Suppose we observe data drawn i.i.d. from the true distribution . A model with parameters defines a predicted distribution .
Maximum likelihood estimation (MLE). We want to choose to maximise the probability assigned to the observed data:
Taking the logarithm (monotone transformation, same argmax):
For classification. With classes, :
where are the logits. For a one-hot label , this is exactly the cross-entropy loss:
For regression. If :
Minimising negative log-likelihood under a Gaussian noise model is equivalent to minimising mean squared error.
For language modelling. Each token has distribution . The NLL loss over a sequence of tokens is:
This is the standard pretraining loss for GPT-family models. The perplexity is - a lower perplexity means the model assigns higher probability to the actual next token.
10.2 Bayesian Inference: Prior, Likelihood, Posterior
Bayesian inference applies Bayes' theorem to update beliefs about parameters after observing data :
The normalising constant is the evidence or marginal likelihood.
MAP estimation. The Maximum A Posteriori (MAP) estimate is the mode of the posterior:
With a Gaussian prior , the term becomes - i.e., L2 regularisation (weight decay). MAP estimation with a Gaussian prior is exactly regularised MLE.
For AI: Bayesian thinking is increasingly central to modern AI:
- Model uncertainty (epistemic): the posterior over models captures uncertainty from limited data - Bayesian deep learning approximates this posterior
- RLHF reward model: human preferences give a likelihood ; updating the reward model is a posterior update
- In-context learning: some interpretations frame few-shot ICL as implicit Bayesian updating over hypotheses about the task
10.3 Probabilistic Language Models
A language model defines a probability distribution over sequences of tokens. Given vocabulary , a sequence is assigned probability:
by the chain rule of probability (Section3.4). Each conditional is a Categorical distribution over tokens, produced by the transformer's softmax output.
Sampling strategies as distribution operations:
- Greedy decoding: always take - equivalent to a degenerate point mass
- Temperature scaling: replace logits with ; approaches greedy, approaches uniform
- Top- sampling: restrict to the most probable tokens, renormalise - truncates the distribution
- Top- (nucleus) sampling: restrict to the smallest set of tokens whose cumulative probability , renormalise - adapts the cutoff to the entropy of the distribution
Bits-per-character (BPC) and perplexity: The quality of a language model is measured by how much probability mass it assigns to held-out text. Perplexity = . A perplexity of means the model is "as uncertain as a uniform distribution over choices" at each step.
10.4 Sources of Randomness in Neural Network Training
Training a neural network involves multiple distinct sources of randomness, each with a different probability distribution:
| Source | Distribution | When it occurs |
|---|---|---|
| Weight initialisation | or Uniform | Once, at start |
| Mini-batch selection | Uniform (without replacement) | Every step |
| Dropout masks | Every forward pass | |
| Data augmentation | Uniform (flips), Uniform (crop positions), Gaussian (colour jitter) | Every sample |
| Label smoothing | Convex combination of Categorical and Uniform | During loss computation |
| Noise injection (DP-SGD) | (clipped Gaussian) | Every gradient update |
| Sampling from generative models | Model-defined (Categorical, Gaussian, etc.) | Inference |
Each source of randomness introduces variance into the training process. Understanding their distributions is essential for:
- Reproducibility (fixing random seeds)
- Differential privacy analysis (noise calibration)
- Variance reduction in gradient estimation
- Debugging training instabilities (distinguishing random vs systematic failures)
11. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Confusing with | These are completely different quantities (prosecutor's fallacy, base rate neglect) | Always identify which is the "given" and which is the "unknown" before applying Bayes' theorem |
| 2 | Treating as a probability | The PDF is a density - it can exceed 1, and individual point probabilities are always 0 for continuous | Probabilities come from integrating the PDF: |
| 3 | Assuming independence from | Uncorrelated does not imply independent; independence is strictly stronger | Use the definition: for all |
| 4 | Applying Axiom 3 to non-disjoint events | holds only when | Use inclusion-exclusion: |
| 5 | Forgetting to check normalisation when constructing a PMF/PDF | An un-normalised function cannot be a PMF or PDF | Always verify (discrete) or (continuous) |
| 6 | Using without checking independence | Variance is additive only for uncorrelated (not just any) random variables | Add the covariance term: |
| 7 | Confusing with | Jensen's inequality shows these differ whenever is nonlinear | Compute using LOTUS: or |
| 8 | Confusing pairwise independence with mutual independence | Events can be pairwise independent but not mutually independent (Bernstein example) | For events, verify independence for all subsets, or use a structural argument |
| 9 | Assuming when and are independent | Independence means ; this is zero only when or | Independence \neq mutual exclusivity. Disjoint events () with positive probability are necessarily dependent |
| 10 | Applying the change-of-variables formula without the absolute value of the Jacobian | For decreasing transformations, the Jacobian is negative; omitting the absolute value gives a negative PDF | Always write $f_Y(y) = f_X(h(y)) \cdot |
| 11 | Concluding and are independent from a correlation of zero in non-Gaussian data | Zero correlation uncorrelated, which does NOT imply independence for non-Gaussian distributions | Independence implies zero correlation, but not vice versa. Check independence directly or use copulas |
| 12 | Treating conditional probability as symmetric: | only if (from Bayes' theorem with equal priors) | Apply Bayes' theorem to convert between the two: |
12. Exercises
Exercise 1 * - Axiom Derivations Using only the three Kolmogorov axioms, prove each of the following: (a) (b) (c) (inclusion-exclusion for two events) (d) If then (e) for all events
Each proof should cite exactly which axiom(s) and which previously proved results it uses.
Exercise 2 * - Bayes' Theorem Application A medical test for a disease has sensitivity 95% (true positive rate) and specificity 98% (true negative rate). The disease affects 0.5% of the population. (a) Define events (has disease) and (tests positive). State all given probabilities. (b) Compute using the law of total probability. (c) Compute using Bayes' theorem. (d) Explain why the result might surprise a clinician unfamiliar with Bayes' theorem. (e) How would the answer change if the disease prevalence were 10% instead of 0.5%? Compute and interpret.
Exercise 3 * - PMF Construction and Verification A biased die has for . (a) Find the normalisation constant . (b) Verify this is a valid PMF. (c) Compute . (d) Compute . (e) What is ? Compare to the fair die.
Exercise 4 ** - CDF Analysis Let have CDF for and for (Exponential distribution). (a) Show this is a valid CDF (check all four properties from Theorem 5.1). (b) Find the PDF by differentiating the CDF. (c) Compute for any . (d) Show the memoryless property: for all . (e) For : compute exactly and numerically.
Exercise 5 ** - Independence and Conditional Independence Roll two fair dice, letting be the result of die 1 and the result of die 2. Define (the sum). (a) Verify that and are independent. (b) Show that and are not independent (hint: compute and compare to ). (c) Now condition on the event . Show that and are conditionally independent given fails - or more precisely, show and check whether this holds. (d) Explain in words why knowing induces dependence between and .
Exercise 6 ** - Change of Variables Let . (a) Find the PDF of . What distribution is this? (b) Find the PDF of . (c) Find the CDF and PDF of . (d) If , show that has the same distribution as for any CDF (the inverse CDF / quantile transform method). (e) How is part (d) used in sampling from arbitrary distributions in ML?
Exercise 7 *** - Cross-Entropy Loss Derivation Consider a -class classification problem with model . (a) Write the negative log-likelihood for a single example . (b) Show this equals the cross-entropy where is the one-hot distribution. (c) Show the gradient of the NLL w.r.t. logits is (softmax output minus one-hot target). (d) Explain why label smoothing replaces the one-hot with , and how this changes the loss. (e) For language modelling with vocabulary size : if the model assigns probability 0.8 to the correct next token, what is the NLL? The perplexity?
Exercise 8 *** - Probabilistic Analysis of Dropout During training, dropout independently sets each of activations to 0 with probability , and scales the remaining by . (a) Let be the mask for activation . What is ? (b) Let be the pre-dropout activation. Define . Show (inverted dropout preserves the expected activation). (c) Compute as a function of , . (d) The output of a layer is . Assuming and are independent across neurons, what is ? (e) Explain why high increases gradient variance and can destabilise training. At what is variance maximised?