Part 1

30 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Introduction to Probability and Random Variables: Part 1: Intuition to 12. Exercises

1. Intuition

1.1 What Is Probability? Three Interpretations

Probability is a number in $[0, 1]$ assigned to an event, measuring how likely that event is to occur. But what does "likely" mean? There are three major interpretations, and each underpins a different school of thought in statistics and machine learning.

1. Classical (equally likely outcomes). If a sample space $\Omega$ has $N$ equally likely outcomes and event $A$ contains $k$ of them, then $P(A) = k/N$ . This is the interpretation of textbook dice and card problems. It breaks down when outcomes are not equally likely - flipping a biased coin, for instance.

2. Frequentist (long-run relative frequency). $P(A)$ is the limiting proportion of times $A$ occurs in an infinite sequence of identical, independent experiments: $P(A) = \lim_{n \to \infty} \frac{\text{count of } A \text{ in } n \text{ trials}}{n}$ . This is the foundation of classical statistics (hypothesis testing, confidence intervals). Its weakness: it cannot assign probabilities to one-off events ("the probability that GPT-5 passes the bar exam").

3. Bayesian (degree of belief). $P(A)$ represents a rational agent's degree of belief that $A$ is true, updated as new evidence arrives. This interpretation allows probabilities for singular events and underpins Bayesian inference, Bayesian neural networks, and probabilistic generative models. Crucially, different agents can assign different prior probabilities to the same event - and Bayes' theorem tells them how to update rationally.

THREE INTERPRETATIONS OF PROBABILITY
========================================================================

  Classical           Frequentist              Bayesian
  -----------------   --------------------     ----------------------
  P(A) = k/N          P(A) = lim freq(A)       P(A) = degree of belief
  Equally likely       Long-run proportion      Updated by evidence
  outcomes             Objective               Subjective / rational
  Dice, cards          Hypothesis tests         Generative models
                       Confidence intervals     Bayesian NNs, VAEs

  All three satisfy the same axioms (Section2.2).
  They differ only in WHAT those axioms are applied to.

========================================================================

For AI: Modern ML systems implicitly blend all three. A softmax output layer uses the classical interpretation (outputs sum to 1). Training loop analysis uses frequentist reasoning (expected loss over the data distribution). Bayesian neural networks, RLHF reward models, and variational autoencoders use Bayesian reasoning explicitly.

1.2 Why Probability Is the Language of AI

Every component of a modern AI system is probabilistic:

Data: Training sets are finite samples from an unknown data-generating distribution $p_{\text{data}}(\mathbf{x}, y)$ . The goal of supervised learning is to find $f_\theta$ that estimates $p_{\text{data}}(y \mid \mathbf{x})$ .
Model outputs: A language model defines a probability distribution $P(x_t \mid x_{<t})$ over the next token. Generation is sampling from this distribution. Temperature scaling, top- $k$ and top- $p$ sampling are all operations on probability distributions.
Loss functions: Cross-entropy is the negative log-likelihood of the data under the model's distribution. Minimising cross-entropy is equivalent to maximising the probability the model assigns to the training data.
Regularisation: Dropout applies Bernoulli random variables to activations. Weight decay corresponds to a Gaussian prior in a Bayesian interpretation (MAP estimation).
Uncertainty quantification: Calibration of a classifier means its predicted probabilities match empirical frequencies. A well-calibrated model saying "70% confident" should be right 70% of the time.
Generative models: Diffusion models define a forward Markov chain (adding Gaussian noise) and learn to reverse it. VAEs learn $p_\theta(\mathbf{x} \mid \mathbf{z})$ and $q_\phi(\mathbf{z} \mid \mathbf{x})$ . Normalising flows learn invertible transformations of simple distributions. All are explicitly probabilistic.

1.3 Historical Timeline

Year	Person	Contribution
1654	Pascal & Fermat	Correspondence on gambling problems - foundations of combinatorial probability
1713	Jacob Bernoulli	Ars Conjectandi - law of large numbers, Bernoulli distribution
1763	Thomas Bayes (posthumous)	Essay on inverse probability - what we now call Bayes' theorem
1812	Pierre-Simon Laplace	Theorie analytique des probabilites - systematic probability theory, Laplace approximation
1837	Simeon Poisson	Poisson distribution - rare events in large samples
1900	Karl Pearson	Chi-squared distribution, correlation coefficient
1933	Andrei Kolmogorov	Rigorous axiomatic foundation - the three axioms that unify all interpretations
1950s	Shannon	Information theory - entropy connects probability to communication
1980s	Judea Pearl	Bayesian networks - graphical models for probabilistic reasoning
1990s	Gelman, Rubin et al.	MCMC methods - practical Bayesian inference for complex models
2013	Kingma & Welling	Variational Autoencoders - learned latent distributions via reparameterisation
2020	Ho et al.	Denoising Diffusion Probabilistic Models - probabilistic generative modelling at scale

2. Formal Definitions - Probability Spaces

2.1 Sample Spaces and Events

Definition 2.1 (Sample space). The sample space $\Omega$ is the set of all possible outcomes of a random experiment. Each element $\omega \in \Omega$ is called an elementary outcome or sample point.

Examples:

Tossing a fair coin: $\Omega = \{\text{H}, \text{T}\}$
Rolling a six-sided die: $\Omega = \{1, 2, 3, 4, 5, 6\}$
Measuring a person's height: $\Omega = (0, \infty) \subset \mathbb{R}$
Choosing a word from a vocabulary: $\Omega = \{\text{the}, \text{cat}, \text{sat}, \ldots\}$ (finite vocabulary)
One training step of a neural network: $\Omega =$ all possible mini-batch draws

Definition 2.2 (Event). An event $A$ is a subset of $\Omega$ , i.e., $A \subseteq \Omega$ . We say event $A$ occurs if the observed outcome $\omega \in A$ .

Event algebra. Events combine using set operations:

Complement: $A^c = \Omega \setminus A$ - " $A$ does not occur"
Union: $A \cup B$ - " $A$ or $B$ (or both) occur"
Intersection: $A \cap B$ - "both $A$ and $B$ occur"
Difference: $A \setminus B = A \cap B^c$ - " $A$ occurs but $B$ does not"

The $\sigma$ -algebra (brief note). For continuous sample spaces (e.g., $\Omega = \mathbb{R}$ ), we cannot assign probabilities to every subset - some are too pathological (Vitali sets). The solution is to restrict to a $\sigma$ -algebra $\mathcal{F}$ : a collection of subsets of $\Omega$ that is closed under complement and countable unions. The standard choice for $\Omega = \mathbb{R}$ is the Borel $\sigma$ -algebra $\mathcal{B}(\mathbb{R})$ , generated by all open intervals. For this course, you can think of "events" as "any reasonable subset of $\Omega$ " - the measure-theoretic subtlety is noted but not required.

2.2 The Three Kolmogorov Axioms

In 1933, Andrei Kolmogorov placed probability theory on a rigorous axiomatic foundation, resolving a century of informal debate about what probability "really is."

Definition 2.3 (Probability measure). A probability measure is a function $P : \mathcal{F} \to [0, 1]$ satisfying:

Axiom 1 (Non-negativity):

P(A) \geq 0 \quad \text{for all events } A

Axiom 2 (Normalisation):

P(\Omega) = 1

Axiom 3 (Countable additivity / $\sigma$ -additivity): If $A_1, A_2, A_3, \ldots$ are pairwise disjoint events (i.e., $A_i \cap A_j = \emptyset$ for $i \neq j$ ), then:

P\!\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i)

The triple $(\Omega, \mathcal{F}, P)$ is called a probability space.

Why these three axioms? They are minimal: they are logically independent (none follows from the other two), they rule out degenerate assignments (e.g., $P(A) = -0.5$ or $P(\Omega) = 2$ ), and they imply everything else in probability theory via logical deduction.

2.3 Immediate Consequences of the Axioms

From just three axioms, we can derive all the basic rules of probability:

Theorem 2.1 (Probability of the empty set): $P(\emptyset) = 0$ .

Proof. $\Omega$ and $\emptyset$ are disjoint with $\Omega \cup \emptyset = \Omega$ . By Axiom 3: $P(\Omega) + P(\emptyset) = P(\Omega) = 1$ . So $P(\emptyset) = 0$ . $\square$

Theorem 2.2 (Finite additivity): For disjoint $A_1, \ldots, A_n$ :

P(A_1 \cup \cdots \cup A_n) = P(A_1) + \cdots + P(A_n)

Proof. Set $A_{n+1} = A_{n+2} = \cdots = \emptyset$ , apply Axiom 3, then Theorem 2.1. $\square$

Theorem 2.3 (Complement rule): $P(A^c) = 1 - P(A)$ .

Proof. $A$ and $A^c$ are disjoint with $A \cup A^c = \Omega$ . By Axiom 3: $P(A) + P(A^c) = P(\Omega) = 1$ . $\square$

Theorem 2.4 (Monotonicity): If $A \subseteq B$ , then $P(A) \leq P(B)$ .

Proof. Write $B = A \cup (B \setminus A)$ where $A$ and $B \setminus A$ are disjoint. By Axiom 3: $P(B) = P(A) + P(B \setminus A) \geq P(A)$ (since $P(B \setminus A) \geq 0$ by Axiom 1). $\square$

Theorem 2.5 (Bounds): $0 \leq P(A) \leq 1$ for all $A$ .

Proof. $\emptyset \subseteq A \subseteq \Omega$ , so by monotonicity: $0 = P(\emptyset) \leq P(A) \leq P(\Omega) = 1$ . $\square$

2.4 Probability Measures - Examples and Non-Examples

Examples of valid probability measures:

Sample space $\Omega$	Probability measure $P$
$\{1,2,3,4,5,6\}$	$P(\{k\}) = 1/6$ for each $k$ (fair die)
$\{1,2,3,4,5,6\}$	$P(\{k\}) = k/21$ for each $k$ (loaded die, $\sum k/21 = 1$ )
$[0,1]$	$P([a,b]) = b-a$ for $0 \leq a \leq b \leq 1$ (uniform)
$\mathbb{R}$	$P(A) = \int_A \frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx$ (standard Gaussian)
$\{0, 1\}$	$P(\{1\}) = p$ , $P(\{0\}) = 1-p$ for $p \in [0,1]$ (Bernoulli)

Non-examples (violate at least one axiom):

Assignment	Axiom violated
$P(\{H\}) = 0.6$ , $P(\{T\}) = 0.6$	Axiom 2: total = 1.2 \neq 1
$P(\{H\}) = -0.1$ , $P(\{T\}) = 1.1$	Axiom 1: negative probability
$P(\{H\}) = 0.5$ , $P(\{T\}) = 0.5$ , $P(\{H,T\}) = 0.8$	Axiom 3: $P(\{H\} \cup \{T\}) \neq P(\{H\}) + P(\{T\})$

3. Computing Probabilities - The Core Rules

3.1 Complement and Monotonicity

The complement rule is one of the most practically useful consequences of the axioms.

Complement rule: $P(A^c) = 1 - P(A)$

This is particularly useful when $P(A^c)$ is easier to compute than $P(A)$ directly. This technique - "compute the complement" - is ubiquitous in probability:

Example. What is the probability that at least one of 10 coin flips is heads (for a fair coin)?

Direct approach: sum $P(\text{exactly } k \text{ heads})$ for $k = 1, \ldots, 10$ - tedious.

Complement approach: $P(\text{at least one head}) = 1 - P(\text{no heads}) = 1 - (1/2)^{10} = 1023/1024 \approx 0.999$ .

For AI: The complement rule underlies the "at least one" union bound used in statistical learning theory. If we want $P(\text{at least one of } n \text{ bad events occurs}) \leq \delta$ , we equivalently want $P(\text{all events are good}) \geq 1 - \delta$ .

3.2 Inclusion-Exclusion Principle

Theorem 3.1 (Inclusion-exclusion, two events):

P(A \cup B) = P(A) + P(B) - P(A \cap B)

Proof. Decompose $A \cup B = A \cup (B \setminus A)$ into disjoint parts. Then $B = (A \cap B) \cup (B \setminus A)$ into disjoint parts. By Axiom 3: $P(A \cup B) = P(A) + P(B \setminus A)$ and $P(B) = P(A \cap B) + P(B \setminus A)$ . Subtracting: $P(B \setminus A) = P(B) - P(A \cap B)$ . Substituting: $P(A \cup B) = P(A) + P(B) - P(A \cap B)$ . $\square$

The subtraction corrects for double-counting $A \cap B$ .

Generalisation to $n$ events:

P\!\left(\bigcup_{i=1}^n A_i\right) = \sum_i P(A_i) - \sum_{i<j} P(A_i \cap A_j) + \sum_{i<j<k} P(A_i \cap A_j \cap A_k) - \cdots

Union bound (Boole's inequality): A simpler but looser bound:

P\!\left(\bigcup_{i=1}^n A_i\right) \leq \sum_{i=1}^n P(A_i)

For AI: The union bound is the workhorse of statistical learning theory. To prove that a learned classifier generalises to unseen data with high probability, one typically applies the union bound over all possible hypotheses, then uses concentration inequalities (Section05) to bound each term.

3.3 Conditional Probability

When we observe that event $B$ has occurred, this changes our state of knowledge. The probability of $A$ given that $B$ occurred is:

Definition 3.1 (Conditional probability):

P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0

Intuition: Conditioning on $B$ restricts the sample space from $\Omega$ to $B$ . Within this restricted space, we renormalise by dividing by $P(B)$ to ensure the conditional probabilities sum to 1.

CONDITIONAL PROBABILITY - GEOMETRIC INTUITION
========================================================================

  Before conditioning:         After conditioning on B:
  +---------------------+      +---------------------+
  |          \\Omega           |      |         B            |
  |   +------------+    |      |   +--------------+  |
  |   |  A   |A\\capB | B  |      |   |  A\\capB         |  |
  |   |      |    |    |      |   | (renormalised)|  |
  |   +------------+    |      |   +--------------+  |
  +---------------------+      +---------------------+

  P(A) = area(A) / area(\\Omega)     P(A|B) = area(A\\capB) / area(B)

========================================================================

Key properties of conditional probability:

$P(\cdot \mid B)$ is itself a valid probability measure on $(\Omega, \mathcal{F})$ : it satisfies all three Kolmogorov axioms
$P(B \mid B) = 1$ - given $B$ occurred, $B$ certainly occurred
$P(A^c \mid B) = 1 - P(A \mid B)$ - the complement rule holds conditionally

Non-example: $P(A \mid B) \neq P(B \mid A)$ in general. This asymmetry is the source of the classic prosecutor's fallacy: $P(\text{DNA match} \mid \text{innocent})$ is very small, but $P(\text{innocent} \mid \text{DNA match})$ could be substantial in a large population.

3.4 Chain Rule of Probability

Rearranging the definition of conditional probability gives:

P(A \cap B) = P(A \mid B) \cdot P(B) = P(B \mid A) \cdot P(A)

This extends to any number of events:

Theorem 3.2 (Chain rule / product rule):

P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1) \cdot P(A_2 \mid A_1) \cdot P(A_3 \mid A_1, A_2) \cdots P(A_n \mid A_1, \ldots, A_{n-1})

For AI: The chain rule of probability is the mathematical foundation of autoregressive language models. A language model with vocabulary $\mathcal{V}$ and context window $T$ defines:

P(x_1, x_2, \ldots, x_T) = P(x_1) \cdot P(x_2 \mid x_1) \cdot P(x_3 \mid x_1, x_2) \cdots P(x_T \mid x_1, \ldots, x_{T-1})

Every GPT-family model is directly computing these conditional probabilities. The model is trained to minimise the average negative log of each factor - i.e., the cross-entropy loss.

3.5 Law of Total Probability

Definition 3.2 (Partition). Events $B_1, B_2, \ldots, B_n$ form a partition of $\Omega$ if they are pairwise disjoint ( $B_i \cap B_j = \emptyset$ for $i \neq j$ ) and exhaustive ( $\bigcup_i B_i = \Omega$ ).

Theorem 3.3 (Law of total probability). If $B_1, \ldots, B_n$ partition $\Omega$ and $P(B_i) > 0$ for all $i$ :

P(A) = \sum_{i=1}^n P(A \mid B_i) \cdot P(B_i)

Proof. Since $B_1, \ldots, B_n$ partition $\Omega$ : $A = \bigcup_i (A \cap B_i)$ , a disjoint union. By Axiom 3: $P(A) = \sum_i P(A \cap B_i) = \sum_i P(A \mid B_i) P(B_i)$ . $\square$

Example. A language model generates a sentence that is either grammatical ( $B_1$ , probability 0.7) or ungrammatical ( $B_2$ , probability 0.3). Given grammatical, the probability a human accepts it is 0.9; given ungrammatical, 0.2. What is the overall acceptance probability?

P(\text{accepted}) = P(\text{acc} \mid B_1) P(B_1) + P(\text{acc} \mid B_2) P(B_2) = 0.9 \times 0.7 + 0.2 \times 0.3 = 0.63 + 0.06 = 0.69

3.6 Bayes' Theorem

Bayes' theorem inverts conditional probability: from $P(A \mid B)$ we can compute $P(B \mid A)$ .

Theorem 3.4 (Bayes' theorem):

P(B \mid A) = \frac{P(A \mid B) \cdot P(B)}{P(A)}

Applying the law of total probability to the denominator (with a partition $B, B^c$ ):

P(B \mid A) = \frac{P(A \mid B) \cdot P(B)}{P(A \mid B) P(B) + P(A \mid B^c) P(B^c)}

Bayesian inference terminology:

Term	Symbol	Meaning
Prior	$P(B)$	Belief about $B$ before seeing $A$
Likelihood	$P(A \mid B)$	How probable is $A$ if $B$ is true?
Evidence	$P(A)$	Total probability of observation $A$
Posterior	$P(B \mid A)$	Updated belief about $B$ after seeing $A$

\underbrace{P(B \mid A)}_{\text{posterior}} = \frac{\overbrace{P(A \mid B)}^{\text{likelihood}} \cdot \overbrace{P(B)}^{\text{prior}}}{\underbrace{P(A)}_{\text{evidence}}}

Classic example - Spam filter. Email is spam with prior $P(\text{spam}) = 0.3$ . The word "lottery" appears in 80% of spam emails but only 5% of legitimate ones. Given an email contains "lottery", what is $P(\text{spam} \mid \text{lottery})$ ?

P(\text{spam} \mid \text{lottery}) = \frac{0.80 \times 0.30}{0.80 \times 0.30 + 0.05 \times 0.70} = \frac{0.24}{0.24 + 0.035} = \frac{0.24}{0.275} \approx 0.873

For AI: Bayes' theorem is the core of:

Bayesian neural networks: parameters $\theta$ have a prior $P(\theta)$ ; training updates to the posterior $P(\theta \mid \mathcal{D}) \propto P(\mathcal{D} \mid \theta) P(\theta)$
RLHF reward modelling: given human preference data, update beliefs about which completions are better
Naive Bayes classifiers: classify documents by computing $P(\text{class} \mid \text{words}) \propto P(\text{words} \mid \text{class}) P(\text{class})$

4. Independence

4.1 Unconditional Independence

Intuitively, events $A$ and $B$ are independent if knowing one occurred gives no information about whether the other occurred.

Definition 4.1 (Independence). Events $A$ and $B$ are independent, written $A \perp B$ , if:

P(A \cap B) = P(A) \cdot P(B)

Equivalently (when $P(B) > 0$ ): $P(A \mid B) = P(A)$ - conditioning on $B$ does not change the probability of $A$ .

Why this definition? The conditional probability $P(A \mid B) = P(A \cap B)/P(B)$ equals $P(A)$ if and only if $P(A \cap B) = P(A)P(B)$ . The multiplicative form is preferred as it is symmetric and well-defined even when $P(A) = 0$ or $P(B) = 0$ .

Examples of independent events:

Two fair coin flips: $P(\text{H on flip 1} \cap \text{H on flip 2}) = 1/4 = 1/2 \times 1/2$ [ok]
Drawing with replacement: second draw is independent of first
Two separate neural network weight initialisations drawn i.i.d.

Non-examples (dependent events):

Drawing without replacement: if first card is an ace, second ace is less likely
Correlated features in a dataset
Sequential tokens in a language model: $P(x_t \mid x_{t-1}) \neq P(x_t)$

For $n$ events (mutual independence): Events $A_1, \ldots, A_n$ are mutually independent if for every subset $S \subseteq \{1, \ldots, n\}$ :

P\!\left(\bigcap_{i \in S} A_i\right) = \prod_{i \in S} P(A_i)

This is strictly stronger than pairwise independence (see Section4.3).

4.2 Conditional Independence

Conditional independence is one of the most important concepts in probabilistic modelling and graphical models.

Definition 4.2 (Conditional independence). Events $A$ and $B$ are conditionally independent given $C$ , written $A \perp B \mid C$ , if:

P(A \cap B \mid C) = P(A \mid C) \cdot P(B \mid C)

Equivalently (when $P(B \cap C) > 0$ ): $P(A \mid B, C) = P(A \mid C)$ - once we know $C$ , learning $B$ gives no additional information about $A$ .

The classic example - explaining away. Let $A$ = "the grass is wet", $B$ = "the sprinkler is on", $C$ = "it rained". Rain and sprinkler are independent causes of wet grass. But:

Without knowing whether it rained: $A$ and $B$ are marginally dependent (if the grass is wet, the sprinkler being on is more likely)
Given that it rained: $A \perp B \mid C$ - knowing the sprinkler status adds nothing once we know it rained

Caution: Independence and conditional independence are logically independent:

$A \perp B$ does NOT imply $A \perp B \mid C$ (conditioning can introduce dependence)
$A \perp B \mid C$ does NOT imply $A \perp B$ (marginalising out $C$ can introduce dependence)

For AI: Conditional independence is the foundation of:

Naive Bayes: assumes features are conditionally independent given the class label
Hidden Markov Models: $x_t \perp x_{t'} \mid z_t$ - observations are independent given the hidden state
Bayesian networks: the factorisation $P(X_1, \ldots, X_n) = \prod_i P(X_i \mid \text{parents}(X_i))$ encodes a set of conditional independence assumptions
Attention masks: attention weights in transformers implement conditional independence (causal masking = future tokens are conditionally independent of past given present)

4.3 Pairwise vs Mutual Independence

Pairwise independence means $P(A_i \cap A_j) = P(A_i)P(A_j)$ for all pairs $(i,j)$ .

Mutual independence means the joint probability factorises for every subset (Definition 4.1 for $n$ events).

Mutual independence implies pairwise independence but not vice versa. The following counterexample (Bernstein, 1928) shows that pairwise independence can hold without mutual independence:

Roll two fair dice. Define:

$A$ = "first die shows even" - $P(A) = 1/2$
$B$ = "second die shows even" - $P(B) = 1/2$
$C$ = "sum of dice is even" - $P(C) = 1/2$

Check pairwise: $P(A \cap B) = 1/4 = P(A)P(B)$ [ok], $P(A \cap C) = 1/4$ [ok], $P(B \cap C) = 1/4$ [ok].

But $P(A \cap B \cap C) = P(\text{both even, sum even}) = P(\text{both even}) = 1/4 \neq P(A)P(B)P(C) = 1/8$ .

So $A, B, C$ are pairwise independent but not mutually independent.

For AI: The i.i.d. (independent and identically distributed) assumption in ML training is a mutual independence assumption: $P((\mathbf{x}^{(1)}, y^{(1)}), \ldots, (\mathbf{x}^{(n)}, y^{(n)})) = \prod_{i=1}^n P(\mathbf{x}^{(i)}, y^{(i)})$ . This assumption underlies the derivation of cross-entropy loss as a maximum likelihood estimator.

4.4 Why Independence Matters for AI

Independence assumptions make otherwise intractable problems tractable.

Without independence: The joint distribution of $n$ binary random variables requires $2^n - 1$ parameters. For $n = 100$ features, this is $2^{100} - 1 \approx 10^{30}$ - utterly infeasible.

With conditional independence (Naive Bayes): Given class $C$ , features are independent: $P(\mathbf{x} \mid C) = \prod_{j=1}^n P(x_j \mid C)$ . Now only $n$ parameters per class - linear in $n$ .

Independence in optimisation: Mini-batch gradient descent assumes that samples in a batch are independent draws from the training distribution. This independence makes the batch gradient an unbiased estimator of the full gradient.

Independence and parallelism: Independently distributed data can be processed in parallel without communication - the foundation of distributed ML training (data parallelism).

5. Random Variables - Formal Foundation

5.1 Definition: Random Variable as a Measurable Function

The outcomes in a sample space $\Omega$ can be anything - text strings, images, dice faces. To do mathematics with them, we need to convert them to numbers. A random variable does exactly this.

Definition 5.1 (Random variable). A random variable $X$ is a (measurable) function from the sample space $\Omega$ to the real line $\mathbb{R}$ :

X : \Omega \to \mathbb{R}, \quad \omega \mapsto X(\omega)

The measurability requirement ensures that $\{\omega \in \Omega : X(\omega) \leq x\}$ is a valid event (belongs to $\mathcal{F}$ ) for every $x \in \mathbb{R}$ , so that we can assign it a probability. For all practical purposes, every function you would naturally write down is measurable.

Examples:

Die roll: $\Omega = \{1,\ldots,6\}$ , $X(\omega) = \omega$ (the outcome itself)
Coin flip: $\Omega = \{H, T\}$ , $X(H) = 1$ , $X(T) = 0$ (Bernoulli representation)
Sentence length: $\Omega = \{\text{all English sentences}\}$ , $X(\omega) = |\omega|$ (number of words)
Model loss: $\Omega = \{\text{all possible mini-batches}\}$ , $X(\omega) = \mathcal{L}(\theta; \omega)$

Non-examples:

The sample space $\Omega$ itself is not a random variable (it is the domain, not a function)
A deterministic constant $c$ can be viewed as a degenerate random variable $X(\omega) = c$ for all $\omega$ , but calling it "random" is misleading

Notation: Random variables are typically denoted by uppercase letters ( $X$ , $Y$ , $Z$ ); their realised values (specific numbers) by lowercase ( $x$ , $y$ , $z$ ). So " $X = x$ " means "the random variable $X$ takes the value $x$ ."

The event $\{X \leq x\} = \{\omega \in \Omega : X(\omega) \leq x\}$ is well-defined, and we write $P(X \leq x)$ as shorthand for $P(\{X \leq x\})$ .

5.2 Discrete vs Continuous - Taxonomy

Random variables split into two main types based on the range of $X$ .

TAXONOMY OF RANDOM VARIABLES
========================================================================

  Random Variable X: \\Omega -> \\mathbb{R}
  |
  +-- Discrete: X takes countably many values {x_1, x_2, x_3, ...}
  |   |
  |   +-- Characterised by PMF: p(x) = P(X = x)
  |   +-- CDF: F(x) = \\Sigma p(x_i) for all x_i \\leq x   (staircase)
  |   +-- Examples: Bernoulli, Geometric, Binomial, Poisson, Categorical
  |
  +-- Continuous: X takes values in an interval (uncountably many)
      |
      +-- Characterised by PDF: f(x) such that P(a\\leqX\\leqb) = \\int_a^b f(x)dx
      +-- CDF: F(x) = \\int_-\\infty^x f(t)dt   (smooth, differentiable)
      +-- Examples: Uniform, Gaussian, Exponential, Beta, Gamma

  Note: "Mixed" random variables exist (CDF is a mix of jumps and
  smooth parts) but are rare in practice.

========================================================================

Key distinction - probability at a point:

Discrete: $P(X = x)$ can be positive (it is the PMF value)
Continuous: $P(X = x) = 0$ for every specific $x$ (the probability of hitting any exact value is zero). Probabilities come from integrating over intervals.

For AI: The distinction is crucial for loss functions. Cross-entropy loss for classification uses a discrete distribution (Categorical) over a finite vocabulary or class set. For regression, the loss is often the negative log-likelihood of a Gaussian (continuous distribution). Diffusion models use a mixture: discrete time steps with continuous noise.

5.3 The Cumulative Distribution Function

The CDF provides a unified description of both discrete and continuous random variables.

Definition 5.2 (Cumulative distribution function). The CDF of a random variable $X$ is:

F_X(x) = P(X \leq x), \quad x \in \mathbb{R}

Examples:

Fair die: $F(x) = 0$ for $x < 1$ ; $F(x) = k/6$ for $k \leq x < k+1$ , $k = 1, \ldots, 6$ ; $F(x) = 1$ for $x \geq 6$ . A staircase with jumps of $1/6$ at each integer.
Standard normal: $F(x) = \Phi(x) = \frac{1}{\sqrt{2\pi}}\int_{-\infty}^x e^{-t^2/2}\,dt$ . A smooth S-shaped curve.
Bernoulli( $p$ ): $F(x) = 0$ for $x < 0$ ; $F(x) = 1-p$ for $0 \leq x < 1$ ; $F(x) = 1$ for $x \geq 1$ .

Computing interval probabilities from the CDF:

P(a < X \leq b) = F(b) - F(a)

P(X > x) = 1 - F(x)

5.4 CDF Properties and the Fundamental Theorem

Theorem 5.1 (CDF properties). Any CDF satisfies:

Monotone non-decreasing: $x_1 \leq x_2 \implies F(x_1) \leq F(x_2)$
Right-continuous: $\lim_{y \downarrow x} F(y) = F(x)$ for all $x$
Limits: $\lim_{x \to -\infty} F(x) = 0$ and $\lim_{x \to +\infty} F(x) = 1$
Jump characterisation: $P(X = x) = F(x) - F(x^-)$ where $F(x^-) = \lim_{y \uparrow x} F(y)$ . For continuous $X$ , this is 0 everywhere; for discrete $X$ , it equals the PMF at jump points.

These four properties completely characterise CDFs - any function satisfying them is the CDF of some random variable. This is both the universality and the tractability of the CDF.

Fundamental theorem (continuous case): For continuous $X$ with CDF $F$ and PDF $f$ :

f(x) = \frac{d}{dx} F(x), \quad F(x) = \int_{-\infty}^x f(t)\, dt

The PDF is the derivative of the CDF; the CDF is the integral of the PDF.

6. Discrete Random Variables

6.1 Probability Mass Function

Definition 6.1 (PMF). For a discrete random variable $X$ with countable range $\mathcal{X} = \{x_1, x_2, \ldots\}$ , the probability mass function is:

p_X(x) = P(X = x), \quad x \in \mathcal{X}

Properties:

$p_X(x) \geq 0$ for all $x$ (Axiom 1)
$\sum_{x \in \mathcal{X}} p_X(x) = 1$ (Axiom 2)
$p_X(x) = 0$ for $x \notin \mathcal{X}$

CDF from PMF: $F_X(x) = \sum_{x_i \leq x} p_X(x_i)$ - sum all PMF values at or below $x$ .

Non-examples of valid PMFs:

$p(0) = 0.6, p(1) = 0.6$ : sum = 1.2 > 1
$p(0) = -0.2, p(1) = 1.2$ : negative value
$p(k) = 1/k$ for $k = 1, 2, \ldots$ : $\sum 1/k = \infty$ (harmonic series diverges)

For AI: Every classifier's output layer implicitly defines a PMF over classes. The softmax function $\text{softmax}(\mathbf{z})_k = e^{z_k}/\sum_j e^{z_j}$ produces a vector of non-negative values summing to 1 - a valid PMF over the $K$ classes. Training minimises the cross-entropy between this PMF and the one-hot PMF of the true label.

6.2 Bernoulli Distribution - The Canonical Example

The Bernoulli distribution is the simplest non-trivial probability distribution: a single binary trial.

Definition 6.2 (Bernoulli distribution). $X \sim \text{Bernoulli}(p)$ for $p \in [0,1]$ if:

P(X = 1) = p, \quad P(X = 0) = 1 - p

or equivalently: $p_X(x) = p^x (1-p)^{1-x}$ for $x \in \{0, 1\}$ .

Properties:

Mean: $\mathbb{E}[X] = 1 \cdot p + 0 \cdot (1-p) = p$
Variance: $\operatorname{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = p - p^2 = p(1-p)$
Maximum variance at $p = 1/2$ : $\operatorname{Var}(X) = 1/4$
CDF: $F(x) = 0$ for $x < 0$ ; $F(x) = 1-p$ for $0 \leq x < 1$ ; $F(x) = 1$ for $x \geq 1$

ML applications of the Bernoulli distribution:

Binary classification: $Y \sim \text{Bernoulli}(\sigma(\mathbf{w}^\top \mathbf{x}))$ where $\sigma$ is sigmoid
Dropout: each neuron is active with probability $1 - p_{\text{drop}}$ ; the mask is $\text{Bernoulli}(1-p_{\text{drop}})^d$
Coin-flip initialisation tests: checking whether random initialisation is breaking symmetry
Binary attention masks: a position is masked with $\text{Bernoulli}(p_{\text{mask}})$ in masked language modelling (BERT, RoBERTa)
Bernoulli reward signals: in RL, some environments give binary rewards ( $0$ or $1$ )

Bernoulli and entropy. The binary entropy of a Bernoulli( $p$ ) variable is:

H(p) = -p \log p - (1-p) \log(1-p)

This is maximised at $p = 1/2$ (maximum uncertainty) and equals 0 at $p \in \{0, 1\}$ (certainty). Binary cross-entropy loss is exactly this entropy when the true label is $p \in \{0,1\}$ and the model predicts $\hat{p}$ : $\mathcal{L} = -y \log \hat{p} - (1-y)\log(1-\hat{p})$ .

6.3 Geometric Distribution

The Geometric distribution models waiting times: how many Bernoulli trials until the first success?

Definition 6.3 (Geometric distribution). $X \sim \text{Geometric}(p)$ for $p \in (0,1]$ if:

P(X = k) = (1-p)^{k-1} p, \quad k = 1, 2, 3, \ldots

(Here $X$ is the trial number of the first success; an alternative convention has $X$ = number of failures before first success, giving PMF $(1-p)^k p$ for $k = 0, 1, 2, \ldots$ )

Properties:

Mean: $\mathbb{E}[X] = 1/p$
Variance: $\operatorname{Var}(X) = (1-p)/p^2$
Memoryless property: $P(X > m+n \mid X > m) = P(X > n)$ - the past gives no information about the remaining wait time. The Geometric is the only discrete memoryless distribution.

Geometric series check (normalisation):

\sum_{k=1}^\infty (1-p)^{k-1}p = p \sum_{k=0}^\infty (1-p)^k = p \cdot \frac{1}{1-(1-p)} = p \cdot \frac{1}{p} = 1 \checkmark

ML applications:

Sentence length models: the length of a generated sentence can be modelled geometrically (each word has probability $p$ of being the last)
Number of gradient steps to convergence: in simplified convergence analyses, the number of steps until $\|\nabla\mathcal{L}\| < \varepsilon$ can have an approximately geometric distribution under certain conditions
Retry logic: in RL exploration, the number of random actions before discovering a reward follows a geometric-like distribution in uniform random environments

6.4 Preview: Binomial, Categorical, Poisson

The following distributions have their full treatment in Section02-Common-Distributions. They are introduced here briefly for orientation:

Preview: Binomial Distribution $\text{Binomial}(n, p)$ counts the number of successes in $n$ independent Bernoulli( $p$ ) trials: $P(X = k) = \binom{n}{k}p^k(1-p)^{n-k}$ for $k = 0, 1, \ldots, n$ . Mean $np$ , variance $np(1-p)$ . As $n \to \infty$ with $np = \lambda$ fixed, the Binomial converges to Poisson( $\lambda$ ).

-> Full treatment: Common Distributions Section2

Preview: Categorical Distribution $\text{Categorical}(\mathbf{p})$ with $\mathbf{p} \in \Delta^{K-1}$ (the probability simplex) models selection of one of $K$ categories: $P(X = k) = p_k$ . This is the distribution underlying softmax classifiers and language model next-token prediction.

-> Full treatment: Common Distributions Section4

Preview: Poisson Distribution $\text{Poisson}(\lambda)$ models the number of events in a fixed interval when events occur independently at constant rate $\lambda$ : $P(X = k) = e^{-\lambda}\lambda^k/k!$ . Mean = Variance = $\lambda$ . Used in modelling word frequencies (Zipf/Poisson approximation), network packet arrivals, and mutation rates.

-> Full treatment: Common Distributions Section3

7. Continuous Random Variables

7.1 Probability Density Function

For continuous random variables, individual points have probability zero. Probability is distributed over intervals via the PDF.

Definition 7.1 (Probability density function). A non-negative function $f_X : \mathbb{R} \to [0, \infty)$ is a PDF for $X$ if:

$f_X(x) \geq 0$ for all $x$
$\int_{-\infty}^{\infty} f_X(x)\, dx = 1$
For any $a \leq b$ : $P(a \leq X \leq b) = \int_a^b f_X(x)\, dx$

Critical distinction: The PDF $f_X(x)$ is NOT a probability. It is a probability density - probability per unit length. In particular, $f_X(x)$ can exceed 1 (though the integral must equal 1).

Example: $f_X(x) = 2x$ for $x \in [0,1]$ and $0$ elsewhere. This is a valid PDF: non-negative, integrates to $\int_0^1 2x\,dx = [x^2]_0^1 = 1$ . Note $f_X(0.9) = 1.8 > 1$ , but this is fine - it is a density.

Probability at a point: $P(X = x) = \int_x^x f(t)\,dt = 0$ . This is why conditioning on a continuous event requires care and is handled via conditional densities (Section03).

Non-examples of valid PDFs:

$f(x) = x$ on $[0,1]$ : $\int_0^1 x\,dx = 1/2 \neq 1$ (not normalised)
$f(x) = -e^{-x}$ for $x \geq 0$ : negative values
$f(x) = 1$ for $x \in [0,2]$ : $\int_0^2 1\,dx = 2 \neq 1$

7.2 From CDF to PDF: the Derivative Relationship

For a continuous random variable with PDF $f_X$ and CDF $F_X$ :

F_X(x) = \int_{-\infty}^x f_X(t)\, dt \quad \Longleftrightarrow \quad f_X(x) = \frac{d}{dx} F_X(x)

This is the Fundamental Theorem of Calculus applied to probability. The CDF accumulates density; the PDF is the rate of accumulation.

Using the CDF to compute probabilities:

P(a \leq X \leq b) = F_X(b) - F_X(a)

P(X > x) = 1 - F_X(x)

P(X \leq x) = F_X(x)

Since continuous $X$ has $P(X = a) = 0$ : the endpoints don't matter, $P(a < X \leq b) = P(a \leq X \leq b) = P(a < X < b)$ .

7.3 Uniform Distribution

The Uniform distribution is the maximum-entropy distribution over a bounded interval: all values are equally likely.

Definition 7.2 (Uniform distribution). $X \sim \text{Uniform}(a, b)$ for $a < b$ if:

f_X(x) = \begin{cases} \frac{1}{b-a} & a \leq x \leq b \\ 0 & \text{otherwise} \end{cases}

CDF: $F_X(x) = 0$ for $x < a$ ; $F_X(x) = (x-a)/(b-a)$ for $a \leq x \leq b$ ; $F_X(x) = 1$ for $x > b$ .

Properties:

Mean: $\mathbb{E}[X] = (a+b)/2$ (the midpoint)
Variance: $\operatorname{Var}(X) = (b-a)^2/12$
Symmetry: $f$ is symmetric about $(a+b)/2$

Derivation of variance:

\mathbb{E}[X^2] = \int_a^b x^2 \cdot \frac{1}{b-a}\, dx = \frac{1}{b-a} \cdot \frac{b^3 - a^3}{3} = \frac{a^2 + ab + b^2}{3}

\operatorname{Var}(X) = \frac{a^2+ab+b^2}{3} - \left(\frac{a+b}{2}\right)^2 = \frac{(b-a)^2}{12}

ML applications of Uniform:

Weight initialisation (Xavier/Glorot): weights drawn from $\text{Uniform}(-1/\sqrt{n}, 1/\sqrt{n})$ to maintain activation variance across layers
Batch selection: selecting which training examples to include in a mini-batch (sampling uniformly without replacement)
Data augmentation: random cropping positions, rotation angles, colour jitter magnitudes drawn from uniform distributions
Hyperparameter search: random search over hyperparameter ranges samples uniformly from the search space

7.4 Gaussian Distribution - Introduction

The Gaussian (Normal) distribution is the most important continuous distribution in all of probability and statistics, due to the Central Limit Theorem (Section06-Stochastic-Processes) and its mathematical tractability.

Definition 7.3 (Gaussian distribution). $X \sim \mathcal{N}(\mu, \sigma^2)$ with mean $\mu \in \mathbb{R}$ and variance $\sigma^2 > 0$ if:

f_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), \quad x \in \mathbb{R}

The standard normal $Z \sim \mathcal{N}(0,1)$ has $\mu = 0$ , $\sigma^2 = 1$ :

\phi(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2}

Its CDF is $\Phi(z) = \int_{-\infty}^z \phi(t)\, dt$ , which has no closed form but is tabulated and implemented in all numerical libraries.

Properties:

Mean: $\mathbb{E}[X] = \mu$
Variance: $\operatorname{Var}(X) = \sigma^2$
Symmetry: $f(\mu - x) = f(\mu + x)$ - symmetric about $\mu$
68-95-99.7 rule:
- $P(\mu - \sigma \leq X \leq \mu + \sigma) \approx 68.27\%$
- $P(\mu - 2\sigma \leq X \leq \mu + 2\sigma) \approx 95.45\%$
- $P(\mu - 3\sigma \leq X \leq \mu + 3\sigma) \approx 99.73\%$
Standardisation: If $X \sim \mathcal{N}(\mu, \sigma^2)$ , then $Z = (X-\mu)/\sigma \sim \mathcal{N}(0,1)$
Affine stability: If $X \sim \mathcal{N}(\mu, \sigma^2)$ , then $aX + b \sim \mathcal{N}(a\mu + b, a^2\sigma^2)$

Why the Gaussian is ubiquitous:

Central Limit Theorem (preview): The sum of many i.i.d. random variables with finite variance converges to a Gaussian - regardless of the original distribution. See Section06.
Maximum entropy: Among all continuous distributions with fixed mean $\mu$ and variance $\sigma^2$ , the Gaussian maximises entropy (is the "least informative" given these constraints)
Conjugacy: The Gaussian is conjugate to itself under Bayesian updating with Gaussian likelihood - the posterior is also Gaussian

ML applications of the Gaussian:

Weight initialisation (He/Kaiming): weights drawn from $\mathcal{N}(0, 2/n_{\text{in}})$ for ReLU networks
Noise models: residuals $y - f_\theta(\mathbf{x}) \sim \mathcal{N}(0, \sigma^2)$ gives mean squared error loss
VAE latent space: $z \sim \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x))$ with Gaussian prior $\mathcal{N}(0, I)$
Diffusion models forward process: $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t}\varepsilon$ , $\varepsilon \sim \mathcal{N}(0, I)$
Score functions: the score $\nabla_x \log p(x)$ for a Gaussian is linear: $\nabla_x \log \mathcal{N}(x;\mu,\sigma^2) = -(x-\mu)/\sigma^2$

Preview: Exponential, Beta, Gamma Distributions The Exponential distribution models waiting times between Poisson events; the Beta models probabilities on $[0,1]$ (conjugate prior for Bernoulli); the Gamma generalises both. Full treatment: Common Distributions Section5, Section6, Section7.

8. Expectation and Variance - Foundations

8.1 Expected Value: Definition and Linearity

The expected value (or mean) of a random variable is its probability-weighted average - the "centre of mass" of its distribution.

Definition 8.1 (Expected value - discrete). For a discrete $X$ with PMF $p_X$ :

\mathbb{E}[X] = \sum_{x \in \mathcal{X}} x \cdot p_X(x)

(provided the sum converges absolutely: $\sum |x| p_X(x) < \infty$ ).

Definition 8.2 (Expected value - continuous). For a continuous $X$ with PDF $f_X$ :

\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f_X(x)\, dx

(provided the integral converges absolutely).

Examples:

Bernoulli $(p)$ : $\mathbb{E}[X] = 1 \cdot p + 0 \cdot (1-p) = p$
Uniform $(0,1)$ : $\mathbb{E}[X] = \int_0^1 x \cdot 1\, dx = 1/2$
$\mathcal{N}(\mu, \sigma^2)$ : $\mathbb{E}[X] = \mu$ (by symmetry and definition)
Fair die: $\mathbb{E}[X] = (1+2+3+4+5+6)/6 = 3.5$

Non-example (expectation undefined): The Cauchy distribution with PDF $f(x) = 1/(\pi(1+x^2))$ has $\int_{-\infty}^\infty |x|/(\\pi(1+x^2))\,dx = \infty$ , so $\mathbb{E}[X]$ does not exist. This is why heavy-tailed distributions require care.

Theorem 8.1 (Linearity of expectation). For any random variables $X$ , $Y$ and constants $a$ , $b$ :

\mathbb{E}[aX + bY] = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y]

This holds whether or not $X$ and $Y$ are independent - linearity of expectation is unconditional.

Proof (discrete): $\mathbb{E}[aX + bY] = \sum_{x,y}(ax+by)P(X=x, Y=y) = a\sum_x x \sum_y P(X=x,Y=y) + b\sum_y y \sum_x P(X=x,Y=y) = a\mathbb{E}[X] + b\mathbb{E}[Y]$ . $\square$

For AI: Linearity of expectation is why mini-batch gradient descent works. The gradient of the average loss over a batch equals the average of individual gradients:

\nabla_\theta \mathbb{E}_{\mathbf{x} \sim \mathcal{D}}\!\left[\mathcal{L}(\mathbf{x};\theta)\right] = \mathbb{E}_{\mathbf{x} \sim \mathcal{D}}\!\left[\nabla_\theta \mathcal{L}(\mathbf{x};\theta)\right]

This follows from linearity of both expectation and differentiation.

Expected value of a function: For $g : \mathbb{R} \to \mathbb{R}$ :

\mathbb{E}[g(X)] = \sum_x g(x) p_X(x) \quad (\text{discrete}), \qquad \mathbb{E}[g(X)] = \int g(x) f_X(x)\, dx \quad (\text{continuous})

This is the Law of the Unconscious Statistician (LOTUS) - you can compute $\mathbb{E}[g(X)]$ directly from the distribution of $X$ without finding the distribution of $g(X)$ first. (Full treatment in Section04.)

Warning: $\mathbb{E}[g(X)] \neq g(\mathbb{E}[X])$ in general. For example, $\mathbb{E}[X^2] \neq (\mathbb{E}[X])^2$ (the gap is the variance). Jensen's inequality gives the direction: if $g$ is convex, $g(\mathbb{E}[X]) \leq \mathbb{E}[g(X)]$ .

8.2 Variance and Standard Deviation

Variance measures spread - how far a random variable typically deviates from its mean.

Definition 8.3 (Variance). The variance of $X$ is:

\operatorname{Var}(X) = \mathbb{E}\!\left[(X - \mathbb{E}[X])^2\right]

Computational formula:

\operatorname{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2

Proof: $\mathbb{E}[(X-\mu)^2] = \mathbb{E}[X^2 - 2\mu X + \mu^2] = \mathbb{E}[X^2] - 2\mu \mathbb{E}[X] + \mu^2 = \mathbb{E}[X^2] - \mu^2$ . $\square$

Definition 8.4 (Standard deviation). $\sigma_X = \sqrt{\operatorname{Var}(X)}$ . This has the same units as $X$ and $\mathbb{E}[X]$ .

Examples:

Bernoulli $(p)$ : $\operatorname{Var}(X) = p - p^2 = p(1-p)$ . Maximum at $p=1/2$ : $\operatorname{Var}(X) = 1/4$ .
Uniform $(a,b)$ : $\operatorname{Var}(X) = (b-a)^2/12$
$\mathcal{N}(\mu,\sigma^2)$ : $\operatorname{Var}(X) = \sigma^2$ (by definition)
Constant $c$ : $\operatorname{Var}(c) = 0$ (no randomness)

Properties of variance:

$\operatorname{Var}(X) \geq 0$ , with equality iff $X$ is a constant a.s.
$\operatorname{Var}(aX + b) = a^2 \operatorname{Var}(X)$ (shifting by $b$ doesn't change spread; scaling by $a$ scales variance by $a^2$ )
$\operatorname{Var}(X + Y) = \operatorname{Var}(X) + \operatorname{Var}(Y) + 2\operatorname{Cov}(X,Y)$
If $X \perp Y$ : $\operatorname{Var}(X+Y) = \operatorname{Var}(X) + \operatorname{Var}(Y)$

For AI: Variance appears throughout ML:

Bias-variance tradeoff: test error = bias^2 + variance + irreducible noise. High-variance models overfit.
Gradient variance: the variance of the stochastic gradient estimator determines training stability. Variance reduction techniques (control variates, importance sampling) reduce this.
Batch normalisation: standardises activations to have mean 0 and variance 1 within each batch, stabilising training.
Uncertainty in predictions: the variance of a Gaussian output head models aleatoric uncertainty.

8.3 Key Properties and Common Pitfalls

Expectation is linear; variance is not. For independent $X, Y$ :

$\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$ (independence gives factorisation)
$\operatorname{Var}(X+Y) = \operatorname{Var}(X) + \operatorname{Var}(Y)$ (independence cancels covariance)
But in general: $\operatorname{Var}(X+Y) \neq \operatorname{Var}(X) + \operatorname{Var}(Y)$

Jensen's inequality: For convex $g$ :

g(\mathbb{E}[X]) \leq \mathbb{E}[g(X)]

Examples: $(\mathbb{E}[X])^2 \leq \mathbb{E}[X^2]$ (taking $g(x)=x^2$ ); $e^{\mathbb{E}[X]} \leq \mathbb{E}[e^X]$ (taking $g(x)=e^x$ ). Jensen's inequality is used to prove Markov's inequality, the EM lower bound (ELBO), and the KL divergence non-negativity.

Expectation and independence: $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$ holds if $X \perp Y$ , but the converse is false: $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$ (uncorrelated) does not imply independence.

8.4 Preview: Covariance, LOTUS, MGF

Preview: Covariance and Correlation The covariance $\operatorname{Cov}(X,Y) = \mathbb{E}[(X-\mu_X)(Y-\mu_Y)]$ measures linear dependence between two random variables. The covariance matrix $\Sigma$ of a random vector extends this to multiple dimensions. These are the core tools for analysing multivariate distributions.

-> Full treatment: Expectation and Moments Section3

Preview: Law of the Unconscious Statistician (LOTUS) LOTUS allows computing $\mathbb{E}[g(X)]$ using the distribution of $X$ directly: no change-of-variables required. It is the foundational tool for deriving moments and the expected loss under a distribution.

-> Full treatment: Expectation and Moments Section2

Preview: Moment Generating Functions The MGF $M_X(t) = \mathbb{E}[e^{tX}]$ encodes all moments of $X$ : $\mathbb{E}[X^n] = M_X^{(n)}(0)$ . MGFs are the key tool for proving the Central Limit Theorem and computing tail bounds.

-> Full treatment: Expectation and Moments Section5

9. Transformations of Random Variables

9.1 Functions of a Random Variable

Given a random variable $X$ and a function $g : \mathbb{R} \to \mathbb{R}$ , the composition $Y = g(X)$ is a new random variable. Computing the distribution of $Y$ from the distribution of $X$ is a central skill.

Discrete case. If $X$ is discrete with PMF $p_X$ and $g$ is any function, then $Y = g(X)$ is discrete with:

p_Y(y) = P(Y = y) = P(g(X) = y) = \sum_{x : g(x) = y} p_X(x)

Example. $X \sim \text{Uniform}\{1,2,3,4,5,6\}$ (fair die), $Y = X^2 \bmod 6$ . Then $p_Y(1) = P(X^2 \equiv 1 \bmod 6) = P(X \in \{1,5\}) = 1/3$ , etc.

Continuous case - CDF method. For continuous $X$ and $Y = g(X)$ , compute the CDF of $Y$ :

F_Y(y) = P(Y \leq y) = P(g(X) \leq y) = P(X \in \{x : g(x) \leq y\})

then differentiate to get the PDF: $f_Y(y) = F_Y'(y)$ .

9.2 Change-of-Variables Formula

For monotone transformations, the CDF method yields a clean formula.

Theorem 9.1 (Change of variables - monotone, 1-D). Let $X$ have PDF $f_X$ and CDF $F_X$ . Let $g$ be a differentiable, strictly monotone function on the support of $X$ , with inverse $h = g^{-1}$ . Then $Y = g(X)$ has PDF:

f_Y(y) = f_X(h(y)) \cdot \left|\frac{dh}{dy}\right| = f_X(h(y)) \cdot \left|\frac{1}{g'(h(y))}\right|

Derivation (increasing $g$ ):

F_Y(y) = P(Y \leq y) = P(g(X) \leq y) = P(X \leq h(y)) = F_X(h(y))

f_Y(y) = \frac{d}{dy}F_Y(y) = f_X(h(y)) \cdot h'(y) = f_X(h(y)) \cdot \frac{1}{g'(h(y))}

For decreasing $g$ , $F_Y(y) = 1 - F_X(h(y))$ , and differentiating gives a negative sign, cancelled by the absolute value in the formula.

Example. $X \sim \mathcal{N}(0,1)$ , $Y = e^X$ (log-normal distribution). Here $g(x) = e^x$ , $h(y) = \ln y$ , $h'(y) = 1/y$ . So:

f_Y(y) = \phi(\ln y) \cdot \frac{1}{y} = \frac{1}{\sqrt{2\pi}}\, e^{-(\ln y)^2/2} \cdot \frac{1}{y}, \quad y > 0

Example. $X \sim \mathcal{N}(\mu, \sigma^2)$ , $Y = (X - \mu)/\sigma$ (standardisation). Here $g(x) = (x-\mu)/\sigma$ , $h(y) = \mu + \sigma y$ , $h'(y) = \sigma$ :

f_Y(y) = f_X(\mu + \sigma y) \cdot \sigma = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-y^2/2} \cdot \sigma = \frac{1}{\sqrt{2\pi}} e^{-y^2/2} = \phi(y)

So $Y \sim \mathcal{N}(0,1)$ . [ok]

For AI: The change-of-variables formula is the mathematical foundation of normalising flows - generative models that learn a bijection $g : \mathbb{R}^d \to \mathbb{R}^d$ transforming a simple base distribution (e.g., $\mathcal{N}(0,I)$ ) into a complex target distribution. The log-density of the transformed variable is:

\log p_Y(y) = \log p_X(g^{-1}(y)) + \log |\det J_{g^{-1}}(y)|

where $J_{g^{-1}}$ is the Jacobian of the inverse transformation.

9.3 Standardisation and the Z-Score

Definition 9.1 (Z-score / standardisation). For $X$ with mean $\mu$ and standard deviation $\sigma > 0$ :

Z = \frac{X - \mu}{\sigma}

Then $\mathbb{E}[Z] = 0$ and $\operatorname{Var}(Z) = 1$ regardless of the distribution of $X$ .

Proof: $\mathbb{E}[Z] = \mathbb{E}[(X-\mu)/\sigma] = (\mathbb{E}[X] - \mu)/\sigma = 0$ . $\operatorname{Var}(Z) = \operatorname{Var}(X)/\sigma^2 = \sigma^2/\sigma^2 = 1$ . $\square$

Applications in ML:

Feature normalisation (preprocessing): standardise each feature so that the gradient has comparable scale across dimensions, improving optimiser convergence
Batch normalisation (BN): standardises activations within each mini-batch, then applies learned scale and shift parameters $(\gamma, \beta)$ : $\hat{x} = (x - \hat{\mu})/\hat{\sigma}$ , output $= \gamma\hat{x} + \beta$
Layer normalisation (LN): same idea but normalising across features rather than the batch dimension - standard in transformer models
RMS Normalisation (RMSNorm): a simplified variant using only the RMS (root mean square), dropping the mean subtraction - used in LLaMA, Gemma, and Mistral models: $\hat{x}_i = x_i / \text{RMS}(\mathbf{x})$

10. Applications in Machine Learning

10.1 Cross-Entropy Loss as Negative Log-Likelihood

Setup. Suppose we observe data $\mathcal{D} = \{(x^{(1)}, y^{(1)}), \ldots, (x^{(n)}, y^{(n)})\}$ drawn i.i.d. from the true distribution $p_{\text{data}}(y \mid x)$ . A model $f_\theta$ with parameters $\theta$ defines a predicted distribution $p_\theta(y \mid x)$ .

Maximum likelihood estimation (MLE). We want to choose $\theta$ to maximise the probability assigned to the observed data:

\theta^* = \arg\max_\theta \prod_{i=1}^n p_\theta(y^{(i)} \mid x^{(i)}) \quad [\text{likelihood}]

Taking the logarithm (monotone transformation, same argmax):

\theta^* = \arg\max_\theta \sum_{i=1}^n \log p_\theta(y^{(i)} \mid x^{(i)}) = \arg\min_\theta \underbrace{-\frac{1}{n}\sum_{i=1}^n \log p_\theta(y^{(i)} \mid x^{(i)})}_{\text{negative log-likelihood loss}}

For classification. With $K$ classes, $p_\theta(y \mid x) = \text{Categorical}(\text{softmax}(f_\theta(x)))_y$ :

-\log p_\theta(y \mid x) = -\log \text{softmax}(f_\theta(x))_y = -\log \frac{e^{z_y}}{\sum_j e^{z_j}} = \log\sum_j e^{z_j} - z_y

where $z = f_\theta(x)$ are the logits. For a one-hot label $\mathbf{y}$ , this is exactly the cross-entropy loss:

\mathcal{L}(\theta; x, y) = -\sum_{k=1}^K y_k \log p_\theta(k \mid x) = H(p_{\text{data}}, p_\theta)

For regression. If $p_\theta(y \mid x) = \mathcal{N}(f_\theta(x), \sigma^2)$ :

-\log p_\theta(y \mid x) = \frac{1}{2\sigma^2}(y - f_\theta(x))^2 + \text{const}

Minimising negative log-likelihood under a Gaussian noise model is equivalent to minimising mean squared error.

For language modelling. Each token has distribution $p_\theta(x_t \mid x_{<t}) = \text{Categorical}(\text{softmax}(z_t))$ . The NLL loss over a sequence of $T$ tokens is:

\mathcal{L} = -\frac{1}{T}\sum_{t=1}^T \log p_\theta(x_t \mid x_{<t})

This is the standard pretraining loss for GPT-family models. The perplexity is $\exp(\mathcal{L})$ - a lower perplexity means the model assigns higher probability to the actual next token.

10.2 Bayesian Inference: Prior, Likelihood, Posterior

Bayesian inference applies Bayes' theorem to update beliefs about parameters $\theta$ after observing data $\mathcal{D}$ :

\underbrace{p(\theta \mid \mathcal{D})}_{\text{posterior}} \propto \underbrace{p(\mathcal{D} \mid \theta)}_{\text{likelihood}} \cdot \underbrace{p(\theta)}_{\text{prior}}

The normalising constant $p(\mathcal{D}) = \int p(\mathcal{D} \mid \theta) p(\theta) d\theta$ is the evidence or marginal likelihood.

MAP estimation. The Maximum A Posteriori (MAP) estimate is the mode of the posterior:

\theta^*_{\text{MAP}} = \arg\max_\theta \log p(\theta \mid \mathcal{D}) = \arg\max_\theta \left[\log p(\mathcal{D} \mid \theta) + \log p(\theta)\right]

With a Gaussian prior $p(\theta) = \mathcal{N}(0, \lambda^{-1}I)$ , the $\log p(\theta)$ term becomes $-\lambda\|\theta\|^2/2$ - i.e., L2 regularisation (weight decay). MAP estimation with a Gaussian prior is exactly regularised MLE.

For AI: Bayesian thinking is increasingly central to modern AI:

Model uncertainty (epistemic): the posterior over models captures uncertainty from limited data - Bayesian deep learning approximates this posterior
RLHF reward model: human preferences give a likelihood $P(\text{human prefers } y_1 \text{ over } y_2 \mid r_1, r_2)$ ; updating the reward model is a posterior update
In-context learning: some interpretations frame few-shot ICL as implicit Bayesian updating over hypotheses about the task

10.3 Probabilistic Language Models

A language model defines a probability distribution over sequences of tokens. Given vocabulary $\mathcal{V}$ , a sequence $\mathbf{x} = (x_1, \ldots, x_T)$ is assigned probability:

P(\mathbf{x}) = P(x_1) \cdot P(x_2 \mid x_1) \cdot P(x_3 \mid x_1, x_2) \cdots P(x_T \mid x_1, \ldots, x_{T-1}) = \prod_{t=1}^T P(x_t \mid x_{<t})

by the chain rule of probability (Section3.4). Each conditional $P(x_t \mid x_{<t})$ is a Categorical distribution over $|\mathcal{V}|$ tokens, produced by the transformer's softmax output.

Sampling strategies as distribution operations:

Greedy decoding: always take $\arg\max P(x_t \mid x_{<t})$ - equivalent to a degenerate point mass
Temperature scaling: replace logits $z$ with $z/\tau$ ; $\tau \to 0$ approaches greedy, $\tau \to \infty$ approaches uniform
Top- $k$ sampling: restrict to the $k$ most probable tokens, renormalise - truncates the distribution
Top- $p$ (nucleus) sampling: restrict to the smallest set of tokens whose cumulative probability $\geq p$ , renormalise - adapts the cutoff to the entropy of the distribution

Bits-per-character (BPC) and perplexity: The quality of a language model is measured by how much probability mass it assigns to held-out text. Perplexity = $\exp(-\frac{1}{T}\sum_t \log P(x_t \mid x_{<t}))$ . A perplexity of $k$ means the model is "as uncertain as a uniform distribution over $k$ choices" at each step.

10.4 Sources of Randomness in Neural Network Training

Training a neural network involves multiple distinct sources of randomness, each with a different probability distribution:

Source	Distribution	When it occurs
Weight initialisation	$\mathcal{N}(0, \sigma^2)$ or Uniform	Once, at start
Mini-batch selection	Uniform (without replacement)	Every step
Dropout masks	$\text{Bernoulli}(1-p_{\text{drop}})^d$	Every forward pass
Data augmentation	Uniform (flips), Uniform (crop positions), Gaussian (colour jitter)	Every sample
Label smoothing	Convex combination of Categorical and Uniform	During loss computation
Noise injection (DP-SGD)	$\mathcal{N}(0, \sigma^2 C^2)$ (clipped Gaussian)	Every gradient update
Sampling from generative models	Model-defined (Categorical, Gaussian, etc.)	Inference

Each source of randomness introduces variance into the training process. Understanding their distributions is essential for:

Reproducibility (fixing random seeds)
Differential privacy analysis (noise calibration)
Variance reduction in gradient estimation
Debugging training instabilities (distinguishing random vs systematic failures)

11. Common Mistakes

#	Mistake	Why It's Wrong	Fix
1	Confusing $P(A \mid B)$ with $P(B \mid A)$	These are completely different quantities (prosecutor's fallacy, base rate neglect)	Always identify which is the "given" and which is the "unknown" before applying Bayes' theorem
2	Treating $f_X(x)$ as a probability	The PDF is a density - it can exceed 1, and individual point probabilities are always 0 for continuous $X$	Probabilities come from integrating the PDF: $P(a \leq X \leq b) = \int_a^b f(x)dx$
3	Assuming independence from $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$	Uncorrelated does not imply independent; independence is strictly stronger	Use the definition: $X \perp Y \iff P(X \leq x, Y \leq y) = P(X \leq x)P(Y \leq y)$ for all $x,y$
4	Applying Axiom 3 to non-disjoint events	$P(A \cup B) = P(A) + P(B)$ holds only when $A \cap B = \emptyset$	Use inclusion-exclusion: $P(A \cup B) = P(A) + P(B) - P(A \cap B)$
5	Forgetting to check normalisation when constructing a PMF/PDF	An un-normalised function cannot be a PMF or PDF	Always verify $\sum_x p(x) = 1$ (discrete) or $\int f(x)dx = 1$ (continuous)
6	Using $\operatorname{Var}(X+Y) = \operatorname{Var}(X) + \operatorname{Var}(Y)$ without checking independence	Variance is additive only for uncorrelated (not just any) random variables	Add the covariance term: $\operatorname{Var}(X+Y) = \operatorname{Var}(X) + \operatorname{Var}(Y) + 2\operatorname{Cov}(X,Y)$
7	Confusing $\mathbb{E}[g(X)]$ with $g(\mathbb{E}[X])$	Jensen's inequality shows these differ whenever $g$ is nonlinear	Compute $\mathbb{E}[g(X)]$ using LOTUS: $\mathbb{E}[g(X)] = \sum_x g(x)p(x)$ or $\int g(x)f(x)dx$
8	Confusing pairwise independence with mutual independence	Events can be pairwise independent but not mutually independent (Bernstein example)	For $n > 2$ events, verify independence for all $2^n - n - 1$ subsets, or use a structural argument
9	Assuming $P(A \cap B) = 0$ when $A$ and $B$ are independent	Independence means $P(A \cap B) = P(A)P(B)$ ; this is zero only when $P(A) = 0$ or $P(B) = 0$	Independence \neq mutual exclusivity. Disjoint events ( $A \cap B = \emptyset$ ) with positive probability are necessarily dependent
10	Applying the change-of-variables formula without the absolute value of the Jacobian	For decreasing transformations, the Jacobian $dh/dy$ is negative; omitting the absolute value gives a negative PDF	Always write $f_Y(y) = f_X(h(y)) \cdot
11	Concluding $X$ and $Y$ are independent from a correlation of zero in non-Gaussian data	Zero correlation $\Leftrightarrow$ uncorrelated, which does NOT imply independence for non-Gaussian distributions	Independence implies zero correlation, but not vice versa. Check independence directly or use copulas
12	Treating conditional probability as symmetric: $P(A \mid B) = P(B \mid A)$	$P(A \mid B) = P(B \mid A)$ only if $P(A) = P(B)$ (from Bayes' theorem with equal priors)	Apply Bayes' theorem to convert between the two: $P(B \mid A) = P(A \mid B) P(B) / P(A)$

12. Exercises

Exercise 1 * - Axiom Derivations Using only the three Kolmogorov axioms, prove each of the following: (a) $P(\emptyset) = 0$ (b) $P(A^c) = 1 - P(A)$ (c) $P(A \cup B) = P(A) + P(B) - P(A \cap B)$ (inclusion-exclusion for two events) (d) If $A \subseteq B$ then $P(A) \leq P(B)$ (e) $0 \leq P(A) \leq 1$ for all events $A$

Each proof should cite exactly which axiom(s) and which previously proved results it uses.

Exercise 2 * - Bayes' Theorem Application A medical test for a disease has sensitivity 95% (true positive rate) and specificity 98% (true negative rate). The disease affects 0.5% of the population. (a) Define events $D$ (has disease) and $T$ (tests positive). State all given probabilities. (b) Compute $P(T)$ using the law of total probability. (c) Compute $P(D \mid T)$ using Bayes' theorem. (d) Explain why the result might surprise a clinician unfamiliar with Bayes' theorem. (e) How would the answer change if the disease prevalence were 10% instead of 0.5%? Compute and interpret.

Exercise 3 * - PMF Construction and Verification A biased die has $P(X = k) = c \cdot k$ for $k \in \{1, 2, 3, 4, 5, 6\}$ . (a) Find the normalisation constant $c$ . (b) Verify this is a valid PMF. (c) Compute $\mathbb{E}[X]$ . (d) Compute $\operatorname{Var}(X)$ . (e) What is $P(X \geq 4)$ ? Compare to the fair die.

Exercise 4 ** - CDF Analysis Let $X$ have CDF $F(x) = 1 - e^{-\lambda x}$ for $x \geq 0$ and $F(x) = 0$ for $x < 0$ (Exponential distribution). (a) Show this is a valid CDF (check all four properties from Theorem 5.1). (b) Find the PDF $f(x)$ by differentiating the CDF. (c) Compute $P(X > t)$ for any $t > 0$ . (d) Show the memoryless property: $P(X > s+t \mid X > s) = P(X > t)$ for all $s, t > 0$ . (e) For $\lambda = 1$ : compute $P(1 \leq X \leq 2)$ exactly and numerically.

Exercise 5 ** - Independence and Conditional Independence Roll two fair dice, letting $X$ be the result of die 1 and $Y$ the result of die 2. Define $Z = X + Y$ (the sum). (a) Verify that $X$ and $Y$ are independent. (b) Show that $X$ and $Z$ are not independent (hint: compute $P(X=1, Z=2)$ and compare to $P(X=1)P(Z=2)$ ). (c) Now condition on the event $\{Z = 7\}$ . Show that $X$ and $Y$ are conditionally independent given $Z = 7$ fails - or more precisely, show $P(X=1, Y=6 \mid Z=7) = P(X=1 \mid Z=7) P(Y=6 \mid Z=7)$ and check whether this holds. (d) Explain in words why knowing $Z$ induces dependence between $X$ and $Y$ .

Exercise 6 ** - Change of Variables Let $X \sim \text{Uniform}(0, 1)$ . (a) Find the PDF of $Y = -\ln X$ . What distribution is this? (b) Find the PDF of $W = X^2$ . (c) Find the CDF and PDF of $V = \sqrt{X}$ . (d) If $U \sim \text{Uniform}(0,1)$ , show that $F_X^{-1}(U)$ has the same distribution as $X$ for any CDF $F_X$ (the inverse CDF / quantile transform method). (e) How is part (d) used in sampling from arbitrary distributions in ML?

Exercise 7 *** - Cross-Entropy Loss Derivation Consider a $K$ -class classification problem with model $p_\theta(y \mid \mathbf{x}) = \text{softmax}(f_\theta(\mathbf{x}))_y$ . (a) Write the negative log-likelihood for a single example $(\mathbf{x}, y^*)$ . (b) Show this equals the cross-entropy $H(p_{\text{true}}, p_\theta)$ where $p_{\text{true}}$ is the one-hot distribution. (c) Show the gradient of the NLL w.r.t. logits $\mathbf{z} = f_\theta(\mathbf{x})$ is $\text{softmax}(\mathbf{z}) - \mathbf{e}_{y^*}$ (softmax output minus one-hot target). (d) Explain why label smoothing replaces the one-hot $p_{\text{true}}$ with $(1-\varepsilon)\mathbf{e}_{y^*} + \varepsilon/K \cdot \mathbf{1}$ , and how this changes the loss. (e) For language modelling with vocabulary size $|\mathcal{V}| = 32000$ : if the model assigns probability 0.8 to the correct next token, what is the NLL? The perplexity?

Exercise 8 *** - Probabilistic Analysis of Dropout During training, dropout independently sets each of $d$ activations to 0 with probability $p_{\text{drop}}$ , and scales the remaining by $1/(1-p_{\text{drop}})$ . (a) Let $M_i \sim \text{Bernoulli}(1-p_{\text{drop}})$ be the mask for activation $i$ . What is $\mathbb{E}[M_i]$ ? (b) Let $h_i$ be the pre-dropout activation. Define $\tilde{h}_i = M_i \cdot h_i / (1-p_{\text{drop}})$ . Show $\mathbb{E}[\tilde{h}_i] = h_i$ (inverted dropout preserves the expected activation). (c) Compute $\operatorname{Var}(\tilde{h}_i)$ as a function of $h_i$ , $p_{\text{drop}}$ . (d) The output of a layer is $s = \sum_{i=1}^d \tilde{h}_i$ . Assuming $h_i$ and $M_i$ are independent across neurons, what is $\operatorname{Var}(s)$ ? (e) Explain why high $p_{\text{drop}}$ increases gradient variance and can destabilise training. At what $p_{\text{drop}}$ is variance maximised?

Introduction and Random Variables: Part 1 - Intuition To 12 Exercises

Introduction to Probability and Random Variables: Part 1: Intuition to 12. Exercises

1. Intuition

1.1 What Is Probability? Three Interpretations

1.2 Why Probability Is the Language of AI

1.3 Historical Timeline

2. Formal Definitions - Probability Spaces

2.1 Sample Spaces and Events

2.2 The Three Kolmogorov Axioms

2.3 Immediate Consequences of the Axioms

2.4 Probability Measures - Examples and Non-Examples

3. Computing Probabilities - The Core Rules

3.1 Complement and Monotonicity

3.2 Inclusion-Exclusion Principle

3.3 Conditional Probability

3.4 Chain Rule of Probability

3.5 Law of Total Probability

3.6 Bayes' Theorem

4. Independence

4.1 Unconditional Independence

4.2 Conditional Independence

4.3 Pairwise vs Mutual Independence

4.4 Why Independence Matters for AI

5. Random Variables - Formal Foundation

5.1 Definition: Random Variable as a Measurable Function

5.2 Discrete vs Continuous - Taxonomy

5.3 The Cumulative Distribution Function

5.4 CDF Properties and the Fundamental Theorem

6. Discrete Random Variables

6.1 Probability Mass Function

6.2 Bernoulli Distribution - The Canonical Example

6.3 Geometric Distribution

6.4 Preview: Binomial, Categorical, Poisson

7. Continuous Random Variables

7.1 Probability Density Function

7.2 From CDF to PDF: the Derivative Relationship

7.3 Uniform Distribution

7.4 Gaussian Distribution - Introduction

8. Expectation and Variance - Foundations

8.1 Expected Value: Definition and Linearity

8.2 Variance and Standard Deviation

8.3 Key Properties and Common Pitfalls

8.4 Preview: Covariance, LOTUS, MGF

9. Transformations of Random Variables

9.1 Functions of a Random Variable

9.2 Change-of-Variables Formula

9.3 Standardisation and the Z-Score

10. Applications in Machine Learning

10.1 Cross-Entropy Loss as Negative Log-Likelihood

10.2 Bayesian Inference: Prior, Likelihood, Posterior

10.3 Probabilistic Language Models

10.4 Sources of Randomness in Neural Network Training

11. Common Mistakes

12. Exercises

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?