Probability Theory

Exercises Notebook

Converted from exercises.ipynb for web reading.

§6.1 Introduction and Random Variables — Exercises

10 exercises covering probability axioms, conditional probability, Bayes' theorem, independence, CDF/PDF, discrete and continuous distributions, and ML applications.

Format	Description
Problem	Markdown cell with task description
Your Solution	Code cell with scaffolding
Solution	Code cell with reference solution and checks

Difficulty Levels

Level	Exercises	Focus
★	1–3	Probability axioms, conditional probability, CDF
★★	4–6	Bayes' theorem, distributions, expectation
★★★	7-10	ML applications: cross-entropy, KL divergence

Topic Map

Topic	Exercise
Sample spaces and axioms	1
Conditional probability	2
CDF, PDF, PMF	3
Bayes' theorem	4
Geometric distribution	5
Expectation and variance	6
Cross-entropy loss	7
KL divergence and VAE	8
Mixture total probability	9
Categorical temperature	10

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np
from scipy import stats
import math

try:
    import matplotlib.pyplot as plt
    plt.style.use('seaborn-v0_8-whitegrid')
    plt.rcParams['figure.figsize'] = [10, 5]
    HAS_MPL = True
except ImportError:
    HAS_MPL = False

np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)

def header(title):
    print('\n' + '=' * len(title))
    print(title)
    print('=' * len(title))

def check_close(name, got, expected, tol=1e-8):
    ok = np.allclose(got, expected, atol=tol, rtol=tol)
    print(f"{'PASS' if ok else 'FAIL'} — {name}")
    if not ok:
        print(f'  expected: {expected}')
        print(f'  got     : {got}')
    return ok

def check_true(name, cond):
    print(f"{'PASS' if cond else 'FAIL'} — {name}")
    return cond

print('Setup complete.')

Exercise 1 ★ — Sample Space and Probability Axioms

Consider the experiment of rolling a fair 8-sided die ( $\Omega = \{1, 2, 3, 4, 5, 6, 7, 8\}$ ).

Part (a): Write a function P(event, omega) that computes $P(A) = |A|/|\Omega|$ for a uniform probability space.

Part (b): Define events:

$A$ = {outcome is even}
$B$ = {outcome is prime}
$C$ = {outcome > 5}

Compute $P(A)$ , $P(B)$ , $P(C)$ , $P(A \cup B)$ , $P(A \cap C)$ , $P(B^c)$ .

Part (c): Verify the inclusion-exclusion principle: $P(A \cup B) = P(A) + P(B) - P(A \cap B)$

Part (d): Verify the complement rule: $P(A^c) = 1 - P(A)$ .

Code cell 5

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 6

# Solution
Omega = set(range(1, 9))

def P(event, omega):
    return len(event) / len(omega)

A = {x for x in Omega if x % 2 == 0}        # {2, 4, 6, 8}
B = {x for x in Omega if all(x % d != 0 for d in range(2, x)) and x > 1}  # {2, 3, 5, 7}
C = {x for x in Omega if x > 5}              # {6, 7, 8}

header('Exercise 1: Sample Space and Probability Axioms')
print(f'A (even)  = {sorted(A)}')
print(f'B (prime) = {sorted(B)}')
print(f'C (>5)    = {sorted(C)}')

check_close('P(A)', P(A, Omega), 4/8)
check_close('P(B)', P(B, Omega), 4/8)
check_close('P(C)', P(C, Omega), 3/8)

# Inclusion-exclusion
lhs = P(A | B, Omega)
rhs = P(A, Omega) + P(B, Omega) - P(A & B, Omega)
check_close('Inclusion-exclusion P(A∪B)', lhs, rhs)

# Complement
Ac = Omega - A
check_close('Complement P(Ac) = 1 - P(A)', P(Ac, Omega), 1 - P(A, Omega))

print('\nTakeaway: The axioms ensure consistent probability assignment; inclusion-exclusion'
      ' corrects for double-counting when events overlap.')

Exercise 2 ★ — Conditional Probability and Chain Rule

A bag contains 4 red balls and 6 blue balls. Two balls are drawn without replacement.

Part (a): Compute $P(\text{second is red} \mid \text{first is red})$ .

Part (b): Compute $P(\text{both red})$ using the chain rule:

P(A \cap B) = P(B \mid A) \cdot P(A)

Part (c): Compute $P(\text{at least one red})$ using the complement.

Part (d): Simulate 100,000 draws and verify your analytical answers.

Code cell 8

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 9

# Solution
import numpy as np
np.random.seed(42)

# Analytical
# Part (a): After drawing 1 red, bag has 3 red, 6 blue (9 total)
p_second_red_given_first_red = 3/9

# Part (b): Chain rule
p_first_red  = 4/10
p_both_red   = p_first_red * p_second_red_given_first_red  # = 4/10 * 3/9 = 12/90

# Part (c): Complement of (both blue)
p_both_blue  = (6/10) * (5/9)
p_at_least_one_red = 1 - p_both_blue

# Simulation
bag = np.array([1]*4 + [0]*6)  # 1=red, 0=blue
n = 100000
both_red = 0
for _ in range(n):
    draw = np.random.choice(bag, size=2, replace=False)
    if draw.sum() == 2:
        both_red += 1
p_both_red_sim = both_red / n

header('Exercise 2: Conditional Probability')
check_close('P(2nd red | 1st red)', p_second_red_given_first_red, 1/3)
check_close('P(both red)', p_both_red, 12/90)
check_close('P(at least one red)', p_at_least_one_red, 1 - 30/90)
check_true('Simulation P(both red) within 1%', abs(p_both_red_sim - p_both_red) < 0.01)
print(f'  Simulation: {p_both_red_sim:.4f}, Analytical: {p_both_red:.4f}')

print('\nTakeaway: Conditional probability captures how one event changes the'
      ' probability of another — the foundation of Bayesian updating in ML.')

Exercise 3 ★ — CDF, PDF, and the Universality of the Uniform

Part (a): Write a function empirical_cdf(samples, x) that estimates $F(x) = P(X \leq x)$ from samples.

Part (b): Draw $n = 5000$ samples from $\mathcal{N}(0,1)$ and plot the empirical CDF against the true CDF $\Phi(x)$ .

Part (c): Demonstrate the Universality of the Uniform: if $X \sim \mathcal{N}(0,1)$ , then $U = \Phi(X) \sim \text{Uniform}(0,1)$ . Verify this by:

Drawing samples from $\mathcal{N}(0,1)$
Applying $\Phi$ to get $U$ samples
Checking that $U$ is approximately uniform via a KS test

Part (d): Recover $\mathcal{N}(2, 1.5^2)$ from Uniform samples using the inverse CDF.

Code cell 11

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 12

# Solution
import numpy as np
from scipy import stats
np.random.seed(42)

def empirical_cdf(samples, x):
    return np.mean(samples <= x)

n = 5000
X = np.random.standard_normal(n)

header = lambda t: print('\n' + '='*len(t) + '\n' + t + '\n' + '='*len(t))
check_close = lambda name, got, expected, tol=0.03: (
    print(f"{'PASS' if abs(got-expected)<tol else 'FAIL'} — {name}: got {got:.4f}, expected {expected:.4f}"))
check_true = lambda name, c: print(f"{'PASS' if c else 'FAIL'} — {name}")

header('Exercise 3: CDF and Universality of Uniform')

test_points = [-1, 0, 1, 1.96]
for x in test_points:
    emp = empirical_cdf(X, x)
    true = stats.norm.cdf(x)
    check_close(f'x={x}', emp, true)

# Part (c): Universality of Uniform
U = stats.norm.cdf(X)  # Apply Phi to Normal samples
ks_stat, p_val = stats.kstest(U, 'uniform')
print(f'\nKS test for U = Phi(X) ~ Uniform(0,1):')
print(f'  KS statistic = {ks_stat:.4f}, p-value = {p_val:.4f}')
check_true('U is uniform (p > 0.05)', p_val > 0.05)

# Part (d): Inverse CDF to sample N(2, 1.5^2)
mu, sigma = 2.0, 1.5
U2 = np.random.uniform(0, 1, n)
Y  = stats.norm.ppf(U2, loc=mu, scale=sigma)  # N(2, 1.5^2) samples
check_close('Sample mean ≈ mu', Y.mean(), mu, tol=0.05)
check_close('Sample std  ≈ sigma', Y.std(), sigma, tol=0.05)

print('\nTakeaway: The CDF completely characterises a distribution; the inverse CDF'
      ' enables sampling from any distribution using Uniform(0,1) samples.')

Exercise 4 ★★ — Bayes' Theorem: Spam Filtering

A spam filter uses word frequencies. Suppose:

$P(\text{spam}) = 0.3$ (prior — 30% of emails are spam)
$P(\text{'free'} \mid \text{spam}) = 0.8$ (likelihood of 'free' in spam)
$P(\text{'free'} \mid \text{ham}) = 0.1$ (likelihood of 'free' in ham)

Part (a): Compute $P(\text{spam} \mid \text{'free'})$ using Bayes' theorem.

Part (b): The email also contains 'win'. Given $P(\text{'win'} \mid \text{spam}) = 0.7$ and $P(\text{'win'} \mid \text{ham}) = 0.05$ , compute the posterior after observing both words, assuming conditional independence given spam/ham.

Part (c): Write a naive_bayes_update(prior, likelihoods_spam, likelihoods_ham) function that applies Naive Bayes iteratively for a list of observed words.

Part (d): Plot how the posterior probability of spam evolves as the words 'free', 'win', 'prize', 'offer', 'hello' are observed in sequence (using the likelihood pairs from your function).

Code cell 14

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 15

# Solution
import numpy as np

def naive_bayes_update(prior, likelihoods_spam, likelihoods_ham):
    """Naive Bayes: iteratively update prior with each observed word."""
    p_spam = prior
    for p_w_spam, p_w_ham in zip(likelihoods_spam, likelihoods_ham):
        p_w = p_w_spam * p_spam + p_w_ham * (1 - p_spam)
        p_spam = p_w_spam * p_spam / p_w
    return p_spam

# Part (a)
p_spam = 0.3
p_free_given_spam = 0.8
p_free_given_ham  = 0.1
p_free = p_free_given_spam * p_spam + p_free_given_ham * (1 - p_spam)
p_spam_given_free = p_free_given_spam * p_spam / p_free

# Part (b): both words
p_spam_given_both = naive_bayes_update(0.3, [0.8, 0.7], [0.1, 0.05])

header = lambda t: print('\n' + '='*len(t) + '\n' + t + '\n' + '='*len(t))
check_close = lambda name, got, expected, tol=1e-6: (
    print(f"{'PASS' if abs(got-expected)<tol else 'FAIL'} — {name}: {got:.6f}"))
check_true = lambda name, c: print(f"{'PASS' if c else 'FAIL'} — {name}")

header('Exercise 4: Naive Bayes Spam Filter')
check_close('P(spam | free)', p_spam_given_free, 0.3*0.8 / (0.3*0.8 + 0.7*0.1))
check_true('P(spam|both words) > P(spam|one word)', p_spam_given_both > p_spam_given_free)
print(f'  P(spam) = 0.3  →  P(spam|free) = {p_spam_given_free:.4f}  →  P(spam|free,win) = {p_spam_given_both:.4f}')

# Part (d): evolution of posterior
words = ['free', 'win', 'prize', 'offer', 'hello']
ls  = [0.8, 0.7, 0.9, 0.75, 0.2]  # likelihoods given spam
lh  = [0.1, 0.05, 0.05, 0.1, 0.5]  # likelihoods given ham

posteriors = [0.3]
prior = 0.3
for p_s, p_h in zip(ls, lh):
    prior = naive_bayes_update(prior, [p_s], [p_h])
    posteriors.append(prior)

print('\nPosterior evolution:')
for i, (w, post) in enumerate(zip(['prior'] + words, posteriors)):
    print(f'  After {w:<8}: P(spam) = {post:.4f}')

try:
    import matplotlib.pyplot as plt
    plt.style.use('seaborn-v0_8-whitegrid')
    fig, ax = plt.subplots(figsize=(10, 4))
    ax.plot(range(len(posteriors)), posteriors, 'o-', color='steelblue', lw=2)
    ax.axhline(0.5, color='red', ls='--', alpha=0.5, label='Decision boundary')
    ax.set_xticks(range(len(posteriors)))
    ax.set_xticklabels(['prior'] + words)
    ax.set_ylabel('P(spam | observed words)')
    ax.set_title('Naive Bayes: Posterior Spam Probability')
    ax.legend(); plt.tight_layout(); plt.show()
except ImportError:
    pass

print('\nTakeaway: Naive Bayes assumes conditional independence — computationally cheap'
      ' and surprisingly effective, but overconfident when independence fails.')

Exercise 5 ★★ — Geometric Distribution: Token Repetition

In a language model, suppose at each step there is probability $p = 0.15$ that the model generates an end-of-sequence (EOS) token. Model the sequence length $L$ as $L \sim \text{Geometric}(0.15)$ .

Part (a): Compute $\mathbb{E}[L]$ and $\text{Var}(L)$ analytically.

Part (b): Verify the memoryless property: given that the sequence has already generated 10 tokens without EOS, show that the remaining length has the same distribution.

Part (c): Compute the probability that the sequence is longer than 20 tokens.

Part (d): Simulate 50,000 sequences and compare the empirical length distribution to the theoretical PMF. Report the empirical mean and variance.

Code cell 17

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 18

# Solution
import numpy as np
from scipy import stats
np.random.seed(42)

p = 0.15
rv = stats.geom(p)

header = lambda t: print('\n' + '='*len(t) + '\n' + t + '\n' + '='*len(t))
check_close_fn = lambda name, got, expected, tol=1e-6: (
    print(f"{'PASS' if abs(got-expected)<tol else 'FAIL'} — {name}: {got:.6f}"))
check_true_fn = lambda name, c: print(f"{'PASS' if c else 'FAIL'} — {name}")

# Part (a)
e_L   = 1 / p
var_L = (1 - p) / p**2

# Part (b): Memoryless property
s = 10
p_gt_s_t  = rv.sf(s + 5)   # P(L > s+5)
p_gt_s    = rv.sf(s)        # P(L > s)
conditional = p_gt_s_t / p_gt_s   # P(L > s+5 | L > s)
p_gt_5    = rv.sf(5)        # P(L > 5)

# Part (c)
p_longer_20 = rv.sf(20)

# Part (d): Simulation
n_sim = 50000
lengths = rv.rvs(size=n_sim)

header('Exercise 5: Geometric Distribution')
check_close_fn('E[L]   = 1/p', e_L, 1/p)
check_close_fn('Var[L] = (1-p)/p^2', var_L, (1-p)/p**2)
check_close_fn('Memoryless property', conditional, p_gt_5, tol=1e-10)
check_close_fn('P(L > 20)', p_longer_20, rv.sf(20))
check_true_fn('Simulated mean close to E[L]', abs(lengths.mean() - e_L) < 0.3)

print(f'\nSimulation: mean={lengths.mean():.2f}, var={lengths.var():.2f}')
print(f'Theory:     mean={e_L:.2f}, var={var_L:.2f}')

print('\nTakeaway: The memoryless property makes Geometric the unique discrete distribution'
      ' without ageing — analogous to how Exponential models event inter-arrival times.')

Exercise 6 ★★ — Expectation, Variance, and LOTUS

Let $X \sim \mathcal{N}(0, 1)$ (standard normal).

Part (a): Using LOTUS, compute $\mathbb{E}[X^2]$ , $\mathbb{E}[X^4]$ , and $\mathbb{E}[X^6]$ analytically. (Hint: for $\mathcal{N}(0,1)$ , the $2k$ -th moment is $(2k-1)!! = (2k-1)(2k-3)\cdots 1$ )

Part (b): Verify these formulas via Monte Carlo simulation with $n = 10^6$ samples.

Part (c): Compute $\text{Var}(X^2)$ analytically using:

\text{Var}(X^2) = \mathbb{E}[X^4] - (\mathbb{E}[X^2])^2

Part (d): The standardisation of a random variable: Given $Y \sim \mathcal{N}(\mu=3, \sigma^2=4)$ , compute $Z = (Y - \mathbb{E}[Y])/\sqrt{\text{Var}(Y)}$ and verify $\mathbb{E}[Z] = 0$ , $\text{Var}(Z) = 1$ .

Code cell 20

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 21

# Solution
import numpy as np
np.random.seed(42)

header = lambda t: print('\n' + '='*len(t) + '\n' + t + '\n' + '='*len(t))
def check_close(name, got, expected, tol=0.01):
    ok = abs(got - expected) < tol
    print(f"{'PASS' if ok else 'FAIL'} — {name}: got {got:.4f}, expected {expected:.4f}")

# Part (a): Double-factorial moments of N(0,1)
# E[X^2] = 1, E[X^4] = 3, E[X^6] = 15
e_x2 = 1    # (2*1 - 1)!! = 1
e_x4 = 3    # (2*2 - 1)!! = 1*3 = 3
e_x6 = 15   # (2*3 - 1)!! = 1*3*5 = 15

# Part (b): Monte Carlo verification
n = 1000000
X = np.random.standard_normal(n)
mc_x2 = (X**2).mean()
mc_x4 = (X**4).mean()
mc_x6 = (X**6).mean()

# Part (c): Var(X^2) = E[X^4] - (E[X^2])^2
var_x2 = e_x4 - e_x2**2   # = 3 - 1 = 2

# Part (d): Standardisation
from scipy import stats
mu_Y, sigma_Y = 3.0, 2.0  # sigma = sqrt(variance) = sqrt(4) = 2
Y = np.random.normal(mu_Y, sigma_Y, n)
Z = (Y - mu_Y) / sigma_Y

header('Exercise 6: Expectation, Variance, LOTUS')
check_close('E[X^2] (analytical)', e_x2, 1.0, tol=1e-10)
check_close('E[X^2] (Monte Carlo)', mc_x2, 1.0, tol=0.01)
check_close('E[X^4] (analytical)', e_x4, 3.0, tol=1e-10)
check_close('E[X^4] (Monte Carlo)', mc_x4, 3.0, tol=0.05)
check_close('E[X^6] (analytical)', e_x6, 15.0, tol=1e-10)
check_close('E[X^6] (Monte Carlo)', mc_x6, 15.0, tol=0.3)
check_close('Var(X^2) = 2', var_x2, 2.0, tol=1e-10)
check_close('E[Z] = 0 after standardisation', Z.mean(), 0.0, tol=0.01)
check_close('Var(Z) = 1 after standardisation', Z.var(), 1.0, tol=0.01)

print('\nTakeaway: LOTUS avoids finding the distribution of g(X) explicitly;'
      ' standardisation is the Gaussian foundation of batch normalisation in NNs.')

Exercise 7 ★★★ — Cross-Entropy Loss and Gradient

Consider a 3-class softmax classifier. The output logits are $\mathbf{z} \in \mathbb{R}^3$ and the predicted probabilities are $\hat{p}_c = e^{z_c} / \sum_j e^{z_j}$ .

Part (a): Implement softmax(z) numerically stably (subtract max before exponentiation).

Part (b): Implement cross_entropy_loss(logits, y_true) where y_true is a one-hot vector.

Part (c): Compute the gradient of the cross-entropy loss with respect to the logits:

\frac{\partial \mathcal{L}}{\partial z_c} = \hat{p}_c - y_c

Implement cross_entropy_gradient(logits, y_true) and verify numerically via finite differences.

Part (d): Plot the loss surface as a function of $z_1$ (holding $z_0, z_2$ fixed) for the true class being class 0. Observe how the gradient drives $z_0$ to be large.

Code cell 23

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 24

# Solution
import numpy as np
np.random.seed(42)

header = lambda t: print('\n' + '='*len(t) + '\n' + t + '\n' + '='*len(t))
def check_close(name, got, expected, tol=1e-5):
    ok = np.allclose(got, expected, atol=tol)
    print(f"{'PASS' if ok else 'FAIL'} — {name}")

def softmax(z):
    z_stable = z - z.max()   # numerical stability
    exp_z = np.exp(z_stable)
    return exp_z / exp_z.sum()

def cross_entropy_loss(logits, y_true):
    p = softmax(logits)
    return -np.sum(y_true * np.log(p + 1e-15))

def cross_entropy_gradient(logits, y_true):
    p = softmax(logits)
    return p - y_true  # elegant closed form

# Numerical gradient via finite differences
def numerical_gradient(f, z, eps=1e-5):
    grad = np.zeros_like(z)
    for i in range(len(z)):
        z_plus  = z.copy(); z_plus[i]  += eps
        z_minus = z.copy(); z_minus[i] -= eps
        grad[i] = (f(z_plus) - f(z_minus)) / (2 * eps)
    return grad

z = np.array([2.0, 1.0, 0.5])
y = np.array([1.0, 0.0, 0.0])

p_hat  = softmax(z)
loss   = cross_entropy_loss(z, y)
grad   = cross_entropy_gradient(z, y)
num_grad = numerical_gradient(lambda z_: cross_entropy_loss(z_, y), z)

header('Exercise 7: Cross-Entropy Loss')
print(f'softmax(z) = {p_hat}')
print(f'loss       = {loss:.4f}  (= -log({p_hat[0]:.4f}) = {-np.log(p_hat[0]):.4f})')
check_close('Softmax sums to 1', p_hat.sum(), 1.0)
check_close('Analytical grad = numerical grad', grad, num_grad, tol=1e-5)
print(f'  Analytical: {grad}')
print(f'  Numerical:  {num_grad}')

# Verify: gradient is p_hat - y
expected_grad = p_hat - y
check_close('Gradient = p_hat - y', grad, expected_grad)

print('\nTakeaway: The cross-entropy gradient p̂-y is the probability model"s prediction'
      ' error — it pushes logits in the direction of the true class, a direct probabilistic signal.')

Exercise 8 ★★★ — KL Divergence and the VAE Objective

The VAE evidence lower bound (ELBO) contains a KL divergence term:

D_{\text{KL}}\bigl(\mathcal{N}(\mu, \sigma^2) \| \mathcal{N}(0, 1)\bigr) = \frac{1}{2}\left(\sigma^2 + \mu^2 - 1 - \log \sigma^2\right)

Part (a): Implement this closed-form KL divergence for Gaussians.

Part (b): Verify that $D_{\text{KL}} = 0$ when $\mu = 0, \sigma = 1$ , and that $D_{\text{KL}} > 0$ otherwise.

Part (c): Estimate the KL divergence via Monte Carlo:

D_{\text{KL}}(p\|q) \approx \frac{1}{n}\sum_{i=1}^n \log\frac{p(z_i)}{q(z_i)}, \quad z_i \sim p

Part (d): Plot the KL divergence landscape as $\mu$ varies in $[-3, 3]$ with $\sigma = 1$ , and as $\sigma$ varies in $[0.1, 3]$ with $\mu = 0$ . Identify the minimum in each case.

Code cell 26

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 27

# Solution
import numpy as np
from scipy import stats
np.random.seed(42)

header = lambda t: print('\n' + '='*len(t) + '\n' + t + '\n' + '='*len(t))
def check_close(name, got, expected, tol=1e-5):
    ok = abs(got - expected) < tol
    print(f"{'PASS' if ok else 'FAIL'} — {name}: {got:.6f}")

def kl_gaussian(mu, sigma):
    """KL(N(mu, sigma^2) || N(0,1)) = 0.5*(sigma^2 + mu^2 - 1 - log(sigma^2))."""
    return 0.5 * (sigma**2 + mu**2 - 1 - 2*np.log(sigma))

def kl_gaussian_mc(mu, sigma, n=100000):
    z = np.random.normal(mu, sigma, n)
    log_p = stats.norm.logpdf(z, mu, sigma)
    log_q = stats.norm.logpdf(z, 0, 1)
    return (log_p - log_q).mean()

header('Exercise 8: KL Divergence and VAE')

# Part (b): KL = 0 at N(0,1)
check_close('KL(N(0,1)||N(0,1)) = 0', kl_gaussian(0, 1), 0.0)

# Various points
test_cases = [(1, 1), (2, 1), (0, 2), (0, 0.5)]
print('\nKL values (closed form vs Monte Carlo):')
for mu, sigma in test_cases:
    kl_cf = kl_gaussian(mu, sigma)
    kl_mc = kl_gaussian_mc(mu, sigma)
    ok = abs(kl_cf - kl_mc) < 0.05
    print(f'  N({mu},{sigma}²): closed={kl_cf:.4f}, MC={kl_mc:.4f}  {"PASS" if ok else "FAIL"}')

try:
    import matplotlib.pyplot as plt
    plt.style.use('seaborn-v0_8-whitegrid')
    fig, axes = plt.subplots(1, 2, figsize=(14, 4))

    mu_range = np.linspace(-3, 3, 200)
    kl_mu    = [kl_gaussian(m, 1.0) for m in mu_range]
    axes[0].plot(mu_range, kl_mu, 'steelblue', lw=2)
    axes[0].axvline(0, color='red', ls='--', alpha=0.5, label='min at mu=0')
    axes[0].set_xlabel('mu'); axes[0].set_ylabel('KL divergence')
    axes[0].set_title('KL vs. mu (sigma=1)'); axes[0].legend()

    sigma_range = np.linspace(0.1, 3, 200)
    kl_sigma    = [kl_gaussian(0.0, s) for s in sigma_range]
    axes[1].plot(sigma_range, kl_sigma, 'darkorange', lw=2)
    axes[1].axvline(1.0, color='red', ls='--', alpha=0.5, label='min at sigma=1')
    axes[1].set_xlabel('sigma'); axes[1].set_ylabel('KL divergence')
    axes[1].set_title('KL vs. sigma (mu=0)'); axes[1].legend()

    plt.suptitle('KL(N(mu,sigma²)||N(0,1)) — VAE Regularisation Term', fontsize=13)
    plt.tight_layout(); plt.show()
except ImportError:
    pass

print('\nTakeaway: The VAE KL term penalises the encoder for encoding to a distribution'
      ' far from N(0,1). The closed form makes this differentiable — enabling reparameterised'
      ' gradients to flow through the sampling operation.')

Exercise 9 *** - Mixture Models and the Law of Total Probability

A monitoring system receives examples from two data sources. Source $S=0$ is clean and emits $X \sim \mathcal{N}(0,1)$ . Source $S=1$ is shifted and emits $X \sim \mathcal{N}(3,1)$ . The shifted source appears with probability $P(S=1)=0.25$ .

Part (a): Compute $P(X>2)$ using the law of total probability.

Part (b): Compute $P(S=1 \mid X>2)$ using Bayes' theorem.

Part (c): Verify both probabilities by Monte Carlo simulation.

Part (d): Interpret why this calculation matters for distribution-shift detection.

Code cell 29

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 30

# Solution
from scipy import stats

p_shift = 0.25
threshold = 2.0
p_tail_clean = 1 - stats.norm.cdf(threshold, loc=0, scale=1)
p_tail_shift = 1 - stats.norm.cdf(threshold, loc=3, scale=1)
p_tail = (1 - p_shift) * p_tail_clean + p_shift * p_tail_shift
p_shift_given_tail = p_shift * p_tail_shift / p_tail

n = 300000
S = (np.random.rand(n) < p_shift).astype(int)
X = np.random.normal(0, 1, n)
X[S == 1] = np.random.normal(3, 1, (S == 1).sum())
mask = X > threshold
p_tail_mc = mask.mean()
p_shift_given_tail_mc = S[mask].mean()

header('Exercise 9: Mixture Models and Total Probability')
print(f'P(X>2) analytic: {p_tail:.4f}')
print(f'P(X>2) Monte Carlo: {p_tail_mc:.4f}')
print(f'P(S=1 | X>2) analytic: {p_shift_given_tail:.4f}')
print(f'P(S=1 | X>2) Monte Carlo: {p_shift_given_tail_mc:.4f}')
check_close('Tail probability matches simulation', p_tail_mc, p_tail, tol=0.01)
check_close('Posterior source probability matches simulation', p_shift_given_tail_mc, p_shift_given_tail, tol=0.02)
print('Takeaway: a rare shifted source can dominate tail events, which is why tail monitoring is useful for ML drift detection.')

Exercise 10 *** - Categorical Sampling and Temperature

A language model produces logits $z = [3.0, 1.0, 0.0, -1.0]$ for four tokens.

Part (a): Convert logits to probabilities with temperature $T$ : $p_i(T) = \operatorname{softmax}(z_i/T)$ .

Part (b): Compare entropy and maximum probability for $T \in \{0.5, 1, 2\}$ .

Part (c): Sample 20,000 tokens for each temperature and verify empirical frequencies.

Part (d): Explain why temperature controls diversity in generation.

Code cell 32

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 33

# Solution

def softmax_temperature(logits, T):
    z = np.asarray(logits, dtype=float) / T
    z = z - z.max()
    p = np.exp(z)
    return p / p.sum()

def entropy(p):
    p = np.asarray(p)
    return -np.sum(p * np.log2(p + 1e-15))

logits = np.array([3.0, 1.0, 0.0, -1.0])
temps = [0.5, 1.0, 2.0]

header('Exercise 10: Categorical Sampling and Temperature')
for T in temps:
    p = softmax_temperature(logits, T)
    samples = np.random.choice(len(p), size=20000, p=p)
    freq = np.bincount(samples, minlength=len(p)) / len(samples)
    print(f'T={T}: p={p.round(4)}, entropy={entropy(p):.3f} bits, max={p.max():.3f}')
    print(f'     empirical={freq.round(4)}')
    check_true(f'empirical frequencies close for T={T}', np.allclose(freq, p, atol=0.015))

p_low = softmax_temperature(logits, 0.5)
p_high = softmax_temperature(logits, 2.0)
check_true('Higher temperature increases entropy', entropy(p_high) > entropy(p_low))
print('Takeaway: lower temperature concentrates probability on the top token; higher temperature spreads mass and increases diversity.')

What to Review After Finishing

Kolmogorov axioms — can you derive the complement rule and inclusion-exclusion from them?
Conditional probability — can you apply Bayes' theorem to a novel problem?
CDF/PDF/PMF relationship — can you go from PDF to CDF and back?
Independence — do you understand why pairwise ≠ mutual independence?
Expectation / LOTUS — can you compute E[g(X)] without finding the distribution of g(X)?
Gaussian distribution — do you know the 68-95-99.7 rule and how to standardise?
Cross-entropy — can you derive the gradient of the softmax cross-entropy loss?
KL divergence — can you write the closed form for two Gaussians?

References

Kolmogorov, A.N. (1933). Grundbegriffe der Wahrscheinlichkeitsrechnung.
Bishop, C.M. (2006). Pattern Recognition and Machine Learning, Ch. 1–2.
Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning, Ch. 3.
MacKay, D.J.C. (2003). Information Theory, Inference, and Learning Algorithms, Ch. 2.

Mixture total probability - can you explain the probabilistic calculation and the ML relevance?
Categorical temperature - can you connect the computation to model behavior?