Exercises Notebook
Converted from
exercises.ipynbfor web reading.
§6.1 Introduction and Random Variables — Exercises
10 exercises covering probability axioms, conditional probability, Bayes' theorem, independence, CDF/PDF, discrete and continuous distributions, and ML applications.
| Format | Description |
|---|---|
| Problem | Markdown cell with task description |
| Your Solution | Code cell with scaffolding |
| Solution | Code cell with reference solution and checks |
Difficulty Levels
| Level | Exercises | Focus |
|---|---|---|
| ★ | 1–3 | Probability axioms, conditional probability, CDF |
| ★★ | 4–6 | Bayes' theorem, distributions, expectation |
| ★★★ | 7-10 | ML applications: cross-entropy, KL divergence |
Topic Map
| Topic | Exercise |
|---|---|
| Sample spaces and axioms | 1 |
| Conditional probability | 2 |
| CDF, PDF, PMF | 3 |
| Bayes' theorem | 4 |
| Geometric distribution | 5 |
| Expectation and variance | 6 |
| Cross-entropy loss | 7 |
| KL divergence and VAE | 8 |
| Mixture total probability | 9 |
| Categorical temperature | 10 |
Code cell 2
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
try:
import seaborn as sns
sns.set_theme(style="whitegrid", palette="colorblind")
HAS_SNS = True
except ImportError:
plt.style.use("seaborn-v0_8-whitegrid")
HAS_SNS = False
mpl.rcParams.update({
"figure.figsize": (10, 6),
"figure.dpi": 120,
"font.size": 13,
"axes.titlesize": 15,
"axes.labelsize": 13,
"xtick.labelsize": 11,
"ytick.labelsize": 11,
"legend.fontsize": 11,
"legend.framealpha": 0.85,
"lines.linewidth": 2.0,
"axes.spines.top": False,
"axes.spines.right": False,
"savefig.bbox": "tight",
"savefig.dpi": 150,
})
np.random.seed(42)
print("Plot setup complete.")
Code cell 3
import numpy as np
from scipy import stats
import math
try:
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = [10, 5]
HAS_MPL = True
except ImportError:
HAS_MPL = False
np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)
def header(title):
print('\n' + '=' * len(title))
print(title)
print('=' * len(title))
def check_close(name, got, expected, tol=1e-8):
ok = np.allclose(got, expected, atol=tol, rtol=tol)
print(f"{'PASS' if ok else 'FAIL'} — {name}")
if not ok:
print(f' expected: {expected}')
print(f' got : {got}')
return ok
def check_true(name, cond):
print(f"{'PASS' if cond else 'FAIL'} — {name}")
return cond
print('Setup complete.')
Exercise 1 ★ — Sample Space and Probability Axioms
Consider the experiment of rolling a fair 8-sided die ().
Part (a): Write a function P(event, omega) that computes for a uniform probability space.
Part (b): Define events:
- = {outcome is even}
- = {outcome is prime}
- = {outcome > 5}
Compute , , , , , .
Part (c): Verify the inclusion-exclusion principle:
Part (d): Verify the complement rule: .
Code cell 5
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 6
# Solution
Omega = set(range(1, 9))
def P(event, omega):
return len(event) / len(omega)
A = {x for x in Omega if x % 2 == 0} # {2, 4, 6, 8}
B = {x for x in Omega if all(x % d != 0 for d in range(2, x)) and x > 1} # {2, 3, 5, 7}
C = {x for x in Omega if x > 5} # {6, 7, 8}
header('Exercise 1: Sample Space and Probability Axioms')
print(f'A (even) = {sorted(A)}')
print(f'B (prime) = {sorted(B)}')
print(f'C (>5) = {sorted(C)}')
check_close('P(A)', P(A, Omega), 4/8)
check_close('P(B)', P(B, Omega), 4/8)
check_close('P(C)', P(C, Omega), 3/8)
# Inclusion-exclusion
lhs = P(A | B, Omega)
rhs = P(A, Omega) + P(B, Omega) - P(A & B, Omega)
check_close('Inclusion-exclusion P(A∪B)', lhs, rhs)
# Complement
Ac = Omega - A
check_close('Complement P(Ac) = 1 - P(A)', P(Ac, Omega), 1 - P(A, Omega))
print('\nTakeaway: The axioms ensure consistent probability assignment; inclusion-exclusion'
' corrects for double-counting when events overlap.')
Exercise 2 ★ — Conditional Probability and Chain Rule
A bag contains 4 red balls and 6 blue balls. Two balls are drawn without replacement.
Part (a): Compute .
Part (b): Compute using the chain rule:
Part (c): Compute using the complement.
Part (d): Simulate 100,000 draws and verify your analytical answers.
Code cell 8
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 9
# Solution
import numpy as np
np.random.seed(42)
# Analytical
# Part (a): After drawing 1 red, bag has 3 red, 6 blue (9 total)
p_second_red_given_first_red = 3/9
# Part (b): Chain rule
p_first_red = 4/10
p_both_red = p_first_red * p_second_red_given_first_red # = 4/10 * 3/9 = 12/90
# Part (c): Complement of (both blue)
p_both_blue = (6/10) * (5/9)
p_at_least_one_red = 1 - p_both_blue
# Simulation
bag = np.array([1]*4 + [0]*6) # 1=red, 0=blue
n = 100000
both_red = 0
for _ in range(n):
draw = np.random.choice(bag, size=2, replace=False)
if draw.sum() == 2:
both_red += 1
p_both_red_sim = both_red / n
header('Exercise 2: Conditional Probability')
check_close('P(2nd red | 1st red)', p_second_red_given_first_red, 1/3)
check_close('P(both red)', p_both_red, 12/90)
check_close('P(at least one red)', p_at_least_one_red, 1 - 30/90)
check_true('Simulation P(both red) within 1%', abs(p_both_red_sim - p_both_red) < 0.01)
print(f' Simulation: {p_both_red_sim:.4f}, Analytical: {p_both_red:.4f}')
print('\nTakeaway: Conditional probability captures how one event changes the'
' probability of another — the foundation of Bayesian updating in ML.')
Exercise 3 ★ — CDF, PDF, and the Universality of the Uniform
Part (a): Write a function empirical_cdf(samples, x) that estimates
from samples.
Part (b): Draw samples from and plot the empirical CDF against the true CDF .
Part (c): Demonstrate the Universality of the Uniform: if , then . Verify this by:
- Drawing samples from
- Applying to get samples
- Checking that is approximately uniform via a KS test
Part (d): Recover from Uniform samples using the inverse CDF.
Code cell 11
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 12
# Solution
import numpy as np
from scipy import stats
np.random.seed(42)
def empirical_cdf(samples, x):
return np.mean(samples <= x)
n = 5000
X = np.random.standard_normal(n)
header = lambda t: print('\n' + '='*len(t) + '\n' + t + '\n' + '='*len(t))
check_close = lambda name, got, expected, tol=0.03: (
print(f"{'PASS' if abs(got-expected)<tol else 'FAIL'} — {name}: got {got:.4f}, expected {expected:.4f}"))
check_true = lambda name, c: print(f"{'PASS' if c else 'FAIL'} — {name}")
header('Exercise 3: CDF and Universality of Uniform')
test_points = [-1, 0, 1, 1.96]
for x in test_points:
emp = empirical_cdf(X, x)
true = stats.norm.cdf(x)
check_close(f'x={x}', emp, true)
# Part (c): Universality of Uniform
U = stats.norm.cdf(X) # Apply Phi to Normal samples
ks_stat, p_val = stats.kstest(U, 'uniform')
print(f'\nKS test for U = Phi(X) ~ Uniform(0,1):')
print(f' KS statistic = {ks_stat:.4f}, p-value = {p_val:.4f}')
check_true('U is uniform (p > 0.05)', p_val > 0.05)
# Part (d): Inverse CDF to sample N(2, 1.5^2)
mu, sigma = 2.0, 1.5
U2 = np.random.uniform(0, 1, n)
Y = stats.norm.ppf(U2, loc=mu, scale=sigma) # N(2, 1.5^2) samples
check_close('Sample mean ≈ mu', Y.mean(), mu, tol=0.05)
check_close('Sample std ≈ sigma', Y.std(), sigma, tol=0.05)
print('\nTakeaway: The CDF completely characterises a distribution; the inverse CDF'
' enables sampling from any distribution using Uniform(0,1) samples.')
Exercise 4 ★★ — Bayes' Theorem: Spam Filtering
A spam filter uses word frequencies. Suppose:
- (prior — 30% of emails are spam)
- (likelihood of 'free' in spam)
- (likelihood of 'free' in ham)
Part (a): Compute using Bayes' theorem.
Part (b): The email also contains 'win'. Given and , compute the posterior after observing both words, assuming conditional independence given spam/ham.
Part (c): Write a naive_bayes_update(prior, likelihoods_spam, likelihoods_ham) function
that applies Naive Bayes iteratively for a list of observed words.
Part (d): Plot how the posterior probability of spam evolves as the words 'free', 'win', 'prize', 'offer', 'hello' are observed in sequence (using the likelihood pairs from your function).
Code cell 14
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 15
# Solution
import numpy as np
def naive_bayes_update(prior, likelihoods_spam, likelihoods_ham):
"""Naive Bayes: iteratively update prior with each observed word."""
p_spam = prior
for p_w_spam, p_w_ham in zip(likelihoods_spam, likelihoods_ham):
p_w = p_w_spam * p_spam + p_w_ham * (1 - p_spam)
p_spam = p_w_spam * p_spam / p_w
return p_spam
# Part (a)
p_spam = 0.3
p_free_given_spam = 0.8
p_free_given_ham = 0.1
p_free = p_free_given_spam * p_spam + p_free_given_ham * (1 - p_spam)
p_spam_given_free = p_free_given_spam * p_spam / p_free
# Part (b): both words
p_spam_given_both = naive_bayes_update(0.3, [0.8, 0.7], [0.1, 0.05])
header = lambda t: print('\n' + '='*len(t) + '\n' + t + '\n' + '='*len(t))
check_close = lambda name, got, expected, tol=1e-6: (
print(f"{'PASS' if abs(got-expected)<tol else 'FAIL'} — {name}: {got:.6f}"))
check_true = lambda name, c: print(f"{'PASS' if c else 'FAIL'} — {name}")
header('Exercise 4: Naive Bayes Spam Filter')
check_close('P(spam | free)', p_spam_given_free, 0.3*0.8 / (0.3*0.8 + 0.7*0.1))
check_true('P(spam|both words) > P(spam|one word)', p_spam_given_both > p_spam_given_free)
print(f' P(spam) = 0.3 → P(spam|free) = {p_spam_given_free:.4f} → P(spam|free,win) = {p_spam_given_both:.4f}')
# Part (d): evolution of posterior
words = ['free', 'win', 'prize', 'offer', 'hello']
ls = [0.8, 0.7, 0.9, 0.75, 0.2] # likelihoods given spam
lh = [0.1, 0.05, 0.05, 0.1, 0.5] # likelihoods given ham
posteriors = [0.3]
prior = 0.3
for p_s, p_h in zip(ls, lh):
prior = naive_bayes_update(prior, [p_s], [p_h])
posteriors.append(prior)
print('\nPosterior evolution:')
for i, (w, post) in enumerate(zip(['prior'] + words, posteriors)):
print(f' After {w:<8}: P(spam) = {post:.4f}')
try:
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-whitegrid')
fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(range(len(posteriors)), posteriors, 'o-', color='steelblue', lw=2)
ax.axhline(0.5, color='red', ls='--', alpha=0.5, label='Decision boundary')
ax.set_xticks(range(len(posteriors)))
ax.set_xticklabels(['prior'] + words)
ax.set_ylabel('P(spam | observed words)')
ax.set_title('Naive Bayes: Posterior Spam Probability')
ax.legend(); plt.tight_layout(); plt.show()
except ImportError:
pass
print('\nTakeaway: Naive Bayes assumes conditional independence — computationally cheap'
' and surprisingly effective, but overconfident when independence fails.')
Exercise 5 ★★ — Geometric Distribution: Token Repetition
In a language model, suppose at each step there is probability that the model generates an end-of-sequence (EOS) token. Model the sequence length as .
Part (a): Compute and analytically.
Part (b): Verify the memoryless property: given that the sequence has already generated 10 tokens without EOS, show that the remaining length has the same distribution.
Part (c): Compute the probability that the sequence is longer than 20 tokens.
Part (d): Simulate 50,000 sequences and compare the empirical length distribution to the theoretical PMF. Report the empirical mean and variance.
Code cell 17
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 18
# Solution
import numpy as np
from scipy import stats
np.random.seed(42)
p = 0.15
rv = stats.geom(p)
header = lambda t: print('\n' + '='*len(t) + '\n' + t + '\n' + '='*len(t))
check_close_fn = lambda name, got, expected, tol=1e-6: (
print(f"{'PASS' if abs(got-expected)<tol else 'FAIL'} — {name}: {got:.6f}"))
check_true_fn = lambda name, c: print(f"{'PASS' if c else 'FAIL'} — {name}")
# Part (a)
e_L = 1 / p
var_L = (1 - p) / p**2
# Part (b): Memoryless property
s = 10
p_gt_s_t = rv.sf(s + 5) # P(L > s+5)
p_gt_s = rv.sf(s) # P(L > s)
conditional = p_gt_s_t / p_gt_s # P(L > s+5 | L > s)
p_gt_5 = rv.sf(5) # P(L > 5)
# Part (c)
p_longer_20 = rv.sf(20)
# Part (d): Simulation
n_sim = 50000
lengths = rv.rvs(size=n_sim)
header('Exercise 5: Geometric Distribution')
check_close_fn('E[L] = 1/p', e_L, 1/p)
check_close_fn('Var[L] = (1-p)/p^2', var_L, (1-p)/p**2)
check_close_fn('Memoryless property', conditional, p_gt_5, tol=1e-10)
check_close_fn('P(L > 20)', p_longer_20, rv.sf(20))
check_true_fn('Simulated mean close to E[L]', abs(lengths.mean() - e_L) < 0.3)
print(f'\nSimulation: mean={lengths.mean():.2f}, var={lengths.var():.2f}')
print(f'Theory: mean={e_L:.2f}, var={var_L:.2f}')
print('\nTakeaway: The memoryless property makes Geometric the unique discrete distribution'
' without ageing — analogous to how Exponential models event inter-arrival times.')
Exercise 6 ★★ — Expectation, Variance, and LOTUS
Let (standard normal).
Part (a): Using LOTUS, compute , , and analytically. (Hint: for , the -th moment is )
Part (b): Verify these formulas via Monte Carlo simulation with samples.
Part (c): Compute analytically using:
Part (d): The standardisation of a random variable: Given , compute and verify , .
Code cell 20
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 21
# Solution
import numpy as np
np.random.seed(42)
header = lambda t: print('\n' + '='*len(t) + '\n' + t + '\n' + '='*len(t))
def check_close(name, got, expected, tol=0.01):
ok = abs(got - expected) < tol
print(f"{'PASS' if ok else 'FAIL'} — {name}: got {got:.4f}, expected {expected:.4f}")
# Part (a): Double-factorial moments of N(0,1)
# E[X^2] = 1, E[X^4] = 3, E[X^6] = 15
e_x2 = 1 # (2*1 - 1)!! = 1
e_x4 = 3 # (2*2 - 1)!! = 1*3 = 3
e_x6 = 15 # (2*3 - 1)!! = 1*3*5 = 15
# Part (b): Monte Carlo verification
n = 1000000
X = np.random.standard_normal(n)
mc_x2 = (X**2).mean()
mc_x4 = (X**4).mean()
mc_x6 = (X**6).mean()
# Part (c): Var(X^2) = E[X^4] - (E[X^2])^2
var_x2 = e_x4 - e_x2**2 # = 3 - 1 = 2
# Part (d): Standardisation
from scipy import stats
mu_Y, sigma_Y = 3.0, 2.0 # sigma = sqrt(variance) = sqrt(4) = 2
Y = np.random.normal(mu_Y, sigma_Y, n)
Z = (Y - mu_Y) / sigma_Y
header('Exercise 6: Expectation, Variance, LOTUS')
check_close('E[X^2] (analytical)', e_x2, 1.0, tol=1e-10)
check_close('E[X^2] (Monte Carlo)', mc_x2, 1.0, tol=0.01)
check_close('E[X^4] (analytical)', e_x4, 3.0, tol=1e-10)
check_close('E[X^4] (Monte Carlo)', mc_x4, 3.0, tol=0.05)
check_close('E[X^6] (analytical)', e_x6, 15.0, tol=1e-10)
check_close('E[X^6] (Monte Carlo)', mc_x6, 15.0, tol=0.3)
check_close('Var(X^2) = 2', var_x2, 2.0, tol=1e-10)
check_close('E[Z] = 0 after standardisation', Z.mean(), 0.0, tol=0.01)
check_close('Var(Z) = 1 after standardisation', Z.var(), 1.0, tol=0.01)
print('\nTakeaway: LOTUS avoids finding the distribution of g(X) explicitly;'
' standardisation is the Gaussian foundation of batch normalisation in NNs.')
Exercise 7 ★★★ — Cross-Entropy Loss and Gradient
Consider a 3-class softmax classifier. The output logits are and the predicted probabilities are .
Part (a): Implement softmax(z) numerically stably (subtract max before exponentiation).
Part (b): Implement cross_entropy_loss(logits, y_true) where y_true is a one-hot vector.
Part (c): Compute the gradient of the cross-entropy loss with respect to the logits:
Implement cross_entropy_gradient(logits, y_true) and verify numerically via finite differences.
Part (d): Plot the loss surface as a function of (holding fixed) for the true class being class 0. Observe how the gradient drives to be large.
Code cell 23
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 24
# Solution
import numpy as np
np.random.seed(42)
header = lambda t: print('\n' + '='*len(t) + '\n' + t + '\n' + '='*len(t))
def check_close(name, got, expected, tol=1e-5):
ok = np.allclose(got, expected, atol=tol)
print(f"{'PASS' if ok else 'FAIL'} — {name}")
def softmax(z):
z_stable = z - z.max() # numerical stability
exp_z = np.exp(z_stable)
return exp_z / exp_z.sum()
def cross_entropy_loss(logits, y_true):
p = softmax(logits)
return -np.sum(y_true * np.log(p + 1e-15))
def cross_entropy_gradient(logits, y_true):
p = softmax(logits)
return p - y_true # elegant closed form
# Numerical gradient via finite differences
def numerical_gradient(f, z, eps=1e-5):
grad = np.zeros_like(z)
for i in range(len(z)):
z_plus = z.copy(); z_plus[i] += eps
z_minus = z.copy(); z_minus[i] -= eps
grad[i] = (f(z_plus) - f(z_minus)) / (2 * eps)
return grad
z = np.array([2.0, 1.0, 0.5])
y = np.array([1.0, 0.0, 0.0])
p_hat = softmax(z)
loss = cross_entropy_loss(z, y)
grad = cross_entropy_gradient(z, y)
num_grad = numerical_gradient(lambda z_: cross_entropy_loss(z_, y), z)
header('Exercise 7: Cross-Entropy Loss')
print(f'softmax(z) = {p_hat}')
print(f'loss = {loss:.4f} (= -log({p_hat[0]:.4f}) = {-np.log(p_hat[0]):.4f})')
check_close('Softmax sums to 1', p_hat.sum(), 1.0)
check_close('Analytical grad = numerical grad', grad, num_grad, tol=1e-5)
print(f' Analytical: {grad}')
print(f' Numerical: {num_grad}')
# Verify: gradient is p_hat - y
expected_grad = p_hat - y
check_close('Gradient = p_hat - y', grad, expected_grad)
print('\nTakeaway: The cross-entropy gradient p̂-y is the probability model"s prediction'
' error — it pushes logits in the direction of the true class, a direct probabilistic signal.')
Exercise 8 ★★★ — KL Divergence and the VAE Objective
The VAE evidence lower bound (ELBO) contains a KL divergence term:
Part (a): Implement this closed-form KL divergence for Gaussians.
Part (b): Verify that when , and that otherwise.
Part (c): Estimate the KL divergence via Monte Carlo:
Part (d): Plot the KL divergence landscape as varies in with , and as varies in with . Identify the minimum in each case.
Code cell 26
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 27
# Solution
import numpy as np
from scipy import stats
np.random.seed(42)
header = lambda t: print('\n' + '='*len(t) + '\n' + t + '\n' + '='*len(t))
def check_close(name, got, expected, tol=1e-5):
ok = abs(got - expected) < tol
print(f"{'PASS' if ok else 'FAIL'} — {name}: {got:.6f}")
def kl_gaussian(mu, sigma):
"""KL(N(mu, sigma^2) || N(0,1)) = 0.5*(sigma^2 + mu^2 - 1 - log(sigma^2))."""
return 0.5 * (sigma**2 + mu**2 - 1 - 2*np.log(sigma))
def kl_gaussian_mc(mu, sigma, n=100000):
z = np.random.normal(mu, sigma, n)
log_p = stats.norm.logpdf(z, mu, sigma)
log_q = stats.norm.logpdf(z, 0, 1)
return (log_p - log_q).mean()
header('Exercise 8: KL Divergence and VAE')
# Part (b): KL = 0 at N(0,1)
check_close('KL(N(0,1)||N(0,1)) = 0', kl_gaussian(0, 1), 0.0)
# Various points
test_cases = [(1, 1), (2, 1), (0, 2), (0, 0.5)]
print('\nKL values (closed form vs Monte Carlo):')
for mu, sigma in test_cases:
kl_cf = kl_gaussian(mu, sigma)
kl_mc = kl_gaussian_mc(mu, sigma)
ok = abs(kl_cf - kl_mc) < 0.05
print(f' N({mu},{sigma}²): closed={kl_cf:.4f}, MC={kl_mc:.4f} {"PASS" if ok else "FAIL"}')
try:
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-whitegrid')
fig, axes = plt.subplots(1, 2, figsize=(14, 4))
mu_range = np.linspace(-3, 3, 200)
kl_mu = [kl_gaussian(m, 1.0) for m in mu_range]
axes[0].plot(mu_range, kl_mu, 'steelblue', lw=2)
axes[0].axvline(0, color='red', ls='--', alpha=0.5, label='min at mu=0')
axes[0].set_xlabel('mu'); axes[0].set_ylabel('KL divergence')
axes[0].set_title('KL vs. mu (sigma=1)'); axes[0].legend()
sigma_range = np.linspace(0.1, 3, 200)
kl_sigma = [kl_gaussian(0.0, s) for s in sigma_range]
axes[1].plot(sigma_range, kl_sigma, 'darkorange', lw=2)
axes[1].axvline(1.0, color='red', ls='--', alpha=0.5, label='min at sigma=1')
axes[1].set_xlabel('sigma'); axes[1].set_ylabel('KL divergence')
axes[1].set_title('KL vs. sigma (mu=0)'); axes[1].legend()
plt.suptitle('KL(N(mu,sigma²)||N(0,1)) — VAE Regularisation Term', fontsize=13)
plt.tight_layout(); plt.show()
except ImportError:
pass
print('\nTakeaway: The VAE KL term penalises the encoder for encoding to a distribution'
' far from N(0,1). The closed form makes this differentiable — enabling reparameterised'
' gradients to flow through the sampling operation.')
Exercise 9 *** - Mixture Models and the Law of Total Probability
A monitoring system receives examples from two data sources. Source is clean and emits . Source is shifted and emits . The shifted source appears with probability .
Part (a): Compute using the law of total probability.
Part (b): Compute using Bayes' theorem.
Part (c): Verify both probabilities by Monte Carlo simulation.
Part (d): Interpret why this calculation matters for distribution-shift detection.
Code cell 29
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 30
# Solution
from scipy import stats
p_shift = 0.25
threshold = 2.0
p_tail_clean = 1 - stats.norm.cdf(threshold, loc=0, scale=1)
p_tail_shift = 1 - stats.norm.cdf(threshold, loc=3, scale=1)
p_tail = (1 - p_shift) * p_tail_clean + p_shift * p_tail_shift
p_shift_given_tail = p_shift * p_tail_shift / p_tail
n = 300000
S = (np.random.rand(n) < p_shift).astype(int)
X = np.random.normal(0, 1, n)
X[S == 1] = np.random.normal(3, 1, (S == 1).sum())
mask = X > threshold
p_tail_mc = mask.mean()
p_shift_given_tail_mc = S[mask].mean()
header('Exercise 9: Mixture Models and Total Probability')
print(f'P(X>2) analytic: {p_tail:.4f}')
print(f'P(X>2) Monte Carlo: {p_tail_mc:.4f}')
print(f'P(S=1 | X>2) analytic: {p_shift_given_tail:.4f}')
print(f'P(S=1 | X>2) Monte Carlo: {p_shift_given_tail_mc:.4f}')
check_close('Tail probability matches simulation', p_tail_mc, p_tail, tol=0.01)
check_close('Posterior source probability matches simulation', p_shift_given_tail_mc, p_shift_given_tail, tol=0.02)
print('Takeaway: a rare shifted source can dominate tail events, which is why tail monitoring is useful for ML drift detection.')
Exercise 10 *** - Categorical Sampling and Temperature
A language model produces logits for four tokens.
Part (a): Convert logits to probabilities with temperature : .
Part (b): Compare entropy and maximum probability for .
Part (c): Sample 20,000 tokens for each temperature and verify empirical frequencies.
Part (d): Explain why temperature controls diversity in generation.
Code cell 32
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 33
# Solution
def softmax_temperature(logits, T):
z = np.asarray(logits, dtype=float) / T
z = z - z.max()
p = np.exp(z)
return p / p.sum()
def entropy(p):
p = np.asarray(p)
return -np.sum(p * np.log2(p + 1e-15))
logits = np.array([3.0, 1.0, 0.0, -1.0])
temps = [0.5, 1.0, 2.0]
header('Exercise 10: Categorical Sampling and Temperature')
for T in temps:
p = softmax_temperature(logits, T)
samples = np.random.choice(len(p), size=20000, p=p)
freq = np.bincount(samples, minlength=len(p)) / len(samples)
print(f'T={T}: p={p.round(4)}, entropy={entropy(p):.3f} bits, max={p.max():.3f}')
print(f' empirical={freq.round(4)}')
check_true(f'empirical frequencies close for T={T}', np.allclose(freq, p, atol=0.015))
p_low = softmax_temperature(logits, 0.5)
p_high = softmax_temperature(logits, 2.0)
check_true('Higher temperature increases entropy', entropy(p_high) > entropy(p_low))
print('Takeaway: lower temperature concentrates probability on the top token; higher temperature spreads mass and increases diversity.')
What to Review After Finishing
- Kolmogorov axioms — can you derive the complement rule and inclusion-exclusion from them?
- Conditional probability — can you apply Bayes' theorem to a novel problem?
- CDF/PDF/PMF relationship — can you go from PDF to CDF and back?
- Independence — do you understand why pairwise ≠ mutual independence?
- Expectation / LOTUS — can you compute E[g(X)] without finding the distribution of g(X)?
- Gaussian distribution — do you know the 68-95-99.7 rule and how to standardise?
- Cross-entropy — can you derive the gradient of the softmax cross-entropy loss?
- KL divergence — can you write the closed form for two Gaussians?
References
- Kolmogorov, A.N. (1933). Grundbegriffe der Wahrscheinlichkeitsrechnung.
- Bishop, C.M. (2006). Pattern Recognition and Machine Learning, Ch. 1–2.
- Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning, Ch. 3.
- MacKay, D.J.C. (2003). Information Theory, Inference, and Learning Algorithms, Ch. 2.
- Mixture total probability - can you explain the probabilistic calculation and the ML relevance?
- Categorical temperature - can you connect the computation to model behavior?