Exercises Notebook
Converted from
exercises.ipynbfor web reading.
Joint Distributions — Exercises
10 exercises covering joint PMF/PDF/CDF, marginalisation, conditionals, independence, covariance matrices, the multivariate Gaussian, the reparameterisation trick, and Naive Bayes classification.
| Format | Description |
|---|---|
| Problem | Markdown cell with task description |
| Your Solution | Code cell with scaffolding |
| Solution | Code cell with reference solution and checks |
Difficulty Levels
| Level | Exercises | Focus |
|---|---|---|
| ★ | 1-3 | Core mechanics: PMF, marginals, conditionals |
| ★★ | 4-6 | Independence, covariance matrices, MVN |
| ★★★ | 7-10 | Reparameterisation trick, Naive Bayes |
Topic Map
| Topic | Exercise |
|---|---|
| Joint PMF and marginalisation | 1 |
| Continuous marginals and conditionals | 2 |
| Chain rule and sequence probability | 3 |
| Independence testing | 4 |
| Covariance matrix and PSD | 5 |
| MVN conditionals (Schur complement) | 6 |
| Reparameterisation trick | 7 |
| Naive Bayes classification | 8 |
| Collider conditional dependence | 9 |
| Autoregressive factorization | 10 |
Code cell 2
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
try:
import seaborn as sns
sns.set_theme(style="whitegrid", palette="colorblind")
HAS_SNS = True
except ImportError:
plt.style.use("seaborn-v0_8-whitegrid")
HAS_SNS = False
mpl.rcParams.update({
"figure.figsize": (10, 6),
"figure.dpi": 120,
"font.size": 13,
"axes.titlesize": 15,
"axes.labelsize": 13,
"xtick.labelsize": 11,
"ytick.labelsize": 11,
"legend.fontsize": 11,
"legend.framealpha": 0.85,
"lines.linewidth": 2.0,
"axes.spines.top": False,
"axes.spines.right": False,
"savefig.bbox": "tight",
"savefig.dpi": 150,
})
np.random.seed(42)
print("Plot setup complete.")
Code cell 3
import numpy as np
import numpy.linalg as la
from scipy import stats
try:
import matplotlib.pyplot as plt
HAS_MPL = True
except ImportError:
HAS_MPL = False
np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)
def header(title):
print('\n' + '=' * len(title))
print(title)
print('=' * len(title))
def check_close(name, got, expected, tol=1e-8):
ok = np.allclose(got, expected, atol=tol, rtol=tol)
print(f"{'PASS' if ok else 'FAIL'} — {name}")
if not ok:
print(' expected:', expected)
print(' got :', got)
return ok
def check_true(name, cond):
print(f"{'PASS' if cond else 'FAIL'} — {name}")
return cond
print('Setup complete.')
Exercise 1 ★ — Joint PMF and Marginalisation
Two fair dice are rolled. Let be the value of die 1 and be the value of die 2.
Part (a): Write down the joint PMF as a NumPy array.
Part (b): Compute the marginal PMFs and by summing.
Part (c): Let . Compute by marginalising the joint.
Part (d): Compute and from the joint PMF. Verify that and (since for fair dice).
Code cell 5
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 6
# Solution
import numpy as np
header('Exercise 1: Joint PMF and Marginalisation')
# Part (a)
joint_pmf = np.ones((6, 6)) / 36
check_close('joint sums to 1', joint_pmf.sum(), 1.0)
check_true('shape is (6,6)', joint_pmf.shape == (6, 6))
# Part (b)
marginal_x = joint_pmf.sum(axis=1) # sum over Y
marginal_y = joint_pmf.sum(axis=0) # sum over X
check_close('marginal_x is uniform 1/6', marginal_x, np.ones(6)/6)
check_close('marginal_y is uniform 1/6', marginal_y, np.ones(6)/6)
# Part (c): P(S=7) — count cells where x+y=7
vals = np.arange(1, 7)
X_grid, Y_grid = np.meshgrid(vals, vals)
p_s7 = joint_pmf[X_grid + Y_grid == 7].sum()
check_close('P(S=7) = 6/36', p_s7, 6/36)
# Part (d): E[S] and Var(S) from joint
# E[S] = sum_{x,y} (x+y)*p(x,y)
S_vals = X_grid + Y_grid
E_S = (S_vals * joint_pmf).sum()
E_S2 = (S_vals**2 * joint_pmf).sum()
Var_S = E_S2 - E_S**2
check_close('E[S] = 7', E_S, 7.0)
check_close('Var[S] = Var[X]+Var[Y] = 35/6', Var_S, 35/6)
# Verify additivity (independence)
E_X = (vals * marginal_x).sum()
Var_X = ((vals**2) * marginal_x).sum() - E_X**2
check_close('E[S] = E[X]+E[Y]', E_S, 2*E_X)
check_close('Var[S] = Var[X]+Var[Y] (independence)', Var_S, 2*Var_X)
print('\nTakeaway: For independent X,Y, the joint factors as p(x)p(y), which lets'
' us compute moments of X+Y using marginals alone.')
Exercise 2 ★ — Continuous Marginals and Conditionals
Let have joint PDF for .
Part (a): Verify that is a valid PDF (integrates to 1).
Part (b): Find (the marginal of ). What distribution does follow?
Part (c): Find . What standard distribution is ?
Part (d): Compute and verify numerically by sampling.
Code cell 8
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 9
# Solution
import numpy as np
from scipy.integrate import dblquad, quad
np.random.seed(42)
header('Exercise 2: Continuous Marginals and Conditionals')
# Part (a): Normalisation
def joint_pdf_2(y, x):
return np.exp(-y) if 0 <= x <= y else 0.0
norm, _ = dblquad(joint_pdf_2, 0, 50, lambda x: x, lambda x: 50)
check_close('normalisation = 1', norm, 1.0, tol=1e-6)
# Part (b): Marginal f_X(x) = int_x^inf e^{-y} dy = e^{-x}
# So X ~ Exp(1)
def marginal_x_2(x):
return np.exp(-x) # Exp(1) PDF
mx_check, _ = quad(marginal_x_2, 0, 50)
check_close('marginal_x integrates to 1', mx_check, 1.0, tol=1e-6)
print('X ~ Exp(1) (marginal is exponential)')
# Part (c): f_{Y|X}(y|x) = e^{-y} / e^{-x} = e^{-(y-x)}, y >= x
# This is a Shifted Exp(1): Y|X=x ~ x + Exp(1)
def cond_pdf_y_given_x(y, x):
if y < x:
return 0.0
return np.exp(-(y - x)) # Exp(1) shifted by x
x_test = 2.0
cond_check, _ = quad(lambda y: cond_pdf_y_given_x(y, x_test), x_test, 100)
check_close('conditional PDF integrates to 1 given X=2', cond_check, 1.0, tol=1e-6)
print('Y|X=x ~ x + Exp(1) (shifted exponential)')
# Part (d): E[Y|X=x] = x + 1 (mean of x + Exp(1))
E_Y_given_x = lambda x: x + 1.0
check_close('E[Y|X=2] = 3', E_Y_given_x(x_test), 3.0)
# Verify by sampling
N = 200000
X_samp = np.random.exponential(1, N)
Y_samp = X_samp + np.random.exponential(1, N) # Y = X + Exp(1) given X
# Actually sample from joint: uniform on support
X_samp2 = np.random.exponential(1, N) # X ~ Exp(1)
extra = np.random.exponential(1, N) # extra ~ Exp(1)
Y_samp2 = X_samp2 + extra # Y = X + extra
tol_x = 0.1
mask = np.abs(X_samp2 - x_test) < tol_x
EY_emp = Y_samp2[mask].mean()
print(f'\nE[Y|X≈{x_test}] empirical: {EY_emp:.4f} (theory: {E_Y_given_x(x_test):.4f})')
check_true('empirical close to theory', abs(EY_emp - E_Y_given_x(x_test)) < 0.15)
print('\nTakeaway: f_{Y|X}(y|x) = f(x,y)/f_X(x) is a density in y, not a number.'
' Here it revealed Y|X ~ x+Exp(1): conditioning on X shifts the distribution.')
Exercise 3 ★ — Chain Rule and Sequence Probability
Consider a trigram language model over a 4-word vocabulary .
Part (a): Given the conditional distributions below, compute via the chain rule.
Part (b): Compute the cross-entropy in bits.
Part (c): A bigram model gives . A trigram model gives . Compute the perplexity for each model. Which model assigns higher probability to this sentence?
Conditional probabilities:
Code cell 11
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 12
# Solution
import numpy as np
header('Exercise 3: Chain Rule and Sequence Probability')
P_cat_given_s = 0.6
P_sat_given_cat = 0.5
P_mat_given_sat = 0.7
# Part (a)
P_sentence = P_cat_given_s * P_sat_given_cat * P_mat_given_sat
check_close('P(cat sat mat) = 0.6*0.5*0.7', P_sentence, 0.6*0.5*0.7)
# Part (b)
H_bits = -np.log2(P_sentence)
print(f'H = {H_bits:.4f} bits (= -log2({P_sentence:.4f}))')
check_close('cross-entropy bits', H_bits, -np.log2(0.21))
# Part (c)
P2 = 0.042; P3 = 0.031
PP2 = 2**(-np.log2(P2))
PP3 = 2**(-np.log2(P3))
# Simpler: PP = 1/P
PP2_simple = 1/P2; PP3_simple = 1/P3
check_close('PP2 = 1/P2', PP2, PP2_simple)
check_close('PP3 = 1/P3', PP3, PP3_simple)
print(f'Bigram perplexity: {PP2:.2f}')
print(f'Trigram perplexity: {PP3:.2f}')
check_true('bigram has lower perplexity (higher prob)', PP2 < PP3)
print('\nTakeaway: Lower perplexity = higher probability assigned to sentence.'
' A good language model should minimise perplexity on held-out text.'
' Chain rule ensures any joint factorisation gives the same P(sentence).')
Exercise 4 ★★ — Independence Testing
Part (a): For the joint PMF below, determine whether by checking the factorisation condition.
(rows = , columns = )
Part (b): Compute from the joint PMF.
Part (c): Compute the mutual information . What does imply? What does imply?
Code cell 14
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 15
# Solution
import numpy as np
header('Exercise 4: Independence Testing')
P = np.array([
[0.10, 0.15, 0.05],
[0.20, 0.30, 0.10],
[0.04, 0.06, 0.00]
])
check_close('P sums to 1', P.sum(), 1.0)
# Part (a)
px = P.sum(axis=1) # marginal X: sum over columns (Y)
py = P.sum(axis=0) # marginal Y: sum over rows (X)
P_indep = np.outer(px, py)
is_independent = np.allclose(P, P_indep)
print(f'Independent: {is_independent}')
print(f'Max deviation from independence: {np.abs(P - P_indep).max():.4f}')
check_true('NOT independent', not is_independent)
# Part (b): Cov(X, Y)
vals_x = np.arange(3)
vals_y = np.arange(3)
EX = (vals_x * px).sum()
EY = (vals_y * py).sum()
EXY = sum(vals_x[x]*vals_y[y]*P[x,y] for x in range(3) for y in range(3))
cov_xy = EXY - EX*EY
print(f'E[X]={EX:.4f}, E[Y]={EY:.4f}, E[XY]={EXY:.4f}')
print(f'Cov(X,Y) = {cov_xy:.6f}')
# Part (c): Mutual information I(X;Y)
MI = 0.0
for x in range(3):
for y in range(3):
if P[x,y] > 0:
MI += P[x,y] * np.log(P[x,y] / (px[x]*py[y]))
print(f'I(X;Y) = {MI:.6f} nats')
check_true('MI >= 0', MI >= -1e-10)
check_true('MI > 0 (not independent)', MI > 0.001)
print('\nTakeaway: Cov(X,Y)=0 does NOT imply I(X;Y)=0 in general.'
' But I(X;Y)=0 iff X and Y are independent.'
' Mutual information is the gold standard for detecting dependence.')
Exercise 5 ★★ — Covariance Matrix and PSD
Part (a): Construct a valid covariance matrix with unit diagonal (i.e., all variances = 1) and off-diagonal entries chosen so that the matrix is PSD. Verify PSD via Cholesky decomposition.
Part (b): Sample points from and verify that the sample covariance matrix is close to .
Part (c): Compute the correlation matrix from (i.e., normalise by the diagonal). What is always true about the diagonal of a correlation matrix?
Part (d): The matrix is PSD for any . This is Tikhonov regularisation. Show that adding increases all eigenvalues by without changing eigenvectors. Why is this useful in ML (hint: condition number)?
Code cell 17
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 18
# Solution
import numpy as np
import numpy.linalg as la
np.random.seed(42)
header('Exercise 5: Covariance Matrix and PSD')
# Part (a)
L = np.array([[1,0,0,0],[0.3,1,0,0],[0.2,0.1,1,0],[0.4,0.2,0.3,1]], dtype=float)
Sigma_raw = L @ L.T
D_inv = np.diag(1.0 / np.sqrt(np.diag(Sigma_raw)))
Sigma = D_inv @ Sigma_raw @ D_inv # unit diagonal
check_close('diagonal is all 1s', np.diag(Sigma), np.ones(4))
eigvals = la.eigvalsh(Sigma)
check_true('all eigenvalues > 0 (PSD)', (eigvals > 0).all())
try:
la.cholesky(Sigma)
print('PASS — Cholesky succeeded (PSD confirmed)')
except la.LinAlgError:
print('FAIL — not PSD')
# Part (b)
mu = np.zeros(4)
samples = np.random.multivariate_normal(mu, Sigma, 10000)
Sigma_est = np.cov(samples.T)
max_err = np.abs(Sigma_est - Sigma).max()
print(f'\nMax entry error between estimated and true Sigma: {max_err:.4f}')
check_true('estimated close to true (max err < 0.05)', max_err < 0.05)
# Part (c): Correlation matrix
D_inv2 = np.diag(1.0 / np.sqrt(np.diag(Sigma_est)))
Corr = D_inv2 @ Sigma_est @ D_inv2
check_close('correlation matrix diagonal is 1', np.diag(Corr), np.ones(4), tol=0.05)
print('Correlation matrix (diagonal = 1 by construction)')
# Part (d): Tikhonov regularisation
alpha = 0.1
Sigma_reg = Sigma + alpha * np.eye(4)
eigs_orig = np.sort(la.eigvalsh(Sigma))
eigs_reg = np.sort(la.eigvalsh(Sigma_reg))
check_close('eigenvalues shift by alpha', eigs_reg - eigs_orig, alpha * np.ones(4))
cond_orig = eigs_orig.max() / eigs_orig.min()
cond_reg = eigs_reg.max() / eigs_reg.min()
print(f'Condition number: original={cond_orig:.2f}, regularised={cond_reg:.2f}')
check_true('regularisation reduces condition number', cond_reg < cond_orig)
print('\nTakeaway: Adding alpha*I to Sigma is the most common form of covariance'
' regularisation. It prevents near-singular inverses by ensuring all eigenvalues'
' are at least alpha. Used in ridge regression, Gaussian processes, and Adam optimizer.')
Exercise 6 ★★ — MVN Conditionals (Schur Complement)
Let with
Part (a): Compute the distribution of using the Schur complement formula.
Part (b): Verify your answer by sampling: draw samples from the joint, keep those with and , and compare the empirical conditional mean and variance.
Part (c): Compute the partial correlation . How does it differ from the marginal correlation ?
Code cell 20
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 21
# Solution
import numpy as np
np.random.seed(42)
header('Exercise 6: MVN Conditionals (Schur Complement)')
mu_full = np.array([1.0, 2.0, 3.0])
Sigma_full = np.array([[4.,2.,1.],[2.,3.,1.],[1.,1.,2.]])
x2_obs = np.array([2.5, 2.5])
# Partition
idx1, idx2 = [0], [1, 2]
mu1 = mu_full[idx1]; mu2 = mu_full[idx2]
S11 = Sigma_full[np.ix_(idx1, idx1)]
S12 = Sigma_full[np.ix_(idx1, idx2)]
S21 = Sigma_full[np.ix_(idx2, idx1)]
S22 = Sigma_full[np.ix_(idx2, idx2)]
S22_inv = np.linalg.inv(S22)
mu_cond = mu1 + S12 @ S22_inv @ (x2_obs - mu2)
Sigma_cond = S11 - S12 @ S22_inv @ S21
print(f'X1 | X2=2.5, X3=2.5:')
print(f' mu_cond = {mu_cond[0]:.4f}')
print(f' sigma_cond = {np.sqrt(Sigma_cond[0,0]):.4f}')
# Part (b): Verify by sampling
N = 500000
samp = np.random.multivariate_normal(mu_full, Sigma_full, N)
tol = 0.08
mask = (np.abs(samp[:,1] - x2_obs[0]) < tol) & (np.abs(samp[:,2] - x2_obs[1]) < tol)
X1_cond = samp[mask, 0]
print(f'\nSampling ({mask.sum()} matching samples):')
print(f' emp mean = {X1_cond.mean():.4f} (theory: {mu_cond[0]:.4f})')
print(f' emp std = {X1_cond.std():.4f} (theory: {np.sqrt(Sigma_cond[0,0]):.4f})')
check_true('empirical mean close to theory', abs(X1_cond.mean() - mu_cond[0]) < 0.2)
check_true('empirical std close to theory', abs(X1_cond.std() - np.sqrt(Sigma_cond[0,0])) < 0.2)
# Part (c): Partial correlation rho_{12.3}
# Compute conditional covariance of (X1,X2) given X3
idx12, idx3 = [0,1], [2]
S_12_12 = Sigma_full[np.ix_(idx12, idx12)]
S_12_3 = Sigma_full[np.ix_(idx12, idx3)]
S_3_3 = Sigma_full[np.ix_(idx3, idx3)]
Sigma_cond_12 = S_12_12 - S_12_3 @ np.linalg.inv(S_3_3) @ S_12_3.T
rho_12_partial = Sigma_cond_12[0,1] / np.sqrt(Sigma_cond_12[0,0] * Sigma_cond_12[1,1])
rho_12_marginal = Sigma_full[0,1] / np.sqrt(Sigma_full[0,0] * Sigma_full[1,1])
print(f'\nMarginal correlation rho_12 = {rho_12_marginal:.4f}')
print(f'Partial correlation rho_12.3 = {rho_12_partial:.4f}')
print('\nTakeaway: Partial correlation removes the confounding effect of X3.'
' If rho_12 > rho_12.3, part of the X1-X2 correlation was mediated by X3.'
' This is the MVN version of controlling for a covariate.')
Exercise 7 ★★★ — Reparameterisation Trick
The reparameterisation trick is the engine of VAE training.
Part (a): Implement two gradient estimators for :
- REINFORCE:
- Reparameterisation:
Compare variance of the two estimators over 200 Monte Carlo runs.
Part (b): Implement the VAE ELBO computation for a 1D Gaussian encoder:
where , , and the KL has a closed form:
Code cell 23
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 24
# Solution
import numpy as np
np.random.seed(42)
header('Exercise 7: Reparameterisation Trick')
mu, sigma = 1.5, 0.8
# Part (a): True gradient
N_ref = 2000000
z_ref = np.random.normal(mu, sigma, N_ref)
true_grad = np.cos(z_ref).mean() # d/dmu E[sin(z)] = E[cos(z)] by reparam
print(f'Reference gradient (N={N_ref}): {true_grad:.6f}')
# REINFORCE: grad = E[sin(z) * (z-mu)/sigma^2]
N_mc = 5000
n_runs = 200
rf_ests = []
rp_ests = []
for _ in range(n_runs):
z_rf = np.random.normal(mu, sigma, N_mc)
rf_ests.append((np.sin(z_rf) * (z_rf - mu) / sigma**2).mean())
eps = np.random.normal(0, 1, N_mc)
z_rp = mu + sigma * eps
rp_ests.append(np.cos(z_rp).mean()) # d/dmu sin(mu+sigma*eps) = cos(z)
rf_ests = np.array(rf_ests)
rp_ests = np.array(rp_ests)
print(f'\nGradient estimates (mean ± std over {n_runs} runs, N={N_mc} each):')
print(f' REINFORCE: {rf_ests.mean():.4f} ± {rf_ests.std():.4f}')
print(f' Reparameterisation: {rp_ests.mean():.4f} ± {rp_ests.std():.4f}')
print(f' True: {true_grad:.4f}')
print(f' Variance ratio (RF/RP): {rf_ests.var()/rp_ests.var():.1f}x')
check_true('reparam is closer to truth', abs(rp_ests.mean()-true_grad) < abs(rf_ests.mean()-true_grad) or
rp_ests.var() < rf_ests.var())
check_true('reparam has lower variance', rp_ests.var() < rf_ests.var())
# Part (b): VAE ELBO KL term
def kl_gaussian(mu, sigma):
# KL(N(mu,sigma^2) || N(0,1)) = 0.5*(mu^2 + sigma^2 - log(sigma^2) - 1)
return 0.5 * (mu**2 + sigma**2 - np.log(sigma**2) - 1.0)
# Verify KL = 0 at N(0,1)
check_close('KL(N(0,1)||N(0,1)) = 0', kl_gaussian(0.0, 1.0), 0.0)
check_true('KL >= 0 always', kl_gaussian(0.5, 0.7) >= 0)
# Monte Carlo verification
mu_t, sigma_t = 0.5, 0.7
kl_theory = kl_gaussian(mu_t, sigma_t)
# MC KL: E_{q}[log q(z) - log p(z)]
z_mc = np.random.normal(mu_t, sigma_t, 500000)
log_q = -0.5*np.log(2*np.pi*sigma_t**2) - (z_mc-mu_t)**2/(2*sigma_t**2)
log_p = -0.5*np.log(2*np.pi) - z_mc**2/2
kl_mc = (log_q - log_p).mean()
print(f'\nKL({mu_t},{sigma_t}): theory={kl_theory:.6f}, MC={kl_mc:.6f}')
check_close('KL theory matches MC', kl_theory, kl_mc, tol=0.005)
print('\nTakeaway: The reparameterisation trick replaces high-variance REINFORCE'
' with a deterministic, low-variance gradient. This is why VAEs train stably:'
' gradients flow through the deterministic mu and sigma paths, not through'
' the sampling operation itself.')
Exercise 8 ★★★ — Naive Bayes Classification
Naive Bayes uses conditional independence to classify documents:
Part (a): Implement a Gaussian Naive Bayes classifier for a 2-class, 2-feature dataset. Train on 80% of data, evaluate on 20%.
Part (b): Show that the Naive Bayes decision boundary is linear when both classes have the same per-feature variance (i.e., derive when ).
Part (c): Compare Naive Bayes accuracy to logistic regression on the same dataset. When does NB outperform LR and vice versa?
Code cell 26
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 27
# Solution
import numpy as np
from scipy import stats
np.random.seed(42)
header('Exercise 8: Naive Bayes Classification')
# Generate data
N_per_class = 300
mu0 = np.array([0.0, 0.0]); mu1 = np.array([2.0, 1.5])
Sigma_class = np.array([[1.5, 0.6],[0.6, 1.0]])
X0 = np.random.multivariate_normal(mu0, Sigma_class, N_per_class)
X1 = np.random.multivariate_normal(mu1, Sigma_class, N_per_class)
X_all = np.vstack([X0, X1])
y_all = np.concatenate([np.zeros(N_per_class), np.ones(N_per_class)]).astype(int)
idx = np.random.permutation(len(y_all))
train_idx = idx[:int(0.8*len(y_all))]
test_idx = idx[int(0.8*len(y_all)):]
X_train, y_train = X_all[train_idx], y_all[train_idx]
X_test, y_test = X_all[test_idx], y_all[test_idx]
# Part (a): Gaussian Naive Bayes
def gnb_train(X, y):
classes = np.unique(y)
priors = {c: (y==c).mean() for c in classes}
means = {c: X[y==c].mean(axis=0) for c in classes}
stds = {c: X[y==c].std(axis=0) for c in classes}
return classes, priors, means, stds
def gnb_predict(params, X):
classes, priors, means, stds = params
log_probs = np.zeros((len(X), len(classes)))
for i, c in enumerate(classes):
log_prior = np.log(priors[c])
log_lik = sum(stats.norm.logpdf(X[:,j], means[c][j], stds[c][j]+1e-9)
for j in range(X.shape[1]))
log_probs[:, i] = log_prior + log_lik
return classes[np.argmax(log_probs, axis=1)]
params = gnb_train(X_train, y_train)
y_pred_nb = gnb_predict(params, X_test)
acc_nb = (y_pred_nb == y_test).mean()
print(f'Gaussian Naive Bayes accuracy: {acc_nb:.4f}')
# Part (c): Compare with logistic regression (sklearn-free implementation)
# Use a simple gradient descent logistic regression
def sigmoid(z): return 1.0 / (1.0 + np.exp(-np.clip(z, -30, 30)))
X_tr_aug = np.hstack([X_train, np.ones((len(X_train), 1))])
X_te_aug = np.hstack([X_test, np.ones((len(X_test), 1))])
w = np.zeros(3)
lr_rate = 0.1
for _ in range(300):
p = sigmoid(X_tr_aug @ w)
grad = X_tr_aug.T @ (p - y_train) / len(y_train)
w -= lr_rate * grad
y_pred_lr = (sigmoid(X_te_aug @ w) > 0.5).astype(int)
acc_lr = (y_pred_lr == y_test).mean()
print(f'Logistic Regression accuracy: {acc_lr:.4f}')
check_true('Naive Bayes accuracy > 0.75', acc_nb > 0.75)
check_true('Logistic Regression accuracy > 0.75', acc_lr > 0.75)
print('\nNB vs LR comparison:')
print(f' NB: {acc_nb:.4f} LR: {acc_lr:.4f}')
print('NB assumes conditional independence (often wrong but fast and interpretable).')
print('LR makes no such assumption; fits the boundary directly from data.')
print('With small N or many features, NB can outperform LR (less variance).')
print('\nTakeaway: Naive Bayes encodes conditional independence as a modelling assumption.'
' Despite being "naive", it often works well in NLP (bag-of-words spam filtering)'
' because even if features are dependent, the decision boundary can be correct.')
Exercise 9 *** - Conditional Independence in a Collider
Let and be independent Bernoulli variables and define (exclusive OR).
Part (a): Verify and are marginally independent.
Part (b): Condition on and compute the joint distribution of .
Part (c): Show that and are not independent given .
Part (d): Explain why conditioning can create dependence in causal and diagnostic models.
Code cell 29
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 30
# Solution
joint = np.ones((2, 2)) / 4
px = joint.sum(axis=1)
py = joint.sum(axis=0)
independent = np.allclose(joint, np.outer(px, py))
Z = np.array([[0, 1], [1, 0]])
mask = (Z == 1)
joint_given = np.where(mask, joint, 0.0)
joint_given /= joint_given.sum()
px_given = joint_given.sum(axis=1)
py_given = joint_given.sum(axis=0)
conditional_independent = np.allclose(joint_given, np.outer(px_given, py_given))
header('Exercise 9: Conditional Independence in a Collider')
print('P(X,Y):')
print(joint)
print('P(X,Y | Z=1):')
print(joint_given)
check_true('X and Y are marginally independent', independent)
check_true('X and Y are dependent after conditioning on Z=1', not conditional_independent)
print('Takeaway: conditioning is not always harmless; collider conditioning can create spurious dependence.')
Exercise 10 *** - Autoregressive Factorization of a Token Sequence
A tiny language model assigns conditional probabilities to a three-token sequence .
Part (a): Compute .
Part (b): Compute negative log-likelihood and perplexity.
Part (c): Compare with a model that assumes conditional independence after .
Part (d): Interpret why the chain rule is the probability skeleton of autoregressive LLMs.
Code cell 32
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 33
# Solution
p_x1 = 0.30
p_x2_given_x1 = 0.50
p_x3_given_x12 = 0.40
p_joint = p_x1 * p_x2_given_x1 * p_x3_given_x12
nll = -np.log(p_joint)
perplexity = np.exp(nll / 3)
p_x2_ind = 0.20
p_x3_ind = 0.25
p_ind_model = p_x1 * p_x2_ind * p_x3_ind
nll_ind = -np.log(p_ind_model)
perplexity_ind = np.exp(nll_ind / 3)
header('Exercise 10: Autoregressive Factorization')
print(f'Chain-rule probability: {p_joint:.4f}')
print(f'NLL: {nll:.4f}, perplexity: {perplexity:.4f}')
print(f'Independence model probability: {p_ind_model:.4f}')
print(f'Independence model perplexity: {perplexity_ind:.4f}')
check_true('Using context improves this sequence probability', p_joint > p_ind_model)
check_true('Using context lowers perplexity', perplexity < perplexity_ind)
print('Takeaway: autoregressive modeling is joint probability built from conditionals, one token at a time.')
What to Review After Finishing
- Marginalisation — can you integrate/sum out a variable from a joint?
- Conditional distributions — ; it's a function of
- Independence vs zero covariance — give a counterexample
- Schur complement formula — derive it from scratch for a 2×2 block matrix
- Reparameterisation trick — why does it reduce gradient variance?
- Naive Bayes — when is the decision boundary linear?
References
- Bishop, C. (2006). Pattern Recognition and Machine Learning, Ch. 2
- Murphy, K. (2022). Probabilistic Machine Learning, Ch. 2-3
- Kingma & Welling (2014). Auto-Encoding Variational Bayes. arXiv:1312.6114
- Joe, H. (1997). Multivariate Models and Dependence Concepts (copulas)
- Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems (d-separation)
- Collider conditional dependence - can you explain the probabilistic calculation and the ML relevance?
- Autoregressive factorization - can you connect the computation to model behavior?