Exercises NotebookMath for LLMs

Joint Distributions

Probability Theory / Joint Distributions

Run notebook
Exercises Notebook

Exercises Notebook

Converted from exercises.ipynb for web reading.

Joint Distributions — Exercises

10 exercises covering joint PMF/PDF/CDF, marginalisation, conditionals, independence, covariance matrices, the multivariate Gaussian, the reparameterisation trick, and Naive Bayes classification.

FormatDescription
ProblemMarkdown cell with task description
Your SolutionCode cell with scaffolding
SolutionCode cell with reference solution and checks

Difficulty Levels

LevelExercisesFocus
1-3Core mechanics: PMF, marginals, conditionals
★★4-6Independence, covariance matrices, MVN
★★★7-10Reparameterisation trick, Naive Bayes

Topic Map

TopicExercise
Joint PMF and marginalisation1
Continuous marginals and conditionals2
Chain rule and sequence probability3
Independence testing4
Covariance matrix and PSD5
MVN conditionals (Schur complement)6
Reparameterisation trick7
Naive Bayes classification8
Collider conditional dependence9
Autoregressive factorization10

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np
import numpy.linalg as la
from scipy import stats

try:
    import matplotlib.pyplot as plt
    HAS_MPL = True
except ImportError:
    HAS_MPL = False

np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)

def header(title):
    print('\n' + '=' * len(title))
    print(title)
    print('=' * len(title))

def check_close(name, got, expected, tol=1e-8):
    ok = np.allclose(got, expected, atol=tol, rtol=tol)
    print(f"{'PASS' if ok else 'FAIL'}{name}")
    if not ok:
        print('  expected:', expected)
        print('  got     :', got)
    return ok

def check_true(name, cond):
    print(f"{'PASS' if cond else 'FAIL'}{name}")
    return cond

print('Setup complete.')

Exercise 1 ★ — Joint PMF and Marginalisation

Two fair dice are rolled. Let XX be the value of die 1 and YY be the value of die 2.

Part (a): Write down the joint PMF p(x,y)p(x,y) as a 6×66 \times 6 NumPy array.

Part (b): Compute the marginal PMFs pX(x)p_X(x) and pY(y)p_Y(y) by summing.

Part (c): Let S=X+YS = X + Y. Compute P(S=7)P(S=7) by marginalising the joint.

Part (d): Compute E[S]E[S] and Var(S)\text{Var}(S) from the joint PMF. Verify that E[S]=E[X]+E[Y]E[S] = E[X] + E[Y] and Var(S)=Var(X)+Var(Y)\text{Var}(S) = \text{Var}(X) + \text{Var}(Y) (since XYX \perp Y for fair dice).

Code cell 5

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 6

# Solution
import numpy as np

header('Exercise 1: Joint PMF and Marginalisation')

# Part (a)
joint_pmf = np.ones((6, 6)) / 36
check_close('joint sums to 1', joint_pmf.sum(), 1.0)
check_true('shape is (6,6)', joint_pmf.shape == (6, 6))

# Part (b)
marginal_x = joint_pmf.sum(axis=1)  # sum over Y
marginal_y = joint_pmf.sum(axis=0)  # sum over X
check_close('marginal_x is uniform 1/6', marginal_x, np.ones(6)/6)
check_close('marginal_y is uniform 1/6', marginal_y, np.ones(6)/6)

# Part (c): P(S=7) — count cells where x+y=7
vals = np.arange(1, 7)
X_grid, Y_grid = np.meshgrid(vals, vals)
p_s7 = joint_pmf[X_grid + Y_grid == 7].sum()
check_close('P(S=7) = 6/36', p_s7, 6/36)

# Part (d): E[S] and Var(S) from joint
# E[S] = sum_{x,y} (x+y)*p(x,y)
S_vals = X_grid + Y_grid
E_S = (S_vals * joint_pmf).sum()
E_S2 = (S_vals**2 * joint_pmf).sum()
Var_S = E_S2 - E_S**2
check_close('E[S] = 7', E_S, 7.0)
check_close('Var[S] = Var[X]+Var[Y] = 35/6', Var_S, 35/6)

# Verify additivity (independence)
E_X = (vals * marginal_x).sum()
Var_X = ((vals**2) * marginal_x).sum() - E_X**2
check_close('E[S] = E[X]+E[Y]', E_S, 2*E_X)
check_close('Var[S] = Var[X]+Var[Y] (independence)', Var_S, 2*Var_X)

print('\nTakeaway: For independent X,Y, the joint factors as p(x)p(y), which lets'
      ' us compute moments of X+Y using marginals alone.')

Exercise 2 ★ — Continuous Marginals and Conditionals

Let (X,Y)(X, Y) have joint PDF f(x,y)=eyf(x, y) = e^{-y} for 0xy<0 \le x \le y < \infty.

Part (a): Verify that ff is a valid PDF (integrates to 1).

Part (b): Find fX(x)f_X(x) (the marginal of XX). What distribution does XX follow?

Part (c): Find fYX(yx)f_{Y|X}(y|x). What standard distribution is YX=xY|X=x?

Part (d): Compute E[YX=x]\mathbb{E}[Y|X=x] and verify numerically by sampling.

Code cell 8

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 9

# Solution
import numpy as np
from scipy.integrate import dblquad, quad
np.random.seed(42)

header('Exercise 2: Continuous Marginals and Conditionals')

# Part (a): Normalisation
def joint_pdf_2(y, x):
    return np.exp(-y) if 0 <= x <= y else 0.0

norm, _ = dblquad(joint_pdf_2, 0, 50, lambda x: x, lambda x: 50)
check_close('normalisation = 1', norm, 1.0, tol=1e-6)

# Part (b): Marginal f_X(x) = int_x^inf e^{-y} dy = e^{-x}
# So X ~ Exp(1)
def marginal_x_2(x):
    return np.exp(-x)  # Exp(1) PDF

mx_check, _ = quad(marginal_x_2, 0, 50)
check_close('marginal_x integrates to 1', mx_check, 1.0, tol=1e-6)
print('X ~ Exp(1) (marginal is exponential)')

# Part (c): f_{Y|X}(y|x) = e^{-y} / e^{-x} = e^{-(y-x)}, y >= x
# This is a Shifted Exp(1): Y|X=x ~ x + Exp(1)
def cond_pdf_y_given_x(y, x):
    if y < x:
        return 0.0
    return np.exp(-(y - x))  # Exp(1) shifted by x

x_test = 2.0
cond_check, _ = quad(lambda y: cond_pdf_y_given_x(y, x_test), x_test, 100)
check_close('conditional PDF integrates to 1 given X=2', cond_check, 1.0, tol=1e-6)
print('Y|X=x ~ x + Exp(1)  (shifted exponential)')

# Part (d): E[Y|X=x] = x + 1  (mean of x + Exp(1))
E_Y_given_x = lambda x: x + 1.0
check_close('E[Y|X=2] = 3', E_Y_given_x(x_test), 3.0)

# Verify by sampling
N = 200000
X_samp = np.random.exponential(1, N)
Y_samp = X_samp + np.random.exponential(1, N)  # Y = X + Exp(1) given X
# Actually sample from joint: uniform on support
X_samp2 = np.random.exponential(1, N)    # X ~ Exp(1)
extra  = np.random.exponential(1, N)    # extra ~ Exp(1)
Y_samp2 = X_samp2 + extra               # Y = X + extra

tol_x = 0.1
mask = np.abs(X_samp2 - x_test) < tol_x
EY_emp = Y_samp2[mask].mean()
print(f'\nE[Y|X≈{x_test}] empirical: {EY_emp:.4f} (theory: {E_Y_given_x(x_test):.4f})')
check_true('empirical close to theory', abs(EY_emp - E_Y_given_x(x_test)) < 0.15)

print('\nTakeaway: f_{Y|X}(y|x) = f(x,y)/f_X(x) is a density in y, not a number.'
      ' Here it revealed Y|X ~ x+Exp(1): conditioning on X shifts the distribution.')

Exercise 3 ★ — Chain Rule and Sequence Probability

Consider a trigram language model over a 4-word vocabulary {<s>,cat,sat,mat}\{\text{<s>}, \text{cat}, \text{sat}, \text{mat}\}.

Part (a): Given the conditional distributions below, compute P(cat sat mat)P(\text{cat sat mat}) via the chain rule.

Part (b): Compute the cross-entropy H=log2P(cat sat mat)H = -\log_2 P(\text{cat sat mat}) in bits.

Part (c): A bigram model gives P2(cat sat mat)=0.042P_2(\text{cat sat mat}) = 0.042. A trigram model gives P3(cat sat mat)=0.031P_3(\text{cat sat mat}) = 0.031. Compute the perplexity PP=2H\text{PP} = 2^H for each model. Which model assigns higher probability to this sentence?

Conditional probabilities:

  • P(cat<s>)=0.6P(\text{cat}|\text{<s>}) = 0.6
  • P(satcat)=0.5P(\text{sat}|\text{cat}) = 0.5
  • P(matsat)=0.7P(\text{mat}|\text{sat}) = 0.7

Code cell 11

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 12

# Solution
import numpy as np

header('Exercise 3: Chain Rule and Sequence Probability')

P_cat_given_s = 0.6
P_sat_given_cat = 0.5
P_mat_given_sat = 0.7

# Part (a)
P_sentence = P_cat_given_s * P_sat_given_cat * P_mat_given_sat
check_close('P(cat sat mat) = 0.6*0.5*0.7', P_sentence, 0.6*0.5*0.7)

# Part (b)
H_bits = -np.log2(P_sentence)
print(f'H = {H_bits:.4f} bits  (= -log2({P_sentence:.4f}))')
check_close('cross-entropy bits', H_bits, -np.log2(0.21))

# Part (c)
P2 = 0.042; P3 = 0.031
PP2 = 2**(-np.log2(P2))
PP3 = 2**(-np.log2(P3))
# Simpler: PP = 1/P
PP2_simple = 1/P2; PP3_simple = 1/P3
check_close('PP2 = 1/P2', PP2, PP2_simple)
check_close('PP3 = 1/P3', PP3, PP3_simple)
print(f'Bigram  perplexity: {PP2:.2f}')
print(f'Trigram perplexity: {PP3:.2f}')
check_true('bigram has lower perplexity (higher prob)', PP2 < PP3)

print('\nTakeaway: Lower perplexity = higher probability assigned to sentence.'
      ' A good language model should minimise perplexity on held-out text.'
      ' Chain rule ensures any joint factorisation gives the same P(sentence).')

Exercise 4 ★★ — Independence Testing

Part (a): For the joint PMF below, determine whether XYX \perp Y by checking the factorisation condition.

p(x,y)=(0.100.150.050.200.300.100.040.060.00)p(x,y) = \begin{pmatrix} 0.10 & 0.15 & 0.05 \\ 0.20 & 0.30 & 0.10 \\ 0.04 & 0.06 & 0.00 \end{pmatrix}

(rows = X{0,1,2}X \in \{0,1,2\}, columns = Y{0,1,2}Y \in \{0,1,2\})

Part (b): Compute Cov(X,Y)\text{Cov}(X, Y) from the joint PMF.

Part (c): Compute the mutual information I(X;Y)=x,yp(x,y)logp(x,y)pX(x)pY(y)I(X;Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p_X(x)p_Y(y)}. What does I(X;Y)=0I(X;Y) = 0 imply? What does I(X;Y)>0I(X;Y) > 0 imply?

Code cell 14

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 15

# Solution
import numpy as np

header('Exercise 4: Independence Testing')

P = np.array([
    [0.10, 0.15, 0.05],
    [0.20, 0.30, 0.10],
    [0.04, 0.06, 0.00]
])
check_close('P sums to 1', P.sum(), 1.0)

# Part (a)
px = P.sum(axis=1)   # marginal X: sum over columns (Y)
py = P.sum(axis=0)   # marginal Y: sum over rows (X)
P_indep = np.outer(px, py)
is_independent = np.allclose(P, P_indep)
print(f'Independent: {is_independent}')
print(f'Max deviation from independence: {np.abs(P - P_indep).max():.4f}')
check_true('NOT independent', not is_independent)

# Part (b): Cov(X, Y)
vals_x = np.arange(3)
vals_y = np.arange(3)
EX = (vals_x * px).sum()
EY = (vals_y * py).sum()
EXY = sum(vals_x[x]*vals_y[y]*P[x,y] for x in range(3) for y in range(3))
cov_xy = EXY - EX*EY
print(f'E[X]={EX:.4f}, E[Y]={EY:.4f}, E[XY]={EXY:.4f}')
print(f'Cov(X,Y) = {cov_xy:.6f}')

# Part (c): Mutual information I(X;Y)
MI = 0.0
for x in range(3):
    for y in range(3):
        if P[x,y] > 0:
            MI += P[x,y] * np.log(P[x,y] / (px[x]*py[y]))
print(f'I(X;Y) = {MI:.6f} nats')
check_true('MI >= 0', MI >= -1e-10)
check_true('MI > 0 (not independent)', MI > 0.001)

print('\nTakeaway: Cov(X,Y)=0 does NOT imply I(X;Y)=0 in general.'
      ' But I(X;Y)=0 iff X and Y are independent.'
      ' Mutual information is the gold standard for detecting dependence.')

Exercise 5 ★★ — Covariance Matrix and PSD

Part (a): Construct a valid 4×44 \times 4 covariance matrix Σ\Sigma with unit diagonal (i.e., all variances = 1) and off-diagonal entries chosen so that the matrix is PSD. Verify PSD via Cholesky decomposition.

Part (b): Sample N=10000N=10000 points from N(0,Σ)\mathcal{N}(\mathbf{0}, \Sigma) and verify that the sample covariance matrix is close to Σ\Sigma.

Part (c): Compute the correlation matrix from Σ\Sigma (i.e., normalise by the diagonal). What is always true about the diagonal of a correlation matrix?

Part (d): The matrix Σ=Σ+αI\Sigma^* = \Sigma + \alpha I is PSD for any α>0\alpha > 0. This is Tikhonov regularisation. Show that adding αI\alpha I increases all eigenvalues by α\alpha without changing eigenvectors. Why is this useful in ML (hint: condition number)?

Code cell 17

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 18

# Solution
import numpy as np
import numpy.linalg as la
np.random.seed(42)

header('Exercise 5: Covariance Matrix and PSD')

# Part (a)
L = np.array([[1,0,0,0],[0.3,1,0,0],[0.2,0.1,1,0],[0.4,0.2,0.3,1]], dtype=float)
Sigma_raw = L @ L.T
D_inv = np.diag(1.0 / np.sqrt(np.diag(Sigma_raw)))
Sigma = D_inv @ Sigma_raw @ D_inv  # unit diagonal

check_close('diagonal is all 1s', np.diag(Sigma), np.ones(4))
eigvals = la.eigvalsh(Sigma)
check_true('all eigenvalues > 0 (PSD)', (eigvals > 0).all())
try:
    la.cholesky(Sigma)
    print('PASS — Cholesky succeeded (PSD confirmed)')
except la.LinAlgError:
    print('FAIL — not PSD')

# Part (b)
mu = np.zeros(4)
samples = np.random.multivariate_normal(mu, Sigma, 10000)
Sigma_est = np.cov(samples.T)
max_err = np.abs(Sigma_est - Sigma).max()
print(f'\nMax entry error between estimated and true Sigma: {max_err:.4f}')
check_true('estimated close to true (max err < 0.05)', max_err < 0.05)

# Part (c): Correlation matrix
D_inv2 = np.diag(1.0 / np.sqrt(np.diag(Sigma_est)))
Corr = D_inv2 @ Sigma_est @ D_inv2
check_close('correlation matrix diagonal is 1', np.diag(Corr), np.ones(4), tol=0.05)
print('Correlation matrix (diagonal = 1 by construction)')

# Part (d): Tikhonov regularisation
alpha = 0.1
Sigma_reg = Sigma + alpha * np.eye(4)
eigs_orig = np.sort(la.eigvalsh(Sigma))
eigs_reg  = np.sort(la.eigvalsh(Sigma_reg))
check_close('eigenvalues shift by alpha', eigs_reg - eigs_orig, alpha * np.ones(4))
cond_orig = eigs_orig.max() / eigs_orig.min()
cond_reg  = eigs_reg.max()  / eigs_reg.min()
print(f'Condition number: original={cond_orig:.2f}, regularised={cond_reg:.2f}')
check_true('regularisation reduces condition number', cond_reg < cond_orig)

print('\nTakeaway: Adding alpha*I to Sigma is the most common form of covariance'
      ' regularisation. It prevents near-singular inverses by ensuring all eigenvalues'
      ' are at least alpha. Used in ridge regression, Gaussian processes, and Adam optimizer.')

Exercise 6 ★★ — MVN Conditionals (Schur Complement)

Let (X1,X2,X3)N(μ,Σ)(X_1, X_2, X_3) \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma) with

μ=(123),Σ=(421231112)\boldsymbol{\mu} = \begin{pmatrix}1 \\ 2 \\ 3\end{pmatrix}, \quad \Sigma = \begin{pmatrix}4 & 2 & 1 \\ 2 & 3 & 1 \\ 1 & 1 & 2\end{pmatrix}

Part (a): Compute the distribution of X1X2=2.5,X3=2.5X_1 \mid X_2 = 2.5, X_3 = 2.5 using the Schur complement formula.

Part (b): Verify your answer by sampling: draw N=500,000N=500{,}000 samples from the joint, keep those with X22.5X_2 \approx 2.5 and X32.5X_3 \approx 2.5, and compare the empirical conditional mean and variance.

Part (c): Compute the partial correlation ρ12.3=Corr(X1,X2X3)\rho_{12.3} = \text{Corr}(X_1, X_2 \mid X_3). How does it differ from the marginal correlation ρ12\rho_{12}?

Code cell 20

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 21

# Solution
import numpy as np
np.random.seed(42)

header('Exercise 6: MVN Conditionals (Schur Complement)')

mu_full = np.array([1.0, 2.0, 3.0])
Sigma_full = np.array([[4.,2.,1.],[2.,3.,1.],[1.,1.,2.]])
x2_obs = np.array([2.5, 2.5])

# Partition
idx1, idx2 = [0], [1, 2]
mu1 = mu_full[idx1]; mu2 = mu_full[idx2]
S11 = Sigma_full[np.ix_(idx1, idx1)]
S12 = Sigma_full[np.ix_(idx1, idx2)]
S21 = Sigma_full[np.ix_(idx2, idx1)]
S22 = Sigma_full[np.ix_(idx2, idx2)]

S22_inv = np.linalg.inv(S22)
mu_cond = mu1 + S12 @ S22_inv @ (x2_obs - mu2)
Sigma_cond = S11 - S12 @ S22_inv @ S21

print(f'X1 | X2=2.5, X3=2.5:')
print(f'  mu_cond    = {mu_cond[0]:.4f}')
print(f'  sigma_cond = {np.sqrt(Sigma_cond[0,0]):.4f}')

# Part (b): Verify by sampling
N = 500000
samp = np.random.multivariate_normal(mu_full, Sigma_full, N)
tol = 0.08
mask = (np.abs(samp[:,1] - x2_obs[0]) < tol) & (np.abs(samp[:,2] - x2_obs[1]) < tol)
X1_cond = samp[mask, 0]
print(f'\nSampling ({mask.sum()} matching samples):')
print(f'  emp mean = {X1_cond.mean():.4f}  (theory: {mu_cond[0]:.4f})')
print(f'  emp std  = {X1_cond.std():.4f}  (theory: {np.sqrt(Sigma_cond[0,0]):.4f})')
check_true('empirical mean close to theory', abs(X1_cond.mean() - mu_cond[0]) < 0.2)
check_true('empirical std close to theory', abs(X1_cond.std() - np.sqrt(Sigma_cond[0,0])) < 0.2)

# Part (c): Partial correlation rho_{12.3}
# Compute conditional covariance of (X1,X2) given X3
idx12, idx3 = [0,1], [2]
S_12_12 = Sigma_full[np.ix_(idx12, idx12)]
S_12_3  = Sigma_full[np.ix_(idx12, idx3)]
S_3_3   = Sigma_full[np.ix_(idx3, idx3)]
Sigma_cond_12 = S_12_12 - S_12_3 @ np.linalg.inv(S_3_3) @ S_12_3.T
rho_12_partial = Sigma_cond_12[0,1] / np.sqrt(Sigma_cond_12[0,0] * Sigma_cond_12[1,1])
rho_12_marginal = Sigma_full[0,1] / np.sqrt(Sigma_full[0,0] * Sigma_full[1,1])
print(f'\nMarginal correlation rho_12   = {rho_12_marginal:.4f}')
print(f'Partial correlation rho_12.3  = {rho_12_partial:.4f}')

print('\nTakeaway: Partial correlation removes the confounding effect of X3.'
      ' If rho_12 > rho_12.3, part of the X1-X2 correlation was mediated by X3.'
      ' This is the MVN version of controlling for a covariate.')

Exercise 7 ★★★ — Reparameterisation Trick

The reparameterisation trick is the engine of VAE training.

Part (a): Implement two gradient estimators for μEzN(μ,σ2)[sin(z)]\nabla_\mu \mathbb{E}_{z \sim \mathcal{N}(\mu, \sigma^2)}[\sin(z)]:

  1. REINFORCE: μE[f(z)]=E[f(z)(zμ)/σ2]\nabla_\mu \mathbb{E}[f(z)] = \mathbb{E}[f(z) \cdot (z-\mu)/\sigma^2]
  2. Reparameterisation: μE[f(μ+σε)]=E[f(μ+σε)]\nabla_\mu \mathbb{E}[f(\mu + \sigma\varepsilon)] = \mathbb{E}[f'(\mu + \sigma\varepsilon)]

Compare variance of the two estimators over 200 Monte Carlo runs.

Part (b): Implement the VAE ELBO computation for a 1D Gaussian encoder:

ELBO(x;ϕ,θ)=Ezqϕ(zx)[logpθ(xz)]KL(qϕ(zx)p(z))\text{ELBO}(x; \phi, \theta) = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - \text{KL}(q_\phi(z|x) \| p(z))

where qϕ(zx)=N(μϕ,σϕ2)q_\phi(z|x) = \mathcal{N}(\mu_\phi, \sigma_\phi^2), p(z)=N(0,1)p(z) = \mathcal{N}(0,1), and the KL has a closed form:

KL(N(μ,σ2)N(0,1))=12(μ2+σ2logσ21)\text{KL}(\mathcal{N}(\mu,\sigma^2) \| \mathcal{N}(0,1)) = \frac{1}{2}(\mu^2 + \sigma^2 - \log\sigma^2 - 1)

Code cell 23

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 24

# Solution
import numpy as np
np.random.seed(42)

header('Exercise 7: Reparameterisation Trick')

mu, sigma = 1.5, 0.8

# Part (a): True gradient
N_ref = 2000000
z_ref = np.random.normal(mu, sigma, N_ref)
true_grad = np.cos(z_ref).mean()  # d/dmu E[sin(z)] = E[cos(z)] by reparam
print(f'Reference gradient (N={N_ref}): {true_grad:.6f}')

# REINFORCE: grad = E[sin(z) * (z-mu)/sigma^2]
N_mc = 5000
n_runs = 200
rf_ests = []
rp_ests = []
for _ in range(n_runs):
    z_rf = np.random.normal(mu, sigma, N_mc)
    rf_ests.append((np.sin(z_rf) * (z_rf - mu) / sigma**2).mean())
    eps = np.random.normal(0, 1, N_mc)
    z_rp = mu + sigma * eps
    rp_ests.append(np.cos(z_rp).mean())  # d/dmu sin(mu+sigma*eps) = cos(z)

rf_ests = np.array(rf_ests)
rp_ests = np.array(rp_ests)

print(f'\nGradient estimates (mean ± std over {n_runs} runs, N={N_mc} each):')
print(f'  REINFORCE:        {rf_ests.mean():.4f} ± {rf_ests.std():.4f}')
print(f'  Reparameterisation: {rp_ests.mean():.4f} ± {rp_ests.std():.4f}')
print(f'  True:             {true_grad:.4f}')
print(f'  Variance ratio (RF/RP): {rf_ests.var()/rp_ests.var():.1f}x')

check_true('reparam is closer to truth', abs(rp_ests.mean()-true_grad) < abs(rf_ests.mean()-true_grad) or
           rp_ests.var() < rf_ests.var())
check_true('reparam has lower variance', rp_ests.var() < rf_ests.var())

# Part (b): VAE ELBO KL term
def kl_gaussian(mu, sigma):
    # KL(N(mu,sigma^2) || N(0,1)) = 0.5*(mu^2 + sigma^2 - log(sigma^2) - 1)
    return 0.5 * (mu**2 + sigma**2 - np.log(sigma**2) - 1.0)

# Verify KL = 0 at N(0,1)
check_close('KL(N(0,1)||N(0,1)) = 0', kl_gaussian(0.0, 1.0), 0.0)
check_true('KL >= 0 always', kl_gaussian(0.5, 0.7) >= 0)

# Monte Carlo verification
mu_t, sigma_t = 0.5, 0.7
kl_theory = kl_gaussian(mu_t, sigma_t)
# MC KL: E_{q}[log q(z) - log p(z)]
z_mc = np.random.normal(mu_t, sigma_t, 500000)
log_q = -0.5*np.log(2*np.pi*sigma_t**2) - (z_mc-mu_t)**2/(2*sigma_t**2)
log_p = -0.5*np.log(2*np.pi) - z_mc**2/2
kl_mc = (log_q - log_p).mean()
print(f'\nKL({mu_t},{sigma_t}): theory={kl_theory:.6f}, MC={kl_mc:.6f}')
check_close('KL theory matches MC', kl_theory, kl_mc, tol=0.005)

print('\nTakeaway: The reparameterisation trick replaces high-variance REINFORCE'
      ' with a deterministic, low-variance gradient. This is why VAEs train stably:'
      ' gradients flow through the deterministic mu and sigma paths, not through'
      ' the sampling operation itself.')

Exercise 8 ★★★ — Naive Bayes Classification

Naive Bayes uses conditional independence to classify documents:

P(Y=cx)P(Y=c)i=1dP(xiY=c)P(Y=c \mid \mathbf{x}) \propto P(Y=c) \prod_{i=1}^d P(x_i \mid Y=c)

Part (a): Implement a Gaussian Naive Bayes classifier for a 2-class, 2-feature dataset. Train on 80% of data, evaluate on 20%.

Part (b): Show that the Naive Bayes decision boundary is linear when both classes have the same per-feature variance σ2\sigma^2 (i.e., derive when logP(Y=1x)>logP(Y=0x)\log P(Y=1|\mathbf{x}) > \log P(Y=0|\mathbf{x})).

Part (c): Compare Naive Bayes accuracy to logistic regression on the same dataset. When does NB outperform LR and vice versa?

Code cell 26

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 27

# Solution
import numpy as np
from scipy import stats
np.random.seed(42)

header('Exercise 8: Naive Bayes Classification')

# Generate data
N_per_class = 300
mu0 = np.array([0.0, 0.0]); mu1 = np.array([2.0, 1.5])
Sigma_class = np.array([[1.5, 0.6],[0.6, 1.0]])
X0 = np.random.multivariate_normal(mu0, Sigma_class, N_per_class)
X1 = np.random.multivariate_normal(mu1, Sigma_class, N_per_class)
X_all = np.vstack([X0, X1])
y_all = np.concatenate([np.zeros(N_per_class), np.ones(N_per_class)]).astype(int)

idx = np.random.permutation(len(y_all))
train_idx = idx[:int(0.8*len(y_all))]
test_idx  = idx[int(0.8*len(y_all)):]
X_train, y_train = X_all[train_idx], y_all[train_idx]
X_test,  y_test  = X_all[test_idx],  y_all[test_idx]

# Part (a): Gaussian Naive Bayes
def gnb_train(X, y):
    classes = np.unique(y)
    priors  = {c: (y==c).mean() for c in classes}
    means   = {c: X[y==c].mean(axis=0) for c in classes}
    stds    = {c: X[y==c].std(axis=0)  for c in classes}
    return classes, priors, means, stds

def gnb_predict(params, X):
    classes, priors, means, stds = params
    log_probs = np.zeros((len(X), len(classes)))
    for i, c in enumerate(classes):
        log_prior = np.log(priors[c])
        log_lik = sum(stats.norm.logpdf(X[:,j], means[c][j], stds[c][j]+1e-9)
                      for j in range(X.shape[1]))
        log_probs[:, i] = log_prior + log_lik
    return classes[np.argmax(log_probs, axis=1)]

params = gnb_train(X_train, y_train)
y_pred_nb = gnb_predict(params, X_test)
acc_nb = (y_pred_nb == y_test).mean()
print(f'Gaussian Naive Bayes accuracy: {acc_nb:.4f}')

# Part (c): Compare with logistic regression (sklearn-free implementation)
# Use a simple gradient descent logistic regression
def sigmoid(z): return 1.0 / (1.0 + np.exp(-np.clip(z, -30, 30)))

X_tr_aug = np.hstack([X_train, np.ones((len(X_train), 1))])
X_te_aug = np.hstack([X_test,  np.ones((len(X_test),  1))])
w = np.zeros(3)
lr_rate = 0.1
for _ in range(300):
    p = sigmoid(X_tr_aug @ w)
    grad = X_tr_aug.T @ (p - y_train) / len(y_train)
    w -= lr_rate * grad

y_pred_lr = (sigmoid(X_te_aug @ w) > 0.5).astype(int)
acc_lr = (y_pred_lr == y_test).mean()
print(f'Logistic Regression accuracy: {acc_lr:.4f}')

check_true('Naive Bayes accuracy > 0.75', acc_nb > 0.75)
check_true('Logistic Regression accuracy > 0.75', acc_lr > 0.75)

print('\nNB vs LR comparison:')
print(f'  NB: {acc_nb:.4f}  LR: {acc_lr:.4f}')
print('NB assumes conditional independence (often wrong but fast and interpretable).')
print('LR makes no such assumption; fits the boundary directly from data.')
print('With small N or many features, NB can outperform LR (less variance).')
print('\nTakeaway: Naive Bayes encodes conditional independence as a modelling assumption.'
      ' Despite being "naive", it often works well in NLP (bag-of-words spam filtering)'
      ' because even if features are dependent, the decision boundary can be correct.')

Exercise 9 *** - Conditional Independence in a Collider

Let XX and YY be independent Bernoulli(0.5)(0.5) variables and define Z=XYZ = X \oplus Y (exclusive OR).

Part (a): Verify XX and YY are marginally independent.

Part (b): Condition on Z=1Z=1 and compute the joint distribution of (X,Y)(X,Y).

Part (c): Show that XX and YY are not independent given Z=1Z=1.

Part (d): Explain why conditioning can create dependence in causal and diagnostic models.

Code cell 29

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 30

# Solution
joint = np.ones((2, 2)) / 4
px = joint.sum(axis=1)
py = joint.sum(axis=0)
independent = np.allclose(joint, np.outer(px, py))

Z = np.array([[0, 1], [1, 0]])
mask = (Z == 1)
joint_given = np.where(mask, joint, 0.0)
joint_given /= joint_given.sum()
px_given = joint_given.sum(axis=1)
py_given = joint_given.sum(axis=0)
conditional_independent = np.allclose(joint_given, np.outer(px_given, py_given))

header('Exercise 9: Conditional Independence in a Collider')
print('P(X,Y):')
print(joint)
print('P(X,Y | Z=1):')
print(joint_given)
check_true('X and Y are marginally independent', independent)
check_true('X and Y are dependent after conditioning on Z=1', not conditional_independent)
print('Takeaway: conditioning is not always harmless; collider conditioning can create spurious dependence.')

Exercise 10 *** - Autoregressive Factorization of a Token Sequence

A tiny language model assigns conditional probabilities to a three-token sequence (x1,x2,x3)(x_1,x_2,x_3).

Part (a): Compute p(x1,x2,x3)=p(x1)p(x2x1)p(x3x1,x2)p(x_1,x_2,x_3)=p(x_1)p(x_2\mid x_1)p(x_3\mid x_1,x_2).

Part (b): Compute negative log-likelihood and perplexity.

Part (c): Compare with a model that assumes conditional independence after x1x_1.

Part (d): Interpret why the chain rule is the probability skeleton of autoregressive LLMs.

Code cell 32

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 33

# Solution
p_x1 = 0.30
p_x2_given_x1 = 0.50
p_x3_given_x12 = 0.40
p_joint = p_x1 * p_x2_given_x1 * p_x3_given_x12
nll = -np.log(p_joint)
perplexity = np.exp(nll / 3)

p_x2_ind = 0.20
p_x3_ind = 0.25
p_ind_model = p_x1 * p_x2_ind * p_x3_ind
nll_ind = -np.log(p_ind_model)
perplexity_ind = np.exp(nll_ind / 3)

header('Exercise 10: Autoregressive Factorization')
print(f'Chain-rule probability: {p_joint:.4f}')
print(f'NLL: {nll:.4f}, perplexity: {perplexity:.4f}')
print(f'Independence model probability: {p_ind_model:.4f}')
print(f'Independence model perplexity: {perplexity_ind:.4f}')
check_true('Using context improves this sequence probability', p_joint > p_ind_model)
check_true('Using context lowers perplexity', perplexity < perplexity_ind)
print('Takeaway: autoregressive modeling is joint probability built from conditionals, one token at a time.')

What to Review After Finishing

  • Marginalisation — can you integrate/sum out a variable from a joint?
  • Conditional distributions — fYX=fX,Y/fXf_{Y|X} = f_{X,Y}/f_X; it's a function of yy
  • Independence vs zero covariance — give a counterexample
  • Schur complement formula — derive it from scratch for a 2×2 block matrix
  • Reparameterisation trick — why does it reduce gradient variance?
  • Naive Bayes — when is the decision boundary linear?

References

  1. Bishop, C. (2006). Pattern Recognition and Machine Learning, Ch. 2
  2. Murphy, K. (2022). Probabilistic Machine Learning, Ch. 2-3
  3. Kingma & Welling (2014). Auto-Encoding Variational Bayes. arXiv:1312.6114
  4. Joe, H. (1997). Multivariate Models and Dependence Concepts (copulas)
  5. Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems (d-separation)
  • Collider conditional dependence - can you explain the probabilistic calculation and the ML relevance?
  • Autoregressive factorization - can you connect the computation to model behavior?