Exercises Notebook

Converted from exercises.ipynb for web reading.

Joint Distributions — Exercises

10 exercises covering joint PMF/PDF/CDF, marginalisation, conditionals, independence, covariance matrices, the multivariate Gaussian, the reparameterisation trick, and Naive Bayes classification.

Format	Description
Problem	Markdown cell with task description
Your Solution	Code cell with scaffolding
Solution	Code cell with reference solution and checks

Difficulty Levels

Level	Exercises	Focus
★	1-3	Core mechanics: PMF, marginals, conditionals
★★	4-6	Independence, covariance matrices, MVN
★★★	7-10	Reparameterisation trick, Naive Bayes

Topic Map

Topic	Exercise
Joint PMF and marginalisation	1
Continuous marginals and conditionals	2
Chain rule and sequence probability	3
Independence testing	4
Covariance matrix and PSD	5
MVN conditionals (Schur complement)	6
Reparameterisation trick	7
Naive Bayes classification	8
Collider conditional dependence	9
Autoregressive factorization	10

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np
import numpy.linalg as la
from scipy import stats

try:
    import matplotlib.pyplot as plt
    HAS_MPL = True
except ImportError:
    HAS_MPL = False

np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)

def header(title):
    print('\n' + '=' * len(title))
    print(title)
    print('=' * len(title))

def check_close(name, got, expected, tol=1e-8):
    ok = np.allclose(got, expected, atol=tol, rtol=tol)
    print(f"{'PASS' if ok else 'FAIL'} — {name}")
    if not ok:
        print('  expected:', expected)
        print('  got     :', got)
    return ok

def check_true(name, cond):
    print(f"{'PASS' if cond else 'FAIL'} — {name}")
    return cond

print('Setup complete.')

Exercise 1 ★ — Joint PMF and Marginalisation

Two fair dice are rolled. Let $X$ be the value of die 1 and $Y$ be the value of die 2.

Part (a): Write down the joint PMF $p(x,y)$ as a $6 \times 6$ NumPy array.

Part (b): Compute the marginal PMFs $p_X(x)$ and $p_Y(y)$ by summing.

Part (c): Let $S = X + Y$ . Compute $P(S=7)$ by marginalising the joint.

Part (d): Compute $E[S]$ and $\text{Var}(S)$ from the joint PMF. Verify that $E[S] = E[X] + E[Y]$ and $\text{Var}(S) = \text{Var}(X) + \text{Var}(Y)$ (since $X \perp Y$ for fair dice).

Code cell 5

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 6

# Solution
import numpy as np

header('Exercise 1: Joint PMF and Marginalisation')

# Part (a)
joint_pmf = np.ones((6, 6)) / 36
check_close('joint sums to 1', joint_pmf.sum(), 1.0)
check_true('shape is (6,6)', joint_pmf.shape == (6, 6))

# Part (b)
marginal_x = joint_pmf.sum(axis=1)  # sum over Y
marginal_y = joint_pmf.sum(axis=0)  # sum over X
check_close('marginal_x is uniform 1/6', marginal_x, np.ones(6)/6)
check_close('marginal_y is uniform 1/6', marginal_y, np.ones(6)/6)

# Part (c): P(S=7) — count cells where x+y=7
vals = np.arange(1, 7)
X_grid, Y_grid = np.meshgrid(vals, vals)
p_s7 = joint_pmf[X_grid + Y_grid == 7].sum()
check_close('P(S=7) = 6/36', p_s7, 6/36)

# Part (d): E[S] and Var(S) from joint
# E[S] = sum_{x,y} (x+y)*p(x,y)
S_vals = X_grid + Y_grid
E_S = (S_vals * joint_pmf).sum()
E_S2 = (S_vals**2 * joint_pmf).sum()
Var_S = E_S2 - E_S**2
check_close('E[S] = 7', E_S, 7.0)
check_close('Var[S] = Var[X]+Var[Y] = 35/6', Var_S, 35/6)

# Verify additivity (independence)
E_X = (vals * marginal_x).sum()
Var_X = ((vals**2) * marginal_x).sum() - E_X**2
check_close('E[S] = E[X]+E[Y]', E_S, 2*E_X)
check_close('Var[S] = Var[X]+Var[Y] (independence)', Var_S, 2*Var_X)

print('\nTakeaway: For independent X,Y, the joint factors as p(x)p(y), which lets'
      ' us compute moments of X+Y using marginals alone.')

Exercise 2 ★ — Continuous Marginals and Conditionals

Let $(X, Y)$ have joint PDF $f(x, y) = e^{-y}$ for $0 \le x \le y < \infty$ .

Part (a): Verify that $f$ is a valid PDF (integrates to 1).

Part (b): Find $f_X(x)$ (the marginal of $X$ ). What distribution does $X$ follow?

Part (c): Find $f_{Y|X}(y|x)$ . What standard distribution is $Y|X=x$ ?

Part (d): Compute $\mathbb{E}[Y|X=x]$ and verify numerically by sampling.

Code cell 8

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 9

# Solution
import numpy as np
from scipy.integrate import dblquad, quad
np.random.seed(42)

header('Exercise 2: Continuous Marginals and Conditionals')

# Part (a): Normalisation
def joint_pdf_2(y, x):
    return np.exp(-y) if 0 <= x <= y else 0.0

norm, _ = dblquad(joint_pdf_2, 0, 50, lambda x: x, lambda x: 50)
check_close('normalisation = 1', norm, 1.0, tol=1e-6)

# Part (b): Marginal f_X(x) = int_x^inf e^{-y} dy = e^{-x}
# So X ~ Exp(1)
def marginal_x_2(x):
    return np.exp(-x)  # Exp(1) PDF

mx_check, _ = quad(marginal_x_2, 0, 50)
check_close('marginal_x integrates to 1', mx_check, 1.0, tol=1e-6)
print('X ~ Exp(1) (marginal is exponential)')

# Part (c): f_{Y|X}(y|x) = e^{-y} / e^{-x} = e^{-(y-x)}, y >= x
# This is a Shifted Exp(1): Y|X=x ~ x + Exp(1)
def cond_pdf_y_given_x(y, x):
    if y < x:
        return 0.0
    return np.exp(-(y - x))  # Exp(1) shifted by x

x_test = 2.0
cond_check, _ = quad(lambda y: cond_pdf_y_given_x(y, x_test), x_test, 100)
check_close('conditional PDF integrates to 1 given X=2', cond_check, 1.0, tol=1e-6)
print('Y|X=x ~ x + Exp(1)  (shifted exponential)')

# Part (d): E[Y|X=x] = x + 1  (mean of x + Exp(1))
E_Y_given_x = lambda x: x + 1.0
check_close('E[Y|X=2] = 3', E_Y_given_x(x_test), 3.0)

# Verify by sampling
N = 200000
X_samp = np.random.exponential(1, N)
Y_samp = X_samp + np.random.exponential(1, N)  # Y = X + Exp(1) given X
# Actually sample from joint: uniform on support
X_samp2 = np.random.exponential(1, N)    # X ~ Exp(1)
extra  = np.random.exponential(1, N)    # extra ~ Exp(1)
Y_samp2 = X_samp2 + extra               # Y = X + extra

tol_x = 0.1
mask = np.abs(X_samp2 - x_test) < tol_x
EY_emp = Y_samp2[mask].mean()
print(f'\nE[Y|X≈{x_test}] empirical: {EY_emp:.4f} (theory: {E_Y_given_x(x_test):.4f})')
check_true('empirical close to theory', abs(EY_emp - E_Y_given_x(x_test)) < 0.15)

print('\nTakeaway: f_{Y|X}(y|x) = f(x,y)/f_X(x) is a density in y, not a number.'
      ' Here it revealed Y|X ~ x+Exp(1): conditioning on X shifts the distribution.')

Exercise 3 ★ — Chain Rule and Sequence Probability

Consider a trigram language model over a 4-word vocabulary $\{\text{<s>}, \text{cat}, \text{sat}, \text{mat}\}$ .

Part (a): Given the conditional distributions below, compute $P(\text{cat sat mat})$ via the chain rule.

Part (b): Compute the cross-entropy $H = -\log_2 P(\text{cat sat mat})$ in bits.

Part (c): A bigram model gives $P_2(\text{cat sat mat}) = 0.042$ . A trigram model gives $P_3(\text{cat sat mat}) = 0.031$ . Compute the perplexity $\text{PP} = 2^H$ for each model. Which model assigns higher probability to this sentence?

Conditional probabilities:

$P(\text{cat}|\text{<s>}) = 0.6$
$P(\text{sat}|\text{cat}) = 0.5$
$P(\text{mat}|\text{sat}) = 0.7$

Code cell 11

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 12

# Solution
import numpy as np

header('Exercise 3: Chain Rule and Sequence Probability')

P_cat_given_s = 0.6
P_sat_given_cat = 0.5
P_mat_given_sat = 0.7

# Part (a)
P_sentence = P_cat_given_s * P_sat_given_cat * P_mat_given_sat
check_close('P(cat sat mat) = 0.6*0.5*0.7', P_sentence, 0.6*0.5*0.7)

# Part (b)
H_bits = -np.log2(P_sentence)
print(f'H = {H_bits:.4f} bits  (= -log2({P_sentence:.4f}))')
check_close('cross-entropy bits', H_bits, -np.log2(0.21))

# Part (c)
P2 = 0.042; P3 = 0.031
PP2 = 2**(-np.log2(P2))
PP3 = 2**(-np.log2(P3))
# Simpler: PP = 1/P
PP2_simple = 1/P2; PP3_simple = 1/P3
check_close('PP2 = 1/P2', PP2, PP2_simple)
check_close('PP3 = 1/P3', PP3, PP3_simple)
print(f'Bigram  perplexity: {PP2:.2f}')
print(f'Trigram perplexity: {PP3:.2f}')
check_true('bigram has lower perplexity (higher prob)', PP2 < PP3)

print('\nTakeaway: Lower perplexity = higher probability assigned to sentence.'
      ' A good language model should minimise perplexity on held-out text.'
      ' Chain rule ensures any joint factorisation gives the same P(sentence).')

Exercise 4 ★★ — Independence Testing

Part (a): For the joint PMF below, determine whether $X \perp Y$ by checking the factorisation condition.

p(x,y) = \begin{pmatrix} 0.10 & 0.15 & 0.05 \\ 0.20 & 0.30 & 0.10 \\ 0.04 & 0.06 & 0.00 \end{pmatrix}

(rows = $X \in \{0,1,2\}$ , columns = $Y \in \{0,1,2\}$ )

Part (b): Compute $\text{Cov}(X, Y)$ from the joint PMF.

Part (c): Compute the mutual information $I(X;Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p_X(x)p_Y(y)}$ . What does $I(X;Y) = 0$ imply? What does $I(X;Y) > 0$ imply?

Code cell 14

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 15

# Solution
import numpy as np

header('Exercise 4: Independence Testing')

P = np.array([
    [0.10, 0.15, 0.05],
    [0.20, 0.30, 0.10],
    [0.04, 0.06, 0.00]
])
check_close('P sums to 1', P.sum(), 1.0)

# Part (a)
px = P.sum(axis=1)   # marginal X: sum over columns (Y)
py = P.sum(axis=0)   # marginal Y: sum over rows (X)
P_indep = np.outer(px, py)
is_independent = np.allclose(P, P_indep)
print(f'Independent: {is_independent}')
print(f'Max deviation from independence: {np.abs(P - P_indep).max():.4f}')
check_true('NOT independent', not is_independent)

# Part (b): Cov(X, Y)
vals_x = np.arange(3)
vals_y = np.arange(3)
EX = (vals_x * px).sum()
EY = (vals_y * py).sum()
EXY = sum(vals_x[x]*vals_y[y]*P[x,y] for x in range(3) for y in range(3))
cov_xy = EXY - EX*EY
print(f'E[X]={EX:.4f}, E[Y]={EY:.4f}, E[XY]={EXY:.4f}')
print(f'Cov(X,Y) = {cov_xy:.6f}')

# Part (c): Mutual information I(X;Y)
MI = 0.0
for x in range(3):
    for y in range(3):
        if P[x,y] > 0:
            MI += P[x,y] * np.log(P[x,y] / (px[x]*py[y]))
print(f'I(X;Y) = {MI:.6f} nats')
check_true('MI >= 0', MI >= -1e-10)
check_true('MI > 0 (not independent)', MI > 0.001)

print('\nTakeaway: Cov(X,Y)=0 does NOT imply I(X;Y)=0 in general.'
      ' But I(X;Y)=0 iff X and Y are independent.'
      ' Mutual information is the gold standard for detecting dependence.')

Exercise 5 ★★ — Covariance Matrix and PSD

Part (a): Construct a valid $4 \times 4$ covariance matrix $\Sigma$ with unit diagonal (i.e., all variances = 1) and off-diagonal entries chosen so that the matrix is PSD. Verify PSD via Cholesky decomposition.

Part (b): Sample $N=10000$ points from $\mathcal{N}(\mathbf{0}, \Sigma)$ and verify that the sample covariance matrix is close to $\Sigma$ .

Part (c): Compute the correlation matrix from $\Sigma$ (i.e., normalise by the diagonal). What is always true about the diagonal of a correlation matrix?

Part (d): The matrix $\Sigma^* = \Sigma + \alpha I$ is PSD for any $\alpha > 0$ . This is Tikhonov regularisation. Show that adding $\alpha I$ increases all eigenvalues by $\alpha$ without changing eigenvectors. Why is this useful in ML (hint: condition number)?

Code cell 17

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 18

# Solution
import numpy as np
import numpy.linalg as la
np.random.seed(42)

header('Exercise 5: Covariance Matrix and PSD')

# Part (a)
L = np.array([[1,0,0,0],[0.3,1,0,0],[0.2,0.1,1,0],[0.4,0.2,0.3,1]], dtype=float)
Sigma_raw = L @ L.T
D_inv = np.diag(1.0 / np.sqrt(np.diag(Sigma_raw)))
Sigma = D_inv @ Sigma_raw @ D_inv  # unit diagonal

check_close('diagonal is all 1s', np.diag(Sigma), np.ones(4))
eigvals = la.eigvalsh(Sigma)
check_true('all eigenvalues > 0 (PSD)', (eigvals > 0).all())
try:
    la.cholesky(Sigma)
    print('PASS — Cholesky succeeded (PSD confirmed)')
except la.LinAlgError:
    print('FAIL — not PSD')

# Part (b)
mu = np.zeros(4)
samples = np.random.multivariate_normal(mu, Sigma, 10000)
Sigma_est = np.cov(samples.T)
max_err = np.abs(Sigma_est - Sigma).max()
print(f'\nMax entry error between estimated and true Sigma: {max_err:.4f}')
check_true('estimated close to true (max err < 0.05)', max_err < 0.05)

# Part (c): Correlation matrix
D_inv2 = np.diag(1.0 / np.sqrt(np.diag(Sigma_est)))
Corr = D_inv2 @ Sigma_est @ D_inv2
check_close('correlation matrix diagonal is 1', np.diag(Corr), np.ones(4), tol=0.05)
print('Correlation matrix (diagonal = 1 by construction)')

# Part (d): Tikhonov regularisation
alpha = 0.1
Sigma_reg = Sigma + alpha * np.eye(4)
eigs_orig = np.sort(la.eigvalsh(Sigma))
eigs_reg  = np.sort(la.eigvalsh(Sigma_reg))
check_close('eigenvalues shift by alpha', eigs_reg - eigs_orig, alpha * np.ones(4))
cond_orig = eigs_orig.max() / eigs_orig.min()
cond_reg  = eigs_reg.max()  / eigs_reg.min()
print(f'Condition number: original={cond_orig:.2f}, regularised={cond_reg:.2f}')
check_true('regularisation reduces condition number', cond_reg < cond_orig)

print('\nTakeaway: Adding alpha*I to Sigma is the most common form of covariance'
      ' regularisation. It prevents near-singular inverses by ensuring all eigenvalues'
      ' are at least alpha. Used in ridge regression, Gaussian processes, and Adam optimizer.')

Exercise 6 ★★ — MVN Conditionals (Schur Complement)

Let $(X_1, X_2, X_3) \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)$ with

\boldsymbol{\mu} = \begin{pmatrix}1 \\ 2 \\ 3\end{pmatrix}, \quad \Sigma = \begin{pmatrix}4 & 2 & 1 \\ 2 & 3 & 1 \\ 1 & 1 & 2\end{pmatrix}

Part (a): Compute the distribution of $X_1 \mid X_2 = 2.5, X_3 = 2.5$ using the Schur complement formula.

Part (b): Verify your answer by sampling: draw $N=500{,}000$ samples from the joint, keep those with $X_2 \approx 2.5$ and $X_3 \approx 2.5$ , and compare the empirical conditional mean and variance.

Part (c): Compute the partial correlation $\rho_{12.3} = \text{Corr}(X_1, X_2 \mid X_3)$ . How does it differ from the marginal correlation $\rho_{12}$ ?

Code cell 20

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 21

# Solution
import numpy as np
np.random.seed(42)

header('Exercise 6: MVN Conditionals (Schur Complement)')

mu_full = np.array([1.0, 2.0, 3.0])
Sigma_full = np.array([[4.,2.,1.],[2.,3.,1.],[1.,1.,2.]])
x2_obs = np.array([2.5, 2.5])

# Partition
idx1, idx2 = [0], [1, 2]
mu1 = mu_full[idx1]; mu2 = mu_full[idx2]
S11 = Sigma_full[np.ix_(idx1, idx1)]
S12 = Sigma_full[np.ix_(idx1, idx2)]
S21 = Sigma_full[np.ix_(idx2, idx1)]
S22 = Sigma_full[np.ix_(idx2, idx2)]

S22_inv = np.linalg.inv(S22)
mu_cond = mu1 + S12 @ S22_inv @ (x2_obs - mu2)
Sigma_cond = S11 - S12 @ S22_inv @ S21

print(f'X1 | X2=2.5, X3=2.5:')
print(f'  mu_cond    = {mu_cond[0]:.4f}')
print(f'  sigma_cond = {np.sqrt(Sigma_cond[0,0]):.4f}')

# Part (b): Verify by sampling
N = 500000
samp = np.random.multivariate_normal(mu_full, Sigma_full, N)
tol = 0.08
mask = (np.abs(samp[:,1] - x2_obs[0]) < tol) & (np.abs(samp[:,2] - x2_obs[1]) < tol)
X1_cond = samp[mask, 0]
print(f'\nSampling ({mask.sum()} matching samples):')
print(f'  emp mean = {X1_cond.mean():.4f}  (theory: {mu_cond[0]:.4f})')
print(f'  emp std  = {X1_cond.std():.4f}  (theory: {np.sqrt(Sigma_cond[0,0]):.4f})')
check_true('empirical mean close to theory', abs(X1_cond.mean() - mu_cond[0]) < 0.2)
check_true('empirical std close to theory', abs(X1_cond.std() - np.sqrt(Sigma_cond[0,0])) < 0.2)

# Part (c): Partial correlation rho_{12.3}
# Compute conditional covariance of (X1,X2) given X3
idx12, idx3 = [0,1], [2]
S_12_12 = Sigma_full[np.ix_(idx12, idx12)]
S_12_3  = Sigma_full[np.ix_(idx12, idx3)]
S_3_3   = Sigma_full[np.ix_(idx3, idx3)]
Sigma_cond_12 = S_12_12 - S_12_3 @ np.linalg.inv(S_3_3) @ S_12_3.T
rho_12_partial = Sigma_cond_12[0,1] / np.sqrt(Sigma_cond_12[0,0] * Sigma_cond_12[1,1])
rho_12_marginal = Sigma_full[0,1] / np.sqrt(Sigma_full[0,0] * Sigma_full[1,1])
print(f'\nMarginal correlation rho_12   = {rho_12_marginal:.4f}')
print(f'Partial correlation rho_12.3  = {rho_12_partial:.4f}')

print('\nTakeaway: Partial correlation removes the confounding effect of X3.'
      ' If rho_12 > rho_12.3, part of the X1-X2 correlation was mediated by X3.'
      ' This is the MVN version of controlling for a covariate.')

Exercise 7 ★★★ — Reparameterisation Trick

The reparameterisation trick is the engine of VAE training.

Part (a): Implement two gradient estimators for $\nabla_\mu \mathbb{E}_{z \sim \mathcal{N}(\mu, \sigma^2)}[\sin(z)]$ :

REINFORCE: $\nabla_\mu \mathbb{E}[f(z)] = \mathbb{E}[f(z) \cdot (z-\mu)/\sigma^2]$
Reparameterisation: $\nabla_\mu \mathbb{E}[f(\mu + \sigma\varepsilon)] = \mathbb{E}[f'(\mu + \sigma\varepsilon)]$

Compare variance of the two estimators over 200 Monte Carlo runs.

Part (b): Implement the VAE ELBO computation for a 1D Gaussian encoder:

\text{ELBO}(x; \phi, \theta) = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - \text{KL}(q_\phi(z|x) \| p(z))

where $q_\phi(z|x) = \mathcal{N}(\mu_\phi, \sigma_\phi^2)$ , $p(z) = \mathcal{N}(0,1)$ , and the KL has a closed form:

\text{KL}(\mathcal{N}(\mu,\sigma^2) \| \mathcal{N}(0,1)) = \frac{1}{2}(\mu^2 + \sigma^2 - \log\sigma^2 - 1)

Code cell 23

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 24

# Solution
import numpy as np
np.random.seed(42)

header('Exercise 7: Reparameterisation Trick')

mu, sigma = 1.5, 0.8

# Part (a): True gradient
N_ref = 2000000
z_ref = np.random.normal(mu, sigma, N_ref)
true_grad = np.cos(z_ref).mean()  # d/dmu E[sin(z)] = E[cos(z)] by reparam
print(f'Reference gradient (N={N_ref}): {true_grad:.6f}')

# REINFORCE: grad = E[sin(z) * (z-mu)/sigma^2]
N_mc = 5000
n_runs = 200
rf_ests = []
rp_ests = []
for _ in range(n_runs):
    z_rf = np.random.normal(mu, sigma, N_mc)
    rf_ests.append((np.sin(z_rf) * (z_rf - mu) / sigma**2).mean())
    eps = np.random.normal(0, 1, N_mc)
    z_rp = mu + sigma * eps
    rp_ests.append(np.cos(z_rp).mean())  # d/dmu sin(mu+sigma*eps) = cos(z)

rf_ests = np.array(rf_ests)
rp_ests = np.array(rp_ests)

print(f'\nGradient estimates (mean ± std over {n_runs} runs, N={N_mc} each):')
print(f'  REINFORCE:        {rf_ests.mean():.4f} ± {rf_ests.std():.4f}')
print(f'  Reparameterisation: {rp_ests.mean():.4f} ± {rp_ests.std():.4f}')
print(f'  True:             {true_grad:.4f}')
print(f'  Variance ratio (RF/RP): {rf_ests.var()/rp_ests.var():.1f}x')

check_true('reparam is closer to truth', abs(rp_ests.mean()-true_grad) < abs(rf_ests.mean()-true_grad) or
           rp_ests.var() < rf_ests.var())
check_true('reparam has lower variance', rp_ests.var() < rf_ests.var())

# Part (b): VAE ELBO KL term
def kl_gaussian(mu, sigma):
    # KL(N(mu,sigma^2) || N(0,1)) = 0.5*(mu^2 + sigma^2 - log(sigma^2) - 1)
    return 0.5 * (mu**2 + sigma**2 - np.log(sigma**2) - 1.0)

# Verify KL = 0 at N(0,1)
check_close('KL(N(0,1)||N(0,1)) = 0', kl_gaussian(0.0, 1.0), 0.0)
check_true('KL >= 0 always', kl_gaussian(0.5, 0.7) >= 0)

# Monte Carlo verification
mu_t, sigma_t = 0.5, 0.7
kl_theory = kl_gaussian(mu_t, sigma_t)
# MC KL: E_{q}[log q(z) - log p(z)]
z_mc = np.random.normal(mu_t, sigma_t, 500000)
log_q = -0.5*np.log(2*np.pi*sigma_t**2) - (z_mc-mu_t)**2/(2*sigma_t**2)
log_p = -0.5*np.log(2*np.pi) - z_mc**2/2
kl_mc = (log_q - log_p).mean()
print(f'\nKL({mu_t},{sigma_t}): theory={kl_theory:.6f}, MC={kl_mc:.6f}')
check_close('KL theory matches MC', kl_theory, kl_mc, tol=0.005)

print('\nTakeaway: The reparameterisation trick replaces high-variance REINFORCE'
      ' with a deterministic, low-variance gradient. This is why VAEs train stably:'
      ' gradients flow through the deterministic mu and sigma paths, not through'
      ' the sampling operation itself.')

Exercise 8 ★★★ — Naive Bayes Classification

Naive Bayes uses conditional independence to classify documents:

P(Y=c \mid \mathbf{x}) \propto P(Y=c) \prod_{i=1}^d P(x_i \mid Y=c)

Part (a): Implement a Gaussian Naive Bayes classifier for a 2-class, 2-feature dataset. Train on 80% of data, evaluate on 20%.

Part (b): Show that the Naive Bayes decision boundary is linear when both classes have the same per-feature variance $\sigma^2$ (i.e., derive when $\log P(Y=1|\mathbf{x}) > \log P(Y=0|\mathbf{x})$ ).

Part (c): Compare Naive Bayes accuracy to logistic regression on the same dataset. When does NB outperform LR and vice versa?

Code cell 26

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 27

# Solution
import numpy as np
from scipy import stats
np.random.seed(42)

header('Exercise 8: Naive Bayes Classification')

# Generate data
N_per_class = 300
mu0 = np.array([0.0, 0.0]); mu1 = np.array([2.0, 1.5])
Sigma_class = np.array([[1.5, 0.6],[0.6, 1.0]])
X0 = np.random.multivariate_normal(mu0, Sigma_class, N_per_class)
X1 = np.random.multivariate_normal(mu1, Sigma_class, N_per_class)
X_all = np.vstack([X0, X1])
y_all = np.concatenate([np.zeros(N_per_class), np.ones(N_per_class)]).astype(int)

idx = np.random.permutation(len(y_all))
train_idx = idx[:int(0.8*len(y_all))]
test_idx  = idx[int(0.8*len(y_all)):]
X_train, y_train = X_all[train_idx], y_all[train_idx]
X_test,  y_test  = X_all[test_idx],  y_all[test_idx]

# Part (a): Gaussian Naive Bayes
def gnb_train(X, y):
    classes = np.unique(y)
    priors  = {c: (y==c).mean() for c in classes}
    means   = {c: X[y==c].mean(axis=0) for c in classes}
    stds    = {c: X[y==c].std(axis=0)  for c in classes}
    return classes, priors, means, stds

def gnb_predict(params, X):
    classes, priors, means, stds = params
    log_probs = np.zeros((len(X), len(classes)))
    for i, c in enumerate(classes):
        log_prior = np.log(priors[c])
        log_lik = sum(stats.norm.logpdf(X[:,j], means[c][j], stds[c][j]+1e-9)
                      for j in range(X.shape[1]))
        log_probs[:, i] = log_prior + log_lik
    return classes[np.argmax(log_probs, axis=1)]

params = gnb_train(X_train, y_train)
y_pred_nb = gnb_predict(params, X_test)
acc_nb = (y_pred_nb == y_test).mean()
print(f'Gaussian Naive Bayes accuracy: {acc_nb:.4f}')

# Part (c): Compare with logistic regression (sklearn-free implementation)
# Use a simple gradient descent logistic regression
def sigmoid(z): return 1.0 / (1.0 + np.exp(-np.clip(z, -30, 30)))

X_tr_aug = np.hstack([X_train, np.ones((len(X_train), 1))])
X_te_aug = np.hstack([X_test,  np.ones((len(X_test),  1))])
w = np.zeros(3)
lr_rate = 0.1
for _ in range(300):
    p = sigmoid(X_tr_aug @ w)
    grad = X_tr_aug.T @ (p - y_train) / len(y_train)
    w -= lr_rate * grad

y_pred_lr = (sigmoid(X_te_aug @ w) > 0.5).astype(int)
acc_lr = (y_pred_lr == y_test).mean()
print(f'Logistic Regression accuracy: {acc_lr:.4f}')

check_true('Naive Bayes accuracy > 0.75', acc_nb > 0.75)
check_true('Logistic Regression accuracy > 0.75', acc_lr > 0.75)

print('\nNB vs LR comparison:')
print(f'  NB: {acc_nb:.4f}  LR: {acc_lr:.4f}')
print('NB assumes conditional independence (often wrong but fast and interpretable).')
print('LR makes no such assumption; fits the boundary directly from data.')
print('With small N or many features, NB can outperform LR (less variance).')
print('\nTakeaway: Naive Bayes encodes conditional independence as a modelling assumption.'
      ' Despite being "naive", it often works well in NLP (bag-of-words spam filtering)'
      ' because even if features are dependent, the decision boundary can be correct.')

Exercise 9 *** - Conditional Independence in a Collider

Let $X$ and $Y$ be independent Bernoulli $(0.5)$ variables and define $Z = X \oplus Y$ (exclusive OR).

Part (a): Verify $X$ and $Y$ are marginally independent.

Part (b): Condition on $Z=1$ and compute the joint distribution of $(X,Y)$ .

Part (c): Show that $X$ and $Y$ are not independent given $Z=1$ .

Part (d): Explain why conditioning can create dependence in causal and diagnostic models.

Code cell 29

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 30

# Solution
joint = np.ones((2, 2)) / 4
px = joint.sum(axis=1)
py = joint.sum(axis=0)
independent = np.allclose(joint, np.outer(px, py))

Z = np.array([[0, 1], [1, 0]])
mask = (Z == 1)
joint_given = np.where(mask, joint, 0.0)
joint_given /= joint_given.sum()
px_given = joint_given.sum(axis=1)
py_given = joint_given.sum(axis=0)
conditional_independent = np.allclose(joint_given, np.outer(px_given, py_given))

header('Exercise 9: Conditional Independence in a Collider')
print('P(X,Y):')
print(joint)
print('P(X,Y | Z=1):')
print(joint_given)
check_true('X and Y are marginally independent', independent)
check_true('X and Y are dependent after conditioning on Z=1', not conditional_independent)
print('Takeaway: conditioning is not always harmless; collider conditioning can create spurious dependence.')

Exercise 10 *** - Autoregressive Factorization of a Token Sequence

A tiny language model assigns conditional probabilities to a three-token sequence $(x_1,x_2,x_3)$ .

Part (a): Compute $p(x_1,x_2,x_3)=p(x_1)p(x_2\mid x_1)p(x_3\mid x_1,x_2)$ .

Part (b): Compute negative log-likelihood and perplexity.

Part (c): Compare with a model that assumes conditional independence after $x_1$ .

Part (d): Interpret why the chain rule is the probability skeleton of autoregressive LLMs.

Code cell 32

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 33

# Solution
p_x1 = 0.30
p_x2_given_x1 = 0.50
p_x3_given_x12 = 0.40
p_joint = p_x1 * p_x2_given_x1 * p_x3_given_x12
nll = -np.log(p_joint)
perplexity = np.exp(nll / 3)

p_x2_ind = 0.20
p_x3_ind = 0.25
p_ind_model = p_x1 * p_x2_ind * p_x3_ind
nll_ind = -np.log(p_ind_model)
perplexity_ind = np.exp(nll_ind / 3)

header('Exercise 10: Autoregressive Factorization')
print(f'Chain-rule probability: {p_joint:.4f}')
print(f'NLL: {nll:.4f}, perplexity: {perplexity:.4f}')
print(f'Independence model probability: {p_ind_model:.4f}')
print(f'Independence model perplexity: {perplexity_ind:.4f}')
check_true('Using context improves this sequence probability', p_joint > p_ind_model)
check_true('Using context lowers perplexity', perplexity < perplexity_ind)
print('Takeaway: autoregressive modeling is joint probability built from conditionals, one token at a time.')

What to Review After Finishing

Marginalisation — can you integrate/sum out a variable from a joint?
Conditional distributions — $f_{Y|X} = f_{X,Y}/f_X$ ; it's a function of $y$
Independence vs zero covariance — give a counterexample
Schur complement formula — derive it from scratch for a 2×2 block matrix
Reparameterisation trick — why does it reduce gradient variance?
Naive Bayes — when is the decision boundary linear?

References

Bishop, C. (2006). Pattern Recognition and Machine Learning, Ch. 2
Murphy, K. (2022). Probabilistic Machine Learning, Ch. 2-3
Kingma & Welling (2014). Auto-Encoding Variational Bayes. arXiv:1312.6114
Joe, H. (1997). Multivariate Models and Dependence Concepts (copulas)
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems (d-separation)

Collider conditional dependence - can you explain the probabilistic calculation and the ML relevance?
Autoregressive factorization - can you connect the computation to model behavior?