Theory NotebookMath for LLMs

Joint Distributions

Probability Theory / Joint Distributions

Run notebook
Theory Notebook

Theory Notebook

Converted from theory.ipynb for web reading.

Joint Distributions

"To understand a random vector, you must understand not just its components, but how they dance together."

Interactive theory notebook for §06/03. Work through cells top-to-bottom.

Topics covered:

  • Joint PMF/PDF/CDF and marginalisation
  • Conditional distributions and chain rule
  • Independence: three equivalent characterisations
  • Covariance, correlation, covariance matrices
  • Multivariate Gaussian: geometry, conditionals, affine transforms
  • Change of variables and Jacobians
  • Copulas and Sklar's theorem
  • ML applications: VAE, attention, autoregressive factorisation

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np
import scipy.linalg as la
from scipy import stats

try:
    import matplotlib.pyplot as plt
    import matplotlib
    plt.style.use('seaborn-v0_8-whitegrid')
    plt.rcParams['figure.figsize'] = [10, 6]
    plt.rcParams['font.size'] = 12
    HAS_MPL = True
except ImportError:
    HAS_MPL = False
    print('matplotlib not available — skipping plots')

try:
    import seaborn as sns
    HAS_SNS = True
except ImportError:
    HAS_SNS = False

np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)
print('Setup complete.')

1. Joint Distributions: Discrete and Continuous

A joint distribution fully describes the simultaneous behaviour of two or more random variables. For a pair (X,Y)(X, Y):

  • Discrete: joint PMF p(x,y)=P(X=x,Y=y)p(x,y) = P(X=x, Y=y), with x,yp(x,y)=1\sum_{x,y} p(x,y)=1.
  • Continuous: joint PDF f(x,y)0f(x,y) \ge 0, with f(x,y)dxdy=1\int\int f(x,y)\,dx\,dy = 1.
  • CDF: F(x,y)=P(Xx,Yy)F(x,y) = P(X \le x, Y \le y) for both cases.

The PDF is recovered from the CDF by mixed partial differentiation: f(x,y)=2F(x,y)/xyf(x,y) = \partial^2 F(x,y)/\partial x\,\partial y.

Code cell 5

# === 1.1 Discrete Joint PMF ===

# Example: roll two dice, X = die 1, Y = die 2
# Joint PMF: p(x,y) = 1/36 for x,y in {1,...,6}
vals = np.arange(1, 7)
joint_pmf = np.ones((6, 6)) / 36
print('Joint PMF (6x6 matrix, each entry = 1/36):')
print(joint_pmf)
print(f'Sum of all probabilities: {joint_pmf.sum():.4f}')

# Marginal PMFs
marginal_x = joint_pmf.sum(axis=1)  # sum over Y
marginal_y = joint_pmf.sum(axis=0)  # sum over X
print(f'\nMarginal P(X=x): {marginal_x}  (should be 1/6 each)')
print(f'Marginal P(Y=y): {marginal_y}  (should be 1/6 each)')

# Joint PMF of sum S = X + Y
S_pmf = {}
for x in range(1, 7):
    for y in range(1, 7):
        s = x + y
        S_pmf[s] = S_pmf.get(s, 0) + 1/36
print('\nPMF of S = X + Y:')
for s in sorted(S_pmf):
    print(f'  P(S={s}) = {S_pmf[s]:.4f} = {int(round(S_pmf[s]*36))}/36')

1.2 Continuous Joint PDF

For continuous (X,Y)(X, Y), probabilities are computed via double integration:

P(aXb,  cYd)=abcdf(x,y)dydxP(a \le X \le b,\; c \le Y \le d) = \int_a^b \int_c^d f(x,y)\,dy\,dx

A common example: the bivariate Gaussian with correlation ρ\rho:

f(x,y)=12πσ1σ21ρ2exp ⁣(12(1ρ2)[(xμ1)2σ122ρ(xμ1)(yμ2)σ1σ2+(yμ2)2σ22])f(x,y) = \frac{1}{2\pi\sigma_1\sigma_2\sqrt{1-\rho^2}} \exp\!\left(-\frac{1}{2(1-\rho^2)}\left[\frac{(x-\mu_1)^2}{\sigma_1^2} - \frac{2\rho(x-\mu_1)(y-\mu_2)}{\sigma_1\sigma_2} + \frac{(y-\mu_2)^2}{\sigma_2^2}\right]\right)

Code cell 7

# === 1.2 Bivariate Gaussian PDF visualisation ===

mu = np.array([0.0, 0.0])
rho = 0.7
Sigma = np.array([[1.0, rho], [rho, 1.0]])

rv = stats.multivariate_normal(mean=mu, cov=Sigma)

x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
X_grid, Y_grid = np.meshgrid(x, y)
pos = np.dstack((X_grid, Y_grid))
Z = rv.pdf(pos)

print(f'Bivariate Gaussian with rho={rho}')
print(f'Peak value at (0,0): {rv.pdf([0,0]):.4f}')
print(f'Theoretical peak: 1/(2*pi*sqrt(1-rho^2)) = {1/(2*np.pi*np.sqrt(1-rho**2)):.4f}')

if HAS_MPL:
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    for ax, r in zip(axes, [0.0, 0.7, -0.7]):
        S = np.array([[1.0, r], [r, 1.0]])
        rv_ = stats.multivariate_normal(mean=[0,0], cov=S)
        Z_ = rv_.pdf(pos)
        ct = ax.contourf(X_grid, Y_grid, Z_, levels=20, cmap='viridis')
        ax.set_title(f'rho = {r}')
        ax.set_xlabel('X'); ax.set_ylabel('Y')
    plt.suptitle('Bivariate Gaussian PDF for different correlations')
    plt.tight_layout()
    plt.show()
    print('PASS — plotted bivariate Gaussians')

2. Marginalisation

Given a joint distribution, we recover individual marginals by integrating out (or summing out) the other variable:

fX(x)=fX,Y(x,y)dy,fY(y)=fX,Y(x,y)dx.f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y)\,dy, \qquad f_Y(y) = \int_{-\infty}^{\infty} f_{X,Y}(x,y)\,dx.

Marginalisation is one-way: knowing both marginals does not determine the joint — a fact captured by copula theory (Section 7).

Code cell 9

# === 2.1 Marginalisation: Triangular Joint ===
# f(x,y) = 3x for 0 <= y <= x <= 1  (from Appendix V.1)

from scipy.integrate import dblquad, quad

def joint_pdf(y, x):  # note: dblquad integrates inner variable first
    if 0 <= y <= x <= 1:
        return 3*x
    return 0.0

# Verify normalisation
norm, err = dblquad(joint_pdf, 0, 1, 0, lambda x: x)
print(f'Normalisation integral: {norm:.6f} (should be 1.0)')

# Marginal f_X(x) = 3x^2
def marg_x(x):
    return 3*x**2

# Marginal f_Y(y) = 3(1-y^2)/2
def marg_y(y):
    return 1.5*(1 - y**2)

# Verify marginals integrate to 1
mx_int, _ = quad(marg_x, 0, 1)
my_int, _ = quad(marg_y, 0, 1)
print(f'Marginal f_X integrates to: {mx_int:.6f}')
print(f'Marginal f_Y integrates to: {my_int:.6f}')

# E[X] and E[Y]
EX, _ = quad(lambda x: x * marg_x(x), 0, 1)
EY, _ = quad(lambda y: y * marg_y(y), 0, 1)
print(f'E[X] = {EX:.4f} (theory: 3/4 = {3/4:.4f})')
print(f'E[Y] = {EY:.4f} (theory: 3/8 = {3/8:.4f})')

3. Conditional Distributions

The conditional PDF of YY given X=xX=x is defined as:

fYX(yx)=fX,Y(x,y)fX(x),where fX(x)>0.f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)}, \quad \text{where } f_X(x) > 0.

This is not a point probability but a function of yy for fixed xx. It must integrate to 1 over yy: fYX(yx)dy=1\int f_{Y|X}(y|x)\,dy = 1.

The conditional mean E[YX=x]=yfYX(yx)dy\mathbb{E}[Y|X=x] = \int y\,f_{Y|X}(y|x)\,dy is a function of xx and is central to regression.

Code cell 11

# === 3.1 Conditional Distributions: Bivariate Gaussian ===

# Joint (X1, X2) ~ N(0, Sigma) with Sigma = [[4, 3],[3, 9]]
mu_joint = np.array([0.0, 0.0])
Sigma_joint = np.array([[4.0, 3.0], [3.0, 9.0]])

# Schur complement formula for X1 | X2 = x2
mu1, mu2 = mu_joint[0], mu_joint[1]
S11, S12, S21, S22 = Sigma_joint[0,0], Sigma_joint[0,1], Sigma_joint[1,0], Sigma_joint[1,1]

x2_given = 3.0
mu_cond = mu1 + S12 / S22 * (x2_given - mu2)
sigma2_cond = S11 - S12**2 / S22

print('X1 | X2 = 3 distribution (Schur complement):')
print(f'  Conditional mean  mu_1|2 = {mu1} + ({S12}/{S22})*({x2_given}-{mu2}) = {mu_cond:.4f}')
print(f'  Conditional var  sig^2_1|2 = {S11} - {S12}^2/{S22} = {sigma2_cond:.4f}')

# Verify by sampling
N = 200000
samples = np.random.multivariate_normal(mu_joint, Sigma_joint, N)
tol = 0.05 * abs(x2_given)
mask = np.abs(samples[:, 1] - x2_given) < tol
cond_samples = samples[mask, 0]
print(f'\nVerification (conditioning X2 ≈ {x2_given} ± {tol:.2f}):')
print(f'  Sample conditional mean: {cond_samples.mean():.4f} (theory: {mu_cond:.4f})')
print(f'  Sample conditional std : {cond_samples.std():.4f}  (theory: {np.sqrt(sigma2_cond):.4f})')

ok_mean = abs(cond_samples.mean() - mu_cond) < 0.15
ok_std  = abs(cond_samples.std() - np.sqrt(sigma2_cond)) < 0.15
print(f'PASS mean — {ok_mean}    PASS std — {ok_std}')

4. Chain Rule of Probability

The chain rule decomposes any joint distribution into a product of conditionals:

p(x1,x2,,xn)=i=1np(xix1,,xi1)p(x_1, x_2, \ldots, x_n) = \prod_{i=1}^n p(x_i \mid x_1, \ldots, x_{i-1})

The ordering of variables can be arbitrary — different orderings give different factorisations, all valid. This factorisation is the mathematical foundation of autoregressive language models: the probability of a token sequence is decomposed as

P(w1,w2,,wT)=t=1TP(wtw1,,wt1)P(w_1, w_2, \ldots, w_T) = \prod_{t=1}^T P(w_t \mid w_1, \ldots, w_{t-1})

Each conditional P(wtw<t)P(w_t \mid w_{<t}) is what a transformer learns to model.

Code cell 13

# === 4.1 Chain Rule: Language Model Probability ===

# Simulate a simple unigram/bigram language model
import numpy as np
np.random.seed(42)

# Vocabulary: 5 words
vocab = ['the', 'cat', 'sat', 'on', 'mat']
V = len(vocab)

# Unigram probs (P(w_1))
unigram = np.array([0.4, 0.2, 0.15, 0.15, 0.1])
assert np.isclose(unigram.sum(), 1.0)

# Bigram transition matrix: bigram[i,j] = P(w_{t+1}=j | w_t=i)
bigram = np.array([
    [0.05, 0.5, 0.0, 0.2, 0.25],  # after 'the'
    [0.1,  0.0, 0.6, 0.2, 0.1 ],  # after 'cat'
    [0.3,  0.1, 0.0, 0.5, 0.1 ],  # after 'sat'
    [0.3,  0.1, 0.0, 0.0, 0.6 ],  # after 'on'
    [0.6,  0.2, 0.1, 0.1, 0.0 ],  # after 'mat'
])
# Verify rows sum to 1
assert np.allclose(bigram.sum(axis=1), 1.0)

# Compute P(the, cat, sat, on, mat) via chain rule
sentence = [0, 1, 2, 3, 4]  # indices
log_prob = np.log(unigram[sentence[0]])
for t in range(1, len(sentence)):
    log_prob += np.log(bigram[sentence[t-1], sentence[t]])

print('Sentence:', ' '.join(vocab[i] for i in sentence))
print(f'Chain rule: log P = {log_prob:.4f}')
print(f'P = {np.exp(log_prob):.8f}')
print('\nBreakdown:')
print(f'  P(the)            = {unigram[0]:.3f}')
print(f'  P(cat | the)      = {bigram[0,1]:.3f}')
print(f'  P(sat | cat)      = {bigram[1,2]:.3f}')
print(f'  P(on  | sat)      = {bigram[2,3]:.3f}')
print(f'  P(mat | on)       = {bigram[3,4]:.3f}')
product = unigram[0]*bigram[0,1]*bigram[1,2]*bigram[2,3]*bigram[3,4]
print(f'  Product           = {product:.8f}')
print(f'PASS — chain rule product matches: {np.isclose(product, np.exp(log_prob))}')

5. Independence

XX and YY are independent (XYX \perp Y) when knowing the value of one provides no information about the other. Three equivalent characterisations:

  1. Factorisation: fX,Y(x,y)=fX(x)fY(y)f_{X,Y}(x,y) = f_X(x)\,f_Y(y) for almost all (x,y)(x,y).
  2. CDF factorisation: FX,Y(x,y)=FX(x)FY(y)F_{X,Y}(x,y) = F_X(x)\,F_Y(y).
  3. Conditional = marginal: fYX(yx)=fY(y)f_{Y|X}(y|x) = f_Y(y) for all xx.

Warning: Zero covariance does NOT imply independence (see the unit disk example). For jointly Gaussian random variables, it does — because the Gaussian is fully characterised by means and covariances.

Code cell 15

# === 5.1 Independence: Three Equivalent Tests ===

import numpy as np
np.random.seed(42)

# Test 1: Independent (X, Y) ~ N(0,I)
N = 100000
X = np.random.normal(0, 1, N)
Y = np.random.normal(0, 1, N)

# Check Cov(X,Y) ≈ 0
cov_indep = np.cov(X, Y)[0, 1]
print('Test 1: Independent standard normals')
print(f'  Cov(X,Y) = {cov_indep:.4f}  (should ≈ 0)')
print(f'  PASS — near zero: {abs(cov_indep) < 0.02}')

# Test 2: Dependent (X, Y) on unit disk — zero cov but not independent
# Sample uniformly from unit disk
r = np.sqrt(np.random.uniform(0, 1, N))
theta = np.random.uniform(0, 2*np.pi, N)
Xd = r * np.cos(theta)
Yd = r * np.sin(theta)

cov_disk = np.cov(Xd, Yd)[0, 1]
print('\nTest 2: Uniform on unit disk')
print(f'  Cov(X,Y) = {cov_disk:.4f}  (should ≈ 0 by symmetry)')

# But X and Y are NOT independent: conditional variance of Y given X=0.9 is tiny
x_val = 0.9
tol = 0.05
mask = np.abs(Xd - x_val) < tol
Y_given_x = Yd[mask]
print(f'  Marginal std(Y)      = {Yd.std():.4f}')
print(f'  Cond std(Y|X≈0.9)   = {Y_given_x.std():.4f}  (much smaller!)')
print(f'  Theory: sqrt(1-0.9^2) = {np.sqrt(1-0.9**2)/np.sqrt(3):.4f} approx')
print(f'  Cov=0 but NOT independent: confirmed')

# Test 3: Bivariate Gaussian: cov=0 => independent
rho = 0.0
Z = np.random.multivariate_normal([0,0], [[1,rho],[rho,1]], N)
Xg, Yg = Z[:, 0], Z[:, 1]
mask2 = np.abs(Xg - 0.9) < 0.05
print(f'\nTest 3: Bivariate Gaussian with rho=0')
print(f'  Marginal std(Y)      = {Yg.std():.4f}')
print(f'  Cond std(Y|X≈0.9)   = {Yg[mask2].std():.4f}  (same as marginal — INDEPENDENT)')

6. Covariance and Correlation

Covariance measures linear dependence:

Cov(X,Y)=E[XY]E[X]E[Y]\text{Cov}(X,Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]

Pearson correlation normalises to [1,1][-1, 1]:

ρXY=Cov(X,Y)σXσY\rho_{XY} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}

The covariance matrix of a random vector XRd\mathbf{X} \in \mathbb{R}^d:

Σ=E[(Xμ)(Xμ)],Σij=Cov(Xi,Xj)\Sigma = \mathbb{E}[(\mathbf{X}-\boldsymbol{\mu})(\mathbf{X}-\boldsymbol{\mu})^\top], \quad \Sigma_{ij} = \text{Cov}(X_i, X_j)

Σ\Sigma is always symmetric positive semi-definite (PSD).

Code cell 17

# === 6.1 Covariance Matrix: Construction and Properties ===

import numpy as np
np.random.seed(42)

# Create a 3D Gaussian with known covariance
mu = np.array([1.0, 2.0, 3.0])
# Build a valid PSD matrix via Sigma = A @ A.T + eps*I
A = np.array([[2, 1, 0], [0, 1, 1], [1, 0, 2]], dtype=float)
Sigma_true = A @ A.T + 0.1 * np.eye(3)

print('True covariance matrix Sigma:')
print(Sigma_true)
print(f'Eigenvalues (all > 0 for PSD): {np.linalg.eigvalsh(Sigma_true).round(4)}')

# Sample and estimate
N = 100000
samples = np.random.multivariate_normal(mu, Sigma_true, N)
Sigma_est = np.cov(samples.T)

print('\nEstimated covariance matrix (N=100,000):')
print(Sigma_est.round(4))

max_err = np.abs(Sigma_est - Sigma_true).max()
print(f'\nMax entry error: {max_err:.4f}')
print(f'PASS — estimate close to true: {max_err < 0.05}')

# Verify PSD via Cholesky
try:
    L = np.linalg.cholesky(Sigma_est)
    print('PASS — Cholesky decomposition succeeded (matrix is PSD)')
except np.linalg.LinAlgError:
    print('FAIL — matrix not PSD')

# Correlation matrix
D_inv = np.diag(1.0 / np.sqrt(np.diag(Sigma_true)))
Corr = D_inv @ Sigma_true @ D_inv
print('\nCorrelation matrix (diagonal should be 1):')
print(Corr.round(4))

6.2 Anscombe's Quartet — Correlation Can Mislead

Four datasets with identical means, variances, and correlations, yet completely different joint distributions. This illustrates why correlation alone is an inadequate summary of a bivariate distribution.

Code cell 19

# === 6.2 Anscombe's Quartet ===

import numpy as np

# Anscombe's quartet — four datasets with same statistics
anscombe = {
    'I':   ([10,8,13,9,11,14,6,4,12,7,5],  [8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68]),
    'II':  ([10,8,13,9,11,14,6,4,12,7,5],  [9.14,8.14,8.74,8.77,9.26,8.10,6.13,3.10,9.13,7.26,4.74]),
    'III': ([10,8,13,9,11,14,6,4,12,7,5],  [7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73]),
    'IV':  ([8,8,8,8,8,8,8,19,8,8,8],       [6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.50,5.56,7.91,6.89]),
}

print('Dataset statistics (should all be ~equal):')
print(f'{"Set":>4} | {"E[X]":>8} | {"E[Y]":>8} | {"Var[X]":>8} | {"Var[Y]":>8} | {"Corr":>8}')
print('-' * 55)
for name, (x, y) in anscombe.items():
    x, y = np.array(x, dtype=float), np.array(y)
    r = np.corrcoef(x, y)[0, 1]
    print(f'{name:>4} | {x.mean():>8.3f} | {y.mean():>8.3f} | {x.var():>8.3f} | {y.var():>8.3f} | {r:>8.3f}')

if HAS_MPL:
    fig, axes = plt.subplots(1, 4, figsize=(16, 4))
    for ax, (name, (x, y)) in zip(axes, anscombe.items()):
        x, y = np.array(x, dtype=float), np.array(y)
        ax.scatter(x, y, color='steelblue', s=60, zorder=5)
        # Regression line
        m, b = np.polyfit(x, y, 1)
        xfit = np.linspace(x.min()-0.5, x.max()+0.5, 50)
        ax.plot(xfit, m*xfit + b, 'r-', lw=2)
        ax.set_title(f'Dataset {name}')
        ax.set_xlabel('X'); ax.set_ylabel('Y')
    plt.suptitle("Anscombe's Quartet: same correlation, different distributions")
    plt.tight_layout()
    plt.show()
    print('PASS — plotted Anscombe quartet')

7. Multivariate Gaussian

The multivariate Gaussian XN(μ,Σ)\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma) is defined by its PDF:

f(x)=1(2π)d/2Σ1/2exp ⁣(12(xμ)Σ1(xμ))f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\!\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top \Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)

Key properties:

  • Marginals of a joint Gaussian are Gaussian.
  • Conditionals are Gaussian (Schur complement formula).
  • Affine transforms: AX+bN(Aμ+b,AΣA)A\mathbf{X}+\mathbf{b} \sim \mathcal{N}(A\boldsymbol{\mu}+\mathbf{b}, A\Sigma A^\top).
  • Maximum entropy: the Gaussian maximises entropy among all distributions with fixed mean and covariance.
  • Uncorrelated Gaussian = independent (unique to the Gaussian family).

Code cell 21

# === 7.1 MVN: Geometry via Eigendecomposition ===

import numpy as np
np.random.seed(42)

# 2D Gaussian: Sigma = Q * diag(lambda) * Q^T (eigendecomposition)
Sigma = np.array([[3.0, 1.5], [1.5, 2.0]])
eigvals, Q = np.linalg.eigh(Sigma)

print('Covariance matrix Sigma:')
print(Sigma)
print(f'Eigenvalues: {eigvals.round(4)}')
print('Eigenvectors (columns = principal axes):')
print(Q.round(4))
print(f'\nReconstruction: Q*diag(lam)*Q.T matches Sigma: {np.allclose(Q @ np.diag(eigvals) @ Q.T, Sigma)}')

# Ellipse radii = sqrt(eigenvalues), oriented along eigenvectors
print(f'\nEllipse semi-axes (1-sigma confidence ellipse):')
for i, (lam, v) in enumerate(zip(eigvals, Q.T)):
    print(f'  Axis {i+1}: length={np.sqrt(lam):.4f}, direction={v.round(4)}')

if HAS_MPL:
    from matplotlib.patches import Ellipse
    fig, ax = plt.subplots(figsize=(8, 6))
    # Sample points
    mu = np.array([0.0, 0.0])
    pts = np.random.multivariate_normal(mu, Sigma, 300)
    ax.scatter(pts[:, 0], pts[:, 1], alpha=0.3, s=15, color='steelblue')
    # Plot eigenvectors scaled by sqrt(eigenvalue)
    for lam, v in zip(eigvals, Q.T):
        scale = np.sqrt(lam)
        ax.annotate('', xy=scale*v, xytext=-scale*v,
                    arrowprops=dict(arrowstyle='<->', color='red', lw=2))
    ax.set_aspect('equal')
    ax.set_title('MVN samples with principal axes (eigenvectors)')
    ax.set_xlabel('X1'); ax.set_ylabel('X2')
    plt.tight_layout()
    plt.show()
    print('PASS — plotted MVN geometry')

Code cell 22

# === 7.2 MVN Conditionals: Schur Complement ===

import numpy as np
from scipy import stats
np.random.seed(42)

# Partition: X = (X1, X2, X3), condition on X2, X3
mu_full = np.array([1.0, 2.0, 3.0])
Sigma_full = np.array([
    [4.0, 1.5, 1.0],
    [1.5, 3.0, 0.5],
    [1.0, 0.5, 2.0]
])

# Partition indices: group 1 = {0}, group 2 = {1,2}
idx1, idx2 = [0], [1, 2]
mu1 = mu_full[idx1]; mu2 = mu_full[idx2]
S11 = Sigma_full[np.ix_(idx1, idx1)]
S12 = Sigma_full[np.ix_(idx1, idx2)]
S21 = Sigma_full[np.ix_(idx2, idx1)]
S22 = Sigma_full[np.ix_(idx2, idx2)]

x2_obs = np.array([2.5, 3.5])  # observed values of X2, X3

# Schur complement
S22_inv = np.linalg.inv(S22)
mu_cond = mu1 + S12 @ S22_inv @ (x2_obs - mu2)
Sigma_cond = S11 - S12 @ S22_inv @ S21

print('X1 | (X2=2.5, X3=3.5) distribution:')
print(f'  mu_cond    = {mu_cond[0]:.4f}')
print(f'  sigma_cond = {np.sqrt(Sigma_cond[0,0]):.4f}')

# Verify by sampling
N = 500000
samp = np.random.multivariate_normal(mu_full, Sigma_full, N)
tol = 0.05
mask = (np.abs(samp[:, 1] - x2_obs[0]) < tol) & (np.abs(samp[:, 2] - x2_obs[1]) < tol)
X1_cond = samp[mask, 0]
print(f'\nSampling verification ({mask.sum()} matching samples):')
print(f'  Sample mean : {X1_cond.mean():.4f} (theory: {mu_cond[0]:.4f})')
print(f'  Sample std  : {X1_cond.std():.4f}  (theory: {np.sqrt(Sigma_cond[0,0]):.4f})')
ok = abs(X1_cond.mean() - mu_cond[0]) < 0.2
print(f'PASS — {ok}')

Code cell 23

# === 7.3 MVN Affine Transforms ===

import numpy as np
np.random.seed(42)

# If X ~ N(mu, Sigma), then Y = AX + b ~ N(A*mu + b, A*Sigma*A.T)
mu_x = np.array([1.0, 2.0])
Sigma_x = np.array([[2.0, 1.0], [1.0, 3.0]])
A = np.array([[1, 2], [0, 1], [1, -1]], dtype=float)  # 3x2 matrix
b = np.array([0.5, 1.0, -0.5])

# Theoretical distribution of Y = AX + b
mu_y_theory = A @ mu_x + b
Sigma_y_theory = A @ Sigma_x @ A.T

print('Theoretical distribution of Y = AX + b:')
print(f'  mu_Y = {mu_y_theory}')
print(f'  Sigma_Y =\n{Sigma_y_theory}')

# Verify by sampling
N = 200000
X_samp = np.random.multivariate_normal(mu_x, Sigma_x, N)
Y_samp = (A @ X_samp.T).T + b

mu_y_emp = Y_samp.mean(axis=0)
Sigma_y_emp = np.cov(Y_samp.T)

print('\nEmpirical distribution:')
print(f'  mu_Y_emp = {mu_y_emp.round(4)}')
print(f'  Sigma_Y_emp =\n{Sigma_y_emp.round(4)}')

ok_mu = np.allclose(mu_y_emp, mu_y_theory, atol=0.02)
ok_sig = np.allclose(Sigma_y_emp, Sigma_y_theory, atol=0.05)
print(f'\nPASS mu — {ok_mu}')
print(f'PASS sigma — {ok_sig}')
print('\nAI connection: In attention, Q=XW_Q, K=XW_K, V=XW_V are affine transforms.')
print('If X ~ MVN, then Q,K,V are also MVN. This is used in theoretical analysis of attention.')

8. Change of Variables and Jacobians

For a smooth bijection Y=g(X)\mathbf{Y} = g(\mathbf{X}) with inverse x=g1(y)\mathbf{x} = g^{-1}(\mathbf{y}):

fY(y)=fX(g1(y))detJg1(y)f_{\mathbf{Y}}(\mathbf{y}) = f_{\mathbf{X}}(g^{-1}(\mathbf{y})) \cdot |\det J_{g^{-1}}(\mathbf{y})|

where Jg1J_{g^{-1}} is the Jacobian matrix of g1g^{-1}:

[Jg1]ij=[g1]iyj[J_{g^{-1}}]_{ij} = \frac{\partial [g^{-1}]_i}{\partial y_j}

The detJ|\det J| factor accounts for how the transformation stretches or compresses volume.

Code cell 25

# === 8.1 Change of Variables: Polar Coordinates ===

import numpy as np
from scipy import stats
np.random.seed(42)

# Transform: (X, Y) i.i.d. N(0,1)  ->  (R, Theta) polar
# X = R*cos(T), Y = R*sin(T)
# f_{X,Y}(x,y) = (1/2pi)*exp(-(x^2+y^2)/2)
#
# Jacobian of inverse (x,y) -> (r,theta):
# J = [[cos T, -r sin T], [sin T, r cos T]]
# |det J| = r (the polar area element)
#
# f_{R,T}(r,theta) = f_{X,Y}(r cos T, r sin T) * r
#                  = (1/2pi)*exp(-r^2/2)*r
# Marginalising over theta: f_R(r) = r*exp(-r^2/2)  (Rayleigh dist.)

# Sample (X,Y) i.i.d. N(0,1)
N = 100000
X = np.random.normal(0, 1, N)
Y = np.random.normal(0, 1, N)
R = np.sqrt(X**2 + Y**2)
Theta = np.arctan2(Y, X)

# Verify: R should follow Rayleigh(sigma=1), i.e., R ~ Rayleigh(1)
# PDF: f(r) = r * exp(-r^2/2), CDF: 1 - exp(-r^2/2)
rayleigh_rv = stats.rayleigh(scale=1)
ks_stat, p_val = stats.kstest(R, rayleigh_rv.cdf)
print('R = sqrt(X^2+Y^2) where X,Y iid N(0,1)')
print(f'KS test vs Rayleigh(1): stat={ks_stat:.4f}, p={p_val:.4f}')
print(f'PASS — R ~ Rayleigh(1): {p_val > 0.05}')

# Verify: Theta should be Uniform(0, 2pi)
Theta_pos = Theta % (2 * np.pi)
ks_stat2, p_val2 = stats.kstest(Theta_pos, stats.uniform(0, 2*np.pi).cdf)
print(f'\nTheta = arctan2(Y,X):')
print(f'KS test vs Uniform(0,2pi): stat={ks_stat2:.4f}, p={p_val2:.4f}')
print(f'PASS — Theta ~ Uniform(0,2pi): {p_val2 > 0.05}')
print('\nKey: f_R(r) = r*exp(-r^2/2) is the Rayleigh PDF — the r factor comes from |det J| = r')

9. Copulas and Sklar's Theorem

A copula separates marginal behaviour from dependence structure.

Sklar's Theorem: Every joint CDF FX,YF_{X,Y} can be written as

FX,Y(x,y)=C(FX(x),FY(y))F_{X,Y}(x,y) = C(F_X(x),\, F_Y(y))

where C:[0,1]2[0,1]C:[0,1]^2 \to [0,1] is a copula — a joint CDF on [0,1]2[0,1]^2 with uniform marginals. The copula CC is unique when marginals are continuous.

The copula C(u,v)C(u,v) encodes the entire dependence structure independently of the marginal distributions. This allows constructing joints with arbitrary marginals and specified dependence — powerful for financial modelling and stress testing.

Code cell 27

# === 9.1 Gaussian Copula ===

import numpy as np
from scipy import stats
np.random.seed(42)

# Gaussian copula: sample from correlated Gaussian, then apply marginal transforms
rho = 0.8
Sigma_cop = np.array([[1.0, rho], [rho, 1.0]])
N = 5000

# Step 1: Sample from bivariate N(0, Sigma)
Z = np.random.multivariate_normal([0, 0], Sigma_cop, N)

# Step 2: Apply marginal CDF to get uniform marginals (probability integral transform)
U = stats.norm.cdf(Z)

# Step 3: Apply desired marginal inverse CDFs
# Margin 1: Exponential(1), Margin 2: Beta(2,5)
X_exp = stats.expon(scale=1).ppf(U[:, 0])
X_beta = stats.beta(2, 5).ppf(U[:, 1])

print('Gaussian copula with Exp(1) and Beta(2,5) marginals')
print(f'Correlation in the copula (Z space): rho={rho}')
print(f'Sample Pearson correlation of (X_exp, X_beta): {np.corrcoef(X_exp, X_beta)[0,1]:.4f}')
print(f'  (different from rho because marginals are non-Gaussian)')

# Verify marginals via KS test
ks1, p1 = stats.kstest(X_exp,   stats.expon(scale=1).cdf)
ks2, p2 = stats.kstest(X_beta,  stats.beta(2, 5).cdf)
print(f'\nKS test X_exp  ~ Exp(1):   KS={ks1:.4f}, p={p1:.4f}')
print(f'KS test X_beta ~ Beta(2,5): KS={ks2:.4f}, p={p2:.4f}')
print(f'PASS marginals: {p1 > 0.05 and p2 > 0.05}')

if HAS_MPL:
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    axes[0].scatter(Z[:500, 0], Z[:500, 1], alpha=0.3, s=10, color='steelblue')
    axes[0].set_title('Step 1: Correlated Gaussian (Z-space)')
    axes[0].set_xlabel('Z1'); axes[0].set_ylabel('Z2')
    axes[1].scatter(X_exp[:500], X_beta[:500], alpha=0.3, s=10, color='seagreen')
    axes[1].set_title('Step 3: Exp(1) x Beta(2,5) with Gaussian copula')
    axes[1].set_xlabel('X_exp'); axes[1].set_ylabel('X_beta')
    plt.tight_layout()
    plt.show()
    print('PASS — plotted copula transformation')

10. Conditional Independence

XX and YY are conditionally independent given ZZ (XYZX \perp Y \mid Z) if:

fX,YZ(x,yz)=fXZ(xz)fYZ(yz)f_{X,Y|Z}(x,y|z) = f_{X|Z}(x|z) \cdot f_{Y|Z}(y|z)

This is different from marginal independence:

PatternXYX \perp Y?XYZX \perp Y \mid Z?
Chain XZYX \to Z \to YNoYes
Fork XZYX \leftarrow Z \rightarrow YNoYes
Collider XZYX \to Z \leftarrow YYesNo

Naive Bayes uses conditional independence: P(xY=c)=iP(xiY=c)P(\mathbf{x}|Y=c) = \prod_i P(x_i|Y=c), assuming features are conditionally independent given the class label.

Code cell 29

# === 10.1 Conditional Independence: Fork Structure ===

import numpy as np
np.random.seed(42)

# Fork: Z -> X, Z -> Y
# Z ~ Bernoulli(0.5), X|Z ~ N(Z, 1), Y|Z ~ N(2*Z, 1)
N = 100000
Z = np.random.binomial(1, 0.5, N)
X = np.random.normal(Z.astype(float), 1, N)
Y = np.random.normal(2 * Z.astype(float), 1, N)

print('Fork structure: Z -> X, Z -> Y')
print(f'E[X]={X.mean():.4f}, E[Y]={Y.mean():.4f}')
print(f'Corr(X,Y) marginal = {np.corrcoef(X, Y)[0,1]:.4f}  (non-zero: X,Y marginally dependent)')

# Conditional on Z=0 and Z=1: correlation should drop
for z_val in [0, 1]:
    mask = Z == z_val
    corr_cond = np.corrcoef(X[mask], Y[mask])[0, 1]
    print(f'  Corr(X,Y | Z={z_val}) = {corr_cond:.4f}  (≈0: conditionally independent)')

print('PASS — Conditioning on Z removes the correlation (Fork: CI given common cause)')

print('\n--- Collider: X -> Z <- Y ---')
# Collider: X, Y independent; Z = X + Y + noise
X2 = np.random.normal(0, 1, N)
Y2 = np.random.normal(0, 1, N)
Z2 = X2 + Y2 + np.random.normal(0, 0.1, N)  # Z is a noisy sum

print(f'Corr(X2,Y2) marginal = {np.corrcoef(X2, Y2)[0,1]:.4f}  (≈0: marginally independent)')

# Condition on Z2 ≈ 2 (high Z): now X and Y become anti-correlated
mask_z2 = np.abs(Z2 - 2.0) < 0.2
corr_given_z2 = np.corrcoef(X2[mask_z2], Y2[mask_z2])[0, 1]
print(f'Corr(X2,Y2 | Z2≈2) = {corr_given_z2:.4f}  (negative: Berkson paradox / explaining away)')
print('PASS — Conditioning on collider induces negative correlation')

11. ML Application: Reparameterisation Trick (VAE)

Variational Autoencoders (VAEs) require gradients through sampling. The reparameterisation trick makes this possible:

zqϕ(zx)=N(μϕ(x),  diag(σϕ2(x)))\mathbf{z} \sim q_\phi(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}_\phi(\mathbf{x}),\; \text{diag}(\boldsymbol{\sigma}^2_\phi(\mathbf{x})))

is rewritten as:

z=μϕ(x)+σϕ(x)ε,εN(0,I)\mathbf{z} = \boldsymbol{\mu}_\phi(\mathbf{x}) + \boldsymbol{\sigma}_\phi(\mathbf{x}) \odot \boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, I)

Now ε\boldsymbol{\varepsilon} is sampled from a fixed distribution, and gradients flow through the deterministic μϕ\boldsymbol{\mu}_\phi and σϕ\boldsymbol{\sigma}_\phi.

Code cell 31

# === 11.1 Reparameterisation Trick: Gradient Estimation ===

import numpy as np
np.random.seed(42)

# Simulate gradient estimation: E_{z~N(mu,sigma^2)}[z^2]
# True value: E[z^2] = mu^2 + sigma^2
# dE/dmu = 2*mu, dE/dsigma = 2*sigma

mu = 2.0
sigma = 1.5
N_mc = 10000

# Method 1: REINFORCE (score function estimator) — high variance
# grad_mu E[f(z)] = E[f(z) * (z-mu)/sigma^2]
z_rf = np.random.normal(mu, sigma, N_mc)
f_rf = z_rf**2
score = (z_rf - mu) / sigma**2
grad_mu_reinforce = np.mean(f_rf * score)

# Method 2: Reparameterisation trick — low variance
# z = mu + sigma * eps, f(z) = (mu + sigma*eps)^2
# df/dmu = 2*(mu + sigma*eps) = 2*z
eps = np.random.normal(0, 1, N_mc)
z_reparam = mu + sigma * eps
grad_mu_reparam = np.mean(2 * z_reparam)  # df/dmu = 2z

true_grad_mu = 2 * mu  # d(mu^2 + sigma^2)/dmu = 2*mu

print(f'True gradient dE[z^2]/dmu = 2*mu = {true_grad_mu:.4f}')
print(f'REINFORCE estimate       = {grad_mu_reinforce:.4f}  (high variance)')
print(f'Reparameterisation est.  = {grad_mu_reparam:.4f}  (low variance)')

# Compare variance
rf_grads = [np.mean((np.random.normal(mu, sigma, 1000)**2) * ((np.random.normal(mu, sigma, 1000)-mu)/sigma**2))
             for _ in range(200)]
rp_grads = [np.mean(2*(mu + sigma*np.random.normal(0,1,1000))) for _ in range(200)]
print(f'\nVariance of 200 estimates (N=1000):')
print(f'  REINFORCE variance   : {np.var(rf_grads):.4f}')
print(f'  Reparam variance     : {np.var(rp_grads):.6f}')
print(f'Variance reduction ratio: {np.var(rf_grads)/np.var(rp_grads):.1f}x')
print('PASS — reparameterisation has much lower variance')

12. ML Application: Attention as Conditional Expectation

Scaled dot-product attention can be interpreted probabilistically:

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

The softmax weights αij=softmax(qikj/dk)j\alpha_{ij} = \text{softmax}(q_i^\top k_j / \sqrt{d_k})_j define a distribution over key positions. The output for query ii is:

oi=jαijvj=Ejαi[vj]\mathbf{o}_i = \sum_j \alpha_{ij} \mathbf{v}_j = \mathbb{E}_{j \sim \alpha_i}[\mathbf{v}_j]

This is a conditional expectation: given query qiq_i, the output is the expected value vector under the attention distribution. The attention weights αi\alpha_i define a joint distribution over (query position,key position)(\text{query position}, \text{key position}) pairs.

Code cell 33

# === 12.1 Attention as Conditional Expectation ===

import numpy as np
np.random.seed(42)

# Mini attention: seq_len=5, d_k=4, d_v=4
seq_len = 5
d_k, d_v = 4, 4

Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_v)

# Compute attention weights
scores = Q @ K.T / np.sqrt(d_k)  # (seq_len, seq_len)

def softmax(x, axis=-1):
    x = x - x.max(axis=axis, keepdims=True)
    e = np.exp(x)
    return e / e.sum(axis=axis, keepdims=True)

alpha = softmax(scores)  # (seq_len, seq_len) — attention distribution
output = alpha @ V       # (seq_len, d_v) — conditional expectation of V

print('Attention weights (each row is a distribution over key positions):')
print(alpha.round(4))
print(f'Row sums: {alpha.sum(axis=1).round(6)} (all 1.0 — valid prob distribution)')

# Output[i] = sum_j alpha[i,j] * V[j] = E_{j~alpha[i]}[V[j]]
print('\nOutput[0] = sum_j alpha[0,j] * V[j]:')
manual = sum(alpha[0, j] * V[j] for j in range(seq_len))
print(f'  Manual: {manual.round(6)}')
print(f'  Matrix: {output[0].round(6)}')
print(f'PASS — same result: {np.allclose(manual, output[0])}')

print('\nInterpretation: output[i] is the conditional mean of V')
print('under the distribution alpha[i] over key positions.')
print('High-entropy alpha => output is a broad average.')
print('Low-entropy alpha  => output focuses on one key ("sharp attention").')
entropies = -np.sum(alpha * np.log(alpha + 1e-10), axis=1)
print(f'Attention entropies (nats): {entropies.round(4)}')
print(f'Max entropy = log({seq_len}) = {np.log(seq_len):.4f}')

Code cell 34

# === 12.2 Autoregressive Factorisation: Sequence Probability ===

import numpy as np
from scipy.special import softmax
np.random.seed(42)

# Simulate a tiny autoregressive model (3-word vocabulary)
vocab = ['<s>', 'cat', 'mat']
V = len(vocab)

# Conditional distributions: p(w_t | w_{t-1}) as a 3x3 matrix
# (In reality, this would be a transformer output)
transition = np.array([
    [0.0, 0.6, 0.4],   # after <s>: cat (0.6) or mat (0.4)
    [0.3, 0.2, 0.5],   # after cat: <s> (0.3), cat (0.2), mat (0.5)
    [0.5, 0.3, 0.2],   # after mat: <s> (0.5), cat (0.3), mat (0.2)
])
start_dist = np.array([1.0, 0.0, 0.0])  # always start with <s>

# Compute P('cat', 'mat') via chain rule
# = P(cat | <s>) * P(mat | cat) = 0.6 * 0.5 = 0.3
sentence = [0, 1, 2]  # <s>, cat, mat
log_prob = 0.0
for i in range(1, len(sentence)):
    log_prob += np.log(transition[sentence[i-1], sentence[i]])

print("Chain rule: P('<s>', 'cat', 'mat') via autoregressive factorisation")
print(f"  P(cat | <s>) = {transition[0, 1]:.3f}")
print(f"  P(mat | cat) = {transition[1, 2]:.3f}")
print(f"  P(sentence) = {np.exp(log_prob):.4f} = {transition[0,1]*transition[1,2]:.4f}")

# Monte Carlo sampling from the model
def sample_sequence(length=5):
    seq = [0]  # start with <s>
    for _ in range(length - 1):
        prev = seq[-1]
        next_tok = np.random.choice(V, p=transition[prev])
        seq.append(next_tok)
    return seq

print('\n5 sampled sequences:')
for _ in range(5):
    seq = sample_sequence(4)
    print(' '.join(vocab[t] for t in seq))
print('\nThis is exactly how GPT generates text: sample one token at a time')
print('from p(w_t | w_1,...,w_{t-1}), using the chain rule of probability.')

13. ML Application: Gaussian Mixture Models

A Gaussian Mixture Model (GMM) is a latent variable model:

p(x)=k=1KπkN(x;μk,Σk)p(\mathbf{x}) = \sum_{k=1}^K \pi_k \,\mathcal{N}(\mathbf{x}; \boldsymbol{\mu}_k, \Sigma_k)

where πk\pi_k are mixing weights with kπk=1\sum_k \pi_k = 1.

The joint over (x,z)(\mathbf{x}, z) where zz is the cluster assignment is:

p(x,z=k)=πkN(x;μk,Σk)p(\mathbf{x}, z=k) = \pi_k \mathcal{N}(\mathbf{x}; \boldsymbol{\mu}_k, \Sigma_k)

The marginal p(x)=kp(x,z=k)p(\mathbf{x}) = \sum_k p(\mathbf{x}, z=k) is the mixture, and the conditional p(z=kx)p(z=k|\mathbf{x}) — the responsibility — is computed via Bayes:

rk(x)=p(z=kx)=πkN(x;μk,Σk)jπjN(x;μj,Σj)r_k(\mathbf{x}) = p(z=k|\mathbf{x}) = \frac{\pi_k \mathcal{N}(\mathbf{x}; \boldsymbol{\mu}_k, \Sigma_k)}{\sum_{j} \pi_j \mathcal{N}(\mathbf{x}; \boldsymbol{\mu}_j, \Sigma_j)}

Code cell 36

# === 13.1 GMM: EM Algorithm ===

import numpy as np
from scipy import stats
np.random.seed(42)

# Generate data from true GMM: K=2 components in 2D
K = 2
true_mu = [np.array([-2.0, 0.0]), np.array([2.0, 1.0])]
true_Sigma = [np.eye(2), np.array([[1.5, 0.5],[0.5, 0.8]])]
true_pi = [0.4, 0.6]

N = 500
z_true = np.random.choice(K, size=N, p=true_pi)
X_data = np.vstack([
    np.random.multivariate_normal(true_mu[k], true_Sigma[k])
    for k in z_true
])

# EM Algorithm
# Initialise
pi = np.array([0.5, 0.5])
mu = [np.array([-1.0, 0.0]), np.array([1.0, 0.0])]
Sigma = [np.eye(2), np.eye(2)]

def gmm_log_likelihood(X, pi, mu, Sigma):
    N = len(X)
    ll = 0.0
    for i in range(N):
        mix = sum(pi[k] * stats.multivariate_normal(mu[k], Sigma[k]).pdf(X[i]) for k in range(K))
        ll += np.log(mix + 1e-300)
    return ll

print('EM training:')
print(f'  Initial log-likelihood: {gmm_log_likelihood(X_data, pi, mu, Sigma):.2f}')

for iteration in range(20):
    # E-step: compute responsibilities
    R = np.zeros((N, K))
    for k in range(K):
        R[:, k] = pi[k] * stats.multivariate_normal(mu[k], Sigma[k]).pdf(X_data)
    R = R / R.sum(axis=1, keepdims=True)

    # M-step: update parameters
    Nk = R.sum(axis=0)  # effective counts
    for k in range(K):
        mu[k] = (R[:, k:k+1] * X_data).sum(axis=0) / Nk[k]
        diff = X_data - mu[k]
        Sigma[k] = (R[:, k:k+1] * diff).T @ diff / Nk[k] + 1e-5 * np.eye(2)
    pi = Nk / N

final_ll = gmm_log_likelihood(X_data, pi, mu, Sigma)
print(f'  Final log-likelihood:   {final_ll:.2f}')
print(f'\nEstimated mixing weights: {pi.round(4)} (true: {true_pi})')
print(f'Estimated mu_0: {mu[0].round(4)} (true: {true_mu[0]})')
print(f'Estimated mu_1: {mu[1].round(4)} (true: {true_mu[1]})')

ok0 = np.allclose(np.sort(pi), np.sort(true_pi), atol=0.1)
print(f'PASS — mixing weights recovered: {ok0}')

14. Common Mistakes and Edge Cases

A summary of the most frequent errors when working with joint distributions.

Code cell 38

# === 14.1 Common Mistakes Demo ===

import numpy as np
from scipy import stats
np.random.seed(42)

print('=== Mistake 1: Treating marginals as determining the joint ===')
# Same marginals, different copulas => different dependence
N = 100000
rho_vals = [0.0, 0.9, -0.9]
for rho in rho_vals:
    Sig = np.array([[1, rho],[rho, 1]])
    samp = np.random.multivariate_normal([0,0], Sig, N)
    U = stats.norm.cdf(samp)  # uniform marginals
    # P(both in top quartile)
    p_joint_top = ((U[:,0] > 0.75) & (U[:,1] > 0.75)).mean()
    p_indep_top = 0.25 * 0.25
    print(f'  rho={rho:+.1f}: P(X>Q75, Y>Q75) = {p_joint_top:.4f}'
          f'  (independence would give {p_indep_top:.4f})')
print('=> Same marginals, very different joint tail behaviour!\n')

print('=== Mistake 2: Cov=0 does not imply independence ===')
X = np.random.uniform(-1, 1, N)
Y = X**2 + np.random.normal(0, 0.01, N)  # Y determined by X!
print(f'  Cov(X, Y) = {np.cov(X,Y)[0,1]:.6f}  (≈0 by symmetry)')
print(f'  But Y = X^2: completely determined by X (MI >> 0)')
print(f'  Conditional var(Y | X≈0.9) ≈ {np.var(Y[np.abs(X-0.9)<0.05]):.6f}')
print(f'  Marginal var(Y)            = {np.var(Y):.4f}  (much larger)\n')

print('=== Mistake 3: Forgetting Jacobian in CoV ===')
# X ~ N(0,1), Y = exp(X) (log-normal)
# Wrong: f_Y(y) = f_X(y) = (1/sqrt(2pi))*exp(-y^2/2)  <-- WRONG
# Right: f_Y(y) = f_X(ln y) * (1/y)                   <-- correct
X3 = np.random.normal(0, 1, N)
Y3 = np.exp(X3)
y_test = np.array([1.0, 2.0, 3.0])
f_wrong = stats.norm.pdf(y_test)            # missing Jacobian
f_right = stats.norm.pdf(np.log(y_test)) / y_test  # correct
f_lognorm = stats.lognorm.pdf(y_test, s=1)  # scipy reference
print('  y       | f_wrong  | f_correct | f_lognorm')
for y, fw, fr, fl in zip(y_test, f_wrong, f_right, f_lognorm):
    print(f'  y={y:.1f}   | {fw:.6f} | {fr:.6f}  | {fl:.6f}')
print('PASS — correct formula matches scipy lognorm')

Summary

This notebook covered the core machinery of joint distributions:

TopicKey Result
Joint PMF/PDFMarginals via integration/summation
Conditionals$f_{Y
Chain rulep(x1,,xn)=ip(xix<i)p(x_1,\ldots,x_n) = \prod_i p(x_i \mid x_{<i})
Independencef(x,y)=fX(x)fY(y)f(x,y) = f_X(x)f_Y(y); Cov=0⇏\text{Cov}=0 \not\Rightarrow indep.
MVN conditionalSchur complement formula
Change of variablesPDF transforms with Jacobian $
CopulasF(x,y)=C(FX(x),FY(y))F(x,y) = C(F_X(x), F_Y(y))
ReparameterisationLow-variance gradient for VAE ELBO
AttentionConditional expectation Ejαi[vj]\mathbb{E}_{j\sim\alpha_i}[v_j]

Next: §04 Expected Values and Moments