Theory Notebook

Converted from theory.ipynb for web reading.

Joint Distributions

"To understand a random vector, you must understand not just its components, but how they dance together."

Interactive theory notebook for §06/03. Work through cells top-to-bottom.

Topics covered:

Joint PMF/PDF/CDF and marginalisation
Conditional distributions and chain rule
Independence: three equivalent characterisations
Covariance, correlation, covariance matrices
Multivariate Gaussian: geometry, conditionals, affine transforms
Change of variables and Jacobians
Copulas and Sklar's theorem
ML applications: VAE, attention, autoregressive factorisation

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np
import scipy.linalg as la
from scipy import stats

try:
    import matplotlib.pyplot as plt
    import matplotlib
    plt.style.use('seaborn-v0_8-whitegrid')
    plt.rcParams['figure.figsize'] = [10, 6]
    plt.rcParams['font.size'] = 12
    HAS_MPL = True
except ImportError:
    HAS_MPL = False
    print('matplotlib not available — skipping plots')

try:
    import seaborn as sns
    HAS_SNS = True
except ImportError:
    HAS_SNS = False

np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)
print('Setup complete.')

1. Joint Distributions: Discrete and Continuous

A joint distribution fully describes the simultaneous behaviour of two or more random variables. For a pair $(X, Y)$ :

Discrete: joint PMF $p(x,y) = P(X=x, Y=y)$ , with $\sum_{x,y} p(x,y)=1$ .
Continuous: joint PDF $f(x,y) \ge 0$ , with $\int\int f(x,y)\,dx\,dy = 1$ .
CDF: $F(x,y) = P(X \le x, Y \le y)$ for both cases.

The PDF is recovered from the CDF by mixed partial differentiation: $f(x,y) = \partial^2 F(x,y)/\partial x\,\partial y$ .

Code cell 5

# === 1.1 Discrete Joint PMF ===

# Example: roll two dice, X = die 1, Y = die 2
# Joint PMF: p(x,y) = 1/36 for x,y in {1,...,6}
vals = np.arange(1, 7)
joint_pmf = np.ones((6, 6)) / 36
print('Joint PMF (6x6 matrix, each entry = 1/36):')
print(joint_pmf)
print(f'Sum of all probabilities: {joint_pmf.sum():.4f}')

# Marginal PMFs
marginal_x = joint_pmf.sum(axis=1)  # sum over Y
marginal_y = joint_pmf.sum(axis=0)  # sum over X
print(f'\nMarginal P(X=x): {marginal_x}  (should be 1/6 each)')
print(f'Marginal P(Y=y): {marginal_y}  (should be 1/6 each)')

# Joint PMF of sum S = X + Y
S_pmf = {}
for x in range(1, 7):
    for y in range(1, 7):
        s = x + y
        S_pmf[s] = S_pmf.get(s, 0) + 1/36
print('\nPMF of S = X + Y:')
for s in sorted(S_pmf):
    print(f'  P(S={s}) = {S_pmf[s]:.4f} = {int(round(S_pmf[s]*36))}/36')

1.2 Continuous Joint PDF

For continuous $(X, Y)$ , probabilities are computed via double integration:

P(a \le X \le b,\; c \le Y \le d) = \int_a^b \int_c^d f(x,y)\,dy\,dx

A common example: the bivariate Gaussian with correlation $\rho$ :

f(x,y) = \frac{1}{2\pi\sigma_1\sigma_2\sqrt{1-\rho^2}} \exp\!\left(-\frac{1}{2(1-\rho^2)}\left[\frac{(x-\mu_1)^2}{\sigma_1^2} - \frac{2\rho(x-\mu_1)(y-\mu_2)}{\sigma_1\sigma_2} + \frac{(y-\mu_2)^2}{\sigma_2^2}\right]\right)

Code cell 7

# === 1.2 Bivariate Gaussian PDF visualisation ===

mu = np.array([0.0, 0.0])
rho = 0.7
Sigma = np.array([[1.0, rho], [rho, 1.0]])

rv = stats.multivariate_normal(mean=mu, cov=Sigma)

x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
X_grid, Y_grid = np.meshgrid(x, y)
pos = np.dstack((X_grid, Y_grid))
Z = rv.pdf(pos)

print(f'Bivariate Gaussian with rho={rho}')
print(f'Peak value at (0,0): {rv.pdf([0,0]):.4f}')
print(f'Theoretical peak: 1/(2*pi*sqrt(1-rho^2)) = {1/(2*np.pi*np.sqrt(1-rho**2)):.4f}')

if HAS_MPL:
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    for ax, r in zip(axes, [0.0, 0.7, -0.7]):
        S = np.array([[1.0, r], [r, 1.0]])
        rv_ = stats.multivariate_normal(mean=[0,0], cov=S)
        Z_ = rv_.pdf(pos)
        ct = ax.contourf(X_grid, Y_grid, Z_, levels=20, cmap='viridis')
        ax.set_title(f'rho = {r}')
        ax.set_xlabel('X'); ax.set_ylabel('Y')
    plt.suptitle('Bivariate Gaussian PDF for different correlations')
    plt.tight_layout()
    plt.show()
    print('PASS — plotted bivariate Gaussians')

2. Marginalisation

Given a joint distribution, we recover individual marginals by integrating out (or summing out) the other variable:

f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y)\,dy, \qquad f_Y(y) = \int_{-\infty}^{\infty} f_{X,Y}(x,y)\,dx.

Marginalisation is one-way: knowing both marginals does not determine the joint — a fact captured by copula theory (Section 7).

Code cell 9

# === 2.1 Marginalisation: Triangular Joint ===
# f(x,y) = 3x for 0 <= y <= x <= 1  (from Appendix V.1)

from scipy.integrate import dblquad, quad

def joint_pdf(y, x):  # note: dblquad integrates inner variable first
    if 0 <= y <= x <= 1:
        return 3*x
    return 0.0

# Verify normalisation
norm, err = dblquad(joint_pdf, 0, 1, 0, lambda x: x)
print(f'Normalisation integral: {norm:.6f} (should be 1.0)')

# Marginal f_X(x) = 3x^2
def marg_x(x):
    return 3*x**2

# Marginal f_Y(y) = 3(1-y^2)/2
def marg_y(y):
    return 1.5*(1 - y**2)

# Verify marginals integrate to 1
mx_int, _ = quad(marg_x, 0, 1)
my_int, _ = quad(marg_y, 0, 1)
print(f'Marginal f_X integrates to: {mx_int:.6f}')
print(f'Marginal f_Y integrates to: {my_int:.6f}')

# E[X] and E[Y]
EX, _ = quad(lambda x: x * marg_x(x), 0, 1)
EY, _ = quad(lambda y: y * marg_y(y), 0, 1)
print(f'E[X] = {EX:.4f} (theory: 3/4 = {3/4:.4f})')
print(f'E[Y] = {EY:.4f} (theory: 3/8 = {3/8:.4f})')

3. Conditional Distributions

The conditional PDF of $Y$ given $X=x$ is defined as:

f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)}, \quad \text{where } f_X(x) > 0.

This is not a point probability but a function of $y$ for fixed $x$ . It must integrate to 1 over $y$ : $\int f_{Y|X}(y|x)\,dy = 1$ .

The conditional mean $\mathbb{E}[Y|X=x] = \int y\,f_{Y|X}(y|x)\,dy$ is a function of $x$ and is central to regression.

Code cell 11

# === 3.1 Conditional Distributions: Bivariate Gaussian ===

# Joint (X1, X2) ~ N(0, Sigma) with Sigma = [[4, 3],[3, 9]]
mu_joint = np.array([0.0, 0.0])
Sigma_joint = np.array([[4.0, 3.0], [3.0, 9.0]])

# Schur complement formula for X1 | X2 = x2
mu1, mu2 = mu_joint[0], mu_joint[1]
S11, S12, S21, S22 = Sigma_joint[0,0], Sigma_joint[0,1], Sigma_joint[1,0], Sigma_joint[1,1]

x2_given = 3.0
mu_cond = mu1 + S12 / S22 * (x2_given - mu2)
sigma2_cond = S11 - S12**2 / S22

print('X1 | X2 = 3 distribution (Schur complement):')
print(f'  Conditional mean  mu_1|2 = {mu1} + ({S12}/{S22})*({x2_given}-{mu2}) = {mu_cond:.4f}')
print(f'  Conditional var  sig^2_1|2 = {S11} - {S12}^2/{S22} = {sigma2_cond:.4f}')

# Verify by sampling
N = 200000
samples = np.random.multivariate_normal(mu_joint, Sigma_joint, N)
tol = 0.05 * abs(x2_given)
mask = np.abs(samples[:, 1] - x2_given) < tol
cond_samples = samples[mask, 0]
print(f'\nVerification (conditioning X2 ≈ {x2_given} ± {tol:.2f}):')
print(f'  Sample conditional mean: {cond_samples.mean():.4f} (theory: {mu_cond:.4f})')
print(f'  Sample conditional std : {cond_samples.std():.4f}  (theory: {np.sqrt(sigma2_cond):.4f})')

ok_mean = abs(cond_samples.mean() - mu_cond) < 0.15
ok_std  = abs(cond_samples.std() - np.sqrt(sigma2_cond)) < 0.15
print(f'PASS mean — {ok_mean}    PASS std — {ok_std}')

4. Chain Rule of Probability

The chain rule decomposes any joint distribution into a product of conditionals:

p(x_1, x_2, \ldots, x_n) = \prod_{i=1}^n p(x_i \mid x_1, \ldots, x_{i-1})

The ordering of variables can be arbitrary — different orderings give different factorisations, all valid. This factorisation is the mathematical foundation of autoregressive language models: the probability of a token sequence is decomposed as

P(w_1, w_2, \ldots, w_T) = \prod_{t=1}^T P(w_t \mid w_1, \ldots, w_{t-1})

Each conditional $P(w_t \mid w_{<t})$ is what a transformer learns to model.

Code cell 13

# === 4.1 Chain Rule: Language Model Probability ===

# Simulate a simple unigram/bigram language model
import numpy as np
np.random.seed(42)

# Vocabulary: 5 words
vocab = ['the', 'cat', 'sat', 'on', 'mat']
V = len(vocab)

# Unigram probs (P(w_1))
unigram = np.array([0.4, 0.2, 0.15, 0.15, 0.1])
assert np.isclose(unigram.sum(), 1.0)

# Bigram transition matrix: bigram[i,j] = P(w_{t+1}=j | w_t=i)
bigram = np.array([
    [0.05, 0.5, 0.0, 0.2, 0.25],  # after 'the'
    [0.1,  0.0, 0.6, 0.2, 0.1 ],  # after 'cat'
    [0.3,  0.1, 0.0, 0.5, 0.1 ],  # after 'sat'
    [0.3,  0.1, 0.0, 0.0, 0.6 ],  # after 'on'
    [0.6,  0.2, 0.1, 0.1, 0.0 ],  # after 'mat'
])
# Verify rows sum to 1
assert np.allclose(bigram.sum(axis=1), 1.0)

# Compute P(the, cat, sat, on, mat) via chain rule
sentence = [0, 1, 2, 3, 4]  # indices
log_prob = np.log(unigram[sentence[0]])
for t in range(1, len(sentence)):
    log_prob += np.log(bigram[sentence[t-1], sentence[t]])

print('Sentence:', ' '.join(vocab[i] for i in sentence))
print(f'Chain rule: log P = {log_prob:.4f}')
print(f'P = {np.exp(log_prob):.8f}')
print('\nBreakdown:')
print(f'  P(the)            = {unigram[0]:.3f}')
print(f'  P(cat | the)      = {bigram[0,1]:.3f}')
print(f'  P(sat | cat)      = {bigram[1,2]:.3f}')
print(f'  P(on  | sat)      = {bigram[2,3]:.3f}')
print(f'  P(mat | on)       = {bigram[3,4]:.3f}')
product = unigram[0]*bigram[0,1]*bigram[1,2]*bigram[2,3]*bigram[3,4]
print(f'  Product           = {product:.8f}')
print(f'PASS — chain rule product matches: {np.isclose(product, np.exp(log_prob))}')

5. Independence

$X$ and $Y$ are independent ( $X \perp Y$ ) when knowing the value of one provides no information about the other. Three equivalent characterisations:

Factorisation: $f_{X,Y}(x,y) = f_X(x)\,f_Y(y)$ for almost all $(x,y)$ .
CDF factorisation: $F_{X,Y}(x,y) = F_X(x)\,F_Y(y)$ .
Conditional = marginal: $f_{Y|X}(y|x) = f_Y(y)$ for all $x$ .

Warning: Zero covariance does NOT imply independence (see the unit disk example). For jointly Gaussian random variables, it does — because the Gaussian is fully characterised by means and covariances.

Code cell 15

# === 5.1 Independence: Three Equivalent Tests ===

import numpy as np
np.random.seed(42)

# Test 1: Independent (X, Y) ~ N(0,I)
N = 100000
X = np.random.normal(0, 1, N)
Y = np.random.normal(0, 1, N)

# Check Cov(X,Y) ≈ 0
cov_indep = np.cov(X, Y)[0, 1]
print('Test 1: Independent standard normals')
print(f'  Cov(X,Y) = {cov_indep:.4f}  (should ≈ 0)')
print(f'  PASS — near zero: {abs(cov_indep) < 0.02}')

# Test 2: Dependent (X, Y) on unit disk — zero cov but not independent
# Sample uniformly from unit disk
r = np.sqrt(np.random.uniform(0, 1, N))
theta = np.random.uniform(0, 2*np.pi, N)
Xd = r * np.cos(theta)
Yd = r * np.sin(theta)

cov_disk = np.cov(Xd, Yd)[0, 1]
print('\nTest 2: Uniform on unit disk')
print(f'  Cov(X,Y) = {cov_disk:.4f}  (should ≈ 0 by symmetry)')

# But X and Y are NOT independent: conditional variance of Y given X=0.9 is tiny
x_val = 0.9
tol = 0.05
mask = np.abs(Xd - x_val) < tol
Y_given_x = Yd[mask]
print(f'  Marginal std(Y)      = {Yd.std():.4f}')
print(f'  Cond std(Y|X≈0.9)   = {Y_given_x.std():.4f}  (much smaller!)')
print(f'  Theory: sqrt(1-0.9^2) = {np.sqrt(1-0.9**2)/np.sqrt(3):.4f} approx')
print(f'  Cov=0 but NOT independent: confirmed')

# Test 3: Bivariate Gaussian: cov=0 => independent
rho = 0.0
Z = np.random.multivariate_normal([0,0], [[1,rho],[rho,1]], N)
Xg, Yg = Z[:, 0], Z[:, 1]
mask2 = np.abs(Xg - 0.9) < 0.05
print(f'\nTest 3: Bivariate Gaussian with rho=0')
print(f'  Marginal std(Y)      = {Yg.std():.4f}')
print(f'  Cond std(Y|X≈0.9)   = {Yg[mask2].std():.4f}  (same as marginal — INDEPENDENT)')

6. Covariance and Correlation

Covariance measures linear dependence:

\text{Cov}(X,Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]

Pearson correlation normalises to $[-1, 1]$ :

\rho_{XY} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}

The covariance matrix of a random vector $\mathbf{X} \in \mathbb{R}^d$ :

\Sigma = \mathbb{E}[(\mathbf{X}-\boldsymbol{\mu})(\mathbf{X}-\boldsymbol{\mu})^\top], \quad \Sigma_{ij} = \text{Cov}(X_i, X_j)

$\Sigma$ is always symmetric positive semi-definite (PSD).

Code cell 17

# === 6.1 Covariance Matrix: Construction and Properties ===

import numpy as np
np.random.seed(42)

# Create a 3D Gaussian with known covariance
mu = np.array([1.0, 2.0, 3.0])
# Build a valid PSD matrix via Sigma = A @ A.T + eps*I
A = np.array([[2, 1, 0], [0, 1, 1], [1, 0, 2]], dtype=float)
Sigma_true = A @ A.T + 0.1 * np.eye(3)

print('True covariance matrix Sigma:')
print(Sigma_true)
print(f'Eigenvalues (all > 0 for PSD): {np.linalg.eigvalsh(Sigma_true).round(4)}')

# Sample and estimate
N = 100000
samples = np.random.multivariate_normal(mu, Sigma_true, N)
Sigma_est = np.cov(samples.T)

print('\nEstimated covariance matrix (N=100,000):')
print(Sigma_est.round(4))

max_err = np.abs(Sigma_est - Sigma_true).max()
print(f'\nMax entry error: {max_err:.4f}')
print(f'PASS — estimate close to true: {max_err < 0.05}')

# Verify PSD via Cholesky
try:
    L = np.linalg.cholesky(Sigma_est)
    print('PASS — Cholesky decomposition succeeded (matrix is PSD)')
except np.linalg.LinAlgError:
    print('FAIL — matrix not PSD')

# Correlation matrix
D_inv = np.diag(1.0 / np.sqrt(np.diag(Sigma_true)))
Corr = D_inv @ Sigma_true @ D_inv
print('\nCorrelation matrix (diagonal should be 1):')
print(Corr.round(4))

6.2 Anscombe's Quartet — Correlation Can Mislead

Four datasets with identical means, variances, and correlations, yet completely different joint distributions. This illustrates why correlation alone is an inadequate summary of a bivariate distribution.

Code cell 19

# === 6.2 Anscombe's Quartet ===

import numpy as np

# Anscombe's quartet — four datasets with same statistics
anscombe = {
    'I':   ([10,8,13,9,11,14,6,4,12,7,5],  [8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68]),
    'II':  ([10,8,13,9,11,14,6,4,12,7,5],  [9.14,8.14,8.74,8.77,9.26,8.10,6.13,3.10,9.13,7.26,4.74]),
    'III': ([10,8,13,9,11,14,6,4,12,7,5],  [7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73]),
    'IV':  ([8,8,8,8,8,8,8,19,8,8,8],       [6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.50,5.56,7.91,6.89]),
}

print('Dataset statistics (should all be ~equal):')
print(f'{"Set":>4} | {"E[X]":>8} | {"E[Y]":>8} | {"Var[X]":>8} | {"Var[Y]":>8} | {"Corr":>8}')
print('-' * 55)
for name, (x, y) in anscombe.items():
    x, y = np.array(x, dtype=float), np.array(y)
    r = np.corrcoef(x, y)[0, 1]
    print(f'{name:>4} | {x.mean():>8.3f} | {y.mean():>8.3f} | {x.var():>8.3f} | {y.var():>8.3f} | {r:>8.3f}')

if HAS_MPL:
    fig, axes = plt.subplots(1, 4, figsize=(16, 4))
    for ax, (name, (x, y)) in zip(axes, anscombe.items()):
        x, y = np.array(x, dtype=float), np.array(y)
        ax.scatter(x, y, color='steelblue', s=60, zorder=5)
        # Regression line
        m, b = np.polyfit(x, y, 1)
        xfit = np.linspace(x.min()-0.5, x.max()+0.5, 50)
        ax.plot(xfit, m*xfit + b, 'r-', lw=2)
        ax.set_title(f'Dataset {name}')
        ax.set_xlabel('X'); ax.set_ylabel('Y')
    plt.suptitle("Anscombe's Quartet: same correlation, different distributions")
    plt.tight_layout()
    plt.show()
    print('PASS — plotted Anscombe quartet')

7. Multivariate Gaussian

The multivariate Gaussian $\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)$ is defined by its PDF:

f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\!\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top \Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)

Key properties:

Marginals of a joint Gaussian are Gaussian.
Conditionals are Gaussian (Schur complement formula).
Affine transforms: $A\mathbf{X}+\mathbf{b} \sim \mathcal{N}(A\boldsymbol{\mu}+\mathbf{b}, A\Sigma A^\top)$ .
Maximum entropy: the Gaussian maximises entropy among all distributions with fixed mean and covariance.
Uncorrelated Gaussian = independent (unique to the Gaussian family).

Code cell 21

# === 7.1 MVN: Geometry via Eigendecomposition ===

import numpy as np
np.random.seed(42)

# 2D Gaussian: Sigma = Q * diag(lambda) * Q^T (eigendecomposition)
Sigma = np.array([[3.0, 1.5], [1.5, 2.0]])
eigvals, Q = np.linalg.eigh(Sigma)

print('Covariance matrix Sigma:')
print(Sigma)
print(f'Eigenvalues: {eigvals.round(4)}')
print('Eigenvectors (columns = principal axes):')
print(Q.round(4))
print(f'\nReconstruction: Q*diag(lam)*Q.T matches Sigma: {np.allclose(Q @ np.diag(eigvals) @ Q.T, Sigma)}')

# Ellipse radii = sqrt(eigenvalues), oriented along eigenvectors
print(f'\nEllipse semi-axes (1-sigma confidence ellipse):')
for i, (lam, v) in enumerate(zip(eigvals, Q.T)):
    print(f'  Axis {i+1}: length={np.sqrt(lam):.4f}, direction={v.round(4)}')

if HAS_MPL:
    from matplotlib.patches import Ellipse
    fig, ax = plt.subplots(figsize=(8, 6))
    # Sample points
    mu = np.array([0.0, 0.0])
    pts = np.random.multivariate_normal(mu, Sigma, 300)
    ax.scatter(pts[:, 0], pts[:, 1], alpha=0.3, s=15, color='steelblue')
    # Plot eigenvectors scaled by sqrt(eigenvalue)
    for lam, v in zip(eigvals, Q.T):
        scale = np.sqrt(lam)
        ax.annotate('', xy=scale*v, xytext=-scale*v,
                    arrowprops=dict(arrowstyle='<->', color='red', lw=2))
    ax.set_aspect('equal')
    ax.set_title('MVN samples with principal axes (eigenvectors)')
    ax.set_xlabel('X1'); ax.set_ylabel('X2')
    plt.tight_layout()
    plt.show()
    print('PASS — plotted MVN geometry')

Code cell 22

# === 7.2 MVN Conditionals: Schur Complement ===

import numpy as np
from scipy import stats
np.random.seed(42)

# Partition: X = (X1, X2, X3), condition on X2, X3
mu_full = np.array([1.0, 2.0, 3.0])
Sigma_full = np.array([
    [4.0, 1.5, 1.0],
    [1.5, 3.0, 0.5],
    [1.0, 0.5, 2.0]
])

# Partition indices: group 1 = {0}, group 2 = {1,2}
idx1, idx2 = [0], [1, 2]
mu1 = mu_full[idx1]; mu2 = mu_full[idx2]
S11 = Sigma_full[np.ix_(idx1, idx1)]
S12 = Sigma_full[np.ix_(idx1, idx2)]
S21 = Sigma_full[np.ix_(idx2, idx1)]
S22 = Sigma_full[np.ix_(idx2, idx2)]

x2_obs = np.array([2.5, 3.5])  # observed values of X2, X3

# Schur complement
S22_inv = np.linalg.inv(S22)
mu_cond = mu1 + S12 @ S22_inv @ (x2_obs - mu2)
Sigma_cond = S11 - S12 @ S22_inv @ S21

print('X1 | (X2=2.5, X3=3.5) distribution:')
print(f'  mu_cond    = {mu_cond[0]:.4f}')
print(f'  sigma_cond = {np.sqrt(Sigma_cond[0,0]):.4f}')

# Verify by sampling
N = 500000
samp = np.random.multivariate_normal(mu_full, Sigma_full, N)
tol = 0.05
mask = (np.abs(samp[:, 1] - x2_obs[0]) < tol) & (np.abs(samp[:, 2] - x2_obs[1]) < tol)
X1_cond = samp[mask, 0]
print(f'\nSampling verification ({mask.sum()} matching samples):')
print(f'  Sample mean : {X1_cond.mean():.4f} (theory: {mu_cond[0]:.4f})')
print(f'  Sample std  : {X1_cond.std():.4f}  (theory: {np.sqrt(Sigma_cond[0,0]):.4f})')
ok = abs(X1_cond.mean() - mu_cond[0]) < 0.2
print(f'PASS — {ok}')

Code cell 23

# === 7.3 MVN Affine Transforms ===

import numpy as np
np.random.seed(42)

# If X ~ N(mu, Sigma), then Y = AX + b ~ N(A*mu + b, A*Sigma*A.T)
mu_x = np.array([1.0, 2.0])
Sigma_x = np.array([[2.0, 1.0], [1.0, 3.0]])
A = np.array([[1, 2], [0, 1], [1, -1]], dtype=float)  # 3x2 matrix
b = np.array([0.5, 1.0, -0.5])

# Theoretical distribution of Y = AX + b
mu_y_theory = A @ mu_x + b
Sigma_y_theory = A @ Sigma_x @ A.T

print('Theoretical distribution of Y = AX + b:')
print(f'  mu_Y = {mu_y_theory}')
print(f'  Sigma_Y =\n{Sigma_y_theory}')

# Verify by sampling
N = 200000
X_samp = np.random.multivariate_normal(mu_x, Sigma_x, N)
Y_samp = (A @ X_samp.T).T + b

mu_y_emp = Y_samp.mean(axis=0)
Sigma_y_emp = np.cov(Y_samp.T)

print('\nEmpirical distribution:')
print(f'  mu_Y_emp = {mu_y_emp.round(4)}')
print(f'  Sigma_Y_emp =\n{Sigma_y_emp.round(4)}')

ok_mu = np.allclose(mu_y_emp, mu_y_theory, atol=0.02)
ok_sig = np.allclose(Sigma_y_emp, Sigma_y_theory, atol=0.05)
print(f'\nPASS mu — {ok_mu}')
print(f'PASS sigma — {ok_sig}')
print('\nAI connection: In attention, Q=XW_Q, K=XW_K, V=XW_V are affine transforms.')
print('If X ~ MVN, then Q,K,V are also MVN. This is used in theoretical analysis of attention.')

8. Change of Variables and Jacobians

For a smooth bijection $\mathbf{Y} = g(\mathbf{X})$ with inverse $\mathbf{x} = g^{-1}(\mathbf{y})$ :

f_{\mathbf{Y}}(\mathbf{y}) = f_{\mathbf{X}}(g^{-1}(\mathbf{y})) \cdot |\det J_{g^{-1}}(\mathbf{y})|

where $J_{g^{-1}}$ is the Jacobian matrix of $g^{-1}$ :

[J_{g^{-1}}]_{ij} = \frac{\partial [g^{-1}]_i}{\partial y_j}

The $|\det J|$ factor accounts for how the transformation stretches or compresses volume.

Code cell 25

# === 8.1 Change of Variables: Polar Coordinates ===

import numpy as np
from scipy import stats
np.random.seed(42)

# Transform: (X, Y) i.i.d. N(0,1)  ->  (R, Theta) polar
# X = R*cos(T), Y = R*sin(T)
# f_{X,Y}(x,y) = (1/2pi)*exp(-(x^2+y^2)/2)
#
# Jacobian of inverse (x,y) -> (r,theta):
# J = [[cos T, -r sin T], [sin T, r cos T]]
# |det J| = r (the polar area element)
#
# f_{R,T}(r,theta) = f_{X,Y}(r cos T, r sin T) * r
#                  = (1/2pi)*exp(-r^2/2)*r
# Marginalising over theta: f_R(r) = r*exp(-r^2/2)  (Rayleigh dist.)

# Sample (X,Y) i.i.d. N(0,1)
N = 100000
X = np.random.normal(0, 1, N)
Y = np.random.normal(0, 1, N)
R = np.sqrt(X**2 + Y**2)
Theta = np.arctan2(Y, X)

# Verify: R should follow Rayleigh(sigma=1), i.e., R ~ Rayleigh(1)
# PDF: f(r) = r * exp(-r^2/2), CDF: 1 - exp(-r^2/2)
rayleigh_rv = stats.rayleigh(scale=1)
ks_stat, p_val = stats.kstest(R, rayleigh_rv.cdf)
print('R = sqrt(X^2+Y^2) where X,Y iid N(0,1)')
print(f'KS test vs Rayleigh(1): stat={ks_stat:.4f}, p={p_val:.4f}')
print(f'PASS — R ~ Rayleigh(1): {p_val > 0.05}')

# Verify: Theta should be Uniform(0, 2pi)
Theta_pos = Theta % (2 * np.pi)
ks_stat2, p_val2 = stats.kstest(Theta_pos, stats.uniform(0, 2*np.pi).cdf)
print(f'\nTheta = arctan2(Y,X):')
print(f'KS test vs Uniform(0,2pi): stat={ks_stat2:.4f}, p={p_val2:.4f}')
print(f'PASS — Theta ~ Uniform(0,2pi): {p_val2 > 0.05}')
print('\nKey: f_R(r) = r*exp(-r^2/2) is the Rayleigh PDF — the r factor comes from |det J| = r')

9. Copulas and Sklar's Theorem

A copula separates marginal behaviour from dependence structure.

Sklar's Theorem: Every joint CDF $F_{X,Y}$ can be written as

F_{X,Y}(x,y) = C(F_X(x),\, F_Y(y))

where $C:[0,1]^2 \to [0,1]$ is a copula — a joint CDF on $[0,1]^2$ with uniform marginals. The copula $C$ is unique when marginals are continuous.

The copula $C(u,v)$ encodes the entire dependence structure independently of the marginal distributions. This allows constructing joints with arbitrary marginals and specified dependence — powerful for financial modelling and stress testing.

Code cell 27

# === 9.1 Gaussian Copula ===

import numpy as np
from scipy import stats
np.random.seed(42)

# Gaussian copula: sample from correlated Gaussian, then apply marginal transforms
rho = 0.8
Sigma_cop = np.array([[1.0, rho], [rho, 1.0]])
N = 5000

# Step 1: Sample from bivariate N(0, Sigma)
Z = np.random.multivariate_normal([0, 0], Sigma_cop, N)

# Step 2: Apply marginal CDF to get uniform marginals (probability integral transform)
U = stats.norm.cdf(Z)

# Step 3: Apply desired marginal inverse CDFs
# Margin 1: Exponential(1), Margin 2: Beta(2,5)
X_exp = stats.expon(scale=1).ppf(U[:, 0])
X_beta = stats.beta(2, 5).ppf(U[:, 1])

print('Gaussian copula with Exp(1) and Beta(2,5) marginals')
print(f'Correlation in the copula (Z space): rho={rho}')
print(f'Sample Pearson correlation of (X_exp, X_beta): {np.corrcoef(X_exp, X_beta)[0,1]:.4f}')
print(f'  (different from rho because marginals are non-Gaussian)')

# Verify marginals via KS test
ks1, p1 = stats.kstest(X_exp,   stats.expon(scale=1).cdf)
ks2, p2 = stats.kstest(X_beta,  stats.beta(2, 5).cdf)
print(f'\nKS test X_exp  ~ Exp(1):   KS={ks1:.4f}, p={p1:.4f}')
print(f'KS test X_beta ~ Beta(2,5): KS={ks2:.4f}, p={p2:.4f}')
print(f'PASS marginals: {p1 > 0.05 and p2 > 0.05}')

if HAS_MPL:
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    axes[0].scatter(Z[:500, 0], Z[:500, 1], alpha=0.3, s=10, color='steelblue')
    axes[0].set_title('Step 1: Correlated Gaussian (Z-space)')
    axes[0].set_xlabel('Z1'); axes[0].set_ylabel('Z2')
    axes[1].scatter(X_exp[:500], X_beta[:500], alpha=0.3, s=10, color='seagreen')
    axes[1].set_title('Step 3: Exp(1) x Beta(2,5) with Gaussian copula')
    axes[1].set_xlabel('X_exp'); axes[1].set_ylabel('X_beta')
    plt.tight_layout()
    plt.show()
    print('PASS — plotted copula transformation')

10. Conditional Independence

$X$ and $Y$ are conditionally independent given $Z$ ( $X \perp Y \mid Z$ ) if:

f_{X,Y|Z}(x,y|z) = f_{X|Z}(x|z) \cdot f_{Y|Z}(y|z)

This is different from marginal independence:

Pattern	$X \perp Y$ ?	$X \perp Y \mid Z$ ?
Chain $X \to Z \to Y$	No	Yes
Fork $X \leftarrow Z \rightarrow Y$	No	Yes
Collider $X \to Z \leftarrow Y$	Yes	No

Naive Bayes uses conditional independence: $P(\mathbf{x}|Y=c) = \prod_i P(x_i|Y=c)$ , assuming features are conditionally independent given the class label.

Code cell 29

# === 10.1 Conditional Independence: Fork Structure ===

import numpy as np
np.random.seed(42)

# Fork: Z -> X, Z -> Y
# Z ~ Bernoulli(0.5), X|Z ~ N(Z, 1), Y|Z ~ N(2*Z, 1)
N = 100000
Z = np.random.binomial(1, 0.5, N)
X = np.random.normal(Z.astype(float), 1, N)
Y = np.random.normal(2 * Z.astype(float), 1, N)

print('Fork structure: Z -> X, Z -> Y')
print(f'E[X]={X.mean():.4f}, E[Y]={Y.mean():.4f}')
print(f'Corr(X,Y) marginal = {np.corrcoef(X, Y)[0,1]:.4f}  (non-zero: X,Y marginally dependent)')

# Conditional on Z=0 and Z=1: correlation should drop
for z_val in [0, 1]:
    mask = Z == z_val
    corr_cond = np.corrcoef(X[mask], Y[mask])[0, 1]
    print(f'  Corr(X,Y | Z={z_val}) = {corr_cond:.4f}  (≈0: conditionally independent)')

print('PASS — Conditioning on Z removes the correlation (Fork: CI given common cause)')

print('\n--- Collider: X -> Z <- Y ---')
# Collider: X, Y independent; Z = X + Y + noise
X2 = np.random.normal(0, 1, N)
Y2 = np.random.normal(0, 1, N)
Z2 = X2 + Y2 + np.random.normal(0, 0.1, N)  # Z is a noisy sum

print(f'Corr(X2,Y2) marginal = {np.corrcoef(X2, Y2)[0,1]:.4f}  (≈0: marginally independent)')

# Condition on Z2 ≈ 2 (high Z): now X and Y become anti-correlated
mask_z2 = np.abs(Z2 - 2.0) < 0.2
corr_given_z2 = np.corrcoef(X2[mask_z2], Y2[mask_z2])[0, 1]
print(f'Corr(X2,Y2 | Z2≈2) = {corr_given_z2:.4f}  (negative: Berkson paradox / explaining away)')
print('PASS — Conditioning on collider induces negative correlation')

11. ML Application: Reparameterisation Trick (VAE)

Variational Autoencoders (VAEs) require gradients through sampling. The reparameterisation trick makes this possible:

\mathbf{z} \sim q_\phi(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}_\phi(\mathbf{x}),\; \text{diag}(\boldsymbol{\sigma}^2_\phi(\mathbf{x})))

is rewritten as:

\mathbf{z} = \boldsymbol{\mu}_\phi(\mathbf{x}) + \boldsymbol{\sigma}_\phi(\mathbf{x}) \odot \boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, I)

Now $\boldsymbol{\varepsilon}$ is sampled from a fixed distribution, and gradients flow through the deterministic $\boldsymbol{\mu}_\phi$ and $\boldsymbol{\sigma}_\phi$ .

Code cell 31

# === 11.1 Reparameterisation Trick: Gradient Estimation ===

import numpy as np
np.random.seed(42)

# Simulate gradient estimation: E_{z~N(mu,sigma^2)}[z^2]
# True value: E[z^2] = mu^2 + sigma^2
# dE/dmu = 2*mu, dE/dsigma = 2*sigma

mu = 2.0
sigma = 1.5
N_mc = 10000

# Method 1: REINFORCE (score function estimator) — high variance
# grad_mu E[f(z)] = E[f(z) * (z-mu)/sigma^2]
z_rf = np.random.normal(mu, sigma, N_mc)
f_rf = z_rf**2
score = (z_rf - mu) / sigma**2
grad_mu_reinforce = np.mean(f_rf * score)

# Method 2: Reparameterisation trick — low variance
# z = mu + sigma * eps, f(z) = (mu + sigma*eps)^2
# df/dmu = 2*(mu + sigma*eps) = 2*z
eps = np.random.normal(0, 1, N_mc)
z_reparam = mu + sigma * eps
grad_mu_reparam = np.mean(2 * z_reparam)  # df/dmu = 2z

true_grad_mu = 2 * mu  # d(mu^2 + sigma^2)/dmu = 2*mu

print(f'True gradient dE[z^2]/dmu = 2*mu = {true_grad_mu:.4f}')
print(f'REINFORCE estimate       = {grad_mu_reinforce:.4f}  (high variance)')
print(f'Reparameterisation est.  = {grad_mu_reparam:.4f}  (low variance)')

# Compare variance
rf_grads = [np.mean((np.random.normal(mu, sigma, 1000)**2) * ((np.random.normal(mu, sigma, 1000)-mu)/sigma**2))
             for _ in range(200)]
rp_grads = [np.mean(2*(mu + sigma*np.random.normal(0,1,1000))) for _ in range(200)]
print(f'\nVariance of 200 estimates (N=1000):')
print(f'  REINFORCE variance   : {np.var(rf_grads):.4f}')
print(f'  Reparam variance     : {np.var(rp_grads):.6f}')
print(f'Variance reduction ratio: {np.var(rf_grads)/np.var(rp_grads):.1f}x')
print('PASS — reparameterisation has much lower variance')

12. ML Application: Attention as Conditional Expectation

Scaled dot-product attention can be interpreted probabilistically:

\text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

The softmax weights $\alpha_{ij} = \text{softmax}(q_i^\top k_j / \sqrt{d_k})_j$ define a distribution over key positions. The output for query $i$ is:

\mathbf{o}_i = \sum_j \alpha_{ij} \mathbf{v}_j = \mathbb{E}_{j \sim \alpha_i}[\mathbf{v}_j]

This is a conditional expectation: given query $q_i$ , the output is the expected value vector under the attention distribution. The attention weights $\alpha_i$ define a joint distribution over $(\text{query position}, \text{key position})$ pairs.

Code cell 33

# === 12.1 Attention as Conditional Expectation ===

import numpy as np
np.random.seed(42)

# Mini attention: seq_len=5, d_k=4, d_v=4
seq_len = 5
d_k, d_v = 4, 4

Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_v)

# Compute attention weights
scores = Q @ K.T / np.sqrt(d_k)  # (seq_len, seq_len)

def softmax(x, axis=-1):
    x = x - x.max(axis=axis, keepdims=True)
    e = np.exp(x)
    return e / e.sum(axis=axis, keepdims=True)

alpha = softmax(scores)  # (seq_len, seq_len) — attention distribution
output = alpha @ V       # (seq_len, d_v) — conditional expectation of V

print('Attention weights (each row is a distribution over key positions):')
print(alpha.round(4))
print(f'Row sums: {alpha.sum(axis=1).round(6)} (all 1.0 — valid prob distribution)')

# Output[i] = sum_j alpha[i,j] * V[j] = E_{j~alpha[i]}[V[j]]
print('\nOutput[0] = sum_j alpha[0,j] * V[j]:')
manual = sum(alpha[0, j] * V[j] for j in range(seq_len))
print(f'  Manual: {manual.round(6)}')
print(f'  Matrix: {output[0].round(6)}')
print(f'PASS — same result: {np.allclose(manual, output[0])}')

print('\nInterpretation: output[i] is the conditional mean of V')
print('under the distribution alpha[i] over key positions.')
print('High-entropy alpha => output is a broad average.')
print('Low-entropy alpha  => output focuses on one key ("sharp attention").')
entropies = -np.sum(alpha * np.log(alpha + 1e-10), axis=1)
print(f'Attention entropies (nats): {entropies.round(4)}')
print(f'Max entropy = log({seq_len}) = {np.log(seq_len):.4f}')

Code cell 34

# === 12.2 Autoregressive Factorisation: Sequence Probability ===

import numpy as np
from scipy.special import softmax
np.random.seed(42)

# Simulate a tiny autoregressive model (3-word vocabulary)
vocab = ['<s>', 'cat', 'mat']
V = len(vocab)

# Conditional distributions: p(w_t | w_{t-1}) as a 3x3 matrix
# (In reality, this would be a transformer output)
transition = np.array([
    [0.0, 0.6, 0.4],   # after <s>: cat (0.6) or mat (0.4)
    [0.3, 0.2, 0.5],   # after cat: <s> (0.3), cat (0.2), mat (0.5)
    [0.5, 0.3, 0.2],   # after mat: <s> (0.5), cat (0.3), mat (0.2)
])
start_dist = np.array([1.0, 0.0, 0.0])  # always start with <s>

# Compute P('cat', 'mat') via chain rule
# = P(cat | <s>) * P(mat | cat) = 0.6 * 0.5 = 0.3
sentence = [0, 1, 2]  # <s>, cat, mat
log_prob = 0.0
for i in range(1, len(sentence)):
    log_prob += np.log(transition[sentence[i-1], sentence[i]])

print("Chain rule: P('<s>', 'cat', 'mat') via autoregressive factorisation")
print(f"  P(cat | <s>) = {transition[0, 1]:.3f}")
print(f"  P(mat | cat) = {transition[1, 2]:.3f}")
print(f"  P(sentence) = {np.exp(log_prob):.4f} = {transition[0,1]*transition[1,2]:.4f}")

# Monte Carlo sampling from the model
def sample_sequence(length=5):
    seq = [0]  # start with <s>
    for _ in range(length - 1):
        prev = seq[-1]
        next_tok = np.random.choice(V, p=transition[prev])
        seq.append(next_tok)
    return seq

print('\n5 sampled sequences:')
for _ in range(5):
    seq = sample_sequence(4)
    print(' '.join(vocab[t] for t in seq))
print('\nThis is exactly how GPT generates text: sample one token at a time')
print('from p(w_t | w_1,...,w_{t-1}), using the chain rule of probability.')

13. ML Application: Gaussian Mixture Models

A Gaussian Mixture Model (GMM) is a latent variable model:

p(\mathbf{x}) = \sum_{k=1}^K \pi_k \,\mathcal{N}(\mathbf{x}; \boldsymbol{\mu}_k, \Sigma_k)

where $\pi_k$ are mixing weights with $\sum_k \pi_k = 1$ .

The joint over $(\mathbf{x}, z)$ where $z$ is the cluster assignment is:

p(\mathbf{x}, z=k) = \pi_k \mathcal{N}(\mathbf{x}; \boldsymbol{\mu}_k, \Sigma_k)

The marginal $p(\mathbf{x}) = \sum_k p(\mathbf{x}, z=k)$ is the mixture, and the conditional $p(z=k|\mathbf{x})$ — the responsibility — is computed via Bayes:

r_k(\mathbf{x}) = p(z=k|\mathbf{x}) = \frac{\pi_k \mathcal{N}(\mathbf{x}; \boldsymbol{\mu}_k, \Sigma_k)}{\sum_{j} \pi_j \mathcal{N}(\mathbf{x}; \boldsymbol{\mu}_j, \Sigma_j)}

Code cell 36

# === 13.1 GMM: EM Algorithm ===

import numpy as np
from scipy import stats
np.random.seed(42)

# Generate data from true GMM: K=2 components in 2D
K = 2
true_mu = [np.array([-2.0, 0.0]), np.array([2.0, 1.0])]
true_Sigma = [np.eye(2), np.array([[1.5, 0.5],[0.5, 0.8]])]
true_pi = [0.4, 0.6]

N = 500
z_true = np.random.choice(K, size=N, p=true_pi)
X_data = np.vstack([
    np.random.multivariate_normal(true_mu[k], true_Sigma[k])
    for k in z_true
])

# EM Algorithm
# Initialise
pi = np.array([0.5, 0.5])
mu = [np.array([-1.0, 0.0]), np.array([1.0, 0.0])]
Sigma = [np.eye(2), np.eye(2)]

def gmm_log_likelihood(X, pi, mu, Sigma):
    N = len(X)
    ll = 0.0
    for i in range(N):
        mix = sum(pi[k] * stats.multivariate_normal(mu[k], Sigma[k]).pdf(X[i]) for k in range(K))
        ll += np.log(mix + 1e-300)
    return ll

print('EM training:')
print(f'  Initial log-likelihood: {gmm_log_likelihood(X_data, pi, mu, Sigma):.2f}')

for iteration in range(20):
    # E-step: compute responsibilities
    R = np.zeros((N, K))
    for k in range(K):
        R[:, k] = pi[k] * stats.multivariate_normal(mu[k], Sigma[k]).pdf(X_data)
    R = R / R.sum(axis=1, keepdims=True)

    # M-step: update parameters
    Nk = R.sum(axis=0)  # effective counts
    for k in range(K):
        mu[k] = (R[:, k:k+1] * X_data).sum(axis=0) / Nk[k]
        diff = X_data - mu[k]
        Sigma[k] = (R[:, k:k+1] * diff).T @ diff / Nk[k] + 1e-5 * np.eye(2)
    pi = Nk / N

final_ll = gmm_log_likelihood(X_data, pi, mu, Sigma)
print(f'  Final log-likelihood:   {final_ll:.2f}')
print(f'\nEstimated mixing weights: {pi.round(4)} (true: {true_pi})')
print(f'Estimated mu_0: {mu[0].round(4)} (true: {true_mu[0]})')
print(f'Estimated mu_1: {mu[1].round(4)} (true: {true_mu[1]})')

ok0 = np.allclose(np.sort(pi), np.sort(true_pi), atol=0.1)
print(f'PASS — mixing weights recovered: {ok0}')

14. Common Mistakes and Edge Cases

A summary of the most frequent errors when working with joint distributions.

Code cell 38

# === 14.1 Common Mistakes Demo ===

import numpy as np
from scipy import stats
np.random.seed(42)

print('=== Mistake 1: Treating marginals as determining the joint ===')
# Same marginals, different copulas => different dependence
N = 100000
rho_vals = [0.0, 0.9, -0.9]
for rho in rho_vals:
    Sig = np.array([[1, rho],[rho, 1]])
    samp = np.random.multivariate_normal([0,0], Sig, N)
    U = stats.norm.cdf(samp)  # uniform marginals
    # P(both in top quartile)
    p_joint_top = ((U[:,0] > 0.75) & (U[:,1] > 0.75)).mean()
    p_indep_top = 0.25 * 0.25
    print(f'  rho={rho:+.1f}: P(X>Q75, Y>Q75) = {p_joint_top:.4f}'
          f'  (independence would give {p_indep_top:.4f})')
print('=> Same marginals, very different joint tail behaviour!\n')

print('=== Mistake 2: Cov=0 does not imply independence ===')
X = np.random.uniform(-1, 1, N)
Y = X**2 + np.random.normal(0, 0.01, N)  # Y determined by X!
print(f'  Cov(X, Y) = {np.cov(X,Y)[0,1]:.6f}  (≈0 by symmetry)')
print(f'  But Y = X^2: completely determined by X (MI >> 0)')
print(f'  Conditional var(Y | X≈0.9) ≈ {np.var(Y[np.abs(X-0.9)<0.05]):.6f}')
print(f'  Marginal var(Y)            = {np.var(Y):.4f}  (much larger)\n')

print('=== Mistake 3: Forgetting Jacobian in CoV ===')
# X ~ N(0,1), Y = exp(X) (log-normal)
# Wrong: f_Y(y) = f_X(y) = (1/sqrt(2pi))*exp(-y^2/2)  <-- WRONG
# Right: f_Y(y) = f_X(ln y) * (1/y)                   <-- correct
X3 = np.random.normal(0, 1, N)
Y3 = np.exp(X3)
y_test = np.array([1.0, 2.0, 3.0])
f_wrong = stats.norm.pdf(y_test)            # missing Jacobian
f_right = stats.norm.pdf(np.log(y_test)) / y_test  # correct
f_lognorm = stats.lognorm.pdf(y_test, s=1)  # scipy reference
print('  y       | f_wrong  | f_correct | f_lognorm')
for y, fw, fr, fl in zip(y_test, f_wrong, f_right, f_lognorm):
    print(f'  y={y:.1f}   | {fw:.6f} | {fr:.6f}  | {fl:.6f}')
print('PASS — correct formula matches scipy lognorm')

Summary

This notebook covered the core machinery of joint distributions:

Topic	Key Result
Joint PMF/PDF	Marginals via integration/summation
Conditionals	$f_{Y
Chain rule	$p(x_1,\ldots,x_n) = \prod_i p(x_i \mid x_{<i})$
Independence	$f(x,y) = f_X(x)f_Y(y)$ ; $\text{Cov}=0 \not\Rightarrow$ indep.
MVN conditional	Schur complement formula
Change of variables	PDF transforms with Jacobian $
Copulas	$F(x,y) = C(F_X(x), F_Y(y))$
Reparameterisation	Low-variance gradient for VAE ELBO
Attention	Conditional expectation $\mathbb{E}_{j\sim\alpha_i}[v_j]$

Next: §04 Expected Values and Moments