"Probability is not about the odds, but about the belief in the possibility of something happening."
Nassim Nicholas Taleb
Overview
Real-world phenomena are never isolated. A language model's next token depends on every preceding token. A patient's diagnosis depends on dozens of correlated symptoms. A stock price today depends on yesterday's price, the broader market, and macroeconomic variables. To reason about multiple quantities together - to model dependencies, compute conditionals, and propagate uncertainty - we need joint distributions.
This section develops the full theory of multivariate probability: how to define probability distributions over pairs and vectors of random variables, how to extract marginals and conditionals, and when independence permits us to factorise complexity. We give a complete treatment of the multivariate Gaussian, the most important distribution in machine learning, and show how the chain rule of probability underlies the autoregressive factorisation that powers every modern large language model.
The ideas here are the probabilistic backbone of Naive Bayes classifiers, Gaussian Mixture Models, Variational Autoencoders, and the attention mechanism in transformers.
A univariate distribution answers one question: what is the probability that a single random variable takes a particular value? But the world is multivariate. We want to know: what is the probability that a language model generates token Aand then token B? What is the probability that a patient has disease Dgiven symptom S and test result T? What is the probability that two neurons in a neural network fire together?
To answer these questions, we need joint distributions - probability distributions over two or more random variables simultaneously. The joint distribution is the most complete probabilistic description of a system: from it, we can derive every single-variable distribution (by marginalising), every conditional distribution (by dividing), and every notion of dependence (by comparing the joint to the product of marginals).
The central tension in probabilistic machine learning is between expressiveness and tractability. A fully general joint distribution over n binary variables requires 2n−1 parameters - exponential in n. A fully factored (independence) model requires only n parameters but ignores all dependencies. The art of probabilistic modelling lies in choosing structured assumptions - conditional independence, low-rank covariance, mixture models - that capture the important dependencies while remaining computationally feasible.
For AI: Every language model's output distribution is a joint distribution over token sequences. The reason transformers are tractable is that they use the chain rule to factor p(x1,…,xT)=∏t=1Tp(xt∣x1,…,xt−1), converting an intractable VT-dimensional joint into T tractable conditionals.
1.2 From Univariate to Multivariate
When we have a single random variable X, its distribution lives on the real line R. The CDF FX(x)=P(X≤x) is a function of one variable, the density fX(x) integrates to 1 over R.
When we have two random variables X and Y, their joint distribution lives on the plane R2. The joint CDF is FX,Y(x,y)=P(X≤x,Y≤y), a function of two variables. The joint density fX,Y(x,y) integrates to 1 over R2. For n variables X=(X1,…,Xn), the joint distribution lives on Rn.
The key new concept is dependence: knowing the value of X can change our belief about Y. If X and Y are independent, knowing X tells us nothing about Y, and the joint distribution factorises as fX,Y(x,y)=fX(x)⋅fY(y). Independence is a very special, restrictive assumption - and it is almost never exactly true in practice, but often a useful approximation.
1.3 The Independence Assumption in ML
Many machine learning algorithms make explicit independence assumptions to achieve tractability:
Naive Bayes: Assumes features are conditionally independent given the class label. Despite being wrong for virtually all real datasets, this assumption produces competitive classifiers because it massively reduces the number of parameters.
Mean-field variational inference: Assumes the posterior over all latent variables factorises as q(z)=∏iqi(zi). This is the canonical tractability trick in VAEs and Bayesian neural networks.
i.i.d. data assumption: Most supervised learning theory assumes training examples (xi,yi) are drawn independently from the same distribution. This is violated for time series, graphs, and correlated datasets - and the violation matters for generalisation bounds.
Residual stream in transformers: Attention heads in the same layer see the same residual stream, making their outputs dependent. Recent work on multi-head attention analyses this dependence structure.
Understanding exactly which independence assumptions are made - and when they break - is essential for diagnosing failure modes in learned models.
1.4 Historical Context
The study of joint distributions has a rich history tied to the development of statistics and probability theory.
1888 - Francis Galton introduced correlation while studying the joint distribution of parents' and children's heights. His "regression to the mean" was originally an observation about the conditional mean.
1896 - Karl Pearson formalised the correlation coefficient and the bivariate normal distribution, laying the groundwork for multivariate statistics.
1933 - Andrey Kolmogorov provided the axiomatic foundation for probability theory, defining joint distributions via product measures on product spaces.
1959 - Abe Sklar proved his eponymous theorem, showing any joint distribution can be decomposed into marginals and a copula - separating the dependence structure from the marginal distributions.
1989 - Judea Pearl developed graphical models (Bayesian networks), providing a language for specifying conditional independence structures among many variables. His work underpins modern probabilistic AI.
2013 - Diederik Kingma and Max Welling introduced VAEs, where the core idea is the joint distribution p(x,z)=p(z)p(x∣z) over observed data and latent variables.
2017 - "Attention Is All You Need" made the chain rule factorisation ∏tp(xt∣x<t) the dominant paradigm for language modelling at scale.
2. Formal Definitions
2.1 Joint PMF (Discrete Case)
Let X and Y be discrete random variables taking values in countable sets X and Y respectively.
Definition: The joint probability mass function of (X,Y) is
pX,Y(x,y)=P(X=x,Y=y)for all x∈X,y∈Y
Axioms: A valid joint PMF must satisfy:
Non-negativity:pX,Y(x,y)≥0 for all (x,y)
Normalisation:∑x∈X∑y∈YpX,Y(x,y)=1
Example 1 - Fair dice: Roll two fair 6-sided dice. Let X = value of die 1, Y = value of die 2. Then pX,Y(i,j)=1/36 for all i,j∈{1,…,6}. The joint distribution is uniform over the 6×6 grid.
Example 2 - Dependent variables: Let X∼Bernoulli(0.4) and Y=X (perfect dependence). Then:
This is a valid joint PMF: all entries non-negative, sum = 1.0.
Non-example 1:p(0,0)=0.3,p(0,1)=0.4,p(1,0)=0.2,p(1,1)=0.2 - sum is 1.1, not valid.
Non-example 2:p(0,0)=0.5,p(1,1)=0.5,p(0,1)=−0.1,p(1,0)=0.1 - negative entry, not valid.
Extension to n variables: The joint PMF of (X1,…,Xn) is
p(x1,…,xn)=P(X1=x1,…,Xn=xn)
with ∑x1⋯∑xnp(x1,…,xn)=1.
2.2 Joint PDF (Continuous Case)
Let X and Y be continuous random variables.
Definition: A function fX,Y:R2→R≥0 is a joint probability density function for (X,Y) if for every (measurable) set A⊆R2:
P((X,Y)∈A)=∬AfX,Y(x,y)dxdy
Axioms:
fX,Y(x,y)≥0 for all (x,y)
∫−∞∞∫−∞∞fX,Y(x,y)dxdy=1
Important: Unlike probabilities, densities can exceed 1 - only integrals of densities are probabilities.
Example 1 - Uniform on unit square:fX,Y(x,y)=1 for (x,y)∈[0,1]2, zero elsewhere. Check: ∫01∫011dxdy=1. [ok]
Example 2 - Standard bivariate normal (uncorrelated):
fX,Y(x,y)=2π1exp(−2x2+y2)
This factors as fX(x)⋅fY(y) - the variables are independent standard normals.
Example 3 - Correlated bivariate normal (correlation ρ):
fX,Y(x,y)=2π1−ρ21exp(−2(1−ρ2)x2−2ρxy+y2)
Non-example:f(x,y)=2(x+y)−1 on [0,1]2 - this can be negative (at (0,0) it equals −1), so not a valid PDF.
Joint PDF in Rn: For a random vector X=(X1,…,Xn)⊤, the joint PDF fX:Rn→R≥0 satisfies
P(X∈A)=∫AfX(x)dx
where dx=dx1⋯dxn.
2.3 Joint CDF
Definition: The joint cumulative distribution function of (X1,…,Xn) is
F(x1,…,xn)=P(X1≤x1,…,Xn≤xn)
Properties:
Monotone:F is non-decreasing in each argument
Limits:F(x1,…,xn)→1 as all xi→∞; F→0 if any xi→−∞
Right-continuous:F is right-continuous in each argument
Recovery of the PDF: For continuous variables, f(x1,…,xn)=∂x1⋯∂xn∂nF
Computing probabilities from the CDF (bivariate):
P(a<X≤b,c<Y≤d)=F(b,d)−F(a,d)−F(b,c)+F(a,c)
This inclusion-exclusion formula generalises to n dimensions with 2n terms.
2.4 Mixed Discrete-Continuous Distributions
Some important distributions are mixed: discrete in one component, continuous in another.
Example - Spike-and-slab prior (used in LoRA): Let Z∼Bernoulli(π) be a binary inclusion indicator and W∣Z=1∼N(0,σ12), W∣Z=0=0 (point mass). The joint distribution over (Z,W) is:
P(Z=0,W=0)=1−π (discrete atom)
For w=0: f(Z=1,W=w)=π⋅σ12π1e−w2/(2σ12) (continuous)
Example - Gaussian Mixture Model (GMM): Let Z∈{1,…,K} be a discrete cluster label with P(Z=k)=πk, and X∣Z=k∼N(μk,Σk). The joint p(Z,X) is mixed. Marginalising out Z gives the mixture density:
f(x)=k=1∑KπkN(x;μk,Σk)
This is one of the most widely used density estimation models in unsupervised learning.
3. Marginal and Conditional Distributions
3.1 Marginalisation
Given a joint distribution over (X,Y), we obtain the marginal distribution of X by "integrating out" Y - summing or integrating over all possible values of Y.
Discrete:
pX(x)=y∈Y∑pX,Y(x,y)
Continuous:
fX(x)=∫−∞∞fX,Y(x,y)dy
Intuition: The marginal pX answers: "What is the probability that X=x, regardless of what Y is doing?" We collapse the joint table by summing rows (to get the X marginal) or columns (to get the Y marginal).
Example - From joint table:
Y=0
Y=1
Y=2
pX(x)
X=0
0.1
0.2
0.1
0.4
X=1
0.3
0.2
0.1
0.6
pY(y)
0.4
0.4
0.2
1.0
Marginalisation in n dimensions: To obtain the marginal of XA (a subset of variables indexed by A), integrate over all other variables XAc:
fXA(xA)=∫fX(xA,xAc)dxAc
For AI: Training a language model means learning the joint distribution p(x1,…,xT) over token sequences. At inference time, we often want p(xt∣x1,…,xt−1) - a conditional derived from this joint. Computing the marginal p(xt)=∑x1,…,xt−1p(x1,…,xt) would require summing over Vt−1 sequences - intractable. This is why the chain rule factorisation is essential.
3.2 Conditional Distributions
The conditional distribution of Y given X=x captures how our belief about Y changes when we learn the value of X.
Interpretation:fY∣X(y∣x) is a function of y for fixed x. As x varies, we get a family of distributions - the conditional distribution function is a map from values of x to distributions on Y.
Example - Bivariate normal: For (X,Y) jointly Gaussian with correlation ρ:
Y∣X=x∼N(μY+ρσXσY(x−μX),σY2(1−ρ2))
The conditional mean is linear in x - this is the linear regression formula. The conditional variance σY2(1−ρ2) is always smaller than the marginal variance σY2 (conditioning reduces uncertainty), with equality iff ρ=0.
For AI: Every output of a neural network is a parameterised conditional distribution. A classifier outputs p(y∣x;θ); a language model outputs p(xt∣x<t;θ); a diffusion model computes p(xt−1∣xt;θ). All of probabilistic deep learning is the art of learning flexible conditional distributions from data.
3.3 Chain Rule of Probability
The chain rule (or product rule) expresses any joint distribution as a product of conditional distributions:
Proof: By induction. Base case n=2: p(x1,x2)=p(x2∣x1)p(x1) is the definition of conditional probability. Inductive step: p(x1,…,xn)=p(xn∣x1,…,xn−1)⋅p(x1,…,xn−1), and apply the inductive hypothesis.
CHAIN RULE ORDERINGS FOR p(A, B, C)
========================================================================
Forward: p(A) \\cdot p(B|A) \\cdot p(C|A,B)
Reverse: p(C) \\cdot p(B|C) \\cdot p(A|B,C)
Mixed: p(B) \\cdot p(A|B) \\cdot p(C|A,B)
All three represent the same joint distribution p(A,B,C).
Graphical models choose the ordering that exploits conditional
independence to reduce the complexity of each conditional.
========================================================================
For AI - Autoregressive language models: The chain rule directly motivates the autoregressive factorisation:
p(x1,x2,…,xT)=p(x1)t=2∏Tp(xt∣x1,…,xt−1)
A GPT model learns one neural network to represent all T conditional distributions simultaneously (via causal masking). Each forward pass computes all T conditionals in parallel during training, while at inference time they are evaluated sequentially. The chain rule is not a modelling choice - it is an exact identity.
Despite a 95% sensitive test, a positive result only gives 16% posterior probability - because the disease is rare. This is the base rate fallacy, a consequence of not properly using the joint distribution.
Continuous Bayes - Gaussian-Gaussian: Prior θ∼N(μ0,σ02), likelihood X∣θ∼N(θ,σ2). The posterior is:
θ∣X=x∼N(σ2+σ02σ2μ0+σ02x,σ2+σ02σ2σ02)
The posterior mean is a weighted average of prior mean and observation - a fundamental result in Bayesian estimation.
4. Independence and Conditional Independence
4.1 Independence - Three Equivalent Characterisations
X and Y are independent, written X⊥⊥Y, if and only if any of the following equivalent conditions holds:
Characterisation 1 (Factorisation):
pX,Y(x,y)=pX(x)⋅pY(y)for all x,y
Characterisation 2 (Conditional = Marginal):
pY∣X(y∣x)=pY(y)for all x,y
Characterisation 3 (CDF Factorisation):
FX,Y(x,y)=FX(x)⋅FY(y)for all x,y
Proof of equivalence (1) <-> (2): If pX,Y=pXpY, then pY∣X(y∣x)=pX,Y/pX=pY. Conversely, pX,Y=pY∣XpX=pYpX.
Independence of functions: If X⊥⊥Y, then g(X)⊥⊥h(Y) for any measurable functions g,h.
Caution: Independence requires factorisation everywhere in the support. A single point where pX,Y(x,y)=pX(x)pY(y) breaks independence.
4.2 Pairwise vs Mutual Independence
For three or more variables, we must distinguish between pairwise and mutual independence.
(X1,X2,X3) are pairwise independent if Xi⊥⊥Xj for all pairs (i,j)
(X1,X2,X3) are mutually independent if p(x1,x2,x3)=p(x1)p(x2)p(x3) for all (x1,x2,x3)
Mutual independence implies pairwise, but NOT vice versa.
Classic counterexample (Bernstein, 1946): Let X1,X2∼Bernoulli(1/2) independently, and X3=X1⊕X2 (XOR). Then all pairs are independent (pairwise), but P(X3=1∣X1=1,X2=1)=0=1/2=P(X3=1) - NOT mutually independent.
General definition:(X1,…,Xn) are mutually independent iff for every subset S⊆{1,…,n}:
p(i∈S⋂{Xi=xi})=i∈S∏p(Xi=xi)
This requires 2n−n−1 factorisation conditions - all subsets of size ≥2.
4.3 Conditional Independence
X and Y are conditionally independent given Z, written X⊥⊥Y∣Z, if:
p(x,y∣z)=p(x∣z)⋅p(y∣z)for all x,y,z
Equivalently: p(x∣y,z)=p(x∣z) - knowing Z, additional knowledge of Y provides no information about X.
Conditional independence does not imply marginal independence, and vice versa.
Example 1 (CI but not marginal):Z selects which coin; X and Y are independent flips of that coin. Given Z: X⊥⊥Y∣Z. Marginally: X and Y are correlated (via the shared coin).
Example 2 (Marginal but not CI):X,Y are independent fair coin flips; Z=X⊕Y. Then X⊥⊥Y, but X⊥⊥Y∣Z=0 (knowing the XOR makes the flips correlated).
CONDITIONAL INDEPENDENCE STRUCTURES
========================================================================
Chain: X -> Z -> Y X \\perp\\perp Y | Z (Z blocks the path)
Fork: X <- Z -> Y X \\perp\\perp Y | Z (Z explains dependence)
Collider: X -> Z <- Y X \\perp\\perp Y (marginally independent)
X \\not\\models\\perp\\perp Y | Z (conditioning creates dependence)
- Berkson's paradox / selection bias
========================================================================
For AI:
Naive Bayes:p(x∣y)=∏ip(xi∣y) - features are conditionally independent given class
Causal masking: In transformers, position t is conditionally independent of future tokens given context x<t
4.4 d-Separation Preview
Berkson's paradox (collider bias) illustrates how conditioning can create dependence. Suppose acceptance to a selective university depends on both academic ability A and athletic skill S, where A⊥⊥S marginally. Among admitted students (conditioning on admission), A and S are negatively correlated - high academic ability reduces the need for athletic skill to justify admission.
This is the collider structure A→Z←S with Z conditioned on. It appears throughout ML as selection bias: training on filtered data creates spurious dependencies between filtered-out features.
Forward reference: The systematic framework for reading off all conditional independence relationships from a directed acyclic graph (d-separation, Bayesian networks, Markov blankets) is developed in the graphical models literature; see Pearl (2000), Causality.
Cov(X,Y)>0: X and Y tend to move in the same direction
Cov(X,Y)<0: X and Y tend to move in opposite directions
Cov(X,Y)=0: uncorrelated - no linear relationship
Key properties:
Symmetry:Cov(X,Y)=Cov(Y,X)
Bilinearity:Cov(aX+bZ,Y)=aCov(X,Y)+bCov(Z,Y)
Self-covariance:Cov(X,X)=Var(X)
Independence implies zero covariance:X⊥⊥Y⇒Cov(X,Y)=0
Variance of sum:Var(X+Y)=Var(X)+2Cov(X,Y)+Var(Y)
Caution: Property 4 does NOT reverse. Cov(X,Y)=0 does NOT imply X⊥⊥Y.
Example - Zero covariance, strong dependence: Let X∼Uniform(−1,1) and Y=X2. Then E[XY]=E[X3]=0 (odd function of symmetric distribution), so Cov(X,Y)=0. But Y is a deterministic function of X - perfectly dependent. Covariance only measures linear dependence.
5.2 Pearson Correlation Coefficient
The Pearson correlation coefficient normalises covariance to [−1,1]:
ρ(X,Y)=Var(X)Var(Y)Cov(X,Y)=σXσYCov(X,Y)
Geometric interpretation:ρ is the cosine of the angle between centred random variables viewed as vectors in L2:
ρ(X,Y)=cosθ,θ=∠(X−E[X],Y−E[Y])
Properties:
∣ρ(X,Y)∣≤1 (Cauchy-Schwarz)
ρ=1: perfect positive linear relationship
ρ=−1: perfect negative linear relationship
ρ=0: uncorrelated (no linear relationship - but possibly nonlinear)
Scale invariance:ρ(aX+b,cY+d)=sign(ac)⋅ρ(X,Y)
5.3 Covariance Matrix
For a random vector X=(X1,…,Xn)⊤, the covariance matrix is:
Σ=Cov(X)=E[(X−μ)(X−μ)⊤]
where μ=E[X]. Entry (i,j) is Σij=Cov(Xi,Xj); diagonal Σii=Var(Xi).
Key properties:
Symmetry:Σ=Σ⊤
Positive semi-definiteness:v⊤Σv≥0 for all v∈Rn
Proof of PSD:v⊤Σv=E[(v⊤(X−μ))2]≥0 since the expectation of a squared quantity is non-negative. QED
Transformation:Cov(AX+b)=AΣA⊤
Eigendecomposition:Σ=QΛQ⊤ where Q is orthogonal (eigenvectors) and Λ=diag(λ1,…,λn) with λi≥0. Eigenvectors are the principal axes of the distribution; eigenvalues are the variances along those axes.
For AI:
PCA finds the eigendecomposition of the sample covariance matrix
Adam second moment estimate approximates the diagonal of E[gtgt⊤]
Whitening applies Σ−1/2 to decorrelate features - the foundation of BatchNorm
5.4 Pitfalls: Anscombe, Zero Covariance, Causation
Anscombe's quartet (1973): Four datasets with nearly identical means, variances, and ρ≈0.816 yet completely different visual patterns - linear, curved, linear with outlier, vertical cluster. Lesson: always visualise. Summary statistics can be identical for very different distributions.
Zero covariance \neq independence: Covariance only captures linear dependence. For general dependence, use mutual informationI(X;Y)=E[logp(X)p(Y)p(X,Y)] - which is zero if and only if X⊥⊥Y.
Correlation \neq causation: Ice cream sales and drowning rates are positively correlated (confounded by summer). In ML: spurious correlations in training data are learned as shortcuts (Geirhos et al., 2020). Causal invariance methods (IRM, DRO) aim to learn causal rather than associative features.
6. Multivariate Gaussian Distribution
6.1 Definition and Parameterisation
The multivariate Gaussian (MVN) is the most important distribution in machine learning.
Definition:X∈Rn follows X∼N(μ,Σ) with density:
f(x)=(2π)n/2∣Σ∣1/21exp(−21(x−μ)⊤Σ−1(x−μ))
where μ∈Rn is the mean vector and Σ∈Rn×n is the covariance matrix (symmetric positive definite).
Parameters:n for μ + n(n+1)/2 for Σ = n(n+3)/2 total.
Natural parameterisation (precision form): Using Λ=Σ−1 (precision matrix) and η=Λμ:
f(x)∝exp(−21x⊤Λx+η⊤x)
Special cases:
Σ=σ2I: isotropic - all dimensions independent, equal variance
Applications: Gaussian process regression (predictive distribution at test points), Bayesian linear regression (posterior over weights), Kalman filter (state update step - the Kalman gain K=Σ12Σ22−1).
6.5 Affine Transformations
Theorem: If X∼N(μ,Σ) and Y=AX+b, then:
Y∼N(Aμ+b,AΣA⊤)
Applications:
Whitening:A=Σ−1/2 -> Y∼N(0,I)
PCA projection:A=Qk⊤ (top k eigenvectors) -> k-dimensional projection
Reparameterisation trick: Sample ε∼N(0,I), set z=μ+Lε where LL⊤=Σ - allows gradients to flow through the sampling step in VAEs
6.6 Maximum Entropy Property
Theorem: Among all distributions on Rn with fixed mean μ and covariance Σ, the MVN has maximum differential entropy:
h(N(μ,Σ))=21log((2πe)n∣Σ∣)
Implication for AI: Gaussian priors/posteriors are the maximum-entropy (least informative) choice given mean and covariance constraints. This justifies Gaussian weight initialisation (no preferred direction), Gaussian latent priors in VAEs (maximum entropy latent space), and Laplace approximation to the Bayesian posterior.
7. Transformations of Random Vectors
7.1 Change of Variables and Jacobians
Given random vector X with PDF fX and differentiable bijection g:Rn→Rn, the PDF of Y=g(X) is:
fY(y)=fX(g−1(y))⋅detJg−1(y)
where Jg−1 is the Jacobian of the inverse map. The Jacobian determinant measures how much g stretches/compresses volume - probability mass must be conserved.
Example 1 - Polar coordinates:(X1,X2)∼N(0,I), transform to (R,Θ). The Jacobian is ∣detJ∣=r, giving:
fR,Θ(r,θ)=2πre−r2/2
So R has the Rayleigh distribution, Θ∼Uniform(0,2π), and R⊥⊥Θ.
Example 2 - Box-Muller transform:U1,U2∼Uniform(0,1) independent. Define:
X1=−2logU1cos(2πU2),X2=−2logU1sin(2πU2)
Then X1,X2∼N(0,1) independently - a practical algorithm for Gaussian sampling.
Example 3 - Log-normal:X∼N(μ,σ2), Y=eX. Then:
fY(y)=yσ2π1exp(−2σ2(logy−μ)2),y>0
For AI - Normalising flows: Flows learn a chain g=gK∘⋯∘g1 from a simple base distribution to a complex target. The change-of-variables formula gives exact log-likelihoods:
logfY(y)=logfZ(z)−k=1∑Klog∣detJgk∣
Efficient flows (RealNVP, Glow) use triangular Jacobians with O(n) determinant computation.
7.2 Sum of Random Variables and Convolution
If X⊥⊥Y and Z=X+Y:
fZ(z)=(fX∗fY)(z)=∫−∞∞fX(x)fY(z−x)dx
Stability under convolution:
N(μ1,σ12)∗N(μ2,σ22)=N(μ1+μ2,σ12+σ22)
Poisson(λ1)∗Poisson(λ2)=Poisson(λ1+λ2)
Gamma(α1,β)∗Gamma(α2,β)=Gamma(α1+α2,β)
Forward reference: The Central Limit Theorem - repeated convolution of any distribution with finite variance converges to Gaussian - is proved in Section6.6 Stochastic Processes via MGF factorisation.
7.3 Copulas
A copula separates the dependence structure from the marginals.
Sklar's Theorem (1959): For any joint CDF FX,Y with continuous marginals FX and FY, there exists a unique copula C:[0,1]2→[0,1] such that:
FX,Y(x,y)=C(FX(x),FY(y))
The copula C is the joint CDF of (FX(X),FY(Y)) - the probability-integral-transformed variables with Uniform(0,1) marginals.
Common copulas:
Independence:C(u,v)=uv
Gaussian copula: Dependence structure of a bivariate normal with arbitrary marginals
Clayton: Lower tail dependence (joint extremes at low values more likely)
Gumbel: Upper tail dependence
For AI: Copulas model the dependence structure of generated text distributions independently of individual token marginals. In risk modelling, asset returns are often fitted with Gaussian marginals but non-Gaussian tail dependence via t-copulas.
7.4 Order Statistics
Given n i.i.d. samples X1,…,Xn from CDF F with density f, the sorted values X(1)≤⋯≤X(n) are order statistics.
fX(k)(x)=(k−1)!(n−k)!n!F(x)k−1(1−F(x))n−kf(x)
Minimum:fX(1)(x)=n(1−F(x))n−1f(x)
Maximum:fX(n)(x)=nF(x)n−1f(x)
For AI - Top-k sampling: Language model top-k sampling selects among the k highest-probability tokens - analysable via the joint distribution of the k largest order statistics of the logit vector. Top-p (nucleus) sampling is related to the quantile of the cumulative distribution.
8. ML Applications
8.1 Naive Bayes Classifier
Model: Given class label Y∈{1,…,K} and features x=(x1,…,xd):
p(x∣Y=k)=i=1∏dp(xi∣Y=k)
Features are conditionally independent given the class - reduces parameter count from exponential to linear in d.
Decision rule:
y^=argkmax[logp(Y=k)+i=1∑dlogp(xi∣Y=k)]
Variants: Gaussian NB (continuous features), Multinomial NB (word counts), Bernoulli NB (binary features).
Why it works despite being wrong: The independence assumption is almost always false, yet Naive Bayes is competitive because:
Classification only requires the sign of the log-posterior ratio to be correct - much weaker than accurate probability estimation
The independence assumption acts as a regulariser with limited data
The decision boundary is a hyperplane - equivalent to logistic regression with NB features
8.2 Gaussian Mixture Models
A GMM is a joint distribution p(Z,X) where:
Z∈{1,…,K} is a latent cluster indicator with P(Z=k)=πk
X∣Z=k∼N(μk,Σk)
The marginal mixture density: p(x)=∑k=1KπkN(x;μk,Σk)
EM algorithm:
E-step:rik=P(Z=k∣xi)∝πkN(xi;μk,Σk) (Bayes' theorem for the joint)
M-step: Update parameters by maximising the expected complete-data log-likelihood
For AI: GMMs model speaker diarisation, sound source separation, and codebook structure in VQ-VAEs. Mechanistic interpretability research (Elhage et al., 2022) has found that transformer residual stream activations often decompose into Gaussian clusters across different input contexts.
8.3 VAE Encoder as Conditional Distribution
A VAE defines generative joint distribution pθ(x,z)=pθ(x∣z)⋅p(z) with prior p(z)=N(0,I).
The encoder approximates the intractable posterior pθ(z∣x) with:
Reparameterisation trick: Write z=μϕ(x)+σϕ(x)⊙ε with ε∼N(0,I). This moves stochasticity into ε (not a function of ϕ), so gradients flow through μϕ and σϕ - the affine transformation property of MVN applied to enable gradient-based learning.
8.4 Attention as Joint Distribution
For query qt and keys k1,…,kT:
αts=∑s′exp(qt⊤ks′/d)exp(qt⊤ks/d)=p(attend to position s∣query t)
This defines a conditional distribution over positions given the query. The output is the conditional expectation:
ot=s∑αtsvs=Es∼p(⋅∣t)[vs]
The attention weight matrix {αts}t,s defines a joint distribution over (query, key) position pairs. Causal masking encodes xt⊥⊥xs′∣x<t for s′>t - the autoregressive conditional independence assumption.
8.5 Autoregressive Factorisation
The chain rule gives the autoregressive factorisation used by every modern language model:
p(x1,…,xT)=t=1∏Tp(xt∣x1,…,xt−1)
Why this factorisation?
Exact: A mathematical identity - no approximation
Tractable: Each conditional is over vocabulary V (~50K-150K tokens) - manageable
Sample-able: Sequential sampling from conditionals
Trainable: Cross-entropy loss on each position; all positions trained in parallel with teacher forcing
KV cache: Storing (ks,vs) for s<t avoids recomputation - a direct consequence of the sequential conditional structure. The cache grows linearly in sequence length, motivating context-length-efficient architectures (sliding window, linear attention, Mamba).
9. Common Mistakes
#
Mistake
Why It's Wrong
Fix
1
Confusing p(X,Y) with p(X)p(Y)
The second assumes independence
Always verify factorisation before assuming independence
2
Marginalising over the wrong variable
Summing out X gives pY, not pX
Identify which variable to sum/integrate out
3
Concluding X⊥⊥Y from Cov(X,Y)=0
Zero covariance = no linear dependence only
Compute p(X,Y) vs p(X)p(Y); use mutual information
4
Confusing marginal and conditional independence
X⊥⊥Y and X⊥⊥Y∣Z are independent statements
Draw the causal structure; distinguish fork, chain, collider
5
Wrong block structure in conditional MVN
Swapping Σ11 and Σ22 gives wrong result
Put the conditioning variable consistently in x2
6
Assuming Gaussian marginals imply Gaussian joint
Non-Gaussian copulas yield Gaussian marginals
Only MVN guarantees both marginals and conditionals Gaussian
7
Using $
\det J_\mathbf{g}
insteadof
8
Confusing pairwise and mutual independence
Pairwise does not imply mutual (Bernstein XOR example)
Check all subset factorisations for mutual independence
9
Using ρ as a complete measure of dependence
Anscombe's quartet: same ρ, very different distributions
Always visualise; use mutual information for general dependence
10
Forgetting the normalising constant in continuous Bayes
Posterior must integrate to 1
Compute the evidence ∫p(D∣θ)p(θ)dθ or use VI/MCMC
11
Conditioning on a collider and expecting independence
Conditioning on a collider creates dependence (Berkson)
Draw the causal graph; identify colliders before conditioning
12
Comparing chain rule terms from different orderings
Different orderings give different factorisations of the same joint
Fix the ordering; compare p(x1,…,xn) directly
10. Exercises
Exercise 1 * - Joint PMF Verification
Given the joint PMF table, verify it is valid, compute all marginal distributions, and determine whether X⊥⊥Y.
Y=0
Y=1
Y=2
X=0
a
0.1
0.15
X=1
0.2
0.1
b
(a) Find a,b such that pX(0)=0.4. (b) Compute marginals pX and pY. (c) Test independence via the factorisation condition. (d) Compute P(X=1∣Y=1).
Exercise 2 * - Continuous Marginals and Conditionals
Let (X,Y) have joint PDF f(x,y)=c⋅xy on [0,2]×[0,1], zero elsewhere.
(a) Find c. (b) Compute marginals fX and fY. (c) Are X and Y independent? (d) Compute fY∣X(y∣x).
Exercise 3 * - Chain Rule and Autoregressive Factorisation
A simplified language model over vocabulary {A,B,C} has: p(A)=0.5, p(B∣A)=0.4, p(C∣A)=0.3, p(A∣B)=0.5.
(a) Compute p(A,B) using the chain rule in forward order. (b) Compute p(B,A) in reverse order. Verify p(A,B)=p(B,A). (c) Implement a simple bigram language model and sample 100 two-token sequences.
Exercise 4 * - Independence Testing with Simulation
(a) Generate 105 pairs (X,Y) where X∼N(0,1) and Y=X2+ε, ε∼N(0,0.1). Compute Pearson ρ. (b) Estimate mutual information I^(X;Y) via binned histogram. (c) Explain why ρ≈0 but I^(X;Y)≫0.
Exercise 5 ** - Covariance Matrix Properties
Let X∼N(0,Σ) with Σ=(4223).
(a) Verify Σ is PSD by computing eigenvalues. (b) Compute the Cholesky factor L and sample from X via X=Lε. (c) Compute ρ(X1,X2). (d) Compute the distribution of Y=X1+X2 using the affine transformation property.
Exercise 6 ** - MVN Conditional Distribution
Let X=(X1,X2,X3)⊤∼N(μ,Σ) with μ=(1,2,3)⊤ and Σ=420231012.
(a) Compute the conditional distribution (X1,X2)∣X3=3.5 via the Schur complement formula. (b) Verify numerically by sampling 104 draws, filtering to X3∈[3.4,3.6], and computing empirical conditional moments.
Exercise 7 *** - Reparameterisation Trick
(a) Show that if ε∼N(0,I) and z=μ+Lε, then z∼N(μ,LL⊤). (b) For μ=[1,−1] and Σ=(2112), sample 104 points and verify empirical mean and covariance. (c) Compute ∂z/∂μ and ∂z/∂L. Explain why gradients exist and why this is essential for VAE training.
Exercise 8 *** - Naive Bayes vs Logistic Regression
(a) Implement Gaussian Naive Bayes for a 2-class, 2-feature problem. Fit via MLE. (b) Implement logistic regression via gradient descent. (c) Compare decision boundaries on data with independent features. (d) Repeat with strongly correlated features. Analyse which model performs better and why, connecting to the conditional independence assumption.
11. Why This Matters for AI (2026 Perspective)
Concept
AI/LLM Application
Chain rule factorisation
Every autoregressive model (GPT, LLaMA, Gemini, Claude) uses ∏tp(xt∣x<t) - the exact chain rule identity.
Conditional distributions
The core output of every neural network: classifier outputs p(y∣x); LLMs output p(xt∣x<t); diffusion models output p(xt−1∣xt).
MVN conditionals (Schur complement)
Gaussian process regression (active in Bayesian optimisation for hyperparameter search), Kalman filtering (robot navigation, sensor fusion).
Reparameterisation trick
VAEs (Kingma & Welling, 2013), continuous normalising flows, diffusion model variational bounds - all require differentiable sampling from Gaussian distributions.
Conditional independence
Naive Bayes (competitive for text classification), mean-field VI (VAE diagonal covariance), causal masking in attention (the AR factorisation independence assumption).
Covariance matrix / PSD
Adam, Shampoo, K-FAC (second-order optimisers) approximate or use the parameter covariance. BatchNorm/LayerNorm whiten activations using empirical covariance.
Berkson's paradox (collider bias)
Selection bias in training data (models trained on filtered web data), benchmark contamination, shortcut learning (Geirhos et al., 2020).
Copulas
Modelling tail dependence in financial risk; structured dependence between multi-modal outputs; beyond-correlation dependence modelling.
Normalising flows / Jacobians
RealNVP, Glow, Neural ODEs - generative models using exact change-of-variables for tractable log-likelihoods and invertible generation.
GMMs
Speaker diarisation, VQ-VAE codebook (discrete latents via k-means on Gaussian clusters), mechanistic interpretability (superposition hypothesis).
Maximum entropy MVN
Justifies Gaussian weight initialisation (Xavier/Kaiming), Gaussian latent priors in VAEs, Laplace approximation to Bayesian neural network posteriors.
12. Conceptual Bridge
This section sits at the heart of probabilistic machine learning. We arrived here from the foundations of single-variable probability (Section6.1) and named distributions (Section6.2), and we have now built the full machinery for reasoning about multiple variables simultaneously. The chain rule of probability - a simple product of conditional distributions - is not merely an abstract identity: it is the operating principle of every language model in existence.
The multivariate Gaussian is the cornerstone of continuous multivariate modelling. Its remarkable closure properties (marginals Gaussian, conditionals Gaussian, affine transforms Gaussian) make it analytically tractable in a way no other distribution is. The Schur complement formula for conditional distributions appears in Gaussian process regression, Bayesian linear regression, the Kalman filter, and Gaussian belief propagation - it is one of the most frequently used results in applied probabilistic modelling.
Looking forward, the next section Section6.4 Expectation and Variance develops the theory of moments in depth - deriving E[X], Var(X), and higher moments using LOTUS, the law of total expectation, and the law of total variance. Many quantities referenced here (covariance as E[XY]−E[X]E[Y], MVN conditional mean) will be derived properly there. Section Section6.5 Concentration Inequalities develops tail bounds that require the joint distribution structure built here. Section Section6.6 Stochastic Processes proves the Central Limit Theorem via the convolution of joint distributions.
POSITION IN THE PROBABILITY THEORY CHAPTER
========================================================================
Section6.1 Introduction Section6.2 Distributions Section6.3 Joint [HERE]
---------------- ------------------ ----------------
Axioms, CDF, PMF, -> Bernoulli, Gaussian, -> Joint PMF/PDF,
PDF, single RVs Gamma, Beta, ... Marginals, Conditionals,
Exp. family Chain rule, MVN,
Transformations
v v v
Section6.4 Expectation Section6.5 Concentration Section6.6 CLT/LLN
----------------- ------------------ ----------------
E[X], Var(X), LOTUS, Markov, Chebyshev, LLN, CLT proof
Total expectation/var Hoeffding, PAC bounds via joint dist.
v
Section6.7 Markov Chains
------------------
Transition matrices,
Steady state, MCMC
========================================================================
The ideas in this section - joint distributions, conditional independence, the multivariate Gaussian, the chain rule - are not just mathematical tools. They are the conceptual vocabulary of modern probabilistic AI: the language in which generative models, Bayesian inference, and information-theoretic learning are expressed. Every time a GPT model generates a token, every time a VAE encodes an image, every time a diffusion model denoises - the machinery is joint distributions.
This is N(ρx,1−ρ2) - confirming the conditional mean E[Y∣X=x]=ρx and conditional variance Var(Y∣X=x)=1−ρ2.
Interpretation of the conditional mean: The slope of the regression line E[Y∣X=x]=ρx equals the correlation coefficient when both variables are standardised. For unstandardised (X,Y) with means (μX,μY) and standard deviations (σX,σY):
E[Y∣X=x]=μY+ρσXσY(x−μX)
The regression coefficient β=ρσY/σX is the best linear predictor of Y from X.
A.3 Independence Condition for Bivariate Normal
X and Y are independent in the bivariate normal iff ρ=0.
Proof: If ρ=0, then f(x,y)=2π1e−(x2+y2)/2=fX(x)fY(y) - direct factorisation. Conversely, if f(x,y)=fX(x)fY(y) for all x,y, then Cov(X,Y)=0, which forces ρ=0. QED
This is a special property of the multivariate Gaussian: for MVN, uncorrelated implies independent. This does NOT hold for general distributions.
Appendix B: The Multivariate Gaussian - Derivation of the Schur Complement Formula
We derive the conditional distribution X1∣X2=x2 from first principles.
Setup:X=(X1⊤,X2⊤)⊤∼N(μ,Σ) with the block partition as in Section 6.4.
Strategy: Construct a new variable Z=X1−Σ12Σ22−1X2 that is independent of X2.
Key insight: The Schur complement Σ11−Σ12Σ22−1Σ21 is the covariance of the residualX1−X^1 where X^1=μ1+Σ12Σ22−1(X2−μ2) is the best linear predictor (BLUP) of X1 from X2.
Appendix C: Independence Tests in Practice
Given n samples from (X,Y), how do we test independence?
C.1 Pearson's Chi-Squared Test (Discrete)
For discrete variables X and Y, form the contingency table with observed counts Oij (number of samples with X=i,Y=j). Under independence, expected counts are:
Eij=n(row totali)(col totalj)
The test statistic:
χ2=i,j∑Eij(Oij−Eij)2∼H0χ2((r−1)(c−1))
where r,c are the numbers of rows and columns. Reject independence at level α if χ2>χ(r−1)(c−1),1−α2.
C.2 Pearson Correlation Test (Continuous, Linear)
For continuous (X,Y), test H0:ρ=0 vs H1:ρ=0. The test statistic:
t=1−ρ^2ρ^n−2∼H0tn−2
Limitations: Only detects linear dependence. Fails for Y=X2 (zero correlation, strong dependence).
C.3 Kolmogorov-Smirnov Test for Independence
Compare the empirical joint CDF to the product of empirical marginals. The test statistic:
Dn=x,ysup∣Fn(x,y)−FXn(x)FYn(y)∣
Under independence, Dn→d0 at rate 1/n.
C.4 Mutual Information Test
Estimate I^(X;Y) via binned histograms or kernel density estimation. Under independence, 2nI^(X;Y)→dχ(r−1)(c−1)2. More powerful than χ2 for nonlinear dependence.
For AI model diagnostics: Testing whether two layers' activations are conditionally independent given the input is a key step in mechanistic interpretability. High mutual information between distant layers suggests information bypasses intermediate layers - a sign of skip connections or residual pathways.
Appendix D: Multivariate Change of Variables - Additional Examples
D.1 Bivariate Transformation: Sum and Difference
Let X,Y be independent N(0,1). Define S=X+Y, D=X−Y.
This factors as fS(s)fD(d) where S,D∼N(0,2) independently.
Takeaway: The sum and difference of independent standard normals are themselves independent normals - a non-obvious result following directly from the change of variables.
D.2 Exponential to Gamma via Convolution
For X1,…,Xn∼iidExp(λ), the sum Sn=X1+⋯+Xn∼Gamma(n,λ).
Proof by MGF:MSn(t)=∏i=1nMXi(t)=(λ−tλ)n which is the MGF of Gamma(n,λ).
Connection to Poisson process: The time until the n-th event in a Poisson(λ) process is exactly Gamma(n,λ) - a direct consequence of inter-arrival times being iid Exp(λ).
Appendix E: Copulas - Detailed Treatment
E.1 Gaussian Copula Construction
The Gaussian copula with correlation matrix R is constructed as follows:
Generate z=(Z1,…,Zd)⊤∼N(0,R)
Apply the probability integral transform: Ui=Φ(Zi) where Φ is the standard normal CDF
The resulting (U1,…,Ud) has the Gaussian copula with correlation R
To generate samples from an arbitrary distribution with Gaussian copula:
4. Invert the marginal CDFs: Xi=Fi−1(Ui)
Properties:
Captures linear (Pearson) correlation at the copula level
Tail dependence: the Gaussian copula has zero tail dependence - extreme events in different variables are asymptotically independent even when there is correlation in the bulk
The Global Financial Crisis (2008): The widespread use of the Gaussian copula for CDO pricing was criticised as a key model failure. The Gaussian copula underestimates joint tail events - the probability that many mortgages default simultaneously during a systemic crisis. The correct model should use a copula with positive tail dependence (e.g., t-copula or Clayton copula).
E.2 Measuring Dependence via Copulas
Kendall's tau: A rank-based correlation measure:
τ=P[(X1−X2)(Y1−Y2)>0]−P[(X1−X2)(Y1−Y2)<0]
Unlike Pearson ρ, Kendall's τ is invariant to monotone transformations of X and Y - it is purely a property of the copula.
Spearman's rho:ρS=ρ(rank(X),rank(Y)) - also invariant to monotone marginal transformations.
For the Gaussian copula with linear correlation ρ: τ=(2/π)arcsin(ρ) and ρS=(6/π)arcsin(ρ/2).
Appendix F: Gaussian Mixture Models - EM Algorithm Details
F.1 Complete Data Log-Likelihood
If we observed both xi and latent assignments zi, the complete data log-likelihood is:
Each M-step is a weighted MLE - the soft assignments rik play the role of fractional cluster memberships.
F.4 Convergence
The EM algorithm monotonically increases the marginal log-likelihood ℓ(θ)=∑ilog∑kπkN(xi;μk,Σk) at each iteration. It converges to a local maximum (not necessarily global). In practice:
Random restarts help escape poor local optima
K-means++ initialisation provides a good starting point
The BIC/AIC criteria are used to select the number of components K
Appendix G: The Reparameterisation Trick - Formal Derivation
G.1 The Problem
To train a VAE, we need:
∇ϕEqϕ(z∣x)[f(z)]=∇ϕ∫qϕ(z∣x)f(z)dz
The naive approach - swapping gradient and expectation - fails because the distribution qϕ depends on ϕ:
∫∇ϕqϕ(z∣x)f(z)dz=Eqϕ[∇ϕf(z)]
G.2 The Reparameterisation
Write z=gϕ(ε,x) where ε∼p(ε) does not depend on ϕ. For the diagonal Gaussian encoder:
z=μϕ(x)+σϕ(x)⊙ε,ε∼N(0,I)
Then:
Eqϕ(z∣x)[f(z)]=Ep(ε)[f(gϕ(ε,x))]
Now the expectation is over p(ε) which does NOT depend on ϕ, so:
∇ϕEqϕ(z∣x)[f(z)]=Ep(ε)[∇ϕf(gϕ(ε,x))]
This gradient can be estimated by Monte Carlo: sample ε(1),…,ε(L) and compute L1∑l∇ϕf(gϕ(ε(l),x)). In practice, L=1 often suffices.
G.3 Why the Affine Transformation Property is Key
The reparameterisation z=μ+Lε works because:
z∼N(μ,LL⊤) by the affine transformation property (Section 6.5)
The Jacobian of the transformation is L - invertible and differentiable
Gradients flow through μ and L as if they were deterministic variables
This trick generalises beyond Gaussians: any distribution expressible as a deterministic transformation of a fixed noise distribution admits reparameterisation. The Gumbel-softmax trick (Jang et al., 2017) extends this to Categorical distributions.
Appendix H: Common Distributions as Joint Distributions
Many standard distributions arise naturally as marginals of joint distributions over multiple variables.
H.1 Chi-Squared as a Sum of Squared Normals
If Z1,…,Zk∼iidN(0,1), then Q=∑i=1kZi2∼χ2(k).
Proof via MGF:MQ(t)=∏i=1kMZi2(t). For Z2 where Z∼N(0,1):
MZ2(t)=E[etZ2]=1−2t1,t<1/2
So MQ(t)=(1−2t)−k/2 - the MGF of Gamma(k/2,1/2)=χ2(k).
For AI: The chi-squared distribution appears in:
Wald tests for neural network weights (are weights significantly different from zero?)
Goodness-of-fit tests for probability distributions learned by generative models
The test for the χ2 independence test (Appendix C)
H.2 Student-t as Ratio of Normal to Chi-Squared
If Z∼N(0,1) and V∼χ2(ν) independently, then:
T=V/νZ∼tν
Intuition: The Student-t arises when we estimate the variance from data (replacing the known σ2 with the sample variance s^2). The heavier tails reflect additional uncertainty from estimating σ2.
For AI: The t-SNE visualisation algorithm uses the Student-t distribution (with ν=1, i.e., Cauchy) in the low-dimensional space to model cluster structure. The heavier tails push apart nearby clusters more aggressively than a Gaussian, creating cleaner cluster separation in 2D.
H.3 F-Distribution as Ratio of Chi-Squareds
If V1∼χ2(d1) and V2∼χ2(d2) independently, then:
F=V2/d2V1/d1∼F(d1,d2)
Used in ANOVA and comparing variances between groups.
Appendix I: Expectation and Covariance - Key Identities
This appendix collects identities used throughout the section. Full derivations are in Section6.4.
I.1 Law of Total Expectation
E[X]=EY[E[X∣Y]]
Application:E[X]=EZ[E[X∣Z]]=∑kP(Z=k)μk for a GMM - the overall mean is the mixture-weighted average of component means.
I.2 Law of Total Variance
Var(X)=E[Var(X∣Y)]+Var(E[X∣Y])
Decomposition: Total variance = average within-group variance + between-group variance (from ANOVA). For a GMM: Var(X)=∑kπkΣk+∑kπk(μk−μ)(μk−μ)⊤ where μ=∑kπkμk.
I.3 Bilinearity of Covariance for Vectors
For random vectors X and Y and matrices A,B:
Cov(AX,BY)=ACov(X,Y)B⊤
Special case:Cov(AX)=AΣXA⊤ - the covariance under a linear transformation.
I.4 Independence and Zero Covariance for MVN
For jointly Gaussian random variables: Cov(Xi,Xj)=0⇔Xi⊥⊥Xj.
This is unique to the Gaussian - in general, zero covariance is weaker than independence.
Appendix J: Graphical Models - Overview
A Bayesian network (directed graphical model) is a directed acyclic graph (DAG) where:
Nodes = random variables
Edges = direct dependencies
Each node Xi has a conditional distribution p(Xi∣parents(Xi))
The joint distribution factorises as:
p(x1,…,xn)=i=1∏np(xi∣pa(xi))
This is the chain rule applied with the graphical structure choosing which conditioning variables appear in each factor.
Conditional independence from the graph:X⊥⊥Y∣Z in the joint distribution iff X and Y are d-separated by Z in the DAG - all paths from X to Y are "blocked" by Z.
COMMON GRAPHICAL MODEL PATTERNS
========================================================================
Naive Bayes: LDA: Hidden Markov Model:
Y \\theta \\alpha z_1 -> z_2 -> z_3
\\swarrowv\\searrow \\swarrowv\\searrow \\searrow v v v
X_1 X_2 X_3 w_1 w_2 w_3 \\beta x_1 x_2 x_3
(features cond. (words given (topics) (obs given state)
indep. given Y) topic dist.)
VAE: Kalman Filter:
p(z) z_t-1 -> z_t -> z_t+1
v v v v
p(x|z) x_t-1 x_t x_t+1
(encoder (decoder) (hidden state dynamics)
q(z|x) ^)
========================================================================
For AI: Modern large language models can be understood as extremely deep Bayesian networks where each token's probability is conditioned on all previous tokens. The attention pattern (which tokens attend to which) defines an implicit graphical structure that changes with each input.
Appendix K: Numerical Stability in MVN Computations
K.1 Cholesky Decomposition for Sampling
Never invert Σ directly for sampling. Instead:
Compute the Cholesky factor L where Σ=LL⊤ (stable, O(n3))
Sample ε∼N(0,I)
Return x=μ+Lε
Direct inversion Σ−1 amplifies numerical errors; Cholesky is numerically stable.
K.2 Log-Determinant for Density Evaluation
Computing log∣Σ∣ naively via log(det(Sigma)) overflows for large n. Instead:
Compute Cholesky Σ=LL⊤
log∣Σ∣=2∑ilogLii (sum of log-diagonal elements of L)
This is numerically stable because Lii>0 and we sum logs rather than multiply small numbers.
K.3 Mahalanobis Distance
The Mahalanobis distance d2=(x−μ)⊤Σ−1(x−μ) should be computed as:
Solve Lv=(x−μ) via forward substitution
d2=∥v∥2
This avoids forming Σ−1 explicitly.
K.4 Conditioning on Nearly Degenerate Σ22
When Σ22 is nearly singular (nearly degenerate observations), add a small diagonal regulariser: Σ22+ϵI for ϵ∼10−6. This is jitter in Gaussian process literature. The resulting conditional covariance is Σ11−Σ12(Σ22+ϵI)−1Σ21, which remains PSD.
Mutual independence
v (implies)
Pairwise independence
v (implies)
Zero covariance (for general distributions)
^v (equivalent for MVN only)
Zero covariance
For the multivariate Gaussian, uncorrelated = pairwise independent = mutually independent (if the full vector is jointly Gaussian).
Appendix M: Detailed Exercises with Full Solutions
M.1 Computing the Bivariate Normal Conditional - Full Worked Example
For AI: Model parameter magnitudes after training often follow approximately log-normal distributions. If W∼Log-Normal, then log∣W∣∼N - transforming to log-scale makes analysis easier. Weight pruning thresholds are often set based on the log-normal quantiles.
Appendix N: The Gaussian Process as a Joint Distribution
A Gaussian process (GP) is a joint distribution over function values at infinitely many input points - the continuous generalisation of the MVN.
Definition:f∼GP(m(⋅),k(⋅,⋅)) if for any finite collection of inputs x1,…,xn:
(f(x1),…,f(xn))∼N(m,K)
where mi=m(xi) and Kij=k(xi,xj) is the kernel (covariance function).
GP regression uses the conditional MVN formula (Section 6.4) to make predictions. Given observations y=f(X)+ε with noise ε∼N(0,σ2I), the predictive distribution at new points X∗ is:
This is exactly the Schur complement conditional formula applied to the GP's joint distribution over observed and unobserved function values.
For AI: GPs are used for Bayesian hyperparameter optimisation (BoTorch, GPyTorch) in training large models. The acquisition function (UCB, EI) is computed analytically from the GP's mean and variance - enabled by the closed-form MVN conditional.
Appendix O: Information Theory and Joint Distributions
Why the converse fails: Zero covariance says E[XY]=E[X]E[Y] - a single numerical condition. Independence says p(x,y)=pX(x)pY(y) for ALL (x,y) - infinitely many conditions. A single number cannot encode the information required for independence.
Q.2 Proof that MVN Uncorrelated Implies Independent
Theorem: For jointly Gaussian (X1,…,Xn)∼N(μ,Σ): Cov(Xi,Xj)=0 for all i=j implies (X1,…,Xn) are mutually independent.
Proof: If Σ=diag(σ12,…,σn2) (all off-diagonal entries zero), then:
f(x)=(2π)n/2∣Σ∣1/21e−21(x−μ)⊤Σ−1(x−μ)
With diagonal Σ: Σ−1=diag(1/σ12,…,1/σn2) and ∣Σ∣=∏iσi2. So:
This is Beta(α+1,β) - one success increments α by 1. [ok]
R.2 Continuous Bayes for Signal in Noise
Setup: Observe noisy signal Y=θ+ε where ε∼N(0,σ2) (known) and prior θ∼N(μ0,σ02).
Likelihood:f(y∣θ)=N(y;θ,σ2)
Posterior:
f(θ∣y)∝f(y∣θ)f(θ)∝exp(−2σ2(y−θ)2−2σ02(θ−μ0)2)
Completing the square in θ, the exponent becomes:
−21(σ21+σ021)(θ−1/σ2+1/σ02y/σ2+μ0/σ02)2
So the posterior is:
θ∣y∼N(1/σ2+1/σ02y/σ2+μ0/σ02,1/σ2+1/σ021)
Interpretation: The posterior mean is a precision-weighted average of the observation y and the prior mean μ0. More precise information (smaller variance) gets higher weight. When σ02→∞ (flat prior): posterior mean →y (MLE). When σ2→∞ (very noisy observation): posterior mean →μ0 (prior dominates).
This formula is the Kalman filter update equation for the 1D case.
Appendix S: From Joint Distributions to Variational Inference
Variational inference (VI) approximates intractable posteriors by solving an optimisation problem over a tractable family of distributions.
Goal: Approximate p(z∣x)=p(z,x)/p(x) with qϕ(z)∈Q.
Connection to joint distributions: The ELBO decomposes as:
Reconstruction: how well qϕ places mass where p(x∣z) is large
Regularisation: DKL(qϕ∥p(z)) keeps qϕ close to the prior joint distribution p(z)
Maximising the ELBO is equivalent to minimising DKL(qϕ(z)∥p(z∣x)) - finding the best joint approximation in Q.
Mean-field VI: The simplest tractable family: qϕ(z)=∏iqϕi(zi). This assumes all latent variables are independent - a strong assumption that leads to underestimated posterior variances (VI is over-confident). VAEs use this assumption in the encoder: the output qϕ(z∣x)=∏dN(zd;μd,σd2) is a diagonal (factored) Gaussian.
For AI (2026): Flow-based posteriors (normalising flows applied to the variational distribution) relax the mean-field assumption while remaining tractable. Diffusion-based VI methods (Geffner et al., 2023) use learned denoising processes as flexible approximate posteriors, achieving near-exact inference for structured joint distributions.
Appendix T: Extended ML Case Studies
T.1 Case Study: Transformer Attention as Conditional Expectation
We show in detail how the self-attention operation is a conditional expectation under a joint distribution over sequence positions.
Setting up the joint distribution: Define a joint distribution over "query position" T and "key position" S with:
p(S=s∣T=t)=αts=softmax(dqt⊤ks)s
This is a well-defined conditional distribution: αts≥0 and ∑sαts=1.
The attention output as conditional expectation:
ot=s∑αtsvs=ES∼p(⋅∣T=t)[vS]
This is the conditional mean of the value vector under the attention distribution - exactly the conditional expectation formula E[Y∣X=x] from Section 3.2.
Multi-head attention as mixture:
Each head h has its own attention distribution ph(S∣T). The output is:
ot=WO[ot(1)∥⋯∥ot(H)]
where ot(h)=ES∼ph(⋅∣T=t)[vS(h)]. Different heads can specialise in different conditional distributions - syntax, semantics, coreference, etc.
KV cache as sufficient statistic: The key insight is that once (ks,vs) are computed for position s, they fully determine ph(S=s∣T=t) and vS(h) for any future query position t>s. The KV cache stores exactly these sufficient statistics - the conditional distributions of future positions given past keys and values.
T.2 Case Study: VAE as Latent Variable Model
The VAE is a latent variable model with joint distribution:
pθ(x,z)=pθ(x∣z)⋅p(z)
The chain of joint distributions:
Prior:z∼p(z)=N(0,I) - maximum entropy distribution over latent space
Generative joint:pθ(x,z) - the model's joint distribution
Marginal likelihood:pθ(x)=∫pθ(x∣z)p(z)dz - intractable; ELBO is a lower bound
Reparameterisation implements the affine transformation: The encoder outputs (μϕ,σϕ). Sampling z=μϕ+σϕ⊙ε with ε∼N(0,I) uses the affine transformation property: z∼N(μϕ,diag(σϕ2)).
The second term, computed in closed form (Appendix O.3), enforces that the approximate posterior qϕ(z∣x) stays close to the prior - preventing the encoder from ignoring the prior and using an arbitrary encoding.
T.3 Case Study: Gaussian Process Regression as Conditional MVN
Setting: We observe noisy evaluations y=f(X)+ε at training inputs X. We want to predict f∗=f(X∗) at test points.
Joint prior:
(f∗,y)∼N(0,(K∗∗Kn∗K∗nKnn+σ2I))
where Kij=k(xi,xj) is the kernel matrix.
Prediction = Conditional MVN: Apply the Schur complement formula (Section 6.4):
f∗∣y∼N(μ∗,Σ∗)
with μ∗=K∗n(Knn+σ2I)−1y and Σ∗=K∗∗−K∗n(Knn+σ2I)−1Kn∗.
The posterior mean μ∗ is a weighted combination of training outputs - the weights K∗n(Knn+σ2I)−1 are the GP analog of the Kalman gain. The posterior variance Σ∗ quantifies predictive uncertainty - it is reduced compared to the prior K∗∗ because we have data.
Connection to Bayesian linear regression: With a linear kernel k(x,x′)=x⊤x′, GP regression reduces to Bayesian linear regression with an isotropic prior on weights. The MVN conditional formula gives the posterior over weights and the predictive distribution simultaneously.
Interpretation: The conditional variance of X1 given X2 is 8/3≈2.67<4 - conditioning reduces variance. The reduction 4/3 is the variance explained by X2.
Correlation from covariance:ρ=2/4⋅3=2/12≈0.577.
Conditional distribution for X1∣X2=x2 (with μ1=μ2=0):
X1∣X2=x2∼N(32x2,38)
U.2 Eigendecomposition of a 2x2 Covariance Matrix
For Σ=(4223):
Characteristic polynomial:λ2−7λ+8=0
Eigenvalues:λ1,2=(7±49−32)/2=(7±17)/2
Numerically: λ1≈5.56, λ2≈1.44.
Eigenvectors: For λ1: (4−λ1)v1+2v2=0⇒v1∝(2,λ1−4)∝(2,1.56), normalised: ≈(0.789,0.614).
Geometric interpretation: The principal axis of the distribution is at angle arctan(0.614/0.789)≈37.9degrees from the x1-axis. Variance along the long axis is λ1≈5.56; along the short axis λ2≈1.44.
U.3 PSD Check via Cholesky
For Σ=(4223):
Cholesky: Σ=LL⊤ where L=(2102).
Verify: LL⊤=(2102)(2012)=(4223) [ok]
Since Cholesky exists (all diagonal entries 2,2>0), Σ is SPD.
Sampling: Generate ε∼N(0,I), then x=Lε has covariance LIL⊤=Σ.
Appendix V - Worked Problems: Marginals, Conditionals, and Independence Tests
This appendix provides fully worked solutions to ten representative problems at the level
of the exercises. Work through each problem before reading the solution.
(b) Marginals: P(X=0)=0.30, P(X=1)=0.70; P(Y=0)=0.40, P(Y=1)=0.60.
Check all four cells:
(0,0): 0.30×0.40=0.12 [ok]
(0,1): 0.30×0.60=0.18 [ok]
(1,0): 0.70×0.40=0.28 [ok]
(1,1): 0.70×0.60=0.42 [ok]
Independent. [ok]
V.3 Covariance Computation
Problem. Suppose (X,Y) uniform on the unit disk {(x,y):x2+y2≤1}.
The joint PDF is f(x,y)=π1. Compute Cov(X,Y).
Solution.
By symmetry of the unit disk under (x,y)↦(−x,y) and (x,y)↦(x,−y):
E[X]=0 and E[Y]=0 (odd integrand over symmetric domain).
E[XY]=π1∬x2+y2≤1xydxdy=0 (same symmetry argument: xy is
antisymmetric under x↦−x).
Therefore Cov(X,Y)=E[XY]−E[X]E[Y]=0−0⋅0=0.
However, X and Y are not independent: if X=0.9 then Y must lie in
[−1−0.81,1−0.81]=[−0.436,0.436], a much tighter range than
without conditioning. This is a canonical example of zero covariance ⇒ independence.
V.4 Bivariate Normal: Conditional Mean and Variance
Problem.(X1,X2)∼N(0,Σ) with Σ=(4339).
Find the distribution of X1∣X2=x2.
Solution.
Using the Schur complement formula with μ1=μ2=0:
The correlation is ρ=3/4⋅9=3/6=1/2, and the conditional mean
μ1∣2=ρ⋅σ2σ1⋅x2=21⋅32⋅x2=3x2 confirms our result.
V.5 Reparameterisation Trick Gradient
Problem. The ELBO of a VAE contains the term Ez∼qϕ(z∣x)[f(z)] where
qϕ(z∣x)=N(μϕ(x),σϕ2(x)I). Show that
∇ϕEqϕ[f(z)]=Eε∼N(0,I)[∇ϕf(μϕ+σϕ⊙ε)].
Solution.
The key insight is to push the gradient inside the expectation. With the direct
parameterisation, qϕ depends on ϕ, making ∇ϕEqϕ[f(z)]
require differentiating through the distribution - the REINFORCE estimator does this
but has high variance.
The reparameterisation trick decouples the stochasticity from the parameters. Write
z=g(ϕ,ε)=μϕ+σϕ⊙ε,ε∼N(0,I).
Then
Ez∼qϕ[f(z)]=Eε∼N(0,I)[f(g(ϕ,ε))].
Now ε is drawn from a fixed distribution independent of ϕ. Assuming
sufficient regularity to swap gradient and expectation:
The gradient flows through the deterministic function g via the chain rule:
∇ϕf(g(ϕ,ε))=∇zf(z)⋅∂ϕ∂g=∇zf(z)⋅(Idiag(ε))
where the second row gives ∂g/∂σϕ=ε. This low-variance
estimator is exactly what enables training VAEs with standard backpropagation.
V.6 Change of Variables: Log-Normal
Problem. If X∼N(μ,σ2), find the PDF of Y=eX.
Solution.
The transformation g(x)=ex is strictly increasing, so its inverse is g−1(y)=lny
with derivative dydg−1(y)=y1. By the univariate change-of-variables formula:
This is the log-normal distribution. Its mean and variance are
E[Y]=eμ+σ2/2,Var(Y)=(eσ2−1)e2μ+σ2.
AI relevance. Log-normal distributions model quantities that are products of many
small independent factors - learning rate schedules, weight magnitudes after multiplicative
updates, and token probability products in sequence models. When σ2≫1 the
distribution has a heavy right tail; this explains why gradient clipping is necessary:
the product of many near-1 weights can occasionally explode.
V.7 Order Statistics of Uniform
Problem. Let U(1)≤U(2)≤⋯≤U(n) be the order statistics of
n iid Uniform(0,1) random variables. Find fU(k)(u).
Solution.
For the k-th order statistic, we need exactly k−1 values below u, one value at u,
and n−k values above u. This is a multinomial counting argument:
Recognising the kernel: U(k)∼Beta(k,n−k+1). This is a remarkable
result connecting order statistics to the Beta distribution.
Moments:
E[U(k)]=n+1k,Var(U(k))=(n+1)2(n+2)k(n−k+1).
AI relevance. Order statistics underlie Top-k sampling in autoregressive generation.
When sampling the next token by selecting from the k highest-probability tokens,
the threshold is the k-th order statistic of the probability vector (in decreasing order).
The Beta distribution for U(k) gives the distribution of this adaptive threshold.
V.8 Conditional Independence: Chain vs Fork vs Collider
Problem. In each graphical model below, state whether X⊥Y∣Z and whether
X⊥Y (marginal independence).
(a) Chain: X→Z→Y
(b) Fork: X←Z→Y
(c) Collider: X→Z←Y
Solution.
(a) Chain. The joint is p(x,z,y)=p(x)p(z∣x)p(y∣z).
Marginal: p(x,y)=∑zp(x)p(z∣x)p(y∣z), generally not p(x)p(y). Not marginally independent.
Conditional: p(x,y∣z)=p(x,z,y)/p(z)=p(x∣z)p(y∣z) after noting p(x∣z)=p(x)p(z∣x)/p(z). X⊥Y∣Z. [ok]
(b) Fork. The joint is p(z)p(x∣z)p(y∣z).
Marginal: p(x,y)=∑zp(z)p(x∣z)p(y∣z), generally not p(x)p(y) because both share cause Z. Not marginally independent.
Conditional: knowing Z creates dependence between X and Y (explaining away / Berkson's paradox). X⊥Y∣Z in general.
Summary table:
Structure
X⊥Y (marginal)?
X⊥Y∣Z?
Chain
No
Yes
Fork
No
Yes
Collider
Yes
No
The collider reversal is the most counterintuitive result in probabilistic graphical
models and has real-world consequences: conditioning on a collider (e.g., selecting on a
trait influenced by two independent factors) induces spurious correlations between
otherwise independent causes. In neural architecture search, selecting architectures
that achieve high validation accuracy (collider: architecture traits + dataset difficulty)
can create apparent anti-correlations between design choices that are truly independent.
Appendix W - Notation Reference and Quick Formulas