Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 1
30 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Markov Chains: Part 1: Intuition to 9. ML Deep Dive

1. Intuition

1.1 What Is a Markov Chain?

A Markov chain is a sequence of random variables X0,X1,X2,X_0, X_1, X_2, \ldots taking values in a state space S\mathcal{S}, where the future is independent of the past given the present. Formally, for every n0n \geq 0 and every state jSj \in \mathcal{S}:

P(Xn+1=jX0,X1,,Xn)=P(Xn+1=jXn)P(X_{n+1} = j \mid X_0, X_1, \ldots, X_n) = P(X_{n+1} = j \mid X_n)

This is the Markov property: the current state XnX_n is a sufficient statistic for predicting Xn+1X_{n+1}. No matter how long the history, knowing only the present gives as much information about the future as knowing the entire past trajectory.

Why memorylessness is powerful. At first glance, the Markov property looks like a restriction - real systems do have memory. But three things make it useful:

  1. Many systems genuinely satisfy it. A fair die, re-rolled at each step, is Markov (no memory). A chess position captures all relevant game state (Markov). An LLM's key-value cache provides sufficient statistics for the next token (Markov, given the KV cache).

  2. Higher-order Markov chains reduce to first-order. A chain where Xn+1X_{n+1} depends on XnX_n and Xn1X_{n-1} can be converted to a first-order chain on the augmented state (Xn,Xn1)(X_n, X_{n-1}). So first-order is universal given a rich enough state space.

  3. The mathematics is tractable. Transition probabilities only involve pairs of states, so the full dynamics are captured by a matrix rather than an exponentially large conditional probability table.

The mental picture: Imagine a particle hopping between states on a graph. At each step, it looks only at which node it currently occupies, then jumps to a neighbouring node with probability given by the edge weights. The history of how it got to the current node is irrelevant - only where it is now matters.

For AI: Every autoregressive language model generates tokens one at a time, each conditioned on all previous tokens. Given a context window of length nn, the distribution over the next token is fully specified by the current state (the context). This is precisely the Markov property with state space = all possible contexts of length n\leq n.

1.2 Everywhere in AI

Markov chains appear as the backbone of several major AI paradigms:

Language generation: GPT, LLaMA, and every autoregressive transformer generates text via a Markov chain. The state is the current context (the key-value cache), the transition is the softmax over the vocabulary, and the stationary distribution - if it exists - characterises what the model "talks about" at equilibrium.

PageRank: Google's original ranking algorithm computes the stationary distribution of a Markov chain on the web graph, where each page links to others. A page is important if the random surfer spends a lot of time there at stationarity. Power iteration on the transition matrix gives the PageRank vector.

Reinforcement learning: A Markov Decision Process (MDP) is a Markov chain with actions. The agent's policy induces a Markov chain over states, and the value function is the expected discounted return from each state - computable from the stationary properties of this chain. TD-learning, Q-learning, and policy gradient methods all exploit the Markov structure.

Bayesian inference via MCMC: When the posterior distribution π(θD)\pi(\theta | \mathcal{D}) is intractable, MCMC constructs a Markov chain whose stationary distribution is exactly π\pi. Running the chain for long enough generates samples that represent the posterior. Metropolis-Hastings and Gibbs sampling are the two canonical MCMC algorithms.

Speech and sequence modelling: Hidden Markov Models (HMMs) model sequences where the true state (e.g., phoneme) is latent. The forward-backward algorithm and Viterbi decoder exploit the Markov structure to perform efficient inference.

Diffusion models: The DDPM forward process x0x1xTx_0 \to x_1 \to \cdots \to x_T is a Markov chain where each step adds Gaussian noise. The learned reverse process is a Markov chain running backwards in time. The SDE perspective (Section06 Stochastic Processes) gives the continuous-time version.

1.3 Historical Timeline

YearContributorResult
1906Andrei MarkovDefined Markov chains; proved the law of large numbers for dependent sequences
1907Andrei MarkovFirst application: statistical analysis of letters in Pushkin's Eugene Onegin
1913Paul EhrenfestUrn model illustrating approach to equilibrium; connection to statistical mechanics
1928Andrei KolmogorovBackward equations for Markov processes; rigorous foundation
1936Oskar Perron, Georg FrobeniusPerron-Frobenius theorem (positive matrices have unique dominant eigenvector)
1950Nicholas Metropolis et al.Monte Carlo method; Metropolis algorithm for statistical mechanics
1953Metropolis, Rosenbluthx2, Tellerx2Metropolis-Hastings algorithm published; first MCMC
1970W.K. HastingsGeneralized Metropolis to asymmetric proposals
1984Geman & GemanGibbs sampling for image restoration
1996Levin, Peres, WilmerSystematic mixing time theory; coupling and spectral gap bounds
1998Page, Brin, Motwani, WinogradPageRank: web ranking via Markov chain stationary distribution
2015Ho, Jain & AbbeelDDPM: diffusion as Markov chain with learned reverse
2022VariousLLMs as Markov chains; stationary distribution theory for language models

1.4 Taxonomy of Markov Chains

Markov chains are classified along two dimensions:

State space: Finite (S={1,,N}\mathcal{S} = \{1,\ldots,N\}), countably infinite (S=Z+\mathcal{S} = \mathbb{Z}^+), or continuous (S=Rd\mathcal{S} = \mathbb{R}^d). Most of this section treats finite or countably infinite state spaces; MCMC typically targets continuous spaces.

Time: Discrete (n=0,1,2,n = 0, 1, 2, \ldots) or continuous (t0t \geq 0). Discrete-time chains are developed in Section2-Section6; continuous-time chains in Section7.

Discrete TimeContinuous Time
Finite statesFinite DTMC - PageRank, HMMCTMC - queueing theory
Countable statesSRW, birth-death chainPoisson process
Continuous statesMCMC on Rd\mathbb{R}^dDiffusion SDE (Section06)

Time-homogeneous vs. inhomogeneous: If transition probabilities do not depend on nn - P(Xn+1=jXn=i)=PijP(X_{n+1}=j|X_n=i) = P_{ij} for all nn - the chain is time-homogeneous. Almost all chains in this section are time-homogeneous.


2. Formal Definitions

2.1 The Markov Property

Recall: Conditional independence XYZX \perp Y \mid Z was defined in Section03 Joint Distributions. The Markov property is conditional independence of the future from the past, given the present.

Definition (Discrete-Time Markov Chain). A sequence {Xn}n0\{X_n\}_{n \geq 0} of random variables on a measurable state space (S,B)(\mathcal{S}, \mathcal{B}) is a (time-homogeneous) Markov chain if for all n0n \geq 0 and all measurable BSB \subseteq \mathcal{S}:

P(Xn+1BX0,X1,,Xn)=P(Xn+1BXn)=:K(Xn,B)P(X_{n+1} \in B \mid X_0, X_1, \ldots, X_n) = P(X_{n+1} \in B \mid X_n) =: K(X_n, B)

The function K:S×B[0,1]K : \mathcal{S} \times \mathcal{B} \to [0,1] is called the transition kernel or transition probability. For finite S={1,,N}\mathcal{S} = \{1,\ldots,N\}, it is encoded by the N×NN \times N transition matrix PP with Pij=P(Xn+1=jXn=i)P_{ij} = P(X_{n+1}=j \mid X_n=i).

Equivalent formulation via conditional independence: The Markov property is equivalent to:

Xn+1(X0,,Xn1)XnX_{n+1} \perp (X_0, \ldots, X_{n-1}) \mid X_n

This is the cleanest statement: given the present XnX_n, the future Xn+1X_{n+1} is independent of the entire past (X0,,Xn1)(X_0,\ldots,X_{n-1}). More generally, the strong Markov property asserts this holds for stopping times: Xτ+1FτXτX_{\tau+1} \perp \mathcal{F}_\tau \mid X_\tau for any stopping time τ\tau.

Initial distribution: The full law of the chain is specified by:

  1. The initial distribution μ0\mu_0 with μ0(i)=P(X0=i)\mu_0(i) = P(X_0 = i)
  2. The transition matrix PP with Pij0P_{ij} \geq 0 and jPij=1\sum_j P_{ij} = 1 for all ii

Together, (μ0,P)(\mu_0, P) determines all finite-dimensional distributions.

2.2 Transition Matrix

For a finite state space S={1,,N}\mathcal{S} = \{1,\ldots,N\}, the transition matrix PP is an N×NN \times N stochastic matrix: every row sums to 1, all entries are non-negative. The (i,j)(i,j) entry is the probability of jumping from state ii to state jj in one step.

Distribution at step nn: If μ(n)\mu^{(n)} is the row vector of probabilities at step nn, then:

μ(n)=μ(0)Pn\mu^{(n)} = \mu^{(0)} P^n

The distribution evolves by left-multiplying by PP at each step. In component form: μj(n+1)=iμi(n)Pij\mu^{(n+1)}_j = \sum_i \mu^{(n)}_i P_{ij}.

nn-step transition probabilities: The probability of going from ii to jj in exactly nn steps is the (i,j)(i,j) entry of PnP^n:

Pij(n)=P(Xn=jX0=i)=(Pn)ijP^{(n)}_{ij} = P(X_n = j \mid X_0 = i) = (P^n)_{ij}

Computing PnP^n via matrix exponentiation (eigendecomposition or repeated squaring) is the core computational task for Markov chains.

Non-example: A matrix with a row summing to more than 1, or with any negative entry, is NOT a valid transition matrix. A doubly-stochastic matrix (both rows and columns sum to 1) is a transition matrix, and its stationary distribution is uniform.

For AI: In a language model with vocabulary size VV, the transition matrix PP has size V×VV \times V where PijP_{ij} = probability of token jj following token ii (bigram model). Real LLMs use much richer state spaces (the context window), but the Markov structure is the same.

2.3 Chapman-Kolmogorov Equations

Theorem (Chapman-Kolmogorov). For any m,n0m, n \geq 0 and states i,ji, j:

Pij(m+n)=kSPik(m)Pkj(n)P^{(m+n)}_{ij} = \sum_{k \in \mathcal{S}} P^{(m)}_{ik} P^{(n)}_{kj}

In matrix form: Pm+n=PmPnP^{m+n} = P^m P^n. This is simply the rule for matrix multiplication - to go from ii to jj in m+nm+n steps, sum over all intermediate states kk at the midpoint.

Proof: By the law of total probability and the Markov property:

P(Xm+n=jX0=i)=kP(Xm+n=jXm=k)P(Xm=kX0=i)=kPkj(n)Pik(m)P(X_{m+n}=j \mid X_0=i) = \sum_k P(X_{m+n}=j \mid X_m=k) P(X_m=k \mid X_0=i) = \sum_k P^{(n)}_{kj} P^{(m)}_{ik}

Chapman-Kolmogorov is the reason matrix multiplication is the right operation for composing Markov chain transitions. It also says: to predict n+mn+m steps ahead, you can decompose the prediction at any intermediate time mm.

2.4 Standard Examples

Example 1: Two-state chain (weather model)

P=(1ppq1q)P = \begin{pmatrix} 1-p & p \\ q & 1-q \end{pmatrix}

State 1 = sunny, state 2 = rainy. Sunny stays sunny with probability 1p1-p, turns rainy with probability pp. Rainy turns sunny with probability qq. The stationary distribution is π=(q/(p+q), p/(p+q))\pi = (q/(p+q),\ p/(p+q)).

Example 2: Simple Random Walk on {0,1,,N}\{0,1,\ldots,N\} with reflecting barriers

Pij={1/2ij=1, 0<i<N1i=0,j=1 or i=N,j=N1P_{ij} = \begin{cases} 1/2 & |i-j|=1,\ 0 < i < N \\ 1 & i=0, j=1 \text{ or } i=N, j=N-1 \end{cases}

The particle bounces off the walls at 0 and NN. Stationary distribution: uniform.

Example 3: PageRank chain

Pij=αAijout-degree(i)+(1α)1NP_{ij} = \alpha \cdot \frac{A_{ij}}{\text{out-degree}(i)} + (1-\alpha) \cdot \frac{1}{N}

where AA is the web adjacency matrix and α(0,1)\alpha \in (0,1) is the damping factor (usually α=0.85\alpha=0.85). The teleportation term (1α)/N(1-\alpha)/N ensures the chain is irreducible and aperiodic. The stationary distribution is the PageRank vector.

Example 4: Gibbs chain on {0,1}2\{0,1\}^2

For the Ising model on two spins, PP is a 4×44 \times 4 matrix constructed by randomly updating one spin at a time conditional on the other. This is the simplest MCMC example.


3. Classification of States

3.1 Communicating Classes and Irreducibility

Definition. State jj is accessible from state ii (written iji \to j) if there exists n0n \geq 0 such that Pij(n)>0P^{(n)}_{ij} > 0. States ii and jj communicate (written iji \leftrightarrow j) if iji \to j and jij \to i.

Communication is an equivalence relation, and its equivalence classes are called communicating classes. A communicating class CC is closed if no state outside CC is accessible from any state inside CC: iCi \in C and iji \to j implies jCj \in C.

Definition (Irreducible Chain). A Markov chain is irreducible if it has exactly one communicating class - every state communicates with every other. Equivalently, for every pair (i,j)(i,j), there exists nn with Pij(n)>0P^{(n)}_{ij} > 0.

Irreducibility is the "connectivity" condition for Markov chains. An irreducible chain cannot get trapped in a subset of states. Most of the theory (stationary distributions, convergence) requires irreducibility.

Decomposition theorem: Any finite Markov chain decomposes into transient states (a particle eventually leaves forever) and one or more closed communicating classes of recurrent states. Once the chain enters a closed class, it stays there forever.

For AI: A language model's transition matrix is nearly irreducible: any token sequence of sufficient length can be generated (in principle). Practical degeneracy (zero-probability transitions due to softmax temperatures -> 0) creates absorbing states.

3.2 Recurrence and Transience

Definition. Define the first return time to state ii: Ti=min{n1:Xn=i}T_i = \min\{n \geq 1 : X_n = i\}, with Ti=T_i = \infty if the chain never returns. The return probability is:

fi=P(Ti<X0=i)f_i = P(T_i < \infty \mid X_0 = i)
  • State ii is recurrent if fi=1f_i = 1 (return is certain)
  • State ii is transient if fi<1f_i < 1 (positive probability of never returning)

Theorem (Recurrence criterion). State ii is recurrent if and only if n=0Pii(n)=\sum_{n=0}^\infty P^{(n)}_{ii} = \infty. It is transient if and only if n=0Pii(n)<\sum_{n=0}^\infty P^{(n)}_{ii} < \infty.

Proof sketch: Let Ri=n=11[Xn=i]R_i = \sum_{n=1}^\infty \mathbf{1}[X_n=i] be the number of returns to ii. Then E[RiX0=i]=n=1Pii(n)\mathbb{E}[R_i \mid X_0=i] = \sum_{n=1}^\infty P^{(n)}_{ii}. If fi=1f_i=1, returns are geometrically distributed with mean \infty, so the sum diverges. If fi<1f_i < 1, the number of returns is geometric with finite mean, so the sum converges.

Positive vs. null recurrence: Among recurrent states, define the mean return time μi=E[TiX0=i]\mu_i = \mathbb{E}[T_i \mid X_0=i].

  • Positive recurrent: fi=1f_i = 1 and μi<\mu_i < \infty (returns in finite average time)
  • Null recurrent: fi=1f_i = 1 but μi=\mu_i = \infty (certain to return, but takes forever on average)

All recurrent states in a finite irreducible chain are positive recurrent (finite chains cannot be null-recurrent). The simple random walk on Z\mathbb{Z} is null-recurrent in 1D and 2D, transient in 3D and above (Polya's theorem).

For AI: In MCMC, we want the chain to be recurrent (it returns to every region of the posterior). A transient MCMC chain would drift away from the posterior, giving biased samples.

3.3 Periodicity

Definition. The period of state ii is:

d(i)=gcd{n1:Pii(n)>0}d(i) = \gcd\{n \geq 1 : P^{(n)}_{ii} > 0\}

State ii is aperiodic if d(i)=1d(i) = 1; otherwise it is periodic with period d(i)d(i).

Key facts:

  1. All states in the same communicating class have the same period
  2. A chain is called aperiodic if all states have period 1
  3. If Pii>0P_{ii} > 0 for some ii, then d(i)=1d(i) = 1 (self-loops force aperiodicity)
  4. An irreducible aperiodic chain converges to its stationary distribution (Section4.3)

Example: A two-state chain 010 \leftrightarrow 1 with P01=P10=1P_{01}=P_{10}=1 is periodic with period 2: the chain alternates 0,1,0,1,... and P00(n)=1P^{(n)}_{00} = 1 only for even nn.

Periodicity breaks convergence: If a chain has period d>1d > 1, then Pij(n)P^{(n)}_{ij} does not converge as nn \to \infty - it oscillates. Adding any self-loop probability ε>0\varepsilon > 0 makes the chain aperiodic.

For AI: Metropolis-Hastings with a symmetric proposal that includes the possibility of staying put is automatically aperiodic (self-loops happen when a proposal is rejected). This is why MCMC chains converge.

3.4 Absorbing States and Absorption Probabilities

Definition. State ii is absorbing if Pii=1P_{ii} = 1 (the chain stays at ii forever once it arrives). A chain with at least one absorbing state is called absorbing.

Canonical form: For an absorbing chain with rr absorbing states and tt transient states, the transition matrix can be written in canonical form:

P=(QR0Ir)P = \begin{pmatrix} Q & R \\ 0 & I_r \end{pmatrix}

where QQ is the t×tt \times t sub-matrix of transitions among transient states, RR is the t×rt \times r matrix of transitions to absorbing states, and IrI_r is the identity on absorbing states.

Fundamental matrix: N=(IQ)1N = (I - Q)^{-1}. The (i,j)(i,j) entry of NN is the expected number of times the chain visits transient state jj before absorption, starting from transient state ii.

Absorption probabilities: B=NRB = NR gives the probability of being absorbed into each absorbing state. The gambler's ruin formula P(reach Nstart at a)=a/NP(\text{reach } N \mid \text{start at } a) = a/N (derived via OST in Section06) follows directly from this formula.

3.5 Ergodic Chains

Definition (Ergodic Chain). A Markov chain is ergodic if it is:

  1. Irreducible - every state communicates with every other
  2. Positive recurrent - mean return time is finite for all states
  3. Aperiodic - all states have period 1

Ergodic chains are the "nice" chains: they have a unique stationary distribution, and the chain converges to it from any starting state.

For finite chains: An irreducible chain on a finite state space is automatically positive recurrent. So for finite chains, ergodic = irreducible + aperiodic.

Ergodic theorem: For an ergodic chain, the time-average converges to the space-average:

1nk=0n1f(Xk)niπif(i)=Eπ[f]a.s.\frac{1}{n}\sum_{k=0}^{n-1} f(X_k) \xrightarrow{n\to\infty} \sum_i \pi_i f(i) = \mathbb{E}_\pi[f] \quad \text{a.s.}

This is the Markov chain version of the law of large numbers. It is why MCMC works: time-averaging over the chain's trajectory approximates the posterior expectation.


4. Stationary Distributions

4.1 Definition and Existence

Definition (Stationary Distribution). A probability distribution π\pi on S\mathcal{S} is a stationary distribution (also called invariant distribution or steady-state distribution) for a Markov chain with transition matrix PP if:

πP=πi.e.,πj=iπiPij for all j\pi P = \pi \qquad \text{i.e.,} \qquad \pi_j = \sum_i \pi_i P_{ij} \text{ for all } j

Equivalently, π\pi is a left eigenvector of PP with eigenvalue 1. If the chain starts in distribution π\pi, it stays in distribution π\pi for all time: μ(0)=πμ(n)=π\mu^{(0)} = \pi \Rightarrow \mu^{(n)} = \pi for all nn.

Interpretation: πi\pi_i represents the long-run fraction of time the chain spends in state ii. For PageRank, πi\pi_i is the importance score of page ii. For MCMC, π\pi is the target posterior distribution.

Existence: Every finite irreducible Markov chain has a unique stationary distribution π\pi with πi>0\pi_i > 0 for all ii. For infinite chains, existence requires positive recurrence.

Non-uniqueness when reducible: A chain with multiple closed communicating classes has infinitely many stationary distributions - any convex combination of the stationary distributions within each closed class is stationary.

The mean return time formula: For a positive recurrent chain, πi=1/μi\pi_i = 1/\mu_i where μi=E[TiX0=i]\mu_i = \mathbb{E}[T_i \mid X_0=i] is the mean return time. States visited more often (smaller μi\mu_i) get higher stationary probability.

4.2 Perron-Frobenius Theorem

The Perron-Frobenius theorem is the fundamental result guaranteeing existence and uniqueness of the stationary distribution.

Theorem (Perron-Frobenius for Stochastic Matrices). Let PP be an irreducible stochastic matrix. Then:

  1. λ=1\lambda = 1 is an eigenvalue of PP (always true for stochastic matrices, since P1=1P\mathbf{1}=\mathbf{1})
  2. λ1|\lambda| \leq 1 for all other eigenvalues λ\lambda
  3. There exists a unique probability vector π>0\pi > 0 (componentwise) with πP=π\pi P = \pi
  4. π\pi is the unique (up to scaling) left eigenvector with eigenvalue 1

If additionally PP is primitive (irreducible and aperiodic), then λ=1\lambda = 1 is the unique eigenvalue on the unit circle, and λ2<1|\lambda_2| < 1 strictly. This gap drives convergence.

Proof sketch (uniqueness): Suppose π\pi and ν\nu are both stationary. Then μ=(πν)\mu = (\pi - \nu) satisfies μP=μ\mu P = \mu with iμi=0\sum_i \mu_i = 0. If any μi0\mu_i \neq 0, irreducibility implies all μi0\mu_i \neq 0. But then πν\pi - \nu cannot be a probability vector minus a probability vector with the right sign everywhere - contradiction.

For AI: PageRank power iteration converges because the damped PageRank matrix is primitive (the teleportation term makes it irreducible and the self-loops make it aperiodic). The dominant eigenvalue is 1, and all other eigenvalues are <α<1< \alpha < 1 in magnitude, guaranteeing geometric convergence.

4.3 Convergence to Stationarity

Theorem (Convergence for Ergodic Chains). Let PP be irreducible and aperiodic with stationary distribution π\pi. Then for any initial distribution μ(0)\mu^{(0)}:

μ(0)PnπTVn0\|\mu^{(0)} P^n - \pi\|_{\text{TV}} \xrightarrow{n\to\infty} 0

Moreover, for any starting state ii: Pij(n)πjP^{(n)}_{ij} \to \pi_j as nn \to \infty for all jj.

Rate of convergence: For a reversible chain with eigenvalues 1=λ1>λ2λN11 = \lambda_1 > \lambda_2 \geq \cdots \geq \lambda_N \geq -1, the convergence rate is governed by λ=max(λ2,λN)\lambda_* = \max(|\lambda_2|, |\lambda_N|):

Pn(i,)πTV12πiλn\|P^n(i,\cdot) - \pi\|_{\text{TV}} \leq \frac{1}{2\sqrt{\pi_i}} \lambda_*^n

The rate λ\lambda_* is called the second largest eigenvalue modulus (SLEM). The spectral gap is 1λ1 - \lambda_*; a large gap means fast convergence.

Periodic chains: If PP is irreducible but periodic with period dd, then Pij(n)P^{(n)}_{ij} does not converge - it cycles. However, PdP^d (the dd-step chain) is aperiodic and does converge. In practice, a small perturbation (adding laziness) kills periodicity.

The lazy chain: Replace PP with (I+P)/2(I + P)/2. This halves the spectral gap but makes the chain aperiodic with all eigenvalues non-negative, ensuring monotone convergence.

4.4 Null-Recurrent Chains

The simple random walk on Z\mathbb{Z} returns to the origin with probability 1 (recurrent) but takes infinitely long on average (null-recurrent). Such chains have no stationary probability distribution - the "stationary measure" πi\pi_i is not normalisable.

The 1D random walk has πi=1\pi_i = 1 for all ii, which gives infinite total mass iπi=\sum_i \pi_i = \infty. The chain visits every state infinitely often, but spends a vanishing fraction of time at any fixed state.

Implications for AI: Infinite-state chains that model text generation (treating the entire prefix as state) may be null-recurrent or transient. In practice, language models cap context length, making the state space finite and all states positive recurrent.

4.5 Computing Stationary Distributions

Method 1: Solve the linear system

πP=π,π1=1,πi0\pi P = \pi, \quad \|\pi\|_1 = 1, \quad \pi_i \geq 0

This is an (N+1)(N+1)-dimensional system (the normalisation replaces one of the NN redundant equations πP=π\pi P = \pi). For small NN, direct linear algebra is exact.

Method 2: Power iteration

π(k+1)=π(k)P\pi^{(k+1)} = \pi^{(k)} P

Starting from any distribution π(0)>0\pi^{(0)} > 0, iterating converges to π\pi at rate λk\lambda_*^k. Power iteration is the PageRank algorithm.

Method 3: Eigendecomposition Compute the left eigenvector of PP corresponding to eigenvalue 1. For symmetric or reversible chains, this is equivalent to finding the dominant eigenvector.

Method 4: MCMC (for continuous state spaces) For distributions π(x)exp(f(x))\pi(x) \propto \exp(-f(x)) on Rd\mathbb{R}^d, construct a Markov chain with stationary distribution π\pi and run it - the empirical distribution of the chain approximates π\pi by the ergodic theorem.


5. Detailed Balance and Reversibility

5.1 Detailed Balance Equations

Definition (Detailed Balance). A distribution π\pi satisfies the detailed balance equations with respect to transition matrix PP if:

πiPij=πjPjifor all i,jS\pi_i P_{ij} = \pi_j P_{ji} \quad \text{for all } i, j \in \mathcal{S}

Detailed balance says: the probability flux from ii to jj at stationarity equals the flux from jj to ii. This is a pointwise balance of probability flows, stronger than the global balance equations πP=π\pi P = \pi.

Theorem. If π\pi satisfies detailed balance with PP, then π\pi is a stationary distribution for PP.

Proof:

iπiPij=iπjPji=πjiPji=πj1=πj\sum_i \pi_i P_{ij} = \sum_i \pi_j P_{ji} = \pi_j \sum_i P_{ji} = \pi_j \cdot 1 = \pi_j

So πP=π\pi P = \pi. The converse is false - a stationary distribution need not satisfy detailed balance.

Significance: Detailed balance is easier to verify than stationarity (it's a pairwise condition rather than a global one), and it characterises reversible chains. All Metropolis-Hastings algorithms are constructed to satisfy detailed balance by design.

5.2 Reversible Markov Chains

Definition (Reversible Chain). A Markov chain is reversible (or time-reversible) with respect to distribution π\pi if it satisfies detailed balance. Equivalently, the chain running forwards in time has the same law as the chain running backwards.

The reversed chain: Given a stationary Markov chain with transition matrix PP, define the time-reversed chain by P^ij=πjPji/πi\hat{P}_{ij} = \pi_j P_{ji} / \pi_i. The reversed chain is also a Markov chain, with the same stationary distribution. The original chain is reversible iff P^=P\hat{P} = P.

Symmetric Markov chains are reversible: If Pij=PjiP_{ij} = P_{ji} for all i,ji,j (symmetric transition matrix), then detailed balance holds with the uniform distribution πi=1/N\pi_i = 1/N. The symmetric random walk on a graph is reversible.

Non-reversible example: The chain 12311 \to 2 \to 3 \to 1 (cyclic with P12=P23=P31=1P_{12}=P_{23}=P_{31}=1) has uniform stationary distribution but is NOT reversible: it has a preferred direction of rotation.

For AI: Reversibility is a sufficient condition for correctness of MCMC algorithms. A Metropolis-Hastings chain with target π\pi satisfies detailed balance - it's reversible - and therefore has π\pi as stationary distribution. Non-reversible MCMC (e.g., lifting methods, HMC) can mix faster but is harder to analyse.

5.3 Birth-Death Chains

A birth-death chain is a Markov chain on {0,1,2,}\{0,1,2,\ldots\} where only nearest-neighbour transitions are allowed:

Pi,i+1=pi,Pi,i1=qi=1pi,Pij=0 for ij>1P_{i,i+1} = p_i, \quad P_{i,i-1} = q_i = 1-p_i, \quad P_{ij} = 0 \text{ for } |i-j| > 1

Birth-death chains are automatically reversible, and their stationary distribution can be computed explicitly. The detailed balance equations πipi=πi+1qi+1\pi_i p_i = \pi_{i+1} q_{i+1} give:

πn=π0k=0n1pkqk+1\pi_n = \pi_0 \prod_{k=0}^{n-1} \frac{p_k}{q_{k+1}}

Setting π0=1/Z\pi_0 = 1/Z (normalisation) gives the stationary distribution.

Queue models: An M/M/1M/M/1 queue with arrival rate λ<1\lambda < 1 (service rate) is a birth-death chain. Stationary distribution: πn=(1ρ)ρn\pi_n = (1-\rho)\rho^n where ρ=λ\rho = \lambda. The chain is stable (stationary distribution exists) iff λ<1\lambda < 1.

For AI: The queue length at an LLM serving system (number of pending requests) is approximately a birth-death chain. The stationary distribution under load ρ=λ/μ\rho = \lambda/\mu determines average latency: E[queue length]=ρ/(1ρ)\mathbb{E}[\text{queue length}] = \rho/(1-\rho).

5.4 Spectral Decomposition of Reversible Chains

For a reversible chain, the transition matrix PP has real eigenvalues 1=λ1λ2λN11 = \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_N \geq -1 and an orthonormal basis of eigenvectors (with respect to the π\pi-weighted inner product).

Spectral decomposition:

Pijn=πjk=1Nλknϕk(i)ϕk(j)P^n_{ij} = \pi_j \sum_{k=1}^N \lambda_k^n \phi_k(i) \phi_k(j)

where ϕk\phi_k are the eigenfunctions (normalised in L2(π)L^2(\pi)). As nn \to \infty, only the k=1k=1 term survives (since λk<1|\lambda_k| < 1 for k2k \geq 2), giving PijnπjP^n_{ij} \to \pi_j. The rate of convergence is λ2n|\lambda_2|^n.

The spectral gap: gap=1λ2\text{gap} = 1 - \lambda_2 measures how fast the chain forgets its initial state. A chain with spectral gap δ\delta mixes in O(1/δ)O(1/\delta) steps. Bounding the spectral gap from below is the main technical challenge in proving MCMC convergence rates.

Dirichlet form: For reversible chains, the spectral gap equals:

gap=minf1E(f,f)Varπ(f)\text{gap} = \min_{f \perp \mathbf{1}} \frac{\mathcal{E}(f,f)}{\text{Var}_\pi(f)}

where E(f,f)=12i,jπiPij(f(j)f(i))2\mathcal{E}(f,f) = \frac{1}{2}\sum_{i,j} \pi_i P_{ij}(f(j)-f(i))^2 is the Dirichlet form. This variational characterisation is the basis for Poincare inequality methods.


6. Mixing Times and Convergence Rates

6.1 Total Variation Distance

Definition. The total variation (TV) distance between two probability distributions μ\mu and ν\nu on S\mathcal{S} is:

μνTV=supASμ(A)ν(A)=12iμiνi\|\mu - \nu\|_{\text{TV}} = \sup_{A \subseteq \mathcal{S}} |\mu(A) - \nu(A)| = \frac{1}{2}\sum_i |\mu_i - \nu_i|

TV distance measures how distinguishable two distributions are: it equals the maximum probability that any event AA has different probability under μ\mu vs. ν\nu. The TV distance is 0 iff μ=ν\mu = \nu, and 1 iff the distributions have disjoint support.

Coupling characterisation: μνTV=inf(X,Y)P(XY)\|\mu - \nu\|_{\text{TV}} = \inf_{(X,Y)} P(X \neq Y) where the infimum is over all couplings (X,Y)(X,Y) with marginals μ\mu and ν\nu. This is the coupling inequality: any coupling gives an upper bound on TV distance.

For AI: TV distance is used to measure how far an MCMC chain is from stationarity, and to certify when the chain has "mixed" sufficiently to produce approximately independent samples from the posterior.

6.2 Mixing Time

Definition. The mixing time of a Markov chain with stationary distribution π\pi is:

tmix(ε)=min{t0:maxxSPt(x,)πTVε}t_{\text{mix}}(\varepsilon) = \min\left\{t \geq 0 : \max_{x \in \mathcal{S}} \|P^t(x,\cdot) - \pi\|_{\text{TV}} \leq \varepsilon\right\}

The standard mixing time is tmix=tmix(1/4)t_{\text{mix}} = t_{\text{mix}}(1/4) (the default ε=1/4\varepsilon = 1/4 is conventional). Mixing time measures how long the chain must run before its distribution is ε\varepsilon-close to stationarity from the worst-case starting state.

Why the worst case? In applications, we often don't control the starting state. The worst-case mixing time certifies that regardless of initialisation, the chain has converged after tmixt_{\text{mix}} steps.

Submultiplicativity: tmix(2k/4)ktmixt_{\text{mix}}(2^{-k}/4) \leq k \cdot t_{\text{mix}}. So to achieve ε=2k/4\varepsilon = 2^{-k}/4 precision, it suffices to run for ktmixk \cdot t_{\text{mix}} steps.

6.3 Spectral Gap

For a reversible ergodic chain, the spectral gap gap=1λ2\text{gap} = 1 - \lambda_2 provides a bound on mixing time:

Theorem (Spectral Gap Bound):

tmix(ε)log(1/(επmin))gapt_{\text{mix}}(\varepsilon) \leq \left\lceil \frac{\log(1/(\varepsilon \pi_{\min}))}{\text{gap}} \right\rceil

where πmin=miniπi\pi_{\min} = \min_i \pi_i.

Proof sketch: From the spectral decomposition, Pn(x,)πTV12πx(1gap)n\|P^n(x,\cdot) - \pi\|_{\text{TV}} \leq \frac{1}{2\sqrt{\pi_x}} (1-\text{gap})^n. Setting this ε\leq \varepsilon and solving for nn gives the bound.

Lower bound: There is also a lower bound: tmix12gapt_{\text{mix}} \geq \frac{1}{2\,\text{gap}}. So the mixing time is Θ(1/gap)\Theta(1/\text{gap}) for reversible chains.

Computing the spectral gap: For small matrices, compute eigenvalues of PP directly. For large chains, use the Cheeger inequality: gap/2h2gap\text{gap}/2 \leq h \leq \sqrt{2\,\text{gap}} where hh is the Cheeger constant (conductance of the chain).

6.4 Coupling Arguments

Definition (Coupling). A coupling of two Markov chains with the same transition matrix PP but different initial distributions μ,ν\mu, \nu is a process (Xt,Yt)(X_t, Y_t) such that each marginal is a valid Markov chain with transition PP, and once Xt=YtX_t = Y_t they remain together (the coalescence or meeting time).

Coupling inequality: Pt(μ,)Pt(ν,)TVP(XtYt)P(τmeet>t)\|P^t(\mu,\cdot) - P^t(\nu,\cdot)\|_{\text{TV}} \leq P(X_t \neq Y_t) \leq P(\tau_{\text{meet}} > t)

where τmeet=min{t:Xt=Yt}\tau_{\text{meet}} = \min\{t : X_t = Y_t\} is the coupling time.

Grand coupling: For a finite ergodic chain, one can construct a single coupling {(Xtx,Xty):x,yS}\{(X_t^x, X_t^y) : x, y \in \mathcal{S}\} of chains starting from all possible pairs (x,y)(x,y) simultaneously, such that τmeet\tau_{\text{meet}} is the same for all pairs. The mixing time is then tmix=O(E[τmeet])t_{\text{mix}} = O(\mathbb{E}[\tau_{\text{meet}}]).

Example: Random walk on hypercube. For a lazy random walk on {0,1}d\{0,1\}^d, couple two chains by updating the same coordinate at each step. When the coordinate is chosen, the two chains agree on that bit after the update. All dd bits agree after O(dlogd)O(d \log d) steps (coupon collector), giving tmix=O(dlogd)t_{\text{mix}} = O(d \log d).

6.5 AI Relevance: Fast Mixing and MCMC Warm Starts

Fast mixing in practice: For Bayesian inference in LLM fine-tuning (e.g., RLHF reward models), the posterior over model weights is a distribution on Rd\mathbb{R}^d with d109d \sim 10^9. Direct MCMC is intractable; approximate methods (Langevin dynamics, HMC) exploit gradient information to achieve better mixing.

Spectral gap and dimensionality: For a Gaussian target N(0,Σ)\mathcal{N}(0, \Sigma), the Langevin SDE mixing time scales as O(κ)O(\kappa) where κ=λmax/λmin\kappa = \lambda_{\max}/\lambda_{\min} is the condition number of Σ\Sigma. Preconditioning (e.g., SGLD with adaptive learning rates, Adam-Langevin) improves the effective spectral gap.

Warm starts: If the initial distribution μ(0)\mu^{(0)} is already close to π\pi (e.g., from a previous MCMC run), the chain needs fewer steps to mix. In sequential Bayesian updating (fine-tuning an LLM on new data), the previous posterior serves as a warm start.

Temperature scheduling: Simulated annealing starts with a high-temperature chain (wide, fast-mixing distribution) and gradually cools toward the target. This exploits fast mixing at high temperature and accurate targeting at low temperature.


7. Continuous-Time Markov Chains

7.1 Generator Matrix

A continuous-time Markov chain (CTMC) {Xt}t0\{X_t\}_{t \geq 0} on a finite state space S\mathcal{S} satisfies the Markov property for continuous time:

P(Xt+s=jXu:us)=P(Xt+s=jXs)P(X_{t+s}=j \mid X_u : u \leq s) = P(X_{t+s}=j \mid X_s)

The chain holds in each state for an exponentially distributed sojourn time, then jumps according to a discrete-time chain. The rate of leaving state ii is qi=jiqijq_i = \sum_{j \neq i} q_{ij} where qij0q_{ij} \geq 0 are the jump rates from ii to jj.

Definition (Generator Matrix / Q-Matrix). The generator (or Q-matrix) of a CTMC is the matrix QQ with:

Qij={qij0ij(jump rate from i to j)qi=jiqiji=j(holding rate)Q_{ij} = \begin{cases} q_{ij} \geq 0 & i \neq j \quad \text{(jump rate from } i \text{ to } j\text{)} \\ -q_i = -\sum_{j \neq i} q_{ij} & i = j \quad \text{(holding rate)} \end{cases}

Key property: rows of QQ sum to zero (Q1=0Q \mathbf{1} = \mathbf{0}). The diagonal entries are negative; off-diagonal entries are non-negative.

Transition matrix: The probability of being in state jj at time tt given state ii at time 0 is:

P(t)=eQt=n=0(Qt)nn!P(t) = e^{Qt} = \sum_{n=0}^\infty \frac{(Qt)^n}{n!}

This is the matrix exponential. Note P(0)=IP(0) = I and limtP(t)=1π\lim_{t\to\infty} P(t) = \mathbf{1}\pi (convergence to stationarity for ergodic chains).

Embedded chain: The CTMC can be decomposed into (1) an embedded discrete-time Markov chain governing the sequence of states visited, and (2) exponential holding times in each state. Jump probabilities: P~ij=qij/qi\tilde{P}_{ij} = q_{ij}/q_i for iji \neq j.

7.2 Kolmogorov Equations

Forward Equation (Fokker-Planck):

ddtP(t)=P(t)QP(t)=eQt\frac{d}{dt} P(t) = P(t) Q \quad \Rightarrow \quad P(t) = e^{Qt}

This equation describes how the distribution evolves forward in time: ddtμ(t)=μ(t)Q\frac{d}{dt}\mu(t) = \mu(t) Q where μ(t)=μ(0)eQt\mu(t) = \mu(0) e^{Qt}.

Backward Equation:

ddtP(t)=QP(t)\frac{d}{dt} P(t) = Q\, P(t)

The forward equation applies to the time-marginal distribution; the backward equation to the transition probabilities viewed as functions of the initial state.

Stationary distribution: π\pi is stationary if ddtμ(t)=0\frac{d}{dt}\mu(t) = 0 at μ=π\mu = \pi, i.e.:

πQ=0(row vector)iπiQij=0 for all j\pi Q = \mathbf{0} \quad \text{(row vector)} \quad \Leftrightarrow \quad \sum_i \pi_i Q_{ij} = 0 \text{ for all } j

For an ergodic CTMC, the stationary distribution is unique.

7.3 Stationary Distribution of CTMCs

Detailed balance for CTMCs: A distribution π\pi satisfies detailed balance if πiQij=πjQji\pi_i Q_{ij} = \pi_j Q_{ji} for all iji \neq j. Detailed balance implies stationarity (πQ=0\pi Q = 0).

Computing the stationary distribution: Solve πQ=0\pi Q = 0 with iπi=1\sum_i \pi_i = 1. This is equivalent to finding the null space of QTQ^T (right eigenvector with eigenvalue 0).

Birth-death CTMC: With birth rate λn\lambda_n from state nn and death rate μn\mu_n, detailed balance gives πnλn=πn+1μn+1\pi_n \lambda_n = \pi_{n+1}\mu_{n+1}, yielding:

πn=π0k=0n1λkμk+1\pi_n = \pi_0 \prod_{k=0}^{n-1} \frac{\lambda_k}{\mu_{k+1}}

7.4 Poisson Process as CTMC

Recall from Section06 Stochastic Processes: The Poisson process NtN_t counts arrivals up to time tt with independent Poisson increments. Inter-arrival times are Exp(λ)\text{Exp}(\lambda).

The Poisson process is the simplest CTMC: states {0,1,2,}\{0,1,2,\ldots\}, jump rates qn,n+1=λq_{n,n+1} = \lambda (arrivals only), generator:

Q=(λλλλ)Q = \begin{pmatrix} -\lambda & \lambda & & \\ & -\lambda & \lambda & \\ & & \ddots & \ddots \end{pmatrix}

The Poisson process is transient - it drifts to ++\infty with no stationary distribution. An M/M/1M/M/1 queue (arrivals at rate λ\lambda, service at rate μ>λ\mu > \lambda) modifies the Poisson process by adding downward jumps (service completions), yielding a birth-death CTMC with geometric stationary distribution πn=(1ρ)ρn\pi_n = (1-\rho)\rho^n where ρ=λ/μ\rho = \lambda/\mu.

For AI: The Poisson process models token arrivals at an LLM serving endpoint. At high load (ρ1\rho \to 1), the queue length distribution becomes heavy-tailed, leading to high-percentile latency spikes. The M/M/1 formula E[queue]=ρ/(1ρ)\mathbb{E}[\text{queue}] = \rho/(1-\rho) quantifies this: 90% utilisation means 9x the base queue length.


8. MCMC: Markov Chain Monte Carlo

8.1 The Sampling Problem

The challenge: In Bayesian inference, the posterior π(θD)p(Dθ)p(θ)\pi(\theta \mid \mathcal{D}) \propto p(\mathcal{D}\mid\theta)\,p(\theta) is analytically intractable for most models. Computing Eπ[f(θ)]\mathbb{E}_\pi[f(\theta)] requires integrating over a high-dimensional distribution we cannot normalise.

MCMC solution: Construct a Markov chain {θt}\{\theta_t\} on the parameter space such that:

  1. The chain is ergodic (irreducible, aperiodic, positive recurrent)
  2. The stationary distribution equals the target π\pi

By the ergodic theorem, time-averaging gives posterior expectations:

1Tt=1Tf(θt)TEπ[f(θ)]\frac{1}{T}\sum_{t=1}^T f(\theta_t) \xrightarrow{T\to\infty} \mathbb{E}_\pi[f(\theta)]

The key insight: we only need to know π(θ)\pi(\theta) up to a normalising constant. MCMC constructs the chain using only unnormalised evaluations π~(θ)π(θ)\tilde{\pi}(\theta) \propto \pi(\theta).

8.2 Metropolis-Hastings

Algorithm (Metropolis-Hastings). Given current state θ\theta, target π\pi, proposal q(θ)q(\cdot \mid \theta):

  1. Sample candidate θq(θ)\theta' \sim q(\cdot \mid \theta)
  2. Compute acceptance ratio: a=min(1, π(θ)q(θθ)π(θ)q(θθ))a = \min\left(1,\ \frac{\pi(\theta')\, q(\theta \mid \theta')}{\pi(\theta)\, q(\theta' \mid \theta)}\right)
  3. With probability aa: accept θt+1=θ\theta_{t+1} = \theta'; otherwise: θt+1=θ\theta_{t+1} = \theta (reject and stay)

Correctness: The Metropolis-Hastings chain satisfies detailed balance with respect to π\pi:

π(θ)q(θθ)a(θ,θ)=π(θ)q(θθ)a(θ,θ)\pi(\theta) q(\theta' \mid \theta) a(\theta, \theta') = \pi(\theta') q(\theta \mid \theta') a(\theta', \theta)

Proof: Without loss of generality, assume π(θ)q(θθ)π(θ)q(θθ)\pi(\theta')q(\theta\mid\theta') \leq \pi(\theta)q(\theta'\mid\theta). Then a(θ,θ)=π(θ)q(θθ)π(θ)q(θθ)a(\theta,\theta')=\frac{\pi(\theta')q(\theta|\theta')}{\pi(\theta)q(\theta'|\theta)} and a(θ,θ)=1a(\theta',\theta)=1, so both sides equal π(θ)q(θθ)\pi(\theta')q(\theta|\theta').

Special cases:

  • Metropolis algorithm: Symmetric proposal q(θθ)=q(θθ)q(\theta'\mid\theta) = q(\theta\mid\theta') simplifies acceptance to a=min(1,π(θ)/π(θ))a = \min(1, \pi(\theta')/\pi(\theta))
  • Independence sampler: Proposal q(θθ)=q(θ)q(\theta'\mid\theta) = q(\theta') independent of current state; acceptance a=min(1,π(θ)q(θ)/(π(θ)q(θ)))a = \min(1, \pi(\theta')q(\theta)/(\pi(\theta)q(\theta')))

Proposal tuning: For a Gaussian random walk proposal q(θθ)=N(θ,σ2I)q(\theta'\mid\theta) = \mathcal{N}(\theta, \sigma^2 I), the optimal σ\sigma achieves acceptance rate ~23.4% in high dimensions (Roberts-Gelman-Gilks 1997). Too small σ\sigma: high acceptance but slow exploration; too large: low acceptance, inefficient.

For AI: MCMC is used for posterior inference in Bayesian neural networks, uncertainty quantification in LLM outputs, and sampling from energy-based models. In practice, Langevin dynamics (gradient-based MCMC) replaces vanilla MH for high-dimensional targets.

8.3 Gibbs Sampling

Algorithm (Gibbs Sampling). For target π(θ1,,θd)\pi(\theta_1,\ldots,\theta_d), cycle through coordinates:

  1. Sample θ1(t+1)π(θ1θ2(t),,θd(t))\theta_1^{(t+1)} \sim \pi(\theta_1 \mid \theta_2^{(t)}, \ldots, \theta_d^{(t)})
  2. Sample θ2(t+1)π(θ2θ1(t+1),θ3(t),,θd(t))\theta_2^{(t+1)} \sim \pi(\theta_2 \mid \theta_1^{(t+1)}, \theta_3^{(t)}, \ldots, \theta_d^{(t)})
  3. \vdots
  4. Sample θd(t+1)π(θdθ1(t+1),,θd1(t+1))\theta_d^{(t+1)} \sim \pi(\theta_d \mid \theta_1^{(t+1)}, \ldots, \theta_{d-1}^{(t+1)})

Correctness: Gibbs sampling is a special case of Metropolis-Hastings where each coordinate update is a MH step with proposal equal to the full conditional. The acceptance ratio is always 1 - every proposal is accepted. This requires the ability to sample from full conditionals π(θkθk)\pi(\theta_k \mid \theta_{-k}).

Block Gibbs: Instead of updating one coordinate at a time, update a block of coordinates jointly. More efficient when coordinates within a block are highly correlated.

For AI: Gibbs sampling is used in latent Dirichlet allocation (LDA) for topic modelling - the full conditionals for topic assignments are available in closed form. In RL, it can be used to sample from the Q-function distribution for exploration.

8.4 Convergence Diagnostics

Running an MCMC chain for a finite number of steps gives an approximate (not exact) sample from π\pi. Practical convergence diagnostics:

Trace plots: Plot θt\theta_t vs. tt. A well-mixed chain looks like stationary noise. Trends, drifts, or stuck values indicate non-convergence.

R^\hat{R} statistic (Gelman-Rubin): Run M4M \geq 4 chains from diverse starting points. Compute R^=Vbetween/Vwithin\hat{R} = \sqrt{V_{\text{between}}/V_{\text{within}}}. If R^1.01\hat{R} \leq 1.01, chains have converged to the same distribution.

Effective sample size (ESS): Due to autocorrelation within the chain, TT MCMC samples give fewer independent samples than TT iid samples. ESS=T/(1+2k=1ρk)\text{ESS} = T / (1 + 2\sum_{k=1}^\infty \rho_k) where ρk\rho_k is the lag-kk autocorrelation. ESS/T1\text{ESS}/T \ll 1 indicates poor mixing.

Burn-in: The first BB samples (before the chain has mixed) are discarded. Typical burn-in is 1050%10-50\% of total samples.

8.5 Hamiltonian Monte Carlo

Motivation: For distributions on Rd\mathbb{R}^d, the random walk Metropolis-Hastings has step size O(d1/2)O(d^{-1/2}) to achieve acceptable acceptance rates, giving diffusive mixing: O(d)O(d) steps to traverse the distribution. Hamiltonian Monte Carlo (HMC) exploits gradient information to take large, high-acceptance steps.

Algorithm: Augment θRd\theta \in \mathbb{R}^d with momentum pN(0,M)p \sim \mathcal{N}(0, M). The Hamiltonian is H(θ,p)=logπ(θ)+12pTM1pH(\theta,p) = -\log\pi(\theta) + \frac{1}{2}p^T M^{-1} p. Propose a new state by simulating Hamiltonian dynamics using the leapfrog integrator for LL steps with step size ε\varepsilon, then accept/reject with MH ratio eΔHe^{-\Delta H}.

Why it works: Hamiltonian dynamics preserves the joint distribution eH(θ,p)e^{-H(\theta,p)} on (θ,p)(\theta,p). Marginalising out pp recovers π(θ)\pi(\theta). The leapfrog integrator introduces discretisation error, corrected by the MH accept/reject step.

Mixing time: HMC achieves O(d1/4)O(d^{1/4}) gradient evaluations per independent sample (vs. O(d)O(d) for random walk), making it scalable to high dimensions. NUTS (No-U-Turn Sampler) automates the choice of LL and ε\varepsilon.

For AI: HMC is the backbone of probabilistic programming systems like Stan and PyMC. In Bayesian deep learning, SGLD (stochastic gradient Langevin dynamics) adds Gaussian noise to SGD steps, yielding approximate Bayesian inference at scale.


9. ML Deep Dive

9.1 Language Model Generation

Every autoregressive language model - GPT, LLaMA, PaLM, Gemini - generates text as a Markov chain on the vocabulary. The state at step nn is the entire generated sequence (w1,,wn)(w_1,\ldots,w_n) (treated as a single token in the KV cache), and the transition is:

P(wn+1=vw1,,wn)=softmax(LM(w1,,wn))vP(w_{n+1} = v \mid w_1,\ldots,w_n) = \text{softmax}(\text{LM}(w_1,\ldots,w_n))_v

Temperature and the stationary distribution: At temperature τ\tau:

Pτ(wn+1=vcontext)=exp(logitv/τ)vexp(logitv/τ)P_\tau(w_{n+1} = v \mid \text{context}) = \frac{\exp(\text{logit}_v / \tau)}{\sum_{v'} \exp(\text{logit}_{v'}/\tau)}
  • τ0\tau \to 0: greedy decoding (deterministic); τ=1\tau = 1: standard sampling; τ\tau \to \infty: uniform
  • The generated sequence is a Markov chain; its stationary distribution (if the chain were infinite) would represent the model's "preferred" discourse

Mixing time of language models: For a well-trained LLM, the Markov chain mixes quickly - the model "forgets" its starting state within a few hundred tokens for most topics. This reflects the inductive bias toward coherent text: the chain has a large spectral gap.

Top-kk and nucleus sampling: Top-kk sampling restricts transitions to the kk highest-probability tokens; nucleus (top-pp) sampling restricts to the minimal set of tokens whose cumulative probability exceeds pp. Both modify the transition matrix to improve mixing (reduce probability of stuck low-quality loops).

Repetition and periodicity: Without repetition penalties, language model chains can enter periodic loops (mode collapse in generation: "the the the ..."). Adding a repetition penalty introduces a state-dependent transition modification that destroys the periodic structure.

9.2 PageRank

Setup: Represent the web as a directed graph with NN pages. The random surfer model: a user follows a random link at each step with probability α\alpha (damping factor), or teleports to a random page with probability 1α1-\alpha.

Transition matrix:

Pij=αAijdi+(1α)1NP_{ij} = \alpha \cdot \frac{A_{ij}}{d_i} + (1-\alpha) \cdot \frac{1}{N}

where Aij=1A_{ij} = 1 if page ii links to page jj, and di=jAijd_i = \sum_j A_{ij} is the out-degree of page ii.

Dangling nodes: Pages with no outbound links (di=0d_i = 0) create a stochastic matrix problem. Fix: replace the row for dangling node ii with the uniform distribution 1/N1/N. The resulting matrix is stochastic.

PageRank computation: The PageRank vector π\pi satisfies πP=π\pi P = \pi. Computed by power iteration:

π(k+1)=π(k)P\pi^{(k+1)} = \pi^{(k)} P

Convergence guaranteed by Perron-Frobenius (the damped matrix is primitive). Convergence rate: αk\alpha^k geometric (since the second eigenvalue is α\leq \alpha). With α=0.85\alpha = 0.85, convergence to 10810^{-8} precision takes 100\approx 100 iterations.

Interpretation: πi\pi_i is the long-run fraction of time the random surfer spends on page ii. Pages with high in-degree from other high-PageRank pages get high scores.

For AI: Influence in a social network, citations in academic papers, and token importance in attention patterns can all be modelled as PageRank variants. The attention mechanism in transformers can be viewed as a differentiable, query-dependent version of PageRank.

9.3 Markov Decision Processes

A Markov Decision Process (MDP) extends Markov chains by adding actions:

(S,A,P,r,γ)(\mathcal{S}, \mathcal{A}, P, r, \gamma)
  • S\mathcal{S}: state space; A\mathcal{A}: action space
  • P(ss,a)P(s' \mid s, a): transition probabilities given action aa
  • r(s,a)r(s, a): reward function
  • γ[0,1)\gamma \in [0,1): discount factor

A policy π(as)\pi(a \mid s) induces a Markov chain on S\mathcal{S} with transition Pπ(ss)=aπ(as)P(ss,a)P^\pi(s'\mid s) = \sum_a \pi(a\mid s) P(s'\mid s,a).

Bellman equation: The value function Vπ(s)=Eπ[t=0γtr(st,at)s0=s]V^\pi(s) = \mathbb{E}^\pi[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) \mid s_0=s] satisfies:

Vπ(s)=aπ(as)[r(s,a)+γsP(ss,a)Vπ(s)]V^\pi(s) = \sum_a \pi(a\mid s)\left[r(s,a) + \gamma \sum_{s'} P(s'\mid s,a) V^\pi(s')\right]

In matrix form: Vπ=rπ+γPπVπ\mathbf{V}^\pi = \mathbf{r}^\pi + \gamma P^\pi \mathbf{V}^\pi, giving Vπ=(IγPπ)1rπ\mathbf{V}^\pi = (I - \gamma P^\pi)^{-1}\mathbf{r}^\pi.

Policy evaluation = linear system: Given a fixed policy π\pi, computing VπV^\pi requires inverting (IγPπ)(I - \gamma P^\pi). This is guaranteed well-conditioned since γ<1\gamma < 1 implies all eigenvalues of γPπ\gamma P^\pi have magnitude <1< 1.

RLHF as MDP: Reinforcement learning from human feedback trains a language model (policy πθ\pi_\theta) to maximise a reward model rϕr_\phi - an MDP with the KV cache as state, vocabulary as actions, and the human preference model as reward.

9.4 Hidden Markov Models

A Hidden Markov Model (HMM) is a Markov chain {Zt}\{Z_t\} (hidden states) with observations {Xt}\{X_t\} emitted from the current hidden state:

Z0μ0,ZtZt1PZt1,XtZtEZtZ_0 \sim \mu_0, \quad Z_t \mid Z_{t-1} \sim P_{Z_{t-1}}, \quad X_t \mid Z_t \sim E_{Z_t}

where PP is the transition matrix and EE is the emission distribution.

Forward algorithm: Compute αt(i)=P(X1,,Xt,Zt=i)\alpha_t(i) = P(X_1,\ldots,X_t, Z_t=i) recursively:

αt(j)=Ej(Xt)iαt1(i)Pij,α0(j)=μ0(j)Ej(X1)\alpha_t(j) = E_j(X_t) \sum_i \alpha_{t-1}(i) P_{ij}, \quad \alpha_0(j) = \mu_0(j) E_j(X_1)

Time complexity: O(TS2)O(T \cdot |\mathcal{S}|^2). Used for likelihood computation and decoding.

Viterbi algorithm: Find the most likely hidden state sequence Z^1:T=argmaxz1:TP(z1:TX1:T)\hat{Z}_{1:T} = \arg\max_{z_{1:T}} P(z_{1:T} \mid X_{1:T}) via dynamic programming:

δt(j)=Ej(Xt)maxiδt1(i)Pij\delta_t(j) = E_j(X_t) \max_i \delta_{t-1}(i) P_{ij}

Time complexity: same as forward algorithm.

Baum-Welch (EM for HMMs): Learn P,E,μ0P, E, \mu_0 from unlabelled observations. E-step: compute forward-backward probabilities. M-step: update parameters using expected sufficient statistics. Converges to a local maximum of the likelihood.

For AI: HMMs are classical sequence models (speech recognition, gene finding). Modern LLMs subsume HMMs: the hidden state (KV cache) is a rich continuous representation of the context, and emission (next-token distribution) is the softmax output.

9.5 Diffusion Models as CTMCs

The DDPM forward process can be viewed as a discrete-time Markov chain with Gaussian transitions:

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}\,x_{t-1},\, \beta_t I)

The marginal q(xtx0)=N(αˉtx0,(1αˉt)I)q(x_t \mid x_0) = \mathcal{N}(\sqrt{\bar\alpha_t}x_0, (1-\bar\alpha_t)I) with αˉt=s=1t(1βs)\bar\alpha_t = \prod_{s=1}^t (1-\beta_s).

Recall from Section06 Stochastic Processes: In the SDE formulation (Song et al., 2021), the continuous-time version is a CTMC/SDE: dx=f(x,t)dt+g(t)dBtdx = f(x,t)\,dt + g(t)\,dB_t.

The reverse process: By time-reversal of Markov chains (Anderson, 1982), the reverse of an ergodic CTMC is also a CTMC. The reverse diffusion has:

pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(t))p_\theta(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1};\, \mu_\theta(x_t, t),\, \Sigma_\theta(t))

The neural network learns to parameterise the reverse transitions - equivalently, to approximate the score function xlogqt(x)\nabla_x \log q_t(x).


Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue