NotesMath for LLMs

Markov Chains

Probability Theory / Markov Chains

Notes

"A sequence of experiments forms a simple Markov chain if the conditional distribution of each experiment, given all preceding ones, depends only on the immediately preceding experiment."

  • Andrei Markov, 1906

Overview

A Markov chain is a stochastic process with a single, powerful property: the future depends on the past only through the present. This memorylessness - the Markov property - is both a mathematical simplification and a profound modelling insight. It says: knowing the current state is a complete sufficient statistic for predicting what comes next.

This simplicity conceals enormous richness. Markov chains model language (every autoregressive LLM generates tokens one at a time, conditioning on the current state of the context), web structure (PageRank computes the stationary distribution of a Markov chain on the web graph), Bayesian computation (MCMC algorithms construct Markov chains whose stationary distribution is the target posterior), reinforcement learning (MDPs are Markov chains with actions), and speech recognition (HMMs). The mathematics of Markov chains - transition matrices, stationary distributions, mixing times - is the engine beneath all of these.

This section builds the complete theory from first principles: formal definitions, state classification, existence and uniqueness of stationary distributions via Perron-Frobenius, convergence to stationarity and mixing times, reversibility and detailed balance, continuous-time chains, and the MCMC algorithms central to Bayesian ML.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbInteractive derivations: transition matrices, power iteration, classification, stationary distributions, mixing times, Metropolis-Hastings, Gibbs, PageRank, HMMs
exercises.ipynb10 graded exercises from state classification to PageRank and HMM decoding

Learning Objectives

After completing this section, you will be able to:

  1. State the Markov property and express it as conditional independence
  2. Write the transition matrix of a Markov chain and compute nn-step probabilities via PnP^n
  3. Classify states as communicating, recurrent/transient, periodic/aperiodic, and absorbing
  4. State the Perron-Frobenius theorem and use it to establish existence and uniqueness of stationary distributions for ergodic chains
  5. Compute stationary distributions via linear systems and power iteration
  6. State the detailed balance equations and explain how they relate to reversibility
  7. Define total variation distance and mixing time; bound mixing time via the spectral gap
  8. Write the generator matrix of a CTMC and derive its stationary distribution
  9. Implement Metropolis-Hastings and verify it satisfies detailed balance for the target distribution
  10. Implement Gibbs sampling and explain why it is a special case of Metropolis-Hastings
  11. Compute PageRank via power iteration with teleportation
  12. Describe the HMM forward algorithm and Viterbi decoding

Table of Contents


1. Intuition

1.1 What Is a Markov Chain?

A Markov chain is a sequence of random variables X0,X1,X2,X_0, X_1, X_2, \ldots taking values in a state space S\mathcal{S}, where the future is independent of the past given the present. Formally, for every n0n \geq 0 and every state jSj \in \mathcal{S}:

P(Xn+1=jX0,X1,,Xn)=P(Xn+1=jXn)P(X_{n+1} = j \mid X_0, X_1, \ldots, X_n) = P(X_{n+1} = j \mid X_n)

This is the Markov property: the current state XnX_n is a sufficient statistic for predicting Xn+1X_{n+1}. No matter how long the history, knowing only the present gives as much information about the future as knowing the entire past trajectory.

Why memorylessness is powerful. At first glance, the Markov property looks like a restriction - real systems do have memory. But three things make it useful:

  1. Many systems genuinely satisfy it. A fair die, re-rolled at each step, is Markov (no memory). A chess position captures all relevant game state (Markov). An LLM's key-value cache provides sufficient statistics for the next token (Markov, given the KV cache).

  2. Higher-order Markov chains reduce to first-order. A chain where Xn+1X_{n+1} depends on XnX_n and Xn1X_{n-1} can be converted to a first-order chain on the augmented state (Xn,Xn1)(X_n, X_{n-1}). So first-order is universal given a rich enough state space.

  3. The mathematics is tractable. Transition probabilities only involve pairs of states, so the full dynamics are captured by a matrix rather than an exponentially large conditional probability table.

The mental picture: Imagine a particle hopping between states on a graph. At each step, it looks only at which node it currently occupies, then jumps to a neighbouring node with probability given by the edge weights. The history of how it got to the current node is irrelevant - only where it is now matters.

For AI: Every autoregressive language model generates tokens one at a time, each conditioned on all previous tokens. Given a context window of length nn, the distribution over the next token is fully specified by the current state (the context). This is precisely the Markov property with state space = all possible contexts of length n\leq n.

1.2 Everywhere in AI

Markov chains appear as the backbone of several major AI paradigms:

Language generation: GPT, LLaMA, and every autoregressive transformer generates text via a Markov chain. The state is the current context (the key-value cache), the transition is the softmax over the vocabulary, and the stationary distribution - if it exists - characterises what the model "talks about" at equilibrium.

PageRank: Google's original ranking algorithm computes the stationary distribution of a Markov chain on the web graph, where each page links to others. A page is important if the random surfer spends a lot of time there at stationarity. Power iteration on the transition matrix gives the PageRank vector.

Reinforcement learning: A Markov Decision Process (MDP) is a Markov chain with actions. The agent's policy induces a Markov chain over states, and the value function is the expected discounted return from each state - computable from the stationary properties of this chain. TD-learning, Q-learning, and policy gradient methods all exploit the Markov structure.

Bayesian inference via MCMC: When the posterior distribution π(θD)\pi(\theta | \mathcal{D}) is intractable, MCMC constructs a Markov chain whose stationary distribution is exactly π\pi. Running the chain for long enough generates samples that represent the posterior. Metropolis-Hastings and Gibbs sampling are the two canonical MCMC algorithms.

Speech and sequence modelling: Hidden Markov Models (HMMs) model sequences where the true state (e.g., phoneme) is latent. The forward-backward algorithm and Viterbi decoder exploit the Markov structure to perform efficient inference.

Diffusion models: The DDPM forward process x0x1xTx_0 \to x_1 \to \cdots \to x_T is a Markov chain where each step adds Gaussian noise. The learned reverse process is a Markov chain running backwards in time. The SDE perspective (Section06 Stochastic Processes) gives the continuous-time version.

1.3 Historical Timeline

YearContributorResult
1906Andrei MarkovDefined Markov chains; proved the law of large numbers for dependent sequences
1907Andrei MarkovFirst application: statistical analysis of letters in Pushkin's Eugene Onegin
1913Paul EhrenfestUrn model illustrating approach to equilibrium; connection to statistical mechanics
1928Andrei KolmogorovBackward equations for Markov processes; rigorous foundation
1936Oskar Perron, Georg FrobeniusPerron-Frobenius theorem (positive matrices have unique dominant eigenvector)
1950Nicholas Metropolis et al.Monte Carlo method; Metropolis algorithm for statistical mechanics
1953Metropolis, Rosenbluthx2, Tellerx2Metropolis-Hastings algorithm published; first MCMC
1970W.K. HastingsGeneralized Metropolis to asymmetric proposals
1984Geman & GemanGibbs sampling for image restoration
1996Levin, Peres, WilmerSystematic mixing time theory; coupling and spectral gap bounds
1998Page, Brin, Motwani, WinogradPageRank: web ranking via Markov chain stationary distribution
2015Ho, Jain & AbbeelDDPM: diffusion as Markov chain with learned reverse
2022VariousLLMs as Markov chains; stationary distribution theory for language models

1.4 Taxonomy of Markov Chains

Markov chains are classified along two dimensions:

State space: Finite (S={1,,N}\mathcal{S} = \{1,\ldots,N\}), countably infinite (S=Z+\mathcal{S} = \mathbb{Z}^+), or continuous (S=Rd\mathcal{S} = \mathbb{R}^d). Most of this section treats finite or countably infinite state spaces; MCMC typically targets continuous spaces.

Time: Discrete (n=0,1,2,n = 0, 1, 2, \ldots) or continuous (t0t \geq 0). Discrete-time chains are developed in Section2-Section6; continuous-time chains in Section7.

Discrete TimeContinuous Time
Finite statesFinite DTMC - PageRank, HMMCTMC - queueing theory
Countable statesSRW, birth-death chainPoisson process
Continuous statesMCMC on Rd\mathbb{R}^dDiffusion SDE (Section06)

Time-homogeneous vs. inhomogeneous: If transition probabilities do not depend on nn - P(Xn+1=jXn=i)=PijP(X_{n+1}=j|X_n=i) = P_{ij} for all nn - the chain is time-homogeneous. Almost all chains in this section are time-homogeneous.


2. Formal Definitions

2.1 The Markov Property

Recall: Conditional independence XYZX \perp Y \mid Z was defined in Section03 Joint Distributions. The Markov property is conditional independence of the future from the past, given the present.

Definition (Discrete-Time Markov Chain). A sequence {Xn}n0\{X_n\}_{n \geq 0} of random variables on a measurable state space (S,B)(\mathcal{S}, \mathcal{B}) is a (time-homogeneous) Markov chain if for all n0n \geq 0 and all measurable BSB \subseteq \mathcal{S}:

P(Xn+1BX0,X1,,Xn)=P(Xn+1BXn)=:K(Xn,B)P(X_{n+1} \in B \mid X_0, X_1, \ldots, X_n) = P(X_{n+1} \in B \mid X_n) =: K(X_n, B)

The function K:S×B[0,1]K : \mathcal{S} \times \mathcal{B} \to [0,1] is called the transition kernel or transition probability. For finite S={1,,N}\mathcal{S} = \{1,\ldots,N\}, it is encoded by the N×NN \times N transition matrix PP with Pij=P(Xn+1=jXn=i)P_{ij} = P(X_{n+1}=j \mid X_n=i).

Equivalent formulation via conditional independence: The Markov property is equivalent to:

Xn+1(X0,,Xn1)XnX_{n+1} \perp (X_0, \ldots, X_{n-1}) \mid X_n

This is the cleanest statement: given the present XnX_n, the future Xn+1X_{n+1} is independent of the entire past (X0,,Xn1)(X_0,\ldots,X_{n-1}). More generally, the strong Markov property asserts this holds for stopping times: Xτ+1FτXτX_{\tau+1} \perp \mathcal{F}_\tau \mid X_\tau for any stopping time τ\tau.

Initial distribution: The full law of the chain is specified by:

  1. The initial distribution μ0\mu_0 with μ0(i)=P(X0=i)\mu_0(i) = P(X_0 = i)
  2. The transition matrix PP with Pij0P_{ij} \geq 0 and jPij=1\sum_j P_{ij} = 1 for all ii

Together, (μ0,P)(\mu_0, P) determines all finite-dimensional distributions.

2.2 Transition Matrix

For a finite state space S={1,,N}\mathcal{S} = \{1,\ldots,N\}, the transition matrix PP is an N×NN \times N stochastic matrix: every row sums to 1, all entries are non-negative. The (i,j)(i,j) entry is the probability of jumping from state ii to state jj in one step.

Distribution at step nn: If μ(n)\mu^{(n)} is the row vector of probabilities at step nn, then:

μ(n)=μ(0)Pn\mu^{(n)} = \mu^{(0)} P^n

The distribution evolves by left-multiplying by PP at each step. In component form: μj(n+1)=iμi(n)Pij\mu^{(n+1)}_j = \sum_i \mu^{(n)}_i P_{ij}.

nn-step transition probabilities: The probability of going from ii to jj in exactly nn steps is the (i,j)(i,j) entry of PnP^n:

Pij(n)=P(Xn=jX0=i)=(Pn)ijP^{(n)}_{ij} = P(X_n = j \mid X_0 = i) = (P^n)_{ij}

Computing PnP^n via matrix exponentiation (eigendecomposition or repeated squaring) is the core computational task for Markov chains.

Non-example: A matrix with a row summing to more than 1, or with any negative entry, is NOT a valid transition matrix. A doubly-stochastic matrix (both rows and columns sum to 1) is a transition matrix, and its stationary distribution is uniform.

For AI: In a language model with vocabulary size VV, the transition matrix PP has size V×VV \times V where PijP_{ij} = probability of token jj following token ii (bigram model). Real LLMs use much richer state spaces (the context window), but the Markov structure is the same.

2.3 Chapman-Kolmogorov Equations

Theorem (Chapman-Kolmogorov). For any m,n0m, n \geq 0 and states i,ji, j:

Pij(m+n)=kSPik(m)Pkj(n)P^{(m+n)}_{ij} = \sum_{k \in \mathcal{S}} P^{(m)}_{ik} P^{(n)}_{kj}

In matrix form: Pm+n=PmPnP^{m+n} = P^m P^n. This is simply the rule for matrix multiplication - to go from ii to jj in m+nm+n steps, sum over all intermediate states kk at the midpoint.

Proof: By the law of total probability and the Markov property:

P(Xm+n=jX0=i)=kP(Xm+n=jXm=k)P(Xm=kX0=i)=kPkj(n)Pik(m)P(X_{m+n}=j \mid X_0=i) = \sum_k P(X_{m+n}=j \mid X_m=k) P(X_m=k \mid X_0=i) = \sum_k P^{(n)}_{kj} P^{(m)}_{ik}

Chapman-Kolmogorov is the reason matrix multiplication is the right operation for composing Markov chain transitions. It also says: to predict n+mn+m steps ahead, you can decompose the prediction at any intermediate time mm.

2.4 Standard Examples

Example 1: Two-state chain (weather model)

P=(1ppq1q)P = \begin{pmatrix} 1-p & p \\ q & 1-q \end{pmatrix}

State 1 = sunny, state 2 = rainy. Sunny stays sunny with probability 1p1-p, turns rainy with probability pp. Rainy turns sunny with probability qq. The stationary distribution is π=(q/(p+q), p/(p+q))\pi = (q/(p+q),\ p/(p+q)).

Example 2: Simple Random Walk on {0,1,,N}\{0,1,\ldots,N\} with reflecting barriers

Pij={1/2ij=1, 0<i<N1i=0,j=1 or i=N,j=N1P_{ij} = \begin{cases} 1/2 & |i-j|=1,\ 0 < i < N \\ 1 & i=0, j=1 \text{ or } i=N, j=N-1 \end{cases}

The particle bounces off the walls at 0 and NN. Stationary distribution: uniform.

Example 3: PageRank chain

Pij=αAijout-degree(i)+(1α)1NP_{ij} = \alpha \cdot \frac{A_{ij}}{\text{out-degree}(i)} + (1-\alpha) \cdot \frac{1}{N}

where AA is the web adjacency matrix and α(0,1)\alpha \in (0,1) is the damping factor (usually α=0.85\alpha=0.85). The teleportation term (1α)/N(1-\alpha)/N ensures the chain is irreducible and aperiodic. The stationary distribution is the PageRank vector.

Example 4: Gibbs chain on {0,1}2\{0,1\}^2

For the Ising model on two spins, PP is a 4×44 \times 4 matrix constructed by randomly updating one spin at a time conditional on the other. This is the simplest MCMC example.


3. Classification of States

3.1 Communicating Classes and Irreducibility

Definition. State jj is accessible from state ii (written iji \to j) if there exists n0n \geq 0 such that Pij(n)>0P^{(n)}_{ij} > 0. States ii and jj communicate (written iji \leftrightarrow j) if iji \to j and jij \to i.

Communication is an equivalence relation, and its equivalence classes are called communicating classes. A communicating class CC is closed if no state outside CC is accessible from any state inside CC: iCi \in C and iji \to j implies jCj \in C.

Definition (Irreducible Chain). A Markov chain is irreducible if it has exactly one communicating class - every state communicates with every other. Equivalently, for every pair (i,j)(i,j), there exists nn with Pij(n)>0P^{(n)}_{ij} > 0.

Irreducibility is the "connectivity" condition for Markov chains. An irreducible chain cannot get trapped in a subset of states. Most of the theory (stationary distributions, convergence) requires irreducibility.

Decomposition theorem: Any finite Markov chain decomposes into transient states (a particle eventually leaves forever) and one or more closed communicating classes of recurrent states. Once the chain enters a closed class, it stays there forever.

For AI: A language model's transition matrix is nearly irreducible: any token sequence of sufficient length can be generated (in principle). Practical degeneracy (zero-probability transitions due to softmax temperatures -> 0) creates absorbing states.

3.2 Recurrence and Transience

Definition. Define the first return time to state ii: Ti=min{n1:Xn=i}T_i = \min\{n \geq 1 : X_n = i\}, with Ti=T_i = \infty if the chain never returns. The return probability is:

fi=P(Ti<X0=i)f_i = P(T_i < \infty \mid X_0 = i)
  • State ii is recurrent if fi=1f_i = 1 (return is certain)
  • State ii is transient if fi<1f_i < 1 (positive probability of never returning)

Theorem (Recurrence criterion). State ii is recurrent if and only if n=0Pii(n)=\sum_{n=0}^\infty P^{(n)}_{ii} = \infty. It is transient if and only if n=0Pii(n)<\sum_{n=0}^\infty P^{(n)}_{ii} < \infty.

Proof sketch: Let Ri=n=11[Xn=i]R_i = \sum_{n=1}^\infty \mathbf{1}[X_n=i] be the number of returns to ii. Then E[RiX0=i]=n=1Pii(n)\mathbb{E}[R_i \mid X_0=i] = \sum_{n=1}^\infty P^{(n)}_{ii}. If fi=1f_i=1, returns are geometrically distributed with mean \infty, so the sum diverges. If fi<1f_i < 1, the number of returns is geometric with finite mean, so the sum converges.

Positive vs. null recurrence: Among recurrent states, define the mean return time μi=E[TiX0=i]\mu_i = \mathbb{E}[T_i \mid X_0=i].

  • Positive recurrent: fi=1f_i = 1 and μi<\mu_i < \infty (returns in finite average time)
  • Null recurrent: fi=1f_i = 1 but μi=\mu_i = \infty (certain to return, but takes forever on average)

All recurrent states in a finite irreducible chain are positive recurrent (finite chains cannot be null-recurrent). The simple random walk on Z\mathbb{Z} is null-recurrent in 1D and 2D, transient in 3D and above (Polya's theorem).

For AI: In MCMC, we want the chain to be recurrent (it returns to every region of the posterior). A transient MCMC chain would drift away from the posterior, giving biased samples.

3.3 Periodicity

Definition. The period of state ii is:

d(i)=gcd{n1:Pii(n)>0}d(i) = \gcd\{n \geq 1 : P^{(n)}_{ii} > 0\}

State ii is aperiodic if d(i)=1d(i) = 1; otherwise it is periodic with period d(i)d(i).

Key facts:

  1. All states in the same communicating class have the same period
  2. A chain is called aperiodic if all states have period 1
  3. If Pii>0P_{ii} > 0 for some ii, then d(i)=1d(i) = 1 (self-loops force aperiodicity)
  4. An irreducible aperiodic chain converges to its stationary distribution (Section4.3)

Example: A two-state chain 010 \leftrightarrow 1 with P01=P10=1P_{01}=P_{10}=1 is periodic with period 2: the chain alternates 0,1,0,1,... and P00(n)=1P^{(n)}_{00} = 1 only for even nn.

Periodicity breaks convergence: If a chain has period d>1d > 1, then Pij(n)P^{(n)}_{ij} does not converge as nn \to \infty - it oscillates. Adding any self-loop probability ε>0\varepsilon > 0 makes the chain aperiodic.

For AI: Metropolis-Hastings with a symmetric proposal that includes the possibility of staying put is automatically aperiodic (self-loops happen when a proposal is rejected). This is why MCMC chains converge.

3.4 Absorbing States and Absorption Probabilities

Definition. State ii is absorbing if Pii=1P_{ii} = 1 (the chain stays at ii forever once it arrives). A chain with at least one absorbing state is called absorbing.

Canonical form: For an absorbing chain with rr absorbing states and tt transient states, the transition matrix can be written in canonical form:

P=(QR0Ir)P = \begin{pmatrix} Q & R \\ 0 & I_r \end{pmatrix}

where QQ is the t×tt \times t sub-matrix of transitions among transient states, RR is the t×rt \times r matrix of transitions to absorbing states, and IrI_r is the identity on absorbing states.

Fundamental matrix: N=(IQ)1N = (I - Q)^{-1}. The (i,j)(i,j) entry of NN is the expected number of times the chain visits transient state jj before absorption, starting from transient state ii.

Absorption probabilities: B=NRB = NR gives the probability of being absorbed into each absorbing state. The gambler's ruin formula P(reach Nstart at a)=a/NP(\text{reach } N \mid \text{start at } a) = a/N (derived via OST in Section06) follows directly from this formula.

3.5 Ergodic Chains

Definition (Ergodic Chain). A Markov chain is ergodic if it is:

  1. Irreducible - every state communicates with every other
  2. Positive recurrent - mean return time is finite for all states
  3. Aperiodic - all states have period 1

Ergodic chains are the "nice" chains: they have a unique stationary distribution, and the chain converges to it from any starting state.

For finite chains: An irreducible chain on a finite state space is automatically positive recurrent. So for finite chains, ergodic = irreducible + aperiodic.

Ergodic theorem: For an ergodic chain, the time-average converges to the space-average:

1nk=0n1f(Xk)niπif(i)=Eπ[f]a.s.\frac{1}{n}\sum_{k=0}^{n-1} f(X_k) \xrightarrow{n\to\infty} \sum_i \pi_i f(i) = \mathbb{E}_\pi[f] \quad \text{a.s.}

This is the Markov chain version of the law of large numbers. It is why MCMC works: time-averaging over the chain's trajectory approximates the posterior expectation.


4. Stationary Distributions

4.1 Definition and Existence

Definition (Stationary Distribution). A probability distribution π\pi on S\mathcal{S} is a stationary distribution (also called invariant distribution or steady-state distribution) for a Markov chain with transition matrix PP if:

πP=πi.e.,πj=iπiPij for all j\pi P = \pi \qquad \text{i.e.,} \qquad \pi_j = \sum_i \pi_i P_{ij} \text{ for all } j

Equivalently, π\pi is a left eigenvector of PP with eigenvalue 1. If the chain starts in distribution π\pi, it stays in distribution π\pi for all time: μ(0)=πμ(n)=π\mu^{(0)} = \pi \Rightarrow \mu^{(n)} = \pi for all nn.

Interpretation: πi\pi_i represents the long-run fraction of time the chain spends in state ii. For PageRank, πi\pi_i is the importance score of page ii. For MCMC, π\pi is the target posterior distribution.

Existence: Every finite irreducible Markov chain has a unique stationary distribution π\pi with πi>0\pi_i > 0 for all ii. For infinite chains, existence requires positive recurrence.

Non-uniqueness when reducible: A chain with multiple closed communicating classes has infinitely many stationary distributions - any convex combination of the stationary distributions within each closed class is stationary.

The mean return time formula: For a positive recurrent chain, πi=1/μi\pi_i = 1/\mu_i where μi=E[TiX0=i]\mu_i = \mathbb{E}[T_i \mid X_0=i] is the mean return time. States visited more often (smaller μi\mu_i) get higher stationary probability.

4.2 Perron-Frobenius Theorem

The Perron-Frobenius theorem is the fundamental result guaranteeing existence and uniqueness of the stationary distribution.

Theorem (Perron-Frobenius for Stochastic Matrices). Let PP be an irreducible stochastic matrix. Then:

  1. λ=1\lambda = 1 is an eigenvalue of PP (always true for stochastic matrices, since P1=1P\mathbf{1}=\mathbf{1})
  2. λ1|\lambda| \leq 1 for all other eigenvalues λ\lambda
  3. There exists a unique probability vector π>0\pi > 0 (componentwise) with πP=π\pi P = \pi
  4. π\pi is the unique (up to scaling) left eigenvector with eigenvalue 1

If additionally PP is primitive (irreducible and aperiodic), then λ=1\lambda = 1 is the unique eigenvalue on the unit circle, and λ2<1|\lambda_2| < 1 strictly. This gap drives convergence.

Proof sketch (uniqueness): Suppose π\pi and ν\nu are both stationary. Then μ=(πν)\mu = (\pi - \nu) satisfies μP=μ\mu P = \mu with iμi=0\sum_i \mu_i = 0. If any μi0\mu_i \neq 0, irreducibility implies all μi0\mu_i \neq 0. But then πν\pi - \nu cannot be a probability vector minus a probability vector with the right sign everywhere - contradiction.

For AI: PageRank power iteration converges because the damped PageRank matrix is primitive (the teleportation term makes it irreducible and the self-loops make it aperiodic). The dominant eigenvalue is 1, and all other eigenvalues are <α<1< \alpha < 1 in magnitude, guaranteeing geometric convergence.

4.3 Convergence to Stationarity

Theorem (Convergence for Ergodic Chains). Let PP be irreducible and aperiodic with stationary distribution π\pi. Then for any initial distribution μ(0)\mu^{(0)}:

μ(0)PnπTVn0\|\mu^{(0)} P^n - \pi\|_{\text{TV}} \xrightarrow{n\to\infty} 0

Moreover, for any starting state ii: Pij(n)πjP^{(n)}_{ij} \to \pi_j as nn \to \infty for all jj.

Rate of convergence: For a reversible chain with eigenvalues 1=λ1>λ2λN11 = \lambda_1 > \lambda_2 \geq \cdots \geq \lambda_N \geq -1, the convergence rate is governed by λ=max(λ2,λN)\lambda_* = \max(|\lambda_2|, |\lambda_N|):

Pn(i,)πTV12πiλn\|P^n(i,\cdot) - \pi\|_{\text{TV}} \leq \frac{1}{2\sqrt{\pi_i}} \lambda_*^n

The rate λ\lambda_* is called the second largest eigenvalue modulus (SLEM). The spectral gap is 1λ1 - \lambda_*; a large gap means fast convergence.

Periodic chains: If PP is irreducible but periodic with period dd, then Pij(n)P^{(n)}_{ij} does not converge - it cycles. However, PdP^d (the dd-step chain) is aperiodic and does converge. In practice, a small perturbation (adding laziness) kills periodicity.

The lazy chain: Replace PP with (I+P)/2(I + P)/2. This halves the spectral gap but makes the chain aperiodic with all eigenvalues non-negative, ensuring monotone convergence.

4.4 Null-Recurrent Chains

The simple random walk on Z\mathbb{Z} returns to the origin with probability 1 (recurrent) but takes infinitely long on average (null-recurrent). Such chains have no stationary probability distribution - the "stationary measure" πi\pi_i is not normalisable.

The 1D random walk has πi=1\pi_i = 1 for all ii, which gives infinite total mass iπi=\sum_i \pi_i = \infty. The chain visits every state infinitely often, but spends a vanishing fraction of time at any fixed state.

Implications for AI: Infinite-state chains that model text generation (treating the entire prefix as state) may be null-recurrent or transient. In practice, language models cap context length, making the state space finite and all states positive recurrent.

4.5 Computing Stationary Distributions

Method 1: Solve the linear system

πP=π,π1=1,πi0\pi P = \pi, \quad \|\pi\|_1 = 1, \quad \pi_i \geq 0

This is an (N+1)(N+1)-dimensional system (the normalisation replaces one of the NN redundant equations πP=π\pi P = \pi). For small NN, direct linear algebra is exact.

Method 2: Power iteration

π(k+1)=π(k)P\pi^{(k+1)} = \pi^{(k)} P

Starting from any distribution π(0)>0\pi^{(0)} > 0, iterating converges to π\pi at rate λk\lambda_*^k. Power iteration is the PageRank algorithm.

Method 3: Eigendecomposition Compute the left eigenvector of PP corresponding to eigenvalue 1. For symmetric or reversible chains, this is equivalent to finding the dominant eigenvector.

Method 4: MCMC (for continuous state spaces) For distributions π(x)exp(f(x))\pi(x) \propto \exp(-f(x)) on Rd\mathbb{R}^d, construct a Markov chain with stationary distribution π\pi and run it - the empirical distribution of the chain approximates π\pi by the ergodic theorem.


5. Detailed Balance and Reversibility

5.1 Detailed Balance Equations

Definition (Detailed Balance). A distribution π\pi satisfies the detailed balance equations with respect to transition matrix PP if:

πiPij=πjPjifor all i,jS\pi_i P_{ij} = \pi_j P_{ji} \quad \text{for all } i, j \in \mathcal{S}

Detailed balance says: the probability flux from ii to jj at stationarity equals the flux from jj to ii. This is a pointwise balance of probability flows, stronger than the global balance equations πP=π\pi P = \pi.

Theorem. If π\pi satisfies detailed balance with PP, then π\pi is a stationary distribution for PP.

Proof:

iπiPij=iπjPji=πjiPji=πj1=πj\sum_i \pi_i P_{ij} = \sum_i \pi_j P_{ji} = \pi_j \sum_i P_{ji} = \pi_j \cdot 1 = \pi_j

So πP=π\pi P = \pi. The converse is false - a stationary distribution need not satisfy detailed balance.

Significance: Detailed balance is easier to verify than stationarity (it's a pairwise condition rather than a global one), and it characterises reversible chains. All Metropolis-Hastings algorithms are constructed to satisfy detailed balance by design.

5.2 Reversible Markov Chains

Definition (Reversible Chain). A Markov chain is reversible (or time-reversible) with respect to distribution π\pi if it satisfies detailed balance. Equivalently, the chain running forwards in time has the same law as the chain running backwards.

The reversed chain: Given a stationary Markov chain with transition matrix PP, define the time-reversed chain by P^ij=πjPji/πi\hat{P}_{ij} = \pi_j P_{ji} / \pi_i. The reversed chain is also a Markov chain, with the same stationary distribution. The original chain is reversible iff P^=P\hat{P} = P.

Symmetric Markov chains are reversible: If Pij=PjiP_{ij} = P_{ji} for all i,ji,j (symmetric transition matrix), then detailed balance holds with the uniform distribution πi=1/N\pi_i = 1/N. The symmetric random walk on a graph is reversible.

Non-reversible example: The chain 12311 \to 2 \to 3 \to 1 (cyclic with P12=P23=P31=1P_{12}=P_{23}=P_{31}=1) has uniform stationary distribution but is NOT reversible: it has a preferred direction of rotation.

For AI: Reversibility is a sufficient condition for correctness of MCMC algorithms. A Metropolis-Hastings chain with target π\pi satisfies detailed balance - it's reversible - and therefore has π\pi as stationary distribution. Non-reversible MCMC (e.g., lifting methods, HMC) can mix faster but is harder to analyse.

5.3 Birth-Death Chains

A birth-death chain is a Markov chain on {0,1,2,}\{0,1,2,\ldots\} where only nearest-neighbour transitions are allowed:

Pi,i+1=pi,Pi,i1=qi=1pi,Pij=0 for ij>1P_{i,i+1} = p_i, \quad P_{i,i-1} = q_i = 1-p_i, \quad P_{ij} = 0 \text{ for } |i-j| > 1

Birth-death chains are automatically reversible, and their stationary distribution can be computed explicitly. The detailed balance equations πipi=πi+1qi+1\pi_i p_i = \pi_{i+1} q_{i+1} give:

πn=π0k=0n1pkqk+1\pi_n = \pi_0 \prod_{k=0}^{n-1} \frac{p_k}{q_{k+1}}

Setting π0=1/Z\pi_0 = 1/Z (normalisation) gives the stationary distribution.

Queue models: An M/M/1M/M/1 queue with arrival rate λ<1\lambda < 1 (service rate) is a birth-death chain. Stationary distribution: πn=(1ρ)ρn\pi_n = (1-\rho)\rho^n where ρ=λ\rho = \lambda. The chain is stable (stationary distribution exists) iff λ<1\lambda < 1.

For AI: The queue length at an LLM serving system (number of pending requests) is approximately a birth-death chain. The stationary distribution under load ρ=λ/μ\rho = \lambda/\mu determines average latency: E[queue length]=ρ/(1ρ)\mathbb{E}[\text{queue length}] = \rho/(1-\rho).

5.4 Spectral Decomposition of Reversible Chains

For a reversible chain, the transition matrix PP has real eigenvalues 1=λ1λ2λN11 = \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_N \geq -1 and an orthonormal basis of eigenvectors (with respect to the π\pi-weighted inner product).

Spectral decomposition:

Pijn=πjk=1Nλknϕk(i)ϕk(j)P^n_{ij} = \pi_j \sum_{k=1}^N \lambda_k^n \phi_k(i) \phi_k(j)

where ϕk\phi_k are the eigenfunctions (normalised in L2(π)L^2(\pi)). As nn \to \infty, only the k=1k=1 term survives (since λk<1|\lambda_k| < 1 for k2k \geq 2), giving PijnπjP^n_{ij} \to \pi_j. The rate of convergence is λ2n|\lambda_2|^n.

The spectral gap: gap=1λ2\text{gap} = 1 - \lambda_2 measures how fast the chain forgets its initial state. A chain with spectral gap δ\delta mixes in O(1/δ)O(1/\delta) steps. Bounding the spectral gap from below is the main technical challenge in proving MCMC convergence rates.

Dirichlet form: For reversible chains, the spectral gap equals:

gap=minf1E(f,f)Varπ(f)\text{gap} = \min_{f \perp \mathbf{1}} \frac{\mathcal{E}(f,f)}{\text{Var}_\pi(f)}

where E(f,f)=12i,jπiPij(f(j)f(i))2\mathcal{E}(f,f) = \frac{1}{2}\sum_{i,j} \pi_i P_{ij}(f(j)-f(i))^2 is the Dirichlet form. This variational characterisation is the basis for Poincare inequality methods.


6. Mixing Times and Convergence Rates

6.1 Total Variation Distance

Definition. The total variation (TV) distance between two probability distributions μ\mu and ν\nu on S\mathcal{S} is:

μνTV=supASμ(A)ν(A)=12iμiνi\|\mu - \nu\|_{\text{TV}} = \sup_{A \subseteq \mathcal{S}} |\mu(A) - \nu(A)| = \frac{1}{2}\sum_i |\mu_i - \nu_i|

TV distance measures how distinguishable two distributions are: it equals the maximum probability that any event AA has different probability under μ\mu vs. ν\nu. The TV distance is 0 iff μ=ν\mu = \nu, and 1 iff the distributions have disjoint support.

Coupling characterisation: μνTV=inf(X,Y)P(XY)\|\mu - \nu\|_{\text{TV}} = \inf_{(X,Y)} P(X \neq Y) where the infimum is over all couplings (X,Y)(X,Y) with marginals μ\mu and ν\nu. This is the coupling inequality: any coupling gives an upper bound on TV distance.

For AI: TV distance is used to measure how far an MCMC chain is from stationarity, and to certify when the chain has "mixed" sufficiently to produce approximately independent samples from the posterior.

6.2 Mixing Time

Definition. The mixing time of a Markov chain with stationary distribution π\pi is:

tmix(ε)=min{t0:maxxSPt(x,)πTVε}t_{\text{mix}}(\varepsilon) = \min\left\{t \geq 0 : \max_{x \in \mathcal{S}} \|P^t(x,\cdot) - \pi\|_{\text{TV}} \leq \varepsilon\right\}

The standard mixing time is tmix=tmix(1/4)t_{\text{mix}} = t_{\text{mix}}(1/4) (the default ε=1/4\varepsilon = 1/4 is conventional). Mixing time measures how long the chain must run before its distribution is ε\varepsilon-close to stationarity from the worst-case starting state.

Why the worst case? In applications, we often don't control the starting state. The worst-case mixing time certifies that regardless of initialisation, the chain has converged after tmixt_{\text{mix}} steps.

Submultiplicativity: tmix(2k/4)ktmixt_{\text{mix}}(2^{-k}/4) \leq k \cdot t_{\text{mix}}. So to achieve ε=2k/4\varepsilon = 2^{-k}/4 precision, it suffices to run for ktmixk \cdot t_{\text{mix}} steps.

6.3 Spectral Gap

For a reversible ergodic chain, the spectral gap gap=1λ2\text{gap} = 1 - \lambda_2 provides a bound on mixing time:

Theorem (Spectral Gap Bound):

tmix(ε)log(1/(επmin))gapt_{\text{mix}}(\varepsilon) \leq \left\lceil \frac{\log(1/(\varepsilon \pi_{\min}))}{\text{gap}} \right\rceil

where πmin=miniπi\pi_{\min} = \min_i \pi_i.

Proof sketch: From the spectral decomposition, Pn(x,)πTV12πx(1gap)n\|P^n(x,\cdot) - \pi\|_{\text{TV}} \leq \frac{1}{2\sqrt{\pi_x}} (1-\text{gap})^n. Setting this ε\leq \varepsilon and solving for nn gives the bound.

Lower bound: There is also a lower bound: tmix12gapt_{\text{mix}} \geq \frac{1}{2\,\text{gap}}. So the mixing time is Θ(1/gap)\Theta(1/\text{gap}) for reversible chains.

Computing the spectral gap: For small matrices, compute eigenvalues of PP directly. For large chains, use the Cheeger inequality: gap/2h2gap\text{gap}/2 \leq h \leq \sqrt{2\,\text{gap}} where hh is the Cheeger constant (conductance of the chain).

6.4 Coupling Arguments

Definition (Coupling). A coupling of two Markov chains with the same transition matrix PP but different initial distributions μ,ν\mu, \nu is a process (Xt,Yt)(X_t, Y_t) such that each marginal is a valid Markov chain with transition PP, and once Xt=YtX_t = Y_t they remain together (the coalescence or meeting time).

Coupling inequality: Pt(μ,)Pt(ν,)TVP(XtYt)P(τmeet>t)\|P^t(\mu,\cdot) - P^t(\nu,\cdot)\|_{\text{TV}} \leq P(X_t \neq Y_t) \leq P(\tau_{\text{meet}} > t)

where τmeet=min{t:Xt=Yt}\tau_{\text{meet}} = \min\{t : X_t = Y_t\} is the coupling time.

Grand coupling: For a finite ergodic chain, one can construct a single coupling {(Xtx,Xty):x,yS}\{(X_t^x, X_t^y) : x, y \in \mathcal{S}\} of chains starting from all possible pairs (x,y)(x,y) simultaneously, such that τmeet\tau_{\text{meet}} is the same for all pairs. The mixing time is then tmix=O(E[τmeet])t_{\text{mix}} = O(\mathbb{E}[\tau_{\text{meet}}]).

Example: Random walk on hypercube. For a lazy random walk on {0,1}d\{0,1\}^d, couple two chains by updating the same coordinate at each step. When the coordinate is chosen, the two chains agree on that bit after the update. All dd bits agree after O(dlogd)O(d \log d) steps (coupon collector), giving tmix=O(dlogd)t_{\text{mix}} = O(d \log d).

6.5 AI Relevance: Fast Mixing and MCMC Warm Starts

Fast mixing in practice: For Bayesian inference in LLM fine-tuning (e.g., RLHF reward models), the posterior over model weights is a distribution on Rd\mathbb{R}^d with d109d \sim 10^9. Direct MCMC is intractable; approximate methods (Langevin dynamics, HMC) exploit gradient information to achieve better mixing.

Spectral gap and dimensionality: For a Gaussian target N(0,Σ)\mathcal{N}(0, \Sigma), the Langevin SDE mixing time scales as O(κ)O(\kappa) where κ=λmax/λmin\kappa = \lambda_{\max}/\lambda_{\min} is the condition number of Σ\Sigma. Preconditioning (e.g., SGLD with adaptive learning rates, Adam-Langevin) improves the effective spectral gap.

Warm starts: If the initial distribution μ(0)\mu^{(0)} is already close to π\pi (e.g., from a previous MCMC run), the chain needs fewer steps to mix. In sequential Bayesian updating (fine-tuning an LLM on new data), the previous posterior serves as a warm start.

Temperature scheduling: Simulated annealing starts with a high-temperature chain (wide, fast-mixing distribution) and gradually cools toward the target. This exploits fast mixing at high temperature and accurate targeting at low temperature.


7. Continuous-Time Markov Chains

7.1 Generator Matrix

A continuous-time Markov chain (CTMC) {Xt}t0\{X_t\}_{t \geq 0} on a finite state space S\mathcal{S} satisfies the Markov property for continuous time:

P(Xt+s=jXu:us)=P(Xt+s=jXs)P(X_{t+s}=j \mid X_u : u \leq s) = P(X_{t+s}=j \mid X_s)

The chain holds in each state for an exponentially distributed sojourn time, then jumps according to a discrete-time chain. The rate of leaving state ii is qi=jiqijq_i = \sum_{j \neq i} q_{ij} where qij0q_{ij} \geq 0 are the jump rates from ii to jj.

Definition (Generator Matrix / Q-Matrix). The generator (or Q-matrix) of a CTMC is the matrix QQ with:

Qij={qij0ij(jump rate from i to j)qi=jiqiji=j(holding rate)Q_{ij} = \begin{cases} q_{ij} \geq 0 & i \neq j \quad \text{(jump rate from } i \text{ to } j\text{)} \\ -q_i = -\sum_{j \neq i} q_{ij} & i = j \quad \text{(holding rate)} \end{cases}

Key property: rows of QQ sum to zero (Q1=0Q \mathbf{1} = \mathbf{0}). The diagonal entries are negative; off-diagonal entries are non-negative.

Transition matrix: The probability of being in state jj at time tt given state ii at time 0 is:

P(t)=eQt=n=0(Qt)nn!P(t) = e^{Qt} = \sum_{n=0}^\infty \frac{(Qt)^n}{n!}

This is the matrix exponential. Note P(0)=IP(0) = I and limtP(t)=1π\lim_{t\to\infty} P(t) = \mathbf{1}\pi (convergence to stationarity for ergodic chains).

Embedded chain: The CTMC can be decomposed into (1) an embedded discrete-time Markov chain governing the sequence of states visited, and (2) exponential holding times in each state. Jump probabilities: P~ij=qij/qi\tilde{P}_{ij} = q_{ij}/q_i for iji \neq j.

7.2 Kolmogorov Equations

Forward Equation (Fokker-Planck):

ddtP(t)=P(t)QP(t)=eQt\frac{d}{dt} P(t) = P(t) Q \quad \Rightarrow \quad P(t) = e^{Qt}

This equation describes how the distribution evolves forward in time: ddtμ(t)=μ(t)Q\frac{d}{dt}\mu(t) = \mu(t) Q where μ(t)=μ(0)eQt\mu(t) = \mu(0) e^{Qt}.

Backward Equation:

ddtP(t)=QP(t)\frac{d}{dt} P(t) = Q\, P(t)

The forward equation applies to the time-marginal distribution; the backward equation to the transition probabilities viewed as functions of the initial state.

Stationary distribution: π\pi is stationary if ddtμ(t)=0\frac{d}{dt}\mu(t) = 0 at μ=π\mu = \pi, i.e.:

πQ=0(row vector)iπiQij=0 for all j\pi Q = \mathbf{0} \quad \text{(row vector)} \quad \Leftrightarrow \quad \sum_i \pi_i Q_{ij} = 0 \text{ for all } j

For an ergodic CTMC, the stationary distribution is unique.

7.3 Stationary Distribution of CTMCs

Detailed balance for CTMCs: A distribution π\pi satisfies detailed balance if πiQij=πjQji\pi_i Q_{ij} = \pi_j Q_{ji} for all iji \neq j. Detailed balance implies stationarity (πQ=0\pi Q = 0).

Computing the stationary distribution: Solve πQ=0\pi Q = 0 with iπi=1\sum_i \pi_i = 1. This is equivalent to finding the null space of QTQ^T (right eigenvector with eigenvalue 0).

Birth-death CTMC: With birth rate λn\lambda_n from state nn and death rate μn\mu_n, detailed balance gives πnλn=πn+1μn+1\pi_n \lambda_n = \pi_{n+1}\mu_{n+1}, yielding:

πn=π0k=0n1λkμk+1\pi_n = \pi_0 \prod_{k=0}^{n-1} \frac{\lambda_k}{\mu_{k+1}}

7.4 Poisson Process as CTMC

Recall from Section06 Stochastic Processes: The Poisson process NtN_t counts arrivals up to time tt with independent Poisson increments. Inter-arrival times are Exp(λ)\text{Exp}(\lambda).

The Poisson process is the simplest CTMC: states {0,1,2,}\{0,1,2,\ldots\}, jump rates qn,n+1=λq_{n,n+1} = \lambda (arrivals only), generator:

Q=(λλλλ)Q = \begin{pmatrix} -\lambda & \lambda & & \\ & -\lambda & \lambda & \\ & & \ddots & \ddots \end{pmatrix}

The Poisson process is transient - it drifts to ++\infty with no stationary distribution. An M/M/1M/M/1 queue (arrivals at rate λ\lambda, service at rate μ>λ\mu > \lambda) modifies the Poisson process by adding downward jumps (service completions), yielding a birth-death CTMC with geometric stationary distribution πn=(1ρ)ρn\pi_n = (1-\rho)\rho^n where ρ=λ/μ\rho = \lambda/\mu.

For AI: The Poisson process models token arrivals at an LLM serving endpoint. At high load (ρ1\rho \to 1), the queue length distribution becomes heavy-tailed, leading to high-percentile latency spikes. The M/M/1 formula E[queue]=ρ/(1ρ)\mathbb{E}[\text{queue}] = \rho/(1-\rho) quantifies this: 90% utilisation means 9x the base queue length.


8. MCMC: Markov Chain Monte Carlo

8.1 The Sampling Problem

The challenge: In Bayesian inference, the posterior π(θD)p(Dθ)p(θ)\pi(\theta \mid \mathcal{D}) \propto p(\mathcal{D}\mid\theta)\,p(\theta) is analytically intractable for most models. Computing Eπ[f(θ)]\mathbb{E}_\pi[f(\theta)] requires integrating over a high-dimensional distribution we cannot normalise.

MCMC solution: Construct a Markov chain {θt}\{\theta_t\} on the parameter space such that:

  1. The chain is ergodic (irreducible, aperiodic, positive recurrent)
  2. The stationary distribution equals the target π\pi

By the ergodic theorem, time-averaging gives posterior expectations:

1Tt=1Tf(θt)TEπ[f(θ)]\frac{1}{T}\sum_{t=1}^T f(\theta_t) \xrightarrow{T\to\infty} \mathbb{E}_\pi[f(\theta)]

The key insight: we only need to know π(θ)\pi(\theta) up to a normalising constant. MCMC constructs the chain using only unnormalised evaluations π~(θ)π(θ)\tilde{\pi}(\theta) \propto \pi(\theta).

8.2 Metropolis-Hastings

Algorithm (Metropolis-Hastings). Given current state θ\theta, target π\pi, proposal q(θ)q(\cdot \mid \theta):

  1. Sample candidate θq(θ)\theta' \sim q(\cdot \mid \theta)
  2. Compute acceptance ratio: a=min(1, π(θ)q(θθ)π(θ)q(θθ))a = \min\left(1,\ \frac{\pi(\theta')\, q(\theta \mid \theta')}{\pi(\theta)\, q(\theta' \mid \theta)}\right)
  3. With probability aa: accept θt+1=θ\theta_{t+1} = \theta'; otherwise: θt+1=θ\theta_{t+1} = \theta (reject and stay)

Correctness: The Metropolis-Hastings chain satisfies detailed balance with respect to π\pi:

π(θ)q(θθ)a(θ,θ)=π(θ)q(θθ)a(θ,θ)\pi(\theta) q(\theta' \mid \theta) a(\theta, \theta') = \pi(\theta') q(\theta \mid \theta') a(\theta', \theta)

Proof: Without loss of generality, assume π(θ)q(θθ)π(θ)q(θθ)\pi(\theta')q(\theta\mid\theta') \leq \pi(\theta)q(\theta'\mid\theta). Then a(θ,θ)=π(θ)q(θθ)π(θ)q(θθ)a(\theta,\theta')=\frac{\pi(\theta')q(\theta|\theta')}{\pi(\theta)q(\theta'|\theta)} and a(θ,θ)=1a(\theta',\theta)=1, so both sides equal π(θ)q(θθ)\pi(\theta')q(\theta|\theta').

Special cases:

  • Metropolis algorithm: Symmetric proposal q(θθ)=q(θθ)q(\theta'\mid\theta) = q(\theta\mid\theta') simplifies acceptance to a=min(1,π(θ)/π(θ))a = \min(1, \pi(\theta')/\pi(\theta))
  • Independence sampler: Proposal q(θθ)=q(θ)q(\theta'\mid\theta) = q(\theta') independent of current state; acceptance a=min(1,π(θ)q(θ)/(π(θ)q(θ)))a = \min(1, \pi(\theta')q(\theta)/(\pi(\theta)q(\theta')))

Proposal tuning: For a Gaussian random walk proposal q(θθ)=N(θ,σ2I)q(\theta'\mid\theta) = \mathcal{N}(\theta, \sigma^2 I), the optimal σ\sigma achieves acceptance rate ~23.4% in high dimensions (Roberts-Gelman-Gilks 1997). Too small σ\sigma: high acceptance but slow exploration; too large: low acceptance, inefficient.

For AI: MCMC is used for posterior inference in Bayesian neural networks, uncertainty quantification in LLM outputs, and sampling from energy-based models. In practice, Langevin dynamics (gradient-based MCMC) replaces vanilla MH for high-dimensional targets.

8.3 Gibbs Sampling

Algorithm (Gibbs Sampling). For target π(θ1,,θd)\pi(\theta_1,\ldots,\theta_d), cycle through coordinates:

  1. Sample θ1(t+1)π(θ1θ2(t),,θd(t))\theta_1^{(t+1)} \sim \pi(\theta_1 \mid \theta_2^{(t)}, \ldots, \theta_d^{(t)})
  2. Sample θ2(t+1)π(θ2θ1(t+1),θ3(t),,θd(t))\theta_2^{(t+1)} \sim \pi(\theta_2 \mid \theta_1^{(t+1)}, \theta_3^{(t)}, \ldots, \theta_d^{(t)})
  3. \vdots
  4. Sample θd(t+1)π(θdθ1(t+1),,θd1(t+1))\theta_d^{(t+1)} \sim \pi(\theta_d \mid \theta_1^{(t+1)}, \ldots, \theta_{d-1}^{(t+1)})

Correctness: Gibbs sampling is a special case of Metropolis-Hastings where each coordinate update is a MH step with proposal equal to the full conditional. The acceptance ratio is always 1 - every proposal is accepted. This requires the ability to sample from full conditionals π(θkθk)\pi(\theta_k \mid \theta_{-k}).

Block Gibbs: Instead of updating one coordinate at a time, update a block of coordinates jointly. More efficient when coordinates within a block are highly correlated.

For AI: Gibbs sampling is used in latent Dirichlet allocation (LDA) for topic modelling - the full conditionals for topic assignments are available in closed form. In RL, it can be used to sample from the Q-function distribution for exploration.

8.4 Convergence Diagnostics

Running an MCMC chain for a finite number of steps gives an approximate (not exact) sample from π\pi. Practical convergence diagnostics:

Trace plots: Plot θt\theta_t vs. tt. A well-mixed chain looks like stationary noise. Trends, drifts, or stuck values indicate non-convergence.

R^\hat{R} statistic (Gelman-Rubin): Run M4M \geq 4 chains from diverse starting points. Compute R^=Vbetween/Vwithin\hat{R} = \sqrt{V_{\text{between}}/V_{\text{within}}}. If R^1.01\hat{R} \leq 1.01, chains have converged to the same distribution.

Effective sample size (ESS): Due to autocorrelation within the chain, TT MCMC samples give fewer independent samples than TT iid samples. ESS=T/(1+2k=1ρk)\text{ESS} = T / (1 + 2\sum_{k=1}^\infty \rho_k) where ρk\rho_k is the lag-kk autocorrelation. ESS/T1\text{ESS}/T \ll 1 indicates poor mixing.

Burn-in: The first BB samples (before the chain has mixed) are discarded. Typical burn-in is 1050%10-50\% of total samples.

8.5 Hamiltonian Monte Carlo

Motivation: For distributions on Rd\mathbb{R}^d, the random walk Metropolis-Hastings has step size O(d1/2)O(d^{-1/2}) to achieve acceptable acceptance rates, giving diffusive mixing: O(d)O(d) steps to traverse the distribution. Hamiltonian Monte Carlo (HMC) exploits gradient information to take large, high-acceptance steps.

Algorithm: Augment θRd\theta \in \mathbb{R}^d with momentum pN(0,M)p \sim \mathcal{N}(0, M). The Hamiltonian is H(θ,p)=logπ(θ)+12pTM1pH(\theta,p) = -\log\pi(\theta) + \frac{1}{2}p^T M^{-1} p. Propose a new state by simulating Hamiltonian dynamics using the leapfrog integrator for LL steps with step size ε\varepsilon, then accept/reject with MH ratio eΔHe^{-\Delta H}.

Why it works: Hamiltonian dynamics preserves the joint distribution eH(θ,p)e^{-H(\theta,p)} on (θ,p)(\theta,p). Marginalising out pp recovers π(θ)\pi(\theta). The leapfrog integrator introduces discretisation error, corrected by the MH accept/reject step.

Mixing time: HMC achieves O(d1/4)O(d^{1/4}) gradient evaluations per independent sample (vs. O(d)O(d) for random walk), making it scalable to high dimensions. NUTS (No-U-Turn Sampler) automates the choice of LL and ε\varepsilon.

For AI: HMC is the backbone of probabilistic programming systems like Stan and PyMC. In Bayesian deep learning, SGLD (stochastic gradient Langevin dynamics) adds Gaussian noise to SGD steps, yielding approximate Bayesian inference at scale.


9. ML Deep Dive

9.1 Language Model Generation

Every autoregressive language model - GPT, LLaMA, PaLM, Gemini - generates text as a Markov chain on the vocabulary. The state at step nn is the entire generated sequence (w1,,wn)(w_1,\ldots,w_n) (treated as a single token in the KV cache), and the transition is:

P(wn+1=vw1,,wn)=softmax(LM(w1,,wn))vP(w_{n+1} = v \mid w_1,\ldots,w_n) = \text{softmax}(\text{LM}(w_1,\ldots,w_n))_v

Temperature and the stationary distribution: At temperature τ\tau:

Pτ(wn+1=vcontext)=exp(logitv/τ)vexp(logitv/τ)P_\tau(w_{n+1} = v \mid \text{context}) = \frac{\exp(\text{logit}_v / \tau)}{\sum_{v'} \exp(\text{logit}_{v'}/\tau)}
  • τ0\tau \to 0: greedy decoding (deterministic); τ=1\tau = 1: standard sampling; τ\tau \to \infty: uniform
  • The generated sequence is a Markov chain; its stationary distribution (if the chain were infinite) would represent the model's "preferred" discourse

Mixing time of language models: For a well-trained LLM, the Markov chain mixes quickly - the model "forgets" its starting state within a few hundred tokens for most topics. This reflects the inductive bias toward coherent text: the chain has a large spectral gap.

Top-kk and nucleus sampling: Top-kk sampling restricts transitions to the kk highest-probability tokens; nucleus (top-pp) sampling restricts to the minimal set of tokens whose cumulative probability exceeds pp. Both modify the transition matrix to improve mixing (reduce probability of stuck low-quality loops).

Repetition and periodicity: Without repetition penalties, language model chains can enter periodic loops (mode collapse in generation: "the the the ..."). Adding a repetition penalty introduces a state-dependent transition modification that destroys the periodic structure.

9.2 PageRank

Setup: Represent the web as a directed graph with NN pages. The random surfer model: a user follows a random link at each step with probability α\alpha (damping factor), or teleports to a random page with probability 1α1-\alpha.

Transition matrix:

Pij=αAijdi+(1α)1NP_{ij} = \alpha \cdot \frac{A_{ij}}{d_i} + (1-\alpha) \cdot \frac{1}{N}

where Aij=1A_{ij} = 1 if page ii links to page jj, and di=jAijd_i = \sum_j A_{ij} is the out-degree of page ii.

Dangling nodes: Pages with no outbound links (di=0d_i = 0) create a stochastic matrix problem. Fix: replace the row for dangling node ii with the uniform distribution 1/N1/N. The resulting matrix is stochastic.

PageRank computation: The PageRank vector π\pi satisfies πP=π\pi P = \pi. Computed by power iteration:

π(k+1)=π(k)P\pi^{(k+1)} = \pi^{(k)} P

Convergence guaranteed by Perron-Frobenius (the damped matrix is primitive). Convergence rate: αk\alpha^k geometric (since the second eigenvalue is α\leq \alpha). With α=0.85\alpha = 0.85, convergence to 10810^{-8} precision takes 100\approx 100 iterations.

Interpretation: πi\pi_i is the long-run fraction of time the random surfer spends on page ii. Pages with high in-degree from other high-PageRank pages get high scores.

For AI: Influence in a social network, citations in academic papers, and token importance in attention patterns can all be modelled as PageRank variants. The attention mechanism in transformers can be viewed as a differentiable, query-dependent version of PageRank.

9.3 Markov Decision Processes

A Markov Decision Process (MDP) extends Markov chains by adding actions:

(S,A,P,r,γ)(\mathcal{S}, \mathcal{A}, P, r, \gamma)
  • S\mathcal{S}: state space; A\mathcal{A}: action space
  • P(ss,a)P(s' \mid s, a): transition probabilities given action aa
  • r(s,a)r(s, a): reward function
  • γ[0,1)\gamma \in [0,1): discount factor

A policy π(as)\pi(a \mid s) induces a Markov chain on S\mathcal{S} with transition Pπ(ss)=aπ(as)P(ss,a)P^\pi(s'\mid s) = \sum_a \pi(a\mid s) P(s'\mid s,a).

Bellman equation: The value function Vπ(s)=Eπ[t=0γtr(st,at)s0=s]V^\pi(s) = \mathbb{E}^\pi[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) \mid s_0=s] satisfies:

Vπ(s)=aπ(as)[r(s,a)+γsP(ss,a)Vπ(s)]V^\pi(s) = \sum_a \pi(a\mid s)\left[r(s,a) + \gamma \sum_{s'} P(s'\mid s,a) V^\pi(s')\right]

In matrix form: Vπ=rπ+γPπVπ\mathbf{V}^\pi = \mathbf{r}^\pi + \gamma P^\pi \mathbf{V}^\pi, giving Vπ=(IγPπ)1rπ\mathbf{V}^\pi = (I - \gamma P^\pi)^{-1}\mathbf{r}^\pi.

Policy evaluation = linear system: Given a fixed policy π\pi, computing VπV^\pi requires inverting (IγPπ)(I - \gamma P^\pi). This is guaranteed well-conditioned since γ<1\gamma < 1 implies all eigenvalues of γPπ\gamma P^\pi have magnitude <1< 1.

RLHF as MDP: Reinforcement learning from human feedback trains a language model (policy πθ\pi_\theta) to maximise a reward model rϕr_\phi - an MDP with the KV cache as state, vocabulary as actions, and the human preference model as reward.

9.4 Hidden Markov Models

A Hidden Markov Model (HMM) is a Markov chain {Zt}\{Z_t\} (hidden states) with observations {Xt}\{X_t\} emitted from the current hidden state:

Z0μ0,ZtZt1PZt1,XtZtEZtZ_0 \sim \mu_0, \quad Z_t \mid Z_{t-1} \sim P_{Z_{t-1}}, \quad X_t \mid Z_t \sim E_{Z_t}

where PP is the transition matrix and EE is the emission distribution.

Forward algorithm: Compute αt(i)=P(X1,,Xt,Zt=i)\alpha_t(i) = P(X_1,\ldots,X_t, Z_t=i) recursively:

αt(j)=Ej(Xt)iαt1(i)Pij,α0(j)=μ0(j)Ej(X1)\alpha_t(j) = E_j(X_t) \sum_i \alpha_{t-1}(i) P_{ij}, \quad \alpha_0(j) = \mu_0(j) E_j(X_1)

Time complexity: O(TS2)O(T \cdot |\mathcal{S}|^2). Used for likelihood computation and decoding.

Viterbi algorithm: Find the most likely hidden state sequence Z^1:T=argmaxz1:TP(z1:TX1:T)\hat{Z}_{1:T} = \arg\max_{z_{1:T}} P(z_{1:T} \mid X_{1:T}) via dynamic programming:

δt(j)=Ej(Xt)maxiδt1(i)Pij\delta_t(j) = E_j(X_t) \max_i \delta_{t-1}(i) P_{ij}

Time complexity: same as forward algorithm.

Baum-Welch (EM for HMMs): Learn P,E,μ0P, E, \mu_0 from unlabelled observations. E-step: compute forward-backward probabilities. M-step: update parameters using expected sufficient statistics. Converges to a local maximum of the likelihood.

For AI: HMMs are classical sequence models (speech recognition, gene finding). Modern LLMs subsume HMMs: the hidden state (KV cache) is a rich continuous representation of the context, and emission (next-token distribution) is the softmax output.

9.5 Diffusion Models as CTMCs

The DDPM forward process can be viewed as a discrete-time Markov chain with Gaussian transitions:

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}\,x_{t-1},\, \beta_t I)

The marginal q(xtx0)=N(αˉtx0,(1αˉt)I)q(x_t \mid x_0) = \mathcal{N}(\sqrt{\bar\alpha_t}x_0, (1-\bar\alpha_t)I) with αˉt=s=1t(1βs)\bar\alpha_t = \prod_{s=1}^t (1-\beta_s).

Recall from Section06 Stochastic Processes: In the SDE formulation (Song et al., 2021), the continuous-time version is a CTMC/SDE: dx=f(x,t)dt+g(t)dBtdx = f(x,t)\,dt + g(t)\,dB_t.

The reverse process: By time-reversal of Markov chains (Anderson, 1982), the reverse of an ergodic CTMC is also a CTMC. The reverse diffusion has:

pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(t))p_\theta(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1};\, \mu_\theta(x_t, t),\, \Sigma_\theta(t))

The neural network learns to parameterise the reverse transitions - equivalently, to approximate the score function xlogqt(x)\nabla_x \log q_t(x).


10. Common Mistakes

#MistakeWhy It's WrongFix
1Confusing stationarity with convergenceStationarity (πP=π\pi P = \pi) is a property of a distribution; convergence (μ(n)π\mu^{(n)} \to \pi) is about the chain's trajectory. The chain may start at π\pi and be stationary without "converging" (it's already there).Distinguish: stationarity is the destination; convergence is the journey.
2Assuming irreducibility implies unique stationary distributionAn irreducible chain on an infinite state space may be null-recurrent (no normalisable stationary distribution), e.g., SRW on Z\mathbb{Z}.Unique stationary distribution requires irreducibility + positive recurrence.
3Forgetting aperiodicity for convergenceAn irreducible positive-recurrent but periodic chain has a unique stationary distribution but Pij(n)P^{(n)}_{ij} does NOT converge - it oscillates.Convergence to stationarity requires ergodicity = irreducible + positive recurrent + aperiodic.
4Treating detailed balance as necessary for stationarityDetailed balance (πiPij=πjPji\pi_i P_{ij} = \pi_j P_{ji}) is sufficient but not necessary for πP=π\pi P = \pi. Many chains (e.g., cyclic chains) have stationary distributions without satisfying detailed balance.Detailed balance \Rightarrow stationarity, but stationarity ⇏\not\Rightarrow detailed balance.
5Confusing the transition matrix orientationSome books write PijP_{ij} as probability from jj to ii (column stochastic); others write it as probability from ii to jj (row stochastic). This flips πP=π\pi P = \pi vs. Pπ=πP\pi = \pi.Fix a convention: row stochastic means rows sum to 1 and πP=π\pi P = \pi (left eigenvector).
6Ignoring burn-in in MCMCThe first BB MCMC samples come from a distribution close to μ(0)\mu^{(0)} (initial distribution), not π\pi. Including them biases posterior estimates.Always discard a burn-in period. Use R^\hat{R} and ESS to assess convergence.
7Confusing mixing time with burn-inMixing time tmixt_{\text{mix}} is a property of the chain (how long until the worst-case distribution is close to π\pi). Burn-in is a practical heuristic. They are related but not equal.Burn-in should be at least tmixt_{\text{mix}}; more if starting far from π\pi.
8Assuming high acceptance rate = well-mixed MCMCA chain that always accepts (tiny step size in MH) explores very slowly; acceptance rate ~1 with step size -> 0 gives high acceptance but poor mixing.Target ~23% acceptance in high dimensions (optimal for Gaussian targets).
9Using PnP^n to compute stationary distribution for large nnMatrix exponentiation PnP^n converges to the rank-1 matrix 1π\mathbf{1}\pi, but floating-point errors accumulate.Use power iteration on the distribution vector: π(k+1)=π(k)P\pi^{(k+1)} = \pi^{(k)} P, which is numerically stable.
10Misidentifying the state space for LLM generationA bigram LM has states = vocabulary (size VV). A full autoregressive LM has states = entire context = exponential in context length.The Markov chain for a transformer is on the context window, not the vocabulary alone.
11Confusing transient and absorbingA transient state will eventually be left for good; an absorbing state is one from which the chain never leaves. Every absorbing state is recurrent (trivially), not transient.Transient: P(return)<1P(\text{return}) < 1; absorbing: Pii=1P_{ii}=1. Absorbing states are recurrent.
12Expecting MCMC samples to be independentMCMC samples are correlated (autocorrelated) - consecutive samples are not independent. ESS < T measures effective independence.Report ESS, not raw sample count. Use thinning to reduce autocorrelation if needed.

11. Exercises

Exercise 1 * - Transition Matrix and Distribution Evolution

A weather model has two states: Sunny (1) and Rainy (2), with transition matrix P=(0.80.20.40.6)P = \begin{pmatrix}0.8 & 0.2 \\ 0.4 & 0.6\end{pmatrix}.

(a) Starting from μ(0)=(1,0)\mu^{(0)} = (1, 0) (certainly sunny), compute μ(1),μ(5),μ(20)\mu^{(1)}, \mu^{(5)}, \mu^{(20)}.

(b) Compute the 3-step transition matrix P3P^3. What is P(X3=2X0=1)P(X_3=2 \mid X_0=1)?

(c) Find the stationary distribution π\pi by solving πP=π\pi P = \pi with π1+π2=1\pi_1 + \pi_2 = 1.

(d) Verify that μ(n)π\mu^{(n)} \to \pi as nn \to \infty by computing μ(n)πTV\|\mu^{(n)} - \pi\|_{\text{TV}} for n=1,5,10,20n = 1, 5, 10, 20.


Exercise 2 * - State Classification

Consider the Markov chain on {1,2,3,4,5}\{1,2,3,4,5\} with transition matrix:

P=(010000.500.500000100000100010)P = \begin{pmatrix} 0 & 1 & 0 & 0 & 0 \\ 0.5 & 0 & 0.5 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 & 0 \end{pmatrix}

(a) Draw the transition diagram. Identify all communicating classes.

(b) Classify each state as recurrent or transient. Justify using the return probability criterion.

(c) For the recurrent states, determine their period. Are any states aperiodic?

(d) Starting from state 1, what is the probability of eventually reaching state 4?


Exercise 3 * - Computing Stationary Distributions

(a) Implement stationary_power_iteration(P, n_iter=1000) that starts from a uniform distribution and returns π(n)\pi^{(n)} after nn iterations. Apply to the 3×33\times3 chain with Pij=1/3P_{ij} = 1/3 for all i,ji,j and a 4×44\times4 random row-stochastic matrix.

(b) Verify your result satisfies πP=π\pi P = \pi to tolerance 10610^{-6}.

(c) For a doubly-stochastic matrix (rows AND columns sum to 1), prove the stationary distribution is uniform without solving any equations. Verify numerically.

(d) For the birth-death chain with pi=0.6p_i = 0.6, qi=0.4q_i = 0.4 on {0,1,2,3,4}\{0,1,2,3,4\} with reflecting barriers, compute π\pi analytically and verify against power iteration.


Exercise 4 ** - Detailed Balance and Birth-Death Chains

(a) Verify that the two-state chain P=(1ppq1q)P = \begin{pmatrix}1-p&p\\q&1-q\end{pmatrix} satisfies detailed balance with π=(q/(p+q),p/(p+q))\pi = (q/(p+q), p/(p+q)).

(b) Implement a M/M/1M/M/1 queue birth-death chain with λ=0.6,μ=1.0\lambda=0.6, \mu=1.0 on states {0,1,,20}\{0,1,\ldots,20\} with an absorbing boundary at 20. Compute the stationary distribution and verify πn(1ρ)ρn\pi_n \approx (1-\rho)\rho^n with ρ=0.6\rho = 0.6.

(c) For the cyclic chain 12311\to2\to3\to1 with P12=P23=P31=1P_{12}=P_{23}=P_{31}=1, find the stationary distribution. Does it satisfy detailed balance? Explain why not.

(d) Show that if PP is doubly stochastic, then detailed balance holds with the uniform distribution iff PP is also symmetric.


Exercise 5 ** - Mixing Time and Spectral Gap

(a) For the two-state chain in Exercise 1, compute the eigenvalues of PP. What is the spectral gap?

(b) Using the spectral gap bound, upper-bound the mixing time tmix(0.01)t_{\text{mix}}(0.01).

(c) Compute the TV distance Pt(1,)πTV\|P^t(1,\cdot) - \pi\|_{\text{TV}} for t=1,2,5,10,20t = 1, 2, 5, 10, 20 and plot the convergence. At what tt does it drop below 0.010.01?

(d) For the lazy version P=(I+P)/2P' = (I+P)/2, redo parts (a)-(c). How does the spectral gap change? How does the mixing time change?


Exercise 6 ** - Metropolis-Hastings

(a) Implement Metropolis-Hastings with a Gaussian random walk proposal q(θθ)=N(θ,σ2)q(\theta'|\theta) = \mathcal{N}(\theta, \sigma^2) for target π(θ)exp(θ4/4+θ2/2)\pi(\theta) \propto \exp(-\theta^4/4 + \theta^2/2) (double-well potential). Run for T=50000T=50000 steps with σ=0.5\sigma=0.5.

(b) Verify your sampler satisfies detailed balance: for two states θ1,θ2\theta_1, \theta_2, check π(θ1)K(θ1,θ2)π(θ2)K(θ2,θ1)\pi(\theta_1)\,K(\theta_1,\theta_2) \approx \pi(\theta_2)\,K(\theta_2,\theta_1) numerically where KK is the MH transition kernel.

(c) Compute the acceptance rate and effective sample size (ESS). How does ESS change as σ\sigma varies over {0.1,0.5,1.0,2.0,5.0}\{0.1, 0.5, 1.0, 2.0, 5.0\}?

(d) Estimate Eπ[θ2]\mathbb{E}_\pi[\theta^2] from your samples. Compare to the true value (computed numerically).


Exercise 7 *** - PageRank

(a) Implement PageRank power iteration for the 5-node graph with adjacency matrix:

A=(0110000110000111000101000)A = \begin{pmatrix}0&1&1&0&0\\0&0&1&1&0\\0&0&0&1&1\\1&0&0&0&1\\0&1&0&0&0\end{pmatrix}

with damping factor α=0.85\alpha=0.85. Handle dangling nodes.

(b) Verify that your PageRank vector satisfies πP=π\pi P = \pi to tolerance 10810^{-8}.

(c) Add a new node 6 that links to all existing nodes, and all existing nodes link to node 6. How does the PageRank of node 1 change? Why?

(d) Implement personalised PageRank where the teleportation vector is non-uniform: (0.4,0.2,0.1,0.1,0.1,0.1)(0.4, 0.2, 0.1, 0.1, 0.1, 0.1). Compare to standard PageRank.


Exercise 8 *** - HMM Forward Algorithm and Viterbi

An HMM models DNA sequences with 2 hidden states (CpG island: HH, non-CpG: LL) and 4 observations (A, C, G, T).

(a) Given transition matrix THH=0.5,THL=0.5,TLH=0.4,TLL=0.6T_{HH}=0.5, T_{HL}=0.5, T_{LH}=0.4, T_{LL}=0.6 and emission matrices (HMM notebook provides values), implement the forward algorithm and compute P(X1:5=ACGTT)P(X_{1:5} = \text{ACGTT}) under equal initial probabilities.

(b) Implement the Viterbi algorithm to find the most likely hidden state sequence for the same observation.

(c) Verify the forward probabilities satisfy iαt(i)=P(X1:t)\sum_i \alpha_t(i) = P(X_{1:t}) at each tt.

(d) Compare Viterbi decoding to the most-probable-state-per-position (posterior decoding using the backward algorithm). When do they differ?


12. Why This Matters for AI (2026 Perspective)

ConceptAI/ML ApplicationImpact
Transition matrix / stationary dist.PageRank web graph; LLM token distributionFoundation for all sequential reasoning
Ergodic theoremMCMC posterior estimation; time-averaging in RLJustifies replacing expectation with time average
Perron-FrobeniusPageRank convergence proof; convergence of power iterationGuarantees unique ranking vectors exist
Detailed balance / reversibilityCorrectness of MH, Gibbs, SGLDAll Bayesian MCMC algorithms rely on this
Spectral gap / mixing timeMCMC convergence rate; warm-start certificationExplains why MCMC works (or fails) in practice
Metropolis-HastingsBayesian neural network posteriors; sampling EBMsFoundation of probabilistic inference in ML
Gibbs samplingLDA topic models; Boltzmann machine inferenceTractable posterior sampling via conditionals
HMC / NUTSStan, PyMC, posterior sampling in Bayesian DLGold standard for Bayesian inference on Rd\mathbb{R}^d
MDP / Bellman equationRL algorithms (PPO, SAC, DQN, RLHF)Every RL agent solves a Markov chain problem
HMM forward-backwardSpeech recognition, gene prediction, legacy NLPEfficient inference on latent sequence models
CTMC / generator matrixLLM serving queues; continuous-time RLLatency modelling and event-driven simulation
Diffusion as CTMCDDPM/DDIM forward process; score-matchingConnects generative modelling to Markov chain theory
Power iterationPageRank; dominant eigenvector computationO(N2)O(N^2) algorithm that scales to trillion-node graphs
Lazy chains / warm startsMCMC initialisation strategyReduces burn-in and improves sample efficiency

13. Conceptual Bridge

Markov chains occupy the central position in the probability curriculum: they are the first class of stochastic processes with nontrivial structure - neither completely independent (iid) nor completely dependent. The Markov property gives just enough memory to model real sequential phenomena while remaining mathematically tractable.

Looking backward: Markov chains build on everything in Chapters 1-6. The transition matrix is a row-stochastic matrix - linear algebra from Chapters 2-3. The stationary distribution is an eigenvector problem solved by Perron-Frobenius. Computing stationary distributions uses the conditional probability machinery from Section03. The ergodic theorem is a strengthening of the law of large numbers (Section06). The Metropolis-Hastings acceptance ratio uses Bayes' theorem (Section03). Mixing times use concentration inequalities (Section05) through the spectral gap. The stochastic process framework of filtrations and stopping times (Section06) provides the rigorous foundation.

Looking forward: Markov chains are the gateway to Chapter 7 (Statistics), where MCMC enables Bayesian inference with intractable posteriors; Chapter 8 (Optimisation), where Markov chain mixing theory explains why SGD finds flat minima; and Chapter 9 (Information Theory), where the entropy of a stationary distribution connects to compression. Markov Decision Processes - the reinforcement learning setting - are a direct extension requiring optimisation over policies.

The deep connection: The central theorem of Markov chain theory - ergodicity implies μ(n)π\mu^{(n)} \to \pi - is the probabilistic analogue of the contraction mapping theorem from analysis: repeated application of the transition operator contracts all distributions toward the unique fixed point π\pi. The spectral gap quantifies the contraction rate, just as the Lipschitz constant does in the deterministic setting. This structural parallel runs through SGD convergence theory, where the gradient operator acts as a stochastic contraction.

MARKOV CHAINS IN THE CURRICULUM
========================================================================

  Section01 Random Variables -------------------------------------+
  Section02 Distributions --------------------------------------+  |
  Section03 Joint Distributions  ---------------------------+   |  |
  Section04 Expectation & Moments ----------------------+   |   |  |
  Section05 Concentration Inequalities -------------+   |   |   |  |
  Section06 Stochastic Processes ---------------+   |   |   |   |  |
                                           v   v   v   v  v  v
                              +---------------------------------+
                              |   Section07  MARKOV  CHAINS           |
                              |                                 |
                              |  Markov property                |
                              |  Transition matrices            |
                              |  Stationary distributions       |
                              |  Mixing times, spectral gap     |
                              |  MCMC (MH, Gibbs, HMC)         |
                              |  PageRank, MDP, HMM             |
                              +--------------+------------------+
                                             |
               +-----------------------------+--------------------+
               v                             v                    v
    Ch7: Statistics               Ch8: Optimisation        Ch9: Info Theory
    (MCMC for Bayes,              (SGD mixing theory,      (Entropy of \\pi,
     posterior inference)          Langevin dynamics)       KL from stationarity)
               v                             v
    RL/RLHF: MDP                  Diffusion models
    (policy eval,                 (DDPM as CTMC,
     value functions)              reverse Markov)

========================================================================

Appendix A: Harmonic Functions and the Dirichlet Problem

A function h:SRh : \mathcal{S} \to \mathbb{R} is harmonic for a Markov chain at state ii if:

h(i)=jPijh(j)=(Ph)(i)h(i) = \sum_j P_{ij} h(j) = (Ph)(i)

That is, h(i)h(i) equals the expected value of hh one step later. Harmonic functions are the "conserved quantities" of Markov chains - if h(X0)h(X_0) is the starting value, then h(Xn)h(X_n) forms a martingale.

Maximum principle: For an irreducible chain on a finite state space, the only bounded harmonic functions are constants. This is the probabilistic analogue of the maximum principle for harmonic functions in PDE theory.

Dirichlet problem: On a chain with absorbing boundary S\partial \mathcal{S} and interior S\mathcal{S}^\circ, the function h(i)=Ei[f(Xτ)]h(i) = E_i[f(X_{\tau_\partial})] (expected boundary value at hitting time) is the unique solution to:

h(i)=(Ph)(i) for iS,h(i)=f(i) for iSh(i) = (Ph)(i) \text{ for } i \in \mathcal{S}^\circ, \quad h(i) = f(i) \text{ for } i \in \partial\mathcal{S}

Application - gambler's ruin: Let h(i)=P(reach NX0=i)h(i) = P(\text{reach } N \mid X_0=i). Then hh is harmonic on {1,,N1}\{1,\ldots,N-1\} with boundary conditions h(0)=0h(0)=0, h(N)=1h(N)=1. For the simple random walk: h(i)=i/Nh(i) = i/N.

For AI: Value functions in reinforcement learning are harmonic: Vπ(s)=aπ(as)[r(s,a)+γsP(ss,a)Vπ(s)]V^\pi(s) = \sum_a \pi(a|s)[r(s,a) + \gamma \sum_{s'}P(s'|s,a)V^\pi(s')] is a Bellman equation - a discrete-time analogue of the Dirichlet problem with discount factor γ\gamma.


Appendix B: Spectral Theory for Non-Reversible Chains

For non-reversible chains, eigenvalues of PP can be complex (though all have λ1|\lambda| \leq 1). The Jordan form replaces the diagonal eigendecomposition.

Pseudo-spectrum: For nearly-reversible chains, the pseudo-spectrum can be much larger than the spectrum, causing transient amplification before eventual convergence.

Non-reversible speedup: Some non-reversible chains mix faster than any reversible chain with the same stationary distribution. The "lifted" or "skewed" chains in the literature achieve mixing time O(1/h2)O(1/h^2) vs. O(1/h)O(1/h) for reversible chains, where hh is the Cheeger constant.

Lifting: A common technique to construct non-reversible fast-mixing chains: add a "direction" variable d{+1,1}d \in \{+1,-1\} to the state, creating a Markov chain on S×{+,}\mathcal{S} \times \{+,-\} that preferentially moves in one direction while still having π\pi as marginal stationary distribution.


Appendix C: Advanced MCMC Methods

C.1 Stochastic Gradient Langevin Dynamics (SGLD)

For Bayesian inference on large datasets, standard MH requires evaluating logπ(θ)\log\pi(\theta) at the full dataset - too expensive. SGLD combines SGD with Gaussian noise:

θt+1=θtηt2L~(θt)+ηtξt,ξtN(0,I)\theta_{t+1} = \theta_t - \frac{\eta_t}{2}\nabla \tilde{L}(\theta_t) + \sqrt{\eta_t}\,\xi_t, \quad \xi_t \sim \mathcal{N}(0,I)

where L~(θ)=Nniminibatchlogp(xiθ)logp(θ)\tilde{L}(\theta) = -\frac{N}{n}\sum_{i \in \text{minibatch}} \log p(x_i|\theta) - \log p(\theta) is the mini-batch stochastic gradient.

For step sizes satisfying tηt=\sum_t \eta_t = \infty and tηt2<\sum_t \eta_t^2 < \infty (Robbins-Monro conditions), the chain converges to samples from the posterior. At small but fixed step size η\eta, the chain samples from an approximate posterior within O(η)O(\eta) of the true posterior.

For AI: SGLD underlies Bayesian deep learning at scale. It explains why SGD with learning rate decay has regularisation properties: the decaying noise variance means the chain converges to a tighter distribution around the posterior mode.

C.2 Parallel Tempering (Replica Exchange)

For multimodal targets, single chains can get stuck in one mode. Parallel tempering runs KK chains at different temperatures β1<β2<<βK\beta_1 < \beta_2 < \cdots < \beta_K (with βK=1\beta_K = 1 being the target). Periodically, adjacent chains swap states with MH acceptance probability:

a=min(1,πβi(xj)πβj(xi)πβi(xi)πβj(xj))a = \min\left(1,\, \frac{\pi_{\beta_i}(x_j)\pi_{\beta_j}(x_i)}{\pi_{\beta_i}(x_i)\pi_{\beta_j}(x_j)}\right)

High-temperature chains mix fast (flat distribution) and pass samples to low-temperature chains through swaps. This enables exploration of multiple modes.

C.3 Slice Sampling

Slice sampling introduces a uniform auxiliary variable uUniform(0,π(θ))u \sim \text{Uniform}(0, \pi(\theta)), creating a joint distribution on (θ,u)(\theta, u) that is uniform on the "slice" {(θ,u):u<π(θ)}\{(\theta, u) : u < \pi(\theta)\}. Marginalising out uu recovers π(θ)\pi(\theta).

The slice sampler alternates between updating uu (trivial given θ\theta) and updating θ\theta (uniform on the slice level set). No tuning parameter; acceptance rate = 1; effective for univariate distributions.


Appendix D: Worked Examples

D.1 Google PageRank Computation (Mini Example)

Consider a 4-page web: page 1 links to 2,3; page 2 links to 4; page 3 links to 2; page 4 links to 1,3.

Transition matrix (with no damping for clarity):

P=(01/21/20000101001/201/20)P = \begin{pmatrix}0&1/2&1/2&0\\0&0&0&1\\0&1&0&0\\1/2&0&1/2&0\end{pmatrix}

Stationary distribution (solve πP=π\pi P = \pi, πi=1\sum\pi_i=1): π=(4/13,3/13,4/13,2/13)\pi = (4/13, 3/13, 4/13, 2/13) - verified by direct computation.

With damping α=0.85\alpha=0.85: P=0.85P+0.15J/4P' = 0.85 P + 0.15 \cdot J/4 where JJ is the all-ones matrix. Power iteration converges in ~30 iterations.

D.2 MCMC on a Bimodal Distribution

Target: π(θ)e(θ3)2/2+e(θ+3)2/2\pi(\theta) \propto e^{-(\theta-3)^2/2} + e^{-(\theta+3)^2/2} (mixture of two Gaussians at ±3\pm 3).

With Gaussian random walk MH (σ=0.5\sigma=0.5): chain mixes within one mode; inter-modal jumps are rare because the barrier between modes is O(10)O(10) nats. Acceptance rate ~60%, but ESS/T ~0.02 (poor mixing).

With σ=3\sigma=3: chain jumps between modes; acceptance rate ~15%, ESS/T ~0.10 (better mixing).

This illustrates the exploration-exploitation tradeoff in MCMC proposal design.

D.3 HMM Example: Weather from Ice Core

Two hidden states: Ice Age (I), Warm Period (W). Observations: high (hh) or low (ll) dust in ice core.

Transition: PII=0.7,PIW=0.3,PWI=0.4,PWW=0.6P_{II}=0.7, P_{IW}=0.3, P_{WI}=0.4, P_{WW}=0.6. Emission: P(hI)=0.8,P(lI)=0.2,P(hW)=0.2,P(lW)=0.8P(h|I)=0.8, P(l|I)=0.2, P(h|W)=0.2, P(l|W)=0.8.

For observation sequence h,l,h,h,lh, l, h, h, l: the Viterbi algorithm gives hidden sequence I,W,I,I,WI, W, I, I, W (high dust = ice age, low dust = warm period).

Forward algorithm gives likelihood P(obs sequence)0.0035P(\text{obs sequence}) \approx 0.0035.


Appendix E: Key Theorems and Proofs

E.1 Proof: Stationary Distribution is Unique for Finite Irreducible Chains

Theorem. A finite irreducible Markov chain has a unique stationary distribution π\pi with πi>0\pi_i > 0 for all ii.

Proof. Consider the NN-dimensional simplex Δ={μ:μi0,μi=1}\Delta = \{\mu : \mu_i \geq 0, \sum\mu_i = 1\}. The map T:μμPT: \mu \mapsto \mu P is a continuous map from Δ\Delta to Δ\Delta. By Brouwer's fixed point theorem, TT has at least one fixed point π\pi.

For uniqueness: suppose π\pi and ν\nu are two stationary distributions. Since the chain is irreducible and finite, all states are positive recurrent, so πi=1/μi>0\pi_i = 1/\mu_i > 0 where μi\mu_i is the mean return time. The mean return time is unique (it equals 1/πi1/\pi_i and also 1/νi1/\nu_i if ν\nu is stationary), so πi=νi\pi_i = \nu_i for all ii.

For πi>0\pi_i > 0: by irreducibility, from any state jj there exists a path to ii. The stationary mass at each state in this path is >0> 0 (by positivity of transitions and stationarity), propagating to give πi>0\pi_i > 0.

E.2 Proof: Detailed Balance Implies Stationarity

Theorem. If πiPij=πjPji\pi_i P_{ij} = \pi_j P_{ji} for all i,ji,j, then πP=π\pi P = \pi.

Proof. For each jj:

(πP)j=iπiPij=iπjPji=πjiPji=πj1=πj(\pi P)_j = \sum_i \pi_i P_{ij} = \sum_i \pi_j P_{ji} = \pi_j \sum_i P_{ji} = \pi_j \cdot 1 = \pi_j

where we used detailed balance in the second equality and the row-sum property of stochastic matrices in the fourth.

E.3 Proof: Convergence via Coupling (Sketch)

Theorem. Let PP be ergodic. For any μ,ν\mu, \nu and coupling (Xt,Yt)(X_t, Y_t) with X0μX_0 \sim \mu, Y0νY_0 \sim \nu, define τ=min{t:Xt=Yt}\tau = \min\{t : X_t = Y_t\}. Then:

μPtνPtTVP(τ>t)\|\mu P^t - \nu P^t\|_{\text{TV}} \leq P(\tau > t)

Proof. For any event AA:

μPt(A)νPt(A)=P(XtA)P(YtA)=P(XtA,τt)+P(XtA,τ>t)P(YtA,τt)P(YtA,τ>t)\mu P^t(A) - \nu P^t(A) = P(X_t \in A) - P(Y_t \in A) = P(X_t \in A, \tau \leq t) + P(X_t \in A, \tau > t) - P(Y_t \in A, \tau \leq t) - P(Y_t \in A, \tau > t)

On {τt}\{\tau \leq t\}: Xt=YtX_t = Y_t (coalesce), so P(XtA,τt)=P(YtA,τt)P(X_t \in A, \tau \leq t) = P(Y_t \in A, \tau \leq t). Therefore:

μPt(A)νPt(A)=P(XtA,τ>t)P(YtA,τ>t)P(τ>t)|\mu P^t(A) - \nu P^t(A)| = |P(X_t \in A, \tau > t) - P(Y_t \in A, \tau > t)| \leq P(\tau > t)

Taking supremum over AA gives the result.


Appendix F: Python Recipes for Markov Chains

import numpy as np

# -- 1. Power iteration for stationary distribution
def stationary(P, tol=1e-12, max_iter=10000):
    pi = np.ones(P.shape[0]) / P.shape[0]
    for _ in range(max_iter):
        pi_new = pi @ P
        if np.max(np.abs(pi_new - pi)) < tol:
            return pi_new
        pi = pi_new
    return pi

# -- 2. Simulate a Markov chain
def simulate_chain(P, x0, n_steps):
    N = P.shape[0]
    x = x0
    trajectory = [x]
    for _ in range(n_steps):
        x = np.random.choice(N, p=P[x])
        trajectory.append(x)
    return np.array(trajectory)

# -- 3. Metropolis-Hastings for 1D target
def metropolis_hastings(log_pi, n_steps, sigma=0.5, x0=0.0):
    x = x0
    samples = []
    for _ in range(n_steps):
        x_prime = x + sigma * np.random.randn()
        log_a = log_pi(x_prime) - log_pi(x)
        if np.log(np.random.rand()) < log_a:
            x = x_prime
        samples.append(x)
    return np.array(samples)

# -- 4. PageRank
def pagerank(A, alpha=0.85, tol=1e-8):
    N = A.shape[0]
    out_degree = A.sum(axis=1, keepdims=True)
    # Fix dangling nodes
    dangling = (out_degree == 0).flatten()
    out_degree[dangling] = 1
    P = A / out_degree
    P[dangling] = 1 / N
    teleport = np.ones((N, N)) / N
    P_hat = alpha * P + (1 - alpha) * teleport
    pi = np.ones(N) / N
    for _ in range(1000):
        pi_new = pi @ P_hat
        if np.max(np.abs(pi_new - pi)) < tol:
            return pi_new
        pi = pi_new
    return pi

# -- 5. HMM forward algorithm
def hmm_forward(obs, T, E, pi0):
    """obs: list of observations, T: NxN transition, E: NxM emission, pi0: initial."""
    N = T.shape[0]
    alpha = np.zeros((len(obs), N))
    alpha[0] = pi0 * E[:, obs[0]]
    for t in range(1, len(obs)):
        alpha[t] = (alpha[t-1] @ T) * E[:, obs[t]]
    return alpha, alpha[-1].sum()  # alpha matrix and likelihood

Appendix G: Connections to Other Areas

Spectral graph theory: For a random walk on an undirected graph GG, the transition matrix P=D1AP = D^{-1}A (where DD is the degree matrix and AA is the adjacency matrix) has eigenvalues in [1,1][-1,1]. The spectral gap of the Laplacian L=ID1/2AD1/2L = I - D^{-1/2}AD^{-1/2} equals the spectral gap of the walk. Well-connected (expander) graphs have large spectral gaps and fast-mixing random walks.

Information theory: The relative entropy (KL divergence) between the chain's distribution and stationarity decreases monotonically: DKL(μ(n)π)DKL(μ(n+1)π)D_{\text{KL}}(\mu^{(n)} \| \pi) \geq D_{\text{KL}}(\mu^{(n+1)} \| \pi). This is the data processing inequality applied to the Markov transition. The rate of KL decrease is related to the spectral gap.

Ergodic theory: The Markov chain ergodic theorem is a special case of Birkhoff's ergodic theorem in abstract measure-preserving systems. The stationarity condition πP=π\pi P = \pi says π\pi is an invariant measure for the measure-preserving map T:ΩΩT : \Omega \to \Omega on the path space.

Linear programming: Computing the stationary distribution satisfying πP=π\pi P = \pi, π0\pi \geq 0, π1=1\|\pi\|_1 = 1 is a linear system. The LP relaxation of combinatorial problems can sometimes be solved by finding stationary distributions of associated Markov chains.

Quantum computing: Quantum walks are the quantum analogue of random walks on graphs, with amplitudes instead of probabilities. They can achieve quadratic speedups over classical random walks for certain search and sampling problems.


Appendix H: Self-Assessment Checklist

Before moving to Chapter 7 (Statistics), verify you can:

  • Write the transition matrix for any small Markov chain from a verbal description
  • Compute μ(n)=μ(0)Pn\mu^{(n)} = \mu^{(0)}P^n for n=1,2,5n = 1, 2, 5 by hand or code
  • Classify all states of a small chain as communicating/recurrent/transient/absorbing
  • Solve πP=π\pi P = \pi for a 2×22\times2 or 3×33\times3 stochastic matrix
  • State Perron-Frobenius and explain why the stationary distribution is unique
  • Verify detailed balance for a given (π,P)(\pi, P) pair
  • Compute the spectral gap from eigenvalues and give a mixing time bound
  • Implement Metropolis-Hastings for a 1D target and compute acceptance rate
  • Implement power iteration for PageRank with dangling node handling
  • Implement the HMM forward algorithm and compute the likelihood of an observation sequence
  • Explain why MCMC sample autocorrelation reduces the effective sample size
  • State the ergodic theorem and explain why it justifies MCMC estimation

Appendix I: Detailed Proofs - Perron-Frobenius and Mixing

I.1 Perron-Frobenius Theorem (Complete Proof)

We prove the Perron-Frobenius theorem for primitive stochastic matrices.

Step 1: Eigenvalue 1 exists. For any stochastic matrix PP, P1=1P\mathbf{1} = \mathbf{1} (column vector of ones). So 1TP=1T\mathbf{1}^T P = \mathbf{1}^T iff PP is doubly stochastic. More relevantly: by the Brouwer fixed-point theorem on ΔN1\Delta^{N-1}, the map ππP\pi \mapsto \pi P has a fixed point - the stationary distribution. This fixed point is a left eigenvector with eigenvalue 1.

Step 2: λ1|\lambda| \leq 1 for all eigenvalues. Let λ\lambda be an eigenvalue with (right) eigenvector vv (Pv=λvPv = \lambda v). Pick vmax=maxiviv_{\max} = \max_i |v_i|. Then λvmax=λvi=(Pv)i=jPijvjjPijvjvmax|\lambda||v_{\max}| = |\lambda v_{i^*}| = |(Pv)_{i^*}| = |\sum_j P_{i^*j}v_j| \leq \sum_j P_{i^*j}|v_j| \leq v_{\max}. So λ1|\lambda| \leq 1.

Step 3: λ=1|\lambda| = 1 implies λ=1\lambda = 1 for primitive PP. For a primitive matrix, Pm>0P^m > 0 (all entries strictly positive) for some mm. Suppose λ=1|\lambda| = 1, λ1\lambda \neq 1, with Pv=λvPv = \lambda v and vmax=1v_{\max} = 1. The equality case in Step 2 requires vj=1|v_j| = 1 for all jj and all rows of PP to "align" the phases of vjv_j. But for Pm>0P^m > 0, all rows of PmP^m are strictly positive, and the alignment condition forces λm=1\lambda^m = 1 and all vjv_j equal - meaning v=c1v = c\mathbf{1} and λ=1\lambda = 1. Contradiction.

Step 4: Convergence. Since λ=1\lambda = 1 is the unique eigenvalue on the unit circle for primitive PP, all other eigenvalues λk\lambda_k satisfy λk1δ|\lambda_k| \leq 1 - \delta for some δ>0\delta > 0. The spectral decomposition gives Pn=1π+k2λknϕkψkTP^n = \mathbf{1}\pi + \sum_{k \geq 2} \lambda_k^n \phi_k \psi_k^T where λkn0|\lambda_k|^n \to 0 at rate (1δ)n(1-\delta)^n.

I.2 Coupling Proof of Convergence (Complete)

Theorem. Let PP be ergodic with stationary distribution π\pi. For any xSx \in \mathcal{S}:

Pn(x,)πTVP(τ>n)\|P^n(x,\cdot) - \pi\|_{\text{TV}} \leq P(\tau > n)

where τ\tau is the coalescence time of the Markov coupling (two chains started at xx and π\sim\pi).

Proof. Let (Xn,Yn)(X_n, Y_n) be a coupling with X0=xX_0 = x and Y0πY_0 \sim \pi, where both chains use the same transition PP and coalesce when they meet. Since Y0πY_0 \sim \pi and π\pi is stationary, YnπY_n \sim \pi for all nn.

For any event AA:

Pn(x,A)π(A)=P(XnA)P(YnA)P^n(x, A) - \pi(A) = P(X_n \in A) - P(Y_n \in A)

On {τn}\{\tau \leq n\}: Xn=YnX_n = Y_n (they have coalesced), so their indicators for AA are equal.

Pn(x,A)π(A)=E[1XnA1YnA]P^n(x, A) - \pi(A) = E[\mathbf{1}_{X_n \in A} - \mathbf{1}_{Y_n \in A}] =E[(1XnA1YnA)1τ>n]= E[(\mathbf{1}_{X_n \in A} - \mathbf{1}_{Y_n \in A})\mathbf{1}_{\tau > n}] E[1τ>n]=P(τ>n)\leq E[\mathbf{1}_{\tau > n}] = P(\tau > n)

Since this holds for any AA and for AcA^c by symmetry, taking the supremum gives the TV bound.

I.3 Spectral Gap Lower Bound via Conductance (Cheeger Inequality)

Definition. The conductance (Cheeger constant) of a chain is:

h=minSS,π(S)1/2iS,jSπiPijπ(S)h = \min_{S \subseteq \mathcal{S}, \pi(S) \leq 1/2} \frac{\sum_{i \in S, j \notin S} \pi_i P_{ij}}{\pi(S)}

Conductance measures the minimum probability flux crossing any "bottleneck" cut in the chain.

Cheeger Inequality: For a reversible chain:

h22gap2h\frac{h^2}{2} \leq \text{gap} \leq 2h

The lower bound is the harder direction and gives: gaph2/2\text{gap} \geq h^2/2. In words: a large bottleneck (small hh) implies a small spectral gap (slow mixing). Conversely, if hh is bounded below, the chain mixes in O(1/h2)O(1/h^2) steps (vs. O(1/h)O(1/h) for the tighter bound gap2h\text{gap} \leq 2h).

For AI: The Cheeger inequality provides the tightest practical bounds for MCMC convergence. For neural network posteriors, the conductance can be estimated using gradient information - a well-conditioned posterior (no sharp barriers) has high conductance.


Appendix J: Markov Chains and Entropy

J.1 Entropy Production

The relative entropy (KL divergence) from the chain's distribution to stationarity is:

H(μ(n)π)=iμi(n)logμi(n)πiH(\mu^{(n)} \| \pi) = \sum_i \mu^{(n)}_i \log \frac{\mu^{(n)}_i}{\pi_i}

Data processing inequality (for Markov chains): H(μ(n+1)π)H(μ(n)π)H(\mu^{(n+1)} \| \pi) \leq H(\mu^{(n)} \| \pi)

This follows from the data processing inequality: applying the stochastic map PP cannot increase KL divergence. Equality holds iff μ(n)=π\mu^{(n)} = \pi.

For reversible chains: The relative entropy decays at rate governed by the log-Sobolev constant. Chains with larger log-Sobolev constants (tighter functional inequalities) decay faster.

J.2 Entropy of the Stationary Distribution

The Shannon entropy of π\pi is H(π)=iπilogπiH(\pi) = -\sum_i \pi_i \log \pi_i. For a uniform distribution (doubly stochastic PP), H(π)=logNH(\pi) = \log N is maximised. For PageRank, the entropy of π\pi measures how "concentrated" web traffic is - low entropy means a few pages dominate.

Maximum entropy interpretation: The stationary distribution of a reversible chain with detailed balance can be interpreted as the maximum entropy distribution subject to the constraint that the detailed balance equations hold. This connects Markov chains to the exponential family (Section02).

J.3 Kullback-Leibler and MCMC Acceptance

The Metropolis-Hastings acceptance ratio has an information-theoretic interpretation:

a(θ,θ)=min(1,π(θ)q(θθ)π(θ)q(θθ))=min(1,elogπ(θ)logπ(θ)logq(θθ)+logq(θθ))a(\theta, \theta') = \min\left(1, \frac{\pi(\theta')q(\theta|\theta')}{\pi(\theta)q(\theta'|\theta)}\right) = \min\left(1, e^{\log\pi(\theta')-\log\pi(\theta)-\log q(\theta'|\theta)+\log q(\theta|\theta')}\right)

The exponent is the log-ratio of probability densities - related to the KL divergence between the proposal and target. When the proposal qq matches π\pi well, most proposals are accepted.


Appendix K: Continuous-Time Chains and Generators

K.1 Semigroup Theory

The family of transition matrices {P(t)}t0\{P(t)\}_{t \geq 0} forms a Markov semigroup: P(0)=IP(0) = I, P(s+t)=P(s)P(t)P(s+t) = P(s)P(t), and tP(t)t \mapsto P(t) is continuous in tt. The generator Q=ddtP(t)t=0Q = \frac{d}{dt}P(t)\big|_{t=0} characterises the infinitesimal behaviour.

Exponential formula: P(t)=eQtP(t) = e^{Qt} where the matrix exponential can be computed via eigendecomposition of QQ. If QQ has eigenvalues 0=μ1μ2μN0 = \mu_1 \geq \mu_2 \geq \cdots \geq \mu_N (all real and non-positive for reversible chains), then:

P(t)=keμktϕkψkTP(t) = \sum_k e^{\mu_k t} \phi_k \psi_k^T

Convergence to stationarity is at rate eμ2te^{|\mu_2|t} where μ2=μ2|\mu_2| = -\mu_2 is the spectral gap of QQ.

K.2 Reversible CTMCs

A CTMC is reversible with respect to π\pi if πiQij=πjQji\pi_i Q_{ij} = \pi_j Q_{ji} for all iji \neq j. The spectral theory for reversible CTMCs is identical to that for reversible DTMCs: real eigenvalues, orthonormal eigenfunctions, exponential convergence.

Detailed balance for CTMCs:

πiqij=πjqji for all ij\pi_i q_{ij} = \pi_j q_{ji} \text{ for all } i \neq j

This is the continuous-time analogue of the discrete detailed balance. Metropolis-Hastings in continuous time (the Metropolis algorithm for particle simulations) satisfies this by construction.


Appendix L: Theory Problems

Problem L.1. Prove that for a doubly stochastic matrix PP (rows and columns both sum to 1), the uniform distribution is stationary. Give an example showing the converse fails.

Problem L.2. Let PP be irreducible with stationary distribution π\pi. Prove that πi>0\pi_i > 0 for all ii by constructing a path from any jj with πj>0\pi_j > 0 to ii.

Problem L.3. (Coupling inequality) Let PP be ergodic. Construct a coupling of two copies of the chain, one started at xx and one at the stationary distribution π\pi. Use the coupling inequality to prove Pn(x,)πTVP(τmeet>n)\|P^n(x,\cdot) - \pi\|_{\text{TV}} \leq P(\tau_{\text{meet}} > n).

Problem L.4. For the symmetric random walk on {0,1,,N}\{0,1,\ldots,N\} with reflecting barriers, show the spectral gap is 1cos(π/N)π2/(2N2)1 - \cos(\pi/N) \approx \pi^2/(2N^2) for large NN. What is the mixing time?

Problem L.5. (Metropolis-Hastings correctness) Let K(x,y)K(x,y) be the MH transition kernel with proposal qq and target π\pi. Show that πxK(x,y)=πyK(y,x)\pi_x K(x,y) = \pi_y K(y,x) (detailed balance), and conclude π\pi is stationary.

Problem L.6. For a birth-death chain on {0,1,,N}\{0,1,\ldots,N\} with rates pi=pp_i = p and qi=qq_i = q (constant), find the stationary distribution. For what values of pp does the chain have a proper stationary distribution on {0,1,2,}\{0,1,2,\ldots\} (infinite chain)?

Problem L.7. The lazy walk on a graph GG sets Pii=1/2P_{ii} = 1/2 (stay with prob 1/2) and Pij=1/(2di)P_{ij} = 1/(2d_i) for neighbours jj (where did_i is the degree). Show the lazy walk has all non-negative eigenvalues and therefore converges monotonically to stationarity in TV distance.

Problem L.8. (PageRank convergence rate) For the damped PageRank matrix P=αP0+(1α)J/NP = \alpha P_0 + (1-\alpha)J/N with damping α\alpha, show all eigenvalues except the dominant one have magnitude α\leq \alpha. Conclude the PageRank power iteration converges in O(log(1/ε)/log(1/α))O(\log(1/\varepsilon)/\log(1/\alpha)) iterations.


Appendix M: Notation Summary

SymbolMeaning
S\mathcal{S}State space
PijP_{ij}Transition probability from state ii to state jj
Pij(n)P^{(n)}_{ij}nn-step transition probability ((Pn)ij(P^n)_{ij})
μ(n)\mu^{(n)}Distribution at time nn; row vector
π\piStationary distribution; row vector with πP=π\pi P = \pi
fif_iReturn probability to state ii: P(Ti<X0=i)P(T_i < \infty \mid X_0=i)
μi\mu_iMean return time to state ii; πi=1/μi\pi_i = 1/\mu_i
d(i)d(i)Period of state ii: gcd{n1:Pii(n)>0}\gcd\{n \geq 1 : P^{(n)}_{ii} > 0\}
μνTV\|\mu - \nu\|_{\text{TV}}Total variation distance between μ\mu and ν\nu
tmix(ε)t_{\text{mix}}(\varepsilon)Mixing time to ε\varepsilon-accuracy
gap(P)\text{gap}(P)Spectral gap: $1 -
hhCheeger constant (conductance)
QQGenerator matrix of a CTMC; Qij=qijQ_{ij} = q_{ij}, Qii=qiQ_{ii} = -q_i
P(t)=eQtP(t) = e^{Qt}CTMC transition matrix at time tt
a(θ,θ)a(\theta,\theta')MH acceptance probability
τmeet\tau_{\text{meet}}Coupling time (coalescence time of two chains)
αt(i)\alpha_t(i)HMM forward variable: P(X1,,Xt,Zt=i)P(X_1,\ldots,X_t, Z_t=i)
δt(j)\delta_t(j)Viterbi variable: maxz1:t1P(X1,,Xt,Zt=j,z1:t1)\max_{z_{1:t-1}} P(X_1,\ldots,X_t, Z_t=j, z_{1:t-1})

Appendix N: Advanced ML Applications

N.1 RLHF as a Constrained MDP

Reinforcement learning from human feedback (RLHF) trains a language model policy πθ\pi_\theta to maximise:

Eπθ[rϕ(x,y)]βDKL(πθπref)\mathbb{E}_{\pi_\theta}[r_\phi(x, y)] - \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})

where rϕr_\phi is a learned reward model, πref\pi_{\text{ref}} is the reference (SFT) policy, and β\beta controls the KL penalty.

This is a constrained Markov chain optimisation problem: the state space is all possible prefixes (contexts), the action space is the vocabulary, and the KL divergence penalty ensures the learned chain πθ\pi_\theta doesn't deviate too far from the reference chain πref\pi_{\text{ref}} in transition structure.

The optimal policy has an explicit closed form:

π(yx)=πref(yx)erϕ(x,y)/βZ(x)\pi^*(y|x) = \pi_{\text{ref}}(y|x) \cdot \frac{e^{r_\phi(x,y)/\beta}}{Z(x)}

where Z(x)Z(x) is a normalising constant. This is a Gibbs distribution with the reward as energy function - exactly the target distribution for MCMC.

Implication: The RLHF optimal policy can be sampled by constructing a Markov chain with the above stationary distribution. Recent work uses Langevin-type MCMC to directly sample from π\pi^* without RL.

N.2 Diffusion Sampling as Time-Reversed CTMC

The score-based generative modelling framework (Song et al., 2021) writes the forward noising process as a CTMC/SDE and generates by running the reversed process. The key identity is Anderson's reverse-time SDE:

dx=[f(x,t)g(t)2xlogpt(x)]dt+g(t)dBˉtdx = [f(x,t) - g(t)^2 \nabla_x \log p_t(x)]\,dt + g(t)\,d\bar{B}_t

where Bˉt\bar{B}_t is a Brownian motion running backwards in time. The score function xlogpt(x)\nabla_x \log p_t(x) is the generator of the reversed Markov chain. Learning this score function via denoising score matching allows sampling by running the reverse process.

Discrete diffusion: For categorical distributions (text), the VP-SDE is replaced by a CTMC on the vocabulary with generator Q(t)Q(t). MDLM and other discrete diffusion models use the forward chain q(xtx0)q(x_t|x_0) = masked tokens, Uniform transitions, or absorbing-state corruption, then learn the reverse chain.

N.3 MCMC for LLM Sampling

Standard autoregressive LLM decoding (top-kk, nucleus) can be viewed as Markov chain sampling from a specific transition matrix. Recent speculative decoding methods (Chen et al., 2023; Leviathan et al., 2023) use a smaller "draft" model as a proposal and the large model as the acceptance criterion - this is precisely independence Metropolis-Hastings:

  • Proposal: q(xn+1context)=small_LM(xn+1context)q(x_{n+1}|\text{context}) = \text{small\_LM}(x_{n+1}|\text{context})
  • Acceptance: a=min(1,large_LM/small_LM)a = \min(1, \text{large\_LM}/\text{small\_LM})
  • Guarantees: the accepted tokens have the same distribution as large-LM-only sampling

This is MCMC applied to language model decoding - Markov chain theory directly yields correctness of speculative decoding.

N.4 Graph Neural Networks and Random Walks

Graph Neural Networks (GNNs) can be understood through the lens of Markov chains. A single GNN layer performs:

hv(k+1)=f(hv(k),AGGREGATE({hu(k):uN(v)}))h_v^{(k+1)} = f\left(h_v^{(k)},\, \text{AGGREGATE}\left(\{h_u^{(k)} : u \in \mathcal{N}(v)\}\right)\right)

For the simple mean-aggregation GNN: hv(k+1)=σ(Whv(k)+W1dvuvhu(k))h_v^{(k+1)} = \sigma(W h_v^{(k)} + W'\frac{1}{d_v}\sum_{u \sim v} h_u^{(k)}). The aggregation step is applying the random walk transition matrix D1AD^{-1}A to the feature matrix. kk layers of GNN = applying the random walk kk times - the receptive field grows as the kk-step neighbourhood.

Over-smoothing: After many layers, all node features converge to the stationary distribution of the random walk (uniform for regular graphs). This is over-smoothing - the GNN loses discriminative power because the Markov chain mixes. Mixing time bounds from Section6 directly bound the number of useful GNN layers.

N.5 Token Position Encoding as Markov Chain

The self-attention mechanism with rotary position encoding (RoPE) can be viewed as computing a position-dependent Markov chain over tokens. Each attention head defines a stochastic matrix over token positions (the attention weights), and information flows along this chain.

Attention as ergodic averaging: In multi-head attention, each head computes softmax(QKT/d)V\text{softmax}(QK^T/\sqrt{d})V. The softmax weights form a row-stochastic matrix - a one-step Markov transition. The output is the expected value of VV under this distribution. Stacking attention layers is like composing Markov kernels.

Information bottleneck: The rank of the attention matrix limits the "state space" of the Markov chain. Low-rank attention (as in linear attention / Performer) corresponds to a Markov chain with restricted state space - mixing may be faster but with less expressivity.


Appendix O: Review Problems

Review Problem 1. Let PP be a 3×33\times3 stochastic matrix with P11=0.5,P12=0.3,P13=0.2,P21=0.1,P22=0.7,P23=0.2,P31=0.3,P32=0.3,P33=0.4P_{11}=0.5, P_{12}=0.3, P_{13}=0.2, P_{21}=0.1, P_{22}=0.7, P_{23}=0.2, P_{31}=0.3, P_{32}=0.3, P_{33}=0.4. (a) Is this chain irreducible? (b) Find the stationary distribution. (c) Starting from state 1, compute the distribution after 10 steps. (d) Compare to stationarity.

Review Problem 2. Consider a random walk on a cycle of length NN (states {0,1,,N1}\{0,1,\ldots,N-1\}, move clockwise or counterclockwise with equal probability 1/21/2). (a) What is the stationary distribution? (b) What is the period? (c) Compute the mixing time using the spectral gap (eigenvalues: cos(2πk/N)\cos(2\pi k/N) for k=0,,N1k=0,\ldots,N-1). (d) How does mixing time scale with NN?

Review Problem 3. (MCMC) You want to sample from π(x)xa1ebx\pi(x) \propto x^{a-1}e^{-bx} (a Gamma distribution) using MH with log-normal proposal q(xx)=LogNormal(logx,σ2)q(x'|x) = \text{LogNormal}(\log x, \sigma^2). (a) Write down the acceptance ratio. (b) Is qq symmetric? (c) Will this chain have π\pi as stationary distribution?

Review Problem 4. (PageRank) The web has 3 pages: 1 links to 2 and 3; 2 links to 1; 3 links to 1. With damping α=0.85\alpha=0.85, write the PageRank transition matrix and find π\pi by solving a linear system. Which page has the highest PageRank?

Review Problem 5. (HMM) For an HMM with 2 hidden states, 2 observations, transition T=(0.60.40.30.7)T = \begin{pmatrix}0.6&0.4\\0.3&0.7\end{pmatrix}, emission E=(0.90.10.20.8)E = \begin{pmatrix}0.9&0.1\\0.2&0.8\end{pmatrix}, initial π0=(0.5,0.5)\pi_0 = (0.5, 0.5): (a) Compute P(obs=(0,1,0))P(\text{obs}=(0,1,0)) using the forward algorithm. (b) Find the Viterbi path.


Appendix P: Common Distributions in MCMC Targets

When π(θ)exp(f(θ))\pi(\theta) \propto \exp(-f(\theta)), the gradient f\nabla f drives Langevin dynamics. Key shapes:

Gaussian: f(θ)=12θTΣ1θf(\theta) = \frac{1}{2}\theta^T\Sigma^{-1}\theta. Gradient: Σ1θ\Sigma^{-1}\theta. Langevin mixes in O(κ(Σ))O(\kappa(\Sigma)) steps. HMC mixes in O(κ(Σ)1/2)O(\kappa(\Sigma)^{1/2}) steps (advantage for ill-conditioned targets).

Logistic posterior: f(θ)=iyixiTθ+ilog(1+exiTθ)+12σ2θ2f(\theta) = -\sum_i y_i x_i^T\theta + \sum_i \log(1+e^{x_i^T\theta}) + \frac{1}{2\sigma^2}\|\theta\|^2. Strongly log-concave; SGLD mixes in polynomial time.

Mixture of Gaussians: π=kwkN(μk,Σk)\pi = \sum_k w_k \mathcal{N}(\mu_k, \Sigma_k). Multi-modal; standard MCMC can get stuck between modes. Parallel tempering or simulated annealing needed for good mixing.

Heavy-tailed: f(θ)=νlog(1+θ2/ν)f(\theta) = \nu\log(1 + \|\theta\|^2/\nu) (Student-t). Sub-quadratic growth means the tails are heavy. HMC can struggle; NUTS with dual averaging handles this.

Boltzmann (energy-based models): f(θ)Eϕ(x)f(\theta) \propto -E_\phi(x) for neural network energy EϕE_\phi. Sampling requires running MCMC; contrastive divergence (CD-kk) uses short MCMC chains as an approximation.


Appendix Q: Further Reading

  1. Levin, Peres, Wilmer - Markov Chains and Mixing Times (AMS, 2nd ed., 2017) - the definitive reference for mixing times and coupling; freely available online

  2. Norris, J.R. - Markov Chains (Cambridge, 1997) - rigorous treatment of discrete and continuous-time chains with clean proofs

  3. Brooks, Gelman, Jones, Meng - Handbook of Markov Chain Monte Carlo (CRC Press, 2011) - comprehensive MCMC reference; includes HMC, NUTS, diagnostics

  4. Betancourt, M. - "A Conceptual Introduction to Hamiltonian Monte Carlo" (arXiv, 2017) - outstanding intuitive treatment of HMC geometry

  5. Page et al. - "The PageRank Citation Ranking: Bringing Order to the Web" (Stanford Technical Report, 1999) - original PageRank paper

  6. Rabiner, L.R. - "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition" (Proc. IEEE, 1989) - classic HMM reference

  7. Song et al. - "Score-Based Generative Modeling through Stochastic Differential Equations" (ICLR 2021) - connects diffusion models to CTMC theory

  8. Welling & Teh - "Bayesian Learning via Stochastic Gradient Langevin Dynamics" (ICML 2011) - introduces SGLD for large-scale Bayesian inference

  9. Sutton & Barto - Reinforcement Learning: An Introduction (MIT Press, 2nd ed., 2018) - MDPs, Bellman equations, TD-learning; free online


Appendix R: Spectral Methods for Large-Scale Markov Chains

R.1 Lanczos Algorithm for Large Transition Matrices

For transition matrices arising in PageRank (N1010N \sim 10^{10} web pages), direct eigendecomposition is infeasible. The power iteration and Lanczos algorithm enable computing the top eigenvectors using only matrix-vector products.

Power iteration for stationary distribution:

  • Memory: O(N)O(N) for the distribution vector
  • Per-iteration cost: O(nnz)O(\text{nnz}) (number of non-zeros in PP) for sparse matrices
  • Convergence in O(log(1/ε)/log(1/λ2))O(\log(1/\varepsilon) / \log(1/|\lambda_2|)) iterations

For the web graph, λ2α=0.85|\lambda_2| \leq \alpha = 0.85 (damping), so ~70 iterations give ε=108\varepsilon = 10^{-8} precision.

Randomised SVD: For low-rank approximation of PnP^n (needed for long-range transition probability estimation), randomised methods can approximate the top-rr singular vectors in O(Nrlogr)O(N r \log r) time - much cheaper than full SVD.

R.2 Multi-Scale Markov Chains

For chains with multiple timescales (fast-mixing within clusters, slow-mixing between clusters), aggregation-disaggregation methods provide efficient algorithms:

  1. Aggregation: Lump states within each cluster into a single meta-state. Compute the inter-cluster transition matrix Pˉ\bar{P}.
  2. Disaggregation: Compute the within-cluster stationary distributions π(c)\pi^{(c)} for each cluster cc.
  3. Combine: πi=πˉcπi(c)\pi_i = \bar{\pi}_c \cdot \pi^{(c)}_i for state ii in cluster cc.

This decomposition is exact when the chain is "nearly lumpable" (within-cluster transitions are much faster than between-cluster). It's the foundation of multi-scale MCMC methods.

For AI: LLM attention patterns have multi-scale structure: attention heads attend to nearby tokens (fast mixing) and distant context (slow mixing). This multi-scale structure can be exploited for efficient long-context modelling.

R.3 Reversibilisation

Any Markov chain can be made reversible by considering the Doob hh-transform or by multiplicative reversibilisation:

P~ij=12(Pij+πjPji/πi)\tilde{P}_{ij} = \frac{1}{2}(P_{ij} + \pi_j P_{ji} / \pi_i)

The resulting P~\tilde{P} is reversible with the same stationary distribution π\pi. Reversibilisation can speed up or slow down convergence depending on the chain - it destroys the directional bias that non-reversible chains use for faster mixing.


Appendix S: Continuous-State Markov Chains

S.1 Harris Chains

For Markov chains on continuous state spaces (like MCMC on Rd\mathbb{R}^d), the discretestate ergodic theory extends with modifications. A Markov chain with transition kernel K(x,)K(x, \cdot) is:

  • ϕ\phi-irreducible if there exists a σ\sigma-finite measure ϕ\phi such that for any Borel set AA with ϕ(A)>0\phi(A) > 0, we have n=1Kn(x,A)>0\sum_{n=1}^\infty K^n(x, A) > 0 for all xx.
  • Harris recurrent if it is ϕ\phi-irreducible and returns to any set AA with ϕ(A)>0\phi(A) > 0 with probability 1.
  • Positive Harris recurrent if the chain is Harris recurrent and has a unique invariant probability measure π\pi.

A positive Harris recurrent, aperiodic chain is called a Harris ergodic chain, and the ergodic theorem holds: time averages converge to π\pi-expectations.

For MCMC: The Metropolis-Hastings chain on Rd\mathbb{R}^d with Gaussian proposal and a proper target density π\pi is Harris ergodic under mild regularity conditions. This justifies MCMC estimators for continuous posteriors.

S.2 Geometric Ergodicity

A Markov chain is geometrically ergodic if there exist constants C<C < \infty and ρ<1\rho < 1 such that:

Kn(x,)πTVC(x)ρnfor all x\|K^n(x, \cdot) - \pi\|_{\text{TV}} \leq C(x)\,\rho^n \quad \text{for all } x

Geometric ergodicity implies a central limit theorem for MCMC estimators:

n(1nt=1nf(Xt)Eπ[f])dN(0,σf2)\sqrt{n}\left(\frac{1}{n}\sum_{t=1}^n f(X_t) - \mathbb{E}_\pi[f]\right) \xrightarrow{d} \mathcal{N}(0, \sigma_f^2)

where σf2=Varπ(f)+2k=1Covπ(f(X0),f(Xk))\sigma_f^2 = \text{Var}_\pi(f) + 2\sum_{k=1}^\infty \text{Cov}_\pi(f(X_0), f(X_k)) is the asymptotic variance. This is the Markov chain CLT - it justifies asymptotic confidence intervals for MCMC estimates.

For AI: Geometric ergodicity of the Langevin algorithm for log-concave targets gives polynomial bounds on the number of gradient evaluations needed for ε\varepsilon-accurate posterior estimates. This is the theoretical basis for Bayesian deep learning via SGLD.


Appendix T: Markov Chain Games and Nash Equilibria

T.1 Stochastic Games

A stochastic game (Shapley, 1953) is an MDP with multiple agents. Two-player zero-sum stochastic games have:

  • State space S\mathcal{S}; action spaces A1,A2\mathcal{A}_1, \mathcal{A}_2
  • Transition P(ss,a1,a2)P(s'|s, a_1, a_2)
  • Reward r(s,a1,a2)r(s, a_1, a_2) (player 1 maximises, player 2 minimises)

At a Nash equilibrium, each player's policy is a best response to the other's. The value function satisfies a minimax Bellman equation:

V(s)=mina2maxa1[r(s,a1,a2)+γsP(ss,a1,a2)V(s)]V^*(s) = \min_{a_2} \max_{a_1} \left[r(s,a_1,a_2) + \gamma \sum_{s'} P(s'|s,a_1,a_2) V^*(s')\right]

For AI: Multi-agent RL (MARL) with competing agents (e.g., AlphaGo, Libratus for poker) uses stochastic game theory. RLHF with adversarial reward models is a stochastic game between the policy and the worst-case reward model.

T.2 Markov Perfect Equilibrium

In dynamic games, a Markov perfect equilibrium (MPE) is a Nash equilibrium where strategies depend only on the current state (Markov property on the strategy). MPE is the game-theoretic analogue of the optimal policy in MDPs.

Computing MPE requires solving a system of coupled Bellman equations - one per player. For two-player zero-sum games, this reduces to linear programming (minimax theorem). For general-sum games, this is PPAD-complete.


Appendix U: Temporal Difference Learning and Martingales

Connecting back to Section06 Stochastic Processes Section3.7, the TD(0) algorithm is:

V(st)V(st)+αt[rt+γV(st+1)V(st)]V(s_t) \leftarrow V(s_t) + \alpha_t [r_t + \gamma V(s_{t+1}) - V(s_t)]

The update direction δt=rt+γV(st+1)V(st)\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) is the TD error. Under the true value function VV^*, E[δtst]=0\mathbb{E}[\delta_t | s_t] = 0 (martingale difference). TD learning is a stochastic approximation algorithm for finding the fixed point of the Bellman operator TπV=rπ+γPπVT^\pi V = r^\pi + \gamma P^\pi V.

Convergence proof via martingale theory: Write Vt=V+εtV_t = V^* + \varepsilon_t (error from true value). Then:

εt+1(st)=(1αt)εt(st)+αt[γPπεt(st)+Mt+1]\varepsilon_{t+1}(s_t) = (1 - \alpha_t)\varepsilon_t(s_t) + \alpha_t[\gamma P^\pi \varepsilon_t(s_t) + M_{t+1}]

where Mt+1=rt+γV(st+1)V(st)M_{t+1} = r_t + \gamma V^*(s_{t+1}) - V^*(s_t) is a martingale difference noise term. Under the Robbins-Monro conditions αt=\sum\alpha_t = \infty, αt2<\sum\alpha_t^2 < \infty, the ODE method (Borkar, 2008) shows εt0\varepsilon_t \to 0 a.s.


Appendix V: Practical MCMC Checklist for AI/ML

Before running MCMC:

  • Identify the target: is π(θD)\pi(\theta | \mathcal{D}) proper? Is logπ\log\pi evaluable efficiently?
  • Choose algorithm: log-concave -> Langevin/HMC; multimodal -> parallel tempering; discrete -> Gibbs
  • Set proposal scale: tune to ~23% acceptance (MH) or ~65% (HMC)
  • Run multiple chains (4\geq 4) from diverse starting points

During MCMC:

  • Monitor trace plots for signs of non-stationarity (trends, sticking)
  • Track acceptance rate: should be in target range
  • Compute R^\hat{R} after every 1000 iterations; stop when R^1.01\hat{R} \leq 1.01

After MCMC:

  • Discard burn-in (first 50% as rule of thumb)
  • Compute ESS per parameter; target ESS 400\geq 400 for reliable estimates
  • Check posterior predictive checks: do simulated data match observed data?
  • Report: posterior mean, posterior std, 95% credible interval, ESS, R^\hat{R}

Common pathologies:

  • Divergent transitions (HMC): posterior geometry has sharp ridges; reparameterise
  • Low ESS: chain mixes slowly; increase step size, use HMC, or apply preconditioning
  • R^>1.1\hat{R} > 1.1: chains haven't agreed; run longer, use better initialisation
  • Bimodal trace plots: chain stuck between modes; use parallel tempering

Appendix W: Worked Computation - Two-State Chain

Full Analysis

Let P=(0.70.30.50.5)P = \begin{pmatrix}0.7 & 0.3 \\ 0.5 & 0.5\end{pmatrix}, starting from μ(0)=(1,0)\mu^{(0)} = (1, 0).

Step 1: Eigenvalues. Characteristic polynomial: det(PλI)=(0.7λ)(0.5λ)0.3×0.5=λ21.2λ+0.20=(λ1)(λ0.2)=0\det(P - \lambda I) = (0.7-\lambda)(0.5-\lambda) - 0.3 \times 0.5 = \lambda^2 - 1.2\lambda + 0.20 = (\lambda-1)(\lambda-0.2) = 0. Eigenvalues: λ1=1,λ2=0.2\lambda_1=1, \lambda_2=0.2.

Step 2: Stationary distribution. From πP=π\pi P = \pi: 0.7π1+0.5π2=π1π2=0.6π10.7\pi_1 + 0.5\pi_2 = \pi_1 \Rightarrow \pi_2 = 0.6\pi_1. With π1+π2=1\pi_1+\pi_2=1: π=(5/8,3/8)\pi = (5/8, 3/8).

Step 3: nn-step distribution. Using the spectral decomposition:

Pn=π+(0.2)n(correction term)P^n = \pi + (0.2)^n \cdot \text{(correction term)}

Explicitly: Pn=(5/83/85/83/8)+(0.2)n(3/83/85/85/8)P^n = \begin{pmatrix}5/8 & 3/8 \\ 5/8 & 3/8\end{pmatrix} + (0.2)^n\begin{pmatrix}3/8 & -3/8 \\ -5/8 & 5/8\end{pmatrix}

So μ(n)=μ(0)Pn=(5/8+3/8(0.2)n, 3/83/8(0.2)n)\mu^{(n)} = \mu^{(0)}P^n = (5/8 + 3/8 \cdot (0.2)^n,\ 3/8 - 3/8 \cdot (0.2)^n).

Step 4: TV distance. μ(n)πTV=123/8(0.2)n+123/8(0.2)n=3/8(0.2)n\|\mu^{(n)} - \pi\|_{\text{TV}} = \frac{1}{2}|{3/8 \cdot (0.2)^n}| + \frac{1}{2}|{3/8 \cdot (0.2)^n}| = 3/8 \cdot (0.2)^n.

At n=5n=5: 3/8×0.00032=0.000123/8 \times 0.00032 = 0.00012 - essentially converged. Spectral gap =10.2=0.8= 1 - 0.2 = 0.8, so mixing is fast.

Step 5: Verify detailed balance.

π1P12=5/8×0.3=0.1875,π2P21=3/8×0.5=0.1875\pi_1 P_{12} = 5/8 \times 0.3 = 0.1875, \quad \pi_2 P_{21} = 3/8 \times 0.5 = 0.1875 \checkmark

The chain is reversible with this stationary distribution.


Appendix X: Advanced Topics

X.1 Geometric Random Walks for Volume Computation

The ball walk and Gaussian walk are Markov chains used to estimate the volume of convex bodies. Starting from a point xx in a convex body KK, the chain moves to a uniformly random point within a ball of radius rr centred at xx (rejecting if outside KK). The stationary distribution is uniform on KK.

Mixing time analysis: tmix=O(n2poly(1/ε))t_{\text{mix}} = O(n^2 \text{poly}(1/\varepsilon)) for the ball walk, where nn is the dimension. This gives a polynomial-time randomised algorithm for volume computation (Dyer-Frieze-Kannan theorem).

For AI: High-dimensional sampling problems (Bayesian inference with d109d \sim 10^9 parameters) require understanding how mixing time scales with dimension. Geometric random walks provide the fundamental tools.

X.2 Markov Chain Tree Theorem

For a strongly connected directed graph with transition matrix PP, the stationary distribution can be computed via spanning trees:

πi=τijτj\pi_i = \frac{\tau_i}{\sum_j \tau_j}

where τi\tau_i is the sum of weights of all spanning trees directed toward node ii (arborescences rooted at ii). This is the Matrix-Tree theorem (Kirchhoff 1847) extended to directed graphs.

Implication: The stationary distribution has a combinatorial formula in terms of the graph structure. This provides an alternative to eigendecomposition that can be faster for sparse graphs.

X.3 Lumping and Aggregation

A partition {C1,,Cm}\{C_1,\ldots,C_m\} of S\mathcal{S} is lumpable with respect to PP if for any two states i,ji, j in the same class CkC_k, the aggregated transitions CmPi=CmPj\sum_{\ell \in C_m} P_{i\ell} = \sum_{\ell \in C_m} P_{j\ell} for all classes CmC_m. In this case, the quotient chain Pˉ\bar{P} on {C1,,Cm}\{C_1,\ldots,C_m\} is a well-defined Markov chain.

Lumping enables dimensionality reduction: replace a large chain with a smaller aggregated chain. The stationary distribution of the lumped chain aggregates that of the original.

For AI: Grouping similar states (e.g., tokens with similar embeddings) can define a lumped chain that captures the high-level dynamics of the LLM's token distribution, enabling efficient analysis of LLM behaviour at scale.


Appendix Y: Connections to Physics

Y.1 Statistical Mechanics and Gibbs Measures

The Gibbs distribution π(x)eH(x)/(kBT)\pi(x) \propto e^{-H(x)/(k_BT)} (Boltzmann distribution) is the equilibrium distribution of a physical system with Hamiltonian HH at temperature TT. This is the canonical target distribution for MCMC in physics.

The Metropolis algorithm was originally designed to sample from Gibbs distributions of particle systems (Metropolis et al., 1953). MCMC methods in ML directly descended from statistical mechanics - including the energy-based models (EBMs) that predate modern deep learning.

Temperature annealing: Starting at high TT (flat, easy-to-sample distribution) and gradually cooling to T=0T=0 (greedy optimisation) is simulated annealing - a probabilistic optimisation algorithm. The connection between MCMC sampling and optimisation at T0T \to 0 is the bridge between Bayesian inference and MAP estimation.

Y.2 Detailed Balance and Thermodynamic Equilibrium

Detailed balance in physics is called microscopic reversibility or time-reversal symmetry. A physical system is at thermodynamic equilibrium iff its microscopic dynamics satisfy detailed balance. Systems out of equilibrium (driven by energy flows) violate detailed balance and maintain directed probability currents.

Non-equilibrium steady states: Some biological systems (motor proteins, gene regulatory networks) maintain non-reversible Markov chains in steady state - driven by ATP hydrolysis. These are "out-of-equilibrium" in the thermodynamic sense. Non-reversible MCMC is inspired by this physics.

Y.3 Renormalization Group and Multi-Scale Chains

The renormalization group (Wilson, 1971) is a technique in physics for coarse-graining: replacing fine-grained microscopic degrees of freedom with effective coarse-grained ones. This is mathematically analogous to the lumping/aggregation of Markov chains.

In ML, knowledge distillation (training a small model to mimic a large model) is a form of renormalization: the small model's transition matrix approximates the "effective" dynamics of the large model at a coarser scale.


Appendix Z: Glossary

Absorbing state. A state ii with Pii=1P_{ii}=1; the chain stays forever once it arrives. Every absorbing state is recurrent.

Aperiodic. A state ii with period d(i)=1d(i)=1. A chain is aperiodic if all states are aperiodic. Aperiodicity + irreducibility + positive recurrence = ergodicity.

Birth-death chain. A Markov chain on Z0\mathbb{Z}_{\geq 0} with only nearest-neighbour transitions (ij1|i-j| \leq 1). Always reversible; stationary distribution computed via detailed balance.

Chapman-Kolmogorov equations. P(m+n)=PmPnP^{(m+n)} = P^m P^n: the n+mn+m step transition matrix factors as a product of mm- and nn-step matrices.

Communicating class. A maximal set of states CC where iji \leftrightarrow j for all i,jCi,j \in C. Partitions the state space into equivalence classes.

Conductance (Cheeger constant). h=minS:π(S)1/2Q(S,Sc)/π(S)h = \min_{S: \pi(S) \leq 1/2} Q(S, S^c)/\pi(S) where Q(S,Sc)Q(S,S^c) is the probability flux from SS to ScS^c at stationarity. Bounds: h2/2gap2hh^2/2 \leq \text{gap} \leq 2h.

Coupling. A joint process (Xt,Yt)(X_t, Y_t) where each marginal is a valid Markov chain with transition PP. Used to bound TV distance via μPnνPnTVP(τ>n)\|\mu P^n - \nu P^n\|_{\text{TV}} \leq P(\tau > n) where τ\tau is the coalescence time.

CTMC. Continuous-time Markov chain. Holds in each state for an exponential time, then jumps. Specified by a generator matrix QQ with P(t)=eQtP(t) = e^{Qt}.

Detailed balance. πiPij=πjPji\pi_i P_{ij} = \pi_j P_{ji}: probability flux is equal in both directions at stationarity. Sufficient (not necessary) for π\pi to be stationary.

Doubly stochastic matrix. A matrix where rows AND columns all sum to 1. Stationary distribution is uniform.

Embedded chain. The discrete-time chain of states visited by a CTMC (ignoring holding times). Jump probabilities P~ij=qij/qi\tilde{P}_{ij} = q_{ij}/q_i.

Ergodic chain. Irreducible + positive recurrent + aperiodic. Convergence to unique stationary distribution is guaranteed.

Ergodic theorem. For an ergodic chain: 1nk=0n1f(Xk)Eπ[f]\frac{1}{n}\sum_{k=0}^{n-1} f(X_k) \to \mathbb{E}_\pi[f] a.s. Justifies MCMC estimation.

Fundamental matrix. N=(IQ)1N = (I-Q)^{-1} for absorbing chains, where QQ is the sub-matrix of transitions between transient states. NijN_{ij} = expected visits to state jj starting from ii before absorption.

Generator matrix (Q-matrix). Qij=qij0Q_{ij} = q_{ij} \geq 0 for iji \neq j (jump rates), Qii=jiqijQ_{ii} = -\sum_{j \neq i} q_{ij}. Rows sum to zero. CTMC transition: P(t)=eQtP(t) = e^{Qt}.

Gibbs sampling. MCMC algorithm that updates one coordinate at a time from its full conditional distribution. Special case of MH with acceptance ratio 1.

Harmonic function. h(i)=(Ph)(i)h(i) = (Ph)(i): equals its own expected value one step ahead. The martingale h(Xn)h(X_n) is constant in expectation.

HMC (Hamiltonian Monte Carlo). MCMC using gradient information to make long-range moves. Augments the state with momentum; uses leapfrog integration to simulate Hamiltonian dynamics.

HMM (Hidden Markov Model). A Markov chain (Zt)(Z_t) of hidden states with observations (Xt)(X_t) emitted from the current state. Algorithms: forward, backward, Viterbi, Baum-Welch.

Irreducible chain. Every state is accessible from every other: iji \to j for all i,ji,j. Ensures no trapping in subsets.

Lazy chain. P=(I+P)/2P' = (I+P)/2: stays put with prob 1/2, otherwise transitions. Aperiodic by construction; eigenvalues non-negative.

Lumpable partition. A partition where all states in the same class have identical transition probabilities to every other class. Enables dimensionality reduction.

Markov chain. A sequence of random variables where the future depends on the past only through the present: P(Xn+1X0,,Xn)=P(Xn+1Xn)P(X_{n+1}|X_0,\ldots,X_n) = P(X_{n+1}|X_n).

Markov property. Xn+1(X0,,Xn1)XnX_{n+1} \perp (X_0,\ldots,X_{n-1}) | X_n: conditional independence of future from past given present.

MDP (Markov Decision Process). Extension of Markov chain with actions; the policy induces a Markov chain; the Bellman equation characterises the value function.

Metropolis-Hastings. MCMC algorithm: propose θq(θ)\theta' \sim q(\cdot|\theta), accept with probability min(1,π(θ)q(θθ)/(π(θ)q(θθ)))\min(1, \pi(\theta')q(\theta|\theta')/(\pi(\theta)q(\theta'|\theta))). Correct by detailed balance.

Mixing time. tmix(ε)=min{t:maxxPt(x,)πTVε}t_{\text{mix}}(\varepsilon) = \min\{t: \max_x \|P^t(x,\cdot)-\pi\|_{\text{TV}} \leq \varepsilon\}. Bounded by O(log(πmin1)/gap)O(\log(\pi_{\min}^{-1})/\text{gap}).

Null recurrent. Recurrent state with infinite mean return time (μi=\mu_i = \infty). Common in countable chains (1D SRW); no normalisable stationary distribution.

PageRank. Stationary distribution of the random surfer Markov chain on the web graph. Computed by power iteration.

Period. d(i)=gcd{n1:Pii(n)>0}d(i) = \gcd\{n \geq 1: P^{(n)}_{ii} > 0\}. Period 1 = aperiodic.

Perron-Frobenius theorem. For primitive stochastic PP: eigenvalue 1 is unique and dominant; unique stationary distribution π>0\pi > 0; Pn1πTP^n \to \mathbf{1}\pi^T geometrically.

Positive recurrent. Recurrent state with finite mean return time (μi<\mu_i < \infty). In a finite irreducible chain, all states are positive recurrent.

Reversible chain. Satisfies detailed balance; the time-reversed chain equals the forward chain. Equivalent to: PP is self-adjoint in L2(π)L^2(\pi).

SGLD. Stochastic gradient Langevin dynamics: SGD + Gaussian noise. Approximates Bayesian inference at scale.

Spectral gap. gap=1λ2\text{gap} = 1 - |\lambda_2| for reversible chains. Determines mixing rate: tmix=Θ(1/gap)t_{\text{mix}} = \Theta(1/\text{gap}).

Stationary distribution. π\pi with πP=π\pi P = \pi, πi0\pi_i \geq 0, πi=1\sum\pi_i = 1. The equilibrium distribution; unique for ergodic chains.

Stochastic matrix. Pij0P_{ij} \geq 0 and jPij=1\sum_j P_{ij} = 1 for all ii. The transition matrix of a DTMC.

Transient state. State ii with fi=P(return to i)<1f_i = P(\text{return to } i) < 1. The chain leaves and never returns with positive probability; nPii(n)<\sum_n P^{(n)}_{ii} < \infty.

Total variation (TV) distance. μνTV=supAμ(A)ν(A)=12iμiνi\|\mu-\nu\|_{\text{TV}} = \sup_A|\mu(A)-\nu(A)| = \frac{1}{2}\sum_i|\mu_i-\nu_i|. The mixing distance for Markov chains.

Viterbi algorithm. Dynamic programming for the most likely hidden state sequence in an HMM. Complexity O(TS2)O(T|\mathcal{S}|^2).


Appendix AA: Practice Problems with Solutions

AA.1 Absorption Probability Calculation

Problem. A gambler starts with \3andplaysatacasino.Eachroundtheywinand plays at a casino. Each round they win$1(prob0.45)orlose(prob 0.45) or lose$1(prob0.55).Thegameendswhentheyreach(prob 0.55). The game ends when they reach$0(ruin)or(ruin) or$6(goal).Find:(a)probabilityofreaching(goal). Find: (a) probability of reaching$6fromfrom$3$, (b) expected duration of the game.

Solution. Let p=0.45p = 0.45, q=0.55q = 0.55, r=q/p=0.55/0.45=11/9r = q/p = 0.55/0.45 = 11/9. Using the absorption formula for biased walks with absorbing boundaries at 0 and N=6N=6:

(a) P(reach 6start at 3)=1r31r6=1(11/9)31(11/9)611.52115.3540.5214.3540.120P(\text{reach 6} | \text{start at 3}) = \frac{1 - r^3}{1 - r^6} = \frac{1 - (11/9)^3}{1 - (11/9)^6} \approx \frac{1-1.521}{1-5.354} \approx \frac{-0.521}{-4.354} \approx 0.120

Only 12% chance of reaching the goal from the midpoint - the bias significantly hurts the gambler.

(b) Expected duration: E[τ]=1qp[61r31r63]10.1[6×0.1203]=10×(2.28)=...\mathbb{E}[\tau] = \frac{1}{q-p}\left[6 \cdot \frac{1 - r^3}{1-r^6} - 3\right] \approx \frac{1}{0.1}[6 \times 0.120 - 3] = 10 \times (-2.28) = ...

Actually for biased walks: E[τX0=a]=aqpNqpP(win)\mathbb{E}[\tau|X_0=a] = \frac{a}{q-p} - \frac{N}{q-p} \cdot P(\text{win}). With qp=0.1q-p=0.1: E[τ]=3/0.16/0.1×0.120=307.2=22.8\mathbb{E}[\tau] = 3/0.1 - 6/0.1 \times 0.120 = 30 - 7.2 = 22.8 rounds.

AA.2 PageRank Mini-Example with Computation

Problem. Three web pages: 1 links to {2,3}, 2 links to {3}, 3 links to {1}. Find PageRank with α=0.8\alpha=0.8.

Adjacency matrix:

A=(011001100)A = \begin{pmatrix}0&1&1\\0&0&1\\1&0&0\end{pmatrix}

Out-degrees: d1=2,d2=1,d3=1d_1=2, d_2=1, d_3=1.

Transition matrix without damping:

P0=(01/21/2001100)P_0 = \begin{pmatrix}0&1/2&1/2\\0&0&1\\1&0&0\end{pmatrix}

Damped matrix (α=0.8\alpha=0.8):

P=0.8P0+0.2×13J=(1/152/5+1/152/5+1/151/151/154/5+1/154/5+1/151/151/15)P = 0.8 P_0 + 0.2 \times \frac{1}{3}J = \begin{pmatrix}1/15&2/5+1/15&2/5+1/15\\1/15&1/15&4/5+1/15\\4/5+1/15&1/15&1/15\end{pmatrix}

Solving πP=π\pi P = \pi with π1+π2+π3=1\pi_1+\pi_2+\pi_3=1: by symmetry of the network (1 has out-degree 2, others 1) and solving the 3×33\times3 system: π(0.390,0.211,0.399)\pi \approx (0.390, 0.211, 0.399).

Page 3 has highest PageRank - it's the target of all paths through the cycle 12311\to2\to3\to1, plus direct link from 1.

AA.3 HMM Forward Algorithm Computation

Problem. HMM with 2 states H,L and 2 observations A,B.

  • π0=(0.5,0.5)\pi_0 = (0.5, 0.5), T=(0.70.30.40.6)T = \begin{pmatrix}0.7&0.3\\0.4&0.6\end{pmatrix}
  • Emission: EH=(0.9,0.1)E_H = (0.9, 0.1), EL=(0.2,0.8)E_L = (0.2, 0.8) (A and B respectively)
  • Observation: A, B

Forward algorithm:

t=1t=1, obs=A: α1(H)=0.5×0.9=0.45\alpha_1(H) = 0.5 \times 0.9 = 0.45, α1(L)=0.5×0.2=0.10\alpha_1(L) = 0.5 \times 0.2 = 0.10

t=2t=2, obs=B:

α2(H)=0.1×[0.45×0.7+0.10×0.4]=0.1×[0.315+0.04]=0.0355\alpha_2(H) = 0.1 \times [0.45 \times 0.7 + 0.10 \times 0.4] = 0.1 \times [0.315 + 0.04] = 0.0355 α2(L)=0.8×[0.45×0.3+0.10×0.6]=0.8×[0.135+0.06]=0.156\alpha_2(L) = 0.8 \times [0.45 \times 0.3 + 0.10 \times 0.6] = 0.8 \times [0.135 + 0.06] = 0.156

Likelihood: P(obs=(A,B))=α2(H)+α2(L)=0.0355+0.156=0.1915P(\text{obs}=(A,B)) = \alpha_2(H) + \alpha_2(L) = 0.0355 + 0.156 = 0.1915.

Posterior: P(Z2=Hobs)=0.0355/0.191518.5%P(Z_2=H|\text{obs}) = 0.0355/0.1915 \approx 18.5\%; P(Z2=Lobs)81.5%P(Z_2=L|\text{obs}) \approx 81.5\%.

Interpretation: Observing (A,B) - first high-emission then low-emission - suggests state L at time 2.


Appendix BB: Index of Key Theorems

Theorem / ResultWhere ProvedKey Implication
Perron-FrobeniusSection4.2, App. I.1Unique stationary distribution for ergodic chains
Ergodic theoremSection3.5, App. ETime averages converge to Eπ[f]\mathbb{E}_\pi[f]; justifies MCMC
Chapman-KolmogorovSection2.3nn-step transitions: Pm+n=PmPnP^{m+n} = P^m P^n
Convergence theoremSection4.3μ(n)πTV0\|\mu^{(n)} - \pi\|_{\text{TV}} \to 0 for ergodic chains
Detailed balance \Rightarrow stationaritySection5.1, App. E.2Sufficient condition; easier to verify
Spectral gap boundSection6.3tmixlog(πmin1/ε)/gapt_{\text{mix}} \leq \log(\pi_{\min}^{-1}/\varepsilon)/\text{gap}
Cheeger inequalitySection6.3, App. I.3h2/2gap2hh^2/2 \leq \text{gap} \leq 2h
Coupling inequalitySection6.1, App. I.2Pn(x,)πTVP(τ>n)\|P^n(x,\cdot)-\pi\|_{\text{TV}} \leq P(\tau > n)
MH correctnessSection8.2Detailed balance implies π\pi is stationary
MH acceptance optimalitySection8.2~23% acceptance optimal in high dimensions
Markov chain CLTApp. S.2n(MCMC avgEπ[f])N(0,σf2)\sqrt{n}(\text{MCMC avg} - \mathbb{E}_\pi[f]) \to \mathcal{N}(0,\sigma_f^2)
Dirichlet problem / OSTApp. AValue function = expected boundary value
Mean return time formulaSection4.1πi=1/μi\pi_i = 1/\mu_i; occupancy = reciprocal return time
Absorption formulaSection3.4B=(IQ)1RB = (I-Q)^{-1}R; absorption probabilities from fundamental matrix
Viterbi optimalitySection9.4DP gives exact most likely path in $O(T

Appendix CC: Mixing Times for Specific Chains

A reference table of known mixing time results for important chains:

ChainState SpaceMixing Time tmixt_{\text{mix}}Technique
Two-state chain{0,1}\{0,1\}O(1/(p+q))O(1/(p+q))Eigenvalue 1(p+q)1-(p+q)
Random walk on path PnP_n{0,,n}\{0,\ldots,n\}Θ(n2)\Theta(n^2)Eigenvalues cos(kπ/n)\cos(k\pi/n)
Random walk on cycle CnC_nZ/nZ\mathbb{Z}/n\mathbb{Z}Θ(n2)\Theta(n^2)Fourier analysis
Random walk on hypercube {0,1}d\{0,1\}^d{0,1}d\{0,1\}^dΘ(dlogd)\Theta(d\log d)Coupling (coupon collector)
Random walk on complete graph KnK_n{1,,n}\{1,\ldots,n\}Θ(1)\Theta(1)Spectral gap =11/(n1)1= 1-1/(n-1) \approx 1
Random walk on expander{1,,n}\{1,\ldots,n\}Θ(logn)\Theta(\log n)Spectral gap =Ω(1)= \Omega(1)
Glauber dynamics (Ising, high TT){1,+1}n\{-1,+1\}^nO(nlogn)O(n\log n)Coupling or spectral gap
Glauber dynamics (Ising, T<TcT < T_c){1,+1}n\{-1,+1\}^nexp(Ω(n))\exp(\Omega(n))Phase transition barrier
MH on N(0,Id)\mathcal{N}(0, I_d) (Gaussian RW)Rd\mathbb{R}^dΘ(d)\Theta(d)Optimal scaling
HMC on N(0,Id)\mathcal{N}(0, I_d)Rd\mathbb{R}^dO(d1/4)O(d^{1/4})Leapfrog integrator analysis
PageRank (damping α\alpha)Web pagesO(log(N)/log(1/α))O(\log(N)/\log(1/\alpha))Second eigenvalue α\leq \alpha

Key observations:

  • Dimension scaling: Random walk MH scales as O(d)O(d) (diffusive), HMC as O(d1/4)O(d^{1/4}) (much better)
  • Phase transitions: Near critical temperature, MCMC for Ising model has exponential mixing time - a computational hard phase
  • Expander graphs: Ω(1)\Omega(1) spectral gap means O(logn)O(\log n) mixing - the best possible for non-trivial graphs

Appendix DD: Software Ecosystem

The following Python libraries implement the algorithms in this section:

MCMC and Probabilistic Programming:

  • PyMC - full-featured Bayesian modelling with NUTS/HMC sampler
  • Stan (via cmdstanpy/pystan) - the gold standard for Bayesian modelling; generates C++ code for fast sampling
  • NumPyro - JAX-based; supports GPU-accelerated HMC and NUTS
  • blackjax - modular MCMC library (JAX); easy to extend with custom kernels

Markov Chain Simulation:

  • networkx - graph operations; random walks on graphs
  • numpy - transition matrix operations; power iteration; eigendecomposition

Hidden Markov Models:

  • hmmlearn - sklearn-compatible HMM; Baum-Welch training, Viterbi decoding
  • pomegranate - probabilistic models including HMMs with GPU support

Reinforcement Learning (MDP):

  • gymnasium (formerly OpenAI Gym) - RL environments as MDPs
  • stable-baselines3 - RL algorithms (PPO, SAC, TD3) implementing policy optimisation

Diffusion Models:

  • diffusers (HuggingFace) - implements DDPM, DDIM, and score-based models as CTMCs

Appendix EE: Common Markov Chain Constructions

Ehrenfest Model: NN balls in two urns, uniform random ball moved at each step. State = number of balls in urn 1. Birth-death chain; stationary distribution: Binomial(N,1/2)(N, 1/2). Models diffusion/thermodynamic equilibrium.

Polya Urn: Start with aa red and bb blue balls; each step add a ball of the drawn colour. Path-dependent; NOT Markov without extended state. But Xn=fraction redX_n = \text{fraction red} satisfies E[Xn+1Xn]=Xn\mathbb{E}[X_{n+1}|X_n] = X_n - a martingale.

Bernoulli-Laplace Diffusion: NN balls total, kk red and NkN-k blue, split between two urns of size N/2N/2. State = number of red in urn 1. Symmetric birth-death chain; used to model genetics and combinatorics.

Random Transposition Walk: On SnS_n (permutation group of nn elements), pick two positions uniformly and swap. State space: SnS_n (size n!n!). Mixing time: Θ(nlogn)\Theta(n\log n). Used to model card shuffling (Bayer-Diaconis theorem).

Metropolis on Graphs: For any graph GG and target π\pi, the Metropolis chain with uniform proposal on neighbours satisfies detailed balance with respect to π\pi. Useful for sampling graph colorings, independent sets, etc.

Skip-free chains: Transition matrices with Pij=0P_{ij} = 0 for j<i1j < i-1 (can only go up by 1 or down arbitrarily). Examples: GI/M/1 queues. Efficient algorithms via matrix-analytic methods.


Appendix FF: Spectral Graph Theory Connection

FF.1 Normalized Laplacian and Random Walks

For an undirected graph G=(V,E)G=(V,E) with degree matrix DD and adjacency matrix AA, the random walk transition matrix is P=D1AP = D^{-1}A. The normalized Laplacian is:

L=ID1/2AD1/2=ID1/2PD1/2\mathcal{L} = I - D^{-1/2}AD^{-1/2} = I - D^{-1/2}PD^{1/2}

The eigenvalues of L\mathcal{L} are 0=λ1λ2λN20 = \lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_N \leq 2. The spectral gap of the random walk is λ2(L)\lambda_2(\mathcal{L}).

Cheeger inequality for graphs: λ2/2h(G)2λ2\lambda_2/2 \leq h(G) \leq \sqrt{2\lambda_2} where h(G)=minS:SV/2E(S,Sˉ)/(dminS)h(G) = \min_{S:|S|\leq|V|/2} |E(S,\bar{S})|/(d_{\min}|S|) is the edge expansion of the graph.

Expander graphs: A dd-regular expander has λ2(L)=Ω(1)\lambda_2(\mathcal{L}) = \Omega(1) (bounded below independently of NN). Examples: Ramanujan graphs achieve λ2=12d1/d\lambda_2 = 1 - 2\sqrt{d-1}/d (optimal). Random regular graphs are expanders with high probability.

FF.2 Graph-Based ML and Markov Chains

Graph attention networks: GAT attention coefficients eij=a(Whi,Whj)e_{ij} = a(\mathbf{W}h_i, \mathbf{W}h_j) (learnable) create a data-dependent transition matrix over the graph. The spectral properties of this attention-weighted adjacency matrix determine the GNN's ability to propagate information.

Label propagation: Semi-supervised learning via label propagation: FαPF+(1α)YF \leftarrow \alpha PF + (1-\alpha)Y where P=D1AP=D^{-1}A is the random walk matrix, FF are labels, YY are known labels, and α\alpha is a damping parameter. This is exactly the PageRank recursion applied to labels - convergence follows from the same Perron-Frobenius analysis.

Spectral clustering: The kk smallest eigenvectors of L\mathcal{L} embed the graph into Rk\mathbb{R}^k such that well-connected clusters map to nearby points. This exploits the fact that slow-mixing chains (small spectral gap) correspond to loosely connected clusters.


Appendix GG: Long-Range Dependencies and Markov Approximations

GG.1 Finite-Order Markov Chains

A kkth-order Markov chain satisfies:

P(XnX0,,Xn1)=P(XnXnk,,Xn1)P(X_n | X_0,\ldots,X_{n-1}) = P(X_n | X_{n-k},\ldots,X_{n-1})

This can always be reduced to a first-order chain on the augmented state (Xnk+1,,Xn)(X_{n-k+1},\ldots,X_n) with state space size Sk|\mathcal{S}|^k.

Trade-off: Higher-order captures more dependencies but exponentially increases state space. N-gram language models are kkth-order Markov chains on the vocabulary - trigram models (k=2k=2) are common in classical NLP.

For transformers: A transformer with context length LL has an effective LLth-order Markov chain for token generation. The KV cache stores the sufficient statistics (the last LL tokens) - first-order Markov in the augmented state space.

GG.2 Variable-Length Markov Chains (VLMC)

VLMCs use a context tree to specify the order dynamically: some contexts require deep history, others only shallow. The transition probability P(XnXn1,)=P(Xnsuffix(Xn1,,Xnk))P(X_n | X_{n-1},\ldots) = P(X_n | \text{suffix}(X_{n-1},\ldots,X_{n-k})) where kk depends on the context.

VLMCs provide compact representations of processes with heterogeneous memory. Related to suffix trees and compressed suffix arrays - efficient data structures for sequential data.

GG.3 Hidden Non-Markovian Processes

Many real processes are non-Markovian but become Markovian on an augmented state space. Examples:

  • AR(pp) process: Xn=k=1pϕkXnk+εnX_n = \sum_{k=1}^p \phi_k X_{n-k} + \varepsilon_n. First-order Markov on (Xn,,Xnp+1)(X_n,\ldots,X_{n-p+1}).
  • ARMA(p,qp,q): Moving average component creates non-Markovian behaviour; state space includes latent noise terms.
  • LLM with KV cache: First-order Markov on the KV cache state; non-Markovian on just the last token.

Appendix HH: Excursion Theory and Renewal Processes

HH.1 Renewal Theory

A renewal process is a sequence of times T1,T2,T_1, T_2, \ldots where TkTk1T_k - T_{k-1} are iid positive random variables (inter-renewal times). The process Nt=max{k:Tkt}N_t = \max\{k : T_k \leq t\} counts renewals up to time tt.

Renewal theorem: limtNt/t=1/μ\lim_{t\to\infty} N_t/t = 1/\mu where μ=E[T2T1]\mu = \mathbb{E}[T_2-T_1] is the mean inter-renewal time. This is the law of large numbers for renewal processes.

Connection to Markov chains: The times of visits to a fixed recurrent state ii form a renewal process with inter-renewal distribution = return time distribution. The renewal theorem gives πi=1/μi\pi_i = 1/\mu_i (mean return time formula).

HH.2 Excursions from a State

An excursion from state ii is the portion of the trajectory between two consecutive visits to ii. Excursions are iid by the strong Markov property. The distribution of the excursion - its length, states visited, and path structure - encodes rich information about the chain.

For AI: In RLHF, the "excursions" of the LM's generation from topic to topic form a renewal-like structure. Understanding the statistics of these excursions (how long the LM stays on-topic, how it transitions between topics) is important for reward model design.


Appendix II: Summary Diagrams

MARKOV CHAIN THEORY SUMMARY
========================================================================

  CHAIN PROPERTIES                   IMPLICATIONS
  ----------------                   ------------
  Irreducible ----------------------> Every state reachable from every other
  + Positive Recurrent --------------> Stationary distribution \\pi exists
  + Aperiodic -----------------------> \\pi unique; P^n -> 1\\cdot\\pi^T (convergence)
  (= Ergodic)

  Reversible -----------------------> Detailed balance \\pi_i P_ij = \\pi_j P_ji
                                     Real eigenvalues; symmetric in L^2(\\pi)
                                     MH, Gibbs, HMC all reversible

  MIXING SPEED HIERARCHY
  -----------------------
  Spectral gap large ---------------> Fast mixing (t_mix = O(1/gap))
  Cheeger constant large -----------> Large spectral gap (Cheeger inequality)
  Non-reversible + directed ---------> Potentially faster than any reversible
  HMC gradient steps ---------------> O(d^{1/4}) vs O(d) for random walk

  COMPUTING \\pi
  -----------
  Small N: solve \\pi P = \\pi, \\Vert\\pi\\Vert_1 = 1 (linear system)
  Large N: power iteration \\pi^{k+1} = \\pi^{k} P (PageRank)
  Continuous space: MCMC (Metropolis-Hastings, HMC)
  Birth-death: explicit formula via detailed balance

  ML APPLICATIONS
  ---------------
  Language generation --------------> Markov chain on vocabulary (or context)
  PageRank -------------------------> Stationary dist of web graph walk
  MCMC -----------------------------> Bayesian posterior sampling
  RL (MDP) -------------------------> Policy induces Markov chain; Bellman eqn
  HMM ------------------------------> Latent Markov chain with observations
  Diffusion models -----------------> Forward = CTMC; reverse = learned CTMC
  GNNs -----------------------------> Message passing = transition matrix power

========================================================================