Part 2

28 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Markov Chains: Part 10: Common Mistakes to Appendix Y: Connections to Physics

10. Common Mistakes

#	Mistake	Why It's Wrong	Fix
1	Confusing stationarity with convergence	Stationarity ( $\pi P = \pi$ ) is a property of a distribution; convergence ( $\mu^{(n)} \to \pi$ ) is about the chain's trajectory. The chain may start at $\pi$ and be stationary without "converging" (it's already there).	Distinguish: stationarity is the destination; convergence is the journey.
2	Assuming irreducibility implies unique stationary distribution	An irreducible chain on an infinite state space may be null-recurrent (no normalisable stationary distribution), e.g., SRW on $\mathbb{Z}$ .	Unique stationary distribution requires irreducibility + positive recurrence.
3	Forgetting aperiodicity for convergence	An irreducible positive-recurrent but periodic chain has a unique stationary distribution but $P^{(n)}_{ij}$ does NOT converge - it oscillates.	Convergence to stationarity requires ergodicity = irreducible + positive recurrent + aperiodic.
4	Treating detailed balance as necessary for stationarity	Detailed balance ( $\pi_i P_{ij} = \pi_j P_{ji}$ ) is sufficient but not necessary for $\pi P = \pi$ . Many chains (e.g., cyclic chains) have stationary distributions without satisfying detailed balance.	Detailed balance $\Rightarrow$ stationarity, but stationarity $\not\Rightarrow$ detailed balance.
5	Confusing the transition matrix orientation	Some books write $P_{ij}$ as probability from $j$ to $i$ (column stochastic); others write it as probability from $i$ to $j$ (row stochastic). This flips $\pi P = \pi$ vs. $P\pi = \pi$ .	Fix a convention: row stochastic means rows sum to 1 and $\pi P = \pi$ (left eigenvector).
6	Ignoring burn-in in MCMC	The first $B$ MCMC samples come from a distribution close to $\mu^{(0)}$ (initial distribution), not $\pi$ . Including them biases posterior estimates.	Always discard a burn-in period. Use $\hat{R}$ and ESS to assess convergence.
7	Confusing mixing time with burn-in	Mixing time $t_{\text{mix}}$ is a property of the chain (how long until the worst-case distribution is close to $\pi$ ). Burn-in is a practical heuristic. They are related but not equal.	Burn-in should be at least $t_{\text{mix}}$ ; more if starting far from $\pi$ .
8	Assuming high acceptance rate = well-mixed MCMC	A chain that always accepts (tiny step size in MH) explores very slowly; acceptance rate ~1 with step size -> 0 gives high acceptance but poor mixing.	Target ~23% acceptance in high dimensions (optimal for Gaussian targets).
9	Using $P^n$ to compute stationary distribution for large $n$	Matrix exponentiation $P^n$ converges to the rank-1 matrix $\mathbf{1}\pi$ , but floating-point errors accumulate.	Use power iteration on the distribution vector: $\pi^{(k+1)} = \pi^{(k)} P$ , which is numerically stable.
10	Misidentifying the state space for LLM generation	A bigram LM has states = vocabulary (size $V$ ). A full autoregressive LM has states = entire context = exponential in context length.	The Markov chain for a transformer is on the context window, not the vocabulary alone.
11	Confusing transient and absorbing	A transient state will eventually be left for good; an absorbing state is one from which the chain never leaves. Every absorbing state is recurrent (trivially), not transient.	Transient: $P(\text{return}) < 1$ ; absorbing: $P_{ii}=1$ . Absorbing states are recurrent.
12	Expecting MCMC samples to be independent	MCMC samples are correlated (autocorrelated) - consecutive samples are not independent. ESS < T measures effective independence.	Report ESS, not raw sample count. Use thinning to reduce autocorrelation if needed.

11. Exercises

Exercise 1 * - Transition Matrix and Distribution Evolution

A weather model has two states: Sunny (1) and Rainy (2), with transition matrix $P = \begin{pmatrix}0.8 & 0.2 \\ 0.4 & 0.6\end{pmatrix}$ .

(a) Starting from $\mu^{(0)} = (1, 0)$ (certainly sunny), compute $\mu^{(1)}, \mu^{(5)}, \mu^{(20)}$ .

(b) Compute the 3-step transition matrix $P^3$ . What is $P(X_3=2 \mid X_0=1)$ ?

(d) Verify that $\mu^{(n)} \to \pi$ as $n \to \infty$ by computing $\|\mu^{(n)} - \pi\|_{\text{TV}}$ for $n = 1, 5, 10, 20$ .

Exercise 2 * - State Classification

Consider the Markov chain on $\{1,2,3,4,5\}$ with transition matrix:

P = \begin{pmatrix} 0 & 1 & 0 & 0 & 0 \\ 0.5 & 0 & 0.5 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 & 0 \end{pmatrix}

(a) Draw the transition diagram. Identify all communicating classes.

(b) Classify each state as recurrent or transient. Justify using the return probability criterion.

(d) Starting from state 1, what is the probability of eventually reaching state 4?

Exercise 3 * - Computing Stationary Distributions

(a) Implement stationary_power_iteration(P, n_iter=1000) that starts from a uniform distribution and returns $\pi^{(n)}$ after $n$ iterations. Apply to the $3\times3$ chain with $P_{ij} = 1/3$ for all $i,j$ and a $4\times4$ random row-stochastic matrix.

(b) Verify your result satisfies $\pi P = \pi$ to tolerance $10^{-6}$ .

(c) For a doubly-stochastic matrix (rows AND columns sum to 1), prove the stationary distribution is uniform without solving any equations. Verify numerically.

(d) For the birth-death chain with $p_i = 0.6$ , $q_i = 0.4$ on $\{0,1,2,3,4\}$ with reflecting barriers, compute $\pi$ analytically and verify against power iteration.

Exercise 4 ** - Detailed Balance and Birth-Death Chains

(a) Verify that the two-state chain $P = \begin{pmatrix}1-p&p\\q&1-q\end{pmatrix}$ satisfies detailed balance with $\pi = (q/(p+q), p/(p+q))$ .

(b) Implement a $M/M/1$ queue birth-death chain with $\lambda=0.6, \mu=1.0$ on states $\{0,1,\ldots,20\}$ with an absorbing boundary at 20. Compute the stationary distribution and verify $\pi_n \approx (1-\rho)\rho^n$ with $\rho = 0.6$ .

(c) For the cyclic chain $1\to2\to3\to1$ with $P_{12}=P_{23}=P_{31}=1$ , find the stationary distribution. Does it satisfy detailed balance? Explain why not.

(d) Show that if $P$ is doubly stochastic, then detailed balance holds with the uniform distribution iff $P$ is also symmetric.

Exercise 5 ** - Mixing Time and Spectral Gap

(a) For the two-state chain in Exercise 1, compute the eigenvalues of $P$ . What is the spectral gap?

(b) Using the spectral gap bound, upper-bound the mixing time $t_{\text{mix}}(0.01)$ .

(c) Compute the TV distance $\|P^t(1,\cdot) - \pi\|_{\text{TV}}$ for $t = 1, 2, 5, 10, 20$ and plot the convergence. At what $t$ does it drop below $0.01$ ?

(d) For the lazy version $P' = (I+P)/2$ , redo parts (a)-(c). How does the spectral gap change? How does the mixing time change?

Exercise 6 ** - Metropolis-Hastings

(a) Implement Metropolis-Hastings with a Gaussian random walk proposal $q(\theta'|\theta) = \mathcal{N}(\theta, \sigma^2)$ for target $\pi(\theta) \propto \exp(-\theta^4/4 + \theta^2/2)$ (double-well potential). Run for $T=50000$ steps with $\sigma=0.5$ .

(b) Verify your sampler satisfies detailed balance: for two states $\theta_1, \theta_2$ , check $\pi(\theta_1)\,K(\theta_1,\theta_2) \approx \pi(\theta_2)\,K(\theta_2,\theta_1)$ numerically where $K$ is the MH transition kernel.

(c) Compute the acceptance rate and effective sample size (ESS). How does ESS change as $\sigma$ varies over $\{0.1, 0.5, 1.0, 2.0, 5.0\}$ ?

(d) Estimate $\mathbb{E}_\pi[\theta^2]$ from your samples. Compare to the true value (computed numerically).

Exercise 7 *** - PageRank

(a) Implement PageRank power iteration for the 5-node graph with adjacency matrix:

A = \begin{pmatrix}0&1&1&0&0\\0&0&1&1&0\\0&0&0&1&1\\1&0&0&0&1\\0&1&0&0&0\end{pmatrix}

with damping factor $\alpha=0.85$ . Handle dangling nodes.

(b) Verify that your PageRank vector satisfies $\pi P = \pi$ to tolerance $10^{-8}$ .

(c) Add a new node 6 that links to all existing nodes, and all existing nodes link to node 6. How does the PageRank of node 1 change? Why?

(d) Implement personalised PageRank where the teleportation vector is non-uniform: $(0.4, 0.2, 0.1, 0.1, 0.1, 0.1)$ . Compare to standard PageRank.

Exercise 8 *** - HMM Forward Algorithm and Viterbi

An HMM models DNA sequences with 2 hidden states (CpG island: $H$ , non-CpG: $L$ ) and 4 observations (A, C, G, T).

(a) Given transition matrix $T_{HH}=0.5, T_{HL}=0.5, T_{LH}=0.4, T_{LL}=0.6$ and emission matrices (HMM notebook provides values), implement the forward algorithm and compute $P(X_{1:5} = \text{ACGTT})$ under equal initial probabilities.

(b) Implement the Viterbi algorithm to find the most likely hidden state sequence for the same observation.

(d) Compare Viterbi decoding to the most-probable-state-per-position (posterior decoding using the backward algorithm). When do they differ?

12. Why This Matters for AI (2026 Perspective)

Concept	AI/ML Application	Impact
Transition matrix / stationary dist.	PageRank web graph; LLM token distribution	Foundation for all sequential reasoning
Ergodic theorem	MCMC posterior estimation; time-averaging in RL	Justifies replacing expectation with time average
Perron-Frobenius	PageRank convergence proof; convergence of power iteration	Guarantees unique ranking vectors exist
Detailed balance / reversibility	Correctness of MH, Gibbs, SGLD	All Bayesian MCMC algorithms rely on this
Spectral gap / mixing time	MCMC convergence rate; warm-start certification	Explains why MCMC works (or fails) in practice
Metropolis-Hastings	Bayesian neural network posteriors; sampling EBMs	Foundation of probabilistic inference in ML
Gibbs sampling	LDA topic models; Boltzmann machine inference	Tractable posterior sampling via conditionals
HMC / NUTS	Stan, PyMC, posterior sampling in Bayesian DL	Gold standard for Bayesian inference on $\mathbb{R}^d$
MDP / Bellman equation	RL algorithms (PPO, SAC, DQN, RLHF)	Every RL agent solves a Markov chain problem
HMM forward-backward	Speech recognition, gene prediction, legacy NLP	Efficient inference on latent sequence models
CTMC / generator matrix	LLM serving queues; continuous-time RL	Latency modelling and event-driven simulation
Diffusion as CTMC	DDPM/DDIM forward process; score-matching	Connects generative modelling to Markov chain theory
Power iteration	PageRank; dominant eigenvector computation	$O(N^2)$ algorithm that scales to trillion-node graphs
Lazy chains / warm starts	MCMC initialisation strategy	Reduces burn-in and improves sample efficiency

13. Conceptual Bridge

Markov chains occupy the central position in the probability curriculum: they are the first class of stochastic processes with nontrivial structure - neither completely independent (iid) nor completely dependent. The Markov property gives just enough memory to model real sequential phenomena while remaining mathematically tractable.

Looking backward: Markov chains build on everything in Chapters 1-6. The transition matrix is a row-stochastic matrix - linear algebra from Chapters 2-3. The stationary distribution is an eigenvector problem solved by Perron-Frobenius. Computing stationary distributions uses the conditional probability machinery from Section03. The ergodic theorem is a strengthening of the law of large numbers (Section06). The Metropolis-Hastings acceptance ratio uses Bayes' theorem (Section03). Mixing times use concentration inequalities (Section05) through the spectral gap. The stochastic process framework of filtrations and stopping times (Section06) provides the rigorous foundation.

Looking forward: Markov chains are the gateway to Chapter 7 (Statistics), where MCMC enables Bayesian inference with intractable posteriors; Chapter 8 (Optimisation), where Markov chain mixing theory explains why SGD finds flat minima; and Chapter 9 (Information Theory), where the entropy of a stationary distribution connects to compression. Markov Decision Processes - the reinforcement learning setting - are a direct extension requiring optimisation over policies.

The deep connection: The central theorem of Markov chain theory - ergodicity implies $\mu^{(n)} \to \pi$ - is the probabilistic analogue of the contraction mapping theorem from analysis: repeated application of the transition operator contracts all distributions toward the unique fixed point $\pi$ . The spectral gap quantifies the contraction rate, just as the Lipschitz constant does in the deterministic setting. This structural parallel runs through SGD convergence theory, where the gradient operator acts as a stochastic contraction.

MARKOV CHAINS IN THE CURRICULUM
========================================================================

  Section01 Random Variables -------------------------------------+
  Section02 Distributions --------------------------------------+  |
  Section03 Joint Distributions  ---------------------------+   |  |
  Section04 Expectation & Moments ----------------------+   |   |  |
  Section05 Concentration Inequalities -------------+   |   |   |  |
  Section06 Stochastic Processes ---------------+   |   |   |   |  |
                                           v   v   v   v  v  v
                              +---------------------------------+
                              |   Section07  MARKOV  CHAINS           |
                              |                                 |
                              |  Markov property                |
                              |  Transition matrices            |
                              |  Stationary distributions       |
                              |  Mixing times, spectral gap     |
                              |  MCMC (MH, Gibbs, HMC)         |
                              |  PageRank, MDP, HMM             |
                              +--------------+------------------+
                                             |
               +-----------------------------+--------------------+
               v                             v                    v
    Ch7: Statistics               Ch8: Optimisation        Ch9: Info Theory
    (MCMC for Bayes,              (SGD mixing theory,      (Entropy of \\pi,
     posterior inference)          Langevin dynamics)       KL from stationarity)
               v                             v
    RL/RLHF: MDP                  Diffusion models
    (policy eval,                 (DDPM as CTMC,
     value functions)              reverse Markov)

========================================================================

Appendix A: Harmonic Functions and the Dirichlet Problem

A function $h : \mathcal{S} \to \mathbb{R}$ is harmonic for a Markov chain at state $i$ if:

h(i) = \sum_j P_{ij} h(j) = (Ph)(i)

That is, $h(i)$ equals the expected value of $h$ one step later. Harmonic functions are the "conserved quantities" of Markov chains - if $h(X_0)$ is the starting value, then $h(X_n)$ forms a martingale.

Maximum principle: For an irreducible chain on a finite state space, the only bounded harmonic functions are constants. This is the probabilistic analogue of the maximum principle for harmonic functions in PDE theory.

Dirichlet problem: On a chain with absorbing boundary $\partial \mathcal{S}$ and interior $\mathcal{S}^\circ$ , the function $h(i) = E_i[f(X_{\tau_\partial})]$ (expected boundary value at hitting time) is the unique solution to:

h(i) = (Ph)(i) \text{ for } i \in \mathcal{S}^\circ, \quad h(i) = f(i) \text{ for } i \in \partial\mathcal{S}

Application - gambler's ruin: Let $h(i) = P(\text{reach } N \mid X_0=i)$ . Then $h$ is harmonic on $\{1,\ldots,N-1\}$ with boundary conditions $h(0)=0$ , $h(N)=1$ . For the simple random walk: $h(i) = i/N$ .

For AI: Value functions in reinforcement learning are harmonic: $V^\pi(s) = \sum_a \pi(a|s)[r(s,a) + \gamma \sum_{s'}P(s'|s,a)V^\pi(s')]$ is a Bellman equation - a discrete-time analogue of the Dirichlet problem with discount factor $\gamma$ .

Appendix B: Spectral Theory for Non-Reversible Chains

For non-reversible chains, eigenvalues of $P$ can be complex (though all have $|\lambda| \leq 1$ ). The Jordan form replaces the diagonal eigendecomposition.

Pseudo-spectrum: For nearly-reversible chains, the pseudo-spectrum can be much larger than the spectrum, causing transient amplification before eventual convergence.

Non-reversible speedup: Some non-reversible chains mix faster than any reversible chain with the same stationary distribution. The "lifted" or "skewed" chains in the literature achieve mixing time $O(1/h^2)$ vs. $O(1/h)$ for reversible chains, where $h$ is the Cheeger constant.

Lifting: A common technique to construct non-reversible fast-mixing chains: add a "direction" variable $d \in \{+1,-1\}$ to the state, creating a Markov chain on $\mathcal{S} \times \{+,-\}$ that preferentially moves in one direction while still having $\pi$ as marginal stationary distribution.

Appendix C: Advanced MCMC Methods

C.1 Stochastic Gradient Langevin Dynamics (SGLD)

For Bayesian inference on large datasets, standard MH requires evaluating $\log\pi(\theta)$ at the full dataset - too expensive. SGLD combines SGD with Gaussian noise:

\theta_{t+1} = \theta_t - \frac{\eta_t}{2}\nabla \tilde{L}(\theta_t) + \sqrt{\eta_t}\,\xi_t, \quad \xi_t \sim \mathcal{N}(0,I)

where $\tilde{L}(\theta) = -\frac{N}{n}\sum_{i \in \text{minibatch}} \log p(x_i|\theta) - \log p(\theta)$ is the mini-batch stochastic gradient.

For step sizes satisfying $\sum_t \eta_t = \infty$ and $\sum_t \eta_t^2 < \infty$ (Robbins-Monro conditions), the chain converges to samples from the posterior. At small but fixed step size $\eta$ , the chain samples from an approximate posterior within $O(\eta)$ of the true posterior.

For AI: SGLD underlies Bayesian deep learning at scale. It explains why SGD with learning rate decay has regularisation properties: the decaying noise variance means the chain converges to a tighter distribution around the posterior mode.

C.2 Parallel Tempering (Replica Exchange)

For multimodal targets, single chains can get stuck in one mode. Parallel tempering runs $K$ chains at different temperatures $\beta_1 < \beta_2 < \cdots < \beta_K$ (with $\beta_K = 1$ being the target). Periodically, adjacent chains swap states with MH acceptance probability:

a = \min\left(1,\, \frac{\pi_{\beta_i}(x_j)\pi_{\beta_j}(x_i)}{\pi_{\beta_i}(x_i)\pi_{\beta_j}(x_j)}\right)

High-temperature chains mix fast (flat distribution) and pass samples to low-temperature chains through swaps. This enables exploration of multiple modes.

C.3 Slice Sampling

Slice sampling introduces a uniform auxiliary variable $u \sim \text{Uniform}(0, \pi(\theta))$ , creating a joint distribution on $(\theta, u)$ that is uniform on the "slice" $\{(\theta, u) : u < \pi(\theta)\}$ . Marginalising out $u$ recovers $\pi(\theta)$ .

The slice sampler alternates between updating $u$ (trivial given $\theta$ ) and updating $\theta$ (uniform on the slice level set). No tuning parameter; acceptance rate = 1; effective for univariate distributions.

Appendix D: Worked Examples

D.1 Google PageRank Computation (Mini Example)

Consider a 4-page web: page 1 links to 2,3; page 2 links to 4; page 3 links to 2; page 4 links to 1,3.

Transition matrix (with no damping for clarity):

P = \begin{pmatrix}0&1/2&1/2&0\\0&0&0&1\\0&1&0&0\\1/2&0&1/2&0\end{pmatrix}

Stationary distribution (solve $\pi P = \pi$ , $\sum\pi_i=1$ ): $\pi = (4/13, 3/13, 4/13, 2/13)$ - verified by direct computation.

With damping $\alpha=0.85$ : $P' = 0.85 P + 0.15 \cdot J/4$ where $J$ is the all-ones matrix. Power iteration converges in ~30 iterations.

D.2 MCMC on a Bimodal Distribution

Target: $\pi(\theta) \propto e^{-(\theta-3)^2/2} + e^{-(\theta+3)^2/2}$ (mixture of two Gaussians at $\pm 3$ ).

With Gaussian random walk MH ( $\sigma=0.5$ ): chain mixes within one mode; inter-modal jumps are rare because the barrier between modes is $O(10)$ nats. Acceptance rate ~60%, but ESS/T ~0.02 (poor mixing).

With $\sigma=3$ : chain jumps between modes; acceptance rate ~15%, ESS/T ~0.10 (better mixing).

This illustrates the exploration-exploitation tradeoff in MCMC proposal design.

D.3 HMM Example: Weather from Ice Core

Two hidden states: Ice Age (I), Warm Period (W). Observations: high ( $h$ ) or low ( $l$ ) dust in ice core.

Transition: $P_{II}=0.7, P_{IW}=0.3, P_{WI}=0.4, P_{WW}=0.6$ . Emission: $P(h|I)=0.8, P(l|I)=0.2, P(h|W)=0.2, P(l|W)=0.8$ .

For observation sequence $h, l, h, h, l$ : the Viterbi algorithm gives hidden sequence $I, W, I, I, W$ (high dust = ice age, low dust = warm period).

Forward algorithm gives likelihood $P(\text{obs sequence}) \approx 0.0035$ .

Appendix E: Key Theorems and Proofs

E.1 Proof: Stationary Distribution is Unique for Finite Irreducible Chains

Theorem. A finite irreducible Markov chain has a unique stationary distribution $\pi$ with $\pi_i > 0$ for all $i$ .

Proof. Consider the $N$ -dimensional simplex $\Delta = \{\mu : \mu_i \geq 0, \sum\mu_i = 1\}$ . The map $T: \mu \mapsto \mu P$ is a continuous map from $\Delta$ to $\Delta$ . By Brouwer's fixed point theorem, $T$ has at least one fixed point $\pi$ .

For uniqueness: suppose $\pi$ and $\nu$ are two stationary distributions. Since the chain is irreducible and finite, all states are positive recurrent, so $\pi_i = 1/\mu_i > 0$ where $\mu_i$ is the mean return time. The mean return time is unique (it equals $1/\pi_i$ and also $1/\nu_i$ if $\nu$ is stationary), so $\pi_i = \nu_i$ for all $i$ .

For $\pi_i > 0$ : by irreducibility, from any state $j$ there exists a path to $i$ . The stationary mass at each state in this path is $> 0$ (by positivity of transitions and stationarity), propagating to give $\pi_i > 0$ .

E.2 Proof: Detailed Balance Implies Stationarity

Theorem. If $\pi_i P_{ij} = \pi_j P_{ji}$ for all $i,j$ , then $\pi P = \pi$ .

Proof. For each $j$ :

(\pi P)_j = \sum_i \pi_i P_{ij} = \sum_i \pi_j P_{ji} = \pi_j \sum_i P_{ji} = \pi_j \cdot 1 = \pi_j

where we used detailed balance in the second equality and the row-sum property of stochastic matrices in the fourth.

E.3 Proof: Convergence via Coupling (Sketch)

Theorem. Let $P$ be ergodic. For any $\mu, \nu$ and coupling $(X_t, Y_t)$ with $X_0 \sim \mu$ , $Y_0 \sim \nu$ , define $\tau = \min\{t : X_t = Y_t\}$ . Then:

\|\mu P^t - \nu P^t\|_{\text{TV}} \leq P(\tau > t)

Proof. For any event $A$ :

\mu P^t(A) - \nu P^t(A) = P(X_t \in A) - P(Y_t \in A) = P(X_t \in A, \tau \leq t) + P(X_t \in A, \tau > t) - P(Y_t \in A, \tau \leq t) - P(Y_t \in A, \tau > t)

On $\{\tau \leq t\}$ : $X_t = Y_t$ (coalesce), so $P(X_t \in A, \tau \leq t) = P(Y_t \in A, \tau \leq t)$ . Therefore:

|\mu P^t(A) - \nu P^t(A)| = |P(X_t \in A, \tau > t) - P(Y_t \in A, \tau > t)| \leq P(\tau > t)

Taking supremum over $A$ gives the result.

Appendix F: Python Recipes for Markov Chains

import numpy as np

# -- 1. Power iteration for stationary distribution
def stationary(P, tol=1e-12, max_iter=10000):
    pi = np.ones(P.shape[0]) / P.shape[0]
    for _ in range(max_iter):
        pi_new = pi @ P
        if np.max(np.abs(pi_new - pi)) < tol:
            return pi_new
        pi = pi_new
    return pi

# -- 2. Simulate a Markov chain
def simulate_chain(P, x0, n_steps):
    N = P.shape[0]
    x = x0
    trajectory = [x]
    for _ in range(n_steps):
        x = np.random.choice(N, p=P[x])
        trajectory.append(x)
    return np.array(trajectory)

# -- 3. Metropolis-Hastings for 1D target
def metropolis_hastings(log_pi, n_steps, sigma=0.5, x0=0.0):
    x = x0
    samples = []
    for _ in range(n_steps):
        x_prime = x + sigma * np.random.randn()
        log_a = log_pi(x_prime) - log_pi(x)
        if np.log(np.random.rand()) < log_a:
            x = x_prime
        samples.append(x)
    return np.array(samples)

# -- 4. PageRank
def pagerank(A, alpha=0.85, tol=1e-8):
    N = A.shape[0]
    out_degree = A.sum(axis=1, keepdims=True)
    # Fix dangling nodes
    dangling = (out_degree == 0).flatten()
    out_degree[dangling] = 1
    P = A / out_degree
    P[dangling] = 1 / N
    teleport = np.ones((N, N)) / N
    P_hat = alpha * P + (1 - alpha) * teleport
    pi = np.ones(N) / N
    for _ in range(1000):
        pi_new = pi @ P_hat
        if np.max(np.abs(pi_new - pi)) < tol:
            return pi_new
        pi = pi_new
    return pi

# -- 5. HMM forward algorithm
def hmm_forward(obs, T, E, pi0):
    """obs: list of observations, T: NxN transition, E: NxM emission, pi0: initial."""
    N = T.shape[0]
    alpha = np.zeros((len(obs), N))
    alpha[0] = pi0 * E[:, obs[0]]
    for t in range(1, len(obs)):
        alpha[t] = (alpha[t-1] @ T) * E[:, obs[t]]
    return alpha, alpha[-1].sum()  # alpha matrix and likelihood

Appendix G: Connections to Other Areas

Spectral graph theory: For a random walk on an undirected graph $G$ , the transition matrix $P = D^{-1}A$ (where $D$ is the degree matrix and $A$ is the adjacency matrix) has eigenvalues in $[-1,1]$ . The spectral gap of the Laplacian $L = I - D^{-1/2}AD^{-1/2}$ equals the spectral gap of the walk. Well-connected (expander) graphs have large spectral gaps and fast-mixing random walks.

Information theory: The relative entropy (KL divergence) between the chain's distribution and stationarity decreases monotonically: $D_{\text{KL}}(\mu^{(n)} \| \pi) \geq D_{\text{KL}}(\mu^{(n+1)} \| \pi)$ . This is the data processing inequality applied to the Markov transition. The rate of KL decrease is related to the spectral gap.

Ergodic theory: The Markov chain ergodic theorem is a special case of Birkhoff's ergodic theorem in abstract measure-preserving systems. The stationarity condition $\pi P = \pi$ says $\pi$ is an invariant measure for the measure-preserving map $T : \Omega \to \Omega$ on the path space.

Linear programming: Computing the stationary distribution satisfying $\pi P = \pi$ , $\pi \geq 0$ , $\|\pi\|_1 = 1$ is a linear system. The LP relaxation of combinatorial problems can sometimes be solved by finding stationary distributions of associated Markov chains.

Quantum computing: Quantum walks are the quantum analogue of random walks on graphs, with amplitudes instead of probabilities. They can achieve quadratic speedups over classical random walks for certain search and sampling problems.

Appendix H: Self-Assessment Checklist

Before moving to Chapter 7 (Statistics), verify you can:

Write the transition matrix for any small Markov chain from a verbal description
Compute $\mu^{(n)} = \mu^{(0)}P^n$ for $n = 1, 2, 5$ by hand or code
Classify all states of a small chain as communicating/recurrent/transient/absorbing
Solve $\pi P = \pi$ for a $2\times2$ or $3\times3$ stochastic matrix
State Perron-Frobenius and explain why the stationary distribution is unique
Verify detailed balance for a given $(\pi, P)$ pair
Compute the spectral gap from eigenvalues and give a mixing time bound
Implement Metropolis-Hastings for a 1D target and compute acceptance rate
Implement power iteration for PageRank with dangling node handling
Implement the HMM forward algorithm and compute the likelihood of an observation sequence
Explain why MCMC sample autocorrelation reduces the effective sample size
State the ergodic theorem and explain why it justifies MCMC estimation

Appendix I: Detailed Proofs - Perron-Frobenius and Mixing

I.1 Perron-Frobenius Theorem (Complete Proof)

We prove the Perron-Frobenius theorem for primitive stochastic matrices.

Step 1: Eigenvalue 1 exists. For any stochastic matrix $P$ , $P\mathbf{1} = \mathbf{1}$ (column vector of ones). So $\mathbf{1}^T P = \mathbf{1}^T$ iff $P$ is doubly stochastic. More relevantly: by the Brouwer fixed-point theorem on $\Delta^{N-1}$ , the map $\pi \mapsto \pi P$ has a fixed point - the stationary distribution. This fixed point is a left eigenvector with eigenvalue 1.

Step 2: $|\lambda| \leq 1$ for all eigenvalues. Let $\lambda$ be an eigenvalue with (right) eigenvector $v$ ( $Pv = \lambda v$ ). Pick $v_{\max} = \max_i |v_i|$ . Then $|\lambda||v_{\max}| = |\lambda v_{i^*}| = |(Pv)_{i^*}| = |\sum_j P_{i^*j}v_j| \leq \sum_j P_{i^*j}|v_j| \leq v_{\max}$ . So $|\lambda| \leq 1$ .

Step 3: $|\lambda| = 1$ implies $\lambda = 1$ for primitive $P$ . For a primitive matrix, $P^m > 0$ (all entries strictly positive) for some $m$ . Suppose $|\lambda| = 1$ , $\lambda \neq 1$ , with $Pv = \lambda v$ and $v_{\max} = 1$ . The equality case in Step 2 requires $|v_j| = 1$ for all $j$ and all rows of $P$ to "align" the phases of $v_j$ . But for $P^m > 0$ , all rows of $P^m$ are strictly positive, and the alignment condition forces $\lambda^m = 1$ and all $v_j$ equal - meaning $v = c\mathbf{1}$ and $\lambda = 1$ . Contradiction.

Step 4: Convergence. Since $\lambda = 1$ is the unique eigenvalue on the unit circle for primitive $P$ , all other eigenvalues $\lambda_k$ satisfy $|\lambda_k| \leq 1 - \delta$ for some $\delta > 0$ . The spectral decomposition gives $P^n = \mathbf{1}\pi + \sum_{k \geq 2} \lambda_k^n \phi_k \psi_k^T$ where $|\lambda_k|^n \to 0$ at rate $(1-\delta)^n$ .

I.2 Coupling Proof of Convergence (Complete)

Theorem. Let $P$ be ergodic with stationary distribution $\pi$ . For any $x \in \mathcal{S}$ :

\|P^n(x,\cdot) - \pi\|_{\text{TV}} \leq P(\tau > n)

where $\tau$ is the coalescence time of the Markov coupling (two chains started at $x$ and $\sim\pi$ ).

Proof. Let $(X_n, Y_n)$ be a coupling with $X_0 = x$ and $Y_0 \sim \pi$ , where both chains use the same transition $P$ and coalesce when they meet. Since $Y_0 \sim \pi$ and $\pi$ is stationary, $Y_n \sim \pi$ for all $n$ .

For any event $A$ :

P^n(x, A) - \pi(A) = P(X_n \in A) - P(Y_n \in A)

On $\{\tau \leq n\}$ : $X_n = Y_n$ (they have coalesced), so their indicators for $A$ are equal.

P^n(x, A) - \pi(A) = E[\mathbf{1}_{X_n \in A} - \mathbf{1}_{Y_n \in A}]

= E[(\mathbf{1}_{X_n \in A} - \mathbf{1}_{Y_n \in A})\mathbf{1}_{\tau > n}]

\leq E[\mathbf{1}_{\tau > n}] = P(\tau > n)

Since this holds for any $A$ and for $A^c$ by symmetry, taking the supremum gives the TV bound.

I.3 Spectral Gap Lower Bound via Conductance (Cheeger Inequality)

Definition. The conductance (Cheeger constant) of a chain is:

h = \min_{S \subseteq \mathcal{S}, \pi(S) \leq 1/2} \frac{\sum_{i \in S, j \notin S} \pi_i P_{ij}}{\pi(S)}

Conductance measures the minimum probability flux crossing any "bottleneck" cut in the chain.

Cheeger Inequality: For a reversible chain:

\frac{h^2}{2} \leq \text{gap} \leq 2h

The lower bound is the harder direction and gives: $\text{gap} \geq h^2/2$ . In words: a large bottleneck (small $h$ ) implies a small spectral gap (slow mixing). Conversely, if $h$ is bounded below, the chain mixes in $O(1/h^2)$ steps (vs. $O(1/h)$ for the tighter bound $\text{gap} \leq 2h$ ).

For AI: The Cheeger inequality provides the tightest practical bounds for MCMC convergence. For neural network posteriors, the conductance can be estimated using gradient information - a well-conditioned posterior (no sharp barriers) has high conductance.

Appendix J: Markov Chains and Entropy

J.1 Entropy Production

The relative entropy (KL divergence) from the chain's distribution to stationarity is:

H(\mu^{(n)} \| \pi) = \sum_i \mu^{(n)}_i \log \frac{\mu^{(n)}_i}{\pi_i}

Data processing inequality (for Markov chains): $H(\mu^{(n+1)} \| \pi) \leq H(\mu^{(n)} \| \pi)$

This follows from the data processing inequality: applying the stochastic map $P$ cannot increase KL divergence. Equality holds iff $\mu^{(n)} = \pi$ .

For reversible chains: The relative entropy decays at rate governed by the log-Sobolev constant. Chains with larger log-Sobolev constants (tighter functional inequalities) decay faster.

J.2 Entropy of the Stationary Distribution

The Shannon entropy of $\pi$ is $H(\pi) = -\sum_i \pi_i \log \pi_i$ . For a uniform distribution (doubly stochastic $P$ ), $H(\pi) = \log N$ is maximised. For PageRank, the entropy of $\pi$ measures how "concentrated" web traffic is - low entropy means a few pages dominate.

Maximum entropy interpretation: The stationary distribution of a reversible chain with detailed balance can be interpreted as the maximum entropy distribution subject to the constraint that the detailed balance equations hold. This connects Markov chains to the exponential family (Section02).

J.3 Kullback-Leibler and MCMC Acceptance

The Metropolis-Hastings acceptance ratio has an information-theoretic interpretation:

a(\theta, \theta') = \min\left(1, \frac{\pi(\theta')q(\theta|\theta')}{\pi(\theta)q(\theta'|\theta)}\right) = \min\left(1, e^{\log\pi(\theta')-\log\pi(\theta)-\log q(\theta'|\theta)+\log q(\theta|\theta')}\right)

The exponent is the log-ratio of probability densities - related to the KL divergence between the proposal and target. When the proposal $q$ matches $\pi$ well, most proposals are accepted.

Appendix K: Continuous-Time Chains and Generators

K.1 Semigroup Theory

The family of transition matrices $\{P(t)\}_{t \geq 0}$ forms a Markov semigroup: $P(0) = I$ , $P(s+t) = P(s)P(t)$ , and $t \mapsto P(t)$ is continuous in $t$ . The generator $Q = \frac{d}{dt}P(t)\big|_{t=0}$ characterises the infinitesimal behaviour.

Exponential formula: $P(t) = e^{Qt}$ where the matrix exponential can be computed via eigendecomposition of $Q$ . If $Q$ has eigenvalues $0 = \mu_1 \geq \mu_2 \geq \cdots \geq \mu_N$ (all real and non-positive for reversible chains), then:

P(t) = \sum_k e^{\mu_k t} \phi_k \psi_k^T

Convergence to stationarity is at rate $e^{|\mu_2|t}$ where $|\mu_2| = -\mu_2$ is the spectral gap of $Q$ .

K.2 Reversible CTMCs

A CTMC is reversible with respect to $\pi$ if $\pi_i Q_{ij} = \pi_j Q_{ji}$ for all $i \neq j$ . The spectral theory for reversible CTMCs is identical to that for reversible DTMCs: real eigenvalues, orthonormal eigenfunctions, exponential convergence.

Detailed balance for CTMCs:

\pi_i q_{ij} = \pi_j q_{ji} \text{ for all } i \neq j

This is the continuous-time analogue of the discrete detailed balance. Metropolis-Hastings in continuous time (the Metropolis algorithm for particle simulations) satisfies this by construction.

Appendix L: Theory Problems

Problem L.1. Prove that for a doubly stochastic matrix $P$ (rows and columns both sum to 1), the uniform distribution is stationary. Give an example showing the converse fails.

Problem L.2. Let $P$ be irreducible with stationary distribution $\pi$ . Prove that $\pi_i > 0$ for all $i$ by constructing a path from any $j$ with $\pi_j > 0$ to $i$ .

Problem L.3. (Coupling inequality) Let $P$ be ergodic. Construct a coupling of two copies of the chain, one started at $x$ and one at the stationary distribution $\pi$ . Use the coupling inequality to prove $\|P^n(x,\cdot) - \pi\|_{\text{TV}} \leq P(\tau_{\text{meet}} > n)$ .

Problem L.4. For the symmetric random walk on $\{0,1,\ldots,N\}$ with reflecting barriers, show the spectral gap is $1 - \cos(\pi/N) \approx \pi^2/(2N^2)$ for large $N$ . What is the mixing time?

Problem L.5. (Metropolis-Hastings correctness) Let $K(x,y)$ be the MH transition kernel with proposal $q$ and target $\pi$ . Show that $\pi_x K(x,y) = \pi_y K(y,x)$ (detailed balance), and conclude $\pi$ is stationary.

Problem L.6. For a birth-death chain on $\{0,1,\ldots,N\}$ with rates $p_i = p$ and $q_i = q$ (constant), find the stationary distribution. For what values of $p$ does the chain have a proper stationary distribution on $\{0,1,2,\ldots\}$ (infinite chain)?

Problem L.7. The lazy walk on a graph $G$ sets $P_{ii} = 1/2$ (stay with prob 1/2) and $P_{ij} = 1/(2d_i)$ for neighbours $j$ (where $d_i$ is the degree). Show the lazy walk has all non-negative eigenvalues and therefore converges monotonically to stationarity in TV distance.

Problem L.8. (PageRank convergence rate) For the damped PageRank matrix $P = \alpha P_0 + (1-\alpha)J/N$ with damping $\alpha$ , show all eigenvalues except the dominant one have magnitude $\leq \alpha$ . Conclude the PageRank power iteration converges in $O(\log(1/\varepsilon)/\log(1/\alpha))$ iterations.

Appendix M: Notation Summary

Symbol	Meaning
$\mathcal{S}$	State space
$P_{ij}$	Transition probability from state $i$ to state $j$
$P^{(n)}_{ij}$	$n$ -step transition probability ( $(P^n)_{ij}$ )
$\mu^{(n)}$	Distribution at time $n$ ; row vector
$\pi$	Stationary distribution; row vector with $\pi P = \pi$
$f_i$	Return probability to state $i$ : $P(T_i < \infty \mid X_0=i)$
$\mu_i$	Mean return time to state $i$ ; $\pi_i = 1/\mu_i$
$d(i)$	Period of state $i$ : $\gcd\{n \geq 1 : P^{(n)}_{ii} > 0\}$
$\\|\mu - \nu\\|_{\text{TV}}$	Total variation distance between $\mu$ and $\nu$
$t_{\text{mix}}(\varepsilon)$	Mixing time to $\varepsilon$ -accuracy
$\text{gap}(P)$	Spectral gap: $1 -
$h$	Cheeger constant (conductance)
$Q$	Generator matrix of a CTMC; $Q_{ij} = q_{ij}$ , $Q_{ii} = -q_i$
$P(t) = e^{Qt}$	CTMC transition matrix at time $t$
$a(\theta,\theta')$	MH acceptance probability
$\tau_{\text{meet}}$	Coupling time (coalescence time of two chains)
$\alpha_t(i)$	HMM forward variable: $P(X_1,\ldots,X_t, Z_t=i)$
$\delta_t(j)$	Viterbi variable: $\max_{z_{1:t-1}} P(X_1,\ldots,X_t, Z_t=j, z_{1:t-1})$

Appendix N: Advanced ML Applications

N.1 RLHF as a Constrained MDP

Reinforcement learning from human feedback (RLHF) trains a language model policy $\pi_\theta$ to maximise:

\mathbb{E}_{\pi_\theta}[r_\phi(x, y)] - \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})

where $r_\phi$ is a learned reward model, $\pi_{\text{ref}}$ is the reference (SFT) policy, and $\beta$ controls the KL penalty.

This is a constrained Markov chain optimisation problem: the state space is all possible prefixes (contexts), the action space is the vocabulary, and the KL divergence penalty ensures the learned chain $\pi_\theta$ doesn't deviate too far from the reference chain $\pi_{\text{ref}}$ in transition structure.

The optimal policy has an explicit closed form:

\pi^*(y|x) = \pi_{\text{ref}}(y|x) \cdot \frac{e^{r_\phi(x,y)/\beta}}{Z(x)}

where $Z(x)$ is a normalising constant. This is a Gibbs distribution with the reward as energy function - exactly the target distribution for MCMC.

Implication: The RLHF optimal policy can be sampled by constructing a Markov chain with the above stationary distribution. Recent work uses Langevin-type MCMC to directly sample from $\pi^*$ without RL.

N.2 Diffusion Sampling as Time-Reversed CTMC

The score-based generative modelling framework (Song et al., 2021) writes the forward noising process as a CTMC/SDE and generates by running the reversed process. The key identity is Anderson's reverse-time SDE:

dx = [f(x,t) - g(t)^2 \nabla_x \log p_t(x)]\,dt + g(t)\,d\bar{B}_t

where $\bar{B}_t$ is a Brownian motion running backwards in time. The score function $\nabla_x \log p_t(x)$ is the generator of the reversed Markov chain. Learning this score function via denoising score matching allows sampling by running the reverse process.

Discrete diffusion: For categorical distributions (text), the VP-SDE is replaced by a CTMC on the vocabulary with generator $Q(t)$ . MDLM and other discrete diffusion models use the forward chain $q(x_t|x_0)$ = masked tokens, Uniform transitions, or absorbing-state corruption, then learn the reverse chain.

N.3 MCMC for LLM Sampling

Standard autoregressive LLM decoding (top- $k$ , nucleus) can be viewed as Markov chain sampling from a specific transition matrix. Recent speculative decoding methods (Chen et al., 2023; Leviathan et al., 2023) use a smaller "draft" model as a proposal and the large model as the acceptance criterion - this is precisely independence Metropolis-Hastings:

Proposal: $q(x_{n+1}|\text{context}) = \text{small\_LM}(x_{n+1}|\text{context})$
Acceptance: $a = \min(1, \text{large\_LM}/\text{small\_LM})$
Guarantees: the accepted tokens have the same distribution as large-LM-only sampling

This is MCMC applied to language model decoding - Markov chain theory directly yields correctness of speculative decoding.

N.4 Graph Neural Networks and Random Walks

Graph Neural Networks (GNNs) can be understood through the lens of Markov chains. A single GNN layer performs:

h_v^{(k+1)} = f\left(h_v^{(k)},\, \text{AGGREGATE}\left(\{h_u^{(k)} : u \in \mathcal{N}(v)\}\right)\right)

For the simple mean-aggregation GNN: $h_v^{(k+1)} = \sigma(W h_v^{(k)} + W'\frac{1}{d_v}\sum_{u \sim v} h_u^{(k)})$ . The aggregation step is applying the random walk transition matrix $D^{-1}A$ to the feature matrix. $k$ layers of GNN = applying the random walk $k$ times - the receptive field grows as the $k$ -step neighbourhood.

Over-smoothing: After many layers, all node features converge to the stationary distribution of the random walk (uniform for regular graphs). This is over-smoothing - the GNN loses discriminative power because the Markov chain mixes. Mixing time bounds from Section6 directly bound the number of useful GNN layers.

N.5 Token Position Encoding as Markov Chain

The self-attention mechanism with rotary position encoding (RoPE) can be viewed as computing a position-dependent Markov chain over tokens. Each attention head defines a stochastic matrix over token positions (the attention weights), and information flows along this chain.

Attention as ergodic averaging: In multi-head attention, each head computes $\text{softmax}(QK^T/\sqrt{d})V$ . The softmax weights form a row-stochastic matrix - a one-step Markov transition. The output is the expected value of $V$ under this distribution. Stacking attention layers is like composing Markov kernels.

Information bottleneck: The rank of the attention matrix limits the "state space" of the Markov chain. Low-rank attention (as in linear attention / Performer) corresponds to a Markov chain with restricted state space - mixing may be faster but with less expressivity.

Appendix O: Review Problems

Review Problem 1. Let $P$ be a $3\times3$ stochastic matrix with $P_{11}=0.5, P_{12}=0.3, P_{13}=0.2, P_{21}=0.1, P_{22}=0.7, P_{23}=0.2, P_{31}=0.3, P_{32}=0.3, P_{33}=0.4$ . (a) Is this chain irreducible? (b) Find the stationary distribution. (c) Starting from state 1, compute the distribution after 10 steps. (d) Compare to stationarity.

Review Problem 2. Consider a random walk on a cycle of length $N$ (states $\{0,1,\ldots,N-1\}$ , move clockwise or counterclockwise with equal probability $1/2$ ). (a) What is the stationary distribution? (b) What is the period? (c) Compute the mixing time using the spectral gap (eigenvalues: $\cos(2\pi k/N)$ for $k=0,\ldots,N-1$ ). (d) How does mixing time scale with $N$ ?

Review Problem 3. (MCMC) You want to sample from $\pi(x) \propto x^{a-1}e^{-bx}$ (a Gamma distribution) using MH with log-normal proposal $q(x'|x) = \text{LogNormal}(\log x, \sigma^2)$ . (a) Write down the acceptance ratio. (b) Is $q$ symmetric? (c) Will this chain have $\pi$ as stationary distribution?

Review Problem 4. (PageRank) The web has 3 pages: 1 links to 2 and 3; 2 links to 1; 3 links to 1. With damping $\alpha=0.85$ , write the PageRank transition matrix and find $\pi$ by solving a linear system. Which page has the highest PageRank?

Review Problem 5. (HMM) For an HMM with 2 hidden states, 2 observations, transition $T = \begin{pmatrix}0.6&0.4\\0.3&0.7\end{pmatrix}$ , emission $E = \begin{pmatrix}0.9&0.1\\0.2&0.8\end{pmatrix}$ , initial $\pi_0 = (0.5, 0.5)$ : (a) Compute $P(\text{obs}=(0,1,0))$ using the forward algorithm. (b) Find the Viterbi path.

Appendix P: Common Distributions in MCMC Targets

When $\pi(\theta) \propto \exp(-f(\theta))$ , the gradient $\nabla f$ drives Langevin dynamics. Key shapes:

Gaussian: $f(\theta) = \frac{1}{2}\theta^T\Sigma^{-1}\theta$ . Gradient: $\Sigma^{-1}\theta$ . Langevin mixes in $O(\kappa(\Sigma))$ steps. HMC mixes in $O(\kappa(\Sigma)^{1/2})$ steps (advantage for ill-conditioned targets).

Logistic posterior: $f(\theta) = -\sum_i y_i x_i^T\theta + \sum_i \log(1+e^{x_i^T\theta}) + \frac{1}{2\sigma^2}\|\theta\|^2$ . Strongly log-concave; SGLD mixes in polynomial time.

Mixture of Gaussians: $\pi = \sum_k w_k \mathcal{N}(\mu_k, \Sigma_k)$ . Multi-modal; standard MCMC can get stuck between modes. Parallel tempering or simulated annealing needed for good mixing.

Heavy-tailed: $f(\theta) = \nu\log(1 + \|\theta\|^2/\nu)$ (Student-t). Sub-quadratic growth means the tails are heavy. HMC can struggle; NUTS with dual averaging handles this.

Boltzmann (energy-based models): $f(\theta) \propto -E_\phi(x)$ for neural network energy $E_\phi$ . Sampling requires running MCMC; contrastive divergence (CD- $k$ ) uses short MCMC chains as an approximation.

Appendix Q: Further Reading

Levin, Peres, Wilmer - Markov Chains and Mixing Times (AMS, 2nd ed., 2017) - the definitive reference for mixing times and coupling; freely available online
Norris, J.R. - Markov Chains (Cambridge, 1997) - rigorous treatment of discrete and continuous-time chains with clean proofs
Brooks, Gelman, Jones, Meng - Handbook of Markov Chain Monte Carlo (CRC Press, 2011) - comprehensive MCMC reference; includes HMC, NUTS, diagnostics
Betancourt, M. - "A Conceptual Introduction to Hamiltonian Monte Carlo" (arXiv, 2017) - outstanding intuitive treatment of HMC geometry
Page et al. - "The PageRank Citation Ranking: Bringing Order to the Web" (Stanford Technical Report, 1999) - original PageRank paper
Rabiner, L.R. - "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition" (Proc. IEEE, 1989) - classic HMM reference
Song et al. - "Score-Based Generative Modeling through Stochastic Differential Equations" (ICLR 2021) - connects diffusion models to CTMC theory
Welling & Teh - "Bayesian Learning via Stochastic Gradient Langevin Dynamics" (ICML 2011) - introduces SGLD for large-scale Bayesian inference
Sutton & Barto - Reinforcement Learning: An Introduction (MIT Press, 2nd ed., 2018) - MDPs, Bellman equations, TD-learning; free online

Appendix R: Spectral Methods for Large-Scale Markov Chains

R.1 Lanczos Algorithm for Large Transition Matrices

For transition matrices arising in PageRank ( $N \sim 10^{10}$ web pages), direct eigendecomposition is infeasible. The power iteration and Lanczos algorithm enable computing the top eigenvectors using only matrix-vector products.

Power iteration for stationary distribution:

Memory: $O(N)$ for the distribution vector
Per-iteration cost: $O(\text{nnz})$ (number of non-zeros in $P$ ) for sparse matrices
Convergence in $O(\log(1/\varepsilon) / \log(1/|\lambda_2|))$ iterations

For the web graph, $|\lambda_2| \leq \alpha = 0.85$ (damping), so ~70 iterations give $\varepsilon = 10^{-8}$ precision.

Randomised SVD: For low-rank approximation of $P^n$ (needed for long-range transition probability estimation), randomised methods can approximate the top- $r$ singular vectors in $O(N r \log r)$ time - much cheaper than full SVD.

R.2 Multi-Scale Markov Chains

For chains with multiple timescales (fast-mixing within clusters, slow-mixing between clusters), aggregation-disaggregation methods provide efficient algorithms:

Aggregation: Lump states within each cluster into a single meta-state. Compute the inter-cluster transition matrix $\bar{P}$ .
Disaggregation: Compute the within-cluster stationary distributions $\pi^{(c)}$ for each cluster $c$ .
Combine: $\pi_i = \bar{\pi}_c \cdot \pi^{(c)}_i$ for state $i$ in cluster $c$ .

This decomposition is exact when the chain is "nearly lumpable" (within-cluster transitions are much faster than between-cluster). It's the foundation of multi-scale MCMC methods.

For AI: LLM attention patterns have multi-scale structure: attention heads attend to nearby tokens (fast mixing) and distant context (slow mixing). This multi-scale structure can be exploited for efficient long-context modelling.

R.3 Reversibilisation

Any Markov chain can be made reversible by considering the Doob $h$ -transform or by multiplicative reversibilisation:

\tilde{P}_{ij} = \frac{1}{2}(P_{ij} + \pi_j P_{ji} / \pi_i)

The resulting $\tilde{P}$ is reversible with the same stationary distribution $\pi$ . Reversibilisation can speed up or slow down convergence depending on the chain - it destroys the directional bias that non-reversible chains use for faster mixing.

Appendix S: Continuous-State Markov Chains

S.1 Harris Chains

For Markov chains on continuous state spaces (like MCMC on $\mathbb{R}^d$ ), the discretestate ergodic theory extends with modifications. A Markov chain with transition kernel $K(x, \cdot)$ is:

$\phi$ -irreducible if there exists a $\sigma$ -finite measure $\phi$ such that for any Borel set $A$ with $\phi(A) > 0$ , we have $\sum_{n=1}^\infty K^n(x, A) > 0$ for all $x$ .
Harris recurrent if it is $\phi$ -irreducible and returns to any set $A$ with $\phi(A) > 0$ with probability 1.
Positive Harris recurrent if the chain is Harris recurrent and has a unique invariant probability measure $\pi$ .

A positive Harris recurrent, aperiodic chain is called a Harris ergodic chain, and the ergodic theorem holds: time averages converge to $\pi$ -expectations.

For MCMC: The Metropolis-Hastings chain on $\mathbb{R}^d$ with Gaussian proposal and a proper target density $\pi$ is Harris ergodic under mild regularity conditions. This justifies MCMC estimators for continuous posteriors.

S.2 Geometric Ergodicity

A Markov chain is geometrically ergodic if there exist constants $C < \infty$ and $\rho < 1$ such that:

\|K^n(x, \cdot) - \pi\|_{\text{TV}} \leq C(x)\,\rho^n \quad \text{for all } x

Geometric ergodicity implies a central limit theorem for MCMC estimators:

\sqrt{n}\left(\frac{1}{n}\sum_{t=1}^n f(X_t) - \mathbb{E}_\pi[f]\right) \xrightarrow{d} \mathcal{N}(0, \sigma_f^2)

where $\sigma_f^2 = \text{Var}_\pi(f) + 2\sum_{k=1}^\infty \text{Cov}_\pi(f(X_0), f(X_k))$ is the asymptotic variance. This is the Markov chain CLT - it justifies asymptotic confidence intervals for MCMC estimates.

For AI: Geometric ergodicity of the Langevin algorithm for log-concave targets gives polynomial bounds on the number of gradient evaluations needed for $\varepsilon$ -accurate posterior estimates. This is the theoretical basis for Bayesian deep learning via SGLD.

Appendix T: Markov Chain Games and Nash Equilibria

T.1 Stochastic Games

A stochastic game (Shapley, 1953) is an MDP with multiple agents. Two-player zero-sum stochastic games have:

State space $\mathcal{S}$ ; action spaces $\mathcal{A}_1, \mathcal{A}_2$
Transition $P(s'|s, a_1, a_2)$
Reward $r(s, a_1, a_2)$ (player 1 maximises, player 2 minimises)

At a Nash equilibrium, each player's policy is a best response to the other's. The value function satisfies a minimax Bellman equation:

V^*(s) = \min_{a_2} \max_{a_1} \left[r(s,a_1,a_2) + \gamma \sum_{s'} P(s'|s,a_1,a_2) V^*(s')\right]

For AI: Multi-agent RL (MARL) with competing agents (e.g., AlphaGo, Libratus for poker) uses stochastic game theory. RLHF with adversarial reward models is a stochastic game between the policy and the worst-case reward model.

T.2 Markov Perfect Equilibrium

In dynamic games, a Markov perfect equilibrium (MPE) is a Nash equilibrium where strategies depend only on the current state (Markov property on the strategy). MPE is the game-theoretic analogue of the optimal policy in MDPs.

Computing MPE requires solving a system of coupled Bellman equations - one per player. For two-player zero-sum games, this reduces to linear programming (minimax theorem). For general-sum games, this is PPAD-complete.

Appendix U: Temporal Difference Learning and Martingales

Connecting back to Section06 Stochastic Processes Section3.7, the TD(0) algorithm is:

V(s_t) \leftarrow V(s_t) + \alpha_t [r_t + \gamma V(s_{t+1}) - V(s_t)]

The update direction $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is the TD error. Under the true value function $V^*$ , $\mathbb{E}[\delta_t | s_t] = 0$ (martingale difference). TD learning is a stochastic approximation algorithm for finding the fixed point of the Bellman operator $T^\pi V = r^\pi + \gamma P^\pi V$ .

Convergence proof via martingale theory: Write $V_t = V^* + \varepsilon_t$ (error from true value). Then:

\varepsilon_{t+1}(s_t) = (1 - \alpha_t)\varepsilon_t(s_t) + \alpha_t[\gamma P^\pi \varepsilon_t(s_t) + M_{t+1}]

where $M_{t+1} = r_t + \gamma V^*(s_{t+1}) - V^*(s_t)$ is a martingale difference noise term. Under the Robbins-Monro conditions $\sum\alpha_t = \infty$ , $\sum\alpha_t^2 < \infty$ , the ODE method (Borkar, 2008) shows $\varepsilon_t \to 0$ a.s.

Appendix V: Practical MCMC Checklist for AI/ML

Before running MCMC:

Identify the target: is $\pi(\theta | \mathcal{D})$ proper? Is $\log\pi$ evaluable efficiently?
Choose algorithm: log-concave -> Langevin/HMC; multimodal -> parallel tempering; discrete -> Gibbs
Set proposal scale: tune to ~23% acceptance (MH) or ~65% (HMC)
Run multiple chains ( $\geq 4$ ) from diverse starting points

During MCMC:

Monitor trace plots for signs of non-stationarity (trends, sticking)
Track acceptance rate: should be in target range
Compute $\hat{R}$ after every 1000 iterations; stop when $\hat{R} \leq 1.01$

After MCMC:

Discard burn-in (first 50% as rule of thumb)
Compute ESS per parameter; target ESS $\geq 400$ for reliable estimates
Check posterior predictive checks: do simulated data match observed data?
Report: posterior mean, posterior std, 95% credible interval, ESS, $\hat{R}$

Common pathologies:

Divergent transitions (HMC): posterior geometry has sharp ridges; reparameterise
Low ESS: chain mixes slowly; increase step size, use HMC, or apply preconditioning
$\hat{R} > 1.1$ : chains haven't agreed; run longer, use better initialisation
Bimodal trace plots: chain stuck between modes; use parallel tempering

Appendix W: Worked Computation - Two-State Chain

Full Analysis

Let $P = \begin{pmatrix}0.7 & 0.3 \\ 0.5 & 0.5\end{pmatrix}$ , starting from $\mu^{(0)} = (1, 0)$ .

Step 1: Eigenvalues. Characteristic polynomial: $\det(P - \lambda I) = (0.7-\lambda)(0.5-\lambda) - 0.3 \times 0.5 = \lambda^2 - 1.2\lambda + 0.20 = (\lambda-1)(\lambda-0.2) = 0$ . Eigenvalues: $\lambda_1=1, \lambda_2=0.2$ .

Step 2: Stationary distribution. From $\pi P = \pi$ : $0.7\pi_1 + 0.5\pi_2 = \pi_1 \Rightarrow \pi_2 = 0.6\pi_1$ . With $\pi_1+\pi_2=1$ : $\pi = (5/8, 3/8)$ .

Step 3: $n$ -step distribution. Using the spectral decomposition:

P^n = \pi + (0.2)^n \cdot \text{(correction term)}

Explicitly: $P^n = \begin{pmatrix}5/8 & 3/8 \\ 5/8 & 3/8\end{pmatrix} + (0.2)^n\begin{pmatrix}3/8 & -3/8 \\ -5/8 & 5/8\end{pmatrix}$

So $\mu^{(n)} = \mu^{(0)}P^n = (5/8 + 3/8 \cdot (0.2)^n,\ 3/8 - 3/8 \cdot (0.2)^n)$ .

Step 4: TV distance. $\|\mu^{(n)} - \pi\|_{\text{TV}} = \frac{1}{2}|{3/8 \cdot (0.2)^n}| + \frac{1}{2}|{3/8 \cdot (0.2)^n}| = 3/8 \cdot (0.2)^n$ .

At $n=5$ : $3/8 \times 0.00032 = 0.00012$ - essentially converged. Spectral gap $= 1 - 0.2 = 0.8$ , so mixing is fast.

Step 5: Verify detailed balance.

\pi_1 P_{12} = 5/8 \times 0.3 = 0.1875, \quad \pi_2 P_{21} = 3/8 \times 0.5 = 0.1875 \checkmark

The chain is reversible with this stationary distribution.

Appendix X: Advanced Topics

X.1 Geometric Random Walks for Volume Computation

The ball walk and Gaussian walk are Markov chains used to estimate the volume of convex bodies. Starting from a point $x$ in a convex body $K$ , the chain moves to a uniformly random point within a ball of radius $r$ centred at $x$ (rejecting if outside $K$ ). The stationary distribution is uniform on $K$ .

Mixing time analysis: $t_{\text{mix}} = O(n^2 \text{poly}(1/\varepsilon))$ for the ball walk, where $n$ is the dimension. This gives a polynomial-time randomised algorithm for volume computation (Dyer-Frieze-Kannan theorem).

For AI: High-dimensional sampling problems (Bayesian inference with $d \sim 10^9$ parameters) require understanding how mixing time scales with dimension. Geometric random walks provide the fundamental tools.

X.2 Markov Chain Tree Theorem

For a strongly connected directed graph with transition matrix $P$ , the stationary distribution can be computed via spanning trees:

\pi_i = \frac{\tau_i}{\sum_j \tau_j}

where $\tau_i$ is the sum of weights of all spanning trees directed toward node $i$ (arborescences rooted at $i$ ). This is the Matrix-Tree theorem (Kirchhoff 1847) extended to directed graphs.

Implication: The stationary distribution has a combinatorial formula in terms of the graph structure. This provides an alternative to eigendecomposition that can be faster for sparse graphs.

X.3 Lumping and Aggregation

A partition $\{C_1,\ldots,C_m\}$ of $\mathcal{S}$ is lumpable with respect to $P$ if for any two states $i, j$ in the same class $C_k$ , the aggregated transitions $\sum_{\ell \in C_m} P_{i\ell} = \sum_{\ell \in C_m} P_{j\ell}$ for all classes $C_m$ . In this case, the quotient chain $\bar{P}$ on $\{C_1,\ldots,C_m\}$ is a well-defined Markov chain.

Lumping enables dimensionality reduction: replace a large chain with a smaller aggregated chain. The stationary distribution of the lumped chain aggregates that of the original.

For AI: Grouping similar states (e.g., tokens with similar embeddings) can define a lumped chain that captures the high-level dynamics of the LLM's token distribution, enabling efficient analysis of LLM behaviour at scale.

Appendix Y: Connections to Physics

Y.1 Statistical Mechanics and Gibbs Measures

The Gibbs distribution $\pi(x) \propto e^{-H(x)/(k_BT)}$ (Boltzmann distribution) is the equilibrium distribution of a physical system with Hamiltonian $H$ at temperature $T$ . This is the canonical target distribution for MCMC in physics.

The Metropolis algorithm was originally designed to sample from Gibbs distributions of particle systems (Metropolis et al., 1953). MCMC methods in ML directly descended from statistical mechanics - including the energy-based models (EBMs) that predate modern deep learning.

Temperature annealing: Starting at high $T$ (flat, easy-to-sample distribution) and gradually cooling to $T=0$ (greedy optimisation) is simulated annealing - a probabilistic optimisation algorithm. The connection between MCMC sampling and optimisation at $T \to 0$ is the bridge between Bayesian inference and MAP estimation.

Y.2 Detailed Balance and Thermodynamic Equilibrium

Detailed balance in physics is called microscopic reversibility or time-reversal symmetry. A physical system is at thermodynamic equilibrium iff its microscopic dynamics satisfy detailed balance. Systems out of equilibrium (driven by energy flows) violate detailed balance and maintain directed probability currents.

Non-equilibrium steady states: Some biological systems (motor proteins, gene regulatory networks) maintain non-reversible Markov chains in steady state - driven by ATP hydrolysis. These are "out-of-equilibrium" in the thermodynamic sense. Non-reversible MCMC is inspired by this physics.

Y.3 Renormalization Group and Multi-Scale Chains

The renormalization group (Wilson, 1971) is a technique in physics for coarse-graining: replacing fine-grained microscopic degrees of freedom with effective coarse-grained ones. This is mathematically analogous to the lumping/aggregation of Markov chains.

In ML, knowledge distillation (training a small model to mimic a large model) is a form of renormalization: the small model's transition matrix approximates the "effective" dynamics of the large model at a coarser scale.

Markov Chains: Part 2 - Common Mistakes To Appendix Y Connections To Physics