Part 3

11 min read18 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Markov Chains: Appendix Z: Glossary to Appendix II: Summary Diagrams

Appendix Z: Glossary

Absorbing state. A state $i$ with $P_{ii}=1$ ; the chain stays forever once it arrives. Every absorbing state is recurrent.

Aperiodic. A state $i$ with period $d(i)=1$ . A chain is aperiodic if all states are aperiodic. Aperiodicity + irreducibility + positive recurrence = ergodicity.

Birth-death chain. A Markov chain on $\mathbb{Z}_{\geq 0}$ with only nearest-neighbour transitions ( $|i-j| \leq 1$ ). Always reversible; stationary distribution computed via detailed balance.

Chapman-Kolmogorov equations. $P^{(m+n)} = P^m P^n$ : the $n+m$ step transition matrix factors as a product of $m$ - and $n$ -step matrices.

Communicating class. A maximal set of states $C$ where $i \leftrightarrow j$ for all $i,j \in C$ . Partitions the state space into equivalence classes.

Conductance (Cheeger constant). $h = \min_{S: \pi(S) \leq 1/2} Q(S, S^c)/\pi(S)$ where $Q(S,S^c)$ is the probability flux from $S$ to $S^c$ at stationarity. Bounds: $h^2/2 \leq \text{gap} \leq 2h$ .

Coupling. A joint process $(X_t, Y_t)$ where each marginal is a valid Markov chain with transition $P$ . Used to bound TV distance via $\|\mu P^n - \nu P^n\|_{\text{TV}} \leq P(\tau > n)$ where $\tau$ is the coalescence time.

CTMC. Continuous-time Markov chain. Holds in each state for an exponential time, then jumps. Specified by a generator matrix $Q$ with $P(t) = e^{Qt}$ .

Detailed balance. $\pi_i P_{ij} = \pi_j P_{ji}$ : probability flux is equal in both directions at stationarity. Sufficient (not necessary) for $\pi$ to be stationary.

Doubly stochastic matrix. A matrix where rows AND columns all sum to 1. Stationary distribution is uniform.

Embedded chain. The discrete-time chain of states visited by a CTMC (ignoring holding times). Jump probabilities $\tilde{P}_{ij} = q_{ij}/q_i$ .

Ergodic chain. Irreducible + positive recurrent + aperiodic. Convergence to unique stationary distribution is guaranteed.

Ergodic theorem. For an ergodic chain: $\frac{1}{n}\sum_{k=0}^{n-1} f(X_k) \to \mathbb{E}_\pi[f]$ a.s. Justifies MCMC estimation.

Fundamental matrix. $N = (I-Q)^{-1}$ for absorbing chains, where $Q$ is the sub-matrix of transitions between transient states. $N_{ij}$ = expected visits to state $j$ starting from $i$ before absorption.

Generator matrix (Q-matrix). $Q_{ij} = q_{ij} \geq 0$ for $i \neq j$ (jump rates), $Q_{ii} = -\sum_{j \neq i} q_{ij}$ . Rows sum to zero. CTMC transition: $P(t) = e^{Qt}$ .

Gibbs sampling. MCMC algorithm that updates one coordinate at a time from its full conditional distribution. Special case of MH with acceptance ratio 1.

Harmonic function. $h(i) = (Ph)(i)$ : equals its own expected value one step ahead. The martingale $h(X_n)$ is constant in expectation.

HMC (Hamiltonian Monte Carlo). MCMC using gradient information to make long-range moves. Augments the state with momentum; uses leapfrog integration to simulate Hamiltonian dynamics.

HMM (Hidden Markov Model). A Markov chain $(Z_t)$ of hidden states with observations $(X_t)$ emitted from the current state. Algorithms: forward, backward, Viterbi, Baum-Welch.

Irreducible chain. Every state is accessible from every other: $i \to j$ for all $i,j$ . Ensures no trapping in subsets.

Lazy chain. $P' = (I+P)/2$ : stays put with prob 1/2, otherwise transitions. Aperiodic by construction; eigenvalues non-negative.

Lumpable partition. A partition where all states in the same class have identical transition probabilities to every other class. Enables dimensionality reduction.

Markov chain. A sequence of random variables where the future depends on the past only through the present: $P(X_{n+1}|X_0,\ldots,X_n) = P(X_{n+1}|X_n)$ .

Markov property. $X_{n+1} \perp (X_0,\ldots,X_{n-1}) | X_n$ : conditional independence of future from past given present.

MDP (Markov Decision Process). Extension of Markov chain with actions; the policy induces a Markov chain; the Bellman equation characterises the value function.

Metropolis-Hastings. MCMC algorithm: propose $\theta' \sim q(\cdot|\theta)$ , accept with probability $\min(1, \pi(\theta')q(\theta|\theta')/(\pi(\theta)q(\theta'|\theta)))$ . Correct by detailed balance.

Mixing time. $t_{\text{mix}}(\varepsilon) = \min\{t: \max_x \|P^t(x,\cdot)-\pi\|_{\text{TV}} \leq \varepsilon\}$ . Bounded by $O(\log(\pi_{\min}^{-1})/\text{gap})$ .

Null recurrent. Recurrent state with infinite mean return time ( $\mu_i = \infty$ ). Common in countable chains (1D SRW); no normalisable stationary distribution.

PageRank. Stationary distribution of the random surfer Markov chain on the web graph. Computed by power iteration.

Period. $d(i) = \gcd\{n \geq 1: P^{(n)}_{ii} > 0\}$ . Period 1 = aperiodic.

Perron-Frobenius theorem. For primitive stochastic $P$ : eigenvalue 1 is unique and dominant; unique stationary distribution $\pi > 0$ ; $P^n \to \mathbf{1}\pi^T$ geometrically.

Positive recurrent. Recurrent state with finite mean return time ( $\mu_i < \infty$ ). In a finite irreducible chain, all states are positive recurrent.

Reversible chain. Satisfies detailed balance; the time-reversed chain equals the forward chain. Equivalent to: $P$ is self-adjoint in $L^2(\pi)$ .

SGLD. Stochastic gradient Langevin dynamics: SGD + Gaussian noise. Approximates Bayesian inference at scale.

Spectral gap. $\text{gap} = 1 - |\lambda_2|$ for reversible chains. Determines mixing rate: $t_{\text{mix}} = \Theta(1/\text{gap})$ .

Stationary distribution. $\pi$ with $\pi P = \pi$ , $\pi_i \geq 0$ , $\sum\pi_i = 1$ . The equilibrium distribution; unique for ergodic chains.

Stochastic matrix. $P_{ij} \geq 0$ and $\sum_j P_{ij} = 1$ for all $i$ . The transition matrix of a DTMC.

Transient state. State $i$ with $f_i = P(\text{return to } i) < 1$ . The chain leaves and never returns with positive probability; $\sum_n P^{(n)}_{ii} < \infty$ .

Total variation (TV) distance. $\|\mu-\nu\|_{\text{TV}} = \sup_A|\mu(A)-\nu(A)| = \frac{1}{2}\sum_i|\mu_i-\nu_i|$ . The mixing distance for Markov chains.

Viterbi algorithm. Dynamic programming for the most likely hidden state sequence in an HMM. Complexity $O(T|\mathcal{S}|^2)$ .

Appendix AA: Practice Problems with Solutions

AA.1 Absorption Probability Calculation

Problem. A gambler starts with $\$ 3 $and plays at a casino. Each round they win$ $1 $(prob 0.45) or lose$ $1 $(prob 0.55). The game ends when they reach$ $0 $(ruin) or$ $6 $(goal). Find: (a) probability of reaching$ $6 $from$ $3$, (b) expected duration of the game.

Solution. Let $p = 0.45$ , $q = 0.55$ , $r = q/p = 0.55/0.45 = 11/9$ . Using the absorption formula for biased walks with absorbing boundaries at 0 and $N=6$ :

(a) $P(\text{reach 6} | \text{start at 3}) = \frac{1 - r^3}{1 - r^6} = \frac{1 - (11/9)^3}{1 - (11/9)^6} \approx \frac{1-1.521}{1-5.354} \approx \frac{-0.521}{-4.354} \approx 0.120$

Only 12% chance of reaching the goal from the midpoint - the bias significantly hurts the gambler.

(b) Expected duration: $\mathbb{E}[\tau] = \frac{1}{q-p}\left[6 \cdot \frac{1 - r^3}{1-r^6} - 3\right] \approx \frac{1}{0.1}[6 \times 0.120 - 3] = 10 \times (-2.28) = ...$

Actually for biased walks: $\mathbb{E}[\tau|X_0=a] = \frac{a}{q-p} - \frac{N}{q-p} \cdot P(\text{win})$ . With $q-p=0.1$ : $\mathbb{E}[\tau] = 3/0.1 - 6/0.1 \times 0.120 = 30 - 7.2 = 22.8$ rounds.

AA.2 PageRank Mini-Example with Computation

Problem. Three web pages: 1 links to {2,3}, 2 links to {3}, 3 links to {1}. Find PageRank with $\alpha=0.8$ .

Adjacency matrix:

A = \begin{pmatrix}0&1&1\\0&0&1\\1&0&0\end{pmatrix}

Out-degrees: $d_1=2, d_2=1, d_3=1$ .

Transition matrix without damping:

P_0 = \begin{pmatrix}0&1/2&1/2\\0&0&1\\1&0&0\end{pmatrix}

Damped matrix ( $\alpha=0.8$ ):

P = 0.8 P_0 + 0.2 \times \frac{1}{3}J = \begin{pmatrix}1/15&2/5+1/15&2/5+1/15\\1/15&1/15&4/5+1/15\\4/5+1/15&1/15&1/15\end{pmatrix}

Solving $\pi P = \pi$ with $\pi_1+\pi_2+\pi_3=1$ : by symmetry of the network (1 has out-degree 2, others 1) and solving the $3\times3$ system: $\pi \approx (0.390, 0.211, 0.399)$ .

Page 3 has highest PageRank - it's the target of all paths through the cycle $1\to2\to3\to1$ , plus direct link from 1.

AA.3 HMM Forward Algorithm Computation

Problem. HMM with 2 states H,L and 2 observations A,B.

$\pi_0 = (0.5, 0.5)$ , $T = \begin{pmatrix}0.7&0.3\\0.4&0.6\end{pmatrix}$
Emission: $E_H = (0.9, 0.1)$ , $E_L = (0.2, 0.8)$ (A and B respectively)
Observation: A, B

Forward algorithm:

$t=1$ , obs=A: $\alpha_1(H) = 0.5 \times 0.9 = 0.45$ , $\alpha_1(L) = 0.5 \times 0.2 = 0.10$

$t=2$ , obs=B:

\alpha_2(H) = 0.1 \times [0.45 \times 0.7 + 0.10 \times 0.4] = 0.1 \times [0.315 + 0.04] = 0.0355

\alpha_2(L) = 0.8 \times [0.45 \times 0.3 + 0.10 \times 0.6] = 0.8 \times [0.135 + 0.06] = 0.156

Likelihood: $P(\text{obs}=(A,B)) = \alpha_2(H) + \alpha_2(L) = 0.0355 + 0.156 = 0.1915$ .

Posterior: $P(Z_2=H|\text{obs}) = 0.0355/0.1915 \approx 18.5\%$ ; $P(Z_2=L|\text{obs}) \approx 81.5\%$ .

Interpretation: Observing (A,B) - first high-emission then low-emission - suggests state L at time 2.

Appendix BB: Index of Key Theorems

Theorem / Result	Where Proved	Key Implication
Perron-Frobenius	Section4.2, App. I.1	Unique stationary distribution for ergodic chains
Ergodic theorem	Section3.5, App. E	Time averages converge to $\mathbb{E}_\pi[f]$ ; justifies MCMC
Chapman-Kolmogorov	Section2.3	$n$ -step transitions: $P^{m+n} = P^m P^n$
Convergence theorem	Section4.3	$\\|\mu^{(n)} - \pi\\|_{\text{TV}} \to 0$ for ergodic chains
Detailed balance $\Rightarrow$ stationarity	Section5.1, App. E.2	Sufficient condition; easier to verify
Spectral gap bound	Section6.3	$t_{\text{mix}} \leq \log(\pi_{\min}^{-1}/\varepsilon)/\text{gap}$
Cheeger inequality	Section6.3, App. I.3	$h^2/2 \leq \text{gap} \leq 2h$
Coupling inequality	Section6.1, App. I.2	$\\|P^n(x,\cdot)-\pi\\|_{\text{TV}} \leq P(\tau > n)$
MH correctness	Section8.2	Detailed balance implies $\pi$ is stationary
MH acceptance optimality	Section8.2	~23% acceptance optimal in high dimensions
Markov chain CLT	App. S.2	$\sqrt{n}(\text{MCMC avg} - \mathbb{E}_\pi[f]) \to \mathcal{N}(0,\sigma_f^2)$
Dirichlet problem / OST	App. A	Value function = expected boundary value
Mean return time formula	Section4.1	$\pi_i = 1/\mu_i$ ; occupancy = reciprocal return time
Absorption formula	Section3.4	$B = (I-Q)^{-1}R$ ; absorption probabilities from fundamental matrix
Viterbi optimality	Section9.4	DP gives exact most likely path in $O(T

Appendix CC: Mixing Times for Specific Chains

A reference table of known mixing time results for important chains:

Chain	State Space	Mixing Time $t_{\text{mix}}$	Technique
Two-state chain	$\{0,1\}$	$O(1/(p+q))$	Eigenvalue $1-(p+q)$
Random walk on path $P_n$	$\{0,\ldots,n\}$	$\Theta(n^2)$	Eigenvalues $\cos(k\pi/n)$
Random walk on cycle $C_n$	$\mathbb{Z}/n\mathbb{Z}$	$\Theta(n^2)$	Fourier analysis
Random walk on hypercube $\{0,1\}^d$	$\{0,1\}^d$	$\Theta(d\log d)$	Coupling (coupon collector)
Random walk on complete graph $K_n$	$\{1,\ldots,n\}$	$\Theta(1)$	Spectral gap $= 1-1/(n-1) \approx 1$
Random walk on expander	$\{1,\ldots,n\}$	$\Theta(\log n)$	Spectral gap $= \Omega(1)$
Glauber dynamics (Ising, high $T$ )	$\{-1,+1\}^n$	$O(n\log n)$	Coupling or spectral gap
Glauber dynamics (Ising, $T < T_c$ )	$\{-1,+1\}^n$	$\exp(\Omega(n))$	Phase transition barrier
MH on $\mathcal{N}(0, I_d)$ (Gaussian RW)	$\mathbb{R}^d$	$\Theta(d)$	Optimal scaling
HMC on $\mathcal{N}(0, I_d)$	$\mathbb{R}^d$	$O(d^{1/4})$	Leapfrog integrator analysis
PageRank (damping $\alpha$ )	Web pages	$O(\log(N)/\log(1/\alpha))$	Second eigenvalue $\leq \alpha$

Key observations:

Dimension scaling: Random walk MH scales as $O(d)$ (diffusive), HMC as $O(d^{1/4})$ (much better)
Phase transitions: Near critical temperature, MCMC for Ising model has exponential mixing time - a computational hard phase
Expander graphs: $\Omega(1)$ spectral gap means $O(\log n)$ mixing - the best possible for non-trivial graphs

Appendix DD: Software Ecosystem

The following Python libraries implement the algorithms in this section:

MCMC and Probabilistic Programming:

PyMC - full-featured Bayesian modelling with NUTS/HMC sampler
Stan (via cmdstanpy/pystan) - the gold standard for Bayesian modelling; generates C++ code for fast sampling
NumPyro - JAX-based; supports GPU-accelerated HMC and NUTS
blackjax - modular MCMC library (JAX); easy to extend with custom kernels

Markov Chain Simulation:

networkx - graph operations; random walks on graphs
numpy - transition matrix operations; power iteration; eigendecomposition

Hidden Markov Models:

hmmlearn - sklearn-compatible HMM; Baum-Welch training, Viterbi decoding
pomegranate - probabilistic models including HMMs with GPU support

Reinforcement Learning (MDP):

gymnasium (formerly OpenAI Gym) - RL environments as MDPs
stable-baselines3 - RL algorithms (PPO, SAC, TD3) implementing policy optimisation

Diffusion Models:

diffusers (HuggingFace) - implements DDPM, DDIM, and score-based models as CTMCs

Appendix EE: Common Markov Chain Constructions

Ehrenfest Model: $N$ balls in two urns, uniform random ball moved at each step. State = number of balls in urn 1. Birth-death chain; stationary distribution: Binomial $(N, 1/2)$ . Models diffusion/thermodynamic equilibrium.

Polya Urn: Start with $a$ red and $b$ blue balls; each step add a ball of the drawn colour. Path-dependent; NOT Markov without extended state. But $X_n = \text{fraction red}$ satisfies $\mathbb{E}[X_{n+1}|X_n] = X_n$ - a martingale.

Bernoulli-Laplace Diffusion: $N$ balls total, $k$ red and $N-k$ blue, split between two urns of size $N/2$ . State = number of red in urn 1. Symmetric birth-death chain; used to model genetics and combinatorics.

Random Transposition Walk: On $S_n$ (permutation group of $n$ elements), pick two positions uniformly and swap. State space: $S_n$ (size $n!$ ). Mixing time: $\Theta(n\log n)$ . Used to model card shuffling (Bayer-Diaconis theorem).

Metropolis on Graphs: For any graph $G$ and target $\pi$ , the Metropolis chain with uniform proposal on neighbours satisfies detailed balance with respect to $\pi$ . Useful for sampling graph colorings, independent sets, etc.

Skip-free chains: Transition matrices with $P_{ij} = 0$ for $j < i-1$ (can only go up by 1 or down arbitrarily). Examples: GI/M/1 queues. Efficient algorithms via matrix-analytic methods.

Appendix FF: Spectral Graph Theory Connection

FF.1 Normalized Laplacian and Random Walks

For an undirected graph $G=(V,E)$ with degree matrix $D$ and adjacency matrix $A$ , the random walk transition matrix is $P = D^{-1}A$ . The normalized Laplacian is:

\mathcal{L} = I - D^{-1/2}AD^{-1/2} = I - D^{-1/2}PD^{1/2}

The eigenvalues of $\mathcal{L}$ are $0 = \lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_N \leq 2$ . The spectral gap of the random walk is $\lambda_2(\mathcal{L})$ .

Cheeger inequality for graphs: $\lambda_2/2 \leq h(G) \leq \sqrt{2\lambda_2}$ where $h(G) = \min_{S:|S|\leq|V|/2} |E(S,\bar{S})|/(d_{\min}|S|)$ is the edge expansion of the graph.

Expander graphs: A $d$ -regular expander has $\lambda_2(\mathcal{L}) = \Omega(1)$ (bounded below independently of $N$ ). Examples: Ramanujan graphs achieve $\lambda_2 = 1 - 2\sqrt{d-1}/d$ (optimal). Random regular graphs are expanders with high probability.

FF.2 Graph-Based ML and Markov Chains

Graph attention networks: GAT attention coefficients $e_{ij} = a(\mathbf{W}h_i, \mathbf{W}h_j)$ (learnable) create a data-dependent transition matrix over the graph. The spectral properties of this attention-weighted adjacency matrix determine the GNN's ability to propagate information.

Label propagation: Semi-supervised learning via label propagation: $F \leftarrow \alpha PF + (1-\alpha)Y$ where $P=D^{-1}A$ is the random walk matrix, $F$ are labels, $Y$ are known labels, and $\alpha$ is a damping parameter. This is exactly the PageRank recursion applied to labels - convergence follows from the same Perron-Frobenius analysis.

Spectral clustering: The $k$ smallest eigenvectors of $\mathcal{L}$ embed the graph into $\mathbb{R}^k$ such that well-connected clusters map to nearby points. This exploits the fact that slow-mixing chains (small spectral gap) correspond to loosely connected clusters.

Appendix GG: Long-Range Dependencies and Markov Approximations

GG.1 Finite-Order Markov Chains

A $k$ th-order Markov chain satisfies:

P(X_n | X_0,\ldots,X_{n-1}) = P(X_n | X_{n-k},\ldots,X_{n-1})

This can always be reduced to a first-order chain on the augmented state $(X_{n-k+1},\ldots,X_n)$ with state space size $|\mathcal{S}|^k$ .

Trade-off: Higher-order captures more dependencies but exponentially increases state space. N-gram language models are $k$ th-order Markov chains on the vocabulary - trigram models ( $k=2$ ) are common in classical NLP.

For transformers: A transformer with context length $L$ has an effective $L$ th-order Markov chain for token generation. The KV cache stores the sufficient statistics (the last $L$ tokens) - first-order Markov in the augmented state space.

GG.2 Variable-Length Markov Chains (VLMC)

VLMCs use a context tree to specify the order dynamically: some contexts require deep history, others only shallow. The transition probability $P(X_n | X_{n-1},\ldots) = P(X_n | \text{suffix}(X_{n-1},\ldots,X_{n-k}))$ where $k$ depends on the context.

VLMCs provide compact representations of processes with heterogeneous memory. Related to suffix trees and compressed suffix arrays - efficient data structures for sequential data.

GG.3 Hidden Non-Markovian Processes

Many real processes are non-Markovian but become Markovian on an augmented state space. Examples:

AR( $p$ ) process: $X_n = \sum_{k=1}^p \phi_k X_{n-k} + \varepsilon_n$ . First-order Markov on $(X_n,\ldots,X_{n-p+1})$ .
ARMA( $p,q$ ): Moving average component creates non-Markovian behaviour; state space includes latent noise terms.
LLM with KV cache: First-order Markov on the KV cache state; non-Markovian on just the last token.

Appendix HH: Excursion Theory and Renewal Processes

HH.1 Renewal Theory

A renewal process is a sequence of times $T_1, T_2, \ldots$ where $T_k - T_{k-1}$ are iid positive random variables (inter-renewal times). The process $N_t = \max\{k : T_k \leq t\}$ counts renewals up to time $t$ .

Renewal theorem: $\lim_{t\to\infty} N_t/t = 1/\mu$ where $\mu = \mathbb{E}[T_2-T_1]$ is the mean inter-renewal time. This is the law of large numbers for renewal processes.

Connection to Markov chains: The times of visits to a fixed recurrent state $i$ form a renewal process with inter-renewal distribution = return time distribution. The renewal theorem gives $\pi_i = 1/\mu_i$ (mean return time formula).

HH.2 Excursions from a State

An excursion from state $i$ is the portion of the trajectory between two consecutive visits to $i$ . Excursions are iid by the strong Markov property. The distribution of the excursion - its length, states visited, and path structure - encodes rich information about the chain.

For AI: In RLHF, the "excursions" of the LM's generation from topic to topic form a renewal-like structure. Understanding the statistics of these excursions (how long the LM stays on-topic, how it transitions between topics) is important for reward model design.

Appendix II: Summary Diagrams

MARKOV CHAIN THEORY SUMMARY
========================================================================

  CHAIN PROPERTIES                   IMPLICATIONS
  ----------------                   ------------
  Irreducible ----------------------> Every state reachable from every other
  + Positive Recurrent --------------> Stationary distribution \\pi exists
  + Aperiodic -----------------------> \\pi unique; P^n -> 1\\cdot\\pi^T (convergence)
  (= Ergodic)

  Reversible -----------------------> Detailed balance \\pi_i P_ij = \\pi_j P_ji
                                     Real eigenvalues; symmetric in L^2(\\pi)
                                     MH, Gibbs, HMC all reversible

  MIXING SPEED HIERARCHY
  -----------------------
  Spectral gap large ---------------> Fast mixing (t_mix = O(1/gap))
  Cheeger constant large -----------> Large spectral gap (Cheeger inequality)
  Non-reversible + directed ---------> Potentially faster than any reversible
  HMC gradient steps ---------------> O(d^{1/4}) vs O(d) for random walk

  COMPUTING \\pi
  -----------
  Small N: solve \\pi P = \\pi, \\Vert\\pi\\Vert_1 = 1 (linear system)
  Large N: power iteration \\pi^{k+1} = \\pi^{k} P (PageRank)
  Continuous space: MCMC (Metropolis-Hastings, HMC)
  Birth-death: explicit formula via detailed balance

  ML APPLICATIONS
  ---------------
  Language generation --------------> Markov chain on vocabulary (or context)
  PageRank -------------------------> Stationary dist of web graph walk
  MCMC -----------------------------> Bayesian posterior sampling
  RL (MDP) -------------------------> Policy induces Markov chain; Bellman eqn
  HMM ------------------------------> Latent Markov chain with observations
  Diffusion models -----------------> Forward = CTMC; reverse = learned CTMC
  GNNs -----------------------------> Message passing = transition matrix power

========================================================================

Markov Chains: Part 3 - Appendix Z Glossary To Appendix Ii Summary Diagrams