Lesson overview | Previous part | Lesson overview
Markov Chains: Appendix Z: Glossary to Appendix II: Summary Diagrams
Appendix Z: Glossary
Absorbing state. A state with ; the chain stays forever once it arrives. Every absorbing state is recurrent.
Aperiodic. A state with period . A chain is aperiodic if all states are aperiodic. Aperiodicity + irreducibility + positive recurrence = ergodicity.
Birth-death chain. A Markov chain on with only nearest-neighbour transitions (). Always reversible; stationary distribution computed via detailed balance.
Chapman-Kolmogorov equations. : the step transition matrix factors as a product of - and -step matrices.
Communicating class. A maximal set of states where for all . Partitions the state space into equivalence classes.
Conductance (Cheeger constant). where is the probability flux from to at stationarity. Bounds: .
Coupling. A joint process where each marginal is a valid Markov chain with transition . Used to bound TV distance via where is the coalescence time.
CTMC. Continuous-time Markov chain. Holds in each state for an exponential time, then jumps. Specified by a generator matrix with .
Detailed balance. : probability flux is equal in both directions at stationarity. Sufficient (not necessary) for to be stationary.
Doubly stochastic matrix. A matrix where rows AND columns all sum to 1. Stationary distribution is uniform.
Embedded chain. The discrete-time chain of states visited by a CTMC (ignoring holding times). Jump probabilities .
Ergodic chain. Irreducible + positive recurrent + aperiodic. Convergence to unique stationary distribution is guaranteed.
Ergodic theorem. For an ergodic chain: a.s. Justifies MCMC estimation.
Fundamental matrix. for absorbing chains, where is the sub-matrix of transitions between transient states. = expected visits to state starting from before absorption.
Generator matrix (Q-matrix). for (jump rates), . Rows sum to zero. CTMC transition: .
Gibbs sampling. MCMC algorithm that updates one coordinate at a time from its full conditional distribution. Special case of MH with acceptance ratio 1.
Harmonic function. : equals its own expected value one step ahead. The martingale is constant in expectation.
HMC (Hamiltonian Monte Carlo). MCMC using gradient information to make long-range moves. Augments the state with momentum; uses leapfrog integration to simulate Hamiltonian dynamics.
HMM (Hidden Markov Model). A Markov chain of hidden states with observations emitted from the current state. Algorithms: forward, backward, Viterbi, Baum-Welch.
Irreducible chain. Every state is accessible from every other: for all . Ensures no trapping in subsets.
Lazy chain. : stays put with prob 1/2, otherwise transitions. Aperiodic by construction; eigenvalues non-negative.
Lumpable partition. A partition where all states in the same class have identical transition probabilities to every other class. Enables dimensionality reduction.
Markov chain. A sequence of random variables where the future depends on the past only through the present: .
Markov property. : conditional independence of future from past given present.
MDP (Markov Decision Process). Extension of Markov chain with actions; the policy induces a Markov chain; the Bellman equation characterises the value function.
Metropolis-Hastings. MCMC algorithm: propose , accept with probability . Correct by detailed balance.
Mixing time. . Bounded by .
Null recurrent. Recurrent state with infinite mean return time (). Common in countable chains (1D SRW); no normalisable stationary distribution.
PageRank. Stationary distribution of the random surfer Markov chain on the web graph. Computed by power iteration.
Period. . Period 1 = aperiodic.
Perron-Frobenius theorem. For primitive stochastic : eigenvalue 1 is unique and dominant; unique stationary distribution ; geometrically.
Positive recurrent. Recurrent state with finite mean return time (). In a finite irreducible chain, all states are positive recurrent.
Reversible chain. Satisfies detailed balance; the time-reversed chain equals the forward chain. Equivalent to: is self-adjoint in .
SGLD. Stochastic gradient Langevin dynamics: SGD + Gaussian noise. Approximates Bayesian inference at scale.
Spectral gap. for reversible chains. Determines mixing rate: .
Stationary distribution. with , , . The equilibrium distribution; unique for ergodic chains.
Stochastic matrix. and for all . The transition matrix of a DTMC.
Transient state. State with . The chain leaves and never returns with positive probability; .
Total variation (TV) distance. . The mixing distance for Markov chains.
Viterbi algorithm. Dynamic programming for the most likely hidden state sequence in an HMM. Complexity .
Appendix AA: Practice Problems with Solutions
AA.1 Absorption Probability Calculation
Problem. A gambler starts with \3$1$1$0$6$6$3$, (b) expected duration of the game.
Solution. Let , , . Using the absorption formula for biased walks with absorbing boundaries at 0 and :
(a)
Only 12% chance of reaching the goal from the midpoint - the bias significantly hurts the gambler.
(b) Expected duration:
Actually for biased walks: . With : rounds.
AA.2 PageRank Mini-Example with Computation
Problem. Three web pages: 1 links to {2,3}, 2 links to {3}, 3 links to {1}. Find PageRank with .
Adjacency matrix:
Out-degrees: .
Transition matrix without damping:
Damped matrix ():
Solving with : by symmetry of the network (1 has out-degree 2, others 1) and solving the system: .
Page 3 has highest PageRank - it's the target of all paths through the cycle , plus direct link from 1.
AA.3 HMM Forward Algorithm Computation
Problem. HMM with 2 states H,L and 2 observations A,B.
- ,
- Emission: , (A and B respectively)
- Observation: A, B
Forward algorithm:
, obs=A: ,
, obs=B:
Likelihood: .
Posterior: ; .
Interpretation: Observing (A,B) - first high-emission then low-emission - suggests state L at time 2.
Appendix BB: Index of Key Theorems
| Theorem / Result | Where Proved | Key Implication |
|---|---|---|
| Perron-Frobenius | Section4.2, App. I.1 | Unique stationary distribution for ergodic chains |
| Ergodic theorem | Section3.5, App. E | Time averages converge to ; justifies MCMC |
| Chapman-Kolmogorov | Section2.3 | -step transitions: |
| Convergence theorem | Section4.3 | for ergodic chains |
| Detailed balance stationarity | Section5.1, App. E.2 | Sufficient condition; easier to verify |
| Spectral gap bound | Section6.3 | |
| Cheeger inequality | Section6.3, App. I.3 | |
| Coupling inequality | Section6.1, App. I.2 | |
| MH correctness | Section8.2 | Detailed balance implies is stationary |
| MH acceptance optimality | Section8.2 | ~23% acceptance optimal in high dimensions |
| Markov chain CLT | App. S.2 | |
| Dirichlet problem / OST | App. A | Value function = expected boundary value |
| Mean return time formula | Section4.1 | ; occupancy = reciprocal return time |
| Absorption formula | Section3.4 | ; absorption probabilities from fundamental matrix |
| Viterbi optimality | Section9.4 | DP gives exact most likely path in $O(T |
Appendix CC: Mixing Times for Specific Chains
A reference table of known mixing time results for important chains:
| Chain | State Space | Mixing Time | Technique |
|---|---|---|---|
| Two-state chain | Eigenvalue | ||
| Random walk on path | Eigenvalues | ||
| Random walk on cycle | Fourier analysis | ||
| Random walk on hypercube | Coupling (coupon collector) | ||
| Random walk on complete graph | Spectral gap | ||
| Random walk on expander | Spectral gap | ||
| Glauber dynamics (Ising, high ) | Coupling or spectral gap | ||
| Glauber dynamics (Ising, ) | Phase transition barrier | ||
| MH on (Gaussian RW) | Optimal scaling | ||
| HMC on | Leapfrog integrator analysis | ||
| PageRank (damping ) | Web pages | Second eigenvalue |
Key observations:
- Dimension scaling: Random walk MH scales as (diffusive), HMC as (much better)
- Phase transitions: Near critical temperature, MCMC for Ising model has exponential mixing time - a computational hard phase
- Expander graphs: spectral gap means mixing - the best possible for non-trivial graphs
Appendix DD: Software Ecosystem
The following Python libraries implement the algorithms in this section:
MCMC and Probabilistic Programming:
PyMC- full-featured Bayesian modelling with NUTS/HMC samplerStan(viacmdstanpy/pystan) - the gold standard for Bayesian modelling; generates C++ code for fast samplingNumPyro- JAX-based; supports GPU-accelerated HMC and NUTSblackjax- modular MCMC library (JAX); easy to extend with custom kernels
Markov Chain Simulation:
networkx- graph operations; random walks on graphsnumpy- transition matrix operations; power iteration; eigendecomposition
Hidden Markov Models:
hmmlearn- sklearn-compatible HMM; Baum-Welch training, Viterbi decodingpomegranate- probabilistic models including HMMs with GPU support
Reinforcement Learning (MDP):
gymnasium(formerly OpenAI Gym) - RL environments as MDPsstable-baselines3- RL algorithms (PPO, SAC, TD3) implementing policy optimisation
Diffusion Models:
diffusers(HuggingFace) - implements DDPM, DDIM, and score-based models as CTMCs
Appendix EE: Common Markov Chain Constructions
Ehrenfest Model: balls in two urns, uniform random ball moved at each step. State = number of balls in urn 1. Birth-death chain; stationary distribution: Binomial. Models diffusion/thermodynamic equilibrium.
Polya Urn: Start with red and blue balls; each step add a ball of the drawn colour. Path-dependent; NOT Markov without extended state. But satisfies - a martingale.
Bernoulli-Laplace Diffusion: balls total, red and blue, split between two urns of size . State = number of red in urn 1. Symmetric birth-death chain; used to model genetics and combinatorics.
Random Transposition Walk: On (permutation group of elements), pick two positions uniformly and swap. State space: (size ). Mixing time: . Used to model card shuffling (Bayer-Diaconis theorem).
Metropolis on Graphs: For any graph and target , the Metropolis chain with uniform proposal on neighbours satisfies detailed balance with respect to . Useful for sampling graph colorings, independent sets, etc.
Skip-free chains: Transition matrices with for (can only go up by 1 or down arbitrarily). Examples: GI/M/1 queues. Efficient algorithms via matrix-analytic methods.
Appendix FF: Spectral Graph Theory Connection
FF.1 Normalized Laplacian and Random Walks
For an undirected graph with degree matrix and adjacency matrix , the random walk transition matrix is . The normalized Laplacian is:
The eigenvalues of are . The spectral gap of the random walk is .
Cheeger inequality for graphs: where is the edge expansion of the graph.
Expander graphs: A -regular expander has (bounded below independently of ). Examples: Ramanujan graphs achieve (optimal). Random regular graphs are expanders with high probability.
FF.2 Graph-Based ML and Markov Chains
Graph attention networks: GAT attention coefficients (learnable) create a data-dependent transition matrix over the graph. The spectral properties of this attention-weighted adjacency matrix determine the GNN's ability to propagate information.
Label propagation: Semi-supervised learning via label propagation: where is the random walk matrix, are labels, are known labels, and is a damping parameter. This is exactly the PageRank recursion applied to labels - convergence follows from the same Perron-Frobenius analysis.
Spectral clustering: The smallest eigenvectors of embed the graph into such that well-connected clusters map to nearby points. This exploits the fact that slow-mixing chains (small spectral gap) correspond to loosely connected clusters.
Appendix GG: Long-Range Dependencies and Markov Approximations
GG.1 Finite-Order Markov Chains
A th-order Markov chain satisfies:
This can always be reduced to a first-order chain on the augmented state with state space size .
Trade-off: Higher-order captures more dependencies but exponentially increases state space. N-gram language models are th-order Markov chains on the vocabulary - trigram models () are common in classical NLP.
For transformers: A transformer with context length has an effective th-order Markov chain for token generation. The KV cache stores the sufficient statistics (the last tokens) - first-order Markov in the augmented state space.
GG.2 Variable-Length Markov Chains (VLMC)
VLMCs use a context tree to specify the order dynamically: some contexts require deep history, others only shallow. The transition probability where depends on the context.
VLMCs provide compact representations of processes with heterogeneous memory. Related to suffix trees and compressed suffix arrays - efficient data structures for sequential data.
GG.3 Hidden Non-Markovian Processes
Many real processes are non-Markovian but become Markovian on an augmented state space. Examples:
- AR() process: . First-order Markov on .
- ARMA(): Moving average component creates non-Markovian behaviour; state space includes latent noise terms.
- LLM with KV cache: First-order Markov on the KV cache state; non-Markovian on just the last token.
Appendix HH: Excursion Theory and Renewal Processes
HH.1 Renewal Theory
A renewal process is a sequence of times where are iid positive random variables (inter-renewal times). The process counts renewals up to time .
Renewal theorem: where is the mean inter-renewal time. This is the law of large numbers for renewal processes.
Connection to Markov chains: The times of visits to a fixed recurrent state form a renewal process with inter-renewal distribution = return time distribution. The renewal theorem gives (mean return time formula).
HH.2 Excursions from a State
An excursion from state is the portion of the trajectory between two consecutive visits to . Excursions are iid by the strong Markov property. The distribution of the excursion - its length, states visited, and path structure - encodes rich information about the chain.
For AI: In RLHF, the "excursions" of the LM's generation from topic to topic form a renewal-like structure. Understanding the statistics of these excursions (how long the LM stays on-topic, how it transitions between topics) is important for reward model design.
Appendix II: Summary Diagrams
MARKOV CHAIN THEORY SUMMARY
========================================================================
CHAIN PROPERTIES IMPLICATIONS
---------------- ------------
Irreducible ----------------------> Every state reachable from every other
+ Positive Recurrent --------------> Stationary distribution \\pi exists
+ Aperiodic -----------------------> \\pi unique; P^n -> 1\\cdot\\pi^T (convergence)
(= Ergodic)
Reversible -----------------------> Detailed balance \\pi_i P_ij = \\pi_j P_ji
Real eigenvalues; symmetric in L^2(\\pi)
MH, Gibbs, HMC all reversible
MIXING SPEED HIERARCHY
-----------------------
Spectral gap large ---------------> Fast mixing (t_mix = O(1/gap))
Cheeger constant large -----------> Large spectral gap (Cheeger inequality)
Non-reversible + directed ---------> Potentially faster than any reversible
HMC gradient steps ---------------> O(d^{1/4}) vs O(d) for random walk
COMPUTING \\pi
-----------
Small N: solve \\pi P = \\pi, \\Vert\\pi\\Vert_1 = 1 (linear system)
Large N: power iteration \\pi^{k+1} = \\pi^{k} P (PageRank)
Continuous space: MCMC (Metropolis-Hastings, HMC)
Birth-death: explicit formula via detailed balance
ML APPLICATIONS
---------------
Language generation --------------> Markov chain on vocabulary (or context)
PageRank -------------------------> Stationary dist of web graph walk
MCMC -----------------------------> Bayesian posterior sampling
RL (MDP) -------------------------> Policy induces Markov chain; Bellman eqn
HMM ------------------------------> Latent Markov chain with observations
Diffusion models -----------------> Forward = CTMC; reverse = learned CTMC
GNNs -----------------------------> Message passing = transition matrix power
========================================================================