Part 3Math for LLMs

Markov Chains: Part 3 - Appendix Z Glossary To Appendix Ii Summary Diagrams

Probability Theory / Markov Chains

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 3
11 min read18 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Markov Chains: Appendix Z: Glossary to Appendix II: Summary Diagrams

Appendix Z: Glossary

Absorbing state. A state ii with Pii=1P_{ii}=1; the chain stays forever once it arrives. Every absorbing state is recurrent.

Aperiodic. A state ii with period d(i)=1d(i)=1. A chain is aperiodic if all states are aperiodic. Aperiodicity + irreducibility + positive recurrence = ergodicity.

Birth-death chain. A Markov chain on Z0\mathbb{Z}_{\geq 0} with only nearest-neighbour transitions (ij1|i-j| \leq 1). Always reversible; stationary distribution computed via detailed balance.

Chapman-Kolmogorov equations. P(m+n)=PmPnP^{(m+n)} = P^m P^n: the n+mn+m step transition matrix factors as a product of mm- and nn-step matrices.

Communicating class. A maximal set of states CC where iji \leftrightarrow j for all i,jCi,j \in C. Partitions the state space into equivalence classes.

Conductance (Cheeger constant). h=minS:π(S)1/2Q(S,Sc)/π(S)h = \min_{S: \pi(S) \leq 1/2} Q(S, S^c)/\pi(S) where Q(S,Sc)Q(S,S^c) is the probability flux from SS to ScS^c at stationarity. Bounds: h2/2gap2hh^2/2 \leq \text{gap} \leq 2h.

Coupling. A joint process (Xt,Yt)(X_t, Y_t) where each marginal is a valid Markov chain with transition PP. Used to bound TV distance via μPnνPnTVP(τ>n)\|\mu P^n - \nu P^n\|_{\text{TV}} \leq P(\tau > n) where τ\tau is the coalescence time.

CTMC. Continuous-time Markov chain. Holds in each state for an exponential time, then jumps. Specified by a generator matrix QQ with P(t)=eQtP(t) = e^{Qt}.

Detailed balance. πiPij=πjPji\pi_i P_{ij} = \pi_j P_{ji}: probability flux is equal in both directions at stationarity. Sufficient (not necessary) for π\pi to be stationary.

Doubly stochastic matrix. A matrix where rows AND columns all sum to 1. Stationary distribution is uniform.

Embedded chain. The discrete-time chain of states visited by a CTMC (ignoring holding times). Jump probabilities P~ij=qij/qi\tilde{P}_{ij} = q_{ij}/q_i.

Ergodic chain. Irreducible + positive recurrent + aperiodic. Convergence to unique stationary distribution is guaranteed.

Ergodic theorem. For an ergodic chain: 1nk=0n1f(Xk)Eπ[f]\frac{1}{n}\sum_{k=0}^{n-1} f(X_k) \to \mathbb{E}_\pi[f] a.s. Justifies MCMC estimation.

Fundamental matrix. N=(IQ)1N = (I-Q)^{-1} for absorbing chains, where QQ is the sub-matrix of transitions between transient states. NijN_{ij} = expected visits to state jj starting from ii before absorption.

Generator matrix (Q-matrix). Qij=qij0Q_{ij} = q_{ij} \geq 0 for iji \neq j (jump rates), Qii=jiqijQ_{ii} = -\sum_{j \neq i} q_{ij}. Rows sum to zero. CTMC transition: P(t)=eQtP(t) = e^{Qt}.

Gibbs sampling. MCMC algorithm that updates one coordinate at a time from its full conditional distribution. Special case of MH with acceptance ratio 1.

Harmonic function. h(i)=(Ph)(i)h(i) = (Ph)(i): equals its own expected value one step ahead. The martingale h(Xn)h(X_n) is constant in expectation.

HMC (Hamiltonian Monte Carlo). MCMC using gradient information to make long-range moves. Augments the state with momentum; uses leapfrog integration to simulate Hamiltonian dynamics.

HMM (Hidden Markov Model). A Markov chain (Zt)(Z_t) of hidden states with observations (Xt)(X_t) emitted from the current state. Algorithms: forward, backward, Viterbi, Baum-Welch.

Irreducible chain. Every state is accessible from every other: iji \to j for all i,ji,j. Ensures no trapping in subsets.

Lazy chain. P=(I+P)/2P' = (I+P)/2: stays put with prob 1/2, otherwise transitions. Aperiodic by construction; eigenvalues non-negative.

Lumpable partition. A partition where all states in the same class have identical transition probabilities to every other class. Enables dimensionality reduction.

Markov chain. A sequence of random variables where the future depends on the past only through the present: P(Xn+1X0,,Xn)=P(Xn+1Xn)P(X_{n+1}|X_0,\ldots,X_n) = P(X_{n+1}|X_n).

Markov property. Xn+1(X0,,Xn1)XnX_{n+1} \perp (X_0,\ldots,X_{n-1}) | X_n: conditional independence of future from past given present.

MDP (Markov Decision Process). Extension of Markov chain with actions; the policy induces a Markov chain; the Bellman equation characterises the value function.

Metropolis-Hastings. MCMC algorithm: propose θq(θ)\theta' \sim q(\cdot|\theta), accept with probability min(1,π(θ)q(θθ)/(π(θ)q(θθ)))\min(1, \pi(\theta')q(\theta|\theta')/(\pi(\theta)q(\theta'|\theta))). Correct by detailed balance.

Mixing time. tmix(ε)=min{t:maxxPt(x,)πTVε}t_{\text{mix}}(\varepsilon) = \min\{t: \max_x \|P^t(x,\cdot)-\pi\|_{\text{TV}} \leq \varepsilon\}. Bounded by O(log(πmin1)/gap)O(\log(\pi_{\min}^{-1})/\text{gap}).

Null recurrent. Recurrent state with infinite mean return time (μi=\mu_i = \infty). Common in countable chains (1D SRW); no normalisable stationary distribution.

PageRank. Stationary distribution of the random surfer Markov chain on the web graph. Computed by power iteration.

Period. d(i)=gcd{n1:Pii(n)>0}d(i) = \gcd\{n \geq 1: P^{(n)}_{ii} > 0\}. Period 1 = aperiodic.

Perron-Frobenius theorem. For primitive stochastic PP: eigenvalue 1 is unique and dominant; unique stationary distribution π>0\pi > 0; Pn1πTP^n \to \mathbf{1}\pi^T geometrically.

Positive recurrent. Recurrent state with finite mean return time (μi<\mu_i < \infty). In a finite irreducible chain, all states are positive recurrent.

Reversible chain. Satisfies detailed balance; the time-reversed chain equals the forward chain. Equivalent to: PP is self-adjoint in L2(π)L^2(\pi).

SGLD. Stochastic gradient Langevin dynamics: SGD + Gaussian noise. Approximates Bayesian inference at scale.

Spectral gap. gap=1λ2\text{gap} = 1 - |\lambda_2| for reversible chains. Determines mixing rate: tmix=Θ(1/gap)t_{\text{mix}} = \Theta(1/\text{gap}).

Stationary distribution. π\pi with πP=π\pi P = \pi, πi0\pi_i \geq 0, πi=1\sum\pi_i = 1. The equilibrium distribution; unique for ergodic chains.

Stochastic matrix. Pij0P_{ij} \geq 0 and jPij=1\sum_j P_{ij} = 1 for all ii. The transition matrix of a DTMC.

Transient state. State ii with fi=P(return to i)<1f_i = P(\text{return to } i) < 1. The chain leaves and never returns with positive probability; nPii(n)<\sum_n P^{(n)}_{ii} < \infty.

Total variation (TV) distance. μνTV=supAμ(A)ν(A)=12iμiνi\|\mu-\nu\|_{\text{TV}} = \sup_A|\mu(A)-\nu(A)| = \frac{1}{2}\sum_i|\mu_i-\nu_i|. The mixing distance for Markov chains.

Viterbi algorithm. Dynamic programming for the most likely hidden state sequence in an HMM. Complexity O(TS2)O(T|\mathcal{S}|^2).


Appendix AA: Practice Problems with Solutions

AA.1 Absorption Probability Calculation

Problem. A gambler starts with \3andplaysatacasino.Eachroundtheywinand plays at a casino. Each round they win$1(prob0.45)orlose(prob 0.45) or lose$1(prob0.55).Thegameendswhentheyreach(prob 0.55). The game ends when they reach$0(ruin)or(ruin) or$6(goal).Find:(a)probabilityofreaching(goal). Find: (a) probability of reaching$6fromfrom$3$, (b) expected duration of the game.

Solution. Let p=0.45p = 0.45, q=0.55q = 0.55, r=q/p=0.55/0.45=11/9r = q/p = 0.55/0.45 = 11/9. Using the absorption formula for biased walks with absorbing boundaries at 0 and N=6N=6:

(a) P(reach 6start at 3)=1r31r6=1(11/9)31(11/9)611.52115.3540.5214.3540.120P(\text{reach 6} | \text{start at 3}) = \frac{1 - r^3}{1 - r^6} = \frac{1 - (11/9)^3}{1 - (11/9)^6} \approx \frac{1-1.521}{1-5.354} \approx \frac{-0.521}{-4.354} \approx 0.120

Only 12% chance of reaching the goal from the midpoint - the bias significantly hurts the gambler.

(b) Expected duration: E[τ]=1qp[61r31r63]10.1[6×0.1203]=10×(2.28)=...\mathbb{E}[\tau] = \frac{1}{q-p}\left[6 \cdot \frac{1 - r^3}{1-r^6} - 3\right] \approx \frac{1}{0.1}[6 \times 0.120 - 3] = 10 \times (-2.28) = ...

Actually for biased walks: E[τX0=a]=aqpNqpP(win)\mathbb{E}[\tau|X_0=a] = \frac{a}{q-p} - \frac{N}{q-p} \cdot P(\text{win}). With qp=0.1q-p=0.1: E[τ]=3/0.16/0.1×0.120=307.2=22.8\mathbb{E}[\tau] = 3/0.1 - 6/0.1 \times 0.120 = 30 - 7.2 = 22.8 rounds.

AA.2 PageRank Mini-Example with Computation

Problem. Three web pages: 1 links to {2,3}, 2 links to {3}, 3 links to {1}. Find PageRank with α=0.8\alpha=0.8.

Adjacency matrix:

A=(011001100)A = \begin{pmatrix}0&1&1\\0&0&1\\1&0&0\end{pmatrix}

Out-degrees: d1=2,d2=1,d3=1d_1=2, d_2=1, d_3=1.

Transition matrix without damping:

P0=(01/21/2001100)P_0 = \begin{pmatrix}0&1/2&1/2\\0&0&1\\1&0&0\end{pmatrix}

Damped matrix (α=0.8\alpha=0.8):

P=0.8P0+0.2×13J=(1/152/5+1/152/5+1/151/151/154/5+1/154/5+1/151/151/15)P = 0.8 P_0 + 0.2 \times \frac{1}{3}J = \begin{pmatrix}1/15&2/5+1/15&2/5+1/15\\1/15&1/15&4/5+1/15\\4/5+1/15&1/15&1/15\end{pmatrix}

Solving πP=π\pi P = \pi with π1+π2+π3=1\pi_1+\pi_2+\pi_3=1: by symmetry of the network (1 has out-degree 2, others 1) and solving the 3×33\times3 system: π(0.390,0.211,0.399)\pi \approx (0.390, 0.211, 0.399).

Page 3 has highest PageRank - it's the target of all paths through the cycle 12311\to2\to3\to1, plus direct link from 1.

AA.3 HMM Forward Algorithm Computation

Problem. HMM with 2 states H,L and 2 observations A,B.

  • π0=(0.5,0.5)\pi_0 = (0.5, 0.5), T=(0.70.30.40.6)T = \begin{pmatrix}0.7&0.3\\0.4&0.6\end{pmatrix}
  • Emission: EH=(0.9,0.1)E_H = (0.9, 0.1), EL=(0.2,0.8)E_L = (0.2, 0.8) (A and B respectively)
  • Observation: A, B

Forward algorithm:

t=1t=1, obs=A: α1(H)=0.5×0.9=0.45\alpha_1(H) = 0.5 \times 0.9 = 0.45, α1(L)=0.5×0.2=0.10\alpha_1(L) = 0.5 \times 0.2 = 0.10

t=2t=2, obs=B:

α2(H)=0.1×[0.45×0.7+0.10×0.4]=0.1×[0.315+0.04]=0.0355\alpha_2(H) = 0.1 \times [0.45 \times 0.7 + 0.10 \times 0.4] = 0.1 \times [0.315 + 0.04] = 0.0355 α2(L)=0.8×[0.45×0.3+0.10×0.6]=0.8×[0.135+0.06]=0.156\alpha_2(L) = 0.8 \times [0.45 \times 0.3 + 0.10 \times 0.6] = 0.8 \times [0.135 + 0.06] = 0.156

Likelihood: P(obs=(A,B))=α2(H)+α2(L)=0.0355+0.156=0.1915P(\text{obs}=(A,B)) = \alpha_2(H) + \alpha_2(L) = 0.0355 + 0.156 = 0.1915.

Posterior: P(Z2=Hobs)=0.0355/0.191518.5%P(Z_2=H|\text{obs}) = 0.0355/0.1915 \approx 18.5\%; P(Z2=Lobs)81.5%P(Z_2=L|\text{obs}) \approx 81.5\%.

Interpretation: Observing (A,B) - first high-emission then low-emission - suggests state L at time 2.


Appendix BB: Index of Key Theorems

Theorem / ResultWhere ProvedKey Implication
Perron-FrobeniusSection4.2, App. I.1Unique stationary distribution for ergodic chains
Ergodic theoremSection3.5, App. ETime averages converge to Eπ[f]\mathbb{E}_\pi[f]; justifies MCMC
Chapman-KolmogorovSection2.3nn-step transitions: Pm+n=PmPnP^{m+n} = P^m P^n
Convergence theoremSection4.3μ(n)πTV0\|\mu^{(n)} - \pi\|_{\text{TV}} \to 0 for ergodic chains
Detailed balance \Rightarrow stationaritySection5.1, App. E.2Sufficient condition; easier to verify
Spectral gap boundSection6.3tmixlog(πmin1/ε)/gapt_{\text{mix}} \leq \log(\pi_{\min}^{-1}/\varepsilon)/\text{gap}
Cheeger inequalitySection6.3, App. I.3h2/2gap2hh^2/2 \leq \text{gap} \leq 2h
Coupling inequalitySection6.1, App. I.2Pn(x,)πTVP(τ>n)\|P^n(x,\cdot)-\pi\|_{\text{TV}} \leq P(\tau > n)
MH correctnessSection8.2Detailed balance implies π\pi is stationary
MH acceptance optimalitySection8.2~23% acceptance optimal in high dimensions
Markov chain CLTApp. S.2n(MCMC avgEπ[f])N(0,σf2)\sqrt{n}(\text{MCMC avg} - \mathbb{E}_\pi[f]) \to \mathcal{N}(0,\sigma_f^2)
Dirichlet problem / OSTApp. AValue function = expected boundary value
Mean return time formulaSection4.1πi=1/μi\pi_i = 1/\mu_i; occupancy = reciprocal return time
Absorption formulaSection3.4B=(IQ)1RB = (I-Q)^{-1}R; absorption probabilities from fundamental matrix
Viterbi optimalitySection9.4DP gives exact most likely path in $O(T

Appendix CC: Mixing Times for Specific Chains

A reference table of known mixing time results for important chains:

ChainState SpaceMixing Time tmixt_{\text{mix}}Technique
Two-state chain{0,1}\{0,1\}O(1/(p+q))O(1/(p+q))Eigenvalue 1(p+q)1-(p+q)
Random walk on path PnP_n{0,,n}\{0,\ldots,n\}Θ(n2)\Theta(n^2)Eigenvalues cos(kπ/n)\cos(k\pi/n)
Random walk on cycle CnC_nZ/nZ\mathbb{Z}/n\mathbb{Z}Θ(n2)\Theta(n^2)Fourier analysis
Random walk on hypercube {0,1}d\{0,1\}^d{0,1}d\{0,1\}^dΘ(dlogd)\Theta(d\log d)Coupling (coupon collector)
Random walk on complete graph KnK_n{1,,n}\{1,\ldots,n\}Θ(1)\Theta(1)Spectral gap =11/(n1)1= 1-1/(n-1) \approx 1
Random walk on expander{1,,n}\{1,\ldots,n\}Θ(logn)\Theta(\log n)Spectral gap =Ω(1)= \Omega(1)
Glauber dynamics (Ising, high TT){1,+1}n\{-1,+1\}^nO(nlogn)O(n\log n)Coupling or spectral gap
Glauber dynamics (Ising, T<TcT < T_c){1,+1}n\{-1,+1\}^nexp(Ω(n))\exp(\Omega(n))Phase transition barrier
MH on N(0,Id)\mathcal{N}(0, I_d) (Gaussian RW)Rd\mathbb{R}^dΘ(d)\Theta(d)Optimal scaling
HMC on N(0,Id)\mathcal{N}(0, I_d)Rd\mathbb{R}^dO(d1/4)O(d^{1/4})Leapfrog integrator analysis
PageRank (damping α\alpha)Web pagesO(log(N)/log(1/α))O(\log(N)/\log(1/\alpha))Second eigenvalue α\leq \alpha

Key observations:

  • Dimension scaling: Random walk MH scales as O(d)O(d) (diffusive), HMC as O(d1/4)O(d^{1/4}) (much better)
  • Phase transitions: Near critical temperature, MCMC for Ising model has exponential mixing time - a computational hard phase
  • Expander graphs: Ω(1)\Omega(1) spectral gap means O(logn)O(\log n) mixing - the best possible for non-trivial graphs

Appendix DD: Software Ecosystem

The following Python libraries implement the algorithms in this section:

MCMC and Probabilistic Programming:

  • PyMC - full-featured Bayesian modelling with NUTS/HMC sampler
  • Stan (via cmdstanpy/pystan) - the gold standard for Bayesian modelling; generates C++ code for fast sampling
  • NumPyro - JAX-based; supports GPU-accelerated HMC and NUTS
  • blackjax - modular MCMC library (JAX); easy to extend with custom kernels

Markov Chain Simulation:

  • networkx - graph operations; random walks on graphs
  • numpy - transition matrix operations; power iteration; eigendecomposition

Hidden Markov Models:

  • hmmlearn - sklearn-compatible HMM; Baum-Welch training, Viterbi decoding
  • pomegranate - probabilistic models including HMMs with GPU support

Reinforcement Learning (MDP):

  • gymnasium (formerly OpenAI Gym) - RL environments as MDPs
  • stable-baselines3 - RL algorithms (PPO, SAC, TD3) implementing policy optimisation

Diffusion Models:

  • diffusers (HuggingFace) - implements DDPM, DDIM, and score-based models as CTMCs

Appendix EE: Common Markov Chain Constructions

Ehrenfest Model: NN balls in two urns, uniform random ball moved at each step. State = number of balls in urn 1. Birth-death chain; stationary distribution: Binomial(N,1/2)(N, 1/2). Models diffusion/thermodynamic equilibrium.

Polya Urn: Start with aa red and bb blue balls; each step add a ball of the drawn colour. Path-dependent; NOT Markov without extended state. But Xn=fraction redX_n = \text{fraction red} satisfies E[Xn+1Xn]=Xn\mathbb{E}[X_{n+1}|X_n] = X_n - a martingale.

Bernoulli-Laplace Diffusion: NN balls total, kk red and NkN-k blue, split between two urns of size N/2N/2. State = number of red in urn 1. Symmetric birth-death chain; used to model genetics and combinatorics.

Random Transposition Walk: On SnS_n (permutation group of nn elements), pick two positions uniformly and swap. State space: SnS_n (size n!n!). Mixing time: Θ(nlogn)\Theta(n\log n). Used to model card shuffling (Bayer-Diaconis theorem).

Metropolis on Graphs: For any graph GG and target π\pi, the Metropolis chain with uniform proposal on neighbours satisfies detailed balance with respect to π\pi. Useful for sampling graph colorings, independent sets, etc.

Skip-free chains: Transition matrices with Pij=0P_{ij} = 0 for j<i1j < i-1 (can only go up by 1 or down arbitrarily). Examples: GI/M/1 queues. Efficient algorithms via matrix-analytic methods.


Appendix FF: Spectral Graph Theory Connection

FF.1 Normalized Laplacian and Random Walks

For an undirected graph G=(V,E)G=(V,E) with degree matrix DD and adjacency matrix AA, the random walk transition matrix is P=D1AP = D^{-1}A. The normalized Laplacian is:

L=ID1/2AD1/2=ID1/2PD1/2\mathcal{L} = I - D^{-1/2}AD^{-1/2} = I - D^{-1/2}PD^{1/2}

The eigenvalues of L\mathcal{L} are 0=λ1λ2λN20 = \lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_N \leq 2. The spectral gap of the random walk is λ2(L)\lambda_2(\mathcal{L}).

Cheeger inequality for graphs: λ2/2h(G)2λ2\lambda_2/2 \leq h(G) \leq \sqrt{2\lambda_2} where h(G)=minS:SV/2E(S,Sˉ)/(dminS)h(G) = \min_{S:|S|\leq|V|/2} |E(S,\bar{S})|/(d_{\min}|S|) is the edge expansion of the graph.

Expander graphs: A dd-regular expander has λ2(L)=Ω(1)\lambda_2(\mathcal{L}) = \Omega(1) (bounded below independently of NN). Examples: Ramanujan graphs achieve λ2=12d1/d\lambda_2 = 1 - 2\sqrt{d-1}/d (optimal). Random regular graphs are expanders with high probability.

FF.2 Graph-Based ML and Markov Chains

Graph attention networks: GAT attention coefficients eij=a(Whi,Whj)e_{ij} = a(\mathbf{W}h_i, \mathbf{W}h_j) (learnable) create a data-dependent transition matrix over the graph. The spectral properties of this attention-weighted adjacency matrix determine the GNN's ability to propagate information.

Label propagation: Semi-supervised learning via label propagation: FαPF+(1α)YF \leftarrow \alpha PF + (1-\alpha)Y where P=D1AP=D^{-1}A is the random walk matrix, FF are labels, YY are known labels, and α\alpha is a damping parameter. This is exactly the PageRank recursion applied to labels - convergence follows from the same Perron-Frobenius analysis.

Spectral clustering: The kk smallest eigenvectors of L\mathcal{L} embed the graph into Rk\mathbb{R}^k such that well-connected clusters map to nearby points. This exploits the fact that slow-mixing chains (small spectral gap) correspond to loosely connected clusters.


Appendix GG: Long-Range Dependencies and Markov Approximations

GG.1 Finite-Order Markov Chains

A kkth-order Markov chain satisfies:

P(XnX0,,Xn1)=P(XnXnk,,Xn1)P(X_n | X_0,\ldots,X_{n-1}) = P(X_n | X_{n-k},\ldots,X_{n-1})

This can always be reduced to a first-order chain on the augmented state (Xnk+1,,Xn)(X_{n-k+1},\ldots,X_n) with state space size Sk|\mathcal{S}|^k.

Trade-off: Higher-order captures more dependencies but exponentially increases state space. N-gram language models are kkth-order Markov chains on the vocabulary - trigram models (k=2k=2) are common in classical NLP.

For transformers: A transformer with context length LL has an effective LLth-order Markov chain for token generation. The KV cache stores the sufficient statistics (the last LL tokens) - first-order Markov in the augmented state space.

GG.2 Variable-Length Markov Chains (VLMC)

VLMCs use a context tree to specify the order dynamically: some contexts require deep history, others only shallow. The transition probability P(XnXn1,)=P(Xnsuffix(Xn1,,Xnk))P(X_n | X_{n-1},\ldots) = P(X_n | \text{suffix}(X_{n-1},\ldots,X_{n-k})) where kk depends on the context.

VLMCs provide compact representations of processes with heterogeneous memory. Related to suffix trees and compressed suffix arrays - efficient data structures for sequential data.

GG.3 Hidden Non-Markovian Processes

Many real processes are non-Markovian but become Markovian on an augmented state space. Examples:

  • AR(pp) process: Xn=k=1pϕkXnk+εnX_n = \sum_{k=1}^p \phi_k X_{n-k} + \varepsilon_n. First-order Markov on (Xn,,Xnp+1)(X_n,\ldots,X_{n-p+1}).
  • ARMA(p,qp,q): Moving average component creates non-Markovian behaviour; state space includes latent noise terms.
  • LLM with KV cache: First-order Markov on the KV cache state; non-Markovian on just the last token.

Appendix HH: Excursion Theory and Renewal Processes

HH.1 Renewal Theory

A renewal process is a sequence of times T1,T2,T_1, T_2, \ldots where TkTk1T_k - T_{k-1} are iid positive random variables (inter-renewal times). The process Nt=max{k:Tkt}N_t = \max\{k : T_k \leq t\} counts renewals up to time tt.

Renewal theorem: limtNt/t=1/μ\lim_{t\to\infty} N_t/t = 1/\mu where μ=E[T2T1]\mu = \mathbb{E}[T_2-T_1] is the mean inter-renewal time. This is the law of large numbers for renewal processes.

Connection to Markov chains: The times of visits to a fixed recurrent state ii form a renewal process with inter-renewal distribution = return time distribution. The renewal theorem gives πi=1/μi\pi_i = 1/\mu_i (mean return time formula).

HH.2 Excursions from a State

An excursion from state ii is the portion of the trajectory between two consecutive visits to ii. Excursions are iid by the strong Markov property. The distribution of the excursion - its length, states visited, and path structure - encodes rich information about the chain.

For AI: In RLHF, the "excursions" of the LM's generation from topic to topic form a renewal-like structure. Understanding the statistics of these excursions (how long the LM stays on-topic, how it transitions between topics) is important for reward model design.


Appendix II: Summary Diagrams

MARKOV CHAIN THEORY SUMMARY
========================================================================

  CHAIN PROPERTIES                   IMPLICATIONS
  ----------------                   ------------
  Irreducible ----------------------> Every state reachable from every other
  + Positive Recurrent --------------> Stationary distribution \\pi exists
  + Aperiodic -----------------------> \\pi unique; P^n -> 1\\cdot\\pi^T (convergence)
  (= Ergodic)

  Reversible -----------------------> Detailed balance \\pi_i P_ij = \\pi_j P_ji
                                     Real eigenvalues; symmetric in L^2(\\pi)
                                     MH, Gibbs, HMC all reversible

  MIXING SPEED HIERARCHY
  -----------------------
  Spectral gap large ---------------> Fast mixing (t_mix = O(1/gap))
  Cheeger constant large -----------> Large spectral gap (Cheeger inequality)
  Non-reversible + directed ---------> Potentially faster than any reversible
  HMC gradient steps ---------------> O(d^{1/4}) vs O(d) for random walk

  COMPUTING \\pi
  -----------
  Small N: solve \\pi P = \\pi, \\Vert\\pi\\Vert_1 = 1 (linear system)
  Large N: power iteration \\pi^{k+1} = \\pi^{k} P (PageRank)
  Continuous space: MCMC (Metropolis-Hastings, HMC)
  Birth-death: explicit formula via detailed balance

  ML APPLICATIONS
  ---------------
  Language generation --------------> Markov chain on vocabulary (or context)
  PageRank -------------------------> Stationary dist of web graph walk
  MCMC -----------------------------> Bayesian posterior sampling
  RL (MDP) -------------------------> Policy induces Markov chain; Bellman eqn
  HMM ------------------------------> Latent Markov chain with observations
  Diffusion models -----------------> Forward = CTMC; reverse = learned CTMC
  GNNs -----------------------------> Message passing = transition matrix power

========================================================================

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue