NotesMath for LLMs

Stochastic Processes

Probability Theory / Stochastic Processes

Notes

"A stochastic process is a mathematical model for a randomly evolving system. To understand modern AI is to understand how randomness unfolds over time."

  • inspired by Doob, Stochastic Processes (1953)

Overview

A stochastic process is a collection of random variables indexed by time: {Xt}tT\{X_t\}_{t \in T}. Where probability theory asks "what is the distribution of a single random outcome?", stochastic processes ask "how does a random system evolve?" This shift - from snapshots to trajectories - is essential for understanding modern AI.

The trajectory of SGD through a loss landscape is a stochastic process. The sequence of tokens generated by a language model is a stochastic process. The forward noising schedule of a diffusion model is a stochastic process. The value function estimates in reinforcement learning satisfy martingale properties. None of these can be understood through static probability alone.

This section builds the rigorous framework - filtrations, adapted processes, martingales, stopping times - and applies it to the canonical examples: random walks, Poisson processes, Brownian motion, and stationary Gaussian processes. Markov chains, which deserve their own dedicated treatment, are previewed here and developed fully in Section07 Markov Chains.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbInteractive derivations: random walks, martingale OST, Poisson process, Brownian motion, OU process, diffusion models, Gaussian processes, SGD as diffusion
exercises.ipynb10 graded exercises from martingale verification to RL TD-error martingale and GP posterior

Learning Objectives

After completing this section, you will be able to:

  1. Define a stochastic process and classify it by time/state space type
  2. Define filtrations and adapted processes; interpret information accumulation formally
  3. Define martingales and verify the martingale property for standard examples
  4. State and apply Doob's Optional Stopping Theorem with correct conditions
  5. Construct the Doob martingale and connect it to McDiarmid's inequality
  6. Define the Poisson process via its three characterising properties and apply superposition/thinning
  7. Define Brownian motion via Wiener's axioms and compute quadratic variation
  8. Explain the Ornstein-Uhlenbeck process and its connection to regularisation
  9. Define strict and wide-sense stationarity, ergodicity, and the Wiener-Khinchin theorem
  10. Define Gaussian processes, specify the kernel as covariance function, and compute GP posteriors
  11. Explain how diffusion models implement a forward stochastic process and connect to Brownian motion
  12. Model SGD as a stochastic process and identify where martingale/near-martingale structure appears

Table of Contents


1. Intuition and Overview

1.1 What Is a Stochastic Process?

A stochastic process is a family {Xt}tT\{X_t\}_{t \in T} of random variables defined on a common probability space (Ω,F,P)(\Omega, \mathcal{F}, P), indexed by a set TT called the index set or time set. For each fixed outcome ωΩ\omega \in \Omega, the mapping tXt(ω)t \mapsto X_t(\omega) is called a sample path or realisation of the process.

The key conceptual shift from classical probability: rather than asking "what is the distribution of XX?", we ask "how does the random system {Xt}\{X_t\} evolve over time?" This trajectory-level view captures phenomena that static distributions cannot: memory, feedback, path dependence, and accumulation of information.

Informal classification: The simplest stochastic process is a sequence of iid random variables - no memory, no dependence. More interesting processes have memory (the present depends on the past), structure (the conditional distributions have a specific form), or continuity (sample paths are almost surely continuous). The richest models - Brownian motion, diffusion processes - have all three.

For AI: Almost every quantity that evolves during training or inference is a stochastic process. The parameter vector θt\theta_t after tt steps of SGD is a stochastic process. The hidden state hth_t of an RNN at position tt is a stochastic process. The sequence of tokens (w1,w2,)(w_1, w_2, \ldots) sampled from an LLM is a stochastic process. The noise level σt\sigma_t in a diffusion model forward pass is a stochastic process. Stochastic process theory gives us the mathematical language to reason about all of these rigorously.

1.2 The Central Examples

The five canonical stochastic processes appear throughout mathematics and AI:

  1. Simple Random Walk (SRW): Sn=k=1nXkS_n = \sum_{k=1}^n X_k where XkiidUniform{1,+1}X_k \stackrel{iid}{\sim} \text{Uniform}\{-1,+1\}. Discrete time, discrete state space. Models fair gambling, particle diffusion in discrete space, and gradient noise accumulation.

  2. Poisson Process: N(t)N(t) = number of events in [0,t][0,t], with N(t)N(s)Poisson(λ(ts))N(t) - N(s) \sim \text{Poisson}(\lambda(t-s)) independent of the past. Continuous time, discrete (integer) state space. Models arrival streams, request queues, neuron spike trains.

  3. Brownian Motion (Wiener Process): BtB_t with B0=0B_0 = 0, continuous paths, BtBsN(0,ts)B_t - B_s \sim \mathcal{N}(0, t-s) independent of the past. Continuous time, continuous state space. The fundamental building block of continuous-time stochastic processes, and the forward process in diffusion models.

  4. Gaussian Process: A process where every finite collection (Xt1,,Xtn)(X_{t_1}, \ldots, X_{t_n}) is jointly Gaussian. Fully specified by its mean function μ(t)\mu(t) and covariance kernel k(s,t)k(s,t). Foundation of Gaussian process regression (Bayesian nonparametrics) and closely related to neural network functions in the infinite-width limit.

  5. Markov Chain: {Xn}\{X_n\} where P(Xn+1X0,,Xn)=P(Xn+1Xn)P(X_{n+1} | X_0,\ldots,X_n) = P(X_{n+1} | X_n) - the past matters only through the present. Discrete time, arbitrary state space. The backbone of MCMC, reinforcement learning, and language model token generation. Full treatment: Section07 Markov Chains.

1.3 Why Processes Matter for AI

Stochastic process theory provides the theoretical foundation for several pillar areas of modern AI:

Diffusion models (DDPM, DDIM, Score SDE) define a forward process that progressively corrupts data with Gaussian noise - this is literally a discrete-time approximation to the Ornstein-Uhlenbeck SDE. The reverse process learned by the neural network is a time-reversed SDE, and training minimises a score-matching objective derived from Ito calculus.

SGD dynamics can be approximated by a continuous-time diffusion SDE: dθ=L(θ)dt+2β1dBtd\theta = -\nabla L(\theta)\,dt + \sqrt{2\beta^{-1}}dB_t where β1\beta^{-1} is the effective noise temperature proportional to the learning rate divided by batch size. This approximation (Mandt et al., 2017; Li et al., 2017) explains why SGD finds flat minima: the stationary distribution of the SDE concentrates in regions of low curvature.

Reinforcement learning uses the discounted return Gt=k=0γkrt+kG_t = \sum_{k=0}^\infty \gamma^k r_{t+k}, which is a function of the trajectory (rt,rt+1,)(r_t, r_{t+1}, \ldots) - a stochastic process. The temporal difference error δt=rt+γV(st+1)V(st)\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) forms a martingale difference sequence under the optimal value function, and TD-learning is an online martingale approximation algorithm.

Gaussian processes are used as priors over functions in Bayesian optimisation (tuning LLM hyperparameters), and the neural tangent kernel theory shows that infinite-width neural networks converge to GPs. RoPE (Rotary Position Embedding) encodes relative position using a stationary kernel - exploiting the wide-sense stationarity machinery developed in Section7.

1.4 Historical Timeline

YearContributorResult
1900Louis BachelierModelled stock prices as random walk (Brownian motion) in doctoral thesis
1905Albert EinsteinDerived diffusion equation from random molecular motion
1923Norbert WienerConstructed Brownian motion rigorously on path space
1933Andrei KolmogorovAxiomatic probability; forward/backward equations for Markov processes
1940Paul LevyCharacterised Levy processes; Levy-Khinchin representation
1944Kiyosi ItoDeveloped stochastic calculus (Ito integral, Ito's lemma)
1953Joseph DoobMartingale theory; optional stopping; convergence theorems
1964SkorokhodEmbedding theorem; functional CLT (Donsker's theorem)
1991Brockwell & DavisTime series analysis unified with stochastic process theory
2015Ho, Jain & AbbeelDenoising diffusion probabilistic models (DDPM)
2021Song et al.Score-based generative models via SDEs
2022Su et al.RoPE: positional encoding as stationary kernel

2. Formal Framework

2.1 Probability Spaces and Filtrations

The rigorous foundation of stochastic processes builds on probability spaces enhanced with a filtration - a formal model of accumulating information.

Definition (Filtered Probability Space). A filtered probability space is a quadruple (Ω,F,(Ft)t0,P)(\Omega, \mathcal{F}, (\mathcal{F}_t)_{t \geq 0}, P) where:

  • (Ω,F,P)(\Omega, \mathcal{F}, P) is a probability space
  • (Ft)t0(\mathcal{F}_t)_{t \geq 0} is a filtration: an increasing family of sub-σ\sigma-algebras, FsFtF\mathcal{F}_s \subseteq \mathcal{F}_t \subseteq \mathcal{F} for all sts \leq t

Intuitively, Ft\mathcal{F}_t represents the information available at time tt - everything that has been observed up to and including time tt. The filtration captures how information accumulates: knowing more over time means larger σ\sigma-algebras.

Definition (Adapted Process). A stochastic process {Xt}\{X_t\} is adapted to the filtration (Ft)(\mathcal{F}_t) if XtX_t is Ft\mathcal{F}_t-measurable for each tt. Intuitively: the value of the process at time tt is determined by information available at time tt (not the future).

The natural filtration: Given a process {Xt}\{X_t\}, the natural filtration is FtX=σ(Xs:st)\mathcal{F}_t^X = \sigma(X_s : s \leq t) - the σ\sigma-algebra generated by the process up to time tt. Every process is adapted to its natural filtration.

For AI: In online learning, Ft\mathcal{F}_t represents the set of training examples seen up to step tt. The model parameters θt\theta_t are Ft\mathcal{F}_t-measurable (they depend only on past data). A look-ahead bias - where θt\theta_t depends on future data - violates adaptedness and invalidates PAC bounds.

In a causal language model, the token at position tt is sampled conditioned on all previous tokens - this is precisely the adaptedness condition. Non-causal (bidirectional) models like BERT use a different filtration where Ft\mathcal{F}_t includes information from both past and future.

Standard conditions (usual conditions): A filtration satisfies the usual conditions if:

  1. (Ft)(\mathcal{F}_t) is right-continuous: Ft=s>tFs\mathcal{F}_t = \bigcap_{s > t} \mathcal{F}_s
  2. F0\mathcal{F}_0 contains all PP-null sets (completeness)

These technical conditions ensure stopping times and martingales behave as expected. All filtrations in this section are assumed to satisfy the usual conditions.

2.2 Classification of Stochastic Processes

Processes are classified by the structure of their time set TT and state space SS:

TAXONOMY OF STOCHASTIC PROCESSES
========================================================================

                    State Space S
                    ---------------------------------
                    Discrete             Continuous
                    ---------------------------------
  Time  Discrete  | Markov chains        Random walk  |
  Set T |          | HMMs                 AR(p) models |
        |          | Galton-Watson trees               |
        +----------+----------------------------------+
        Continuous | Poisson process      Brownian mot |
                  | Birth-death proc     OU process   |
                  | Counting processes   Diffusion SDE|
                   ---------------------------------

========================================================================

Discrete time, discrete state: Markov chains, hidden Markov models (HMMs used in speech recognition), random walks on graphs. The transition dynamics are described by a stochastic matrix.

Discrete time, continuous state: Random walks in Rd\mathbb{R}^d, AR/ARMA time series models, recurrent neural network hidden states. State space is Rd\mathbb{R}^d for some dimension dd.

Continuous time, discrete state: Poisson processes, birth-death processes, queueing models. Transitions happen at random times; between transitions the state is constant.

Continuous time, continuous state: Brownian motion, Ornstein-Uhlenbeck process, general diffusion SDEs. These are the most mathematically sophisticated and most relevant to diffusion models and SGD dynamics.

2.3 Stopping Times

A stopping time (or optional time) formalises the idea of a "decision made using only available information."

Definition (Stopping Time). A random variable τ:Ω[0,]\tau: \Omega \to [0,\infty] is a stopping time with respect to filtration (Ft)(\mathcal{F}_t) if {τt}Ft\{\tau \leq t\} \in \mathcal{F}_t for all t0t \geq 0. Equivalently, the decision "stop now" can be made using only information available at the current time.

Examples:

  • τ=5\tau = 5: constant time - trivially a stopping time
  • τ=inf{t:Xta}\tau = \inf\{t : X_t \geq a\} (first passage time): stopping time (the event {τt}={maxstXsa}\{\tau \leq t\} = \{\max_{s \leq t} X_s \geq a\} is Ft\mathcal{F}_t-measurable)
  • τ=sup{tT:Xta}\tau = \sup\{t \leq T : X_t \geq a\} (last time above level): not a stopping time (requires future knowledge)

The σ\sigma-algebra Fτ\mathcal{F}_\tau: For a stopping time τ\tau, the σ\sigma-algebra at τ\tau is Fτ={AF:A{τt}Ft for all t}\mathcal{F}_\tau = \{A \in \mathcal{F} : A \cap \{\tau \leq t\} \in \mathcal{F}_t \text{ for all } t\}. This represents "information available at the random time τ\tau."

For AI: Early stopping in neural network training is a stopping time: the rule "stop when validation loss hasn't improved for kk steps" depends only on past validation losses - it is Ft\mathcal{F}_t-measurable. In contrast, "stop at the training step that achieves minimum test loss" is NOT a stopping time - it requires knowing future test losses.

In RL, τ\tau = "the first time the agent reaches the goal state" is a stopping time. Doob's Optional Stopping Theorem (Section3.3) characterises the distribution of MτM_\tau - the martingale value at this random time.

2.4 Sample Paths and Almost-Sure Properties

For each fixed ωΩ\omega \in \Omega, the sample path tXt(ω)t \mapsto X_t(\omega) is an ordinary function of time. The regularity of sample paths (continuity, measurability, boundedness) is crucial for the applicability of theorems.

Path properties:

  • Continuous paths: tXt(ω)t \mapsto X_t(\omega) is continuous for a.e. ω\omega. Example: Brownian motion.
  • Cadlag paths (right-continuous with left limits): Xt(ω)X_t(\omega) is right-continuous with left limits for a.e. ω\omega. Example: Poisson process. Pronounced "cadlag" from French continu a droite, limite a gauche.
  • Caglad paths: left-continuous with right limits. Useful in predictable integrands.

Why continuity matters: A process with continuous paths can be approximated by discrete-time processes as the mesh goes to zero - this is the content of Donsker's invariance principle (Section6.3) and justifies the SDE approximation of SGD. A process with only cadlag paths (like Poisson) has discontinuities that cannot be approximated this way.

Almost-sure properties: Properties that hold with probability one (a.s.) are the gold standard. For example, Brownian motion is a.s. nowhere differentiable - every sample path is continuous but has infinite variation everywhere. This non-differentiability is responsible for the dt\sqrt{dt} scaling of noise in SDEs (Ito's formula) and explains why diffusion model noise scales as αˉt\sqrt{\bar{\alpha}_t} rather than linearly in time.


3. Martingales

3.1 Definition and Intuition

The word martingale comes from a gambling strategy: double your bet after each loss. Mathematically, a martingale captures the notion of a "fair game" - the best prediction of the future value is the current value, given everything known so far.

Definition (Martingale). An adapted process {Mt}t0\{M_t\}_{t \geq 0} with E[Mt]<\mathbb{E}[|M_t|] < \infty for all tt is a martingale with respect to filtration (Ft)(\mathcal{F}_t) if:

E[MtFs]=Msfor all st\mathbb{E}[M_t \mid \mathcal{F}_s] = M_s \quad \text{for all } s \leq t

The condition says: given all information up to time ss, the best prediction of MtM_t is MsM_s. The process neither drifts up nor down in conditional expectation.

Three types:

  • Martingale: E[MtFs]=Ms\mathbb{E}[M_t | \mathcal{F}_s] = M_s - fair game
  • Submartingale: E[MtFs]Ms\mathbb{E}[M_t | \mathcal{F}_s] \geq M_s - tends upward on average (e.g., convex functions of martingales via Jensen)
  • Supermartingale: E[MtFs]Ms\mathbb{E}[M_t | \mathcal{F}_s] \leq M_s - tends downward on average

Discrete-time discrete version: {Mn}n0\{M_n\}_{n \geq 0} is a martingale if E[Mn+1M0,,Mn]=Mn\mathbb{E}[M_{n+1} | M_0,\ldots,M_n] = M_n for all nn. This is equivalent to the martingale difference condition: E[Mn+1MnFn]=0\mathbb{E}[M_{n+1} - M_n | \mathcal{F}_n] = 0.

Immediate consequence (tower property): For any sts \leq t, E[Mt]=E[E[MtFs]]=E[Ms]\mathbb{E}[M_t] = \mathbb{E}[\mathbb{E}[M_t | \mathcal{F}_s]] = \mathbb{E}[M_s]. So the mean of a martingale is constant: E[Mt]=E[M0]\mathbb{E}[M_t] = \mathbb{E}[M_0] for all tt.

3.2 Examples of Martingales

Example 1: Partial Sums of Zero-Mean iid. Let X1,X2,X_1, X_2, \ldots be iid with E[Xi]=0\mathbb{E}[X_i] = 0. Then Sn=k=1nXkS_n = \sum_{k=1}^n X_k is a martingale:

E[Sn+1Fn]=E[Sn+Xn+1Fn]=Sn+E[Xn+1]=Sn\mathbb{E}[S_{n+1} | \mathcal{F}_n] = \mathbb{E}[S_n + X_{n+1} | \mathcal{F}_n] = S_n + \mathbb{E}[X_{n+1}] = S_n

Example 2: Products of Unit-Mean iid. Let X1,X2,X_1, X_2, \ldots be iid with E[Xi]=1\mathbb{E}[X_i] = 1, Xi>0X_i > 0. Then Mn=k=1nXkM_n = \prod_{k=1}^n X_k is a martingale:

E[Mn+1Fn]=MnE[Xn+1]=Mn\mathbb{E}[M_{n+1} | \mathcal{F}_n] = M_n \cdot \mathbb{E}[X_{n+1}] = M_n

This is the likelihood ratio (Radon-Nikodym derivative) martingale - fundamental to sequential hypothesis testing and importance sampling.

Example 3: Conditional Expectations. For any integrable ZZ and filtration (Ft)(\mathcal{F}_t), the process Mt=E[ZFt]M_t = \mathbb{E}[Z | \mathcal{F}_t] is a martingale (the Doob martingale). This follows directly from the tower property: E[MtFs]=E[E[ZFt]Fs]=E[ZFs]=Ms\mathbb{E}[M_t | \mathcal{F}_s] = \mathbb{E}[\mathbb{E}[Z|\mathcal{F}_t] | \mathcal{F}_s] = \mathbb{E}[Z|\mathcal{F}_s] = M_s.

Example 4: Polya Urn. An urn contains rr red and bb blue balls. At each step, draw a ball, record its colour, return it with one extra ball of the same colour. Let RnR_n = fraction of red balls after step nn. Then RnR_n is a martingale. The limiting fraction RBeta(r,b)R_\infty \sim \text{Beta}(r, b).

Example 5: Squared SRW minus time. The simple random walk SnS_n satisfies Sn2nS_n^2 - n is a martingale (when Var(Xk)=1\text{Var}(X_k) = 1):

E[Sn+12(n+1)Fn]=E[(Sn+Xn+1)2Fn]n1=Sn2+1n1=Sn2n\mathbb{E}[S_{n+1}^2 - (n+1) | \mathcal{F}_n] = \mathbb{E}[(S_n + X_{n+1})^2 | \mathcal{F}_n] - n - 1 = S_n^2 + 1 - n - 1 = S_n^2 - n

For AI: The gradient of the expected loss L(θ)\nabla L(\theta) minus the mini-batch gradient estimate L^(θ)\nabla \hat{L}(\theta) is a martingale difference - the fundamental quantity that makes SGD unbiased. The accumulation of these martingale differences over training steps determines the SGD trajectory.

3.3 Doob's Optional Stopping Theorem

Doob's Optional Stopping Theorem (OST) is perhaps the most useful result in martingale theory. It answers: "what is the expected value of a martingale at a stopping time?"

Naively: Since E[Mn]=E[M0]\mathbb{E}[M_n] = \mathbb{E}[M_0] for all nn, one might expect E[Mτ]=E[M0]\mathbb{E}[M_\tau] = \mathbb{E}[M_0] for any stopping time τ\tau. This is false in general (e.g., the doubling strategy in gambling can exploit martingale properties without bounded stopping times).

Theorem (Doob's Optional Stopping Theorem). Let {Mn}\{M_n\} be a martingale and τ\tau a stopping time. Then E[Mτ]=E[M0]\mathbb{E}[M_\tau] = \mathbb{E}[M_0] provided at least one of the following holds:

  1. τ\tau is bounded: τN\tau \leq N a.s. for some constant NN
  2. E[τ]<\mathbb{E}[\tau] < \infty and Mn+1MnC|M_{n+1} - M_n| \leq C a.s. for some constant CC
  3. E[supnMnτ]<\mathbb{E}[\sup_{n} |M_{n \wedge \tau}|] < \infty (uniformly integrable)

Proof sketch (bounded case). E[Mτ]=k=0NE[Mk1τ=k]\mathbb{E}[M_\tau] = \sum_{k=0}^N \mathbb{E}[M_k \mathbf{1}_{\tau=k}]. Write Mk=MNj=kN1(Mj+1Mj)M_k = M_N - \sum_{j=k}^{N-1}(M_{j+1}-M_j), use adaptedness of {τ=k}\{\tau=k\}, and the martingale property E[Mj+1MjFj]=0\mathbb{E}[M_{j+1}-M_j | \mathcal{F}_j] = 0. Each cross term vanishes, leaving E[Mτ]=E[MN]=E[M0]\mathbb{E}[M_\tau] = \mathbb{E}[M_N] = \mathbb{E}[M_0]. \square

Application: Gambler's Ruin. Symmetric random walk SnS_n on {0,1,,N}\{0, 1, \ldots, N\} with absorbing barriers. Start at S0=kS_0 = k. Let τ=inf{n:Sn{0,N}}\tau = \inf\{n : S_n \in \{0, N\}\}.

Since SnS_n is a martingale and τ\tau is bounded by Wald's identity (it can be shown E[τ]<\mathbb{E}[\tau] < \infty):

E[Sτ]=E[S0]=k\mathbb{E}[S_\tau] = \mathbb{E}[S_0] = k E[Sτ]=NP(Sτ=N)+0P(Sτ=0)=NP(win)\mathbb{E}[S_\tau] = N \cdot P(S_\tau = N) + 0 \cdot P(S_\tau = 0) = N \cdot P(\text{win}) P(win)=k/N\Rightarrow P(\text{win}) = k/N

Also: Sn2nS_n^2 - n is a martingale, so E[Sτ2τ]=k20\mathbb{E}[S_\tau^2 - \tau] = k^2 - 0, giving E[τ]=E[Sτ2]k2=(N(k/N)N+0)k2=k(Nk)\mathbb{E}[\tau] = \mathbb{E}[S_\tau^2] - k^2 = (N \cdot (k/N) \cdot N + 0) - k^2 = k(N-k).

3.4 Doob's Martingale Inequalities

Beyond Optional Stopping, Doob proved several fundamental inequalities that control the maximum of a martingale.

Theorem (Doob's Maximal Inequality). For a non-negative submartingale {Mn}\{M_n\} and λ>0\lambda > 0:

λP ⁣(max0knMkλ)E[Mn]\lambda \cdot P\!\left(\max_{0 \leq k \leq n} M_k \geq \lambda\right) \leq \mathbb{E}[M_n]

Corollary (Doob's L2L^2 Inequality). For a martingale {Mn}\{M_n\} with E[Mn2]<\mathbb{E}[M_n^2] < \infty:

E ⁣[max0knMk2]4E[Mn2]\mathbb{E}\!\left[\max_{0 \leq k \leq n} M_k^2\right] \leq 4\mathbb{E}[M_n^2]

Theorem (Martingale Convergence Theorem, Doob 1953). Every L1L^1-bounded martingale (i.e., supnE[Mn]<\sup_n \mathbb{E}[|M_n|] < \infty) converges almost surely to a finite limit MM_\infty.

This convergence theorem is the rigorous foundation for why TD-learning and Q-learning converge under appropriate conditions - the value function updates form a bounded supermartingale under the optimal policy.

3.5 The Doob Martingale Construction

Definition. For a function f(X1,,Xn)f(X_1,\ldots,X_n) of independent random variables, define:

Mk=E[f(X1,,Xn)X1,,Xk],k=0,1,,nM_k = \mathbb{E}[f(X_1,\ldots,X_n) \mid X_1,\ldots,X_k], \quad k = 0, 1, \ldots, n

Then M0=E[f]M_0 = \mathbb{E}[f], Mn=f(X1,,Xn)M_n = f(X_1,\ldots,X_n), and {Mk}\{M_k\} is a martingale (by tower property). This is the Doob martingale for ff.

Connection to McDiarmid: The martingale differences are Dk=MkMk1D_k = M_k - M_{k-1}. If ff has bounded differences f(,xk,)f(,xk,)ck|f(\ldots,x_k,\ldots) - f(\ldots,x_k',\ldots)| \leq c_k, then Dkck|D_k| \leq c_k a.s. Applying Azuma's inequality to the martingale (M0,M1,,Mn)(M_0, M_1, \ldots, M_n):

P(fE[f]t)=P(MnM0t)exp ⁣(2t2kck2)P(f - \mathbb{E}[f] \geq t) = P(M_n - M_0 \geq t) \leq \exp\!\left(\frac{-2t^2}{\sum_k c_k^2}\right)

This is exactly McDiarmid's inequality - it is Azuma applied to the Doob martingale. The Doob construction thus unifies martingale theory with the concentration inequalities of Section05.

For AI: The empirical risk R^(h)=1ni=1n(h(xi),yi)\hat{R}(h) = \frac{1}{n}\sum_{i=1}^n \ell(h(x_i), y_i) is a function of nn independent training examples, each contributing ci=1/nc_i = 1/n bounded difference. The Doob martingale Mk=E[R^(h)z1,,zk]M_k = \mathbb{E}[\hat{R}(h) | z_1,\ldots,z_k] converges from R(h)R(h) to R^(h)\hat{R}(h), and Azuma gives the McDiarmid generalisation bound P(R^Rt)2e2nt2P(|\hat{R}-R| \geq t) \leq 2e^{-2nt^2}.

3.6 Supermartingales and Submartingales

Supermartingales (E[Mn+1Fn]Mn\mathbb{E}[M_{n+1}|\mathcal{F}_n] \leq M_n) model processes that decrease in conditional expectation - losing bets, damped oscillators, optimisation algorithms.

Key example: For a Lyapunov function V(θ)V(\theta) in optimisation, if E[V(θn+1)θn]V(θn)αL(θn)2+noise\mathbb{E}[V(\theta_{n+1}) | \theta_n] \leq V(\theta_n) - \alpha \|\nabla L(\theta_n)\|^2 + \text{noise}, then V(θn)V(\theta_n) is a "noisy supermartingale." Under appropriate noise control, the Martingale Convergence Theorem guarantees V(θn)VV(\theta_n) \to V^* a.s., implying convergence to a critical point.

Submartingales (E[Mn+1Fn]Mn\mathbb{E}[M_{n+1}|\mathcal{F}_n] \geq M_n) arise as convex functions of martingales (by Jensen's inequality: ff convex, MnM_n martingale \Rightarrow f(Mn)f(M_n) submartingale). Examples: Mn|M_n|, Mn2M_n^2, etMne^{tM_n} (for t>0t > 0).

For AI: The training loss trajectory is typically a supermartingale in the early phases of training (decreasing in expectation per step). Plateau phases correspond to near-martingale behaviour. The Doob decomposition theorem guarantees any submartingale YnY_n can be written as Yn=Mn+AnY_n = M_n + A_n where MnM_n is a martingale and AnA_n is a predictable increasing process - this decomposition underlies variance reduction in stochastic optimisation.

3.7 ML: SGD as a Near-Martingale

Consider the SGD update: θn+1=θnηL^n(θn)\theta_{n+1} = \theta_n - \eta \nabla \hat{L}_n(\theta_n) where L^n\hat{L}_n is the mini-batch loss. Define gn=L^n(θn)L(θn)g_n = \nabla \hat{L}_n(\theta_n) - \nabla L(\theta_n) = gradient noise.

Since each mini-batch is sampled iid, E[gnθn]=0\mathbb{E}[g_n | \theta_n] = 0 - the gradient noise is a martingale difference sequence. The noise accumulation SN=n=1NgnS_N = \sum_{n=1}^N g_n forms a martingale.

The diffusion approximation (continuous-time limit): As step size η0\eta \to 0 with NN \to \infty and ηNT\eta N \to T fixed, the SGD trajectory converges in distribution to the SDE:

dθt=L(θt)dt+ηΣ(θt)dBtd\theta_t = -\nabla L(\theta_t)\,dt + \sqrt{\eta \cdot \Sigma(\theta_t)}\,dB_t

where Σ(θ)=Cov(L^(θ))\Sigma(\theta) = \text{Cov}(\nabla \hat{L}(\theta)) is the gradient covariance matrix and BtB_t is a Brownian motion in parameter space. The noise temperature η/B\eta/B (learning rate / batch size) controls the diffusion coefficient.

Variance reduction methods: SVRG (Stochastic Variance Reduced Gradient) and SAG (Stochastic Average Gradient) reduce gn\|g_n\| to make SGD closer to a true gradient descent. They convert the martingale noise gng_n from O(1)O(1) variance to O(1/n)O(1/n) variance by periodically computing the full gradient. The convergence improvement from O(1/n)O(1/\sqrt{n}) (SGD) to O(ρn)O(\rho^n) (linear convergence of SVRG) directly reflects this martingale variance reduction.


4. Discrete-Time Random Walks

4.1 Simple Random Walk

Definition. The simple random walk (SRW) on Z\mathbb{Z} is S0=0S_0 = 0, Sn=k=1nXkS_n = \sum_{k=1}^n X_k where XkiidUniform{1,+1}X_k \stackrel{iid}{\sim} \text{Uniform}\{-1, +1\} (so P(Xk=+1)=P(Xk=1)=1/2P(X_k = +1) = P(X_k = -1) = 1/2).

Distribution: P(Sn=k)=(n(n+k)/2)2nP(S_n = k) = \binom{n}{(n+k)/2} 2^{-n} for kn(mod2)k \equiv n \pmod{2}. By CLT: Sn/ndN(0,1)S_n / \sqrt{n} \xrightarrow{d} \mathcal{N}(0,1).

Key identity (Reflection Principle): The number of paths from (0,0)(0,0) to (n,k)(n, k) that touch or cross the xx-axis equals the number of all paths from (0,2)(0, -2) to (n,k)(n, k). This gives:

P ⁣(max0knSkm)=2P(Snm)P(Sn=m)P\!\left(\max_{0 \leq k \leq n} S_k \geq m\right) = 2P(S_n \geq m) - P(S_n = m)

for integers m>0m > 0 - approximately 2P(Snm)2P(S_n \geq m) for large nn.

The arc-sine law: The last time the SRW is at 0 before time nn has a distribution that concentrates near 0 and nn, not at n/2n/2 as intuition suggests. Formally: if Ln=max{kn:Sk=0}L_n = \max\{k \leq n : S_k = 0\}, then Ln/ndArcsine(0,1)L_n/n \xrightarrow{d} \text{Arcsine}(0,1).

Biased random walk: With P(Xk=+1)=pP(X_k = +1) = p, P(Xk=1)=1pP(X_k = -1) = 1-p, the walk has drift μ=2p1\mu = 2p-1. It is a submartingale if p>1/2p > 1/2, supermartingale if p<1/2p < 1/2, martingale if p=1/2p = 1/2. The exponential tilting Mn=((1p)/p)SnM_n = ((1-p)/p)^{S_n} is a martingale for any pp.

4.2 Recurrence and Transience

Definition. A random walk is recurrent if it returns to its starting point with probability 1; transient if it escapes to infinity with positive probability.

Polya's Theorem (1921). The SRW on Zd\mathbb{Z}^d is:

  • Recurrent for d=1d = 1 and d=2d = 2 (probability 1 of returning to origin)
  • Transient for d3d \geq 3 (positive probability of never returning)

Proof idea for d=1d=1: The expected number of returns to 0 is nP(S2n=0)=n(2nn)4nn(1/πn)=\sum_{n} P(S_{2n}=0) = \sum_n \binom{2n}{n}4^{-n} \sim \sum_n (1/\sqrt{\pi n}) = \infty by Stirling. Since the expected number of returns is infinite, the walk must return infinitely often (Borel-Cantelli).

Proof idea for d=3d=3: nP(S2n=0)=n(2nn)362n<\sum_n P(S_{2n}=0) = \sum_n \binom{2n}{n}^3 6^{-2n} < \infty by Stirling, so by Borel-Cantelli the walk returns only finitely many times.

For AI: In gradient descent on a loss landscape with one flat direction (rank deficiency), the gradient is zero in that direction - the component of the iterate in the flat direction performs a random walk. In 2D, this walk is recurrent: training can drift arbitrarily far in the flat direction, explaining parameter growth and the need for weight decay (which adds a restoring force, converting the walk from recurrent to transient).

4.3 Gambler's Ruin

Setup: A gambler starts with kk dollars, plays a fair game (win/lose \1eachround),andstopswhenreachingeach round), and stops when reaching$N(success)or(success) or$0$ (ruin).

Result (via OST): Using the martingales SnS_n and Sn2nS_n^2 - n:

  • P(ruin)=1k/NP(\text{ruin}) = 1 - k/N, P(success)=k/NP(\text{success}) = k/N
  • E[τ]=k(Nk)\mathbb{E}[\tau] = k(N-k) (expected duration)

Biased gambler (p1/2p \neq 1/2): Use the exponential martingale Mn=((1p)/p)SnM_n = ((1-p)/p)^{S_n}:

P(success)=1(q/p)k1(q/p)N,q=1pP(\text{success}) = \frac{1 - (q/p)^k}{1 - (q/p)^N}, \quad q = 1-p

For AI: Gambler's ruin models early stopping in training: the training loss random-walks around a local minimum, and we stop when it either reaches a "good" threshold (success) or explodes (ruin). The expected stopping time k(Nk)k(N-k) motivates why learning rate schedules that reduce the step size as training progresses can reduce expected convergence time.

4.4 ML: Token Sequences as Random Walks

A language model generates tokens sequentially. Conditional on the context, each token selection is a draw from a distribution - making the sequence w1,w2,w_1, w_2, \ldots a stochastic process. The log-probability of a sequence:

logP(w1,,wT)=t=1TlogP(wtw1,,wt1)\log P(w_1, \ldots, w_T) = \sum_{t=1}^T \log P(w_t | w_1, \ldots, w_{t-1})

is a sum of log-probabilities - a random walk in log-space. The perplexity PPL=exp(1TtlogP(wt))\text{PPL} = \exp(-\frac{1}{T}\sum_t \log P(w_t|\cdot)) is the geometric mean of the inverse probabilities, and concentrates around eHe^{H} where HH is the true entropy rate of the language.

Surprise accumulation: The cumulative surprise logP(wtw<t)-\log P(w_t | w_{<t}) for a well-calibrated model is a martingale (under the true distribution). If the model is miscalibrated, the surprise process is either a submartingale (model too confident -> underestimates entropy) or supermartingale (model too diffuse). This connects language model calibration to martingale theory.


5. Poisson Processes

5.1 Definition and Characterisation

The Poisson process is the canonical continuous-time counting process - it counts the number of events in an interval.

Definition (Poisson Process). A right-continuous process {N(t)}t0\{N(t)\}_{t \geq 0} with N(0)=0N(0) = 0 is a Poisson process with rate λ>0\lambda > 0 if:

  1. Independent increments: For any 0t0<t1<<tn0 \leq t_0 < t_1 < \cdots < t_n, the increments N(t1)N(t0),N(t2)N(t1),N(t_1)-N(t_0), N(t_2)-N(t_1), \ldots are mutually independent
  2. Stationary increments: N(t+s)N(s)=dN(t)N(0)N(t+s) - N(s) \stackrel{d}{=} N(t) - N(0) for all t,s0t, s \geq 0
  3. Poisson distribution of increments: N(t)N(s)Poisson(λ(ts))N(t) - N(s) \sim \text{Poisson}(\lambda(t-s)) for t>st > s

Equivalent characterisation via inter-arrivals: If T1,T2,T_1, T_2, \ldots are the inter-arrival times between successive events, then {N(t)}\{N(t)\} is a Poisson process iff T1,T2,iidExp(λ)T_1, T_2, \ldots \stackrel{iid}{\sim} \text{Exp}(\lambda).

The λ0\lambda \to 0 limit: A Poisson process can be approximated by a Bernoulli process: divide [0,T][0,T] into nn intervals of length δ=T/n\delta = T/n. Place an event in interval kk with probability λδ\lambda\delta independently. As nn \to \infty with λδn=λT\lambda\delta n = \lambda T fixed, this converges to a Poisson process.

Martingale structure: N(t)λtN(t) - \lambda t is a martingale (compensated Poisson process). Its quadratic variation is also λt\lambda t, so (N(t)λt)2λt(N(t)-\lambda t)^2 - \lambda t is also a martingale.

5.2 Properties: Superposition and Thinning

Superposition: If N1(t)Poisson(λ1)N_1(t) \sim \text{Poisson}(\lambda_1) and N2(t)Poisson(λ2)N_2(t) \sim \text{Poisson}(\lambda_2) are independent, then N(t)=N1(t)+N2(t)Poisson(λ1+λ2)N(t) = N_1(t) + N_2(t) \sim \text{Poisson}(\lambda_1 + \lambda_2). The superposition of kk independent Poisson processes is Poisson with rate iλi\sum_i \lambda_i.

Thinning (Coloring): Start with a Poisson process N(t)Poisson(λ)N(t) \sim \text{Poisson}(\lambda). Independently colour each arrival red with probability pp and blue with probability 1p1-p. Then the red and blue counting processes are independent Poisson processes with rates λp\lambda p and λ(1p)\lambda(1-p).

Conditioning (uniform order statistics): Given N(T)=nN(T) = n, the arrival times T1<T2<<TnT_1 < T_2 < \cdots < T_n are distributed as the order statistics of nn iid Uniform(0,T)(0,T) random variables. This is the conditional uniformity property - conditioning on the count makes the arrival times maximally spread.

Non-homogeneous Poisson process: Generalise by allowing a time-varying rate λ(t)\lambda(t): N(t)N(s)Poisson(stλ(u)du)N(t) - N(s) \sim \text{Poisson}(\int_s^t \lambda(u)\,du) with independent increments. The compensator is Λ(t)=0tλ(u)du\Lambda(t) = \int_0^t \lambda(u)\,du (the integrated intensity).

5.3 Compound Poisson Process

Definition. Let {N(t)}\{N(t)\} be a Poisson process with rate λ\lambda and {Yk}\{Y_k\} iid random variables (independent of NN). The compound Poisson process is:

X(t)=k=1N(t)YkX(t) = \sum_{k=1}^{N(t)} Y_k

Key moments:

  • E[X(t)]=λtE[Y1]\mathbb{E}[X(t)] = \lambda t \cdot \mathbb{E}[Y_1]
  • Var(X(t))=λtE[Y12]\text{Var}(X(t)) = \lambda t \cdot \mathbb{E}[Y_1^2]
  • MGF: MX(t)(s)=exp(λt(MY1(s)1))M_{X(t)}(s) = \exp(\lambda t (M_{Y_1}(s) - 1))

For AI: The total compute consumed by serving an LLM API follows a compound Poisson model: requests arrive as a Poisson process, and each request requires a random amount of compute (the "jump size" YkY_k). Capacity planning uses compound Poisson theory to compute the required buffer size to handle peak demand with high probability.

5.4 ML: Event Streams and Token Timing

LLM inference event streams: In deployed LLMs, user requests arrive as a Poisson process with rate λ\lambda (requests/second). Under heavy load, the inter-arrival distribution determines queueing behaviour. If token generation takes μ1\mu^{-1} seconds on average, the system is stable when λ<μ\lambda < \mu (arrival rate < service rate) - an M/M/1 queue analysis using Poisson process theory.

KV-cache access patterns: In multi-head attention with prefix caching, the pattern of cache hits/misses follows a complex point process. Poisson approximation applies when individual hit probabilities are small and requests are approximately independent.

Continuous batching: Modern LLM serving (vLLM, TensorRT-LLM) uses continuous batching: new requests join an existing batch as others complete. The batch size at time tt is a birth-death process - a continuous-time Markov chain with Poisson arrivals and Exponential service times.


6. Brownian Motion

6.1 Definition and Wiener's Axioms

Brownian motion (the Wiener process) is the fundamental continuous-time continuous-path stochastic process. It is simultaneously the limit of rescaled random walks (Donsker's theorem) and the building block for all diffusion processes in modern AI.

Definition (Standard Brownian Motion). A stochastic process {Bt}t0\{B_t\}_{t \geq 0} is a standard Brownian motion (or Wiener process) if:

  1. B0=0B_0 = 0 a.s.
  2. Independent increments: For 0t0<t1<<tn0 \leq t_0 < t_1 < \cdots < t_n, the increments Bt1Bt0,Bt2Bt1,B_{t_1}-B_{t_0}, B_{t_2}-B_{t_1}, \ldots are mutually independent
  3. Gaussian increments: BtBsN(0,ts)B_t - B_s \sim \mathcal{N}(0, t-s) for all 0s<t0 \leq s < t
  4. Continuous paths: tBt(ω)t \mapsto B_t(\omega) is continuous a.s.

Existence: Wiener (1923) proved that such a process exists by constructing it on the space C([0,))C([0,\infty)) of continuous functions with the Wiener measure. The construction uses Levy's Haar wavelet representation and is one of the foundational results of modern probability.

Finite-dimensional distributions: For 0<t1<<tn0 < t_1 < \cdots < t_n, the joint density of (Bt1,,Btn)(B_{t_1}, \ldots, B_{t_n}) is:

p(x1,,xn)=k=1n12π(tktk1)exp ⁣((xkxk1)22(tktk1))p(x_1, \ldots, x_n) = \prod_{k=1}^n \frac{1}{\sqrt{2\pi(t_k - t_{k-1})}} \exp\!\left(-\frac{(x_k - x_{k-1})^2}{2(t_k - t_{k-1})}\right)

with t0=x0=0t_0 = x_0 = 0.

Martingale properties: The following are all martingales:

  • BtB_t (mean zero, no drift)
  • Bt2tB_t^2 - t (variance process)
  • exp(sBts2t/2)\exp(sB_t - s^2t/2) for any sRs \in \mathbb{R} (exponential martingale / Doleans-Dade)

6.2 Key Properties and Quadratic Variation

Self-similarity: {cBt/c2}t0=d{Bt}t0\{cB_{t/c^2}\}_{t \geq 0} \stackrel{d}{=} \{B_t\}_{t \geq 0} for any c>0c > 0. The process "looks the same" at any scale - it is fractal.

Time inversion: {tB1/t}t>0=d{Bt}t>0\{tB_{1/t}\}_{t > 0} \stackrel{d}{=} \{B_t\}_{t > 0} (with B0=0B_0 = 0).

Non-differentiability: Brownian motion is a.s. nowhere differentiable. Formally: for almost every ω\omega, the function tBt(ω)t \mapsto B_t(\omega) is not differentiable at any point. This is why stochastic calculus requires the Ito integral - the standard Riemann-Stieltjes integral does not apply.

Quadratic variation: The key property distinguishing Brownian motion from ordinary functions is its quadratic variation:

[B]tlimP0k(BtkBtk1)2=ta.s.[B]_t \coloneqq \lim_{|\mathcal{P}| \to 0} \sum_{k} (B_{t_k} - B_{t_{k-1}})^2 = t \quad \text{a.s.}

where the limit is over partitions P={0=t0<t1<<tn=t}\mathcal{P} = \{0 = t_0 < t_1 < \cdots < t_n = t\} as the mesh P0|\mathcal{P}| \to 0. Informally: (dBt)2=dt(dB_t)^2 = dt. This is the foundation of Ito's lemma and the dt\sqrt{dt} scaling of stochastic noise.

The Levy characterisation: A continuous martingale {Mt}\{M_t\} with M0=0M_0 = 0 and quadratic variation [M]t=t[M]_t = t is a standard Brownian motion. This characterises BM among all continuous martingales.

For AI: The quadratic variation formula (dBt)2=dt(dB_t)^2 = dt explains why diffusion model noise at timestep tt scales as t\sqrt{t}, not tt. In DDPM, the forward process adds noise ϵN(0,I)\epsilon \sim \mathcal{N}(0, I) scaled by αˉt\sqrt{\bar{\alpha}_t}, where αˉt=s=1t(1βs)\bar{\alpha}_t = \prod_{s=1}^t (1-\beta_s) - a discretisation of BM's variance accumulation.

6.3 Brownian Motion as a Limit

Theorem (Donsker's Invariance Principle, 1951). Let X1,X2,X_1, X_2, \ldots be iid with E[Xi]=0\mathbb{E}[X_i] = 0 and E[Xi2]=σ2<\mathbb{E}[X_i^2] = \sigma^2 < \infty. Define the continuous interpolation:

Bn(t)=1σn(Snt+(ntnt)Xnt+1),t[0,1]B_n(t) = \frac{1}{\sigma\sqrt{n}}\left(S_{\lfloor nt \rfloor} + (nt - \lfloor nt \rfloor) X_{\lfloor nt \rfloor + 1}\right), \quad t \in [0,1]

Then BnBB_n \Rightarrow B (convergence in distribution on C([0,1])C([0,1]) with the uniform topology), where BB is standard Brownian motion.

This is the functional central limit theorem - it upgrades the CLT from convergence of individual random variables to convergence of entire paths. The SRW path, rescaled by 1/n1/\sqrt{n} in space and 1/n1/n in time, converges to Brownian motion.

Implications: Any result proved for Brownian motion has a discrete-time analogue for random walks. This transfers insights between the continuous-time theory (SDEs, Ito calculus) and the discrete-time practice (mini-batch SGD, RL).

6.4 Geometric Brownian Motion

Definition. A process {St}\{S_t\} satisfies geometric Brownian motion if:

dSt=μStdt+σStdBt,S0=s0>0dS_t = \mu S_t \,dt + \sigma S_t \,dB_t, \quad S_0 = s_0 > 0

By Ito's lemma applied to f(S)=logSf(S) = \log S:

dlogSt=(μσ22)dt+σdBtd\log S_t = \left(\mu - \frac{\sigma^2}{2}\right)dt + \sigma\,dB_t St=s0exp ⁣((μσ22)t+σBt)S_t = s_0 \exp\!\left(\left(\mu - \frac{\sigma^2}{2}\right)t + \sigma B_t\right)

So StLogNormal ⁣(logs0+(μσ2/2)t,σ2t)S_t \sim \text{LogNormal}\!\left(\log s_0 + (\mu - \sigma^2/2)t,\, \sigma^2 t\right).

For AI: GBM models multiplicative noise, which appears in layer-normalisation dynamics. When a transformer residual stream is updated as ht+1=ht+Δth_{t+1} = h_t + \Delta_t where Δt/htN(0,σ2)\Delta_t / \|h_t\| \sim \mathcal{N}(0, \sigma^2), the norm ht\|h_t\| approximately follows GBM, explaining the norm growth observed in deep transformers without proper initialisation.

6.5 Ornstein-Uhlenbeck Process

The Ornstein-Uhlenbeck (OU) process is the unique stationary Gaussian Markov process - and the theoretical foundation for diffusion model forward processes.

Definition (OU SDE). The OU process satisfies:

dXt=θ(Xtμ)dt+σdBt,θ>0dX_t = -\theta(X_t - \mu)\,dt + \sigma\,dB_t, \quad \theta > 0

The drift θ(Xtμ)-\theta(X_t - \mu) is a mean-reverting force pulling the process toward μ\mu.

Solution (via Ito's formula):

Xt=μ+(X0μ)eθt+σ0teθ(ts)dBsX_t = \mu + (X_0 - \mu)e^{-\theta t} + \sigma\int_0^t e^{-\theta(t-s)}\,dB_s

Distributions:

  • XtX0N ⁣(μ+(X0μ)eθt, σ22θ(1e2θt))X_t | X_0 \sim \mathcal{N}\!\left(\mu + (X_0-\mu)e^{-\theta t},\ \frac{\sigma^2}{2\theta}(1 - e^{-2\theta t})\right)
  • Stationary distribution: XN(μ,σ2/(2θ))X_\infty \sim \mathcal{N}(\mu, \sigma^2/(2\theta))
  • Autocorrelation: Corr(Xs,Xt)=eθts\text{Corr}(X_s, X_t) = e^{-\theta|t-s|} - exponential decay

For AI: The DDPM forward process is q(xtx0)=N(αˉtx0,(1αˉt)I)q(x_t|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}_t)I), which matches the OU solution with θ=1/2\theta = 1/2, μ=0\mu = 0, σ2=1\sigma^2 = 1 (after time reparametrisation). The variance-preserving SDE (VP-SDE) of Song et al. (2021) makes this correspondence exact:

dx=12β(t)xdt+β(t)dBtdx = -\frac{1}{2}\beta(t)x\,dt + \sqrt{\beta(t)}\,dB_t

with αˉt=exp(120tβ(s)ds)\bar{\alpha}_t = \exp(-\frac{1}{2}\int_0^t \beta(s)\,ds). The OU process provides the continuous-time foundation for all modern diffusion models.

L2 regularisation as OU: L2 regularisation in gradient descent adds a restoring force λθ-\lambda\theta to the update, giving a discrete-time OU process for the parameters. The stationary distribution concentrates around θ=0\theta=0 with variance η/λ\propto \eta/\lambda (noise / regularisation), explaining why stronger L2 keeps parameters smaller.

6.6 ML: Diffusion Models

The forward process (data -> noise): Given clean data x0pdatax_0 \sim p_{\text{data}}, DDPM defines:

q(xtxt1)=N(xt;1βtxt1,βtI),t=1,,Tq(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I), \quad t = 1,\ldots,T

The marginal is q(xtx0)=N(αˉtx0,(1αˉt)I)q(x_t | x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}_t)I) where αˉt=s=1t(1βs)\bar{\alpha}_t = \prod_{s=1}^t(1-\beta_s). This forward process is a discretised OU process.

The reverse process (noise -> data): By Bayes' theorem and the Gaussian Markov property:

pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(t))p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(t))

The network learns to predict the score xtlogq(xt)\nabla_{x_t}\log q(x_t) (or equivalently, the noise ϵ\epsilon), enabling the reverse process to denoise step by step.

Score matching: The training objective minimises:

L=Et,x0,ϵ ⁣[ϵϵθ(αˉtx0+1αˉtϵ,t)2]\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon}\!\left[\|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, t)\|^2\right]

This is equivalent to minimising Fisher divergence between pθp_\theta and qq, connecting to information-geometric stochastic process theory.

Continuous-time perspective (Score SDE, Song et al. 2021): The continuous limit is:

  • Forward SDE: dx=f(x,t)dt+g(t)dBtdx = f(x,t)\,dt + g(t)\,dB_t
  • Reverse SDE: dx=[f(x,t)g(t)2xlogpt(x)]dt+g(t)dBˉtdx = [f(x,t) - g(t)^2\nabla_x\log p_t(x)]\,dt + g(t)\,d\bar{B}_t

The neural network approximates xlogpt(x)\nabla_x\log p_t(x) - the score function - enabling the SDE to run backward from noise to data. This is the unified framework behind DDPM (discrete VP-SDE), DDIM (deterministic ODE), and Consistency Models.


7. Stationary Processes and Ergodicity

7.1 Strict and Wide-Sense Stationarity

Definition (Strict Stationarity). A stochastic process {Xt}\{X_t\} is strictly stationary if for any nn, any times t1,,tnt_1, \ldots, t_n, and any shift hh:

(Xt1+h,,Xtn+h)=d(Xt1,,Xtn)(X_{t_1+h}, \ldots, X_{t_n+h}) \stackrel{d}{=} (X_{t_1}, \ldots, X_{t_n})

The finite-dimensional distributions are invariant under time shifts.

Definition (Wide-Sense Stationarity, WSS). A process {Xt}\{X_t\} is wide-sense stationary (or second-order stationary) if:

  1. E[Xt]=μ\mathbb{E}[X_t] = \mu (constant mean)
  2. Cov(Xt,Xs)=k(ts)\text{Cov}(X_t, X_s) = k(t-s) depends only on the lag τ=ts\tau = t-s

Examples:

  • Strictly stationary but not WSS: tt-distributed noise with 1 DOF (undefined variance)
  • WSS but not strictly stationary: Xt=Acos(ωt)+Bsin(ωt)X_t = A\cos(\omega t) + B\sin(\omega t) with A,BN(0,σ2)A,B \sim \mathcal{N}(0,\sigma^2) iid - WSS with k(τ)=σ2cos(ωτ)k(\tau) = \sigma^2\cos(\omega\tau); not strictly stationary since XtX_t is perfectly periodic with a random phase
  • Both: Gaussian processes with stationary kernels; iid sequences

The autocovariance function k(τ)k(\tau) for a WSS process satisfies:

  • Symmetry: k(τ)=k(τ)k(\tau) = k(-\tau) (real process)
  • Positive semidefiniteness: i,jaiajk(titj)0\sum_{i,j} a_i a_j k(t_i - t_j) \geq 0 for all finite sequences

For AI: A WSS process is fully characterised by μ\mu and k(τ)k(\tau) - enormously parsimonious compared to the general joint distribution. This is why WSS is the right model for relative position encodings in transformers.

7.2 Ergodicity and the Ergodic Theorem

Intuition: A process is ergodic if "time averages equal ensemble averages" - a single long trajectory captures the full statistical behaviour of the process.

Theorem (Birkhoff's Ergodic Theorem, 1931). For a stationary ergodic process {Xt}\{X_t\} with E[X0]<\mathbb{E}[|X_0|] < \infty:

1T0TXtdta.s.E[X0]as T\frac{1}{T}\int_0^T X_t\,dt \xrightarrow{a.s.} \mathbb{E}[X_0] \quad \text{as } T \to \infty

For discrete time: 1Nn=0N1Xna.s.E[X0]\frac{1}{N}\sum_{n=0}^{N-1} X_n \xrightarrow{a.s.} \mathbb{E}[X_0].

What "ergodic" means formally: A stationary process is ergodic if the only time-invariant events have probability 0 or 1 (the σ\sigma-algebra of invariant events is trivial). Equivalently, correlations decay: 1Nn=0N1Cov(X0,Xn)0\frac{1}{N}\sum_{n=0}^{N-1}\text{Cov}(X_0, X_n) \to 0.

Non-ergodic example: Xt=ZX_t = Z for all tt, where ZN(0,1)Z \sim \mathcal{N}(0,1) is fixed once and for all. This is stationary (distribution constant) but not ergodic (time average =ZE[Z]=0= Z \neq \mathbb{E}[Z] = 0 for any realisation).

For AI: SGD convergence analysis assumes that the gradient noise process is ergodic - the time average of gradient variance equals the ensemble average. If the data distribution shifts during training (non-stationarity), this fails. Continual learning methods address the resulting catastrophic forgetting, which is fundamentally a violation of the stationarity assumption.

7.3 Autocorrelation and Power Spectral Density

Definition. The autocovariance function (ACVF) is k(τ)=Cov(Xt,Xt+τ)k(\tau) = \text{Cov}(X_t, X_{t+\tau}). The autocorrelation function (ACF) is ρ(τ)=k(τ)/k(0)\rho(\tau) = k(\tau)/k(0), normalised to [1,1][-1,1].

Examples of ACFs:

  • iid white noise: k(τ)=σ21[τ=0]k(\tau) = \sigma^2 \mathbf{1}[\tau=0]
  • AR(1) process (Xt=ϕXt1+ϵtX_t = \phi X_{t-1} + \epsilon_t, ϕ<1|\phi|<1): k(τ)=σ2ϕτ/(1ϕ2)k(\tau) = \sigma^2\phi^{|\tau|}/(1-\phi^2) - geometric decay
  • OU process: k(τ)=(σ2/2θ)eθτk(\tau) = (\sigma^2/2\theta)e^{-\theta|\tau|} - continuous-time geometric decay
  • Seasonal process: periodic ACF

Theorem (Wiener-Khinchin, 1934). For a WSS process with absolutely summable autocovariance:

S(ω)=τ=k(τ)eiωτ=F[k](ω)S(\omega) = \sum_{\tau=-\infty}^\infty k(\tau)e^{-i\omega\tau} = \mathcal{F}[k](\omega)

The power spectral density (PSD) S(ω)S(\omega) is the Fourier transform of k(τ)k(\tau) and is non-negative: S(ω)0S(\omega) \geq 0 for all ω\omega.

Conversely, any non-negative integrable function S(ω)S(\omega) is the PSD of some WSS process (Bochner's theorem). PSD decomposes the variance of a process by frequency: low-frequency components explain slow variation, high-frequency components explain rapid fluctuation.

For AI: The attention mechanism computes softmax(QK/dk)V\text{softmax}(QK^\top/\sqrt{d_k})V. For a stationary input sequence, Qi=WQxiQ_i = W_Q x_i and Kj=WKxjK_j = W_K x_j, so the attention score between positions ii and jj depends (via the input autocorrelation) only on ij|i-j| - the relative position. This is the stationarity condition that relative position encodings (ALiBi, T5 bias, RoPE) exploit.

7.4 Gaussian Processes

Definition. A stochastic process {f(t)}tT\{f(t)\}_{t \in T} is a Gaussian process (GP) if every finite collection (f(t1),,f(tn))(f(t_1), \ldots, f(t_n)) is jointly Gaussian. A GP is fully specified by:

  • Mean function: m(t)=E[f(t)]m(t) = \mathbb{E}[f(t)]
  • Covariance kernel: k(s,t)=Cov(f(s),f(t))k(s,t) = \text{Cov}(f(s), f(t))

Notation: fGP(m,k)f \sim \mathcal{GP}(m, k).

The kernel determines everything: Properties of the GP sample paths are determined by kk:

  • Smoothness: Matern-ν\nu kernel controls differentiability (ν\nu times differentiable paths)
  • Length scale: k(s,t)=est2/(22)k(s,t) = e^{-|s-t|^2/(2\ell^2)} (RBF/squared-exponential) has length scale \ell
  • Periodicity: k(s,t)=σ2cos(2πst/p)k(s,t) = \sigma^2\cos(2\pi|s-t|/p) for periodic processes

Common kernels:

KernelFormulaProperties
RBF (squared-exponential)σ2exp(xx2/(22))\sigma^2\exp(-\|x-x'\|^2/(2\ell^2))Infinitely differentiable
Matern-1/2σ2exp(xx/)\sigma^2\exp(-\|x-x'\|/\ell)Continuous but not diff. (= OU)
Matern-3/2σ2(1+3r/)exp(3r/)\sigma^2(1+\sqrt{3}r/\ell)\exp(-\sqrt{3}r/\ell)Once differentiable
Periodicσ2exp(2sin2(πxx/p)/2)\sigma^2\exp(-2\sin^2(\pi\|x-x'\|/p)/\ell^2)Exactly periodic
Linearσ2xx\sigma^2 x^\top x'Bayesian linear regression

GP Posterior (Bayesian inference): Given observations y=f(t)+ϵ\mathbf{y} = f(\mathbf{t}) + \epsilon, ϵN(0,σn2I)\epsilon \sim \mathcal{N}(0, \sigma_n^2 I):

ft,t,yN(μ,Σ)f_* | \mathbf{t}_*, \mathbf{t}, \mathbf{y} \sim \mathcal{N}(\mu_*, \Sigma_*) μ=m(t)+k(t,t)[k(t,t)+σn2I]1(ym(t))\mu_* = m(\mathbf{t}_*) + k(\mathbf{t}_*, \mathbf{t})[k(\mathbf{t},\mathbf{t}) + \sigma_n^2 I]^{-1}(\mathbf{y} - m(\mathbf{t})) Σ=k(t,t)k(t,t)[k(t,t)+σn2I]1k(t,t)\Sigma_* = k(\mathbf{t}_*, \mathbf{t}_*) - k(\mathbf{t}_*, \mathbf{t})[k(\mathbf{t},\mathbf{t}) + \sigma_n^2 I]^{-1}k(\mathbf{t}, \mathbf{t}_*)

This is exact Bayesian inference in O(n3)O(n^3) time (from the Cholesky decomposition of the n×nn \times n kernel matrix).

For AI: Bayesian optimisation for hyperparameter search uses a GP prior over the loss surface. The acquisition function (UCB, EI) determines where to evaluate next, and the GP posterior updates after each observation. This is the principled method for tuning LLM training hyperparameters (η\eta, β1\beta_1, β2\beta_2, etc.) when each evaluation is expensive.

Neural tangent kernel (NTK): An infinite-width neural network at initialisation is a GP with kernel determined by the architecture. During gradient descent with small learning rate, the network stays approximately GP (lazy training regime). The NTK Θ(x,x)=θf(x;θ)θf(x;θ)\Theta(x,x') = \nabla_\theta f(x;\theta)^\top \nabla_\theta f(x';\theta) determines the learning dynamics via the Gram matrix Θ(X,X)\Theta(X,X).

7.5 ML: RoPE and Stationarity in Attention

Wide-sense stationarity of attention: For a sequence of token embeddings {xt}\{x_t\}, the attention matrix Aij=softmax(qikjd)A_{ij} = \text{softmax}(\frac{q_i^\top k_j}{\sqrt{d}}) can be made to depend only on the relative position iji-j if we choose position encoding appropriately. This is equivalent to making the QQ-KK inner product a stationary kernel.

RoPE (Rotary Position Embedding, Su et al. 2022): RoPE encodes position tt by rotating the query and key vectors:

Rtq,Rtk,where Rt is a block-diagonal rotation matrixR_t\mathbf{q}, \quad R_t\mathbf{k}, \quad \text{where } R_t \text{ is a block-diagonal rotation matrix}

The inner product satisfies:

(Rmq)(Rnk)=qRmnR0k=qRnmk(R_m\mathbf{q})^\top(R_n\mathbf{k}) = \mathbf{q}^\top R_{m-n}^\top R_0 \mathbf{k} = \mathbf{q}^\top R_{n-m}\mathbf{k}

which depends only on mnm-n - exactly the stationarity condition! RoPE makes attention score a stationary kernel in position, enabling efficient relative position encoding.

ALiBi and linear bias: ALiBi (Press et al., 2022) adds mij-m|i-j| to each attention logit where mm is a per-head slope. The resulting attention scores are of the form f(qi,kj)mijf(q_i, k_j) - m|i-j|, where the mij-m|i-j| term is a stationary (translation-invariant) bias. This enables extrapolation to longer sequences - a key advantage when the training distribution has stationarity structure.


8. Preview: Markov Chains

8.1 The Markov Property

A stochastic process {Xn}\{X_n\} satisfies the Markov property if:

P(Xn+1=jX0,X1,,Xn)=P(Xn+1=jXn)n,jP(X_{n+1} = j \mid X_0, X_1, \ldots, X_n) = P(X_{n+1} = j \mid X_n) \quad \forall n, j

The past is irrelevant - the present state XnX_n is a sufficient statistic for the future. Markov chains are the simplest non-trivial stochastic processes with memory.

Markov chains fit within the stochastic process framework developed here:

  • They are adapted to their natural filtration Fn=σ(X0,,Xn)\mathcal{F}_n = \sigma(X_0,\ldots,X_n)
  • The transition function P(Xn+1=jXn=i)=PijP(X_{n+1}=j|X_n=i) = P_{ij} defines a stochastic matrix
  • Many Markov chains are martingale transforms: f(Xn)f(X_n) is a martingale if ff is harmonic for the transition operator

8.2 Forward Reference

Full treatment: Section07 Markov Chains

That section covers: transition matrices, invariant/stationary distributions, detailed balance, recurrence/transience (Markov chain version), mixing times, spectral gap, Perron-Frobenius theorem, reversibility, MCMC (Metropolis-Hastings, Gibbs), and ML applications in language model generation, RL policy evaluation, and PageRank.


9. ML Deep Dive

9.1 SGD Trajectory as Stochastic Process

The parameter sequence {θn}n0\{\theta_n\}_{n \geq 0} produced by SGD is a discrete-time stochastic process with values in Rp\mathbb{R}^p. Understanding its properties as a process - not just its endpoint - reveals why deep learning generalises.

The update rule as a random dynamical system:

θn+1=θnηng~n(θn),g~n(θ)=1BiBn(θ;zi)\theta_{n+1} = \theta_n - \eta_n \tilde{g}_n(\theta_n), \quad \tilde{g}_n(\theta) = \frac{1}{B}\sum_{i \in \mathcal{B}_n}\nabla\ell(\theta; z_i)

where Bn\mathcal{B}_n is a random mini-batch drawn at step nn.

Key stochastic process properties:

  • θn\theta_n is adapted to Fn=σ(B1,,Bn)\mathcal{F}_n = \sigma(\mathcal{B}_1,\ldots,\mathcal{B}_n)
  • E[g~n(θn)Fn1]=L(θn)\mathbb{E}[\tilde{g}_n(\theta_n) | \mathcal{F}_{n-1}] = \nabla L(\theta_n) - unbiasedness means the noise is a martingale difference
  • g~nL(θn)\tilde{g}_n - \nabla L(\theta_n) is a martingale difference sequence

The diffusion approximation (Mandt et al., 2017): Under small step size and local quadratic approximation L(θ)12(θθ)H(θθ)L(\theta) \approx \frac{1}{2}(\theta-\theta^*)^\top H (\theta-\theta^*), the SGD dynamics converge to:

dθ=Hθdt+ηΣdBtd\theta = -H\theta\,dt + \sqrt{\eta \Sigma}\,dB_t

This is a multivariate OU process with mean-reversion matrix HH and noise ηΣ\sqrt{\eta\Sigma}. The stationary distribution is:

θN ⁣(θ, η2ΣH1)\theta_\infty \sim \mathcal{N}\!\left(\theta^*,\ \frac{\eta}{2}\Sigma H^{-1}\right)

Lower learning rate η\eta -> smaller stationary variance -> SGD concentrates more tightly around θ\theta^*. The ratio η/B\eta/B determines the noise temperature.

Flat minima and generalisation: The stationary distribution of the SGD SDE concentrates in regions of small Hessian HH (flat minima). Fisher-Rao information connects Σ\Sigma to the curvature, and the entropy of the stationary distribution logdet(H1)\propto \log\det(H^{-1}) is large for flat minima. This provides a stochastic process explanation for why SGD with large learning rates prefers flat minima that generalise well.

9.2 RL Value Functions as Martingales

In reinforcement learning, the agent at time tt is in state sts_t, takes action ata_t, receives reward rtr_t, and transitions to st+1s_{t+1}. The discounted return is:

Gt=k=0γkrt+kG_t = \sum_{k=0}^\infty \gamma^k r_{t+k}

The Bellman equation: Under a fixed policy π\pi, the value function Vπ(s)=E[Gtst=s]V^\pi(s) = \mathbb{E}[G_t | s_t = s] satisfies:

Vπ(s)=E[rt+γVπ(st+1)st=s]V^\pi(s) = \mathbb{E}[r_t + \gamma V^\pi(s_{t+1}) | s_t = s]

Martingale characterisation: Define Zt=k=0t1γkrk+γtVπ(st)Z_t = \sum_{k=0}^{t-1}\gamma^k r_k + \gamma^t V^\pi(s_t). Under the true VπV^\pi:

E[Zt+1Ft]=E ⁣[γtrt+γt+1Vπ(st+1)+k=0t1γkrkFt]=Zt\mathbb{E}[Z_{t+1} | \mathcal{F}_t] = \mathbb{E}\!\left[\gamma^t r_t + \gamma^{t+1} V^\pi(s_{t+1}) + \sum_{k=0}^{t-1}\gamma^k r_k \,\Big|\, \mathcal{F}_t\right] = Z_t

So {Zt}\{Z_t\} is a martingale! The TD error δt=rt+γVπ(st+1)Vπ(st)\delta_t = r_t + \gamma V^\pi(s_{t+1}) - V^\pi(s_t) is a martingale difference: E[δtFt]=0\mathbb{E}[\delta_t | \mathcal{F}_t] = 0. TD-learning converges because it performs online martingale approximation - stochastic approximation theory guarantees convergence of martingale difference algorithms under Robbins-Monro conditions.

REINFORCE gradient: The policy gradient estimator g^=Gtlogπθ(atst)\hat{g} = G_t \nabla\log\pi_\theta(a_t|s_t) is an unbiased estimator of J(θ)\nabla J(\theta) - it is a martingale difference. The baseline b(st)b(s_t) reduces variance without introducing bias: E[(Gtb(st))logπst]=Jb(st)0=J\mathbb{E}[(G_t - b(s_t))\nabla\log\pi | s_t] = \nabla J - b(s_t)\cdot 0 = \nabla J since E[logπst]=0\mathbb{E}[\nabla\log\pi|s_t] = 0.

9.3 Attention over Sequences

Attention can be viewed as a stochastic process operating over the token sequence:

Token sequence as discrete-time process: (w1,w2,,wT)(w_1, w_2, \ldots, w_T) with embedding xt=Embed(wt)x_t = \text{Embed}(w_t). The attention mechanism at layer \ell computes:

ht()=s=1Tαts()vs(),αts=softmax(qtksdk)h_t^{(\ell)} = \sum_{s=1}^T \alpha_{ts}^{(\ell)} v_s^{(\ell)}, \quad \alpha_{ts} = \text{softmax}\left(\frac{q_t^\top k_s}{\sqrt{d_k}}\right)

The hidden state ht()h_t^{(\ell)} is a weighted average of all positions - a generalisation of the conditional expectation E[Vquery=q]\mathbb{E}[V | \text{query} = q].

Causal attention as filtration: In autoregressive generation, causal masking ensures hth_t depends only on h1,,hth_1, \ldots, h_t - maintaining the adaptedness condition to the natural filtration of the sequence. Bidirectional attention (BERT) uses a "global" σ\sigma-algebra that is not filtered.

KV-cache as path memory: The key-value cache stores {ks,vs}st\{k_s, v_s\}_{s \leq t} - the full path of the process up to position tt. This is the computational realisation of the filtration Ft\mathcal{F}_t: at each new token, the model has access to the complete history.

9.4 Diffusion Model Schedules

Different noise schedules in diffusion models correspond to different continuous-time SDEs:

ScheduleSDEMarginal
Linear (DDPM)dx=β(t)2xdt+β(t)dBtdx = -\frac{\beta(t)}{2}x\,dt + \sqrt{\beta(t)}\,dB_tN(αˉtx0,(1αˉt)I)\mathcal{N}(\sqrt{\bar\alpha_t}x_0, (1-\bar\alpha_t)I)
CosineSame SDE, different β(t)\beta(t)Smoother transition
VE-SDEdx=σ(t)dBtdx = \sigma(t)\,dB_tN(x0,σ(t)2I)\mathcal{N}(x_0, \sigma(t)^2 I)
SubVP-SDEdx=12β(t)xdt+β(t)(1e20tβ)dBtdx = -\frac{1}{2}\beta(t)x\,dt + \sqrt{\beta(t)(1-e^{-2\int_0^t\beta})}\,dB_tTighter bounds

The reverse SDE (Anderson, 1982) is:

dx=[f(x,t)g(t)2xlogpt(x)]dt+g(t)dBˉtdx = \left[f(x,t) - g(t)^2\nabla_x\log p_t(x)\right]dt + g(t)\,d\bar{B}_t

where Bˉt\bar{B}_t is a backward Brownian motion. The neural network sθ(x,t)xlogpt(x)s_\theta(x,t) \approx \nabla_x\log p_t(x) is trained to estimate the score function, enabling the reverse process to convert noise into data.

Consistency Models (Song et al., 2023) learn to map any point on the reverse SDE trajectory directly to the final sample, bypassing the iterative nature of full diffusion. The key insight uses probability flow ODE theory - replacing the stochastic reverse SDE with a deterministic ODE that has the same marginals.


10. Common Mistakes

#MistakeWhy It's WrongFix
1"Any increasing family of sets is a filtration"Filtration requires σ\sigma-algebras, not just sets. A family of sets AtA_t with AsAtA_s \subset A_t has no algebraic structure.Verify each Ft\mathcal{F}_t is a σ\sigma-algebra (closed under complements and countable unions) and FsFt\mathcal{F}_s \subseteq \mathcal{F}_t for sts \leq t.
2"If E[Mn]=c\mathbb{E}[M_n] = c for all nn, then {Mn}\{M_n\} is a martingale"Constant mean is necessary but not sufficient. Non-martingale with constant mean: Mn=(1)nM_n = (-1)^n with P=1/2P=1/2 Rademacher steps (mean 0, but $\mathbb{E}[M_{n+1}M_n] = -M_n \neq M_n$).
3"OST applies to any stopping time"Without boundedness or uniform integrability, E[Mτ]E[M0]\mathbb{E}[M_\tau] \neq \mathbb{E}[M_0]. The doubling strategy (τ\tau = "first win") has E[τ]=\mathbb{E}[\tau] = \infty and violates OST.Verify one of the three OST conditions: τ\tau bounded, E[τ]<\mathbb{E}[\tau]<\infty + bounded differences, or UI.
4"Brownian motion is differentiable since it's continuous"Continuity does not imply differentiability. BM is a.s. nowhere differentiable (Wiener 1923). Its paths have infinite variation on any interval.Use Ito calculus (dBt2=dtdB_t^2 = dt) for BM integrals, not classical Riemann-Stieltjes.
5"The Poisson process has stationary paths"Stationarity of increments \neq stationarity of the process itself. N(t)Poisson(λt)N(t) \sim \text{Poisson}(\lambda t) - the marginal distribution changes with tt; N(t)N(t) is not stationary.The compensated process N(t)λtN(t) - \lambda t is a martingale; stationarity applies to increments, not levels.
6"Strict stationarity implies WSS"Strict stationarity only implies WSS if the second moment exists. A process with Xt=dCauchy(0,1)X_t \stackrel{d}{=} \text{Cauchy}(0,1) at all times (iid) is strictly stationary but E[Xt2]=\mathbb{E}[X_t^2] = \infty, so WSS is undefined.Always verify second moment existence before claiming WSS from strict stationarity.
7"WSS implies strict stationarity"WSS is a second-moment condition. Higher-order moments may not be stationary. Example: an AR(1) with non-Gaussian innovations has the same ACVF regardless of the innovation distribution, but higher moments differ with time if innovations have time-varying skewness.WSS -> strict stationarity ONLY for Gaussian processes (since Gaussians are determined by first two moments).
8"The OU process converges to its mean"The OU process converges to its stationary distribution N(μ,σ2/(2θ))\mathcal{N}(\mu, \sigma^2/(2\theta)), not to the constant μ\mu. It fluctuates around μ\mu with variance σ2/(2θ)\sigma^2/(2\theta) in steady state.Distinguish convergence of the distribution (to stationary distribution) from convergence of the paths (which remain stochastic).
9"Ergodicity means the process repeats"Ergodicity means time averages equal ensemble averages. An ergodic process does not repeat its paths - it explores the full distribution over time.Ergodic = time average converges to ensemble average. Periodicity (repeating paths) is a different and usually violated property.
10"A GP with RBF kernel can model any function"RBF GPs have infinitely smooth sample paths. Functions with discontinuities or kinks (like ReLU activations) cannot be well-approximated by RBF GPs. Matern-1/2 (OU) or Matern-3/2 kernels are better for rough functions.Match kernel smoothness to the smoothness of the function being modelled. Use Matern family for non-smooth functions.
11"DDPM forward process is Brownian motion"DDPM uses a discretised, time-reversed, variance-preserving process. True BM has no mean reversion and linearly growing variance; DDPM's forward process is a discrete OU approximation with αˉt0\bar\alpha_t \to 0.DDPM forward process \approx discrete VP-SDE (OU); the connection to BM is via Donsker's theorem in the limit TT \to \infty, Δt0\Delta t \to 0.

11. Exercises

Exercise 1 * - Martingale Verification Let X1,X2,X_1, X_2, \ldots be iid with E[Xi]=0\mathbb{E}[X_i] = 0, Var(Xi)=σ2\text{Var}(X_i) = \sigma^2. Define Sn=k=1nXkS_n = \sum_{k=1}^n X_k and Vn=Sn2nσ2V_n = S_n^2 - n\sigma^2.

(a) Prove {Sn}\{S_n\} is a martingale with respect to its natural filtration. (b) Prove {Vn}\{V_n\} is a martingale. What does E[Vτ]=E[V0]\mathbb{E}[V_\tau] = \mathbb{E}[V_0] imply for a bounded stopping time τ\tau? (c) For XkUniform{1,+1}X_k \sim \text{Uniform}\{-1,+1\}, verify (a) and (b) numerically with n=100n=100 and 10510^5 simulations. Check E[Sn]0\mathbb{E}[S_n] \approx 0 and E[Vn]0\mathbb{E}[V_n] \approx 0. (d) Let τ=min(n,50)\tau = \min(n, 50) for a fixed n=100n=100 simulation. Verify E[Sτ]0\mathbb{E}[S_\tau] \approx 0 and E[Vτ]0\mathbb{E}[V_\tau] \approx 0 empirically.

Exercise 2 * - Doob's OST: Gambler's Ruin A fair gambler starts with k=30k = 30 dollars and plays until reaching N=100N = 100 or going broke.

(a) Use OST applied to SnS_n to derive P(win)=k/NP(\text{win}) = k/N. Compute numerically for k=30k=30, N=100N=100. (b) Use OST applied to Sn2nS_n^2 - n to derive E[τ]=k(Nk)\mathbb{E}[\tau] = k(N-k). Compute for the given values. (c) Simulate 10510^5 games and verify both P(win)P(\text{win}) and E[τ]\mathbb{E}[\tau] empirically. (d) For a biased game with p=0.6p=0.6, use the exponential martingale Mn=((1p)/p)SnM_n = ((1-p)/p)^{S_n} to derive P(win)P(\text{win}). Verify numerically.

Exercise 3 * - Poisson Process Properties Let N(t)Poisson(λ=2)N(t) \sim \text{Poisson}(\lambda=2).

(a) Compute P(N(3)5)P(N(3) \geq 5) exactly and via simulation (10510^5 trials). (b) Show that M(t)=N(t)λtM(t) = N(t) - \lambda t is a martingale (verify E[M(t)]=0\mathbb{E}[M(t)] = 0 and E[M(t)N(s),sr]=M(r)\mathbb{E}[M(t)|N(s), s \leq r] = M(r) for r<tr < t). (c) Thinning: Generate a Poisson process with rate λ=5\lambda=5. Thin independently with p=0.4p=0.4. Verify the thinned process is Poisson(22) by checking its inter-arrival distribution. (d) Superposition: Show that the superposition of Poisson(1)\text{Poisson}(1) and Poisson(2)\text{Poisson}(2) is Poisson(3)\text{Poisson}(3) empirically.

Exercise 4 ** - Doob Martingale and McDiarmid Let f(x1,,xn)=max1knxkf(x_1,\ldots,x_n) = \max_{1 \leq k \leq n} x_k on [0,1]n[0,1]^n with XiiidUniform[0,1]X_i \stackrel{iid}{\sim}\text{Uniform}[0,1].

(a) Show ff has bounded differences: supf(,xk,)f(,xk,)1\sup |f(\ldots,x_k,\ldots) - f(\ldots,x_k',\ldots)| \leq 1 for each kk, so ck=1c_k = 1. (b) Construct the Doob martingale Mk=E[maxXX1,,Xk]M_k = \mathbb{E}[\max X | X_1,\ldots,X_k] and compute M0M_0 and MnM_n analytically. (c) Apply Azuma's inequality to bound P(fE[f]t)P(f - \mathbb{E}[f] \geq t) and compare to McDiarmid's bound. (d) Verify both bounds numerically for n=100n=100, t=0.05t=0.05. Compute the empirical tail probability and compare to the bound.

Exercise 5 ** - Brownian Motion Properties Simulate BtB_t on [0,1][0,1] using n=1000n=1000 steps.

(a) Verify the increment distribution: check that BtBsN(0,ts)B_t - B_s \sim \mathcal{N}(0, t-s) holds for several (s,t)(s,t) pairs using KS test. (b) Verify quadratic variation: compute QVn=k=1n(Bk/nB(k1)/n)2QV_n = \sum_{k=1}^n (B_{k/n} - B_{(k-1)/n})^2 for n=10,100,1000,10000n = 10, 100, 1000, 10000 and show QVn1QV_n \to 1. (c) Verify non-differentiability heuristically: compute finite differences ΔB/Δt\Delta B / \Delta t for Δt=101,102,103\Delta t = 10^{-1}, 10^{-2}, 10^{-3} and observe they grow as 1/Δt1/\sqrt{\Delta t}. (d) Simulate the OU process dX=Xdt+2dBtdX = -X\,dt + \sqrt{2}\,dB_t with Euler-Maruyama. Verify the stationary distribution is N(0,1)\mathcal{N}(0,1).

Exercise 6 ** - Gaussian Process Posterior Let fGP(0,kRBF)f \sim \mathcal{GP}(0, k_{\text{RBF}}) with kRBF(x,x)=exp(xx2/(22))k_{\text{RBF}}(x,x') = \exp(-|x-x'|^2/(2\ell^2)), =0.5\ell = 0.5.

(a) Draw 5 sample paths from the GP prior on [0,5][0,5] using the covariance matrix. (b) Given observations yi=f(xi)+ϵiy_i = f(x_i) + \epsilon_i at x={0.5,1.5,2.5,3.5}x = \{0.5, 1.5, 2.5, 3.5\} with ϵiN(0,0.01)\epsilon_i \sim \mathcal{N}(0, 0.01), compute the GP posterior mean and variance. (c) Plot the posterior mean ±2\pm 2 standard deviations and verify that all observation points are covered. (d) Compare the posterior using RBF kernel vs Matern-1/2 kernel. Which has smoother interpolation? Which has faster uncertainty growth away from observations?

Exercise 7 *** - SGD as Diffusion Process Minimise L(θ)=θ2/2L(\theta) = \theta^2/2 (quadratic, H=1H=1, θ=0\theta^* = 0) using SGD with g~(θ)=θ+ϵ\tilde{g}(\theta) = \theta + \epsilon, ϵN(0,σ2)\epsilon \sim \mathcal{N}(0, \sigma^2).

(a) Show that the SGD iterates satisfy θn+1=(1η)θnηϵn\theta_{n+1} = (1-\eta)\theta_n - \eta\epsilon_n. Find the stationary distribution of θn\theta_n. (b) Verify analytically that Var(θ)=ησ2/(2η)ησ2/2\text{Var}(\theta_\infty) = \eta\sigma^2/(2-\eta) \approx \eta\sigma^2/2 for small η\eta. (c) Simulate for η{0.01,0.1,0.3}\eta \in \{0.01, 0.1, 0.3\}, σ2=1\sigma^2 = 1, n=10000n = 10000 steps. Plot the stationary distribution and compare to the analytical prediction. (d) Verify the diffusion approximation: the continuous-time SDE dθ=θdt+ησdBtd\theta = -\theta\,dt + \sqrt{\eta}\sigma\,dB_t has stationary distribution N(0,ησ2/2)\mathcal{N}(0, \eta\sigma^2/2). Compare to part (b).

Exercise 8 *** - RL TD-Error as Martingale Consider a Markov reward process: states {1,2,3}\{1, 2, 3\}, transition matrix P=[[0.5,0.3,0.2],[0.4,0.4,0.2],[0.3,0.3,0.4]]P = [[0.5,0.3,0.2],[0.4,0.4,0.2],[0.3,0.3,0.4]], rewards r=[1,2,3]r = [1, 2, 3], discount γ=0.9\gamma = 0.9.

(a) Compute the true value function V=(IγP)1rV = (I - \gamma P)^{-1}r exactly. (b) Implement TD(0) to estimate VV: run 10410^4 episodes of length 50, update V(st)V(st)+α(rt+γV(st+1)V(st))V(s_t) \leftarrow V(s_t) + \alpha(r_t + \gamma V(s_{t+1}) - V(s_t)) with α=0.01\alpha = 0.01. Compare to exact VV. (c) Verify the martingale property of the TD error: under the true VV^*, the TD error δt=rt+γV(st+1)V(st)\delta_t = r_t + \gamma V^*(s_{t+1}) - V^*(s_t) has E[δtst]=0\mathbb{E}[\delta_t | s_t] = 0. Verify numerically. (d) Implement variance-reduced version using baseline b(s)=rˉb(s) = \bar{r} (mean reward). Show this reduces the variance of the gradient estimate without introducing bias.


12. Why This Matters for AI (2026 Perspective)

ConceptAI/ML Impact
Filtrations and adaptednessCausal vs. bidirectional attention; early stopping as stopping time; data leakage detection
MartingalesTD-learning convergence; REINFORCE gradient estimators; unbiased gradient noise in SGD
Optional Stopping TheoremEarly stopping criteria; gambler's ruin models of gradient explosion; RL episode termination
Doob MartingaleUnified proof of McDiarmid/Azuma; online learning regret bounds; generalisation theory
Poisson ProcessLLM inference request queues; KV-cache hit modelling; continuous batching server design
Brownian MotionForward process in diffusion models; SGD noise continuous-time approximation; random feature maps
OU ProcessDDPM/Score-SDE forward processes; L2 regularisation as mean reversion; Adam momentum as filtering
Geometric Brownian MotionResidual stream norm growth; multiplicative noise in LoRA fine-tuning
Donsker's TheoremJustifies SDE approximation of SGD; random walk -> BM limit for CLT on paths
Wide-Sense StationarityRoPE relative position encoding; ALiBi attention bias; translation-invariant kernels
Ergodic TheoremTime-average = ensemble average in training; failure mode in continual learning
Gaussian ProcessesBayesian hyperparameter search; neural tangent kernel in infinite-width limit; GAN discriminator theory
GP PosteriorUncertainty quantification in Bayesian deep learning; active learning for LLM fine-tuning data selection
Score SDEUnified framework for DDPM, DDIM, consistency models, flow matching
Martingale CLTAsymptotic normality of SGD parameter estimates; statistical inference on trained models

13. Conceptual Bridge

Looking backward: This section builds directly on the static probability of Section01-Section05. Filtrations formalise conditional expectations (Section04) over time; the Doob martingale makes McDiarmid's inequality (Section05) a consequence of Azuma applied to a martingale; the Gaussian process is the stochastic-process version of the multivariate Gaussian (Section03). Every concept here has a static probability antecedent.

Looking forward: Stochastic processes enable the dynamic probability of Section07 and beyond. Markov chains (Section07) are stochastic processes satisfying the Markov property - the specialisation of the general framework to memoryless dynamics. Statistics (Section07-Statistics) uses time series models (AR, MA, ARIMA) which are stationary stochastic processes analysed via the autocorrelation and spectral density tools developed here. Optimisation (Section08) uses the SDE/martingale formalism to analyse SGD convergence and diffusion-based sampling.

The martingale as a unifying concept: The martingale is arguably the most powerful single concept in probability theory. It appears in: (a) concentration inequalities as the Doob/Azuma construction, (b) SGD as a near-martingale gradient process, (c) RL value functions as martingale transforms, (d) diffusion model reverse processes as time-reversed SDEs with martingale driving noise, (e) sequential hypothesis testing (SPRT) as a martingale stopping problem. Understanding martingales is understanding the probabilistic structure that makes machine learning algorithms work.

STOCHASTIC PROCESSES: POSITION IN CURRICULUM
========================================================================

  STATIC PROBABILITY (Section01-Section05)
  ---------------------------------------------
  Section01 Random Variables  ->  Section02 Distributions
  Section03 Joint Dist.       ->  Section04 Expectation/MGF
  Section05 Concentration     ->  Azuma = Doob + Azuma
  ---------------------------------------------
                | filtrations, martingales, paths
                v
  +---------------------------------------------+
  |  Section06 STOCHASTIC PROCESSES (THIS SECTION)    |
  |  -----------------------------------------  |
  |  Filtrations    Martingales   OST           |
  |  Random Walks   Poisson       BM/OU         |
  |  Stationarity   GPs           SGD/RL/Diff   |
  +---------------------------------------------+
                |
       +--------+--------+
       v                 v
  Section07 Markov Chains   Section07-Stat/05 Time Series
  (Markov property,   (AR/MA/ARIMA, spectral
  MCMC, PageRank,     analysis, forecasting,
  RL policy eval)     seasonal decomposition)
       |
       v
  Section08 Optimisation (SDE/martingale-based
  convergence of SGD, Langevin dynamics,
  diffusion-based sampling)

========================================================================

The AI synthesis: In 2026, the stochastic process framework permeates every major AI subfield. Diffusion models are explicit SDE constructions. RL convergence relies on martingale theory. LLM evaluation uses Hoeffding/CLT on sequences of i.i.d. samples (Poisson processes of test cases). Bayesian optimisation uses GP posteriors. Continual learning grapples with non-stationarity. The practitioner who understands stochastic processes sees the probabilistic skeleton underlying methods that otherwise appear unrelated.


This completes Section06 Stochastic Processes. The framework built here - filtrations, martingales, the canonical processes - is the common language of probability theory in motion.

<- Back to Chapter 6: Probability Theory | Next: Markov Chains ->


Appendix A: Stochastic Calculus Primer

A.1 The Ito Integral

Classical calculus integrates smooth functions. The Ito integral extends integration to Brownian motion, where the integrand is a stochastic process.

Why ordinary integrals fail: Define 0TBtdBt\int_0^T B_t\,dB_t using Riemann sums. Two natural choices:

  • Left-endpoint (Ito): k=0n1Btk(Btk+1Btk)\sum_{k=0}^{n-1} B_{t_k}(B_{t_{k+1}} - B_{t_k})
  • Midpoint (Stratonovich): k=0n1Btk+Btk+12(Btk+1Btk)\sum_{k=0}^{n-1} \frac{B_{t_k}+B_{t_{k+1}}}{2}(B_{t_{k+1}} - B_{t_k})

These give different limits as nn \to \infty! This is unique to stochastic integration - in classical calculus all Riemann sums give the same limit.

Computation:

Ito: 0TBtdBt=BT2T2\text{Ito: } \int_0^T B_t\,dB_t = \frac{B_T^2 - T}{2} Stratonovich: 0TBtdBt=BT22\text{Stratonovich: } \int_0^T B_t \circ dB_t = \frac{B_T^2}{2}

The Ito integral gives an extra T/2-T/2 term from the quadratic variation [B]T=T[B]_T = T. This is the source of Ito's lemma.

The Ito integral is a martingale: For any adapted square-integrable integrand HtH_t:

MT=0THtdBtis a martingaleM_T = \int_0^T H_t\,dB_t \quad \text{is a martingale}

with E[MT2]=0TE[Ht2]dt\mathbb{E}[M_T^2] = \int_0^T \mathbb{E}[H_t^2]\,dt (Ito isometry).

A.2 Ito's Lemma

Theorem (Ito's Lemma). Let XtX_t satisfy the SDE dXt=μtdt+σtdBtdX_t = \mu_t\,dt + \sigma_t\,dB_t. For any C2C^2 function ff:

df(Xt)=f(Xt)dXt+12f(Xt)σt2dtdf(X_t) = f'(X_t)\,dX_t + \frac{1}{2}f''(X_t)\sigma_t^2\,dt =(μtf(Xt)+σt22f(Xt))dt+σtf(Xt)dBt= \left(\mu_t f'(X_t) + \frac{\sigma_t^2}{2}f''(X_t)\right)dt + \sigma_t f'(X_t)\,dB_t

The extra term 12f(Xt)σt2dt\frac{1}{2}f''(X_t)\sigma_t^2\,dt (the Ito correction) arises from the quadratic variation (dBt)2=dt(dB_t)^2 = dt. It has no classical analogue.

Example: GBM. dS=μSdt+σSdBdS = \mu S\,dt + \sigma S\,dB. Apply Ito to f(S)=logSf(S) = \log S:

dlogS=1SμSdt+1SσSdB121S2σ2S2dt=(μσ22)dt+σdBd\log S = \frac{1}{S}\mu S\,dt + \frac{1}{S}\sigma S\,dB - \frac{1}{2}\frac{1}{S^2}\sigma^2 S^2\,dt = \left(\mu - \frac{\sigma^2}{2}\right)dt + \sigma\,dB

The σ2/2-\sigma^2/2 Ito correction converts the arithmetic drift μ\mu to the geometric/log drift μσ2/2\mu - \sigma^2/2.

For AI: DDPM noise levels satisfy dαˉtβtαˉtdtd\bar\alpha_t \approx -\beta_t\bar\alpha_t\,dt (linear SDE for logαˉ\log\bar\alpha). The Ito correction explains why the SNR αˉt/(1αˉt)\bar\alpha_t/(1-\bar\alpha_t) does not decrease linearly even for linear schedules - the correction term shifts the effective noise profile.

A.3 Stochastic Differential Equations

General Ito SDE:

dXt=b(Xt,t)dt+σ(Xt,t)dBt,X0=x0dX_t = b(X_t, t)\,dt + \sigma(X_t, t)\,dB_t, \quad X_0 = x_0
  • b(x,t)b(x,t): drift (deterministic tendency)
  • σ(x,t)\sigma(x,t): diffusion coefficient (noise strength)

Existence and uniqueness (Lipschitz conditions): If bb and σ\sigma are Lipschitz in xx uniformly in tt and have linear growth, the SDE has a unique strong solution (Picard-Lindelof for SDEs).

Important SDEs in ML:

ProcessSDENotes
BMdX=dBdX = dBb=0b=0, σ=1\sigma=1
OUdX=θXdt+σdBdX = -\theta X\,dt + \sigma\,dBMean-reverting
GBMdX=μXdt+σXdBdX = \mu X\,dt + \sigma X\,dBMultiplicative noise
LangevindX=U(X)dt+2dBdX = -\nabla U(X)\,dt + \sqrt{2}\,dBStationary eU\propto e^{-U}
SGD-SDEdθ=L(θ)dt+ηΣdBd\theta = -\nabla L(\theta)\,dt + \sqrt{\eta\Sigma}\,dBSGD diffusion approx.
DDPM fwddX=β2Xdt+βdBdX = -\frac{\beta}{2}X\,dt + \sqrt{\beta}\,dBVP-SDE for diffusion

Langevin dynamics and MCMC: The Langevin SDE dXt=U(Xt)dt+2dBtdX_t = -\nabla U(X_t)\,dt + \sqrt{2}\,dB_t has stationary distribution π(x)eU(x)\pi(x) \propto e^{-U(x)}. This is the continuous-time version of Langevin MCMC, used for sampling from complex posteriors (e.g., Bayesian neural networks, energy-based models).


Appendix B: Convergence Theorems for Martingales

B.1 Martingale Convergence

Theorem (Doob's Martingale Convergence, 1953). If {Mn}\{M_n\} is a martingale and supnE[Mn]<\sup_n \mathbb{E}[|M_n|] < \infty (L1L^1 bounded), then M=limnMnM_\infty = \lim_{n\to\infty} M_n exists a.s. and E[M]<\mathbb{E}[|M_\infty|] < \infty.

This theorem is foundational for proving convergence of:

  • Bayesian posterior updates (likelihood ratio martingales)
  • Online learning algorithms (regret bounds via bounded martingales)
  • Stochastic approximation algorithms (Robbins-Monro)

The L2L^2 version: If supnE[Mn2]<\sup_n \mathbb{E}[M_n^2] < \infty, then MnMM_n \to M_\infty both a.s. and in L2L^2.

B.2 Upcrossing Inequality

The proof of martingale convergence uses the upcrossing inequality: the number of times a martingale upcrosses an interval [a,b][a,b] (goes from below aa to above bb) satisfies:

E[Un[a,b]]E[(Mna)+]ba\mathbb{E}[U_n[a,b]] \leq \frac{\mathbb{E}[(M_n - a)^+]}{b - a}

If MnM_n is L1L^1 bounded, then E[U[a,b]]<\mathbb{E}[U_\infty[a,b]] < \infty, so upcrossings are finite for every rational interval [a,b][a,b]. This means oscillation in the limit is impossible, forcing convergence.

B.3 Uniform Integrability and L1L^1 Convergence

Definition. A family {Mt}\{M_t\} is uniformly integrable (UI) if:

suptE[Mt1Mt>K]0as K\sup_t \mathbb{E}[|M_t|\mathbf{1}_{|M_t| > K}] \to 0 \quad \text{as } K \to \infty

UI is the condition needed for E[M]=limnE[Mn]\mathbb{E}[M_\infty] = \lim_n \mathbb{E}[M_n] (exchange of limit and expectation). Without UI, a.s. convergence does not imply L1L^1 convergence.

Sufficient condition for UI: supnE[Mnp]<\sup_n \mathbb{E}[|M_n|^p] < \infty for any p>1p > 1.

For AI: Uniform integrability is implicitly required for most SGD convergence proofs that need E[L(θN)]L\mathbb{E}[L(\theta_N)] \to L^* (expectation of limit = limit of expectations). When gradient clipping is used, it ensures the gradient process is UI, validating the convergence proofs.


Appendix C: Gaussian Process Details

C.1 Kernel Properties

A function k:X×XRk: \mathcal{X} \times \mathcal{X} \to \mathbb{R} is a valid kernel (positive semidefinite / Mercer kernel) iff:

i,j=1ncicjk(xi,xj)0n,x1,,xnX,ciR\sum_{i,j=1}^n c_i c_j k(x_i, x_j) \geq 0 \quad \forall n, x_1,\ldots,x_n \in \mathcal{X}, c_i \in \mathbb{R}

Operations preserving PSD kernels:

  • Sum: k1+k2k_1 + k_2 (sum of two valid kernels is valid)
  • Product: k1k2k_1 \cdot k_2 (product of two valid kernels is valid)
  • Scaling: αk\alpha k for α>0\alpha > 0
  • Composition: k(x,x)=k1(ϕ(x),ϕ(x))k(x,x') = k_1(\phi(x), \phi(x')) for any feature map ϕ\phi

Bochner's theorem (characterisation of stationary kernels): A continuous function k(τ)k(\tau) is the covariance of a stationary process iff:

k(τ)=eiωτdF(ω)k(\tau) = \int e^{i\omega\tau}\,dF(\omega)

for some non-decreasing bounded measure FF (the spectral measure). The PSD is S(ω)=dF/dωS(\omega) = dF/d\omega.

C.2 Sparse GP Approximations

Exact GP inference costs O(n3)O(n^3) for nn training points. For large nn (e.g., n=105n = 10^5), this is intractable. Sparse GP approximations use mnm \ll n inducing points u=f(z)\mathbf{u} = f(\mathbf{z}):

p(f)q(f)=p(fu)q(u)dup(f) \approx q(f) = \int p(f|\mathbf{u})q(\mathbf{u})\,d\mathbf{u}

The most common is the SVGP (Sparse Variational GP) which optimises a variational lower bound (ELBO), reducing cost to O(nm2+m3)O(nm^2 + m^3).

For AI: Sparse GPs enable scalable Bayesian optimisation for LLM hyperparameter tuning with n>1000n > 1000 evaluations. The inducing points act as a compressed representation of the observations, analogous to the key-value cache in attention.

C.3 Deep Kernel Learning

Neural-network kernels: Replace the input xx with a neural network representation ϕθ(x)\phi_\theta(x) and use:

kθ(x,x)=kbase(ϕθ(x),ϕθ(x))k_\theta(x,x') = k_{\text{base}}(\phi_\theta(x), \phi_\theta(x'))

This is deep kernel learning (DKL): train θ\theta by maximising the GP marginal likelihood. DKL combines the flexibility of deep representations with the uncertainty quantification of GPs, and is used in scalable Bayesian optimisation for AutoML.


Appendix D: Extended ML Applications

D.1 Diffusion Models: Mathematical Details

The probability flow ODE for the VP-SDE is:

dxdt=f(x,t)12g(t)2xlogpt(x)\frac{dx}{dt} = f(x,t) - \frac{1}{2}g(t)^2\nabla_x\log p_t(x)

This ODE has the same marginal distributions as the reverse SDE but is deterministic. It enables:

  • DDIM sampling: Solve the ODE with large step sizes (fewer function evaluations)
  • Latent space interpolation: Deterministic ODE gives unique encoding of each data point
  • Consistency models: Learn direct xtx0x_t \to x_0 mapping that is consistent along ODE trajectories

The score function sθ(xt,t)xtlogpt(xt)s_\theta(x_t, t) \approx \nabla_{x_t}\log p_t(x_t) is the neural network that drives both the SDE and ODE. At optimum:

sθ(xt,t)=E ⁣[x0αˉtxt1αˉtxt]11αˉts_\theta^*(x_t, t) = \mathbb{E}\!\left[\frac{x_0 - \sqrt{\bar\alpha_t}x_t}{1-\bar\alpha_t}\,\Big|\,x_t\right] \cdot \frac{-1}{\sqrt{1-\bar\alpha_t}}

This is a conditional expectation - directly connecting score learning to the Doob martingale construction.

D.2 Online Learning and Martingale Regret Bounds

In online learning, at each round tt the learner predicts y^t\hat{y}_t, suffers loss t(y^t)\ell_t(\hat{y}_t), and observes true outcome yty_t. The regret against best fixed action aa^* is:

RegretT=t=1Tt(y^t)minat=1Tt(a)\text{Regret}_T = \sum_{t=1}^T \ell_t(\hat{y}_t) - \min_{a}\sum_{t=1}^T \ell_t(a)

Martingale structure: In Follow-the-Regularised-Leader (FTRL) algorithms, the loss differences t(y^t)E[t(y^t)Ft1]\ell_t(\hat{y}_t) - \mathbb{E}[\ell_t(\hat{y}_t)|\mathcal{F}_{t-1}] form a martingale difference sequence. Applying Azuma's inequality:

P ⁣(RegretTE[RegretT]+tT)e2t2P\!\left(\text{Regret}_T \geq \mathbb{E}[\text{Regret}_T] + t\sqrt{T}\right) \leq e^{-2t^2}

This gives high-probability regret bounds - the stochastic process framework converts expected regret into individual-run guarantees.

Bandit algorithms (Thompson Sampling): In multi-armed bandits with KK arms, at each round the agent samples parameters from the posterior and plays the arm with highest reward. The number of plays of suboptimal arms is bounded using stopping time arguments on the likelihood ratio martingale.

D.3 RLHF and Reward Models as Stochastic Processes

RLHF (Reinforcement Learning from Human Feedback) trains a language model πθ\pi_\theta to maximise a learned reward model rϕ(x,y)r_\phi(x,y) subject to a KL penalty:

maxπExD,yπ[rϕ(x,y)]βKL(ππref)\max_\pi \mathbb{E}_{x\sim\mathcal{D}, y\sim\pi}[r_\phi(x,y)] - \beta\text{KL}(\pi \| \pi_{\text{ref}})

This is a stochastic control problem: the policy π\pi is a controller, the generation trajectory (y1,,yT)(y_1,\ldots,y_T) is the controlled stochastic process, and the reward rϕ(x,y1:T)r_\phi(x, y_{1:T}) is a terminal cost.

The RLHF policy update (PPO) as a martingale: The policy gradient estimator in PPO is:

g^=Et ⁣[πθ(atst)πθold(atst)At]\hat{g} = \mathbb{E}_t\!\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)}A_t\right]

where AtA_t is the advantage estimate. Under the true advantage function, At=Qπ(st,at)Vπ(st)A_t = Q^\pi(s_t,a_t) - V^\pi(s_t) has E[Atst]=0\mathbb{E}[A_t | s_t] = 0 - it is a martingale difference. PPO clips the ratio to prevent large updates, maintaining stability of the martingale approximation.

D.4 Stochastic Processes in LLM Interpretability

Mechanistic interpretability (Elhage et al., Nanda et al.) studies circuits in transformer weights. The token sequence flowing through a transformer layers can be modelled as a stochastic process evolving through residual stream updates:

xt(+1)=xt()+Attn()(x1:t())+MLP()(xt())x_t^{(\ell+1)} = x_t^{(\ell)} + \text{Attn}^{(\ell)}(x_{1:t}^{(\ell)}) + \text{MLP}^{(\ell)}(x_t^{(\ell)})

Each residual update is a perturbation to the current state - the trajectory through layers is a discrete-time process indexed by layer depth \ell, not token position tt. Understanding this process (what information is stored/retrieved at each layer) is the central question of interpretability.

Induction heads are attention heads that implement the operation "if A appeared before B at some position, predict B will appear after the current A." This is a pattern-matching stopping time: the head activates when it detects that the current context matches a past pattern. Formalising this as a stopping time over the layer-depth stochastic process is an active research direction.


Appendix E: The Functional CLT (Donsker's Theorem)

E.1 Weak Convergence on Function Spaces

Donsker's theorem states convergence in distribution on the function space C([0,1])C([0,1]) - the space of continuous functions equipped with the supremum norm f=suptf(t)\|f\|_\infty = \sup_t |f(t)|.

Theorem (Donsker, 1951). Let X1,X2,X_1, X_2, \ldots be iid with E[Xi]=0\mathbb{E}[X_i]=0, σ2=E[Xi2]<\sigma^2 = \mathbb{E}[X_i^2] < \infty. Define:

Wn(t)=1σnSnt,t[0,1]W_n(t) = \frac{1}{\sigma\sqrt{n}}S_{\lfloor nt \rfloor}, \quad t \in [0,1]

Then WnBW_n \Rightarrow B weakly in (C([0,1]),)(C([0,1]), \|\cdot\|_\infty), where BB is standard Brownian motion.

Continuous mapping theorem: If WnBW_n \Rightarrow B and g:C([0,1])Rg: C([0,1]) \to \mathbb{R} is continuous, then g(Wn)g(B)g(W_n) \Rightarrow g(B). This gives:

  • max0knSk/(σn)max0t1Bt\max_{0 \leq k \leq n} S_k / (\sigma\sqrt{n}) \Rightarrow \max_{0 \leq t \leq 1} B_t - reflection principle for random walks
  • n1k=1nSk2/σ201Bt2dtn^{-1}\sum_{k=1}^n S_k^2 / \sigma^2 \Rightarrow \int_0^1 B_t^2\,dt - law of Brownian local time

E.2 Applications to SGD Analysis

Donsker's theorem justifies the SDE approximation of SGD: as step size η0\eta \to 0, the piecewise-constant SGD interpolation converges (in distribution, as a path) to the SDE dθ=Ldt+ηΣdBd\theta = -\nabla L\,dt + \sqrt{\eta\Sigma}\,dB. This means:

  • Convergence results for the SDE (e.g., mixing time to stationary distribution) translate to discrete-time SGD results
  • The stationary distribution of SGD approximates N(θ,η2H1Σ)\mathcal{N}(\theta^*, \frac{\eta}{2}H^{-1}\Sigma) near a minimum θ\theta^*
  • The escape rate from a local minimum scales as exp(ΔL/(η/B))\exp(-\Delta L / (\eta/B)) - the Kramers escape rate formula

This is why practitioners observe that increasing batch size BB by factor kk requires proportionally increasing learning rate η\eta by kk to maintain the same noise temperature η/B\eta/B and thus the same optimisation dynamics.


Appendix F: Notation Summary

SymbolMeaning
(Ω,F,P)(\Omega, \mathcal{F}, P)Probability space
(Ft)t0(\mathcal{F}_t)_{t \geq 0}Filtration (increasing family of σ\sigma-algebras)
{Xt}tT\{X_t\}_{t \in T}Stochastic process indexed by TT
MnM_nMartingale
τ\tauStopping time
Fτ\mathcal{F}_\tauσ\sigma-algebra at stopping time τ\tau
[X]t[X]_tQuadratic variation of process XX up to time tt
BtB_tStandard Brownian motion (Wiener process)
N(t)N(t)Poisson process with rate λ\lambda
k(s,t)k(s,t)Covariance kernel (for GPs)
GP(m,k)\mathcal{GP}(m, k)Gaussian process with mean mm and kernel kk
dBtdB_tWiener increment (stochastic differential)
0THtdBt\int_0^T H_t\,dB_tIto integral
[B]T=T[B]_T = TQuadratic variation of BM
αˉt\bar\alpha_tDDPM cumulative noise schedule
sθ(x,t)s_\theta(x,t)Score function approximation
VP-SDEVariance-Preserving SDE (DDPM continuum)
VE-SDEVariance-Exploding SDE
OSTOptional Stopping Theorem (Doob)
UIUniformly Integrable
WSSWide-Sense Stationary
ACVFAutocovariance function
PSDPower Spectral Density
NTKNeural Tangent Kernel
OUOrnstein-Uhlenbeck process
GBMGeometric Brownian Motion
SRWSimple Random Walk

Appendix G: Worked Examples

G.1 Symmetric Random Walk: Hitting Probabilities

Problem. A particle starts at position x=3x = 3 on the integer line. At each step it moves +1+1 or 1-1 with equal probability. Find: (a) The probability it hits 0 before hitting 10 (b) The expected number of steps to hit {0,10}\{0, 10\} (c) The probability it ever returns to 3 starting from 3

Solution.

(a) By the gambler's ruin result, P(hit 0 before 10)=13/10=7/10P(\text{hit 0 before 10}) = 1 - 3/10 = 7/10.

(b) E[τ]=k(Nk)=3×7=21\mathbb{E}[\tau] = k(N-k) = 3 \times 7 = 21 steps.

(c) For a symmetric walk starting at x=3x=3 on Z\mathbb{Z}, the probability of ever returning to 3 equals 1 (by Polya's theorem, d=1d=1 recurrence). However, the expected return time is \infty (the walk is null-recurrent).

Numerical verification:

import numpy as np
np.random.seed(42)
N_trials = 100_000

# (a) Gambler's ruin simulation
hits_zero = 0
for _ in range(N_trials):
    x = 3
    while 0 < x < 10:
        x += np.random.choice([-1, 1])
    hits_zero += (x == 0)
print(f"P(hit 0) \\approx {hits_zero/N_trials:.4f}, theory: 0.7000")  # -> 0.7000

# (b) Expected steps
total_steps = 0
for _ in range(N_trials):
    x, steps = 3, 0
    while 0 < x < 10:
        x += np.random.choice([-1, 1])
        steps += 1
    total_steps += steps
print(f"E[\\tau] \\approx {total_steps/N_trials:.2f}, theory: 21.00")  # -> 21.00

G.2 Poisson Process: Conditional Uniformity

Problem. Events arrive as a Poisson process with rate λ\lambda. Given that exactly 5 events occurred in [0,3][0,3], what is the distribution of the first event time T1T_1?

Solution. By the conditional uniformity property, given N(3)=5N(3) = 5, the 5 event times are the order statistics of 5 iid Uniform[0,3][0,3] random variables. Therefore T1N(3)=5Beta(1,5)T_1 | N(3)=5 \sim \text{Beta}(1,5) scaled to [0,3][0,3]:

T1N(3)=53Beta(1,5)1=36Uniform[0,3] in distributionT_1 | N(3)=5 \sim \frac{3 \cdot \text{Beta}(1,5)}{1} = \frac{3}{6}\text{Uniform}[0,3] \text{ in distribution}

More precisely: P(T1>tN(3)=5)=(1t/3)5P(T_1 > t | N(3)=5) = (1-t/3)^5 for t[0,3]t \in [0,3].

Verification: Simulate 5 Uniform[0,3][0,3] random variables, take the minimum. Compare to conditional Poisson process simulation.

G.3 Brownian Motion: The Arc-Sine Distribution

Problem. Let L1=sup{t[0,1]:Bt=0}L_1 = \sup\{t \in [0,1]: B_t = 0\} be the last zero of Brownian motion before time 1. Find P(L1t)P(L_1 \leq t).

Solution. By Levy's arc-sine law:

P(L1t)=2πarcsin(t)P(L_1 \leq t) = \frac{2}{\pi}\arcsin(\sqrt{t})

This is the CDF of the arc-sine distribution on [0,1][0,1]. The density p(t)=1/(πt(1t))p(t) = 1/(\pi\sqrt{t(1-t)}) concentrates near 0 and 1 - the last zero before time 1 is most likely to be near 0 or near 1, and least likely to be near 1/21/2. This violates the naive intuition that "the last zero should be somewhere in the middle."

Implication for AI: The arc-sine law suggests that in a long random walk (or SRW approximation of SGD noise), the trajectory spends most of its time on one side of zero - the noise process can have long "runs" of the same sign. This persistent structure in gradient noise is why momentum-based optimisers (Adam, SGD with momentum) perform better than pure SGD for certain problem structures.

G.4 OU Process: Analytical vs Numerical

Problem. Solve dX=2(X1)dt+3dBdX = -2(X-1)\,dt + \sqrt{3}\,dB, X0=5X_0 = 5.

Solution. This is OU with θ=2\theta=2, μ=1\mu=1, σ=3\sigma=\sqrt{3}.

  • Solution: Xt=1+(51)e2t+30te2(ts)dBsX_t = 1 + (5-1)e^{-2t} + \sqrt{3}\int_0^t e^{-2(t-s)}\,dB_s
  • Mean: E[Xt]=1+4e2t\mathbb{E}[X_t] = 1 + 4e^{-2t}
  • Variance: Var(Xt)=34(1e4t)\text{Var}(X_t) = \frac{3}{4}(1 - e^{-4t})
  • Stationary distribution: XN(1,3/4)X_\infty \sim \mathcal{N}(1, 3/4)

The process decays exponentially from X0=5X_0=5 toward the mean μ=1\mu=1 with decay rate θ=2\theta=2, while accumulating noise that saturates at σ2/(2θ)=3/4\sigma^2/(2\theta) = 3/4.

For diffusion models: This corresponds to DDPM with βt=4Δt\beta_t = 4\Delta t (time-discretised) and the signal-to-noise ratio at time TT approaching E[XT]=1+4e2T\mathbb{E}[X_T] = 1 + 4e^{-2T} for TT \to \infty - which is the "pure noise" terminal condition when XTN(1,3/4)X_T \approx \mathcal{N}(1, 3/4).

G.5 Gaussian Process: Prior vs Posterior

Problem. Fit a GP to three observations (x1,y1)=(0,1)(x_1, y_1) = (0, 1), (x2,y2)=(1,0.5)(x_2, y_2) = (1, -0.5), (x3,y3)=(2,0.8)(x_3, y_3) = (2, 0.8) with RBF kernel k(x,x)=e(xx)2k(x,x') = e^{-(x-x')^2} and noise σn2=0.01\sigma_n^2 = 0.01.

Solution. Build the 3×33 \times 3 kernel matrix KK:

Kij=e(xixj)2+0.01δijK_{ij} = e^{-(x_i-x_j)^2} + 0.01\,\delta_{ij}

Posterior at test point x=1.5x_* = 1.5:

  • Compute k=[k(x,x1),k(x,x2),k(x,x3)]k_* = [k(x_*, x_1), k(x_*, x_2), k(x_*, x_3)]
  • μ=k(K)1y\mu_* = k_*^\top(K)^{-1}\mathbf{y}
  • σ2=k(x,x)kK1k\sigma_*^2 = k(x_*,x_*) - k_*^\top K^{-1} k_*

The posterior mean interpolates between the observations, and the posterior variance is small near observations (where the kernel provides strong correlation) and large away from them.


Appendix H: Self-Assessment Checklist

After completing this section, verify you can:

  • State the definition of a stochastic process and give 3 examples of each combination of (discrete/continuous) x (time/state space)
  • Define a filtration and explain what "adapted" means in ML context
  • Verify the martingale property for a new process by computing E[Mn+1Fn]\mathbb{E}[M_{n+1}|\mathcal{F}_n]
  • State all three conditions of Doob's OST and identify which applies to a given problem
  • Apply OST to compute expected stopping times and boundary hitting probabilities
  • Construct the Doob martingale for a given function ff and derive McDiarmid from Azuma
  • State the three defining properties of a Poisson process and derive the superposition/thinning results
  • State Wiener's four axioms for Brownian motion and compute P(Bta)P(B_t \geq a)
  • Compute the quadratic variation [B]T=T[B]_T = T and state Ito's lemma
  • Solve the OU SDE and state its stationary distribution
  • Connect the DDPM forward process to the VP-SDE and OU process
  • Define WSS and ergodicity, and state the Wiener-Khinchin theorem
  • Specify a GP by its mean function and kernel, and compute the posterior given observations
  • Explain why RoPE makes attention a stationary kernel in position
  • Model SGD as a near-martingale and state the diffusion SDE approximation

Appendix I: Connections to Other Areas

I.1 Information Theory (Section09)

  • The log-likelihood ratio process klog(p/q)(Xk)\sum_k \log(p/q)(X_k) is a supermartingale under QQ and a submartingale under PP
  • The sequential probability ratio test (SPRT) is an optimal stopping time for hypothesis testing, derived using the likelihood ratio martingale
  • Renyi entropy rates and the entropy rate Hˉ=limnH(XnXn1,,X1)/1\bar{H} = \lim_n H(X_n|X_{n-1},\ldots,X_1)/1 of a stationary process are central concepts at the intersection of stochastic processes and information theory

I.2 Functional Analysis (Section12)

  • The space C([0,1])C([0,1]) of continuous functions is a Banach space; weak convergence on this space is Donsker's theorem
  • Gaussian processes are infinite-dimensional Gaussian distributions on function spaces (Hilbert spaces)
  • Reproducing Kernel Hilbert Spaces (RKHS): For each valid kernel kk, there is a unique RKHS Hk\mathcal{H}_k with the reproducing property f(x)=f,k(,x)Hkf(x) = \langle f, k(\cdot,x)\rangle_{\mathcal{H}_k}. GP regression in Hk\mathcal{H}_k is equivalent to minimising a regularised empirical risk in the RKHS

I.3 Optimisation (Section08)

  • Stochastic approximation (Robbins-Monro, 1951): The original framework for SGD convergence uses martingale theory. Under step sizes tηt=\sum_t \eta_t = \infty, tηt2<\sum_t \eta_t^2 < \infty and bounded variance, the iterates converge a.s.
  • Lyapunov methods: Proving convergence of SGD amounts to constructing a Lyapunov function V(θ)V(\theta) that is a (noisy) supermartingale, then applying martingale convergence
  • Langevin MCMC: The Langevin SDE dθ=U(θ)dt+2dBtd\theta = -\nabla U(\theta)\,dt + \sqrt{2}\,dB_t is used for posterior sampling in Bayesian DL, connecting optimisation and stochastic process theory

I.4 Statistics (Section07-Statistics)

  • Time series analysis (ARIMA, state-space models) is the statistical estimation side of stochastic processes: given observed data X1,,XnX_1,\ldots,X_n, estimate the model parameters of an assumed process
  • Spectral analysis uses the Wiener-Khinchin theorem to estimate the PSD from data via the periodogram
  • Hypothesis testing for stationarity (Augmented Dickey-Fuller test, KPSS test) checks whether an observed time series is consistent with a stationary process

Appendix J: Advanced Topics

J.1 Levy Processes

A Levy process generalises both Brownian motion (continuous paths) and Poisson process (jumps) to arbitrary combinations of drift, diffusion, and jumps.

Levy-Khinchin representation: Every Levy process XtX_t has characteristic function:

E[eiθXt]=exp ⁣(t[iμθσ2θ22+R{0}(eiθx1iθx1x<1)ν(dx)])\mathbb{E}[e^{i\theta X_t}] = \exp\!\left(t\left[i\mu\theta - \frac{\sigma^2\theta^2}{2} + \int_{\mathbb{R}\setminus\{0\}} (e^{i\theta x} - 1 - i\theta x\mathbf{1}_{|x|<1})\,\nu(dx)\right]\right)

where μR\mu \in \mathbb{R} (drift), σ20\sigma^2 \geq 0 (diffusion), and ν\nu is the Levy measure (jump intensity).

For AI: Heavy-tailed gradient distributions in neural network training are better modelled by Levy processes (with large jumps) than by Brownian motion (Gaussian, no jumps). The Levy stable distribution family includes the Gaussian (index α=2\alpha=2) and Cauchy (α=1\alpha=1) as special cases. Empirical studies of gradient statistics in transformers show heavy tails (Student-tt with low degrees of freedom), suggesting Levy process models may be more appropriate than the Gaussian SDE approximation.

J.2 Malliavin Calculus

Malliavin calculus is a variational calculus on Wiener space - it provides derivatives of random variables with respect to the underlying Brownian motion.

The Malliavin derivative DsF\mathcal{D}_s F measures the sensitivity of FF to the Brownian increment at time ss. It satisfies:

DsF=limε0F(B+ε1[s,))F(B)ε\mathcal{D}_s F = \lim_{\varepsilon\to 0}\frac{F(B + \varepsilon\mathbf{1}_{[s,\infty)}) - F(B)}{\varepsilon}

Clark-Ocone formula: For FL2F \in L^2 with E[F]=0\mathbb{E}[F]=0:

F=0TE[DsFFs]dBsF = \int_0^T \mathbb{E}[\mathcal{D}_s F | \mathcal{F}_s]\,dB_s

This represents any square-integrable random variable as a stochastic integral - connecting Malliavin calculus to the martingale representation theorem.

For AI: Malliavin calculus provides pathwise gradients for diffusion models, enabling gradient computation through the stochastic sampling process. This is used in score-based generative model training with more stable gradient estimates than the standard score-matching loss.

J.3 Stochastic Control and HJB Equation

A stochastic control problem seeks to find a policy ut(ω)u_t(\omega) (adapted to Ft\mathcal{F}_t) minimising:

J(u)=E ⁣[0TL(Xt,ut)dt+g(XT)]J(u) = \mathbb{E}\!\left[\int_0^T \mathcal{L}(X_t, u_t)\,dt + g(X_T)\right]

subject to dXt=b(Xt,ut)dt+σ(Xt,ut)dBtdX_t = b(X_t, u_t)\,dt + \sigma(X_t, u_t)\,dB_t.

The Hamilton-Jacobi-Bellman (HJB) equation gives the optimal value function V(x,t)=infuJ(u;x,t)V(x,t) = \inf_u J(u; x,t):

tV=minu ⁣[L(x,u)+b(x,u)xV+12tr(σσx2V)]-\partial_t V = \min_u\!\left[\mathcal{L}(x,u) + b(x,u)^\top\nabla_x V + \frac{1}{2}\text{tr}(\sigma\sigma^\top \nabla^2_x V)\right]

For AI: RLHF can be formulated as a stochastic control problem where the LLM generates tokens sequentially (the SDE), the reward is the RLHF reward model (the cost), and the KL penalty is a regularisation term. The HJB equation characterises the optimal RLHF policy, and PPO approximately solves it using neural network approximations of VV.


Appendix K: Python Recipes for Stochastic Processes

K.1 Simulating Stochastic Processes

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# Simple Random Walk
def simulate_srw(n_steps, n_paths=5):
    steps = np.random.choice([-1, 1], size=(n_paths, n_steps))
    return np.hstack([np.zeros((n_paths,1)), steps.cumsum(axis=1)])

# Brownian Motion (via SRW limit)
def simulate_bm(T=1.0, n_steps=1000, n_paths=5):
    dt = T / n_steps
    increments = np.random.normal(0, np.sqrt(dt), (n_paths, n_steps))
    return np.hstack([np.zeros((n_paths,1)), increments.cumsum(axis=1)])

# Ornstein-Uhlenbeck (Euler-Maruyama)
def simulate_ou(theta=1.0, mu=0.0, sigma=1.0, T=10.0, n_steps=1000, n_paths=5):
    dt = T / n_steps
    X = np.zeros((n_paths, n_steps+1))
    X[:,0] = 2.0  # initial condition
    for t in range(n_steps):
        dB = np.random.normal(0, np.sqrt(dt), n_paths)
        X[:,t+1] = X[:,t] + theta*(mu - X[:,t])*dt + sigma*dB
    return X

# Poisson Process
def simulate_poisson(lam=1.0, T=10.0, n_paths=5):
    paths = []
    for _ in range(n_paths):
        t, events = 0, [0]
        while t < T:
            t += np.random.exponential(1/lam)
            events.append(min(t, T))
        paths.append(events)
    return paths

# Geometric Brownian Motion
def simulate_gbm(mu=0.1, sigma=0.3, S0=100.0, T=1.0, n_steps=252, n_paths=5):
    dt = T / n_steps
    log_S = np.zeros((n_paths, n_steps+1))
    log_S[:,0] = np.log(S0)
    for t in range(n_steps):
        log_S[:,t+1] = log_S[:,t] + (mu - sigma**2/2)*dt + sigma*np.random.normal(0, np.sqrt(dt), n_paths)
    return np.exp(log_S)

K.2 Gaussian Process Posterior

import numpy as np
from scipy.linalg import cho_factor, cho_solve

def rbf_kernel(X1, X2, length_scale=1.0, variance=1.0):
    """Squared-exponential (RBF) kernel."""
    dists = ((X1[:,None] - X2[None,:])**2).sum(-1)
    return variance * np.exp(-dists / (2 * length_scale**2))

def gp_posterior(X_train, y_train, X_test, kernel_fn, noise_var=1e-3):
    """Compute GP posterior mean and variance."""
    K_tt = kernel_fn(X_train, X_train) + noise_var * np.eye(len(X_train))
    K_ts = kernel_fn(X_train, X_test)
    K_ss = kernel_fn(X_test, X_test)

    # Solve K_tt @ alpha = y_train using Cholesky
    cho = cho_factor(K_tt)
    alpha = cho_solve(cho, y_train)
    v = cho_solve(cho, K_ts)

    mu_post = K_ts.T @ alpha
    var_post = np.diag(K_ss) - (K_ts * v).sum(0)
    return mu_post, np.sqrt(np.maximum(var_post, 0))

# Usage:
# X_train = np.array([0., 1., 2.])
# y_train = np.array([1., -0.5, 0.8])
# X_test = np.linspace(-0.5, 2.5, 100)
# mu, std = gp_posterior(X_train, y_train, X_test, rbf_kernel)

K.3 Martingale Verification

import numpy as np

def verify_martingale(process_fn, n_steps=100, n_paths=10000, tol=0.05):
    """
    Verify martingale property: E[M_{n+1} | M_n] \\approx M_n.
    process_fn: callable that generates (n_paths, n_steps+1) array.
    """
    paths = process_fn(n_steps, n_paths)
    # Check: mean at step n equals mean at step 0
    means = paths.mean(axis=0)
    assert np.allclose(means, means[0], atol=tol), f"Mean not constant: {means[:5]}"
    # Check: E[M_{n+1}^2] - E[M_n^2] = variance increment
    print(f"Mean at all steps: {means[0]:.4f} +/- {means.std():.4f} (should be ~0)")
    print(f"PASS: Martingale mean constant within tolerance {tol}")

def simulate_srw_paths(n_steps, n_paths):
    steps = np.random.choice([-1., 1.], size=(n_paths, n_steps))
    return np.hstack([np.zeros((n_paths,1)), steps.cumsum(axis=1)])

np.random.seed(42)
verify_martingale(simulate_srw_paths)

K.4 Diffusion Model Forward Process

import numpy as np

def ddpm_forward_process(x0, T=1000, beta_start=1e-4, beta_end=0.02):
    """
    DDPM forward process: q(x_t | x_0) = N(sqrt(alpha_bar_t)*x0, (1-alpha_bar_t)*I)
    Returns: {t: (x_t, alpha_bar_t)} for t = 0, 1, ..., T
    """
    betas = np.linspace(beta_start, beta_end, T)
    alphas = 1 - betas
    alpha_bars = np.cumprod(alphas)

    trajectory = {0: (x0, 1.0)}
    x0_arr = np.array(x0, dtype=float)

    for t in range(1, T+1):
        ab = alpha_bars[t-1]
        eps = np.random.randn(*x0_arr.shape)
        x_t = np.sqrt(ab) * x0_arr + np.sqrt(1 - ab) * eps
        trajectory[t] = (x_t, ab)
    return trajectory, alpha_bars

def snr(alpha_bar):
    """Signal-to-noise ratio: alpha_bar / (1 - alpha_bar)."""
    return alpha_bar / (1 - alpha_bar + 1e-9)

# Example:
# x0 = np.array([1.0, 0.0, -1.0])
# traj, alpha_bars = ddpm_forward_process(x0, T=1000)
# For t=500: SNR \\approx 1 (equal signal and noise)
# For t=1000: SNR \\approx 0 (pure noise)

K.5 DDPM Schedule Comparison

import numpy as np

def linear_schedule(T=1000, beta_start=1e-4, beta_end=0.02):
    return np.linspace(beta_start, beta_end, T)

def cosine_schedule(T=1000, s=0.008):
    t = np.arange(T+1) / T
    f = np.cos((t + s)/(1+s) * np.pi/2)**2
    alpha_bars = f / f[0]
    betas = 1 - alpha_bars[1:] / alpha_bars[:-1]
    return np.clip(betas, 0, 0.999)

T = 1000
betas_linear = linear_schedule(T)
betas_cosine = cosine_schedule(T)

ab_linear = np.cumprod(1 - betas_linear)
ab_cosine = np.cumprod(1 - betas_cosine)

print("Signal preserved at t=T/4, T/2, 3T/4, T:")
for name, ab in [("Linear", ab_linear), ("Cosine", ab_cosine)]:
    vals = [ab[T//4-1], ab[T//2-1], ab[3*T//4-1], ab[-1]]
    print(f"  {name}: {[f'{v:.4f}' for v in vals]}")
# Cosine schedule destroys signal more uniformly across timesteps

Appendix L: Theory Problems (Advanced)

L.1 Optional Stopping and Wald's Identity

Problem. Let X1,X2,X_1, X_2, \ldots be iid with E[Xi]=μ\mathbb{E}[X_i] = \mu and E[Xi2]=σ2+μ2\mathbb{E}[X_i^2] = \sigma^2 + \mu^2. Let τ\tau be a stopping time with E[τ]<\mathbb{E}[\tau] < \infty. Prove Wald's Identity:

E[Sτ]=μE[τ],E[Sτ2]=σ2E[τ]+μ2E[τ]2\mathbb{E}[S_\tau] = \mu \cdot \mathbb{E}[\tau], \quad \mathbb{E}[S_\tau^2] = \sigma^2\mathbb{E}[\tau] + \mu^2\mathbb{E}[\tau]^2

Hint: SnnμS_n - n\mu is a martingale. Apply OST. For the second moment, use (Snnμ)2nσ2(S_n - n\mu)^2 - n\sigma^2.

L.2 The Optional Stopping Theorem: Necessary Conditions

Problem. Show that the condition E[τ]<\mathbb{E}[\tau] < \infty alone does NOT suffice for OST. Construct a martingale {Mn}\{M_n\} and stopping time τ\tau with E[τ]<\mathbb{E}[\tau] < \infty but E[Mτ]E[M0]\mathbb{E}[M_\tau] \neq \mathbb{E}[M_0].

Hint: Let Mn=M_n = SRW and τ=min(T,hitting time of level n)\tau = \min(T, \text{hitting time of level } n) for appropriate TT depending on nn...

L.3 Levy Characterisation of BM

Problem. Prove that if {Mt}\{M_t\} is a continuous local martingale starting at 0 with quadratic variation [M]t=t[M]_t = t, then MtM_t is a standard Brownian motion.

Hint: Use the exponential martingale eiθMt+θ2t/2e^{i\theta M_t + \theta^2 t/2} and show it is a martingale. Then compute the characteristic function of (Mt1,,Mtn)(M_{t_1},\ldots,M_{t_n}).

L.4 GP Posterior as Hilbert Space Projection

Problem. Show that the GP posterior mean μ(x)=kK1y\mu_*(x) = k_*^\top K^{-1}\mathbf{y} is the orthogonal projection of ff onto the span of {k(,x1),,k(,xn)}\{k(\cdot, x_1),\ldots,k(\cdot, x_n)\} in the RKHS Hk\mathcal{H}_k.

Hint: Use the reproducing property f(xi)=f,k(,xi)Hkf(x_i) = \langle f, k(\cdot,x_i)\rangle_{\mathcal{H}_k} to express the GP as a Hilbert space problem.


Appendix M: References and Further Reading

Textbooks

  1. Karatzas & Shreve, Brownian Motion and Stochastic Calculus (1991) - The standard reference for rigorous continuous-time stochastic processes, Ito calculus, and SDEs. Graduate-level.

  2. Durrett, Probability: Theory and Examples (5th ed., 2019) - Comprehensive coverage including martingales, stopping times, ergodic theory, Brownian motion. Excellent exercises.

  3. Williams, Probability with Martingales (1991) - Elegant, concise introduction to martingale theory. The clearest proof of martingale convergence. Highly recommended.

  4. Oksendal, Stochastic Differential Equations (7th ed., 2014) - Standard SDEs reference. Accessible treatment of Ito calculus and applications.

  5. Rasmussen & Williams, Gaussian Processes for Machine Learning (2006) - The canonical GP reference. Available free online.

  6. Doob, Stochastic Processes (1953) - The original monograph. Still valuable for historical context.

Papers for AI Connections

  1. Song et al. (2021), Score-Based Generative Modeling through Stochastic Differential Equations - Score SDE framework unifying DDPM, SMLD.

  2. Mandt, Hoffman & Blei (2017), Stochastic Gradient Descent as Approximate Bayesian Inference - SDE approximation of SGD.

  3. Su et al. (2022), RoFormer: Enhanced Transformer with Rotary Position Embedding - RoPE as stationary kernel.

  4. Ho, Jain & Abbeel (2020), Denoising Diffusion Probabilistic Models - DDPM.

  5. Li et al. (2017), Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms - SDE theory for SGD.

  6. Song et al. (2023), Consistency Models - Distillation of diffusion models via probability flow ODE.

  7. Tropp (2015), An Introduction to Matrix Concentration Inequalities - Matrix Bernstein, extends scalar martingale results.

  8. Williams (1997), Reinforcement Learning: An Introduction (Sutton & Barto) - RL value functions as martingales; Chapter 6 on TD-learning.


Appendix N: Extended Proofs

N.1 Proof of Doob's Maximal Inequality

Theorem. For a non-negative submartingale {Mk}k=0n\{M_k\}_{k=0}^n and λ>0\lambda > 0:

λP ⁣(max0knMkλ)E[Mn]\lambda \cdot P\!\left(\max_{0 \leq k \leq n} M_k \geq \lambda\right) \leq \mathbb{E}[M_n]

Proof. Let τ=min{k:Mkλ}\tau = \min\{k : M_k \geq \lambda\} (first time the process crosses λ\lambda), with τ=n\tau = n if no crossing occurs. Then τ\tau is a stopping time (the crossing event is Fk\mathcal{F}_k-measurable).

Decompose the expectation E[Mn]\mathbb{E}[M_n]:

E[Mn]=E[Mn1τn]+E[Mn1τ>n]\mathbb{E}[M_n] = \mathbb{E}[M_n \mathbf{1}_{\tau \leq n}] + \mathbb{E}[M_n \mathbf{1}_{\tau > n}]

For the first term: since {Mk}\{M_k\} is a submartingale, E[MnFτ]Mτλ\mathbb{E}[M_n | \mathcal{F}_\tau] \geq M_\tau \geq \lambda on {τn}\{\tau \leq n\}:

E[Mn1τn]=E[E[MnFτ]1τn]λP(τn)=λP ⁣(maxkMkλ)\mathbb{E}[M_n \mathbf{1}_{\tau \leq n}] = \mathbb{E}[\mathbb{E}[M_n|\mathcal{F}_\tau]\mathbf{1}_{\tau \leq n}] \geq \lambda \cdot P(\tau \leq n) = \lambda \cdot P\!\left(\max_k M_k \geq \lambda\right)

Since E[Mn1τ>n]0\mathbb{E}[M_n \mathbf{1}_{\tau > n}] \geq 0:

E[Mn]λP ⁣(maxkMkλ)\mathbb{E}[M_n] \geq \lambda P\!\left(\max_k M_k \geq \lambda\right)

\square

Corollary. Applying to {Mk}\{|M_k|\} (non-negative submartingale since f(x)=xf(x)=|x| is convex):

P ⁣(max0knMkλ)E[Mn]λP\!\left(\max_{0 \leq k \leq n}|M_k| \geq \lambda\right) \leq \frac{\mathbb{E}[|M_n|]}{\lambda}

N.2 Proof of Azuma's Inequality

Theorem (Azuma-Hoeffding). Let {Mk}k=0n\{M_k\}_{k=0}^n be a martingale with MkMk1ck|M_k - M_{k-1}| \leq c_k a.s. Then:

P(MnM0t)exp ⁣(t22kck2)P(M_n - M_0 \geq t) \leq \exp\!\left(\frac{-t^2}{2\sum_k c_k^2}\right)

Proof. Let Dk=MkMk1D_k = M_k - M_{k-1} be the martingale differences. Since E[DkFk1]=0\mathbb{E}[D_k|\mathcal{F}_{k-1}] = 0 and Dkck|D_k| \leq c_k, by Hoeffding's lemma:

E[esDkFk1]es2ck2/2\mathbb{E}[e^{sD_k}|\mathcal{F}_{k-1}] \leq e^{s^2c_k^2/2}

By the Markov inequality for es(MnM0)e^{s(M_n - M_0)} (for any s>0s > 0):

P(MnM0t)estE[es(MnM0)]P(M_n - M_0 \geq t) \leq e^{-st}\mathbb{E}[e^{s(M_n-M_0)}]

Factor the MGF using conditional independence:

E[eskDk]=E ⁣[kesDk]=E ⁣[k=1n1esDkE[esDnFn1]]E ⁣[k=1n1esDk]es2cn2/2\mathbb{E}[e^{s\sum_k D_k}] = \mathbb{E}\!\left[\prod_k e^{sD_k}\right] = \mathbb{E}\!\left[\prod_{k=1}^{n-1}e^{sD_k}\mathbb{E}[e^{sD_n}|\mathcal{F}_{n-1}]\right] \leq \mathbb{E}\!\left[\prod_{k=1}^{n-1}e^{sD_k}\right] \cdot e^{s^2c_n^2/2}

Iterating: E[es(MnM0)]es2kck2/2\mathbb{E}[e^{s(M_n-M_0)}] \leq e^{s^2\sum_k c_k^2/2}.

Combining:

P(MnM0t)est+s2ck2/2P(M_n - M_0 \geq t) \leq e^{-st + s^2\sum c_k^2/2}

Optimise over s>0s > 0: set s=t/ck2s = t/\sum c_k^2 to get:

P(MnM0t)exp ⁣(t22ck2)P(M_n - M_0 \geq t) \leq \exp\!\left(\frac{-t^2}{2\sum c_k^2}\right)

\square

N.3 Proof of Polya's Theorem (d=1d=1: Recurrence)

Theorem. The SRW on Z\mathbb{Z} is recurrent.

Proof. It suffices to show the expected number of returns to 0 is infinite: n=1P(S2n=0)=\sum_{n=1}^\infty P(S_{2n}=0) = \infty.

By Stirling's approximation (2nn)4n/πn\binom{2n}{n} \sim 4^n/\sqrt{\pi n}:

P(S2n=0)=(2nn)4n1πnP(S_{2n}=0) = \binom{2n}{n}4^{-n} \sim \frac{1}{\sqrt{\pi n}}

Therefore n=1P(S2n=0)n=11πn=\sum_{n=1}^\infty P(S_{2n}=0) \sim \sum_{n=1}^\infty \frac{1}{\sqrt{\pi n}} = \infty.

By the second Borel-Cantelli lemma (the events {S2n=0}\{S_{2n}=0\} are independent since each depends on distinct steps): P(return to 0 infinitely often)=1P(\text{return to 0 infinitely often}) = 1.

Note: In d3d \geq 3, P(S2n=0)Cd/nd/2P(S_{2n}=0) \sim C_d/n^{d/2} by CLT in Rd\mathbb{R}^d, and Cd/nd/2<\sum C_d/n^{d/2} < \infty for d3d \geq 3 (since d/2>1d/2 > 1). By Borel-Cantelli (first lemma): returns occur finitely often, so the walk is transient. \square

N.4 Proof that the OU Process Has the Correct Stationary Distribution

Theorem. The stationary distribution of dX=θXdt+σdBdX = -\theta X\,dt + \sigma\,dB (μ=0\mu=0) is N(0,σ2/(2θ))\mathcal{N}(0, \sigma^2/(2\theta)).

Proof via Fokker-Planck equation. The probability density p(x,t)p(x,t) satisfies:

tp=x((θx)p)+σ22x2p=θx(xp)+σ22x2p\partial_t p = -\partial_x((-\theta x)p) + \frac{\sigma^2}{2}\partial_x^2 p = \theta\partial_x(xp) + \frac{\sigma^2}{2}\partial_x^2 p

At stationarity tp=0\partial_t p = 0:

0=θx(xp)+σ22x2p0 = \theta\partial_x(xp) + \frac{\sigma^2}{2}\partial_x^2 p 0=θ(p+xxp)+σ22x2p0 = \theta(p + x\partial_x p) + \frac{\sigma^2}{2}\partial_x^2 p

Try p(x)=Z1exp(θx2/σ2)p(x) = Z^{-1}\exp(-\theta x^2/\sigma^2):

xp=2θxσ2p,x2p=(2θσ2+4θ2x2σ4)p\partial_x p = -\frac{2\theta x}{\sigma^2}p, \quad \partial_x^2 p = \left(-\frac{2\theta}{\sigma^2} + \frac{4\theta^2 x^2}{\sigma^4}\right)p

Substituting: θ(12θx2/σ2)p+σ22(2θ/σ2+4θ2x2/σ4)p=[θ2θ2x2/σ2θ+2θ2x2/σ2]p=0\theta(1 - 2\theta x^2/\sigma^2)p + \frac{\sigma^2}{2}(-2\theta/\sigma^2 + 4\theta^2x^2/\sigma^4)p = [\theta - 2\theta^2x^2/\sigma^2 - \theta + 2\theta^2x^2/\sigma^2]p = 0 [ok]

The normalisation gives p(x)=N(0,σ2/(2θ))p_\infty(x) = \mathcal{N}(0, \sigma^2/(2\theta)). \square


Appendix O: Spectral Theory of Stochastic Processes

O.1 Spectral Representation

Every WSS process {Xt}\{X_t\} with absolutely integrable ACVF admits the spectral representation:

Xt=eiωtdZ(ω)X_t = \int_{-\infty}^\infty e^{i\omega t}\,dZ(\omega)

where Z(ω)Z(\omega) is an orthogonal increment process (the spectral process) satisfying E[dZ(ω)dZ(ω)]=S(ω)δ(ωω)dω\mathbb{E}[dZ(\omega)\overline{dZ(\omega')}] = S(\omega)\delta(\omega-\omega')\,d\omega.

This is the stochastic analogue of the Fourier transform: a stationary process decomposes into uncorrelated frequency components, each contributing variance S(ω)dωS(\omega)\,d\omega in the frequency band [ω,ω+dω][\omega, \omega+d\omega].

For AI: The power spectral density of a token sequence determines how information is distributed across "frequencies" (temporal scales). A language model with strong long-range dependencies (like GPT-4 modelling coherent long documents) will have a PSD with high power at low frequencies. Attention head specialisation can be viewed as learning frequency-selective filters - heads that attend to nearby tokens (high frequency) vs. distant tokens (low frequency).

O.2 Spectral Gap and Mixing

For a stationary Markov chain with transition operator PP, the spectral gap γ=1λ2\gamma = 1 - |\lambda_2| (where λ2\lambda_2 is the second-largest eigenvalue of PP) determines the mixing time tmix1/γt_{\text{mix}} \sim 1/\gamma.

  • Large spectral gap -> fast mixing -> samples become independent quickly
  • Small spectral gap -> slow mixing -> long-range correlations persist

For transformers: The attention pattern at each layer can be viewed as a stochastic matrix ART×TA \in \mathbb{R}^{T\times T} (for sequence length TT). The spectral gap of AA determines how quickly information "mixes" across the sequence. Heads with nearly uniform attention (large spectral gap) aggregate globally; heads with peaked attention (small spectral gap) attend locally.


Appendix P: Decision Framework for Choosing Process Models

CHOOSE THE RIGHT STOCHASTIC PROCESS MODEL
========================================================================

  Is the system evolving over time?
  +-- No  -> Use static probability (Section01-Section05)
  +-- Yes ->
        Does the state space have a natural structure?
        +-- Countable (e.g., integer counts, categories)
        |   +-- Discrete time -> Markov chain (Section07)
        |   +-- Continuous time -> Poisson or birth-death process
        +-- Continuous (e.g., parameter vector, embedding)
            +-- Does it have independent increments?
            |   +-- Yes, with Gaussian increments -> Brownian motion
            |   +-- Yes, with jump structure -> Poisson / Levy
            |   +-- No, but stationary -> ARMA or OU process
            +-- Does it decrease toward an attractor?
            |   +-- Yes -> OU process or SDE with drift
            +-- Does it only depend on the current state?
                +-- Yes -> Markov process (continuous time)
                +-- No, depends on full history -> GP or ARIMA

========================================================================

  AI Application -> Recommended Model
  ---------------------------------------------------------------------
  SGD parameter trajectory     -> OU SDE (diffusion approx)
  Mini-batch gradient noise    -> Martingale difference sequence
  Diffusion model forward      -> VP-SDE (discrete OU)
  LLM token generation         -> Markov chain (Section07)
  Request arrivals             -> Poisson process
  Hyperparameter optimisation  -> Gaussian process (Bayesian opt)
  RL value function            -> Martingale transform
  Training loss trajectory     -> Supermartingale
  NTK / infinite-width NN      -> Gaussian process
  Residual stream updates      -> Additive noise SDE
  ---------------------------------------------------------------------

Appendix Q: Connections to Measure Theory

Q.1 Stochastic Processes as Measurable Functions

A stochastic process {Xt}tT\{X_t\}_{t \in T} on (Ω,F,P)(\Omega, \mathcal{F}, P) can be viewed as a measurable function from Ω\Omega to the path space RT\mathbb{R}^T. The probabilistic properties of the process are encoded in the pushforward measure PX1P \circ X^{-1} on RT\mathbb{R}^T.

For Brownian motion: The path space is C([0,))C([0,\infty)) (continuous functions). The Wiener measure WW is the unique probability measure on C([0,))C([0,\infty)) such that Bt(ω)=ω(t)B_t(\omega) = \omega(t) (coordinate process) satisfies the four BM axioms. Constructing WW rigorously is Wiener's 1923 achievement.

Q.2 The Radon-Nikodym Theorem and Change of Measure

If PP and QQ are probability measures on (Ω,F)(\Omega, \mathcal{F}) with QPQ \ll P (Q is absolutely continuous wrt P), then there exists the Radon-Nikodym derivative dQ/dP=ZdQ/dP = Z such that Q(A)=EP[Z1A]Q(A) = \mathbb{E}^P[Z\mathbf{1}_A] for all AA.

Girsanov's theorem: Under mild conditions, if BtB_t is a BM under PP and θt\theta_t is adapted, then:

B~t=Bt+0tθsds\tilde{B}_t = B_t + \int_0^t \theta_s\,ds

is a BM under QQ where dQ/dP=exp(0TθtdBt120Tθt2dt)dQ/dP = \exp(-\int_0^T \theta_t\,dB_t - \frac{1}{2}\int_0^T\theta_t^2\,dt) (the Girsanov kernel).

For diffusion models: Girsanov's theorem is the mathematical basis for the reverse diffusion process. The reverse SDE under the data distribution PP corresponds to a change of measure from the noise distribution QQ (standard BM), with the Radon-Nikodym derivative involving the score function logpt\nabla\log p_t. This makes the score-matching objective the likelihood maximisation objective under the reverse measure.


Appendix R: Supplementary: Concentration and Martingales Revisited

R.1 The Azuma-McDiarmid Connection - Full Detail

In Section3.5 we introduced the Doob martingale construction as the proof of McDiarmid's inequality. Here we make the connection fully explicit.

Setup. Let X1,,XnX_1, \ldots, X_n be independent random variables and f:XnRf: \mathcal{X}^n \to \mathbb{R} satisfy the bounded differences condition: for each kk,

supx1,,xn,xkf(x1,,xk,,xn)f(x1,,xk,,xn)ck\sup_{x_1,\ldots,x_n, x_k'} |f(x_1,\ldots,x_k,\ldots,x_n) - f(x_1,\ldots,x_k',\ldots,x_n)| \leq c_k

Step 1: Doob martingale. Define Mk=E[fX1,,Xk]M_k = \mathbb{E}[f | X_1,\ldots,X_k] for k=0,,nk=0,\ldots,n. Then M0=E[f]M_0 = \mathbb{E}[f], Mn=fM_n = f, and {Mk}\{M_k\} is a martingale.

Step 2: Bounded differences. The martingale difference Dk=MkMk1D_k = M_k - M_{k-1} satisfies Dkck|D_k| \leq c_k a.s. because:

MkMk1=E[fX1,,Xk]E[fX1,,Xk1]supxk,xkE[f(,xk,)f(,xk,)X1:k1]ck|M_k - M_{k-1}| = |\mathbb{E}[f|X_1,\ldots,X_k] - \mathbb{E}[f|X_1,\ldots,X_{k-1}]| \leq \sup_{x_k, x_k'} |\mathbb{E}[f(\ldots,x_k,\ldots) - f(\ldots,x_k',\ldots)|X_{1:k-1}]| \leq c_k

Step 3: Azuma. Apply Azuma's inequality (proved in SectionN.2) to the martingale {Mk}\{M_k\}:

P(fE[f]t)=P(MnM0t)exp ⁣(t22ck2)P(f - \mathbb{E}[f] \geq t) = P(M_n - M_0 \geq t) \leq \exp\!\left(\frac{-t^2}{2\sum c_k^2}\right)

This IS McDiarmid's inequality. The concentration inequality Section05 is a consequence of martingale theory.

R.2 The Freedman Inequality

Azuma's inequality uses only the almost-sure bound Dkck|D_k| \leq c_k. Freedman's inequality (1975) also uses conditional variances, giving a Bernstein-type improvement:

Theorem (Freedman, 1975). Let {Mn}\{M_n\} be a martingale with DkM|D_k| \leq M a.s. Define Wn=k=1nE[Dk2Fk1]W_n = \sum_{k=1}^n \mathbb{E}[D_k^2|\mathcal{F}_{k-1}] (total conditional variance). Then for t,v>0t, v > 0:

P(MnM0t,Wnv)exp ⁣(t2/2v+Mt/3)P(M_n - M_0 \geq t, W_n \leq v) \leq \exp\!\left(\frac{-t^2/2}{v + Mt/3}\right)

This is a martingale version of Bernstein's inequality. The event WnvW_n \leq v restricts to "low variance" trajectories. Unconditionally:

P(MnM0t)exp ⁣(t2/2V+Mt/3) where V=supωWn(ω)P(M_n - M_0 \geq t) \leq \exp\!\left(\frac{-t^2/2}{V + Mt/3}\right) \text{ where } V = \sup_\omega W_n(\omega)

For AI: Freedman's inequality gives tighter bounds for SGD convergence when the gradient variance is small. In the low-variance early-training phase (when the model is near initialisation and gradients are large but consistent), Freedman gives et2/(2v)e^{-t^2/(2v)} bounds tighter than Azuma's et2/(2nM2)e^{-t^2/(2nM^2)}.


Appendix S: Practice Problems (All Levels)

Level * (Computation)

S.1. Let Mn=2SnM_n = 2^{S_n} where SnS_n is the SRW. For what value of pp in the biased walk (P(+1)=pP(+1)=p) is Mn=2SnM_n = 2^{S_n} a martingale?

Answer: E[2Sn+1Sn]=2Sn(2p+(1p)/2)\mathbb{E}[2^{S_{n+1}}|S_n] = 2^{S_n}(2p + (1-p)/2). Set equal to 2Sn2^{S_n}: 2p+(1p)/2=12p + (1-p)/2 = 1, so p=1/3p = 1/3.

S.2. A Poisson process has rate λ=3\lambda = 3 (events per hour). Find P(N(2)5)P(N(2) \geq 5) and the expected time to the 4th event.

Answer: N(2)Poisson(6)N(2) \sim \text{Poisson}(6), so P(N(2)5)=1k=04e66k/k!0.715P(N(2)\geq 5) = 1 - \sum_{k=0}^4 e^{-6}6^k/k! \approx 0.715. Time to 4th event: T4Gamma(4,1/3)T_4 \sim \text{Gamma}(4, 1/3), E[T4]=4/3\mathbb{E}[T_4] = 4/3 hours.

S.3. For the OU process with θ=1\theta=1, μ=2\mu=2, σ=2\sigma=\sqrt{2}, X0=0X_0=0: compute E[X3]\mathbb{E}[X_3] and Var(X3)\text{Var}(X_3).

Answer: E[X3]=2+(02)e3=22e31.9\mathbb{E}[X_3] = 2 + (0-2)e^{-3} = 2 - 2e^{-3} \approx 1.9. Var(X3)=22(1e6)=1e60.9975\text{Var}(X_3) = \frac{2}{2}(1-e^{-6}) = 1-e^{-6} \approx 0.9975.

Level ** (Theory)

S.4. Show that for a martingale {Mn}\{M_n\}, the sequence Mn2[M]nM_n^2 - [M]_n is also a martingale, where [M]n=k=1nE[Dk2Fk1][M]_n = \sum_{k=1}^n \mathbb{E}[D_k^2|\mathcal{F}_{k-1}] is the predictable quadratic variation.

S.5. Prove that if ff is convex and {Mn}\{M_n\} is a martingale, then {f(Mn)}\{f(M_n)\} is a submartingale (using Jensen's inequality).

S.6. For the compound Poisson process X(t)=k=1N(t)YkX(t) = \sum_{k=1}^{N(t)} Y_k with λ=2\lambda = 2, YkExp(3)Y_k \sim \text{Exp}(3): compute the MGF MX(t)(s)M_{X(t)}(s) and use it to find E[X(t)]\mathbb{E}[X(t)] and Var(X(t))\text{Var}(X(t)).

Level *** (AI Applications)

S.7. (Diffusion model noise schedule design) You want a noise schedule βt\beta_t for the DDPM forward process such that the SNR αt/(1αt)\alpha_t/(1-\alpha_t) decreases linearly from SNR0=100\text{SNR}_0 = 100 to SNRT=0.01\text{SNR}_T = 0.01 over T=1000T=1000 steps. Derive the required βt\beta_t and compare to the linear schedule.

S.8. (SGD stationary distribution) In a 2D quadratic loss L(θ)=12θHθL(\theta) = \frac{1}{2}\theta^\top H\theta with H=diag(1,10)H = \text{diag}(1, 10), SGD with step η\eta and gradient noise Σ=I\Sigma = I has stationary distribution N(0,η2H1)\mathcal{N}(0, \frac{\eta}{2}H^{-1}). Show that the ratio of the standard deviations in the two directions is 10\sqrt{10}, and interpret this for optimisation.

S.9. (Martingale in RLHF) In PPO with KL-regularised reward rθ(s,a)=rϕ(s,a)βlog(πθ(as)/πref(as))r_\theta(s,a) = r_\phi(s,a) - \beta\log(\pi_\theta(a|s)/\pi_{\text{ref}}(a|s)), show that the KL penalty term creates a non-martingale perturbation to the policy gradient estimator. Under what conditions on β\beta does the perturbation become negligible?


Appendix T: Theorem Index

TheoremStatementSection
Doob's OSTE[Mτ]=E[M0]\mathbb{E}[M_\tau]=\mathbb{E}[M_0] under UI or bounded conditionsSection3.3
Doob's Maximal InequalityλP(maxMkλ)E[Mn]\lambda P(\max M_k \geq \lambda) \leq \mathbb{E}[M_n]Section3.4, App. N
Martingale ConvergenceL1L^1-bounded martingale converges a.s.Section3.4
Azuma's InequalityBounded differences -> et2/2ck2e^{-t^2/2\sum c_k^2}App. N
McDiarmid's InequalityBounded differences function -> Azuma on Doob martingaleSection3.5
Freedman's InequalityMartingale Bernstein with conditional varianceApp. R
Polya's TheoremSRW on Zd\mathbb{Z}^d: recurrent iff d2d \leq 2Section4.2, App. N
Donsker's TheoremSRW \Rightarrow BM in C([0,1])C([0,1])Section6.3, App. E
Levy CharacterisationContinuous martingale + QV =t= t \Rightarrow BMSection6.2
Ito's Lemmadf(Xt)=fdX+12f(dX)2df(X_t) = f'dX + \frac{1}{2}f''(dX)^2App. A
Wiener-KhinchinACVF \leftrightarrow PSD via FourierSection7.3
Bochner's TheoremPSD is valid iff non-negative definiteSection7.3
Birkhoff ErgodicTime average = ensemble average a.s.Section7.2
Girsanov's TheoremChange of drift = change of measureApp. Q
Fokker-PlanckPDE for density of SDEApp. N
Anderson's Reverse SDEReverse SDE = forward SDE + scoreSection6.6
Clark-Ocone FormulaMalliavin representation on Wiener spaceApp. J
Wald's IdentityE[Sτ]=μE[τ]\mathbb{E}[S_\tau] = \mu\mathbb{E}[\tau]App. L

Appendix U: Continuous-Time Martingales

U.1 Local Martingales

In continuous time, the natural generalisation of a martingale is a local martingale: a process {Mt}\{M_t\} such that there exist stopping times τn\tau_n \uparrow \infty and each stopped process MtτnM_{t \wedge \tau_n} is a martingale.

Every continuous local martingale with M0=0M_0 = 0 can be written as a time-changed Brownian motion (by the Dambis-Dubins-Schwarz theorem):

Mt=B[M]tM_t = B_{[M]_t}

where BB is a standard BM and [M]t[M]_t is the quadratic variation of MM.

The Ito integral 0tHsdBs\int_0^t H_s\,dB_s is a local martingale (and a true martingale when E[0THs2ds]<\mathbb{E}[\int_0^T H_s^2\,ds] < \infty).

U.2 The Martingale Representation Theorem

Theorem (Martingale Representation). Every square-integrable martingale {Mt}\{M_t\} adapted to the Brownian filtration can be written as:

Mt=M0+0tHsdBsM_t = M_0 + \int_0^t H_s\,dB_s

for some adapted process HsH_s with E[0THs2ds]<\mathbb{E}[\int_0^T H_s^2\,ds] < \infty.

For AI: In score-based diffusion model training, the training loss can be expressed as an Ito integral against the Wiener process of the forward noise. The neural network effectively learns the integrand Hs(xs)H_s(x_s) that represents the reverse process. The martingale representation theorem guarantees that such a representation exists - it is the theoretical justification for why neural networks can learn to "undo" Brownian noise.

U.3 Feynman-Kac Formula

For the SDE dXt=b(Xt)dt+σ(Xt)dBtdX_t = b(X_t)\,dt + \sigma(X_t)\,dB_t and function ff, the Feynman-Kac formula relates the SDE to a PDE:

u(x,t)=E ⁣[f(XT)exp ⁣(tTr(Xs)ds)Xt=x]u(x,t) = \mathbb{E}\!\left[f(X_T)\exp\!\left(-\int_t^T r(X_s)\,ds\right)\,\Big|\, X_t = x\right]

satisfies the PDE tu=bxu+σ22xxuru-\partial_t u = b\partial_x u + \frac{\sigma^2}{2}\partial_{xx}u - ru with terminal condition u(x,T)=f(x)u(x,T) = f(x).

For AI: The score function in diffusion models satisfies a backward Kolmogorov equation (the PDE form of the reverse SDE). Feynman-Kac connects this PDE to the expectation form logpt(xt)=E[x0αˉtxt1αˉtxt]\nabla\log p_t(x_t) = \mathbb{E}[\frac{x_0 - \sqrt{\bar\alpha_t}x_t}{1-\bar\alpha_t}|x_t], which is directly implemented by the neural denoiser.


Appendix V: Numerical Methods for SDEs

V.1 Euler-Maruyama Method

The simplest numerical scheme for dX=b(X,t)dt+σ(X,t)dBtdX = b(X,t)\,dt + \sigma(X,t)\,dB_t:

Xn+1=Xn+b(Xn,tn)Δt+σ(Xn,tn)ΔBnX_{n+1} = X_n + b(X_n, t_n)\Delta t + \sigma(X_n, t_n)\Delta B_n

where ΔBnN(0,Δt)\Delta B_n \sim \mathcal{N}(0, \Delta t).

Convergence: Euler-Maruyama has strong order 1/2 (pathwise error Δt\sim \sqrt{\Delta t}) and weak order 1 (distributional error Δt\sim \Delta t). Compare to Euler method for ODEs which has order 1 for pathwise error - the Δt\sqrt{\Delta t} order reflects the irregularity of BM.

V.2 Milstein Method

The Milstein method improves strong convergence to order 1 by adding an Ito correction:

Xn+1=Xn+bΔt+σΔBn+12σσ[(ΔBn)2Δt]X_{n+1} = X_n + b\Delta t + \sigma\Delta B_n + \frac{1}{2}\sigma\sigma'\left[(\Delta B_n)^2 - \Delta t\right]

The extra term 12σσ[(ΔBn)2Δt]\frac{1}{2}\sigma\sigma'[(\Delta B_n)^2 - \Delta t] uses the Ito formula's second-order term. For scalar SDEs, Milstein achieves the same convergence order as the trapezoidal rule for ODEs.

V.3 DDPM as Euler-Maruyama

The DDPM update xt=1βtxt1+βtϵtx_t = \sqrt{1-\beta_t}x_{t-1} + \sqrt{\beta_t}\epsilon_t is exactly the Euler-Maruyama discretisation of the VP-SDE dx=12β(t)xdt+β(t)dBtdx = -\frac{1}{2}\beta(t)x\,dt + \sqrt{\beta(t)}\,dB_t with step size Δt=1/T\Delta t = 1/T:

xt=xt1βt2xt11+βtϵt=(1βt/2)xt1+βtϵt1βtxt1+βtϵtx_t = x_{t-1} - \frac{\beta_t}{2}x_{t-1} \cdot 1 + \sqrt{\beta_t}\epsilon_t = (1-\beta_t/2)x_{t-1} + \sqrt{\beta_t}\epsilon_t \approx \sqrt{1-\beta_t}x_{t-1} + \sqrt{\beta_t}\epsilon_t

where the approximation 1βt1βt/2\sqrt{1-\beta_t} \approx 1 - \beta_t/2 holds for small βt\beta_t. DDIM corresponds to using an implicit (backward Euler) discretisation of the probability flow ODE, which allows larger step sizes.


Appendix W: Bridge Processes and Conditional Diffusions

W.1 Brownian Bridge

A Brownian bridge from aa to bb on [0,T][0,T] is a Brownian motion conditioned to be at bb at time TT:

XtBB=a+(ba)tT+BttTBTX_t^{\text{BB}} = a + (b-a)\frac{t}{T} + B_t - \frac{t}{T}B_T

This is a Gaussian process with E[Xt]=a+(ba)t/T\mathbb{E}[X_t] = a + (b-a)t/T and Cov(Xs,Xt)=s(Tt)/T\text{Cov}(X_s, X_t) = s(T-t)/T for sts \leq t.

For AI: Stochastic interpolation (Albergo & Vanden-Eijnden, 2023) uses Brownian bridges to construct generative models that interpolate between data and noise along curved paths, rather than straight (linear interpolation) or OU (DDPM) paths. Flow matching (Lipman et al., 2022) uses deterministic bridges (straight-line paths) for efficient training.

W.2 Conditional Score Function

Given a forward process q(xtx0)q(x_t|x_0), the conditional score is:

xtlogq(xtx0)=(xtαˉtx0)1αˉt=ϵ1αˉt\nabla_{x_t}\log q(x_t|x_0) = \frac{-(x_t - \sqrt{\bar\alpha_t}x_0)}{1-\bar\alpha_t} = \frac{-\epsilon}{\sqrt{1-\bar\alpha_t}}

where ϵ\epsilon is the noise added in the forward process. The marginal score satisfies:

xtlogq(xt)=E ⁣[xtlogq(xtx0)xt]\nabla_{x_t}\log q(x_t) = \mathbb{E}\!\left[\nabla_{x_t}\log q(x_t|x_0)\,\big|\,x_t\right]

This is a conditional expectation - a Doob martingale evaluated at tt! The diffusion model learns to estimate E[ϵxt]\mathbb{E}[\epsilon|x_t], which by Tweedie's formula equals the MMSE denoiser. The entire diffusion model training objective is a stochastic process estimation problem formulated in martingale language.


End of Section06 Stochastic Processes.

<- Back to Chapter 6: Probability Theory | Next: Markov Chains ->


Appendix X: Glossary

TermDefinition
Adapted processProcess XtX_t that is Ft\mathcal{F}_t-measurable at each tt
Autocorrelation function (ACF)ρ(τ)=Cov(Xt,Xt+τ)/Var(Xt)\rho(\tau) = \text{Cov}(X_t, X_{t+\tau})/\text{Var}(X_t)
Azuma's inequalityConcentration bound for martingales with bounded differences
Brownian motionContinuous Gaussian process with independent stationary increments
CadlagRight-continuous with left limits (French acronym)
Compensated PoissonN(t)λtN(t) - \lambda t - a martingale
Doob martingale$M_k = \mathbb{E}[f
Ergodic theoremTime average \to ensemble average a.s.
FiltrationIncreasing family of σ\sigma-algebras: FsFt\mathcal{F}_s \subseteq \mathcal{F}_t for sts \leq t
Gaussian process (GP)Process where every finite marginal is jointly Gaussian
Ito integralStochastic integral HtdBt\int H_t\,dB_t adapted to BM filtration
Ito's lemmaChain rule for BM: df=fdX+12f(dX)2df = f'dX + \frac{1}{2}f''(dX)^2
KernelPositive semidefinite function k(x,x)k(x,x') - covariance of GP
Local martingaleStopped process MtτnM_{t\wedge\tau_n} is a martingale for each nn
Martingale$\mathbb{E}[M_t
Martingale differenceDk=MkMk1D_k = M_k - M_{k-1} with $\mathbb{E}[D_k
OU processdX=θ(Xμ)dt+σdBdX = -\theta(X-\mu)\,dt + \sigma\,dB - mean-reverting Gaussian process
Optional Stopping TheoremE[Mτ]=E[M0]\mathbb{E}[M_\tau] = \mathbb{E}[M_0] under appropriate conditions
Poisson processCounting process with iid exponential inter-arrival times
Quadratic variation[B]t=t[B]_t = t for BM - central property of stochastic integration
Score functionxlogpt(x)\nabla_x \log p_t(x) - gradient of log-density
Stationary processDistribution invariant under time shifts
Stopping timeτ\tau with {τt}Ft\{\tau \leq t\} \in \mathcal{F}_t - no look-ahead
Submartingale$\mathbb{E}[M_t
Supermartingale$\mathbb{E}[M_t
ThinningPoisson process subsampled with prob pp is Poisson(λp\lambda p)
Uniform integrability$\sup_t\mathbb{E}[
VP-SDEVariance-preserving SDE - DDPM forward process continuum
Wide-sense stationaryConstant mean + covariance depends only on lag
Wiener measureProbability measure on C([0,))C([0,\infty)) defining BM paths