Part 1

27 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Stochastic Processes: Part 1: Intuition and Overview to 10. Common Mistakes

1. Intuition and Overview

1.1 What Is a Stochastic Process?

A stochastic process is a family $\{X_t\}_{t \in T}$ of random variables defined on a common probability space $(\Omega, \mathcal{F}, P)$ , indexed by a set $T$ called the index set or time set. For each fixed outcome $\omega \in \Omega$ , the mapping $t \mapsto X_t(\omega)$ is called a sample path or realisation of the process.

The key conceptual shift from classical probability: rather than asking "what is the distribution of $X$ ?", we ask "how does the random system $\{X_t\}$ evolve over time?" This trajectory-level view captures phenomena that static distributions cannot: memory, feedback, path dependence, and accumulation of information.

Informal classification: The simplest stochastic process is a sequence of iid random variables - no memory, no dependence. More interesting processes have memory (the present depends on the past), structure (the conditional distributions have a specific form), or continuity (sample paths are almost surely continuous). The richest models - Brownian motion, diffusion processes - have all three.

For AI: Almost every quantity that evolves during training or inference is a stochastic process. The parameter vector $\theta_t$ after $t$ steps of SGD is a stochastic process. The hidden state $h_t$ of an RNN at position $t$ is a stochastic process. The sequence of tokens $(w_1, w_2, \ldots)$ sampled from an LLM is a stochastic process. The noise level $\sigma_t$ in a diffusion model forward pass is a stochastic process. Stochastic process theory gives us the mathematical language to reason about all of these rigorously.

1.2 The Central Examples

The five canonical stochastic processes appear throughout mathematics and AI:

Simple Random Walk (SRW): $S_n = \sum_{k=1}^n X_k$ where $X_k \stackrel{iid}{\sim} \text{Uniform}\{-1,+1\}$ . Discrete time, discrete state space. Models fair gambling, particle diffusion in discrete space, and gradient noise accumulation.
Poisson Process: $N(t)$ = number of events in $[0,t]$ , with $N(t) - N(s) \sim \text{Poisson}(\lambda(t-s))$ independent of the past. Continuous time, discrete (integer) state space. Models arrival streams, request queues, neuron spike trains.
Brownian Motion (Wiener Process): $B_t$ with $B_0 = 0$ , continuous paths, $B_t - B_s \sim \mathcal{N}(0, t-s)$ independent of the past. Continuous time, continuous state space. The fundamental building block of continuous-time stochastic processes, and the forward process in diffusion models.
Gaussian Process: A process where every finite collection $(X_{t_1}, \ldots, X_{t_n})$ is jointly Gaussian. Fully specified by its mean function $\mu(t)$ and covariance kernel $k(s,t)$ . Foundation of Gaussian process regression (Bayesian nonparametrics) and closely related to neural network functions in the infinite-width limit.
Markov Chain: $\{X_n\}$ where $P(X_{n+1} | X_0,\ldots,X_n) = P(X_{n+1} | X_n)$ - the past matters only through the present. Discrete time, arbitrary state space. The backbone of MCMC, reinforcement learning, and language model token generation. Full treatment: Section07 Markov Chains.

1.3 Why Processes Matter for AI

Stochastic process theory provides the theoretical foundation for several pillar areas of modern AI:

Diffusion models (DDPM, DDIM, Score SDE) define a forward process that progressively corrupts data with Gaussian noise - this is literally a discrete-time approximation to the Ornstein-Uhlenbeck SDE. The reverse process learned by the neural network is a time-reversed SDE, and training minimises a score-matching objective derived from Ito calculus.

SGD dynamics can be approximated by a continuous-time diffusion SDE: $d\theta = -\nabla L(\theta)\,dt + \sqrt{2\beta^{-1}}dB_t$ where $\beta^{-1}$ is the effective noise temperature proportional to the learning rate divided by batch size. This approximation (Mandt et al., 2017; Li et al., 2017) explains why SGD finds flat minima: the stationary distribution of the SDE concentrates in regions of low curvature.

Reinforcement learning uses the discounted return $G_t = \sum_{k=0}^\infty \gamma^k r_{t+k}$ , which is a function of the trajectory $(r_t, r_{t+1}, \ldots)$ - a stochastic process. The temporal difference error $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ forms a martingale difference sequence under the optimal value function, and TD-learning is an online martingale approximation algorithm.

Gaussian processes are used as priors over functions in Bayesian optimisation (tuning LLM hyperparameters), and the neural tangent kernel theory shows that infinite-width neural networks converge to GPs. RoPE (Rotary Position Embedding) encodes relative position using a stationary kernel - exploiting the wide-sense stationarity machinery developed in Section7.

1.4 Historical Timeline

Year	Contributor	Result
1900	Louis Bachelier	Modelled stock prices as random walk (Brownian motion) in doctoral thesis
1905	Albert Einstein	Derived diffusion equation from random molecular motion
1923	Norbert Wiener	Constructed Brownian motion rigorously on path space
1933	Andrei Kolmogorov	Axiomatic probability; forward/backward equations for Markov processes
1940	Paul Levy	Characterised Levy processes; Levy-Khinchin representation
1944	Kiyosi Ito	Developed stochastic calculus (Ito integral, Ito's lemma)
1953	Joseph Doob	Martingale theory; optional stopping; convergence theorems
1964	Skorokhod	Embedding theorem; functional CLT (Donsker's theorem)
1991	Brockwell & Davis	Time series analysis unified with stochastic process theory
2015	Ho, Jain & Abbeel	Denoising diffusion probabilistic models (DDPM)
2021	Song et al.	Score-based generative models via SDEs
2022	Su et al.	RoPE: positional encoding as stationary kernel

2. Formal Framework

2.1 Probability Spaces and Filtrations

The rigorous foundation of stochastic processes builds on probability spaces enhanced with a filtration - a formal model of accumulating information.

Definition (Filtered Probability Space). A filtered probability space is a quadruple $(\Omega, \mathcal{F}, (\mathcal{F}_t)_{t \geq 0}, P)$ where:

$(\Omega, \mathcal{F}, P)$ is a probability space
$(\mathcal{F}_t)_{t \geq 0}$ is a filtration: an increasing family of sub- $\sigma$ -algebras, $\mathcal{F}_s \subseteq \mathcal{F}_t \subseteq \mathcal{F}$ for all $s \leq t$

Intuitively, $\mathcal{F}_t$ represents the information available at time $t$ - everything that has been observed up to and including time $t$ . The filtration captures how information accumulates: knowing more over time means larger $\sigma$ -algebras.

Definition (Adapted Process). A stochastic process $\{X_t\}$ is adapted to the filtration $(\mathcal{F}_t)$ if $X_t$ is $\mathcal{F}_t$ -measurable for each $t$ . Intuitively: the value of the process at time $t$ is determined by information available at time $t$ (not the future).

The natural filtration: Given a process $\{X_t\}$ , the natural filtration is $\mathcal{F}_t^X = \sigma(X_s : s \leq t)$ - the $\sigma$ -algebra generated by the process up to time $t$ . Every process is adapted to its natural filtration.

For AI: In online learning, $\mathcal{F}_t$ represents the set of training examples seen up to step $t$ . The model parameters $\theta_t$ are $\mathcal{F}_t$ -measurable (they depend only on past data). A look-ahead bias - where $\theta_t$ depends on future data - violates adaptedness and invalidates PAC bounds.

In a causal language model, the token at position $t$ is sampled conditioned on all previous tokens - this is precisely the adaptedness condition. Non-causal (bidirectional) models like BERT use a different filtration where $\mathcal{F}_t$ includes information from both past and future.

Standard conditions (usual conditions): A filtration satisfies the usual conditions if:

$(\mathcal{F}_t)$ is right-continuous: $\mathcal{F}_t = \bigcap_{s > t} \mathcal{F}_s$
$\mathcal{F}_0$ contains all $P$ -null sets (completeness)

These technical conditions ensure stopping times and martingales behave as expected. All filtrations in this section are assumed to satisfy the usual conditions.

2.2 Classification of Stochastic Processes

Processes are classified by the structure of their time set $T$ and state space $S$ :

TAXONOMY OF STOCHASTIC PROCESSES
========================================================================

                    State Space S
                    ---------------------------------
                    Discrete             Continuous
                    ---------------------------------
  Time  Discrete  | Markov chains        Random walk  |
  Set T |          | HMMs                 AR(p) models |
        |          | Galton-Watson trees               |
        +----------+----------------------------------+
        Continuous | Poisson process      Brownian mot |
                  | Birth-death proc     OU process   |
                  | Counting processes   Diffusion SDE|
                   ---------------------------------

========================================================================

Discrete time, discrete state: Markov chains, hidden Markov models (HMMs used in speech recognition), random walks on graphs. The transition dynamics are described by a stochastic matrix.

Discrete time, continuous state: Random walks in $\mathbb{R}^d$ , AR/ARMA time series models, recurrent neural network hidden states. State space is $\mathbb{R}^d$ for some dimension $d$ .

Continuous time, discrete state: Poisson processes, birth-death processes, queueing models. Transitions happen at random times; between transitions the state is constant.

Continuous time, continuous state: Brownian motion, Ornstein-Uhlenbeck process, general diffusion SDEs. These are the most mathematically sophisticated and most relevant to diffusion models and SGD dynamics.

2.3 Stopping Times

A stopping time (or optional time) formalises the idea of a "decision made using only available information."

Definition (Stopping Time). A random variable $\tau: \Omega \to [0,\infty]$ is a stopping time with respect to filtration $(\mathcal{F}_t)$ if $\{\tau \leq t\} \in \mathcal{F}_t$ for all $t \geq 0$ . Equivalently, the decision "stop now" can be made using only information available at the current time.

Examples:

$\tau = 5$ : constant time - trivially a stopping time
$\tau = \inf\{t : X_t \geq a\}$ (first passage time): stopping time (the event $\{\tau \leq t\} = \{\max_{s \leq t} X_s \geq a\}$ is $\mathcal{F}_t$ -measurable)
$\tau = \sup\{t \leq T : X_t \geq a\}$ (last time above level): not a stopping time (requires future knowledge)

The $\sigma$ -algebra $\mathcal{F}_\tau$ : For a stopping time $\tau$ , the $\sigma$ -algebra at $\tau$ is $\mathcal{F}_\tau = \{A \in \mathcal{F} : A \cap \{\tau \leq t\} \in \mathcal{F}_t \text{ for all } t\}$ . This represents "information available at the random time $\tau$ ."

For AI: Early stopping in neural network training is a stopping time: the rule "stop when validation loss hasn't improved for $k$ steps" depends only on past validation losses - it is $\mathcal{F}_t$ -measurable. In contrast, "stop at the training step that achieves minimum test loss" is NOT a stopping time - it requires knowing future test losses.

In RL, $\tau$ = "the first time the agent reaches the goal state" is a stopping time. Doob's Optional Stopping Theorem (Section3.3) characterises the distribution of $M_\tau$ - the martingale value at this random time.

2.4 Sample Paths and Almost-Sure Properties

For each fixed $\omega \in \Omega$ , the sample path $t \mapsto X_t(\omega)$ is an ordinary function of time. The regularity of sample paths (continuity, measurability, boundedness) is crucial for the applicability of theorems.

Path properties:

Continuous paths: $t \mapsto X_t(\omega)$ is continuous for a.e. $\omega$ . Example: Brownian motion.
Cadlag paths (right-continuous with left limits): $X_t(\omega)$ is right-continuous with left limits for a.e. $\omega$ . Example: Poisson process. Pronounced "cadlag" from French continu a droite, limite a gauche.
Caglad paths: left-continuous with right limits. Useful in predictable integrands.

Why continuity matters: A process with continuous paths can be approximated by discrete-time processes as the mesh goes to zero - this is the content of Donsker's invariance principle (Section6.3) and justifies the SDE approximation of SGD. A process with only cadlag paths (like Poisson) has discontinuities that cannot be approximated this way.

Almost-sure properties: Properties that hold with probability one (a.s.) are the gold standard. For example, Brownian motion is a.s. nowhere differentiable - every sample path is continuous but has infinite variation everywhere. This non-differentiability is responsible for the $\sqrt{dt}$ scaling of noise in SDEs (Ito's formula) and explains why diffusion model noise scales as $\sqrt{\bar{\alpha}_t}$ rather than linearly in time.

3. Martingales

3.1 Definition and Intuition

The word martingale comes from a gambling strategy: double your bet after each loss. Mathematically, a martingale captures the notion of a "fair game" - the best prediction of the future value is the current value, given everything known so far.

Definition (Martingale). An adapted process $\{M_t\}_{t \geq 0}$ with $\mathbb{E}[|M_t|] < \infty$ for all $t$ is a martingale with respect to filtration $(\mathcal{F}_t)$ if:

\mathbb{E}[M_t \mid \mathcal{F}_s] = M_s \quad \text{for all } s \leq t

The condition says: given all information up to time $s$ , the best prediction of $M_t$ is $M_s$ . The process neither drifts up nor down in conditional expectation.

Three types:

Martingale: $\mathbb{E}[M_t | \mathcal{F}_s] = M_s$ - fair game
Submartingale: $\mathbb{E}[M_t | \mathcal{F}_s] \geq M_s$ - tends upward on average (e.g., convex functions of martingales via Jensen)
Supermartingale: $\mathbb{E}[M_t | \mathcal{F}_s] \leq M_s$ - tends downward on average

Discrete-time discrete version: $\{M_n\}_{n \geq 0}$ is a martingale if $\mathbb{E}[M_{n+1} | M_0,\ldots,M_n] = M_n$ for all $n$ . This is equivalent to the martingale difference condition: $\mathbb{E}[M_{n+1} - M_n | \mathcal{F}_n] = 0$ .

Immediate consequence (tower property): For any $s \leq t$ , $\mathbb{E}[M_t] = \mathbb{E}[\mathbb{E}[M_t | \mathcal{F}_s]] = \mathbb{E}[M_s]$ . So the mean of a martingale is constant: $\mathbb{E}[M_t] = \mathbb{E}[M_0]$ for all $t$ .

3.2 Examples of Martingales

Example 1: Partial Sums of Zero-Mean iid. Let $X_1, X_2, \ldots$ be iid with $\mathbb{E}[X_i] = 0$ . Then $S_n = \sum_{k=1}^n X_k$ is a martingale:

\mathbb{E}[S_{n+1} | \mathcal{F}_n] = \mathbb{E}[S_n + X_{n+1} | \mathcal{F}_n] = S_n + \mathbb{E}[X_{n+1}] = S_n

Example 2: Products of Unit-Mean iid. Let $X_1, X_2, \ldots$ be iid with $\mathbb{E}[X_i] = 1$ , $X_i > 0$ . Then $M_n = \prod_{k=1}^n X_k$ is a martingale:

\mathbb{E}[M_{n+1} | \mathcal{F}_n] = M_n \cdot \mathbb{E}[X_{n+1}] = M_n

This is the likelihood ratio (Radon-Nikodym derivative) martingale - fundamental to sequential hypothesis testing and importance sampling.

Example 3: Conditional Expectations. For any integrable $Z$ and filtration $(\mathcal{F}_t)$ , the process $M_t = \mathbb{E}[Z | \mathcal{F}_t]$ is a martingale (the Doob martingale). This follows directly from the tower property: $\mathbb{E}[M_t | \mathcal{F}_s] = \mathbb{E}[\mathbb{E}[Z|\mathcal{F}_t] | \mathcal{F}_s] = \mathbb{E}[Z|\mathcal{F}_s] = M_s$ .

Example 4: Polya Urn. An urn contains $r$ red and $b$ blue balls. At each step, draw a ball, record its colour, return it with one extra ball of the same colour. Let $R_n$ = fraction of red balls after step $n$ . Then $R_n$ is a martingale. The limiting fraction $R_\infty \sim \text{Beta}(r, b)$ .

Example 5: Squared SRW minus time. The simple random walk $S_n$ satisfies $S_n^2 - n$ is a martingale (when $\text{Var}(X_k) = 1$ ):

\mathbb{E}[S_{n+1}^2 - (n+1) | \mathcal{F}_n] = \mathbb{E}[(S_n + X_{n+1})^2 | \mathcal{F}_n] - n - 1 = S_n^2 + 1 - n - 1 = S_n^2 - n

For AI: The gradient of the expected loss $\nabla L(\theta)$ minus the mini-batch gradient estimate $\nabla \hat{L}(\theta)$ is a martingale difference - the fundamental quantity that makes SGD unbiased. The accumulation of these martingale differences over training steps determines the SGD trajectory.

3.3 Doob's Optional Stopping Theorem

Doob's Optional Stopping Theorem (OST) is perhaps the most useful result in martingale theory. It answers: "what is the expected value of a martingale at a stopping time?"

Naively: Since $\mathbb{E}[M_n] = \mathbb{E}[M_0]$ for all $n$ , one might expect $\mathbb{E}[M_\tau] = \mathbb{E}[M_0]$ for any stopping time $\tau$ . This is false in general (e.g., the doubling strategy in gambling can exploit martingale properties without bounded stopping times).

Theorem (Doob's Optional Stopping Theorem). Let $\{M_n\}$ be a martingale and $\tau$ a stopping time. Then $\mathbb{E}[M_\tau] = \mathbb{E}[M_0]$ provided at least one of the following holds:

$\tau$ is bounded: $\tau \leq N$ a.s. for some constant $N$
$\mathbb{E}[\tau] < \infty$ and $|M_{n+1} - M_n| \leq C$ a.s. for some constant $C$
$\mathbb{E}[\sup_{n} |M_{n \wedge \tau}|] < \infty$ (uniformly integrable)

Proof sketch (bounded case). $\mathbb{E}[M_\tau] = \sum_{k=0}^N \mathbb{E}[M_k \mathbf{1}_{\tau=k}]$ . Write $M_k = M_N - \sum_{j=k}^{N-1}(M_{j+1}-M_j)$ , use adaptedness of $\{\tau=k\}$ , and the martingale property $\mathbb{E}[M_{j+1}-M_j | \mathcal{F}_j] = 0$ . Each cross term vanishes, leaving $\mathbb{E}[M_\tau] = \mathbb{E}[M_N] = \mathbb{E}[M_0]$ . $\square$

Application: Gambler's Ruin. Symmetric random walk $S_n$ on $\{0, 1, \ldots, N\}$ with absorbing barriers. Start at $S_0 = k$ . Let $\tau = \inf\{n : S_n \in \{0, N\}\}$ .

Since $S_n$ is a martingale and $\tau$ is bounded by Wald's identity (it can be shown $\mathbb{E}[\tau] < \infty$ ):

\mathbb{E}[S_\tau] = \mathbb{E}[S_0] = k

\mathbb{E}[S_\tau] = N \cdot P(S_\tau = N) + 0 \cdot P(S_\tau = 0) = N \cdot P(\text{win})

\Rightarrow P(\text{win}) = k/N

Also: $S_n^2 - n$ is a martingale, so $\mathbb{E}[S_\tau^2 - \tau] = k^2 - 0$ , giving $\mathbb{E}[\tau] = \mathbb{E}[S_\tau^2] - k^2 = (N \cdot (k/N) \cdot N + 0) - k^2 = k(N-k)$ .

3.4 Doob's Martingale Inequalities

Beyond Optional Stopping, Doob proved several fundamental inequalities that control the maximum of a martingale.

Theorem (Doob's Maximal Inequality). For a non-negative submartingale $\{M_n\}$ and $\lambda > 0$ :

\lambda \cdot P\!\left(\max_{0 \leq k \leq n} M_k \geq \lambda\right) \leq \mathbb{E}[M_n]

Corollary (Doob's $L^2$ Inequality). For a martingale $\{M_n\}$ with $\mathbb{E}[M_n^2] < \infty$ :

\mathbb{E}\!\left[\max_{0 \leq k \leq n} M_k^2\right] \leq 4\mathbb{E}[M_n^2]

Theorem (Martingale Convergence Theorem, Doob 1953). Every $L^1$ -bounded martingale (i.e., $\sup_n \mathbb{E}[|M_n|] < \infty$ ) converges almost surely to a finite limit $M_\infty$ .

This convergence theorem is the rigorous foundation for why TD-learning and Q-learning converge under appropriate conditions - the value function updates form a bounded supermartingale under the optimal policy.

3.5 The Doob Martingale Construction

Definition. For a function $f(X_1,\ldots,X_n)$ of independent random variables, define:

M_k = \mathbb{E}[f(X_1,\ldots,X_n) \mid X_1,\ldots,X_k], \quad k = 0, 1, \ldots, n

Then $M_0 = \mathbb{E}[f]$ , $M_n = f(X_1,\ldots,X_n)$ , and $\{M_k\}$ is a martingale (by tower property). This is the Doob martingale for $f$ .

Connection to McDiarmid: The martingale differences are $D_k = M_k - M_{k-1}$ . If $f$ has bounded differences $|f(\ldots,x_k,\ldots) - f(\ldots,x_k',\ldots)| \leq c_k$ , then $|D_k| \leq c_k$ a.s. Applying Azuma's inequality to the martingale $(M_0, M_1, \ldots, M_n)$ :

P(f - \mathbb{E}[f] \geq t) = P(M_n - M_0 \geq t) \leq \exp\!\left(\frac{-2t^2}{\sum_k c_k^2}\right)

This is exactly McDiarmid's inequality - it is Azuma applied to the Doob martingale. The Doob construction thus unifies martingale theory with the concentration inequalities of Section05.

For AI: The empirical risk $\hat{R}(h) = \frac{1}{n}\sum_{i=1}^n \ell(h(x_i), y_i)$ is a function of $n$ independent training examples, each contributing $c_i = 1/n$ bounded difference. The Doob martingale $M_k = \mathbb{E}[\hat{R}(h) | z_1,\ldots,z_k]$ converges from $R(h)$ to $\hat{R}(h)$ , and Azuma gives the McDiarmid generalisation bound $P(|\hat{R}-R| \geq t) \leq 2e^{-2nt^2}$ .

3.6 Supermartingales and Submartingales

Supermartingales ( $\mathbb{E}[M_{n+1}|\mathcal{F}_n] \leq M_n$ ) model processes that decrease in conditional expectation - losing bets, damped oscillators, optimisation algorithms.

Key example: For a Lyapunov function $V(\theta)$ in optimisation, if $\mathbb{E}[V(\theta_{n+1}) | \theta_n] \leq V(\theta_n) - \alpha \|\nabla L(\theta_n)\|^2 + \text{noise}$ , then $V(\theta_n)$ is a "noisy supermartingale." Under appropriate noise control, the Martingale Convergence Theorem guarantees $V(\theta_n) \to V^*$ a.s., implying convergence to a critical point.

Submartingales ( $\mathbb{E}[M_{n+1}|\mathcal{F}_n] \geq M_n$ ) arise as convex functions of martingales (by Jensen's inequality: $f$ convex, $M_n$ martingale $\Rightarrow$ $f(M_n)$ submartingale). Examples: $|M_n|$ , $M_n^2$ , $e^{tM_n}$ (for $t > 0$ ).

For AI: The training loss trajectory is typically a supermartingale in the early phases of training (decreasing in expectation per step). Plateau phases correspond to near-martingale behaviour. The Doob decomposition theorem guarantees any submartingale $Y_n$ can be written as $Y_n = M_n + A_n$ where $M_n$ is a martingale and $A_n$ is a predictable increasing process - this decomposition underlies variance reduction in stochastic optimisation.

3.7 ML: SGD as a Near-Martingale

Consider the SGD update: $\theta_{n+1} = \theta_n - \eta \nabla \hat{L}_n(\theta_n)$ where $\hat{L}_n$ is the mini-batch loss. Define $g_n = \nabla \hat{L}_n(\theta_n) - \nabla L(\theta_n)$ = gradient noise.

Since each mini-batch is sampled iid, $\mathbb{E}[g_n | \theta_n] = 0$ - the gradient noise is a martingale difference sequence. The noise accumulation $S_N = \sum_{n=1}^N g_n$ forms a martingale.

The diffusion approximation (continuous-time limit): As step size $\eta \to 0$ with $N \to \infty$ and $\eta N \to T$ fixed, the SGD trajectory converges in distribution to the SDE:

d\theta_t = -\nabla L(\theta_t)\,dt + \sqrt{\eta \cdot \Sigma(\theta_t)}\,dB_t

where $\Sigma(\theta) = \text{Cov}(\nabla \hat{L}(\theta))$ is the gradient covariance matrix and $B_t$ is a Brownian motion in parameter space. The noise temperature $\eta/B$ (learning rate / batch size) controls the diffusion coefficient.

Variance reduction methods: SVRG (Stochastic Variance Reduced Gradient) and SAG (Stochastic Average Gradient) reduce $\|g_n\|$ to make SGD closer to a true gradient descent. They convert the martingale noise $g_n$ from $O(1)$ variance to $O(1/n)$ variance by periodically computing the full gradient. The convergence improvement from $O(1/\sqrt{n})$ (SGD) to $O(\rho^n)$ (linear convergence of SVRG) directly reflects this martingale variance reduction.

4. Discrete-Time Random Walks

4.1 Simple Random Walk

Definition. The simple random walk (SRW) on $\mathbb{Z}$ is $S_0 = 0$ , $S_n = \sum_{k=1}^n X_k$ where $X_k \stackrel{iid}{\sim} \text{Uniform}\{-1, +1\}$ (so $P(X_k = +1) = P(X_k = -1) = 1/2$ ).

Distribution: $P(S_n = k) = \binom{n}{(n+k)/2} 2^{-n}$ for $k \equiv n \pmod{2}$ . By CLT: $S_n / \sqrt{n} \xrightarrow{d} \mathcal{N}(0,1)$ .

Key identity (Reflection Principle): The number of paths from $(0,0)$ to $(n, k)$ that touch or cross the $x$ -axis equals the number of all paths from $(0, -2)$ to $(n, k)$ . This gives:

P\!\left(\max_{0 \leq k \leq n} S_k \geq m\right) = 2P(S_n \geq m) - P(S_n = m)

for integers $m > 0$ - approximately $2P(S_n \geq m)$ for large $n$ .

The arc-sine law: The last time the SRW is at 0 before time $n$ has a distribution that concentrates near 0 and $n$ , not at $n/2$ as intuition suggests. Formally: if $L_n = \max\{k \leq n : S_k = 0\}$ , then $L_n/n \xrightarrow{d} \text{Arcsine}(0,1)$ .

Biased random walk: With $P(X_k = +1) = p$ , $P(X_k = -1) = 1-p$ , the walk has drift $\mu = 2p-1$ . It is a submartingale if $p > 1/2$ , supermartingale if $p < 1/2$ , martingale if $p = 1/2$ . The exponential tilting $M_n = ((1-p)/p)^{S_n}$ is a martingale for any $p$ .

4.2 Recurrence and Transience

Definition. A random walk is recurrent if it returns to its starting point with probability 1; transient if it escapes to infinity with positive probability.

Polya's Theorem (1921). The SRW on $\mathbb{Z}^d$ is:

Recurrent for $d = 1$ and $d = 2$ (probability 1 of returning to origin)
Transient for $d \geq 3$ (positive probability of never returning)

Proof idea for $d=1$ : The expected number of returns to 0 is $\sum_{n} P(S_{2n}=0) = \sum_n \binom{2n}{n}4^{-n} \sim \sum_n (1/\sqrt{\pi n}) = \infty$ by Stirling. Since the expected number of returns is infinite, the walk must return infinitely often (Borel-Cantelli).

Proof idea for $d=3$ : $\sum_n P(S_{2n}=0) = \sum_n \binom{2n}{n}^3 6^{-2n} < \infty$ by Stirling, so by Borel-Cantelli the walk returns only finitely many times.

For AI: In gradient descent on a loss landscape with one flat direction (rank deficiency), the gradient is zero in that direction - the component of the iterate in the flat direction performs a random walk. In 2D, this walk is recurrent: training can drift arbitrarily far in the flat direction, explaining parameter growth and the need for weight decay (which adds a restoring force, converting the walk from recurrent to transient).

4.3 Gambler's Ruin

Setup: A gambler starts with $k$ dollars, plays a fair game (win/lose $\$ 1 $each round), and stops when reaching$ $N $(success) or$ $0$ (ruin).

Result (via OST): Using the martingales $S_n$ and $S_n^2 - n$ :

$P(\text{ruin}) = 1 - k/N$ , $P(\text{success}) = k/N$
$\mathbb{E}[\tau] = k(N-k)$ (expected duration)

Biased gambler ( $p \neq 1/2$ ): Use the exponential martingale $M_n = ((1-p)/p)^{S_n}$ :

P(\text{success}) = \frac{1 - (q/p)^k}{1 - (q/p)^N}, \quad q = 1-p

For AI: Gambler's ruin models early stopping in training: the training loss random-walks around a local minimum, and we stop when it either reaches a "good" threshold (success) or explodes (ruin). The expected stopping time $k(N-k)$ motivates why learning rate schedules that reduce the step size as training progresses can reduce expected convergence time.

4.4 ML: Token Sequences as Random Walks

A language model generates tokens sequentially. Conditional on the context, each token selection is a draw from a distribution - making the sequence $w_1, w_2, \ldots$ a stochastic process. The log-probability of a sequence:

\log P(w_1, \ldots, w_T) = \sum_{t=1}^T \log P(w_t | w_1, \ldots, w_{t-1})

is a sum of log-probabilities - a random walk in log-space. The perplexity $\text{PPL} = \exp(-\frac{1}{T}\sum_t \log P(w_t|\cdot))$ is the geometric mean of the inverse probabilities, and concentrates around $e^{H}$ where $H$ is the true entropy rate of the language.

Surprise accumulation: The cumulative surprise $-\log P(w_t | w_{<t})$ for a well-calibrated model is a martingale (under the true distribution). If the model is miscalibrated, the surprise process is either a submartingale (model too confident -> underestimates entropy) or supermartingale (model too diffuse). This connects language model calibration to martingale theory.

5. Poisson Processes

5.1 Definition and Characterisation

The Poisson process is the canonical continuous-time counting process - it counts the number of events in an interval.

Definition (Poisson Process). A right-continuous process $\{N(t)\}_{t \geq 0}$ with $N(0) = 0$ is a Poisson process with rate $\lambda > 0$ if:

Independent increments: For any $0 \leq t_0 < t_1 < \cdots < t_n$ , the increments $N(t_1)-N(t_0), N(t_2)-N(t_1), \ldots$ are mutually independent
Stationary increments: $N(t+s) - N(s) \stackrel{d}{=} N(t) - N(0)$ for all $t, s \geq 0$
Poisson distribution of increments: $N(t) - N(s) \sim \text{Poisson}(\lambda(t-s))$ for $t > s$

Equivalent characterisation via inter-arrivals: If $T_1, T_2, \ldots$ are the inter-arrival times between successive events, then $\{N(t)\}$ is a Poisson process iff $T_1, T_2, \ldots \stackrel{iid}{\sim} \text{Exp}(\lambda)$ .

The $\lambda \to 0$ limit: A Poisson process can be approximated by a Bernoulli process: divide $[0,T]$ into $n$ intervals of length $\delta = T/n$ . Place an event in interval $k$ with probability $\lambda\delta$ independently. As $n \to \infty$ with $\lambda\delta n = \lambda T$ fixed, this converges to a Poisson process.

Martingale structure: $N(t) - \lambda t$ is a martingale (compensated Poisson process). Its quadratic variation is also $\lambda t$ , so $(N(t)-\lambda t)^2 - \lambda t$ is also a martingale.

5.2 Properties: Superposition and Thinning

Superposition: If $N_1(t) \sim \text{Poisson}(\lambda_1)$ and $N_2(t) \sim \text{Poisson}(\lambda_2)$ are independent, then $N(t) = N_1(t) + N_2(t) \sim \text{Poisson}(\lambda_1 + \lambda_2)$ . The superposition of $k$ independent Poisson processes is Poisson with rate $\sum_i \lambda_i$ .

Thinning (Coloring): Start with a Poisson process $N(t) \sim \text{Poisson}(\lambda)$ . Independently colour each arrival red with probability $p$ and blue with probability $1-p$ . Then the red and blue counting processes are independent Poisson processes with rates $\lambda p$ and $\lambda(1-p)$ .

Conditioning (uniform order statistics): Given $N(T) = n$ , the arrival times $T_1 < T_2 < \cdots < T_n$ are distributed as the order statistics of $n$ iid Uniform $(0,T)$ random variables. This is the conditional uniformity property - conditioning on the count makes the arrival times maximally spread.

Non-homogeneous Poisson process: Generalise by allowing a time-varying rate $\lambda(t)$ : $N(t) - N(s) \sim \text{Poisson}(\int_s^t \lambda(u)\,du)$ with independent increments. The compensator is $\Lambda(t) = \int_0^t \lambda(u)\,du$ (the integrated intensity).

5.3 Compound Poisson Process

Definition. Let $\{N(t)\}$ be a Poisson process with rate $\lambda$ and $\{Y_k\}$ iid random variables (independent of $N$ ). The compound Poisson process is:

X(t) = \sum_{k=1}^{N(t)} Y_k

Key moments:

$\mathbb{E}[X(t)] = \lambda t \cdot \mathbb{E}[Y_1]$
$\text{Var}(X(t)) = \lambda t \cdot \mathbb{E}[Y_1^2]$
MGF: $M_{X(t)}(s) = \exp(\lambda t (M_{Y_1}(s) - 1))$

For AI: The total compute consumed by serving an LLM API follows a compound Poisson model: requests arrive as a Poisson process, and each request requires a random amount of compute (the "jump size" $Y_k$ ). Capacity planning uses compound Poisson theory to compute the required buffer size to handle peak demand with high probability.

5.4 ML: Event Streams and Token Timing

LLM inference event streams: In deployed LLMs, user requests arrive as a Poisson process with rate $\lambda$ (requests/second). Under heavy load, the inter-arrival distribution determines queueing behaviour. If token generation takes $\mu^{-1}$ seconds on average, the system is stable when $\lambda < \mu$ (arrival rate < service rate) - an M/M/1 queue analysis using Poisson process theory.

KV-cache access patterns: In multi-head attention with prefix caching, the pattern of cache hits/misses follows a complex point process. Poisson approximation applies when individual hit probabilities are small and requests are approximately independent.

Continuous batching: Modern LLM serving (vLLM, TensorRT-LLM) uses continuous batching: new requests join an existing batch as others complete. The batch size at time $t$ is a birth-death process - a continuous-time Markov chain with Poisson arrivals and Exponential service times.

6. Brownian Motion

6.1 Definition and Wiener's Axioms

Brownian motion (the Wiener process) is the fundamental continuous-time continuous-path stochastic process. It is simultaneously the limit of rescaled random walks (Donsker's theorem) and the building block for all diffusion processes in modern AI.

Definition (Standard Brownian Motion). A stochastic process $\{B_t\}_{t \geq 0}$ is a standard Brownian motion (or Wiener process) if:

$B_0 = 0$ a.s.
Independent increments: For $0 \leq t_0 < t_1 < \cdots < t_n$ , the increments $B_{t_1}-B_{t_0}, B_{t_2}-B_{t_1}, \ldots$ are mutually independent
Gaussian increments: $B_t - B_s \sim \mathcal{N}(0, t-s)$ for all $0 \leq s < t$
Continuous paths: $t \mapsto B_t(\omega)$ is continuous a.s.

Existence: Wiener (1923) proved that such a process exists by constructing it on the space $C([0,\infty))$ of continuous functions with the Wiener measure. The construction uses Levy's Haar wavelet representation and is one of the foundational results of modern probability.

Finite-dimensional distributions: For $0 < t_1 < \cdots < t_n$ , the joint density of $(B_{t_1}, \ldots, B_{t_n})$ is:

p(x_1, \ldots, x_n) = \prod_{k=1}^n \frac{1}{\sqrt{2\pi(t_k - t_{k-1})}} \exp\!\left(-\frac{(x_k - x_{k-1})^2}{2(t_k - t_{k-1})}\right)

with $t_0 = x_0 = 0$ .

Martingale properties: The following are all martingales:

$B_t$ (mean zero, no drift)
$B_t^2 - t$ (variance process)
$\exp(sB_t - s^2t/2)$ for any $s \in \mathbb{R}$ (exponential martingale / Doleans-Dade)

6.2 Key Properties and Quadratic Variation

Self-similarity: $\{cB_{t/c^2}\}_{t \geq 0} \stackrel{d}{=} \{B_t\}_{t \geq 0}$ for any $c > 0$ . The process "looks the same" at any scale - it is fractal.

Time inversion: $\{tB_{1/t}\}_{t > 0} \stackrel{d}{=} \{B_t\}_{t > 0}$ (with $B_0 = 0$ ).

Non-differentiability: Brownian motion is a.s. nowhere differentiable. Formally: for almost every $\omega$ , the function $t \mapsto B_t(\omega)$ is not differentiable at any point. This is why stochastic calculus requires the Ito integral - the standard Riemann-Stieltjes integral does not apply.

Quadratic variation: The key property distinguishing Brownian motion from ordinary functions is its quadratic variation:

[B]_t \coloneqq \lim_{|\mathcal{P}| \to 0} \sum_{k} (B_{t_k} - B_{t_{k-1}})^2 = t \quad \text{a.s.}

where the limit is over partitions $\mathcal{P} = \{0 = t_0 < t_1 < \cdots < t_n = t\}$ as the mesh $|\mathcal{P}| \to 0$ . Informally: $(dB_t)^2 = dt$ . This is the foundation of Ito's lemma and the $\sqrt{dt}$ scaling of stochastic noise.

The Levy characterisation: A continuous martingale $\{M_t\}$ with $M_0 = 0$ and quadratic variation $[M]_t = t$ is a standard Brownian motion. This characterises BM among all continuous martingales.

For AI: The quadratic variation formula $(dB_t)^2 = dt$ explains why diffusion model noise at timestep $t$ scales as $\sqrt{t}$ , not $t$ . In DDPM, the forward process adds noise $\epsilon \sim \mathcal{N}(0, I)$ scaled by $\sqrt{\bar{\alpha}_t}$ , where $\bar{\alpha}_t = \prod_{s=1}^t (1-\beta_s)$ - a discretisation of BM's variance accumulation.

6.3 Brownian Motion as a Limit

Theorem (Donsker's Invariance Principle, 1951). Let $X_1, X_2, \ldots$ be iid with $\mathbb{E}[X_i] = 0$ and $\mathbb{E}[X_i^2] = \sigma^2 < \infty$ . Define the continuous interpolation:

B_n(t) = \frac{1}{\sigma\sqrt{n}}\left(S_{\lfloor nt \rfloor} + (nt - \lfloor nt \rfloor) X_{\lfloor nt \rfloor + 1}\right), \quad t \in [0,1]

Then $B_n \Rightarrow B$ (convergence in distribution on $C([0,1])$ with the uniform topology), where $B$ is standard Brownian motion.

This is the functional central limit theorem - it upgrades the CLT from convergence of individual random variables to convergence of entire paths. The SRW path, rescaled by $1/\sqrt{n}$ in space and $1/n$ in time, converges to Brownian motion.

Implications: Any result proved for Brownian motion has a discrete-time analogue for random walks. This transfers insights between the continuous-time theory (SDEs, Ito calculus) and the discrete-time practice (mini-batch SGD, RL).

6.4 Geometric Brownian Motion

Definition. A process $\{S_t\}$ satisfies geometric Brownian motion if:

dS_t = \mu S_t \,dt + \sigma S_t \,dB_t, \quad S_0 = s_0 > 0

By Ito's lemma applied to $f(S) = \log S$ :

d\log S_t = \left(\mu - \frac{\sigma^2}{2}\right)dt + \sigma\,dB_t

S_t = s_0 \exp\!\left(\left(\mu - \frac{\sigma^2}{2}\right)t + \sigma B_t\right)

So $S_t \sim \text{LogNormal}\!\left(\log s_0 + (\mu - \sigma^2/2)t,\, \sigma^2 t\right)$ .

For AI: GBM models multiplicative noise, which appears in layer-normalisation dynamics. When a transformer residual stream is updated as $h_{t+1} = h_t + \Delta_t$ where $\Delta_t / \|h_t\| \sim \mathcal{N}(0, \sigma^2)$ , the norm $\|h_t\|$ approximately follows GBM, explaining the norm growth observed in deep transformers without proper initialisation.

6.5 Ornstein-Uhlenbeck Process

The Ornstein-Uhlenbeck (OU) process is the unique stationary Gaussian Markov process - and the theoretical foundation for diffusion model forward processes.

Definition (OU SDE). The OU process satisfies:

dX_t = -\theta(X_t - \mu)\,dt + \sigma\,dB_t, \quad \theta > 0

The drift $-\theta(X_t - \mu)$ is a mean-reverting force pulling the process toward $\mu$ .

Solution (via Ito's formula):

X_t = \mu + (X_0 - \mu)e^{-\theta t} + \sigma\int_0^t e^{-\theta(t-s)}\,dB_s

Distributions:

$X_t | X_0 \sim \mathcal{N}\!\left(\mu + (X_0-\mu)e^{-\theta t},\ \frac{\sigma^2}{2\theta}(1 - e^{-2\theta t})\right)$
Stationary distribution: $X_\infty \sim \mathcal{N}(\mu, \sigma^2/(2\theta))$
Autocorrelation: $\text{Corr}(X_s, X_t) = e^{-\theta|t-s|}$ - exponential decay

For AI: The DDPM forward process is $q(x_t|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}_t)I)$ , which matches the OU solution with $\theta = 1/2$ , $\mu = 0$ , $\sigma^2 = 1$ (after time reparametrisation). The variance-preserving SDE (VP-SDE) of Song et al. (2021) makes this correspondence exact:

dx = -\frac{1}{2}\beta(t)x\,dt + \sqrt{\beta(t)}\,dB_t

with $\bar{\alpha}_t = \exp(-\frac{1}{2}\int_0^t \beta(s)\,ds)$ . The OU process provides the continuous-time foundation for all modern diffusion models.

L2 regularisation as OU: L2 regularisation in gradient descent adds a restoring force $-\lambda\theta$ to the update, giving a discrete-time OU process for the parameters. The stationary distribution concentrates around $\theta=0$ with variance $\propto \eta/\lambda$ (noise / regularisation), explaining why stronger L2 keeps parameters smaller.

6.6 ML: Diffusion Models

The forward process (data -> noise): Given clean data $x_0 \sim p_{\text{data}}$ , DDPM defines:

q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I), \quad t = 1,\ldots,T

The marginal is $q(x_t | x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}_t)I)$ where $\bar{\alpha}_t = \prod_{s=1}^t(1-\beta_s)$ . This forward process is a discretised OU process.

The reverse process (noise -> data): By Bayes' theorem and the Gaussian Markov property:

p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(t))

The network learns to predict the score $\nabla_{x_t}\log q(x_t)$ (or equivalently, the noise $\epsilon$ ), enabling the reverse process to denoise step by step.

Score matching: The training objective minimises:

\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon}\!\left[\|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, t)\|^2\right]

This is equivalent to minimising Fisher divergence between $p_\theta$ and $q$ , connecting to information-geometric stochastic process theory.

Continuous-time perspective (Score SDE, Song et al. 2021): The continuous limit is:

Forward SDE: $dx = f(x,t)\,dt + g(t)\,dB_t$
Reverse SDE: $dx = [f(x,t) - g(t)^2\nabla_x\log p_t(x)]\,dt + g(t)\,d\bar{B}_t$

The neural network approximates $\nabla_x\log p_t(x)$ - the score function - enabling the SDE to run backward from noise to data. This is the unified framework behind DDPM (discrete VP-SDE), DDIM (deterministic ODE), and Consistency Models.

7. Stationary Processes and Ergodicity

7.1 Strict and Wide-Sense Stationarity

Definition (Strict Stationarity). A stochastic process $\{X_t\}$ is strictly stationary if for any $n$ , any times $t_1, \ldots, t_n$ , and any shift $h$ :

(X_{t_1+h}, \ldots, X_{t_n+h}) \stackrel{d}{=} (X_{t_1}, \ldots, X_{t_n})

The finite-dimensional distributions are invariant under time shifts.

Definition (Wide-Sense Stationarity, WSS). A process $\{X_t\}$ is wide-sense stationary (or second-order stationary) if:

$\mathbb{E}[X_t] = \mu$ (constant mean)
$\text{Cov}(X_t, X_s) = k(t-s)$ depends only on the lag $\tau = t-s$

Examples:

Strictly stationary but not WSS: $t$ -distributed noise with 1 DOF (undefined variance)
WSS but not strictly stationary: $X_t = A\cos(\omega t) + B\sin(\omega t)$ with $A,B \sim \mathcal{N}(0,\sigma^2)$ iid - WSS with $k(\tau) = \sigma^2\cos(\omega\tau)$ ; not strictly stationary since $X_t$ is perfectly periodic with a random phase
Both: Gaussian processes with stationary kernels; iid sequences

The autocovariance function $k(\tau)$ for a WSS process satisfies:

Symmetry: $k(\tau) = k(-\tau)$ (real process)
Positive semidefiniteness: $\sum_{i,j} a_i a_j k(t_i - t_j) \geq 0$ for all finite sequences

For AI: A WSS process is fully characterised by $\mu$ and $k(\tau)$ - enormously parsimonious compared to the general joint distribution. This is why WSS is the right model for relative position encodings in transformers.

7.2 Ergodicity and the Ergodic Theorem

Intuition: A process is ergodic if "time averages equal ensemble averages" - a single long trajectory captures the full statistical behaviour of the process.

Theorem (Birkhoff's Ergodic Theorem, 1931). For a stationary ergodic process $\{X_t\}$ with $\mathbb{E}[|X_0|] < \infty$ :

\frac{1}{T}\int_0^T X_t\,dt \xrightarrow{a.s.} \mathbb{E}[X_0] \quad \text{as } T \to \infty

For discrete time: $\frac{1}{N}\sum_{n=0}^{N-1} X_n \xrightarrow{a.s.} \mathbb{E}[X_0]$ .

What "ergodic" means formally: A stationary process is ergodic if the only time-invariant events have probability 0 or 1 (the $\sigma$ -algebra of invariant events is trivial). Equivalently, correlations decay: $\frac{1}{N}\sum_{n=0}^{N-1}\text{Cov}(X_0, X_n) \to 0$ .

Non-ergodic example: $X_t = Z$ for all $t$ , where $Z \sim \mathcal{N}(0,1)$ is fixed once and for all. This is stationary (distribution constant) but not ergodic (time average $= Z \neq \mathbb{E}[Z] = 0$ for any realisation).

For AI: SGD convergence analysis assumes that the gradient noise process is ergodic - the time average of gradient variance equals the ensemble average. If the data distribution shifts during training (non-stationarity), this fails. Continual learning methods address the resulting catastrophic forgetting, which is fundamentally a violation of the stationarity assumption.

7.3 Autocorrelation and Power Spectral Density

Definition. The autocovariance function (ACVF) is $k(\tau) = \text{Cov}(X_t, X_{t+\tau})$ . The autocorrelation function (ACF) is $\rho(\tau) = k(\tau)/k(0)$ , normalised to $[-1,1]$ .

Examples of ACFs:

iid white noise: $k(\tau) = \sigma^2 \mathbf{1}[\tau=0]$
AR(1) process ( $X_t = \phi X_{t-1} + \epsilon_t$ , $|\phi|<1$ ): $k(\tau) = \sigma^2\phi^{|\tau|}/(1-\phi^2)$ - geometric decay
OU process: $k(\tau) = (\sigma^2/2\theta)e^{-\theta|\tau|}$ - continuous-time geometric decay
Seasonal process: periodic ACF

Theorem (Wiener-Khinchin, 1934). For a WSS process with absolutely summable autocovariance:

S(\omega) = \sum_{\tau=-\infty}^\infty k(\tau)e^{-i\omega\tau} = \mathcal{F}[k](\omega)

The power spectral density (PSD) $S(\omega)$ is the Fourier transform of $k(\tau)$ and is non-negative: $S(\omega) \geq 0$ for all $\omega$ .

Conversely, any non-negative integrable function $S(\omega)$ is the PSD of some WSS process (Bochner's theorem). PSD decomposes the variance of a process by frequency: low-frequency components explain slow variation, high-frequency components explain rapid fluctuation.

For AI: The attention mechanism computes $\text{softmax}(QK^\top/\sqrt{d_k})V$ . For a stationary input sequence, $Q_i = W_Q x_i$ and $K_j = W_K x_j$ , so the attention score between positions $i$ and $j$ depends (via the input autocorrelation) only on $|i-j|$ - the relative position. This is the stationarity condition that relative position encodings (ALiBi, T5 bias, RoPE) exploit.

7.4 Gaussian Processes

Definition. A stochastic process $\{f(t)\}_{t \in T}$ is a Gaussian process (GP) if every finite collection $(f(t_1), \ldots, f(t_n))$ is jointly Gaussian. A GP is fully specified by:

Mean function: $m(t) = \mathbb{E}[f(t)]$
Covariance kernel: $k(s,t) = \text{Cov}(f(s), f(t))$

Notation: $f \sim \mathcal{GP}(m, k)$ .

The kernel determines everything: Properties of the GP sample paths are determined by $k$ :

Smoothness: Matern- $\nu$ kernel controls differentiability ( $\nu$ times differentiable paths)
Length scale: $k(s,t) = e^{-|s-t|^2/(2\ell^2)}$ (RBF/squared-exponential) has length scale $\ell$
Periodicity: $k(s,t) = \sigma^2\cos(2\pi|s-t|/p)$ for periodic processes

Common kernels:

Kernel	Formula	Properties
RBF (squared-exponential)	$\sigma^2\exp(-\\|x-x'\\|^2/(2\ell^2))$	Infinitely differentiable
Matern-1/2	$\sigma^2\exp(-\\|x-x'\\|/\ell)$	Continuous but not diff. (= OU)
Matern-3/2	$\sigma^2(1+\sqrt{3}r/\ell)\exp(-\sqrt{3}r/\ell)$	Once differentiable
Periodic	$\sigma^2\exp(-2\sin^2(\pi\\|x-x'\\|/p)/\ell^2)$	Exactly periodic
Linear	$\sigma^2 x^\top x'$	Bayesian linear regression

GP Posterior (Bayesian inference): Given observations $\mathbf{y} = f(\mathbf{t}) + \epsilon$ , $\epsilon \sim \mathcal{N}(0, \sigma_n^2 I)$ :

f_* | \mathbf{t}_*, \mathbf{t}, \mathbf{y} \sim \mathcal{N}(\mu_*, \Sigma_*)

\mu_* = m(\mathbf{t}_*) + k(\mathbf{t}_*, \mathbf{t})[k(\mathbf{t},\mathbf{t}) + \sigma_n^2 I]^{-1}(\mathbf{y} - m(\mathbf{t}))

\Sigma_* = k(\mathbf{t}_*, \mathbf{t}_*) - k(\mathbf{t}_*, \mathbf{t})[k(\mathbf{t},\mathbf{t}) + \sigma_n^2 I]^{-1}k(\mathbf{t}, \mathbf{t}_*)

This is exact Bayesian inference in $O(n^3)$ time (from the Cholesky decomposition of the $n \times n$ kernel matrix).

For AI: Bayesian optimisation for hyperparameter search uses a GP prior over the loss surface. The acquisition function (UCB, EI) determines where to evaluate next, and the GP posterior updates after each observation. This is the principled method for tuning LLM training hyperparameters ( $\eta$ , $\beta_1$ , $\beta_2$ , etc.) when each evaluation is expensive.

Neural tangent kernel (NTK): An infinite-width neural network at initialisation is a GP with kernel determined by the architecture. During gradient descent with small learning rate, the network stays approximately GP (lazy training regime). The NTK $\Theta(x,x') = \nabla_\theta f(x;\theta)^\top \nabla_\theta f(x';\theta)$ determines the learning dynamics via the Gram matrix $\Theta(X,X)$ .

7.5 ML: RoPE and Stationarity in Attention

Wide-sense stationarity of attention: For a sequence of token embeddings $\{x_t\}$ , the attention matrix $A_{ij} = \text{softmax}(\frac{q_i^\top k_j}{\sqrt{d}})$ can be made to depend only on the relative position $i-j$ if we choose position encoding appropriately. This is equivalent to making the $Q$ - $K$ inner product a stationary kernel.

RoPE (Rotary Position Embedding, Su et al. 2022): RoPE encodes position $t$ by rotating the query and key vectors:

R_t\mathbf{q}, \quad R_t\mathbf{k}, \quad \text{where } R_t \text{ is a block-diagonal rotation matrix}

The inner product satisfies:

(R_m\mathbf{q})^\top(R_n\mathbf{k}) = \mathbf{q}^\top R_{m-n}^\top R_0 \mathbf{k} = \mathbf{q}^\top R_{n-m}\mathbf{k}

which depends only on $m-n$ - exactly the stationarity condition! RoPE makes attention score a stationary kernel in position, enabling efficient relative position encoding.

ALiBi and linear bias: ALiBi (Press et al., 2022) adds $-m|i-j|$ to each attention logit where $m$ is a per-head slope. The resulting attention scores are of the form $f(q_i, k_j) - m|i-j|$ , where the $-m|i-j|$ term is a stationary (translation-invariant) bias. This enables extrapolation to longer sequences - a key advantage when the training distribution has stationarity structure.

8. Preview: Markov Chains

8.1 The Markov Property

A stochastic process $\{X_n\}$ satisfies the Markov property if:

P(X_{n+1} = j \mid X_0, X_1, \ldots, X_n) = P(X_{n+1} = j \mid X_n) \quad \forall n, j

The past is irrelevant - the present state $X_n$ is a sufficient statistic for the future. Markov chains are the simplest non-trivial stochastic processes with memory.

Markov chains fit within the stochastic process framework developed here:

They are adapted to their natural filtration $\mathcal{F}_n = \sigma(X_0,\ldots,X_n)$
The transition function $P(X_{n+1}=j|X_n=i) = P_{ij}$ defines a stochastic matrix
Many Markov chains are martingale transforms: $f(X_n)$ is a martingale if $f$ is harmonic for the transition operator

8.2 Forward Reference

Full treatment: Section07 Markov Chains

That section covers: transition matrices, invariant/stationary distributions, detailed balance, recurrence/transience (Markov chain version), mixing times, spectral gap, Perron-Frobenius theorem, reversibility, MCMC (Metropolis-Hastings, Gibbs), and ML applications in language model generation, RL policy evaluation, and PageRank.

9. ML Deep Dive

9.1 SGD Trajectory as Stochastic Process

The parameter sequence $\{\theta_n\}_{n \geq 0}$ produced by SGD is a discrete-time stochastic process with values in $\mathbb{R}^p$ . Understanding its properties as a process - not just its endpoint - reveals why deep learning generalises.

The update rule as a random dynamical system:

\theta_{n+1} = \theta_n - \eta_n \tilde{g}_n(\theta_n), \quad \tilde{g}_n(\theta) = \frac{1}{B}\sum_{i \in \mathcal{B}_n}\nabla\ell(\theta; z_i)

where $\mathcal{B}_n$ is a random mini-batch drawn at step $n$ .

Key stochastic process properties:

$\theta_n$ is adapted to $\mathcal{F}_n = \sigma(\mathcal{B}_1,\ldots,\mathcal{B}_n)$
$\mathbb{E}[\tilde{g}_n(\theta_n) | \mathcal{F}_{n-1}] = \nabla L(\theta_n)$ - unbiasedness means the noise is a martingale difference
$\tilde{g}_n - \nabla L(\theta_n)$ is a martingale difference sequence

The diffusion approximation (Mandt et al., 2017): Under small step size and local quadratic approximation $L(\theta) \approx \frac{1}{2}(\theta-\theta^*)^\top H (\theta-\theta^*)$ , the SGD dynamics converge to:

d\theta = -H\theta\,dt + \sqrt{\eta \Sigma}\,dB_t

This is a multivariate OU process with mean-reversion matrix $H$ and noise $\sqrt{\eta\Sigma}$ . The stationary distribution is:

\theta_\infty \sim \mathcal{N}\!\left(\theta^*,\ \frac{\eta}{2}\Sigma H^{-1}\right)

Lower learning rate $\eta$ -> smaller stationary variance -> SGD concentrates more tightly around $\theta^*$ . The ratio $\eta/B$ determines the noise temperature.

Flat minima and generalisation: The stationary distribution of the SGD SDE concentrates in regions of small Hessian $H$ (flat minima). Fisher-Rao information connects $\Sigma$ to the curvature, and the entropy of the stationary distribution $\propto \log\det(H^{-1})$ is large for flat minima. This provides a stochastic process explanation for why SGD with large learning rates prefers flat minima that generalise well.

9.2 RL Value Functions as Martingales

In reinforcement learning, the agent at time $t$ is in state $s_t$ , takes action $a_t$ , receives reward $r_t$ , and transitions to $s_{t+1}$ . The discounted return is:

G_t = \sum_{k=0}^\infty \gamma^k r_{t+k}

The Bellman equation: Under a fixed policy $\pi$ , the value function $V^\pi(s) = \mathbb{E}[G_t | s_t = s]$ satisfies:

V^\pi(s) = \mathbb{E}[r_t + \gamma V^\pi(s_{t+1}) | s_t = s]

Martingale characterisation: Define $Z_t = \sum_{k=0}^{t-1}\gamma^k r_k + \gamma^t V^\pi(s_t)$ . Under the true $V^\pi$ :

\mathbb{E}[Z_{t+1} | \mathcal{F}_t] = \mathbb{E}\!\left[\gamma^t r_t + \gamma^{t+1} V^\pi(s_{t+1}) + \sum_{k=0}^{t-1}\gamma^k r_k \,\Big|\, \mathcal{F}_t\right] = Z_t

So $\{Z_t\}$ is a martingale! The TD error $\delta_t = r_t + \gamma V^\pi(s_{t+1}) - V^\pi(s_t)$ is a martingale difference: $\mathbb{E}[\delta_t | \mathcal{F}_t] = 0$ . TD-learning converges because it performs online martingale approximation - stochastic approximation theory guarantees convergence of martingale difference algorithms under Robbins-Monro conditions.

REINFORCE gradient: The policy gradient estimator $\hat{g} = G_t \nabla\log\pi_\theta(a_t|s_t)$ is an unbiased estimator of $\nabla J(\theta)$ - it is a martingale difference. The baseline $b(s_t)$ reduces variance without introducing bias: $\mathbb{E}[(G_t - b(s_t))\nabla\log\pi | s_t] = \nabla J - b(s_t)\cdot 0 = \nabla J$ since $\mathbb{E}[\nabla\log\pi|s_t] = 0$ .

9.3 Attention over Sequences

Attention can be viewed as a stochastic process operating over the token sequence:

Token sequence as discrete-time process: $(w_1, w_2, \ldots, w_T)$ with embedding $x_t = \text{Embed}(w_t)$ . The attention mechanism at layer $\ell$ computes:

h_t^{(\ell)} = \sum_{s=1}^T \alpha_{ts}^{(\ell)} v_s^{(\ell)}, \quad \alpha_{ts} = \text{softmax}\left(\frac{q_t^\top k_s}{\sqrt{d_k}}\right)

The hidden state $h_t^{(\ell)}$ is a weighted average of all positions - a generalisation of the conditional expectation $\mathbb{E}[V | \text{query} = q]$ .

Causal attention as filtration: In autoregressive generation, causal masking ensures $h_t$ depends only on $h_1, \ldots, h_t$ - maintaining the adaptedness condition to the natural filtration of the sequence. Bidirectional attention (BERT) uses a "global" $\sigma$ -algebra that is not filtered.

KV-cache as path memory: The key-value cache stores $\{k_s, v_s\}_{s \leq t}$ - the full path of the process up to position $t$ . This is the computational realisation of the filtration $\mathcal{F}_t$ : at each new token, the model has access to the complete history.

9.4 Diffusion Model Schedules

Different noise schedules in diffusion models correspond to different continuous-time SDEs:

Schedule	SDE	Marginal
Linear (DDPM)	$dx = -\frac{\beta(t)}{2}x\,dt + \sqrt{\beta(t)}\,dB_t$	$\mathcal{N}(\sqrt{\bar\alpha_t}x_0, (1-\bar\alpha_t)I)$
Cosine	Same SDE, different $\beta(t)$	Smoother transition
VE-SDE	$dx = \sigma(t)\,dB_t$	$\mathcal{N}(x_0, \sigma(t)^2 I)$
SubVP-SDE	$dx = -\frac{1}{2}\beta(t)x\,dt + \sqrt{\beta(t)(1-e^{-2\int_0^t\beta})}\,dB_t$	Tighter bounds

The reverse SDE (Anderson, 1982) is:

dx = \left[f(x,t) - g(t)^2\nabla_x\log p_t(x)\right]dt + g(t)\,d\bar{B}_t

where $\bar{B}_t$ is a backward Brownian motion. The neural network $s_\theta(x,t) \approx \nabla_x\log p_t(x)$ is trained to estimate the score function, enabling the reverse process to convert noise into data.

Consistency Models (Song et al., 2023) learn to map any point on the reverse SDE trajectory directly to the final sample, bypassing the iterative nature of full diffusion. The key insight uses probability flow ODE theory - replacing the stochastic reverse SDE with a deterministic ODE that has the same marginals.

10. Common Mistakes

#	Mistake	Why It's Wrong	Fix
1	"Any increasing family of sets is a filtration"	Filtration requires $\sigma$ -algebras, not just sets. A family of sets $A_t$ with $A_s \subset A_t$ has no algebraic structure.	Verify each $\mathcal{F}_t$ is a $\sigma$ -algebra (closed under complements and countable unions) and $\mathcal{F}_s \subseteq \mathcal{F}_t$ for $s \leq t$ .
2	"If $\mathbb{E}[M_n] = c$ for all $n$ , then $\{M_n\}$ is a martingale"	Constant mean is necessary but not sufficient. Non-martingale with constant mean: $M_n = (-1)^n$ with $P=1/2$ Rademacher steps (mean 0, but $\mathbb{E}[M_{n+1}	M_n] = -M_n \neq M_n$).
3	"OST applies to any stopping time"	Without boundedness or uniform integrability, $\mathbb{E}[M_\tau] \neq \mathbb{E}[M_0]$ . The doubling strategy ( $\tau$ = "first win") has $\mathbb{E}[\tau] = \infty$ and violates OST.	Verify one of the three OST conditions: $\tau$ bounded, $\mathbb{E}[\tau]<\infty$ + bounded differences, or UI.
4	"Brownian motion is differentiable since it's continuous"	Continuity does not imply differentiability. BM is a.s. nowhere differentiable (Wiener 1923). Its paths have infinite variation on any interval.	Use Ito calculus ( $dB_t^2 = dt$ ) for BM integrals, not classical Riemann-Stieltjes.
5	"The Poisson process has stationary paths"	Stationarity of increments \neq stationarity of the process itself. $N(t) \sim \text{Poisson}(\lambda t)$ - the marginal distribution changes with $t$ ; $N(t)$ is not stationary.	The compensated process $N(t) - \lambda t$ is a martingale; stationarity applies to increments, not levels.
6	"Strict stationarity implies WSS"	Strict stationarity only implies WSS if the second moment exists. A process with $X_t \stackrel{d}{=} \text{Cauchy}(0,1)$ at all times (iid) is strictly stationary but $\mathbb{E}[X_t^2] = \infty$ , so WSS is undefined.	Always verify second moment existence before claiming WSS from strict stationarity.
7	"WSS implies strict stationarity"	WSS is a second-moment condition. Higher-order moments may not be stationary. Example: an AR(1) with non-Gaussian innovations has the same ACVF regardless of the innovation distribution, but higher moments differ with time if innovations have time-varying skewness.	WSS -> strict stationarity ONLY for Gaussian processes (since Gaussians are determined by first two moments).
8	"The OU process converges to its mean"	The OU process converges to its stationary distribution $\mathcal{N}(\mu, \sigma^2/(2\theta))$ , not to the constant $\mu$ . It fluctuates around $\mu$ with variance $\sigma^2/(2\theta)$ in steady state.	Distinguish convergence of the distribution (to stationary distribution) from convergence of the paths (which remain stochastic).
9	"Ergodicity means the process repeats"	Ergodicity means time averages equal ensemble averages. An ergodic process does not repeat its paths - it explores the full distribution over time.	Ergodic = time average converges to ensemble average. Periodicity (repeating paths) is a different and usually violated property.
10	"A GP with RBF kernel can model any function"	RBF GPs have infinitely smooth sample paths. Functions with discontinuities or kinks (like ReLU activations) cannot be well-approximated by RBF GPs. Matern-1/2 (OU) or Matern-3/2 kernels are better for rough functions.	Match kernel smoothness to the smoothness of the function being modelled. Use Matern family for non-smooth functions.
11	"DDPM forward process is Brownian motion"	DDPM uses a discretised, time-reversed, variance-preserving process. True BM has no mean reversion and linearly growing variance; DDPM's forward process is a discrete OU approximation with $\bar\alpha_t \to 0$ .	DDPM forward process \approx discrete VP-SDE (OU); the connection to BM is via Donsker's theorem in the limit $T \to \infty$ , $\Delta t \to 0$ .

Stochastic Processes: Part 1 - Intuition And Overview To 10 Common Mistakes