Part 2

29 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Stochastic Processes: Part 11: Exercises to Appendix U: Continuous-Time Martingales

11. Exercises

Exercise 1 * - Martingale Verification Let $X_1, X_2, \ldots$ be iid with $\mathbb{E}[X_i] = 0$ , $\text{Var}(X_i) = \sigma^2$ . Define $S_n = \sum_{k=1}^n X_k$ and $V_n = S_n^2 - n\sigma^2$ .

(a) Prove $\{S_n\}$ is a martingale with respect to its natural filtration. (b) Prove $\{V_n\}$ is a martingale. What does $\mathbb{E}[V_\tau] = \mathbb{E}[V_0]$ imply for a bounded stopping time $\tau$ ? (c) For $X_k \sim \text{Uniform}\{-1,+1\}$ , verify (a) and (b) numerically with $n=100$ and $10^5$ simulations. Check $\mathbb{E}[S_n] \approx 0$ and $\mathbb{E}[V_n] \approx 0$ . (d) Let $\tau = \min(n, 50)$ for a fixed $n=100$ simulation. Verify $\mathbb{E}[S_\tau] \approx 0$ and $\mathbb{E}[V_\tau] \approx 0$ empirically.

Exercise 2 * - Doob's OST: Gambler's Ruin A fair gambler starts with $k = 30$ dollars and plays until reaching $N = 100$ or going broke.

(a) Use OST applied to $S_n$ to derive $P(\text{win}) = k/N$ . Compute numerically for $k=30$ , $N=100$ . (b) Use OST applied to $S_n^2 - n$ to derive $\mathbb{E}[\tau] = k(N-k)$ . Compute for the given values. (c) Simulate $10^5$ games and verify both $P(\text{win})$ and $\mathbb{E}[\tau]$ empirically. (d) For a biased game with $p=0.6$ , use the exponential martingale $M_n = ((1-p)/p)^{S_n}$ to derive $P(\text{win})$ . Verify numerically.

Exercise 3 * - Poisson Process Properties Let $N(t) \sim \text{Poisson}(\lambda=2)$ .

(a) Compute $P(N(3) \geq 5)$ exactly and via simulation ( $10^5$ trials). (b) Show that $M(t) = N(t) - \lambda t$ is a martingale (verify $\mathbb{E}[M(t)] = 0$ and $\mathbb{E}[M(t)|N(s), s \leq r] = M(r)$ for $r < t$ ). (c) Thinning: Generate a Poisson process with rate $\lambda=5$ . Thin independently with $p=0.4$ . Verify the thinned process is Poisson( $2$ ) by checking its inter-arrival distribution. (d) Superposition: Show that the superposition of $\text{Poisson}(1)$ and $\text{Poisson}(2)$ is $\text{Poisson}(3)$ empirically.

Exercise 4 ** - Doob Martingale and McDiarmid Let $f(x_1,\ldots,x_n) = \max_{1 \leq k \leq n} x_k$ on $[0,1]^n$ with $X_i \stackrel{iid}{\sim}\text{Uniform}[0,1]$ .

(a) Show $f$ has bounded differences: $\sup |f(\ldots,x_k,\ldots) - f(\ldots,x_k',\ldots)| \leq 1$ for each $k$ , so $c_k = 1$ . (b) Construct the Doob martingale $M_k = \mathbb{E}[\max X | X_1,\ldots,X_k]$ and compute $M_0$ and $M_n$ analytically. (c) Apply Azuma's inequality to bound $P(f - \mathbb{E}[f] \geq t)$ and compare to McDiarmid's bound. (d) Verify both bounds numerically for $n=100$ , $t=0.05$ . Compute the empirical tail probability and compare to the bound.

Exercise 5 ** - Brownian Motion Properties Simulate $B_t$ on $[0,1]$ using $n=1000$ steps.

(a) Verify the increment distribution: check that $B_t - B_s \sim \mathcal{N}(0, t-s)$ holds for several $(s,t)$ pairs using KS test. (b) Verify quadratic variation: compute $QV_n = \sum_{k=1}^n (B_{k/n} - B_{(k-1)/n})^2$ for $n = 10, 100, 1000, 10000$ and show $QV_n \to 1$ . (c) Verify non-differentiability heuristically: compute finite differences $\Delta B / \Delta t$ for $\Delta t = 10^{-1}, 10^{-2}, 10^{-3}$ and observe they grow as $1/\sqrt{\Delta t}$ . (d) Simulate the OU process $dX = -X\,dt + \sqrt{2}\,dB_t$ with Euler-Maruyama. Verify the stationary distribution is $\mathcal{N}(0,1)$ .

Exercise 6 ** - Gaussian Process Posterior Let $f \sim \mathcal{GP}(0, k_{\text{RBF}})$ with $k_{\text{RBF}}(x,x') = \exp(-|x-x'|^2/(2\ell^2))$ , $\ell = 0.5$ .

(a) Draw 5 sample paths from the GP prior on $[0,5]$ using the covariance matrix. (b) Given observations $y_i = f(x_i) + \epsilon_i$ at $x = \{0.5, 1.5, 2.5, 3.5\}$ with $\epsilon_i \sim \mathcal{N}(0, 0.01)$ , compute the GP posterior mean and variance. (c) Plot the posterior mean $\pm 2$ standard deviations and verify that all observation points are covered. (d) Compare the posterior using RBF kernel vs Matern-1/2 kernel. Which has smoother interpolation? Which has faster uncertainty growth away from observations?

Exercise 7 *** - SGD as Diffusion Process Minimise $L(\theta) = \theta^2/2$ (quadratic, $H=1$ , $\theta^* = 0$ ) using SGD with $\tilde{g}(\theta) = \theta + \epsilon$ , $\epsilon \sim \mathcal{N}(0, \sigma^2)$ .

(a) Show that the SGD iterates satisfy $\theta_{n+1} = (1-\eta)\theta_n - \eta\epsilon_n$ . Find the stationary distribution of $\theta_n$ . (b) Verify analytically that $\text{Var}(\theta_\infty) = \eta\sigma^2/(2-\eta) \approx \eta\sigma^2/2$ for small $\eta$ . (c) Simulate for $\eta \in \{0.01, 0.1, 0.3\}$ , $\sigma^2 = 1$ , $n = 10000$ steps. Plot the stationary distribution and compare to the analytical prediction. (d) Verify the diffusion approximation: the continuous-time SDE $d\theta = -\theta\,dt + \sqrt{\eta}\sigma\,dB_t$ has stationary distribution $\mathcal{N}(0, \eta\sigma^2/2)$ . Compare to part (b).

Exercise 8 *** - RL TD-Error as Martingale Consider a Markov reward process: states $\{1, 2, 3\}$ , transition matrix $P = [[0.5,0.3,0.2],[0.4,0.4,0.2],[0.3,0.3,0.4]]$ , rewards $r = [1, 2, 3]$ , discount $\gamma = 0.9$ .

(a) Compute the true value function $V = (I - \gamma P)^{-1}r$ exactly. (b) Implement TD(0) to estimate $V$ : run $10^4$ episodes of length 50, update $V(s_t) \leftarrow V(s_t) + \alpha(r_t + \gamma V(s_{t+1}) - V(s_t))$ with $\alpha = 0.01$ . Compare to exact $V$ . (c) Verify the martingale property of the TD error: under the true $V^*$ , the TD error $\delta_t = r_t + \gamma V^*(s_{t+1}) - V^*(s_t)$ has $\mathbb{E}[\delta_t | s_t] = 0$ . Verify numerically. (d) Implement variance-reduced version using baseline $b(s) = \bar{r}$ (mean reward). Show this reduces the variance of the gradient estimate without introducing bias.

12. Why This Matters for AI (2026 Perspective)

Concept	AI/ML Impact
Filtrations and adaptedness	Causal vs. bidirectional attention; early stopping as stopping time; data leakage detection
Martingales	TD-learning convergence; REINFORCE gradient estimators; unbiased gradient noise in SGD
Optional Stopping Theorem	Early stopping criteria; gambler's ruin models of gradient explosion; RL episode termination
Doob Martingale	Unified proof of McDiarmid/Azuma; online learning regret bounds; generalisation theory
Poisson Process	LLM inference request queues; KV-cache hit modelling; continuous batching server design
Brownian Motion	Forward process in diffusion models; SGD noise continuous-time approximation; random feature maps
OU Process	DDPM/Score-SDE forward processes; L2 regularisation as mean reversion; Adam momentum as filtering
Geometric Brownian Motion	Residual stream norm growth; multiplicative noise in LoRA fine-tuning
Donsker's Theorem	Justifies SDE approximation of SGD; random walk -> BM limit for CLT on paths
Wide-Sense Stationarity	RoPE relative position encoding; ALiBi attention bias; translation-invariant kernels
Ergodic Theorem	Time-average = ensemble average in training; failure mode in continual learning
Gaussian Processes	Bayesian hyperparameter search; neural tangent kernel in infinite-width limit; GAN discriminator theory
GP Posterior	Uncertainty quantification in Bayesian deep learning; active learning for LLM fine-tuning data selection
Score SDE	Unified framework for DDPM, DDIM, consistency models, flow matching
Martingale CLT	Asymptotic normality of SGD parameter estimates; statistical inference on trained models

13. Conceptual Bridge

Looking backward: This section builds directly on the static probability of Section01-Section05. Filtrations formalise conditional expectations (Section04) over time; the Doob martingale makes McDiarmid's inequality (Section05) a consequence of Azuma applied to a martingale; the Gaussian process is the stochastic-process version of the multivariate Gaussian (Section03). Every concept here has a static probability antecedent.

Looking forward: Stochastic processes enable the dynamic probability of Section07 and beyond. Markov chains (Section07) are stochastic processes satisfying the Markov property - the specialisation of the general framework to memoryless dynamics. Statistics (Section07-Statistics) uses time series models (AR, MA, ARIMA) which are stationary stochastic processes analysed via the autocorrelation and spectral density tools developed here. Optimisation (Section08) uses the SDE/martingale formalism to analyse SGD convergence and diffusion-based sampling.

The martingale as a unifying concept: The martingale is arguably the most powerful single concept in probability theory. It appears in: (a) concentration inequalities as the Doob/Azuma construction, (b) SGD as a near-martingale gradient process, (c) RL value functions as martingale transforms, (d) diffusion model reverse processes as time-reversed SDEs with martingale driving noise, (e) sequential hypothesis testing (SPRT) as a martingale stopping problem. Understanding martingales is understanding the probabilistic structure that makes machine learning algorithms work.

STOCHASTIC PROCESSES: POSITION IN CURRICULUM
========================================================================

  STATIC PROBABILITY (Section01-Section05)
  ---------------------------------------------
  Section01 Random Variables  ->  Section02 Distributions
  Section03 Joint Dist.       ->  Section04 Expectation/MGF
  Section05 Concentration     ->  Azuma = Doob + Azuma
  ---------------------------------------------
                | filtrations, martingales, paths
                v
  +---------------------------------------------+
  |  Section06 STOCHASTIC PROCESSES (THIS SECTION)    |
  |  -----------------------------------------  |
  |  Filtrations    Martingales   OST           |
  |  Random Walks   Poisson       BM/OU         |
  |  Stationarity   GPs           SGD/RL/Diff   |
  +---------------------------------------------+
                |
       +--------+--------+
       v                 v
  Section07 Markov Chains   Section07-Stat/05 Time Series
  (Markov property,   (AR/MA/ARIMA, spectral
  MCMC, PageRank,     analysis, forecasting,
  RL policy eval)     seasonal decomposition)
       |
       v
  Section08 Optimisation (SDE/martingale-based
  convergence of SGD, Langevin dynamics,
  diffusion-based sampling)

========================================================================

The AI synthesis: In 2026, the stochastic process framework permeates every major AI subfield. Diffusion models are explicit SDE constructions. RL convergence relies on martingale theory. LLM evaluation uses Hoeffding/CLT on sequences of i.i.d. samples (Poisson processes of test cases). Bayesian optimisation uses GP posteriors. Continual learning grapples with non-stationarity. The practitioner who understands stochastic processes sees the probabilistic skeleton underlying methods that otherwise appear unrelated.

This completes Section06 Stochastic Processes. The framework built here - filtrations, martingales, the canonical processes - is the common language of probability theory in motion.

<- Back to Chapter 6: Probability Theory | Next: Markov Chains ->

Appendix A: Stochastic Calculus Primer

A.1 The Ito Integral

Classical calculus integrates smooth functions. The Ito integral extends integration to Brownian motion, where the integrand is a stochastic process.

Why ordinary integrals fail: Define $\int_0^T B_t\,dB_t$ using Riemann sums. Two natural choices:

Left-endpoint (Ito): $\sum_{k=0}^{n-1} B_{t_k}(B_{t_{k+1}} - B_{t_k})$
Midpoint (Stratonovich): $\sum_{k=0}^{n-1} \frac{B_{t_k}+B_{t_{k+1}}}{2}(B_{t_{k+1}} - B_{t_k})$

These give different limits as $n \to \infty$ ! This is unique to stochastic integration - in classical calculus all Riemann sums give the same limit.

Computation:

\text{Ito: } \int_0^T B_t\,dB_t = \frac{B_T^2 - T}{2}

\text{Stratonovich: } \int_0^T B_t \circ dB_t = \frac{B_T^2}{2}

The Ito integral gives an extra $-T/2$ term from the quadratic variation $[B]_T = T$ . This is the source of Ito's lemma.

The Ito integral is a martingale: For any adapted square-integrable integrand $H_t$ :

M_T = \int_0^T H_t\,dB_t \quad \text{is a martingale}

with $\mathbb{E}[M_T^2] = \int_0^T \mathbb{E}[H_t^2]\,dt$ (Ito isometry).

A.2 Ito's Lemma

Theorem (Ito's Lemma). Let $X_t$ satisfy the SDE $dX_t = \mu_t\,dt + \sigma_t\,dB_t$ . For any $C^2$ function $f$ :

df(X_t) = f'(X_t)\,dX_t + \frac{1}{2}f''(X_t)\sigma_t^2\,dt

= \left(\mu_t f'(X_t) + \frac{\sigma_t^2}{2}f''(X_t)\right)dt + \sigma_t f'(X_t)\,dB_t

The extra term $\frac{1}{2}f''(X_t)\sigma_t^2\,dt$ (the Ito correction) arises from the quadratic variation $(dB_t)^2 = dt$ . It has no classical analogue.

Example: GBM. $dS = \mu S\,dt + \sigma S\,dB$ . Apply Ito to $f(S) = \log S$ :

d\log S = \frac{1}{S}\mu S\,dt + \frac{1}{S}\sigma S\,dB - \frac{1}{2}\frac{1}{S^2}\sigma^2 S^2\,dt = \left(\mu - \frac{\sigma^2}{2}\right)dt + \sigma\,dB

The $-\sigma^2/2$ Ito correction converts the arithmetic drift $\mu$ to the geometric/log drift $\mu - \sigma^2/2$ .

For AI: DDPM noise levels satisfy $d\bar\alpha_t \approx -\beta_t\bar\alpha_t\,dt$ (linear SDE for $\log\bar\alpha$ ). The Ito correction explains why the SNR $\bar\alpha_t/(1-\bar\alpha_t)$ does not decrease linearly even for linear schedules - the correction term shifts the effective noise profile.

A.3 Stochastic Differential Equations

General Ito SDE:

dX_t = b(X_t, t)\,dt + \sigma(X_t, t)\,dB_t, \quad X_0 = x_0

$b(x,t)$ : drift (deterministic tendency)
$\sigma(x,t)$ : diffusion coefficient (noise strength)

Existence and uniqueness (Lipschitz conditions): If $b$ and $\sigma$ are Lipschitz in $x$ uniformly in $t$ and have linear growth, the SDE has a unique strong solution (Picard-Lindelof for SDEs).

Important SDEs in ML:

Process	SDE	Notes
BM	$dX = dB$	$b=0$ , $\sigma=1$
OU	$dX = -\theta X\,dt + \sigma\,dB$	Mean-reverting
GBM	$dX = \mu X\,dt + \sigma X\,dB$	Multiplicative noise
Langevin	$dX = -\nabla U(X)\,dt + \sqrt{2}\,dB$	Stationary $\propto e^{-U}$
SGD-SDE	$d\theta = -\nabla L(\theta)\,dt + \sqrt{\eta\Sigma}\,dB$	SGD diffusion approx.
DDPM fwd	$dX = -\frac{\beta}{2}X\,dt + \sqrt{\beta}\,dB$	VP-SDE for diffusion

Langevin dynamics and MCMC: The Langevin SDE $dX_t = -\nabla U(X_t)\,dt + \sqrt{2}\,dB_t$ has stationary distribution $\pi(x) \propto e^{-U(x)}$ . This is the continuous-time version of Langevin MCMC, used for sampling from complex posteriors (e.g., Bayesian neural networks, energy-based models).

Appendix B: Convergence Theorems for Martingales

B.1 Martingale Convergence

Theorem (Doob's Martingale Convergence, 1953). If $\{M_n\}$ is a martingale and $\sup_n \mathbb{E}[|M_n|] < \infty$ ( $L^1$ bounded), then $M_\infty = \lim_{n\to\infty} M_n$ exists a.s. and $\mathbb{E}[|M_\infty|] < \infty$ .

This theorem is foundational for proving convergence of:

Bayesian posterior updates (likelihood ratio martingales)
Online learning algorithms (regret bounds via bounded martingales)
Stochastic approximation algorithms (Robbins-Monro)

The $L^2$ version: If $\sup_n \mathbb{E}[M_n^2] < \infty$ , then $M_n \to M_\infty$ both a.s. and in $L^2$ .

B.2 Upcrossing Inequality

The proof of martingale convergence uses the upcrossing inequality: the number of times a martingale upcrosses an interval $[a,b]$ (goes from below $a$ to above $b$ ) satisfies:

\mathbb{E}[U_n[a,b]] \leq \frac{\mathbb{E}[(M_n - a)^+]}{b - a}

If $M_n$ is $L^1$ bounded, then $\mathbb{E}[U_\infty[a,b]] < \infty$ , so upcrossings are finite for every rational interval $[a,b]$ . This means oscillation in the limit is impossible, forcing convergence.

B.3 Uniform Integrability and $L^1$ Convergence

Definition. A family $\{M_t\}$ is uniformly integrable (UI) if:

\sup_t \mathbb{E}[|M_t|\mathbf{1}_{|M_t| > K}] \to 0 \quad \text{as } K \to \infty

UI is the condition needed for $\mathbb{E}[M_\infty] = \lim_n \mathbb{E}[M_n]$ (exchange of limit and expectation). Without UI, a.s. convergence does not imply $L^1$ convergence.

Sufficient condition for UI: $\sup_n \mathbb{E}[|M_n|^p] < \infty$ for any $p > 1$ .

For AI: Uniform integrability is implicitly required for most SGD convergence proofs that need $\mathbb{E}[L(\theta_N)] \to L^*$ (expectation of limit = limit of expectations). When gradient clipping is used, it ensures the gradient process is UI, validating the convergence proofs.

Appendix C: Gaussian Process Details

C.1 Kernel Properties

A function $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ is a valid kernel (positive semidefinite / Mercer kernel) iff:

\sum_{i,j=1}^n c_i c_j k(x_i, x_j) \geq 0 \quad \forall n, x_1,\ldots,x_n \in \mathcal{X}, c_i \in \mathbb{R}

Operations preserving PSD kernels:

Sum: $k_1 + k_2$ (sum of two valid kernels is valid)
Product: $k_1 \cdot k_2$ (product of two valid kernels is valid)
Scaling: $\alpha k$ for $\alpha > 0$
Composition: $k(x,x') = k_1(\phi(x), \phi(x'))$ for any feature map $\phi$

Bochner's theorem (characterisation of stationary kernels): A continuous function $k(\tau)$ is the covariance of a stationary process iff:

k(\tau) = \int e^{i\omega\tau}\,dF(\omega)

for some non-decreasing bounded measure $F$ (the spectral measure). The PSD is $S(\omega) = dF/d\omega$ .

C.2 Sparse GP Approximations

Exact GP inference costs $O(n^3)$ for $n$ training points. For large $n$ (e.g., $n = 10^5$ ), this is intractable. Sparse GP approximations use $m \ll n$ inducing points $\mathbf{u} = f(\mathbf{z})$ :

p(f) \approx q(f) = \int p(f|\mathbf{u})q(\mathbf{u})\,d\mathbf{u}

The most common is the SVGP (Sparse Variational GP) which optimises a variational lower bound (ELBO), reducing cost to $O(nm^2 + m^3)$ .

For AI: Sparse GPs enable scalable Bayesian optimisation for LLM hyperparameter tuning with $n > 1000$ evaluations. The inducing points act as a compressed representation of the observations, analogous to the key-value cache in attention.

C.3 Deep Kernel Learning

Neural-network kernels: Replace the input $x$ with a neural network representation $\phi_\theta(x)$ and use:

k_\theta(x,x') = k_{\text{base}}(\phi_\theta(x), \phi_\theta(x'))

This is deep kernel learning (DKL): train $\theta$ by maximising the GP marginal likelihood. DKL combines the flexibility of deep representations with the uncertainty quantification of GPs, and is used in scalable Bayesian optimisation for AutoML.

Appendix D: Extended ML Applications

D.1 Diffusion Models: Mathematical Details

The probability flow ODE for the VP-SDE is:

\frac{dx}{dt} = f(x,t) - \frac{1}{2}g(t)^2\nabla_x\log p_t(x)

This ODE has the same marginal distributions as the reverse SDE but is deterministic. It enables:

DDIM sampling: Solve the ODE with large step sizes (fewer function evaluations)
Latent space interpolation: Deterministic ODE gives unique encoding of each data point
Consistency models: Learn direct $x_t \to x_0$ mapping that is consistent along ODE trajectories

The score function $s_\theta(x_t, t) \approx \nabla_{x_t}\log p_t(x_t)$ is the neural network that drives both the SDE and ODE. At optimum:

s_\theta^*(x_t, t) = \mathbb{E}\!\left[\frac{x_0 - \sqrt{\bar\alpha_t}x_t}{1-\bar\alpha_t}\,\Big|\,x_t\right] \cdot \frac{-1}{\sqrt{1-\bar\alpha_t}}

This is a conditional expectation - directly connecting score learning to the Doob martingale construction.

D.2 Online Learning and Martingale Regret Bounds

In online learning, at each round $t$ the learner predicts $\hat{y}_t$ , suffers loss $\ell_t(\hat{y}_t)$ , and observes true outcome $y_t$ . The regret against best fixed action $a^*$ is:

\text{Regret}_T = \sum_{t=1}^T \ell_t(\hat{y}_t) - \min_{a}\sum_{t=1}^T \ell_t(a)

Martingale structure: In Follow-the-Regularised-Leader (FTRL) algorithms, the loss differences $\ell_t(\hat{y}_t) - \mathbb{E}[\ell_t(\hat{y}_t)|\mathcal{F}_{t-1}]$ form a martingale difference sequence. Applying Azuma's inequality:

P\!\left(\text{Regret}_T \geq \mathbb{E}[\text{Regret}_T] + t\sqrt{T}\right) \leq e^{-2t^2}

This gives high-probability regret bounds - the stochastic process framework converts expected regret into individual-run guarantees.

Bandit algorithms (Thompson Sampling): In multi-armed bandits with $K$ arms, at each round the agent samples parameters from the posterior and plays the arm with highest reward. The number of plays of suboptimal arms is bounded using stopping time arguments on the likelihood ratio martingale.

D.3 RLHF and Reward Models as Stochastic Processes

RLHF (Reinforcement Learning from Human Feedback) trains a language model $\pi_\theta$ to maximise a learned reward model $r_\phi(x,y)$ subject to a KL penalty:

\max_\pi \mathbb{E}_{x\sim\mathcal{D}, y\sim\pi}[r_\phi(x,y)] - \beta\text{KL}(\pi \| \pi_{\text{ref}})

This is a stochastic control problem: the policy $\pi$ is a controller, the generation trajectory $(y_1,\ldots,y_T)$ is the controlled stochastic process, and the reward $r_\phi(x, y_{1:T})$ is a terminal cost.

The RLHF policy update (PPO) as a martingale: The policy gradient estimator in PPO is:

\hat{g} = \mathbb{E}_t\!\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)}A_t\right]

where $A_t$ is the advantage estimate. Under the true advantage function, $A_t = Q^\pi(s_t,a_t) - V^\pi(s_t)$ has $\mathbb{E}[A_t | s_t] = 0$ - it is a martingale difference. PPO clips the ratio to prevent large updates, maintaining stability of the martingale approximation.

D.4 Stochastic Processes in LLM Interpretability

Mechanistic interpretability (Elhage et al., Nanda et al.) studies circuits in transformer weights. The token sequence flowing through a transformer layers can be modelled as a stochastic process evolving through residual stream updates:

x_t^{(\ell+1)} = x_t^{(\ell)} + \text{Attn}^{(\ell)}(x_{1:t}^{(\ell)}) + \text{MLP}^{(\ell)}(x_t^{(\ell)})

Each residual update is a perturbation to the current state - the trajectory through layers is a discrete-time process indexed by layer depth $\ell$ , not token position $t$ . Understanding this process (what information is stored/retrieved at each layer) is the central question of interpretability.

Induction heads are attention heads that implement the operation "if A appeared before B at some position, predict B will appear after the current A." This is a pattern-matching stopping time: the head activates when it detects that the current context matches a past pattern. Formalising this as a stopping time over the layer-depth stochastic process is an active research direction.

Appendix E: The Functional CLT (Donsker's Theorem)

E.1 Weak Convergence on Function Spaces

Donsker's theorem states convergence in distribution on the function space $C([0,1])$ - the space of continuous functions equipped with the supremum norm $\|f\|_\infty = \sup_t |f(t)|$ .

Theorem (Donsker, 1951). Let $X_1, X_2, \ldots$ be iid with $\mathbb{E}[X_i]=0$ , $\sigma^2 = \mathbb{E}[X_i^2] < \infty$ . Define:

W_n(t) = \frac{1}{\sigma\sqrt{n}}S_{\lfloor nt \rfloor}, \quad t \in [0,1]

Then $W_n \Rightarrow B$ weakly in $(C([0,1]), \|\cdot\|_\infty)$ , where $B$ is standard Brownian motion.

Continuous mapping theorem: If $W_n \Rightarrow B$ and $g: C([0,1]) \to \mathbb{R}$ is continuous, then $g(W_n) \Rightarrow g(B)$ . This gives:

$\max_{0 \leq k \leq n} S_k / (\sigma\sqrt{n}) \Rightarrow \max_{0 \leq t \leq 1} B_t$ - reflection principle for random walks
$n^{-1}\sum_{k=1}^n S_k^2 / \sigma^2 \Rightarrow \int_0^1 B_t^2\,dt$ - law of Brownian local time

E.2 Applications to SGD Analysis

Donsker's theorem justifies the SDE approximation of SGD: as step size $\eta \to 0$ , the piecewise-constant SGD interpolation converges (in distribution, as a path) to the SDE $d\theta = -\nabla L\,dt + \sqrt{\eta\Sigma}\,dB$ . This means:

Convergence results for the SDE (e.g., mixing time to stationary distribution) translate to discrete-time SGD results
The stationary distribution of SGD approximates $\mathcal{N}(\theta^*, \frac{\eta}{2}H^{-1}\Sigma)$ near a minimum $\theta^*$
The escape rate from a local minimum scales as $\exp(-\Delta L / (\eta/B))$ - the Kramers escape rate formula

This is why practitioners observe that increasing batch size $B$ by factor $k$ requires proportionally increasing learning rate $\eta$ by $k$ to maintain the same noise temperature $\eta/B$ and thus the same optimisation dynamics.

Appendix F: Notation Summary

Symbol	Meaning
$(\Omega, \mathcal{F}, P)$	Probability space
$(\mathcal{F}_t)_{t \geq 0}$	Filtration (increasing family of $\sigma$ -algebras)
$\{X_t\}_{t \in T}$	Stochastic process indexed by $T$
$M_n$	Martingale
$\tau$	Stopping time
$\mathcal{F}_\tau$	$\sigma$ -algebra at stopping time $\tau$
$[X]_t$	Quadratic variation of process $X$ up to time $t$
$B_t$	Standard Brownian motion (Wiener process)
$N(t)$	Poisson process with rate $\lambda$
$k(s,t)$	Covariance kernel (for GPs)
$\mathcal{GP}(m, k)$	Gaussian process with mean $m$ and kernel $k$
$dB_t$	Wiener increment (stochastic differential)
$\int_0^T H_t\,dB_t$	Ito integral
$[B]_T = T$	Quadratic variation of BM
$\bar\alpha_t$	DDPM cumulative noise schedule
$s_\theta(x,t)$	Score function approximation
VP-SDE	Variance-Preserving SDE (DDPM continuum)
VE-SDE	Variance-Exploding SDE
OST	Optional Stopping Theorem (Doob)
UI	Uniformly Integrable
WSS	Wide-Sense Stationary
ACVF	Autocovariance function
PSD	Power Spectral Density
NTK	Neural Tangent Kernel
OU	Ornstein-Uhlenbeck process
GBM	Geometric Brownian Motion
SRW	Simple Random Walk

Appendix G: Worked Examples

G.1 Symmetric Random Walk: Hitting Probabilities

Problem. A particle starts at position $x = 3$ on the integer line. At each step it moves $+1$ or $-1$ with equal probability. Find: (a) The probability it hits 0 before hitting 10 (b) The expected number of steps to hit $\{0, 10\}$ (c) The probability it ever returns to 3 starting from 3

Solution.

(a) By the gambler's ruin result, $P(\text{hit 0 before 10}) = 1 - 3/10 = 7/10$ .

(b) $\mathbb{E}[\tau] = k(N-k) = 3 \times 7 = 21$ steps.

(c) For a symmetric walk starting at $x=3$ on $\mathbb{Z}$ , the probability of ever returning to 3 equals 1 (by Polya's theorem, $d=1$ recurrence). However, the expected return time is $\infty$ (the walk is null-recurrent).

Numerical verification:

import numpy as np
np.random.seed(42)
N_trials = 100_000

# (a) Gambler's ruin simulation
hits_zero = 0
for _ in range(N_trials):
    x = 3
    while 0 < x < 10:
        x += np.random.choice([-1, 1])
    hits_zero += (x == 0)
print(f"P(hit 0) \\approx {hits_zero/N_trials:.4f}, theory: 0.7000")  # -> 0.7000

# (b) Expected steps
total_steps = 0
for _ in range(N_trials):
    x, steps = 3, 0
    while 0 < x < 10:
        x += np.random.choice([-1, 1])
        steps += 1
    total_steps += steps
print(f"E[\\tau] \\approx {total_steps/N_trials:.2f}, theory: 21.00")  # -> 21.00

G.2 Poisson Process: Conditional Uniformity

Problem. Events arrive as a Poisson process with rate $\lambda$ . Given that exactly 5 events occurred in $[0,3]$ , what is the distribution of the first event time $T_1$ ?

Solution. By the conditional uniformity property, given $N(3) = 5$ , the 5 event times are the order statistics of 5 iid Uniform $[0,3]$ random variables. Therefore $T_1 | N(3)=5 \sim \text{Beta}(1,5)$ scaled to $[0,3]$ :

T_1 | N(3)=5 \sim \frac{3 \cdot \text{Beta}(1,5)}{1} = \frac{3}{6}\text{Uniform}[0,3] \text{ in distribution}

More precisely: $P(T_1 > t | N(3)=5) = (1-t/3)^5$ for $t \in [0,3]$ .

Verification: Simulate 5 Uniform $[0,3]$ random variables, take the minimum. Compare to conditional Poisson process simulation.

G.3 Brownian Motion: The Arc-Sine Distribution

Problem. Let $L_1 = \sup\{t \in [0,1]: B_t = 0\}$ be the last zero of Brownian motion before time 1. Find $P(L_1 \leq t)$ .

Solution. By Levy's arc-sine law:

P(L_1 \leq t) = \frac{2}{\pi}\arcsin(\sqrt{t})

This is the CDF of the arc-sine distribution on $[0,1]$ . The density $p(t) = 1/(\pi\sqrt{t(1-t)})$ concentrates near 0 and 1 - the last zero before time 1 is most likely to be near 0 or near 1, and least likely to be near $1/2$ . This violates the naive intuition that "the last zero should be somewhere in the middle."

Implication for AI: The arc-sine law suggests that in a long random walk (or SRW approximation of SGD noise), the trajectory spends most of its time on one side of zero - the noise process can have long "runs" of the same sign. This persistent structure in gradient noise is why momentum-based optimisers (Adam, SGD with momentum) perform better than pure SGD for certain problem structures.

G.4 OU Process: Analytical vs Numerical

Problem. Solve $dX = -2(X-1)\,dt + \sqrt{3}\,dB$ , $X_0 = 5$ .

Solution. This is OU with $\theta=2$ , $\mu=1$ , $\sigma=\sqrt{3}$ .

Solution: $X_t = 1 + (5-1)e^{-2t} + \sqrt{3}\int_0^t e^{-2(t-s)}\,dB_s$
Mean: $\mathbb{E}[X_t] = 1 + 4e^{-2t}$
Variance: $\text{Var}(X_t) = \frac{3}{4}(1 - e^{-4t})$
Stationary distribution: $X_\infty \sim \mathcal{N}(1, 3/4)$

The process decays exponentially from $X_0=5$ toward the mean $\mu=1$ with decay rate $\theta=2$ , while accumulating noise that saturates at $\sigma^2/(2\theta) = 3/4$ .

For diffusion models: This corresponds to DDPM with $\beta_t = 4\Delta t$ (time-discretised) and the signal-to-noise ratio at time $T$ approaching $\mathbb{E}[X_T] = 1 + 4e^{-2T}$ for $T \to \infty$ - which is the "pure noise" terminal condition when $X_T \approx \mathcal{N}(1, 3/4)$ .

G.5 Gaussian Process: Prior vs Posterior

Problem. Fit a GP to three observations $(x_1, y_1) = (0, 1)$ , $(x_2, y_2) = (1, -0.5)$ , $(x_3, y_3) = (2, 0.8)$ with RBF kernel $k(x,x') = e^{-(x-x')^2}$ and noise $\sigma_n^2 = 0.01$ .

Solution. Build the $3 \times 3$ kernel matrix $K$ :

K_{ij} = e^{-(x_i-x_j)^2} + 0.01\,\delta_{ij}

Posterior at test point $x_* = 1.5$ :

Compute $k_* = [k(x_*, x_1), k(x_*, x_2), k(x_*, x_3)]$
$\mu_* = k_*^\top(K)^{-1}\mathbf{y}$
$\sigma_*^2 = k(x_*,x_*) - k_*^\top K^{-1} k_*$

The posterior mean interpolates between the observations, and the posterior variance is small near observations (where the kernel provides strong correlation) and large away from them.

Appendix H: Self-Assessment Checklist

After completing this section, verify you can:

State the definition of a stochastic process and give 3 examples of each combination of (discrete/continuous) x (time/state space)
Define a filtration and explain what "adapted" means in ML context
Verify the martingale property for a new process by computing $\mathbb{E}[M_{n+1}|\mathcal{F}_n]$
State all three conditions of Doob's OST and identify which applies to a given problem
Apply OST to compute expected stopping times and boundary hitting probabilities
Construct the Doob martingale for a given function $f$ and derive McDiarmid from Azuma
State the three defining properties of a Poisson process and derive the superposition/thinning results
State Wiener's four axioms for Brownian motion and compute $P(B_t \geq a)$
Compute the quadratic variation $[B]_T = T$ and state Ito's lemma
Solve the OU SDE and state its stationary distribution
Connect the DDPM forward process to the VP-SDE and OU process
Define WSS and ergodicity, and state the Wiener-Khinchin theorem
Specify a GP by its mean function and kernel, and compute the posterior given observations
Explain why RoPE makes attention a stationary kernel in position
Model SGD as a near-martingale and state the diffusion SDE approximation

Appendix I: Connections to Other Areas

I.1 Information Theory (Section09)

The log-likelihood ratio process $\sum_k \log(p/q)(X_k)$ is a supermartingale under $Q$ and a submartingale under $P$
The sequential probability ratio test (SPRT) is an optimal stopping time for hypothesis testing, derived using the likelihood ratio martingale
Renyi entropy rates and the entropy rate $\bar{H} = \lim_n H(X_n|X_{n-1},\ldots,X_1)/1$ of a stationary process are central concepts at the intersection of stochastic processes and information theory

I.2 Functional Analysis (Section12)

The space $C([0,1])$ of continuous functions is a Banach space; weak convergence on this space is Donsker's theorem
Gaussian processes are infinite-dimensional Gaussian distributions on function spaces (Hilbert spaces)
Reproducing Kernel Hilbert Spaces (RKHS): For each valid kernel $k$ , there is a unique RKHS $\mathcal{H}_k$ with the reproducing property $f(x) = \langle f, k(\cdot,x)\rangle_{\mathcal{H}_k}$ . GP regression in $\mathcal{H}_k$ is equivalent to minimising a regularised empirical risk in the RKHS

I.3 Optimisation (Section08)

Stochastic approximation (Robbins-Monro, 1951): The original framework for SGD convergence uses martingale theory. Under step sizes $\sum_t \eta_t = \infty$ , $\sum_t \eta_t^2 < \infty$ and bounded variance, the iterates converge a.s.
Lyapunov methods: Proving convergence of SGD amounts to constructing a Lyapunov function $V(\theta)$ that is a (noisy) supermartingale, then applying martingale convergence
Langevin MCMC: The Langevin SDE $d\theta = -\nabla U(\theta)\,dt + \sqrt{2}\,dB_t$ is used for posterior sampling in Bayesian DL, connecting optimisation and stochastic process theory

I.4 Statistics (Section07-Statistics)

Time series analysis (ARIMA, state-space models) is the statistical estimation side of stochastic processes: given observed data $X_1,\ldots,X_n$ , estimate the model parameters of an assumed process
Spectral analysis uses the Wiener-Khinchin theorem to estimate the PSD from data via the periodogram
Hypothesis testing for stationarity (Augmented Dickey-Fuller test, KPSS test) checks whether an observed time series is consistent with a stationary process

Appendix J: Advanced Topics

J.1 Levy Processes

A Levy process generalises both Brownian motion (continuous paths) and Poisson process (jumps) to arbitrary combinations of drift, diffusion, and jumps.

Levy-Khinchin representation: Every Levy process $X_t$ has characteristic function:

\mathbb{E}[e^{i\theta X_t}] = \exp\!\left(t\left[i\mu\theta - \frac{\sigma^2\theta^2}{2} + \int_{\mathbb{R}\setminus\{0\}} (e^{i\theta x} - 1 - i\theta x\mathbf{1}_{|x|<1})\,\nu(dx)\right]\right)

where $\mu \in \mathbb{R}$ (drift), $\sigma^2 \geq 0$ (diffusion), and $\nu$ is the Levy measure (jump intensity).

For AI: Heavy-tailed gradient distributions in neural network training are better modelled by Levy processes (with large jumps) than by Brownian motion (Gaussian, no jumps). The Levy stable distribution family includes the Gaussian (index $\alpha=2$ ) and Cauchy ( $\alpha=1$ ) as special cases. Empirical studies of gradient statistics in transformers show heavy tails (Student- $t$ with low degrees of freedom), suggesting Levy process models may be more appropriate than the Gaussian SDE approximation.

J.2 Malliavin Calculus

Malliavin calculus is a variational calculus on Wiener space - it provides derivatives of random variables with respect to the underlying Brownian motion.

The Malliavin derivative $\mathcal{D}_s F$ measures the sensitivity of $F$ to the Brownian increment at time $s$ . It satisfies:

\mathcal{D}_s F = \lim_{\varepsilon\to 0}\frac{F(B + \varepsilon\mathbf{1}_{[s,\infty)}) - F(B)}{\varepsilon}

Clark-Ocone formula: For $F \in L^2$ with $\mathbb{E}[F]=0$ :

F = \int_0^T \mathbb{E}[\mathcal{D}_s F | \mathcal{F}_s]\,dB_s

This represents any square-integrable random variable as a stochastic integral - connecting Malliavin calculus to the martingale representation theorem.

For AI: Malliavin calculus provides pathwise gradients for diffusion models, enabling gradient computation through the stochastic sampling process. This is used in score-based generative model training with more stable gradient estimates than the standard score-matching loss.

J.3 Stochastic Control and HJB Equation

A stochastic control problem seeks to find a policy $u_t(\omega)$ (adapted to $\mathcal{F}_t$ ) minimising:

J(u) = \mathbb{E}\!\left[\int_0^T \mathcal{L}(X_t, u_t)\,dt + g(X_T)\right]

subject to $dX_t = b(X_t, u_t)\,dt + \sigma(X_t, u_t)\,dB_t$ .

The Hamilton-Jacobi-Bellman (HJB) equation gives the optimal value function $V(x,t) = \inf_u J(u; x,t)$ :

-\partial_t V = \min_u\!\left[\mathcal{L}(x,u) + b(x,u)^\top\nabla_x V + \frac{1}{2}\text{tr}(\sigma\sigma^\top \nabla^2_x V)\right]

For AI: RLHF can be formulated as a stochastic control problem where the LLM generates tokens sequentially (the SDE), the reward is the RLHF reward model (the cost), and the KL penalty is a regularisation term. The HJB equation characterises the optimal RLHF policy, and PPO approximately solves it using neural network approximations of $V$ .

Appendix K: Python Recipes for Stochastic Processes

K.1 Simulating Stochastic Processes

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# Simple Random Walk
def simulate_srw(n_steps, n_paths=5):
    steps = np.random.choice([-1, 1], size=(n_paths, n_steps))
    return np.hstack([np.zeros((n_paths,1)), steps.cumsum(axis=1)])

# Brownian Motion (via SRW limit)
def simulate_bm(T=1.0, n_steps=1000, n_paths=5):
    dt = T / n_steps
    increments = np.random.normal(0, np.sqrt(dt), (n_paths, n_steps))
    return np.hstack([np.zeros((n_paths,1)), increments.cumsum(axis=1)])

# Ornstein-Uhlenbeck (Euler-Maruyama)
def simulate_ou(theta=1.0, mu=0.0, sigma=1.0, T=10.0, n_steps=1000, n_paths=5):
    dt = T / n_steps
    X = np.zeros((n_paths, n_steps+1))
    X[:,0] = 2.0  # initial condition
    for t in range(n_steps):
        dB = np.random.normal(0, np.sqrt(dt), n_paths)
        X[:,t+1] = X[:,t] + theta*(mu - X[:,t])*dt + sigma*dB
    return X

# Poisson Process
def simulate_poisson(lam=1.0, T=10.0, n_paths=5):
    paths = []
    for _ in range(n_paths):
        t, events = 0, [0]
        while t < T:
            t += np.random.exponential(1/lam)
            events.append(min(t, T))
        paths.append(events)
    return paths

# Geometric Brownian Motion
def simulate_gbm(mu=0.1, sigma=0.3, S0=100.0, T=1.0, n_steps=252, n_paths=5):
    dt = T / n_steps
    log_S = np.zeros((n_paths, n_steps+1))
    log_S[:,0] = np.log(S0)
    for t in range(n_steps):
        log_S[:,t+1] = log_S[:,t] + (mu - sigma**2/2)*dt + sigma*np.random.normal(0, np.sqrt(dt), n_paths)
    return np.exp(log_S)

K.2 Gaussian Process Posterior

import numpy as np
from scipy.linalg import cho_factor, cho_solve

def rbf_kernel(X1, X2, length_scale=1.0, variance=1.0):
    """Squared-exponential (RBF) kernel."""
    dists = ((X1[:,None] - X2[None,:])**2).sum(-1)
    return variance * np.exp(-dists / (2 * length_scale**2))

def gp_posterior(X_train, y_train, X_test, kernel_fn, noise_var=1e-3):
    """Compute GP posterior mean and variance."""
    K_tt = kernel_fn(X_train, X_train) + noise_var * np.eye(len(X_train))
    K_ts = kernel_fn(X_train, X_test)
    K_ss = kernel_fn(X_test, X_test)

    # Solve K_tt @ alpha = y_train using Cholesky
    cho = cho_factor(K_tt)
    alpha = cho_solve(cho, y_train)
    v = cho_solve(cho, K_ts)

    mu_post = K_ts.T @ alpha
    var_post = np.diag(K_ss) - (K_ts * v).sum(0)
    return mu_post, np.sqrt(np.maximum(var_post, 0))

# Usage:
# X_train = np.array([0., 1., 2.])
# y_train = np.array([1., -0.5, 0.8])
# X_test = np.linspace(-0.5, 2.5, 100)
# mu, std = gp_posterior(X_train, y_train, X_test, rbf_kernel)

K.3 Martingale Verification

import numpy as np

def verify_martingale(process_fn, n_steps=100, n_paths=10000, tol=0.05):
    """
    Verify martingale property: E[M_{n+1} | M_n] \\approx M_n.
    process_fn: callable that generates (n_paths, n_steps+1) array.
    """
    paths = process_fn(n_steps, n_paths)
    # Check: mean at step n equals mean at step 0
    means = paths.mean(axis=0)
    assert np.allclose(means, means[0], atol=tol), f"Mean not constant: {means[:5]}"
    # Check: E[M_{n+1}^2] - E[M_n^2] = variance increment
    print(f"Mean at all steps: {means[0]:.4f} +/- {means.std():.4f} (should be ~0)")
    print(f"PASS: Martingale mean constant within tolerance {tol}")

def simulate_srw_paths(n_steps, n_paths):
    steps = np.random.choice([-1., 1.], size=(n_paths, n_steps))
    return np.hstack([np.zeros((n_paths,1)), steps.cumsum(axis=1)])

np.random.seed(42)
verify_martingale(simulate_srw_paths)

K.4 Diffusion Model Forward Process

import numpy as np

def ddpm_forward_process(x0, T=1000, beta_start=1e-4, beta_end=0.02):
    """
    DDPM forward process: q(x_t | x_0) = N(sqrt(alpha_bar_t)*x0, (1-alpha_bar_t)*I)
    Returns: {t: (x_t, alpha_bar_t)} for t = 0, 1, ..., T
    """
    betas = np.linspace(beta_start, beta_end, T)
    alphas = 1 - betas
    alpha_bars = np.cumprod(alphas)

    trajectory = {0: (x0, 1.0)}
    x0_arr = np.array(x0, dtype=float)

    for t in range(1, T+1):
        ab = alpha_bars[t-1]
        eps = np.random.randn(*x0_arr.shape)
        x_t = np.sqrt(ab) * x0_arr + np.sqrt(1 - ab) * eps
        trajectory[t] = (x_t, ab)
    return trajectory, alpha_bars

def snr(alpha_bar):
    """Signal-to-noise ratio: alpha_bar / (1 - alpha_bar)."""
    return alpha_bar / (1 - alpha_bar + 1e-9)

# Example:
# x0 = np.array([1.0, 0.0, -1.0])
# traj, alpha_bars = ddpm_forward_process(x0, T=1000)
# For t=500: SNR \\approx 1 (equal signal and noise)
# For t=1000: SNR \\approx 0 (pure noise)

K.5 DDPM Schedule Comparison

import numpy as np

def linear_schedule(T=1000, beta_start=1e-4, beta_end=0.02):
    return np.linspace(beta_start, beta_end, T)

def cosine_schedule(T=1000, s=0.008):
    t = np.arange(T+1) / T
    f = np.cos((t + s)/(1+s) * np.pi/2)**2
    alpha_bars = f / f[0]
    betas = 1 - alpha_bars[1:] / alpha_bars[:-1]
    return np.clip(betas, 0, 0.999)

T = 1000
betas_linear = linear_schedule(T)
betas_cosine = cosine_schedule(T)

ab_linear = np.cumprod(1 - betas_linear)
ab_cosine = np.cumprod(1 - betas_cosine)

print("Signal preserved at t=T/4, T/2, 3T/4, T:")
for name, ab in [("Linear", ab_linear), ("Cosine", ab_cosine)]:
    vals = [ab[T//4-1], ab[T//2-1], ab[3*T//4-1], ab[-1]]
    print(f"  {name}: {[f'{v:.4f}' for v in vals]}")
# Cosine schedule destroys signal more uniformly across timesteps

Appendix L: Theory Problems (Advanced)

L.1 Optional Stopping and Wald's Identity

Problem. Let $X_1, X_2, \ldots$ be iid with $\mathbb{E}[X_i] = \mu$ and $\mathbb{E}[X_i^2] = \sigma^2 + \mu^2$ . Let $\tau$ be a stopping time with $\mathbb{E}[\tau] < \infty$ . Prove Wald's Identity:

\mathbb{E}[S_\tau] = \mu \cdot \mathbb{E}[\tau], \quad \mathbb{E}[S_\tau^2] = \sigma^2\mathbb{E}[\tau] + \mu^2\mathbb{E}[\tau]^2

Hint: $S_n - n\mu$ is a martingale. Apply OST. For the second moment, use $(S_n - n\mu)^2 - n\sigma^2$ .

L.2 The Optional Stopping Theorem: Necessary Conditions

Problem. Show that the condition $\mathbb{E}[\tau] < \infty$ alone does NOT suffice for OST. Construct a martingale $\{M_n\}$ and stopping time $\tau$ with $\mathbb{E}[\tau] < \infty$ but $\mathbb{E}[M_\tau] \neq \mathbb{E}[M_0]$ .

Hint: Let $M_n =$ SRW and $\tau = \min(T, \text{hitting time of level } n)$ for appropriate $T$ depending on $n$ ...

L.3 Levy Characterisation of BM

Problem. Prove that if $\{M_t\}$ is a continuous local martingale starting at 0 with quadratic variation $[M]_t = t$ , then $M_t$ is a standard Brownian motion.

Hint: Use the exponential martingale $e^{i\theta M_t + \theta^2 t/2}$ and show it is a martingale. Then compute the characteristic function of $(M_{t_1},\ldots,M_{t_n})$ .

L.4 GP Posterior as Hilbert Space Projection

Problem. Show that the GP posterior mean $\mu_*(x) = k_*^\top K^{-1}\mathbf{y}$ is the orthogonal projection of $f$ onto the span of $\{k(\cdot, x_1),\ldots,k(\cdot, x_n)\}$ in the RKHS $\mathcal{H}_k$ .

Hint: Use the reproducing property $f(x_i) = \langle f, k(\cdot,x_i)\rangle_{\mathcal{H}_k}$ to express the GP as a Hilbert space problem.

Appendix M: References and Further Reading

Textbooks

Karatzas & Shreve, Brownian Motion and Stochastic Calculus (1991) - The standard reference for rigorous continuous-time stochastic processes, Ito calculus, and SDEs. Graduate-level.
Durrett, Probability: Theory and Examples (5th ed., 2019) - Comprehensive coverage including martingales, stopping times, ergodic theory, Brownian motion. Excellent exercises.
Williams, Probability with Martingales (1991) - Elegant, concise introduction to martingale theory. The clearest proof of martingale convergence. Highly recommended.
Oksendal, Stochastic Differential Equations (7th ed., 2014) - Standard SDEs reference. Accessible treatment of Ito calculus and applications.
Rasmussen & Williams, Gaussian Processes for Machine Learning (2006) - The canonical GP reference. Available free online.
Doob, Stochastic Processes (1953) - The original monograph. Still valuable for historical context.

Papers for AI Connections

Song et al. (2021), Score-Based Generative Modeling through Stochastic Differential Equations - Score SDE framework unifying DDPM, SMLD.
Mandt, Hoffman & Blei (2017), Stochastic Gradient Descent as Approximate Bayesian Inference - SDE approximation of SGD.
Su et al. (2022), RoFormer: Enhanced Transformer with Rotary Position Embedding - RoPE as stationary kernel.
Ho, Jain & Abbeel (2020), Denoising Diffusion Probabilistic Models - DDPM.
Li et al. (2017), Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms - SDE theory for SGD.
Song et al. (2023), Consistency Models - Distillation of diffusion models via probability flow ODE.
Tropp (2015), An Introduction to Matrix Concentration Inequalities - Matrix Bernstein, extends scalar martingale results.
Williams (1997), Reinforcement Learning: An Introduction (Sutton & Barto) - RL value functions as martingales; Chapter 6 on TD-learning.

Appendix N: Extended Proofs

N.1 Proof of Doob's Maximal Inequality

Theorem. For a non-negative submartingale $\{M_k\}_{k=0}^n$ and $\lambda > 0$ :

\lambda \cdot P\!\left(\max_{0 \leq k \leq n} M_k \geq \lambda\right) \leq \mathbb{E}[M_n]

Proof. Let $\tau = \min\{k : M_k \geq \lambda\}$ (first time the process crosses $\lambda$ ), with $\tau = n$ if no crossing occurs. Then $\tau$ is a stopping time (the crossing event is $\mathcal{F}_k$ -measurable).

Decompose the expectation $\mathbb{E}[M_n]$ :

\mathbb{E}[M_n] = \mathbb{E}[M_n \mathbf{1}_{\tau \leq n}] + \mathbb{E}[M_n \mathbf{1}_{\tau > n}]

For the first term: since $\{M_k\}$ is a submartingale, $\mathbb{E}[M_n | \mathcal{F}_\tau] \geq M_\tau \geq \lambda$ on $\{\tau \leq n\}$ :

\mathbb{E}[M_n \mathbf{1}_{\tau \leq n}] = \mathbb{E}[\mathbb{E}[M_n|\mathcal{F}_\tau]\mathbf{1}_{\tau \leq n}] \geq \lambda \cdot P(\tau \leq n) = \lambda \cdot P\!\left(\max_k M_k \geq \lambda\right)

Since $\mathbb{E}[M_n \mathbf{1}_{\tau > n}] \geq 0$ :

\mathbb{E}[M_n] \geq \lambda P\!\left(\max_k M_k \geq \lambda\right)

$\square$

Corollary. Applying to $\{|M_k|\}$ (non-negative submartingale since $f(x)=|x|$ is convex):

P\!\left(\max_{0 \leq k \leq n}|M_k| \geq \lambda\right) \leq \frac{\mathbb{E}[|M_n|]}{\lambda}

N.2 Proof of Azuma's Inequality

Theorem (Azuma-Hoeffding). Let $\{M_k\}_{k=0}^n$ be a martingale with $|M_k - M_{k-1}| \leq c_k$ a.s. Then:

P(M_n - M_0 \geq t) \leq \exp\!\left(\frac{-t^2}{2\sum_k c_k^2}\right)

Proof. Let $D_k = M_k - M_{k-1}$ be the martingale differences. Since $\mathbb{E}[D_k|\mathcal{F}_{k-1}] = 0$ and $|D_k| \leq c_k$ , by Hoeffding's lemma:

\mathbb{E}[e^{sD_k}|\mathcal{F}_{k-1}] \leq e^{s^2c_k^2/2}

By the Markov inequality for $e^{s(M_n - M_0)}$ (for any $s > 0$ ):

P(M_n - M_0 \geq t) \leq e^{-st}\mathbb{E}[e^{s(M_n-M_0)}]

Factor the MGF using conditional independence:

\mathbb{E}[e^{s\sum_k D_k}] = \mathbb{E}\!\left[\prod_k e^{sD_k}\right] = \mathbb{E}\!\left[\prod_{k=1}^{n-1}e^{sD_k}\mathbb{E}[e^{sD_n}|\mathcal{F}_{n-1}]\right] \leq \mathbb{E}\!\left[\prod_{k=1}^{n-1}e^{sD_k}\right] \cdot e^{s^2c_n^2/2}

Iterating: $\mathbb{E}[e^{s(M_n-M_0)}] \leq e^{s^2\sum_k c_k^2/2}$ .

Combining:

P(M_n - M_0 \geq t) \leq e^{-st + s^2\sum c_k^2/2}

Optimise over $s > 0$ : set $s = t/\sum c_k^2$ to get:

P(M_n - M_0 \geq t) \leq \exp\!\left(\frac{-t^2}{2\sum c_k^2}\right)

$\square$

N.3 Proof of Polya's Theorem ( $d=1$ : Recurrence)

Theorem. The SRW on $\mathbb{Z}$ is recurrent.

Proof. It suffices to show the expected number of returns to 0 is infinite: $\sum_{n=1}^\infty P(S_{2n}=0) = \infty$ .

By Stirling's approximation $\binom{2n}{n} \sim 4^n/\sqrt{\pi n}$ :

P(S_{2n}=0) = \binom{2n}{n}4^{-n} \sim \frac{1}{\sqrt{\pi n}}

Therefore $\sum_{n=1}^\infty P(S_{2n}=0) \sim \sum_{n=1}^\infty \frac{1}{\sqrt{\pi n}} = \infty$ .

By the second Borel-Cantelli lemma (the events $\{S_{2n}=0\}$ are independent since each depends on distinct steps): $P(\text{return to 0 infinitely often}) = 1$ .

Note: In $d \geq 3$ , $P(S_{2n}=0) \sim C_d/n^{d/2}$ by CLT in $\mathbb{R}^d$ , and $\sum C_d/n^{d/2} < \infty$ for $d \geq 3$ (since $d/2 > 1$ ). By Borel-Cantelli (first lemma): returns occur finitely often, so the walk is transient. $\square$

N.4 Proof that the OU Process Has the Correct Stationary Distribution

Theorem. The stationary distribution of $dX = -\theta X\,dt + \sigma\,dB$ ( $\mu=0$ ) is $\mathcal{N}(0, \sigma^2/(2\theta))$ .

Proof via Fokker-Planck equation. The probability density $p(x,t)$ satisfies:

\partial_t p = -\partial_x((-\theta x)p) + \frac{\sigma^2}{2}\partial_x^2 p = \theta\partial_x(xp) + \frac{\sigma^2}{2}\partial_x^2 p

At stationarity $\partial_t p = 0$ :

0 = \theta\partial_x(xp) + \frac{\sigma^2}{2}\partial_x^2 p

0 = \theta(p + x\partial_x p) + \frac{\sigma^2}{2}\partial_x^2 p

Try $p(x) = Z^{-1}\exp(-\theta x^2/\sigma^2)$ :

\partial_x p = -\frac{2\theta x}{\sigma^2}p, \quad \partial_x^2 p = \left(-\frac{2\theta}{\sigma^2} + \frac{4\theta^2 x^2}{\sigma^4}\right)p

Substituting: $\theta(1 - 2\theta x^2/\sigma^2)p + \frac{\sigma^2}{2}(-2\theta/\sigma^2 + 4\theta^2x^2/\sigma^4)p = [\theta - 2\theta^2x^2/\sigma^2 - \theta + 2\theta^2x^2/\sigma^2]p = 0$ [ok]

The normalisation gives $p_\infty(x) = \mathcal{N}(0, \sigma^2/(2\theta))$ . $\square$

Appendix O: Spectral Theory of Stochastic Processes

O.1 Spectral Representation

Every WSS process $\{X_t\}$ with absolutely integrable ACVF admits the spectral representation:

X_t = \int_{-\infty}^\infty e^{i\omega t}\,dZ(\omega)

where $Z(\omega)$ is an orthogonal increment process (the spectral process) satisfying $\mathbb{E}[dZ(\omega)\overline{dZ(\omega')}] = S(\omega)\delta(\omega-\omega')\,d\omega$ .

This is the stochastic analogue of the Fourier transform: a stationary process decomposes into uncorrelated frequency components, each contributing variance $S(\omega)\,d\omega$ in the frequency band $[\omega, \omega+d\omega]$ .

For AI: The power spectral density of a token sequence determines how information is distributed across "frequencies" (temporal scales). A language model with strong long-range dependencies (like GPT-4 modelling coherent long documents) will have a PSD with high power at low frequencies. Attention head specialisation can be viewed as learning frequency-selective filters - heads that attend to nearby tokens (high frequency) vs. distant tokens (low frequency).

O.2 Spectral Gap and Mixing

For a stationary Markov chain with transition operator $P$ , the spectral gap $\gamma = 1 - |\lambda_2|$ (where $\lambda_2$ is the second-largest eigenvalue of $P$ ) determines the mixing time $t_{\text{mix}} \sim 1/\gamma$ .

Large spectral gap -> fast mixing -> samples become independent quickly
Small spectral gap -> slow mixing -> long-range correlations persist

For transformers: The attention pattern at each layer can be viewed as a stochastic matrix $A \in \mathbb{R}^{T\times T}$ (for sequence length $T$ ). The spectral gap of $A$ determines how quickly information "mixes" across the sequence. Heads with nearly uniform attention (large spectral gap) aggregate globally; heads with peaked attention (small spectral gap) attend locally.

Appendix P: Decision Framework for Choosing Process Models

CHOOSE THE RIGHT STOCHASTIC PROCESS MODEL
========================================================================

  Is the system evolving over time?
  +-- No  -> Use static probability (Section01-Section05)
  +-- Yes ->
        Does the state space have a natural structure?
        +-- Countable (e.g., integer counts, categories)
        |   +-- Discrete time -> Markov chain (Section07)
        |   +-- Continuous time -> Poisson or birth-death process
        +-- Continuous (e.g., parameter vector, embedding)
            +-- Does it have independent increments?
            |   +-- Yes, with Gaussian increments -> Brownian motion
            |   +-- Yes, with jump structure -> Poisson / Levy
            |   +-- No, but stationary -> ARMA or OU process
            +-- Does it decrease toward an attractor?
            |   +-- Yes -> OU process or SDE with drift
            +-- Does it only depend on the current state?
                +-- Yes -> Markov process (continuous time)
                +-- No, depends on full history -> GP or ARIMA

========================================================================

  AI Application -> Recommended Model
  ---------------------------------------------------------------------
  SGD parameter trajectory     -> OU SDE (diffusion approx)
  Mini-batch gradient noise    -> Martingale difference sequence
  Diffusion model forward      -> VP-SDE (discrete OU)
  LLM token generation         -> Markov chain (Section07)
  Request arrivals             -> Poisson process
  Hyperparameter optimisation  -> Gaussian process (Bayesian opt)
  RL value function            -> Martingale transform
  Training loss trajectory     -> Supermartingale
  NTK / infinite-width NN      -> Gaussian process
  Residual stream updates      -> Additive noise SDE
  ---------------------------------------------------------------------

Appendix Q: Connections to Measure Theory

Q.1 Stochastic Processes as Measurable Functions

A stochastic process $\{X_t\}_{t \in T}$ on $(\Omega, \mathcal{F}, P)$ can be viewed as a measurable function from $\Omega$ to the path space $\mathbb{R}^T$ . The probabilistic properties of the process are encoded in the pushforward measure $P \circ X^{-1}$ on $\mathbb{R}^T$ .

For Brownian motion: The path space is $C([0,\infty))$ (continuous functions). The Wiener measure $W$ is the unique probability measure on $C([0,\infty))$ such that $B_t(\omega) = \omega(t)$ (coordinate process) satisfies the four BM axioms. Constructing $W$ rigorously is Wiener's 1923 achievement.

Q.2 The Radon-Nikodym Theorem and Change of Measure

If $P$ and $Q$ are probability measures on $(\Omega, \mathcal{F})$ with $Q \ll P$ (Q is absolutely continuous wrt P), then there exists the Radon-Nikodym derivative $dQ/dP = Z$ such that $Q(A) = \mathbb{E}^P[Z\mathbf{1}_A]$ for all $A$ .

Girsanov's theorem: Under mild conditions, if $B_t$ is a BM under $P$ and $\theta_t$ is adapted, then:

\tilde{B}_t = B_t + \int_0^t \theta_s\,ds

is a BM under $Q$ where $dQ/dP = \exp(-\int_0^T \theta_t\,dB_t - \frac{1}{2}\int_0^T\theta_t^2\,dt)$ (the Girsanov kernel).

For diffusion models: Girsanov's theorem is the mathematical basis for the reverse diffusion process. The reverse SDE under the data distribution $P$ corresponds to a change of measure from the noise distribution $Q$ (standard BM), with the Radon-Nikodym derivative involving the score function $\nabla\log p_t$ . This makes the score-matching objective the likelihood maximisation objective under the reverse measure.

Appendix R: Supplementary: Concentration and Martingales Revisited

R.1 The Azuma-McDiarmid Connection - Full Detail

In Section3.5 we introduced the Doob martingale construction as the proof of McDiarmid's inequality. Here we make the connection fully explicit.

Setup. Let $X_1, \ldots, X_n$ be independent random variables and $f: \mathcal{X}^n \to \mathbb{R}$ satisfy the bounded differences condition: for each $k$ ,

\sup_{x_1,\ldots,x_n, x_k'} |f(x_1,\ldots,x_k,\ldots,x_n) - f(x_1,\ldots,x_k',\ldots,x_n)| \leq c_k

Step 1: Doob martingale. Define $M_k = \mathbb{E}[f | X_1,\ldots,X_k]$ for $k=0,\ldots,n$ . Then $M_0 = \mathbb{E}[f]$ , $M_n = f$ , and $\{M_k\}$ is a martingale.

Step 2: Bounded differences. The martingale difference $D_k = M_k - M_{k-1}$ satisfies $|D_k| \leq c_k$ a.s. because:

|M_k - M_{k-1}| = |\mathbb{E}[f|X_1,\ldots,X_k] - \mathbb{E}[f|X_1,\ldots,X_{k-1}]| \leq \sup_{x_k, x_k'} |\mathbb{E}[f(\ldots,x_k,\ldots) - f(\ldots,x_k',\ldots)|X_{1:k-1}]| \leq c_k

Step 3: Azuma. Apply Azuma's inequality (proved in SectionN.2) to the martingale $\{M_k\}$ :

P(f - \mathbb{E}[f] \geq t) = P(M_n - M_0 \geq t) \leq \exp\!\left(\frac{-t^2}{2\sum c_k^2}\right)

This IS McDiarmid's inequality. The concentration inequality Section05 is a consequence of martingale theory.

R.2 The Freedman Inequality

Azuma's inequality uses only the almost-sure bound $|D_k| \leq c_k$ . Freedman's inequality (1975) also uses conditional variances, giving a Bernstein-type improvement:

Theorem (Freedman, 1975). Let $\{M_n\}$ be a martingale with $|D_k| \leq M$ a.s. Define $W_n = \sum_{k=1}^n \mathbb{E}[D_k^2|\mathcal{F}_{k-1}]$ (total conditional variance). Then for $t, v > 0$ :

P(M_n - M_0 \geq t, W_n \leq v) \leq \exp\!\left(\frac{-t^2/2}{v + Mt/3}\right)

This is a martingale version of Bernstein's inequality. The event $W_n \leq v$ restricts to "low variance" trajectories. Unconditionally:

P(M_n - M_0 \geq t) \leq \exp\!\left(\frac{-t^2/2}{V + Mt/3}\right) \text{ where } V = \sup_\omega W_n(\omega)

For AI: Freedman's inequality gives tighter bounds for SGD convergence when the gradient variance is small. In the low-variance early-training phase (when the model is near initialisation and gradients are large but consistent), Freedman gives $e^{-t^2/(2v)}$ bounds tighter than Azuma's $e^{-t^2/(2nM^2)}$ .

Appendix S: Practice Problems (All Levels)

Level * (Computation)

S.1. Let $M_n = 2^{S_n}$ where $S_n$ is the SRW. For what value of $p$ in the biased walk ( $P(+1)=p$ ) is $M_n = 2^{S_n}$ a martingale?

Answer: $\mathbb{E}[2^{S_{n+1}}|S_n] = 2^{S_n}(2p + (1-p)/2)$ . Set equal to $2^{S_n}$ : $2p + (1-p)/2 = 1$ , so $p = 1/3$ .

S.2. A Poisson process has rate $\lambda = 3$ (events per hour). Find $P(N(2) \geq 5)$ and the expected time to the 4th event.

Answer: $N(2) \sim \text{Poisson}(6)$ , so $P(N(2)\geq 5) = 1 - \sum_{k=0}^4 e^{-6}6^k/k! \approx 0.715$ . Time to 4th event: $T_4 \sim \text{Gamma}(4, 1/3)$ , $\mathbb{E}[T_4] = 4/3$ hours.

S.3. For the OU process with $\theta=1$ , $\mu=2$ , $\sigma=\sqrt{2}$ , $X_0=0$ : compute $\mathbb{E}[X_3]$ and $\text{Var}(X_3)$ .

Answer: $\mathbb{E}[X_3] = 2 + (0-2)e^{-3} = 2 - 2e^{-3} \approx 1.9$ . $\text{Var}(X_3) = \frac{2}{2}(1-e^{-6}) = 1-e^{-6} \approx 0.9975$ .

Level ** (Theory)

S.4. Show that for a martingale $\{M_n\}$ , the sequence $M_n^2 - [M]_n$ is also a martingale, where $[M]_n = \sum_{k=1}^n \mathbb{E}[D_k^2|\mathcal{F}_{k-1}]$ is the predictable quadratic variation.

S.5. Prove that if $f$ is convex and $\{M_n\}$ is a martingale, then $\{f(M_n)\}$ is a submartingale (using Jensen's inequality).

S.6. For the compound Poisson process $X(t) = \sum_{k=1}^{N(t)} Y_k$ with $\lambda = 2$ , $Y_k \sim \text{Exp}(3)$ : compute the MGF $M_{X(t)}(s)$ and use it to find $\mathbb{E}[X(t)]$ and $\text{Var}(X(t))$ .

Level *** (AI Applications)

S.7. (Diffusion model noise schedule design) You want a noise schedule $\beta_t$ for the DDPM forward process such that the SNR $\alpha_t/(1-\alpha_t)$ decreases linearly from $\text{SNR}_0 = 100$ to $\text{SNR}_T = 0.01$ over $T=1000$ steps. Derive the required $\beta_t$ and compare to the linear schedule.

S.8. (SGD stationary distribution) In a 2D quadratic loss $L(\theta) = \frac{1}{2}\theta^\top H\theta$ with $H = \text{diag}(1, 10)$ , SGD with step $\eta$ and gradient noise $\Sigma = I$ has stationary distribution $\mathcal{N}(0, \frac{\eta}{2}H^{-1})$ . Show that the ratio of the standard deviations in the two directions is $\sqrt{10}$ , and interpret this for optimisation.

S.9. (Martingale in RLHF) In PPO with KL-regularised reward $r_\theta(s,a) = r_\phi(s,a) - \beta\log(\pi_\theta(a|s)/\pi_{\text{ref}}(a|s))$ , show that the KL penalty term creates a non-martingale perturbation to the policy gradient estimator. Under what conditions on $\beta$ does the perturbation become negligible?

Appendix T: Theorem Index

Theorem	Statement	Section
Doob's OST	$\mathbb{E}[M_\tau]=\mathbb{E}[M_0]$ under UI or bounded conditions	Section3.3
Doob's Maximal Inequality	$\lambda P(\max M_k \geq \lambda) \leq \mathbb{E}[M_n]$	Section3.4, App. N
Martingale Convergence	$L^1$ -bounded martingale converges a.s.	Section3.4
Azuma's Inequality	Bounded differences -> $e^{-t^2/2\sum c_k^2}$	App. N
McDiarmid's Inequality	Bounded differences function -> Azuma on Doob martingale	Section3.5
Freedman's Inequality	Martingale Bernstein with conditional variance	App. R
Polya's Theorem	SRW on $\mathbb{Z}^d$ : recurrent iff $d \leq 2$	Section4.2, App. N
Donsker's Theorem	SRW $\Rightarrow$ BM in $C([0,1])$	Section6.3, App. E
Levy Characterisation	Continuous martingale + QV $= t$ $\Rightarrow$ BM	Section6.2
Ito's Lemma	$df(X_t) = f'dX + \frac{1}{2}f''(dX)^2$	App. A
Wiener-Khinchin	ACVF $\leftrightarrow$ PSD via Fourier	Section7.3
Bochner's Theorem	PSD is valid iff non-negative definite	Section7.3
Birkhoff Ergodic	Time average = ensemble average a.s.	Section7.2
Girsanov's Theorem	Change of drift = change of measure	App. Q
Fokker-Planck	PDE for density of SDE	App. N
Anderson's Reverse SDE	Reverse SDE = forward SDE + score	Section6.6
Clark-Ocone Formula	Malliavin representation on Wiener space	App. J
Wald's Identity	$\mathbb{E}[S_\tau] = \mu\mathbb{E}[\tau]$	App. L

Appendix U: Continuous-Time Martingales

U.1 Local Martingales

In continuous time, the natural generalisation of a martingale is a local martingale: a process $\{M_t\}$ such that there exist stopping times $\tau_n \uparrow \infty$ and each stopped process $M_{t \wedge \tau_n}$ is a martingale.

Every continuous local martingale with $M_0 = 0$ can be written as a time-changed Brownian motion (by the Dambis-Dubins-Schwarz theorem):

M_t = B_{[M]_t}

where $B$ is a standard BM and $[M]_t$ is the quadratic variation of $M$ .

The Ito integral $\int_0^t H_s\,dB_s$ is a local martingale (and a true martingale when $\mathbb{E}[\int_0^T H_s^2\,ds] < \infty$ ).

U.2 The Martingale Representation Theorem

Theorem (Martingale Representation). Every square-integrable martingale $\{M_t\}$ adapted to the Brownian filtration can be written as:

M_t = M_0 + \int_0^t H_s\,dB_s

for some adapted process $H_s$ with $\mathbb{E}[\int_0^T H_s^2\,ds] < \infty$ .

For AI: In score-based diffusion model training, the training loss can be expressed as an Ito integral against the Wiener process of the forward noise. The neural network effectively learns the integrand $H_s(x_s)$ that represents the reverse process. The martingale representation theorem guarantees that such a representation exists - it is the theoretical justification for why neural networks can learn to "undo" Brownian noise.

U.3 Feynman-Kac Formula

For the SDE $dX_t = b(X_t)\,dt + \sigma(X_t)\,dB_t$ and function $f$ , the Feynman-Kac formula relates the SDE to a PDE:

u(x,t) = \mathbb{E}\!\left[f(X_T)\exp\!\left(-\int_t^T r(X_s)\,ds\right)\,\Big|\, X_t = x\right]

satisfies the PDE $-\partial_t u = b\partial_x u + \frac{\sigma^2}{2}\partial_{xx}u - ru$ with terminal condition $u(x,T) = f(x)$ .

For AI: The score function in diffusion models satisfies a backward Kolmogorov equation (the PDE form of the reverse SDE). Feynman-Kac connects this PDE to the expectation form $\nabla\log p_t(x_t) = \mathbb{E}[\frac{x_0 - \sqrt{\bar\alpha_t}x_t}{1-\bar\alpha_t}|x_t]$ , which is directly implemented by the neural denoiser.

Stochastic Processes: Part 2 - Exercises To Appendix U Continuous Time Martingales