Part 2Math for LLMs

Concentration Inequalities: Part 2 - Ml Applications To Appendix N Extended Worked Problems

Probability Theory / Concentration Inequalities

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 2
30 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Concentration Inequalities: Part 11: ML Applications to Appendix N: Extended Worked Problems

11. ML Applications

11.1 SGD and Gradient Concentration

In stochastic gradient descent, at each step tt we compute a mini-batch gradient:

g^t=1miBtθ(θt;x(i))\hat{g}_t = \frac{1}{m}\sum_{i \in B_t} \nabla_\theta \ell(\theta_t; x^{(i)})

where BtB_t is a random mini-batch of size mm. The true gradient is gt=θL(θt)=E[θ]g_t = \nabla_\theta \mathcal{L}(\theta_t) = \mathbb{E}[\nabla_\theta \ell].

Concentration of gradient norm. Assuming (θ;x)2G\|\nabla\ell(\theta; x)\|_2 \leq G almost surely, by Hoeffding applied coordinate-wise and union bound:

P(g^tgt2ε)2dexp ⁣(mε22G2)P(\|\hat{g}_t - g_t\|_2 \geq \varepsilon) \leq 2d \cdot \exp\!\left(-\frac{m\varepsilon^2}{2G^2}\right)

Implication for learning rate. For SGD to make progress, we need the gradient noise g^tgt2\|\hat{g}_t - g_t\|_2 to be smaller than the signal. Concentration tells us: with probability 1δ1-\delta, the noise is bounded by G2log(2d/δ)/mG\sqrt{2\log(2d/\delta)/m}. This justifies the learning rate scaling η1/t\eta \propto 1/\sqrt{t} in theory.

Variance reduction. Methods like SVRG (Stochastic Variance-Reduced Gradient) periodically compute the full gradient, reducing GG to G/nG/\sqrt{n}. Bernstein's inequality quantifies the benefit: convergence can be linear (not just 1/t1/\sqrt{t}) when variance is controlled.

11.2 Confidence Intervals via Hoeffding

Suppose we evaluate an LLM on a benchmark: out of nn binary questions (correct/incorrect), we observe accuracy p^=k/n\hat{p} = k/n.

By Hoeffding's inequality (with Xi{0,1}X_i \in \{0,1\}, ba=1b-a = 1), with probability at least 1δ1-\delta:

p^plog(2/δ)2n|\hat{p} - p| \leq \sqrt{\frac{\log(2/\delta)}{2n}}

This gives the Hoeffding confidence interval: p[p^log(2/δ)2n,p^+log(2/δ)2n]p \in \left[\hat{p} - \sqrt{\frac{\log(2/\delta)}{2n}}, \hat{p} + \sqrt{\frac{\log(2/\delta)}{2n}}\right].

Numerical example. For n=1000n = 1000 questions and δ=0.05\delta = 0.05:

CI half-width=log(40)20003.6920000.043\text{CI half-width} = \sqrt{\frac{\log(40)}{2000}} \approx \sqrt{\frac{3.69}{2000}} \approx 0.043

So a measured accuracy of 73.2% on 1000 questions gives 95% CI [69.0%,77.4%][69.0\%, 77.4\%] - a ±4.2%\pm 4.2\% interval. This explains why claiming "Model A (73.2%) outperforms Model B (72.8%)" on 1000 questions has no statistical support.

11.3 Random Features and Kernel Approximation

The Rahimi-Recht random features method approximates the RBF kernel k(x,x)=exx2/(2σ2)k(x, x') = e^{-\|x-x'\|^2/(2\sigma^2)} as:

k(x,x)1Dj=1Dϕj(x)ϕj(x),ϕj(x)=2cos(ωjx+bj)k(x, x') \approx \frac{1}{D}\sum_{j=1}^D \phi_j(x)\phi_j(x'), \quad \phi_j(x) = \sqrt{2}\cos(\omega_j^\top x + b_j)

where ωjN(0,I/σ2)\omega_j \sim \mathcal{N}(0, I/\sigma^2) and bjU[0,2π]b_j \sim \mathcal{U}[0, 2\pi].

How many features? The approximation error is bounded by Hoeffding:

P ⁣(1Dj=1Dzjk(x,x)ε)2eDε2/2P\!\left(\left|\frac{1}{D}\sum_{j=1}^D z_j - k(x,x')\right| \geq \varepsilon\right) \leq 2e^{-D\varepsilon^2/2}

where zj=ϕj(x)ϕj(x)[1,1]z_j = \phi_j(x)\phi_j(x') \in [-1, 1]. For ε=0.01\varepsilon = 0.01 and δ=0.01\delta = 0.01: D2log(2/0.01)/0.01292,000D \geq 2\log(2/0.01)/0.01^2 \approx 92{,}000.

In practice, D=1000D = 1000--1000010000 suffices with ε0.1\varepsilon \approx 0.1, suggesting the bound is conservative. This is common: worst-case bounds are loose, actual concentrations are tighter.

11.4 Concentration in High Dimensions

High-dimensional geometry exhibits surprising concentration phenomena. For xN(0,Id)\mathbf{x} \sim \mathcal{N}(0, I_d):

Chi-squared concentration. x22χd2\|\mathbf{x}\|_2^2 \sim \chi^2_d with mean dd and variance 2d2d. Bernstein gives:

P ⁣(x22d1t)2exp ⁣(dmin ⁣(t24,t4))P\!\left(\left|\frac{\|\mathbf{x}\|_2^2}{d} - 1\right| \geq t\right) \leq 2\exp\!\left(-d \cdot \min\!\left(\frac{t^2}{4}, \frac{t}{4}\right)\right)

So x2/d1\|\mathbf{x}\|_2 / \sqrt{d} \to 1 in probability - all Gaussian vectors in high dimensions have nearly the same norm d\sqrt{d}.

Concentration of inner products. For independent x,yN(0,Id/d)\mathbf{x}, \mathbf{y} \sim \mathcal{N}(0, I_d/d) (unit-variance):

P(x,yt)2edt2/2P(|\langle \mathbf{x}, \mathbf{y} \rangle| \geq t) \leq 2e^{-dt^2/2}

Nearly-orthogonal random vectors: with d=512d = 512 (typical embedding dimension), random vectors have x,y<0.1|\langle \mathbf{x}, \mathbf{y}\rangle| < 0.1 with very high probability.

For AI (attention). In transformers with dk=64d_k = 64 (key dimension), random query-key inner products concentrate around 0. This justifies the 1/dk1/\sqrt{d_k} scaling in attention: without it, softmax inputs would have variance dkd_k, causing near-zero gradients through the softmax saturation.

Preview: Law of Large Numbers and CLT

The concentration results above provide finite-sample bounds. The LLN says Xˉnμ\bar{X}_n \to \mu as nn \to \infty; the CLT says the deviation n(Xˉnμ)/σ\sqrt{n}(\bar{X}_n - \mu)/\sigma converges to N(0,1)\mathcal{N}(0,1). These are the asymptotic counterparts of Hoeffding's inequality.

-> Full treatment: Section06 Stochastic Processes


12. Common Mistakes

#MistakeWhy It's WrongFix
1Applying Markov to a signed random variableMarkov requires X0X \geq 0; for signed XX, P(Xt)P(X \geq t) can exceed E[X]/t\mathbb{E}[X]/tApply Markov to $
2Using Chebyshev for exponential tailsChebyshev gives 1/k21/k^2 decay, vastly looser than sub-Gaussian ek2/2e^{-k^2/2}If data is bounded or Gaussian, use Hoeffding/Chernoff
3Forgetting the factor of 2 in two-sided HoeffdingOne-sided: e2nt2/(ba)2e^{-2nt^2/(b-a)^2}; two-sided has factor 2 in frontAlways write "two-sided" explicitly and include the factor
4Confusing sub-Gaussian parameter with varianceSub-Gaussian parameter σ2\sigma^2 satisfies Var(X)σ2\operatorname{Var}(X) \leq \sigma^2 but is not equalVar(X)=M(0)\operatorname{Var}(X) = M''(0); sub-Gaussian parameter can be larger
5Applying union bound over infinite classes directlyUnion bound only works for countable (finite/countably infinite) collectionsUse ε\varepsilon-net covering argument first, then union bound over the net
6Computing VC dimension by counting parametersVC dimension is not the number of parameters; depends on the function class structureFor linear classifiers in Rd\mathbb{R}^d: dVC=d+1d_{VC} = d+1, not number of weights
7Ignoring the log(1/δ)\log(1/\delta) term in sample complexitySetting δ=0.01\delta = 0.01 adds log(100)4.6\log(100) \approx 4.6 - small, but not negligibleAlways compute full bound including log(1/δ)\log(1/\delta)
8Claiming Chernoff applies to dependent variablesChernoff requires independence (product of MGFs)For dependent variables, use McDiarmid or martingale concentration
9Using Hoeffding when variance is known to be smallBernstein would give an exponentially tighter bound for small-variance dataCheck if Var(Xi)(ba)2/4\operatorname{Var}(X_i) \ll (b-a)^2/4; if so, use Bernstein
10Interpreting VC generalisation bounds as practically tightVC bounds are often vacuous for deep networks; describe worst-case behaviourUse PAC-Bayes or Rademacher bounds, which can be data-dependent and tighter

13. Exercises

**Exercise 1 *** (Markov and Chebyshev) For XExp(1)X \sim \operatorname{Exp}(1) and YN(0,1)Y \sim \mathcal{N}(0,1): (a) Compute P(X3)P(X \geq 3) exactly and via Markov's inequality. What is the ratio? (b) Compute P(Y3)P(|Y| \geq 3) exactly and via Chebyshev. What is the ratio? (c) For which kk does Chebyshev give P(Yk)0.05P(|Y| \geq k) \leq 0.05? Compare to the exact Gaussian answer. (d) Verify both bounds numerically with N=106N = 10^6 samples.

**Exercise 2 *** (Hoeffding Sample Complexity) A spam filter classifies emails. On nn test emails, it achieves accuracy p^\hat{p}. (a) Using Hoeffding, find the smallest nn such that with 99% confidence, p^p0.02|\hat{p} - p| \leq 0.02. (b) How does the required nn change if we only require p^p0.05|\hat{p} - p| \leq 0.05? (c) If n=500n = 500 and p^=0.92\hat{p} = 0.92, give the 95% Hoeffding confidence interval for pp. (d) Compare the Hoeffding CI to the Normal-approximation CI p^±1.96p^(1p^)/n\hat{p} \pm 1.96\sqrt{\hat{p}(1-\hat{p})/n}.

**Exercise 3 *** (Chernoff Bounds for Binomial) Let SBin(n,p)S \sim \operatorname{Bin}(n, p) with n=1000n = 1000, p=0.1p = 0.1 (so μ=100\mu = 100). (a) Compute P(S130)P(S \geq 130) exactly and via the multiplicative Chernoff bound (δ=0.3\delta = 0.3). (b) Compute P(S130)P(S \geq 130) via Hoeffding. Compare to Chernoff - which is tighter? (c) Compute P(S70)P(S \leq 70) via the lower-tail Chernoff bound and compare to the exact value. (d) Plot all three bounds (Hoeffding, upper Chernoff, lower Chernoff) as a function of the threshold.

**Exercise 4 **** (Bernstein: Gradient Noise in SGD) Assume each sample gradient gi[G,G]dg_i \in [-G, G]^d with E[gi]=g\mathbb{E}[g_i] = g and per-coordinate variance σk2\sigma_k^2. (a) Apply Bernstein (with union bound) to bound P(g^gε)P(\|\hat{g} - g\|_\infty \geq \varepsilon) for mini-batch size mm. (b) Compare to the Hoeffding-based bound. Under what condition does Bernstein win? (c) For G=1G = 1, d=100d = 100, σk2=0.01\sigma_k^2 = 0.01, find the mm needed for P(g^g0.1)0.05P(\|\hat{g} - g\|_\infty \geq 0.1) \leq 0.05 via both bounds. (d) Verify numerically: generate synthetic gradients and measure empirical exceedance probability.

**Exercise 5 **** (McDiarmid on Empirical Risk) Let R^(h,D)=1ni=1n(h,zi)\hat{R}(h, \mathcal{D}) = \frac{1}{n}\sum_{i=1}^n \ell(h, z_i) with [0,1]\ell \in [0, 1] and ziz_i i.i.d. (a) Show that swapping one training example changes R^\hat{R} by at most 1/n1/n. (b) Apply McDiarmid's inequality to bound P(R^E[R^]t)P(|\hat{R} - \mathbb{E}[\hat{R}]| \geq t). (c) Compare this to the direct Hoeffding application. Are they the same? Why? (d) What if [0,L]\ell \in [0, L] instead of [0,1][0, 1]? How does the bound scale?

**Exercise 6 **** (PAC Bounds for Finite Class) Consider H={h1,,h1000}\mathcal{H} = \{h_1, \ldots, h_{1000}\} with 1000 decision rules for binary classification. (a) How many training examples are needed for uniform convergence with ε=0.05\varepsilon = 0.05, δ=0.05\delta = 0.05? (b) ERM achieves training error 0. Using the PAC bound, what is the worst-case true error with 95% confidence? (c) If we use n=5000n = 5000 examples and observe R^(h^)=0.03\hat{R}(\hat{h}) = 0.03, bound R(h^)R(\hat{h}) with 99% confidence. (d) How does the bound change if H=106|\mathcal{H}| = 10^6 (e.g., a lookup table over binary features)?

**Exercise 7 ***** (VC Dimension and Generalisation) Consider linear classifiers in Rd\mathbb{R}^d: H={sign(wx+b):wRd,bR}\mathcal{H} = \{\operatorname{sign}(\mathbf{w}^\top \mathbf{x} + b): \mathbf{w} \in \mathbb{R}^d, b \in \mathbb{R}\}. (a) Show by construction that d+1d+1 points can be shattered (exhibit a configuration). (b) Show that no d+2d+2 points in general position can be shattered. (c) Write the VC generalisation bound for d=100d = 100, n=10000n = 10000, δ=0.05\delta = 0.05. (d) Compare to the Rademacher bound for the same class with w21\|\mathbf{w}\|_2 \leq 1 and x21\|\mathbf{x}\|_2 \leq 1.

**Exercise 8 ***** (Rademacher Complexity: Monte Carlo Estimation) Given a dataset {x(1),,x(n)}Rd\{x^{(1)}, \ldots, x^{(n)}\} \subset \mathbb{R}^d and linear function class F={xwx:w21}\mathcal{F} = \{\mathbf{x} \mapsto \mathbf{w}^\top \mathbf{x}: \|\mathbf{w}\|_2 \leq 1\}: (a) Show that R^S(F)=1nEσ ⁣[iσix(i)2]\hat{\mathfrak{R}}_S(\mathcal{F}) = \frac{1}{n}\mathbb{E}_\sigma\!\left[\left\|\sum_i \sigma_i x^{(i)}\right\|_2\right]. (b) Implement a Monte Carlo estimator using T=1000T = 1000 random sign vectors σ\sigma. (c) On the MNIST training set (or a synthetic dataset), compute R^S\hat{\mathfrak{R}}_S and the resulting generalisation bound. (d) How does R^S\hat{\mathfrak{R}}_S change as you vary nn? Verify the O(1/n)O(1/\sqrt{n}) scaling.


14. Why This Matters for AI (2026 Perspective)

ConceptAI/ML ConnectionCurrent Importance
Hoeffding confidence intervalsLLM benchmark evaluation: how many test questions for reliable rankings?Critical - used implicitly in all leaderboard comparisons (MMLU, HELM, LMSYS)
PAC-Bayes boundsTightest known generalisation bounds for overparameterised models; connects to flat minima and sharpness-aware minimisation (SAM)Active research frontier, 2023-2026
VC dimensionClassical foundation; vacuous for LLMs with 10910^9+ parametersConceptually important; practically replaced by norm-based bounds
Rademacher complexityData-dependent bounds; finite for norm-constrained transformersUsed in theoretical analysis of attention, LoRA rank bounds
McDiarmid stabilityAlgorithmic stability theory: output doesn't change much when one training point changesFoundation of differential privacy; relevant to RLHF data sensitivity
Concentration in high dimsdk\sqrt{d_k} scaling in attention; JL lemma for random projections in retrievalDirectly explains transformer architectural choices (FlashAttention, MLA)
Chernoff for random graphsTheoretical analysis of dropout, random initialisation, random feature networksRelevant to lottery ticket hypothesis, neural architecture search
Union bound + coveringUniform convergence for function classes; foundation of all finite-sample learning theoryEvery theoretical ML paper uses this framework
Sub-Gaussian gradientsConvergence theory for Adam, SGD; required sample complexity for fine-tuningUsed in LoRA theoretical analysis and RLHF convergence guarantees
Bernstein vs HoeffdingVariance-aware bounds; relevant when gradient variance is small (near optima)Foundation of variance reduction methods (SVRG, SAG, SAGA)

15. Conceptual Bridge

Looking back. This section builds on Section04 Expectation and Moments, which established E[X]\mathbb{E}[X], Var(X)\operatorname{Var}(X), and the MGF MX(t)=E[etX]M_X(t) = \mathbb{E}[e^{tX}]. The MGF is the critical bridge: Hoeffding's lemma uses convexity of etxe^{tx}; the Chernoff method optimises over MX(s)M_X(s); Bernstein's inequality exploits the variance E[X2]\mathbb{E}[X^2]. Jensen's inequality (from Section04) underlies both Hoeffding's lemma proof and the information-theoretic interpretation of Chernoff bounds.

Looking forward. The Law of Large Numbers says Xˉnμ\bar{X}_n \to \mu almost surely. This is the qualitative counterpart of Hoeffding's quantitative bound: Hoeffding gives the rate at which Xˉn\bar{X}_n concentrates. The Central Limit Theorem (CLT) says the shape of the distribution of Xˉn\bar{X}_n converges to Gaussian - explaining why sub-Gaussian bounds are the right model for sums of bounded variables. Both are developed in Section06 Stochastic Processes.

The PAC-statistics connection. PAC learning theory is, at its core, applied concentration theory. The "probably" in PAC = concentration inequality. The "approximately correct" = ε\varepsilon tolerance. The sample complexity n=O((d+log(1/δ))/ε2)n = O((d + \log(1/\delta))/\varepsilon^2) is exactly what concentration inequalities give. Chapter 7 (Statistics) will build on this: confidence intervals are Hoeffding bounds, hypothesis tests are Chernoff bounds, and MLE theory uses Bernstein to prove asymptotic normality.

CONCENTRATION INEQUALITIES IN THE CURRICULUM
========================================================================

  Section04 Expectation and Moments
  +- E[X], Var(X), MGF M_X(t)  -------------------------- input
  +- Jensen's inequality         -------------------------- used in proofs
  +- Markov/Chebyshev preview    -------------------------- previewed there

         v

  Section05 Concentration Inequalities   <--- YOU ARE HERE
  +- Markov / Chebyshev            (moment-based, polynomial tails)
  +- Sub-Gaussian / Hoeffding      (bounded, exponential tails)
  +- Chernoff / Bernstein          (MGF-based, variance-aware)
  +- McDiarmid                     (functions of independent variables)
  +- Union bound + covering        (infinite hypothesis classes)
  +- PAC learning / VC dimension   (generalisation theory)
  +- Rademacher complexity         (data-dependent bounds)

         v                                      v

  Section06 Stochastic Processes         Chapter 7: Statistics
  +- Weak LLN (via Chebyshev)      +- Confidence intervals
  +- Strong LLN                    +- Hypothesis testing
  +- CLT (limit of Hoeffding)      +- MLE consistency (Bernstein)

========================================================================

Appendix A: Summary of Key Bounds

InequalityConditionsBoundRate
MarkovX0X \geq 0, E[X]=μ\mathbb{E}[X] = \muP(Xt)μ/tP(X \geq t) \leq \mu/tO(1/t)O(1/t)
ChebyshevVar(X)=σ2\operatorname{Var}(X) = \sigma^2P(Xμt)σ2/t2P(\lvert X-\mu\rvert \geq t) \leq \sigma^2/t^2O(1/t2)O(1/t^2)
CantelliVar(X)=σ2\operatorname{Var}(X) = \sigma^2P(Xμt)σ2/(σ2+t2)P(X-\mu \geq t) \leq \sigma^2/(\sigma^2+t^2)O(1/t2)O(1/t^2)
Sub-GaussianXX σ2\sigma^2-sub-GP(Xt)et2/(2σ2)P(X \geq t) \leq e^{-t^2/(2\sigma^2)}O(et2)O(e^{-t^2})
HoeffdingXi[ai,bi]X_i \in [a_i,b_i] i.i.d.P(Xˉμt)exp(2nt2/(biai)2)P(\bar{X}-\mu \geq t) \leq \exp(-2nt^2/\sum(b_i-a_i)^2)O(ent2)O(e^{-nt^2})
Chernoff (mult.)Bin(n,p)\operatorname{Bin}(n,p), δ(0,1)\delta \in (0,1)P(S(1+δ)μ)eμδ2/3P(S \geq (1+\delta)\mu) \leq e^{-\mu\delta^2/3}O(eμ)O(e^{-\mu})
BernsteinXic\lvert X_i\rvert \leq c, variance ν2\nu^2P(Xˉt)exp(nt2/2/(ν2+ct/3))P(\bar{X} \geq t) \leq \exp(-nt^2/2/(\nu^2+ct/3))O(ent2/ν2)O(e^{-nt^2/\nu^2})
McDiarmidff with cic_i-bounded diff.P(fE[f]t)e2t2/ci2P(f - \mathbb{E}[f] \geq t) \leq e^{-2t^2/\sum c_i^2}O(et2)O(e^{-t^2})

Appendix B: Sample Complexity Table

Given: i.i.d. data in [0,1][0,1], want P(Xˉnμε)1δP(|\bar{X}_n - \mu| \leq \varepsilon) \geq 1-\delta.

ε\varepsilonδ\deltaHoeffding nnNormal approx nn
0.100.1013368
0.050.10530271
0.050.05600384
0.010.0514,9799,604
0.010.0119,93316,587

Hoeffding is distribution-free (works for worst case); Normal approximation assumes CLT holds.


<- Back to Chapter 6: Probability Theory | Next: Stochastic Processes ->

Appendix C: Detailed Proofs and Derivations

C.1 Proof of Hoeffding's Lemma in Full

We prove: if E[X]=0\mathbb{E}[X] = 0 and X[a,b]X \in [a, b] a.s., then E[etX]et2(ba)2/8\mathbb{E}[e^{tX}] \leq e^{t^2(b-a)^2/8}.

Step 1: Convexity bound. Since etxe^{tx} is convex in xx, for x[a,b]x \in [a, b]:

etxbxbaeta+xabaetbe^{tx} \leq \frac{b - x}{b - a} e^{ta} + \frac{x - a}{b - a} e^{tb}

Step 2: Take expectations. Let p=aba[0,1]p = \frac{-a}{b-a} \in [0,1] (since a0ba \leq 0 \leq b after centering):

E[etX]bE[X]baeta+E[X]abaetb=bbaeta+abaetb\mathbb{E}[e^{tX}] \leq \frac{b - \mathbb{E}[X]}{b-a} e^{ta} + \frac{\mathbb{E}[X] - a}{b-a} e^{tb} = \frac{b}{b-a} e^{ta} + \frac{-a}{b-a} e^{tb} =(1p)eta+petb=eta[(1p)+pet(ba)]= (1-p) e^{ta} + p e^{tb} = e^{ta}\left[(1-p) + p e^{t(b-a)}\right]

Step 3: Exponential bound. Let h=t(ba)h = t(b-a) and ϕ(h)=log[(1p)+peh]ph\phi(h) = \log[(1-p) + pe^h] - ph. Then:

E[etX]eta+pheϕ(h)=eta+pt(ba)eϕ(h)\mathbb{E}[e^{tX}] \leq e^{ta + ph} \cdot e^{\phi(h)} = e^{ta + p \cdot t(b-a)} \cdot e^{\phi(h)}

Note: ta+pt(ba)=ta+t(a/(ba))(ba)=tata=0ta + p \cdot t(b-a) = ta + t(-a/(b-a))(b-a) = ta - ta = 0 (since p=a/(ba)p = -a/(b-a)). So:

E[etX]eϕ(h)\mathbb{E}[e^{tX}] \leq e^{\phi(h)}

Step 4: Bound ϕ(h)\phi(h). We have ϕ(0)=0\phi(0) = 0 and ϕ(0)=0\phi'(0) = 0. By Taylor's theorem:

ϕ(h)=ϕ(0)+hϕ(0)+h22ϕ(ξ)\phi(h) = \phi(0) + h\phi'(0) + \frac{h^2}{2}\phi''(\xi)

for some ξ[0,h]\xi \in [0, h]. Computing: ϕ(h)=peh(1p)((1p)+peh)214\phi''(h) = \frac{pe^h(1-p)}{((1-p)+pe^h)^2} \leq \frac{1}{4}

(since xy(x+y)2/4xy \leq (x+y)^2/4 for x,y0x, y \geq 0, applied to x=(1p)x = (1-p), y=pehy = pe^h).

Therefore: ϕ(h)h28=t2(ba)28\phi(h) \leq \frac{h^2}{8} = \frac{t^2(b-a)^2}{8}. \blacksquare

C.2 Proof of McDiarmid's Inequality via Azuma's Inequality

Azuma's Inequality. If (Z0,Z1,,Zn)(Z_0, Z_1, \ldots, Z_n) is a martingale with ZkZk1ck|Z_k - Z_{k-1}| \leq c_k a.s., then:

P(ZnZ0t)exp ⁣(2t2k=1nck2)P(Z_n - Z_0 \geq t) \leq \exp\!\left(-\frac{2t^2}{\sum_{k=1}^n c_k^2}\right)

Doob martingale construction. Given independent X1,,XnX_1, \ldots, X_n and ff with bounded differences cic_i, define:

Zk=E[f(X1,,Xn)X1,,Xk]Z_k = \mathbb{E}[f(X_1, \ldots, X_n) \mid X_1, \ldots, X_k]

Then:

  • Z0=E[f]Z_0 = \mathbb{E}[f] (constant, before observing anything)
  • Zn=f(X1,,Xn)Z_n = f(X_1, \ldots, X_n) (the function value itself)
  • (Zk)(Z_k) is a martingale by the tower property

Bounding differences. For each kk:

ZkZk1=E[fX1,,Xk]E[fX1,,Xk1]Z_k - Z_{k-1} = \mathbb{E}[f \mid X_1,\ldots,X_k] - \mathbb{E}[f \mid X_1,\ldots,X_{k-1}]

Since changing XkX_k changes ff by at most ckc_k and Zk1Z_{k-1} doesn't depend on XkX_k, we have ZkZk1ck|Z_k - Z_{k-1}| \leq c_k.

Applying Azuma to this Doob martingale gives McDiarmid's inequality. \blacksquare

C.3 Proof of the Finite-Class PAC Bound (Agnostic Case)

Setup. H\mathcal{H} finite with H=M|\mathcal{H}| = M. Losses h(i)=(h,z(i))[0,1]\ell_h^{(i)} = \ell(h, z^{(i)}) \in [0,1]. True risk R(h)=E[h]R(h) = \mathbb{E}[\ell_h], empirical risk R^(h)=1nih(i)\hat{R}(h) = \frac{1}{n}\sum_i \ell_h^{(i)}.

Goal. Show P(suphR(h)R^(h)>ε)δP(\sup_h |R(h) - \hat{R}(h)| > \varepsilon) \leq \delta when nlog(2M/δ)2ε2n \geq \frac{\log(2M/\delta)}{2\varepsilon^2}.

Step 1. Fix hh. By Hoeffding: P(R(h)R^(h)>ε)2e2nε2P(|R(h) - \hat{R}(h)| > \varepsilon) \leq 2e^{-2n\varepsilon^2}.

Step 2. Union bound over all MM hypotheses:

P(h:R(h)R^(h)>ε)hH2e2nε2=2Me2nε2P(\exists h: |R(h) - \hat{R}(h)| > \varepsilon) \leq \sum_{h \in \mathcal{H}} 2e^{-2n\varepsilon^2} = 2M e^{-2n\varepsilon^2}

Step 3. Set 2Me2nε2δ2Me^{-2n\varepsilon^2} \leq \delta:

nlog(2M/δ)2ε2n \geq \frac{\log(2M/\delta)}{2\varepsilon^2}

ERM generalisation. Under uniform convergence, the ERM h^=argminhR^(h)\hat{h} = \arg\min_h \hat{R}(h) satisfies:

R(h^)R^(h^)+εR^(h)+εR(h)+2εR(\hat{h}) \leq \hat{R}(\hat{h}) + \varepsilon \leq \hat{R}(h^*) + \varepsilon \leq R(h^*) + 2\varepsilon

where h=argminhR(h)h^* = \arg\min_h R(h) is the Bayes-optimal hypothesis in H\mathcal{H}. \blacksquare


Appendix D: VC Theory in Depth

D.1 The Growth Function

Definition. The growth function ΠH(n)\Pi_\mathcal{H}(n) counts the maximum number of distinct labellings achievable by H\mathcal{H} on any nn points:

ΠH(n)=maxx(1),,x(n)X{(h(x(1)),,h(x(n))):hH}\Pi_\mathcal{H}(n) = \max_{x^{(1)},\ldots,x^{(n)} \in \mathcal{X}} \left|\{(h(x^{(1)}),\ldots,h(x^{(n)})): h \in \mathcal{H}\}\right|

If H\mathcal{H} can shatter nn points, ΠH(n)=2n\Pi_\mathcal{H}(n) = 2^n. The VC dimension d=dVCd = d_{VC} is the largest nn for which ΠH(n)=2n\Pi_\mathcal{H}(n) = 2^n.

D.2 Sauer-Shelah Lemma

Lemma. For dVC(H)=dd_{VC}(\mathcal{H}) = d and ndn \geq d:

ΠH(n)k=0d(nk)\Pi_\mathcal{H}(n) \leq \sum_{k=0}^d \binom{n}{k}

Proof idea. By induction on nn and dd. Split: either x(n)x^{(n)} makes no difference to the labelling count, or it doubles some labellings. Careful counting bounds the total.

Consequence. For ndn \geq d: ΠH(n)(en/d)d\Pi_\mathcal{H}(n) \leq (en/d)^d. This transitions from exponential 2n2^n (when class is rich) to polynomial (en/d)d(en/d)^d (once nn exceeds VC dimension).

D.3 VC Dimension Examples

Hypothesis ClassVC Dimension
Threshold on R\mathbb{R}1
Intervals on R\mathbb{R}2
Halfspaces in Rd\mathbb{R}^dd+1d+1
Axis-aligned rectangles in R2\mathbb{R}^24
Convex polygons in R2\mathbb{R}^2\infty
Degree-kk polynomials in R\mathbb{R}k+1k+1
Neural nets with WW weightsO(WlogW)O(W \log W)
Kernel classifiers (universal)\infty

Why VC = d+1d+1 for halfspaces. In Rd\mathbb{R}^d, any d+1d+1 points in general position (no d+1d+1 are on a hyperplane) can be shattered by a halfspace. But for d+2d+2 points, by Radon's theorem, they can be split into two groups whose convex hulls intersect - no halfspace separates them. Hence dVC=d+1d_{VC} = d+1.

D.4 Fundamental Theorem of Statistical Learning

Theorem. For binary classification, the following are equivalent:

  1. H\mathcal{H} is PAC-learnable
  2. ERM is a successful learner for H\mathcal{H}
  3. H\mathcal{H} has finite VC dimension
  4. H\mathcal{H} has the uniform convergence property

This theorem, proved through Sauer-Shelah and the symmetrisation technique (doubling the sample and introducing Rademacher variables), is the bedrock of statistical learning theory.


Appendix E: Advanced Topics in Concentration

E.1 Talagrand's Inequality

Talagrand's inequality (1995) strengthens McDiarmid for "self-bounding" functions - those where the effect of each coordinate is bounded by the function value itself.

Self-bounding functions. f:XnR0f: \mathcal{X}^n \to \mathbb{R}_{\geq 0} is self-bounding if there exist fi:Xn1Rf_i: \mathcal{X}^{n-1} \to \mathbb{R} such that:

0f(x)fi(xi)1andi=1n(f(x)fi(xi))f(x)0 \leq f(\mathbf{x}) - f_i(\mathbf{x}^{-i}) \leq 1 \quad \text{and} \quad \sum_{i=1}^n (f(\mathbf{x}) - f_i(\mathbf{x}^{-i})) \leq f(\mathbf{x})

Talagrand's bound. For self-bounding ff:

P(fE[f]+t)et2/(2(E[f]+t/3))P(f \geq \mathbb{E}[f] + t) \leq e^{-t^2/(2(\mathbb{E}[f]+t/3))}

This is a variance-adaptive McDiarmid - analogous to Bernstein vs Hoeffding.

Applications: Number of ones in a sum of Bernoullis, size of random matchings, empirical risk when the model fits the data.

E.2 Concentration for Heavy-Tailed Distributions

When distributions are heavy-tailed, sub-Gaussian bounds don't apply. Modern techniques include:

Catoni's estimator. For estimating the mean of a distribution with finite variance σ2\sigma^2 (but potentially infinite MGF), Catoni's estimator achieves:

P(μ^ψμt)2exp ⁣(nt22σ2+2tσ)ent2/(2σ2)P(|\hat{\mu}_\psi - \mu| \geq t) \leq 2\exp\!\left(-\frac{nt^2}{2\sigma^2 + 2t\sigma}\right) \approx e^{-nt^2/(2\sigma^2)}

using a truncated exponential influence function. This gives Bernstein-like rates without boundedness.

Median of means. Partition nn samples into kk groups of n/kn/k. Compute group means and take the median. The median is sub-Gaussian with parameter σ2k/n\sigma^2 k/n - achieving sub-Gaussian rates for distributions with only two moments.

For AI: Training data for LLMs often has heavy-tailed length distributions (documents follow power laws). Heavy-tail-robust mean estimation is relevant for unbiased training.

E.3 Matrix Concentration Inequalities

Scalar concentration extends to matrices. For many AI applications, we care about concentration of random matrices.

Matrix Hoeffding. For independent random PSD matrices X1,,XnRd×dX_1, \ldots, X_n \in \mathbb{R}^{d \times d} with 0XicI0 \preceq X_i \preceq cI and E[Xi]=M\mathbb{E}[\sum X_i] = M, for t>0t > 0:

P ⁣(λmax ⁣(iXiM)t)det2/(2nc2)P\!\left(\lambda_{\max}\!\left(\sum_i X_i - M\right) \geq t\right) \leq d \cdot e^{-t^2/(2nc^2)}

The extra factor dd (matrix dimension) accounts for the eigenvalue structure.

Application: random features. When approximating a kernel matrix Kij=k(x(i),x(j))K_{ij} = k(x^{(i)}, x^{(j)}) by random features, the approximation error in spectral norm concentrates with matrix Hoeffding, justifying the use of D=O(d2/ε2)D = O(d^2/\varepsilon^2) random features for ε\varepsilon-accurate Gram matrix approximation.

Matrix Bernstein. For independent zero-mean random matrices XiX_i with Xi2R\|X_i\|_2 \leq R and σ2=E[Xi2]2\sigma^2 = \|\sum \mathbb{E}[X_i^2]\|_2:

P ⁣(iXi2t)2dexp ⁣(t2/2σ2+Rt/3)P\!\left(\left\|\sum_i X_i\right\|_2 \geq t\right) \leq 2d \cdot \exp\!\left(-\frac{t^2/2}{\sigma^2 + Rt/3}\right)

Used in randomised linear algebra (randomised SVD, Nystrom approximation).

E.4 PAC-Bayes Theory

PAC-Bayes bounds combine the PAC framework with Bayesian prior knowledge.

Setup. Given a prior PP over H\mathcal{H} (before seeing data) and posterior QQ (after seeing data):

PAC-Bayes Theorem (McAllester, 1999). For any prior PP and δ>0\delta > 0, with probability 1δ\geq 1-\delta:

EhQ[R(h)]EhQ[R^(h)]+DKL(QP)+log(1/δ)2(n1)\mathbb{E}_{h \sim Q}[R(h)] \leq \mathbb{E}_{h \sim Q}[\hat{R}(h)] + \sqrt{\frac{D_{\mathrm{KL}}(Q \| P) + \log(1/\delta)}{2(n-1)}}

Why this is powerful: The KL term DKL(QP)D_{\mathrm{KL}}(Q\|P) replaces logH\log|\mathcal{H}| (which can be infinite). For neural networks, if the posterior concentrates near the prior (small KL), the bound is tight. Flat minima (low sharpness) correspond to posteriors that don't drift far from the prior.

Connection to SAM. Sharpness-Aware Minimisation (SAM, 2021) minimises a PAC-Bayes-inspired upper bound on generalisation error by seeking flat minima - directly operationalising PAC-Bayes in practice.


Appendix F: Worked Examples

F.1 Coin Flip Estimation

Problem. We flip a biased coin nn times and observe p^\hat{p} heads. How many flips to estimate pp within ±0.05\pm 0.05 with 99% confidence?

Hoeffding solution. With Xi{0,1}X_i \in \{0,1\}, ba=1b-a=1:

nlog(2/0.01)2(0.05)2=log2000.005=5.2980.005=1060n \geq \frac{\log(2/0.01)}{2(0.05)^2} = \frac{\log 200}{0.005} = \frac{5.298}{0.005} = 1060

Normal approximation. n(2.576/0.05)20.25=664n \geq (2.576/0.05)^2 \cdot 0.25 = 664. Less conservative because it uses the Gaussian CLT approximation (valid for large nn).

Interpretation. About 1000-1100 flips are needed for reliable probability estimation. This is why A/B tests often require thousands of users: smaller samples give confidence intervals too wide to detect meaningful differences.

F.2 Multi-Armed Bandit via Chernoff

Problem. A recommendation system has K=100K = 100 arms (content types). Each arm aa has unknown click-through rate pap_a. We want to identify the best arm within ε=0.01\varepsilon = 0.01 with δ=0.01\delta = 0.01.

Naive approach. Pull each arm n0n_0 times. By Hoeffding + union bound:

P(a:p^apa>ε)2Ke2n0ε2P(\exists a: |\hat{p}_a - p_a| > \varepsilon) \leq 2K e^{-2n_0\varepsilon^2}

Set δ\leq \delta: n0log(2K/δ)2ε2=log(20000)0.000248,200n_0 \geq \frac{\log(2K/\delta)}{2\varepsilon^2} = \frac{\log(20000)}{0.0002} \approx 48{,}200. Total pulls: 4,820,0004{,}820{,}000.

Adaptive strategy (UCB). Upper Confidence Bound algorithms concentrate exploration on promising arms, achieving total pulls O(Klogn/ε2)O(K\log n/\varepsilon^2) - much smaller in practice.

F.3 Neural Network Generalisation: PAC-Bayes in Practice

Setup. 2-layer ReLU network, 100 hidden units, trained on 50,000 MNIST examples. Training error 0.3%, test error 1.2%. Standard VC bound: vacuous (parameter count n\gg n).

PAC-Bayes approach. Use Gaussian prior P=N(0,σ02I)P = \mathcal{N}(0, \sigma_0^2 I) and posterior Q=N(θ^,σ2I)Q = \mathcal{N}(\hat{\theta}, \sigma^2 I) (trained weights θ^\hat{\theta}). Compute:

DKL(QP)=θ^222σ02+dlog(σ0/σ)+dσ2/(2σ02)d/2D_{\mathrm{KL}}(Q\|P) = \frac{\|\hat{\theta}\|_2^2}{2\sigma_0^2} + d\log(\sigma_0/\sigma) + d\sigma^2/(2\sigma_0^2) - d/2

With careful tuning of σ\sigma, this can give non-vacuous bounds (< 50% error guarantee) even for deep networks - unlike VC theory.

F.4 Random Projection (Johnson-Lindenstrauss)

Problem. We have n=10000n = 10000 points in R1000\mathbb{R}^{1000}. We want to project to Rm\mathbb{R}^m preserving all pairwise distances within factor (1±ε)(1 \pm \varepsilon) for ε=0.1\varepsilon = 0.1.

JL Lemma. For m=O(logn/ε2)m = O(\log n / \varepsilon^2) random projections (from N(0,I/m)\mathcal{N}(0, I/m)):

m4+2log(n2)ε2/2ε3/38lognε2=8ln(10000)0.017400m \geq \frac{4 + 2\log(n^2)}{\varepsilon^2/2 - \varepsilon^3/3} \approx \frac{8\log n}{\varepsilon^2} = \frac{8 \cdot \ln(10000)}{0.01} \approx 7400

So 7400 dimensions suffice - far less than 1000 (already doing 26% compression), and compression from 10000 would be massive.

Proof sketch. A random Gaussian matrix RRm×dR \in \mathbb{R}^{m \times d} satisfies: for any fixed pair of points x,yx, y:

P ⁣(RxRy22/mxy22>εxy22)2emε2/8P\!\left(\left|\|Rx - Ry\|_2^2/m - \|x-y\|_2^2\right| > \varepsilon\|x-y\|_2^2\right) \leq 2e^{-m\varepsilon^2/8}

by chi-squared concentration. Union bound over (n2)\binom{n}{2} pairs gives the JL lemma.

For AI: Locality-sensitive hashing, approximate nearest-neighbour search (used in RAG retrieval), and low-rank adaptation (LoRA) all rely on Johnson-Lindenstrauss-type arguments to justify dimensionality reduction.


Appendix G: Computing Concentration Bounds in Practice

G.1 Choosing the Right Bound

DECISION TREE: WHICH BOUND TO USE
========================================================================

  Do you know E[X]? --- No --> Can't bound tail
       |
       Yes
       |
  Is X \\geq 0? --- Yes --> Markov: P(X \\geq t) \\leq E[X]/t
       |
       No
       |
  Do you know Var(X)? --- Yes --> Chebyshev: P(|X-\\mu| \\geq t) \\leq \\sigma^2/t^2
       |
       No
       |
  Is X bounded in [a,b]? --- Yes --> Hoeffding (exponential bound)
       |                             or Bernstein if Var(X) known
       |
       No
       |
  Is MGF finite? --- Yes --> Chernoff method (optimise over s)
       |
       No
       |
  Distribution-free needed --> Student-t CI or bootstrap

========================================================================

G.2 Python Recipe for Hoeffding CIs

import numpy as np
from scipy import stats

def hoeffding_ci(x, delta=0.05, lo=0.0, hi=1.0):
    """Hoeffding confidence interval for bounded data in [lo, hi]."""
    n = len(x)
    x_bar = np.mean(x)
    radius = (hi - lo) * np.sqrt(np.log(2 / delta) / (2 * n))
    return x_bar - radius, x_bar + radius

def hoeffding_sample_size(epsilon, delta, lo=0.0, hi=1.0):
    """Required n for epsilon-accuracy at 1-delta confidence."""
    return int(np.ceil((hi - lo)**2 * np.log(2 / delta) / (2 * epsilon**2)))

G.3 Interpreting Generalisation Gaps

When a model achieves training error 1% and test error 5%:

  • Generalisation gap = 4%
  • Is this within bounds? With n=10000n = 10000, δ=0.05\delta = 0.05: Hoeffding gives log(40)/200001.4%\sqrt{\log(40)/20000} \approx 1.4\% per hypothesis
  • For H=106|\mathcal{H}| = 10^6 hypotheses: add log(106)/200002.6%\sqrt{\log(10^6)/20000} \approx 2.6\%
  • Total bound: 4%\approx 4\% - exactly matching the observed gap!

In practice, neural networks occupy a tiny corner of their hypothesis class (due to inductive bias of SGD), so effective complexity is much smaller than the VC dimension suggests.


Appendix H: Self-Assessment Checklist

Before moving to Section06 (Stochastic Processes), verify you can:

  • State Markov's inequality from memory and prove it with the indicator trick
  • Derive Chebyshev from Markov by applying Markov to (Xμ)2(X-\mu)^2
  • Prove Hoeffding's lemma using convexity of etxe^{tx} and Taylor expansion
  • Apply the Chernoff method: Markov + MGF bound + optimise over ss
  • State McDiarmid's inequality and identify cic_i for a given function
  • Compute sample complexity for given (ε,δ)(\varepsilon, \delta) using Hoeffding
  • Define VC dimension and compute it for halfspaces (=d+1= d+1 in Rd\mathbb{R}^d)
  • Write the PAC generalisation bound with H|\mathcal{H}| or dVCd_{VC}
  • Define Rademacher complexity and explain its interpretation
  • Identify when Bernstein beats Hoeffding (small variance case)
  • Give a non-example for sub-Gaussian (heavy-tailed distribution)
  • Explain why VC bounds are vacuous for LLMs and what PAC-Bayes offers

Appendix I: Connections to Information Theory

I.1 Kullback-Leibler Divergence and Exponential Families

The Chernoff method has a deep information-theoretic interpretation. The optimal exponent in the Chernoff bound equals the KL divergence under the tilted distribution.

For X1,,XnX_1, \ldots, X_n i.i.d. with distribution PP, and target xˉ=a>E[X]\bar{x} = a > \mathbb{E}[X]:

limn1nlogP(Xˉna)=DKL(PaP)\lim_{n \to \infty} \frac{1}{n}\log P(\bar{X}_n \geq a) = -D_{\mathrm{KL}}(P_a \| P)

where PaP_a is the exponential tilt of PP that puts its mean at aa: Pa(x)esxP(x)P_a(x) \propto e^{s^* x} P(x) with EPa[X]=a\mathbb{E}_{P_a}[X] = a.

This is Cramer's theorem - the foundation of large deviations theory. The probability of a rare event decays exponentially at rate given by the KL divergence to the closest distribution that makes the event typical.

For AI: This connection explains why cross-entropy loss (KL divergence from data to model) is the right loss for language models: minimising cross-entropy = maximising the probability of the training data = minimising the Chernoff exponent for generalisation error.

I.2 Entropy and Sauer-Shelah

The growth function ΠH(n)\Pi_\mathcal{H}(n) counts the number of distinct behaviours of H\mathcal{H} on nn points. The log growth rate:

h(H)=limnlog2ΠH(n)nh(\mathcal{H}) = \lim_{n \to \infty} \frac{\log_2 \Pi_\mathcal{H}(n)}{n}

is the combinatorial entropy of H\mathcal{H}. Sauer-Shelah shows:

  • If dVC<d_{VC} < \infty: h(H)=0h(\mathcal{H}) = 0 (subexponential growth, polynomial in nn)
  • If dVC=d_{VC} = \infty: h(H)=1h(\mathcal{H}) = 1 (full exponential growth 2n2^n)

This binary distinction - polynomial vs exponential growth - is precisely what separates PAC-learnable from non-learnable hypothesis classes.


Appendix J: Connections to Optimisation

J.1 Convergence of SGD via Hoeffding

The standard convergence analysis of SGD for convex functions uses:

  1. Descent lemma: L(θt+1)L(θt)ηgt(θtθ)+η2L2gt22\mathcal{L}(\theta_{t+1}) \leq \mathcal{L}(\theta_t) - \eta g_t^\top(\theta_t - \theta^*) + \frac{\eta^2 L}{2}\|g_t\|_2^2

  2. Gradient concentration: g^tgt2ε\|\hat{g}_t - g_t\|_2 \leq \varepsilon with high probability (Hoeffding)

  3. Telescoping: Sum over TT steps and apply Hoeffding union bound over all TT steps (adding logT\log T to the sample complexity)

The result: after T=O(G2/(ε2η2))T = O(G^2/(\varepsilon^2\eta^2)) steps, SGD achieves L(θT)L(θ)ε\mathcal{L}(\theta_T) - \mathcal{L}(\theta^*) \leq \varepsilon with high probability.

J.2 Generalisation and Algorithmic Stability

Definition (Uniform Stability). Algorithm A\mathcal{A} is β\beta-uniformly stable if for any two datasets SS, SS' differing in one example:

supz(A(S),z)(A(S),z)β\sup_z |\ell(\mathcal{A}(S), z) - \ell(\mathcal{A}(S'), z)| \leq \beta

Theorem (Bousquet-Elisseeff, 2002). If A\mathcal{A} is β\beta-stable with βc/n\beta \leq c/n for some constant cc, and losses are bounded in [0,1][0, 1], then with probability 1δ\geq 1-\delta:

R(A(S))R^(A(S))+2β+(4nβ+1)log(1/δ)2nR(\mathcal{A}(S)) \leq \hat{R}(\mathcal{A}(S)) + 2\beta + (4n\beta + 1)\sqrt{\frac{\log(1/\delta)}{2n}}

SGD is stable. For LL-smooth convex losses, SGD with step size η\eta for TT steps is 2ηTL/n2\eta TL/n-uniformly stable. Setting η=O(1/T)\eta = O(1/\sqrt{T}) gives stability O(T/n)O(\sqrt{T}/n) - small when Tn2T \ll n^2.

This explains why early stopping helps: more SGD steps = less stable = worse generalisation.


Appendix K: Concentration Phenomena in Transformer Architectures

K.1 Attention Score Concentration

In scaled dot-product attention, query-key inner products QiKj/dkQ_i K_j^\top / \sqrt{d_k} determine the attention weights. For randomly initialised Q,KN(0,1/dk)Q, K \sim \mathcal{N}(0, 1/d_k), the inner products satisfy:

P(QiKjt)2edkt2/2P(|Q_i K_j^\top| \geq t) \leq 2e^{-d_k t^2/2}

Without the dk\sqrt{d_k} scaling, the softmax input QiKjQ_i K_j^\top has standard deviation O(dk)O(\sqrt{d_k}), pushing softmax into its saturating regime (near-uniform or near-one-hot). The scaling ensures standard deviation O(1)O(1), keeping gradients flowing.

Formal justification. With dkd_k-dimensional queries and keys, QiKj=l=1dkQilKjlQ_i K_j^\top = \sum_{l=1}^{d_k} Q_{il} K_{jl} where each summand has variance 1/dk21/d_k^2. By sub-Gaussian closure, the sum is 1/dk1/d_k-sub-Gaussian, so standard deviation 1/dk\approx 1/\sqrt{d_k}. After scaling: QiKj/dkQ_i K_j^\top / \sqrt{d_k} has standard deviation 1\approx 1, optimal for softmax.

K.2 Concentration in Residual Networks

Deep residual networks xl+1=xl+fl(xl)x_{l+1} = x_l + f_l(x_l) accumulate residual perturbations. If each residual block flf_l has bounded output fl(x)2c\|f_l(x)\|_2 \leq c, then by Azuma's inequality applied to the martingale xlx_l:

P(xLx02t)exp ⁣(t22Lc2)P(\|x_L - x_0\|_2 \geq t) \leq \exp\!\left(-\frac{t^2}{2Lc^2}\right)

This says: even with L=100L = 100 layers, if each residual block adds a small perturbation (c1c \ll 1), the total drift is bounded. LayerNorm and weight decay enforce exactly this: they keep fl(x)2\|f_l(x)\|_2 small, ensuring the representation doesn't drift exponentially with depth.

K.3 LoRA and Low-Rank Concentration

LoRA (Low-Rank Adaptation) approximates weight updates as ΔW=AB\Delta W = AB where ARd×rA \in \mathbb{R}^{d \times r}, BRr×kB \in \mathbb{R}^{r \times k} with small rank rr. The resulting hypothesis class has reduced complexity:

Rademacher complexity of LoRA. For the function class {xW0x+ABx:AFsA,BFsB}\{x \mapsto W_0 x + AB x : \|A\|_F \leq s_A, \|B\|_F \leq s_B\}:

Rn(FLoRA)sAsBrCn\mathfrak{R}_n(\mathcal{F}_{\text{LoRA}}) \leq \frac{s_A s_B \sqrt{r} \cdot C}{\sqrt{n}}

where C=maxix(i)2C = \max_i \|x^{(i)}\|_2. The key: Rademacher complexity scales as r\sqrt{r}, not dk\sqrt{dk} (full fine-tuning). With rmin(d,k)r \ll \min(d,k), LoRA has provably smaller generalisation gap than full fine-tuning, explaining why it doesn't overfit despite being used on small datasets.


Appendix L: Practical Guidelines for ML Practitioners

L.1 When to Report Confidence Intervals

Always report when:

  • Sample size n<10000n < 10000 (Hoeffding CI is meaningful)
  • Comparing two systems (the CI overlap tells you if the difference is significant)
  • Benchmarking LLMs on standard evaluations

Minimum information to report: point estimate, sample size, CI method (Hoeffding vs Normal approximation), confidence level.

L.2 Common Benchmarking Mistakes

MistakeStatistical ProblemFix
"Model A (73.2%) > Model B (72.8%) on 1000 questions"95% CI is +/-4.2% - difference insignificantNeed n19000n \geq 19000 for +/-1% CI
"Best of 10 model variants is significant"Multiple comparison: 10×0.05=0.510 \times 0.05 = 0.5 false positive rateBonferroni: each test at 0.05/10=0.0050.05/10 = 0.005
"Averaging over 5 datasets gives reliable comparison"Each dataset is a separate testReport CI for each; meta-analysis requires care
"Our method wins on 8/10 tasks"Sign test: need 9/10\geq 9/10 for p<0.05p < 0.05Use paired t-test or Wilcoxon signed-rank
"1000-epoch training curve shows improvement"Training curve is not test performanceReport test error at convergence with CI

L.3 Required Sample Sizes (Reference Table)

For binary outcomes (accuracy), using Hoeffding with 95% confidence (δ=0.05\delta = 0.05):

Desired precision (ε\varepsilon)Required nn
+/-10% (0.10)185
+/-5% (0.05)738
+/-3% (0.03)2,056
+/-2% (0.02)4,626
+/-1% (0.01)18,445
+/-0.5% (0.005)73,779

Note: These are conservative (distribution-free). Normal approximation gives roughly half as many, valid when np^(1p^)5n\hat{p}(1-\hat{p}) \geq 5.

L.4 Interpreting Generalisation Bounds for Deep Learning

Current theoretical bounds are often loose for deep networks. Practitioners should interpret them as:

  1. Order of magnitude guidance: A bound of O(d/n)O(\sqrt{d/n}) correctly predicts that more data helps linearly, but may overestimate the absolute gap by 10-100x
  2. Relative comparisons: PAC-Bayes bounds correctly rank models by generalisation risk even when absolute values are imprecise
  3. Design principles: Results like "norm-constrained models generalise better" translate directly to regularisation practices (weight decay, gradient clipping, dropout)
  4. Not absolute guarantees: A theoretical bound of 40% error doesn't mean the model has 40% test error - it means we can't rule out 40% test error from the theory alone

The theory is most useful as a framework for thinking, not as a numerical oracle.


Appendix M: References and Further Reading

Primary Sources

  1. Boucheron, Lugosi & Massart (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press. - The definitive modern treatment.

  2. Hoeffding, W. (1963). Probability Inequalities for Sums of Bounded Random Variables. Journal of the American Statistical Association, 58(301), 13-30.

  3. McDiarmid, C. (1989). On the Method of Bounded Differences. Surveys in Combinatorics, 141, 148-188.

  4. Vapnik, V. & Chervonenkis, A. (1971). On the Uniform Convergence of Relative Frequencies to Their Probabilities. Theory of Probability and Its Applications, 16(2), 264-280.

  5. Valiant, L. (1984). A Theory of the Learnable. Communications of the ACM, 27(11), 1134-1142.

Learning Theory Textbooks

  1. Shalev-Shwartz & Ben-David (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press. - Best introduction to PAC learning for ML practitioners.

  2. Mohri, Rostamizadeh & Talwalkar (2018). Foundations of Machine Learning (2nd ed.). MIT Press. - Covers Rademacher complexity and generalisation theory in depth.

  3. Wainwright, M. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. - Advanced treatment, covers matrix concentration and modern techniques.

Deep Learning Theory

  1. Zhang, Bengio et al. (2017). Understanding Deep Learning Requires Rethinking Generalization. ICLR 2017. - Showed memorisation capacity, challenging classical VC theory.

  2. Dziugaite & Roy (2017). Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many Parameters. UAI 2017. - First non-vacuous PAC-Bayes bounds for deep nets.

  3. Belkin et al. (2019). Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-Off. PNAS 2019. - Double descent, overparameterisation.

  4. Foret et al. (2021). Sharpness-Aware Minimization for Efficiently Improving Generalization. ICLR 2021. - SAM as PAC-Bayes operationalisation.


Appendix N: Extended Worked Problems

N.1 Bounding the Maximum of Random Variables

Problem. Let X1,,XnN(0,1)X_1, \ldots, X_n \sim \mathcal{N}(0,1) i.i.d. What is E[maxiXi]\mathbb{E}[\max_i X_i] approximately?

Answer. Each XiX_i is 1-sub-Gaussian. By the union bound:

P(maxiXit)nP(X1t)net2/2P(\max_i X_i \geq t) \leq n \cdot P(X_1 \geq t) \leq n e^{-t^2/2}

Setting net2/2=1ne^{-t^2/2} = 1: t=2lognt^* = \sqrt{2\log n}. So maxiXi=O(logn)\max_i X_i = O(\sqrt{\log n}) with high probability.

More precisely, E[maxiXi]2logn(1O(1/logn))\mathbb{E}[\max_i X_i] \approx \sqrt{2\log n}(1 - O(1/\log n)). For n=1000n = 1000: 26.93.7\approx \sqrt{2 \cdot 6.9} \approx 3.7, matching Monte Carlo.

For AI: The maximum attention logit in a head with nn tokens and dkd_k-dimensional keys satisfies maxjQKj2dklogn\max_j Q K_j^\top \approx \sqrt{2d_k \log n}. The softmax of this concentrates on one or few tokens for large dkd_k - explaining why attention becomes "sharp" (spiky) in deeper layers where dkd_k is large.

N.2 Symmetric Rounding

Problem. An algorithm rounds each of n=100n = 100 numbers independently up or down by at most 0.5. What is the probability the total rounding error exceeds 10?

Solution. Each error Ei[0.5,0.5]E_i \in [-0.5, 0.5] with E[Ei]=0\mathbb{E}[E_i] = 0. The sum S=EiS = \sum E_i has Ei0.5|E_i| \leq 0.5, so by Hoeffding:

P(S10)exp ⁣(2100(1.0)2)et2=exp ⁣(210210012)=e20.135P(S \geq 10) \leq \exp\!\left(-\frac{2 \cdot 100}{\sum (1.0)^2}\right) \cdot e^{-t^2} = \exp\!\left(-\frac{2 \cdot 10^2}{100 \cdot 1^2}\right) = e^{-2} \approx 0.135

The two-sided version: P(S10)2e20.27P(|S| \geq 10) \leq 2e^{-2} \approx 0.27.

N.3 PAC Bound for a Neural Network Classifier

Problem. A student trains a 3-layer neural network on n=5000n = 5000 examples and achieves 2% training error. Using the finite-class PAC bound with H=2106|\mathcal{H}| = 2^{10^6} (very conservatively, assuming 1-bit per parameter for 10610^6 parameters), what does the bound say about test error?

Solution. The PAC bound (agnostic, finite class):

R(h^)R^(h^)+log(2H/δ)2nR(\hat{h}) \leq \hat{R}(\hat{h}) + \sqrt{\frac{\log(2|\mathcal{H}|/\delta)}{2n}}

With δ=0.05\delta = 0.05 and H=2106|\mathcal{H}| = 2^{10^6}:

log(22106/0.05)10000=106ln2+ln4010000693150100008.3\sqrt{\frac{\log(2 \cdot 2^{10^6}/0.05)}{10000}} = \sqrt{\frac{10^6 \ln 2 + \ln 40}{10000}} \approx \sqrt{\frac{693150}{10000}} \approx 8.3

Test error bound: 0.02+8.38.320.02 + 8.3 \approx 8.32 - a completely vacuous bound (exceeds 100%). This illustrates why VC/parameter-counting bounds are useless for deep networks. PAC-Bayes or Rademacher complexity (using wF\|\mathbf{w}\|_F rather than parameter count) gives non-vacuous results.

N.4 Proving Generalisation for 1-Nearest Neighbour

Problem. For kk-nearest neighbour classification with k=1k = 1: (a) Show that the leave-one-out error is an unbiased estimate of the true error. (b) Use McDiarmid to bound the variance of the generalisation error around the leave-one-out estimate.

Solution. (a) The leave-one-out (LOO) error R^LOO=1ni1[y^iy(i)]\hat{R}_{\text{LOO}} = \frac{1}{n}\sum_i \mathbf{1}[\hat{y}_{-i} \neq y^{(i)}] where y^i\hat{y}_{-i} is the prediction on x(i)x^{(i)} using all other training points. By symmetry of the i.i.d. draw, E[R^LOO]=Rn1\mathbb{E}[\hat{R}_{\text{LOO}}] = R_{n-1} (the true error of 1-NN trained on n1n-1 examples).

(b) Swapping one training example changes R^LOO\hat{R}_{\text{LOO}} by at most 2/n2/n (affects at most 2 LOO predictions). By McDiarmid:

P(R^LOOE[R^LOO]t)2exp ⁣(2t2i(2/n)2)=2exp ⁣(nt22)P(|\hat{R}_{\text{LOO}} - \mathbb{E}[\hat{R}_{\text{LOO}}]| \geq t) \leq 2\exp\!\left(-\frac{2t^2}{\sum_i (2/n)^2}\right) = 2\exp\!\left(-\frac{nt^2}{2}\right)

N.5 Confidence Region for Gradient Descent Convergence

Problem. SGD is run for TT iterations with step size η=1/T\eta = 1/\sqrt{T}. Each gradient estimate is sub-Gaussian with parameter σ2\sigma^2. With what probability does SGD achieve L(θT)Lε\mathcal{L}(\theta_T) - \mathcal{L}^* \leq \varepsilon?

Solution. Standard SGD analysis shows: E[L(θT)L]R2+σ2T\mathbb{E}[\mathcal{L}(\theta_T) - \mathcal{L}^*] \leq \frac{R^2 + \sigma^2}{\sqrt{T}} where R=θ0θ2R = \|\theta_0 - \theta^*\|_2.

For the high-probability version: at each step tt, P(gtg^t2u)2emu2/(2σ2)P(\|g_t - \hat{g}_t\|_2 \geq u) \leq 2e^{-mu^2/(2\sigma^2)} by sub-Gaussianity (batch size mm). Taking union bound over all TT steps and using the Azuma-type argument for the algorithm trajectory:

P(L(θT)Lε)δP(\mathcal{L}(\theta_T) - \mathcal{L}^* \geq \varepsilon) \leq \delta

when T=O(R2log(1/δ)+σ2log(T/δ)ε2)T = O\left(\frac{R^2 \log(1/\delta) + \sigma^2 \log(T/\delta)}{\varepsilon^2}\right).


Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue