Part 2

30 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Concentration Inequalities: Part 11: ML Applications to Appendix N: Extended Worked Problems

11. ML Applications

11.1 SGD and Gradient Concentration

In stochastic gradient descent, at each step $t$ we compute a mini-batch gradient:

\hat{g}_t = \frac{1}{m}\sum_{i \in B_t} \nabla_\theta \ell(\theta_t; x^{(i)})

where $B_t$ is a random mini-batch of size $m$ . The true gradient is $g_t = \nabla_\theta \mathcal{L}(\theta_t) = \mathbb{E}[\nabla_\theta \ell]$ .

Concentration of gradient norm. Assuming $\|\nabla\ell(\theta; x)\|_2 \leq G$ almost surely, by Hoeffding applied coordinate-wise and union bound:

P(\|\hat{g}_t - g_t\|_2 \geq \varepsilon) \leq 2d \cdot \exp\!\left(-\frac{m\varepsilon^2}{2G^2}\right)

Implication for learning rate. For SGD to make progress, we need the gradient noise $\|\hat{g}_t - g_t\|_2$ to be smaller than the signal. Concentration tells us: with probability $1-\delta$ , the noise is bounded by $G\sqrt{2\log(2d/\delta)/m}$ . This justifies the learning rate scaling $\eta \propto 1/\sqrt{t}$ in theory.

Variance reduction. Methods like SVRG (Stochastic Variance-Reduced Gradient) periodically compute the full gradient, reducing $G$ to $G/\sqrt{n}$ . Bernstein's inequality quantifies the benefit: convergence can be linear (not just $1/\sqrt{t}$ ) when variance is controlled.

11.2 Confidence Intervals via Hoeffding

Suppose we evaluate an LLM on a benchmark: out of $n$ binary questions (correct/incorrect), we observe accuracy $\hat{p} = k/n$ .

By Hoeffding's inequality (with $X_i \in \{0,1\}$ , $b-a = 1$ ), with probability at least $1-\delta$ :

|\hat{p} - p| \leq \sqrt{\frac{\log(2/\delta)}{2n}}

This gives the Hoeffding confidence interval: $p \in \left[\hat{p} - \sqrt{\frac{\log(2/\delta)}{2n}}, \hat{p} + \sqrt{\frac{\log(2/\delta)}{2n}}\right]$ .

Numerical example. For $n = 1000$ questions and $\delta = 0.05$ :

\text{CI half-width} = \sqrt{\frac{\log(40)}{2000}} \approx \sqrt{\frac{3.69}{2000}} \approx 0.043

So a measured accuracy of 73.2% on 1000 questions gives 95% CI $[69.0\%, 77.4\%]$ - a $\pm 4.2\%$ interval. This explains why claiming "Model A (73.2%) outperforms Model B (72.8%)" on 1000 questions has no statistical support.

11.3 Random Features and Kernel Approximation

The Rahimi-Recht random features method approximates the RBF kernel $k(x, x') = e^{-\|x-x'\|^2/(2\sigma^2)}$ as:

k(x, x') \approx \frac{1}{D}\sum_{j=1}^D \phi_j(x)\phi_j(x'), \quad \phi_j(x) = \sqrt{2}\cos(\omega_j^\top x + b_j)

where $\omega_j \sim \mathcal{N}(0, I/\sigma^2)$ and $b_j \sim \mathcal{U}[0, 2\pi]$ .

How many features? The approximation error is bounded by Hoeffding:

P\!\left(\left|\frac{1}{D}\sum_{j=1}^D z_j - k(x,x')\right| \geq \varepsilon\right) \leq 2e^{-D\varepsilon^2/2}

where $z_j = \phi_j(x)\phi_j(x') \in [-1, 1]$ . For $\varepsilon = 0.01$ and $\delta = 0.01$ : $D \geq 2\log(2/0.01)/0.01^2 \approx 92{,}000$ .

In practice, $D = 1000$ -- $10000$ suffices with $\varepsilon \approx 0.1$ , suggesting the bound is conservative. This is common: worst-case bounds are loose, actual concentrations are tighter.

11.4 Concentration in High Dimensions

High-dimensional geometry exhibits surprising concentration phenomena. For $\mathbf{x} \sim \mathcal{N}(0, I_d)$ :

Chi-squared concentration. $\|\mathbf{x}\|_2^2 \sim \chi^2_d$ with mean $d$ and variance $2d$ . Bernstein gives:

P\!\left(\left|\frac{\|\mathbf{x}\|_2^2}{d} - 1\right| \geq t\right) \leq 2\exp\!\left(-d \cdot \min\!\left(\frac{t^2}{4}, \frac{t}{4}\right)\right)

So $\|\mathbf{x}\|_2 / \sqrt{d} \to 1$ in probability - all Gaussian vectors in high dimensions have nearly the same norm $\sqrt{d}$ .

Concentration of inner products. For independent $\mathbf{x}, \mathbf{y} \sim \mathcal{N}(0, I_d/d)$ (unit-variance):

P(|\langle \mathbf{x}, \mathbf{y} \rangle| \geq t) \leq 2e^{-dt^2/2}

Nearly-orthogonal random vectors: with $d = 512$ (typical embedding dimension), random vectors have $|\langle \mathbf{x}, \mathbf{y}\rangle| < 0.1$ with very high probability.

For AI (attention). In transformers with $d_k = 64$ (key dimension), random query-key inner products concentrate around 0. This justifies the $1/\sqrt{d_k}$ scaling in attention: without it, softmax inputs would have variance $d_k$ , causing near-zero gradients through the softmax saturation.

Preview: Law of Large Numbers and CLT

The concentration results above provide finite-sample bounds. The LLN says $\bar{X}_n \to \mu$ as $n \to \infty$ ; the CLT says the deviation $\sqrt{n}(\bar{X}_n - \mu)/\sigma$ converges to $\mathcal{N}(0,1)$ . These are the asymptotic counterparts of Hoeffding's inequality.

-> Full treatment: Section06 Stochastic Processes

12. Common Mistakes

#	Mistake	Why It's Wrong	Fix
1	Applying Markov to a signed random variable	Markov requires $X \geq 0$ ; for signed $X$ , $P(X \geq t)$ can exceed $\mathbb{E}[X]/t$	Apply Markov to $
2	Using Chebyshev for exponential tails	Chebyshev gives $1/k^2$ decay, vastly looser than sub-Gaussian $e^{-k^2/2}$	If data is bounded or Gaussian, use Hoeffding/Chernoff
3	Forgetting the factor of 2 in two-sided Hoeffding	One-sided: $e^{-2nt^2/(b-a)^2}$ ; two-sided has factor 2 in front	Always write "two-sided" explicitly and include the factor
4	Confusing sub-Gaussian parameter with variance	Sub-Gaussian parameter $\sigma^2$ satisfies $\operatorname{Var}(X) \leq \sigma^2$ but is not equal	$\operatorname{Var}(X) = M''(0)$ ; sub-Gaussian parameter can be larger
5	Applying union bound over infinite classes directly	Union bound only works for countable (finite/countably infinite) collections	Use $\varepsilon$ -net covering argument first, then union bound over the net
6	Computing VC dimension by counting parameters	VC dimension is not the number of parameters; depends on the function class structure	For linear classifiers in $\mathbb{R}^d$ : $d_{VC} = d+1$ , not number of weights
7	Ignoring the $\log(1/\delta)$ term in sample complexity	Setting $\delta = 0.01$ adds $\log(100) \approx 4.6$ - small, but not negligible	Always compute full bound including $\log(1/\delta)$
8	Claiming Chernoff applies to dependent variables	Chernoff requires independence (product of MGFs)	For dependent variables, use McDiarmid or martingale concentration
9	Using Hoeffding when variance is known to be small	Bernstein would give an exponentially tighter bound for small-variance data	Check if $\operatorname{Var}(X_i) \ll (b-a)^2/4$ ; if so, use Bernstein
10	Interpreting VC generalisation bounds as practically tight	VC bounds are often vacuous for deep networks; describe worst-case behaviour	Use PAC-Bayes or Rademacher bounds, which can be data-dependent and tighter

13. Exercises

**Exercise 1 *** (Markov and Chebyshev) For $X \sim \operatorname{Exp}(1)$ and $Y \sim \mathcal{N}(0,1)$ : (a) Compute $P(X \geq 3)$ exactly and via Markov's inequality. What is the ratio? (b) Compute $P(|Y| \geq 3)$ exactly and via Chebyshev. What is the ratio? (c) For which $k$ does Chebyshev give $P(|Y| \geq k) \leq 0.05$ ? Compare to the exact Gaussian answer. (d) Verify both bounds numerically with $N = 10^6$ samples.

**Exercise 2 *** (Hoeffding Sample Complexity) A spam filter classifies emails. On $n$ test emails, it achieves accuracy $\hat{p}$ . (a) Using Hoeffding, find the smallest $n$ such that with 99% confidence, $|\hat{p} - p| \leq 0.02$ . (b) How does the required $n$ change if we only require $|\hat{p} - p| \leq 0.05$ ? (c) If $n = 500$ and $\hat{p} = 0.92$ , give the 95% Hoeffding confidence interval for $p$ . (d) Compare the Hoeffding CI to the Normal-approximation CI $\hat{p} \pm 1.96\sqrt{\hat{p}(1-\hat{p})/n}$ .

**Exercise 3 *** (Chernoff Bounds for Binomial) Let $S \sim \operatorname{Bin}(n, p)$ with $n = 1000$ , $p = 0.1$ (so $\mu = 100$ ). (a) Compute $P(S \geq 130)$ exactly and via the multiplicative Chernoff bound ( $\delta = 0.3$ ). (b) Compute $P(S \geq 130)$ via Hoeffding. Compare to Chernoff - which is tighter? (c) Compute $P(S \leq 70)$ via the lower-tail Chernoff bound and compare to the exact value. (d) Plot all three bounds (Hoeffding, upper Chernoff, lower Chernoff) as a function of the threshold.

**Exercise 4 **** (Bernstein: Gradient Noise in SGD) Assume each sample gradient $g_i \in [-G, G]^d$ with $\mathbb{E}[g_i] = g$ and per-coordinate variance $\sigma_k^2$ . (a) Apply Bernstein (with union bound) to bound $P(\|\hat{g} - g\|_\infty \geq \varepsilon)$ for mini-batch size $m$ . (b) Compare to the Hoeffding-based bound. Under what condition does Bernstein win? (c) For $G = 1$ , $d = 100$ , $\sigma_k^2 = 0.01$ , find the $m$ needed for $P(\|\hat{g} - g\|_\infty \geq 0.1) \leq 0.05$ via both bounds. (d) Verify numerically: generate synthetic gradients and measure empirical exceedance probability.

**Exercise 5 **** (McDiarmid on Empirical Risk) Let $\hat{R}(h, \mathcal{D}) = \frac{1}{n}\sum_{i=1}^n \ell(h, z_i)$ with $\ell \in [0, 1]$ and $z_i$ i.i.d. (a) Show that swapping one training example changes $\hat{R}$ by at most $1/n$ . (b) Apply McDiarmid's inequality to bound $P(|\hat{R} - \mathbb{E}[\hat{R}]| \geq t)$ . (c) Compare this to the direct Hoeffding application. Are they the same? Why? (d) What if $\ell \in [0, L]$ instead of $[0, 1]$ ? How does the bound scale?

**Exercise 6 **** (PAC Bounds for Finite Class) Consider $\mathcal{H} = \{h_1, \ldots, h_{1000}\}$ with 1000 decision rules for binary classification. (a) How many training examples are needed for uniform convergence with $\varepsilon = 0.05$ , $\delta = 0.05$ ? (b) ERM achieves training error 0. Using the PAC bound, what is the worst-case true error with 95% confidence? (c) If we use $n = 5000$ examples and observe $\hat{R}(\hat{h}) = 0.03$ , bound $R(\hat{h})$ with 99% confidence. (d) How does the bound change if $|\mathcal{H}| = 10^6$ (e.g., a lookup table over binary features)?

**Exercise 7 ***** (VC Dimension and Generalisation) Consider linear classifiers in $\mathbb{R}^d$ : $\mathcal{H} = \{\operatorname{sign}(\mathbf{w}^\top \mathbf{x} + b): \mathbf{w} \in \mathbb{R}^d, b \in \mathbb{R}\}$ . (a) Show by construction that $d+1$ points can be shattered (exhibit a configuration). (b) Show that no $d+2$ points in general position can be shattered. (c) Write the VC generalisation bound for $d = 100$ , $n = 10000$ , $\delta = 0.05$ . (d) Compare to the Rademacher bound for the same class with $\|\mathbf{w}\|_2 \leq 1$ and $\|\mathbf{x}\|_2 \leq 1$ .

**Exercise 8 ***** (Rademacher Complexity: Monte Carlo Estimation) Given a dataset $\{x^{(1)}, \ldots, x^{(n)}\} \subset \mathbb{R}^d$ and linear function class $\mathcal{F} = \{\mathbf{x} \mapsto \mathbf{w}^\top \mathbf{x}: \|\mathbf{w}\|_2 \leq 1\}$ : (a) Show that $\hat{\mathfrak{R}}_S(\mathcal{F}) = \frac{1}{n}\mathbb{E}_\sigma\!\left[\left\|\sum_i \sigma_i x^{(i)}\right\|_2\right]$ . (b) Implement a Monte Carlo estimator using $T = 1000$ random sign vectors $\sigma$ . (c) On the MNIST training set (or a synthetic dataset), compute $\hat{\mathfrak{R}}_S$ and the resulting generalisation bound. (d) How does $\hat{\mathfrak{R}}_S$ change as you vary $n$ ? Verify the $O(1/\sqrt{n})$ scaling.

14. Why This Matters for AI (2026 Perspective)

Concept	AI/ML Connection	Current Importance
Hoeffding confidence intervals	LLM benchmark evaluation: how many test questions for reliable rankings?	Critical - used implicitly in all leaderboard comparisons (MMLU, HELM, LMSYS)
PAC-Bayes bounds	Tightest known generalisation bounds for overparameterised models; connects to flat minima and sharpness-aware minimisation (SAM)	Active research frontier, 2023-2026
VC dimension	Classical foundation; vacuous for LLMs with $10^9$ + parameters	Conceptually important; practically replaced by norm-based bounds
Rademacher complexity	Data-dependent bounds; finite for norm-constrained transformers	Used in theoretical analysis of attention, LoRA rank bounds
McDiarmid stability	Algorithmic stability theory: output doesn't change much when one training point changes	Foundation of differential privacy; relevant to RLHF data sensitivity
Concentration in high dims	$\sqrt{d_k}$ scaling in attention; JL lemma for random projections in retrieval	Directly explains transformer architectural choices (FlashAttention, MLA)
Chernoff for random graphs	Theoretical analysis of dropout, random initialisation, random feature networks	Relevant to lottery ticket hypothesis, neural architecture search
Union bound + covering	Uniform convergence for function classes; foundation of all finite-sample learning theory	Every theoretical ML paper uses this framework
Sub-Gaussian gradients	Convergence theory for Adam, SGD; required sample complexity for fine-tuning	Used in LoRA theoretical analysis and RLHF convergence guarantees
Bernstein vs Hoeffding	Variance-aware bounds; relevant when gradient variance is small (near optima)	Foundation of variance reduction methods (SVRG, SAG, SAGA)

15. Conceptual Bridge

Looking back. This section builds on Section04 Expectation and Moments, which established $\mathbb{E}[X]$ , $\operatorname{Var}(X)$ , and the MGF $M_X(t) = \mathbb{E}[e^{tX}]$ . The MGF is the critical bridge: Hoeffding's lemma uses convexity of $e^{tx}$ ; the Chernoff method optimises over $M_X(s)$ ; Bernstein's inequality exploits the variance $\mathbb{E}[X^2]$ . Jensen's inequality (from Section04) underlies both Hoeffding's lemma proof and the information-theoretic interpretation of Chernoff bounds.

Looking forward. The Law of Large Numbers says $\bar{X}_n \to \mu$ almost surely. This is the qualitative counterpart of Hoeffding's quantitative bound: Hoeffding gives the rate at which $\bar{X}_n$ concentrates. The Central Limit Theorem (CLT) says the shape of the distribution of $\bar{X}_n$ converges to Gaussian - explaining why sub-Gaussian bounds are the right model for sums of bounded variables. Both are developed in Section06 Stochastic Processes.

The PAC-statistics connection. PAC learning theory is, at its core, applied concentration theory. The "probably" in PAC = concentration inequality. The "approximately correct" = $\varepsilon$ tolerance. The sample complexity $n = O((d + \log(1/\delta))/\varepsilon^2)$ is exactly what concentration inequalities give. Chapter 7 (Statistics) will build on this: confidence intervals are Hoeffding bounds, hypothesis tests are Chernoff bounds, and MLE theory uses Bernstein to prove asymptotic normality.

CONCENTRATION INEQUALITIES IN THE CURRICULUM
========================================================================

  Section04 Expectation and Moments
  +- E[X], Var(X), MGF M_X(t)  -------------------------- input
  +- Jensen's inequality         -------------------------- used in proofs
  +- Markov/Chebyshev preview    -------------------------- previewed there

         v

  Section05 Concentration Inequalities   <--- YOU ARE HERE
  +- Markov / Chebyshev            (moment-based, polynomial tails)
  +- Sub-Gaussian / Hoeffding      (bounded, exponential tails)
  +- Chernoff / Bernstein          (MGF-based, variance-aware)
  +- McDiarmid                     (functions of independent variables)
  +- Union bound + covering        (infinite hypothesis classes)
  +- PAC learning / VC dimension   (generalisation theory)
  +- Rademacher complexity         (data-dependent bounds)

         v                                      v

  Section06 Stochastic Processes         Chapter 7: Statistics
  +- Weak LLN (via Chebyshev)      +- Confidence intervals
  +- Strong LLN                    +- Hypothesis testing
  +- CLT (limit of Hoeffding)      +- MLE consistency (Bernstein)

========================================================================

Appendix A: Summary of Key Bounds

Inequality	Conditions	Bound	Rate
Markov	$X \geq 0$ , $\mathbb{E}[X] = \mu$	$P(X \geq t) \leq \mu/t$	$O(1/t)$
Chebyshev	$\operatorname{Var}(X) = \sigma^2$	$P(\lvert X-\mu\rvert \geq t) \leq \sigma^2/t^2$	$O(1/t^2)$
Cantelli	$\operatorname{Var}(X) = \sigma^2$	$P(X-\mu \geq t) \leq \sigma^2/(\sigma^2+t^2)$	$O(1/t^2)$
Sub-Gaussian	$X$ $\sigma^2$ -sub-G	$P(X \geq t) \leq e^{-t^2/(2\sigma^2)}$	$O(e^{-t^2})$
Hoeffding	$X_i \in [a_i,b_i]$ i.i.d.	$P(\bar{X}-\mu \geq t) \leq \exp(-2nt^2/\sum(b_i-a_i)^2)$	$O(e^{-nt^2})$
Chernoff (mult.)	$\operatorname{Bin}(n,p)$ , $\delta \in (0,1)$	$P(S \geq (1+\delta)\mu) \leq e^{-\mu\delta^2/3}$	$O(e^{-\mu})$
Bernstein	$\lvert X_i\rvert \leq c$ , variance $\nu^2$	$P(\bar{X} \geq t) \leq \exp(-nt^2/2/(\nu^2+ct/3))$	$O(e^{-nt^2/\nu^2})$
McDiarmid	$f$ with $c_i$ -bounded diff.	$P(f - \mathbb{E}[f] \geq t) \leq e^{-2t^2/\sum c_i^2}$	$O(e^{-t^2})$

Appendix B: Sample Complexity Table

Given: i.i.d. data in $[0,1]$ , want $P(|\bar{X}_n - \mu| \leq \varepsilon) \geq 1-\delta$ .

$\varepsilon$	$\delta$	Hoeffding $n$	Normal approx $n$
0.10	0.10	133	68
0.05	0.10	530	271
0.05	0.05	600	384
0.01	0.05	14,979	9,604
0.01	0.01	19,933	16,587

Hoeffding is distribution-free (works for worst case); Normal approximation assumes CLT holds.

<- Back to Chapter 6: Probability Theory | Next: Stochastic Processes ->

Appendix C: Detailed Proofs and Derivations

C.1 Proof of Hoeffding's Lemma in Full

We prove: if $\mathbb{E}[X] = 0$ and $X \in [a, b]$ a.s., then $\mathbb{E}[e^{tX}] \leq e^{t^2(b-a)^2/8}$ .

Step 1: Convexity bound. Since $e^{tx}$ is convex in $x$ , for $x \in [a, b]$ :

e^{tx} \leq \frac{b - x}{b - a} e^{ta} + \frac{x - a}{b - a} e^{tb}

Step 2: Take expectations. Let $p = \frac{-a}{b-a} \in [0,1]$ (since $a \leq 0 \leq b$ after centering):

\mathbb{E}[e^{tX}] \leq \frac{b - \mathbb{E}[X]}{b-a} e^{ta} + \frac{\mathbb{E}[X] - a}{b-a} e^{tb} = \frac{b}{b-a} e^{ta} + \frac{-a}{b-a} e^{tb}

= (1-p) e^{ta} + p e^{tb} = e^{ta}\left[(1-p) + p e^{t(b-a)}\right]

Step 3: Exponential bound. Let $h = t(b-a)$ and $\phi(h) = \log[(1-p) + pe^h] - ph$ . Then:

\mathbb{E}[e^{tX}] \leq e^{ta + ph} \cdot e^{\phi(h)} = e^{ta + p \cdot t(b-a)} \cdot e^{\phi(h)}

Note: $ta + p \cdot t(b-a) = ta + t(-a/(b-a))(b-a) = ta - ta = 0$ (since $p = -a/(b-a)$ ). So:

\mathbb{E}[e^{tX}] \leq e^{\phi(h)}

Step 4: Bound $\phi(h)$ . We have $\phi(0) = 0$ and $\phi'(0) = 0$ . By Taylor's theorem:

\phi(h) = \phi(0) + h\phi'(0) + \frac{h^2}{2}\phi''(\xi)

for some $\xi \in [0, h]$ . Computing: $\phi''(h) = \frac{pe^h(1-p)}{((1-p)+pe^h)^2} \leq \frac{1}{4}$

(since $xy \leq (x+y)^2/4$ for $x, y \geq 0$ , applied to $x = (1-p)$ , $y = pe^h$ ).

Therefore: $\phi(h) \leq \frac{h^2}{8} = \frac{t^2(b-a)^2}{8}$ . $\blacksquare$

C.2 Proof of McDiarmid's Inequality via Azuma's Inequality

Azuma's Inequality. If $(Z_0, Z_1, \ldots, Z_n)$ is a martingale with $|Z_k - Z_{k-1}| \leq c_k$ a.s., then:

P(Z_n - Z_0 \geq t) \leq \exp\!\left(-\frac{2t^2}{\sum_{k=1}^n c_k^2}\right)

Doob martingale construction. Given independent $X_1, \ldots, X_n$ and $f$ with bounded differences $c_i$ , define:

Z_k = \mathbb{E}[f(X_1, \ldots, X_n) \mid X_1, \ldots, X_k]

Then:

$Z_0 = \mathbb{E}[f]$ (constant, before observing anything)
$Z_n = f(X_1, \ldots, X_n)$ (the function value itself)
$(Z_k)$ is a martingale by the tower property

Bounding differences. For each $k$ :

Z_k - Z_{k-1} = \mathbb{E}[f \mid X_1,\ldots,X_k] - \mathbb{E}[f \mid X_1,\ldots,X_{k-1}]

Since changing $X_k$ changes $f$ by at most $c_k$ and $Z_{k-1}$ doesn't depend on $X_k$ , we have $|Z_k - Z_{k-1}| \leq c_k$ .

Applying Azuma to this Doob martingale gives McDiarmid's inequality. $\blacksquare$

C.3 Proof of the Finite-Class PAC Bound (Agnostic Case)

Setup. $\mathcal{H}$ finite with $|\mathcal{H}| = M$ . Losses $\ell_h^{(i)} = \ell(h, z^{(i)}) \in [0,1]$ . True risk $R(h) = \mathbb{E}[\ell_h]$ , empirical risk $\hat{R}(h) = \frac{1}{n}\sum_i \ell_h^{(i)}$ .

Goal. Show $P(\sup_h |R(h) - \hat{R}(h)| > \varepsilon) \leq \delta$ when $n \geq \frac{\log(2M/\delta)}{2\varepsilon^2}$ .

Step 1. Fix $h$ . By Hoeffding: $P(|R(h) - \hat{R}(h)| > \varepsilon) \leq 2e^{-2n\varepsilon^2}$ .

Step 2. Union bound over all $M$ hypotheses:

P(\exists h: |R(h) - \hat{R}(h)| > \varepsilon) \leq \sum_{h \in \mathcal{H}} 2e^{-2n\varepsilon^2} = 2M e^{-2n\varepsilon^2}

Step 3. Set $2Me^{-2n\varepsilon^2} \leq \delta$ :

n \geq \frac{\log(2M/\delta)}{2\varepsilon^2}

ERM generalisation. Under uniform convergence, the ERM $\hat{h} = \arg\min_h \hat{R}(h)$ satisfies:

R(\hat{h}) \leq \hat{R}(\hat{h}) + \varepsilon \leq \hat{R}(h^*) + \varepsilon \leq R(h^*) + 2\varepsilon

where $h^* = \arg\min_h R(h)$ is the Bayes-optimal hypothesis in $\mathcal{H}$ . $\blacksquare$

Appendix D: VC Theory in Depth

D.1 The Growth Function

Definition. The growth function $\Pi_\mathcal{H}(n)$ counts the maximum number of distinct labellings achievable by $\mathcal{H}$ on any $n$ points:

\Pi_\mathcal{H}(n) = \max_{x^{(1)},\ldots,x^{(n)} \in \mathcal{X}} \left|\{(h(x^{(1)}),\ldots,h(x^{(n)})): h \in \mathcal{H}\}\right|

If $\mathcal{H}$ can shatter $n$ points, $\Pi_\mathcal{H}(n) = 2^n$ . The VC dimension $d = d_{VC}$ is the largest $n$ for which $\Pi_\mathcal{H}(n) = 2^n$ .

D.2 Sauer-Shelah Lemma

Lemma. For $d_{VC}(\mathcal{H}) = d$ and $n \geq d$ :

\Pi_\mathcal{H}(n) \leq \sum_{k=0}^d \binom{n}{k}

Proof idea. By induction on $n$ and $d$ . Split: either $x^{(n)}$ makes no difference to the labelling count, or it doubles some labellings. Careful counting bounds the total.

Consequence. For $n \geq d$ : $\Pi_\mathcal{H}(n) \leq (en/d)^d$ . This transitions from exponential $2^n$ (when class is rich) to polynomial $(en/d)^d$ (once $n$ exceeds VC dimension).

D.3 VC Dimension Examples

Hypothesis Class	VC Dimension
Threshold on $\mathbb{R}$	1
Intervals on $\mathbb{R}$	2
Halfspaces in $\mathbb{R}^d$	$d+1$
Axis-aligned rectangles in $\mathbb{R}^2$	4
Convex polygons in $\mathbb{R}^2$	$\infty$
Degree- $k$ polynomials in $\mathbb{R}$	$k+1$
Neural nets with $W$ weights	$O(W \log W)$
Kernel classifiers (universal)	$\infty$

Why VC = $d+1$ for halfspaces. In $\mathbb{R}^d$ , any $d+1$ points in general position (no $d+1$ are on a hyperplane) can be shattered by a halfspace. But for $d+2$ points, by Radon's theorem, they can be split into two groups whose convex hulls intersect - no halfspace separates them. Hence $d_{VC} = d+1$ .

D.4 Fundamental Theorem of Statistical Learning

Theorem. For binary classification, the following are equivalent:

$\mathcal{H}$ is PAC-learnable
ERM is a successful learner for $\mathcal{H}$
$\mathcal{H}$ has finite VC dimension
$\mathcal{H}$ has the uniform convergence property

This theorem, proved through Sauer-Shelah and the symmetrisation technique (doubling the sample and introducing Rademacher variables), is the bedrock of statistical learning theory.

Appendix E: Advanced Topics in Concentration

E.1 Talagrand's Inequality

Talagrand's inequality (1995) strengthens McDiarmid for "self-bounding" functions - those where the effect of each coordinate is bounded by the function value itself.

Self-bounding functions. $f: \mathcal{X}^n \to \mathbb{R}_{\geq 0}$ is self-bounding if there exist $f_i: \mathcal{X}^{n-1} \to \mathbb{R}$ such that:

0 \leq f(\mathbf{x}) - f_i(\mathbf{x}^{-i}) \leq 1 \quad \text{and} \quad \sum_{i=1}^n (f(\mathbf{x}) - f_i(\mathbf{x}^{-i})) \leq f(\mathbf{x})

Talagrand's bound. For self-bounding $f$ :

P(f \geq \mathbb{E}[f] + t) \leq e^{-t^2/(2(\mathbb{E}[f]+t/3))}

This is a variance-adaptive McDiarmid - analogous to Bernstein vs Hoeffding.

Applications: Number of ones in a sum of Bernoullis, size of random matchings, empirical risk when the model fits the data.

E.2 Concentration for Heavy-Tailed Distributions

When distributions are heavy-tailed, sub-Gaussian bounds don't apply. Modern techniques include:

Catoni's estimator. For estimating the mean of a distribution with finite variance $\sigma^2$ (but potentially infinite MGF), Catoni's estimator achieves:

P(|\hat{\mu}_\psi - \mu| \geq t) \leq 2\exp\!\left(-\frac{nt^2}{2\sigma^2 + 2t\sigma}\right) \approx e^{-nt^2/(2\sigma^2)}

using a truncated exponential influence function. This gives Bernstein-like rates without boundedness.

Median of means. Partition $n$ samples into $k$ groups of $n/k$ . Compute group means and take the median. The median is sub-Gaussian with parameter $\sigma^2 k/n$ - achieving sub-Gaussian rates for distributions with only two moments.

For AI: Training data for LLMs often has heavy-tailed length distributions (documents follow power laws). Heavy-tail-robust mean estimation is relevant for unbiased training.

E.3 Matrix Concentration Inequalities

Scalar concentration extends to matrices. For many AI applications, we care about concentration of random matrices.

Matrix Hoeffding. For independent random PSD matrices $X_1, \ldots, X_n \in \mathbb{R}^{d \times d}$ with $0 \preceq X_i \preceq cI$ and $\mathbb{E}[\sum X_i] = M$ , for $t > 0$ :

P\!\left(\lambda_{\max}\!\left(\sum_i X_i - M\right) \geq t\right) \leq d \cdot e^{-t^2/(2nc^2)}

The extra factor $d$ (matrix dimension) accounts for the eigenvalue structure.

Application: random features. When approximating a kernel matrix $K_{ij} = k(x^{(i)}, x^{(j)})$ by random features, the approximation error in spectral norm concentrates with matrix Hoeffding, justifying the use of $D = O(d^2/\varepsilon^2)$ random features for $\varepsilon$ -accurate Gram matrix approximation.

Matrix Bernstein. For independent zero-mean random matrices $X_i$ with $\|X_i\|_2 \leq R$ and $\sigma^2 = \|\sum \mathbb{E}[X_i^2]\|_2$ :

P\!\left(\left\|\sum_i X_i\right\|_2 \geq t\right) \leq 2d \cdot \exp\!\left(-\frac{t^2/2}{\sigma^2 + Rt/3}\right)

Used in randomised linear algebra (randomised SVD, Nystrom approximation).

E.4 PAC-Bayes Theory

PAC-Bayes bounds combine the PAC framework with Bayesian prior knowledge.

Setup. Given a prior $P$ over $\mathcal{H}$ (before seeing data) and posterior $Q$ (after seeing data):

PAC-Bayes Theorem (McAllester, 1999). For any prior $P$ and $\delta > 0$ , with probability $\geq 1-\delta$ :

\mathbb{E}_{h \sim Q}[R(h)] \leq \mathbb{E}_{h \sim Q}[\hat{R}(h)] + \sqrt{\frac{D_{\mathrm{KL}}(Q \| P) + \log(1/\delta)}{2(n-1)}}

Why this is powerful: The KL term $D_{\mathrm{KL}}(Q\|P)$ replaces $\log|\mathcal{H}|$ (which can be infinite). For neural networks, if the posterior concentrates near the prior (small KL), the bound is tight. Flat minima (low sharpness) correspond to posteriors that don't drift far from the prior.

Connection to SAM. Sharpness-Aware Minimisation (SAM, 2021) minimises a PAC-Bayes-inspired upper bound on generalisation error by seeking flat minima - directly operationalising PAC-Bayes in practice.

Appendix F: Worked Examples

F.1 Coin Flip Estimation

Problem. We flip a biased coin $n$ times and observe $\hat{p}$ heads. How many flips to estimate $p$ within $\pm 0.05$ with 99% confidence?

Hoeffding solution. With $X_i \in \{0,1\}$ , $b-a=1$ :

n \geq \frac{\log(2/0.01)}{2(0.05)^2} = \frac{\log 200}{0.005} = \frac{5.298}{0.005} = 1060

Normal approximation. $n \geq (2.576/0.05)^2 \cdot 0.25 = 664$ . Less conservative because it uses the Gaussian CLT approximation (valid for large $n$ ).

Interpretation. About 1000-1100 flips are needed for reliable probability estimation. This is why A/B tests often require thousands of users: smaller samples give confidence intervals too wide to detect meaningful differences.

F.2 Multi-Armed Bandit via Chernoff

Problem. A recommendation system has $K = 100$ arms (content types). Each arm $a$ has unknown click-through rate $p_a$ . We want to identify the best arm within $\varepsilon = 0.01$ with $\delta = 0.01$ .

Naive approach. Pull each arm $n_0$ times. By Hoeffding + union bound:

P(\exists a: |\hat{p}_a - p_a| > \varepsilon) \leq 2K e^{-2n_0\varepsilon^2}

Set $\leq \delta$ : $n_0 \geq \frac{\log(2K/\delta)}{2\varepsilon^2} = \frac{\log(20000)}{0.0002} \approx 48{,}200$ . Total pulls: $4{,}820{,}000$ .

Adaptive strategy (UCB). Upper Confidence Bound algorithms concentrate exploration on promising arms, achieving total pulls $O(K\log n/\varepsilon^2)$ - much smaller in practice.

F.3 Neural Network Generalisation: PAC-Bayes in Practice

Setup. 2-layer ReLU network, 100 hidden units, trained on 50,000 MNIST examples. Training error 0.3%, test error 1.2%. Standard VC bound: vacuous (parameter count $\gg n$ ).

PAC-Bayes approach. Use Gaussian prior $P = \mathcal{N}(0, \sigma_0^2 I)$ and posterior $Q = \mathcal{N}(\hat{\theta}, \sigma^2 I)$ (trained weights $\hat{\theta}$ ). Compute:

D_{\mathrm{KL}}(Q\|P) = \frac{\|\hat{\theta}\|_2^2}{2\sigma_0^2} + d\log(\sigma_0/\sigma) + d\sigma^2/(2\sigma_0^2) - d/2

With careful tuning of $\sigma$ , this can give non-vacuous bounds (< 50% error guarantee) even for deep networks - unlike VC theory.

F.4 Random Projection (Johnson-Lindenstrauss)

Problem. We have $n = 10000$ points in $\mathbb{R}^{1000}$ . We want to project to $\mathbb{R}^m$ preserving all pairwise distances within factor $(1 \pm \varepsilon)$ for $\varepsilon = 0.1$ .

JL Lemma. For $m = O(\log n / \varepsilon^2)$ random projections (from $\mathcal{N}(0, I/m)$ ):

m \geq \frac{4 + 2\log(n^2)}{\varepsilon^2/2 - \varepsilon^3/3} \approx \frac{8\log n}{\varepsilon^2} = \frac{8 \cdot \ln(10000)}{0.01} \approx 7400

So 7400 dimensions suffice - far less than 1000 (already doing 26% compression), and compression from 10000 would be massive.

Proof sketch. A random Gaussian matrix $R \in \mathbb{R}^{m \times d}$ satisfies: for any fixed pair of points $x, y$ :

P\!\left(\left|\|Rx - Ry\|_2^2/m - \|x-y\|_2^2\right| > \varepsilon\|x-y\|_2^2\right) \leq 2e^{-m\varepsilon^2/8}

by chi-squared concentration. Union bound over $\binom{n}{2}$ pairs gives the JL lemma.

For AI: Locality-sensitive hashing, approximate nearest-neighbour search (used in RAG retrieval), and low-rank adaptation (LoRA) all rely on Johnson-Lindenstrauss-type arguments to justify dimensionality reduction.

Appendix G: Computing Concentration Bounds in Practice

G.1 Choosing the Right Bound

DECISION TREE: WHICH BOUND TO USE
========================================================================

  Do you know E[X]? --- No --> Can't bound tail
       |
       Yes
       |
  Is X \\geq 0? --- Yes --> Markov: P(X \\geq t) \\leq E[X]/t
       |
       No
       |
  Do you know Var(X)? --- Yes --> Chebyshev: P(|X-\\mu| \\geq t) \\leq \\sigma^2/t^2
       |
       No
       |
  Is X bounded in [a,b]? --- Yes --> Hoeffding (exponential bound)
       |                             or Bernstein if Var(X) known
       |
       No
       |
  Is MGF finite? --- Yes --> Chernoff method (optimise over s)
       |
       No
       |
  Distribution-free needed --> Student-t CI or bootstrap

========================================================================

G.2 Python Recipe for Hoeffding CIs

import numpy as np
from scipy import stats

def hoeffding_ci(x, delta=0.05, lo=0.0, hi=1.0):
    """Hoeffding confidence interval for bounded data in [lo, hi]."""
    n = len(x)
    x_bar = np.mean(x)
    radius = (hi - lo) * np.sqrt(np.log(2 / delta) / (2 * n))
    return x_bar - radius, x_bar + radius

def hoeffding_sample_size(epsilon, delta, lo=0.0, hi=1.0):
    """Required n for epsilon-accuracy at 1-delta confidence."""
    return int(np.ceil((hi - lo)**2 * np.log(2 / delta) / (2 * epsilon**2)))

G.3 Interpreting Generalisation Gaps

When a model achieves training error 1% and test error 5%:

Generalisation gap = 4%
Is this within bounds? With $n = 10000$ , $\delta = 0.05$ : Hoeffding gives $\sqrt{\log(40)/20000} \approx 1.4\%$ per hypothesis
For $|\mathcal{H}| = 10^6$ hypotheses: add $\sqrt{\log(10^6)/20000} \approx 2.6\%$
Total bound: $\approx 4\%$ - exactly matching the observed gap!

In practice, neural networks occupy a tiny corner of their hypothesis class (due to inductive bias of SGD), so effective complexity is much smaller than the VC dimension suggests.

Appendix H: Self-Assessment Checklist

Before moving to Section06 (Stochastic Processes), verify you can:

State Markov's inequality from memory and prove it with the indicator trick
Derive Chebyshev from Markov by applying Markov to $(X-\mu)^2$
Prove Hoeffding's lemma using convexity of $e^{tx}$ and Taylor expansion
Apply the Chernoff method: Markov + MGF bound + optimise over $s$
State McDiarmid's inequality and identify $c_i$ for a given function
Compute sample complexity for given $(\varepsilon, \delta)$ using Hoeffding
Define VC dimension and compute it for halfspaces ( $= d+1$ in $\mathbb{R}^d$ )
Write the PAC generalisation bound with $|\mathcal{H}|$ or $d_{VC}$
Define Rademacher complexity and explain its interpretation
Identify when Bernstein beats Hoeffding (small variance case)
Give a non-example for sub-Gaussian (heavy-tailed distribution)
Explain why VC bounds are vacuous for LLMs and what PAC-Bayes offers

Appendix I: Connections to Information Theory

I.1 Kullback-Leibler Divergence and Exponential Families

The Chernoff method has a deep information-theoretic interpretation. The optimal exponent in the Chernoff bound equals the KL divergence under the tilted distribution.

For $X_1, \ldots, X_n$ i.i.d. with distribution $P$ , and target $\bar{x} = a > \mathbb{E}[X]$ :

\lim_{n \to \infty} \frac{1}{n}\log P(\bar{X}_n \geq a) = -D_{\mathrm{KL}}(P_a \| P)

where $P_a$ is the exponential tilt of $P$ that puts its mean at $a$ : $P_a(x) \propto e^{s^* x} P(x)$ with $\mathbb{E}_{P_a}[X] = a$ .

This is Cramer's theorem - the foundation of large deviations theory. The probability of a rare event decays exponentially at rate given by the KL divergence to the closest distribution that makes the event typical.

For AI: This connection explains why cross-entropy loss (KL divergence from data to model) is the right loss for language models: minimising cross-entropy = maximising the probability of the training data = minimising the Chernoff exponent for generalisation error.

I.2 Entropy and Sauer-Shelah

The growth function $\Pi_\mathcal{H}(n)$ counts the number of distinct behaviours of $\mathcal{H}$ on $n$ points. The log growth rate:

h(\mathcal{H}) = \lim_{n \to \infty} \frac{\log_2 \Pi_\mathcal{H}(n)}{n}

is the combinatorial entropy of $\mathcal{H}$ . Sauer-Shelah shows:

If $d_{VC} < \infty$ : $h(\mathcal{H}) = 0$ (subexponential growth, polynomial in $n$ )
If $d_{VC} = \infty$ : $h(\mathcal{H}) = 1$ (full exponential growth $2^n$ )

This binary distinction - polynomial vs exponential growth - is precisely what separates PAC-learnable from non-learnable hypothesis classes.

Appendix J: Connections to Optimisation

J.1 Convergence of SGD via Hoeffding

The standard convergence analysis of SGD for convex functions uses:

Descent lemma: $\mathcal{L}(\theta_{t+1}) \leq \mathcal{L}(\theta_t) - \eta g_t^\top(\theta_t - \theta^*) + \frac{\eta^2 L}{2}\|g_t\|_2^2$
Gradient concentration: $\|\hat{g}_t - g_t\|_2 \leq \varepsilon$ with high probability (Hoeffding)
Telescoping: Sum over $T$ steps and apply Hoeffding union bound over all $T$ steps (adding $\log T$ to the sample complexity)

The result: after $T = O(G^2/(\varepsilon^2\eta^2))$ steps, SGD achieves $\mathcal{L}(\theta_T) - \mathcal{L}(\theta^*) \leq \varepsilon$ with high probability.

J.2 Generalisation and Algorithmic Stability

Definition (Uniform Stability). Algorithm $\mathcal{A}$ is $\beta$ -uniformly stable if for any two datasets $S$ , $S'$ differing in one example:

\sup_z |\ell(\mathcal{A}(S), z) - \ell(\mathcal{A}(S'), z)| \leq \beta

Theorem (Bousquet-Elisseeff, 2002). If $\mathcal{A}$ is $\beta$ -stable with $\beta \leq c/n$ for some constant $c$ , and losses are bounded in $[0, 1]$ , then with probability $\geq 1-\delta$ :

R(\mathcal{A}(S)) \leq \hat{R}(\mathcal{A}(S)) + 2\beta + (4n\beta + 1)\sqrt{\frac{\log(1/\delta)}{2n}}

SGD is stable. For $L$ -smooth convex losses, SGD with step size $\eta$ for $T$ steps is $2\eta TL/n$ -uniformly stable. Setting $\eta = O(1/\sqrt{T})$ gives stability $O(\sqrt{T}/n)$ - small when $T \ll n^2$ .

This explains why early stopping helps: more SGD steps = less stable = worse generalisation.

Appendix K: Concentration Phenomena in Transformer Architectures

K.1 Attention Score Concentration

In scaled dot-product attention, query-key inner products $Q_i K_j^\top / \sqrt{d_k}$ determine the attention weights. For randomly initialised $Q, K \sim \mathcal{N}(0, 1/d_k)$ , the inner products satisfy:

P(|Q_i K_j^\top| \geq t) \leq 2e^{-d_k t^2/2}

Without the $\sqrt{d_k}$ scaling, the softmax input $Q_i K_j^\top$ has standard deviation $O(\sqrt{d_k})$ , pushing softmax into its saturating regime (near-uniform or near-one-hot). The scaling ensures standard deviation $O(1)$ , keeping gradients flowing.

Formal justification. With $d_k$ -dimensional queries and keys, $Q_i K_j^\top = \sum_{l=1}^{d_k} Q_{il} K_{jl}$ where each summand has variance $1/d_k^2$ . By sub-Gaussian closure, the sum is $1/d_k$ -sub-Gaussian, so standard deviation $\approx 1/\sqrt{d_k}$ . After scaling: $Q_i K_j^\top / \sqrt{d_k}$ has standard deviation $\approx 1$ , optimal for softmax.

K.2 Concentration in Residual Networks

Deep residual networks $x_{l+1} = x_l + f_l(x_l)$ accumulate residual perturbations. If each residual block $f_l$ has bounded output $\|f_l(x)\|_2 \leq c$ , then by Azuma's inequality applied to the martingale $x_l$ :

P(\|x_L - x_0\|_2 \geq t) \leq \exp\!\left(-\frac{t^2}{2Lc^2}\right)

This says: even with $L = 100$ layers, if each residual block adds a small perturbation ( $c \ll 1$ ), the total drift is bounded. LayerNorm and weight decay enforce exactly this: they keep $\|f_l(x)\|_2$ small, ensuring the representation doesn't drift exponentially with depth.

K.3 LoRA and Low-Rank Concentration

LoRA (Low-Rank Adaptation) approximates weight updates as $\Delta W = AB$ where $A \in \mathbb{R}^{d \times r}$ , $B \in \mathbb{R}^{r \times k}$ with small rank $r$ . The resulting hypothesis class has reduced complexity:

Rademacher complexity of LoRA. For the function class $\{x \mapsto W_0 x + AB x : \|A\|_F \leq s_A, \|B\|_F \leq s_B\}$ :

\mathfrak{R}_n(\mathcal{F}_{\text{LoRA}}) \leq \frac{s_A s_B \sqrt{r} \cdot C}{\sqrt{n}}

where $C = \max_i \|x^{(i)}\|_2$ . The key: Rademacher complexity scales as $\sqrt{r}$ , not $\sqrt{dk}$ (full fine-tuning). With $r \ll \min(d,k)$ , LoRA has provably smaller generalisation gap than full fine-tuning, explaining why it doesn't overfit despite being used on small datasets.

Appendix L: Practical Guidelines for ML Practitioners

L.1 When to Report Confidence Intervals

Always report when:

Sample size $n < 10000$ (Hoeffding CI is meaningful)
Comparing two systems (the CI overlap tells you if the difference is significant)
Benchmarking LLMs on standard evaluations

Minimum information to report: point estimate, sample size, CI method (Hoeffding vs Normal approximation), confidence level.

L.2 Common Benchmarking Mistakes

Mistake	Statistical Problem	Fix
"Model A (73.2%) > Model B (72.8%) on 1000 questions"	95% CI is +/-4.2% - difference insignificant	Need $n \geq 19000$ for +/-1% CI
"Best of 10 model variants is significant"	Multiple comparison: $10 \times 0.05 = 0.5$ false positive rate	Bonferroni: each test at $0.05/10 = 0.005$
"Averaging over 5 datasets gives reliable comparison"	Each dataset is a separate test	Report CI for each; meta-analysis requires care
"Our method wins on 8/10 tasks"	Sign test: need $\geq 9/10$ for $p < 0.05$	Use paired t-test or Wilcoxon signed-rank
"1000-epoch training curve shows improvement"	Training curve is not test performance	Report test error at convergence with CI

L.3 Required Sample Sizes (Reference Table)

For binary outcomes (accuracy), using Hoeffding with 95% confidence ( $\delta = 0.05$ ):

Desired precision ( $\varepsilon$ )	Required $n$
+/-10% (0.10)	185
+/-5% (0.05)	738
+/-3% (0.03)	2,056
+/-2% (0.02)	4,626
+/-1% (0.01)	18,445
+/-0.5% (0.005)	73,779

Note: These are conservative (distribution-free). Normal approximation gives roughly half as many, valid when $n\hat{p}(1-\hat{p}) \geq 5$ .

L.4 Interpreting Generalisation Bounds for Deep Learning

Current theoretical bounds are often loose for deep networks. Practitioners should interpret them as:

Order of magnitude guidance: A bound of $O(\sqrt{d/n})$ correctly predicts that more data helps linearly, but may overestimate the absolute gap by 10-100x
Relative comparisons: PAC-Bayes bounds correctly rank models by generalisation risk even when absolute values are imprecise
Design principles: Results like "norm-constrained models generalise better" translate directly to regularisation practices (weight decay, gradient clipping, dropout)
Not absolute guarantees: A theoretical bound of 40% error doesn't mean the model has 40% test error - it means we can't rule out 40% test error from the theory alone

The theory is most useful as a framework for thinking, not as a numerical oracle.

Appendix M: References and Further Reading

Primary Sources

Boucheron, Lugosi & Massart (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press. - The definitive modern treatment.
Hoeffding, W. (1963). Probability Inequalities for Sums of Bounded Random Variables. Journal of the American Statistical Association, 58(301), 13-30.
McDiarmid, C. (1989). On the Method of Bounded Differences. Surveys in Combinatorics, 141, 148-188.
Vapnik, V. & Chervonenkis, A. (1971). On the Uniform Convergence of Relative Frequencies to Their Probabilities. Theory of Probability and Its Applications, 16(2), 264-280.
Valiant, L. (1984). A Theory of the Learnable. Communications of the ACM, 27(11), 1134-1142.

Learning Theory Textbooks

Shalev-Shwartz & Ben-David (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press. - Best introduction to PAC learning for ML practitioners.
Mohri, Rostamizadeh & Talwalkar (2018). Foundations of Machine Learning (2nd ed.). MIT Press. - Covers Rademacher complexity and generalisation theory in depth.
Wainwright, M. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. - Advanced treatment, covers matrix concentration and modern techniques.

Deep Learning Theory

Zhang, Bengio et al. (2017). Understanding Deep Learning Requires Rethinking Generalization. ICLR 2017. - Showed memorisation capacity, challenging classical VC theory.
Dziugaite & Roy (2017). Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many Parameters. UAI 2017. - First non-vacuous PAC-Bayes bounds for deep nets.
Belkin et al. (2019). Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-Off. PNAS 2019. - Double descent, overparameterisation.
Foret et al. (2021). Sharpness-Aware Minimization for Efficiently Improving Generalization. ICLR 2021. - SAM as PAC-Bayes operationalisation.

Appendix N: Extended Worked Problems

N.1 Bounding the Maximum of Random Variables

Problem. Let $X_1, \ldots, X_n \sim \mathcal{N}(0,1)$ i.i.d. What is $\mathbb{E}[\max_i X_i]$ approximately?

Answer. Each $X_i$ is 1-sub-Gaussian. By the union bound:

P(\max_i X_i \geq t) \leq n \cdot P(X_1 \geq t) \leq n e^{-t^2/2}

Setting $ne^{-t^2/2} = 1$ : $t^* = \sqrt{2\log n}$ . So $\max_i X_i = O(\sqrt{\log n})$ with high probability.

More precisely, $\mathbb{E}[\max_i X_i] \approx \sqrt{2\log n}(1 - O(1/\log n))$ . For $n = 1000$ : $\approx \sqrt{2 \cdot 6.9} \approx 3.7$ , matching Monte Carlo.

For AI: The maximum attention logit in a head with $n$ tokens and $d_k$ -dimensional keys satisfies $\max_j Q K_j^\top \approx \sqrt{2d_k \log n}$ . The softmax of this concentrates on one or few tokens for large $d_k$ - explaining why attention becomes "sharp" (spiky) in deeper layers where $d_k$ is large.

N.2 Symmetric Rounding

Problem. An algorithm rounds each of $n = 100$ numbers independently up or down by at most 0.5. What is the probability the total rounding error exceeds 10?

Solution. Each error $E_i \in [-0.5, 0.5]$ with $\mathbb{E}[E_i] = 0$ . The sum $S = \sum E_i$ has $|E_i| \leq 0.5$ , so by Hoeffding:

P(S \geq 10) \leq \exp\!\left(-\frac{2 \cdot 100}{\sum (1.0)^2}\right) \cdot e^{-t^2} = \exp\!\left(-\frac{2 \cdot 10^2}{100 \cdot 1^2}\right) = e^{-2} \approx 0.135

The two-sided version: $P(|S| \geq 10) \leq 2e^{-2} \approx 0.27$ .

N.3 PAC Bound for a Neural Network Classifier

Problem. A student trains a 3-layer neural network on $n = 5000$ examples and achieves 2% training error. Using the finite-class PAC bound with $|\mathcal{H}| = 2^{10^6}$ (very conservatively, assuming 1-bit per parameter for $10^6$ parameters), what does the bound say about test error?

Solution. The PAC bound (agnostic, finite class):

R(\hat{h}) \leq \hat{R}(\hat{h}) + \sqrt{\frac{\log(2|\mathcal{H}|/\delta)}{2n}}

With $\delta = 0.05$ and $|\mathcal{H}| = 2^{10^6}$ :

\sqrt{\frac{\log(2 \cdot 2^{10^6}/0.05)}{10000}} = \sqrt{\frac{10^6 \ln 2 + \ln 40}{10000}} \approx \sqrt{\frac{693150}{10000}} \approx 8.3

Test error bound: $0.02 + 8.3 \approx 8.32$ - a completely vacuous bound (exceeds 100%). This illustrates why VC/parameter-counting bounds are useless for deep networks. PAC-Bayes or Rademacher complexity (using $\|\mathbf{w}\|_F$ rather than parameter count) gives non-vacuous results.

N.4 Proving Generalisation for 1-Nearest Neighbour

Problem. For $k$ -nearest neighbour classification with $k = 1$ : (a) Show that the leave-one-out error is an unbiased estimate of the true error. (b) Use McDiarmid to bound the variance of the generalisation error around the leave-one-out estimate.

Solution. (a) The leave-one-out (LOO) error $\hat{R}_{\text{LOO}} = \frac{1}{n}\sum_i \mathbf{1}[\hat{y}_{-i} \neq y^{(i)}]$ where $\hat{y}_{-i}$ is the prediction on $x^{(i)}$ using all other training points. By symmetry of the i.i.d. draw, $\mathbb{E}[\hat{R}_{\text{LOO}}] = R_{n-1}$ (the true error of 1-NN trained on $n-1$ examples).

(b) Swapping one training example changes $\hat{R}_{\text{LOO}}$ by at most $2/n$ (affects at most 2 LOO predictions). By McDiarmid:

P(|\hat{R}_{\text{LOO}} - \mathbb{E}[\hat{R}_{\text{LOO}}]| \geq t) \leq 2\exp\!\left(-\frac{2t^2}{\sum_i (2/n)^2}\right) = 2\exp\!\left(-\frac{nt^2}{2}\right)

N.5 Confidence Region for Gradient Descent Convergence

Problem. SGD is run for $T$ iterations with step size $\eta = 1/\sqrt{T}$ . Each gradient estimate is sub-Gaussian with parameter $\sigma^2$ . With what probability does SGD achieve $\mathcal{L}(\theta_T) - \mathcal{L}^* \leq \varepsilon$ ?

Solution. Standard SGD analysis shows: $\mathbb{E}[\mathcal{L}(\theta_T) - \mathcal{L}^*] \leq \frac{R^2 + \sigma^2}{\sqrt{T}}$ where $R = \|\theta_0 - \theta^*\|_2$ .

For the high-probability version: at each step $t$ , $P(\|g_t - \hat{g}_t\|_2 \geq u) \leq 2e^{-mu^2/(2\sigma^2)}$ by sub-Gaussianity (batch size $m$ ). Taking union bound over all $T$ steps and using the Azuma-type argument for the algorithm trajectory:

P(\mathcal{L}(\theta_T) - \mathcal{L}^* \geq \varepsilon) \leq \delta

when $T = O\left(\frac{R^2 \log(1/\delta) + \sigma^2 \log(T/\delta)}{\varepsilon^2}\right)$ .

Concentration Inequalities: Part 2 - Ml Applications To Appendix N Extended Worked Problems