Part 3

21 min read18 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Concentration Inequalities: Appendix O: Visualisation Notes to Supplementary: Concentration in High Dimensions

Appendix O: Visualisation Notes

O.1 Plotting Concentration Inequalities

When visualising tail bounds, always show:

The exact tail probability (computed numerically) as the ground truth
The bound as a curve above the exact probability
The ratio (bound/exact) to show how loose the bound is

TAIL BOUND TIGHTNESS - Gaussian(0,1)
========================================================================

  t   | P(X\\geqt) exact | Markov* | Chebyshev | Hoeffding | Bound/Exact
  --------------------------------------------------------------------
  1   | 0.159  | N/A    | 1.000    | 0.607     |  4x / 3.8x
  2   | 0.023  | N/A    | 0.250    | 0.135     | 11x / 5.9x
  3   | 0.0013 | N/A    | 0.111    | 0.011     | 85x / 8.5x
  4   | 3.2e-5 | N/A    | 0.0625   | 3.4e-4    | 1953x / 11x
  5   | 2.9e-7 | N/A    | 0.040    | 3.7e-7    | 138,000x / 1.3x

  *Markov requires X \\geq 0; not directly applicable to symmetric Gaussian

========================================================================

At $t = 5\sigma$ : Chebyshev overestimates by 138,000x! Hoeffding for Gaussian is essentially tight - Gaussian IS the sub-Gaussian equality case.

O.2 Sample Complexity Curves

The sample complexity $n(\varepsilon, \delta) = \frac{\log(2/\delta)}{2\varepsilon^2}$ has a characteristic shape:

$O(1/\varepsilon^2)$ : doubling precision requires 4x more data
$O(\log(1/\delta))$ : going from 95% to 99.9% confidence adds only $\log(200/20)/\log(40/20) \approx 3.3 \times$ more data - logarithmically cheap

This $1/\varepsilon^2$ scaling is fundamental: it appears in:

Hoeffding (statistical estimation)
Monte Carlo (variance $\sigma^2$ , error $\sigma/\sqrt{n}$ )
PAC learning (generalisation gap)
Stochastic optimisation (SGD convergence rate)

All are manifestations of the same underlying concentration phenomenon.

Appendix P: Glossary of Key Terms

Term	Definition	Section
Concentration inequality	A bound on $P(X - \mathbb{E}[X] \geq t)$ or $P(\\|X - \mathbb{E}[X]\\| \geq t)$	Section1
Sub-Gaussian random variable	$X$ with $\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2/2}$ for all $t$	Section3.1
Sub-Gaussian parameter	$\sigma^2$ : the proxy variance in the sub-Gaussian definition	Section3.1
Rademacher variable	$\sigma \in \{-1, +1\}$ with equal probability	Section3.2
Hoeffding's lemma	Bounded RV $X \in [a,b]$ with zero mean $\Rightarrow$ $\mathbb{E}[e^{tX}] \leq e^{t^2(b-a)^2/8}$	Section4.1
Chernoff method	Tail bound via $P(X \geq t) \leq e^{-st}\mathbb{E}[e^{sX}]$ , optimised over $s$	Section5.1
Bounded differences	$	f(\cdots x_i \cdots) - f(\cdots x_i' \cdots)
Union bound	$P(\bigcup A_i) \leq \sum P(A_i)$	Section8.1
Covering number	Smallest $\varepsilon$ -net size for a set under a metric	Section8.3
Shattering	$\mathcal{H}$ shatters $C$ if it can realise all $2^{	C
VC dimension	Size of largest shattered set	Section9.4
Growth function	$\Pi_\mathcal{H}(n)$ : max labellings of $n$ points by $\mathcal{H}$	App D.1
PAC learning	Probably approximately correct: $R(\hat{h}) \leq \min_h R(h) + \varepsilon$ w.p. $\geq 1-\delta$	Section9.1
Uniform convergence	$\sup_h	\hat{R}(h) - R(h)
Rademacher complexity	$\mathfrak{R}_n(\mathcal{F}) = \mathbb{E}_\sigma[\sup_f \frac{1}{n}\sum \sigma_i f(x^{(i)})]$	Section10.1
PAC-Bayes	Generalisation bound using $D_{\mathrm{KL}}(\text{posterior}\\|\text{prior})$	App E.4
Algorithmic stability	Output insensitive to swapping one training example	App J.2
Exponential tilt	Twisted distribution $P_s(x) \propto e^{sx}P(x)$ used in Chernoff	App I.1

Appendix Q: The Method of Types and Large Deviations

Q.1 Types and Empirical Distributions

The method of types is an elegant combinatorial approach to concentration that complements the analytical MGF-based approach.

Definition. For a sequence $x^n = (x_1, \ldots, x_n) \in \mathcal{X}^n$ , the type (empirical distribution) is:

P_{x^n}(a) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[x_i = a], \quad a \in \mathcal{X}

The set of all sequences of type $Q$ is the type class $T_n(Q)$ .

Key counting result. For a finite alphabet $|\mathcal{X}| = k$ :

|T_n(Q)| = \binom{n}{nQ(x_1), \ldots, nQ(x_k)} \approx e^{nH(Q)}

where $H(Q) = -\sum_{x} Q(x)\log Q(x)$ is the entropy of $Q$ . This means there are approximately $e^{nH(Q)}$ sequences with empirical distribution $Q$ .

Q.2 Connection to Chernoff Bounds

For i.i.d. $X_i \sim P$ and empirical distribution $P_{x^n}$ :

P(P_{X^n} = Q) \leq e^{-n D_{\mathrm{KL}}(Q \| P)}

This is the method of types bound. Taking $Q$ to be the distribution concentrated at $\bar{x} = a$ :

P(\bar{X}_n \geq a) \leq (n+1)^{|\mathcal{X}|} e^{-n D_{\mathrm{KL}}(Q^* \| P)}

where $Q^*$ is the type closest to $P$ with mean $\geq a$ . For large $n$ , the polynomial factor $(n+1)^{|\mathcal{X}|}$ is negligible, and this matches the Chernoff bound with the KL divergence interpretation.

Q.3 Cramer's Theorem

Theorem (Cramer, 1938). For i.i.d. $X_1, \ldots, X_n$ with distribution $P$ :

\lim_{n\to\infty} \frac{1}{n} \log P(\bar{X}_n \geq a) = -I(a)

where $I(a) = \sup_{t} \{ta - \log M_X(t)\}$ is the Cramer-Legendre transform (rate function), and $I(a) = D_{\mathrm{KL}}(P_a \| P)$ (KL divergence to the tilted distribution).

This is the fundamental result of large deviations theory: rare events occur at exponential rate determined by the KL distance to the "closest" distribution that makes the event typical.

For AI: Language model perplexity is essentially $\exp(D_{\mathrm{KL}}(\text{true dist}\|\text{model}))$ . Low perplexity = model close to true distribution = rare events (unusual tokens) are not overly penalised. Cramer's theorem connects the generalisation gap to KL divergence, explaining why cross-entropy is the right loss.

Appendix R: Generalisation in the Overparameterised Regime

R.1 The Classical Puzzle

Classical theory (VC, Rademacher) says: more parameters = larger hypothesis class = worse generalisation. Yet in practice, overparameterised models (large $d \gg n$ ) often generalise better than their smaller counterparts.

The resolution involves multiple mechanisms:

1. Implicit regularisation. Gradient descent on overparameterised linear models converges to the minimum- $\ell_2$ -norm solution. Among all interpolating solutions, this has the smallest Rademacher complexity.

2. Double descent. As model capacity increases past the interpolation threshold ( $d \approx n$ ), test error first rises (overfitting) then falls (benign overfitting). The minimum-norm interpolating solution's generalisation error is:

\text{MSE} \approx \frac{\sigma^2 d}{n} \cdot \frac{1}{1 - n/d}

which decreases as $d/n \to \infty$ (overparameterised regime).

3. Benign overfitting. Bartlett et al. (2020) showed that for linear models with Gaussian data, interpolation (zero training error) can generalise well when most singular values of the data matrix are small - the model "memorises" noise in directions orthogonal to the signal.

R.2 Neural Tangent Kernel (NTK) Theory

Wide neural networks in the "lazy training" (NTK) regime behave like kernel methods. The generalisation bound becomes:

R(\hat{h}) \leq \hat{R}(\hat{h}) + O\!\left(\sqrt{\frac{\hat{\mathbf{y}}^\top (K_{nn})^{-1} \hat{\mathbf{y}} + \log(1/\delta)}{n}}\right)

where $K_{nn}$ is the NTK kernel matrix. The key quantity $\hat{\mathbf{y}}^\top K_{nn}^{-1} \hat{\mathbf{y}}$ is the effective complexity - analogous to Rademacher complexity but computed from the kernel. For networks trained with weight decay, this remains bounded even as depth and width grow.

R.3 PAC-Bayes for Overparameterised Models

The most successful approach for non-vacuous bounds on deep networks uses PAC-Bayes. The bound:

\mathbb{E}_{h\sim Q}[R(h)] \leq \mathbb{E}_{h\sim Q}[\hat{R}(h)] + \sqrt{\frac{D_{\mathrm{KL}}(Q\|P) + \log(1/\delta)}{2(n-1)}}

applies to a stochastic classifier $h \sim Q$ . For weight-perturbed networks (Gaussian posterior with small variance $\sigma^2$ ):

D_{\mathrm{KL}}(Q\|P) \approx \frac{\|\hat{\theta} - \theta_0\|_2^2}{2\sigma_0^2}

This is small when the fine-tuned weights $\hat{\theta}$ don't stray far from the initialisation $\theta_0$ . SAM (Sharpness-Aware Minimisation) explicitly seeks flat minima where the posterior concentrates near a broad region around $\hat{\theta}$ , minimising this KL term.

Appendix S: Concentration in Continuous-Time Settings

S.1 Azuma's Inequality for Martingales

We've seen Azuma's inequality used to prove McDiarmid. Here's the full statement:

Theorem (Azuma-Hoeffding). Let $(Z_0, Z_1, \ldots, Z_n)$ be a martingale (or supermartingale) with $Z_0 = 0$ and $|Z_k - Z_{k-1}| \leq c_k$ a.s. Then:

P(Z_n \geq t) \leq \exp\!\left(-\frac{t^2}{2\sum_{k=1}^n c_k^2}\right)

Proof. Apply the Chernoff method: $P(Z_n \geq t) \leq e^{-st} \mathbb{E}[e^{sZ_n}]$ . For a martingale, $\mathbb{E}[e^{sZ_k} \mid Z_{k-1}] \leq e^{sZ_{k-1}} \cdot e^{s^2c_k^2/2}$ by Hoeffding's lemma (conditional on $Z_{k-1}$ , the increment $Z_k - Z_{k-1} \in [-c_k, c_k]$ is bounded). Telescoping: $\mathbb{E}[e^{sZ_n}] \leq e^{s^2\sum c_k^2/2}$ . Optimise: $s = t/\sum c_k^2$ . $\blacksquare$

S.2 Doob's Optional Stopping

Theorem. If $(M_n)$ is a martingale and $\tau$ is a bounded stopping time, then $\mathbb{E}[M_\tau] = \mathbb{E}[M_0]$ .

Application to SGD. Define $M_t = \|\theta_t - \theta^*\|_2^2 - t \cdot C\eta^2\sigma^2$ where $C$ is a constant. Under mild conditions, $M_t$ is a supermartingale. Azuma's inequality applied to the stopped process gives high-probability guarantees on convergence time.

This is the mathematical foundation of high-probability SGD convergence - stronger than the usual in-expectation guarantees.

Appendix T: Summary Diagrams

T.1 The Proof Technique Map

PROOF TECHNIQUES FOR CONCENTRATION INEQUALITIES
========================================================================

  Core tool: Markov's inequality
  -------------------------------------------------------------------

  Apply Markov to (X-\\mu)^2          --> Chebyshev
  Apply Markov to e^{sX}          --> Chernoff method
       +-- with Hoeffding's lemma  --> Hoeffding's inequality
       +-- with Bernstein's lemma  --> Bernstein's inequality
       +-- with Binomial MGF       --> Chernoff for Bin(n,p)

  Apply Azuma to Doob martingale  --> McDiarmid's inequality
  Union bound over hypothesis class --> PAC generalisation bound
  Union bound + covering number    --> Infinite-class PAC
  Rademacher symmetrisation        --> Rademacher complexity bound

  -------------------------------------------------------------------
  Common pattern: P(X \\geq t) \\leq E[e^{sX}] / e^{st}  (all MGF methods)
========================================================================

T.2 Bound Quality Comparison

TAIL PROBABILITY COMPARISON - X = mean of n=100 Bernoulli(0.5), t=0.1
========================================================================

  Method           Bound          Notes
  -----------------------------------------------------------------
  Exact (CLT)     ~0.046         Normal approx; valid for large n
  Hoeffding        0.135          Tight for this case
  Chebyshev        0.250          3x looser
  Markov           0.500          5x looser (applied to |X-0.5|/0.5)

  Same setup, t=0.2:
  Exact            ~0.000046
  Hoeffding        0.018          400x looser
  Chebyshev        0.0625         1360x looser

========================================================================

Appendix U: Proofs of Sub-Gaussian Closure and Examples

U.1 Cosh Inequality for Rademacher Variables

Claim. For $\varepsilon \in \{-1, +1\}$ uniform: $\mathbb{E}[e^{t\varepsilon}] = \cosh(t) \leq e^{t^2/2}$ .

Proof. $\cosh(t) = \sum_{k=0}^\infty t^{2k}/(2k)! \leq \sum_{k=0}^\infty t^{2k}/(2^k k!) = \sum_{k=0}^\infty (t^2/2)^k/k! = e^{t^2/2}$ , using $(2k)! \geq 2^k k!$ .

Interpretation. Rademacher variables are the "most concentrated" bounded variables in $[-1,1]$ : their MGF is exactly $\cosh(t)$ , slightly below the sub-Gaussian bound $e^{t^2/2}$ . This is also why Rademacher complexity multiplied by 2 appears in generalisation bounds - the factor accounts for the Rademacher supremum.

U.2 Sub-Gaussian Parameter for Bounded Variables

Claim. If $X \in [a,b]$ with $\mathbb{E}[X] = 0$ , then $X$ is $\frac{(b-a)^2}{4}$ -sub-Gaussian.

The parameter $\frac{(b-a)^2}{4}$ is achieved when $X = \frac{b-a}{2}\varepsilon$ where $\varepsilon$ is Rademacher - a two-point distribution at the endpoints. This is the "hardest" distribution on $[a,b]$ for Hoeffding's lemma.

Tightness example. For $X \in \{-1, +1\}$ uniform: variance $= 1$ , sub-Gaussian parameter $= \frac{(1-(-1))^2}{4} = 1$ . They agree! For the Rademacher distribution, sub-Gaussian parameter = variance. For other distributions in $[-1,1]$ , sub-Gaussian parameter $\geq$ variance.

U.3 Sub-Gaussian Characterisation via Tail Probabilities

The following are equivalent for a mean-zero random variable $X$ :

(MGF) $\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2/2}$ for all $t$
(Tail) $P(|X| \geq t) \leq 2e^{-t^2/(2\sigma^2)}$ for all $t \geq 0$
(Moments) $\|X\|_{L^p} = (\mathbb{E}[|X|^p])^{1/p} \leq \sigma\sqrt{p}$ for all $p \geq 1$

The equivalence (up to constants) is non-trivial. The key step: $(2) \Rightarrow (1)$ uses $\mathbb{E}[e^{tX}] = 1 + \sum_{k=1}^\infty \frac{t^k \mathbb{E}[X^k]}{k!}$ and bounds $\mathbb{E}[X^k]$ from the tail condition.

Appendix V: Exercises with Solutions

V.1 Markov and Chebyshev Calculations

Problem. $X \sim \text{Poisson}(5)$ . Compute $P(X \geq 12)$ using: (a) Markov's inequality ( $\mathbb{E}[X] = 5$ ) (b) Chebyshev ( $\operatorname{Var}(X) = 5$ , so $t = 12 - 5 = 7$ , $k = 7/\sqrt{5} \approx 3.13$ ) (c) The exact value (Poisson CDF)

Solution. (a) Markov: $P(X \geq 12) \leq 5/12 \approx 0.417$ (b) Chebyshev: $P(X \geq 12) \leq P(|X - 5| \geq 7) \leq 5/49 \approx 0.102$ (c) Exact: $P(X \geq 12) = 1 - \sum_{k=0}^{11} e^{-5}5^k/k! \approx 0.020$

Chebyshev is 5x loose; Markov is 21x loose. Both bounds hold but are very conservative.

V.2 Hoeffding Bound Computation

Problem. We estimate the click-through rate $p$ of an ad by showing it to $n$ users and recording clicks $X_i \in \{0,1\}$ . We observe $\hat{p} = 0.08$ from $n = 2000$ users. Give a 99% Hoeffding confidence interval.

Solution. With $\delta = 0.01$ , $b-a=1$ :

\varepsilon = \sqrt{\frac{\log(2/0.01)}{2 \cdot 2000}} = \sqrt{\frac{\log 200}{4000}} = \sqrt{\frac{5.298}{4000}} = \sqrt{0.001325} \approx 0.0364

99% CI: $[0.08 - 0.0364, 0.08 + 0.0364] = [0.0436, 0.1164]$ .

Interpretation: we're 99% confident the true CTR is between 4.4% and 11.6% - a very wide interval! To tighten to +/-1%: need $n \geq \frac{\log 200}{2(0.01)^2} \approx 26{,}490$ users.

V.3 Chernoff vs Hoeffding

Problem. $S \sim \text{Bin}(500, 0.02)$ , so $\mu = 10$ . Bound $P(S \geq 20)$ .

Hoeffding: $P(S \geq 20) = P(\bar{X} \geq 0.04) \leq \exp(-2 \cdot 500 \cdot (0.04-0.02)^2) = e^{-0.4} \approx 0.67$ . Useless!

Chernoff: $\delta = (20-10)/10 = 1$ . $P(S \geq 20) \leq e^{-10 \cdot 1^2/3} = e^{-3.33} \approx 0.036$ .

Exact: $P(S \geq 20) \approx 0.0085$ (Poisson approximation: $e^{-10}\sum_{k=20}^\infty 10^k/k!$ ).

Chernoff is 4x loose; Hoeffding is essentially vacuous (bound > 0.5 is useless). For rare events with $p \ll 1$ and large $n$ , Chernoff is the right tool.

V.4 McDiarmid on Dropout

Problem. A network output $f(\mathbf{z}_1, \ldots, \mathbf{z}_L)$ depends on $L=10$ layer activations. Dropout independently zeroes each activation, changing the contribution of neuron $j$ in layer $l$ by at most $c_{lj} = |w_{lj}|\cdot a_{lj}^{\max} / (1-p)$ . If $\sum_{l,j} c_{lj}^2 = 4$ , bound $P(|f - \mathbb{E}[f]| \geq 0.5)$ .

Solution. By McDiarmid: $P(|f - \mathbb{E}[f]| \geq 0.5) \leq 2\exp(-2 \cdot 0.25/4) = 2e^{-0.125} \approx 1.76$ . The bound is vacuous! This means either $c_{lj}$ are too large, or $t$ is too small relative to the noise.

With larger networks and better weight norms ( $\sum c_{lj}^2 = 0.1$ ): $2e^{-2 \cdot 0.25/0.1} = 2e^{-5} \approx 0.013$ . Now meaningful: dropout output deviates by more than 0.5 less than 1.3% of the time.

Appendix W: Notation Reference for This Section

Symbol	Meaning
$P(A)$	Probability of event $A$
$\mu = \mathbb{E}[X]$	Mean of $X$
$\sigma^2 = \operatorname{Var}(X)$	Variance of $X$
$M_X(t) = \mathbb{E}[e^{tX}]$	Moment generating function
$\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i$	Sample mean
$R(h)$	True (population) risk of hypothesis $h$
$\hat{R}(h)$	Empirical risk of $h$ on training data
$\mathcal{H}$	Hypothesis class
$d_{VC}$	VC dimension of $\mathcal{H}$
$\Pi_\mathcal{H}(n)$	Growth function
$\mathfrak{R}_n(\mathcal{F})$	Rademacher complexity
$\hat{\mathfrak{R}}_S(\mathcal{F})$	Empirical Rademacher complexity
$\varepsilon$	Accuracy parameter (how close to optimal)
$\delta$	Confidence parameter (failure probability)
$c_i$	Bounded differences constant for coordinate $i$
$\varepsilon$ -net	Covering set with spacing $\varepsilon$
$\mathcal{N}(\mathcal{H}, d, \varepsilon)$	Covering number: size of smallest $\varepsilon$ -net
$D_{\mathrm{KL}}(P\\|Q)$	KL divergence from $Q$ to $P$
$I(a) = \sup_t(ta - \log M(t))$	Cramer rate function
$\sigma_i$	Rademacher random variable $\in \{-1,+1\}$
$B_\delta(h)$	Ball of radius $\delta$ around hypothesis $h$

Appendix X: More on Rademacher Complexity

X.1 Computing Rademacher Complexity for Common Classes

Halfspaces with margin. For $\mathcal{H}_\gamma = \{\mathbf{x} \mapsto \operatorname{sign}(\mathbf{w}^\top \mathbf{x}): \|\mathbf{w}\|_2 \leq 1, \gamma\text{-margin}\}$ on data with $\|\mathbf{x}^{(i)}\|_2 \leq 1$ :

\mathfrak{R}_n(\mathcal{H}_\gamma) \leq \frac{1}{\gamma\sqrt{n}}

The margin $\gamma$ effectively replaces dimension $d$ ! This is why SVMs with large margins generalise well in high dimensions - the effective complexity is $O(1/\gamma^2 n)$ , not $O(d/n)$ .

Two-layer neural network. For $\mathcal{F} = \{\mathbf{x} \mapsto \mathbf{v}^\top \sigma(W\mathbf{x}): \|W\|_F \leq B_W, \|\mathbf{v}\|_1 \leq B_v\}$ :

\mathfrak{R}_n(\mathcal{F}) \leq \frac{B_v B_W C \sqrt{2\log(2d)}}{\sqrt{n}}

where $C = \max_i \|\mathbf{x}^{(i)}\|_2$ . The $\log d$ factor comes from the union bound over $d$ output neurons; the $\sqrt{n}$ denominator shows $O(1/\sqrt{n})$ rate as expected.

X.2 Rademacher Complexity and Mutual Information

A deep connection: the Rademacher complexity $\hat{\mathfrak{R}}_S(\mathcal{F})$ can be bounded via mutual information between the training labels and the class predictions. For bounded function classes:

\hat{\mathfrak{R}}_S(\mathcal{F}) \leq \sqrt{\frac{2I(Y^n; f(X^n))}{n}}

where $I(Y^n; f(X^n))$ is the mutual information between labels and class predictions. This connects concentration theory to information theory - a richer theory being actively developed (2020-2026) under "information-theoretic generalisation bounds."

X.3 Monte Carlo Estimation of Rademacher Complexity

In practice, $\hat{\mathfrak{R}}_S(\mathcal{F})$ can be estimated by Monte Carlo:

def estimate_rademacher(X, W_norm=1.0, T=1000):
    """Monte Carlo estimate of Rademacher complexity for ||w||_2 <= W_norm."""
    n, d = X.shape
    rademachers = []
    for _ in range(T):
        sigma = np.random.choice([-1, 1], size=n)
        # Optimal w: proportional to X^T sigma, normalised
        w_opt = X.T @ sigma
        w_norm = np.linalg.norm(w_opt)
        if w_norm > 0:
            w_opt = w_opt * W_norm / w_norm
        rademachers.append((sigma @ (X @ w_opt)) / n)
    return np.mean(rademachers)

For a dataset of $n=500$ MNIST digits (784-dimensional), $W_\text{norm}=1$ : empirical $\hat{\mathfrak{R}}_S \approx 0.038$ . Rademacher bound: test error $\leq$ train error + $2 \cdot 0.038 + \sqrt{\log(20)/1000} \approx$ train error + 0.076 + 0.055. For 2% train error: test error $\leq 13.1\%$ - loose but not vacuous!

Appendix Y: Bounds in the Context of Modern ML Systems

Y.1 LLM Benchmark Evaluation Standards

As of 2026, serious LLM evaluations use concentration-aware practices:

HELM (Holistic Evaluation of Language Models): Uses 1000+ examples per scenario, reporting bootstrap confidence intervals. Assumes bounded loss and applies Hoeffding-style reasoning.

BIG-bench: 200+ tasks, with tasks having 100-10000 examples. Standard deviations are reported, equivalent to reporting $\sigma/\sqrt{n}$ which matches the leading term in Hoeffding CI.

LiveBench: Uses human-pairwise comparisons (Elo-style). The Elo update rule has a concentration interpretation: the ELO gap needed for significance scales as $O(1/\sqrt{n_{\text{comparisons}}})$ .

Statistical significance in model comparisons. With $n$ test examples and two models achieving accuracies $\hat{p}_1$ and $\hat{p}_2$ :

Paired comparison: use McNemar's test (exact)
Unpaired: significance threshold $|\hat{p}_1 - \hat{p}_2| \geq 2\sqrt{\frac{\log(4/\delta)}{2n}}$ (Hoeffding)
For $n = 1000$ , $\delta = 0.05$ : need difference $\geq 2 \times 0.042 = 8.4\%$ for significance

Y.2 Fine-Tuning Sample Complexity via PAC-Bayes

When fine-tuning a pretrained LLM on task-specific data:

Prior: Pre-trained weights $\theta_0$ from language model pre-training on internet-scale data Posterior: Fine-tuned weights $\hat{\theta}$ on $n$ task examples PAC-Bayes bound:

R(\hat{\theta}) \leq \hat{R}(\hat{\theta}) + \sqrt{\frac{\|\hat{\theta} - \theta_0\|_2^2/(2\sigma_0^2) + \log(1/\delta)}{2(n-1)}}

The $\|\hat{\theta} - \theta_0\|_2^2$ term is small when fine-tuning doesn't move far from pre-training - exactly what LoRA enforces by constraining $\Delta W = AB$ . With rank $r \ll \min(d,k)$ :

\|\hat{\theta} - \theta_0\|_2^2 = \|AB\|_F^2 \leq \|A\|_F^2 \|B\|_F^2 \leq r \cdot \text{(per-param budget)}

PAC-Bayes theoretically justifies why LoRA generalises better than full fine-tuning on small datasets.

Y.3 Privacy-Preserving ML via Concentration

Differential Privacy (DP) and concentration inequalities are deeply connected. An algorithm is $(\varepsilon, \delta)$ -DP if changing one training example changes the output distribution by at most $\varepsilon$ in TV distance, except with probability $\delta$ .

DP-SGD (Abadi et al., 2016): Clips each gradient to norm $C$ , adds Gaussian noise $\mathcal{N}(0, \sigma^2 C^2 I)$ . The privacy guarantee follows from the Gaussian mechanism, and the utility (convergence) uses concentration inequalities on the noisy gradients.

McDiarmid connection: DP is essentially algorithmic stability with $c_i = C/n$ (each example changes the model by at most $C/n$ in some norm). McDiarmid's inequality gives the same $O(\exp(-nt^2/C^2))$ bound for both privacy and stability.

Appendix Z: Complete Theorem Index

This appendix lists every named result in this section for quick reference.

Moment-Based Inequalities

Theorem 2.1 (Markov). $X \geq 0$ , $\mathbb{E}[X] = \mu$ : $P(X \geq t) \leq \mu/t$ .

Theorem 2.2 (Chebyshev). $\operatorname{Var}(X) = \sigma^2$ : $P(|X-\mu| \geq t) \leq \sigma^2/t^2$ .

Theorem 2.3 (Cantelli). One-sided: $P(X - \mu \geq t) \leq \sigma^2/(\sigma^2 + t^2)$ .

Sub-Gaussian Inequalities

Definition 3.1 (Sub-Gaussian). $X$ is $\sigma^2$ -sub-G if $\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2/2}$ for all $t$ .

Theorem 3.2 (Sub-Gaussian tail). $P(X \geq t) \leq e^{-t^2/(2\sigma^2)}$ .

Theorem 3.3 (Sub-Gaussian sum). Independent $\sigma_i^2$ -sub-G: sum is $\sum \sigma_i^2$ -sub-G.

Hoeffding-Type Inequalities

Lemma 4.1 (Hoeffding's lemma). $X \in [a,b]$ , $\mathbb{E}[X]=0$ : $\mathbb{E}[e^{tX}] \leq e^{t^2(b-a)^2/8}$ .

Theorem 4.2 (Hoeffding). Independent $X_i \in [a_i,b_i]$ : $P(\sum(X_i-\mu_i) \geq t) \leq e^{-2t^2/\sum(b_i-a_i)^2}$ .

Theorem 4.3 (Two-sided Hoeffding). $P(|\bar{X}_n-\mu| \geq t) \leq 2e^{-2nt^2/(b-a)^2}$ .

Chernoff Bounds

Theorem 5.1 (Chernoff method). $P(X \geq t) \leq \inf_{s>0} e^{-st} M_X(s)$ .

Theorem 5.2 (Chernoff upper tail). $S \sim \operatorname{Bin}(n,p)$ , $\delta > 0$ : $P(S \geq (1+\delta)\mu) \leq (e^\delta/(1+\delta)^{1+\delta})^\mu$ .

Theorem 5.3 (Chernoff multiplicative). $P(S \geq (1+\delta)\mu) \leq e^{-\mu\delta^2/3}$ for $\delta \in (0,1)$ .

Theorem 5.4 (Chernoff lower tail). $P(S \leq (1-\delta)\mu) \leq e^{-\mu\delta^2/2}$ .

Bernstein and McDiarmid

Theorem 6.2 (Bernstein). $|X_i| \leq c$ , variance $\nu^2$ : $P(\sum X_i \geq t) \leq \exp(-t^2/2/(\nu^2+ct/3))$ .

Theorem 7.2 (McDiarmid). $f$ with $c_i$ -bounded differences: $P(f-\mathbb{E}[f] \geq t) \leq e^{-2t^2/\sum c_i^2}$ .

Theorem S.1 (Azuma). Martingale $|Z_k-Z_{k-1}| \leq c_k$ : $P(Z_n \geq t) \leq e^{-t^2/2\sum c_k^2}$ .

Statistical Learning Theory

Lemma 8.1 (Union bound). $P(\bigcup A_i) \leq \sum P(A_i)$ .

Lemma 8.3 (Covering number). $\ell_2$ -ball: $\mathcal{N}(\mathcal{B}_R, \ell_2, \varepsilon) \leq (3R/\varepsilon)^d$ .

Theorem 9.2 (Finite-class PAC). $n \geq (\log(|\mathcal{H}|/\delta))/(2\varepsilon^2) \Rightarrow$ uniform convergence.

Lemma 9.4 (Sauer-Shelah). $\Pi_\mathcal{H}(n) \leq (en/d)^d$ for $n \geq d = d_{VC}$ .

Theorem 9.5 (VC bound). $R(h) \leq \hat{R}(h) + \sqrt{(d\log(2n/d)+\log(4/\delta))/(2n)}$ .

Theorem 10.2 (Rademacher bound). $R(f) \leq \hat{R}(f) + 2\mathfrak{R}_n(\mathcal{F}) + \sqrt{\log(1/\delta)/(2n)}$ .

Theorem E.4 (PAC-Bayes). $\mathbb{E}_Q[R(h)] \leq \mathbb{E}_Q[\hat{R}(h)] + \sqrt{(D_{\mathrm{KL}}(Q\|P)+\log(1/\delta))/(2(n-1))}$ .

Theorem J.2 (Stability). $\beta$ -stable algo: $R \leq \hat{R} + 2\beta + (4n\beta+1)\sqrt{\log(1/\delta)/(2n)}$ .

This completes Section05 Concentration Inequalities. The material here is used extensively in Chapter 7: Statistics for confidence intervals and hypothesis tests, and throughout learning theory.

<- Back to Chapter 6: Probability Theory | Next: Stochastic Processes ->

Supplementary: Concentration in High Dimensions

The Curse and Blessing of Dimensionality

High-dimensional geometry is counterintuitive. In $\mathbb{R}^d$ with large $d$ :

Almost all the volume of a ball lies near its surface
Random unit vectors are nearly orthogonal: $|\langle u, v \rangle| \approx 1/\sqrt{d}$ w.h.p.
The norm of a Gaussian vector concentrates: $\|\mathbf{x}\|_2 \approx \sqrt{d}$ for $\mathbf{x} \sim \mathcal{N}(0, I_d)$

These facts follow directly from concentration inequalities applied to sums of squares.

Gaussian norm concentration. Let $\mathbf{x} \sim \mathcal{N}(0, I_d)$ . Then $\|\mathbf{x}\|_2^2 = \sum_{i=1}^d X_i^2$ where $X_i \stackrel{iid}{\sim} \mathcal{N}(0,1)$ . By Bernstein's inequality (since $X_i^2 - 1$ is sub-exponential):

P\!\left(\left|\,\|\mathbf{x}\|_2 - \sqrt{d}\,\right| \geq t\right) \leq 2e^{-ct^2}

for a universal constant $c > 0$ .

Implication for dot products. For two independent $\mathbf{x}, \mathbf{y} \sim \mathcal{N}(0, I_d/d)$ :

P\!\left(\left|\langle \mathbf{x}, \mathbf{y}\rangle\right| \geq t/\sqrt{d}\right) \leq 2e^{-ct^2}

This is why scaled dot-product attention uses $1/\sqrt{d_k}$ - it keeps logits $O(1)$ despite $d_k$ -dimensional embeddings.

Geometry of Softmax

Let $\mathbf{q}, \mathbf{k}_1, \ldots, \mathbf{k}_n \in \mathbb{R}^{d_k}$ be the query and keys in a transformer. The attention weight on key $j$ is:

\alpha_j = \frac{e^{\mathbf{q}^\top \mathbf{k}_j / \sqrt{d_k}}}{\sum_{\ell} e^{\mathbf{q}^\top \mathbf{k}_\ell / \sqrt{d_k}}}

Without the $1/\sqrt{d_k}$ scaling, the logits $\mathbf{q}^\top \mathbf{k}_j$ have variance $d_k$ (for unit random vectors), pushing softmax into near-zero gradient regions - a form of implicit dimension dependence that concentration inequalities make precise.

Covering Numbers in Transformer Weight Space

The number of parameters in a transformer layer is $\Theta = O(d^2)$ . For LoRA with rank $r$ , the effective dimensionality is $O(dr)$ . By the $\varepsilon$ -covering number bound:

\log \mathcal{N}(\mathcal{F}_{\text{LoRA}}, \varepsilon, \|\cdot\|) = O\!\left(dr \log(1/\varepsilon)\right)

This converts directly to a PAC generalisation bound showing that LoRA generalises with $O(\sqrt{dr \log n / n})$ excess risk - substantially better than $O(\sqrt{d^2 \log n / n})$ for full fine-tuning. Concentration inequalities thus provide the theoretical foundation for parameter-efficient fine-tuning.

Summary: Why This Section Matters

Concentration inequalities are the engine behind every finite-sample guarantee in machine learning. They explain:

Why more data helps: bounds shrink as $1/\sqrt{n}$
Why simpler models generalise: bounds grow with $\sqrt{\log |\mathcal{H}|}$ or $\sqrt{d_{\text{VC}}}$
Why high dimensions are manageable: Gaussian concentration keeps norms stable
Why LoRA works: Rademacher complexity of low-rank updates is provably small
Why LLM benchmarks need many examples: CIs on accuracy require $n \propto 1/\varepsilon^2$

The progression Markov -> Chebyshev -> sub-Gaussian -> Chernoff -> McDiarmid is not merely a tour of tricks - it is a systematic strengthening of assumptions that yields exponentially tighter guarantees, mirroring the progress from "some data" to "structured data" in modern ML.

End of Section05 Concentration Inequalities

Next: Section06 Stochastic Processes - where concentration inequalities meet martingales in continuous time.

Back: Section04 Expectation and Moments - the moment-based tools (MGF, Jensen) used throughout this section.

Chapter: Section06 Probability Theory README

Concentration Inequalities: Part 3 - Appendix O Visualisation Notes To Supplementary Concentration In High