Lesson overview | Previous part | Lesson overview
Concentration Inequalities: Appendix O: Visualisation Notes to Supplementary: Concentration in High Dimensions
Appendix O: Visualisation Notes
O.1 Plotting Concentration Inequalities
When visualising tail bounds, always show:
- The exact tail probability (computed numerically) as the ground truth
- The bound as a curve above the exact probability
- The ratio (bound/exact) to show how loose the bound is
TAIL BOUND TIGHTNESS - Gaussian(0,1)
========================================================================
t | P(X\\geqt) exact | Markov* | Chebyshev | Hoeffding | Bound/Exact
--------------------------------------------------------------------
1 | 0.159 | N/A | 1.000 | 0.607 | 4x / 3.8x
2 | 0.023 | N/A | 0.250 | 0.135 | 11x / 5.9x
3 | 0.0013 | N/A | 0.111 | 0.011 | 85x / 8.5x
4 | 3.2e-5 | N/A | 0.0625 | 3.4e-4 | 1953x / 11x
5 | 2.9e-7 | N/A | 0.040 | 3.7e-7 | 138,000x / 1.3x
*Markov requires X \\geq 0; not directly applicable to symmetric Gaussian
========================================================================
At : Chebyshev overestimates by 138,000x! Hoeffding for Gaussian is essentially tight - Gaussian IS the sub-Gaussian equality case.
O.2 Sample Complexity Curves
The sample complexity has a characteristic shape:
- : doubling precision requires 4x more data
- : going from 95% to 99.9% confidence adds only more data - logarithmically cheap
This scaling is fundamental: it appears in:
- Hoeffding (statistical estimation)
- Monte Carlo (variance , error )
- PAC learning (generalisation gap)
- Stochastic optimisation (SGD convergence rate)
All are manifestations of the same underlying concentration phenomenon.
Appendix P: Glossary of Key Terms
| Term | Definition | Section |
|---|---|---|
| Concentration inequality | A bound on or | Section1 |
| Sub-Gaussian random variable | with for all | Section3.1 |
| Sub-Gaussian parameter | : the proxy variance in the sub-Gaussian definition | Section3.1 |
| Rademacher variable | with equal probability | Section3.2 |
| Hoeffding's lemma | Bounded RV with zero mean | Section4.1 |
| Chernoff method | Tail bound via , optimised over | Section5.1 |
| Bounded differences | $ | f(\cdots x_i \cdots) - f(\cdots x_i' \cdots) |
| Union bound | Section8.1 | |
| Covering number | Smallest -net size for a set under a metric | Section8.3 |
| Shattering | shatters if it can realise all $2^{ | C |
| VC dimension | Size of largest shattered set | Section9.4 |
| Growth function | : max labellings of points by | App D.1 |
| PAC learning | Probably approximately correct: w.p. | Section9.1 |
| Uniform convergence | $\sup_h | \hat{R}(h) - R(h) |
| Rademacher complexity | Section10.1 | |
| PAC-Bayes | Generalisation bound using | App E.4 |
| Algorithmic stability | Output insensitive to swapping one training example | App J.2 |
| Exponential tilt | Twisted distribution used in Chernoff | App I.1 |
Appendix Q: The Method of Types and Large Deviations
Q.1 Types and Empirical Distributions
The method of types is an elegant combinatorial approach to concentration that complements the analytical MGF-based approach.
Definition. For a sequence , the type (empirical distribution) is:
The set of all sequences of type is the type class .
Key counting result. For a finite alphabet :
where is the entropy of . This means there are approximately sequences with empirical distribution .
Q.2 Connection to Chernoff Bounds
For i.i.d. and empirical distribution :
This is the method of types bound. Taking to be the distribution concentrated at :
where is the type closest to with mean . For large , the polynomial factor is negligible, and this matches the Chernoff bound with the KL divergence interpretation.
Q.3 Cramer's Theorem
Theorem (Cramer, 1938). For i.i.d. with distribution :
where is the Cramer-Legendre transform (rate function), and (KL divergence to the tilted distribution).
This is the fundamental result of large deviations theory: rare events occur at exponential rate determined by the KL distance to the "closest" distribution that makes the event typical.
For AI: Language model perplexity is essentially . Low perplexity = model close to true distribution = rare events (unusual tokens) are not overly penalised. Cramer's theorem connects the generalisation gap to KL divergence, explaining why cross-entropy is the right loss.
Appendix R: Generalisation in the Overparameterised Regime
R.1 The Classical Puzzle
Classical theory (VC, Rademacher) says: more parameters = larger hypothesis class = worse generalisation. Yet in practice, overparameterised models (large ) often generalise better than their smaller counterparts.
The resolution involves multiple mechanisms:
1. Implicit regularisation. Gradient descent on overparameterised linear models converges to the minimum--norm solution. Among all interpolating solutions, this has the smallest Rademacher complexity.
2. Double descent. As model capacity increases past the interpolation threshold (), test error first rises (overfitting) then falls (benign overfitting). The minimum-norm interpolating solution's generalisation error is:
which decreases as (overparameterised regime).
3. Benign overfitting. Bartlett et al. (2020) showed that for linear models with Gaussian data, interpolation (zero training error) can generalise well when most singular values of the data matrix are small - the model "memorises" noise in directions orthogonal to the signal.
R.2 Neural Tangent Kernel (NTK) Theory
Wide neural networks in the "lazy training" (NTK) regime behave like kernel methods. The generalisation bound becomes:
where is the NTK kernel matrix. The key quantity is the effective complexity - analogous to Rademacher complexity but computed from the kernel. For networks trained with weight decay, this remains bounded even as depth and width grow.
R.3 PAC-Bayes for Overparameterised Models
The most successful approach for non-vacuous bounds on deep networks uses PAC-Bayes. The bound:
applies to a stochastic classifier . For weight-perturbed networks (Gaussian posterior with small variance ):
This is small when the fine-tuned weights don't stray far from the initialisation . SAM (Sharpness-Aware Minimisation) explicitly seeks flat minima where the posterior concentrates near a broad region around , minimising this KL term.
Appendix S: Concentration in Continuous-Time Settings
S.1 Azuma's Inequality for Martingales
We've seen Azuma's inequality used to prove McDiarmid. Here's the full statement:
Theorem (Azuma-Hoeffding). Let be a martingale (or supermartingale) with and a.s. Then:
Proof. Apply the Chernoff method: . For a martingale, by Hoeffding's lemma (conditional on , the increment is bounded). Telescoping: . Optimise: .
S.2 Doob's Optional Stopping
Theorem. If is a martingale and is a bounded stopping time, then .
Application to SGD. Define where is a constant. Under mild conditions, is a supermartingale. Azuma's inequality applied to the stopped process gives high-probability guarantees on convergence time.
This is the mathematical foundation of high-probability SGD convergence - stronger than the usual in-expectation guarantees.
Appendix T: Summary Diagrams
T.1 The Proof Technique Map
PROOF TECHNIQUES FOR CONCENTRATION INEQUALITIES
========================================================================
Core tool: Markov's inequality
-------------------------------------------------------------------
Apply Markov to (X-\\mu)^2 --> Chebyshev
Apply Markov to e^{sX} --> Chernoff method
+-- with Hoeffding's lemma --> Hoeffding's inequality
+-- with Bernstein's lemma --> Bernstein's inequality
+-- with Binomial MGF --> Chernoff for Bin(n,p)
Apply Azuma to Doob martingale --> McDiarmid's inequality
Union bound over hypothesis class --> PAC generalisation bound
Union bound + covering number --> Infinite-class PAC
Rademacher symmetrisation --> Rademacher complexity bound
-------------------------------------------------------------------
Common pattern: P(X \\geq t) \\leq E[e^{sX}] / e^{st} (all MGF methods)
========================================================================
T.2 Bound Quality Comparison
TAIL PROBABILITY COMPARISON - X = mean of n=100 Bernoulli(0.5), t=0.1
========================================================================
Method Bound Notes
-----------------------------------------------------------------
Exact (CLT) ~0.046 Normal approx; valid for large n
Hoeffding 0.135 Tight for this case
Chebyshev 0.250 3x looser
Markov 0.500 5x looser (applied to |X-0.5|/0.5)
Same setup, t=0.2:
Exact ~0.000046
Hoeffding 0.018 400x looser
Chebyshev 0.0625 1360x looser
========================================================================
Appendix U: Proofs of Sub-Gaussian Closure and Examples
U.1 Cosh Inequality for Rademacher Variables
Claim. For uniform: .
Proof. , using .
Interpretation. Rademacher variables are the "most concentrated" bounded variables in : their MGF is exactly , slightly below the sub-Gaussian bound . This is also why Rademacher complexity multiplied by 2 appears in generalisation bounds - the factor accounts for the Rademacher supremum.
U.2 Sub-Gaussian Parameter for Bounded Variables
Claim. If with , then is -sub-Gaussian.
The parameter is achieved when where is Rademacher - a two-point distribution at the endpoints. This is the "hardest" distribution on for Hoeffding's lemma.
Tightness example. For uniform: variance , sub-Gaussian parameter . They agree! For the Rademacher distribution, sub-Gaussian parameter = variance. For other distributions in , sub-Gaussian parameter variance.
U.3 Sub-Gaussian Characterisation via Tail Probabilities
The following are equivalent for a mean-zero random variable :
- (MGF) for all
- (Tail) for all
- (Moments) for all
The equivalence (up to constants) is non-trivial. The key step: uses and bounds from the tail condition.
Appendix V: Exercises with Solutions
V.1 Markov and Chebyshev Calculations
Problem. . Compute using: (a) Markov's inequality () (b) Chebyshev (, so , ) (c) The exact value (Poisson CDF)
Solution. (a) Markov: (b) Chebyshev: (c) Exact:
Chebyshev is 5x loose; Markov is 21x loose. Both bounds hold but are very conservative.
V.2 Hoeffding Bound Computation
Problem. We estimate the click-through rate of an ad by showing it to users and recording clicks . We observe from users. Give a 99% Hoeffding confidence interval.
Solution. With , :
99% CI: .
Interpretation: we're 99% confident the true CTR is between 4.4% and 11.6% - a very wide interval! To tighten to +/-1%: need users.
V.3 Chernoff vs Hoeffding
Problem. , so . Bound .
Hoeffding: . Useless!
Chernoff: . .
Exact: (Poisson approximation: ).
Chernoff is 4x loose; Hoeffding is essentially vacuous (bound > 0.5 is useless). For rare events with and large , Chernoff is the right tool.
V.4 McDiarmid on Dropout
Problem. A network output depends on layer activations. Dropout independently zeroes each activation, changing the contribution of neuron in layer by at most . If , bound .
Solution. By McDiarmid: . The bound is vacuous! This means either are too large, or is too small relative to the noise.
With larger networks and better weight norms (): . Now meaningful: dropout output deviates by more than 0.5 less than 1.3% of the time.
Appendix W: Notation Reference for This Section
| Symbol | Meaning |
|---|---|
| Probability of event | |
| Mean of | |
| Variance of | |
| Moment generating function | |
| Sample mean | |
| True (population) risk of hypothesis | |
| Empirical risk of on training data | |
| Hypothesis class | |
| VC dimension of | |
| Growth function | |
| Rademacher complexity | |
| Empirical Rademacher complexity | |
| Accuracy parameter (how close to optimal) | |
| Confidence parameter (failure probability) | |
| Bounded differences constant for coordinate | |
| -net | Covering set with spacing |
| Covering number: size of smallest -net | |
| KL divergence from to | |
| Cramer rate function | |
| Rademacher random variable | |
| Ball of radius around hypothesis |
Appendix X: More on Rademacher Complexity
X.1 Computing Rademacher Complexity for Common Classes
Halfspaces with margin. For on data with :
The margin effectively replaces dimension ! This is why SVMs with large margins generalise well in high dimensions - the effective complexity is , not .
Two-layer neural network. For :
where . The factor comes from the union bound over output neurons; the denominator shows rate as expected.
X.2 Rademacher Complexity and Mutual Information
A deep connection: the Rademacher complexity can be bounded via mutual information between the training labels and the class predictions. For bounded function classes:
where is the mutual information between labels and class predictions. This connects concentration theory to information theory - a richer theory being actively developed (2020-2026) under "information-theoretic generalisation bounds."
X.3 Monte Carlo Estimation of Rademacher Complexity
In practice, can be estimated by Monte Carlo:
def estimate_rademacher(X, W_norm=1.0, T=1000):
"""Monte Carlo estimate of Rademacher complexity for ||w||_2 <= W_norm."""
n, d = X.shape
rademachers = []
for _ in range(T):
sigma = np.random.choice([-1, 1], size=n)
# Optimal w: proportional to X^T sigma, normalised
w_opt = X.T @ sigma
w_norm = np.linalg.norm(w_opt)
if w_norm > 0:
w_opt = w_opt * W_norm / w_norm
rademachers.append((sigma @ (X @ w_opt)) / n)
return np.mean(rademachers)
For a dataset of MNIST digits (784-dimensional), : empirical . Rademacher bound: test error train error + train error + 0.076 + 0.055. For 2% train error: test error - loose but not vacuous!
Appendix Y: Bounds in the Context of Modern ML Systems
Y.1 LLM Benchmark Evaluation Standards
As of 2026, serious LLM evaluations use concentration-aware practices:
HELM (Holistic Evaluation of Language Models): Uses 1000+ examples per scenario, reporting bootstrap confidence intervals. Assumes bounded loss and applies Hoeffding-style reasoning.
BIG-bench: 200+ tasks, with tasks having 100-10000 examples. Standard deviations are reported, equivalent to reporting which matches the leading term in Hoeffding CI.
LiveBench: Uses human-pairwise comparisons (Elo-style). The Elo update rule has a concentration interpretation: the ELO gap needed for significance scales as .
Statistical significance in model comparisons. With test examples and two models achieving accuracies and :
- Paired comparison: use McNemar's test (exact)
- Unpaired: significance threshold (Hoeffding)
- For , : need difference for significance
Y.2 Fine-Tuning Sample Complexity via PAC-Bayes
When fine-tuning a pretrained LLM on task-specific data:
Prior: Pre-trained weights from language model pre-training on internet-scale data Posterior: Fine-tuned weights on task examples PAC-Bayes bound:
The term is small when fine-tuning doesn't move far from pre-training - exactly what LoRA enforces by constraining . With rank :
PAC-Bayes theoretically justifies why LoRA generalises better than full fine-tuning on small datasets.
Y.3 Privacy-Preserving ML via Concentration
Differential Privacy (DP) and concentration inequalities are deeply connected. An algorithm is -DP if changing one training example changes the output distribution by at most in TV distance, except with probability .
DP-SGD (Abadi et al., 2016): Clips each gradient to norm , adds Gaussian noise . The privacy guarantee follows from the Gaussian mechanism, and the utility (convergence) uses concentration inequalities on the noisy gradients.
McDiarmid connection: DP is essentially algorithmic stability with (each example changes the model by at most in some norm). McDiarmid's inequality gives the same bound for both privacy and stability.
Appendix Z: Complete Theorem Index
This appendix lists every named result in this section for quick reference.
Moment-Based Inequalities
Theorem 2.1 (Markov). , : .
Theorem 2.2 (Chebyshev). : .
Theorem 2.3 (Cantelli). One-sided: .
Sub-Gaussian Inequalities
Definition 3.1 (Sub-Gaussian). is -sub-G if for all .
Theorem 3.2 (Sub-Gaussian tail). .
Theorem 3.3 (Sub-Gaussian sum). Independent -sub-G: sum is -sub-G.
Hoeffding-Type Inequalities
Lemma 4.1 (Hoeffding's lemma). , : .
Theorem 4.2 (Hoeffding). Independent : .
Theorem 4.3 (Two-sided Hoeffding). .
Chernoff Bounds
Theorem 5.1 (Chernoff method). .
Theorem 5.2 (Chernoff upper tail). , : .
Theorem 5.3 (Chernoff multiplicative). for .
Theorem 5.4 (Chernoff lower tail). .
Bernstein and McDiarmid
Theorem 6.2 (Bernstein). , variance : .
Theorem 7.2 (McDiarmid). with -bounded differences: .
Theorem S.1 (Azuma). Martingale : .
Statistical Learning Theory
Lemma 8.1 (Union bound). .
Lemma 8.3 (Covering number). -ball: .
Theorem 9.2 (Finite-class PAC). uniform convergence.
Lemma 9.4 (Sauer-Shelah). for .
Theorem 9.5 (VC bound). .
Theorem 10.2 (Rademacher bound). .
Theorem E.4 (PAC-Bayes). .
Theorem J.2 (Stability). -stable algo: .
This completes Section05 Concentration Inequalities. The material here is used extensively in Chapter 7: Statistics for confidence intervals and hypothesis tests, and throughout learning theory.
<- Back to Chapter 6: Probability Theory | Next: Stochastic Processes ->
Supplementary: Concentration in High Dimensions
The Curse and Blessing of Dimensionality
High-dimensional geometry is counterintuitive. In with large :
- Almost all the volume of a ball lies near its surface
- Random unit vectors are nearly orthogonal: w.h.p.
- The norm of a Gaussian vector concentrates: for
These facts follow directly from concentration inequalities applied to sums of squares.
Gaussian norm concentration. Let . Then where . By Bernstein's inequality (since is sub-exponential):
for a universal constant .
Implication for dot products. For two independent :
This is why scaled dot-product attention uses - it keeps logits despite -dimensional embeddings.
Geometry of Softmax
Let be the query and keys in a transformer. The attention weight on key is:
Without the scaling, the logits have variance (for unit random vectors), pushing softmax into near-zero gradient regions - a form of implicit dimension dependence that concentration inequalities make precise.
Covering Numbers in Transformer Weight Space
The number of parameters in a transformer layer is . For LoRA with rank , the effective dimensionality is . By the -covering number bound:
This converts directly to a PAC generalisation bound showing that LoRA generalises with excess risk - substantially better than for full fine-tuning. Concentration inequalities thus provide the theoretical foundation for parameter-efficient fine-tuning.
Summary: Why This Section Matters
Concentration inequalities are the engine behind every finite-sample guarantee in machine learning. They explain:
- Why more data helps: bounds shrink as
- Why simpler models generalise: bounds grow with or
- Why high dimensions are manageable: Gaussian concentration keeps norms stable
- Why LoRA works: Rademacher complexity of low-rank updates is provably small
- Why LLM benchmarks need many examples: CIs on accuracy require
The progression Markov -> Chebyshev -> sub-Gaussian -> Chernoff -> McDiarmid is not merely a tour of tricks - it is a systematic strengthening of assumptions that yields exponentially tighter guarantees, mirroring the progress from "some data" to "structured data" in modern ML.
End of Section05 Concentration Inequalities
Next: Section06 Stochastic Processes - where concentration inequalities meet martingales in continuous time.
Back: Section04 Expectation and Moments - the moment-based tools (MGF, Jensen) used throughout this section.
Chapter: Section06 Probability Theory README