"The tendency of random quantities to concentrate near their expectations is one of the most powerful and pervasive phenomena in all of mathematics."
- Stephane Boucheron, Gabor Lugosi & Pascal Massart, Concentration Inequalities (2013)
Overview
A random variable has an expectation , but individual realisations scatter around it. Concentration inequalities answer the quantitative question: how far from can stray, and with what probability? These are not mere abstract bounds - they are the mathematical engine behind almost every theoretical guarantee in machine learning.
Every time a practitioner asks "how many training examples do I need to achieve 95% accuracy with 99% confidence?", the answer is a concentration inequality. Every time a theorist proves that a learning algorithm generalises from training data to unseen data, the proof uses a concentration inequality. The PAC learning framework, VC theory, Rademacher complexity, and modern neural tangent kernel theory all rest on the same probabilistic foundation built in this section.
The central insight is a hierarchy: weaker assumptions yield looser bounds, stronger assumptions yield tighter ones. Markov's inequality needs only non-negativity and a finite mean. Chebyshev's needs a finite variance. Hoeffding's needs bounded support. Chernoff's needs a moment generating function. The practitioner's art is matching the right bound to the right problem.
Prerequisites
- Section04 Expectation and Moments - , , MGF , Jensen's inequality, indicator random variables
- Section03 Joint Distributions - independence, conditional distributions
- Section02 Common Distributions - Gaussian, Bernoulli, Binomial
- Section01 Introduction and Random Variables - probability axioms, union bound preview
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Interactive derivations: Markov, Chebyshev, Hoeffding, Chernoff, McDiarmid, PAC bounds, Rademacher complexity |
| exercises.ipynb | 10 graded exercises from tail bound verification to VC dimension and Rademacher estimation |
Learning Objectives
After completing this section, you will be able to:
- State and prove Markov's inequality and identify when it is tight
- Derive Chebyshev's inequality from Markov and apply the -sigma rule quantitatively
- Define sub-Gaussian random variables and prove the sub-Gaussian tail bound
- State and prove Hoeffding's inequality for bounded independent random variables
- Apply the Chernoff method to derive tail bounds via MGF optimisation
- State Bernstein's inequality and explain when it improves on Hoeffding
- Apply McDiarmid's inequality to functions of independent random variables
- Use the union bound with covering arguments to control events over hypothesis classes
- Derive the PAC generalisation bound for finite hypothesis classes
- Define VC dimension and state the generalisation bound with Sauer-Shelah lemma
- Define Rademacher complexity and interpret the Rademacher generalisation bound
- Compute required sample sizes for given -PAC guarantees
Table of Contents
- 1. Intuition
- 2. Markov's and Chebyshev's Inequalities
- 3. Sub-Gaussian Random Variables
- 4. Hoeffding's Inequality
- 5. Chernoff Bounds
- 6. Bernstein's Inequality
- 7. McDiarmid's Inequality
- 8. The Union Bound and Covering Arguments
- 9. PAC Learning and Generalisation Bounds
- 10. Rademacher Complexity
- 11. ML Applications
- 12. Common Mistakes
- 13. Exercises
- 14. Why This Matters for AI (2026 Perspective)
- 15. Conceptual Bridge
1. Intuition
1.1 What Is Concentration?
A random variable has a mean . But knowing the mean alone tells us nothing about how a single draw is distributed around it. Concentration inequalities fill this gap by bounding the probability that deviates far from .
The fundamental question is: given what we know about (its mean, variance, boundedness, or MGF), what is the tightest bound we can place on ?
This matters deeply in practice. In machine learning, we observe the empirical mean of a loss function on training data and want to know how well it approximates the true (population) mean. If are i.i.d. losses with mean and we observe , concentration inequalities tell us how close is to with high probability. This is the foundation of statistical learning theory.
The sample mean concentrates around the true mean. As grows, the deviation becomes increasingly small with high probability. This is the content of the Law of Large Numbers - but concentration inequalities give quantitative, finite- bounds, not just asymptotic guarantees.
For AI: Every evaluation metric in ML (accuracy, F1, BLEU, ROUGE, perplexity) is estimated from a finite test set. Concentration inequalities tell us how many test examples we need to trust the estimate. HELM benchmarks for LLMs, for instance, implicitly rely on Hoeffding-type bounds for the confidence intervals around model comparisons.
1.2 The Hierarchy of Bounds
Concentration inequalities form a clear hierarchy. Moving down requires stronger assumptions but yields exponentially tighter bounds:
CONCENTRATION INEQUALITY HIERARCHY
========================================================================
Assumption Inequality Tail bound form
-----------------------------------------------------------------
E[X] < \\infty, X \\geq 0 Markov P(X \\geq t) \\leq \\mu/t
(polynomial decay)
Var(X) < \\infty Chebyshev P(|X-\\mu| \\geq t) \\leq \\sigma^2/t^2
(polynomial decay)
E[e^{sX}] < \\infty Chernoff method P(X \\geq t) \\leq min_s e^{-st}M(s)
(general expo)
X \\in [a,b] a.s. Hoeffding P(Xbar-\\mu \\geq t) \\leq exp(-2nt^2/(b-a)^2)
(sub-Gaussian) (exponential decay)
Var known + bounded Bernstein exp(-nt^2/(2\\sigma^2+2ct/3))
(variance-aware) (tighter for small t)
f(X_1,...,X_n) McDiarmid exp(-2t^2/\\Sigmac_i^2)
bounded differences (functions of iid)
========================================================================
The key insight: moment-based bounds (Markov, Chebyshev) give polynomial tails , while MGF-based bounds give exponential tails . Exponential tails are vastly superior - a Gaussian has , while Chebyshev gives only .
1.3 Historical Timeline
| Year | Person | Contribution |
|---|---|---|
| 1867 | Chebyshev | Proved the inequality bearing his name; used it to prove the weak LLN |
| 1899 | Markov (student of Chebyshev) | Simpler proof via indicator argument; Markov's inequality |
| 1952 | Herman Chernoff | MGF-based exponential tail bounds for sums of independent variables |
| 1963 | Wassily Hoeffding | Tight bounds for bounded variables; modern form used everywhere |
| 1971 | Vapnik & Chervonenkis | VC dimension theory; combinatorial approach to generalisation |
| 1984 | Leslie Valiant | PAC learning framework - theoretical model for machine learning |
| 1989 | Colin McDiarmid | Bounded differences inequality for general functions of independent variables |
| 1995 | Bartlett & Mendelson | Rademacher complexity - data-dependent alternative to VC theory |
| 2013 | Boucheron, Lugosi & Massart | Comprehensive monograph unifying modern concentration theory |
1.4 Why Concentration Matters for ML
Generalisation: A neural network achieves 99% training accuracy. Does it generalise? Concentration inequalities bound - the gap between true and empirical risk. They tell us when we can trust training metrics.
SGD analysis: Stochastic gradient descent uses a mini-batch estimate of the true gradient . Concentration inequalities bound , determining the noise level and required step sizes for convergence.
Confidence intervals: When evaluating an LLM on 1000 benchmark questions with accuracy 73.2%, concentration bounds tell us the confidence interval around this estimate.
Random features: The Rahimi-Recht random features method approximates kernel matrices using random projections. How many random features are needed? The answer is a Hoeffding bound.
Dropout: Dropout randomly zeroes activations. The kept activations are bounded, so McDiarmid's inequality bounds the output variance.
2. Markov's and Chebyshev's Inequalities
2.1 Markov's Inequality
Theorem (Markov's Inequality). Let be a non-negative random variable with . For any :
Proof. This follows immediately from the indicator for , :
This proof is a master class in the indicator trick: multiply by 1, then bound the indicator.
Tightness. Markov's bound is tight. Consider with probability and otherwise. Then and - exactly matching the bound.
Extensions. Markov applies to any non-negative function of : for any non-decreasing ,
Setting gives the -th moment bound: . Setting gives the Chernoff method.
For AI: Markov's inequality underlies gradient clipping analysis. If , then - so gradient norms exceed at most 10% of steps.
2.2 Chebyshev's Inequality
Theorem (Chebyshev's Inequality). For any random variable with and , for any :
Equivalently, setting : .
Proof. Apply Markov to the non-negative variable :
The -sigma rule. Chebyshev guarantees: at least of probability mass lies within of the mean. For any distribution with finite variance:
| Chebyshev bound | Gaussian actual | |
|---|---|---|
| 2 | within | |
| 3 | within | |
| 4 | within | |
| 5 | within |
For AI: Chebyshev justifies batch normalisation. If activations have , then at most of activations lie beyond - this quantifies the scale-normalisation benefit.
Recall: The variance and its properties were developed fully in Section04 Expectation and Moments. We use it here as a plug-in parameter.
2.3 One-Sided Chebyshev (Cantelli's Inequality)
Chebyshev bounds both tails symmetrically. Cantelli's inequality gives a tighter one-sided bound:
Theorem (Cantelli). For any :
Proof. For any , by Markov applied to :
Optimise over : gives , yielding .
For large : Chebyshev gives per tail (so two-sided), while Cantelli gives one-sided - roughly the same. But for small , Cantelli avoids the factor of 2.
2.4 Limitations of Moment-Based Bounds
Markov and Chebyshev have polynomial tails. The Gaussian has , but Chebyshev gives only - an 85x overestimate. For , Chebyshev gives , while the true Gaussian probability is .
Why the gap? Chebyshev uses only the first two moments. A distribution could concentrate all its mass at and match any pair - that's the worst case for Chebyshev. Real distributions with bounded support or thin tails concentrate far more.
When Chebyshev is the right tool:
- Distribution-free bounds (don't know the family)
- Heavy-tailed distributions (Pareto, -distribution with low df)
- Quick estimates without assuming sub-Gaussianity
3. Sub-Gaussian Random Variables
3.1 Definition and MGF Condition
Sub-Gaussian random variables are those whose tails decay at least as fast as a Gaussian. The key condition is on the MGF.
Definition. A mean-zero random variable is -sub-Gaussian if for all :
The parameter is the sub-Gaussian parameter (also called the proxy variance). Note: need not equal , though always holds (by differentiation at ).
Why this condition? Recall from Section04 that the MGF of is exactly . The sub-Gaussian condition says the MGF of is dominated by that of a Gaussian with variance . This immediately implies Gaussian-like tail bounds.
Equivalent characterisations (for mean-zero ):
- MGF condition: for all
- Tail condition: for all
- Moment condition: for all
3.2 Examples of Sub-Gaussian Variables
Gaussian. is -sub-Gaussian. The MGF equals exactly - Gaussian is the equality case.
Bounded random variable. If almost surely with , then is -sub-Gaussian. This is Hoeffding's lemma - proved in Section4.
Rademacher variable. with equal probability. Then , so is 1-sub-Gaussian. Rademacher variables appear in Rademacher complexity (Section10).
Bernoulli centered. . Since , it is -sub-Gaussian by Hoeffding's lemma.
Non-example. Heavy-tailed distributions are not sub-Gaussian. A -distribution with degrees of freedom has for all when . A Pareto distribution is not sub-Gaussian regardless of parameters.
3.3 The Sub-Gaussian Tail Bound
Theorem. If is -sub-Gaussian with , then for all :
Proof (Chernoff method). For any :
Optimise over : . Substituting:
This proof pattern - Markov + MGF bound + optimise - is the Chernoff method and will recur throughout this section.
3.4 Closure Under Sums
Theorem. If are independent with being -sub-Gaussian and , then is -sub-Gaussian.
Proof. By independence:
Corollary. For i.i.d. -sub-Gaussian variables, the sample mean satisfies .
For AI: Mini-batch gradient estimation. If each sample's contribution to the gradient is sub-Gaussian with parameter , then the mini-batch gradient of size is -sub-Gaussian - the noise decreases as .
4. Hoeffding's Inequality
4.1 Hoeffding's Lemma
Hoeffding's lemma is the core technical ingredient: it establishes that any bounded, mean-zero random variable is sub-Gaussian.
Lemma (Hoeffding, 1963). Let be a random variable with and almost surely. Then for all :
In other words, is -sub-Gaussian.
Proof sketch. Since is a bounded interval and is convex in , we can bound by the chord from to :
Taking expectations (using , so and ):
where and with . Expanding via Taylor series around and bounding (achieved at ) gives .
Why ? The factor comes from the worst case (centred in the interval). For a Bernoulli minus its mean, and the sub-Gaussian parameter is .
4.2 Hoeffding's Inequality
Theorem (Hoeffding, 1963). Let be independent random variables with and almost surely. Then for all :
Proof. Let (mean-zero, ). By Hoeffding's lemma, each is -sub-Gaussian. By independence and closure under sums (Section3.4), is -sub-Gaussian. Applying the sub-Gaussian tail bound:
4.3 Two-Sided Hoeffding and Sample Complexity
For i.i.d. with mean , the two-sided version follows by symmetry (apply one-sided to and ):
Reading the bound: The probability of a large deviation decays exponentially in and . Doubling the sample size squares the exponential factor. Halving the allowed deviation quadruples the required .
4.4 Required Sample Size
One of the most practically useful results: how large must be to guarantee with probability at least ?
Setting and solving for :
Example. Coin flips (, ): to achieve accuracy with confidence:
So roughly 18,500 coin flips are needed to estimate a probability to within with 95% confidence.
For AI: Benchmark evaluation. To measure LLM accuracy to within at 95% confidence on a binary task (), Hoeffding requires examples. This explains why serious LLM evaluations (MMLU, HumanEval, BIG-bench) use thousands of questions.
5. Chernoff Bounds
5.1 The Chernoff Method
The Chernoff method is a general technique for deriving exponential tail bounds using the MGF. It is more powerful than Hoeffding when the MGF has a known closed form.
The Chernoff method. For any random variable and :
Proof. For any : by Markov. Optimise over .
The infimum over gives the tightest bound. The optimal satisfies , i.e., the mean of under the tilted distribution equals .
Recall: was developed in Section04 Expectation and Moments. The key property used here is for independent .
5.2 Chernoff for Bernoulli Sums
Let i.i.d. and with mean .
The MGF of is .
Upper Chernoff bound. For :
Derivation. Apply the Chernoff method with :
Substituting (using ):
Optimising: , giving the stated bound.
5.3 Multiplicative Chernoff Form
For , a simpler but slightly looser form:
This follows from the inequality .
Comparison with Hoeffding. Hoeffding (applied to with ) gives:
Chernoff gives . For small (rare events), Chernoff's dependence on rather than makes it far tighter.
5.4 Lower Tail Chernoff
For the lower tail, for :
Note the factor of vs in the exponent - the lower tail is slightly tighter.
Application: balls into bins. When items are assigned to bins uniformly at random, the expected load per bin is . Chernoff bounds control the maximum load - crucial for hashing and load-balancing proofs in distributed systems.
6. Bernstein's Inequality
6.1 Sub-Exponential Random Variables
Some random variables have heavier tails than sub-Gaussian but still lighter than arbitrary - these are sub-exponential.
Definition. A mean-zero random variable is sub-exponential with parameters if:
Sub-exponential tails decay as (exponential), slower than Gaussian .
Examples:
- Exponential: sub-exponential with
- : sub-exponential (sum of squared Gaussians)
- Any bounded variable is also sub-exponential
6.2 Bernstein's Inequality
Theorem (Bernstein, 1924/1946). Let be independent mean-zero variables with and . Then for all :
Intuition. This interpolates between two regimes:
- Small (): Denominator , gives - sub-Gaussian with variance
- Large (): Denominator , gives - exponential tail
6.3 Bernstein vs Hoeffding
For i.i.d. variables with and sample variance :
| Bound | Formula | Better when |
|---|---|---|
| Hoeffding | No variance info | |
| Bernstein | (small variance) |
When Bernstein wins. If (data has small variance despite large range), Bernstein gives for small , while Hoeffding gives . Since , Bernstein is exponentially tighter.
Example. Suppose but . Hoeffding uses range , while Bernstein uses variance - a 400x improvement in the exponent for small .
6.4 Application to Gradient Noise
In SGD, the stochastic gradient estimates the true gradient .
If each is bounded: , and the sample gradient variance is , then Bernstein bounds:
where the factor comes from a union bound over coordinates. This bound shows why variance reduction methods (SVRG, SAGA) that reduce improve convergence more dramatically than reducing alone.
7. McDiarmid's Inequality
7.1 Bounded Differences Condition
McDiarmid's inequality generalises Hoeffding from sums to general functions.
Definition (Bounded Differences). A function satisfies the bounded differences condition with constants if for all and all :
Intuitively: changing the -th coordinate by any amount changes by at most .
7.2 McDiarmid's Inequality
Theorem (McDiarmid, 1989). Let be independent random variables and let satisfy the bounded differences condition with constants . Then for all :
Proof sketch (martingale method). Define the Doob martingale with and . The differences satisfy by the bounded differences assumption. Apply Azuma's inequality (concentration for bounded martingale differences) to get the result.
7.3 Relationship to Hoeffding
Hoeffding's inequality is a special case of McDiarmid's. For with , changing changes by at most , so . Then:
Substituting into McDiarmid recovers exactly Hoeffding's bound.
7.4 Applications to ML Stability
Training loss with missing data. Let be a training set and the empirical risk with . Swapping one training example changes by at most . McDiarmid gives:
Dropout. With dropout probability , each neuron's activation is scaled by . The network output changes by at most when the -th neuron is toggled. McDiarmid bounds the output variance from dropout noise.
Data augmentation. If augmenting a single training example can change the training loss by at most , McDiarmid bounds the sensitivity of the trained model to augmentation choices.
8. The Union Bound and Covering Arguments
8.1 Union Bound
Lemma (Boole's / Union Bound). For any events (not necessarily independent):
Proof. By induction: .
The union bound is exact when events are disjoint and pessimistic otherwise. Despite its looseness, it is indispensable because it requires no assumption about dependence between events.
For AI: The union bound appears in virtually every multi-class or multi-hypothesis analysis. PAC theory for a finite hypothesis class asks for the probability that any is bad - union bound over all hypotheses.
8.2 Multiple Hypothesis Testing
Consider statistical tests each with individual significance level . If we want the probability that at least one false positive occurs to be at most , the Bonferroni correction sets each test at level .
In ML: When training models and reporting the best, the union bound says the best result can be a false positive with probability up to . This is the multiple comparison problem underlying claims like "our method beats 10 baselines on 5 datasets."
Balancing union bound and concentration. For bad events each with :
To make this : need each . In PAC theory, this costs an extra in the required sample size.
8.3 Covering Numbers and \varepsilon-Nets
The union bound can only be applied to finite collections of events. For continuous hypothesis spaces (e.g., all linear classifiers, all neural networks), we need to discretise first.
Definition. An -net of a set under metric is a finite set such that for every , there exists with .
The covering number is the size of the smallest -net.
Covering number of a ball. The -ball in has covering number:
This grows polynomially in but exponentially in - the curse of dimensionality for covering.
8.4 The Net Argument
The standard approach for continuous hypothesis classes:
- Build an -net with
- Apply concentration + union bound over the net: control
- Use Lipschitz continuity to extend from net to all of
The final bound pays extra compared to a single hypothesis, plus the approximation error from the net.
Johnson-Lindenstrauss Lemma. As an application of covering, if random projections are used, points in embed into with distortion of all pairwise distances, with high probability. This is proved using the union bound over all pairs.
9. PAC Learning and Generalisation Bounds
9.1 The PAC Framework
The Probably Approximately Correct (PAC) framework, introduced by Valiant (1984), provides a formal model for machine learning under uncertainty.
Setup: A learner observes i.i.d. examples from an unknown distribution over . The learner selects a hypothesis from a hypothesis class .
True risk:
Empirical risk:
PAC guarantee: A learning algorithm is PAC-learning if for any and any distribution , given examples, with probability at least :
The sample complexity is the central quantity: how many examples are needed?
9.2 Finite Hypothesis Class
Theorem. Let be a finite hypothesis class with . For any ERM algorithm (minimising ) and any , if:
then with probability at least , the selected hypothesis satisfies (in the realisable case with ).
Proof. Consider the event "bad ": .
By Hoeffding: .
By union bound: .
Setting this and solving: . The factor version covers the agnostic (non-realisable) case.
9.3 Uniform Convergence
Theorem (Uniform Convergence). For finite and losses in , for :
Proof. Fix any . By Hoeffding: .
By union bound over all hypotheses: .
Consequence: If uniform convergence holds, then for ERM :
The ERM generalises to within of the best hypothesis.
9.4 VC Dimension
For infinite hypothesis classes, we need a different complexity measure.
Definition. A hypothesis class shatters a set if for every labelling , there exists with for all .
VC dimension is the size of the largest set shattered by .
Examples:
- Threshold classifiers on : (can shatter any single point but not two)
- Linear classifiers (halfspaces) in :
- Degree- polynomial classifiers in :
- Circles in :
- Neural networks: grows with parameter count, approximately linearly
Sauer-Shelah Lemma. The growth function satisfies:
for . This says: an -point set can be labelled in at most ways by , even if .
9.5 Generalisation Bound with VC Dimension
Theorem (VC Generalisation Bound). For any , with probability at least over i.i.d. training examples, every satisfies:
where .
Sample complexity. Setting the right-hand side excess to :
The from the finite case is replaced by . For a -dimensional linear classifier (), need - linear in dimension, not exponential.
For AI (2026): Modern transformers have billions of parameters, implying enormous VC dimension. Yet they generalise. This "paradox" is partly resolved by:
- Implicit regularisation: gradient descent finds minimum-norm solutions
- Overparameterisation: interpolation threshold, double descent (covered in Section04)
- PAC-Bayes: data-dependent bounds that are tighter for specific algorithms
10. Rademacher Complexity
10.1 Definition
Rademacher complexity measures the richness of a function class on specific data, making it data-dependent and often tighter than VC bounds.
Definition. Given a sample and a function class , the empirical Rademacher complexity is:
where with i.i.d. (Rademacher variables).
The Rademacher complexity is .
Interpretation. measures how well can fit random labels on the training points. If the class is rich enough to fit any labelling, . If it can only fit structured labels, .
10.2 Rademacher Generalisation Bound
Theorem. For any function class with values in , with probability at least :
for all simultaneously.
Comparison to VC. The VC bound has ; the Rademacher bound has . For specific data distributions where the function class is "effectively smaller," Rademacher can be significantly tighter.
10.3 Rademacher of Linear Classifiers
For linear classifiers on data with :
More precisely, using Cauchy-Schwarz:
This is , independent of the ambient dimension ! This explains why linear classifiers generalise well even in high dimensions, as long as the norm is controlled.
10.4 Rademacher vs VC
| Property | VC Dimension | Rademacher Complexity |
|---|---|---|
| Distribution | Distribution-free | Data-dependent |
| Calculation | Combinatorial | Expectation over labels |
| Estimate | Hard (often NP-hard) | MC approximation possible |
| Tightness | Worst-case | Often tighter in practice |
| Handles regression | No (classification) | Yes (arbitrary losses) |
| Modern NNs | Vacuous (too large ) | Can be finite for sparse/low-rank models |
For AI: Rademacher complexity gives finite (non-vacuous) generalisation bounds for attention heads with bounded query/key norm, explaining why transformers with weight decay generalise despite huge parameter counts.
11. ML Applications
11.1 SGD and Gradient Concentration
In stochastic gradient descent, at each step we compute a mini-batch gradient:
where is a random mini-batch of size . The true gradient is .
Concentration of gradient norm. Assuming almost surely, by Hoeffding applied coordinate-wise and union bound:
Implication for learning rate. For SGD to make progress, we need the gradient noise to be smaller than the signal. Concentration tells us: with probability , the noise is bounded by . This justifies the learning rate scaling in theory.
Variance reduction. Methods like SVRG (Stochastic Variance-Reduced Gradient) periodically compute the full gradient, reducing to . Bernstein's inequality quantifies the benefit: convergence can be linear (not just ) when variance is controlled.
11.2 Confidence Intervals via Hoeffding
Suppose we evaluate an LLM on a benchmark: out of binary questions (correct/incorrect), we observe accuracy .
By Hoeffding's inequality (with , ), with probability at least :
This gives the Hoeffding confidence interval: .
Numerical example. For questions and :
So a measured accuracy of 73.2% on 1000 questions gives 95% CI - a interval. This explains why claiming "Model A (73.2%) outperforms Model B (72.8%)" on 1000 questions has no statistical support.
11.3 Random Features and Kernel Approximation
The Rahimi-Recht random features method approximates the RBF kernel as:
where and .
How many features? The approximation error is bounded by Hoeffding:
where . For and : .
In practice, -- suffices with , suggesting the bound is conservative. This is common: worst-case bounds are loose, actual concentrations are tighter.
11.4 Concentration in High Dimensions
High-dimensional geometry exhibits surprising concentration phenomena. For :
Chi-squared concentration. with mean and variance . Bernstein gives:
So in probability - all Gaussian vectors in high dimensions have nearly the same norm .
Concentration of inner products. For independent (unit-variance):
Nearly-orthogonal random vectors: with (typical embedding dimension), random vectors have with very high probability.
For AI (attention). In transformers with (key dimension), random query-key inner products concentrate around 0. This justifies the scaling in attention: without it, softmax inputs would have variance , causing near-zero gradients through the softmax saturation.
Preview: Law of Large Numbers and CLT
The concentration results above provide finite-sample bounds. The LLN says as ; the CLT says the deviation converges to . These are the asymptotic counterparts of Hoeffding's inequality.
-> Full treatment: Section06 Stochastic Processes
12. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Applying Markov to a signed random variable | Markov requires ; for signed , can exceed | Apply Markov to $ |
| 2 | Using Chebyshev for exponential tails | Chebyshev gives decay, vastly looser than sub-Gaussian | If data is bounded or Gaussian, use Hoeffding/Chernoff |
| 3 | Forgetting the factor of 2 in two-sided Hoeffding | One-sided: ; two-sided has factor 2 in front | Always write "two-sided" explicitly and include the factor |
| 4 | Confusing sub-Gaussian parameter with variance | Sub-Gaussian parameter satisfies but is not equal | ; sub-Gaussian parameter can be larger |
| 5 | Applying union bound over infinite classes directly | Union bound only works for countable (finite/countably infinite) collections | Use -net covering argument first, then union bound over the net |
| 6 | Computing VC dimension by counting parameters | VC dimension is not the number of parameters; depends on the function class structure | For linear classifiers in : , not number of weights |
| 7 | Ignoring the term in sample complexity | Setting adds - small, but not negligible | Always compute full bound including |
| 8 | Claiming Chernoff applies to dependent variables | Chernoff requires independence (product of MGFs) | For dependent variables, use McDiarmid or martingale concentration |
| 9 | Using Hoeffding when variance is known to be small | Bernstein would give an exponentially tighter bound for small-variance data | Check if ; if so, use Bernstein |
| 10 | Interpreting VC generalisation bounds as practically tight | VC bounds are often vacuous for deep networks; describe worst-case behaviour | Use PAC-Bayes or Rademacher bounds, which can be data-dependent and tighter |
13. Exercises
**Exercise 1 *** (Markov and Chebyshev) For and : (a) Compute exactly and via Markov's inequality. What is the ratio? (b) Compute exactly and via Chebyshev. What is the ratio? (c) For which does Chebyshev give ? Compare to the exact Gaussian answer. (d) Verify both bounds numerically with samples.
**Exercise 2 *** (Hoeffding Sample Complexity) A spam filter classifies emails. On test emails, it achieves accuracy . (a) Using Hoeffding, find the smallest such that with 99% confidence, . (b) How does the required change if we only require ? (c) If and , give the 95% Hoeffding confidence interval for . (d) Compare the Hoeffding CI to the Normal-approximation CI .
**Exercise 3 *** (Chernoff Bounds for Binomial) Let with , (so ). (a) Compute exactly and via the multiplicative Chernoff bound (). (b) Compute via Hoeffding. Compare to Chernoff - which is tighter? (c) Compute via the lower-tail Chernoff bound and compare to the exact value. (d) Plot all three bounds (Hoeffding, upper Chernoff, lower Chernoff) as a function of the threshold.
**Exercise 4 **** (Bernstein: Gradient Noise in SGD) Assume each sample gradient with and per-coordinate variance . (a) Apply Bernstein (with union bound) to bound for mini-batch size . (b) Compare to the Hoeffding-based bound. Under what condition does Bernstein win? (c) For , , , find the needed for via both bounds. (d) Verify numerically: generate synthetic gradients and measure empirical exceedance probability.
**Exercise 5 **** (McDiarmid on Empirical Risk) Let with and i.i.d. (a) Show that swapping one training example changes by at most . (b) Apply McDiarmid's inequality to bound . (c) Compare this to the direct Hoeffding application. Are they the same? Why? (d) What if instead of ? How does the bound scale?
**Exercise 6 **** (PAC Bounds for Finite Class) Consider with 1000 decision rules for binary classification. (a) How many training examples are needed for uniform convergence with , ? (b) ERM achieves training error 0. Using the PAC bound, what is the worst-case true error with 95% confidence? (c) If we use examples and observe , bound with 99% confidence. (d) How does the bound change if (e.g., a lookup table over binary features)?
**Exercise 7 ***** (VC Dimension and Generalisation) Consider linear classifiers in : . (a) Show by construction that points can be shattered (exhibit a configuration). (b) Show that no points in general position can be shattered. (c) Write the VC generalisation bound for , , . (d) Compare to the Rademacher bound for the same class with and .
**Exercise 8 ***** (Rademacher Complexity: Monte Carlo Estimation) Given a dataset and linear function class : (a) Show that . (b) Implement a Monte Carlo estimator using random sign vectors . (c) On the MNIST training set (or a synthetic dataset), compute and the resulting generalisation bound. (d) How does change as you vary ? Verify the scaling.
14. Why This Matters for AI (2026 Perspective)
| Concept | AI/ML Connection | Current Importance |
|---|---|---|
| Hoeffding confidence intervals | LLM benchmark evaluation: how many test questions for reliable rankings? | Critical - used implicitly in all leaderboard comparisons (MMLU, HELM, LMSYS) |
| PAC-Bayes bounds | Tightest known generalisation bounds for overparameterised models; connects to flat minima and sharpness-aware minimisation (SAM) | Active research frontier, 2023-2026 |
| VC dimension | Classical foundation; vacuous for LLMs with + parameters | Conceptually important; practically replaced by norm-based bounds |
| Rademacher complexity | Data-dependent bounds; finite for norm-constrained transformers | Used in theoretical analysis of attention, LoRA rank bounds |
| McDiarmid stability | Algorithmic stability theory: output doesn't change much when one training point changes | Foundation of differential privacy; relevant to RLHF data sensitivity |
| Concentration in high dims | scaling in attention; JL lemma for random projections in retrieval | Directly explains transformer architectural choices (FlashAttention, MLA) |
| Chernoff for random graphs | Theoretical analysis of dropout, random initialisation, random feature networks | Relevant to lottery ticket hypothesis, neural architecture search |
| Union bound + covering | Uniform convergence for function classes; foundation of all finite-sample learning theory | Every theoretical ML paper uses this framework |
| Sub-Gaussian gradients | Convergence theory for Adam, SGD; required sample complexity for fine-tuning | Used in LoRA theoretical analysis and RLHF convergence guarantees |
| Bernstein vs Hoeffding | Variance-aware bounds; relevant when gradient variance is small (near optima) | Foundation of variance reduction methods (SVRG, SAG, SAGA) |
15. Conceptual Bridge
Looking back. This section builds on Section04 Expectation and Moments, which established , , and the MGF . The MGF is the critical bridge: Hoeffding's lemma uses convexity of ; the Chernoff method optimises over ; Bernstein's inequality exploits the variance . Jensen's inequality (from Section04) underlies both Hoeffding's lemma proof and the information-theoretic interpretation of Chernoff bounds.
Looking forward. The Law of Large Numbers says almost surely. This is the qualitative counterpart of Hoeffding's quantitative bound: Hoeffding gives the rate at which concentrates. The Central Limit Theorem (CLT) says the shape of the distribution of converges to Gaussian - explaining why sub-Gaussian bounds are the right model for sums of bounded variables. Both are developed in Section06 Stochastic Processes.
The PAC-statistics connection. PAC learning theory is, at its core, applied concentration theory. The "probably" in PAC = concentration inequality. The "approximately correct" = tolerance. The sample complexity is exactly what concentration inequalities give. Chapter 7 (Statistics) will build on this: confidence intervals are Hoeffding bounds, hypothesis tests are Chernoff bounds, and MLE theory uses Bernstein to prove asymptotic normality.
CONCENTRATION INEQUALITIES IN THE CURRICULUM
========================================================================
Section04 Expectation and Moments
+- E[X], Var(X), MGF M_X(t) -------------------------- input
+- Jensen's inequality -------------------------- used in proofs
+- Markov/Chebyshev preview -------------------------- previewed there
v
Section05 Concentration Inequalities <--- YOU ARE HERE
+- Markov / Chebyshev (moment-based, polynomial tails)
+- Sub-Gaussian / Hoeffding (bounded, exponential tails)
+- Chernoff / Bernstein (MGF-based, variance-aware)
+- McDiarmid (functions of independent variables)
+- Union bound + covering (infinite hypothesis classes)
+- PAC learning / VC dimension (generalisation theory)
+- Rademacher complexity (data-dependent bounds)
v v
Section06 Stochastic Processes Chapter 7: Statistics
+- Weak LLN (via Chebyshev) +- Confidence intervals
+- Strong LLN +- Hypothesis testing
+- CLT (limit of Hoeffding) +- MLE consistency (Bernstein)
========================================================================
Appendix A: Summary of Key Bounds
| Inequality | Conditions | Bound | Rate |
|---|---|---|---|
| Markov | , | ||
| Chebyshev | |||
| Cantelli | |||
| Sub-Gaussian | -sub-G | ||
| Hoeffding | i.i.d. | ||
| Chernoff (mult.) | , | ||
| Bernstein | , variance | ||
| McDiarmid | with -bounded diff. |
Appendix B: Sample Complexity Table
Given: i.i.d. data in , want .
| Hoeffding | Normal approx | ||
|---|---|---|---|
| 0.10 | 0.10 | 133 | 68 |
| 0.05 | 0.10 | 530 | 271 |
| 0.05 | 0.05 | 600 | 384 |
| 0.01 | 0.05 | 14,979 | 9,604 |
| 0.01 | 0.01 | 19,933 | 16,587 |
Hoeffding is distribution-free (works for worst case); Normal approximation assumes CLT holds.
<- Back to Chapter 6: Probability Theory | Next: Stochastic Processes ->
Appendix C: Detailed Proofs and Derivations
C.1 Proof of Hoeffding's Lemma in Full
We prove: if and a.s., then .
Step 1: Convexity bound. Since is convex in , for :
Step 2: Take expectations. Let (since after centering):
Step 3: Exponential bound. Let and . Then:
Note: (since ). So:
Step 4: Bound . We have and . By Taylor's theorem:
for some . Computing:
(since for , applied to , ).
Therefore: .
C.2 Proof of McDiarmid's Inequality via Azuma's Inequality
Azuma's Inequality. If is a martingale with a.s., then:
Doob martingale construction. Given independent and with bounded differences , define:
Then:
- (constant, before observing anything)
- (the function value itself)
- is a martingale by the tower property
Bounding differences. For each :
Since changing changes by at most and doesn't depend on , we have .
Applying Azuma to this Doob martingale gives McDiarmid's inequality.
C.3 Proof of the Finite-Class PAC Bound (Agnostic Case)
Setup. finite with . Losses . True risk , empirical risk .
Goal. Show when .
Step 1. Fix . By Hoeffding: .
Step 2. Union bound over all hypotheses:
Step 3. Set :
ERM generalisation. Under uniform convergence, the ERM satisfies:
where is the Bayes-optimal hypothesis in .
Appendix D: VC Theory in Depth
D.1 The Growth Function
Definition. The growth function counts the maximum number of distinct labellings achievable by on any points:
If can shatter points, . The VC dimension is the largest for which .
D.2 Sauer-Shelah Lemma
Lemma. For and :
Proof idea. By induction on and . Split: either makes no difference to the labelling count, or it doubles some labellings. Careful counting bounds the total.
Consequence. For : . This transitions from exponential (when class is rich) to polynomial (once exceeds VC dimension).
D.3 VC Dimension Examples
| Hypothesis Class | VC Dimension |
|---|---|
| Threshold on | 1 |
| Intervals on | 2 |
| Halfspaces in | |
| Axis-aligned rectangles in | 4 |
| Convex polygons in | |
| Degree- polynomials in | |
| Neural nets with weights | |
| Kernel classifiers (universal) |
Why VC = for halfspaces. In , any points in general position (no are on a hyperplane) can be shattered by a halfspace. But for points, by Radon's theorem, they can be split into two groups whose convex hulls intersect - no halfspace separates them. Hence .
D.4 Fundamental Theorem of Statistical Learning
Theorem. For binary classification, the following are equivalent:
- is PAC-learnable
- ERM is a successful learner for
- has finite VC dimension
- has the uniform convergence property
This theorem, proved through Sauer-Shelah and the symmetrisation technique (doubling the sample and introducing Rademacher variables), is the bedrock of statistical learning theory.
Appendix E: Advanced Topics in Concentration
E.1 Talagrand's Inequality
Talagrand's inequality (1995) strengthens McDiarmid for "self-bounding" functions - those where the effect of each coordinate is bounded by the function value itself.
Self-bounding functions. is self-bounding if there exist such that:
Talagrand's bound. For self-bounding :
This is a variance-adaptive McDiarmid - analogous to Bernstein vs Hoeffding.
Applications: Number of ones in a sum of Bernoullis, size of random matchings, empirical risk when the model fits the data.
E.2 Concentration for Heavy-Tailed Distributions
When distributions are heavy-tailed, sub-Gaussian bounds don't apply. Modern techniques include:
Catoni's estimator. For estimating the mean of a distribution with finite variance (but potentially infinite MGF), Catoni's estimator achieves:
using a truncated exponential influence function. This gives Bernstein-like rates without boundedness.
Median of means. Partition samples into groups of . Compute group means and take the median. The median is sub-Gaussian with parameter - achieving sub-Gaussian rates for distributions with only two moments.
For AI: Training data for LLMs often has heavy-tailed length distributions (documents follow power laws). Heavy-tail-robust mean estimation is relevant for unbiased training.
E.3 Matrix Concentration Inequalities
Scalar concentration extends to matrices. For many AI applications, we care about concentration of random matrices.
Matrix Hoeffding. For independent random PSD matrices with and , for :
The extra factor (matrix dimension) accounts for the eigenvalue structure.
Application: random features. When approximating a kernel matrix by random features, the approximation error in spectral norm concentrates with matrix Hoeffding, justifying the use of random features for -accurate Gram matrix approximation.
Matrix Bernstein. For independent zero-mean random matrices with and :
Used in randomised linear algebra (randomised SVD, Nystrom approximation).
E.4 PAC-Bayes Theory
PAC-Bayes bounds combine the PAC framework with Bayesian prior knowledge.
Setup. Given a prior over (before seeing data) and posterior (after seeing data):
PAC-Bayes Theorem (McAllester, 1999). For any prior and , with probability :
Why this is powerful: The KL term replaces (which can be infinite). For neural networks, if the posterior concentrates near the prior (small KL), the bound is tight. Flat minima (low sharpness) correspond to posteriors that don't drift far from the prior.
Connection to SAM. Sharpness-Aware Minimisation (SAM, 2021) minimises a PAC-Bayes-inspired upper bound on generalisation error by seeking flat minima - directly operationalising PAC-Bayes in practice.
Appendix F: Worked Examples
F.1 Coin Flip Estimation
Problem. We flip a biased coin times and observe heads. How many flips to estimate within with 99% confidence?
Hoeffding solution. With , :
Normal approximation. . Less conservative because it uses the Gaussian CLT approximation (valid for large ).
Interpretation. About 1000-1100 flips are needed for reliable probability estimation. This is why A/B tests often require thousands of users: smaller samples give confidence intervals too wide to detect meaningful differences.
F.2 Multi-Armed Bandit via Chernoff
Problem. A recommendation system has arms (content types). Each arm has unknown click-through rate . We want to identify the best arm within with .
Naive approach. Pull each arm times. By Hoeffding + union bound:
Set : . Total pulls: .
Adaptive strategy (UCB). Upper Confidence Bound algorithms concentrate exploration on promising arms, achieving total pulls - much smaller in practice.
F.3 Neural Network Generalisation: PAC-Bayes in Practice
Setup. 2-layer ReLU network, 100 hidden units, trained on 50,000 MNIST examples. Training error 0.3%, test error 1.2%. Standard VC bound: vacuous (parameter count ).
PAC-Bayes approach. Use Gaussian prior and posterior (trained weights ). Compute:
With careful tuning of , this can give non-vacuous bounds (< 50% error guarantee) even for deep networks - unlike VC theory.
F.4 Random Projection (Johnson-Lindenstrauss)
Problem. We have points in . We want to project to preserving all pairwise distances within factor for .
JL Lemma. For random projections (from ):
So 7400 dimensions suffice - far less than 1000 (already doing 26% compression), and compression from 10000 would be massive.
Proof sketch. A random Gaussian matrix satisfies: for any fixed pair of points :
by chi-squared concentration. Union bound over pairs gives the JL lemma.
For AI: Locality-sensitive hashing, approximate nearest-neighbour search (used in RAG retrieval), and low-rank adaptation (LoRA) all rely on Johnson-Lindenstrauss-type arguments to justify dimensionality reduction.
Appendix G: Computing Concentration Bounds in Practice
G.1 Choosing the Right Bound
DECISION TREE: WHICH BOUND TO USE
========================================================================
Do you know E[X]? --- No --> Can't bound tail
|
Yes
|
Is X \\geq 0? --- Yes --> Markov: P(X \\geq t) \\leq E[X]/t
|
No
|
Do you know Var(X)? --- Yes --> Chebyshev: P(|X-\\mu| \\geq t) \\leq \\sigma^2/t^2
|
No
|
Is X bounded in [a,b]? --- Yes --> Hoeffding (exponential bound)
| or Bernstein if Var(X) known
|
No
|
Is MGF finite? --- Yes --> Chernoff method (optimise over s)
|
No
|
Distribution-free needed --> Student-t CI or bootstrap
========================================================================
G.2 Python Recipe for Hoeffding CIs
import numpy as np
from scipy import stats
def hoeffding_ci(x, delta=0.05, lo=0.0, hi=1.0):
"""Hoeffding confidence interval for bounded data in [lo, hi]."""
n = len(x)
x_bar = np.mean(x)
radius = (hi - lo) * np.sqrt(np.log(2 / delta) / (2 * n))
return x_bar - radius, x_bar + radius
def hoeffding_sample_size(epsilon, delta, lo=0.0, hi=1.0):
"""Required n for epsilon-accuracy at 1-delta confidence."""
return int(np.ceil((hi - lo)**2 * np.log(2 / delta) / (2 * epsilon**2)))
G.3 Interpreting Generalisation Gaps
When a model achieves training error 1% and test error 5%:
- Generalisation gap = 4%
- Is this within bounds? With , : Hoeffding gives per hypothesis
- For hypotheses: add
- Total bound: - exactly matching the observed gap!
In practice, neural networks occupy a tiny corner of their hypothesis class (due to inductive bias of SGD), so effective complexity is much smaller than the VC dimension suggests.
Appendix H: Self-Assessment Checklist
Before moving to Section06 (Stochastic Processes), verify you can:
- State Markov's inequality from memory and prove it with the indicator trick
- Derive Chebyshev from Markov by applying Markov to
- Prove Hoeffding's lemma using convexity of and Taylor expansion
- Apply the Chernoff method: Markov + MGF bound + optimise over
- State McDiarmid's inequality and identify for a given function
- Compute sample complexity for given using Hoeffding
- Define VC dimension and compute it for halfspaces ( in )
- Write the PAC generalisation bound with or
- Define Rademacher complexity and explain its interpretation
- Identify when Bernstein beats Hoeffding (small variance case)
- Give a non-example for sub-Gaussian (heavy-tailed distribution)
- Explain why VC bounds are vacuous for LLMs and what PAC-Bayes offers
Appendix I: Connections to Information Theory
I.1 Kullback-Leibler Divergence and Exponential Families
The Chernoff method has a deep information-theoretic interpretation. The optimal exponent in the Chernoff bound equals the KL divergence under the tilted distribution.
For i.i.d. with distribution , and target :
where is the exponential tilt of that puts its mean at : with .
This is Cramer's theorem - the foundation of large deviations theory. The probability of a rare event decays exponentially at rate given by the KL divergence to the closest distribution that makes the event typical.
For AI: This connection explains why cross-entropy loss (KL divergence from data to model) is the right loss for language models: minimising cross-entropy = maximising the probability of the training data = minimising the Chernoff exponent for generalisation error.
I.2 Entropy and Sauer-Shelah
The growth function counts the number of distinct behaviours of on points. The log growth rate:
is the combinatorial entropy of . Sauer-Shelah shows:
- If : (subexponential growth, polynomial in )
- If : (full exponential growth )
This binary distinction - polynomial vs exponential growth - is precisely what separates PAC-learnable from non-learnable hypothesis classes.
Appendix J: Connections to Optimisation
J.1 Convergence of SGD via Hoeffding
The standard convergence analysis of SGD for convex functions uses:
-
Descent lemma:
-
Gradient concentration: with high probability (Hoeffding)
-
Telescoping: Sum over steps and apply Hoeffding union bound over all steps (adding to the sample complexity)
The result: after steps, SGD achieves with high probability.
J.2 Generalisation and Algorithmic Stability
Definition (Uniform Stability). Algorithm is -uniformly stable if for any two datasets , differing in one example:
Theorem (Bousquet-Elisseeff, 2002). If is -stable with for some constant , and losses are bounded in , then with probability :
SGD is stable. For -smooth convex losses, SGD with step size for steps is -uniformly stable. Setting gives stability - small when .
This explains why early stopping helps: more SGD steps = less stable = worse generalisation.
Appendix K: Concentration Phenomena in Transformer Architectures
K.1 Attention Score Concentration
In scaled dot-product attention, query-key inner products determine the attention weights. For randomly initialised , the inner products satisfy:
Without the scaling, the softmax input has standard deviation , pushing softmax into its saturating regime (near-uniform or near-one-hot). The scaling ensures standard deviation , keeping gradients flowing.
Formal justification. With -dimensional queries and keys, where each summand has variance . By sub-Gaussian closure, the sum is -sub-Gaussian, so standard deviation . After scaling: has standard deviation , optimal for softmax.
K.2 Concentration in Residual Networks
Deep residual networks accumulate residual perturbations. If each residual block has bounded output , then by Azuma's inequality applied to the martingale :
This says: even with layers, if each residual block adds a small perturbation (), the total drift is bounded. LayerNorm and weight decay enforce exactly this: they keep small, ensuring the representation doesn't drift exponentially with depth.
K.3 LoRA and Low-Rank Concentration
LoRA (Low-Rank Adaptation) approximates weight updates as where , with small rank . The resulting hypothesis class has reduced complexity:
Rademacher complexity of LoRA. For the function class :
where . The key: Rademacher complexity scales as , not (full fine-tuning). With , LoRA has provably smaller generalisation gap than full fine-tuning, explaining why it doesn't overfit despite being used on small datasets.
Appendix L: Practical Guidelines for ML Practitioners
L.1 When to Report Confidence Intervals
Always report when:
- Sample size (Hoeffding CI is meaningful)
- Comparing two systems (the CI overlap tells you if the difference is significant)
- Benchmarking LLMs on standard evaluations
Minimum information to report: point estimate, sample size, CI method (Hoeffding vs Normal approximation), confidence level.
L.2 Common Benchmarking Mistakes
| Mistake | Statistical Problem | Fix |
|---|---|---|
| "Model A (73.2%) > Model B (72.8%) on 1000 questions" | 95% CI is +/-4.2% - difference insignificant | Need for +/-1% CI |
| "Best of 10 model variants is significant" | Multiple comparison: false positive rate | Bonferroni: each test at |
| "Averaging over 5 datasets gives reliable comparison" | Each dataset is a separate test | Report CI for each; meta-analysis requires care |
| "Our method wins on 8/10 tasks" | Sign test: need for | Use paired t-test or Wilcoxon signed-rank |
| "1000-epoch training curve shows improvement" | Training curve is not test performance | Report test error at convergence with CI |
L.3 Required Sample Sizes (Reference Table)
For binary outcomes (accuracy), using Hoeffding with 95% confidence ():
| Desired precision () | Required |
|---|---|
| +/-10% (0.10) | 185 |
| +/-5% (0.05) | 738 |
| +/-3% (0.03) | 2,056 |
| +/-2% (0.02) | 4,626 |
| +/-1% (0.01) | 18,445 |
| +/-0.5% (0.005) | 73,779 |
Note: These are conservative (distribution-free). Normal approximation gives roughly half as many, valid when .
L.4 Interpreting Generalisation Bounds for Deep Learning
Current theoretical bounds are often loose for deep networks. Practitioners should interpret them as:
- Order of magnitude guidance: A bound of correctly predicts that more data helps linearly, but may overestimate the absolute gap by 10-100x
- Relative comparisons: PAC-Bayes bounds correctly rank models by generalisation risk even when absolute values are imprecise
- Design principles: Results like "norm-constrained models generalise better" translate directly to regularisation practices (weight decay, gradient clipping, dropout)
- Not absolute guarantees: A theoretical bound of 40% error doesn't mean the model has 40% test error - it means we can't rule out 40% test error from the theory alone
The theory is most useful as a framework for thinking, not as a numerical oracle.
Appendix M: References and Further Reading
Primary Sources
-
Boucheron, Lugosi & Massart (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press. - The definitive modern treatment.
-
Hoeffding, W. (1963). Probability Inequalities for Sums of Bounded Random Variables. Journal of the American Statistical Association, 58(301), 13-30.
-
McDiarmid, C. (1989). On the Method of Bounded Differences. Surveys in Combinatorics, 141, 148-188.
-
Vapnik, V. & Chervonenkis, A. (1971). On the Uniform Convergence of Relative Frequencies to Their Probabilities. Theory of Probability and Its Applications, 16(2), 264-280.
-
Valiant, L. (1984). A Theory of the Learnable. Communications of the ACM, 27(11), 1134-1142.
Learning Theory Textbooks
-
Shalev-Shwartz & Ben-David (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press. - Best introduction to PAC learning for ML practitioners.
-
Mohri, Rostamizadeh & Talwalkar (2018). Foundations of Machine Learning (2nd ed.). MIT Press. - Covers Rademacher complexity and generalisation theory in depth.
-
Wainwright, M. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. - Advanced treatment, covers matrix concentration and modern techniques.
Deep Learning Theory
-
Zhang, Bengio et al. (2017). Understanding Deep Learning Requires Rethinking Generalization. ICLR 2017. - Showed memorisation capacity, challenging classical VC theory.
-
Dziugaite & Roy (2017). Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many Parameters. UAI 2017. - First non-vacuous PAC-Bayes bounds for deep nets.
-
Belkin et al. (2019). Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-Off. PNAS 2019. - Double descent, overparameterisation.
-
Foret et al. (2021). Sharpness-Aware Minimization for Efficiently Improving Generalization. ICLR 2021. - SAM as PAC-Bayes operationalisation.
Appendix N: Extended Worked Problems
N.1 Bounding the Maximum of Random Variables
Problem. Let i.i.d. What is approximately?
Answer. Each is 1-sub-Gaussian. By the union bound:
Setting : . So with high probability.
More precisely, . For : , matching Monte Carlo.
For AI: The maximum attention logit in a head with tokens and -dimensional keys satisfies . The softmax of this concentrates on one or few tokens for large - explaining why attention becomes "sharp" (spiky) in deeper layers where is large.
N.2 Symmetric Rounding
Problem. An algorithm rounds each of numbers independently up or down by at most 0.5. What is the probability the total rounding error exceeds 10?
Solution. Each error with . The sum has , so by Hoeffding:
The two-sided version: .
N.3 PAC Bound for a Neural Network Classifier
Problem. A student trains a 3-layer neural network on examples and achieves 2% training error. Using the finite-class PAC bound with (very conservatively, assuming 1-bit per parameter for parameters), what does the bound say about test error?
Solution. The PAC bound (agnostic, finite class):
With and :
Test error bound: - a completely vacuous bound (exceeds 100%). This illustrates why VC/parameter-counting bounds are useless for deep networks. PAC-Bayes or Rademacher complexity (using rather than parameter count) gives non-vacuous results.
N.4 Proving Generalisation for 1-Nearest Neighbour
Problem. For -nearest neighbour classification with : (a) Show that the leave-one-out error is an unbiased estimate of the true error. (b) Use McDiarmid to bound the variance of the generalisation error around the leave-one-out estimate.
Solution. (a) The leave-one-out (LOO) error where is the prediction on using all other training points. By symmetry of the i.i.d. draw, (the true error of 1-NN trained on examples).
(b) Swapping one training example changes by at most (affects at most 2 LOO predictions). By McDiarmid:
N.5 Confidence Region for Gradient Descent Convergence
Problem. SGD is run for iterations with step size . Each gradient estimate is sub-Gaussian with parameter . With what probability does SGD achieve ?
Solution. Standard SGD analysis shows: where .
For the high-probability version: at each step , by sub-Gaussianity (batch size ). Taking union bound over all steps and using the Azuma-type argument for the algorithm trajectory:
when .
Appendix O: Visualisation Notes
O.1 Plotting Concentration Inequalities
When visualising tail bounds, always show:
- The exact tail probability (computed numerically) as the ground truth
- The bound as a curve above the exact probability
- The ratio (bound/exact) to show how loose the bound is
TAIL BOUND TIGHTNESS - Gaussian(0,1)
========================================================================
t | P(X\\geqt) exact | Markov* | Chebyshev | Hoeffding | Bound/Exact
--------------------------------------------------------------------
1 | 0.159 | N/A | 1.000 | 0.607 | 4x / 3.8x
2 | 0.023 | N/A | 0.250 | 0.135 | 11x / 5.9x
3 | 0.0013 | N/A | 0.111 | 0.011 | 85x / 8.5x
4 | 3.2e-5 | N/A | 0.0625 | 3.4e-4 | 1953x / 11x
5 | 2.9e-7 | N/A | 0.040 | 3.7e-7 | 138,000x / 1.3x
*Markov requires X \\geq 0; not directly applicable to symmetric Gaussian
========================================================================
At : Chebyshev overestimates by 138,000x! Hoeffding for Gaussian is essentially tight - Gaussian IS the sub-Gaussian equality case.
O.2 Sample Complexity Curves
The sample complexity has a characteristic shape:
- : doubling precision requires 4x more data
- : going from 95% to 99.9% confidence adds only more data - logarithmically cheap
This scaling is fundamental: it appears in:
- Hoeffding (statistical estimation)
- Monte Carlo (variance , error )
- PAC learning (generalisation gap)
- Stochastic optimisation (SGD convergence rate)
All are manifestations of the same underlying concentration phenomenon.
Appendix P: Glossary of Key Terms
| Term | Definition | Section |
|---|---|---|
| Concentration inequality | A bound on or | Section1 |
| Sub-Gaussian random variable | with for all | Section3.1 |
| Sub-Gaussian parameter | : the proxy variance in the sub-Gaussian definition | Section3.1 |
| Rademacher variable | with equal probability | Section3.2 |
| Hoeffding's lemma | Bounded RV with zero mean | Section4.1 |
| Chernoff method | Tail bound via , optimised over | Section5.1 |
| Bounded differences | $ | f(\cdots x_i \cdots) - f(\cdots x_i' \cdots) |
| Union bound | Section8.1 | |
| Covering number | Smallest -net size for a set under a metric | Section8.3 |
| Shattering | shatters if it can realise all $2^{ | C |
| VC dimension | Size of largest shattered set | Section9.4 |
| Growth function | : max labellings of points by | App D.1 |
| PAC learning | Probably approximately correct: w.p. | Section9.1 |
| Uniform convergence | $\sup_h | \hat{R}(h) - R(h) |
| Rademacher complexity | Section10.1 | |
| PAC-Bayes | Generalisation bound using | App E.4 |
| Algorithmic stability | Output insensitive to swapping one training example | App J.2 |
| Exponential tilt | Twisted distribution used in Chernoff | App I.1 |
Appendix Q: The Method of Types and Large Deviations
Q.1 Types and Empirical Distributions
The method of types is an elegant combinatorial approach to concentration that complements the analytical MGF-based approach.
Definition. For a sequence , the type (empirical distribution) is:
The set of all sequences of type is the type class .
Key counting result. For a finite alphabet :
where is the entropy of . This means there are approximately sequences with empirical distribution .
Q.2 Connection to Chernoff Bounds
For i.i.d. and empirical distribution :
This is the method of types bound. Taking to be the distribution concentrated at :
where is the type closest to with mean . For large , the polynomial factor is negligible, and this matches the Chernoff bound with the KL divergence interpretation.
Q.3 Cramer's Theorem
Theorem (Cramer, 1938). For i.i.d. with distribution :
where is the Cramer-Legendre transform (rate function), and (KL divergence to the tilted distribution).
This is the fundamental result of large deviations theory: rare events occur at exponential rate determined by the KL distance to the "closest" distribution that makes the event typical.
For AI: Language model perplexity is essentially . Low perplexity = model close to true distribution = rare events (unusual tokens) are not overly penalised. Cramer's theorem connects the generalisation gap to KL divergence, explaining why cross-entropy is the right loss.
Appendix R: Generalisation in the Overparameterised Regime
R.1 The Classical Puzzle
Classical theory (VC, Rademacher) says: more parameters = larger hypothesis class = worse generalisation. Yet in practice, overparameterised models (large ) often generalise better than their smaller counterparts.
The resolution involves multiple mechanisms:
1. Implicit regularisation. Gradient descent on overparameterised linear models converges to the minimum--norm solution. Among all interpolating solutions, this has the smallest Rademacher complexity.
2. Double descent. As model capacity increases past the interpolation threshold (), test error first rises (overfitting) then falls (benign overfitting). The minimum-norm interpolating solution's generalisation error is:
which decreases as (overparameterised regime).
3. Benign overfitting. Bartlett et al. (2020) showed that for linear models with Gaussian data, interpolation (zero training error) can generalise well when most singular values of the data matrix are small - the model "memorises" noise in directions orthogonal to the signal.
R.2 Neural Tangent Kernel (NTK) Theory
Wide neural networks in the "lazy training" (NTK) regime behave like kernel methods. The generalisation bound becomes:
where is the NTK kernel matrix. The key quantity is the effective complexity - analogous to Rademacher complexity but computed from the kernel. For networks trained with weight decay, this remains bounded even as depth and width grow.
R.3 PAC-Bayes for Overparameterised Models
The most successful approach for non-vacuous bounds on deep networks uses PAC-Bayes. The bound:
applies to a stochastic classifier . For weight-perturbed networks (Gaussian posterior with small variance ):
This is small when the fine-tuned weights don't stray far from the initialisation . SAM (Sharpness-Aware Minimisation) explicitly seeks flat minima where the posterior concentrates near a broad region around , minimising this KL term.
Appendix S: Concentration in Continuous-Time Settings
S.1 Azuma's Inequality for Martingales
We've seen Azuma's inequality used to prove McDiarmid. Here's the full statement:
Theorem (Azuma-Hoeffding). Let be a martingale (or supermartingale) with and a.s. Then:
Proof. Apply the Chernoff method: . For a martingale, by Hoeffding's lemma (conditional on , the increment is bounded). Telescoping: . Optimise: .
S.2 Doob's Optional Stopping
Theorem. If is a martingale and is a bounded stopping time, then .
Application to SGD. Define where is a constant. Under mild conditions, is a supermartingale. Azuma's inequality applied to the stopped process gives high-probability guarantees on convergence time.
This is the mathematical foundation of high-probability SGD convergence - stronger than the usual in-expectation guarantees.
Appendix T: Summary Diagrams
T.1 The Proof Technique Map
PROOF TECHNIQUES FOR CONCENTRATION INEQUALITIES
========================================================================
Core tool: Markov's inequality
-------------------------------------------------------------------
Apply Markov to (X-\\mu)^2 --> Chebyshev
Apply Markov to e^{sX} --> Chernoff method
+-- with Hoeffding's lemma --> Hoeffding's inequality
+-- with Bernstein's lemma --> Bernstein's inequality
+-- with Binomial MGF --> Chernoff for Bin(n,p)
Apply Azuma to Doob martingale --> McDiarmid's inequality
Union bound over hypothesis class --> PAC generalisation bound
Union bound + covering number --> Infinite-class PAC
Rademacher symmetrisation --> Rademacher complexity bound
-------------------------------------------------------------------
Common pattern: P(X \\geq t) \\leq E[e^{sX}] / e^{st} (all MGF methods)
========================================================================
T.2 Bound Quality Comparison
TAIL PROBABILITY COMPARISON - X = mean of n=100 Bernoulli(0.5), t=0.1
========================================================================
Method Bound Notes
-----------------------------------------------------------------
Exact (CLT) ~0.046 Normal approx; valid for large n
Hoeffding 0.135 Tight for this case
Chebyshev 0.250 3x looser
Markov 0.500 5x looser (applied to |X-0.5|/0.5)
Same setup, t=0.2:
Exact ~0.000046
Hoeffding 0.018 400x looser
Chebyshev 0.0625 1360x looser
========================================================================
Appendix U: Proofs of Sub-Gaussian Closure and Examples
U.1 Cosh Inequality for Rademacher Variables
Claim. For uniform: .
Proof. , using .
Interpretation. Rademacher variables are the "most concentrated" bounded variables in : their MGF is exactly , slightly below the sub-Gaussian bound . This is also why Rademacher complexity multiplied by 2 appears in generalisation bounds - the factor accounts for the Rademacher supremum.
U.2 Sub-Gaussian Parameter for Bounded Variables
Claim. If with , then is -sub-Gaussian.
The parameter is achieved when where is Rademacher - a two-point distribution at the endpoints. This is the "hardest" distribution on for Hoeffding's lemma.
Tightness example. For uniform: variance , sub-Gaussian parameter . They agree! For the Rademacher distribution, sub-Gaussian parameter = variance. For other distributions in , sub-Gaussian parameter variance.
U.3 Sub-Gaussian Characterisation via Tail Probabilities
The following are equivalent for a mean-zero random variable :
- (MGF) for all
- (Tail) for all
- (Moments) for all
The equivalence (up to constants) is non-trivial. The key step: uses and bounds from the tail condition.
Appendix V: Exercises with Solutions
V.1 Markov and Chebyshev Calculations
Problem. . Compute using: (a) Markov's inequality () (b) Chebyshev (, so , ) (c) The exact value (Poisson CDF)
Solution. (a) Markov: (b) Chebyshev: (c) Exact:
Chebyshev is 5x loose; Markov is 21x loose. Both bounds hold but are very conservative.
V.2 Hoeffding Bound Computation
Problem. We estimate the click-through rate of an ad by showing it to users and recording clicks . We observe from users. Give a 99% Hoeffding confidence interval.
Solution. With , :
99% CI: .
Interpretation: we're 99% confident the true CTR is between 4.4% and 11.6% - a very wide interval! To tighten to +/-1%: need users.
V.3 Chernoff vs Hoeffding
Problem. , so . Bound .
Hoeffding: . Useless!
Chernoff: . .
Exact: (Poisson approximation: ).
Chernoff is 4x loose; Hoeffding is essentially vacuous (bound > 0.5 is useless). For rare events with and large , Chernoff is the right tool.
V.4 McDiarmid on Dropout
Problem. A network output depends on layer activations. Dropout independently zeroes each activation, changing the contribution of neuron in layer by at most . If , bound .
Solution. By McDiarmid: . The bound is vacuous! This means either are too large, or is too small relative to the noise.
With larger networks and better weight norms (): . Now meaningful: dropout output deviates by more than 0.5 less than 1.3% of the time.
Appendix W: Notation Reference for This Section
| Symbol | Meaning |
|---|---|
| Probability of event | |
| Mean of | |
| Variance of | |
| Moment generating function | |
| Sample mean | |
| True (population) risk of hypothesis | |
| Empirical risk of on training data | |
| Hypothesis class | |
| VC dimension of | |
| Growth function | |
| Rademacher complexity | |
| Empirical Rademacher complexity | |
| Accuracy parameter (how close to optimal) | |
| Confidence parameter (failure probability) | |
| Bounded differences constant for coordinate | |
| -net | Covering set with spacing |
| Covering number: size of smallest -net | |
| KL divergence from to | |
| Cramer rate function | |
| Rademacher random variable | |
| Ball of radius around hypothesis |
Appendix X: More on Rademacher Complexity
X.1 Computing Rademacher Complexity for Common Classes
Halfspaces with margin. For on data with :
The margin effectively replaces dimension ! This is why SVMs with large margins generalise well in high dimensions - the effective complexity is , not .
Two-layer neural network. For :
where . The factor comes from the union bound over output neurons; the denominator shows rate as expected.
X.2 Rademacher Complexity and Mutual Information
A deep connection: the Rademacher complexity can be bounded via mutual information between the training labels and the class predictions. For bounded function classes:
where is the mutual information between labels and class predictions. This connects concentration theory to information theory - a richer theory being actively developed (2020-2026) under "information-theoretic generalisation bounds."
X.3 Monte Carlo Estimation of Rademacher Complexity
In practice, can be estimated by Monte Carlo:
def estimate_rademacher(X, W_norm=1.0, T=1000):
"""Monte Carlo estimate of Rademacher complexity for ||w||_2 <= W_norm."""
n, d = X.shape
rademachers = []
for _ in range(T):
sigma = np.random.choice([-1, 1], size=n)
# Optimal w: proportional to X^T sigma, normalised
w_opt = X.T @ sigma
w_norm = np.linalg.norm(w_opt)
if w_norm > 0:
w_opt = w_opt * W_norm / w_norm
rademachers.append((sigma @ (X @ w_opt)) / n)
return np.mean(rademachers)
For a dataset of MNIST digits (784-dimensional), : empirical . Rademacher bound: test error train error + train error + 0.076 + 0.055. For 2% train error: test error - loose but not vacuous!
Appendix Y: Bounds in the Context of Modern ML Systems
Y.1 LLM Benchmark Evaluation Standards
As of 2026, serious LLM evaluations use concentration-aware practices:
HELM (Holistic Evaluation of Language Models): Uses 1000+ examples per scenario, reporting bootstrap confidence intervals. Assumes bounded loss and applies Hoeffding-style reasoning.
BIG-bench: 200+ tasks, with tasks having 100-10000 examples. Standard deviations are reported, equivalent to reporting which matches the leading term in Hoeffding CI.
LiveBench: Uses human-pairwise comparisons (Elo-style). The Elo update rule has a concentration interpretation: the ELO gap needed for significance scales as .
Statistical significance in model comparisons. With test examples and two models achieving accuracies and :
- Paired comparison: use McNemar's test (exact)
- Unpaired: significance threshold (Hoeffding)
- For , : need difference for significance
Y.2 Fine-Tuning Sample Complexity via PAC-Bayes
When fine-tuning a pretrained LLM on task-specific data:
Prior: Pre-trained weights from language model pre-training on internet-scale data Posterior: Fine-tuned weights on task examples PAC-Bayes bound:
The term is small when fine-tuning doesn't move far from pre-training - exactly what LoRA enforces by constraining . With rank :
PAC-Bayes theoretically justifies why LoRA generalises better than full fine-tuning on small datasets.
Y.3 Privacy-Preserving ML via Concentration
Differential Privacy (DP) and concentration inequalities are deeply connected. An algorithm is -DP if changing one training example changes the output distribution by at most in TV distance, except with probability .
DP-SGD (Abadi et al., 2016): Clips each gradient to norm , adds Gaussian noise . The privacy guarantee follows from the Gaussian mechanism, and the utility (convergence) uses concentration inequalities on the noisy gradients.
McDiarmid connection: DP is essentially algorithmic stability with (each example changes the model by at most in some norm). McDiarmid's inequality gives the same bound for both privacy and stability.
Appendix Z: Complete Theorem Index
This appendix lists every named result in this section for quick reference.
Moment-Based Inequalities
Theorem 2.1 (Markov). , : .
Theorem 2.2 (Chebyshev). : .
Theorem 2.3 (Cantelli). One-sided: .
Sub-Gaussian Inequalities
Definition 3.1 (Sub-Gaussian). is -sub-G if for all .
Theorem 3.2 (Sub-Gaussian tail). .
Theorem 3.3 (Sub-Gaussian sum). Independent -sub-G: sum is -sub-G.
Hoeffding-Type Inequalities
Lemma 4.1 (Hoeffding's lemma). , : .
Theorem 4.2 (Hoeffding). Independent : .
Theorem 4.3 (Two-sided Hoeffding). .
Chernoff Bounds
Theorem 5.1 (Chernoff method). .
Theorem 5.2 (Chernoff upper tail). , : .
Theorem 5.3 (Chernoff multiplicative). for .
Theorem 5.4 (Chernoff lower tail). .
Bernstein and McDiarmid
Theorem 6.2 (Bernstein). , variance : .
Theorem 7.2 (McDiarmid). with -bounded differences: .
Theorem S.1 (Azuma). Martingale : .
Statistical Learning Theory
Lemma 8.1 (Union bound). .
Lemma 8.3 (Covering number). -ball: .
Theorem 9.2 (Finite-class PAC). uniform convergence.
Lemma 9.4 (Sauer-Shelah). for .
Theorem 9.5 (VC bound). .
Theorem 10.2 (Rademacher bound). .
Theorem E.4 (PAC-Bayes). .
Theorem J.2 (Stability). -stable algo: .
This completes Section05 Concentration Inequalities. The material here is used extensively in Chapter 7: Statistics for confidence intervals and hypothesis tests, and throughout learning theory.
<- Back to Chapter 6: Probability Theory | Next: Stochastic Processes ->
Supplementary: Concentration in High Dimensions
The Curse and Blessing of Dimensionality
High-dimensional geometry is counterintuitive. In with large :
- Almost all the volume of a ball lies near its surface
- Random unit vectors are nearly orthogonal: w.h.p.
- The norm of a Gaussian vector concentrates: for
These facts follow directly from concentration inequalities applied to sums of squares.
Gaussian norm concentration. Let . Then where . By Bernstein's inequality (since is sub-exponential):
for a universal constant .
Implication for dot products. For two independent :
This is why scaled dot-product attention uses - it keeps logits despite -dimensional embeddings.
Geometry of Softmax
Let be the query and keys in a transformer. The attention weight on key is:
Without the scaling, the logits have variance (for unit random vectors), pushing softmax into near-zero gradient regions - a form of implicit dimension dependence that concentration inequalities make precise.
Covering Numbers in Transformer Weight Space
The number of parameters in a transformer layer is . For LoRA with rank , the effective dimensionality is . By the -covering number bound:
This converts directly to a PAC generalisation bound showing that LoRA generalises with excess risk - substantially better than for full fine-tuning. Concentration inequalities thus provide the theoretical foundation for parameter-efficient fine-tuning.
Summary: Why This Section Matters
Concentration inequalities are the engine behind every finite-sample guarantee in machine learning. They explain:
- Why more data helps: bounds shrink as
- Why simpler models generalise: bounds grow with or
- Why high dimensions are manageable: Gaussian concentration keeps norms stable
- Why LoRA works: Rademacher complexity of low-rank updates is provably small
- Why LLM benchmarks need many examples: CIs on accuracy require
The progression Markov -> Chebyshev -> sub-Gaussian -> Chernoff -> McDiarmid is not merely a tour of tricks - it is a systematic strengthening of assumptions that yields exponentially tighter guarantees, mirroring the progress from "some data" to "structured data" in modern ML.
End of Section05 Concentration Inequalities
Next: Section06 Stochastic Processes - where concentration inequalities meet martingales in continuous time.
Back: Section04 Expectation and Moments - the moment-based tools (MGF, Jensen) used throughout this section.
Chapter: Section06 Probability Theory README