Lesson overview | Lesson overview | Next part
Concentration Inequalities: Part 1: Intuition to 10. Rademacher Complexity
1. Intuition
1.1 What Is Concentration?
A random variable has a mean . But knowing the mean alone tells us nothing about how a single draw is distributed around it. Concentration inequalities fill this gap by bounding the probability that deviates far from .
The fundamental question is: given what we know about (its mean, variance, boundedness, or MGF), what is the tightest bound we can place on ?
This matters deeply in practice. In machine learning, we observe the empirical mean of a loss function on training data and want to know how well it approximates the true (population) mean. If are i.i.d. losses with mean and we observe , concentration inequalities tell us how close is to with high probability. This is the foundation of statistical learning theory.
The sample mean concentrates around the true mean. As grows, the deviation becomes increasingly small with high probability. This is the content of the Law of Large Numbers - but concentration inequalities give quantitative, finite- bounds, not just asymptotic guarantees.
For AI: Every evaluation metric in ML (accuracy, F1, BLEU, ROUGE, perplexity) is estimated from a finite test set. Concentration inequalities tell us how many test examples we need to trust the estimate. HELM benchmarks for LLMs, for instance, implicitly rely on Hoeffding-type bounds for the confidence intervals around model comparisons.
1.2 The Hierarchy of Bounds
Concentration inequalities form a clear hierarchy. Moving down requires stronger assumptions but yields exponentially tighter bounds:
CONCENTRATION INEQUALITY HIERARCHY
========================================================================
Assumption Inequality Tail bound form
-----------------------------------------------------------------
E[X] < \\infty, X \\geq 0 Markov P(X \\geq t) \\leq \\mu/t
(polynomial decay)
Var(X) < \\infty Chebyshev P(|X-\\mu| \\geq t) \\leq \\sigma^2/t^2
(polynomial decay)
E[e^{sX}] < \\infty Chernoff method P(X \\geq t) \\leq min_s e^{-st}M(s)
(general expo)
X \\in [a,b] a.s. Hoeffding P(Xbar-\\mu \\geq t) \\leq exp(-2nt^2/(b-a)^2)
(sub-Gaussian) (exponential decay)
Var known + bounded Bernstein exp(-nt^2/(2\\sigma^2+2ct/3))
(variance-aware) (tighter for small t)
f(X_1,...,X_n) McDiarmid exp(-2t^2/\\Sigmac_i^2)
bounded differences (functions of iid)
========================================================================
The key insight: moment-based bounds (Markov, Chebyshev) give polynomial tails , while MGF-based bounds give exponential tails . Exponential tails are vastly superior - a Gaussian has , while Chebyshev gives only .
1.3 Historical Timeline
| Year | Person | Contribution |
|---|---|---|
| 1867 | Chebyshev | Proved the inequality bearing his name; used it to prove the weak LLN |
| 1899 | Markov (student of Chebyshev) | Simpler proof via indicator argument; Markov's inequality |
| 1952 | Herman Chernoff | MGF-based exponential tail bounds for sums of independent variables |
| 1963 | Wassily Hoeffding | Tight bounds for bounded variables; modern form used everywhere |
| 1971 | Vapnik & Chervonenkis | VC dimension theory; combinatorial approach to generalisation |
| 1984 | Leslie Valiant | PAC learning framework - theoretical model for machine learning |
| 1989 | Colin McDiarmid | Bounded differences inequality for general functions of independent variables |
| 1995 | Bartlett & Mendelson | Rademacher complexity - data-dependent alternative to VC theory |
| 2013 | Boucheron, Lugosi & Massart | Comprehensive monograph unifying modern concentration theory |
1.4 Why Concentration Matters for ML
Generalisation: A neural network achieves 99% training accuracy. Does it generalise? Concentration inequalities bound - the gap between true and empirical risk. They tell us when we can trust training metrics.
SGD analysis: Stochastic gradient descent uses a mini-batch estimate of the true gradient . Concentration inequalities bound , determining the noise level and required step sizes for convergence.
Confidence intervals: When evaluating an LLM on 1000 benchmark questions with accuracy 73.2%, concentration bounds tell us the confidence interval around this estimate.
Random features: The Rahimi-Recht random features method approximates kernel matrices using random projections. How many random features are needed? The answer is a Hoeffding bound.
Dropout: Dropout randomly zeroes activations. The kept activations are bounded, so McDiarmid's inequality bounds the output variance.
2. Markov's and Chebyshev's Inequalities
2.1 Markov's Inequality
Theorem (Markov's Inequality). Let be a non-negative random variable with . For any :
Proof. This follows immediately from the indicator for , :
This proof is a master class in the indicator trick: multiply by 1, then bound the indicator.
Tightness. Markov's bound is tight. Consider with probability and otherwise. Then and - exactly matching the bound.
Extensions. Markov applies to any non-negative function of : for any non-decreasing ,
Setting gives the -th moment bound: . Setting gives the Chernoff method.
For AI: Markov's inequality underlies gradient clipping analysis. If , then - so gradient norms exceed at most 10% of steps.
2.2 Chebyshev's Inequality
Theorem (Chebyshev's Inequality). For any random variable with and , for any :
Equivalently, setting : .
Proof. Apply Markov to the non-negative variable :
The -sigma rule. Chebyshev guarantees: at least of probability mass lies within of the mean. For any distribution with finite variance:
| Chebyshev bound | Gaussian actual | |
|---|---|---|
| 2 | within | |
| 3 | within | |
| 4 | within | |
| 5 | within |
For AI: Chebyshev justifies batch normalisation. If activations have , then at most of activations lie beyond - this quantifies the scale-normalisation benefit.
Recall: The variance and its properties were developed fully in Section04 Expectation and Moments. We use it here as a plug-in parameter.
2.3 One-Sided Chebyshev (Cantelli's Inequality)
Chebyshev bounds both tails symmetrically. Cantelli's inequality gives a tighter one-sided bound:
Theorem (Cantelli). For any :
Proof. For any , by Markov applied to :
Optimise over : gives , yielding .
For large : Chebyshev gives per tail (so two-sided), while Cantelli gives one-sided - roughly the same. But for small , Cantelli avoids the factor of 2.
2.4 Limitations of Moment-Based Bounds
Markov and Chebyshev have polynomial tails. The Gaussian has , but Chebyshev gives only - an 85x overestimate. For , Chebyshev gives , while the true Gaussian probability is .
Why the gap? Chebyshev uses only the first two moments. A distribution could concentrate all its mass at and match any pair - that's the worst case for Chebyshev. Real distributions with bounded support or thin tails concentrate far more.
When Chebyshev is the right tool:
- Distribution-free bounds (don't know the family)
- Heavy-tailed distributions (Pareto, -distribution with low df)
- Quick estimates without assuming sub-Gaussianity
3. Sub-Gaussian Random Variables
3.1 Definition and MGF Condition
Sub-Gaussian random variables are those whose tails decay at least as fast as a Gaussian. The key condition is on the MGF.
Definition. A mean-zero random variable is -sub-Gaussian if for all :
The parameter is the sub-Gaussian parameter (also called the proxy variance). Note: need not equal , though always holds (by differentiation at ).
Why this condition? Recall from Section04 that the MGF of is exactly . The sub-Gaussian condition says the MGF of is dominated by that of a Gaussian with variance . This immediately implies Gaussian-like tail bounds.
Equivalent characterisations (for mean-zero ):
- MGF condition: for all
- Tail condition: for all
- Moment condition: for all
3.2 Examples of Sub-Gaussian Variables
Gaussian. is -sub-Gaussian. The MGF equals exactly - Gaussian is the equality case.
Bounded random variable. If almost surely with , then is -sub-Gaussian. This is Hoeffding's lemma - proved in Section4.
Rademacher variable. with equal probability. Then , so is 1-sub-Gaussian. Rademacher variables appear in Rademacher complexity (Section10).
Bernoulli centered. . Since , it is -sub-Gaussian by Hoeffding's lemma.
Non-example. Heavy-tailed distributions are not sub-Gaussian. A -distribution with degrees of freedom has for all when . A Pareto distribution is not sub-Gaussian regardless of parameters.
3.3 The Sub-Gaussian Tail Bound
Theorem. If is -sub-Gaussian with , then for all :
Proof (Chernoff method). For any :
Optimise over : . Substituting:
This proof pattern - Markov + MGF bound + optimise - is the Chernoff method and will recur throughout this section.
3.4 Closure Under Sums
Theorem. If are independent with being -sub-Gaussian and , then is -sub-Gaussian.
Proof. By independence:
Corollary. For i.i.d. -sub-Gaussian variables, the sample mean satisfies .
For AI: Mini-batch gradient estimation. If each sample's contribution to the gradient is sub-Gaussian with parameter , then the mini-batch gradient of size is -sub-Gaussian - the noise decreases as .
4. Hoeffding's Inequality
4.1 Hoeffding's Lemma
Hoeffding's lemma is the core technical ingredient: it establishes that any bounded, mean-zero random variable is sub-Gaussian.
Lemma (Hoeffding, 1963). Let be a random variable with and almost surely. Then for all :
In other words, is -sub-Gaussian.
Proof sketch. Since is a bounded interval and is convex in , we can bound by the chord from to :
Taking expectations (using , so and ):
where and with . Expanding via Taylor series around and bounding (achieved at ) gives .
Why ? The factor comes from the worst case (centred in the interval). For a Bernoulli minus its mean, and the sub-Gaussian parameter is .
4.2 Hoeffding's Inequality
Theorem (Hoeffding, 1963). Let be independent random variables with and almost surely. Then for all :
Proof. Let (mean-zero, ). By Hoeffding's lemma, each is -sub-Gaussian. By independence and closure under sums (Section3.4), is -sub-Gaussian. Applying the sub-Gaussian tail bound:
4.3 Two-Sided Hoeffding and Sample Complexity
For i.i.d. with mean , the two-sided version follows by symmetry (apply one-sided to and ):
Reading the bound: The probability of a large deviation decays exponentially in and . Doubling the sample size squares the exponential factor. Halving the allowed deviation quadruples the required .
4.4 Required Sample Size
One of the most practically useful results: how large must be to guarantee with probability at least ?
Setting and solving for :
Example. Coin flips (, ): to achieve accuracy with confidence:
So roughly 18,500 coin flips are needed to estimate a probability to within with 95% confidence.
For AI: Benchmark evaluation. To measure LLM accuracy to within at 95% confidence on a binary task (), Hoeffding requires examples. This explains why serious LLM evaluations (MMLU, HumanEval, BIG-bench) use thousands of questions.
5. Chernoff Bounds
5.1 The Chernoff Method
The Chernoff method is a general technique for deriving exponential tail bounds using the MGF. It is more powerful than Hoeffding when the MGF has a known closed form.
The Chernoff method. For any random variable and :
Proof. For any : by Markov. Optimise over .
The infimum over gives the tightest bound. The optimal satisfies , i.e., the mean of under the tilted distribution equals .
Recall: was developed in Section04 Expectation and Moments. The key property used here is for independent .
5.2 Chernoff for Bernoulli Sums
Let i.i.d. and with mean .
The MGF of is .
Upper Chernoff bound. For :
Derivation. Apply the Chernoff method with :
Substituting (using ):
Optimising: , giving the stated bound.
5.3 Multiplicative Chernoff Form
For , a simpler but slightly looser form:
This follows from the inequality .
Comparison with Hoeffding. Hoeffding (applied to with ) gives:
Chernoff gives . For small (rare events), Chernoff's dependence on rather than makes it far tighter.
5.4 Lower Tail Chernoff
For the lower tail, for :
Note the factor of vs in the exponent - the lower tail is slightly tighter.
Application: balls into bins. When items are assigned to bins uniformly at random, the expected load per bin is . Chernoff bounds control the maximum load - crucial for hashing and load-balancing proofs in distributed systems.
6. Bernstein's Inequality
6.1 Sub-Exponential Random Variables
Some random variables have heavier tails than sub-Gaussian but still lighter than arbitrary - these are sub-exponential.
Definition. A mean-zero random variable is sub-exponential with parameters if:
Sub-exponential tails decay as (exponential), slower than Gaussian .
Examples:
- Exponential: sub-exponential with
- : sub-exponential (sum of squared Gaussians)
- Any bounded variable is also sub-exponential
6.2 Bernstein's Inequality
Theorem (Bernstein, 1924/1946). Let be independent mean-zero variables with and . Then for all :
Intuition. This interpolates between two regimes:
- Small (): Denominator , gives - sub-Gaussian with variance
- Large (): Denominator , gives - exponential tail
6.3 Bernstein vs Hoeffding
For i.i.d. variables with and sample variance :
| Bound | Formula | Better when |
|---|---|---|
| Hoeffding | No variance info | |
| Bernstein | (small variance) |
When Bernstein wins. If (data has small variance despite large range), Bernstein gives for small , while Hoeffding gives . Since , Bernstein is exponentially tighter.
Example. Suppose but . Hoeffding uses range , while Bernstein uses variance - a 400x improvement in the exponent for small .
6.4 Application to Gradient Noise
In SGD, the stochastic gradient estimates the true gradient .
If each is bounded: , and the sample gradient variance is , then Bernstein bounds:
where the factor comes from a union bound over coordinates. This bound shows why variance reduction methods (SVRG, SAGA) that reduce improve convergence more dramatically than reducing alone.
7. McDiarmid's Inequality
7.1 Bounded Differences Condition
McDiarmid's inequality generalises Hoeffding from sums to general functions.
Definition (Bounded Differences). A function satisfies the bounded differences condition with constants if for all and all :
Intuitively: changing the -th coordinate by any amount changes by at most .
7.2 McDiarmid's Inequality
Theorem (McDiarmid, 1989). Let be independent random variables and let satisfy the bounded differences condition with constants . Then for all :
Proof sketch (martingale method). Define the Doob martingale with and . The differences satisfy by the bounded differences assumption. Apply Azuma's inequality (concentration for bounded martingale differences) to get the result.
7.3 Relationship to Hoeffding
Hoeffding's inequality is a special case of McDiarmid's. For with , changing changes by at most , so . Then:
Substituting into McDiarmid recovers exactly Hoeffding's bound.
7.4 Applications to ML Stability
Training loss with missing data. Let be a training set and the empirical risk with . Swapping one training example changes by at most . McDiarmid gives:
Dropout. With dropout probability , each neuron's activation is scaled by . The network output changes by at most when the -th neuron is toggled. McDiarmid bounds the output variance from dropout noise.
Data augmentation. If augmenting a single training example can change the training loss by at most , McDiarmid bounds the sensitivity of the trained model to augmentation choices.
8. The Union Bound and Covering Arguments
8.1 Union Bound
Lemma (Boole's / Union Bound). For any events (not necessarily independent):
Proof. By induction: .
The union bound is exact when events are disjoint and pessimistic otherwise. Despite its looseness, it is indispensable because it requires no assumption about dependence between events.
For AI: The union bound appears in virtually every multi-class or multi-hypothesis analysis. PAC theory for a finite hypothesis class asks for the probability that any is bad - union bound over all hypotheses.
8.2 Multiple Hypothesis Testing
Consider statistical tests each with individual significance level . If we want the probability that at least one false positive occurs to be at most , the Bonferroni correction sets each test at level .
In ML: When training models and reporting the best, the union bound says the best result can be a false positive with probability up to . This is the multiple comparison problem underlying claims like "our method beats 10 baselines on 5 datasets."
Balancing union bound and concentration. For bad events each with :
To make this : need each . In PAC theory, this costs an extra in the required sample size.
8.3 Covering Numbers and \varepsilon-Nets
The union bound can only be applied to finite collections of events. For continuous hypothesis spaces (e.g., all linear classifiers, all neural networks), we need to discretise first.
Definition. An -net of a set under metric is a finite set such that for every , there exists with .
The covering number is the size of the smallest -net.
Covering number of a ball. The -ball in has covering number:
This grows polynomially in but exponentially in - the curse of dimensionality for covering.
8.4 The Net Argument
The standard approach for continuous hypothesis classes:
- Build an -net with
- Apply concentration + union bound over the net: control
- Use Lipschitz continuity to extend from net to all of
The final bound pays extra compared to a single hypothesis, plus the approximation error from the net.
Johnson-Lindenstrauss Lemma. As an application of covering, if random projections are used, points in embed into with distortion of all pairwise distances, with high probability. This is proved using the union bound over all pairs.
9. PAC Learning and Generalisation Bounds
9.1 The PAC Framework
The Probably Approximately Correct (PAC) framework, introduced by Valiant (1984), provides a formal model for machine learning under uncertainty.
Setup: A learner observes i.i.d. examples from an unknown distribution over . The learner selects a hypothesis from a hypothesis class .
True risk:
Empirical risk:
PAC guarantee: A learning algorithm is PAC-learning if for any and any distribution , given examples, with probability at least :
The sample complexity is the central quantity: how many examples are needed?
9.2 Finite Hypothesis Class
Theorem. Let be a finite hypothesis class with . For any ERM algorithm (minimising ) and any , if:
then with probability at least , the selected hypothesis satisfies (in the realisable case with ).
Proof. Consider the event "bad ": .
By Hoeffding: .
By union bound: .
Setting this and solving: . The factor version covers the agnostic (non-realisable) case.
9.3 Uniform Convergence
Theorem (Uniform Convergence). For finite and losses in , for :
Proof. Fix any . By Hoeffding: .
By union bound over all hypotheses: .
Consequence: If uniform convergence holds, then for ERM :
The ERM generalises to within of the best hypothesis.
9.4 VC Dimension
For infinite hypothesis classes, we need a different complexity measure.
Definition. A hypothesis class shatters a set if for every labelling , there exists with for all .
VC dimension is the size of the largest set shattered by .
Examples:
- Threshold classifiers on : (can shatter any single point but not two)
- Linear classifiers (halfspaces) in :
- Degree- polynomial classifiers in :
- Circles in :
- Neural networks: grows with parameter count, approximately linearly
Sauer-Shelah Lemma. The growth function satisfies:
for . This says: an -point set can be labelled in at most ways by , even if .
9.5 Generalisation Bound with VC Dimension
Theorem (VC Generalisation Bound). For any , with probability at least over i.i.d. training examples, every satisfies:
where .
Sample complexity. Setting the right-hand side excess to :
The from the finite case is replaced by . For a -dimensional linear classifier (), need - linear in dimension, not exponential.
For AI (2026): Modern transformers have billions of parameters, implying enormous VC dimension. Yet they generalise. This "paradox" is partly resolved by:
- Implicit regularisation: gradient descent finds minimum-norm solutions
- Overparameterisation: interpolation threshold, double descent (covered in Section04)
- PAC-Bayes: data-dependent bounds that are tighter for specific algorithms
10. Rademacher Complexity
10.1 Definition
Rademacher complexity measures the richness of a function class on specific data, making it data-dependent and often tighter than VC bounds.
Definition. Given a sample and a function class , the empirical Rademacher complexity is:
where with i.i.d. (Rademacher variables).
The Rademacher complexity is .
Interpretation. measures how well can fit random labels on the training points. If the class is rich enough to fit any labelling, . If it can only fit structured labels, .
10.2 Rademacher Generalisation Bound
Theorem. For any function class with values in , with probability at least :
for all simultaneously.
Comparison to VC. The VC bound has ; the Rademacher bound has . For specific data distributions where the function class is "effectively smaller," Rademacher can be significantly tighter.
10.3 Rademacher of Linear Classifiers
For linear classifiers on data with :
More precisely, using Cauchy-Schwarz:
This is , independent of the ambient dimension ! This explains why linear classifiers generalise well even in high dimensions, as long as the norm is controlled.
10.4 Rademacher vs VC
| Property | VC Dimension | Rademacher Complexity |
|---|---|---|
| Distribution | Distribution-free | Data-dependent |
| Calculation | Combinatorial | Expectation over labels |
| Estimate | Hard (often NP-hard) | MC approximation possible |
| Tightness | Worst-case | Often tighter in practice |
| Handles regression | No (classification) | Yes (arbitrary losses) |
| Modern NNs | Vacuous (too large ) | Can be finite for sparse/low-rank models |
For AI: Rademacher complexity gives finite (non-vacuous) generalisation bounds for attention heads with bounded query/key norm, explaining why transformers with weight decay generalise despite huge parameter counts.