Lesson overview | Previous part | Next part
Concentration Inequalities: Part 11: ML Applications to Appendix N: Extended Worked Problems
11. ML Applications
11.1 SGD and Gradient Concentration
In stochastic gradient descent, at each step we compute a mini-batch gradient:
where is a random mini-batch of size . The true gradient is .
Concentration of gradient norm. Assuming almost surely, by Hoeffding applied coordinate-wise and union bound:
Implication for learning rate. For SGD to make progress, we need the gradient noise to be smaller than the signal. Concentration tells us: with probability , the noise is bounded by . This justifies the learning rate scaling in theory.
Variance reduction. Methods like SVRG (Stochastic Variance-Reduced Gradient) periodically compute the full gradient, reducing to . Bernstein's inequality quantifies the benefit: convergence can be linear (not just ) when variance is controlled.
11.2 Confidence Intervals via Hoeffding
Suppose we evaluate an LLM on a benchmark: out of binary questions (correct/incorrect), we observe accuracy .
By Hoeffding's inequality (with , ), with probability at least :
This gives the Hoeffding confidence interval: .
Numerical example. For questions and :
So a measured accuracy of 73.2% on 1000 questions gives 95% CI - a interval. This explains why claiming "Model A (73.2%) outperforms Model B (72.8%)" on 1000 questions has no statistical support.
11.3 Random Features and Kernel Approximation
The Rahimi-Recht random features method approximates the RBF kernel as:
where and .
How many features? The approximation error is bounded by Hoeffding:
where . For and : .
In practice, -- suffices with , suggesting the bound is conservative. This is common: worst-case bounds are loose, actual concentrations are tighter.
11.4 Concentration in High Dimensions
High-dimensional geometry exhibits surprising concentration phenomena. For :
Chi-squared concentration. with mean and variance . Bernstein gives:
So in probability - all Gaussian vectors in high dimensions have nearly the same norm .
Concentration of inner products. For independent (unit-variance):
Nearly-orthogonal random vectors: with (typical embedding dimension), random vectors have with very high probability.
For AI (attention). In transformers with (key dimension), random query-key inner products concentrate around 0. This justifies the scaling in attention: without it, softmax inputs would have variance , causing near-zero gradients through the softmax saturation.
Preview: Law of Large Numbers and CLT
The concentration results above provide finite-sample bounds. The LLN says as ; the CLT says the deviation converges to . These are the asymptotic counterparts of Hoeffding's inequality.
-> Full treatment: Section06 Stochastic Processes
12. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Applying Markov to a signed random variable | Markov requires ; for signed , can exceed | Apply Markov to $ |
| 2 | Using Chebyshev for exponential tails | Chebyshev gives decay, vastly looser than sub-Gaussian | If data is bounded or Gaussian, use Hoeffding/Chernoff |
| 3 | Forgetting the factor of 2 in two-sided Hoeffding | One-sided: ; two-sided has factor 2 in front | Always write "two-sided" explicitly and include the factor |
| 4 | Confusing sub-Gaussian parameter with variance | Sub-Gaussian parameter satisfies but is not equal | ; sub-Gaussian parameter can be larger |
| 5 | Applying union bound over infinite classes directly | Union bound only works for countable (finite/countably infinite) collections | Use -net covering argument first, then union bound over the net |
| 6 | Computing VC dimension by counting parameters | VC dimension is not the number of parameters; depends on the function class structure | For linear classifiers in : , not number of weights |
| 7 | Ignoring the term in sample complexity | Setting adds - small, but not negligible | Always compute full bound including |
| 8 | Claiming Chernoff applies to dependent variables | Chernoff requires independence (product of MGFs) | For dependent variables, use McDiarmid or martingale concentration |
| 9 | Using Hoeffding when variance is known to be small | Bernstein would give an exponentially tighter bound for small-variance data | Check if ; if so, use Bernstein |
| 10 | Interpreting VC generalisation bounds as practically tight | VC bounds are often vacuous for deep networks; describe worst-case behaviour | Use PAC-Bayes or Rademacher bounds, which can be data-dependent and tighter |
13. Exercises
**Exercise 1 *** (Markov and Chebyshev) For and : (a) Compute exactly and via Markov's inequality. What is the ratio? (b) Compute exactly and via Chebyshev. What is the ratio? (c) For which does Chebyshev give ? Compare to the exact Gaussian answer. (d) Verify both bounds numerically with samples.
**Exercise 2 *** (Hoeffding Sample Complexity) A spam filter classifies emails. On test emails, it achieves accuracy . (a) Using Hoeffding, find the smallest such that with 99% confidence, . (b) How does the required change if we only require ? (c) If and , give the 95% Hoeffding confidence interval for . (d) Compare the Hoeffding CI to the Normal-approximation CI .
**Exercise 3 *** (Chernoff Bounds for Binomial) Let with , (so ). (a) Compute exactly and via the multiplicative Chernoff bound (). (b) Compute via Hoeffding. Compare to Chernoff - which is tighter? (c) Compute via the lower-tail Chernoff bound and compare to the exact value. (d) Plot all three bounds (Hoeffding, upper Chernoff, lower Chernoff) as a function of the threshold.
**Exercise 4 **** (Bernstein: Gradient Noise in SGD) Assume each sample gradient with and per-coordinate variance . (a) Apply Bernstein (with union bound) to bound for mini-batch size . (b) Compare to the Hoeffding-based bound. Under what condition does Bernstein win? (c) For , , , find the needed for via both bounds. (d) Verify numerically: generate synthetic gradients and measure empirical exceedance probability.
**Exercise 5 **** (McDiarmid on Empirical Risk) Let with and i.i.d. (a) Show that swapping one training example changes by at most . (b) Apply McDiarmid's inequality to bound . (c) Compare this to the direct Hoeffding application. Are they the same? Why? (d) What if instead of ? How does the bound scale?
**Exercise 6 **** (PAC Bounds for Finite Class) Consider with 1000 decision rules for binary classification. (a) How many training examples are needed for uniform convergence with , ? (b) ERM achieves training error 0. Using the PAC bound, what is the worst-case true error with 95% confidence? (c) If we use examples and observe , bound with 99% confidence. (d) How does the bound change if (e.g., a lookup table over binary features)?
**Exercise 7 ***** (VC Dimension and Generalisation) Consider linear classifiers in : . (a) Show by construction that points can be shattered (exhibit a configuration). (b) Show that no points in general position can be shattered. (c) Write the VC generalisation bound for , , . (d) Compare to the Rademacher bound for the same class with and .
**Exercise 8 ***** (Rademacher Complexity: Monte Carlo Estimation) Given a dataset and linear function class : (a) Show that . (b) Implement a Monte Carlo estimator using random sign vectors . (c) On the MNIST training set (or a synthetic dataset), compute and the resulting generalisation bound. (d) How does change as you vary ? Verify the scaling.
14. Why This Matters for AI (2026 Perspective)
| Concept | AI/ML Connection | Current Importance |
|---|---|---|
| Hoeffding confidence intervals | LLM benchmark evaluation: how many test questions for reliable rankings? | Critical - used implicitly in all leaderboard comparisons (MMLU, HELM, LMSYS) |
| PAC-Bayes bounds | Tightest known generalisation bounds for overparameterised models; connects to flat minima and sharpness-aware minimisation (SAM) | Active research frontier, 2023-2026 |
| VC dimension | Classical foundation; vacuous for LLMs with + parameters | Conceptually important; practically replaced by norm-based bounds |
| Rademacher complexity | Data-dependent bounds; finite for norm-constrained transformers | Used in theoretical analysis of attention, LoRA rank bounds |
| McDiarmid stability | Algorithmic stability theory: output doesn't change much when one training point changes | Foundation of differential privacy; relevant to RLHF data sensitivity |
| Concentration in high dims | scaling in attention; JL lemma for random projections in retrieval | Directly explains transformer architectural choices (FlashAttention, MLA) |
| Chernoff for random graphs | Theoretical analysis of dropout, random initialisation, random feature networks | Relevant to lottery ticket hypothesis, neural architecture search |
| Union bound + covering | Uniform convergence for function classes; foundation of all finite-sample learning theory | Every theoretical ML paper uses this framework |
| Sub-Gaussian gradients | Convergence theory for Adam, SGD; required sample complexity for fine-tuning | Used in LoRA theoretical analysis and RLHF convergence guarantees |
| Bernstein vs Hoeffding | Variance-aware bounds; relevant when gradient variance is small (near optima) | Foundation of variance reduction methods (SVRG, SAG, SAGA) |
15. Conceptual Bridge
Looking back. This section builds on Section04 Expectation and Moments, which established , , and the MGF . The MGF is the critical bridge: Hoeffding's lemma uses convexity of ; the Chernoff method optimises over ; Bernstein's inequality exploits the variance . Jensen's inequality (from Section04) underlies both Hoeffding's lemma proof and the information-theoretic interpretation of Chernoff bounds.
Looking forward. The Law of Large Numbers says almost surely. This is the qualitative counterpart of Hoeffding's quantitative bound: Hoeffding gives the rate at which concentrates. The Central Limit Theorem (CLT) says the shape of the distribution of converges to Gaussian - explaining why sub-Gaussian bounds are the right model for sums of bounded variables. Both are developed in Section06 Stochastic Processes.
The PAC-statistics connection. PAC learning theory is, at its core, applied concentration theory. The "probably" in PAC = concentration inequality. The "approximately correct" = tolerance. The sample complexity is exactly what concentration inequalities give. Chapter 7 (Statistics) will build on this: confidence intervals are Hoeffding bounds, hypothesis tests are Chernoff bounds, and MLE theory uses Bernstein to prove asymptotic normality.
CONCENTRATION INEQUALITIES IN THE CURRICULUM
========================================================================
Section04 Expectation and Moments
+- E[X], Var(X), MGF M_X(t) -------------------------- input
+- Jensen's inequality -------------------------- used in proofs
+- Markov/Chebyshev preview -------------------------- previewed there
v
Section05 Concentration Inequalities <--- YOU ARE HERE
+- Markov / Chebyshev (moment-based, polynomial tails)
+- Sub-Gaussian / Hoeffding (bounded, exponential tails)
+- Chernoff / Bernstein (MGF-based, variance-aware)
+- McDiarmid (functions of independent variables)
+- Union bound + covering (infinite hypothesis classes)
+- PAC learning / VC dimension (generalisation theory)
+- Rademacher complexity (data-dependent bounds)
v v
Section06 Stochastic Processes Chapter 7: Statistics
+- Weak LLN (via Chebyshev) +- Confidence intervals
+- Strong LLN +- Hypothesis testing
+- CLT (limit of Hoeffding) +- MLE consistency (Bernstein)
========================================================================
Appendix A: Summary of Key Bounds
| Inequality | Conditions | Bound | Rate |
|---|---|---|---|
| Markov | , | ||
| Chebyshev | |||
| Cantelli | |||
| Sub-Gaussian | -sub-G | ||
| Hoeffding | i.i.d. | ||
| Chernoff (mult.) | , | ||
| Bernstein | , variance | ||
| McDiarmid | with -bounded diff. |
Appendix B: Sample Complexity Table
Given: i.i.d. data in , want .
| Hoeffding | Normal approx | ||
|---|---|---|---|
| 0.10 | 0.10 | 133 | 68 |
| 0.05 | 0.10 | 530 | 271 |
| 0.05 | 0.05 | 600 | 384 |
| 0.01 | 0.05 | 14,979 | 9,604 |
| 0.01 | 0.01 | 19,933 | 16,587 |
Hoeffding is distribution-free (works for worst case); Normal approximation assumes CLT holds.
<- Back to Chapter 6: Probability Theory | Next: Stochastic Processes ->
Appendix C: Detailed Proofs and Derivations
C.1 Proof of Hoeffding's Lemma in Full
We prove: if and a.s., then .
Step 1: Convexity bound. Since is convex in , for :
Step 2: Take expectations. Let (since after centering):
Step 3: Exponential bound. Let and . Then:
Note: (since ). So:
Step 4: Bound . We have and . By Taylor's theorem:
for some . Computing:
(since for , applied to , ).
Therefore: .
C.2 Proof of McDiarmid's Inequality via Azuma's Inequality
Azuma's Inequality. If is a martingale with a.s., then:
Doob martingale construction. Given independent and with bounded differences , define:
Then:
- (constant, before observing anything)
- (the function value itself)
- is a martingale by the tower property
Bounding differences. For each :
Since changing changes by at most and doesn't depend on , we have .
Applying Azuma to this Doob martingale gives McDiarmid's inequality.
C.3 Proof of the Finite-Class PAC Bound (Agnostic Case)
Setup. finite with . Losses . True risk , empirical risk .
Goal. Show when .
Step 1. Fix . By Hoeffding: .
Step 2. Union bound over all hypotheses:
Step 3. Set :
ERM generalisation. Under uniform convergence, the ERM satisfies:
where is the Bayes-optimal hypothesis in .
Appendix D: VC Theory in Depth
D.1 The Growth Function
Definition. The growth function counts the maximum number of distinct labellings achievable by on any points:
If can shatter points, . The VC dimension is the largest for which .
D.2 Sauer-Shelah Lemma
Lemma. For and :
Proof idea. By induction on and . Split: either makes no difference to the labelling count, or it doubles some labellings. Careful counting bounds the total.
Consequence. For : . This transitions from exponential (when class is rich) to polynomial (once exceeds VC dimension).
D.3 VC Dimension Examples
| Hypothesis Class | VC Dimension |
|---|---|
| Threshold on | 1 |
| Intervals on | 2 |
| Halfspaces in | |
| Axis-aligned rectangles in | 4 |
| Convex polygons in | |
| Degree- polynomials in | |
| Neural nets with weights | |
| Kernel classifiers (universal) |
Why VC = for halfspaces. In , any points in general position (no are on a hyperplane) can be shattered by a halfspace. But for points, by Radon's theorem, they can be split into two groups whose convex hulls intersect - no halfspace separates them. Hence .
D.4 Fundamental Theorem of Statistical Learning
Theorem. For binary classification, the following are equivalent:
- is PAC-learnable
- ERM is a successful learner for
- has finite VC dimension
- has the uniform convergence property
This theorem, proved through Sauer-Shelah and the symmetrisation technique (doubling the sample and introducing Rademacher variables), is the bedrock of statistical learning theory.
Appendix E: Advanced Topics in Concentration
E.1 Talagrand's Inequality
Talagrand's inequality (1995) strengthens McDiarmid for "self-bounding" functions - those where the effect of each coordinate is bounded by the function value itself.
Self-bounding functions. is self-bounding if there exist such that:
Talagrand's bound. For self-bounding :
This is a variance-adaptive McDiarmid - analogous to Bernstein vs Hoeffding.
Applications: Number of ones in a sum of Bernoullis, size of random matchings, empirical risk when the model fits the data.
E.2 Concentration for Heavy-Tailed Distributions
When distributions are heavy-tailed, sub-Gaussian bounds don't apply. Modern techniques include:
Catoni's estimator. For estimating the mean of a distribution with finite variance (but potentially infinite MGF), Catoni's estimator achieves:
using a truncated exponential influence function. This gives Bernstein-like rates without boundedness.
Median of means. Partition samples into groups of . Compute group means and take the median. The median is sub-Gaussian with parameter - achieving sub-Gaussian rates for distributions with only two moments.
For AI: Training data for LLMs often has heavy-tailed length distributions (documents follow power laws). Heavy-tail-robust mean estimation is relevant for unbiased training.
E.3 Matrix Concentration Inequalities
Scalar concentration extends to matrices. For many AI applications, we care about concentration of random matrices.
Matrix Hoeffding. For independent random PSD matrices with and , for :
The extra factor (matrix dimension) accounts for the eigenvalue structure.
Application: random features. When approximating a kernel matrix by random features, the approximation error in spectral norm concentrates with matrix Hoeffding, justifying the use of random features for -accurate Gram matrix approximation.
Matrix Bernstein. For independent zero-mean random matrices with and :
Used in randomised linear algebra (randomised SVD, Nystrom approximation).
E.4 PAC-Bayes Theory
PAC-Bayes bounds combine the PAC framework with Bayesian prior knowledge.
Setup. Given a prior over (before seeing data) and posterior (after seeing data):
PAC-Bayes Theorem (McAllester, 1999). For any prior and , with probability :
Why this is powerful: The KL term replaces (which can be infinite). For neural networks, if the posterior concentrates near the prior (small KL), the bound is tight. Flat minima (low sharpness) correspond to posteriors that don't drift far from the prior.
Connection to SAM. Sharpness-Aware Minimisation (SAM, 2021) minimises a PAC-Bayes-inspired upper bound on generalisation error by seeking flat minima - directly operationalising PAC-Bayes in practice.
Appendix F: Worked Examples
F.1 Coin Flip Estimation
Problem. We flip a biased coin times and observe heads. How many flips to estimate within with 99% confidence?
Hoeffding solution. With , :
Normal approximation. . Less conservative because it uses the Gaussian CLT approximation (valid for large ).
Interpretation. About 1000-1100 flips are needed for reliable probability estimation. This is why A/B tests often require thousands of users: smaller samples give confidence intervals too wide to detect meaningful differences.
F.2 Multi-Armed Bandit via Chernoff
Problem. A recommendation system has arms (content types). Each arm has unknown click-through rate . We want to identify the best arm within with .
Naive approach. Pull each arm times. By Hoeffding + union bound:
Set : . Total pulls: .
Adaptive strategy (UCB). Upper Confidence Bound algorithms concentrate exploration on promising arms, achieving total pulls - much smaller in practice.
F.3 Neural Network Generalisation: PAC-Bayes in Practice
Setup. 2-layer ReLU network, 100 hidden units, trained on 50,000 MNIST examples. Training error 0.3%, test error 1.2%. Standard VC bound: vacuous (parameter count ).
PAC-Bayes approach. Use Gaussian prior and posterior (trained weights ). Compute:
With careful tuning of , this can give non-vacuous bounds (< 50% error guarantee) even for deep networks - unlike VC theory.
F.4 Random Projection (Johnson-Lindenstrauss)
Problem. We have points in . We want to project to preserving all pairwise distances within factor for .
JL Lemma. For random projections (from ):
So 7400 dimensions suffice - far less than 1000 (already doing 26% compression), and compression from 10000 would be massive.
Proof sketch. A random Gaussian matrix satisfies: for any fixed pair of points :
by chi-squared concentration. Union bound over pairs gives the JL lemma.
For AI: Locality-sensitive hashing, approximate nearest-neighbour search (used in RAG retrieval), and low-rank adaptation (LoRA) all rely on Johnson-Lindenstrauss-type arguments to justify dimensionality reduction.
Appendix G: Computing Concentration Bounds in Practice
G.1 Choosing the Right Bound
DECISION TREE: WHICH BOUND TO USE
========================================================================
Do you know E[X]? --- No --> Can't bound tail
|
Yes
|
Is X \\geq 0? --- Yes --> Markov: P(X \\geq t) \\leq E[X]/t
|
No
|
Do you know Var(X)? --- Yes --> Chebyshev: P(|X-\\mu| \\geq t) \\leq \\sigma^2/t^2
|
No
|
Is X bounded in [a,b]? --- Yes --> Hoeffding (exponential bound)
| or Bernstein if Var(X) known
|
No
|
Is MGF finite? --- Yes --> Chernoff method (optimise over s)
|
No
|
Distribution-free needed --> Student-t CI or bootstrap
========================================================================
G.2 Python Recipe for Hoeffding CIs
import numpy as np
from scipy import stats
def hoeffding_ci(x, delta=0.05, lo=0.0, hi=1.0):
"""Hoeffding confidence interval for bounded data in [lo, hi]."""
n = len(x)
x_bar = np.mean(x)
radius = (hi - lo) * np.sqrt(np.log(2 / delta) / (2 * n))
return x_bar - radius, x_bar + radius
def hoeffding_sample_size(epsilon, delta, lo=0.0, hi=1.0):
"""Required n for epsilon-accuracy at 1-delta confidence."""
return int(np.ceil((hi - lo)**2 * np.log(2 / delta) / (2 * epsilon**2)))
G.3 Interpreting Generalisation Gaps
When a model achieves training error 1% and test error 5%:
- Generalisation gap = 4%
- Is this within bounds? With , : Hoeffding gives per hypothesis
- For hypotheses: add
- Total bound: - exactly matching the observed gap!
In practice, neural networks occupy a tiny corner of their hypothesis class (due to inductive bias of SGD), so effective complexity is much smaller than the VC dimension suggests.
Appendix H: Self-Assessment Checklist
Before moving to Section06 (Stochastic Processes), verify you can:
- State Markov's inequality from memory and prove it with the indicator trick
- Derive Chebyshev from Markov by applying Markov to
- Prove Hoeffding's lemma using convexity of and Taylor expansion
- Apply the Chernoff method: Markov + MGF bound + optimise over
- State McDiarmid's inequality and identify for a given function
- Compute sample complexity for given using Hoeffding
- Define VC dimension and compute it for halfspaces ( in )
- Write the PAC generalisation bound with or
- Define Rademacher complexity and explain its interpretation
- Identify when Bernstein beats Hoeffding (small variance case)
- Give a non-example for sub-Gaussian (heavy-tailed distribution)
- Explain why VC bounds are vacuous for LLMs and what PAC-Bayes offers
Appendix I: Connections to Information Theory
I.1 Kullback-Leibler Divergence and Exponential Families
The Chernoff method has a deep information-theoretic interpretation. The optimal exponent in the Chernoff bound equals the KL divergence under the tilted distribution.
For i.i.d. with distribution , and target :
where is the exponential tilt of that puts its mean at : with .
This is Cramer's theorem - the foundation of large deviations theory. The probability of a rare event decays exponentially at rate given by the KL divergence to the closest distribution that makes the event typical.
For AI: This connection explains why cross-entropy loss (KL divergence from data to model) is the right loss for language models: minimising cross-entropy = maximising the probability of the training data = minimising the Chernoff exponent for generalisation error.
I.2 Entropy and Sauer-Shelah
The growth function counts the number of distinct behaviours of on points. The log growth rate:
is the combinatorial entropy of . Sauer-Shelah shows:
- If : (subexponential growth, polynomial in )
- If : (full exponential growth )
This binary distinction - polynomial vs exponential growth - is precisely what separates PAC-learnable from non-learnable hypothesis classes.
Appendix J: Connections to Optimisation
J.1 Convergence of SGD via Hoeffding
The standard convergence analysis of SGD for convex functions uses:
-
Descent lemma:
-
Gradient concentration: with high probability (Hoeffding)
-
Telescoping: Sum over steps and apply Hoeffding union bound over all steps (adding to the sample complexity)
The result: after steps, SGD achieves with high probability.
J.2 Generalisation and Algorithmic Stability
Definition (Uniform Stability). Algorithm is -uniformly stable if for any two datasets , differing in one example:
Theorem (Bousquet-Elisseeff, 2002). If is -stable with for some constant , and losses are bounded in , then with probability :
SGD is stable. For -smooth convex losses, SGD with step size for steps is -uniformly stable. Setting gives stability - small when .
This explains why early stopping helps: more SGD steps = less stable = worse generalisation.
Appendix K: Concentration Phenomena in Transformer Architectures
K.1 Attention Score Concentration
In scaled dot-product attention, query-key inner products determine the attention weights. For randomly initialised , the inner products satisfy:
Without the scaling, the softmax input has standard deviation , pushing softmax into its saturating regime (near-uniform or near-one-hot). The scaling ensures standard deviation , keeping gradients flowing.
Formal justification. With -dimensional queries and keys, where each summand has variance . By sub-Gaussian closure, the sum is -sub-Gaussian, so standard deviation . After scaling: has standard deviation , optimal for softmax.
K.2 Concentration in Residual Networks
Deep residual networks accumulate residual perturbations. If each residual block has bounded output , then by Azuma's inequality applied to the martingale :
This says: even with layers, if each residual block adds a small perturbation (), the total drift is bounded. LayerNorm and weight decay enforce exactly this: they keep small, ensuring the representation doesn't drift exponentially with depth.
K.3 LoRA and Low-Rank Concentration
LoRA (Low-Rank Adaptation) approximates weight updates as where , with small rank . The resulting hypothesis class has reduced complexity:
Rademacher complexity of LoRA. For the function class :
where . The key: Rademacher complexity scales as , not (full fine-tuning). With , LoRA has provably smaller generalisation gap than full fine-tuning, explaining why it doesn't overfit despite being used on small datasets.
Appendix L: Practical Guidelines for ML Practitioners
L.1 When to Report Confidence Intervals
Always report when:
- Sample size (Hoeffding CI is meaningful)
- Comparing two systems (the CI overlap tells you if the difference is significant)
- Benchmarking LLMs on standard evaluations
Minimum information to report: point estimate, sample size, CI method (Hoeffding vs Normal approximation), confidence level.
L.2 Common Benchmarking Mistakes
| Mistake | Statistical Problem | Fix |
|---|---|---|
| "Model A (73.2%) > Model B (72.8%) on 1000 questions" | 95% CI is +/-4.2% - difference insignificant | Need for +/-1% CI |
| "Best of 10 model variants is significant" | Multiple comparison: false positive rate | Bonferroni: each test at |
| "Averaging over 5 datasets gives reliable comparison" | Each dataset is a separate test | Report CI for each; meta-analysis requires care |
| "Our method wins on 8/10 tasks" | Sign test: need for | Use paired t-test or Wilcoxon signed-rank |
| "1000-epoch training curve shows improvement" | Training curve is not test performance | Report test error at convergence with CI |
L.3 Required Sample Sizes (Reference Table)
For binary outcomes (accuracy), using Hoeffding with 95% confidence ():
| Desired precision () | Required |
|---|---|
| +/-10% (0.10) | 185 |
| +/-5% (0.05) | 738 |
| +/-3% (0.03) | 2,056 |
| +/-2% (0.02) | 4,626 |
| +/-1% (0.01) | 18,445 |
| +/-0.5% (0.005) | 73,779 |
Note: These are conservative (distribution-free). Normal approximation gives roughly half as many, valid when .
L.4 Interpreting Generalisation Bounds for Deep Learning
Current theoretical bounds are often loose for deep networks. Practitioners should interpret them as:
- Order of magnitude guidance: A bound of correctly predicts that more data helps linearly, but may overestimate the absolute gap by 10-100x
- Relative comparisons: PAC-Bayes bounds correctly rank models by generalisation risk even when absolute values are imprecise
- Design principles: Results like "norm-constrained models generalise better" translate directly to regularisation practices (weight decay, gradient clipping, dropout)
- Not absolute guarantees: A theoretical bound of 40% error doesn't mean the model has 40% test error - it means we can't rule out 40% test error from the theory alone
The theory is most useful as a framework for thinking, not as a numerical oracle.
Appendix M: References and Further Reading
Primary Sources
-
Boucheron, Lugosi & Massart (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press. - The definitive modern treatment.
-
Hoeffding, W. (1963). Probability Inequalities for Sums of Bounded Random Variables. Journal of the American Statistical Association, 58(301), 13-30.
-
McDiarmid, C. (1989). On the Method of Bounded Differences. Surveys in Combinatorics, 141, 148-188.
-
Vapnik, V. & Chervonenkis, A. (1971). On the Uniform Convergence of Relative Frequencies to Their Probabilities. Theory of Probability and Its Applications, 16(2), 264-280.
-
Valiant, L. (1984). A Theory of the Learnable. Communications of the ACM, 27(11), 1134-1142.
Learning Theory Textbooks
-
Shalev-Shwartz & Ben-David (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press. - Best introduction to PAC learning for ML practitioners.
-
Mohri, Rostamizadeh & Talwalkar (2018). Foundations of Machine Learning (2nd ed.). MIT Press. - Covers Rademacher complexity and generalisation theory in depth.
-
Wainwright, M. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. - Advanced treatment, covers matrix concentration and modern techniques.
Deep Learning Theory
-
Zhang, Bengio et al. (2017). Understanding Deep Learning Requires Rethinking Generalization. ICLR 2017. - Showed memorisation capacity, challenging classical VC theory.
-
Dziugaite & Roy (2017). Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many Parameters. UAI 2017. - First non-vacuous PAC-Bayes bounds for deep nets.
-
Belkin et al. (2019). Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-Off. PNAS 2019. - Double descent, overparameterisation.
-
Foret et al. (2021). Sharpness-Aware Minimization for Efficiently Improving Generalization. ICLR 2021. - SAM as PAC-Bayes operationalisation.
Appendix N: Extended Worked Problems
N.1 Bounding the Maximum of Random Variables
Problem. Let i.i.d. What is approximately?
Answer. Each is 1-sub-Gaussian. By the union bound:
Setting : . So with high probability.
More precisely, . For : , matching Monte Carlo.
For AI: The maximum attention logit in a head with tokens and -dimensional keys satisfies . The softmax of this concentrates on one or few tokens for large - explaining why attention becomes "sharp" (spiky) in deeper layers where is large.
N.2 Symmetric Rounding
Problem. An algorithm rounds each of numbers independently up or down by at most 0.5. What is the probability the total rounding error exceeds 10?
Solution. Each error with . The sum has , so by Hoeffding:
The two-sided version: .
N.3 PAC Bound for a Neural Network Classifier
Problem. A student trains a 3-layer neural network on examples and achieves 2% training error. Using the finite-class PAC bound with (very conservatively, assuming 1-bit per parameter for parameters), what does the bound say about test error?
Solution. The PAC bound (agnostic, finite class):
With and :
Test error bound: - a completely vacuous bound (exceeds 100%). This illustrates why VC/parameter-counting bounds are useless for deep networks. PAC-Bayes or Rademacher complexity (using rather than parameter count) gives non-vacuous results.
N.4 Proving Generalisation for 1-Nearest Neighbour
Problem. For -nearest neighbour classification with : (a) Show that the leave-one-out error is an unbiased estimate of the true error. (b) Use McDiarmid to bound the variance of the generalisation error around the leave-one-out estimate.
Solution. (a) The leave-one-out (LOO) error where is the prediction on using all other training points. By symmetry of the i.i.d. draw, (the true error of 1-NN trained on examples).
(b) Swapping one training example changes by at most (affects at most 2 LOO predictions). By McDiarmid:
N.5 Confidence Region for Gradient Descent Convergence
Problem. SGD is run for iterations with step size . Each gradient estimate is sub-Gaussian with parameter . With what probability does SGD achieve ?
Solution. Standard SGD analysis shows: where .
For the high-probability version: at each step , by sub-Gaussianity (batch size ). Taking union bound over all steps and using the Azuma-type argument for the algorithm trajectory:
when .