Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 1
28 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Concentration Inequalities: Part 1: Intuition to 10. Rademacher Complexity

1. Intuition

1.1 What Is Concentration?

A random variable XX has a mean μ=E[X]\mu = \mathbb{E}[X]. But knowing the mean alone tells us nothing about how a single draw is distributed around it. Concentration inequalities fill this gap by bounding the probability that XX deviates far from μ\mu.

The fundamental question is: given what we know about XX (its mean, variance, boundedness, or MGF), what is the tightest bound we can place on P(Xμt)P(|X - \mu| \geq t)?

This matters deeply in practice. In machine learning, we observe the empirical mean of a loss function on training data and want to know how well it approximates the true (population) mean. If X1,,XnX_1, \ldots, X_n are i.i.d. losses with mean μ\mu and we observe Xˉn=1nXi\bar{X}_n = \frac{1}{n}\sum X_i, concentration inequalities tell us how close Xˉn\bar{X}_n is to μ\mu with high probability. This is the foundation of statistical learning theory.

The sample mean concentrates around the true mean. As nn grows, the deviation Xˉnμ|\bar{X}_n - \mu| becomes increasingly small with high probability. This is the content of the Law of Large Numbers - but concentration inequalities give quantitative, finite-nn bounds, not just asymptotic guarantees.

For AI: Every evaluation metric in ML (accuracy, F1, BLEU, ROUGE, perplexity) is estimated from a finite test set. Concentration inequalities tell us how many test examples we need to trust the estimate. HELM benchmarks for LLMs, for instance, implicitly rely on Hoeffding-type bounds for the confidence intervals around model comparisons.

1.2 The Hierarchy of Bounds

Concentration inequalities form a clear hierarchy. Moving down requires stronger assumptions but yields exponentially tighter bounds:

CONCENTRATION INEQUALITY HIERARCHY
========================================================================

  Assumption          Inequality         Tail bound form
  -----------------------------------------------------------------
  E[X] < \\infty, X \\geq 0   Markov             P(X \\geq t) \\leq \\mu/t
  (polynomial decay)

  Var(X) < \\infty         Chebyshev          P(|X-\\mu| \\geq t) \\leq \\sigma^2/t^2
  (polynomial decay)

  E[e^{sX}] < \\infty      Chernoff method    P(X \\geq t) \\leq min_s e^{-st}M(s)
  (general expo)

  X \\in [a,b] a.s.     Hoeffding          P(Xbar-\\mu \\geq t) \\leq exp(-2nt^2/(b-a)^2)
  (sub-Gaussian)      (exponential decay)

  Var known + bounded Bernstein          exp(-nt^2/(2\\sigma^2+2ct/3))
  (variance-aware)    (tighter for small t)

  f(X_1,...,X_n)       McDiarmid          exp(-2t^2/\\Sigmac_i^2)
  bounded differences (functions of iid)

========================================================================

The key insight: moment-based bounds (Markov, Chebyshev) give polynomial tails O(1/tk)O(1/t^k), while MGF-based bounds give exponential tails O(ect2)O(e^{-ct^2}). Exponential tails are vastly superior - a Gaussian has P(X3σ)0.003P(|X| \geq 3\sigma) \approx 0.003, while Chebyshev gives only 1/90.111/9 \approx 0.11.

1.3 Historical Timeline

YearPersonContribution
1867ChebyshevProved the inequality bearing his name; used it to prove the weak LLN
1899Markov (student of Chebyshev)Simpler proof via indicator argument; Markov's inequality
1952Herman ChernoffMGF-based exponential tail bounds for sums of independent variables
1963Wassily HoeffdingTight bounds for bounded variables; modern form used everywhere
1971Vapnik & ChervonenkisVC dimension theory; combinatorial approach to generalisation
1984Leslie ValiantPAC learning framework - theoretical model for machine learning
1989Colin McDiarmidBounded differences inequality for general functions of independent variables
1995Bartlett & MendelsonRademacher complexity - data-dependent alternative to VC theory
2013Boucheron, Lugosi & MassartComprehensive monograph unifying modern concentration theory

1.4 Why Concentration Matters for ML

Generalisation: A neural network achieves 99% training accuracy. Does it generalise? Concentration inequalities bound R(h)R^(h)|R(h) - \hat{R}(h)| - the gap between true and empirical risk. They tell us when we can trust training metrics.

SGD analysis: Stochastic gradient descent uses a mini-batch estimate g^\hat{g} of the true gradient gg. Concentration inequalities bound g^g2\|\hat{g} - g\|_2, determining the noise level and required step sizes for convergence.

Confidence intervals: When evaluating an LLM on 1000 benchmark questions with accuracy 73.2%, concentration bounds tell us the confidence interval around this estimate.

Random features: The Rahimi-Recht random features method approximates kernel matrices using random projections. How many random features are needed? The answer is a Hoeffding bound.

Dropout: Dropout randomly zeroes activations. The kept activations are bounded, so McDiarmid's inequality bounds the output variance.


2. Markov's and Chebyshev's Inequalities

2.1 Markov's Inequality

Theorem (Markov's Inequality). Let X0X \geq 0 be a non-negative random variable with E[X]<\mathbb{E}[X] < \infty. For any t>0t > 0:

P(Xt)E[X]tP(X \geq t) \leq \frac{\mathbb{E}[X]}{t}

Proof. This follows immediately from the indicator 1[Xt]X/t\mathbf{1}[X \geq t] \leq X/t for X0X \geq 0, t>0t > 0:

P(Xt)=E[1[Xt]]E ⁣[Xt]=E[X]tP(X \geq t) = \mathbb{E}[\mathbf{1}[X \geq t]] \leq \mathbb{E}\!\left[\frac{X}{t}\right] = \frac{\mathbb{E}[X]}{t}

This proof is a master class in the indicator trick: multiply by 1, then bound the indicator.

Tightness. Markov's bound is tight. Consider X=tX = t with probability μ/t\mu/t and X=0X = 0 otherwise. Then E[X]=μ\mathbb{E}[X] = \mu and P(Xt)=μ/tP(X \geq t) = \mu/t - exactly matching the bound.

Extensions. Markov applies to any non-negative function of XX: for any non-decreasing ϕ0\phi \geq 0,

P(Xt)=P(ϕ(X)ϕ(t))E[ϕ(X)]ϕ(t)P(X \geq t) = P(\phi(X) \geq \phi(t)) \leq \frac{\mathbb{E}[\phi(X)]}{\phi(t)}

Setting ϕ(x)=xk\phi(x) = x^k gives the kk-th moment bound: P(Xt)E[Xk]/tkP(X \geq t) \leq \mathbb{E}[X^k]/t^k. Setting ϕ(x)=esx\phi(x) = e^{sx} gives the Chernoff method.

For AI: Markov's inequality underlies gradient clipping analysis. If E[g2]G\mathbb{E}[\|g\|_2] \leq G, then P(g2cG)1/cP(\|g\|_2 \geq cG) \leq 1/c - so gradient norms exceed 10G10G at most 10% of steps.

2.2 Chebyshev's Inequality

Theorem (Chebyshev's Inequality). For any random variable XX with E[X]=μ\mathbb{E}[X] = \mu and Var(X)=σ2<\operatorname{Var}(X) = \sigma^2 < \infty, for any t>0t > 0:

P(Xμt)σ2t2P(|X - \mu| \geq t) \leq \frac{\sigma^2}{t^2}

Equivalently, setting t=kσt = k\sigma: P(Xμkσ)1k2P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}.

Proof. Apply Markov to the non-negative variable Y=(Xμ)2Y = (X - \mu)^2:

P(Xμt)=P((Xμ)2t2)E[(Xμ)2]t2=σ2t2P(|X - \mu| \geq t) = P((X-\mu)^2 \geq t^2) \leq \frac{\mathbb{E}[(X-\mu)^2]}{t^2} = \frac{\sigma^2}{t^2}

The kk-sigma rule. Chebyshev guarantees: at least 11/k21 - 1/k^2 of probability mass lies within kσk\sigma of the mean. For any distribution with finite variance:

kkChebyshev boundGaussian actual
275%\geq 75\% within 2σ2\sigma95.4%\approx 95.4\%
388.9%\geq 88.9\% within 3σ3\sigma99.7%\approx 99.7\%
493.75%\geq 93.75\% within 4σ4\sigma99.994%\approx 99.994\%
596%\geq 96\% within 5σ5\sigma99.99994%\approx 99.99994\%

For AI: Chebyshev justifies batch normalisation. If activations have Var(X)=σ2\operatorname{Var}(X) = \sigma^2, then at most 1/k21/k^2 of activations lie beyond kσk\sigma - this quantifies the scale-normalisation benefit.

Recall: The variance Var(X)=E[X2](E[X])2\operatorname{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 and its properties were developed fully in Section04 Expectation and Moments. We use it here as a plug-in parameter.

2.3 One-Sided Chebyshev (Cantelli's Inequality)

Chebyshev bounds both tails symmetrically. Cantelli's inequality gives a tighter one-sided bound:

Theorem (Cantelli). For any t>0t > 0:

P(Xμt)σ2σ2+t2P(X - \mu \geq t) \leq \frac{\sigma^2}{\sigma^2 + t^2}

Proof. For any s>0s > 0, by Markov applied to (Xμ+s)2(X - \mu + s)^2:

P(Xμt)=P(Xμ+st+s)E[(Xμ+s)2](t+s)2=σ2+s2(t+s)2P(X - \mu \geq t) = P(X - \mu + s \geq t + s) \leq \frac{\mathbb{E}[(X-\mu+s)^2]}{(t+s)^2} = \frac{\sigma^2 + s^2}{(t+s)^2}

Optimise over ss: /s=0\partial/\partial s = 0 gives s=σ2/ts = \sigma^2/t, yielding σ2/(σ2+t2)\sigma^2/(\sigma^2 + t^2).

For large tt: Chebyshev gives σ2/t2\sigma^2/t^2 per tail (so 2σ2/t22\sigma^2/t^2 two-sided), while Cantelli gives σ2/(σ2+t2)σ2/t2\sigma^2/(\sigma^2 + t^2) \approx \sigma^2/t^2 one-sided - roughly the same. But for small tt, Cantelli avoids the factor of 2.

2.4 Limitations of Moment-Based Bounds

Markov and Chebyshev have polynomial tails. The Gaussian has P(X3σ)0.0013P(X \geq 3\sigma) \approx 0.0013, but Chebyshev gives only 1/90.1111/9 \approx 0.111 - an 85x overestimate. For t=10σt = 10\sigma, Chebyshev gives 0.010.01, while the true Gaussian probability is 1023\approx 10^{-23}.

Why the gap? Chebyshev uses only the first two moments. A distribution could concentrate all its mass at μ±t\mu \pm t and match any (μ,σ2)(\mu, \sigma^2) pair - that's the worst case for Chebyshev. Real distributions with bounded support or thin tails concentrate far more.

When Chebyshev is the right tool:

  • Distribution-free bounds (don't know the family)
  • Heavy-tailed distributions (Pareto, tt-distribution with low df)
  • Quick estimates without assuming sub-Gaussianity

3. Sub-Gaussian Random Variables

3.1 Definition and MGF Condition

Sub-Gaussian random variables are those whose tails decay at least as fast as a Gaussian. The key condition is on the MGF.

Definition. A mean-zero random variable XX is σ2\sigma^2-sub-Gaussian if for all tRt \in \mathbb{R}:

E[etX]eσ2t2/2\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2/2}

The parameter σ2\sigma^2 is the sub-Gaussian parameter (also called the proxy variance). Note: σ2\sigma^2 need not equal Var(X)\operatorname{Var}(X), though Var(X)σ2\operatorname{Var}(X) \leq \sigma^2 always holds (by differentiation at t=0t=0).

Why this condition? Recall from Section04 that the MGF of N(0,σ2)\mathcal{N}(0, \sigma^2) is exactly eσ2t2/2e^{\sigma^2 t^2/2}. The sub-Gaussian condition says the MGF of XX is dominated by that of a Gaussian with variance σ2\sigma^2. This immediately implies Gaussian-like tail bounds.

Equivalent characterisations (for mean-zero XX):

  1. MGF condition: E[etX]eσ2t2/2\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2/2} for all tt
  2. Tail condition: P(Xt)2et2/(2σ2)P(|X| \geq t) \leq 2e^{-t^2/(2\sigma^2)} for all t0t \geq 0
  3. Moment condition: E[X2k](2k1)!!σ2k\mathbb{E}[X^{2k}] \leq (2k-1)!! \cdot \sigma^{2k} for all k1k \geq 1

3.2 Examples of Sub-Gaussian Variables

Gaussian. XN(0,σ2)X \sim \mathcal{N}(0, \sigma^2) is σ2\sigma^2-sub-Gaussian. The MGF equals eσ2t2/2e^{\sigma^2 t^2/2} exactly - Gaussian is the equality case.

Bounded random variable. If X[a,b]X \in [a, b] almost surely with E[X]=μ\mathbb{E}[X] = \mu, then XμX - \mu is (ba)24\frac{(b-a)^2}{4}-sub-Gaussian. This is Hoeffding's lemma - proved in Section4.

Rademacher variable. ε{1,+1}\varepsilon \in \{-1, +1\} with equal probability. Then E[etε]=cosh(t)et2/2\mathbb{E}[e^{t\varepsilon}] = \cosh(t) \leq e^{t^2/2}, so ε\varepsilon is 1-sub-Gaussian. Rademacher variables appear in Rademacher complexity (Section10).

Bernoulli centered. X=Bern(p)p{p,1p}X = \operatorname{Bern}(p) - p \in \{-p, 1-p\}. Since X[p,1p]X \in [-p, 1-p], it is 14\frac{1}{4}-sub-Gaussian by Hoeffding's lemma.

Non-example. Heavy-tailed distributions are not sub-Gaussian. A tt-distribution with ν\nu degrees of freedom has E[etX]=\mathbb{E}[e^{tX}] = \infty for all t0t \neq 0 when ν2\nu \leq 2. A Pareto distribution is not sub-Gaussian regardless of parameters.

3.3 The Sub-Gaussian Tail Bound

Theorem. If XX is σ2\sigma^2-sub-Gaussian with E[X]=0\mathbb{E}[X] = 0, then for all t0t \geq 0:

P(Xt)et2/(2σ2)P(X \geq t) \leq e^{-t^2/(2\sigma^2)}

Proof (Chernoff method). For any s>0s > 0:

P(Xt)=P(esXest)E[esX]esteσ2s2/2est=eσ2s2/2stP(X \geq t) = P(e^{sX} \geq e^{st}) \leq \frac{\mathbb{E}[e^{sX}]}{e^{st}} \leq \frac{e^{\sigma^2 s^2/2}}{e^{st}} = e^{\sigma^2 s^2/2 - st}

Optimise over ss: /s(σ2s2/2st)=0s=t/σ2\partial/\partial s(\sigma^2 s^2/2 - st) = 0 \Rightarrow s^* = t/\sigma^2. Substituting:

P(Xt)eσ2(t/σ2)2/2tt/σ2=et2/(2σ2)t2/σ2=et2/(2σ2)P(X \geq t) \leq e^{\sigma^2(t/\sigma^2)^2/2 - t \cdot t/\sigma^2} = e^{t^2/(2\sigma^2) - t^2/\sigma^2} = e^{-t^2/(2\sigma^2)}

This proof pattern - Markov + MGF bound + optimise - is the Chernoff method and will recur throughout this section.

3.4 Closure Under Sums

Theorem. If X1,,XnX_1, \ldots, X_n are independent with XiX_i being σi2\sigma_i^2-sub-Gaussian and E[Xi]=0\mathbb{E}[X_i] = 0, then S=iXiS = \sum_i X_i is (iσi2)(\sum_i \sigma_i^2)-sub-Gaussian.

Proof. By independence:

E[etS]=i=1nE[etXi]i=1neσi2t2/2=e(iσi2)t2/2\mathbb{E}[e^{tS}] = \prod_{i=1}^n \mathbb{E}[e^{tX_i}] \leq \prod_{i=1}^n e^{\sigma_i^2 t^2/2} = e^{(\sum_i \sigma_i^2)t^2/2}

Corollary. For nn i.i.d. σ2\sigma^2-sub-Gaussian variables, the sample mean Xˉn=S/n\bar{X}_n = S/n satisfies P(Xˉnt)ent2/(2σ2)P(\bar{X}_n \geq t) \leq e^{-nt^2/(2\sigma^2)}.

For AI: Mini-batch gradient estimation. If each sample's contribution to the gradient is sub-Gaussian with parameter σ2\sigma^2, then the mini-batch gradient of size mm is σ2/m\sigma^2/m-sub-Gaussian - the noise decreases as 1/m1/m.


4. Hoeffding's Inequality

4.1 Hoeffding's Lemma

Hoeffding's lemma is the core technical ingredient: it establishes that any bounded, mean-zero random variable is sub-Gaussian.

Lemma (Hoeffding, 1963). Let XX be a random variable with E[X]=0\mathbb{E}[X] = 0 and X[a,b]X \in [a, b] almost surely. Then for all tRt \in \mathbb{R}:

E[etX]exp ⁣(t2(ba)28)\mathbb{E}[e^{tX}] \leq \exp\!\left(\frac{t^2(b-a)^2}{8}\right)

In other words, XX is (ba)24\frac{(b-a)^2}{4}-sub-Gaussian.

Proof sketch. Since [a,b][a, b] is a bounded interval and etxe^{tx} is convex in xx, we can bound etxe^{tx} by the chord from (a,eta)(a, e^{ta}) to (b,etb)(b, e^{tb}):

etxbxbaeta+xabaetbe^{tx} \leq \frac{b-x}{b-a} e^{ta} + \frac{x-a}{b-a} e^{tb}

Taking expectations (using E[X]=0\mathbb{E}[X] = 0, so E[(bX)/(ba)]=b/(ba)\mathbb{E}[(b-X)/(b-a)] = b/(b-a) and E[(Xa)/(ba)]=a/(ba)\mathbb{E}[(X-a)/(b-a)] = -a/(b-a)):

E[etX]bbaetaabaetb=eg(u)\mathbb{E}[e^{tX}] \leq \frac{b}{b-a} e^{ta} - \frac{a}{b-a} e^{tb} = e^{g(u)}

where u=t(ba)u = t(b-a) and g(u)=pu+log(1p+peu)g(u) = -pu + \log(1 - p + pe^u) with p=a/(ba)p = -a/(b-a). Expanding gg via Taylor series around u=0u=0 and bounding g(u)1/4g''(u) \leq 1/4 (achieved at p=1/2p = 1/2) gives g(u)u2/8=t2(ba)2/8g(u) \leq u^2/8 = t^2(b-a)^2/8.

Why (ba)2/8(b-a)^2/8? The factor 1/81/8 comes from the worst case p=1/2p = 1/2 (centred in the interval). For a {0,1}\{0,1\} Bernoulli minus its mean, ba=1b - a = 1 and the sub-Gaussian parameter is 1/41/4.

4.2 Hoeffding's Inequality

Theorem (Hoeffding, 1963). Let X1,,XnX_1, \ldots, X_n be independent random variables with E[Xi]=μi\mathbb{E}[X_i] = \mu_i and Xi[ai,bi]X_i \in [a_i, b_i] almost surely. Then for all t>0t > 0:

P ⁣(i=1n(Xiμi)t)exp ⁣(2t2i=1n(biai)2)P\!\left(\sum_{i=1}^n (X_i - \mu_i) \geq t\right) \leq \exp\!\left(-\frac{2t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right)

Proof. Let Yi=XiμiY_i = X_i - \mu_i (mean-zero, Yi[aiμi,biμi]Y_i \in [a_i - \mu_i, b_i - \mu_i]). By Hoeffding's lemma, each YiY_i is (biai)24\frac{(b_i - a_i)^2}{4}-sub-Gaussian. By independence and closure under sums (Section3.4), Yi\sum Y_i is (biai)24\frac{\sum(b_i-a_i)^2}{4}-sub-Gaussian. Applying the sub-Gaussian tail bound:

P ⁣(Yit)exp ⁣(t22(biai)24)=exp ⁣(2t2(biai)2)P\!\left(\sum Y_i \geq t\right) \leq \exp\!\left(-\frac{t^2}{2 \cdot \frac{\sum(b_i-a_i)^2}{4}}\right) = \exp\!\left(-\frac{2t^2}{\sum(b_i-a_i)^2}\right)

4.3 Two-Sided Hoeffding and Sample Complexity

For i.i.d. Xi[a,b]X_i \in [a, b] with mean μ\mu, the two-sided version follows by symmetry (apply one-sided to XX and X-X):

P(Xˉnμt)2exp ⁣(2nt2(ba)2)P(|\bar{X}_n - \mu| \geq t) \leq 2\exp\!\left(-\frac{2nt^2}{(b-a)^2}\right)

Reading the bound: The probability of a large deviation decays exponentially in nn and t2t^2. Doubling the sample size squares the exponential factor. Halving the allowed deviation quadruples the required nn.

4.4 Required Sample Size

One of the most practically useful results: how large must nn be to guarantee Xˉnμε|\bar{X}_n - \mu| \leq \varepsilon with probability at least 1δ1 - \delta?

Setting 2exp(2nt2/(ba)2)δ2\exp(-2nt^2/(b-a)^2) \leq \delta and solving for nn:

n(ba)2log(2/δ)2ε2n \geq \frac{(b-a)^2 \log(2/\delta)}{2\varepsilon^2}

Example. Coin flips (Xi{0,1}X_i \in \{0,1\}, ba=1b - a = 1): to achieve ε=0.01\varepsilon = 0.01 accuracy with δ=0.05\delta = 0.05 confidence:

nlog(40)0.000218,444n \geq \frac{\log(40)}{0.0002} \approx 18{,}444

So roughly 18,500 coin flips are needed to estimate a probability to within ±0.01\pm 0.01 with 95% confidence.

For AI: Benchmark evaluation. To measure LLM accuracy to within ±1%\pm 1\% at 95% confidence on a binary task (ba=1b-a=1), Hoeffding requires n19,122n \geq 19{,}122 examples. This explains why serious LLM evaluations (MMLU, HumanEval, BIG-bench) use thousands of questions.


5. Chernoff Bounds

5.1 The Chernoff Method

The Chernoff method is a general technique for deriving exponential tail bounds using the MGF. It is more powerful than Hoeffding when the MGF has a known closed form.

The Chernoff method. For any random variable XX and t>0t > 0:

P(Xa)infs>0esaE[esX]=infs>0esaMX(s)P(X \geq a) \leq \inf_{s > 0} e^{-sa} \cdot \mathbb{E}[e^{sX}] = \inf_{s > 0} e^{-sa} M_X(s)

Proof. For any s>0s > 0: P(Xa)=P(esXesa)esaE[esX]P(X \geq a) = P(e^{sX} \geq e^{sa}) \leq e^{-sa} \mathbb{E}[e^{sX}] by Markov. Optimise over ss.

The infimum over ss gives the tightest bound. The optimal ss^* satisfies MX(s)/MX(s)=aM_X'(s^*)/M_X(s^*) = a, i.e., the mean of XX under the tilted distribution equals aa.

Recall: MX(t)=E[etX]M_X(t) = \mathbb{E}[e^{tX}] was developed in Section04 Expectation and Moments. The key property used here is MX+Y(t)=MX(t)MY(t)M_{X+Y}(t) = M_X(t) \cdot M_Y(t) for independent X,YX, Y.

5.2 Chernoff for Bernoulli Sums

Let X1,,XnBern(p)X_1, \ldots, X_n \sim \operatorname{Bern}(p) i.i.d. and S=i=1nXiBin(n,p)S = \sum_{i=1}^n X_i \sim \operatorname{Bin}(n, p) with mean μ=np\mu = np.

The MGF of SS is MS(t)=(1p+pet)nM_S(t) = (1 - p + pe^t)^n.

Upper Chernoff bound. For δ>0\delta > 0:

P(S(1+δ)μ)(eδ(1+δ)1+δ)μP(S \geq (1+\delta)\mu) \leq \left(\frac{e^\delta}{(1+\delta)^{1+\delta}}\right)^\mu

Derivation. Apply the Chernoff method with a=(1+δ)μa = (1+\delta)\mu:

P(S(1+δ)μ)infs>0es(1+δ)μMS(s)P(S \geq (1+\delta)\mu) \leq \inf_{s>0} e^{-s(1+\delta)\mu} M_S(s)

Substituting MS(s)=(1p+pes)neμ(es1)M_S(s) = (1-p+pe^s)^n \leq e^{\mu(e^s-1)} (using 1+xex1 + x \leq e^x):

infs>0exp(μ(es1)s(1+δ)μ)=exp(μ(es1s(1+δ)))\leq \inf_{s>0} \exp(\mu(e^s - 1) - s(1+\delta)\mu) = \exp(\mu(e^s - 1 - s(1+\delta)))

Optimising: s=log(1+δ)s^* = \log(1+\delta), giving the stated bound.

5.3 Multiplicative Chernoff Form

For δ(0,1)\delta \in (0, 1), a simpler but slightly looser form:

P(S(1+δ)μ)eμδ2/3P(S \geq (1+\delta)\mu) \leq e^{-\mu\delta^2/3}

This follows from the inequality (eδ(1+δ)1+δ)μeμδ2/3\left(\frac{e^\delta}{(1+\delta)^{1+\delta}}\right)^\mu \leq e^{-\mu\delta^2/3}.

Comparison with Hoeffding. Hoeffding (applied to Bin(n,p)\operatorname{Bin}(n,p) with ba=1b-a=1) gives:

P(S(1+δ)μ)=P(Xˉp(1+δ))exp(2n(pδ)2)P(S \geq (1+\delta)\mu) = P(\bar{X} \geq p(1+\delta)) \leq \exp(-2n(p\delta)^2)

Chernoff gives exp(np2δ2/3)=exp(μ2δ2/(3n))\exp(-np^2\delta^2/3) = \exp(-\mu^2\delta^2/(3n)). For small pp (rare events), Chernoff's dependence on μ=np\mu = np rather than nn makes it far tighter.

5.4 Lower Tail Chernoff

For the lower tail, for δ(0,1)\delta \in (0, 1):

P(S(1δ)μ)eμδ2/2P(S \leq (1-\delta)\mu) \leq e^{-\mu\delta^2/2}

Note the factor of 1/21/2 vs 1/31/3 in the exponent - the lower tail is slightly tighter.

Application: balls into bins. When nn items are assigned to mm bins uniformly at random, the expected load per bin is μ=n/m\mu = n/m. Chernoff bounds control the maximum load - crucial for hashing and load-balancing proofs in distributed systems.


6. Bernstein's Inequality

6.1 Sub-Exponential Random Variables

Some random variables have heavier tails than sub-Gaussian but still lighter than arbitrary - these are sub-exponential.

Definition. A mean-zero random variable XX is sub-exponential with parameters (ν2,b)(\nu^2, b) if:

E[etX]eν2t2/2for all t1/b\mathbb{E}[e^{tX}] \leq e^{\nu^2 t^2/2} \quad \text{for all } |t| \leq 1/b

Sub-exponential tails decay as et/be^{-t/b} (exponential), slower than Gaussian et2/(2σ2)e^{-t^2/(2\sigma^2)}.

Examples:

  • Exponential(1)(1): sub-exponential with (ν2,b)=(1,1)(\nu^2, b) = (1, 1)
  • χk2\chi^2_k: sub-exponential (sum of squared Gaussians)
  • Any bounded variable is also sub-exponential

6.2 Bernstein's Inequality

Theorem (Bernstein, 1924/1946). Let X1,,XnX_1, \ldots, X_n be independent mean-zero variables with Xic|X_i| \leq c and i=1nE[Xi2]=ν2\sum_{i=1}^n \mathbb{E}[X_i^2] = \nu^2. Then for all t>0t > 0:

P ⁣(i=1nXit)exp ⁣(t2/2ν2+ct/3)P\!\left(\sum_{i=1}^n X_i \geq t\right) \leq \exp\!\left(-\frac{t^2/2}{\nu^2 + ct/3}\right)

Intuition. This interpolates between two regimes:

  • Small tt (tν2/ct \ll \nu^2/c): Denominator ν2\approx \nu^2, gives et2/(2ν2)e^{-t^2/(2\nu^2)} - sub-Gaussian with variance ν2\nu^2
  • Large tt (tν2/ct \gg \nu^2/c): Denominator ct/3\approx ct/3, gives e3t/(2c)e^{-3t/(2c)} - exponential tail

6.3 Bernstein vs Hoeffding

For nn i.i.d. variables with Xic|X_i| \leq c and sample variance s2=1nE[Xi2]s^2 = \frac{1}{n}\sum \mathbb{E}[X_i^2]:

BoundFormulaBetter when
Hoeffdingexp(2nt2/(2c)2)=exp(nt2/(2c2))\exp(-2nt^2/(2c)^2) = \exp(-nt^2/(2c^2))No variance info
Bernsteinexp(nt2/2/(ns2+ct/3))\exp(-nt^2/2/(ns^2 + ct/3))s2c2s^2 \ll c^2 (small variance)

When Bernstein wins. If s2c2s^2 \ll c^2 (data has small variance despite large range), Bernstein gives exp(nt2/(2s2))\exp(-nt^2/(2s^2)) for small tt, while Hoeffding gives exp(nt2/(2c2))\exp(-nt^2/(2c^2)). Since s2c2s^2 \ll c^2, Bernstein is exponentially tighter.

Example. Suppose Xi[1,1]X_i \in [-1, 1] but Var(Xi)=0.01\operatorname{Var}(X_i) = 0.01. Hoeffding uses range (ba)2=4(b-a)^2 = 4, while Bernstein uses variance 0.010.01 - a 400x improvement in the exponent for small tt.

6.4 Application to Gradient Noise

In SGD, the stochastic gradient g^t=1miBt(wt;xi)\hat{g}_t = \frac{1}{m}\sum_{i \in B_t} \nabla \ell(w_t; x_i) estimates the true gradient gt=L(wt)g_t = \nabla \mathcal{L}(w_t).

If each (w;x)\nabla\ell(w; x) is bounded: (w;x)2G\|\nabla\ell(w; x)\|_2 \leq G, and the sample gradient variance is σg2\sigma_g^2, then Bernstein bounds:

P ⁣(g^tgt2t)dexp ⁣(mt2/2dσg2+Gt/3)P\!\left(\|\hat{g}_t - g_t\|_2 \geq t\right) \leq d \cdot \exp\!\left(-\frac{mt^2/2}{d\sigma_g^2 + Gt/3}\right)

where the factor dd comes from a union bound over coordinates. This bound shows why variance reduction methods (SVRG, SAGA) that reduce σg2\sigma_g^2 improve convergence more dramatically than reducing GG alone.


7. McDiarmid's Inequality

7.1 Bounded Differences Condition

McDiarmid's inequality generalises Hoeffding from sums to general functions.

Definition (Bounded Differences). A function f:XnRf: \mathcal{X}^n \to \mathbb{R} satisfies the bounded differences condition with constants c1,,cn>0c_1, \ldots, c_n > 0 if for all i[n]i \in [n] and all x1,,xn,xiXx_1, \ldots, x_n, x_i' \in \mathcal{X}:

f(x1,,xi,,xn)f(x1,,xi,,xn)ci|f(x_1, \ldots, x_i, \ldots, x_n) - f(x_1, \ldots, x_i', \ldots, x_n)| \leq c_i

Intuitively: changing the ii-th coordinate by any amount changes ff by at most cic_i.

7.2 McDiarmid's Inequality

Theorem (McDiarmid, 1989). Let X1,,XnX_1, \ldots, X_n be independent random variables and let ff satisfy the bounded differences condition with constants c1,,cnc_1, \ldots, c_n. Then for all t>0t > 0:

P(f(X1,,Xn)E[f]t)exp ⁣(2t2i=1nci2)P(f(X_1, \ldots, X_n) - \mathbb{E}[f] \geq t) \leq \exp\!\left(-\frac{2t^2}{\sum_{i=1}^n c_i^2}\right)

Proof sketch (martingale method). Define the Doob martingale Zk=E[fX1,,Xk]Z_k = \mathbb{E}[f \mid X_1, \ldots, X_k] with Z0=E[f]Z_0 = \mathbb{E}[f] and Zn=fZ_n = f. The differences Dk=ZkZk1D_k = Z_k - Z_{k-1} satisfy Dkck|D_k| \leq c_k by the bounded differences assumption. Apply Azuma's inequality (concentration for bounded martingale differences) to get the result.

7.3 Relationship to Hoeffding

Hoeffding's inequality is a special case of McDiarmid's. For f(X1,,Xn)=1ni=1nXif(X_1, \ldots, X_n) = \frac{1}{n}\sum_{i=1}^n X_i with Xi[ai,bi]X_i \in [a_i, b_i], changing XiX_i changes ff by at most (biai)/n(b_i - a_i)/n, so ci=(biai)/nc_i = (b_i - a_i)/n. Then:

ici2=i(biai)2n2\sum_i c_i^2 = \frac{\sum_i (b_i - a_i)^2}{n^2}

Substituting into McDiarmid recovers exactly Hoeffding's bound.

7.4 Applications to ML Stability

Training loss with missing data. Let D={z1,,zn}\mathcal{D} = \{z_1, \ldots, z_n\} be a training set and R^(D)=1ni=1n(h,zi)\hat{R}(\mathcal{D}) = \frac{1}{n}\sum_{i=1}^n \ell(h, z_i) the empirical risk with [0,1]\ell \in [0, 1]. Swapping one training example changes R^\hat{R} by at most 1/n1/n. McDiarmid gives:

P(R^E[R^]t)2exp(2nt2)P(|\hat{R} - \mathbb{E}[\hat{R}]| \geq t) \leq 2\exp(-2nt^2)

Dropout. With dropout probability pp, each neuron's activation aia_i is scaled by 1[kept]/(1p)\mathbf{1}[\text{kept}] / (1-p). The network output f(z1,,zd)f(z_1, \ldots, z_d) changes by at most ci=wiaimax/(1p)c_i = |w_i| \cdot a_i^{\max} / (1-p) when the ii-th neuron is toggled. McDiarmid bounds the output variance from dropout noise.

Data augmentation. If augmenting a single training example can change the training loss by at most cc, McDiarmid bounds the sensitivity of the trained model to augmentation choices.


8. The Union Bound and Covering Arguments

8.1 Union Bound

Lemma (Boole's / Union Bound). For any events A1,,AkA_1, \ldots, A_k (not necessarily independent):

P ⁣(i=1kAi)i=1kP(Ai)P\!\left(\bigcup_{i=1}^k A_i\right) \leq \sum_{i=1}^k P(A_i)

Proof. By induction: P(AB)=P(A)+P(B)P(AB)P(A)+P(B)P(A \cup B) = P(A) + P(B) - P(A \cap B) \leq P(A) + P(B).

The union bound is exact when events are disjoint and pessimistic otherwise. Despite its looseness, it is indispensable because it requires no assumption about dependence between events.

For AI: The union bound appears in virtually every multi-class or multi-hypothesis analysis. PAC theory for a finite hypothesis class H\mathcal{H} asks for the probability that any hHh \in \mathcal{H} is bad - union bound over all H|\mathcal{H}| hypotheses.

8.2 Multiple Hypothesis Testing

Consider mm statistical tests each with individual significance level α\alpha. If we want the probability that at least one false positive occurs to be at most δ\delta, the Bonferroni correction sets each test at level δ/m\delta/m.

In ML: When training mm models and reporting the best, the union bound says the best result can be a false positive with probability up to mαm\alpha. This is the multiple comparison problem underlying claims like "our method beats 10 baselines on 5 datasets."

Balancing union bound and concentration. For kk bad events AiA_i each with P(Ai)εP(A_i) \leq \varepsilon:

P(i:Ai)kεP(\exists i: A_i) \leq k\varepsilon

To make this δ\leq \delta: need each P(Ai)δ/kP(A_i) \leq \delta/k. In PAC theory, this costs an extra logk\log k in the required sample size.

8.3 Covering Numbers and \varepsilon-Nets

The union bound can only be applied to finite collections of events. For continuous hypothesis spaces (e.g., all linear classifiers, all neural networks), we need to discretise first.

Definition. An ε\varepsilon-net of a set H\mathcal{H} under metric dd is a finite set NεH\mathcal{N}_\varepsilon \subseteq \mathcal{H} such that for every hHh \in \mathcal{H}, there exists h^Nε\hat{h} \in \mathcal{N}_\varepsilon with d(h,h^)εd(h, \hat{h}) \leq \varepsilon.

The covering number N(H,d,ε)\mathcal{N}(\mathcal{H}, d, \varepsilon) is the size of the smallest ε\varepsilon-net.

Covering number of a ball. The 2\ell_2-ball {w:w2R}\{\mathbf{w}: \|\mathbf{w}\|_2 \leq R\} in Rd\mathbb{R}^d has covering number:

N(BR,2,ε)(2Rε+1)d(3Rε)d\mathcal{N}(\mathcal{B}_R, \ell_2, \varepsilon) \leq \left(\frac{2R}{\varepsilon} + 1\right)^d \leq \left(\frac{3R}{\varepsilon}\right)^d

This grows polynomially in 1/ε1/\varepsilon but exponentially in dd - the curse of dimensionality for covering.

8.4 The Net Argument

The standard approach for continuous hypothesis classes:

  1. Build an ε\varepsilon-net Nε\mathcal{N}_\varepsilon with Nε=N(H,d,ε)|\mathcal{N}_\varepsilon| = \mathcal{N}(\mathcal{H}, d, \varepsilon)
  2. Apply concentration + union bound over the net: control suphNεR^(h)R(h)\sup_{h \in \mathcal{N}_\varepsilon} |\hat{R}(h) - R(h)|
  3. Use Lipschitz continuity to extend from net to all of H\mathcal{H}

The final bound pays logN(H,d,ε)\log \mathcal{N}(\mathcal{H}, d, \varepsilon) extra compared to a single hypothesis, plus the approximation error ε\varepsilon from the net.

Johnson-Lindenstrauss Lemma. As an application of covering, if m=O(log(n)/ε2)m = O(\log(n)/\varepsilon^2) random projections are used, nn points in Rd\mathbb{R}^d embed into Rm\mathbb{R}^m with (1±ε)(1\pm\varepsilon) distortion of all pairwise distances, with high probability. This is proved using the union bound over all (n2)\binom{n}{2} pairs.


9. PAC Learning and Generalisation Bounds

9.1 The PAC Framework

The Probably Approximately Correct (PAC) framework, introduced by Valiant (1984), provides a formal model for machine learning under uncertainty.

Setup: A learner observes nn i.i.d. examples {(x(i),y(i))}\{(x^{(i)}, y^{(i)})\} from an unknown distribution D\mathcal{D} over X×Y\mathcal{X} \times \mathcal{Y}. The learner selects a hypothesis hh from a hypothesis class H\mathcal{H}.

True risk: R(h)=P(x,y)D(h(x)y)=E[1[h(x)y]]R(h) = P_{(x,y) \sim \mathcal{D}}(h(x) \neq y) = \mathbb{E}[\mathbf{1}[h(x) \neq y]]

Empirical risk: R^(h)=1ni=1n1[h(x(i))y(i)]\hat{R}(h) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[h(x^{(i)}) \neq y^{(i)}]

PAC guarantee: A learning algorithm A\mathcal{A} is PAC-learning H\mathcal{H} if for any ε,δ>0\varepsilon, \delta > 0 and any distribution D\mathcal{D}, given nn0(ε,δ)n \geq n_0(\varepsilon, \delta) examples, with probability at least 1δ1-\delta:

R(A(D))minhHR(h)+εR(\mathcal{A}(\mathcal{D})) \leq \min_{h \in \mathcal{H}} R(h) + \varepsilon

The sample complexity n0(ε,δ)n_0(\varepsilon, \delta) is the central quantity: how many examples are needed?

9.2 Finite Hypothesis Class

Theorem. Let H\mathcal{H} be a finite hypothesis class with H<|\mathcal{H}| < \infty. For any ERM algorithm (minimising R^(h)\hat{R}(h)) and any ε,δ>0\varepsilon, \delta > 0, if:

nlog(H/δ)2ε2n \geq \frac{\log(|\mathcal{H}|/\delta)}{2\varepsilon^2}

then with probability at least 1δ1-\delta, the selected hypothesis h^\hat{h} satisfies R(h^)minhR(h)+εR(\hat{h}) \leq \min_{h^*} R(h^*) + \varepsilon (in the realisable case with R^(h)=0\hat{R}(h^*) = 0).

Proof. Consider the event "bad hh": Bh={R(h)>ε but R^(h)=0}B_h = \{R(h) > \varepsilon \text{ but } \hat{R}(h) = 0\}.

By Hoeffding: P(Bh)P(R^(h)=0R(h)>ε)(1ε)nenεP(B_h) \leq P(\hat{R}(h) = 0 \mid R(h) > \varepsilon) \leq (1-\varepsilon)^n \leq e^{-n\varepsilon}.

By union bound: P(hH:Bh)HenεP(\exists h \in \mathcal{H}: B_h) \leq |\mathcal{H}| \cdot e^{-n\varepsilon}.

Setting this δ\leq \delta and solving: nlog(H/δ)εn \geq \frac{\log(|\mathcal{H}|/\delta)}{\varepsilon}. The factor ε2\varepsilon^2 version covers the agnostic (non-realisable) case.

9.3 Uniform Convergence

Theorem (Uniform Convergence). For finite H\mathcal{H} and losses in [0,1][0, 1], for nlog(2H/δ)2ε2n \geq \frac{\log(2|\mathcal{H}|/\delta)}{2\varepsilon^2}:

P ⁣(suphHR^(h)R(h)>ε)δP\!\left(\sup_{h \in \mathcal{H}} |\hat{R}(h) - R(h)| > \varepsilon\right) \leq \delta

Proof. Fix any hHh \in \mathcal{H}. By Hoeffding: P(R^(h)R(h)>ε)2e2nε2P(|\hat{R}(h) - R(h)| > \varepsilon) \leq 2e^{-2n\varepsilon^2}.

By union bound over all H|\mathcal{H}| hypotheses: P(suphR^(h)R(h)>ε)2He2nε2δP(\sup_h |\hat{R}(h) - R(h)| > \varepsilon) \leq 2|\mathcal{H}| e^{-2n\varepsilon^2} \leq \delta.

Consequence: If uniform convergence holds, then for ERM h^=argminhR^(h)\hat{h} = \arg\min_h \hat{R}(h):

R(h^)R^(h^)+εR^(h)+εR(h)+2εR(\hat{h}) \leq \hat{R}(\hat{h}) + \varepsilon \leq \hat{R}(h^*) + \varepsilon \leq R(h^*) + 2\varepsilon

The ERM generalises to within 2ε2\varepsilon of the best hypothesis.

9.4 VC Dimension

For infinite hypothesis classes, we need a different complexity measure.

Definition. A hypothesis class H\mathcal{H} shatters a set C={x1,,xk}C = \{x_1, \ldots, x_k\} if for every labelling y{0,1}ky \in \{0,1\}^k, there exists hHh \in \mathcal{H} with h(xi)=yih(x_i) = y_i for all ii.

VC dimension dVC(H)d_{VC}(\mathcal{H}) is the size of the largest set shattered by H\mathcal{H}.

Examples:

  • Threshold classifiers on R\mathbb{R}: dVC=1d_{VC} = 1 (can shatter any single point but not two)
  • Linear classifiers (halfspaces) in Rd\mathbb{R}^d: dVC=d+1d_{VC} = d + 1
  • Degree-kk polynomial classifiers in Rd\mathbb{R}^d: dVC=(d+kk)d_{VC} = \binom{d+k}{k}
  • Circles in R2\mathbb{R}^2: dVC=3d_{VC} = 3
  • Neural networks: grows with parameter count, approximately linearly

Sauer-Shelah Lemma. The growth function ΠH(n)=maxx1,,xn{(h(x1),,h(xn)):hH}\Pi_\mathcal{H}(n) = \max_{x_1,\ldots,x_n} |\{(h(x_1),\ldots,h(x_n)): h \in \mathcal{H}\}| satisfies:

ΠH(n)i=0d(ni)(end)d\Pi_\mathcal{H}(n) \leq \sum_{i=0}^{d} \binom{n}{i} \leq \left(\frac{en}{d}\right)^d

for nd=dVC(H)n \geq d = d_{VC}(\mathcal{H}). This says: an nn-point set can be labelled in at most (en/d)d(en/d)^d ways by H\mathcal{H}, even if H=|\mathcal{H}| = \infty.

9.5 Generalisation Bound with VC Dimension

Theorem (VC Generalisation Bound). For any δ>0\delta > 0, with probability at least 1δ1-\delta over nn i.i.d. training examples, every hHh \in \mathcal{H} satisfies:

R(h)R^(h)+d(log(2n/d)+1)+log(4/δ)2nR(h) \leq \hat{R}(h) + \sqrt{\frac{d(\log(2n/d) + 1) + \log(4/\delta)}{2n}}

where d=dVC(H)d = d_{VC}(\mathcal{H}).

Sample complexity. Setting the right-hand side excess to ε\varepsilon:

n=O ⁣(dlog(1/(εδ))+log(1/δ)ε2)=O ⁣(d+log(1/δ)ε2)n = O\!\left(\frac{d \log(1/(\varepsilon\delta)) + \log(1/\delta)}{\varepsilon^2}\right) = O\!\left(\frac{d + \log(1/\delta)}{\varepsilon^2}\right)

The log(H)\log(|\mathcal{H}|) from the finite case is replaced by dVCd_{VC}. For a dd-dimensional linear classifier (dVC=d+1d_{VC} = d+1), need n=O(d/ε2)n = O(d/\varepsilon^2) - linear in dimension, not exponential.

For AI (2026): Modern transformers have billions of parameters, implying enormous VC dimension. Yet they generalise. This "paradox" is partly resolved by:

  1. Implicit regularisation: gradient descent finds minimum-norm solutions
  2. Overparameterisation: interpolation threshold, double descent (covered in Section04)
  3. PAC-Bayes: data-dependent bounds that are tighter for specific algorithms

10. Rademacher Complexity

10.1 Definition

Rademacher complexity measures the richness of a function class on specific data, making it data-dependent and often tighter than VC bounds.

Definition. Given a sample S=(x(1),,x(n))S = (x^{(1)}, \ldots, x^{(n)}) and a function class F\mathcal{F}, the empirical Rademacher complexity is:

R^S(F)=Eσ ⁣[supfF1ni=1nσif(x(i))]\hat{\mathfrak{R}}_S(\mathcal{F}) = \mathbb{E}_{\boldsymbol{\sigma}}\!\left[\sup_{f \in \mathcal{F}} \frac{1}{n}\sum_{i=1}^n \sigma_i f(x^{(i)})\right]

where σ=(σ1,,σn)\boldsymbol{\sigma} = (\sigma_1, \ldots, \sigma_n) with σiUniform({1,+1})\sigma_i \sim \operatorname{Uniform}(\{-1, +1\}) i.i.d. (Rademacher variables).

The Rademacher complexity is Rn(F)=ES[R^S(F)]\mathfrak{R}_n(\mathcal{F}) = \mathbb{E}_S[\hat{\mathfrak{R}}_S(\mathcal{F})].

Interpretation. R^S(F)\hat{\mathfrak{R}}_S(\mathcal{F}) measures how well F\mathcal{F} can fit random ±1\pm 1 labels on the training points. If the class is rich enough to fit any labelling, R^S1\hat{\mathfrak{R}}_S \approx 1. If it can only fit structured labels, R^S0\hat{\mathfrak{R}}_S \approx 0.

10.2 Rademacher Generalisation Bound

Theorem. For any function class F\mathcal{F} with values in [0,1][0, 1], with probability at least 1δ1-\delta:

R(f)R^(f)+2Rn(F)+log(1/δ)2nR(f) \leq \hat{R}(f) + 2\mathfrak{R}_n(\mathcal{F}) + \sqrt{\frac{\log(1/\delta)}{2n}}

for all fFf \in \mathcal{F} simultaneously.

Comparison to VC. The VC bound has d/n\sqrt{d/n}; the Rademacher bound has Rn(F)\mathfrak{R}_n(\mathcal{F}). For specific data distributions where the function class is "effectively smaller," Rademacher can be significantly tighter.

10.3 Rademacher of Linear Classifiers

For linear classifiers F={xw,x:w2B}\mathcal{F} = \{\mathbf{x} \mapsto \langle \mathbf{w}, \mathbf{x} \rangle : \|\mathbf{w}\|_2 \leq B\} on data with x(i)2C\|\mathbf{x}^{(i)}\|_2 \leq C:

R^S(F)BCnΣ^FB=Ctr(Σ^)Bn\hat{\mathfrak{R}}_S(\mathcal{F}) \leq \frac{BC}{\sqrt{n}} \cdot \frac{\|\hat{\Sigma}\|_F}{B} = \frac{C\sqrt{\operatorname{tr}(\hat{\Sigma})}}{B\sqrt{n}}

More precisely, using Cauchy-Schwarz:

R^S(F)BnEσ ⁣[i=1nσix(i)2]Bnix(i)22n1/2BCn\hat{\mathfrak{R}}_S(\mathcal{F}) \leq \frac{B}{n}\mathbb{E}_\sigma\!\left[\left\|\sum_{i=1}^n \sigma_i \mathbf{x}^{(i)}\right\|_2\right] \leq \frac{B}{\sqrt{n}} \cdot \frac{\sqrt{\sum_i \|\mathbf{x}^{(i)}\|_2^2}}{n^{1/2}} \leq \frac{BC}{\sqrt{n}}

This is O(1/n)O(1/\sqrt{n}), independent of the ambient dimension dd! This explains why linear classifiers generalise well even in high dimensions, as long as the norm w2\|\mathbf{w}\|_2 is controlled.

10.4 Rademacher vs VC

PropertyVC DimensionRademacher Complexity
DistributionDistribution-freeData-dependent
CalculationCombinatorialExpectation over labels
EstimateHard (often NP-hard)MC approximation possible
TightnessWorst-caseOften tighter in practice
Handles regressionNo (classification)Yes (arbitrary losses)
Modern NNsVacuous (too large dd)Can be finite for sparse/low-rank models

For AI: Rademacher complexity gives finite (non-vacuous) generalisation bounds for attention heads with bounded query/key norm, explaining why transformers with weight decay generalise despite huge parameter counts.


Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue