Part 1

28 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Concentration Inequalities: Part 1: Intuition to 10. Rademacher Complexity

1. Intuition

1.1 What Is Concentration?

A random variable $X$ has a mean $\mu = \mathbb{E}[X]$ . But knowing the mean alone tells us nothing about how a single draw is distributed around it. Concentration inequalities fill this gap by bounding the probability that $X$ deviates far from $\mu$ .

The fundamental question is: given what we know about $X$ (its mean, variance, boundedness, or MGF), what is the tightest bound we can place on $P(|X - \mu| \geq t)$ ?

This matters deeply in practice. In machine learning, we observe the empirical mean of a loss function on training data and want to know how well it approximates the true (population) mean. If $X_1, \ldots, X_n$ are i.i.d. losses with mean $\mu$ and we observe $\bar{X}_n = \frac{1}{n}\sum X_i$ , concentration inequalities tell us how close $\bar{X}_n$ is to $\mu$ with high probability. This is the foundation of statistical learning theory.

The sample mean concentrates around the true mean. As $n$ grows, the deviation $|\bar{X}_n - \mu|$ becomes increasingly small with high probability. This is the content of the Law of Large Numbers - but concentration inequalities give quantitative, finite- $n$ bounds, not just asymptotic guarantees.

For AI: Every evaluation metric in ML (accuracy, F1, BLEU, ROUGE, perplexity) is estimated from a finite test set. Concentration inequalities tell us how many test examples we need to trust the estimate. HELM benchmarks for LLMs, for instance, implicitly rely on Hoeffding-type bounds for the confidence intervals around model comparisons.

1.2 The Hierarchy of Bounds

Concentration inequalities form a clear hierarchy. Moving down requires stronger assumptions but yields exponentially tighter bounds:

CONCENTRATION INEQUALITY HIERARCHY
========================================================================

  Assumption          Inequality         Tail bound form
  -----------------------------------------------------------------
  E[X] < \\infty, X \\geq 0   Markov             P(X \\geq t) \\leq \\mu/t
  (polynomial decay)

  Var(X) < \\infty         Chebyshev          P(|X-\\mu| \\geq t) \\leq \\sigma^2/t^2
  (polynomial decay)

  E[e^{sX}] < \\infty      Chernoff method    P(X \\geq t) \\leq min_s e^{-st}M(s)
  (general expo)

  X \\in [a,b] a.s.     Hoeffding          P(Xbar-\\mu \\geq t) \\leq exp(-2nt^2/(b-a)^2)
  (sub-Gaussian)      (exponential decay)

  Var known + bounded Bernstein          exp(-nt^2/(2\\sigma^2+2ct/3))
  (variance-aware)    (tighter for small t)

  f(X_1,...,X_n)       McDiarmid          exp(-2t^2/\\Sigmac_i^2)
  bounded differences (functions of iid)

========================================================================

The key insight: moment-based bounds (Markov, Chebyshev) give polynomial tails $O(1/t^k)$ , while MGF-based bounds give exponential tails $O(e^{-ct^2})$ . Exponential tails are vastly superior - a Gaussian has $P(|X| \geq 3\sigma) \approx 0.003$ , while Chebyshev gives only $1/9 \approx 0.11$ .

1.3 Historical Timeline

Year	Person	Contribution
1867	Chebyshev	Proved the inequality bearing his name; used it to prove the weak LLN
1899	Markov (student of Chebyshev)	Simpler proof via indicator argument; Markov's inequality
1952	Herman Chernoff	MGF-based exponential tail bounds for sums of independent variables
1963	Wassily Hoeffding	Tight bounds for bounded variables; modern form used everywhere
1971	Vapnik & Chervonenkis	VC dimension theory; combinatorial approach to generalisation
1984	Leslie Valiant	PAC learning framework - theoretical model for machine learning
1989	Colin McDiarmid	Bounded differences inequality for general functions of independent variables
1995	Bartlett & Mendelson	Rademacher complexity - data-dependent alternative to VC theory
2013	Boucheron, Lugosi & Massart	Comprehensive monograph unifying modern concentration theory

1.4 Why Concentration Matters for ML

Generalisation: A neural network achieves 99% training accuracy. Does it generalise? Concentration inequalities bound $|R(h) - \hat{R}(h)|$ - the gap between true and empirical risk. They tell us when we can trust training metrics.

SGD analysis: Stochastic gradient descent uses a mini-batch estimate $\hat{g}$ of the true gradient $g$ . Concentration inequalities bound $\|\hat{g} - g\|_2$ , determining the noise level and required step sizes for convergence.

Confidence intervals: When evaluating an LLM on 1000 benchmark questions with accuracy 73.2%, concentration bounds tell us the confidence interval around this estimate.

Random features: The Rahimi-Recht random features method approximates kernel matrices using random projections. How many random features are needed? The answer is a Hoeffding bound.

Dropout: Dropout randomly zeroes activations. The kept activations are bounded, so McDiarmid's inequality bounds the output variance.

2. Markov's and Chebyshev's Inequalities

2.1 Markov's Inequality

Theorem (Markov's Inequality). Let $X \geq 0$ be a non-negative random variable with $\mathbb{E}[X] < \infty$ . For any $t > 0$ :

P(X \geq t) \leq \frac{\mathbb{E}[X]}{t}

Proof. This follows immediately from the indicator $\mathbf{1}[X \geq t] \leq X/t$ for $X \geq 0$ , $t > 0$ :

P(X \geq t) = \mathbb{E}[\mathbf{1}[X \geq t]] \leq \mathbb{E}\!\left[\frac{X}{t}\right] = \frac{\mathbb{E}[X]}{t}

This proof is a master class in the indicator trick: multiply by 1, then bound the indicator.

Tightness. Markov's bound is tight. Consider $X = t$ with probability $\mu/t$ and $X = 0$ otherwise. Then $\mathbb{E}[X] = \mu$ and $P(X \geq t) = \mu/t$ - exactly matching the bound.

Extensions. Markov applies to any non-negative function of $X$ : for any non-decreasing $\phi \geq 0$ ,

P(X \geq t) = P(\phi(X) \geq \phi(t)) \leq \frac{\mathbb{E}[\phi(X)]}{\phi(t)}

Setting $\phi(x) = x^k$ gives the $k$ -th moment bound: $P(X \geq t) \leq \mathbb{E}[X^k]/t^k$ . Setting $\phi(x) = e^{sx}$ gives the Chernoff method.

For AI: Markov's inequality underlies gradient clipping analysis. If $\mathbb{E}[\|g\|_2] \leq G$ , then $P(\|g\|_2 \geq cG) \leq 1/c$ - so gradient norms exceed $10G$ at most 10% of steps.

2.2 Chebyshev's Inequality

Theorem (Chebyshev's Inequality). For any random variable $X$ with $\mathbb{E}[X] = \mu$ and $\operatorname{Var}(X) = \sigma^2 < \infty$ , for any $t > 0$ :

P(|X - \mu| \geq t) \leq \frac{\sigma^2}{t^2}

Equivalently, setting $t = k\sigma$ : $P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}$ .

Proof. Apply Markov to the non-negative variable $Y = (X - \mu)^2$ :

P(|X - \mu| \geq t) = P((X-\mu)^2 \geq t^2) \leq \frac{\mathbb{E}[(X-\mu)^2]}{t^2} = \frac{\sigma^2}{t^2}

The $k$ -sigma rule. Chebyshev guarantees: at least $1 - 1/k^2$ of probability mass lies within $k\sigma$ of the mean. For any distribution with finite variance:

$k$	Chebyshev bound	Gaussian actual
2	$\geq 75\%$ within $2\sigma$	$\approx 95.4\%$
3	$\geq 88.9\%$ within $3\sigma$	$\approx 99.7\%$
4	$\geq 93.75\%$ within $4\sigma$	$\approx 99.994\%$
5	$\geq 96\%$ within $5\sigma$	$\approx 99.99994\%$

For AI: Chebyshev justifies batch normalisation. If activations have $\operatorname{Var}(X) = \sigma^2$ , then at most $1/k^2$ of activations lie beyond $k\sigma$ - this quantifies the scale-normalisation benefit.

Recall: The variance $\operatorname{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$ and its properties were developed fully in Section04 Expectation and Moments. We use it here as a plug-in parameter.

2.3 One-Sided Chebyshev (Cantelli's Inequality)

Chebyshev bounds both tails symmetrically. Cantelli's inequality gives a tighter one-sided bound:

Theorem (Cantelli). For any $t > 0$ :

P(X - \mu \geq t) \leq \frac{\sigma^2}{\sigma^2 + t^2}

Proof. For any $s > 0$ , by Markov applied to $(X - \mu + s)^2$ :

P(X - \mu \geq t) = P(X - \mu + s \geq t + s) \leq \frac{\mathbb{E}[(X-\mu+s)^2]}{(t+s)^2} = \frac{\sigma^2 + s^2}{(t+s)^2}

Optimise over $s$ : $\partial/\partial s = 0$ gives $s = \sigma^2/t$ , yielding $\sigma^2/(\sigma^2 + t^2)$ .

For large $t$ : Chebyshev gives $\sigma^2/t^2$ per tail (so $2\sigma^2/t^2$ two-sided), while Cantelli gives $\sigma^2/(\sigma^2 + t^2) \approx \sigma^2/t^2$ one-sided - roughly the same. But for small $t$ , Cantelli avoids the factor of 2.

2.4 Limitations of Moment-Based Bounds

Markov and Chebyshev have polynomial tails. The Gaussian has $P(X \geq 3\sigma) \approx 0.0013$ , but Chebyshev gives only $1/9 \approx 0.111$ - an 85x overestimate. For $t = 10\sigma$ , Chebyshev gives $0.01$ , while the true Gaussian probability is $\approx 10^{-23}$ .

Why the gap? Chebyshev uses only the first two moments. A distribution could concentrate all its mass at $\mu \pm t$ and match any $(\mu, \sigma^2)$ pair - that's the worst case for Chebyshev. Real distributions with bounded support or thin tails concentrate far more.

When Chebyshev is the right tool:

Distribution-free bounds (don't know the family)
Heavy-tailed distributions (Pareto, $t$ -distribution with low df)
Quick estimates without assuming sub-Gaussianity

3. Sub-Gaussian Random Variables

3.1 Definition and MGF Condition

Sub-Gaussian random variables are those whose tails decay at least as fast as a Gaussian. The key condition is on the MGF.

Definition. A mean-zero random variable $X$ is $\sigma^2$ -sub-Gaussian if for all $t \in \mathbb{R}$ :

\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2/2}

The parameter $\sigma^2$ is the sub-Gaussian parameter (also called the proxy variance). Note: $\sigma^2$ need not equal $\operatorname{Var}(X)$ , though $\operatorname{Var}(X) \leq \sigma^2$ always holds (by differentiation at $t=0$ ).

Why this condition? Recall from Section04 that the MGF of $\mathcal{N}(0, \sigma^2)$ is exactly $e^{\sigma^2 t^2/2}$ . The sub-Gaussian condition says the MGF of $X$ is dominated by that of a Gaussian with variance $\sigma^2$ . This immediately implies Gaussian-like tail bounds.

Equivalent characterisations (for mean-zero $X$ ):

MGF condition: $\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2/2}$ for all $t$
Tail condition: $P(|X| \geq t) \leq 2e^{-t^2/(2\sigma^2)}$ for all $t \geq 0$
Moment condition: $\mathbb{E}[X^{2k}] \leq (2k-1)!! \cdot \sigma^{2k}$ for all $k \geq 1$

3.2 Examples of Sub-Gaussian Variables

Gaussian. $X \sim \mathcal{N}(0, \sigma^2)$ is $\sigma^2$ -sub-Gaussian. The MGF equals $e^{\sigma^2 t^2/2}$ exactly - Gaussian is the equality case.

Bounded random variable. If $X \in [a, b]$ almost surely with $\mathbb{E}[X] = \mu$ , then $X - \mu$ is $\frac{(b-a)^2}{4}$ -sub-Gaussian. This is Hoeffding's lemma - proved in Section4.

Rademacher variable. $\varepsilon \in \{-1, +1\}$ with equal probability. Then $\mathbb{E}[e^{t\varepsilon}] = \cosh(t) \leq e^{t^2/2}$ , so $\varepsilon$ is 1-sub-Gaussian. Rademacher variables appear in Rademacher complexity (Section10).

Bernoulli centered. $X = \operatorname{Bern}(p) - p \in \{-p, 1-p\}$ . Since $X \in [-p, 1-p]$ , it is $\frac{1}{4}$ -sub-Gaussian by Hoeffding's lemma.

Non-example. Heavy-tailed distributions are not sub-Gaussian. A $t$ -distribution with $\nu$ degrees of freedom has $\mathbb{E}[e^{tX}] = \infty$ for all $t \neq 0$ when $\nu \leq 2$ . A Pareto distribution is not sub-Gaussian regardless of parameters.

3.3 The Sub-Gaussian Tail Bound

Theorem. If $X$ is $\sigma^2$ -sub-Gaussian with $\mathbb{E}[X] = 0$ , then for all $t \geq 0$ :

P(X \geq t) \leq e^{-t^2/(2\sigma^2)}

Proof (Chernoff method). For any $s > 0$ :

P(X \geq t) = P(e^{sX} \geq e^{st}) \leq \frac{\mathbb{E}[e^{sX}]}{e^{st}} \leq \frac{e^{\sigma^2 s^2/2}}{e^{st}} = e^{\sigma^2 s^2/2 - st}

Optimise over $s$ : $\partial/\partial s(\sigma^2 s^2/2 - st) = 0 \Rightarrow s^* = t/\sigma^2$ . Substituting:

P(X \geq t) \leq e^{\sigma^2(t/\sigma^2)^2/2 - t \cdot t/\sigma^2} = e^{t^2/(2\sigma^2) - t^2/\sigma^2} = e^{-t^2/(2\sigma^2)}

This proof pattern - Markov + MGF bound + optimise - is the Chernoff method and will recur throughout this section.

3.4 Closure Under Sums

Theorem. If $X_1, \ldots, X_n$ are independent with $X_i$ being $\sigma_i^2$ -sub-Gaussian and $\mathbb{E}[X_i] = 0$ , then $S = \sum_i X_i$ is $(\sum_i \sigma_i^2)$ -sub-Gaussian.

Proof. By independence:

\mathbb{E}[e^{tS}] = \prod_{i=1}^n \mathbb{E}[e^{tX_i}] \leq \prod_{i=1}^n e^{\sigma_i^2 t^2/2} = e^{(\sum_i \sigma_i^2)t^2/2}

Corollary. For $n$ i.i.d. $\sigma^2$ -sub-Gaussian variables, the sample mean $\bar{X}_n = S/n$ satisfies $P(\bar{X}_n \geq t) \leq e^{-nt^2/(2\sigma^2)}$ .

For AI: Mini-batch gradient estimation. If each sample's contribution to the gradient is sub-Gaussian with parameter $\sigma^2$ , then the mini-batch gradient of size $m$ is $\sigma^2/m$ -sub-Gaussian - the noise decreases as $1/m$ .

4. Hoeffding's Inequality

4.1 Hoeffding's Lemma

Hoeffding's lemma is the core technical ingredient: it establishes that any bounded, mean-zero random variable is sub-Gaussian.

Lemma (Hoeffding, 1963). Let $X$ be a random variable with $\mathbb{E}[X] = 0$ and $X \in [a, b]$ almost surely. Then for all $t \in \mathbb{R}$ :

\mathbb{E}[e^{tX}] \leq \exp\!\left(\frac{t^2(b-a)^2}{8}\right)

In other words, $X$ is $\frac{(b-a)^2}{4}$ -sub-Gaussian.

Proof sketch. Since $[a, b]$ is a bounded interval and $e^{tx}$ is convex in $x$ , we can bound $e^{tx}$ by the chord from $(a, e^{ta})$ to $(b, e^{tb})$ :

e^{tx} \leq \frac{b-x}{b-a} e^{ta} + \frac{x-a}{b-a} e^{tb}

Taking expectations (using $\mathbb{E}[X] = 0$ , so $\mathbb{E}[(b-X)/(b-a)] = b/(b-a)$ and $\mathbb{E}[(X-a)/(b-a)] = -a/(b-a)$ ):

\mathbb{E}[e^{tX}] \leq \frac{b}{b-a} e^{ta} - \frac{a}{b-a} e^{tb} = e^{g(u)}

where $u = t(b-a)$ and $g(u) = -pu + \log(1 - p + pe^u)$ with $p = -a/(b-a)$ . Expanding $g$ via Taylor series around $u=0$ and bounding $g''(u) \leq 1/4$ (achieved at $p = 1/2$ ) gives $g(u) \leq u^2/8 = t^2(b-a)^2/8$ .

Why $(b-a)^2/8$ ? The factor $1/8$ comes from the worst case $p = 1/2$ (centred in the interval). For a $\{0,1\}$ Bernoulli minus its mean, $b - a = 1$ and the sub-Gaussian parameter is $1/4$ .

4.2 Hoeffding's Inequality

Theorem (Hoeffding, 1963). Let $X_1, \ldots, X_n$ be independent random variables with $\mathbb{E}[X_i] = \mu_i$ and $X_i \in [a_i, b_i]$ almost surely. Then for all $t > 0$ :

P\!\left(\sum_{i=1}^n (X_i - \mu_i) \geq t\right) \leq \exp\!\left(-\frac{2t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right)

Proof. Let $Y_i = X_i - \mu_i$ (mean-zero, $Y_i \in [a_i - \mu_i, b_i - \mu_i]$ ). By Hoeffding's lemma, each $Y_i$ is $\frac{(b_i - a_i)^2}{4}$ -sub-Gaussian. By independence and closure under sums (Section3.4), $\sum Y_i$ is $\frac{\sum(b_i-a_i)^2}{4}$ -sub-Gaussian. Applying the sub-Gaussian tail bound:

P\!\left(\sum Y_i \geq t\right) \leq \exp\!\left(-\frac{t^2}{2 \cdot \frac{\sum(b_i-a_i)^2}{4}}\right) = \exp\!\left(-\frac{2t^2}{\sum(b_i-a_i)^2}\right)

4.3 Two-Sided Hoeffding and Sample Complexity

For i.i.d. $X_i \in [a, b]$ with mean $\mu$ , the two-sided version follows by symmetry (apply one-sided to $X$ and $-X$ ):

P(|\bar{X}_n - \mu| \geq t) \leq 2\exp\!\left(-\frac{2nt^2}{(b-a)^2}\right)

Reading the bound: The probability of a large deviation decays exponentially in $n$ and $t^2$ . Doubling the sample size squares the exponential factor. Halving the allowed deviation quadruples the required $n$ .

4.4 Required Sample Size

One of the most practically useful results: how large must $n$ be to guarantee $|\bar{X}_n - \mu| \leq \varepsilon$ with probability at least $1 - \delta$ ?

Setting $2\exp(-2nt^2/(b-a)^2) \leq \delta$ and solving for $n$ :

n \geq \frac{(b-a)^2 \log(2/\delta)}{2\varepsilon^2}

Example. Coin flips ( $X_i \in \{0,1\}$ , $b - a = 1$ ): to achieve $\varepsilon = 0.01$ accuracy with $\delta = 0.05$ confidence:

n \geq \frac{\log(40)}{0.0002} \approx 18{,}444

So roughly 18,500 coin flips are needed to estimate a probability to within $\pm 0.01$ with 95% confidence.

For AI: Benchmark evaluation. To measure LLM accuracy to within $\pm 1\%$ at 95% confidence on a binary task ( $b-a=1$ ), Hoeffding requires $n \geq 19{,}122$ examples. This explains why serious LLM evaluations (MMLU, HumanEval, BIG-bench) use thousands of questions.

5. Chernoff Bounds

5.1 The Chernoff Method

The Chernoff method is a general technique for deriving exponential tail bounds using the MGF. It is more powerful than Hoeffding when the MGF has a known closed form.

The Chernoff method. For any random variable $X$ and $t > 0$ :

P(X \geq a) \leq \inf_{s > 0} e^{-sa} \cdot \mathbb{E}[e^{sX}] = \inf_{s > 0} e^{-sa} M_X(s)

Proof. For any $s > 0$ : $P(X \geq a) = P(e^{sX} \geq e^{sa}) \leq e^{-sa} \mathbb{E}[e^{sX}]$ by Markov. Optimise over $s$ .

The infimum over $s$ gives the tightest bound. The optimal $s^*$ satisfies $M_X'(s^*)/M_X(s^*) = a$ , i.e., the mean of $X$ under the tilted distribution equals $a$ .

Recall: $M_X(t) = \mathbb{E}[e^{tX}]$ was developed in Section04 Expectation and Moments. The key property used here is $M_{X+Y}(t) = M_X(t) \cdot M_Y(t)$ for independent $X, Y$ .

5.2 Chernoff for Bernoulli Sums

Let $X_1, \ldots, X_n \sim \operatorname{Bern}(p)$ i.i.d. and $S = \sum_{i=1}^n X_i \sim \operatorname{Bin}(n, p)$ with mean $\mu = np$ .

The MGF of $S$ is $M_S(t) = (1 - p + pe^t)^n$ .

Upper Chernoff bound. For $\delta > 0$ :

P(S \geq (1+\delta)\mu) \leq \left(\frac{e^\delta}{(1+\delta)^{1+\delta}}\right)^\mu

Derivation. Apply the Chernoff method with $a = (1+\delta)\mu$ :

P(S \geq (1+\delta)\mu) \leq \inf_{s>0} e^{-s(1+\delta)\mu} M_S(s)

Substituting $M_S(s) = (1-p+pe^s)^n \leq e^{\mu(e^s-1)}$ (using $1 + x \leq e^x$ ):

\leq \inf_{s>0} \exp(\mu(e^s - 1) - s(1+\delta)\mu) = \exp(\mu(e^s - 1 - s(1+\delta)))

Optimising: $s^* = \log(1+\delta)$ , giving the stated bound.

5.3 Multiplicative Chernoff Form

For $\delta \in (0, 1)$ , a simpler but slightly looser form:

P(S \geq (1+\delta)\mu) \leq e^{-\mu\delta^2/3}

This follows from the inequality $\left(\frac{e^\delta}{(1+\delta)^{1+\delta}}\right)^\mu \leq e^{-\mu\delta^2/3}$ .

Comparison with Hoeffding. Hoeffding (applied to $\operatorname{Bin}(n,p)$ with $b-a=1$ ) gives:

P(S \geq (1+\delta)\mu) = P(\bar{X} \geq p(1+\delta)) \leq \exp(-2n(p\delta)^2)

Chernoff gives $\exp(-np^2\delta^2/3) = \exp(-\mu^2\delta^2/(3n))$ . For small $p$ (rare events), Chernoff's dependence on $\mu = np$ rather than $n$ makes it far tighter.

5.4 Lower Tail Chernoff

For the lower tail, for $\delta \in (0, 1)$ :

P(S \leq (1-\delta)\mu) \leq e^{-\mu\delta^2/2}

Note the factor of $1/2$ vs $1/3$ in the exponent - the lower tail is slightly tighter.

Application: balls into bins. When $n$ items are assigned to $m$ bins uniformly at random, the expected load per bin is $\mu = n/m$ . Chernoff bounds control the maximum load - crucial for hashing and load-balancing proofs in distributed systems.

6. Bernstein's Inequality

6.1 Sub-Exponential Random Variables

Some random variables have heavier tails than sub-Gaussian but still lighter than arbitrary - these are sub-exponential.

Definition. A mean-zero random variable $X$ is sub-exponential with parameters $(\nu^2, b)$ if:

\mathbb{E}[e^{tX}] \leq e^{\nu^2 t^2/2} \quad \text{for all } |t| \leq 1/b

Sub-exponential tails decay as $e^{-t/b}$ (exponential), slower than Gaussian $e^{-t^2/(2\sigma^2)}$ .

Examples:

Exponential $(1)$ : sub-exponential with $(\nu^2, b) = (1, 1)$
$\chi^2_k$ : sub-exponential (sum of squared Gaussians)
Any bounded variable is also sub-exponential

6.2 Bernstein's Inequality

Theorem (Bernstein, 1924/1946). Let $X_1, \ldots, X_n$ be independent mean-zero variables with $|X_i| \leq c$ and $\sum_{i=1}^n \mathbb{E}[X_i^2] = \nu^2$ . Then for all $t > 0$ :

P\!\left(\sum_{i=1}^n X_i \geq t\right) \leq \exp\!\left(-\frac{t^2/2}{\nu^2 + ct/3}\right)

Intuition. This interpolates between two regimes:

Small $t$ ( $t \ll \nu^2/c$ ): Denominator $\approx \nu^2$ , gives $e^{-t^2/(2\nu^2)}$ - sub-Gaussian with variance $\nu^2$
Large $t$ ( $t \gg \nu^2/c$ ): Denominator $\approx ct/3$ , gives $e^{-3t/(2c)}$ - exponential tail

6.3 Bernstein vs Hoeffding

For $n$ i.i.d. variables with $|X_i| \leq c$ and sample variance $s^2 = \frac{1}{n}\sum \mathbb{E}[X_i^2]$ :

Bound	Formula	Better when
Hoeffding	$\exp(-2nt^2/(2c)^2) = \exp(-nt^2/(2c^2))$	No variance info
Bernstein	$\exp(-nt^2/2/(ns^2 + ct/3))$	$s^2 \ll c^2$ (small variance)

When Bernstein wins. If $s^2 \ll c^2$ (data has small variance despite large range), Bernstein gives $\exp(-nt^2/(2s^2))$ for small $t$ , while Hoeffding gives $\exp(-nt^2/(2c^2))$ . Since $s^2 \ll c^2$ , Bernstein is exponentially tighter.

Example. Suppose $X_i \in [-1, 1]$ but $\operatorname{Var}(X_i) = 0.01$ . Hoeffding uses range $(b-a)^2 = 4$ , while Bernstein uses variance $0.01$ - a 400x improvement in the exponent for small $t$ .

6.4 Application to Gradient Noise

In SGD, the stochastic gradient $\hat{g}_t = \frac{1}{m}\sum_{i \in B_t} \nabla \ell(w_t; x_i)$ estimates the true gradient $g_t = \nabla \mathcal{L}(w_t)$ .

If each $\nabla\ell(w; x)$ is bounded: $\|\nabla\ell(w; x)\|_2 \leq G$ , and the sample gradient variance is $\sigma_g^2$ , then Bernstein bounds:

P\!\left(\|\hat{g}_t - g_t\|_2 \geq t\right) \leq d \cdot \exp\!\left(-\frac{mt^2/2}{d\sigma_g^2 + Gt/3}\right)

where the factor $d$ comes from a union bound over coordinates. This bound shows why variance reduction methods (SVRG, SAGA) that reduce $\sigma_g^2$ improve convergence more dramatically than reducing $G$ alone.

7. McDiarmid's Inequality

7.1 Bounded Differences Condition

McDiarmid's inequality generalises Hoeffding from sums to general functions.

Definition (Bounded Differences). A function $f: \mathcal{X}^n \to \mathbb{R}$ satisfies the bounded differences condition with constants $c_1, \ldots, c_n > 0$ if for all $i \in [n]$ and all $x_1, \ldots, x_n, x_i' \in \mathcal{X}$ :

|f(x_1, \ldots, x_i, \ldots, x_n) - f(x_1, \ldots, x_i', \ldots, x_n)| \leq c_i

Intuitively: changing the $i$ -th coordinate by any amount changes $f$ by at most $c_i$ .

7.2 McDiarmid's Inequality

Theorem (McDiarmid, 1989). Let $X_1, \ldots, X_n$ be independent random variables and let $f$ satisfy the bounded differences condition with constants $c_1, \ldots, c_n$ . Then for all $t > 0$ :

P(f(X_1, \ldots, X_n) - \mathbb{E}[f] \geq t) \leq \exp\!\left(-\frac{2t^2}{\sum_{i=1}^n c_i^2}\right)

Proof sketch (martingale method). Define the Doob martingale $Z_k = \mathbb{E}[f \mid X_1, \ldots, X_k]$ with $Z_0 = \mathbb{E}[f]$ and $Z_n = f$ . The differences $D_k = Z_k - Z_{k-1}$ satisfy $|D_k| \leq c_k$ by the bounded differences assumption. Apply Azuma's inequality (concentration for bounded martingale differences) to get the result.

7.3 Relationship to Hoeffding

Hoeffding's inequality is a special case of McDiarmid's. For $f(X_1, \ldots, X_n) = \frac{1}{n}\sum_{i=1}^n X_i$ with $X_i \in [a_i, b_i]$ , changing $X_i$ changes $f$ by at most $(b_i - a_i)/n$ , so $c_i = (b_i - a_i)/n$ . Then:

\sum_i c_i^2 = \frac{\sum_i (b_i - a_i)^2}{n^2}

Substituting into McDiarmid recovers exactly Hoeffding's bound.

7.4 Applications to ML Stability

Training loss with missing data. Let $\mathcal{D} = \{z_1, \ldots, z_n\}$ be a training set and $\hat{R}(\mathcal{D}) = \frac{1}{n}\sum_{i=1}^n \ell(h, z_i)$ the empirical risk with $\ell \in [0, 1]$ . Swapping one training example changes $\hat{R}$ by at most $1/n$ . McDiarmid gives:

P(|\hat{R} - \mathbb{E}[\hat{R}]| \geq t) \leq 2\exp(-2nt^2)

Dropout. With dropout probability $p$ , each neuron's activation $a_i$ is scaled by $\mathbf{1}[\text{kept}] / (1-p)$ . The network output $f(z_1, \ldots, z_d)$ changes by at most $c_i = |w_i| \cdot a_i^{\max} / (1-p)$ when the $i$ -th neuron is toggled. McDiarmid bounds the output variance from dropout noise.

Data augmentation. If augmenting a single training example can change the training loss by at most $c$ , McDiarmid bounds the sensitivity of the trained model to augmentation choices.

8. The Union Bound and Covering Arguments

8.1 Union Bound

Lemma (Boole's / Union Bound). For any events $A_1, \ldots, A_k$ (not necessarily independent):

P\!\left(\bigcup_{i=1}^k A_i\right) \leq \sum_{i=1}^k P(A_i)

Proof. By induction: $P(A \cup B) = P(A) + P(B) - P(A \cap B) \leq P(A) + P(B)$ .

The union bound is exact when events are disjoint and pessimistic otherwise. Despite its looseness, it is indispensable because it requires no assumption about dependence between events.

For AI: The union bound appears in virtually every multi-class or multi-hypothesis analysis. PAC theory for a finite hypothesis class $\mathcal{H}$ asks for the probability that any $h \in \mathcal{H}$ is bad - union bound over all $|\mathcal{H}|$ hypotheses.

8.2 Multiple Hypothesis Testing

Consider $m$ statistical tests each with individual significance level $\alpha$ . If we want the probability that at least one false positive occurs to be at most $\delta$ , the Bonferroni correction sets each test at level $\delta/m$ .

In ML: When training $m$ models and reporting the best, the union bound says the best result can be a false positive with probability up to $m\alpha$ . This is the multiple comparison problem underlying claims like "our method beats 10 baselines on 5 datasets."

Balancing union bound and concentration. For $k$ bad events $A_i$ each with $P(A_i) \leq \varepsilon$ :

P(\exists i: A_i) \leq k\varepsilon

To make this $\leq \delta$ : need each $P(A_i) \leq \delta/k$ . In PAC theory, this costs an extra $\log k$ in the required sample size.

8.3 Covering Numbers and \varepsilon-Nets

The union bound can only be applied to finite collections of events. For continuous hypothesis spaces (e.g., all linear classifiers, all neural networks), we need to discretise first.

Definition. An $\varepsilon$ -net of a set $\mathcal{H}$ under metric $d$ is a finite set $\mathcal{N}_\varepsilon \subseteq \mathcal{H}$ such that for every $h \in \mathcal{H}$ , there exists $\hat{h} \in \mathcal{N}_\varepsilon$ with $d(h, \hat{h}) \leq \varepsilon$ .

The covering number $\mathcal{N}(\mathcal{H}, d, \varepsilon)$ is the size of the smallest $\varepsilon$ -net.

Covering number of a ball. The $\ell_2$ -ball $\{\mathbf{w}: \|\mathbf{w}\|_2 \leq R\}$ in $\mathbb{R}^d$ has covering number:

\mathcal{N}(\mathcal{B}_R, \ell_2, \varepsilon) \leq \left(\frac{2R}{\varepsilon} + 1\right)^d \leq \left(\frac{3R}{\varepsilon}\right)^d

This grows polynomially in $1/\varepsilon$ but exponentially in $d$ - the curse of dimensionality for covering.

8.4 The Net Argument

The standard approach for continuous hypothesis classes:

Build an $\varepsilon$ -net $\mathcal{N}_\varepsilon$ with $|\mathcal{N}_\varepsilon| = \mathcal{N}(\mathcal{H}, d, \varepsilon)$
Apply concentration + union bound over the net: control $\sup_{h \in \mathcal{N}_\varepsilon} |\hat{R}(h) - R(h)|$
Use Lipschitz continuity to extend from net to all of $\mathcal{H}$

The final bound pays $\log \mathcal{N}(\mathcal{H}, d, \varepsilon)$ extra compared to a single hypothesis, plus the approximation error $\varepsilon$ from the net.

Johnson-Lindenstrauss Lemma. As an application of covering, if $m = O(\log(n)/\varepsilon^2)$ random projections are used, $n$ points in $\mathbb{R}^d$ embed into $\mathbb{R}^m$ with $(1\pm\varepsilon)$ distortion of all pairwise distances, with high probability. This is proved using the union bound over all $\binom{n}{2}$ pairs.

9. PAC Learning and Generalisation Bounds

9.1 The PAC Framework

The Probably Approximately Correct (PAC) framework, introduced by Valiant (1984), provides a formal model for machine learning under uncertainty.

Setup: A learner observes $n$ i.i.d. examples $\{(x^{(i)}, y^{(i)})\}$ from an unknown distribution $\mathcal{D}$ over $\mathcal{X} \times \mathcal{Y}$ . The learner selects a hypothesis $h$ from a hypothesis class $\mathcal{H}$ .

True risk: $R(h) = P_{(x,y) \sim \mathcal{D}}(h(x) \neq y) = \mathbb{E}[\mathbf{1}[h(x) \neq y]]$

Empirical risk: $\hat{R}(h) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[h(x^{(i)}) \neq y^{(i)}]$

PAC guarantee: A learning algorithm $\mathcal{A}$ is PAC-learning $\mathcal{H}$ if for any $\varepsilon, \delta > 0$ and any distribution $\mathcal{D}$ , given $n \geq n_0(\varepsilon, \delta)$ examples, with probability at least $1-\delta$ :

R(\mathcal{A}(\mathcal{D})) \leq \min_{h \in \mathcal{H}} R(h) + \varepsilon

The sample complexity $n_0(\varepsilon, \delta)$ is the central quantity: how many examples are needed?

9.2 Finite Hypothesis Class

Theorem. Let $\mathcal{H}$ be a finite hypothesis class with $|\mathcal{H}| < \infty$ . For any ERM algorithm (minimising $\hat{R}(h)$ ) and any $\varepsilon, \delta > 0$ , if:

n \geq \frac{\log(|\mathcal{H}|/\delta)}{2\varepsilon^2}

then with probability at least $1-\delta$ , the selected hypothesis $\hat{h}$ satisfies $R(\hat{h}) \leq \min_{h^*} R(h^*) + \varepsilon$ (in the realisable case with $\hat{R}(h^*) = 0$ ).

Proof. Consider the event "bad $h$ ": $B_h = \{R(h) > \varepsilon \text{ but } \hat{R}(h) = 0\}$ .

By Hoeffding: $P(B_h) \leq P(\hat{R}(h) = 0 \mid R(h) > \varepsilon) \leq (1-\varepsilon)^n \leq e^{-n\varepsilon}$ .

By union bound: $P(\exists h \in \mathcal{H}: B_h) \leq |\mathcal{H}| \cdot e^{-n\varepsilon}$ .

Setting this $\leq \delta$ and solving: $n \geq \frac{\log(|\mathcal{H}|/\delta)}{\varepsilon}$ . The factor $\varepsilon^2$ version covers the agnostic (non-realisable) case.

9.3 Uniform Convergence

Theorem (Uniform Convergence). For finite $\mathcal{H}$ and losses in $[0, 1]$ , for $n \geq \frac{\log(2|\mathcal{H}|/\delta)}{2\varepsilon^2}$ :

P\!\left(\sup_{h \in \mathcal{H}} |\hat{R}(h) - R(h)| > \varepsilon\right) \leq \delta

Proof. Fix any $h \in \mathcal{H}$ . By Hoeffding: $P(|\hat{R}(h) - R(h)| > \varepsilon) \leq 2e^{-2n\varepsilon^2}$ .

By union bound over all $|\mathcal{H}|$ hypotheses: $P(\sup_h |\hat{R}(h) - R(h)| > \varepsilon) \leq 2|\mathcal{H}| e^{-2n\varepsilon^2} \leq \delta$ .

Consequence: If uniform convergence holds, then for ERM $\hat{h} = \arg\min_h \hat{R}(h)$ :

R(\hat{h}) \leq \hat{R}(\hat{h}) + \varepsilon \leq \hat{R}(h^*) + \varepsilon \leq R(h^*) + 2\varepsilon

The ERM generalises to within $2\varepsilon$ of the best hypothesis.

9.4 VC Dimension

For infinite hypothesis classes, we need a different complexity measure.

Definition. A hypothesis class $\mathcal{H}$ shatters a set $C = \{x_1, \ldots, x_k\}$ if for every labelling $y \in \{0,1\}^k$ , there exists $h \in \mathcal{H}$ with $h(x_i) = y_i$ for all $i$ .

VC dimension $d_{VC}(\mathcal{H})$ is the size of the largest set shattered by $\mathcal{H}$ .

Examples:

Threshold classifiers on $\mathbb{R}$ : $d_{VC} = 1$ (can shatter any single point but not two)
Linear classifiers (halfspaces) in $\mathbb{R}^d$ : $d_{VC} = d + 1$
Degree- $k$ polynomial classifiers in $\mathbb{R}^d$ : $d_{VC} = \binom{d+k}{k}$
Circles in $\mathbb{R}^2$ : $d_{VC} = 3$
Neural networks: grows with parameter count, approximately linearly

Sauer-Shelah Lemma. The growth function $\Pi_\mathcal{H}(n) = \max_{x_1,\ldots,x_n} |\{(h(x_1),\ldots,h(x_n)): h \in \mathcal{H}\}|$ satisfies:

\Pi_\mathcal{H}(n) \leq \sum_{i=0}^{d} \binom{n}{i} \leq \left(\frac{en}{d}\right)^d

for $n \geq d = d_{VC}(\mathcal{H})$ . This says: an $n$ -point set can be labelled in at most $(en/d)^d$ ways by $\mathcal{H}$ , even if $|\mathcal{H}| = \infty$ .

9.5 Generalisation Bound with VC Dimension

Theorem (VC Generalisation Bound). For any $\delta > 0$ , with probability at least $1-\delta$ over $n$ i.i.d. training examples, every $h \in \mathcal{H}$ satisfies:

R(h) \leq \hat{R}(h) + \sqrt{\frac{d(\log(2n/d) + 1) + \log(4/\delta)}{2n}}

where $d = d_{VC}(\mathcal{H})$ .

Sample complexity. Setting the right-hand side excess to $\varepsilon$ :

n = O\!\left(\frac{d \log(1/(\varepsilon\delta)) + \log(1/\delta)}{\varepsilon^2}\right) = O\!\left(\frac{d + \log(1/\delta)}{\varepsilon^2}\right)

The $\log(|\mathcal{H}|)$ from the finite case is replaced by $d_{VC}$ . For a $d$ -dimensional linear classifier ( $d_{VC} = d+1$ ), need $n = O(d/\varepsilon^2)$ - linear in dimension, not exponential.

For AI (2026): Modern transformers have billions of parameters, implying enormous VC dimension. Yet they generalise. This "paradox" is partly resolved by:

Implicit regularisation: gradient descent finds minimum-norm solutions
Overparameterisation: interpolation threshold, double descent (covered in Section04)
PAC-Bayes: data-dependent bounds that are tighter for specific algorithms

10. Rademacher Complexity

10.1 Definition

Rademacher complexity measures the richness of a function class on specific data, making it data-dependent and often tighter than VC bounds.

Definition. Given a sample $S = (x^{(1)}, \ldots, x^{(n)})$ and a function class $\mathcal{F}$ , the empirical Rademacher complexity is:

\hat{\mathfrak{R}}_S(\mathcal{F}) = \mathbb{E}_{\boldsymbol{\sigma}}\!\left[\sup_{f \in \mathcal{F}} \frac{1}{n}\sum_{i=1}^n \sigma_i f(x^{(i)})\right]

where $\boldsymbol{\sigma} = (\sigma_1, \ldots, \sigma_n)$ with $\sigma_i \sim \operatorname{Uniform}(\{-1, +1\})$ i.i.d. (Rademacher variables).

The Rademacher complexity is $\mathfrak{R}_n(\mathcal{F}) = \mathbb{E}_S[\hat{\mathfrak{R}}_S(\mathcal{F})]$ .

Interpretation. $\hat{\mathfrak{R}}_S(\mathcal{F})$ measures how well $\mathcal{F}$ can fit random $\pm 1$ labels on the training points. If the class is rich enough to fit any labelling, $\hat{\mathfrak{R}}_S \approx 1$ . If it can only fit structured labels, $\hat{\mathfrak{R}}_S \approx 0$ .

10.2 Rademacher Generalisation Bound

Theorem. For any function class $\mathcal{F}$ with values in $[0, 1]$ , with probability at least $1-\delta$ :

R(f) \leq \hat{R}(f) + 2\mathfrak{R}_n(\mathcal{F}) + \sqrt{\frac{\log(1/\delta)}{2n}}

for all $f \in \mathcal{F}$ simultaneously.

Comparison to VC. The VC bound has $\sqrt{d/n}$ ; the Rademacher bound has $\mathfrak{R}_n(\mathcal{F})$ . For specific data distributions where the function class is "effectively smaller," Rademacher can be significantly tighter.

10.3 Rademacher of Linear Classifiers

For linear classifiers $\mathcal{F} = \{\mathbf{x} \mapsto \langle \mathbf{w}, \mathbf{x} \rangle : \|\mathbf{w}\|_2 \leq B\}$ on data with $\|\mathbf{x}^{(i)}\|_2 \leq C$ :

\hat{\mathfrak{R}}_S(\mathcal{F}) \leq \frac{BC}{\sqrt{n}} \cdot \frac{\|\hat{\Sigma}\|_F}{B} = \frac{C\sqrt{\operatorname{tr}(\hat{\Sigma})}}{B\sqrt{n}}

More precisely, using Cauchy-Schwarz:

\hat{\mathfrak{R}}_S(\mathcal{F}) \leq \frac{B}{n}\mathbb{E}_\sigma\!\left[\left\|\sum_{i=1}^n \sigma_i \mathbf{x}^{(i)}\right\|_2\right] \leq \frac{B}{\sqrt{n}} \cdot \frac{\sqrt{\sum_i \|\mathbf{x}^{(i)}\|_2^2}}{n^{1/2}} \leq \frac{BC}{\sqrt{n}}

This is $O(1/\sqrt{n})$ , independent of the ambient dimension $d$ ! This explains why linear classifiers generalise well even in high dimensions, as long as the norm $\|\mathbf{w}\|_2$ is controlled.

10.4 Rademacher vs VC

Property	VC Dimension	Rademacher Complexity
Distribution	Distribution-free	Data-dependent
Calculation	Combinatorial	Expectation over labels
Estimate	Hard (often NP-hard)	MC approximation possible
Tightness	Worst-case	Often tighter in practice
Handles regression	No (classification)	Yes (arbitrary losses)
Modern NNs	Vacuous (too large $d$ )	Can be finite for sparse/low-rank models

For AI: Rademacher complexity gives finite (non-vacuous) generalisation bounds for attention heads with bounded query/key norm, explaining why transformers with weight decay generalise despite huge parameter counts.

Concentration Inequalities: Part 1 - Intuition To 10 Rademacher Complexity