Part 3Math for LLMs

Concentration Inequalities: Part 3 - Appendix O Visualisation Notes To Supplementary Concentration In High

Probability Theory / Concentration Inequalities

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 3
21 min read18 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Concentration Inequalities: Appendix O: Visualisation Notes to Supplementary: Concentration in High Dimensions

Appendix O: Visualisation Notes

O.1 Plotting Concentration Inequalities

When visualising tail bounds, always show:

  1. The exact tail probability (computed numerically) as the ground truth
  2. The bound as a curve above the exact probability
  3. The ratio (bound/exact) to show how loose the bound is
TAIL BOUND TIGHTNESS - Gaussian(0,1)
========================================================================

  t   | P(X\\geqt) exact | Markov* | Chebyshev | Hoeffding | Bound/Exact
  --------------------------------------------------------------------
  1   | 0.159  | N/A    | 1.000    | 0.607     |  4x / 3.8x
  2   | 0.023  | N/A    | 0.250    | 0.135     | 11x / 5.9x
  3   | 0.0013 | N/A    | 0.111    | 0.011     | 85x / 8.5x
  4   | 3.2e-5 | N/A    | 0.0625   | 3.4e-4    | 1953x / 11x
  5   | 2.9e-7 | N/A    | 0.040    | 3.7e-7    | 138,000x / 1.3x

  *Markov requires X \\geq 0; not directly applicable to symmetric Gaussian

========================================================================

At t=5σt = 5\sigma: Chebyshev overestimates by 138,000x! Hoeffding for Gaussian is essentially tight - Gaussian IS the sub-Gaussian equality case.

O.2 Sample Complexity Curves

The sample complexity n(ε,δ)=log(2/δ)2ε2n(\varepsilon, \delta) = \frac{\log(2/\delta)}{2\varepsilon^2} has a characteristic shape:

  • O(1/ε2)O(1/\varepsilon^2): doubling precision requires 4x more data
  • O(log(1/δ))O(\log(1/\delta)): going from 95% to 99.9% confidence adds only log(200/20)/log(40/20)3.3×\log(200/20)/\log(40/20) \approx 3.3 \times more data - logarithmically cheap

This 1/ε21/\varepsilon^2 scaling is fundamental: it appears in:

  • Hoeffding (statistical estimation)
  • Monte Carlo (variance σ2\sigma^2, error σ/n\sigma/\sqrt{n})
  • PAC learning (generalisation gap)
  • Stochastic optimisation (SGD convergence rate)

All are manifestations of the same underlying concentration phenomenon.


Appendix P: Glossary of Key Terms

TermDefinitionSection
Concentration inequalityA bound on P(XE[X]t)P(X - \mathbb{E}[X] \geq t) or P(XE[X]t)P(\|X - \mathbb{E}[X]\| \geq t)Section1
Sub-Gaussian random variableXX with E[etX]eσ2t2/2\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2/2} for all ttSection3.1
Sub-Gaussian parameterσ2\sigma^2: the proxy variance in the sub-Gaussian definitionSection3.1
Rademacher variableσ{1,+1}\sigma \in \{-1, +1\} with equal probabilitySection3.2
Hoeffding's lemmaBounded RV X[a,b]X \in [a,b] with zero mean \Rightarrow E[etX]et2(ba)2/8\mathbb{E}[e^{tX}] \leq e^{t^2(b-a)^2/8}Section4.1
Chernoff methodTail bound via P(Xt)estE[esX]P(X \geq t) \leq e^{-st}\mathbb{E}[e^{sX}], optimised over ssSection5.1
Bounded differences$f(\cdots x_i \cdots) - f(\cdots x_i' \cdots)
Union boundP(Ai)P(Ai)P(\bigcup A_i) \leq \sum P(A_i)Section8.1
Covering numberSmallest ε\varepsilon-net size for a set under a metricSection8.3
ShatteringH\mathcal{H} shatters CC if it can realise all $2^{C
VC dimensionSize of largest shattered setSection9.4
Growth functionΠH(n)\Pi_\mathcal{H}(n): max labellings of nn points by H\mathcal{H}App D.1
PAC learningProbably approximately correct: R(h^)minhR(h)+εR(\hat{h}) \leq \min_h R(h) + \varepsilon w.p. 1δ\geq 1-\deltaSection9.1
Uniform convergence$\sup_h\hat{R}(h) - R(h)
Rademacher complexityRn(F)=Eσ[supf1nσif(x(i))]\mathfrak{R}_n(\mathcal{F}) = \mathbb{E}_\sigma[\sup_f \frac{1}{n}\sum \sigma_i f(x^{(i)})]Section10.1
PAC-BayesGeneralisation bound using DKL(posteriorprior)D_{\mathrm{KL}}(\text{posterior}\|\text{prior})App E.4
Algorithmic stabilityOutput insensitive to swapping one training exampleApp J.2
Exponential tiltTwisted distribution Ps(x)esxP(x)P_s(x) \propto e^{sx}P(x) used in ChernoffApp I.1

Appendix Q: The Method of Types and Large Deviations

Q.1 Types and Empirical Distributions

The method of types is an elegant combinatorial approach to concentration that complements the analytical MGF-based approach.

Definition. For a sequence xn=(x1,,xn)Xnx^n = (x_1, \ldots, x_n) \in \mathcal{X}^n, the type (empirical distribution) is:

Pxn(a)=1ni=1n1[xi=a],aXP_{x^n}(a) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[x_i = a], \quad a \in \mathcal{X}

The set of all sequences of type QQ is the type class Tn(Q)T_n(Q).

Key counting result. For a finite alphabet X=k|\mathcal{X}| = k:

Tn(Q)=(nnQ(x1),,nQ(xk))enH(Q)|T_n(Q)| = \binom{n}{nQ(x_1), \ldots, nQ(x_k)} \approx e^{nH(Q)}

where H(Q)=xQ(x)logQ(x)H(Q) = -\sum_{x} Q(x)\log Q(x) is the entropy of QQ. This means there are approximately enH(Q)e^{nH(Q)} sequences with empirical distribution QQ.

Q.2 Connection to Chernoff Bounds

For i.i.d. XiPX_i \sim P and empirical distribution PxnP_{x^n}:

P(PXn=Q)enDKL(QP)P(P_{X^n} = Q) \leq e^{-n D_{\mathrm{KL}}(Q \| P)}

This is the method of types bound. Taking QQ to be the distribution concentrated at xˉ=a\bar{x} = a:

P(Xˉna)(n+1)XenDKL(QP)P(\bar{X}_n \geq a) \leq (n+1)^{|\mathcal{X}|} e^{-n D_{\mathrm{KL}}(Q^* \| P)}

where QQ^* is the type closest to PP with mean a\geq a. For large nn, the polynomial factor (n+1)X(n+1)^{|\mathcal{X}|} is negligible, and this matches the Chernoff bound with the KL divergence interpretation.

Q.3 Cramer's Theorem

Theorem (Cramer, 1938). For i.i.d. X1,,XnX_1, \ldots, X_n with distribution PP:

limn1nlogP(Xˉna)=I(a)\lim_{n\to\infty} \frac{1}{n} \log P(\bar{X}_n \geq a) = -I(a)

where I(a)=supt{talogMX(t)}I(a) = \sup_{t} \{ta - \log M_X(t)\} is the Cramer-Legendre transform (rate function), and I(a)=DKL(PaP)I(a) = D_{\mathrm{KL}}(P_a \| P) (KL divergence to the tilted distribution).

This is the fundamental result of large deviations theory: rare events occur at exponential rate determined by the KL distance to the "closest" distribution that makes the event typical.

For AI: Language model perplexity is essentially exp(DKL(true distmodel))\exp(D_{\mathrm{KL}}(\text{true dist}\|\text{model})). Low perplexity = model close to true distribution = rare events (unusual tokens) are not overly penalised. Cramer's theorem connects the generalisation gap to KL divergence, explaining why cross-entropy is the right loss.


Appendix R: Generalisation in the Overparameterised Regime

R.1 The Classical Puzzle

Classical theory (VC, Rademacher) says: more parameters = larger hypothesis class = worse generalisation. Yet in practice, overparameterised models (large dnd \gg n) often generalise better than their smaller counterparts.

The resolution involves multiple mechanisms:

1. Implicit regularisation. Gradient descent on overparameterised linear models converges to the minimum-2\ell_2-norm solution. Among all interpolating solutions, this has the smallest Rademacher complexity.

2. Double descent. As model capacity increases past the interpolation threshold (dnd \approx n), test error first rises (overfitting) then falls (benign overfitting). The minimum-norm interpolating solution's generalisation error is:

MSEσ2dn11n/d\text{MSE} \approx \frac{\sigma^2 d}{n} \cdot \frac{1}{1 - n/d}

which decreases as d/nd/n \to \infty (overparameterised regime).

3. Benign overfitting. Bartlett et al. (2020) showed that for linear models with Gaussian data, interpolation (zero training error) can generalise well when most singular values of the data matrix are small - the model "memorises" noise in directions orthogonal to the signal.

R.2 Neural Tangent Kernel (NTK) Theory

Wide neural networks in the "lazy training" (NTK) regime behave like kernel methods. The generalisation bound becomes:

R(h^)R^(h^)+O ⁣(y^(Knn)1y^+log(1/δ)n)R(\hat{h}) \leq \hat{R}(\hat{h}) + O\!\left(\sqrt{\frac{\hat{\mathbf{y}}^\top (K_{nn})^{-1} \hat{\mathbf{y}} + \log(1/\delta)}{n}}\right)

where KnnK_{nn} is the NTK kernel matrix. The key quantity y^Knn1y^\hat{\mathbf{y}}^\top K_{nn}^{-1} \hat{\mathbf{y}} is the effective complexity - analogous to Rademacher complexity but computed from the kernel. For networks trained with weight decay, this remains bounded even as depth and width grow.

R.3 PAC-Bayes for Overparameterised Models

The most successful approach for non-vacuous bounds on deep networks uses PAC-Bayes. The bound:

EhQ[R(h)]EhQ[R^(h)]+DKL(QP)+log(1/δ)2(n1)\mathbb{E}_{h\sim Q}[R(h)] \leq \mathbb{E}_{h\sim Q}[\hat{R}(h)] + \sqrt{\frac{D_{\mathrm{KL}}(Q\|P) + \log(1/\delta)}{2(n-1)}}

applies to a stochastic classifier hQh \sim Q. For weight-perturbed networks (Gaussian posterior with small variance σ2\sigma^2):

DKL(QP)θ^θ0222σ02D_{\mathrm{KL}}(Q\|P) \approx \frac{\|\hat{\theta} - \theta_0\|_2^2}{2\sigma_0^2}

This is small when the fine-tuned weights θ^\hat{\theta} don't stray far from the initialisation θ0\theta_0. SAM (Sharpness-Aware Minimisation) explicitly seeks flat minima where the posterior concentrates near a broad region around θ^\hat{\theta}, minimising this KL term.


Appendix S: Concentration in Continuous-Time Settings

S.1 Azuma's Inequality for Martingales

We've seen Azuma's inequality used to prove McDiarmid. Here's the full statement:

Theorem (Azuma-Hoeffding). Let (Z0,Z1,,Zn)(Z_0, Z_1, \ldots, Z_n) be a martingale (or supermartingale) with Z0=0Z_0 = 0 and ZkZk1ck|Z_k - Z_{k-1}| \leq c_k a.s. Then:

P(Znt)exp ⁣(t22k=1nck2)P(Z_n \geq t) \leq \exp\!\left(-\frac{t^2}{2\sum_{k=1}^n c_k^2}\right)

Proof. Apply the Chernoff method: P(Znt)estE[esZn]P(Z_n \geq t) \leq e^{-st} \mathbb{E}[e^{sZ_n}]. For a martingale, E[esZkZk1]esZk1es2ck2/2\mathbb{E}[e^{sZ_k} \mid Z_{k-1}] \leq e^{sZ_{k-1}} \cdot e^{s^2c_k^2/2} by Hoeffding's lemma (conditional on Zk1Z_{k-1}, the increment ZkZk1[ck,ck]Z_k - Z_{k-1} \in [-c_k, c_k] is bounded). Telescoping: E[esZn]es2ck2/2\mathbb{E}[e^{sZ_n}] \leq e^{s^2\sum c_k^2/2}. Optimise: s=t/ck2s = t/\sum c_k^2. \blacksquare

S.2 Doob's Optional Stopping

Theorem. If (Mn)(M_n) is a martingale and τ\tau is a bounded stopping time, then E[Mτ]=E[M0]\mathbb{E}[M_\tau] = \mathbb{E}[M_0].

Application to SGD. Define Mt=θtθ22tCη2σ2M_t = \|\theta_t - \theta^*\|_2^2 - t \cdot C\eta^2\sigma^2 where CC is a constant. Under mild conditions, MtM_t is a supermartingale. Azuma's inequality applied to the stopped process gives high-probability guarantees on convergence time.

This is the mathematical foundation of high-probability SGD convergence - stronger than the usual in-expectation guarantees.


Appendix T: Summary Diagrams

T.1 The Proof Technique Map

PROOF TECHNIQUES FOR CONCENTRATION INEQUALITIES
========================================================================

  Core tool: Markov's inequality
  -------------------------------------------------------------------

  Apply Markov to (X-\\mu)^2          --> Chebyshev
  Apply Markov to e^{sX}          --> Chernoff method
       +-- with Hoeffding's lemma  --> Hoeffding's inequality
       +-- with Bernstein's lemma  --> Bernstein's inequality
       +-- with Binomial MGF       --> Chernoff for Bin(n,p)

  Apply Azuma to Doob martingale  --> McDiarmid's inequality
  Union bound over hypothesis class --> PAC generalisation bound
  Union bound + covering number    --> Infinite-class PAC
  Rademacher symmetrisation        --> Rademacher complexity bound

  -------------------------------------------------------------------
  Common pattern: P(X \\geq t) \\leq E[e^{sX}] / e^{st}  (all MGF methods)
========================================================================

T.2 Bound Quality Comparison

TAIL PROBABILITY COMPARISON - X = mean of n=100 Bernoulli(0.5), t=0.1
========================================================================

  Method           Bound          Notes
  -----------------------------------------------------------------
  Exact (CLT)     ~0.046         Normal approx; valid for large n
  Hoeffding        0.135          Tight for this case
  Chebyshev        0.250          3x looser
  Markov           0.500          5x looser (applied to |X-0.5|/0.5)

  Same setup, t=0.2:
  Exact            ~0.000046
  Hoeffding        0.018          400x looser
  Chebyshev        0.0625         1360x looser

========================================================================

Appendix U: Proofs of Sub-Gaussian Closure and Examples

U.1 Cosh Inequality for Rademacher Variables

Claim. For ε{1,+1}\varepsilon \in \{-1, +1\} uniform: E[etε]=cosh(t)et2/2\mathbb{E}[e^{t\varepsilon}] = \cosh(t) \leq e^{t^2/2}.

Proof. cosh(t)=k=0t2k/(2k)!k=0t2k/(2kk!)=k=0(t2/2)k/k!=et2/2\cosh(t) = \sum_{k=0}^\infty t^{2k}/(2k)! \leq \sum_{k=0}^\infty t^{2k}/(2^k k!) = \sum_{k=0}^\infty (t^2/2)^k/k! = e^{t^2/2}, using (2k)!2kk!(2k)! \geq 2^k k!.

Interpretation. Rademacher variables are the "most concentrated" bounded variables in [1,1][-1,1]: their MGF is exactly cosh(t)\cosh(t), slightly below the sub-Gaussian bound et2/2e^{t^2/2}. This is also why Rademacher complexity multiplied by 2 appears in generalisation bounds - the factor accounts for the Rademacher supremum.

U.2 Sub-Gaussian Parameter for Bounded Variables

Claim. If X[a,b]X \in [a,b] with E[X]=0\mathbb{E}[X] = 0, then XX is (ba)24\frac{(b-a)^2}{4}-sub-Gaussian.

The parameter (ba)24\frac{(b-a)^2}{4} is achieved when X=ba2εX = \frac{b-a}{2}\varepsilon where ε\varepsilon is Rademacher - a two-point distribution at the endpoints. This is the "hardest" distribution on [a,b][a,b] for Hoeffding's lemma.

Tightness example. For X{1,+1}X \in \{-1, +1\} uniform: variance =1= 1, sub-Gaussian parameter =(1(1))24=1= \frac{(1-(-1))^2}{4} = 1. They agree! For the Rademacher distribution, sub-Gaussian parameter = variance. For other distributions in [1,1][-1,1], sub-Gaussian parameter \geq variance.

U.3 Sub-Gaussian Characterisation via Tail Probabilities

The following are equivalent for a mean-zero random variable XX:

  1. (MGF) E[etX]eσ2t2/2\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2/2} for all tt
  2. (Tail) P(Xt)2et2/(2σ2)P(|X| \geq t) \leq 2e^{-t^2/(2\sigma^2)} for all t0t \geq 0
  3. (Moments) XLp=(E[Xp])1/pσp\|X\|_{L^p} = (\mathbb{E}[|X|^p])^{1/p} \leq \sigma\sqrt{p} for all p1p \geq 1

The equivalence (up to constants) is non-trivial. The key step: (2)(1)(2) \Rightarrow (1) uses E[etX]=1+k=1tkE[Xk]k!\mathbb{E}[e^{tX}] = 1 + \sum_{k=1}^\infty \frac{t^k \mathbb{E}[X^k]}{k!} and bounds E[Xk]\mathbb{E}[X^k] from the tail condition.


Appendix V: Exercises with Solutions

V.1 Markov and Chebyshev Calculations

Problem. XPoisson(5)X \sim \text{Poisson}(5). Compute P(X12)P(X \geq 12) using: (a) Markov's inequality (E[X]=5\mathbb{E}[X] = 5) (b) Chebyshev (Var(X)=5\operatorname{Var}(X) = 5, so t=125=7t = 12 - 5 = 7, k=7/53.13k = 7/\sqrt{5} \approx 3.13) (c) The exact value (Poisson CDF)

Solution. (a) Markov: P(X12)5/120.417P(X \geq 12) \leq 5/12 \approx 0.417 (b) Chebyshev: P(X12)P(X57)5/490.102P(X \geq 12) \leq P(|X - 5| \geq 7) \leq 5/49 \approx 0.102 (c) Exact: P(X12)=1k=011e55k/k!0.020P(X \geq 12) = 1 - \sum_{k=0}^{11} e^{-5}5^k/k! \approx 0.020

Chebyshev is 5x loose; Markov is 21x loose. Both bounds hold but are very conservative.

V.2 Hoeffding Bound Computation

Problem. We estimate the click-through rate pp of an ad by showing it to nn users and recording clicks Xi{0,1}X_i \in \{0,1\}. We observe p^=0.08\hat{p} = 0.08 from n=2000n = 2000 users. Give a 99% Hoeffding confidence interval.

Solution. With δ=0.01\delta = 0.01, ba=1b-a=1:

ε=log(2/0.01)22000=log2004000=5.2984000=0.0013250.0364\varepsilon = \sqrt{\frac{\log(2/0.01)}{2 \cdot 2000}} = \sqrt{\frac{\log 200}{4000}} = \sqrt{\frac{5.298}{4000}} = \sqrt{0.001325} \approx 0.0364

99% CI: [0.080.0364,0.08+0.0364]=[0.0436,0.1164][0.08 - 0.0364, 0.08 + 0.0364] = [0.0436, 0.1164].

Interpretation: we're 99% confident the true CTR is between 4.4% and 11.6% - a very wide interval! To tighten to +/-1%: need nlog2002(0.01)226,490n \geq \frac{\log 200}{2(0.01)^2} \approx 26{,}490 users.

V.3 Chernoff vs Hoeffding

Problem. SBin(500,0.02)S \sim \text{Bin}(500, 0.02), so μ=10\mu = 10. Bound P(S20)P(S \geq 20).

Hoeffding: P(S20)=P(Xˉ0.04)exp(2500(0.040.02)2)=e0.40.67P(S \geq 20) = P(\bar{X} \geq 0.04) \leq \exp(-2 \cdot 500 \cdot (0.04-0.02)^2) = e^{-0.4} \approx 0.67. Useless!

Chernoff: δ=(2010)/10=1\delta = (20-10)/10 = 1. P(S20)e1012/3=e3.330.036P(S \geq 20) \leq e^{-10 \cdot 1^2/3} = e^{-3.33} \approx 0.036.

Exact: P(S20)0.0085P(S \geq 20) \approx 0.0085 (Poisson approximation: e10k=2010k/k!e^{-10}\sum_{k=20}^\infty 10^k/k!).

Chernoff is 4x loose; Hoeffding is essentially vacuous (bound > 0.5 is useless). For rare events with p1p \ll 1 and large nn, Chernoff is the right tool.

V.4 McDiarmid on Dropout

Problem. A network output f(z1,,zL)f(\mathbf{z}_1, \ldots, \mathbf{z}_L) depends on L=10L=10 layer activations. Dropout independently zeroes each activation, changing the contribution of neuron jj in layer ll by at most clj=wljaljmax/(1p)c_{lj} = |w_{lj}|\cdot a_{lj}^{\max} / (1-p). If l,jclj2=4\sum_{l,j} c_{lj}^2 = 4, bound P(fE[f]0.5)P(|f - \mathbb{E}[f]| \geq 0.5).

Solution. By McDiarmid: P(fE[f]0.5)2exp(20.25/4)=2e0.1251.76P(|f - \mathbb{E}[f]| \geq 0.5) \leq 2\exp(-2 \cdot 0.25/4) = 2e^{-0.125} \approx 1.76. The bound is vacuous! This means either cljc_{lj} are too large, or tt is too small relative to the noise.

With larger networks and better weight norms (clj2=0.1\sum c_{lj}^2 = 0.1): 2e20.25/0.1=2e50.0132e^{-2 \cdot 0.25/0.1} = 2e^{-5} \approx 0.013. Now meaningful: dropout output deviates by more than 0.5 less than 1.3% of the time.


Appendix W: Notation Reference for This Section

SymbolMeaning
P(A)P(A)Probability of event AA
μ=E[X]\mu = \mathbb{E}[X]Mean of XX
σ2=Var(X)\sigma^2 = \operatorname{Var}(X)Variance of XX
MX(t)=E[etX]M_X(t) = \mathbb{E}[e^{tX}]Moment generating function
Xˉn=1ni=1nXi\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_iSample mean
R(h)R(h)True (population) risk of hypothesis hh
R^(h)\hat{R}(h)Empirical risk of hh on training data
H\mathcal{H}Hypothesis class
dVCd_{VC}VC dimension of H\mathcal{H}
ΠH(n)\Pi_\mathcal{H}(n)Growth function
Rn(F)\mathfrak{R}_n(\mathcal{F})Rademacher complexity
R^S(F)\hat{\mathfrak{R}}_S(\mathcal{F})Empirical Rademacher complexity
ε\varepsilonAccuracy parameter (how close to optimal)
δ\deltaConfidence parameter (failure probability)
cic_iBounded differences constant for coordinate ii
ε\varepsilon-netCovering set with spacing ε\varepsilon
N(H,d,ε)\mathcal{N}(\mathcal{H}, d, \varepsilon)Covering number: size of smallest ε\varepsilon-net
DKL(PQ)D_{\mathrm{KL}}(P\|Q)KL divergence from QQ to PP
I(a)=supt(talogM(t))I(a) = \sup_t(ta - \log M(t))Cramer rate function
σi\sigma_iRademacher random variable {1,+1}\in \{-1,+1\}
Bδ(h)B_\delta(h)Ball of radius δ\delta around hypothesis hh

Appendix X: More on Rademacher Complexity

X.1 Computing Rademacher Complexity for Common Classes

Halfspaces with margin. For Hγ={xsign(wx):w21,γ-margin}\mathcal{H}_\gamma = \{\mathbf{x} \mapsto \operatorname{sign}(\mathbf{w}^\top \mathbf{x}): \|\mathbf{w}\|_2 \leq 1, \gamma\text{-margin}\} on data with x(i)21\|\mathbf{x}^{(i)}\|_2 \leq 1:

Rn(Hγ)1γn\mathfrak{R}_n(\mathcal{H}_\gamma) \leq \frac{1}{\gamma\sqrt{n}}

The margin γ\gamma effectively replaces dimension dd! This is why SVMs with large margins generalise well in high dimensions - the effective complexity is O(1/γ2n)O(1/\gamma^2 n), not O(d/n)O(d/n).

Two-layer neural network. For F={xvσ(Wx):WFBW,v1Bv}\mathcal{F} = \{\mathbf{x} \mapsto \mathbf{v}^\top \sigma(W\mathbf{x}): \|W\|_F \leq B_W, \|\mathbf{v}\|_1 \leq B_v\}:

Rn(F)BvBWC2log(2d)n\mathfrak{R}_n(\mathcal{F}) \leq \frac{B_v B_W C \sqrt{2\log(2d)}}{\sqrt{n}}

where C=maxix(i)2C = \max_i \|\mathbf{x}^{(i)}\|_2. The logd\log d factor comes from the union bound over dd output neurons; the n\sqrt{n} denominator shows O(1/n)O(1/\sqrt{n}) rate as expected.

X.2 Rademacher Complexity and Mutual Information

A deep connection: the Rademacher complexity R^S(F)\hat{\mathfrak{R}}_S(\mathcal{F}) can be bounded via mutual information between the training labels and the class predictions. For bounded function classes:

R^S(F)2I(Yn;f(Xn))n\hat{\mathfrak{R}}_S(\mathcal{F}) \leq \sqrt{\frac{2I(Y^n; f(X^n))}{n}}

where I(Yn;f(Xn))I(Y^n; f(X^n)) is the mutual information between labels and class predictions. This connects concentration theory to information theory - a richer theory being actively developed (2020-2026) under "information-theoretic generalisation bounds."

X.3 Monte Carlo Estimation of Rademacher Complexity

In practice, R^S(F)\hat{\mathfrak{R}}_S(\mathcal{F}) can be estimated by Monte Carlo:

def estimate_rademacher(X, W_norm=1.0, T=1000):
    """Monte Carlo estimate of Rademacher complexity for ||w||_2 <= W_norm."""
    n, d = X.shape
    rademachers = []
    for _ in range(T):
        sigma = np.random.choice([-1, 1], size=n)
        # Optimal w: proportional to X^T sigma, normalised
        w_opt = X.T @ sigma
        w_norm = np.linalg.norm(w_opt)
        if w_norm > 0:
            w_opt = w_opt * W_norm / w_norm
        rademachers.append((sigma @ (X @ w_opt)) / n)
    return np.mean(rademachers)

For a dataset of n=500n=500 MNIST digits (784-dimensional), Wnorm=1W_\text{norm}=1: empirical R^S0.038\hat{\mathfrak{R}}_S \approx 0.038. Rademacher bound: test error \leq train error + 20.038+log(20)/10002 \cdot 0.038 + \sqrt{\log(20)/1000} \approx train error + 0.076 + 0.055. For 2% train error: test error 13.1%\leq 13.1\% - loose but not vacuous!


Appendix Y: Bounds in the Context of Modern ML Systems

Y.1 LLM Benchmark Evaluation Standards

As of 2026, serious LLM evaluations use concentration-aware practices:

HELM (Holistic Evaluation of Language Models): Uses 1000+ examples per scenario, reporting bootstrap confidence intervals. Assumes bounded loss and applies Hoeffding-style reasoning.

BIG-bench: 200+ tasks, with tasks having 100-10000 examples. Standard deviations are reported, equivalent to reporting σ/n\sigma/\sqrt{n} which matches the leading term in Hoeffding CI.

LiveBench: Uses human-pairwise comparisons (Elo-style). The Elo update rule has a concentration interpretation: the ELO gap needed for significance scales as O(1/ncomparisons)O(1/\sqrt{n_{\text{comparisons}}}).

Statistical significance in model comparisons. With nn test examples and two models achieving accuracies p^1\hat{p}_1 and p^2\hat{p}_2:

  • Paired comparison: use McNemar's test (exact)
  • Unpaired: significance threshold p^1p^22log(4/δ)2n|\hat{p}_1 - \hat{p}_2| \geq 2\sqrt{\frac{\log(4/\delta)}{2n}} (Hoeffding)
  • For n=1000n = 1000, δ=0.05\delta = 0.05: need difference 2×0.042=8.4%\geq 2 \times 0.042 = 8.4\% for significance

Y.2 Fine-Tuning Sample Complexity via PAC-Bayes

When fine-tuning a pretrained LLM on task-specific data:

Prior: Pre-trained weights θ0\theta_0 from language model pre-training on internet-scale data Posterior: Fine-tuned weights θ^\hat{\theta} on nn task examples PAC-Bayes bound:

R(θ^)R^(θ^)+θ^θ022/(2σ02)+log(1/δ)2(n1)R(\hat{\theta}) \leq \hat{R}(\hat{\theta}) + \sqrt{\frac{\|\hat{\theta} - \theta_0\|_2^2/(2\sigma_0^2) + \log(1/\delta)}{2(n-1)}}

The θ^θ022\|\hat{\theta} - \theta_0\|_2^2 term is small when fine-tuning doesn't move far from pre-training - exactly what LoRA enforces by constraining ΔW=AB\Delta W = AB. With rank rmin(d,k)r \ll \min(d,k):

θ^θ022=ABF2AF2BF2r(per-param budget)\|\hat{\theta} - \theta_0\|_2^2 = \|AB\|_F^2 \leq \|A\|_F^2 \|B\|_F^2 \leq r \cdot \text{(per-param budget)}

PAC-Bayes theoretically justifies why LoRA generalises better than full fine-tuning on small datasets.

Y.3 Privacy-Preserving ML via Concentration

Differential Privacy (DP) and concentration inequalities are deeply connected. An algorithm is (ε,δ)(\varepsilon, \delta)-DP if changing one training example changes the output distribution by at most ε\varepsilon in TV distance, except with probability δ\delta.

DP-SGD (Abadi et al., 2016): Clips each gradient to norm CC, adds Gaussian noise N(0,σ2C2I)\mathcal{N}(0, \sigma^2 C^2 I). The privacy guarantee follows from the Gaussian mechanism, and the utility (convergence) uses concentration inequalities on the noisy gradients.

McDiarmid connection: DP is essentially algorithmic stability with ci=C/nc_i = C/n (each example changes the model by at most C/nC/n in some norm). McDiarmid's inequality gives the same O(exp(nt2/C2))O(\exp(-nt^2/C^2)) bound for both privacy and stability.


Appendix Z: Complete Theorem Index

This appendix lists every named result in this section for quick reference.

Moment-Based Inequalities

Theorem 2.1 (Markov). X0X \geq 0, E[X]=μ\mathbb{E}[X] = \mu: P(Xt)μ/tP(X \geq t) \leq \mu/t.

Theorem 2.2 (Chebyshev). Var(X)=σ2\operatorname{Var}(X) = \sigma^2: P(Xμt)σ2/t2P(|X-\mu| \geq t) \leq \sigma^2/t^2.

Theorem 2.3 (Cantelli). One-sided: P(Xμt)σ2/(σ2+t2)P(X - \mu \geq t) \leq \sigma^2/(\sigma^2 + t^2).

Sub-Gaussian Inequalities

Definition 3.1 (Sub-Gaussian). XX is σ2\sigma^2-sub-G if E[etX]eσ2t2/2\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2/2} for all tt.

Theorem 3.2 (Sub-Gaussian tail). P(Xt)et2/(2σ2)P(X \geq t) \leq e^{-t^2/(2\sigma^2)}.

Theorem 3.3 (Sub-Gaussian sum). Independent σi2\sigma_i^2-sub-G: sum is σi2\sum \sigma_i^2-sub-G.

Hoeffding-Type Inequalities

Lemma 4.1 (Hoeffding's lemma). X[a,b]X \in [a,b], E[X]=0\mathbb{E}[X]=0: E[etX]et2(ba)2/8\mathbb{E}[e^{tX}] \leq e^{t^2(b-a)^2/8}.

Theorem 4.2 (Hoeffding). Independent Xi[ai,bi]X_i \in [a_i,b_i]: P((Xiμi)t)e2t2/(biai)2P(\sum(X_i-\mu_i) \geq t) \leq e^{-2t^2/\sum(b_i-a_i)^2}.

Theorem 4.3 (Two-sided Hoeffding). P(Xˉnμt)2e2nt2/(ba)2P(|\bar{X}_n-\mu| \geq t) \leq 2e^{-2nt^2/(b-a)^2}.

Chernoff Bounds

Theorem 5.1 (Chernoff method). P(Xt)infs>0estMX(s)P(X \geq t) \leq \inf_{s>0} e^{-st} M_X(s).

Theorem 5.2 (Chernoff upper tail). SBin(n,p)S \sim \operatorname{Bin}(n,p), δ>0\delta > 0: P(S(1+δ)μ)(eδ/(1+δ)1+δ)μP(S \geq (1+\delta)\mu) \leq (e^\delta/(1+\delta)^{1+\delta})^\mu.

Theorem 5.3 (Chernoff multiplicative). P(S(1+δ)μ)eμδ2/3P(S \geq (1+\delta)\mu) \leq e^{-\mu\delta^2/3} for δ(0,1)\delta \in (0,1).

Theorem 5.4 (Chernoff lower tail). P(S(1δ)μ)eμδ2/2P(S \leq (1-\delta)\mu) \leq e^{-\mu\delta^2/2}.

Bernstein and McDiarmid

Theorem 6.2 (Bernstein). Xic|X_i| \leq c, variance ν2\nu^2: P(Xit)exp(t2/2/(ν2+ct/3))P(\sum X_i \geq t) \leq \exp(-t^2/2/(\nu^2+ct/3)).

Theorem 7.2 (McDiarmid). ff with cic_i-bounded differences: P(fE[f]t)e2t2/ci2P(f-\mathbb{E}[f] \geq t) \leq e^{-2t^2/\sum c_i^2}.

Theorem S.1 (Azuma). Martingale ZkZk1ck|Z_k-Z_{k-1}| \leq c_k: P(Znt)et2/2ck2P(Z_n \geq t) \leq e^{-t^2/2\sum c_k^2}.

Statistical Learning Theory

Lemma 8.1 (Union bound). P(Ai)P(Ai)P(\bigcup A_i) \leq \sum P(A_i).

Lemma 8.3 (Covering number). 2\ell_2-ball: N(BR,2,ε)(3R/ε)d\mathcal{N}(\mathcal{B}_R, \ell_2, \varepsilon) \leq (3R/\varepsilon)^d.

Theorem 9.2 (Finite-class PAC). n(log(H/δ))/(2ε2)n \geq (\log(|\mathcal{H}|/\delta))/(2\varepsilon^2) \Rightarrow uniform convergence.

Lemma 9.4 (Sauer-Shelah). ΠH(n)(en/d)d\Pi_\mathcal{H}(n) \leq (en/d)^d for nd=dVCn \geq d = d_{VC}.

Theorem 9.5 (VC bound). R(h)R^(h)+(dlog(2n/d)+log(4/δ))/(2n)R(h) \leq \hat{R}(h) + \sqrt{(d\log(2n/d)+\log(4/\delta))/(2n)}.

Theorem 10.2 (Rademacher bound). R(f)R^(f)+2Rn(F)+log(1/δ)/(2n)R(f) \leq \hat{R}(f) + 2\mathfrak{R}_n(\mathcal{F}) + \sqrt{\log(1/\delta)/(2n)}.

Theorem E.4 (PAC-Bayes). EQ[R(h)]EQ[R^(h)]+(DKL(QP)+log(1/δ))/(2(n1))\mathbb{E}_Q[R(h)] \leq \mathbb{E}_Q[\hat{R}(h)] + \sqrt{(D_{\mathrm{KL}}(Q\|P)+\log(1/\delta))/(2(n-1))}.

Theorem J.2 (Stability). β\beta-stable algo: RR^+2β+(4nβ+1)log(1/δ)/(2n)R \leq \hat{R} + 2\beta + (4n\beta+1)\sqrt{\log(1/\delta)/(2n)}.


This completes Section05 Concentration Inequalities. The material here is used extensively in Chapter 7: Statistics for confidence intervals and hypothesis tests, and throughout learning theory.

<- Back to Chapter 6: Probability Theory | Next: Stochastic Processes ->


Supplementary: Concentration in High Dimensions

The Curse and Blessing of Dimensionality

High-dimensional geometry is counterintuitive. In Rd\mathbb{R}^d with large dd:

  • Almost all the volume of a ball lies near its surface
  • Random unit vectors are nearly orthogonal: u,v1/d|\langle u, v \rangle| \approx 1/\sqrt{d} w.h.p.
  • The norm of a Gaussian vector concentrates: x2d\|\mathbf{x}\|_2 \approx \sqrt{d} for xN(0,Id)\mathbf{x} \sim \mathcal{N}(0, I_d)

These facts follow directly from concentration inequalities applied to sums of squares.

Gaussian norm concentration. Let xN(0,Id)\mathbf{x} \sim \mathcal{N}(0, I_d). Then x22=i=1dXi2\|\mathbf{x}\|_2^2 = \sum_{i=1}^d X_i^2 where XiiidN(0,1)X_i \stackrel{iid}{\sim} \mathcal{N}(0,1). By Bernstein's inequality (since Xi21X_i^2 - 1 is sub-exponential):

P ⁣(x2dt)2ect2P\!\left(\left|\,\|\mathbf{x}\|_2 - \sqrt{d}\,\right| \geq t\right) \leq 2e^{-ct^2}

for a universal constant c>0c > 0.

Implication for dot products. For two independent x,yN(0,Id/d)\mathbf{x}, \mathbf{y} \sim \mathcal{N}(0, I_d/d):

P ⁣(x,yt/d)2ect2P\!\left(\left|\langle \mathbf{x}, \mathbf{y}\rangle\right| \geq t/\sqrt{d}\right) \leq 2e^{-ct^2}

This is why scaled dot-product attention uses 1/dk1/\sqrt{d_k} - it keeps logits O(1)O(1) despite dkd_k-dimensional embeddings.

Geometry of Softmax

Let q,k1,,knRdk\mathbf{q}, \mathbf{k}_1, \ldots, \mathbf{k}_n \in \mathbb{R}^{d_k} be the query and keys in a transformer. The attention weight on key jj is:

αj=eqkj/dkeqk/dk\alpha_j = \frac{e^{\mathbf{q}^\top \mathbf{k}_j / \sqrt{d_k}}}{\sum_{\ell} e^{\mathbf{q}^\top \mathbf{k}_\ell / \sqrt{d_k}}}

Without the 1/dk1/\sqrt{d_k} scaling, the logits qkj\mathbf{q}^\top \mathbf{k}_j have variance dkd_k (for unit random vectors), pushing softmax into near-zero gradient regions - a form of implicit dimension dependence that concentration inequalities make precise.

Covering Numbers in Transformer Weight Space

The number of parameters in a transformer layer is Θ=O(d2)\Theta = O(d^2). For LoRA with rank rr, the effective dimensionality is O(dr)O(dr). By the ε\varepsilon-covering number bound:

logN(FLoRA,ε,)=O ⁣(drlog(1/ε))\log \mathcal{N}(\mathcal{F}_{\text{LoRA}}, \varepsilon, \|\cdot\|) = O\!\left(dr \log(1/\varepsilon)\right)

This converts directly to a PAC generalisation bound showing that LoRA generalises with O(drlogn/n)O(\sqrt{dr \log n / n}) excess risk - substantially better than O(d2logn/n)O(\sqrt{d^2 \log n / n}) for full fine-tuning. Concentration inequalities thus provide the theoretical foundation for parameter-efficient fine-tuning.

Summary: Why This Section Matters

Concentration inequalities are the engine behind every finite-sample guarantee in machine learning. They explain:

  1. Why more data helps: bounds shrink as 1/n1/\sqrt{n}
  2. Why simpler models generalise: bounds grow with logH\sqrt{\log |\mathcal{H}|} or dVC\sqrt{d_{\text{VC}}}
  3. Why high dimensions are manageable: Gaussian concentration keeps norms stable
  4. Why LoRA works: Rademacher complexity of low-rank updates is provably small
  5. Why LLM benchmarks need many examples: CIs on accuracy require n1/ε2n \propto 1/\varepsilon^2

The progression Markov -> Chebyshev -> sub-Gaussian -> Chernoff -> McDiarmid is not merely a tour of tricks - it is a systematic strengthening of assumptions that yields exponentially tighter guarantees, mirroring the progress from "some data" to "structured data" in modern ML.


End of Section05 Concentration Inequalities

Next: Section06 Stochastic Processes - where concentration inequalities meet martingales in continuous time.

Back: Section04 Expectation and Moments - the moment-based tools (MGF, Jensen) used throughout this section.

Chapter: Section06 Probability Theory README

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue