Theory Notebook
Converted from
theory.ipynbfor web reading.
Concentration Inequalities
"One of the great gifts of probability theory is that it allows us to turn complex dependencies into simple, manageable bounds." — Stéphane Boucheron
Interactive exploration of concentration inequalities: from Markov and Chebyshev to sub-Gaussian tails, Hoeffding, Chernoff, Bernstein, and McDiarmid. Applications to PAC learning, VC theory, Rademacher complexity, and LLM evaluation.
Code cell 2
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
try:
import seaborn as sns
sns.set_theme(style="whitegrid", palette="colorblind")
HAS_SNS = True
except ImportError:
plt.style.use("seaborn-v0_8-whitegrid")
HAS_SNS = False
mpl.rcParams.update({
"figure.figsize": (10, 6),
"figure.dpi": 120,
"font.size": 13,
"axes.titlesize": 15,
"axes.labelsize": 13,
"xtick.labelsize": 11,
"ytick.labelsize": 11,
"legend.fontsize": 11,
"legend.framealpha": 0.85,
"lines.linewidth": 2.0,
"axes.spines.top": False,
"axes.spines.right": False,
"savefig.bbox": "tight",
"savefig.dpi": 150,
})
np.random.seed(42)
print("Plot setup complete.")
Code cell 3
import numpy as np
from scipy import stats
try:
import matplotlib.pyplot as plt
import matplotlib
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['font.size'] = 12
HAS_MPL = True
except ImportError:
HAS_MPL = False
try:
import seaborn as sns
HAS_SNS = True
except ImportError:
HAS_SNS = False
np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)
print('Setup complete.')
1. Intuition: Why Tail Bounds Matter
A random variable is concentrated if most of its mass lies near the mean. Concentration inequalities quantify this with explicit tail bounds . In ML, might be a generalisation error, empirical risk, or gradient norm.
The hierarchy (weakest → strongest assumptions, tightest → loosest bounds):
- Markov — only non-negativity: — polynomial tail
- Chebyshev — finite variance: — polynomial
- Sub-Gaussian/Hoeffding — bounded support or light tails:
- Chernoff — MGF exists: — exponential, shape depends on MGF
- Bernstein — variance + bounded: exponential, transitions polynomial→exponential
- McDiarmid — function of independents with bounded differences:
Code cell 5
# === 1.1 Tail Bound Comparison ===
t = np.linspace(0.1, 5, 500)
mu = 1.0 # E[X] for Markov
sigma2 = 1.0 # Var(X) for Chebyshev
c2 = 1.0 # sum c_i^2 for Hoeffding/sub-Gaussian
markov = mu / t
cheby = sigma2 / t**2
hoeffding = np.exp(-2 * t**2 / c2)
gaussian_tail = 2 * stats.norm.sf(t) # 2P(Z >= t) — actual Gaussian tail
if HAS_MPL:
fig, ax = plt.subplots(figsize=(10,5))
ax.semilogy(t, markov, label='Markov: $\mu/t$', lw=2)
ax.semilogy(t, cheby, label='Chebyshev: $\sigma^2/t^2$', lw=2)
ax.semilogy(t, hoeffding, label='Hoeffding/sub-Gaussian', lw=2)
ax.semilogy(t, gaussian_tail,label='Gaussian tail (actual)', lw=2, ls='--')
ax.set_xlim(0.5, 5)
ax.set_ylim(1e-7, 2)
ax.set_xlabel('Deviation $t$')
ax.set_ylabel('$P(|X - \\mu| \\geq t)$ (log scale)')
ax.set_title('Tail Bound Hierarchy: Polynomial vs Exponential Decay')
ax.legend()
plt.tight_layout()
plt.show()
# Print at t=3
t0 = 3.0
print(f'At t={t0}:')
print(f' Markov (mu=1): {mu/t0:.4f}')
print(f' Chebyshev (sigma2=1): {sigma2/t0**2:.4f}')
print(f' Hoeffding (c2=1): {np.exp(-2*t0**2/c2):.4f}')
print(f' Gaussian (actual): {2*stats.norm.sf(t0):.6f}')
2. Markov and Chebyshev Inequalities
2.1 Markov's Inequality
Theorem. For non-negative with and :
Proof. . Divide by .
2.2 Chebyshev's Inequality
Apply Markov to :
One-sided Cantelli. (sharper for large ).
Code cell 7
# === 2. Markov and Chebyshev Numerical Verification ===
np.random.seed(42)
N = 1_000_000
# Exponential distribution: E[X] = 1/lambda, Var = 1/lambda^2
lam = 1.0
samples_exp = np.random.exponential(1.0/lam, N)
mu_exp = 1.0 / lam
var_exp = 1.0 / lam**2
ts = [2.0, 3.0, 5.0]
print('Exponential(1) — one-sided bounds')
print(f' {'t':>4} {'P(X>=t) actual':>18} {'Markov':>10} {'Ratio':>8}')
for t in ts:
actual = (samples_exp >= t).mean()
markov = mu_exp / t
print(f' {t:>4.1f} {actual:>18.6f} {markov:>10.6f} {actual/markov:>8.3f}')
# Gaussian: test Chebyshev
samples_norm = np.random.normal(0, 1, N)
print('\nGaussian(0,1) — two-sided Chebyshev')
print(f' {'t':>4} {'P(|X|>=t) actual':>20} {'Chebyshev':>12} {'Gaussian tail':>15}')
for t in [1.0, 2.0, 3.0]:
actual = (np.abs(samples_norm) >= t).mean()
cheby = 1.0 / t**2 # sigma^2 = 1
gtail = 2 * stats.norm.sf(t)
print(f' {t:>4.1f} {actual:>20.6f} {cheby:>12.6f} {gtail:>15.6f}')
print('\nTakeaway: Chebyshev is loose for Gaussian (1/t^2 vs e^{-t^2/2}) — ')
print('but holds for ANY distribution with finite variance.')
3. Sub-Gaussian Random Variables
A zero-mean random variable is -sub-Gaussian if:
Equivalent characterisation (tail form):
Key examples:
- : -sub-Gaussian (equality in MGF bound)
- Bounded with : -sub-Gaussian (Hoeffding's lemma)
- Rademacher w.p. : -sub-Gaussian
Closure: Sum of independent -sub-Gaussians is -sub-Gaussian.
Code cell 9
# === 3. Sub-Gaussian MGF Verification ===
np.random.seed(42)
N = 500_000
ts = np.linspace(-2, 2, 100)
# Gaussian N(0,1): sigma^2 = 1
X_gauss = np.random.normal(0, 1, N)
# Uniform [-1,1]: sigma^2 = (b-a)^2/4 = 1
X_unif = np.random.uniform(-1, 1, N)
# Rademacher: sigma^2 = 1
X_rad = np.where(np.random.rand(N) < 0.5, -1.0, 1.0)
sub_gauss_bound = np.exp(ts**2 / 2) # sigma^2 = 1
if HAS_MPL:
fig, ax = plt.subplots(figsize=(10,5))
for X, label, color in [
(X_gauss, 'Gaussian N(0,1)', 'C0'),
(X_unif, 'Uniform[-1,1]', 'C1'),
(X_rad, 'Rademacher ±1', 'C2'),
]:
mgf = np.array([np.exp(t * X).mean() for t in ts])
ax.plot(ts, mgf, label=f'E[e^{{tX}}] {label}', color=color)
ax.plot(ts, sub_gauss_bound, 'k--', lw=2, label='Sub-Gaussian bound $e^{t^2/2}$')
ax.set_xlabel('$t$')
ax.set_ylabel('MGF')
ax.set_title('Sub-Gaussian MGF Bound Verification')
ax.set_ylim(0, 8)
ax.legend()
plt.tight_layout()
plt.show()
# Verify bound holds at t=1.5
t_test = 1.5
bound = np.exp(t_test**2 / 2)
for X, name in [(X_gauss,'Gaussian'), (X_unif,'Uniform'), (X_rad,'Rademacher')]:
mgf_mc = np.exp(t_test * X).mean()
ok = mgf_mc <= bound * 1.05 # 5% tolerance for MC
print(f'{'PASS' if ok else 'FAIL'} — {name}: MGF({t_test})={mgf_mc:.4f} <= {bound:.4f}')
4. Hoeffding's Inequality
Theorem (Hoeffding, 1963). Let be independent with and . Let . Then:
Hoeffding's Lemma (key step). For with :
Proof sketch. Convexity of gives . Take expectations, then apply , , Taylor-expand and show . This gives .
Sample complexity (iid case, , ):
Setting gives .
Code cell 11
# === 4. Hoeffding: Sample Complexity and Verification ===
np.random.seed(42)
# --- Sample complexity formula ---
def hoeffding_n(eps, delta):
"""Min samples so P(|X_bar - mu| >= eps) <= delta, X_i in [0,1]."""
return np.ceil(np.log(2 / delta) / (2 * eps**2)).astype(int)
print('Sample complexity n(eps, delta) via Hoeffding:')
print(f' {'eps':>6} {'delta':>8} {'n':>8}')
for eps in [0.01, 0.05, 0.1]:
for delta in [0.05, 0.01]:
print(f' {eps:>6.2f} {delta:>8.3f} {hoeffding_n(eps, delta):>8}')
# --- Empirical verification ---
mu_true = 0.3
n_values = [50, 100, 500, 1000]
n_trials = 50000
eps = 0.1
print('\nHoeffding bound vs empirical tail (X_i ~ Bernoulli(0.3), eps=0.1):')
print(f' {'n':>6} {'P(|X_bar-mu|>=eps) empirical':>32} {'Hoeffding bound':>18}')
for n in n_values:
samples = np.random.binomial(1, mu_true, (n_trials, n)).mean(axis=1)
empirical = (np.abs(samples - mu_true) >= eps).mean()
bound = 2 * np.exp(-2 * n * eps**2)
print(f' {n:>6} {empirical:>32.6f} {bound:>18.6f}')
5. Chernoff Bounds
The Chernoff method. For any , apply Markov to :
Optimise over :
Binomial Chernoff. Let , , :
- Upper tail:
- Simplified upper: for
- Lower tail: for
The gap between Hoeffding () and Chernoff () is that Chernoff exploits the actual Binomial MGF, not just the bounded range.
Code cell 13
# === 5. Chernoff vs Hoeffding for Binomial ===
np.random.seed(42)
def chernoff_upper(delta, mu):
"""Simplified Chernoff upper: exp(-mu*delta^2/3) for delta in (0,1]."""
return np.exp(-mu * delta**2 / 3)
def hoeffding_upper(delta, n):
"""Hoeffding upper: exp(-2*n*delta^2) for X_i in [0,1]."""
return np.exp(-2 * n * delta**2)
# Compare bounds for B(n=100, p=0.1) => mu=10
n, p = 100, 0.1
mu = n * p
N_mc = 500_000
samples = np.random.binomial(n, p, N_mc)
print(f'Binomial(n={n}, p={p}), mu={mu}')
print(f' {'delta':>6} {'t=(1+d)mu':>12} {'Empirical':>12} '
f'{'Chernoff':>12} {'Hoeffding':>12}')
for delta in [0.2, 0.5, 1.0]:
t = (1 + delta) * mu
empirical = (samples >= t).mean()
chernoff_b = chernoff_upper(delta, mu)
hoeffding_b = hoeffding_upper(delta * p, n) # eps = delta*p
print(f' {delta:>6.1f} {t:>12.1f} {empirical:>12.6f} '
f'{chernoff_b:>12.6f} {hoeffding_b:>12.6f}')
print('\nChernoff is tighter when p is small (exploits Binomial MGF).')
6. Bernstein's Inequality
Bernstein's inequality bridges the polynomial (Chebyshev) and exponential (sub-Gaussian) regimes.
Theorem (Bernstein). Let be independent, zero-mean, . Then:
Interpretation. Let :
- Gaussian regime (): bound — sub-Gaussian with
- Poisson regime (): bound — exponential, lighter than Hoeffding
vs. Hoeffding. Bernstein uses variance information; Hoeffding only uses range. Bernstein is strictly tighter when variance is small relative to range.
Code cell 15
# === 6. Bernstein vs Hoeffding Comparison ===
np.random.seed(42)
def bernstein_bound(t, v, M):
"""Bernstein: exp(-t^2/2 / (v + Mt/3))."""
return np.exp(-t**2 / 2 / (v + M * t / 3))
def hoeffding_bound(t, n, a, b):
"""Hoeffding: exp(-2t^2 / (n*(b-a)^2))."""
return np.exp(-2 * t**2 / (n * (b - a)**2))
# Sparse Bernoulli: X_i ~ Bern(p) - p, p=0.01
# mean=0, variance=p*(1-p)~p, range=[-(p), 1-p]
n_vars = 100
p = 0.01
v_total = n_vars * p * (1 - p) # total variance
M = 1 - p # max |X_i|
ts = np.linspace(0.01, 2.0, 300)
bern_b = bernstein_bound(ts, v_total, M)
hoeff_b = hoeffding_bound(ts, n_vars, -p, 1-p)
# Empirical
N_mc = 500_000
sim = (np.random.binomial(1, p, (N_mc, n_vars)) - p).sum(axis=1)
if HAS_MPL:
fig, ax = plt.subplots(figsize=(10,5))
ts_emp = np.linspace(0.01, sim.max(), 200)
emp = np.array([(sim >= t).mean() for t in ts_emp])
ax.semilogy(ts, bern_b, label='Bernstein', lw=2)
ax.semilogy(ts, hoeff_b, label='Hoeffding', lw=2)
ax.semilogy(ts_emp, np.maximum(emp, 1e-7), 'k.', ms=2, label='Empirical')
ax.set_xlabel('$t$')
ax.set_ylabel('$P(S_n \\geq t)$')
ax.set_title(f'Bernstein vs Hoeffding (n={n_vars} sparse Bernoulli, p={p})')
ax.legend()
plt.tight_layout()
plt.show()
print(f'Variance v={v_total:.4f}, M={M:.2f}')
print(f'Bernstein at t=0.5: {bernstein_bound(0.5, v_total, M):.6f}')
print(f'Hoeffding at t=0.5: {hoeffding_bound(0.5, n_vars, -p, 1-p):.6f}')
print('Bernstein is tighter when variance << range^2.')
7. McDiarmid's Inequality
McDiarmid extends Hoeffding from sums to general functions of independent variables.
Definition (Bounded Differences). has the -bounded differences property if for each :
Theorem (McDiarmid, 1989). If has bounded differences with constants and are independent, then:
Proof idea. Define the Doob martingale . The differences are bounded in and have zero conditional mean. Apply Azuma's inequality to the martingale.
Key application: empirical risk. Let with . Changing one sample changes by at most , so and:
Code cell 17
# === 7. McDiarmid on Empirical Risk ===
np.random.seed(42)
def mcdiarmid_bound(t, n, c_max=None):
"""McDiarmid bound with c_i = 1/n (all equal), c_max=1/n."""
if c_max is None:
c_max = 1.0 / n
return np.exp(-2 * t**2 / (n * c_max**2))
# Simulate: classifier h, threshold at 0.5
# True risk R(h) = P(error) = 0.2
R_true = 0.2
n_values = [50, 100, 500, 1000]
n_trials = 100_000
t = 0.1
print(f'Empirical risk deviation (t={t}), true R={R_true}')
print(f' {'n':>6} {'P(R_hat-R>=t) sim':>22} {'McDiarmid bound':>18}')
for n in n_values:
errors = np.random.binomial(1, R_true, (n_trials, n)).mean(axis=1)
empirical = (errors - R_true >= t).mean()
bound = mcdiarmid_bound(t, n)
print(f' {n:>6} {empirical:>22.6f} {bound:>18.6f}')
# Also show: symmetric bound P(|R_hat - R| >= t) <= 2*exp(-2nt^2)
print(f'\nTwo-sided: P(|R_hat - R| >= {t}) <= 2*exp(-2n*{t}^2)')
for n in [100, 1000]:
print(f' n={n}: {2*mcdiarmid_bound(t, n):.6f}')
print('\nThis is the foundation of PAC-style generalisation bounds.')
8. Union Bound and Covering Numbers
8.1 Union Bound (Boole's Inequality)
For any events :
ML application. Applying McDiarmid to each and taking the union over all hypotheses:
Setting RHS gives .
8.2 Covering Numbers
For infinite , replace with the covering number : the minimum number of -balls needed to cover under metric .
Metric entropy: — for with :
Code cell 19
# === 8. Union Bound for Finite Hypothesis Class ===
np.random.seed(42)
# Scenario: |H| decision stumps on binary features
# Each stump: threshold on one of d features => |H| = 2d
def pac_bound_finite(H_size, n, delta):
"""Generalisation gap t such that P(exists h: gap >= t) <= delta."""
return np.sqrt(np.log(H_size / delta) / (2 * n))
print('Generalisation gap bound for finite hypothesis classes:')
print(f' {'|H|':>10} {'n':>6} {'delta':>6} {'Gap bound t':>12}')
for H_size in [10, 100, 1000, 10000]:
for n in [100, 1000]:
t = pac_bound_finite(H_size, n, 0.05)
print(f' {H_size:>10} {n:>6} {'0.05':>6} {t:>12.4f}')
print('\nSample complexity: n >= log(|H|/delta) / (2*eps^2)')
for H_size in [100, 10000]:
n_required = np.ceil(np.log(H_size / 0.05) / (2 * 0.05**2)).astype(int)
print(f' |H|={H_size}: n >= {n_required} for eps=0.05, delta=0.05')
# Covering number for linear classifiers in R^d, ||theta|| <= 1
print('\nCovering number log N(eps) for {theta: ||theta||<=1} in R^d:')
for d in [10, 100, 1000]:
eps = 0.1
log_N = d * np.log(3 / eps)
print(f' d={d:4d}: log N = {log_N:.1f} => N = {np.exp(log_N):.2e}')
9. PAC Learning and VC Dimension
9.1 PAC Framework
A hypothesis class is PAC learnable if there exists an algorithm such that for all distributions and all , given samples:
Finite class. suffices (union bound).
9.2 VC Dimension
The VC dimension is the largest set that can shatter (correctly classify all labellings).
Examples:
- Intervals on :
- Half-spaces in :
- Degree- polynomials in :
- Neural networks (binary, params):
VC generalisation bound (Vapnik-Chervonenkis). With probability :
Code cell 21
# === 9. VC Dimension: Shattering and PAC Bounds ===
np.random.seed(42)
# --- Shattering demo: half-spaces in R^2 ---
# 3 points in general position CAN be shattered by half-spaces
# 4 points CANNOT always be shattered (XOR configuration fails)
def linear_shatter_check(points):
"""Check if set of points can be shattered by linear classifier."""
n = len(points)
from itertools import product
can_realise_all = True
for labels in product([-1, 1], repeat=n):
labels = np.array(labels)
# Try linear separator with bias: w.x + b = 0
# Use LP / direct check via small optimization
found = False
for trial in range(1000):
w = np.random.randn(2)
b = np.random.randn()
preds = np.sign(points @ w + b)
if np.all(preds == labels):
found = True
break
if not found:
can_realise_all = False
break
return can_realise_all
# 3 points in general position
pts3 = np.array([[0., 0.], [1., 0.], [0., 1.]])
# 4 points - XOR pattern
pts4 = np.array([[0., 0.], [1., 1.], [1., 0.], [0., 1.]])
print('VC dimension of half-spaces in R^2 is 3:')
print(f' Can shatter 3 pts in general position: {linear_shatter_check(pts3)}')
print(f' Can shatter 4 pts (XOR): {linear_shatter_check(pts4)}')
# --- VC generalisation bound ---
def vc_bound(d_vc, n, delta):
"""VC generalisation gap."""
return np.sqrt((2 * d_vc * np.log(np.e * n / d_vc) + 2 * np.log(4 / delta)) / n)
print('\nVC generalisation bound (delta=0.05):')
print(f' {'d_VC':>6} {'n':>6} {'Gap bound':>12}')
for d_vc in [3, 10, 100]:
for n in [100, 1000, 10000]:
gap = vc_bound(d_vc, n, 0.05)
print(f' {d_vc:>6} {n:>6} {gap:>12.4f}')
10. Rademacher Complexity
Rademacher complexity measures the ability of to fit random noise.
Empirical Rademacher complexity:
where (Rademacher random variables).
Population Rademacher complexity: .
Theorem (Rademacher Generalisation Bound). With probability , for all :
Key examples:
- Finite class:
- Linear , data :
- LoRA rank-: (vs full)
Code cell 23
# === 10. Rademacher Complexity: Monte Carlo Estimation ===
np.random.seed(42)
def rademacher_mc(X, W_norm, n_mc=2000):
"""
Estimate empirical Rademacher complexity for linear functions ||w||_2 <= W_norm
on data matrix X (n x d).
hat_R = E_sigma[sup_{||w||<=W} (1/n) sigma^T X w]
= (W_norm / n) * E_sigma[||X^T sigma||_2]
"""
n, d = X.shape
norms = []
for _ in range(n_mc):
sigma = np.random.choice([-1., 1.], size=n)
norms.append(np.linalg.norm(X.T @ sigma))
return W_norm * np.mean(norms) / n
def rademacher_theory(n, d, W_norm, X_norm):
"""Theoretical bound: W_norm * X_norm / sqrt(n)."""
return W_norm * X_norm / np.sqrt(n)
n, d = 100, 50
W_norm = 1.0
X = np.random.randn(n, d) / np.sqrt(d) # ||x_i||_2 ~ 1
X_norm = np.linalg.norm(X, axis=1).mean()
R_hat_mc = rademacher_mc(X, W_norm)
R_theory = rademacher_theory(n, d, W_norm, X_norm)
print(f'Empirical Rademacher complexity (n={n}, d={d}, W={W_norm})')
print(f' Monte Carlo estimate: {R_hat_mc:.4f}')
print(f' Theory bound: {R_theory:.4f}')
print(f' Ratio MC/theory: {R_hat_mc/R_theory:.3f}')
# LoRA vs full: compare as d varies
print('\nRademacher complexity: LoRA rank-r vs full layer')
print(f' {'d':>6} {'rank r':>8} {'LoRA R':>10} {'Full R':>10} {'Ratio':>8}')
for d in [64, 256, 1024]:
r = 4
# Simplified: paramcount ~ d*r vs d^2
R_lora = np.sqrt(d * r / n)
R_full = np.sqrt(d**2 / n)
print(f' {d:>6} {r:>8} {R_lora:>10.4f} {R_full:>10.4f} {R_lora/R_full:>8.4f}')
print('\nTakeaway: LoRA Rademacher complexity scales as sqrt(d*r/n) vs sqrt(d^2/n).')
11. Applications: SGD and Gradient Concentration
11.1 Mini-Batch Gradient as a Random Variable
In SGD, the mini-batch gradient is an estimator of the full gradient . By independence of the batch samples, applying Hoeffding or Bernstein per coordinate gives:
where bounds the -th gradient component. The batch size directly controls gradient noise.
Bernstein for SGD. If and gradients have variance , Bernstein gives:
This shows gradient noise switches from Gaussian (small ) to exponential (large ) regime.
Code cell 25
# === 11.1 SGD Mini-Batch Gradient Concentration ===
np.random.seed(42)
# Simulate: gradient = mean of B IID bounded rvs in [-G, G]
G = 1.0 # gradient bound per sample
mu_grad = 0.3 # true gradient
def simulate_sgd_concentration(B, n_mc=100_000):
"""Simulate |g_hat - g_true| for mini-batch size B."""
# Each sample gradient: Uniform[-G, G] shifted by mu_grad
# Clipped to [-G, G]
grads = np.random.uniform(-G + mu_grad, G + mu_grad, (n_mc, B)).clip(-G, G)
batch_grads = grads.mean(axis=1)
return np.abs(batch_grads - mu_grad)
ts = [0.1, 0.2, 0.5]
print('P(|g_hat - g_true| >= t) for various batch sizes B')
print(f' {'B':>6} {'t=0.1':>10} {'t=0.2':>10} {'t=0.5':>10} {'Hoeffding(B,t=0.1)':>20}')
for B in [1, 8, 32, 128]:
devs = simulate_sgd_concentration(B)
probs = [(devs >= t).mean() for t in ts]
hb = 2 * np.exp(-2 * B * 0.1**2 / (2*G)**2)
print(f' {B:>6} {probs[0]:>10.5f} {probs[1]:>10.5f} {probs[2]:>10.5f} {hb:>20.5f}')
print('\n4x batch size => roughly 4x tighter tail (Gaussian regime, t small).')
11.2 LLM Benchmark Confidence Intervals
When evaluating an LLM on a benchmark, accuracy is an estimate of true accuracy . By Hoeffding (or the exact Clopper-Pearson interval for Binomial):
Practical implications:
- MMLU with questions: requires
- Smaller benchmarks ():
- Sub-task evaluation (): — barely distinguishable differences
Benchmark contamination shifts the distribution, invalidating the iid assumption and making concentration bounds overly optimistic — one of the key failure modes in LLM evaluation.
Code cell 27
# === 11.2 LLM Benchmark Confidence Intervals ===
from scipy import stats as sp_stats
def hoeffding_ci(n, delta=0.05):
"""Half-width of Hoeffding confidence interval."""
return np.sqrt(np.log(2 / delta) / (2 * n))
def clopper_pearson_ci(k, n, alpha=0.05):
"""Exact Clopper-Pearson CI for Binomial."""
lo = sp_stats.beta.ppf(alpha / 2, k, n - k + 1) if k > 0 else 0
hi = sp_stats.beta.ppf(1 - alpha / 2, k + 1, n - k) if k < n else 1
return lo, hi
benchmarks = [
('MMLU', 14042),
('BIG-Bench Hard', 6511),
('HellaSwag', 10042),
('Custom (500)', 500),
('Sub-task (50)', 50),
]
print('Confidence interval half-widths for LLM benchmarks (delta=5%):')
print(f' {'Benchmark':>24} {'n':>6} {'Hoeffding eps':>14} {'CP CI width (acc=0.7)':>22}')
for name, n in benchmarks:
eps = hoeffding_ci(n)
k = int(0.7 * n)
lo, hi = clopper_pearson_ci(k, n)
print(f' {name:>24} {n:>6} {eps:>14.4f} [{lo:.4f}, {hi:.4f}] = {hi-lo:.4f}')
print('\nImplication: for n=50 sub-tasks, a 17% margin is needed to be 95% confident.')
print('Most LLM benchmark comparisons within 1% are not statistically meaningful for small n.')
11.3 Johnson-Lindenstrauss Lemma
Theorem (JL Lemma). For any set of points in and , there exists a linear map with such that:
Proof sketch. For a random Gaussian projection with , each projected distance is a chi-squared sum. Apply Chernoff bounds. The key fact: projection rows, each giving failure probability. Union bound over all pairs gives total failure for .
Applications in LLMs:
- Random projections for approximate nearest neighbour in embedding lookup
- Dimensionality reduction before FAISS indexing
- Sketching attention matrices: approximate via random projections
Code cell 29
# === 11.3 Johnson-Lindenstrauss Lemma Verification ===
np.random.seed(42)
def jl_project(X, k):
"""Random Gaussian projection to k dimensions."""
d = X.shape[1]
R = np.random.randn(d, k) / np.sqrt(k)
return X @ R
def jl_min_k(m, eps, delta=0.05):
"""Minimum k for JL embedding of m points."""
return int(np.ceil(4 * np.log(m / delta) / (eps**2 / 2 - eps**3 / 3)))
# Generate m random points in R^d
m, d = 200, 500
X = np.random.randn(m, d)
# Compute pairwise distances in original space (sample 500 pairs)
n_pairs = 500
idx = np.random.choice(m, (n_pairs, 2))
dists_orig = np.linalg.norm(X[idx[:,0]] - X[idx[:,1]], axis=1)
print(f'JL Lemma: m={m} points, d={d}, testing projections')
print(f' {'k':>6} {'max |ratio-1|':>18} {'frac within eps=0.2':>22} {'theory k':>12}')
for eps in [0.1, 0.2, 0.3]:
k_theory = jl_min_k(m, eps)
for k in [20, 50, 100, 200]:
X_proj = jl_project(X, k)
dists_proj = np.linalg.norm(X_proj[idx[:,0]] - X_proj[idx[:,1]], axis=1)
ratios = (dists_proj / dists_orig)**2
max_dev = np.max(np.abs(ratios - 1))
frac_ok = np.mean(np.abs(ratios - 1) <= eps)
if eps == 0.2: # only print for eps=0.2
print(f' {k:>6} {max_dev:>18.4f} {frac_ok:>22.4f} {k_theory:>12}')
print(f'\nFor eps=0.2: theory requires k >= {jl_min_k(m, 0.2)} for m={m} points.')
11.4 Sub-Gaussian Properties in Transformers
Attention Logit Concentration
In self-attention, the logit between query and key is . If , then each is sub-Gaussian with parameter :
This motivates the scaling — without it, logits would have variance , pushing softmax into saturation.
Residual Stream Norm Concentration
With residual layers, each adding a vector with (bounded by RMSNorm), the total residual . But if the directions are random, the norm concentrates at — explaining why deep transformers need scaling at initialisation (e.g., GPT-2's 1/sqrt(2L) rule).
Gradient Noise in Large-Batch Training
The linear scaling rule (Goyal et al., 2017): batch size requires LR to maintain the same effective noise level. This follows directly from Hoeffding: gradient variance scales as , so the noise-to-signal ratio determines convergence.
Code cell 31
# === 11.4 Attention Logit Concentration ===
np.random.seed(42)
N_mc = 100_000
print('Attention logit variance before and after scaling:')
print(f' {'d_k':>6} {'Var(q^T k) unscaled':>22} {'Var(q^T k/sqrt(d_k))':>24} {'Softmax entropy':>16}')
for d_k in [4, 16, 64, 256]:
q = np.random.randn(N_mc, d_k)
k = np.random.randn(N_mc, d_k)
logits_raw = (q * k).sum(axis=1)
logits_scaled = logits_raw / np.sqrt(d_k)
var_raw = logits_raw.var()
var_scaled = logits_scaled.var()
# Softmax entropy for 10 keys, one being the true match (logit scaled/unscaled)
n_keys = 10
keys_all = np.random.randn(N_mc, n_keys, d_k)
q_batched = q[:, np.newaxis, :] # (N, 1, d)
logits_all = (q_batched * keys_all).sum(axis=2) # (N, n_keys)
softmax_scaled = np.exp(logits_all / np.sqrt(d_k))
softmax_scaled /= softmax_scaled.sum(axis=1, keepdims=True)
entropy_scaled = -(softmax_scaled * np.log(softmax_scaled + 1e-9)).sum(axis=1).mean()
print(f' {d_k:>6} {var_raw:>22.3f} {var_scaled:>24.3f} {entropy_scaled:>16.3f}')
print('\nWith scaling, Var ≈ 1 regardless of d_k — prevents softmax saturation.')
print('Max entropy = log(10) =', round(np.log(10), 3), '(uniform softmax)')
12. Choosing the Right Bound: A Decision Framework
| Scenario | Best Bound | Formula |
|---|---|---|
| , only mean known | Markov | |
| Finite variance, arbitrary dist | Chebyshev | |
| Bounded iid sum, worst-case | Hoeffding | |
| Bounded iid sum, small variance | Bernstein | |
| Binomial, multiplicative | Chernoff | |
| Function of iid vars, bounded diffs | McDiarmid | |
| Generalisation for finite | Union + Hoeffding | $\sqrt{\log( |
| Generalisation for general | Rademacher | |
| High-dimensional distances | JL Lemma |
The key question: What do you know about ?
- Non-negative + mean → Markov
-
- variance → Chebyshev
-
- bounded → Hoeffding
-
- small variance → Bernstein
-
- MGF structure → Chernoff
Code cell 33
# === 12. Bound Comparison on a Single Example ===
np.random.seed(42)
# n=100 iid X_i ~ Uniform[0,1], S_n = sum, E[S_n]=50, Var=100/12
n = 100
mu = 0.5
sigma2 = 1.0 / 12 # Var(Uniform[0,1])
a, b = 0.0, 1.0
# We want P(S_n/n - mu >= t) for t = 0.05
t = 0.05
# Bounds for sample mean deviation
markov_b = mu / (mu + t) # P(X_bar >= mu+t) <= mu/(mu+t) if X_i in [0,1] — not correct, use actual
# Markov: P(X_bar >= t+mu) = P(sum >= n*(mu+t)), E[sum]=n*mu, need X_bar >= mu+t
# P(S_n >= n(mu+t)) <= E[S_n]/(n(mu+t)) = mu/(mu+t)
markov_b = mu / (mu + t)
cheby_b = sigma2 / (n * t**2) # Var(X_bar) = sigma2/n
hoeffding_b = np.exp(-2 * n * t**2 / (b - a)**2)
bernstein_b = np.exp(-n * t**2 / 2 / (sigma2 + (b-a)*t/3))
# Empirical
N_mc = 1_000_000
samples = np.random.uniform(a, b, (N_mc, n)).mean(axis=1)
empirical = (samples - mu >= t).mean()
print(f'P(X_bar - {mu} >= {t}), n={n}, X_i ~ Uniform[0,1]:')
print(f' Empirical: {empirical:.6f}')
print(f' Markov: {markov_b:.6f} (ratio {markov_b/empirical:.1f}x)')
print(f' Chebyshev: {cheby_b:.6f} (ratio {cheby_b/empirical:.1f}x)')
print(f' Hoeffding: {hoeffding_b:.6f} (ratio {hoeffding_b/empirical:.1f}x)')
print(f' Bernstein: {bernstein_b:.6f} (ratio {bernstein_b/empirical:.1f}x)')
print(f'\nSmaller ratio = tighter bound. Bernstein wins by using variance info.')
Summary
| Concept | Key Formula | ML Relevance |
|---|---|---|
| Markov | Baseline; gradient norm bounds | |
| Chebyshev | $P( | X-\mu |
| Sub-Gaussian | Bounded/Gaussian noise | |
| Hoeffding | Sample complexity in PAC | |
| Chernoff | Binomial counts, hitting probs | |
| Bernstein | Exp with variance term | Sparse gradients in SGD |
| McDiarmid | Empirical risk, any function | |
| Union bound | Any finite hypothesis class | |
| VC bound | Shattering dimension | |
| Rademacher | Data-dependent, tightest | |
| JL Lemma | Random projections, embeddings |
→ See notes.md for full proofs and theory.
→ See exercises.ipynb for practice problems.
13. Preview: Law of Large Numbers via Concentration
The Weak Law of Large Numbers (WLLN) follows immediately from Chebyshev:
The Strong LLN requires more work (Borel-Cantelli or martingale argument), but Chebyshev's inequality gives the WLLN with a quantitative rate.
Strong LLN via Borel-Cantelli (sketch). For the subsequence , Chebyshev gives . Since , Borel-Cantelli gives convergence a.s. along . Interpolate between subsequences to complete the proof.
CLT preview. The Central Limit Theorem tells us the shape of the deviation:
This is sharper than Chebyshev's but is asymptotic. Concentration inequalities give non-asymptotic guarantees valid for finite .
→ Full treatment: §06/06 Stochastic Processes for martingales; §07 Statistics for CLT applications.
Code cell 36
# === 13. LLN Convergence: Chebyshev Rate vs Empirical ===
np.random.seed(42)
# Exponential(1): mean=1, var=1
N_mc = 10_000
ns = np.logspace(1, 4, 30).astype(int)
eps = 0.1
empirical_probs = []
chebyshev_bounds = []
hoeffding_bounds_clip = []
for n in ns:
samples = np.random.exponential(1.0, (N_mc, n)).mean(axis=1)
prob = (np.abs(samples - 1.0) >= eps).mean()
empirical_probs.append(prob)
chebyshev_bounds.append(1.0 / (n * eps**2)) # sigma^2=1
# Hoeffding: need bounded, so clip to [0, 10]; c_i ~ 10/n
hoeffding_bounds_clip.append(2 * np.exp(-2 * n * eps**2 / (10**2)))
if HAS_MPL:
fig, ax = plt.subplots(figsize=(10,5))
ax.loglog(ns, empirical_probs, 'o-', label='Empirical P(|X_bar-1|>=0.1)', ms=4)
ax.loglog(ns, chebyshev_bounds, '--', label='Chebyshev bound (1/n)', lw=2)
ax.loglog(ns, hoeffding_bounds_clip, ':', label='Hoeffding (clipped at 10)', lw=2)
ax.set_xlabel('Sample size $n$')
ax.set_ylabel('$P(|\\bar{X}_n - \\mu| \\geq 0.1)$')
ax.set_title('LLN Rate: Empirical vs Chebyshev Bound (Exponential(1))')
ax.legend()
plt.tight_layout()
plt.show()
print(f'At n=100: empirical={empirical_probs[ns.tolist().index(min(ns, key=lambda x: abs(x-100)))]:.4f}, '
f'Chebyshev={1/(100*eps**2):.4f}')
14. Azuma's Inequality for Martingales
Azuma extends Hoeffding from iid sums to martingales.
Theorem (Azuma-Hoeffding). Let be a martingale with bounded differences: a.s. Then:
Why this is powerful: The need not be independent — only the martingale difference needs to be bounded and have zero conditional mean. This is exactly the structure of the Doob martingale used in McDiarmid's proof.
Doob martingale. For with independent , define:
Then , , and iff has -bounded difference in . Azuma applied to the Doob martingale yields McDiarmid.
Applications in ML:
- Online learning regret bounds
- Stochastic approximation (SGD) convergence
- Adaptive data collection (active learning, bandits)
Code cell 38
# === 14. Azuma: Random Walk and Martingale ===
np.random.seed(42)
# Simple martingale: partial sums of bounded differences D_k in [-c, c]
n_steps = 200
c = 1.0
N_mc = 50_000
# D_k ~ Uniform[-c, c], zero mean
D = np.random.uniform(-c, c, (N_mc, n_steps))
M = D.cumsum(axis=1) # martingale starting at 0
# Azuma bound for M_n >= t
ts = np.linspace(1, 20, 100)
azuma = np.exp(-ts**2 / (2 * n_steps * c**2))
empirical = np.array([(M[:,-1] >= t).mean() for t in ts])
if HAS_MPL:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Left: sample paths
ax = axes[0]
for i in range(20):
ax.plot(M[i], alpha=0.4, lw=0.8)
ax.axhline(0, color='k', lw=1)
# Azuma CI bands
steps = np.arange(n_steps)
band = np.sqrt(2 * (steps+1) * c**2 * np.log(2 / 0.05))
ax.plot(steps, band, 'r--', lw=2, label='Azuma 95% band')
ax.plot(steps, -band, 'r--', lw=2)
ax.set_xlabel('Step $k$')
ax.set_ylabel('$M_k$')
ax.set_title('Martingale Sample Paths + Azuma Bounds')
ax.legend()
# Right: tail bound
ax = axes[1]
ax.semilogy(ts, empirical, 'o-', ms=3, label='Empirical')
ax.semilogy(ts, azuma, '--', lw=2, label='Azuma bound')
ax.set_xlabel('$t$')
ax.set_ylabel('$P(M_n \\geq t)$')
ax.set_title(f'Azuma Tail Bound (n={n_steps}, c={c})')
ax.legend()
plt.tight_layout()
plt.show()
t_test = 10.0
emp = (M[:,-1] >= t_test).mean()
az = np.exp(-t_test**2 / (2 * n_steps * c**2))
print(f'At t={t_test}: empirical={emp:.5f}, Azuma bound={az:.5f}')
15. Beyond Sub-Gaussian: Heavy Tails
Not all random variables in ML are bounded or sub-Gaussian. Gradient norms in transformer training can be heavy-tailed (Student- distribution empirically). For such variables, exponential bounds fail; polynomial bounds must be used.
Sub-exponential variables. is sub-exponential with parameters if:
Sub-exponential = sub-Gaussian for small , exponential tail for large . Example: with DOF is sub-exponential with , .
Catoni's estimator. For heavy-tailed means with finite variance , the sample mean is not optimal. Catoni's robust estimator achieves:
with probability — the sub-Gaussian rate, despite heavy tails!
Forward reference: Matrix concentration inequalities (Tropp's matrix Bernstein, non-commutative Khintchine) handle random matrices — directly applicable to random attention weight matrices. See §08 Advanced Topics in Probability if available.
Code cell 40
# === 15. Sub-Exponential: Chi-Squared vs Gaussian Tails ===
np.random.seed(42)
k = 5 # degrees of freedom
N_mc = 1_000_000
# chi^2(k): sum of k standard Gaussians squared
X_chi2 = np.random.chisquare(k, N_mc) # mean=k, var=2k
mu_chi2 = k
sigma2_chi2 = 2 * k
ts = np.linspace(1, 30, 200)
# Sub-exponential Bernstein bound: P(X - mu >= t) <= exp(-t^2/(2(2k + 4t)))
def se_bernstein(t, k):
v = 2 * k # variance = 2k
b = 4 # parameter b for chi^2
return np.exp(-t**2 / 2 / (v + b * t))
sub_gauss_b = np.exp(-ts**2 / (2 * sigma2_chi2)) # treating as sub-Gaussian
bernstein_b = se_bernstein(ts, k)
empirical_b = np.array([(X_chi2 - mu_chi2 >= t).mean() for t in ts])
if HAS_MPL:
fig, ax = plt.subplots(figsize=(10,5))
ax.semilogy(ts, empirical_b, 'k-', lw=2, label='Empirical $\chi^2(5)$')
ax.semilogy(ts, sub_gauss_b, '--', label='Sub-Gaussian bound (wrong!)', lw=2)
ax.semilogy(ts, bernstein_b, '-.', label='Bernstein sub-exp bound', lw=2)
ax.set_xlabel('$t$')
ax.set_ylabel('$P(X - \\mu \\geq t)$')
ax.set_title('$\\chi^2(5)$ Tail: Sub-Gaussian vs Sub-Exponential Bound')
ax.set_xlim(0, 30)
ax.legend()
plt.tight_layout()
plt.show()
t_test = 15.0
print(f'At t={t_test} above mean:')
print(f' Empirical: {(X_chi2 - mu_chi2 >= t_test).mean():.6f}')
print(f' Sub-Gaussian: {np.exp(-t_test**2/(2*sigma2_chi2)):.6f} (too tight!)')
print(f' Bernstein: {se_bernstein(t_test, k):.6f} (valid bound)')
16. Practical Guidelines for ML Practitioners
When Bounds Are Tight vs Loose
Concentration bounds are worst-case over all distributions satisfying the assumptions. They can be very loose in practice:
- VC bounds for neural networks: is enormous, giving vacuous bounds
- Rademacher complexity for transformers: also often loose due to norm-based analysis
- PAC-Bayes bounds (Catoni, McAllester) are often much tighter in practice
What Bounds ARE Useful For
Despite looseness, concentration inequalities provide:
- Scaling intuition: generalisation gap — more data always helps
- Architecture comparison: LoRA has Rademacher factor vs full fine-tuning
- Benchmark validity: CIs on accuracy estimates via Hoeffding or Clopper-Pearson
- Hyperparameter reasoning: batch size → gradient noise scales as
- Privacy analysis: DP-SGD noise calibrated via concentration (Gaussian mechanism)
The Sub-Gaussian Assumption in Practice
Empirical gradient distributions in transformers are often heavy-tailed (Student- like). This is why gradient clipping is used — it effectively imposes bounded differences, making the Hoeffding/Bernstein assumptions approximately valid after clipping.
Code cell 42
# === 16. Gradient Clipping: From Heavy-Tailed to Sub-Gaussian ===
np.random.seed(42)
N_mc = 500_000
# Heavy-tailed gradient: Cauchy-like (Student-t with low df)
df = 3
grads_heavy = np.random.standard_t(df, N_mc)
def clip_gradients(g, clip_norm):
return np.clip(g, -clip_norm, clip_norm)
clip_values = [1.0, 2.0, 5.0, np.inf]
print('Effect of gradient clipping on tail behaviour:')
print(f' {'Clip':>8} {'P(|g|>3)':>12} {'P(|g|>5)':>12} {'P(|g|>10)':>12} {'Std':>8}')
for clip in clip_values:
if clip == np.inf:
g = grads_heavy
label = 'None'
else:
g = clip_gradients(grads_heavy, clip)
label = f'{clip:.1f}'
p3 = (np.abs(g) >= 3).mean()
p5 = (np.abs(g) >= 5).mean()
p10 = (np.abs(g) >= 10).mean()
std = g.std()
print(f' {label:>8} {p3:>12.5f} {p5:>12.5f} {p10:>12.5f} {std:>8.4f}')
print('\nGaussian(0,1) for reference:')
g_gauss = np.random.randn(N_mc)
print(f' {'Gaussian':>8} {(np.abs(g_gauss)>=3).mean():>12.5f} '
f'{(np.abs(g_gauss)>=5).mean():>12.5f} {(np.abs(g_gauss)>=10).mean():>12.5f} '
f'{g_gauss.std():>8.4f}')
print('\nClipping at 2.0 makes tail comparable to Gaussian — enabling Hoeffding/Bernstein analysis.')
17. PAC-Bayes Bounds
PAC-Bayes bounds are often the tightest available for neural networks.
Theorem (McAllester PAC-Bayes). For prior on (chosen before seeing data) and any posterior on , with probability over :
Intuition: The generalisation gap is controlled by — how much the posterior deviates from the prior. If training barely moves the parameters (small KL), the model generalises well.
Connection to regularisation:
- regularisation Gaussian prior
- KL for Gaussian posteriors
- LoRA: small KL because , so posterior stays close to prior
This is why deep learning 'works' despite overparameterisation: SGD with small LR + early stopping creates posteriors close to the initialisation prior , keeping KL small and generalisation gap bounded.
Code cell 44
# === 17. PAC-Bayes: KL and Generalisation Gap ===
np.random.seed(42)
from scipy.special import xlogy
def kl_bernoulli(q, p):
"""KL divergence between Bernoulli(q) and Bernoulli(p)."""
q = np.clip(q, 1e-9, 1 - 1e-9)
p = np.clip(p, 1e-9, 1 - 1e-9)
return xlogy(q, q/p) + xlogy(1-q, (1-q)/(1-p))
def pacbayes_bound(R_hat, kl, n, delta=0.05):
"""McAllester PAC-Bayes bound."""
return R_hat + np.sqrt((kl + np.log(n / delta)) / (2 * (n - 1)))
# Scenario: linear model, Gaussian prior N(0, I), Gaussian posterior N(w*, sigma^2 I)
def gaussian_kl(theta, sigma2_prior=1.0, sigma2_post=0.01, d=100):
"""KL(N(theta, sigma2_post I) || N(0, sigma2_prior I))."""
kl = 0.5 * (d * (sigma2_post / sigma2_prior - 1 - np.log(sigma2_post / sigma2_prior))
+ np.dot(theta, theta) / sigma2_prior)
return kl
n, d = 1000, 100
R_hat = 0.1 # empirical risk
print('PAC-Bayes generalisation bound vs KL(posterior||prior):')
print(f' {'||theta||_2':>12} {'KL':>10} {'Gap bound':>12} {'R bound':>10}')
for norm_theta in [0.5, 1.0, 2.0, 5.0, 10.0]:
theta = np.ones(d) * norm_theta / np.sqrt(d)
kl = gaussian_kl(theta)
gap = np.sqrt((kl + np.log(n / 0.05)) / (2 * (n - 1)))
R_bound = R_hat + gap
print(f' {norm_theta:>12.1f} {kl:>10.2f} {gap:>12.4f} {R_bound:>10.4f}')
print(f'\nKey insight: KL grows as ||theta||^2 / sigma2_prior.')
print('L2 regularisation (weight decay) penalises exactly this term.')
18. Double Descent and the Failure of Classical Bounds
Classical VC bounds predict monotonically increasing generalisation error with model complexity. Double descent (Belkin et al., 2019) shows that modern overparameterised models can achieve low test error even when :
test error
| classical regime interpolation threshold modern regime
| bias↓, var↑ | var↓ (implicit reg)
| Underfitting → ──────────────────────── peak ──────────────────── overfit?No!
|________________ model complexity (p/n) ___________________________
Why classical bounds fail:
- VC/Rademacher bounds don't capture the effect of optimisation algorithm (SGD bias)
- Minimum-norm interpolation has implicit regularisation not captured by norm bounds
- Benign overfitting (Bartlett et al., 2020): overparameterised models can fit noise and still generalise if signal-to-noise is high
Concentration still applies: Even in double descent, Rademacher/PAC-Bayes bounds apply — they're just not tight. The issue is that classical bounds use worst-case norms, while SGD finds solutions with small effective norms via implicit regularisation.
Code cell 46
# === 18. Double Descent: Polynomial Regression Demo ===
np.random.seed(42)
from numpy.polynomial import polynomial as P
# True function: sin(2*pi*x) with noise
def true_fn(x):
return np.sin(2 * np.pi * x)
n_train = 20
sigma_noise = 0.2
x_train = np.linspace(0, 1, n_train)
y_train = true_fn(x_train) + np.random.randn(n_train) * sigma_noise
x_test = np.linspace(0, 1, 200)
y_test = true_fn(x_test)
degrees = list(range(1, 35))
train_errors = []
test_errors = []
for deg in degrees:
# Vandermonde design matrix
V_train = np.vander(x_train, deg + 1, increasing=True)
V_test = np.vander(x_test, deg + 1, increasing=True)
# Minimum-norm least squares (pseudoinverse)
w = np.linalg.pinv(V_train) @ y_train
train_err = np.mean((V_train @ w - y_train)**2)
test_err = np.mean((V_test @ w - y_test)**2)
train_errors.append(train_err)
test_errors.append(test_err)
if HAS_MPL:
fig, ax = plt.subplots(figsize=(10, 5))
ax.semilogy(degrees, train_errors, 'o-', ms=4, label='Train MSE')
ax.semilogy(degrees, test_errors, 's-', ms=4, label='Test MSE')
ax.axvline(n_train, color='r', ls='--', lw=2, label=f'n_train={n_train} (interpolation threshold)')
ax.set_xlabel('Polynomial degree (model complexity)')
ax.set_ylabel('MSE')
ax.set_title('Double Descent in Polynomial Regression')
ax.legend()
plt.tight_layout()
plt.show()
interp_idx = n_train - 1
peak_test = max(test_errors[:n_train+2])
final_test = test_errors[-1]
print(f'Test error peak near interpolation: {peak_test:.4f}')
print(f'Test error at highest degree: {final_test:.4f}')
print('Double descent: error peaks then decreases in overparameterised regime.')
19. Differential Privacy and Concentration
Differentially Private SGD (DP-SGD, Abadi et al. 2016) adds calibrated Gaussian noise to gradients to prevent membership inference. The noise scale is determined by concentration:
- Clip individual gradients: — ensures
- Aggregate + noise:
- Privacy accounting: -DP is achieved when:
Gaussian mechanism analysis uses the sub-Gaussian concentration of the noise. The privacy-utility tradeoff: larger → better privacy, larger gradient noise → slower convergence.
Moments accountant / Rényi DP gives tighter accounting via concentration of privacy loss across steps.
Code cell 48
# === 19. DP-SGD Noise Calibration ===
np.random.seed(42)
def dp_sgd_sigma(eps, delta, C, B, n):
"""
Noise multiplier sigma for (eps, delta)-DP.
Returns sigma (noise is sigma * C * I).
"""
q = B / n # sampling ratio
sigma = C * np.sqrt(2 * np.log(1.25 / delta)) / (eps * q)
return sigma
n = 60000 # MNIST-size training set
B = 256
C = 1.0
print('DP-SGD noise sigma for various privacy budgets:')
print(f' (n={n}, batch={B}, clip_norm={C})')
print(f' {'eps':>6} {'delta':>10} {'sigma':>10} {'noise/signal ratio':>20}')
for eps in [0.1, 0.5, 1.0, 5.0, 10.0]:
for delta in [1e-5, 1e-3]:
sigma = dp_sgd_sigma(eps, delta, C, B, n)
# noise/signal: sigma * C / (batch_grad_norm ~ C)
ratio = sigma
print(f' {eps:>6.1f} {delta:>10.0e} {sigma:>10.4f} {ratio:>20.4f}')
print('\nInterpretation: sigma >> 1 means privacy noise dominates gradient signal.')
print('Strong privacy (eps=0.1) requires enormous noise -> very slow convergence.')
20. Preview: Matrix Concentration Inequalities
Scalar concentration inequalities extend to random matrices via matrix versions of Bernstein.
Theorem (Matrix Bernstein, Tropp 2012). Let where are independent random Hermitian matrices with and a.s. Let . Then:
where is the matrix dimension and is the spectral norm.
Applications:
- Random attention: concentration of away from its expectation
- Covariance estimation: spectral norm concentration
- LoRA: concentration of low-rank update spectral norm
- Random features: approximation error of random feature maps for kernel methods
Full treatment in §08 Advanced Probability Topics (if available) or Tropp's An Introduction to Matrix Concentration Inequalities (2015, arXiv:1501.01571).
Code cell 50
# === 20. Matrix Concentration: Random Attention ===
np.random.seed(42)
# Simulate spectral norm concentration of random attention logit matrix
# A = Q K^T / sqrt(d_k), Q,K ~ N(0, I_{n x d_k})
def attention_spectral_norm(n_seq, d_k, N_mc=500):
"""Empirical distribution of ||QK^T/sqrt(d_k)||_2."""
norms = []
for _ in range(N_mc):
Q = np.random.randn(n_seq, d_k) / np.sqrt(d_k)
K = np.random.randn(n_seq, d_k) / np.sqrt(d_k)
A = Q @ K.T # already has 1/d_k from normalisation
norms.append(np.linalg.norm(A, 2))
return np.array(norms)
print('Spectral norm of Q K^T / d_k (Q, K ~ N(0, I/sqrt(d_k)))')
print(f' {'n_seq':>6} {'d_k':>5} {'E[||A||]':>10} {'Std[||A||]':>12} '
f'{'sqrt(n_seq/d_k)':>15}')
for n_seq, d_k in [(10,64), (20,64), (50,64), (100,64), (64,64), (64,256)]:
norms = attention_spectral_norm(n_seq, d_k, N_mc=300)
theory = np.sqrt(n_seq / d_k) # Marchenko-Pastur limit for rectangular case
print(f' {n_seq:>6} {d_k:>5} {norms.mean():>10.4f} {norms.std():>12.4f} '
f'{theory:>15.4f}')
print('\nSpectral norm ~ sqrt(n_seq/d_k) — longer sequences or smaller d_k -> larger logits.')
print('This is why FlashAttention uses causal masking to bound effective sequence length.')
Code cell 51
# === Final: Concentration Inequality Summary Plot ===
np.random.seed(42)
if HAS_MPL:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Plot 1: Tail hierarchy
ax = axes[0, 0]
t = np.linspace(0.2, 5, 300)
ax.semilogy(t, 1/t, label='Markov: 1/t')
ax.semilogy(t, 1/t**2, label='Chebyshev: 1/t^2')
ax.semilogy(t, np.exp(-2*t**2), label='Hoeffding/sub-Gauss')
ax.semilogy(t, 2*stats.norm.sf(t), label='Gaussian (actual)', ls='--')
ax.set_title('Tail Bound Hierarchy')
ax.set_xlabel('t'); ax.set_ylabel('Bound (log scale)')
ax.legend(fontsize=9); ax.set_ylim(1e-7, 2)
# Plot 2: PAC learning curve
ax = axes[0, 1]
ns = np.logspace(1, 5, 100)
for H_log, style in [(2, '-'), (5, '--'), (10, ':')]:
H = 10**H_log
gap = np.sqrt(np.log(H / 0.05) / (2 * ns))
ax.loglog(ns, gap, style, label=f'|H|=10^{H_log}')
ax.set_title('PAC Generalisation Gap vs n')
ax.set_xlabel('Training samples n')
ax.set_ylabel('Gap bound')
ax.legend(); ax.axhline(0.05, color='gray', ls=':', lw=1)
# Plot 3: Hoeffding verification
ax = axes[1, 0]
ns_test = np.arange(10, 500, 10)
N_mc = 50000
empirical_50 = []
bounds_50 = []
for n in ns_test:
s = np.random.binomial(1, 0.5, (N_mc, n)).mean(axis=1)
empirical_50.append((np.abs(s - 0.5) >= 0.1).mean())
bounds_50.append(2 * np.exp(-2 * n * 0.01))
ax.semilogy(ns_test, empirical_50, label='Empirical')
ax.semilogy(ns_test, bounds_50, '--', label='Hoeffding')
ax.set_title('Hoeffding: |X_bar - 0.5| >= 0.1')
ax.set_xlabel('n'); ax.set_ylabel('P (log scale)')
ax.legend()
# Plot 4: LoRA Rademacher scaling
ax = axes[1, 1]
ds = np.logspace(1, 3, 50)
n_val = 1000
for r, style in [(1, '-'), (4, '--'), (16, ':'), (64, '-.')
][:4]:
R_lora = np.sqrt(ds * r / n_val)
ax.loglog(ds, R_lora, style, label=f'LoRA r={r}')
R_full = np.sqrt(ds**2 / n_val)
ax.loglog(ds, R_full, 'k-', lw=2, label='Full (r=d)')
ax.set_title(f'Rademacher Complexity vs d (n={n_val})')
ax.set_xlabel('Dimension d')
ax.set_ylabel('$\\mathfrak{R}$ bound')
ax.legend(fontsize=9)
plt.suptitle('Concentration Inequalities: Summary', fontsize=14, y=1.01)
plt.tight_layout()
plt.show()
print('Summary plots generated.')
Notebook Complete
This notebook has covered:
- Tail bound hierarchy — Markov → Chebyshev → sub-Gaussian → Hoeffding → Chernoff → Bernstein → McDiarmid
- PAC learning — finite class bounds via union bound, VC dimension, shattering
- Rademacher complexity — data-dependent bounds, LoRA analysis
- ML applications — SGD gradient concentration, LLM benchmark CIs, JL lemma
- Transformers — attention scaling, residual norms, gradient clipping
- Modern theory — PAC-Bayes, double descent, DP-SGD, matrix concentration
References
- Boucheron, Lugosi & Massart, Concentration Inequalities (2013) — OUP
- Tropp, An Introduction to Matrix Concentration Inequalities (2015) — arXiv:1501.01571
- Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014) — Chapters 2-7
- Wainwright, High-Dimensional Statistics (2019) — Cambridge
- Vershynin, High-Dimensional Probability (2018) — Cambridge
→ exercises.ipynb — 8 graded problems
→ notes.md — full theoretical treatment with proofs