Theory NotebookMath for LLMs

Concentration Inequalities

Probability Theory / Concentration Inequalities

Run notebook
Theory Notebook

Theory Notebook

Converted from theory.ipynb for web reading.

Concentration Inequalities

"One of the great gifts of probability theory is that it allows us to turn complex dependencies into simple, manageable bounds." — Stéphane Boucheron

Interactive exploration of concentration inequalities: from Markov and Chebyshev to sub-Gaussian tails, Hoeffding, Chernoff, Bernstein, and McDiarmid. Applications to PAC learning, VC theory, Rademacher complexity, and LLM evaluation.

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np
from scipy import stats

try:
    import matplotlib.pyplot as plt
    import matplotlib
    plt.style.use('seaborn-v0_8-whitegrid')
    plt.rcParams['figure.figsize'] = [10, 6]
    plt.rcParams['font.size'] = 12
    HAS_MPL = True
except ImportError:
    HAS_MPL = False

try:
    import seaborn as sns
    HAS_SNS = True
except ImportError:
    HAS_SNS = False

np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)
print('Setup complete.')

1. Intuition: Why Tail Bounds Matter

A random variable is concentrated if most of its mass lies near the mean. Concentration inequalities quantify this with explicit tail bounds P(Xμ+t)P(X \geq \mu + t). In ML, XX might be a generalisation error, empirical risk, or gradient norm.

The hierarchy (weakest → strongest assumptions, tightest → loosest bounds):

  • Markov — only non-negativity: P(Xt)E[X]/tP(X \geq t) \leq \mathbb{E}[X]/t — polynomial tail O(1/t)O(1/t)
  • Chebyshev — finite variance: P(Xμt)σ2/t2P(|X-\mu| \geq t) \leq \sigma^2/t^2 — polynomial O(1/t2)O(1/t^2)
  • Sub-Gaussian/Hoeffding — bounded support or light tails: O(e2t2/ci2)O(e^{-2t^2/\sum c_i^2})
  • Chernoff — MGF exists: infs>0estMX(s)\inf_{s>0} e^{-st}M_X(s) — exponential, shape depends on MGF
  • Bernstein — variance + bounded: exponential, transitions polynomial→exponential
  • McDiarmid — function of independents with bounded differences: O(e2t2/ci2)O(e^{-2t^2/\sum c_i^2})

Code cell 5

# === 1.1 Tail Bound Comparison ===

t = np.linspace(0.1, 5, 500)
mu = 1.0    # E[X] for Markov
sigma2 = 1.0  # Var(X) for Chebyshev
c2 = 1.0    # sum c_i^2 for Hoeffding/sub-Gaussian

markov   = mu / t
cheby    = sigma2 / t**2
hoeffding = np.exp(-2 * t**2 / c2)
gaussian_tail = 2 * stats.norm.sf(t)  # 2P(Z >= t) — actual Gaussian tail

if HAS_MPL:
    fig, ax = plt.subplots(figsize=(10,5))
    ax.semilogy(t, markov,       label='Markov: $\mu/t$',              lw=2)
    ax.semilogy(t, cheby,        label='Chebyshev: $\sigma^2/t^2$',    lw=2)
    ax.semilogy(t, hoeffding,    label='Hoeffding/sub-Gaussian',        lw=2)
    ax.semilogy(t, gaussian_tail,label='Gaussian tail (actual)',        lw=2, ls='--')
    ax.set_xlim(0.5, 5)
    ax.set_ylim(1e-7, 2)
    ax.set_xlabel('Deviation $t$')
    ax.set_ylabel('$P(|X - \\mu| \\geq t)$ (log scale)')
    ax.set_title('Tail Bound Hierarchy: Polynomial vs Exponential Decay')
    ax.legend()
    plt.tight_layout()
    plt.show()

# Print at t=3
t0 = 3.0
print(f'At t={t0}:')
print(f'  Markov (mu=1):        {mu/t0:.4f}')
print(f'  Chebyshev (sigma2=1): {sigma2/t0**2:.4f}')
print(f'  Hoeffding (c2=1):     {np.exp(-2*t0**2/c2):.4f}')
print(f'  Gaussian (actual):    {2*stats.norm.sf(t0):.6f}')

2. Markov and Chebyshev Inequalities

2.1 Markov's Inequality

Theorem. For non-negative XX with E[X]<\mathbb{E}[X] < \infty and t>0t > 0:

P(Xt)E[X]tP(X \geq t) \leq \frac{\mathbb{E}[X]}{t}

Proof. E[X]=E[X1Xt]+E[X1X<t]tP(Xt)+0\mathbb{E}[X] = \mathbb{E}[X \mathbf{1}_{X \geq t}] + \mathbb{E}[X \mathbf{1}_{X < t}] \geq t \cdot P(X \geq t) + 0. Divide by tt. \square

2.2 Chebyshev's Inequality

Apply Markov to (Xμ)20(X - \mu)^2 \geq 0:

P(Xμt)=P((Xμ)2t2)E[(Xμ)2]t2=σ2t2P(|X - \mu| \geq t) = P((X-\mu)^2 \geq t^2) \leq \frac{\mathbb{E}[(X-\mu)^2]}{t^2} = \frac{\sigma^2}{t^2}

One-sided Cantelli. P(Xμt)σ2σ2+t2P(X - \mu \geq t) \leq \frac{\sigma^2}{\sigma^2 + t^2} (sharper for large tt).

Code cell 7

# === 2. Markov and Chebyshev Numerical Verification ===

np.random.seed(42)
N = 1_000_000

# Exponential distribution: E[X] = 1/lambda, Var = 1/lambda^2
lam = 1.0
samples_exp = np.random.exponential(1.0/lam, N)
mu_exp  = 1.0 / lam
var_exp = 1.0 / lam**2

ts = [2.0, 3.0, 5.0]
print('Exponential(1) — one-sided bounds')
print(f'  {'t':>4}  {'P(X>=t) actual':>18}  {'Markov':>10}  {'Ratio':>8}')
for t in ts:
    actual  = (samples_exp >= t).mean()
    markov  = mu_exp / t
    print(f'  {t:>4.1f}  {actual:>18.6f}  {markov:>10.6f}  {actual/markov:>8.3f}')

# Gaussian: test Chebyshev
samples_norm = np.random.normal(0, 1, N)
print('\nGaussian(0,1) — two-sided Chebyshev')
print(f'  {'t':>4}  {'P(|X|>=t) actual':>20}  {'Chebyshev':>12}  {'Gaussian tail':>15}')
for t in [1.0, 2.0, 3.0]:
    actual = (np.abs(samples_norm) >= t).mean()
    cheby  = 1.0 / t**2  # sigma^2 = 1
    gtail  = 2 * stats.norm.sf(t)
    print(f'  {t:>4.1f}  {actual:>20.6f}  {cheby:>12.6f}  {gtail:>15.6f}')

print('\nTakeaway: Chebyshev is loose for Gaussian (1/t^2 vs e^{-t^2/2}) — ')
print('but holds for ANY distribution with finite variance.')

3. Sub-Gaussian Random Variables

A zero-mean random variable XX is σ2\sigma^2-sub-Gaussian if:

E[etX]eσ2t2/2tR\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2 / 2} \quad \forall t \in \mathbb{R}

Equivalent characterisation (tail form):

P(Xt)2et2/(2σ2)P(|X| \geq t) \leq 2e^{-t^2/(2\sigma^2)}

Key examples:

  • N(0,σ2)\mathcal{N}(0, \sigma^2): σ2\sigma^2-sub-Gaussian (equality in MGF bound)
  • Bounded X[a,b]X \in [a,b] with E[X]=0\mathbb{E}[X]=0: ((ba)/2)2((b-a)/2)^2-sub-Gaussian (Hoeffding's lemma)
  • Rademacher X=±1X = \pm 1 w.p. 1/21/2: 11-sub-Gaussian

Closure: Sum of independent σi2\sigma_i^2-sub-Gaussians is (σi2)(\sum \sigma_i^2)-sub-Gaussian.

Code cell 9

# === 3. Sub-Gaussian MGF Verification ===

np.random.seed(42)
N = 500_000
ts = np.linspace(-2, 2, 100)

# Gaussian N(0,1): sigma^2 = 1
X_gauss = np.random.normal(0, 1, N)
# Uniform [-1,1]: sigma^2 = (b-a)^2/4 = 1
X_unif  = np.random.uniform(-1, 1, N)
# Rademacher: sigma^2 = 1
X_rad   = np.where(np.random.rand(N) < 0.5, -1.0, 1.0)

sub_gauss_bound = np.exp(ts**2 / 2)  # sigma^2 = 1

if HAS_MPL:
    fig, ax = plt.subplots(figsize=(10,5))
    for X, label, color in [
        (X_gauss, 'Gaussian N(0,1)', 'C0'),
        (X_unif,  'Uniform[-1,1]',  'C1'),
        (X_rad,   'Rademacher ±1',  'C2'),
    ]:
        mgf = np.array([np.exp(t * X).mean() for t in ts])
        ax.plot(ts, mgf, label=f'E[e^{{tX}}] {label}', color=color)
    ax.plot(ts, sub_gauss_bound, 'k--', lw=2, label='Sub-Gaussian bound $e^{t^2/2}$')
    ax.set_xlabel('$t$')
    ax.set_ylabel('MGF')
    ax.set_title('Sub-Gaussian MGF Bound Verification')
    ax.set_ylim(0, 8)
    ax.legend()
    plt.tight_layout()
    plt.show()

# Verify bound holds at t=1.5
t_test = 1.5
bound = np.exp(t_test**2 / 2)
for X, name in [(X_gauss,'Gaussian'), (X_unif,'Uniform'), (X_rad,'Rademacher')]:
    mgf_mc = np.exp(t_test * X).mean()
    ok = mgf_mc <= bound * 1.05  # 5% tolerance for MC
    print(f'{'PASS' if ok else 'FAIL'}{name}: MGF({t_test})={mgf_mc:.4f} <= {bound:.4f}')

4. Hoeffding's Inequality

Theorem (Hoeffding, 1963). Let X1,,XnX_1, \ldots, X_n be independent with Xi[ai,bi]X_i \in [a_i, b_i] and E[Xi]=μi\mathbb{E}[X_i] = \mu_i. Let Sn=i=1nXiS_n = \sum_{i=1}^n X_i. Then:

P(SnE[Sn]t)exp ⁣(2t2i=1n(biai)2)P(S_n - \mathbb{E}[S_n] \geq t) \leq \exp\!\left(\frac{-2t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right)

Hoeffding's Lemma (key step). For X[a,b]X \in [a,b] with E[X]=0\mathbb{E}[X]=0:

E[etX]et2(ba)2/8\mathbb{E}[e^{tX}] \leq e^{t^2(b-a)^2/8}

Proof sketch. Convexity of etxe^{tx} gives etxbxbaeta+xabaetbe^{tx} \leq \frac{b-x}{b-a}e^{ta} + \frac{x-a}{b-a}e^{tb}. Take expectations, then apply u=t(ba)u = t(b-a), h(u)=log[(1p)epu+pe(1p)u]h(u) = \log[(1-p)e^{-pu} + pe^{(1-p)u}], Taylor-expand hh and show h(u)1/4h''(u) \leq 1/4. This gives h(u)u2/8h(u) \leq u^2/8. \square

Sample complexity (iid case, Xi[0,1]X_i \in [0,1], ε=t/n\varepsilon = t/n):

P ⁣(Xˉnμε)2e2nε2P\!\left(|\bar{X}_n - \mu| \geq \varepsilon\right) \leq 2e^{-2n\varepsilon^2}

Setting δ=2e2nε2\delta = 2e^{-2n\varepsilon^2} gives nlog(2/δ)2ε2n \geq \frac{\log(2/\delta)}{2\varepsilon^2}.

Code cell 11

# === 4. Hoeffding: Sample Complexity and Verification ===

np.random.seed(42)

# --- Sample complexity formula ---
def hoeffding_n(eps, delta):
    """Min samples so P(|X_bar - mu| >= eps) <= delta, X_i in [0,1]."""
    return np.ceil(np.log(2 / delta) / (2 * eps**2)).astype(int)

print('Sample complexity n(eps, delta) via Hoeffding:')
print(f'  {'eps':>6}  {'delta':>8}  {'n':>8}')
for eps in [0.01, 0.05, 0.1]:
    for delta in [0.05, 0.01]:
        print(f'  {eps:>6.2f}  {delta:>8.3f}  {hoeffding_n(eps, delta):>8}')

# --- Empirical verification ---
mu_true = 0.3
n_values = [50, 100, 500, 1000]
n_trials = 50000
eps = 0.1

print('\nHoeffding bound vs empirical tail (X_i ~ Bernoulli(0.3), eps=0.1):')
print(f'  {'n':>6}  {'P(|X_bar-mu|>=eps) empirical':>32}  {'Hoeffding bound':>18}')
for n in n_values:
    samples = np.random.binomial(1, mu_true, (n_trials, n)).mean(axis=1)
    empirical = (np.abs(samples - mu_true) >= eps).mean()
    bound = 2 * np.exp(-2 * n * eps**2)
    print(f'  {n:>6}  {empirical:>32.6f}  {bound:>18.6f}')

5. Chernoff Bounds

The Chernoff method. For any s>0s > 0, apply Markov to esXe^{sX}:

P(Xt)=P(esXest)estE[esX]=estMX(s)P(X \geq t) = P(e^{sX} \geq e^{st}) \leq e^{-st} \mathbb{E}[e^{sX}] = e^{-st} M_X(s)

Optimise over s>0s > 0:

P(Xt)infs>0estMX(s)P(X \geq t) \leq \inf_{s > 0}\, e^{-st} M_X(s)

Binomial Chernoff. Let S=i=1nXiS = \sum_{i=1}^n X_i, XiiidBern(p)X_i \stackrel{iid}{\sim} \text{Bern}(p), μ=np\mu = np:

  • Upper tail: P(S(1+δ)μ)(eδ(1+δ)1+δ)μP(S \geq (1+\delta)\mu) \leq \left(\frac{e^\delta}{(1+\delta)^{1+\delta}}\right)^\mu
  • Simplified upper: P(S(1+δ)μ)eμδ2/3P(S \geq (1+\delta)\mu) \leq e^{-\mu\delta^2/3} for δ(0,1]\delta \in (0,1]
  • Lower tail: P(S(1δ)μ)eμδ2/2P(S \leq (1-\delta)\mu) \leq e^{-\mu\delta^2/2} for δ(0,1)\delta \in (0,1)

The gap between Hoeffding (e2nε2e^{-2n\varepsilon^2}) and Chernoff (enpδ2/3e^{-np\delta^2/3}) is that Chernoff exploits the actual Binomial MGF, not just the bounded range.

Code cell 13

# === 5. Chernoff vs Hoeffding for Binomial ===

np.random.seed(42)

def chernoff_upper(delta, mu):
    """Simplified Chernoff upper: exp(-mu*delta^2/3) for delta in (0,1]."""
    return np.exp(-mu * delta**2 / 3)

def hoeffding_upper(delta, n):
    """Hoeffding upper: exp(-2*n*delta^2) for X_i in [0,1]."""
    return np.exp(-2 * n * delta**2)

# Compare bounds for B(n=100, p=0.1) => mu=10
n, p = 100, 0.1
mu = n * p

N_mc = 500_000
samples = np.random.binomial(n, p, N_mc)

print(f'Binomial(n={n}, p={p}), mu={mu}')
print(f'  {'delta':>6}  {'t=(1+d)mu':>12}  {'Empirical':>12}  '
      f'{'Chernoff':>12}  {'Hoeffding':>12}')
for delta in [0.2, 0.5, 1.0]:
    t = (1 + delta) * mu
    empirical  = (samples >= t).mean()
    chernoff_b = chernoff_upper(delta, mu)
    hoeffding_b = hoeffding_upper(delta * p, n)  # eps = delta*p
    print(f'  {delta:>6.1f}  {t:>12.1f}  {empirical:>12.6f}  '
          f'{chernoff_b:>12.6f}  {hoeffding_b:>12.6f}')

print('\nChernoff is tighter when p is small (exploits Binomial MGF).')

6. Bernstein's Inequality

Bernstein's inequality bridges the polynomial (Chebyshev) and exponential (sub-Gaussian) regimes.

Theorem (Bernstein). Let X1,,XnX_1, \ldots, X_n be independent, zero-mean, XiM|X_i| \leq M. Then:

P ⁣(iXit)exp ⁣(t2/2iE[Xi2]+Mt/3)P\!\left(\sum_i X_i \geq t\right) \leq \exp\!\left(\frac{-t^2/2}{\sum_i \mathbb{E}[X_i^2] + Mt/3}\right)

Interpretation. Let v=iVar(Xi)v = \sum_i \text{Var}(X_i):

  • Gaussian regime (tv/Mt \ll v/M): bound et2/(2v)\approx e^{-t^2/(2v)} — sub-Gaussian with σ2=v\sigma^2 = v
  • Poisson regime (tv/Mt \gg v/M): bound et/(2M)\approx e^{-t/(2M)} — exponential, lighter than Hoeffding

vs. Hoeffding. Bernstein uses variance information; Hoeffding only uses range. Bernstein is strictly tighter when variance is small relative to range.

Code cell 15

# === 6. Bernstein vs Hoeffding Comparison ===

np.random.seed(42)

def bernstein_bound(t, v, M):
    """Bernstein: exp(-t^2/2 / (v + Mt/3))."""
    return np.exp(-t**2 / 2 / (v + M * t / 3))

def hoeffding_bound(t, n, a, b):
    """Hoeffding: exp(-2t^2 / (n*(b-a)^2))."""
    return np.exp(-2 * t**2 / (n * (b - a)**2))

# Sparse Bernoulli: X_i ~ Bern(p) - p, p=0.01
# mean=0, variance=p*(1-p)~p, range=[-(p), 1-p]
n_vars = 100
p = 0.01
v_total = n_vars * p * (1 - p)  # total variance
M = 1 - p  # max |X_i|

ts = np.linspace(0.01, 2.0, 300)

bern_b  = bernstein_bound(ts, v_total, M)
hoeff_b = hoeffding_bound(ts, n_vars, -p, 1-p)

# Empirical
N_mc = 500_000
sim = (np.random.binomial(1, p, (N_mc, n_vars)) - p).sum(axis=1)

if HAS_MPL:
    fig, ax = plt.subplots(figsize=(10,5))
    ts_emp = np.linspace(0.01, sim.max(), 200)
    emp = np.array([(sim >= t).mean() for t in ts_emp])
    ax.semilogy(ts, bern_b,  label='Bernstein', lw=2)
    ax.semilogy(ts, hoeff_b, label='Hoeffding', lw=2)
    ax.semilogy(ts_emp, np.maximum(emp, 1e-7), 'k.', ms=2, label='Empirical')
    ax.set_xlabel('$t$')
    ax.set_ylabel('$P(S_n \\geq t)$')
    ax.set_title(f'Bernstein vs Hoeffding (n={n_vars} sparse Bernoulli, p={p})')
    ax.legend()
    plt.tight_layout()
    plt.show()

print(f'Variance v={v_total:.4f}, M={M:.2f}')
print(f'Bernstein at t=0.5:  {bernstein_bound(0.5, v_total, M):.6f}')
print(f'Hoeffding at t=0.5:  {hoeffding_bound(0.5, n_vars, -p, 1-p):.6f}')
print('Bernstein is tighter when variance << range^2.')

7. McDiarmid's Inequality

McDiarmid extends Hoeffding from sums to general functions of independent variables.

Definition (Bounded Differences). f(x1,,xn)f(x_1,\ldots,x_n) has the (c1,,cn)(c_1,\ldots,c_n)-bounded differences property if for each ii:

supx1,,xn,xif(x)f(x1,,xi,,xn)ci\sup_{x_1,\ldots,x_n,x_i'} |f(\mathbf{x}) - f(x_1,\ldots,x_i',\ldots,x_n)| \leq c_i

Theorem (McDiarmid, 1989). If ff has bounded differences with constants c1,,cnc_1,\ldots,c_n and X1,,XnX_1,\ldots,X_n are independent, then:

P(f(X)E[f(X)]t)exp ⁣(2t2i=1nci2)P(f(\mathbf{X}) - \mathbb{E}[f(\mathbf{X})] \geq t) \leq \exp\!\left(\frac{-2t^2}{\sum_{i=1}^n c_i^2}\right)

Proof idea. Define the Doob martingale Dk=E[fX1,,Xk]E[f]D_k = \mathbb{E}[f | X_1,\ldots,X_k] - \mathbb{E}[f]. The differences DkDk1D_k - D_{k-1} are bounded in [ck,ck][-c_k, c_k] and have zero conditional mean. Apply Azuma's inequality to the martingale. \square

Key application: empirical risk. Let f(S)=R^(h)=1ni(h(xi),yi)f(S) = \hat{R}(h) = \frac{1}{n}\sum_i \ell(h(x_i), y_i) with [0,1]\ell \in [0,1]. Changing one sample changes R^\hat{R} by at most 1/n1/n, so ci=1/nc_i = 1/n and:

P(R^(h)R(h)t)e2nt2P(\hat{R}(h) - R(h) \geq t) \leq e^{-2nt^2}

Code cell 17

# === 7. McDiarmid on Empirical Risk ===

np.random.seed(42)

def mcdiarmid_bound(t, n, c_max=None):
    """McDiarmid bound with c_i = 1/n (all equal), c_max=1/n."""
    if c_max is None:
        c_max = 1.0 / n
    return np.exp(-2 * t**2 / (n * c_max**2))

# Simulate: classifier h, threshold at 0.5
# True risk R(h) = P(error) = 0.2
R_true = 0.2

n_values = [50, 100, 500, 1000]
n_trials = 100_000
t = 0.1

print(f'Empirical risk deviation (t={t}), true R={R_true}')
print(f'  {'n':>6}  {'P(R_hat-R>=t) sim':>22}  {'McDiarmid bound':>18}')
for n in n_values:
    errors = np.random.binomial(1, R_true, (n_trials, n)).mean(axis=1)
    empirical = (errors - R_true >= t).mean()
    bound = mcdiarmid_bound(t, n)
    print(f'  {n:>6}  {empirical:>22.6f}  {bound:>18.6f}')

# Also show: symmetric bound P(|R_hat - R| >= t) <= 2*exp(-2nt^2)
print(f'\nTwo-sided: P(|R_hat - R| >= {t}) <= 2*exp(-2n*{t}^2)')
for n in [100, 1000]:
    print(f'  n={n}: {2*mcdiarmid_bound(t, n):.6f}')
print('\nThis is the foundation of PAC-style generalisation bounds.')

8. Union Bound and Covering Numbers

8.1 Union Bound (Boole's Inequality)

For any events A1,,AmA_1, \ldots, A_m:

P ⁣(i=1mAi)i=1mP(Ai)P\!\left(\bigcup_{i=1}^m A_i\right) \leq \sum_{i=1}^m P(A_i)

ML application. Applying McDiarmid to each hHh \in \mathcal{H} and taking the union over all H|\mathcal{H}| hypotheses:

P ⁣(hH:R^(h)R(h)t)He2nt2P\!\left(\exists h \in \mathcal{H}: \hat{R}(h) - R(h) \geq t\right) \leq |\mathcal{H}| \cdot e^{-2nt^2}

Setting RHS =δ= \delta gives t=log(H/δ)2nt = \sqrt{\frac{\log(|\mathcal{H}|/\delta)}{2n}}.

8.2 Covering Numbers

For infinite H\mathcal{H}, replace H|\mathcal{H}| with the covering number N(H,ε,d)\mathcal{N}(\mathcal{H}, \varepsilon, d): the minimum number of ε\varepsilon-balls needed to cover H\mathcal{H} under metric dd.

Metric entropy: logN(H,ε,)\log \mathcal{N}(\mathcal{H}, \varepsilon, \|\cdot\|) — for HRd\mathcal{H} \subset \mathbb{R}^d with θB\|\theta\| \leq B:

logNdlog ⁣(3Bε)\log \mathcal{N} \leq d \log\!\left(\frac{3B}{\varepsilon}\right)

Code cell 19

# === 8. Union Bound for Finite Hypothesis Class ===

np.random.seed(42)

# Scenario: |H| decision stumps on binary features
# Each stump: threshold on one of d features => |H| = 2d

def pac_bound_finite(H_size, n, delta):
    """Generalisation gap t such that P(exists h: gap >= t) <= delta."""
    return np.sqrt(np.log(H_size / delta) / (2 * n))

print('Generalisation gap bound for finite hypothesis classes:')
print(f'  {'|H|':>10}  {'n':>6}  {'delta':>6}  {'Gap bound t':>12}')
for H_size in [10, 100, 1000, 10000]:
    for n in [100, 1000]:
        t = pac_bound_finite(H_size, n, 0.05)
        print(f'  {H_size:>10}  {n:>6}  {'0.05':>6}  {t:>12.4f}')

print('\nSample complexity: n >= log(|H|/delta) / (2*eps^2)')
for H_size in [100, 10000]:
    n_required = np.ceil(np.log(H_size / 0.05) / (2 * 0.05**2)).astype(int)
    print(f'  |H|={H_size}: n >= {n_required} for eps=0.05, delta=0.05')

# Covering number for linear classifiers in R^d, ||theta|| <= 1
print('\nCovering number log N(eps) for {theta: ||theta||<=1} in R^d:')
for d in [10, 100, 1000]:
    eps = 0.1
    log_N = d * np.log(3 / eps)
    print(f'  d={d:4d}: log N = {log_N:.1f} => N = {np.exp(log_N):.2e}')

9. PAC Learning and VC Dimension

9.1 PAC Framework

A hypothesis class H\mathcal{H} is PAC learnable if there exists an algorithm AA such that for all distributions D\mathcal{D} and all ε,δ>0\varepsilon, \delta > 0, given nn(ε,δ)n \geq n(\varepsilon, \delta) samples:

PSDn ⁣(R(A(S))minhHR(h)+ε)1δP_{S \sim \mathcal{D}^n}\!\left(R(A(S)) \leq \min_{h \in \mathcal{H}} R(h) + \varepsilon\right) \geq 1 - \delta

Finite class. nlog(H/δ)2ε2n \geq \frac{\log(|\mathcal{H}|/\delta)}{2\varepsilon^2} suffices (union bound).

9.2 VC Dimension

The VC dimension dVC(H)d_{\text{VC}}(\mathcal{H}) is the largest set that H\mathcal{H} can shatter (correctly classify all 2m2^m labellings).

Examples:

  • Intervals on R\mathbb{R}: dVC=2d_{\text{VC}} = 2
  • Half-spaces in Rd\mathbb{R}^d: dVC=d+1d_{\text{VC}} = d + 1
  • Degree-kk polynomials in R\mathbb{R}: dVC=k+1d_{\text{VC}} = k + 1
  • Neural networks (binary, pp params): dVC=O(plogp)d_{\text{VC}} = O(p \log p)

VC generalisation bound (Vapnik-Chervonenkis). With probability 1δ\geq 1 - \delta:

R(h)R^(h)+2dVClog(en/dVC)+2log(4/δ)nR(h) \leq \hat{R}(h) + \sqrt{\frac{2d_{\text{VC}} \log(en/d_{\text{VC}}) + 2\log(4/\delta)}{n}}

Code cell 21

# === 9. VC Dimension: Shattering and PAC Bounds ===

np.random.seed(42)

# --- Shattering demo: half-spaces in R^2 ---
# 3 points in general position CAN be shattered by half-spaces
# 4 points CANNOT always be shattered (XOR configuration fails)

def linear_shatter_check(points):
    """Check if set of points can be shattered by linear classifier."""
    n = len(points)
    from itertools import product
    can_realise_all = True
    for labels in product([-1, 1], repeat=n):
        labels = np.array(labels)
        # Try linear separator with bias: w.x + b = 0
        # Use LP / direct check via small optimization
        found = False
        for trial in range(1000):
            w = np.random.randn(2)
            b = np.random.randn()
            preds = np.sign(points @ w + b)
            if np.all(preds == labels):
                found = True
                break
        if not found:
            can_realise_all = False
            break
    return can_realise_all

# 3 points in general position
pts3 = np.array([[0., 0.], [1., 0.], [0., 1.]])
# 4 points - XOR pattern
pts4 = np.array([[0., 0.], [1., 1.], [1., 0.], [0., 1.]])

print('VC dimension of half-spaces in R^2 is 3:')
print(f'  Can shatter 3 pts in general position: {linear_shatter_check(pts3)}')
print(f'  Can shatter 4 pts (XOR): {linear_shatter_check(pts4)}')

# --- VC generalisation bound ---
def vc_bound(d_vc, n, delta):
    """VC generalisation gap."""
    return np.sqrt((2 * d_vc * np.log(np.e * n / d_vc) + 2 * np.log(4 / delta)) / n)

print('\nVC generalisation bound (delta=0.05):')
print(f'  {'d_VC':>6}  {'n':>6}  {'Gap bound':>12}')
for d_vc in [3, 10, 100]:
    for n in [100, 1000, 10000]:
        gap = vc_bound(d_vc, n, 0.05)
        print(f'  {d_vc:>6}  {n:>6}  {gap:>12.4f}')

10. Rademacher Complexity

Rademacher complexity measures the ability of F\mathcal{F} to fit random noise.

Empirical Rademacher complexity:

R^S(F)=Eσ ⁣[supfF1ni=1nσif(xi)]\hat{\mathfrak{R}}_S(\mathcal{F}) = \mathbb{E}_{\boldsymbol{\sigma}}\!\left[\sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n \sigma_i f(x_i)\right]

where σiiidUniform({1,+1})\sigma_i \stackrel{iid}{\sim} \text{Uniform}(\{-1, +1\}) (Rademacher random variables).

Population Rademacher complexity: Rn(F)=ES[R^S(F)]\mathfrak{R}_n(\mathcal{F}) = \mathbb{E}_S[\hat{\mathfrak{R}}_S(\mathcal{F})].

Theorem (Rademacher Generalisation Bound). With probability 1δ\geq 1 - \delta, for all fFf \in \mathcal{F}:

R(f)R^(f)+2Rn(F)+log(1/δ)2nR(f) \leq \hat{R}(f) + 2\mathfrak{R}_n(\mathcal{F}) + \sqrt{\frac{\log(1/\delta)}{2n}}

Key examples:

  • Finite class: RnlogH/(2n)\mathfrak{R}_n \leq \sqrt{\log |\mathcal{H}| / (2n)}
  • Linear w2W\|w\|_2 \leq W, data x2X\|x\|_2 \leq X: RnWX/n\mathfrak{R}_n \leq WX/\sqrt{n}
  • LoRA rank-rr: Rn=O ⁣(dr/n)\mathfrak{R}_n = O\!\left(\sqrt{dr/n}\right) (vs O ⁣(d2/n)O\!\left(\sqrt{d^2/n}\right) full)

Code cell 23

# === 10. Rademacher Complexity: Monte Carlo Estimation ===

np.random.seed(42)

def rademacher_mc(X, W_norm, n_mc=2000):
    """
    Estimate empirical Rademacher complexity for linear functions ||w||_2 <= W_norm
    on data matrix X (n x d).
    hat_R = E_sigma[sup_{||w||<=W} (1/n) sigma^T X w]
          = (W_norm / n) * E_sigma[||X^T sigma||_2]
    """
    n, d = X.shape
    norms = []
    for _ in range(n_mc):
        sigma = np.random.choice([-1., 1.], size=n)
        norms.append(np.linalg.norm(X.T @ sigma))
    return W_norm * np.mean(norms) / n

def rademacher_theory(n, d, W_norm, X_norm):
    """Theoretical bound: W_norm * X_norm / sqrt(n)."""
    return W_norm * X_norm / np.sqrt(n)

n, d = 100, 50
W_norm = 1.0
X = np.random.randn(n, d) / np.sqrt(d)  # ||x_i||_2 ~ 1
X_norm = np.linalg.norm(X, axis=1).mean()

R_hat_mc = rademacher_mc(X, W_norm)
R_theory  = rademacher_theory(n, d, W_norm, X_norm)

print(f'Empirical Rademacher complexity (n={n}, d={d}, W={W_norm})')
print(f'  Monte Carlo estimate: {R_hat_mc:.4f}')
print(f'  Theory bound:         {R_theory:.4f}')
print(f'  Ratio MC/theory:      {R_hat_mc/R_theory:.3f}')

# LoRA vs full: compare as d varies
print('\nRademacher complexity: LoRA rank-r vs full layer')
print(f'  {'d':>6}  {'rank r':>8}  {'LoRA R':>10}  {'Full R':>10}  {'Ratio':>8}')
for d in [64, 256, 1024]:
    r = 4
    # Simplified: paramcount ~ d*r vs d^2
    R_lora = np.sqrt(d * r / n)
    R_full = np.sqrt(d**2 / n)
    print(f'  {d:>6}  {r:>8}  {R_lora:>10.4f}  {R_full:>10.4f}  {R_lora/R_full:>8.4f}')

print('\nTakeaway: LoRA Rademacher complexity scales as sqrt(d*r/n) vs sqrt(d^2/n).')

11. Applications: SGD and Gradient Concentration

11.1 Mini-Batch Gradient as a Random Variable

In SGD, the mini-batch gradient g^t=1BiBt(θ;zi)\hat{\mathbf{g}}_t = \frac{1}{B}\sum_{i \in \mathcal{B}_t} \nabla \ell(\theta; z_i) is an estimator of the full gradient gt=L(θ)\mathbf{g}_t = \nabla L(\theta). By independence of the batch samples, applying Hoeffding or Bernstein per coordinate gives:

P ⁣(g^t,jgt,jε)2exp ⁣(2Bε2(bjaj)2)P\!\left(|\hat{g}_{t,j} - g_{t,j}| \geq \varepsilon\right) \leq 2\exp\!\left(\frac{-2B\varepsilon^2}{(b_j - a_j)^2}\right)

where [aj,bj][a_j, b_j] bounds the jj-th gradient component. The batch size BB directly controls gradient noise.

Bernstein for SGD. If 2G\|\nabla \ell\|_2 \leq G and gradients have variance σ2\sigma^2, Bernstein gives:

P ⁣(g^tgt2ε)exp ⁣(Bε2/2dσ2+Gε/3)P\!\left(\|\hat{\mathbf{g}}_t - \mathbf{g}_t\|_2 \geq \varepsilon\right) \leq \exp\!\left(\frac{-B\varepsilon^2/2}{d\sigma^2 + G\varepsilon/3}\right)

This shows gradient noise switches from Gaussian (small ε\varepsilon) to exponential (large ε\varepsilon) regime.

Code cell 25

# === 11.1 SGD Mini-Batch Gradient Concentration ===

np.random.seed(42)

# Simulate: gradient = mean of B IID bounded rvs in [-G, G]
G = 1.0  # gradient bound per sample
mu_grad = 0.3  # true gradient

def simulate_sgd_concentration(B, n_mc=100_000):
    """Simulate |g_hat - g_true| for mini-batch size B."""
    # Each sample gradient: Uniform[-G, G] shifted by mu_grad
    # Clipped to [-G, G]
    grads = np.random.uniform(-G + mu_grad, G + mu_grad, (n_mc, B)).clip(-G, G)
    batch_grads = grads.mean(axis=1)
    return np.abs(batch_grads - mu_grad)

ts = [0.1, 0.2, 0.5]
print('P(|g_hat - g_true| >= t) for various batch sizes B')
print(f'  {'B':>6}  {'t=0.1':>10}  {'t=0.2':>10}  {'t=0.5':>10}  {'Hoeffding(B,t=0.1)':>20}')
for B in [1, 8, 32, 128]:
    devs = simulate_sgd_concentration(B)
    probs = [(devs >= t).mean() for t in ts]
    hb = 2 * np.exp(-2 * B * 0.1**2 / (2*G)**2)
    print(f'  {B:>6}  {probs[0]:>10.5f}  {probs[1]:>10.5f}  {probs[2]:>10.5f}  {hb:>20.5f}')

print('\n4x batch size => roughly 4x tighter tail (Gaussian regime, t small).')

11.2 LLM Benchmark Confidence Intervals

When evaluating an LLM on a benchmark, accuracy A^=k/n\hat{A} = k/n is an estimate of true accuracy AA. By Hoeffding (or the exact Clopper-Pearson interval for Binomial):

P(A^Aε)2e2nε2P(|\hat{A} - A| \geq \varepsilon) \leq 2e^{-2n\varepsilon^2}

Practical implications:

  • MMLU with n=14,042n=14{,}042 questions: ε=0.01\varepsilon = 0.01 requires δ2e2140420.0120.0008\delta \leq 2e^{-2 \cdot 14042 \cdot 0.01^2} \approx 0.0008
  • Smaller benchmarks (n=500n=500): ε95%=log(2/0.05)/(2500)0.055\varepsilon_{95\%} = \sqrt{\log(2/0.05)/(2 \cdot 500)} \approx 0.055
  • Sub-task evaluation (n=50n=50): ε95%0.17\varepsilon_{95\%} \approx 0.17 — barely distinguishable differences

Benchmark contamination shifts the distribution, invalidating the iid assumption and making concentration bounds overly optimistic — one of the key failure modes in LLM evaluation.

Code cell 27

# === 11.2 LLM Benchmark Confidence Intervals ===

from scipy import stats as sp_stats

def hoeffding_ci(n, delta=0.05):
    """Half-width of Hoeffding confidence interval."""
    return np.sqrt(np.log(2 / delta) / (2 * n))

def clopper_pearson_ci(k, n, alpha=0.05):
    """Exact Clopper-Pearson CI for Binomial."""
    lo = sp_stats.beta.ppf(alpha / 2, k, n - k + 1) if k > 0 else 0
    hi = sp_stats.beta.ppf(1 - alpha / 2, k + 1, n - k) if k < n else 1
    return lo, hi

benchmarks = [
    ('MMLU',        14042),
    ('BIG-Bench Hard', 6511),
    ('HellaSwag',   10042),
    ('Custom (500)',   500),
    ('Sub-task (50)',   50),
]

print('Confidence interval half-widths for LLM benchmarks (delta=5%):')
print(f'  {'Benchmark':>24}  {'n':>6}  {'Hoeffding eps':>14}  {'CP CI width (acc=0.7)':>22}')
for name, n in benchmarks:
    eps = hoeffding_ci(n)
    k = int(0.7 * n)
    lo, hi = clopper_pearson_ci(k, n)
    print(f'  {name:>24}  {n:>6}  {eps:>14.4f}  [{lo:.4f}, {hi:.4f}] = {hi-lo:.4f}')

print('\nImplication: for n=50 sub-tasks, a 17% margin is needed to be 95% confident.')
print('Most LLM benchmark comparisons within 1% are not statistically meaningful for small n.')

11.3 Johnson-Lindenstrauss Lemma

Theorem (JL Lemma). For any set of mm points in Rd\mathbb{R}^d and ε(0,1/2)\varepsilon \in (0,1/2), there exists a linear map f:RdRkf: \mathbb{R}^d \to \mathbb{R}^k with k=O(ε2logm)k = O(\varepsilon^{-2} \log m) such that:

u,v:(1ε)uv2f(u)f(v)2(1+ε)uv2\forall u, v: (1-\varepsilon)\|u-v\|^2 \leq \|f(u)-f(v)\|^2 \leq (1+\varepsilon)\|u-v\|^2

Proof sketch. For a random Gaussian projection RRk×dR \in \mathbb{R}^{k \times d} with RijN(0,1/k)R_{ij} \sim \mathcal{N}(0, 1/k), each projected distance is a chi-squared sum. Apply Chernoff bounds. The key fact: kk projection rows, each giving ekε2/4e^{-k\varepsilon^2/4} failure probability. Union bound over all (m2)\binom{m}{2} pairs gives total failure m2ekε2/4<1\leq m^2 \cdot e^{-k\varepsilon^2/4} < 1 for k=Ω(ε2logm)k = \Omega(\varepsilon^{-2}\log m).

Applications in LLMs:

  • Random projections for approximate nearest neighbour in embedding lookup
  • Dimensionality reduction before FAISS indexing
  • Sketching attention matrices: approximate QKQK^\top via random projections

Code cell 29

# === 11.3 Johnson-Lindenstrauss Lemma Verification ===

np.random.seed(42)

def jl_project(X, k):
    """Random Gaussian projection to k dimensions."""
    d = X.shape[1]
    R = np.random.randn(d, k) / np.sqrt(k)
    return X @ R

def jl_min_k(m, eps, delta=0.05):
    """Minimum k for JL embedding of m points."""
    return int(np.ceil(4 * np.log(m / delta) / (eps**2 / 2 - eps**3 / 3)))

# Generate m random points in R^d
m, d = 200, 500
X = np.random.randn(m, d)

# Compute pairwise distances in original space (sample 500 pairs)
n_pairs = 500
idx = np.random.choice(m, (n_pairs, 2))
dists_orig = np.linalg.norm(X[idx[:,0]] - X[idx[:,1]], axis=1)

print(f'JL Lemma: m={m} points, d={d}, testing projections')
print(f'  {'k':>6}  {'max |ratio-1|':>18}  {'frac within eps=0.2':>22}  {'theory k':>12}')
for eps in [0.1, 0.2, 0.3]:
    k_theory = jl_min_k(m, eps)
    for k in [20, 50, 100, 200]:
        X_proj = jl_project(X, k)
        dists_proj = np.linalg.norm(X_proj[idx[:,0]] - X_proj[idx[:,1]], axis=1)
        ratios = (dists_proj / dists_orig)**2
        max_dev = np.max(np.abs(ratios - 1))
        frac_ok = np.mean(np.abs(ratios - 1) <= eps)
        if eps == 0.2:  # only print for eps=0.2
            print(f'  {k:>6}  {max_dev:>18.4f}  {frac_ok:>22.4f}  {k_theory:>12}')

print(f'\nFor eps=0.2: theory requires k >= {jl_min_k(m, 0.2)} for m={m} points.')

11.4 Sub-Gaussian Properties in Transformers

Attention Logit Concentration

In self-attention, the logit between query q\mathbf{q} and key kj\mathbf{k}_j is zj=qkj/dkz_j = \mathbf{q}^\top \mathbf{k}_j / \sqrt{d_k}. If q,kjN(0,Idk)\mathbf{q}, \mathbf{k}_j \sim \mathcal{N}(0, I_{d_k}), then each zjz_j is sub-Gaussian with parameter 11:

Var(zj)=E[(qkj)2dk]=1\text{Var}(z_j) = \mathbb{E}\left[\frac{(\mathbf{q}^\top \mathbf{k}_j)^2}{d_k}\right] = 1

This motivates the dk\sqrt{d_k} scaling — without it, logits would have variance dkd_k, pushing softmax into saturation.

Residual Stream Norm Concentration

With LL residual layers, each adding a vector δ\delta_\ell with δ2c\|\delta_\ell\|_2 \approx c (bounded by RMSNorm), the total residual δ2Lc\|\sum_\ell \delta_\ell\|_2 \leq Lc. But if the directions are random, the norm concentrates at O(L)O(\sqrt{L}) — explaining why deep transformers need 1/L1/\sqrt{L} scaling at initialisation (e.g., GPT-2's 1/sqrt(2L) rule).

Gradient Noise in Large-Batch Training

The linear scaling rule (Goyal et al., 2017): batch size BkBB \to kB requires LR αkα\alpha \to k\alpha to maintain the same effective noise level. This follows directly from Hoeffding: gradient variance scales as 1/B1/B, so the noise-to-signal ratio σ/B\sigma/\sqrt{B} determines convergence.

Code cell 31

# === 11.4 Attention Logit Concentration ===

np.random.seed(42)

N_mc = 100_000

print('Attention logit variance before and after scaling:')
print(f'  {'d_k':>6}  {'Var(q^T k) unscaled':>22}  {'Var(q^T k/sqrt(d_k))':>24}  {'Softmax entropy':>16}')

for d_k in [4, 16, 64, 256]:
    q = np.random.randn(N_mc, d_k)
    k = np.random.randn(N_mc, d_k)
    logits_raw   = (q * k).sum(axis=1)
    logits_scaled = logits_raw / np.sqrt(d_k)

    var_raw    = logits_raw.var()
    var_scaled = logits_scaled.var()

    # Softmax entropy for 10 keys, one being the true match (logit scaled/unscaled)
    n_keys = 10
    keys_all = np.random.randn(N_mc, n_keys, d_k)
    q_batched = q[:, np.newaxis, :]  # (N, 1, d)
    logits_all = (q_batched * keys_all).sum(axis=2)  # (N, n_keys)
    softmax_scaled = np.exp(logits_all / np.sqrt(d_k))
    softmax_scaled /= softmax_scaled.sum(axis=1, keepdims=True)
    entropy_scaled = -(softmax_scaled * np.log(softmax_scaled + 1e-9)).sum(axis=1).mean()

    print(f'  {d_k:>6}  {var_raw:>22.3f}  {var_scaled:>24.3f}  {entropy_scaled:>16.3f}')

print('\nWith scaling, Var ≈ 1 regardless of d_k — prevents softmax saturation.')
print('Max entropy = log(10) =', round(np.log(10), 3), '(uniform softmax)')

12. Choosing the Right Bound: A Decision Framework

ScenarioBest BoundFormula
X0X \geq 0, only mean knownMarkovμ/t\mu/t
Finite variance, arbitrary distChebyshevσ2/t2\sigma^2/t^2
Bounded iid sum, worst-caseHoeffding2e2t2/(biai)22e^{-2t^2/\sum(b_i-a_i)^2}
Bounded iid sum, small varianceBernsteinet2/(2v+2Mt/3)e^{-t^2/(2v+2Mt/3)}
Binomial, multiplicativeChernoffeμδ2/3e^{-\mu\delta^2/3}
Function of iid vars, bounded diffsMcDiarmide2t2/ci2e^{-2t^2/\sum c_i^2}
Generalisation for finite H\mathcal{H}Union + Hoeffding$\sqrt{\log(
Generalisation for general H\mathcal{H}Rademacher2Rn+log(1/δ)/2n2\mathfrak{R}_n + \sqrt{\log(1/\delta)/2n}
High-dimensional distancesJL Lemmak=O(ε2logm)k = O(\varepsilon^{-2}\log m)

The key question: What do you know about XX?

  1. Non-negative + mean → Markov
    • variance → Chebyshev
    • bounded → Hoeffding
    • small variance → Bernstein
    • MGF structure → Chernoff

Code cell 33

# === 12. Bound Comparison on a Single Example ===

np.random.seed(42)

# n=100 iid X_i ~ Uniform[0,1], S_n = sum, E[S_n]=50, Var=100/12
n = 100
mu = 0.5
sigma2 = 1.0 / 12  # Var(Uniform[0,1])
a, b = 0.0, 1.0

# We want P(S_n/n - mu >= t) for t = 0.05
t = 0.05

# Bounds for sample mean deviation
markov_b    = mu / (mu + t)  # P(X_bar >= mu+t) <= mu/(mu+t) if X_i in [0,1] — not correct, use actual
# Markov: P(X_bar >= t+mu) = P(sum >= n*(mu+t)), E[sum]=n*mu, need X_bar >= mu+t
# P(S_n >= n(mu+t)) <= E[S_n]/(n(mu+t)) = mu/(mu+t)
markov_b    = mu / (mu + t)
cheby_b     = sigma2 / (n * t**2)  # Var(X_bar) = sigma2/n
hoeffding_b = np.exp(-2 * n * t**2 / (b - a)**2)
bernstein_b = np.exp(-n * t**2 / 2 / (sigma2 + (b-a)*t/3))

# Empirical
N_mc = 1_000_000
samples = np.random.uniform(a, b, (N_mc, n)).mean(axis=1)
empirical = (samples - mu >= t).mean()

print(f'P(X_bar - {mu} >= {t}), n={n}, X_i ~ Uniform[0,1]:')
print(f'  Empirical:  {empirical:.6f}')
print(f'  Markov:     {markov_b:.6f}  (ratio {markov_b/empirical:.1f}x)')
print(f'  Chebyshev:  {cheby_b:.6f}  (ratio {cheby_b/empirical:.1f}x)')
print(f'  Hoeffding:  {hoeffding_b:.6f}  (ratio {hoeffding_b/empirical:.1f}x)')
print(f'  Bernstein:  {bernstein_b:.6f}  (ratio {bernstein_b/empirical:.1f}x)')
print(f'\nSmaller ratio = tighter bound. Bernstein wins by using variance info.')

Summary

ConceptKey FormulaML Relevance
MarkovP(Xt)μ/tP(X \geq t) \leq \mu/tBaseline; gradient norm bounds
Chebyshev$P(X-\mu
Sub-GaussianE[etX]eσ2t2/2\mathbb{E}[e^{tX}] \leq e^{\sigma^2t^2/2}Bounded/Gaussian noise
Hoeffding2e2t2/(biai)22e^{-2t^2/\sum(b_i-a_i)^2}Sample complexity in PAC
ChernoffinfsestMX(s)\inf_s e^{-st}M_X(s)Binomial counts, hitting probs
BernsteinExp with variance termSparse gradients in SGD
McDiarmide2t2/ci2e^{-2t^2/\sum c_i^2}Empirical risk, any function
Union boundP(Ai)P(Ai)P(\bigcup A_i) \leq \sum P(A_i)Any finite hypothesis class
VC boundR^+O(dVC/n)\hat{R} + O(\sqrt{d_{VC}/n})Shattering dimension
RademacherR^+2Rn+O(1/n)\hat{R} + 2\mathfrak{R}_n + O(1/\sqrt{n})Data-dependent, tightest
JL Lemmak=O(ε2logm)k = O(\varepsilon^{-2}\log m)Random projections, embeddings

→ See notes.md for full proofs and theory.
→ See exercises.ipynb for practice problems.


13. Preview: Law of Large Numbers via Concentration

The Weak Law of Large Numbers (WLLN) follows immediately from Chebyshev:

P ⁣(Xˉnμε)σ2nε2n0P\!\left(|\bar{X}_n - \mu| \geq \varepsilon\right) \leq \frac{\sigma^2}{n\varepsilon^2} \xrightarrow{n\to\infty} 0

The Strong LLN requires more work (Borel-Cantelli or martingale argument), but Chebyshev's inequality gives the WLLN with a quantitative rate.

Strong LLN via Borel-Cantelli (sketch). For the subsequence nk=k2n_k = k^2, Chebyshev gives P(Xˉnkμε)σ2/(nkε2)P(|\bar{X}_{n_k} - \mu| \geq \varepsilon) \leq \sigma^2/(n_k \varepsilon^2). Since k1/k2<\sum_k 1/k^2 < \infty, Borel-Cantelli gives convergence a.s. along nkn_k. Interpolate between subsequences to complete the proof.

CLT preview. The Central Limit Theorem tells us the shape of the deviation:

n(Xˉnμ)dN(0,σ2)\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2)

This is sharper than Chebyshev's O(1/t2)O(1/t^2) but is asymptotic. Concentration inequalities give non-asymptotic guarantees valid for finite nn.

→ Full treatment: §06/06 Stochastic Processes for martingales; §07 Statistics for CLT applications.

Code cell 36

# === 13. LLN Convergence: Chebyshev Rate vs Empirical ===

np.random.seed(42)

# Exponential(1): mean=1, var=1
N_mc = 10_000
ns = np.logspace(1, 4, 30).astype(int)

eps = 0.1
empirical_probs = []
chebyshev_bounds = []
hoeffding_bounds_clip = []

for n in ns:
    samples = np.random.exponential(1.0, (N_mc, n)).mean(axis=1)
    prob = (np.abs(samples - 1.0) >= eps).mean()
    empirical_probs.append(prob)
    chebyshev_bounds.append(1.0 / (n * eps**2))  # sigma^2=1
    # Hoeffding: need bounded, so clip to [0, 10]; c_i ~ 10/n
    hoeffding_bounds_clip.append(2 * np.exp(-2 * n * eps**2 / (10**2)))

if HAS_MPL:
    fig, ax = plt.subplots(figsize=(10,5))
    ax.loglog(ns, empirical_probs, 'o-', label='Empirical P(|X_bar-1|>=0.1)', ms=4)
    ax.loglog(ns, chebyshev_bounds, '--', label='Chebyshev bound (1/n)', lw=2)
    ax.loglog(ns, hoeffding_bounds_clip, ':', label='Hoeffding (clipped at 10)', lw=2)
    ax.set_xlabel('Sample size $n$')
    ax.set_ylabel('$P(|\\bar{X}_n - \\mu| \\geq 0.1)$')
    ax.set_title('LLN Rate: Empirical vs Chebyshev Bound (Exponential(1))')
    ax.legend()
    plt.tight_layout()
    plt.show()

print(f'At n=100: empirical={empirical_probs[ns.tolist().index(min(ns, key=lambda x: abs(x-100)))]:.4f}, '
      f'Chebyshev={1/(100*eps**2):.4f}')

14. Azuma's Inequality for Martingales

Azuma extends Hoeffding from iid sums to martingales.

Theorem (Azuma-Hoeffding). Let (Mk)k=0n(M_k)_{k=0}^n be a martingale with bounded differences: MkMk1ck|M_k - M_{k-1}| \leq c_k a.s. Then:

P(MnM0t)exp ⁣(t22k=1nck2)P(M_n - M_0 \geq t) \leq \exp\!\left(\frac{-t^2}{2\sum_{k=1}^n c_k^2}\right)

Why this is powerful: The XkX_k need not be independent — only the martingale difference Dk=MkMk1D_k = M_k - M_{k-1} needs to be bounded and have zero conditional mean. This is exactly the structure of the Doob martingale used in McDiarmid's proof.

Doob martingale. For f(X1,,Xn)f(X_1,\ldots,X_n) with independent XiX_i, define:

Mk=E[fX1,,Xk]M_k = \mathbb{E}[f | X_1,\ldots,X_k]

Then M0=E[f]M_0 = \mathbb{E}[f], Mn=fM_n = f, and MkMk1ck|M_k - M_{k-1}| \leq c_k iff ff has ckc_k-bounded difference in XkX_k. Azuma applied to the Doob martingale yields McDiarmid.

Applications in ML:

  • Online learning regret bounds
  • Stochastic approximation (SGD) convergence
  • Adaptive data collection (active learning, bandits)

Code cell 38

# === 14. Azuma: Random Walk and Martingale ===

np.random.seed(42)

# Simple martingale: partial sums of bounded differences D_k in [-c, c]
n_steps = 200
c = 1.0
N_mc = 50_000

# D_k ~ Uniform[-c, c], zero mean
D = np.random.uniform(-c, c, (N_mc, n_steps))
M = D.cumsum(axis=1)  # martingale starting at 0

# Azuma bound for M_n >= t
ts = np.linspace(1, 20, 100)
azuma = np.exp(-ts**2 / (2 * n_steps * c**2))
empirical = np.array([(M[:,-1] >= t).mean() for t in ts])

if HAS_MPL:
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))

    # Left: sample paths
    ax = axes[0]
    for i in range(20):
        ax.plot(M[i], alpha=0.4, lw=0.8)
    ax.axhline(0, color='k', lw=1)
    # Azuma CI bands
    steps = np.arange(n_steps)
    band = np.sqrt(2 * (steps+1) * c**2 * np.log(2 / 0.05))
    ax.plot(steps, band, 'r--', lw=2, label='Azuma 95% band')
    ax.plot(steps, -band, 'r--', lw=2)
    ax.set_xlabel('Step $k$')
    ax.set_ylabel('$M_k$')
    ax.set_title('Martingale Sample Paths + Azuma Bounds')
    ax.legend()

    # Right: tail bound
    ax = axes[1]
    ax.semilogy(ts, empirical, 'o-', ms=3, label='Empirical')
    ax.semilogy(ts, azuma, '--', lw=2, label='Azuma bound')
    ax.set_xlabel('$t$')
    ax.set_ylabel('$P(M_n \\geq t)$')
    ax.set_title(f'Azuma Tail Bound (n={n_steps}, c={c})')
    ax.legend()

    plt.tight_layout()
    plt.show()

t_test = 10.0
emp = (M[:,-1] >= t_test).mean()
az  = np.exp(-t_test**2 / (2 * n_steps * c**2))
print(f'At t={t_test}: empirical={emp:.5f}, Azuma bound={az:.5f}')

15. Beyond Sub-Gaussian: Heavy Tails

Not all random variables in ML are bounded or sub-Gaussian. Gradient norms in transformer training can be heavy-tailed (Student-tt distribution empirically). For such variables, exponential bounds fail; polynomial bounds must be used.

Sub-exponential variables. XX is sub-exponential with parameters (ν2,b)(\nu^2, b) if:

E[etX]et2ν2/2for t1/b\mathbb{E}[e^{tX}] \leq e^{t^2\nu^2/2} \quad \text{for } |t| \leq 1/b

Sub-exponential = sub-Gaussian for small tt, exponential tail for large tt. Example: χ2\chi^2 with kk DOF is sub-exponential with ν2=4k\nu^2 = 4k, b=4b = 4.

Catoni's estimator. For heavy-tailed means with finite variance σ2\sigma^2, the sample mean is not optimal. Catoni's robust estimator achieves:

μ^Catμσ2log(2/δ)n|\hat{\mu}_{\text{Cat}} - \mu| \leq \sigma\sqrt{\frac{2\log(2/\delta)}{n}}

with probability 1δ1-\delta — the sub-Gaussian rate, despite heavy tails!

Forward reference: Matrix concentration inequalities (Tropp's matrix Bernstein, non-commutative Khintchine) handle random matrices — directly applicable to random attention weight matrices. See §08 Advanced Topics in Probability if available.

Code cell 40

# === 15. Sub-Exponential: Chi-Squared vs Gaussian Tails ===

np.random.seed(42)

k = 5  # degrees of freedom
N_mc = 1_000_000

# chi^2(k): sum of k standard Gaussians squared
X_chi2 = np.random.chisquare(k, N_mc)  # mean=k, var=2k
mu_chi2 = k
sigma2_chi2 = 2 * k

ts = np.linspace(1, 30, 200)

# Sub-exponential Bernstein bound: P(X - mu >= t) <= exp(-t^2/(2(2k + 4t)))
def se_bernstein(t, k):
    v = 2 * k  # variance = 2k
    b = 4      # parameter b for chi^2
    return np.exp(-t**2 / 2 / (v + b * t))

sub_gauss_b = np.exp(-ts**2 / (2 * sigma2_chi2))  # treating as sub-Gaussian
bernstein_b = se_bernstein(ts, k)
empirical_b = np.array([(X_chi2 - mu_chi2 >= t).mean() for t in ts])

if HAS_MPL:
    fig, ax = plt.subplots(figsize=(10,5))
    ax.semilogy(ts, empirical_b, 'k-', lw=2, label='Empirical $\chi^2(5)$')
    ax.semilogy(ts, sub_gauss_b, '--', label='Sub-Gaussian bound (wrong!)', lw=2)
    ax.semilogy(ts, bernstein_b, '-.',  label='Bernstein sub-exp bound', lw=2)
    ax.set_xlabel('$t$')
    ax.set_ylabel('$P(X - \\mu \\geq t)$')
    ax.set_title('$\\chi^2(5)$ Tail: Sub-Gaussian vs Sub-Exponential Bound')
    ax.set_xlim(0, 30)
    ax.legend()
    plt.tight_layout()
    plt.show()

t_test = 15.0
print(f'At t={t_test} above mean:')
print(f'  Empirical:      {(X_chi2 - mu_chi2 >= t_test).mean():.6f}')
print(f'  Sub-Gaussian:   {np.exp(-t_test**2/(2*sigma2_chi2)):.6f}  (too tight!)')
print(f'  Bernstein:      {se_bernstein(t_test, k):.6f}  (valid bound)')

16. Practical Guidelines for ML Practitioners

When Bounds Are Tight vs Loose

Concentration bounds are worst-case over all distributions satisfying the assumptions. They can be very loose in practice:

  • VC bounds for neural networks: dVC=O(plogp)d_{\text{VC}} = O(p \log p) is enormous, giving vacuous >1> 1 bounds
  • Rademacher complexity for transformers: also often loose due to norm-based analysis
  • PAC-Bayes bounds (Catoni, McAllester) are often much tighter in practice

What Bounds ARE Useful For

Despite looseness, concentration inequalities provide:

  1. Scaling intuition: O(1/n)O(1/\sqrt{n}) generalisation gap — more data always helps
  2. Architecture comparison: LoRA has r/d\sqrt{r/d} Rademacher factor vs full fine-tuning
  3. Benchmark validity: CIs on accuracy estimates via Hoeffding or Clopper-Pearson
  4. Hyperparameter reasoning: batch size BB → gradient noise scales as 1/B1/\sqrt{B}
  5. Privacy analysis: DP-SGD noise calibrated via concentration (Gaussian mechanism)

The Sub-Gaussian Assumption in Practice

Empirical gradient distributions in transformers are often heavy-tailed (Student-tt like). This is why gradient clipping is used — it effectively imposes bounded differences, making the Hoeffding/Bernstein assumptions approximately valid after clipping.

Code cell 42

# === 16. Gradient Clipping: From Heavy-Tailed to Sub-Gaussian ===

np.random.seed(42)

N_mc = 500_000

# Heavy-tailed gradient: Cauchy-like (Student-t with low df)
df = 3
grads_heavy = np.random.standard_t(df, N_mc)

def clip_gradients(g, clip_norm):
    return np.clip(g, -clip_norm, clip_norm)

clip_values = [1.0, 2.0, 5.0, np.inf]

print('Effect of gradient clipping on tail behaviour:')
print(f'  {'Clip':>8}  {'P(|g|>3)':>12}  {'P(|g|>5)':>12}  {'P(|g|>10)':>12}  {'Std':>8}')
for clip in clip_values:
    if clip == np.inf:
        g = grads_heavy
        label = 'None'
    else:
        g = clip_gradients(grads_heavy, clip)
        label = f'{clip:.1f}'
    p3  = (np.abs(g) >= 3).mean()
    p5  = (np.abs(g) >= 5).mean()
    p10 = (np.abs(g) >= 10).mean()
    std = g.std()
    print(f'  {label:>8}  {p3:>12.5f}  {p5:>12.5f}  {p10:>12.5f}  {std:>8.4f}')

print('\nGaussian(0,1) for reference:')
g_gauss = np.random.randn(N_mc)
print(f'  {'Gaussian':>8}  {(np.abs(g_gauss)>=3).mean():>12.5f}  '
      f'{(np.abs(g_gauss)>=5).mean():>12.5f}  {(np.abs(g_gauss)>=10).mean():>12.5f}  '
      f'{g_gauss.std():>8.4f}')

print('\nClipping at 2.0 makes tail comparable to Gaussian — enabling Hoeffding/Bernstein analysis.')

17. PAC-Bayes Bounds

PAC-Bayes bounds are often the tightest available for neural networks.

Theorem (McAllester PAC-Bayes). For prior PP on H\mathcal{H} (chosen before seeing data) and any posterior QQ on H\mathcal{H}, with probability 1δ\geq 1-\delta over SDnS \sim \mathcal{D}^n:

EhQ[R(h)]EhQ[R^(h)]+KL(QP)+ln(n/δ)2(n1)\mathbb{E}_{h \sim Q}[R(h)] \leq \mathbb{E}_{h \sim Q}[\hat{R}(h)] + \sqrt{\frac{\text{KL}(Q \| P) + \ln(n/\delta)}{2(n-1)}}

Intuition: The generalisation gap is controlled by KL(QP)\text{KL}(Q \| P) — how much the posterior QQ deviates from the prior. If training barely moves the parameters (small KL), the model generalises well.

Connection to regularisation:

  • L2L^2 regularisation \Leftrightarrow Gaussian prior P=N(0,λ1I)P = \mathcal{N}(0, \lambda^{-1}I)
  • KL(QP)=λ2θ2+const(Q \| P) = \frac{\lambda}{2}\|\theta\|^2 + \text{const} for Gaussian posteriors
  • LoRA: small KL because rdr \ll d, so posterior stays close to prior

This is why deep learning 'works' despite overparameterisation: SGD with small LR + early stopping creates posteriors QQ close to the initialisation prior PP, keeping KL small and generalisation gap bounded.

Code cell 44

# === 17. PAC-Bayes: KL and Generalisation Gap ===

np.random.seed(42)

from scipy.special import xlogy

def kl_bernoulli(q, p):
    """KL divergence between Bernoulli(q) and Bernoulli(p)."""
    q = np.clip(q, 1e-9, 1 - 1e-9)
    p = np.clip(p, 1e-9, 1 - 1e-9)
    return xlogy(q, q/p) + xlogy(1-q, (1-q)/(1-p))

def pacbayes_bound(R_hat, kl, n, delta=0.05):
    """McAllester PAC-Bayes bound."""
    return R_hat + np.sqrt((kl + np.log(n / delta)) / (2 * (n - 1)))

# Scenario: linear model, Gaussian prior N(0, I), Gaussian posterior N(w*, sigma^2 I)
def gaussian_kl(theta, sigma2_prior=1.0, sigma2_post=0.01, d=100):
    """KL(N(theta, sigma2_post I) || N(0, sigma2_prior I))."""
    kl = 0.5 * (d * (sigma2_post / sigma2_prior - 1 - np.log(sigma2_post / sigma2_prior))
                + np.dot(theta, theta) / sigma2_prior)
    return kl

n, d = 1000, 100
R_hat = 0.1  # empirical risk

print('PAC-Bayes generalisation bound vs KL(posterior||prior):')
print(f'  {'||theta||_2':>12}  {'KL':>10}  {'Gap bound':>12}  {'R bound':>10}')
for norm_theta in [0.5, 1.0, 2.0, 5.0, 10.0]:
    theta = np.ones(d) * norm_theta / np.sqrt(d)
    kl = gaussian_kl(theta)
    gap = np.sqrt((kl + np.log(n / 0.05)) / (2 * (n - 1)))
    R_bound = R_hat + gap
    print(f'  {norm_theta:>12.1f}  {kl:>10.2f}  {gap:>12.4f}  {R_bound:>10.4f}')

print(f'\nKey insight: KL grows as ||theta||^2 / sigma2_prior.')
print('L2 regularisation (weight decay) penalises exactly this term.')

18. Double Descent and the Failure of Classical Bounds

Classical VC bounds predict monotonically increasing generalisation error with model complexity. Double descent (Belkin et al., 2019) shows that modern overparameterised models can achieve low test error even when pnp \gg n:

test error
    |              classical regime      interpolation threshold    modern regime
    |              bias↓, var↑                  |                  var↓ (implicit reg)
    |    Underfitting → ──────────────────────── peak ──────────────────── overfit?No!
    |________________ model complexity (p/n) ___________________________

Why classical bounds fail:

  1. VC/Rademacher bounds don't capture the effect of optimisation algorithm (SGD bias)
  2. Minimum-norm interpolation has implicit regularisation not captured by norm bounds
  3. Benign overfitting (Bartlett et al., 2020): overparameterised models can fit noise and still generalise if signal-to-noise is high

Concentration still applies: Even in double descent, Rademacher/PAC-Bayes bounds apply — they're just not tight. The issue is that classical bounds use worst-case norms, while SGD finds solutions with small effective norms via implicit regularisation.

Code cell 46

# === 18. Double Descent: Polynomial Regression Demo ===

np.random.seed(42)

from numpy.polynomial import polynomial as P

# True function: sin(2*pi*x) with noise
def true_fn(x):
    return np.sin(2 * np.pi * x)

n_train = 20
sigma_noise = 0.2
x_train = np.linspace(0, 1, n_train)
y_train = true_fn(x_train) + np.random.randn(n_train) * sigma_noise

x_test = np.linspace(0, 1, 200)
y_test = true_fn(x_test)

degrees = list(range(1, 35))
train_errors = []
test_errors = []

for deg in degrees:
    # Vandermonde design matrix
    V_train = np.vander(x_train, deg + 1, increasing=True)
    V_test  = np.vander(x_test,  deg + 1, increasing=True)

    # Minimum-norm least squares (pseudoinverse)
    w = np.linalg.pinv(V_train) @ y_train

    train_err = np.mean((V_train @ w - y_train)**2)
    test_err  = np.mean((V_test  @ w - y_test)**2)
    train_errors.append(train_err)
    test_errors.append(test_err)

if HAS_MPL:
    fig, ax = plt.subplots(figsize=(10, 5))
    ax.semilogy(degrees, train_errors, 'o-', ms=4, label='Train MSE')
    ax.semilogy(degrees, test_errors,  's-', ms=4, label='Test MSE')
    ax.axvline(n_train, color='r', ls='--', lw=2, label=f'n_train={n_train} (interpolation threshold)')
    ax.set_xlabel('Polynomial degree (model complexity)')
    ax.set_ylabel('MSE')
    ax.set_title('Double Descent in Polynomial Regression')
    ax.legend()
    plt.tight_layout()
    plt.show()

interp_idx = n_train - 1
peak_test = max(test_errors[:n_train+2])
final_test = test_errors[-1]
print(f'Test error peak near interpolation: {peak_test:.4f}')
print(f'Test error at highest degree: {final_test:.4f}')
print('Double descent: error peaks then decreases in overparameterised regime.')

19. Differential Privacy and Concentration

Differentially Private SGD (DP-SGD, Abadi et al. 2016) adds calibrated Gaussian noise to gradients to prevent membership inference. The noise scale is determined by concentration:

  1. Clip individual gradients: g~i=gi/max(1,gi2/C)\tilde{g}_i = g_i / \max(1, \|g_i\|_2/C) — ensures g~i2C\|\tilde{g}_i\|_2 \leq C
  2. Aggregate + noise: g^=1B(ig~i+N(0,σ2C2I))\hat{g} = \frac{1}{B}(\sum_i \tilde{g}_i + \mathcal{N}(0, \sigma^2 C^2 I))
  3. Privacy accounting: (ε,δ)(\varepsilon, \delta)-DP is achieved when:
σC2ln(1.25/δ)εB/n\sigma \geq \frac{C\sqrt{2\ln(1.25/\delta)}}{\varepsilon \cdot B / n}

Gaussian mechanism analysis uses the sub-Gaussian concentration of the noise. The privacy-utility tradeoff: larger σ\sigma → better privacy, larger gradient noise → slower convergence.

Moments accountant / Rényi DP gives tighter accounting via concentration of privacy loss across steps.

Code cell 48

# === 19. DP-SGD Noise Calibration ===

np.random.seed(42)

def dp_sgd_sigma(eps, delta, C, B, n):
    """
    Noise multiplier sigma for (eps, delta)-DP.
    Returns sigma (noise is sigma * C * I).
    """
    q = B / n  # sampling ratio
    sigma = C * np.sqrt(2 * np.log(1.25 / delta)) / (eps * q)
    return sigma

n = 60000  # MNIST-size training set
B = 256
C = 1.0

print('DP-SGD noise sigma for various privacy budgets:')
print(f'  (n={n}, batch={B}, clip_norm={C})')
print(f'  {'eps':>6}  {'delta':>10}  {'sigma':>10}  {'noise/signal ratio':>20}')
for eps in [0.1, 0.5, 1.0, 5.0, 10.0]:
    for delta in [1e-5, 1e-3]:
        sigma = dp_sgd_sigma(eps, delta, C, B, n)
        # noise/signal: sigma * C / (batch_grad_norm ~ C)
        ratio = sigma
        print(f'  {eps:>6.1f}  {delta:>10.0e}  {sigma:>10.4f}  {ratio:>20.4f}')

print('\nInterpretation: sigma >> 1 means privacy noise dominates gradient signal.')
print('Strong privacy (eps=0.1) requires enormous noise -> very slow convergence.')

20. Preview: Matrix Concentration Inequalities

Scalar concentration inequalities extend to random matrices via matrix versions of Bernstein.

Theorem (Matrix Bernstein, Tropp 2012). Let S=kXkS = \sum_k X_k where XkX_k are independent random Hermitian matrices with E[Xk]=0\mathbb{E}[X_k] = 0 and XkM\|X_k\| \leq M a.s. Let v=kE[Xk2]v = \|\sum_k \mathbb{E}[X_k^2]\|. Then:

P(St)dexp ⁣(t2/2v+Mt/3)P(\|S\| \geq t) \leq d \cdot \exp\!\left(\frac{-t^2/2}{v + Mt/3}\right)

where dd is the matrix dimension and \|\cdot\| is the spectral norm.

Applications:

  • Random attention: concentration of QK/dkQK^\top / \sqrt{d_k} away from its expectation
  • Covariance estimation: Σ^Σ\hat{\Sigma} - \Sigma spectral norm concentration
  • LoRA: concentration of low-rank update ΔW=AB\Delta W = AB spectral norm
  • Random features: approximation error of random feature maps for kernel methods

Full treatment in §08 Advanced Probability Topics (if available) or Tropp's An Introduction to Matrix Concentration Inequalities (2015, arXiv:1501.01571).

Code cell 50

# === 20. Matrix Concentration: Random Attention ===

np.random.seed(42)

# Simulate spectral norm concentration of random attention logit matrix
# A = Q K^T / sqrt(d_k), Q,K ~ N(0, I_{n x d_k})

def attention_spectral_norm(n_seq, d_k, N_mc=500):
    """Empirical distribution of ||QK^T/sqrt(d_k)||_2."""
    norms = []
    for _ in range(N_mc):
        Q = np.random.randn(n_seq, d_k) / np.sqrt(d_k)
        K = np.random.randn(n_seq, d_k) / np.sqrt(d_k)
        A = Q @ K.T  # already has 1/d_k from normalisation
        norms.append(np.linalg.norm(A, 2))
    return np.array(norms)

print('Spectral norm of Q K^T / d_k (Q, K ~ N(0, I/sqrt(d_k)))')
print(f'  {'n_seq':>6}  {'d_k':>5}  {'E[||A||]':>10}  {'Std[||A||]':>12}  '
      f'{'sqrt(n_seq/d_k)':>15}')
for n_seq, d_k in [(10,64), (20,64), (50,64), (100,64), (64,64), (64,256)]:
    norms = attention_spectral_norm(n_seq, d_k, N_mc=300)
    theory = np.sqrt(n_seq / d_k)  # Marchenko-Pastur limit for rectangular case
    print(f'  {n_seq:>6}  {d_k:>5}  {norms.mean():>10.4f}  {norms.std():>12.4f}  '
          f'{theory:>15.4f}')

print('\nSpectral norm ~ sqrt(n_seq/d_k) — longer sequences or smaller d_k -> larger logits.')
print('This is why FlashAttention uses causal masking to bound effective sequence length.')

Code cell 51

# === Final: Concentration Inequality Summary Plot ===

np.random.seed(42)

if HAS_MPL:
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))

    # Plot 1: Tail hierarchy
    ax = axes[0, 0]
    t = np.linspace(0.2, 5, 300)
    ax.semilogy(t, 1/t,                   label='Markov: 1/t')
    ax.semilogy(t, 1/t**2,                label='Chebyshev: 1/t^2')
    ax.semilogy(t, np.exp(-2*t**2),       label='Hoeffding/sub-Gauss')
    ax.semilogy(t, 2*stats.norm.sf(t),    label='Gaussian (actual)', ls='--')
    ax.set_title('Tail Bound Hierarchy')
    ax.set_xlabel('t'); ax.set_ylabel('Bound (log scale)')
    ax.legend(fontsize=9); ax.set_ylim(1e-7, 2)

    # Plot 2: PAC learning curve
    ax = axes[0, 1]
    ns = np.logspace(1, 5, 100)
    for H_log, style in [(2, '-'), (5, '--'), (10, ':')]:
        H = 10**H_log
        gap = np.sqrt(np.log(H / 0.05) / (2 * ns))
        ax.loglog(ns, gap, style, label=f'|H|=10^{H_log}')
    ax.set_title('PAC Generalisation Gap vs n')
    ax.set_xlabel('Training samples n')
    ax.set_ylabel('Gap bound')
    ax.legend(); ax.axhline(0.05, color='gray', ls=':', lw=1)

    # Plot 3: Hoeffding verification
    ax = axes[1, 0]
    ns_test = np.arange(10, 500, 10)
    N_mc = 50000
    empirical_50 = []
    bounds_50 = []
    for n in ns_test:
        s = np.random.binomial(1, 0.5, (N_mc, n)).mean(axis=1)
        empirical_50.append((np.abs(s - 0.5) >= 0.1).mean())
        bounds_50.append(2 * np.exp(-2 * n * 0.01))
    ax.semilogy(ns_test, empirical_50, label='Empirical')
    ax.semilogy(ns_test, bounds_50, '--', label='Hoeffding')
    ax.set_title('Hoeffding: |X_bar - 0.5| >= 0.1')
    ax.set_xlabel('n'); ax.set_ylabel('P (log scale)')
    ax.legend()

    # Plot 4: LoRA Rademacher scaling
    ax = axes[1, 1]
    ds = np.logspace(1, 3, 50)
    n_val = 1000
    for r, style in [(1, '-'), (4, '--'), (16, ':'), (64, '-.')
                     ][:4]:
        R_lora = np.sqrt(ds * r / n_val)
        ax.loglog(ds, R_lora, style, label=f'LoRA r={r}')
    R_full = np.sqrt(ds**2 / n_val)
    ax.loglog(ds, R_full, 'k-', lw=2, label='Full (r=d)')
    ax.set_title(f'Rademacher Complexity vs d (n={n_val})')
    ax.set_xlabel('Dimension d')
    ax.set_ylabel('$\\mathfrak{R}$ bound')
    ax.legend(fontsize=9)

    plt.suptitle('Concentration Inequalities: Summary', fontsize=14, y=1.01)
    plt.tight_layout()
    plt.show()

print('Summary plots generated.')

Notebook Complete

This notebook has covered:

  1. Tail bound hierarchy — Markov → Chebyshev → sub-Gaussian → Hoeffding → Chernoff → Bernstein → McDiarmid
  2. PAC learning — finite class bounds via union bound, VC dimension, shattering
  3. Rademacher complexity — data-dependent bounds, LoRA analysis
  4. ML applications — SGD gradient concentration, LLM benchmark CIs, JL lemma
  5. Transformers — attention scaling, residual norms, gradient clipping
  6. Modern theory — PAC-Bayes, double descent, DP-SGD, matrix concentration

References

  1. Boucheron, Lugosi & Massart, Concentration Inequalities (2013) — OUP
  2. Tropp, An Introduction to Matrix Concentration Inequalities (2015) — arXiv:1501.01571
  3. Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014) — Chapters 2-7
  4. Wainwright, High-Dimensional Statistics (2019) — Cambridge
  5. Vershynin, High-Dimensional Probability (2018) — Cambridge

exercises.ipynb — 8 graded problems
notes.md — full theoretical treatment with proofs