Theory Notebook

Converted from theory.ipynb for web reading.

Concentration Inequalities

"One of the great gifts of probability theory is that it allows us to turn complex dependencies into simple, manageable bounds." — Stéphane Boucheron

Interactive exploration of concentration inequalities: from Markov and Chebyshev to sub-Gaussian tails, Hoeffding, Chernoff, Bernstein, and McDiarmid. Applications to PAC learning, VC theory, Rademacher complexity, and LLM evaluation.

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np
from scipy import stats

try:
    import matplotlib.pyplot as plt
    import matplotlib
    plt.style.use('seaborn-v0_8-whitegrid')
    plt.rcParams['figure.figsize'] = [10, 6]
    plt.rcParams['font.size'] = 12
    HAS_MPL = True
except ImportError:
    HAS_MPL = False

try:
    import seaborn as sns
    HAS_SNS = True
except ImportError:
    HAS_SNS = False

np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)
print('Setup complete.')

1. Intuition: Why Tail Bounds Matter

A random variable is concentrated if most of its mass lies near the mean. Concentration inequalities quantify this with explicit tail bounds $P(X \geq \mu + t)$ . In ML, $X$ might be a generalisation error, empirical risk, or gradient norm.

The hierarchy (weakest → strongest assumptions, tightest → loosest bounds):

Markov — only non-negativity: $P(X \geq t) \leq \mathbb{E}[X]/t$ — polynomial tail $O(1/t)$
Chebyshev — finite variance: $P(|X-\mu| \geq t) \leq \sigma^2/t^2$ — polynomial $O(1/t^2)$
Sub-Gaussian/Hoeffding — bounded support or light tails: $O(e^{-2t^2/\sum c_i^2})$
Chernoff — MGF exists: $\inf_{s>0} e^{-st}M_X(s)$ — exponential, shape depends on MGF
Bernstein — variance + bounded: exponential, transitions polynomial→exponential
McDiarmid — function of independents with bounded differences: $O(e^{-2t^2/\sum c_i^2})$

Code cell 5

# === 1.1 Tail Bound Comparison ===

t = np.linspace(0.1, 5, 500)
mu = 1.0    # E[X] for Markov
sigma2 = 1.0  # Var(X) for Chebyshev
c2 = 1.0    # sum c_i^2 for Hoeffding/sub-Gaussian

markov   = mu / t
cheby    = sigma2 / t**2
hoeffding = np.exp(-2 * t**2 / c2)
gaussian_tail = 2 * stats.norm.sf(t)  # 2P(Z >= t) — actual Gaussian tail

if HAS_MPL:
    fig, ax = plt.subplots(figsize=(10,5))
    ax.semilogy(t, markov,       label='Markov: $\mu/t$',              lw=2)
    ax.semilogy(t, cheby,        label='Chebyshev: $\sigma^2/t^2$',    lw=2)
    ax.semilogy(t, hoeffding,    label='Hoeffding/sub-Gaussian',        lw=2)
    ax.semilogy(t, gaussian_tail,label='Gaussian tail (actual)',        lw=2, ls='--')
    ax.set_xlim(0.5, 5)
    ax.set_ylim(1e-7, 2)
    ax.set_xlabel('Deviation $t$')
    ax.set_ylabel('$P(|X - \\mu| \\geq t)$ (log scale)')
    ax.set_title('Tail Bound Hierarchy: Polynomial vs Exponential Decay')
    ax.legend()
    plt.tight_layout()
    plt.show()

# Print at t=3
t0 = 3.0
print(f'At t={t0}:')
print(f'  Markov (mu=1):        {mu/t0:.4f}')
print(f'  Chebyshev (sigma2=1): {sigma2/t0**2:.4f}')
print(f'  Hoeffding (c2=1):     {np.exp(-2*t0**2/c2):.4f}')
print(f'  Gaussian (actual):    {2*stats.norm.sf(t0):.6f}')

2. Markov and Chebyshev Inequalities

2.1 Markov's Inequality

Theorem. For non-negative $X$ with $\mathbb{E}[X] < \infty$ and $t > 0$ :

P(X \geq t) \leq \frac{\mathbb{E}[X]}{t}

Proof. $\mathbb{E}[X] = \mathbb{E}[X \mathbf{1}_{X \geq t}] + \mathbb{E}[X \mathbf{1}_{X < t}] \geq t \cdot P(X \geq t) + 0$ . Divide by $t$ . $\square$

2.2 Chebyshev's Inequality

Apply Markov to $(X - \mu)^2 \geq 0$ :

P(|X - \mu| \geq t) = P((X-\mu)^2 \geq t^2) \leq \frac{\mathbb{E}[(X-\mu)^2]}{t^2} = \frac{\sigma^2}{t^2}

One-sided Cantelli. $P(X - \mu \geq t) \leq \frac{\sigma^2}{\sigma^2 + t^2}$ (sharper for large $t$ ).

Code cell 7

# === 2. Markov and Chebyshev Numerical Verification ===

np.random.seed(42)
N = 1_000_000

# Exponential distribution: E[X] = 1/lambda, Var = 1/lambda^2
lam = 1.0
samples_exp = np.random.exponential(1.0/lam, N)
mu_exp  = 1.0 / lam
var_exp = 1.0 / lam**2

ts = [2.0, 3.0, 5.0]
print('Exponential(1) — one-sided bounds')
print(f'  {'t':>4}  {'P(X>=t) actual':>18}  {'Markov':>10}  {'Ratio':>8}')
for t in ts:
    actual  = (samples_exp >= t).mean()
    markov  = mu_exp / t
    print(f'  {t:>4.1f}  {actual:>18.6f}  {markov:>10.6f}  {actual/markov:>8.3f}')

# Gaussian: test Chebyshev
samples_norm = np.random.normal(0, 1, N)
print('\nGaussian(0,1) — two-sided Chebyshev')
print(f'  {'t':>4}  {'P(|X|>=t) actual':>20}  {'Chebyshev':>12}  {'Gaussian tail':>15}')
for t in [1.0, 2.0, 3.0]:
    actual = (np.abs(samples_norm) >= t).mean()
    cheby  = 1.0 / t**2  # sigma^2 = 1
    gtail  = 2 * stats.norm.sf(t)
    print(f'  {t:>4.1f}  {actual:>20.6f}  {cheby:>12.6f}  {gtail:>15.6f}')

print('\nTakeaway: Chebyshev is loose for Gaussian (1/t^2 vs e^{-t^2/2}) — ')
print('but holds for ANY distribution with finite variance.')

3. Sub-Gaussian Random Variables

A zero-mean random variable $X$ is $\sigma^2$ -sub-Gaussian if:

\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2 / 2} \quad \forall t \in \mathbb{R}

Equivalent characterisation (tail form):

P(|X| \geq t) \leq 2e^{-t^2/(2\sigma^2)}

Key examples:

$\mathcal{N}(0, \sigma^2)$ : $\sigma^2$ -sub-Gaussian (equality in MGF bound)
Bounded $X \in [a,b]$ with $\mathbb{E}[X]=0$ : $((b-a)/2)^2$ -sub-Gaussian (Hoeffding's lemma)
Rademacher $X = \pm 1$ w.p. $1/2$ : $1$ -sub-Gaussian

Closure: Sum of independent $\sigma_i^2$ -sub-Gaussians is $(\sum \sigma_i^2)$ -sub-Gaussian.

Code cell 9

# === 3. Sub-Gaussian MGF Verification ===

np.random.seed(42)
N = 500_000
ts = np.linspace(-2, 2, 100)

# Gaussian N(0,1): sigma^2 = 1
X_gauss = np.random.normal(0, 1, N)
# Uniform [-1,1]: sigma^2 = (b-a)^2/4 = 1
X_unif  = np.random.uniform(-1, 1, N)
# Rademacher: sigma^2 = 1
X_rad   = np.where(np.random.rand(N) < 0.5, -1.0, 1.0)

sub_gauss_bound = np.exp(ts**2 / 2)  # sigma^2 = 1

if HAS_MPL:
    fig, ax = plt.subplots(figsize=(10,5))
    for X, label, color in [
        (X_gauss, 'Gaussian N(0,1)', 'C0'),
        (X_unif,  'Uniform[-1,1]',  'C1'),
        (X_rad,   'Rademacher ±1',  'C2'),
    ]:
        mgf = np.array([np.exp(t * X).mean() for t in ts])
        ax.plot(ts, mgf, label=f'E[e^{{tX}}] {label}', color=color)
    ax.plot(ts, sub_gauss_bound, 'k--', lw=2, label='Sub-Gaussian bound $e^{t^2/2}$')
    ax.set_xlabel('$t$')
    ax.set_ylabel('MGF')
    ax.set_title('Sub-Gaussian MGF Bound Verification')
    ax.set_ylim(0, 8)
    ax.legend()
    plt.tight_layout()
    plt.show()

# Verify bound holds at t=1.5
t_test = 1.5
bound = np.exp(t_test**2 / 2)
for X, name in [(X_gauss,'Gaussian'), (X_unif,'Uniform'), (X_rad,'Rademacher')]:
    mgf_mc = np.exp(t_test * X).mean()
    ok = mgf_mc <= bound * 1.05  # 5% tolerance for MC
    print(f'{'PASS' if ok else 'FAIL'} — {name}: MGF({t_test})={mgf_mc:.4f} <= {bound:.4f}')

4. Hoeffding's Inequality

Theorem (Hoeffding, 1963). Let $X_1, \ldots, X_n$ be independent with $X_i \in [a_i, b_i]$ and $\mathbb{E}[X_i] = \mu_i$ . Let $S_n = \sum_{i=1}^n X_i$ . Then:

P(S_n - \mathbb{E}[S_n] \geq t) \leq \exp\!\left(\frac{-2t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right)

Hoeffding's Lemma (key step). For $X \in [a,b]$ with $\mathbb{E}[X]=0$ :

\mathbb{E}[e^{tX}] \leq e^{t^2(b-a)^2/8}

Proof sketch. Convexity of $e^{tx}$ gives $e^{tx} \leq \frac{b-x}{b-a}e^{ta} + \frac{x-a}{b-a}e^{tb}$ . Take expectations, then apply $u = t(b-a)$ , $h(u) = \log[(1-p)e^{-pu} + pe^{(1-p)u}]$ , Taylor-expand $h$ and show $h''(u) \leq 1/4$ . This gives $h(u) \leq u^2/8$ . $\square$

Sample complexity (iid case, $X_i \in [0,1]$ , $\varepsilon = t/n$ ):

P\!\left(|\bar{X}_n - \mu| \geq \varepsilon\right) \leq 2e^{-2n\varepsilon^2}

Setting $\delta = 2e^{-2n\varepsilon^2}$ gives $n \geq \frac{\log(2/\delta)}{2\varepsilon^2}$ .

Code cell 11

# === 4. Hoeffding: Sample Complexity and Verification ===

np.random.seed(42)

# --- Sample complexity formula ---
def hoeffding_n(eps, delta):
    """Min samples so P(|X_bar - mu| >= eps) <= delta, X_i in [0,1]."""
    return np.ceil(np.log(2 / delta) / (2 * eps**2)).astype(int)

print('Sample complexity n(eps, delta) via Hoeffding:')
print(f'  {'eps':>6}  {'delta':>8}  {'n':>8}')
for eps in [0.01, 0.05, 0.1]:
    for delta in [0.05, 0.01]:
        print(f'  {eps:>6.2f}  {delta:>8.3f}  {hoeffding_n(eps, delta):>8}')

# --- Empirical verification ---
mu_true = 0.3
n_values = [50, 100, 500, 1000]
n_trials = 50000
eps = 0.1

print('\nHoeffding bound vs empirical tail (X_i ~ Bernoulli(0.3), eps=0.1):')
print(f'  {'n':>6}  {'P(|X_bar-mu|>=eps) empirical':>32}  {'Hoeffding bound':>18}')
for n in n_values:
    samples = np.random.binomial(1, mu_true, (n_trials, n)).mean(axis=1)
    empirical = (np.abs(samples - mu_true) >= eps).mean()
    bound = 2 * np.exp(-2 * n * eps**2)
    print(f'  {n:>6}  {empirical:>32.6f}  {bound:>18.6f}')

5. Chernoff Bounds

The Chernoff method. For any $s > 0$ , apply Markov to $e^{sX}$ :

P(X \geq t) = P(e^{sX} \geq e^{st}) \leq e^{-st} \mathbb{E}[e^{sX}] = e^{-st} M_X(s)

Optimise over $s > 0$ :

P(X \geq t) \leq \inf_{s > 0}\, e^{-st} M_X(s)

Binomial Chernoff. Let $S = \sum_{i=1}^n X_i$ , $X_i \stackrel{iid}{\sim} \text{Bern}(p)$ , $\mu = np$ :

Upper tail: $P(S \geq (1+\delta)\mu) \leq \left(\frac{e^\delta}{(1+\delta)^{1+\delta}}\right)^\mu$
Simplified upper: $P(S \geq (1+\delta)\mu) \leq e^{-\mu\delta^2/3}$ for $\delta \in (0,1]$
Lower tail: $P(S \leq (1-\delta)\mu) \leq e^{-\mu\delta^2/2}$ for $\delta \in (0,1)$

The gap between Hoeffding ( $e^{-2n\varepsilon^2}$ ) and Chernoff ( $e^{-np\delta^2/3}$ ) is that Chernoff exploits the actual Binomial MGF, not just the bounded range.

Code cell 13

# === 5. Chernoff vs Hoeffding for Binomial ===

np.random.seed(42)

def chernoff_upper(delta, mu):
    """Simplified Chernoff upper: exp(-mu*delta^2/3) for delta in (0,1]."""
    return np.exp(-mu * delta**2 / 3)

def hoeffding_upper(delta, n):
    """Hoeffding upper: exp(-2*n*delta^2) for X_i in [0,1]."""
    return np.exp(-2 * n * delta**2)

# Compare bounds for B(n=100, p=0.1) => mu=10
n, p = 100, 0.1
mu = n * p

N_mc = 500_000
samples = np.random.binomial(n, p, N_mc)

print(f'Binomial(n={n}, p={p}), mu={mu}')
print(f'  {'delta':>6}  {'t=(1+d)mu':>12}  {'Empirical':>12}  '
      f'{'Chernoff':>12}  {'Hoeffding':>12}')
for delta in [0.2, 0.5, 1.0]:
    t = (1 + delta) * mu
    empirical  = (samples >= t).mean()
    chernoff_b = chernoff_upper(delta, mu)
    hoeffding_b = hoeffding_upper(delta * p, n)  # eps = delta*p
    print(f'  {delta:>6.1f}  {t:>12.1f}  {empirical:>12.6f}  '
          f'{chernoff_b:>12.6f}  {hoeffding_b:>12.6f}')

print('\nChernoff is tighter when p is small (exploits Binomial MGF).')

6. Bernstein's Inequality

Bernstein's inequality bridges the polynomial (Chebyshev) and exponential (sub-Gaussian) regimes.

Theorem (Bernstein). Let $X_1, \ldots, X_n$ be independent, zero-mean, $|X_i| \leq M$ . Then:

P\!\left(\sum_i X_i \geq t\right) \leq \exp\!\left(\frac{-t^2/2}{\sum_i \mathbb{E}[X_i^2] + Mt/3}\right)

Interpretation. Let $v = \sum_i \text{Var}(X_i)$ :

Gaussian regime ( $t \ll v/M$ ): bound $\approx e^{-t^2/(2v)}$ — sub-Gaussian with $\sigma^2 = v$
Poisson regime ( $t \gg v/M$ ): bound $\approx e^{-t/(2M)}$ — exponential, lighter than Hoeffding

vs. Hoeffding. Bernstein uses variance information; Hoeffding only uses range. Bernstein is strictly tighter when variance is small relative to range.

Code cell 15

# === 6. Bernstein vs Hoeffding Comparison ===

np.random.seed(42)

def bernstein_bound(t, v, M):
    """Bernstein: exp(-t^2/2 / (v + Mt/3))."""
    return np.exp(-t**2 / 2 / (v + M * t / 3))

def hoeffding_bound(t, n, a, b):
    """Hoeffding: exp(-2t^2 / (n*(b-a)^2))."""
    return np.exp(-2 * t**2 / (n * (b - a)**2))

# Sparse Bernoulli: X_i ~ Bern(p) - p, p=0.01
# mean=0, variance=p*(1-p)~p, range=[-(p), 1-p]
n_vars = 100
p = 0.01
v_total = n_vars * p * (1 - p)  # total variance
M = 1 - p  # max |X_i|

ts = np.linspace(0.01, 2.0, 300)

bern_b  = bernstein_bound(ts, v_total, M)
hoeff_b = hoeffding_bound(ts, n_vars, -p, 1-p)

# Empirical
N_mc = 500_000
sim = (np.random.binomial(1, p, (N_mc, n_vars)) - p).sum(axis=1)

if HAS_MPL:
    fig, ax = plt.subplots(figsize=(10,5))
    ts_emp = np.linspace(0.01, sim.max(), 200)
    emp = np.array([(sim >= t).mean() for t in ts_emp])
    ax.semilogy(ts, bern_b,  label='Bernstein', lw=2)
    ax.semilogy(ts, hoeff_b, label='Hoeffding', lw=2)
    ax.semilogy(ts_emp, np.maximum(emp, 1e-7), 'k.', ms=2, label='Empirical')
    ax.set_xlabel('$t$')
    ax.set_ylabel('$P(S_n \\geq t)$')
    ax.set_title(f'Bernstein vs Hoeffding (n={n_vars} sparse Bernoulli, p={p})')
    ax.legend()
    plt.tight_layout()
    plt.show()

print(f'Variance v={v_total:.4f}, M={M:.2f}')
print(f'Bernstein at t=0.5:  {bernstein_bound(0.5, v_total, M):.6f}')
print(f'Hoeffding at t=0.5:  {hoeffding_bound(0.5, n_vars, -p, 1-p):.6f}')
print('Bernstein is tighter when variance << range^2.')

7. McDiarmid's Inequality

McDiarmid extends Hoeffding from sums to general functions of independent variables.

Definition (Bounded Differences). $f(x_1,\ldots,x_n)$ has the $(c_1,\ldots,c_n)$ -bounded differences property if for each $i$ :

\sup_{x_1,\ldots,x_n,x_i'} |f(\mathbf{x}) - f(x_1,\ldots,x_i',\ldots,x_n)| \leq c_i

Theorem (McDiarmid, 1989). If $f$ has bounded differences with constants $c_1,\ldots,c_n$ and $X_1,\ldots,X_n$ are independent, then:

P(f(\mathbf{X}) - \mathbb{E}[f(\mathbf{X})] \geq t) \leq \exp\!\left(\frac{-2t^2}{\sum_{i=1}^n c_i^2}\right)

Proof idea. Define the Doob martingale $D_k = \mathbb{E}[f | X_1,\ldots,X_k] - \mathbb{E}[f]$ . The differences $D_k - D_{k-1}$ are bounded in $[-c_k, c_k]$ and have zero conditional mean. Apply Azuma's inequality to the martingale. $\square$

Key application: empirical risk. Let $f(S) = \hat{R}(h) = \frac{1}{n}\sum_i \ell(h(x_i), y_i)$ with $\ell \in [0,1]$ . Changing one sample changes $\hat{R}$ by at most $1/n$ , so $c_i = 1/n$ and:

P(\hat{R}(h) - R(h) \geq t) \leq e^{-2nt^2}

Code cell 17

# === 7. McDiarmid on Empirical Risk ===

np.random.seed(42)

def mcdiarmid_bound(t, n, c_max=None):
    """McDiarmid bound with c_i = 1/n (all equal), c_max=1/n."""
    if c_max is None:
        c_max = 1.0 / n
    return np.exp(-2 * t**2 / (n * c_max**2))

# Simulate: classifier h, threshold at 0.5
# True risk R(h) = P(error) = 0.2
R_true = 0.2

n_values = [50, 100, 500, 1000]
n_trials = 100_000
t = 0.1

print(f'Empirical risk deviation (t={t}), true R={R_true}')
print(f'  {'n':>6}  {'P(R_hat-R>=t) sim':>22}  {'McDiarmid bound':>18}')
for n in n_values:
    errors = np.random.binomial(1, R_true, (n_trials, n)).mean(axis=1)
    empirical = (errors - R_true >= t).mean()
    bound = mcdiarmid_bound(t, n)
    print(f'  {n:>6}  {empirical:>22.6f}  {bound:>18.6f}')

# Also show: symmetric bound P(|R_hat - R| >= t) <= 2*exp(-2nt^2)
print(f'\nTwo-sided: P(|R_hat - R| >= {t}) <= 2*exp(-2n*{t}^2)')
for n in [100, 1000]:
    print(f'  n={n}: {2*mcdiarmid_bound(t, n):.6f}')
print('\nThis is the foundation of PAC-style generalisation bounds.')

8. Union Bound and Covering Numbers

8.1 Union Bound (Boole's Inequality)

For any events $A_1, \ldots, A_m$ :

P\!\left(\bigcup_{i=1}^m A_i\right) \leq \sum_{i=1}^m P(A_i)

ML application. Applying McDiarmid to each $h \in \mathcal{H}$ and taking the union over all $|\mathcal{H}|$ hypotheses:

P\!\left(\exists h \in \mathcal{H}: \hat{R}(h) - R(h) \geq t\right) \leq |\mathcal{H}| \cdot e^{-2nt^2}

Setting RHS $= \delta$ gives $t = \sqrt{\frac{\log(|\mathcal{H}|/\delta)}{2n}}$ .

8.2 Covering Numbers

For infinite $\mathcal{H}$ , replace $|\mathcal{H}|$ with the covering number $\mathcal{N}(\mathcal{H}, \varepsilon, d)$ : the minimum number of $\varepsilon$ -balls needed to cover $\mathcal{H}$ under metric $d$ .

Metric entropy: $\log \mathcal{N}(\mathcal{H}, \varepsilon, \|\cdot\|)$ — for $\mathcal{H} \subset \mathbb{R}^d$ with $\|\theta\| \leq B$ :

\log \mathcal{N} \leq d \log\!\left(\frac{3B}{\varepsilon}\right)

Code cell 19

# === 8. Union Bound for Finite Hypothesis Class ===

np.random.seed(42)

# Scenario: |H| decision stumps on binary features
# Each stump: threshold on one of d features => |H| = 2d

def pac_bound_finite(H_size, n, delta):
    """Generalisation gap t such that P(exists h: gap >= t) <= delta."""
    return np.sqrt(np.log(H_size / delta) / (2 * n))

print('Generalisation gap bound for finite hypothesis classes:')
print(f'  {'|H|':>10}  {'n':>6}  {'delta':>6}  {'Gap bound t':>12}')
for H_size in [10, 100, 1000, 10000]:
    for n in [100, 1000]:
        t = pac_bound_finite(H_size, n, 0.05)
        print(f'  {H_size:>10}  {n:>6}  {'0.05':>6}  {t:>12.4f}')

print('\nSample complexity: n >= log(|H|/delta) / (2*eps^2)')
for H_size in [100, 10000]:
    n_required = np.ceil(np.log(H_size / 0.05) / (2 * 0.05**2)).astype(int)
    print(f'  |H|={H_size}: n >= {n_required} for eps=0.05, delta=0.05')

# Covering number for linear classifiers in R^d, ||theta|| <= 1
print('\nCovering number log N(eps) for {theta: ||theta||<=1} in R^d:')
for d in [10, 100, 1000]:
    eps = 0.1
    log_N = d * np.log(3 / eps)
    print(f'  d={d:4d}: log N = {log_N:.1f} => N = {np.exp(log_N):.2e}')

9. PAC Learning and VC Dimension

9.1 PAC Framework

A hypothesis class $\mathcal{H}$ is PAC learnable if there exists an algorithm $A$ such that for all distributions $\mathcal{D}$ and all $\varepsilon, \delta > 0$ , given $n \geq n(\varepsilon, \delta)$ samples:

P_{S \sim \mathcal{D}^n}\!\left(R(A(S)) \leq \min_{h \in \mathcal{H}} R(h) + \varepsilon\right) \geq 1 - \delta

Finite class. $n \geq \frac{\log(|\mathcal{H}|/\delta)}{2\varepsilon^2}$ suffices (union bound).

9.2 VC Dimension

The VC dimension $d_{\text{VC}}(\mathcal{H})$ is the largest set that $\mathcal{H}$ can shatter (correctly classify all $2^m$ labellings).

Examples:

Intervals on $\mathbb{R}$ : $d_{\text{VC}} = 2$
Half-spaces in $\mathbb{R}^d$ : $d_{\text{VC}} = d + 1$
Degree- $k$ polynomials in $\mathbb{R}$ : $d_{\text{VC}} = k + 1$
Neural networks (binary, $p$ params): $d_{\text{VC}} = O(p \log p)$

VC generalisation bound (Vapnik-Chervonenkis). With probability $\geq 1 - \delta$ :

R(h) \leq \hat{R}(h) + \sqrt{\frac{2d_{\text{VC}} \log(en/d_{\text{VC}}) + 2\log(4/\delta)}{n}}

Code cell 21

# === 9. VC Dimension: Shattering and PAC Bounds ===

np.random.seed(42)

# --- Shattering demo: half-spaces in R^2 ---
# 3 points in general position CAN be shattered by half-spaces
# 4 points CANNOT always be shattered (XOR configuration fails)

def linear_shatter_check(points):
    """Check if set of points can be shattered by linear classifier."""
    n = len(points)
    from itertools import product
    can_realise_all = True
    for labels in product([-1, 1], repeat=n):
        labels = np.array(labels)
        # Try linear separator with bias: w.x + b = 0
        # Use LP / direct check via small optimization
        found = False
        for trial in range(1000):
            w = np.random.randn(2)
            b = np.random.randn()
            preds = np.sign(points @ w + b)
            if np.all(preds == labels):
                found = True
                break
        if not found:
            can_realise_all = False
            break
    return can_realise_all

# 3 points in general position
pts3 = np.array([[0., 0.], [1., 0.], [0., 1.]])
# 4 points - XOR pattern
pts4 = np.array([[0., 0.], [1., 1.], [1., 0.], [0., 1.]])

print('VC dimension of half-spaces in R^2 is 3:')
print(f'  Can shatter 3 pts in general position: {linear_shatter_check(pts3)}')
print(f'  Can shatter 4 pts (XOR): {linear_shatter_check(pts4)}')

# --- VC generalisation bound ---
def vc_bound(d_vc, n, delta):
    """VC generalisation gap."""
    return np.sqrt((2 * d_vc * np.log(np.e * n / d_vc) + 2 * np.log(4 / delta)) / n)

print('\nVC generalisation bound (delta=0.05):')
print(f'  {'d_VC':>6}  {'n':>6}  {'Gap bound':>12}')
for d_vc in [3, 10, 100]:
    for n in [100, 1000, 10000]:
        gap = vc_bound(d_vc, n, 0.05)
        print(f'  {d_vc:>6}  {n:>6}  {gap:>12.4f}')

10. Rademacher Complexity

Rademacher complexity measures the ability of $\mathcal{F}$ to fit random noise.

Empirical Rademacher complexity:

\hat{\mathfrak{R}}_S(\mathcal{F}) = \mathbb{E}_{\boldsymbol{\sigma}}\!\left[\sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n \sigma_i f(x_i)\right]

where $\sigma_i \stackrel{iid}{\sim} \text{Uniform}(\{-1, +1\})$ (Rademacher random variables).

Population Rademacher complexity: $\mathfrak{R}_n(\mathcal{F}) = \mathbb{E}_S[\hat{\mathfrak{R}}_S(\mathcal{F})]$ .

Theorem (Rademacher Generalisation Bound). With probability $\geq 1 - \delta$ , for all $f \in \mathcal{F}$ :

R(f) \leq \hat{R}(f) + 2\mathfrak{R}_n(\mathcal{F}) + \sqrt{\frac{\log(1/\delta)}{2n}}

Key examples:

Finite class: $\mathfrak{R}_n \leq \sqrt{\log |\mathcal{H}| / (2n)}$
Linear $\|w\|_2 \leq W$ , data $\|x\|_2 \leq X$ : $\mathfrak{R}_n \leq WX/\sqrt{n}$
LoRA rank- $r$ : $\mathfrak{R}_n = O\!\left(\sqrt{dr/n}\right)$ (vs $O\!\left(\sqrt{d^2/n}\right)$ full)

Code cell 23

# === 10. Rademacher Complexity: Monte Carlo Estimation ===

np.random.seed(42)

def rademacher_mc(X, W_norm, n_mc=2000):
    """
    Estimate empirical Rademacher complexity for linear functions ||w||_2 <= W_norm
    on data matrix X (n x d).
    hat_R = E_sigma[sup_{||w||<=W} (1/n) sigma^T X w]
          = (W_norm / n) * E_sigma[||X^T sigma||_2]
    """
    n, d = X.shape
    norms = []
    for _ in range(n_mc):
        sigma = np.random.choice([-1., 1.], size=n)
        norms.append(np.linalg.norm(X.T @ sigma))
    return W_norm * np.mean(norms) / n

def rademacher_theory(n, d, W_norm, X_norm):
    """Theoretical bound: W_norm * X_norm / sqrt(n)."""
    return W_norm * X_norm / np.sqrt(n)

n, d = 100, 50
W_norm = 1.0
X = np.random.randn(n, d) / np.sqrt(d)  # ||x_i||_2 ~ 1
X_norm = np.linalg.norm(X, axis=1).mean()

R_hat_mc = rademacher_mc(X, W_norm)
R_theory  = rademacher_theory(n, d, W_norm, X_norm)

print(f'Empirical Rademacher complexity (n={n}, d={d}, W={W_norm})')
print(f'  Monte Carlo estimate: {R_hat_mc:.4f}')
print(f'  Theory bound:         {R_theory:.4f}')
print(f'  Ratio MC/theory:      {R_hat_mc/R_theory:.3f}')

# LoRA vs full: compare as d varies
print('\nRademacher complexity: LoRA rank-r vs full layer')
print(f'  {'d':>6}  {'rank r':>8}  {'LoRA R':>10}  {'Full R':>10}  {'Ratio':>8}')
for d in [64, 256, 1024]:
    r = 4
    # Simplified: paramcount ~ d*r vs d^2
    R_lora = np.sqrt(d * r / n)
    R_full = np.sqrt(d**2 / n)
    print(f'  {d:>6}  {r:>8}  {R_lora:>10.4f}  {R_full:>10.4f}  {R_lora/R_full:>8.4f}')

print('\nTakeaway: LoRA Rademacher complexity scales as sqrt(d*r/n) vs sqrt(d^2/n).')

11. Applications: SGD and Gradient Concentration

11.1 Mini-Batch Gradient as a Random Variable

In SGD, the mini-batch gradient $\hat{\mathbf{g}}_t = \frac{1}{B}\sum_{i \in \mathcal{B}_t} \nabla \ell(\theta; z_i)$ is an estimator of the full gradient $\mathbf{g}_t = \nabla L(\theta)$ . By independence of the batch samples, applying Hoeffding or Bernstein per coordinate gives:

P\!\left(|\hat{g}_{t,j} - g_{t,j}| \geq \varepsilon\right) \leq 2\exp\!\left(\frac{-2B\varepsilon^2}{(b_j - a_j)^2}\right)

where $[a_j, b_j]$ bounds the $j$ -th gradient component. The batch size $B$ directly controls gradient noise.

Bernstein for SGD. If $\|\nabla \ell\|_2 \leq G$ and gradients have variance $\sigma^2$ , Bernstein gives:

P\!\left(\|\hat{\mathbf{g}}_t - \mathbf{g}_t\|_2 \geq \varepsilon\right) \leq \exp\!\left(\frac{-B\varepsilon^2/2}{d\sigma^2 + G\varepsilon/3}\right)

This shows gradient noise switches from Gaussian (small $\varepsilon$ ) to exponential (large $\varepsilon$ ) regime.

Code cell 25

# === 11.1 SGD Mini-Batch Gradient Concentration ===

np.random.seed(42)

# Simulate: gradient = mean of B IID bounded rvs in [-G, G]
G = 1.0  # gradient bound per sample
mu_grad = 0.3  # true gradient

def simulate_sgd_concentration(B, n_mc=100_000):
    """Simulate |g_hat - g_true| for mini-batch size B."""
    # Each sample gradient: Uniform[-G, G] shifted by mu_grad
    # Clipped to [-G, G]
    grads = np.random.uniform(-G + mu_grad, G + mu_grad, (n_mc, B)).clip(-G, G)
    batch_grads = grads.mean(axis=1)
    return np.abs(batch_grads - mu_grad)

ts = [0.1, 0.2, 0.5]
print('P(|g_hat - g_true| >= t) for various batch sizes B')
print(f'  {'B':>6}  {'t=0.1':>10}  {'t=0.2':>10}  {'t=0.5':>10}  {'Hoeffding(B,t=0.1)':>20}')
for B in [1, 8, 32, 128]:
    devs = simulate_sgd_concentration(B)
    probs = [(devs >= t).mean() for t in ts]
    hb = 2 * np.exp(-2 * B * 0.1**2 / (2*G)**2)
    print(f'  {B:>6}  {probs[0]:>10.5f}  {probs[1]:>10.5f}  {probs[2]:>10.5f}  {hb:>20.5f}')

print('\n4x batch size => roughly 4x tighter tail (Gaussian regime, t small).')

11.2 LLM Benchmark Confidence Intervals

When evaluating an LLM on a benchmark, accuracy $\hat{A} = k/n$ is an estimate of true accuracy $A$ . By Hoeffding (or the exact Clopper-Pearson interval for Binomial):

P(|\hat{A} - A| \geq \varepsilon) \leq 2e^{-2n\varepsilon^2}

Practical implications:

MMLU with $n=14{,}042$ questions: $\varepsilon = 0.01$ requires $\delta \leq 2e^{-2 \cdot 14042 \cdot 0.01^2} \approx 0.0008$
Smaller benchmarks ( $n=500$ ): $\varepsilon_{95\%} = \sqrt{\log(2/0.05)/(2 \cdot 500)} \approx 0.055$
Sub-task evaluation ( $n=50$ ): $\varepsilon_{95\%} \approx 0.17$ — barely distinguishable differences

Benchmark contamination shifts the distribution, invalidating the iid assumption and making concentration bounds overly optimistic — one of the key failure modes in LLM evaluation.

Code cell 27

# === 11.2 LLM Benchmark Confidence Intervals ===

from scipy import stats as sp_stats

def hoeffding_ci(n, delta=0.05):
    """Half-width of Hoeffding confidence interval."""
    return np.sqrt(np.log(2 / delta) / (2 * n))

def clopper_pearson_ci(k, n, alpha=0.05):
    """Exact Clopper-Pearson CI for Binomial."""
    lo = sp_stats.beta.ppf(alpha / 2, k, n - k + 1) if k > 0 else 0
    hi = sp_stats.beta.ppf(1 - alpha / 2, k + 1, n - k) if k < n else 1
    return lo, hi

benchmarks = [
    ('MMLU',        14042),
    ('BIG-Bench Hard', 6511),
    ('HellaSwag',   10042),
    ('Custom (500)',   500),
    ('Sub-task (50)',   50),
]

print('Confidence interval half-widths for LLM benchmarks (delta=5%):')
print(f'  {'Benchmark':>24}  {'n':>6}  {'Hoeffding eps':>14}  {'CP CI width (acc=0.7)':>22}')
for name, n in benchmarks:
    eps = hoeffding_ci(n)
    k = int(0.7 * n)
    lo, hi = clopper_pearson_ci(k, n)
    print(f'  {name:>24}  {n:>6}  {eps:>14.4f}  [{lo:.4f}, {hi:.4f}] = {hi-lo:.4f}')

print('\nImplication: for n=50 sub-tasks, a 17% margin is needed to be 95% confident.')
print('Most LLM benchmark comparisons within 1% are not statistically meaningful for small n.')

11.3 Johnson-Lindenstrauss Lemma

Theorem (JL Lemma). For any set of $m$ points in $\mathbb{R}^d$ and $\varepsilon \in (0,1/2)$ , there exists a linear map $f: \mathbb{R}^d \to \mathbb{R}^k$ with $k = O(\varepsilon^{-2} \log m)$ such that:

\forall u, v: (1-\varepsilon)\|u-v\|^2 \leq \|f(u)-f(v)\|^2 \leq (1+\varepsilon)\|u-v\|^2

Proof sketch. For a random Gaussian projection $R \in \mathbb{R}^{k \times d}$ with $R_{ij} \sim \mathcal{N}(0, 1/k)$ , each projected distance is a chi-squared sum. Apply Chernoff bounds. The key fact: $k$ projection rows, each giving $e^{-k\varepsilon^2/4}$ failure probability. Union bound over all $\binom{m}{2}$ pairs gives total failure $\leq m^2 \cdot e^{-k\varepsilon^2/4} < 1$ for $k = \Omega(\varepsilon^{-2}\log m)$ .

Applications in LLMs:

Random projections for approximate nearest neighbour in embedding lookup
Dimensionality reduction before FAISS indexing
Sketching attention matrices: approximate $QK^\top$ via random projections

Code cell 29

# === 11.3 Johnson-Lindenstrauss Lemma Verification ===

np.random.seed(42)

def jl_project(X, k):
    """Random Gaussian projection to k dimensions."""
    d = X.shape[1]
    R = np.random.randn(d, k) / np.sqrt(k)
    return X @ R

def jl_min_k(m, eps, delta=0.05):
    """Minimum k for JL embedding of m points."""
    return int(np.ceil(4 * np.log(m / delta) / (eps**2 / 2 - eps**3 / 3)))

# Generate m random points in R^d
m, d = 200, 500
X = np.random.randn(m, d)

# Compute pairwise distances in original space (sample 500 pairs)
n_pairs = 500
idx = np.random.choice(m, (n_pairs, 2))
dists_orig = np.linalg.norm(X[idx[:,0]] - X[idx[:,1]], axis=1)

print(f'JL Lemma: m={m} points, d={d}, testing projections')
print(f'  {'k':>6}  {'max |ratio-1|':>18}  {'frac within eps=0.2':>22}  {'theory k':>12}')
for eps in [0.1, 0.2, 0.3]:
    k_theory = jl_min_k(m, eps)
    for k in [20, 50, 100, 200]:
        X_proj = jl_project(X, k)
        dists_proj = np.linalg.norm(X_proj[idx[:,0]] - X_proj[idx[:,1]], axis=1)
        ratios = (dists_proj / dists_orig)**2
        max_dev = np.max(np.abs(ratios - 1))
        frac_ok = np.mean(np.abs(ratios - 1) <= eps)
        if eps == 0.2:  # only print for eps=0.2
            print(f'  {k:>6}  {max_dev:>18.4f}  {frac_ok:>22.4f}  {k_theory:>12}')

print(f'\nFor eps=0.2: theory requires k >= {jl_min_k(m, 0.2)} for m={m} points.')

11.4 Sub-Gaussian Properties in Transformers

Attention Logit Concentration

In self-attention, the logit between query $\mathbf{q}$ and key $\mathbf{k}_j$ is $z_j = \mathbf{q}^\top \mathbf{k}_j / \sqrt{d_k}$ . If $\mathbf{q}, \mathbf{k}_j \sim \mathcal{N}(0, I_{d_k})$ , then each $z_j$ is sub-Gaussian with parameter $1$ :

\text{Var}(z_j) = \mathbb{E}\left[\frac{(\mathbf{q}^\top \mathbf{k}_j)^2}{d_k}\right] = 1

This motivates the $\sqrt{d_k}$ scaling — without it, logits would have variance $d_k$ , pushing softmax into saturation.

Residual Stream Norm Concentration

With $L$ residual layers, each adding a vector $\delta_\ell$ with $\|\delta_\ell\|_2 \approx c$ (bounded by RMSNorm), the total residual $\|\sum_\ell \delta_\ell\|_2 \leq Lc$ . But if the directions are random, the norm concentrates at $O(\sqrt{L})$ — explaining why deep transformers need $1/\sqrt{L}$ scaling at initialisation (e.g., GPT-2's 1/sqrt(2L) rule).

Gradient Noise in Large-Batch Training

The linear scaling rule (Goyal et al., 2017): batch size $B \to kB$ requires LR $\alpha \to k\alpha$ to maintain the same effective noise level. This follows directly from Hoeffding: gradient variance scales as $1/B$ , so the noise-to-signal ratio $\sigma/\sqrt{B}$ determines convergence.

Code cell 31

# === 11.4 Attention Logit Concentration ===

np.random.seed(42)

N_mc = 100_000

print('Attention logit variance before and after scaling:')
print(f'  {'d_k':>6}  {'Var(q^T k) unscaled':>22}  {'Var(q^T k/sqrt(d_k))':>24}  {'Softmax entropy':>16}')

for d_k in [4, 16, 64, 256]:
    q = np.random.randn(N_mc, d_k)
    k = np.random.randn(N_mc, d_k)
    logits_raw   = (q * k).sum(axis=1)
    logits_scaled = logits_raw / np.sqrt(d_k)

    var_raw    = logits_raw.var()
    var_scaled = logits_scaled.var()

    # Softmax entropy for 10 keys, one being the true match (logit scaled/unscaled)
    n_keys = 10
    keys_all = np.random.randn(N_mc, n_keys, d_k)
    q_batched = q[:, np.newaxis, :]  # (N, 1, d)
    logits_all = (q_batched * keys_all).sum(axis=2)  # (N, n_keys)
    softmax_scaled = np.exp(logits_all / np.sqrt(d_k))
    softmax_scaled /= softmax_scaled.sum(axis=1, keepdims=True)
    entropy_scaled = -(softmax_scaled * np.log(softmax_scaled + 1e-9)).sum(axis=1).mean()

    print(f'  {d_k:>6}  {var_raw:>22.3f}  {var_scaled:>24.3f}  {entropy_scaled:>16.3f}')

print('\nWith scaling, Var ≈ 1 regardless of d_k — prevents softmax saturation.')
print('Max entropy = log(10) =', round(np.log(10), 3), '(uniform softmax)')

12. Choosing the Right Bound: A Decision Framework

Scenario	Best Bound	Formula
$X \geq 0$ , only mean known	Markov	$\mu/t$
Finite variance, arbitrary dist	Chebyshev	$\sigma^2/t^2$
Bounded iid sum, worst-case	Hoeffding	$2e^{-2t^2/\sum(b_i-a_i)^2}$
Bounded iid sum, small variance	Bernstein	$e^{-t^2/(2v+2Mt/3)}$
Binomial, multiplicative	Chernoff	$e^{-\mu\delta^2/3}$
Function of iid vars, bounded diffs	McDiarmid	$e^{-2t^2/\sum c_i^2}$
Generalisation for finite $\mathcal{H}$	Union + Hoeffding	$\sqrt{\log(
Generalisation for general $\mathcal{H}$	Rademacher	$2\mathfrak{R}_n + \sqrt{\log(1/\delta)/2n}$
High-dimensional distances	JL Lemma	$k = O(\varepsilon^{-2}\log m)$

The key question: What do you know about $X$ ?

Non-negative + mean → Markov
- variance → Chebyshev
- bounded → Hoeffding
- small variance → Bernstein
- MGF structure → Chernoff

Code cell 33

# === 12. Bound Comparison on a Single Example ===

np.random.seed(42)

# n=100 iid X_i ~ Uniform[0,1], S_n = sum, E[S_n]=50, Var=100/12
n = 100
mu = 0.5
sigma2 = 1.0 / 12  # Var(Uniform[0,1])
a, b = 0.0, 1.0

# We want P(S_n/n - mu >= t) for t = 0.05
t = 0.05

# Bounds for sample mean deviation
markov_b    = mu / (mu + t)  # P(X_bar >= mu+t) <= mu/(mu+t) if X_i in [0,1] — not correct, use actual
# Markov: P(X_bar >= t+mu) = P(sum >= n*(mu+t)), E[sum]=n*mu, need X_bar >= mu+t
# P(S_n >= n(mu+t)) <= E[S_n]/(n(mu+t)) = mu/(mu+t)
markov_b    = mu / (mu + t)
cheby_b     = sigma2 / (n * t**2)  # Var(X_bar) = sigma2/n
hoeffding_b = np.exp(-2 * n * t**2 / (b - a)**2)
bernstein_b = np.exp(-n * t**2 / 2 / (sigma2 + (b-a)*t/3))

# Empirical
N_mc = 1_000_000
samples = np.random.uniform(a, b, (N_mc, n)).mean(axis=1)
empirical = (samples - mu >= t).mean()

print(f'P(X_bar - {mu} >= {t}), n={n}, X_i ~ Uniform[0,1]:')
print(f'  Empirical:  {empirical:.6f}')
print(f'  Markov:     {markov_b:.6f}  (ratio {markov_b/empirical:.1f}x)')
print(f'  Chebyshev:  {cheby_b:.6f}  (ratio {cheby_b/empirical:.1f}x)')
print(f'  Hoeffding:  {hoeffding_b:.6f}  (ratio {hoeffding_b/empirical:.1f}x)')
print(f'  Bernstein:  {bernstein_b:.6f}  (ratio {bernstein_b/empirical:.1f}x)')
print(f'\nSmaller ratio = tighter bound. Bernstein wins by using variance info.')

Summary

Concept	Key Formula	ML Relevance
Markov	$P(X \geq t) \leq \mu/t$	Baseline; gradient norm bounds
Chebyshev	$P(	X-\mu
Sub-Gaussian	$\mathbb{E}[e^{tX}] \leq e^{\sigma^2t^2/2}$	Bounded/Gaussian noise
Hoeffding	$2e^{-2t^2/\sum(b_i-a_i)^2}$	Sample complexity in PAC
Chernoff	$\inf_s e^{-st}M_X(s)$	Binomial counts, hitting probs
Bernstein	Exp with variance term	Sparse gradients in SGD
McDiarmid	$e^{-2t^2/\sum c_i^2}$	Empirical risk, any function
Union bound	$P(\bigcup A_i) \leq \sum P(A_i)$	Any finite hypothesis class
VC bound	$\hat{R} + O(\sqrt{d_{VC}/n})$	Shattering dimension
Rademacher	$\hat{R} + 2\mathfrak{R}_n + O(1/\sqrt{n})$	Data-dependent, tightest
JL Lemma	$k = O(\varepsilon^{-2}\log m)$	Random projections, embeddings

→ See notes.md for full proofs and theory.
→ See exercises.ipynb for practice problems.

13. Preview: Law of Large Numbers via Concentration

The Weak Law of Large Numbers (WLLN) follows immediately from Chebyshev:

P\!\left(|\bar{X}_n - \mu| \geq \varepsilon\right) \leq \frac{\sigma^2}{n\varepsilon^2} \xrightarrow{n\to\infty} 0

The Strong LLN requires more work (Borel-Cantelli or martingale argument), but Chebyshev's inequality gives the WLLN with a quantitative rate.

Strong LLN via Borel-Cantelli (sketch). For the subsequence $n_k = k^2$ , Chebyshev gives $P(|\bar{X}_{n_k} - \mu| \geq \varepsilon) \leq \sigma^2/(n_k \varepsilon^2)$ . Since $\sum_k 1/k^2 < \infty$ , Borel-Cantelli gives convergence a.s. along $n_k$ . Interpolate between subsequences to complete the proof.

CLT preview. The Central Limit Theorem tells us the shape of the deviation:

\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2)

This is sharper than Chebyshev's $O(1/t^2)$ but is asymptotic. Concentration inequalities give non-asymptotic guarantees valid for finite $n$ .

→ Full treatment: §06/06 Stochastic Processes for martingales; §07 Statistics for CLT applications.

Code cell 36

# === 13. LLN Convergence: Chebyshev Rate vs Empirical ===

np.random.seed(42)

# Exponential(1): mean=1, var=1
N_mc = 10_000
ns = np.logspace(1, 4, 30).astype(int)

eps = 0.1
empirical_probs = []
chebyshev_bounds = []
hoeffding_bounds_clip = []

for n in ns:
    samples = np.random.exponential(1.0, (N_mc, n)).mean(axis=1)
    prob = (np.abs(samples - 1.0) >= eps).mean()
    empirical_probs.append(prob)
    chebyshev_bounds.append(1.0 / (n * eps**2))  # sigma^2=1
    # Hoeffding: need bounded, so clip to [0, 10]; c_i ~ 10/n
    hoeffding_bounds_clip.append(2 * np.exp(-2 * n * eps**2 / (10**2)))

if HAS_MPL:
    fig, ax = plt.subplots(figsize=(10,5))
    ax.loglog(ns, empirical_probs, 'o-', label='Empirical P(|X_bar-1|>=0.1)', ms=4)
    ax.loglog(ns, chebyshev_bounds, '--', label='Chebyshev bound (1/n)', lw=2)
    ax.loglog(ns, hoeffding_bounds_clip, ':', label='Hoeffding (clipped at 10)', lw=2)
    ax.set_xlabel('Sample size $n$')
    ax.set_ylabel('$P(|\\bar{X}_n - \\mu| \\geq 0.1)$')
    ax.set_title('LLN Rate: Empirical vs Chebyshev Bound (Exponential(1))')
    ax.legend()
    plt.tight_layout()
    plt.show()

print(f'At n=100: empirical={empirical_probs[ns.tolist().index(min(ns, key=lambda x: abs(x-100)))]:.4f}, '
      f'Chebyshev={1/(100*eps**2):.4f}')

14. Azuma's Inequality for Martingales

Azuma extends Hoeffding from iid sums to martingales.

Theorem (Azuma-Hoeffding). Let $(M_k)_{k=0}^n$ be a martingale with bounded differences: $|M_k - M_{k-1}| \leq c_k$ a.s. Then:

P(M_n - M_0 \geq t) \leq \exp\!\left(\frac{-t^2}{2\sum_{k=1}^n c_k^2}\right)

Why this is powerful: The $X_k$ need not be independent — only the martingale difference $D_k = M_k - M_{k-1}$ needs to be bounded and have zero conditional mean. This is exactly the structure of the Doob martingale used in McDiarmid's proof.

Doob martingale. For $f(X_1,\ldots,X_n)$ with independent $X_i$ , define:

M_k = \mathbb{E}[f | X_1,\ldots,X_k]

Then $M_0 = \mathbb{E}[f]$ , $M_n = f$ , and $|M_k - M_{k-1}| \leq c_k$ iff $f$ has $c_k$ -bounded difference in $X_k$ . Azuma applied to the Doob martingale yields McDiarmid.

Applications in ML:

Online learning regret bounds
Stochastic approximation (SGD) convergence
Adaptive data collection (active learning, bandits)

Code cell 38

# === 14. Azuma: Random Walk and Martingale ===

np.random.seed(42)

# Simple martingale: partial sums of bounded differences D_k in [-c, c]
n_steps = 200
c = 1.0
N_mc = 50_000

# D_k ~ Uniform[-c, c], zero mean
D = np.random.uniform(-c, c, (N_mc, n_steps))
M = D.cumsum(axis=1)  # martingale starting at 0

# Azuma bound for M_n >= t
ts = np.linspace(1, 20, 100)
azuma = np.exp(-ts**2 / (2 * n_steps * c**2))
empirical = np.array([(M[:,-1] >= t).mean() for t in ts])

if HAS_MPL:
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))

    # Left: sample paths
    ax = axes[0]
    for i in range(20):
        ax.plot(M[i], alpha=0.4, lw=0.8)
    ax.axhline(0, color='k', lw=1)
    # Azuma CI bands
    steps = np.arange(n_steps)
    band = np.sqrt(2 * (steps+1) * c**2 * np.log(2 / 0.05))
    ax.plot(steps, band, 'r--', lw=2, label='Azuma 95% band')
    ax.plot(steps, -band, 'r--', lw=2)
    ax.set_xlabel('Step $k$')
    ax.set_ylabel('$M_k$')
    ax.set_title('Martingale Sample Paths + Azuma Bounds')
    ax.legend()

    # Right: tail bound
    ax = axes[1]
    ax.semilogy(ts, empirical, 'o-', ms=3, label='Empirical')
    ax.semilogy(ts, azuma, '--', lw=2, label='Azuma bound')
    ax.set_xlabel('$t$')
    ax.set_ylabel('$P(M_n \\geq t)$')
    ax.set_title(f'Azuma Tail Bound (n={n_steps}, c={c})')
    ax.legend()

    plt.tight_layout()
    plt.show()

t_test = 10.0
emp = (M[:,-1] >= t_test).mean()
az  = np.exp(-t_test**2 / (2 * n_steps * c**2))
print(f'At t={t_test}: empirical={emp:.5f}, Azuma bound={az:.5f}')

15. Beyond Sub-Gaussian: Heavy Tails

Not all random variables in ML are bounded or sub-Gaussian. Gradient norms in transformer training can be heavy-tailed (Student- $t$ distribution empirically). For such variables, exponential bounds fail; polynomial bounds must be used.

Sub-exponential variables. $X$ is sub-exponential with parameters $(\nu^2, b)$ if:

\mathbb{E}[e^{tX}] \leq e^{t^2\nu^2/2} \quad \text{for } |t| \leq 1/b

Sub-exponential = sub-Gaussian for small $t$ , exponential tail for large $t$ . Example: $\chi^2$ with $k$ DOF is sub-exponential with $\nu^2 = 4k$ , $b = 4$ .

Catoni's estimator. For heavy-tailed means with finite variance $\sigma^2$ , the sample mean is not optimal. Catoni's robust estimator achieves:

|\hat{\mu}_{\text{Cat}} - \mu| \leq \sigma\sqrt{\frac{2\log(2/\delta)}{n}}

with probability $1-\delta$ — the sub-Gaussian rate, despite heavy tails!

Forward reference: Matrix concentration inequalities (Tropp's matrix Bernstein, non-commutative Khintchine) handle random matrices — directly applicable to random attention weight matrices. See §08 Advanced Topics in Probability if available.

Code cell 40

# === 15. Sub-Exponential: Chi-Squared vs Gaussian Tails ===

np.random.seed(42)

k = 5  # degrees of freedom
N_mc = 1_000_000

# chi^2(k): sum of k standard Gaussians squared
X_chi2 = np.random.chisquare(k, N_mc)  # mean=k, var=2k
mu_chi2 = k
sigma2_chi2 = 2 * k

ts = np.linspace(1, 30, 200)

# Sub-exponential Bernstein bound: P(X - mu >= t) <= exp(-t^2/(2(2k + 4t)))
def se_bernstein(t, k):
    v = 2 * k  # variance = 2k
    b = 4      # parameter b for chi^2
    return np.exp(-t**2 / 2 / (v + b * t))

sub_gauss_b = np.exp(-ts**2 / (2 * sigma2_chi2))  # treating as sub-Gaussian
bernstein_b = se_bernstein(ts, k)
empirical_b = np.array([(X_chi2 - mu_chi2 >= t).mean() for t in ts])

if HAS_MPL:
    fig, ax = plt.subplots(figsize=(10,5))
    ax.semilogy(ts, empirical_b, 'k-', lw=2, label='Empirical $\chi^2(5)$')
    ax.semilogy(ts, sub_gauss_b, '--', label='Sub-Gaussian bound (wrong!)', lw=2)
    ax.semilogy(ts, bernstein_b, '-.',  label='Bernstein sub-exp bound', lw=2)
    ax.set_xlabel('$t$')
    ax.set_ylabel('$P(X - \\mu \\geq t)$')
    ax.set_title('$\\chi^2(5)$ Tail: Sub-Gaussian vs Sub-Exponential Bound')
    ax.set_xlim(0, 30)
    ax.legend()
    plt.tight_layout()
    plt.show()

t_test = 15.0
print(f'At t={t_test} above mean:')
print(f'  Empirical:      {(X_chi2 - mu_chi2 >= t_test).mean():.6f}')
print(f'  Sub-Gaussian:   {np.exp(-t_test**2/(2*sigma2_chi2)):.6f}  (too tight!)')
print(f'  Bernstein:      {se_bernstein(t_test, k):.6f}  (valid bound)')

16. Practical Guidelines for ML Practitioners

When Bounds Are Tight vs Loose

Concentration bounds are worst-case over all distributions satisfying the assumptions. They can be very loose in practice:

VC bounds for neural networks: $d_{\text{VC}} = O(p \log p)$ is enormous, giving vacuous $> 1$ bounds
Rademacher complexity for transformers: also often loose due to norm-based analysis
PAC-Bayes bounds (Catoni, McAllester) are often much tighter in practice

What Bounds ARE Useful For

Despite looseness, concentration inequalities provide:

Scaling intuition: $O(1/\sqrt{n})$ generalisation gap — more data always helps
Architecture comparison: LoRA has $\sqrt{r/d}$ Rademacher factor vs full fine-tuning
Benchmark validity: CIs on accuracy estimates via Hoeffding or Clopper-Pearson
Hyperparameter reasoning: batch size $B$ → gradient noise scales as $1/\sqrt{B}$
Privacy analysis: DP-SGD noise calibrated via concentration (Gaussian mechanism)

The Sub-Gaussian Assumption in Practice

Empirical gradient distributions in transformers are often heavy-tailed (Student- $t$ like). This is why gradient clipping is used — it effectively imposes bounded differences, making the Hoeffding/Bernstein assumptions approximately valid after clipping.

Code cell 42

# === 16. Gradient Clipping: From Heavy-Tailed to Sub-Gaussian ===

np.random.seed(42)

N_mc = 500_000

# Heavy-tailed gradient: Cauchy-like (Student-t with low df)
df = 3
grads_heavy = np.random.standard_t(df, N_mc)

def clip_gradients(g, clip_norm):
    return np.clip(g, -clip_norm, clip_norm)

clip_values = [1.0, 2.0, 5.0, np.inf]

print('Effect of gradient clipping on tail behaviour:')
print(f'  {'Clip':>8}  {'P(|g|>3)':>12}  {'P(|g|>5)':>12}  {'P(|g|>10)':>12}  {'Std':>8}')
for clip in clip_values:
    if clip == np.inf:
        g = grads_heavy
        label = 'None'
    else:
        g = clip_gradients(grads_heavy, clip)
        label = f'{clip:.1f}'
    p3  = (np.abs(g) >= 3).mean()
    p5  = (np.abs(g) >= 5).mean()
    p10 = (np.abs(g) >= 10).mean()
    std = g.std()
    print(f'  {label:>8}  {p3:>12.5f}  {p5:>12.5f}  {p10:>12.5f}  {std:>8.4f}')

print('\nGaussian(0,1) for reference:')
g_gauss = np.random.randn(N_mc)
print(f'  {'Gaussian':>8}  {(np.abs(g_gauss)>=3).mean():>12.5f}  '
      f'{(np.abs(g_gauss)>=5).mean():>12.5f}  {(np.abs(g_gauss)>=10).mean():>12.5f}  '
      f'{g_gauss.std():>8.4f}')

print('\nClipping at 2.0 makes tail comparable to Gaussian — enabling Hoeffding/Bernstein analysis.')

17. PAC-Bayes Bounds

PAC-Bayes bounds are often the tightest available for neural networks.

Theorem (McAllester PAC-Bayes). For prior $P$ on $\mathcal{H}$ (chosen before seeing data) and any posterior $Q$ on $\mathcal{H}$ , with probability $\geq 1-\delta$ over $S \sim \mathcal{D}^n$ :

\mathbb{E}_{h \sim Q}[R(h)] \leq \mathbb{E}_{h \sim Q}[\hat{R}(h)] + \sqrt{\frac{\text{KL}(Q \| P) + \ln(n/\delta)}{2(n-1)}}

Intuition: The generalisation gap is controlled by $\text{KL}(Q \| P)$ — how much the posterior $Q$ deviates from the prior. If training barely moves the parameters (small KL), the model generalises well.

Connection to regularisation:

$L^2$ regularisation $\Leftrightarrow$ Gaussian prior $P = \mathcal{N}(0, \lambda^{-1}I)$
KL $(Q \| P) = \frac{\lambda}{2}\|\theta\|^2 + \text{const}$ for Gaussian posteriors
LoRA: small KL because $r \ll d$ , so posterior stays close to prior

This is why deep learning 'works' despite overparameterisation: SGD with small LR + early stopping creates posteriors $Q$ close to the initialisation prior $P$ , keeping KL small and generalisation gap bounded.

Code cell 44

# === 17. PAC-Bayes: KL and Generalisation Gap ===

np.random.seed(42)

from scipy.special import xlogy

def kl_bernoulli(q, p):
    """KL divergence between Bernoulli(q) and Bernoulli(p)."""
    q = np.clip(q, 1e-9, 1 - 1e-9)
    p = np.clip(p, 1e-9, 1 - 1e-9)
    return xlogy(q, q/p) + xlogy(1-q, (1-q)/(1-p))

def pacbayes_bound(R_hat, kl, n, delta=0.05):
    """McAllester PAC-Bayes bound."""
    return R_hat + np.sqrt((kl + np.log(n / delta)) / (2 * (n - 1)))

# Scenario: linear model, Gaussian prior N(0, I), Gaussian posterior N(w*, sigma^2 I)
def gaussian_kl(theta, sigma2_prior=1.0, sigma2_post=0.01, d=100):
    """KL(N(theta, sigma2_post I) || N(0, sigma2_prior I))."""
    kl = 0.5 * (d * (sigma2_post / sigma2_prior - 1 - np.log(sigma2_post / sigma2_prior))
                + np.dot(theta, theta) / sigma2_prior)
    return kl

n, d = 1000, 100
R_hat = 0.1  # empirical risk

print('PAC-Bayes generalisation bound vs KL(posterior||prior):')
print(f'  {'||theta||_2':>12}  {'KL':>10}  {'Gap bound':>12}  {'R bound':>10}')
for norm_theta in [0.5, 1.0, 2.0, 5.0, 10.0]:
    theta = np.ones(d) * norm_theta / np.sqrt(d)
    kl = gaussian_kl(theta)
    gap = np.sqrt((kl + np.log(n / 0.05)) / (2 * (n - 1)))
    R_bound = R_hat + gap
    print(f'  {norm_theta:>12.1f}  {kl:>10.2f}  {gap:>12.4f}  {R_bound:>10.4f}')

print(f'\nKey insight: KL grows as ||theta||^2 / sigma2_prior.')
print('L2 regularisation (weight decay) penalises exactly this term.')

18. Double Descent and the Failure of Classical Bounds

Classical VC bounds predict monotonically increasing generalisation error with model complexity. Double descent (Belkin et al., 2019) shows that modern overparameterised models can achieve low test error even when $p \gg n$ :

test error
    |              classical regime      interpolation threshold    modern regime
    |              bias↓, var↑                  |                  var↓ (implicit reg)
    |    Underfitting → ──────────────────────── peak ──────────────────── overfit?No!
    |________________ model complexity (p/n) ___________________________

Why classical bounds fail:

VC/Rademacher bounds don't capture the effect of optimisation algorithm (SGD bias)
Minimum-norm interpolation has implicit regularisation not captured by norm bounds
Benign overfitting (Bartlett et al., 2020): overparameterised models can fit noise and still generalise if signal-to-noise is high

Concentration still applies: Even in double descent, Rademacher/PAC-Bayes bounds apply — they're just not tight. The issue is that classical bounds use worst-case norms, while SGD finds solutions with small effective norms via implicit regularisation.

Code cell 46

# === 18. Double Descent: Polynomial Regression Demo ===

np.random.seed(42)

from numpy.polynomial import polynomial as P

# True function: sin(2*pi*x) with noise
def true_fn(x):
    return np.sin(2 * np.pi * x)

n_train = 20
sigma_noise = 0.2
x_train = np.linspace(0, 1, n_train)
y_train = true_fn(x_train) + np.random.randn(n_train) * sigma_noise

x_test = np.linspace(0, 1, 200)
y_test = true_fn(x_test)

degrees = list(range(1, 35))
train_errors = []
test_errors = []

for deg in degrees:
    # Vandermonde design matrix
    V_train = np.vander(x_train, deg + 1, increasing=True)
    V_test  = np.vander(x_test,  deg + 1, increasing=True)

    # Minimum-norm least squares (pseudoinverse)
    w = np.linalg.pinv(V_train) @ y_train

    train_err = np.mean((V_train @ w - y_train)**2)
    test_err  = np.mean((V_test  @ w - y_test)**2)
    train_errors.append(train_err)
    test_errors.append(test_err)

if HAS_MPL:
    fig, ax = plt.subplots(figsize=(10, 5))
    ax.semilogy(degrees, train_errors, 'o-', ms=4, label='Train MSE')
    ax.semilogy(degrees, test_errors,  's-', ms=4, label='Test MSE')
    ax.axvline(n_train, color='r', ls='--', lw=2, label=f'n_train={n_train} (interpolation threshold)')
    ax.set_xlabel('Polynomial degree (model complexity)')
    ax.set_ylabel('MSE')
    ax.set_title('Double Descent in Polynomial Regression')
    ax.legend()
    plt.tight_layout()
    plt.show()

interp_idx = n_train - 1
peak_test = max(test_errors[:n_train+2])
final_test = test_errors[-1]
print(f'Test error peak near interpolation: {peak_test:.4f}')
print(f'Test error at highest degree: {final_test:.4f}')
print('Double descent: error peaks then decreases in overparameterised regime.')

19. Differential Privacy and Concentration

Differentially Private SGD (DP-SGD, Abadi et al. 2016) adds calibrated Gaussian noise to gradients to prevent membership inference. The noise scale is determined by concentration:

Clip individual gradients: $\tilde{g}_i = g_i / \max(1, \|g_i\|_2/C)$ — ensures $\|\tilde{g}_i\|_2 \leq C$
Aggregate + noise: $\hat{g} = \frac{1}{B}(\sum_i \tilde{g}_i + \mathcal{N}(0, \sigma^2 C^2 I))$
Privacy accounting: $(\varepsilon, \delta)$ -DP is achieved when:

\sigma \geq \frac{C\sqrt{2\ln(1.25/\delta)}}{\varepsilon \cdot B / n}

Gaussian mechanism analysis uses the sub-Gaussian concentration of the noise. The privacy-utility tradeoff: larger $\sigma$ → better privacy, larger gradient noise → slower convergence.

Moments accountant / Rényi DP gives tighter accounting via concentration of privacy loss across steps.

Code cell 48

# === 19. DP-SGD Noise Calibration ===

np.random.seed(42)

def dp_sgd_sigma(eps, delta, C, B, n):
    """
    Noise multiplier sigma for (eps, delta)-DP.
    Returns sigma (noise is sigma * C * I).
    """
    q = B / n  # sampling ratio
    sigma = C * np.sqrt(2 * np.log(1.25 / delta)) / (eps * q)
    return sigma

n = 60000  # MNIST-size training set
B = 256
C = 1.0

print('DP-SGD noise sigma for various privacy budgets:')
print(f'  (n={n}, batch={B}, clip_norm={C})')
print(f'  {'eps':>6}  {'delta':>10}  {'sigma':>10}  {'noise/signal ratio':>20}')
for eps in [0.1, 0.5, 1.0, 5.0, 10.0]:
    for delta in [1e-5, 1e-3]:
        sigma = dp_sgd_sigma(eps, delta, C, B, n)
        # noise/signal: sigma * C / (batch_grad_norm ~ C)
        ratio = sigma
        print(f'  {eps:>6.1f}  {delta:>10.0e}  {sigma:>10.4f}  {ratio:>20.4f}')

print('\nInterpretation: sigma >> 1 means privacy noise dominates gradient signal.')
print('Strong privacy (eps=0.1) requires enormous noise -> very slow convergence.')

20. Preview: Matrix Concentration Inequalities

Scalar concentration inequalities extend to random matrices via matrix versions of Bernstein.

Theorem (Matrix Bernstein, Tropp 2012). Let $S = \sum_k X_k$ where $X_k$ are independent random Hermitian matrices with $\mathbb{E}[X_k] = 0$ and $\|X_k\| \leq M$ a.s. Let $v = \|\sum_k \mathbb{E}[X_k^2]\|$ . Then:

P(\|S\| \geq t) \leq d \cdot \exp\!\left(\frac{-t^2/2}{v + Mt/3}\right)

where $d$ is the matrix dimension and $\|\cdot\|$ is the spectral norm.

Applications:

Random attention: concentration of $QK^\top / \sqrt{d_k}$ away from its expectation
Covariance estimation: $\hat{\Sigma} - \Sigma$ spectral norm concentration
LoRA: concentration of low-rank update $\Delta W = AB$ spectral norm
Random features: approximation error of random feature maps for kernel methods

Full treatment in §08 Advanced Probability Topics (if available) or Tropp's An Introduction to Matrix Concentration Inequalities (2015, arXiv:1501.01571).

Code cell 50

# === 20. Matrix Concentration: Random Attention ===

np.random.seed(42)

# Simulate spectral norm concentration of random attention logit matrix
# A = Q K^T / sqrt(d_k), Q,K ~ N(0, I_{n x d_k})

def attention_spectral_norm(n_seq, d_k, N_mc=500):
    """Empirical distribution of ||QK^T/sqrt(d_k)||_2."""
    norms = []
    for _ in range(N_mc):
        Q = np.random.randn(n_seq, d_k) / np.sqrt(d_k)
        K = np.random.randn(n_seq, d_k) / np.sqrt(d_k)
        A = Q @ K.T  # already has 1/d_k from normalisation
        norms.append(np.linalg.norm(A, 2))
    return np.array(norms)

print('Spectral norm of Q K^T / d_k (Q, K ~ N(0, I/sqrt(d_k)))')
print(f'  {'n_seq':>6}  {'d_k':>5}  {'E[||A||]':>10}  {'Std[||A||]':>12}  '
      f'{'sqrt(n_seq/d_k)':>15}')
for n_seq, d_k in [(10,64), (20,64), (50,64), (100,64), (64,64), (64,256)]:
    norms = attention_spectral_norm(n_seq, d_k, N_mc=300)
    theory = np.sqrt(n_seq / d_k)  # Marchenko-Pastur limit for rectangular case
    print(f'  {n_seq:>6}  {d_k:>5}  {norms.mean():>10.4f}  {norms.std():>12.4f}  '
          f'{theory:>15.4f}')

print('\nSpectral norm ~ sqrt(n_seq/d_k) — longer sequences or smaller d_k -> larger logits.')
print('This is why FlashAttention uses causal masking to bound effective sequence length.')

Code cell 51

# === Final: Concentration Inequality Summary Plot ===

np.random.seed(42)

if HAS_MPL:
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))

    # Plot 1: Tail hierarchy
    ax = axes[0, 0]
    t = np.linspace(0.2, 5, 300)
    ax.semilogy(t, 1/t,                   label='Markov: 1/t')
    ax.semilogy(t, 1/t**2,                label='Chebyshev: 1/t^2')
    ax.semilogy(t, np.exp(-2*t**2),       label='Hoeffding/sub-Gauss')
    ax.semilogy(t, 2*stats.norm.sf(t),    label='Gaussian (actual)', ls='--')
    ax.set_title('Tail Bound Hierarchy')
    ax.set_xlabel('t'); ax.set_ylabel('Bound (log scale)')
    ax.legend(fontsize=9); ax.set_ylim(1e-7, 2)

    # Plot 2: PAC learning curve
    ax = axes[0, 1]
    ns = np.logspace(1, 5, 100)
    for H_log, style in [(2, '-'), (5, '--'), (10, ':')]:
        H = 10**H_log
        gap = np.sqrt(np.log(H / 0.05) / (2 * ns))
        ax.loglog(ns, gap, style, label=f'|H|=10^{H_log}')
    ax.set_title('PAC Generalisation Gap vs n')
    ax.set_xlabel('Training samples n')
    ax.set_ylabel('Gap bound')
    ax.legend(); ax.axhline(0.05, color='gray', ls=':', lw=1)

    # Plot 3: Hoeffding verification
    ax = axes[1, 0]
    ns_test = np.arange(10, 500, 10)
    N_mc = 50000
    empirical_50 = []
    bounds_50 = []
    for n in ns_test:
        s = np.random.binomial(1, 0.5, (N_mc, n)).mean(axis=1)
        empirical_50.append((np.abs(s - 0.5) >= 0.1).mean())
        bounds_50.append(2 * np.exp(-2 * n * 0.01))
    ax.semilogy(ns_test, empirical_50, label='Empirical')
    ax.semilogy(ns_test, bounds_50, '--', label='Hoeffding')
    ax.set_title('Hoeffding: |X_bar - 0.5| >= 0.1')
    ax.set_xlabel('n'); ax.set_ylabel('P (log scale)')
    ax.legend()

    # Plot 4: LoRA Rademacher scaling
    ax = axes[1, 1]
    ds = np.logspace(1, 3, 50)
    n_val = 1000
    for r, style in [(1, '-'), (4, '--'), (16, ':'), (64, '-.')
                     ][:4]:
        R_lora = np.sqrt(ds * r / n_val)
        ax.loglog(ds, R_lora, style, label=f'LoRA r={r}')
    R_full = np.sqrt(ds**2 / n_val)
    ax.loglog(ds, R_full, 'k-', lw=2, label='Full (r=d)')
    ax.set_title(f'Rademacher Complexity vs d (n={n_val})')
    ax.set_xlabel('Dimension d')
    ax.set_ylabel('$\\mathfrak{R}$ bound')
    ax.legend(fontsize=9)

    plt.suptitle('Concentration Inequalities: Summary', fontsize=14, y=1.01)
    plt.tight_layout()
    plt.show()

print('Summary plots generated.')

Notebook Complete

This notebook has covered:

Tail bound hierarchy — Markov → Chebyshev → sub-Gaussian → Hoeffding → Chernoff → Bernstein → McDiarmid
PAC learning — finite class bounds via union bound, VC dimension, shattering
Rademacher complexity — data-dependent bounds, LoRA analysis
ML applications — SGD gradient concentration, LLM benchmark CIs, JL lemma
Transformers — attention scaling, residual norms, gradient clipping
Modern theory — PAC-Bayes, double descent, DP-SGD, matrix concentration

References

Boucheron, Lugosi & Massart, Concentration Inequalities (2013) — OUP
Tropp, An Introduction to Matrix Concentration Inequalities (2015) — arXiv:1501.01571
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014) — Chapters 2-7
Wainwright, High-Dimensional Statistics (2019) — Cambridge
Vershynin, High-Dimensional Probability (2018) — Cambridge

→ exercises.ipynb — 8 graded problems
→ notes.md — full theoretical treatment with proofs