Exercises NotebookMath for LLMs

Concentration Inequalities

Probability Theory / Concentration Inequalities

Run notebook
Exercises Notebook

Exercises Notebook

Converted from exercises.ipynb for web reading.

Concentration Inequalities — Exercises

10 exercises covering tail bounds, sample complexity, PAC learning, and ML applications.

FormatDescription
ProblemMarkdown cell with task description
Your SolutionCode cell with scaffolding
SolutionCode cell with reference solution and checks

Difficulty Levels

LevelExercisesFocus
1–3Core mechanics
★★4–6Deeper theory
★★★7-10AI / ML applications

Topic Map

TopicExercise
Markov and Chebyshev1
Hoeffding sample complexity2
Chernoff bounds for Binomial3
Bernstein for SGD gradient noise4
McDiarmid on empirical risk5
PAC bounds for finite class6
VC dimension of linear classifiers7
Rademacher complexity Monte Carlo8
Empirical Bernstein intervals9
Uniform confidence dashboards10

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np
from scipy import stats

try:
    import matplotlib.pyplot as plt
    HAS_MPL = True
except ImportError:
    HAS_MPL = False

np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)

def header(title):
    print('\n' + '=' * len(title))
    print(title)
    print('=' * len(title))

def check_close(name, got, expected, tol=1e-6):
    ok = np.allclose(got, expected, atol=tol, rtol=tol)
    print(f"{'PASS' if ok else 'FAIL'}{name}")
    if not ok:
        print('  expected:', expected)
        print('  got     :', got)
    return ok

def check_true(name, cond):
    print(f"{'PASS' if cond else 'FAIL'}{name}")
    return cond

print('Setup complete.')

Exercise 1 ★ — Markov and Chebyshev Inequalities

Let XExp(λ)X \sim \text{Exp}(\lambda) with rate λ=2\lambda = 2 (mean =1/λ=0.5= 1/\lambda = 0.5, variance =1/λ2=0.25= 1/\lambda^2 = 0.25).

(a) Compute the Markov bound P(Xt)E[X]/tP(X \geq t) \leq \mathbb{E}[X]/t for t=1,2,3t = 1, 2, 3. Compare with the exact tail P(Xt)=eλtP(X \geq t) = e^{-\lambda t}.

(b) For Y=X0.5Y = X - 0.5 (centred), compute the Chebyshev bound P(Yt)σ2/t2P(|Y| \geq t) \leq \sigma^2/t^2 for t=0.5,1.0,1.5t = 0.5, 1.0, 1.5. Compare with exact.

(c) Verify both bounds hold empirically with N=106N = 10^6 samples. For each bound, compute the tightness ratio (bound / true probability).

(d) For ZN(0,1)Z \sim \mathcal{N}(0, 1), compare Chebyshev and the actual Gaussian tail P(Zt)P(|Z| \geq t) at t=1,2,3t = 1, 2, 3. How much looser is Chebyshev?

Code cell 5

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 6

# Solution
import numpy as np
from scipy import stats
np.random.seed(42)

lam = 2.0
mu = 1.0 / lam
sigma2 = 1.0 / lam**2

# (a) Markov
ts_markov = [1.0, 2.0, 3.0]
print('(a) Markov bounds for Exp(2):')
for t in ts_markov:
    mb = mu / t
    exact = np.exp(-lam * t)
    print(f'  t={t}: Markov={mb:.4f}, Exact={exact:.6f}, Ratio={mb/exact:.2f}x')

# (b) Chebyshev
ts_cheby = [0.5, 1.0, 1.5]
print('\n(b) Chebyshev bounds for |X - mu|:')
for t in ts_cheby:
    cb = sigma2 / t**2
    # P(|X - mu| >= t) = P(X >= mu+t) + P(X <= mu-t)
    # For Exp(2): P(X >= mu+t) = e^{-lam(mu+t)}, P(X <= mu-t) = 1-e^{-lam*max(mu-t,0)}
    exact_upper = np.exp(-lam * (mu + t))
    exact_lower = max(1 - np.exp(-lam * max(mu - t, 0)), 0)
    exact = exact_upper + exact_lower
    ratio = cb / max(exact, 1e-9)
    print(f'  t={t}: Chebyshev={cb:.4f}, Exact={exact:.6f}, Ratio={ratio:.2f}x')

# (c) Empirical
N = 1_000_000
samples = np.random.exponential(1.0/lam, N)

header('Exercise 1: Markov and Chebyshev')
for t in [1.0, 2.0]:
    emp = (samples >= t).mean()
    mb = mu / t
    exact = np.exp(-lam * t)
    check_true(f'Markov bound holds at t={t}', mb >= emp * 0.999)
    check_close(f'Exact tail at t={t}', exact, emp, tol=0.005)

for t in [0.5, 1.0]:
    emp = (np.abs(samples - mu) >= t).mean()
    cb = sigma2 / t**2
    check_true(f'Chebyshev bound holds at t={t}', cb >= emp * 0.999)

# (d) Gaussian
print('\n(d) Gaussian Chebyshev vs actual:')
for t in [1.0, 2.0, 3.0]:
    cb = 1.0 / t**2  # sigma^2=1
    actual = 2 * stats.norm.sf(t)
    print(f'  t={t}: Chebyshev={cb:.4f}, Gaussian={actual:.6f}, Ratio={cb/actual:.1f}x')

print('\nTakeaway: Markov and Chebyshev are loose (polynomial tails) but assumption-free.')

Exercise 2 ★ — Hoeffding Sample Complexity

A spam filter is evaluated by computing the error rate ε^=1ni=1n1[yiy^i]\hat{\varepsilon} = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[y_i \neq \hat{y}_i] on a test set. The true error rate is ε\varepsilon^*.

(a) By Hoeffding's inequality, derive the minimum number of test examples nn needed so that:

P(ε^ε0.02)0.05P(|\hat{\varepsilon} - \varepsilon^*| \geq 0.02) \leq 0.05

(b) Implement hoeffding_n(eps, delta) returning the minimum nn. Compute nn for (ε,δ){0.01,0.02,0.05}×{0.01,0.05,0.1}(\varepsilon, \delta) \in \{0.01, 0.02, 0.05\} \times \{0.01, 0.05, 0.1\}.

(c) Verify empirically: generate N=105N = 10^5 trials, each drawing nn Bernoulli samples with ε=0.15\varepsilon^* = 0.15, and check that P(ε^εeps)δP(|\hat{\varepsilon} - \varepsilon^*| \geq \text{eps}) \leq \delta.

(d) How much larger must nn be to achieve half the error tolerance (eps \to eps/2)? Show this requires 4×4\times more data.

Code cell 8

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 9

# Solution
import numpy as np
np.random.seed(42)

def hoeffding_n(eps, delta):
    return int(np.ceil(np.log(2 / delta) / (2 * eps**2)))

# (a)
n_required = hoeffding_n(0.02, 0.05)

header('Exercise 2: Hoeffding Sample Complexity')
check_true('n(eps=0.02, delta=0.05) >= 1844', n_required >= 1844)
print(f'  n = {n_required}')

# (b)
print('\n(b) Sample complexity table (rows=eps, cols=delta):')
header_row = '  eps\delta  ' + '  '.join([f'{d:.2f}' for d in [0.01, 0.05, 0.1]])
print(header_row)
for eps in [0.01, 0.02, 0.05]:
    row = f'  {eps:.2f}       '
    for delta in [0.01, 0.05, 0.1]:
        row += f'{hoeffding_n(eps,delta):>8}'
    print(row)

# (c) Empirical
eps_test, delta_test = 0.05, 0.05
eps_true = 0.15
n_test = hoeffding_n(eps_test, delta_test)
N_mc = 100_000
samples = np.random.binomial(1, eps_true, (N_mc, n_test)).mean(axis=1)
empirical_p = (np.abs(samples - eps_true) >= eps_test).mean()
print(f'\n(c) n={n_test}: P(|eps_hat - eps*| >= {eps_test}) = {empirical_p:.4f} <= {delta_test}')
check_true('Empirical <= delta', empirical_p <= delta_test)

# (d) Scaling
n1 = hoeffding_n(0.05, 0.05)
n2 = hoeffding_n(0.025, 0.05)  # eps/2
print(f'\n(d) n(eps=0.05) = {n1}, n(eps/2=0.025) = {n2}, ratio = {n2/n1:.2f}')
check_close('4x scaling (eps -> eps/2 requires 4x data)', n2/n1, 4.0, tol=0.2)
print('\nTakeaway: Sample complexity scales as 1/eps^2 — halving error needs 4x data.')

Exercise 3 ★ — Chernoff Bounds for Binomial

A random test has n=100n = 100 true/false questions. A student who guesses uniformly at random answers each question correctly with probability p=0.5p = 0.5 (expected score: μ=np=50\mu = np = 50).

(a) Compute the simplified Chernoff upper bound:

P(S(1+δ)μ)eμδ2/3P(S \geq (1+\delta)\mu) \leq e^{-\mu\delta^2/3}

for the student scoring 60\geq 60 (i.e., δ=0.2\delta = 0.2). Compare to exact Binomial tail.

(b) Implement chernoff_upper(delta, mu) and chernoff_lower(delta, mu). Compute both bounds and exact values for P(S60)P(S \geq 60), P(S70)P(S \geq 70), P(S40)P(S \leq 40).

(c) Compare Chernoff vs Hoeffding at the same points. Which is tighter, and why does Chernoff win for Binomial?

(d) A detection system raises an alert if 80\geq 80 of 100 iid sensors trigger (each has false-positive rate p=0.6p=0.6). What is the false alarm probability? Use Chernoff to bound it.

Code cell 11

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 12

# Solution
import numpy as np
from scipy import stats
np.random.seed(42)

def chernoff_upper(delta, mu):
    return np.exp(-mu * delta**2 / 3)

def chernoff_lower(delta, mu):
    return np.exp(-mu * delta**2 / 2)

def hoeffding_bound(t, n):
    """P(S/n - p >= t/n) <= exp(-2*(t/n)^2*n) = exp(-2*t^2/n)."""
    return np.exp(-2 * t**2 / n)

n, p = 100, 0.5
mu = n * p  # = 50

header('Exercise 3: Chernoff Bounds for Binomial')

# (a)
delta_a = 0.2
bound_a = chernoff_upper(delta_a, mu)
exact_a = 1 - stats.binom.cdf(59, n, p)  # P(S >= 60)
check_true('Chernoff upper bound holds for P(S>=60)', bound_a >= exact_a)
print(f'  P(S>=60): Chernoff={bound_a:.5f}, Exact={exact_a:.6f}')

# (b)
print('\n(b) Chernoff vs exact (upper tail):')
for t_val, delta in [(60, 0.2), (70, 0.4)]:
    bound = chernoff_upper(delta, mu)
    exact = 1 - stats.binom.cdf(t_val-1, n, p)
    hoeff  = hoeffding_bound(t_val - mu, n)
    print(f'  P(S>={t_val}): Chernoff={bound:.5f}, Exact={exact:.6f}, Hoeffding={hoeff:.5f}')
    check_true(f'Chernoff holds at t={t_val}', bound >= exact)

# Lower tail
delta_low = 0.2
bound_low = chernoff_lower(delta_low, mu)
exact_low = stats.binom.cdf(40, n, p)
print(f'  P(S<=40): Chernoff={bound_low:.5f}, Exact={exact_low:.6f}')
check_true('Chernoff lower bound holds for P(S<=40)', bound_low >= exact_low)

# (c) Chernoff is tighter because it uses MGF structure of Binomial
t_test = 10  # S >= 60
print(f'\n(c) At t=60 (10 above mean): Chernoff={chernoff_upper(0.2, mu):.5f}, '
      f'Hoeffding={hoeffding_bound(t_test, n):.5f}')
print('Chernoff is tighter: uses E[e^{tS}] = (pe^t + (1-p))^n, vs range-only Hoeffding.')

# (d)
n_sensors, p_fp = 100, 0.6
mu_d = n_sensors * p_fp  # = 60
delta_d = (80 - mu_d) / mu_d  # = 1/3
chernoff_d = chernoff_upper(delta_d, mu_d)
exact_d = 1 - stats.binom.cdf(79, n_sensors, p_fp)
print(f'\n(d) P(alert) = P(S>=80): Chernoff={chernoff_d:.5f}, Exact={exact_d:.6f}')
check_true('Chernoff bounds detection alert rate', chernoff_d >= exact_d)
print('\nTakeaway: Chernoff exploits the full MGF of the Binomial; tighter than Hoeffding for p-specific analysis.')

Exercise 4 ★★ — Bernstein Inequality for SGD Gradient Noise

In mini-batch SGD, the gradient estimate is g^=1BiBgi\hat{g} = \frac{1}{B}\sum_{i \in \mathcal{B}} g_i where each gig_i is a stochastic gradient. Assume the gig_i are iid with:

  • E[gi]=g\mathbb{E}[g_i] = g (true gradient)
  • Var(gi)=σ2=1.0\text{Var}(g_i) = \sigma^2 = 1.0 (gradient variance)
  • gigM=2.0|g_i - g| \leq M = 2.0 a.s. (bounded deviations)

(a) Implement bernstein_bound(t, B, sigma2, M) for the tail probability P(g^gt)P(\hat{g} - g \geq t) using Bernstein's inequality:

P ⁣(i(gig)t)exp ⁣(t2/2v+Mt/3)P\!\left(\sum_i (g_i - g) \geq t\right) \leq \exp\!\left(\frac{-t^2/2}{v + Mt/3}\right)

where v=Bσ2v = B\sigma^2 (total variance).

(b) Compare Bernstein vs Hoeffding at t=0.1,0.5,1.0t = 0.1, 0.5, 1.0 for B{8,32,128}B \in \{8, 32, 128\}. Show Bernstein is tighter.

(c) Find the minimum batch size BB to ensure P(g^g0.1)0.05P(|\hat{g} - g| \geq 0.1) \leq 0.05. Compare Bernstein and Hoeffding sample complexity requirements.

(d) Simulate: generate giN(g,σ2)g_i \sim \mathcal{N}(g, \sigma^2) clipped to [gM,g+M][g-M, g+M]. Verify the Bernstein bound empirically for B=32B = 32, t=0.3t = 0.3.

Code cell 14

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 15

# Solution
import numpy as np
np.random.seed(42)

g_true = 0.5
sigma2 = 1.0
M = 2.0

def bernstein_bound(t, B, sigma2, M):
    """Two-sided: 2*exp(...). t is the mean deviation."""
    # For sum S = sum(g_i - g), P(S >= t*B) via Bernstein:
    # numerator: (t*B)^2/2, denom: B*sigma2 + M*(t*B)/3
    T = t * B  # convert mean deviation to sum deviation
    return 2 * np.exp(-T**2 / 2 / (B * sigma2 + M * T / 3))

def hoeffding_bound_sgd(t, B, M):
    """Hoeffding: 2*exp(-2*B*t^2 / (2M)^2)."""
    return 2 * np.exp(-2 * B * t**2 / (2 * M)**2)

header('Exercise 4: Bernstein for SGD Gradient Noise')

# (a) Test
print('(a) Bounds at B=32:')
B_test = 32
for t in [0.1, 0.5, 1.0]:
    bern = bernstein_bound(t, B_test, sigma2, M)
    hoef = hoeffding_bound_sgd(t, B_test, M)
    print(f'  t={t}: Bernstein={bern:.5f}, Hoeffding={hoef:.5f}, '
          f'Tighter: {"Bernstein" if bern < hoef else "Hoeffding"}')

# (b)
print('\n(b) Bernstein vs Hoeffding across B (t=0.3):')
t_b = 0.3
for B in [8, 32, 128]:
    bern = bernstein_bound(t_b, B, sigma2, M)
    hoef = hoeffding_bound_sgd(t_b, B, M)
    print(f'  B={B:4d}: Bernstein={bern:.5f}, Hoeffding={hoef:.5f}')

# (c) Minimum B via search
print('\n(c) Min B for P(|g_hat-g|>=0.1) <= 0.05:')
eps, delta = 0.1, 0.05
for B in range(1, 1000):
    if bernstein_bound(eps, B, sigma2, M) <= delta:
        B_bern = B
        break
B_hoef = None
for B in range(1, 20000):
    if hoeffding_bound_sgd(eps, B, M) <= delta:
        B_hoef = B
        break
if B_hoef is None:
    raise RuntimeError("Hoeffding search range was too small")
print(f'  Bernstein needs B >= {B_bern}')
print(f'  Hoeffding needs B >= {B_hoef}')
check_true('Bernstein needs fewer samples', B_bern <= B_hoef)

# (d) Empirical
B_emp = 32
t_emp = 0.3
N_mc = 200_000
gi = np.random.normal(g_true, np.sqrt(sigma2), (N_mc, B_emp)).clip(g_true-M, g_true+M)
g_hat = gi.mean(axis=1)
empirical = (np.abs(g_hat - g_true) >= t_emp).mean()
bern_bound = bernstein_bound(t_emp, B_emp, sigma2, M)
print(f'\n(d) B={B_emp}, t={t_emp}: empirical={empirical:.5f}, Bernstein={bern_bound:.5f}')
check_true('Bernstein bound holds empirically', bern_bound >= empirical * 0.99)
print('\nTakeaway: Bernstein uses variance info; significantly tighter than Hoeffding for small variance.')

Exercise 5 ★★ — McDiarmid's Inequality on Empirical Risk

Let S=(z1,,zn)S = (z_1, \ldots, z_n) be iid training examples and R^(h,S)=1ni=1n(h(xi),yi)\hat{R}(h, S) = \frac{1}{n}\sum_{i=1}^n \ell(h(x_i), y_i) with [0,1]\ell \in [0,1].

(a) Show that R^\hat{R} satisfies the bounded differences property with ci=1/nc_i = 1/n (changing one example changes empirical risk by at most 1/n1/n).

(b) By McDiarmid's inequality:

P(R^(h,S)R(h)t)e2nt2P(\hat{R}(h,S) - R(h) \geq t) \leq e^{-2nt^2}

Implement mcdiarmid_bound(t, n) and compute bounds for t=0.05,0.1t = 0.05, 0.1 and n=100,1000n = 100, 1000.

(c) Suppose we have a data-dependent statistic: the maximum deviation over kk held-out folds:

D=maxj=1kR^jRjD = \max_{j=1}^k |\hat{R}_j - R_j|

Apply McDiarmid to show P(DE[D]t)e2nt2P(D - \mathbb{E}[D] \geq t) \leq e^{-2nt^2} holds for the full split.

(d) Simulate: draw n=200n=200 iid Bernoulli(p=0.3p=0.3) examples (true error 0.3). Compute R^=Xˉ\hat{R} = \bar{X}. Verify McDiarmid bound holds across 10510^5 trials at t=0.05t = 0.05.

(e) How does McDiarmid extend the Hoeffding result? Give an example where McDiarmid applies but Hoeffding does NOT (non-sum function).

Code cell 17

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 18

# Solution
import numpy as np
np.random.seed(42)

def mcdiarmid_bound(t, n):
    return np.exp(-2 * n * t**2)

header('Exercise 5: McDiarmid on Empirical Risk')

# (a) Bounded differences argument
print('(a) Bounded differences:')
print('  Changing z_i to z_i\' changes R_hat by |l(h(x_i),y_i) - l(h(x_i\'),y_i\')| / n <= 1/n')
print('  So c_i = 1/n for all i, sum c_i^2 = n*(1/n)^2 = 1/n')

# (b)
print('\n(b) McDiarmid bounds:')
for t in [0.05, 0.1]:
    for n in [100, 1000]:
        bound = mcdiarmid_bound(t, n)
        print(f'  t={t}, n={n}: bound={bound:.6f}')

# (d) Simulation
n, p_true = 200, 0.3
t_test = 0.05
N_mc = 100_000
samples = np.random.binomial(1, p_true, (N_mc, n))
R_hat = samples.mean(axis=1)
empirical = (R_hat - p_true >= t_test).mean()
bound = mcdiarmid_bound(t_test, n)

print(f'\n(d) n={n}, t={t_test}:')
print(f'  Empirical P(R_hat - R >= t) = {empirical:.5f}')
print(f'  McDiarmid bound             = {bound:.5f}')
check_true('McDiarmid bound holds', bound >= empirical)

# (e) Non-sum: maximum of n values
print('\n(e) McDiarmid example: f(x1,...,xn) = max(x1,...,xn) on [0,1]^n')
print('  max changes by at most 1 when one x_i changes => c_i = 1 for each i')
print('  McDiarmid: P(max - E[max] >= t) <= exp(-2t^2/n)')
print('  Hoeffding does NOT apply (max is not a sum)')
# Verify: for U[0,1], E[max_n] = n/(n+1)
n_test = 50
N_mc2 = 100_000
maxs = np.random.uniform(0, 1, (N_mc2, n_test)).max(axis=1)
E_max = n_test / (n_test + 1)
t_e = 0.05
emp_e = (maxs - E_max >= t_e).mean()
mc_bound = np.exp(-2 * t_e**2 / n_test)
print(f'  max of {n_test} Uniforms: empirical P(max-E[max]>={t_e}) = {emp_e:.5f}, '
      f'McDiarmid = {mc_bound:.5f}')
check_true('McDiarmid holds for max function', mc_bound >= emp_e)
print('\nTakeaway: McDiarmid applies to any function with bounded differences, not just sums.')

Exercise 6 ★★ — PAC Bounds for Finite Hypothesis Class

A binary classifier is selected from a finite class H=220|\mathcal{H}| = 2^{20} (about 10^6) using empirical risk minimisation (ERM).

(a) Using the union bound + Hoeffding, derive the generalisation bound:

P ⁣(hH:R(h)>R^(h)+log(H/δ)2n)δP\!\left(\exists h \in \mathcal{H}: R(h) > \hat{R}(h) + \sqrt{\frac{\log(|\mathcal{H}|/\delta)}{2n}}\right) \leq \delta

Implement pac_bound_finite(H_size, n, delta) returning the generalisation gap.

(b) Compute the generalisation bound for the ERM hypothesis h=argminhR^(h)h^* = \arg\min_{h} \hat{R}(h) when n=10000n = 10000, H=220|\mathcal{H}| = 2^{20}, δ=0.05\delta = 0.05.

(c) Sample complexity: find the minimum nn for the bound to be <0.05< 0.05. How does nn scale with logH\log|\mathcal{H}|?

(d) Simulate the PAC setup: generate H=100|\mathcal{H}|=100 random classifiers on a dataset, each with true risk R(h)=0.2R(h) = 0.2. Find ERM, check that the gap bound holds.

(e) Why can't we just use nlog(H)n \geq \log(|\mathcal{H}|) (ignoring the 1/ε21/\varepsilon^2)? Give an example where this fails.

Code cell 20

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 21

# Solution
import numpy as np
np.random.seed(42)

def pac_bound_finite(H_size, n, delta=0.05):
    return np.sqrt(np.log(H_size / delta) / (2 * n))

header('Exercise 6: PAC Bounds for Finite Hypothesis Class')

# (a) Verify formula
print('(a) Union bound + Hoeffding derivation:')
print('  P(exists h: R(h) > R_hat(h) + t)')
print('  <= |H| * P(R(h) > R_hat(h) + t)   [union bound]')
print('  <= |H| * exp(-2nt^2)               [Hoeffding on R_hat(h)]')
print('  Set |H|*exp(-2nt^2) = delta => t = sqrt(log(|H|/delta) / 2n)')

# (b)
gap = pac_bound_finite(2**20, 10000, 0.05)
print(f'\n(b) Gap for |H|=2^20, n=10000, delta=0.05: {gap:.4f}')
check_true('Gap is reasonable (< 0.2)', gap < 0.2)

# (c)
eps_target = 0.05
n_min = int(np.ceil(np.log(2**20 / 0.05) / (2 * eps_target**2)))
print(f'\n(c) Min n for gap < {eps_target}: n >= {n_min}')
check_true('n_min gives gap < target', pac_bound_finite(2**20, n_min) < eps_target + 1e-6)

# (d) Simulation
H_size = 100
n_data = 500
R_true = 0.2
N_mc = 10_000

max_gaps = []
for _ in range(N_mc):
    # Each h has true risk 0.2; R_hat is empirical on n_data samples
    R_hats = np.random.binomial(n_data, R_true, H_size) / n_data
    h_erm = np.argmin(R_hats)
    gap_erm = R_true - R_hats[h_erm]  # R(h*) - R_hat(h*)
    max_gaps.append(gap_erm)

bound = pac_bound_finite(H_size, n_data, 0.05)
frac_exceeding = np.mean(np.array(max_gaps) > bound)
print(f'\n(d) Simulation (H={H_size}, n={n_data}):')
print(f'  PAC bound = {bound:.4f}')
print(f'  P(gap > bound) = {frac_exceeding:.4f}  (should be <= 0.05)')
check_true('PAC bound holds in simulation', frac_exceeding <= 0.06)

# (e)
print('\n(e) Why 1/eps^2 matters:')
print('  With n = log(|H|): gap = sqrt(log(|H|/d) / (2*log(|H|))) ~ sqrt(1/2) ~ 0.7')
print('  This is vacuous! Need n >> log(|H|) / eps^2 for useful bounds.')
vacuous_n = int(np.log(2**20))
print(f'  n=log(|H|)={vacuous_n}: gap={pac_bound_finite(2**20, vacuous_n):.3f} (useless!)')
print('\nTakeaway: Generalisation requires O(log(|H|)/eps^2) samples — the eps^2 term is essential.')

Exercise 7 ★★★ — VC Dimension of Linear Classifiers

A linear classifier in Rd\mathbb{R}^d predicts y^=sign(wx+b)\hat{y} = \text{sign}(\mathbf{w}^\top \mathbf{x} + b).

(a) Show that the VC dimension of half-spaces in Rd\mathbb{R}^d is d+1d+1 by:

  • Exhibiting d+1d+1 points that CAN be shattered (use the standard basis + origin)
  • Arguing no d+2d+2 points can be shattered (use Radon's theorem / LP duality argument)

(b) Implement can_shatter(points): check whether a given set of points can be shattered by linear classifiers in R2\mathbb{R}^2 (brute-force over all 2m2^m labellings).

(c) Use your function to verify: 3 points in general position CAN be shattered, 4 points (including XOR pattern) CANNOT.

(d) Implement the VC generalisation bound vc_bound(d_vc, n, delta) and compute it for dVC{3,10,100}d_{\text{VC}} \in \{3, 10, 100\} and n{100,1000,10000}n \in \{100, 1000, 10000\}.

(e) Simulate learning with the optimal linear separator on linearly separable data (d=2d=2, dVC=3d_{\text{VC}}=3). Plot the empirical generalisation gap vs the VC bound for n[50,5000]n \in [50, 5000]. Does the VC bound track the actual gap?

Code cell 23

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 24

# Solution
import numpy as np
from itertools import product
np.random.seed(42)

def can_shatter(points, n_trials=3000):
    m, d = points.shape
    for labels in product([-1., 1.], repeat=m):
        labels = np.array(labels)
        found = False
        for _ in range(n_trials):
            w = np.random.randn(d)
            b = np.random.randn()
            preds = np.sign(points @ w + b)
            if np.all(preds == labels):
                found = True
                break
        if not found:
            return False
    return True

def vc_bound(d_vc, n, delta=0.05):
    return np.sqrt((2 * d_vc * np.log(np.e * n / d_vc) + 2 * np.log(4 / delta)) / n)

header('Exercise 7: VC Dimension of Linear Classifiers')

# (a) Exhibit d+1=3 shatterable points in R^2
print('(a) Standard basis + origin in R^2:')
pts_basis = np.array([[0., 0.], [1., 0.], [0., 1.]])
check_true('Standard basis shattered by half-spaces', can_shatter(pts_basis))

# (b)-(c)
pts3 = np.array([[0., 0.], [1., 0.], [0., 1.]])
pts4_xor = np.array([[0., 0.], [1., 1.], [1., 0.], [0., 1.]])
shatter3 = can_shatter(pts3)
shatter4 = can_shatter(pts4_xor)

check_true('3 pts in general position CAN be shattered', shatter3)
check_true('4 pts in XOR pattern CANNOT be shattered', not shatter4)
print(f'  3 pts shatterable: {shatter3}')
print(f'  4 pts (XOR) shatterable: {shatter4}')

# (d)
print('\n(d) VC generalisation bounds (delta=0.05):')
print(f'  {'d_vc':>6}  {'n':>8}  {'bound':>10}')
for d_vc in [3, 10, 100]:
    for n in [100, 1000, 10000]:
        bound = vc_bound(d_vc, n)
        print(f'  {d_vc:>6}  {n:>8}  {bound:>10.4f}')

# (e) Simulate: learn perceptron on linearly separable data
print('\n(e) Empirical vs VC bound for d=2 linear classifier:')
n_sizes = [50, 100, 200, 500, 1000, 2000, 5000]
N_mc = 500
d_vc_true = 3

emp_gaps = []
for n in n_sizes:
    gaps = []
    for _ in range(N_mc):
        # Generate data from true linear model
        X = np.random.randn(n + 1000, 2)
        w_true = np.array([1., -1.])
        y = np.sign(X @ w_true + 0.5)

        X_tr, y_tr = X[:n], y[:n]
        X_te, y_te = X[n:], y[n:]

        # Fit via closed-form sign of linear discriminant
        w_fit = np.linalg.lstsq(X_tr, y_tr, rcond=None)[0]
        y_pred_tr = np.sign(X_tr @ w_fit)
        y_pred_te = np.sign(X_te @ w_fit)

        R_hat = (y_pred_tr != y_tr).mean()
        R_true = (y_pred_te != y_te).mean()
        gaps.append(R_true - R_hat)
    emp_gaps.append(np.mean(gaps))

print(f'  {'n':>6}  {'Empirical gap':>16}  {'VC bound':>12}')
for n, gap in zip(n_sizes, emp_gaps):
    print(f'  {n:>6}  {gap:>16.4f}  {vc_bound(d_vc_true, n):>12.4f}')

print('\nTakeaway: VC bound is loose in practice but gives correct 1/sqrt(n) scaling.')

Exercise 8 ★★★ — Rademacher Complexity and LoRA Analysis

Rademacher complexity measures how well a function class fits random noise.

(a) Implement rademacher_mc(X, F_samples, n_mc) that estimates the empirical Rademacher complexity for a function class given by a set of functions evaluated on data XX:

R^S(F)1Tt=1TsupfF1ni=1nσi(t)f(xi)\hat{\mathfrak{R}}_S(\mathcal{F}) \approx \frac{1}{T}\sum_{t=1}^T \sup_{f \in \mathcal{F}} \frac{1}{n}\sum_{i=1}^n \sigma_i^{(t)} f(x_i)

(b) For linear functions fw(x)=wx/xf_w(x) = w^\top x / \|x\| with w2W\|w\|_2 \leq W, verify the bound R^SW/n\hat{\mathfrak{R}}_S \leq W / \sqrt{n}.

(c) Compare Rademacher complexity for:

  • Full weight matrix update: ΔWRd×d\Delta W \in \mathbb{R}^{d \times d}, ΔWF1\|\Delta W\|_F \leq 1
  • LoRA update: ΔW=AB\Delta W = AB, ARd×rA \in \mathbb{R}^{d \times r}, BRr×dB \in \mathbb{R}^{r \times d}, rdr \ll d

Show that LoRA Rademacher complexity is O(r/d)O(\sqrt{r/d}) times smaller than full fine-tuning.

(d) Implement the full Rademacher generalisation bound and compute it for n=1000n = 1000, d=64d = 64, r=4r = 4, δ=0.05\delta = 0.05. Compare to the bound for full fine-tuning (r=d=64r = d = 64).

(e) Discuss why, despite LoRA having a smaller Rademacher bound, practical LLM fine-tuning often shows similar generalisation to full fine-tuning. What does this tell us about the tightness of Rademacher bounds?

Code cell 26

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 27

# Solution
import numpy as np
np.random.seed(42)

def rademacher_mc(X, W_norm, n_mc=2000):
    """
    Empirical Rademacher complexity for {f_w : ||w||<=W_norm}.
    sup_w (1/n) sigma^T X w = (W_norm/n) ||X^T sigma||_2
    """
    n = X.shape[0]
    norms = []
    for _ in range(n_mc):
        sigma = np.random.choice([-1., 1.], size=n)
        norms.append(np.linalg.norm(X.T @ sigma))
    return W_norm * np.mean(norms) / n

def rademacher_lora(n, d, r, W_norm=1.0):
    """
    Rademacher bound for LoRA: O(W_norm * sqrt(d*r/n)).
    Simplified: using Frobenius norm of Delta W = AB, ||Delta W||_F <= W_norm.
    """
    return W_norm * np.sqrt(d * r) / n  # times 1/n for per-sample bound, simplification

def rademacher_full(n, d, W_norm=1.0):
    return W_norm * np.sqrt(d**2) / n

def rad_generalisation_bound(R_hat, R_n, n, delta):
    return R_hat + 2 * R_n + np.sqrt(np.log(1/delta) / (2*n))

header('Exercise 8: Rademacher Complexity and LoRA')

# (b) Verify linear bound
n, d = 100, 50
X = np.random.randn(n, d) / np.sqrt(d)
W_norm = 1.0
R_mc = rademacher_mc(X, W_norm)
R_theory = W_norm * np.linalg.norm(X, axis=1).mean() / np.sqrt(n)

print('(b) Linear Rademacher complexity:')
print(f'  MC estimate:  {R_mc:.4f}')
print(f'  Theory bound: {W_norm / np.sqrt(n):.4f}')
print(f'  MC <= theory: {R_mc <= W_norm / np.sqrt(n) * 1.5}')

# (c) LoRA vs full
print('\n(c) LoRA vs Full Rademacher (n=1000, d=256, W_norm=1):')
n_c, d_c = 1000, 256
print(f'  {'rank r':>8}  {'LoRA R':>10}  {'Full R':>10}  {'Ratio':>8}')
R_full_c = rademacher_full(n_c, d_c)
for r in [1, 4, 16, 64, 256]:
    R_lora = rademacher_lora(n_c, d_c, r)
    print(f'  {r:>8}  {R_lora:>10.5f}  {R_full_c:>10.5f}  {R_lora/R_full_c:>8.4f}')

# (d) Generalisation bound
n_d, d_d, r_d = 1000, 64, 4
R_hat = 0.05  # empirical risk

R_lora_d = rademacher_lora(n_d, d_d, r_d)
R_full_d = rademacher_full(n_d, d_d)
bound_lora = rad_generalisation_bound(R_hat, R_lora_d, n_d, 0.05)
bound_full = rad_generalisation_bound(R_hat, R_full_d, n_d, 0.05)

print(f'\n(d) Generalisation bounds (n={n_d}, d={d_d}, r={r_d}, R_hat={R_hat}):')
print(f'  LoRA (r={r_d}): R_n={R_lora_d:.4f}, bound={bound_lora:.4f}')
print(f'  Full (r={d_d}): R_n={R_full_d:.4f}, bound={bound_full:.4f}')
check_true('LoRA bound < Full bound', bound_lora < bound_full)

# (e) Discussion
print('\n(e) Discussion:')
print('  Rademacher bounds are norm-based and conservative.')
print('  In practice, SGD on full fine-tuning finds solutions with small effective norms.')
print('  The implicit regularisation of SGD makes full fine-tuning behave like low-norm solutions.')
print('  LoRA explicitly constraints the rank but the bound being smaller does NOT mean')
print('  actual generalisation is better — both can achieve similar test errors.')
print('  PAC-Bayes bounds (using KL from initialisation) are empirically much tighter.')
print('\nTakeaway: Rademacher complexity gives correct scaling (sqrt(r/n)) but constants are loose.')
print('For practical architecture design: LoRA has provably smaller complexity class.')

Exercise 9 *** - Empirical Bernstein Confidence Intervals

For bounded losses Xi[0,1]X_i\in[0,1], empirical Bernstein bounds use the sample variance instead of only the range.

Part (a): Implement a simple empirical Bernstein radius.

Part (b): Compare it with Hoeffding for low-variance Bernoulli losses.

Part (c): Verify by simulation that both intervals cover the true mean at roughly the requested confidence.

Part (d): Explain why variance-sensitive bounds matter for reliable evaluation.

Code cell 29

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 30

# Solution

def hoeffding_radius(n, delta):
    return np.sqrt(np.log(2/delta) / (2*n))

def empirical_bernstein_radius(samples, delta):
    n = len(samples)
    v = np.var(samples, ddof=1)
    return np.sqrt(2 * v * np.log(3/delta) / n) + 3 * np.log(3/delta) / n

p_true = 0.03
n = 1000
delta = 0.05
trials = 1000
cover_h = 0
cover_eb = 0
radii_h = []
radii_eb = []
for _ in range(trials):
    x = (np.random.rand(n) < p_true).astype(float)
    mean = x.mean()
    rh = hoeffding_radius(n, delta)
    reb = empirical_bernstein_radius(x, delta)
    cover_h += abs(mean - p_true) <= rh
    cover_eb += abs(mean - p_true) <= reb
    radii_h.append(rh)
    radii_eb.append(reb)

header('Exercise 9: Empirical Bernstein Confidence Intervals')
print(f'Average Hoeffding radius: {np.mean(radii_h):.4f}')
print(f'Average empirical Bernstein radius: {np.mean(radii_eb):.4f}')
print(f'Hoeffding coverage: {cover_h/trials:.3f}')
print(f'Empirical Bernstein coverage: {cover_eb/trials:.3f}')
check_true('Empirical Bernstein is tighter on low-variance losses', np.mean(radii_eb) < np.mean(radii_h))
check_true('Both methods cover at high rate in this simulation', cover_h/trials > 0.93 and cover_eb/trials > 0.90)
print('Takeaway: evaluation bounds can be much tighter when the observed variance is small.')

Exercise 10 *** - Uniform Confidence Over Many Metrics

A model report tracks m=25m=25 safety metrics. Each metric is estimated from nn iid bounded examples.

Part (a): Use a union bound to build a simultaneous Hoeffding interval for all metrics.

Part (b): Compare the radius to a single-metric interval.

Part (c): Simulate all metrics and verify the chance that any estimate misses its true mean.

Part (d): Interpret why dashboards need simultaneous, not only per-metric, uncertainty.

Code cell 32

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 33

# Solution
m = 25
n = 800
delta = 0.05
true_ps = np.linspace(0.02, 0.25, m)

single_radius = np.sqrt(np.log(2/delta)/(2*n))
uniform_radius = np.sqrt(np.log(2*m/delta)/(2*n))
trials = 1000
miss_any = 0
for _ in range(trials):
    X = (np.random.rand(n, m) < true_ps).astype(float)
    estimates = X.mean(axis=0)
    miss_any += np.any(np.abs(estimates - true_ps) > uniform_radius)

header('Exercise 10: Uniform Confidence Over Many Metrics')
print(f'Single-metric radius: {single_radius:.4f}')
print(f'Union-bound uniform radius: {uniform_radius:.4f}')
print(f'Empirical P(any miss): {miss_any/trials:.3f}, target <= {delta}')
check_true('Uniform radius is wider than single radius', uniform_radius > single_radius)
check_true('Any-miss rate is controlled in simulation', miss_any/trials <= 0.08)
print('Takeaway: once a report has many metrics, per-metric confidence intervals understate dashboard-level risk.')

What to Review After Finishing

  • Markov: P(Xt)E[X]/tP(X \geq t) \leq \mathbb{E}[X]/t — non-negative + mean only
  • Chebyshev: P(Xμt)σ2/t2P(|X-\mu| \geq t) \leq \sigma^2/t^2 — polynomial tail, universal
  • Hoeffding: 2e2nε22e^{-2n\varepsilon^2} for bounded iid — exponential tail, n=O(log(1/δ)/ε2)n = O(\log(1/\delta)/\varepsilon^2)
  • Chernoff: infsestMX(s)\inf_s e^{-st}M_X(s) — MGF method, tightest for Binomial
  • Bernstein: adds variance term — transitions polynomial → exponential at t=v/Mt = v/M
  • McDiarmid: bounded differences → exponential tail for ANY function of iid vars
  • Union bound: P(Ai)P(Ai)P(\bigcup A_i) \leq \sum P(A_i) — foundation of PAC generalisation
  • VC dimension: largest shattered set; dVC=d+1d_{\text{VC}} = d+1 for half-spaces in Rd\mathbb{R}^d
  • Rademacher: R^=Eσ[supf1nσif(xi)]\hat{\mathfrak{R}} = \mathbb{E}_\sigma[\sup_f \frac{1}{n}\sum \sigma_i f(x_i)] — data-dependent, tighter than VC
  • LoRA complexity: O(dr/n)O(\sqrt{dr/n}) vs O(d2/n)O(\sqrt{d^2/n}) — theoretical basis for PEFT

References

  1. Boucheron, Lugosi & Massart, Concentration Inequalities (2013) — OUP
  2. Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014) — Chapters 2-7
  3. Wainwright, High-Dimensional Statistics (2019) — Cambridge
  4. Vershynin, High-Dimensional Probability (2018) — Cambridge
  5. Tropp, An Introduction to Matrix Concentration Inequalities (2015) — arXiv:1501.01571
  6. Bartlett & Mendelson, Rademacher and Gaussian complexities (2002) — JMLR
  7. Hu et al., LoRA: Low-Rank Adaptation (2022) — arXiv:2106.09685
  • Empirical Bernstein intervals - can you explain the probabilistic calculation and the ML relevance?
  • Uniform confidence dashboards - can you connect the computation to model behavior?