Exercises Notebook
Converted from
exercises.ipynbfor web reading.
Concentration Inequalities — Exercises
10 exercises covering tail bounds, sample complexity, PAC learning, and ML applications.
| Format | Description |
|---|---|
| Problem | Markdown cell with task description |
| Your Solution | Code cell with scaffolding |
| Solution | Code cell with reference solution and checks |
Difficulty Levels
| Level | Exercises | Focus |
|---|---|---|
| ★ | 1–3 | Core mechanics |
| ★★ | 4–6 | Deeper theory |
| ★★★ | 7-10 | AI / ML applications |
Topic Map
| Topic | Exercise |
|---|---|
| Markov and Chebyshev | 1 |
| Hoeffding sample complexity | 2 |
| Chernoff bounds for Binomial | 3 |
| Bernstein for SGD gradient noise | 4 |
| McDiarmid on empirical risk | 5 |
| PAC bounds for finite class | 6 |
| VC dimension of linear classifiers | 7 |
| Rademacher complexity Monte Carlo | 8 |
| Empirical Bernstein intervals | 9 |
| Uniform confidence dashboards | 10 |
Code cell 2
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
try:
import seaborn as sns
sns.set_theme(style="whitegrid", palette="colorblind")
HAS_SNS = True
except ImportError:
plt.style.use("seaborn-v0_8-whitegrid")
HAS_SNS = False
mpl.rcParams.update({
"figure.figsize": (10, 6),
"figure.dpi": 120,
"font.size": 13,
"axes.titlesize": 15,
"axes.labelsize": 13,
"xtick.labelsize": 11,
"ytick.labelsize": 11,
"legend.fontsize": 11,
"legend.framealpha": 0.85,
"lines.linewidth": 2.0,
"axes.spines.top": False,
"axes.spines.right": False,
"savefig.bbox": "tight",
"savefig.dpi": 150,
})
np.random.seed(42)
print("Plot setup complete.")
Code cell 3
import numpy as np
from scipy import stats
try:
import matplotlib.pyplot as plt
HAS_MPL = True
except ImportError:
HAS_MPL = False
np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)
def header(title):
print('\n' + '=' * len(title))
print(title)
print('=' * len(title))
def check_close(name, got, expected, tol=1e-6):
ok = np.allclose(got, expected, atol=tol, rtol=tol)
print(f"{'PASS' if ok else 'FAIL'} — {name}")
if not ok:
print(' expected:', expected)
print(' got :', got)
return ok
def check_true(name, cond):
print(f"{'PASS' if cond else 'FAIL'} — {name}")
return cond
print('Setup complete.')
Exercise 1 ★ — Markov and Chebyshev Inequalities
Let with rate (mean , variance ).
(a) Compute the Markov bound for . Compare with the exact tail .
(b) For (centred), compute the Chebyshev bound for . Compare with exact.
(c) Verify both bounds hold empirically with samples. For each bound, compute the tightness ratio (bound / true probability).
(d) For , compare Chebyshev and the actual Gaussian tail at . How much looser is Chebyshev?
Code cell 5
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 6
# Solution
import numpy as np
from scipy import stats
np.random.seed(42)
lam = 2.0
mu = 1.0 / lam
sigma2 = 1.0 / lam**2
# (a) Markov
ts_markov = [1.0, 2.0, 3.0]
print('(a) Markov bounds for Exp(2):')
for t in ts_markov:
mb = mu / t
exact = np.exp(-lam * t)
print(f' t={t}: Markov={mb:.4f}, Exact={exact:.6f}, Ratio={mb/exact:.2f}x')
# (b) Chebyshev
ts_cheby = [0.5, 1.0, 1.5]
print('\n(b) Chebyshev bounds for |X - mu|:')
for t in ts_cheby:
cb = sigma2 / t**2
# P(|X - mu| >= t) = P(X >= mu+t) + P(X <= mu-t)
# For Exp(2): P(X >= mu+t) = e^{-lam(mu+t)}, P(X <= mu-t) = 1-e^{-lam*max(mu-t,0)}
exact_upper = np.exp(-lam * (mu + t))
exact_lower = max(1 - np.exp(-lam * max(mu - t, 0)), 0)
exact = exact_upper + exact_lower
ratio = cb / max(exact, 1e-9)
print(f' t={t}: Chebyshev={cb:.4f}, Exact={exact:.6f}, Ratio={ratio:.2f}x')
# (c) Empirical
N = 1_000_000
samples = np.random.exponential(1.0/lam, N)
header('Exercise 1: Markov and Chebyshev')
for t in [1.0, 2.0]:
emp = (samples >= t).mean()
mb = mu / t
exact = np.exp(-lam * t)
check_true(f'Markov bound holds at t={t}', mb >= emp * 0.999)
check_close(f'Exact tail at t={t}', exact, emp, tol=0.005)
for t in [0.5, 1.0]:
emp = (np.abs(samples - mu) >= t).mean()
cb = sigma2 / t**2
check_true(f'Chebyshev bound holds at t={t}', cb >= emp * 0.999)
# (d) Gaussian
print('\n(d) Gaussian Chebyshev vs actual:')
for t in [1.0, 2.0, 3.0]:
cb = 1.0 / t**2 # sigma^2=1
actual = 2 * stats.norm.sf(t)
print(f' t={t}: Chebyshev={cb:.4f}, Gaussian={actual:.6f}, Ratio={cb/actual:.1f}x')
print('\nTakeaway: Markov and Chebyshev are loose (polynomial tails) but assumption-free.')
Exercise 2 ★ — Hoeffding Sample Complexity
A spam filter is evaluated by computing the error rate on a test set. The true error rate is .
(a) By Hoeffding's inequality, derive the minimum number of test examples needed so that:
(b) Implement hoeffding_n(eps, delta) returning the minimum . Compute for .
(c) Verify empirically: generate trials, each drawing Bernoulli samples with , and check that .
(d) How much larger must be to achieve half the error tolerance (eps eps/2)? Show this requires more data.
Code cell 8
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 9
# Solution
import numpy as np
np.random.seed(42)
def hoeffding_n(eps, delta):
return int(np.ceil(np.log(2 / delta) / (2 * eps**2)))
# (a)
n_required = hoeffding_n(0.02, 0.05)
header('Exercise 2: Hoeffding Sample Complexity')
check_true('n(eps=0.02, delta=0.05) >= 1844', n_required >= 1844)
print(f' n = {n_required}')
# (b)
print('\n(b) Sample complexity table (rows=eps, cols=delta):')
header_row = ' eps\delta ' + ' '.join([f'{d:.2f}' for d in [0.01, 0.05, 0.1]])
print(header_row)
for eps in [0.01, 0.02, 0.05]:
row = f' {eps:.2f} '
for delta in [0.01, 0.05, 0.1]:
row += f'{hoeffding_n(eps,delta):>8}'
print(row)
# (c) Empirical
eps_test, delta_test = 0.05, 0.05
eps_true = 0.15
n_test = hoeffding_n(eps_test, delta_test)
N_mc = 100_000
samples = np.random.binomial(1, eps_true, (N_mc, n_test)).mean(axis=1)
empirical_p = (np.abs(samples - eps_true) >= eps_test).mean()
print(f'\n(c) n={n_test}: P(|eps_hat - eps*| >= {eps_test}) = {empirical_p:.4f} <= {delta_test}')
check_true('Empirical <= delta', empirical_p <= delta_test)
# (d) Scaling
n1 = hoeffding_n(0.05, 0.05)
n2 = hoeffding_n(0.025, 0.05) # eps/2
print(f'\n(d) n(eps=0.05) = {n1}, n(eps/2=0.025) = {n2}, ratio = {n2/n1:.2f}')
check_close('4x scaling (eps -> eps/2 requires 4x data)', n2/n1, 4.0, tol=0.2)
print('\nTakeaway: Sample complexity scales as 1/eps^2 — halving error needs 4x data.')
Exercise 3 ★ — Chernoff Bounds for Binomial
A random test has true/false questions. A student who guesses uniformly at random answers each question correctly with probability (expected score: ).
(a) Compute the simplified Chernoff upper bound:
for the student scoring (i.e., ). Compare to exact Binomial tail.
(b) Implement chernoff_upper(delta, mu) and chernoff_lower(delta, mu). Compute both bounds and exact values for , , .
(c) Compare Chernoff vs Hoeffding at the same points. Which is tighter, and why does Chernoff win for Binomial?
(d) A detection system raises an alert if of 100 iid sensors trigger (each has false-positive rate ). What is the false alarm probability? Use Chernoff to bound it.
Code cell 11
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 12
# Solution
import numpy as np
from scipy import stats
np.random.seed(42)
def chernoff_upper(delta, mu):
return np.exp(-mu * delta**2 / 3)
def chernoff_lower(delta, mu):
return np.exp(-mu * delta**2 / 2)
def hoeffding_bound(t, n):
"""P(S/n - p >= t/n) <= exp(-2*(t/n)^2*n) = exp(-2*t^2/n)."""
return np.exp(-2 * t**2 / n)
n, p = 100, 0.5
mu = n * p # = 50
header('Exercise 3: Chernoff Bounds for Binomial')
# (a)
delta_a = 0.2
bound_a = chernoff_upper(delta_a, mu)
exact_a = 1 - stats.binom.cdf(59, n, p) # P(S >= 60)
check_true('Chernoff upper bound holds for P(S>=60)', bound_a >= exact_a)
print(f' P(S>=60): Chernoff={bound_a:.5f}, Exact={exact_a:.6f}')
# (b)
print('\n(b) Chernoff vs exact (upper tail):')
for t_val, delta in [(60, 0.2), (70, 0.4)]:
bound = chernoff_upper(delta, mu)
exact = 1 - stats.binom.cdf(t_val-1, n, p)
hoeff = hoeffding_bound(t_val - mu, n)
print(f' P(S>={t_val}): Chernoff={bound:.5f}, Exact={exact:.6f}, Hoeffding={hoeff:.5f}')
check_true(f'Chernoff holds at t={t_val}', bound >= exact)
# Lower tail
delta_low = 0.2
bound_low = chernoff_lower(delta_low, mu)
exact_low = stats.binom.cdf(40, n, p)
print(f' P(S<=40): Chernoff={bound_low:.5f}, Exact={exact_low:.6f}')
check_true('Chernoff lower bound holds for P(S<=40)', bound_low >= exact_low)
# (c) Chernoff is tighter because it uses MGF structure of Binomial
t_test = 10 # S >= 60
print(f'\n(c) At t=60 (10 above mean): Chernoff={chernoff_upper(0.2, mu):.5f}, '
f'Hoeffding={hoeffding_bound(t_test, n):.5f}')
print('Chernoff is tighter: uses E[e^{tS}] = (pe^t + (1-p))^n, vs range-only Hoeffding.')
# (d)
n_sensors, p_fp = 100, 0.6
mu_d = n_sensors * p_fp # = 60
delta_d = (80 - mu_d) / mu_d # = 1/3
chernoff_d = chernoff_upper(delta_d, mu_d)
exact_d = 1 - stats.binom.cdf(79, n_sensors, p_fp)
print(f'\n(d) P(alert) = P(S>=80): Chernoff={chernoff_d:.5f}, Exact={exact_d:.6f}')
check_true('Chernoff bounds detection alert rate', chernoff_d >= exact_d)
print('\nTakeaway: Chernoff exploits the full MGF of the Binomial; tighter than Hoeffding for p-specific analysis.')
Exercise 4 ★★ — Bernstein Inequality for SGD Gradient Noise
In mini-batch SGD, the gradient estimate is where each is a stochastic gradient. Assume the are iid with:
- (true gradient)
- (gradient variance)
- a.s. (bounded deviations)
(a) Implement bernstein_bound(t, B, sigma2, M) for the tail probability using Bernstein's inequality:
where (total variance).
(b) Compare Bernstein vs Hoeffding at for . Show Bernstein is tighter.
(c) Find the minimum batch size to ensure . Compare Bernstein and Hoeffding sample complexity requirements.
(d) Simulate: generate clipped to . Verify the Bernstein bound empirically for , .
Code cell 14
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 15
# Solution
import numpy as np
np.random.seed(42)
g_true = 0.5
sigma2 = 1.0
M = 2.0
def bernstein_bound(t, B, sigma2, M):
"""Two-sided: 2*exp(...). t is the mean deviation."""
# For sum S = sum(g_i - g), P(S >= t*B) via Bernstein:
# numerator: (t*B)^2/2, denom: B*sigma2 + M*(t*B)/3
T = t * B # convert mean deviation to sum deviation
return 2 * np.exp(-T**2 / 2 / (B * sigma2 + M * T / 3))
def hoeffding_bound_sgd(t, B, M):
"""Hoeffding: 2*exp(-2*B*t^2 / (2M)^2)."""
return 2 * np.exp(-2 * B * t**2 / (2 * M)**2)
header('Exercise 4: Bernstein for SGD Gradient Noise')
# (a) Test
print('(a) Bounds at B=32:')
B_test = 32
for t in [0.1, 0.5, 1.0]:
bern = bernstein_bound(t, B_test, sigma2, M)
hoef = hoeffding_bound_sgd(t, B_test, M)
print(f' t={t}: Bernstein={bern:.5f}, Hoeffding={hoef:.5f}, '
f'Tighter: {"Bernstein" if bern < hoef else "Hoeffding"}')
# (b)
print('\n(b) Bernstein vs Hoeffding across B (t=0.3):')
t_b = 0.3
for B in [8, 32, 128]:
bern = bernstein_bound(t_b, B, sigma2, M)
hoef = hoeffding_bound_sgd(t_b, B, M)
print(f' B={B:4d}: Bernstein={bern:.5f}, Hoeffding={hoef:.5f}')
# (c) Minimum B via search
print('\n(c) Min B for P(|g_hat-g|>=0.1) <= 0.05:')
eps, delta = 0.1, 0.05
for B in range(1, 1000):
if bernstein_bound(eps, B, sigma2, M) <= delta:
B_bern = B
break
B_hoef = None
for B in range(1, 20000):
if hoeffding_bound_sgd(eps, B, M) <= delta:
B_hoef = B
break
if B_hoef is None:
raise RuntimeError("Hoeffding search range was too small")
print(f' Bernstein needs B >= {B_bern}')
print(f' Hoeffding needs B >= {B_hoef}')
check_true('Bernstein needs fewer samples', B_bern <= B_hoef)
# (d) Empirical
B_emp = 32
t_emp = 0.3
N_mc = 200_000
gi = np.random.normal(g_true, np.sqrt(sigma2), (N_mc, B_emp)).clip(g_true-M, g_true+M)
g_hat = gi.mean(axis=1)
empirical = (np.abs(g_hat - g_true) >= t_emp).mean()
bern_bound = bernstein_bound(t_emp, B_emp, sigma2, M)
print(f'\n(d) B={B_emp}, t={t_emp}: empirical={empirical:.5f}, Bernstein={bern_bound:.5f}')
check_true('Bernstein bound holds empirically', bern_bound >= empirical * 0.99)
print('\nTakeaway: Bernstein uses variance info; significantly tighter than Hoeffding for small variance.')
Exercise 5 ★★ — McDiarmid's Inequality on Empirical Risk
Let be iid training examples and with .
(a) Show that satisfies the bounded differences property with (changing one example changes empirical risk by at most ).
(b) By McDiarmid's inequality:
Implement mcdiarmid_bound(t, n) and compute bounds for and .
(c) Suppose we have a data-dependent statistic: the maximum deviation over held-out folds:
Apply McDiarmid to show holds for the full split.
(d) Simulate: draw iid Bernoulli() examples (true error 0.3). Compute . Verify McDiarmid bound holds across trials at .
(e) How does McDiarmid extend the Hoeffding result? Give an example where McDiarmid applies but Hoeffding does NOT (non-sum function).
Code cell 17
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 18
# Solution
import numpy as np
np.random.seed(42)
def mcdiarmid_bound(t, n):
return np.exp(-2 * n * t**2)
header('Exercise 5: McDiarmid on Empirical Risk')
# (a) Bounded differences argument
print('(a) Bounded differences:')
print(' Changing z_i to z_i\' changes R_hat by |l(h(x_i),y_i) - l(h(x_i\'),y_i\')| / n <= 1/n')
print(' So c_i = 1/n for all i, sum c_i^2 = n*(1/n)^2 = 1/n')
# (b)
print('\n(b) McDiarmid bounds:')
for t in [0.05, 0.1]:
for n in [100, 1000]:
bound = mcdiarmid_bound(t, n)
print(f' t={t}, n={n}: bound={bound:.6f}')
# (d) Simulation
n, p_true = 200, 0.3
t_test = 0.05
N_mc = 100_000
samples = np.random.binomial(1, p_true, (N_mc, n))
R_hat = samples.mean(axis=1)
empirical = (R_hat - p_true >= t_test).mean()
bound = mcdiarmid_bound(t_test, n)
print(f'\n(d) n={n}, t={t_test}:')
print(f' Empirical P(R_hat - R >= t) = {empirical:.5f}')
print(f' McDiarmid bound = {bound:.5f}')
check_true('McDiarmid bound holds', bound >= empirical)
# (e) Non-sum: maximum of n values
print('\n(e) McDiarmid example: f(x1,...,xn) = max(x1,...,xn) on [0,1]^n')
print(' max changes by at most 1 when one x_i changes => c_i = 1 for each i')
print(' McDiarmid: P(max - E[max] >= t) <= exp(-2t^2/n)')
print(' Hoeffding does NOT apply (max is not a sum)')
# Verify: for U[0,1], E[max_n] = n/(n+1)
n_test = 50
N_mc2 = 100_000
maxs = np.random.uniform(0, 1, (N_mc2, n_test)).max(axis=1)
E_max = n_test / (n_test + 1)
t_e = 0.05
emp_e = (maxs - E_max >= t_e).mean()
mc_bound = np.exp(-2 * t_e**2 / n_test)
print(f' max of {n_test} Uniforms: empirical P(max-E[max]>={t_e}) = {emp_e:.5f}, '
f'McDiarmid = {mc_bound:.5f}')
check_true('McDiarmid holds for max function', mc_bound >= emp_e)
print('\nTakeaway: McDiarmid applies to any function with bounded differences, not just sums.')
Exercise 6 ★★ — PAC Bounds for Finite Hypothesis Class
A binary classifier is selected from a finite class (about 10^6) using empirical risk minimisation (ERM).
(a) Using the union bound + Hoeffding, derive the generalisation bound:
Implement pac_bound_finite(H_size, n, delta) returning the generalisation gap.
(b) Compute the generalisation bound for the ERM hypothesis when , , .
(c) Sample complexity: find the minimum for the bound to be . How does scale with ?
(d) Simulate the PAC setup: generate random classifiers on a dataset, each with true risk . Find ERM, check that the gap bound holds.
(e) Why can't we just use (ignoring the )? Give an example where this fails.
Code cell 20
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 21
# Solution
import numpy as np
np.random.seed(42)
def pac_bound_finite(H_size, n, delta=0.05):
return np.sqrt(np.log(H_size / delta) / (2 * n))
header('Exercise 6: PAC Bounds for Finite Hypothesis Class')
# (a) Verify formula
print('(a) Union bound + Hoeffding derivation:')
print(' P(exists h: R(h) > R_hat(h) + t)')
print(' <= |H| * P(R(h) > R_hat(h) + t) [union bound]')
print(' <= |H| * exp(-2nt^2) [Hoeffding on R_hat(h)]')
print(' Set |H|*exp(-2nt^2) = delta => t = sqrt(log(|H|/delta) / 2n)')
# (b)
gap = pac_bound_finite(2**20, 10000, 0.05)
print(f'\n(b) Gap for |H|=2^20, n=10000, delta=0.05: {gap:.4f}')
check_true('Gap is reasonable (< 0.2)', gap < 0.2)
# (c)
eps_target = 0.05
n_min = int(np.ceil(np.log(2**20 / 0.05) / (2 * eps_target**2)))
print(f'\n(c) Min n for gap < {eps_target}: n >= {n_min}')
check_true('n_min gives gap < target', pac_bound_finite(2**20, n_min) < eps_target + 1e-6)
# (d) Simulation
H_size = 100
n_data = 500
R_true = 0.2
N_mc = 10_000
max_gaps = []
for _ in range(N_mc):
# Each h has true risk 0.2; R_hat is empirical on n_data samples
R_hats = np.random.binomial(n_data, R_true, H_size) / n_data
h_erm = np.argmin(R_hats)
gap_erm = R_true - R_hats[h_erm] # R(h*) - R_hat(h*)
max_gaps.append(gap_erm)
bound = pac_bound_finite(H_size, n_data, 0.05)
frac_exceeding = np.mean(np.array(max_gaps) > bound)
print(f'\n(d) Simulation (H={H_size}, n={n_data}):')
print(f' PAC bound = {bound:.4f}')
print(f' P(gap > bound) = {frac_exceeding:.4f} (should be <= 0.05)')
check_true('PAC bound holds in simulation', frac_exceeding <= 0.06)
# (e)
print('\n(e) Why 1/eps^2 matters:')
print(' With n = log(|H|): gap = sqrt(log(|H|/d) / (2*log(|H|))) ~ sqrt(1/2) ~ 0.7')
print(' This is vacuous! Need n >> log(|H|) / eps^2 for useful bounds.')
vacuous_n = int(np.log(2**20))
print(f' n=log(|H|)={vacuous_n}: gap={pac_bound_finite(2**20, vacuous_n):.3f} (useless!)')
print('\nTakeaway: Generalisation requires O(log(|H|)/eps^2) samples — the eps^2 term is essential.')
Exercise 7 ★★★ — VC Dimension of Linear Classifiers
A linear classifier in predicts .
(a) Show that the VC dimension of half-spaces in is by:
- Exhibiting points that CAN be shattered (use the standard basis + origin)
- Arguing no points can be shattered (use Radon's theorem / LP duality argument)
(b) Implement can_shatter(points): check whether a given set of points can be shattered by linear classifiers in (brute-force over all labellings).
(c) Use your function to verify: 3 points in general position CAN be shattered, 4 points (including XOR pattern) CANNOT.
(d) Implement the VC generalisation bound vc_bound(d_vc, n, delta) and compute it for and .
(e) Simulate learning with the optimal linear separator on linearly separable data (, ). Plot the empirical generalisation gap vs the VC bound for . Does the VC bound track the actual gap?
Code cell 23
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 24
# Solution
import numpy as np
from itertools import product
np.random.seed(42)
def can_shatter(points, n_trials=3000):
m, d = points.shape
for labels in product([-1., 1.], repeat=m):
labels = np.array(labels)
found = False
for _ in range(n_trials):
w = np.random.randn(d)
b = np.random.randn()
preds = np.sign(points @ w + b)
if np.all(preds == labels):
found = True
break
if not found:
return False
return True
def vc_bound(d_vc, n, delta=0.05):
return np.sqrt((2 * d_vc * np.log(np.e * n / d_vc) + 2 * np.log(4 / delta)) / n)
header('Exercise 7: VC Dimension of Linear Classifiers')
# (a) Exhibit d+1=3 shatterable points in R^2
print('(a) Standard basis + origin in R^2:')
pts_basis = np.array([[0., 0.], [1., 0.], [0., 1.]])
check_true('Standard basis shattered by half-spaces', can_shatter(pts_basis))
# (b)-(c)
pts3 = np.array([[0., 0.], [1., 0.], [0., 1.]])
pts4_xor = np.array([[0., 0.], [1., 1.], [1., 0.], [0., 1.]])
shatter3 = can_shatter(pts3)
shatter4 = can_shatter(pts4_xor)
check_true('3 pts in general position CAN be shattered', shatter3)
check_true('4 pts in XOR pattern CANNOT be shattered', not shatter4)
print(f' 3 pts shatterable: {shatter3}')
print(f' 4 pts (XOR) shatterable: {shatter4}')
# (d)
print('\n(d) VC generalisation bounds (delta=0.05):')
print(f' {'d_vc':>6} {'n':>8} {'bound':>10}')
for d_vc in [3, 10, 100]:
for n in [100, 1000, 10000]:
bound = vc_bound(d_vc, n)
print(f' {d_vc:>6} {n:>8} {bound:>10.4f}')
# (e) Simulate: learn perceptron on linearly separable data
print('\n(e) Empirical vs VC bound for d=2 linear classifier:')
n_sizes = [50, 100, 200, 500, 1000, 2000, 5000]
N_mc = 500
d_vc_true = 3
emp_gaps = []
for n in n_sizes:
gaps = []
for _ in range(N_mc):
# Generate data from true linear model
X = np.random.randn(n + 1000, 2)
w_true = np.array([1., -1.])
y = np.sign(X @ w_true + 0.5)
X_tr, y_tr = X[:n], y[:n]
X_te, y_te = X[n:], y[n:]
# Fit via closed-form sign of linear discriminant
w_fit = np.linalg.lstsq(X_tr, y_tr, rcond=None)[0]
y_pred_tr = np.sign(X_tr @ w_fit)
y_pred_te = np.sign(X_te @ w_fit)
R_hat = (y_pred_tr != y_tr).mean()
R_true = (y_pred_te != y_te).mean()
gaps.append(R_true - R_hat)
emp_gaps.append(np.mean(gaps))
print(f' {'n':>6} {'Empirical gap':>16} {'VC bound':>12}')
for n, gap in zip(n_sizes, emp_gaps):
print(f' {n:>6} {gap:>16.4f} {vc_bound(d_vc_true, n):>12.4f}')
print('\nTakeaway: VC bound is loose in practice but gives correct 1/sqrt(n) scaling.')
Exercise 8 ★★★ — Rademacher Complexity and LoRA Analysis
Rademacher complexity measures how well a function class fits random noise.
(a) Implement rademacher_mc(X, F_samples, n_mc) that estimates the empirical Rademacher complexity for a function class given by a set of functions evaluated on data :
(b) For linear functions with , verify the bound .
(c) Compare Rademacher complexity for:
- Full weight matrix update: ,
- LoRA update: , , ,
Show that LoRA Rademacher complexity is times smaller than full fine-tuning.
(d) Implement the full Rademacher generalisation bound and compute it for , , , . Compare to the bound for full fine-tuning ().
(e) Discuss why, despite LoRA having a smaller Rademacher bound, practical LLM fine-tuning often shows similar generalisation to full fine-tuning. What does this tell us about the tightness of Rademacher bounds?
Code cell 26
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 27
# Solution
import numpy as np
np.random.seed(42)
def rademacher_mc(X, W_norm, n_mc=2000):
"""
Empirical Rademacher complexity for {f_w : ||w||<=W_norm}.
sup_w (1/n) sigma^T X w = (W_norm/n) ||X^T sigma||_2
"""
n = X.shape[0]
norms = []
for _ in range(n_mc):
sigma = np.random.choice([-1., 1.], size=n)
norms.append(np.linalg.norm(X.T @ sigma))
return W_norm * np.mean(norms) / n
def rademacher_lora(n, d, r, W_norm=1.0):
"""
Rademacher bound for LoRA: O(W_norm * sqrt(d*r/n)).
Simplified: using Frobenius norm of Delta W = AB, ||Delta W||_F <= W_norm.
"""
return W_norm * np.sqrt(d * r) / n # times 1/n for per-sample bound, simplification
def rademacher_full(n, d, W_norm=1.0):
return W_norm * np.sqrt(d**2) / n
def rad_generalisation_bound(R_hat, R_n, n, delta):
return R_hat + 2 * R_n + np.sqrt(np.log(1/delta) / (2*n))
header('Exercise 8: Rademacher Complexity and LoRA')
# (b) Verify linear bound
n, d = 100, 50
X = np.random.randn(n, d) / np.sqrt(d)
W_norm = 1.0
R_mc = rademacher_mc(X, W_norm)
R_theory = W_norm * np.linalg.norm(X, axis=1).mean() / np.sqrt(n)
print('(b) Linear Rademacher complexity:')
print(f' MC estimate: {R_mc:.4f}')
print(f' Theory bound: {W_norm / np.sqrt(n):.4f}')
print(f' MC <= theory: {R_mc <= W_norm / np.sqrt(n) * 1.5}')
# (c) LoRA vs full
print('\n(c) LoRA vs Full Rademacher (n=1000, d=256, W_norm=1):')
n_c, d_c = 1000, 256
print(f' {'rank r':>8} {'LoRA R':>10} {'Full R':>10} {'Ratio':>8}')
R_full_c = rademacher_full(n_c, d_c)
for r in [1, 4, 16, 64, 256]:
R_lora = rademacher_lora(n_c, d_c, r)
print(f' {r:>8} {R_lora:>10.5f} {R_full_c:>10.5f} {R_lora/R_full_c:>8.4f}')
# (d) Generalisation bound
n_d, d_d, r_d = 1000, 64, 4
R_hat = 0.05 # empirical risk
R_lora_d = rademacher_lora(n_d, d_d, r_d)
R_full_d = rademacher_full(n_d, d_d)
bound_lora = rad_generalisation_bound(R_hat, R_lora_d, n_d, 0.05)
bound_full = rad_generalisation_bound(R_hat, R_full_d, n_d, 0.05)
print(f'\n(d) Generalisation bounds (n={n_d}, d={d_d}, r={r_d}, R_hat={R_hat}):')
print(f' LoRA (r={r_d}): R_n={R_lora_d:.4f}, bound={bound_lora:.4f}')
print(f' Full (r={d_d}): R_n={R_full_d:.4f}, bound={bound_full:.4f}')
check_true('LoRA bound < Full bound', bound_lora < bound_full)
# (e) Discussion
print('\n(e) Discussion:')
print(' Rademacher bounds are norm-based and conservative.')
print(' In practice, SGD on full fine-tuning finds solutions with small effective norms.')
print(' The implicit regularisation of SGD makes full fine-tuning behave like low-norm solutions.')
print(' LoRA explicitly constraints the rank but the bound being smaller does NOT mean')
print(' actual generalisation is better — both can achieve similar test errors.')
print(' PAC-Bayes bounds (using KL from initialisation) are empirically much tighter.')
print('\nTakeaway: Rademacher complexity gives correct scaling (sqrt(r/n)) but constants are loose.')
print('For practical architecture design: LoRA has provably smaller complexity class.')
Exercise 9 *** - Empirical Bernstein Confidence Intervals
For bounded losses , empirical Bernstein bounds use the sample variance instead of only the range.
Part (a): Implement a simple empirical Bernstein radius.
Part (b): Compare it with Hoeffding for low-variance Bernoulli losses.
Part (c): Verify by simulation that both intervals cover the true mean at roughly the requested confidence.
Part (d): Explain why variance-sensitive bounds matter for reliable evaluation.
Code cell 29
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 30
# Solution
def hoeffding_radius(n, delta):
return np.sqrt(np.log(2/delta) / (2*n))
def empirical_bernstein_radius(samples, delta):
n = len(samples)
v = np.var(samples, ddof=1)
return np.sqrt(2 * v * np.log(3/delta) / n) + 3 * np.log(3/delta) / n
p_true = 0.03
n = 1000
delta = 0.05
trials = 1000
cover_h = 0
cover_eb = 0
radii_h = []
radii_eb = []
for _ in range(trials):
x = (np.random.rand(n) < p_true).astype(float)
mean = x.mean()
rh = hoeffding_radius(n, delta)
reb = empirical_bernstein_radius(x, delta)
cover_h += abs(mean - p_true) <= rh
cover_eb += abs(mean - p_true) <= reb
radii_h.append(rh)
radii_eb.append(reb)
header('Exercise 9: Empirical Bernstein Confidence Intervals')
print(f'Average Hoeffding radius: {np.mean(radii_h):.4f}')
print(f'Average empirical Bernstein radius: {np.mean(radii_eb):.4f}')
print(f'Hoeffding coverage: {cover_h/trials:.3f}')
print(f'Empirical Bernstein coverage: {cover_eb/trials:.3f}')
check_true('Empirical Bernstein is tighter on low-variance losses', np.mean(radii_eb) < np.mean(radii_h))
check_true('Both methods cover at high rate in this simulation', cover_h/trials > 0.93 and cover_eb/trials > 0.90)
print('Takeaway: evaluation bounds can be much tighter when the observed variance is small.')
Exercise 10 *** - Uniform Confidence Over Many Metrics
A model report tracks safety metrics. Each metric is estimated from iid bounded examples.
Part (a): Use a union bound to build a simultaneous Hoeffding interval for all metrics.
Part (b): Compare the radius to a single-metric interval.
Part (c): Simulate all metrics and verify the chance that any estimate misses its true mean.
Part (d): Interpret why dashboards need simultaneous, not only per-metric, uncertainty.
Code cell 32
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 33
# Solution
m = 25
n = 800
delta = 0.05
true_ps = np.linspace(0.02, 0.25, m)
single_radius = np.sqrt(np.log(2/delta)/(2*n))
uniform_radius = np.sqrt(np.log(2*m/delta)/(2*n))
trials = 1000
miss_any = 0
for _ in range(trials):
X = (np.random.rand(n, m) < true_ps).astype(float)
estimates = X.mean(axis=0)
miss_any += np.any(np.abs(estimates - true_ps) > uniform_radius)
header('Exercise 10: Uniform Confidence Over Many Metrics')
print(f'Single-metric radius: {single_radius:.4f}')
print(f'Union-bound uniform radius: {uniform_radius:.4f}')
print(f'Empirical P(any miss): {miss_any/trials:.3f}, target <= {delta}')
check_true('Uniform radius is wider than single radius', uniform_radius > single_radius)
check_true('Any-miss rate is controlled in simulation', miss_any/trials <= 0.08)
print('Takeaway: once a report has many metrics, per-metric confidence intervals understate dashboard-level risk.')
What to Review After Finishing
- Markov: — non-negative + mean only
- Chebyshev: — polynomial tail, universal
- Hoeffding: for bounded iid — exponential tail,
- Chernoff: — MGF method, tightest for Binomial
- Bernstein: adds variance term — transitions polynomial → exponential at
- McDiarmid: bounded differences → exponential tail for ANY function of iid vars
- Union bound: — foundation of PAC generalisation
- VC dimension: largest shattered set; for half-spaces in
- Rademacher: — data-dependent, tighter than VC
- LoRA complexity: vs — theoretical basis for PEFT
References
- Boucheron, Lugosi & Massart, Concentration Inequalities (2013) — OUP
- Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014) — Chapters 2-7
- Wainwright, High-Dimensional Statistics (2019) — Cambridge
- Vershynin, High-Dimensional Probability (2018) — Cambridge
- Tropp, An Introduction to Matrix Concentration Inequalities (2015) — arXiv:1501.01571
- Bartlett & Mendelson, Rademacher and Gaussian complexities (2002) — JMLR
- Hu et al., LoRA: Low-Rank Adaptation (2022) — arXiv:2106.09685
- Empirical Bernstein intervals - can you explain the probabilistic calculation and the ML relevance?
- Uniform confidence dashboards - can you connect the computation to model behavior?