Exercises Notebook

Converted from exercises.ipynb for web reading.

Expectation and Moments — Exercises

10 exercises covering LOTUS, linearity, variance, conditional expectation, MGFs, Jensen's inequality, bias-variance decomposition, and Adam.

Format	Description
Problem	Markdown cell with task description
Your Solution	Code cell with scaffolding
Solution	Code cell with reference solution and checks

Difficulty Levels

Level	Exercises	Focus
★	1–3	Core mechanics
★★	4–6	Deeper theory
★★★	7-10	AI / ML applications

Topic Map

Topic	Exercise
LOTUS	1
Linearity of expectation	2
MGF and Gaussian moments	3
Tower property	4
Gamma MGF and cumulants	5
Jensen and KL divergence	6
Bias-variance decomposition	7
Adam as moment estimation	8
Mini-batch gradient variance	9
Control variates	10

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np
from scipy import stats
from scipy.integrate import quad

try:
    import matplotlib.pyplot as plt
    HAS_MPL = True
except ImportError:
    HAS_MPL = False

np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)

def header(title):
    print('\n' + '=' * len(title))
    print(title)
    print('=' * len(title))

def check_close(name, got, expected, tol=1e-4):
    ok = np.allclose(got, expected, atol=tol, rtol=tol)
    print(f"{'PASS' if ok else 'FAIL'} — {name}")
    if not ok:
        print(f'  expected: {expected}')
        print(f'  got     : {got}')
    return ok

def check_true(name, cond):
    print(f"{'PASS' if cond else 'FAIL'} — {name}")
    return cond

print('Setup complete.')

Exercise 1 ★ — LOTUS: Computing Expectations Without Finding Distributions

Let $X \sim \text{Gamma}(\alpha, \beta)$ with PDF

f_X(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}, \quad x > 0

with $\alpha = 3$ , $\beta = 2$ .

(a) Compute $\mathbb{E}[X]$ by numerical integration (LOTUS with $g(x) = x$ ). Check against the theoretical value $\alpha/\beta$ .

(b) Compute $\mathbb{E}[X^2]$ and $\text{Var}(X)$ . Check against $\alpha/\beta^2$ .

(c) Compute $\mathbb{E}[\log X]$ using LOTUS. The theoretical value is $\psi(\alpha) - \log(\beta)$ where $\psi$ is the digamma function.

(d) Compute $\mathbb{E}[1/X]$ and verify $\mathbb{E}[1/X] = \beta/(\alpha - 1)$ for $\alpha > 1$ .

(e) Verify all answers by Monte Carlo simulation with $N = 10^5$ samples.

Code cell 5

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 6

# Solution
from scipy.integrate import quad
from scipy.special import gamma as gamma_fn, digamma
import numpy as np
from scipy import stats

alpha, beta = 3.0, 2.0

def gamma_pdf(x, a, b):
    return b**a / gamma_fn(a) * x**(a-1) * np.exp(-b*x)

# (a) E[X]
E_X, _ = quad(lambda x: x * gamma_pdf(x, alpha, beta), 0, np.inf)

# (b) E[X^2] and Var(X)
E_X2, _ = quad(lambda x: x**2 * gamma_pdf(x, alpha, beta), 0, np.inf)
Var_X = E_X2 - E_X**2

# (c) E[log X]
E_logX, _ = quad(lambda x: np.log(x) * gamma_pdf(x, alpha, beta), 0, np.inf)
E_logX_theory = digamma(alpha) - np.log(beta)

# (d) E[1/X]
E_invX, _ = quad(lambda x: (1.0/x) * gamma_pdf(x, alpha, beta), 0, np.inf)
E_invX_theory = beta / (alpha - 1)  # valid for alpha > 1

# (e) Monte Carlo
N = 100000
np.random.seed(42)
samples = np.random.gamma(alpha, 1.0/beta, N)  # scipy: scale = 1/beta

header('Exercise 1: LOTUS — Gamma Distribution Moments')
check_close('E[X] theory (alpha/beta)', E_X, alpha/beta)
check_close('E[X] vs MC', E_X, samples.mean(), tol=0.02)
check_close('Var(X) theory (alpha/beta^2)', Var_X, alpha/beta**2)
check_close('Var(X) vs MC', Var_X, samples.var(), tol=0.02)
check_close('E[log X] theory (digamma(alpha)-log(beta))', E_logX, E_logX_theory)
check_close('E[log X] vs MC', E_logX, np.log(samples).mean(), tol=0.02)
check_close('E[1/X] theory (beta/(alpha-1))', E_invX, E_invX_theory)
check_close('E[1/X] vs MC', E_invX, (1.0/samples).mean(), tol=0.01)
print('\nTakeaway: LOTUS lets us compute E[g(X)] directly from f_X — no need to find the distribution of g(X).')

Exercise 2 ★ — Linearity of Expectation Without Independence

Let $X \sim \mathcal{N}(0, 1)$ and define $Y = X^2$ . Note that $X$ and $Y$ are not independent.

(a) Compute $\mathbb{E}[X + Y]$ using linearity: $\mathbb{E}[X] + \mathbb{E}[Y]$ . Verify by Monte Carlo.

(b) Compute $\mathbb{E}[3X - 2Y + 5]$ using linearity. Verify by Monte Carlo.

(c) Compute $\mathbb{E}[XY]$ . Is this equal to $\mathbb{E}[X] \cdot \mathbb{E}[Y]$ ? Explain why or why not.

(d) Using the indicator variable trick: For $X_1, \ldots, X_{100}$ i.i.d. $\text{Uniform}(0,1)$ , let $N = \sum_{i=1}^{100} \mathbf{1}[X_i > 0.7]$ . Compute $\mathbb{E}[N]$ by linearity and verify by simulation.

(e) Show by simulation that $\text{Var}(X + Y) \neq \text{Var}(X) + \text{Var}(Y)$ when $X$ and $Y$ are dependent.

Code cell 8

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 9

# Solution
import numpy as np
np.random.seed(42)
N = 200000

X = np.random.normal(0, 1, N)
Y = X**2

# Theory: E[X]=0, E[Y]=E[X^2]=Var(X)+(E[X])^2=1, E[XY]=E[X^3]=0 (odd moment of N(0,1))

# (a)
E_X_plus_Y_linearity = 0 + 1  # E[X] + E[X^2] = 0 + 1 = 1
E_X_plus_Y_mc = (X + Y).mean()

# (b)
E_combo_linearity = 3*0 - 2*1 + 5  # = 3
E_combo_mc = (3*X - 2*Y + 5).mean()

# (c): E[XY] = E[X^3] = 0 (odd moment of symmetric distribution)
E_XY = (X * Y).mean()
E_X_times_E_Y = X.mean() * Y.mean()  # 0 * 1 = 0
# Coincidentally both are ~0 here, but for different reasons
# E[XY] = E[X^3] = 0 by symmetry; E[X]*E[Y] = 0*1 = 0
# They agree but X and Y are NOT independent (proven by P(Y<1) vs P(Y<1|X>1.5))

# (d) indicator trick: E[N] = 100 * P(X_i > 0.7) = 100 * 0.3 = 30
E_N_linearity = 100 * 0.3
N_sim = sum(np.random.uniform(0, 1, 100) > 0.7 for _ in range(20000))
E_N_sim = N_sim / 20000  # This is a scalar, not array

# (e)
Var_X_val = X.var()
Var_Y_val = Y.var()
Var_XpY_decomp = Var_X_val + Var_Y_val  # Wrong when dependent!
Var_XpY_direct = (X + Y).var()
Cov_XY = np.cov(X, Y)[0, 1]

header('Exercise 2: Linearity of Expectation')
check_close('E[X+Y]: linearity vs MC', E_X_plus_Y_linearity, E_X_plus_Y_mc, tol=0.02)
check_close('E[3X-2Y+5]: linearity vs MC', E_combo_linearity, E_combo_mc, tol=0.02)
print(f'\nE[XY] = {E_XY:.4f}  (E[X]*E[Y] = {E_X_times_E_Y:.4f} — agree here, but X,Y still dependent!)')
check_close('E[N] indicator trick', E_N_linearity, 30.0)
print(f'\nVariance when dependent:')
print(f'  Var(X)+Var(Y) = {Var_XpY_decomp:.4f}  (assumes independence)')
print(f'  Var(X+Y)      = {Var_XpY_direct:.4f}  (true value)')
print(f'  Cov(X,Y)      = {Cov_XY:.4f}  (non-zero!)')
print(f'  Var(X+Y) = Var(X)+Var(Y)+2Cov(X,Y) = {Var_XpY_decomp + 2*Cov_XY:.4f}')
check_true('Var(X+Y) ≠ Var(X)+Var(Y) when dependent', abs(Var_XpY_decomp - Var_XpY_direct) > 0.1)
print('\nTakeaway: Linearity of expectation always holds; Var(X+Y)=Var(X)+Var(Y) requires independence.')

Exercise 3 ★ — Moments of the Gaussian via MGF Differentiation

For $X \sim \mathcal{N}(\mu, \sigma^2)$ , the MGF is

M_X(t) = e^{\mu t + \sigma^2 t^2/2}

(a) Differentiate $M_X(t)$ analytically to derive:

$M_X'(0) = \mathbb{E}[X]$ — verify equals $\mu$
$M_X''(0) = \mathbb{E}[X^2]$ — verify equals $\mu^2 + \sigma^2$
$M_X'''(0) = \mathbb{E}[X^3]$ — derive the formula

(b) Implement gaussian_mgf(t, mu, sigma) and use numerical differentiation to extract the first 4 moments.

(c) From the moments, compute the skewness $\gamma_1 = \mu_3/\sigma^3$ and excess kurtosis $\gamma_2 = \mu_4/\sigma^4 - 3$ . Verify: for Gaussian, $\gamma_1 = 0$ and $\gamma_2 = 0$ .

(d) Repeat for $X \sim \text{Exp}(\lambda)$ with MGF $M(t) = \lambda/(\lambda - t)$ for $t < \lambda$ . Verify $\gamma_1 = 2$ and $\gamma_2 = 6$ .

Code cell 11

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 12

# Solution
import numpy as np

mu, sigma = 2.0, 1.5

def gaussian_mgf(t, mu, sigma):
    return np.exp(mu*t + 0.5*sigma**2*t**2)

def numerical_moments(mgf_fn, n_moments=4, h=1e-4):
    """k-th moment = k-th derivative of MGF at t=0 (via finite differences)."""
    moments = []
    for k in range(1, n_moments+1):
        # Use central finite difference: f^(k)(0) ≈ (1/h^k) * sum c_i * f(i*h)
        # Use scipy.misc.derivative or a simple 5-point stencil
        # Simple approach: use log-derivative for better numerics
        # For small k, direct approach is fine
        ts = np.array([-2*h, -h, 0, h, 2*h])
        vals = np.array([mgf_fn(t) for t in ts])
        if k == 1:
            d = (-vals[4] + 8*vals[3] - 8*vals[1] + vals[0]) / (12*h)
        elif k == 2:
            d = (-vals[4] + 16*vals[3] - 30*vals[2] + 16*vals[1] - vals[0]) / (12*h**2)
        elif k == 3:
            d = (vals[4] - 2*vals[3] + 2*vals[1] - vals[0]) / (2*h**3)
        elif k == 4:
            d = (vals[4] - 4*vals[3] + 6*vals[2] - 4*vals[1] + vals[0]) / h**4
        moments.append(d)
    return np.array(moments)

moments = numerical_moments(lambda t: gaussian_mgf(t, mu, sigma))

# Central moments from raw moments
m1 = moments[0]  # E[X]
m2 = moments[1]  # E[X^2]
m3 = moments[2]  # E[X^3]
m4 = moments[3]  # E[X^4]
mu2 = m2 - m1**2          # Var(X)
mu3 = m3 - 3*m1*mu2 - m1**3
mu4 = m4 - 4*m1*m3 + 6*m1**2*m2 - 3*m1**4

skewness = mu3 / mu2**1.5
ex_kurt  = mu4 / mu2**2 - 3

# Exponential
lam = 2.0
def exp_mgf(t, lam):
    return lam / (lam - t)

moments_exp = numerical_moments(lambda t: exp_mgf(t, lam), h=1e-5)
m1e = moments_exp[0]; m2e = moments_exp[1]
m3e = moments_exp[2]; m4e = moments_exp[3]
mu2e = m2e - m1e**2
mu3e = m3e - 3*m1e*mu2e - m1e**3
mu4e = m4e - 4*m1e*m3e + 6*m1e**2*m2e - 3*m1e**4
skew_exp = mu3e / mu2e**1.5
kurt_exp = mu4e / mu2e**2 - 3

header('Exercise 3: Gaussian and Exponential Moments via MGF')
print(f'Gaussian N({mu}, {sigma}^2):')
check_close('E[X] = mu', moments[0], mu)
check_close('E[X^2] = mu^2+sigma^2', moments[1], mu**2 + sigma**2)
check_close('Skewness = 0', skewness, 0.0, tol=0.01)
check_close('Excess kurtosis = 0', ex_kurt, 0.0, tol=0.05)
print(f'\nExponential(lam={lam}):')
check_close('E[X] = 1/lam', moments_exp[0], 1/lam, tol=0.01)
check_close('Var(X) = 1/lam^2', mu2e, 1/lam**2, tol=0.01)
check_close('Skewness = 2', skew_exp, 2.0, tol=0.1)
check_close('Excess kurtosis = 6', kurt_exp, 6.0, tol=0.3)
print('\nTakeaway: MGF encodes all moments; differentiation extracts them without integration.')

Exercise 4 ★★ — Tower Property and Law of Total Variance

A customer service centre receives calls. The number of calls per hour $N$ follows a Poisson distribution with rate $\Lambda$ , where $\Lambda$ itself is random: $\Lambda \sim \text{Gamma}(\alpha, \beta)$ with $\alpha=2$ , $\beta=0.5$ .

(a) Use the tower property to compute $\mathbb{E}[N]$ :

\mathbb{E}[N] = \mathbb{E}[\mathbb{E}[N|\Lambda]] = \mathbb{E}[\Lambda] = \alpha/\beta

(b) Use the law of total variance to compute $\text{Var}(N)$ :

\text{Var}(N) = \mathbb{E}[\text{Var}(N|\Lambda)] + \text{Var}(\mathbb{E}[N|\Lambda])

Note: for $N|\Lambda \sim \text{Pois}(\Lambda)$ , $\text{Var}(N|\Lambda) = \Lambda$ .

(c) Verify both results by simulation: draw $\Lambda \sim \text{Gamma}(\alpha, 1/\beta)$ , then $N|\Lambda \sim \text{Pois}(\Lambda)$ .

(d) The marginal distribution of $N$ (Poisson with Gamma-distributed rate) is a Negative Binomial. Verify this by comparing the marginal PMF to $\text{NegBin}(r=\alpha, p=\beta/(\beta+1))$ .

Code cell 14

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 15

# Solution
import numpy as np
from scipy import stats
np.random.seed(42)

alpha, beta = 2.0, 0.5

# (a)
E_N = alpha / beta  # = 4

# (b)
E_Var_N_given_L = alpha / beta        # E[Lambda]
Var_E_N_given_L = alpha / beta**2     # Var(Lambda)
Var_N = E_Var_N_given_L + Var_E_N_given_L

# (c)
N_sim = 100000
Lambda_samples = np.random.gamma(alpha, 1.0/beta, N_sim)  # scale = 1/beta
N_samples = np.array([np.random.poisson(lam) for lam in Lambda_samples])

# (d) Negative Binomial: r=alpha, p=beta/(beta+1)
r = alpha
p = beta / (beta + 1)
max_k = 30
ks = np.arange(max_k)
negbin_pmf = stats.nbinom.pmf(ks, r, p)
sample_pmf, _ = np.histogram(N_samples, bins=np.arange(max_k+1), density=False)
sample_pmf = sample_pmf / N_sim
tvd = np.abs(sample_pmf - negbin_pmf).sum() / 2

header('Exercise 4: Tower Property and Law of Total Variance')
check_close('E[N] = alpha/beta = 4', E_N, 4.0)
check_close('E[N] simulation', E_N, N_samples.mean(), tol=0.05)
check_close('Var(N) = alpha/beta + alpha/beta^2', Var_N, alpha/beta + alpha/beta**2)
check_close('Var(N) simulation', Var_N, N_samples.var(), tol=0.5)
print(f'\nLaw of total variance decomposition:')
print(f'  E[Var(N|Lambda)] = {E_Var_N_given_L:.2f}  ("within" component)')
print(f'  Var(E[N|Lambda]) = {Var_E_N_given_L:.2f}  ("between" component)')
print(f'  Total Var(N)     = {Var_N:.2f}')
print(f'  Note: Var(N) > E[N] = {E_N} — overdispersion vs Poisson!')
print(f'\nMarginal ~ NegBin({r}, {p:.3f}):')
print(f'  Total variation distance vs sample: {tvd:.4f}  (should be small)')
check_true('TVD < 0.02 (close to NegBin)', tvd < 0.02)
print('\nTakeaway: Tower property: E[Y] = E[E[Y|X]] — marginalise by conditioning then averaging.')

Exercise 5 ★★ — MGF of the Gamma Distribution and Cumulants

The Gamma $(\alpha, \beta)$ distribution (shape $\alpha$ , rate $\beta$ ) has MGF:

M(t) = \left(\frac{\beta}{\beta - t}\right)^\alpha, \quad t < \beta

(a) Implement gamma_mgf(t, alpha, beta) and verify numerically against Monte Carlo estimates $\hat{M}(t) = \frac{1}{N}\sum_{i=1}^N e^{tX_i}$ .

(b) Compute the first four cumulants using the cumulant generating function $K(t) = \log M(t) = \alpha \log(\beta) - \alpha \log(\beta - t)$ :

$\kappa_1 = K'(0) = \alpha/\beta$ (mean)
$\kappa_2 = K''(0) = \alpha/\beta^2$ (variance)
$\kappa_3 = K'''(0)$ (third cumulant)
$\kappa_4 = K^{(4)}(0)$ (fourth cumulant)

(c) From cumulants, compute skewness $\gamma_1 = \kappa_3/\kappa_2^{3/2}$ and excess kurtosis $\gamma_2 = \kappa_4/\kappa_2^2$ . Verify: $\gamma_1 = 2/\sqrt{\alpha}$ , $\gamma_2 = 6/\alpha$ .

(d) Verify all results by Monte Carlo with $\alpha=3$ , $\beta=2$ , $N=10^5$ .

Code cell 17

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 18

# Solution
import numpy as np
np.random.seed(42)

alpha, beta = 3.0, 2.0

def gamma_mgf(t, alpha, beta):
    return (beta / (beta - t))**alpha

def gamma_cgf(t, alpha, beta):
    return alpha * np.log(beta) - alpha * np.log(beta - t)

# (a) Verify
N_mc = 100000
samples = np.random.gamma(alpha, 1.0/beta, N_mc)

# (b) Cumulants via finite differences of K(t)
h = 1e-4
ts_fd = np.array([-2*h, -h, 0, h, 2*h])
K_vals = np.array([gamma_cgf(t, alpha, beta) for t in ts_fd])

kappa1 = (-K_vals[4] + 8*K_vals[3] - 8*K_vals[1] + K_vals[0]) / (12*h)
kappa2 = (-K_vals[4] + 16*K_vals[3] - 30*K_vals[2] + 16*K_vals[1] - K_vals[0]) / (12*h**2)
kappa3 = (K_vals[4] - 2*K_vals[3] + 2*K_vals[1] - K_vals[0]) / (2*h**3)
kappa4 = (K_vals[4] - 4*K_vals[3] + 6*K_vals[2] - 4*K_vals[1] + K_vals[0]) / h**4

# (c)
skewness = kappa3 / kappa2**1.5
ex_kurt  = kappa4 / kappa2**2

header('Exercise 5: Gamma MGF and Cumulants')
for t in [0.1, 0.5, 1.0]:
    mgf_theory = gamma_mgf(t, alpha, beta)
    mgf_mc = np.exp(t * samples).mean()
    check_close(f'MGF(t={t}): theory vs MC', mgf_theory, mgf_mc, tol=0.01)

print(f'\nCumulants:')
check_close('kappa1 = alpha/beta', kappa1, alpha/beta)
check_close('kappa2 = alpha/beta^2', kappa2, alpha/beta**2)
check_close('kappa3 = 2*alpha/beta^3', kappa3, 2*alpha/beta**3)
check_close('kappa4 = 6*alpha/beta^4', kappa4, 6*alpha/beta**4)

print(f'\nSkewness and kurtosis:')
check_close('Skewness = 2/sqrt(alpha)', skewness, 2/np.sqrt(alpha), tol=0.01)
check_close('Excess kurtosis = 6/alpha', ex_kurt, 6/alpha, tol=0.05)
print(f'\nMC verification: skewness={((samples-samples.mean())**3).mean()/samples.std()**3:.3f}, '
      f'kurtosis={(((samples-samples.mean())**4).mean()/samples.std()**4 - 3):.3f}')
print('\nTakeaway: CGF K(t)=log M(t) has a simpler form; its derivatives give cumulants directly.')

Exercise 6 ★★ — Jensen's Inequality and the KL Divergence

(a) Use Jensen's inequality to prove $\mathbb{E}[\log X] \leq \log \mathbb{E}[X]$ for any positive random variable $X$ . Verify numerically for $X \sim \text{Exp}(1)$ .

(b) Implement kl_divergence(p, q) for discrete distributions and verify $\text{KL}(p \| q) \geq 0$ with equality iff $p = q$ .

(c) Show that the KL divergence is not symmetric: compute $\text{KL}(p \| q)$ and $\text{KL}(q \| p)$ for two specific distributions and verify they differ.

(d) For two Gaussians $p = \mathcal{N}(\mu_1, \sigma_1^2)$ and $q = \mathcal{N}(\mu_2, \sigma_2^2)$ , the KL divergence has a closed form:

\text{KL}(p \| q) = \log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1-\mu_2)^2}{2\sigma_2^2} - \frac{1}{2}

Verify by Monte Carlo and by numerical integration.

(e) Derive the ELBO inequality: for any distributions $q(z)$ and $p(x,z)$ ,

\log p(x) \geq \mathbb{E}_q[\log p(x,z)] - \mathbb{E}_q[\log q(z)] = \text{ELBO}

Verify numerically for a simple Gaussian model.

Code cell 20

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 21

# Solution
import numpy as np
from scipy import stats
from scipy.integrate import quad
np.random.seed(42)

# (a)
X = np.random.exponential(1.0, 200000)
E_logX = np.log(X).mean()      # theory: -gamma ≈ -0.5772
log_EX = np.log(X.mean())      # log(E[X]) = log(1) = 0

# (b)
def kl_divergence(p, q, eps=1e-10):
    p, q = np.asarray(p, dtype=float), np.asarray(q, dtype=float)
    mask = p > eps
    return np.sum(p[mask] * np.log(p[mask] / (q[mask] + eps)))

p = np.array([0.4, 0.3, 0.2, 0.1])
q = np.array([0.25, 0.25, 0.25, 0.25])

# (d)
def gaussian_kl(mu1, sigma1, mu2, sigma2):
    return (np.log(sigma2/sigma1)
            + (sigma1**2 + (mu1-mu2)**2) / (2*sigma2**2)
            - 0.5)

mu1, sigma1 = 1.0, 1.5
mu2, sigma2 = 2.0, 2.0
kl_closed = gaussian_kl(mu1, sigma1, mu2, sigma2)

# Numerical integration verification
def kl_integrand(x):
    p = stats.norm.pdf(x, mu1, sigma1)
    q = stats.norm.pdf(x, mu2, sigma2)
    if p < 1e-300 or q < 1e-300:
        return 0.0
    return p * np.log(p / q)

kl_numerical, _ = quad(kl_integrand, mu1 - 8*sigma1, mu1 + 8*sigma1)

# MC verification
samples_p = np.random.normal(mu1, sigma1, 500000)
log_p = stats.norm.logpdf(samples_p, mu1, sigma1)
log_q = stats.norm.logpdf(samples_p, mu2, sigma2)
kl_mc = (log_p - log_q).mean()

# (e) ELBO: log p(x) >= E_q[log p(x,z)] - E_q[log q(z)] = ELBO
# Simple model: p(z) = N(0,1), p(x|z) = N(z, 1), q(z) = N(mu_q, sigma_q^2)
# Marginal: p(x) = N(0, 2) (integrate out z)
x_obs = 1.5
log_px = stats.norm.logpdf(x_obs, 0, np.sqrt(2))  # true log p(x)
mu_q, sigma_q = 0.75, 0.7  # approximate posterior q(z)
z_samples = np.random.normal(mu_q, sigma_q, 100000)
log_pxz = stats.norm.logpdf(x_obs, z_samples, 1.0) + stats.norm.logpdf(z_samples, 0, 1.0)
log_qz  = stats.norm.logpdf(z_samples, mu_q, sigma_q)
elbo = (log_pxz - log_qz).mean()

header('Exercise 6: Jensen\'s Inequality and KL Divergence')
print(f'(a) Jensen: E[log X] = {E_logX:.4f} <= log(E[X]) = {log_EX:.4f}')
check_true('E[log X] <= log(E[X])', E_logX <= log_EX)
check_close('KL(p||p) = 0', kl_divergence(p, p), 0.0, tol=1e-8)
check_true('KL(p||q) > 0', kl_divergence(p, q) > 0)
print(f'\n(c) Asymmetry: KL(p||q)={kl_divergence(p,q):.4f}, KL(q||p)={kl_divergence(q,p):.4f}')
check_true('KL is asymmetric', abs(kl_divergence(p,q) - kl_divergence(q,p)) > 0.01)
print(f'\n(d) Gaussian KL: closed={kl_closed:.4f}, numerical={kl_numerical:.4f}, MC={kl_mc:.4f}')
check_close('closed == numerical', kl_closed, kl_numerical, tol=0.01)
check_close('closed == MC', kl_closed, kl_mc, tol=0.01)
print(f'\n(e) ELBO: log p(x) = {log_px:.4f}, ELBO = {elbo:.4f}')
check_true('ELBO <= log p(x)', elbo <= log_px + 0.01)
print('\nTakeaway: KL >= 0 by Jensen; the ELBO is a lower bound on log p(x) for any q.')

Exercise 7 ★★★ — Bias-Variance Decomposition

Consider estimating $f(x) = \sin(2\pi x)$ from $n$ noisy observations $y_i = f(x_i) + \varepsilon_i$ , $\varepsilon_i \sim \mathcal{N}(0, \sigma^2)$ .

(a) Implement bias_variance_decomposition(degree, n_train, sigma, n_datasets, x_test) that:

Generates n_datasets independent training sets of size n_train
Fits a degree-d polynomial to each
Returns bias², variance, and noise at each test point

(b) Verify the decomposition: $\text{MSE} = \text{Bias}^2 + \text{Variance} + \sigma^2$ .

(c) Plot (or print) how bias² and variance change as degree goes from 1 to 12 (with $n=15$ , $\sigma=0.3$ ). Identify the optimal degree.

(d) Show the effect of increasing $n$ : fit degree=6 for $n \in \{10, 20, 50, 100\}$ . Which component dominates at small $n$ ? At large $n$ ?

Code cell 23

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 24

# Solution
import numpy as np
np.random.seed(42)

def true_fn(x):
    return np.sin(2 * np.pi * x)

def poly_predict(x_train, y_train, x_test, degree):
    Phi_tr = np.column_stack([x_train**k for k in range(degree+1)])
    Phi_te = np.column_stack([x_test**k for k in range(degree+1)])
    w, _, _, _ = np.linalg.lstsq(Phi_tr, y_train, rcond=None)
    return Phi_te @ w

def bias_variance_decomp(degree, n_train, sigma, n_datasets=300, x_test=None):
    if x_test is None:
        x_test = np.linspace(0, 1, 50)
    f_test = true_fn(x_test)
    preds = []
    for _ in range(n_datasets):
        x_tr = np.linspace(0, 1, n_train)
        y_tr = true_fn(x_tr) + np.random.normal(0, sigma, n_train)
        pred = poly_predict(x_tr, y_tr, x_test, degree)
        preds.append(pred)
    preds = np.array(preds)  # (n_datasets, len(x_test))
    mean_pred = preds.mean(axis=0)
    bias2   = np.mean((mean_pred - f_test)**2)
    variance = np.mean(preds.var(axis=0))
    noise   = sigma**2
    mse     = np.mean((preds - f_test)**2)  # average over datasets and x_test
    return bias2, variance, noise, mse

header('Exercise 7: Bias-Variance Decomposition')

# (b) Verify decomposition
sigma = 0.3
bias2, var, noise, mse = bias_variance_decomp(4, 20, sigma)
print(f'Degree=4, n=20:')
print(f'  Bias^2={bias2:.4f}, Var={var:.4f}, Noise={noise:.4f}')
print(f'  Bias^2+Var+Noise={bias2+var+noise:.4f}, MSE={mse:.4f}')
check_close('MSE = Bias^2 + Var + Noise', mse, bias2+var+noise, tol=0.02)

# (c) Sweep degrees
print(f'\nDegree sweep (n=15, sigma=0.3):')
print(f'{"Degree":8s} {"Bias^2":10s} {"Variance":10s} {"Total MSE":10s}')
best_mse = float('inf'); best_d = 0
for d in range(1, 12):
    b2, v, ns, mse_d = bias_variance_decomp(d, 15, sigma)
    print(f'  d={d:2d}:  bias2={b2:.4f}  var={v:.4f}  mse={b2+v+ns:.4f}')
    if b2+v+ns < best_mse:
        best_mse = b2+v+ns; best_d = d
print(f'  => Optimal degree: {best_d}')

# (d) Effect of n
print(f'\nEffect of n (degree=6, sigma=0.3):')
for n in [10, 20, 50, 100]:
    b2, v, ns, mse_n = bias_variance_decomp(6, n, sigma)
    dom = 'bias dominates' if b2 > v else 'variance dominates'
    print(f'  n={n:4d}: bias2={b2:.4f}, var={v:.4f}  ({dom})')
print('\nTakeaway: Bias decreases with complexity; variance increases; optimal = minimum MSE.')

Exercise 8 ★★★ — Adam as Adaptive Moment Estimation

Adam tracks running estimates of the first and second moments of the gradient:

m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \approx \mathbb{E}[g_t]

v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \approx \mathbb{E}[g_t^2]

with bias corrections $\hat{m}_t = m_t/(1-\beta_1^t)$ , $\hat{v}_t = v_t/(1-\beta_2^t)$ .

(a) Implement adam_optimizer(grad_fn, theta0, lr, n_steps) and minimise $f(\theta) = (\theta_1 - 3)^2 + 10(\theta_2 + 1)^2$ .

(b) Show that without bias correction, Adam converges slowly at first (underestimates moments). Quantify the initialisation bias at $t=1$ : $m_1 = (1-\beta_1)g_1$ vs $\hat{m}_1 = g_1$ .

(c) Implement SGD for comparison. Show Adam converges faster on this problem.

(d) Analyse the effective learning rate $\alpha \cdot \hat{m}_t / \sqrt{\hat{v}_t + \varepsilon}$ : show that it is bounded by approximately $\alpha$ per parameter, regardless of gradient scale.

(e) Demonstrate Adam's robustness to gradient scale: multiply the gradient by 100 and show Adam still converges, while SGD with the same learning rate diverges.

Code cell 26

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 27

# Solution
import numpy as np
np.random.seed(42)

def f(theta):
    return (theta[0] - 3)**2 + 10*(theta[1] + 1)**2

def grad_f(theta, scale=1.0):
    return scale * np.array([2*(theta[0]-3), 20*(theta[1]+1)])

def adam(grad_fn, theta0, lr=0.01, beta1=0.9, beta2=0.999, eps=1e-8, n_steps=500):
    theta = theta0.copy().astype(float)
    m = np.zeros_like(theta)
    v = np.zeros_like(theta)
    losses = [f(theta)]
    for t in range(1, n_steps+1):
        g = grad_fn(theta)
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * g**2
        m_hat = m / (1 - beta1**t)  # bias correction
        v_hat = v / (1 - beta2**t)  # bias correction
        theta -= lr * m_hat / (np.sqrt(v_hat) + eps)
        losses.append(f(theta))
    return theta, np.array(losses)

def sgd(grad_fn, theta0, lr=0.01, n_steps=500):
    theta = theta0.copy().astype(float)
    losses = [f(theta)]
    for _ in range(n_steps):
        g = grad_fn(theta)
        theta -= lr * g
        losses.append(f(theta))
    return theta, np.array(losses)

theta0 = np.array([0.0, 0.0])

theta_adam, losses_adam = adam(grad_f, theta0)
theta_sgd, losses_sgd = sgd(grad_f, theta0, lr=0.01)

header('Exercise 8: Adam as Moment Estimation')
check_close('Adam converges: theta -> [3, -1]', theta_adam, np.array([3.0, -1.0]), tol=0.01)
check_close('SGD converges:  theta -> [3, -1]', theta_sgd, np.array([3.0, -1.0]), tol=0.01)

# (b) Bias correction analysis
beta1, beta2 = 0.9, 0.999
g1 = grad_f(theta0)
m1_no_corr = (1 - beta1) * g1
m1_corr    = m1_no_corr / (1 - beta1**1)
print(f'\n(b) Bias correction at t=1:')
print(f'  True gradient g1    = {g1}')
print(f'  m1 without correction = {m1_no_corr}  (only {100*(1-beta1):.0f}% of g1)')
print(f'  m1 with correction    = {m1_corr}     (= g1, fully corrected)')

# (c) Convergence comparison
print(f'\n(c) Adam vs SGD (500 steps):')
adam_steps_to_1e3 = next((i for i,l in enumerate(losses_adam) if l < 1e-3), None)
sgd_steps_to_1e3  = next((i for i,l in enumerate(losses_sgd)  if l < 1e-3), None)
print(f'  Adam: {adam_steps_to_1e3} steps to loss < 1e-3')
print(f'  SGD:  {sgd_steps_to_1e3} steps to loss < 1e-3')

# (d) Effective learning rate
eps = 1e-8
theta_test = np.array([1.0, 0.5])
g = grad_f(theta_test)
eff_lr = 0.01 * g / (np.sqrt(g**2) + eps)  # at convergence, m_hat/sqrt(v_hat) = sign(g)
print(f'\n(d) Effective step size (sign of gradient, |step| ≈ lr):')
print(f'  gradient: {g}')
print(f'  effective step: {eff_lr}')
print(f'  |step| ≈ lr = 0.01: {np.allclose(np.abs(eff_lr), 0.01, atol=0.001)}')

# (e) Gradient scale robustness
theta_adam_scaled, losses_adam_scaled = adam(lambda t: grad_f(t, scale=100), theta0)
theta_sgd_scaled, losses_sgd_scaled = sgd(lambda t: grad_f(t, scale=100), theta0, lr=0.01)
print(f'\n(e) With gradient scale=100:')
print(f'  Adam final loss: {losses_adam_scaled[-1]:.6f}  (converges!)')
print(f'  SGD final loss:  {losses_sgd_scaled[-1]:.6f}  (diverges: nan or very large)')
check_close('Adam robust to scale', theta_adam_scaled, np.array([3.0, -1.0]), tol=0.05)
print('\nTakeaway: Adam estimates E[g] and E[g^2] per-parameter; effective LR ≈ lr regardless of gradient scale.')

Exercise 9 *** - Mini-Batch Gradient Variance

Let per-example gradients $g_i$ be iid with mean $\mu$ and variance $\sigma^2$ . A mini-batch gradient is $\\bar g_B = B^{-1}\sum_{i=1}^B g_i$ .

Part (a): Derive $\operatorname{Var}(\\bar g_B)=\sigma^2/B$ .

Part (b): Verify this scaling by simulation for several batch sizes.

Part (c): Explain the tradeoff between gradient noise and compute efficiency.

Code cell 29

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 30

# Solution
mu, sigma = 0.3, 2.0
batch_sizes = [1, 4, 16, 64, 256]
n_trials = 20000

header('Exercise 9: Mini-Batch Gradient Variance')
for B in batch_sizes:
    batches = np.random.normal(mu, sigma, size=(n_trials, B))
    means = batches.mean(axis=1)
    empirical_var = means.var(ddof=0)
    theory_var = sigma**2 / B
    print(f'B={B:3d}: empirical Var={empirical_var:.4f}, theory={theory_var:.4f}')
    check_close(f'variance scales as 1/B for B={B}', empirical_var, theory_var, tol=0.06)
print('Takeaway: larger batches reduce estimator variance, but with diminishing returns proportional to 1/B.')

Exercise 10 *** - Control Variates for Variance Reduction

Estimate $\mathbb{E}[e^X]$ for $X\sim\mathcal{N}(0,1)$ . Use $C=X$ as a control variate with known mean $\mathbb{E}[C]=0$ .

Part (a): Implement the naive Monte Carlo estimator.

Part (b): Estimate the optimal control coefficient $b^*=\operatorname{Cov}(Y,C)/\operatorname{Var}(C)$ .

Part (c): Compare estimator variance with and without the control variate.

Part (d): Explain why variance reduction matters for policy gradients and variational inference.

Code cell 32

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 33

# Solution
n = 200000
X = np.random.randn(n)
Y = np.exp(X)
C = X
b_star = np.cov(Y, C, ddof=0)[0, 1] / np.var(C)
Y_cv = Y - b_star * (C - 0.0)
true_value = np.exp(0.5)

header('Exercise 10: Control Variates')
print(f'True E[e^X]: {true_value:.4f}')
print(f'Naive estimate: {Y.mean():.4f}')
print(f'Control-variate estimate: {Y_cv.mean():.4f}')
print(f'Optimal b*: {b_star:.4f}')
print(f'Naive sample variance: {Y.var():.4f}')
print(f'Control-variate sample variance: {Y_cv.var():.4f}')
check_true('Control variate reduces variance', Y_cv.var() < Y.var())
check_close('Control-variate estimate is close to truth', Y_cv.mean(), true_value, tol=0.02)
print('Takeaway: subtracting a correlated zero-mean term keeps the expectation fixed while reducing Monte Carlo noise.')

What to Review After Finishing

LOTUS: $\mathbb{E}[g(X)] = \int g(x)f_X(x)\,dx$ — no need to find distribution of $g(X)$
Linearity: $\mathbb{E}[aX+bY] = a\mathbb{E}[X]+b\mathbb{E}[Y]$ — no independence needed
Variance formula: $\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$
Tower property: $\mathbb{E}[\mathbb{E}[Y|X]] = \mathbb{E}[Y]$
Law of total variance: $\text{Var}(Y) = \mathbb{E}[\text{Var}(Y|X)] + \text{Var}(\mathbb{E}[Y|X])$
MGF: $M_X(t) = \mathbb{E}[e^{tX}]$ ; $k$ -th derivative at 0 gives $k$ -th raw moment
Jensen: convex $f \Rightarrow f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]$ ; proves $\text{KL} \geq 0$
Bias-variance: $\text{MSE} = \text{Bias}^2 + \text{Var} + \sigma^2$
Adam: $m_t \approx \mathbb{E}[g_t]$ , $v_t \approx \mathbb{E}[g_t^2]$ ; bias correction at small $t$

References

DeGroot & Schervish, Probability and Statistics (4th ed.) — Chapters 4–5
Bishop, PRML (2006) — Appendix B (probability review)
Goodfellow, Bengio & Courville, Deep Learning (2016) — Chapter 3
Kingma & Ba, Adam: A Method for Stochastic Optimization (2015) — arXiv:1412.6980
Hastie, Tibshirani & Friedman, ESL (2009) — §7.3 (bias-variance tradeoff)
Belkin et al., Reconciling modern ML and the bias-variance tradeoff (2019) — double descent

Mini-batch gradient variance - can you explain the probabilistic calculation and the ML relevance?
Control variates - can you connect the computation to model behavior?