Exercises NotebookMath for LLMs

Expectation and Moments

Probability Theory / Expectation and Moments

Run notebook
Exercises Notebook

Exercises Notebook

Converted from exercises.ipynb for web reading.

Expectation and Moments — Exercises

10 exercises covering LOTUS, linearity, variance, conditional expectation, MGFs, Jensen's inequality, bias-variance decomposition, and Adam.

FormatDescription
ProblemMarkdown cell with task description
Your SolutionCode cell with scaffolding
SolutionCode cell with reference solution and checks

Difficulty Levels

LevelExercisesFocus
1–3Core mechanics
★★4–6Deeper theory
★★★7-10AI / ML applications

Topic Map

TopicExercise
LOTUS1
Linearity of expectation2
MGF and Gaussian moments3
Tower property4
Gamma MGF and cumulants5
Jensen and KL divergence6
Bias-variance decomposition7
Adam as moment estimation8
Mini-batch gradient variance9
Control variates10

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np
from scipy import stats
from scipy.integrate import quad

try:
    import matplotlib.pyplot as plt
    HAS_MPL = True
except ImportError:
    HAS_MPL = False

np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)

def header(title):
    print('\n' + '=' * len(title))
    print(title)
    print('=' * len(title))

def check_close(name, got, expected, tol=1e-4):
    ok = np.allclose(got, expected, atol=tol, rtol=tol)
    print(f"{'PASS' if ok else 'FAIL'}{name}")
    if not ok:
        print(f'  expected: {expected}')
        print(f'  got     : {got}')
    return ok

def check_true(name, cond):
    print(f"{'PASS' if cond else 'FAIL'}{name}")
    return cond

print('Setup complete.')

Exercise 1 ★ — LOTUS: Computing Expectations Without Finding Distributions

Let XGamma(α,β)X \sim \text{Gamma}(\alpha, \beta) with PDF

fX(x)=βαΓ(α)xα1eβx,x>0f_X(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}, \quad x > 0

with α=3\alpha = 3, β=2\beta = 2.

(a) Compute E[X]\mathbb{E}[X] by numerical integration (LOTUS with g(x)=xg(x) = x). Check against the theoretical value α/β\alpha/\beta.

(b) Compute E[X2]\mathbb{E}[X^2] and Var(X)\text{Var}(X). Check against α/β2\alpha/\beta^2.

(c) Compute E[logX]\mathbb{E}[\log X] using LOTUS. The theoretical value is ψ(α)log(β)\psi(\alpha) - \log(\beta) where ψ\psi is the digamma function.

(d) Compute E[1/X]\mathbb{E}[1/X] and verify E[1/X]=β/(α1)\mathbb{E}[1/X] = \beta/(\alpha - 1) for α>1\alpha > 1.

(e) Verify all answers by Monte Carlo simulation with N=105N = 10^5 samples.

Code cell 5

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 6

# Solution
from scipy.integrate import quad
from scipy.special import gamma as gamma_fn, digamma
import numpy as np
from scipy import stats

alpha, beta = 3.0, 2.0

def gamma_pdf(x, a, b):
    return b**a / gamma_fn(a) * x**(a-1) * np.exp(-b*x)

# (a) E[X]
E_X, _ = quad(lambda x: x * gamma_pdf(x, alpha, beta), 0, np.inf)

# (b) E[X^2] and Var(X)
E_X2, _ = quad(lambda x: x**2 * gamma_pdf(x, alpha, beta), 0, np.inf)
Var_X = E_X2 - E_X**2

# (c) E[log X]
E_logX, _ = quad(lambda x: np.log(x) * gamma_pdf(x, alpha, beta), 0, np.inf)
E_logX_theory = digamma(alpha) - np.log(beta)

# (d) E[1/X]
E_invX, _ = quad(lambda x: (1.0/x) * gamma_pdf(x, alpha, beta), 0, np.inf)
E_invX_theory = beta / (alpha - 1)  # valid for alpha > 1

# (e) Monte Carlo
N = 100000
np.random.seed(42)
samples = np.random.gamma(alpha, 1.0/beta, N)  # scipy: scale = 1/beta

header('Exercise 1: LOTUS — Gamma Distribution Moments')
check_close('E[X] theory (alpha/beta)', E_X, alpha/beta)
check_close('E[X] vs MC', E_X, samples.mean(), tol=0.02)
check_close('Var(X) theory (alpha/beta^2)', Var_X, alpha/beta**2)
check_close('Var(X) vs MC', Var_X, samples.var(), tol=0.02)
check_close('E[log X] theory (digamma(alpha)-log(beta))', E_logX, E_logX_theory)
check_close('E[log X] vs MC', E_logX, np.log(samples).mean(), tol=0.02)
check_close('E[1/X] theory (beta/(alpha-1))', E_invX, E_invX_theory)
check_close('E[1/X] vs MC', E_invX, (1.0/samples).mean(), tol=0.01)
print('\nTakeaway: LOTUS lets us compute E[g(X)] directly from f_X — no need to find the distribution of g(X).')

Exercise 2 ★ — Linearity of Expectation Without Independence

Let XN(0,1)X \sim \mathcal{N}(0, 1) and define Y=X2Y = X^2. Note that XX and YY are not independent.

(a) Compute E[X+Y]\mathbb{E}[X + Y] using linearity: E[X]+E[Y]\mathbb{E}[X] + \mathbb{E}[Y]. Verify by Monte Carlo.

(b) Compute E[3X2Y+5]\mathbb{E}[3X - 2Y + 5] using linearity. Verify by Monte Carlo.

(c) Compute E[XY]\mathbb{E}[XY]. Is this equal to E[X]E[Y]\mathbb{E}[X] \cdot \mathbb{E}[Y]? Explain why or why not.

(d) Using the indicator variable trick: For X1,,X100X_1, \ldots, X_{100} i.i.d. Uniform(0,1)\text{Uniform}(0,1), let N=i=11001[Xi>0.7]N = \sum_{i=1}^{100} \mathbf{1}[X_i > 0.7]. Compute E[N]\mathbb{E}[N] by linearity and verify by simulation.

(e) Show by simulation that Var(X+Y)Var(X)+Var(Y)\text{Var}(X + Y) \neq \text{Var}(X) + \text{Var}(Y) when XX and YY are dependent.

Code cell 8

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 9

# Solution
import numpy as np
np.random.seed(42)
N = 200000

X = np.random.normal(0, 1, N)
Y = X**2

# Theory: E[X]=0, E[Y]=E[X^2]=Var(X)+(E[X])^2=1, E[XY]=E[X^3]=0 (odd moment of N(0,1))

# (a)
E_X_plus_Y_linearity = 0 + 1  # E[X] + E[X^2] = 0 + 1 = 1
E_X_plus_Y_mc = (X + Y).mean()

# (b)
E_combo_linearity = 3*0 - 2*1 + 5  # = 3
E_combo_mc = (3*X - 2*Y + 5).mean()

# (c): E[XY] = E[X^3] = 0 (odd moment of symmetric distribution)
E_XY = (X * Y).mean()
E_X_times_E_Y = X.mean() * Y.mean()  # 0 * 1 = 0
# Coincidentally both are ~0 here, but for different reasons
# E[XY] = E[X^3] = 0 by symmetry; E[X]*E[Y] = 0*1 = 0
# They agree but X and Y are NOT independent (proven by P(Y<1) vs P(Y<1|X>1.5))

# (d) indicator trick: E[N] = 100 * P(X_i > 0.7) = 100 * 0.3 = 30
E_N_linearity = 100 * 0.3
N_sim = sum(np.random.uniform(0, 1, 100) > 0.7 for _ in range(20000))
E_N_sim = N_sim / 20000  # This is a scalar, not array

# (e)
Var_X_val = X.var()
Var_Y_val = Y.var()
Var_XpY_decomp = Var_X_val + Var_Y_val  # Wrong when dependent!
Var_XpY_direct = (X + Y).var()
Cov_XY = np.cov(X, Y)[0, 1]

header('Exercise 2: Linearity of Expectation')
check_close('E[X+Y]: linearity vs MC', E_X_plus_Y_linearity, E_X_plus_Y_mc, tol=0.02)
check_close('E[3X-2Y+5]: linearity vs MC', E_combo_linearity, E_combo_mc, tol=0.02)
print(f'\nE[XY] = {E_XY:.4f}  (E[X]*E[Y] = {E_X_times_E_Y:.4f} — agree here, but X,Y still dependent!)')
check_close('E[N] indicator trick', E_N_linearity, 30.0)
print(f'\nVariance when dependent:')
print(f'  Var(X)+Var(Y) = {Var_XpY_decomp:.4f}  (assumes independence)')
print(f'  Var(X+Y)      = {Var_XpY_direct:.4f}  (true value)')
print(f'  Cov(X,Y)      = {Cov_XY:.4f}  (non-zero!)')
print(f'  Var(X+Y) = Var(X)+Var(Y)+2Cov(X,Y) = {Var_XpY_decomp + 2*Cov_XY:.4f}')
check_true('Var(X+Y) ≠ Var(X)+Var(Y) when dependent', abs(Var_XpY_decomp - Var_XpY_direct) > 0.1)
print('\nTakeaway: Linearity of expectation always holds; Var(X+Y)=Var(X)+Var(Y) requires independence.')

Exercise 3 ★ — Moments of the Gaussian via MGF Differentiation

For XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2), the MGF is

MX(t)=eμt+σ2t2/2M_X(t) = e^{\mu t + \sigma^2 t^2/2}

(a) Differentiate MX(t)M_X(t) analytically to derive:

  • MX(0)=E[X]M_X'(0) = \mathbb{E}[X] — verify equals μ\mu
  • MX(0)=E[X2]M_X''(0) = \mathbb{E}[X^2] — verify equals μ2+σ2\mu^2 + \sigma^2
  • MX(0)=E[X3]M_X'''(0) = \mathbb{E}[X^3] — derive the formula

(b) Implement gaussian_mgf(t, mu, sigma) and use numerical differentiation to extract the first 4 moments.

(c) From the moments, compute the skewness γ1=μ3/σ3\gamma_1 = \mu_3/\sigma^3 and excess kurtosis γ2=μ4/σ43\gamma_2 = \mu_4/\sigma^4 - 3. Verify: for Gaussian, γ1=0\gamma_1 = 0 and γ2=0\gamma_2 = 0.

(d) Repeat for XExp(λ)X \sim \text{Exp}(\lambda) with MGF M(t)=λ/(λt)M(t) = \lambda/(\lambda - t) for t<λt < \lambda. Verify γ1=2\gamma_1 = 2 and γ2=6\gamma_2 = 6.

Code cell 11

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 12

# Solution
import numpy as np

mu, sigma = 2.0, 1.5

def gaussian_mgf(t, mu, sigma):
    return np.exp(mu*t + 0.5*sigma**2*t**2)

def numerical_moments(mgf_fn, n_moments=4, h=1e-4):
    """k-th moment = k-th derivative of MGF at t=0 (via finite differences)."""
    moments = []
    for k in range(1, n_moments+1):
        # Use central finite difference: f^(k)(0) ≈ (1/h^k) * sum c_i * f(i*h)
        # Use scipy.misc.derivative or a simple 5-point stencil
        # Simple approach: use log-derivative for better numerics
        # For small k, direct approach is fine
        ts = np.array([-2*h, -h, 0, h, 2*h])
        vals = np.array([mgf_fn(t) for t in ts])
        if k == 1:
            d = (-vals[4] + 8*vals[3] - 8*vals[1] + vals[0]) / (12*h)
        elif k == 2:
            d = (-vals[4] + 16*vals[3] - 30*vals[2] + 16*vals[1] - vals[0]) / (12*h**2)
        elif k == 3:
            d = (vals[4] - 2*vals[3] + 2*vals[1] - vals[0]) / (2*h**3)
        elif k == 4:
            d = (vals[4] - 4*vals[3] + 6*vals[2] - 4*vals[1] + vals[0]) / h**4
        moments.append(d)
    return np.array(moments)

moments = numerical_moments(lambda t: gaussian_mgf(t, mu, sigma))

# Central moments from raw moments
m1 = moments[0]  # E[X]
m2 = moments[1]  # E[X^2]
m3 = moments[2]  # E[X^3]
m4 = moments[3]  # E[X^4]
mu2 = m2 - m1**2          # Var(X)
mu3 = m3 - 3*m1*mu2 - m1**3
mu4 = m4 - 4*m1*m3 + 6*m1**2*m2 - 3*m1**4

skewness = mu3 / mu2**1.5
ex_kurt  = mu4 / mu2**2 - 3

# Exponential
lam = 2.0
def exp_mgf(t, lam):
    return lam / (lam - t)

moments_exp = numerical_moments(lambda t: exp_mgf(t, lam), h=1e-5)
m1e = moments_exp[0]; m2e = moments_exp[1]
m3e = moments_exp[2]; m4e = moments_exp[3]
mu2e = m2e - m1e**2
mu3e = m3e - 3*m1e*mu2e - m1e**3
mu4e = m4e - 4*m1e*m3e + 6*m1e**2*m2e - 3*m1e**4
skew_exp = mu3e / mu2e**1.5
kurt_exp = mu4e / mu2e**2 - 3

header('Exercise 3: Gaussian and Exponential Moments via MGF')
print(f'Gaussian N({mu}, {sigma}^2):')
check_close('E[X] = mu', moments[0], mu)
check_close('E[X^2] = mu^2+sigma^2', moments[1], mu**2 + sigma**2)
check_close('Skewness = 0', skewness, 0.0, tol=0.01)
check_close('Excess kurtosis = 0', ex_kurt, 0.0, tol=0.05)
print(f'\nExponential(lam={lam}):')
check_close('E[X] = 1/lam', moments_exp[0], 1/lam, tol=0.01)
check_close('Var(X) = 1/lam^2', mu2e, 1/lam**2, tol=0.01)
check_close('Skewness = 2', skew_exp, 2.0, tol=0.1)
check_close('Excess kurtosis = 6', kurt_exp, 6.0, tol=0.3)
print('\nTakeaway: MGF encodes all moments; differentiation extracts them without integration.')

Exercise 4 ★★ — Tower Property and Law of Total Variance

A customer service centre receives calls. The number of calls per hour NN follows a Poisson distribution with rate Λ\Lambda, where Λ\Lambda itself is random: ΛGamma(α,β)\Lambda \sim \text{Gamma}(\alpha, \beta) with α=2\alpha=2, β=0.5\beta=0.5.

(a) Use the tower property to compute E[N]\mathbb{E}[N]:

E[N]=E[E[NΛ]]=E[Λ]=α/β\mathbb{E}[N] = \mathbb{E}[\mathbb{E}[N|\Lambda]] = \mathbb{E}[\Lambda] = \alpha/\beta

(b) Use the law of total variance to compute Var(N)\text{Var}(N):

Var(N)=E[Var(NΛ)]+Var(E[NΛ])\text{Var}(N) = \mathbb{E}[\text{Var}(N|\Lambda)] + \text{Var}(\mathbb{E}[N|\Lambda])

Note: for NΛPois(Λ)N|\Lambda \sim \text{Pois}(\Lambda), Var(NΛ)=Λ\text{Var}(N|\Lambda) = \Lambda.

(c) Verify both results by simulation: draw ΛGamma(α,1/β)\Lambda \sim \text{Gamma}(\alpha, 1/\beta), then NΛPois(Λ)N|\Lambda \sim \text{Pois}(\Lambda).

(d) The marginal distribution of NN (Poisson with Gamma-distributed rate) is a Negative Binomial. Verify this by comparing the marginal PMF to NegBin(r=α,p=β/(β+1))\text{NegBin}(r=\alpha, p=\beta/(\beta+1)).

Code cell 14

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 15

# Solution
import numpy as np
from scipy import stats
np.random.seed(42)

alpha, beta = 2.0, 0.5

# (a)
E_N = alpha / beta  # = 4

# (b)
E_Var_N_given_L = alpha / beta        # E[Lambda]
Var_E_N_given_L = alpha / beta**2     # Var(Lambda)
Var_N = E_Var_N_given_L + Var_E_N_given_L

# (c)
N_sim = 100000
Lambda_samples = np.random.gamma(alpha, 1.0/beta, N_sim)  # scale = 1/beta
N_samples = np.array([np.random.poisson(lam) for lam in Lambda_samples])

# (d) Negative Binomial: r=alpha, p=beta/(beta+1)
r = alpha
p = beta / (beta + 1)
max_k = 30
ks = np.arange(max_k)
negbin_pmf = stats.nbinom.pmf(ks, r, p)
sample_pmf, _ = np.histogram(N_samples, bins=np.arange(max_k+1), density=False)
sample_pmf = sample_pmf / N_sim
tvd = np.abs(sample_pmf - negbin_pmf).sum() / 2

header('Exercise 4: Tower Property and Law of Total Variance')
check_close('E[N] = alpha/beta = 4', E_N, 4.0)
check_close('E[N] simulation', E_N, N_samples.mean(), tol=0.05)
check_close('Var(N) = alpha/beta + alpha/beta^2', Var_N, alpha/beta + alpha/beta**2)
check_close('Var(N) simulation', Var_N, N_samples.var(), tol=0.5)
print(f'\nLaw of total variance decomposition:')
print(f'  E[Var(N|Lambda)] = {E_Var_N_given_L:.2f}  ("within" component)')
print(f'  Var(E[N|Lambda]) = {Var_E_N_given_L:.2f}  ("between" component)')
print(f'  Total Var(N)     = {Var_N:.2f}')
print(f'  Note: Var(N) > E[N] = {E_N} — overdispersion vs Poisson!')
print(f'\nMarginal ~ NegBin({r}, {p:.3f}):')
print(f'  Total variation distance vs sample: {tvd:.4f}  (should be small)')
check_true('TVD < 0.02 (close to NegBin)', tvd < 0.02)
print('\nTakeaway: Tower property: E[Y] = E[E[Y|X]] — marginalise by conditioning then averaging.')

Exercise 5 ★★ — MGF of the Gamma Distribution and Cumulants

The Gamma(α,β)(\alpha, \beta) distribution (shape α\alpha, rate β\beta) has MGF:

M(t)=(ββt)α,t<βM(t) = \left(\frac{\beta}{\beta - t}\right)^\alpha, \quad t < \beta

(a) Implement gamma_mgf(t, alpha, beta) and verify numerically against Monte Carlo estimates M^(t)=1Ni=1NetXi\hat{M}(t) = \frac{1}{N}\sum_{i=1}^N e^{tX_i}.

(b) Compute the first four cumulants using the cumulant generating function K(t)=logM(t)=αlog(β)αlog(βt)K(t) = \log M(t) = \alpha \log(\beta) - \alpha \log(\beta - t):

  • κ1=K(0)=α/β\kappa_1 = K'(0) = \alpha/\beta (mean)
  • κ2=K(0)=α/β2\kappa_2 = K''(0) = \alpha/\beta^2 (variance)
  • κ3=K(0)\kappa_3 = K'''(0) (third cumulant)
  • κ4=K(4)(0)\kappa_4 = K^{(4)}(0) (fourth cumulant)

(c) From cumulants, compute skewness γ1=κ3/κ23/2\gamma_1 = \kappa_3/\kappa_2^{3/2} and excess kurtosis γ2=κ4/κ22\gamma_2 = \kappa_4/\kappa_2^2. Verify: γ1=2/α\gamma_1 = 2/\sqrt{\alpha}, γ2=6/α\gamma_2 = 6/\alpha.

(d) Verify all results by Monte Carlo with α=3\alpha=3, β=2\beta=2, N=105N=10^5.

Code cell 17

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 18

# Solution
import numpy as np
np.random.seed(42)

alpha, beta = 3.0, 2.0

def gamma_mgf(t, alpha, beta):
    return (beta / (beta - t))**alpha

def gamma_cgf(t, alpha, beta):
    return alpha * np.log(beta) - alpha * np.log(beta - t)

# (a) Verify
N_mc = 100000
samples = np.random.gamma(alpha, 1.0/beta, N_mc)

# (b) Cumulants via finite differences of K(t)
h = 1e-4
ts_fd = np.array([-2*h, -h, 0, h, 2*h])
K_vals = np.array([gamma_cgf(t, alpha, beta) for t in ts_fd])

kappa1 = (-K_vals[4] + 8*K_vals[3] - 8*K_vals[1] + K_vals[0]) / (12*h)
kappa2 = (-K_vals[4] + 16*K_vals[3] - 30*K_vals[2] + 16*K_vals[1] - K_vals[0]) / (12*h**2)
kappa3 = (K_vals[4] - 2*K_vals[3] + 2*K_vals[1] - K_vals[0]) / (2*h**3)
kappa4 = (K_vals[4] - 4*K_vals[3] + 6*K_vals[2] - 4*K_vals[1] + K_vals[0]) / h**4

# (c)
skewness = kappa3 / kappa2**1.5
ex_kurt  = kappa4 / kappa2**2

header('Exercise 5: Gamma MGF and Cumulants')
for t in [0.1, 0.5, 1.0]:
    mgf_theory = gamma_mgf(t, alpha, beta)
    mgf_mc = np.exp(t * samples).mean()
    check_close(f'MGF(t={t}): theory vs MC', mgf_theory, mgf_mc, tol=0.01)

print(f'\nCumulants:')
check_close('kappa1 = alpha/beta', kappa1, alpha/beta)
check_close('kappa2 = alpha/beta^2', kappa2, alpha/beta**2)
check_close('kappa3 = 2*alpha/beta^3', kappa3, 2*alpha/beta**3)
check_close('kappa4 = 6*alpha/beta^4', kappa4, 6*alpha/beta**4)

print(f'\nSkewness and kurtosis:')
check_close('Skewness = 2/sqrt(alpha)', skewness, 2/np.sqrt(alpha), tol=0.01)
check_close('Excess kurtosis = 6/alpha', ex_kurt, 6/alpha, tol=0.05)
print(f'\nMC verification: skewness={((samples-samples.mean())**3).mean()/samples.std()**3:.3f}, '
      f'kurtosis={(((samples-samples.mean())**4).mean()/samples.std()**4 - 3):.3f}')
print('\nTakeaway: CGF K(t)=log M(t) has a simpler form; its derivatives give cumulants directly.')

Exercise 6 ★★ — Jensen's Inequality and the KL Divergence

(a) Use Jensen's inequality to prove E[logX]logE[X]\mathbb{E}[\log X] \leq \log \mathbb{E}[X] for any positive random variable XX. Verify numerically for XExp(1)X \sim \text{Exp}(1).

(b) Implement kl_divergence(p, q) for discrete distributions and verify KL(pq)0\text{KL}(p \| q) \geq 0 with equality iff p=qp = q.

(c) Show that the KL divergence is not symmetric: compute KL(pq)\text{KL}(p \| q) and KL(qp)\text{KL}(q \| p) for two specific distributions and verify they differ.

(d) For two Gaussians p=N(μ1,σ12)p = \mathcal{N}(\mu_1, \sigma_1^2) and q=N(μ2,σ22)q = \mathcal{N}(\mu_2, \sigma_2^2), the KL divergence has a closed form:

KL(pq)=logσ2σ1+σ12+(μ1μ2)22σ2212\text{KL}(p \| q) = \log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1-\mu_2)^2}{2\sigma_2^2} - \frac{1}{2}

Verify by Monte Carlo and by numerical integration.

(e) Derive the ELBO inequality: for any distributions q(z)q(z) and p(x,z)p(x,z),

logp(x)Eq[logp(x,z)]Eq[logq(z)]=ELBO\log p(x) \geq \mathbb{E}_q[\log p(x,z)] - \mathbb{E}_q[\log q(z)] = \text{ELBO}

Verify numerically for a simple Gaussian model.

Code cell 20

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 21

# Solution
import numpy as np
from scipy import stats
from scipy.integrate import quad
np.random.seed(42)

# (a)
X = np.random.exponential(1.0, 200000)
E_logX = np.log(X).mean()      # theory: -gamma ≈ -0.5772
log_EX = np.log(X.mean())      # log(E[X]) = log(1) = 0

# (b)
def kl_divergence(p, q, eps=1e-10):
    p, q = np.asarray(p, dtype=float), np.asarray(q, dtype=float)
    mask = p > eps
    return np.sum(p[mask] * np.log(p[mask] / (q[mask] + eps)))

p = np.array([0.4, 0.3, 0.2, 0.1])
q = np.array([0.25, 0.25, 0.25, 0.25])

# (d)
def gaussian_kl(mu1, sigma1, mu2, sigma2):
    return (np.log(sigma2/sigma1)
            + (sigma1**2 + (mu1-mu2)**2) / (2*sigma2**2)
            - 0.5)

mu1, sigma1 = 1.0, 1.5
mu2, sigma2 = 2.0, 2.0
kl_closed = gaussian_kl(mu1, sigma1, mu2, sigma2)

# Numerical integration verification
def kl_integrand(x):
    p = stats.norm.pdf(x, mu1, sigma1)
    q = stats.norm.pdf(x, mu2, sigma2)
    if p < 1e-300 or q < 1e-300:
        return 0.0
    return p * np.log(p / q)

kl_numerical, _ = quad(kl_integrand, mu1 - 8*sigma1, mu1 + 8*sigma1)

# MC verification
samples_p = np.random.normal(mu1, sigma1, 500000)
log_p = stats.norm.logpdf(samples_p, mu1, sigma1)
log_q = stats.norm.logpdf(samples_p, mu2, sigma2)
kl_mc = (log_p - log_q).mean()

# (e) ELBO: log p(x) >= E_q[log p(x,z)] - E_q[log q(z)] = ELBO
# Simple model: p(z) = N(0,1), p(x|z) = N(z, 1), q(z) = N(mu_q, sigma_q^2)
# Marginal: p(x) = N(0, 2) (integrate out z)
x_obs = 1.5
log_px = stats.norm.logpdf(x_obs, 0, np.sqrt(2))  # true log p(x)
mu_q, sigma_q = 0.75, 0.7  # approximate posterior q(z)
z_samples = np.random.normal(mu_q, sigma_q, 100000)
log_pxz = stats.norm.logpdf(x_obs, z_samples, 1.0) + stats.norm.logpdf(z_samples, 0, 1.0)
log_qz  = stats.norm.logpdf(z_samples, mu_q, sigma_q)
elbo = (log_pxz - log_qz).mean()

header('Exercise 6: Jensen\'s Inequality and KL Divergence')
print(f'(a) Jensen: E[log X] = {E_logX:.4f} <= log(E[X]) = {log_EX:.4f}')
check_true('E[log X] <= log(E[X])', E_logX <= log_EX)
check_close('KL(p||p) = 0', kl_divergence(p, p), 0.0, tol=1e-8)
check_true('KL(p||q) > 0', kl_divergence(p, q) > 0)
print(f'\n(c) Asymmetry: KL(p||q)={kl_divergence(p,q):.4f}, KL(q||p)={kl_divergence(q,p):.4f}')
check_true('KL is asymmetric', abs(kl_divergence(p,q) - kl_divergence(q,p)) > 0.01)
print(f'\n(d) Gaussian KL: closed={kl_closed:.4f}, numerical={kl_numerical:.4f}, MC={kl_mc:.4f}')
check_close('closed == numerical', kl_closed, kl_numerical, tol=0.01)
check_close('closed == MC', kl_closed, kl_mc, tol=0.01)
print(f'\n(e) ELBO: log p(x) = {log_px:.4f}, ELBO = {elbo:.4f}')
check_true('ELBO <= log p(x)', elbo <= log_px + 0.01)
print('\nTakeaway: KL >= 0 by Jensen; the ELBO is a lower bound on log p(x) for any q.')

Exercise 7 ★★★ — Bias-Variance Decomposition

Consider estimating f(x)=sin(2πx)f(x) = \sin(2\pi x) from nn noisy observations yi=f(xi)+εiy_i = f(x_i) + \varepsilon_i, εiN(0,σ2)\varepsilon_i \sim \mathcal{N}(0, \sigma^2).

(a) Implement bias_variance_decomposition(degree, n_train, sigma, n_datasets, x_test) that:

  • Generates n_datasets independent training sets of size n_train
  • Fits a degree-d polynomial to each
  • Returns bias², variance, and noise at each test point

(b) Verify the decomposition: MSE=Bias2+Variance+σ2\text{MSE} = \text{Bias}^2 + \text{Variance} + \sigma^2.

(c) Plot (or print) how bias² and variance change as degree goes from 1 to 12 (with n=15n=15, σ=0.3\sigma=0.3). Identify the optimal degree.

(d) Show the effect of increasing nn: fit degree=6 for n{10,20,50,100}n \in \{10, 20, 50, 100\}. Which component dominates at small nn? At large nn?

Code cell 23

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 24

# Solution
import numpy as np
np.random.seed(42)

def true_fn(x):
    return np.sin(2 * np.pi * x)

def poly_predict(x_train, y_train, x_test, degree):
    Phi_tr = np.column_stack([x_train**k for k in range(degree+1)])
    Phi_te = np.column_stack([x_test**k for k in range(degree+1)])
    w, _, _, _ = np.linalg.lstsq(Phi_tr, y_train, rcond=None)
    return Phi_te @ w

def bias_variance_decomp(degree, n_train, sigma, n_datasets=300, x_test=None):
    if x_test is None:
        x_test = np.linspace(0, 1, 50)
    f_test = true_fn(x_test)
    preds = []
    for _ in range(n_datasets):
        x_tr = np.linspace(0, 1, n_train)
        y_tr = true_fn(x_tr) + np.random.normal(0, sigma, n_train)
        pred = poly_predict(x_tr, y_tr, x_test, degree)
        preds.append(pred)
    preds = np.array(preds)  # (n_datasets, len(x_test))
    mean_pred = preds.mean(axis=0)
    bias2   = np.mean((mean_pred - f_test)**2)
    variance = np.mean(preds.var(axis=0))
    noise   = sigma**2
    mse     = np.mean((preds - f_test)**2)  # average over datasets and x_test
    return bias2, variance, noise, mse

header('Exercise 7: Bias-Variance Decomposition')

# (b) Verify decomposition
sigma = 0.3
bias2, var, noise, mse = bias_variance_decomp(4, 20, sigma)
print(f'Degree=4, n=20:')
print(f'  Bias^2={bias2:.4f}, Var={var:.4f}, Noise={noise:.4f}')
print(f'  Bias^2+Var+Noise={bias2+var+noise:.4f}, MSE={mse:.4f}')
check_close('MSE = Bias^2 + Var + Noise', mse, bias2+var+noise, tol=0.02)

# (c) Sweep degrees
print(f'\nDegree sweep (n=15, sigma=0.3):')
print(f'{"Degree":8s} {"Bias^2":10s} {"Variance":10s} {"Total MSE":10s}')
best_mse = float('inf'); best_d = 0
for d in range(1, 12):
    b2, v, ns, mse_d = bias_variance_decomp(d, 15, sigma)
    print(f'  d={d:2d}:  bias2={b2:.4f}  var={v:.4f}  mse={b2+v+ns:.4f}')
    if b2+v+ns < best_mse:
        best_mse = b2+v+ns; best_d = d
print(f'  => Optimal degree: {best_d}')

# (d) Effect of n
print(f'\nEffect of n (degree=6, sigma=0.3):')
for n in [10, 20, 50, 100]:
    b2, v, ns, mse_n = bias_variance_decomp(6, n, sigma)
    dom = 'bias dominates' if b2 > v else 'variance dominates'
    print(f'  n={n:4d}: bias2={b2:.4f}, var={v:.4f}  ({dom})')
print('\nTakeaway: Bias decreases with complexity; variance increases; optimal = minimum MSE.')

Exercise 8 ★★★ — Adam as Adaptive Moment Estimation

Adam tracks running estimates of the first and second moments of the gradient:

mt=β1mt1+(1β1)gtE[gt]m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \approx \mathbb{E}[g_t] vt=β2vt1+(1β2)gt2E[gt2]v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \approx \mathbb{E}[g_t^2]

with bias corrections m^t=mt/(1β1t)\hat{m}_t = m_t/(1-\beta_1^t), v^t=vt/(1β2t)\hat{v}_t = v_t/(1-\beta_2^t).

(a) Implement adam_optimizer(grad_fn, theta0, lr, n_steps) and minimise f(θ)=(θ13)2+10(θ2+1)2f(\theta) = (\theta_1 - 3)^2 + 10(\theta_2 + 1)^2.

(b) Show that without bias correction, Adam converges slowly at first (underestimates moments). Quantify the initialisation bias at t=1t=1: m1=(1β1)g1m_1 = (1-\beta_1)g_1 vs m^1=g1\hat{m}_1 = g_1.

(c) Implement SGD for comparison. Show Adam converges faster on this problem.

(d) Analyse the effective learning rate αm^t/v^t+ε\alpha \cdot \hat{m}_t / \sqrt{\hat{v}_t + \varepsilon}: show that it is bounded by approximately α\alpha per parameter, regardless of gradient scale.

(e) Demonstrate Adam's robustness to gradient scale: multiply the gradient by 100 and show Adam still converges, while SGD with the same learning rate diverges.

Code cell 26

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 27

# Solution
import numpy as np
np.random.seed(42)

def f(theta):
    return (theta[0] - 3)**2 + 10*(theta[1] + 1)**2

def grad_f(theta, scale=1.0):
    return scale * np.array([2*(theta[0]-3), 20*(theta[1]+1)])

def adam(grad_fn, theta0, lr=0.01, beta1=0.9, beta2=0.999, eps=1e-8, n_steps=500):
    theta = theta0.copy().astype(float)
    m = np.zeros_like(theta)
    v = np.zeros_like(theta)
    losses = [f(theta)]
    for t in range(1, n_steps+1):
        g = grad_fn(theta)
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * g**2
        m_hat = m / (1 - beta1**t)  # bias correction
        v_hat = v / (1 - beta2**t)  # bias correction
        theta -= lr * m_hat / (np.sqrt(v_hat) + eps)
        losses.append(f(theta))
    return theta, np.array(losses)

def sgd(grad_fn, theta0, lr=0.01, n_steps=500):
    theta = theta0.copy().astype(float)
    losses = [f(theta)]
    for _ in range(n_steps):
        g = grad_fn(theta)
        theta -= lr * g
        losses.append(f(theta))
    return theta, np.array(losses)

theta0 = np.array([0.0, 0.0])

theta_adam, losses_adam = adam(grad_f, theta0)
theta_sgd, losses_sgd = sgd(grad_f, theta0, lr=0.01)

header('Exercise 8: Adam as Moment Estimation')
check_close('Adam converges: theta -> [3, -1]', theta_adam, np.array([3.0, -1.0]), tol=0.01)
check_close('SGD converges:  theta -> [3, -1]', theta_sgd, np.array([3.0, -1.0]), tol=0.01)

# (b) Bias correction analysis
beta1, beta2 = 0.9, 0.999
g1 = grad_f(theta0)
m1_no_corr = (1 - beta1) * g1
m1_corr    = m1_no_corr / (1 - beta1**1)
print(f'\n(b) Bias correction at t=1:')
print(f'  True gradient g1    = {g1}')
print(f'  m1 without correction = {m1_no_corr}  (only {100*(1-beta1):.0f}% of g1)')
print(f'  m1 with correction    = {m1_corr}     (= g1, fully corrected)')

# (c) Convergence comparison
print(f'\n(c) Adam vs SGD (500 steps):')
adam_steps_to_1e3 = next((i for i,l in enumerate(losses_adam) if l < 1e-3), None)
sgd_steps_to_1e3  = next((i for i,l in enumerate(losses_sgd)  if l < 1e-3), None)
print(f'  Adam: {adam_steps_to_1e3} steps to loss < 1e-3')
print(f'  SGD:  {sgd_steps_to_1e3} steps to loss < 1e-3')

# (d) Effective learning rate
eps = 1e-8
theta_test = np.array([1.0, 0.5])
g = grad_f(theta_test)
eff_lr = 0.01 * g / (np.sqrt(g**2) + eps)  # at convergence, m_hat/sqrt(v_hat) = sign(g)
print(f'\n(d) Effective step size (sign of gradient, |step| ≈ lr):')
print(f'  gradient: {g}')
print(f'  effective step: {eff_lr}')
print(f'  |step| ≈ lr = 0.01: {np.allclose(np.abs(eff_lr), 0.01, atol=0.001)}')

# (e) Gradient scale robustness
theta_adam_scaled, losses_adam_scaled = adam(lambda t: grad_f(t, scale=100), theta0)
theta_sgd_scaled, losses_sgd_scaled = sgd(lambda t: grad_f(t, scale=100), theta0, lr=0.01)
print(f'\n(e) With gradient scale=100:')
print(f'  Adam final loss: {losses_adam_scaled[-1]:.6f}  (converges!)')
print(f'  SGD final loss:  {losses_sgd_scaled[-1]:.6f}  (diverges: nan or very large)')
check_close('Adam robust to scale', theta_adam_scaled, np.array([3.0, -1.0]), tol=0.05)
print('\nTakeaway: Adam estimates E[g] and E[g^2] per-parameter; effective LR ≈ lr regardless of gradient scale.')

Exercise 9 *** - Mini-Batch Gradient Variance

Let per-example gradients gig_i be iid with mean μ\mu and variance σ2\sigma^2. A mini-batch gradient is bargB=B1i=1Bgi\\bar g_B = B^{-1}\sum_{i=1}^B g_i.

Part (a): Derive Var(bargB)=σ2/B\operatorname{Var}(\\bar g_B)=\sigma^2/B.

Part (b): Verify this scaling by simulation for several batch sizes.

Part (c): Explain the tradeoff between gradient noise and compute efficiency.

Code cell 29

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 30

# Solution
mu, sigma = 0.3, 2.0
batch_sizes = [1, 4, 16, 64, 256]
n_trials = 20000

header('Exercise 9: Mini-Batch Gradient Variance')
for B in batch_sizes:
    batches = np.random.normal(mu, sigma, size=(n_trials, B))
    means = batches.mean(axis=1)
    empirical_var = means.var(ddof=0)
    theory_var = sigma**2 / B
    print(f'B={B:3d}: empirical Var={empirical_var:.4f}, theory={theory_var:.4f}')
    check_close(f'variance scales as 1/B for B={B}', empirical_var, theory_var, tol=0.06)
print('Takeaway: larger batches reduce estimator variance, but with diminishing returns proportional to 1/B.')

Exercise 10 *** - Control Variates for Variance Reduction

Estimate E[eX]\mathbb{E}[e^X] for XN(0,1)X\sim\mathcal{N}(0,1). Use C=XC=X as a control variate with known mean E[C]=0\mathbb{E}[C]=0.

Part (a): Implement the naive Monte Carlo estimator.

Part (b): Estimate the optimal control coefficient b=Cov(Y,C)/Var(C)b^*=\operatorname{Cov}(Y,C)/\operatorname{Var}(C).

Part (c): Compare estimator variance with and without the control variate.

Part (d): Explain why variance reduction matters for policy gradients and variational inference.

Code cell 32

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 33

# Solution
n = 200000
X = np.random.randn(n)
Y = np.exp(X)
C = X
b_star = np.cov(Y, C, ddof=0)[0, 1] / np.var(C)
Y_cv = Y - b_star * (C - 0.0)
true_value = np.exp(0.5)

header('Exercise 10: Control Variates')
print(f'True E[e^X]: {true_value:.4f}')
print(f'Naive estimate: {Y.mean():.4f}')
print(f'Control-variate estimate: {Y_cv.mean():.4f}')
print(f'Optimal b*: {b_star:.4f}')
print(f'Naive sample variance: {Y.var():.4f}')
print(f'Control-variate sample variance: {Y_cv.var():.4f}')
check_true('Control variate reduces variance', Y_cv.var() < Y.var())
check_close('Control-variate estimate is close to truth', Y_cv.mean(), true_value, tol=0.02)
print('Takeaway: subtracting a correlated zero-mean term keeps the expectation fixed while reducing Monte Carlo noise.')

What to Review After Finishing

  • LOTUS: E[g(X)]=g(x)fX(x)dx\mathbb{E}[g(X)] = \int g(x)f_X(x)\,dx — no need to find distribution of g(X)g(X)
  • Linearity: E[aX+bY]=aE[X]+bE[Y]\mathbb{E}[aX+bY] = a\mathbb{E}[X]+b\mathbb{E}[Y] — no independence needed
  • Variance formula: Var(X)=E[X2](E[X])2\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2
  • Tower property: E[E[YX]]=E[Y]\mathbb{E}[\mathbb{E}[Y|X]] = \mathbb{E}[Y]
  • Law of total variance: Var(Y)=E[Var(YX)]+Var(E[YX])\text{Var}(Y) = \mathbb{E}[\text{Var}(Y|X)] + \text{Var}(\mathbb{E}[Y|X])
  • MGF: MX(t)=E[etX]M_X(t) = \mathbb{E}[e^{tX}]; kk-th derivative at 0 gives kk-th raw moment
  • Jensen: convex ff(E[X])E[f(X)]f \Rightarrow f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]; proves KL0\text{KL} \geq 0
  • Bias-variance: MSE=Bias2+Var+σ2\text{MSE} = \text{Bias}^2 + \text{Var} + \sigma^2
  • Adam: mtE[gt]m_t \approx \mathbb{E}[g_t], vtE[gt2]v_t \approx \mathbb{E}[g_t^2]; bias correction at small tt

References

  1. DeGroot & Schervish, Probability and Statistics (4th ed.) — Chapters 4–5
  2. Bishop, PRML (2006) — Appendix B (probability review)
  3. Goodfellow, Bengio & Courville, Deep Learning (2016) — Chapter 3
  4. Kingma & Ba, Adam: A Method for Stochastic Optimization (2015) — arXiv:1412.6980
  5. Hastie, Tibshirani & Friedman, ESL (2009) — §7.3 (bias-variance tradeoff)
  6. Belkin et al., Reconciling modern ML and the bias-variance tradeoff (2019) — double descent
  • Mini-batch gradient variance - can you explain the probabilistic calculation and the ML relevance?
  • Control variates - can you connect the computation to model behavior?