Exercises Notebook

Converted from exercises.ipynb for web reading.

Series and Sequences - Exercises

10 graded exercises covering the full section arc, from core calculus mechanics to ML-facing applications.

Format	Description
Problem	Markdown cell with task description
Your Solution	Code cell for learner work
Solution	Reference solution with checks

Difficulty: straightforward -> moderate -> challenging.

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np
import numpy.linalg as la
from scipy import integrate, special, stats
from math import factorial
import matplotlib.patches as patches

COLORS = {
    "primary": "#0077BB",
    "secondary": "#EE7733",
    "tertiary": "#009988",
    "error": "#CC3311",
    "neutral": "#555555",
    "highlight": "#EE3377",
}
HAS_MPL = True
np.set_printoptions(precision=8, suppress=True)
np.random.seed(42)

def header(title):
    print("\n" + "=" * len(title))
    print(title)
    print("=" * len(title))

def check_true(name, cond):
    ok = bool(cond)
    print(f"{'PASS' if ok else 'FAIL'} - {name}")
    return ok

def check_close(name, got, expected, tol=1e-8):
    ok = np.allclose(got, expected, atol=tol, rtol=tol)
    print(f"{'PASS' if ok else 'FAIL'} - {name}: got {got}, expected {expected}")
    return ok



def centered_diff(f, x, h=1e-6):
    return (f(x + h) - f(x - h)) / (2 * h)

def forward_diff(f, x, h=1e-6):
    return (f(x + h) - f(x)) / h

def backward_diff(f, x, h=1e-6):
    return (f(x) - f(x - h)) / h



def grad_check(f, x, analytic_grad, h=1e-6):
    x = np.asarray(x, dtype=float)
    analytic_grad = np.asarray(analytic_grad, dtype=float)
    numeric_grad = np.zeros_like(x, dtype=float)
    for idx in np.ndindex(x.shape):
        x_plus = x.copy(); x_minus = x.copy()
        x_plus[idx] += h; x_minus[idx] -= h
        numeric_grad[idx] = (f(x_plus) - f(x_minus)) / (2 * h)
    denom = la.norm(analytic_grad) + la.norm(numeric_grad) + 1e-12
    return la.norm(analytic_grad - numeric_grad) / denom



def check(name, got, expected, tol=1e-8):
    return check_close(name, got, expected, tol=tol)

print("Chapter helper setup complete.")

Exercise 1 (★): Geometric Series

The geometric series $\sum_{n=0}^\infty r^n = \frac{1}{1-r}$ for $|r| < 1$ is the foundation of Adam bias correction and reinforcement learning returns.

(a) Compute $\sum_{n=0}^\infty (2/3)^n$ using the closed form.

(b) Verify numerically that the partial sum $S_N \to 1/(1-r)$ as $N \to \infty$ .

(c) Compute the discounted return $G_0 = \sum_{t=0}^\infty \gamma^t \cdot 1$ for $\gamma = 0.95$ (constant reward = 1).

(d) For $r = -1/4$ , compute $\sum_{n=0}^\infty (-1/4)^n$ exactly, then verify with 1000 partial terms.

Code cell 5

# Your Solution
# Exercise 1 - learner workspace
# Write your solution here, then run the reference solution below to compare.
print("Learner workspace ready for Exercise 1.")

Code cell 6

# Solution
# Exercise 1 - reference solution
r = 2/3

# (a) Closed form: 1/(1-r)
sum_a = 1 / (1 - r)

# (b) Partial sum: (1 - r^(N+1))/(1 - r)
def partial_geom_sum(r, N):
    return (1 - r**(N+1)) / (1 - r)

S_100 = partial_geom_sum(r, 100)

# (c) G0 = 1/(1-gamma)
gamma = 0.95
G0 = 1 / (1 - gamma)

# (d)
r_alt = -1/4
sum_d = 1 / (1 - r_alt)  # = 4/5 = 0.8
partial_d = sum(r_alt**n for n in range(1000))

header('Exercise 1: Geometric Series')
print(f'(a) sum (2/3)^n = {sum_a:.8f}')
check_close('(a) = 3.0', sum_a, 3.0)
check_close('(b) S_100 approx 3', S_100, 3.0, tol=1e-6)
print(f'(c) G0 = 1/(1-0.95) = {G0:.4f}')
check_close('(c) G0 = 20', G0, 20.0)
check_close('(d) exact = 0.8', sum_d, 0.8)
check_close('(d) partial sum converges to 0.8', partial_d, 0.8, tol=1e-6)

print("\nTakeaway: The geometric series formula 1/(1-r) is the foundation "
      "of Adam's bias correction and discounted RL returns.")

print("Exercise 1 solution complete.")

Exercise 2 (★): Convergence Tests

Determine whether each series converges or diverges. Implement a function that computes partial sums and visually confirms convergence or divergence.

(a) $\sum_{n=1}^\infty \frac{1}{n^{3/2}}$ — apply p-series test

(b) $\sum_{n=1}^\infty \frac{n}{e^n}$ — apply ratio test

(c) $\sum_{n=1}^\infty \frac{(-1)^n}{\sqrt{n}}$ — apply alternating series test

(d) $\sum_{n=1}^\infty \frac{n!}{2^n}$ — apply ratio test

For each, compute the first 100 partial sums and state the verdict.

Code cell 8

# Your Solution
# Exercise 2 - learner workspace
# Write your solution here, then run the reference solution below to compare.
print("Learner workspace ready for Exercise 2.")

Code cell 9

# Solution
# Exercise 2 - reference solution

def partial_sums(a_fn, N):
    s = 0.0
    ps = []
    for n in range(1, N+1):
        s += a_fn(n)
        ps.append(s)
    return ps

from math import factorial, exp, sqrt

a_fn_a = lambda n: 1 / n**1.5
a_fn_b = lambda n: n / exp(n)
a_fn_c = lambda n: (-1)**n / sqrt(n)
a_fn_d = lambda n: factorial(n) / 2**n

S_a = partial_sums(a_fn_a, 100)
S_b = partial_sums(a_fn_b, 30)
S_c = partial_sums(a_fn_c, 100)
S_d = partial_sums(a_fn_d, 20)

# Ratio test for (b)
ratios_b = [abs(a_fn_b(n+1)/a_fn_b(n)) for n in range(1, 20)]
# Ratio test for (d)
ratios_d = [abs(a_fn_d(n+1)/a_fn_d(n)) for n in range(1, 15)]

header('Exercise 2: Convergence Tests')
print(f'(a) sum 1/n^1.5: S_100 = {S_a[-1]:.6f} (converges, p=1.5>1)')
check_true('(a) converges (S_100 < 5)', S_a[-1] < 5 and S_a[-1] > 2)
print(f'(b) sum n/e^n: S_30 = {S_b[-1]:.8f}, ratio -> {ratios_b[-1]:.6f} < 1 (converges)')
check_true('(b) ratio test L < 1', ratios_b[-1] < 0.5)
print(f'(c) sum (-1)^n/sqrt(n): S_100 = {S_c[-1]:.6f} (converges by AST)')
check_true('(c) converges (|S| < 1)', abs(S_c[-1]) < 1)
print(f'(d) sum n!/2^n: S_20 last term = {a_fn_d(20):.2e} -> inf (diverges by ratio L->inf)')
check_true('(d) diverges (ratio -> inf)', ratios_d[-1] > 5)

print('\nTakeaway: The ratio test is most powerful for factorials; '
      'p-series works directly; alternating series test handles signed series.')

print("Exercise 2 solution complete.")

Exercise 3 (★): Radius of Convergence

The radius of convergence $R$ determines where a power series is valid.

(a) Find $R$ for $\sum_{n=0}^\infty \frac{x^n}{3^n}$ using the ratio method.

(b) Find $R$ and the full interval of convergence for $\sum_{n=1}^\infty \frac{(-1)^n x^n}{n}$ . Check both endpoints.

(c) Show that $\sum_{n=0}^\infty n! x^n$ has $R = 0$ .

(d) For the power series for $\ln(1+x) = \sum_{n=1}^\infty \frac{(-1)^{n+1}x^n}{n}$ , verify numerically that the series converges at $x=1$ but diverges at $x=-1$ .

Code cell 11

# Your Solution
# Exercise 3 - learner workspace
# Write your solution here, then run the reference solution below to compare.
print("Learner workspace ready for Exercise 3.")

Code cell 12

# Solution
# Exercise 3 - reference solution

def radius_ratio(coeffs):
    """R = lim |c_n / c_{n+1}|"""
    ratios = [abs(coeffs[n]) / abs(coeffs[n+1])
              for n in range(len(coeffs)-1) if coeffs[n+1] != 0]
    return ratios[-1]

# (a) c_n = 1/3^n; R = lim |(1/3^n)/(1/3^(n+1))| = 3
c_a = [1/3**n for n in range(30)]
R_a = radius_ratio(c_a)

# (b) c_n = (-1)^n/n; R = lim |(-1)^n/n| / |(-1)^(n+1)/(n+1)| = (n+1)/n -> 1
c_b = [(-1)**n / n for n in range(1, 31)]
R_b = radius_ratio(c_b)

# (c) c_n = n!; ratio = |n!/(n+1)!| = 1/(n+1) -> 0, so R = lim (n+1) = inf in denominator -> R=0
c_c = [float(factorial(n)) for n in range(1, 15)]
R_c_ratios = [c_c[n]/c_c[n+1] for n in range(len(c_c)-1)]

# (d) ln(1+x) series
def log1p_series(x, N):
    return sum((-1)**(n+1) * x**n / n for n in range(1, N+1))

S_x1 = log1p_series(1.0, 1000)    # should converge to ln(2)
S_xm1 = log1p_series(-1.0, 1000)  # should diverge (harmonic)

header('Exercise 3: Radius of Convergence')
print(f'(a) R = {R_a:.4f} (should be 3.0)')
check_close('(a) R = 3', R_a, 3.0)
print(f'(b) R = {R_b:.4f} (should be 1.0)')
check_close('(b) R = 1', R_b, 1.0)
print(f'(c) R = 0: ratios -> {R_c_ratios[-1]:.6f} (-> 0, meaning R = lim 1/ratio = 0)')
check_true('(c) R -> 0', R_c_ratios[-1] < 0.1)
print(f'(d) At x=1: S_1000 = {S_x1:.8f}, ln(2) = {np.log(2):.8f}')
check_close('(d) ln(1+1) via series', S_x1, np.log(2), tol=1e-3)
print(f'(d) At x=-1: S_1000 = {S_xm1:.4f} (harmonic series, diverges to -inf)')
check_true('(d) x=-1 diverges (S < -5)', S_xm1 < -5)

print('\nTakeaway: Radius of convergence from ratio test; must check endpoints separately.')

print("Exercise 3 solution complete.")

Exercise 4 (★★): Taylor Series Derivation

Derive and verify the Maclaurin series for elementary functions.

(a) Derive the series for $e^{-x}$ by computing $c_n = f^{(n)}(0)/n!$ .

(b) Starting from $e^x = \sum x^n/n!$ , substitute $x \to x^2$ to get $e^{x^2}$ , then $x \to -x^2$ to get $e^{-x^2}$ .

(c) Integrate the series for $e^{-t^2}$ term-by-term from 0 to $x$ to get the series for $\int_0^x e^{-t^2}\,dt$ .

(d) How many terms are needed to compute $\int_0^{0.5} e^{-t^2}\,dt$ to within $10^{-8}$ ?

Code cell 14

# Your Solution
# Exercise 4 - learner workspace
# Write your solution here, then run the reference solution below to compare.
print("Learner workspace ready for Exercise 4.")

Code cell 15

# Solution
# Exercise 4 - reference solution

import numpy as np
from math import factorial

# (a) e^(-x) = sum (-1)^n x^n / n!
def taylor_exp_neg_x(x, N):
    return sum((-1)**n * x**n / factorial(n) for n in range(N))

# (b) e^(x^2) = sum x^(2n)/n!  and  e^(-x^2) = sum (-1)^n x^(2n)/n!
def taylor_exp_x_squared(x, N):
    return sum(x**(2*n) / factorial(n) for n in range(N))

def taylor_exp_neg_x_squared(x, N):
    return sum((-1)**n * x**(2*n) / factorial(n) for n in range(N))

# (c) Integrate e^(-t^2) = sum (-1)^n t^(2n)/n! term by term:
#   integral_0^x = sum (-1)^n x^(2n+1) / (n! * (2n+1))
def integral_exp_neg_t_squared(x, N):
    return sum((-1)**n * x**(2*n+1) / (factorial(n) * (2*n+1)) for n in range(N))

# (d) True value = sqrt(pi)/2 * erf(0.5)
x0 = 0.5
try:
    from scipy.special import erf
    true_val = np.sqrt(np.pi)/2 * erf(x0)
except:
    true_val = 0.46128  # approx

# Find minimum N
for N in range(1, 30):
    approx = integral_exp_neg_t_squared(x0, N)
    err = abs(approx - true_val)
    if err < 1e-8:
        min_N = N
        break

header('Exercise 4: Taylor Series Derivation')
check_close('(a) e^(-0.5)', taylor_exp_neg_x(0.5, 20), np.exp(-0.5))
check_close('(b) e^(0.5^2)', taylor_exp_x_squared(0.5, 20), np.exp(0.25))
check_close('(b) e^(-0.5^2)', taylor_exp_neg_x_squared(0.5, 20), np.exp(-0.25))
print(f'(c) integral_0^0.5 e^(-t^2) dt (N=10): '
      f'{integral_exp_neg_t_squared(0.5, 10):.10f}')
check_close('(c) matches true value', integral_exp_neg_t_squared(0.5, 10), true_val)
print(f'(d) Minimum N for error < 1e-8: N = {min_N}')
check_true('(d) error < 1e-8', abs(integral_exp_neg_t_squared(x0, min_N) - true_val) < 1e-8)

print('\nTakeaway: Substitution and term-by-term integration produce new series; '
      'alternating series error bound = first omitted term.')

print("Exercise 4 solution complete.")

Exercise 5 (★★): Lagrange Remainder and Error Bounds

(a) Write the 4th-order Taylor polynomial for $\ln(1+x)$ at $a=0$ .

(b) Use the Lagrange remainder $|R_n(x)| \le \frac{M}{(n+1)!}|x|^{n+1}$ to bound the error $|\ln(1.2) - T_4(0.2)|$ . (Hint: $f^{(5)}(x) = 4!/(1+x)^5$ ; bound on $[0, 0.2]$ .)

(c) Compute the actual error and confirm the bound holds.

(d) How many terms of the $\ln(1+x)$ series are needed to compute $\ln(1.1)$ to 10 decimal places? (Hint: alternating series error $\le$ first omitted term.)

Code cell 17

# Your Solution
# Exercise 5 - learner workspace
# Write your solution here, then run the reference solution below to compare.
print("Learner workspace ready for Exercise 5.")

Code cell 18

# Solution
# Exercise 5 - reference solution

import numpy as np
from math import factorial

# (a) T4(x) = x - x^2/2 + x^3/3 - x^4/4
def T4_log(x):
    return x - x**2/2 + x**3/3 - x**4/4

# (b) f^(5)(x) = 4!/(1+x)^5; M = 24 at x=0 (max on [0,0.2])
x0 = 0.2
n = 4
M = factorial(4)  # = 24, since f^(5)(x) = 4!/(1+x)^5 <= 24 on [0, 0.2]
lagrange_bound = M * x0**(n+1) / factorial(n+1)  # M * 0.2^5 / 5!

# (c) Actual error
actual_err = abs(np.log(1.2) - T4_log(x0))

# (d) Alternating series: |error| <= |first omitted term| = x^(N+1)/(N+1)
x_target = 0.1
target_tol = 5e-11  # 10 decimal places
min_N = 1
while x_target**(min_N+1) / (min_N+1) >= target_tol:
    min_N += 1

header('Exercise 5: Lagrange Remainder')
print(f'(a) T4(0.2) = {T4_log(x0):.10f}')
print(f'    ln(1.2) = {np.log(1.2):.10f}')
print(f'(b) Lagrange bound = {lagrange_bound:.4e}')
print(f'(c) Actual error   = {actual_err:.4e}')
check_true('(c) actual error <= Lagrange bound', actual_err <= lagrange_bound)
print(f'(d) Min N for 10 decimal places: N = {min_N}')
# Verify
approx_ln11 = sum((-1)**(n+1) * 0.1**n / n for n in range(1, min_N+1))
check_close('(d) ln(1.1) to 10 places', approx_ln11, np.log(1.1), tol=1e-10)

print('\nTakeaway: Lagrange remainder gives guaranteed error bounds; '
      'for alternating series, error <= first omitted term (often tighter).')

print("Exercise 5 solution complete.")

Exercise 6 (★★): GELU Taylor Analysis

The GELU activation $\text{GELU}(x) = x\Phi(x)$ where $\Phi$ is the Gaussian CDF.

(a) Using $e^{-t^2} = \sum (-1)^n t^{2n}/n!$ and term-by-term integration, show that $\text{erf}(x) = \frac{2}{\sqrt{\pi}}\sum_{n=0}^\infty \frac{(-1)^n x^{2n+1}}{n!(2n+1)}$ .

(b) Compute the first 4 non-zero terms of the GELU series: $\text{GELU}(x) = x \cdot \frac{1}{2}\left(1 + \text{erf}(x/\sqrt{2})\right)$ .

(c) Compare GELU, its Taylor approximation (4 terms), and ReLU for $x \in [-3, 3]$ .

(d) At what $|x|$ does the 4-term approximation exceed 1% relative error?

Code cell 20

# Your Solution
# Exercise 6 - learner workspace
# Write your solution here, then run the reference solution below to compare.
print("Learner workspace ready for Exercise 6.")

Code cell 21

# Solution
# Exercise 6 - reference solution

import numpy as np
from math import factorial, sqrt, pi

try:
    from scipy.special import erf as scipy_erf
    HAS_SCIPY = True
except:
    HAS_SCIPY = False

# (a) erf(x) = (2/sqrt(pi)) * sum (-1)^n x^(2n+1) / (n! * (2n+1))
def erf_series(x, N):
    coeff = 2 / sqrt(pi)
    return coeff * sum((-1)**n * x**(2*n+1) / (factorial(n) * (2*n+1)) for n in range(N))

# (b) GELU(x) = x * 0.5 * (1 + erf(x/sqrt(2)))
def gelu_series(x, N):
    return x * 0.5 * (1 + erf_series(x / sqrt(2), N))

# (c) Comparison
x_arr = np.linspace(-3, 3, 300)

if HAS_SCIPY:
    gelu_exact = x_arr * 0.5 * (1 + scipy_erf(x_arr / sqrt(2)))
else:
    gelu_exact = np.vectorize(lambda x: gelu_series(x, 30))(x_arr)

gelu_4terms = np.vectorize(lambda x: gelu_series(x, 4))(x_arr)
relu = np.maximum(0, x_arr)

if HAS_MPL:
    plt.figure(figsize=(9, 5))
    plt.plot(x_arr, gelu_exact, 'b-', lw=2, label='GELU (exact)')
    plt.plot(x_arr, gelu_4terms, 'r--', lw=2, label='GELU (4-term Taylor)')
    plt.plot(x_arr, relu, 'g:', lw=2, label='ReLU')
    plt.legend(); plt.title('GELU vs Taylor Approximation vs ReLU')
    plt.xlabel('x'); plt.show()

# (d) Find 1% crossover
x_test = np.linspace(0, 3, 1000)
gelu_ex = np.vectorize(lambda x: gelu_series(x, 30))(x_test)
gelu_4 = np.vectorize(lambda x: gelu_series(x, 4))(x_test)
rel_err = np.abs(gelu_ex - gelu_4) / (np.abs(gelu_ex) + 1e-10)
crossover = x_test[rel_err > 0.01][0] if (rel_err > 0.01).any() else float('inf')

header('Exercise 6: GELU Taylor Analysis')
check_close('(a) erf(0.5)', erf_series(0.5, 15),
            scipy_erf(0.5) if HAS_SCIPY else 0.5205, tol=1e-6)
check_close('(b) GELU(1.0)', gelu_series(1.0, 15),
            gelu_exact[x_arr >= 1.0][0], tol=1e-4)
print(f'(d) 1% relative error crossover at |x| ≈ {crossover:.2f}')
check_true('(d) crossover > 1', crossover > 1.0)

print('\nTakeaway: GELU is smooth everywhere; its Taylor series works well for |x| < ~2.')

print("Exercise 6 solution complete.")

Exercise 7 (★★★): Linear Attention via Taylor Approximation

Standard attention computes $\text{Attn}(Q,K,V) = \text{softmax}(QK^\top/\sqrt{d})V$ , which costs $O(n^2 d)$ . A key idea in linear attention is to approximate $e^{q^\top k} \approx \phi(q)^\top \phi(k)$ using a feature map $\phi$ .

(a) Show that the 1st-order Taylor expansion of $e^x$ gives $e^{q^\top k/\sqrt{d}} \approx 1 + q^\top k/\sqrt{d}$ . What is the error (2nd-order remainder)?

(b) Using the approximation $e^{q^\top k} \approx (1 + q^\top k)$ , rewrite un-normalized attention as $\sum_j (1 + q^\top k_j) v_j$ . Show this decomposes into $\sum_j v_j + q^\top (\sum_j k_j v_j^\top)$ — meaning the $k_j, v_j$ cumulative sum can be precomputed.

(c) Simulate: generate random $Q, K, V \in \mathbb{R}^{n \times d}$ , compute exact softmax attention and linear approximation, measure the approximation error.

(d) How does approximation error scale with sequence length $n$ and $d$ ?

Code cell 23

# Your Solution
# Exercise 7 - learner workspace
# Write your solution here, then run the reference solution below to compare.
print("Learner workspace ready for Exercise 7.")

Code cell 24

# Solution
# Exercise 7 - reference solution

import numpy as np

def softmax(x, axis=-1):
    x = x - x.max(axis=axis, keepdims=True)
    e = np.exp(x)
    return e / e.sum(axis=axis, keepdims=True)

def exact_attention(Q, K, V):
    d = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d)
    weights = softmax(scores, axis=-1)
    return weights @ V

def linear_attention(Q, K, V):
    # (a)/(b): e^(q^T k) ~ 1 + q^T k  => unnorm attn = sum_j (1 + q^T k_j) v_j
    n = Q.shape[0]
    # Kernel: phi(x) = [1, x]  (concat constant 1)
    # Un-normalized: sum_j v_j + Q @ (K^T V)
    sum_v = V.sum(axis=0, keepdims=True)          # (1, d)
    KtV = K.T @ V                                  # (d, d)
    unnorm = sum_v + Q @ KtV                       # (n, d)
    # Normalize by sum of kernel weights = sum_j (1 + q^T k_j) = n + q^T sum(k_j)
    sum_k = K.sum(axis=0, keepdims=True)           # (1, d)
    norm = n + (Q @ sum_k.T)                       # (n, 1)
    return unnorm / norm

n, d = 32, 16
np.random.seed(42)
Q = np.random.randn(n, d) / np.sqrt(d)
K = np.random.randn(n, d) / np.sqrt(d)
V = np.random.randn(n, d)

exact = exact_attention(Q, K, V)
linear = linear_attention(Q, K, V)
err = np.abs(exact - linear).max()

# (d) Scale with n and d
errors = {}
for nn in [16, 32, 64]:
    Q2 = np.random.randn(nn, d) / np.sqrt(d)
    K2 = np.random.randn(nn, d) / np.sqrt(d)
    V2 = np.random.randn(nn, d)
    e2 = np.abs(exact_attention(Q2, K2, V2) - linear_attention(Q2, K2, V2)).mean()
    errors[nn] = e2

header('Exercise 7: Linear Attention via Taylor')
print(f'Max abs error (n=32, d=16): {err:.4f}')
check_true('exact and linear have same shape', exact.shape == linear.shape)
print('Error vs n:', {k: f'{v:.4f}' for k, v in errors.items()})
check_true('error < 0.5 for small n, d', err < 0.5)

print('\nTakeaway: 1st-order Taylor of exp enables O(nd^2) attention, but accuracy '
      'degrades for large sequence lengths and sharp attention distributions.')

print("Exercise 7 solution complete.")

Exercise 8 (★★★): Adam Bias Correction via Geometric Series

Adam maintains exponential moving averages: $m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$ . Starting from $m_0 = 0$ , unrolling gives $m_t = (1-\beta_1)\sum_{i=1}^{t}\beta_1^{t-i}g_i$ .

(a) If all gradients equal $g$ (constant), show that $m_t = g(1 - \beta_1^t)$ . (Use the geometric series formula.)

(b) Hence $\mathbb{E}[m_t] = \mathbb{E}[g](1-\beta_1^t)$ , so the bias-corrected estimate is $\hat{m}_t = m_t / (1-\beta_1^t)$ . Verify numerically for $\beta_1 = 0.9$ , $t = 1, 5, 10, 100$ .

(c) Plot the bias factor $(1-\beta_1^t)$ for $t = 1, \ldots, 50$ for $\beta_1 \in \{0.9, 0.99, 0.999\}$ . Explain why $\beta_1 = 0.999$ requires many more steps to 'warm up'.

(d) Implement a minimal Adam update step and verify loss decreases on a quadratic $f(\theta) = \|\theta\|^2$ .

Code cell 26

# Your Solution
# Exercise 8 - learner workspace
# Write your solution here, then run the reference solution below to compare.
print("Learner workspace ready for Exercise 8.")

Code cell 27

# Solution
# Exercise 8 - reference solution

import numpy as np

# (a) Geometric series: sum_{i=1}^t beta^(t-i) = (1-beta^t)/(1-beta)
# m_t = (1-beta1) * g * (1-beta1^t)/(1-beta1) = g*(1-beta1^t)

def bias_factor(beta, t):
    return 1 - beta**t

beta1 = 0.9
print('(b) Bias factors for beta1=0.9:')
for t in [1, 5, 10, 100]:
    bf = bias_factor(beta1, t)
    print(f'  t={t:3d}: 1-beta^t = {bf:.6f}, correction = {1/bf:.4f}')

if HAS_MPL:
    t_arr = np.arange(1, 51)
    plt.figure(figsize=(9, 4))
    for beta, color in zip([0.9, 0.99, 0.999], ['blue', 'orange', 'red']):
        plt.plot(t_arr, [bias_factor(beta, t) for t in t_arr],
                 color=color, label=f'beta={beta}')
    plt.axhline(0.99, ls='--', color='gray', label='99% warmup')
    plt.xlabel('Step t'); plt.ylabel('Bias factor 1-β^t')
    plt.title('Adam Bias Correction Warmup'); plt.legend(); plt.show()

# (d) Minimal Adam on f(theta) = ||theta||^2
def adam_step(theta, grad, m, v, t, lr=0.01, beta1=0.9, beta2=0.999, eps=1e-8):
    m = beta1 * m + (1 - beta1) * grad
    v = beta2 * v + (1 - beta2) * grad**2
    m_hat = m / (1 - beta1**t)
    v_hat = v / (1 - beta2**t)
    theta = theta - lr * m_hat / (np.sqrt(v_hat) + eps)
    return theta, m, v

theta = np.array([3.0, -2.0])
m = np.zeros_like(theta)
v = np.zeros_like(theta)
losses = []
for t in range(1, 201):
    grad = 2 * theta  # gradient of ||theta||^2
    theta, m, v = adam_step(theta, grad, m, v, t)
    losses.append(np.sum(theta**2))

header('Exercise 8: Adam Bias Correction')
check_close('(a) bias_factor(0.9, 1)', bias_factor(0.9, 1), 0.1)
check_close('(a) bias_factor(0.9, 100)', bias_factor(0.9, 100), 1.0, tol=1e-5)
check_true('(d) loss decreased', losses[-1] < losses[0])
check_true('(d) converged to near 0', losses[-1] < 1e-4)
print(f'(d) Initial loss: {losses[0]:.4f}, Final loss: {losses[-1]:.2e}')

print('\nTakeaway: Adam bias correction = dividing by (1-beta^t), derived directly '
      'from geometric series; large beta requires many warmup steps.')

print("Exercise 8 solution complete.")

Exercise 9 (★★★): Neumann Series for an Inverse

If $\rho(A)<1$ , then

(I-A)^{-1} = \sum_{k=0}^{\infty} A^k.

Implement truncated Neumann sums and verify that the approximation improves as the number of terms grows.

Code cell 29

# Your Solution
# Exercise 9 - learner workspace
# Build partial sums I + A + ... + A^K and compare with np.linalg.inv(I-A).
print("Learner workspace ready for Exercise 9.")

Code cell 30

# Solution
# Exercise 9 - Neumann inverse series
header("Exercise 9: Neumann series")

A = np.array([[0.20, 0.05], [-0.10, 0.15]])
I = np.eye(2)
true_inv = la.inv(I - A)

def neumann_inverse(A, K):
    S = np.eye(A.shape[0])
    power = np.eye(A.shape[0])
    for _ in range(1, K + 1):
        power = power @ A
        S = S + power
    return S

prev = np.inf
for K in [1, 2, 4, 8, 16]:
    approx = neumann_inverse(A, K)
    err = la.norm(approx - true_inv, ord='fro')
    print(f"K={K:2d}: Frobenius error={err:.3e}")
    check_true("error is decreasing", err < prev)
    prev = err
print("Takeaway: many iterative solvers and residual refinements are controlled infinite series.")

Exercise 10 (★★★): Taylor Approximation of GELU

The GELU activation can be written

\operatorname{GELU}(x)=x\Phi(x),

where $\Phi$ is the standard normal CDF. Approximate GELU near zero by fitting a polynomial and compare approximation error on a small interval.

Code cell 32

# Your Solution
# Exercise 10 - learner workspace
# Fit a polynomial approximation to GELU around zero and inspect its error.
print("Learner workspace ready for Exercise 10.")

Code cell 33

# Solution
# Exercise 10 - GELU polynomial approximation
header("Exercise 10: GELU polynomial approximation")

def gelu(x):
    return x * stats.norm.cdf(x)

xs_fit = np.linspace(-1.0, 1.0, 201)
ys_fit = gelu(xs_fit)
coef = np.polyfit(xs_fit, ys_fit, deg=5)
poly = np.poly1d(coef)
xs_test = np.linspace(-1.25, 1.25, 251)
err = np.max(np.abs(poly(xs_test) - gelu(xs_test)))
print("degree-5 coefficients:", coef)
print(f"max error on [-1.25,1.25]: {err:.6e}")
check_true("degree-5 local approximation is accurate", err < 2e-3)
print("Takeaway: activation approximations are series ideas made computational.")

What to Review After Finishing

Geometric series — closed form $S = a/(1-r)$ and partial sum $S_N = a(1-r^N)/(1-r)$
Convergence tests — know when to apply ratio, integral, comparison, alternating tests
Radius of convergence — formula $R = 1/\limsup |c_n|^{1/n}$ ; behavior at endpoints
Taylor series — derivation from $c_n = f^{(n)}(a)/n!$ ; standard series table
Lagrange remainder — guaranteeing error bounds, not just computing approximations
ML connections — Adam (geometric series), linear attention (Taylor of exp), GELU (erf Taylor)

References

Rudin, Principles of Mathematical Analysis, Ch. 3 (sequences) and Ch. 8 (series)
Apostol, Calculus Vol. 1, Ch. 11–12 (power series, Taylor series)
Kingma & Ba, Adam: A Method for Stochastic Optimization (2015) — geometric series in §2
Katharopoulos et al., Transformers are RNNs (2020) — linear attention via Taylor of exp
Hendrycks & Gimpel, Gaussian Error Linear Units (GELUs) (2016)