Theory Notebook
Converted from
theory.ipynbfor web reading.
Summation & Product Notation — The Syntax of Quantitative AI
"Virtually every formula in machine learning — loss functions, gradients, attention scores, probability distributions, normalisation constants — is a sum or a product. Fluency with Σ and Π is the difference between reading ML papers fluidly and decoding them symbol by symbol."
This notebook is the interactive companion to notes.md. It demonstrates every concept from all 16 sections with runnable Python and NumPy code, connecting each mathematical idea to its concrete use in modern AI systems (transformers, LLMs, diffusion models, RAG pipelines).
| Section | Topic | What You'll Build |
|---|---|---|
| 1 | Intuition | Why Σ/Π exist; where they appear in AI; historical timeline |
| 2 | Summation — Formal Definitions | Recursive definition, index sets, multiple indices |
| 3 | Product Notation — Formal Definitions | Π basics, empty product, factorial, log-likelihood |
| 4 | Algebraic Properties of Summation | Linearity, splitting, reindexing, telescoping, interchange |
| 5 | Algebraic Properties of Products | Splitting, constants, log↔product conversion |
| 6 | Important Summation Formulas | Arithmetic/geometric series, Taylor series, harmonic numbers |
| 7 | Summation Bounds & Estimation | Comparison, integral test, Cauchy-Schwarz, Stirling |
| 8 | Special Summation Patterns in AI | Softmax, cross-entropy, attention, perplexity, BM25 |
| 9 | Einstein Summation Convention | Implicit sums, einsum in PyTorch/NumPy, tensor contraction |
| 10 | Double Sums & Index Interaction | Diagonal sums, order interchange, Cauchy product |
| 11 | Convergence of Infinite Series | Partial sums, absolute convergence, tests, power series |
| 12 | Summation in Probability & Statistics | Expectation, variance, entropy, KL divergence |
| 13–14 | Notation Variants & Common Mistakes | ML conventions, subscript/superscript, pitfalls table |
| 15 | Exercises | 8 progressive exercises with scaffolds and solutions |
| 16 | Why This Matters for AI (2026) | Impact table and conceptual bridge |
Prerequisites: Python, NumPy. PyTorch optional but recommended for einsum demos.
Code cell 2
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
try:
import seaborn as sns
sns.set_theme(style="whitegrid", palette="colorblind")
HAS_SNS = True
except ImportError:
plt.style.use("seaborn-v0_8-whitegrid")
HAS_SNS = False
mpl.rcParams.update({
"figure.figsize": (10, 6),
"figure.dpi": 120,
"font.size": 13,
"axes.titlesize": 15,
"axes.labelsize": 13,
"xtick.labelsize": 11,
"ytick.labelsize": 11,
"legend.fontsize": 11,
"legend.framealpha": 0.85,
"lines.linewidth": 2.0,
"axes.spines.top": False,
"axes.spines.right": False,
"savefig.bbox": "tight",
"savefig.dpi": 150,
})
np.random.seed(42)
print("Plot setup complete.")
Code cell 3
import numpy as np
import math
import warnings
from typing import List, Tuple, Callable
np.random.seed(42)
np.set_printoptions(precision=8, suppress=True)
# Optional: PyTorch for einsum and tensor demos
try:
import torch
torch.manual_seed(42)
HAS_TORCH = True
print(f'NumPy {np.__version__} | PyTorch {torch.__version__}')
if torch.cuda.is_available():
print(f'GPU: {torch.cuda.get_device_name()}')
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
print('Device: Apple Silicon (MPS)')
else:
print('Device: CPU')
except ImportError:
HAS_TORCH = False
print(f'NumPy {np.__version__} | PyTorch not installed (NumPy demos still work)')
# Optional: matplotlib for visualisations
try:
import matplotlib.pyplot as plt
HAS_PLT = True
except ImportError:
HAS_PLT = False
print('matplotlib not installed — plots will be skipped')
1. Intuition — What Summation and Product Notation Are, and Why They Exist
1.1 What Is Summation and Product Notation?
Summation () and product () notation are compact symbolic languages for expressing repeated addition and multiplication over collections of terms.
- Without them: writing the loss over a million training examples requires a million terms
- With them: one line —
They are not merely abbreviations — they are precise mathematical objects with well-defined rules for manipulation, interchange, and transformation.
1.2 Why This Notation Exists
| Problem | Without Notation | With Notation |
|---|---|---|
| Sum of squares 1 to n | ||
| Joint probability | ||
| Attention output | ||
| Gradient of batch loss |
Compact notation reveals structure, enables algebraic manipulation, and generalises naturally from finite sums → infinite series → integrals.
1.3 Where These Appear in AI
Every core formula in modern ML uses Σ or Π:
| Formula | Expression |
|---|---|
| Cross-entropy loss | |
| Softmax | |
| Attention | |
| Dot product | |
| Matrix multiply | |
| Perplexity | |
| Likelihood |
1.4 The Index as a Bound Variable
The index variable ( in ) is a bound variable — it has no meaning outside the sum. and are identical expressions (alpha-equivalence).
Critical for AI: and are completely different sums!
1.5 The Relationship to Integration
DISCRETE CONTINUOUS
∑ᵢ₌₁ⁿ f(i) → ∫₁ⁿ f(x) dx
Index i ∈ {1,...,n} → x ∈ [1, n]
Step size 1 → Infinitesimal dx
∑ → ∫ as step → 0 (Riemann sum definition)
∏ᵢ f(i) → exp(∫ log f(x) dx)
This bridge is not just analogy — Riemann integration is literally defined as a limit of summation.
Code cell 5
# ══════════════════════════════════════════════════════════════════
# 1.1–1.3 WHY Σ AND Π EXIST — DEMONSTRATION
# ══════════════════════════════════════════════════════════════════
print("=" * 65)
print("WITHOUT VS WITH SUMMATION NOTATION")
print("=" * 65)
# Without notation: writing out all terms
terms_without = [i**2 for i in range(1, 11)]
print(f"\nWithout notation: {' + '.join(str(t) for t in terms_without)}")
print(f" = {sum(terms_without)}")
# With notation: one compact expression
n = 10
result_with = sum(i**2 for i in range(1, n+1))
print(f"\nWith notation: Σᵢ₌₁¹⁰ i² = {result_with}")
print(f" → Same answer, infinitely more readable")
# ── AI formulas that are really just sums ──
print(f"\n{'=' * 65}")
print("AI FORMULAS AS SUMS AND PRODUCTS")
print("=" * 65)
# Dot product: q·k = Σᵢ qᵢkᵢ
q = np.array([1.0, 2.0, 3.0, 4.0])
k = np.array([0.5, 1.5, 0.2, 1.0])
dot_manual = sum(q[i] * k[i] for i in range(len(q)))
dot_numpy = np.dot(q, k)
print(f"\nDot product q·k = Σᵢ qᵢkᵢ")
print(f" Manual sum: {dot_manual}")
print(f" np.dot: {dot_numpy}")
print(f" Match: {np.isclose(dot_manual, dot_numpy)} ✓")
# Matrix multiply: (AB)ᵢⱼ = Σₖ AᵢₖBₖⱼ
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
C_manual = np.zeros((2, 2))
for i in range(2):
for j in range(2):
C_manual[i, j] = sum(A[i, k] * B[k, j] for k in range(2))
C_numpy = A @ B
print(f"\nMatrix multiply (AB)ᵢⱼ = Σₖ AᵢₖBₖⱼ")
print(f" Manual:\n{C_manual}")
print(f" A @ B:\n{C_numpy}")
print(f" Match: {np.allclose(C_manual, C_numpy)} ✓")
# Likelihood as product: P(seq) = Πᵢ P(tᵢ|context)
probs = [0.3, 0.5, 0.8, 0.6, 0.4]
likelihood_product = 1.0
for p in probs:
likelihood_product *= p
log_likelihood_sum = sum(np.log(p) for p in probs)
print(f"\nLikelihood: Πᵢ P(tᵢ|ctx) = {likelihood_product:.6f}")
print(f"Log-likelihood: Σᵢ log P = {log_likelihood_sum:.6f}")
print(f"exp(Σ log P) = Π P: {np.exp(log_likelihood_sum):.6f}")
print(f" Match: {np.isclose(likelihood_product, np.exp(log_likelihood_sum))} ✓")
print(f"\n → log converts Π to Σ — the most important trick in ML!")
Code cell 6
# ══════════════════════════════════════════════════════════════════
# 1.4 THE INDEX AS A BOUND VARIABLE (Alpha-Equivalence)
# ══════════════════════════════════════════════════════════════════
print("=" * 65)
print("BOUND VARIABLES — INDEX NAME DOES NOT MATTER")
print("=" * 65)
n = 5
sum_i = sum(i**2 for i in range(1, n+1))
sum_j = sum(j**2 for j in range(1, n+1))
sum_k = sum(k**2 for k in range(1, n+1))
print(f"\n Σᵢ₌₁⁵ i² = {sum_i}")
print(f" Σⱼ₌₁⁵ j² = {sum_j}")
print(f" Σₖ₌₁⁵ k² = {sum_k}")
print(f" All equal: {sum_i == sum_j == sum_k} ✓ (alpha-equivalence)")
# CRITICAL: Σᵢ αᵢⱼ vs Σⱼ αᵢⱼ are DIFFERENT
print(f"\n{'─' * 65}")
print("CRITICAL: WHICH INDEX YOU SUM OVER MATTERS!")
alpha = np.array([[0.5, 0.3, 0.2],
[0.1, 0.6, 0.3],
[0.4, 0.4, 0.2]])
print(f"\n Attention matrix α (3×3):\n{alpha}")
print(f"\n Σⱼ αᵢⱼ (sum each ROW) = {alpha.sum(axis=1)}")
print(f" Σᵢ αᵢⱼ (sum each COLUMN) = {alpha.sum(axis=0)}")
print(f" → Different sums! Row sums = 1.0 (probability distribution)")
# ══════════════════════════════════════════════════════════════════
# 1.5 HISTORICAL TIMELINE
# ══════════════════════════════════════════════════════════════════
print(f"\n{'=' * 65}")
print("HISTORICAL TIMELINE OF SUMMATION NOTATION")
print("=" * 65)
timeline = [
("~250 BCE", "Archimedes", "Geometric series; area of parabola (no notation)"),
("1593", "Viète", "Systematic symbolic algebra → paves way"),
("1675", "Leibniz", "∫ as elongated S for 'summa' (continuous)"),
("~1750", "Euler", "Introduces Σ (capital sigma) for discrete sums"),
("~1800", "Gauss", "Σᵢ₌₁ⁿ i = n(n+1)/2 at age 10"),
("1821", "Cauchy", "Rigorous convergence theory for infinite series"),
("1854", "Riemann", "Riemann sum → integral (Σ → ∫ as Δx → 0)"),
("1916", "Einstein", "Summation convention: repeated index = sum"),
("2017", "Vaswani et al.", "Transformer: Attn = Σⱼ αᵢⱼVⱼ everywhere"),
]
print(f"\n {'Year':<12} {'Who':<18} {'What'}")
print(f" {'─'*12} {'─'*18} {'─'*45}")
for year, who, what in timeline:
print(f" {year:<12} {who:<18} {what}")
# ══════════════════════════════════════════════════════════════════
# 1.6 SUM ↔ INTEGRAL CONNECTION (Riemann Sum)
# ══════════════════════════════════════════════════════════════════
print(f"\n{'=' * 65}")
print("SUM → INTEGRAL: RIEMANN SUM DEMONSTRATION")
print("=" * 65)
# Approximate ∫₀¹ x² dx using Riemann sums with increasing n
def riemann_sum(f, a, b, n):
"""Left Riemann sum: Σᵢ₌₀ⁿ⁻¹ f(xᵢ)·Δx where Δx = (b-a)/n"""
dx = (b - a) / n
return sum(f(a + i * dx) * dx for i in range(n))
exact = 1/3 # ∫₀¹ x² dx = 1/3
print(f"\n Target: ∫₀¹ x² dx = 1/3 ≈ {exact:.10f}")
print(f"\n {'n (terms)':<12} {'Σᵢ f(xᵢ)Δx':<18} {'Error':<15} {'Δx'}")
print(f" {'─'*12} {'─'*18} {'─'*15} {'─'*10}")
for n_terms in [5, 10, 50, 100, 1000, 10000]:
approx = riemann_sum(lambda x: x**2, 0, 1, n_terms)
err = abs(approx - exact)
dx = 1.0 / n_terms
print(f" {n_terms:<12} {approx:<18.10f} {err:<15.2e} {dx:.6f}")
print(f"\n → As n → ∞ and Δx → 0: Σ → ∫ (definition of Riemann integral)")
2. Summation Notation — Formal Definitions
2.1 Basic Definition
Components:
- — summation operator (capital Greek sigma)
- — index variable (dummy/bound variable; any name works)
- — lower limit (inclusive)
- — upper limit (inclusive)
- — summand (expression evaluated at each index value)
- Number of terms:
Empty sum convention: if , the sum has no terms and equals 0 (additive identity).
2.2 Formal Definition via Recursion
- Base case: (single term)
- Recursive step:
- Empty base:
2.3 Index Set Notation
More general: sum over arbitrary index set :
AI examples:
- — sum over vocabulary tokens
- — sum over upper-triangular pairs
- — sum over all layers and heads
2.4 Multiple Indices
Process: for each fixed , compute inner sum over ; then sum results over . Total terms: .
Code cell 8
# ══════════════════════════════════════════════════════════════════
# 2.1–2.2 SUMMATION: BASIC + RECURSIVE DEFINITION
# ══════════════════════════════════════════════════════════════════
def sigma(f, m, n):
"""Σᵢ₌ₘⁿ f(i) — iterative definition."""
return sum(f(i) for i in range(m, n + 1))
def sigma_recursive(f, m, n):
"""Σᵢ₌ₘⁿ f(i) — recursive definition (2.2)."""
if n < m:
return 0 # Empty sum = 0 (additive identity)
if n == m:
return f(m) # Base case: single term
return sigma_recursive(f, m, n - 1) + f(n) # Recursive step
print("=" * 65)
print("SUMMATION: ITERATIVE vs RECURSIVE DEFINITION")
print("=" * 65)
# Test with f(i) = i²
f = lambda i: i**2
for n in [1, 3, 5, 10]:
iter_val = sigma(f, 1, n)
rec_val = sigma_recursive(f, 1, n)
formula = n * (n + 1) * (2 * n + 1) // 6 # closed form
print(f" Σᵢ₌₁^{n:<2d} i² iterative={iter_val:<6d} recursive={rec_val:<6d} formula={formula:<6d} ✓")
# Empty sum convention
empty = sigma(f, 5, 3) # n < m → 0
print(f"\n Empty sum Σᵢ₌₅³ f(i) = {empty} (additive identity)")
# ══════════════════════════════════════════════════════════════════
# 2.3 INDEX SET NOTATION
# ══════════════════════════════════════════════════════════════════
print(f"\n{'=' * 65}")
print("INDEX SET NOTATION: Σ OVER ARBITRARY SETS")
print("=" * 65)
# Sum over arbitrary index set
def sigma_set(f, index_set):
"""Σ_{i ∈ S} f(i) — sum over arbitrary index set S."""
return sum(f(i) for i in index_set)
# Example 1: Sum over vocabulary (simulated as indices)
vocab = {0, 1, 2, 3, 4} # 5-token vocabulary
logits = np.array([2.0, 1.0, 0.5, -1.0, 0.3])
Z = sigma_set(lambda v: np.exp(logits[v]), vocab)
print(f"\n Softmax denominator Z = Σ_{{v ∈ V}} exp(zᵥ) = {Z:.4f}")
print(f" np.sum(np.exp(logits)) = {np.sum(np.exp(logits)):.4f} ✓")
# Example 2: Sum over pairs (i,j) with i < j
pairs = [(i, j) for i in range(4) for j in range(4) if i < j]
pair_sum = sigma_set(lambda p: p[0] + p[1], pairs)
print(f"\n Σ_{{(i,j): i<j, i,j∈{{0..3}}}} (i+j) = {pair_sum}")
print(f" Pairs: {pairs}")
# Example 3: Sum with indicator function
data = [3, 7, 2, 8, 1, 5, 9, 4]
count_gt5 = sum(1 for x in data if x > 5)
print(f"\n Σᵢ 𝟙[xᵢ > 5] = {count_gt5} (counting elements > 5 in {data})")
# ══════════════════════════════════════════════════════════════════
# 2.4 MULTIPLE INDICES — DOUBLE SUMS
# ══════════════════════════════════════════════════════════════════
print(f"\n{'=' * 65}")
print("DOUBLE SUMS: Σᵢ Σⱼ f(i,j)")
print("=" * 65)
# Compute Σᵢ₌₁³ Σⱼ₌₁⁴ (i·j) step by step
m, n = 3, 4
print(f"\n Computing Σᵢ₌₁³ Σⱼ₌₁⁴ (i·j):")
total = 0
for i in range(1, m + 1):
inner = sum(i * j for j in range(1, n + 1))
total += inner
terms = ' + '.join(f'{i}·{j}' for j in range(1, n + 1))
print(f" i={i}: Σⱼ₌₁⁴ {i}·j = {terms} = {inner}")
print(f" Total: {total}")
print(f" Formula: (Σᵢ i)(Σⱼ j) = {sum(range(1,4))} × {sum(range(1,5))} = {sum(range(1,4)) * sum(range(1,5))} ✓")
# Matrix as double sum: Σᵢ Σⱼ Aᵢⱼ = sum of all entries
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
double_sum = sum(A[i, j] for i in range(3) for j in range(3))
print(f"\n Matrix A:\n{A}")
print(f" Σᵢ Σⱼ Aᵢⱼ = {double_sum} = np.sum(A) = {np.sum(A)} ✓")
3. Product Notation — Formal Definitions
3.1 Basic Definition
- Same structure as but with (capital Greek pi) and multiplication
- Empty product: if , value = 1 (multiplicative identity)
- Recursive:
3.2 Why Empty Product = 1 and Empty Sum = 0
| Operation | Identity Element | Empty Result | Reason |
|---|---|---|---|
| Addition () | 0 | Adding nothing = additive identity | |
| Multiplication () | 1 | Multiplying nothing = multiplicative identity |
Consistency check: ✓ (matches )
3.3 Factorial as Product
Stirling: — grows faster than any exponential.
3.4 Product Notation in Probability — The Log Trick
This conversion is one of the most important transformations in all of ML:
- Products of small probabilities underflow → log converts to sum of log-probabilities
- Numerically stable, easier to differentiate, and computationally cheaper
Code cell 10
# ══════════════════════════════════════════════════════════════════
# 3.1–3.2 PRODUCT NOTATION & EMPTY PRODUCT
# ══════════════════════════════════════════════════════════════════
def pi_product(f, m, n):
"""Πᵢ₌ₘⁿ f(i) — iterative product."""
result = 1 # multiplicative identity
for i in range(m, n + 1):
result *= f(i)
return result
def pi_recursive(f, m, n):
"""Πᵢ₌ₘⁿ f(i) — recursive definition."""
if n < m:
return 1 # Empty product = 1 (multiplicative identity)
if n == m:
return f(m)
return pi_recursive(f, m, n - 1) * f(n)
print("=" * 65)
print("PRODUCT NOTATION: Πᵢ₌ₘⁿ f(i)")
print("=" * 65)
# Basic product demonstration
f = lambda i: i
for n in [1, 3, 5, 8, 10]:
iter_val = pi_product(f, 1, n)
rec_val = pi_recursive(f, 1, n)
fact_val = math.factorial(n)
print(f" Πᵢ₌₁^{n:<2d} i = {n}! iter={iter_val:<10d} rec={rec_val:<10d} math={fact_val:<10d} ✓")
# Empty product: Πᵢ₌₅³ = 1
empty_prod = pi_product(f, 5, 3)
empty_sum = sigma(lambda i: i**2, 5, 3)
print(f"\n Empty product Πᵢ₌₅³ i = {empty_prod} (multiplicative identity)")
print(f" Empty sum Σᵢ₌₅³ i² = {empty_sum} (additive identity)")
# 0! = 1 (empty product convention)
print(f"\n 0! = Πᵢ₌₁⁰ i = {pi_product(f, 1, 0)} = {math.factorial(0)} ✓")
# ══════════════════════════════════════════════════════════════════
# 3.3 FACTORIAL & STIRLING'S APPROXIMATION
# ══════════════════════════════════════════════════════════════════
print(f"\n{'=' * 65}")
print("FACTORIAL GROWTH & STIRLING'S APPROXIMATION")
print("=" * 65)
print(f"\n {'n':<6} {'n!':<20} {'Stirling':<20} {'Relative Error'}")
print(f" {'─'*6} {'─'*20} {'─'*20} {'─'*15}")
for n in [1, 5, 10, 20, 50, 100]:
exact = math.factorial(n)
stirling = np.sqrt(2 * np.pi * n) * (n / np.e) ** n
rel_err = abs(exact - stirling) / exact
print(f" {n:<6} {exact:<20.4g} {stirling:<20.4g} {rel_err:<15.6f}")
print(f"\n Stirling: n! ≈ √(2πn)(n/e)ⁿ — gets better as n → ∞")
# ══════════════════════════════════════════════════════════════════
# 3.4 THE LOG TRICK: Π → Σ VIA LOGARITHM
# ══════════════════════════════════════════════════════════════════
print(f"\n{'=' * 65}")
print("THE LOG TRICK: log Πᵢ f(i) = Σᵢ log f(i)")
print("=" * 65)
# Simulate LLM token probabilities
token_probs = [0.3, 0.5, 0.8, 0.02, 0.15, 0.6, 0.4, 0.1, 0.05, 0.7]
n_tokens = len(token_probs)
# Direct product — suffers from underflow with many small probs
likelihood = pi_product(lambda i: token_probs[i], 0, n_tokens - 1)
# Log-sum — numerically stable
log_likelihood = sum(np.log(p) for p in token_probs)
# Verify: exp(Σ log pᵢ) = Π pᵢ
reconstructed = np.exp(log_likelihood)
print(f"\n Token probs: {token_probs}")
print(f"\n Direct Π P(tᵢ|ctx) = {likelihood:.2e} ← tiny! underflow risk")
print(f" Σ log P(tᵢ|ctx) = {log_likelihood:.6f} ← stable log-space")
print(f" exp(Σ log P) = {reconstructed:.2e} ← reconstructed")
print(f" Match: {np.isclose(likelihood, reconstructed)} ✓")
# Perplexity: PPL = exp(−(1/N) Σ log P)
cross_entropy = -log_likelihood / n_tokens
perplexity = np.exp(cross_entropy)
# Equivalently: geometric mean of 1/P
geo_mean = np.prod([1/p for p in token_probs]) ** (1/n_tokens)
print(f"\n Cross-entropy H = −(1/N)Σ log P = {cross_entropy:.4f}")
print(f" Perplexity PPL = exp(H) = {perplexity:.4f}")
print(f" Geometric mean = (Π 1/P)^(1/N) = {geo_mean:.4f}")
print(f" Match: {np.isclose(perplexity, geo_mean)} ✓")
print(f"\n → Lower PPL = model is more confident/accurate")
4. Algebraic Properties of Summation
4.1 Linearity of Summation
This is the single most important property. It means:
- Scalar multiple:
- Sum of sums:
- AI: gradient of sum = sum of gradients:
4.2 Constant Summand
4.3 Splitting and Combining Sums
4.4 Index Shifting (Reindexing)
Reverse:
4.5 Telescoping Sums
All intermediate terms cancel — only first and last survive.
4.6 Interchange of Summation Order (Discrete Fubini)
Always valid for finite sums. For infinite sums: requires absolute convergence.
Code cell 12
# ══════════════════════════════════════════════════════════════════
# 4.1 LINEARITY OF SUMMATION
# ══════════════════════════════════════════════════════════════════
print("=" * 65)
print("LINEARITY: Σ(αf + βg) = αΣf + βΣg")
print("=" * 65)
n = 10
alpha_c, beta_c = 3.0, -2.0
f_vals = np.array([i**2 for i in range(1, n+1)], dtype=float)
g_vals = np.array([i for i in range(1, n+1)], dtype=float)
# LHS: Σᵢ (α·f(i) + β·g(i))
lhs = np.sum(alpha_c * f_vals + beta_c * g_vals)
# RHS: α·Σf + β·Σg
rhs = alpha_c * np.sum(f_vals) + beta_c * np.sum(g_vals)
print(f"\n f(i) = i², g(i) = i, α = {alpha_c}, β = {beta_c}, n = {n}")
print(f" LHS = Σ(αf + βg) = {lhs}")
print(f" RHS = αΣf + βΣg = {rhs}")
print(f" Equal: {np.isclose(lhs, rhs)} ✓")
# AI application: gradient of sum = sum of gradients
print(f"\n AI: ∇_θ Σᵢ L(θ,xᵢ) = Σᵢ ∇_θ L(θ,xᵢ)")
print(" → Mini-batch gradient = (1/|B|) sum_{i in B} grad l_i")
# ══════════════════════════════════════════════════════════════════
# 4.2–4.3 CONSTANT SUMMAND & SPLITTING
# ══════════════════════════════════════════════════════════════════
print(f"\n{'=' * 65}")
print("CONSTANT SUMMAND & SPLITTING SUMS")
print("=" * 65)
# Constant summand: Σᵢ₌₁ⁿ c = n·c
c, n = 7, 5
const_sum = sigma(lambda i: c, 1, n)
print(f"\n Σᵢ₌₁⁵ 7 = {const_sum} = 5 × 7 = {5 * 7} ✓")
# Splitting: Σᵢ₌₁¹⁰ = Σᵢ₌₁⁵ + Σᵢ₌₆¹⁰
full = sigma(lambda i: i**2, 1, 10)
part1 = sigma(lambda i: i**2, 1, 5)
part2 = sigma(lambda i: i**2, 6, 10)
print(f"\n Σᵢ₌₁¹⁰ i² = {full}")
print(f" Σᵢ₌₁⁵ i² + Σᵢ₌₆¹⁰ i² = {part1} + {part2} = {part1 + part2} ✓")
# ══════════════════════════════════════════════════════════════════
# 4.4 INDEX SHIFTING (REINDEXING)
# ══════════════════════════════════════════════════════════════════
print(f"\n{'=' * 65}")
print("INDEX SHIFTING: Σᵢ₌ₘⁿ f(i) = Σⱼ₌ₘ₊ₖⁿ⁺ᵏ f(j−k)")
print("=" * 65)
# Shift by k=2: Σᵢ₌₁⁵ i² = Σⱼ₌₃⁷ (j-2)²
original = sigma(lambda i: i**2, 1, 5)
shifted = sigma(lambda j: (j-2)**2, 3, 7)
print(f"\n Σᵢ₌₁⁵ i² = {original}")
print(f" Σⱼ₌₃⁷ (j−2)² = {shifted}")
print(f" Equal: {original == shifted} ✓ (j = i + 2)")
# Reverse index: Σᵢ₌₁⁵ f(i) = Σᵢ₌₁⁵ f(6−i)
reverse = sigma(lambda i: (6-i)**2, 1, 5)
print(f"\n Σᵢ₌₁⁵ i² = {original}")
print(f" Σᵢ₌₁⁵ (6−i)² = {reverse}")
print(f" Equal: {original == reverse} ✓ (reverse index)")
# ══════════════════════════════════════════════════════════════════
# 4.5 TELESCOPING SUMS
# ══════════════════════════════════════════════════════════════════
print(f"\n{'=' * 65}")
print("TELESCOPING: Σᵢ₌₁ⁿ (f(i+1) − f(i)) = f(n+1) − f(1)")
print("=" * 65)
# Example: Σᵢ₌₁ⁿ (1/i − 1/(i+1)) = 1 − 1/(n+1)
n = 10
telescope = sigma(lambda i: 1/i - 1/(i+1), 1, n)
closed_form = 1 - 1/(n+1)
print(f"\n Σᵢ₌₁¹⁰ (1/i − 1/(i+1))")
terms = [f"(1/{i} − 1/{i+1})" for i in range(1, 6)]
print(f" = {' + '.join(terms)} + ...")
print(f" = {telescope:.10f}")
print(f" = 1 − 1/11 = {closed_form:.10f}")
print(f" Match: {np.isclose(telescope, closed_form)} ✓")
# Show the cancellation explicitly
print(f"\n Expansion showing cancellation:")
for i in range(1, 6):
cancel = " ← cancels with next" if i < 5 else ""
print(f" + 1/{i} − 1/{i+1}{cancel}")
print(f" Only 1/1 and −1/{n+1} survive!")
# AI: Residual network parameter update
print(f"\n AI: θₜ − θ₀ = Σₛ₌₁ᵗ (θₛ − θₛ₋₁) = Σₛ₌₁ᵗ Δθₛ (telescoping)")
# ══════════════════════════════════════════════════════════════════
# 4.6 INTERCHANGE OF SUMMATION ORDER
# ══════════════════════════════════════════════════════════════════
print(f"\n{'=' * 65}")
print("INTERCHANGE: Σᵢ Σⱼ f(i,j) = Σⱼ Σᵢ f(i,j)")
print("=" * 65)
# Row-first vs column-first on a matrix
M = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
# Row-first: Σᵢ (Σⱼ Mᵢⱼ)
row_first = sum(sum(M[i, j] for j in range(3)) for i in range(3))
# Column-first: Σⱼ (Σᵢ Mᵢⱼ)
col_first = sum(sum(M[i, j] for i in range(3)) for j in range(3))
print(f"\n Matrix M:\n{M}")
print(f"\n Row-first: Σᵢ(Σⱼ Mᵢⱼ) = Σᵢ{M.sum(axis=1).tolist()} = {row_first}")
print(f" Column-first: Σⱼ(Σᵢ Mᵢⱼ) = Σⱼ{M.sum(axis=0).tolist()} = {col_first}")
print(f" Equal: {row_first == col_first} ✓ (Fubini for finite sums)")
# AI: E[Σᵢ Xᵢ] = Σᵢ E[Xᵢ] (linearity of expectation = interchange)
print(f"\n AI: 𝔼[Σᵢ Xᵢ] = Σᵢ 𝔼[Xᵢ] (linearity of expectation)")
print(f" → Valid even if Xᵢ are dependent!")
5. Algebraic Properties of Products
5.1 Splitting Products
5.2 Product of a Constant
5.3 Interchange of Product Order
5.4 Distributing Product over Multiplication
Does NOT distribute over addition:
5.5 Logarithm Converts Product to Sum — The Key Identity
This is why every ML system works in log-space:
- Cross-entropy:
- Perplexity:
Code cell 14
# ══════════════════════════════════════════════════════════════════
# 5.1–5.4 PRODUCT ALGEBRAIC PROPERTIES
# ══════════════════════════════════════════════════════════════════
print("=" * 65)
print("ALGEBRAIC PROPERTIES OF PRODUCTS")
print("=" * 65)
# 5.1 Splitting: Πᵢ₌₁¹⁰ i = (Πᵢ₌₁⁵ i) · (Πᵢ₌₆¹⁰ i)
full = pi_product(lambda i: i, 1, 10)
p1 = pi_product(lambda i: i, 1, 5)
p2 = pi_product(lambda i: i, 6, 10)
print(f"\n Splitting: Πᵢ₌₁¹⁰ i = {full}")
print(f" (Πᵢ₌₁⁵ i)(Πᵢ₌₆¹⁰ i) = {p1} × {p2} = {p1 * p2} ✓")
# 5.2 Constant product: Πᵢ₌₁ⁿ c = cⁿ
c, n = 3, 5
const_prod = pi_product(lambda i: c, 1, n)
print(f"\n Πᵢ₌₁⁵ 3 = {const_prod} = 3⁵ = {3**5} ✓")
# 5.4 Product distributes over multiplication (but NOT addition)
f_vals = np.array([2, 3, 4, 5], dtype=float)
g_vals = np.array([1, 2, 1, 3], dtype=float)
prod_fg = np.prod(f_vals * g_vals)
prod_f_prod_g = np.prod(f_vals) * np.prod(g_vals)
print(f"\n Πᵢ(fᵢ·gᵢ) = {prod_fg:.0f}")
print(f" (Πᵢ fᵢ)(Πᵢ gᵢ) = {prod_f_prod_g:.0f}")
print(f" Equal: {np.isclose(prod_fg, prod_f_prod_g)} ✓ (distributes over ×)")
# Does NOT distribute over addition
prod_sum = np.prod(f_vals + g_vals)
sum_prods = np.prod(f_vals) + np.prod(g_vals)
print(f"\n Πᵢ(fᵢ+gᵢ) = {prod_sum:.0f}")
print(f" Πᵢ fᵢ + Πᵢ gᵢ = {sum_prods:.0f}")
print(f" Equal: {np.isclose(prod_sum, sum_prods)} ← NOT equal! Product ≠ distribute over +")
# ══════════════════════════════════════════════════════════════════
# 5.5 LOG ↔ PRODUCT — COMPREHENSIVE DEMONSTRATION
# ══════════════════════════════════════════════════════════════════
print(f"\n{'=' * 65}")
print("LOG CONVERTS PRODUCT TO SUM — NUMERICAL STABILITY DEMO")
print("=" * 65)
# Simulate 50-token sequence with varying probabilities
np.random.seed(42)
probs_50 = np.random.uniform(0.01, 0.3, size=50)
# Direct product: underflows quickly
direct_product = np.prod(probs_50)
# Log-sum: stable
log_sum = np.sum(np.log(probs_50))
from_log = np.exp(log_sum)
print(f"\n 50-token sequence probabilities: P(tᵢ|ctx) ∈ [0.01, 0.3]")
print(f"\n Direct Π P(tᵢ) = {direct_product:.6e}")
print(f" Σ log P(tᵢ) = {log_sum:.4f}")
print(f" exp(Σ log P) = {from_log:.6e}")
print(f" Match: {np.isclose(direct_product, from_log)} ✓")
# With 500 tokens: direct product → 0.0 (underflow!)
probs_500 = np.random.uniform(0.01, 0.3, size=500)
direct_500 = np.prod(probs_500)
log_500 = np.sum(np.log(probs_500))
print(f"\n 500-token sequence:")
print(f" Direct Π P(tᵢ) = {direct_500} ← UNDERFLOW!")
print(f" Σ log P(tᵢ) = {log_500:.4f} ← perfectly fine in log-space")
print(f"\n → This is why ALL production ML systems work in log-space!")
6. Important Summation Formulas
6.1 Arithmetic Series (Gauss's Formula)
Gauss at age 10: pair first + last: ; pairs each summing to .
6.2 Sum of Squares
6.3 Sum of Cubes — The Remarkable Identity
Sum of cubes = square of the sum of first integers!
6.4 Geometric Series (Finite)
6.5 Geometric Series (Infinite) — for :
Derivative:
6.6 Exponential and Taylor Series
6.7 Harmonic Numbers
The harmonic series diverges (grows without bound).
-series: converges iff .
Code cell 16
# ══════════════════════════════════════════════════════════════════
# 6.1–6.3 CLASSICAL CLOSED-FORM SUMMATION FORMULAS
# ══════════════════════════════════════════════════════════════════
print("=" * 65)
print("CLASSICAL SUMMATION FORMULAS — VERIFY ALL")
print("=" * 65)
print(f"\n {'n':<6} {'Σi':<10} {'n(n+1)/2':<10} {'Σi²':<12} {'formula':<12} {'Σi³':<12} {'(Σi)²':<12}")
print(f" {'─'*6} {'─'*10} {'─'*10} {'─'*12} {'─'*12} {'─'*12} {'─'*12}")
for n in [1, 3, 5, 10, 20, 50, 100]:
# 6.1 Arithmetic: Σᵢ₌₁ⁿ i = n(n+1)/2
sum_i = sum(range(1, n+1))
gauss = n * (n + 1) // 2
# 6.2 Squares: Σᵢ₌₁ⁿ i² = n(n+1)(2n+1)/6
sum_i2 = sum(i**2 for i in range(1, n+1))
sq_formula = n * (n + 1) * (2 * n + 1) // 6
# 6.3 Cubes: Σᵢ₌₁ⁿ i³ = (n(n+1)/2)²
sum_i3 = sum(i**3 for i in range(1, n+1))
cube_formula = gauss ** 2
print(f" {n:<6} {sum_i:<10} {gauss:<10} {sum_i2:<12} {sq_formula:<12} {sum_i3:<12} {cube_formula:<12}")
print(f"\n All verified: Σi³ = (Σi)² — remarkable identity! ✓")
print(f" AI: n tokens → O(n²) attention pairs; Σᵢ₌₁ⁿ i = n(n+1)/2 ≈ n²/2")
# ══════════════════════════════════════════════════════════════════
# 6.4–6.5 GEOMETRIC SERIES
# ══════════════════════════════════════════════════════════════════
print(f"\n{'=' * 65}")
print("GEOMETRIC SERIES: Σᵢ₌₀ⁿ⁻¹ rⁱ = (1−rⁿ)/(1−r)")
print("=" * 65)
def geometric_sum(r, n):
"""Closed form: (1 - rⁿ) / (1 - r)"""
if np.isclose(r, 1.0):
return float(n)
return (1 - r**n) / (1 - r)
print(f"\n {'r':<8} {'n':<6} {'Direct Σ':<18} {'Formula':<18} {'Match'}")
print(f" {'─'*8} {'─'*6} {'─'*18} {'─'*18} {'─'*6}")
for r in [0.5, 0.9, 0.99, 2.0]:
for n in [5, 20]:
direct = sum(r**i for i in range(n))
formula = geometric_sum(r, n)
print(f" {r:<8.2f} {n:<6} {direct:<18.8f} {formula:<18.8f} {'✓' if np.isclose(direct, formula) else '✗'}")
# Infinite geometric series: Σᵢ₌₀^∞ rⁱ = 1/(1-r) for |r| < 1
print(f"\n Infinite geometric series (|r| < 1):")
for r in [0.5, 0.9, 0.99]:
partial_100 = sum(r**i for i in range(100))
infinite = 1 / (1 - r)
print(f" r={r:.2f}: Σᵢ₌₀⁹⁹ rⁱ = {partial_100:.8f} → 1/(1−r) = {infinite:.8f}")
print(f"\n AI: Geometric decay in ALiBi attention; discounted rewards in RL: Σᵢ γⁱrᵢ")
Code cell 17
# ══════════════════════════════════════════════════════════════════
# 6.6 TAYLOR SERIES — FUNCTIONS AS INFINITE SUMS
# ══════════════════════════════════════════════════════════════════
print("=" * 65)
print("TAYLOR SERIES: FUNCTIONS AS Σ")
print("=" * 65)
def exp_taylor(x, n_terms):
"""eˣ = Σₙ₌₀^∞ xⁿ/n!"""
return sum(x**n / math.factorial(n) for n in range(n_terms))
def sin_taylor(x, n_terms):
"""sin(x) = Σₙ₌₀^∞ (-1)ⁿ x^(2n+1)/(2n+1)!"""
return sum((-1)**n * x**(2*n+1) / math.factorial(2*n+1) for n in range(n_terms))
def ln1px_taylor(x, n_terms):
"""ln(1+x) = Σₙ₌₁^∞ (-1)^(n+1) xⁿ/n for |x| ≤ 1"""
return sum((-1)**(n+1) * x**n / n for n in range(1, n_terms + 1))
x = 1.0 # Evaluate at x = 1
print(f"\n x = {x}")
print(f"\n {'Terms':<8} {'exp(x) Taylor':<18} {'sin(x) Taylor':<18} {'ln(1+x) Taylor':<18}")
print(f" {'─'*8} {'─'*18} {'─'*18} {'─'*18}")
for n_terms in [1, 3, 5, 10, 20]:
e_approx = exp_taylor(x, n_terms)
s_approx = sin_taylor(x, min(n_terms, 10))
l_approx = ln1px_taylor(x, n_terms)
print(f" {n_terms:<8} {e_approx:<18.10f} {s_approx:<18.10f} {l_approx:<18.10f}")
print(f"\n Exact: {' ':<8}{np.exp(x):<18.10f} {np.sin(x):<18.10f} {np.log(1+x):<18.10f}")
print(f"\n AI uses: exp() in softmax, log() in cross-entropy, sin()/cos() in positional encodings")
# ══════════════════════════════════════════════════════════════════
# 6.7 HARMONIC NUMBERS & DIVERGENCE
# ══════════════════════════════════════════════════════════════════
print(f"\n{'=' * 65}")
print("HARMONIC NUMBERS: Hₙ = Σᵢ₌₁ⁿ 1/i ≈ ln(n) + γ")
print("=" * 65)
EULER_MASCHERONI = 0.5772156649 # γ
print(f"\n {'n':<10} {'Hₙ = Σ 1/i':<18} {'ln(n) + γ':<18} {'Error':<12} {'ln(n)':<12}")
print(f" {'─'*10} {'─'*18} {'─'*18} {'─'*12} {'─'*12}")
for n in [1, 5, 10, 50, 100, 1000, 10000]:
H_n = sum(1.0/i for i in range(1, n+1))
approx = np.log(n) + EULER_MASCHERONI
err = abs(H_n - approx)
print(f" {n:<10} {H_n:<18.8f} {approx:<18.8f} {err:<12.6f} {np.log(n):<12.4f}")
print(f"\n γ (Euler–Mascheroni) ≈ {EULER_MASCHERONI}")
print(f" Hₙ grows like ln(n) — DIVERGES, but very slowly")
# p-series convergence
print(f"\n p-series: Σ 1/nᵖ — converges iff p > 1")
for p in [0.5, 1.0, 1.5, 2.0, 3.0]:
partial = sum(1.0/n**p for n in range(1, 10001))
converges = "DIVERGES" if p <= 1 else f"→ ζ({p:.1f}) ≈ {partial:.6f}"
print(f" p={p:.1f}: Σₙ₌₁¹⁰⁰⁰⁰ 1/nᵖ = {partial:.6f} {converges}")
print(f"\n ζ(2) = π²/6 ≈ {np.pi**2/6:.6f} (Basel problem, Euler 1734)")
print(f" AI: Zipf's law (word freq ∝ 1/rank); harmonic sums in complexity analysis")
7. Summation Bounds and Estimation
7.1 Comparison Test
If for all :
- converges converges
- diverges diverges
7.2 Integral Test
For positive decreasing : and converge or diverge together.
7.3 Bounding Partial Sums
- Upper:
- Cauchy-Schwarz:
- Triangle inequality:
7.4 Stirling's Approximation
7.5 Asymptotic Notation for Sums
- ;
- Attention: ; Total transformer:
Code cell 19
# ══════════════════════════════════════════════════════════════════
# 7.1–7.2 COMPARISON TEST & INTEGRAL TEST
# ══════════════════════════════════════════════════════════════════
print("=" * 65)
print("INTEGRAL TEST BOUNDS ON HARMONIC NUMBERS")
print("=" * 65)
# Integral test: ln(n+1) ≤ Hₙ ≤ 1 + ln(n)
print(f"\n {'n':<10} {'ln(n+1)':<14} {'Hₙ':<14} {'1+ln(n)':<14} {'In bounds?'}")
print(f" {'─'*10} {'─'*14} {'─'*14} {'─'*14} {'─'*10}")
for n in [5, 10, 50, 100, 1000]:
H_n = sum(1.0/i for i in range(1, n+1))
lower = np.log(n + 1)
upper = 1.0 + np.log(n)
in_bounds = lower <= H_n <= upper
print(f" {n:<10} {lower:<14.6f} {H_n:<14.6f} {upper:<14.6f} {'✓' if in_bounds else '✗'}")
# ══════════════════════════════════════════════════════════════════
# 7.3 CAUCHY-SCHWARZ INEQUALITY FOR SUMS
# ══════════════════════════════════════════════════════════════════
print(f"\n{'=' * 65}")
print("CAUCHY-SCHWARZ: (Σ aᵢbᵢ)² ≤ (Σ aᵢ²)(Σ bᵢ²)")
print("=" * 65)
a = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
b = np.array([2.0, 1.0, 4.0, 0.5, 3.0])
lhs = np.dot(a, b)**2
rhs = np.dot(a, a) * np.dot(b, b)
print(f"\n a = {a}")
print(f" b = {b}")
print(f"\n (Σ aᵢbᵢ)² = {np.dot(a,b):.2f}² = {lhs:.2f}")
print(f" (Σ aᵢ²)(Σ bᵢ²) = {np.dot(a,a):.2f} × {np.dot(b,b):.2f} = {rhs:.2f}")
print(f" LHS ≤ RHS: {lhs <= rhs + 1e-10} ✓ (Cauchy-Schwarz)")
# Equality holds iff a ∝ b
a_prop = np.array([1.0, 2.0, 3.0])
b_prop = 2.5 * a_prop # b = 2.5a (proportional)
lhs_eq = np.dot(a_prop, b_prop)**2
rhs_eq = np.dot(a_prop, a_prop) * np.dot(b_prop, b_prop)
print(f"\n Equality when a ∝ b: a = {a_prop}, b = 2.5a = {b_prop}")
print(f" (Σ aᵢbᵢ)² = {lhs_eq:.2f}, (Σ aᵢ²)(Σ bᵢ²) = {rhs_eq:.2f}")
print(f" Equal: {np.isclose(lhs_eq, rhs_eq)} ✓")
# Triangle inequality
print(f"\n Triangle inequality: |Σᵢ aᵢ| ≤ Σᵢ |aᵢ|")
mixed = np.array([3, -2, 5, -1, 4])
print(f" a = {mixed}")
print(f" |Σ aᵢ| = |{np.sum(mixed)}| = {abs(np.sum(mixed))}")
print(f" Σ |aᵢ| = {np.sum(np.abs(mixed))}")
print(f" {abs(np.sum(mixed))} ≤ {np.sum(np.abs(mixed))} ✓")
# ══════════════════════════════════════════════════════════════════
# 7.4–7.5 STIRLING & ASYMPTOTIC COMPLEXITY
# ══════════════════════════════════════════════════════════════════
print(f"\n{'=' * 65}")
print("ASYMPTOTIC COMPLEXITY OF COMMON SUMS")
print("=" * 65)
print(f"\n {'Sum':<25} {'n=100':<15} {'n=1000':<15} {'Growth'}")
print(f" {'─'*25} {'─'*15} {'─'*15} {'─'*15}")
sums_data = [
("Σᵢ₌₁ⁿ 1 = n", lambda n: n, "O(n)"),
("Σᵢ₌₁ⁿ i = n²/2", lambda n: n*(n+1)//2, "O(n²)"),
("Σᵢ₌₁ⁿ i² ≈ n³/3", lambda n: n*(n+1)*(2*n+1)//6, "O(n³)"),
("Σᵢ₌₁ⁿ log(i)=log(n!)",lambda n: sum(np.log(i) for i in range(1,n+1)), "O(n log n)"),
("Σᵢ₌₁ⁿ 1/i ≈ ln(n)", lambda n: sum(1.0/i for i in range(1,n+1)), "O(log n)"),
]
for label, func, growth in sums_data:
v100 = func(100)
v1000 = func(1000)
print(f" {label:<25} {v100:<15.2f} {v1000:<15.2f} {growth}")
# Transformer FLOP count
print(f"\n Transformer complexity per layer:")
for n_seq, d_model in [(512, 768), (2048, 1024), (8192, 4096)]:
attn_flops = 2 * n_seq * n_seq * d_model # QKᵀ and αV
ffn_flops = 2 * n_seq * d_model * (4 * d_model) # 2 linear layers
total = attn_flops + ffn_flops
print(f" n={n_seq:<5} d={d_model:<5} Attn: {attn_flops:.2e} FFN: {ffn_flops:.2e} Total: {total:.2e}")
print(f"\n Attn = O(n²d), FFN = O(nd²) → Total = O(n²d + nd²)")
8. Special Summation Patterns in AI
8.1 Softmax and the Normalisation Sum (Partition Function)
Log-sum-exp trick (numerically stable): , where .
8.2 Cross-Entropy Loss as Double Sum
Gradient: — predicted minus true.
8.3 Attention as Weighted Sum
— attention is a convex combination of value vectors.
8.4 Perplexity as Geometric Mean
8.5 BM25 as Weighted Sum
Code cell 21
# ══════════════════════════════════════════════════════════════════
# 8.1 SOFTMAX — NORMALISATION SUM & LOG-SUM-EXP TRICK
# ══════════════════════════════════════════════════════════════════
print("=" * 65)
print("SOFTMAX: P(v) = exp(zᵥ) / Σᵥ' exp(zᵥ')")
print("=" * 65)
def softmax_naive(z):
"""Naive softmax — may overflow for large z."""
e = np.exp(z)
return e / np.sum(e)
def softmax_stable(z):
"""Numerically stable softmax using max trick."""
m = np.max(z) # m = max(z)
e = np.exp(z - m) # shift by max → largest exp = 1
return e / np.sum(e) # Z = Σ exp(zᵥ − m)
def log_sum_exp(z):
"""log Σ exp(zᵥ) = m + log Σ exp(zᵥ − m)"""
m = np.max(z)
return m + np.log(np.sum(np.exp(z - m)))
# Demo with normal logits
logits = np.array([2.0, 1.0, 0.5, -1.0, 3.0])
probs = softmax_stable(logits)
Z = np.sum(np.exp(logits))
print(f"\n Logits z = {logits}")
print(f" Z = Σ exp(zᵥ) = {Z:.4f}")
print(f" log Z (naive) = {np.log(Z):.6f}")
print(f" log Z (log-sum-exp) = {log_sum_exp(logits):.6f}")
print(f" Softmax P(v): {probs}")
print(f" Σ P(v) = {np.sum(probs):.10f} ✓ (sums to 1)")
# Demo with extreme logits — naive overflows!
extreme = np.array([1000.0, 999.0, 998.0])
print(f"\n Extreme logits: {extreme}")
with warnings.catch_warnings():
warnings.simplefilter("ignore")
naive_result = softmax_naive(extreme)
stable_result = softmax_stable(extreme)
print(f" Naive softmax: {naive_result} ← overflow produces nan!")
print(f" Stable softmax: {stable_result} ← correct!")
# ══════════════════════════════════════════════════════════════════
# 8.2 CROSS-ENTROPY LOSS & ITS GRADIENT
# ══════════════════════════════════════════════════════════════════
print(f"\n{'=' * 65}")
print("CROSS-ENTROPY: L = −(1/N) Σᵢ log P(yᵢ|xᵢ)")
print("=" * 65)
# Simulate 5 training examples, vocab size 4
np.random.seed(42)
N, V_size = 5, 4
logits_batch = np.random.randn(N, V_size)
labels = np.array([2, 0, 3, 1, 2]) # true labels
# Compute cross-entropy loss
total_loss = 0.0
print(f"\n {'Example':<10} {'True y':<8} {'P(y|x)':<12} {'−log P(y|x)':<14}")
print(f" {'─'*10} {'─'*8} {'─'*12} {'─'*14}")
for i in range(N):
probs_i = softmax_stable(logits_batch[i])
p_true = probs_i[labels[i]]
loss_i = -np.log(p_true)
total_loss += loss_i
print(f" {i+1:<10} {labels[i]:<8} {p_true:<12.6f} {loss_i:<14.6f}")
avg_loss = total_loss / N
print(f"\n L = −(1/{N}) Σ log P(yᵢ|xᵢ) = {avg_loss:.6f}")
# Gradient: ∂L/∂zᵥ = P(v|x) − 𝟙[v=y]
print(f"\n Gradient for example 1 (y=2):")
probs_1 = softmax_stable(logits_batch[0])
grad = probs_1.copy()
grad[labels[0]] -= 1.0 # subtract indicator
print(f" P(v|x): {probs_1}")
print(f" 𝟙[v=y]: {[int(v == labels[0]) for v in range(V_size)]}")
print(f" ∂L/∂zᵥ: {grad}")
print(f" Σᵥ ∂L/∂zᵥ = {np.sum(grad):.10f} (sums to 0!) ✓")
Code cell 22
# ══════════════════════════════════════════════════════════════════
# 8.3 ATTENTION AS WEIGHTED SUM
# ══════════════════════════════════════════════════════════════════
print("=" * 65)
print("ATTENTION: Attn(Q,K,V)ᵢ = Σⱼ αᵢⱼ Vⱼ")
print("=" * 65)
np.random.seed(42)
n_seq, d_k, d_v = 4, 3, 3 # 4 tokens, dim 3
Q = np.random.randn(n_seq, d_k)
K = np.random.randn(n_seq, d_k)
V = np.random.randn(n_seq, d_v)
# Step 1: Scores sᵢⱼ = qᵢ·kⱼ / √dₖ = Σₖ Qᵢₖ Kⱼₖ / √dₖ
scores = Q @ K.T / np.sqrt(d_k)
print(f"\n Score matrix (sᵢⱼ = qᵢ·kⱼ/√dₖ):\n{scores}")
# Step 2: Attention weights αᵢⱼ = softmax(sᵢ) over j
alpha = np.zeros_like(scores)
for i in range(n_seq):
alpha[i] = softmax_stable(scores[i])
print(f"\n Attention weights α (each row sums to 1):\n{np.round(alpha, 4)}")
print(f" Row sums: {alpha.sum(axis=1)}")
# Step 3: Output = Σⱼ αᵢⱼ Vⱼ (weighted sum of value vectors)
attn_output_manual = np.zeros((n_seq, d_v))
for i in range(n_seq):
for j in range(n_seq):
attn_output_manual[i] += alpha[i, j] * V[j]
attn_output_matrix = alpha @ V # same thing as matrix multiply
print(f"\n Attention output (manual Σⱼ αᵢⱼVⱼ):\n{np.round(attn_output_manual, 4)}")
print(f" Attention output (α @ V):\n{np.round(attn_output_matrix, 4)}")
print(f" Match: {np.allclose(attn_output_manual, attn_output_matrix)} ✓")
# Causal mask: restrict j ≤ i
print(f"\n Causal attention (j ≤ i only):")
causal_mask = np.triu(np.ones((n_seq, n_seq)) * (-1e9), k=1)
scores_causal = scores + causal_mask
alpha_causal = np.zeros_like(scores_causal)
for i in range(n_seq):
alpha_causal[i] = softmax_stable(scores_causal[i])
print(f" Causal α:\n{np.round(alpha_causal, 4)}")
print(f" → Token 0 attends only to itself; token 3 attends to all 4")
# ══════════════════════════════════════════════════════════════════
# 8.4–8.5 PERPLEXITY & BM25
# ══════════════════════════════════════════════════════════════════
print(f"\n{'=' * 65}")
print("PERPLEXITY: PPL = exp(−(1/N) Σ log P)")
print("=" * 65)
# Simulate model probabilities for a 10-token sequence
token_probs = np.array([0.8, 0.5, 0.3, 0.9, 0.1, 0.6, 0.4, 0.7, 0.2, 0.05])
N_tok = len(token_probs)
cross_ent = -np.mean(np.log(token_probs))
ppl = np.exp(cross_ent)
geo_mean = np.prod(1.0 / token_probs) ** (1.0 / N_tok)
print(f"\n Token probs: {token_probs}")
print(f" Cross-entropy H = −(1/N)Σ log P = {cross_ent:.4f}")
print(f" PPL = exp(H) = {ppl:.4f}")
print(f" PPL = (Π 1/P)^(1/N) = {geo_mean:.4f}")
print(f" Match: {np.isclose(ppl, geo_mean)} ✓")
print(f"\n Interpretation: model is as uncertain as choosing uniformly")
print(f" from {ppl:.1f} equally likely tokens on average")
# BM25 scoring
print(f"\n{'─' * 65}")
print("BM25: score(q,d) = Σ_{t ∈ q} IDF(t) · TF_mod(t,d)")
print("─" * 65)
def bm25_score(query_terms, doc_tf, doc_len, avg_dl, idf, k1=1.5, b=0.75):
"""BM25 score = Σ_{t ∈ q} IDF(t) · (f·(k1+1)) / (f + k1·(1 - b + b·|d|/avgdl))"""
score = 0.0
details = []
for t in query_terms:
if t in doc_tf:
f = doc_tf[t]
tf_mod = (f * (k1 + 1)) / (f + k1 * (1 - b + b * doc_len / avg_dl))
term_score = idf.get(t, 0) * tf_mod
score += term_score
details.append((t, idf[t], f, tf_mod, term_score))
return score, details
query = ["neural", "network", "training"]
doc_tf = {"neural": 3, "network": 2, "training": 5, "the": 20}
idf = {"neural": 2.5, "network": 2.0, "training": 1.8}
score, details = bm25_score(query, doc_tf, doc_len=100, avg_dl=150, idf=idf)
print(f"\n Query: {query}")
print(f" {'Term':<12} {'IDF':<8} {'TF':<6} {'TF_mod':<10} {'Score'}")
print(f" {'─'*12} {'─'*8} {'─'*6} {'─'*10} {'─'*8}")
for t, i, f, tf_m, s in details:
print(f" {t:<12} {i:<8.2f} {f:<6} {tf_m:<10.4f} {s:.4f}")
print(f"\n BM25(q,d) = Σ = {score:.4f}")
9. Einstein Summation Convention
9.1 Motivation & Rule
In tensor operations, summation over repeated indices is so common that Einstein (1916) proposed: if an index appears exactly twice in a product, sum over it automatically.
- Free index: appears once → result has that index (no summation)
- Dummy index: appears twice → summed over; does not appear in result
9.2 Examples
| Operation | Einstein | Explicit | Result |
|---|---|---|---|
| Dot product | scalar | ||
| Matrix-vector | vector (index ) | ||
| Matrix-matrix | matrix (indices ) | ||
| Trace | scalar | ||
| Outer product | (no repeated) | matrix | |
| Quadratic form | scalar |
9.3 Rules
- Each index appears at most twice per term
- Repeated index → summed; do not write
- Free index → must match on both sides of equation
- Three+ appearances → ambiguous; write explicitly
9.4 einsum in NumPy / PyTorch
np.einsum("ij,jk->ik", A, B) # matrix multiply
np.einsum("ij,ij->", A, B) # Frobenius inner product
np.einsum("bhid,bhjd->bhij", Q, K) # batched attention scores
9.5 Tensor Contraction
Generalisation of matmul to higher-order tensors:
Code cell 24
# ══════════════════════════════════════════════════════════════════
# 9.1–9.2 EINSTEIN SUMMATION — ALL EXAMPLES
# ══════════════════════════════════════════════════════════════════
print("=" * 65)
print("EINSTEIN SUMMATION CONVENTION — np.einsum DEMONSTRATIONS")
print("=" * 65)
A = np.array([[1, 2, 3],
[4, 5, 6]]) # 2×3
B = np.array([[7, 8],
[9, 10],
[11, 12]]) # 3×2
x = np.array([1.0, 2.0, 3.0])
y = np.array([4.0, 5.0, 6.0])
# 1. Dot product: aᵢbᵢ = Σᵢ aᵢbᵢ
dot = np.einsum('i,i->', x, y)
print(f"\n 1. Dot product: aᵢbᵢ")
print(f" einsum('i,i->', x, y) = {dot}")
print(f" np.dot(x, y) = {np.dot(x, y)} ✓")
# 2. Matrix-vector: Aᵢⱼxⱼ = Σⱼ Aᵢⱼxⱼ
mv = np.einsum('ij,j->i', A, x)
print(f"\n 2. Matrix-vector: Aᵢⱼxⱼ")
print(f" einsum('ij,j->i', A, x) = {mv}")
print(f" A @ x = {A @ x} ✓")
# 3. Matrix-matrix: AᵢₖBₖⱼ = Σₖ AᵢₖBₖⱼ
mm = np.einsum('ik,kj->ij', A, B)
print(f"\n 3. Matrix multiply: AᵢₖBₖⱼ")
print(f" einsum('ik,kj->ij', A, B):\n{mm}")
print(f" A @ B:\n{A @ B} ✓")
# 4. Trace: Aᵢᵢ = Σᵢ Aᵢᵢ
S = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
tr = np.einsum('ii->', S)
print(f"\n 4. Trace: Aᵢᵢ")
print(f" einsum('ii->', S) = {tr}")
print(f" np.trace(S) = {np.trace(S)} ✓")
# 5. Outer product: aᵢbⱼ (no repeated index)
outer = np.einsum('i,j->ij', x, y)
print(f"\n 5. Outer product: aᵢbⱼ (no sum — free indices i,j)")
print(f" einsum('i,j->ij', x, y):\n{outer}")
print(f" np.outer(x, y):\n{np.outer(x, y)} ✓")
# 6. Quadratic form: xᵢAᵢⱼxⱼ = Σᵢ Σⱼ xᵢAᵢⱼxⱼ
quad = np.einsum('i,ij,j->', x, S, x)
print(f"\n 6. Quadratic form: xᵢAᵢⱼxⱼ")
print(f" einsum('i,ij,j->', x, S, x) = {quad}")
print(f" x @ S @ x = {x @ S @ x} ✓")
# 7. Transpose: Aᵢⱼ → Aⱼᵢ
trans = np.einsum('ij->ji', A)
print(f"\n 7. Transpose: einsum('ij->ji', A):\n{trans}")
# 8. Frobenius inner product: Σᵢⱼ AᵢⱼBᵢⱼ
F1 = np.array([[1, 2], [3, 4]])
F2 = np.array([[5, 6], [7, 8]])
frob = np.einsum('ij,ij->', F1, F2)
print(f"\n 8. Frobenius: AᵢⱼBᵢⱼ = {frob} = tr(AᵀB) = {np.trace(F1.T @ F2)} ✓")
Code cell 25
# ══════════════════════════════════════════════════════════════════
# 9.4 EINSUM FOR ATTENTION (BATCHED MULTI-HEAD)
# ══════════════════════════════════════════════════════════════════
print("=" * 65)
print("BATCHED MULTI-HEAD ATTENTION VIA EINSUM")
print("=" * 65)
np.random.seed(42)
batch, heads, seq, dim = 2, 3, 4, 5
Q = np.random.randn(batch, heads, seq, dim)
K = np.random.randn(batch, heads, seq, dim)
V = np.random.randn(batch, heads, seq, dim)
# Attention scores: sᵢⱼ = Σₖ Qᵢₖ Kⱼₖ / √dₖ
# einsum: "bhid,bhjd->bhij" — sum over d (dummy); free: b,h,i,j
scores = np.einsum('bhid,bhjd->bhij', Q, K) / np.sqrt(dim)
print(f"\n Q shape: {Q.shape} (batch, heads, seq, dim)")
print(f" K shape: {K.shape}")
print(f" Scores = einsum('bhid,bhjd->bhij', Q, K)/√d")
print(f" Scores shape: {scores.shape} (batch, heads, seq, seq)")
print(f"\n Free indices: b(batch), h(head), i(query), j(key)")
print(f" Dummy index: d(dimension) — summed over!")
# Softmax over j (key dimension)
def batched_softmax(x, axis=-1):
e = np.exp(x - np.max(x, axis=axis, keepdims=True))
return e / np.sum(e, axis=axis, keepdims=True)
alpha = batched_softmax(scores, axis=-1)
print(f"\n α shape: {alpha.shape} (after softmax over j)")
print(f" Row sums (should be 1): {alpha[0, 0].sum(axis=-1)}")
# Output: oᵢ = Σⱼ αᵢⱼ Vⱼ → einsum('bhij,bhjd->bhid')
output = np.einsum('bhij,bhjd->bhid', alpha, V)
print(f"\n Output = einsum('bhij,bhjd->bhid', α, V)")
print(f" Output shape: {output.shape} (same as Q — correct!)")
# ══════════════════════════════════════════════════════════════════
# 9.5 TENSOR CONTRACTION
# ══════════════════════════════════════════════════════════════════
print(f"\n{'=' * 65}")
print("TENSOR CONTRACTION — GENERALISED MATMUL")
print("=" * 65)
# Contract 3D tensor with 2D tensor
T = np.random.randn(3, 4, 5) # Aᵢⱼₖ
M = np.random.randn(5, 2) # Bₖₗ
# Contract over k: Cᵢⱼₗ = Σₖ Aᵢⱼₖ Bₖₗ
C = np.einsum('ijk,kl->ijl', T, M)
print(f"\n A shape: {T.shape} (3D tensor)")
print(f" B shape: {M.shape} (matrix)")
print(f" Σₖ Aᵢⱼₖ Bₖₗ → C shape: {C.shape}")
# Multi-index contraction: Σⱼ Σₖ AᵢⱼₖBⱼₖₗ = Cᵢₗ
T2 = np.random.randn(3, 4, 5)
T3 = np.random.randn(4, 5, 2)
C2 = np.einsum('ijk,jkl->il', T2, T3)
print(f"\n Σⱼ Σₖ Aᵢⱼₖ Bⱼₖₗ → C shape: {C2.shape} (contracted 2 indices)")
# FLOPs: product of all index ranges
flops = 3 * 4 * 5 * 2 # i × j × k × l
print(f" FLOPs: {flops} = 3×4×5×2 (product of ALL index ranges)")
# PyTorch einsum comparison (if available)
if HAS_TORCH:
Q_t = torch.randn(2, 3, 4, 5)
K_t = torch.randn(2, 3, 4, 5)
scores_t = torch.einsum('bhid,bhjd->bhij', Q_t, K_t)
print(f"\n torch.einsum('bhid,bhjd->bhij', Q, K) shape: {list(scores_t.shape)}")
print(f" → Same as NumPy einsum — lingua franca of tensor ops!")
else:
print(f"\n (PyTorch not available — einsum works identically with torch.einsum)")
10. Double Sums and Index Interaction
10.1 Diagonal, Upper-Triangular, and Lower-Triangular Sums
- Diagonal:
- Upper triangular: — causal attention restricts to
- Symmetrisation:
10.2 Changing Order in Triangular Sums
Both cover the same pairs with ; just different traversal order.
10.3 Cauchy Product (Product of Two Series)
where is a convolution — the same operation used in CNNs!
10.4 Summation by Parts (Abel's Summation)
where — the discrete analogue of integration by parts.
Code cell 27
# ══════════════════════════════════════════════════════════════════
# 10.1 DIAGONAL & TRIANGULAR SUMS
# ══════════════════════════════════════════════════════════════════
print("=" * 65)
print("DIAGONAL & TRIANGULAR SUMS")
print("=" * 65)
M = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]])
n = 4
# Diagonal sum: Σᵢ Mᵢᵢ
diag_sum = sum(M[i, i] for i in range(n))
print(f"\n M:\n{M}")
print(f"\n Diagonal: Σᵢ Mᵢᵢ = {diag_sum} = tr(M) = {np.trace(M)} ✓")
# Upper triangular (i < j): Σᵢ Σⱼ>ᵢ Mᵢⱼ
upper = sum(M[i, j] for i in range(n) for j in range(i+1, n))
print(f" Upper triangular (i < j): {upper}")
print(f" np.triu(M, k=1).sum() = {np.triu(M, k=1).sum()} ✓")
# Lower triangular (i > j): Σᵢ Σⱼ<ᵢ Mᵢⱼ
lower = sum(M[i, j] for i in range(n) for j in range(i))
print(f" Lower triangular (i > j): {lower}")
# Verify: full = diagonal + upper + lower
full = np.sum(M)
print(f"\n Full = diag + upper + lower: {full} = {diag_sum} + {upper} + {lower} = {diag_sum + upper + lower} ✓")
# AI: Causal attention mask
print(f"\n AI: Causal attention uses LOWER triangular (j ≤ i):")
causal = np.tril(np.ones((4, 4)))
print(f" Mask:\n{causal.astype(int)}")
# ══════════════════════════════════════════════════════════════════
# 10.2 CHANGING ORDER IN TRIANGULAR SUMS
# ══════════════════════════════════════════════════════════════════
print(f"\n{'=' * 65}")
print("INTERCHANGE ORDER: Σᵢ₌₁ⁿ Σⱼ₌₁ⁱ = Σⱼ₌₁ⁿ Σᵢ₌ⱼⁿ")
print("=" * 65)
n = 5
f_func = lambda i, j: i + j
# Order 1: fix i, sum j from 1 to i
order1 = 0
print(f"\n Order 1: Σᵢ₌₁⁵ Σⱼ₌₁ⁱ f(i,j) where f(i,j) = i+j")
for i in range(1, n+1):
inner = sum(f_func(i, j) for j in range(1, i+1))
order1 += inner
terms = [f"f({i},{j})" for j in range(1, i+1)]
print(f" i={i}: Σⱼ₌₁^{i} = {' + '.join(terms)} = {inner}")
print(f" Total = {order1}")
# Order 2: fix j, sum i from j to n
order2 = 0
print(f"\n Order 2: Σⱼ₌₁⁵ Σᵢ₌ⱼ⁵ f(i,j)")
for j in range(1, n+1):
inner = sum(f_func(i, j) for i in range(j, n+1))
order2 += inner
terms = [f"f({i},{j})" for i in range(j, n+1)]
print(f" j={j}: Σᵢ₌{j}⁵ = {' + '.join(terms)} = {inner}")
print(f" Total = {order2}")
print(f"\n Match: {order1 == order2} ✓ (same pairs, different order)")
# ══════════════════════════════════════════════════════════════════
# 10.3 CAUCHY PRODUCT = CONVOLUTION
# ══════════════════════════════════════════════════════════════════
print(f"\n{'=' * 65}")
print("CAUCHY PRODUCT: cₖ = Σᵢ₌₀ᵏ aᵢb_{k-i} (CONVOLUTION)")
print("=" * 65)
a = np.array([1.0, 2.0, 3.0]) # polynomial: 1 + 2x + 3x²
b = np.array([4.0, 5.0]) # polynomial: 4 + 5x
# Product polynomial coefficients via Cauchy product
n_a, n_b = len(a), len(b)
c = np.zeros(n_a + n_b - 1)
for k in range(len(c)):
for i in range(k + 1):
if i < n_a and (k - i) < n_b:
c[k] += a[i] * b[k - i]
print(f"\n a(x) = {a[0]:.0f} + {a[1]:.0f}x + {a[2]:.0f}x²")
print(f" b(x) = {b[0]:.0f} + {b[1]:.0f}x")
print(f"\n Cauchy product cₖ = Σᵢ aᵢb_{k-i}:")
for k in range(len(c)):
terms = []
for i in range(k + 1):
if i < n_a and (k - i) < n_b:
terms.append(f"a[{i}]·b[{k-i}] = {a[i]*b[k-i]:.0f}")
print(f" c[{k}] = {' + '.join(terms)} = {c[k]:.0f}")
print(f"\n c(x) = {' + '.join(f'{c[k]:.0f}x^{k}' if k > 0 else f'{c[k]:.0f}' for k in range(len(c)))}")
print(f" np.convolve(a, b) = {np.convolve(a, b)}")
print(f" Match: {np.allclose(c, np.convolve(a, b))} ✓")
print(f"\n AI: Convolution in CNNs uses this same sum structure!")
11. Convergence of Infinite Series
11.1 Partial Sums and Convergence
— the -th partial sum.
Series converges to if .
Necessary condition: if converges, then . But NOT sufficient — harmonic series: yet .
11.2 Absolute vs Conditional Convergence
- Absolute: converges → converges (safe to rearrange)
- Conditional: converges but not absolutely → Riemann rearrangement theorem: can be rearranged to sum to ANY value!
11.3 Key Convergence Tests
| Test | Condition | Conclusion |
|---|---|---|
| Ratio | $L = \lim | a_{n+1}/a_n |
| Root | $L = \lim | a_n |
| Alternating | , decreasing, | converges |
| -series | Converges iff |
11.4 Power Series
— radius of convergence : converges for , diverges for .
11.5 AI: Learning Rate Convergence (Robbins-Monro)
(take enough steps) AND (don't overshoot).
Example: satisfies both: , .
Code cell 29
# ══════════════════════════════════════════════════════════════════
# 11.1–11.2 CONVERGENCE: PARTIAL SUMS & ABSOLUTE VS CONDITIONAL
# ══════════════════════════════════════════════════════════════════
print("=" * 65)
print("CONVERGENCE OF INFINITE SERIES")
print("=" * 65)
# Geometric series: converges for |r| < 1
print(f"\n Geometric series Σ rⁿ — partial sums Sₙ:")
print(f" {'n':<8} {'r=0.5':<14} {'r=0.9':<14} {'r=1.1':<14} {'1/n (harmonic)':<14}")
print(f" {'─'*8} {'─'*14} {'─'*14} {'─'*14} {'─'*14}")
for n in [5, 10, 50, 100, 500, 1000]:
s_05 = sum(0.5**i for i in range(n))
s_09 = sum(0.9**i for i in range(n))
s_11 = sum(1.1**i for i in range(n))
h_n = sum(1.0/i for i in range(1, n+1))
print(f" {n:<8} {s_05:<14.6f} {s_09:<14.6f} {s_11:<14.2f} {h_n:<14.6f}")
print(f"\n r=0.5 → 1/(1-0.5) = {1/0.5:.1f}")
print(f" r=0.9 → 1/(1-0.9) = {1/0.1:.1f}")
print(f" r=1.1 → DIVERGES (grows without bound)")
print(f" 1/n → DIVERGES (harmonic, but very slowly)")
# Alternating harmonic: converges conditionally
print(f"\n{'─' * 65}")
print("CONDITIONAL CONVERGENCE: Σ (-1)ⁿ⁺¹/n = ln(2)")
print("─" * 65)
print(f"\n {'n':<8} {'Σ (-1)ⁿ⁺¹/n':<18} {'ln(2)':<18} {'Error'}")
print(f" {'─'*8} {'─'*18} {'─'*18} {'─'*12}")
for n in [10, 100, 1000, 10000]:
alt_sum = sum((-1)**(i+1) / i for i in range(1, n+1))
err = abs(alt_sum - np.log(2))
print(f" {n:<8} {alt_sum:<18.10f} {np.log(2):<18.10f} {err:<12.2e}")
print(f"\n Converges conditionally: Σ |(-1)ⁿ⁺¹/n| = Σ 1/n = ∞ (diverges!)")
print(f" Riemann theorem: rearranging terms can make it sum to ANY value!")
# ══════════════════════════════════════════════════════════════════
# 11.3 CONVERGENCE TESTS
# ══════════════════════════════════════════════════════════════════
print(f"\n{'=' * 65}")
print("CONVERGENCE TESTS")
print("=" * 65)
# Ratio test: aₙ = xⁿ/n!
print(f"\n Ratio test for eˣ series: aₙ = xⁿ/n!")
x_val = 2.0
print(f" x = {x_val}")
print(f" {'n':<6} {'aₙ = xⁿ/n!':<18} {'|aₙ₊₁/aₙ|':<18} {'Verdict'}")
print(f" {'─'*6} {'─'*18} {'─'*18} {'─'*20}")
for n in range(1, 10):
a_n = x_val**n / math.factorial(n)
a_n1 = x_val**(n+1) / math.factorial(n+1)
ratio = abs(a_n1 / a_n)
verdict = "< 1 → converges" if ratio < 1 else "> 1 → diverges" if ratio > 1 else "= 1"
print(f" {n:<6} {a_n:<18.8f} {ratio:<18.8f} {verdict}")
print(f" → |aₙ₊₁/aₙ| = x/(n+1) → 0 < 1 for any x → eˣ converges ∀ x ✓")
# ══════════════════════════════════════════════════════════════════
# 11.5 ROBBINS-MONRO CONDITIONS FOR LEARNING RATE
# ══════════════════════════════════════════════════════════════════
print(f"\n{'=' * 65}")
print("ROBBINS-MONRO: Σ ηₜ = ∞ AND Σ ηₜ² < ∞")
print("=" * 65)
schedules = {
"ηₜ = 1/t": lambda t: 1.0/t,
"ηₜ = 1/√t": lambda t: 1.0/np.sqrt(t),
"ηₜ = 1/t²": lambda t: 1.0/t**2,
"ηₜ = 0.01 (const)": lambda t: 0.01,
}
T = 10000
print(f"\n T = {T}")
print(f"\n {'Schedule':<20} {'Σ ηₜ':<14} {'Σ ηₜ²':<14} {'Σηₜ=∞?':<10} {'Σηₜ²<∞?':<10} {'Valid?'}")
print(f" {'─'*20} {'─'*14} {'─'*14} {'─'*10} {'─'*10} {'─'*6}")
for name, lr_fn in schedules.items():
sum_lr = sum(lr_fn(t) for t in range(1, T+1))
sum_lr2 = sum(lr_fn(t)**2 for t in range(1, T+1))
grows = sum_lr > 50 # heuristic: still growing
bounded = sum_lr2 < 100
valid = grows and bounded
print(f" {name:<20} {sum_lr:<14.4f} {sum_lr2:<14.6f} {'✓' if grows else '✗':<10} {'✓' if bounded else '✗':<10} {'✓' if valid else '✗'}")
print(f"\n 1/t satisfies both: Σ 1/t = ∞ (harmonic) and Σ 1/t² = π²/6 ≈ {np.pi**2/6:.4f} ✓")
print(f" → Adam adapts lr per-parameter, effectively satisfying these conditions")
12. Summation in Probability and Statistics
12.1 Expectation as Weighted Summation
The expected value is the fundamental bridge between summation and probability:
For a function :
12.2 Variance via Summation
12.3 Entropy and Information
Shannon entropy measures uncertainty using a sum of log-probabilities:
12.4 KL Divergence
Measures how distribution differs from :
Properties:
- (Gibbs' inequality, from )
- — not symmetric
- Cross-entropy:
12.5 Jensen's Inequality
For convex :
This proves and explains why:
- Average of logs ≤ log of average
- (since is convex)
12.6 Law of Total Expectation
Used in mixture models: each component contributes weighted by mixing probability .
12.7 Inclusion-Exclusion Principle
Compact:
Code cell 31
# ══════════════════════════════════════════════════════════════════
# 12 SUMMATION IN PROBABILITY & STATISTICS
# ══════════════════════════════════════════════════════════════════
print("=" * 65)
print("SUMMATION IN PROBABILITY & STATISTICS")
print("=" * 65)
# --- Expectation and Variance ---
print("\n12.1–12.2 Expectation & Variance")
print("─" * 50)
# Fair die
x = np.arange(1, 7)
p = np.ones(6) / 6
E_X = np.sum(x * p)
E_X2 = np.sum(x**2 * p)
Var_X = np.sum((x - E_X)**2 * p)
Var_shortcut = E_X2 - E_X**2
print(f"\n Fair die: X ∈ {list(x)}")
print(f" E[X] = Σ xᵢpᵢ = {E_X:.4f} (analytic: 3.5)")
print(f" E[X²] = Σ xᵢ²pᵢ = {E_X2:.4f}")
print(f" Var(X) = Σ (xᵢ-μ)²pᵢ = {Var_X:.4f}")
print(f" Var(X) = E[X²]-E[X]² = {Var_shortcut:.4f} ✓")
assert np.isclose(Var_X, Var_shortcut)
# --- Entropy ---
print(f"\n{'─' * 50}")
print("12.3 Entropy H(X) = -Σ pᵢ log pᵢ")
print("─" * 50)
def entropy(probs):
"""Shannon entropy in nats, handling p=0."""
p = np.array(probs, dtype=np.float64)
mask = p > 0
return -np.sum(p[mask] * np.log(p[mask]))
distributions = {
"Uniform(6)": [1/6]*6,
"Peaked": [0.9, 0.02, 0.02, 0.02, 0.02, 0.02],
"Deterministic": [1.0, 0.0, 0.0, 0.0, 0.0, 0.0],
"Binary fair": [0.5, 0.5],
}
print(f"\n {'Distribution':<18} {'H (nats)':<12} {'H (bits)':<12} {'Max H?'}")
print(f" {'─'*18} {'─'*12} {'─'*12} {'─'*8}")
for name, p_dist in distributions.items():
h = entropy(p_dist)
h_bits = h / np.log(2)
max_h = np.log(len(p_dist))
print(f" {name:<18} {h:<12.4f} {h_bits:<12.4f} {'= max' if np.isclose(h, max_h) else ''}")
# --- KL Divergence ---
print(f"\n{'─' * 50}")
print("12.4 KL Divergence — asymmetry demonstration")
print("─" * 50)
def kl_divergence(p, q):
"""KL(P || Q) = Σ pᵢ log(pᵢ/qᵢ)"""
p, q = np.array(p), np.array(q)
mask = p > 0
return np.sum(p[mask] * np.log(p[mask] / q[mask]))
P = [0.4, 0.3, 0.2, 0.1]
Q = [0.25, 0.25, 0.25, 0.25]
kl_pq = kl_divergence(P, Q)
kl_qp = kl_divergence(Q, P)
cross_h = -np.sum(np.array(P) * np.log(np.array(Q)))
print(f"\n P = {P}")
print(f" Q = {Q}")
print(f" KL(P || Q) = {kl_pq:.6f}")
print(f" KL(Q || P) = {kl_qp:.6f}")
print(f" KL(P||Q) ≠ KL(Q||P): asymmetric! ✓")
print(f"\n Cross-entropy H(P,Q) = -Σ pᵢ log qᵢ = {cross_h:.6f}")
print(f" H(P) + KL(P||Q) = {entropy(P) + kl_pq:.6f}")
print(f" H(P,Q) = H(P) + KL(P||Q): {np.isclose(cross_h, entropy(P) + kl_pq)} ✓")
# --- Jensen's Inequality ---
print(f"\n{'─' * 50}")
print("12.5 Jensen's Inequality: φ(E[X]) ≤ E[φ(X)] for convex φ")
print("─" * 50)
x_vals = np.array([1.0, 4.0, 9.0])
p_vals = np.array([0.2, 0.5, 0.3])
# x² is convex
E_x = np.sum(x_vals * p_vals)
E_x2 = np.sum(x_vals**2 * p_vals)
Ex_squared = E_x**2
print(f"\n φ(x) = x² (convex)")
print(f" E[X] = {E_x:.2f}")
print(f" (E[X])² = {Ex_squared:.4f}")
print(f" E[X²] = {E_x2:.4f}")
print(f" (E[X])² ≤ E[X²]: {Ex_squared:.4f} ≤ {E_x2:.4f} → {Ex_squared <= E_x2} ✓")
# -log is convex: proves KL ≥ 0
E_log = np.sum(p_vals * np.log(x_vals))
log_E = np.log(E_x)
print(f"\n φ(x) = -log(x) (convex) → E[log X] ≤ log E[X]")
print(f" E[log X] = {E_log:.4f}")
print(f" log E[X] = {log_E:.4f}")
print(f" E[log X] ≤ log E[X]: {E_log <= log_E} ✓")
# --- Inclusion-Exclusion ---
print(f"\n{'─' * 50}")
print("12.7 Inclusion-Exclusion Principle")
print("─" * 50)
# Three events with known overlaps
pA, pB, pC = 0.5, 0.4, 0.3
pAB, pAC, pBC = 0.2, 0.15, 0.1
pABC = 0.05
p_union = pA + pB + pC - pAB - pAC - pBC + pABC
print(f"\n P(A)=0.5, P(B)=0.4, P(C)=0.3")
print(f" P(A∩B)=0.2, P(A∩C)=0.15, P(B∩C)=0.1, P(A∩B∩C)=0.05")
print(f" P(A∪B∪C) = {pA}+{pB}+{pC} - {pAB}-{pAC}-{pBC} + {pABC}")
print(f" = {p_union:.2f}")
print(f" (Alternating signs: +singles, -pairs, +triples)")
13. Notation Variants and Conventions in ML
13.1 Subscript vs Superscript Indexing
Different ML subfields use different conventions:
| Convention | Example | Where used |
|---|---|---|
| Subscript for samples | Most ML papers | |
| Superscript for samples | Andrew Ng's courses, CS229 | |
| Subscript for features | Feature-level operations | |
| Superscript for layers | Neural network notation | |
| Superscript for iteration | Optimization algorithms |
13.2 Compact Notation Styles
| Notation | Meaning | Context |
|---|---|---|
| Sum over all valid | When range is clear from context | |
| Sum over elements of set | Probability, combinatorics | |
| Sum over all configurations | Partition functions, graphical models | |
| Sum over dataset | Empirical risk minimization |
13.3 Neural Network Formulas in Σ Notation
Forward pass (fully connected layer):
Matrix form:
Backpropagation (error signal):
Gradient of loss w.r.t. weights:
13.4 Implicit Summation in Matrix Notation
Many ML formulas hide summations inside matrix operations:
| Matrix form | Explicit summation |
|---|---|
Code cell 33
# ══════════════════════════════════════════════════════════════════
# 13 NOTATION VARIANTS & NEURAL NETWORK FORMULAS
# ══════════════════════════════════════════════════════════════════
print("=" * 65)
print("NOTATION VARIANTS & NEURAL NETWORK FORMULAS")
print("=" * 65)
# --- 13.3: Neural network forward pass in Σ notation ---
print("\n13.3 Forward pass: z_j = Σ_k w_jk * a_k + b_j")
print("─" * 50)
np.random.seed(42)
n_in, n_out = 4, 3
# Layer weights and biases
W = np.random.randn(n_out, n_in) * 0.5
b = np.random.randn(n_out) * 0.1
a_prev = np.array([1.0, 0.5, -0.3, 0.8])
# Explicit summation (the Σ version)
z_explicit = np.zeros(n_out)
for j in range(n_out):
s = 0.0
for k in range(n_in):
s += W[j, k] * a_prev[k]
z_explicit[j] = s + b[j]
# Matrix form (implicit summation)
z_matrix = W @ a_prev + b
print(f"\n Input a = {a_prev}")
print(f" W shape: ({n_out}, {n_in})")
print(f"\n {'Neuron j':<12} {'Σ_k w_jk*a_k + b_j':<22} {'W@a + b':<22} {'Match?'}")
print(f" {'─'*12} {'─'*22} {'─'*22} {'─'*6}")
for j in range(n_out):
print(f" j={j:<9} {z_explicit[j]:<22.8f} {z_matrix[j]:<22.8f} {'✓' if np.isclose(z_explicit[j], z_matrix[j]) else '✗'}")
# --- 13.4: Hidden summations in matrix ops ---
print(f"\n{'─' * 50}")
print("13.4 Hidden summations inside matrix operations")
print("─" * 50)
A = np.array([[1, 2], [3, 4]], dtype=float)
B = np.array([[5, 6], [7, 8]], dtype=float)
x_vec = np.array([1.0, 2.0])
# trace(AB) = Σ_i Σ_j A_ij B_ji
trace_builtin = np.trace(A @ B)
trace_explicit = sum(A[i, j] * B[j, i] for i in range(2) for j in range(2))
# Frobenius norm: ||A||_F² = Σ_i Σ_j A_ij²
frob_builtin = np.linalg.norm(A, 'fro')**2
frob_explicit = sum(A[i, j]**2 for i in range(2) for j in range(2))
# dot product: a⊤b = Σ_i a_i b_i
dot_builtin = x_vec @ x_vec
dot_explicit = sum(x_vec[i]**2 for i in range(len(x_vec)))
print(f"\n {'Operation':<25} {'Matrix form':<14} {'Σ notation':<14} {'Match?'}")
print(f" {'─'*25} {'─'*14} {'─'*14} {'─'*6}")
print(f" {'tr(AB)':<25} {trace_builtin:<14.4f} {trace_explicit:<14.4f} {'✓' if np.isclose(trace_builtin, trace_explicit) else '✗'}")
print(f" {'||A||_F²':<25} {frob_builtin:<14.4f} {frob_explicit:<14.4f} {'✓' if np.isclose(frob_builtin, frob_explicit) else '✗'}")
print(f" {'x⊤x':<25} {dot_builtin:<14.4f} {dot_explicit:<14.4f} {'✓' if np.isclose(dot_builtin, dot_explicit) else '✗'}")
# --- Backprop gradient formula ---
print(f"\n{'─' * 50}")
print("∂L/∂w_jk = (1/N) Σ_i δ_j^(i) a_k^(i)")
print("─" * 50)
N = 5 # batch size
deltas = np.random.randn(N, n_out) # error signals δ
a_batch = np.random.randn(N, n_in) # activations
# Explicit double sum
dW_explicit = np.zeros((n_out, n_in))
for j in range(n_out):
for k in range(n_in):
for i in range(N):
dW_explicit[j, k] += deltas[i, j] * a_batch[i, k]
dW_explicit[j, k] /= N
# Matrix form: (1/N) δᵀ a
dW_matrix = (1/N) * deltas.T @ a_batch
print(f"\n Batch size N={N}, output={n_out}, input={n_in}")
print(f" Max |dW_explicit - dW_matrix| = {np.max(np.abs(dW_explicit - dW_matrix)):.2e}")
print(f" Triple sum ≡ matrix multiply: {np.allclose(dW_explicit, dW_matrix)} ✓")
print(f"\n Key insight: The Σ notation makes index relationships explicit,")
print(f" but matrix ops are what GPUs actually compute efficiently.")
14. Common Mistakes and Pitfalls
14.1 Index Errors
| Mistake | Why it's wrong | Correct |
|---|---|---|
| Products don't distribute over Σ | ||
| Changed index but not the term | Must change both: | |
| Wrong re-indexing |
14.2 Scope and Binding Errors
The index variable is bound — it has no meaning outside the sum:
❌ — is not defined outside the sum
This is analogous to loop variables in code: for i in range(n): total += x[i] — you wouldn't use i after the loop expects it to hold a meaningful value.
14.3 Algebraic Mistakes
- Sum of squares ≠ Square of sum:
- Sum of products ≠ Product of sums:
- Product of sums ≠ Sum of products:
- Log of sum ≠ Sum of logs: , BUT ✓
14.4 Numerical Pitfalls
| Pitfall | Example | Solution |
|---|---|---|
| Overflow in products | for small | Work in log-space: |
| Catastrophic cancellation | for large | Welford's online algorithm |
| Summing many small floats | Kahan compensated summation | |
| Softmax overflow | Subtract max: |
14.5 Order of Summation Matters (When It Shouldn't)
In exact arithmetic, . In floating-point, order matters due to rounding. Kahan summation compensates for accumulated errors.
Code cell 35
# ══════════════════════════════════════════════════════════════════
# 14 COMMON MISTAKES AND PITFALLS
# ══════════════════════════════════════════════════════════════════
print("=" * 65)
print("COMMON MISTAKES AND PITFALLS")
print("=" * 65)
# --- Mistake 1: Σ(xy) ≠ (Σx)(Σy) ---
print("\n14.1 Σ xᵢyᵢ ≠ (Σ xᵢ)(Σ yᵢ)")
print("─" * 50)
x = np.array([1.0, 2.0, 3.0, 4.0])
y = np.array([5.0, 6.0, 7.0, 8.0])
sum_xy = np.sum(x * y)
sum_x_times_sum_y = np.sum(x) * np.sum(y)
print(f" x = {list(x)}, y = {list(y)}")
print(f" Σ xᵢyᵢ = {sum_xy:.1f}")
print(f" (Σ xᵢ)(Σ yᵢ) = {sum_x_times_sum_y:.1f}")
print(f" Equal? {np.isclose(sum_xy, sum_x_times_sum_y)} ← THEY ARE NOT EQUAL!")
covariance_bias = (sum_xy / len(x)) - (np.mean(x) * np.mean(y))
print(f" Difference matters: this gap IS the covariance = {covariance_bias:.4f}")
# --- Mistake 2: Σ x² ≠ (Σ x)² ---
print(f"\n{'─' * 50}")
print("14.3 Σ xᵢ² ≠ (Σ xᵢ)²")
print("─" * 50)
sum_sq = np.sum(x**2)
sq_sum = np.sum(x)**2
print(f" x = {list(x)}")
print(f" Σ xᵢ² = {sum_sq:.1f}")
print(f" (Σ xᵢ)² = {sq_sum:.1f}")
print(f" This gap = n·Var(X) = {sq_sum - len(x)*sum_sq/len(x):.1f}")
# --- Numerical: Kahan summation ---
print(f"\n{'─' * 50}")
print("14.4–14.5 Kahan Compensated Summation")
print("─" * 50)
def kahan_sum(values):
"""Kahan compensated summation — reduces floating-point error."""
s = 0.0
c = 0.0 # compensation for lost low-order bits
for v in values:
y = v - c
t = s + y
c = (t - s) - y # algebraically zero, but captures rounding error
s = t
return s
# Sum of 10 million 0.1's (should be 1,000,000.0)
n = 10_000_000
values = [0.1] * n
naive = sum(values)
compensated = kahan_sum(values)
numpy_result = np.sum(np.full(n, 0.1))
exact = 1_000_000.0
print(f"\n Summing {n:,} copies of 0.1 (exact answer: {exact:,.1f})")
print(f" {'Method':<20} {'Result':<22} {'Error'}")
print(f" {'─'*20} {'─'*22} {'─'*14}")
print(f" {'Naive sum()':<20} {naive:<22.10f} {abs(naive - exact):<14.10f}")
print(f" {'Kahan sum':<20} {compensated:<22.10f} {abs(compensated - exact):<14.10f}")
print(f" {'numpy.sum':<20} {numpy_result:<22.10f} {abs(numpy_result - exact):<14.10f}")
# --- Welford's online variance ---
print(f"\n{'─' * 50}")
print("14.4 Welford's Algorithm (numerically stable variance)")
print("─" * 50)
def welford_variance(data):
"""Welford's one-pass numerically stable variance."""
n = 0
mean = 0.0
M2 = 0.0
for x in data:
n += 1
delta = x - mean
mean += delta / n
delta2 = x - mean
M2 += delta * delta2
return M2 / n if n > 0 else 0.0
# Large values with small differences → catastrophic cancellation
data_shifted = np.array([1e8 + 1, 1e8 + 2, 1e8 + 3, 1e8 + 4, 1e8 + 5], dtype=np.float64)
naive_var = np.mean(data_shifted**2) - np.mean(data_shifted)**2
welford_var = welford_variance(data_shifted)
numpy_var = np.var(data_shifted)
print(f"\n Data: [1e8+1, 1e8+2, ..., 1e8+5] (true variance = 2.0)")
print(f" {'Method':<25} {'Variance':<18} {'Error'}")
print(f" {'─'*25} {'─'*18} {'─'*10}")
print(f" {'Naive (E[X²]-E[X]²)':<25} {naive_var:<18.6f} {abs(naive_var - 2.0):<10.6f}")
print(f" {'Welford online':<25} {welford_var:<18.6f} {abs(welford_var - 2.0):<10.6f}")
print(f" {'numpy.var':<25} {numpy_var:<18.6f} {abs(numpy_var - 2.0):<10.6f}")
15. Practice Exercises
Exercise 1: Closed-Form Verification
Prove and verify numerically: for
Exercise 2: Telescoping Sum
Simplify and verify.
Exercise 3: Double Sum Interchange
Show that
Exercise 4: Log-Sum-Exp
Implement the numerically stable version of and compare with the naive version for extreme inputs.
Exercise 5: Implement Softmax Gradient
Given softmax output , show that where is the Kronecker delta.
Exercise 6: Cauchy-Schwarz Application
Verify that for random vectors, and find when equality holds.
Exercise 7: Einstein Summation
Convert these to np.einsum and verify:
- Batch matrix multiply:
- Bilinear form:
- Batch outer product:
Exercise 8: Convergence Analysis
For the series , apply the ratio test and compute partial sums to estimate the limit.
Code cell 37
# ══════════════════════════════════════════════════════════════════
# 15 PRACTICE EXERCISES — SOLUTIONS
# ══════════════════════════════════════════════════════════════════
print("=" * 65)
print("PRACTICE EXERCISES — SOLUTIONS")
print("=" * 65)
# --- Exercise 1: Closed-form for Σ i² ---
print("\nExercise 1: Σ i² = n(n+1)(2n+1)/6")
print("─" * 50)
for n in [100, 1000, 10000]:
brute = sum(i**2 for i in range(1, n+1))
closed = n * (n + 1) * (2*n + 1) // 6
print(f" n={n:<6}: brute={brute:<15}, closed={closed:<15}, match={brute == closed} ✓")
# --- Exercise 2: Telescoping ---
print(f"\nExercise 2: Σ (1/i - 1/(i+1)) = 1 - 1/(n+1)")
print("─" * 50)
for n in [10, 100, 1000]:
tele = sum(1/i - 1/(i+1) for i in range(1, n+1))
closed_val = 1 - 1/(n+1)
print(f" n={n:<6}: sum={tele:.10f}, 1-1/(n+1)={closed_val:.10f}, match={np.isclose(tele, closed_val)} ✓")
# --- Exercise 3: Double sum separation ---
print(f"\nExercise 3: ΣᵢΣⱼ aᵢbⱼ = (Σᵢ aᵢ)(Σⱼ bⱼ)")
print("─" * 50)
a = np.array([2, 3, 5, 7])
b = np.array([1, 4, 6])
double_sum = sum(a[i] * b[j] for i in range(len(a)) for j in range(len(b)))
product_sums = np.sum(a) * np.sum(b)
print(f" a = {list(a)}, b = {list(b)}")
print(f" ΣᵢΣⱼ aᵢbⱼ = {double_sum}, (Σ aᵢ)(Σ bⱼ) = {product_sums}")
print(f" Works because bⱼ doesn't depend on i, so Σⱼ bⱼ factors out ✓")
# --- Exercise 4: Log-Sum-Exp stability ---
print(f"\nExercise 4: Numerically stable log-sum-exp")
print("─" * 50)
x_extreme = np.array([1000.0, 1000.1, 999.8])
x_max = np.max(x_extreme)
# Naive: will overflow
try:
naive_lse = np.log(np.sum(np.exp(x_extreme)))
except:
naive_lse = float('inf')
# Stable version
stable_lse = x_max + np.log(np.sum(np.exp(x_extreme - x_max)))
print(f" x = {list(x_extreme)}")
print(f" Naive log(Σ eˣ): {naive_lse}")
print(f" Stable max + log(Σ e^(x-max)): {stable_lse:.6f}")
print(f" The max-subtraction trick preserves the result without overflow ✓")
# --- Exercise 5: Softmax Jacobian ---
print(f"\nExercise 5: Softmax Jacobian ∂σᵢ/∂zⱼ = σᵢ(δᵢⱼ - σⱼ)")
print("─" * 50)
z = np.array([2.0, 1.0, 0.5])
sigma = np.exp(z - np.max(z))
sigma = sigma / np.sum(sigma)
# Analytic Jacobian
n = len(z)
jacobian = np.zeros((n, n))
for i in range(n):
for j in range(n):
if i == j:
jacobian[i, j] = sigma[i] * (1 - sigma[j])
else:
jacobian[i, j] = -sigma[i] * sigma[j]
# Numerical Jacobian for verification
eps = 1e-7
numerical_jac = np.zeros((n, n))
for j in range(n):
z_plus = z.copy(); z_plus[j] += eps
z_minus = z.copy(); z_minus[j] -= eps
s_plus = np.exp(z_plus - np.max(z_plus)); s_plus /= s_plus.sum()
s_minus = np.exp(z_minus - np.max(z_minus)); s_minus /= s_minus.sum()
numerical_jac[:, j] = (s_plus - s_minus) / (2 * eps)
print(f" z = {list(z)}")
print(f" σ = [{', '.join(f'{s:.4f}' for s in sigma)}]")
print(f" Max |analytic - numerical| = {np.max(np.abs(jacobian - numerical_jac)):.2e} ✓")
# --- Exercise 6: Cauchy-Schwarz ---
print(f"\nExercise 6: Cauchy-Schwarz (Σ aᵢbᵢ)² ≤ (Σ aᵢ²)(Σ bᵢ²)")
print("─" * 50)
np.random.seed(7)
a_rand = np.random.randn(10)
b_rand = np.random.randn(10)
lhs = np.sum(a_rand * b_rand)**2
rhs = np.sum(a_rand**2) * np.sum(b_rand**2)
print(f" Random a, b: LHS = {lhs:.6f} ≤ RHS = {rhs:.6f}: {lhs <= rhs + 1e-10} ✓")
# Equality when a = cb
b_prop = 3.0 * a_rand
lhs_eq = np.sum(a_rand * b_prop)**2
rhs_eq = np.sum(a_rand**2) * np.sum(b_prop**2)
print(f" b = 3a: LHS = {lhs_eq:.6f} = RHS = {rhs_eq:.6f}: {np.isclose(lhs_eq, rhs_eq)} ✓ (equality!)")
# --- Exercise 7: Einstein summation ---
print(f"\nExercise 7: Einstein summation with np.einsum")
print("─" * 50)
B_size, I, K, J = 2, 3, 4, 5
A7 = np.random.randn(B_size, I, K)
B7 = np.random.randn(B_size, K, J)
C_batch = np.einsum('bik,bkj->bij', A7, B7)
C_loop = np.array([A7[b] @ B7[b] for b in range(B_size)])
print(f" Batch matmul: einsum('bik,bkj->bij') match: {np.allclose(C_batch, C_loop)} ✓")
x7 = np.random.randn(4)
M7 = np.random.randn(4, 5)
y7 = np.random.randn(5)
bilinear = np.einsum('i,ij,j->', x7, M7, y7)
bilinear_loop = x7 @ M7 @ y7
print(f" Bilinear form: einsum('i,ij,j->') match: {np.isclose(bilinear, bilinear_loop)} ✓")
a7 = np.random.randn(B_size, 3)
b7 = np.random.randn(B_size, 4)
outer_batch = np.einsum('bi,bj->bij', a7, b7)
outer_loop = np.array([np.outer(a7[b], b7[b]) for b in range(B_size)])
print(f" Batch outer: einsum('bi,bj->bij') match: {np.allclose(outer_batch, outer_loop)} ✓")
# --- Exercise 8: Convergence of Σ n²/2ⁿ ---
print(f"\nExercise 8: Σ n²/2ⁿ convergence (ratio test + partial sums)")
print("─" * 50)
# Ratio test: a_{n+1}/a_n = (n+1)²/(2n²) → 1/2 < 1 → converges
print(" Ratio test: |(n+1)²/2^(n+1)| / |n²/2ⁿ| = (n+1)²/(2n²) → 1/2 < 1 ✓")
# Exact sum: Σ n²xⁿ = x(1+x)/(1-x)³ at x=1/2 → (1/2)(3/2)/(1/2)³ = 6
partial = [sum(k**2 / 2**k for k in range(1, n+1)) for n in [5, 10, 20, 50]]
print(f" Exact value: x(1+x)/(1-x)³ at x=½ = 6.0")
print(f" Partial sums: S₅={partial[0]:.6f}, S₁₀={partial[1]:.6f}, "
f"S₂₀={partial[2]:.6f}, S₅₀={partial[3]:.6f}")
print(f" Converges to 6.0 ✓")
16. Why This Matters for AI
Impact Table
| AI Component | Summation/Product Role | Key Formula |
|---|---|---|
| Forward pass | Linear combination | |
| Loss function | Empirical risk | |
| Softmax | Normalization | |
| Attention | Weighted combination | |
| Cross-entropy | Classification loss | |
| Likelihood | Maximum likelihood estimation | |
| Gradient descent | Mini-batch SGD | |
| Batch norm | Normalization statistics | |
| Entropy regularization | Exploration in RL | |
| ELBO | Variational inference | |
| Perplexity | Language model eval |
Conceptual Bridge
┌─────────────────────────────────────────────────────────┐
│ SUMMATION & PRODUCT NOTATION │
│ ═══════════════════════════════════════ │
│ │
│ Σ (Summation) Π (Product) │
│ │ │ │
│ ├─ Forward pass ├─ Likelihood │
│ ├─ Loss functions ├─ Posterior (Bayes) │
│ ├─ Gradient averages ├─ Partition functions │
│ ├─ Attention weights └─ Factorial / combinatorics │
│ ├─ Batch statistics │
│ └─ Einstein notation ── einsum ── GPU kernels │
│ │
│ "If you can read a Σ, you can read any ML paper." │
└─────────────────────────────────────────────────────────┘
Key Takeaways
- Every ML model is built on summations — linear combinations, averages, expectations, weighted sums
- Products appear in likelihoods, posteriors, and partition functions — always convert to log-sums for computation
- Einstein summation (
np.einsum) is the universal language for tensor operations in modern deep learning - Numerical stability (log-sum-exp, Kahan summation, Welford's algorithm) separates working code from broken code
- Convergence theory (ratio test, p-series, Robbins-Monro) determines when your training will actually converge
References
- Knuth, D.E. The Art of Computer Programming, Vol. 1, §1.2.3 — Sums and Products (1968)
- Graham, Knuth, Patashnik. Concrete Mathematics, Ch. 2 — Sums (1994)
- Goodfellow, Bengio, Courville. Deep Learning, Ch. 2 — Linear Algebra notation (2016)
- Vaswani et al. "Attention Is All You Need" — Σ in scaled dot-product attention (2017)
- Einstein, A. "The Foundation of the General Theory of Relativity" — implicit summation convention (1916)
Code cell 39
# ══════════════════════════════════════════════════════════════════
# 16 FINAL SUMMARY — COMPLETE NOTATION REFERENCE
# ══════════════════════════════════════════════════════════════════
print("=" * 65)
print("COMPLETE NOTATION QUICK REFERENCE")
print("=" * 65)
reference = {
"Σᵢ₌₁ⁿ aᵢ": "Basic summation: a₁ + a₂ + ... + aₙ",
"Πᵢ₌₁ⁿ aᵢ": "Basic product: a₁ · a₂ · ... · aₙ",
"Σᵢ∈S f(i)": "Sum over index set S",
"ΣΣ aᵢⱼ": "Double sum (can interchange if abs convergent)",
"Aᵢⱼ Bⱼₖ": "Einstein: implicit sum over repeated j",
"log Π = Σ log": "Products → sums via logarithm",
"Σ (f(k)-f(k+1))": "Telescoping: collapses to f(1) - f(n+1)",
"max + log Σ exp": "Log-sum-exp trick (numerical stability)",
}
print(f"\n {'Notation':<22} {'Meaning'}")
print(f" {'─'*22} {'─'*43}")
for notation, meaning in reference.items():
print(f" {notation:<22} {meaning}")
# AI formula that uses ALL the concepts
print(f"\n{'=' * 65}")
print("PUTTING IT ALL TOGETHER: Transformer Loss")
print("=" * 65)
print("""
The cross-entropy loss for a Transformer language model:
L = -(1/T) Σₜ₌₁ᵀ log( exp(zₜ[yₜ]) / Σⱼ₌₁ⱽ exp(zₜ[j]) )
This single formula uses:
✓ Summation (outer Σ over tokens, inner Σ over vocab)
✓ Product (likelihood Π → log converts to Σ)
✓ Log-sum-exp (numerical stability of softmax)
✓ Einstein notation (in the QKᵀ attention computation)
✓ Double sums (batch × sequence × head dimensions)
✓ Convergence theory (learning rate schedule)
✓ Jensen's inequality (proves KL ≥ 0 in ELBO)
Every section of this chapter appears in this one equation.
""")
print("=" * 65)
print("END OF CHAPTER: Summation and Product Notation")
print("=" * 65)