Theory Notebook

Converted from theory.ipynb for web reading.

Summation & Product Notation — The Syntax of Quantitative AI

"Virtually every formula in machine learning — loss functions, gradients, attention scores, probability distributions, normalisation constants — is a sum or a product. Fluency with Σ and Π is the difference between reading ML papers fluidly and decoding them symbol by symbol."

This notebook is the interactive companion to notes.md. It demonstrates every concept from all 16 sections with runnable Python and NumPy code, connecting each mathematical idea to its concrete use in modern AI systems (transformers, LLMs, diffusion models, RAG pipelines).

Section	Topic	What You'll Build
1	Intuition	Why Σ/Π exist; where they appear in AI; historical timeline
2	Summation — Formal Definitions	Recursive definition, index sets, multiple indices
3	Product Notation — Formal Definitions	Π basics, empty product, factorial, log-likelihood
4	Algebraic Properties of Summation	Linearity, splitting, reindexing, telescoping, interchange
5	Algebraic Properties of Products	Splitting, constants, log↔product conversion
6	Important Summation Formulas	Arithmetic/geometric series, Taylor series, harmonic numbers
7	Summation Bounds & Estimation	Comparison, integral test, Cauchy-Schwarz, Stirling
8	Special Summation Patterns in AI	Softmax, cross-entropy, attention, perplexity, BM25
9	Einstein Summation Convention	Implicit sums, einsum in PyTorch/NumPy, tensor contraction
10	Double Sums & Index Interaction	Diagonal sums, order interchange, Cauchy product
11	Convergence of Infinite Series	Partial sums, absolute convergence, tests, power series
12	Summation in Probability & Statistics	Expectation, variance, entropy, KL divergence
13–14	Notation Variants & Common Mistakes	ML conventions, subscript/superscript, pitfalls table
15	Exercises	8 progressive exercises with scaffolds and solutions
16	Why This Matters for AI (2026)	Impact table and conceptual bridge

Prerequisites: Python, NumPy. PyTorch optional but recommended for einsum demos.

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np
import math
import warnings
from typing import List, Tuple, Callable

np.random.seed(42)
np.set_printoptions(precision=8, suppress=True)

# Optional: PyTorch for einsum and tensor demos
try:
    import torch
    torch.manual_seed(42)
    HAS_TORCH = True
    print(f'NumPy {np.__version__} | PyTorch {torch.__version__}')
    if torch.cuda.is_available():
        print(f'GPU: {torch.cuda.get_device_name()}')
    elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
        print('Device: Apple Silicon (MPS)')
    else:
        print('Device: CPU')
except ImportError:
    HAS_TORCH = False
    print(f'NumPy {np.__version__} | PyTorch not installed (NumPy demos still work)')

# Optional: matplotlib for visualisations
try:
    import matplotlib.pyplot as plt
    HAS_PLT = True
except ImportError:
    HAS_PLT = False
    print('matplotlib not installed — plots will be skipped')

1. Intuition — What Summation and Product Notation Are, and Why They Exist

1.1 What Is Summation and Product Notation?

Summation ( $\Sigma$ ) and product ( $\Pi$ ) notation are compact symbolic languages for expressing repeated addition and multiplication over collections of terms.

Without them: writing the loss over a million training examples requires a million terms
With them: one line — $L = -\frac{1}{N}\sum_{i=1}^{N} \log P(y_i \mid x_i)$

They are not merely abbreviations — they are precise mathematical objects with well-defined rules for manipulation, interchange, and transformation.

1.2 Why This Notation Exists

Problem	Without Notation	With Notation
Sum of squares 1 to n	$1^2 + 2^2 + 3^2 + \ldots + n^2$	$\sum_{i=1}^{n} i^2$
Joint probability	$P(A_1) \cdot P(A_2) \cdots P(A_n)$	$\prod_{i=1}^{n} P(A_i)$
Attention output	$\alpha_1 V_1 + \alpha_2 V_2 + \ldots$	$\sum_j \alpha_{ij} V_j$
Gradient of batch loss	$\frac{1}{N}(\nabla l_1 + \nabla l_2 + \ldots)$	$\frac{1}{N}\sum_i \nabla l_i$

Compact notation reveals structure, enables algebraic manipulation, and generalises naturally from finite sums → infinite series → integrals.

1.3 Where These Appear in AI

Every core formula in modern ML uses Σ or Π:

Formula	Expression
Cross-entropy loss	$L = -\frac{1}{N}\sum_{i=1}^{N}\sum_{v \in V} \mathbb{1}[y_i=v]\log P(v \mid x_i)$
Softmax	$P(v \mid x) = \frac{\exp(z_v)}{\sum_{v' \in V}\exp(z_{v'})}$
Attention	$\text{Attn}(Q,K,V)_i = \sum_j \alpha_{ij} V_j$
Dot product	$q \cdot k = \sum_{i=1}^{d_k} q_i k_i$
Matrix multiply	$(AB)_{ij} = \sum_k A_{ik} B_{kj}$
Perplexity	$\text{PPL} = \exp\!\left(-\frac{1}{N}\sum_{i=1}^{N}\log P(t_i \mid \text{ctx})\right)$
Likelihood	$P(\text{seq}) = \prod_{i=1}^{n} P(t_i \mid t_1,\ldots,t_{i-1})$

1.4 The Index as a Bound Variable

The index variable ( $i$ in $\sum_i$ ) is a bound variable — it has no meaning outside the sum. $\sum_{i=1}^{n} i^2$ and $\sum_{j=1}^{n} j^2$ are identical expressions (alpha-equivalence).

Critical for AI: $\sum_i \alpha_{ij}$ and $\sum_j \alpha_{ij}$ are completely different sums!

1.5 The Relationship to Integration

DISCRETE                         CONTINUOUS
∑ᵢ₌₁ⁿ f(i)              →       ∫₁ⁿ f(x) dx
Index i ∈ {1,...,n}       →       x ∈ [1, n]
Step size 1               →       Infinitesimal dx
∑ → ∫ as step → 0                (Riemann sum definition)
∏ᵢ f(i)                  →       exp(∫ log f(x) dx)

This bridge is not just analogy — Riemann integration is literally defined as a limit of summation.

Code cell 5

# ══════════════════════════════════════════════════════════════════
# 1.1–1.3  WHY Σ AND Π EXIST — DEMONSTRATION
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("WITHOUT VS WITH SUMMATION NOTATION")
print("=" * 65)

# Without notation: writing out all terms
terms_without = [i**2 for i in range(1, 11)]
print(f"\nWithout notation: {' + '.join(str(t) for t in terms_without)}")
print(f"  = {sum(terms_without)}")

# With notation: one compact expression
n = 10
result_with = sum(i**2 for i in range(1, n+1))
print(f"\nWith notation:    Σᵢ₌₁¹⁰ i² = {result_with}")
print(f"  → Same answer, infinitely more readable")

# ── AI formulas that are really just sums ──
print(f"\n{'=' * 65}")
print("AI FORMULAS AS SUMS AND PRODUCTS")
print("=" * 65)

# Dot product: q·k = Σᵢ qᵢkᵢ
q = np.array([1.0, 2.0, 3.0, 4.0])
k = np.array([0.5, 1.5, 0.2, 1.0])
dot_manual = sum(q[i] * k[i] for i in range(len(q)))
dot_numpy  = np.dot(q, k)
print(f"\nDot product q·k = Σᵢ qᵢkᵢ")
print(f"  Manual sum:  {dot_manual}")
print(f"  np.dot:      {dot_numpy}")
print(f"  Match: {np.isclose(dot_manual, dot_numpy)} ✓")

# Matrix multiply: (AB)ᵢⱼ = Σₖ AᵢₖBₖⱼ
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
C_manual = np.zeros((2, 2))
for i in range(2):
    for j in range(2):
        C_manual[i, j] = sum(A[i, k] * B[k, j] for k in range(2))
C_numpy = A @ B
print(f"\nMatrix multiply (AB)ᵢⱼ = Σₖ AᵢₖBₖⱼ")
print(f"  Manual:\n{C_manual}")
print(f"  A @ B:\n{C_numpy}")
print(f"  Match: {np.allclose(C_manual, C_numpy)} ✓")

# Likelihood as product: P(seq) = Πᵢ P(tᵢ|context)
probs = [0.3, 0.5, 0.8, 0.6, 0.4]
likelihood_product = 1.0
for p in probs:
    likelihood_product *= p
log_likelihood_sum = sum(np.log(p) for p in probs)
print(f"\nLikelihood:  Πᵢ P(tᵢ|ctx) = {likelihood_product:.6f}")
print(f"Log-likelihood: Σᵢ log P    = {log_likelihood_sum:.6f}")
print(f"exp(Σ log P) = Π P:           {np.exp(log_likelihood_sum):.6f}")
print(f"  Match: {np.isclose(likelihood_product, np.exp(log_likelihood_sum))} ✓")
print(f"\n  → log converts Π to Σ — the most important trick in ML!")

Code cell 6

# ══════════════════════════════════════════════════════════════════
# 1.4  THE INDEX AS A BOUND VARIABLE (Alpha-Equivalence)
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("BOUND VARIABLES — INDEX NAME DOES NOT MATTER")
print("=" * 65)

n = 5
sum_i = sum(i**2 for i in range(1, n+1))
sum_j = sum(j**2 for j in range(1, n+1))
sum_k = sum(k**2 for k in range(1, n+1))
print(f"\n  Σᵢ₌₁⁵ i² = {sum_i}")
print(f"  Σⱼ₌₁⁵ j² = {sum_j}")
print(f"  Σₖ₌₁⁵ k² = {sum_k}")
print(f"  All equal: {sum_i == sum_j == sum_k} ✓  (alpha-equivalence)")

# CRITICAL: Σᵢ αᵢⱼ vs Σⱼ αᵢⱼ are DIFFERENT
print(f"\n{'─' * 65}")
print("CRITICAL: WHICH INDEX YOU SUM OVER MATTERS!")
alpha = np.array([[0.5, 0.3, 0.2],
                  [0.1, 0.6, 0.3],
                  [0.4, 0.4, 0.2]])
print(f"\n  Attention matrix α (3×3):\n{alpha}")
print(f"\n  Σⱼ αᵢⱼ (sum each ROW)    = {alpha.sum(axis=1)}")
print(f"  Σᵢ αᵢⱼ (sum each COLUMN) = {alpha.sum(axis=0)}")
print(f"  → Different sums! Row sums = 1.0 (probability distribution)")

# ══════════════════════════════════════════════════════════════════
# 1.5  HISTORICAL TIMELINE
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("HISTORICAL TIMELINE OF SUMMATION NOTATION")
print("=" * 65)

timeline = [
    ("~250 BCE",  "Archimedes",       "Geometric series; area of parabola (no notation)"),
    ("1593",      "Viète",            "Systematic symbolic algebra → paves way"),
    ("1675",      "Leibniz",          "∫ as elongated S for 'summa' (continuous)"),
    ("~1750",     "Euler",            "Introduces Σ (capital sigma) for discrete sums"),
    ("~1800",     "Gauss",            "Σᵢ₌₁ⁿ i = n(n+1)/2 at age 10"),
    ("1821",      "Cauchy",           "Rigorous convergence theory for infinite series"),
    ("1854",      "Riemann",          "Riemann sum → integral (Σ → ∫ as Δx → 0)"),
    ("1916",      "Einstein",         "Summation convention: repeated index = sum"),
    ("2017",      "Vaswani et al.",   "Transformer: Attn = Σⱼ αᵢⱼVⱼ everywhere"),
]
print(f"\n  {'Year':<12} {'Who':<18} {'What'}")
print(f"  {'─'*12} {'─'*18} {'─'*45}")
for year, who, what in timeline:
    print(f"  {year:<12} {who:<18} {what}")

# ══════════════════════════════════════════════════════════════════
# 1.6  SUM ↔ INTEGRAL CONNECTION (Riemann Sum)
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("SUM → INTEGRAL: RIEMANN SUM DEMONSTRATION")
print("=" * 65)

# Approximate ∫₀¹ x² dx using Riemann sums with increasing n
def riemann_sum(f, a, b, n):
    """Left Riemann sum: Σᵢ₌₀ⁿ⁻¹ f(xᵢ)·Δx where Δx = (b-a)/n"""
    dx = (b - a) / n
    return sum(f(a + i * dx) * dx for i in range(n))

exact = 1/3  # ∫₀¹ x² dx = 1/3
print(f"\n  Target: ∫₀¹ x² dx = 1/3 ≈ {exact:.10f}")
print(f"\n  {'n (terms)':<12} {'Σᵢ f(xᵢ)Δx':<18} {'Error':<15} {'Δx'}")
print(f"  {'─'*12} {'─'*18} {'─'*15} {'─'*10}")
for n_terms in [5, 10, 50, 100, 1000, 10000]:
    approx = riemann_sum(lambda x: x**2, 0, 1, n_terms)
    err = abs(approx - exact)
    dx = 1.0 / n_terms
    print(f"  {n_terms:<12} {approx:<18.10f} {err:<15.2e} {dx:.6f}")
print(f"\n  → As n → ∞ and Δx → 0: Σ → ∫  (definition of Riemann integral)")

2. Summation Notation — Formal Definitions

2.1 Basic Definition

\sum_{i=m}^{n} f(i) = f(m) + f(m+1) + f(m+2) + \ldots + f(n)

Components:

$\Sigma$ — summation operator (capital Greek sigma)
$i$ — index variable (dummy/bound variable; any name works)
$m$ — lower limit (inclusive)
$n$ — upper limit (inclusive)
$f(i)$ — summand (expression evaluated at each index value)
Number of terms: $n - m + 1$

Empty sum convention: if $n < m$ , the sum has no terms and equals 0 (additive identity).

2.2 Formal Definition via Recursion

Base case: $\sum_{i=m}^{m} f(i) = f(m)$ (single term)
Recursive step: $\sum_{i=m}^{n} f(i) = \left(\sum_{i=m}^{n-1} f(i)\right) + f(n)$
Empty base: $\sum_{i=m}^{m-1} f(i) = 0$

2.3 Index Set Notation

More general: sum over arbitrary index set $S$ :

\sum_{i \in S} f(i) = \text{sum of } f(i) \text{ for all } i \in S

AI examples:

$\sum_{v \in V} \exp(z_v)$ — sum over vocabulary tokens
$\sum_{(i,j): i < j} a_{ij}$ — sum over upper-triangular pairs
$\sum_{l=1}^{L}\sum_{h=1}^{H} \text{Attn}_{l,h}$ — sum over all layers and heads

2.4 Multiple Indices

\sum_{i=1}^{m}\sum_{j=1}^{n} f(i,j) = \sum_{i=1}^{m}\left(\sum_{j=1}^{n} f(i,j)\right)

Process: for each fixed $i$ , compute inner sum over $j$ ; then sum results over $i$ . Total terms: $m \times n$ .

Code cell 8

# ══════════════════════════════════════════════════════════════════
# 2.1–2.2  SUMMATION: BASIC + RECURSIVE DEFINITION
# ══════════════════════════════════════════════════════════════════

def sigma(f, m, n):
    """Σᵢ₌ₘⁿ f(i) — iterative definition."""
    return sum(f(i) for i in range(m, n + 1))

def sigma_recursive(f, m, n):
    """Σᵢ₌ₘⁿ f(i) — recursive definition (2.2)."""
    if n < m:
        return 0            # Empty sum = 0 (additive identity)
    if n == m:
        return f(m)         # Base case: single term
    return sigma_recursive(f, m, n - 1) + f(n)  # Recursive step

print("=" * 65)
print("SUMMATION: ITERATIVE vs RECURSIVE DEFINITION")
print("=" * 65)

# Test with f(i) = i²
f = lambda i: i**2
for n in [1, 3, 5, 10]:
    iter_val = sigma(f, 1, n)
    rec_val  = sigma_recursive(f, 1, n)
    formula  = n * (n + 1) * (2 * n + 1) // 6  # closed form
    print(f"  Σᵢ₌₁^{n:<2d} i²  iterative={iter_val:<6d}  recursive={rec_val:<6d}  formula={formula:<6d}  ✓")

# Empty sum convention
empty = sigma(f, 5, 3)  # n < m → 0
print(f"\n  Empty sum Σᵢ₌₅³ f(i) = {empty}  (additive identity)")

# ══════════════════════════════════════════════════════════════════
# 2.3  INDEX SET NOTATION
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("INDEX SET NOTATION: Σ OVER ARBITRARY SETS")
print("=" * 65)

# Sum over arbitrary index set
def sigma_set(f, index_set):
    """Σ_{i ∈ S} f(i) — sum over arbitrary index set S."""
    return sum(f(i) for i in index_set)

# Example 1: Sum over vocabulary (simulated as indices)
vocab = {0, 1, 2, 3, 4}  # 5-token vocabulary
logits = np.array([2.0, 1.0, 0.5, -1.0, 0.3])
Z = sigma_set(lambda v: np.exp(logits[v]), vocab)
print(f"\n  Softmax denominator Z = Σ_{{v ∈ V}} exp(zᵥ) = {Z:.4f}")
print(f"  np.sum(np.exp(logits))                     = {np.sum(np.exp(logits)):.4f}  ✓")

# Example 2: Sum over pairs (i,j) with i < j
pairs = [(i, j) for i in range(4) for j in range(4) if i < j]
pair_sum = sigma_set(lambda p: p[0] + p[1], pairs)
print(f"\n  Σ_{{(i,j): i<j, i,j∈{{0..3}}}} (i+j) = {pair_sum}")
print(f"  Pairs: {pairs}")

# Example 3: Sum with indicator function
data = [3, 7, 2, 8, 1, 5, 9, 4]
count_gt5 = sum(1 for x in data if x > 5)
print(f"\n  Σᵢ 𝟙[xᵢ > 5] = {count_gt5}  (counting elements > 5 in {data})")

# ══════════════════════════════════════════════════════════════════
# 2.4  MULTIPLE INDICES — DOUBLE SUMS
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("DOUBLE SUMS: Σᵢ Σⱼ f(i,j)")
print("=" * 65)

# Compute Σᵢ₌₁³ Σⱼ₌₁⁴ (i·j) step by step
m, n = 3, 4
print(f"\n  Computing Σᵢ₌₁³ Σⱼ₌₁⁴ (i·j):")
total = 0
for i in range(1, m + 1):
    inner = sum(i * j for j in range(1, n + 1))
    total += inner
    terms = ' + '.join(f'{i}·{j}' for j in range(1, n + 1))
    print(f"    i={i}: Σⱼ₌₁⁴ {i}·j = {terms} = {inner}")
print(f"  Total: {total}")
print(f"  Formula: (Σᵢ i)(Σⱼ j) = {sum(range(1,4))} × {sum(range(1,5))} = {sum(range(1,4)) * sum(range(1,5))}  ✓")

# Matrix as double sum: Σᵢ Σⱼ Aᵢⱼ = sum of all entries
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
double_sum = sum(A[i, j] for i in range(3) for j in range(3))
print(f"\n  Matrix A:\n{A}")
print(f"  Σᵢ Σⱼ Aᵢⱼ = {double_sum} = np.sum(A) = {np.sum(A)}  ✓")

3. Product Notation — Formal Definitions

3.1 Basic Definition

\prod_{i=m}^{n} f(i) = f(m) \cdot f(m+1) \cdot f(m+2) \cdots f(n)

Same structure as $\Sigma$ but with $\Pi$ (capital Greek pi) and multiplication
Empty product: if $n < m$ , value = 1 (multiplicative identity)
Recursive: $\prod_{i=m}^{n} f(i) = \left(\prod_{i=m}^{n-1} f(i)\right) \cdot f(n)$

3.2 Why Empty Product = 1 and Empty Sum = 0

Operation	Identity Element	Empty Result	Reason
Addition ( $\Sigma$ )	0	$\sum_{\emptyset} = 0$	Adding nothing = additive identity
Multiplication ( $\Pi$ )	1	$\prod_{\emptyset} = 1$	Multiplying nothing = multiplicative identity

Consistency check: $0! = \prod_{i=1}^{0} i = 1$ ✓ (matches $\Gamma(1) = 1$ )

3.3 Factorial as Product

n! = \prod_{i=1}^{n} i = 1 \cdot 2 \cdot 3 \cdots n

Stirling: $n! \approx \sqrt{2\pi n}\left(\frac{n}{e}\right)^n$ — grows faster than any exponential.

3.4 Product Notation in Probability — The Log Trick

P(t_1, \ldots, t_n) = \prod_{i=1}^{n} P(t_i \mid t_1, \ldots, t_{i-1})

\log P(t_1, \ldots, t_n) = \sum_{i=1}^{n} \log P(t_i \mid t_1, \ldots, t_{i-1})

This $\Pi \xrightarrow{\log} \Sigma$ conversion is one of the most important transformations in all of ML:

Products of small probabilities underflow → log converts to sum of log-probabilities
Numerically stable, easier to differentiate, and computationally cheaper

Code cell 10

# ══════════════════════════════════════════════════════════════════
# 3.1–3.2  PRODUCT NOTATION & EMPTY PRODUCT
# ══════════════════════════════════════════════════════════════════

def pi_product(f, m, n):
    """Πᵢ₌ₘⁿ f(i) — iterative product."""
    result = 1  # multiplicative identity
    for i in range(m, n + 1):
        result *= f(i)
    return result

def pi_recursive(f, m, n):
    """Πᵢ₌ₘⁿ f(i) — recursive definition."""
    if n < m:
        return 1            # Empty product = 1 (multiplicative identity)
    if n == m:
        return f(m)
    return pi_recursive(f, m, n - 1) * f(n)

print("=" * 65)
print("PRODUCT NOTATION: Πᵢ₌ₘⁿ f(i)")
print("=" * 65)

# Basic product demonstration
f = lambda i: i
for n in [1, 3, 5, 8, 10]:
    iter_val = pi_product(f, 1, n)
    rec_val  = pi_recursive(f, 1, n)
    fact_val = math.factorial(n)
    print(f"  Πᵢ₌₁^{n:<2d} i = {n}!  iter={iter_val:<10d}  rec={rec_val:<10d}  math={fact_val:<10d}  ✓")

# Empty product: Πᵢ₌₅³ = 1
empty_prod = pi_product(f, 5, 3)
empty_sum  = sigma(lambda i: i**2, 5, 3)
print(f"\n  Empty product Πᵢ₌₅³ i = {empty_prod}  (multiplicative identity)")
print(f"  Empty sum     Σᵢ₌₅³ i² = {empty_sum}  (additive identity)")

# 0! = 1 (empty product convention)
print(f"\n  0! = Πᵢ₌₁⁰ i = {pi_product(f, 1, 0)} = {math.factorial(0)}  ✓")

# ══════════════════════════════════════════════════════════════════
# 3.3  FACTORIAL & STIRLING'S APPROXIMATION
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("FACTORIAL GROWTH & STIRLING'S APPROXIMATION")
print("=" * 65)

print(f"\n  {'n':<6} {'n!':<20} {'Stirling':<20} {'Relative Error'}")
print(f"  {'─'*6} {'─'*20} {'─'*20} {'─'*15}")
for n in [1, 5, 10, 20, 50, 100]:
    exact = math.factorial(n)
    stirling = np.sqrt(2 * np.pi * n) * (n / np.e) ** n
    rel_err = abs(exact - stirling) / exact
    print(f"  {n:<6} {exact:<20.4g} {stirling:<20.4g} {rel_err:<15.6f}")
print(f"\n  Stirling: n! ≈ √(2πn)(n/e)ⁿ — gets better as n → ∞")

# ══════════════════════════════════════════════════════════════════
# 3.4  THE LOG TRICK: Π → Σ VIA LOGARITHM
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("THE LOG TRICK: log Πᵢ f(i) = Σᵢ log f(i)")
print("=" * 65)

# Simulate LLM token probabilities
token_probs = [0.3, 0.5, 0.8, 0.02, 0.15, 0.6, 0.4, 0.1, 0.05, 0.7]
n_tokens = len(token_probs)

# Direct product — suffers from underflow with many small probs
likelihood = pi_product(lambda i: token_probs[i], 0, n_tokens - 1)

# Log-sum — numerically stable
log_likelihood = sum(np.log(p) for p in token_probs)

# Verify: exp(Σ log pᵢ) = Π pᵢ
reconstructed = np.exp(log_likelihood)

print(f"\n  Token probs: {token_probs}")
print(f"\n  Direct Π P(tᵢ|ctx)  = {likelihood:.2e}   ← tiny! underflow risk")
print(f"  Σ log P(tᵢ|ctx)     = {log_likelihood:.6f}   ← stable log-space")
print(f"  exp(Σ log P)         = {reconstructed:.2e}   ← reconstructed")
print(f"  Match: {np.isclose(likelihood, reconstructed)} ✓")

# Perplexity: PPL = exp(−(1/N) Σ log P)
cross_entropy = -log_likelihood / n_tokens
perplexity = np.exp(cross_entropy)
# Equivalently: geometric mean of 1/P
geo_mean = np.prod([1/p for p in token_probs]) ** (1/n_tokens)

print(f"\n  Cross-entropy H  = −(1/N)Σ log P = {cross_entropy:.4f}")
print(f"  Perplexity PPL   = exp(H)         = {perplexity:.4f}")
print(f"  Geometric mean   = (Π 1/P)^(1/N)  = {geo_mean:.4f}")
print(f"  Match: {np.isclose(perplexity, geo_mean)} ✓")
print(f"\n  → Lower PPL = model is more confident/accurate")

4. Algebraic Properties of Summation

4.1 Linearity of Summation

\sum_{i} \bigl(\alpha f(i) + \beta g(i)\bigr) = \alpha \sum_{i} f(i) + \beta \sum_{i} g(i)

This is the single most important property. It means:

Scalar multiple: $\sum_i c \cdot f(i) = c \cdot \sum_i f(i)$
Sum of sums: $\sum_i (f(i) + g(i)) = \sum_i f(i) + \sum_i g(i)$
AI: gradient of sum = sum of gradients: $\nabla_\theta \sum_i L(\theta, x_i) = \sum_i \nabla_\theta L(\theta, x_i)$

4.2 Constant Summand

\sum_{i=1}^{n} c = n \cdot c \qquad \text{(sum of } n \text{ copies of } c\text{)}

4.3 Splitting and Combining Sums

\sum_{i=m}^{n} f(i) = \sum_{i=m}^{k} f(i) + \sum_{i=k+1}^{n} f(i) \qquad (m \leq k < n)

4.4 Index Shifting (Reindexing)

\sum_{i=m}^{n} f(i) = \sum_{j=m+k}^{n+k} f(j-k) \quad \text{where } j = i + k

Reverse: $\sum_{i=m}^{n} f(i) = \sum_{i=m}^{n} f(m+n-i)$

4.5 Telescoping Sums

\sum_{i=1}^{n} \bigl(f(i+1) - f(i)\bigr) = f(n+1) - f(1)

All intermediate terms cancel — only first and last survive.

4.6 Interchange of Summation Order (Discrete Fubini)

\sum_{i=1}^{m}\sum_{j=1}^{n} f(i,j) = \sum_{j=1}^{n}\sum_{i=1}^{m} f(i,j)

Always valid for finite sums. For infinite sums: requires absolute convergence.

Code cell 12

# ══════════════════════════════════════════════════════════════════
# 4.1  LINEARITY OF SUMMATION
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("LINEARITY: Σ(αf + βg) = αΣf + βΣg")
print("=" * 65)

n = 10
alpha_c, beta_c = 3.0, -2.0
f_vals = np.array([i**2 for i in range(1, n+1)], dtype=float)
g_vals = np.array([i for i in range(1, n+1)], dtype=float)

# LHS: Σᵢ (α·f(i) + β·g(i))
lhs = np.sum(alpha_c * f_vals + beta_c * g_vals)
# RHS: α·Σf + β·Σg
rhs = alpha_c * np.sum(f_vals) + beta_c * np.sum(g_vals)
print(f"\n  f(i) = i², g(i) = i, α = {alpha_c}, β = {beta_c}, n = {n}")
print(f"  LHS = Σ(αf + βg) = {lhs}")
print(f"  RHS = αΣf + βΣg  = {rhs}")
print(f"  Equal: {np.isclose(lhs, rhs)} ✓")

# AI application: gradient of sum = sum of gradients
print(f"\n  AI: ∇_θ Σᵢ L(θ,xᵢ) = Σᵢ ∇_θ L(θ,xᵢ)")
print("  → Mini-batch gradient = (1/|B|) sum_{i in B} grad l_i")

# ══════════════════════════════════════════════════════════════════
# 4.2–4.3  CONSTANT SUMMAND & SPLITTING
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("CONSTANT SUMMAND & SPLITTING SUMS")
print("=" * 65)

# Constant summand: Σᵢ₌₁ⁿ c = n·c
c, n = 7, 5
const_sum = sigma(lambda i: c, 1, n)
print(f"\n  Σᵢ₌₁⁵ 7 = {const_sum} = 5 × 7 = {5 * 7}  ✓")

# Splitting: Σᵢ₌₁¹⁰ = Σᵢ₌₁⁵ + Σᵢ₌₆¹⁰
full = sigma(lambda i: i**2, 1, 10)
part1 = sigma(lambda i: i**2, 1, 5)
part2 = sigma(lambda i: i**2, 6, 10)
print(f"\n  Σᵢ₌₁¹⁰ i² = {full}")
print(f"  Σᵢ₌₁⁵ i² + Σᵢ₌₆¹⁰ i² = {part1} + {part2} = {part1 + part2}  ✓")

# ══════════════════════════════════════════════════════════════════
# 4.4  INDEX SHIFTING (REINDEXING)
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("INDEX SHIFTING: Σᵢ₌ₘⁿ f(i) = Σⱼ₌ₘ₊ₖⁿ⁺ᵏ f(j−k)")
print("=" * 65)

# Shift by k=2: Σᵢ₌₁⁵ i² = Σⱼ₌₃⁷ (j-2)²
original = sigma(lambda i: i**2, 1, 5)
shifted  = sigma(lambda j: (j-2)**2, 3, 7)
print(f"\n  Σᵢ₌₁⁵ i²     = {original}")
print(f"  Σⱼ₌₃⁷ (j−2)² = {shifted}")
print(f"  Equal: {original == shifted} ✓  (j = i + 2)")

# Reverse index: Σᵢ₌₁⁵ f(i) = Σᵢ₌₁⁵ f(6−i)
reverse = sigma(lambda i: (6-i)**2, 1, 5)
print(f"\n  Σᵢ₌₁⁵ i²     = {original}")
print(f"  Σᵢ₌₁⁵ (6−i)² = {reverse}")
print(f"  Equal: {original == reverse} ✓  (reverse index)")

# ══════════════════════════════════════════════════════════════════
# 4.5  TELESCOPING SUMS
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("TELESCOPING: Σᵢ₌₁ⁿ (f(i+1) − f(i)) = f(n+1) − f(1)")
print("=" * 65)

# Example: Σᵢ₌₁ⁿ (1/i − 1/(i+1)) = 1 − 1/(n+1)
n = 10
telescope = sigma(lambda i: 1/i - 1/(i+1), 1, n)
closed_form = 1 - 1/(n+1)
print(f"\n  Σᵢ₌₁¹⁰ (1/i − 1/(i+1))")
terms = [f"(1/{i} − 1/{i+1})" for i in range(1, 6)]
print(f"    = {' + '.join(terms)} + ...")
print(f"    = {telescope:.10f}")
print(f"    = 1 − 1/11 = {closed_form:.10f}")
print(f"  Match: {np.isclose(telescope, closed_form)} ✓")

# Show the cancellation explicitly
print(f"\n  Expansion showing cancellation:")
for i in range(1, 6):
    cancel = "  ← cancels with next" if i < 5 else ""
    print(f"    + 1/{i} − 1/{i+1}{cancel}")
print(f"    Only 1/1 and −1/{n+1} survive!")

# AI: Residual network parameter update
print(f"\n  AI: θₜ − θ₀ = Σₛ₌₁ᵗ (θₛ − θₛ₋₁) = Σₛ₌₁ᵗ Δθₛ  (telescoping)")

# ══════════════════════════════════════════════════════════════════
# 4.6  INTERCHANGE OF SUMMATION ORDER
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("INTERCHANGE: Σᵢ Σⱼ f(i,j) = Σⱼ Σᵢ f(i,j)")
print("=" * 65)

# Row-first vs column-first on a matrix
M = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

# Row-first: Σᵢ (Σⱼ Mᵢⱼ)
row_first = sum(sum(M[i, j] for j in range(3)) for i in range(3))
# Column-first: Σⱼ (Σᵢ Mᵢⱼ)
col_first = sum(sum(M[i, j] for i in range(3)) for j in range(3))

print(f"\n  Matrix M:\n{M}")
print(f"\n  Row-first:    Σᵢ(Σⱼ Mᵢⱼ) = Σᵢ{M.sum(axis=1).tolist()} = {row_first}")
print(f"  Column-first: Σⱼ(Σᵢ Mᵢⱼ) = Σⱼ{M.sum(axis=0).tolist()} = {col_first}")
print(f"  Equal: {row_first == col_first} ✓  (Fubini for finite sums)")

# AI: E[Σᵢ Xᵢ] = Σᵢ E[Xᵢ] (linearity of expectation = interchange)
print(f"\n  AI: 𝔼[Σᵢ Xᵢ] = Σᵢ 𝔼[Xᵢ]  (linearity of expectation)")
print(f"  → Valid even if Xᵢ are dependent!")

5. Algebraic Properties of Products

5.1 Splitting Products

\prod_{i=m}^{n} f(i) = \prod_{i=m}^{k} f(i) \cdot \prod_{i=k+1}^{n} f(i) \quad (m \leq k < n)

5.2 Product of a Constant

\prod_{i=1}^{n} c = c^n

5.3 Interchange of Product Order

\prod_{i=1}^{m}\prod_{j=1}^{n} f(i,j) = \prod_{j=1}^{n}\prod_{i=1}^{m} f(i,j)

5.4 Distributing Product over Multiplication

\prod_{i=1}^{n} \bigl(f(i) \cdot g(i)\bigr) = \left(\prod_{i=1}^{n} f(i)\right) \cdot \left(\prod_{i=1}^{n} g(i)\right)

Does NOT distribute over addition: $\prod_i (f(i) + g(i)) \neq \prod_i f(i) + \prod_i g(i)$

5.5 Logarithm Converts Product to Sum — The Key Identity

\log \prod_{i=1}^{n} f(i) = \sum_{i=1}^{n} \log f(i) \qquad \text{and} \qquad \prod_{i=1}^{n} f(i) = \exp\!\left(\sum_{i=1}^{n} \log f(i)\right)

This is why every ML system works in log-space:

Cross-entropy: $-\sum_i \log P = -\log \prod_i P = \log \prod_i (1/P)$
Perplexity: $\text{PPL} = \exp(-\frac{1}{N}\sum_i \log P) = (\prod_i \frac{1}{P})^{1/N}$

Code cell 14

# ══════════════════════════════════════════════════════════════════
# 5.1–5.4  PRODUCT ALGEBRAIC PROPERTIES
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("ALGEBRAIC PROPERTIES OF PRODUCTS")
print("=" * 65)

# 5.1 Splitting: Πᵢ₌₁¹⁰ i = (Πᵢ₌₁⁵ i) · (Πᵢ₌₆¹⁰ i)
full = pi_product(lambda i: i, 1, 10)
p1   = pi_product(lambda i: i, 1, 5)
p2   = pi_product(lambda i: i, 6, 10)
print(f"\n  Splitting: Πᵢ₌₁¹⁰ i = {full}")
print(f"            (Πᵢ₌₁⁵ i)(Πᵢ₌₆¹⁰ i) = {p1} × {p2} = {p1 * p2}  ✓")

# 5.2 Constant product: Πᵢ₌₁ⁿ c = cⁿ
c, n = 3, 5
const_prod = pi_product(lambda i: c, 1, n)
print(f"\n  Πᵢ₌₁⁵ 3 = {const_prod} = 3⁵ = {3**5}  ✓")

# 5.4 Product distributes over multiplication (but NOT addition)
f_vals = np.array([2, 3, 4, 5], dtype=float)
g_vals = np.array([1, 2, 1, 3], dtype=float)

prod_fg = np.prod(f_vals * g_vals)
prod_f_prod_g = np.prod(f_vals) * np.prod(g_vals)
print(f"\n  Πᵢ(fᵢ·gᵢ)   = {prod_fg:.0f}")
print(f"  (Πᵢ fᵢ)(Πᵢ gᵢ) = {prod_f_prod_g:.0f}")
print(f"  Equal: {np.isclose(prod_fg, prod_f_prod_g)} ✓  (distributes over ×)")

# Does NOT distribute over addition
prod_sum = np.prod(f_vals + g_vals)
sum_prods = np.prod(f_vals) + np.prod(g_vals)
print(f"\n  Πᵢ(fᵢ+gᵢ)        = {prod_sum:.0f}")
print(f"  Πᵢ fᵢ + Πᵢ gᵢ    = {sum_prods:.0f}")
print(f"  Equal: {np.isclose(prod_sum, sum_prods)}  ← NOT equal! Product ≠ distribute over +")

# ══════════════════════════════════════════════════════════════════
# 5.5  LOG ↔ PRODUCT — COMPREHENSIVE DEMONSTRATION
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("LOG CONVERTS PRODUCT TO SUM — NUMERICAL STABILITY DEMO")
print("=" * 65)

# Simulate 50-token sequence with varying probabilities
np.random.seed(42)
probs_50 = np.random.uniform(0.01, 0.3, size=50)

# Direct product: underflows quickly
direct_product = np.prod(probs_50)

# Log-sum: stable
log_sum = np.sum(np.log(probs_50))
from_log = np.exp(log_sum)

print(f"\n  50-token sequence probabilities: P(tᵢ|ctx) ∈ [0.01, 0.3]")
print(f"\n  Direct Π P(tᵢ)  = {direct_product:.6e}")
print(f"  Σ log P(tᵢ)     = {log_sum:.4f}")
print(f"  exp(Σ log P)     = {from_log:.6e}")
print(f"  Match: {np.isclose(direct_product, from_log)} ✓")

# With 500 tokens: direct product → 0.0 (underflow!)
probs_500 = np.random.uniform(0.01, 0.3, size=500)
direct_500 = np.prod(probs_500)
log_500 = np.sum(np.log(probs_500))
print(f"\n  500-token sequence:")
print(f"  Direct Π P(tᵢ)  = {direct_500}  ← UNDERFLOW!")
print(f"  Σ log P(tᵢ)     = {log_500:.4f}  ← perfectly fine in log-space")
print(f"\n  → This is why ALL production ML systems work in log-space!")

6. Important Summation Formulas

6.1 Arithmetic Series (Gauss's Formula)

\sum_{i=1}^{n} i = \frac{n(n+1)}{2}

Gauss at age 10: pair first + last: $1 + n = 2 + (n{-}1) = \ldots$ ; $n/2$ pairs each summing to $n{+}1$ .

6.2 Sum of Squares

\sum_{i=1}^{n} i^2 = \frac{n(n+1)(2n+1)}{6}

6.3 Sum of Cubes — The Remarkable Identity

\sum_{i=1}^{n} i^3 = \left(\frac{n(n+1)}{2}\right)^2 = \left(\sum_{i=1}^{n} i\right)^2

Sum of cubes = square of the sum of first $n$ integers!

6.4 Geometric Series (Finite)

\sum_{i=0}^{n-1} r^i = \frac{1 - r^n}{1 - r} \quad (r \neq 1)

6.5 Geometric Series (Infinite) — for $|r| < 1$ :

\sum_{i=0}^{\infty} r^i = \frac{1}{1-r}

Derivative: $\sum_{i=1}^{\infty} i \, r^{i-1} = \frac{1}{(1-r)^2}$

6.6 Exponential and Taylor Series

e^x = \sum_{n=0}^{\infty} \frac{x^n}{n!}, \qquad \sin(x) = \sum_{n=0}^{\infty} \frac{(-1)^n x^{2n+1}}{(2n+1)!}, \qquad \ln(1{+}x) = \sum_{n=1}^{\infty}\frac{(-1)^{n+1}x^n}{n}

6.7 Harmonic Numbers

H_n = \sum_{i=1}^{n} \frac{1}{i} \approx \ln(n) + \gamma \quad (\gamma \approx 0.5772 \text{, Euler–Mascheroni constant})

The harmonic series $\sum_{n=1}^{\infty} 1/n$ diverges (grows without bound).

$p$ -series: $\sum 1/n^p$ converges iff $p > 1$ .

Code cell 16

# ══════════════════════════════════════════════════════════════════
# 6.1–6.3  CLASSICAL CLOSED-FORM SUMMATION FORMULAS
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("CLASSICAL SUMMATION FORMULAS — VERIFY ALL")
print("=" * 65)

print(f"\n  {'n':<6} {'Σi':<10} {'n(n+1)/2':<10} {'Σi²':<12} {'formula':<12} {'Σi³':<12} {'(Σi)²':<12}")
print(f"  {'─'*6} {'─'*10} {'─'*10} {'─'*12} {'─'*12} {'─'*12} {'─'*12}")

for n in [1, 3, 5, 10, 20, 50, 100]:
    # 6.1 Arithmetic: Σᵢ₌₁ⁿ i = n(n+1)/2
    sum_i = sum(range(1, n+1))
    gauss = n * (n + 1) // 2

    # 6.2 Squares: Σᵢ₌₁ⁿ i² = n(n+1)(2n+1)/6
    sum_i2 = sum(i**2 for i in range(1, n+1))
    sq_formula = n * (n + 1) * (2 * n + 1) // 6

    # 6.3 Cubes: Σᵢ₌₁ⁿ i³ = (n(n+1)/2)²
    sum_i3 = sum(i**3 for i in range(1, n+1))
    cube_formula = gauss ** 2

    print(f"  {n:<6} {sum_i:<10} {gauss:<10} {sum_i2:<12} {sq_formula:<12} {sum_i3:<12} {cube_formula:<12}")

print(f"\n  All verified: Σi³ = (Σi)² — remarkable identity! ✓")
print(f"  AI: n tokens → O(n²) attention pairs; Σᵢ₌₁ⁿ i = n(n+1)/2 ≈ n²/2")

# ══════════════════════════════════════════════════════════════════
# 6.4–6.5  GEOMETRIC SERIES
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("GEOMETRIC SERIES: Σᵢ₌₀ⁿ⁻¹ rⁱ = (1−rⁿ)/(1−r)")
print("=" * 65)

def geometric_sum(r, n):
    """Closed form: (1 - rⁿ) / (1 - r)"""
    if np.isclose(r, 1.0):
        return float(n)
    return (1 - r**n) / (1 - r)

print(f"\n  {'r':<8} {'n':<6} {'Direct Σ':<18} {'Formula':<18} {'Match'}")
print(f"  {'─'*8} {'─'*6} {'─'*18} {'─'*18} {'─'*6}")
for r in [0.5, 0.9, 0.99, 2.0]:
    for n in [5, 20]:
        direct = sum(r**i for i in range(n))
        formula = geometric_sum(r, n)
        print(f"  {r:<8.2f} {n:<6} {direct:<18.8f} {formula:<18.8f} {'✓' if np.isclose(direct, formula) else '✗'}")

# Infinite geometric series: Σᵢ₌₀^∞ rⁱ = 1/(1-r) for |r| < 1
print(f"\n  Infinite geometric series (|r| < 1):")
for r in [0.5, 0.9, 0.99]:
    partial_100 = sum(r**i for i in range(100))
    infinite = 1 / (1 - r)
    print(f"    r={r:.2f}: Σᵢ₌₀⁹⁹ rⁱ = {partial_100:.8f}  →  1/(1−r) = {infinite:.8f}")

print(f"\n  AI: Geometric decay in ALiBi attention; discounted rewards in RL: Σᵢ γⁱrᵢ")

Code cell 17

# ══════════════════════════════════════════════════════════════════
# 6.6  TAYLOR SERIES — FUNCTIONS AS INFINITE SUMS
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("TAYLOR SERIES: FUNCTIONS AS Σ")
print("=" * 65)

def exp_taylor(x, n_terms):
    """eˣ = Σₙ₌₀^∞ xⁿ/n!"""
    return sum(x**n / math.factorial(n) for n in range(n_terms))

def sin_taylor(x, n_terms):
    """sin(x) = Σₙ₌₀^∞ (-1)ⁿ x^(2n+1)/(2n+1)!"""
    return sum((-1)**n * x**(2*n+1) / math.factorial(2*n+1) for n in range(n_terms))

def ln1px_taylor(x, n_terms):
    """ln(1+x) = Σₙ₌₁^∞ (-1)^(n+1) xⁿ/n  for |x| ≤ 1"""
    return sum((-1)**(n+1) * x**n / n for n in range(1, n_terms + 1))

x = 1.0  # Evaluate at x = 1
print(f"\n  x = {x}")
print(f"\n  {'Terms':<8} {'exp(x) Taylor':<18} {'sin(x) Taylor':<18} {'ln(1+x) Taylor':<18}")
print(f"  {'─'*8} {'─'*18} {'─'*18} {'─'*18}")
for n_terms in [1, 3, 5, 10, 20]:
    e_approx = exp_taylor(x, n_terms)
    s_approx = sin_taylor(x, min(n_terms, 10))
    l_approx = ln1px_taylor(x, n_terms)
    print(f"  {n_terms:<8} {e_approx:<18.10f} {s_approx:<18.10f} {l_approx:<18.10f}")

print(f"\n  Exact:  {' ':<8}{np.exp(x):<18.10f} {np.sin(x):<18.10f} {np.log(1+x):<18.10f}")
print(f"\n  AI uses: exp() in softmax, log() in cross-entropy, sin()/cos() in positional encodings")

# ══════════════════════════════════════════════════════════════════
# 6.7  HARMONIC NUMBERS & DIVERGENCE
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("HARMONIC NUMBERS: Hₙ = Σᵢ₌₁ⁿ 1/i ≈ ln(n) + γ")
print("=" * 65)

EULER_MASCHERONI = 0.5772156649  # γ

print(f"\n  {'n':<10} {'Hₙ = Σ 1/i':<18} {'ln(n) + γ':<18} {'Error':<12} {'ln(n)':<12}")
print(f"  {'─'*10} {'─'*18} {'─'*18} {'─'*12} {'─'*12}")
for n in [1, 5, 10, 50, 100, 1000, 10000]:
    H_n = sum(1.0/i for i in range(1, n+1))
    approx = np.log(n) + EULER_MASCHERONI
    err = abs(H_n - approx)
    print(f"  {n:<10} {H_n:<18.8f} {approx:<18.8f} {err:<12.6f} {np.log(n):<12.4f}")

print(f"\n  γ (Euler–Mascheroni) ≈ {EULER_MASCHERONI}")
print(f"  Hₙ grows like ln(n) — DIVERGES, but very slowly")

# p-series convergence
print(f"\n  p-series: Σ 1/nᵖ — converges iff p > 1")
for p in [0.5, 1.0, 1.5, 2.0, 3.0]:
    partial = sum(1.0/n**p for n in range(1, 10001))
    converges = "DIVERGES" if p <= 1 else f"→ ζ({p:.1f}) ≈ {partial:.6f}"
    print(f"    p={p:.1f}: Σₙ₌₁¹⁰⁰⁰⁰ 1/nᵖ = {partial:.6f}  {converges}")

print(f"\n  ζ(2) = π²/6 ≈ {np.pi**2/6:.6f} (Basel problem, Euler 1734)")
print(f"  AI: Zipf's law (word freq ∝ 1/rank); harmonic sums in complexity analysis")

7. Summation Bounds and Estimation

7.1 Comparison Test

If $0 \le f(n) \le g(n)$ for all $n \ge N$ :

$\sum g(n)$ converges $\Rightarrow$ $\sum f(n)$ converges
$\sum f(n)$ diverges $\Rightarrow$ $\sum g(n)$ diverges

7.2 Integral Test

For positive decreasing $f$ : $\sum_{n=1}^{\infty} f(n)$ and $\int_1^{\infty} f(x)\,dx$ converge or diverge together.

\int_1^{n+1} f(x)\,dx \le \sum_{i=1}^{n} f(i) \le f(1) + \int_1^{n} f(x)\,dx

7.3 Bounding Partial Sums

Upper: $\sum_i f(i) \le n \cdot \max_i f(i)$
Cauchy-Schwarz: $\left(\sum_i a_i b_i\right)^2 \le \left(\sum_i a_i^2\right)\left(\sum_i b_i^2\right)$
Triangle inequality: $|\sum_i a_i| \le \sum_i |a_i|$

7.4 Stirling's Approximation

n! \approx \sqrt{2\pi n}\left(\frac{n}{e}\right)^n \qquad \Rightarrow \qquad \log(n!) \approx n\log n - n

7.5 Asymptotic Notation for Sums

$\sum_{i=1}^{n} i = O(n^2)$ ; $\sum_{i=1}^{n} \log i = O(n \log n)$
Attention: $\sum_i \sum_j 1 = n^2 = O(n^2)$ ; Total transformer: $O(n^2 d + n d^2)$

Code cell 19

# ══════════════════════════════════════════════════════════════════
# 7.1–7.2  COMPARISON TEST & INTEGRAL TEST
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("INTEGRAL TEST BOUNDS ON HARMONIC NUMBERS")
print("=" * 65)

# Integral test: ln(n+1) ≤ Hₙ ≤ 1 + ln(n)
print(f"\n  {'n':<10} {'ln(n+1)':<14} {'Hₙ':<14} {'1+ln(n)':<14} {'In bounds?'}")
print(f"  {'─'*10} {'─'*14} {'─'*14} {'─'*14} {'─'*10}")
for n in [5, 10, 50, 100, 1000]:
    H_n = sum(1.0/i for i in range(1, n+1))
    lower = np.log(n + 1)
    upper = 1.0 + np.log(n)
    in_bounds = lower <= H_n <= upper
    print(f"  {n:<10} {lower:<14.6f} {H_n:<14.6f} {upper:<14.6f} {'✓' if in_bounds else '✗'}")

# ══════════════════════════════════════════════════════════════════
# 7.3  CAUCHY-SCHWARZ INEQUALITY FOR SUMS
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("CAUCHY-SCHWARZ: (Σ aᵢbᵢ)² ≤ (Σ aᵢ²)(Σ bᵢ²)")
print("=" * 65)

a = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
b = np.array([2.0, 1.0, 4.0, 0.5, 3.0])

lhs = np.dot(a, b)**2
rhs = np.dot(a, a) * np.dot(b, b)
print(f"\n  a = {a}")
print(f"  b = {b}")
print(f"\n  (Σ aᵢbᵢ)²     = {np.dot(a,b):.2f}² = {lhs:.2f}")
print(f"  (Σ aᵢ²)(Σ bᵢ²) = {np.dot(a,a):.2f} × {np.dot(b,b):.2f} = {rhs:.2f}")
print(f"  LHS ≤ RHS: {lhs <= rhs + 1e-10} ✓  (Cauchy-Schwarz)")

# Equality holds iff a ∝ b
a_prop = np.array([1.0, 2.0, 3.0])
b_prop = 2.5 * a_prop  # b = 2.5a (proportional)
lhs_eq = np.dot(a_prop, b_prop)**2
rhs_eq = np.dot(a_prop, a_prop) * np.dot(b_prop, b_prop)
print(f"\n  Equality when a ∝ b: a = {a_prop}, b = 2.5a = {b_prop}")
print(f"  (Σ aᵢbᵢ)² = {lhs_eq:.2f},  (Σ aᵢ²)(Σ bᵢ²) = {rhs_eq:.2f}")
print(f"  Equal: {np.isclose(lhs_eq, rhs_eq)} ✓")

# Triangle inequality
print(f"\n  Triangle inequality: |Σᵢ aᵢ| ≤ Σᵢ |aᵢ|")
mixed = np.array([3, -2, 5, -1, 4])
print(f"  a = {mixed}")
print(f"  |Σ aᵢ| = |{np.sum(mixed)}| = {abs(np.sum(mixed))}")
print(f"  Σ |aᵢ| = {np.sum(np.abs(mixed))}")
print(f"  {abs(np.sum(mixed))} ≤ {np.sum(np.abs(mixed))} ✓")

# ══════════════════════════════════════════════════════════════════
# 7.4–7.5  STIRLING & ASYMPTOTIC COMPLEXITY
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("ASYMPTOTIC COMPLEXITY OF COMMON SUMS")
print("=" * 65)

print(f"\n  {'Sum':<25} {'n=100':<15} {'n=1000':<15} {'Growth'}")
print(f"  {'─'*25} {'─'*15} {'─'*15} {'─'*15}")

sums_data = [
    ("Σᵢ₌₁ⁿ 1 = n",        lambda n: n,                                "O(n)"),
    ("Σᵢ₌₁ⁿ i = n²/2",     lambda n: n*(n+1)//2,                       "O(n²)"),
    ("Σᵢ₌₁ⁿ i² ≈ n³/3",    lambda n: n*(n+1)*(2*n+1)//6,               "O(n³)"),
    ("Σᵢ₌₁ⁿ log(i)=log(n!)",lambda n: sum(np.log(i) for i in range(1,n+1)), "O(n log n)"),
    ("Σᵢ₌₁ⁿ 1/i ≈ ln(n)",  lambda n: sum(1.0/i for i in range(1,n+1)), "O(log n)"),
]
for label, func, growth in sums_data:
    v100 = func(100)
    v1000 = func(1000)
    print(f"  {label:<25} {v100:<15.2f} {v1000:<15.2f} {growth}")

# Transformer FLOP count
print(f"\n  Transformer complexity per layer:")
for n_seq, d_model in [(512, 768), (2048, 1024), (8192, 4096)]:
    attn_flops = 2 * n_seq * n_seq * d_model  # QKᵀ and αV
    ffn_flops = 2 * n_seq * d_model * (4 * d_model)  # 2 linear layers
    total = attn_flops + ffn_flops
    print(f"    n={n_seq:<5} d={d_model:<5} Attn: {attn_flops:.2e}  FFN: {ffn_flops:.2e}  Total: {total:.2e}")
print(f"\n  Attn = O(n²d), FFN = O(nd²) → Total = O(n²d + nd²)")

8. Special Summation Patterns in AI

8.1 Softmax and the Normalisation Sum (Partition Function)

Z(\mathbf{z}) = \sum_{v=1}^{|V|} \exp(z_v), \qquad P(v \mid \mathbf{x}) = \frac{\exp(z_v)}{Z(\mathbf{z})}

Log-sum-exp trick (numerically stable): $\log Z = m + \log\sum_v \exp(z_v - m)$ , where $m = \max_v z_v$ .

8.2 Cross-Entropy Loss as Double Sum

L(\theta) = -\frac{1}{N}\sum_{i=1}^{N}\log P_\theta(y_i \mid x_i)

Gradient: $\frac{\partial L}{\partial z_v} = P_\theta(v \mid x) - \mathbb{1}[v = y]$ — predicted minus true.

8.3 Attention as Weighted Sum

\text{Attn}(Q,K,V)_i = \sum_{j=1}^{n} \alpha_{ij} V_j, \qquad \alpha_{ij} = \frac{\exp(q_i \cdot k_j / \sqrt{d_k})}{\sum_l \exp(q_i \cdot k_l / \sqrt{d_k})}

$\sum_j \alpha_{ij} = 1$ — attention is a convex combination of value vectors.

8.4 Perplexity as Geometric Mean

\text{PPL} = \exp\!\left(-\frac{1}{N}\sum_{i=1}^{N}\log P(t_i \mid \text{ctx})\right) = \left(\prod_i \frac{1}{P(t_i \mid \text{ctx})}\right)^{1/N}

8.5 BM25 as Weighted Sum

\text{BM25}(q,d) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f_{t,d}(k_1+1)}{f_{t,d} + k_1(1 - b + b \cdot |d|/\text{avgdl})}

Code cell 21

# ══════════════════════════════════════════════════════════════════
# 8.1  SOFTMAX — NORMALISATION SUM & LOG-SUM-EXP TRICK
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("SOFTMAX: P(v) = exp(zᵥ) / Σᵥ' exp(zᵥ')")
print("=" * 65)

def softmax_naive(z):
    """Naive softmax — may overflow for large z."""
    e = np.exp(z)
    return e / np.sum(e)

def softmax_stable(z):
    """Numerically stable softmax using max trick."""
    m = np.max(z)              # m = max(z)
    e = np.exp(z - m)          # shift by max → largest exp = 1
    return e / np.sum(e)       # Z = Σ exp(zᵥ − m)

def log_sum_exp(z):
    """log Σ exp(zᵥ) = m + log Σ exp(zᵥ − m)"""
    m = np.max(z)
    return m + np.log(np.sum(np.exp(z - m)))

# Demo with normal logits
logits = np.array([2.0, 1.0, 0.5, -1.0, 3.0])
probs = softmax_stable(logits)
Z = np.sum(np.exp(logits))

print(f"\n  Logits z = {logits}")
print(f"  Z = Σ exp(zᵥ)         = {Z:.4f}")
print(f"  log Z (naive)          = {np.log(Z):.6f}")
print(f"  log Z (log-sum-exp)    = {log_sum_exp(logits):.6f}")
print(f"  Softmax P(v):          {probs}")
print(f"  Σ P(v) = {np.sum(probs):.10f}  ✓ (sums to 1)")

# Demo with extreme logits — naive overflows!
extreme = np.array([1000.0, 999.0, 998.0])
print(f"\n  Extreme logits: {extreme}")
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    naive_result = softmax_naive(extreme)
    stable_result = softmax_stable(extreme)
print(f"  Naive softmax:  {naive_result}  ← overflow produces nan!")
print(f"  Stable softmax: {stable_result}  ← correct!")

# ══════════════════════════════════════════════════════════════════
# 8.2  CROSS-ENTROPY LOSS & ITS GRADIENT
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("CROSS-ENTROPY: L = −(1/N) Σᵢ log P(yᵢ|xᵢ)")
print("=" * 65)

# Simulate 5 training examples, vocab size 4
np.random.seed(42)
N, V_size = 5, 4
logits_batch = np.random.randn(N, V_size)
labels = np.array([2, 0, 3, 1, 2])  # true labels

# Compute cross-entropy loss
total_loss = 0.0
print(f"\n  {'Example':<10} {'True y':<8} {'P(y|x)':<12} {'−log P(y|x)':<14}")
print(f"  {'─'*10} {'─'*8} {'─'*12} {'─'*14}")
for i in range(N):
    probs_i = softmax_stable(logits_batch[i])
    p_true = probs_i[labels[i]]
    loss_i = -np.log(p_true)
    total_loss += loss_i
    print(f"  {i+1:<10} {labels[i]:<8} {p_true:<12.6f} {loss_i:<14.6f}")

avg_loss = total_loss / N
print(f"\n  L = −(1/{N}) Σ log P(yᵢ|xᵢ) = {avg_loss:.6f}")

# Gradient: ∂L/∂zᵥ = P(v|x) − 𝟙[v=y]
print(f"\n  Gradient for example 1 (y=2):")
probs_1 = softmax_stable(logits_batch[0])
grad = probs_1.copy()
grad[labels[0]] -= 1.0  # subtract indicator
print(f"    P(v|x):  {probs_1}")
print(f"    𝟙[v=y]:  {[int(v == labels[0]) for v in range(V_size)]}")
print(f"    ∂L/∂zᵥ:  {grad}")
print(f"    Σᵥ ∂L/∂zᵥ = {np.sum(grad):.10f}  (sums to 0!) ✓")

Code cell 22

# ══════════════════════════════════════════════════════════════════
# 8.3  ATTENTION AS WEIGHTED SUM
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("ATTENTION: Attn(Q,K,V)ᵢ = Σⱼ αᵢⱼ Vⱼ")
print("=" * 65)

np.random.seed(42)
n_seq, d_k, d_v = 4, 3, 3  # 4 tokens, dim 3

Q = np.random.randn(n_seq, d_k)
K = np.random.randn(n_seq, d_k)
V = np.random.randn(n_seq, d_v)

# Step 1: Scores sᵢⱼ = qᵢ·kⱼ / √dₖ = Σₖ Qᵢₖ Kⱼₖ / √dₖ
scores = Q @ K.T / np.sqrt(d_k)
print(f"\n  Score matrix (sᵢⱼ = qᵢ·kⱼ/√dₖ):\n{scores}")

# Step 2: Attention weights αᵢⱼ = softmax(sᵢ) over j
alpha = np.zeros_like(scores)
for i in range(n_seq):
    alpha[i] = softmax_stable(scores[i])
print(f"\n  Attention weights α (each row sums to 1):\n{np.round(alpha, 4)}")
print(f"  Row sums: {alpha.sum(axis=1)}")

# Step 3: Output = Σⱼ αᵢⱼ Vⱼ (weighted sum of value vectors)
attn_output_manual = np.zeros((n_seq, d_v))
for i in range(n_seq):
    for j in range(n_seq):
        attn_output_manual[i] += alpha[i, j] * V[j]

attn_output_matrix = alpha @ V  # same thing as matrix multiply
print(f"\n  Attention output (manual Σⱼ αᵢⱼVⱼ):\n{np.round(attn_output_manual, 4)}")
print(f"  Attention output (α @ V):\n{np.round(attn_output_matrix, 4)}")
print(f"  Match: {np.allclose(attn_output_manual, attn_output_matrix)} ✓")

# Causal mask: restrict j ≤ i
print(f"\n  Causal attention (j ≤ i only):")
causal_mask = np.triu(np.ones((n_seq, n_seq)) * (-1e9), k=1)
scores_causal = scores + causal_mask
alpha_causal = np.zeros_like(scores_causal)
for i in range(n_seq):
    alpha_causal[i] = softmax_stable(scores_causal[i])
print(f"  Causal α:\n{np.round(alpha_causal, 4)}")
print(f"  → Token 0 attends only to itself; token 3 attends to all 4")

# ══════════════════════════════════════════════════════════════════
# 8.4–8.5  PERPLEXITY & BM25
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("PERPLEXITY: PPL = exp(−(1/N) Σ log P)")
print("=" * 65)

# Simulate model probabilities for a 10-token sequence
token_probs = np.array([0.8, 0.5, 0.3, 0.9, 0.1, 0.6, 0.4, 0.7, 0.2, 0.05])
N_tok = len(token_probs)

cross_ent = -np.mean(np.log(token_probs))
ppl = np.exp(cross_ent)
geo_mean = np.prod(1.0 / token_probs) ** (1.0 / N_tok)

print(f"\n  Token probs: {token_probs}")
print(f"  Cross-entropy H = −(1/N)Σ log P = {cross_ent:.4f}")
print(f"  PPL = exp(H)                     = {ppl:.4f}")
print(f"  PPL = (Π 1/P)^(1/N)              = {geo_mean:.4f}")
print(f"  Match: {np.isclose(ppl, geo_mean)} ✓")
print(f"\n  Interpretation: model is as uncertain as choosing uniformly")
print(f"  from {ppl:.1f} equally likely tokens on average")

# BM25 scoring
print(f"\n{'─' * 65}")
print("BM25: score(q,d) = Σ_{t ∈ q} IDF(t) · TF_mod(t,d)")
print("─" * 65)

def bm25_score(query_terms, doc_tf, doc_len, avg_dl, idf, k1=1.5, b=0.75):
    """BM25 score = Σ_{t ∈ q} IDF(t) · (f·(k1+1)) / (f + k1·(1 - b + b·|d|/avgdl))"""
    score = 0.0
    details = []
    for t in query_terms:
        if t in doc_tf:
            f = doc_tf[t]
            tf_mod = (f * (k1 + 1)) / (f + k1 * (1 - b + b * doc_len / avg_dl))
            term_score = idf.get(t, 0) * tf_mod
            score += term_score
            details.append((t, idf[t], f, tf_mod, term_score))
    return score, details

query = ["neural", "network", "training"]
doc_tf = {"neural": 3, "network": 2, "training": 5, "the": 20}
idf = {"neural": 2.5, "network": 2.0, "training": 1.8}
score, details = bm25_score(query, doc_tf, doc_len=100, avg_dl=150, idf=idf)

print(f"\n  Query: {query}")
print(f"  {'Term':<12} {'IDF':<8} {'TF':<6} {'TF_mod':<10} {'Score'}")
print(f"  {'─'*12} {'─'*8} {'─'*6} {'─'*10} {'─'*8}")
for t, i, f, tf_m, s in details:
    print(f"  {t:<12} {i:<8.2f} {f:<6} {tf_m:<10.4f} {s:.4f}")
print(f"\n  BM25(q,d) = Σ = {score:.4f}")

9. Einstein Summation Convention

9.1 Motivation & Rule

In tensor operations, summation over repeated indices is so common that Einstein (1916) proposed: if an index appears exactly twice in a product, sum over it automatically.

a_i b_i \equiv \sum_i a_i b_i \qquad \text{(repeated index } i \text{ → implicit sum)}

Free index: appears once → result has that index (no summation)
Dummy index: appears twice → summed over; does not appear in result

9.2 Examples

Operation	Einstein	Explicit	Result
Dot product	$a_i b_i$	$\sum_i a_i b_i$	scalar
Matrix-vector	$A_{ij} x_j$	$\sum_j A_{ij} x_j$	vector (index $i$ )
Matrix-matrix	$A_{ik} B_{kj}$	$\sum_k A_{ik} B_{kj}$	matrix (indices $i,j$ )
Trace	$A_{ii}$	$\sum_i A_{ii}$	scalar
Outer product	$a_i b_j$	$a_i b_j$ (no repeated)	matrix
Quadratic form	$x_i A_{ij} x_j$	$\sum_i \sum_j x_i A_{ij} x_j$	scalar

9.3 Rules

Each index appears at most twice per term
Repeated index → summed; do not write $\Sigma$
Free index → must match on both sides of equation
Three+ appearances → ambiguous; write $\Sigma$ explicitly

9.4 `einsum` in NumPy / PyTorch

np.einsum("ij,jk->ik", A, B)      # matrix multiply
np.einsum("ij,ij->", A, B)        # Frobenius inner product
np.einsum("bhid,bhjd->bhij", Q, K) # batched attention scores

9.5 Tensor Contraction

Generalisation of matmul to higher-order tensors: $\sum_k A_{ijk} B_{klm} = C_{ijlm}$

Code cell 24

# ══════════════════════════════════════════════════════════════════
# 9.1–9.2  EINSTEIN SUMMATION — ALL EXAMPLES
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("EINSTEIN SUMMATION CONVENTION — np.einsum DEMONSTRATIONS")
print("=" * 65)

A = np.array([[1, 2, 3],
              [4, 5, 6]])  # 2×3
B = np.array([[7, 8],
              [9, 10],
              [11, 12]])   # 3×2
x = np.array([1.0, 2.0, 3.0])
y = np.array([4.0, 5.0, 6.0])

# 1. Dot product: aᵢbᵢ = Σᵢ aᵢbᵢ
dot = np.einsum('i,i->', x, y)
print(f"\n  1. Dot product: aᵢbᵢ")
print(f"     einsum('i,i->', x, y) = {dot}")
print(f"     np.dot(x, y)          = {np.dot(x, y)}  ✓")

# 2. Matrix-vector: Aᵢⱼxⱼ = Σⱼ Aᵢⱼxⱼ
mv = np.einsum('ij,j->i', A, x)
print(f"\n  2. Matrix-vector: Aᵢⱼxⱼ")
print(f"     einsum('ij,j->i', A, x) = {mv}")
print(f"     A @ x                    = {A @ x}  ✓")

# 3. Matrix-matrix: AᵢₖBₖⱼ = Σₖ AᵢₖBₖⱼ
mm = np.einsum('ik,kj->ij', A, B)
print(f"\n  3. Matrix multiply: AᵢₖBₖⱼ")
print(f"     einsum('ik,kj->ij', A, B):\n{mm}")
print(f"     A @ B:\n{A @ B}  ✓")

# 4. Trace: Aᵢᵢ = Σᵢ Aᵢᵢ
S = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
tr = np.einsum('ii->', S)
print(f"\n  4. Trace: Aᵢᵢ")
print(f"     einsum('ii->', S) = {tr}")
print(f"     np.trace(S)       = {np.trace(S)}  ✓")

# 5. Outer product: aᵢbⱼ (no repeated index)
outer = np.einsum('i,j->ij', x, y)
print(f"\n  5. Outer product: aᵢbⱼ (no sum — free indices i,j)")
print(f"     einsum('i,j->ij', x, y):\n{outer}")
print(f"     np.outer(x, y):\n{np.outer(x, y)}  ✓")

# 6. Quadratic form: xᵢAᵢⱼxⱼ = Σᵢ Σⱼ xᵢAᵢⱼxⱼ
quad = np.einsum('i,ij,j->', x, S, x)
print(f"\n  6. Quadratic form: xᵢAᵢⱼxⱼ")
print(f"     einsum('i,ij,j->', x, S, x) = {quad}")
print(f"     x @ S @ x                    = {x @ S @ x}  ✓")

# 7. Transpose: Aᵢⱼ → Aⱼᵢ
trans = np.einsum('ij->ji', A)
print(f"\n  7. Transpose: einsum('ij->ji', A):\n{trans}")

# 8. Frobenius inner product: Σᵢⱼ AᵢⱼBᵢⱼ
F1 = np.array([[1, 2], [3, 4]])
F2 = np.array([[5, 6], [7, 8]])
frob = np.einsum('ij,ij->', F1, F2)
print(f"\n  8. Frobenius: AᵢⱼBᵢⱼ = {frob} = tr(AᵀB) = {np.trace(F1.T @ F2)}  ✓")

Code cell 25

# ══════════════════════════════════════════════════════════════════
# 9.4  EINSUM FOR ATTENTION (BATCHED MULTI-HEAD)
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("BATCHED MULTI-HEAD ATTENTION VIA EINSUM")
print("=" * 65)

np.random.seed(42)
batch, heads, seq, dim = 2, 3, 4, 5

Q = np.random.randn(batch, heads, seq, dim)
K = np.random.randn(batch, heads, seq, dim)
V = np.random.randn(batch, heads, seq, dim)

# Attention scores: sᵢⱼ = Σₖ Qᵢₖ Kⱼₖ / √dₖ
# einsum: "bhid,bhjd->bhij" — sum over d (dummy); free: b,h,i,j
scores = np.einsum('bhid,bhjd->bhij', Q, K) / np.sqrt(dim)
print(f"\n  Q shape: {Q.shape}  (batch, heads, seq, dim)")
print(f"  K shape: {K.shape}")
print(f"  Scores = einsum('bhid,bhjd->bhij', Q, K)/√d")
print(f"  Scores shape: {scores.shape}  (batch, heads, seq, seq)")
print(f"\n  Free indices: b(batch), h(head), i(query), j(key)")
print(f"  Dummy index:  d(dimension) — summed over!")

# Softmax over j (key dimension)
def batched_softmax(x, axis=-1):
    e = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e / np.sum(e, axis=axis, keepdims=True)

alpha = batched_softmax(scores, axis=-1)
print(f"\n  α shape: {alpha.shape}  (after softmax over j)")
print(f"  Row sums (should be 1): {alpha[0, 0].sum(axis=-1)}")

# Output: oᵢ = Σⱼ αᵢⱼ Vⱼ → einsum('bhij,bhjd->bhid')
output = np.einsum('bhij,bhjd->bhid', alpha, V)
print(f"\n  Output = einsum('bhij,bhjd->bhid', α, V)")
print(f"  Output shape: {output.shape}  (same as Q — correct!)")

# ══════════════════════════════════════════════════════════════════
# 9.5  TENSOR CONTRACTION
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("TENSOR CONTRACTION — GENERALISED MATMUL")
print("=" * 65)

# Contract 3D tensor with 2D tensor
T = np.random.randn(3, 4, 5)   # Aᵢⱼₖ
M = np.random.randn(5, 2)       # Bₖₗ
# Contract over k: Cᵢⱼₗ = Σₖ Aᵢⱼₖ Bₖₗ
C = np.einsum('ijk,kl->ijl', T, M)
print(f"\n  A shape: {T.shape} (3D tensor)")
print(f"  B shape: {M.shape} (matrix)")
print(f"  Σₖ Aᵢⱼₖ Bₖₗ → C shape: {C.shape}")

# Multi-index contraction: Σⱼ Σₖ AᵢⱼₖBⱼₖₗ = Cᵢₗ
T2 = np.random.randn(3, 4, 5)
T3 = np.random.randn(4, 5, 2)
C2 = np.einsum('ijk,jkl->il', T2, T3)
print(f"\n  Σⱼ Σₖ Aᵢⱼₖ Bⱼₖₗ → C shape: {C2.shape}  (contracted 2 indices)")

# FLOPs: product of all index ranges
flops = 3 * 4 * 5 * 2  # i × j × k × l
print(f"  FLOPs: {flops} = 3×4×5×2 (product of ALL index ranges)")

# PyTorch einsum comparison (if available)
if HAS_TORCH:
    Q_t = torch.randn(2, 3, 4, 5)
    K_t = torch.randn(2, 3, 4, 5)
    scores_t = torch.einsum('bhid,bhjd->bhij', Q_t, K_t)
    print(f"\n  torch.einsum('bhid,bhjd->bhij', Q, K) shape: {list(scores_t.shape)}")
    print(f"  → Same as NumPy einsum — lingua franca of tensor ops!")
else:
    print(f"\n  (PyTorch not available — einsum works identically with torch.einsum)")

10. Double Sums and Index Interaction

10.1 Diagonal, Upper-Triangular, and Lower-Triangular Sums

Diagonal: $\sum_i f(i,i)$
Upper triangular: $\sum_i \sum_{j > i} f(i,j)$ — causal attention restricts to $j \leq i$
Symmetrisation: $\sum_{i \neq j} f(i,j) = \sum_i \sum_j f(i,j) - \sum_i f(i,i)$

10.2 Changing Order in Triangular Sums

\sum_{i=1}^{n}\sum_{j=1}^{i} f(i,j) = \sum_{j=1}^{n}\sum_{i=j}^{n} f(i,j)

Both cover the same pairs $(i,j)$ with $1 \leq j \leq i \leq n$ ; just different traversal order.

10.3 Cauchy Product (Product of Two Series)

\left(\sum_{i=0}^{\infty} a_i\right)\!\left(\sum_{j=0}^{\infty} b_j\right) = \sum_{k=0}^{\infty}\left(\sum_{i=0}^{k} a_i b_{k-i}\right) = \sum_k c_k

where $c_k = \sum_{i=0}^{k} a_i b_{k-i}$ is a convolution — the same operation used in CNNs!

10.4 Summation by Parts (Abel's Summation)

\sum_{i=m}^{n} a_i b_i = A_n b_n - A_{m-1} b_m - \sum_{i=m}^{n-1} A_i(b_{i+1} - b_i)

where $A_k = \sum_{i=m}^{k} a_i$ — the discrete analogue of integration by parts.

Code cell 27

# ══════════════════════════════════════════════════════════════════
# 10.1  DIAGONAL & TRIANGULAR SUMS
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("DIAGONAL & TRIANGULAR SUMS")
print("=" * 65)

M = np.array([[1,  2,  3,  4],
              [5,  6,  7,  8],
              [9,  10, 11, 12],
              [13, 14, 15, 16]])
n = 4

# Diagonal sum: Σᵢ Mᵢᵢ
diag_sum = sum(M[i, i] for i in range(n))
print(f"\n  M:\n{M}")
print(f"\n  Diagonal: Σᵢ Mᵢᵢ = {diag_sum} = tr(M) = {np.trace(M)}  ✓")

# Upper triangular (i < j): Σᵢ Σⱼ>ᵢ Mᵢⱼ
upper = sum(M[i, j] for i in range(n) for j in range(i+1, n))
print(f"  Upper triangular (i < j): {upper}")
print(f"    np.triu(M, k=1).sum() = {np.triu(M, k=1).sum()}  ✓")

# Lower triangular (i > j): Σᵢ Σⱼ<ᵢ Mᵢⱼ
lower = sum(M[i, j] for i in range(n) for j in range(i))
print(f"  Lower triangular (i > j): {lower}")

# Verify: full = diagonal + upper + lower
full = np.sum(M)
print(f"\n  Full = diag + upper + lower: {full} = {diag_sum} + {upper} + {lower} = {diag_sum + upper + lower}  ✓")

# AI: Causal attention mask
print(f"\n  AI: Causal attention uses LOWER triangular (j ≤ i):")
causal = np.tril(np.ones((4, 4)))
print(f"  Mask:\n{causal.astype(int)}")

# ══════════════════════════════════════════════════════════════════
# 10.2  CHANGING ORDER IN TRIANGULAR SUMS
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("INTERCHANGE ORDER: Σᵢ₌₁ⁿ Σⱼ₌₁ⁱ = Σⱼ₌₁ⁿ Σᵢ₌ⱼⁿ")
print("=" * 65)

n = 5
f_func = lambda i, j: i + j

# Order 1: fix i, sum j from 1 to i
order1 = 0
print(f"\n  Order 1: Σᵢ₌₁⁵ Σⱼ₌₁ⁱ f(i,j)  where f(i,j) = i+j")
for i in range(1, n+1):
    inner = sum(f_func(i, j) for j in range(1, i+1))
    order1 += inner
    terms = [f"f({i},{j})" for j in range(1, i+1)]
    print(f"    i={i}: Σⱼ₌₁^{i} = {' + '.join(terms)} = {inner}")
print(f"  Total = {order1}")

# Order 2: fix j, sum i from j to n
order2 = 0
print(f"\n  Order 2: Σⱼ₌₁⁵ Σᵢ₌ⱼ⁵ f(i,j)")
for j in range(1, n+1):
    inner = sum(f_func(i, j) for i in range(j, n+1))
    order2 += inner
    terms = [f"f({i},{j})" for i in range(j, n+1)]
    print(f"    j={j}: Σᵢ₌{j}⁵ = {' + '.join(terms)} = {inner}")
print(f"  Total = {order2}")
print(f"\n  Match: {order1 == order2} ✓  (same pairs, different order)")

# ══════════════════════════════════════════════════════════════════
# 10.3  CAUCHY PRODUCT = CONVOLUTION
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("CAUCHY PRODUCT: cₖ = Σᵢ₌₀ᵏ aᵢb_{k-i}  (CONVOLUTION)")
print("=" * 65)

a = np.array([1.0, 2.0, 3.0])  # polynomial: 1 + 2x + 3x²
b = np.array([4.0, 5.0])       # polynomial: 4 + 5x

# Product polynomial coefficients via Cauchy product
n_a, n_b = len(a), len(b)
c = np.zeros(n_a + n_b - 1)
for k in range(len(c)):
    for i in range(k + 1):
        if i < n_a and (k - i) < n_b:
            c[k] += a[i] * b[k - i]

print(f"\n  a(x) = {a[0]:.0f} + {a[1]:.0f}x + {a[2]:.0f}x²")
print(f"  b(x) = {b[0]:.0f} + {b[1]:.0f}x")
print(f"\n  Cauchy product cₖ = Σᵢ aᵢb_{k-i}:")
for k in range(len(c)):
    terms = []
    for i in range(k + 1):
        if i < n_a and (k - i) < n_b:
            terms.append(f"a[{i}]·b[{k-i}] = {a[i]*b[k-i]:.0f}")
    print(f"    c[{k}] = {' + '.join(terms)} = {c[k]:.0f}")

print(f"\n  c(x) = {' + '.join(f'{c[k]:.0f}x^{k}' if k > 0 else f'{c[k]:.0f}' for k in range(len(c)))}")
print(f"  np.convolve(a, b) = {np.convolve(a, b)}")
print(f"  Match: {np.allclose(c, np.convolve(a, b))} ✓")
print(f"\n  AI: Convolution in CNNs uses this same sum structure!")

11. Convergence of Infinite Series

11.1 Partial Sums and Convergence

$S_n = \sum_{i=1}^{n} a_i$ — the $n$ -th partial sum.

Series converges to $L$ if $\lim_{n \to \infty} S_n = L$ .

Necessary condition: if $\sum a_i$ converges, then $a_i \to 0$ . But NOT sufficient — harmonic series: $1/n \to 0$ yet $\sum 1/n = \infty$ .

11.2 Absolute vs Conditional Convergence

Absolute: $\sum |a_i|$ converges → $\sum a_i$ converges (safe to rearrange)
Conditional: converges but not absolutely → Riemann rearrangement theorem: can be rearranged to sum to ANY value!

11.3 Key Convergence Tests

Test	Condition	Conclusion
Ratio	$L = \lim	a_{n+1}/a_n
Root	$L = \lim	a_n
Alternating	$b_n \geq 0$ , decreasing, $b_n \to 0$	$\sum (-1)^n b_n$ converges
$p$ -series	$\sum 1/n^p$	Converges iff $p > 1$

11.4 Power Series

$\sum_{n=0}^{\infty} c_n (x - a)^n$ — radius of convergence $R$ : converges for $|x-a| < R$ , diverges for $|x-a| > R$ .

11.5 AI: Learning Rate Convergence (Robbins-Monro)

$\sum_t \eta_t = \infty$ (take enough steps) AND $\sum_t \eta_t^2 < \infty$ (don't overshoot).

Example: $\eta_t = 1/t$ satisfies both: $\sum 1/t = \infty$ , $\sum 1/t^2 = \pi^2/6 < \infty$ .

Code cell 29

# ══════════════════════════════════════════════════════════════════
# 11.1–11.2  CONVERGENCE: PARTIAL SUMS & ABSOLUTE VS CONDITIONAL
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("CONVERGENCE OF INFINITE SERIES")
print("=" * 65)

# Geometric series: converges for |r| < 1
print(f"\n  Geometric series Σ rⁿ — partial sums Sₙ:")
print(f"  {'n':<8} {'r=0.5':<14} {'r=0.9':<14} {'r=1.1':<14} {'1/n (harmonic)':<14}")
print(f"  {'─'*8} {'─'*14} {'─'*14} {'─'*14} {'─'*14}")
for n in [5, 10, 50, 100, 500, 1000]:
    s_05 = sum(0.5**i for i in range(n))
    s_09 = sum(0.9**i for i in range(n))
    s_11 = sum(1.1**i for i in range(n))
    h_n  = sum(1.0/i for i in range(1, n+1))
    print(f"  {n:<8} {s_05:<14.6f} {s_09:<14.6f} {s_11:<14.2f} {h_n:<14.6f}")
print(f"\n  r=0.5 → 1/(1-0.5) = {1/0.5:.1f}")
print(f"  r=0.9 → 1/(1-0.9) = {1/0.1:.1f}")
print(f"  r=1.1 → DIVERGES (grows without bound)")
print(f"  1/n   → DIVERGES (harmonic, but very slowly)")

# Alternating harmonic: converges conditionally
print(f"\n{'─' * 65}")
print("CONDITIONAL CONVERGENCE: Σ (-1)ⁿ⁺¹/n = ln(2)")
print("─" * 65)

print(f"\n  {'n':<8} {'Σ (-1)ⁿ⁺¹/n':<18} {'ln(2)':<18} {'Error'}")
print(f"  {'─'*8} {'─'*18} {'─'*18} {'─'*12}")
for n in [10, 100, 1000, 10000]:
    alt_sum = sum((-1)**(i+1) / i for i in range(1, n+1))
    err = abs(alt_sum - np.log(2))
    print(f"  {n:<8} {alt_sum:<18.10f} {np.log(2):<18.10f} {err:<12.2e}")
print(f"\n  Converges conditionally: Σ |(-1)ⁿ⁺¹/n| = Σ 1/n = ∞ (diverges!)")
print(f"  Riemann theorem: rearranging terms can make it sum to ANY value!")

# ══════════════════════════════════════════════════════════════════
# 11.3  CONVERGENCE TESTS
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("CONVERGENCE TESTS")
print("=" * 65)

# Ratio test: aₙ = xⁿ/n!
print(f"\n  Ratio test for eˣ series: aₙ = xⁿ/n!")
x_val = 2.0
print(f"  x = {x_val}")
print(f"  {'n':<6} {'aₙ = xⁿ/n!':<18} {'|aₙ₊₁/aₙ|':<18} {'Verdict'}")
print(f"  {'─'*6} {'─'*18} {'─'*18} {'─'*20}")
for n in range(1, 10):
    a_n = x_val**n / math.factorial(n)
    a_n1 = x_val**(n+1) / math.factorial(n+1)
    ratio = abs(a_n1 / a_n)
    verdict = "< 1 → converges" if ratio < 1 else "> 1 → diverges" if ratio > 1 else "= 1"
    print(f"  {n:<6} {a_n:<18.8f} {ratio:<18.8f} {verdict}")
print(f"  → |aₙ₊₁/aₙ| = x/(n+1) → 0 < 1 for any x → eˣ converges ∀ x ✓")

# ══════════════════════════════════════════════════════════════════
# 11.5  ROBBINS-MONRO CONDITIONS FOR LEARNING RATE
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("ROBBINS-MONRO: Σ ηₜ = ∞  AND  Σ ηₜ² < ∞")
print("=" * 65)

schedules = {
    "ηₜ = 1/t":        lambda t: 1.0/t,
    "ηₜ = 1/√t":       lambda t: 1.0/np.sqrt(t),
    "ηₜ = 1/t²":       lambda t: 1.0/t**2,
    "ηₜ = 0.01 (const)": lambda t: 0.01,
}

T = 10000
print(f"\n  T = {T}")
print(f"\n  {'Schedule':<20} {'Σ ηₜ':<14} {'Σ ηₜ²':<14} {'Σηₜ=∞?':<10} {'Σηₜ²<∞?':<10} {'Valid?'}")
print(f"  {'─'*20} {'─'*14} {'─'*14} {'─'*10} {'─'*10} {'─'*6}")
for name, lr_fn in schedules.items():
    sum_lr = sum(lr_fn(t) for t in range(1, T+1))
    sum_lr2 = sum(lr_fn(t)**2 for t in range(1, T+1))
    grows = sum_lr > 50  # heuristic: still growing
    bounded = sum_lr2 < 100
    valid = grows and bounded
    print(f"  {name:<20} {sum_lr:<14.4f} {sum_lr2:<14.6f} {'✓' if grows else '✗':<10} {'✓' if bounded else '✗':<10} {'✓' if valid else '✗'}")

print(f"\n  1/t satisfies both: Σ 1/t = ∞ (harmonic) and Σ 1/t² = π²/6 ≈ {np.pi**2/6:.4f} ✓")
print(f"  → Adam adapts lr per-parameter, effectively satisfying these conditions")

12. Summation in Probability and Statistics

12.1 Expectation as Weighted Summation

The expected value is the fundamental bridge between summation and probability:

E[X] = \sum_{i=1}^{n} x_i \, P(X = x_i)

For a function $g(X)$ : $\quad E[g(X)] = \sum_{i} g(x_i) \, p_i$

12.2 Variance via Summation

\text{Var}(X) = E[(X - \mu)^2] = \sum_{i=1}^{n} (x_i - \mu)^2 \, p_i = E[X^2] - (E[X])^2

12.3 Entropy and Information

Shannon entropy measures uncertainty using a sum of log-probabilities:

H(X) = -\sum_{i=1}^{n} p_i \log p_i \qquad \text{(bits if log₂, nats if ln)}

12.4 KL Divergence

Measures how distribution $Q$ differs from $P$ :

D_\text{KL}(P \| Q) = \sum_{i=1}^{n} p_i \log \frac{p_i}{q_i}

Properties:

$D_\text{KL} \geq 0$ (Gibbs' inequality, from $\ln x \leq x - 1$ )
$D_\text{KL}(P \| Q) \neq D_\text{KL}(Q \| P)$ — not symmetric
Cross-entropy: $H(P, Q) = H(P) + D_\text{KL}(P \| Q)$

12.5 Jensen's Inequality

For convex $\varphi$ : $\quad \varphi\!\left(\sum_i p_i x_i\right) \leq \sum_i p_i\,\varphi(x_i)$

This proves $D_\text{KL} \geq 0$ and explains why:

Average of logs ≤ log of average
$E[\log X] \leq \log E[X]$ (since $-\log$ is convex)

12.6 Law of Total Expectation

E[X] = \sum_{j} E[X \mid Y = y_j] \, P(Y = y_j)

Used in mixture models: each component contributes $E[X | \text{component}_j]$ weighted by mixing probability $\pi_j$ .

12.7 Inclusion-Exclusion Principle

P\!\left(\bigcup_{i=1}^{n} A_i\right) = \sum_{i} P(A_i) - \sum_{i<j} P(A_i \cap A_j) + \sum_{i<j<k} P(A_i \cap A_j \cap A_k) - \cdots

Compact: $\displaystyle P\!\left(\bigcup_{i=1}^n A_i\right) = \sum_{\emptyset \neq S \subseteq [n]} (-1)^{|S|+1} P\!\left(\bigcap_{i \in S} A_i\right)$

Code cell 31

# ══════════════════════════════════════════════════════════════════
# 12  SUMMATION IN PROBABILITY & STATISTICS
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("SUMMATION IN PROBABILITY & STATISTICS")
print("=" * 65)

# --- Expectation and Variance ---
print("\n12.1–12.2  Expectation & Variance")
print("─" * 50)

# Fair die
x = np.arange(1, 7)
p = np.ones(6) / 6

E_X = np.sum(x * p)
E_X2 = np.sum(x**2 * p)
Var_X = np.sum((x - E_X)**2 * p)
Var_shortcut = E_X2 - E_X**2

print(f"\n  Fair die: X ∈ {list(x)}")
print(f"  E[X]  = Σ xᵢpᵢ  = {E_X:.4f}  (analytic: 3.5)")
print(f"  E[X²] = Σ xᵢ²pᵢ = {E_X2:.4f}")
print(f"  Var(X) = Σ (xᵢ-μ)²pᵢ = {Var_X:.4f}")
print(f"  Var(X) = E[X²]-E[X]² = {Var_shortcut:.4f}  ✓")
assert np.isclose(Var_X, Var_shortcut)

# --- Entropy ---
print(f"\n{'─' * 50}")
print("12.3  Entropy H(X) = -Σ pᵢ log pᵢ")
print("─" * 50)

def entropy(probs):
    """Shannon entropy in nats, handling p=0."""
    p = np.array(probs, dtype=np.float64)
    mask = p > 0
    return -np.sum(p[mask] * np.log(p[mask]))

distributions = {
    "Uniform(6)":     [1/6]*6,
    "Peaked":         [0.9, 0.02, 0.02, 0.02, 0.02, 0.02],
    "Deterministic":  [1.0, 0.0, 0.0, 0.0, 0.0, 0.0],
    "Binary fair":    [0.5, 0.5],
}

print(f"\n  {'Distribution':<18} {'H (nats)':<12} {'H (bits)':<12} {'Max H?'}")
print(f"  {'─'*18} {'─'*12} {'─'*12} {'─'*8}")
for name, p_dist in distributions.items():
    h = entropy(p_dist)
    h_bits = h / np.log(2)
    max_h = np.log(len(p_dist))
    print(f"  {name:<18} {h:<12.4f} {h_bits:<12.4f} {'= max' if np.isclose(h, max_h) else ''}")

# --- KL Divergence ---
print(f"\n{'─' * 50}")
print("12.4  KL Divergence — asymmetry demonstration")
print("─" * 50)

def kl_divergence(p, q):
    """KL(P || Q) = Σ pᵢ log(pᵢ/qᵢ)"""
    p, q = np.array(p), np.array(q)
    mask = p > 0
    return np.sum(p[mask] * np.log(p[mask] / q[mask]))

P = [0.4, 0.3, 0.2, 0.1]
Q = [0.25, 0.25, 0.25, 0.25]

kl_pq = kl_divergence(P, Q)
kl_qp = kl_divergence(Q, P)
cross_h = -np.sum(np.array(P) * np.log(np.array(Q)))

print(f"\n  P = {P}")
print(f"  Q = {Q}")
print(f"  KL(P || Q)   = {kl_pq:.6f}")
print(f"  KL(Q || P)   = {kl_qp:.6f}")
print(f"  KL(P||Q) ≠ KL(Q||P): asymmetric! ✓")
print(f"\n  Cross-entropy H(P,Q) = -Σ pᵢ log qᵢ = {cross_h:.6f}")
print(f"  H(P) + KL(P||Q) = {entropy(P) + kl_pq:.6f}")
print(f"  H(P,Q) = H(P) + KL(P||Q): {np.isclose(cross_h, entropy(P) + kl_pq)} ✓")

# --- Jensen's Inequality ---
print(f"\n{'─' * 50}")
print("12.5  Jensen's Inequality: φ(E[X]) ≤ E[φ(X)]  for convex φ")
print("─" * 50)

x_vals = np.array([1.0, 4.0, 9.0])
p_vals = np.array([0.2, 0.5, 0.3])

# x² is convex
E_x = np.sum(x_vals * p_vals)
E_x2 = np.sum(x_vals**2 * p_vals)
Ex_squared = E_x**2

print(f"\n  φ(x) = x² (convex)")
print(f"  E[X]    = {E_x:.2f}")
print(f"  (E[X])² = {Ex_squared:.4f}")
print(f"  E[X²]   = {E_x2:.4f}")
print(f"  (E[X])² ≤ E[X²]: {Ex_squared:.4f} ≤ {E_x2:.4f} → {Ex_squared <= E_x2} ✓")

# -log is convex: proves KL ≥ 0
E_log = np.sum(p_vals * np.log(x_vals))
log_E = np.log(E_x)
print(f"\n  φ(x) = -log(x) (convex) → E[log X] ≤ log E[X]")
print(f"  E[log X]  = {E_log:.4f}")
print(f"  log E[X]  = {log_E:.4f}")
print(f"  E[log X] ≤ log E[X]: {E_log <= log_E} ✓")

# --- Inclusion-Exclusion ---
print(f"\n{'─' * 50}")
print("12.7  Inclusion-Exclusion Principle")
print("─" * 50)

# Three events with known overlaps
pA, pB, pC = 0.5, 0.4, 0.3
pAB, pAC, pBC = 0.2, 0.15, 0.1
pABC = 0.05

p_union = pA + pB + pC - pAB - pAC - pBC + pABC
print(f"\n  P(A)=0.5, P(B)=0.4, P(C)=0.3")
print(f"  P(A∩B)=0.2, P(A∩C)=0.15, P(B∩C)=0.1, P(A∩B∩C)=0.05")
print(f"  P(A∪B∪C) = {pA}+{pB}+{pC} - {pAB}-{pAC}-{pBC} + {pABC}")
print(f"           = {p_union:.2f}")
print(f"  (Alternating signs: +singles, -pairs, +triples)")

13. Notation Variants and Conventions in ML

13.1 Subscript vs Superscript Indexing

Different ML subfields use different conventions:

Convention	Example	Where used
Subscript for samples	$x_i, y_i$	Most ML papers
Superscript for samples	$x^{(i)}, y^{(i)}$	Andrew Ng's courses, CS229
Subscript for features	$x_j, w_j$	Feature-level operations
Superscript for layers	$a^{[l]}, W^{[l]}$	Neural network notation
Superscript for iteration	$\theta^{(t)}$	Optimization algorithms

13.2 Compact Notation Styles

Notation	Meaning	Context
$\sum_i$	Sum over all valid $i$	When range is clear from context
$\sum_{x \in \mathcal{X}}$	Sum over elements of set	Probability, combinatorics
$\sum_{\mathbf{x}}$	Sum over all configurations	Partition functions, graphical models
$\sum_{(x,y) \sim \mathcal{D}}$	Sum over dataset	Empirical risk minimization

13.3 Neural Network Formulas in Σ Notation

Forward pass (fully connected layer):

z_j^{[l]} = \sum_{k=1}^{n_{l-1}} w_{jk}^{[l]} a_k^{[l-1]} + b_j^{[l]}

Matrix form: $\mathbf{z}^{[l]} = W^{[l]} \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}$

Backpropagation (error signal):

\delta_j^{[l]} = \sum_{k} w_{kj}^{[l+1]} \delta_k^{[l+1]} \cdot \sigma'(z_j^{[l]})

Gradient of loss w.r.t. weights:

\frac{\partial \mathcal{L}}{\partial w_{jk}^{[l]}} = \frac{1}{N}\sum_{i=1}^{N} \delta_j^{[l](i)} a_k^{[l-1](i)}

13.4 Implicit Summation in Matrix Notation

Many ML formulas hide summations inside matrix operations:

Matrix form	Explicit summation
$\mathbf{y} = W\mathbf{x}$	$y_i = \sum_j W_{ij} x_j$
$L = \\|\mathbf{y} - \hat{\mathbf{y}}\\|^2$	$L = \sum_i (y_i - \hat{y}_i)^2$
$\text{tr}(AB)$	$\sum_i \sum_j A_{ij} B_{ji}$
$\mathbf{a}^\top\mathbf{b}$	$\sum_i a_i b_i$
$\\|A\\|_F^2$	$\sum_i \sum_j A_{ij}^2$

Code cell 33

# ══════════════════════════════════════════════════════════════════
# 13  NOTATION VARIANTS & NEURAL NETWORK FORMULAS
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("NOTATION VARIANTS & NEURAL NETWORK FORMULAS")
print("=" * 65)

# --- 13.3: Neural network forward pass in Σ notation ---
print("\n13.3  Forward pass: z_j = Σ_k w_jk * a_k + b_j")
print("─" * 50)

np.random.seed(42)
n_in, n_out = 4, 3

# Layer weights and biases
W = np.random.randn(n_out, n_in) * 0.5
b = np.random.randn(n_out) * 0.1
a_prev = np.array([1.0, 0.5, -0.3, 0.8])

# Explicit summation (the Σ version)
z_explicit = np.zeros(n_out)
for j in range(n_out):
    s = 0.0
    for k in range(n_in):
        s += W[j, k] * a_prev[k]
    z_explicit[j] = s + b[j]

# Matrix form (implicit summation)
z_matrix = W @ a_prev + b

print(f"\n  Input a = {a_prev}")
print(f"  W shape: ({n_out}, {n_in})")
print(f"\n  {'Neuron j':<12} {'Σ_k w_jk*a_k + b_j':<22} {'W@a + b':<22} {'Match?'}")
print(f"  {'─'*12} {'─'*22} {'─'*22} {'─'*6}")
for j in range(n_out):
    print(f"  j={j:<9} {z_explicit[j]:<22.8f} {z_matrix[j]:<22.8f} {'✓' if np.isclose(z_explicit[j], z_matrix[j]) else '✗'}")

# --- 13.4: Hidden summations in matrix ops ---
print(f"\n{'─' * 50}")
print("13.4  Hidden summations inside matrix operations")
print("─" * 50)

A = np.array([[1, 2], [3, 4]], dtype=float)
B = np.array([[5, 6], [7, 8]], dtype=float)
x_vec = np.array([1.0, 2.0])

# trace(AB) = Σ_i Σ_j A_ij B_ji
trace_builtin = np.trace(A @ B)
trace_explicit = sum(A[i, j] * B[j, i] for i in range(2) for j in range(2))

# Frobenius norm: ||A||_F² = Σ_i Σ_j A_ij²
frob_builtin = np.linalg.norm(A, 'fro')**2
frob_explicit = sum(A[i, j]**2 for i in range(2) for j in range(2))

# dot product: a⊤b = Σ_i a_i b_i
dot_builtin = x_vec @ x_vec
dot_explicit = sum(x_vec[i]**2 for i in range(len(x_vec)))

print(f"\n  {'Operation':<25} {'Matrix form':<14} {'Σ notation':<14} {'Match?'}")
print(f"  {'─'*25} {'─'*14} {'─'*14} {'─'*6}")
print(f"  {'tr(AB)':<25} {trace_builtin:<14.4f} {trace_explicit:<14.4f} {'✓' if np.isclose(trace_builtin, trace_explicit) else '✗'}")
print(f"  {'||A||_F²':<25} {frob_builtin:<14.4f} {frob_explicit:<14.4f} {'✓' if np.isclose(frob_builtin, frob_explicit) else '✗'}")
print(f"  {'x⊤x':<25} {dot_builtin:<14.4f} {dot_explicit:<14.4f} {'✓' if np.isclose(dot_builtin, dot_explicit) else '✗'}")

# --- Backprop gradient formula ---
print(f"\n{'─' * 50}")
print("∂L/∂w_jk = (1/N) Σ_i δ_j^(i) a_k^(i)")
print("─" * 50)

N = 5  # batch size
deltas = np.random.randn(N, n_out)      # error signals δ
a_batch = np.random.randn(N, n_in)      # activations

# Explicit double sum
dW_explicit = np.zeros((n_out, n_in))
for j in range(n_out):
    for k in range(n_in):
        for i in range(N):
            dW_explicit[j, k] += deltas[i, j] * a_batch[i, k]
        dW_explicit[j, k] /= N

# Matrix form: (1/N) δᵀ a
dW_matrix = (1/N) * deltas.T @ a_batch

print(f"\n  Batch size N={N}, output={n_out}, input={n_in}")
print(f"  Max |dW_explicit - dW_matrix| = {np.max(np.abs(dW_explicit - dW_matrix)):.2e}")
print(f"  Triple sum ≡ matrix multiply: {np.allclose(dW_explicit, dW_matrix)} ✓")
print(f"\n  Key insight: The Σ notation makes index relationships explicit,")
print(f"  but matrix ops are what GPUs actually compute efficiently.")

14. Common Mistakes and Pitfalls

14.1 Index Errors

Mistake	Why it's wrong	Correct
$\sum_{i=1}^{n} x_i y_i = \sum_{i=1}^{n} x_i \cdot \sum_{i=1}^{n} y_i$	Products don't distribute over Σ	$\sum_i x_i y_i \neq (\sum_i x_i)(\sum_j y_j)$
$\sum_{i=1}^{n} x_i = \sum_{j=1}^{n} x_i$	Changed index but not the term	Must change both: $\sum_{j=1}^{n} x_j$
$\sum_{i=0}^{n-1} a_i = \sum_{j=1}^{n} a_j$	Wrong re-indexing	$\sum_{j=1}^{n} a_{j-1}$

14.2 Scope and Binding Errors

The index variable is bound — it has no meaning outside the sum:

❌ $\left(\sum_{i=1}^{n} x_i\right) + i \quad$ — $i$ is not defined outside the sum

This is analogous to loop variables in code: for i in range(n): total += x[i] — you wouldn't use i after the loop expects it to hold a meaningful value.

14.3 Algebraic Mistakes

Sum of squares ≠ Square of sum: $\sum x_i^2 \neq \left(\sum x_i\right)^2$
Sum of products ≠ Product of sums: $\sum x_i y_i \neq \sum x_i \cdot \sum y_i$
Product of sums ≠ Sum of products: $\prod (a_i + b_i) \neq \prod a_i + \prod b_i$
Log of sum ≠ Sum of logs: $\log\left(\sum x_i\right) \neq \sum \log x_i$ , BUT $\log\left(\prod x_i\right) = \sum \log x_i$ ✓

14.4 Numerical Pitfalls

Pitfall	Example	Solution
Overflow in products	$\prod_{i=1}^{1000} p_i$ for small $p_i$	Work in log-space: $\sum \log p_i$
Catastrophic cancellation	$\sum x_i^2 - n\bar{x}^2$ for large $x_i$	Welford's online algorithm
Summing many small floats	$\sum_{i=1}^{10^6} 0.1$	Kahan compensated summation
Softmax overflow	$e^{1000}$	Subtract max: $e^{x_i - \max(x)}$

14.5 Order of Summation Matters (When It Shouldn't)

In exact arithmetic, $a + b + c = c + b + a$ . In floating-point, order matters due to rounding. Kahan summation compensates for accumulated errors.

Code cell 35

# ══════════════════════════════════════════════════════════════════
# 14  COMMON MISTAKES AND PITFALLS
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("COMMON MISTAKES AND PITFALLS")
print("=" * 65)

# --- Mistake 1: Σ(xy) ≠ (Σx)(Σy) ---
print("\n14.1  Σ xᵢyᵢ ≠ (Σ xᵢ)(Σ yᵢ)")
print("─" * 50)

x = np.array([1.0, 2.0, 3.0, 4.0])
y = np.array([5.0, 6.0, 7.0, 8.0])

sum_xy = np.sum(x * y)
sum_x_times_sum_y = np.sum(x) * np.sum(y)

print(f"  x = {list(x)}, y = {list(y)}")
print(f"  Σ xᵢyᵢ      = {sum_xy:.1f}")
print(f"  (Σ xᵢ)(Σ yᵢ) = {sum_x_times_sum_y:.1f}")
print(f"  Equal? {np.isclose(sum_xy, sum_x_times_sum_y)} ← THEY ARE NOT EQUAL!")
covariance_bias = (sum_xy / len(x)) - (np.mean(x) * np.mean(y))
print(f"  Difference matters: this gap IS the covariance = {covariance_bias:.4f}")

# --- Mistake 2: Σ x² ≠ (Σ x)² ---
print(f"\n{'─' * 50}")
print("14.3  Σ xᵢ² ≠ (Σ xᵢ)²")
print("─" * 50)

sum_sq = np.sum(x**2)
sq_sum = np.sum(x)**2
print(f"  x = {list(x)}")
print(f"  Σ xᵢ² = {sum_sq:.1f}")
print(f"  (Σ xᵢ)² = {sq_sum:.1f}")
print(f"  This gap = n·Var(X) = {sq_sum - len(x)*sum_sq/len(x):.1f}")

# --- Numerical: Kahan summation ---
print(f"\n{'─' * 50}")
print("14.4–14.5  Kahan Compensated Summation")
print("─" * 50)

def kahan_sum(values):
    """Kahan compensated summation — reduces floating-point error."""
    s = 0.0
    c = 0.0  # compensation for lost low-order bits
    for v in values:
        y = v - c
        t = s + y
        c = (t - s) - y   # algebraically zero, but captures rounding error
        s = t
    return s

# Sum of 10 million 0.1's (should be 1,000,000.0)
n = 10_000_000
values = [0.1] * n

naive = sum(values)
compensated = kahan_sum(values)
numpy_result = np.sum(np.full(n, 0.1))
exact = 1_000_000.0

print(f"\n  Summing {n:,} copies of 0.1 (exact answer: {exact:,.1f})")
print(f"  {'Method':<20} {'Result':<22} {'Error'}")
print(f"  {'─'*20} {'─'*22} {'─'*14}")
print(f"  {'Naive sum()':<20} {naive:<22.10f} {abs(naive - exact):<14.10f}")
print(f"  {'Kahan sum':<20} {compensated:<22.10f} {abs(compensated - exact):<14.10f}")
print(f"  {'numpy.sum':<20} {numpy_result:<22.10f} {abs(numpy_result - exact):<14.10f}")

# --- Welford's online variance ---
print(f"\n{'─' * 50}")
print("14.4  Welford's Algorithm (numerically stable variance)")
print("─" * 50)

def welford_variance(data):
    """Welford's one-pass numerically stable variance."""
    n = 0
    mean = 0.0
    M2 = 0.0
    for x in data:
        n += 1
        delta = x - mean
        mean += delta / n
        delta2 = x - mean
        M2 += delta * delta2
    return M2 / n if n > 0 else 0.0

# Large values with small differences → catastrophic cancellation
data_shifted = np.array([1e8 + 1, 1e8 + 2, 1e8 + 3, 1e8 + 4, 1e8 + 5], dtype=np.float64)

naive_var = np.mean(data_shifted**2) - np.mean(data_shifted)**2
welford_var = welford_variance(data_shifted)
numpy_var = np.var(data_shifted)

print(f"\n  Data: [1e8+1, 1e8+2, ..., 1e8+5]  (true variance = 2.0)")
print(f"  {'Method':<25} {'Variance':<18} {'Error'}")
print(f"  {'─'*25} {'─'*18} {'─'*10}")
print(f"  {'Naive (E[X²]-E[X]²)':<25} {naive_var:<18.6f} {abs(naive_var - 2.0):<10.6f}")
print(f"  {'Welford online':<25} {welford_var:<18.6f} {abs(welford_var - 2.0):<10.6f}")
print(f"  {'numpy.var':<25} {numpy_var:<18.6f} {abs(numpy_var - 2.0):<10.6f}")

15. Practice Exercises

Exercise 1: Closed-Form Verification

Prove and verify numerically: $\sum_{i=1}^{n} i^2 = \frac{n(n+1)(2n+1)}{6}$ for $n = 100, 1000, 10000$

Exercise 2: Telescoping Sum

Simplify $\sum_{i=1}^{n} \left(\frac{1}{i} - \frac{1}{i+1}\right)$ and verify.

Exercise 3: Double Sum Interchange

Show that $\sum_{i=1}^{m}\sum_{j=1}^{n} a_i b_j = \left(\sum_{i=1}^{m} a_i\right)\left(\sum_{j=1}^{n} b_j\right)$

Exercise 4: Log-Sum-Exp

Implement the numerically stable version of $\log\left(\sum_{i=1}^{n} e^{x_i}\right)$ and compare with the naive version for extreme inputs.

Exercise 5: Implement Softmax Gradient

Given softmax output $\sigma_i$ , show that $\frac{\partial \sigma_i}{\partial z_j} = \sigma_i(\delta_{ij} - \sigma_j)$ where $\delta_{ij}$ is the Kronecker delta.

Exercise 6: Cauchy-Schwarz Application

Verify that $\left(\sum_i a_i b_i\right)^2 \leq \left(\sum_i a_i^2\right)\left(\sum_i b_i^2\right)$ for random vectors, and find when equality holds.

Exercise 7: Einstein Summation

Convert these to np.einsum and verify:

Batch matrix multiply: $C_{bij} = \sum_k A_{bik} B_{bkj}$
Bilinear form: $s = \sum_i \sum_j x_i M_{ij} y_j$
Batch outer product: $O_{bij} = a_{bi} \cdot b_{bj}$

Exercise 8: Convergence Analysis

For the series $\sum_{n=1}^{\infty} \frac{n^2}{2^n}$ , apply the ratio test and compute partial sums to estimate the limit.

Code cell 37

# ══════════════════════════════════════════════════════════════════
# 15  PRACTICE EXERCISES — SOLUTIONS
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("PRACTICE EXERCISES — SOLUTIONS")
print("=" * 65)

# --- Exercise 1: Closed-form for Σ i² ---
print("\nExercise 1: Σ i² = n(n+1)(2n+1)/6")
print("─" * 50)
for n in [100, 1000, 10000]:
    brute = sum(i**2 for i in range(1, n+1))
    closed = n * (n + 1) * (2*n + 1) // 6
    print(f"  n={n:<6}: brute={brute:<15}, closed={closed:<15}, match={brute == closed} ✓")

# --- Exercise 2: Telescoping ---
print(f"\nExercise 2: Σ (1/i - 1/(i+1)) = 1 - 1/(n+1)")
print("─" * 50)
for n in [10, 100, 1000]:
    tele = sum(1/i - 1/(i+1) for i in range(1, n+1))
    closed_val = 1 - 1/(n+1)
    print(f"  n={n:<6}: sum={tele:.10f}, 1-1/(n+1)={closed_val:.10f}, match={np.isclose(tele, closed_val)} ✓")

# --- Exercise 3: Double sum separation ---
print(f"\nExercise 3: ΣᵢΣⱼ aᵢbⱼ = (Σᵢ aᵢ)(Σⱼ bⱼ)")
print("─" * 50)
a = np.array([2, 3, 5, 7])
b = np.array([1, 4, 6])
double_sum = sum(a[i] * b[j] for i in range(len(a)) for j in range(len(b)))
product_sums = np.sum(a) * np.sum(b)
print(f"  a = {list(a)}, b = {list(b)}")
print(f"  ΣᵢΣⱼ aᵢbⱼ = {double_sum}, (Σ aᵢ)(Σ bⱼ) = {product_sums}")
print(f"  Works because bⱼ doesn't depend on i, so Σⱼ bⱼ factors out ✓")

# --- Exercise 4: Log-Sum-Exp stability ---
print(f"\nExercise 4: Numerically stable log-sum-exp")
print("─" * 50)
x_extreme = np.array([1000.0, 1000.1, 999.8])
x_max = np.max(x_extreme)

# Naive: will overflow
try:
    naive_lse = np.log(np.sum(np.exp(x_extreme)))
except:
    naive_lse = float('inf')
# Stable version
stable_lse = x_max + np.log(np.sum(np.exp(x_extreme - x_max)))
print(f"  x = {list(x_extreme)}")
print(f"  Naive log(Σ eˣ): {naive_lse}")
print(f"  Stable max + log(Σ e^(x-max)): {stable_lse:.6f}")
print(f"  The max-subtraction trick preserves the result without overflow ✓")

# --- Exercise 5: Softmax Jacobian ---
print(f"\nExercise 5: Softmax Jacobian ∂σᵢ/∂zⱼ = σᵢ(δᵢⱼ - σⱼ)")
print("─" * 50)
z = np.array([2.0, 1.0, 0.5])
sigma = np.exp(z - np.max(z))
sigma = sigma / np.sum(sigma)

# Analytic Jacobian
n = len(z)
jacobian = np.zeros((n, n))
for i in range(n):
    for j in range(n):
        if i == j:
            jacobian[i, j] = sigma[i] * (1 - sigma[j])
        else:
            jacobian[i, j] = -sigma[i] * sigma[j]

# Numerical Jacobian for verification
eps = 1e-7
numerical_jac = np.zeros((n, n))
for j in range(n):
    z_plus = z.copy(); z_plus[j] += eps
    z_minus = z.copy(); z_minus[j] -= eps
    s_plus = np.exp(z_plus - np.max(z_plus)); s_plus /= s_plus.sum()
    s_minus = np.exp(z_minus - np.max(z_minus)); s_minus /= s_minus.sum()
    numerical_jac[:, j] = (s_plus - s_minus) / (2 * eps)

print(f"  z = {list(z)}")
print(f"  σ = [{', '.join(f'{s:.4f}' for s in sigma)}]")
print(f"  Max |analytic - numerical| = {np.max(np.abs(jacobian - numerical_jac)):.2e} ✓")

# --- Exercise 6: Cauchy-Schwarz ---
print(f"\nExercise 6: Cauchy-Schwarz (Σ aᵢbᵢ)² ≤ (Σ aᵢ²)(Σ bᵢ²)")
print("─" * 50)
np.random.seed(7)
a_rand = np.random.randn(10)
b_rand = np.random.randn(10)
lhs = np.sum(a_rand * b_rand)**2
rhs = np.sum(a_rand**2) * np.sum(b_rand**2)
print(f"  Random a, b: LHS = {lhs:.6f} ≤ RHS = {rhs:.6f}: {lhs <= rhs + 1e-10} ✓")
# Equality when a = cb
b_prop = 3.0 * a_rand
lhs_eq = np.sum(a_rand * b_prop)**2
rhs_eq = np.sum(a_rand**2) * np.sum(b_prop**2)
print(f"  b = 3a:       LHS = {lhs_eq:.6f} = RHS = {rhs_eq:.6f}: {np.isclose(lhs_eq, rhs_eq)} ✓ (equality!)")

# --- Exercise 7: Einstein summation ---
print(f"\nExercise 7: Einstein summation with np.einsum")
print("─" * 50)
B_size, I, K, J = 2, 3, 4, 5
A7 = np.random.randn(B_size, I, K)
B7 = np.random.randn(B_size, K, J)
C_batch = np.einsum('bik,bkj->bij', A7, B7)
C_loop = np.array([A7[b] @ B7[b] for b in range(B_size)])
print(f"  Batch matmul: einsum('bik,bkj->bij') match: {np.allclose(C_batch, C_loop)} ✓")

x7 = np.random.randn(4)
M7 = np.random.randn(4, 5)
y7 = np.random.randn(5)
bilinear = np.einsum('i,ij,j->', x7, M7, y7)
bilinear_loop = x7 @ M7 @ y7
print(f"  Bilinear form: einsum('i,ij,j->') match: {np.isclose(bilinear, bilinear_loop)} ✓")

a7 = np.random.randn(B_size, 3)
b7 = np.random.randn(B_size, 4)
outer_batch = np.einsum('bi,bj->bij', a7, b7)
outer_loop = np.array([np.outer(a7[b], b7[b]) for b in range(B_size)])
print(f"  Batch outer:   einsum('bi,bj->bij') match: {np.allclose(outer_batch, outer_loop)} ✓")

# --- Exercise 8: Convergence of Σ n²/2ⁿ ---
print(f"\nExercise 8: Σ n²/2ⁿ convergence (ratio test + partial sums)")
print("─" * 50)
# Ratio test: a_{n+1}/a_n = (n+1)²/(2n²) → 1/2 < 1 → converges
print("  Ratio test: |(n+1)²/2^(n+1)| / |n²/2ⁿ| = (n+1)²/(2n²) → 1/2 < 1 ✓")
# Exact sum: Σ n²xⁿ = x(1+x)/(1-x)³ at x=1/2 → (1/2)(3/2)/(1/2)³ = 6
partial = [sum(k**2 / 2**k for k in range(1, n+1)) for n in [5, 10, 20, 50]]
print(f"  Exact value: x(1+x)/(1-x)³ at x=½ = 6.0")
print(f"  Partial sums: S₅={partial[0]:.6f}, S₁₀={partial[1]:.6f}, "
      f"S₂₀={partial[2]:.6f}, S₅₀={partial[3]:.6f}")
print(f"  Converges to 6.0 ✓")

16. Why This Matters for AI

Impact Table

AI Component	Summation/Product Role	Key Formula
Forward pass	$z = \sum_k w_k x_k + b$	Linear combination
Loss function	$L = \frac{1}{N}\sum_{i=1}^{N} \ell(y_i, \hat{y}_i)$	Empirical risk
Softmax	$\sigma_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$	Normalization
Attention	$\text{out} = \sum_j \alpha_j v_j$	Weighted combination
Cross-entropy	$L = -\sum_c y_c \log \hat{y}_c$	Classification loss
Likelihood	$\mathcal{L} = \prod_{i=1}^{N} p(x_i \mid \theta)$	Maximum likelihood estimation
Gradient descent	$\theta_{t+1} = \theta_t - \eta \frac{1}{B}\sum_{i \in \mathcal{B}} \nabla \ell_i$	Mini-batch SGD
Batch norm	$\mu = \frac{1}{B}\sum_i x_i, \quad \sigma^2 = \frac{1}{B}\sum_i (x_i - \mu)^2$	Normalization statistics
Entropy regularization	$H(\pi) = -\sum_a \pi(a) \log \pi(a)$	Exploration in RL
ELBO	$\mathcal{L} = E_q[\log p(x,z)] - E_q[\log q(z)]$	Variational inference
Perplexity	$\text{PPL} = \exp\!\left(-\frac{1}{T}\sum_{t=1}^{T} \log p(w_t)\right)$	Language model eval

Conceptual Bridge

┌─────────────────────────────────────────────────────────┐
│  SUMMATION & PRODUCT NOTATION                           │
│  ═══════════════════════════════════════                 │
│                                                         │
│  Σ (Summation)          Π (Product)                     │
│    │                      │                             │
│    ├─ Forward pass        ├─ Likelihood                 │
│    ├─ Loss functions      ├─ Posterior (Bayes)          │
│    ├─ Gradient averages   ├─ Partition functions        │
│    ├─ Attention weights   └─ Factorial / combinatorics  │
│    ├─ Batch statistics                                  │
│    └─ Einstein notation ── einsum ── GPU kernels        │
│                                                         │
│  "If you can read a Σ, you can read any ML paper."     │
└─────────────────────────────────────────────────────────┘

Key Takeaways

Every ML model is built on summations — linear combinations, averages, expectations, weighted sums
Products appear in likelihoods, posteriors, and partition functions — always convert to log-sums for computation
Einstein summation (np.einsum) is the universal language for tensor operations in modern deep learning
Numerical stability (log-sum-exp, Kahan summation, Welford's algorithm) separates working code from broken code
Convergence theory (ratio test, p-series, Robbins-Monro) determines when your training will actually converge

References

Knuth, D.E. The Art of Computer Programming, Vol. 1, §1.2.3 — Sums and Products (1968)
Graham, Knuth, Patashnik. Concrete Mathematics, Ch. 2 — Sums (1994)
Goodfellow, Bengio, Courville. Deep Learning, Ch. 2 — Linear Algebra notation (2016)
Vaswani et al. "Attention Is All You Need" — Σ in scaled dot-product attention (2017)
Einstein, A. "The Foundation of the General Theory of Relativity" — implicit summation convention (1916)

Code cell 39

# ══════════════════════════════════════════════════════════════════
# 16  FINAL SUMMARY — COMPLETE NOTATION REFERENCE
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("COMPLETE NOTATION QUICK REFERENCE")
print("=" * 65)

reference = {
    "Σᵢ₌₁ⁿ aᵢ":       "Basic summation: a₁ + a₂ + ... + aₙ",
    "Πᵢ₌₁ⁿ aᵢ":       "Basic product:   a₁ · a₂ · ... · aₙ",
    "Σᵢ∈S f(i)":       "Sum over index set S",
    "ΣΣ aᵢⱼ":          "Double sum (can interchange if abs convergent)",
    "Aᵢⱼ Bⱼₖ":         "Einstein: implicit sum over repeated j",
    "log Π = Σ log":    "Products → sums via logarithm",
    "Σ (f(k)-f(k+1))":  "Telescoping: collapses to f(1) - f(n+1)",
    "max + log Σ exp":   "Log-sum-exp trick (numerical stability)",
}

print(f"\n  {'Notation':<22} {'Meaning'}")
print(f"  {'─'*22} {'─'*43}")
for notation, meaning in reference.items():
    print(f"  {notation:<22} {meaning}")

# AI formula that uses ALL the concepts
print(f"\n{'=' * 65}")
print("PUTTING IT ALL TOGETHER: Transformer Loss")
print("=" * 65)
print("""
  The cross-entropy loss for a Transformer language model:

    L = -(1/T) Σₜ₌₁ᵀ log( exp(zₜ[yₜ]) / Σⱼ₌₁ⱽ exp(zₜ[j]) )

  This single formula uses:
    ✓ Summation (outer Σ over tokens, inner Σ over vocab)
    ✓ Product   (likelihood Π → log converts to Σ)
    ✓ Log-sum-exp (numerical stability of softmax)
    ✓ Einstein notation (in the QKᵀ attention computation)
    ✓ Double sums (batch × sequence × head dimensions)
    ✓ Convergence theory (learning rate schedule)
    ✓ Jensen's inequality (proves KL ≥ 0 in ELBO)

  Every section of this chapter appears in this one equation.
""")

print("=" * 65)
print("END OF CHAPTER: Summation and Product Notation")
print("=" * 65)

Summation and Product Notation