Theory NotebookMath for LLMs

Summation and Product Notation

Mathematical Foundations / Summation and Product Notation

Run notebook
Theory Notebook

Theory Notebook

Converted from theory.ipynb for web reading.

Summation & Product Notation — The Syntax of Quantitative AI

"Virtually every formula in machine learning — loss functions, gradients, attention scores, probability distributions, normalisation constants — is a sum or a product. Fluency with Σ and Π is the difference between reading ML papers fluidly and decoding them symbol by symbol."

This notebook is the interactive companion to notes.md. It demonstrates every concept from all 16 sections with runnable Python and NumPy code, connecting each mathematical idea to its concrete use in modern AI systems (transformers, LLMs, diffusion models, RAG pipelines).

SectionTopicWhat You'll Build
1IntuitionWhy Σ/Π exist; where they appear in AI; historical timeline
2Summation — Formal DefinitionsRecursive definition, index sets, multiple indices
3Product Notation — Formal DefinitionsΠ basics, empty product, factorial, log-likelihood
4Algebraic Properties of SummationLinearity, splitting, reindexing, telescoping, interchange
5Algebraic Properties of ProductsSplitting, constants, log↔product conversion
6Important Summation FormulasArithmetic/geometric series, Taylor series, harmonic numbers
7Summation Bounds & EstimationComparison, integral test, Cauchy-Schwarz, Stirling
8Special Summation Patterns in AISoftmax, cross-entropy, attention, perplexity, BM25
9Einstein Summation ConventionImplicit sums, einsum in PyTorch/NumPy, tensor contraction
10Double Sums & Index InteractionDiagonal sums, order interchange, Cauchy product
11Convergence of Infinite SeriesPartial sums, absolute convergence, tests, power series
12Summation in Probability & StatisticsExpectation, variance, entropy, KL divergence
13–14Notation Variants & Common MistakesML conventions, subscript/superscript, pitfalls table
15Exercises8 progressive exercises with scaffolds and solutions
16Why This Matters for AI (2026)Impact table and conceptual bridge

Prerequisites: Python, NumPy. PyTorch optional but recommended for einsum demos.

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np
import math
import warnings
from typing import List, Tuple, Callable

np.random.seed(42)
np.set_printoptions(precision=8, suppress=True)

# Optional: PyTorch for einsum and tensor demos
try:
    import torch
    torch.manual_seed(42)
    HAS_TORCH = True
    print(f'NumPy {np.__version__} | PyTorch {torch.__version__}')
    if torch.cuda.is_available():
        print(f'GPU: {torch.cuda.get_device_name()}')
    elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
        print('Device: Apple Silicon (MPS)')
    else:
        print('Device: CPU')
except ImportError:
    HAS_TORCH = False
    print(f'NumPy {np.__version__} | PyTorch not installed (NumPy demos still work)')

# Optional: matplotlib for visualisations
try:
    import matplotlib.pyplot as plt
    HAS_PLT = True
except ImportError:
    HAS_PLT = False
    print('matplotlib not installed — plots will be skipped')

1. Intuition — What Summation and Product Notation Are, and Why They Exist

1.1 What Is Summation and Product Notation?

Summation (Σ\Sigma) and product (Π\Pi) notation are compact symbolic languages for expressing repeated addition and multiplication over collections of terms.

  • Without them: writing the loss over a million training examples requires a million terms
  • With them: one lineL=1Ni=1NlogP(yixi)L = -\frac{1}{N}\sum_{i=1}^{N} \log P(y_i \mid x_i)

They are not merely abbreviations — they are precise mathematical objects with well-defined rules for manipulation, interchange, and transformation.

1.2 Why This Notation Exists

ProblemWithout NotationWith Notation
Sum of squares 1 to n12+22+32++n21^2 + 2^2 + 3^2 + \ldots + n^2i=1ni2\sum_{i=1}^{n} i^2
Joint probabilityP(A1)P(A2)P(An)P(A_1) \cdot P(A_2) \cdots P(A_n)i=1nP(Ai)\prod_{i=1}^{n} P(A_i)
Attention outputα1V1+α2V2+\alpha_1 V_1 + \alpha_2 V_2 + \ldotsjαijVj\sum_j \alpha_{ij} V_j
Gradient of batch loss1N(l1+l2+)\frac{1}{N}(\nabla l_1 + \nabla l_2 + \ldots)1Nili\frac{1}{N}\sum_i \nabla l_i

Compact notation reveals structure, enables algebraic manipulation, and generalises naturally from finite sums → infinite series → integrals.

1.3 Where These Appear in AI

Every core formula in modern ML uses Σ or Π:

FormulaExpression
Cross-entropy lossL=1Ni=1NvV1[yi=v]logP(vxi)L = -\frac{1}{N}\sum_{i=1}^{N}\sum_{v \in V} \mathbb{1}[y_i=v]\log P(v \mid x_i)
SoftmaxP(vx)=exp(zv)vVexp(zv)P(v \mid x) = \frac{\exp(z_v)}{\sum_{v' \in V}\exp(z_{v'})}
AttentionAttn(Q,K,V)i=jαijVj\text{Attn}(Q,K,V)_i = \sum_j \alpha_{ij} V_j
Dot productqk=i=1dkqikiq \cdot k = \sum_{i=1}^{d_k} q_i k_i
Matrix multiply(AB)ij=kAikBkj(AB)_{ij} = \sum_k A_{ik} B_{kj}
PerplexityPPL=exp ⁣(1Ni=1NlogP(tictx))\text{PPL} = \exp\!\left(-\frac{1}{N}\sum_{i=1}^{N}\log P(t_i \mid \text{ctx})\right)
LikelihoodP(seq)=i=1nP(tit1,,ti1)P(\text{seq}) = \prod_{i=1}^{n} P(t_i \mid t_1,\ldots,t_{i-1})

1.4 The Index as a Bound Variable

The index variable (ii in i\sum_i) is a bound variable — it has no meaning outside the sum. i=1ni2\sum_{i=1}^{n} i^2 and j=1nj2\sum_{j=1}^{n} j^2 are identical expressions (alpha-equivalence).

Critical for AI: iαij\sum_i \alpha_{ij} and jαij\sum_j \alpha_{ij} are completely different sums!

1.5 The Relationship to Integration

DISCRETE                         CONTINUOUS
∑ᵢ₌₁ⁿ f(i)              →       ∫₁ⁿ f(x) dx
Index i ∈ {1,...,n}       →       x ∈ [1, n]
Step size 1               →       Infinitesimal dx
∑ → ∫ as step → 0                (Riemann sum definition)
∏ᵢ f(i)                  →       exp(∫ log f(x) dx)

This bridge is not just analogy — Riemann integration is literally defined as a limit of summation.

Code cell 5

# ══════════════════════════════════════════════════════════════════
# 1.1–1.3  WHY Σ AND Π EXIST — DEMONSTRATION
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("WITHOUT VS WITH SUMMATION NOTATION")
print("=" * 65)

# Without notation: writing out all terms
terms_without = [i**2 for i in range(1, 11)]
print(f"\nWithout notation: {' + '.join(str(t) for t in terms_without)}")
print(f"  = {sum(terms_without)}")

# With notation: one compact expression
n = 10
result_with = sum(i**2 for i in range(1, n+1))
print(f"\nWith notation:    Σᵢ₌₁¹⁰ i² = {result_with}")
print(f"  → Same answer, infinitely more readable")

# ── AI formulas that are really just sums ──
print(f"\n{'=' * 65}")
print("AI FORMULAS AS SUMS AND PRODUCTS")
print("=" * 65)

# Dot product: q·k = Σᵢ qᵢkᵢ
q = np.array([1.0, 2.0, 3.0, 4.0])
k = np.array([0.5, 1.5, 0.2, 1.0])
dot_manual = sum(q[i] * k[i] for i in range(len(q)))
dot_numpy  = np.dot(q, k)
print(f"\nDot product q·k = Σᵢ qᵢkᵢ")
print(f"  Manual sum:  {dot_manual}")
print(f"  np.dot:      {dot_numpy}")
print(f"  Match: {np.isclose(dot_manual, dot_numpy)} ✓")

# Matrix multiply: (AB)ᵢⱼ = Σₖ AᵢₖBₖⱼ
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
C_manual = np.zeros((2, 2))
for i in range(2):
    for j in range(2):
        C_manual[i, j] = sum(A[i, k] * B[k, j] for k in range(2))
C_numpy = A @ B
print(f"\nMatrix multiply (AB)ᵢⱼ = Σₖ AᵢₖBₖⱼ")
print(f"  Manual:\n{C_manual}")
print(f"  A @ B:\n{C_numpy}")
print(f"  Match: {np.allclose(C_manual, C_numpy)} ✓")

# Likelihood as product: P(seq) = Πᵢ P(tᵢ|context)
probs = [0.3, 0.5, 0.8, 0.6, 0.4]
likelihood_product = 1.0
for p in probs:
    likelihood_product *= p
log_likelihood_sum = sum(np.log(p) for p in probs)
print(f"\nLikelihood:  Πᵢ P(tᵢ|ctx) = {likelihood_product:.6f}")
print(f"Log-likelihood: Σᵢ log P    = {log_likelihood_sum:.6f}")
print(f"exp(Σ log P) = Π P:           {np.exp(log_likelihood_sum):.6f}")
print(f"  Match: {np.isclose(likelihood_product, np.exp(log_likelihood_sum))} ✓")
print(f"\n  → log converts Π to Σ — the most important trick in ML!")

Code cell 6

# ══════════════════════════════════════════════════════════════════
# 1.4  THE INDEX AS A BOUND VARIABLE (Alpha-Equivalence)
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("BOUND VARIABLES — INDEX NAME DOES NOT MATTER")
print("=" * 65)

n = 5
sum_i = sum(i**2 for i in range(1, n+1))
sum_j = sum(j**2 for j in range(1, n+1))
sum_k = sum(k**2 for k in range(1, n+1))
print(f"\n  Σᵢ₌₁⁵ i² = {sum_i}")
print(f"  Σⱼ₌₁⁵ j² = {sum_j}")
print(f"  Σₖ₌₁⁵ k² = {sum_k}")
print(f"  All equal: {sum_i == sum_j == sum_k} ✓  (alpha-equivalence)")

# CRITICAL: Σᵢ αᵢⱼ vs Σⱼ αᵢⱼ are DIFFERENT
print(f"\n{'─' * 65}")
print("CRITICAL: WHICH INDEX YOU SUM OVER MATTERS!")
alpha = np.array([[0.5, 0.3, 0.2],
                  [0.1, 0.6, 0.3],
                  [0.4, 0.4, 0.2]])
print(f"\n  Attention matrix α (3×3):\n{alpha}")
print(f"\n  Σⱼ αᵢⱼ (sum each ROW)    = {alpha.sum(axis=1)}")
print(f"  Σᵢ αᵢⱼ (sum each COLUMN) = {alpha.sum(axis=0)}")
print(f"  → Different sums! Row sums = 1.0 (probability distribution)")

# ══════════════════════════════════════════════════════════════════
# 1.5  HISTORICAL TIMELINE
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("HISTORICAL TIMELINE OF SUMMATION NOTATION")
print("=" * 65)

timeline = [
    ("~250 BCE",  "Archimedes",       "Geometric series; area of parabola (no notation)"),
    ("1593",      "Viète",            "Systematic symbolic algebra → paves way"),
    ("1675",      "Leibniz",          "∫ as elongated S for 'summa' (continuous)"),
    ("~1750",     "Euler",            "Introduces Σ (capital sigma) for discrete sums"),
    ("~1800",     "Gauss",            "Σᵢ₌₁ⁿ i = n(n+1)/2 at age 10"),
    ("1821",      "Cauchy",           "Rigorous convergence theory for infinite series"),
    ("1854",      "Riemann",          "Riemann sum → integral (Σ → ∫ as Δx → 0)"),
    ("1916",      "Einstein",         "Summation convention: repeated index = sum"),
    ("2017",      "Vaswani et al.",   "Transformer: Attn = Σⱼ αᵢⱼVⱼ everywhere"),
]
print(f"\n  {'Year':<12} {'Who':<18} {'What'}")
print(f"  {'─'*12} {'─'*18} {'─'*45}")
for year, who, what in timeline:
    print(f"  {year:<12} {who:<18} {what}")

# ══════════════════════════════════════════════════════════════════
# 1.6  SUM ↔ INTEGRAL CONNECTION (Riemann Sum)
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("SUM → INTEGRAL: RIEMANN SUM DEMONSTRATION")
print("=" * 65)

# Approximate ∫₀¹ x² dx using Riemann sums with increasing n
def riemann_sum(f, a, b, n):
    """Left Riemann sum: Σᵢ₌₀ⁿ⁻¹ f(xᵢ)·Δx where Δx = (b-a)/n"""
    dx = (b - a) / n
    return sum(f(a + i * dx) * dx for i in range(n))

exact = 1/3  # ∫₀¹ x² dx = 1/3
print(f"\n  Target: ∫₀¹ x² dx = 1/3 ≈ {exact:.10f}")
print(f"\n  {'n (terms)':<12} {'Σᵢ f(xᵢ)Δx':<18} {'Error':<15} {'Δx'}")
print(f"  {'─'*12} {'─'*18} {'─'*15} {'─'*10}")
for n_terms in [5, 10, 50, 100, 1000, 10000]:
    approx = riemann_sum(lambda x: x**2, 0, 1, n_terms)
    err = abs(approx - exact)
    dx = 1.0 / n_terms
    print(f"  {n_terms:<12} {approx:<18.10f} {err:<15.2e} {dx:.6f}")
print(f"\n  → As n → ∞ and Δx → 0: Σ → ∫  (definition of Riemann integral)")

2. Summation Notation — Formal Definitions

2.1 Basic Definition

i=mnf(i)=f(m)+f(m+1)+f(m+2)++f(n)\sum_{i=m}^{n} f(i) = f(m) + f(m+1) + f(m+2) + \ldots + f(n)

Components:

  • Σ\Sigma — summation operator (capital Greek sigma)
  • ii — index variable (dummy/bound variable; any name works)
  • mm — lower limit (inclusive)
  • nn — upper limit (inclusive)
  • f(i)f(i) — summand (expression evaluated at each index value)
  • Number of terms: nm+1n - m + 1

Empty sum convention: if n<mn < m, the sum has no terms and equals 0 (additive identity).

2.2 Formal Definition via Recursion

  • Base case: i=mmf(i)=f(m)\sum_{i=m}^{m} f(i) = f(m) (single term)
  • Recursive step: i=mnf(i)=(i=mn1f(i))+f(n)\sum_{i=m}^{n} f(i) = \left(\sum_{i=m}^{n-1} f(i)\right) + f(n)
  • Empty base: i=mm1f(i)=0\sum_{i=m}^{m-1} f(i) = 0

2.3 Index Set Notation

More general: sum over arbitrary index set SS:

iSf(i)=sum of f(i) for all iS\sum_{i \in S} f(i) = \text{sum of } f(i) \text{ for all } i \in S

AI examples:

  • vVexp(zv)\sum_{v \in V} \exp(z_v) — sum over vocabulary tokens
  • (i,j):i<jaij\sum_{(i,j): i < j} a_{ij} — sum over upper-triangular pairs
  • l=1Lh=1HAttnl,h\sum_{l=1}^{L}\sum_{h=1}^{H} \text{Attn}_{l,h} — sum over all layers and heads

2.4 Multiple Indices

i=1mj=1nf(i,j)=i=1m(j=1nf(i,j))\sum_{i=1}^{m}\sum_{j=1}^{n} f(i,j) = \sum_{i=1}^{m}\left(\sum_{j=1}^{n} f(i,j)\right)

Process: for each fixed ii, compute inner sum over jj; then sum results over ii. Total terms: m×nm \times n.

Code cell 8

# ══════════════════════════════════════════════════════════════════
# 2.1–2.2  SUMMATION: BASIC + RECURSIVE DEFINITION
# ══════════════════════════════════════════════════════════════════

def sigma(f, m, n):
    """Σᵢ₌ₘⁿ f(i) — iterative definition."""
    return sum(f(i) for i in range(m, n + 1))

def sigma_recursive(f, m, n):
    """Σᵢ₌ₘⁿ f(i) — recursive definition (2.2)."""
    if n < m:
        return 0            # Empty sum = 0 (additive identity)
    if n == m:
        return f(m)         # Base case: single term
    return sigma_recursive(f, m, n - 1) + f(n)  # Recursive step

print("=" * 65)
print("SUMMATION: ITERATIVE vs RECURSIVE DEFINITION")
print("=" * 65)

# Test with f(i) = i²
f = lambda i: i**2
for n in [1, 3, 5, 10]:
    iter_val = sigma(f, 1, n)
    rec_val  = sigma_recursive(f, 1, n)
    formula  = n * (n + 1) * (2 * n + 1) // 6  # closed form
    print(f"  Σᵢ₌₁^{n:<2d} i²  iterative={iter_val:<6d}  recursive={rec_val:<6d}  formula={formula:<6d}  ✓")

# Empty sum convention
empty = sigma(f, 5, 3)  # n < m → 0
print(f"\n  Empty sum Σᵢ₌₅³ f(i) = {empty}  (additive identity)")

# ══════════════════════════════════════════════════════════════════
# 2.3  INDEX SET NOTATION
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("INDEX SET NOTATION: Σ OVER ARBITRARY SETS")
print("=" * 65)

# Sum over arbitrary index set
def sigma_set(f, index_set):
    """Σ_{i ∈ S} f(i) — sum over arbitrary index set S."""
    return sum(f(i) for i in index_set)

# Example 1: Sum over vocabulary (simulated as indices)
vocab = {0, 1, 2, 3, 4}  # 5-token vocabulary
logits = np.array([2.0, 1.0, 0.5, -1.0, 0.3])
Z = sigma_set(lambda v: np.exp(logits[v]), vocab)
print(f"\n  Softmax denominator Z = Σ_{{v ∈ V}} exp(zᵥ) = {Z:.4f}")
print(f"  np.sum(np.exp(logits))                     = {np.sum(np.exp(logits)):.4f}  ✓")

# Example 2: Sum over pairs (i,j) with i < j
pairs = [(i, j) for i in range(4) for j in range(4) if i < j]
pair_sum = sigma_set(lambda p: p[0] + p[1], pairs)
print(f"\n  Σ_{{(i,j): i<j, i,j∈{{0..3}}}} (i+j) = {pair_sum}")
print(f"  Pairs: {pairs}")

# Example 3: Sum with indicator function
data = [3, 7, 2, 8, 1, 5, 9, 4]
count_gt5 = sum(1 for x in data if x > 5)
print(f"\n  Σᵢ 𝟙[xᵢ > 5] = {count_gt5}  (counting elements > 5 in {data})")

# ══════════════════════════════════════════════════════════════════
# 2.4  MULTIPLE INDICES — DOUBLE SUMS
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("DOUBLE SUMS: Σᵢ Σⱼ f(i,j)")
print("=" * 65)

# Compute Σᵢ₌₁³ Σⱼ₌₁⁴ (i·j) step by step
m, n = 3, 4
print(f"\n  Computing Σᵢ₌₁³ Σⱼ₌₁⁴ (i·j):")
total = 0
for i in range(1, m + 1):
    inner = sum(i * j for j in range(1, n + 1))
    total += inner
    terms = ' + '.join(f'{i}·{j}' for j in range(1, n + 1))
    print(f"    i={i}: Σⱼ₌₁⁴ {i}·j = {terms} = {inner}")
print(f"  Total: {total}")
print(f"  Formula: (Σᵢ i)(Σⱼ j) = {sum(range(1,4))} × {sum(range(1,5))} = {sum(range(1,4)) * sum(range(1,5))}  ✓")

# Matrix as double sum: Σᵢ Σⱼ Aᵢⱼ = sum of all entries
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
double_sum = sum(A[i, j] for i in range(3) for j in range(3))
print(f"\n  Matrix A:\n{A}")
print(f"  Σᵢ Σⱼ Aᵢⱼ = {double_sum} = np.sum(A) = {np.sum(A)}  ✓")

3. Product Notation — Formal Definitions

3.1 Basic Definition

i=mnf(i)=f(m)f(m+1)f(m+2)f(n)\prod_{i=m}^{n} f(i) = f(m) \cdot f(m+1) \cdot f(m+2) \cdots f(n)
  • Same structure as Σ\Sigma but with Π\Pi (capital Greek pi) and multiplication
  • Empty product: if n<mn < m, value = 1 (multiplicative identity)
  • Recursive: i=mnf(i)=(i=mn1f(i))f(n)\prod_{i=m}^{n} f(i) = \left(\prod_{i=m}^{n-1} f(i)\right) \cdot f(n)

3.2 Why Empty Product = 1 and Empty Sum = 0

OperationIdentity ElementEmpty ResultReason
Addition (Σ\Sigma)0=0\sum_{\emptyset} = 0Adding nothing = additive identity
Multiplication (Π\Pi)1=1\prod_{\emptyset} = 1Multiplying nothing = multiplicative identity

Consistency check: 0!=i=10i=10! = \prod_{i=1}^{0} i = 1 ✓ (matches Γ(1)=1\Gamma(1) = 1)

3.3 Factorial as Product

n!=i=1ni=123nn! = \prod_{i=1}^{n} i = 1 \cdot 2 \cdot 3 \cdots n

Stirling: n!2πn(ne)nn! \approx \sqrt{2\pi n}\left(\frac{n}{e}\right)^n — grows faster than any exponential.

3.4 Product Notation in Probability — The Log Trick

P(t1,,tn)=i=1nP(tit1,,ti1)P(t_1, \ldots, t_n) = \prod_{i=1}^{n} P(t_i \mid t_1, \ldots, t_{i-1}) logP(t1,,tn)=i=1nlogP(tit1,,ti1)\log P(t_1, \ldots, t_n) = \sum_{i=1}^{n} \log P(t_i \mid t_1, \ldots, t_{i-1})

This ΠlogΣ\Pi \xrightarrow{\log} \Sigma conversion is one of the most important transformations in all of ML:

  • Products of small probabilities underflow → log converts to sum of log-probabilities
  • Numerically stable, easier to differentiate, and computationally cheaper

Code cell 10

# ══════════════════════════════════════════════════════════════════
# 3.1–3.2  PRODUCT NOTATION & EMPTY PRODUCT
# ══════════════════════════════════════════════════════════════════

def pi_product(f, m, n):
    """Πᵢ₌ₘⁿ f(i) — iterative product."""
    result = 1  # multiplicative identity
    for i in range(m, n + 1):
        result *= f(i)
    return result

def pi_recursive(f, m, n):
    """Πᵢ₌ₘⁿ f(i) — recursive definition."""
    if n < m:
        return 1            # Empty product = 1 (multiplicative identity)
    if n == m:
        return f(m)
    return pi_recursive(f, m, n - 1) * f(n)

print("=" * 65)
print("PRODUCT NOTATION: Πᵢ₌ₘⁿ f(i)")
print("=" * 65)

# Basic product demonstration
f = lambda i: i
for n in [1, 3, 5, 8, 10]:
    iter_val = pi_product(f, 1, n)
    rec_val  = pi_recursive(f, 1, n)
    fact_val = math.factorial(n)
    print(f"  Πᵢ₌₁^{n:<2d} i = {n}!  iter={iter_val:<10d}  rec={rec_val:<10d}  math={fact_val:<10d}  ✓")

# Empty product: Πᵢ₌₅³ = 1
empty_prod = pi_product(f, 5, 3)
empty_sum  = sigma(lambda i: i**2, 5, 3)
print(f"\n  Empty product Πᵢ₌₅³ i = {empty_prod}  (multiplicative identity)")
print(f"  Empty sum     Σᵢ₌₅³ i² = {empty_sum}  (additive identity)")

# 0! = 1 (empty product convention)
print(f"\n  0! = Πᵢ₌₁⁰ i = {pi_product(f, 1, 0)} = {math.factorial(0)}  ✓")

# ══════════════════════════════════════════════════════════════════
# 3.3  FACTORIAL & STIRLING'S APPROXIMATION
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("FACTORIAL GROWTH & STIRLING'S APPROXIMATION")
print("=" * 65)

print(f"\n  {'n':<6} {'n!':<20} {'Stirling':<20} {'Relative Error'}")
print(f"  {'─'*6} {'─'*20} {'─'*20} {'─'*15}")
for n in [1, 5, 10, 20, 50, 100]:
    exact = math.factorial(n)
    stirling = np.sqrt(2 * np.pi * n) * (n / np.e) ** n
    rel_err = abs(exact - stirling) / exact
    print(f"  {n:<6} {exact:<20.4g} {stirling:<20.4g} {rel_err:<15.6f}")
print(f"\n  Stirling: n! ≈ √(2πn)(n/e)ⁿ — gets better as n → ∞")

# ══════════════════════════════════════════════════════════════════
# 3.4  THE LOG TRICK: Π → Σ VIA LOGARITHM
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("THE LOG TRICK: log Πᵢ f(i) = Σᵢ log f(i)")
print("=" * 65)

# Simulate LLM token probabilities
token_probs = [0.3, 0.5, 0.8, 0.02, 0.15, 0.6, 0.4, 0.1, 0.05, 0.7]
n_tokens = len(token_probs)

# Direct product — suffers from underflow with many small probs
likelihood = pi_product(lambda i: token_probs[i], 0, n_tokens - 1)

# Log-sum — numerically stable
log_likelihood = sum(np.log(p) for p in token_probs)

# Verify: exp(Σ log pᵢ) = Π pᵢ
reconstructed = np.exp(log_likelihood)

print(f"\n  Token probs: {token_probs}")
print(f"\n  Direct Π P(tᵢ|ctx)  = {likelihood:.2e}   ← tiny! underflow risk")
print(f"  Σ log P(tᵢ|ctx)     = {log_likelihood:.6f}   ← stable log-space")
print(f"  exp(Σ log P)         = {reconstructed:.2e}   ← reconstructed")
print(f"  Match: {np.isclose(likelihood, reconstructed)} ✓")

# Perplexity: PPL = exp(−(1/N) Σ log P)
cross_entropy = -log_likelihood / n_tokens
perplexity = np.exp(cross_entropy)
# Equivalently: geometric mean of 1/P
geo_mean = np.prod([1/p for p in token_probs]) ** (1/n_tokens)

print(f"\n  Cross-entropy H  = −(1/N)Σ log P = {cross_entropy:.4f}")
print(f"  Perplexity PPL   = exp(H)         = {perplexity:.4f}")
print(f"  Geometric mean   = (Π 1/P)^(1/N)  = {geo_mean:.4f}")
print(f"  Match: {np.isclose(perplexity, geo_mean)} ✓")
print(f"\n  → Lower PPL = model is more confident/accurate")

4. Algebraic Properties of Summation

4.1 Linearity of Summation

i(αf(i)+βg(i))=αif(i)+βig(i)\sum_{i} \bigl(\alpha f(i) + \beta g(i)\bigr) = \alpha \sum_{i} f(i) + \beta \sum_{i} g(i)

This is the single most important property. It means:

  • Scalar multiple: icf(i)=cif(i)\sum_i c \cdot f(i) = c \cdot \sum_i f(i)
  • Sum of sums: i(f(i)+g(i))=if(i)+ig(i)\sum_i (f(i) + g(i)) = \sum_i f(i) + \sum_i g(i)
  • AI: gradient of sum = sum of gradients: θiL(θ,xi)=iθL(θ,xi)\nabla_\theta \sum_i L(\theta, x_i) = \sum_i \nabla_\theta L(\theta, x_i)

4.2 Constant Summand

i=1nc=nc(sum of n copies of c)\sum_{i=1}^{n} c = n \cdot c \qquad \text{(sum of } n \text{ copies of } c\text{)}

4.3 Splitting and Combining Sums

i=mnf(i)=i=mkf(i)+i=k+1nf(i)(mk<n)\sum_{i=m}^{n} f(i) = \sum_{i=m}^{k} f(i) + \sum_{i=k+1}^{n} f(i) \qquad (m \leq k < n)

4.4 Index Shifting (Reindexing)

i=mnf(i)=j=m+kn+kf(jk)where j=i+k\sum_{i=m}^{n} f(i) = \sum_{j=m+k}^{n+k} f(j-k) \quad \text{where } j = i + k

Reverse: i=mnf(i)=i=mnf(m+ni)\sum_{i=m}^{n} f(i) = \sum_{i=m}^{n} f(m+n-i)

4.5 Telescoping Sums

i=1n(f(i+1)f(i))=f(n+1)f(1)\sum_{i=1}^{n} \bigl(f(i+1) - f(i)\bigr) = f(n+1) - f(1)

All intermediate terms cancel — only first and last survive.

4.6 Interchange of Summation Order (Discrete Fubini)

i=1mj=1nf(i,j)=j=1ni=1mf(i,j)\sum_{i=1}^{m}\sum_{j=1}^{n} f(i,j) = \sum_{j=1}^{n}\sum_{i=1}^{m} f(i,j)

Always valid for finite sums. For infinite sums: requires absolute convergence.

Code cell 12

# ══════════════════════════════════════════════════════════════════
# 4.1  LINEARITY OF SUMMATION
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("LINEARITY: Σ(αf + βg) = αΣf + βΣg")
print("=" * 65)

n = 10
alpha_c, beta_c = 3.0, -2.0
f_vals = np.array([i**2 for i in range(1, n+1)], dtype=float)
g_vals = np.array([i for i in range(1, n+1)], dtype=float)

# LHS: Σᵢ (α·f(i) + β·g(i))
lhs = np.sum(alpha_c * f_vals + beta_c * g_vals)
# RHS: α·Σf + β·Σg
rhs = alpha_c * np.sum(f_vals) + beta_c * np.sum(g_vals)
print(f"\n  f(i) = i², g(i) = i, α = {alpha_c}, β = {beta_c}, n = {n}")
print(f"  LHS = Σ(αf + βg) = {lhs}")
print(f"  RHS = αΣf + βΣg  = {rhs}")
print(f"  Equal: {np.isclose(lhs, rhs)} ✓")

# AI application: gradient of sum = sum of gradients
print(f"\n  AI: ∇_θ Σᵢ L(θ,xᵢ) = Σᵢ ∇_θ L(θ,xᵢ)")
print("  → Mini-batch gradient = (1/|B|) sum_{i in B} grad l_i")

# ══════════════════════════════════════════════════════════════════
# 4.2–4.3  CONSTANT SUMMAND & SPLITTING
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("CONSTANT SUMMAND & SPLITTING SUMS")
print("=" * 65)

# Constant summand: Σᵢ₌₁ⁿ c = n·c
c, n = 7, 5
const_sum = sigma(lambda i: c, 1, n)
print(f"\n  Σᵢ₌₁⁵ 7 = {const_sum} = 5 × 7 = {5 * 7}  ✓")

# Splitting: Σᵢ₌₁¹⁰ = Σᵢ₌₁⁵ + Σᵢ₌₆¹⁰
full = sigma(lambda i: i**2, 1, 10)
part1 = sigma(lambda i: i**2, 1, 5)
part2 = sigma(lambda i: i**2, 6, 10)
print(f"\n  Σᵢ₌₁¹⁰ i² = {full}")
print(f"  Σᵢ₌₁⁵ i² + Σᵢ₌₆¹⁰ i² = {part1} + {part2} = {part1 + part2}  ✓")

# ══════════════════════════════════════════════════════════════════
# 4.4  INDEX SHIFTING (REINDEXING)
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("INDEX SHIFTING: Σᵢ₌ₘⁿ f(i) = Σⱼ₌ₘ₊ₖⁿ⁺ᵏ f(j−k)")
print("=" * 65)

# Shift by k=2: Σᵢ₌₁⁵ i² = Σⱼ₌₃⁷ (j-2)²
original = sigma(lambda i: i**2, 1, 5)
shifted  = sigma(lambda j: (j-2)**2, 3, 7)
print(f"\n  Σᵢ₌₁⁵ i²     = {original}")
print(f"  Σⱼ₌₃⁷ (j−2)² = {shifted}")
print(f"  Equal: {original == shifted} ✓  (j = i + 2)")

# Reverse index: Σᵢ₌₁⁵ f(i) = Σᵢ₌₁⁵ f(6−i)
reverse = sigma(lambda i: (6-i)**2, 1, 5)
print(f"\n  Σᵢ₌₁⁵ i²     = {original}")
print(f"  Σᵢ₌₁⁵ (6−i)² = {reverse}")
print(f"  Equal: {original == reverse} ✓  (reverse index)")

# ══════════════════════════════════════════════════════════════════
# 4.5  TELESCOPING SUMS
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("TELESCOPING: Σᵢ₌₁ⁿ (f(i+1) − f(i)) = f(n+1) − f(1)")
print("=" * 65)

# Example: Σᵢ₌₁ⁿ (1/i − 1/(i+1)) = 1 − 1/(n+1)
n = 10
telescope = sigma(lambda i: 1/i - 1/(i+1), 1, n)
closed_form = 1 - 1/(n+1)
print(f"\n  Σᵢ₌₁¹⁰ (1/i − 1/(i+1))")
terms = [f"(1/{i} − 1/{i+1})" for i in range(1, 6)]
print(f"    = {' + '.join(terms)} + ...")
print(f"    = {telescope:.10f}")
print(f"    = 1 − 1/11 = {closed_form:.10f}")
print(f"  Match: {np.isclose(telescope, closed_form)} ✓")

# Show the cancellation explicitly
print(f"\n  Expansion showing cancellation:")
for i in range(1, 6):
    cancel = "  ← cancels with next" if i < 5 else ""
    print(f"    + 1/{i} − 1/{i+1}{cancel}")
print(f"    Only 1/1 and −1/{n+1} survive!")

# AI: Residual network parameter update
print(f"\n  AI: θₜ − θ₀ = Σₛ₌₁ᵗ (θₛ − θₛ₋₁) = Σₛ₌₁ᵗ Δθₛ  (telescoping)")

# ══════════════════════════════════════════════════════════════════
# 4.6  INTERCHANGE OF SUMMATION ORDER
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("INTERCHANGE: Σᵢ Σⱼ f(i,j) = Σⱼ Σᵢ f(i,j)")
print("=" * 65)

# Row-first vs column-first on a matrix
M = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

# Row-first: Σᵢ (Σⱼ Mᵢⱼ)
row_first = sum(sum(M[i, j] for j in range(3)) for i in range(3))
# Column-first: Σⱼ (Σᵢ Mᵢⱼ)
col_first = sum(sum(M[i, j] for i in range(3)) for j in range(3))

print(f"\n  Matrix M:\n{M}")
print(f"\n  Row-first:    Σᵢ(Σⱼ Mᵢⱼ) = Σᵢ{M.sum(axis=1).tolist()} = {row_first}")
print(f"  Column-first: Σⱼ(Σᵢ Mᵢⱼ) = Σⱼ{M.sum(axis=0).tolist()} = {col_first}")
print(f"  Equal: {row_first == col_first} ✓  (Fubini for finite sums)")

# AI: E[Σᵢ Xᵢ] = Σᵢ E[Xᵢ] (linearity of expectation = interchange)
print(f"\n  AI: 𝔼[Σᵢ Xᵢ] = Σᵢ 𝔼[Xᵢ]  (linearity of expectation)")
print(f"  → Valid even if Xᵢ are dependent!")

5. Algebraic Properties of Products

5.1 Splitting Products

i=mnf(i)=i=mkf(i)i=k+1nf(i)(mk<n)\prod_{i=m}^{n} f(i) = \prod_{i=m}^{k} f(i) \cdot \prod_{i=k+1}^{n} f(i) \quad (m \leq k < n)

5.2 Product of a Constant

i=1nc=cn\prod_{i=1}^{n} c = c^n

5.3 Interchange of Product Order

i=1mj=1nf(i,j)=j=1ni=1mf(i,j)\prod_{i=1}^{m}\prod_{j=1}^{n} f(i,j) = \prod_{j=1}^{n}\prod_{i=1}^{m} f(i,j)

5.4 Distributing Product over Multiplication

i=1n(f(i)g(i))=(i=1nf(i))(i=1ng(i))\prod_{i=1}^{n} \bigl(f(i) \cdot g(i)\bigr) = \left(\prod_{i=1}^{n} f(i)\right) \cdot \left(\prod_{i=1}^{n} g(i)\right)

Does NOT distribute over addition: i(f(i)+g(i))if(i)+ig(i)\prod_i (f(i) + g(i)) \neq \prod_i f(i) + \prod_i g(i)

5.5 Logarithm Converts Product to Sum — The Key Identity

logi=1nf(i)=i=1nlogf(i)andi=1nf(i)=exp ⁣(i=1nlogf(i))\log \prod_{i=1}^{n} f(i) = \sum_{i=1}^{n} \log f(i) \qquad \text{and} \qquad \prod_{i=1}^{n} f(i) = \exp\!\left(\sum_{i=1}^{n} \log f(i)\right)

This is why every ML system works in log-space:

  • Cross-entropy: ilogP=logiP=logi(1/P)-\sum_i \log P = -\log \prod_i P = \log \prod_i (1/P)
  • Perplexity: PPL=exp(1NilogP)=(i1P)1/N\text{PPL} = \exp(-\frac{1}{N}\sum_i \log P) = (\prod_i \frac{1}{P})^{1/N}

Code cell 14

# ══════════════════════════════════════════════════════════════════
# 5.1–5.4  PRODUCT ALGEBRAIC PROPERTIES
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("ALGEBRAIC PROPERTIES OF PRODUCTS")
print("=" * 65)

# 5.1 Splitting: Πᵢ₌₁¹⁰ i = (Πᵢ₌₁⁵ i) · (Πᵢ₌₆¹⁰ i)
full = pi_product(lambda i: i, 1, 10)
p1   = pi_product(lambda i: i, 1, 5)
p2   = pi_product(lambda i: i, 6, 10)
print(f"\n  Splitting: Πᵢ₌₁¹⁰ i = {full}")
print(f"            (Πᵢ₌₁⁵ i)(Πᵢ₌₆¹⁰ i) = {p1} × {p2} = {p1 * p2}  ✓")

# 5.2 Constant product: Πᵢ₌₁ⁿ c = cⁿ
c, n = 3, 5
const_prod = pi_product(lambda i: c, 1, n)
print(f"\n  Πᵢ₌₁⁵ 3 = {const_prod} = 3⁵ = {3**5}  ✓")

# 5.4 Product distributes over multiplication (but NOT addition)
f_vals = np.array([2, 3, 4, 5], dtype=float)
g_vals = np.array([1, 2, 1, 3], dtype=float)

prod_fg = np.prod(f_vals * g_vals)
prod_f_prod_g = np.prod(f_vals) * np.prod(g_vals)
print(f"\n  Πᵢ(fᵢ·gᵢ)   = {prod_fg:.0f}")
print(f"  (Πᵢ fᵢ)(Πᵢ gᵢ) = {prod_f_prod_g:.0f}")
print(f"  Equal: {np.isclose(prod_fg, prod_f_prod_g)} ✓  (distributes over ×)")

# Does NOT distribute over addition
prod_sum = np.prod(f_vals + g_vals)
sum_prods = np.prod(f_vals) + np.prod(g_vals)
print(f"\n  Πᵢ(fᵢ+gᵢ)        = {prod_sum:.0f}")
print(f"  Πᵢ fᵢ + Πᵢ gᵢ    = {sum_prods:.0f}")
print(f"  Equal: {np.isclose(prod_sum, sum_prods)}  ← NOT equal! Product ≠ distribute over +")

# ══════════════════════════════════════════════════════════════════
# 5.5  LOG ↔ PRODUCT — COMPREHENSIVE DEMONSTRATION
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("LOG CONVERTS PRODUCT TO SUM — NUMERICAL STABILITY DEMO")
print("=" * 65)

# Simulate 50-token sequence with varying probabilities
np.random.seed(42)
probs_50 = np.random.uniform(0.01, 0.3, size=50)

# Direct product: underflows quickly
direct_product = np.prod(probs_50)

# Log-sum: stable
log_sum = np.sum(np.log(probs_50))
from_log = np.exp(log_sum)

print(f"\n  50-token sequence probabilities: P(tᵢ|ctx) ∈ [0.01, 0.3]")
print(f"\n  Direct Π P(tᵢ)  = {direct_product:.6e}")
print(f"  Σ log P(tᵢ)     = {log_sum:.4f}")
print(f"  exp(Σ log P)     = {from_log:.6e}")
print(f"  Match: {np.isclose(direct_product, from_log)} ✓")

# With 500 tokens: direct product → 0.0 (underflow!)
probs_500 = np.random.uniform(0.01, 0.3, size=500)
direct_500 = np.prod(probs_500)
log_500 = np.sum(np.log(probs_500))
print(f"\n  500-token sequence:")
print(f"  Direct Π P(tᵢ)  = {direct_500}  ← UNDERFLOW!")
print(f"  Σ log P(tᵢ)     = {log_500:.4f}  ← perfectly fine in log-space")
print(f"\n  → This is why ALL production ML systems work in log-space!")

6. Important Summation Formulas

6.1 Arithmetic Series (Gauss's Formula)

i=1ni=n(n+1)2\sum_{i=1}^{n} i = \frac{n(n+1)}{2}

Gauss at age 10: pair first + last: 1+n=2+(n1)=1 + n = 2 + (n{-}1) = \ldots; n/2n/2 pairs each summing to n+1n{+}1.

6.2 Sum of Squares

i=1ni2=n(n+1)(2n+1)6\sum_{i=1}^{n} i^2 = \frac{n(n+1)(2n+1)}{6}

6.3 Sum of Cubes — The Remarkable Identity

i=1ni3=(n(n+1)2)2=(i=1ni)2\sum_{i=1}^{n} i^3 = \left(\frac{n(n+1)}{2}\right)^2 = \left(\sum_{i=1}^{n} i\right)^2

Sum of cubes = square of the sum of first nn integers!

6.4 Geometric Series (Finite)

i=0n1ri=1rn1r(r1)\sum_{i=0}^{n-1} r^i = \frac{1 - r^n}{1 - r} \quad (r \neq 1)

6.5 Geometric Series (Infinite) — for r<1|r| < 1:

i=0ri=11r\sum_{i=0}^{\infty} r^i = \frac{1}{1-r}

Derivative: i=1iri1=1(1r)2\sum_{i=1}^{\infty} i \, r^{i-1} = \frac{1}{(1-r)^2}

6.6 Exponential and Taylor Series

ex=n=0xnn!,sin(x)=n=0(1)nx2n+1(2n+1)!,ln(1+x)=n=1(1)n+1xnne^x = \sum_{n=0}^{\infty} \frac{x^n}{n!}, \qquad \sin(x) = \sum_{n=0}^{\infty} \frac{(-1)^n x^{2n+1}}{(2n+1)!}, \qquad \ln(1{+}x) = \sum_{n=1}^{\infty}\frac{(-1)^{n+1}x^n}{n}

6.7 Harmonic Numbers

Hn=i=1n1iln(n)+γ(γ0.5772, Euler–Mascheroni constant)H_n = \sum_{i=1}^{n} \frac{1}{i} \approx \ln(n) + \gamma \quad (\gamma \approx 0.5772 \text{, Euler–Mascheroni constant})

The harmonic series n=11/n\sum_{n=1}^{\infty} 1/n diverges (grows without bound).

pp-series: 1/np\sum 1/n^p converges iff p>1p > 1.

Code cell 16

# ══════════════════════════════════════════════════════════════════
# 6.1–6.3  CLASSICAL CLOSED-FORM SUMMATION FORMULAS
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("CLASSICAL SUMMATION FORMULAS — VERIFY ALL")
print("=" * 65)

print(f"\n  {'n':<6} {'Σi':<10} {'n(n+1)/2':<10} {'Σi²':<12} {'formula':<12} {'Σi³':<12} {'(Σi)²':<12}")
print(f"  {'─'*6} {'─'*10} {'─'*10} {'─'*12} {'─'*12} {'─'*12} {'─'*12}")

for n in [1, 3, 5, 10, 20, 50, 100]:
    # 6.1 Arithmetic: Σᵢ₌₁ⁿ i = n(n+1)/2
    sum_i = sum(range(1, n+1))
    gauss = n * (n + 1) // 2

    # 6.2 Squares: Σᵢ₌₁ⁿ i² = n(n+1)(2n+1)/6
    sum_i2 = sum(i**2 for i in range(1, n+1))
    sq_formula = n * (n + 1) * (2 * n + 1) // 6

    # 6.3 Cubes: Σᵢ₌₁ⁿ i³ = (n(n+1)/2)²
    sum_i3 = sum(i**3 for i in range(1, n+1))
    cube_formula = gauss ** 2

    print(f"  {n:<6} {sum_i:<10} {gauss:<10} {sum_i2:<12} {sq_formula:<12} {sum_i3:<12} {cube_formula:<12}")

print(f"\n  All verified: Σi³ = (Σi)² — remarkable identity! ✓")
print(f"  AI: n tokens → O(n²) attention pairs; Σᵢ₌₁ⁿ i = n(n+1)/2 ≈ n²/2")

# ══════════════════════════════════════════════════════════════════
# 6.4–6.5  GEOMETRIC SERIES
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("GEOMETRIC SERIES: Σᵢ₌₀ⁿ⁻¹ rⁱ = (1−rⁿ)/(1−r)")
print("=" * 65)

def geometric_sum(r, n):
    """Closed form: (1 - rⁿ) / (1 - r)"""
    if np.isclose(r, 1.0):
        return float(n)
    return (1 - r**n) / (1 - r)

print(f"\n  {'r':<8} {'n':<6} {'Direct Σ':<18} {'Formula':<18} {'Match'}")
print(f"  {'─'*8} {'─'*6} {'─'*18} {'─'*18} {'─'*6}")
for r in [0.5, 0.9, 0.99, 2.0]:
    for n in [5, 20]:
        direct = sum(r**i for i in range(n))
        formula = geometric_sum(r, n)
        print(f"  {r:<8.2f} {n:<6} {direct:<18.8f} {formula:<18.8f} {'✓' if np.isclose(direct, formula) else '✗'}")

# Infinite geometric series: Σᵢ₌₀^∞ rⁱ = 1/(1-r) for |r| < 1
print(f"\n  Infinite geometric series (|r| < 1):")
for r in [0.5, 0.9, 0.99]:
    partial_100 = sum(r**i for i in range(100))
    infinite = 1 / (1 - r)
    print(f"    r={r:.2f}: Σᵢ₌₀⁹⁹ rⁱ = {partial_100:.8f}  →  1/(1−r) = {infinite:.8f}")

print(f"\n  AI: Geometric decay in ALiBi attention; discounted rewards in RL: Σᵢ γⁱrᵢ")

Code cell 17

# ══════════════════════════════════════════════════════════════════
# 6.6  TAYLOR SERIES — FUNCTIONS AS INFINITE SUMS
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("TAYLOR SERIES: FUNCTIONS AS Σ")
print("=" * 65)

def exp_taylor(x, n_terms):
    """eˣ = Σₙ₌₀^∞ xⁿ/n!"""
    return sum(x**n / math.factorial(n) for n in range(n_terms))

def sin_taylor(x, n_terms):
    """sin(x) = Σₙ₌₀^∞ (-1)ⁿ x^(2n+1)/(2n+1)!"""
    return sum((-1)**n * x**(2*n+1) / math.factorial(2*n+1) for n in range(n_terms))

def ln1px_taylor(x, n_terms):
    """ln(1+x) = Σₙ₌₁^∞ (-1)^(n+1) xⁿ/n  for |x| ≤ 1"""
    return sum((-1)**(n+1) * x**n / n for n in range(1, n_terms + 1))

x = 1.0  # Evaluate at x = 1
print(f"\n  x = {x}")
print(f"\n  {'Terms':<8} {'exp(x) Taylor':<18} {'sin(x) Taylor':<18} {'ln(1+x) Taylor':<18}")
print(f"  {'─'*8} {'─'*18} {'─'*18} {'─'*18}")
for n_terms in [1, 3, 5, 10, 20]:
    e_approx = exp_taylor(x, n_terms)
    s_approx = sin_taylor(x, min(n_terms, 10))
    l_approx = ln1px_taylor(x, n_terms)
    print(f"  {n_terms:<8} {e_approx:<18.10f} {s_approx:<18.10f} {l_approx:<18.10f}")

print(f"\n  Exact:  {' ':<8}{np.exp(x):<18.10f} {np.sin(x):<18.10f} {np.log(1+x):<18.10f}")
print(f"\n  AI uses: exp() in softmax, log() in cross-entropy, sin()/cos() in positional encodings")

# ══════════════════════════════════════════════════════════════════
# 6.7  HARMONIC NUMBERS & DIVERGENCE
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("HARMONIC NUMBERS: Hₙ = Σᵢ₌₁ⁿ 1/i ≈ ln(n) + γ")
print("=" * 65)

EULER_MASCHERONI = 0.5772156649  # γ

print(f"\n  {'n':<10} {'Hₙ = Σ 1/i':<18} {'ln(n) + γ':<18} {'Error':<12} {'ln(n)':<12}")
print(f"  {'─'*10} {'─'*18} {'─'*18} {'─'*12} {'─'*12}")
for n in [1, 5, 10, 50, 100, 1000, 10000]:
    H_n = sum(1.0/i for i in range(1, n+1))
    approx = np.log(n) + EULER_MASCHERONI
    err = abs(H_n - approx)
    print(f"  {n:<10} {H_n:<18.8f} {approx:<18.8f} {err:<12.6f} {np.log(n):<12.4f}")

print(f"\n  γ (Euler–Mascheroni) ≈ {EULER_MASCHERONI}")
print(f"  Hₙ grows like ln(n) — DIVERGES, but very slowly")

# p-series convergence
print(f"\n  p-series: Σ 1/nᵖ — converges iff p > 1")
for p in [0.5, 1.0, 1.5, 2.0, 3.0]:
    partial = sum(1.0/n**p for n in range(1, 10001))
    converges = "DIVERGES" if p <= 1 else f"→ ζ({p:.1f}) ≈ {partial:.6f}"
    print(f"    p={p:.1f}: Σₙ₌₁¹⁰⁰⁰⁰ 1/nᵖ = {partial:.6f}  {converges}")

print(f"\n  ζ(2) = π²/6 ≈ {np.pi**2/6:.6f} (Basel problem, Euler 1734)")
print(f"  AI: Zipf's law (word freq ∝ 1/rank); harmonic sums in complexity analysis")

7. Summation Bounds and Estimation

7.1 Comparison Test

If 0f(n)g(n)0 \le f(n) \le g(n) for all nNn \ge N:

  • g(n)\sum g(n) converges \Rightarrow f(n)\sum f(n) converges
  • f(n)\sum f(n) diverges \Rightarrow g(n)\sum g(n) diverges

7.2 Integral Test

For positive decreasing ff: n=1f(n)\sum_{n=1}^{\infty} f(n) and 1f(x)dx\int_1^{\infty} f(x)\,dx converge or diverge together.

1n+1f(x)dxi=1nf(i)f(1)+1nf(x)dx\int_1^{n+1} f(x)\,dx \le \sum_{i=1}^{n} f(i) \le f(1) + \int_1^{n} f(x)\,dx

7.3 Bounding Partial Sums

  • Upper: if(i)nmaxif(i)\sum_i f(i) \le n \cdot \max_i f(i)
  • Cauchy-Schwarz: (iaibi)2(iai2)(ibi2)\left(\sum_i a_i b_i\right)^2 \le \left(\sum_i a_i^2\right)\left(\sum_i b_i^2\right)
  • Triangle inequality: iaiiai|\sum_i a_i| \le \sum_i |a_i|

7.4 Stirling's Approximation

n!2πn(ne)nlog(n!)nlognnn! \approx \sqrt{2\pi n}\left(\frac{n}{e}\right)^n \qquad \Rightarrow \qquad \log(n!) \approx n\log n - n

7.5 Asymptotic Notation for Sums

  • i=1ni=O(n2)\sum_{i=1}^{n} i = O(n^2); i=1nlogi=O(nlogn)\sum_{i=1}^{n} \log i = O(n \log n)
  • Attention: ij1=n2=O(n2)\sum_i \sum_j 1 = n^2 = O(n^2); Total transformer: O(n2d+nd2)O(n^2 d + n d^2)

Code cell 19

# ══════════════════════════════════════════════════════════════════
# 7.1–7.2  COMPARISON TEST & INTEGRAL TEST
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("INTEGRAL TEST BOUNDS ON HARMONIC NUMBERS")
print("=" * 65)

# Integral test: ln(n+1) ≤ Hₙ ≤ 1 + ln(n)
print(f"\n  {'n':<10} {'ln(n+1)':<14} {'Hₙ':<14} {'1+ln(n)':<14} {'In bounds?'}")
print(f"  {'─'*10} {'─'*14} {'─'*14} {'─'*14} {'─'*10}")
for n in [5, 10, 50, 100, 1000]:
    H_n = sum(1.0/i for i in range(1, n+1))
    lower = np.log(n + 1)
    upper = 1.0 + np.log(n)
    in_bounds = lower <= H_n <= upper
    print(f"  {n:<10} {lower:<14.6f} {H_n:<14.6f} {upper:<14.6f} {'✓' if in_bounds else '✗'}")

# ══════════════════════════════════════════════════════════════════
# 7.3  CAUCHY-SCHWARZ INEQUALITY FOR SUMS
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("CAUCHY-SCHWARZ: (Σ aᵢbᵢ)² ≤ (Σ aᵢ²)(Σ bᵢ²)")
print("=" * 65)

a = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
b = np.array([2.0, 1.0, 4.0, 0.5, 3.0])

lhs = np.dot(a, b)**2
rhs = np.dot(a, a) * np.dot(b, b)
print(f"\n  a = {a}")
print(f"  b = {b}")
print(f"\n  (Σ aᵢbᵢ)²     = {np.dot(a,b):.2f}² = {lhs:.2f}")
print(f"  (Σ aᵢ²)(Σ bᵢ²) = {np.dot(a,a):.2f} × {np.dot(b,b):.2f} = {rhs:.2f}")
print(f"  LHS ≤ RHS: {lhs <= rhs + 1e-10} ✓  (Cauchy-Schwarz)")

# Equality holds iff a ∝ b
a_prop = np.array([1.0, 2.0, 3.0])
b_prop = 2.5 * a_prop  # b = 2.5a (proportional)
lhs_eq = np.dot(a_prop, b_prop)**2
rhs_eq = np.dot(a_prop, a_prop) * np.dot(b_prop, b_prop)
print(f"\n  Equality when a ∝ b: a = {a_prop}, b = 2.5a = {b_prop}")
print(f"  (Σ aᵢbᵢ)² = {lhs_eq:.2f},  (Σ aᵢ²)(Σ bᵢ²) = {rhs_eq:.2f}")
print(f"  Equal: {np.isclose(lhs_eq, rhs_eq)} ✓")

# Triangle inequality
print(f"\n  Triangle inequality: |Σᵢ aᵢ| ≤ Σᵢ |aᵢ|")
mixed = np.array([3, -2, 5, -1, 4])
print(f"  a = {mixed}")
print(f"  |Σ aᵢ| = |{np.sum(mixed)}| = {abs(np.sum(mixed))}")
print(f"  Σ |aᵢ| = {np.sum(np.abs(mixed))}")
print(f"  {abs(np.sum(mixed))}{np.sum(np.abs(mixed))} ✓")

# ══════════════════════════════════════════════════════════════════
# 7.4–7.5  STIRLING & ASYMPTOTIC COMPLEXITY
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("ASYMPTOTIC COMPLEXITY OF COMMON SUMS")
print("=" * 65)

print(f"\n  {'Sum':<25} {'n=100':<15} {'n=1000':<15} {'Growth'}")
print(f"  {'─'*25} {'─'*15} {'─'*15} {'─'*15}")

sums_data = [
    ("Σᵢ₌₁ⁿ 1 = n",        lambda n: n,                                "O(n)"),
    ("Σᵢ₌₁ⁿ i = n²/2",     lambda n: n*(n+1)//2,                       "O(n²)"),
    ("Σᵢ₌₁ⁿ i² ≈ n³/3",    lambda n: n*(n+1)*(2*n+1)//6,               "O(n³)"),
    ("Σᵢ₌₁ⁿ log(i)=log(n!)",lambda n: sum(np.log(i) for i in range(1,n+1)), "O(n log n)"),
    ("Σᵢ₌₁ⁿ 1/i ≈ ln(n)",  lambda n: sum(1.0/i for i in range(1,n+1)), "O(log n)"),
]
for label, func, growth in sums_data:
    v100 = func(100)
    v1000 = func(1000)
    print(f"  {label:<25} {v100:<15.2f} {v1000:<15.2f} {growth}")

# Transformer FLOP count
print(f"\n  Transformer complexity per layer:")
for n_seq, d_model in [(512, 768), (2048, 1024), (8192, 4096)]:
    attn_flops = 2 * n_seq * n_seq * d_model  # QKᵀ and αV
    ffn_flops = 2 * n_seq * d_model * (4 * d_model)  # 2 linear layers
    total = attn_flops + ffn_flops
    print(f"    n={n_seq:<5} d={d_model:<5} Attn: {attn_flops:.2e}  FFN: {ffn_flops:.2e}  Total: {total:.2e}")
print(f"\n  Attn = O(n²d), FFN = O(nd²) → Total = O(n²d + nd²)")

8. Special Summation Patterns in AI

8.1 Softmax and the Normalisation Sum (Partition Function)

Z(z)=v=1Vexp(zv),P(vx)=exp(zv)Z(z)Z(\mathbf{z}) = \sum_{v=1}^{|V|} \exp(z_v), \qquad P(v \mid \mathbf{x}) = \frac{\exp(z_v)}{Z(\mathbf{z})}

Log-sum-exp trick (numerically stable): logZ=m+logvexp(zvm)\log Z = m + \log\sum_v \exp(z_v - m), where m=maxvzvm = \max_v z_v.

8.2 Cross-Entropy Loss as Double Sum

L(θ)=1Ni=1NlogPθ(yixi)L(\theta) = -\frac{1}{N}\sum_{i=1}^{N}\log P_\theta(y_i \mid x_i)

Gradient: Lzv=Pθ(vx)1[v=y]\frac{\partial L}{\partial z_v} = P_\theta(v \mid x) - \mathbb{1}[v = y] — predicted minus true.

8.3 Attention as Weighted Sum

Attn(Q,K,V)i=j=1nαijVj,αij=exp(qikj/dk)lexp(qikl/dk)\text{Attn}(Q,K,V)_i = \sum_{j=1}^{n} \alpha_{ij} V_j, \qquad \alpha_{ij} = \frac{\exp(q_i \cdot k_j / \sqrt{d_k})}{\sum_l \exp(q_i \cdot k_l / \sqrt{d_k})}

jαij=1\sum_j \alpha_{ij} = 1 — attention is a convex combination of value vectors.

8.4 Perplexity as Geometric Mean

PPL=exp ⁣(1Ni=1NlogP(tictx))=(i1P(tictx))1/N\text{PPL} = \exp\!\left(-\frac{1}{N}\sum_{i=1}^{N}\log P(t_i \mid \text{ctx})\right) = \left(\prod_i \frac{1}{P(t_i \mid \text{ctx})}\right)^{1/N}

8.5 BM25 as Weighted Sum

BM25(q,d)=tqIDF(t)ft,d(k1+1)ft,d+k1(1b+bd/avgdl)\text{BM25}(q,d) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f_{t,d}(k_1+1)}{f_{t,d} + k_1(1 - b + b \cdot |d|/\text{avgdl})}

Code cell 21

# ══════════════════════════════════════════════════════════════════
# 8.1  SOFTMAX — NORMALISATION SUM & LOG-SUM-EXP TRICK
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("SOFTMAX: P(v) = exp(zᵥ) / Σᵥ' exp(zᵥ')")
print("=" * 65)

def softmax_naive(z):
    """Naive softmax — may overflow for large z."""
    e = np.exp(z)
    return e / np.sum(e)

def softmax_stable(z):
    """Numerically stable softmax using max trick."""
    m = np.max(z)              # m = max(z)
    e = np.exp(z - m)          # shift by max → largest exp = 1
    return e / np.sum(e)       # Z = Σ exp(zᵥ − m)

def log_sum_exp(z):
    """log Σ exp(zᵥ) = m + log Σ exp(zᵥ − m)"""
    m = np.max(z)
    return m + np.log(np.sum(np.exp(z - m)))

# Demo with normal logits
logits = np.array([2.0, 1.0, 0.5, -1.0, 3.0])
probs = softmax_stable(logits)
Z = np.sum(np.exp(logits))

print(f"\n  Logits z = {logits}")
print(f"  Z = Σ exp(zᵥ)         = {Z:.4f}")
print(f"  log Z (naive)          = {np.log(Z):.6f}")
print(f"  log Z (log-sum-exp)    = {log_sum_exp(logits):.6f}")
print(f"  Softmax P(v):          {probs}")
print(f"  Σ P(v) = {np.sum(probs):.10f}  ✓ (sums to 1)")

# Demo with extreme logits — naive overflows!
extreme = np.array([1000.0, 999.0, 998.0])
print(f"\n  Extreme logits: {extreme}")
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    naive_result = softmax_naive(extreme)
    stable_result = softmax_stable(extreme)
print(f"  Naive softmax:  {naive_result}  ← overflow produces nan!")
print(f"  Stable softmax: {stable_result}  ← correct!")

# ══════════════════════════════════════════════════════════════════
# 8.2  CROSS-ENTROPY LOSS & ITS GRADIENT
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("CROSS-ENTROPY: L = −(1/N) Σᵢ log P(yᵢ|xᵢ)")
print("=" * 65)

# Simulate 5 training examples, vocab size 4
np.random.seed(42)
N, V_size = 5, 4
logits_batch = np.random.randn(N, V_size)
labels = np.array([2, 0, 3, 1, 2])  # true labels

# Compute cross-entropy loss
total_loss = 0.0
print(f"\n  {'Example':<10} {'True y':<8} {'P(y|x)':<12} {'−log P(y|x)':<14}")
print(f"  {'─'*10} {'─'*8} {'─'*12} {'─'*14}")
for i in range(N):
    probs_i = softmax_stable(logits_batch[i])
    p_true = probs_i[labels[i]]
    loss_i = -np.log(p_true)
    total_loss += loss_i
    print(f"  {i+1:<10} {labels[i]:<8} {p_true:<12.6f} {loss_i:<14.6f}")

avg_loss = total_loss / N
print(f"\n  L = −(1/{N}) Σ log P(yᵢ|xᵢ) = {avg_loss:.6f}")

# Gradient: ∂L/∂zᵥ = P(v|x) − 𝟙[v=y]
print(f"\n  Gradient for example 1 (y=2):")
probs_1 = softmax_stable(logits_batch[0])
grad = probs_1.copy()
grad[labels[0]] -= 1.0  # subtract indicator
print(f"    P(v|x):  {probs_1}")
print(f"    𝟙[v=y]:  {[int(v == labels[0]) for v in range(V_size)]}")
print(f"    ∂L/∂zᵥ:  {grad}")
print(f"    Σᵥ ∂L/∂zᵥ = {np.sum(grad):.10f}  (sums to 0!) ✓")

Code cell 22

# ══════════════════════════════════════════════════════════════════
# 8.3  ATTENTION AS WEIGHTED SUM
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("ATTENTION: Attn(Q,K,V)ᵢ = Σⱼ αᵢⱼ Vⱼ")
print("=" * 65)

np.random.seed(42)
n_seq, d_k, d_v = 4, 3, 3  # 4 tokens, dim 3

Q = np.random.randn(n_seq, d_k)
K = np.random.randn(n_seq, d_k)
V = np.random.randn(n_seq, d_v)

# Step 1: Scores sᵢⱼ = qᵢ·kⱼ / √dₖ = Σₖ Qᵢₖ Kⱼₖ / √dₖ
scores = Q @ K.T / np.sqrt(d_k)
print(f"\n  Score matrix (sᵢⱼ = qᵢ·kⱼ/√dₖ):\n{scores}")

# Step 2: Attention weights αᵢⱼ = softmax(sᵢ) over j
alpha = np.zeros_like(scores)
for i in range(n_seq):
    alpha[i] = softmax_stable(scores[i])
print(f"\n  Attention weights α (each row sums to 1):\n{np.round(alpha, 4)}")
print(f"  Row sums: {alpha.sum(axis=1)}")

# Step 3: Output = Σⱼ αᵢⱼ Vⱼ (weighted sum of value vectors)
attn_output_manual = np.zeros((n_seq, d_v))
for i in range(n_seq):
    for j in range(n_seq):
        attn_output_manual[i] += alpha[i, j] * V[j]

attn_output_matrix = alpha @ V  # same thing as matrix multiply
print(f"\n  Attention output (manual Σⱼ αᵢⱼVⱼ):\n{np.round(attn_output_manual, 4)}")
print(f"  Attention output (α @ V):\n{np.round(attn_output_matrix, 4)}")
print(f"  Match: {np.allclose(attn_output_manual, attn_output_matrix)} ✓")

# Causal mask: restrict j ≤ i
print(f"\n  Causal attention (j ≤ i only):")
causal_mask = np.triu(np.ones((n_seq, n_seq)) * (-1e9), k=1)
scores_causal = scores + causal_mask
alpha_causal = np.zeros_like(scores_causal)
for i in range(n_seq):
    alpha_causal[i] = softmax_stable(scores_causal[i])
print(f"  Causal α:\n{np.round(alpha_causal, 4)}")
print(f"  → Token 0 attends only to itself; token 3 attends to all 4")

# ══════════════════════════════════════════════════════════════════
# 8.4–8.5  PERPLEXITY & BM25
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("PERPLEXITY: PPL = exp(−(1/N) Σ log P)")
print("=" * 65)

# Simulate model probabilities for a 10-token sequence
token_probs = np.array([0.8, 0.5, 0.3, 0.9, 0.1, 0.6, 0.4, 0.7, 0.2, 0.05])
N_tok = len(token_probs)

cross_ent = -np.mean(np.log(token_probs))
ppl = np.exp(cross_ent)
geo_mean = np.prod(1.0 / token_probs) ** (1.0 / N_tok)

print(f"\n  Token probs: {token_probs}")
print(f"  Cross-entropy H = −(1/N)Σ log P = {cross_ent:.4f}")
print(f"  PPL = exp(H)                     = {ppl:.4f}")
print(f"  PPL = (Π 1/P)^(1/N)              = {geo_mean:.4f}")
print(f"  Match: {np.isclose(ppl, geo_mean)} ✓")
print(f"\n  Interpretation: model is as uncertain as choosing uniformly")
print(f"  from {ppl:.1f} equally likely tokens on average")

# BM25 scoring
print(f"\n{'─' * 65}")
print("BM25: score(q,d) = Σ_{t ∈ q} IDF(t) · TF_mod(t,d)")
print("─" * 65)

def bm25_score(query_terms, doc_tf, doc_len, avg_dl, idf, k1=1.5, b=0.75):
    """BM25 score = Σ_{t ∈ q} IDF(t) · (f·(k1+1)) / (f + k1·(1 - b + b·|d|/avgdl))"""
    score = 0.0
    details = []
    for t in query_terms:
        if t in doc_tf:
            f = doc_tf[t]
            tf_mod = (f * (k1 + 1)) / (f + k1 * (1 - b + b * doc_len / avg_dl))
            term_score = idf.get(t, 0) * tf_mod
            score += term_score
            details.append((t, idf[t], f, tf_mod, term_score))
    return score, details

query = ["neural", "network", "training"]
doc_tf = {"neural": 3, "network": 2, "training": 5, "the": 20}
idf = {"neural": 2.5, "network": 2.0, "training": 1.8}
score, details = bm25_score(query, doc_tf, doc_len=100, avg_dl=150, idf=idf)

print(f"\n  Query: {query}")
print(f"  {'Term':<12} {'IDF':<8} {'TF':<6} {'TF_mod':<10} {'Score'}")
print(f"  {'─'*12} {'─'*8} {'─'*6} {'─'*10} {'─'*8}")
for t, i, f, tf_m, s in details:
    print(f"  {t:<12} {i:<8.2f} {f:<6} {tf_m:<10.4f} {s:.4f}")
print(f"\n  BM25(q,d) = Σ = {score:.4f}")

9. Einstein Summation Convention

9.1 Motivation & Rule

In tensor operations, summation over repeated indices is so common that Einstein (1916) proposed: if an index appears exactly twice in a product, sum over it automatically.

aibiiaibi(repeated index i → implicit sum)a_i b_i \equiv \sum_i a_i b_i \qquad \text{(repeated index } i \text{ → implicit sum)}
  • Free index: appears once → result has that index (no summation)
  • Dummy index: appears twice → summed over; does not appear in result

9.2 Examples

OperationEinsteinExplicitResult
Dot productaibia_i b_iiaibi\sum_i a_i b_iscalar
Matrix-vectorAijxjA_{ij} x_jjAijxj\sum_j A_{ij} x_jvector (index ii)
Matrix-matrixAikBkjA_{ik} B_{kj}kAikBkj\sum_k A_{ik} B_{kj}matrix (indices i,ji,j)
TraceAiiA_{ii}iAii\sum_i A_{ii}scalar
Outer productaibja_i b_jaibja_i b_j (no repeated)matrix
Quadratic formxiAijxjx_i A_{ij} x_jijxiAijxj\sum_i \sum_j x_i A_{ij} x_jscalar

9.3 Rules

  1. Each index appears at most twice per term
  2. Repeated index → summed; do not write Σ\Sigma
  3. Free index → must match on both sides of equation
  4. Three+ appearances → ambiguous; write Σ\Sigma explicitly

9.4 einsum in NumPy / PyTorch

np.einsum("ij,jk->ik", A, B)      # matrix multiply
np.einsum("ij,ij->", A, B)        # Frobenius inner product
np.einsum("bhid,bhjd->bhij", Q, K) # batched attention scores

9.5 Tensor Contraction

Generalisation of matmul to higher-order tensors: kAijkBklm=Cijlm\sum_k A_{ijk} B_{klm} = C_{ijlm}

Code cell 24

# ══════════════════════════════════════════════════════════════════
# 9.1–9.2  EINSTEIN SUMMATION — ALL EXAMPLES
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("EINSTEIN SUMMATION CONVENTION — np.einsum DEMONSTRATIONS")
print("=" * 65)

A = np.array([[1, 2, 3],
              [4, 5, 6]])  # 2×3
B = np.array([[7, 8],
              [9, 10],
              [11, 12]])   # 3×2
x = np.array([1.0, 2.0, 3.0])
y = np.array([4.0, 5.0, 6.0])

# 1. Dot product: aᵢbᵢ = Σᵢ aᵢbᵢ
dot = np.einsum('i,i->', x, y)
print(f"\n  1. Dot product: aᵢbᵢ")
print(f"     einsum('i,i->', x, y) = {dot}")
print(f"     np.dot(x, y)          = {np.dot(x, y)}  ✓")

# 2. Matrix-vector: Aᵢⱼxⱼ = Σⱼ Aᵢⱼxⱼ
mv = np.einsum('ij,j->i', A, x)
print(f"\n  2. Matrix-vector: Aᵢⱼxⱼ")
print(f"     einsum('ij,j->i', A, x) = {mv}")
print(f"     A @ x                    = {A @ x}  ✓")

# 3. Matrix-matrix: AᵢₖBₖⱼ = Σₖ AᵢₖBₖⱼ
mm = np.einsum('ik,kj->ij', A, B)
print(f"\n  3. Matrix multiply: AᵢₖBₖⱼ")
print(f"     einsum('ik,kj->ij', A, B):\n{mm}")
print(f"     A @ B:\n{A @ B}  ✓")

# 4. Trace: Aᵢᵢ = Σᵢ Aᵢᵢ
S = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
tr = np.einsum('ii->', S)
print(f"\n  4. Trace: Aᵢᵢ")
print(f"     einsum('ii->', S) = {tr}")
print(f"     np.trace(S)       = {np.trace(S)}  ✓")

# 5. Outer product: aᵢbⱼ (no repeated index)
outer = np.einsum('i,j->ij', x, y)
print(f"\n  5. Outer product: aᵢbⱼ (no sum — free indices i,j)")
print(f"     einsum('i,j->ij', x, y):\n{outer}")
print(f"     np.outer(x, y):\n{np.outer(x, y)}  ✓")

# 6. Quadratic form: xᵢAᵢⱼxⱼ = Σᵢ Σⱼ xᵢAᵢⱼxⱼ
quad = np.einsum('i,ij,j->', x, S, x)
print(f"\n  6. Quadratic form: xᵢAᵢⱼxⱼ")
print(f"     einsum('i,ij,j->', x, S, x) = {quad}")
print(f"     x @ S @ x                    = {x @ S @ x}  ✓")

# 7. Transpose: Aᵢⱼ → Aⱼᵢ
trans = np.einsum('ij->ji', A)
print(f"\n  7. Transpose: einsum('ij->ji', A):\n{trans}")

# 8. Frobenius inner product: Σᵢⱼ AᵢⱼBᵢⱼ
F1 = np.array([[1, 2], [3, 4]])
F2 = np.array([[5, 6], [7, 8]])
frob = np.einsum('ij,ij->', F1, F2)
print(f"\n  8. Frobenius: AᵢⱼBᵢⱼ = {frob} = tr(AᵀB) = {np.trace(F1.T @ F2)}  ✓")

Code cell 25

# ══════════════════════════════════════════════════════════════════
# 9.4  EINSUM FOR ATTENTION (BATCHED MULTI-HEAD)
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("BATCHED MULTI-HEAD ATTENTION VIA EINSUM")
print("=" * 65)

np.random.seed(42)
batch, heads, seq, dim = 2, 3, 4, 5

Q = np.random.randn(batch, heads, seq, dim)
K = np.random.randn(batch, heads, seq, dim)
V = np.random.randn(batch, heads, seq, dim)

# Attention scores: sᵢⱼ = Σₖ Qᵢₖ Kⱼₖ / √dₖ
# einsum: "bhid,bhjd->bhij" — sum over d (dummy); free: b,h,i,j
scores = np.einsum('bhid,bhjd->bhij', Q, K) / np.sqrt(dim)
print(f"\n  Q shape: {Q.shape}  (batch, heads, seq, dim)")
print(f"  K shape: {K.shape}")
print(f"  Scores = einsum('bhid,bhjd->bhij', Q, K)/√d")
print(f"  Scores shape: {scores.shape}  (batch, heads, seq, seq)")
print(f"\n  Free indices: b(batch), h(head), i(query), j(key)")
print(f"  Dummy index:  d(dimension) — summed over!")

# Softmax over j (key dimension)
def batched_softmax(x, axis=-1):
    e = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e / np.sum(e, axis=axis, keepdims=True)

alpha = batched_softmax(scores, axis=-1)
print(f"\n  α shape: {alpha.shape}  (after softmax over j)")
print(f"  Row sums (should be 1): {alpha[0, 0].sum(axis=-1)}")

# Output: oᵢ = Σⱼ αᵢⱼ Vⱼ → einsum('bhij,bhjd->bhid')
output = np.einsum('bhij,bhjd->bhid', alpha, V)
print(f"\n  Output = einsum('bhij,bhjd->bhid', α, V)")
print(f"  Output shape: {output.shape}  (same as Q — correct!)")

# ══════════════════════════════════════════════════════════════════
# 9.5  TENSOR CONTRACTION
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("TENSOR CONTRACTION — GENERALISED MATMUL")
print("=" * 65)

# Contract 3D tensor with 2D tensor
T = np.random.randn(3, 4, 5)   # Aᵢⱼₖ
M = np.random.randn(5, 2)       # Bₖₗ
# Contract over k: Cᵢⱼₗ = Σₖ Aᵢⱼₖ Bₖₗ
C = np.einsum('ijk,kl->ijl', T, M)
print(f"\n  A shape: {T.shape} (3D tensor)")
print(f"  B shape: {M.shape} (matrix)")
print(f"  Σₖ Aᵢⱼₖ Bₖₗ → C shape: {C.shape}")

# Multi-index contraction: Σⱼ Σₖ AᵢⱼₖBⱼₖₗ = Cᵢₗ
T2 = np.random.randn(3, 4, 5)
T3 = np.random.randn(4, 5, 2)
C2 = np.einsum('ijk,jkl->il', T2, T3)
print(f"\n  Σⱼ Σₖ Aᵢⱼₖ Bⱼₖₗ → C shape: {C2.shape}  (contracted 2 indices)")

# FLOPs: product of all index ranges
flops = 3 * 4 * 5 * 2  # i × j × k × l
print(f"  FLOPs: {flops} = 3×4×5×2 (product of ALL index ranges)")

# PyTorch einsum comparison (if available)
if HAS_TORCH:
    Q_t = torch.randn(2, 3, 4, 5)
    K_t = torch.randn(2, 3, 4, 5)
    scores_t = torch.einsum('bhid,bhjd->bhij', Q_t, K_t)
    print(f"\n  torch.einsum('bhid,bhjd->bhij', Q, K) shape: {list(scores_t.shape)}")
    print(f"  → Same as NumPy einsum — lingua franca of tensor ops!")
else:
    print(f"\n  (PyTorch not available — einsum works identically with torch.einsum)")

10. Double Sums and Index Interaction

10.1 Diagonal, Upper-Triangular, and Lower-Triangular Sums

  • Diagonal: if(i,i)\sum_i f(i,i)
  • Upper triangular: ij>if(i,j)\sum_i \sum_{j > i} f(i,j)causal attention restricts to jij \leq i
  • Symmetrisation: ijf(i,j)=ijf(i,j)if(i,i)\sum_{i \neq j} f(i,j) = \sum_i \sum_j f(i,j) - \sum_i f(i,i)

10.2 Changing Order in Triangular Sums

i=1nj=1if(i,j)=j=1ni=jnf(i,j)\sum_{i=1}^{n}\sum_{j=1}^{i} f(i,j) = \sum_{j=1}^{n}\sum_{i=j}^{n} f(i,j)

Both cover the same pairs (i,j)(i,j) with 1jin1 \leq j \leq i \leq n; just different traversal order.

10.3 Cauchy Product (Product of Two Series)

(i=0ai) ⁣(j=0bj)=k=0(i=0kaibki)=kck\left(\sum_{i=0}^{\infty} a_i\right)\!\left(\sum_{j=0}^{\infty} b_j\right) = \sum_{k=0}^{\infty}\left(\sum_{i=0}^{k} a_i b_{k-i}\right) = \sum_k c_k

where ck=i=0kaibkic_k = \sum_{i=0}^{k} a_i b_{k-i} is a convolution — the same operation used in CNNs!

10.4 Summation by Parts (Abel's Summation)

i=mnaibi=AnbnAm1bmi=mn1Ai(bi+1bi)\sum_{i=m}^{n} a_i b_i = A_n b_n - A_{m-1} b_m - \sum_{i=m}^{n-1} A_i(b_{i+1} - b_i)

where Ak=i=mkaiA_k = \sum_{i=m}^{k} a_i — the discrete analogue of integration by parts.

Code cell 27

# ══════════════════════════════════════════════════════════════════
# 10.1  DIAGONAL & TRIANGULAR SUMS
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("DIAGONAL & TRIANGULAR SUMS")
print("=" * 65)

M = np.array([[1,  2,  3,  4],
              [5,  6,  7,  8],
              [9,  10, 11, 12],
              [13, 14, 15, 16]])
n = 4

# Diagonal sum: Σᵢ Mᵢᵢ
diag_sum = sum(M[i, i] for i in range(n))
print(f"\n  M:\n{M}")
print(f"\n  Diagonal: Σᵢ Mᵢᵢ = {diag_sum} = tr(M) = {np.trace(M)}  ✓")

# Upper triangular (i < j): Σᵢ Σⱼ>ᵢ Mᵢⱼ
upper = sum(M[i, j] for i in range(n) for j in range(i+1, n))
print(f"  Upper triangular (i < j): {upper}")
print(f"    np.triu(M, k=1).sum() = {np.triu(M, k=1).sum()}  ✓")

# Lower triangular (i > j): Σᵢ Σⱼ<ᵢ Mᵢⱼ
lower = sum(M[i, j] for i in range(n) for j in range(i))
print(f"  Lower triangular (i > j): {lower}")

# Verify: full = diagonal + upper + lower
full = np.sum(M)
print(f"\n  Full = diag + upper + lower: {full} = {diag_sum} + {upper} + {lower} = {diag_sum + upper + lower}  ✓")

# AI: Causal attention mask
print(f"\n  AI: Causal attention uses LOWER triangular (j ≤ i):")
causal = np.tril(np.ones((4, 4)))
print(f"  Mask:\n{causal.astype(int)}")

# ══════════════════════════════════════════════════════════════════
# 10.2  CHANGING ORDER IN TRIANGULAR SUMS
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("INTERCHANGE ORDER: Σᵢ₌₁ⁿ Σⱼ₌₁ⁱ = Σⱼ₌₁ⁿ Σᵢ₌ⱼⁿ")
print("=" * 65)

n = 5
f_func = lambda i, j: i + j

# Order 1: fix i, sum j from 1 to i
order1 = 0
print(f"\n  Order 1: Σᵢ₌₁⁵ Σⱼ₌₁ⁱ f(i,j)  where f(i,j) = i+j")
for i in range(1, n+1):
    inner = sum(f_func(i, j) for j in range(1, i+1))
    order1 += inner
    terms = [f"f({i},{j})" for j in range(1, i+1)]
    print(f"    i={i}: Σⱼ₌₁^{i} = {' + '.join(terms)} = {inner}")
print(f"  Total = {order1}")

# Order 2: fix j, sum i from j to n
order2 = 0
print(f"\n  Order 2: Σⱼ₌₁⁵ Σᵢ₌ⱼ⁵ f(i,j)")
for j in range(1, n+1):
    inner = sum(f_func(i, j) for i in range(j, n+1))
    order2 += inner
    terms = [f"f({i},{j})" for i in range(j, n+1)]
    print(f"    j={j}: Σᵢ₌{j}⁵ = {' + '.join(terms)} = {inner}")
print(f"  Total = {order2}")
print(f"\n  Match: {order1 == order2} ✓  (same pairs, different order)")

# ══════════════════════════════════════════════════════════════════
# 10.3  CAUCHY PRODUCT = CONVOLUTION
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("CAUCHY PRODUCT: cₖ = Σᵢ₌₀ᵏ aᵢb_{k-i}  (CONVOLUTION)")
print("=" * 65)

a = np.array([1.0, 2.0, 3.0])  # polynomial: 1 + 2x + 3x²
b = np.array([4.0, 5.0])       # polynomial: 4 + 5x

# Product polynomial coefficients via Cauchy product
n_a, n_b = len(a), len(b)
c = np.zeros(n_a + n_b - 1)
for k in range(len(c)):
    for i in range(k + 1):
        if i < n_a and (k - i) < n_b:
            c[k] += a[i] * b[k - i]

print(f"\n  a(x) = {a[0]:.0f} + {a[1]:.0f}x + {a[2]:.0f}x²")
print(f"  b(x) = {b[0]:.0f} + {b[1]:.0f}x")
print(f"\n  Cauchy product cₖ = Σᵢ aᵢb_{k-i}:")
for k in range(len(c)):
    terms = []
    for i in range(k + 1):
        if i < n_a and (k - i) < n_b:
            terms.append(f"a[{i}]·b[{k-i}] = {a[i]*b[k-i]:.0f}")
    print(f"    c[{k}] = {' + '.join(terms)} = {c[k]:.0f}")

print(f"\n  c(x) = {' + '.join(f'{c[k]:.0f}x^{k}' if k > 0 else f'{c[k]:.0f}' for k in range(len(c)))}")
print(f"  np.convolve(a, b) = {np.convolve(a, b)}")
print(f"  Match: {np.allclose(c, np.convolve(a, b))} ✓")
print(f"\n  AI: Convolution in CNNs uses this same sum structure!")

11. Convergence of Infinite Series

11.1 Partial Sums and Convergence

Sn=i=1naiS_n = \sum_{i=1}^{n} a_i — the nn-th partial sum.

Series converges to LL if limnSn=L\lim_{n \to \infty} S_n = L.

Necessary condition: if ai\sum a_i converges, then ai0a_i \to 0. But NOT sufficient — harmonic series: 1/n01/n \to 0 yet 1/n=\sum 1/n = \infty.

11.2 Absolute vs Conditional Convergence

  • Absolute: ai\sum |a_i| converges → ai\sum a_i converges (safe to rearrange)
  • Conditional: converges but not absolutely → Riemann rearrangement theorem: can be rearranged to sum to ANY value!

11.3 Key Convergence Tests

TestConditionConclusion
Ratio$L = \lima_{n+1}/a_n
Root$L = \lima_n
Alternatingbn0b_n \geq 0, decreasing, bn0b_n \to 0(1)nbn\sum (-1)^n b_n converges
pp-series1/np\sum 1/n^pConverges iff p>1p > 1

11.4 Power Series

n=0cn(xa)n\sum_{n=0}^{\infty} c_n (x - a)^n — radius of convergence RR: converges for xa<R|x-a| < R, diverges for xa>R|x-a| > R.

11.5 AI: Learning Rate Convergence (Robbins-Monro)

tηt=\sum_t \eta_t = \infty (take enough steps) AND tηt2<\sum_t \eta_t^2 < \infty (don't overshoot).

Example: ηt=1/t\eta_t = 1/t satisfies both: 1/t=\sum 1/t = \infty, 1/t2=π2/6<\sum 1/t^2 = \pi^2/6 < \infty.

Code cell 29

# ══════════════════════════════════════════════════════════════════
# 11.1–11.2  CONVERGENCE: PARTIAL SUMS & ABSOLUTE VS CONDITIONAL
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("CONVERGENCE OF INFINITE SERIES")
print("=" * 65)

# Geometric series: converges for |r| < 1
print(f"\n  Geometric series Σ rⁿ — partial sums Sₙ:")
print(f"  {'n':<8} {'r=0.5':<14} {'r=0.9':<14} {'r=1.1':<14} {'1/n (harmonic)':<14}")
print(f"  {'─'*8} {'─'*14} {'─'*14} {'─'*14} {'─'*14}")
for n in [5, 10, 50, 100, 500, 1000]:
    s_05 = sum(0.5**i for i in range(n))
    s_09 = sum(0.9**i for i in range(n))
    s_11 = sum(1.1**i for i in range(n))
    h_n  = sum(1.0/i for i in range(1, n+1))
    print(f"  {n:<8} {s_05:<14.6f} {s_09:<14.6f} {s_11:<14.2f} {h_n:<14.6f}")
print(f"\n  r=0.5 → 1/(1-0.5) = {1/0.5:.1f}")
print(f"  r=0.9 → 1/(1-0.9) = {1/0.1:.1f}")
print(f"  r=1.1 → DIVERGES (grows without bound)")
print(f"  1/n   → DIVERGES (harmonic, but very slowly)")

# Alternating harmonic: converges conditionally
print(f"\n{'─' * 65}")
print("CONDITIONAL CONVERGENCE: Σ (-1)ⁿ⁺¹/n = ln(2)")
print("─" * 65)

print(f"\n  {'n':<8} {'Σ (-1)ⁿ⁺¹/n':<18} {'ln(2)':<18} {'Error'}")
print(f"  {'─'*8} {'─'*18} {'─'*18} {'─'*12}")
for n in [10, 100, 1000, 10000]:
    alt_sum = sum((-1)**(i+1) / i for i in range(1, n+1))
    err = abs(alt_sum - np.log(2))
    print(f"  {n:<8} {alt_sum:<18.10f} {np.log(2):<18.10f} {err:<12.2e}")
print(f"\n  Converges conditionally: Σ |(-1)ⁿ⁺¹/n| = Σ 1/n = ∞ (diverges!)")
print(f"  Riemann theorem: rearranging terms can make it sum to ANY value!")

# ══════════════════════════════════════════════════════════════════
# 11.3  CONVERGENCE TESTS
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("CONVERGENCE TESTS")
print("=" * 65)

# Ratio test: aₙ = xⁿ/n!
print(f"\n  Ratio test for eˣ series: aₙ = xⁿ/n!")
x_val = 2.0
print(f"  x = {x_val}")
print(f"  {'n':<6} {'aₙ = xⁿ/n!':<18} {'|aₙ₊₁/aₙ|':<18} {'Verdict'}")
print(f"  {'─'*6} {'─'*18} {'─'*18} {'─'*20}")
for n in range(1, 10):
    a_n = x_val**n / math.factorial(n)
    a_n1 = x_val**(n+1) / math.factorial(n+1)
    ratio = abs(a_n1 / a_n)
    verdict = "< 1 → converges" if ratio < 1 else "> 1 → diverges" if ratio > 1 else "= 1"
    print(f"  {n:<6} {a_n:<18.8f} {ratio:<18.8f} {verdict}")
print(f"  → |aₙ₊₁/aₙ| = x/(n+1) → 0 < 1 for any x → eˣ converges ∀ x ✓")

# ══════════════════════════════════════════════════════════════════
# 11.5  ROBBINS-MONRO CONDITIONS FOR LEARNING RATE
# ══════════════════════════════════════════════════════════════════

print(f"\n{'=' * 65}")
print("ROBBINS-MONRO: Σ ηₜ = ∞  AND  Σ ηₜ² < ∞")
print("=" * 65)

schedules = {
    "ηₜ = 1/t":        lambda t: 1.0/t,
    "ηₜ = 1/√t":       lambda t: 1.0/np.sqrt(t),
    "ηₜ = 1/t²":       lambda t: 1.0/t**2,
    "ηₜ = 0.01 (const)": lambda t: 0.01,
}

T = 10000
print(f"\n  T = {T}")
print(f"\n  {'Schedule':<20} {'Σ ηₜ':<14} {'Σ ηₜ²':<14} {'Σηₜ=∞?':<10} {'Σηₜ²<∞?':<10} {'Valid?'}")
print(f"  {'─'*20} {'─'*14} {'─'*14} {'─'*10} {'─'*10} {'─'*6}")
for name, lr_fn in schedules.items():
    sum_lr = sum(lr_fn(t) for t in range(1, T+1))
    sum_lr2 = sum(lr_fn(t)**2 for t in range(1, T+1))
    grows = sum_lr > 50  # heuristic: still growing
    bounded = sum_lr2 < 100
    valid = grows and bounded
    print(f"  {name:<20} {sum_lr:<14.4f} {sum_lr2:<14.6f} {'✓' if grows else '✗':<10} {'✓' if bounded else '✗':<10} {'✓' if valid else '✗'}")

print(f"\n  1/t satisfies both: Σ 1/t = ∞ (harmonic) and Σ 1/t² = π²/6 ≈ {np.pi**2/6:.4f} ✓")
print(f"  → Adam adapts lr per-parameter, effectively satisfying these conditions")

12. Summation in Probability and Statistics

12.1 Expectation as Weighted Summation

The expected value is the fundamental bridge between summation and probability:

E[X]=i=1nxiP(X=xi)E[X] = \sum_{i=1}^{n} x_i \, P(X = x_i)

For a function g(X)g(X): E[g(X)]=ig(xi)pi\quad E[g(X)] = \sum_{i} g(x_i) \, p_i

12.2 Variance via Summation

Var(X)=E[(Xμ)2]=i=1n(xiμ)2pi=E[X2](E[X])2\text{Var}(X) = E[(X - \mu)^2] = \sum_{i=1}^{n} (x_i - \mu)^2 \, p_i = E[X^2] - (E[X])^2

12.3 Entropy and Information

Shannon entropy measures uncertainty using a sum of log-probabilities:

H(X)=i=1npilogpi(bits if log₂, nats if ln)H(X) = -\sum_{i=1}^{n} p_i \log p_i \qquad \text{(bits if log₂, nats if ln)}

12.4 KL Divergence

Measures how distribution QQ differs from PP:

DKL(PQ)=i=1npilogpiqiD_\text{KL}(P \| Q) = \sum_{i=1}^{n} p_i \log \frac{p_i}{q_i}

Properties:

  • DKL0D_\text{KL} \geq 0 (Gibbs' inequality, from lnxx1\ln x \leq x - 1)
  • DKL(PQ)DKL(QP)D_\text{KL}(P \| Q) \neq D_\text{KL}(Q \| P)not symmetric
  • Cross-entropy: H(P,Q)=H(P)+DKL(PQ)H(P, Q) = H(P) + D_\text{KL}(P \| Q)

12.5 Jensen's Inequality

For convex φ\varphi: φ ⁣(ipixi)ipiφ(xi)\quad \varphi\!\left(\sum_i p_i x_i\right) \leq \sum_i p_i\,\varphi(x_i)

This proves DKL0D_\text{KL} \geq 0 and explains why:

  • Average of logs ≤ log of average
  • E[logX]logE[X]E[\log X] \leq \log E[X] (since log-\log is convex)

12.6 Law of Total Expectation

E[X]=jE[XY=yj]P(Y=yj)E[X] = \sum_{j} E[X \mid Y = y_j] \, P(Y = y_j)

Used in mixture models: each component contributes E[Xcomponentj]E[X | \text{component}_j] weighted by mixing probability πj\pi_j.

12.7 Inclusion-Exclusion Principle

P ⁣(i=1nAi)=iP(Ai)i<jP(AiAj)+i<j<kP(AiAjAk)P\!\left(\bigcup_{i=1}^{n} A_i\right) = \sum_{i} P(A_i) - \sum_{i<j} P(A_i \cap A_j) + \sum_{i<j<k} P(A_i \cap A_j \cap A_k) - \cdots

Compact: P ⁣(i=1nAi)=S[n](1)S+1P ⁣(iSAi)\displaystyle P\!\left(\bigcup_{i=1}^n A_i\right) = \sum_{\emptyset \neq S \subseteq [n]} (-1)^{|S|+1} P\!\left(\bigcap_{i \in S} A_i\right)

Code cell 31

# ══════════════════════════════════════════════════════════════════
# 12  SUMMATION IN PROBABILITY & STATISTICS
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("SUMMATION IN PROBABILITY & STATISTICS")
print("=" * 65)

# --- Expectation and Variance ---
print("\n12.1–12.2  Expectation & Variance")
print("─" * 50)

# Fair die
x = np.arange(1, 7)
p = np.ones(6) / 6

E_X = np.sum(x * p)
E_X2 = np.sum(x**2 * p)
Var_X = np.sum((x - E_X)**2 * p)
Var_shortcut = E_X2 - E_X**2

print(f"\n  Fair die: X ∈ {list(x)}")
print(f"  E[X]  = Σ xᵢpᵢ  = {E_X:.4f}  (analytic: 3.5)")
print(f"  E[X²] = Σ xᵢ²pᵢ = {E_X2:.4f}")
print(f"  Var(X) = Σ (xᵢ-μ)²pᵢ = {Var_X:.4f}")
print(f"  Var(X) = E[X²]-E[X]² = {Var_shortcut:.4f}  ✓")
assert np.isclose(Var_X, Var_shortcut)

# --- Entropy ---
print(f"\n{'─' * 50}")
print("12.3  Entropy H(X) = -Σ pᵢ log pᵢ")
print("─" * 50)

def entropy(probs):
    """Shannon entropy in nats, handling p=0."""
    p = np.array(probs, dtype=np.float64)
    mask = p > 0
    return -np.sum(p[mask] * np.log(p[mask]))

distributions = {
    "Uniform(6)":     [1/6]*6,
    "Peaked":         [0.9, 0.02, 0.02, 0.02, 0.02, 0.02],
    "Deterministic":  [1.0, 0.0, 0.0, 0.0, 0.0, 0.0],
    "Binary fair":    [0.5, 0.5],
}

print(f"\n  {'Distribution':<18} {'H (nats)':<12} {'H (bits)':<12} {'Max H?'}")
print(f"  {'─'*18} {'─'*12} {'─'*12} {'─'*8}")
for name, p_dist in distributions.items():
    h = entropy(p_dist)
    h_bits = h / np.log(2)
    max_h = np.log(len(p_dist))
    print(f"  {name:<18} {h:<12.4f} {h_bits:<12.4f} {'= max' if np.isclose(h, max_h) else ''}")

# --- KL Divergence ---
print(f"\n{'─' * 50}")
print("12.4  KL Divergence — asymmetry demonstration")
print("─" * 50)

def kl_divergence(p, q):
    """KL(P || Q) = Σ pᵢ log(pᵢ/qᵢ)"""
    p, q = np.array(p), np.array(q)
    mask = p > 0
    return np.sum(p[mask] * np.log(p[mask] / q[mask]))

P = [0.4, 0.3, 0.2, 0.1]
Q = [0.25, 0.25, 0.25, 0.25]

kl_pq = kl_divergence(P, Q)
kl_qp = kl_divergence(Q, P)
cross_h = -np.sum(np.array(P) * np.log(np.array(Q)))

print(f"\n  P = {P}")
print(f"  Q = {Q}")
print(f"  KL(P || Q)   = {kl_pq:.6f}")
print(f"  KL(Q || P)   = {kl_qp:.6f}")
print(f"  KL(P||Q) ≠ KL(Q||P): asymmetric! ✓")
print(f"\n  Cross-entropy H(P,Q) = -Σ pᵢ log qᵢ = {cross_h:.6f}")
print(f"  H(P) + KL(P||Q) = {entropy(P) + kl_pq:.6f}")
print(f"  H(P,Q) = H(P) + KL(P||Q): {np.isclose(cross_h, entropy(P) + kl_pq)} ✓")

# --- Jensen's Inequality ---
print(f"\n{'─' * 50}")
print("12.5  Jensen's Inequality: φ(E[X]) ≤ E[φ(X)]  for convex φ")
print("─" * 50)

x_vals = np.array([1.0, 4.0, 9.0])
p_vals = np.array([0.2, 0.5, 0.3])

# x² is convex
E_x = np.sum(x_vals * p_vals)
E_x2 = np.sum(x_vals**2 * p_vals)
Ex_squared = E_x**2

print(f"\n  φ(x) = x² (convex)")
print(f"  E[X]    = {E_x:.2f}")
print(f"  (E[X])² = {Ex_squared:.4f}")
print(f"  E[X²]   = {E_x2:.4f}")
print(f"  (E[X])² ≤ E[X²]: {Ex_squared:.4f}{E_x2:.4f}{Ex_squared <= E_x2} ✓")

# -log is convex: proves KL ≥ 0
E_log = np.sum(p_vals * np.log(x_vals))
log_E = np.log(E_x)
print(f"\n  φ(x) = -log(x) (convex) → E[log X] ≤ log E[X]")
print(f"  E[log X]  = {E_log:.4f}")
print(f"  log E[X]  = {log_E:.4f}")
print(f"  E[log X] ≤ log E[X]: {E_log <= log_E} ✓")

# --- Inclusion-Exclusion ---
print(f"\n{'─' * 50}")
print("12.7  Inclusion-Exclusion Principle")
print("─" * 50)

# Three events with known overlaps
pA, pB, pC = 0.5, 0.4, 0.3
pAB, pAC, pBC = 0.2, 0.15, 0.1
pABC = 0.05

p_union = pA + pB + pC - pAB - pAC - pBC + pABC
print(f"\n  P(A)=0.5, P(B)=0.4, P(C)=0.3")
print(f"  P(A∩B)=0.2, P(A∩C)=0.15, P(B∩C)=0.1, P(A∩B∩C)=0.05")
print(f"  P(A∪B∪C) = {pA}+{pB}+{pC} - {pAB}-{pAC}-{pBC} + {pABC}")
print(f"           = {p_union:.2f}")
print(f"  (Alternating signs: +singles, -pairs, +triples)")

13. Notation Variants and Conventions in ML

13.1 Subscript vs Superscript Indexing

Different ML subfields use different conventions:

ConventionExampleWhere used
Subscript for samplesxi,yix_i, y_iMost ML papers
Superscript for samplesx(i),y(i)x^{(i)}, y^{(i)}Andrew Ng's courses, CS229
Subscript for featuresxj,wjx_j, w_jFeature-level operations
Superscript for layersa[l],W[l]a^{[l]}, W^{[l]}Neural network notation
Superscript for iterationθ(t)\theta^{(t)}Optimization algorithms

13.2 Compact Notation Styles

NotationMeaningContext
i\sum_iSum over all valid iiWhen range is clear from context
xX\sum_{x \in \mathcal{X}}Sum over elements of setProbability, combinatorics
x\sum_{\mathbf{x}}Sum over all configurationsPartition functions, graphical models
(x,y)D\sum_{(x,y) \sim \mathcal{D}}Sum over datasetEmpirical risk minimization

13.3 Neural Network Formulas in Σ Notation

Forward pass (fully connected layer):

zj[l]=k=1nl1wjk[l]ak[l1]+bj[l]z_j^{[l]} = \sum_{k=1}^{n_{l-1}} w_{jk}^{[l]} a_k^{[l-1]} + b_j^{[l]}

Matrix form: z[l]=W[l]a[l1]+b[l]\mathbf{z}^{[l]} = W^{[l]} \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}

Backpropagation (error signal):

δj[l]=kwkj[l+1]δk[l+1]σ(zj[l])\delta_j^{[l]} = \sum_{k} w_{kj}^{[l+1]} \delta_k^{[l+1]} \cdot \sigma'(z_j^{[l]})

Gradient of loss w.r.t. weights:

Lwjk[l]=1Ni=1Nδj[l](i)ak[l1](i)\frac{\partial \mathcal{L}}{\partial w_{jk}^{[l]}} = \frac{1}{N}\sum_{i=1}^{N} \delta_j^{[l](i)} a_k^{[l-1](i)}

13.4 Implicit Summation in Matrix Notation

Many ML formulas hide summations inside matrix operations:

Matrix formExplicit summation
y=Wx\mathbf{y} = W\mathbf{x}yi=jWijxjy_i = \sum_j W_{ij} x_j
L=yy^2L = \|\mathbf{y} - \hat{\mathbf{y}}\|^2L=i(yiy^i)2L = \sum_i (y_i - \hat{y}_i)^2
tr(AB)\text{tr}(AB)ijAijBji\sum_i \sum_j A_{ij} B_{ji}
ab\mathbf{a}^\top\mathbf{b}iaibi\sum_i a_i b_i
AF2\|A\|_F^2ijAij2\sum_i \sum_j A_{ij}^2

Code cell 33

# ══════════════════════════════════════════════════════════════════
# 13  NOTATION VARIANTS & NEURAL NETWORK FORMULAS
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("NOTATION VARIANTS & NEURAL NETWORK FORMULAS")
print("=" * 65)

# --- 13.3: Neural network forward pass in Σ notation ---
print("\n13.3  Forward pass: z_j = Σ_k w_jk * a_k + b_j")
print("─" * 50)

np.random.seed(42)
n_in, n_out = 4, 3

# Layer weights and biases
W = np.random.randn(n_out, n_in) * 0.5
b = np.random.randn(n_out) * 0.1
a_prev = np.array([1.0, 0.5, -0.3, 0.8])

# Explicit summation (the Σ version)
z_explicit = np.zeros(n_out)
for j in range(n_out):
    s = 0.0
    for k in range(n_in):
        s += W[j, k] * a_prev[k]
    z_explicit[j] = s + b[j]

# Matrix form (implicit summation)
z_matrix = W @ a_prev + b

print(f"\n  Input a = {a_prev}")
print(f"  W shape: ({n_out}, {n_in})")
print(f"\n  {'Neuron j':<12} {'Σ_k w_jk*a_k + b_j':<22} {'W@a + b':<22} {'Match?'}")
print(f"  {'─'*12} {'─'*22} {'─'*22} {'─'*6}")
for j in range(n_out):
    print(f"  j={j:<9} {z_explicit[j]:<22.8f} {z_matrix[j]:<22.8f} {'✓' if np.isclose(z_explicit[j], z_matrix[j]) else '✗'}")

# --- 13.4: Hidden summations in matrix ops ---
print(f"\n{'─' * 50}")
print("13.4  Hidden summations inside matrix operations")
print("─" * 50)

A = np.array([[1, 2], [3, 4]], dtype=float)
B = np.array([[5, 6], [7, 8]], dtype=float)
x_vec = np.array([1.0, 2.0])

# trace(AB) = Σ_i Σ_j A_ij B_ji
trace_builtin = np.trace(A @ B)
trace_explicit = sum(A[i, j] * B[j, i] for i in range(2) for j in range(2))

# Frobenius norm: ||A||_F² = Σ_i Σ_j A_ij²
frob_builtin = np.linalg.norm(A, 'fro')**2
frob_explicit = sum(A[i, j]**2 for i in range(2) for j in range(2))

# dot product: a⊤b = Σ_i a_i b_i
dot_builtin = x_vec @ x_vec
dot_explicit = sum(x_vec[i]**2 for i in range(len(x_vec)))

print(f"\n  {'Operation':<25} {'Matrix form':<14} {'Σ notation':<14} {'Match?'}")
print(f"  {'─'*25} {'─'*14} {'─'*14} {'─'*6}")
print(f"  {'tr(AB)':<25} {trace_builtin:<14.4f} {trace_explicit:<14.4f} {'✓' if np.isclose(trace_builtin, trace_explicit) else '✗'}")
print(f"  {'||A||_F²':<25} {frob_builtin:<14.4f} {frob_explicit:<14.4f} {'✓' if np.isclose(frob_builtin, frob_explicit) else '✗'}")
print(f"  {'x⊤x':<25} {dot_builtin:<14.4f} {dot_explicit:<14.4f} {'✓' if np.isclose(dot_builtin, dot_explicit) else '✗'}")

# --- Backprop gradient formula ---
print(f"\n{'─' * 50}")
print("∂L/∂w_jk = (1/N) Σ_i δ_j^(i) a_k^(i)")
print("─" * 50)

N = 5  # batch size
deltas = np.random.randn(N, n_out)      # error signals δ
a_batch = np.random.randn(N, n_in)      # activations

# Explicit double sum
dW_explicit = np.zeros((n_out, n_in))
for j in range(n_out):
    for k in range(n_in):
        for i in range(N):
            dW_explicit[j, k] += deltas[i, j] * a_batch[i, k]
        dW_explicit[j, k] /= N

# Matrix form: (1/N) δᵀ a
dW_matrix = (1/N) * deltas.T @ a_batch

print(f"\n  Batch size N={N}, output={n_out}, input={n_in}")
print(f"  Max |dW_explicit - dW_matrix| = {np.max(np.abs(dW_explicit - dW_matrix)):.2e}")
print(f"  Triple sum ≡ matrix multiply: {np.allclose(dW_explicit, dW_matrix)} ✓")
print(f"\n  Key insight: The Σ notation makes index relationships explicit,")
print(f"  but matrix ops are what GPUs actually compute efficiently.")

14. Common Mistakes and Pitfalls

14.1 Index Errors

MistakeWhy it's wrongCorrect
i=1nxiyi=i=1nxii=1nyi\sum_{i=1}^{n} x_i y_i = \sum_{i=1}^{n} x_i \cdot \sum_{i=1}^{n} y_iProducts don't distribute over Σixiyi(ixi)(jyj)\sum_i x_i y_i \neq (\sum_i x_i)(\sum_j y_j)
i=1nxi=j=1nxi\sum_{i=1}^{n} x_i = \sum_{j=1}^{n} x_iChanged index but not the termMust change both: j=1nxj\sum_{j=1}^{n} x_j
i=0n1ai=j=1naj\sum_{i=0}^{n-1} a_i = \sum_{j=1}^{n} a_jWrong re-indexingj=1naj1\sum_{j=1}^{n} a_{j-1}

14.2 Scope and Binding Errors

The index variable is bound — it has no meaning outside the sum:

(i=1nxi)+i\left(\sum_{i=1}^{n} x_i\right) + i \quadii is not defined outside the sum

This is analogous to loop variables in code: for i in range(n): total += x[i] — you wouldn't use i after the loop expects it to hold a meaningful value.

14.3 Algebraic Mistakes

  • Sum of squares ≠ Square of sum: xi2(xi)2\sum x_i^2 \neq \left(\sum x_i\right)^2
  • Sum of products ≠ Product of sums: xiyixiyi\sum x_i y_i \neq \sum x_i \cdot \sum y_i
  • Product of sums ≠ Sum of products: (ai+bi)ai+bi\prod (a_i + b_i) \neq \prod a_i + \prod b_i
  • Log of sum ≠ Sum of logs: log(xi)logxi\log\left(\sum x_i\right) \neq \sum \log x_i, BUT log(xi)=logxi\log\left(\prod x_i\right) = \sum \log x_i

14.4 Numerical Pitfalls

PitfallExampleSolution
Overflow in productsi=11000pi\prod_{i=1}^{1000} p_i for small pip_iWork in log-space: logpi\sum \log p_i
Catastrophic cancellationxi2nxˉ2\sum x_i^2 - n\bar{x}^2 for large xix_iWelford's online algorithm
Summing many small floatsi=11060.1\sum_{i=1}^{10^6} 0.1Kahan compensated summation
Softmax overflowe1000e^{1000}Subtract max: eximax(x)e^{x_i - \max(x)}

14.5 Order of Summation Matters (When It Shouldn't)

In exact arithmetic, a+b+c=c+b+aa + b + c = c + b + a. In floating-point, order matters due to rounding. Kahan summation compensates for accumulated errors.

Code cell 35

# ══════════════════════════════════════════════════════════════════
# 14  COMMON MISTAKES AND PITFALLS
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("COMMON MISTAKES AND PITFALLS")
print("=" * 65)

# --- Mistake 1: Σ(xy) ≠ (Σx)(Σy) ---
print("\n14.1  Σ xᵢyᵢ ≠ (Σ xᵢ)(Σ yᵢ)")
print("─" * 50)

x = np.array([1.0, 2.0, 3.0, 4.0])
y = np.array([5.0, 6.0, 7.0, 8.0])

sum_xy = np.sum(x * y)
sum_x_times_sum_y = np.sum(x) * np.sum(y)

print(f"  x = {list(x)}, y = {list(y)}")
print(f"  Σ xᵢyᵢ      = {sum_xy:.1f}")
print(f"  (Σ xᵢ)(Σ yᵢ) = {sum_x_times_sum_y:.1f}")
print(f"  Equal? {np.isclose(sum_xy, sum_x_times_sum_y)} ← THEY ARE NOT EQUAL!")
covariance_bias = (sum_xy / len(x)) - (np.mean(x) * np.mean(y))
print(f"  Difference matters: this gap IS the covariance = {covariance_bias:.4f}")

# --- Mistake 2: Σ x² ≠ (Σ x)² ---
print(f"\n{'─' * 50}")
print("14.3  Σ xᵢ² ≠ (Σ xᵢ)²")
print("─" * 50)

sum_sq = np.sum(x**2)
sq_sum = np.sum(x)**2
print(f"  x = {list(x)}")
print(f"  Σ xᵢ² = {sum_sq:.1f}")
print(f"  (Σ xᵢ)² = {sq_sum:.1f}")
print(f"  This gap = n·Var(X) = {sq_sum - len(x)*sum_sq/len(x):.1f}")

# --- Numerical: Kahan summation ---
print(f"\n{'─' * 50}")
print("14.4–14.5  Kahan Compensated Summation")
print("─" * 50)

def kahan_sum(values):
    """Kahan compensated summation — reduces floating-point error."""
    s = 0.0
    c = 0.0  # compensation for lost low-order bits
    for v in values:
        y = v - c
        t = s + y
        c = (t - s) - y   # algebraically zero, but captures rounding error
        s = t
    return s

# Sum of 10 million 0.1's (should be 1,000,000.0)
n = 10_000_000
values = [0.1] * n

naive = sum(values)
compensated = kahan_sum(values)
numpy_result = np.sum(np.full(n, 0.1))
exact = 1_000_000.0

print(f"\n  Summing {n:,} copies of 0.1 (exact answer: {exact:,.1f})")
print(f"  {'Method':<20} {'Result':<22} {'Error'}")
print(f"  {'─'*20} {'─'*22} {'─'*14}")
print(f"  {'Naive sum()':<20} {naive:<22.10f} {abs(naive - exact):<14.10f}")
print(f"  {'Kahan sum':<20} {compensated:<22.10f} {abs(compensated - exact):<14.10f}")
print(f"  {'numpy.sum':<20} {numpy_result:<22.10f} {abs(numpy_result - exact):<14.10f}")

# --- Welford's online variance ---
print(f"\n{'─' * 50}")
print("14.4  Welford's Algorithm (numerically stable variance)")
print("─" * 50)

def welford_variance(data):
    """Welford's one-pass numerically stable variance."""
    n = 0
    mean = 0.0
    M2 = 0.0
    for x in data:
        n += 1
        delta = x - mean
        mean += delta / n
        delta2 = x - mean
        M2 += delta * delta2
    return M2 / n if n > 0 else 0.0

# Large values with small differences → catastrophic cancellation
data_shifted = np.array([1e8 + 1, 1e8 + 2, 1e8 + 3, 1e8 + 4, 1e8 + 5], dtype=np.float64)

naive_var = np.mean(data_shifted**2) - np.mean(data_shifted)**2
welford_var = welford_variance(data_shifted)
numpy_var = np.var(data_shifted)

print(f"\n  Data: [1e8+1, 1e8+2, ..., 1e8+5]  (true variance = 2.0)")
print(f"  {'Method':<25} {'Variance':<18} {'Error'}")
print(f"  {'─'*25} {'─'*18} {'─'*10}")
print(f"  {'Naive (E[X²]-E[X]²)':<25} {naive_var:<18.6f} {abs(naive_var - 2.0):<10.6f}")
print(f"  {'Welford online':<25} {welford_var:<18.6f} {abs(welford_var - 2.0):<10.6f}")
print(f"  {'numpy.var':<25} {numpy_var:<18.6f} {abs(numpy_var - 2.0):<10.6f}")

15. Practice Exercises

Exercise 1: Closed-Form Verification

Prove and verify numerically: i=1ni2=n(n+1)(2n+1)6\sum_{i=1}^{n} i^2 = \frac{n(n+1)(2n+1)}{6} for n=100,1000,10000n = 100, 1000, 10000

Exercise 2: Telescoping Sum

Simplify i=1n(1i1i+1)\sum_{i=1}^{n} \left(\frac{1}{i} - \frac{1}{i+1}\right) and verify.

Exercise 3: Double Sum Interchange

Show that i=1mj=1naibj=(i=1mai)(j=1nbj)\sum_{i=1}^{m}\sum_{j=1}^{n} a_i b_j = \left(\sum_{i=1}^{m} a_i\right)\left(\sum_{j=1}^{n} b_j\right)

Exercise 4: Log-Sum-Exp

Implement the numerically stable version of log(i=1nexi)\log\left(\sum_{i=1}^{n} e^{x_i}\right) and compare with the naive version for extreme inputs.

Exercise 5: Implement Softmax Gradient

Given softmax output σi\sigma_i, show that σizj=σi(δijσj)\frac{\partial \sigma_i}{\partial z_j} = \sigma_i(\delta_{ij} - \sigma_j) where δij\delta_{ij} is the Kronecker delta.

Exercise 6: Cauchy-Schwarz Application

Verify that (iaibi)2(iai2)(ibi2)\left(\sum_i a_i b_i\right)^2 \leq \left(\sum_i a_i^2\right)\left(\sum_i b_i^2\right) for random vectors, and find when equality holds.

Exercise 7: Einstein Summation

Convert these to np.einsum and verify:

  1. Batch matrix multiply: Cbij=kAbikBbkjC_{bij} = \sum_k A_{bik} B_{bkj}
  2. Bilinear form: s=ijxiMijyjs = \sum_i \sum_j x_i M_{ij} y_j
  3. Batch outer product: Obij=abibbjO_{bij} = a_{bi} \cdot b_{bj}

Exercise 8: Convergence Analysis

For the series n=1n22n\sum_{n=1}^{\infty} \frac{n^2}{2^n}, apply the ratio test and compute partial sums to estimate the limit.

Code cell 37

# ══════════════════════════════════════════════════════════════════
# 15  PRACTICE EXERCISES — SOLUTIONS
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("PRACTICE EXERCISES — SOLUTIONS")
print("=" * 65)

# --- Exercise 1: Closed-form for Σ i² ---
print("\nExercise 1: Σ i² = n(n+1)(2n+1)/6")
print("─" * 50)
for n in [100, 1000, 10000]:
    brute = sum(i**2 for i in range(1, n+1))
    closed = n * (n + 1) * (2*n + 1) // 6
    print(f"  n={n:<6}: brute={brute:<15}, closed={closed:<15}, match={brute == closed} ✓")

# --- Exercise 2: Telescoping ---
print(f"\nExercise 2: Σ (1/i - 1/(i+1)) = 1 - 1/(n+1)")
print("─" * 50)
for n in [10, 100, 1000]:
    tele = sum(1/i - 1/(i+1) for i in range(1, n+1))
    closed_val = 1 - 1/(n+1)
    print(f"  n={n:<6}: sum={tele:.10f}, 1-1/(n+1)={closed_val:.10f}, match={np.isclose(tele, closed_val)} ✓")

# --- Exercise 3: Double sum separation ---
print(f"\nExercise 3: ΣᵢΣⱼ aᵢbⱼ = (Σᵢ aᵢ)(Σⱼ bⱼ)")
print("─" * 50)
a = np.array([2, 3, 5, 7])
b = np.array([1, 4, 6])
double_sum = sum(a[i] * b[j] for i in range(len(a)) for j in range(len(b)))
product_sums = np.sum(a) * np.sum(b)
print(f"  a = {list(a)}, b = {list(b)}")
print(f"  ΣᵢΣⱼ aᵢbⱼ = {double_sum}, (Σ aᵢ)(Σ bⱼ) = {product_sums}")
print(f"  Works because bⱼ doesn't depend on i, so Σⱼ bⱼ factors out ✓")

# --- Exercise 4: Log-Sum-Exp stability ---
print(f"\nExercise 4: Numerically stable log-sum-exp")
print("─" * 50)
x_extreme = np.array([1000.0, 1000.1, 999.8])
x_max = np.max(x_extreme)

# Naive: will overflow
try:
    naive_lse = np.log(np.sum(np.exp(x_extreme)))
except:
    naive_lse = float('inf')
# Stable version
stable_lse = x_max + np.log(np.sum(np.exp(x_extreme - x_max)))
print(f"  x = {list(x_extreme)}")
print(f"  Naive log(Σ eˣ): {naive_lse}")
print(f"  Stable max + log(Σ e^(x-max)): {stable_lse:.6f}")
print(f"  The max-subtraction trick preserves the result without overflow ✓")

# --- Exercise 5: Softmax Jacobian ---
print(f"\nExercise 5: Softmax Jacobian ∂σᵢ/∂zⱼ = σᵢ(δᵢⱼ - σⱼ)")
print("─" * 50)
z = np.array([2.0, 1.0, 0.5])
sigma = np.exp(z - np.max(z))
sigma = sigma / np.sum(sigma)

# Analytic Jacobian
n = len(z)
jacobian = np.zeros((n, n))
for i in range(n):
    for j in range(n):
        if i == j:
            jacobian[i, j] = sigma[i] * (1 - sigma[j])
        else:
            jacobian[i, j] = -sigma[i] * sigma[j]

# Numerical Jacobian for verification
eps = 1e-7
numerical_jac = np.zeros((n, n))
for j in range(n):
    z_plus = z.copy(); z_plus[j] += eps
    z_minus = z.copy(); z_minus[j] -= eps
    s_plus = np.exp(z_plus - np.max(z_plus)); s_plus /= s_plus.sum()
    s_minus = np.exp(z_minus - np.max(z_minus)); s_minus /= s_minus.sum()
    numerical_jac[:, j] = (s_plus - s_minus) / (2 * eps)

print(f"  z = {list(z)}")
print(f"  σ = [{', '.join(f'{s:.4f}' for s in sigma)}]")
print(f"  Max |analytic - numerical| = {np.max(np.abs(jacobian - numerical_jac)):.2e} ✓")

# --- Exercise 6: Cauchy-Schwarz ---
print(f"\nExercise 6: Cauchy-Schwarz (Σ aᵢbᵢ)² ≤ (Σ aᵢ²)(Σ bᵢ²)")
print("─" * 50)
np.random.seed(7)
a_rand = np.random.randn(10)
b_rand = np.random.randn(10)
lhs = np.sum(a_rand * b_rand)**2
rhs = np.sum(a_rand**2) * np.sum(b_rand**2)
print(f"  Random a, b: LHS = {lhs:.6f} ≤ RHS = {rhs:.6f}: {lhs <= rhs + 1e-10} ✓")
# Equality when a = cb
b_prop = 3.0 * a_rand
lhs_eq = np.sum(a_rand * b_prop)**2
rhs_eq = np.sum(a_rand**2) * np.sum(b_prop**2)
print(f"  b = 3a:       LHS = {lhs_eq:.6f} = RHS = {rhs_eq:.6f}: {np.isclose(lhs_eq, rhs_eq)} ✓ (equality!)")

# --- Exercise 7: Einstein summation ---
print(f"\nExercise 7: Einstein summation with np.einsum")
print("─" * 50)
B_size, I, K, J = 2, 3, 4, 5
A7 = np.random.randn(B_size, I, K)
B7 = np.random.randn(B_size, K, J)
C_batch = np.einsum('bik,bkj->bij', A7, B7)
C_loop = np.array([A7[b] @ B7[b] for b in range(B_size)])
print(f"  Batch matmul: einsum('bik,bkj->bij') match: {np.allclose(C_batch, C_loop)} ✓")

x7 = np.random.randn(4)
M7 = np.random.randn(4, 5)
y7 = np.random.randn(5)
bilinear = np.einsum('i,ij,j->', x7, M7, y7)
bilinear_loop = x7 @ M7 @ y7
print(f"  Bilinear form: einsum('i,ij,j->') match: {np.isclose(bilinear, bilinear_loop)} ✓")

a7 = np.random.randn(B_size, 3)
b7 = np.random.randn(B_size, 4)
outer_batch = np.einsum('bi,bj->bij', a7, b7)
outer_loop = np.array([np.outer(a7[b], b7[b]) for b in range(B_size)])
print(f"  Batch outer:   einsum('bi,bj->bij') match: {np.allclose(outer_batch, outer_loop)} ✓")

# --- Exercise 8: Convergence of Σ n²/2ⁿ ---
print(f"\nExercise 8: Σ n²/2ⁿ convergence (ratio test + partial sums)")
print("─" * 50)
# Ratio test: a_{n+1}/a_n = (n+1)²/(2n²) → 1/2 < 1 → converges
print("  Ratio test: |(n+1)²/2^(n+1)| / |n²/2ⁿ| = (n+1)²/(2n²) → 1/2 < 1 ✓")
# Exact sum: Σ n²xⁿ = x(1+x)/(1-x)³ at x=1/2 → (1/2)(3/2)/(1/2)³ = 6
partial = [sum(k**2 / 2**k for k in range(1, n+1)) for n in [5, 10, 20, 50]]
print(f"  Exact value: x(1+x)/(1-x)³ at x=½ = 6.0")
print(f"  Partial sums: S₅={partial[0]:.6f}, S₁₀={partial[1]:.6f}, "
      f"S₂₀={partial[2]:.6f}, S₅₀={partial[3]:.6f}")
print(f"  Converges to 6.0 ✓")

16. Why This Matters for AI

Impact Table

AI ComponentSummation/Product RoleKey Formula
Forward passz=kwkxk+bz = \sum_k w_k x_k + bLinear combination
Loss functionL=1Ni=1N(yi,y^i)L = \frac{1}{N}\sum_{i=1}^{N} \ell(y_i, \hat{y}_i)Empirical risk
Softmaxσi=ezijezj\sigma_i = \frac{e^{z_i}}{\sum_j e^{z_j}}Normalization
Attentionout=jαjvj\text{out} = \sum_j \alpha_j v_jWeighted combination
Cross-entropyL=cyclogy^cL = -\sum_c y_c \log \hat{y}_cClassification loss
LikelihoodL=i=1Np(xiθ)\mathcal{L} = \prod_{i=1}^{N} p(x_i \mid \theta)Maximum likelihood estimation
Gradient descentθt+1=θtη1BiBi\theta_{t+1} = \theta_t - \eta \frac{1}{B}\sum_{i \in \mathcal{B}} \nabla \ell_iMini-batch SGD
Batch normμ=1Bixi,σ2=1Bi(xiμ)2\mu = \frac{1}{B}\sum_i x_i, \quad \sigma^2 = \frac{1}{B}\sum_i (x_i - \mu)^2Normalization statistics
Entropy regularizationH(π)=aπ(a)logπ(a)H(\pi) = -\sum_a \pi(a) \log \pi(a)Exploration in RL
ELBOL=Eq[logp(x,z)]Eq[logq(z)]\mathcal{L} = E_q[\log p(x,z)] - E_q[\log q(z)]Variational inference
PerplexityPPL=exp ⁣(1Tt=1Tlogp(wt))\text{PPL} = \exp\!\left(-\frac{1}{T}\sum_{t=1}^{T} \log p(w_t)\right)Language model eval

Conceptual Bridge

┌─────────────────────────────────────────────────────────┐
│  SUMMATION & PRODUCT NOTATION                           │
│  ═══════════════════════════════════════                 │
│                                                         │
│  Σ (Summation)          Π (Product)                     │
│    │                      │                             │
│    ├─ Forward pass        ├─ Likelihood                 │
│    ├─ Loss functions      ├─ Posterior (Bayes)          │
│    ├─ Gradient averages   ├─ Partition functions        │
│    ├─ Attention weights   └─ Factorial / combinatorics  │
│    ├─ Batch statistics                                  │
│    └─ Einstein notation ── einsum ── GPU kernels        │
│                                                         │
│  "If you can read a Σ, you can read any ML paper."     │
└─────────────────────────────────────────────────────────┘

Key Takeaways

  1. Every ML model is built on summations — linear combinations, averages, expectations, weighted sums
  2. Products appear in likelihoods, posteriors, and partition functions — always convert to log-sums for computation
  3. Einstein summation (np.einsum) is the universal language for tensor operations in modern deep learning
  4. Numerical stability (log-sum-exp, Kahan summation, Welford's algorithm) separates working code from broken code
  5. Convergence theory (ratio test, p-series, Robbins-Monro) determines when your training will actually converge

References

  • Knuth, D.E. The Art of Computer Programming, Vol. 1, §1.2.3 — Sums and Products (1968)
  • Graham, Knuth, Patashnik. Concrete Mathematics, Ch. 2 — Sums (1994)
  • Goodfellow, Bengio, Courville. Deep Learning, Ch. 2 — Linear Algebra notation (2016)
  • Vaswani et al. "Attention Is All You Need" — Σ in scaled dot-product attention (2017)
  • Einstein, A. "The Foundation of the General Theory of Relativity" — implicit summation convention (1916)

Code cell 39

# ══════════════════════════════════════════════════════════════════
# 16  FINAL SUMMARY — COMPLETE NOTATION REFERENCE
# ══════════════════════════════════════════════════════════════════

print("=" * 65)
print("COMPLETE NOTATION QUICK REFERENCE")
print("=" * 65)

reference = {
    "Σᵢ₌₁ⁿ aᵢ":       "Basic summation: a₁ + a₂ + ... + aₙ",
    "Πᵢ₌₁ⁿ aᵢ":       "Basic product:   a₁ · a₂ · ... · aₙ",
    "Σᵢ∈S f(i)":       "Sum over index set S",
    "ΣΣ aᵢⱼ":          "Double sum (can interchange if abs convergent)",
    "Aᵢⱼ Bⱼₖ":         "Einstein: implicit sum over repeated j",
    "log Π = Σ log":    "Products → sums via logarithm",
    "Σ (f(k)-f(k+1))":  "Telescoping: collapses to f(1) - f(n+1)",
    "max + log Σ exp":   "Log-sum-exp trick (numerical stability)",
}

print(f"\n  {'Notation':<22} {'Meaning'}")
print(f"  {'─'*22} {'─'*43}")
for notation, meaning in reference.items():
    print(f"  {notation:<22} {meaning}")

# AI formula that uses ALL the concepts
print(f"\n{'=' * 65}")
print("PUTTING IT ALL TOGETHER: Transformer Loss")
print("=" * 65)
print("""
  The cross-entropy loss for a Transformer language model:

    L = -(1/T) Σₜ₌₁ᵀ log( exp(zₜ[yₜ]) / Σⱼ₌₁ⱽ exp(zₜ[j]) )

  This single formula uses:
    ✓ Summation (outer Σ over tokens, inner Σ over vocab)
    ✓ Product   (likelihood Π → log converts to Σ)
    ✓ Log-sum-exp (numerical stability of softmax)
    ✓ Einstein notation (in the QKᵀ attention computation)
    ✓ Double sums (batch × sequence × head dimensions)
    ✓ Convergence theory (learning rate schedule)
    ✓ Jensen's inequality (proves KL ≥ 0 in ELBO)

  Every section of this chapter appears in this one equation.
""")

print("=" * 65)
print("END OF CHAPTER: Summation and Product Notation")
print("=" * 65)