NotesMath for LLMs

Summation and Product Notation

Mathematical Foundations / Summation and Product Notation

Notes

"Summation and product notation are the syntax of quantitative AI reasoning - the grammar in which every loss function, every gradient, every attention score, and every probability distribution is expressed. Fluency with \Sigma and \Pi is the difference between reading ML papers as fluid prose and decoding them symbol by symbol."

Overview

Summation notation (\Sigma) and product notation (\Pi) are compact symbolic languages for expressing repeated addition and multiplication over collections of terms. They are not merely abbreviations - they are precise mathematical objects with well-defined rules for manipulation, interchange, and transformation.

For AI and LLMs specifically, virtually every formula you will encounter is written in \Sigma or \Pi notation: loss functions, gradients, attention scores, probability distributions, normalisation constants, perplexity, entropy, KL divergence, matrix multiplication. Without mastery of these notations, every formula requires line-by-line decoding; with it, mathematical expressions read like natural language.

This chapter covers summation and product notation from first principles through to the specific patterns that appear in modern transformer-based AI systems. Every concept is grounded in how it appears in real machine learning computation.

Prerequisites

  • Basic arithmetic operations (addition, multiplication, exponentiation)
  • Familiarity with sets and index notation
  • Functions and mappings (previous chapter)
  • Logarithms and exponentials
  • Familiarity with Python/NumPy (for companion notebook)

Companion Notebooks

NotebookDescription
theory.ipynbInteractive code: summation formulas, product identities, convergence demos, einsum operations

Learning Objectives

After completing this section, you will:

  • Read and write summation (\Sigma) and product (\Pi) notation fluently in any ML context
  • Apply all algebraic properties: linearity, splitting, reindexing, interchange, telescoping
  • Derive closed-form results for arithmetic, geometric, and power series
  • Convert between products and sums using logarithms (the foundation of log-likelihood)
  • Recognize Einstein summation convention as the shorthand behind tensor notation and einsum
  • Manipulate double sums, change summation order, and handle triangular index regions
  • Determine convergence of infinite series using ratio, root, comparison, and integral tests
  • Decompose every major AI formula (softmax, cross-entropy, attention, perplexity, BM25) into its summation/product structure
  • Identify and avoid the top 10 summation/product mistakes that cause bugs in ML code

Table of Contents


1. Intuition

1.1 What Is Summation and Product Notation?

Summation notation (\Sigma) and product notation (\Pi) are compact symbolic languages for expressing repeated addition and multiplication over collections of terms. They are the mathematical equivalent of a for loop - but with algebraic structure that allows formal manipulation.

Without them, writing the loss over a million training examples requires a million terms. With them, one line:

L=1Ni=1NlogPθ(yixi)L = -\frac{1}{N} \sum_{i=1}^{N} \log P_\theta(y_i \mid x_i)

This single expression encodes: "for each of the NN training examples, compute the log-probability of the correct label yiy_i given input xix_i under model parameters θ\theta, negate, and average."

They are not merely abbreviations - they are precise mathematical objects with well-defined rules for manipulation, interchange, and transformation. You can:

  • Split a sum into parts: i=1100f(i)=i=150f(i)+i=51100f(i)\sum_{i=1}^{100} f(i) = \sum_{i=1}^{50} f(i) + \sum_{i=51}^{100} f(i)
  • Factor out constants: i=1ncf(i)=ci=1nf(i)\sum_{i=1}^{n} c \cdot f(i) = c \cdot \sum_{i=1}^{n} f(i)
  • Interchange order: ijf(i,j)=jif(i,j)\sum_i \sum_j f(i,j) = \sum_j \sum_i f(i,j) (under certain conditions)
  • Convert products to sums: logif(i)=ilogf(i)\log \prod_i f(i) = \sum_i \log f(i)
  • Telescope: i=1n(f(i+1)f(i))=f(n+1)f(1)\sum_{i=1}^{n}(f(i+1) - f(i)) = f(n+1) - f(1)

For AI, fluency with \Sigma and \Pi is non-negotiable. Virtually every formula in machine learning is written in summation or product notation - loss functions, gradients, attention scores, probability distributions, normalisation constants. The researcher who reads these fluently thinks in terms of structure; the one who doesn't decodes symbol by symbol and loses the forest for the trees.

THE NOTATION AS A COMPRESSION TOOL
=======================================================================

Without notation (N = 5):
L = -(1/5)(log P(y_1|x_1) + log P(y_2|x_2) + log P(y_3|x_3) + log P(y_4|x_4) + log P(y_5|x_5))

With notation (N = anything):
L = -(1/N) \sum^i_=_1^N log P(y^i|x^i)

The notation doesn't just save space - it reveals structure:
  - The 1/N shows it's an average
  - The log shows we're in log-probability space
  - The \sum shows we're aggregating over examples
  - The index i shows which variable changes

1.2 Why This Notation Exists

The development of summation notation was driven by several fundamental needs:

Human working memory is finite. Writing out 10,000 individual terms is impossible and obscures structure. A mathematician needs to reason about patterns in sums, not individual terms. Compact notation makes patterns visible.

Notation reveals structure. Consider i=1ni2\sum_{i=1}^{n} i^2. This immediately communicates "sum of squares from 1 to nn" - the pattern is evident in a way that 1+4+9+16+1 + 4 + 9 + 16 + \ldots only suggests. The reader can immediately ask: does this have a closed form? Is it O(n3)O(n^3)? How does it relate to 0nx2dx\int_0^n x^2 \, dx?

Enables algebraic manipulation. Once expressed in \Sigma/\Pi notation, sums can be split, combined, reindexed, bounded, and transformed using precise algebraic rules. This is how mathematicians prove things about sums, not by expanding and hand-checking.

Generalises naturally. The same notation scales from finite sums to infinite series to integrals:

i=1nf(i)i=1f(i)1f(x)dx\sum_{i=1}^{n} f(i) \quad \longrightarrow \quad \sum_{i=1}^{\infty} f(i) \quad \longrightarrow \quad \int_1^{\infty} f(x) \, dx

Same conceptual framework, deeper mathematics.

Historical necessity. As mathematics moved from specific numerical examples to general theorems about arbitrary nn, notation had to scale. Gauss couldn't write "1+2+3++100=50501 + 2 + 3 + \ldots + 100 = 5050" for every value - he needed i=1ni=n(n+1)/2\sum_{i=1}^{n} i = n(n+1)/2.

1.3 Where These Appear in AI

Every major formula in machine learning is a sum or product (or both). Here is a comprehensive inventory:

Loss functions (sums):

L=1Ni=1NvV1[yi=v]logP(vxi)L = -\frac{1}{N} \sum_{i=1}^{N} \sum_{v \in V} \mathbb{1}[y_i = v] \log P(v \mid x_i)

This is the categorical cross-entropy loss: a double sum over training examples (ii) and vocabulary tokens (vv).

Softmax normalisation (sum in denominator):

P(vx)=exp(zv)vVexp(zv)P(v \mid x) = \frac{\exp(z_v)}{\sum_{v' \in V} \exp(z_{v'})}

The denominator is a sum over the entire vocabulary - the partition function that ensures probabilities sum to 1.

Attention mechanism (weighted sum):

Attn(Q,K,V)i=j=1nαijVj,αij=exp(qikj/dk)=1nexp(qik/dk)\text{Attn}(Q, K, V)_i = \sum_{j=1}^{n} \alpha_{ij} V_j, \qquad \alpha_{ij} = \frac{\exp(q_i \cdot k_j / \sqrt{d_k})}{\sum_{\ell=1}^{n} \exp(q_i \cdot k_\ell / \sqrt{d_k})}

Each attention output is a weighted sum of value vectors, where the weights αij\alpha_{ij} themselves involve summation in their softmax normalisation.

Dot product (sum of elementwise products):

qk=i=1dkqikiq \cdot k = \sum_{i=1}^{d_k} q_i k_i

The fundamental operation in attention: a sum of dkd_k products.

Matrix multiplication (sum over shared dimension):

(AB)ij=k=1dAikBkj(AB)_{ij} = \sum_{k=1}^{d} A_{ik} B_{kj}

Every linear layer, every projection, every attention computation involves matrix multiplication - which is itself a collection of sums.

Perplexity (product converted to sum via log):

PPL=exp ⁣(1Ni=1NlogP(tit1,,ti1))\text{PPL} = \exp\!\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(t_i \mid t_1, \ldots, t_{i-1})\right)

The standard evaluation metric for language models: an exponentiated average of log-probabilities.

Sequence likelihood (product):

P(sequence)=i=1nP(tit1,,ti1)P(\text{sequence}) = \prod_{i=1}^{n} P(t_i \mid t_1, \ldots, t_{i-1})

The chain rule of probability applied to a sequence - a product of conditional probabilities.

Joint probability of independent events (product):

P(A1A2An)=i=1nP(Ai)P(A_1 \cap A_2 \cap \ldots \cap A_n) = \prod_{i=1}^{n} P(A_i)

BM25 retrieval score (weighted sum):

BM25(q,d)=tqIDF(t)ft,d(k1+1)ft,d+k1 ⁣(1b+bdavgdl)\text{BM25}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f_{t,d} \cdot (k_1 + 1)}{f_{t,d} + k_1\!\left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)}

The standard sparse retrieval function in RAG systems: a sum over query terms, each contributing an IDF-weighted term frequency score.

BPE merge cost (sum over pairs):

Cost=(a,b)count(a,b)merge_cost(a,b)\text{Cost} = \sum_{(a,b)} \text{count}(a,b) \cdot \text{merge\_cost}(a,b)

1.4 The Index as a Variable

The index variable (the ii in i\sum_i) is a bound variable - it has no meaning outside the sum, just as the variable of integration has no meaning outside the integral.

i=1ni2=j=1nj2=k=1nk2==1n2\sum_{i=1}^{n} i^2 = \sum_{j=1}^{n} j^2 = \sum_{k=1}^{n} k^2 = \sum_{\text{}=1}^{n} \text{}^2

All four expressions are identical. The name of the index does not matter. This is called alpha-equivalence: renaming bound variables preserves meaning. It is the same principle in:

  • Summation: i=1nf(i)=j=1nf(j)\sum_{i=1}^{n} f(i) = \sum_{j=1}^{n} f(j)
  • Integration: 01xdx=01tdt\int_0^1 x \, dx = \int_0^1 t \, dt
  • Programming: sum(f(i) for i in range(n)) = sum(f(j) for j in range(n))
  • Logic: x:P(x)\forall x: P(x) is the same as y:P(y)\forall y: P(y)

Critical for AI. When multiple sums interact, index names carry essential information about which sum they belong to:

iαijjαij\sum_i \alpha_{ij} \neq \sum_j \alpha_{ij}

The left side sums over the first index (rows), producing a result that still depends on jj. The right side sums over the second index (columns), producing a result that depends on ii. In attention: jαij=1\sum_j \alpha_{ij} = 1 (attention weights for query ii sum to 1), but iαij\sum_i \alpha_{ij} gives the total attention received by key position jj - a completely different quantity.

1.5 Historical Timeline

HISTORICAL DEVELOPMENT OF SUMMATION NOTATION
=======================================================================

~1800 BCE  Egypt/Babylon    Arithmetic series computed via specific examples;
                            no general notation; "add these five numbers"

~250 BCE   Archimedes       Summation of geometric series; area under parabola
                            via method of exhaustion; still no compact notation

1593       Viete            Introduced systematic symbolic algebra; first use
                            of letters for unknowns; paved way for notation

1675       Leibniz          Introduced \int as elongated S for "summa" (sum);
                            integration conceived as continuous summation

18th c.    Euler            Introduced \Sigma (capital sigma) for discrete summation;
                            standardised the notation we use today; wrote
                            \sum_n_=_1^\infty 1/n^2 = \pi^2/6 (Basel problem, 1735)

1777-1855  Gauss            Master of summation manipulation; derived closed
                            forms for many classical sums; the "prince of
                            mathematicians" famously summed 1+2+...+100 as a child

1821       Cauchy           Rigorous theory of series convergence; defined when
                            infinite sums are valid; \epsilon-\delta framework

1854       Riemann          Riemann sum -> integral; summation as the formal
                            foundation of integration theory

1916       Einstein         Einstein summation convention: repeated indices
                            imply summation; revolutionised tensor notation

20th c.    Standard         \Sigma and \Pi become universal mathematical notation;
                            every textbook, every paper

2016+      Modern ML        Every NeurIPS/ICML/ICLR paper assumes fluency;
                            torch.einsum brings Einstein notation to code

1.6 The Relationship to Integration

The connection between summation and integration is not merely an analogy - it is a deep mathematical fact. The Riemann integral is defined as a limit of sums:

abf(x)dx=limni=1nf(xi)Δx\int_a^b f(x) \, dx = \lim_{n \to \infty} \sum_{i=1}^{n} f(x_i^*) \Delta x

where Δx=(ba)/n\Delta x = (b-a)/n and xix_i^* is a sample point in the ii-th subinterval.

DISCRETE <--> CONTINUOUS CORRESPONDENCE
=======================================================================

SUMMATION                          INTEGRATION
----------                         -----------
\sum^i_=_1^n f(i)               ->       \int_1^n f(x) dx

Index i \in {1, 2, ..., n}   ->      x \in [1, n]  (continuous variable)
Step size = 1               ->      Infinitesimal dx
\sum -> \int as step -> 0                 (Riemann sum definition)

\prod^i_=_1^n f(i)               ->       exp(\int_1^n ln f(x) dx)
Discrete product            ->      Continuous product (via log)

\sum^i f(i)*P(i)              ->       \int f(x)*p(x) dx
Discrete expectation        ->      Continuous expectation

-\sum^i p^i log p^i             ->       -\int p(x) log p(x) dx
Discrete entropy            ->      Differential entropy

This bridge matters for AI because many derivations convert between sums and integrals:

  • The evidence lower bound (ELBO) involves expectations that are sums (discrete latent variables) or integrals (continuous latent variables)
  • Monte Carlo estimation approximates integrals by sums: f(x)p(x)dx1Kk=1Kf(xk)\int f(x) p(x) dx \approx \frac{1}{K} \sum_{k=1}^{K} f(x_k)
  • Riemann sum approximations appear in numerical integration throughout scientific ML

2. Summation Notation - Formal Definitions

2.1 Basic Definition

The summation of f(i)f(i) as ii ranges from mm to nn is defined as:

i=mnf(i)=f(m)+f(m+1)+f(m+2)++f(n)\sum_{i=m}^{n} f(i) = f(m) + f(m+1) + f(m+2) + \ldots + f(n)

The components of this notation are:

ComponentNameRole
Σ\SigmaSummation operatorCapital Greek sigma; indicates repeated addition
iiIndex variableDummy/bound variable; any name works
mmLower limitFirst value of the index (inclusive)
nnUpper limitLast value of the index (inclusive)
f(i)f(i)SummandExpression evaluated at each index value

Number of terms: nm+1n - m + 1 (when nmn \geq m).

Empty sum convention: If n<mn < m, the sum has no terms. By convention, the value of an empty sum is 0 - the additive identity:

i=53f(i)=0(empty sum; 3<5)\sum_{i=5}^{3} f(i) = 0 \qquad (\text{empty sum; } 3 < 5)

This convention is not arbitrary; it is the only value consistent with the algebraic properties of summation (see 3.2).

Concrete examples:

i=14i2=12+22+32+42=1+4+9+16=30\sum_{i=1}^{4} i^2 = 1^2 + 2^2 + 3^2 + 4^2 = 1 + 4 + 9 + 16 = 30 i=032i=20+21+22+23=1+2+4+8=15\sum_{i=0}^{3} 2^i = 2^0 + 2^1 + 2^2 + 2^3 = 1 + 2 + 4 + 8 = 15 i=1n1=1+1++1n times=n\sum_{i=1}^{n} 1 = \underbrace{1 + 1 + \ldots + 1}_{n \text{ times}} = n

2.2 Formal Definition via Recursion

The summation notation can be defined with complete mathematical precision using recursion:

Base case (empty sum):

i=mm1f(i)=0\sum_{i=m}^{m-1} f(i) = 0

Recursive step:

i=mnf(i)=(i=mn1f(i))+f(n)(nm)\sum_{i=m}^{n} f(i) = \left(\sum_{i=m}^{n-1} f(i)\right) + f(n) \qquad (n \geq m)

This definition is equivalent to the "expand and add" interpretation but is fully rigorous. It also makes clear why the empty sum equals 0: from the recursive step with n=mn = m:

i=mmf(i)=(i=mm1f(i))+f(m)=0+f(m)=f(m)\sum_{i=m}^{m} f(i) = \left(\sum_{i=m}^{m-1} f(i)\right) + f(m) = 0 + f(m) = f(m) \quad \checkmark

The recursion unfolds exactly like a for loop:

def sigma(f, m, n):
    """Recursive definition of \sum^i_=_m^n f(i)"""
    if n < m:
        return 0          # Empty sum = additive identity
    return sigma(f, m, n-1) + f(n)  # Recursive step

2.3 Index Set Notation

The basic i=mn\sum_{i=m}^{n} notation assumes the index ranges over consecutive integers {m,m+1,,n}\{m, m+1, \ldots, n\}. But sums often range over arbitrary sets. The index set notation generalises:

iSf(i)=sum of f(i) for all iS\sum_{i \in S} f(i) = \text{sum of } f(i) \text{ for all } i \in S

This is the most general form. The limits notation is a special case where S={m,m+1,,n}S = \{m, m+1, \ldots, n\}.

AI-relevant examples:

vVexp(zv)(sum over all tokens v in vocabulary V)\sum_{v \in V} \exp(z_v) \qquad \text{(sum over all tokens } v \text{ in vocabulary } V \text{)} (i,j)i<jaij(sum over all pairs with i less than j; upper triangular)\sum_{\substack{(i,j) \\ i < j}} a_{ij} \qquad \text{(sum over all pairs with } i \text{ less than } j \text{; upper triangular)} p prime,pN1p(sum over primes up to N)\sum_{p \text{ prime}, \, p \leq N} \frac{1}{p} \qquad \text{(sum over primes up to } N \text{)} l=1Lh=1HAttnl,h(sum over all layers and attention heads)\sum_{l=1}^{L} \sum_{h=1}^{H} \text{Attn}_{l,h} \qquad \text{(sum over all layers and attention heads)}

Set-builder notation with conditions:

i=1i oddnf(i)=f(1)+f(3)+f(5)+\sum_{\substack{i=1 \\ i \text{ odd}}}^{n} f(i) = f(1) + f(3) + f(5) + \ldots

This sums only over odd values of ii. The condition acts as a filter on the index set.

2.4 Multiple Indices

A double sum sums over two independent indices:

i=1mj=1nf(i,j)=i=1m(j=1nf(i,j))\sum_{i=1}^{m} \sum_{j=1}^{n} f(i,j) = \sum_{i=1}^{m} \left(\sum_{j=1}^{n} f(i,j)\right)

The process: for each fixed ii, compute the inner sum over jj; then sum those results over ii.

Total number of terms: m×nm \times n.

Concrete example:

i=12j=13ij=i=12(i1+i2+i3)=i=126i=6+12=18\sum_{i=1}^{2} \sum_{j=1}^{3} ij = \sum_{i=1}^{2} (i \cdot 1 + i \cdot 2 + i \cdot 3) = \sum_{i=1}^{2} 6i = 6 + 12 = 18

Triple sums and higher are defined analogously by nesting more Σ\Sigma symbols. In AI, you frequently see:

b=1Bi=1nj=1nαb,i,j(sum over batch, query position, key position)\sum_{b=1}^{B} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_{b,i,j} \qquad \text{(sum over batch, query position, key position)}

AI example - matrix multiplication as double sum:

Computing all entries of C=ABC = AB where ARm×dA \in \mathbb{R}^{m \times d}, BRd×nB \in \mathbb{R}^{d \times n}:

Cij=k=1dAikBkjC_{ij} = \sum_{k=1}^{d} A_{ik} B_{kj}

To compute the entire matrix CC, we sum over all i,j,ki, j, k:

i=1mj=1nk=1dAikBkj\sum_{i=1}^{m} \sum_{j=1}^{n} \sum_{k=1}^{d} A_{ik} B_{kj}

This triple sum involves m×n×dm \times n \times d individual multiply-add operations.

2.5 Sigma Notation Conventions

In practice, several shorthand conventions are widely used:

Implicit limits. When the index set is clear from context:

if(i)(limits understood from context)\sum_i f(i) \qquad \text{(limits understood from context)}

Vector sum. For a vector x=(x1,x2,,xn)\mathbf{x} = (x_1, x_2, \ldots, x_n):

ixi=x1+x2++xn\sum_i x_i = x_1 + x_2 + \ldots + x_n

All-ones vector. The sum of all components can be written as a dot product with the all-ones vector 1=(1,1,,1)\mathbf{1} = (1, 1, \ldots, 1)^\top:

1x=ixi\mathbf{1}^\top \mathbf{x} = \sum_i x_i

This is heavily used in ML: torch.sum(x) is the same as ones @ x.

Indicator function notation. The indicator (or Iverson bracket) 1[condition]\mathbb{1}[\text{condition}] equals 1 when the condition is true and 0 otherwise:

i1[condition(i)]=number of i satisfying the condition\sum_i \mathbb{1}[\text{condition}(i)] = \text{number of } i \text{ satisfying the condition}

For example, i=1n1[xi>0]\sum_{i=1}^{n} \mathbb{1}[x_i > 0] counts the number of positive elements in a vector.

Vector/matrix sums. When summing vectors or matrices, the sum is computed componentwise:

i=1nvi=v1+v2++vn(each vi is a vector)\sum_{i=1}^{n} \mathbf{v}_i = \mathbf{v}_1 + \mathbf{v}_2 + \ldots + \mathbf{v}_n \qquad \text{(each } \mathbf{v}_i \text{ is a vector)}

This appears in attention: Attni=jαijVj\text{Attn}_i = \sum_j \alpha_{ij} \mathbf{V}_j is a weighted sum of vectors.


3. Product Notation - Formal Definitions

3.1 Basic Definition

The product of f(i)f(i) as ii ranges from mm to nn is defined as:

i=mnf(i)=f(m)f(m+1)f(m+2)f(n)\prod_{i=m}^{n} f(i) = f(m) \cdot f(m+1) \cdot f(m+2) \cdots f(n)

The components are identical to summation notation but with Π\Pi (capital Greek pi) indicating repeated multiplication instead of addition.

ComponentNameRole
Π\PiProduct operatorCapital Greek pi; indicates repeated multiplication
iiIndex variableDummy/bound variable; any name works
mmLower limitFirst value of the index (inclusive)
nnUpper limitLast value of the index (inclusive)
f(i)f(i)FactorExpression evaluated at each index value

Number of factors: nm+1n - m + 1 (same counting as summation).

Empty product convention: If n<mn < m, the product has no factors. By convention, the value of an empty product is 1 - the multiplicative identity:

i=53f(i)=1(empty product; 3<5)\prod_{i=5}^{3} f(i) = 1 \qquad (\text{empty product; } 3 < 5)

Recursive definition:

i=mm1f(i)=1(base case: empty product)\prod_{i=m}^{m-1} f(i) = 1 \qquad (\text{base case: empty product}) i=mnf(i)=(i=mn1f(i))f(n)(nm)\prod_{i=m}^{n} f(i) = \left(\prod_{i=m}^{n-1} f(i)\right) \cdot f(n) \qquad (n \geq m)

Concrete examples:

i=14i=1234=24=4!\prod_{i=1}^{4} i = 1 \cdot 2 \cdot 3 \cdot 4 = 24 = 4! i=032=24=16\prod_{i=0}^{3} 2 = 2^4 = 16 i=1n1P(ticontext)=1P(t1ctx)1P(t2ctx)1P(tnctx)(inverse probability product)\prod_{i=1}^{n} \frac{1}{P(t_i | \text{context})} = \frac{1}{P(t_1|\text{ctx})} \cdot \frac{1}{P(t_2|\text{ctx})} \cdots \frac{1}{P(t_n|\text{ctx})} \qquad \text{(inverse probability product)}

3.2 Why Empty Product = 1 and Empty Sum = 0

These conventions are not arbitrary - they are the only values consistent with the algebraic rules:

Additive identity argument. The identity for addition is 0: a+0=aa + 0 = a. An empty sum contributes no terms, so it must equal the additive identity:

=0\sum_{\emptyset} = 0

Consistency check: The recursive definition gives i=11f(i)=i=10f(i)+f(1)=0+f(1)=f(1)\sum_{i=1}^{1} f(i) = \sum_{i=1}^{0} f(i) + f(1) = 0 + f(1) = f(1). If the empty sum were anything other than 0, this would fail. OK

Multiplicative identity argument. The identity for multiplication is 1: a1=aa \cdot 1 = a. An empty product contributes no factors, so it must equal the multiplicative identity:

=1\prod_{\emptyset} = 1

Consistency check: The recursive definition gives i=11f(i)=i=10f(i)f(1)=1f(1)=f(1)\prod_{i=1}^{1} f(i) = \prod_{i=1}^{0} f(i) \cdot f(1) = 1 \cdot f(1) = f(1). OK

Factorial consistency. The factorial n!=i=1nin! = \prod_{i=1}^{n} i must satisfy 0!=10! = 1:

0!=i=10i=1(empty product convention)0! = \prod_{i=1}^{0} i = 1 \qquad (\text{empty product convention})

This is also consistent with Γ(1)=1\Gamma(1) = 1, the combinatorial identity (n0)=n!0!n!=1\binom{n}{0} = \frac{n!}{0! \cdot n!} = 1, and the Taylor series ex=n=0xnn!e^x = \sum_{n=0}^{\infty} \frac{x^n}{n!} (which needs 0!=10! = 1 for the n=0n = 0 term to be x0/0!=1x^0/0! = 1).

IDENTITY ELEMENTS - THE PATTERN
=======================================================================

Operation      Identity     Empty Result     Why
---------      --------     ------------     ---
Addition       0            \sum\emptyset = 0           Adding nothing = 0
Multiplication 1            \prod\emptyset = 1           Multiplying nothing = 1
Union          \emptyset            \cup\emptyset = \emptyset           Unioning nothing = empty set
Intersection   Universal    \cap\emptyset = U           Intersecting nothing = everything
Logical AND    True         \wedge\emptyset = True        No conditions to fail
Logical OR     False        \vee\emptyset = False       No conditions to satisfy

3.3 Factorial as Product Notation

The factorial of nn is defined as:

n!=i=1ni=123nn! = \prod_{i=1}^{n} i = 1 \cdot 2 \cdot 3 \cdots n

with the convention 0!=10! = 1 (empty product).

Growth rate. Factorial grows faster than any polynomial or exponential base:

nnn!n!nnn^n2n2^nn2n^2
11121
51203,1253225
103,628,80010,000,000,0001,024100
202.43 \times 10^1^81.05 \times 10^2^61,048,576400

For large nn, Stirling's approximation gives:

n!2πn(ne)nn! \approx \sqrt{2\pi n} \left(\frac{n}{e}\right)^n

AI relevance:

  • Permutations: The number of permutations of nn items is n!n!. This creates combinatorial explosion in search spaces. A vocabulary of just 30 tokens has 30!2.65×103230! \approx 2.65 \times 10^{32} possible orderings.
  • Permutation symmetry in neural networks: A hidden layer with nn neurons has n!n! equivalent parameter configurations obtained by permuting the neurons. This symmetry means the loss landscape has at least n!n! equivalent minima.
  • Combinatorics: Binomial coefficients (nk)=n!k!(nk)!\binom{n}{k} = \frac{n!}{k!(n-k)!} appear in counting arguments throughout ML theory.

3.4 Product Notation in Probability

Product notation is the natural language of probability, where independent events multiply:

Joint probability of independent events:

P(A1A2An)=i=1nP(Ai)(if A1,,An are independent)P(A_1 \cap A_2 \cap \ldots \cap A_n) = \prod_{i=1}^{n} P(A_i) \qquad \text{(if } A_1, \ldots, A_n \text{ are independent)}

Likelihood of a sequence (chain rule of probability):

P(t1,t2,,tn)=i=1nP(tit1,,ti1)P(t_1, t_2, \ldots, t_n) = \prod_{i=1}^{n} P(t_i \mid t_1, \ldots, t_{i-1})

This is the fundamental equation of autoregressive language models. A transformer computes each factor P(tit1,,ti1)P(t_i \mid t_1, \ldots, t_{i-1}) using causal attention, and multiplies them all together to get the sequence probability.

Log-likelihood converts product to sum:

logP(t1,,tn)=logi=1nP(tit1,,ti1)=i=1nlogP(tit1,,ti1)\log P(t_1, \ldots, t_n) = \log \prod_{i=1}^{n} P(t_i \mid t_1, \ldots, t_{i-1}) = \sum_{i=1}^{n} \log P(t_i \mid t_1, \ldots, t_{i-1})

This conversion from Π\Pi to Σ\Sigma via logarithm is one of the most important transformations in all of machine learning. It is used because:

  1. Numerical stability: Products of many small probabilities underflow to zero (0.01100=102000.01^{100} = 10^{-200}, which is below FP64 minimum). Taking logarithms converts to a sum of manageable-sized numbers (100×log(0.01)=460.5100 \times \log(0.01) = -460.5).

  2. Computational convenience: Sums are easier to differentiate than products. The gradient of ilogPi\sum_i \log P_i is iPiPi\sum_i \frac{\nabla P_i}{P_i}, cleanly decomposing into per-example terms.

  3. Statistical theory: The negative log-likelihood is the cross-entropy loss, connecting maximum likelihood estimation to information theory.

THE PRODUCT-TO-SUM CONVERSION VIA LOGARITHM
=======================================================================

                    log
    \prod^i P(t^i|ctx) ---------> \sum^i log P(t^i|ctx)

    Product of           Sum of
    probabilities        log-probabilities

    Underflows for       Numerically stable
    long sequences       for any length

    Hard to              Easy to
    differentiate        differentiate

    <- - - - - exp - - - - <-

    exp(\sum^i log P^i) = \prod^i P^i     (inverse transformation)

4. Algebraic Properties of Summation

These properties are the tools for manipulating sums - the algebraic rules that allow you to simplify, decompose, and transform summation expressions.

4.1 Linearity of Summation

The most important property. Summation is a linear operator:

Scalar multiple:

i=1ncf(i)=ci=1nf(i)(c constant w.r.t. i)\sum_{i=1}^{n} c \cdot f(i) = c \cdot \sum_{i=1}^{n} f(i) \qquad \text{(}c \text{ constant w.r.t. } i\text{)}

Sum of sums:

i=1n(f(i)+g(i))=i=1nf(i)+i=1ng(i)\sum_{i=1}^{n} \bigl(f(i) + g(i)\bigr) = \sum_{i=1}^{n} f(i) + \sum_{i=1}^{n} g(i)

Combined (linearity):

i=1n(αf(i)+βg(i))=αi=1nf(i)+βi=1ng(i)\sum_{i=1}^{n} \bigl(\alpha f(i) + \beta g(i)\bigr) = \alpha \sum_{i=1}^{n} f(i) + \beta \sum_{i=1}^{n} g(i)

Proof sketch: Expand both sides by the definition of summation (write out all terms); use commutativity and associativity of addition to rearrange; observe both sides are equal term by term.

Why this matters for AI: Linearity allows decomposition of complex sums into simpler parts. The most critical application:

θi=1NL(θ,xi)=i=1NθL(θ,xi)\nabla_\theta \sum_{i=1}^{N} L(\theta, x_i) = \sum_{i=1}^{N} \nabla_\theta L(\theta, x_i)

The gradient of a sum equals the sum of gradients. This is why mini-batch gradient descent works: You can compute per-example gradients independently and then sum them. It is also why gradient accumulation works: you can split a large batch into smaller sub-batches, compute gradients for each, and sum them - the result is identical to computing the gradient of the full batch.

Warning: The constant cc must not depend on the summation index ii. If cc depends on ii, it cannot be factored out:

i=1nc(i)f(i)c()i=1nf(i)(WRONG if c depends on i)\sum_{i=1}^{n} c(i) \cdot f(i) \neq c(\cdot) \cdot \sum_{i=1}^{n} f(i) \qquad \text{(WRONG if } c \text{ depends on } i\text{)}

4.2 Constant Summand

i=1nc=nc\sum_{i=1}^{n} c = n \cdot c

The sum of nn copies of a constant cc equals nn times cc. This follows from linearity: i=1nc=ci=1n1=cn\sum_{i=1}^{n} c = c \cdot \sum_{i=1}^{n} 1 = c \cdot n.

Special case: i=1n1=n\sum_{i=1}^{n} 1 = n - a useful identity for counting.

AI applications:

  • Normalisation check: v=1VP(v)=1\sum_{v=1}^{|V|} P(v) = 1 - all probabilities must sum to 1. If PP is uniform over V|V| tokens, then P(v)=1/VP(v) = 1/|V| and vP(v)=V(1/V)=1\sum_v P(v) = |V| \cdot (1/|V|) = 1. OK
  • Batch average: i=1N(1/N)=1\sum_{i=1}^{N} (1/N) = 1 - confirms that (1/N)if(i)(1/N) \sum_i f(i) is a proper average.
  • Attention conservation: jαij=1\sum_j \alpha_{ij} = 1 for softmax attention weights - each query's weights sum to 1.

4.3 Splitting and Combining Sums

Splitting at index kk (where mk<nm \leq k < n):

i=mnf(i)=i=mkf(i)+i=k+1nf(i)\sum_{i=m}^{n} f(i) = \sum_{i=m}^{k} f(i) + \sum_{i=k+1}^{n} f(i)

Both parts are contiguous and their union covers the full range.

Merging adjacent ranges:

i=mkf(i)+i=k+1nf(i)=i=mnf(i)\sum_{i=m}^{k} f(i) + \sum_{i=k+1}^{n} f(i) = \sum_{i=m}^{n} f(i)

Non-adjacent ranges: For disjoint sets:

iSf(i)+iTf(i)=iSTf(i)(only if ST=)\sum_{i \in S} f(i) + \sum_{i \in T} f(i) = \sum_{i \in S \cup T} f(i) \qquad \text{(only if } S \cap T = \emptyset\text{)}

If STS \cap T \neq \emptyset, elements in the intersection are counted twice.

AI applications:

  • Mini-batch splitting: Split gradient computation across mini-batches: i=1Nli=i=1B1li+i=B1+1B1+B2li+\sum_{i=1}^{N} \nabla l_i = \sum_{i=1}^{|B_1|} \nabla l_i + \sum_{i=|B_1|+1}^{|B_1|+|B_2|} \nabla l_i + \ldots
  • Distributed training: Each GPU computes a partial sum; the all-reduce operation merges them.
  • Gradient accumulation: Accumulate gradients over KK micro-batches before applying an update: Leffective=b=1KLBb\nabla L_{\text{effective}} = \sum_{b=1}^{K} \nabla L_{B_b}.

4.4 Index Shifting (Reindexing)

Shift by constant kk:

i=mnf(i)=j=m+kn+kf(jk)where j=i+k\sum_{i=m}^{n} f(i) = \sum_{j=m+k}^{n+k} f(j - k) \qquad \text{where } j = i + k

Both limits shift by kk, and the summand substitutes i=jki = j - k.

Rename index (alpha-equivalence):

i=mnf(i)=j=mnf(j)(different name, same sum)\sum_{i=m}^{n} f(i) = \sum_{j=m}^{n} f(j) \qquad \text{(different name, same sum)}

Reverse index:

i=mnf(i)=i=mnf(m+ni)(substitute im+ni)\sum_{i=m}^{n} f(i) = \sum_{i=m}^{n} f(m + n - i) \qquad \text{(substitute } i \to m + n - i\text{)}

When ii goes from mm to nn, the value m+nim + n - i goes from nn down to mm. The same terms are summed in reverse order.

Applications of reindexing:

  • Gauss's trick: Reverse and add to derive i=1ni=n(n+1)/2\sum_{i=1}^{n} i = n(n+1)/2:
    • S=1+2++nS = 1 + 2 + \ldots + n
    • S=n+(n1)++1S = n + (n-1) + \ldots + 1 (reversed)
    • 2S=(n+1)+(n+1)++(n+1)=n(n+1)2S = (n+1) + (n+1) + \ldots + (n+1) = n(n+1)
    • S=n(n+1)/2S = n(n+1)/2
  • AI: relative positional encodings: Reindex from absolute positions to relative: jf(i,j)=δf(i,iδ)\sum_j f(i, j) = \sum_{\delta} f(i, i - \delta) where δ=ij\delta = i - j.
  • Causal attention: Reindex to count from current position backward.

4.5 Telescoping Sums

A telescoping sum is one where most terms cancel in pairs, leaving only the first and last:

i=1n(f(i+1)f(i))=f(n+1)f(1)\sum_{i=1}^{n} \bigl(f(i+1) - f(i)\bigr) = f(n+1) - f(1)

Proof by expansion:

(f(2)f(1))+(f(3)f(2))+(f(4)f(3))++(f(n+1)f(n))\bigl(f(2) - f(1)\bigr) + \bigl(f(3) - f(2)\bigr) + \bigl(f(4) - f(3)\bigr) + \ldots + \bigl(f(n+1) - f(n)\bigr)

Every intermediate term appears once with ++ and once with -; they cancel:

=f(1)+f(2)f(2)+f(3)f(3)++f(n)f(n)+f(n+1)=f(n+1)f(1)= -f(1) + \cancel{f(2)} - \cancel{f(2)} + \cancel{f(3)} - \cancel{f(3)} + \ldots + \cancel{f(n)} - \cancel{f(n)} + f(n+1) = f(n+1) - f(1)

Generalised form:

i=1n(g(i)g(i1))=g(n)g(0)\sum_{i=1}^{n} \bigl(g(i) - g(i-1)\bigr) = g(n) - g(0)

Example - partial fractions: Show that i=1n1i(i+1)=nn+1\sum_{i=1}^{n} \frac{1}{i(i+1)} = \frac{n}{n+1}:

1i(i+1)=1i1i+1\frac{1}{i(i+1)} = \frac{1}{i} - \frac{1}{i+1} i=1n(1i1i+1)=111n+1=nn+1\sum_{i=1}^{n} \left(\frac{1}{i} - \frac{1}{i+1}\right) = \frac{1}{1} - \frac{1}{n+1} = \frac{n}{n+1}

AI applications:

  • Residual networks: The total change in a ResNet's representation across TT layers: Δh=hTh0=t=1T(htht1)=t=1Tδht\Delta h = h_T - h_0 = \sum_{t=1}^{T} (h_t - h_{t-1}) = \sum_{t=1}^{T} \delta h_t. The total transformation telescopes: it equals the sum of all residual updates.
  • Gradient descent accumulation: Total parameter change: θTθ0=t=1T(θtθt1)=t=1TηtL(θt1)\theta_T - \theta_0 = \sum_{t=1}^{T} (\theta_t - \theta_{t-1}) = -\sum_{t=1}^{T} \eta_t \nabla L(\theta_{t-1}).
  • Proving closed-form formulas: Many summation identities are proved by finding a function gg such that f(i)=g(i)g(i1)f(i) = g(i) - g(i-1).

4.6 Interchange of Summation Order

For a double sum over a rectangular region, the order of summation can always be interchanged:

i=1mj=1nf(i,j)=j=1ni=1mf(i,j)\sum_{i=1}^{m} \sum_{j=1}^{n} f(i,j) = \sum_{j=1}^{n} \sum_{i=1}^{m} f(i,j)

Why this works: The double sum ranges over all pairs (i,j)(i, j) with 1im1 \leq i \leq m and 1jn1 \leq j \leq n. The left side sums row by row (fix ii, sweep jj); the right side sums column by column (fix jj, sweep ii). Both cover the same set of m×nm \times n pairs, and addition is commutative and associative.

SUMMING A MATRIX: ROW-FIRST vs COLUMN-FIRST
=======================================================================

    j=1  j=2  j=3          Row-first (\sum^i \sumj):
   +----+----+----+         Row 1: a_1_1 + a_1_2 + a_1_3 = R_1
i=1| a_1_1| a_1_2| a_1_3|        Row 2: a_2_1 + a_2_2 + a_2_3 = R_2
   +----+----+----+         Total: R_1 + R_2
i=2| a_2_1| a_2_2| a_2_3|
   +----+----+----+        Column-first (\sumj \sum^i):
                             Col 1: a_1_1 + a_2_1 = C_1
                             Col 2: a_1_2 + a_2_2 = C_2
                             Col 3: a_1_3 + a_2_3 = C_3
                             Total: C_1 + C_2 + C_3

Both give the same total: all 6 elements summed.

For finite sums: interchange is always safe. No conditions needed.

For infinite sums: interchange requires absolute convergence: ijf(i,j)<\sum_i \sum_j |f(i,j)| < \infty. Without this, rearrangement can change the value (Riemann rearrangement theorem).

AI applications:

  • Linearity of expectation: E[iXi]=iE[Xi]\mathbb{E}\left[\sum_i X_i\right] = \sum_i \mathbb{E}[X_i]. This is interchange of summation and expectation (which is itself a sum/integral).
  • Multi-head attention: hiih\sum_h \sum_i \equiv \sum_i \sum_h - compute attention for all heads first then sum, or compute per-position then sum over heads; same result.
  • Batch-feature decomposition: When computing gradients, you can sum over the batch dimension first or the feature dimension first; the order doesn't matter for finite sums.

5. Algebraic Properties of Products

5.1 Product of Products (Splitting)

Just as sums can be split at any index, products can be split at any index:

i=mnf(i)=i=mkf(i)i=k+1nf(i)(mk<n)\prod_{i=m}^{n} f(i) = \prod_{i=m}^{k} f(i) \cdot \prod_{i=k+1}^{n} f(i) \qquad (m \leq k < n)

Proof: The full product f(m)f(m+1)f(k)f(k+1)f(n)f(m) \cdot f(m+1) \cdots f(k) \cdot f(k+1) \cdots f(n) can be grouped into two sub-products at any break point kk.

AI application: Factoring a sequence probability:

P(t1,,tn)=P(t1,,tk)prefix probabilityP(tk+1,,tnt1,,tk)continuation probabilityP(t_1, \ldots, t_n) = \underbrace{P(t_1, \ldots, t_k)}_{\text{prefix probability}} \cdot \underbrace{P(t_{k+1}, \ldots, t_n \mid t_1, \ldots, t_k)}_{\text{continuation probability}}

This splitting enables prefix-based caching in autoregressive generation: compute the prefix probability once, then extend it token by token.

5.2 Product of a Constant

i=1nc=cn\prod_{i=1}^{n} c = c^n

Multiplying nn copies of constant cc equals cc to the nn-th power.

Special cases:

  • i=1n2=2n\prod_{i=1}^{n} 2 = 2^n - binary growth
  • i=1n(1/2)=(1/2)n=2n\prod_{i=1}^{n} (1/2) = (1/2)^n = 2^{-n} - exponential decay
  • i=1n(1)=(1)n\prod_{i=1}^{n} (-1) = (-1)^n - alternating sign

AI application: If a model assigns uniform probability 1/V1/|V| to each token in a sequence of length nn:

P(sequence)=i=1n1V=1VnP(\text{sequence}) = \prod_{i=1}^{n} \frac{1}{|V|} = \frac{1}{|V|^n}

With vocabulary size V=50,000|V| = 50{,}000 and n=100n = 100 tokens, the probability is (1/50,000)10010470(1/50{,}000)^{100} \approx 10^{-470} - far below any floating-point representation, which is exactly why we use log-probabilities.

5.3 Product of a Product (Nested Products)

Nested products over independent index sets can be interchanged, just like double sums:

i=1mj=1nf(i,j)=j=1ni=1mf(i,j)\prod_{i=1}^{m} \prod_{j=1}^{n} f(i,j) = \prod_{j=1}^{n} \prod_{i=1}^{m} f(i,j)

Both sides multiply together the same m×nm \times n factors; only the order differs. Since multiplication is commutative and associative, the result is the same.

5.4 Distributing Product over Multiplication

Product distributes over pointwise multiplication of terms:

i=1n(f(i)g(i))=(i=1nf(i))(i=1ng(i))\prod_{i=1}^{n} \bigl(f(i) \cdot g(i)\bigr) = \left(\prod_{i=1}^{n} f(i)\right) \cdot \left(\prod_{i=1}^{n} g(i)\right)

Proof: Expand: (f(1)g(1))(f(2)g(2))(f(n)g(n))(f(1)g(1))(f(2)g(2))\cdots(f(n)g(n)). By commutativity of multiplication, rearrange to (f(1)f(2)f(n))(g(1)g(2)g(n))(f(1)f(2)\cdots f(n))(g(1)g(2)\cdots g(n)).

Product does NOT distribute over addition:

i=1n(f(i)+g(i))i=1nf(i)+i=1ng(i)\prod_{i=1}^{n} \bigl(f(i) + g(i)\bigr) \neq \prod_{i=1}^{n} f(i) + \prod_{i=1}^{n} g(i)

Counterexample: (a+b)(c+d)=ac+ad+bc+bdac+bd(a + b)(c + d) = ac + ad + bc + bd \neq ac + bd.

This is a common source of errors. While multiplying the product distributes cleanly, adding inside a product creates cross-terms - the result expands into 2n2^n terms in general.

5.5 Logarithm Converts Product to Sum

This is arguably the most important property of product notation for all of machine learning:

logi=1nf(i)=i=1nlogf(i)\log \prod_{i=1}^{n} f(i) = \sum_{i=1}^{n} \log f(i)

Proof: By the fundamental logarithm identity log(ab)=log(a)+log(b)\log(ab) = \log(a) + \log(b), applied repeatedly:

log(f(1)f(2)f(n))=logf(1)+logf(2)++logf(n)=i=1nlogf(i)\log(f(1) \cdot f(2) \cdots f(n)) = \log f(1) + \log f(2) + \ldots + \log f(n) = \sum_{i=1}^{n} \log f(i)

Inverse direction:

i=1nf(i)=exp ⁣(i=1nlogf(i))\prod_{i=1}^{n} f(i) = \exp\!\left(\sum_{i=1}^{n} \log f(i)\right)

Why this matters immensely for AI:

  1. Numerical stability. Products of many small probabilities underflow to zero in finite-precision arithmetic. If P(tictx)=0.01P(t_i | \text{ctx}) = 0.01 for all ii in a sequence of length 100, then Pi=10200\prod P_i = 10^{-200}, which is below the minimum of FP32 (1038\approx 10^{-38}). But logPi=460.5\sum \log P_i = -460.5, perfectly representable.

  2. Gradient computation. Differentiating sums is straightforward: θilogPi=ilogPiθ\frac{\partial}{\partial \theta} \sum_i \log P_i = \sum_i \frac{\partial \log P_i}{\partial \theta}. Differentiating products requires the product rule applied n1n-1 times.

  3. Decomposition. Log-likelihood decomposes into per-example terms: each logP(tictx)\log P(t_i | \text{ctx}) can be computed, stored, and analysed independently.

AI examples of product-to-sum conversion:

logP(t1,,tn)log-likelihood=i=1nlogP(ticontext)sum of log-probs\underbrace{\log P(t_1, \ldots, t_n)}_{\text{log-likelihood}} = \underbrace{\sum_{i=1}^{n} \log P(t_i \mid \text{context})}_{\text{sum of log-probs}} 1Ni=1NlogP(tictx)cross-entropy loss=logi=1NP(tictx)1/Nnegative log geometric mean\underbrace{-\frac{1}{N} \sum_{i=1}^{N} \log P(t_i \mid \text{ctx})}_{\text{cross-entropy loss}} = \underbrace{-\log \prod_{i=1}^{N} P(t_i \mid \text{ctx})^{1/N}}_{\text{negative log geometric mean}} PPLperplexity=exp ⁣(1Ni=1NlogP)=(i=1N1P(tictx))1/N\underbrace{\text{PPL}}_{\text{perplexity}} = \exp\!\left(-\frac{1}{N} \sum_{i=1}^{N} \log P\right) = \left(\prod_{i=1}^{N} \frac{1}{P(t_i \mid \text{ctx})}\right)^{1/N}

6. Important Summation Formulas

These closed-form results are essential tools. Knowing them lets you immediately evaluate sums that arise in complexity analysis, error bounds, and algorithmic design.

6.1 Arithmetic Series

i=1ni=n(n+1)2\sum_{i=1}^{n} i = \frac{n(n+1)}{2}

Gauss's derivation (pairing argument): Write the sum forward and backward:

S=1+2+3++nS = 1 + 2 + 3 + \ldots + n S=n+(n1)+(n2)++1S = n + (n-1) + (n-2) + \ldots + 1

Adding these two equations column by column:

2S=(1+n)+(2+n1)+(3+n2)++(n+1)=n×(n+1)2S = (1+n) + (2+n-1) + (3+n-2) + \ldots + (n+1) = n \times (n+1)

Therefore S=n(n+1)/2S = n(n+1)/2.

General arithmetic series (first term aa, common difference dd, nn terms):

i=0n1(a+id)=na+n(n1)2d=n2(2a+(n1)d)=n(first+last)2\sum_{i=0}^{n-1} (a + id) = na + \frac{n(n-1)}{2} d = \frac{n}{2}\bigl(2a + (n-1)d\bigr) = \frac{n(\text{first} + \text{last})}{2}

Sum of first nn odd numbers:

i=1n(2i1)=n2\sum_{i=1}^{n} (2i - 1) = n^2

Proof: i=1n(2i1)=2i=1nii=1n1=2n(n+1)2n=n2+nn=n2\sum_{i=1}^{n} (2i - 1) = 2 \sum_{i=1}^{n} i - \sum_{i=1}^{n} 1 = 2 \cdot \frac{n(n+1)}{2} - n = n^2 + n - n = n^2.

AI relevance: The number of pairwise interactions in full self-attention over nn tokens is i=1nn=n2\sum_{i=1}^{n} n = n^2 (each of nn queries attends to nn keys), giving O(n2)O(n^2) attention complexity. More precisely, causal attention has i=1ni=n(n+1)/2\sum_{i=1}^{n} i = n(n+1)/2 interactions - still O(n2)O(n^2) but roughly half.

6.2 Sum of Squares

i=1ni2=n(n+1)(2n+1)6\sum_{i=1}^{n} i^2 = \frac{n(n+1)(2n+1)}{6}

Proof by telescoping. Use the identity i3(i1)3=3i23i+1i^3 - (i-1)^3 = 3i^2 - 3i + 1:

i=1n(i3(i1)3)=n3(telescoping)\sum_{i=1}^{n} \bigl(i^3 - (i-1)^3\bigr) = n^3 \qquad \text{(telescoping)} i=1n(3i23i+1)=n3\sum_{i=1}^{n} (3i^2 - 3i + 1) = n^3 3i23i+n=n33\sum i^2 - 3\sum i + n = n^3 3i2=n3+3n(n+1)2n=n3+3n2+3n2n3\sum i^2 = n^3 + 3 \cdot \frac{n(n+1)}{2} - n = n^3 + \frac{3n^2 + 3n}{2} - n i2=n(n+1)(2n+1)6\sum i^2 = \frac{n(n+1)(2n+1)}{6}

AI relevance: Squared norms x2=ixi2\|\mathbf{x}\|^2 = \sum_i x_i^2 appear everywhere - L2 regularisation, variance calculations, gradient norms. While ii2\sum_i i^2 specifically appears in complexity analysis, the sum-of-squares pattern is ubiquitous.

6.3 Sum of Cubes

i=1ni3=(n(n+1)2)2=(i=1ni)2\sum_{i=1}^{n} i^3 = \left(\frac{n(n+1)}{2}\right)^2 = \left(\sum_{i=1}^{n} i\right)^2

A remarkable identity: the sum of cubes equals the square of the sum of the first nn integers.

nni3\sum i^3(i)2(\sum i)^2Equal?
111^2 = 1OK
21 + 8 = 93^2 = 9OK
31 + 8 + 27 = 366^2 = 36OK
41 + 8 + 27 + 64 = 10010^2 = 100OK

Proof by induction: Base case n=1n = 1: 13=1=(12/2)2=11^3 = 1 = (1 \cdot 2/2)^2 = 1. OK
Inductive step: assume i=1ki3=(k(k+1)/2)2\sum_{i=1}^{k} i^3 = (k(k+1)/2)^2. Then:

i=1k+1i3=k2(k+1)24+(k+1)3=(k+1)2(k24+k+1)=(k+1)2(k+2)24\sum_{i=1}^{k+1} i^3 = \frac{k^2(k+1)^2}{4} + (k+1)^3 = (k+1)^2 \left(\frac{k^2}{4} + k + 1\right) = (k+1)^2 \cdot \frac{(k+2)^2}{4}

which equals ((k+1)(k+2)/2)2((k+1)(k+2)/2)^2. OK

6.4 Geometric Series (Finite)

i=0n1ri=1rn1r(r1)\sum_{i=0}^{n-1} r^i = \frac{1 - r^n}{1 - r} \qquad (r \neq 1)

Proof: Let S=1+r+r2++rn1S = 1 + r + r^2 + \ldots + r^{n-1}. Then:

rS=r+r2++rnrS = r + r^2 + \ldots + r^n SrS=1rnS - rS = 1 - r^n S(1r)=1rnS(1 - r) = 1 - r^n S=1rn1rS = \frac{1 - r^n}{1 - r}

Special case r=1r = 1: i=0n11=n\sum_{i=0}^{n-1} 1 = n.

Shifted version: i=mnri=rm1rnm+11r\sum_{i=m}^{n} r^i = r^m \cdot \frac{1 - r^{n-m+1}}{1 - r}.

Alternative form (ratio > 1): i=0n1ri=rn1r1\sum_{i=0}^{n-1} r^i = \frac{r^n - 1}{r - 1} for r>1r > 1.

AI relevance:

  • ALiBi (Attention with Linear Biases): Uses geometric-like decay in attention: positions further away receive exponentially decreasing attention bias.
  • Exponential moving averages: In Adam optimizer, m^t=(1β1)i=0t1β1igti\hat{m}_t = (1 - \beta_1) \sum_{i=0}^{t-1} \beta_1^i g_{t-i} - the weights form a geometric series summing to 1β1t1β1\frac{1 - \beta_1^t}{1 - \beta_1}.
  • Discounted rewards in RL: t=0Tγtrt\sum_{t=0}^{T} \gamma^t r_t where γ<1\gamma < 1 is the discount factor.

6.5 Geometric Series (Infinite)

i=0ri=11r(r<1)\sum_{i=0}^{\infty} r^i = \frac{1}{1 - r} \qquad (|r| < 1)

This is the limit of the finite geometric series as nn \to \infty: since r<1|r| < 1, we have rn0r^n \to 0, so 1rn1r11r\frac{1 - r^n}{1 - r} \to \frac{1}{1 - r}.

For r1|r| \geq 1, the series diverges (terms do not shrink to zero).

Derivatives of geometric series:

i=1iri1=1(1r)2(r<1)\sum_{i=1}^{\infty} i r^{i-1} = \frac{1}{(1 - r)^2} \qquad (|r| < 1) i=0iri=r(1r)2(r<1)\sum_{i=0}^{\infty} i r^i = \frac{r}{(1 - r)^2} \qquad (|r| < 1)

These are obtained by differentiating ri=11r\sum r^i = \frac{1}{1 - r} with respect to rr term by term (valid within the radius of convergence).

AI relevance:

  • Expected geometric distribution: If a random variable XGeom(p)X \sim \text{Geom}(p), then E[X]=k=1k(1p)k1p=1/p\mathbb{E}[X] = \sum_{k=1}^{\infty} k(1-p)^{k-1} p = 1/p, which uses the derivative of the geometric series.
  • Discounted return in RL: t=0γt=11γ\sum_{t=0}^{\infty} \gamma^t = \frac{1}{1 - \gamma} gives the maximum possible return under discount γ\gamma.

6.6 Exponential and Taylor Series

Many functions used in AI are defined by their Taylor (power) series - which are infinite sums:

Exponential function:

ex=n=0xnn!=1+x+x22!+x33!+e^x = \sum_{n=0}^{\infty} \frac{x^n}{n!} = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \ldots

Converges for all xRx \in \mathbb{R}. The exponential function is the unique function satisfying f(x)=f(x)f'(x) = f(x) with f(0)=1f(0) = 1.

Sine and cosine:

sin(x)=n=0(1)nx2n+1(2n+1)!=xx33!+x55!\sin(x) = \sum_{n=0}^{\infty} \frac{(-1)^n x^{2n+1}}{(2n+1)!} = x - \frac{x^3}{3!} + \frac{x^5}{5!} - \ldots cos(x)=n=0(1)nx2n(2n)!=1x22!+x44!\cos(x) = \sum_{n=0}^{\infty} \frac{(-1)^n x^{2n}}{(2n)!} = 1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \ldots

Natural logarithm:

ln(1+x)=n=1(1)n+1xnn=xx22+x33(x1,  x1)\ln(1 + x) = \sum_{n=1}^{\infty} \frac{(-1)^{n+1} x^n}{n} = x - \frac{x^2}{2} + \frac{x^3}{3} - \ldots \qquad (|x| \leq 1, \; x \neq -1)

Binomial series:

(1+x)α=n=0(αn)xn(x<1)(1 + x)^\alpha = \sum_{n=0}^{\infty} \binom{\alpha}{n} x^n \qquad (|x| < 1)

where (αn)=α(α1)(αn+1)n!\binom{\alpha}{n} = \frac{\alpha(\alpha-1)\cdots(\alpha - n + 1)}{n!} (generalised binomial coefficient).

AI relevance:

  • Softmax uses exp(z)\exp(z); its Taylor series explains why exp(z)1+z\exp(z) \approx 1 + z for small zz - helpful for linearised attention approximations (e.g., Performers).
  • Log-sum-exp trick uses ln()\ln(); understanding its series expansion helps with numerical stability analysis.
  • Positional encodings in transformers use sin()\sin() and cos()\cos(); their series representations explain periodicity and frequency properties.
  • GELU activation: GELU(x)=xΦ(x)\text{GELU}(x) = x \Phi(x) where Φ\Phi involves the error function, which has a Taylor series expansion.

6.7 Harmonic Series and Harmonic Numbers

The harmonic series is the sum of reciprocals:

n=11n=1+12+13+14+=(diverges!)\sum_{n=1}^{\infty} \frac{1}{n} = 1 + \frac{1}{2} + \frac{1}{3} + \frac{1}{4} + \ldots = \infty \qquad \text{(diverges!)}

Despite the terms approaching zero, the sum grows without bound. This is a classic example showing that an0a_n \to 0 is necessary but not sufficient for convergence.

The nn-th harmonic number is the partial sum:

Hn=i=1n1iln(n)+γH_n = \sum_{i=1}^{n} \frac{1}{i} \approx \ln(n) + \gamma

where γ0.5772\gamma \approx 0.5772 is the Euler-Mascheroni constant.

The p-series generalises the harmonic series:

n=11np{convergesif p>1divergesif p1\sum_{n=1}^{\infty} \frac{1}{n^p} \qquad \begin{cases} \text{converges} & \text{if } p > 1 \\ \text{diverges} & \text{if } p \leq 1 \end{cases}

The special case p=2p = 2 gives Euler's famous result: n=11n2=π261.6449\sum_{n=1}^{\infty} \frac{1}{n^2} = \frac{\pi^2}{6} \approx 1.6449.

The Riemann zeta function ζ(p)=n=11np\zeta(p) = \sum_{n=1}^{\infty} \frac{1}{n^p} unifies all pp-series.

AI relevance:

  • Zipf's law: Word frequency in natural language follows freq(r)1/r\text{freq}(r) \propto 1/r where rr is the rank. The total frequency is proportional to HVH_{|V|}, a harmonic number.
  • BPE merge counts follow approximately Zipfian distributions; analysis involves harmonic numbers.
  • Complexity analysis: i=1n1/i=O(logn)\sum_{i=1}^{n} 1/i = O(\log n); this appears in analyses of algorithm running times and expected values of random processes.
  • Coupon collector problem: Expected number of samples to see all nn items is nHnnlnnn H_n \approx n \ln n - relevant for coverage analysis in data sampling.

7. Summation Bounds and Estimation

In both theoretical analysis and practical AI engineering, we rarely need the exact value of a sum. Instead, we need tight bounds - upper and lower estimates that tell us how a sum grows. This section equips you with the main bounding techniques.

7.1 Comparison Test

If 0aibi0 \leq a_i \leq b_i for all ii, then:

iaiibi\sum_{i} a_i \leq \sum_{i} b_i

Why it matters: This is the workhorse for bounding sums whose exact evaluation is intractable.

Worked example - bounding softmax tail:

Suppose attention logits satisfy zizmaxδiz_i \leq z_{\max} - \delta_i where δici\delta_i \geq c \cdot i for tokens sorted by relevance. Then:

i=knezii=knezmaxci=ezmaxi=kneciezmaxeck1ec\sum_{i=k}^{n} e^{z_i} \leq \sum_{i=k}^{n} e^{z_{\max} - ci} = e^{z_{\max}} \sum_{i=k}^{n} e^{-ci} \leq e^{z_{\max}} \cdot \frac{e^{-ck}}{1 - e^{-c}}

This shows that the tail of the softmax distribution decays exponentially - justifying sparse attention approximations that ignore low-scoring tokens.

General pattern:

To bound \sum f(i):
  1. Find a simpler g(i) with f(i) \leq g(i) for all i     (upper bound)
  2. Find a simpler h(i) with h(i) \leq f(i) for all i     (lower bound)
  3. Evaluate \sum g(i) and \sum h(i) using known formulas

7.2 Integral Test

For a positive, monotonically decreasing function ff:

1n+1f(x)dxi=1nf(i)f(1)+1nf(x)dx\int_{1}^{n+1} f(x)\,dx \leq \sum_{i=1}^{n} f(i) \leq f(1) + \int_{1}^{n} f(x)\,dx
  f(x) |
       |#
       |# #
       |# # #
       |# # #   
       |# # #      
       +--+--+--+--+---- i
        1  2  3  4  5

  # = f(i) rectangles (left sum = upper bound)
  The curve f(x) passes through the tops of the bars
    = gap between rectangle and curve
  Integral = area under curve (between the bounds)

Derivation: Since ff is decreasing, on the interval [i,i+1][i, i+1]:

f(i+1)f(x)f(i)for x[i,i+1]f(i+1) \leq f(x) \leq f(i) \quad \text{for } x \in [i, i+1]

Integrating both sides over [i,i+1][i, i+1]:

f(i+1)ii+1f(x)dxf(i)f(i+1) \leq \int_{i}^{i+1} f(x)\,dx \leq f(i)

Summing from i=1i = 1 to n1n-1 for the right inequality, and from i=1i = 1 to nn for the left, yields the result.

AI example - harmonic sum bounds:

With f(x)=1/xf(x) = 1/x:

ln(n+1)Hn1+lnn\ln(n+1) \leq H_n \leq 1 + \ln n

This gives us Hn=lnn+γ+O(1/n)H_n = \ln n + \gamma + O(1/n) where γ0.5772\gamma \approx 0.5772 is the Euler-Mascheroni constant.

Application: Bounding the expected number of unique tokens seen after mm samples from a vocabulary of size VV:

E[unique tokens]=V(1(11V)m)V(1em/V)E[\text{unique tokens}] = V \left(1 - \left(1 - \frac{1}{V}\right)^m\right) \approx V(1 - e^{-m/V})

The integral test helps bound partial sums that arise in the coverage analysis.

7.3 Cauchy-Schwarz and Triangle Inequality for Sums

Triangle inequality for sums:

i=1naii=1nai\left| \sum_{i=1}^{n} a_i \right| \leq \sum_{i=1}^{n} |a_i|

This bounds the magnitude of a signed sum by the sum of absolute values. It explains why gradient norms after summing over a batch can be bounded:

i=1BθLii=1BθLi\left\| \sum_{i=1}^{B} \nabla_\theta \mathcal{L}_i \right\| \leq \sum_{i=1}^{B} \left\| \nabla_\theta \mathcal{L}_i \right\|

Cauchy-Schwarz inequality for sums:

(i=1naibi)2(i=1nai2)(i=1nbi2)\left( \sum_{i=1}^{n} a_i b_i \right)^2 \leq \left( \sum_{i=1}^{n} a_i^2 \right) \left( \sum_{i=1}^{n} b_i^2 \right)

Or in vector notation: (ab)2a2b2(a \cdot b)^2 \leq \|a\|^2 \|b\|^2, which means cosθ1|\cos \theta| \leq 1.

Proof sketch:

Consider the quadratic Q(t)=i=1n(ai+tbi)20Q(t) = \sum_{i=1}^{n} (a_i + t b_i)^2 \geq 0 for all tRt \in \mathbb{R}.

Expanding: Q(t)=(bi2)t2+2(aibi)t+(ai2)0Q(t) = \left(\sum b_i^2\right) t^2 + 2\left(\sum a_i b_i\right) t + \left(\sum a_i^2\right) \geq 0.

Since this quadratic is non-negative for all tt, its discriminant must be 0\leq 0:

4(aibi)24(ai2)(bi2)04\left(\sum a_i b_i\right)^2 - 4\left(\sum a_i^2\right)\left(\sum b_i^2\right) \leq 0

which gives the result.

AI applications:

ApplicationBound UsedPurpose
Cosine similaritycosθ1\|\cos \theta\| \leq 1Validates similarity metric range
Gradient clipping analysisTriangle inequalityBounds effect of clip on sum
Attention weight analysisCauchy-SchwarzBounds QK dot product magnitude
Variance decompositionCauchy-SchwarzJensen-like bounds on expectations

7.4 Stirling's Approximation

Stirling's approximation connects the product (factorial) to continuous functions:

n!2πn(ne)nn! \approx \sqrt{2\pi n} \left(\frac{n}{e}\right)^n

Taking logarithms:

ln(n!)=k=1nlnknlnnn+12ln(2πn)\ln(n!) = \sum_{k=1}^{n} \ln k \approx n \ln n - n + \frac{1}{2} \ln(2\pi n)

Derivation via integral test:

k=1nlnk1nlnxdx=[xlnxx]1n=nlnnn+1\sum_{k=1}^{n} \ln k \approx \int_{1}^{n} \ln x \, dx = [x \ln x - x]_{1}^{n} = n \ln n - n + 1

The 2πn\sqrt{2\pi n} correction comes from more careful analysis (Laplace's method on the Gamma function integral).

Accuracy table:

nnn!n!StirlingRelative Error
110.9227.8%
5120118.021.6%
10362880035986950.83%
503.04×10643.04 \times 10^{64}3.04×10643.04 \times 10^{64}0.17%
1009.33×101579.33 \times 10^{157}9.32×101579.32 \times 10^{157}0.083%

The approximation improves rapidly: relative error 1/(12n)\sim 1/(12n).

AI relevance:

  • Combinatorial arguments: The number of permutations of nn tokens is n!n!; Stirling tells us log2(n!)nlog2n\log_2(n!) \approx n \log_2 n bits to represent a permutation - relevant for sorting networks used in differentiable sorting.
  • Log-likelihood of multinomial: ln(nk1,,kV)=lnn!vlnkv!\ln \binom{n}{k_1, \ldots, k_V} = \ln n! - \sum_{v} \ln k_v!; Stirling converts this to entropy: nH(p)\approx n H(p) where pv=kv/np_v = k_v/n.
  • Capacity arguments: Channel capacity and model capacity bounds often use ln(nk)nH(k/n)\ln \binom{n}{k} \approx n H(k/n).

7.5 Asymptotic Notation for Sums

When exact sums are intractable, we characterise their growth rate:

NotationMeaningExample SumGrowth
O(g(n))O(g(n))Upper bounded by cg(n)c \cdot g(n)i=1ni=O(n2)\sum_{i=1}^{n} i = O(n^2)At most
Ω(g(n))\Omega(g(n))Lower bounded by cg(n)c \cdot g(n)i=1ni=Ω(n2)\sum_{i=1}^{n} i = \Omega(n^2)At least
Θ(g(n))\Theta(g(n))Tightly boundedi=1ni=Θ(n2)\sum_{i=1}^{n} i = \Theta(n^2)Exactly this rate
o(g(n))o(g(n))Dominated by g(n)g(n)i=1n1/i=o(n)\sum_{i=1}^{n} 1/i = o(n)Strictly less
\simAsymptotically equalHnlnnH_n \sim \ln nRatio -> 1

Key asymptotic results for common sums:

i=1nik=Θ(nk+1)(polynomial sums)\sum_{i=1}^{n} i^k = \Theta(n^{k+1}) \quad \text{(polynomial sums)} i=0nri={Θ(rn)if r>1Θ(n)if r=1Θ(1)if 0<r<1(geometric sums)\sum_{i=0}^{n} r^i = \begin{cases} \Theta(r^n) & \text{if } r > 1 \\ \Theta(n) & \text{if } r = 1 \\ \Theta(1) & \text{if } 0 < r < 1 \end{cases} \quad \text{(geometric sums)} i=1n1ip={Θ(lnn)if p=1Θ(1)if p>1Θ(n1p)if 0<p<1(harmonic-type sums)\sum_{i=1}^{n} \frac{1}{i^p} = \begin{cases} \Theta(\ln n) & \text{if } p = 1 \\ \Theta(1) & \text{if } p > 1 \\ \Theta(n^{1-p}) & \text{if } 0 < p < 1 \end{cases} \quad \text{(harmonic-type sums)}

AI complexity examples:

OperationSum FormAsymptoticPractical Impact
Self-attention (full)i=1nn=n2\sum_{i=1}^{n} n = n^2Θ(n2)\Theta(n^2)Quadratic in sequence length
Self-attention (sparse, top-kk)i=1nk\sum_{i=1}^{n} kΘ(nk)\Theta(nk)Linear if k=O(1)k = O(1)
Batch gradienti=1BO(p)\sum_{i=1}^{B} O(p)Θ(Bp)\Theta(Bp)Linear in batch \times params
Transformer layerO(n2d+nd2)O(n^2 d + n d^2)depends on nn vs ddn<dn < d: O(nd2)O(nd^2); n>dn > d: O(n2d)O(n^2 d)
Vocabulary projectionv=1Vd\sum_{v=1}^{V} dΘ(Vd)\Theta(Vd)Bottleneck for large vocab

8. Special Summation Patterns in AI

This section catalogues the specific summation patterns that appear repeatedly across AI and machine learning. Recognising these patterns lets you read papers fluently, debug implementations, and reason about computational costs.

8.1 Softmax and the Partition Function

The softmax function normalises a vector of logits zRKz \in \mathbb{R}^K into a probability distribution:

softmax(z)i=ezij=1Kezj\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

The denominator Z=j=1KezjZ = \sum_{j=1}^{K} e^{z_j} is the partition function - a term borrowed from statistical mechanics.

Anatomy of softmax:

  Input logits:     z = [2.0,  1.0,  0.5,  -1.0]
                         |      |      |      |
  Exponentiate:    ez = [7.39, 2.72, 1.65,  0.37]
                         |      |      |      |
  Partition sum:   Z  = 7.39 + 2.72 + 1.65 + 0.37 = 12.13
                         |      |      |      |
  Normalise:       p  = [0.61, 0.22, 0.14,  0.03]
                         |      |      |      |
  Verify:         \sump  = 0.61 + 0.22 + 0.14 + 0.03 = 1.00  OK

Numerical stability - the log-sum-exp trick:

Direct computation overflows for large logits. The standard trick:

logZ=logjezj=m+logjezjmwhere m=maxjzj\log Z = \log \sum_{j} e^{z_j} = m + \log \sum_{j} e^{z_j - m} \quad \text{where } m = \max_j z_j

This works because ezj=emezjm\sum e^{z_j} = e^m \sum e^{z_j - m}, and now the largest exponent is e0=1e^0 = 1.

Gradient of log-partition:

A beautiful identity links the partition function to expectations:

logZzi=eziZ=softmax(z)i=pi\frac{\partial \log Z}{\partial z_i} = \frac{e^{z_i}}{Z} = \text{softmax}(z)_i = p_i

Higher derivatives give cumulants of the distribution - the second derivative gives the variance.

Where partition functions appear in AI:

ContextPartition FunctionSize
Classification (softmax)c=1Cezc\sum_{c=1}^{C} e^{z_c}num_classes
Language modellingv=1Vezv\sum_{v=1}^{V} e^{z_v}vocab_size (~50K-100K)
Contrastive learning (InfoNCE)j=1Nesim(q,kj)/τ\sum_{j=1}^{N} e^{\text{sim}(q, k_j)/\tau}num_negatives
Energy-based modelseE(x)dx\int e^{-E(x)} dxIntractable (continuous)
CRF layer (NER)all pathses(path)\sum_{\text{all paths}} e^{s(\text{path})}Exponential (use DP)

8.2 Cross-Entropy Loss as Summation

Cross-entropy loss is a double summation - over the batch and over classes:

L=1Bi=1Bc=1Cyiclogy^ic\mathcal{L} = -\frac{1}{B} \sum_{i=1}^{B} \sum_{c=1}^{C} y_{ic} \log \hat{y}_{ic}

For one-hot targets (yic=1[c=ci]y_{ic} = \mathbb{1}[c = c_i^*]), the inner sum collapses:

L=1Bi=1Blogy^i,ci\mathcal{L} = -\frac{1}{B} \sum_{i=1}^{B} \log \hat{y}_{i, c_i^*}

Expansion for a single sample:

logy^c=logezcjezj=zc+logjezj-\log \hat{y}_c = -\log \frac{e^{z_c}}{\sum_j e^{z_j}} = -z_c + \log \sum_j e^{z_j}

So the loss is: (negative logit of correct class) + (log partition function).

Gradient with respect to logit zkz_k:

Lzk=y^kyk=pk1[k=c]\frac{\partial \mathcal{L}}{\partial z_k} = \hat{y}_k - y_k = p_k - \mathbb{1}[k = c^*]

This is the remarkably clean result that makes softmax + cross-entropy so popular: the gradient is simply (predicted probability - target probability) for each class.

Label smoothing variant:

Instead of one-hot yy, use smoothed targets: ycsmooth=(1ϵ)1[c=c]+ϵ/Cy_c^{\text{smooth}} = (1 - \epsilon) \cdot \mathbb{1}[c = c^*] + \epsilon / C.

Lsmooth=c=1Cycsmoothlogy^c=(1ϵ)LCE+ϵ1CHuniform(y^)\mathcal{L}_{\text{smooth}} = -\sum_{c=1}^{C} y_c^{\text{smooth}} \log \hat{y}_c = (1-\epsilon)\mathcal{L}_{\text{CE}} + \epsilon \cdot \frac{1}{C} H_{\text{uniform}}(\hat{y})

The extra term ϵCc(logy^c)\frac{\epsilon}{C} \sum_c (-\log \hat{y}_c) encourages the model to keep non-target probabilities from collapsing to zero - acting as a regulariser on confidence.

8.3 Attention as Weighted Sum

The core attention operation is a weighted sum of value vectors:

Attention(Q,K,V)i=j=1nαijVj\text{Attention}(Q, K, V)_i = \sum_{j=1}^{n} \alpha_{ij} V_j

where attention weights αij\alpha_{ij} come from softmax over scaled dot-product scores:

αij=exp(QiKj/dk)=1nexp(QiK/dk)\alpha_{ij} = \frac{\exp(Q_i \cdot K_j / \sqrt{d_k})}{\sum_{\ell=1}^{n} \exp(Q_i \cdot K_\ell / \sqrt{d_k})}

The full computation graph:

  Q_i --+
        +-- Q_i * K_j / \sqrtd_k ---> softmax over j ---> \alpha_ij
  K_j --+                                              |
                                                        +-- \sum_j \alpha_ij * V_j ---> output_i
  V_j -------------------------------------------------+

Three sums in attention:

  1. Dot product: QiKj=d=1dkQi,dKj,dQ_i \cdot K_j = \sum_{d=1}^{d_k} Q_{i,d} \cdot K_{j,d} - sum over the head dimension
  2. Softmax normaliser: Zi=j=1nexp(QiKj/dk)Z_i = \sum_{j=1}^{n} \exp(Q_i \cdot K_j / \sqrt{d_k}) - sum over keys
  3. Weighted value: oi=j=1nαijVjo_i = \sum_{j=1}^{n} \alpha_{ij} V_j - sum over values

Multi-head attention adds a fourth sum:

MultiHead(Q,K,V)=Concat(head1,,headh)WO=h=1HheadhWhO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O = \sum_{h=1}^{H} \text{head}_h W_h^O

(The concatenation + linear projection is equivalent to a sum of projected heads.)

Computational cost (counting multiplications):

StepOperationCost
QK^Tn×dk×nn \times d_k \times n per headO(n2dk)O(n^2 d_k)
SoftmaxPer row of nn valuesO(n2)O(n^2)
Attention \times Vn×n×dvn \times n \times d_v per headO(n2dv)O(n^2 d_v)
All headsMultiply by hhO(hn2dk)O(h \cdot n^2 d_k)
Total (since hdk=dmodelh \cdot d_k = d_{\text{model}})O(n2dmodel)O(n^2 d_{\text{model}})

8.4 Matrix Multiplication as Summation

Every matrix multiplication is a summation in disguise:

Cij=k=1pAikBkjC_{ij} = \sum_{k=1}^{p} A_{ik} B_{kj}

This single formula underlies virtually all neural network computation.

Layer types expressed as matrix multiply:

LayerFormulaDimensions
Dense/LinearY=XW+bY = XW + b(B×din)(din×dout)(B \times d_{\text{in}}) \cdot (d_{\text{in}} \times d_{\text{out}})
Multi-head QKV projectionQ=XWQQ = XW^Q(n×d)(d×dkh)(n \times d) \cdot (d \times d_k h)
Embedding lookupE=one_hot(x)WEE = \text{one\_hot}(x) \cdot W_E(B×V)(V×d)(B \times V) \cdot (V \times d)
Vocabulary projectionZ=HWETZ = H W_E^T(B×d)(d×V)(B \times d) \cdot (d \times V)
Convolution (im2col)Y=unfold(X)WY = \text{unfold}(X) \cdot WReshaped to GEMM

FLOPs formula: Multiplying an (m×p)(m \times p) matrix by a (p×n)(p \times n) matrix requires:

  • m×p×nm \times p \times n multiplications
  • m×(p1)×nm \times (p-1) \times n additions
  • Total: 2mpn\approx 2mpn FLOPs

For a transformer with LL layers, the total FLOPs per token is approximately 12Ld212 L d^2 (accounting for QKV projections, attention output, and FFN).

8.5 Gradient as Sum Over Batch

The empirical risk (training loss) is an average over the batch:

L(θ)=1Bi=1B(fθ(xi),yi)\mathcal{L}(\theta) = \frac{1}{B} \sum_{i=1}^{B} \ell(f_\theta(x_i), y_i)

Its gradient is therefore also an average:

θL=1Bi=1Bθ(fθ(xi),yi)\nabla_\theta \mathcal{L} = \frac{1}{B} \sum_{i=1}^{B} \nabla_\theta \ell(f_\theta(x_i), y_i)

Key insight: The gradient is linear in the summation - each sample contributes independently. This enables:

  1. Data parallelism: Compute per-sample gradients on different GPUs, then sum (all-reduce).
  2. Gradient accumulation: Sum gradients over micro-batches before updating weights.
  3. Per-sample gradient clipping (DP-SGD): Clip θi\|\nabla_\theta \ell_i\| before averaging - required for differential privacy.

Gradient accumulation pattern:

  Micro-batch 1:  g_1 = \nablaL(\theta; B_1)     -+
  Micro-batch 2:  g_2 = \nablaL(\theta; B_2)     -+
  Micro-batch 3:  g_3 = \nablaL(\theta; B_3)     -+--->  g = (g_1 + g_2 + g_3 + g_4) / 4
  Micro-batch 4:  g_4 = \nablaL(\theta; B_4)     -+
                                              \theta <- \theta - \eta * g

Effective batch size = k=1ABk\sum_{k=1}^{A} |B_k| where AA is the number of accumulation steps.

Variance of stochastic gradient:

Var[1Bi=1Bgi]=1B2i=1BVar[gi]=σ2B\text{Var}\left[\frac{1}{B} \sum_{i=1}^{B} g_i\right] = \frac{1}{B^2} \sum_{i=1}^{B} \text{Var}[g_i] = \frac{\sigma^2}{B}

(assuming i.i.d. samples). Doubling the batch size halves the gradient variance - but has diminishing returns on convergence speed (linear speedup only up to a critical batch size).

8.6 Perplexity as Geometric Mean

Perplexity is the standard evaluation metric for language models:

PPL=exp(1Tt=1Tlogp(wtw<t))\text{PPL} = \exp\left(-\frac{1}{T} \sum_{t=1}^{T} \log p(w_t \mid w_{<t})\right)

Decomposing the formula:

The inner sum 1Tt=1Tlogp(wtw<t)-\frac{1}{T} \sum_{t=1}^{T} \log p(w_t | w_{<t}) is the average cross-entropy per token.

Exponentiating converts it to a geometric mean of inverse probabilities:

PPL=(t=1T1p(wtw<t))1/T\text{PPL} = \left(\prod_{t=1}^{T} \frac{1}{p(w_t \mid w_{<t})}\right)^{1/T}

Intuition: If PPL = 50, the model is "as confused as if it were choosing uniformly among 50 tokens at each step."

Worked example:

  Sentence: "The cat sat on the mat"
  Token probabilities: [0.1, 0.05, 0.2, 0.3, 0.1, 0.15]

  Log probs: [-2.303, -2.996, -1.609, -1.204, -2.303, -1.897]
  Sum of log probs: -12.312
  Average: -12.312 / 6 = -2.052
  PPL = exp(2.052) = 7.78

  Interpretation: Model is ~8-way confused on average
  (a perfect model would have PPL = 1)

Properties arising from the summation structure:

  • Perplexity is multiplicative over independent segments: if we split text into parts A and B, PPL(AB)=PPL(A)A/TPPL(B)B/T\text{PPL}(AB) = \text{PPL}(A)^{|A|/T} \cdot \text{PPL}(B)^{|B|/T} (weighted geometric mean).
  • Outlier sensitivity: A single very-low-probability token (e.g., p=106p = 10^{-6}) drastically inflates PPL. This is why models are sometimes evaluated with rare-token filtering.
  • Bits per character (BPC): BPC=1Tt(log2p(wtw<t))=log2(PPL)\text{BPC} = \frac{1}{T} \sum_{t} (-\log_2 p(w_t | w_{<t})) = \log_2(\text{PPL}).

8.7 BM25 as Weighted Sum

BM25, the backbone of traditional information retrieval, is a weighted sum over query terms:

BM25(q,d)=tqIDF(t)f(t,d)(k1+1)f(t,d)+k1(1b+bd/avgdl)\text{BM25}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t, d) \cdot (k_1 + 1)}{f(t, d) + k_1 \cdot (1 - b + b \cdot |d| / \text{avgdl})}

Components as summation:

  • Outer sum tq\sum_{t \in q}: Iterates over unique query terms - typically 3-10 terms.
  • IDF (Inverse Document Frequency): IDF(t)=logNnt+0.5nt+0.5\text{IDF}(t) = \log \frac{N - n_t + 0.5}{n_t + 0.5} where NN = total documents, ntn_t = documents containing term tt.
  • TF saturation: The fraction f(t,d)(k1+1)f(t,d)+k1()\frac{f(t,d)(k_1+1)}{f(t,d) + k_1(\ldots)} saturates - repeated occurrences have diminishing returns.

Why BM25 is a pure summation pattern:

  Score = 0
  For each query term t:
      Score += IDF(t) \times TF_saturation(t, d)
             --------   ------------------
             weight     adjusted count

  This is a weighted sum: \sum w_t * s_t

Modern hybrid retrieval combines BM25 with dense retrieval:

score(q,d)=αBM25(q,d)+(1α)simdense(q,d)\text{score}(q, d) = \alpha \cdot \text{BM25}(q, d) + (1 - \alpha) \cdot \text{sim}_{\text{dense}}(q, d)

Both components are summations. The dense similarity is a dot product iqidi\sum_i q_i d_i, and BM25 is a weighted sum over terms. The hybrid score is itself a weighted sum - sums all the way down.


9. Einstein Summation Convention

This section is a bridge, not the full treatment. The goal here is to recognize Einstein notation as a compressed way of writing repeated sums, so the dedicated next chapter, Einstein Summation and Index Notation, can focus on tensor structure, index discipline, and implementation patterns without reintroducing the summation idea from scratch.

9.1 Motivation

In many formulas, the summation symbol eventually becomes visual noise. Matrix multiplication is the classic example:

Cij=kAikBkjC_{ij} = \sum_k A_{ik} B_{kj}

Einstein notation suppresses the explicit sigma and lets the repeated index carry the summation implicitly:

Cij=AikBkjC_{ij} = A_{ik} B_{kj}

The mathematical content is the same. What changes is the visual emphasis: the notation highlights the index pattern rather than the bookkeeping symbol.

9.2 Examples of Einstein Notation

The pattern is easiest to learn by translating familiar sums:

OperationSigma notationEinstein notation
Dot productiaibi\sum_i a_i b_iaibia_i b_i
Matrix-vectorjAijxj\sum_j A_{ij} x_jAijxjA_{ij} x_j
Matrix productkAikBkj\sum_k A_{ik} B_{kj}AikBkjA_{ik} B_{kj}
TraceiAii\sum_i A_{ii}AiiA_{ii}
Bilinear formi,jaiMijbj\sum_{i,j} a_i M_{ij} b_jaiMijbja_i M_{ij} b_j

Seen this way, Einstein notation is not a new operation. It is a shorthand for repeated-index sums you already know how to read.

9.3 Rules for Einstein Notation

A safe first-pass reading rule is:

  • indices that appear twice in one term are summed
  • indices that appear once are free and determine the output shape
  • free indices must match across both sides of an equation
  • if an index appears three or more times, stop and rewrite the expression explicitly

Those rules are enough to parse the most common ML examples. The next chapter develops the more careful index discipline needed for higher-rank tensors and more complex contractions.

9.4 Einstein Notation in PyTorch: einsum

Libraries such as PyTorch and NumPy make the connection explicit through einsum, which turns index patterns into code:

# matrix multiply
C = torch.einsum('ik,kj->ij', A, B)

# attention scores
scores = torch.einsum('bhqd,bhkd->bhqk', Q, K)

For this chapter, the important idea is conceptual: einsum is programmable Einstein notation. Once you can read the repeated indices, you can often read the code directly from the math.

9.5 Tensor Contraction

Einstein notation naturally leads to tensor contraction, where repeated indices are summed out and the remaining free indices define the output tensor. Dot products, matrix multiplication, traces, and attention score computations are all contractions of this kind.

That is the real reason this preview belongs here: it shows that summation notation does not disappear in tensor algebra. It becomes more compact, more structural, and more expressive. The full transition from explicit sums to tensor contractions continues in Einstein Summation and Index Notation.


10. Double Sums and Index Interaction

When two or more indices appear in the same summation, their interaction creates rich mathematical structures. This section explores double (and higher) sums with attention to the patterns that arise in AI.

10.1 Row-Major vs Column-Major Intuition

A double sum i=1mj=1naij\sum_{i=1}^{m} \sum_{j=1}^{n} a_{ij} sums all entries of an m×nm \times n array. The question is: in what order?

  Row-major (outer=i, inner=j):     Column-major (outer=j, inner=i):

  +------------------+              +------------------+
  | ->  ->  ->  ->  ->  -> |  row 1      | down  down  down  down  down  down |
  | ->  ->  ->  ->  ->  -> |  row 2      | down  down  down  down  down  down |
  | ->  ->  ->  ->  ->  -> |  row 3      | down  down  down  down  down  down |
  +------------------+              +------------------+

  Process each row left-to-right,    Process each column top-to-bottom,
  then move to next row.             then move to next column.

For finite sums, both orders give the same result (Fubini's theorem for sums). But the computational pattern differs:

  • Row-major accesses contiguous memory in C/PyTorch (row-contiguous tensors).
  • Column-major accesses contiguous memory in Fortran/MATLAB.

In AI, choosing the right traversal order affects cache performance and GPU coalescing.

10.2 Diagonal Sums

Summing along diagonals of a matrix isolates elements where indices satisfy ji=kj - i = k (constant offset):

Main diagonal (i=ji = j): The trace.

tr(A)=i=1naii\text{tr}(A) = \sum_{i=1}^{n} a_{ii}

Off-diagonals: iai,i+k\sum_{i} a_{i, i+k} sums the kk-th super-diagonal (for k>0k > 0) or sub-diagonal (for k<0k < 0).

AI connection - relative position in attention:

In relative positional encodings (e.g., ALiBi, RoPE), the attention bias depends on ij|i - j|:

scoreij=QiKj+b(ij)\text{score}_{ij} = Q_i \cdot K_j + b(|i - j|)

The bias matrix BB where Bij=b(ij)B_{ij} = b(|i - j|) is a Toeplitz matrix - constant along each diagonal. Summing along diagonals gives the total contribution of each relative distance.

10.3 Upper and Lower Triangular Sums

Many AI computations restrict summation to triangular regions:

i=1nj=inaij(upper triangle: ji)\sum_{i=1}^{n} \sum_{j=i}^{n} a_{ij} \quad \text{(upper triangle: } j \geq i\text{)} i=1nj=1iaij(lower triangle: ji)\sum_{i=1}^{n} \sum_{j=1}^{i} a_{ij} \quad \text{(lower triangle: } j \leq i\text{)}
  Upper triangular:         Lower triangular (causal):
  +- - - - - - -+          +- - - - - - -+
  | # # # # # # |          | # * * * * * |
  | * # # # # # |          | # # * * * * |
  | * * # # # # |          | # # # * * * |
  | * * * # # # |          | # # # # * * |
  | * * * * # # |          | # # # # # * |
  | * * * * * # |          | # # # # # # |
  +- - - - - - -+          +- - - - - - -+

  # = included in sum       Used in causal (autoregressive)
  * = excluded               attention masks

Causal attention uses lower-triangular masking:

αij={exp(sij)kiexp(sik)if ji0if j>i\alpha_{ij} = \begin{cases} \frac{\exp(s_{ij})}{\sum_{k \leq i} \exp(s_{ik})} & \text{if } j \leq i \\ 0 & \text{if } j > i \end{cases}

Count of elements:

  • Upper or lower triangle: i=1n(ni+1)=n(n+1)2\sum_{i=1}^{n} (n - i + 1) = \frac{n(n+1)}{2} elements
  • Strict upper/lower (no diagonal): n(n1)2\frac{n(n-1)}{2} elements
  • This halving is why causal attention uses roughly half the FLOPs of bidirectional attention.

10.4 Changing Order of Summation

Finite case (always valid):

i=1mj=1naij=j=1ni=1maij\sum_{i=1}^{m} \sum_{j=1}^{n} a_{ij} = \sum_{j=1}^{n} \sum_{i=1}^{m} a_{ij}

Triangular region - changing bounds:

i=1nj=inaij=j=1ni=1jaij\sum_{i=1}^{n} \sum_{j=i}^{n} a_{ij} = \sum_{j=1}^{n} \sum_{i=1}^{j} a_{ij}

Proof by picture:

  Original:                    Swapped:
  (fix i, vary j\geqi)           (fix j, vary i\leqj)

  j ->                         j ->
  1 2 3 4                     1 2 3 4
  +---------                  +---------
i=1| # # # #              i=1| # # # #
i=2|   # # #              i=2|   # # #
i=3|     # #              i=3|     # #
i=4|       #              i=4|       #

  Same set of (i,j) pairs - just visited in different order!

General rule for dependent limits:

i=abj=f(i)g(i)=j=minfmaxgiS(j)\sum_{i=a}^{b} \sum_{j=f(i)}^{g(i)} = \sum_{j=\min f}^{\max g} \sum_{i \in S(j)}

where S(j)={i[a,b]:f(i)jg(i)}S(j) = \{i \in [a,b] : f(i) \leq j \leq g(i)\}.

AI application - expectation of loss over data and weights:

In Bayesian deep learning, we often need to interchange an expectation over parameters and a sum over data:

Eθ[i=1n(xi;θ)]=i=1nEθ[(xi;θ)]E_\theta\left[\sum_{i=1}^{n} \ell(x_i; \theta)\right] = \sum_{i=1}^{n} E_\theta[\ell(x_i; \theta)]

This interchange (linearity of expectation) lets us compute per-example expected losses separately.

10.5 Cauchy Product of Series

The Cauchy product defines multiplication of two series:

(i=0ai)(j=0bj)=n=0cnwhere cn=k=0nakbnk\left(\sum_{i=0}^{\infty} a_i\right) \left(\sum_{j=0}^{\infty} b_j\right) = \sum_{n=0}^{\infty} c_n \quad \text{where } c_n = \sum_{k=0}^{n} a_k b_{n-k}

The inner sum cn=k=0nakbnkc_n = \sum_{k=0}^{n} a_k b_{n-k} is a convolution - the same operation that defines conv layers in CNNs.

Connection to convolution:

  1D discrete convolution:

  a = [a_0, a_1, a_2, a_3]     (filter/kernel)
  b = [b_0, b_1, b_2, b_3, b_4] (input signal)

  (a * b)_n = \sum_k a_k * b_{n-k}

  This is exactly the Cauchy product - the coefficient of x^n
  in the polynomial product A(x)*B(x)

Mertens' theorem: If ai\sum a_i converges absolutely and bj\sum b_j converges, then the Cauchy product converges to the product of the sums. This guarantees that polynomial multiplication "works" for convergent power series.

AI applications:

  • 1D convolution in CNNs/WaveNet is a finite Cauchy product.
  • Polynomial multiplication via FFT: the Cauchy product of two length-nn sequences can be computed in O(nlogn)O(n \log n) using FFT, rather than O(n2)O(n^2) directly.
  • Generating functions in combinatorics: if A(x)A(x) and B(x)B(x) are generating functions, their product A(x)B(x)A(x) B(x) has coefficients given by the Cauchy product - used in counting arguments for data structures.

10.6 Vandermonde's Identity and Convolution

Vandermonde's identity is a combinatorial convolution:

(m+nr)=k=0r(mk)(nrk)\binom{m+n}{r} = \sum_{k=0}^{r} \binom{m}{k} \binom{n}{r-k}

Proof idea: Count the number of ways to choose rr items from a group of m+nm + n people (divided into group A of mm and group B of nn). Choose kk from A and rkr - k from B, then sum over all valid kk.

Connection to generating functions:

(1+x)m(1+x)n=(1+x)m+n(1+x)^m (1+x)^n = (1+x)^{m+n}. Comparing coefficients of xrx^r on both sides gives Vandermonde's identity - the coefficient on the left is the Cauchy product of binomial coefficients.

AI relevance:

  • Feature interaction counting: If model A has mm features and model B has nn features, the number of combined feature subsets of size rr follows Vandermonde's identity.
  • Mixture models: When combining categorical distributions, the convolution of their PMFs follows the same pattern.

10.7 Abel's Summation by Parts

Abel's summation is the discrete analogue of integration by parts:

k=1nakbk=Anbnk=1n1Ak(bk+1bk)\sum_{k=1}^{n} a_k b_k = A_n b_n - \sum_{k=1}^{n-1} A_k (b_{k+1} - b_k)

where Ak=i=1kaiA_k = \sum_{i=1}^{k} a_i is the partial sum.

Compare with integration by parts:

ContinuousDiscrete
udv=uvvdu\int u \, dv = uv - \int v \, duakbk=AnbnAkΔbk\sum a_k b_k = A_n b_n - \sum A_k \Delta b_k
uAku \leftrightarrow A_k (antiderivative of aa)Ak=i=1kaiA_k = \sum_{i=1}^k a_i
dvbkdv \leftrightarrow b_kΔbk=bk+1bk\Delta b_k = b_{k+1} - b_k

When to use Abel's summation:

The technique is useful when:

  1. The partial sums AkA_k are bounded (even if the aka_k oscillate).
  2. The sequence bkb_k is monotonically decreasing to 0.

Under these conditions, Abel's summation shows the original sum converges (Dirichlet's test).

AI application - convergence of weighted gradient sums:

In exponentially weighted moving averages (used in Adam, EMA models):

gˉt=k=1tβtkgk=k=1tβtkgk\bar{g}_t = \sum_{k=1}^{t} \beta^{t-k} g_k = \sum_{k=1}^{t} \beta^{t-k} g_k

Abel's summation helps analyse the properties of this weighted sum, particularly showing that the contribution of early gradients decays geometrically and the sum converges even if individual gradients gkg_k oscillate.


11. Convergence of Infinite Series

Infinite sums - series - appear throughout AI: Taylor expansions for activation functions, geometric series for discounted rewards, power series for matrix functions. Understanding when an infinite sum converges (has a finite value) is essential for both theoretical proofs and numerical stability.

11.1 Partial Sums and the Definition of Convergence

An infinite series n=1an\sum_{n=1}^{\infty} a_n is defined as the limit of its partial sums:

S=n=1an=limNSNwhere SN=n=1NanS = \sum_{n=1}^{\infty} a_n = \lim_{N \to \infty} S_N \quad \text{where } S_N = \sum_{n=1}^{N} a_n

The series converges if this limit exists and is finite; otherwise it diverges.

Three types of behavior:

  Converges (e.g., \sum 1/2^n):        Diverges to \infty (e.g., \sum 1/n):

  S_N |        ---------- S=2       S_N |              /
      |      /                          |            /
      |    /                            |          /
      |  /                              |        /
      |/                                |      /
      +---------------- N               |    /
                                        |  /
  Partial sums approach a limit         |/
                                        +---------------- N
                                        Partial sums grow without bound

  Oscillates/Diverges (e.g., \sum (-1)^n):

  S_N |    *       *       *
      |  1 --+  1 --+  1 --
      |      |      |
      |  0 --+  0 --+  0 --
      +---------------- N
  Partial sums bounce between 0 and 1

Necessary condition (divergence test):

If an\sum a_n converges, then limnan=0\lim_{n \to \infty} a_n = 0.

Contrapositive: If liman0\lim a_n \neq 0 or doesn't exist, the series diverges.

Warning: The converse is false! an0a_n \to 0 does NOT guarantee convergence. The harmonic series 1/n\sum 1/n has an=1/n0a_n = 1/n \to 0 but diverges.

11.2 Absolute vs Conditional Convergence

Absolute convergence: an\sum |a_n| converges.

Conditional convergence: an\sum a_n converges but an\sum |a_n| diverges.

Key theorem: Absolute convergence implies convergence. (The converse is false.)

Example: The alternating harmonic series n=1(1)n+1n=ln2\sum_{n=1}^{\infty} \frac{(-1)^{n+1}}{n} = \ln 2 converges conditionally - it converges, but 1/n\sum 1/n diverges.

Why absolute convergence matters in AI:

  • Absolutely convergent series can be rearranged in any order without changing the sum. Conditionally convergent series cannot (Riemann's rearrangement theorem - you can rearrange a conditionally convergent series to converge to any value!).
  • Parallel computation of sums (across GPUs) implicitly rearranges terms. If the series is only conditionally convergent, different summation orders give different results - a source of numerical non-determinism.
  • Most sums in AI (loss over data, gradient components) involve non-negative terms, so convergence is always absolute.

11.3 Convergence Tests Summary

| Test | Statement | Best For | | ---------------------- | -------------------------------------------------------------------------------------- | ------------------------ | ------------------------------------------------------ | ------------------------ | | Divergence | If an↛0a_n \not\to 0, diverges | Quick rejection | | Comparison | If 0anbn0 \leq a_n \leq b_n and bn\sum b_n converges, so does an\sum a_n | Bounding by known series | | Limit comparison | If liman/bn=L>0\lim a_n/b_n = L > 0, both converge or both diverge | Same rate detection | | Ratio | If lima_n+1/an=r\lim | a\_{n+1}/a_n | = r: converges if r<1r < 1, diverges if r>1r > 1 | Factorials, exponentials | | Root | If liman1/n=r\lim | a_n | ^{1/n} = r: converges if r<1r < 1, diverges if r>1r > 1 | nn-th powers | | Integral | f(n)\sum f(n) and f(x)dx\int f(x) dx converge/diverge together (for positive decreasing ff) | Continuous analogues | | Alternating series | If an | a_n | decreases to 0, (1)nan\sum (-1)^n a_n converges | Sign-alternating terms | | pp-series | 1/np\sum 1/n^p converges iff p>1p > 1 | Polynomial decay |

Ratio test in detail (most commonly used):

L=limnan+1anL = \lim_{n \to \infty} \left| \frac{a_{n+1}}{a_n} \right|
  • L<1L < 1: Absolutely convergent. The terms decay at least geometrically.
  • L>1L > 1: Divergent. The terms grow.
  • L=1L = 1: Inconclusive. Need another test.

AI example - convergence of Taylor series for exe^x:

ex=n=0xnn!,an+1an=xn+10 as ne^x = \sum_{n=0}^{\infty} \frac{x^n}{n!}, \quad \frac{a_{n+1}}{a_n} = \frac{|x|}{n+1} \to 0 \text{ as } n \to \infty

Since L=0<1L = 0 < 1, the series converges for all xx. This is why exe^x (and hence softmax) is well-defined everywhere.

11.4 Power Series and Radius of Convergence

A power series n=0cn(xa)n\sum_{n=0}^{\infty} c_n (x - a)^n converges in some interval centered at aa:

R=1lim supncn1/n(radius of convergence)R = \frac{1}{\limsup_{n \to \infty} |c_n|^{1/n}} \quad \text{(radius of convergence)}

The series converges absolutely for xa<R|x - a| < R and diverges for xa>R|x - a| > R.

Key power series in AI:

FunctionPower SeriesRadius
exe^xxnn!\sum \frac{x^n}{n!}R=R = \infty
ln(1+x)\ln(1+x)(1)n+1xnn\sum \frac{(-1)^{n+1} x^n}{n}R=1R = 1
11x\frac{1}{1-x}xn\sum x^nR=1R = 1
tanh(x)\tanh(x)xx33+2x515x - \frac{x^3}{3} + \frac{2x^5}{15} - \cdotsR=π/2R = \pi/2
GELU (xΦ(x))(x \Phi(x))ComplicatedR=R = \infty
sigmoid(x)\text{sigmoid}(x)12+x4x348+\frac{1}{2} + \frac{x}{4} - \frac{x^3}{48} + \cdotsR=πR = \pi

Why radius of convergence matters:

  • The Taylor approximation of tanh\tanh is only valid for x<π/2|x| < \pi/2 - outside this range, the polynomial approximation diverges wildly. This matters for hardware implementations that approximate activation functions with polynomials.
  • The ln(1+x)\ln(1+x) series only converges for x1|x| \leq 1 - so log1p(x) is numerically implemented differently for large xx.

11.5 Convergence of Training - An Analogy

While not an infinite series in the mathematical sense, training a neural network produces a sequence of losses {Lt}t=0\{\mathcal{L}_t\}_{t=0}^{\infty} that we hope converges.

Connections to series convergence:

Series ConceptTraining Analogue
Partial sum SNS_NCumulative loss or running average
an0a_n \to 0 necessaryLoss improvements must shrink
Absolute convergenceAll parameter dimensions converge
Conditional convergenceLoss oscillates but trends down
Radius of convergenceLearning rate stability region
Geometric decay rnr^nExponential learning rate schedule

Learning rate as a convergence factor:

For SGD with learning rate η\eta on a quadratic loss surface with Hessian eigenvalue λ\lambda:

errort(1ηλ)t\text{error}_t \propto (1 - \eta \lambda)^t

This is a geometric series ratio. Convergence requires 1ηλ<1|1 - \eta \lambda| < 1, i.e., η<2/λ\eta < 2/\lambda.

For the maximum eigenvalue λmax\lambda_{\max} of the Hessian, the critical learning rate is:

ηcrit=2λmax\eta_{\text{crit}} = \frac{2}{\lambda_{\max}}

Beyond this, training diverges - exactly like a geometric series with r>1|r| > 1.


12. Summation in Probability and Statistics

Summation is the language of probability theory for discrete random variables. Every key concept - expectation, variance, entropy - is defined as a sum.

12.1 Expectation as Weighted Sum

The expected value of a discrete random variable XX with PMF p(x)p(x):

E[X]=xXxp(x)E[X] = \sum_{x \in \mathcal{X}} x \cdot p(x)

More generally, for any function gg:

E[g(X)]=xXg(x)p(x)E[g(X)] = \sum_{x \in \mathcal{X}} g(x) \cdot p(x)

This is a weighted sum where the weights are probabilities (summing to 1):

  Expectation = weighted average

  Values:      x_1     x_2     x_3     x_4
  Probabilities: p_1     p_2     p_3     p_4
                  |       |       |       |
                  v       v       v       v
  E[X] = ------ x_1p_1 + x_2p_2 + x_3p_3 + x_4p_4 ------

  Constraint: p_1 + p_2 + p_3 + p_4 = 1

Key properties (all following from linearity of summation):

PropertyFormulaSummation Origin
LinearityE[aX+bY]=aE[X]+bE[Y]E[aX + bY] = aE[X] + bE[Y](ax+by)p=axp+byp\sum (ax + by)p = a\sum xp + b\sum yp
ConstantE[c]=cE[c] = ccp(x)=cp(x)=c\sum c \cdot p(x) = c \sum p(x) = c
Non-negativityX0E[X]0X \geq 0 \Rightarrow E[X] \geq 0Sum of non-negative terms
IndicatorE[1A]=P(A)E[\mathbb{1}_A] = P(A)xAp(x)=P(A)\sum_{x \in A} p(x) = P(A)

AI example - expected loss:

The training objective E(x,y)D[(fθ(x),y)]E_{(x,y) \sim \mathcal{D}}[\ell(f_\theta(x), y)] becomes, for a finite dataset:

E^[]=1ni=1n(fθ(xi),yi)\hat{E}[\ell] = \frac{1}{n} \sum_{i=1}^{n} \ell(f_\theta(x_i), y_i)

This is the empirical mean - a sum with equal weights 1/n1/n - that approximates the true expectation.

12.2 Variance and Higher Moments

Variance measures spread around the mean:

Var(X)=E[(Xμ)2]=x(xμ)2p(x)\text{Var}(X) = E[(X - \mu)^2] = \sum_x (x - \mu)^2 p(x)

Computational formula (more numerically stable in some cases):

Var(X)=E[X2](E[X])2=(xx2p(x))(xxp(x))2\text{Var}(X) = E[X^2] - (E[X])^2 = \left(\sum_x x^2 p(x)\right) - \left(\sum_x x p(x)\right)^2

Proof:

Var(X)=E[(Xμ)2]=E[X22μX+μ2]=E[X2]2μE[X]+μ2=E[X2]μ2\text{Var}(X) = E[(X - \mu)^2] = E[X^2 - 2\mu X + \mu^2] = E[X^2] - 2\mu E[X] + \mu^2 = E[X^2] - \mu^2

Variance of a sum (independent variables):

Var(i=1nXi)=i=1nVar(Xi)+2i<jCov(Xi,Xj)\text{Var}\left(\sum_{i=1}^{n} X_i\right) = \sum_{i=1}^{n} \text{Var}(X_i) + 2 \sum_{i<j} \text{Cov}(X_i, X_j)

For independent variables, covariances vanish: Var(Xi)=Var(Xi)\text{Var}(\sum X_i) = \sum \text{Var}(X_i).

AI application - gradient variance:

The variance of the stochastic gradient with batch size BB:

Var[g^]=Var[1Bi=1Bgi]=1B2i=1BVar[gi]=σ2B\text{Var}[\hat{g}] = \text{Var}\left[\frac{1}{B} \sum_{i=1}^{B} g_i\right] = \frac{1}{B^2} \sum_{i=1}^{B} \text{Var}[g_i] = \frac{\sigma^2}{B}

This 1/B1/B scaling is why larger batches reduce gradient noise but with diminishing returns on convergence speed.

12.3 Entropy as Expected Surprise

Shannon entropy measures the average "surprise" or "information content" of a distribution:

H(X)=xXp(x)logp(x)=E[logp(X)]H(X) = -\sum_{x \in \mathcal{X}} p(x) \log p(x) = E[-\log p(X)]

Decomposition of cross-entropy:

H(p,q)=xp(x)logq(x)=H(p)+DKL(pq)H(p, q) = -\sum_x p(x) \log q(x) = H(p) + D_{\text{KL}}(p \| q)

where DKL(pq)=xp(x)logp(x)q(x)0D_{\text{KL}}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)} \geq 0.

AI significance:

  • Training with cross-entropy loss minimises H(p,q)H(p, q) where pp is the true distribution and qq is the model. Since H(p)H(p) is fixed, this is equivalent to minimising DKL(pq)D_{\text{KL}}(p \| q).
  • Maximum entropy occurs at the uniform distribution: Hmax=logXH_{\max} = \log |\mathcal{X}|.
  • Temperature scaling in softmax: pi=ezi/Tjezj/Tp_i = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}. As TT \to \infty, entropy increases toward logK\log K; as T0T \to 0, entropy decreases toward 0 (argmax).

Entropy of common distributions:

DistributionPMFEntropy
Bernoulli(pp)p,(1p)p, (1-p)plogp(1p)log(1p)-p\log p - (1-p)\log(1-p)
Uniform(KK)1/K1/K eachlogK\log K
Geometric(pp)(1p)n1p(1-p)^{n-1} p(1p)log(1p)plogpp\frac{-(1-p)\log(1-p) - p \log p}{p}

12.4 Law of Total Expectation and Total Variance

Law of total expectation (tower property):

E[X]=E[E[XY]]=yE[XY=y]P(Y=y)E[X] = E[E[X \mid Y]] = \sum_y E[X \mid Y = y] P(Y = y)

This is a double summation - the inner sum computes E[XY=y]E[X \mid Y = y] for each yy, and the outer sum averages over yy.

Proof:

yE[XY=y]P(Y=y)=y(xxP(X=xY=y))P(Y=y)\sum_y E[X \mid Y=y] P(Y=y) = \sum_y \left(\sum_x x \, P(X=x \mid Y=y)\right) P(Y=y) =yxxP(X=x,Y=y)=xxyP(X=x,Y=y)=xxP(X=x)=E[X]= \sum_y \sum_x x \, P(X=x, Y=y) = \sum_x x \sum_y P(X=x, Y=y) = \sum_x x \, P(X=x) = E[X]

AI application - mixture models:

For a Gaussian mixture model p(x)=k=1KπkN(x;μk,σk2)p(x) = \sum_{k=1}^{K} \pi_k \mathcal{N}(x; \mu_k, \sigma_k^2):

E[X]=k=1KπkμkE[X] = \sum_{k=1}^{K} \pi_k \mu_k

Law of total variance:

Var(X)=E[Var(XY)]+Var(E[XY])\text{Var}(X) = E[\text{Var}(X \mid Y)] + \text{Var}(E[X \mid Y])

=(average within-group variance)+(between-group variance)= \text{(average within-group variance)} + \text{(between-group variance)}

This decomposition appears in ANOVA, random effects models, and the bias-variance decomposition of ML models.

12.5 Inclusion-Exclusion Principle

For the union of events:

P(i=1nAi)=k=1n(1)k+1S=kP(iSAi)P\left(\bigcup_{i=1}^{n} A_i\right) = \sum_{k=1}^{n} (-1)^{k+1} \sum_{|S|=k} P\left(\bigcap_{i \in S} A_i\right)

Expanded for n=3n = 3:

P(A1A2A3)=P(A1)+P(A2)+P(A3)P(A1A2)P(A1A3)P(A2A3)+P(A1A2A3)P(A_1 \cup A_2 \cup A_3) = P(A_1) + P(A_2) + P(A_3) - P(A_1 \cap A_2) - P(A_1 \cap A_3) - P(A_2 \cap A_3) + P(A_1 \cap A_2 \cap A_3)
  Venn diagram (n=3):

         A_1
        /  \
       / +1 \
      /  /-\  \
     |  |-1 |  |
  A_2--- |+1| ---A_3
      \  \-/  /
       \ -1 /
        \  /

  Each region counted: singles (+), pairs (-), triple (+)
  Alternating signs prevent over-counting

The sum has 2n12^n - 1 terms - exponential in the number of sets.

AI applications:

  • Deduplication analysis: Probability that at least one near-duplicate exists in a dataset.
  • Hash collision bounds: P(any collision among n items in m buckets)P(\text{any collision among } n \text{ items in } m \text{ buckets}).
  • Union bound (Boole's inequality): The simpler upper bound P(Ai)P(Ai)P(\bigcup A_i) \leq \sum P(A_i) is often sufficient and avoids the exponential blowup.
  • Coverage testing: Probability that a test suite covers at least one of several code paths.

Bonferroni inequalities truncate the inclusion-exclusion sum:

  • Truncating after odd terms gives an upper bound.
  • Truncating after even terms gives a lower bound.

This provides a practical way to get bounds without computing all 2n2^n terms.


13. Notation Variants and Conventions Across Fields

Different communities use different summation conventions. Recognising these variants prevents confusion when reading papers from different fields.

13.1 Computer Science / Programming

ConventionMeaningNotes
0-indexed: i=0n1\sum_{i=0}^{n-1}Sum over nn elementsDefault in Python, C, most languages
sum(list)i=0len1list[i]\sum_{i=0}^{\text{len}-1} \text{list}[i]No explicit indices
reduce(+, arr)Left fold: (((a0+a1)+a2)+)(\cdots((a_0 + a_1) + a_2) + \cdots)Associativity matters for floats
Array slice: a[i:j]Elements ai,ai+1,,aj1a_i, a_{i+1}, \ldots, a_{j-1}Exclusive upper bound

13.2 Physics

ConventionMeaningNotes
Einstein summationRepeated indices summedNo Σ\Sigma written
Upper/lower indicesaibi=iaibia^i b_i = \sum_i a^i b_iContravariant/covariant distinction
Metric contractionai=gijaja_i = g_{ij} a^jLowering with metric tensor
μ\partial_\muxμ\frac{\partial}{\partial x^\mu}4-vector gradient

13.3 Statistics

ConventionMeaningNotes
xˉ=1nxi\bar{x} = \frac{1}{n}\sum x_iSample meanIndices often omitted
i\sum_i with no boundsSum over all observationsContext determines range
Ep[f(X)]E_p[f(X)]xf(x)p(x)\sum_x f(x) p(x)Expectation notation
i<j\sum_{i<j}i=1n1j=i+1n\sum_{i=1}^{n-1} \sum_{j=i+1}^{n}All unordered pairs

13.4 LaTeX / Typesetting

% Inline: limits on side          Display: limits above/below
$\sum_{i=1}^{n} a_i$              $$\sum_{i=1}^{n} a_i$$

% Product
$\prod_{j=1}^{m} b_j$            $$\prod_{j=1}^{m} b_j$$

% Multiple indices
$\sum_{\substack{i=1\\i \neq k}}^{n}$    % Exclusion

% Sum over a set
$\sum_{x \in S}$                  % Index set notation

13.5 Notation Translation Table

Your Field SaysMathematicians WriteEinstein WritesPython Says
Dot producti=1naibi\sum_{i=1}^{n} a_i b_iaibia_i b_ia @ b or np.dot(a,b)
Matrix multiplykAikBkj\sum_k A_{ik} B_{kj}AikBkjA_{ik} B_{kj}A @ B
Weighted averageiwixi\sum_i w_i x_iwixiw_i x_inp.average(x, weights=w)
TraceiAii\sum_i A_{ii}AiiA_{ii}np.trace(A)
Batch mean1Bi=1Bxi\frac{1}{B}\sum_{i=1}^{B} x_i-x.mean(dim=0)
Softmaxezijezj\frac{e^{z_i}}{\sum_j e^{z_j}}-F.softmax(z, dim=-1)

14. Common Mistakes and Pitfalls

Summation notation is deceptively simple, but subtle errors can cause bugs in proofs, papers, and code. This section catalogues the most frequent mistakes.

14.1 The Top 10 Summation Mistakes

#MistakeWrongCorrectAI Consequence
1Index scope leakUsing ii outside i\sum_iii is a bound (dummy) variableUndefined variable bugs
2Off-by-one boundsi=0n\sum_{i=0}^{n} vs i=1n\sum_{i=1}^{n}Count elements: n+1n+1 vs nnDimension mismatch errors
3Distributing Σ\Sigma over productsaibi=(ai)(bi)\sum a_i b_i = (\sum a_i)(\sum b_i)aibi(ai)(bi)\sum a_i b_i \neq (\sum a_i)(\sum b_i)Wrong loss gradients
4Distributing Π\Pi over sums(ai+bi)=ai+bi\prod(a_i + b_i) = \prod a_i + \prod b_iProduct distributes over factors, not addendsWrong likelihood
5Swapping \sum and non-linear fff(ai)=f(ai)f(\sum a_i) = \sum f(a_i)Only true for linear ffJensen's inequality violation
6Forgetting empty casesAssuming n1n \geq 1Empty sum = 0, empty product = 1Edge case crashes
7Wrong Σ\Sigma <-> Π\Pi conversionlog=log\log \sum = \sum \loglog=log\log \prod = \sum \log, NOT log\log \sumlog-sum-exp vs sum-of-logs
8Index collisioniaiibi\sum_i a_i \sum_i b_iiaijbj\sum_i a_i \sum_j b_jAmbiguous scope
9Telescoping failureCancelling wrong termsWrite out the first few terms explicitlyWrong simplification
10Infinite sum truncationTreating n=0N\sum_{n=0}^{N} as n=0\sum_{n=0}^{\infty}Bound the tail: n>N\sum_{n>N}Approximation error

14.2 Detailed Examples

Mistake 3 in detail - sum of products \neq product of sums:

i=12aibi=a1b1+a2b2\sum_{i=1}^{2} a_i b_i = a_1 b_1 + a_2 b_2 (i=12ai)(i=12bi)=a1b1+a1b2+a2b1+a2b2\left(\sum_{i=1}^{2} a_i\right)\left(\sum_{i=1}^{2} b_i\right) = a_1 b_1 + a_1 b_2 + a_2 b_1 + a_2 b_2

The product of sums has cross terms (a1b2,a2b1a_1 b_2, a_2 b_1). The correct relationship:

(iai)(jbj)=ijaibj\left(\sum_i a_i\right)\left(\sum_j b_j\right) = \sum_i \sum_j a_i b_j

This is a double sum (outer product), not a single sum (dot product).

Mistake 5 in detail - Jensen's inequality:

For convex ff: f(ipixi)ipif(xi)f\left(\sum_i p_i x_i\right) \leq \sum_i p_i f(x_i) (where pi=1\sum p_i = 1, pi0p_i \geq 0).

For concave ff: the inequality reverses.

Since log\log is concave:

log(ipixi)ipilog(xi)\log\left(\sum_i p_i x_i\right) \geq \sum_i p_i \log(x_i)

This is the basis of the ELBO (Evidence Lower Bound) in variational inference:

logp(x)=logzp(x,z)=logzq(z)p(x,z)q(z)zq(z)logp(x,z)q(z)=ELBO\log p(x) = \log \sum_z p(x,z) = \log \sum_z q(z) \frac{p(x,z)}{q(z)} \geq \sum_z q(z) \log \frac{p(x,z)}{q(z)} = \text{ELBO}

Mistake 7 in detail - log of sum vs sum of logs:

  +==================================================+
  |  CRITICAL DISTINCTION:                           |
  |                                                  |
  |  log(\sum a^i)  \neq  \sum log(a^i)     <- WRONG to swap   |
  |                                                  |
  |  log(\prod a^i)  =  \sum log(a^i)     <- CORRECT          |
  |                                                  |
  |  Logs convert PRODUCTS to SUMS, not SUMS to SUMS |
  +==================================================+
  |  NLL loss = -log p(data)                         |
  |          = -log \prod p(x^i)     (for i.i.d. data)   |
  |          = -\sum log p(x^i)     <- the log-sum trick |
  |                                                  |
  |  BUT:  -log \sum p(x^i)  is DIFFERENT - this is     |
  |        the log of a mixture, not a product.      |
  +==================================================+

14.3 Numerical Pitfalls

Catastrophic cancellation:

When summing numbers with alternating signs or vastly different magnitudes, floating-point errors accumulate:

i=1naiwhere some ai0 and some ai0\sum_{i=1}^{n} a_i \quad \text{where some } a_i \gg 0 \text{ and some } a_i \ll 0

Solutions:

  • Kahan summation: Tracks the running compensation term; reduces error from O(nϵ)O(n \epsilon) to O(ϵ)O(\epsilon).
  • Pairwise summation: Recursively splits the array and sums pairs; reduces error to O(ϵlogn)O(\epsilon \log n). (This is what NumPy uses.)
  • Sort-then-sum: Sum smallest-magnitude terms first.

Overflow/underflow in products:

i=1npi0(underflow for large n)\prod_{i=1}^{n} p_i \to 0 \quad \text{(underflow for large } n \text{)}

Solution: Always use log-space: logpi\sum \log p_i instead of logpi\log \prod p_i.


15. Exercises and Practice Problems

Exercise Set 1: Basic Manipulation

E1.1 Expand and compute: k=15(2k1)\sum_{k=1}^{5} (2k - 1)

E1.2 Write in sigma notation: 3+6+9+12++3n3 + 6 + 9 + 12 + \cdots + 3n

E1.3 Expand and compute: j=25jj1\prod_{j=2}^{5} \frac{j}{j-1}

E1.4 Prove: i=1n(aiai1)=ana0\sum_{i=1}^{n} (a_i - a_{i-1}) = a_n - a_0 (telescoping)

Exercise Set 2: Algebraic Properties

E2.1 Show that i=1n(xixˉ)=0\sum_{i=1}^{n} (x_i - \bar{x}) = 0 where xˉ=1ni=1nxi\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i.

E2.2 Prove: i=1n(xixˉ)2=i=1nxi2nxˉ2\sum_{i=1}^{n} (x_i - \bar{x})^2 = \sum_{i=1}^{n} x_i^2 - n\bar{x}^2.

E2.3 Verify Gauss's trick: i=1ni=n(n+1)2\sum_{i=1}^{n} i = \frac{n(n+1)}{2} for n=100n = 100.

E2.4 Show that i=1neai=ei=1nai\prod_{i=1}^{n} e^{a_i} = e^{\sum_{i=1}^{n} a_i}.

Exercise Set 3: Summation Formulas

E3.1 Derive i=0n1ri=1rn1r\sum_{i=0}^{n-1} r^i = \frac{1 - r^n}{1 - r} by multiplying both sides by (1r)(1-r).

E3.2 Compute k=013k\sum_{k=0}^{\infty} \frac{1}{3^k}.

E3.3 Find a closed form for k=1nk2k\sum_{k=1}^{n} k \cdot 2^k using the derivative trick on the geometric series.

E3.4 Compute k=1n1k(k+1)\sum_{k=1}^{n} \frac{1}{k(k+1)} using partial fractions and telescoping.

Exercise Set 4: Double Sums

E4.1 Compute i=13j=13ij\sum_{i=1}^{3} \sum_{j=1}^{3} ij two ways (row-first and column-first).

E4.2 Change the order of summation: i=1nj=inaij=j=1ni=1jaij\sum_{i=1}^{n} \sum_{j=i}^{n} a_{ij} = \sum_{j=1}^{n} \sum_{i=1}^{j} a_{ij}.

E4.3 Show that i=1nj=1naibj=(i=1nai)(j=1nbj)\sum_{i=1}^{n} \sum_{j=1}^{n} a_i b_j = \left(\sum_{i=1}^{n} a_i\right)\left(\sum_{j=1}^{n} b_j\right).

Exercise Set 5: Einstein Notation

E5.1 Convert to standard sigma notation: Tij=AikBkj+CijT_{ij} = A_{ik} B_{kj} + C_{ij}.

E5.2 Write the einsum string for computing the trace of a matrix product: tr(AB)=AijBji\text{tr}(AB) = A_{ij} B_{ji}.

E5.3 What does 'bhqd,bhkd->bhqk' compute? Identify free and dummy indices.

Exercise Set 6: AI Applications

E6.1 Show that softmax probabilities sum to 1: i=1Kezijezj=1\sum_{i=1}^{K} \frac{e^{z_i}}{\sum_j e^{z_j}} = 1.

E6.2 Derive the gradient zk(logezcjezj)=softmax(z)k1[k=c]\frac{\partial}{\partial z_k}\left(-\log \frac{e^{z_c}}{\sum_j e^{z_j}}\right) = \text{softmax}(z)_k - \mathbb{1}[k=c].

E6.3 Compute the perplexity of a model that assigns probability 0.1 to each of 5 tokens in a sequence.

E6.4 Show that the gradient of the mean loss 1Bii\frac{1}{B}\sum_i \ell_i equals the mean of the gradients 1Bii\frac{1}{B}\sum_i \nabla \ell_i.

Exercise Set 7: Convergence

E7.1 Apply the ratio test to determine convergence of n=1n22n\sum_{n=1}^{\infty} \frac{n^2}{2^n}.

E7.2 Determine whether n=11n3/2\sum_{n=1}^{\infty} \frac{1}{n^{3/2}} converges.

E7.3 Find the radius of convergence of n=0xnn2+1\sum_{n=0}^{\infty} \frac{x^n}{n^2 + 1}.

Exercise Set 8: Probability

E8.1 For a fair die, compute E[X]E[X] and Var(X)\text{Var}(X) from the definitions using summation.

E8.2 Verify that H(Bernoulli(1/2))=log20.693H(\text{Bernoulli}(1/2)) = \log 2 \approx 0.693 nats (= 1 bit).

E8.3 Use the law of total expectation to find E[X]E[X] where XN=nUniform{1,,n}X \mid N=n \sim \text{Uniform}\{1, \ldots, n\} and NGeometric(1/2)N \sim \text{Geometric}(1/2).


16. Why This Matters: The Summation Lens on AI

Summation and product notation are not just mathematical conveniences - they are the computational atoms from which all of AI is built.

The Impact Map

AI ConceptCore Summation PatternWithout Understanding It
Forward passMatrix multiply kWikxk\sum_k W_{ik} x_kCannot read model architectures
BackpropagationChain rule + sum over pathsCannot derive or debug gradients
Loss functionsCross-entropy ylogy^-\sum y \log \hat{y}Cannot design new objectives
Optimisersθt=θt1η\theta_t = \theta_{t-1} - \eta \sumCannot tune learning rates
AttentionWeighted sum αjVj\sum \alpha_j V_jCannot understand transformers
EvaluationPerplexity = exp(1Tlogp)\exp(-\frac{1}{T}\sum \log p)Cannot interpret model quality
ProbabilityE[X]=xp(x)E[X] = \sum x \, p(x)Cannot reason about uncertainty
Regularisationλθi2\lambda \sum \theta_i^2 (L2)Cannot balance fit vs complexity
Batch processing1Bi\frac{1}{B} \sum \ell_iCannot scale training
Information theoryH=plogpH = -\sum p \log pCannot measure information

The Three Skills This Section Builds

  +-----------------------------------------------------+
  |           Fluency with Summation Notation            |
  |                                                     |
  |  1. READ:  Parse \sum and \prod expressions in papers      |
  |            Identify index ranges and dummy vars      |
  |            Recognise standard patterns               |
  |                                                     |
  |  2. MANIPULATE:  Apply algebraic properties          |
  |                  Change summation order               |
  |                  Use telescoping and cancellation     |
  |                  Convert between \sum, \prod, and log       |
  |                                                     |
  |  3. CONNECT:  Map formulas to code                   |
  |               Map code to formulas                   |
  |               Estimate computational cost            |
  |               Debug via mathematical reasoning       |
  +-----------------------------------------------------+

Conceptual Bridge: From Summation to Einstein Summation and Index Notation

  +------------------------------------------------------------------+
  |                      CONCEPTUAL BRIDGE                          |
  |                                                                  |
  |  This Section (04)              Next Section (05)                |
  |  Summation & Products           Einstein Summation & Tensors     |
  |                                                                  |
  |  \sum^i a^ib^i          ------->     a^ib^i (implicit sum)             |
  |  \sum_k A^i_kB_kj        ------->     A^i_kB_kj (matrix multiply)       |
  |  Scalar indices     ------->     Tensor indices (rank \geq 2)       |
  |  Finite sums        ------->     Tensor contraction              |
  |  Index manipulation ------->     Index gymnastics                 |
  |  torch.sum()        ------->     torch.einsum()                  |
  |                                                                  |
  |  Key transition:                                                 |
  |  "Explicit sigma" -> "Implicit contraction" -> "Tensor networks"  |
  |                                                                  |
  |  You now have:                                                   |
  |  OK Fluency with \sum and \prod notation                                |
  |  OK Algebraic manipulation tools                                  |
  |  OK Closed-form formulas for standard sums                       |
  |  OK Convergence analysis for infinite series                     |
  |  OK Recognition of AI summation patterns                         |
  |                                                                  |
  |  Next you'll learn:                                              |
  |  -> Einstein convention as shorthand for repeated-index sums      |
  |  -> Tensor rank, contraction, and index notation                  |
  |  -> Covariant vs contravariant indices                            |
  |  -> Advanced einsum patterns for multi-dimensional operations     |
  |  -> Tensor networks and their connection to neural architectures  |
  +------------------------------------------------------------------+

References and Further Reading

  1. Knuth, D. E. (1997). The Art of Computer Programming, Vol. 1: Fundamental Algorithms, 3rd ed. Addison-Wesley. - Definitive treatment of summation techniques (Chapter 1.2.3).

  2. Graham, R. L., Knuth, D. E., & Patashnik, O. (1994). Concrete Mathematics, 2nd ed. Addison-Wesley. - The best resource for summation methods, with hundreds of worked examples.

  3. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. - See Chapter 2 for notation conventions and Chapter 3 for probability/information theory.

  4. Vaswani, A., et al. (2017). "Attention Is All You Need." NeurIPS. - The transformer paper that made αijVj\sum \alpha_{ij} V_j the central operation in modern AI.

  5. Einstein, A. (1916). "Die Grundlage der allgemeinen Relativitatstheorie." Annalen der Physik. - Introduction of the summation convention.

  6. Apostol, T. M. (1974). Mathematical Analysis, 2nd ed. Addison-Wesley. - Rigorous treatment of series convergence.

  7. Robertson, S. E., & Zaragoza, H. (2009). "The Probabilistic Relevance Framework: BM25 and Beyond." Foundations and Trends in Information Retrieval. - BM25 derivation and analysis.


Navigation: <- 03 Functions and Mappings | up Mathematical Foundations | 05 Einstein Summation and Index Notation ->