"Summation and product notation are the syntax of quantitative AI reasoning - the grammar in which every loss function, every gradient, every attention score, and every probability distribution is expressed. Fluency with \Sigma and \Pi is the difference between reading ML papers as fluid prose and decoding them symbol by symbol."
Overview
Summation notation (\Sigma) and product notation (\Pi) are compact symbolic languages for expressing repeated addition and multiplication over collections of terms. They are not merely abbreviations - they are precise mathematical objects with well-defined rules for manipulation, interchange, and transformation.
For AI and LLMs specifically, virtually every formula you will encounter is written in \Sigma or \Pi notation: loss functions, gradients, attention scores, probability distributions, normalisation constants, perplexity, entropy, KL divergence, matrix multiplication. Without mastery of these notations, every formula requires line-by-line decoding; with it, mathematical expressions read like natural language.
This chapter covers summation and product notation from first principles through to the specific patterns that appear in modern transformer-based AI systems. Every concept is grounded in how it appears in real machine learning computation.
Prerequisites
- Basic arithmetic operations (addition, multiplication, exponentiation)
- Familiarity with sets and index notation
- Functions and mappings (previous chapter)
- Logarithms and exponentials
- Familiarity with Python/NumPy (for companion notebook)
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Interactive code: summation formulas, product identities, convergence demos, einsum operations |
Learning Objectives
After completing this section, you will:
- Read and write summation (\Sigma) and product (\Pi) notation fluently in any ML context
- Apply all algebraic properties: linearity, splitting, reindexing, interchange, telescoping
- Derive closed-form results for arithmetic, geometric, and power series
- Convert between products and sums using logarithms (the foundation of log-likelihood)
- Recognize Einstein summation convention as the shorthand behind tensor notation and
einsum - Manipulate double sums, change summation order, and handle triangular index regions
- Determine convergence of infinite series using ratio, root, comparison, and integral tests
- Decompose every major AI formula (softmax, cross-entropy, attention, perplexity, BM25) into its summation/product structure
- Identify and avoid the top 10 summation/product mistakes that cause bugs in ML code
Table of Contents
- Summation and Product Notation
- Overview
- Prerequisites
- Companion Notebooks
- Learning Objectives
- Table of Contents
- 1. Intuition
- 2. Summation Notation - Formal Definitions
- 3. Product Notation - Formal Definitions
- 4. Algebraic Properties of Summation
- 5. Algebraic Properties of Products
- 6. Important Summation Formulas
- 7. Summation Bounds and Estimation
- 8. Special Summation Patterns in AI
- 9. Einstein Summation Convention
- 10. Double Sums and Index Interaction
- 11. Convergence of Infinite Series
- 12. Summation in Probability and Statistics
- 13. Notation Variants and Conventions
- 14. Common Mistakes
- 15. Exercises
- 16. Why This Matters for AI (2026 Perspective)
- Conceptual Bridge
1. Intuition
1.1 What Is Summation and Product Notation?
Summation notation (\Sigma) and product notation (\Pi) are compact symbolic languages for expressing repeated addition and multiplication over collections of terms. They are the mathematical equivalent of a for loop - but with algebraic structure that allows formal manipulation.
Without them, writing the loss over a million training examples requires a million terms. With them, one line:
This single expression encodes: "for each of the training examples, compute the log-probability of the correct label given input under model parameters , negate, and average."
They are not merely abbreviations - they are precise mathematical objects with well-defined rules for manipulation, interchange, and transformation. You can:
- Split a sum into parts:
- Factor out constants:
- Interchange order: (under certain conditions)
- Convert products to sums:
- Telescope:
For AI, fluency with \Sigma and \Pi is non-negotiable. Virtually every formula in machine learning is written in summation or product notation - loss functions, gradients, attention scores, probability distributions, normalisation constants. The researcher who reads these fluently thinks in terms of structure; the one who doesn't decodes symbol by symbol and loses the forest for the trees.
THE NOTATION AS A COMPRESSION TOOL
=======================================================================
Without notation (N = 5):
L = -(1/5)(log P(y_1|x_1) + log P(y_2|x_2) + log P(y_3|x_3) + log P(y_4|x_4) + log P(y_5|x_5))
With notation (N = anything):
L = -(1/N) \sum^i_=_1^N log P(y^i|x^i)
The notation doesn't just save space - it reveals structure:
- The 1/N shows it's an average
- The log shows we're in log-probability space
- The \sum shows we're aggregating over examples
- The index i shows which variable changes
1.2 Why This Notation Exists
The development of summation notation was driven by several fundamental needs:
Human working memory is finite. Writing out 10,000 individual terms is impossible and obscures structure. A mathematician needs to reason about patterns in sums, not individual terms. Compact notation makes patterns visible.
Notation reveals structure. Consider . This immediately communicates "sum of squares from 1 to " - the pattern is evident in a way that only suggests. The reader can immediately ask: does this have a closed form? Is it ? How does it relate to ?
Enables algebraic manipulation. Once expressed in \Sigma/\Pi notation, sums can be split, combined, reindexed, bounded, and transformed using precise algebraic rules. This is how mathematicians prove things about sums, not by expanding and hand-checking.
Generalises naturally. The same notation scales from finite sums to infinite series to integrals:
Same conceptual framework, deeper mathematics.
Historical necessity. As mathematics moved from specific numerical examples to general theorems about arbitrary , notation had to scale. Gauss couldn't write "" for every value - he needed .
1.3 Where These Appear in AI
Every major formula in machine learning is a sum or product (or both). Here is a comprehensive inventory:
Loss functions (sums):
This is the categorical cross-entropy loss: a double sum over training examples () and vocabulary tokens ().
Softmax normalisation (sum in denominator):
The denominator is a sum over the entire vocabulary - the partition function that ensures probabilities sum to 1.
Attention mechanism (weighted sum):
Each attention output is a weighted sum of value vectors, where the weights themselves involve summation in their softmax normalisation.
Dot product (sum of elementwise products):
The fundamental operation in attention: a sum of products.
Matrix multiplication (sum over shared dimension):
Every linear layer, every projection, every attention computation involves matrix multiplication - which is itself a collection of sums.
Perplexity (product converted to sum via log):
The standard evaluation metric for language models: an exponentiated average of log-probabilities.
Sequence likelihood (product):
The chain rule of probability applied to a sequence - a product of conditional probabilities.
Joint probability of independent events (product):
BM25 retrieval score (weighted sum):
The standard sparse retrieval function in RAG systems: a sum over query terms, each contributing an IDF-weighted term frequency score.
BPE merge cost (sum over pairs):
1.4 The Index as a Variable
The index variable (the in ) is a bound variable - it has no meaning outside the sum, just as the variable of integration has no meaning outside the integral.
All four expressions are identical. The name of the index does not matter. This is called alpha-equivalence: renaming bound variables preserves meaning. It is the same principle in:
- Summation:
- Integration:
- Programming:
sum(f(i) for i in range(n))=sum(f(j) for j in range(n)) - Logic: is the same as
Critical for AI. When multiple sums interact, index names carry essential information about which sum they belong to:
The left side sums over the first index (rows), producing a result that still depends on . The right side sums over the second index (columns), producing a result that depends on . In attention: (attention weights for query sum to 1), but gives the total attention received by key position - a completely different quantity.
1.5 Historical Timeline
HISTORICAL DEVELOPMENT OF SUMMATION NOTATION
=======================================================================
~1800 BCE Egypt/Babylon Arithmetic series computed via specific examples;
no general notation; "add these five numbers"
~250 BCE Archimedes Summation of geometric series; area under parabola
via method of exhaustion; still no compact notation
1593 Viete Introduced systematic symbolic algebra; first use
of letters for unknowns; paved way for notation
1675 Leibniz Introduced \int as elongated S for "summa" (sum);
integration conceived as continuous summation
18th c. Euler Introduced \Sigma (capital sigma) for discrete summation;
standardised the notation we use today; wrote
\sum_n_=_1^\infty 1/n^2 = \pi^2/6 (Basel problem, 1735)
1777-1855 Gauss Master of summation manipulation; derived closed
forms for many classical sums; the "prince of
mathematicians" famously summed 1+2+...+100 as a child
1821 Cauchy Rigorous theory of series convergence; defined when
infinite sums are valid; \epsilon-\delta framework
1854 Riemann Riemann sum -> integral; summation as the formal
foundation of integration theory
1916 Einstein Einstein summation convention: repeated indices
imply summation; revolutionised tensor notation
20th c. Standard \Sigma and \Pi become universal mathematical notation;
every textbook, every paper
2016+ Modern ML Every NeurIPS/ICML/ICLR paper assumes fluency;
torch.einsum brings Einstein notation to code
1.6 The Relationship to Integration
The connection between summation and integration is not merely an analogy - it is a deep mathematical fact. The Riemann integral is defined as a limit of sums:
where and is a sample point in the -th subinterval.
DISCRETE <--> CONTINUOUS CORRESPONDENCE
=======================================================================
SUMMATION INTEGRATION
---------- -----------
\sum^i_=_1^n f(i) -> \int_1^n f(x) dx
Index i \in {1, 2, ..., n} -> x \in [1, n] (continuous variable)
Step size = 1 -> Infinitesimal dx
\sum -> \int as step -> 0 (Riemann sum definition)
\prod^i_=_1^n f(i) -> exp(\int_1^n ln f(x) dx)
Discrete product -> Continuous product (via log)
\sum^i f(i)*P(i) -> \int f(x)*p(x) dx
Discrete expectation -> Continuous expectation
-\sum^i p^i log p^i -> -\int p(x) log p(x) dx
Discrete entropy -> Differential entropy
This bridge matters for AI because many derivations convert between sums and integrals:
- The evidence lower bound (ELBO) involves expectations that are sums (discrete latent variables) or integrals (continuous latent variables)
- Monte Carlo estimation approximates integrals by sums:
- Riemann sum approximations appear in numerical integration throughout scientific ML
2. Summation Notation - Formal Definitions
2.1 Basic Definition
The summation of as ranges from to is defined as:
The components of this notation are:
| Component | Name | Role |
|---|---|---|
| Summation operator | Capital Greek sigma; indicates repeated addition | |
| Index variable | Dummy/bound variable; any name works | |
| Lower limit | First value of the index (inclusive) | |
| Upper limit | Last value of the index (inclusive) | |
| Summand | Expression evaluated at each index value |
Number of terms: (when ).
Empty sum convention: If , the sum has no terms. By convention, the value of an empty sum is 0 - the additive identity:
This convention is not arbitrary; it is the only value consistent with the algebraic properties of summation (see 3.2).
Concrete examples:
2.2 Formal Definition via Recursion
The summation notation can be defined with complete mathematical precision using recursion:
Base case (empty sum):
Recursive step:
This definition is equivalent to the "expand and add" interpretation but is fully rigorous. It also makes clear why the empty sum equals 0: from the recursive step with :
The recursion unfolds exactly like a for loop:
def sigma(f, m, n):
"""Recursive definition of \sum^i_=_m^n f(i)"""
if n < m:
return 0 # Empty sum = additive identity
return sigma(f, m, n-1) + f(n) # Recursive step
2.3 Index Set Notation
The basic notation assumes the index ranges over consecutive integers . But sums often range over arbitrary sets. The index set notation generalises:
This is the most general form. The limits notation is a special case where .
AI-relevant examples:
Set-builder notation with conditions:
This sums only over odd values of . The condition acts as a filter on the index set.
2.4 Multiple Indices
A double sum sums over two independent indices:
The process: for each fixed , compute the inner sum over ; then sum those results over .
Total number of terms: .
Concrete example:
Triple sums and higher are defined analogously by nesting more symbols. In AI, you frequently see:
AI example - matrix multiplication as double sum:
Computing all entries of where , :
To compute the entire matrix , we sum over all :
This triple sum involves individual multiply-add operations.
2.5 Sigma Notation Conventions
In practice, several shorthand conventions are widely used:
Implicit limits. When the index set is clear from context:
Vector sum. For a vector :
All-ones vector. The sum of all components can be written as a dot product with the all-ones vector :
This is heavily used in ML: torch.sum(x) is the same as ones @ x.
Indicator function notation. The indicator (or Iverson bracket) equals 1 when the condition is true and 0 otherwise:
For example, counts the number of positive elements in a vector.
Vector/matrix sums. When summing vectors or matrices, the sum is computed componentwise:
This appears in attention: is a weighted sum of vectors.
3. Product Notation - Formal Definitions
3.1 Basic Definition
The product of as ranges from to is defined as:
The components are identical to summation notation but with (capital Greek pi) indicating repeated multiplication instead of addition.
| Component | Name | Role |
|---|---|---|
| Product operator | Capital Greek pi; indicates repeated multiplication | |
| Index variable | Dummy/bound variable; any name works | |
| Lower limit | First value of the index (inclusive) | |
| Upper limit | Last value of the index (inclusive) | |
| Factor | Expression evaluated at each index value |
Number of factors: (same counting as summation).
Empty product convention: If , the product has no factors. By convention, the value of an empty product is 1 - the multiplicative identity:
Recursive definition:
Concrete examples:
3.2 Why Empty Product = 1 and Empty Sum = 0
These conventions are not arbitrary - they are the only values consistent with the algebraic rules:
Additive identity argument. The identity for addition is 0: . An empty sum contributes no terms, so it must equal the additive identity:
Consistency check: The recursive definition gives . If the empty sum were anything other than 0, this would fail. OK
Multiplicative identity argument. The identity for multiplication is 1: . An empty product contributes no factors, so it must equal the multiplicative identity:
Consistency check: The recursive definition gives . OK
Factorial consistency. The factorial must satisfy :
This is also consistent with , the combinatorial identity , and the Taylor series (which needs for the term to be ).
IDENTITY ELEMENTS - THE PATTERN
=======================================================================
Operation Identity Empty Result Why
--------- -------- ------------ ---
Addition 0 \sum\emptyset = 0 Adding nothing = 0
Multiplication 1 \prod\emptyset = 1 Multiplying nothing = 1
Union \emptyset \cup\emptyset = \emptyset Unioning nothing = empty set
Intersection Universal \cap\emptyset = U Intersecting nothing = everything
Logical AND True \wedge\emptyset = True No conditions to fail
Logical OR False \vee\emptyset = False No conditions to satisfy
3.3 Factorial as Product Notation
The factorial of is defined as:
with the convention (empty product).
Growth rate. Factorial grows faster than any polynomial or exponential base:
| 1 | 1 | 1 | 2 | 1 |
| 5 | 120 | 3,125 | 32 | 25 |
| 10 | 3,628,800 | 10,000,000,000 | 1,024 | 100 |
| 20 | 2.43 \times 10^1^8 | 1.05 \times 10^2^6 | 1,048,576 | 400 |
For large , Stirling's approximation gives:
AI relevance:
- Permutations: The number of permutations of items is . This creates combinatorial explosion in search spaces. A vocabulary of just 30 tokens has possible orderings.
- Permutation symmetry in neural networks: A hidden layer with neurons has equivalent parameter configurations obtained by permuting the neurons. This symmetry means the loss landscape has at least equivalent minima.
- Combinatorics: Binomial coefficients appear in counting arguments throughout ML theory.
3.4 Product Notation in Probability
Product notation is the natural language of probability, where independent events multiply:
Joint probability of independent events:
Likelihood of a sequence (chain rule of probability):
This is the fundamental equation of autoregressive language models. A transformer computes each factor using causal attention, and multiplies them all together to get the sequence probability.
Log-likelihood converts product to sum:
This conversion from to via logarithm is one of the most important transformations in all of machine learning. It is used because:
-
Numerical stability: Products of many small probabilities underflow to zero (, which is below FP64 minimum). Taking logarithms converts to a sum of manageable-sized numbers ().
-
Computational convenience: Sums are easier to differentiate than products. The gradient of is , cleanly decomposing into per-example terms.
-
Statistical theory: The negative log-likelihood is the cross-entropy loss, connecting maximum likelihood estimation to information theory.
THE PRODUCT-TO-SUM CONVERSION VIA LOGARITHM
=======================================================================
log
\prod^i P(t^i|ctx) ---------> \sum^i log P(t^i|ctx)
Product of Sum of
probabilities log-probabilities
Underflows for Numerically stable
long sequences for any length
Hard to Easy to
differentiate differentiate
<- - - - - exp - - - - <-
exp(\sum^i log P^i) = \prod^i P^i (inverse transformation)
4. Algebraic Properties of Summation
These properties are the tools for manipulating sums - the algebraic rules that allow you to simplify, decompose, and transform summation expressions.
4.1 Linearity of Summation
The most important property. Summation is a linear operator:
Scalar multiple:
Sum of sums:
Combined (linearity):
Proof sketch: Expand both sides by the definition of summation (write out all terms); use commutativity and associativity of addition to rearrange; observe both sides are equal term by term.
Why this matters for AI: Linearity allows decomposition of complex sums into simpler parts. The most critical application:
The gradient of a sum equals the sum of gradients. This is why mini-batch gradient descent works: You can compute per-example gradients independently and then sum them. It is also why gradient accumulation works: you can split a large batch into smaller sub-batches, compute gradients for each, and sum them - the result is identical to computing the gradient of the full batch.
Warning: The constant must not depend on the summation index . If depends on , it cannot be factored out:
4.2 Constant Summand
The sum of copies of a constant equals times . This follows from linearity: .
Special case: - a useful identity for counting.
AI applications:
- Normalisation check: - all probabilities must sum to 1. If is uniform over tokens, then and . OK
- Batch average: - confirms that is a proper average.
- Attention conservation: for softmax attention weights - each query's weights sum to 1.
4.3 Splitting and Combining Sums
Splitting at index (where ):
Both parts are contiguous and their union covers the full range.
Merging adjacent ranges:
Non-adjacent ranges: For disjoint sets:
If , elements in the intersection are counted twice.
AI applications:
- Mini-batch splitting: Split gradient computation across mini-batches:
- Distributed training: Each GPU computes a partial sum; the all-reduce operation merges them.
- Gradient accumulation: Accumulate gradients over micro-batches before applying an update: .
4.4 Index Shifting (Reindexing)
Shift by constant :
Both limits shift by , and the summand substitutes .
Rename index (alpha-equivalence):
Reverse index:
When goes from to , the value goes from down to . The same terms are summed in reverse order.
Applications of reindexing:
- Gauss's trick: Reverse and add to derive :
- (reversed)
- AI: relative positional encodings: Reindex from absolute positions to relative: where .
- Causal attention: Reindex to count from current position backward.
4.5 Telescoping Sums
A telescoping sum is one where most terms cancel in pairs, leaving only the first and last:
Proof by expansion:
Every intermediate term appears once with and once with ; they cancel:
Generalised form:
Example - partial fractions: Show that :
AI applications:
- Residual networks: The total change in a ResNet's representation across layers: . The total transformation telescopes: it equals the sum of all residual updates.
- Gradient descent accumulation: Total parameter change: .
- Proving closed-form formulas: Many summation identities are proved by finding a function such that .
4.6 Interchange of Summation Order
For a double sum over a rectangular region, the order of summation can always be interchanged:
Why this works: The double sum ranges over all pairs with and . The left side sums row by row (fix , sweep ); the right side sums column by column (fix , sweep ). Both cover the same set of pairs, and addition is commutative and associative.
SUMMING A MATRIX: ROW-FIRST vs COLUMN-FIRST
=======================================================================
j=1 j=2 j=3 Row-first (\sum^i \sumj):
+----+----+----+ Row 1: a_1_1 + a_1_2 + a_1_3 = R_1
i=1| a_1_1| a_1_2| a_1_3| Row 2: a_2_1 + a_2_2 + a_2_3 = R_2
+----+----+----+ Total: R_1 + R_2
i=2| a_2_1| a_2_2| a_2_3|
+----+----+----+ Column-first (\sumj \sum^i):
Col 1: a_1_1 + a_2_1 = C_1
Col 2: a_1_2 + a_2_2 = C_2
Col 3: a_1_3 + a_2_3 = C_3
Total: C_1 + C_2 + C_3
Both give the same total: all 6 elements summed.
For finite sums: interchange is always safe. No conditions needed.
For infinite sums: interchange requires absolute convergence: . Without this, rearrangement can change the value (Riemann rearrangement theorem).
AI applications:
- Linearity of expectation: . This is interchange of summation and expectation (which is itself a sum/integral).
- Multi-head attention: - compute attention for all heads first then sum, or compute per-position then sum over heads; same result.
- Batch-feature decomposition: When computing gradients, you can sum over the batch dimension first or the feature dimension first; the order doesn't matter for finite sums.
5. Algebraic Properties of Products
5.1 Product of Products (Splitting)
Just as sums can be split at any index, products can be split at any index:
Proof: The full product can be grouped into two sub-products at any break point .
AI application: Factoring a sequence probability:
This splitting enables prefix-based caching in autoregressive generation: compute the prefix probability once, then extend it token by token.
5.2 Product of a Constant
Multiplying copies of constant equals to the -th power.
Special cases:
- - binary growth
- - exponential decay
- - alternating sign
AI application: If a model assigns uniform probability to each token in a sequence of length :
With vocabulary size and tokens, the probability is - far below any floating-point representation, which is exactly why we use log-probabilities.
5.3 Product of a Product (Nested Products)
Nested products over independent index sets can be interchanged, just like double sums:
Both sides multiply together the same factors; only the order differs. Since multiplication is commutative and associative, the result is the same.
5.4 Distributing Product over Multiplication
Product distributes over pointwise multiplication of terms:
Proof: Expand: . By commutativity of multiplication, rearrange to .
Product does NOT distribute over addition:
Counterexample: .
This is a common source of errors. While multiplying the product distributes cleanly, adding inside a product creates cross-terms - the result expands into terms in general.
5.5 Logarithm Converts Product to Sum
This is arguably the most important property of product notation for all of machine learning:
Proof: By the fundamental logarithm identity , applied repeatedly:
Inverse direction:
Why this matters immensely for AI:
-
Numerical stability. Products of many small probabilities underflow to zero in finite-precision arithmetic. If for all in a sequence of length 100, then , which is below the minimum of FP32 (). But , perfectly representable.
-
Gradient computation. Differentiating sums is straightforward: . Differentiating products requires the product rule applied times.
-
Decomposition. Log-likelihood decomposes into per-example terms: each can be computed, stored, and analysed independently.
AI examples of product-to-sum conversion:
6. Important Summation Formulas
These closed-form results are essential tools. Knowing them lets you immediately evaluate sums that arise in complexity analysis, error bounds, and algorithmic design.
6.1 Arithmetic Series
Gauss's derivation (pairing argument): Write the sum forward and backward:
Adding these two equations column by column:
Therefore .
General arithmetic series (first term , common difference , terms):
Sum of first odd numbers:
Proof: .
AI relevance: The number of pairwise interactions in full self-attention over tokens is (each of queries attends to keys), giving attention complexity. More precisely, causal attention has interactions - still but roughly half.
6.2 Sum of Squares
Proof by telescoping. Use the identity :
AI relevance: Squared norms appear everywhere - L2 regularisation, variance calculations, gradient norms. While specifically appears in complexity analysis, the sum-of-squares pattern is ubiquitous.
6.3 Sum of Cubes
A remarkable identity: the sum of cubes equals the square of the sum of the first integers.
| Equal? | |||
|---|---|---|---|
| 1 | 1 | 1^2 = 1 | OK |
| 2 | 1 + 8 = 9 | 3^2 = 9 | OK |
| 3 | 1 + 8 + 27 = 36 | 6^2 = 36 | OK |
| 4 | 1 + 8 + 27 + 64 = 100 | 10^2 = 100 | OK |
Proof by induction: Base case : . OK
Inductive step: assume . Then:
which equals . OK
6.4 Geometric Series (Finite)
Proof: Let . Then:
Special case : .
Shifted version: .
Alternative form (ratio > 1): for .
AI relevance:
- ALiBi (Attention with Linear Biases): Uses geometric-like decay in attention: positions further away receive exponentially decreasing attention bias.
- Exponential moving averages: In Adam optimizer, - the weights form a geometric series summing to .
- Discounted rewards in RL: where is the discount factor.
6.5 Geometric Series (Infinite)
This is the limit of the finite geometric series as : since , we have , so .
For , the series diverges (terms do not shrink to zero).
Derivatives of geometric series:
These are obtained by differentiating with respect to term by term (valid within the radius of convergence).
AI relevance:
- Expected geometric distribution: If a random variable , then , which uses the derivative of the geometric series.
- Discounted return in RL: gives the maximum possible return under discount .
6.6 Exponential and Taylor Series
Many functions used in AI are defined by their Taylor (power) series - which are infinite sums:
Exponential function:
Converges for all . The exponential function is the unique function satisfying with .
Sine and cosine:
Natural logarithm:
Binomial series:
where (generalised binomial coefficient).
AI relevance:
- Softmax uses ; its Taylor series explains why for small - helpful for linearised attention approximations (e.g., Performers).
- Log-sum-exp trick uses ; understanding its series expansion helps with numerical stability analysis.
- Positional encodings in transformers use and ; their series representations explain periodicity and frequency properties.
- GELU activation: where involves the error function, which has a Taylor series expansion.
6.7 Harmonic Series and Harmonic Numbers
The harmonic series is the sum of reciprocals:
Despite the terms approaching zero, the sum grows without bound. This is a classic example showing that is necessary but not sufficient for convergence.
The -th harmonic number is the partial sum:
where is the Euler-Mascheroni constant.
The p-series generalises the harmonic series:
The special case gives Euler's famous result: .
The Riemann zeta function unifies all -series.
AI relevance:
- Zipf's law: Word frequency in natural language follows where is the rank. The total frequency is proportional to , a harmonic number.
- BPE merge counts follow approximately Zipfian distributions; analysis involves harmonic numbers.
- Complexity analysis: ; this appears in analyses of algorithm running times and expected values of random processes.
- Coupon collector problem: Expected number of samples to see all items is - relevant for coverage analysis in data sampling.
7. Summation Bounds and Estimation
In both theoretical analysis and practical AI engineering, we rarely need the exact value of a sum. Instead, we need tight bounds - upper and lower estimates that tell us how a sum grows. This section equips you with the main bounding techniques.
7.1 Comparison Test
If for all , then:
Why it matters: This is the workhorse for bounding sums whose exact evaluation is intractable.
Worked example - bounding softmax tail:
Suppose attention logits satisfy where for tokens sorted by relevance. Then:
This shows that the tail of the softmax distribution decays exponentially - justifying sparse attention approximations that ignore low-scoring tokens.
General pattern:
To bound \sum f(i):
1. Find a simpler g(i) with f(i) \leq g(i) for all i (upper bound)
2. Find a simpler h(i) with h(i) \leq f(i) for all i (lower bound)
3. Evaluate \sum g(i) and \sum h(i) using known formulas
7.2 Integral Test
For a positive, monotonically decreasing function :
f(x) |
|#
|# #
|# # #
|# # #
|# # #
+--+--+--+--+---- i
1 2 3 4 5
# = f(i) rectangles (left sum = upper bound)
The curve f(x) passes through the tops of the bars
= gap between rectangle and curve
Integral = area under curve (between the bounds)
Derivation: Since is decreasing, on the interval :
Integrating both sides over :
Summing from to for the right inequality, and from to for the left, yields the result.
AI example - harmonic sum bounds:
With :
This gives us where is the Euler-Mascheroni constant.
Application: Bounding the expected number of unique tokens seen after samples from a vocabulary of size :
The integral test helps bound partial sums that arise in the coverage analysis.
7.3 Cauchy-Schwarz and Triangle Inequality for Sums
Triangle inequality for sums:
This bounds the magnitude of a signed sum by the sum of absolute values. It explains why gradient norms after summing over a batch can be bounded:
Cauchy-Schwarz inequality for sums:
Or in vector notation: , which means .
Proof sketch:
Consider the quadratic for all .
Expanding: .
Since this quadratic is non-negative for all , its discriminant must be :
which gives the result.
AI applications:
| Application | Bound Used | Purpose |
|---|---|---|
| Cosine similarity | Validates similarity metric range | |
| Gradient clipping analysis | Triangle inequality | Bounds effect of clip on sum |
| Attention weight analysis | Cauchy-Schwarz | Bounds QK dot product magnitude |
| Variance decomposition | Cauchy-Schwarz | Jensen-like bounds on expectations |
7.4 Stirling's Approximation
Stirling's approximation connects the product (factorial) to continuous functions:
Taking logarithms:
Derivation via integral test:
The correction comes from more careful analysis (Laplace's method on the Gamma function integral).
Accuracy table:
| Stirling | Relative Error | ||
|---|---|---|---|
| 1 | 1 | 0.922 | 7.8% |
| 5 | 120 | 118.02 | 1.6% |
| 10 | 3628800 | 3598695 | 0.83% |
| 50 | 0.17% | ||
| 100 | 0.083% |
The approximation improves rapidly: relative error .
AI relevance:
- Combinatorial arguments: The number of permutations of tokens is ; Stirling tells us bits to represent a permutation - relevant for sorting networks used in differentiable sorting.
- Log-likelihood of multinomial: ; Stirling converts this to entropy: where .
- Capacity arguments: Channel capacity and model capacity bounds often use .
7.5 Asymptotic Notation for Sums
When exact sums are intractable, we characterise their growth rate:
| Notation | Meaning | Example Sum | Growth |
|---|---|---|---|
| Upper bounded by | At most | ||
| Lower bounded by | At least | ||
| Tightly bounded | Exactly this rate | ||
| Dominated by | Strictly less | ||
| Asymptotically equal | Ratio -> 1 |
Key asymptotic results for common sums:
AI complexity examples:
| Operation | Sum Form | Asymptotic | Practical Impact |
|---|---|---|---|
| Self-attention (full) | Quadratic in sequence length | ||
| Self-attention (sparse, top-) | Linear if | ||
| Batch gradient | Linear in batch \times params | ||
| Transformer layer | depends on vs | : ; : | |
| Vocabulary projection | Bottleneck for large vocab |
8. Special Summation Patterns in AI
This section catalogues the specific summation patterns that appear repeatedly across AI and machine learning. Recognising these patterns lets you read papers fluently, debug implementations, and reason about computational costs.
8.1 Softmax and the Partition Function
The softmax function normalises a vector of logits into a probability distribution:
The denominator is the partition function - a term borrowed from statistical mechanics.
Anatomy of softmax:
Input logits: z = [2.0, 1.0, 0.5, -1.0]
| | | |
Exponentiate: ez = [7.39, 2.72, 1.65, 0.37]
| | | |
Partition sum: Z = 7.39 + 2.72 + 1.65 + 0.37 = 12.13
| | | |
Normalise: p = [0.61, 0.22, 0.14, 0.03]
| | | |
Verify: \sump = 0.61 + 0.22 + 0.14 + 0.03 = 1.00 OK
Numerical stability - the log-sum-exp trick:
Direct computation overflows for large logits. The standard trick:
This works because , and now the largest exponent is .
Gradient of log-partition:
A beautiful identity links the partition function to expectations:
Higher derivatives give cumulants of the distribution - the second derivative gives the variance.
Where partition functions appear in AI:
| Context | Partition Function | Size |
|---|---|---|
| Classification (softmax) | num_classes | |
| Language modelling | vocab_size (~50K-100K) | |
| Contrastive learning (InfoNCE) | num_negatives | |
| Energy-based models | Intractable (continuous) | |
| CRF layer (NER) | Exponential (use DP) |
8.2 Cross-Entropy Loss as Summation
Cross-entropy loss is a double summation - over the batch and over classes:
For one-hot targets (), the inner sum collapses:
Expansion for a single sample:
So the loss is: (negative logit of correct class) + (log partition function).
Gradient with respect to logit :
This is the remarkably clean result that makes softmax + cross-entropy so popular: the gradient is simply (predicted probability - target probability) for each class.
Label smoothing variant:
Instead of one-hot , use smoothed targets: .
The extra term encourages the model to keep non-target probabilities from collapsing to zero - acting as a regulariser on confidence.
8.3 Attention as Weighted Sum
The core attention operation is a weighted sum of value vectors:
where attention weights come from softmax over scaled dot-product scores:
The full computation graph:
Q_i --+
+-- Q_i * K_j / \sqrtd_k ---> softmax over j ---> \alpha_ij
K_j --+ |
+-- \sum_j \alpha_ij * V_j ---> output_i
V_j -------------------------------------------------+
Three sums in attention:
- Dot product: - sum over the head dimension
- Softmax normaliser: - sum over keys
- Weighted value: - sum over values
Multi-head attention adds a fourth sum:
(The concatenation + linear projection is equivalent to a sum of projected heads.)
Computational cost (counting multiplications):
| Step | Operation | Cost |
|---|---|---|
| QK^T | per head | |
| Softmax | Per row of values | |
| Attention \times V | per head | |
| All heads | Multiply by | |
| Total (since ) |
8.4 Matrix Multiplication as Summation
Every matrix multiplication is a summation in disguise:
This single formula underlies virtually all neural network computation.
Layer types expressed as matrix multiply:
| Layer | Formula | Dimensions |
|---|---|---|
| Dense/Linear | ||
| Multi-head QKV projection | ||
| Embedding lookup | ||
| Vocabulary projection | ||
| Convolution (im2col) | Reshaped to GEMM |
FLOPs formula: Multiplying an matrix by a matrix requires:
- multiplications
- additions
- Total: FLOPs
For a transformer with layers, the total FLOPs per token is approximately (accounting for QKV projections, attention output, and FFN).
8.5 Gradient as Sum Over Batch
The empirical risk (training loss) is an average over the batch:
Its gradient is therefore also an average:
Key insight: The gradient is linear in the summation - each sample contributes independently. This enables:
- Data parallelism: Compute per-sample gradients on different GPUs, then sum (all-reduce).
- Gradient accumulation: Sum gradients over micro-batches before updating weights.
- Per-sample gradient clipping (DP-SGD): Clip before averaging - required for differential privacy.
Gradient accumulation pattern:
Micro-batch 1: g_1 = \nablaL(\theta; B_1) -+
Micro-batch 2: g_2 = \nablaL(\theta; B_2) -+
Micro-batch 3: g_3 = \nablaL(\theta; B_3) -+---> g = (g_1 + g_2 + g_3 + g_4) / 4
Micro-batch 4: g_4 = \nablaL(\theta; B_4) -+
\theta <- \theta - \eta * g
Effective batch size = where is the number of accumulation steps.
Variance of stochastic gradient:
(assuming i.i.d. samples). Doubling the batch size halves the gradient variance - but has diminishing returns on convergence speed (linear speedup only up to a critical batch size).
8.6 Perplexity as Geometric Mean
Perplexity is the standard evaluation metric for language models:
Decomposing the formula:
The inner sum is the average cross-entropy per token.
Exponentiating converts it to a geometric mean of inverse probabilities:
Intuition: If PPL = 50, the model is "as confused as if it were choosing uniformly among 50 tokens at each step."
Worked example:
Sentence: "The cat sat on the mat"
Token probabilities: [0.1, 0.05, 0.2, 0.3, 0.1, 0.15]
Log probs: [-2.303, -2.996, -1.609, -1.204, -2.303, -1.897]
Sum of log probs: -12.312
Average: -12.312 / 6 = -2.052
PPL = exp(2.052) = 7.78
Interpretation: Model is ~8-way confused on average
(a perfect model would have PPL = 1)
Properties arising from the summation structure:
- Perplexity is multiplicative over independent segments: if we split text into parts A and B, (weighted geometric mean).
- Outlier sensitivity: A single very-low-probability token (e.g., ) drastically inflates PPL. This is why models are sometimes evaluated with rare-token filtering.
- Bits per character (BPC): .
8.7 BM25 as Weighted Sum
BM25, the backbone of traditional information retrieval, is a weighted sum over query terms:
Components as summation:
- Outer sum : Iterates over unique query terms - typically 3-10 terms.
- IDF (Inverse Document Frequency): where = total documents, = documents containing term .
- TF saturation: The fraction saturates - repeated occurrences have diminishing returns.
Why BM25 is a pure summation pattern:
Score = 0
For each query term t:
Score += IDF(t) \times TF_saturation(t, d)
-------- ------------------
weight adjusted count
This is a weighted sum: \sum w_t * s_t
Modern hybrid retrieval combines BM25 with dense retrieval:
Both components are summations. The dense similarity is a dot product , and BM25 is a weighted sum over terms. The hybrid score is itself a weighted sum - sums all the way down.
9. Einstein Summation Convention
This section is a bridge, not the full treatment. The goal here is to recognize Einstein notation as a compressed way of writing repeated sums, so the dedicated next chapter, Einstein Summation and Index Notation, can focus on tensor structure, index discipline, and implementation patterns without reintroducing the summation idea from scratch.
9.1 Motivation
In many formulas, the summation symbol eventually becomes visual noise. Matrix multiplication is the classic example:
Einstein notation suppresses the explicit sigma and lets the repeated index carry the summation implicitly:
The mathematical content is the same. What changes is the visual emphasis: the notation highlights the index pattern rather than the bookkeeping symbol.
9.2 Examples of Einstein Notation
The pattern is easiest to learn by translating familiar sums:
| Operation | Sigma notation | Einstein notation |
|---|---|---|
| Dot product | ||
| Matrix-vector | ||
| Matrix product | ||
| Trace | ||
| Bilinear form |
Seen this way, Einstein notation is not a new operation. It is a shorthand for repeated-index sums you already know how to read.
9.3 Rules for Einstein Notation
A safe first-pass reading rule is:
- indices that appear twice in one term are summed
- indices that appear once are free and determine the output shape
- free indices must match across both sides of an equation
- if an index appears three or more times, stop and rewrite the expression explicitly
Those rules are enough to parse the most common ML examples. The next chapter develops the more careful index discipline needed for higher-rank tensors and more complex contractions.
9.4 Einstein Notation in PyTorch: einsum
Libraries such as PyTorch and NumPy make the connection explicit through einsum, which turns index patterns into code:
# matrix multiply
C = torch.einsum('ik,kj->ij', A, B)
# attention scores
scores = torch.einsum('bhqd,bhkd->bhqk', Q, K)
For this chapter, the important idea is conceptual: einsum is programmable Einstein notation. Once you can read the repeated indices, you can often read the code directly from the math.
9.5 Tensor Contraction
Einstein notation naturally leads to tensor contraction, where repeated indices are summed out and the remaining free indices define the output tensor. Dot products, matrix multiplication, traces, and attention score computations are all contractions of this kind.
That is the real reason this preview belongs here: it shows that summation notation does not disappear in tensor algebra. It becomes more compact, more structural, and more expressive. The full transition from explicit sums to tensor contractions continues in Einstein Summation and Index Notation.
10. Double Sums and Index Interaction
When two or more indices appear in the same summation, their interaction creates rich mathematical structures. This section explores double (and higher) sums with attention to the patterns that arise in AI.
10.1 Row-Major vs Column-Major Intuition
A double sum sums all entries of an array. The question is: in what order?
Row-major (outer=i, inner=j): Column-major (outer=j, inner=i):
+------------------+ +------------------+
| -> -> -> -> -> -> | row 1 | down down down down down down |
| -> -> -> -> -> -> | row 2 | down down down down down down |
| -> -> -> -> -> -> | row 3 | down down down down down down |
+------------------+ +------------------+
Process each row left-to-right, Process each column top-to-bottom,
then move to next row. then move to next column.
For finite sums, both orders give the same result (Fubini's theorem for sums). But the computational pattern differs:
- Row-major accesses contiguous memory in C/PyTorch (row-contiguous tensors).
- Column-major accesses contiguous memory in Fortran/MATLAB.
In AI, choosing the right traversal order affects cache performance and GPU coalescing.
10.2 Diagonal Sums
Summing along diagonals of a matrix isolates elements where indices satisfy (constant offset):
Main diagonal (): The trace.
Off-diagonals: sums the -th super-diagonal (for ) or sub-diagonal (for ).
AI connection - relative position in attention:
In relative positional encodings (e.g., ALiBi, RoPE), the attention bias depends on :
The bias matrix where is a Toeplitz matrix - constant along each diagonal. Summing along diagonals gives the total contribution of each relative distance.
10.3 Upper and Lower Triangular Sums
Many AI computations restrict summation to triangular regions:
Upper triangular: Lower triangular (causal):
+- - - - - - -+ +- - - - - - -+
| # # # # # # | | # * * * * * |
| * # # # # # | | # # * * * * |
| * * # # # # | | # # # * * * |
| * * * # # # | | # # # # * * |
| * * * * # # | | # # # # # * |
| * * * * * # | | # # # # # # |
+- - - - - - -+ +- - - - - - -+
# = included in sum Used in causal (autoregressive)
* = excluded attention masks
Causal attention uses lower-triangular masking:
Count of elements:
- Upper or lower triangle: elements
- Strict upper/lower (no diagonal): elements
- This halving is why causal attention uses roughly half the FLOPs of bidirectional attention.
10.4 Changing Order of Summation
Finite case (always valid):
Triangular region - changing bounds:
Proof by picture:
Original: Swapped:
(fix i, vary j\geqi) (fix j, vary i\leqj)
j -> j ->
1 2 3 4 1 2 3 4
+--------- +---------
i=1| # # # # i=1| # # # #
i=2| # # # i=2| # # #
i=3| # # i=3| # #
i=4| # i=4| #
Same set of (i,j) pairs - just visited in different order!
General rule for dependent limits:
where .
AI application - expectation of loss over data and weights:
In Bayesian deep learning, we often need to interchange an expectation over parameters and a sum over data:
This interchange (linearity of expectation) lets us compute per-example expected losses separately.
10.5 Cauchy Product of Series
The Cauchy product defines multiplication of two series:
The inner sum is a convolution - the same operation that defines conv layers in CNNs.
Connection to convolution:
1D discrete convolution:
a = [a_0, a_1, a_2, a_3] (filter/kernel)
b = [b_0, b_1, b_2, b_3, b_4] (input signal)
(a * b)_n = \sum_k a_k * b_{n-k}
This is exactly the Cauchy product - the coefficient of x^n
in the polynomial product A(x)*B(x)
Mertens' theorem: If converges absolutely and converges, then the Cauchy product converges to the product of the sums. This guarantees that polynomial multiplication "works" for convergent power series.
AI applications:
- 1D convolution in CNNs/WaveNet is a finite Cauchy product.
- Polynomial multiplication via FFT: the Cauchy product of two length- sequences can be computed in using FFT, rather than directly.
- Generating functions in combinatorics: if and are generating functions, their product has coefficients given by the Cauchy product - used in counting arguments for data structures.
10.6 Vandermonde's Identity and Convolution
Vandermonde's identity is a combinatorial convolution:
Proof idea: Count the number of ways to choose items from a group of people (divided into group A of and group B of ). Choose from A and from B, then sum over all valid .
Connection to generating functions:
. Comparing coefficients of on both sides gives Vandermonde's identity - the coefficient on the left is the Cauchy product of binomial coefficients.
AI relevance:
- Feature interaction counting: If model A has features and model B has features, the number of combined feature subsets of size follows Vandermonde's identity.
- Mixture models: When combining categorical distributions, the convolution of their PMFs follows the same pattern.
10.7 Abel's Summation by Parts
Abel's summation is the discrete analogue of integration by parts:
where is the partial sum.
Compare with integration by parts:
| Continuous | Discrete |
|---|---|
| (antiderivative of ) | |
When to use Abel's summation:
The technique is useful when:
- The partial sums are bounded (even if the oscillate).
- The sequence is monotonically decreasing to 0.
Under these conditions, Abel's summation shows the original sum converges (Dirichlet's test).
AI application - convergence of weighted gradient sums:
In exponentially weighted moving averages (used in Adam, EMA models):
Abel's summation helps analyse the properties of this weighted sum, particularly showing that the contribution of early gradients decays geometrically and the sum converges even if individual gradients oscillate.
11. Convergence of Infinite Series
Infinite sums - series - appear throughout AI: Taylor expansions for activation functions, geometric series for discounted rewards, power series for matrix functions. Understanding when an infinite sum converges (has a finite value) is essential for both theoretical proofs and numerical stability.
11.1 Partial Sums and the Definition of Convergence
An infinite series is defined as the limit of its partial sums:
The series converges if this limit exists and is finite; otherwise it diverges.
Three types of behavior:
Converges (e.g., \sum 1/2^n): Diverges to \infty (e.g., \sum 1/n):
S_N | ---------- S=2 S_N | /
| / | /
| / | /
| / | /
|/ | /
+---------------- N | /
| /
Partial sums approach a limit |/
+---------------- N
Partial sums grow without bound
Oscillates/Diverges (e.g., \sum (-1)^n):
S_N | * * *
| 1 --+ 1 --+ 1 --
| | |
| 0 --+ 0 --+ 0 --
+---------------- N
Partial sums bounce between 0 and 1
Necessary condition (divergence test):
If converges, then .
Contrapositive: If or doesn't exist, the series diverges.
Warning: The converse is false! does NOT guarantee convergence. The harmonic series has but diverges.
11.2 Absolute vs Conditional Convergence
Absolute convergence: converges.
Conditional convergence: converges but diverges.
Key theorem: Absolute convergence implies convergence. (The converse is false.)
Example: The alternating harmonic series converges conditionally - it converges, but diverges.
Why absolute convergence matters in AI:
- Absolutely convergent series can be rearranged in any order without changing the sum. Conditionally convergent series cannot (Riemann's rearrangement theorem - you can rearrange a conditionally convergent series to converge to any value!).
- Parallel computation of sums (across GPUs) implicitly rearranges terms. If the series is only conditionally convergent, different summation orders give different results - a source of numerical non-determinism.
- Most sums in AI (loss over data, gradient components) involve non-negative terms, so convergence is always absolute.
11.3 Convergence Tests Summary
| Test | Statement | Best For | | ---------------------- | -------------------------------------------------------------------------------------- | ------------------------ | ------------------------------------------------------ | ------------------------ | | Divergence | If , diverges | Quick rejection | | Comparison | If and converges, so does | Bounding by known series | | Limit comparison | If , both converge or both diverge | Same rate detection | | Ratio | If : converges if , diverges if | Factorials, exponentials | | Root | If : converges if , diverges if | -th powers | | Integral | and converge/diverge together (for positive decreasing ) | Continuous analogues | | Alternating series | If decreases to 0, converges | Sign-alternating terms | | -series | converges iff | Polynomial decay |
Ratio test in detail (most commonly used):
- : Absolutely convergent. The terms decay at least geometrically.
- : Divergent. The terms grow.
- : Inconclusive. Need another test.
AI example - convergence of Taylor series for :
Since , the series converges for all . This is why (and hence softmax) is well-defined everywhere.
11.4 Power Series and Radius of Convergence
A power series converges in some interval centered at :
The series converges absolutely for and diverges for .
Key power series in AI:
| Function | Power Series | Radius |
|---|---|---|
| GELU | Complicated | |
Why radius of convergence matters:
- The Taylor approximation of is only valid for - outside this range, the polynomial approximation diverges wildly. This matters for hardware implementations that approximate activation functions with polynomials.
- The series only converges for - so
log1p(x)is numerically implemented differently for large .
11.5 Convergence of Training - An Analogy
While not an infinite series in the mathematical sense, training a neural network produces a sequence of losses that we hope converges.
Connections to series convergence:
| Series Concept | Training Analogue |
|---|---|
| Partial sum | Cumulative loss or running average |
| necessary | Loss improvements must shrink |
| Absolute convergence | All parameter dimensions converge |
| Conditional convergence | Loss oscillates but trends down |
| Radius of convergence | Learning rate stability region |
| Geometric decay | Exponential learning rate schedule |
Learning rate as a convergence factor:
For SGD with learning rate on a quadratic loss surface with Hessian eigenvalue :
This is a geometric series ratio. Convergence requires , i.e., .
For the maximum eigenvalue of the Hessian, the critical learning rate is:
Beyond this, training diverges - exactly like a geometric series with .
12. Summation in Probability and Statistics
Summation is the language of probability theory for discrete random variables. Every key concept - expectation, variance, entropy - is defined as a sum.
12.1 Expectation as Weighted Sum
The expected value of a discrete random variable with PMF :
More generally, for any function :
This is a weighted sum where the weights are probabilities (summing to 1):
Expectation = weighted average
Values: x_1 x_2 x_3 x_4
Probabilities: p_1 p_2 p_3 p_4
| | | |
v v v v
E[X] = ------ x_1p_1 + x_2p_2 + x_3p_3 + x_4p_4 ------
Constraint: p_1 + p_2 + p_3 + p_4 = 1
Key properties (all following from linearity of summation):
| Property | Formula | Summation Origin |
|---|---|---|
| Linearity | ||
| Constant | ||
| Non-negativity | Sum of non-negative terms | |
| Indicator |
AI example - expected loss:
The training objective becomes, for a finite dataset:
This is the empirical mean - a sum with equal weights - that approximates the true expectation.
12.2 Variance and Higher Moments
Variance measures spread around the mean:
Computational formula (more numerically stable in some cases):
Proof:
Variance of a sum (independent variables):
For independent variables, covariances vanish: .
AI application - gradient variance:
The variance of the stochastic gradient with batch size :
This scaling is why larger batches reduce gradient noise but with diminishing returns on convergence speed.
12.3 Entropy as Expected Surprise
Shannon entropy measures the average "surprise" or "information content" of a distribution:
Decomposition of cross-entropy:
where .
AI significance:
- Training with cross-entropy loss minimises where is the true distribution and is the model. Since is fixed, this is equivalent to minimising .
- Maximum entropy occurs at the uniform distribution: .
- Temperature scaling in softmax: . As , entropy increases toward ; as , entropy decreases toward 0 (argmax).
Entropy of common distributions:
| Distribution | PMF | Entropy |
|---|---|---|
| Bernoulli() | ||
| Uniform() | each | |
| Geometric() |
12.4 Law of Total Expectation and Total Variance
Law of total expectation (tower property):
This is a double summation - the inner sum computes for each , and the outer sum averages over .
Proof:
AI application - mixture models:
For a Gaussian mixture model :
Law of total variance:
This decomposition appears in ANOVA, random effects models, and the bias-variance decomposition of ML models.
12.5 Inclusion-Exclusion Principle
For the union of events:
Expanded for :
Venn diagram (n=3):
A_1
/ \
/ +1 \
/ /-\ \
| |-1 | |
A_2--- |+1| ---A_3
\ \-/ /
\ -1 /
\ /
Each region counted: singles (+), pairs (-), triple (+)
Alternating signs prevent over-counting
The sum has terms - exponential in the number of sets.
AI applications:
- Deduplication analysis: Probability that at least one near-duplicate exists in a dataset.
- Hash collision bounds: .
- Union bound (Boole's inequality): The simpler upper bound is often sufficient and avoids the exponential blowup.
- Coverage testing: Probability that a test suite covers at least one of several code paths.
Bonferroni inequalities truncate the inclusion-exclusion sum:
- Truncating after odd terms gives an upper bound.
- Truncating after even terms gives a lower bound.
This provides a practical way to get bounds without computing all terms.
13. Notation Variants and Conventions Across Fields
Different communities use different summation conventions. Recognising these variants prevents confusion when reading papers from different fields.
13.1 Computer Science / Programming
| Convention | Meaning | Notes |
|---|---|---|
| 0-indexed: | Sum over elements | Default in Python, C, most languages |
sum(list) | No explicit indices | |
reduce(+, arr) | Left fold: | Associativity matters for floats |
Array slice: a[i:j] | Elements | Exclusive upper bound |
13.2 Physics
| Convention | Meaning | Notes |
|---|---|---|
| Einstein summation | Repeated indices summed | No written |
| Upper/lower indices | Contravariant/covariant distinction | |
| Metric contraction | Lowering with metric tensor | |
| 4-vector gradient |
13.3 Statistics
| Convention | Meaning | Notes |
|---|---|---|
| Sample mean | Indices often omitted | |
| with no bounds | Sum over all observations | Context determines range |
| Expectation notation | ||
| All unordered pairs |
13.4 LaTeX / Typesetting
% Inline: limits on side Display: limits above/below
$\sum_{i=1}^{n} a_i$ $$\sum_{i=1}^{n} a_i$$
% Product
$\prod_{j=1}^{m} b_j$ $$\prod_{j=1}^{m} b_j$$
% Multiple indices
$\sum_{\substack{i=1\\i \neq k}}^{n}$ % Exclusion
% Sum over a set
$\sum_{x \in S}$ % Index set notation
13.5 Notation Translation Table
| Your Field Says | Mathematicians Write | Einstein Writes | Python Says |
|---|---|---|---|
| Dot product | a @ b or np.dot(a,b) | ||
| Matrix multiply | A @ B | ||
| Weighted average | np.average(x, weights=w) | ||
| Trace | np.trace(A) | ||
| Batch mean | - | x.mean(dim=0) | |
| Softmax | - | F.softmax(z, dim=-1) |
14. Common Mistakes and Pitfalls
Summation notation is deceptively simple, but subtle errors can cause bugs in proofs, papers, and code. This section catalogues the most frequent mistakes.
14.1 The Top 10 Summation Mistakes
| # | Mistake | Wrong | Correct | AI Consequence |
|---|---|---|---|---|
| 1 | Index scope leak | Using outside | is a bound (dummy) variable | Undefined variable bugs |
| 2 | Off-by-one bounds | vs | Count elements: vs | Dimension mismatch errors |
| 3 | Distributing over products | Wrong loss gradients | ||
| 4 | Distributing over sums | Product distributes over factors, not addends | Wrong likelihood | |
| 5 | Swapping and non-linear | Only true for linear | Jensen's inequality violation | |
| 6 | Forgetting empty cases | Assuming | Empty sum = 0, empty product = 1 | Edge case crashes |
| 7 | Wrong <-> conversion | , NOT | log-sum-exp vs sum-of-logs | |
| 8 | Index collision | Ambiguous scope | ||
| 9 | Telescoping failure | Cancelling wrong terms | Write out the first few terms explicitly | Wrong simplification |
| 10 | Infinite sum truncation | Treating as | Bound the tail: | Approximation error |
14.2 Detailed Examples
Mistake 3 in detail - sum of products \neq product of sums:
The product of sums has cross terms (). The correct relationship:
This is a double sum (outer product), not a single sum (dot product).
Mistake 5 in detail - Jensen's inequality:
For convex : (where , ).
For concave : the inequality reverses.
Since is concave:
This is the basis of the ELBO (Evidence Lower Bound) in variational inference:
Mistake 7 in detail - log of sum vs sum of logs:
+==================================================+
| CRITICAL DISTINCTION: |
| |
| log(\sum a^i) \neq \sum log(a^i) <- WRONG to swap |
| |
| log(\prod a^i) = \sum log(a^i) <- CORRECT |
| |
| Logs convert PRODUCTS to SUMS, not SUMS to SUMS |
+==================================================+
| NLL loss = -log p(data) |
| = -log \prod p(x^i) (for i.i.d. data) |
| = -\sum log p(x^i) <- the log-sum trick |
| |
| BUT: -log \sum p(x^i) is DIFFERENT - this is |
| the log of a mixture, not a product. |
+==================================================+
14.3 Numerical Pitfalls
Catastrophic cancellation:
When summing numbers with alternating signs or vastly different magnitudes, floating-point errors accumulate:
Solutions:
- Kahan summation: Tracks the running compensation term; reduces error from to .
- Pairwise summation: Recursively splits the array and sums pairs; reduces error to . (This is what NumPy uses.)
- Sort-then-sum: Sum smallest-magnitude terms first.
Overflow/underflow in products:
Solution: Always use log-space: instead of .
15. Exercises and Practice Problems
Exercise Set 1: Basic Manipulation
E1.1 Expand and compute:
E1.2 Write in sigma notation:
E1.3 Expand and compute:
E1.4 Prove: (telescoping)
Exercise Set 2: Algebraic Properties
E2.1 Show that where .
E2.2 Prove: .
E2.3 Verify Gauss's trick: for .
E2.4 Show that .
Exercise Set 3: Summation Formulas
E3.1 Derive by multiplying both sides by .
E3.2 Compute .
E3.3 Find a closed form for using the derivative trick on the geometric series.
E3.4 Compute using partial fractions and telescoping.
Exercise Set 4: Double Sums
E4.1 Compute two ways (row-first and column-first).
E4.2 Change the order of summation: .
E4.3 Show that .
Exercise Set 5: Einstein Notation
E5.1 Convert to standard sigma notation: .
E5.2 Write the einsum string for computing the trace of a matrix product: .
E5.3 What does 'bhqd,bhkd->bhqk' compute? Identify free and dummy indices.
Exercise Set 6: AI Applications
E6.1 Show that softmax probabilities sum to 1: .
E6.2 Derive the gradient .
E6.3 Compute the perplexity of a model that assigns probability 0.1 to each of 5 tokens in a sequence.
E6.4 Show that the gradient of the mean loss equals the mean of the gradients .
Exercise Set 7: Convergence
E7.1 Apply the ratio test to determine convergence of .
E7.2 Determine whether converges.
E7.3 Find the radius of convergence of .
Exercise Set 8: Probability
E8.1 For a fair die, compute and from the definitions using summation.
E8.2 Verify that nats (= 1 bit).
E8.3 Use the law of total expectation to find where and .
16. Why This Matters: The Summation Lens on AI
Summation and product notation are not just mathematical conveniences - they are the computational atoms from which all of AI is built.
The Impact Map
| AI Concept | Core Summation Pattern | Without Understanding It |
|---|---|---|
| Forward pass | Matrix multiply | Cannot read model architectures |
| Backpropagation | Chain rule + sum over paths | Cannot derive or debug gradients |
| Loss functions | Cross-entropy | Cannot design new objectives |
| Optimisers | Cannot tune learning rates | |
| Attention | Weighted sum | Cannot understand transformers |
| Evaluation | Perplexity = | Cannot interpret model quality |
| Probability | Cannot reason about uncertainty | |
| Regularisation | (L2) | Cannot balance fit vs complexity |
| Batch processing | Cannot scale training | |
| Information theory | Cannot measure information |
The Three Skills This Section Builds
+-----------------------------------------------------+
| Fluency with Summation Notation |
| |
| 1. READ: Parse \sum and \prod expressions in papers |
| Identify index ranges and dummy vars |
| Recognise standard patterns |
| |
| 2. MANIPULATE: Apply algebraic properties |
| Change summation order |
| Use telescoping and cancellation |
| Convert between \sum, \prod, and log |
| |
| 3. CONNECT: Map formulas to code |
| Map code to formulas |
| Estimate computational cost |
| Debug via mathematical reasoning |
+-----------------------------------------------------+
Conceptual Bridge: From Summation to Einstein Summation and Index Notation
+------------------------------------------------------------------+
| CONCEPTUAL BRIDGE |
| |
| This Section (04) Next Section (05) |
| Summation & Products Einstein Summation & Tensors |
| |
| \sum^i a^ib^i -------> a^ib^i (implicit sum) |
| \sum_k A^i_kB_kj -------> A^i_kB_kj (matrix multiply) |
| Scalar indices -------> Tensor indices (rank \geq 2) |
| Finite sums -------> Tensor contraction |
| Index manipulation -------> Index gymnastics |
| torch.sum() -------> torch.einsum() |
| |
| Key transition: |
| "Explicit sigma" -> "Implicit contraction" -> "Tensor networks" |
| |
| You now have: |
| OK Fluency with \sum and \prod notation |
| OK Algebraic manipulation tools |
| OK Closed-form formulas for standard sums |
| OK Convergence analysis for infinite series |
| OK Recognition of AI summation patterns |
| |
| Next you'll learn: |
| -> Einstein convention as shorthand for repeated-index sums |
| -> Tensor rank, contraction, and index notation |
| -> Covariant vs contravariant indices |
| -> Advanced einsum patterns for multi-dimensional operations |
| -> Tensor networks and their connection to neural architectures |
+------------------------------------------------------------------+
References and Further Reading
-
Knuth, D. E. (1997). The Art of Computer Programming, Vol. 1: Fundamental Algorithms, 3rd ed. Addison-Wesley. - Definitive treatment of summation techniques (Chapter 1.2.3).
-
Graham, R. L., Knuth, D. E., & Patashnik, O. (1994). Concrete Mathematics, 2nd ed. Addison-Wesley. - The best resource for summation methods, with hundreds of worked examples.
-
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. - See Chapter 2 for notation conventions and Chapter 3 for probability/information theory.
-
Vaswani, A., et al. (2017). "Attention Is All You Need." NeurIPS. - The transformer paper that made the central operation in modern AI.
-
Einstein, A. (1916). "Die Grundlage der allgemeinen Relativitatstheorie." Annalen der Physik. - Introduction of the summation convention.
-
Apostol, T. M. (1974). Mathematical Analysis, 2nd ed. Addison-Wesley. - Rigorous treatment of series convergence.
-
Robertson, S. E., & Zaragoza, H. (2009). "The Probabilistic Relevance Framework: BM25 and Beyond." Foundations and Trends in Information Retrieval. - BM25 derivation and analysis.
Navigation: <- 03 Functions and Mappings | up Mathematical Foundations | 05 Einstein Summation and Index Notation ->