"I have made a great discovery in mathematics; I have suppressed the summation sign every time that the summation must be made over an index which occurs twice." - Albert Einstein, 1916
Overview
Einstein summation convention is the notational engine of modern tensor computation. Wherever the same index appears exactly twice in a single term, summation over all values of that index is implied - no needed. This convention, born in general relativity, now powers every matrix multiply, every attention score, every gradient computation in deep learning. Once you can read index notation, you can read any transformer paper, derive any backpropagation formula, and debug any shape error - mechanically.
Prerequisites
- Summation and product notation (, ) from Module 04
- Matrix operations: multiplication, transpose, trace
- Basic calculus: partial derivatives, chain rule
- Familiarity with NumPy arrays / PyTorch tensors
Learning Objectives
After completing this section, you will be able to:
- Read and write tensor expressions in Einstein index notation
- Identify free indices (result dimensions) and dummy indices (contracted/summed) in any expression
- Derive gradient formulas mechanically using index notation and the Kronecker delta
- Translate any tensor operation into a
np.einsum/torch.einsumstring - Express transformer attention, backpropagation, and tensor decompositions in index form
- Analyse symmetry, equivariance, and contraction order using index structure
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Interactive code: einsum operations, gradient verification, contraction order benchmarks, attention in index notation |
| exercises.ipynb | Practice problems: index identification, Kronecker delta manipulation, gradient derivation, einsum implementation |
Table of Contents
- 1. Intuition
- 2. Formal Definitions
- 3. Basic Operations in Index Notation
- 4. Contraction Operations
- 5. The Kronecker Delta and Levi-Civita Symbol
- 6. Index Manipulation Rules
- 7. Einstein Notation for Gradients and Backpropagation
- 8. Batched Operations and Batch Indices
- 9. Einsum in Practice
- 10. Tensor Products and Decompositions
- 11. Index Notation for Common AI Architectures
- 12. Symmetries, Invariances, and Equivariances
- 13. Advanced Index Manipulations
- 14. Common Mistakes
- 15. Exercises
- 16. Why This Matters for AI
- 17. Conceptual Bridge
1. Intuition
1.1 What Is Einstein Summation?
Einstein summation convention is a notational shorthand: whenever the same index appears exactly twice in a single term, summation over all values of that index is implied automatically. No symbol needed - the repetition of an index is its own summation instruction.
Consider the matrix-vector product. In standard notation:
In Einstein notation, the same expression is simply:
The index appears twice (once on , once on ), so summation over is implied. The index appears once - it is a free index, meaning the equation holds for each value of .
The result: tensor equations that would fill pages compress to single lines. The structure of the operation is immediately visible in the index pattern.
For AI: every matmul, every dot product, every attention score, every gradient computation is a sum over repeated indices. Einstein notation makes this structure explicit and manipulable.
1.2 Why This Notation Transforms Thinking
Without index notation, means "matrix multiply and " - the mechanics are hidden. With index notation:
The repeated tells you exactly what is summed, over what range, and how inputs connect to outputs.
Pattern recognition from indices alone:
| Expression | Repeated Index | Free Indices | Operation |
|---|---|---|---|
| Matrix multiply | |||
| none | Frobenius inner product (scalar) | ||
| none | Outer product | ||
| none | Dot product (scalar) |
Debugging power: If an equation has a free index on one side but not the other, it is wrong. Index balance checking catches errors mechanically - before running any code.
Generalisation: Every operation on any-order tensor is expressed uniformly. There is no special notation for matrices vs vectors vs 4D tensors - just indices.
1.3 The Core Idea in One Sentence
Two rules govern everything:
- Free index: appears once -> result has that index -> no summation
- Dummy index: appears twice -> summed over -> disappears from result
That's it. Everything else follows from these two rules.
THE TWO RULES
===================================================================
Expression: C_ij = A_ik B_kj
Index i: appears ONCE -> FREE -> C has index i (rows)
Index j: appears ONCE -> FREE -> C has index j (cols)
Index k: appears TWICE -> DUMMY -> summed over -> gone from result
Result shape: C is a matrix (has i,j) OK
Computation: C_ij = \Sigma_k A_ik B_kj OK
Expression: s = u_i v_i
Index i: appears TWICE -> DUMMY -> summed over -> gone from result
Result shape: s is a scalar (no free indices) OK
Computation: s = \Sigma_i u_i v_i = dot product OK
1.4 Where This Appears in AI
Every core operation in deep learning is an Einstein summation:
| Operation | Index Notation | Dummy | Free | Description |
|---|---|---|---|---|
| Matrix multiply | Forward pass linear layer | |||
| Dot product | - | Scalar score | ||
| Attention score | Query-key similarity | |||
| Attention output | Weighted value sum | |||
| Gradient of matmul | Backprop through | |||
| Layer norm gradient | involves | Non-local gradient | ||
| Multi-head attention | Across heads |
1.5 Historical and Modern Context
| Year | Development | Significance |
|---|---|---|
| 1900 | Ricci & Levi-Civita: absolute differential calculus | Index notation precursor |
| 1916 | Einstein: general relativity paper | Introduced implicit summation convention |
| 1920s-50s | Standard in physics | GR, electromagnetism, continuum mechanics |
| 1960s-90s | Penrose: abstract index notation | Diagrammatic tensor calculus |
| 2006 | NumPy numpy.einsum | Einstein convention for array computing |
| 2016 | PyTorch torch.einsum | GPU-accelerated einsum |
| 2017 | Vaswani et al.: "Attention Is All You Need" | Attention expressed as index contractions |
| 2018 | opt_einsum library | Optimal contraction order for multi-tensor einsum |
| 2022 | Dao et al.: FlashAttention | Attention kernels designed from index structure |
| 2024 | DeepSeek: MLA (Multi-head Latent Attention) | Low-rank contraction |
| 2025-26 | JAX einsum + JIT; einsum as ML lingua franca | Seamless compilation of arbitrary contractions |
1.6 The Hierarchy of Tensor Operations
TENSOR RANK HIERARCHY
===================================================================
Scalar (rank 0): s - no indices
Vector (rank 1): v_i - one index
Matrix (rank 2): M_ij - two indices
3-Tensor (rank 3): T_ijk - three indices
n-Tensor (rank n): T_{i_1i_2...i_n} - n indices
RANK-CHANGING OPERATIONS:
+-------------------------------------------------------------+
| Contract one pair of indices -> reduce rank by 2 |
| Outer product -> rank = sum of input ranks |
| Transpose -> relabel indices (same rank) |
| Trace -> contract diagonal (rank -2) |
+-------------------------------------------------------------+
EXAMPLES:
dot product: rank 1 + rank 1 - 2 = rank 0 (scalar)
matrix-vector: rank 2 + rank 1 - 2 = rank 1 (vector)
matrix-matrix: rank 2 + rank 2 - 2 = rank 2 (matrix)
trace: rank 2 - 2 = rank 0 (scalar)
outer product: rank 1 + rank 1 = rank 2 (matrix)
2. Formal Definitions
2.1 Tensors as Indexed Arrays
A tensor of rank over index sets is a function:
It assigns a real number to each tuple of indices.
| Rank | Name | Notation | Example | Shape |
|---|---|---|---|---|
| 0 | Scalar | Loss value, learning rate | () | |
| 1 | Vector | Embedding, gradient | ||
| 2 | Matrix | Weight matrix, attention scores | ||
| 3 | 3-tensor | Batched matrices, projections | ||
| 4 | 4-tensor | Multi-head attention scores |
The shape is the tuple of index set sizes . It determines memory layout and operation validity.
2.2 The Einstein Summation Convention - Precise Statement
Convention: In any expression, if an index label appears exactly twice in a single term, it is a dummy index and is implicitly summed over all values in its range.
Formally, for dummy index over range :
An index appearing once is a free index: the equation holds for each value of that index independently. The range of each index is determined by the tensor dimensions at that position.
Example walkthrough:
- appears twice -> dummy -> implicit sum:
- appears once -> free -> equation holds for all
- appears once -> free -> equation holds for all
- Result: , same as standard matrix multiply
2.3 Free vs Dummy Indices - Precise Definitions
Free index - appears exactly once in a term:
- Result expression carries that index
- The equation holds for every value of the free index independently
- Example: - indices and are free; equation holds for all
Dummy index (bound/contracted index) - appears exactly twice in a single term:
- Implicitly summed over its entire range
- Does not appear in the result
- Example: - index is dummy; summed from 1 to ; gone from result
Invalid - index appearing three or more times:
- is ambiguous: which two occurrences to contract?
- Einstein convention forbids this; write explicit if needed
- Example: has index appearing three times - not valid
Consistency rule - free indices must be the same on both sides of an equation:
This mechanical check catches dimension errors before code runs.
2.4 Index Ranges
Each index position has a defined range from the tensor's shape:
- In where : ,
- Contracted indices must have matching ranges: valid only if 's second dimension equals 's first dimension
- This is exactly the matrix multiply shape compatibility condition: , ; ranges over
Shape inference - from index structure alone, the result shape is determined:
- : ranges over , ranges over -> result shape
- : no free indices -> result shape () - scalar
- : and and are all dummy; is free -> result shape - vector
2.5 Covariant and Contravariant Indices
In differential geometry and general relativity, upper (contravariant) and lower (covariant) indices are distinguished:
- Upper index : contravariant vector; transforms like coordinate differences
- Lower index : covariant vector; transforms like gradient components
- Einstein convention in physics: sum over one upper and one lower index only:
- Metric tensor : raises and lowers indices;
For machine learning: we almost always ignore the up/down distinction. All indices are effectively lower. No metric tensor is needed because we work in flat Euclidean space where . When reading physics literature, understand the convention; in ML, treat all indices as lower.
COVARIANT vs CONTRAVARIANT - PHYSICS vs ML
===================================================================
PHYSICS (General Relativity):
Upper index: v^i (contravariant - transforms one way)
Lower index: w_i (covariant - transforms the other)
Contraction: v^i w_i (one up, one down -> valid)
Metric: v_i = g_ij v^j (raises/lowers indices)
MACHINE LEARNING:
All indices: v_i, w_j (no up/down distinction)
Contraction: v_i w_i (two of same name -> valid)
Metric: g_ij = \delta_ij (identity - flat space)
-> In ML, upper/lower distinction collapses. All that matters:
- Same index twice = sum over it
- Same index once = free dimension
3. Basic Operations in Index Notation
3.1 Scalar Multiplication
Scalar times vector:
Scalar times matrix:
The scalar has no indices; it multiplies each component. No summation occurs - there is no repeated index.
3.2 Vector Addition
Same free index on both terms. No repeated index, no summation. Purely elementwise.
3.3 Inner Product (Dot Product)
Both and carry index . Since appears twice, it is a dummy index - summed over. The result is a scalar (no free indices).
Equivalently:
Sign check: result has no free indices -> result must be scalar OK
3.4 Outer Product
No repeated index - appears once, appears once. Both are free. No summation. The result where , .
Equivalently:
Each entry . This always produces a rank-1 matrix.
Key distinction from dot product: (same index -> sum -> scalar) vs (different indices -> no sum -> matrix).
3.5 Matrix-Vector Multiply
- is dummy (summed): connects 's columns to 's entries
- is free (result index): selects row of and row of result
- , , ; ranges over OK
This is the standard matrix-vector product .
3.6 Matrix-Matrix Multiply
- is dummy: summed over; connects 's columns to 's rows
- are free: selects row of , selects column of
- , , ; ranges over OK
Equivalently:
3.7 Transpose
Simply swap index labels. No summation; pure relabelling. Note that and are different expressions - different index order means transposed access.
3.8 Trace
The same index appears twice on the same tensor - sum over diagonal elements. Result is scalar; no free index. Valid only for square matrices (both index ranges must be equal).
Trace of product:
Both pairs or are dummy -> scalar. This equals - rename dummy index to verify: (relabel ) OK
3.9 Frobenius Inner Product
Both and are dummy - double summation. Result is scalar. Equivalently:
Frobenius norm:
3.10 Element-wise (Hadamard) Product
Both and share indices . But here are free - both appear in the result. No summation; elementwise multiplication.
Critical distinction: The expression looks the same as the Frobenius inner product. The difference is whether are free or dummy - which depends on whether the result carries those indices:
| Einsum string | status | Operation | Result |
|---|---|---|---|
"ij,ij->ij" | free | Elementwise multiply | Matrix |
"ij,ij->" | dummy | Frobenius inner product | Scalar |
In standard mathematical notation, context determines interpretation. In einsum, the output string makes it explicit.
COMPLETE BASIC OPERATIONS REFERENCE
===================================================================
Operation Index Form Einsum Free Dummy Result
----------------- -------------- --------- ---- ----- ------
Scalar multiply s*v_i scalar i - vector
Vector add u_i + v_i - i - vector
Dot product u_i v_i "i,i->" - i scalar
Outer product u_i v_j "i,j->ij" i,j - matrix
Matrix-vector A_ij v_j "ij,j->i" i j vector
Matrix multiply A_ik B_kj "ik,kj->ij" i,j k matrix
Transpose A_ji "ij->ji" i,j - matrix
Trace A_ii "ii->" - i scalar
Frobenius inner A_ij B_ij "ij,ij->" - i,j scalar
Hadamard product A_ij B_ij "ij,ij->ij" i,j - matrix
4. Contraction Operations
4.1 What Is Contraction?
Contraction is the operation of summing over one or more pairs of repeated indices, reducing tensor rank by 2 for each contracted pair.
Given input tensors of ranks and , contracting over pairs yields a result of rank:
| Inputs | Ranks | Contracted pairs | Result rank | Operation |
|---|---|---|---|---|
| scalar \times scalar | 0 + 0 | 0 | 0 | Scalar multiply |
| vector * vector | 1 + 1 | 1 | 0 | Dot product |
| matrix \times vector | 2 + 1 | 1 | 1 | Linear map |
| matrix \times matrix | 2 + 2 | 1 | 2 | Matrix multiply |
| Frobenius inner product | 2 + 2 | 2 | 0 | Scalar |
| trace | 2 | 1 (self) | 0 | Diagonal sum |
4.2 Single Tensor Contraction (Trace-Like)
Contracting two indices of the same tensor generalises the matrix trace:
Result: rank-1 tensor (vector). is free; is dummy.
This is a partial trace - tracing over a subset of indices. It generalises the matrix trace to higher-order tensors.
AI example: In multi-head attention, computing per-head statistics involves contracting over query/key positions while keeping the head dimension free.
4.3 Two-Tensor Contraction
General form: contract tensors (rank ) and (rank ) over shared indices -> result rank .
| Contraction | Index form | Operation | |
|---|---|---|---|
| 1 pair | Matrix multiply | ||
| 2 pairs | Frobenius inner product | ||
| 2 pairs (mixed rank) | Vector result |
4.4 Contraction Order and Computational Cost
For expression (three matrices), there are two possible contraction orders:
(AB)C - left to right:
- Compute : cost multiplications
- Compute : cost multiplications
A(BC) - right to left:
- Compute : cost multiplications
- Compute : cost multiplications
For rectangular tensors, different orders can differ by orders of magnitude.
Example: , ,
- :
- :
But: , ,
- :
- : <- 250\times slower!
Finding the optimal contraction order is NP-hard in general, but greedy approaches work well in practice. The opt_einsum library finds near-optimal contraction orders for multi-tensor einsum expressions.
4.5 The Generalised Contraction Formula
For tensors and , contracting over shared index set :
- Free indices: union of all non-contracted indices from both tensors
- Result shape: determined by free index ranges
- This is the most general single-step operation expressible in
einsum
CONTRACTION VISUALISED
===================================================================
Matrix Multiply: C_ij = A_ik B_kj
A (rank 2) B (rank 2) C (rank 2)
+---+ +---+ +---+
| |--k------k-| | = | |
| | | | | |
+---+ +---+ +---+
<-> i <-> j <-> i <-> j
(free) (free) (free indices)
k is contracted (summed) -> disappears from result
i, j are free -> appear in result
Frobenius Inner Product: s = A_ij B_ij
A (rank 2) B (rank 2) s (rank 0)
+---+ +---+
| |--i------i-| | = * (scalar)
| |--j------j-| |
+---+ +---+
Both i and j contracted -> no free indices -> scalar result
5. The Kronecker Delta and Levi-Civita Symbol
5.1 Kronecker Delta
The Kronecker delta is the identity matrix expressed in index notation:
As a matrix: - the identity matrix.
The key property - index substitution (index killing):
Proof: (only the term with survives the sum).
This is the most important property in all of index notation. It says: contracting any tensor with replaces index with index (or vice versa). The delta "kills" the summed index and substitutes the other.
Other properties:
- Trace of identity: (dimension of the index space)
- Acts as identity: (same as in matrix notation)
- Idempotent: (identity times identity = identity)
5.2 Kronecker Delta Identities
| Identity | Proof | Matrix Equivalent |
|---|---|---|
| (only survives) | ||
| Index substitution: | ||
| Index substitution: | ||
| Substitution reduces to trace |
AI use: The Kronecker delta appears in every gradient derivation. When you differentiate with respect to , the derivative - the delta selects which element you're differentiating. This is the engine of backpropagation derivations.
5.3 Levi-Civita Symbol (Permutation Symbol)
The Levi-Civita symbol is a completely antisymmetric tensor. In 3D:
Even permutations (cyclic): , , ->
Odd permutations (one swap): , , ->
The symbol is non-zero only when all three indices are distinct. In dimensions, generalises to non-zero values out of possible.
5.4 Levi-Civita Applications
Cross product:
The index structure encodes the cross product formula: is free (result component), and are dummy (summed over).
Determinant (3\times3):
All of are dummy -> result is scalar. The Levi-Civita selects only the six permutations with correct signs.
Key identity - - contraction:
This allows simplifying double Levi-Civita contractions into Kronecker deltas - essential for proofs involving cross products and curl operations.
AI context: The Levi-Civita symbol is less directly used than the Kronecker delta in ML. It appears in:
- Determinant computations for normalising flows ()
- Theoretical analyses of rotation equivariance
- Differential geometry of neural manifolds
5.5 Generalised Kronecker Delta
The generalised Kronecker delta extends the concept to multi-index antisymmetry:
This expresses antisymmetrisation and symmetrisation operators, and connects to the Levi-Civita symbol:
6. Index Manipulation Rules
6.1 Renaming Dummy Indices
Dummy indices are bound variables - any name works. This is alpha-equivalence, just like in lambda calculus.
Rules for renaming:
- Rename consistently throughout the entire term: both occurrences change together
- Don't rename to an index already in use (avoid clashes)
- Free indices cannot be renamed within a term without renaming on both sides of the equation
Use renaming to:
- Avoid index clashes when combining expressions
- Make contraction structure explicit
- Prepare expressions for interchange or simplification
6.2 Symmetry and Antisymmetry
Symmetric tensor: (unchanged under index swap)
- Examples: metric tensor ; Hessian ; covariance matrix ; Gram matrix
Antisymmetric tensor: (sign flip under index swap)
- Examples: Levi-Civita ; cross product matrix
- Consequence: - diagonal elements are always zero
Critical identity - antisymmetric contracted with symmetric vanishes:
If is antisymmetric and is symmetric:
Proof:
AI relevance: Attention score matrices are generally not symmetric (queries \neq keys). Weight matrices are not symmetric. Understanding symmetry properties helps simplify gradient expressions.
6.3 Symmetrisation and Antisymmetrisation
Any rank-2 tensor decomposes uniquely into symmetric and antisymmetric parts:
where:
Symmetric part:
Antisymmetric part:
For higher-rank tensors, symmetrisation/antisymmetrisation extends to any subset of indices:
AI application: Decomposing the attention weight gradient into symmetric and antisymmetric components can reveal which parts drive updates to query vs key projections independently.
6.4 Index Raising and Lowering (Physics)
In differential geometry, the metric tensor converts between covariant and contravariant indices:
In flat Euclidean space: -> raising and lowering are trivial (no change).
ML context: All ML computations happen in Euclidean space (or spaces treated as Euclidean). The metric is identity; upper/lower distinction collapses. When reading physics papers, mentally replace .
Exception - natural gradient: Fisher information matrix acts as a metric on parameter space. In natural gradient descent, plays the role of , and the "covariant gradient" differs from the ordinary gradient. This is one place where the physics convention is genuinely useful in ML.
6.5 Partial Derivatives in Index Notation
Partial derivatives get their own index notation, making gradient calculations compact:
| Notation | Standard form | Einstein form | Result |
|---|---|---|---|
| Gradient | Vector | ||
| Jacobian | Matrix | ||
| Hessian | Matrix (symmetric) | ||
| Divergence | Scalar ( is dummy) | ||
| Laplacian | Scalar ( is dummy) |
The divergence has appearing twice -> summed -> scalar. The Laplacian has the same structure - a "trace" of second derivatives.
AI connection: Gradient computation in index notation is the foundation of Section 7. The Jacobian-vector product is exactly what automatic differentiation computes in forward mode. The vector-Jacobian product is reverse mode (backpropagation).
7. Einstein Notation for Gradients and Backpropagation
Index notation turns gradient derivations from pattern matching into mechanical computation. The core tool is the Kronecker delta - the derivative of a tensor component with respect to itself.
7.1 Gradient of Linear Form
(dot product )
The kills the sum over , substituting for . Result: OK
7.2 Gradient of Quadratic Form
(quadratic form )
Apply product rule (both and depend on ):
When is symmetric ():
7.3 Gradient of Matrix Multiply (A side)
Loss ; output where , ; .
Given upstream gradient , find :
Compute the inner derivative:
Substitute back:
In index notation:
This is the standard backpropagation formula - derived mechanically using only substitution.
7.4 Gradient of Matrix Multiply (B side)
7.5 Gradient of Dot Product
7.6 Gradient of Trace
Similarly:
- (since A appears twice, both contribute)
7.7 Gradient of Softmax in Index Notation
Softmax:
The Jacobian is derived via quotient rule:
where and :
Gradient of loss through softmax:
For cross-entropy loss with true class : for , zero otherwise. This simplifies to:
The elegant result: the gradient of cross-entropy loss through softmax is just predicted probability minus the one-hot target. This simplicity comes directly from the index-notation derivation.
GRADIENT DERIVATION CHEAT SHEET
===================================================================
Function Gradient via Index Notation
------------------------ --------------------------------------
f = a_i x_i \partialf/\partialx_j = a_j (linear)
f = x_i A_ij x_j \partialf/\partialx_k = (A+A^T)_kj x_j (quadratic)
C_ij = A_ik B_kj \partialL/\partialA_lm = (\partialL/\partialC)_lj B_mj (A side)
\partialL/\partialB_mn = A_im (\partialL/\partialC)_in (B side)
f = tr(AB) \partialf/\partialA_lm = B_ml = (B^T)_lm
p_v = softmax(z)_v \partialp_v/\partialz_w = p_v(\delta_vw - p_w)
TOOL: \partialT_ab/\partialT_cd = \delta_ac \delta_bd (Kronecker delta - the engine)
8. Batched Operations and Batch Indices
8.1 Introducing the Batch Index
In ML, the same operation is applied to independent examples simultaneously. The batch dimension is represented by adding a leading index .
Batched matrix-vector (per-sample weights):
For each batch element , compute independently.
Shared weight matrix (common in neural nets):
has no index - the same weight matrix is applied to every sample. varies per batch element.
In einsum: "ij,bj->bi" - is free (appears in output), is dummy (contracted).
8.2 Batched Matrix Multiply
Each batch element has its own and ; multiply independently. is dummy; are free.
Einsum: "bik,bkj->bij"
PyTorch: torch.bmm(A, B) for rank-3 tensors.
8.3 Multi-Head Attention in Index Notation
Full multi-head attention with explicit indices:
- where = batch, = heads, = sequence length, = head dimension
Attention scores:
- is dummy (summed over head dimension -> dot product)
- are free (per batch, per head, per query-key pair)
- Einsum:
"bhid,bhjd->bhij"(this is per head)
After softmax (applied over for each ):
Attention output:
- is dummy (weighted sum over values)
- are free (per batch, per head, per position, per dimension)
- Einsum:
"bhij,bhjd->bhid"
8.4 Broadcasting as Implicit Index Repetition
NumPy/PyTorch broadcasting corresponds to an index not appearing on one tensor - it is implicitly repeated:
| Code operation | Index notation | Broadcasting rule |
|---|---|---|
| absent on | Same bias for all batch elements | |
| absent on | Same vector for all batch elements | |
| absent on PE | Same positional encoding for all batches |
Understanding broadcasting as implicit indexing prevents shape errors. If you write the index notation first and check which tensors carry which indices, the broadcasting rules become obvious.
8.5 Reduction Operations
Summing over an index eliminates it from the result:
| Operation | Index notation | Einsum | Description |
|---|---|---|---|
| Sum over batch | ( dummy) | "bi->i" | Average features across batch |
| Mean over batch | "bi->i" \div B | Batch mean | |
| Sum over sequence | ( dummy) | "bti->bi" | Sum over positions |
| Global sum | ( all dummy) | "bti->" | Total sum -> scalar |
Pooling operations are all forms of reduction over specific index dimensions. Average pooling = sum + normalise. Max pooling breaks the einsum pattern (non-linear) but is still a reduction over an index.
9. Einsum in Practice
9.1 Einsum String Syntax
The einsum function (available in NumPy, PyTorch, JAX, TensorFlow) implements Einstein summation directly:
Format: "input1_indices, input2_indices, ... -> output_indices"
Rules:
- Input operands separated by commas; output after arrow (
->) - Each character is an index label (single letter)
- Index appearing on multiple inputs -> contraction (sum) if not in output
- Index in output -> free (kept in result)
- Index omitted from output -> dummy (summed over)
Example: "ik,kj->ij" means - matrix multiply. Index appears on both inputs but not in output -> contracted.
9.2 Standard Operations as Einsum Strings
| Operation | Mathematical | Einsum String | Notes |
|---|---|---|---|
| Dot product | "i,i->" | Both indices contracted | |
| Outer product | "i,j->ij" | No contraction | |
| Matrix-vector | "ij,j->i" | contracted | |
| Matrix multiply | "ik,kj->ij" | contracted | |
| Batched matmul | "bik,bkj->bij" | contracted; free | |
| Transpose | "ij->ji" | Relabel only | |
| Trace | "ii->" | Self-contraction | |
| Row sum | "ij->i" | contracted | |
| Column sum | "ij->j" | contracted | |
| Element-wise multiply | "ij,ij->ij" | No contraction | |
| Frobenius inner product | "ij,ij->" | Both contracted | |
| Attention scores () | "ia,ja->ij" | contracted | |
| Attention output | "ij,ja->ia" | contracted | |
| Quadratic form | "i,ij,j->" | contracted | |
| Bilinear form | "i,ij,j->" | contracted | |
| Batch outer product | "bi,bj->bij" | free | |
| Diagonal extraction | "ii->i" | Diagonal as vector |
9.3 Multi-Operand Einsum
Einsum supports more than two input tensors:
torch.einsum("ijk,jl,km->ilm", A, B, C)
This computes - contracting (between and ) and (between and ) simultaneously.
Contraction order matters. For three-tensor contraction :
- Option 1: - compute first, then
- Option 2: - compute first, then
For rectangular tensors, one order can be orders of magnitude faster. The opt_einsum library finds near-optimal contraction paths automatically:
import opt_einsum
# Finds optimal contraction order for large multi-tensor expressions
path, info = opt_einsum.contract_path("ijk,jl,km->ilm", A, B, C)
result = opt_einsum.contract("ijk,jl,km->ilm", A, B, C, optimize=path)
9.4 Implicit vs Explicit Mode
Implicit mode (no arrow): output indices = all indices appearing exactly once, in alphabetical order.
"ij,jk"-> implicit output"ik"( appears twice -> contracted)- Convenient shorthand; may be ambiguous for complex expressions
Explicit mode (with arrow): output indices specified exactly.
"ij,jk->ki"-> transposed result (column-major output)- Always use explicit mode for clarity
Rule of thumb: Always use explicit mode in production code. Implicit mode is fine for quick interactive exploration but introduces ambiguity in complex expressions.
9.5 Performance Considerations
- Einsum vs manual matmul: Modern frameworks compile einsum to optimised CUDA kernels. For standard patterns (matmul, bmm), performance is equivalent.
- Fused operations: Einsum can fuse multiple contractions, avoiding materialisation of intermediate tensors. Example:
"bhid,bhjd->bhij"computes attention scores without explicitly computing the transpose. - Limitations: Not all einsum patterns are equally optimised on all GPUs. For critical paths (attention in production), manually optimised kernels (FlashAttention) outperform generic einsum.
- Memory: Einsum can be memory-efficient by not materialising intermediates, but can also consume unexpected memory for large multi-operand expressions.
9.6 Common Pitfalls in Code
| Pitfall | Problem | Fix |
|---|---|---|
| Shape mismatch | requires A.shape[1] == B.shape[0] | Check shapes before einsum |
| Wrong index string | "ij,jk->ij" is NOT matmul | Should be "ij,jk->ik" |
| Missing batch dim | "ij,jk->ik" on batched input | Use "bij,bjk->bik" |
| Implicit contraction surprise | Accidentally sharing index names | Use explicit output -> |
| Integer tensors | Einsum with int tensors may give wrong results | Cast to float first |
| Repeated index same tensor | "ij,ij->ij" vs "ii->" | Understand the difference |
10. Tensor Products and Decompositions
10.1 Tensor Product (Outer Product Generalisation)
The tensor product of (rank ) and (rank ) produces a rank- tensor:
No contraction - all indices are free. The result carries all indices of both operands.
Special cases:
- Outer product of vectors: -> rank-2 (matrix)
- Outer product of matrices: -> rank-4
- Kronecker product: specific arrangement of matrix tensor product
AI connection - LoRA: The low-rank update where , is a sum of rank-1 outer products: . The index ranges over the LoRA rank , making the low-rank structure explicit.
10.2 Tensor Unfolding (Matricisation)
Reshaping a rank- tensor into a matrix by selecting which indices form rows vs columns:
Mode- unfolding : index forms rows; all other indices (combined) form columns.
Example:
- Mode-1 unfolding: matrix
- Mode-2 unfolding: matrix
- Mode-3 unfolding: matrix
Used in Tucker decomposition, tensor-train networks, and model compression. The unfolding is the bridge between higher-order tensor operations and standard matrix algorithms (SVD, eigendecomposition).
10.3 CP Decomposition (CANDECOMP/PARAFAC)
Decompose a tensor as a sum of rank-1 tensors:
In Einstein notation: - the index is dummy (summed over rank components).
Each term is a rank-1 tensor (outer product of three vectors). is the tensor rank - the minimum number of rank-1 terms needed for exact representation.
AI use: Low-rank CP decomposition for compressing embedding tables, convolutional filters, and MoE routing tensors.
10.4 Tucker Decomposition
where are dummy indices (contracted):
- = core tensor (small, captures interactions between modes)
- = factor matrices (orthogonal, capture principal components per mode)
This generalises SVD to higher-order tensors. Truncating the core tensor dimensions yields compression.
AI use: Compressing convolutional filters (4D tensors: output channels \times input channels \times height \times width). Compressing transformer weight tensors. The Tucker structure explicitly shows which modes interact through the core tensor.
10.5 Tensor Train (TT) Decomposition / Matrix Product States
Decompose a rank- tensor as a chain of 3-index cores:
- : bond indices (contracted between adjacent cores)
- : physical indices (free; correspond to original tensor dimensions)
- Bond dimension : controls approximation quality
Compression ratio: An order- tensor of size stores values. TT format stores parameters - exponential compression when is small.
AI connections:
- Tensor train for embedding tables: compress embedding into chain of small matrices
- Matrix product states for language modelling (theoretical connection to autoregressive models)
- Quantum-inspired ML: DMRG-like training of tensor network classifiers
TENSOR DECOMPOSITION OVERVIEW
===================================================================
CP Decomposition:
T_ijk = \Sigma_r \lambda_r * a_ir * b_jr * c_kr
-> Sum of rank-1 outer products
-> Parameters: R(n_1 + n_2 + n_3)
Tucker Decomposition:
T_ijk = G_abc * U_ia * V_jb * W_kc
-> Core tensor + factor matrices
-> Parameters: r_1r_2r_3 + n_1r_1 + n_2r_2 + n_3r_3
Tensor Train:
T_{i_1i_2...i_n} = G^1_{i_1\alpha_1} * G^2_{\alpha_1i_2\alpha_2} * ... * G^n_{\alpha_n_1i_n}
-> Chain of 3-index cores
-> Parameters: O(n*d*\chi^2) for order-d, mode-n, bond-\chi
SVD (matrix, rank-2):
A_ij = U_ik \Sigma_k V_jk
-> Special case of all three decompositions
LoRA (practical):
\DeltaW_ij = B_ir A_rj (r = LoRA rank, typically 4-64)
-> Rank-r matrix update; r is the dummy index
11. Index Notation for Common AI Architectures
11.1 Fully Connected Layer
Forward pass:
- is dummy: summed over input dimension
- is free: indexes output neurons
- is activation function (ReLU, GELU, etc.) applied elementwise
Gradient w.r.t. weights:
where is the error signal at neuron .
In matrix form: - an outer product of error and input.
Gradient w.r.t. input:
The index is dummy (summed over output neurons). This propagates the error backward through the layer.
11.2 Embedding Lookup
: embedding matrix; token : discrete index.
Forward:
Select row from . Index is free (embedding dimension); is a fixed value, not a summation index.
Gradient:
Only row has non-zero gradient - the Kronecker delta enforces sparsity. This is why embedding updates use sparse gradient operations rather than dense matrix updates.
11.3 Layer Normalisation
Input: (index over feature dimension ).
Statistics (computed over features):
Normalisation:
Output: (elementwise; are learnable scale and shift)
Gradient complexity: involves sums over of gradient terms because and depend on all . The gradient is non-local - changing any single affects the normalisation of every . In index notation:
The sums over (dummy index) make the non-locality explicit.
11.4 Causal Self-Attention - Complete Index Derivation
This is the central computation of the transformer. Every step in index notation:
Input: (sequence of tokens, each -dimensional)
Step 1: Projections
Index is dummy (summed over model dimension ). Indices (position) and (head dimension ) are free.
Step 2: Raw scores
Index is dummy (dot product over head dimension). Indices (query position) and (key position) are free. This is .
Step 3: Causal mask
Mask out future positions. Both remain free.
Step 4: Attention weights
Index is dummy in the denominator (sum over key positions). Softmax is applied per query position .
Step 5: Attention output
Index is dummy (weighted sum over value positions). Indices (position) and (head dimension) are free.
Step 6: Output projection
Index is dummy. This projects the concatenated head outputs back to model dimension.
11.5 Feed-Forward Network (SwiGLU)
The modern transformer FFN uses gated linear units:
Input: (index over model dimension )
Gate projection: ( over FFN dimension ; is dummy)
Up projection: ( over ; is dummy)
SwiGLU activation: (elementwise; is free; no summation)
Down projection: ( is dummy; is free)
Index flow: -dim input -> -dim hidden (via two parallel projections) -> -dim output. The index dimensions trace the information flow explicitly.
11.6 Gradient of Attention Output
Given upstream gradient , derive gradients w.r.t. , , and :
Gradient w.r.t. :
In matrix form: OK ( is dummy; are free)
Gradient w.r.t. :
Gradient w.r.t. (through softmax):
This is the softmax Jacobian from Section 7.7 applied to each row . These are exactly the formulas implemented in FlashAttention's backward pass.
12. Symmetries, Invariances, and Equivariances
12.1 Permutation Invariance in Index Notation
A function is permutation-invariant if for any permutation . In index form: the output depends only on the set , not on their order.
Example - sum pooling: ( is dummy)
The sum doesn't change if we permute the rows of . Any reduction over the position index that doesn't depend on position yields a permutation-invariant result.
12.2 Equivariance via Index Structure
A function is permutation-equivariant if permuting inputs permutes outputs the same way:
where is a permutation matrix ().
Attention without positional encoding is permutation-equivariant. Proof via index notation:
Permute input: (apply permutation to positions).
Compute
Similarly: ,
Scores:
After softmax (preserves permutation structure):
Output:
Since (orthogonality of permutation):
Therefore: - the output is permuted the same way as the input OK
With positional encoding: The encoding breaks equivariance because it depends on position directly. Different permutations map positions to different encodings.
12.3 Invariant and Equivariant Aggregation
| Aggregation | Index Form | Property | Use in AI |
|---|---|---|---|
| Sum pooling | ( dummy) | Permutation-invariant | Set functions, DeepSets |
| Mean pooling | Permutation-invariant | Sentence embeddings | |
| Max pooling | Permutation-invariant | PointNet | |
| Attention | Permutation-equivariant | Set Transformer |
The presence or absence of the position index in the output determines whether the operation is invariant (no ) or equivariant (keeps ).
12.4 Group Equivariance in Index Notation
For a group acting on inputs via representation :
where applies group element to position .
A function is -equivariant if:
Constraint on weights: A linear map is -equivariant iff:
This constrains to a specific shared-weight structure:
| Group | Constraint | Result |
|---|---|---|
| Cyclic (translations) | Circular convolution | |
| Full symmetric () | same for all ; same for all | Sum/mean pooling |
| Rotation () | commutes with rotation matrices | Spherical harmonics convolution |
Index notation makes these symmetry constraints explicit and verifiable - you can check equivariance by substituting the group action into the index expression.
13. Advanced Index Manipulations
13.1 Diagrammatic Notation (Penrose / Tensor Networks)
Tensor network diagrams represent tensors and their contractions visually:
- Each tensor is a node (box or circle) with one wire (line) per index
- Contraction: connect wires between tensors (shared index = connected wire)
- Free indices: dangling wires (not connected to another tensor)
- Trace: loop a wire back to the same tensor
TENSOR NETWORK DIAGRAMS
===================================================================
Vector v_i: Matrix M_ij: Rank-3 T_ijk:
| i | i | j | i | j
* +-+ +-+
|M| |T|---- k
+-+ +-+
Matrix multiply C_ij = A_ik B_kj:
| i | j
+-+ k +-+ k is contracted (internal wire)
|A|-----------|B| i, j are free (dangling wires)
+-+ +-+
Trace: tr(A) = A_ii:
+------+
| +-+ | Wire loops back -> self-contraction
+--|A|-+ No dangling wires -> scalar
+-+
Frobenius: A_ij B_ij:
+-+ i,j +-+ Both wires connected
|A|========|B| -> all contracted -> scalar
+-+ +-+
AI applications of tensor networks:
- Matrix Product States (MPS) for language modelling: autoregressive probability as tensor train
- MERA (Multi-scale Entanglement Renormalisation Ansatz): hierarchical tensor decomposition
- Tensor network contraction for efficient inference in graphical models
13.2 Differential Forms and the Wedge Product
Differential -forms are completely antisymmetric rank- tensors .
Wedge product: = antisymmetrisation of
Exterior derivative: of a -form is a -form:
where means "omit index ".
Connection to ML:
- Normalising flows require computing where is the Jacobian. The determinant is expressible via the top exterior power:
- Riemannian geometry of neural network loss surfaces uses differential forms for curvature tensors
- The Jacobian log-determinant in variational inference is a wedge product computation
13.3 Tensor Symmetrisation Operators
Symmetriser and antisymmetriser as projection operators in index space:
Applied to a rank-2 tensor :
- - symmetric part
- - antisymmetric part
Projection properties:
- and (idempotent - projecting twice = projecting once)
- (every tensor = symmetric part + antisymmetric part)
- (orthogonal projections)
13.4 Spectral Methods in Index Notation
Eigendecomposition (symmetric ):
Here is dummy - summed over eigenvalues. is the -th component of the -th eigenvector. In matrix form: .
SVD:
ranges over rank. is implicitly (diagonal), which simplifies to just when contracted.
Low-rank approximation: truncate the sum to -> rank- approximation.
LoRA in index notation:
The index is the bottleneck. Parameters: instead of . For typical values (, ): compression ratio = .
13.5 The Score Function and Stein's Lemma
Score function:
Stein's identity (integration by parts):
In index notation with tensor-valued :
AI applications:
- Score-based generative models (diffusion models): learn directly; sample via Langevin dynamics
- REINFORCE (policy gradient): is a score function estimator; in index form:
- Stein Variational Gradient Descent (SVGD): uses the score function with a kernel to approximate posterior inference
14. Common Mistakes
14.1 Mistake Table
| # | Mistake | Why Wrong | Correct |
|---|---|---|---|
| 1 | Sums over BOTH and simultaneously; each pair contributes | = Frobenius inner product | |
| 2 | is the transpose of ; different unless is symmetric | while | |
| 3 | Free index mismatch: | Left has free ; right has free ; inconsistent | Free indices must match both sides |
| 4 | Triple index: | Index appears three times - ambiguous | Write explicit ; Einstein requires exactly two |
| 5 | means "set everywhere" | acts as index substitutor when contracted; it doesn't globally equate and | - is replaced by in ; disappears |
| 6 | "ij,jk->ij" is matrix multiply | This keeps both and in output; is NOT contracted | Matrix multiply is "ij,jk->ik" |
| 7 | Outer product has repeated index | has DIFFERENT index names; no repetition = no contraction | Different indices = free = outer product |
| 8 | Tensor rank = matrix rank | Tensor rank (number of indices) \neq matrix rank (dimension of column space) | Completely different concepts; "rank" is overloaded |
| 9 | Renaming free index changes expression | Renaming consistently on BOTH sides is valid (alpha-equivalence) | and are identical |
| 10 | Einstein notation only works for two tensors | Multi-tensor contractions follow same rules | contracts two pairs |
14.2 The Einsum Debugging Checklist
When an einsum expression gives unexpected results, check:
- Index count: Does each index appear the correct number of times? (Exactly once = free; exactly twice = contracted)
- Shape compatibility: Do contracted indices have matching dimensions?
- Output indices: Are the output indices exactly what you want? (Missing one = extra contraction; extra one = missing contraction)
- Index collision: Did you accidentally reuse an index name? (
"ij,ik->jk"contracts over ; is that intended?) - Batch dimension: Does the batch index appear in all the right places?
- Contraction vs elementwise:
"ij,ij->"vs"ij,ij->ij"- contracted vs elementwise
14.3 Index Balance as Type Checking
Think of index balance like type checking in a programming language:
TYPE CHECKING ANALOGY
===================================================================
Expression Free indices "Type"
-------------------- ------------- --------------
A_ij B_kj i, k Matrix (2 free)
v_i i Vector (1 free)
A_ii (none) Scalar (0 free)
Type-correct equation:
v_i = A_ij u_j <- both sides: 1 free index (i) OK
Type error:
v_i = A_jk u_k <- left: free i; right: free j NO
s = A_ij B_jk <- left: 0 free; right: i,k NO
-> If free indices don't match, the equation is WRONG.
-> This is a mechanical check - no mathematical understanding needed.
15. Exercises
Exercise 1: Index Identification
For each expression below, identify all free indices, all dummy indices, the resulting shape, and write the equivalent np.einsum string.
| # | Expression | Free | Dummy | Shape | Einsum |
|---|---|---|---|---|---|
| 1 | |||||
| 2 | |||||
| 3 | |||||
| 4 | |||||
| 5 |
▸Solution
| # | Free | Dummy | Shape | Einsum |
|---|---|---|---|---|
| 1 | vector | ij,j->i | ||
| 2 | matrix | ij,jk->ik | ||
| 3 | (none) | scalar | i,i-> | |
| 4 | (none) | matrix | i,j->ij | |
| 5 | matrix | ij,jk,kl->il |
Exercise 2: Kronecker Delta Manipulation
Simplify each expression using the substitution property :
- where ranges over
▸Solution
-
Contract over : the first delta forces , substituting gives .
-
First delta: , giving . Second delta: , giving .
-
This is the trace of the identity matrix: .
-
is symmetric in ; is antisymmetric in . Contraction of symmetric with antisymmetric always gives zero.
Exercise 3: Gradient Derivation
Derive the gradient of each scalar loss with respect to the indicated variable using index notation. Show all steps.
3a. (quadratic form). Find .
▸Solution
Apply the product rule ( is constant):
In the second term, rename : . So:
If is symmetric (), this simplifies to .
3b. where , , are matrices. Find .
▸Solution
Write in index notation. Let be the residual.
Differentiate with respect to :
Now .
Setting to zero gives the normal equation .
3c. where and is one-hot. Find .
▸Solution
Let . Then . By chain rule:
The softmax Jacobian is . Substituting:
This elegant result is why cross-entropy + softmax is universally used: the gradient is simply the difference between predictions and targets.
Exercise 4: Einsum Implementation
Write NumPy einsum calls (and verify with standard NumPy operations) for:
4a. Batched trace: given with shape , compute the trace of each matrix in the batch.
# Your solution:
traces = np.einsum('bii->b', A)
# Verification:
assert np.allclose(traces, np.trace(A, axis1=1, axis2=2))
4b. Bilinear form: given shape , shape , shape , compute .
# Your solution:
s = np.einsum('i,ij,j->', x, M, y)
# Verification:
assert np.isclose(s, x @ M @ y)
4c. Multi-head attention scores: given shape , shape , compute .
# Your solution:
S = np.einsum('bhid,bhjd->bhij', Q, K)
# Verification:
assert np.allclose(S, Q @ K.transpose(0, 1, 3, 2))
Exercise 5: Contraction Order Optimisation
Consider the expression where:
- is
- is
- is
5a. Compute the total FLOPs for left-to-right evaluation: .
5b. Compute the total FLOPs for right-to-left evaluation: .
5c. Which order is faster and by how much?
▸Solution
5a. Left-to-right: first compute , shape :
- FLOPs:
Then , shape :
- FLOPs:
Total: FLOPs.
5b. Right-to-left: first compute , shape :
- FLOPs:
Then , shape :
- FLOPs:
Total: FLOPs.
5c. Left-to-right is faster! This mirrors the LoRA pattern: always multiply the low-rank factors first. The intermediate tensor has shape - far smaller than at .
Exercise 6: Softmax Gradient Verification
6a. Derive the Jacobian where using quotient rule in index notation.
▸Solution
6b. Verify numerically with finite differences:
import numpy as np
def softmax(z):
e = np.exp(z - z.max())
return e / e.sum()
z = np.random.randn(5)
p = softmax(z)
# Analytical Jacobian
J_analytical = np.diag(p) - np.outer(p, p)
# Numerical Jacobian (finite differences)
eps = 1e-7
J_numerical = np.zeros((5, 5))
for j in range(5):
z_plus = z.copy(); z_plus[j] += eps
z_minus = z.copy(); z_minus[j] -= eps
J_numerical[:, j] = (softmax(z_plus) - softmax(z_minus)) / (2 * eps)
print("Max error:", np.max(np.abs(J_analytical - J_numerical)))
# Should be < 1e-6
Exercise 7: Symmetry Analysis
For each tensor expression, determine if it is symmetric, antisymmetric, or neither in the indicated indices.
| # | Expression | Indices | Answer |
|---|---|---|---|
| 1 | (i.e. ) | ||
| 2 | |||
| 3 | (Hessian) |
▸Solution
-
Symmetric. Swap : . This is , which is always symmetric.
-
Antisymmetric. Swap : .
-
Symmetric (for functions, by Schwarz's theorem). when has continuous second partial derivatives.
Exercise 8: Full Attention Pass
Implement the complete forward and backward pass for single-head self-attention using only index notation and np.einsum.
Given: input with shape , weight matrices each .
Forward pass - write each step in index notation and einsum:
import numpy as np
T, d, d_k = 8, 16, 4
X = np.random.randn(T, d)
WQ = np.random.randn(d, d_k)
WK = np.random.randn(d, d_k)
WV = np.random.randn(d, d_k)
# Step 1: Queries, Keys, Values
# Q_ia = X_id WQ_da -> einsum: 'id,da->ia'
Q = np.einsum('id,da->ia', X, WQ)
K = np.einsum('id,da->ia', X, WK)
V = np.einsum('id,da->ia', X, WV)
# Step 2: Attention scores
# S_ij = Q_ia K_ja / sqrt(d_k) -> einsum: 'ia,ja->ij'
S = np.einsum('ia,ja->ij', Q, K) / np.sqrt(d_k)
# Step 3: Attention weights (softmax over j for each i)
S_max = S.max(axis=1, keepdims=True)
exp_S = np.exp(S - S_max)
alpha = exp_S / exp_S.sum(axis=1, keepdims=True) # shape (T, T)
# Step 4: Output
# O_ia = alpha_ij V_ja -> einsum: 'ij,ja->ia'
O = np.einsum('ij,ja->ia', alpha, V)
Backward pass - given upstream gradient :
dO = np.random.randn(T, d_k) # upstream gradient
# Step 4 backward: dL/d(alpha) and dL/dV
# dL/dV_ja = alpha_ij * dO_ia -> einsum: 'ij,ia->ja'
dV = np.einsum('ij,ia->ja', alpha, dO)
# dL/d(alpha_ij) = dO_ia * V_ja -> einsum: 'ia,ja->ij'
dalpha = np.einsum('ia,ja->ij', dO, V)
# Step 3 backward: softmax gradient
# dL/dS_ij = alpha_ij * (dalpha_ij - sum_k alpha_ik dalpha_ik)
sum_term = np.einsum('ij,ij->i', alpha, dalpha) # shape (T,)
dS = alpha * (dalpha - sum_term[:, None]) / np.sqrt(d_k)
# Step 2 backward: dL/dQ and dL/dK
# dL/dQ_ia = dS_ij K_ja -> einsum: 'ij,ja->ia'
dQ = np.einsum('ij,ja->ia', dS, K)
# dL/dK_ja = dS_ij Q_ia -> einsum: 'ij,ia->ja'
dK = np.einsum('ij,ia->ja', dS, Q)
# Step 1 backward: dL/dW^Q, dL/dW^K, dL/dW^V
# dL/dWQ_da = X_id * dQ_ia -> einsum: 'id,ia->da'
dWQ = np.einsum('id,ia->da', X, dQ)
dWK = np.einsum('id,ia->da', X, dK)
dWV = np.einsum('id,ia->da', X, dV)
Verify your implementation with numerical gradient checking (finite differences on each weight matrix).
16. Why This Matters for AI
16.1 Impact Across the AI Stack
Einstein summation notation is not merely academic shorthand - it is the computational lingua franca of modern AI systems. Every layer of the stack, from theoretical research papers to production inference engines, benefits from thinking in indices.
| Domain | Without Index Notation | With Index Notation |
|---|---|---|
| Research papers | Ambiguous matrix expressions, dimensions unclear | Precise, self-documenting equations |
| Attention mechanisms | Nested loops or opaque bmm calls | Single einsum string: bhid,bhjd->bhij |
| Backpropagation | Memorize gradient formulas | Derive gradients mechanically |
| Tensor compilers | Manual kernel fusion | Automatic optimization from contraction specs |
| LoRA / adapters | Ad-hoc low-rank tricks | Clear decomposition: |
| Multi-head attention | Reshapes and transposes | Natural batch index: |
| Distributed training | Confusing sharding specs | Shard along a named index |
| Quantization | Unclear where precision is lost | Per-index scaling: |
| Sparse attention | Masking heuristics | Constraint on index range: $ |
| Graph neural networks | Aggregation as a special case | Generalized: |
| Diffusion models | Score-function algebra | Index-level Stein identity derivation |
16.2 Key Takeaways
- Einstein summation is a language, not just notation. Learning it changes how you read papers, write code, and debug models.
- Free indices determine the output shape; dummy indices determine the computation.
einsumis the universal API for turning index notation into executable tensor code.- Once a forward pass is written in index notation, many backward-pass steps become mechanical through product-rule reasoning and identities such as .
- Index notation reveals optimization opportunities: contraction order, low-rank structure, sparsity, and parallel axes all become easier to see.
- Modern architectures are index expressions. Transformers, GNNs, diffusion models, and mixture-of-experts all benefit from this viewpoint.
- Symmetries often appear as constraints on how indices transform, which makes the notation useful for principled architecture design.
17. Conceptual Bridge
17.1 From Mathematical Notation to Working Code
paper equation
-> attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
-> index notation
-> O_ia = softmax_j(Q_id K_jd / sqrt(d_k)) V_ja
-> einsum strings
-> scores = einsum('bhid,bhjd->bhij', Q, K)
-> output = einsum('bhij,bhja->bhia', attn, V)
-> optimized kernels and hardware-specific implementations
At every level, index notation tells you what data flows where, which dimensions contract, and what the output shape must be. That is the bridge from paper math to production tensor programs.
17.2 Where This Leads Next
The next chapters turn indices into the working language of vectors, matrices, tensors, gradients, and efficient kernels. Once you can read repeated-index structure fluently, large model equations become far less mysterious: they collapse into a small set of contraction patterns you can analyze, implement, and optimize.
References and Further Reading
| Resource | Topic | Type |
|---|---|---|
| Einstein, A. (1916). "The Foundation of the General Theory of Relativity" | Original convention introduction | Paper |
| Kolda & Bader (2009). "Tensor Decompositions and Applications" | CP, Tucker, tensor networks | Survey |
| Vaswani et al. (2017). "Attention Is All You Need" | Transformer architecture | Paper |
| Laue, Mitterreiter, and Giesen (2020). "A Simple and Efficient Tensor Calculus" | Automatic tensor differentiation | Paper |
opt_einsum documentation | Contraction-order optimization | Software |
| Penrose (1971). "Applications of Negative Dimensional Tensors" | Diagrammatic tensor notation | Paper |
| Dao et al. (2022). "FlashAttention" | Memory-efficient attention kernels | Paper |
| Hu et al. (2021). "LoRA: Low-Rank Adaptation" | Low-rank weight decomposition | Paper |
NumPy documentation - numpy.einsum | Reference implementation | Documentation |
<- Previous: 04-Summation-and-Product-Notation | Home | Next: 06-Proof-Techniques ->