NotesMath for LLMs

Einstein Summation and Index Notation

Mathematical Foundations / Einstein Summation and Index Notation

Notes

"I have made a great discovery in mathematics; I have suppressed the summation sign every time that the summation must be made over an index which occurs twice." - Albert Einstein, 1916

Overview

Einstein summation convention is the notational engine of modern tensor computation. Wherever the same index appears exactly twice in a single term, summation over all values of that index is implied - no Σ\Sigma needed. This convention, born in general relativity, now powers every matrix multiply, every attention score, every gradient computation in deep learning. Once you can read index notation, you can read any transformer paper, derive any backpropagation formula, and debug any shape error - mechanically.

Prerequisites

  • Summation and product notation (Σ\Sigma, Π\Pi) from Module 04
  • Matrix operations: multiplication, transpose, trace
  • Basic calculus: partial derivatives, chain rule
  • Familiarity with NumPy arrays / PyTorch tensors

Learning Objectives

After completing this section, you will be able to:

  1. Read and write tensor expressions in Einstein index notation
  2. Identify free indices (result dimensions) and dummy indices (contracted/summed) in any expression
  3. Derive gradient formulas mechanically using index notation and the Kronecker delta
  4. Translate any tensor operation into a np.einsum / torch.einsum string
  5. Express transformer attention, backpropagation, and tensor decompositions in index form
  6. Analyse symmetry, equivariance, and contraction order using index structure

Companion Notebooks

NotebookDescription
theory.ipynbInteractive code: einsum operations, gradient verification, contraction order benchmarks, attention in index notation
exercises.ipynbPractice problems: index identification, Kronecker delta manipulation, gradient derivation, einsum implementation

Table of Contents


1. Intuition

1.1 What Is Einstein Summation?

Einstein summation convention is a notational shorthand: whenever the same index appears exactly twice in a single term, summation over all values of that index is implied automatically. No Σ\Sigma symbol needed - the repetition of an index is its own summation instruction.

Consider the matrix-vector product. In standard notation:

wi=j=1nAijvjw_i = \sum_{j=1}^{n} A_{ij} v_j

In Einstein notation, the same expression is simply:

wi=Aijvjw_i = A_{ij} v_j

The index jj appears twice (once on AA, once on vv), so summation over jj is implied. The index ii appears once - it is a free index, meaning the equation holds for each value of ii.

The result: tensor equations that would fill pages compress to single lines. The structure of the operation is immediately visible in the index pattern.

For AI: every matmul, every dot product, every attention score, every gradient computation is a sum over repeated indices. Einstein notation makes this structure explicit and manipulable.

1.2 Why This Notation Transforms Thinking

Without index notation, ABAB means "matrix multiply AA and BB" - the mechanics are hidden. With index notation:

(AB)ij=AikBkj(AB)_{ij} = A_{ik}B_{kj}

The repeated kk tells you exactly what is summed, over what range, and how inputs connect to outputs.

Pattern recognition from indices alone:

ExpressionRepeated IndexFree IndicesOperation
AikBkjA_{ik}B_{kj}kki,ji, jMatrix multiply
AijBijA_{ij}B_{ij}i,ji, jnoneFrobenius inner product (scalar)
AikBjkA_{ik}B_{jk}kki,ji, jABAB^\top
uivju_i v_jnonei,ji, jOuter product
uiviu_i v_iiinoneDot product (scalar)

Debugging power: If an equation has a free index on one side but not the other, it is wrong. Index balance checking catches errors mechanically - before running any code.

Generalisation: Every operation on any-order tensor is expressed uniformly. There is no special notation for matrices vs vectors vs 4D tensors - just indices.

1.3 The Core Idea in One Sentence

Two rules govern everything:

  • Free index: appears once -> result has that index -> no summation
  • Dummy index: appears twice -> summed over -> disappears from result

That's it. Everything else follows from these two rules.

THE TWO RULES
===================================================================

Expression: C_ij = A_ik B_kj

  Index i: appears ONCE -> FREE    -> C has index i (rows)
  Index j: appears ONCE -> FREE    -> C has index j (cols)
  Index k: appears TWICE -> DUMMY  -> summed over -> gone from result

  Result shape: C is a matrix (has i,j) OK
  Computation: C_ij = \Sigma_k A_ik B_kj OK

Expression: s = u_i v_i

  Index i: appears TWICE -> DUMMY  -> summed over -> gone from result

  Result shape: s is a scalar (no free indices) OK
  Computation: s = \Sigma_i u_i v_i = dot product OK

1.4 Where This Appears in AI

Every core operation in deep learning is an Einstein summation:

OperationIndex NotationDummyFreeDescription
Matrix multiplyCij=AikBkjC_{ij} = A_{ik}B_{kj}kki,ji,jForward pass linear layer
Dot products=aibis = a_i b_iii-Scalar score
Attention scoreSij=QikKjk/dkS_{ij} = Q_{ik}K_{jk}/\sqrt{d_k}kki,ji,jQuery-key similarity
Attention outputOia=αijVjaO_{ia} = \alpha_{ij}V_{ja}jji,ai,aWeighted value sum
Gradient of matmulLAik=LCijBkj\frac{\partial L}{\partial A_{ik}} = \frac{\partial L}{\partial C_{ij}} B_{kj}jji,ki,kBackprop through C=ABC = AB
Layer norm gradientLxi\frac{\partial L}{\partial x_i} involves j\sum_jjjiiNon-local gradient
Multi-head attentionOia=hWa,outhαijhVj,dhO_{ia} = \sum_h W^h_{a,\text{out}} \alpha^h_{ij} V^h_{j,d}h,jh,ji,ai,aAcross heads

1.5 Historical and Modern Context

YearDevelopmentSignificance
1900Ricci & Levi-Civita: absolute differential calculusIndex notation precursor
1916Einstein: general relativity paperIntroduced implicit summation convention
1920s-50sStandard in physicsGR, electromagnetism, continuum mechanics
1960s-90sPenrose: abstract index notationDiagrammatic tensor calculus
2006NumPy numpy.einsumEinstein convention for array computing
2016PyTorch torch.einsumGPU-accelerated einsum
2017Vaswani et al.: "Attention Is All You Need"Attention expressed as index contractions
2018opt_einsum libraryOptimal contraction order for multi-tensor einsum
2022Dao et al.: FlashAttentionAttention kernels designed from index structure
2024DeepSeek: MLA (Multi-head Latent Attention)Low-rank contraction Kja=cjrWraKK_{ja} = c_{jr} W^K_{ra}
2025-26JAX einsum + JIT; einsum as ML lingua francaSeamless compilation of arbitrary contractions

1.6 The Hierarchy of Tensor Operations

TENSOR RANK HIERARCHY
===================================================================

  Scalar (rank 0):   s            - no indices
  Vector (rank 1):   v_i          - one index
  Matrix (rank 2):   M_ij         - two indices
  3-Tensor (rank 3): T_ijk        - three indices
  n-Tensor (rank n): T_{i_1i_2...i_n} - n indices

RANK-CHANGING OPERATIONS:
  +-------------------------------------------------------------+
  |  Contract one pair of indices  -> reduce rank by 2           |
  |  Outer product                 -> rank = sum of input ranks  |
  |  Transpose                    -> relabel indices (same rank) |
  |  Trace                        -> contract diagonal (rank -2) |
  +-------------------------------------------------------------+

EXAMPLES:
  dot product:     rank 1 + rank 1 - 2 = rank 0 (scalar)
  matrix-vector:   rank 2 + rank 1 - 2 = rank 1 (vector)
  matrix-matrix:   rank 2 + rank 2 - 2 = rank 2 (matrix)
  trace:           rank 2 - 2           = rank 0 (scalar)
  outer product:   rank 1 + rank 1      = rank 2 (matrix)

2. Formal Definitions

2.1 Tensors as Indexed Arrays

A tensor TT of rank rr over index sets I1,I2,,IrI_1, I_2, \ldots, I_r is a function:

T:I1×I2××IrRT: I_1 \times I_2 \times \cdots \times I_r \to \mathbb{R}

It assigns a real number Ti1i2irT_{i_1 i_2 \ldots i_r} to each tuple of indices.

RankNameNotationExampleShape
0ScalarssLoss value, learning rate()
1Vectorviv_iEmbedding, gradient(n,)(n,)
2MatrixMijM_{ij}Weight matrix, attention scores(m,n)(m, n)
33-tensorTijkT_{ijk}Batched matrices, Q/K/VQ/K/V projections(B,n,d)(B, n, d)
44-tensorTbhijT_{bhij}Multi-head attention scores(B,H,n,n)(B, H, n, n)

The shape is the tuple of index set sizes (n1,n2,,nr)(n_1, n_2, \ldots, n_r). It determines memory layout and operation validity.

2.2 The Einstein Summation Convention - Precise Statement

Convention: In any expression, if an index label α\alpha appears exactly twice in a single term, it is a dummy index and is implicitly summed over all values in its range.

Formally, for dummy index α\alpha over range {1,,n}\{1, \ldots, n\}:

AαBαα=1nAαBαA_{\ldots\alpha\ldots} B_{\ldots\alpha\ldots} \equiv \sum_{\alpha=1}^{n} A_{\ldots\alpha\ldots} B_{\ldots\alpha\ldots}

An index appearing once is a free index: the equation holds for each value of that index independently. The range of each index is determined by the tensor dimensions at that position.

Example walkthrough:

Cij=AikBkjC_{ij} = A_{ik} B_{kj}
  • kk appears twice -> dummy -> implicit sum: Cij=k=1pAikBkjC_{ij} = \sum_{k=1}^{p} A_{ik} B_{kj}
  • ii appears once -> free -> equation holds for all i{1,,m}i \in \{1,\ldots,m\}
  • jj appears once -> free -> equation holds for all j{1,,n}j \in \{1,\ldots,n\}
  • Result: CRm×nC \in \mathbb{R}^{m \times n}, same as standard matrix multiply C=ABC = AB

2.3 Free vs Dummy Indices - Precise Definitions

Free index - appears exactly once in a term:

  • Result expression carries that index
  • The equation holds for every value of the free index independently
  • Example: Aij=BikCkjA_{ij} = B_{ik}C_{kj} - indices ii and jj are free; equation holds for all (i,j)(i,j)

Dummy index (bound/contracted index) - appears exactly twice in a single term:

  • Implicitly summed over its entire range
  • Does not appear in the result
  • Example: BikCkjB_{ik}C_{kj} - index kk is dummy; summed from 1 to pp; gone from result

Invalid - index appearing three or more times:

  • AiiiA_{iii} is ambiguous: which two occurrences to contract?
  • Einstein convention forbids this; write explicit Σ\Sigma if needed
  • Example: AiijBiA_{iij}B_i has index ii appearing three times - not valid

Consistency rule - free indices must be the same on both sides of an equation:

vi=Aijuj(both sides have free index i)v_i = A_{ij} u_j \quad \checkmark \quad \text{(both sides have free index } i \text{)} vi=Ajkuk×(left has free i; right has free j; inconsistent)v_i = A_{jk} u_k \quad \times \quad \text{(left has free } i \text{; right has free } j \text{; inconsistent)}

This mechanical check catches dimension errors before code runs.

2.4 Index Ranges

Each index position has a defined range from the tensor's shape:

  • In AijA_{ij} where ARm×nA \in \mathbb{R}^{m \times n}: i{1,,m}i \in \{1,\ldots,m\}, j{1,,n}j \in \{1,\ldots,n\}
  • Contracted indices must have matching ranges: AikBkjA_{ik}B_{kj} valid only if AA's second dimension equals BB's first dimension
  • This is exactly the matrix multiply shape compatibility condition: ARm×pA \in \mathbb{R}^{m \times p}, BRp×nB \in \mathbb{R}^{p \times n}; kk ranges over {1,,p}\{1,\ldots,p\}

Shape inference - from index structure alone, the result shape is determined:

  • AikBkjA_{ik}B_{kj}: ii ranges over mm, jj ranges over nn -> result shape (m,n)(m, n)
  • viwiv_i w_i: no free indices -> result shape () - scalar
  • TijkUjlVklT_{ijk}U_{jl}V_{kl}: jj and kk and ll are all dummy; ii is free -> result shape (ni,)(n_i,) - vector

2.5 Covariant and Contravariant Indices

In differential geometry and general relativity, upper (contravariant) and lower (covariant) indices are distinguished:

  • Upper index viv^i: contravariant vector; transforms like coordinate differences
  • Lower index wiw_i: covariant vector; transforms like gradient components
  • Einstein convention in physics: sum over one upper and one lower index only: AijBjkA_{ij}B^j{}_k
  • Metric tensor gijg_{ij}: raises and lowers indices; vi=gijvjv_i = g_{ij}v^j

For machine learning: we almost always ignore the up/down distinction. All indices are effectively lower. No metric tensor is needed because we work in flat Euclidean space where gij=δijg_{ij} = \delta_{ij}. When reading physics literature, understand the convention; in ML, treat all indices as lower.

COVARIANT vs CONTRAVARIANT - PHYSICS vs ML
===================================================================

  PHYSICS (General Relativity):
    Upper index:  v^i         (contravariant - transforms one way)
    Lower index:  w_i         (covariant - transforms the other)
    Contraction:  v^i w_i     (one up, one down -> valid)
    Metric:       v_i = g_ij v^j   (raises/lowers indices)

  MACHINE LEARNING:
    All indices:  v_i, w_j    (no up/down distinction)
    Contraction:  v_i w_i     (two of same name -> valid)
    Metric:       g_ij = \delta_ij (identity - flat space)

  -> In ML, upper/lower distinction collapses. All that matters:
    - Same index twice = sum over it
    - Same index once  = free dimension

3. Basic Operations in Index Notation

3.1 Scalar Multiplication

Scalar times vector: (sv)i=svi(sv)_i = s \cdot v_i

Scalar times matrix: (sA)ij=sAij(sA)_{ij} = s \cdot A_{ij}

The scalar ss has no indices; it multiplies each component. No summation occurs - there is no repeated index.

3.2 Vector Addition

(u+v)i=ui+vi(u + v)_i = u_i + v_i

Same free index ii on both terms. No repeated index, no summation. Purely elementwise.

3.3 Inner Product (Dot Product)

s=uivi=iuivis = u_i v_i = \sum_i u_i v_i

Both uu and vv carry index ii. Since ii appears twice, it is a dummy index - summed over. The result ss is a scalar (no free indices).

Equivalently: s=uv=u,vs = u^\top v = \langle u, v \rangle

Sign check: result has no free indices -> result must be scalar OK

3.4 Outer Product

Cij=uivjC_{ij} = u_i v_j

No repeated index - ii appears once, jj appears once. Both are free. No summation. The result CRm×nC \in \mathbb{R}^{m \times n} where m=dim(u)m = \dim(u), n=dim(v)n = \dim(v).

Equivalently: C=uvC = uv^\top

Each entry Cij=ui×vjC_{ij} = u_i \times v_j. This always produces a rank-1 matrix.

Key distinction from dot product: uiviu_i v_i (same index -> sum -> scalar) vs uivju_i v_j (different indices -> no sum -> matrix).

3.5 Matrix-Vector Multiply

wi=Aijvj=jAijvjw_i = A_{ij} v_j = \sum_j A_{ij} v_j
  • jj is dummy (summed): connects AA's columns to vv's entries
  • ii is free (result index): selects row of AA and row of result ww
  • wRmw \in \mathbb{R}^m, ARm×nA \in \mathbb{R}^{m \times n}, vRnv \in \mathbb{R}^n; jj ranges over nn OK

This is the standard matrix-vector product w=Avw = Av.

3.6 Matrix-Matrix Multiply

Cij=AikBkj=kAikBkjC_{ij} = A_{ik} B_{kj} = \sum_k A_{ik} B_{kj}
  • kk is dummy: summed over; connects AA's columns to BB's rows
  • i,ji, j are free: ii selects row of AA, jj selects column of BB
  • CRm×nC \in \mathbb{R}^{m \times n}, ARm×pA \in \mathbb{R}^{m \times p}, BRp×nB \in \mathbb{R}^{p \times n}; kk ranges over pp OK

Equivalently: C=ABC = AB

3.7 Transpose

(A)ij=Aji(A^\top)_{ij} = A_{ji}

Simply swap index labels. No summation; pure relabelling. Note that AjiA_{ji} and AijA_{ij} are different expressions - different index order means transposed access.

3.8 Trace

tr(A)=Aii=iAii\text{tr}(A) = A_{ii} = \sum_i A_{ii}

The same index appears twice on the same tensor - sum over diagonal elements. Result is scalar; no free index. Valid only for square matrices (both index ranges must be equal).

Trace of product:

tr(AB)=AijBji=AikBki\text{tr}(AB) = A_{ij} B_{ji} = A_{ik} B_{ki}

Both pairs (i,j)(i,j) or (i,k)(i,k) are dummy -> scalar. This equals tr(BA)\text{tr}(BA) - rename dummy index to verify: AijBji=BjiAij=BijAjiA_{ij}B_{ji} = B_{ji}A_{ij} = B_{ij}A_{ji} (relabel iji \leftrightarrow j) =tr(BA)= \text{tr}(BA) OK

3.9 Frobenius Inner Product

A,BF=AijBij=ijAijBij\langle A, B \rangle_F = A_{ij} B_{ij} = \sum_i \sum_j A_{ij} B_{ij}

Both ii and jj are dummy - double summation. Result is scalar. Equivalently:

A,BF=tr(AB)=(A)jiBji=AijBij\langle A, B \rangle_F = \text{tr}(A^\top B) = (A^\top)_{ji} B_{ji} = A_{ij} B_{ij} \quad \checkmark

Frobenius norm: AF2=AijAij\|A\|_F^2 = A_{ij} A_{ij}

3.10 Element-wise (Hadamard) Product

(AB)ij=AijBij(A \odot B)_{ij} = A_{ij} B_{ij}

Both AA and BB share indices i,ji, j. But here i,ji, j are free - both appear in the result. No summation; elementwise multiplication.

Critical distinction: The expression AijBijA_{ij}B_{ij} looks the same as the Frobenius inner product. The difference is whether i,ji, j are free or dummy - which depends on whether the result carries those indices:

Einsum stringi,ji, j statusOperationResult
"ij,ij->ij"freeElementwise multiplyMatrix
"ij,ij->"dummyFrobenius inner productScalar

In standard mathematical notation, context determines interpretation. In einsum, the output string makes it explicit.

COMPLETE BASIC OPERATIONS REFERENCE
===================================================================

  Operation          Index Form       Einsum     Free  Dummy  Result
  -----------------  --------------   ---------  ----  -----  ------
  Scalar multiply    s*v_i            scalar     i     -      vector
  Vector add         u_i + v_i        -          i     -      vector
  Dot product        u_i v_i          "i,i->"    -     i      scalar
  Outer product      u_i v_j          "i,j->ij"  i,j   -      matrix
  Matrix-vector      A_ij v_j         "ij,j->i"  i     j      vector
  Matrix multiply    A_ik B_kj        "ik,kj->ij" i,j  k      matrix
  Transpose          A_ji             "ij->ji"   i,j   -      matrix
  Trace              A_ii             "ii->"     -     i      scalar
  Frobenius inner    A_ij B_ij        "ij,ij->"  -     i,j    scalar
  Hadamard product   A_ij B_ij        "ij,ij->ij" i,j  -      matrix

4. Contraction Operations

4.1 What Is Contraction?

Contraction is the operation of summing over one or more pairs of repeated indices, reducing tensor rank by 2 for each contracted pair.

Given input tensors of ranks r1r_1 and r2r_2, contracting over kk pairs yields a result of rank:

result rank=r1+r22k\text{result rank} = r_1 + r_2 - 2k
InputsRanksContracted pairsResult rankOperation
scalar \times scalar0 + 000Scalar multiply
vector * vector1 + 110Dot product
matrix \times vector2 + 111Linear map
matrix \times matrix2 + 212Matrix multiply
Frobenius inner product2 + 220Scalar
trace21 (self)0Diagonal sum

4.2 Single Tensor Contraction (Trace-Like)

Contracting two indices of the same tensor generalises the matrix trace:

Tijkcontract i,jSk=Tiik=iTiikT_{ijk} \xrightarrow{\text{contract } i,j} S_k = T_{iik} = \sum_i T_{iik}

Result: rank-1 tensor (vector). kk is free; ii is dummy.

This is a partial trace - tracing over a subset of indices. It generalises the matrix trace to higher-order tensors.

AI example: In multi-head attention, computing per-head statistics involves contracting over query/key positions while keeping the head dimension free.

4.3 Two-Tensor Contraction

General form: contract tensors AA (rank pp) and BB (rank qq) over kk shared indices -> result rank p+q2kp + q - 2k.

ContractionIndex formp+q2kp + q - 2kOperation
1 pairAikBkjCijA_{ik}B_{kj} \to C_{ij}2+22=22+2-2=2Matrix multiply
2 pairsAijBijsA_{ij}B_{ij} \to s2+24=02+2-4=0Frobenius inner product
2 pairs (mixed rank)AijkBjkviA_{ijk}B_{jk} \to v_i3+24=13+2-4=1Vector result

4.4 Contraction Order and Computational Cost

For expression AijBjkCklA_{ij}B_{jk}C_{kl} (three matrices), there are two possible contraction orders:

(AB)C - left to right:

  1. Compute Dik=AijBjkD_{ik} = A_{ij}B_{jk}: cost O(mpn)O(m \cdot p \cdot n) multiplications
  2. Compute DikCklD_{ik}C_{kl}: cost O(mnq)O(m \cdot n \cdot q) multiplications

A(BC) - right to left:

  1. Compute Ejl=BjkCklE_{jl} = B_{jk}C_{kl}: cost O(pnq)O(p \cdot n \cdot q) multiplications
  2. Compute AijEjlA_{ij}E_{jl}: cost O(mpq)O(m \cdot p \cdot q) multiplications

For rectangular tensors, different orders can differ by orders of magnitude.

Example: AR1000×10A \in \mathbb{R}^{1000 \times 10}, BR10×10B \in \mathbb{R}^{10 \times 10}, CR10×1000C \in \mathbb{R}^{10 \times 1000}

  • (AB)C(AB)C: 1000×10×10+1000×10×1000=105+107=10.1×1061000 \times 10 \times 10 + 1000 \times 10 \times 1000 = 10^5 + 10^7 = 10.1 \times 10^6
  • A(BC)A(BC): 10×10×1000+1000×10×1000=105+107=10.1×10610 \times 10 \times 1000 + 1000 \times 10 \times 1000 = 10^5 + 10^7 = 10.1 \times 10^6

But: AR1000×1000A \in \mathbb{R}^{1000 \times 1000}, BR1000×2B \in \mathbb{R}^{1000 \times 2}, CR2×1000C \in \mathbb{R}^{2 \times 1000}

  • (AB)C(AB)C: 10002×2+1000×2×1000=2×106+2×106=4×1061000^2 \times 2 + 1000 \times 2 \times 1000 = 2 \times 10^6 + 2 \times 10^6 = 4 \times 10^6
  • A(BC)A(BC): 1000×2×1000+10002×1000=2×106+1091000 \times 2 \times 1000 + 1000^2 \times 1000 = 2 \times 10^6 + 10^9 <- 250\times slower!

Finding the optimal contraction order is NP-hard in general, but greedy approaches work well in practice. The opt_einsum library finds near-optimal contraction orders for multi-tensor einsum expressions.

4.5 The Generalised Contraction Formula

For tensors Ai1ipA_{i_1\ldots i_p} and Bj1jqB_{j_1\ldots j_q}, contracting over shared index set SS:

Cfree indices=αSAαBαC_{\text{free indices}} = \sum_{\alpha \in S} A_{\ldots\alpha\ldots} B_{\ldots\alpha\ldots}
  • Free indices: union of all non-contracted indices from both tensors
  • Result shape: determined by free index ranges
  • This is the most general single-step operation expressible in einsum
CONTRACTION VISUALISED
===================================================================

Matrix Multiply: C_ij = A_ik B_kj

  A (rank 2)      B (rank 2)        C (rank 2)
  +---+           +---+             +---+
  |   |--k------k-|   |    =        |   |
  |   |           |   |             |   |
  +---+           +---+             +---+
   <-> i              <-> j              <-> i  <-> j
   (free)           (free)           (free indices)

  k is contracted (summed) -> disappears from result
  i, j are free -> appear in result

Frobenius Inner Product: s = A_ij B_ij

  A (rank 2)      B (rank 2)        s (rank 0)
  +---+           +---+
  |   |--i------i-|   |    =        * (scalar)
  |   |--j------j-|   |
  +---+           +---+

  Both i and j contracted -> no free indices -> scalar result

5. The Kronecker Delta and Levi-Civita Symbol

5.1 Kronecker Delta

The Kronecker delta is the identity matrix expressed in index notation:

δij={1if i=j0if ij\delta_{ij} = \begin{cases} 1 & \text{if } i = j \\ 0 & \text{if } i \neq j \end{cases}

As a matrix: δij=Iij\delta_{ij} = I_{ij} - the identity matrix.

The key property - index substitution (index killing):

Aiδij=AjA_{i\ldots} \delta_{ij} = A_{j\ldots}

Proof: iAiδij=Aj×1=Aj\sum_i A_i \delta_{ij} = A_j \times 1 = A_j (only the term with i=ji = j survives the sum).

This is the most important property in all of index notation. It says: contracting any tensor with δij\delta_{ij} replaces index ii with index jj (or vice versa). The delta "kills" the summed index and substitutes the other.

Other properties:

  • Trace of identity: δii=i1=n\delta_{ii} = \sum_i 1 = n (dimension of the index space)
  • Acts as identity: δijuj=ui\delta_{ij} u_j = u_i (same as Iu=uIu = u in matrix notation)
  • Idempotent: δijδjk=δik\delta_{ij}\delta_{jk} = \delta_{ik} (identity times identity = identity)

5.2 Kronecker Delta Identities

IdentityProofMatrix Equivalent
δijδjk=δik\delta_{ij}\delta_{jk} = \delta_{ik}jδijδjk=δik\sum_j \delta_{ij}\delta_{jk} = \delta_{ik} (only j=i=kj=i=k survives)II=II \cdot I = I
δii=n\delta_{ii} = ni=1n1=n\sum_{i=1}^n 1 = ntr(I)=n\text{tr}(I) = n
δijAjk=Aik\delta_{ij}A_{jk} = A_{ik}Index substitution: jij \to iIA=AIA = A
Aijδjk=AikA_{ij}\delta_{jk} = A_{ik}Index substitution: jkj \to kAI=AAI = A
δijAij=Aii=tr(A)\delta_{ij}A_{ij} = A_{ii} = \text{tr}(A)Substitution reduces to tracetr(IA)=tr(A)\text{tr}(IA) = \text{tr}(A)

AI use: The Kronecker delta appears in every gradient derivation. When you differentiate Cij=AikBkjC_{ij} = A_{ik}B_{kj} with respect to AlmA_{lm}, the derivative Aik/Alm=δilδkm\partial A_{ik}/\partial A_{lm} = \delta_{il}\delta_{km} - the delta selects which element you're differentiating. This is the engine of backpropagation derivations.

5.3 Levi-Civita Symbol (Permutation Symbol)

The Levi-Civita symbol is a completely antisymmetric tensor. In 3D:

εijk={+1if (i,j,k) is an even permutation of (1,2,3)1if (i,j,k) is an odd permutation of (1,2,3)0if any two indices are equal\varepsilon_{ijk} = \begin{cases} +1 & \text{if } (i,j,k) \text{ is an even permutation of } (1,2,3) \\ -1 & \text{if } (i,j,k) \text{ is an odd permutation of } (1,2,3) \\ 0 & \text{if any two indices are equal} \end{cases}

Even permutations (cyclic): (1,2,3)(1,2,3), (2,3,1)(2,3,1), (3,1,2)(3,1,2) -> ε=+1\varepsilon = +1

Odd permutations (one swap): (1,3,2)(1,3,2), (3,2,1)(3,2,1), (2,1,3)(2,1,3) -> ε=1\varepsilon = -1

The symbol is non-zero only when all three indices are distinct. In nn dimensions, εi1i2in\varepsilon_{i_1 i_2 \ldots i_n} generalises to n!n! non-zero values out of nnn^n possible.

5.4 Levi-Civita Applications

Cross product:

(u×v)i=εijkujvk=jkεijkujvk(u \times v)_i = \varepsilon_{ijk} u_j v_k = \sum_j \sum_k \varepsilon_{ijk} u_j v_k

The index structure encodes the cross product formula: ii is free (result component), jj and kk are dummy (summed over).

Determinant (3\times3):

det(A)=εijkA1iA2jA3k\det(A) = \varepsilon_{ijk} A_{1i} A_{2j} A_{3k}

All of i,j,ki, j, k are dummy -> result is scalar. The Levi-Civita selects only the six permutations with correct signs.

Key identity - ε\varepsilon-δ\delta contraction:

εijkεilm=δjlδkmδjmδkl\varepsilon_{ijk} \varepsilon_{ilm} = \delta_{jl}\delta_{km} - \delta_{jm}\delta_{kl}

This allows simplifying double Levi-Civita contractions into Kronecker deltas - essential for proofs involving cross products and curl operations.

AI context: The Levi-Civita symbol is less directly used than the Kronecker delta in ML. It appears in:

  • Determinant computations for normalising flows (logdetJ\log |\det J|)
  • Theoretical analyses of rotation equivariance
  • Differential geometry of neural manifolds

5.5 Generalised Kronecker Delta

The generalised Kronecker delta extends the concept to multi-index antisymmetry:

δj1j2jki1i2ik={+1if (i1,,ik) is an even permutation of (j1,,jk)1if odd permutation0otherwise\delta^{i_1 i_2 \ldots i_k}_{j_1 j_2 \ldots j_k} = \begin{cases} +1 & \text{if } (i_1,\ldots,i_k) \text{ is an even permutation of } (j_1,\ldots,j_k) \\ -1 & \text{if odd permutation} \\ 0 & \text{otherwise} \end{cases}

This expresses antisymmetrisation and symmetrisation operators, and connects to the Levi-Civita symbol:

εijkεlmn=δlmnijk=δilδjmδkn+δimδjnδkl+δinδjlδkmδilδjnδkmδimδjlδknδinδjmδkl\varepsilon_{ijk} \varepsilon_{lmn} = \delta^{ijk}_{lmn} = \delta_{il}\delta_{jm}\delta_{kn} + \delta_{im}\delta_{jn}\delta_{kl} + \delta_{in}\delta_{jl}\delta_{km} - \delta_{il}\delta_{jn}\delta_{km} - \delta_{im}\delta_{jl}\delta_{kn} - \delta_{in}\delta_{jm}\delta_{kl}

6. Index Manipulation Rules

6.1 Renaming Dummy Indices

Dummy indices are bound variables - any name works. This is alpha-equivalence, just like in lambda calculus.

AijBjk=AilBlk(rename jl; identical expression)A_{ij}B_{jk} = A_{il}B_{lk} \quad \text{(rename } j \to l \text{; identical expression)}

Rules for renaming:

  1. Rename consistently throughout the entire term: both occurrences change together
  2. Don't rename to an index already in use (avoid clashes)
  3. Free indices cannot be renamed within a term without renaming on both sides of the equation

Use renaming to:

  • Avoid index clashes when combining expressions
  • Make contraction structure explicit
  • Prepare expressions for interchange or simplification

6.2 Symmetry and Antisymmetry

Symmetric tensor: Aij=AjiA_{ij} = A_{ji} (unchanged under index swap)

  • Examples: metric tensor gijg_{ij}; Hessian 2f/xixj\partial^2 f / \partial x_i \partial x_j; covariance matrix Σij\Sigma_{ij}; Gram matrix Gij=xixjG_{ij} = x_i^\top x_j

Antisymmetric tensor: Aij=AjiA_{ij} = -A_{ji} (sign flip under index swap)

  • Examples: Levi-Civita εijk\varepsilon_{ijk}; cross product matrix [ω]×[\omega]_\times
  • Consequence: Aii=Aii    Aii=0A_{ii} = -A_{ii} \implies A_{ii} = 0 - diagonal elements are always zero

Critical identity - antisymmetric contracted with symmetric vanishes:

If AijA_{ij} is antisymmetric and SijS_{ij} is symmetric:

AijSij=0A_{ij} S_{ij} = 0

Proof:

AijSij=AjiSji(rename ij)=(Aij)Sij(use antisymmetry of A, symmetry of S)A_{ij}S_{ij} = A_{ji}S_{ji} \quad \text{(rename } i \leftrightarrow j \text{)} = (-A_{ij})S_{ij} \quad \text{(use antisymmetry of } A \text{, symmetry of } S \text{)}     2AijSij=0    AijSij=0\implies 2A_{ij}S_{ij} = 0 \implies A_{ij}S_{ij} = 0 \quad \checkmark

AI relevance: Attention score matrices Sij=QiKjS_{ij} = Q_i \cdot K_j are generally not symmetric (queries \neq keys). Weight matrices are not symmetric. Understanding symmetry properties helps simplify gradient expressions.

6.3 Symmetrisation and Antisymmetrisation

Any rank-2 tensor decomposes uniquely into symmetric and antisymmetric parts:

Tij=T(ij)+T[ij]T_{ij} = T_{(ij)} + T_{[ij]}

where:

Symmetric part:

T(ij)=12(Tij+Tji)T_{(ij)} = \frac{1}{2}(T_{ij} + T_{ji})

Antisymmetric part:

T[ij]=12(TijTji)T_{[ij]} = \frac{1}{2}(T_{ij} - T_{ji})

For higher-rank tensors, symmetrisation/antisymmetrisation extends to any subset of indices:

T(ij)k=12(Tijk+Tjik)T_{(ij)k} = \frac{1}{2}(T_{ijk} + T_{jik})

AI application: Decomposing the attention weight gradient into symmetric and antisymmetric components can reveal which parts drive updates to query vs key projections independently.

6.4 Index Raising and Lowering (Physics)

In differential geometry, the metric tensor gijg_{ij} converts between covariant and contravariant indices:

vi=gijvj(lower index)v_i = g_{ij} v^j \quad \text{(lower index)} vi=gijvj(raise index, using inverse metric gij)v^i = g^{ij} v_j \quad \text{(raise index, using inverse metric } g^{ij} \text{)}

In flat Euclidean space: gij=δijg_{ij} = \delta_{ij} -> raising and lowering are trivial (no change).

ML context: All ML computations happen in Euclidean space (or spaces treated as Euclidean). The metric is identity; upper/lower distinction collapses. When reading physics papers, mentally replace gijδijg_{ij} \to \delta_{ij}.

Exception - natural gradient: Fisher information matrix FijF_{ij} acts as a metric on parameter space. In natural gradient descent, FijF_{ij} plays the role of gijg_{ij}, and the "covariant gradient" ~i=Fijj\tilde{\nabla}_i = F^{ij} \nabla_j differs from the ordinary gradient. This is one place where the physics convention is genuinely useful in ML.

6.5 Partial Derivatives in Index Notation

Partial derivatives get their own index notation, making gradient calculations compact:

NotationStandard formEinstein formResult
Gradient(f)i=fxi(\nabla f)_i = \frac{\partial f}{\partial x_i}if\partial_i fVector
JacobianJij=fixjJ_{ij} = \frac{\partial f_i}{\partial x_j}jfi\partial_j f_iMatrix
HessianHij=2fxixjH_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}ijf\partial_i \partial_j fMatrix (symmetric)
Divergencev=ivixi\nabla \cdot v = \sum_i \frac{\partial v_i}{\partial x_i}ivi\partial_i v_iScalar (ii is dummy)
Laplacian2f=i2fxi2\nabla^2 f = \sum_i \frac{\partial^2 f}{\partial x_i^2}iif\partial_i \partial_i fScalar (ii is dummy)

The divergence ivi\partial_i v_i has ii appearing twice -> summed -> scalar. The Laplacian iif\partial_i \partial_i f has the same structure - a "trace" of second derivatives.

AI connection: Gradient computation in index notation is the foundation of Section 7. The Jacobian-vector product Jijvj=jfivjJ_{ij}v_j = \partial_j f_i \cdot v_j is exactly what automatic differentiation computes in forward mode. The vector-Jacobian product uiJij=uijfiu_i J_{ij} = u_i \partial_j f_i is reverse mode (backpropagation).


7. Einstein Notation for Gradients and Backpropagation

Index notation turns gradient derivations from pattern matching into mechanical computation. The core tool is the Kronecker delta - the derivative of a tensor component with respect to itself.

7.1 Gradient of Linear Form

f(x)=aixif(x) = a_i x_i (dot product axa \cdot x)

fxj=xj(aixi)=aixixj=aiδij=aj\frac{\partial f}{\partial x_j} = \frac{\partial}{\partial x_j}(a_i x_i) = a_i \frac{\partial x_i}{\partial x_j} = a_i \delta_{ij} = a_j

The δij\delta_{ij} kills the sum over ii, substituting jj for ii. Result: x(ax)=a\nabla_x(a^\top x) = a OK

7.2 Gradient of Quadratic Form

f(x)=xiAijxjf(x) = x_i A_{ij} x_j (quadratic form xAxx^\top A x)

fxk=xk(xiAijxj)\frac{\partial f}{\partial x_k} = \frac{\partial}{\partial x_k}(x_i A_{ij} x_j)

Apply product rule (both xix_i and xjx_j depend on xkx_k):

=xixkAijxj+xiAijxjxk=δikAijxj+xiAijδjk= \frac{\partial x_i}{\partial x_k} A_{ij} x_j + x_i A_{ij} \frac{\partial x_j}{\partial x_k} = \delta_{ik} A_{ij} x_j + x_i A_{ij} \delta_{jk} =Akjxj+xiAik=(Ax)k+(Ax)k=(A+A)kjxj= A_{kj} x_j + x_i A_{ik} = (Ax)_k + (A^\top x)_k = (A + A^\top)_{kj} x_j

When AA is symmetric (Aij=AjiA_{ij} = A_{ji}): fxk=2Akjxj=2(Ax)k\frac{\partial f}{\partial x_k} = 2A_{kj}x_j = 2(Ax)_k

x(xAx)=(A+A)x; for symmetric A2Ax\boxed{\nabla_x(x^\top A x) = (A + A^\top)x \quad \text{; for symmetric } A\text{: } 2Ax}

7.3 Gradient of Matrix Multiply (A side)

Loss LL; output C=ABC = AB where ARm×pA \in \mathbb{R}^{m \times p}, BRp×nB \in \mathbb{R}^{p \times n}; Cij=AikBkjC_{ij} = A_{ik}B_{kj}.

Given upstream gradient LCij\frac{\partial L}{\partial C_{ij}}, find LAlm\frac{\partial L}{\partial A_{lm}}:

LAlm=LCijCijAlm\frac{\partial L}{\partial A_{lm}} = \frac{\partial L}{\partial C_{ij}} \frac{\partial C_{ij}}{\partial A_{lm}}

Compute the inner derivative:

CijAlm=Alm(AikBkj)=δilδkmBkj=δilBmj\frac{\partial C_{ij}}{\partial A_{lm}} = \frac{\partial}{\partial A_{lm}}(A_{ik}B_{kj}) = \delta_{il}\delta_{km} B_{kj} = \delta_{il} B_{mj}

Substitute back:

LAlm=LCijδilBmj=LCljBmj\frac{\partial L}{\partial A_{lm}} = \frac{\partial L}{\partial C_{ij}} \delta_{il} B_{mj} = \frac{\partial L}{\partial C_{lj}} B_{mj}

In index notation: (LA)lm=(LC)ljBmj\left(\frac{\partial L}{\partial A}\right)_{lm} = \left(\frac{\partial L}{\partial C}\right)_{lj} B_{mj}

LA=LCB\boxed{\frac{\partial L}{\partial A} = \frac{\partial L}{\partial C} \cdot B^\top}

This is the standard backpropagation formula - derived mechanically using only δij\delta_{ij} substitution.

7.4 Gradient of Matrix Multiply (B side)

LBmn=LCijCijBmn\frac{\partial L}{\partial B_{mn}} = \frac{\partial L}{\partial C_{ij}} \frac{\partial C_{ij}}{\partial B_{mn}} CijBmn=Bmn(AikBkj)=Aimδjn\frac{\partial C_{ij}}{\partial B_{mn}} = \frac{\partial}{\partial B_{mn}}(A_{ik}B_{kj}) = A_{im}\delta_{jn} LBmn=LCijAimδjn=AimLCin\frac{\partial L}{\partial B_{mn}} = \frac{\partial L}{\partial C_{ij}} A_{im}\delta_{jn} = A_{im} \frac{\partial L}{\partial C_{in}} LB=ALC\boxed{\frac{\partial L}{\partial B} = A^\top \cdot \frac{\partial L}{\partial C}}

7.5 Gradient of Dot Product

s=aibis = a_i b_i

saj=biδij=bj    a(ab)=b\frac{\partial s}{\partial a_j} = b_i \delta_{ij} = b_j \implies \nabla_a(a^\top b) = b sbj=aiδij=aj    b(ab)=a\frac{\partial s}{\partial b_j} = a_i \delta_{ij} = a_j \implies \nabla_b(a^\top b) = a

7.6 Gradient of Trace

f=tr(AB)=AijBjif = \text{tr}(AB) = A_{ij}B_{ji}

fAlm=Alm(AijBji)=δilδjmBji=Bml=(B)lm\frac{\partial f}{\partial A_{lm}} = \frac{\partial}{\partial A_{lm}}(A_{ij}B_{ji}) = \delta_{il}\delta_{jm} B_{ji} = B_{ml} = (B^\top)_{lm} Atr(AB)=B\boxed{\nabla_A \text{tr}(AB) = B^\top}

Similarly:

  • Atr(AB)=B\nabla_A \text{tr}(A^\top B) = B
  • Atr(AA)=2A\nabla_A \text{tr}(A^\top A) = 2A (since A appears twice, both contribute)

7.7 Gradient of Softmax in Index Notation

Softmax: pv=exp(zv)wexp(zw)=exp(zv)Zp_v = \frac{\exp(z_v)}{\sum_w \exp(z_w)} = \frac{\exp(z_v)}{Z}

The Jacobian pvzw\frac{\partial p_v}{\partial z_w} is derived via quotient rule:

pvzw=exp(zv)zwZexp(zv)ZzwZ2\frac{\partial p_v}{\partial z_w} = \frac{\frac{\partial \exp(z_v)}{\partial z_w} \cdot Z - \exp(z_v) \cdot \frac{\partial Z}{\partial z_w}}{Z^2}

where exp(zv)zw=δvwexp(zv)\frac{\partial \exp(z_v)}{\partial z_w} = \delta_{vw} \exp(z_v) and Zzw=exp(zw)\frac{\partial Z}{\partial z_w} = \exp(z_w):

=δvwexp(zv)Zexp(zv)exp(zw)Z2=pvδvwpvpw= \frac{\delta_{vw}\exp(z_v) Z - \exp(z_v)\exp(z_w)}{Z^2} = p_v\delta_{vw} - p_v p_w pvzw=pv(δvwpw)\boxed{\frac{\partial p_v}{\partial z_w} = p_v(\delta_{vw} - p_w)}

Gradient of loss through softmax:

Lzw=vLpvpvzw=vLpvpv(δvwpw)\frac{\partial L}{\partial z_w} = \sum_v \frac{\partial L}{\partial p_v} \frac{\partial p_v}{\partial z_w} = \sum_v \frac{\partial L}{\partial p_v} p_v(\delta_{vw} - p_w) =pwLpwpwvpvLpv= p_w \frac{\partial L}{\partial p_w} - p_w \sum_v p_v \frac{\partial L}{\partial p_v}

For cross-entropy loss with true class yy: Lpv=1/pv\frac{\partial L}{\partial p_v} = -1/p_v for v=yv = y, zero otherwise. This simplifies to:

Lzw=pw1[w=y]\frac{\partial L}{\partial z_w} = p_w - \mathbb{1}[w = y]

The elegant result: the gradient of cross-entropy loss through softmax is just predicted probability minus the one-hot target. This simplicity comes directly from the index-notation derivation.

GRADIENT DERIVATION CHEAT SHEET
===================================================================

  Function                  Gradient via Index Notation
  ------------------------  --------------------------------------
  f = a_i x_i               \partialf/\partialx_j = a_j          (linear)
  f = x_i A_ij x_j          \partialf/\partialx_k = (A+A^T)_kj x_j (quadratic)
  C_ij = A_ik B_kj          \partialL/\partialA_lm = (\partialL/\partialC)_lj B_mj   (A side)
                             \partialL/\partialB_mn = A_im (\partialL/\partialC)_in    (B side)
  f = tr(AB)                 \partialf/\partialA_lm = B_ml = (B^T)_lm
  p_v = softmax(z)_v         \partialp_v/\partialz_w = p_v(\delta_vw - p_w)

  TOOL: \partialT_ab/\partialT_cd = \delta_ac \delta_bd  (Kronecker delta - the engine)

8. Batched Operations and Batch Indices

8.1 Introducing the Batch Index

In ML, the same operation is applied to BB independent examples simultaneously. The batch dimension is represented by adding a leading index b{1,,B}b \in \{1, \ldots, B\}.

Batched matrix-vector (per-sample weights):

wbi=Abijvbjw_{bi} = A_{bij} v_{bj}

For each batch element bb, compute AbvbA_b v_b independently.

Shared weight matrix (common in neural nets):

wbi=Aijvbjw_{bi} = A_{ij} v_{bj}

AA has no bb index - the same weight matrix is applied to every sample. vbjv_{bj} varies per batch element.

In einsum: "ij,bj->bi" - bb is free (appears in output), jj is dummy (contracted).

8.2 Batched Matrix Multiply

Cbij=AbikBbkjC_{bij} = A_{bik} B_{bkj}

Each batch element bb has its own AbA_b and BbB_b; multiply independently. kk is dummy; b,i,jb, i, j are free.

Einsum: "bik,bkj->bij"

PyTorch: torch.bmm(A, B) for rank-3 tensors.

8.3 Multi-Head Attention in Index Notation

Full multi-head attention with explicit indices:

  • Q,K,VRB×H×n×dQ, K, V \in \mathbb{R}^{B \times H \times n \times d} where BB = batch, HH = heads, nn = sequence length, dd = head dimension

Attention scores:

Sbhij=QbhiaKbhjadS_{bhij} = \frac{Q_{bhia} K_{bhja}}{\sqrt{d}}
  • aa is dummy (summed over head dimension -> dot product)
  • b,h,i,jb, h, i, j are free (per batch, per head, per query-key pair)
  • Einsum: "bhid,bhjd->bhij" (this is QKQK^\top per head)

After softmax (applied over jj for each b,h,ib, h, i):

αbhij=softmaxj(Sbhij)\alpha_{bhij} = \text{softmax}_j(S_{bhij})

Attention output:

Obhia=αbhijVbhjaO_{bhia} = \alpha_{bhij} V_{bhja}
  • jj is dummy (weighted sum over values)
  • b,h,i,ab, h, i, a are free (per batch, per head, per position, per dimension)
  • Einsum: "bhij,bhjd->bhid"

8.4 Broadcasting as Implicit Index Repetition

NumPy/PyTorch broadcasting corresponds to an index not appearing on one tensor - it is implicitly repeated:

Code operationIndex notationBroadcasting rule
Abi+ciA_{bi} + c_ibb absent on ccSame bias for all batch elements
(Av)bi=Abijvj(Av)_{bi} = A_{bij}v_jbb absent on vvSame vector for all batch elements
xbti+PEtix_{bti} + \text{PE}_{ti}bb absent on PESame positional encoding for all batches

Understanding broadcasting as implicit indexing prevents shape errors. If you write the index notation first and check which tensors carry which indices, the broadcasting rules become obvious.

8.5 Reduction Operations

Summing over an index eliminates it from the result:

OperationIndex notationEinsumDescription
Sum over batchsi=xbis_i = x_{bi} (bb dummy)"bi->i"Average features across batch
Mean over batchμi=1Bxbi\mu_i = \frac{1}{B} x_{bi}"bi->i" \div BBatch mean
Sum over sequencesbi=xbtis_{bi} = x_{bti} (tt dummy)"bti->bi"Sum over positions
Global sums=xbtis = x_{bti} (b,t,ib,t,i all dummy)"bti->"Total sum -> scalar

Pooling operations are all forms of reduction over specific index dimensions. Average pooling = sum + normalise. Max pooling breaks the einsum pattern (non-linear) but is still a reduction over an index.


9. Einsum in Practice

9.1 Einsum String Syntax

The einsum function (available in NumPy, PyTorch, JAX, TensorFlow) implements Einstein summation directly:

Format: "input1_indices, input2_indices, ... -> output_indices"

Rules:

  • Input operands separated by commas; output after arrow (->)
  • Each character is an index label (single letter)
  • Index appearing on multiple inputs -> contraction (sum) if not in output
  • Index in output -> free (kept in result)
  • Index omitted from output -> dummy (summed over)

Example: "ik,kj->ij" means Cij=AikBkjC_{ij} = A_{ik}B_{kj} - matrix multiply. Index kk appears on both inputs but not in output -> contracted.

9.2 Standard Operations as Einsum Strings

OperationMathematicalEinsum StringNotes
Dot productaibia_i b_i"i,i->"Both indices contracted
Outer productaibja_i b_j"i,j->ij"No contraction
Matrix-vectorAijvjA_{ij}v_j"ij,j->i"jj contracted
Matrix multiplyAikBkjA_{ik}B_{kj}"ik,kj->ij"kk contracted
Batched matmulAbikBbkjA_{bik}B_{bkj}"bik,bkj->bij"kk contracted; bb free
TransposeAjiA_{ji}"ij->ji"Relabel only
TraceAiiA_{ii}"ii->"Self-contraction
Row sumjAij\sum_j A_{ij}"ij->i"jj contracted
Column sumiAij\sum_i A_{ij}"ij->j"ii contracted
Element-wise multiplyAijBijA_{ij}B_{ij}"ij,ij->ij"No contraction
Frobenius inner productAijBijA_{ij}B_{ij}"ij,ij->"Both contracted
Attention scores (QKQK^\top)QiaKjaQ_{ia}K_{ja}"ia,ja->ij"aa contracted
Attention outputαijVja\alpha_{ij}V_{ja}"ij,ja->ia"jj contracted
Quadratic formxiAijxjx_i A_{ij} x_j"i,ij,j->"i,ji,j contracted
Bilinear formxiMijyjx_i M_{ij} y_j"i,ij,j->"i,ji,j contracted
Batch outer productabibbja_{bi} b_{bj}"bi,bj->bij"bb free
Diagonal extractionAiiA_{ii}"ii->i"Diagonal as vector

9.3 Multi-Operand Einsum

Einsum supports more than two input tensors:

torch.einsum("ijk,jl,km->ilm", A, B, C)

This computes Dilm=AijkBjlCkmD_{ilm} = A_{ijk} B_{jl} C_{km} - contracting jj (between AA and BB) and kk (between AA and CC) simultaneously.

Contraction order matters. For three-tensor contraction AijBjkCklDilA_{ij}B_{jk}C_{kl} \to D_{il}:

  • Option 1: (AB)C(AB)C - compute Eik=AijBjkE_{ik} = A_{ij}B_{jk} first, then EikCklE_{ik}C_{kl}
  • Option 2: A(BC)A(BC) - compute Fjl=BjkCklF_{jl} = B_{jk}C_{kl} first, then AijFjlA_{ij}F_{jl}

For rectangular tensors, one order can be orders of magnitude faster. The opt_einsum library finds near-optimal contraction paths automatically:

import opt_einsum
# Finds optimal contraction order for large multi-tensor expressions
path, info = opt_einsum.contract_path("ijk,jl,km->ilm", A, B, C)
result = opt_einsum.contract("ijk,jl,km->ilm", A, B, C, optimize=path)

9.4 Implicit vs Explicit Mode

Implicit mode (no arrow): output indices = all indices appearing exactly once, in alphabetical order.

  • "ij,jk" -> implicit output "ik" (jj appears twice -> contracted)
  • Convenient shorthand; may be ambiguous for complex expressions

Explicit mode (with arrow): output indices specified exactly.

  • "ij,jk->ki" -> transposed result (column-major output)
  • Always use explicit mode for clarity

Rule of thumb: Always use explicit mode in production code. Implicit mode is fine for quick interactive exploration but introduces ambiguity in complex expressions.

9.5 Performance Considerations

  • Einsum vs manual matmul: Modern frameworks compile einsum to optimised CUDA kernels. For standard patterns (matmul, bmm), performance is equivalent.
  • Fused operations: Einsum can fuse multiple contractions, avoiding materialisation of intermediate tensors. Example: "bhid,bhjd->bhij" computes attention scores without explicitly computing the transpose.
  • Limitations: Not all einsum patterns are equally optimised on all GPUs. For critical paths (attention in production), manually optimised kernels (FlashAttention) outperform generic einsum.
  • Memory: Einsum can be memory-efficient by not materialising intermediates, but can also consume unexpected memory for large multi-operand expressions.

9.6 Common Pitfalls in Code

PitfallProblemFix
Shape mismatchAikBkjA_{ik}B_{kj} requires A.shape[1] == B.shape[0]Check shapes before einsum
Wrong index string"ij,jk->ij" is NOT matmulShould be "ij,jk->ik"
Missing batch dim"ij,jk->ik" on batched inputUse "bij,bjk->bik"
Implicit contraction surpriseAccidentally sharing index namesUse explicit output ->
Integer tensorsEinsum with int tensors may give wrong resultsCast to float first
Repeated index same tensor"ij,ij->ij" vs "ii->"Understand the difference

10. Tensor Products and Decompositions

10.1 Tensor Product (Outer Product Generalisation)

The tensor product of AA (rank pp) and BB (rank qq) produces a rank-(p+q)(p+q) tensor:

(AB)i1ipj1jq=Ai1ipBj1jq(A \otimes B)_{i_1\ldots i_p j_1\ldots j_q} = A_{i_1\ldots i_p} B_{j_1\ldots j_q}

No contraction - all indices are free. The result carries all indices of both operands.

Special cases:

  • Outer product of vectors: (uv)ij=uivj(u \otimes v)_{ij} = u_i v_j -> rank-2 (matrix)
  • Outer product of matrices: (AB)ijkl=AijBkl(A \otimes B)_{ijkl} = A_{ij} B_{kl} -> rank-4
  • Kronecker product: specific arrangement of matrix tensor product

AI connection - LoRA: The low-rank update ΔW=BA\Delta W = BA where BRd×rB \in \mathbb{R}^{d \times r}, ARr×dA \in \mathbb{R}^{r \times d} is a sum of rr rank-1 outer products: ΔWij=BikAkj=k=1rBikAkj\Delta W_{ij} = B_{ik}A_{kj} = \sum_{k=1}^r B_{ik}A_{kj}. The index kk ranges over the LoRA rank rr, making the low-rank structure explicit.

10.2 Tensor Unfolding (Matricisation)

Reshaping a rank-rr tensor into a matrix by selecting which indices form rows vs columns:

Mode-kk unfolding T(k)T_{(k)}: index kk forms rows; all other indices (combined) form columns.

Shape: nk×(n1××nk1×nk+1××nr)\text{Shape: } n_k \times (n_1 \times \cdots \times n_{k-1} \times n_{k+1} \times \cdots \times n_r)

Example: TR3×4×5T \in \mathbb{R}^{3 \times 4 \times 5}

  • Mode-1 unfolding: 3×203 \times 20 matrix
  • Mode-2 unfolding: 4×154 \times 15 matrix
  • Mode-3 unfolding: 5×125 \times 12 matrix

Used in Tucker decomposition, tensor-train networks, and model compression. The unfolding is the bridge between higher-order tensor operations and standard matrix algorithms (SVD, eigendecomposition).

10.3 CP Decomposition (CANDECOMP/PARAFAC)

Decompose a tensor as a sum of rank-1 tensors:

Tijk=r=1Rλrair(1)ajr(2)akr(3)T_{ijk} = \sum_{r=1}^{R} \lambda_r \, a^{(1)}_{ir} \, a^{(2)}_{jr} \, a^{(3)}_{kr}

In Einstein notation: Tijk=λrair(1)ajr(2)akr(3)T_{ijk} = \lambda_r \, a^{(1)}_{ir} \, a^{(2)}_{jr} \, a^{(3)}_{kr} - the index rr is dummy (summed over rank components).

Each term λrair(1)ajr(2)akr(3)\lambda_r \, a^{(1)}_{ir} \, a^{(2)}_{jr} \, a^{(3)}_{kr} is a rank-1 tensor (outer product of three vectors). RR is the tensor rank - the minimum number of rank-1 terms needed for exact representation.

AI use: Low-rank CP decomposition for compressing embedding tables, convolutional filters, and MoE routing tensors.

10.4 Tucker Decomposition

Tijk=GabcUia(1)Ujb(2)Ukc(3)T_{ijk} = G_{abc} \, U^{(1)}_{ia} \, U^{(2)}_{jb} \, U^{(3)}_{kc}

where a,b,ca, b, c are dummy indices (contracted):

  • GG = core tensor (small, captures interactions between modes)
  • U(n)U^{(n)} = factor matrices (orthogonal, capture principal components per mode)

This generalises SVD to higher-order tensors. Truncating the core tensor dimensions yields compression.

AI use: Compressing convolutional filters (4D tensors: output channels \times input channels \times height \times width). Compressing transformer weight tensors. The Tucker structure explicitly shows which modes interact through the core tensor.

10.5 Tensor Train (TT) Decomposition / Matrix Product States

Decompose a rank-rr tensor as a chain of 3-index cores:

Ti1i2ir=Gi1α1(1)Gα1i2α2(2)Gα2i3α3(3)Gαr1ir(r)T_{i_1 i_2 \ldots i_r} = G^{(1)}_{i_1 \alpha_1} G^{(2)}_{\alpha_1 i_2 \alpha_2} G^{(3)}_{\alpha_2 i_3 \alpha_3} \cdots G^{(r)}_{\alpha_{r-1} i_r}
  • αk\alpha_k: bond indices (contracted between adjacent cores)
  • iki_k: physical indices (free; correspond to original tensor dimensions)
  • Bond dimension χ=max(range of αk)\chi = \max(\text{range of } \alpha_k): controls approximation quality

Compression ratio: An order-rr tensor of size nn stores nrn^r values. TT format stores O(rnχ2)O(r \, n \, \chi^2) parameters - exponential compression when χ\chi is small.

AI connections:

  • Tensor train for embedding tables: compress V×d|V| \times d embedding into chain of small matrices
  • Matrix product states for language modelling (theoretical connection to autoregressive models)
  • Quantum-inspired ML: DMRG-like training of tensor network classifiers
TENSOR DECOMPOSITION OVERVIEW
===================================================================

  CP Decomposition:
    T_ijk = \Sigma_r \lambda_r * a_ir * b_jr * c_kr
    -> Sum of rank-1 outer products
    -> Parameters: R(n_1 + n_2 + n_3)

  Tucker Decomposition:
    T_ijk = G_abc * U_ia * V_jb * W_kc
    -> Core tensor + factor matrices
    -> Parameters: r_1r_2r_3 + n_1r_1 + n_2r_2 + n_3r_3

  Tensor Train:
    T_{i_1i_2...i_n} = G^1_{i_1\alpha_1} * G^2_{\alpha_1i_2\alpha_2} * ... * G^n_{\alpha_n_1i_n}
    -> Chain of 3-index cores
    -> Parameters: O(n*d*\chi^2) for order-d, mode-n, bond-\chi

  SVD (matrix, rank-2):
    A_ij = U_ik \Sigma_k V_jk
    -> Special case of all three decompositions

  LoRA (practical):
    \DeltaW_ij = B_ir A_rj    (r = LoRA rank, typically 4-64)
    -> Rank-r matrix update; r is the dummy index

11. Index Notation for Common AI Architectures

11.1 Fully Connected Layer

Forward pass:

hi=σ(zi)wherezi=Wijxj+bih_i = \sigma(z_i) \quad \text{where} \quad z_i = W_{ij} x_j + b_i
  • jj is dummy: summed over input dimension
  • ii is free: indexes output neurons
  • σ\sigma is activation function (ReLU, GELU, etc.) applied elementwise

Gradient w.r.t. weights:

LWij=Lhiσ(zi)xj=δixj\frac{\partial L}{\partial W_{ij}} = \frac{\partial L}{\partial h_i} \sigma'(z_i) x_j = \delta_i x_j

where δi=Lhiσ(zi)\delta_i = \frac{\partial L}{\partial h_i} \sigma'(z_i) is the error signal at neuron ii.

In matrix form: LW=δx\frac{\partial L}{\partial W} = \delta x^\top - an outer product of error and input.

Gradient w.r.t. input:

Lxj=δiWij=(Wδ)j\frac{\partial L}{\partial x_j} = \delta_i W_{ij} = (W^\top \delta)_j

The index ii is dummy (summed over output neurons). This propagates the error backward through the layer.

11.2 Embedding Lookup

ERV×dE \in \mathbb{R}^{|V| \times d}: embedding matrix; token t{1,,V}t \in \{1, \ldots, |V|\}: discrete index.

Forward:

ei=Etie_i = E_{ti}

Select row tt from EE. Index ii is free (embedding dimension); tt is a fixed value, not a summation index.

Gradient:

LEvi=Lei1[v=t]=δvtLei\frac{\partial L}{\partial E_{vi}} = \frac{\partial L}{\partial e_i} \cdot \mathbb{1}[v = t] = \delta_{vt} \frac{\partial L}{\partial e_i}

Only row tt has non-zero gradient - the Kronecker delta δvt\delta_{vt} enforces sparsity. This is why embedding updates use sparse gradient operations rather than dense matrix updates.

11.3 Layer Normalisation

Input: xix_i (index ii over feature dimension dd).

Statistics (computed over features):

μ=1dixi=1dxi(i is dummy)σ2=1di(xiμ)2\mu = \frac{1}{d} \sum_i x_i = \frac{1}{d} x_i \quad \text{(i is dummy)} \qquad \sigma^2 = \frac{1}{d} \sum_i (x_i - \mu)^2

Normalisation: x^i=xiμσ2+ε\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \varepsilon}}

Output: yi=γix^i+βiy_i = \gamma_i \hat{x}_i + \beta_i (elementwise; γ,β\gamma, \beta are learnable scale and shift)

Gradient complexity: Lxi\frac{\partial L}{\partial x_i} involves sums over jj of gradient terms because μ\mu and σ2\sigma^2 depend on all xjx_j. The gradient is non-local - changing any single xjx_j affects the normalisation of every xix_i. In index notation:

Lxi=γiσ2+ε[Lyi1djLyjγjx^jx^i1djLyjγj]\frac{\partial L}{\partial x_i} = \frac{\gamma_i}{\sqrt{\sigma^2 + \varepsilon}} \left[ \frac{\partial L}{\partial y_i} - \frac{1}{d}\sum_j \frac{\partial L}{\partial y_j}\gamma_j \hat{x}_j \cdot \hat{x}_i - \frac{1}{d}\sum_j \frac{\partial L}{\partial y_j}\gamma_j \right]

The sums over jj (dummy index) make the non-locality explicit.

11.4 Causal Self-Attention - Complete Index Derivation

This is the central computation of the transformer. Every step in index notation:

Input: XRn×dX \in \mathbb{R}^{n \times d} (sequence of nn tokens, each dd-dimensional)

Step 1: Projections

Qia=XibWbaQKia=XibWbaKVia=XibWbaVQ_{ia} = X_{ib} W^Q_{ba} \qquad K_{ia} = X_{ib} W^K_{ba} \qquad V_{ia} = X_{ib} W^V_{ba}

Index bb is dummy (summed over model dimension dd). Indices ii (position) and aa (head dimension dkd_k) are free.

Step 2: Raw scores

Sij=QiaKjadkS_{ij} = \frac{Q_{ia} K_{ja}}{\sqrt{d_k}}

Index aa is dummy (dot product over head dimension). Indices ii (query position) and jj (key position) are free. This is QK/dkQK^\top / \sqrt{d_k}.

Step 3: Causal mask

S~ij=Sij1[j>i]\tilde{S}_{ij} = S_{ij} - \infty \cdot \mathbb{1}[j > i]

Mask out future positions. Both i,ji, j remain free.

Step 4: Attention weights

αij=exp(S~ij)lexp(S~il)\alpha_{ij} = \frac{\exp(\tilde{S}_{ij})}{\sum_l \exp(\tilde{S}_{il})}

Index ll is dummy in the denominator (sum over key positions). Softmax is applied per query position ii.

Step 5: Attention output

Oia=αijVjaO_{ia} = \alpha_{ij} V_{ja}

Index jj is dummy (weighted sum over value positions). Indices ii (position) and aa (head dimension) are free.

Step 6: Output projection

Yia=OibWbaOY_{ia} = O_{ib} W^O_{ba}

Index bb is dummy. This projects the concatenated head outputs back to model dimension.

11.5 Feed-Forward Network (SwiGLU)

The modern transformer FFN uses gated linear units:

Input: xix_i (index ii over model dimension dd)

Gate projection: gj=Wjigxig_j = W^g_{ji} x_i (jj over FFN dimension dffd_{ff}; ii is dummy)

Up projection: uj=Wjiuxiu_j = W^u_{ji} x_i (jj over dffd_{ff}; ii is dummy)

SwiGLU activation: hj=Swish(gj)ujh_j = \text{Swish}(g_j) \cdot u_j (elementwise; jj is free; no summation)

Down projection: yi=Wijdhjy_i = W^d_{ij} h_j (jj is dummy; ii is free)

Index flow: dd-dim input -> dffd_{ff}-dim hidden (via two parallel projections) -> dd-dim output. The index dimensions trace the information flow explicitly.

11.6 Gradient of Attention Output

Given upstream gradient LOia\frac{\partial L}{\partial O_{ia}}, derive gradients w.r.t. VV, α\alpha, and SS:

Gradient w.r.t. VV:

LVja=iLOiaαij=αijLOia\frac{\partial L}{\partial V_{ja}} = \sum_i \frac{\partial L}{\partial O_{ia}} \alpha_{ij} = \alpha_{ij} \frac{\partial L}{\partial O_{ia}}

In matrix form: LV=αLO\frac{\partial L}{\partial V} = \alpha^\top \frac{\partial L}{\partial O} OK (ii is dummy; j,aj, a are free)

Gradient w.r.t. α\alpha:

Lαij=aLOiaVja=(LOV)ij\frac{\partial L}{\partial \alpha_{ij}} = \sum_a \frac{\partial L}{\partial O_{ia}} V_{ja} = \left(\frac{\partial L}{\partial O} \cdot V^\top\right)_{ij}

Gradient w.r.t. SS (through softmax):

LSij=αij[LαijlαilLαil]\frac{\partial L}{\partial S_{ij}} = \alpha_{ij}\left[\frac{\partial L}{\partial \alpha_{ij}} - \sum_l \alpha_{il} \frac{\partial L}{\partial \alpha_{il}}\right]

This is the softmax Jacobian from Section 7.7 applied to each row ii. These are exactly the formulas implemented in FlashAttention's backward pass.


12. Symmetries, Invariances, and Equivariances

12.1 Permutation Invariance in Index Notation

A function ff is permutation-invariant if f({xi})=f({xπ(i)})f(\{x_i\}) = f(\{x_{\pi(i)}\}) for any permutation π\pi. In index form: the output depends only on the set {xj}\{x_j\}, not on their order.

Example - sum pooling: sa=iXia=Xias_a = \sum_i X_{ia} = X_{ia} (ii is dummy)

The sum doesn't change if we permute the rows of XX. Any reduction over the position index ii that doesn't depend on position yields a permutation-invariant result.

12.2 Equivariance via Index Structure

A function f:Rn×dRn×df: \mathbb{R}^{n \times d} \to \mathbb{R}^{n \times d} is permutation-equivariant if permuting inputs permutes outputs the same way:

f(PX)ia=(Pf(X))iaf(PX)_{ia} = (Pf(X))_{ia}

where PP is a permutation matrix (Pij=δiπ(j)P_{ij} = \delta_{i\pi(j)}).

Attention without positional encoding is permutation-equivariant. Proof via index notation:

Permute input: Xia=PibXbaX'_{ia} = P_{ib} X_{ba} (apply permutation to positions).

Compute Qia=XibWbaQ=PicXcbWbaQ=PicQcaQ'_{ia} = X'_{ib} W^Q_{ba} = P_{ic} X_{cb} W^Q_{ba} = P_{ic} Q_{ca}

Similarly: Kia=PicKcaK'_{ia} = P_{ic} K_{ca}, Via=PicVcaV'_{ia} = P_{ic} V_{ca}

Scores: Sij=QiaKja=PicQcaPjdKda=PicPjdScdS'_{ij} = Q'_{ia} K'_{ja} = P_{ic}Q_{ca} P_{jd}K_{da} = P_{ic}P_{jd}S_{cd}

After softmax (preserves permutation structure): αij=PicPjdαcd\alpha'_{ij} = P_{ic}P_{jd}\alpha_{cd}

Output: Oia=αijVja=PicPjdαcdPjeVeaO'_{ia} = \alpha'_{ij}V'_{ja} = P_{ic}P_{jd}\alpha_{cd} P_{je}V_{ea}

Since PjdPje=δdeP_{jd}P_{je} = \delta_{de} (orthogonality of permutation): Oia=PicαcdVda=PicOcaO'_{ia} = P_{ic}\alpha_{cd}V_{da} = P_{ic}O_{ca}

Therefore: O=POO' = PO - the output is permuted the same way as the input OK

With positional encoding: The encoding PEia\text{PE}_{ia} breaks equivariance because it depends on position ii directly. Different permutations map positions to different encodings.

12.3 Invariant and Equivariant Aggregation

AggregationIndex FormPropertyUse in AI
Sum poolingsa=Xias_a = X_{ia} (ii dummy)Permutation-invariantSet functions, DeepSets
Mean poolingsa=1nXias_a = \frac{1}{n}X_{ia}Permutation-invariantSentence embeddings
Max poolingsa=maxiXias_a = \max_i X_{ia}Permutation-invariantPointNet
AttentionOia=αijVjaO_{ia} = \alpha_{ij}V_{ja}Permutation-equivariantSet Transformer

The presence or absence of the position index ii in the output determines whether the operation is invariant (no ii) or equivariant (keeps ii).

12.4 Group Equivariance in Index Notation

For a group GG acting on inputs via representation ρ\rho:

(ρ(g)X)ia=Xo(g,i),a(\rho(g)X)_{ia} = X_{o(g,i),a}

where o(g,i)o(g,i) applies group element gg to position ii.

A function ff is GG-equivariant if:

f(ρ(g)X)ia=ρ(g)f(X)ia=f(X)o(g,i),af(\rho(g)X)_{ia} = \rho(g)f(X)_{ia} = f(X)_{o(g,i),a}

Constraint on weights: A linear map WijW_{ij} is GG-equivariant iff:

Wij=Wo(g,i),o(g,j)gGW_{ij} = W_{o(g,i),\,o(g,j)} \quad \forall g \in G

This constrains WW to a specific shared-weight structure:

Group GGConstraintResult
Cyclic (translations)Wij=Wi+1,j+1W_{ij} = W_{i+1,j+1}Circular convolution
Full symmetric (SnS_n)WijW_{ij} same for all iji \neq j; WiiW_{ii} same for all iiSum/mean pooling
Rotation (SO(3)SO(3))WW commutes with rotation matricesSpherical harmonics convolution

Index notation makes these symmetry constraints explicit and verifiable - you can check equivariance by substituting the group action into the index expression.


13. Advanced Index Manipulations

13.1 Diagrammatic Notation (Penrose / Tensor Networks)

Tensor network diagrams represent tensors and their contractions visually:

  • Each tensor is a node (box or circle) with one wire (line) per index
  • Contraction: connect wires between tensors (shared index = connected wire)
  • Free indices: dangling wires (not connected to another tensor)
  • Trace: loop a wire back to the same tensor
TENSOR NETWORK DIAGRAMS
===================================================================

  Vector v_i:          Matrix M_ij:         Rank-3 T_ijk:
      | i                  | i  | j              | i  | j
      *                    +-+                   +-+
                           |M|                   |T|---- k
                           +-+                   +-+

  Matrix multiply C_ij = A_ik B_kj:
      | i           | j
      +-+     k     +-+           k is contracted (internal wire)
      |A|-----------|B|           i, j are free (dangling wires)
      +-+           +-+

  Trace: tr(A) = A_ii:
      +------+
      |  +-+ |                    Wire loops back -> self-contraction
      +--|A|-+                    No dangling wires -> scalar
         +-+

  Frobenius: A_ij B_ij:
      +-+  i,j   +-+             Both wires connected
      |A|========|B|             -> all contracted -> scalar
      +-+        +-+

AI applications of tensor networks:

  • Matrix Product States (MPS) for language modelling: autoregressive probability as tensor train
  • MERA (Multi-scale Entanglement Renormalisation Ansatz): hierarchical tensor decomposition
  • Tensor network contraction for efficient inference in graphical models

13.2 Differential Forms and the Wedge Product

Differential kk-forms are completely antisymmetric rank-kk tensors ωi1ik\omega_{i_1 \ldots i_k}.

Wedge product: (αβ)i1ikj1jl(\alpha \wedge \beta)_{i_1 \ldots i_k j_1 \ldots j_l} = antisymmetrisation of αi1ikβj1jl\alpha_{i_1 \ldots i_k} \beta_{j_1 \ldots j_l}

Exterior derivative: dωd\omega of a kk-form is a (k+1)(k+1)-form:

(dω)i0i1ik=a=0k(1)aiaωi0i^aik(d\omega)_{i_0 i_1 \ldots i_k} = \sum_{a=0}^{k} (-1)^a \partial_{i_a} \omega_{i_0 \ldots \hat{i}_a \ldots i_k}

where i^a\hat{i}_a means "omit index iai_a".

Connection to ML:

  • Normalising flows require computing logdetJ\log|\det J| where JJ is the Jacobian. The determinant is expressible via the top exterior power: det(J)=J1i1J2i2Jninεi1i2in\det(J) = J_{1i_1} J_{2i_2} \cdots J_{ni_n} \varepsilon_{i_1 i_2 \ldots i_n}
  • Riemannian geometry of neural network loss surfaces uses differential forms for curvature tensors
  • The Jacobian log-determinant in variational inference is a wedge product computation

13.3 Tensor Symmetrisation Operators

Symmetriser SS and antisymmetriser AA as projection operators in index space:

Sijkl=12(δikδjl+δilδjk)S^{kl}_{ij} = \frac{1}{2}(\delta^k_i \delta^l_j + \delta^l_i \delta^k_j) Aijkl=12(δikδjlδilδjk)A^{kl}_{ij} = \frac{1}{2}(\delta^k_i \delta^l_j - \delta^l_i \delta^k_j)

Applied to a rank-2 tensor TT:

  • (ST)ij=SijklTkl=12(Tij+Tji)(ST)_{ij} = S^{kl}_{ij} T_{kl} = \frac{1}{2}(T_{ij} + T_{ji}) - symmetric part
  • (AT)ij=AijklTkl=12(TijTji)(AT)_{ij} = A^{kl}_{ij} T_{kl} = \frac{1}{2}(T_{ij} - T_{ji}) - antisymmetric part

Projection properties:

  • S2=SS^2 = S and A2=AA^2 = A (idempotent - projecting twice = projecting once)
  • S+A=IS + A = I (every tensor = symmetric part + antisymmetric part)
  • SA=AS=0SA = AS = 0 (orthogonal projections)

13.4 Spectral Methods in Index Notation

Eigendecomposition (symmetric AA):

Aij=λkvikvjkA_{ij} = \lambda_k v_{ik} v_{jk}

Here kk is dummy - summed over eigenvalues. vikv_{ik} is the ii-th component of the kk-th eigenvector. In matrix form: A=VΛVA = V \Lambda V^\top.

SVD:

Aij=UikΣkVjkA_{ij} = U_{ik} \Sigma_k V_{jk}

kk ranges over rank. Σk\Sigma_k is implicitly Σkl=σkδkl\Sigma_{kl} = \sigma_k \delta_{kl} (diagonal), which simplifies to just Σk\Sigma_k when contracted.

Low-rank approximation: truncate the sum to krk \leq r -> rank-rr approximation.

LoRA in index notation:

ΔWij=BikAkj(k ranges over rank r)\Delta W_{ij} = B_{ik} A_{kj} \quad (k \text{ ranges over rank } r)

The index kk is the bottleneck. Parameters: (d1+d2)×r(d_1 + d_2) \times r instead of d1×d2d_1 \times d_2. For typical values (d1=d2=4096d_1 = d_2 = 4096, r=16r = 16): compression ratio = 40962/(2×4096×16)=128×4096^2 / (2 \times 4096 \times 16) = 128\times.

13.5 The Score Function and Stein's Lemma

Score function: si(x)=xilogp(x)=ilogp(x)s_i(x) = \frac{\partial}{\partial x_i} \log p(x) = \partial_i \log p(x)

Stein's identity (integration by parts):

Ep[f(x)si(x)]=Ep[fxi]\mathbb{E}_p[f(x) s_i(x)] = -\mathbb{E}_p\left[\frac{\partial f}{\partial x_i}\right]

In index notation with tensor-valued ff:

Ep[fa(x)si(x)]=Ep[ifa(x)]\mathbb{E}_p[f_a(x) s_i(x)] = -\mathbb{E}_p[\partial_i f_a(x)]

AI applications:

  • Score-based generative models (diffusion models): learn si(x,t)xilogpt(x)s_i(x, t) \approx \nabla_{x_i} \log p_t(x) directly; sample via Langevin dynamics xt+1,i=xt,i+ηsi(xt)+2ηξix_{t+1,i} = x_{t,i} + \eta s_i(x_t) + \sqrt{2\eta}\,\xi_i
  • REINFORCE (policy gradient): θJ=Eπ[Rθlogπ(as)]\nabla_\theta J = \mathbb{E}_\pi[R \cdot \nabla_\theta \log \pi(a|s)] is a score function estimator; in index form: θJ=E[Rsθ]\partial_\theta J = \mathbb{E}[R \cdot s_\theta]
  • Stein Variational Gradient Descent (SVGD): uses the score function with a kernel to approximate posterior inference

14. Common Mistakes

14.1 Mistake Table

#MistakeWhy WrongCorrect
1AijBij=(iAij)(jBij)A_{ij}B_{ij} = (\sum_i A_{ij})(\sum_j B_{ij})Sums over BOTH ii and jj simultaneously; each pair (i,j)(i,j) contributes AijBijA_{ij}B_{ij}AijBij=ijAijBijA_{ij}B_{ij} = \sum_i\sum_j A_{ij}B_{ij} = Frobenius inner product
2AijBji=AijBijA_{ij}B_{ji} = A_{ij}B_{ij}BjiB_{ji} is the transpose of BijB_{ij}; different unless BB is symmetricAijBji=tr(AB)A_{ij}B_{ji} = \text{tr}(AB) while AijBij=tr(AB)A_{ij}B_{ij} = \text{tr}(A^\top B)
3Free index mismatch: vi=Ajkukv_i = A_{jk}u_kLeft has free ii; right has free jj; inconsistentFree indices must match both sides
4Triple index: AiiiBiA_{iii}B_iIndex ii appears three times - ambiguousWrite explicit Σ\Sigma; Einstein requires exactly two
5δij\delta_{ij} means "set i=ji = j everywhere"δij\delta_{ij} acts as index substitutor when contracted; it doesn't globally equate ii and jjAkiδij=AkjA_{ki}\delta_{ij} = A_{kj} - ii is replaced by jj in AA; δ\delta disappears
6"ij,jk->ij" is matrix multiplyThis keeps both ii and jj in output; jj is NOT contractedMatrix multiply is "ij,jk->ik"
7Outer product has repeated indexaibja_i b_j has DIFFERENT index names; no repetition = no contractionDifferent indices = free = outer product
8Tensor rank = matrix rankTensor rank (number of indices) \neq matrix rank (dimension of column space)Completely different concepts; "rank" is overloaded
9Renaming free index changes expressionRenaming consistently on BOTH sides is valid (alpha-equivalence)vi=Aijujv_i = A_{ij}u_j and vk=Akjujv_k = A_{kj}u_j are identical
10Einstein notation only works for two tensorsMulti-tensor contractions follow same rulesAijBjkCklDilA_{ij}B_{jk}C_{kl} \to D_{il} contracts two pairs

14.2 The Einsum Debugging Checklist

When an einsum expression gives unexpected results, check:

  1. Index count: Does each index appear the correct number of times? (Exactly once = free; exactly twice = contracted)
  2. Shape compatibility: Do contracted indices have matching dimensions?
  3. Output indices: Are the output indices exactly what you want? (Missing one = extra contraction; extra one = missing contraction)
  4. Index collision: Did you accidentally reuse an index name? ("ij,ik->jk" contracts over ii; is that intended?)
  5. Batch dimension: Does the batch index appear in all the right places?
  6. Contraction vs elementwise: "ij,ij->" vs "ij,ij->ij" - contracted vs elementwise

14.3 Index Balance as Type Checking

Think of index balance like type checking in a programming language:

TYPE CHECKING ANALOGY
===================================================================

  Expression              Free indices    "Type"
  --------------------    -------------   --------------
  A_ij B_kj               i, k           Matrix (2 free)
  v_i                     i              Vector (1 free)
  A_ii                    (none)         Scalar (0 free)

  Type-correct equation:
    v_i = A_ij u_j        <- both sides: 1 free index (i)  OK

  Type error:
    v_i = A_jk u_k        <- left: free i; right: free j   NO
    s = A_ij B_jk          <- left: 0 free; right: i,k     NO

  -> If free indices don't match, the equation is WRONG.
  -> This is a mechanical check - no mathematical understanding needed.

15. Exercises

Exercise 1: Index Identification

For each expression below, identify all free indices, all dummy indices, the resulting shape, and write the equivalent np.einsum string.

#ExpressionFreeDummyShapeEinsum
1ci=Aijbjc_i = A_{ij} b_j
2Cik=AijBjkC_{ik} = A_{ij} B_{jk}
3s=vivis = v_i v_i
4Tij=uivjT_{ij} = u_i v_j
5Dil=AijBjkCklD_{il} = A_{ij} B_{jk} C_{kl}
Solution
#FreeDummyShapeEinsum
1iijj(n,)(n,) vectorij,j->i
2i,ki, kjj(n,p)(n, p) matrixij,jk->ik
3(none)iiscalari,i->
4i,ji, j(none)(n,m)(n, m) matrixi,j->ij
5i,li, lj,kj, k(n,q)(n, q) matrixij,jk,kl->il

Exercise 2: Kronecker Delta Manipulation

Simplify each expression using the substitution property δijxj=xi\delta_{ij} x_j = x_i:

  1. δijδjk=  ?\delta_{ij} \delta_{jk} = \;?
  2. δijAjkδkl=  ?\delta_{ij} A_{jk} \delta_{kl} = \;?
  3. δii\delta_{ii} where ii ranges over 1,,n1, \dots, n =  ?= \;?
  4. εijkδjk=  ?\varepsilon_{ijk} \delta_{jk} = \;?
Solution
  1. δijδjk=δik\delta_{ij} \delta_{jk} = \delta_{ik}

    Contract over jj: the first delta forces j=ij = i, substituting gives δik\delta_{ik}.

  2. δijAjkδkl=Ail\delta_{ij} A_{jk} \delta_{kl} = A_{il}

    First delta: jij \to i, giving AikδklA_{ik} \delta_{kl}. Second delta: klk \to l, giving AilA_{il}.

  3. δii=n\delta_{ii} = n

    This is the trace of the identity matrix: i=1nδii=i=1n1=n\sum_{i=1}^{n} \delta_{ii} = \sum_{i=1}^{n} 1 = n.

  4. εijkδjk=0\varepsilon_{ijk} \delta_{jk} = 0

    δjk\delta_{jk} is symmetric in (j,k)(j,k); εijk\varepsilon_{ijk} is antisymmetric in (j,k)(j,k). Contraction of symmetric with antisymmetric always gives zero.


Exercise 3: Gradient Derivation

Derive the gradient of each scalar loss with respect to the indicated variable using index notation. Show all steps.

3a. L=xiAijxjL = x_i A_{ij} x_j (quadratic form). Find Lxk\frac{\partial L}{\partial x_k}.

Solution Lxk=xk(xiAijxj)\frac{\partial L}{\partial x_k} = \frac{\partial}{\partial x_k} \left( x_i A_{ij} x_j \right)

Apply the product rule (AijA_{ij} is constant):

=δikAijxj+xiAijδjk=Akjxj+xiAik= \delta_{ik} A_{ij} x_j + x_i A_{ij} \delta_{jk} = A_{kj} x_j + x_i A_{ik}

In the second term, rename iji \to j: xjAjkx_j A_{jk}. So:

Lxk=Akjxj+Ajkxj=(Akj+Ajk)xj\frac{\partial L}{\partial x_k} = A_{kj} x_j + A_{jk} x_j = (A_{kj} + A_{jk}) x_j xL=(A+AT)x\boxed{\nabla_x L = (A + A^T) x}

If AA is symmetric (A=ATA = A^T), this simplifies to 2Ax2Ax.

3b. L=YXWF2L = \| Y - X W \|_F^2 where YijY_{ij}, XikX_{ik}, WkjW_{kj} are matrices. Find LWpq\frac{\partial L}{\partial W_{pq}}.

Solution

Write LL in index notation. Let Rij=YijXikWkjR_{ij} = Y_{ij} - X_{ik} W_{kj} be the residual.

L=RijRij=(YijXikWkj)(YijXikWkj)L = R_{ij} R_{ij} = (Y_{ij} - X_{ik} W_{kj})(Y_{ij} - X_{ik} W_{kj})

Differentiate with respect to WpqW_{pq}:

LWpq=2RijRijWpq\frac{\partial L}{\partial W_{pq}} = 2 R_{ij} \cdot \frac{\partial R_{ij}}{\partial W_{pq}}

Now RijWpq=Xikδkpδjq=Xipδjq\frac{\partial R_{ij}}{\partial W_{pq}} = -X_{ik} \delta_{kp} \delta_{jq} = -X_{ip} \delta_{jq}.

LWpq=2Rij(Xipδjq)=2XipRiq\frac{\partial L}{\partial W_{pq}} = 2 R_{ij} \cdot (-X_{ip} \delta_{jq}) = -2 X_{ip} R_{iq} LW=2XT(YXW)\boxed{\frac{\partial L}{\partial W} = -2 X^T (Y - XW)}

Setting to zero gives the normal equation XTXW=XTYX^T X W = X^T Y.

3c. L=yjlog(softmax(z)j)L = -y_j \log(\text{softmax}(z)_j) where softmax(z)i=ezikezk\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_k e^{z_k}} and yy is one-hot. Find Lzm\frac{\partial L}{\partial z_m}.

Solution

Let pi=softmax(z)ip_i = \text{softmax}(z)_i. Then L=yjlogpjL = -y_j \log p_j. By chain rule:

Lzm=yj1pjpjzm\frac{\partial L}{\partial z_m} = -y_j \frac{1}{p_j} \frac{\partial p_j}{\partial z_m}

The softmax Jacobian is pjzm=pj(δjmpm)\frac{\partial p_j}{\partial z_m} = p_j (\delta_{jm} - p_m). Substituting:

Lzm=yj1pjpj(δjmpm)=yj(δjmpm)\frac{\partial L}{\partial z_m} = -y_j \frac{1}{p_j} \cdot p_j (\delta_{jm} - p_m) = -y_j (\delta_{jm} - p_m) =ym+pmjyj=ym+pm1=pmym= -y_m + p_m \sum_j y_j = -y_m + p_m \cdot 1 = p_m - y_m Lz=softmax(z)y\boxed{\frac{\partial L}{\partial z} = \text{softmax}(z) - y}

This elegant result is why cross-entropy + softmax is universally used: the gradient is simply the difference between predictions and targets.


Exercise 4: Einsum Implementation

Write NumPy einsum calls (and verify with standard NumPy operations) for:

4a. Batched trace: given AA with shape (B,n,n)(B, n, n), compute the trace of each matrix in the batch.

# Your solution:
traces = np.einsum('bii->b', A)

# Verification:
assert np.allclose(traces, np.trace(A, axis1=1, axis2=2))

4b. Bilinear form: given xx shape (n,)(n,), MM shape (n,n)(n, n), yy shape (n,)(n,), compute s=xiMijyjs = x_i M_{ij} y_j.

# Your solution:
s = np.einsum('i,ij,j->', x, M, y)

# Verification:
assert np.isclose(s, x @ M @ y)

4c. Multi-head attention scores: given QQ shape (B,H,T,dk)(B, H, T, d_k), KK shape (B,H,T,dk)(B, H, T, d_k), compute Sbhij=QbhidKbhjdS_{bhij} = Q_{bhid} K_{bhjd}.

# Your solution:
S = np.einsum('bhid,bhjd->bhij', Q, K)

# Verification:
assert np.allclose(S, Q @ K.transpose(0, 1, 3, 2))

Exercise 5: Contraction Order Optimisation

Consider the expression Cil=AijBjkDklC_{il} = A_{ij} B_{jk} D_{kl} where:

  • AA is (10000,512)(10000, 512)
  • BB is (512,4)(512, 4)
  • DD is (4,512)(4, 512)

5a. Compute the total FLOPs for left-to-right evaluation: (AijBjk)Dkl(A_{ij} B_{jk}) D_{kl}.

5b. Compute the total FLOPs for right-to-left evaluation: Aij(BjkDkl)A_{ij} (B_{jk} D_{kl}).

5c. Which order is faster and by how much?

Solution

5a. Left-to-right: first compute Tik=AijBjkT_{ik} = A_{ij} B_{jk}, shape (10000,4)(10000, 4):

  • FLOPs: 10000×4×512=20,480,00010000 \times 4 \times 512 = 20{,}480{,}000

Then Cil=TikDklC_{il} = T_{ik} D_{kl}, shape (10000,512)(10000, 512):

  • FLOPs: 10000×512×4=20,480,00010000 \times 512 \times 4 = 20{,}480{,}000

Total: 40,960,000\mathbf{40{,}960{,}000} FLOPs.

5b. Right-to-left: first compute Ejl=BjkDklE_{jl} = B_{jk} D_{kl}, shape (512,512)(512, 512):

  • FLOPs: 512×512×4=1,048,576512 \times 512 \times 4 = 1{,}048{,}576

Then Cil=AijEjlC_{il} = A_{ij} E_{jl}, shape (10000,512)(10000, 512):

  • FLOPs: 10000×512×512=2,621,440,00010000 \times 512 \times 512 = 2{,}621{,}440{,}000

Total: 2,622,488,576\mathbf{2{,}622{,}488{,}576} FLOPs.

5c. Left-to-right is 64×\approx 64\times faster! This mirrors the LoRA pattern: always multiply the low-rank factors first. The intermediate tensor TikT_{ik} has shape (10000,4)(10000, 4) - far smaller than EjlE_{jl} at (512,512)(512, 512).


Exercise 6: Softmax Gradient Verification

6a. Derive the Jacobian pizj\frac{\partial p_i}{\partial z_j} where pi=softmax(z)ip_i = \text{softmax}(z)_i using quotient rule in index notation.

Solution pi=eziS,S=kezkp_i = \frac{e^{z_i}}{S}, \quad S = \sum_k e^{z_k} pizj=δijeziSeziezjS2=eziS(δijezjS)=pi(δijpj)\frac{\partial p_i}{\partial z_j} = \frac{\delta_{ij} e^{z_i} \cdot S - e^{z_i} \cdot e^{z_j}}{S^2} = \frac{e^{z_i}}{S} \left( \delta_{ij} - \frac{e^{z_j}}{S} \right) = p_i (\delta_{ij} - p_j) Jij=pi(δijpj)=diag(p)ppT\boxed{J_{ij} = p_i (\delta_{ij} - p_j) = \text{diag}(p) - p p^T}

6b. Verify numerically with finite differences:

import numpy as np

def softmax(z):
    e = np.exp(z - z.max())
    return e / e.sum()

z = np.random.randn(5)
p = softmax(z)

# Analytical Jacobian
J_analytical = np.diag(p) - np.outer(p, p)

# Numerical Jacobian (finite differences)
eps = 1e-7
J_numerical = np.zeros((5, 5))
for j in range(5):
    z_plus = z.copy(); z_plus[j] += eps
    z_minus = z.copy(); z_minus[j] -= eps
    J_numerical[:, j] = (softmax(z_plus) - softmax(z_minus)) / (2 * eps)

print("Max error:", np.max(np.abs(J_analytical - J_numerical)))
# Should be < 1e-6

Exercise 7: Symmetry Analysis

For each tensor expression, determine if it is symmetric, antisymmetric, or neither in the indicated indices.

#ExpressionIndicesAnswer
1Sij=AikAjkS_{ij} = A_{ik} A_{jk} (i.e. AATA A^T)(i,j)(i, j)
2Tij=AikBkjAjkBkiT_{ij} = A_{ik} B_{kj} - A_{jk} B_{ki}(i,j)(i, j)
3Rij=2fxixjR_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j} (Hessian)(i,j)(i, j)
Solution
  1. Symmetric. Swap iji \leftrightarrow j: Sji=AjkAik=AikAjk=SijS_{ji} = A_{jk} A_{ik} = A_{ik} A_{jk} = S_{ij}. This is AATAA^T, which is always symmetric.

  2. Antisymmetric. Swap iji \leftrightarrow j: Tji=AjkBkiAikBkj=(AikBkjAjkBki)=TijT_{ji} = A_{jk} B_{ki} - A_{ik} B_{kj} = -(A_{ik} B_{kj} - A_{jk} B_{ki}) = -T_{ij}.

  3. Symmetric (for C2C^2 functions, by Schwarz's theorem). 2fxixj=2fxjxi\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i} when ff has continuous second partial derivatives.


Exercise 8: Full Attention Pass

Implement the complete forward and backward pass for single-head self-attention using only index notation and np.einsum.

Given: input XX with shape (T,d)(T, d), weight matrices WQ,WK,WVW^Q, W^K, W^V each (d,dk)(d, d_k).

Forward pass - write each step in index notation and einsum:

import numpy as np

T, d, d_k = 8, 16, 4
X = np.random.randn(T, d)
WQ = np.random.randn(d, d_k)
WK = np.random.randn(d, d_k)
WV = np.random.randn(d, d_k)

# Step 1: Queries, Keys, Values
# Q_ia = X_id WQ_da   ->   einsum: 'id,da->ia'
Q = np.einsum('id,da->ia', X, WQ)
K = np.einsum('id,da->ia', X, WK)
V = np.einsum('id,da->ia', X, WV)

# Step 2: Attention scores
# S_ij = Q_ia K_ja / sqrt(d_k)   ->   einsum: 'ia,ja->ij'
S = np.einsum('ia,ja->ij', Q, K) / np.sqrt(d_k)

# Step 3: Attention weights (softmax over j for each i)
S_max = S.max(axis=1, keepdims=True)
exp_S = np.exp(S - S_max)
alpha = exp_S / exp_S.sum(axis=1, keepdims=True)  # shape (T, T)

# Step 4: Output
# O_ia = alpha_ij V_ja   ->   einsum: 'ij,ja->ia'
O = np.einsum('ij,ja->ia', alpha, V)

Backward pass - given upstream gradient LOia\frac{\partial L}{\partial O_{ia}}:

dO = np.random.randn(T, d_k)  # upstream gradient

# Step 4 backward: dL/d(alpha) and dL/dV
# dL/dV_ja = alpha_ij * dO_ia   ->   einsum: 'ij,ia->ja'
dV = np.einsum('ij,ia->ja', alpha, dO)
# dL/d(alpha_ij) = dO_ia * V_ja   ->   einsum: 'ia,ja->ij'
dalpha = np.einsum('ia,ja->ij', dO, V)

# Step 3 backward: softmax gradient
# dL/dS_ij = alpha_ij * (dalpha_ij - sum_k alpha_ik dalpha_ik)
sum_term = np.einsum('ij,ij->i', alpha, dalpha)  # shape (T,)
dS = alpha * (dalpha - sum_term[:, None]) / np.sqrt(d_k)

# Step 2 backward: dL/dQ and dL/dK
# dL/dQ_ia = dS_ij K_ja   ->   einsum: 'ij,ja->ia'
dQ = np.einsum('ij,ja->ia', dS, K)
# dL/dK_ja = dS_ij Q_ia   ->   einsum: 'ij,ia->ja'
dK = np.einsum('ij,ia->ja', dS, Q)

# Step 1 backward: dL/dW^Q, dL/dW^K, dL/dW^V
# dL/dWQ_da = X_id * dQ_ia   ->   einsum: 'id,ia->da'
dWQ = np.einsum('id,ia->da', X, dQ)
dWK = np.einsum('id,ia->da', X, dK)
dWV = np.einsum('id,ia->da', X, dV)

Verify your implementation with numerical gradient checking (finite differences on each weight matrix).


16. Why This Matters for AI

16.1 Impact Across the AI Stack

Einstein summation notation is not merely academic shorthand - it is the computational lingua franca of modern AI systems. Every layer of the stack, from theoretical research papers to production inference engines, benefits from thinking in indices.

DomainWithout Index NotationWith Index Notation
Research papersAmbiguous matrix expressions, dimensions unclearPrecise, self-documenting equations
Attention mechanismsNested loops or opaque bmm callsSingle einsum string: bhid,bhjd->bhij
BackpropagationMemorize gradient formulasDerive gradients mechanically
Tensor compilersManual kernel fusionAutomatic optimization from contraction specs
LoRA / adaptersAd-hoc low-rank tricksClear decomposition: WijAikBkjW_{ij} \approx A_{ik} B_{kj}
Multi-head attentionReshapes and transposesNatural batch index: QbhidQ_{bhid}
Distributed trainingConfusing sharding specsShard along a named index
QuantizationUnclear where precision is lostPer-index scaling: W^ij=siWij(q)\hat{W}_{ij} = s_i W_{ij}^{(q)}
Sparse attentionMasking heuristicsConstraint on index range: $
Graph neural networksAggregation as a special caseGeneralized: hi(l+1)=σ(Aijhj(l)Wkk)h_i^{(l+1)} = \sigma(A_{ij} h_j^{(l)} W_{kk'})
Diffusion modelsScore-function algebraIndex-level Stein identity derivation

16.2 Key Takeaways

  1. Einstein summation is a language, not just notation. Learning it changes how you read papers, write code, and debug models.
  2. Free indices determine the output shape; dummy indices determine the computation.
  3. einsum is the universal API for turning index notation into executable tensor code.
  4. Once a forward pass is written in index notation, many backward-pass steps become mechanical through product-rule reasoning and identities such as xixj=δij\frac{\partial x_i}{\partial x_j} = \delta_{ij}.
  5. Index notation reveals optimization opportunities: contraction order, low-rank structure, sparsity, and parallel axes all become easier to see.
  6. Modern architectures are index expressions. Transformers, GNNs, diffusion models, and mixture-of-experts all benefit from this viewpoint.
  7. Symmetries often appear as constraints on how indices transform, which makes the notation useful for principled architecture design.

17. Conceptual Bridge

17.1 From Mathematical Notation to Working Code

paper equation
  -> attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
  -> index notation
  -> O_ia = softmax_j(Q_id K_jd / sqrt(d_k)) V_ja
  -> einsum strings
  -> scores = einsum('bhid,bhjd->bhij', Q, K)
  -> output = einsum('bhij,bhja->bhia', attn, V)
  -> optimized kernels and hardware-specific implementations

At every level, index notation tells you what data flows where, which dimensions contract, and what the output shape must be. That is the bridge from paper math to production tensor programs.

17.2 Where This Leads Next

The next chapters turn indices into the working language of vectors, matrices, tensors, gradients, and efficient kernels. Once you can read repeated-index structure fluently, large model equations become far less mysterious: they collapse into a small set of contraction patterns you can analyze, implement, and optimize.

References and Further Reading

ResourceTopicType
Einstein, A. (1916). "The Foundation of the General Theory of Relativity"Original convention introductionPaper
Kolda & Bader (2009). "Tensor Decompositions and Applications"CP, Tucker, tensor networksSurvey
Vaswani et al. (2017). "Attention Is All You Need"Transformer architecturePaper
Laue, Mitterreiter, and Giesen (2020). "A Simple and Efficient Tensor Calculus"Automatic tensor differentiationPaper
opt_einsum documentationContraction-order optimizationSoftware
Penrose (1971). "Applications of Negative Dimensional Tensors"Diagrammatic tensor notationPaper
Dao et al. (2022). "FlashAttention"Memory-efficient attention kernelsPaper
Hu et al. (2021). "LoRA: Low-Rank Adaptation"Low-rank weight decompositionPaper
NumPy documentation - numpy.einsumReference implementationDocumentation

<- Previous: 04-Summation-and-Product-Notation | Home | Next: 06-Proof-Techniques ->