Theory Notebook

Converted from theory.ipynb for web reading.

Einstein Summation and Index Notation — From Subscripts to Transformers

"I have made a great discovery in mathematics; I have suppressed the summation sign every time that the summation must be made over an index which occurs twice." — Albert Einstein (1916)

This notebook is the interactive companion to notes.md. Every concept is implemented with explicit loops and verified against NumPy/einsum.

Section	Topic	What You'll Build
1	Intuition	Why repeated indices = summation, free vs dummy
2	Formal Definitions	Tensors, convention rules, covariant/contravariant
3	Basic Operations	Dot product, outer product, matmul, trace in index form
4	Contraction Operations	Rank reduction, contraction order, cost analysis
5	Kronecker Delta & Levi-Civita	Identity as tensor, cross product, determinant
6	Index Manipulation Rules	Renaming, symmetry, partial derivatives
7	Gradients & Backprop	Gradient of linear/quadratic forms, matmul backprop
8	Batched Operations	Batch indices, multi-head attention, broadcasting
9	Einsum in Practice	String syntax, performance, common pitfalls
10	Tensor Products & Decompositions	CP, Tucker, Tensor Train, LoRA
11	AI Architectures	FC layers, attention, LayerNorm, SwiGLU
12	Symmetries & Equivariances	Permutation invariance, group equivariance
13	Advanced Manipulations	Tensor networks, spectral methods, score function
14	Common Mistakes	Index errors, shape bugs, notation traps
15	Exercises	8 graduated exercises from basic to production
16	Why This Matters	2026 perspective on index notation in AI

Prerequisites: Python, NumPy. Understanding of summation notation (Section 04).

References:

Einstein, A. (1916). "Die Grundlage der allgemeinen Relativitätstheorie"
Kolda & Bader (2009). "Tensor Decompositions and Applications", SIAM Review
NumPy einsum documentation (numpy.org)
Vaswani et al. (2017). "Attention Is All You Need"
Dao et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention"

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np
import warnings

np.random.seed(42)
np.set_printoptions(precision=6, suppress=True)

# Optional: PyTorch for real implementations
try:
    import torch
    torch.manual_seed(42)
    HAS_TORCH = True
    print(f'NumPy {np.__version__} | PyTorch {torch.__version__}')
except ImportError:
    torch = None
    HAS_TORCH = False
    print(f'NumPy {np.__version__} | PyTorch not installed (NumPy demos still work)')

1. Intuition

1.1 What Is Einstein Summation?

Einstein summation convention: whenever the same index appears exactly twice in a single term, summation over all values of that index is implied automatically.

No $\Sigma$ symbol needed — the repetition of an index IS the summation instruction.

a_i b_i \equiv \sum_{i=1}^{n} a_i b_i \quad \text{(dot product — i repeated twice)}

A_{ik} B_{kj} \equiv \sum_{k=1}^{p} A_{ik} B_{kj} \quad \text{(matrix multiply — k repeated twice)}

The result: tensor equations that would fill pages compress to single lines. The structure of the operation is immediately visible in the index pattern.

Code cell 5

# ══════════════════════════════════════════════════════════════════
# 1.1 Einstein Convention — The Core Idea
# ══════════════════════════════════════════════════════════════════

# Standard notation with explicit Σ:
#   s = Σ_i  a_i * b_i        (dot product)
#   C_ij = Σ_k  A_ik * B_kj   (matrix multiply)
#
# Einstein notation (Σ suppressed, repeated index = sum):
#   s = a_i b_i               (i appears twice → sum over i)
#   C_ij = A_ik B_kj          (k appears twice → sum over k)

# Let's see this in action with explicit loops
a = np.array([1.0, 2.0, 3.0])
b = np.array([4.0, 5.0, 6.0])

# Dot product: s = a_i b_i (i is repeated → summed)
# Explicit loop — what Einstein notation means
s = 0.0
for i in range(len(a)):
    s += a[i] * b[i]

print('1.1 Einstein Summation Convention')
print('=' * 50)
print(f'a = {a}')
print(f'b = {b}')
print(f'\nDot product: s = a_i b_i')
print(f'  Expanded: a[0]b[0] + a[1]b[1] + a[2]b[2]')
print(f'  = {a[0]}×{b[0]} + {a[1]}×{b[1]} + {a[2]}×{b[2]}')
print(f'  = {s}')
print(f'  np.einsum("i,i->", a, b) = {np.einsum("i,i->", a, b)}')
print(f'  Match: {abs(s - np.einsum("i,i->", a, b)) < 1e-10} ✓')

# Matrix multiply: C_ij = A_ik B_kj (k is repeated → summed)
A = np.array([[1, 2], [3, 4], [5, 6]], dtype=float)  # 3×2
B = np.array([[7, 8, 9], [10, 11, 12]], dtype=float)  # 2×3

m, p = A.shape
p2, n = B.shape
C = np.zeros((m, n))
for i in range(m):
    for j in range(n):
        for k in range(p):  # k is the repeated (dummy) index
            C[i, j] += A[i, k] * B[k, j]

print(f'\nMatrix multiply: C_ij = A_ik B_kj')
print(f'  A shape: {A.shape}, B shape: {B.shape}')
print(f'  k repeated → summed over k=0..{p-1}')
print(f'  i,j free → result shape: ({m},{n})')
print(f'  C =\n{C}')
print(f'  np.einsum("ik,kj->ij", A, B) =\n{np.einsum("ik,kj->ij", A, B)}')
print(f'  Match: {np.allclose(C, np.einsum("ik,kj->ij", A, B))} ✓')

1.2 Why This Notation Transforms Thinking

Without index notation, AB means "matrix multiply A and B" — the mechanics are hidden.

With index notation, $(AB)_{ij} = A_{ik}B_{kj}$ — the repeated $k$ tells you:

What is summed (the $k$ dimension)
Over what range ( $k = 1, \ldots, p$ )
How inputs connect to outputs (A's columns match B's rows)

Pattern Recognition:

$A_{ik}B_{kj}$ → matrix multiply (inner contraction)
$A_{ij}B_{ij}$ → elementwise multiply then sum (Frobenius inner product)
$A_{ik}B_{jk}$ → multiply A by B $^\top$ (outer-dimension contraction)
$u_i v_j$ → outer product (no repeated index, no sum)

Debugging: if a free index appears on one side but not the other, the equation is wrong. Index balance checking catches errors mechanically.

Code cell 7

# ══════════════════════════════════════════════════════════════════
# 1.2 Pattern Recognition via Index Structure
# ══════════════════════════════════════════════════════════════════

np.random.seed(42)
A = np.random.randn(3, 4)
B = np.random.randn(3, 4)  # same shape as A for some operations
C = np.random.randn(4, 2)  # compatible for matmul

# Pattern 1: A_ik C_kj → matrix multiply (k repeated across tensors)
result1 = np.zeros((3, 2))
for i in range(3):
    for j in range(2):
        for k in range(4):
            result1[i, j] += A[i, k] * C[k, j]

# Pattern 2: A_ij B_ij → Frobenius inner product (BOTH i,j repeated → scalar)
result2 = 0.0
for i in range(3):
    for j in range(4):
        result2 += A[i, j] * B[i, j]

# Pattern 3: A_ik B_jk → A @ B^T (k repeated; i,j free → matrix)
result3 = np.zeros((3, 3))
for i in range(3):
    for j in range(3):
        for k in range(4):
            result3[i, j] += A[i, k] * B[j, k]

print('1.2 Pattern Recognition via Index Structure')
print('=' * 50)

print('\nPattern 1: A_ik C_kj = matrix multiply (A @ C)')
print(f'  Free: i,j  Dummy: k  → rank-2 result ({result1.shape})')
print(f'  einsum: {np.allclose(result1, np.einsum("ik,kj->ij", A, C))} ✓')

print('\nPattern 2: A_ij B_ij = Frobenius inner product (scalar)')
print(f'  Free: none  Dummy: i,j  → rank-0 result = {result2:.6f}')
print(f'  einsum: {np.allclose(result2, np.einsum("ij,ij->", A, B))} ✓')
print(f'  Equivalently: tr(A^T B) = {np.trace(A.T @ B):.6f} ✓')

print('\nPattern 3: A_ik B_jk = A @ B^T (k contracted differently)')
print(f'  Free: i,j  Dummy: k  → rank-2 result ({result3.shape})')
print(f'  einsum: {np.allclose(result3, np.einsum("ik,jk->ij", A, B))} ✓')
print(f'  Equivalently: A @ B.T check: {np.allclose(result3, A @ B.T)} ✓')

print('\n--- The index structure alone tells you the operation! ---')
print('  ik,kj→ij : inner dimension matched → matmul')
print('  ij,ij→   : same indices, no output → sum of products')
print('  ik,jk→ij : outer dimension matched → A @ B.T')

1.3 The Core Idea in One Sentence

Everything in Einstein notation follows from exactly two rules:

Index Type	Appears	Effect	In Result?
Free index	Once in term	No summation	Yes — result has that dimension
Dummy index	Twice in term	Summed over	No — disappears from result

\underbrace{C_{\color{blue}i\color{blue}j}}_{\text{free: i,j}} = \underbrace{A_{\color{blue}i\color{red}k}}_{\text{free: i, dummy: k}} \underbrace{B_{\color{red}k\color{blue}j}}_{\text{dummy: k, free: j}}

That's it. Everything else — matmul, attention, gradients — follows from these two rules.

Code cell 9

# ══════════════════════════════════════════════════════════════════
# 1.3 Free vs Dummy — The Only Two Rules You Need
# ══════════════════════════════════════════════════════════════════

def analyze_einstein(expr, tensor_shapes):
    """Analyze an Einstein expression: find free/dummy indices, result shape.
    
    Args:
        expr: string like 'ik,kj->ij' or just 'ik,kj' (implicit)
        tensor_shapes: dict mapping index labels to their dimension sizes
    Returns:
        dict with free, dummy indices and result shape
    """
    # Parse expression
    if '->' in expr:
        inputs_str, output_str = expr.split('->')
    else:
        inputs_str = expr
        output_str = None
    
    input_parts = inputs_str.split(',')
    
    # Count occurrences of each index across ALL input operands
    index_count = {}
    for part in input_parts:
        for ch in part:
            if ch not in index_count:
                index_count[ch] = 0
            index_count[ch] += 1
    
    # Classify: appears once = free, appears twice = dummy
    free_indices = []
    dummy_indices = []
    for idx, count in index_count.items():
        if count == 1:
            free_indices.append(idx)
        elif count == 2:
            dummy_indices.append(idx)
        else:
            print(f'  WARNING: index "{idx}" appears {count} times — invalid!')
    
    # Determine output indices
    if output_str is not None:
        out_indices = list(output_str)
    else:
        # Implicit: all free indices in alphabetical order
        out_indices = sorted(free_indices)
    
    # Compute result shape
    result_shape = tuple(tensor_shapes.get(idx, '?') for idx in out_indices)
    
    return {
        'expression': expr,
        'free': free_indices,
        'dummy': dummy_indices,
        'output_indices': out_indices,
        'result_shape': result_shape
    }

print('1.3 Analyzing Einstein Expressions')
print('=' * 60)

# Test cases
cases = [
    ('i,i->',   {'i': 3},           'Dot product: a_i b_i'),
    ('i,j->ij', {'i': 3, 'j': 4},   'Outer product: a_i b_j'),
    ('ij,j->i', {'i': 3, 'j': 4},   'Matrix-vector: A_ij v_j'),
    ('ik,kj->ij', {'i': 3, 'k': 4, 'j': 5}, 'Matrix multiply: A_ik B_kj'),
    ('ii->',    {'i': 3},            'Trace: A_ii'),
    ('ij,ij->', {'i': 3, 'j': 4},   'Frobenius: A_ij B_ij'),
]

for expr, shapes, desc in cases:
    info = analyze_einstein(expr, shapes)
    free_str = ','.join(info['free']) if info['free'] else '(none)'
    dummy_str = ','.join(info['dummy']) if info['dummy'] else '(none)'
    print(f'\n  {desc}')
    print(f'    Expression: {expr}')
    print(f'    Free: {free_str}  Dummy: {dummy_str}')
    print(f'    Result shape: {info["result_shape"]}')

1.4 Where This Appears in AI

Every operation in a modern transformer is a sum over repeated indices:

Operation	Index Notation	Dummy (Summed)	Free (Result)
Matrix multiply	$C_{ij} = A_{ik}B_{kj}$	$k$	$i, j$
Dot product	$s = a_i b_i$	$i$	(none → scalar)
Attention score	$S_{ij} = Q_{ik}K_{jk}/\sqrt{d}$	$k$	$i, j$
Attention output	$O_{ia} = \alpha_{ij}V_{ja}$	$j$	$i, a$
Gradient of matmul	$\partial L/\partial A_{ik} = (\partial L/\partial C)_{ij} B_{kj}$	$j$	$i, k$

Code cell 11

# ══════════════════════════════════════════════════════════════════
# 1.4 Every AI Operation as Index Contraction
# ══════════════════════════════════════════════════════════════════

np.random.seed(42)

# Set up a mini-transformer scenario
seq_len, d_model = 4, 8
X = np.random.randn(seq_len, d_model)         # input embedding
W_q = np.random.randn(d_model, d_model) * 0.1  # query projection

print('1.4 AI Operations as Index Contractions')
print('=' * 55)

# (a) Attention queries: Q_ia = X_ib W^Q_ba  (b is dummy, i,a free)
Q = np.zeros((seq_len, d_model))
for i in range(seq_len):
    for a in range(d_model):
        for b in range(d_model):  # b = dummy (repeated)
            Q[i, a] += X[i, b] * W_q[b, a]

print('(a) Query projection: Q_ia = X_ib W^Q_ba')
print(f'    X shape: ({seq_len},{d_model}), W_q shape: ({d_model},{d_model})')
print(f'    dummy: b (summed)  free: i,a  → Q shape: {Q.shape}')
print(f'    einsum check: {np.allclose(Q, np.einsum("ib,ba->ia", X, W_q))} ✓')

# (b) Attention score: S_ij = Q_ik K_jk / sqrt(d)
K = np.random.randn(seq_len, d_model)  # key matrix
d_k = d_model
S = np.zeros((seq_len, seq_len))
for i in range(seq_len):
    for j in range(seq_len):
        for k in range(d_model):  # k = dummy
            S[i, j] += Q[i, k] * K[j, k]
        S[i, j] /= np.sqrt(d_k)

print(f'\n(b) Attention score: S_ij = Q_ik K_jk / √d')
print(f'    dummy: k  free: i,j  → S shape: {S.shape}')
print(f'    einsum check: {np.allclose(S, np.einsum("ik,jk->ij", Q, K) / np.sqrt(d_k))} ✓')

# (c) Softmax + weighted sum: O_ia = α_ij V_ja  (j = dummy)
# Stable softmax per row
alpha = np.zeros_like(S)
for i in range(seq_len):
    row = S[i]
    max_val = row[0]
    for v in row:
        if v > max_val:
            max_val = v
    exp_row = np.zeros(seq_len)
    exp_sum = 0.0
    for j in range(seq_len):
        exp_row[j] = np.exp(row[j] - max_val)
        exp_sum += exp_row[j]
    for j in range(seq_len):
        alpha[i, j] = exp_row[j] / exp_sum

V = np.random.randn(seq_len, d_model)
O = np.zeros((seq_len, d_model))
for i in range(seq_len):
    for a in range(d_model):
        for j in range(seq_len):  # j = dummy
            O[i, a] += alpha[i, j] * V[j, a]

print(f'\n(c) Attention output: O_ia = α_ij V_ja')
print(f'    dummy: j  free: i,a  → O shape: {O.shape}')
print(f'    einsum check: {np.allclose(O, np.einsum("ij,ja->ia", alpha, V))} ✓')
print(f'\n    Summary: Q@K^T/√d → softmax → weighted V')
print(f'    All three steps are index contractions!')

1.5 Historical and Modern Context

Era	Development	Significance
1900	Ricci & Levi-Civita: absolute differential calculus	Index notation precursor
1916	Einstein: general relativity paper	Introduced the summation convention
1920s–90s	Standard in physics	Electromagnetism, continuum mechanics
1960s	Penrose: abstract index notation	Distinguished covariant/contravariant
2006	NumPy `np.einsum`	Einstein convention for array computing
2017	`torch.einsum`, `tf.einsum`	Standard in ML frameworks
2018	`opt_einsum` library	Optimal contraction order for multi-tensor
2020s	Transformer lingua franca	FlashAttention, JAX, einsum everywhere

1.6 The Hierarchy of Tensor Operations

Scalar (rank 0): s          — no indices
Vector (rank 1): v_i        — one index
Matrix (rank 2): M_ij       — two indices
3-Tensor (rank 3): T_ijk    — three indices
n-Tensor (rank n): T_{i₁...iₙ} — n indices

Key Operations:
  Contract one pair    → reduce rank by 2
  Outer product       → increase rank by sum
  Transpose           → relabel indices
  Trace               → contract diagonal

Code cell 13

# ══════════════════════════════════════════════════════════════════
# 1.6 Tensor Hierarchy — Rank, Shape, and Contraction
# ══════════════════════════════════════════════════════════════════

print('1.6 Tensor Hierarchy')
print('=' * 60)

# Build tensors of each rank
scalar = 42.0                                    # rank 0: no indices
vector = np.array([1.0, 2.0, 3.0])              # rank 1: v_i
matrix = np.random.randn(3, 4)                  # rank 2: M_ij
tensor3 = np.random.randn(3, 4, 5)             # rank 3: T_ijk
tensor4 = np.random.randn(2, 3, 4, 5)          # rank 4: T_ijkl

for name, t in [('Scalar', scalar), ('Vector', vector),
                ('Matrix', matrix), ('3-Tensor', tensor3),
                ('4-Tensor', tensor4)]:
    shape = np.shape(t)
    rank = len(shape)
    size = np.size(t)
    indices = ''.join(['ijklmnop'[r] for r in range(rank)])
    idx_str = f'T_{indices}' if indices else 's'
    print(f'  {name:10s}: rank={rank}, shape={str(shape):12s}, '
          f'elements={size:5d}, notation: {idx_str}')

# Demonstrate contraction reducing rank
print(f'\nContraction reduces rank by 2:')

# Rank 2 → Rank 0: trace  A_ii → scalar
A_sq = np.random.randn(4, 4)
trace_loop = 0.0
for i in range(4):
    trace_loop += A_sq[i, i]
print(f'  Trace: rank 2 → rank 0 (scalar = {trace_loop:.4f})')

# Rank 3 → Rank 1: partial trace  T_ijk with j=k → v_i = T_ikk
v_partial = np.zeros(3)
for i in range(3):
    for k in range(min(4, 5)):  # contract j=k axes (sizes 4,5 — use min)
        v_partial[i] += tensor3[i, k, k]
print(f'  Partial trace: rank 3 → rank 1 (shape = {v_partial.shape})')

# Outer product increases rank
outer = np.zeros((3, 3))
u = np.array([1.0, 2.0, 3.0])
w = np.array([4.0, 5.0, 6.0])
for i in range(3):
    for j in range(3):
        outer[i, j] = u[i] * w[j]  # no repeated index → no sum
print(f'  Outer product: rank 1 + rank 1 → rank 2 ({outer.shape})')
print(f'    u_i w_j = rank(u) + rank(w) = 1 + 1 = 2 ✓')

2. Formal Definitions

2.1 Tensors as Indexed Arrays

A tensor $T$ of rank $r$ over index sets $I_1, I_2, \ldots, I_r$ is a function:

T: I_1 \times I_2 \times \cdots \times I_r \to \mathbb{R}

It assigns a real number $T_{i_1 i_2 \ldots i_r}$ to each tuple of indices.

Rank	Name	Notation	Example
0	Scalar	$s$	Temperature, loss value
1	Vector	$v_i$	Embedding, gradient
2	Matrix	$M_{ij}$	Weight matrix, attention scores
3	3-Tensor	$T_{ijk}$	Batched matrices, RGB images
4	4-Tensor	$T_{ijkl}$	Conv filters, batch×head×seq×dim

Shape = tuple of index set sizes $(n_1, n_2, \ldots, n_r)$ ; determines memory layout and operation validity.

Code cell 15

# ══════════════════════════════════════════════════════════════════
# 2.1 Tensors as Indexed Arrays
# ══════════════════════════════════════════════════════════════════

print('2.1 Tensors as Indexed Arrays')
print('=' * 55)

# A tensor maps index tuples → real numbers
# T: I₁ × I₂ × ... × Iᵣ → ℝ

# Rank 0: scalar — T() = s
scalar_val = 3.14
print(f'\nRank 0 (scalar): s = {scalar_val}')
print(f'  No indices. T() → ℝ.  Example: loss = {scalar_val}')

# Rank 1: vector — T(i) = v_i
embedding = np.array([0.5, -1.2, 0.3, 0.8])
print(f'\nRank 1 (vector): v_i, shape={embedding.shape}')
print(f'  One index i ∈ {{0,...,{len(embedding)-1}}}')
print(f'  v_0={embedding[0]}, v_1={embedding[1]}, v_2={embedding[2]}, v_3={embedding[3]}')

# Rank 2: matrix — T(i,j) = M_ij
W = np.array([[1, 2, 3],
              [4, 5, 6]], dtype=float)
print(f'\nRank 2 (matrix): M_ij, shape={W.shape}')
print(f'  Two indices: i ∈ {{0,1}}, j ∈ {{0,1,2}}')
print(f'  M_00={W[0,0]}, M_01={W[0,1]}, M_12={W[1,2]}')

# Rank 3: 3-tensor — T(i,j,k) = T_ijk
batch_data = np.random.randn(2, 3, 4)
print(f'\nRank 3 (3-tensor): T_ijk, shape={batch_data.shape}')
print(f'  Three indices: i ∈ {{0,1}}, j ∈ {{0,1,2}}, k ∈ {{0,...,3}}')
print(f'  # elements: {batch_data.size} = 2 × 3 × 4')
print(f'  AI example: batch(2) × seq(3) × embed(4)')

# Rank 4: attention tensor
B, H, N, D = 2, 4, 8, 16
attn_tensor = np.random.randn(B, H, N, D)
print(f'\nRank 4 (4-tensor): T_bhij, shape={attn_tensor.shape}')
print(f'  b={B}(batch), h={H}(heads), i={N}(seq), j={D}(dim)')
print(f'  # elements: {attn_tensor.size:,} = {B}×{H}×{N}×{D}')
print(f'  AI example: multi-head attention Q/K/V tensors')

2.2 The Einstein Summation Convention — Precise Statement

If an index label $\alpha$ appears exactly twice in a single term, it is a dummy index and is implicitly summed over all values in its range:

A_{\ldots\alpha\ldots} B_{\ldots\alpha\ldots} \equiv \sum_{\alpha=1}^{n} A_{\ldots\alpha\ldots} B_{\ldots\alpha\ldots}

An index appearing once is a free index — the equation holds for each value of that index.

2.3 Free vs Dummy Indices — Precise Definitions

Free index: appears exactly once → result has that dimension → equation holds ∀ values
Dummy (bound/contracted) index: appears exactly twice → summed over → disappears from result
Invalid: index appearing 3+ times — ambiguous, not allowed
Consistency rule: free indices must match on both sides of =

\underbrace{v_i}_{\text{free: i}} = \underbrace{A_{ij}}_{\text{free: i, dummy: j}} \underbrace{u_j}_{\text{dummy: j}} \quad \text{✓ (both sides have free i)}

v_i = A_{jk} u_k \quad \text{✗ (left has free i, right has free j — inconsistent!)}

Code cell 17

# ══════════════════════════════════════════════════════════════════
# 2.2-2.3 Convention Rules & Index Classification
# ══════════════════════════════════════════════════════════════════

print('2.2-2.3 Einstein Convention: Free vs Dummy Indices')
print('=' * 55)

# Demonstrate: v_i = A_ij u_j
A = np.array([[1, 2, 3],
              [4, 5, 6]], dtype=float)  # 2×3
u = np.array([1.0, 0.5, -1.0])         # length 3

# v_i = A_ij u_j  (j is dummy → summed; i is free → result index)
v = np.zeros(2)
for i in range(2):         # free index: iterate, keep in result
    for j in range(3):     # dummy index: iterate, sum, disappears
        v[i] += A[i, j] * u[j]

print('Expression: v_i = A_ij u_j')
print(f'  j appears twice (in A and u) → DUMMY → summed over j=0..2')
print(f'  i appears once → FREE → result indexed by i')
print(f'  A: {A.shape}, u: {u.shape} → v: {v.shape}')
print(f'  v = {v}')
print(f'  Check A @ u = {A @ u}  ✓')

# Demonstrate consistency checking
print(f'\nConsistency Check:')
print(f'  v_i = A_ij u_j  →  left has {{i}}, right has {{i}}  ✓')
print(f'  v_i = A_jk u_k  →  left has {{i}}, right has {{j}}  ✗ WRONG!')
print(f'  s = a_i b_i     →  left has {{}},  right has {{}}   ✓ (scalar)')

# Demonstrate invalid: index appearing 3 times
print(f'\nInvalid Notation:')
print(f'  A_iii B_i  → i appears 4 times → AMBIGUOUS → not allowed')
print(f'  A_ij B_jk C_kj → j appears 3 times → INVALID')
print(f'  Rule: each index must appear exactly 1 (free) or 2 (dummy) times')

# Demonstrate range matching for dummy indices
print(f'\nRange Matching:')
print(f'  A_ik B_kj requires A.shape[1] == B.shape[0] (k range must match)')
print(f'  A: (m×p), B: (p×n) → k ranges over 1..p')
print(f'  This IS the shape compatibility condition for matrix multiply!')

2.4 Index Ranges

Each index position has a defined range from the tensor's shape:

$i$ in $A_{ij}$ where $A \in \mathbb{R}^{m \times n}$ : $i \in \{1, \ldots, m\}$ , $j \in \{1, \ldots, n\}$
Contracted indices must have matching ranges: $A_{ik}B_{kj}$ valid only if $A$ 's 2nd dim = $B$ 's 1st dim
Shape inference: from the index structure alone, you can determine the result shape

2.5 Covariant and Contravariant Indices (Physics Notation)

In differential geometry / general relativity:

Upper index $v^i$ (contravariant): transforms like coordinate differences
Lower index $w_i$ (covariant): transforms like gradient components
Einstein convention: sum over one upper and one lower: $A_{ij}B^j{}_k$
Metric tensor $g_{ij}$ raises/lowers: $v_i = g_{ij}v^j$

In ML: all indices are effectively lower (Euclidean space, $g_{ij} = \delta_{ij}$ ). The up/down distinction collapses. Know it exists for reading physics papers; ignore it for coding.

Code cell 19

# ══════════════════════════════════════════════════════════════════
# 2.4-2.5 Index Ranges, Shape Inference, Metric Tensor
# ══════════════════════════════════════════════════════════════════

print('2.4 Index Ranges & Shape Inference')
print('=' * 55)

# Shape inference from index structure
def infer_shape(einsum_str, input_shapes):
    """Infer output shape from einsum string and input shapes."""
    inputs_str, output_str = einsum_str.split('->')
    input_parts = inputs_str.split(',')
    
    # Build index → dimension mapping
    idx_to_dim = {}
    for part, shape in zip(input_parts, input_shapes):
        for pos, ch in enumerate(part):
            if ch in idx_to_dim:
                # Contracted index — verify range match
                if idx_to_dim[ch] != shape[pos]:
                    return f'ERROR: index {ch} has mismatched ranges {idx_to_dim[ch]} vs {shape[pos]}'
            else:
                idx_to_dim[ch] = shape[pos]
    
    # Build output shape
    out_shape = tuple(idx_to_dim[ch] for ch in output_str)
    return out_shape

# Test shape inference on various operations
cases = [
    ('ik,kj->ij', [(3, 4), (4, 5)], 'matmul A(3×4) B(4×5)'),
    ('ij,j->i',   [(3, 4), (4,)],   'matvec A(3×4) v(4)'),
    ('i,i->',     [(3,), (3,)],     'dot product a(3) b(3)'),
    ('i,j->ij',   [(3,), (4,)],     'outer product a(3) b(4)'),
    ('bij,bjk->bik', [(2, 3, 4), (2, 4, 5)], 'batched matmul'),
    ('bhid,bhjd->bhij', [(2,4,8,16), (2,4,8,16)], 'attention scores'),
]

for expr, shapes, desc in cases:
    result = infer_shape(expr, shapes)
    print(f'  {expr:20s} | inputs={shapes} → output={result}')
    print(f'    {desc}')

# Test range mismatch detection
print(f'\nRange mismatch detection:')
bad_result = infer_shape('ik,kj->ij', [(3, 4), (5, 6)])
print(f'  ik,kj->ij with A(3×4), B(5×6): {bad_result}')
print(f'  k must range 1..4 in A AND 1..5 in B — impossible!')

# 2.5 Physics notation: metric tensor and index raising/lowering
print(f'\n{"=" * 55}')
print(f'2.5 Covariant/Contravariant — The Metric Tensor')
print(f'{"=" * 55}')

# In general relativity: metric tensor g_ij raises/lowers indices
# In ML (Euclidean space): g_ij = δ_ij (identity) → no distinction

# Minkowski metric (special relativity) — NOT used in ML, but shows concept
eta = np.zeros((4, 4))
eta[0, 0] = -1.0  # time component
eta[1, 1] = 1.0   # space x
eta[2, 2] = 1.0   # space y
eta[3, 3] = 1.0   # space z

print(f'\nMinkowski metric η_ij (special relativity):')
print(f'{eta}')
print(f'  Used to raise/lower indices in physics: v_i = η_ij v^j')

# In ML: metric is identity → v_i = δ_ij v^j = v_i (trivial!)
I_metric = np.eye(4)
v_contra = np.array([1.0, 2.0, 3.0, 4.0])  # v^i (contravariant)

# Lower the index: v_i = g_ij v^j
v_covariant_physics = np.zeros(4)
for i in range(4):
    for j in range(4):
        v_covariant_physics[i] += eta[i, j] * v_contra[j]

v_covariant_ml = np.zeros(4)
for i in range(4):
    for j in range(4):
        v_covariant_ml[i] += I_metric[i, j] * v_contra[j]

print(f'\nv^i = {v_contra}')
print(f'Physics (η): v_i = η_ij v^j = {v_covariant_physics}')
print(f'ML (δ):      v_i = δ_ij v^j = {v_covariant_ml}')
print(f'\nIn ML, upper=lower, so we drop the distinction entirely.')

3. Basic Operations in Index Notation

Every linear algebra operation has a clean index form. The key insight: the index structure alone tells you the operation.

Operation	Index Form	Free	Dummy	Result
Scalar mult	$(sA)_{ij} = s \cdot A_{ij}$	$i,j$	—	matrix
Vector add	$(u+v)_i = u_i + v_i$	$i$	—	vector
Dot product	$s = u_i v_i$	—	$i$	scalar
Outer product	$C_{ij} = u_i v_j$	$i,j$	—	matrix
Matrix-vector	$w_i = A_{ij} v_j$	$i$	$j$	vector
Matrix multiply	$C_{ij} = A_{ik} B_{kj}$	$i,j$	$k$	matrix
Transpose	$(A^T)_{ij} = A_{ji}$	$i,j$	—	matrix
Trace	$\text{tr}(A) = A_{ii}$	—	$i$	scalar
Frobenius	$\langle A,B\rangle_F = A_{ij}B_{ij}$	—	$i,j$	scalar
Hadamard	$(A \odot B)_{ij} = A_{ij}B_{ij}$	$i,j$	—	matrix

Code cell 21

# ══════════════════════════════════════════════════════════════════
# 3.1 SCALAR MULTIPLICATION: (sA)_ij = s · A_ij
# ══════════════════════════════════════════════════════════════════

print("=" * 70)
print("3.1 SCALAR MULTIPLICATION")
print("  Index form: (sA)_ij = s · A_ij")
print("  Free indices: i, j  |  Dummy indices: none")
print("=" * 70)

# Scalar times a matrix — every element gets multiplied
s = 3.0
A = np.array([[1, 2, 3],
              [4, 5, 6]])
m, n = A.shape

# Raw loop implementation
result_loop = np.zeros((m, n))
for i in range(m):
    for j in range(n):
        result_loop[i, j] = s * A[i, j]

# Einsum: scalar multiplication = just copy with scaling
# np.einsum doesn't directly handle scalar mult, so we verify manually
result_numpy = s * A

print(f"\nScalar s = {s}")
print(f"Matrix A ({m}×{n}):\n{A}")
print(f"\n(sA)_ij via explicit loops:\n{result_loop}")
print(f"(sA)_ij via NumPy:         \n{result_numpy}")
print(f"\n✓ Match: {np.allclose(result_loop, result_numpy)}")

# Key insight: no summation because no repeated indices
print("\n--- Insight ---")
print("No repeated index → no summation → element-wise operation")
print("Each (i,j) entry is independent of all others")

# ══════════════════════════════════════════════════════════════════
# 3.2 VECTOR ADDITION: (u + v)_i = u_i + v_i
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("3.2 VECTOR ADDITION")
print("  Index form: w_i = u_i + v_i")
print("  Free indices: i  |  Dummy indices: none")
print("=" * 70)

u = np.array([1, 2, 3, 4])
v = np.array([10, 20, 30, 40])
n = len(u)

# Raw loop implementation
w_loop = np.zeros(n)
for i in range(n):
    w_loop[i] = u[i] + v[i]

w_numpy = u + v

print(f"\nu = {u}")
print(f"v = {v}")
print(f"\nw_i = u_i + v_i via loop:  {w_loop}")
print(f"w_i = u_i + v_i via NumPy: {w_numpy}")
print(f"\n✓ Match: {np.allclose(w_loop, w_numpy)}")

# Note: Addition is NOT an Einstein summation operation
# Einstein convention only covers products, not sums of tensors
print("\n--- Important Note ---")
print("Einstein convention handles PRODUCTS with summed indices.")
print("Addition (u_i + v_i) is a separate operation — both terms")
print("share the same free index i, but there's no contraction.")

# ══════════════════════════════════════════════════════════════════
# 3.3 INNER PRODUCT (DOT PRODUCT): s = u_i v_i
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("3.3 INNER PRODUCT (DOT PRODUCT)")
print("  Index form: s = u_i v_i  (i is repeated → summed)")  
print("  Free indices: none  |  Dummy indices: i")
print("=" * 70)

u = np.array([1.0, 2.0, 3.0])
v = np.array([4.0, 5.0, 6.0])
n = len(u)

# Raw loop: explicit summation over repeated index i
s_loop = 0.0
for i in range(n):
    s_loop += u[i] * v[i]

s_dot = np.dot(u, v)
s_einsum = np.einsum('i,i->', u, v)

print(f"\nu = {u}")
print(f"v = {v}")
print(f"\ns = u_i v_i = Σ_i u_i·v_i")
print(f"  Step by step: {' + '.join(f'{u[i]}×{v[i]}' for i in range(n))}")
print(f"              = {' + '.join(f'{u[i]*v[i]:.0f}' for i in range(n))}")
print(f"              = {s_loop}")
print(f"\nvia loop:   {s_loop}")
print(f"via np.dot: {s_dot}")
print(f"via einsum: {s_einsum}")
print(f"\n✓ All match: {np.allclose(s_loop, s_dot) and np.allclose(s_loop, s_einsum)}")

# Geometric interpretation
norm_u = np.sqrt(np.einsum('i,i->', u, u))
norm_v = np.sqrt(np.einsum('i,i->', v, v))
cos_theta = s_einsum / (norm_u * norm_v)
print(f"\n--- Geometric View ---")
print(f"||u|| = {norm_u:.4f},  ||v|| = {norm_v:.4f}")
print(f"cos(θ) = u·v / (||u|| ||v||) = {cos_theta:.4f}")
print(f"θ = {np.degrees(np.arccos(cos_theta)):.2f}°")

# ══════════════════════════════════════════════════════════════════
# 3.4 OUTER PRODUCT: C_ij = u_i v_j
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("3.4 OUTER PRODUCT")
print("  Index form: C_ij = u_i · v_j  (no repeated → no sum)")
print("  Free indices: i, j  |  Dummy indices: none")
print("=" * 70)

u = np.array([1.0, 2.0, 3.0])
v = np.array([4.0, 5.0])
m, n = len(u), len(v)

# Raw loop: no summation, just products
C_loop = np.zeros((m, n))
for i in range(m):
    for j in range(n):
        C_loop[i, j] = u[i] * v[j]

C_outer = np.outer(u, v)
C_einsum = np.einsum('i,j->ij', u, v)

print(f"\nu = {u}  (length {m})")
print(f"v = {v}  (length {n})")
print(f"\nC_ij = u_i · v_j  ({m}×{n} matrix):")
print(f"\nvia loop:\n{C_loop}")
print(f"\nvia np.outer:\n{C_outer}")
print(f"\nvia einsum('i,j->ij'):\n{C_einsum}")
print(f"\n✓ All match: {np.allclose(C_loop, C_outer) and np.allclose(C_loop, C_einsum)}")

# Key difference from inner product
print("\n--- Inner vs Outer: The Index Story ---")
print("Inner product: u_i v_i  → i repeated → sum → scalar")
print("Outer product: u_i v_j  → i,j both free → no sum → matrix")
print(f"Inner: R^{m} × R^{n} → R¹ (if m=n)")
print(f"Outer: R^{m} × R^{n} → R^{{{m}×{n}}}")

# Rank-1 matrix property
print("\n--- Rank-1 Property ---")
rank = np.linalg.matrix_rank(C_loop)
print(f"rank(u ⊗ v) = {rank}")
print("Every outer product of vectors produces a rank-1 matrix.")
print("Any rank-r matrix = sum of r outer products (SVD connection)")

# ══════════════════════════════════════════════════════════════════
# 3.5 MATRIX-VECTOR MULTIPLY: w_i = A_ij v_j
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("3.5 MATRIX-VECTOR MULTIPLY")
print("  Index form: w_i = A_ij v_j  (j repeated → sum over j)")
print("  Free indices: i  |  Dummy indices: j")
print("=" * 70)

A = np.array([[1, 2, 3],
              [4, 5, 6]])
v = np.array([1.0, 0.0, -1.0])
m, n = A.shape

# Raw loop: for each row i, sum over column j
w_loop = np.zeros(m)
for i in range(m):
    for j in range(n):
        w_loop[i] += A[i, j] * v[j]

w_matmul = A @ v
w_einsum = np.einsum('ij,j->i', A, v)

print(f"\nA ({m}×{n}):\n{A}")
print(f"v = {v}")
print(f"\nw_i = A_ij v_j = Σ_j A_ij·v_j")
for i in range(m):
    terms = ' + '.join(f'{A[i,j]}×{v[j]:.0f}' for j in range(n))
    print(f"  w_{i} = {terms} = {w_loop[i]:.1f}")

print(f"\nvia loop:               {w_loop}")
print(f"via A @ v:              {w_matmul}")
print(f"via einsum('ij,j->i'):  {w_einsum}")
print(f"\n✓ All match: {np.allclose(w_loop, w_matmul) and np.allclose(w_loop, w_einsum)}")

# Connection to neural networks
print("\n--- Neural Network Connection ---")
b = np.array([0.1, -0.2])
z = w_einsum + b
print(f"Linear layer: z_i = W_ij x_j + b_i")
print(f"  W@x = {w_einsum}")
print(f"  b   = {b}")
print(f"  z   = {z}")
print(f"This is the fundamental operation of every neural network layer.")

# ══════════════════════════════════════════════════════════════════
# 3.6 MATRIX-MATRIX MULTIPLY: C_ij = A_ik B_kj
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("3.6 MATRIX-MATRIX MULTIPLY")
print("  Index form: C_ij = A_ik B_kj  (k repeated → sum over k)")
print("  Free indices: i, j  |  Dummy indices: k")
print("=" * 70)

A = np.array([[1, 2],
              [3, 4],
              [5, 6]])  # 3×2
B = np.array([[7, 8, 9],
              [10, 11, 12]])  # 2×3
m, p = A.shape
p2, n = B.shape

# Raw triple loop
C_loop = np.zeros((m, n))
for i in range(m):
    for j in range(n):
        for k in range(p):
            C_loop[i, j] += A[i, k] * B[k, j]

C_matmul = A @ B
C_einsum = np.einsum('ik,kj->ij', A, B)

print(f"\nA ({m}×{p}):\n{A}")
print(f"\nB ({p}×{n}):\n{B}")
print(f"\nC_ij = A_ik B_kj = Σ_k A_ik·B_kj  ({m}×{n} result)")
print(f"\nDetailed computation:")
for i in range(m):
    for j in range(n):
        terms = ' + '.join(f'{A[i,k]}×{B[k,j]}' for k in range(p))
        print(f"  C_{i}{j} = {terms} = {C_loop[i,j]:.0f}")

print(f"\nvia loop:\n{C_loop}")
print(f"\nvia A @ B:\n{C_matmul}")
print(f"\nvia einsum('ik,kj->ij'):\n{C_einsum}")
print(f"\n✓ All match: {np.allclose(C_loop, C_matmul) and np.allclose(C_loop, C_einsum)}")

# Shape rule
print(f"\n--- Shape Rule ---")
print(f"A is (m×k) = ({m}×{p})")  
print(f"B is (k×n) = ({p}×{n})")
print(f"C is (m×n) = ({m}×{n})")
print(f"The contracted index k disappears. Free indices i,j survive.")
print(f"This is the FUNDAMENTAL pattern of Einstein summation.")

Code cell 22

# ══════════════════════════════════════════════════════════════════
# 3.7 TRANSPOSE: (A^T)_ij = A_ji
# ══════════════════════════════════════════════════════════════════

print("=" * 70)
print("3.7 TRANSPOSE")
print("  Index form: (A^T)_ij = A_ji  (swap index order)")
print("  Free indices: i, j  |  Dummy indices: none")
print("=" * 70)

A = np.array([[1, 2, 3],
              [4, 5, 6]])
m, n = A.shape

# Raw loop: swap row and column indices
AT_loop = np.zeros((n, m))
for i in range(n):
    for j in range(m):
        AT_loop[i, j] = A[j, i]  # Note: A_ji, not A_ij

AT_numpy = A.T
AT_einsum = np.einsum('ij->ji', A)

print(f"\nA ({m}×{n}):\n{A}")
print(f"\n(A^T)_ij = A_ji  ({n}×{m}):")
print(f"\nvia loop:\n{AT_loop}")
print(f"\nvia A.T:\n{AT_numpy}")
print(f"\nvia einsum('ij->ji'):\n{AT_einsum}")
print(f"\n✓ All match: {np.allclose(AT_loop, AT_numpy) and np.allclose(AT_loop, AT_einsum)}")

# Transpose = relabeling indices
print("\n--- Index Relabeling Insight ---")
print("Transpose is NOT a computation — it's a relabeling.")
print("In index notation: just swap which index is 'row' vs 'column'.")
print("The data doesn't change, only its interpretation does.")
print(f"\n(A@A^T)_ij = A_ik (A^T)_kj = A_ik A_jk")
AAT = np.einsum('ik,jk->ij', A, A)
print(f"A @ A^T via einsum('ik,jk->ij'):\n{AAT}")
print(f"A @ A^T via NumPy:\n{A @ A.T}")
print(f"✓ Match: {np.allclose(AAT, A @ A.T)}")

# ══════════════════════════════════════════════════════════════════
# 3.8 TRACE: tr(A) = A_ii  (repeated index → sum along diagonal)
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("3.8 TRACE")
print("  Index form: tr(A) = A_ii  (same index twice → sum)")
print("  Free indices: none  |  Dummy indices: i")
print("=" * 70)

A = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])
n = A.shape[0]

# Raw loop: sum diagonal elements
trace_loop = 0.0
for i in range(n):
    trace_loop += A[i, i]

trace_numpy = np.trace(A)
trace_einsum = np.einsum('ii->', A)

print(f"\nA ({n}×{n}):\n{A}")
print(f"\ntr(A) = A_ii = Σ_i A_ii")
print(f"  = {' + '.join(f'A_{i}{i}={A[i,i]}' for i in range(n))}")
print(f"  = {trace_loop}")
print(f"\nvia loop:     {trace_loop}")
print(f"via np.trace: {trace_numpy}")
print(f"via einsum:   {trace_einsum}")
print(f"\n✓ All match: {trace_loop == trace_numpy == trace_einsum}")

# Trace properties in index notation
print("\n--- Trace Properties (Index Proofs) ---")
B = np.random.randn(3, 3)

# tr(A + B) = tr(A) + tr(B)
tr_sum = np.einsum('ii->', A + B)
tr_A = np.einsum('ii->', A)
tr_B = np.einsum('ii->', B)
print(f"tr(A+B) = {tr_sum:.4f}")
print(f"tr(A) + tr(B) = {tr_A + tr_B:.4f}")
print(f"✓ Linearity: {np.allclose(tr_sum, tr_A + tr_B)}")

# tr(AB) = tr(BA)  — cyclic property
# Index proof: tr(AB) = (AB)_ii = A_ik B_ki = B_ki A_ik = (BA)_kk = tr(BA)
AB = A @ B
BA = B @ A
print(f"\ntr(AB) = {np.trace(AB):.4f}")
print(f"tr(BA) = {np.trace(BA):.4f}")
print(f"✓ Cyclic: {np.allclose(np.trace(AB), np.trace(BA))}")
print("  Proof: (AB)_ii = A_ik B_ki = B_ki A_ik = (BA)_kk  ■")

# tr(A^T) = tr(A)
print(f"\ntr(A^T) = {np.trace(A.T)}")
print(f"tr(A)   = {np.trace(A)}")
print(f"✓ Transpose: {np.trace(A.T) == np.trace(A)}")
print("  Proof: (A^T)_ii = A_ii  (diagonal is unchanged)  ■")

# ══════════════════════════════════════════════════════════════════
# 3.9 FROBENIUS INNER PRODUCT: <A,B>_F = A_ij B_ij
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("3.9 FROBENIUS INNER PRODUCT")
print("  Index form: ⟨A,B⟩_F = A_ij B_ij  (both i,j repeated → sum)")
print("  Free indices: none  |  Dummy indices: i, j")
print("=" * 70)

A = np.array([[1, 2],
              [3, 4]])
B = np.array([[5, 6],
              [7, 8]])
m, n = A.shape

# Raw loop: sum over all index pairs
frob_loop = 0.0
for i in range(m):
    for j in range(n):
        frob_loop += A[i, j] * B[i, j]

frob_trace = np.trace(A.T @ B)  # tr(A^T B) = A_ji B_ji = A_ij B_ij
frob_einsum = np.einsum('ij,ij->', A, B)

print(f"\nA:\n{A}")
print(f"\nB:\n{B}")
print(f"\n⟨A,B⟩_F = A_ij B_ij = Σ_i Σ_j A_ij·B_ij")
terms = []
for i in range(m):
    for j in range(n):
        terms.append(f"{A[i,j]}×{B[i,j]}")
print(f"  = {' + '.join(terms)}")
print(f"  = {frob_loop}")

print(f"\nvia loop:        {frob_loop}")
print(f"via tr(A^T B):   {frob_trace}")  
print(f"via einsum:      {frob_einsum}")
print(f"\n✓ All match: {np.allclose(frob_loop, frob_trace) and np.allclose(frob_loop, frob_einsum)}")

# Frobenius norm via inner product
frob_norm_sq = np.einsum('ij,ij->', A, A)
frob_norm = np.sqrt(frob_norm_sq)
print(f"\n--- Frobenius Norm ---")
print(f"||A||_F² = A_ij A_ij = {frob_norm_sq}")
print(f"||A||_F  = √({frob_norm_sq}) = {frob_norm:.4f}")
print(f"np.linalg.norm(A, 'fro') = {np.linalg.norm(A, 'fro'):.4f}")
print(f"✓ Match: {np.allclose(frob_norm, np.linalg.norm(A, 'fro'))}")

# Connection: Frobenius = trace
print(f"\nEquivalence: ||A||_F² = A_ij A_ij = (A^T A)_jj = tr(A^T A)")
print(f"tr(A^T A) = {np.trace(A.T @ A)}")
print(f"A_ij A_ij = {frob_norm_sq}")
print(f"✓ {np.allclose(np.trace(A.T @ A), frob_norm_sq)}")

# ══════════════════════════════════════════════════════════════════
# 3.10 HADAMARD PRODUCT: (A ⊙ B)_ij = A_ij B_ij (FREE, not summed)
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("3.10 HADAMARD (ELEMENT-WISE) PRODUCT")
print("  Index form: (A ⊙ B)_ij = A_ij B_ij  (i,j are FREE)")
print("  Free indices: i, j  |  Dummy indices: none")
print("=" * 70)

A = np.array([[1, 2],
              [3, 4]])
B = np.array([[5, 6],
              [7, 8]])
m, n = A.shape

# Raw loop: element-wise with no summation
H_loop = np.zeros((m, n))
for i in range(m):
    for j in range(n):
        H_loop[i, j] = A[i, j] * B[i, j]

H_numpy = A * B
H_einsum = np.einsum('ij,ij->ij', A, B)  # Note: ij in output!

print(f"\nA:\n{A}")
print(f"\nB:\n{B}")
print(f"\n(A ⊙ B)_ij = A_ij · B_ij  (same indices, but they're FREE)")
print(f"\nvia loop:\n{H_loop}")
print(f"\nvia A * B:\n{H_numpy}")
print(f"\nvia einsum('ij,ij->ij'):\n{H_einsum}")
print(f"\n✓ All match: {np.allclose(H_loop, H_numpy) and np.allclose(H_loop, H_einsum)}")

# CRITICAL distinction: Hadamard vs Frobenius
print("\n" + "=" * 70)
print("CRITICAL: HADAMARD vs FROBENIUS — SAME INDICES, DIFFERENT MEANING")
print("=" * 70)
print(f"\nFrobenius: einsum('ij,ij->') → scalar   {np.einsum('ij,ij->', A, B)}")
print(f"Hadamard:  einsum('ij,ij->ij') → matrix")
print(f"{np.einsum('ij,ij->ij', A, B)}")
print(f"\nThe ONLY difference: what appears after '->'")
print(f"  'ij,ij->'     means i,j are DUMMY (summed out) → scalar")
print(f"  'ij,ij->ij'   means i,j are FREE (kept) → matrix")
print(f"\nThis is why explicit einsum notation ('->') is clearer than")
print(f"implicit Einstein convention, where context determines meaning.")

# AI usage: gating mechanisms
print(f"\n--- AI Application: Gating ---")
signal = np.array([[0.5, 1.2], [0.8, -0.3]])
gate = np.array([[0.9, 0.1], [0.7, 0.0]])  # From sigmoid
gated = np.einsum('ij,ij->ij', signal, gate)
print(f"LSTM/GRU gating: output_ij = signal_ij · gate_ij")
print(f"Signal:\n{signal}")
print(f"Gate (sigmoid output):\n{gate}")
print(f"Gated output:\n{gated}")
print(f"Gate=0 blocks signal, Gate=1 passes signal.")

4. Contraction Operations — The Heart of Einstein Summation

Contraction is the defining operation of Einstein notation: when an index appears twice, we sum over it, reducing the tensor's rank by 2 for each contracted pair.

T^{i_1 \ldots i_p}{}_{j_1 \ldots j_q} \xrightarrow{\text{contract } i_r = j_s} T^{i_1 \ldots \hat{i}_r \ldots i_p}{}_{j_1 \ldots \hat{j}_s \ldots j_q}

The hat $\hat{}$ means "remove this index." Each contraction:

Removes 2 indices (1 upper + 1 lower, or 2 matching)
Reduces rank by 2
Produces a sum over the contracted range

Key Insight for AI

Every neural network forward pass is a sequence of contractions:

Linear layer: contract weight indices with input indices
Attention: contract query/key dimensions
Convolution: contract over kernel spatial + channel dimensions

4.1 Single Contraction

Matrix-vector ( $w_i = A_{ij} v_j$ ) contracts 1 pair: rank 2+1 → rank 1

4.2 Double Contraction

Frobenius inner product ( $s = A_{ij} B_{ij}$ ) contracts 2 pairs: rank 2+2 → rank 0

4.3 Tensor Contractions

Higher-order tensors with selective contraction: $C_{ik} = T_{ijk} v_j$

4.4 Contraction Chains

Multiple contractions in sequence: $y_i = A_{ij} B_{jk} x_k$ (2 contractions)

Code cell 24

# ══════════════════════════════════════════════════════════════════
# 4.1 SINGLE CONTRACTION  —  Contract one index pair
# ══════════════════════════════════════════════════════════════════

print("=" * 70)
print("4.1 SINGLE CONTRACTION")
print("  One repeated index pair → one summation → rank drops by 2")
print("=" * 70)

# Example 1: Matrix-vector  w_i = A_ij v_j
# rank(A)=2 + rank(v)=1 - 2*1_contraction = 1 → vector
A = np.array([[2, 1, 0],
              [0, 3, 1],
              [1, 0, 2]])
v = np.array([1.0, 2.0, 3.0])
n = A.shape[0]

print("\n--- Example: Matrix-Vector (rank 2+1 → 1) ---")
w_loop = np.zeros(n)
for i in range(n):
    for j in range(n):  # j is contracted
        w_loop[i] += A[i, j] * v[j]
w_einsum = np.einsum('ij,j->i', A, v)
print(f"A_ij v_j → w_i")
print(f"Rank: 2 + 1 - 2 = 1 (vector)")
print(f"Result: {w_einsum}")
print(f"✓ Match: {np.allclose(w_loop, w_einsum)}")

# Example 2: 3rd-order tensor contracted with vector
# C_ij = T_ijk v_k  (contract k)
# rank(T)=3 + rank(v)=1 - 2*1 = 2 → matrix
print("\n--- Example: 3rd-Order Tensor × Vector (rank 3+1 → 2) ---")
T = np.random.randn(3, 3, 4)  # 3×3×4
v = np.random.randn(4)
n1, n2, n3 = T.shape

C_loop = np.zeros((n1, n2))
for i in range(n1):
    for j in range(n2):
        for k in range(n3):  # k is contracted
            C_loop[i, j] += T[i, j, k] * v[k]

C_einsum = np.einsum('ijk,k->ij', T, v)
print(f"T_ijk v_k → C_ij")
print(f"Rank: 3 + 1 - 2 = 2 (matrix)")
print(f"Shape: {T.shape} contracted with {v.shape} → {C_einsum.shape}")
print(f"✓ Match: {np.allclose(C_loop, C_einsum)}")

# Example 3: Self-contraction (trace-like)
# B_k = T_iik  (contract i with itself)
print("\n--- Example: Self-Contraction (rank 3 → 1) ---")
T = np.random.randn(3, 3, 4)  # first two dims must match
B_loop = np.zeros(4)
for k in range(4):
    for i in range(3):  # i summed (appears twice)
        B_loop[k] += T[i, i, k]

B_einsum = np.einsum('iik->k', T)
print(f"T_iik → B_k  (sum diagonal in first two indices)")
print(f"Shape: {T.shape} → {B_einsum.shape}")
print(f"✓ Match: {np.allclose(B_loop, B_einsum)}")

# ══════════════════════════════════════════════════════════════════
# 4.2 DOUBLE CONTRACTION — Contract two index pairs simultaneously
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("4.2 DOUBLE CONTRACTION")
print("  Two repeated index pairs → two sums → rank drops by 4")
print("=" * 70)

# Example 1: Frobenius inner product  s = A_ij B_ij
# rank 2+2 - 4 = 0 → scalar
A = np.array([[1.0, 2.0],
              [3.0, 4.0]])
B = np.array([[5.0, 6.0],
              [7.0, 8.0]])

s_loop = 0.0
for i in range(2):
    for j in range(2):
        s_loop += A[i, j] * B[i, j]

s_einsum = np.einsum('ij,ij->', A, B)
print(f"\n--- Frobenius: A_ij B_ij (rank 2+2-4 = 0) ---")
print(f"s = {s_einsum}")
print(f"✓ Match: {np.allclose(s_loop, s_einsum)}")

# Example 2: 4th-order tensor double contraction
# C = T_ijkl S_ijkl  → scalar (all 4 indices contracted)
print(f"\n--- 4th-Order Double Contraction ---")
T = np.random.randn(2, 3, 2, 3)
S = np.random.randn(2, 3, 2, 3)

c_loop = 0.0
for i in range(2):
    for j in range(3):
        for k in range(2):
            for l in range(3):
                c_loop += T[i, j, k, l] * S[i, j, k, l]

c_einsum = np.einsum('ijkl,ijkl->', T, S)
print(f"T_ijkl S_ijkl → scalar  (4 contractions)")
print(f"Rank: 4 + 4 - 8 = 0")
print(f"Result: {c_einsum:.6f}")
print(f"✓ Match: {np.allclose(c_loop, c_einsum)}")

# Example 3: Partial double contraction
# C_il = T_ijkl M_jk  (contract j and k, keep i and l free)
print(f"\n--- Partial Double Contraction ---")
T = np.random.randn(3, 4, 4, 5)
M = np.random.randn(4, 4)

C_loop = np.zeros((3, 5))
for i in range(3):
    for l in range(5):
        for j in range(4):
            for k in range(4):
                C_loop[i, l] += T[i, j, k, l] * M[j, k]

C_einsum = np.einsum('ijkl,jk->il', T, M)
print(f"T_ijkl M_jk → C_il  (contract j,k; keep i,l)")
print(f"Rank: 4 + 2 - 4 = 2")
print(f"Shape: {T.shape} with {M.shape} → {C_einsum.shape}")
print(f"✓ Match: {np.allclose(C_loop, C_einsum)}")

# ══════════════════════════════════════════════════════════════════
# 4.3 CONTRACTION ORDER AND EFFICIENCY
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("4.3 CONTRACTION ORDER AND COMPUTATIONAL COST")
print("=" * 70)

# y_i = A_ij B_jk x_k  — Two ways to compute
A = np.random.randn(100, 50)
B = np.random.randn(50, 200)
x = np.random.randn(200)

# Method 1: (AB)x — first contract j, then k
# Cost: m*n*p + m*p = 100*50*200 + 100*200 = 1,020,000 multiplications
# Intermediate: 100×200 matrix

# Method 2: A(Bx) — first contract k, then j
# Cost: n*p + m*n = 50*200 + 100*50 = 15,000 multiplications
# Intermediate: 50-element vector

print("\ny_i = A_ij B_jk x_k")
print(f"\nA: {A.shape},  B: {B.shape},  x: {x.shape}")

import time

# Time Method 1: (AB)x
t0 = time.perf_counter()
for _ in range(1000):
    y1 = (A @ B) @ x
t1 = time.perf_counter()
time_ABx = t1 - t0

# Time Method 2: A(Bx)
t0 = time.perf_counter()
for _ in range(1000):
    y2 = A @ (B @ x)
t1 = time.perf_counter()
time_Abx = t1 - t0

print(f"\nMethod 1: (AB)x")
print(f"  FLOPs: {100*50*200} + {100*200} = {100*50*200 + 100*200:,}")
print(f"  Time (1000 runs): {time_ABx:.4f}s")

print(f"\nMethod 2: A(Bx)")
print(f"  FLOPs: {50*200} + {100*50} = {50*200 + 100*50:,}")
print(f"  Time (1000 runs): {time_Abx:.4f}s")

print(f"\nSpeedup: {time_ABx/time_Abx:.1f}x")
print(f"FLOP ratio: {(100*50*200 + 100*200) / (50*200 + 100*50):.1f}x")
print(f"\n✓ Results match: {np.allclose(y1, y2)}")

print("\n--- Key Insight ---")
print("Index notation doesn't specify contraction ORDER.")
print("A_ij B_jk x_k just says 'sum over j and k'.")
print("But the ORDER you contract determines cost!")
print("numpy.einsum_path finds optimal contraction order automatically.")

# Show optimal path
path, info = np.einsum_path('ij,jk,k->i', A, B, x, optimize='optimal')
print(f"\nnp.einsum_path optimal order: {path}")
print(info)

# ══════════════════════════════════════════════════════════════════
# 4.4 CONTRACTION CHAINS IN NEURAL NETWORKS
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("4.4 CONTRACTION CHAINS IN NEURAL NETWORKS")
print("=" * 70)

# A 3-layer MLP is a chain of contractions:
# h1_i = W1_ij x_j           (contract j)
# h2_k = W2_ki h1_i          (contract i)
# y_m  = W3_mk h2_k          (contract k)
# Combined: y_m = W3_mk W2_ki W1_ij x_j  (3 contractions)

np.random.seed(42)
d_in, d_h1, d_h2, d_out = 10, 8, 6, 4

W1 = np.random.randn(d_h1, d_in) * 0.1
W2 = np.random.randn(d_h2, d_h1) * 0.1
W3 = np.random.randn(d_out, d_h2) * 0.1
x = np.random.randn(d_in)

# Sequential (natural) contractions
h1 = np.einsum('ij,j->i', W1, x)          # contract j
h2 = np.einsum('ki,i->k', W2, h1)          # contract i  
y_seq = np.einsum('mk,k->m', W3, h2)       # contract k

# All at once (einsum handles the chain)
y_chain = np.einsum('mk,ki,ij,j->m', W3, W2, W1, x)

# Verify against standard matmul chain
y_matmul = W3 @ W2 @ W1 @ x

print("3-Layer Linear Network: y_m = W3_mk W2_ki W1_ij x_j")
print(f"\nDimensions: {d_in} → {d_h1} → {d_h2} → {d_out}")
print(f"\nSequential contractions:")
print(f"  h1 = W1 @ x:   {h1.shape}")
print(f"  h2 = W2 @ h1:  {h2.shape}")
print(f"  y  = W3 @ h2:  {y_seq.shape}")
print(f"\ny (sequential): {y_seq}")
print(f"y (chain):      {y_chain}")
print(f"y (matmul):     {y_matmul}")
print(f"\n✓ All match: {np.allclose(y_seq, y_chain) and np.allclose(y_seq, y_matmul)}")

# Show contraction path
path, info = np.einsum_path('mk,ki,ij,j->m', W3, W2, W1, x, optimize='optimal')
print(f"\nOptimal contraction path: {path}")
print("The optimizer contracts right-to-left (Bx first, then A(Bx))...")
print("because vectors are cheaper to multiply than matrices.")

5. Special Tensors: Kronecker Delta and Levi-Civita Symbol

Two tensors appear so frequently in index notation that they have special names and symbols. They serve as the identity and orientation operators of tensor algebra.

5.1 Kronecker Delta $\delta_{ij}$

\delta_{ij} = \begin{cases} 1 & \text{if } i = j \\ 0 & \text{if } i \neq j \end{cases}

Properties:

$\delta_{ij}$ is the identity matrix $I$ in disguise
Index substitution: $\delta_{ij} A_{jk} = A_{ik}$ (replaces $j$ with $i$ )
Trace extraction: $\delta_{ii} = n$ (trace of identity = dimension)
Sifting property: $\delta_{ij} v_j = v_i$ (selects the $i$ -th component)

In AI: the identity matrix, skip connections ( $y = f(x) + \delta_{ij} x_j$ ), and residual networks.

5.2 Levi-Civita Symbol $\varepsilon_{ijk}$

\varepsilon_{ijk} = \begin{cases} +1 & \text{if } (i,j,k) \text{ is an even permutation of } (1,2,3) \\ -1 & \text{if } (i,j,k) \text{ is an odd permutation of } (1,2,3) \\ 0 & \text{if any two indices are equal} \end{cases}

Key uses:

Cross product: $(u \times v)_i = \varepsilon_{ijk} u_j v_k$
Determinant: $\det(A) = \varepsilon_{ijk} A_{1i} A_{2j} A_{3k}$
Volume forms and orientation

Less common in ML than $\delta_{ij}$ , but essential in physics-informed neural networks and geometric deep learning.

Code cell 26

# ══════════════════════════════════════════════════════════════════
# 5.1 KRONECKER DELTA: δ_ij — The Identity Tensor
# ══════════════════════════════════════════════════════════════════

print("=" * 70)
print("5.1 KRONECKER DELTA δ_ij")
print("  δ_ij = 1 if i=j, 0 otherwise  (= identity matrix)")
print("=" * 70)

# Build Kronecker delta explicitly
n = 4

delta = np.zeros((n, n))
for i in range(n):
    for j in range(n):
        delta[i, j] = 1.0 if i == j else 0.0

print(f"\nδ_ij for n={n}:")
print(delta)
print(f"= np.eye({n}): {np.allclose(delta, np.eye(n))}")

# Property 1: Index substitution  δ_ij A_jk = A_ik
print(f"\n--- Property 1: Index Substitution ---")
print("δ_ij A_jk = A_ik  (δ replaces j with i)")
A = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])
I3 = np.eye(3)

# Explicit loop
result_loop = np.zeros((3, 3))
for i in range(3):
    for k in range(3):
        for j in range(3):  # j is contracted
            result_loop[i, k] += I3[i, j] * A[j, k]

print(f"\nA:\n{A}")
print(f"\nδ_ij A_jk via loop:\n{result_loop}")
print(f"= A itself: {np.allclose(result_loop, A)}")
print("✓ δ is the identity: it just passes A through unchanged!")

# Property 2: Sifting  δ_ij v_j = v_i
print(f"\n--- Property 2: Sifting ---")
print("δ_ij v_j = v_i  (selects the i-th component)")
v = np.array([10.0, 20.0, 30.0])
sift_loop = np.zeros(3)
for i in range(3):
    for j in range(3):
        sift_loop[i] += I3[i, j] * v[j]
print(f"v = {v}")
print(f"δ_ij v_j = {sift_loop}")
print(f"= v: {np.allclose(sift_loop, v)}")

# Property 3: Trace  δ_ii = n
print(f"\n--- Property 3: Trace ---")
print(f"δ_ii = Σ_i δ_ii = number of dimensions")
trace_delta = 0
for i in range(n):
    trace_delta += delta[i, i]
print(f"δ_ii for n={n}: {trace_delta}")
print(f"np.einsum('ii->', delta) = {np.einsum('ii->', delta)}")

# Property 4: Contraction with Kronecker
print(f"\n--- Property 4: Double Kronecker ---")
print("δ_ij δ_jk = δ_ik  (identity × identity = identity)")
result = np.einsum('ij,jk->ik', I3, I3)
print(f"δ_ij δ_jk:\n{result}")
print(f"= δ_ik: {np.allclose(result, I3)}")

# AI Application: Residual Connection
print(f"\n--- AI Application: Residual/Skip Connection ---")
print("ResNet: y_i = f(x)_i + x_i = f(x)_i + δ_ij x_j")
print("The skip connection IS the Kronecker delta!")
x = np.array([1.0, 2.0, 3.0])
fx = np.array([0.1, -0.2, 0.3])  # Some nonlinear transform
y_residual = fx + np.einsum('ij,j->i', np.eye(3), x)
y_simple = fx + x
print(f"f(x) = {fx}")
print(f"x    = {x}")
print(f"y = f(x) + δ_ij x_j = {y_residual}")
print(f"y = f(x) + x        = {y_simple}")
print(f"✓ Same: {np.allclose(y_residual, y_simple)}")

# ══════════════════════════════════════════════════════════════════
# 5.2 LEVI-CIVITA SYMBOL ε_ijk — Orientation & Cross Products
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("5.2 LEVI-CIVITA SYMBOL ε_ijk")
print("  Encodes permutation sign: +1 even, -1 odd, 0 repeated")
print("=" * 70)

# Build Levi-Civita tensor explicitly
def build_levi_civita(n=3):
    """Build n-dimensional Levi-Civita symbol from first principles."""
    eps = np.zeros([n] * n)
    if n == 3:
        for i in range(n):
            for j in range(n):
                for k in range(n):
                    if (i, j, k) in [(0,1,2), (1,2,0), (2,0,1)]:
                        eps[i, j, k] = 1.0   # even permutation
                    elif (i, j, k) in [(0,2,1), (2,1,0), (1,0,2)]:
                        eps[i, j, k] = -1.0  # odd permutation
                    # else: 0 (repeated index)
    return eps

eps = build_levi_civita(3)

print(f"\nε_ijk (3×3×3 tensor, only 6 non-zero entries):")
for i in range(3):
    for j in range(3):
        for k in range(3):
            if eps[i, j, k] != 0:
                print(f"  ε_{i+1}{j+1}{k+1} = {eps[i,j,k]:+.0f}")

print(f"\nTotal non-zero: {np.count_nonzero(eps)} out of {eps.size}")

# Cross Product: (u × v)_i = ε_ijk u_j v_k
print(f"\n--- Cross Product via Levi-Civita ---")
print("(u × v)_i = ε_ijk u_j v_k")
u = np.array([1.0, 0.0, 0.0])
v = np.array([0.0, 1.0, 0.0])

# Raw loop
cross_loop = np.zeros(3)
for i in range(3):
    for j in range(3):
        for k in range(3):
            cross_loop[i] += eps[i, j, k] * u[j] * v[k]

cross_numpy = np.cross(u, v)
cross_einsum = np.einsum('ijk,j,k->i', eps, u, v)

print(f"\nu = {u}  (x-axis)")
print(f"v = {v}  (y-axis)")
print(f"\n(u × v)_i = ε_ijk u_j v_k:")
for i in range(3):
    terms = []
    for j in range(3):
        for k in range(3):
            if eps[i,j,k] != 0 and u[j] != 0 and v[k] != 0:
                terms.append(f"ε_{i+1}{j+1}{k+1}·u_{j+1}·v_{k+1} = {eps[i,j,k]:+.0f}·{u[j]}·{v[k]}")
    if terms:
        print(f"  i={i+1}: {', '.join(terms)} → {cross_loop[i]:+.0f}")
    else:
        print(f"  i={i+1}: (all terms zero) → 0")

print(f"\nvia loop:    {cross_loop}")
print(f"via np.cross:{cross_numpy}")
print(f"via einsum:  {cross_einsum}")
print(f"✓ Match: {np.allclose(cross_loop, cross_numpy)}")
print(f"x̂ × ŷ = ẑ ✓" if np.allclose(cross_loop, [0,0,1]) else "")

# Determinant via Levi-Civita
print(f"\n--- Determinant via Levi-Civita ---")
print("det(A) = ε_ijk A_1i A_2j A_3k")
A = np.array([[2, 1, 3],
              [0, 4, 1],
              [1, 0, 2]])

# Raw loop
det_loop = 0.0
for i in range(3):
    for j in range(3):
        for k in range(3):
            det_loop += eps[i, j, k] * A[0, i] * A[1, j] * A[2, k]

det_numpy = np.linalg.det(A)
det_einsum = np.einsum('ijk,i,j,k->', eps, A[0], A[1], A[2])

print(f"\nA:\n{A}")
print(f"\ndet(A) via ε_ijk:")
print(f"  loop:   {det_loop}")
print(f"  numpy:  {det_numpy}")
print(f"  einsum: {det_einsum}")
print(f"✓ Match: {np.allclose(det_loop, det_numpy)}")

# ε-δ identity (important relation)
print(f"\n--- The ε-δ Identity ---")
print("ε_ijk ε_imn = δ_jm δ_kn − δ_jn δ_km")
print("This connects cross products to dot products.")
print("Proof: contract over i in the product of two Levi-Civita symbols.")

# Verify numerically
I3 = np.eye(3)
lhs = np.einsum('ijk,imn->jkmn', eps, eps)
rhs = np.zeros((3, 3, 3, 3))
for j in range(3):
    for k in range(3):
        for m in range(3):
            for nn in range(3):
                rhs[j, k, m, nn] = I3[j, m] * I3[k, nn] - I3[j, nn] * I3[k, m]

print(f"\n✓ ε_ijk ε_imn = δ_jm δ_kn − δ_jn δ_km: {np.allclose(lhs, rhs)}")
print("This identity is used extensively in vector calculus identities.")

6. Index Manipulation Rules

These algebraic rules let you transform one index expression into another — like simplifying equations. Mastering these rules means you can derive gradient formulas, optimize computations, and verify identities by pure symbol manipulation.

Rules Summary

Rule	Example	What Happens
Renaming	$A_{ij}B_{jk} = A_{im}B_{mk}$	Dummy indices can be relabeled
Factoring	$A_{ij}C_{k} + B_{ij}C_{k} = (A_{ij}+B_{ij})C_{k}$	Factor common terms
Substitution	$\delta_{ij}A_{jk} = A_{ik}$	Kronecker delta replaces index
Symmetry	$S_{ij} = S_{ji}$ implies $S_{ij}A_{ij} = S_{ij}A_{ji}$	Exploit symmetric tensors
Antisymmetry	$\varepsilon_{ijk} = -\varepsilon_{jik}$	Sign changes on swap
Product rule	$\partial_k(A_{ij}B_{jl}) = (\partial_k A_{ij})B_{jl} + A_{ij}(\partial_k B_{jl})$	Leibniz rule with indices

Code cell 28

# ══════════════════════════════════════════════════════════════════
# 6.1 DUMMY INDEX RENAMING (RELABELING)
# ══════════════════════════════════════════════════════════════════

print("=" * 70)
print("6.1 DUMMY INDEX RENAMING")
print("  A dummy (summed) index is a variable of summation.")
print("  Like ∫f(x)dx = ∫f(t)dt, we can rename it freely.")
print("=" * 70)

A = np.random.randn(3, 4)
B = np.random.randn(4, 5)

# A_ij B_jk  using j as dummy
C1 = np.einsum('ij,jk->ik', A, B)

# A_im B_mk  using m as dummy (same operation, relabeled)
C2 = np.einsum('im,mk->ik', A, B)

print(f"\nA_ij B_jk (dummy=j):  shape {C1.shape}")
print(f"A_im B_mk (dummy=m):  shape {C2.shape}")
print(f"✓ Identical: {np.allclose(C1, C2)}")

# But you CANNOT rename free indices!
print(f"\n--- Free indices CANNOT be renamed ---")
print("A_ij B_jk → result has free indices i,k")
print("If you rename i→p, k→q, the RESULT changes name: C_pq")
print("The values are the same, but the tensor is 'called' differently.")
C_renamed = np.einsum('pj,jq->pq', A, B)  # same values
print(f"A_pj B_jq → C_pq: same values? {np.allclose(C1, C_renamed)}")

# ══════════════════════════════════════════════════════════════════
# 6.2 KRONECKER DELTA SUBSTITUTION
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("6.2 KRONECKER DELTA SUBSTITUTION")
print("  δ_ij replaces j with i (or i with j) wherever it appears")
print("=" * 70)

n = 4
A = np.random.randn(n, n)
delta = np.eye(n)

# Rule: δ_ij A_jk = A_ik
print("\n--- Rule: δ_ij A_jk = A_ik ---")
lhs = np.einsum('ij,jk->ik', delta, A)  # contract j via delta
rhs = A.copy()                            # just A_ik directly
print(f"δ_ij A_jk: \n{lhs[:3,:3]}")
print(f"A_ik:      \n{rhs[:3,:3]}")
print(f"✓ Equal: {np.allclose(lhs, rhs)}")

# Rule: δ_ij A_ik = A_jk (substitute i→j)
print("\n--- Rule: δ_ij A_ik = A_jk ---")
lhs2 = np.einsum('ij,ik->jk', delta, A)
print(f"δ_ij A_ik = A_jk: {np.allclose(lhs2, A)}")

# Chain of substitutions
print("\n--- Chain: δ_ij δ_jk A_kl = δ_ik A_kl = A_il ---")
chain = np.einsum('ij,jk,kl->il', delta, delta, A)
print(f"δ_ij δ_jk A_kl = A_il: {np.allclose(chain, A)}")
print("Each δ substitution 'eats' one index. Like composing identity functions.")

# ══════════════════════════════════════════════════════════════════
# 6.3 SYMMETRY AND ANTISYMMETRY EXPLOITATION
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("6.3 SYMMETRY AND ANTISYMMETRY")
print("=" * 70)

# Symmetric tensor: S_ij = S_ji
n = 3
S = np.random.randn(n, n)
S = (S + S.T) / 2  # Make symmetric

print(f"\n--- Symmetric Tensor: S_ij = S_ji ---")
print(f"S:\n{S}")
print(f"S = S^T: {np.allclose(S, S.T)}")

# Key property: S_ij A_ij = S_ij A_ji (can swap A's indices)
A = np.random.randn(n, n)
val1 = np.einsum('ij,ij->', S, A)
val2 = np.einsum('ij,ji->', S, A)
print(f"\nS_ij A_ij = {val1:.6f}")
print(f"S_ij A_ji = {val2:.6f}")
print(f"✓ Equal: {np.allclose(val1, val2)}")
print("Proof: S_ij A_ji = S_ji A_ji (rename i↔j) = S_ji A_ij (swap dummy)")
print("       = S_ij A_ij (by symmetry of S). ■")

# Antisymmetric tensor: T_ij = -T_ji
T = np.random.randn(n, n)
T = (T - T.T) / 2  # Make antisymmetric

print(f"\n--- Antisymmetric Tensor: T_ij = -T_ji ---")
print(f"T:\n{np.round(T, 3)}")
print(f"T = -T^T: {np.allclose(T, -T.T)}")

# Key property: T_ij S_ij = 0 (antisymmetric × symmetric = 0)
contraction = np.einsum('ij,ij->', T, S)
print(f"\nT_ij S_ij = {contraction:.10f}")
print(f"✓ = 0: {np.allclose(contraction, 0)}")
print("Proof: T_ij S_ij = -T_ji S_ji = -T_ij S_ij")
print("       → 2·T_ij S_ij = 0 → T_ij S_ij = 0. ■")

# Any matrix = symmetric + antisymmetric
print(f"\n--- Decomposition: Any Matrix = Symmetric + Antisymmetric ---")
M = np.random.randn(n, n)
M_sym = (M + M.T) / 2
M_anti = (M - M.T) / 2
print(f"M = M_sym + M_anti: {np.allclose(M, M_sym + M_anti)}")
print(f"M_sym is symmetric: {np.allclose(M_sym, M_sym.T)}")
print(f"M_anti is antisymmetric: {np.allclose(M_anti, -M_anti.T)}")

# ══════════════════════════════════════════════════════════════════
# 6.4 PRODUCT RULE (LEIBNIZ) WITH INDICES
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("6.4 PRODUCT RULE WITH INDICES")
print("  ∂/∂x_k (A_ij B_jl) = (∂A_ij/∂x_k) B_jl + A_ij (∂B_jl/∂x_k)")
print("=" * 70)

# Numerical verification with parameter-dependent matrices
# Let A(t) = t*A0 and B(t) = t^2*B0 for scalar parameter t
# d/dt(AB) should equal (dA/dt)B + A(dB/dt)

A0 = np.random.randn(3, 4)
B0 = np.random.randn(4, 3)
t = 2.0
dt = 1e-7

def AB_at(t):
    return (t * A0) @ (t**2 * B0)

# Numerical derivative
dAB_numerical = (AB_at(t + dt) - AB_at(t - dt)) / (2 * dt)

# Analytical via product rule
dA_dt = A0          # d(tA0)/dt = A0
dB_dt = 2*t * B0    # d(t²B0)/dt = 2tB0
A_t = t * A0
B_t = t**2 * B0

# Product rule: d(AB)/dt = (dA/dt)B + A(dB/dt)
dAB_product_rule = dA_dt @ B_t + A_t @ dB_dt

print(f"\nA(t) = t·A₀,  B(t) = t²·B₀,  at t={t}")
print(f"\nd/dt[A(t)B(t)] via finite differences:")
print(f"{dAB_numerical[:2,:2]}")
print(f"\nd/dt[A(t)B(t)] via product rule (dA/dt·B + A·dB/dt):")
print(f"{dAB_product_rule[:2,:2]}")
print(f"\n✓ Match: {np.allclose(dAB_numerical, dAB_product_rule, atol=1e-5)}")

# Index form of the chain:
print(f"\n--- Index Form ---")
print("d/dt (A_ij B_jk) = (dA_ij/dt) B_jk + A_ij (dB_jk/dt)")
print("Each term preserves the FREE indices (i,k).")
print("The DUMMY index j appears in every term.")
print("This is how backpropagation derives gradients!")

7. Gradients and Backpropagation in Index Notation

This is where Einstein summation becomes essential for AI. Backpropagation is just the chain rule applied to indexed expressions. Understanding gradients in index notation lets you:

Derive gradient formulas for any layer
Verify automatic differentiation outputs
Optimize custom gradient implementations

Key Derivatives in Index Form

Operation	Forward	Gradient w.r.t. input
Linear	$y_i = W_{ij} x_j$	$\frac{\partial L}{\partial x_j} = W_{ij} \frac{\partial L}{\partial y_i}$
Quadratic	$L = x_i A_{ij} x_j$	$\frac{\partial L}{\partial x_k} = A_{kj}x_j + A_{ik}x_i = (A+A^T)_{kj}x_j$
Softmax CE	$L = -y_i \log(p_i)$	$\frac{\partial L}{\partial z_k} = p_k - y_k$

The Fundamental Rule

\frac{\partial (A_{ij} x_j)}{\partial x_k} = A_{ij} \delta_{jk} = A_{ik}

The Kronecker delta $\delta_{jk} = \frac{\partial x_j}{\partial x_k}$ is the Jacobian of the identity map.

Code cell 30

# ══════════════════════════════════════════════════════════════════
# 7.1 DERIVATIVE OF LINEAR OPERATION: ∂(Wx)/∂x AND ∂(Wx)/∂W
# ══════════════════════════════════════════════════════════════════

print("=" * 70)
print("7.1 DERIVATIVES OF LINEAR OPERATIONS")
print("=" * 70)

np.random.seed(42)

# Forward: y_i = W_ij x_j
W = np.array([[1.0, 2.0, 3.0],
              [4.0, 5.0, 6.0]])   # 2×3
x = np.array([1.0, -1.0, 2.0])    # 3

y = np.einsum('ij,j->i', W, x)
print(f"Forward: y_i = W_ij x_j")
print(f"W ({W.shape[0]}×{W.shape[1]}):\n{W}")
print(f"x = {x}")
print(f"y = {y}")

# --- Gradient ∂y_i/∂x_k ---
# ∂y_i/∂x_k = ∂(W_ij x_j)/∂x_k = W_ij δ_jk = W_ik
# This is the JACOBIAN of the linear map
print(f"\n--- ∂y_i/∂x_k = W_ik (the Jacobian) ---")
print("Derivation:")
print("  ∂y_i/∂x_k = ∂(W_ij x_j)/∂x_k")
print("            = W_ij (∂x_j/∂x_k)")
print("            = W_ij δ_jk          ← ∂x_j/∂x_k = δ_jk")
print("            = W_ik               ← δ substitution")

# Verify numerically
jacobian_analytical = W.copy()  # ∂y_i/∂x_k = W_ik

jacobian_numerical = np.zeros((2, 3))
eps = 1e-7
for k in range(3):
    x_plus = x.copy(); x_plus[k] += eps
    x_minus = x.copy(); x_minus[k] -= eps
    jacobian_numerical[:, k] = (W @ x_plus - W @ x_minus) / (2 * eps)

print(f"\nJacobian (analytical) = W:\n{jacobian_analytical}")
print(f"Jacobian (numerical):\n{np.round(jacobian_numerical, 6)}")
print(f"✓ Match: {np.allclose(jacobian_analytical, jacobian_numerical)}")

# --- Gradient ∂y_i/∂W_kl ---
# ∂y_i/∂W_kl = ∂(W_ij x_j)/∂W_kl = δ_ik δ_jl x_j = δ_ik x_l
print(f"\n--- ∂y_i/∂W_kl = δ_ik x_l ---")
print("Derivation:")
print("  ∂y_i/∂W_kl = ∂(W_ij x_j)/∂W_kl")
print("             = (∂W_ij/∂W_kl) x_j")
print("             = δ_ik δ_jl x_j       ← ∂W_ij/∂W_kl = δ_ik δ_jl")
print("             = δ_ik x_l             ← δ_jl contracts with x_j")

# ∂y_i/∂W_kl is a 4th-order tensor (i,k,l) but structured
# For loss gradient: ∂L/∂W_kl = (∂L/∂y_i)(∂y_i/∂W_kl) = (∂L/∂y_k) x_l
print(f"\n--- For backprop with scalar loss L ---")
print("∂L/∂W_kl = (∂L/∂y_i)(∂y_i/∂W_kl) = (∂L/∂y_i) δ_ik x_l = (∂L/∂y_k) x_l")
print("This is an OUTER PRODUCT of the upstream gradient and the input!")

# Simulate backprop
dL_dy = np.array([1.0, -0.5])  # upstream gradient
dL_dW = np.einsum('k,l->kl', dL_dy, x)  # outer product

# Verify numerically
dL_dW_numerical = np.zeros_like(W)
for k in range(2):
    for l in range(3):
        W_plus = W.copy(); W_plus[k, l] += eps
        W_minus = W.copy(); W_minus[k, l] -= eps
        y_plus = W_plus @ x; y_minus = W_minus @ x
        L_plus = np.dot(dL_dy, y_plus); L_minus = np.dot(dL_dy, y_minus)
        dL_dW_numerical[k, l] = (L_plus - L_minus) / (2 * eps)

print(f"\n∂L/∂y = {dL_dy}")
print(f"x = {x}")
print(f"\n∂L/∂W = (∂L/∂y) ⊗ x (outer product):")
print(f"Analytical:\n{dL_dW}")
print(f"Numerical:\n{np.round(dL_dW_numerical, 6)}")
print(f"✓ Match: {np.allclose(dL_dW, dL_dW_numerical)}")

# ══════════════════════════════════════════════════════════════════
# 7.2 CHAIN RULE IN INDEX NOTATION
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("7.2 CHAIN RULE — BACKPROPAGATION IS INDEX CONTRACTION")
print("=" * 70)

# Two-layer network (no activation for clarity):
# h_i = W1_ij x_j        (layer 1)
# y_k = W2_ki h_i        (layer 2)
# L   = 0.5 * (y_k - t_k)(y_k - t_k)  (MSE loss)

W1 = np.array([[1.0, 2.0],
               [3.0, 4.0],
               [5.0, 6.0]])   # 3×2
W2 = np.array([[0.1, 0.2, 0.3],
               [0.4, 0.5, 0.6]])  # 2×3
x = np.array([1.0, -1.0])         # input
t = np.array([0.0, 1.0])           # target

# Forward pass
h = np.einsum('ij,j->i', W1, x)      # h_i = W1_ij x_j
y = np.einsum('ki,i->k', W2, h)      # y_k = W2_ki h_i
L = 0.5 * np.einsum('k,k->', y - t, y - t)   # MSE

print(f"Forward pass:")
print(f"  x = {x}")
print(f"  h = W1·x = {h}")
print(f"  y = W2·h = {y}")
print(f"  L = 0.5·||y-t||² = {L:.4f}")

# Backward pass using index notation
# Step 1: ∂L/∂y_k = y_k - t_k
dL_dy = y - t
print(f"\nBackward pass:")
print(f"  ∂L/∂y_k = y_k - t_k = {dL_dy}")

# Step 2: ∂L/∂h_i = ∂L/∂y_k · ∂y_k/∂h_i = ∂L/∂y_k · W2_ki = W2^T_ik · ∂L/∂y_k
# ∂L/∂h_i = W2_ki (∂L/∂y_k)   ← sum over k
dL_dh = np.einsum('ki,k->i', W2, dL_dy)
print(f"  ∂L/∂h_i = W2_ki (∂L/∂y_k) = {dL_dh}")

# Step 3: ∂L/∂x_j = ∂L/∂h_i · ∂h_i/∂x_j = ∂L/∂h_i · W1_ij = W1^T_ji · ∂L/∂h_i
dL_dx = np.einsum('ij,i->j', W1, dL_dh)
print(f"  ∂L/∂x_j = W1_ij (∂L/∂h_i) = {dL_dx}")

# Step 4: Weight gradients
dL_dW2 = np.einsum('k,i->ki', dL_dy, h)    # outer product
dL_dW1 = np.einsum('i,j->ij', dL_dh, x)    # outer product
print(f"\n  ∂L/∂W2_ki = (∂L/∂y_k) h_i:")
print(f"  {dL_dW2}")
print(f"\n  ∂L/∂W1_ij = (∂L/∂h_i) x_j:")
print(f"  {dL_dW1}")

# Numerical verification
eps = 1e-7
print(f"\n--- Numerical Verification ---")
dL_dx_num = np.zeros_like(x)
for j in range(len(x)):
    x_p = x.copy(); x_p[j] += eps
    x_m = x.copy(); x_m[j] -= eps
    y_p = W2 @ (W1 @ x_p); y_m = W2 @ (W1 @ x_m)
    L_p = 0.5 * np.sum((y_p - t)**2)
    L_m = 0.5 * np.sum((y_m - t)**2)
    dL_dx_num[j] = (L_p - L_m) / (2 * eps)

print(f"∂L/∂x analytical: {dL_dx}")
print(f"∂L/∂x numerical:  {np.round(dL_dx_num, 6)}")
print(f"✓ Match: {np.allclose(dL_dx, dL_dx_num, atol=1e-5)}")

print(f"\n--- Key Pattern ---")
print("Backprop = chain of contractions flowing BACKWARD:")
print("  ∂L/∂y_k → ∂L/∂h_i = W2_ki·(∂L/∂y_k) → ∂L/∂x_j = W1_ij·(∂L/∂h_i)")
print("Each step: contract upstream gradient with transposed weight.")
print("In index notation: transposing = swapping which index is contracted.")

# ══════════════════════════════════════════════════════════════════
# 7.3 QUADRATIC FORM GRADIENT: ∂(x^T A x)/∂x
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("7.3 QUADRATIC FORM GRADIENT")
print("  L = x_i A_ij x_j  →  ∂L/∂x_k = (A + A^T)_kj x_j")
print("=" * 70)

A = np.array([[2.0, 1.0],
              [3.0, 4.0]])  # Not symmetric!
x = np.array([1.0, 2.0])

# Forward: L = x_i A_ij x_j
L = np.einsum('i,ij,j->', x, A, x)

# Gradient derivation:
# ∂L/∂x_k = ∂(x_i A_ij x_j)/∂x_k
#          = (∂x_i/∂x_k) A_ij x_j + x_i A_ij (∂x_j/∂x_k)   ← product rule
#          = δ_ik A_ij x_j + x_i A_ij δ_jk                    ← ∂x_i/∂x_k = δ_ik
#          = A_kj x_j + x_i A_ik                               ← δ substitution
#          = A_kj x_j + A^T_ki x_i                             ← rename: A_ik → A^T_ki
#          = (A + A^T)_kj x_j                                  ← factor

print("Derivation:")
print("  ∂L/∂x_k = ∂(x_i A_ij x_j)/∂x_k")
print("           = δ_ik A_ij x_j + x_i A_ij δ_jk    (product rule)")
print("           = A_kj x_j + A^T_ki x_i              (δ substitution)")
print("           = (A + A^T)_kj x_j                   (combine)")

grad_analytical = (A + A.T) @ x

# Numerical verification
grad_numerical = np.zeros_like(x)
for k in range(len(x)):
    x_p = x.copy(); x_p[k] += eps
    x_m = x.copy(); x_m[k] -= eps
    L_p = x_p @ A @ x_p
    L_m = x_m @ A @ x_m
    grad_numerical[k] = (L_p - L_m) / (2 * eps)

print(f"\nA:\n{A}")
print(f"x = {x}")
print(f"L = x^T A x = {L}")
print(f"\n∂L/∂x = (A + A^T)x:")
print(f"  A + A^T:\n{A + A.T}")
print(f"  Analytical: {grad_analytical}")
print(f"  Numerical:  {np.round(grad_numerical, 6)}")
print(f"  ✓ Match: {np.allclose(grad_analytical, grad_numerical, atol=1e-5)}")

# Special case: symmetric A (common in ML — Hessians, covariance)
print(f"\n--- Special Case: Symmetric A ---")
print("If A_ij = A_ji, then A + A^T = 2A")
print("So ∂(x^T A x)/∂x = 2Ax  (the familiar formula)")
A_sym = (A + A.T) / 2
print(f"A_sym:\n{A_sym}")
grad_sym = 2 * A_sym @ x
print(f"2·A_sym·x = {grad_sym}")
print(f"(A+A^T)x  = {(A + A.T) @ x}")
print(f"✓ Same: {np.allclose(grad_sym, (A + A.T) @ x)}")

8. Batched Operations and Multi-Dimensional Indexing

In practice, AI operations are batched — we process $B$ samples simultaneously for efficiency. Einstein notation handles batching naturally by adding batch indices that remain free (never contracted).

The Batch Index Convention

Notation	Meaning	Example
$C_{bij} = A_{bik} B_{bkj}$	Batched matmul	B separate matrix multiplies
$S_{bhij} = Q_{bhid} K_{bhjd}$	Multi-head attention	B batches, H heads
$y_{bi} = W_{ij} x_{bj}$	Shared weights	Same W applied to all samples

Key insight: Batch indices appear on ALL tensors and survive into the output — they are always free, never dummy.

Why This Matters

GPUs process batches in parallel
Einsum naturally expresses "do this operation independently for each batch element"
No explicit loops over batch dimension needed

Code cell 32

# ══════════════════════════════════════════════════════════════════
# 8.1 BATCHED MATRIX-VECTOR: y_bi = W_ij x_bj (shared weights)
# ══════════════════════════════════════════════════════════════════

print("=" * 70)
print("8.1 BATCHED MATRIX-VECTOR MULTIPLY")
print("  y_bi = W_ij x_bj  (b=batch, j=contracted, i=output)")
print("=" * 70)

np.random.seed(42)
batch_size = 4
d_in = 3
d_out = 2

W = np.random.randn(d_out, d_in)     # Shared weights (no batch index)
X = np.random.randn(batch_size, d_in)  # Batched input

# Raw loop: explicit batch loop
Y_loop = np.zeros((batch_size, d_out))
for b in range(batch_size):
    for i in range(d_out):
        for j in range(d_in):
            Y_loop[b, i] += W[i, j] * X[b, j]

# Einsum: batch index b is free, j is contracted
Y_einsum = np.einsum('ij,bj->bi', W, X)

# Matrix form: X @ W^T
Y_matmul = X @ W.T

print(f"\nW ({d_out}×{d_in}) — shared across batch:")
print(f"{W}")
print(f"\nX ({batch_size}×{d_in}) — batched input:")
print(f"{X}")

print(f"\ny_bi = W_ij x_bj via einsum('ij,bj->bi'):")
print(f"{Y_einsum}")
print(f"\n✓ Loop match:   {np.allclose(Y_loop, Y_einsum)}")
print(f"✓ Matmul match: {np.allclose(Y_einsum, Y_matmul)}")

print(f"\n--- Index Analysis ---")
print(f"W_ij:  indices i,j (no batch)")
print(f"x_bj:  indices b,j (batched)")
print(f"y_bi:  indices b,i (batched output)")
print(f"b: FREE (survives — one output per sample)")
print(f"j: DUMMY (summed — input dimension contracted)")
print(f"i: FREE (survives — output dimension)")

# ══════════════════════════════════════════════════════════════════
# 8.2 BATCHED MATRIX MULTIPLY: C_bik = A_bij B_bjk
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("8.2 BATCHED MATRIX-MATRIX MULTIPLY")
print("  C_bik = A_bij B_bjk  (independent matmul per batch)")
print("=" * 70)

batch_size = 3
m, n, p = 2, 4, 3

A = np.random.randn(batch_size, m, n)
B = np.random.randn(batch_size, n, p)

# Raw loop
C_loop = np.zeros((batch_size, m, p))
for b in range(batch_size):
    for i in range(m):
        for k in range(p):
            for j in range(n):
                C_loop[b, i, k] += A[b, i, j] * B[b, j, k]

C_einsum = np.einsum('bij,bjk->bik', A, B)

# Also: np.matmul handles batches
C_matmul = np.matmul(A, B)  # or A @ B

print(f"A: {A.shape},  B: {B.shape}")
print(f"C = A@B per batch: {C_einsum.shape}")
print(f"\n✓ Loop match:   {np.allclose(C_loop, C_einsum)}")
print(f"✓ Matmul match: {np.allclose(C_einsum, C_matmul)}")

# Verify independence: each batch is separate
for b in range(batch_size):
    single = A[b] @ B[b]
    print(f"  Batch {b}: independent matmul matches: {np.allclose(C_einsum[b], single)}")

# ══════════════════════════════════════════════════════════════════
# 8.3 MULTI-HEAD ATTENTION: S_bhij = Q_bhid K_bhjd
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("8.3 MULTI-HEAD ATTENTION SCORES")
print("  S_bhij = Q_bhid K_bhjd / √d_k")
print("  b=batch, h=head, i=query_pos, j=key_pos, d=head_dim")
print("=" * 70)

B, H, S, D = 2, 4, 8, 16  # batch, heads, seq_len, head_dim

Q = np.random.randn(B, H, S, D) * 0.1
K = np.random.randn(B, H, S, D) * 0.1

# Raw loop (4 free indices, 1 dummy)
scores_loop = np.zeros((B, H, S, S))
for b in range(B):
    for h in range(H):
        for i in range(S):
            for j in range(S):
                for d in range(D):
                    scores_loop[b, h, i, j] += Q[b, h, i, d] * K[b, h, j, d]
scores_loop /= np.sqrt(D)

# Einsum
scores_einsum = np.einsum('bhid,bhjd->bhij', Q, K) / np.sqrt(D)

print(f"Q: {Q.shape}  (batch, heads, seq, head_dim)")
print(f"K: {K.shape}")
print(f"Scores: {scores_einsum.shape}  (batch, heads, seq, seq)")

print(f"\n✓ Match: {np.allclose(scores_loop, scores_einsum)}")

print(f"\n--- Index Classification ---")
print(f"b: batch     — FREE (independent per sample)")
print(f"h: head      — FREE (independent per head)")
print(f"i: query pos — FREE (each query attends differently)")
print(f"j: key pos   — FREE (each key receives differently)")
print(f"d: head dim  — DUMMY (contracted = dot product)")
print(f"\n4 free indices → 4D output tensor")
print(f"1 dummy index → summation = dot product similarity")

# Softmax (simplified)
print(f"\n--- Full Attention Pattern ---")
print("1. Scores:    S_bhij = Q_bhid K_bhjd / √d")
print("2. Weights:   α_bhij = softmax_j(S_bhij)")
print("3. Output:    O_bhid = α_bhij V_bhjd")
print("Step 3 contracts over j (key positions) → weighted sum of values")

V = np.random.randn(B, H, S, D) * 0.1
# Simple softmax
exp_scores = np.exp(scores_einsum - np.max(scores_einsum, axis=-1, keepdims=True))
attn_weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)

# Attention output: O_bhid = α_bhij V_bhjd (contract j)
O_einsum = np.einsum('bhij,bhjd->bhid', attn_weights, V)

# Raw loop verification for one batch/head
O_loop_single = np.zeros((S, D))
for i in range(S):
    for d_idx in range(D):
        for j in range(S):
            O_loop_single[i, d_idx] += attn_weights[0, 0, i, j] * V[0, 0, j, d_idx]

print(f"\nAttention output: {O_einsum.shape}")
print(f"✓ Loop verification (batch=0, head=0): {np.allclose(O_loop_single, O_einsum[0, 0])}")

# ══════════════════════════════════════════════════════════════════
# 8.4 BATCH-WISE REDUCTION: Per-sample losses
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("8.4 BATCH-WISE REDUCTIONS")
print("  Keeping batch index free while contracting feature indices")
print("=" * 70)

B, D = 8, 5
Y_pred = np.random.randn(B, D)
Y_true = np.random.randn(B, D)

# Per-sample squared error: L_b = (Y_pred - Y_true)_bi (Y_pred - Y_true)_bi
# Index i is summed (dummy), b is kept (free)
diff = Y_pred - Y_true

# Raw loop
loss_loop = np.zeros(B)
for b in range(B):
    for i in range(D):
        loss_loop[b] += diff[b, i] ** 2

loss_einsum = np.einsum('bi,bi->b', diff, diff)
loss_numpy = np.sum(diff**2, axis=1)

print(f"Predictions: {Y_pred.shape}")
print(f"Targets:     {Y_true.shape}")
print(f"\nPer-sample MSE = (y-t)_bi (y-t)_bi → L_b")
print(f"  b: FREE (one loss per sample)")
print(f"  i: DUMMY (sum over features)")
print(f"\nLosses (per sample):")
print(f"  loop:   {np.round(loss_loop, 4)}")
print(f"  einsum: {np.round(loss_einsum, 4)}")
print(f"  numpy:  {np.round(loss_numpy, 4)}")
print(f"\n✓ Match: {np.allclose(loss_loop, loss_einsum) and np.allclose(loss_loop, loss_numpy)}")

# Total loss: L = L_b / B  (average over batch — contract b too)
total_loss = np.einsum('b->', loss_einsum) / B
print(f"\nMean loss = (1/B) Σ_b L_b = {total_loss:.4f}")
print(f"np.mean:  {np.mean(loss_numpy):.4f}")
print(f"✓ Match: {np.allclose(total_loss, np.mean(loss_numpy))}")

9. Einsum in Practice: NumPy and PyTorch

Now we move from theory to practical mastery. np.einsum and torch.einsum follow explicit Einstein notation: you write the index pattern explicitly, and the function executes it.

Syntax: `einsum('subscripts', *operands)`

'ij,jk->ik'     means  C_ik = A_ij B_jk  (matmul)
 ↑↑  ↑↑   ↑↑
 A   B    output

Rules:

Each operand gets comma-separated subscripts
-> separates inputs from output
Indices in inputs but NOT in output → summed (dummy)
Indices in output → kept (free)
If -> is omitted, all indices appearing once are output indices (implicit mode)

Performance Tips

np.einsum(..., optimize=True) finds optimal contraction order
np.einsum_path() shows the contraction plan
For 2-operand cases, specialized functions (matmul, dot) are often faster
PyTorch einsum supports autograd — gradients flow automatically

Code cell 34

# ══════════════════════════════════════════════════════════════════
# 9.1 EINSUM CHEAT SHEET — Every Common Operation
# ══════════════════════════════════════════════════════════════════

print("=" * 70)
print("9.1 EINSUM CHEAT SHEET")
print("=" * 70)

np.random.seed(42)

# Setup tensors
v = np.random.randn(4)
w = np.random.randn(4)
A = np.random.randn(3, 4)
B = np.random.randn(4, 5)
C = np.random.randn(3, 4)
M = np.random.randn(4, 4)

operations = [
    # (description, einsum_string, operands, equivalent)
    ("Vector dot product",    'i,i->',     (v, w),   np.dot(v, w)),
    ("Vector outer product",  'i,j->ij',   (v, w),   np.outer(v, w)),
    ("Vector element-wise",   'i,i->i',    (v, w),   v * w),
    ("Vector sum",            'i->',       (v,),     np.sum(v)),
    ("Matrix transpose",      'ij->ji',    (A,),     A.T),
    ("Matrix trace",          'ii->',      (M,),     np.trace(M)),
    ("Matrix diagonal",       'ii->i',     (M,),     np.diag(M)),
    ("Matrix sum all",        'ij->',      (A,),     np.sum(A)),
    ("Matrix row sum",        'ij->i',     (A,),     np.sum(A, axis=1)),
    ("Matrix col sum",        'ij->j',     (A,),     np.sum(A, axis=0)),
    ("Matrix-vector",         'ij,j->i',   (A, v),   A @ v),
    ("Matrix multiply",       'ij,jk->ik', (A, B),   A @ B),
    ("Hadamard product",      'ij,ij->ij', (A, C),   A * C),
    ("Frobenius inner",       'ij,ij->',   (A, C),   np.sum(A * C)),
    ("Bilinear form",         'i,ij,j->',  (v, M, w), v @ M @ w),
]

print(f"\n{'Operation':<22} {'Einsum':<16} {'Result Shape':<14} {'✓'}")
print("-" * 60)
for desc, estr, ops, equiv in operations:
    result = np.einsum(estr, *ops)
    match = np.allclose(result, equiv)
    shape = str(result.shape) if isinstance(result, np.ndarray) else 'scalar'
    print(f"{desc:<22} {estr:<16} {shape:<14} {'✓' if match else '✗'}")

# ══════════════════════════════════════════════════════════════════
# 9.2 IMPLICIT vs EXPLICIT MODE
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("9.2 IMPLICIT vs EXPLICIT MODE")
print("=" * 70)

A = np.random.randn(3, 4)
B = np.random.randn(4, 5)

# Explicit mode: you specify the output
result_explicit = np.einsum('ij,jk->ik', A, B)

# Implicit mode: output indices = alphabetically sorted unique non-repeated
# 'ij,jk' → j appears twice (summed), i and k appear once → output is 'ik'
result_implicit = np.einsum('ij,jk', A, B)

print(f"Explicit: einsum('ij,jk->ik', A, B)")
print(f"Implicit: einsum('ij,jk', A, B)")
print(f"✓ Same: {np.allclose(result_explicit, result_implicit)}")

# Where they differ: trace
M = np.random.randn(4, 4)
print(f"\nTrace:")
print(f"  Explicit: einsum('ii->', M)  = {np.einsum('ii->', M):.4f}  (scalar)")
print(f"  Implicit: einsum('ii', M)    = {np.einsum('ii', M):.4f}  (scalar)")
print(f"  ✓ Same: {np.allclose(np.einsum('ii->', M), np.einsum('ii', M))}")

# Diagonal extraction
print(f"\nDiagonal:")
print(f"  Explicit: einsum('ii->i', M) = {np.einsum('ii->i', M)}")
print(f"  Implicit: einsum('ii', M)    = {np.einsum('ii', M):.4f}  ← NOT diagonal!")
print(f"  Implicit 'ii' takes the TRACE (sums), not the diagonal!")
print(f"\n⚠ RECOMMENDATION: Always use explicit mode ('->') for clarity.")

# ══════════════════════════════════════════════════════════════════
# 9.3 EINSUM_PATH — OPTIMIZING CONTRACTION ORDER
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("9.3 EINSUM_PATH — FINDING OPTIMAL CONTRACTION ORDER")
print("=" * 70)

# Three matrices: A @ B @ C
A = np.random.randn(100, 10)
B = np.random.randn(10, 200)
C = np.random.randn(200, 50)

# Naive: left to right
path_naive, info_naive = np.einsum_path('ij,jk,kl->il', A, B, C, optimize=False)
# Optimal
path_opt, info_opt = np.einsum_path('ij,jk,kl->il', A, B, C, optimize='optimal')

print(f"Computing A({A.shape}) @ B({B.shape}) @ C({C.shape})")
print(f"\n--- Naive (left to right) ---")
print(info_naive)
print(f"\n--- Optimal ---")
print(info_opt)

# Time comparison
import time

t0 = time.perf_counter()
for _ in range(100):
    r1 = np.einsum('ij,jk,kl->il', A, B, C, optimize=False)
t_naive = time.perf_counter() - t0

t0 = time.perf_counter()
for _ in range(100):
    r2 = np.einsum('ij,jk,kl->il', A, B, C, optimize=True)
t_opt = time.perf_counter() - t0

print(f"Time (naive):    {t_naive:.4f}s")
print(f"Time (optimal):  {t_opt:.4f}s")
print(f"Speedup: {t_naive/t_opt:.1f}x")
print(f"✓ Same result: {np.allclose(r1, r2)}")

# ══════════════════════════════════════════════════════════════════
# 9.4 PYTORCH EINSUM — WITH AUTOGRAD
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("9.4 PYTORCH EINSUM WITH AUTOMATIC DIFFERENTIATION")
print("=" * 70)

if HAS_TORCH:
    # PyTorch einsum with gradient tracking
    W = torch.randn(3, 4, requires_grad=True)
    x = torch.randn(4, requires_grad=True)
    
    # Forward: y = Wx
    y = torch.einsum('ij,j->i', W, x)
    
    # Loss
    loss = torch.sum(y ** 2)
    
    # Backward — gradients flow through einsum!
    loss.backward()
    
    print(f"W shape: {W.shape}")
    print(f"x shape: {x.shape}")
    print(f"y = einsum('ij,j->i', W, x): {y.detach().numpy()}")
    print(f"L = sum(y²) = {loss.item():.4f}")
    print(f"\n∂L/∂x (via autograd): {x.grad.numpy()}")
    print(f"∂L/∂x (analytical = 2·W^T·y): {(2 * W.T @ y).detach().numpy()}")
    print(f"✓ Match: {torch.allclose(x.grad, 2 * W.T @ y.detach())}")
    
    print(f"\n∂L/∂W shape: {W.grad.shape}")
    # ∂L/∂W_ij = 2 y_i x_j (outer product)
    grad_analytical = 2 * torch.einsum('i,j->ij', y.detach(), x.detach())
    print(f"∂L/∂W (autograd):\n{W.grad.numpy()}")
    print(f"∂L/∂W (2·y⊗x):\n{grad_analytical.numpy()}")
    print(f"✓ Match: {torch.allclose(W.grad, grad_analytical)}")
    
    print(f"\n--- Key Advantage ---")
    print("torch.einsum supports autograd automatically.")
    print("Write ANY index expression in forward pass →")
    print("PyTorch computes gradients for free via backprop.")
    print("No need to derive gradient formulas manually!")
else:
    print("PyTorch not available. Install with: pip install torch")
    print("The key point: torch.einsum('ij,j->i', W, x) supports autograd.")
    print("Gradients flow through einsum operations automatically.")

10. Tensor Products and Decompositions

Tensor products build higher-order tensors from lower-order ones. Decompositions do the reverse: break a tensor into structured components. Both are naturally expressed in index notation.

10.1 Tensor (Outer/Kronecker) Products

\text{Outer: } C_{ij} = u_i v_j \qquad \text{(vector ⊗ vector → matrix)}

\text{General: } T_{ijkl} = A_{ij} B_{kl} \qquad \text{(matrix ⊗ matrix → 4th-order)}

10.2 Kronecker Product $A \otimes B$

For $A \in \mathbb{R}^{m \times n}$ and $B \in \mathbb{R}^{p \times q}$ :

A \otimes B \in \mathbb{R}^{mp \times nq}

$(A \otimes B)_{(i-1)p+k, (j-1)q+l} = A_{ij} B_{kl}$

10.3 Connection to SVD and CP Decomposition

SVD: $A_{ij} = \sum_r \sigma_r u_{ir} v_{jr}$ (sum of rank-1 outer products)
CP: $T_{ijk} = \sum_r a_{ir} b_{jr} c_{kr}$ (sum of rank-1 3-way products)
Tucker: $T_{ijk} = G_{pqr} U^{(1)}_{ip} U^{(2)}_{jq} U^{(3)}_{kr}$

Code cell 36

# ══════════════════════════════════════════════════════════════════
# 10.1 TENSOR PRODUCTS IN INDEX NOTATION
# ══════════════════════════════════════════════════════════════════

print("=" * 70)
print("10.1 TENSOR PRODUCTS")
print("  No contraction — all indices are free → higher rank")
print("=" * 70)

# Vector ⊗ Vector → Matrix
u = np.array([1.0, 2.0, 3.0])
v = np.array([4.0, 5.0])

print("\n--- Vector ⊗ Vector → Matrix ---")
print(f"u ∈ R³, v ∈ R²")
C = np.einsum('i,j->ij', u, v)
print(f"C_ij = u_i v_j (rank 1+1 = 2):")
print(C)

# Matrix ⊗ Vector → 3rd-order tensor
A = np.array([[1, 2], [3, 4]])
w = np.array([10, 20, 30])

print(f"\n--- Matrix ⊗ Vector → 3rd-Order Tensor ---")
T = np.einsum('ij,k->ijk', A, w)
print(f"T_ijk = A_ij w_k (rank 2+1 = 3):")
print(f"Shape: {T.shape}")
for k in range(len(w)):
    print(f"  T[:,:,{k}] = A × w[{k}] =\n{T[:,:,k]}")

# Matrix ⊗ Matrix → 4th-order tensor
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

print(f"\n--- Matrix ⊗ Matrix → 4th-Order Tensor ---")
T4 = np.einsum('ij,kl->ijkl', A, B)
print(f"T_ijkl = A_ij B_kl (rank 2+2 = 4)")
print(f"Shape: {T4.shape}")

# Verify: T[i,j,k,l] = A[i,j] * B[k,l]
all_match = True
for i in range(2):
    for j in range(2):
        for k in range(2):
            for l in range(2):
                if T4[i,j,k,l] != A[i,j] * B[k,l]:
                    all_match = False
print(f"✓ T[i,j,k,l] = A[i,j]·B[k,l]: {all_match}")

# ══════════════════════════════════════════════════════════════════
# 10.2 KRONECKER PRODUCT
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("10.2 KRONECKER PRODUCT A ⊗ B")
print("  Block matrix where each a_ij is replaced by a_ij·B")
print("=" * 70)

A = np.array([[1, 2], [3, 4]])
B = np.array([[0, 5], [6, 7]])

# Build Kronecker product from raw loops
m, n = A.shape
p, q = B.shape
K = np.zeros((m*p, n*q))
for i in range(m):
    for j in range(n):
        for k in range(p):
            for l in range(q):
                K[i*p + k, j*q + l] = A[i, j] * B[k, l]

K_numpy = np.kron(A, B)

print(f"\nA ({m}×{n}):\n{A}")
print(f"\nB ({p}×{q}):\n{B}")
print(f"\nA ⊗ B ({m*p}×{n*q}):")
print(f"via loop:\n{K}")
print(f"\nvia np.kron:\n{K_numpy}")
print(f"\n✓ Match: {np.allclose(K, K_numpy)}")

# Key property: (A⊗B)(C⊗D) = (AC)⊗(BD) — the mixed-product property
print(f"\n--- Mixed Product Property ---")
print("(A⊗B)(C⊗D) = (AC)⊗(BD)")
C_mat = np.random.randn(2, 2)
D = np.random.randn(2, 2)

lhs = np.kron(A, B) @ np.kron(C_mat, D)
rhs = np.kron(A @ C_mat, B @ D)
print(f"✓ (A⊗B)(C⊗D) = (AC)⊗(BD): {np.allclose(lhs, rhs)}")
print("This property is key in quantum computing and tensor networks.")

# ══════════════════════════════════════════════════════════════════
# 10.3 SVD IN INDEX NOTATION: A_ij = Σ_r σ_r u_ir v_jr
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("10.3 SVD AS SUM OF RANK-1 OUTER PRODUCTS")
print("  A_ij = Σ_r σ_r U_ir V_jr")
print("=" * 70)

A = np.array([[1.0, 2.0, 3.0],
              [4.0, 5.0, 6.0],
              [7.0, 8.0, 9.0]])

U, S, Vt = np.linalg.svd(A)
V = Vt.T  # V_jr = Vt[r,j]

print(f"A:\n{A}")
print(f"\nSingular values: {S}")
print(f"\nSVD: A_ij = Σ_r σ_r U_ir V_jr")

# Reconstruct term by term
A_reconstructed = np.zeros_like(A)
for r in range(len(S)):
    # Rank-1 term: σ_r · (u_r ⊗ v_r)
    rank1 = S[r] * np.einsum('i,j->ij', U[:, r], V[:, r])
    A_reconstructed += rank1
    
    error = np.linalg.norm(A - A_reconstructed, 'fro')
    print(f"\n  r={r}: σ={S[r]:.4f}")
    print(f"  rank-1 term:\n{np.round(rank1, 4)}")
    print(f"  Cumulative error: {error:.6f}")

# Full reconstruction via einsum
A_einsum = np.einsum('ir,r,jr->ij', U, S, V)
print(f"\n✓ Full SVD reconstruction: {np.allclose(A, A_einsum)}")

# Low-rank approximation
print(f"\n--- Low-Rank Approximation ---")
for k in range(1, len(S) + 1):
    A_k = np.einsum('ir,r,jr->ij', U[:, :k], S[:k], V[:, :k])
    error = np.linalg.norm(A - A_k, 'fro')
    energy = np.sum(S[:k]**2) / np.sum(S**2) * 100
    print(f"  Rank-{k}: error={error:.4f}, energy={energy:.1f}%")

# ══════════════════════════════════════════════════════════════════
# 10.4 CP DECOMPOSITION: T_ijk = Σ_r a_ir b_jr c_kr
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("10.4 CP (CANONICAL POLYADIC) DECOMPOSITION")
print("  T_ijk = Σ_r a_ir b_jr c_kr  (sum of rank-1 tensors)")
print("=" * 70)

# Build a rank-R tensor from factors
R = 3  # number of components
dims = (4, 5, 6)  # tensor dimensions

# Random factor matrices
np.random.seed(42)
a = np.random.randn(dims[0], R)  # a_ir
b = np.random.randn(dims[1], R)  # b_jr
c = np.random.randn(dims[2], R)  # c_kr

# Build tensor: T_ijk = Σ_r a_ir b_jr c_kr
# Raw loop
T_loop = np.zeros(dims)
for i in range(dims[0]):
    for j in range(dims[1]):
        for k in range(dims[2]):
            for r in range(R):
                T_loop[i, j, k] += a[i, r] * b[j, r] * c[k, r]

# Einsum
T_einsum = np.einsum('ir,jr,kr->ijk', a, b, c)

print(f"Factor shapes: a={a.shape}, b={b.shape}, c={c.shape}")
print(f"Tensor shape: {T_einsum.shape}")
print(f"CP rank: {R}")
print(f"\n✓ Loop vs einsum match: {np.allclose(T_loop, T_einsum)}")

# Show individual rank-1 components
print(f"\nRank-1 components:")
for r in range(R):
    component = np.einsum('i,j,k->ijk', a[:, r], b[:, r], c[:, r])
    print(f"  Component {r}: shape {component.shape}, "
          f"norm = {np.linalg.norm(component):.4f}")

# Storage comparison
full_storage = np.prod(dims)
cp_storage = sum(d * R for d in dims)
print(f"\nStorage comparison:")
print(f"  Full tensor:   {full_storage} elements")
print(f"  CP factors:    {cp_storage} elements")
print(f"  Compression:   {full_storage / cp_storage:.1f}x")
print(f"\nCP decomposition is how tensor networks compress neural network weights.")

11. Einstein Notation in AI Architectures

Every major AI architecture can be expressed entirely in index notation. This section shows how — from MLPs through attention to convolutions and graph neural networks.

Architecture → Index Pattern Map

Architecture	Core Operation	Index Form
MLP	Linear layer	$y_{bi} = W_{ij} x_{bj} + b_i$
CNN	Convolution	$y_{boc} = \sum_{i,j,k} W_{oijk} x_{b,i+c_h,j+c_w,k}$
Transformer	Attention	$O_{bhid} = \alpha_{bhij} V_{bhjd}$
RNN/LSTM	Gated recurrence	$f_{bi} = \sigma(W^f_{ij} h_{b,t-1,j} + U^f_{ij} x_{btj})$
GNN	Message passing	$h'_{vi} = \sum_{u \in N(v)} W_{ij} h_{uj}$

Understanding these patterns lets you read research papers fluently and implement custom architectures.

Code cell 38

# ══════════════════════════════════════════════════════════════════
# 11.1 MULTI-LAYER PERCEPTRON — FULL INDEX FORM
# ══════════════════════════════════════════════════════════════════

print("=" * 70)
print("11.1 MLP IN FULL INDEX NOTATION")
print("=" * 70)

np.random.seed(42)
batch = 4
d_in, d_h, d_out = 5, 8, 3

# Weights and biases
W1 = np.random.randn(d_h, d_in) * np.sqrt(2/d_in)   # He initialization
b1 = np.zeros(d_h)
W2 = np.random.randn(d_out, d_h) * np.sqrt(2/d_h)
b2 = np.zeros(d_out)

X = np.random.randn(batch, d_in)

# Forward pass in index notation
# Layer 1: h_bi = ReLU(W1_ij x_bj + b1_i)
z1 = np.einsum('ij,bj->bi', W1, X) + b1  # z1_bi = W1_ij x_bj + b1_i
h1 = np.maximum(z1, 0)  # ReLU (element-wise, no index magic)

# Layer 2: y_bk = W2_ki h_bi + b2_k  (no activation for regression)
y = np.einsum('ki,bi->bk', W2, h1) + b2

print(f"Input X:  {X.shape}  (batch, features)")
print(f"W1: {W1.shape}, b1: {b1.shape}")
print(f"W2: {W2.shape}, b2: {b2.shape}")
print(f"\nLayer 1: z1_bi = W1_ij x_bj + b1_i")
print(f"         h1_bi = max(0, z1_bi)")
print(f"  z1: {z1.shape}, h1: {h1.shape}")
print(f"\nLayer 2: y_bk = W2_ki h1_bi + b2_k")
print(f"  y: {y.shape}")

# Verify against standard matmul
z1_check = X @ W1.T + b1
y_check = np.maximum(z1_check, 0) @ W2.T + b2
print(f"\n✓ Standard matmul match: {np.allclose(y, y_check)}")

# Index flow diagram
print(f"\n--- Index Flow ---")
print(f"x_bj  →  W1_ij contracts j  →  h_bi")
print(f"h_bi  →  W2_ki contracts i  →  y_bk")
print(f"b index: flows through unchanged (batch)")
print(f"j→i→k: each layer changes the feature dimension index")

# ══════════════════════════════════════════════════════════════════
# 11.2 FULL TRANSFORMER ATTENTION — EVERY STEP
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("11.2 TRANSFORMER MULTI-HEAD ATTENTION")
print("  Complete implementation in Einstein notation")
print("=" * 70)

B, S, D = 2, 6, 32    # batch, seq_len, model_dim
H, Dk = 4, 8          # heads, head_dim (D = H * Dk)

# Weight matrices
W_Q = np.random.randn(H, D, Dk) * 0.1  # W^Q_hdk
W_K = np.random.randn(H, D, Dk) * 0.1  # W^K_hdk
W_V = np.random.randn(H, D, Dk) * 0.1  # W^V_hdk
W_O = np.random.randn(H, Dk, D) * 0.1  # W^O_hkd

# Input
X = np.random.randn(B, S, D) * 0.1

print(f"Input X: {X.shape}  (batch, seq, model_dim)")
print(f"W_Q: {W_Q.shape}  (heads, model_dim, head_dim)")
print(f"Heads={H}, Head_dim={Dk}, Model_dim={D}={H}×{Dk}")

# Step 1: Project to Q, K, V
# Q_bhid = X_bid' W^Q_hd'k  → but we want Q_bhsd
# More precisely: Q_{b,h,s,k} = X_{b,s,d} W^Q_{h,d,k}
Q = np.einsum('bsd,hdk->bhsk', X, W_Q)  # contract d
K = np.einsum('bsd,hdk->bhsk', X, W_K)
V = np.einsum('bsd,hdk->bhsk', X, W_V)

print(f"\nStep 1: Q/K/V projection")
print(f"  Q_{'{bhsk}'} = X_{'{bsd}'} W^Q_{'{hdk}'}")
print(f"  Q: {Q.shape}, K: {K.shape}, V: {V.shape}")

# Step 2: Attention scores
# S_bhij = Q_bhik K_bhjk / √dk  (contract k = head_dim)
scores = np.einsum('bhik,bhjk->bhij', Q, K) / np.sqrt(Dk)

print(f"\nStep 2: Attention scores")
print(f"  S_{'{bhij}'} = Q_{'{bhik}'} K_{'{bhjk}'} / √d_k")
print(f"  Scores: {scores.shape}")

# Step 3: Softmax over j (key positions)
exp_s = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
attn = exp_s / np.sum(exp_s, axis=-1, keepdims=True)

print(f"\nStep 3: Softmax")
print(f"  α_{'{bhij}'} = softmax_j(S_{'{bhij}'})")
print(f"  Attention weights: {attn.shape}")
print(f"  Row sums ≈ 1: {np.allclose(np.sum(attn, axis=-1), 1.0)}")

# Step 4: Weighted sum of values
# O_bhik = α_bhij V_bhjk  (contract j = key positions)
O_heads = np.einsum('bhij,bhjk->bhik', attn, V)

print(f"\nStep 4: Attention output")
print(f"  O_{'{bhik}'} = α_{'{bhij}'} V_{'{bhjk}'}  (contract j)")
print(f"  Per-head output: {O_heads.shape}")

# Step 5: Project back to model dimension
# Y_bsd = O_bhsk W^O_hkd  (contract h and k)
Y = np.einsum('bhsk,hkd->bsd', O_heads, W_O)

print(f"\nStep 5: Output projection")
print(f"  Y_{'{bsd}'} = O_{'{bhsk}'} W^O_{'{hkd}'}  (contract h,k)")
print(f"  Output: {Y.shape}")
print(f"  ✓ Output shape matches input: {Y.shape == X.shape}")

# Summary
print(f"\n--- Complete Attention in 4 Einsum Calls ---")
print(f"1. Q = einsum('bsd,hdk->bhsk', X, W_Q)")
print(f"2. S = einsum('bhik,bhjk->bhij', Q, K) / √d")
print(f"3. O = einsum('bhij,bhjk->bhik', softmax(S), V)")
print(f"4. Y = einsum('bhsk,hkd->bsd', O, W_O)")

# ══════════════════════════════════════════════════════════════════
# 11.3 CONVOLUTION AS EINSUM
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("11.3 1D CONVOLUTION IN INDEX NOTATION")
print("=" * 70)

# 1D convolution: y_bo[t] = Σ_i Σ_k W_oik x_b,i[t+k]
# Simplified: no padding, stride=1
B, C_in, L = 2, 3, 10   # batch, channels_in, length
C_out, K = 4, 3          # channels_out, kernel_size

X = np.random.randn(B, C_in, L)
W = np.random.randn(C_out, C_in, K)

L_out = L - K + 1

# Raw loop implementation
Y_loop = np.zeros((B, C_out, L_out))
for b in range(B):
    for o in range(C_out):
        for t in range(L_out):
            for i in range(C_in):
                for k in range(K):
                    Y_loop[b, o, t] += W[o, i, k] * X[b, i, t + k]

# Using einsum: need to build patch matrix first (im2col approach)
# Extract patches: P_btik = X_b,i,t+k
patches = np.zeros((B, L_out, C_in, K))
for t in range(L_out):
    patches[:, t, :, :] = X[:, :, t:t+K]

# Now convolution = einsum over patches
# Y_bot = W_oik P_btik  (contract i and k)
Y_einsum = np.einsum('oik,btik->bot', W, patches)

print(f"Input: {X.shape}  (batch, channels, length)")
print(f"Kernel: {W.shape}  (out_ch, in_ch, kernel_size)")
print(f"Output: {Y_loop.shape}  (batch, out_ch, length_out)")

print(f"\nConvolution: Y_bot = W_oik X_b,i,t+k")
print(f"  o: output channel (FREE)")
print(f"  b: batch (FREE)")
print(f"  t: output position (FREE)")
print(f"  i: input channel (DUMMY — summed)")
print(f"  k: kernel position (DUMMY — summed)")

print(f"\n✓ Loop vs einsum match: {np.allclose(Y_loop, Y_einsum)}")

print(f"\n--- Key Insight ---")
print("Convolution = matrix multiply after unfolding (im2col).")
print("In index notation: it's just a double contraction (sum over i,k).")
print("The 'sliding window' is encoded in the index shift t+k.")

12. Symmetries and Equivariances in Index Notation

Symmetry means a tensor is unchanged under index permutation. Equivariance means a function's output transforms predictably when its input transforms. Both are cleanly expressed with indices.

Symmetry Types

Type	Condition	Example
Symmetric	$T_{ij} = T_{ji}$	Covariance matrix, Hessian
Antisymmetric	$T_{ij} = -T_{ji}$	Cross product, curl
Totally symmetric	$T_{ijk} = T_{\sigma(ijk)}$ for all permutations	Elasticity (partially)
Cyclic	$T_{ijk} = T_{jki} = T_{kij}$	Structure constants

Equivariance in Index Notation

A function $f$ is equivariant to group $G$ if:

f(R_{ij} x_j)_i = R_{ij} f(x)_j

"Transform the input, apply $f$ " = "Apply $f$ , transform the output"

Code cell 40

# ══════════════════════════════════════════════════════════════════
# 12.1 SYMMETRIC TENSORS AND THEIR PROPERTIES
# ══════════════════════════════════════════════════════════════════

print("=" * 70)
print("12.1 SYMMETRIC TENSORS IN AI")
print("=" * 70)

np.random.seed(42)
n = 4

# Build symmetric matrix (covariance matrix)
X = np.random.randn(100, n)
Sigma = np.einsum('bi,bj->ij', X, X) / len(X)  # Σ_ij = (1/N) Σ_b x_bi x_bj

print("Covariance matrix from data:")
print(f"  Σ_ij = (1/N) Σ_b x_bi x_bj")
print(f"  Shape: {Sigma.shape}")
print(f"  Symmetric: Σ_ij = Σ_ji? {np.allclose(Sigma, Sigma.T)}")

# Symmetry proof in index notation:
# Σ_ij = (1/N) Σ_b x_bi x_bj
# Σ_ji = (1/N) Σ_b x_bj x_bi  (swap i↔j)
#       = (1/N) Σ_b x_bi x_bj  (multiplication is commutative)
#       = Σ_ij  ✓

# Consequence: only need n(n+1)/2 parameters
total = n * n
unique = n * (n + 1) // 2
print(f"\n  Total entries: {total}")
print(f"  Unique (upper triangle): {unique}")
print(f"  Redundancy ratio: {total/unique:.2f}x")

# Hessian is also symmetric
print(f"\n--- Hessian Matrix (2nd derivatives) ---")
print(f"H_ij = ∂²L/∂θ_i∂θ_j = ∂²L/∂θ_j∂θ_i = H_ji")
print(f"(By Schwarz's theorem, if L is C²)")
print(f"This symmetry halves the computation for 2nd-order methods.")

# ══════════════════════════════════════════════════════════════════
# 12.2 ROTATION EQUIVARIANCE
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("12.2 ROTATION EQUIVARIANCE")
print("  f(Rx) = R f(x)  — the function commutes with rotation")
print("=" * 70)

# 2D rotation matrix
theta = np.pi / 4  # 45 degrees
R = np.array([[np.cos(theta), -np.sin(theta)],
              [np.sin(theta),  np.cos(theta)]])

print(f"Rotation by {np.degrees(theta):.0f}°:")
print(f"R_ij:\n{np.round(R, 4)}")
print(f"R is orthogonal: R^T R = I? {np.allclose(R.T @ R, np.eye(2))}")

# Test: is f(x) = Ax equivariant to rotation?
# f(Rx) = A(Rx) = A_ij R_jk x_k
# R f(x) = R(Ax) = R_ij A_jk x_k
# Equal if A_ij R_jk = R_ij A_jk → A commutes with R

# Equivariant example: A = scalar * I (isotropic)
s = 2.5
A_equivariant = s * np.eye(2)  # A_ij = s δ_ij
x = np.array([1.0, 0.0])

# f(Rx)
Rx = np.einsum('ij,j->i', R, x)
f_Rx = np.einsum('ij,j->i', A_equivariant, Rx)

# R f(x)
fx = np.einsum('ij,j->i', A_equivariant, x)
R_fx = np.einsum('ij,j->i', R, fx)

print(f"\n--- Equivariant: A = {s}I ---")
print(f"f(Rx) = {np.round(f_Rx, 4)}")
print(f"R·f(x) = {np.round(R_fx, 4)}")
print(f"✓ Equivariant: {np.allclose(f_Rx, R_fx)}")

# Non-equivariant example: general A
A_general = np.array([[1.0, 2.0], [0.0, 1.0]])
f_Rx2 = np.einsum('ij,j->i', A_general, Rx)
fx2 = np.einsum('ij,j->i', A_general, x)
R_fx2 = np.einsum('ij,j->i', R, fx2)

print(f"\n--- Non-equivariant: A = [[1,2],[0,1]] ---")
print(f"f(Rx) = {np.round(f_Rx2, 4)}")
print(f"R·f(x) = {np.round(R_fx2, 4)}")
print(f"✗ Not equivariant: {not np.allclose(f_Rx2, R_fx2)}")

# Invariance: scalar output unchanged by rotation
print(f"\n--- Invariance: f(Rx) = f(x) ---")
print(f"Dot product is rotation-invariant:")
y = np.array([0.0, 1.0])
Ry = np.einsum('ij,j->i', R, y)
dot_original = np.einsum('i,i->', x, y)
dot_rotated = np.einsum('i,i->', Rx, Ry)
print(f"  x·y = {dot_original:.4f}")
print(f"  (Rx)·(Ry) = {dot_rotated:.4f}")
print(f"  ✓ Invariant: {np.allclose(dot_original, dot_rotated)}")
print(f"  Proof: (Rx)_i(Ry)_i = R_ij x_j R_ik y_k = δ_jk x_j y_k = x_j y_j ■")

13. Advanced Index Manipulations

These techniques appear in research papers and advanced implementations. Mastering them lets you work with cutting-edge architectures and optimization methods.

Topics

Tensor reshaping as index regrouping
Mode-n products for Tucker decomposition
Index stacking for multi-modal data
Einstein summation with broadcasting

Code cell 42

# ══════════════════════════════════════════════════════════════════
# 13.1 TENSOR RESHAPING AS INDEX REGROUPING
# ══════════════════════════════════════════════════════════════════

print("=" * 70)
print("13.1 RESHAPING = INDEX REGROUPING")
print("=" * 70)

# In index notation, reshape is regrouping indices into compound indices
# T_ijk → M_(ij)(k) means grouping i,j into a single "row" index
T = np.random.randn(3, 4, 5)

# Mode-0 unfolding: T_ijk → M_i,(jk) where row=i, col=(j,k)
M0 = T.reshape(3, 4*5)
# Mode-1 unfolding: T_ijk → M_j,(ik)
M1 = T.transpose(1, 0, 2).reshape(4, 3*5)
# Mode-2 unfolding: T_ijk → M_k,(ij)
M2 = T.transpose(2, 0, 1).reshape(5, 3*4)

print(f"Tensor T: {T.shape}  (i=3, j=4, k=5)")
print(f"\nMode-0 unfolding: T_ijk → M_i,(jk): {M0.shape}")
print(f"Mode-1 unfolding: T_ijk → M_j,(ik): {M1.shape}")
print(f"Mode-2 unfolding: T_ijk → M_k,(ij): {M2.shape}")

# Verify: element access
i, j, k = 1, 2, 3
print(f"\nT[{i},{j},{k}] = {T[i,j,k]:.4f}")
print(f"M0[{i}, {j*5+k}] = {M0[i, j*5+k]:.4f}")
print(f"M1[{j}, {i*5+k}] = {M1[j, i*5+k]:.4f}")
print(f"M2[{k}, {i*4+j}] = {M2[k, i*4+j]:.4f}")
print(f"✓ All access same element: "
      f"{T[i,j,k] == M0[i, j*5+k] == M1[j, i*5+k] == M2[k, i*4+j]}")

# AI connection: reshaping in multi-head attention
print(f"\n--- Reshape in Multi-Head Attention ---")
B, S, D, H = 2, 8, 32, 4
Dk_head = D // H  # 8

X = np.random.randn(B, S, D)
# Reshape [B, S, D] → [B, S, H, Dk] → [B, H, S, Dk]
# This is index regrouping: X_bs(hk) → X_bshk → X_bhsk
X_heads = X.reshape(B, S, H, Dk_head).transpose(0, 2, 1, 3)
print(f"Input: {X.shape} → reshape/transpose → {X_heads.shape}")
print(f"Index view: X_bs(hk) → X_bhsk")
print(f"The compound index d=(h,k) is split into separate indices h and k.")

# ══════════════════════════════════════════════════════════════════
# 13.2 MODE-N PRODUCT (TUCKER DECOMPOSITION)
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("13.2 MODE-N PRODUCT")
print("  Multiply a tensor by a matrix along one mode")
print("=" * 70)

# Mode-n product: T ×_n U
# (T ×_1 U)_ijk = U_ip T_pjk  (contract mode-0 with matrix)
T = np.random.randn(3, 4, 5)
U = np.random.randn(6, 3)  # maps mode-0: 3→6

# Mode-0 product: contract first index
result_mode0 = np.einsum('ip,pjk->ijk', U, T)

# Raw loop verification
result_loop = np.zeros((6, 4, 5))
for i in range(6):
    for j in range(4):
        for k in range(5):
            for p in range(3):
                result_loop[i, j, k] += U[i, p] * T[p, j, k]

print(f"T: {T.shape}")
print(f"U: {U.shape}")
print(f"T ×₀ U: {result_mode0.shape}")
print(f"✓ Match: {np.allclose(result_mode0, result_loop)}")

# Mode-1 product
V = np.random.randn(7, 4)  # maps mode-1: 4→7
result_mode1 = np.einsum('jq,iqk->ijk', V, T)
print(f"\nV: {V.shape}")
print(f"T ×₁ V: {result_mode1.shape}  (mode-1: 4→7)")

# Tucker decomposition: T_ijk = G_pqr U^(0)_ip U^(1)_jq U^(2)_kr
print(f"\n--- Tucker Decomposition ---")
print(f"T_ijk = G_pqr U⁰_ip U¹_jq U²_kr")
print(f"= core tensor G contracted with factor matrices along each mode")

r1, r2, r3 = 2, 3, 2  # core dimensions
G = np.random.randn(r1, r2, r3)
U0 = np.random.randn(3, r1)
U1 = np.random.randn(4, r2)
U2 = np.random.randn(5, r3)

T_tucker = np.einsum('pqr,ip,jq,kr->ijk', G, U0, U1, U2)
print(f"\nCore G: {G.shape}")
print(f"U⁰: {U0.shape}, U¹: {U1.shape}, U²: {U2.shape}")
print(f"Reconstructed T: {T_tucker.shape}")

storage_full = 3 * 4 * 5
storage_tucker = G.size + U0.size + U1.size + U2.size
print(f"\nFull storage: {storage_full}")
print(f"Tucker storage: {storage_tucker} (G:{G.size} + factors:{U0.size+U1.size+U2.size})")
print(f"Compression: {storage_full / storage_tucker:.1f}x")

# ══════════════════════════════════════════════════════════════════
# 13.3 EINSUM WITH COMPLEX EXPRESSIONS
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("13.3 COMPLEX EINSUM PATTERNS")
print("=" * 70)

np.random.seed(42)

# Pattern 1: Bilinear form — x^T A y
x = np.random.randn(4)
A = np.random.randn(4, 4)
y = np.random.randn(4)

bil_einsum = np.einsum('i,ij,j->', x, A, y)
bil_matmul = x @ A @ y
print(f"Bilinear form: x_i A_ij y_j = {bil_einsum:.4f}")
print(f"x @ A @ y = {bil_matmul:.4f}")
print(f"✓ Match: {np.allclose(bil_einsum, bil_matmul)}")

# Pattern 2: Batch outer product
B = 3
u_batch = np.random.randn(B, 4)
v_batch = np.random.randn(B, 5)
outer_batch = np.einsum('bi,bj->bij', u_batch, v_batch)
print(f"\nBatch outer: einsum('bi,bj->bij') → {outer_batch.shape}")

# Pattern 3: Multi-head dot product (no batch)
H, S, D = 4, 6, 8
Q = np.random.randn(H, S, D)
K = np.random.randn(H, S, D)
scores = np.einsum('hid,hjd->hij', Q, K)
print(f"Multi-head scores: einsum('hid,hjd->hij') → {scores.shape}")

# Pattern 4: Tensor network contraction
A = np.random.randn(3, 4, 5)
B = np.random.randn(5, 6, 7)
C = np.random.randn(7, 3, 8)

# Contract: A_ijk B_klm C_min → result_jl (sum over i,k,m)
result = np.einsum('ijk,klm,min->jl', A, B, C)
print(f"\nTensor network: A_ijk B_klm C_min → D_jl")
print(f"  {A.shape} × {B.shape} × {C.shape} → {result.shape}")
print(f"  Contracted: i,k,m  |  Free: j,l")

# Show optimal contraction path
path, info = np.einsum_path('ijk,klm,min->jl', A, B, C, optimize='optimal')
print(f"\nOptimal path: {path}")
print(f"  This determines the order of contractions for efficiency.")

14. Common Mistakes and Debugging

Index notation has sharp edges. These are the mistakes that trip up beginners and experts alike, with explicit demonstrations of each error and its fix.

Error Taxonomy

Error	Example	Problem
Triple index	$A_{ij} B_{jk} C_{jl}$	Index $j$ in 3 tensors — undefined
Shape mismatch	$A_{ij} B_{jk}$ with incompatible $j$	Dimension mismatch
Missing output	$A_{ij} B_{jk}$ without `->`	Ambiguous output
Wrong contraction	`'ij,jk->jk'`	$j$ both contracted AND free
Forgotten batch	`'ij,j->i'` on batched data	Missing batch index

Code cell 44

# ══════════════════════════════════════════════════════════════════
# 14.1 MISTAKE: CONFUSING FREE AND DUMMY INDICES
# ══════════════════════════════════════════════════════════════════

print("=" * 70)
print("14.1 MISTAKE: FREE vs DUMMY CONFUSION")
print("=" * 70)

A = np.random.randn(3, 4)
B = np.random.randn(3, 4)

# ✗ WRONG: Trying to do element-wise multiply but forgetting output indices
wrong = np.einsum('ij,ij->', A, B)   # This gives a SCALAR (Frobenius)
right = np.einsum('ij,ij->ij', A, B)  # This gives a MATRIX (Hadamard)

print(f"\nA: {A.shape},  B: {B.shape}")
print(f"\n✗ einsum('ij,ij->'):  {wrong:.4f}  ← SCALAR (Frobenius inner product)")
print(f"  i,j not in output → they're DUMMY → summed away!")
print(f"\n✓ einsum('ij,ij->ij'): shape {right.shape}  ← MATRIX (element-wise)")
print(f"  i,j IN output → they're FREE → preserved!")
print(f"\n--- Rule: Check what's after '->' ---")
print(f"  Indices AFTER '->' = free (kept)")
print(f"  Indices NOT after '->' = dummy (summed)")

# ══════════════════════════════════════════════════════════════════
# 14.2 MISTAKE: DIMENSION MISMATCH ON CONTRACTED INDEX
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("14.2 MISTAKE: DIMENSION MISMATCH")
print("=" * 70)

A = np.random.randn(3, 4)  # A_ij: j ranges over 4
B = np.random.randn(5, 6)  # B_jk: j ranges over 5 ← MISMATCH!

print(f"A: {A.shape}  (i=3, j=4)")
print(f"B: {B.shape}  (j=5, k=6)")
print(f"\nA_ij B_jk requires j to have SAME range in both tensors!")
print(f"Here j=4 in A but j=5 in B → ERROR")

try:
    result = np.einsum('ij,jk->ik', A, B)
    print(f"Result: {result.shape}")  # Won't reach here
except ValueError as e:
    print(f"\n✗ ValueError: {e}")
    print(f"\n✓ Fix: Ensure contracted index dimensions match.")
    B_fixed = np.random.randn(4, 6)
    result = np.einsum('ij,jk->ik', A, B_fixed)
    print(f"   A(3×4) @ B(4×6) → C(3×6): {result.shape} ✓")

# ══════════════════════════════════════════════════════════════════
# 14.3 MISTAKE: FORGETTING BATCH DIMENSION
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("14.3 MISTAKE: FORGETTING BATCH DIMENSION")
print("=" * 70)

W = np.random.randn(5, 3)    # Weight matrix (no batch)
X = np.random.randn(8, 3)    # Batched input

print(f"W: {W.shape}  (d_out=5, d_in=3)")
print(f"X: {X.shape}  (batch=8, d_in=3)")

# ✗ WRONG: Using unbatched expression on batched data
try:
    # This works but gives wrong interpretation!
    wrong = np.einsum('ij,ij->', W, X)
    print(f"\n✗ einsum('ij,ij->'):  Would contract batch as if it's a feature!")
    print(f"  This treats batch dim as something to sum over — nonsense!")
except Exception as e:
    print(f"Error: {e}")

# ✓ RIGHT: Include batch index
right = np.einsum('ij,bj->bi', W, X)
print(f"\n✓ einsum('ij,bj->bi'): {right.shape}")
print(f"  b is FREE (preserved), j is DUMMY (contracted)")
print(f"  Equivalent to X @ W.T: {np.allclose(right, X @ W.T)}")

# ══════════════════════════════════════════════════════════════════
# 14.4 MISTAKE: INDEX APPEARING THREE TIMES
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("14.4 MISTAKE: INDEX IN THREE OR MORE TENSORS")
print("=" * 70)

print(f"\nIn standard Einstein notation, a dummy index appears exactly TWICE.")
print(f"An index in 3+ positions is undefined!")
print(f"\n✗ A_ij B_jk C_jl  ← j appears 3 times!")
print(f"  Is j summed? Over which pair? Ambiguous!")

A = np.random.randn(3, 4)
B = np.random.randn(4, 5)
C = np.random.randn(4, 6)

# NumPy einsum DOES handle this (sums over all occurrences)
# But the mathematical meaning is not standard Einstein convention
try:
    result = np.einsum('ij,jk,jl->ikl', A, B, C)
    print(f"\nnp.einsum('ij,jk,jl->ikl'): {result.shape}")
    print(f"NumPy handles this by summing j={A.shape[1]} over all three tensors.")
    print(f"But this is NOT standard Einstein summation!")
except Exception as e:
    print(f"Error: {e}")

# ✓ The proper way: break into two contractions
D = np.einsum('ij,jk->ik', A, B)          # First: contract j between A,B
result_proper = np.einsum('ik,jl->ikl', D, C)  # Then: no contraction with C

# Or if you want j summed across all three:
result_explicit = np.zeros((3, 5, 6))
for i in range(3):
    for k in range(5):
        for l in range(6):
            for j in range(4):
                result_explicit[i, k, l] += A[i, j] * B[j, k] * C[j, l]

print(f"\n✓ Explicit 3-way contraction matches einsum: "
      f"{np.allclose(result, result_explicit)}")
print(f"\nNote: einsum extends beyond Einstein's original convention.")
print(f"It sums any repeated index, even across 3+ tensors.")

# ══════════════════════════════════════════════════════════════════
# 14.5 DEBUGGING TOOLKIT
# ══════════════════════════════════════════════════════════════════

print("\n" + "=" * 70)
print("14.5 DEBUGGING EINSUM EXPRESSIONS")
print("=" * 70)

def debug_einsum(subscripts, *operands):
    """Debug helper: analyze an einsum expression before running it."""
    parts = subscripts.split('->')
    inputs = parts[0].split(',')
    output = parts[1] if len(parts) > 1 else None
    
    # Collect all indices
    all_input_indices = set(''.join(inputs))
    
    # Count occurrences
    index_count = {}
    for idx_str in inputs:
        for c in idx_str:
            index_count[c] = index_count.get(c, 0) + 1
    
    # Classify
    if output is not None:
        free = set(output)
        dummy = all_input_indices - free
    else:
        # Implicit: indices appearing once are free
        free = {c for c, cnt in index_count.items() if cnt == 1}
        dummy = {c for c, cnt in index_count.items() if cnt > 1}
    
    print(f"\n  Expression: '{subscripts}'")
    print(f"  Input subscripts: {inputs}")
    print(f"  Output subscripts: {output}")
    print(f"  Index counts: {index_count}")
    print(f"  FREE indices (kept): {sorted(free)}")
    print(f"  DUMMY indices (summed): {sorted(dummy)}")
    
    for i, (idx_str, op) in enumerate(zip(inputs, operands)):
        print(f"  Operand {i}: '{idx_str}' → shape {op.shape}")
    
    # Check dimension consistency
    index_dims = {}
    for idx_str, op in zip(inputs, operands):
        for j_idx, c in enumerate(idx_str):
            dim = op.shape[j_idx]
            if c in index_dims and index_dims[c] != dim:
                print(f"  ⚠ MISMATCH: index '{c}' has dim {index_dims[c]} "
                      f"and {dim}!")
            index_dims[c] = dim
    
    # Compute result
    result = np.einsum(subscripts, *operands)
    print(f"  Result shape: {result.shape}")
    return result

# Demo the debug tool
A = np.random.randn(3, 4)
B = np.random.randn(4, 5)
x = np.random.randn(5)

print("--- Debug Example 1: Matrix chain ---")
debug_einsum('ij,jk,k->i', A, B, x)

print("\n--- Debug Example 2: Batched attention ---")
Q = np.random.randn(2, 4, 6, 8)  # batch, heads, seq, dim
K = np.random.randn(2, 4, 6, 8)
debug_einsum('bhid,bhjd->bhij', Q, K)

print("\n--- Debug Example 3: Intentional error ---")
A_bad = np.random.randn(3, 4)
B_bad = np.random.randn(5, 6)  # j=4 vs j=5 mismatch
try:
    debug_einsum('ij,jk->ik', A_bad, B_bad)
except ValueError as e:
    print(f"  ✗ Caught: {e}")

15. Practice Exercises

Test your understanding by classifying indices, writing loop implementations, and verifying with einsum.

Exercise Guide

Classify: Identify free and dummy indices
Predict: What is the output shape?
Implement: Write raw loop version
Verify: Check against np.einsum

Code cell 46

# ══════════════════════════════════════════════════════════════════
# 15. PRACTICE EXERCISES WITH SOLUTIONS
# ══════════════════════════════════════════════════════════════════

print("=" * 70)
print("15. PRACTICE EXERCISES")
print("=" * 70)

np.random.seed(42)

# ─── Exercise 1: Classify and compute ───
print("\n" + "-" * 50)
print("Exercise 1: What does 'ij,jk,kl->il' compute?")
print("-" * 50)

A = np.random.randn(3, 4)
B = np.random.randn(4, 5)
C = np.random.randn(5, 6)

print("A(3×4), B(4×5), C(5×6)")
print("\nSolution:")
print("  Free: i, l  |  Dummy: j, k")
print("  Output shape: (3, 6)")
print("  Operation: D_il = A_ij B_jk C_kl = (ABC)_il (matrix chain multiply)")

D = np.einsum('ij,jk,kl->il', A, B, C)
D_matmul = A @ B @ C
print(f"  einsum result shape: {D.shape}")
print(f"  ✓ = A @ B @ C: {np.allclose(D, D_matmul)}")

# ─── Exercise 2: Batch trace ───
print("\n" + "-" * 50)
print("Exercise 2: Compute trace of each matrix in a batch")
print("-" * 50)

M = np.random.randn(5, 4, 4)  # 5 matrices, each 4×4
print(f"M: {M.shape}  (batch of 5 matrices, each 4×4)")
print("\nSolution:")
print("  einsum('bii->b', M) — i is dummy (trace), b is free (batch)")
traces = np.einsum('bii->b', M)
traces_loop = np.array([np.trace(M[b]) for b in range(5)])
print(f"  Traces: {np.round(traces, 3)}")
print(f"  ✓ Match loop: {np.allclose(traces, traces_loop)}")

# ─── Exercise 3: Weighted sum ───
print("\n" + "-" * 50)
print("Exercise 3: Weighted combination w_j X_bj → scalar per batch")
print("-" * 50)

X = np.random.randn(8, 5)  # 8 samples, 5 features
w = np.random.randn(5)      # weights

print(f"X: {X.shape}, w: {w.shape}")
print("\nSolution:")
print("  einsum('j,bj->b', w, X) — j is dummy, b is free")
scores = np.einsum('j,bj->b', w, X)
scores_check = X @ w
print(f"  Scores: {np.round(scores, 3)}")
print(f"  ✓ = X @ w: {np.allclose(scores, scores_check)}")

# ─── Exercise 4: Gram matrix ───
print("\n" + "-" * 50)
print("Exercise 4: Compute Gram matrix G_ij = <x_i, x_j>")
print("-" * 50)

X = np.random.randn(6, 10)  # 6 samples, 10 features
print(f"X: {X.shape}")
print("\nSolution:")
print("  G_ij = X_ik X_jk  → einsum('ik,jk->ij', X, X)")
print("  Free: i,j  |  Dummy: k")
G = np.einsum('ik,jk->ij', X, X)
G_check = X @ X.T
print(f"  G shape: {G.shape}")
print(f"  ✓ = X X^T: {np.allclose(G, G_check)}")
print(f"  Symmetric: {np.allclose(G, G.T)}")

# ─── Exercise 5: Attention score gradient ───
print("\n" + "-" * 50)
print("Exercise 5: Given S_ij = Q_ik K_jk, find ∂S_ij/∂Q_mn")
print("-" * 50)

print("\nSolution:")
print("  ∂S_ij/∂Q_mn = ∂(Q_ik K_jk)/∂Q_mn")
print("              = (∂Q_ik/∂Q_mn) K_jk")
print("              = δ_im δ_kn K_jk")
print("              = δ_im K_jn")
print("  → Non-zero only when i=m, and the value is K_jn")

Q = np.random.randn(4, 3)
K = np.random.randn(4, 3)
S = np.einsum('ik,jk->ij', Q, K)

# Verify: ∂S_ij/∂Q_mn at m=1, n=2
m, n_idx = 1, 2
eps = 1e-7
Q_plus = Q.copy(); Q_plus[m, n_idx] += eps
Q_minus = Q.copy(); Q_minus[m, n_idx] -= eps
dS_numerical = (np.einsum('ik,jk->ij', Q_plus, K) - 
                np.einsum('ik,jk->ij', Q_minus, K)) / (2 * eps)

print(f"\n  Numerical ∂S/∂Q[{m},{n_idx}]:")
print(f"  {np.round(dS_numerical, 4)}")
print(f"\n  Analytical: row {m} should be K[:, {n_idx}] = {K[:, n_idx]}")
print(f"  Other rows should be 0")
print(f"  ✓ Row {m} matches K[:, {n_idx}]: {np.allclose(dS_numerical[m], K[:, n_idx])}")
print(f"  ✓ Other rows ≈ 0: {np.allclose(dS_numerical[0], 0) and np.allclose(dS_numerical[2:], 0)}")

16. Why This Matters for AI/ML

Einstein summation notation is not just compact writing — it is the universal language for describing tensor operations in modern AI. Here is precisely why mastering it is essential.

Impact on Your AI Career

Skill	Without Index Notation	With Index Notation
Reading papers	Lost at equations	Fluent comprehension
Implementing models	Copy-paste from GitHub	Build from scratch
Debugging gradients	Print and pray	Derive and verify
Custom architectures	Limited to existing layers	Design new operations
Optimization	Black-box tuning	Understanding the math

Where You'll Use This

Transformer attention: Every attention paper uses index notation
Tensor decompositions: Model compression, efficient inference
Physics-informed ML: Equivariant networks, field theory
Geometric deep learning: Gauge equivariance, manifold operations
Diffusion models: Score matching, Fokker-Planck equations

Code cell 48

# ══════════════════════════════════════════════════════════════════
# 16. COMPREHENSIVE SUMMARY — THE COMPLETE PICTURE
# ══════════════════════════════════════════════════════════════════

print("=" * 70)
print("16. EINSTEIN SUMMATION — THE COMPLETE PICTURE")
print("=" * 70)

print("""
╔══════════════════════════════════════════════════════════════════╗
║  EINSTEIN SUMMATION CONVENTION — MASTER REFERENCE               ║
╠══════════════════════════════════════════════════════════════════╣
║                                                                  ║
║  CORE RULE:                                                      ║
║    Repeated index → automatic summation                          ║
║    C_ik = A_ij B_jk  means  C_ik = Σ_j A_ij B_jk               ║
║                                                                  ║
║  INDEX TYPES:                                                    ║
║    FREE:  appears once per term → survives in output             ║
║    DUMMY: appears twice → summed away (contracted)               ║
║                                                                  ║
║  SHAPE RULE:                                                     ║
║    Output shape = dimensions of free indices                     ║
║    If all indices are dummy → scalar output                      ║
║                                                                  ║
║  KEY OPERATIONS (einsum strings):                                ║
║    Dot product:     'i,i->'        (contract all)                ║
║    Outer product:   'i,j->ij'      (no contraction)             ║
║    Matrix multiply: 'ij,jk->ik'    (contract j)                 ║
║    Transpose:       'ij->ji'       (relabel)                    ║
║    Trace:           'ii->'         (self-contract)               ║
║    Batch matmul:    'bij,bjk->bik' (b=batch, j=contract)        ║
║    Attention:       'bhid,bhjd->bhij' (d=contract)              ║
║                                                                  ║
║  SPECIAL TENSORS:                                                ║
║    δ_ij: Identity/substitution (Kronecker delta)                 ║
║    ε_ijk: Orientation/cross products (Levi-Civita)               ║
║                                                                  ║
║  GRADIENT RULE:                                                  ║
║    ∂(A_ij x_j)/∂x_k = A_ij δ_jk = A_ik                        ║
║    Backprop = chain of index contractions                        ║
║                                                                  ║
║  CONTRACTION ORDER MATTERS FOR EFFICIENCY:                       ║
║    Use np.einsum_path() or optimize=True                         ║
║                                                                  ║
╚══════════════════════════════════════════════════════════════════╝
""")

# Final comprehensive verification
print("=" * 70)
print("VERIFICATION: All major operations in one cell")
print("=" * 70)

np.random.seed(42)

# Setup
A = np.random.randn(3, 4)
B = np.random.randn(4, 5)
v = np.random.randn(4)
w = np.random.randn(4)
M = np.random.randn(4, 4)

results = {
    'Dot product (i,i->)':        (np.einsum('i,i->', v, w), np.dot(v, w)),
    'Outer product (i,j->ij)':    (np.einsum('i,j->ij', v, w), np.outer(v, w)),
    'Matrix-vec (ij,j->i)':       (np.einsum('ij,j->i', A, v), A @ v),
    'Matmul (ij,jk->ik)':         (np.einsum('ij,jk->ik', A, B), A @ B),
    'Transpose (ij->ji)':         (np.einsum('ij->ji', A), A.T),
    'Trace (ii->)':               (np.einsum('ii->', M), np.trace(M)),
    'Diagonal (ii->i)':           (np.einsum('ii->i', M), np.diag(M)),
    'Frobenius (ij,ij->)':        (np.einsum('ij,ij->', A, M[:3, :]), np.sum(A * M[:3, :])),
    'Row sums (ij->i)':           (np.einsum('ij->i', A), np.sum(A, axis=1)),
}

all_pass = True
for name, (ein, ref) in results.items():
    match = np.allclose(ein, ref)
    all_pass &= match
    print(f"  {'✓' if match else '✗'} {name}")

print(f"\n{'='*70}")
print(f"ALL OPERATIONS VERIFIED: {'✓ PASS' if all_pass else '✗ FAIL'}")
print(f"{'='*70}")

print(f"""
╔══════════════════════════════════════════════════════════════════╗
║  "I have made a great discovery in mathematics; I have           ║
║   suppressed the summation sign every time that the              ║
║   summation must be made over an index which occurs twice."      ║
║                                                                  ║
║                — Albert Einstein, letter to Marcel Grossmann,    ║
║                  October 1916                                    ║
║                                                                  ║
║  What started as a notational shortcut became the universal      ║
║  language of tensor computation — from general relativity to     ║
║  modern AI. Every time you call einsum(), you're using           ║
║  Einstein's gift to mathematics.                                 ║
╚══════════════════════════════════════════════════════════════════╝
""")

Einstein Summation and Index Notation