Lesson overview | Previous part | Next part
Orthogonality and Orthonormality: Part 9: Orthogonality in Function Spaces to Conceptual Bridge
9. Orthogonality in Function Spaces
9.1 The L^2 Inner Product
Orthogonality extends naturally from finite-dimensional vector spaces to infinite-dimensional function spaces. The space of square-integrable functions carries the inner product:
Orthogonality of functions: if .
Example: On , the functions form a mutually orthogonal family:
This is the foundation of Fourier series - a basis expansion for functions.
9.2 Fourier Series as Orthonormal Expansion
Orthonormal basis for :
The Fourier series of is its expansion in this ONB:
where the Fourier coefficients are orthogonal projections:
Parseval's theorem for Fourier series:
This is the infinite-dimensional analog of .
9.3 Discrete Fourier Transform as a Unitary Matrix
The Discrete Fourier Transform (DFT) maps to :
In matrix form: where .
Key fact: is unitary (the complex analog of orthogonal):
This is because the DFT basis vectors are orthonormal in :
(The last equality is the geometric series identity for roots of unity.)
Parseval's theorem for DFT: .
For AI: The DFT/FFT appears in:
- Signal processing: Feature extraction from audio (MFCCs, spectrograms)
- Efficient convolutions: - used in efficient CNN implementations
- Positional encodings: Sinusoidal positional embeddings in the original Transformer are Fourier features
- SSMs: State space models (Mamba, S4) use structured orthogonal/unitary matrices
10. Applications in Machine Learning
10.1 Orthogonal Weight Initialization
The problem with random initialization. If weight matrices have singular values spread over a wide range, gradients either explode or vanish during backpropagation through many layers.
Saxe et al. (2013) showed that initializing with orthogonal matrices preserves gradient norms through linear networks:
Theorem. For a deep linear network with orthogonal weight matrices (each ), the Jacobian of the full network is , which is also orthogonal (product of orthogonal matrices). Hence - gradients neither explode nor vanish.
Practical implementation:
- Generate a random matrix with i.i.d. standard normal entries:
- Compute its QR decomposition:
- Use (or for uniform distribution over ) as the initial weight matrix
This is implemented as torch.nn.init.orthogonal_() in PyTorch.
Why it works beyond linear networks: Even in nonlinear networks, orthogonal initialization places the initial weights in a "good" part of weight space where gradients are well-conditioned at initialization. Empirically, networks with orthogonal initialization often converge faster on deep architectures.
10.2 Rotary Position Embeddings (RoPE)
Standard self-attention lacks inherent position information. The original Transformer adds sinusoidal absolute position embeddings. RoPE (Su et al. 2021, used in LLaMA, GPT-NeoX, Falcon) takes a different approach: it encodes position by rotating query and key vectors.
Construction (2D case). For position , the rotation matrix is:
For a -dimensional embedding, pair up dimensions and apply to each pair using different frequencies .
Key property. The inner product of rotated queries and keys depends only on the relative position :
since (rotation matrices satisfy , because the inverse of a rotation is a rotation in the opposite direction).
Orthogonality is central: The key equation uses the fact that is orthogonal (), so .
10.3 Spectral Normalization
Spectral normalization (Miyato et al. 2018) stabilizes GAN training by constraining weight matrices to have spectral norm (= largest singular value) equal to 1.
Implementation: At each forward pass, divide the weight matrix by its spectral norm: .
Connection to orthogonality: A matrix with is exactly an orthogonal matrix. Spectral normalization enforces while letting other singular values vary freely - it's a "one-sided" orthogonality constraint.
Why it helps GANs: The discriminator satisfies a Lipschitz condition when all weight layers are spectrally normalized. This relates to the Wasserstein GAN objective, which requires a 1-Lipschitz discriminator.
10.4 QR in Optimization: Orthogonal Gradients
Several modern optimization techniques use orthogonality:
Shampoo optimizer (Gupta et al. 2018): Maintains a Kronecker-factored preconditioner and periodically computes its orthogonal "square root" via eigendecomposition, orthogonalizing the gradient update directions.
Muon optimizer (2024): Orthogonalizes gradient matrices via Newton-Schulz iterations (a matrix-valued version of Newton's method for computing polar decomposition), then applies the orthogonalized gradient as an update. This ensures each weight matrix sees updates with balanced singular values, preventing the accumulation of rank deficiency.
Gradient orthogonality in continual learning: Methods like Gradient Episodic Memory (GEM) and Orthogonal Gradient Descent (OGD) project gradients for new tasks onto the orthogonal complement of the gradient space for previous tasks - preserving past performance while learning new tasks.
10b. Bessel's Inequality and Completeness
10b.1 Bessel's Inequality
When we expand a vector in terms of an orthonormal set that is not necessarily a complete basis for the entire space, we get Bessel's inequality:
Proof. Let be the projection of onto . Then:
The inequality follows immediately.
Equality holds if and only if - i.e., when itself lies in the subspace spanned by the chosen orthonormal vectors.
Parseval's identity is the limit case: when is a complete ONB:
10b.2 Completeness of Orthonormal Sets
An orthonormal set in a Hilbert space is complete (or maximal) if it satisfies any of the following equivalent conditions:
- is an ONB: every can be written
- Parseval's identity holds: for all
- for all implies (no non-zero vector is orthogonal to all )
In finite dimensions, any ONB is automatically complete (by dimension counting). In infinite dimensions, completeness is a non-trivial requirement - this is why Hilbert space theory requires careful treatment.
For ML context: In kernel methods, the RKHS is an infinite-dimensional Hilbert space. A kernel function defines an inner product, and the question of whether the kernel features span the whole space (completeness/universality) determines the expressivity of the corresponding kernel machine.
10c. Detailed Worked Examples
10c.1 Projection in Least-Squares Regression
Setup. We observe data points and want to fit a line :
Build the design matrix:
Gram-Schmidt on the columns of :
Column 1:
Column 2: . Project out :
The projection (fitted values):
This is the projection of onto = the space of all affine functions of . The residual is orthogonal to every affine function.
Verification of the normal equations: The least-squares solution satisfies :
Solving: , so - a good linear fit.
10c.2 Full Worked Example: Gram-Schmidt in
Input vectors:
Step 1:
Step 2: Remove the component of along :
Step 3: Remove components of along and :
Verification: OK OK OK
QR factorization: The matrix is upper triangular with:
10c.3 Spectral Decomposition Example
Let (symmetric).
Eigenvalues: or .
Eigenvectors:
- :
- :
Verify orthogonality: OK
Spectral decomposition:
Rayleigh quotient at :
Indeed .
10c.4 Householder Reflector: Concrete Computation
Problem: Find a Householder reflector such that where .
Step 1: .
Step 2: Since , use sign to avoid cancellation:
Step 3: Normalize:
Step 4: Build the reflector:
Verify:
This gives (a valid Householder result; the sign depends on the convention). as expected. OK
10d. Orthogonality and the Geometry of Neural Networks
10d.1 The Geometry of Weight Matrices
A single linear layer transforms the input geometry. Understanding this transformation requires decomposing via its singular value decomposition (preview):
- : rotate the input space (first orthogonal transformation)
- : scale along the new axes by singular values
- : rotate the output space (second orthogonal transformation)
For orthogonal : All singular values are 1, so . The layer is a pure rotation - it doesn't compress or expand the input distribution. Distances and angles are perfectly preserved.
Implication for gradient flow. The gradient of the loss with respect to the input is:
For orthogonal : - gradients are neither amplified nor attenuated. This is the formal reason orthogonal initialization prevents gradient vanishing/explosion in deep linear networks.
10d.2 Attention Heads as Orthogonal Projections
In a Transformer's multi-head attention with heads:
where are the query/key/value projection matrices.
Mechanistic interpretability perspective (Elhage et al. 2021): Each attention head computes a projection of the residual stream onto a low-dimensional subspace, adds information from that subspace back, and (in some sense) different heads should attend to different "directions" of information.
Mathematical idealization: If we require for (the value matrices project to orthogonal subspaces), then different heads genuinely operate on independent information. In practice, heads aren't exactly orthogonal, but studying the overlap quantifies how much heads interfere with each other.
QK circuit analysis. The effective attention pattern for head involves (or rather its action on the residual stream). This matrix's singular values determine how "sharp" vs "diffuse" the attention pattern is.
10d.3 Layer Normalization and Orthogonal Complements
Layer normalization (Ba et al. 2016) operates on a vector :
where and .
Orthogonal decomposition view. The centering step is an orthogonal projection onto the subspace :
This is a rank- orthogonal projector. Layer normalization first projects out the "mean direction" , then rescales to unit norm on the remaining -dimensional subspace.
Consequence for representational geometry: After layer normalization, the model's internal representations always satisfy . The model can only represent information in the orthogonal complement of the constant vector - this is why all-zero inputs and constant inputs are indistinguishable after layer norm.
10d.4 Orthogonal Regularization in Weight Matrices
Several training techniques maintain (near-)orthogonality in weight matrices throughout training:
Orthogonal regularization: Add a penalty to the loss. This penalizes deviation from orthogonality without enforcing it exactly.
Spectral regularization: Penalize - penalizes singular values deviating from 1. Stronger than spectral normalization, which only clips the largest singular value.
Riemannian optimization on the Stiefel manifold: The set of matrices with orthonormal columns, , is a smooth Riemannian manifold. Gradient descent on uses the Riemannian gradient (projection of Euclidean gradient onto the tangent space) and retracts to the manifold via QR decomposition or Cayley transform after each step.
Applications: Orthogonal RNNs, rotation-equivariant neural networks, invertible neural networks, and normalizing flows all benefit from exact or approximate orthogonality constraints.
11. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Confusing "orthogonal" and "orthonormal" | Orthogonal means ; orthonormal additionally requires unit length. An orthogonal matrix has orthonormal columns, not just orthogonal ones. | Check: orthogonal <-> zero inner products; orthonormal <-> zero inner products AND |
| 2 | Assuming implies without checking square | For non-square (), but (it's a rank- projection). Only square orthogonal matrices satisfy both. | Specify dimensions: "thin Q" () vs "full Q" () |
| 3 | Thinking Gram-Schmidt preserves span ordering | CGS does: . But with pivoting or reordering, this property is lost. | Track column ordering explicitly when pivoting QR is used |
| 4 | Using CGS when numerical stability matters | Classical Gram-Schmidt squares the error: vs MGS's . For ill-conditioned , CGS can produce non-orthogonal "orthonormal" vectors. | Use Modified Gram-Schmidt (MGS) or Householder QR for production code |
| 5 | Projecting onto a non-orthogonal basis | The formula assumes . If the columns are merely linearly independent (not orthonormal), the correct formula is , which involves the inverse. | Always check if before using the simplified projection formula |
| 6 | Thinking is sufficient for orthogonal projection | Idempotence only guarantees is some projection. Orthogonal projections additionally require (symmetry). Oblique projections satisfy idempotence but not symmetry. | An orthogonal projection must be both idempotent () and symmetric () |
| 7 | Claiming eigenvectors are always orthogonal | Eigenvectors of different eigenvalues are orthogonal only for symmetric matrices. For general matrices, eigenvectors can be arbitrary non-orthogonal vectors (or even non-real). | The spectral theorem applies to symmetric/Hermitian matrices; general matrices need Schur decomposition |
| 8 | Forgetting the sign convention in Householder | When computing , the sign should be chosen as to avoid cancellation when . | Always choose the sign of to be opposite the sign of : |
| 9 | Misinterpreting Parseval's identity as an approximation | is an equality when is a complete orthonormal basis for the whole space. If it's only a basis for a subspace, you get Bessel's inequality: . | Distinguish: complete ONB -> Parseval's equality; partial ONB -> Bessel's inequality |
| 10 | Assuming QR is always better than Cholesky for normal equations | QR avoids squaring the condition number but costs more flops ( vs for Cholesky). For well-conditioned problems with large , Cholesky on may be faster. | Choose based on condition number: if , use QR; if , Cholesky is acceptable and faster |
| 11 | Using as the definition of orthogonal | is a consequence of orthogonality, not the definition. A matrix can have determinant 1 without being orthogonal (e.g., any upper triangular matrix with diagonal product 1). | Definition is . Determinant follows from this. |
| 12 | Thinking orthogonal initialization prevents gradient problems throughout training | Orthogonal initialization helps at initialization. As training proceeds, weights deviate from orthogonality. Maintaining orthogonality throughout training requires explicit regularization (e.g., spectral normalization, orthogonal regularization loss). | Use orthogonal init as a starting point; for persistent orthogonality through training, use explicit constraints |
12. Exercises
Exercise 1 * - Projection and Decomposition
Given and the subspace :
(a) Apply Gram-Schmidt to the spanning vectors to obtain an ONB for .
(b) Compute the projection using the ONB.
(c) Verify: and .
(d) Compute and verify (Pythagorean theorem).
Exercise 2 * - Gram-Schmidt and QR
Let .
(a) Apply Gram-Schmidt to the columns of to produce and such that .
(b) Verify that and numerically.
(c) Solve the least-squares problem for by solving via back substitution.
(d) Verify the solution against np.linalg.lstsq.
Exercise 3 * - Orthogonal Matrices
(a) Verify that the rotation matrix is orthogonal for .
(b) Show that the composition of two rotations by matrix multiplication.
(c) Construct a Householder reflector such that for . Verify and .
(d) What is ? Explain geometrically.
Exercise 4 ** - Modified Gram-Schmidt
(a) Implement both Classical and Modified Gram-Schmidt.
(b) Test on the Hilbert matrix for . Compute the maximum off-diagonal entry of for each method.
(c) Plot orthogonality error () vs matrix size for both methods. What do you observe?
(d) Explain why MGS is more stable using the error analysis from 5.2.
Exercise 5 ** - Least Squares Stability
Given the Vandermonde system fitting a degree- polynomial through points:
(a) Build the design matrix with for , .
(b) Solve the least-squares problem via: (i) normal equations , (ii) QR factorization then .
(c) Compute the condition number and . What is their ratio?
(d) Perturb by small noise and compare the residuals from both methods. Which is more stable?
Exercise 6 ** - Spectral Theorem
Let .
(a) Find all eigenvalues and eigenvectors of analytically.
(b) Verify that eigenvectors are orthogonal. Normalize them to form an orthogonal matrix .
(c) Verify the spectral decomposition: where .
(d) Compute the Rayleigh quotient for , , . Verify each is in .
(e) Is positive definite? Verify using eigenvalues and by direct computation of .
Exercise 7 *** - Orthogonal Weight Initialization
(a) Generate 100 random matrices with i.i.d. Gaussian entries (standard init). Compute the spectral norm of their product for .
(b) Repeat with each initialized as a random orthogonal matrix (via QR of a Gaussian matrix). Plot vs for both initializations.
(c) Generate synthetic "gradient" vectors and propagate backward through the chains. Compare gradient norms: vs .
(d) Explain the result using the isometry property of orthogonal matrices.
Exercise 8 *** - Rayleigh Quotient Iteration
Rayleigh Quotient Iteration is a method for finding eigenvectors with cubic convergence:
(a) Implement Rayleigh Quotient Iteration for a symmetric matrix.
(b) Test on with random initial . Track over iterations.
(c) Plot convergence on a log scale and estimate the convergence rate (should be cubic: ).
(d) Compare convergence speed vs power iteration and inverse iteration with a fixed shift.
(e) Explain why Rayleigh quotient (not a fixed shift) is needed for cubic convergence.
13. Why This Matters for AI (2026 Perspective)
| Concept | AI Application | Impact |
|---|---|---|
| Orthogonal initialization | torch.nn.init.orthogonal_() | Prevents gradient explosion/vanishing at initialization; standard practice for deep linear networks (Saxe et al.) and RNNs |
| QR decomposition | Muon optimizer; orthogonal gradient updates | Orthogonalizing gradient matrices via QR/Newton-Schulz stabilizes LLM training; avoids rank collapse of weight matrices |
| RoPE embeddings | LLaMA, Mistral, GPT-NeoX, Falcon, Gemma | Rotary position encoding uses orthogonal rotation matrices to encode relative position in self-attention; rotation composition gives relative position property |
| Spectral normalization | GANs, discriminator regularization | Constrains for 1-Lipschitz discriminator; stabilizes Wasserstein GAN training |
| Gram-Schmidt / QR | Numerically stable least squares | Used in regression, feature selection, basis pursuit; avoiding normal equations prevents condition number squaring |
| Orthogonal complement | Continual learning (OGD, GEM) | New task gradients are projected onto the orthogonal complement of old task gradient subspaces; prevents catastrophic forgetting |
| Spectral theorem | Hessian analysis, SAM optimizer | Eigendecomposition of the loss Hessian reveals sharp/flat directions; Sharpness-Aware Minimization (SAM) seeks minima with small \lambda_\max(\nabla^2\mathcal{L}) |
| Orthonormal bases | Attention head analysis | Queries, keys, values in each attention head span subspaces; orthogonality between heads corresponds to specialization; mechanistic interpretability studies this |
| DFT (unitary matrix) | Audio models, efficient convolutions | Wav2Vec, Whisper process spectrograms (DFT magnitude); convolutional layers in frequency domain use FFT = matrix multiply by unitary |
| Householder reflectors | QR in neural network training | Used in computing QR factorizations during optimizer steps (Shampoo); Householder product form compresses orthogonal matrices for efficient multiplication |
| Rayleigh quotient | Eigenvalue estimation in training | Stochastic Lanczos quadrature (Ghorbani et al. 2019) approximates the Hessian spectrum via Rayleigh quotients on random probes |
| Gram matrix | Gram-matrix style loss, NTK | Neural Tangent Kernel (where is the Jacobian) is a Gram matrix; its eigenstructure determines training dynamics |
14. Advanced Topics
14.1 Polar Decomposition
Every matrix () has a polar decomposition:
where has orthonormal columns () and is symmetric positive semidefinite.
Uniqueness: If has full column rank, and are unique, with (the matrix square root) and .
Geometric meaning: captures the "stretching" and captures the "rotation" - this is the matrix analog of writing a complex number as .
Relation to SVD: If (thin SVD), then:
So (the "nearest orthogonal matrix" to ) and .
For AI: The Muon optimizer computes from the SVD of the gradient matrix , then applies as the weight update - this is exactly the orthogonal factor of the polar decomposition of .
14.2 Orthogonality in Hilbert Spaces
All finite-dimensional results extend to Hilbert spaces - complete inner product spaces, possibly infinite-dimensional.
Orthogonal Projection Theorem (Hilbert spaces). If is a Hilbert space and is a closed subspace, then every has a unique decomposition with and .
Completeness is essential: Without it, the projection may not exist. (The "closed" hypothesis ensures that sequences of projections converge.)
Examples of Hilbert spaces in ML:
- : the space of square-integrable functions (kernel methods, RKHS)
- : square-summable sequences (infinite-dimensional limits of neural networks)
- Reproducing Kernel Hilbert Spaces (RKHS): the function space for kernel machines, Gaussian processes
14.3 Gram Matrices and the Kernel Trick
Given vectors , the Gram matrix is:
In matrix form: where .
Properties: is symmetric positive semidefinite. Its rank equals .
Kernel trick: Replace with where is a positive definite kernel. The resulting kernel matrix is a Gram matrix in a (possibly infinite-dimensional) RKHS.
For attention: The attention matrix is a kernelized Gram matrix between queries and keys. Its spectral properties determine how information flows between tokens.
Conceptual Bridge
Looking Backward: What Made This Possible
The theory developed in this section builds directly on foundations from earlier chapters:
- Vectors and inner products (01: Vectors and Spaces): The inner product is the primitive notion from which orthogonality, angles, and projections all derive.
- Linear transformations (04: Linear Transformations): Orthogonal matrices are the isometric linear maps - those that preserve both angles and distances. The kernel and image of projection operators are orthogonal complements.
- Matrix rank (02-LA: Matrix Rank): The rank-nullity theorem and the four fundamental subspaces are unified by orthogonality: .
- Eigenvalues (01: Eigenvalues): The spectral theorem tells us that symmetric matrices are diagonalizable with an orthonormal eigenbasis - the connection between orthogonality and spectral theory.
Looking Forward: What This Enables
Orthogonality and orthonormality are not endpoints but foundations:
- Singular Value Decomposition (02: SVD): The SVD is built from two orthogonal matrices. The right singular vectors form an ONB for the row space; the left singular vectors form an ONB for the column space. Without the theory developed here, SVD cannot be understood.
- Matrix Decompositions (08: Matrix Decompositions): LU, QR, Cholesky, Schur - QR is the central algorithm among these, and the QR algorithm for eigenvalues iterates orthogonal similarity transformations.
- Matrix Norms (06: Matrix Norms): The spectral norm is defined via orthogonal invariance. The condition number for orthogonal .
- Optimization (Chapter 08): Gradient methods, second-order methods, and their variants all navigate spaces equipped with inner products. The geometry of the loss landscape is described using the spectral theory developed here.
CURRICULUM POSITION: ORTHOGONALITY AND ORTHONORMALITY
===========================================================================
FOUNDATIONS THIS SECTION FORWARD
----------- ------------ -------
+-----------------+ +-------------------------+ +------------------+
| Inner Products |----------> | Orthogonality (05) |-->| SVD (02) |
| Vectors/Spaces | | +- Gram-Schmidt | | low-rank approx |
+-----------------+ | +- QR Decomposition | +------------------+
| +- Orthogonal Matrices |
+-----------------+ | +- Projections | +------------------+
| Eigenvalues |----------> | +- Spectral Theorem |-->| Matrix Norms |
| (01) | | +- ML Applications | | (06) |
+-----------------+ +-------------------------+ +------------------+
|
+-----------------+ | +------------------+
| Linear Transf. |---------->----------+ -->| Decompositions |
| (04) | | (08) QR alg. |
+-----------------+ +------------------+
AI CONNECTIONS:
+------------------------------------------------------------------+
| RoPE embeddings -> Gram-Schmidt -> Muon optimizer -> OGD/GEM |
| Orthogonal init -> Spectral norm -> Rayleigh quotient -> SAM |
| QR least squares -> DFT (unitary) -> Kernel Gram matrix -> NTK |
+------------------------------------------------------------------+
===========================================================================
<- Back to Advanced Linear Algebra | Next: Matrix Norms ->