Inner products turn size into geometry. Completeness makes that geometry stable under limits.
Overview
Hilbert spaces are complete inner product spaces. They keep the geometry of Euclidean vectors while allowing infinite-dimensional objects such as functions, signals, square-summable sequences, random variables, and feature maps. Normed spaces let us measure size. Hilbert spaces add angle, orthogonality, projection, Fourier coordinates, and self-duality.
This section is the bridge between the normed-space foundations of the previous section and the kernel methods of the next section. The central message is practical: whenever a learning algorithm uses dot products, cosine similarity, least squares, PCA, attention scores, Fourier features, Gaussian processes, or kernelized optimization, it is leaning on Hilbert-space structure.
We focus on the core Hilbert toolkit:
- inner products and induced norms
- Cauchy-Schwarz, Pythagoras, and orthogonality
- projection theorem and least squares
- orthonormal systems, Bessel inequality, Parseval identity, and Fourier-Bessel expansion
- Riesz representation and gradients as vectors
- adjoints, self-adjoint operators, positive operators, compact operators, and spectral decomposition
- careful bridges to RKHS, kernels, Fourier analysis, PCA, and neural tangent kernels
Kernel methods, positive definite kernels, support vector machines, Gaussian processes, Mercer expansions, and scalable kernel approximations are only previewed here. They are developed in detail in Kernel Methods.
Prerequisites
- Normed spaces and completeness - Normed Spaces
- Vector spaces, bases, and linear maps - Linear Algebra Basics
- Orthogonality in finite dimensions - Orthogonality and Orthonormality
- Matrix norms and singular values - Matrix Norms
- Least squares and convex optimization - Convex Optimization
- Random variables and expectations - useful for spaces and Gaussian-process intuition
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Interactive Hilbert geometry, projections, Gram-Schmidt, Parseval checks, Riesz gradients, adjoints, PCA, and kernel previews |
| exercises.ipynb | 8 graded exercises covering inner products, projections, bases, Riesz representation, operators, PCA, and RKHS previews |
Learning Objectives
After completing this section, you will be able to:
- Define real and complex inner products and the norm they induce
- Distinguish pre-Hilbert spaces from Hilbert spaces
- Prove and apply Cauchy-Schwarz, Pythagoras, and the parallelogram law
- Use orthogonal complements to decompose Hilbert spaces
- Apply the projection theorem to closed subspaces and least-squares problems
- Build orthonormal systems with Gram-Schmidt
- Use Bessel inequality, Parseval identity, and Fourier-Bessel coordinates
- State and apply the Riesz representation theorem
- Interpret gradients as Riesz representatives of differentials
- Work with adjoints, self-adjoint operators, positive operators, and compact operators
- Explain how Hilbert geometry supports attention, PCA, ridge regression, Gaussian processes, RKHS theory, and infinite-width neural-network limits
Table of Contents
- 1. Intuition
- 2. Formal Definitions
- 3. Core Theory I: Inner Product Geometry
- 4. Core Theory II: Projection Theorem
- 5. Core Theory III: Orthonormal Systems and Bases
- 6. Core Theory IV: Riesz Representation and Duality
- 7. Core Theory V: Operators on Hilbert Spaces
- 8. Advanced Topics and Bridges
- 9. Applications in Machine Learning
- 10. Common Mistakes
- 11. Exercises
- 12. Why This Matters for AI
- 13. Conceptual Bridge
- References
1. Intuition
1.1 From Norms to Angles
A normed space tells us how large a vector is. A Hilbert space tells us not only how large vectors are, but also how they face each other.
The extra structure is the inner product:
In , the standard inner product is
This single operation gives length,
angle,
and orthogonality,
The philosophical shift is small but powerful:
normed space:
vector + size + convergence
Hilbert space:
vector + size + convergence + angle + projection + coordinates
In machine learning, this is why dot products can act as similarity scores, why least squares has a clean geometric solution, why PCA is an orthogonal coordinate system, and why Fourier analysis can decompose functions into energy-preserving frequency components.
1.2 Why Completeness Matters
A Hilbert space is not just an inner product space. It is a complete inner product space. Completeness means every Cauchy sequence converges to a point inside the same space.
This matters because optimization, approximation, and learning algorithms often produce limits. If the space is incomplete, an algorithm can converge toward an object that does not live in the space being studied.
Let be the polynomials on with inner product
This is an inner product space, but it is not complete. A sequence of polynomials can be Cauchy in the norm and converge to a square-integrable function that is not a polynomial. The completion is .
For AI, incompleteness shows up whenever we approximate functions by finite models but reason about limiting function classes. A Hilbert space is the stable mathematical container for those limits.
1.3 Why Hilbert Spaces Matter for AI
Hilbert-space ideas are everywhere in modern ML:
| ML concept | Hilbert-space idea |
|---|---|
| attention score | inner product as alignment |
| cosine similarity | normalized Hilbert angle |
| least squares | orthogonal projection |
| ridge regression | projected or regularized Hilbert problem |
| PCA | spectral theorem for self-adjoint covariance operators |
| Fourier features | coordinates in an orthonormal system |
| Gaussian processes | covariance kernels and function-space geometry |
| kernel methods | implicit inner products in feature Hilbert spaces |
| gradient descent in function space | Riesz representation of differentials |
| neural tangent kernel | kernel gradient flow in a Hilbert-like feature geometry |
The key operational pattern is:
1.4 Historical Timeline
- 1900s: Hilbert and Schmidt formalize infinite systems of equations and spectral methods.
- 1920s-1930s: Hilbert spaces become the language of quantum mechanics.
- 1940s-1950s: Functional analysis develops projection, duality, and operator theory.
- 1950s-1970s: RKHS theory connects kernels with Hilbert spaces of functions.
- 1990s-2000s: Support vector machines and Gaussian processes bring kernels into mainstream ML.
- 2010s-2020s: Neural tangent kernels and infinite-width limits reconnect deep learning with Hilbert-space and kernel viewpoints.
2. Formal Definitions
2.1 Real and Complex Inner Products
Let be a vector space over or . An inner product is a map
such that for all and scalars :
- Linearity in the first argument
- Conjugate symmetry
- Positive definiteness
Some books choose linearity in the second argument for complex spaces. This repository uses linearity in the first argument in this section. The formulas involving adjoints and Riesz representatives should be read consistently with that convention.
2.2 Induced Norm and Metric
Every inner product induces a norm:
The induced metric is
So every inner product space is a normed space and every inner product space is a metric space. The reverse is false. Many norms do not come from any inner product.
2.3 Hilbert and Pre-Hilbert Spaces
An inner product space is also called a pre-Hilbert space when we want to emphasize that it may not be complete.
A Hilbert space is a complete inner product space. Equivalently, is Hilbert if every sequence satisfying
has a limit with
Completeness is a statement about the norm induced by the inner product.
2.4 Examples and Non-Examples
Finite-dimensional Euclidean spaces. and are Hilbert spaces with the usual inner product. Every finite-dimensional inner product space is complete.
Weighted Euclidean spaces. If , then
is an inner product on . The induced norm is a Mahalanobis-style norm:
Square-summable sequences. The space
is Hilbert under
Square-integrable functions. The space is Hilbert under
Technically, elements of are equivalence classes of functions equal almost everywhere. That detail prevents zero-norm nonzero representatives from breaking positive definiteness.
Polynomials are not complete. The polynomial space with the inner product is a pre-Hilbert space, not a Hilbert space.
with inner product is not complete. Continuous functions are dense in , but an limit of continuous functions need not be continuous.
is not a Hilbert space with its usual norm. The norm does not satisfy the parallelogram law, so it cannot be induced by any inner product.
2.5 Parallelogram Law and Polarization Identity
Every inner-product norm satisfies the parallelogram law:
This law says the two diagonals of a parallelogram carry exactly the same total squared length as twice the sum of squared side lengths.
For real inner product spaces, the inner product can be recovered from the norm:
For complex spaces, the polarization identity is
Thus a Hilbert norm is not just any norm. It is exactly a norm whose geometry secretly contains an inner product.
3. Core Theory I: Inner Product Geometry
3.1 Cauchy-Schwarz Inequality
For any in an inner product space,
Equality holds if and only if and are linearly dependent.
Proof sketch. If , the claim is immediate. Otherwise, consider
Choose
Expanding gives
which rearranges to Cauchy-Schwarz.
In ML, Cauchy-Schwarz bounds the maximum possible alignment between embeddings, gradients, features, and residuals. It justifies cosine similarity and many norm-based generalization bounds.
3.2 Angles, Cosine Similarity, and Orthogonality
For nonzero and in a real Hilbert space, define the angle by
Cauchy-Schwarz guarantees the right side lies in .
Cosine similarity is the same expression. It removes magnitude and keeps directional alignment:
Two vectors are orthogonal when
Attention scores, embedding retrieval, contrastive learning, and spectral algorithms all use this notion of alignment.
3.3 Pythagorean Theorem
If , then
Proof:
This identity is the energy bookkeeping behind least squares, Fourier analysis, PCA, and variance decomposition.
3.4 Orthogonal Complements and Closed Subspaces
For a subset , its orthogonal complement is
is always a closed subspace, even if is not closed.
If is a closed subspace of a Hilbert space, then
Every vector has a unique decomposition
In least squares, is the fitted value and is the residual.
3.5 Best Approximation Intuition
Hilbert spaces make approximation geometric. Suppose is a model class that is a closed subspace and is a target. The best approximation problem is
The solution is characterized not by a complicated search condition but by orthogonality:
That is, the error left after the best approximation has no component in the model subspace.
target x
*
|\
| \
| \ residual x - P_M x
| \
| *
| P_M x
---+---------------- model subspace M
4. Core Theory II: Projection Theorem
4.1 Projection onto Closed Convex Sets
Let be a nonempty closed convex subset of a Hilbert space . For every , there exists a unique point such that
The point is the projection of onto .
For convex sets, the variational characterization is
This inequality says every feasible direction from the projection point makes an obtuse angle with the residual.
4.2 Orthogonal Projection onto Closed Subspaces
When is a closed subspace, the projection condition simplifies:
Equivalently,
If is an orthonormal basis for a finite-dimensional subspace , then
This formula is the finite-dimensional ancestor of Fourier expansion.
4.3 Least Squares as Projection
Given and , least squares solves
The fitted vector is the projection of onto the column space . The residual is orthogonal to every column of :
Thus the normal equations are not arbitrary algebra. They are the orthogonality condition for projection:
If is invertible,
If not, the minimum-norm solution is
4.4 Projection Operators
The orthogonal projection satisfies:
If , then .
For a matrix with full column rank, the projection matrix onto is
It satisfies
4.5 Projection Algorithms in ML
Projection appears in many algorithms:
- projected gradient descent enforces constraints by applying
- alternating projections solve feasibility problems
- least-squares layers project labels onto feature spans
- PCA projects data onto top eigenspaces
- denoising methods often estimate a projection onto a data manifold or low-dimensional signal set
- constrained decoding can be viewed as repeated projection onto feasible token or structure sets, though usually in non-Hilbert geometries
The exact Hilbert projection theorem applies to closed convex sets in Hilbert spaces. ML practice often borrows the intuition outside this perfect setting, so the theorem-level statement and engineering heuristic should not be confused.
5. Core Theory III: Orthonormal Systems and Bases
5.1 Orthonormal Sets and Gram-Schmidt
A set is orthonormal if
Given linearly independent vectors , Gram-Schmidt constructs an orthonormal set:
Numerically, classical Gram-Schmidt can be unstable. Modified Gram-Schmidt or QR factorization is preferred in computation.
5.2 Bessel Inequality
If is an orthonormal set, then for every ,
For finite , define
Then is orthogonal to , so
Bessel inequality says orthonormal coordinates cannot contain more energy than the vector itself.
5.3 Complete Orthonormal Systems and Hilbert Bases
An orthonormal set is complete if the only vector orthogonal to every is :
For separable Hilbert spaces, a countable complete orthonormal system is often called a Hilbert basis. This is not a Hamel basis. Infinite Hilbert expansions converge in norm, not as finite algebraic sums.
5.4 Parseval Identity and Fourier-Bessel Expansion
If is a complete orthonormal system, then
with convergence in the Hilbert norm, and
The first identity is Fourier-Bessel expansion. The second is Parseval identity.
For , the trigonometric functions form an orthonormal coordinate system after normalization. A square-integrable function can be represented by its Fourier coefficients in norm, even when pointwise convergence needs additional hypotheses.
5.5 Separability and Coordinates
A Hilbert space is separable if it has a countable dense subset. Most Hilbert spaces used in ML and signal processing are separable.
If is separable with orthonormal basis , the map
is an isometric isomorphism from to a closed subspace of . If the basis is complete, it is an isometric isomorphism onto .
This is the reason is the prototype of separable infinite-dimensional Hilbert spaces.
6. Core Theory IV: Riesz Representation and Duality
6.1 Bounded Linear Functionals
A linear functional is a linear map
It is bounded if there exists such that
The operator norm is
Bounded linear functionals are exactly continuous linear functionals.
6.2 Riesz Representation Theorem
For every bounded linear functional on a Hilbert space , there exists a unique vector such that
Moreover,
The vector is the Riesz representative of .
6.3 Hilbert Spaces Are Self-Dual
The Riesz theorem identifies with . In a general Banach space, the dual can be quite different from the original space. In a Hilbert space, continuous linear measurements are inner products with vectors inside the same space.
This is the clean mathematical reason gradients can often be represented as vectors. A derivative is naturally a linear functional. To turn it into a gradient vector, we choose an inner product and apply Riesz representation.
6.4 Gradients as Riesz Representatives
Let be differentiable. Its differential at is a bounded linear functional
The gradient is the Riesz representative satisfying
Changing the inner product changes the gradient vector while preserving the same differential. This is why natural gradient, preconditioning, and mirror-descent-like methods can be interpreted as changing the geometry in which steepest descent is measured.
6.5 Why This Matters for Optimization and Backpropagation
Backpropagation computes derivatives. Optimizers apply gradient vectors. The bridge from derivative-as-functional to gradient-as-vector is Riesz representation plus an inner product choice.
For the squared loss
the differential is
Thus the Euclidean gradient is
With a different inner product , the gradient vector becomes
This is preconditioning in Hilbert geometry.
7. Core Theory V: Operators on Hilbert Spaces
7.1 Bounded Operators and Operator Norms
A linear operator is bounded if
Its operator norm is
For matrices with Euclidean geometry, this is the spectral norm:
7.2 Adjoints and Self-Adjoint Operators
The adjoint is defined by
In finite-dimensional real Euclidean space, the adjoint is the transpose. In complex Euclidean space, it is the conjugate transpose.
An operator is self-adjoint if
Self-adjoint operators are the Hilbert-space analogue of symmetric matrices. Covariance operators, kernel integral operators, Hessians of quadratic losses, and graph Laplacians are central ML examples.
7.3 Positive Operators and Quadratic Forms
A self-adjoint operator is positive if
In finite dimensions, this is the positive semidefinite condition .
Positive operators define energy functionals:
Examples:
- covariance matrix: variance in direction
- graph Laplacian: smoothness energy over graph nodes
- kernel Gram matrix: squared norm of a finite feature combination
- Hessian: local curvature of a twice-differentiable loss
7.4 Compact Operators
A bounded operator is compact if it maps bounded sets to relatively compact sets. Equivalently, every bounded sequence has a subsequence such that converges.
In finite dimensions, every bounded operator is compact. In infinite dimensions, compactness is special. Compact operators often behave like infinite matrices whose singular values decay to zero.
Integral operators with smooth kernels are typical examples:
Kernel methods and Gaussian-process covariance operators often lead to compact positive self-adjoint operators under suitable domain and regularity assumptions.
7.5 Spectral Theorem for Compact Self-Adjoint Operators
If is compact and self-adjoint on a Hilbert space, then there is an orthonormal set of eigenvectors associated with nonzero real eigenvalues, and the nonzero spectrum can accumulate only at .
In favorable cases,
This is the infinite-dimensional analogue of diagonalizing a symmetric matrix:
PCA is the finite-sample version of this story. Kernel PCA and Gaussian-process covariance analysis use the same idea in feature or function spaces.
8. Advanced Topics and Bridges
8.1 Weak Convergence
Norm convergence is strong:
Weak convergence asks only that all bounded linear measurements converge:
Strong convergence implies weak convergence. Weak convergence does not generally imply strong convergence.
In infinite-dimensional optimization, weak compactness and lower semicontinuity are often enough to prove existence of minimizers even when norm-compactness fails.
8.2 Fourier Analysis as Hilbert-Space Coordinates
Fourier analysis is Hilbert-space coordinate expansion in . The functions
form an orthonormal system in .
Fourier coefficients are inner products:
Parseval says signal energy equals coefficient energy. This is the basis of spectral signal processing, convolution analysis, and random Fourier feature intuition.
8.3 RKHS and Reproducing Kernels
A reproducing kernel Hilbert space is a Hilbert space of functions on a set such that point evaluation is continuous:
By Riesz representation, for each there exists such that
Define
This is the reproducing kernel. The next section develops positive definite kernels, the Moore-Aronszajn theorem, SVMs, kernel ridge regression, Gaussian processes, and kernel approximations in detail.
8.4 Kernel Gradient Flow and Neural Tangent Kernels
Many overparameterized models can be studied by tracking how predictions change under gradient descent. In certain infinite-width limits, the dynamics of the prediction function are governed by a kernel:
The neural tangent kernel perspective says that, under specific assumptions, very wide neural networks behave like kernel machines during training. This is a modern bridge between deep learning and Hilbert-space geometry, but the details depend on architecture, scaling, initialization, and limiting arguments.
8.5 Continuous vs Discrete Spectra
Compact self-adjoint operators resemble diagonal matrices with eigenvalues tending to zero. General self-adjoint operators can have continuous spectrum. This matters in quantum mechanics, PDEs, and some infinite-dimensional learning problems.
For this course, the main working intuition is:
- finite-dimensional symmetric matrices have orthonormal eigenvectors
- compact self-adjoint operators retain a similar discrete spectral structure
- general self-adjoint operators require spectral measures, which are beyond this section
9. Applications in Machine Learning
9.1 Embedding Similarity and Attention Scores
The transformer attention score between query and key is
The inner product measures alignment. Scaling by controls variance when coordinates have roughly unit scale. Cosine similarity normalizes away vector length; dot-product attention keeps both angle and magnitude.
9.2 Least Squares, Ridge Regression, and Orthogonal Projection
Least squares is projection onto the feature span. Ridge regression solves
The regularization term adds Hilbert norm control. In kernel ridge regression, the same idea is lifted to an RKHS norm:
The representer theorem then says the minimizer lies in the finite span of kernel sections .
9.3 PCA and Spectral Decomposition
Given centered data matrix , the sample covariance is
It is positive self-adjoint. PCA finds orthonormal directions that solve
Projecting onto the top eigenvectors gives the best rank- approximation in squared reconstruction error. This is Hilbert projection plus spectral theory.
9.4 Gaussian Processes and RKHS Preview
A Gaussian process is determined by a mean function and covariance kernel. The covariance kernel is positive definite, so it defines a Hilbert geometry. The RKHS associated with the kernel is not the same as the random sample-path space in general, but it captures the deterministic smoothness geometry encoded by the kernel.
9.5 Infinite-Width Neural Networks and NTK Preview
In neural tangent kernel theory, an infinite-width network can converge to dynamics described by a kernel operator. The Hilbert-space lesson is that inner products between parameter gradients induce a geometry on functions:
This kernel measures how a parameter update that changes the output at also changes the output at .
10. Common Mistakes
-
Confusing every normed space with a Hilbert space. A Hilbert space needs an inner product whose induced norm is complete. with its usual norm is Banach but not Hilbert.
-
Forgetting completeness. Polynomials with the inner product have angles and lengths but are not complete.
-
Treating a Hilbert basis like a Hamel basis. Hilbert expansions can be infinite and converge in norm.
-
Assuming orthogonal projection exists onto every subspace. The subspace must be closed. Dense proper subspaces do not have nontrivial orthogonal complements.
-
Using pointwise convergence when convergence is meant. Fourier series often converge in mean-square even when pointwise behavior is delicate.
-
Assuming all ML feature spaces are finite-dimensional. Kernels often represent implicit Hilbert spaces that may be infinite-dimensional.
-
Ignoring the chosen inner product when discussing gradients. Gradients depend on geometry. Differentials do not.
-
Confusing PSD matrices with positive operators in all contexts. PSD matrices are finite-dimensional examples. Infinite-dimensional positive operators require domain and boundedness care.
-
Assuming RKHS sample paths are typical Gaussian-process samples. The RKHS is the deterministic geometry of the kernel; GP sample paths may lie outside it with probability one in many common settings.
-
Overstating NTK conclusions. NTK limits are mathematically specific and do not automatically explain all finite-width training behavior.
11. Exercises
- Prove Cauchy-Schwarz using minimization of .
- Check whether a weighted bilinear form is an inner product for several matrices .
- Show that does not satisfy the parallelogram law.
- Compute the orthogonal projection of a vector onto a line and onto a column space.
- Derive the normal equations for least squares from residual orthogonality.
- Implement Gram-Schmidt and compare it with QR factorization.
- Verify Bessel inequality and Parseval identity for a finite orthonormal basis.
- Find the Riesz representative of under a weighted inner product.
- Show that a symmetric PSD matrix defines a positive self-adjoint operator.
- Explain why kernel evaluation is point evaluation represented as an inner product in an RKHS.
The companion exercise notebook contains scaffolded versions with computational checks and full solutions.
12. Why This Matters for AI
Hilbert spaces are one of the quiet foundations of AI mathematics. They explain why dot products are meaningful, why squared error is geometrically special, why projections solve approximation problems, why PCA diagonalizes variance, why Fourier features preserve energy, why kernels let nonlinear learning become linear in a feature space, and why gradients become vectors once an inner product is chosen.
Three ideas are especially important:
1. Similarity is geometry. Attention, retrieval, contrastive learning, and recommendation systems all compare representations. Inner products and normalized inner products are Hilbert-space measurements of alignment.
2. Approximation is projection. Least squares, PCA, denoising, and many function-approximation problems become clearer when the learned object is viewed as a projection onto a subspace or a regularized approximation inside a Hilbert space.
3. Learning dynamics depend on inner products. A gradient is not just a list of partial derivatives. It is a Riesz representative of a differential under a chosen geometry. Changing the geometry changes the optimization path.
13. Conceptual Bridge
The next section, Kernel Methods, starts from the Hilbert idea that inner products measure similarity. A kernel is a function
that computes an inner product in a feature Hilbert space without explicitly constructing .
The key transition is:
Hilbert spaces:
inner products, projections, bases, Riesz representation
Kernel methods:
compute Hilbert-space inner products through K(x, x')
learn nonlinear functions with linear Hilbert-space geometry
Once you understand Hilbert spaces, the kernel trick is no longer a trick. It is just inner-product geometry moved into a richer feature space.
References
- R. B. Melrose, Hilbert Spaces, MIT 18.155 lecture notes, 2016. https://math.mit.edu/~rbm/18.155-F16/L10.pdf
- S. New, Hilbert Spaces, University of Waterloo PMATH 453 lecture notes. https://www.math.uwaterloo.ca/~snew/PMATH453/Chap2HilbertSpaces.pdf
- J. B. Conway, A Course in Functional Analysis, Springer, 1990.
- W. Rudin, Functional Analysis, McGraw-Hill, 1991.
- E. Kreyszig, Introductory Functional Analysis with Applications, Wiley, 1978.
- G. Strang, Introduction to Linear Algebra, Wellesley-Cambridge Press, 2016.
- T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer, 2009.
- F. Cucker and S. Smale, "On the Mathematical Foundations of Learning," Bulletin of the AMS, 2002.
- N. Aronszajn, "Theory of Reproducing Kernels," Transactions of the American Mathematical Society, 1950.
- A. Jacot, F. Gabriel, and C. Hongler, "Neural Tangent Kernel: Convergence and Generalization in Neural Networks," NeurIPS, 2018. https://arxiv.org/abs/1806.07572
- Berkeley STAT 154, Reproducing Kernel Hilbert Spaces lecture material, 2025. https://stat154.berkeley.edu/spring-2025/lectures/unit5/unit5_rkhs.html
- Stanford CS229T, Kernel Basics lecture notes, 2017. https://web.stanford.edu/class/cs229t/2017/Lectures/kernel-basics.pdf