"A kernel is a promise: compute similarity in the input space, and behave as if learning happened in a richer feature space."
Overview
Kernel methods are a family of learning methods built around one central observation: many algorithms only need inner products between examples. If an algorithm can be written using , then we can often replace that inner product with a kernel value . The replacement makes the algorithm act as though each input has been mapped into a high-dimensional, sometimes infinite-dimensional, feature space.
This section studies the kernel trick, positive semidefinite kernels, Reproducing Kernel Hilbert Spaces, common kernel families, kernelized algorithms, and scalable approximations. The goal is not to treat kernels as a historical detour before deep learning. The goal is to understand kernels as one of the cleanest mathematical bridges between Hilbert space geometry and modern AI systems.
Kernel methods explain support vector machines, kernel ridge regression, kernel PCA, Gaussian processes, random Fourier features, and the neural tangent kernel. They also sharpen everyday ML instincts about similarity, regularization, smoothness, uncertainty, and data-dependent representation. When you ask whether two embeddings are similar, whether a function should vary smoothly, or whether a model should express uncertainty away from data, you are already asking kernel-shaped questions.
Prerequisites
- Vector spaces, span, and linear maps from Vector Spaces and Subspaces.
- Inner products, orthogonality, and projections from Hilbert Spaces.
- Norms, completeness, and operator language from Normed Spaces.
- Positive semidefinite matrices from Positive Definite Matrices.
- Eigenvalues, SVD, and matrix rank from Advanced Linear Algebra and its sections.
- Constrained optimization and Lagrange multipliers from Constrained Optimization.
- Basic supervised learning vocabulary: dataset, labels, regression, classification, loss, and regularization.
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Interactive derivations for Gram matrices, kernels, RKHS intuition, kernel ridge regression, kernel PCA, Gaussian process prediction, random Fourier features, and NTK-style behavior. |
| exercises.ipynb | Eight graded exercises covering kernel validity, Gram matrices, centering, kernel ridge regression, kernel PCA, Gaussian processes, random Fourier features, and neural tangent kernels. |
Learning Objectives
After completing this section, you will be able to:
- Explain the kernel trick as an inner-product replacement, not as magic.
- Construct explicit feature maps for linear and polynomial kernels.
- Test a finite Gram matrix for positive semidefiniteness.
- Distinguish valid kernels from arbitrary similarity functions.
- Explain how every positive semidefinite kernel defines an RKHS.
- Use the reproducing property to interpret function evaluation as an inner product.
- Apply the representer theorem to regularized learning in RKHS.
- Implement kernel ridge regression and interpret the role of .
- Center a kernel matrix and run kernel PCA.
- Interpret Gaussian process regression as Bayesian learning with a kernel covariance.
- Use Nystrom and random Fourier approximations to reduce kernel cost.
- Connect kernel eigenvalues to generalization, smoothness, and neural tangent kernel intuition.
Table of Contents
- 1. Intuition
- 2. Formal Definitions
- 3. RKHS Foundations
- 4. Common Kernels
- 5. Core Kernel Algorithms
- 6. Computation and Scalability
- 7. Kernel Methods in Modern AI
- 8. Common Mistakes
- 9. Exercises
- 10. Why This Matters for AI
- 11. Conceptual Bridge
- 12. References
1. Intuition
Kernel methods begin with a deceptively simple question. What if the original coordinates are the wrong coordinates for learning?
Suppose a binary classification dataset in is not linearly separable. No line can separate the classes in the original plane. But after adding a feature such as , the same points may become separable by a hyperplane in the transformed feature space.
The direct feature-map story is clear: map to . Run a linear algorithm in the -space. Return a nonlinear predictor in the original input space.
The kernel story is more subtle: do not compute . Compute only the inner product that would have appeared after the feature map. That inner product is the kernel.
This identity is the whole game. The feature space can be very large. It can be infinite-dimensional. It can even be a function space. But if the learning algorithm only needs dot products, the explicit coordinates are unnecessary.
1.1 Why Kernels Matter for AI
Kernels matter because they convert assumptions about similarity into mathematically controlled learning algorithms. In neural networks, similarity is often learned implicitly. In kernel methods, similarity is specified by .
A kernel can encode smoothness. A kernel can encode periodicity. A kernel can encode sequence similarity. A kernel can encode graph structure. A kernel can encode invariances. A kernel can encode prior uncertainty.
This makes kernels especially useful when:
- data is limited;
- uncertainty estimates matter;
- interpretability of inductive bias matters;
- training a large neural model is unnecessary or too expensive;
- a strong handcrafted similarity is available;
- we want a clean mathematical model of generalization.
Kernels also matter in deep learning theory. The neural tangent kernel studies how wide neural networks behave during gradient descent. In the infinite-width limit, a neural network can behave like a kernel machine whose kernel is determined by architecture and initialization. This does not mean all useful neural networks are just kernels. It means kernels provide a precise baseline for understanding when representation learning does, or does not, happen.
The AI story has three levels:
| Level | Kernel role | Example |
|---|---|---|
| Algorithmic | Kernel gives a nonlinear model with convex or closed-form training. | SVM, kernel ridge regression, kernel PCA. |
| Probabilistic | Kernel is a covariance function over functions. | Gaussian process regression. |
| Theoretical | Kernel describes training dynamics of very wide networks. | Neural tangent kernel. |
1.2 Linear Learning in Nonlinear Feature Spaces
Let be an input space. For tabular data, might be . For text, might be strings. For graphs, might be graphs. For functions, might already be infinite-dimensional.
A feature map is a function
where is a Hilbert space. Once inputs are mapped into , a linear predictor has the form
This predictor is linear in feature space. It can be nonlinear in the original input coordinates.
Example: let . Define
Then
So the degree-2 homogeneous polynomial kernel is
The feature map has dimension 3 for . For degree in dimension , the explicit feature dimension grows combinatorially. The kernel evaluation remains cheap.
For the inhomogeneous polynomial kernel,
the implicit feature space contains monomials of degree up to . This is a nonlinear feature expansion without manually writing the expanded feature vector.
Feature-space picture:
input space feature space
x phi(x)
points ---- phi ----> high-dimensional points
linear separator impossible linear separator possible
in original coordinates after nonlinear lifting
The crucial warning: the kernel trick does not make impossible algorithms possible. It makes inner-product algorithms operate in richer spaces without explicit coordinates.
1.3 The Kernel Trick as Inner-Product Substitution
An algorithm is kernelizable when its input dependence appears only through inner products.
For a linear model trained by certain dual methods, the weight vector can often be written as
Then prediction becomes
By linearity,
Replace the inner product with a kernel:
The model now needs only kernel evaluations between training examples and test examples. It does not need explicit coordinates.
This is why the kernel matrix becomes the central data structure:
The kernel matrix is an matrix of pairwise similarities. For many algorithms, the entire training set enters only through .
Kernel trick checklist:
- Write the algorithm in terms of inner products.
- Replace with .
- Ensure is a valid positive semidefinite kernel.
- Solve the resulting finite-dimensional problem over coefficients.
- Predict using kernel evaluations against training points.
The phrase "trick" can be misleading. The kernel trick is not a shortcut around mathematics. It is a theorem-supported representation choice.
1.4 Historical Path
Kernel methods have a long history. They did not begin with modern SVMs. They grew out of approximation theory, potential functions, integral operators, and Hilbert-space methods.
Some milestones:
| Period | Development | Why it matters |
|---|---|---|
| 1909 | Mercer's theorem | Connects positive kernels with eigenfunction expansions. |
| 1950 | Aronszajn's RKHS theory | Provides the Hilbert-space foundation for reproducing kernels. |
| 1960s | Potential function methods | Early pattern-recognition use of kernel-like similarity functions. |
| 1990s | Support vector machines | Convex margin classifiers made kernels central in ML. |
| 2000s | Gaussian processes and kernel machines | Kernels became standard tools for Bayesian nonparametrics and nonlinear learning. |
| 2007 | Random Fourier features | Scalable approximations made large kernel methods practical. |
| 2018 | Neural tangent kernel | Kernel theory became a tool for analyzing wide neural networks. |
The deep-learning era did not erase kernels. It changed where kernels appear. Sometimes they appear as explicit algorithms. Sometimes they appear as covariance functions. Sometimes they appear as theory for training dynamics. Sometimes they appear as similarity functions inside attention, retrieval, and embedding systems.
2. Formal Definitions
This section gives the exact mathematical objects. The definitions are intentionally finite-first. Most practical kernel work begins with a finite dataset and a Gram matrix. The infinite-dimensional RKHS story comes after the finite matrix tests.
2.1 Feature Maps and Feature Spaces
Let be an input set. Let be a Hilbert space with inner product . A feature map is a function
The feature space is the target space . The coordinates of may be explicit, implicit, finite, or infinite.
Examples:
- Identity feature map:
with .
- Polynomial feature map:
- Fourier feature map for a shift-invariant kernel:
- Function-space feature map:
which maps each input to a function of another input.
Non-examples:
- A mapping into a set with no inner product is not enough for the kernel trick.
- A similarity score with no positive semidefinite property may fail to define any Hilbert-space feature map.
- A learned embedding model is not automatically a kernel unless the induced similarity satisfies the kernel conditions.
The feature-space viewpoint explains why kernels are geometric. Learning becomes linear geometry in . Classification becomes hyperplane separation. Regression becomes regularized function fitting. PCA becomes variance decomposition in feature space. Gaussian process covariance becomes inner-product structure over functions.
2.2 Kernel Functions and Gram Matrices
A kernel function is a function
When a feature map exists, the kernel has the form
Given data , the Gram matrix is
In compact notation:
If the kernel comes from a feature map, then
For any coefficient vector ,
If comes from an inner product, then
This equation is the most important finite test: a valid kernel produces positive semidefinite Gram matrices for every finite dataset.
2.3 Positive Semidefinite Kernels
A symmetric function is a positive semidefinite kernel if for every finite set and every ,
Equivalently, the Gram matrix satisfies
There are two conditions:
- Symmetry:
- Positive semidefiniteness:
Do not confuse these with pairwise positivity. A kernel can take negative values and still be valid. The linear kernel can be negative. It is still a valid kernel.
Do not confuse positive semidefinite with positive definite. Positive semidefinite allows zero eigenvalues. This matters because duplicate points, redundant features, and finite-dimensional feature maps often produce singular Gram matrices.
Finite check: for a proposed kernel and a finite dataset, compute eigenvalues of . If any eigenvalue is significantly negative, the kernel failed on that dataset. If all eigenvalues are nonnegative on one dataset, that is evidence, not proof, that the kernel is valid.
2.4 Valid Kernels and Counterexamples
Valid kernels:
| Kernel | Formula | Valid because |
|---|---|---|
| Linear | Direct inner product. | |
| Polynomial | with and integer | Explicit finite feature expansion. |
| RBF | with | Positive Fourier transform by Bochner's theorem. |
| Laplacian | Product of valid one-dimensional Laplace kernels. | |
| Sum | Sum of PSD matrices is PSD. | |
| Product | Schur product theorem. |
Invalid or suspicious similarities:
| Function | Problem |
|---|---|
| Not generally PSD as a kernel, though related to conditionally negative definite distances. | |
| Not generally PSD. | |
| Can produce indefinite Gram matrices. | |
| Arbitrary cosine of learned embeddings | May be useful similarity but not automatically a kernel over the original input set. |
Counterexample method:
- Choose a small dataset.
- Form the Gram matrix.
- Compute eigenvalues.
- Find a vector such that .
This method is not just a homework trick. Indefinite similarity matrices appear in practice when people combine scores, distances, edit penalties, or learned similarities without checking PSD structure. Some algorithms can work with indefinite similarities, but the RKHS theory no longer applies directly.
2.5 Kernel Normalization and Centering
Kernel values can be sensitive to scale. A normalized kernel is often defined as
If and , then
In feature-space language,
So normalized kernels are cosine similarities in feature space.
Kernel centering is different. Kernel PCA and related methods require centered feature vectors:
The centered Gram matrix is
where
Centering removes the feature-space mean. Without centering, kernel PCA can waste its first component on the mean offset rather than variance structure.
Normalization controls diagonal scale. Centering controls feature-space mean. They solve different problems.
3. RKHS Foundations
The finite Gram matrix story explains how kernels are used computationally. The RKHS story explains what functions are being learned.
A Reproducing Kernel Hilbert Space is a Hilbert space of functions where point evaluation is continuous and represented by inner product with a kernel section. That sentence is dense. This section unpacks it carefully.
3.1 Reproducing Kernel Hilbert Spaces
Let be an input set. Let be a Hilbert space whose elements are functions
For each , define the evaluation functional
The space is a Reproducing Kernel Hilbert Space if every evaluation functional is continuous. By the Riesz representation theorem for Hilbert spaces, continuity implies that for every there exists a unique element such that
for every .
The function is the reproducing kernel.
This definition has three important consequences:
- Each input corresponds to a function .
- Evaluating at is an inner product in .
- The kernel is not merely a similarity score; it is the object that represents point evaluation.
The canonical feature map of an RKHS is
Then
This means every RKHS kernel is automatically a feature-map kernel.
3.2 The Reproducing Property
The reproducing property is
It says that function evaluation is inner product geometry.
If is represented as
then
By linearity,
Using the kernel identity,
This is exactly the prediction form used by kernel machines.
The reproducing property explains why coefficients over training points are enough. The learned function lives in a Hilbert space, but prediction reduces to kernel evaluations.
3.3 Moore-Aronszajn Theorem
The Moore-Aronszajn theorem states: every positive semidefinite kernel on uniquely determines an RKHS for which is the reproducing kernel.
This theorem closes the loop:
positive semidefinite kernel
|
v
unique RKHS of functions
|
v
feature map x -> k(x, .)
|
v
inner product recovers k(x, z)
The theorem is important because it means we do not need to explicitly construct . If we prove that is positive semidefinite, then an RKHS exists. The feature space is guaranteed by the theorem.
This is why kernel validity matters. An invalid similarity may still produce useful numbers. But it does not automatically give a Hilbert space of functions, a reproducing property, or a standard regularization theory.
3.4 Mercer Expansion and Eigenfunctions
Mercer's theorem is one of the classical links between kernels and spectral theory. Under suitable conditions, such as a continuous positive semidefinite kernel on a compact domain with respect to a measure, the kernel has an expansion
where and are orthonormal eigenfunctions of the associated integral operator.
The integral operator is
The eigenfunction equation is
The expansion resembles matrix eigendecomposition. The kernel acts like an infinite matrix. The eigenfunctions play the role of eigenvectors. The eigenvalues tell which directions in function space are favored.
Practical interpretation:
- large eigenvalues correspond to functions the kernel represents easily;
- small eigenvalues correspond to functions strongly penalized by the RKHS norm;
- kernel ridge regression shrinks more strongly along low-eigenvalue directions;
- NTK eigenvalues can predict which target functions are learned quickly by wide networks.
Mercer is not required every time we use kernels. The finite Gram matrix definition is more general in many ML settings. But Mercer gives spectral intuition for smoothness and generalization.
3.5 RKHS Norm as Smoothness Control
The RKHS norm measures function complexity relative to the kernel. Different kernels create different notions of complexity.
For a regularized learning problem,
the first term fits the data. The second term penalizes functions that are complex in the RKHS.
The same function can be simple under one kernel and complex under another. For example:
- an RBF kernel favors very smooth functions;
- a periodic kernel favors periodic functions;
- a linear kernel favors linear functions;
- a string kernel favors functions expressible through substring features;
- a graph kernel favors functions aligned with graph similarity.
This is why choosing a kernel is choosing an inductive bias.
The RKHS norm is not merely a generic "size" penalty. It encodes which functions are cheap and which are expensive. In a Gaussian process, this same kernel encodes covariance. In kernel ridge regression, it encodes regularization. In SVMs, it encodes margin geometry.
4. Common Kernels
Kernel choice determines the geometry of learning. This section catalogs the most common kernels and the assumptions they encode.
4.1 Linear and Polynomial Kernels
The linear kernel is
It corresponds to the identity feature map. Its Gram matrix is
when rows of are examples.
Use the linear kernel when:
- the input features are already expressive;
- dimension is high and nonlinear expansion is unnecessary;
- interpretability and speed matter;
- the model should behave like a regularized linear method.
The polynomial kernel is
where and is a positive integer.
It represents interactions among input coordinates. For , it includes pairwise products. For , it includes cubic interactions. Higher degree increases expressivity but can overfit quickly.
Polynomial kernels are useful for:
- controlled feature interactions;
- small or medium-dimensional tabular data;
- settings where multiplicative feature relationships matter;
- teaching the kernel trick because the feature map can be written explicitly.
AI warning: large polynomial degree can behave like an unstable feature explosion. Regularization and scaling are not optional.
4.2 RBF Kernel
The radial basis function kernel, also called the Gaussian kernel, is
where . Equivalently,
with .
The RBF kernel is shift-invariant:
It is also radial: the value depends only on distance.
Bandwidth interpretation:
- small means very local similarity;
- large means broad similarity;
- too small can make close to identity;
- too large can make close to a constant matrix.
Both extremes are bad. If , the model can memorize but generalize poorly. If , the model has too little resolution.
The RBF kernel is universal on many compact domains. Informally, its RKHS is rich enough to approximate many continuous functions. But universality does not remove the need for regularization. Richness without control is overfitting.
4.3 Laplacian and Matern Kernels
The Laplacian kernel is
Compared with RBF, it produces rougher functions. It can be useful when the target function has sharper changes.
The Matern family is a common Gaussian-process kernel family. For distance , it is
where controls smoothness, is lengthscale, and is the modified Bessel function of the second kind.
Common special cases avoid Bessel functions:
Matern smoothness interpretation:
| Kernel | Smoothness assumption |
|---|---|
| Rough, continuous but not very smooth. | |
| Once-differentiable style behavior. | |
| Smoother functions. | |
| Approaches RBF smoothness. |
In scientific ML and Bayesian optimization, Matern kernels are often more realistic than RBF kernels. Many physical functions are smooth but not infinitely smooth.
4.4 Periodic, Rational Quadratic, String, and Graph Kernels
The periodic kernel is
It encodes periodic structure with period . It is useful for seasonal signals and cyclic processes.
The rational quadratic kernel is
It can be interpreted as a scale mixture of RBF kernels. It is useful when multiple lengthscales are present.
String kernels compare sequences by shared subsequences or substrings. They were important in bioinformatics and text classification. The general form is
where counts or weights occurrences of subsequence in string .
Graph kernels compare graphs through walks, paths, subtrees, graphlets, or spectral features. They are related to graph representation learning but are not the same as graph neural networks. For full graph neural network treatment, see Graph Neural Networks.
These structured kernels show that kernels are not limited to vectors. A kernel needs a valid PSD similarity over an input set. The input set can be strings, trees, graphs, distributions, molecules, or programs.
4.5 Kernel Composition Rules
Kernel design often uses closure rules. If and are valid kernels, then the following are valid under standard conditions:
| Construction | Formula |
|---|---|
| Nonnegative scaling | for |
| Sum | |
| Product | |
| Pointwise limit | when the limit exists |
| Pullback by a map | |
| Tensor-style product | |
| Direct-sum kernel |
Why sums work: for any finite dataset,
Why products work: the Gram matrix of is the Hadamard product
The Schur product theorem says the Hadamard product of PSD matrices is PSD.
Composition examples:
models smooth trend plus periodic structure.
models smooth local similarity modulated by linear alignment.
Kernel composition is inductive-bias engineering. It is the kernel-method analogue of architecture design.
5. Core Kernel Algorithms
This section shows how kernels enter algorithms. The common pattern is finite coefficient representation. The learned object may live in a function space, but training solves for a finite vector.
5.1 Kernel Perceptron
The ordinary perceptron learns a linear classifier
With labels , a mistake update is
After many updates, the weight vector is a linear combination of training examples:
In feature space,
Prediction becomes
The kernel perceptron update increments when example is misclassified.
Algorithm:
- Initialize .
- For each example , compute score
- If , set .
- Repeat for multiple passes.
The kernel perceptron is not usually the best modern kernel classifier. But it is the cleanest algorithmic example of kernelization.
5.2 Support Vector Machines
Support vector machines use kernels to build maximum-margin classifiers. For a feature map , the hard-margin primal problem is
subject to
for all .
The dual problem is
subject to
The soft-margin SVM adds slack variables and a penalty parameter . In the dual, this gives box constraints:
Prediction is
Only points with contribute. These are support vectors.
SVM intuition:
- the margin is measured in feature-space norm;
- the kernel determines the feature-space geometry;
- support vectors are the examples that define the boundary;
- controls the tradeoff between margin and training errors;
- kernel bandwidth controls the flexibility of the boundary.
Important scoping note: this section does not reteach constrained optimization or KKT conditions in full. For the full optimization machinery, see Constrained Optimization.
5.3 Kernel Ridge Regression and Representer Theorem
Kernel ridge regression solves
The representer theorem says the minimizer has the finite form
This is remarkable. The optimization is over an infinite-dimensional function space. The solution lies in the span of kernel sections centered at training inputs.
Let collect labels. Let be the Gram matrix. Predictions on training data are
The objective becomes
The closed-form coefficient vector is
Some conventions use instead of . Always check the objective scaling.
Prediction for a test point is
where
Regularization interpretation:
- large shrinks coefficients and smooths the function;
- small fits training data more closely;
- also improves conditioning of ;
- when is nearly singular, the ridge term is numerically essential.
5.4 Kernel PCA
PCA finds directions of high variance. Kernel PCA finds high-variance directions in feature space.
Let be centered in . The feature-space covariance operator is
Instead of diagonalizing directly, kernel PCA diagonalizes the centered Gram matrix
If
then gives coefficients for the -th feature-space principal direction.
The embedding coordinate for a point is computed through kernel evaluations against training points.
Kernel PCA is useful for:
- nonlinear dimensionality reduction;
- visualizing curved manifolds;
- denoising in feature space;
- teaching how kernel methods extend linear algorithms.
Important scoping note: this section uses PCA only as a kernelized algorithm. For full PCA theory, see Principal Component Analysis.
Kernel PCA common failure: forgetting to center the kernel matrix. PCA assumes centered data. In kernel PCA, centering must happen in feature space, which is why appears.
5.5 Gaussian Processes
A Gaussian process is a distribution over functions. It is written
where is the mean function and is the covariance kernel.
For any finite set of inputs, the function values are jointly Gaussian:
Here
and
For noisy observations
with
the predictive mean at a test input is
The predictive variance is
Gaussian process regression has the same algebraic core as kernel ridge regression. The GP adds probabilistic interpretation and uncertainty.
Important scoping note: this section gives the kernel-method view of Gaussian processes. For full Bayesian inference background, see Bayesian Inference.
GP kernel interpretation:
- the kernel is a covariance function;
- nearby points under the kernel have correlated function values;
- the lengthscale controls how quickly correlation decays;
- the posterior mean interpolates or smooths observations;
- the posterior variance grows away from observed data.
This uncertainty is one reason Gaussian processes remain useful in Bayesian optimization, active learning, calibration, and scientific modeling.
6. Computation and Scalability
Kernel methods are elegant, but the kernel matrix is expensive. This section treats computation as part of the mathematics, not an implementation afterthought.
6.1 Kernel Matrix Cost
For training examples, the Gram matrix has entries. Memory cost is
Naive construction cost is
for vector inputs in when each kernel evaluation costs .
Solving exact kernel ridge regression by dense linear algebra costs
in general.
Prediction for one test point costs
kernel evaluations unless the model is sparse or approximated.
This is the core scalability problem. The feature space may be infinite-dimensional, but the training set creates an object.
Practical thresholds:
- is usually easy;
- is possible with care;
- is expensive for exact kernels;
- usually requires approximation, structure, or distributed methods.
The bottleneck is often memory before arithmetic. A float64 Gram matrix for would need roughly 80 GB just for entries.
6.2 Conditioning and Regularization
Kernel matrices can be ill-conditioned. This happens when:
- points are nearly duplicated;
- the bandwidth is too large;
- the bandwidth is too small;
- features are poorly scaled;
- the kernel has rapidly decaying eigenvalues.
Condition number is
when is positive definite.
For kernel ridge regression, the system is
The ridge shift changes eigenvalues:
So regularization improves conditioning. This is both statistical and numerical.
Numerical practices:
- standardize input features before distance-based kernels;
- tune bandwidth and regularization jointly;
- add jitter when doing Gaussian process Cholesky factorization;
- prefer Cholesky solves over explicit matrix inverse;
- monitor eigenvalues of or .
In GP code, a small jitter term is often written as
The term is not a probabilistic modeling claim. It is numerical stabilization.
6.3 Nystrom Approximation
The Nystrom approximation uses a subset of landmark points. Partition the kernel matrix conceptually into
and
The approximate Gram matrix is
If is regularized,
The approximation has rank at most . It reduces storage from to roughly .
Landmark choices:
- uniform random sampling;
- k-means centers;
- leverage-score sampling;
- inducing points optimized by a model;
- domain-specific prototypes.
Nystrom intuition: represent all points by their similarity to landmarks. Then reconstruct the full similarity structure through the landmark-landmark matrix.
This is a low-rank approximation to kernel geometry. It is related to matrix approximation, inducing-point Gaussian processes, and scalable spectral methods.
6.4 Random Fourier Features
Random Fourier features approximate shift-invariant kernels. Bochner's theorem says a continuous shift-invariant kernel
is positive semidefinite if and only if is the Fourier transform of a nonnegative measure.
For the RBF kernel, sample
and
Define random features
Then
Now a kernel method can be approximated by a linear method in random features. Training cost shifts from Gram-matrix algorithms to linear-model algorithms.
Benefits:
- storage instead of ;
- prediction instead of ;
- compatibility with SGD;
- easy integration into neural networks.
Tradeoffs:
- approximation variance decreases as grows;
- poor random features need many dimensions;
- input scaling still matters;
- structured random features may be faster but more complex.
Random Fourier features are a key example of turning an implicit kernel feature map into an explicit approximate feature map.
6.5 When Kernels Are Practical
Kernels are practical when the data scale and inductive bias fit the method.
Good kernel settings:
- small to medium datasets;
- expensive labels;
- tabular or scientific data;
- uncertainty matters;
- smoothness or periodicity assumptions are strong;
- a domain-specific similarity is available;
- convex training is preferred;
- interpretability of inductive bias matters.
Poor kernel settings:
- massive datasets without approximation;
- high-throughput online prediction with many support vectors;
- raw image/text tasks where representation learning dominates;
- settings where the right similarity must be learned from huge data;
- datasets with severe feature scaling problems and no preprocessing.
The practical question is not "kernels or deep learning?" The practical question is "where should similarity be specified, where should it be learned, and what scale must the algorithm handle?"
Modern hybrids answer this in mixed ways:
- deep kernel learning uses neural features inside a kernel;
- Gaussian processes can sit on top of learned embeddings;
- random features can act as a fixed feature layer;
- NTK uses kernels to analyze neural training;
- retrieval systems use learned embeddings but often kernel-like similarity.
7. Kernel Methods in Modern AI
Kernels remain relevant because similarity remains central. Deep learning changed how similarity is learned, but it did not remove the need to understand similarity geometry.
7.1 Similarity Search and Embeddings
Embedding systems map objects into vectors. Semantic search, recommendation, contrastive learning, and retrieval-augmented generation often use similarity scores such as dot product or cosine similarity.
A normalized dot product is a linear kernel on normalized embeddings:
This does not mean every retrieval system is a kernel machine. The embeddings are learned by a neural model, and approximate nearest-neighbor search is an engineering system. But the similarity score is still a geometric object.
Kernel thinking asks:
- what invariances should similarity respect?
- what scale makes two items close?
- is the similarity symmetric?
- is the implied Gram matrix PSD?
- does normalization help or hurt?
- does the similarity align with downstream loss?
These questions are useful even when the final system is not a classical kernel method.
7.2 Kernels and Attention
Transformer attention uses scores
Each score is an inner product between a query vector and a key vector. Softmax then converts scores into weights.
This is not an RKHS kernel method in the classical sense because:
- queries and keys are learned and input-dependent;
- the attention matrix is usually not symmetric;
- softmax normalization is row-wise;
- attention produces weighted values, not only scalar function prediction.
Still, kernel ideas are nearby. Linear attention methods often replace softmax attention with feature maps satisfying
This is a kernel approximation idea. It converts attention into associative linear algebra:
The kernel lesson: attention is not "just a kernel," but kernel feature maps help explain and approximate some attention mechanisms.
7.3 Neural Tangent Kernel
Consider a neural network . The neural tangent kernel is
This is a kernel over inputs induced by parameter gradients. It measures how changing parameters to affect also affects .
In the infinite-width limit for certain architectures and initializations, the NTK remains nearly constant during training. Gradient descent then behaves like kernel regression with kernel .
NTK interpretation:
- if is large, learning on strongly changes prediction at ;
- eigenvalues of the NTK affect learning speed along target-function directions;
- wide-network training can be analyzed through kernel dynamics;
- feature learning is limited in the lazy-training regime.
Important warning: finite neural networks can learn representations in ways the fixed NTK does not capture. The NTK is a baseline and an analysis tool, not a complete theory of all deep learning.
7.4 Deep Kernel Learning
Deep kernel learning composes a neural feature extractor with a kernel. Let
be a neural network. Define
If is a valid kernel, then is a valid kernel for fixed .
Deep kernel learning separates two roles:
- the neural network learns a representation;
- the kernel provides smoothness, uncertainty, or nonparametric structure in representation space.
This hybrid is useful when:
- raw inputs need learned features;
- uncertainty estimates are still desired;
- data is too structured for a hand-designed kernel;
- the final task benefits from GP-style posterior behavior.
The same idea appears in practical systems that train embeddings and then run kernel-like methods or GP heads on top.
7.5 Uncertainty and Small-Data Learning
Deep networks often provide strong predictions but weak calibrated uncertainty without extra machinery. Gaussian processes and Bayesian kernel methods provide uncertainty by construction.
In small-data settings, kernel priors are valuable because assumptions matter more than scale. A well-chosen kernel can encode:
- smoothness in physical space;
- periodicity in time;
- conservation-inspired structure;
- similarity between molecules;
- correlation across tasks;
- graph structure;
- known invariances.
This is why kernels remain common in:
- Bayesian optimization;
- active learning;
- scientific machine learning;
- spatial statistics;
- molecular property prediction;
- calibration and uncertainty modeling;
- low-data regression.
Kernels are less dominant when raw representation learning is the bottleneck. But when the representation is known or learned elsewhere, kernel methods can be excellent final-layer learners.
8. Common Mistakes
| # | Mistake | Why it is wrong | Fix |
|---|---|---|---|
| 1 | Treating any similarity as a kernel. | RKHS theory requires positive semidefinite Gram matrices, not just intuitive similarity. | Check PSD conditions or use known valid kernel constructions. |
| 2 | Thinking pairwise positive values imply PSD. | A matrix can have all positive entries and still have a negative eigenvalue. | Test or eigenvalues for finite examples. |
| 3 | Forgetting feature scaling before RBF kernels. | Distance-based kernels are dominated by large-scale coordinates. | Standardize inputs or use anisotropic lengthscales. |
| 4 | Choosing RBF bandwidth blindly. | Too small gives memorization; too large gives nearly constant similarity. | Tune bandwidth with regularization. |
| 5 | Forgetting to center for kernel PCA. | PCA requires centered features; uncentered kernels distort components. | Use . |
| 6 | Computing explicit inverses. | Explicit inverse is slower and less stable than solving a linear system. | Use Cholesky or linear solves for . |
| 7 | Assuming kernel methods avoid overfitting. | Rich kernels can interpolate noise. | Use validation, regularization, and bandwidth control. |
| 8 | Confusing SVM sparsity with KRR density. | SVM predictions use support vectors; KRR usually uses all training points. | Choose algorithm based on prediction cost and objective. |
| 9 | Treating GP variance as generic model confidence. | GP uncertainty is conditional on the kernel, likelihood, and training assumptions. | Interpret uncertainty relative to the model assumptions. |
| 10 | Ignoring memory. | Exact kernels become impossible at large . | Use Nystrom, random features, inducing points, or linear models. |
| 11 | Reteaching PCA inside kernel PCA. | PCA has its own canonical section. | Use kernel PCA as an application and point to the PCA section. |
| 12 | Thinking NTK explains all deep learning. | NTK mainly describes lazy or infinite-width regimes. | Use NTK as a baseline, not a complete representation-learning theory. |
9. Exercises
-
Exercise 1 (Easy): Build and Inspect Gram Matrices Given a small dataset in , compute linear, polynomial, and RBF Gram matrices. Verify symmetry. Compute eigenvalues. Explain which matrices are PSD and why.
-
Exercise 2 (Easy): Explicit Polynomial Feature Map For , construct a feature map whose inner product equals . Verify the identity numerically for random points.
-
Exercise 3 (Easy): Kernel Normalization and Centering Normalize an RBF Gram matrix. Center it using . Verify that the centered matrix has approximately zero row and column sums.
-
Exercise 4 (Medium): Kernel Ridge Regression Implement kernel ridge regression from scratch. Fit noisy samples of a nonlinear one-dimensional function. Vary and the RBF bandwidth. Explain underfitting, good fit, and overfitting.
-
Exercise 5 (Medium): Kernel PCA Generate a two-dimensional nonlinear dataset such as concentric circles. Compute an RBF kernel. Center the kernel. Extract the first two kernel principal components. Compare with linear PCA intuition.
-
Exercise 6 (Medium): Gaussian Process Posterior Implement the GP posterior mean and variance for one-dimensional regression. Use an RBF kernel. Plot uncertainty away from observed points. Explain the role of noise variance.
-
Exercise 7 (Hard): Random Fourier Features Approximate the RBF kernel with random Fourier features. Measure approximation error as the number of random features grows. Train a linear ridge model on random features and compare it with exact KRR.
-
Exercise 8 (Hard): Neural Tangent Kernel Define a small two-layer neural network. Compute the empirical NTK for a finite dataset using parameter gradients or finite differences. Verify that the NTK Gram matrix is PSD up to numerical tolerance. Interpret what large NTK entries mean.
10. Why This Matters for AI
| Kernel concept | AI impact |
|---|---|
| Feature map | Explains how nonlinear representation can make linear learning powerful. |
| Gram matrix | Central object for pairwise similarity, spectral methods, and kernel algorithms. |
| PSD condition | Separates valid Hilbert-space kernels from arbitrary similarity scores. |
| RKHS | Provides the function-space setting for regularized learning. |
| Reproducing property | Turns function evaluation into inner-product geometry. |
| Representer theorem | Explains why infinite-dimensional optimization reduces to finite coefficients. |
| RBF bandwidth | Controls locality, smoothness, memorization, and generalization. |
| Kernel ridge regression | Gives closed-form nonlinear regression with regularization. |
| SVM dual | Shows how margins can be learned through kernel evaluations. |
| Kernel PCA | Extends linear spectral methods to nonlinear feature spaces. |
| Gaussian process kernels | Encode prior covariance and provide uncertainty estimates. |
| Nystrom approximation | Makes large kernel matrices manageable through landmarks. |
| Random Fourier features | Converts kernel learning into scalable linear learning. |
| Neural tangent kernel | Gives a mathematical baseline for wide-network training dynamics. |
| Deep kernel learning | Combines learned representations with kernel uncertainty or smoothness. |
The durable lesson is: kernels turn similarity assumptions into learnable geometry.
In deep learning, the representation is often learned. In kernel methods, the representation geometry is often specified. Hybrid methods do both.
For LLMs and large AI systems, kernels are most visible in theory, retrieval similarity, approximation methods, uncertainty heads, and small-data or scientific layers around larger learned systems. The concepts remain load-bearing even when the production model is not a classical SVM.
11. Conceptual Bridge
This section completes the functional-analysis arc from normed spaces to Hilbert spaces to kernel methods. Normed spaces introduced size and convergence. Hilbert spaces added inner products, orthogonality, and projection. Kernel methods use Hilbert-space geometry without explicitly constructing the Hilbert-space coordinates.
Backward connection: from Normed Spaces, we use norms and regularization. From Hilbert Spaces, we use inner products and RKHS language. From Positive Definite Matrices, we use PSD Gram matrices. From Constrained Optimization, we use duality in SVMs.
Forward connection: kernel methods prepare you for ML-specific losses, activation functions, normalization, and sampling. They also prepare you to read Gaussian-process models, spectral generalization analyses, kernelized attention approximations, and NTK papers.
Curriculum position:
Functional Analysis
01 Normed Spaces
|
v
02 Hilbert Spaces
|
v
03 Kernel Methods
|
v
ML-Specific Math
|
+-- loss functions
+-- activations
+-- normalization
+-- sampling
The conceptual shift is important: we began with spaces of vectors. We moved to spaces of functions. Kernels let us compute with those function spaces through finite matrices.
That is the deepest practical idea in this section. Infinite-dimensional thinking can produce finite algorithms.
12. References
- Aronszajn, N. (1950). Theory of Reproducing Kernels. Transactions of the American Mathematical Society.
- Cortes, C. and Vapnik, V. (1995). Support-Vector Networks. Machine Learning.
- Scholkopf, B., Herbrich, R., and Smola, A. J. (2001). A Generalized Representer Theorem.
- Scholkopf, B. and Smola, A. J. (2002). Learning with Kernels. MIT Press.
- Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning.
- Rahimi, A. and Recht, B. (2007). Random Features for Large-Scale Kernel Machines.
- Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural Tangent Kernel: Convergence and Generalization in Neural Networks.
- Stanford CS229 notes on kernels and support vector machines.
- Cornell CS4780 lecture notes on kernel methods.
- scikit-learn documentation on kernel approximation and kernel methods.