NotesMath for LLMs

Kernel Methods

Functional Analysis / Kernel Methods

Notes

"A kernel is a promise: compute similarity in the input space, and behave as if learning happened in a richer feature space."

Overview

Kernel methods are a family of learning methods built around one central observation: many algorithms only need inner products between examples. If an algorithm can be written using x,x\langle \mathbf{x}, \mathbf{x}' \rangle, then we can often replace that inner product with a kernel value k(x,x)k(\mathbf{x}, \mathbf{x}'). The replacement makes the algorithm act as though each input has been mapped into a high-dimensional, sometimes infinite-dimensional, feature space.

This section studies the kernel trick, positive semidefinite kernels, Reproducing Kernel Hilbert Spaces, common kernel families, kernelized algorithms, and scalable approximations. The goal is not to treat kernels as a historical detour before deep learning. The goal is to understand kernels as one of the cleanest mathematical bridges between Hilbert space geometry and modern AI systems.

Kernel methods explain support vector machines, kernel ridge regression, kernel PCA, Gaussian processes, random Fourier features, and the neural tangent kernel. They also sharpen everyday ML instincts about similarity, regularization, smoothness, uncertainty, and data-dependent representation. When you ask whether two embeddings are similar, whether a function should vary smoothly, or whether a model should express uncertainty away from data, you are already asking kernel-shaped questions.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbInteractive derivations for Gram matrices, kernels, RKHS intuition, kernel ridge regression, kernel PCA, Gaussian process prediction, random Fourier features, and NTK-style behavior.
exercises.ipynbEight graded exercises covering kernel validity, Gram matrices, centering, kernel ridge regression, kernel PCA, Gaussian processes, random Fourier features, and neural tangent kernels.

Learning Objectives

After completing this section, you will be able to:

  • Explain the kernel trick as an inner-product replacement, not as magic.
  • Construct explicit feature maps for linear and polynomial kernels.
  • Test a finite Gram matrix for positive semidefiniteness.
  • Distinguish valid kernels from arbitrary similarity functions.
  • Explain how every positive semidefinite kernel defines an RKHS.
  • Use the reproducing property to interpret function evaluation as an inner product.
  • Apply the representer theorem to regularized learning in RKHS.
  • Implement kernel ridge regression and interpret the role of λ\lambda.
  • Center a kernel matrix and run kernel PCA.
  • Interpret Gaussian process regression as Bayesian learning with a kernel covariance.
  • Use Nystrom and random Fourier approximations to reduce kernel cost.
  • Connect kernel eigenvalues to generalization, smoothness, and neural tangent kernel intuition.

Table of Contents


1. Intuition

Kernel methods begin with a deceptively simple question. What if the original coordinates are the wrong coordinates for learning?

Suppose a binary classification dataset in R2\mathbb{R}^2 is not linearly separable. No line can separate the classes in the original plane. But after adding a feature such as x12+x22x_1^2 + x_2^2, the same points may become separable by a hyperplane in the transformed feature space.

The direct feature-map story is clear: map x\mathbf{x} to ϕ(x)\phi(\mathbf{x}). Run a linear algorithm in the ϕ\phi-space. Return a nonlinear predictor in the original input space.

The kernel story is more subtle: do not compute ϕ(x)\phi(\mathbf{x}). Compute only the inner product that would have appeared after the feature map. That inner product is the kernel.

k(x,x)=ϕ(x),ϕ(x)H.k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle_{\mathcal{H}}.

This identity is the whole game. The feature space can be very large. It can be infinite-dimensional. It can even be a function space. But if the learning algorithm only needs dot products, the explicit coordinates are unnecessary.

1.1 Why Kernels Matter for AI

Kernels matter because they convert assumptions about similarity into mathematically controlled learning algorithms. In neural networks, similarity is often learned implicitly. In kernel methods, similarity is specified by kk.

A kernel can encode smoothness. A kernel can encode periodicity. A kernel can encode sequence similarity. A kernel can encode graph structure. A kernel can encode invariances. A kernel can encode prior uncertainty.

This makes kernels especially useful when:

  • data is limited;
  • uncertainty estimates matter;
  • interpretability of inductive bias matters;
  • training a large neural model is unnecessary or too expensive;
  • a strong handcrafted similarity is available;
  • we want a clean mathematical model of generalization.

Kernels also matter in deep learning theory. The neural tangent kernel studies how wide neural networks behave during gradient descent. In the infinite-width limit, a neural network can behave like a kernel machine whose kernel is determined by architecture and initialization. This does not mean all useful neural networks are just kernels. It means kernels provide a precise baseline for understanding when representation learning does, or does not, happen.

The AI story has three levels:

LevelKernel roleExample
AlgorithmicKernel gives a nonlinear model with convex or closed-form training.SVM, kernel ridge regression, kernel PCA.
ProbabilisticKernel is a covariance function over functions.Gaussian process regression.
TheoreticalKernel describes training dynamics of very wide networks.Neural tangent kernel.

1.2 Linear Learning in Nonlinear Feature Spaces

Let X\mathcal{X} be an input space. For tabular data, X\mathcal{X} might be Rd\mathbb{R}^d. For text, X\mathcal{X} might be strings. For graphs, X\mathcal{X} might be graphs. For functions, X\mathcal{X} might already be infinite-dimensional.

A feature map is a function

ϕ:XH,\phi : \mathcal{X} \to \mathcal{H},

where H\mathcal{H} is a Hilbert space. Once inputs are mapped into H\mathcal{H}, a linear predictor has the form

f(x)=w,ϕ(x)H+b.f(\mathbf{x}) = \langle \mathbf{w}, \phi(\mathbf{x}) \rangle_{\mathcal{H}} + b.

This predictor is linear in feature space. It can be nonlinear in the original input coordinates.

Example: let x=(x1,x2)\mathbf{x} = (x_1, x_2)^\top. Define

ϕ(x)=[x122x1x2x22].\phi(\mathbf{x}) = \begin{bmatrix} x_1^2 \\ \sqrt{2}x_1x_2 \\ x_2^2 \end{bmatrix}.

Then

ϕ(x),ϕ(z)=x12z12+2x1x2z1z2+x22z22=(xz)2.\langle \phi(\mathbf{x}), \phi(\mathbf{z}) \rangle = x_1^2z_1^2 + 2x_1x_2z_1z_2 + x_2^2z_2^2 = (\mathbf{x}^\top \mathbf{z})^2.

So the degree-2 homogeneous polynomial kernel is

k(x,z)=(xz)2.k(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^\top \mathbf{z})^2.

The feature map has dimension 3 for R2\mathbb{R}^2. For degree pp in dimension dd, the explicit feature dimension grows combinatorially. The kernel evaluation remains cheap.

For the inhomogeneous polynomial kernel,

k(x,z)=(xz+c)p,k(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^\top \mathbf{z} + c)^p,

the implicit feature space contains monomials of degree up to pp. This is a nonlinear feature expansion without manually writing the expanded feature vector.

Feature-space picture:

input space                 feature space

    x                              phi(x)
  points       ---- phi ---->      high-dimensional points

linear separator impossible        linear separator possible
in original coordinates            after nonlinear lifting

The crucial warning: the kernel trick does not make impossible algorithms possible. It makes inner-product algorithms operate in richer spaces without explicit coordinates.

1.3 The Kernel Trick as Inner-Product Substitution

An algorithm is kernelizable when its input dependence appears only through inner products.

For a linear model trained by certain dual methods, the weight vector can often be written as

w=i=1nαiy(i)ϕ(x(i)).\mathbf{w} = \sum_{i=1}^n \alpha_i y^{(i)} \phi(\mathbf{x}^{(i)}).

Then prediction becomes

f(x)=i=1nαiy(i)ϕ(x(i)),ϕ(x)H+b.f(\mathbf{x}) = \left\langle \sum_{i=1}^n \alpha_i y^{(i)} \phi(\mathbf{x}^{(i)}), \phi(\mathbf{x}) \right\rangle_{\mathcal{H}} + b.

By linearity,

f(x)=i=1nαiy(i)ϕ(x(i)),ϕ(x)H+b.f(\mathbf{x}) = \sum_{i=1}^n \alpha_i y^{(i)} \langle \phi(\mathbf{x}^{(i)}), \phi(\mathbf{x}) \rangle_{\mathcal{H}} + b.

Replace the inner product with a kernel:

f(x)=i=1nαiy(i)k(x(i),x)+b.f(\mathbf{x}) = \sum_{i=1}^n \alpha_i y^{(i)} k(\mathbf{x}^{(i)}, \mathbf{x}) + b.

The model now needs only kernel evaluations between training examples and test examples. It does not need explicit ϕ(x)\phi(\mathbf{x}) coordinates.

This is why the kernel matrix becomes the central data structure:

Kij=k(x(i),x(j)).K_{ij} = k(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}).

The kernel matrix is an n×nn \times n matrix of pairwise similarities. For many algorithms, the entire training set enters only through KK.

Kernel trick checklist:

  1. Write the algorithm in terms of inner products.
  2. Replace x(i),x(j)\langle \mathbf{x}^{(i)}, \mathbf{x}^{(j)} \rangle with k(x(i),x(j))k(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}).
  3. Ensure kk is a valid positive semidefinite kernel.
  4. Solve the resulting finite-dimensional problem over coefficients.
  5. Predict using kernel evaluations against training points.

The phrase "trick" can be misleading. The kernel trick is not a shortcut around mathematics. It is a theorem-supported representation choice.

1.4 Historical Path

Kernel methods have a long history. They did not begin with modern SVMs. They grew out of approximation theory, potential functions, integral operators, and Hilbert-space methods.

Some milestones:

PeriodDevelopmentWhy it matters
1909Mercer's theoremConnects positive kernels with eigenfunction expansions.
1950Aronszajn's RKHS theoryProvides the Hilbert-space foundation for reproducing kernels.
1960sPotential function methodsEarly pattern-recognition use of kernel-like similarity functions.
1990sSupport vector machinesConvex margin classifiers made kernels central in ML.
2000sGaussian processes and kernel machinesKernels became standard tools for Bayesian nonparametrics and nonlinear learning.
2007Random Fourier featuresScalable approximations made large kernel methods practical.
2018Neural tangent kernelKernel theory became a tool for analyzing wide neural networks.

The deep-learning era did not erase kernels. It changed where kernels appear. Sometimes they appear as explicit algorithms. Sometimes they appear as covariance functions. Sometimes they appear as theory for training dynamics. Sometimes they appear as similarity functions inside attention, retrieval, and embedding systems.


2. Formal Definitions

This section gives the exact mathematical objects. The definitions are intentionally finite-first. Most practical kernel work begins with a finite dataset and a Gram matrix. The infinite-dimensional RKHS story comes after the finite matrix tests.

2.1 Feature Maps and Feature Spaces

Let X\mathcal{X} be an input set. Let H\mathcal{H} be a Hilbert space with inner product ,H\langle \cdot, \cdot \rangle_{\mathcal{H}}. A feature map is a function

ϕ:XH.\phi : \mathcal{X} \to \mathcal{H}.

The feature space is the target space H\mathcal{H}. The coordinates of ϕ(x)\phi(\mathbf{x}) may be explicit, implicit, finite, or infinite.

Examples:

  1. Identity feature map:
ϕ(x)=x,\phi(\mathbf{x}) = \mathbf{x},

with H=Rd\mathcal{H} = \mathbb{R}^d.

  1. Polynomial feature map:
ϕ(x)=all monomials in x up to degree p.\phi(\mathbf{x}) = \text{all monomials in } \mathbf{x} \text{ up to degree } p.
  1. Fourier feature map for a shift-invariant kernel:
ϕω(x)=2cos(ωx+b).\phi_{\boldsymbol{\omega}}(\mathbf{x}) = \sqrt{2}\cos(\boldsymbol{\omega}^\top \mathbf{x} + b).
  1. Function-space feature map:
ϕ(x)=k(x,),\phi(\mathbf{x}) = k(\mathbf{x}, \cdot),

which maps each input to a function of another input.

Non-examples:

  • A mapping into a set with no inner product is not enough for the kernel trick.
  • A similarity score with no positive semidefinite property may fail to define any Hilbert-space feature map.
  • A learned embedding model is not automatically a kernel unless the induced similarity satisfies the kernel conditions.

The feature-space viewpoint explains why kernels are geometric. Learning becomes linear geometry in H\mathcal{H}. Classification becomes hyperplane separation. Regression becomes regularized function fitting. PCA becomes variance decomposition in feature space. Gaussian process covariance becomes inner-product structure over functions.

2.2 Kernel Functions and Gram Matrices

A kernel function is a function

k:X×XR.k : \mathcal{X} \times \mathcal{X} \to \mathbb{R}.

When a feature map exists, the kernel has the form

k(x,z)=ϕ(x),ϕ(z)H.k(\mathbf{x}, \mathbf{z}) = \langle \phi(\mathbf{x}), \phi(\mathbf{z}) \rangle_{\mathcal{H}}.

Given data x(1),,x(n)\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)}, the Gram matrix is

K=[k(x(1),x(1))k(x(1),x(n))k(x(n),x(1))k(x(n),x(n))].K = \begin{bmatrix} k(\mathbf{x}^{(1)}, \mathbf{x}^{(1)}) & \cdots & k(\mathbf{x}^{(1)}, \mathbf{x}^{(n)}) \\ \vdots & \ddots & \vdots \\ k(\mathbf{x}^{(n)}, \mathbf{x}^{(1)}) & \cdots & k(\mathbf{x}^{(n)}, \mathbf{x}^{(n)}) \end{bmatrix}.

In compact notation:

Kij=k(x(i),x(j)).K_{ij} = k(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}).

If the kernel comes from a feature map, then

Kij=ϕ(x(i)),ϕ(x(j))H.K_{ij} = \langle \phi(\mathbf{x}^{(i)}), \phi(\mathbf{x}^{(j)}) \rangle_{\mathcal{H}}.

For any coefficient vector cRn\mathbf{c} \in \mathbb{R}^n,

cKc=i=1nj=1ncicjk(x(i),x(j)).\mathbf{c}^\top K \mathbf{c} = \sum_{i=1}^n \sum_{j=1}^n c_i c_j k(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}).

If kk comes from an inner product, then

cKc=i=1nciϕ(x(i))H20.\mathbf{c}^\top K \mathbf{c} = \left\lVert \sum_{i=1}^n c_i \phi(\mathbf{x}^{(i)}) \right\rVert_{\mathcal{H}}^2 \ge 0.

This equation is the most important finite test: a valid kernel produces positive semidefinite Gram matrices for every finite dataset.

2.3 Positive Semidefinite Kernels

A symmetric function k:X×XRk : \mathcal{X} \times \mathcal{X} \to \mathbb{R} is a positive semidefinite kernel if for every finite set x(1),,x(n)X\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)} \in \mathcal{X} and every cRn\mathbf{c} \in \mathbb{R}^n,

i=1nj=1ncicjk(x(i),x(j))0.\sum_{i=1}^n \sum_{j=1}^n c_i c_j k(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}) \ge 0.

Equivalently, the Gram matrix KK satisfies

K0.K \succeq 0.

There are two conditions:

  1. Symmetry:
k(x,z)=k(z,x).k(\mathbf{x}, \mathbf{z}) = k(\mathbf{z}, \mathbf{x}).
  1. Positive semidefiniteness:
cKc0for all finite datasets and all c.\mathbf{c}^\top K \mathbf{c} \ge 0 \quad \text{for all finite datasets and all } \mathbf{c}.

Do not confuse these with pairwise positivity. A kernel can take negative values and still be valid. The linear kernel k(x,z)=xzk(\mathbf{x}, \mathbf{z}) = \mathbf{x}^\top \mathbf{z} can be negative. It is still a valid kernel.

Do not confuse positive semidefinite with positive definite. Positive semidefinite allows zero eigenvalues. This matters because duplicate points, redundant features, and finite-dimensional feature maps often produce singular Gram matrices.

Finite check: for a proposed kernel and a finite dataset, compute eigenvalues of KK. If any eigenvalue is significantly negative, the kernel failed on that dataset. If all eigenvalues are nonnegative on one dataset, that is evidence, not proof, that the kernel is valid.

2.4 Valid Kernels and Counterexamples

Valid kernels:

KernelFormulaValid because
Lineark(x,z)=xzk(\mathbf{x}, \mathbf{z}) = \mathbf{x}^\top \mathbf{z}Direct inner product.
Polynomialk(x,z)=(xz+c)pk(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^\top \mathbf{z} + c)^p with c0c \ge 0 and integer p1p \ge 1Explicit finite feature expansion.
RBFk(x,z)=exp(γxz22)k(\mathbf{x}, \mathbf{z}) = \exp(-\gamma \lVert \mathbf{x} - \mathbf{z} \rVert_2^2) with γ>0\gamma > 0Positive Fourier transform by Bochner's theorem.
Laplaciank(x,z)=exp(γxz1)k(\mathbf{x}, \mathbf{z}) = \exp(-\gamma \lVert \mathbf{x} - \mathbf{z} \rVert_1)Product of valid one-dimensional Laplace kernels.
Sumk1+k2k_1 + k_2Sum of PSD matrices is PSD.
Productk1k2k_1 k_2Schur product theorem.

Invalid or suspicious similarities:

FunctionProblem
k(x,z)=xz22k(\mathbf{x}, \mathbf{z}) = -\lVert \mathbf{x} - \mathbf{z} \rVert_2^2Not generally PSD as a kernel, though related to conditionally negative definite distances.
k(x,z)=sin(xz)k(\mathbf{x}, \mathbf{z}) = \sin(\mathbf{x}^\top \mathbf{z})Not generally PSD.
k(x,z)=1[xz]k(\mathbf{x}, \mathbf{z}) = \mathbb{1}[\mathbf{x} \ne \mathbf{z}]Can produce indefinite Gram matrices.
Arbitrary cosine of learned embeddingsMay be useful similarity but not automatically a kernel over the original input set.

Counterexample method:

  1. Choose a small dataset.
  2. Form the Gram matrix.
  3. Compute eigenvalues.
  4. Find a vector c\mathbf{c} such that cKc<0\mathbf{c}^\top K \mathbf{c} < 0.

This method is not just a homework trick. Indefinite similarity matrices appear in practice when people combine scores, distances, edit penalties, or learned similarities without checking PSD structure. Some algorithms can work with indefinite similarities, but the RKHS theory no longer applies directly.

2.5 Kernel Normalization and Centering

Kernel values can be sensitive to scale. A normalized kernel is often defined as

k~(x,z)=k(x,z)k(x,x)k(z,z).\tilde{k}(\mathbf{x}, \mathbf{z}) = \frac{k(\mathbf{x}, \mathbf{z})} {\sqrt{k(\mathbf{x}, \mathbf{x})k(\mathbf{z}, \mathbf{z})}}.

If k(x,x)>0k(\mathbf{x}, \mathbf{x}) > 0 and k(z,z)>0k(\mathbf{z}, \mathbf{z}) > 0, then

k~(x,x)=1.\tilde{k}(\mathbf{x}, \mathbf{x}) = 1.

In feature-space language,

k~(x,z)=ϕ(x),ϕ(z)Hϕ(x)Hϕ(z)H.\tilde{k}(\mathbf{x}, \mathbf{z}) = \frac{ \langle \phi(\mathbf{x}), \phi(\mathbf{z}) \rangle_{\mathcal{H}} }{ \lVert \phi(\mathbf{x}) \rVert_{\mathcal{H}} \lVert \phi(\mathbf{z}) \rVert_{\mathcal{H}} }.

So normalized kernels are cosine similarities in feature space.

Kernel centering is different. Kernel PCA and related methods require centered feature vectors:

ϕ~(x(i))=ϕ(x(i))1nj=1nϕ(x(j)).\tilde{\phi}(\mathbf{x}^{(i)}) = \phi(\mathbf{x}^{(i)}) - \frac{1}{n}\sum_{j=1}^n \phi(\mathbf{x}^{(j)}).

The centered Gram matrix is

Kc=HKH,K_c = HKH,

where

H=In1n11.H = I_n - \frac{1}{n}\mathbf{1}\mathbf{1}^\top.

Centering removes the feature-space mean. Without centering, kernel PCA can waste its first component on the mean offset rather than variance structure.

Normalization controls diagonal scale. Centering controls feature-space mean. They solve different problems.


3. RKHS Foundations

The finite Gram matrix story explains how kernels are used computationally. The RKHS story explains what functions are being learned.

A Reproducing Kernel Hilbert Space is a Hilbert space of functions where point evaluation is continuous and represented by inner product with a kernel section. That sentence is dense. This section unpacks it carefully.

3.1 Reproducing Kernel Hilbert Spaces

Let X\mathcal{X} be an input set. Let H\mathcal{H} be a Hilbert space whose elements are functions

f:XR.f : \mathcal{X} \to \mathbb{R}.

For each xX\mathbf{x} \in \mathcal{X}, define the evaluation functional

Lx(f)=f(x).L_{\mathbf{x}}(f) = f(\mathbf{x}).

The space H\mathcal{H} is a Reproducing Kernel Hilbert Space if every evaluation functional is continuous. By the Riesz representation theorem for Hilbert spaces, continuity implies that for every x\mathbf{x} there exists a unique element k(x,)Hk(\mathbf{x}, \cdot) \in \mathcal{H} such that

f(x)=f,k(x,)Hf(\mathbf{x}) = \langle f, k(\mathbf{x}, \cdot) \rangle_{\mathcal{H}}

for every fHf \in \mathcal{H}.

The function kk is the reproducing kernel.

This definition has three important consequences:

  1. Each input x\mathbf{x} corresponds to a function k(x,)k(\mathbf{x}, \cdot).
  2. Evaluating ff at x\mathbf{x} is an inner product in H\mathcal{H}.
  3. The kernel is not merely a similarity score; it is the object that represents point evaluation.

The canonical feature map of an RKHS is

ϕ(x)=k(x,).\phi(\mathbf{x}) = k(\mathbf{x}, \cdot).

Then

k(x,z)=k(x,),k(z,)H.k(\mathbf{x}, \mathbf{z}) = \langle k(\mathbf{x}, \cdot), k(\mathbf{z}, \cdot) \rangle_{\mathcal{H}}.

This means every RKHS kernel is automatically a feature-map kernel.

3.2 The Reproducing Property

The reproducing property is

f(x)=f,k(x,)H.f(\mathbf{x}) = \langle f, k(\mathbf{x}, \cdot) \rangle_{\mathcal{H}}.

It says that function evaluation is inner product geometry.

If ff is represented as

f=i=1nαik(x(i),),f = \sum_{i=1}^n \alpha_i k(\mathbf{x}^{(i)}, \cdot),

then

f(x)=i=1nαik(x(i),),k(x,)H.f(\mathbf{x}) = \left\langle \sum_{i=1}^n \alpha_i k(\mathbf{x}^{(i)}, \cdot), k(\mathbf{x}, \cdot) \right\rangle_{\mathcal{H}}.

By linearity,

f(x)=i=1nαik(x(i),),k(x,)H.f(\mathbf{x}) = \sum_{i=1}^n \alpha_i \langle k(\mathbf{x}^{(i)}, \cdot), k(\mathbf{x}, \cdot) \rangle_{\mathcal{H}}.

Using the kernel identity,

f(x)=i=1nαik(x(i),x).f(\mathbf{x}) = \sum_{i=1}^n \alpha_i k(\mathbf{x}^{(i)}, \mathbf{x}).

This is exactly the prediction form used by kernel machines.

The reproducing property explains why coefficients over training points are enough. The learned function lives in a Hilbert space, but prediction reduces to kernel evaluations.

3.3 Moore-Aronszajn Theorem

The Moore-Aronszajn theorem states: every positive semidefinite kernel kk on X\mathcal{X} uniquely determines an RKHS Hk\mathcal{H}_k for which kk is the reproducing kernel.

This theorem closes the loop:

positive semidefinite kernel
        |
        v
unique RKHS of functions
        |
        v
feature map x -> k(x, .)
        |
        v
inner product recovers k(x, z)

The theorem is important because it means we do not need to explicitly construct ϕ\phi. If we prove that kk is positive semidefinite, then an RKHS exists. The feature space is guaranteed by the theorem.

This is why kernel validity matters. An invalid similarity may still produce useful numbers. But it does not automatically give a Hilbert space of functions, a reproducing property, or a standard regularization theory.

3.4 Mercer Expansion and Eigenfunctions

Mercer's theorem is one of the classical links between kernels and spectral theory. Under suitable conditions, such as a continuous positive semidefinite kernel on a compact domain with respect to a measure, the kernel has an expansion

k(x,z)=m=1λmψm(x)ψm(z),k(\mathbf{x}, \mathbf{z}) = \sum_{m=1}^{\infty} \lambda_m \psi_m(\mathbf{x})\psi_m(\mathbf{z}),

where λm0\lambda_m \ge 0 and {ψm}\{\psi_m\} are orthonormal eigenfunctions of the associated integral operator.

The integral operator is

(Tkf)(x)=Xk(x,z)f(z)dμ(z).(T_k f)(\mathbf{x}) = \int_{\mathcal{X}} k(\mathbf{x}, \mathbf{z})f(\mathbf{z})\,d\mu(\mathbf{z}).

The eigenfunction equation is

Tkψm=λmψm.T_k \psi_m = \lambda_m \psi_m.

The expansion resembles matrix eigendecomposition. The kernel acts like an infinite matrix. The eigenfunctions play the role of eigenvectors. The eigenvalues tell which directions in function space are favored.

Practical interpretation:

  • large eigenvalues correspond to functions the kernel represents easily;
  • small eigenvalues correspond to functions strongly penalized by the RKHS norm;
  • kernel ridge regression shrinks more strongly along low-eigenvalue directions;
  • NTK eigenvalues can predict which target functions are learned quickly by wide networks.

Mercer is not required every time we use kernels. The finite Gram matrix definition is more general in many ML settings. But Mercer gives spectral intuition for smoothness and generalization.

3.5 RKHS Norm as Smoothness Control

The RKHS norm fHk\lVert f \rVert_{\mathcal{H}_k} measures function complexity relative to the kernel. Different kernels create different notions of complexity.

For a regularized learning problem,

minfHki=1n(f(x(i)),y(i))+λfHk2,\min_{f \in \mathcal{H}_k} \sum_{i=1}^n \ell(f(\mathbf{x}^{(i)}), y^{(i)}) + \lambda \lVert f \rVert_{\mathcal{H}_k}^2,

the first term fits the data. The second term penalizes functions that are complex in the RKHS.

The same function can be simple under one kernel and complex under another. For example:

  • an RBF kernel favors very smooth functions;
  • a periodic kernel favors periodic functions;
  • a linear kernel favors linear functions;
  • a string kernel favors functions expressible through substring features;
  • a graph kernel favors functions aligned with graph similarity.

This is why choosing a kernel is choosing an inductive bias.

The RKHS norm is not merely a generic "size" penalty. It encodes which functions are cheap and which are expensive. In a Gaussian process, this same kernel encodes covariance. In kernel ridge regression, it encodes regularization. In SVMs, it encodes margin geometry.


4. Common Kernels

Kernel choice determines the geometry of learning. This section catalogs the most common kernels and the assumptions they encode.

4.1 Linear and Polynomial Kernels

The linear kernel is

k(x,z)=xz.k(\mathbf{x}, \mathbf{z}) = \mathbf{x}^\top \mathbf{z}.

It corresponds to the identity feature map. Its Gram matrix is

K=XXK = XX^\top

when rows of XX are examples.

Use the linear kernel when:

  • the input features are already expressive;
  • dimension is high and nonlinear expansion is unnecessary;
  • interpretability and speed matter;
  • the model should behave like a regularized linear method.

The polynomial kernel is

k(x,z)=(xz+c)p,k(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^\top \mathbf{z} + c)^p,

where c0c \ge 0 and pp is a positive integer.

It represents interactions among input coordinates. For p=2p=2, it includes pairwise products. For p=3p=3, it includes cubic interactions. Higher degree increases expressivity but can overfit quickly.

Polynomial kernels are useful for:

  • controlled feature interactions;
  • small or medium-dimensional tabular data;
  • settings where multiplicative feature relationships matter;
  • teaching the kernel trick because the feature map can be written explicitly.

AI warning: large polynomial degree can behave like an unstable feature explosion. Regularization and scaling are not optional.

4.2 RBF Kernel

The radial basis function kernel, also called the Gaussian kernel, is

k(x,z)=exp(γxz22),k(\mathbf{x}, \mathbf{z}) = \exp(-\gamma \lVert \mathbf{x} - \mathbf{z} \rVert_2^2),

where γ>0\gamma > 0. Equivalently,

k(x,z)=exp(xz222σ2),k(\mathbf{x}, \mathbf{z}) = \exp\left( -\frac{\lVert \mathbf{x} - \mathbf{z} \rVert_2^2}{2\sigma^2} \right),

with γ=1/(2σ2)\gamma = 1/(2\sigma^2).

The RBF kernel is shift-invariant:

k(x,z)=κ(xz).k(\mathbf{x}, \mathbf{z}) = \kappa(\mathbf{x} - \mathbf{z}).

It is also radial: the value depends only on distance.

Bandwidth interpretation:

  • small σ\sigma means very local similarity;
  • large σ\sigma means broad similarity;
  • too small σ\sigma can make KK close to identity;
  • too large σ\sigma can make KK close to a constant matrix.

Both extremes are bad. If KInK \approx I_n, the model can memorize but generalize poorly. If K11K \approx \mathbf{1}\mathbf{1}^\top, the model has too little resolution.

The RBF kernel is universal on many compact domains. Informally, its RKHS is rich enough to approximate many continuous functions. But universality does not remove the need for regularization. Richness without control is overfitting.

4.3 Laplacian and Matern Kernels

The Laplacian kernel is

k(x,z)=exp(γxz1).k(\mathbf{x}, \mathbf{z}) = \exp(-\gamma \lVert \mathbf{x} - \mathbf{z} \rVert_1).

Compared with RBF, it produces rougher functions. It can be useful when the target function has sharper changes.

The Matern family is a common Gaussian-process kernel family. For distance r=xz2r = \lVert \mathbf{x} - \mathbf{z} \rVert_2, it is

kν(r)=21νΓ(ν)(2νr)νKνBessel(2νr),k_{\nu}(r) = \frac{2^{1-\nu}}{\Gamma(\nu)} \left( \frac{\sqrt{2\nu}r}{\ell} \right)^\nu K_{\nu}^{\mathrm{Bessel}} \left( \frac{\sqrt{2\nu}r}{\ell} \right),

where ν>0\nu > 0 controls smoothness, >0\ell > 0 is lengthscale, and KνBesselK_{\nu}^{\mathrm{Bessel}} is the modified Bessel function of the second kind.

Common special cases avoid Bessel functions:

kν=1/2(r)=exp(r/).k_{\nu=1/2}(r) = \exp(-r/\ell). kν=3/2(r)=(1+3r)exp(3r).k_{\nu=3/2}(r) = \left(1 + \frac{\sqrt{3}r}{\ell}\right) \exp\left(-\frac{\sqrt{3}r}{\ell}\right). kν=5/2(r)=(1+5r+5r232)exp(5r).k_{\nu=5/2}(r) = \left(1 + \frac{\sqrt{5}r}{\ell} + \frac{5r^2}{3\ell^2}\right) \exp\left(-\frac{\sqrt{5}r}{\ell}\right).

Matern smoothness interpretation:

KernelSmoothness assumption
ν=1/2\nu = 1/2Rough, continuous but not very smooth.
ν=3/2\nu = 3/2Once-differentiable style behavior.
ν=5/2\nu = 5/2Smoother functions.
ν\nu \to \inftyApproaches RBF smoothness.

In scientific ML and Bayesian optimization, Matern kernels are often more realistic than RBF kernels. Many physical functions are smooth but not infinitely smooth.

4.4 Periodic, Rational Quadratic, String, and Graph Kernels

The periodic kernel is

k(x,z)=exp(2sin2(πxz/p)2).k(x,z) = \exp\left( -\frac{2\sin^2(\pi \lvert x-z \rvert / p)}{\ell^2} \right).

It encodes periodic structure with period pp. It is useful for seasonal signals and cyclic processes.

The rational quadratic kernel is

k(x,z)=(1+xz222α2)α.k(\mathbf{x}, \mathbf{z}) = \left( 1 + \frac{\lVert \mathbf{x} - \mathbf{z} \rVert_2^2}{2\alpha\ell^2} \right)^{-\alpha}.

It can be interpreted as a scale mixture of RBF kernels. It is useful when multiple lengthscales are present.

String kernels compare sequences by shared subsequences or substrings. They were important in bioinformatics and text classification. The general form is

k(s,t)=uAϕu(s)ϕu(t),k(s,t) = \sum_{u \in \mathcal{A}^*} \phi_u(s)\phi_u(t),

where ϕu(s)\phi_u(s) counts or weights occurrences of subsequence uu in string ss.

Graph kernels compare graphs through walks, paths, subtrees, graphlets, or spectral features. They are related to graph representation learning but are not the same as graph neural networks. For full graph neural network treatment, see Graph Neural Networks.

These structured kernels show that kernels are not limited to vectors. A kernel needs a valid PSD similarity over an input set. The input set can be strings, trees, graphs, distributions, molecules, or programs.

4.5 Kernel Composition Rules

Kernel design often uses closure rules. If k1k_1 and k2k_2 are valid kernels, then the following are valid under standard conditions:

ConstructionFormula
Nonnegative scalingak1ak_1 for a0a \ge 0
Sumk1+k2k_1 + k_2
Productk1k2k_1 k_2
Pointwise limitlimmkm\lim_m k_m when the limit exists
Pullback by a mapk(g(x),g(z))k(g(\mathbf{x}), g(\mathbf{z}))
Tensor-style productkX(x,z)kY(u,v)k_X(\mathbf{x}, \mathbf{z})k_Y(\mathbf{u}, \mathbf{v})
Direct-sum kernelkX+kYk_X + k_Y

Why sums work: for any finite dataset,

c(K1+K2)c=cK1c+cK2c0.\mathbf{c}^\top (K_1 + K_2)\mathbf{c} = \mathbf{c}^\top K_1\mathbf{c} + \mathbf{c}^\top K_2\mathbf{c} \ge 0.

Why products work: the Gram matrix of k1k2k_1k_2 is the Hadamard product

K1K2.K_1 \odot K_2.

The Schur product theorem says the Hadamard product of PSD matrices is PSD.

Composition examples:

k(x,z)=kRBF(x,z)+kperiodic(x,z)k(\mathbf{x}, \mathbf{z}) = k_{\text{RBF}}(\mathbf{x}, \mathbf{z}) + k_{\text{periodic}}(\mathbf{x}, \mathbf{z})

models smooth trend plus periodic structure.

k(x,z)=kRBF(x,z)klinear(x,z)k(\mathbf{x}, \mathbf{z}) = k_{\text{RBF}}(\mathbf{x}, \mathbf{z}) k_{\text{linear}}(\mathbf{x}, \mathbf{z})

models smooth local similarity modulated by linear alignment.

Kernel composition is inductive-bias engineering. It is the kernel-method analogue of architecture design.


5. Core Kernel Algorithms

This section shows how kernels enter algorithms. The common pattern is finite coefficient representation. The learned object may live in a function space, but training solves for a finite vector.

5.1 Kernel Perceptron

The ordinary perceptron learns a linear classifier

f(x)=sign(wx).f(\mathbf{x}) = \operatorname{sign}(\mathbf{w}^\top \mathbf{x}).

With labels y(i){1,+1}y^{(i)} \in \{-1, +1\}, a mistake update is

ww+y(i)x(i).\mathbf{w} \leftarrow \mathbf{w} + y^{(i)}\mathbf{x}^{(i)}.

After many updates, the weight vector is a linear combination of training examples:

w=i=1nαiy(i)x(i).\mathbf{w} = \sum_{i=1}^n \alpha_i y^{(i)} \mathbf{x}^{(i)}.

In feature space,

w=i=1nαiy(i)ϕ(x(i)).\mathbf{w} = \sum_{i=1}^n \alpha_i y^{(i)} \phi(\mathbf{x}^{(i)}).

Prediction becomes

f(x)=sign(i=1nαiy(i)k(x(i),x)).f(\mathbf{x}) = \operatorname{sign} \left( \sum_{i=1}^n \alpha_i y^{(i)} k(\mathbf{x}^{(i)}, \mathbf{x}) \right).

The kernel perceptron update increments αi\alpha_i when example ii is misclassified.

Algorithm:

  1. Initialize α=0\boldsymbol{\alpha} = \mathbf{0}.
  2. For each example ii, compute score
si=j=1nαjy(j)k(x(j),x(i)).s_i = \sum_{j=1}^n \alpha_j y^{(j)} k(\mathbf{x}^{(j)}, \mathbf{x}^{(i)}).
  1. If y(i)si0y^{(i)}s_i \le 0, set αiαi+1\alpha_i \leftarrow \alpha_i + 1.
  2. Repeat for multiple passes.

The kernel perceptron is not usually the best modern kernel classifier. But it is the cleanest algorithmic example of kernelization.

5.2 Support Vector Machines

Support vector machines use kernels to build maximum-margin classifiers. For a feature map ϕ\phi, the hard-margin primal problem is

minw,b12wH2\min_{\mathbf{w}, b} \frac{1}{2}\lVert \mathbf{w} \rVert_{\mathcal{H}}^2

subject to

y(i)(w,ϕ(x(i))H+b)1y^{(i)} \left( \langle \mathbf{w}, \phi(\mathbf{x}^{(i)}) \rangle_{\mathcal{H}} + b \right) \ge 1

for all ii.

The dual problem is

maxαi=1nαi12i=1nj=1nαiαjy(i)y(j)k(x(i),x(j)),\max_{\boldsymbol{\alpha}} \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i=1}^n\sum_{j=1}^n \alpha_i\alpha_j y^{(i)}y^{(j)} k(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}),

subject to

αi0andi=1nαiy(i)=0.\alpha_i \ge 0 \quad \text{and} \quad \sum_{i=1}^n \alpha_i y^{(i)} = 0.

The soft-margin SVM adds slack variables and a penalty parameter CC. In the dual, this gives box constraints:

0αiC.0 \le \alpha_i \le C.

Prediction is

f(x)=sign(i=1nαiy(i)k(x(i),x)+b).f(\mathbf{x}) = \operatorname{sign} \left( \sum_{i=1}^n \alpha_i y^{(i)} k(\mathbf{x}^{(i)}, \mathbf{x}) + b \right).

Only points with αi>0\alpha_i > 0 contribute. These are support vectors.

SVM intuition:

  • the margin is measured in feature-space norm;
  • the kernel determines the feature-space geometry;
  • support vectors are the examples that define the boundary;
  • CC controls the tradeoff between margin and training errors;
  • kernel bandwidth controls the flexibility of the boundary.

Important scoping note: this section does not reteach constrained optimization or KKT conditions in full. For the full optimization machinery, see Constrained Optimization.

5.3 Kernel Ridge Regression and Representer Theorem

Kernel ridge regression solves

minfHki=1n(y(i)f(x(i)))2+λfHk2.\min_{f \in \mathcal{H}_k} \sum_{i=1}^n \left( y^{(i)} - f(\mathbf{x}^{(i)}) \right)^2 + \lambda \lVert f \rVert_{\mathcal{H}_k}^2.

The representer theorem says the minimizer has the finite form

f(x)=i=1nαik(x(i),x).f^*(\mathbf{x}) = \sum_{i=1}^n \alpha_i k(\mathbf{x}^{(i)}, \mathbf{x}).

This is remarkable. The optimization is over an infinite-dimensional function space. The solution lies in the span of nn kernel sections centered at training inputs.

Let yRn\mathbf{y} \in \mathbb{R}^n collect labels. Let KK be the Gram matrix. Predictions on training data are

y^=Kα.\hat{\mathbf{y}} = K\boldsymbol{\alpha}.

The objective becomes

yKα22+λαKα.\lVert \mathbf{y} - K\boldsymbol{\alpha} \rVert_2^2 + \lambda \boldsymbol{\alpha}^\top K\boldsymbol{\alpha}.

The closed-form coefficient vector is

α=(K+λIn)1y.\boldsymbol{\alpha} = (K + \lambda I_n)^{-1}\mathbf{y}.

Some conventions use nλn\lambda instead of λ\lambda. Always check the objective scaling.

Prediction for a test point x\mathbf{x}_* is

f(x)=kα,f^*(\mathbf{x}_*) = \mathbf{k}_*^\top \boldsymbol{\alpha},

where

k=[k(x(1),x)k(x(n),x)].\mathbf{k}_* = \begin{bmatrix} k(\mathbf{x}^{(1)}, \mathbf{x}_*) \\ \vdots \\ k(\mathbf{x}^{(n)}, \mathbf{x}_*) \end{bmatrix}.

Regularization interpretation:

  • large λ\lambda shrinks coefficients and smooths the function;
  • small λ\lambda fits training data more closely;
  • λ\lambda also improves conditioning of K+λInK + \lambda I_n;
  • when KK is nearly singular, the ridge term is numerically essential.

5.4 Kernel PCA

PCA finds directions of high variance. Kernel PCA finds high-variance directions in feature space.

Let ϕ(x(i))\phi(\mathbf{x}^{(i)}) be centered in H\mathcal{H}. The feature-space covariance operator is

C=1ni=1nϕ(x(i))ϕ(x(i)).C = \frac{1}{n} \sum_{i=1}^n \phi(\mathbf{x}^{(i)}) \otimes \phi(\mathbf{x}^{(i)}).

Instead of diagonalizing CC directly, kernel PCA diagonalizes the centered Gram matrix

Kc=HKH.K_c = HKH.

If

Kcvm=λmvm,K_c \mathbf{v}_m = \lambda_m \mathbf{v}_m,

then vm\mathbf{v}_m gives coefficients for the mm-th feature-space principal direction.

The embedding coordinate for a point x\mathbf{x} is computed through kernel evaluations against training points.

Kernel PCA is useful for:

  • nonlinear dimensionality reduction;
  • visualizing curved manifolds;
  • denoising in feature space;
  • teaching how kernel methods extend linear algorithms.

Important scoping note: this section uses PCA only as a kernelized algorithm. For full PCA theory, see Principal Component Analysis.

Kernel PCA common failure: forgetting to center the kernel matrix. PCA assumes centered data. In kernel PCA, centering must happen in feature space, which is why Kc=HKHK_c = HKH appears.

5.5 Gaussian Processes

A Gaussian process is a distribution over functions. It is written

fGP(m,k),f \sim \mathcal{GP}(m, k),

where mm is the mean function and kk is the covariance kernel.

For any finite set of inputs, the function values are jointly Gaussian:

[f(x(1))f(x(n))]N(m,K).\begin{bmatrix} f(\mathbf{x}^{(1)}) \\ \vdots \\ f(\mathbf{x}^{(n)}) \end{bmatrix} \sim \mathcal{N}(\mathbf{m}, K).

Here

mi=m(x(i))m_i = m(\mathbf{x}^{(i)})

and

Kij=k(x(i),x(j)).K_{ij} = k(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}).

For noisy observations

y(i)=f(x(i))+ϵi,y^{(i)} = f(\mathbf{x}^{(i)}) + \epsilon_i,

with

ϵiN(0,σn2),\epsilon_i \sim \mathcal{N}(0, \sigma_n^2),

the predictive mean at a test input x\mathbf{x}_* is

μ=k(K+σn2In)1y.\mu_* = \mathbf{k}_*^\top (K + \sigma_n^2 I_n)^{-1} \mathbf{y}.

The predictive variance is

σ2=k(x,x)k(K+σn2In)1k.\sigma_*^2 = k(\mathbf{x}_*, \mathbf{x}_*) - \mathbf{k}_*^\top (K + \sigma_n^2 I_n)^{-1} \mathbf{k}_*.

Gaussian process regression has the same algebraic core as kernel ridge regression. The GP adds probabilistic interpretation and uncertainty.

Important scoping note: this section gives the kernel-method view of Gaussian processes. For full Bayesian inference background, see Bayesian Inference.

GP kernel interpretation:

  • the kernel is a covariance function;
  • nearby points under the kernel have correlated function values;
  • the lengthscale controls how quickly correlation decays;
  • the posterior mean interpolates or smooths observations;
  • the posterior variance grows away from observed data.

This uncertainty is one reason Gaussian processes remain useful in Bayesian optimization, active learning, calibration, and scientific modeling.


6. Computation and Scalability

Kernel methods are elegant, but the kernel matrix is expensive. This section treats computation as part of the mathematics, not an implementation afterthought.

6.1 Kernel Matrix Cost

For nn training examples, the Gram matrix has n2n^2 entries. Memory cost is

O(n2).O(n^2).

Naive construction cost is

O(n2d)O(n^2 d)

for vector inputs in Rd\mathbb{R}^d when each kernel evaluation costs O(d)O(d).

Solving exact kernel ridge regression by dense linear algebra costs

O(n3)O(n^3)

in general.

Prediction for one test point costs

O(n)O(n)

kernel evaluations unless the model is sparse or approximated.

This is the core scalability problem. The feature space may be infinite-dimensional, but the training set creates an n×nn \times n object.

Practical thresholds:

  • n=103n = 10^3 is usually easy;
  • n=104n = 10^4 is possible with care;
  • n=105n = 10^5 is expensive for exact kernels;
  • n=106n = 10^6 usually requires approximation, structure, or distributed methods.

The bottleneck is often memory before arithmetic. A float64 Gram matrix for n=100,000n=100{,}000 would need roughly 80 GB just for entries.

6.2 Conditioning and Regularization

Kernel matrices can be ill-conditioned. This happens when:

  • points are nearly duplicated;
  • the bandwidth is too large;
  • the bandwidth is too small;
  • features are poorly scaled;
  • the kernel has rapidly decaying eigenvalues.

Condition number is

κ(K)=λmax(K)λmin(K)\kappa(K) = \frac{\lambda_{\max}(K)}{\lambda_{\min}(K)}

when KK is positive definite.

For kernel ridge regression, the system is

(K+λIn)α=y.(K + \lambda I_n)\boldsymbol{\alpha} = \mathbf{y}.

The ridge shift changes eigenvalues:

λm(K+λIn)=λm(K)+λ.\lambda_m(K + \lambda I_n) = \lambda_m(K) + \lambda.

So regularization improves conditioning. This is both statistical and numerical.

Numerical practices:

  • standardize input features before distance-based kernels;
  • tune bandwidth and regularization jointly;
  • add jitter when doing Gaussian process Cholesky factorization;
  • prefer Cholesky solves over explicit matrix inverse;
  • monitor eigenvalues of KK or K+λInK + \lambda I_n.

In GP code, a small jitter term is often written as

Kstable=K+(σn2+ϵ)In.K_{\text{stable}} = K + (\sigma_n^2 + \epsilon)I_n.

The ϵ\epsilon term is not a probabilistic modeling claim. It is numerical stabilization.

6.3 Nystrom Approximation

The Nystrom approximation uses a subset of mnm \ll n landmark points. Partition the kernel matrix conceptually into

KnmRn×mK_{nm} \in \mathbb{R}^{n \times m}

and

KmmRm×m.K_{mm} \in \mathbb{R}^{m \times m}.

The approximate Gram matrix is

KKnmKmmKmn.K \approx K_{nm}K_{mm}^{\dagger}K_{mn}.

If KmmK_{mm} is regularized,

KKnm(Kmm+ϵIm)1Kmn.K \approx K_{nm}(K_{mm} + \epsilon I_m)^{-1}K_{mn}.

The approximation has rank at most mm. It reduces storage from O(n2)O(n^2) to roughly O(nm)O(nm).

Landmark choices:

  • uniform random sampling;
  • k-means centers;
  • leverage-score sampling;
  • inducing points optimized by a model;
  • domain-specific prototypes.

Nystrom intuition: represent all points by their similarity to landmarks. Then reconstruct the full similarity structure through the landmark-landmark matrix.

This is a low-rank approximation to kernel geometry. It is related to matrix approximation, inducing-point Gaussian processes, and scalable spectral methods.

6.4 Random Fourier Features

Random Fourier features approximate shift-invariant kernels. Bochner's theorem says a continuous shift-invariant kernel

k(x,z)=κ(xz)k(\mathbf{x}, \mathbf{z}) = \kappa(\mathbf{x} - \mathbf{z})

is positive semidefinite if and only if κ\kappa is the Fourier transform of a nonnegative measure.

For the RBF kernel, sample

ωrN(0,2γId)\boldsymbol{\omega}_r \sim \mathcal{N}(\mathbf{0}, 2\gamma I_d)

and

brU(0,2π).b_r \sim \mathcal{U}(0, 2\pi).

Define random features

zr(x)=2Dcos(ωrx+br).z_r(\mathbf{x}) = \sqrt{\frac{2}{D}} \cos(\boldsymbol{\omega}_r^\top \mathbf{x} + b_r).

Then

z(x)z(z)k(x,z).z(\mathbf{x})^\top z(\mathbf{z}) \approx k(\mathbf{x}, \mathbf{z}).

Now a kernel method can be approximated by a linear method in DD random features. Training cost shifts from Gram-matrix algorithms to linear-model algorithms.

Benefits:

  • storage O(nD)O(nD) instead of O(n2)O(n^2);
  • prediction O(D)O(D) instead of O(n)O(n);
  • compatibility with SGD;
  • easy integration into neural networks.

Tradeoffs:

  • approximation variance decreases as DD grows;
  • poor random features need many dimensions;
  • input scaling still matters;
  • structured random features may be faster but more complex.

Random Fourier features are a key example of turning an implicit kernel feature map into an explicit approximate feature map.

6.5 When Kernels Are Practical

Kernels are practical when the data scale and inductive bias fit the method.

Good kernel settings:

  • small to medium datasets;
  • expensive labels;
  • tabular or scientific data;
  • uncertainty matters;
  • smoothness or periodicity assumptions are strong;
  • a domain-specific similarity is available;
  • convex training is preferred;
  • interpretability of inductive bias matters.

Poor kernel settings:

  • massive datasets without approximation;
  • high-throughput online prediction with many support vectors;
  • raw image/text tasks where representation learning dominates;
  • settings where the right similarity must be learned from huge data;
  • datasets with severe feature scaling problems and no preprocessing.

The practical question is not "kernels or deep learning?" The practical question is "where should similarity be specified, where should it be learned, and what scale must the algorithm handle?"

Modern hybrids answer this in mixed ways:

  • deep kernel learning uses neural features inside a kernel;
  • Gaussian processes can sit on top of learned embeddings;
  • random features can act as a fixed feature layer;
  • NTK uses kernels to analyze neural training;
  • retrieval systems use learned embeddings but often kernel-like similarity.

7. Kernel Methods in Modern AI

Kernels remain relevant because similarity remains central. Deep learning changed how similarity is learned, but it did not remove the need to understand similarity geometry.

7.1 Similarity Search and Embeddings

Embedding systems map objects into vectors. Semantic search, recommendation, contrastive learning, and retrieval-augmented generation often use similarity scores such as dot product or cosine similarity.

A normalized dot product is a linear kernel on normalized embeddings:

k(u,v)=uvu2v2.k(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u}^\top \mathbf{v}} {\lVert \mathbf{u} \rVert_2 \lVert \mathbf{v} \rVert_2}.

This does not mean every retrieval system is a kernel machine. The embeddings are learned by a neural model, and approximate nearest-neighbor search is an engineering system. But the similarity score is still a geometric object.

Kernel thinking asks:

  • what invariances should similarity respect?
  • what scale makes two items close?
  • is the similarity symmetric?
  • is the implied Gram matrix PSD?
  • does normalization help or hurt?
  • does the similarity align with downstream loss?

These questions are useful even when the final system is not a classical kernel method.

7.2 Kernels and Attention

Transformer attention uses scores

S=QKdk.S = \frac{QK^\top}{\sqrt{d_k}}.

Each score is an inner product between a query vector and a key vector. Softmax then converts scores into weights.

This is not an RKHS kernel method in the classical sense because:

  • queries and keys are learned and input-dependent;
  • the attention matrix is usually not symmetric;
  • softmax normalization is row-wise;
  • attention produces weighted values, not only scalar function prediction.

Still, kernel ideas are nearby. Linear attention methods often replace softmax attention with feature maps satisfying

exp(qk)ϕ(q)ϕ(k).\exp(\mathbf{q}^\top \mathbf{k}) \approx \phi(\mathbf{q})^\top \phi(\mathbf{k}).

This is a kernel approximation idea. It converts attention into associative linear algebra:

Attn(Q,K,V)ϕ(Q)(ϕ(K)V).\operatorname{Attn}(Q,K,V) \approx \phi(Q)(\phi(K)^\top V).

The kernel lesson: attention is not "just a kernel," but kernel feature maps help explain and approximate some attention mechanisms.

7.3 Neural Tangent Kernel

Consider a neural network f(x;θ)f(\mathbf{x}; \boldsymbol{\theta}). The neural tangent kernel is

Θ(x,z)=θf(x;θ)θf(z;θ).\Theta(\mathbf{x}, \mathbf{z}) = \nabla_{\boldsymbol{\theta}} f(\mathbf{x}; \boldsymbol{\theta})^\top \nabla_{\boldsymbol{\theta}} f(\mathbf{z}; \boldsymbol{\theta}).

This is a kernel over inputs induced by parameter gradients. It measures how changing parameters to affect x\mathbf{x} also affects z\mathbf{z}.

In the infinite-width limit for certain architectures and initializations, the NTK remains nearly constant during training. Gradient descent then behaves like kernel regression with kernel Θ\Theta.

NTK interpretation:

  • if Θ(x,z)\Theta(\mathbf{x}, \mathbf{z}) is large, learning on x\mathbf{x} strongly changes prediction at z\mathbf{z};
  • eigenvalues of the NTK affect learning speed along target-function directions;
  • wide-network training can be analyzed through kernel dynamics;
  • feature learning is limited in the lazy-training regime.

Important warning: finite neural networks can learn representations in ways the fixed NTK does not capture. The NTK is a baseline and an analysis tool, not a complete theory of all deep learning.

7.4 Deep Kernel Learning

Deep kernel learning composes a neural feature extractor with a kernel. Let

gθ:XRmg_{\boldsymbol{\theta}} : \mathcal{X} \to \mathbb{R}^m

be a neural network. Define

kθ(x,z)=k0(gθ(x),gθ(z)).k_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) = k_0(g_{\boldsymbol{\theta}}(\mathbf{x}), g_{\boldsymbol{\theta}}(\mathbf{z})).

If k0k_0 is a valid kernel, then kθk_{\boldsymbol{\theta}} is a valid kernel for fixed θ\boldsymbol{\theta}.

Deep kernel learning separates two roles:

  • the neural network learns a representation;
  • the kernel provides smoothness, uncertainty, or nonparametric structure in representation space.

This hybrid is useful when:

  • raw inputs need learned features;
  • uncertainty estimates are still desired;
  • data is too structured for a hand-designed kernel;
  • the final task benefits from GP-style posterior behavior.

The same idea appears in practical systems that train embeddings and then run kernel-like methods or GP heads on top.

7.5 Uncertainty and Small-Data Learning

Deep networks often provide strong predictions but weak calibrated uncertainty without extra machinery. Gaussian processes and Bayesian kernel methods provide uncertainty by construction.

In small-data settings, kernel priors are valuable because assumptions matter more than scale. A well-chosen kernel can encode:

  • smoothness in physical space;
  • periodicity in time;
  • conservation-inspired structure;
  • similarity between molecules;
  • correlation across tasks;
  • graph structure;
  • known invariances.

This is why kernels remain common in:

  • Bayesian optimization;
  • active learning;
  • scientific machine learning;
  • spatial statistics;
  • molecular property prediction;
  • calibration and uncertainty modeling;
  • low-data regression.

Kernels are less dominant when raw representation learning is the bottleneck. But when the representation is known or learned elsewhere, kernel methods can be excellent final-layer learners.


8. Common Mistakes

#MistakeWhy it is wrongFix
1Treating any similarity as a kernel.RKHS theory requires positive semidefinite Gram matrices, not just intuitive similarity.Check PSD conditions or use known valid kernel constructions.
2Thinking pairwise positive values imply PSD.A matrix can have all positive entries and still have a negative eigenvalue.Test cKc\mathbf{c}^\top K\mathbf{c} or eigenvalues for finite examples.
3Forgetting feature scaling before RBF kernels.Distance-based kernels are dominated by large-scale coordinates.Standardize inputs or use anisotropic lengthscales.
4Choosing RBF bandwidth blindly.Too small gives memorization; too large gives nearly constant similarity.Tune bandwidth with regularization.
5Forgetting to center KK for kernel PCA.PCA requires centered features; uncentered kernels distort components.Use Kc=HKHK_c = HKH.
6Computing explicit inverses.Explicit inverse is slower and less stable than solving a linear system.Use Cholesky or linear solves for K+λInK+\lambda I_n.
7Assuming kernel methods avoid overfitting.Rich kernels can interpolate noise.Use validation, regularization, and bandwidth control.
8Confusing SVM sparsity with KRR density.SVM predictions use support vectors; KRR usually uses all training points.Choose algorithm based on prediction cost and objective.
9Treating GP variance as generic model confidence.GP uncertainty is conditional on the kernel, likelihood, and training assumptions.Interpret uncertainty relative to the model assumptions.
10Ignoring O(n2)O(n^2) memory.Exact kernels become impossible at large nn.Use Nystrom, random features, inducing points, or linear models.
11Reteaching PCA inside kernel PCA.PCA has its own canonical section.Use kernel PCA as an application and point to the PCA section.
12Thinking NTK explains all deep learning.NTK mainly describes lazy or infinite-width regimes.Use NTK as a baseline, not a complete representation-learning theory.

9. Exercises

  1. Exercise 1 (Easy): Build and Inspect Gram Matrices Given a small dataset in R2\mathbb{R}^2, compute linear, polynomial, and RBF Gram matrices. Verify symmetry. Compute eigenvalues. Explain which matrices are PSD and why.

  2. Exercise 2 (Easy): Explicit Polynomial Feature Map For x=(x1,x2)\mathbf{x} = (x_1, x_2)^\top, construct a feature map whose inner product equals (xz+1)2(\mathbf{x}^\top \mathbf{z} + 1)^2. Verify the identity numerically for random points.

  3. Exercise 3 (Easy): Kernel Normalization and Centering Normalize an RBF Gram matrix. Center it using H=In11/nH = I_n - \mathbf{1}\mathbf{1}^\top/n. Verify that the centered matrix has approximately zero row and column sums.

  4. Exercise 4 (Medium): Kernel Ridge Regression Implement kernel ridge regression from scratch. Fit noisy samples of a nonlinear one-dimensional function. Vary λ\lambda and the RBF bandwidth. Explain underfitting, good fit, and overfitting.

  5. Exercise 5 (Medium): Kernel PCA Generate a two-dimensional nonlinear dataset such as concentric circles. Compute an RBF kernel. Center the kernel. Extract the first two kernel principal components. Compare with linear PCA intuition.

  6. Exercise 6 (Medium): Gaussian Process Posterior Implement the GP posterior mean and variance for one-dimensional regression. Use an RBF kernel. Plot uncertainty away from observed points. Explain the role of noise variance.

  7. Exercise 7 (Hard): Random Fourier Features Approximate the RBF kernel with random Fourier features. Measure approximation error as the number of random features grows. Train a linear ridge model on random features and compare it with exact KRR.

  8. Exercise 8 (Hard): Neural Tangent Kernel Define a small two-layer neural network. Compute the empirical NTK for a finite dataset using parameter gradients or finite differences. Verify that the NTK Gram matrix is PSD up to numerical tolerance. Interpret what large NTK entries mean.


10. Why This Matters for AI

Kernel conceptAI impact
Feature mapExplains how nonlinear representation can make linear learning powerful.
Gram matrixCentral object for pairwise similarity, spectral methods, and kernel algorithms.
PSD conditionSeparates valid Hilbert-space kernels from arbitrary similarity scores.
RKHSProvides the function-space setting for regularized learning.
Reproducing propertyTurns function evaluation into inner-product geometry.
Representer theoremExplains why infinite-dimensional optimization reduces to finite coefficients.
RBF bandwidthControls locality, smoothness, memorization, and generalization.
Kernel ridge regressionGives closed-form nonlinear regression with regularization.
SVM dualShows how margins can be learned through kernel evaluations.
Kernel PCAExtends linear spectral methods to nonlinear feature spaces.
Gaussian process kernelsEncode prior covariance and provide uncertainty estimates.
Nystrom approximationMakes large kernel matrices manageable through landmarks.
Random Fourier featuresConverts kernel learning into scalable linear learning.
Neural tangent kernelGives a mathematical baseline for wide-network training dynamics.
Deep kernel learningCombines learned representations with kernel uncertainty or smoothness.

The durable lesson is: kernels turn similarity assumptions into learnable geometry.

In deep learning, the representation is often learned. In kernel methods, the representation geometry is often specified. Hybrid methods do both.

For LLMs and large AI systems, kernels are most visible in theory, retrieval similarity, approximation methods, uncertainty heads, and small-data or scientific layers around larger learned systems. The concepts remain load-bearing even when the production model is not a classical SVM.


11. Conceptual Bridge

This section completes the functional-analysis arc from normed spaces to Hilbert spaces to kernel methods. Normed spaces introduced size and convergence. Hilbert spaces added inner products, orthogonality, and projection. Kernel methods use Hilbert-space geometry without explicitly constructing the Hilbert-space coordinates.

Backward connection: from Normed Spaces, we use norms and regularization. From Hilbert Spaces, we use inner products and RKHS language. From Positive Definite Matrices, we use PSD Gram matrices. From Constrained Optimization, we use duality in SVMs.

Forward connection: kernel methods prepare you for ML-specific losses, activation functions, normalization, and sampling. They also prepare you to read Gaussian-process models, spectral generalization analyses, kernelized attention approximations, and NTK papers.

Curriculum position:

Functional Analysis

01 Normed Spaces
    |
    v
02 Hilbert Spaces
    |
    v
03 Kernel Methods
    |
    v
ML-Specific Math
    |
    +-- loss functions
    +-- activations
    +-- normalization
    +-- sampling

The conceptual shift is important: we began with spaces of vectors. We moved to spaces of functions. Kernels let us compute with those function spaces through finite matrices.

That is the deepest practical idea in this section. Infinite-dimensional thinking can produce finite algorithms.


12. References

  • Aronszajn, N. (1950). Theory of Reproducing Kernels. Transactions of the American Mathematical Society.
  • Cortes, C. and Vapnik, V. (1995). Support-Vector Networks. Machine Learning.
  • Scholkopf, B., Herbrich, R., and Smola, A. J. (2001). A Generalized Representer Theorem.
  • Scholkopf, B. and Smola, A. J. (2002). Learning with Kernels. MIT Press.
  • Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning.
  • Rahimi, A. and Recht, B. (2007). Random Features for Large-Scale Kernel Machines.
  • Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural Tangent Kernel: Convergence and Generalization in Neural Networks.
  • Stanford CS229 notes on kernels and support vector machines.
  • Cornell CS4780 lecture notes on kernel methods.
  • scikit-learn documentation on kernel approximation and kernel methods.