"The art of interpolation is the art of making the most of what you know to say something useful about what you don't."
- Attributed to Carl Friedrich Gauss
Overview
Given a set of data points, interpolation constructs a function that passes exactly through every point, while approximation finds the "best" function from some class that fits the data in a least-squares or minimax sense. These are among the oldest problems in numerical mathematics, yet they remain central to modern AI: positional encodings in transformers use sinusoidal interpolation, kernel methods are approximation schemes, and neural networks themselves are universal approximators whose expressive power is quantified by approximation theory.
The critical insight is that the choice of basis determines everything. Monomials () are algebraically simple but numerically catastrophic (Vandermonde matrices are notoriously ill-conditioned). Chebyshev polynomials, derived from trigonometry, minimize the maximal interpolation error among all polynomial interpolants - they are the basis that tames Runge's phenomenon. Splines trade global accuracy for local control, achieving smooth, stable interpolation at the cost of piecewise complexity. Each basis encodes a different prior about the function being approximated.
This section covers polynomial interpolation, Chebyshev theory, splines, least-squares approximation, Fourier approximation, and radial basis functions, with explicit connections to ML applications including kernel methods, positional encodings, and neural function approximation.
Prerequisites
- Real analysis: continuity, differentiability, Taylor's theorem (Section01-Mathematical-Foundations)
- Linear algebra: matrix factorization, least squares, condition numbers (Section02-Linear-Algebra-Basics, Section10-02)
- Floating-point arithmetic: rounding error, numerical stability (Section10-01)
- Norms and inner products (Section02-Linear-Algebra-Basics)
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Interactive derivations: Lagrange interpolation, Chebyshev nodes, spline construction, least-squares fitting, Fourier series |
| exercises.ipynb | 10 graded exercises from Vandermonde conditioning to neural tangent kernel approximation |
Learning Objectives
After completing this section, you will:
- Construct polynomial interpolants using Lagrange and Newton divided-difference forms
- Explain Runge's phenomenon and why equispaced nodes are dangerous for high-degree interpolation
- Derive Chebyshev nodes as the optimal interpolation points minimizing the node polynomial
- Implement natural and clamped cubic splines by solving a tridiagonal linear system
- Apply least-squares polynomial fitting via QR decomposition and the normal equations
- Compute discrete Fourier series and use FFT for fast polynomial evaluation
- Understand radial basis function (RBF) interpolation and its connection to kernel methods
- Connect approximation theory to neural network universality and positional encodings in transformers
Table of Contents
- 1. Intuition and Motivation
- 2. Polynomial Interpolation
- 3. Chebyshev Polynomials and Optimal Nodes
- 4. Spline Interpolation
- 5. Least-Squares Approximation
- 6. Fourier Approximation
- 7. Radial Basis Functions and Kernel Methods
- 8. Applications in Machine Learning
- 9. Common Mistakes
- 10. Exercises
- 11. Why This Matters for AI (2026 Perspective)
- 12. Conceptual Bridge
1. Intuition and Motivation
1.1 The Interpolation Problem
Interpolation: Given data pairs with distinct nodes , find a function such that for all .
Approximation: Find a function from some class that minimizes some error over , possibly without passing through every data point.
The fundamental tension:
- Interpolation (zero training error) risks overfitting and instability (Runge's phenomenon)
- Approximation (nonzero training error) trades exactness for stability and generalization
This is exactly the bias-variance tradeoff in machine learning, seen through the lens of function approximation.
For AI: Every neural network layer computes a learned nonlinear function. Understanding which function classes can be represented and how efficiently is approximation theory. The expressive power of ReLU networks, the smoothness of GELU activations, and the positional encoding design in transformers all draw from this theory.
FUNCTION APPROXIMATION LANDSCAPE
========================================================================
Data: (x_0,y_0), ..., (x_n,y_n)
|
What class of f?
+------------+
Global Local
+------+ +------+
Polynomial Piecewise
(Lagrange, (Splines,
Chebyshev) B-splines)
|
+-----------+
Smooth Nonsmooth
(cubic) (linear,
B-spline)
Special structures:
- Trigonometric: periodic functions, FFT
- RBF/kernel: scattered data, RKHS, GP
- Neural: composition, ReLU networks
========================================================================
1.2 Why Approximation Matters for AI
1. Positional encodings: The original transformer (Vaswani et al., 2017) uses sinusoidal positional encodings . This is a trigonometric interpolation scheme allowing the model to interpolate positions not seen during training.
2. Kernel methods and SVMs: The kernel trick computes inner products in a feature space implicitly. The representer theorem shows that the optimal solution lies in the span of the training data - exactly the RBF interpolant form.
3. Neural network expressiveness: The universal approximation theorem states that a single hidden layer with enough neurons can approximate any continuous function on a compact set. The rate of approximation depends on the function's smoothness - exactly what approximation theory quantifies.
4. Learned embeddings: Word embeddings, graph node embeddings, and molecule representations are all continuous approximations of discrete objects - the embedding space provides an interpolation structure.
5. Physics-informed neural networks (PINNs): These solve PDEs by fitting a neural network as the solution function, treating the problem as function approximation subject to differential equation constraints.
1.3 Historical Timeline
INTERPOLATION AND APPROXIMATION: HISTORICAL TIMELINE
========================================================================
1700 - Newton's divided differences (1711): systematic polynomial interp.
1795 - Gauss: least-squares method for orbit of Ceres
1800 - Legendre: method of least squares (1805); orthogonal polynomials
1853 - Chebyshev: polynomials with equioscillation, minimax approximation
1900 - Runge: demonstrates instability of high-degree equispaced interp. (1901)
1906 - Lebesgue: Lebesgue constant quantifies interpolation error
1946 - Schoenberg: B-splines introduced for smoothing problems
1965 - Cooley-Tukey: Fast Fourier Transform algorithm
1970 - de Boor: stable B-spline evaluation algorithms
1985 - Chui, Mallat: wavelet theory as multi-resolution approximation
1989 - Cybenko, Hornik: universal approximation theorems for neural nets
2017 - Vaswani et al.: sinusoidal positional encodings in transformers
2020 - Tancik et al.: Fourier features for neural radiance fields (NeRF)
2022 - RoPE (Su et al.): rotary positional encodings for LLMs (LLaMA)
2024 - NTK theory: neural tangent kernel as linearized approximation
========================================================================
2. Polynomial Interpolation
2.1 Lagrange Form
Theorem (Existence and Uniqueness): For any distinct nodes and values , there exists a unique polynomial (degree ) satisfying for all .
Proof sketch: The interpolation conditions give a linear system where is the Vandermonde matrix with . Since the are distinct, is invertible (its determinant is ). Uniqueness follows.
Lagrange basis polynomials: Define
Each is a degree- polynomial satisfying (Kronecker delta). The Lagrange interpolating polynomial is:
Barycentric Lagrange Form: The naive Lagrange form requires operations to evaluate. The barycentric form is numerically stable and requires only after preprocessing:
The division by the denominator (which equals 1 by the Lagrange interpolation of the constant function ) provides automatic normalization and excellent numerical stability - it is one of the most stable algorithms in numerical analysis.
For AI: The attention mechanism in transformers can be viewed as a weighted sum where the attention weights play a role analogous to the Lagrange basis weights . Both select how much each "stored value" contributes to the output.
2.2 Newton Divided Differences
The Newton form expresses in a basis adapted for sequential addition of data points:
Divided differences are defined recursively:
Key properties:
- Symmetric: is symmetric in all its arguments
- Derivative connection: for some in the convex hull of
- Incremental: Adding a new data point extends the Newton form by one term - update vs for recomputing Lagrange
Newton divided difference table:
x_0 | y_0
| [y_0, y_1]
x_1 | y_1 [y_0, y_1, y_2]
| [y_1, y_2] [y_0, y_1, y_2, y_3]
x_2 | y_2 [y_1, y_2, y_3]
| [y_2, y_3]
x_3 | y_3
Horner's rule evaluates the Newton form in operations:
p = coeffs[n]
for i in range(n-1, -1, -1):
p = p * (x - nodes[i]) + coeffs[i]
2.3 The Interpolation Error Theorem
Theorem: Let and be its interpolant at distinct nodes . Then for any , there exists such that:
where is the node polynomial.
Proof sketch: Define where is chosen so that . Then has zeros: . By Rolle's theorem, has zeros, has zeros, ..., has 1 zero . Computing gives .
Implications:
- The error is controlled by - choosing nodes to minimize this is Chebyshev's contribution
- Smooth functions ( small) are approximated well
- High-degree polynomials require to have many bounded derivatives
Lebesgue constant: The interpolation error can also be bounded by:
where is the best polynomial approximant of degree and is the Lebesgue constant. The Lebesgue constant quantifies how much the interpolant amplifies errors in .
- Equispaced nodes: - exponential growth (Runge's phenomenon)
- Chebyshev nodes: - logarithmic growth (nearly optimal)
2.4 Runge's Phenomenon
Runge's phenomenon (1901): The interpolant of at equispaced nodes on diverges as , even though is infinitely differentiable!
The reason is purely numerical: the node polynomial for equispaced nodes is extremely large near the boundary of . Even though is bounded, the product grows without bound.
RUNGE'S PHENOMENON: EQUISPACED vs CHEBYSHEV NODES
========================================================================
f(x) = 1/(1+25x^2), n=15 interpolation points
Equispaced nodes (-1, -12/15, ..., 12/15, 1):
- Huge oscillations near +/-1
- Max error \\approx 10^2 (function max is 1!)
- Lebesgue constant _15 \\approx 5000
Chebyshev nodes (cos((2k+1)\\pi/(2n+2))):
- Near-optimal approximation
- Max error \\approx 10^-^3
- Lebesgue constant _15 \\approx 3.5
Moral: Node placement is as important as degree!
========================================================================
For AI: Equispaced sampling is the default in practice (uniform time steps, uniform grid). Understanding why this can fail for high-degree polynomial interpolation is essential context for:
- Time-series models that use polynomial basis functions
- Finite element methods in physics-informed neural networks
- Feature engineering with polynomial features (always use regularization or orthogonal polynomials)
3. Chebyshev Polynomials and Optimal Nodes
3.1 Chebyshev Polynomials
The Chebyshev polynomials of the first kind are defined via the remarkable trigonometric identity:
Equivalently, on :
Key properties:
| Property | Formula |
|---|---|
| Boundedness | on |
| Roots (Chebyshev points of the 1st kind) | , |
| Extrema (Chebyshev points of the 2nd kind) | , |
| Orthogonality | (for ) |
| Leading coefficient | |
| Minimax property | Among all monic polynomials of degree , has the smallest -norm on |
Why the trigonometric definition works: is periodic in and stays in . The change of variables maps to , compressing the uniform distribution on to a non-uniform (arcsine) distribution on that weights the endpoints more heavily.
Chebyshev series: Any continuous function can be expanded as:
(with divided by 2). The Chebyshev series converges faster than any fixed power of for analytic functions - exponential convergence for analytic .
For AI: Chebyshev polynomials appear in:
- Spectral graph neural networks: ChebNet approximates the graph spectral filter by a Chebyshev polynomial in , avoiding eigendecomposition
- Polynomial activation functions: Some neural architectures use Chebyshev polynomials as activation functions for efficient function approximation
- Numerical PDE solvers: Spectral methods using Chebyshev expansions achieve exponential accuracy for smooth solutions
3.2 Chebyshev Nodes
The Optimal Node Placement Problem: Which nodes minimize where ?
Answer: The Chebyshev points of the first kind (roots of ):
With these nodes, and:
This is the smallest possible value (by the minimax property of ).
Error bound with Chebyshev nodes:
Compare to equispaced nodes where (much larger for large ).
Rescaling to : The Chebyshev nodes on are:
3.3 Minimax Approximation
The minimax problem asks: among all polynomials , find minimizing .
Chebyshev's theorem (Equioscillation theorem): is the best approximation if and only if there exist at least points in where the error equioscillates:
Remez algorithm: An iterative algorithm that finds the minimax polynomial by exchanging equioscillation points. Converges quadratically.
For smooth functions: The minimax polynomial approximation converges exponentially for analytic functions - meaning for some depending on the domain of analyticity.
3.4 Clenshaw-Curtis Quadrature Connection
The Chebyshev points of the second kind (extrema of ) are central to Clenshaw-Curtis quadrature (see Section10-05). The key insight: integrating the Chebyshev interpolant exactly gives highly accurate quadrature weights. This is treated in detail in the Numerical Integration section.
4. Spline Interpolation
4.1 Piecewise Polynomial Interpolation
Motivation: High-degree global polynomial interpolation is unstable (Runge). Instead, use a low-degree polynomial on each subinterval - piecewise polynomials or splines.
Definition: Given nodes (breakpoints), a spline of degree is a function such that:
- On each interval , is a polynomial of degree
- - has continuous derivatives globally
Piecewise linear interpolation (): Connect adjacent data points with straight lines. Simple, construction, but only - derivatives are discontinuous. Error: where .
Piecewise cubic (): The sweet spot - 4 degrees of freedom per interval, 2 boundary conditions at each interior node ( continuity), leaving 2 free conditions (boundary conditions). Error: .
4.2 Cubic Splines: Derivation and Construction
Natural cubic spline: Given data points, find piecewise cubic satisfying:
- for (interpolation)
- (natural boundary conditions - zero curvature at endpoints)
Derivation: On with , let be the unknown second derivatives. Since is cubic on each interval and is linear:
Integrating twice and applying , :
Imposing continuity (matching first derivatives at interior nodes):
This gives a tridiagonal linear system for (with for natural BC). Tridiagonal systems are solved in with the Thomas algorithm - spectacularly efficient.
CUBIC SPLINE SYSTEM (tridiagonal)
========================================================================
+ + + + + +
| 2(h_0+h_1) h_1 | |M_1 | |r_1 |
| h_1 2(h_1+h_2) h_2 | |M_2 | |r_2 |
| h_2 2(h_2+h_3) h_3 | |M_3 | = |r_3 |
| | | | | |
| h_{n-2} 2(h_{n-2}+h_{n-1})| |M_{n-1}| |r_{n-1}|
+ + + + + +
r_i = 6[(y_i+_1-y_i)/h_i - (y_i-y_i_1)/h_i_1] (second finite differences of y)
Solved by Thomas algorithm in O(n) operations.
========================================================================
Error analysis: For the natural cubic spline:
where . The convergence makes cubic splines highly accurate for modest grid spacing.
Boundary condition options:
- Natural: (minimizes )
- Clamped: , (uses derivative information)
- Not-a-knot: continuous at and (used when derivatives unknown)
- Periodic: , (for periodic data)
Minimum curvature property: The natural cubic spline minimizes among all twice-differentiable interpolants. This is the thin plate spline principle: the spline is the smoothest interpolant.
For AI:
- Neural ODEs use cubic spline interpolation to densify trajectory predictions
- Temporal point processes use spline bases to model intensity functions
- The minimum curvature property connects to L2 regularization: the RKHS norm for the Matern-3/2 kernel is equivalent to
4.3 B-Splines
B-splines (basis splines) provide a numerically stable, local basis for the spline space. Rather than working with the second-derivative representation, B-splines use recursively defined local basis functions.
Definition (Cox-de Boor recursion): Given a knot vector , the B-spline basis functions of degree are:
Key properties:
- Local support: outside - changing one control point affects only a local region
- Partition of unity: for all - control points are convex combinations
- Non-negativity:
- Convex hull property: The spline curve lies within the convex hull of its control points
B-spline curve: where are control point coefficients.
For AI: B-splines appear in:
- Computer graphics: NURBS (Non-Uniform Rational B-Splines) for representing curved surfaces in 3D rendering used in diffusion model outputs
- Kolmogorov-Arnold Networks (KAN): Replace MLP neurons with learned B-spline functions - a direct application of spline approximation theory in neural networks
- Time-series modeling: Spline bases for smooth covariate functions in survival analysis and temporal models
4.4 Tension Splines and Shape Preservation
Standard cubic splines can overshoot - producing local extrema not present in the data. Monotone splines (Fritsch-Carlson algorithm) adjust slopes to preserve monotonicity:
- Compute slopes from divided differences
- If has opposite sign to adjacent differences, set
- Rescale slopes using condition
For AI: Monotone spline constraints appear in calibration curves, cumulative distribution function modeling, and isotonic regression.
5. Least-Squares Approximation
5.1 Polynomial Least Squares
Given data points with , find the degree- polynomial minimizing:
where is the Vandermonde matrix .
The Vandermonde conditioning problem: Even for moderate , grows like for some . Solving the normal equations squares the condition number: .
Solution via QR decomposition: Compute (thin QR), then . This is stable and avoids squaring the condition number:
For AI: This is the same condition-number-squaring problem as in Section10-02. Linear regression with polynomial features should always use QR or SVD, never the normal equations with the Vandermonde matrix directly.
5.2 Orthogonal Polynomials and Gram-Schmidt
The Vandermonde ill-conditioning arises because the monomial basis is nearly linearly dependent for large and typical node distributions. The fix: use an orthogonal polynomial basis for the weight function.
Orthogonal polynomial construction (three-term recurrence): For a weight function on :
where the coefficients are:
and .
Standard families:
| Weight | Interval | Polynomials |
|---|---|---|
| Legendre | ||
| Chebyshev | ||
| Laguerre | ||
| Hermite |
Discrete orthogonal polynomials: For a finite set of nodes with weights , the discrete inner product gives discrete orthogonal polynomials. Using these as the basis makes the least-squares problem diagonal: .
For AI: Hermite polynomials appear in quantum mechanics (wavefunction basis) and in the theory of Gaussian integrals. The connection to attention mechanism: the softmax function can be written as a ratio of Hermite polynomial evaluations at zero.
5.3 Minimax vs Least-Squares
Minimax (Chebyshev) approximation: Minimizes .
- Equioscillation at points (Chebyshev's theorem)
- Found by the Remez algorithm
- Best for: applications where the worst case matters (safety-critical, numerical tables)
Least-squares () approximation: Minimizes .
- Found by orthogonal projection:
- Best for: statistical fitting, machine learning (minimizes MSE)
Comparison:
| Criterion | Minimax | Least-Squares |
|---|---|---|
| Error norm | ||
| Algorithm | Remez (iterative) | QR / orthogonal projection |
| Sensitivity to outliers | High | Lower (outliers at max) |
| Computation | per Remez step | one-shot |
| ML analogy | Max-margin (SVM) | MSE (regression) |
6. Fourier Approximation
6.1 Trigonometric Interpolation
For periodic functions on , the natural interpolating functions are trigonometric polynomials. Given equispaced nodes for , the trigonometric interpolant is:
The coefficients are exactly the Discrete Fourier Transform (DFT) of the data .
Error for periodic functions: For a function with Fourier series :
For smooth periodic functions, the Fourier coefficients decay rapidly (exponentially for analytic functions), so this error is very small - spectral accuracy.
6.2 Discrete Fourier Transform
Definition: The DFT of a vector is:
where is the primitive th root of unity.
Matrix form: where .
The matrix is unitary (up to scaling): , so the inverse DFT (IDFT) is:
Properties:
- Parseval's theorem: (energy conservation)
- Convolution theorem: DFT of convolution = pointwise product of DFTs
- Real signals: If , then (conjugate symmetry)
For AI: The convolution theorem is the mathematical foundation of CNNs: a convolution in spatial domain equals pointwise multiplication in frequency domain. FFT-based convolution has complexity vs for direct convolution.
6.3 Fast Fourier Transform
Naive DFT: - computing each of output values requires summing terms.
Cooley-Tukey FFT (1965): Exploits the factorization of (assuming ) to reduce complexity to .
Radix-2 Decimation-in-Time: Split into even/odd indices:
Since , both and are DFTs of size . The butterfly pattern:
DFT BUTTERFLY STRUCTURE (n=8)
========================================================================
Input (bit-reversed) Output (in-order)
y_0 ----+ Y_0
y4 ----- butterfly --------- Y_1
y_2 ----- butterfly --------- Y_2
y6 ----- butterfly --------- Y_3
y_1 ----- butterfly --------- Y4
y5 ----- butterfly --------- Y5
y_3 ----- butterfly --------- Y6
y7 ----+- butterfly --------- Y7
3 stages x 4 butterflies = 12 operations vs 64 for naive DFT.
Speedup: O(n^2) -> O(n log n).
========================================================================
Complexity: solves to . For , FFT requires operations vs for naive DFT - a factor of speedup.
For AI: FFT is used in:
- Spectral convolution layers (FNet, S4, Hyena)
- Long-range attention in Performer (via random Fourier features)
- Frequency-domain training of neural networks
6.4 Aliasing and the Nyquist Theorem
Aliasing: When sampling at rate , frequencies above (the Nyquist frequency) are indistinguishable from lower frequencies. The frequency aliases to if .
Nyquist-Shannon sampling theorem: A signal with bandwidth (no frequencies above ) can be perfectly reconstructed from samples taken at rate .
For AI: Aliasing matters in:
- Audio processing (e.g., sample rate 44.1 kHz supports up to 22 kHz - just above human hearing)
- Image super-resolution: downsampling without anti-aliasing introduces artifacts
- Periodic positional encodings: the frequency range in transformer PEs avoids aliasing for typical sequence lengths
7. Radial Basis Functions and Kernel Methods
7.1 RBF Interpolation
For scattered data (non-uniform nodes in ), polynomial interpolation becomes problematic. Radial basis function (RBF) interpolation uses:
where is a radial function. Common choices:
| Name | Smoothness | |
|---|---|---|
| Gaussian | ||
| Multiquadric | ||
| Inverse multiquadric | ||
| Thin plate spline | ||
| Matern-3/2 | ||
| Wendland | piecewise polynomial, compact support |
Interpolation system: The coefficients solve the linear system:
For positive definite RBFs (Gaussian, Matern), is positive definite - guaranteed unique solution.
For AI: The Gaussian RBF kernel is the most common kernel in kernel machines (SVM, Gaussian processes). The shape parameter (or lengthscale ) controls the smoothness/locality tradeoff - analogous to the learning rate in neural networks.
7.2 Connection to Reproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Space (RKHS): For a symmetric positive definite kernel , the RKHS is the completion of with the inner product satisfying the reproducing property:
Representer theorem: The minimizer of has the form:
This is exactly the RBF interpolant! The RKHS norm acts as a regularizer controlling smoothness.
For AI: The representer theorem is foundational for:
- Kernel SVMs: The decision boundary is a linear combination of support vectors in RKHS
- Gaussian process regression: The posterior mean is the RKHS interpolant; posterior variance is the RKHS error
- Neural tangent kernel (NTK): Infinite-width neural networks converge to kernel regression with the NTK kernel
7.3 Gaussian Processes as Bayesian Approximation
A Gaussian process is a distribution over functions where any finite collection is multivariate Gaussian with mean and covariance matrix .
GP posterior: Given observations with :
where .
The posterior mean is precisely the kernel ridge regression solution - Bayesian inference gives both a point estimate and uncertainty quantification "for free".
For AI: Gaussian processes are used for:
- Bayesian optimization: Surrogate model for expensive black-box functions (hyperparameter tuning)
- Neural architecture search (NAS): GP surrogate for network performance prediction
- Uncertainty quantification: GP posterior variance estimates prediction confidence
8. Applications in Machine Learning
8.1 Positional Encodings: Sinusoidal Interpolation
The original transformer uses sinusoidal positional encodings (Vaswani et al., 2017):
Approximation theory perspective: These are trigonometric interpolation functions. The frequencies span - about 4 decades - covering both local (high frequency) and global (low frequency) position information.
Why sinusoidal? For any offset , is a linear function of - the relative position can be expressed via a rotation matrix. This allows the model to attend to relative positions.
RoPE (Rotary Position Embedding, Su et al. 2022): Used in LLaMA, Mistral, GPT-NeoX. Applies a position-dependent rotation to query and key vectors:
The inner product depends only on relative position - exact relative position encoding via complex multiplication.
For AI: The interpolation challenge arises when extending context length. A model trained with RoPE on 2048 tokens needs to generalize to 8192 - trigonometric extrapolation, not interpolation. Methods like YaRN (Peng et al., 2023) modify the frequency range to enable longer context via interpolation of the position indices.
8.2 Neural Networks as Universal Approximators
Cybenko's theorem (1989): Any continuous function can be approximated uniformly by a single hidden layer network:
for any continuous sigmoidal . As , the approximation error goes to zero.
Modern variants:
- Barron's theorem (1993): Functions with finite Barron complexity can be approximated with squared error using neurons - independent of dimension . This avoids the curse of dimensionality for this function class.
- ReLU depth separation: Deep ReLU networks of depth can represent functions that would require neurons in a shallow network - depth exponentially increases expressiveness
- Approximation vs optimization: Universal approximation guarantees existence of weights; finding them via gradient descent is a separate (hard) problem
KAN (Kolmogorov-Arnold Networks, 2024): Replace the node-wise nonlinearity with edge-wise learned univariate functions (represented as B-splines):
Based on the Kolmogorov-Arnold representation theorem: every continuous function of variables can be written as a composition of univariate functions and addition. KANs directly implement this with learnable splines.
8.3 Feature Maps and Kernel Approximation
Random Fourier Features (Rahimi-Recht, 2007): The shift-invariant kernel can be written (Bochner's theorem) as:
where is the spectral density. Approximating with random frequencies:
This turns a kernel method into a linear model in feature space - instead of for kernel matrix.
For AI: Random Fourier features are the foundation of:
- Performer (Choromanski et al., 2020): Approximates softmax attention with complexity using random feature maps
- Neural Tangent Kernel implementations: Computing NTK efficiently for large networks
9. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Using equispaced nodes for high-degree polynomial interpolation | Runge's phenomenon: error grows exponentially near endpoints | Use Chebyshev nodes or splines |
| 2 | Solving least-squares via the normal equations with Vandermonde matrix | Condition number squared: grows like | Use QR decomposition or orthogonal polynomial basis |
| 3 | Extrapolating polynomial interpolants | All polynomial interpolants diverge outside unless the function is actually polynomial | Use local models (splines) or explicit extrapolation assumptions |
| 4 | Setting spline boundary conditions to natural () when data is not natural | Introduces artificial inflection points if the true function has nonzero curvature at endpoints | Use clamped or not-a-knot conditions when derivative information is available |
| 5 | Confusing interpolation degree with approximation quality | High degree interpolation is not always more accurate (Runge) | Match the basis to the data's smoothness and node count |
| 6 | Forgetting that DFT assumes periodic extension | Discontinuity at boundaries causes Gibbs phenomenon - oscillations near jumps | Apply windowing (Hann, Hamming) or use non-uniform FFT |
| 7 | Using Gaussian RBF with too large an | Leads to nearly singular matrix () | Tune via cross-validation or use stable algorithms (contour Pade) |
| 8 | Treating the Lebesgue constant as just a theoretical concept | The Lebesgue constant directly bounds interpolation error amplification - if is large, small errors in cause large errors in | Always check the Lebesgue constant for your node choice |
| 9 | Assuming neural network universality implies easy learning | UAT says approximating functions exists, not that gradient descent finds it | Understand the approximation-optimization gap |
| 10 | Misinterpreting sinusoidal PE frequencies as sampling frequencies | The factor controls the wavelength (period), not a sampling rate | Think in terms of wavelengths: smallest = , largest = |
| 11 | Using fixed- RBF for high-dimensional data | RBF interpolation is ill-conditioned in high dimensions without careful selection | Use compactly supported RBFs (Wendland) or regularized regression |
| 12 | Evaluating Lagrange basis polynomials naively | Numerical cancellation in for large | Use barycentric Lagrange form: numerically stable and per evaluation |
10. Exercises
Exercise 1 * - Newton Divided Differences
Implement the Newton divided difference algorithm and evaluate the interpolating polynomial using Horner's rule.
(a) Compute the divided difference table for .
(b) Implement newton_interp(nodes, vals, x) using the divided difference table and Horner's method.
(c) Verify the result matches the Lagrange interpolant.
Exercise 2 * - Runge's Phenomenon Demonstrate Runge's phenomenon numerically and show that Chebyshev nodes tame it. (a) For , compute the interpolation error at with equispaced nodes. (b) Repeat with Chebyshev nodes. Plot both error curves on a log scale. (c) Compute the Lebesgue constant for both node sets at .
Exercise 3 * - Natural Cubic Spline Build a cubic spline solver from scratch using the tridiagonal system. (a) Derive the tridiagonal system for the second derivatives . (b) Implement the Thomas algorithm for tridiagonal systems. (c) Test on with 10 equispaced nodes. Plot the spline and its error.
Exercise 4 ** - Vandermonde Conditioning and QR Show that QR decomposition is numerically superior to the normal equations for polynomial fitting. (a) Construct the Vandermonde matrix for equispaced nodes. Compute and . (b) Fit a degree-15 polynomial to noisy data by: (i) normal equations, (ii) QR factorization. (c) Compare the residuals and coefficient sensitivity to noise.
Exercise 5 ** - Chebyshev Approximation Compute the Chebyshev series and compare it to the minimax polynomial. (a) Compute the first 20 Chebyshev coefficients of using the DCT. (b) Show that the truncated Chebyshev series achieves near-minimax approximation. (c) For , plot: exact , Chebyshev approximation, and best polynomial approximation. Compute the difference in -norm.
Exercise 6 ** - FFT and the Convolution Theorem
Use the FFT to compute a convolution efficiently.
(a) Implement a 1D discrete convolution via FFT: fft_convolve(x, h).
(b) Verify correctness against np.convolve(x, h) for a Gaussian filter.
(c) Measure and compare the time complexity of FFT vs direct convolution for .
Exercise 7 ** - RBF Interpolation and Kernel Ridge Regression Connect RBF interpolation to kernel methods. (a) Implement Gaussian RBF interpolation for 1D scattered data. (b) Show that as the regularization , kernel ridge regression approaches RBF interpolation. (c) Demonstrate the shape parameter effect: for , plot the RBF interpolant and the condition number of .
Exercise 8 *** - Positional Encoding Interpolation Analyze sinusoidal positional encodings from an approximation theory perspective. (a) Implement transformer sinusoidal PEs for dimension . (b) Show that is a linear function of (derive the rotation matrix). (c) Simulate context length extrapolation: train a simple model on positions and evaluate on . Compare original PEs, linear interpolated PEs, and NTK-scaled PEs.
11. Why This Matters for AI (2026 Perspective)
| Concept | AI/LLM Impact |
|---|---|
| Chebyshev nodes and minimax | ChebNet spectral GNNs avoid explicit eigendecomposition; used in molecular property prediction |
| Cubic splines | Neural ODEs use spline interpolation for continuous-time dynamics; used in physics-informed networks |
| B-splines | KAN networks (2024) parameterize learnable functions as B-splines; emerging alternative to MLPs |
| FFT and convolution theorem | FNet, S4, Hyena use FFT-based long-range mixing as efficient attention alternative () |
| Random Fourier features | Performer achieves attention via kernel approximation; NTK analysis of neural nets |
| Sinusoidal interpolation | RoPE positional encodings in LLaMA, Mistral, GPT-4 class models |
| Context length extrapolation | YaRN, LongRoPE interpolate position indices to extend 4K-trained models to 128K+ tokens |
| Gaussian processes | Bayesian optimization for hyperparameter search in AutoML; uncertainty quantification in autonomous vehicles |
| RKHS and representer theorem | Foundation for kernel SVMs, support vector regression, and theoretical analysis of neural networks |
| Universal approximation | Theoretical basis for depth/width scaling laws; understanding which architectures can represent what functions |
12. Conceptual Bridge
Looking back: This section builds on floating-point arithmetic (Section10-01) - all numerical interpolation is subject to rounding, and the condition numbers of Vandermonde and RBF matrices determine how much rounding error is amplified. From numerical linear algebra (Section10-02), we use QR decomposition for stable least-squares fitting and tridiagonal solvers for spline construction. The optimization perspective (Section10-03) underlies minimax approximation (Remez algorithm) and least-squares fitting. Earlier linear algebra (Section02) provided the theory of least squares, orthogonality, and projections.
Looking forward: The next section (Section10-05: Numerical Integration) uses interpolation directly - quadrature rules are derived by integrating interpolating polynomials. Gaussian quadrature nodes are the zeros of orthogonal polynomials. Clenshaw-Curtis quadrature uses Chebyshev nodes and the DCT for efficient weight computation. Everything built here - Chebyshev theory, orthogonal polynomials, FFT - feeds directly into numerical integration.
The connection to the broader curriculum:
- Graph theory (Section11): Spectral graph convolutions approximate the spectral filter via Chebyshev polynomials in the graph Laplacian eigenvalues
- Probability (Section07): Gaussian processes are probabilistic interpolants; Fourier analysis of stochastic processes uses spectral density
- Optimization (Section08): The minimax approximation problem is solved by Chebyshev equioscillation, connecting to constrained optimization and duality
INTERPOLATION AND APPROXIMATION IN THE CURRICULUM
========================================================================
Section10-01 Floating-Point Section02 Linear Algebra
(rounding errors) --> (least squares, QR)
| |
+------------+-----------+
v
Section10-04 INTERPOLATION
AND APPROXIMATION
+------------------+
| Lagrange / Newton|
| Chebyshev / Runge|
| Cubic splines |
| Least squares |
| FFT / DFT |
| RBF / RKHS |
+--------+---------+
|
+------------+------------+
v v v
Section10-05 Section11 GNNs AI Apps
Numerical (ChebNet, (RoPE, KAN,
Integration SpectralGCN) Performer)
========================================================================
Key insight: The choice of basis is the choice of inductive bias. Monomials assume no structure; Chebyshev assumes bounded derivatives; splines assume local smoothness; trigonometric functions assume periodicity; RBFs assume isotropy. Neural networks learn their basis from data - but the theoretical guarantees come from understanding which function classes can be represented and approximated efficiently.
Appendix A: Barycentric Lagrange Interpolation - Algorithm and Analysis
The barycentric form is the preferred implementation of polynomial interpolation.
A.1 Derivation
Define the node polynomial and barycentric weights .
Then and:
Since implies , dividing:
A.2 Numerical Stability
Theorem (Higham 2004): The barycentric formula is backward stable: it computes the exact interpolant of slightly perturbed data with .
Practical implications:
- Pre-compute weights once: preprocessing
- Evaluate at any : per evaluation
- Adding a new node : update to weights
- No division by - the denominator is the normalizing sum, never computed separately
A.3 Barycentric Weights for Special Nodes
Equispaced nodes on : - grows like .
Chebyshev nodes of the 2nd kind : where , otherwise.
Chebyshev nodes of the 1st kind : .
The Chebyshev weights are computed in and are moderate in size - explaining their superior numerical properties.
def barycentric_weights_cheb2(n):
"""Barycentric weights for Chebyshev points of the 2nd kind."""
w = np.ones(n + 1)
w[0] = 0.5; w[n] = 0.5
w[1::2] = -1
return w
def barycentric_eval(x, nodes, values, weights):
"""Evaluate barycentric interpolant at x."""
diffs = x - nodes
# Handle case x == node exactly
exact = np.where(diffs == 0)[0]
if len(exact) > 0:
return values[exact[0]]
w_over_d = weights / diffs
return (w_over_d @ values) / w_over_d.sum()
Appendix B: Divided Difference Properties and Connections
B.1 The Divided Difference as a Derivative
Theorem: If and are distinct, then:
for some in the convex hull of .
Proof: Apply Rolle's theorem times, as in the interpolation error theorem. The -th divided difference telescopes to .
Corollary: For equally spaced nodes :
where is the -th forward difference operator: , , .
B.2 Newton's Forward Difference Formula
For equispaced nodes , with :
where is the generalized binomial coefficient. This is Newton's forward difference interpolation formula - the precursor to finite difference schemes in PDE solvers.
B.3 Connection to Finite Difference Schemes
The divided difference is exactly the finite difference approximation to the derivative:
For equispaced nodes: (first-order forward difference).
For AI: Numerical differentiation (Section10-03) uses finite differences - they are divided differences with equispaced nodes. The second divided difference gives the second derivative approximation needed for Newton's method and Hessian-vector products.
Appendix C: Chebyshev Polynomial Deep Dive
C.1 Chebyshev Expansion Coefficients via DCT
The Chebyshev coefficients of on :
can be computed via the Discrete Cosine Transform (DCT) at the Chebyshev nodes:
This is a DCT-II computation, achievable in using FFT.
Algorithm:
- Evaluate at Chebyshev nodes
- Apply DCT-II (via FFT: reflect and take FFT, or use
scipy.fft.dct) - Normalize: , for
C.2 Clenshaw Recurrence for Evaluation
To evaluate without computing each :
b[n+2] = 0, b[n+1] = 0
for k = n, n-1, ..., 1:
b[k] = 2x * b[k+1] - b[k+2] + c[k]
result = x * b[1] - b[2] + c[0]
This is the Clenshaw algorithm - stable evaluation of Chebyshev series.
C.3 Chebyshev Approximation Convergence Rates
For functions with various smoothness:
| Smoothness | Convergence of | Convergence of best approx |
|---|---|---|
| Analytic in ellipse | $ | c_k |
| $ | c_k | |
| Lipschitz continuous | $ | c_k |
| Discontinuous | $ | c_k |
For AI: The exponential convergence for analytic functions is why spectral methods achieve machine precision with far fewer points than finite differences. This matters for PINN solvers - use spectral collocation (Chebyshev) not finite differences when the solution is smooth.
Appendix D: B-Spline Implementation Details
D.1 De Boor's Algorithm
Evaluating a B-spline at a point in knot interval :
d[i] = c[i] for i = j-p, j-p+1, ..., j
for r = 1, 2, ..., p:
for i = j, j-1, ..., j-p+r:
alpha = (x - t[i]) / (t[i+p-r+1] - t[i])
d[i] = (1-alpha) * d[i-1] + alpha * d[i]
result = d[j]
This is the de Boor algorithm - per evaluation, numerically stable.
D.2 Knot Vectors
The knot vector controls the spline:
- Uniform: - equal spacing, best for smooth periodic data
- Clamped/open: , - spline passes through first and last control points
- Multiple knots: Repeating reduces continuity (degree knot = discontinuous -th derivative)
D.3 B-Splines in KAN Architecture
KolmogorovArnold Networks (KAN, Liu et al. 2024) represent each network edge as a learnable B-spline:
where (SiLU) is a base function and is a B-spline with learnable coefficients and optionally adaptive grid.
Training: The grid is fixed initially; coefficients are learned by gradient descent. Periodically, the grid is updated to better cover the range of activations - grid extension.
Advantages over MLP: KANs can represent functions with simple structure (e.g., ) with far fewer parameters than MLPs. The spline representation is interpretable - you can read off the learned function shape.
Appendix E: Spline Interpolation - Error Analysis and Conditioning
E.1 Cubic Spline Error Bounds
Theorem (Hall-Meyer, 1976): Let and be the not-a-knot cubic spline. Then:
with constants , , , (for the derivative).
Practical significance:
- Halving the mesh spacing reduces position error by
- The convergence is "super-convergent" - better than the one might expect from a piecewise cubic
E.2 Condition Number of the Spline System
The tridiagonal matrix for cubic splines has entries in range. For equispaced nodes with :
This is a diagonally dominant tridiagonal matrix with - the condition number is bounded independent of ! This is why cubic spline computation is numerically stable.
E.3 Why Natural Splines Minimize Curvature
Proof of minimum curvature property: Let be any twice-differentiable interpolant. Write where is the natural cubic spline. Then for all nodes, and .
The cross term: integrate by parts twice, using on each interval (cubic), at nodes, and . The cross term vanishes, and , so .
Appendix F: The Fast Fourier Transform - Detailed Analysis
F.1 Cooley-Tukey Algorithm Complexity
The radix-2 FFT recursion with :
- Unrolling: multiplications and additions
- For : operations vs for naive DFT
General : When , FFT achieves . For prime , Rader's algorithm reduces to a convolution of size . The "FFT is fast" claim requires with small prime factors - choose in practice.
F.2 Numerical Issues in FFT
Rounding error: The FFT error is - logarithmic growth, much better than direct DFT's .
Planck's convention vs engineering convention: The sign in differs between conventions. NumPy uses in np.fft.fft (physics/engineering convention).
Bit-reversal permutation: The in-place Cooley-Tukey FFT requires reordering inputs by bit-reversal. For output-order , the input index is the bit-reversal of in bits.
F.3 FFT Applications in Deep Learning
Spectral convolution (S4, Hyena): The state-space model layer computes:
where is a learned convolutional kernel. Using FFT:
Complexity: vs for direct attention. For sequence length , this is a speedup.
FNet (Lee-Thorp et al., 2021): Replace self-attention with 2D FFT:
x = torch.fft.fft2(x, norm='ortho').real
Unparameterized - no learned weights - yet achieves 92-97% of BERT accuracy on GLUE. Demonstrates that long-range mixing doesn't require learned attention patterns.
Appendix G: Radial Basis Functions - Stability and Algorithms
G.1 The Shape Parameter Dilemma
For Gaussian RBFs , the interpolation matrix :
- Large (narrow): (well-conditioned), but poor approximation quality
- Small (flat): Near-flat Gaussians, (catastrophically ill-conditioned)
The uncertainty principle for RBFs: You cannot simultaneously have a well-conditioned system and high approximation accuracy. This is Schaback's uncertainty principle.
Solutions:
- Regularized RBF: Minimize - ridge regression with RBF kernel
- Compactly supported RBFs (Wendland): Sparse , cheap to solve, but less smooth
- Contour-Pade algorithm: Reformulate using contour integration, avoiding catastrophic cancellation
G.2 Matern Kernels and Sobolev Spaces
The Matern kernel of order corresponds to the RKHS being the Sobolev space :
where is the modified Bessel function. Special cases:
- : (exponential, functions)
- : ( functions)
- : ( functions)
- : Gaussian ( functions)
For ML: The Matern-5/2 kernel is the default in most GP libraries (scikit-learn, GPyTorch) because it provides enough smoothness without the extreme sensitivity of the Gaussian kernel.
G.3 Sparse Gaussian Processes
Inducing point methods: For data points and inducing points :
where are function values at inducing points. Complexity: instead of .
Variational inducing methods (SVGP, Hensman et al.): Optimize inducing point locations jointly with model parameters via variational lower bound. Enables stochastic mini-batch training for GPs - analogous to stochastic gradient descent for neural networks.
Appendix H: Fourier Features and Neural Tangent Kernel
H.1 Bochner's Theorem and Random Fourier Features
Bochner's theorem: A continuous, shift-invariant kernel is positive definite if and only if it is the Fourier transform of a non-negative measure :
Random Fourier features approximate this as a Monte Carlo integral:
where and .
Convergence: With random features, the approximation error satisfies:
with high probability.
H.2 The Neural Tangent Kernel
An infinite-width neural network trained by gradient descent with squared loss implements kernel regression with the neural tangent kernel (NTK):
Key results:
- At initialization, is fixed (independent of )
- For infinite width, stays constant throughout training
- The trained model converges to the kernel regression solution:
Implications:
- Infinite-width networks are not superior to kernel SVMs (they implement kernel regression)
- The expressiveness advantage of neural networks comes from finite width and feature learning
- NTK theory breaks down for finite-width networks with large learning rates (the "feature learning regime")
For AI: The NTK framework explains:
- Why overparameterized networks can fit any training data (kernel interpolation)
- Why neural scaling laws exist (larger networks have richer NTK kernels)
- The transition from "kernel regime" to "feature learning regime" in modern LLM training
Appendix I: Practical Implementation Guide
I.1 Python Ecosystem for Interpolation
from scipy import interpolate
# Lagrange/Barycentric interpolation
interp = interpolate.BarycentricInterpolator(nodes, values)
y_new = interp(x_new)
# Cubic spline
cs = interpolate.CubicSpline(x, y, bc_type='natural')
y_pred = cs(x_new)
y_deriv = cs(x_new, 1) # First derivative
# B-spline fitting
from scipy.interpolate import make_interp_spline
bspline = make_interp_spline(x, y, k=3) # cubic
y_pred = bspline(x_new)
# Smoothing spline (least-squares B-spline)
tck = interpolate.splrep(x, y, s=0.1) # s = smoothing factor
y_pred = interpolate.splev(x_new, tck)
# RBF interpolation
rbf = interpolate.RBFInterpolator(X_2d, y, kernel='gaussian', epsilon=1.0)
y_new = rbf(X_test)
I.2 NumPy/SciPy FFT
import numpy as np
from scipy.fft import fft, ifft, fftfreq, dct, idct
# FFT of a real signal
n = 1024
t = np.linspace(0, 1, n, endpoint=False)
y = np.sin(2*np.pi*5*t) + 0.5*np.sin(2*np.pi*13*t)
Y = fft(y)
freqs = fftfreq(n, d=1/n) # Frequencies in cycles/unit
# Power spectral density
psd = np.abs(Y)**2 / n
# Inverse FFT
y_recovered = np.real(ifft(Y))
# DCT for Chebyshev coefficients
from scipy.fft import dct
coeffs = dct(f_values, type=2, norm='ortho') / n # Normalized
I.3 Chebyshev Interpolation with Chebfun-style Code
def cheb_nodes(n, a=-1, b=1):
"""Chebyshev nodes of the 2nd kind on [a,b]."""
k = np.arange(n+1)
x = np.cos(np.pi * k / n) # on [-1, 1]
return (a + b)/2 + (b - a)/2 * x
def cheb_coeffs(f_values):
"""Chebyshev expansion coefficients via DCT."""
n = len(f_values) - 1
from scipy.fft import dct
c = dct(f_values[::-1], type=1) / n # DCT-I
c[0] /= 2; c[-1] /= 2
return c
def cheb_eval(c, x):
"""Clenshaw algorithm for Chebyshev series evaluation."""
n = len(c) - 1
b2, b1 = 0.0, 0.0
for k in range(n, 0, -1):
b2, b1 = b1, 2*x*b1 - b2 + c[k]
return x*b1 - b2 + c[0]
Appendix J: Worked Examples - Interpolation in Practice
J.1 Fitting a Binding Energy Curve
In computational chemistry, the binding energy as a function of inter-atomic distance is often known at a few points from DFT calculations. We want a smooth interpolant for geometry optimization.
Problem: r (A): 1.8 2.0 2.2 2.5 3.0 4.0
E (eV): -2.5 -3.1 -3.2 -3.0 -2.7 -2.3
Method: Cubic spline with not-a-knot BC
Result: Minimum at r* \\approx 2.24 A, E* \\approx -3.21 eV
Derivatives: E'(r*) \\approx 0 (equilibrium), E''(r*) > 0 (stable)
This is used in molecular dynamics force fields and materials design ML models.
J.2 Positional Encoding Extrapolation
Problem: A model trained with sinusoidal PEs on positions needs to handle position 1024.
Analysis: At dimension , the PE period is :
- Small (low freq): - position 1024 has been seen many times
- Large (high freq): - period >> 512, position 1024 lies in same "first cycle"
YaRN solution: Scale position indices by (training length / target length):
This maps position 1024 in the target context to position in the training range - guaranteed to be seen. At the cost of reduced resolution at low frequencies.
J.3 Kernel Method vs Neural Network
For a 1D regression problem with points from :
| Method | Parameters | Training error | Test error | Time |
|---|---|---|---|---|
| Degree-5 polynomial (QR) | 6 | 0.09 | 0.11 | 1ms |
| Cubic spline (natural) | 100 | 0.0 (exact) | 0.10 | 1ms |
| Gaussian RBF () | 100 | 0.0 (exact) | 0.09 | 10ms |
| Gaussian RBF + ridge () | 100 | 0.008 | 0.09 | 10ms |
| 2-layer MLP (100 units) | 10200 | 0.002 | 0.10 | 1s |
For smooth 1D functions with , splines and kernel methods outperform neural networks in accuracy and efficiency. Neural networks win for high-dimensional data, compositional structure, and when priors about the function class are unknown.
Appendix K: Connections to Other Chapters
| Topic | Connection | Reference |
|---|---|---|
| Eigenvalue decomposition | Spectral methods diagonalize differentiation/integration operators | Section03-Advanced-Linear-Algebra |
| Fourier series | Spectral analysis of signals, convolutional layers | Section07-Probability (spectral density) |
| Taylor series | Polynomial approximation of smooth functions | Section01-Mathematical-Foundations |
| Optimization | Remez algorithm, RKHS optimization (representer theorem) | Section08-Optimization, Section10-03 |
| Graph theory (spectral GNNs) | Chebyshev polynomial filters on graph Laplacian | Section11-Graph-Theory |
| Numerical integration | Gaussian quadrature uses zeros of orthogonal polynomials | Section10-05 |
| Floating-point | Vandermonde ill-conditioning; Chebyshev stability | Section10-01 |
| Linear algebra | QR for least squares; tridiagonal solver for splines | Section10-02 |
Appendix L: Lebesgue Constants and Interpolation Stability
L.1 Definition and Computation
The Lebesgue constant for a set of nodes is:
It measures the worst-case amplification of errors:
where is the best approximation error.
Growth rates:
| Node set | growth |
|---|---|
| Equispaced on | (exponential) |
| Chebyshev points (1st kind) | (logarithmic) |
| Chebyshev points (2nd kind) | (logarithmic) |
| Optimal nodes | (optimal up to constant) |
| Gauss-Legendre nodes |
Practical consequence: At , equispaced while Chebyshev . An error of in the data values would cause a error in the interpolant with equispaced nodes but only with Chebyshev.
L.2 Lebesgue Function Visualization
LEBESGUE FUNCTION |sum_j |l_j(x)|| (n=10)
========================================================================
20 - Equispaced nodes
|
15 -
|
10 -
|
5 -
|
1 ----------------------------------- Chebyshev (\\approx2.0 everywhere)
-1 1
Equispaced Lebesgue function peaks near +/-1 (endpoint Runge zones).
Chebyshev function is nearly constant across [-1,1].
========================================================================
Appendix M: Multidimensional Interpolation
M.1 Tensor Product Methods
For functions on , a tensor product approach uses one-dimensional bases in each dimension:
Curse of dimensionality: Requires basis functions - exponential in . For , : coefficients.
Sparse grids (Smolyak): Use only tensor products of 1D bases where . Complexity: - much better. Used in high-dimensional integration and interpolation (quantum chemistry, finance).
M.2 Scattered Data in High Dimensions
For scattered data in with , polynomial interpolation and tensor product methods fail. Alternatives:
RBF interpolation: - mesh-free, works for any . But condition number grows exponentially with for Gaussian RBF.
Nearest-neighbor interpolation: where . Discontinuous, per query with KD-tree preprocessing.
Inverse distance weighting (IDW): where . Continuous but not smooth.
For AI: High-dimensional function approximation is precisely the task of neural networks. The kernel trick (kernel regression) extends RBF to high dimensions through the kernel matrix - but computing and inverting this matrix is .
M.3 Triangulation-Based Methods
For scattered 2D data, Delaunay triangulation partitions the domain into non-overlapping triangles. Piecewise linear or cubic interpolation on each triangle:
- Linear (barycentric): , per evaluation after triangulation
- Clough-Tocher cubic: , splits each triangle into 3 subtriangles
scipy.interpolate.LinearNDInterpolator and CloughTocher2DInterpolator implement these.
Appendix N: Approximation Theory Connections to Deep Learning
N.1 Width vs Depth in Approximation
Shallow networks (width , depth 1):
for in the Sobolev space - polynomial rate, cursed by dimension.
Deep ReLU networks: For functions with compositional structure where each is -dimensional:
where - the approximation rate depends on the intrinsic dimension of the composition, not the ambient dimension .
Implication: Deep networks exploit compositional structure to avoid the curse of dimensionality. A function like has intrinsic dimension 2 even though - deep networks can approximate it with error.
N.2 The Kolmogorov-Arnold Representation Theorem
Theorem (Kolmogorov, 1957): Every continuous function can be written as:
for continuous functions and .
Significance: Every multivariate continuous function can be written using only univariate functions and addition. No curse of dimensionality in this representation.
Practical issue: The inner functions may be nowhere differentiable - impossible to learn efficiently with gradient methods. The theorem is an existence result, not a constructive algorithm.
KAN circumvents this: By restricting to learned B-splines (differentiable, finitely parameterized), KAN trades the full generality of Kolmogorov for a practical learnable version.
N.3 Function Classes and Approximation Rates
APPROXIMATION RATE BY FUNCTION CLASS AND METHOD
========================================================================
Function class Best method Rate
--------------------- ------------------ ------------------
Polynomial of deg \\leq k Exact polynomial 0 (exact)
Analytic in disk D\\rho Chebyshev truncation O(\\rho^-n) (exponential)
Cp on [a,b] Poly degree n O(n^-p)
Sobolev Ws'^2() Spline O(hs)
Barron class Shallow network O(N^-1/^2) (dim-free!)
Compositional Deep network O(N^-^2s/d_max)
Arbitrary continuous Universal approx. \\varepsilon-approx, no rate
N = number of neurons, n = polynomial degree, h = mesh size
========================================================================
Appendix O: Summary Reference Tables
O.1 Interpolation Methods Comparison
| Method | Nodes | Error order | Stability | AI use |
|---|---|---|---|---|
| Lagrange (equispaced) | Any | - but large! | Poor (Runge) | Avoid |
| Lagrange (Chebyshev) | Chebyshev | for analytic | Good () | ChebNet |
| Newton divided diff. | Any | Same as Lagrange | Moderate | Incremental |
| Cubic spline (natural) | Equispaced | Excellent | Neural ODEs | |
| B-spline | Arbitrary knots | Excellent | KAN | |
| Trigonometric | Equispaced | Exponential (smooth periodic) | Good | FFT, PE |
| RBF (Gaussian) | Scattered | Exponential (analytic) | Sensitive to | Kernels, GP |
| Polynomial LS (QR) | Any | Best approx | Good | Feature eng. |
O.2 Key Theorems Quick Reference
| Theorem | Statement | Location |
|---|---|---|
| Existence/uniqueness | Unique degree- interpolant for distinct nodes | Section2.1 |
| Interpolation error | Section2.3 | |
| Chebyshev minimax | has smallest -norm among monic degree- polys | Section3.1 |
| Equioscillation | Best approx iff error equioscillates at points | Section3.3 |
| Spline min. curvature | Natural cubic spline minimizes $\int | S'' |
| Representer theorem | Regularized RKHS problem solved by finite expansion | Section7.2 |
| Universal approximation | Shallow networks dense in | Section8.2 |
| Nyquist theorem | Reconstruct -band-limited signal from rate samples | Section6.4 |
| Bochner's theorem | PD shift-invariant kernels = FT of non-negative measures | SectionH.1 |
| Kolmogorov-Arnold | All = composition of univariates | SectionN.2 |
Appendix P: Extended Exercises with Solutions
P.1 Barycentric Interpolation - Full Implementation
import numpy as np
def cheb2_nodes(n, a=-1.0, b=1.0):
"""Chebyshev nodes of the 2nd kind (n+1 nodes) on [a,b]."""
k = np.arange(n + 1)
x = np.cos(np.pi * k / n) # on [-1, 1]
return 0.5*(a+b) + 0.5*(b-a)*x
def barycentric_weights(nodes):
"""Compute barycentric weights for given nodes."""
n = len(nodes) - 1
w = np.ones(n + 1)
for j in range(n + 1):
for k in range(n + 1):
if k != j:
w[j] *= (nodes[j] - nodes[k])
return 1.0 / w
def cheb2_weights(n):
"""Barycentric weights for Chebyshev 2nd kind nodes (O(n) formula)."""
w = np.ones(n + 1)
w[0] = 0.5
w[n] = 0.5
w[1::2] = -1.0
return w
def bary_eval(x, nodes, values, weights):
"""Evaluate barycentric interpolant at scalar x."""
diffs = x - nodes
mask = diffs == 0.0
if np.any(mask):
return values[mask][0]
w_d = weights / diffs
return np.dot(w_d, values) / w_d.sum()
# Example: interpolate f(x) = 1/(1+25x^2) with Chebyshev nodes
n = 20
nodes = cheb2_nodes(n)
f = lambda x: 1.0 / (1 + 25*x**2)
values = f(nodes)
weights = cheb2_weights(n)
x_test = np.linspace(-1, 1, 500)
y_interp = np.array([bary_eval(xi, nodes, values, weights) for xi in x_test])
y_exact = f(x_test)
print(f"Max error (Chebyshev n=20): {np.abs(y_interp - y_exact).max():.2e}")
P.2 Cubic Spline - Full Implementation
def natural_cubic_spline(x, y):
"""
Compute natural cubic spline coefficients.
Returns: (a, b, c, d) polynomial coefficients for each interval.
"""
n = len(x) - 1
h = np.diff(x)
# Set up tridiagonal system for second derivatives M
rhs = 6 * (np.diff(y[1:] / h[1:] - y[:-1] / h[:-1])) # Simplified
# More careful: second differences of function values
dd = np.zeros(n - 1)
for i in range(1, n):
dd[i-1] = 6 * ((y[i+1] - y[i])/h[i] - (y[i] - y[i-1])/h[i-1])
# Tridiagonal matrix: diagonal, sub/super diagonal
diag = 2*(h[:-1] + h[1:])
off = h[1:-1]
# Thomas algorithm (tridiagonal solver)
n_sys = n - 1
c_diag = diag.copy()
c_rhs = dd.copy()
# Forward sweep
for i in range(1, n_sys):
m = off[i-1] / c_diag[i-1]
c_diag[i] -= m * off[i-1]
c_rhs[i] -= m * c_rhs[i-1]
# Back substitution
M = np.zeros(n + 1) # Second derivatives at nodes
M[n_sys] = c_rhs[n_sys-1] / c_diag[n_sys-1]
for i in range(n_sys-2, -1, -1):
M[i+1] = (c_rhs[i] - off[i] * M[i+2]) / c_diag[i]
# M[0] = M[n] = 0 (natural BC)
return M, h
def eval_spline(x_eval, x_nodes, y_nodes, M, h):
"""Evaluate natural cubic spline at x_eval."""
i = np.searchsorted(x_nodes, x_eval, side='right') - 1
i = np.clip(i, 0, len(x_nodes) - 2)
t = x_eval - x_nodes[i]
hi = h[i]
S = (M[i] * (x_nodes[i+1] - x_eval)**3 / (6*hi) +
M[i+1] * (x_eval - x_nodes[i])**3 / (6*hi) +
(y_nodes[i] - M[i]*hi**2/6) * (x_nodes[i+1]-x_eval)/hi +
(y_nodes[i+1] - M[i+1]*hi**2/6) * (x_eval-x_nodes[i])/hi)
return S
P.3 Chebyshev Spectral Differentiation Matrix
A key tool in spectral methods is the differentiation matrix such that at Chebyshev nodes:
def cheb_diff_matrix(n):
"""
Chebyshev spectral differentiation matrix for n+1 nodes.
D @ f_values approximates f' at Chebyshev nodes.
"""
if n == 0:
return np.array([[0.]])
x = np.cos(np.pi * np.arange(n+1) / n)
c = np.ones(n+1)
c[0] = 2; c[n] = 2
c *= (-1)**np.arange(n+1)
D = np.zeros((n+1, n+1))
for i in range(n+1):
for j in range(n+1):
if i != j:
D[i, j] = c[i] / (c[j] * (x[i] - x[j]))
# Diagonal: sum condition
D -= np.diag(D.sum(axis=1))
return D
# Example: differentiate f(x) = sin(pi*x) on [-1,1]
n = 32
D = cheb_diff_matrix(n)
x = np.cos(np.pi * np.arange(n+1) / n)
f_vals = np.sin(np.pi * x)
f_prime_approx = D @ f_vals
f_prime_exact = np.pi * np.cos(np.pi * x)
print(f"Spectral diff error: {np.abs(f_prime_approx - f_prime_exact).max():.2e}")
Appendix Q: Advanced Topics in Function Approximation
Q.1 Pade Approximants
Definition: A Pade approximant is a rational function that agrees with the Taylor series of up to order :
Advantages over polynomials:
- Can represent poles and asymptotic behavior
- Often much more accurate than same-degree Taylor polynomials
- Diagonal Pade often converges to analytic functions in larger domains
Example: The exponential function:
This is the foundation of Pade-based matrix exponential algorithms (used in ODE solvers and graph neural networks for diffusion operators).
Q.2 Wavelet Approximation
Wavelets provide multi-resolution approximation: approximate a function at coarse scale first, then add fine-scale details.
The Daubechies wavelet basis provides:
- Orthogonal decomposition
- Local support: has support around
- Vanishing moments: for
- Adaptive sparsity: Functions with local irregularities (edges, singularities) represented sparsely
For AI: Wavelets appear in:
- Image compression (JPEG 2000 uses wavelet decomposition)
- Multi-scale neural architectures (UNet uses hierarchical scales)
- Graph wavelets for multi-resolution graph signals
Q.3 The NUFFT and Non-Uniform Sampling
The standard FFT assumes uniform sampling. For non-uniform samples , the Non-Uniform FFT (NUFFT) computes:
in by:
- Convolving point masses at with a Gaussian kernel onto a uniform grid
- Applying FFT on the uniform grid
- Deconvolving the Gaussian in frequency domain
For AI: NUFFT is used in:
- MRI reconstruction: k-space data is sampled along non-uniform trajectories
- Molecular dynamics: Computing electrostatic interactions via particle-mesh Ewald (PME)
- Astronomy: Radio telescope arrays measure non-uniform Fourier samples
Appendix R: Reproducible Experiments
All code in this section is reproducible with numpy, scipy, and matplotlib. The key experiments:
# Experiment 1: Runge's phenomenon comparison
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import BarycentricInterpolator
f = lambda x: 1/(1+25*x**2)
x_plot = np.linspace(-1, 1, 500)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
for n in [5, 10, 15]:
# Equispaced
nodes_eq = np.linspace(-1, 1, n+1)
p_eq = BarycentricInterpolator(nodes_eq, f(nodes_eq))
# Chebyshev
nodes_ch = np.cos(np.pi * np.arange(n+1) / n)
p_ch = BarycentricInterpolator(nodes_ch, f(nodes_ch))
err_eq = np.abs(f(x_plot) - p_eq(x_plot)).max()
err_ch = np.abs(f(x_plot) - p_ch(x_plot)).max()
print(f"n={n}: equispaced error={err_eq:.2e}, Chebyshev error={err_ch:.2e}")
# Experiment 2: Lebesgue constants
for n in range(1, 20):
nodes = np.cos(np.pi * np.arange(n+1) / n) # Chebyshev 2nd kind
x_test = np.linspace(-1, 1, 1000)
# Compute Lebesgue function: sum of |ell_j(x)|
weights = np.ones(n+1)
weights[0] = 0.5; weights[n] = 0.5; weights[1::2] = -1.0
lebesgue = []
for x in x_test:
diffs = x - nodes
if np.any(diffs == 0):
lebesgue.append(1.0)
else:
w_d = weights / diffs
# Lagrange basis values: ell_j = w_j/diff_j / sum(w_k/diff_k)
total = w_d.sum()
lebesgue.append(np.abs(w_d / total).sum())
print(f"n={n}: Lambda_n = {max(lebesgue):.4f}")
Appendix S: Historical Notes on Interpolation
S.1 Newton's Contribution
Isaac Newton introduced the method of divided differences in "Methodus Differentialis" (1711). His motivation was astronomical: computing planetary positions at arbitrary times from a table of observed positions at discrete times. The divided difference table allowed him to update the interpolant incrementally as new observations came in - an update, remarkable for 1711.
S.2 Chebyshev's Insight
Pafnuty Chebyshev (1853) asked: which monic polynomial of degree has the smallest maximum absolute value on ? His answer - - was the first systematic use of the minimax criterion in approximation theory. Chebyshev was motivated by mechanism design: he wanted to minimize the deviation from ideal motion in steam engine linkages.
The connection between the trigonometric definition and the polynomial form was noticed by Chebyshev himself - using the addition formula for cosines iteratively.
S.3 Runge's Warning
Carl Runge published his famous counterexample in 1901. The function is analytic everywhere in the complex plane except at - which lie inside the unit circle. This means the function's Taylor series has radius of convergence , and polynomial interpolation at equispaced nodes tries to capture behavior that is "barely analytic" in the relevant region.
Runge's result was important as a warning against increasing polynomial degree. The modern response (Chebyshev nodes) came gradually through the 20th century.
S.4 Cooley-Tukey and the FFT
The Cooley-Tukey FFT algorithm (1965) is often cited as one of the most important algorithms of the 20th century. Its impact was immediate: it made digital signal processing practical, enabled the digital revolution in audio/communications, and later became the backbone of scientific computing.
Remarkably, the core idea (exploiting DFT factorizations) was known to Gauss in 1805 - unpublished in his collected works. The algorithm was independently rediscovered multiple times before Cooley and Tukey's landmark paper.
S.5 The Representer Theorem and RKHS
The representer theorem (Kimeldorf-Wahba, 1971) established that regularized function estimation over an RKHS has a finite-dimensional solution - transforming an infinite-dimensional optimization into a finite linear algebra problem. This unified spline smoothing, kernel regression, and support vector machines under one mathematical framework. The same theorem, applied to the neural tangent kernel, explains why overparameterized neural networks trained by gradient descent converge to global minima that generalize well.
Appendix T: Rational Approximation and Pade Theory
T.1 Pade Table
For a function with Taylor series , the Pade approximant satisfies:
The Pade table arranges these approximants in a grid by :
PADE TABLE for e^x
========================================================================
n\m | 0 1 2
----+------------------------------------------------------
0 | 1 1+x 1+x+x^2/2
1 | 1/(1-x) (2+x)/(2-x) (6+2x)/(6-4x+x^2)
2 | 1/(1-x+x^2/2) (6+4x+x^2)/(6-2x+x^2/6) ...
Diagonal [n/n] entries converge fastest for analytic functions.
========================================================================
Convergence: For functions analytic in a disk , the diagonal Pade converges to in the entire disk - in contrast to Taylor series which may diverge for where is the radius of convergence.
T.2 Matrix Pade and the Matrix Exponential
The matrix exponential is computed using Pade approximants in most numerical libraries. The diagonal Pade:
where and are matrix polynomials of degree . Combined with scaling and squaring (using ):
- Scale: with
- Compute Pade approximant
- Square:
This is used in scipy.linalg.expm and is essential for:
- Continuous-time Markov chain transition matrices
- Neural ODE solvers (computing exactly)
- Graph diffusion for GNNs (heat kernel )
Appendix U: Connection to Modern LLM Architectures
U.1 RoPE as Trigonometric Interpolation
Rotary Position Embedding (RoPE) represents positions as rotations in 2D subspaces of the embedding:
where is a block-diagonal rotation matrix with blocks .
The attention score:
depends only on relative position . The frequencies match the sinusoidal PE frequencies - this is the same trigonometric interpolation scheme.
LongRoPE (2024): For context length extension, scales frequencies non-uniformly:
- High-frequency dimensions (small ): scale by (extend range)
- Low-frequency dimensions (large ): scale by (minimal change)
This is essentially frequency-domain interpolation of the positional encoding.
U.2 Flash Attention and Numerical Integration
The attention matrix can be viewed as a matrix of weights for a kernel regression:
This is a discrete approximation to the integral:
where is a normalized kernel. Understanding attention as kernel regression connects to RBF interpolation theory - the "query point" selects a weighted combination of "function values" based on "proximity" .
U.3 NeRF and Neural Radiance Fields
Neural Radiance Fields (NeRF, Mildenhall et al. 2020) represent a 3D scene as:
The key challenge: standard MLPs cannot represent high-frequency spatial details. Fourier feature encoding solves this:
This is the frequency embedding - equivalent to the first Fourier basis functions. The Fourier features lift the input to a space where the target function is smooth, enabling efficient approximation by the MLP.
Connection to RFF: Tancik et al. (2020) showed that Fourier features are equivalent to sampling the random Fourier feature map at specific fixed frequencies rather than random ones - targeted at the frequencies of the scene being rendered.
Appendix V: Error Analysis Summary and Complexity Reference
V.1 Computational Complexity
| Operation | Complexity | Notes |
|---|---|---|
| Lagrange interpolation setup | Computing basis polynomials | |
| Barycentric weights setup | General nodes; for Chebyshev | |
| Barycentric evaluation | per point | Most efficient form |
| Newton divided differences | Table construction | |
| Newton evaluation (Horner) | per point | |
| Cubic spline setup | Thomas algorithm on tridiagonal system | |
| Cubic spline evaluation | Binary search + eval | |
| FFT of length | Cooley-Tukey, | |
| Least-squares QR | data points, degree | |
| RBF interpolation setup | Dense system solve | |
| GP posterior | Cholesky of kernel matrix | |
| Sparse GP (SVGP) | inducing points |
V.2 Error Summary
| Method | Error bound | When tight |
|---|---|---|
| Lagrange (equispaced, ) | Can grow exponentially | Runge function, large |
| Lagrange (Chebyshev, ) | Analytic | |
| Cubic spline | Smooth , equispaced | |
| Least-squares (degree ) | Depends on node Lebesgue constant | |
| Trigonometric ( terms) | $2\sum_{ | k |
| RBF (Gaussian) | Exponential in for analytic | Sensitive to shape parameter |
V.3 Choosing the Right Method
DECISION TREE: WHICH INTERPOLATION METHOD?
========================================================================
Data type?
+-- Periodic function on [0,2\\pi]?
| +-- USE: Trigonometric interpolation (FFT)
| Best accuracy for smooth periodic functions
|
+-- Scattered data in R^d (d\\geq2)?
| +-- Need uncertainty estimate?
| | +-- USE: Gaussian Process regression
| +-- No uncertainty needed?
| +-- USE: RBF interpolation (Matern kernel)
|
+-- 1D data on [a,b]?
+-- Few points (n \\leq 30), need exact interpolation?
| +-- USE: Barycentric Lagrange (Chebyshev nodes)
|
+-- Many points (n > 30) or need local control?
| +-- USE: Cubic splines (natural or not-a-knot BC)
|
+-- More data than parameters (m > n)?
+-- Need derivative information?
| +-- USE: Smoothing spline
+-- Polynomial features?
+-- USE: QR least-squares (NOT normal equations)
========================================================================
Appendix W: Notation Reference
Throughout this section, the following notation is used consistently:
| Symbol | Meaning |
|---|---|
| or | Interpolation nodes / knots |
| Data values | |
| Degree of polynomial interpolant (using nodes) | |
| -th Lagrange basis polynomial | |
| Barycentric weight for node | |
| Node polynomial | |
| Lebesgue constant | |
| Chebyshev polynomial of degree | |
| -th divided difference | |
| Spline function | |
| Second derivative of spline at node : | |
| Mesh spacing | |
| -th B-spline basis function of degree | |
| Best polynomial approximation error | |
| -th Fourier coefficient | |
| -th Chebyshev coefficient | |
| Kernel function | |
| Reproducing Kernel Hilbert Space with kernel | |
| Radial basis function evaluated at | |
| RBF shape parameter | |
| RBF lengthscale |