"Numerical integration is the art of turning hard integrals into easy sums."
- Attributed to Philip Davis and Philip Rabinowitz
Overview
Numerical integration (quadrature) computes when no antiderivative is available, the function is known only at discrete points, or the exact integral is too expensive. The core idea is simple: approximate by an interpolant and integrate the interpolant exactly.
The quality of this approximation depends entirely on the choice of interpolation nodes. Equispaced nodes give the Newton-Cotes rules (trapezoidal, Simpson's) which are easy to derive but have bounded accuracy. Optimal node placement - the zeros of orthogonal polynomials - gives Gaussian quadrature, which achieves "twice the accuracy for free" by allowing the nodes to vary. Adaptive quadrature refines the mesh where the integrand is rough, achieving near-machine accuracy automatically. Monte Carlo methods trade the deterministic error for a stochastic rate that is dimension-independent - crucial for high-dimensional integrals.
For AI, numerical integration appears in: computing normalizing constants for probability distributions, training variational autoencoders (ELBO involves an expectation), Gaussian process marginal likelihoods, normalizing flows, and reinforcement learning value estimation. Understanding which quadrature method to use - and why - is essential for building numerically stable, efficient ML systems.
Prerequisites
- Calculus: definite integrals, Riemann sums, Taylor's theorem (Section01-Mathematical-Foundations)
- Interpolation: Lagrange polynomials, Chebyshev nodes, orthogonal polynomials (Section10-04)
- Linear algebra: solving linear systems (Section02-Linear-Algebra-Basics, Section10-02)
- Floating-point arithmetic: rounding error accumulation (Section10-01)
- Probability: random variables, expected values, variance (Section07-Probability)
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Interactive: Newton-Cotes, Gaussian quadrature derivation, adaptive quadrature, Monte Carlo, multidimensional integration |
| exercises.ipynb | 10 graded exercises from trapezoidal rule to normalizing flow partition functions |
Learning Objectives
After completing this section, you will:
- Derive Newton-Cotes rules by integrating Lagrange interpolants
- Understand the error analysis of composite rules via the Euler-Maclaurin formula
- Derive Gaussian quadrature by choosing nodes to maximize the degree of exactness
- Implement adaptive Gauss-Kronrod quadrature with error control
- Apply Richardson extrapolation / Romberg integration to accelerate convergence
- Use Monte Carlo integration for high-dimensional problems and understand variance reduction
- Connect quadrature to normalizing constants, ELBO, and variational inference in ML
- Recognize when each method is appropriate based on dimensionality, smoothness, and accuracy requirements
Table of Contents
- 1. Intuition and the Basic Setup
- 2. Newton-Cotes Rules
- 3. Gaussian Quadrature
- 4. Richardson Extrapolation and Romberg Integration
- 5. Adaptive Quadrature
- 6. Monte Carlo Integration
- 7. Multidimensional Integration
- 8. Applications in Machine Learning
- 9. Common Mistakes
- 10. Exercises
- 11. Why This Matters for AI (2026 Perspective)
- 12. Conceptual Bridge
1. Intuition and the Basic Setup
1.1 The Quadrature Problem
A quadrature rule approximates by a weighted sum:
where are nodes (quadrature points) and are weights.
The error is .
What determines quality?
- The number of nodes (more nodes = higher accuracy in general)
- The placement of nodes (free or equispaced?)
- The smoothness of (smooth allows high-order rules)
- The dimensionality (high-dimensional integrals need Monte Carlo)
For AI: Many ML objectives involve integrals:
- - variational ELBO lower bounds this
- Policy gradient - Monte Carlo estimates
- Gaussian process likelihood - involves a determinant, not an integral, but the method of moments is related
1.2 Why Integration is Hard
- Indefinite integrals of elementary functions: Only a small class have closed forms. , have no elementary form.
- High-dimensional integrals: The curse of dimensionality makes grid-based methods infeasible for -.
- Singularities: - finite integral but .
- Oscillatory integrands: - many sign changes, standard rules need many nodes.
- Stochastic integrals: - typically solved by sampling.
1.3 Historical Timeline
NUMERICAL INTEGRATION: HISTORICAL TIMELINE
========================================================================
1700 - Newton (1711): Newton-Cotes formulas from polynomial interpolation
1720 - Cotes: systematic derivation of Newton-Cotes weights
1800 - Gauss (1814): Gaussian quadrature with 2n-1 degree of exactness
1852 - Chebyshev: Gauss-Chebyshev quadrature with explicit node/weight formulas
1901 - Richardson: Richardson extrapolation to accelerate convergence
1955 - Romberg: Romberg integration (repeated Richardson extrapolation)
1960 - Clenshaw-Curtis: Chebyshev-based quadrature via DCT
1970 - Patterson, Kronrod: Gauss-Kronrod embedded rules for adaptive quadrature
1965 - Sobol: quasi-Monte Carlo sequences for multidimensional integration
1949 - Metropolis-Ulam: Monte Carlo method for nuclear physics calculations
1953 - Metropolis: Markov chain Monte Carlo (Metropolis algorithm)
1990 - Gelfand-Smith: MCMC for Bayesian inference becomes mainstream
2015 - Deep learning era: stochastic gradient estimation (REINFORCE, VAE ELBO)
2020 - Normalizing flows: exact likelihood via change-of-variables theorem
2022 - Score-based diffusion: sampling as integration of stochastic DEs
========================================================================
2. Newton-Cotes Rules
2.1 Midpoint, Trapezoidal, and Simpson's Rules
Derivation principle: Integrate the -point Lagrange interpolant exactly.
Midpoint rule (1 point, ):
Integrates polynomials of degree exactly (1-point rule with degree of exactness 1).
Trapezoidal rule (2 points, ):
Same order as midpoint but larger constant (midpoint is twice as accurate as trapezoidal for convex/concave functions).
Simpson's rule (3 points, ):
Integrates polynomials of degree exactly (degree of exactness 3, not 2 - odd-degree rules gain one extra order via symmetry).
NEWTON-COTES RULES: GEOMETRIC INTERPRETATION
========================================================================
Midpoint: Trapezoidal: Simpson's:
--------- ------------ ---------
-- /\ +--+
/ \ / \ / \
/ f \ / f \ / f \
/ x \ / x x \ / x x x \
-------- ---------- ----------
(b-a) (b-a)/2 (b-a)/6, 4/6, 1/6
x = quadrature node
Approximation: rectangle, trapezoid, parabola through 3 points
========================================================================
General Newton-Cotes with equispaced nodes: , :
The weights are determined by integrating the Lagrange basis:
Newton-Cotes weights for small :
| Rule | Weights | Degree | |
|---|---|---|---|
| 1 | Trapezoidal | 1 | |
| 2 | Simpson's | 3 | |
| 3 | Simpson's 3/8 | 3 | |
| 4 | Boole's | 5 |
Key observation: Even- rules (trapezoidal with , Boole's with ) automatically achieve one higher degree of exactness than expected - a consequence of symmetry.
2.2 Composite Rules and Error Analysis
Composite trapezoidal rule: Divide into subintervals of width :
Composite Simpson's rule: Requires even :
Error comparison:
| Rule | Function evaluations | Error | Accuracy for : need |
|---|---|---|---|
| Composite trapezoidal | , | ||
| Composite Simpson's | ( even) | , | |
| Composite Boole's | ($4 | m$) | |
| Gauss-Legendre pts |
2.3 Euler-Maclaurin Formula
The Euler-Maclaurin formula gives the exact relationship between the trapezoidal rule and the true integral:
where are Bernoulli numbers (, , , ...).
Key implications:
- The error has an asymptotic expansion in powers of - this is the foundation for Richardson extrapolation
- For periodic functions where all derivatives of match at and , the boundary terms vanish for all - the composite trapezoidal rule achieves spectral accuracy for smooth periodic functions!
- The formula predicts exactly how much extrapolation reduces error
Periodic functions example: For on with equal intervals, the trapezoidal rule is exact to machine precision even for small . This is why the DFT (which uses uniform sampling of a periodic function) is exact!
For AI: The Euler-Maclaurin formula explains why the FFT is so accurate for bandlimited signals: the trapezoidal rule applied to a periodic function with finitely many nonzero Fourier coefficients gives the exact DFT.
2.4 Limitations of Newton-Cotes
Negative weights for large : Newton-Cotes weights become negative for . This means small perturbations in values can cancel to give a large error - numerical instability.
Runge's phenomenon for integration: Unlike interpolation, integration is a smoother operation - integrating the Runge interpolant gives reasonable results for moderate . But for large with equispaced nodes, the interpolant oscillates wildly and the integral of the oscillations can be large.
Conclusion: Use composite low-order rules (trapezoidal, Simpson's) rather than high-degree Newton-Cotes on a single interval.
3. Gaussian Quadrature
3.1 Degree of Exactness and Optimal Nodes
Definition: A quadrature rule has degree of exactness if for all polynomials of degree , but there exists a degree- polynomial where it fails.
Newton-Cotes with equispaced nodes achieves degree of exactness (or if is even).
Question: Can we do better by choosing the nodes freely?
Counting argument: A rule with free nodes has free parameters ( nodes weights). A polynomial basis requires matching coefficients. We might hope to achieve degree of exactness .
Theorem (Gaussian quadrature): This is achievable. With optimally chosen nodes, we can achieve degree of exactness - exactly twice what equispaced nodes give.
How to find the nodes: The optimal nodes for are the zeros of the -th Legendre polynomial .
Proof sketch: For any polynomial of degree , write where has degree and has degree . Then:
- - the first integral is 0 by orthogonality of to lower-degree polynomials
- - the first term is 0 because vanishes at all zeros of
- because is exact for polynomials of degree
3.2 Gauss-Legendre Quadrature
For , the -point Gauss-Legendre rule uses:
- Nodes: zeros of (Legendre polynomial of degree )
- Weights:
Gauss-Legendre nodes and weights for small :
| Nodes | Weights | |
|---|---|---|
| 1 | ||
| 2 | ||
| 3 | ||
| 5 | (5 values in ) | (5 values summing to 2) |
Error: For :
Scaling to : Change variables , :
Computing GL nodes: Use the eigenvalue method - nodes are eigenvalues of the Golub-Welsch tridiagonal matrix:
The eigenvalues are the GL nodes, and the first components of the eigenvectors give the weights.
For AI: Gauss-Legendre quadrature is used in:
- Normalizing flows: Computing normalizing constants for complex distributions via quadrature
- Physics-informed networks: Computing weak-form integrals exactly
- ODE solvers: Gauss collocation methods are implicit RK methods with GL nodes - they achieve the highest possible order for a given number of stages
3.3 Gauss-Chebyshev Quadrature
For (the Chebyshev weight), the nodes and weights have explicit formulas:
Key properties:
- Nodes are equispaced in - no eigenvalue computation needed
- All weights equal - uniform weights!
- These are exactly the Chebyshev nodes of the first kind
- Degree of exactness:
Type-II Gauss-Chebyshev: For weight :
3.4 Other Gauss Rules
Gauss-Laguerre: - nodes are zeros of Laguerre polynomials .
Gauss-Hermite: - nodes are zeros of Hermite polynomials .
For AI: Gauss-Hermite quadrature is used for computing expectations over Gaussian distributions:
This is used in computing the ELBO exactly in certain VAE architectures and in moment-matching for Gaussian approximations (expectation propagation).
Gauss-Jacobi: - used for Beta-distributed random variables and Jacobi polynomial bases.
3.5 Gauss-Kronrod Rules for Error Estimation
The problem: Gaussian quadrature doesn't naturally provide an error estimate.
Gauss-Kronrod: Extend an -point GL rule to a -point rule by adding Kronrod nodes interleaved with the GL nodes, choosing them to maximize the degree of exactness of the combined rule.
The G7-K15 rule (7-point GL + 15-point Kronrod) is the standard in scipy.integrate.quad:
- G7: degree of exactness 13 (7 nodes)
- K15: degree of exactness 27 (15 nodes, reuses the G7 nodes)
- Error estimate:
Adaptivity: If , accept the interval. Otherwise, bisect and recurse.
4. Richardson Extrapolation and Romberg Integration
4.1 Richardson Extrapolation
Setup: If we know that (error expansion in powers of ), then computing and :
Eliminating the term:
This is one step of Richardson extrapolation - combining two estimates to get .
General Richardson extrapolation: If :
For AI: Richardson extrapolation is the foundation of gradient estimation via finite differences - taking linear combinations of function evaluations to cancel lower-order error terms. The centered difference formula is a 2-step Richardson extrapolation.
4.2 Romberg's Method
Romberg's method applies repeated Richardson extrapolation to the composite trapezoidal rule.
Romberg table: Starting from and halving repeatedly:
The diagonal entry approximates with error .
Algorithm:
R[0][0] = (b-a)/2 * (f(a) + f(b)) # Trapezoidal with h = b-a
for k = 1, 2, ..., max_iter:
# Composite trapezoidal with 2^k intervals
R[k][0] = 0.5*R[k-1][0] + h * sum(f(a + (2j-1)*h) for j in 1..2^(k-1))
# Richardson extrapolation
for j = 1, ..., k:
R[k][j] = (4^j * R[k][j-1] - R[k-1][j-1]) / (4^j - 1)
if |R[k][k] - R[k-1][k-1]| < tol: break
Convergence: For smooth, non-periodic functions, converges like - faster than any fixed-order method!
4.3 Clenshaw-Curtis Quadrature
Clenshaw-Curtis integrates the Chebyshev interpolant of on :
where is the degree- Chebyshev interpolant at the Chebyshev nodes of the 2nd kind.
Weights via DCT: The Chebyshev expansion integrates to:
So the CC weights are computed in via DCT.
CC vs GL: For smooth functions, CC and GL achieve similar accuracy. GL uses nodes for degree ; CC uses nodes for degree but with DCT acceleration. For adaptive applications where the nodes are reused at subintervals, CC is often preferred because its nodes nest ( nodes include the nodes).
5. Adaptive Quadrature
5.1 Adaptive Strategies
Idea: Concentrate function evaluations where the integrand is rough (large error per unit interval), not where it is smooth.
Error indicator: Estimate the local error on using:
where uses more points (e.g., Gauss-Kronrod K15) and uses fewer (e.g., G7).
Algorithm (recursive):
- Evaluate and on
- If (local tolerance proportional to interval width): accept
- Otherwise: bisect into and , recurse on each
Global vs local tolerance: The final error satisfies if each leaf interval satisfies its local tolerance - requires careful accounting.
5.2 Error Control and Termination
Termination conditions:
- Local tolerance met:
- Maximum evaluations reached: Return with warning
- Interval too small: Machine epsilon prevents further bisection
Priority queue (non-recursive): For efficiency, maintain a heap of subintervals sorted by error estimate. Always split the highest-error interval:
import heapq
queue = [(-estimated_error, a, b)]
while error_sum > tol and queue:
(-est_err, l, r) = heapq.heappop(queue)
m = (l + r) / 2
# Evaluate on [l,m] and [m,r]
heapq.heappush(queue, (-new_err_left, l, m))
heapq.heappush(queue, (-new_err_right, m, r))
5.3 scipy.integrate API
from scipy import integrate
# Basic quadrature (adaptive Gauss-Kronrod)
result, error = integrate.quad(f, a, b)
# With tolerances
result, error = integrate.quad(f, a, b, epsabs=1e-12, epsrel=1e-12)
# Improper integrals
result, error = integrate.quad(f, 0, np.inf)
# Singularity handling
result, error = integrate.quad(f, 0, 1, points=[0.5]) # Inform about singularities
# Fixed-order quadrature
pts, wts = integrate.fixed_quad(f, a, b, n=10) # Gauss-Legendre, n points
# Romberg integration
result = integrate.romberg(f, a, b)
6. Monte Carlo Integration
6.1 Basic Monte Carlo
The fundamental idea: Rewrite the integral as an expectation:
Estimate by sampling :
Error analysis: By the Central Limit Theorem:
Key properties:
- Dimension-independent: regardless of
- No smoothness requirement: Works for discontinuous
- Slow convergence: Need for 4 decimal digits of accuracy - much slower than deterministic quadrature for smooth low-dimensional integrands
- Easily parallelizable: All samples are independent
When to use Monte Carlo:
- - (deterministic methods are exponentially worse)
- is discontinuous or has singularities
- Accuracy requirement is modest (-)
- Sampling from is easier than computing the integral directly
6.2 Variance Reduction Techniques
1. Importance sampling: Instead of sampling from , sample from a proposal distribution :
Optimal - reduces variance to zero (but requires knowing the integral!). In practice, choose to be proportional to where is large.
Estimator variance: - minimized when .
2. Control variates: Find with known integral and high correlation with :
The variance is - reduced when and are highly correlated.
3. Antithetic variates: For symmetric domains, pair each sample with its mirror :
Variance reduction: .
4. Stratified sampling: Divide into strata and sample points from each. Variance:
where is the probability mass of stratum .
For AI: All of these techniques appear in ML:
- Importance sampling underlies REINFORCE/policy gradient: , sampled under current policy
- Control variates are used as baselines in policy gradient: subtract a baseline to reduce variance
- Stratified sampling is used in mini-batch selection to ensure balanced class representation
6.3 Quasi-Monte Carlo and Low-Discrepancy Sequences
Discrepancy: Measures how uniformly a point set covers the domain. A sequence with low discrepancy fills the space more uniformly than random points.
Koksma-Hlawka inequality:
where is the variation of and is the star discrepancy.
Sobol sequences: Quasi-random sequences with discrepancy - much better than random's for low to moderate .
Convergence comparison:
| Method | Rate | Good for |
|---|---|---|
| Monte Carlo | Any , discontinuous | |
| Quasi-MC (Sobol) | , smooth | |
| Gauss-Legendre | , smooth | |
| Clenshaw-Curtis | , analytic |
For AI: Quasi-MC is used in:
- Low-discrepancy random features: Quasi-random Fourier features for kernel approximation are more accurate than random ones
- Batch sampling in reinforcement learning: Sobol-like sequences for diverse trajectory coverage
- Hyperparameter search: Quasi-random search outperforms grid search and random search for moderate
7. Multidimensional Integration
7.1 Tensor Product Quadrature
For , the tensor product rule applies 1D quadrature in each dimension:
Curse of dimensionality: Requires function evaluations. For points per dimension and dimensions: evaluations. Infeasible!
Error: If each 1D rule has error :
The error scales linearly with , not exponentially - the convergence rate is preserved. But the number of evaluations grows as .
7.2 Sparse Grids
Smolyak's construction: Use tensor products only of "low-level" 1D rules. The level- sparse grid uses points instead of :
where is the difference between the -th and -th 1D rules.
Error for smooth : - polynomial in logarithm, not exponential.
For AI: Sparse grid integration is used in:
- High-dimensional numerical PDEs (atmospheric modeling, option pricing)
- Bayesian quadrature for computing GP marginal likelihoods
- Efficient moment computation in uncertainty quantification
7.3 Monte Carlo vs Deterministic in High Dimensions
| Best deterministic method | MC (unbiased, samples) | Break-even | |
|---|---|---|---|
| 1 | GL-5 ( pts, ) | ||
| 5 | Sparse grid () | ||
| 10 | Sparse grid () | ||
| 20 | MC wins clearly | - | |
| 100 | MC only | - |
Practical advice: For , use Gauss-Legendre or Clenshaw-Curtis. For , consider quasi-MC or sparse grids. For , use Monte Carlo with variance reduction.
8. Applications in Machine Learning
8.1 Normalizing Constants and Partition Functions
The partition function problem: In energy-based models, the probability is:
Computing exactly is intractable for complex . Solutions:
- Thermodynamic integration: - a 1D quadrature over the inverse temperature
- Annealed importance sampling (AIS): Monte Carlo estimate of the ratio
- Score matching: Avoid entirely by training on
For diffusion models: The ELBO for a Gaussian diffusion process involves:
Each KL divergence term between Gaussians is computable analytically - no numerical integration needed. This is why diffusion models are tractable.
8.2 ELBO and Variational Inference
Variational Autoencoder (VAE): The ELBO lower bound on :
The reparameterization trick: When :
This converts a score function estimator (high variance) to a pathwise estimator (low variance) - essentially converting a derivative of an expectation into an expectation of a derivative.
Gauss-Hermite for exact ELBO: When :
With Gauss-Hermite points, this is nearly exact for smooth - eliminating the Monte Carlo noise in the ELBO gradient.
8.3 Gaussian Process Marginal Likelihood
The GP log marginal likelihood is:
Computing requires the determinant of an matrix - via Cholesky. For large :
Stochastic trace estimation: for random .
Lanczos quadrature: Estimate using Lanczos iteration to build a tridiagonal approximation of , then compute via Gauss quadrature. This combines matrix-vector products ( per) with Gaussian quadrature on the spectrum.
8.4 Expected Value Estimation in RL
Policy gradient (REINFORCE):
This is a Monte Carlo estimate - simulate trajectories, compute returns . The estimator is unbiased but high-variance. Variance reduction:
- Baseline subtraction: Replace with (leaves gradient unbiased, reduces variance if correlated with )
- Actor-Critic: Use - advantage function reduces variance dramatically
TD learning approximates via Bellman recursion rather than direct Monte Carlo:
This is a recursive quadrature - approximating the integral of the value function over the next-state distribution using a single sample.
9. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Using Simpson's rule with an odd number of intervals | Simpson's requires even (pairs of intervals) | Use even or use the 3/8 rule for the odd remainder |
| 2 | Applying high-degree Newton-Cotes on a single large interval | Negative weights for -> catastrophic cancellation | Use composite low-order rules |
| 3 | Underestimating Monte Carlo variance | convergence means 100x more samples for 10x accuracy | Use deterministic methods for , or variance reduction |
| 4 | Ignoring singularities in adaptive quadrature | Standard Gauss-Kronrod fails near singularities | Report singularity locations to quad() via points=[] parameter |
| 5 | Claiming Monte Carlo is "exact in expectation" justifies any sample size | Unbiased \neq accurate; confidence intervals require | Always report MC standard errors alongside estimates |
| 6 | Using composite trapezoidal for periodic functions with wrong period | If the function period doesn't match the integration interval, the boundary terms don't cancel | Ensure covers exactly an integer number of periods |
| 7 | Forgetting the Jacobian when changing variables | For : $\int f(x) dx = \int f(g(u)) | g'(u) |
| 8 | Using GL quadrature with cached nodes/weights outside | Precomputed nodes are for ; rescaling is required for | Apply the affine transformation |
| 9 | Stopping Romberg integration too early | The Romberg table diagonal may not show the true error until many rows | Always check at least 2 successive diagonal entries for convergence |
| 10 | Using importance sampling with a poor proposal | If in some region, the ratio can be huge -> infinite variance | Choose proportional to $ |
| 11 | Applying MC to 1D integrals where deterministic is available | MC needs samples for 3 digits; Gauss-Legendre needs 5 | Use MC only when deterministic methods are infeasible |
| 12 | Forgetting that quasi-MC Sobol sequences need scrambling for correctness | Unscrambled Sobol sequences can have correlated estimators | Use scrambled Sobol from scipy.stats.qmc |
10. Exercises
Exercise 1 * - Newton-Cotes Rules from Scratch
(a) Implement the composite trapezoidal rule trap(f, a, b, m) and verify convergence on .
(b) Implement composite Simpson's rule simpson(f, a, b, m) and verify convergence.
(c) Show the midpoint rule is twice as accurate as the trapezoidal rule (same but half the constant) by computing and comparing errors on a convex function.
Exercise 2 * - Romberg Integration (a) Implement Romberg's method building the triangular table for . (b) Apply it to and compare convergence to composite trapezoidal. (c) Show that for , the diagonal converges super-geometrically - the number of correct digits roughly doubles each step.
Exercise 3 * - Gauss-Legendre Quadrature (a) Compute the 5-point GL nodes and weights using the eigenvalue method (Golub-Welsch matrix). (b) Verify: the rule integrates all polynomials of degree exactly. (c) Compare GL-5 accuracy on vs composite Simpson's with 50 subintervals.
Exercise 4 ** - Adaptive Quadrature Implementation
(a) Implement adaptive_simpson(f, a, b, tol) using recursive bisection with the 3-point Simpson estimate and the 5-point refined estimate for error control.
(b) Test on (singular at 0) and (oscillatory).
(c) Plot the adaptive mesh (where it placed evaluation points) for a function with a sharp peak.
Exercise 5 ** - Monte Carlo Integration (a) Estimate using plain Monte Carlo with samples. Show the error convergence. (b) Implement importance sampling for with proposal (half-normal). Compare variance vs uniform sampling. (c) Implement antithetic variates for the 2D integral in (a) and measure the variance reduction factor.
Exercise 6 ** - Euler-Maclaurin and Periodic Functions (a) Numerically verify the Euler-Maclaurin formula for the trapezoidal rule on : show the error matches . (b) Demonstrate spectral accuracy: for (periodic), show the composite trapezoidal rule achieves machine precision with just 20 points. (c) Explain why this makes the DFT exact: use the Euler-Maclaurin formula to show that integrating the DFT basis functions over with the trapezoidal rule is exact for integer .
Exercise 7 ** - Gauss-Hermite for Gaussian Expectations (a) Implement -point Gauss-Hermite quadrature for using the Golub-Welsch algorithm. (b) Compute (the binary cross-entropy expectation) using GH quadrature with points. Compare to Monte Carlo with samples. (c) Implement the reparameterization trick to compute the ELBO gradient for using both GH quadrature and the pathwise estimator.
Exercise 8 *** - Quasi-Monte Carlo and Low-Discrepancy Sequences (a) Compare plain Monte Carlo vs Sobol quasi-MC for for . (b) Plot the convergence rate (log-log error vs log ) for both methods. Show MC gives slope and Sobol gives slope for small . (c) For , how many samples do you need for MC to achieve relative accuracy? For Sobol? Compute the ratio.
11. Why This Matters for AI (2026 Perspective)
| Concept | AI/LLM Impact |
|---|---|
| Monte Carlo integration | Foundation of all stochastic gradient methods; ELBO, policy gradient, score matching |
| Reparameterization trick | VAE training, diffusion model score matching, normalizing flows - the key to backpropagating through stochastic variables |
| Gauss-Hermite quadrature | Exact ELBO computation in structured VAEs; expectation propagation for approximate inference |
| Importance sampling | Off-policy RL (PPO uses importance weights ); rejection sampling for LLM alignment |
| Quasi-Monte Carlo | Faster hyperparameter search (Optuna, Ray Tune use quasi-random sampling); faster random feature approximations |
| Adaptive quadrature | Error-controlled numerical ODEs (DormandPrince, VODE) used in neural ODE/SDE training |
| Thermodynamic integration | Computing log partition functions for energy-based models; AIS for evaluating generative models |
| Variance reduction (baselines) | Policy gradient training stability (REINFORCE with baseline, A2C, PPO); reduces gradient variance by 10-100x |
| Gaussian process marginal likelihood | Bayesian optimization for hyperparameter tuning; GP-based surrogate models in AutoML |
| Lanczos-based matrix quadrature | Efficient log-determinant computation for GP likelihoods at large scale (GPyTorch) |
12. Conceptual Bridge
Looking back: Numerical integration is built directly on the foundations of this chapter and the broader curriculum:
- Floating-point arithmetic (Section10-01): All quadrature involves finite-precision arithmetic; catastrophic cancellation in the sum must be avoided
- Interpolation (Section10-04): Quadrature rules are derived by integrating interpolants exactly. Every quadrature rule has a corresponding interpolation scheme
- Orthogonal polynomials (Section10-04, Section03-Advanced-Linear-Algebra): Gaussian quadrature nodes are zeros of orthogonal polynomials; the Golub-Welsch algorithm is an eigenvalue computation
- Probability (Section07): Monte Carlo integration is equivalent to expectation estimation; variance reduction techniques exploit probability theory directly
Looking forward: Numerical integration connects to advanced topics throughout the curriculum:
- Ordinary and partial differential equations: ODE solvers (RK4, Dormand-Prince) are quadrature formulas for ; Gauss collocation methods achieve the highest possible order
- Markov Chain Monte Carlo: When direct sampling is unavailable, MCMC generates correlated samples whose empirical average still converges to the integral - at a slower rate where is the autocorrelation time
- Signal processing: The DFT is a quadrature formula; the connection between Fourier analysis and quadrature (Clenshaw-Curtis via DCT) is deep
NUMERICAL INTEGRATION IN THE CURRICULUM
========================================================================
Section10-01 Floating-Point Section10-04 Interpolation
(rounding in sums) --> (integrate interpolant)
| |
+------------+-----------+
v
Section10-05 NUMERICAL INTEGRATION
+--------------------------+
| Newton-Cotes rules |
| Gaussian quadrature |
| Romberg / Richardson |
| Adaptive (Gauss-Kronrod) |
| Monte Carlo + QMC |
| Multidimensional |
+------------+-------------+
|
+----------------+----------------+
v v v
ODE Solvers VAE/ELBO Bayesian
(SectionAdvCalc) (reparam trick) Optimization
Diffusion models (GP marginal
Policy gradient likelihood)
========================================================================
The unifying theme: Every integration problem in ML is either a low-dimensional, smooth integral (use Gaussian quadrature - exact and fast) or a high-dimensional, stochastic integral (use Monte Carlo - dimension-independent convergence). The art is recognizing which regime you're in and applying the appropriate technique. The reparameterization trick is the bridge: it converts a "hard" stochastic integral into a "easy" pathwise integral, enabling efficient gradient computation.
Appendix A: Proof of Gaussian Quadrature Optimality
A.1 The 2n-1 Degree of Exactness Theorem
Theorem: The -point Gauss-Legendre rule integrates all polynomials of degree exactly.
Proof: Let (polynomial of degree ). Divide by the node polynomial (degree ):
where .
Step 1: . Since for some constant (nodes are zeros of ), and , orthogonality of to all lower-degree polynomials gives .
Step 2: . Since for each node .
Step 3: exactly. Since and any rule with nodes is exact for polynomials of degree if it integrates the basis functions correctly - and the weights are chosen to integrate exactly.
Step 4: Combining:
Why is optimal: Any -point rule fails for since for all nodes but .
A.2 The Golub-Welsch Algorithm
The GL nodes and weights are eigenvalues and eigenvectors of the Jacobi tridiagonal matrix :
where for Legendre polynomials: (by symmetry), .
Algorithm:
- Build as a tridiagonal matrix
- Compute eigendecomposition: (symmetric)
- Nodes: (eigenvalues)
- Weights: (squared first component of eigenvectors, times 2 for the integral of 1 over )
This reduces GL computation to a standard symmetric eigenvalue problem - using QR iteration, or with divide-and-conquer.
import numpy as np
def gauss_legendre(n):
"""Compute n-point Gauss-Legendre nodes and weights on [-1, 1]."""
k = np.arange(1, n)
beta = k / np.sqrt(4*k**2 - 1)
J = np.diag(beta, -1) + np.diag(beta, 1) # Symmetric tridiagonal
eigvals, eigvecs = np.linalg.eigh(J)
# Sort by node
idx = np.argsort(eigvals)
nodes = eigvals[idx]
weights = 2 * eigvecs[0, idx]**2 # First row of eigenvector matrix
return nodes, weights
Appendix B: Richardson Extrapolation - Complete Analysis
B.1 The Error Expansion Assumption
Richardson extrapolation requires the error to have the asymptotic expansion:
When does this hold?
- Composite trapezoidal: Yes (Euler-Maclaurin guarantees terms)
- Composite Simpson's: Yes ( terms)
- For periodic functions: The expansion is actually geometric - - Richardson extrapolation helps but eventually the geometric error dominates
B.2 Romberg Table Construction
The Romberg table is computed as:
Incremental trapezoidal computation: can be computed from using only the new midpoints:
Total function evaluations: (not - the old evaluations are reused).
B.3 Convergence of Romberg
For , the diagonal satisfies:
where and . The convergence is faster than any power of - supergeometric.
Practical stopping criterion: Compare . If this is smaller than the tolerance, stop. But be careful: the estimate can be misleading for non-smooth functions.
Appendix C: Monte Carlo - Detailed Variance Analysis
C.1 Central Limit Theorem for MC
Let i.i.d. The MC estimator satisfies:
95% confidence interval: where .
How many samples for accuracy ?
For and : - about 4 million samples.
C.2 Importance Sampling: Optimal Proposal
The variance of the importance sampling estimator is:
Minimize over subject to : By Cauchy-Schwarz, with equality when .
With : The estimator is - zero variance! But computing requires knowing .
Practical IS: Choose close to . A good choice in practice is where is the true weight.
C.3 Quasi-MC Convergence Analysis
The Koksma-Hlawka inequality for quasi-MC:
where is the Hardy-Krause variation and is the star discrepancy.
Sobol sequences: .
Convergence rate comparison:
- MC:
- Sobol: - faster by factor
- Break-even (Sobol faster): - for : ... not practical!
But: The constant in front matters. For moderate (thousands to millions), Sobol consistently outperforms MC in practice for .
Appendix D: Integration and Machine Learning - Extended Connections
D.1 The REINFORCE Gradient and Score Function Estimator
The policy gradient is a Monte Carlo estimate of . The score function estimator (also called REINFORCE or likelihood-ratio estimator):
Proof:
This requires only the ability to sample from and evaluate - but has high variance because fluctuates wildly.
Variance of REINFORCE:
This can be - proportional to the square of the reward, which is large. Hence the need for baselines and actor-critic methods.
D.2 The Reparameterization Trick as Pathwise Derivative
When with (independent of ):
Variance comparison: The reparameterization estimator has variance - this is the squared norm of the gradient of w.r.t. the parameters, which is typically much smaller than the REINFORCE variance. Empirically, 10-100x lower variance.
When does reparameterization apply?
- Gaussian: , [ok]
- Uniform: , [ok]
- Categorical (Gumbel-Softmax): , - continuous relaxation [ok]
- Any discrete distribution without continuous relaxation:
D.3 Numerical Integration in Normalizing Flows
Change of variables theorem: For with :
The log-likelihood is:
Computing requires the determinant of an Jacobian matrix - . Normalizing flows use structured architectures (coupling layers, autoregressive flows) where is triangular and - .
This is numerical integration in disguise: We're computing - a singular integral over -space that the change-of-variables formula evaluates analytically for invertible flows.
Appendix E: Software Reference
E.1 scipy.integrate
from scipy import integrate
# Adaptive Gauss-Kronrod (default: G7-K15)
result, abserr = integrate.quad(f, a, b)
result, abserr = integrate.quad(f, a, b, epsabs=1.49e-8, epsrel=1.49e-8, limit=50)
# Improper integrals
result, abserr = integrate.quad(f, 0, np.inf) # Semi-infinite
result, abserr = integrate.quad(f, -np.inf, np.inf) # Doubly infinite
# With singularity points
result, abserr = integrate.quad(f, 0, 1, points=[0.5])
# Fixed Gauss-Legendre
result, abserr = integrate.fixed_quad(f, a, b, n=5)
# Romberg
result = integrate.romberg(f, a, b, tol=1.48e-8, rtol=1.48e-8)
# 2D integration
result, abserr = integrate.dblquad(f, a, b, lambda x: c, lambda x: d)
E.2 numpy for Quadrature
import numpy as np
# Gauss-Legendre points and weights (numpy built-in)
x, w = np.polynomial.legendre.leggauss(n)
I = np.dot(w, f(x)) # For integral over [-1, 1]
# Gauss-Hermite
x, w = np.polynomial.hermite.hermgauss(n)
I = np.dot(w, f(x)) # For exp(-x^2) weight
# Gauss-Laguerre
x, w = np.polynomial.laguerre.laggauss(n)
I = np.dot(w, f(x)) # For exp(-x) weight
E.3 Quasi-Monte Carlo with scipy.stats.qmc
from scipy.stats import qmc
# Sobol sequence (scrambled)
sampler = qmc.Sobol(d=5, scramble=True, seed=42)
x = sampler.random(n=1024) # n must be power of 2 for Sobol
# x has shape (1024, 5), values in [0, 1]^5
# Halton sequence
sampler = qmc.Halton(d=5, scramble=True, seed=42)
x = sampler.random(n=1000)
# Latin Hypercube Sampling
sampler = qmc.LatinHypercube(d=5, seed=42)
x = sampler.random(n=100)
# Compute integral of f over [0,1]^5
I_qmc = f(x).mean()
I_mc = f(np.random.rand(1000, 5)).mean()
Appendix F: Common Integrals in Machine Learning
| Integral | Closed form | Context |
|---|---|---|
| 1 | Normalization | |
| Gaussian mean | ||
| Gaussian second moment | ||
| VAE ELBO term | ||
| Probit approx. | Expected sigmoid | |
| Gaussian normalization | ||
| Gamma function | ||
| Beta function | ||
| Fourier delta | ||
| (by parts) | PDE weak form |
Appendix G: Error Bounds Quick Reference
| Method | Nodes | Error bound | Smoothness needed |
|---|---|---|---|
| Midpoint (1-panel) | 1 | ||
| Trapezoidal (1-panel) | 2 | ||
| Simpson's (1-panel) | 3 | ||
| Composite trap. ( panels) | |||
| Composite Simpson's ( even) | |||
| GL- (1 panel) | |||
| Clenshaw-Curtis ( pts) | or | or analytic | |
| Romberg ( steps) | |||
| Monte Carlo | samples | Any measurable | |
| Quasi-MC (Sobol) | pts | Bounded variation |
Appendix H: Extended Newton-Cotes Theory
H.1 Derivation of Newton-Cotes Weights
The weights for an -point Newton-Cotes rule on with unit spacing are:
For (Simpson's rule, 3 nodes at ):
Similarly , . With : weights are .
H.2 Negative Weights for High-Order Newton-Cotes
The weights for Newton-Cotes rules with become negative. This is a fundamental instability: if has rounding errors of magnitude , the error in is:
For : - significant amplification. For : - catastrophic.
In contrast, Gaussian quadrature weights are always positive - the sum is exactly and rounding errors are not amplified.
H.3 Peano Kernel Theorem
For a rule with error , the Peano kernel satisfies:
where is the degree of exactness. The Peano kernel is zero when is outside and has a specific shape depending on the rule.
Implications:
- The error depends on - the function's -th derivative
- If everywhere, we can write for some (mean value theorem for integrals)
- For symmetric rules (midpoint, trapezoidal, GL), the Peano kernel has definite sign for appropriate
Appendix I: Adaptive Quadrature - Implementation Details
I.1 Gauss-Kronrod G7-K15 Rule
The 15-point Gauss-Kronrod rule (K15) extends the 7-point Gauss-Legendre (G7):
G7 nodes (Gauss-Legendre, 7 points):
K15 nodes (Kronrod extension, 15 points): The G7 nodes plus 8 additional Kronrod nodes.
Error estimate: . If this is , the error is .
Degree of exactness: G7 integrates polynomials exactly up to degree 13; K15 up to degree 27.
I.2 Recursive vs Priority-Queue Adaptive
Recursive adaptive:
def adaptive_quad(f, a, b, tol, depth=0, max_depth=50):
Q_coarse = coarse_rule(f, a, b)
Q_fine = fine_rule(f, a, b)
if abs(Q_fine - Q_coarse) < tol or depth >= max_depth:
return Q_fine
m = (a + b) / 2
return (adaptive_quad(f, a, m, tol/2, depth+1) +
adaptive_quad(f, m, b, tol/2, depth+1))
Priority-queue adaptive: Better for functions with many singularities - always refines the worst subinterval first.
I.3 Handling Singularities
Integrable singularities: with gives finite .
Strategy 1 - Change of variables: smooths the singularity:
Strategy 2 - Gauss-Jacobi: Use the weight function to absorb the singularity. scipy.integrate.quad handles this automatically when singularities are declared.
Strategy 3 - IMT transformation: The IMT transform maps to with all derivatives vanishing at the endpoints:
This is the basis for the tanh-sinh (double exponential) quadrature method by Takahashi and Mori.
Appendix J: Monte Carlo in Practice - Recipes
J.1 Estimating via Monte Carlo
Classic example: with .
import numpy as np
np.random.seed(42)
for n in [100, 1000, 10000, 1000000]:
x, y = np.random.rand(n), np.random.rand(n)
pi_est = 4 * (x**2 + y**2 <= 1).mean()
std_err = 4 * np.sqrt((x**2+y**2<=1).mean() * (1-(x**2+y**2<=1).mean()) / n)
print(f"n={n:>8}: pi = {pi_est:.6f} +/- {std_err:.6f}")
Output: Shows standard error.
J.2 Gaussian Expectation via GH Quadrature
def gauss_hermite_expect(f, mu, sigma, n=20):
"""
Compute E_{x~N(mu,sigma^2)}[f(x)] via n-point Gauss-Hermite quadrature.
"""
x, w = np.polynomial.hermite.hermgauss(n)
x_transformed = np.sqrt(2) * sigma * x + mu
return np.dot(w, f(x_transformed)) / np.sqrt(np.pi)
# Example: E[log(1 + e^x)] for x ~ N(0, 1)
f = lambda x: np.log(1 + np.exp(x))
exact_approx = gauss_hermite_expect(f, mu=0, sigma=1, n=20)
mc_approx = f(np.random.randn(100000)).mean()
print(f"GH-20: {exact_approx:.8f}")
print(f"MC-1e5: {mc_approx:.6f} +/- {f(np.random.randn(10000)).std()/100:.6f}")
J.3 ELBO Computation
def vae_elbo(x, encoder, decoder, n_samples=10):
"""
Compute ELBO = E_q[log p(x|z)] - KL(q(z|x) || p(z)).
Using reparameterization trick.
"""
mu, log_sigma = encoder(x)
sigma = np.exp(log_sigma)
# KL(N(mu, sigma^2) || N(0, 1)) = 0.5 * (mu^2 + sigma^2 - log(sigma^2) - 1)
kl = 0.5 * np.sum(mu**2 + sigma**2 - 2*log_sigma - 1)
# E_q[log p(x|z)] via reparameterization
eps = np.random.randn(n_samples, len(mu))
z_samples = mu + sigma * eps # Shape: (n_samples, latent_dim)
recon = np.mean([decoder(z) for z in z_samples])
return recon - kl
Appendix K: Connections to Differential Equations and Physics
K.1 ODE Solvers as Quadrature
The ODE initial value problem , has solution:
Runge-Kutta methods approximate this integral using quadrature:
- Euler: - left-endpoint rule ()
- RK4: - Simpson's-like rule ()
- Gauss collocation: Gauss-Legendre-based - stiffly stable, exact for polynomials of degree ( stages)
Dormand-Prince (RK45): Embedded 4th/5th order pair for adaptive step control - the default ODE solver in scipy.integrate.solve_ivp.
K.2 Feynman Path Integrals
In quantum mechanics, the probability amplitude is formally:
where is the action. This is an infinite-dimensional integral - a functional integral over all paths.
Numerical discretization: Replace the path integral by a finite product:
This is a multi-dimensional integral evaluated by Monte Carlo - the lattice QCD approach uses this for nuclear physics calculations.
K.3 Numerical Integration in Finite Element Methods
Finite element method (FEM): Solve in by finding in a finite-dimensional subspace :
Each element integral is computed via Gaussian quadrature on the reference element :
where is the Jacobian of the mapping from to .
For AI: Physics-informed neural networks (PINNs) solve PDEs by:
- Sampling "collocation points" inside the domain (random or Gauss-like)
- Computing the PDE residual at each point
- Minimizing the sum of residuals squared
This is meshless quadrature - and the quality of the quadrature points (random vs quasi-random vs Gauss) directly affects training accuracy.
Appendix L: Notation Reference
| Symbol | Meaning |
|---|---|
| Quadrature rule with nodes | |
| Quadrature nodes | |
| Quadrature weights | |
| Quadrature error | |
| Mesh spacing | |
| Degree of exactness of a rule | |
| Legendre polynomial of degree | |
| Chebyshev polynomial of degree | |
| Bernoulli numbers | |
| Romberg table entry | |
| Monte Carlo estimator | |
| Star discrepancy | |
| Hardy-Krause variation | |
| Partition function | |
| Evidence lower bound | |
| Autocorrelation time in MCMC |
Appendix M: Advanced Topics in Quadrature
M.1 Clenshaw-Curtis Quadrature - Full Derivation
Setup: Integrate the Chebyshev interpolant of at the points :
Since for odd and for even , and for :
Computing the weights via the moment values:
but more efficiently via the DCT applied to the moments vector .
Why CC is competitive with GL: For analytic functions, both achieve exponential convergence. CC uses nodes for degree- exactness while GL uses nodes for degree- exactness - GL is more "efficient" per node. However:
- CC nodes nest: the nodes of CC- are a subset of the nodes of CC- - ideal for adaptive schemes
- CC weights are computed via DCT in - GL requires via eigenvalue method
- CC nodes are the Chebyshev nodes - same as used in optimal polynomial approximation
M.2 Gauss-Kronrod Derivation
Problem: Given GL nodes, find additional Kronrod points that maximize the degree of exactness of the combined -point rule.
Solution: The Kronrod points are zeros of the Stieltjes polynomial satisfying:
These polynomials exist and have real zeros in interleaving with the GL nodes, but only for specific values. The G7-K15 pair is the standard choice.
Degree of exactness of Kronrod: The combined G7-K15 rule is exact for polynomials up to degree (G7 is exact up to degree , the Kronrod extension pushes to ).
M.3 Double Exponential (Tanh-Sinh) Quadrature
The tanh-sinh or double exponential transformation:
maps to with:
The derivative decays double exponentially as : faster than any power.
Key property: All derivatives of the integrand vanish double-exponentially at . By Euler-Maclaurin, the trapezoidal rule is exponentially convergent.
Convergence: For analytic with singularities at :
where is the half-width of the analytic strip and is the step size in -space. This is double-exponential in - faster than any other method for this problem class.
For AI: Tanh-sinh quadrature is used in high-precision computations (arbitrary precision arithmetic), computing special functions in ML kernels (Matern covariance, Bayesian quadrature), and computing KL divergences between complex distributions.
Appendix N: Worked Examples
N.1 Integrating on
Exact value:
Method comparison:
- Composite trap, : ( with )
- Composite Simpson's, : (, fewer evaluations)
- GL-5 on : (exact for degree 9 polynomials)
- GL-10: (machine precision for this smooth function)
Lesson: GL-10 uses only 10 function evaluations and achieves machine precision; composite trap needs evaluations for similar accuracy.
N.2 Computing the ELBO Expectation
For and :
Analytical:
GH-5 approximation: Using 5-point GH nodes and weights , with :
With 5 GH points: relative error . With 1000 MC samples: relative error .
Appendix O: Summary Table - When to Use What
QUADRATURE METHOD SELECTION GUIDE
========================================================================
Key: d = dimension, n = nodes/samples, f = function
1D INTEGRATION (d=1):
---------------------
f smooth + analytic -> Gauss-Legendre (5-10 pts) or CC [best]
f smooth, C\\infty -> Romberg / Richardson extrapolation
f piecewise smooth -> Adaptive G7-K15 (scipy.integrate.quad)
f has singularities -> Adaptive + singularity declaration
f periodic -> Trapezoidal rule (spectral accuracy!)
f analytic + endpoint singularity -> Tanh-sinh
MULTIDIMENSIONAL (d > 1):
-------------------------
d \\leq 3, f smooth -> Tensor product GL or CC
d \\leq 15, f smooth -> Sparse grids (Smolyak)
d \\leq 20 -> Quasi-MC (scrambled Sobol)
d > 20 -> Monte Carlo with variance reduction
f = E_p[g(X)] -> Gauss-Hermite (if p Gaussian)
Singular / complex -> MCMC (Metropolis-Hastings)
ML SPECIFIC:
------------
ELBO gradient -> Reparameterization + 1 MC sample
GP marginal lik. -> Cholesky + Lanczos quadrature
Policy gradient -> REINFORCE + baseline (MC)
Normalizing constant -> AIS or thermodynamic integration
Neural ODE -> Adaptive RK (Dormand-Prince)
Posterior E[f(\\theta)] -> MCMC (HMC/NUTS for continuous \\theta)
========================================================================
Appendix P: Historical Notes and Key Papers
P.1 Gauss's Contribution
Carl Friedrich Gauss derived Gaussian quadrature in 1814 while working on geodesy and orbit computation. His insight was to treat both the nodes and weights as free variables - giving degrees of freedom for an -point rule, which (as he showed) is sufficient to integrate all polynomials of degree .
Gauss computed the first several GL rules by hand, finding the nodes as zeros of Legendre polynomials. The connection between optimal quadrature nodes and orthogonal polynomial zeros is now understood through the theory of Gaussian quadrature, but Gauss discovered the pattern empirically.
P.2 The Monte Carlo Method's Origins
The Monte Carlo method was invented by Stanislaw Ulam while recovering from illness in 1946. Playing solitaire, he wondered: what is the probability of success? Unable to compute it analytically, he realized he could estimate it by simulating many games. He mentioned the idea to John von Neumann, who immediately recognized its potential for nuclear physics computations (neutron diffusion through fissile material) - and the Los Alamos Monte Carlo method was born.
The name "Monte Carlo" was suggested by Nicholas Metropolis, after the famous casino (and the element of chance in the method). The first large-scale application was computing the properties of materials for the hydrogen bomb in 1947.
P.3 Markov Chain Monte Carlo
The Metropolis-Hastings algorithm (1953, 1970) enabled sampling from distributions known only up to a normalizing constant. The key idea: propose a new state from the current state , then accept with probability . The Markov chain converges to as its stationary distribution.
MCMC became mainstream in statistics in the 1990s with the work of Gelfand and Smith (1990) showing how Gibbs sampling could solve Bayesian inference problems computationally. Today, HMC (Hamiltonian Monte Carlo) and NUTS (No-U-Turn Sampler) are the state-of-the-art for continuous distributions in packages like Stan, PyMC, and NumPyro.
P.4 Quasi-Monte Carlo's Development
Ilya Sobol introduced his low-discrepancy sequences in 1967. The key insight: random points have discrepancy while Sobol sequences have discrepancy . For moderate dimensions () and practical sample sizes (), Sobol sequences consistently outperform random sampling.
Modern quasi-Monte Carlo uses scrambled Sobol sequences (Owen, 1995) that preserve the low-discrepancy property while also yielding unbiased estimators with variance that can be computed from multiple independent scramblings.
Appendix Q: Practical Code Recipes
Q.1 Composite Rules in NumPy
import numpy as np
def composite_trap(f, a, b, m):
"""Composite trapezoidal rule with m intervals."""
h = (b - a) / m
x = np.linspace(a, b, m + 1)
y = f(x)
return h * (y[0]/2 + y[1:-1].sum() + y[-1]/2)
def composite_simpson(f, a, b, m):
"""Composite Simpson's rule (m must be even)."""
assert m % 2 == 0, "m must be even"
h = (b - a) / m
x = np.linspace(a, b, m + 1)
y = f(x)
return h/3 * (y[0] + 4*y[1:-1:2].sum() + 2*y[2:-2:2].sum() + y[-1])
# Test: integral of e^x from 0 to 1 = e - 1
f = np.exp
exact = np.e - 1
for m in [10, 100, 1000]:
err_t = abs(composite_trap(f, 0, 1, m) - exact)
err_s = abs(composite_simpson(f, 0, 1, m) - exact)
print(f"m={m}: trap={err_t:.2e}, simp={err_s:.2e}")
Q.2 Romberg Integration
def romberg(f, a, b, max_level=8, tol=1e-10):
"""Romberg integration using Richardson extrapolation."""
R = np.zeros((max_level, max_level))
h = b - a
R[0, 0] = h/2 * (f(a) + f(b))
for k in range(1, max_level):
h /= 2
n_new = 2**(k-1)
x_new = a + h * (2*np.arange(1, n_new+1) - 1)
R[k, 0] = R[k-1, 0]/2 + h * f(x_new).sum()
for j in range(1, k+1):
R[k, j] = (4**j * R[k, j-1] - R[k-1, j-1]) / (4**j - 1)
if k > 0 and abs(R[k, k] - R[k-1, k-1]) < tol:
return R[k, k], k
return R[max_level-1, max_level-1], max_level
result, levels = romberg(np.exp, 0, 1)
print(f"Romberg result: {result:.15f}, levels: {levels}")
print(f"Exact e-1: {np.e-1:.15f}")
Q.3 Monte Carlo with Variance Reduction
def monte_carlo_antithetic(f, n, seed=42):
"""
Monte Carlo estimate of E[f(U)] where U ~ Uniform(0,1).
Uses antithetic variates: pairs (u, 1-u).
"""
np.random.seed(seed)
u = np.random.rand(n // 2)
f_vals = (f(u) + f(1 - u)) / 2 # Antithetic pairs
return f_vals.mean(), f_vals.std() / np.sqrt(n // 2)
def monte_carlo_plain(f, n, seed=42):
"""Standard Monte Carlo."""
np.random.seed(seed)
u = np.random.rand(n)
return u.mean(), np.std(f(u)) / np.sqrt(n)
# Compare for f(u) = sqrt(u)
f = np.sqrt
for n in [100, 1000, 10000]:
est_plain, se_plain = monte_carlo_plain(f, n)
est_anti, se_anti = monte_carlo_antithetic(f, n)
print(f"n={n:>6}: plain se={se_plain:.4f}, antithetic se={se_anti:.4f}, "
f"ratio={se_plain/se_anti:.2f}x")
Appendix R: Common Integrals and Their Quadrature Suitability
| Integral | Exact value | Best method | Notes |
|---|---|---|---|
| GL-5 | Analytic, smooth | ||
| GL-10 | Smooth, mild singularity at 0 | ||
| Adaptive (Gauss-Jacobi) | Singularity at 0 | ||
| Gauss-Hermite | Infinite domain | ||
| Gauss-Hermite | Infinite domain | ||
| Adaptive (many panels) | Oscillatory | ||
| Adaptive + change of vars | Oscillatory near 0 | ||
| Varies | Quasi-MC Sobol | High-dimensional | |
| GH quadrature | Gaussian expectation | ||
| Analytical | No quadrature needed! |
Appendix S: Extended Theory - Bayesian Quadrature
S.1 Bayesian Quadrature Framework
Bayesian quadrature (BQ) models the integrand as a Gaussian process:
and computes the posterior over the integral :
Posterior mean (the quadrature estimate):
where (the "kernel mean" - computable analytically for many kernels).
Posterior variance (the uncertainty):
Key properties:
- BQ reduces to kernel ridge regression with kernel
- With the Gaussian kernel, BQ is equivalent to Gauss-Hermite quadrature in the limit
- BQ provides principled uncertainty - the posterior variance tells us how uncertain we are about the integral given current samples
S.2 Connection to Gauss-Legendre
When (the Brownian motion kernel, corresponding to Sobolev RKHS) and , BQ recovers the trapezoidal rule.
When corresponds to the Sobolev kernel, BQ recovers the -th order spline quadrature rule. In the limit as the RKHS becomes the space of polynomials, BQ recovers Gaussian quadrature.
For AI: BQ is used for:
- Bayesian optimization loop: compute the expected improvement acquisition function via BQ when the integrand is also an ML model
- Uncertainty-aware neural ODEs: BQ estimates the uncertainty in the solution trajectory
- Model evidence computation: via BQ with GP prior on the likelihood
S.3 Active Learning for BQ
Rather than using fixed quadrature nodes, active BQ selects the next evaluation point to maximize the reduction in posterior variance:
This is equivalent to minimizing the posterior variance, giving an information-theoretic quadrature rule. For the Matern- kernel, this recovers the bisection rule (adaptive quadrature). For smoother kernels, it places more points where is uncertain - naturally handling functions with local features.
Appendix T: Stochastic Integration - SDEs and Neural SDEs
T.1 Ito Integral
The Ito integral is a stochastic integral with respect to Brownian motion :
(left-endpoint rule - the Ito convention). The Euler-Maruyama scheme for SDEs is:
This is a first-order strong approximation - strong convergence vs weak convergence.
T.2 Neural SDEs and Diffusion Models
Neural SDE: Replace and with neural networks and :
The ELBO for a neural SDE is:
The integral over is computed via numerical integration (e.g., Euler-Maruyama or higher-order SDE solvers). This is the foundation of score-based diffusion models (Song et al., 2021).
DDPM (Ho et al., 2020) discretizes the diffusion trajectory at timesteps and computes the ELBO as a sum of Gaussian KL terms - discrete numerical integration over the diffusion trajectory.
Continuous diffusion (Song et al., 2021) uses the continuous-time SDE and numerical ODE solvers (Dormand-Prince, DDIM) for sampling - the sampling procedure is adaptive quadrature of the reverse SDE.
Appendix U: Cross-Reference Table
| Topic | Connection | Reference |
|---|---|---|
| Legendre polynomials | GL nodes; Legendre-Gauss rule | Section10-04 (orthogonal polynomials) |
| Chebyshev polynomials | Clenshaw-Curtis; CC weights via DCT | Section10-04 (Chebyshev approximation) |
| Condition numbers | Vandermonde in NC; GL weights always positive | Section10-01, Section10-02 |
| Finite differences | Richardson extrapolation; gradient checking | Section10-03 |
| Fourier analysis | DFT as trapezoidal rule; CC via DCT | Section10-04 |
| Monte Carlo | Stochastic gradient, ELBO, policy gradient | Section07 (Probability) |
| Linear algebra | Golub-Welsch tridiagonal eigenvalue | Section10-02 |
| Optimization | MCMC as sampling from Bayesian posterior | Section08, Section10-03 |
| SDEs/Neural SDEs | Stochastic integration; Ito calculus | Advanced calculus |
| Gaussian processes | GP marginal likelihood; Bayesian quadrature | Section07, Section10-04 |
Appendix V: Worked Examples - Comparing Methods
V.1 Computing
This integral has integrable singularities at both endpoints ( but ).
Methods and errors:
| Method | evaluations | Error |
|---|---|---|
| Composite trapezoidal | 1000 | ( - slow due to singularity) |
| Composite Simpson's | 1000 | ( - same!) |
| GL-20 on | 20 | (singularity causes slow convergence) |
| Gauss-Jacobi | 10 | (exact for weight !) |
| Tanh-sinh, | ~50 | (handles endpoint singularities) |
Lesson: Match the quadrature rule to the weight function. Gauss-Jacobi absorbs the singularity, tanh-sinh handles it automatically.
V.2 High-Dimensional Monte Carlo:
| Method | evaluations | Error | Time |
|---|---|---|---|
| MC (random) | Fast | ||
| QMC (Sobol) | Fast | ||
| Tensor product GL-3 | Fast | ||
| MC (random) | Moderate |
Lesson: For this smooth function in , tensor product GL still wins with fewer evaluations. But for , GL requires evaluations - infeasible.
V.3 Gaussian Expectation:
This integral diverges! .
Warning for AI: Computing expectations over Gaussian distributions requires checking that the function is integrable. In VAE training, large activations can cause to be numerically infinite - a symptom of a poor variational posterior .
V.4 ELBO Term Computation
The KL divergence :
This is analytical - no quadrature needed. The reconstruction term requires Monte Carlo (or GH quadrature):
def elbo_reconstruction(x_data, mu_z, sigma_z, decoder, n_gh=20):
"""Compute E_q[log p(x|z)] via Gauss-Hermite quadrature."""
t, w = np.polynomial.hermite.hermgauss(n_gh)
z_samples = np.sqrt(2) * sigma_z * t + mu_z # GH nodes mapped to q
log_likelihood = np.array([decoder.log_prob(x_data, z) for z in z_samples])
return np.dot(w, log_likelihood) / np.sqrt(np.pi)
With , this achieves near-machine precision for smooth likelihoods - far superior to 1-sample MC.
Appendix W: The Connection Between Quadrature and Finite Element Methods
W.1 Galerkin FEM Requires Integration
The finite element method reduces a PDE to a linear system where:
Each entry requires numerical integration over the domain. With (piecewise linear) elements, the stiffness matrix integrals are computed exactly by the midpoint rule. With (cubic) elements, 4-point GL is needed for exactness.
Rule of thumb: For elements in dimensions, use GL quadrature of degree on each element.
W.2 Spectral Element Methods
Spectral element methods combine the geometric flexibility of FEM with the spectral accuracy of GL/CC quadrature:
- Divide the domain into elements
- On each element, use a high-degree polynomial basis (-th order, typically -)
- Integrate with the matching Gauss-Lobatto-Legendre (GLL) quadrature rule
GLL rule: GL rule with the constraint that the endpoints are always included. The GLL nodes consist of plus the interior zeros of .
For AI: Spectral element methods are used in:
- Atmospheric and ocean modeling (the standard in climate science)
- Spectral graph convolution implementations (spectral approximation of graph filters)
- Neural operators (FNO, DeepONet) for learning PDE solution operators
Appendix X: Self-Test Questions
-
Why is the -point Gauss-Legendre rule exact for polynomials up to degree but not ?
-
Explain why the composite trapezoidal rule achieves spectral accuracy for smooth periodic functions, even though its error is formally for non-periodic functions.
-
The midpoint rule has a smaller error constant than the trapezoidal rule for the same rate. Why? (Hint: think about whether is convex or concave between nodes.)
-
Why does Monte Carlo integration have dimension-independent convergence while deterministic quadrature rates degrade exponentially with dimension?
-
The reparameterization trick reduces gradient variance in VAE training. Why is the score function estimator (REINFORCE) higher variance? (Hint: compare what each estimator directly estimates.)
-
For the Euler-Maclaurin formula , if is periodic with for all , what does this imply for the trapezoidal error?
Answers: See theory.ipynb for numerical verification of each.
Appendix Y: Extended ML Applications
Y.1 Normalizing Flows and the Change of Variables
A normalizing flow defines a sequence of invertible transformations:
The log-likelihood is computed via the change-of-variables formula - a sequence of integration by substitution steps:
Each Jacobian determinant represents the local volume change under the transformation . For a -dimensional transformation:
\log |\det J_{f_k}| = \sum_{i=1}^d \log |[\partial f_k / \partial z_{k-1}]_{ii}|}(only valid when is triangular - the key design choice in autoregressive flows).
Integration interpretation: is the pushforward of through . Computing is exact - no numerical quadrature needed! The integration is performed analytically by the change-of-variables formula.
Y.2 Diffusion Models as Numerical ODE Solvers
The DDIM sampler (Song et al., 2020) solves the reverse diffusion ODE:
using the score function .
DDIM is 2nd-order Adams-Bashforth: The multi-step sampler uses previous score evaluations to achieve 2nd-order accuracy per step:
where is the score. This is exactly the 2-step Adams-Bashforth ODE solver - applied to the reverse SDE.
Implication: Faster diffusion sampling (fewer NFEs) is equivalent to using higher-order quadrature for the reverse SDE. Methods like DPM-Solver++ (Lu et al., 2022) use exponential integrators and 3rd-order Adams-type solvers to reduce from 1000 steps to 10-20 steps.
Y.3 GP Marginal Likelihood via Lanczos Quadrature
The GP log marginal likelihood requires where is the kernel matrix.
Stochastic Lanczos quadrature (SLQ): Estimate :
- Sample random probe vectors
- For each , run steps of Lanczos iteration to build a tridiagonal matrix with
- Approximate (via GL quadrature on the spectrum of )
- Average:
Complexity: matrix-vector products vs for Cholesky. For large , SLQ is the only tractable approach.
GPyTorch (Gardner et al., 2018) implements SLQ for GP training with - enabling GP regression at the scale of deep learning datasets.
Appendix Z: Final Reference - Decision Tree for Integration in ML
ML INTEGRATION DECISION TREE
========================================================================
What are you computing?
|
+-- E_q[f(z)] with q Gaussian?
| +-- Low-dimensional z (d\\leq3), need high accuracy?
| | +-- -> Gauss-Hermite quadrature (n=20 nodes)
| +-- High-dimensional z, gradient needed?
| +-- -> Reparameterization + single MC sample
|
+-- log p(x) = log Z - E(x)?
| +-- f(x) has known structure (flow, Gaussian)?
| | +-- -> Analytical formula (change of variables)
| +-- Intractable Z?
| +-- Need unbiased estimate of Z?
| | +-- -> AIS (Annealed Importance Sampling)
| +-- Need to train model?
| +-- -> Score matching / contrastive divergence
|
+-- Policy gradient E_\\pi[G]?
| +-- Continuous action, differentiable reward?
| | +-- -> Reparameterization (pathwise gradient)
| +-- Discrete action?
| +-- -> REINFORCE with baseline (score function)
|
+-- GP marginal likelihood log|K + \\sigma^2I|?
| +-- n \\leq 5000?
| | +-- -> Cholesky factorization (exact, O(n^3))
| +-- n > 5000?
| +-- -> Stochastic Lanczos Quadrature (O(nqm))
|
+-- ODE/SDE simulation?
+-- Deterministic ODE?
| +-- -> Adaptive Dormand-Prince (RK45)
+-- Stochastic SDE?
+-- -> Euler-Maruyama or Milstein scheme
========================================================================
Appendix AA: Error Analysis for the Composite Rules - Detailed
AA.1 Euler-Maclaurin for Composite Trapezoidal
The complete Euler-Maclaurin formula for the composite trapezoidal rule with :
Bernoulli numbers: , , , .
Leading error term: for some mean .
For (standard result):
For Richardson extrapolation: Taking and :
So - this is exactly the composite Simpson's rule!
AA.2 Superconvergence of Periodic Functions
For periodic on with period : for all . Every Euler-Maclaurin boundary term vanishes:
The trapezoidal rule is exact to all orders for smooth periodic functions. In practice, the error is bounded by the truncation of the Fourier series - exponentially small for analytic periodic functions.
Example: on :
- Composite trapezoidal with : error
- Composite trapezoidal with : error
Compare to on (not periodic):
- Composite trapezoidal with : error ()
- Composite trapezoidal with : error ()
This dramatic difference is the key insight behind the DFT's exact representation of bandlimited periodic signals.
Appendix BB: Python Code - Complete Implementations
"""
Complete reference implementations for Section10-05 Numerical Integration.
All functions tested against scipy.integrate.quad.
"""
import numpy as np
from scipy import integrate
def composite_trap(f, a, b, m):
"""Composite trapezoidal: O(h^2) for smooth f."""
h = (b-a)/m
x = np.linspace(a, b, m+1)
y = f(x)
return h * (y[0]/2 + y[1:-1].sum() + y[-1]/2)
def composite_simpson(f, a, b, m):
"""Composite Simpson's: O(h^4). m must be even."""
assert m % 2 == 0
h = (b-a)/m
x = np.linspace(a, b, m+1)
y = f(x)
return h/3 * (y[0] + 4*y[1:-1:2].sum() + 2*y[2:-2:2].sum() + y[-1])
def gauss_legendre(n):
"""n-point GL nodes and weights via Golub-Welsch."""
k = np.arange(1, n)
J = np.diag(k / np.sqrt(4*k**2 - 1), -1)
J += J.T
evals, evecs = np.linalg.eigh(J)
idx = np.argsort(evals)
return evals[idx], 2 * evecs[0, idx]**2
def gl_integrate(f, a, b, n=10):
"""GL quadrature on [a,b] with n nodes."""
x, w = gauss_legendre(n)
x_mapped = (a+b)/2 + (b-a)/2 * x
return (b-a)/2 * np.dot(w, f(x_mapped))
def romberg_integrate(f, a, b, max_k=8, tol=1e-12):
"""Romberg's method via repeated Richardson extrapolation."""
R = np.zeros((max_k, max_k))
h = b - a
R[0, 0] = h/2 * (f(a) + f(b))
for k in range(1, max_k):
h /= 2
n_new = 2**(k-1)
x_new = a + h * (2*np.arange(1, n_new+1) - 1)
R[k, 0] = R[k-1, 0]/2 + h * f(x_new).sum()
for j in range(1, k+1):
R[k, j] = (4**j * R[k, j-1] - R[k-1, j-1]) / (4**j - 1)
if abs(R[k, k] - R[k-1, k-1]) < tol:
return R[k, k], k
return R[max_k-1, max_k-1], max_k
def monte_carlo(f, domain_sampler, volume, n, seed=42):
"""Basic Monte Carlo integration."""
np.random.seed(seed)
x = domain_sampler(n)
return volume * f(x).mean(), volume * f(x).std() / np.sqrt(n)
def gauss_hermite_expect(f, mu, sigma, n=20):
"""E_{x~N(mu,sigma^2)}[f(x)] via GH quadrature."""
t, w = np.polynomial.hermite.hermgauss(n)
return np.dot(w, f(np.sqrt(2)*sigma*t + mu)) / np.sqrt(np.pi)
# Verification
if __name__ == '__main__':
f = np.exp
exact = np.e - 1
print(f"Trapezoidal (m=100): error = {abs(composite_trap(f,0,1,100) - exact):.2e}")
print(f"Simpson's (m=100): error = {abs(composite_simpson(f,0,1,100) - exact):.2e}")
print(f"GL-10: error = {abs(gl_integrate(f,0,1,10) - exact):.2e}")
result, levels = romberg_integrate(f, 0, 1)
print(f"Romberg: error = {abs(result - exact):.2e} (levels={levels})")
gh_val = gauss_hermite_expect(lambda x: np.exp(-x**2/2),
mu=0, sigma=1, n=20) * np.sqrt(2*np.pi)
print(f"GH integral of N(0,1): {gh_val:.10f} (expected 1.0)")