NotesMath for LLMs

Numerical Integration

Numerical Methods / Numerical Integration

Notes

"Numerical integration is the art of turning hard integrals into easy sums."

  • Attributed to Philip Davis and Philip Rabinowitz

Overview

Numerical integration (quadrature) computes abf(x)w(x)dx\int_a^b f(x) w(x) dx when no antiderivative is available, the function is known only at discrete points, or the exact integral is too expensive. The core idea is simple: approximate ff by an interpolant and integrate the interpolant exactly.

The quality of this approximation depends entirely on the choice of interpolation nodes. Equispaced nodes give the Newton-Cotes rules (trapezoidal, Simpson's) which are easy to derive but have bounded accuracy. Optimal node placement - the zeros of orthogonal polynomials - gives Gaussian quadrature, which achieves "twice the accuracy for free" by allowing the nodes to vary. Adaptive quadrature refines the mesh where the integrand is rough, achieving near-machine accuracy automatically. Monte Carlo methods trade the deterministic O(hp)O(h^p) error for a stochastic O(1/n)O(1/\sqrt{n}) rate that is dimension-independent - crucial for high-dimensional integrals.

For AI, numerical integration appears in: computing normalizing constants for probability distributions, training variational autoencoders (ELBO involves an expectation), Gaussian process marginal likelihoods, normalizing flows, and reinforcement learning value estimation. Understanding which quadrature method to use - and why - is essential for building numerically stable, efficient ML systems.

Prerequisites

  • Calculus: definite integrals, Riemann sums, Taylor's theorem (Section01-Mathematical-Foundations)
  • Interpolation: Lagrange polynomials, Chebyshev nodes, orthogonal polynomials (Section10-04)
  • Linear algebra: solving linear systems (Section02-Linear-Algebra-Basics, Section10-02)
  • Floating-point arithmetic: rounding error accumulation (Section10-01)
  • Probability: random variables, expected values, variance (Section07-Probability)

Companion Notebooks

NotebookDescription
theory.ipynbInteractive: Newton-Cotes, Gaussian quadrature derivation, adaptive quadrature, Monte Carlo, multidimensional integration
exercises.ipynb10 graded exercises from trapezoidal rule to normalizing flow partition functions

Learning Objectives

After completing this section, you will:

  • Derive Newton-Cotes rules by integrating Lagrange interpolants
  • Understand the error analysis of composite rules via the Euler-Maclaurin formula
  • Derive Gaussian quadrature by choosing nodes to maximize the degree of exactness
  • Implement adaptive Gauss-Kronrod quadrature with error control
  • Apply Richardson extrapolation / Romberg integration to accelerate convergence
  • Use Monte Carlo integration for high-dimensional problems and understand variance reduction
  • Connect quadrature to normalizing constants, ELBO, and variational inference in ML
  • Recognize when each method is appropriate based on dimensionality, smoothness, and accuracy requirements

Table of Contents


1. Intuition and the Basic Setup

1.1 The Quadrature Problem

A quadrature rule approximates I(f)=abf(x)w(x)dxI(f) = \int_a^b f(x) w(x) dx by a weighted sum:

Qn(f)=i=1nwif(xi)Q_n(f) = \sum_{i=1}^{n} w_i f(x_i)

where xi[a,b]x_i \in [a, b] are nodes (quadrature points) and wi>0w_i > 0 are weights.

The error is En(f)=I(f)Qn(f)E_n(f) = I(f) - Q_n(f).

What determines quality?

  1. The number of nodes nn (more nodes = higher accuracy in general)
  2. The placement of nodes xix_i (free or equispaced?)
  3. The smoothness of ff (smooth ff allows high-order rules)
  4. The dimensionality dd (high-dimensional integrals need Monte Carlo)

For AI: Many ML objectives involve integrals:

  • logp(x)=logp(xz)p(z)dz\log p(x) = \log \int p(x|z) p(z) dz - variational ELBO lower bounds this
  • Policy gradient θJ=Eπ[θlogπθ(as)R]\nabla_\theta J = \mathbb{E}_\pi[\nabla_\theta \log \pi_\theta(a|s) R] - Monte Carlo estimates
  • Gaussian process likelihood logp(y)=12yK1y12logK\log p(y) = -\frac{1}{2} y^\top K^{-1} y - \frac{1}{2}\log|K| - involves a determinant, not an integral, but the method of moments is related

1.2 Why Integration is Hard

  • Indefinite integrals of elementary functions: Only a small class have closed forms. ex2dx\int e^{-x^2} dx, sin(x)/xdx\int \sin(x)/x \, dx have no elementary form.
  • High-dimensional integrals: The curse of dimensionality makes grid-based methods infeasible for d>5d > 5-1010.
  • Singularities: 01x1/2dx=2\int_0^1 x^{-1/2} dx = 2 - finite integral but f(0)=f(0) = \infty.
  • Oscillatory integrands: 01000sin(x2)dx\int_0^{1000} \sin(x^2) dx - many sign changes, standard rules need many nodes.
  • Stochastic integrals: E[f(X)]=f(x)p(x)dx\mathbb{E}[f(X)] = \int f(x) p(x) dx - typically solved by sampling.

1.3 Historical Timeline

NUMERICAL INTEGRATION: HISTORICAL TIMELINE
========================================================================

  1700  - Newton (1711): Newton-Cotes formulas from polynomial interpolation
  1720  - Cotes: systematic derivation of Newton-Cotes weights
  1800  - Gauss (1814): Gaussian quadrature with 2n-1 degree of exactness
  1852  - Chebyshev: Gauss-Chebyshev quadrature with explicit node/weight formulas
  1901  - Richardson: Richardson extrapolation to accelerate convergence
  1955  - Romberg: Romberg integration (repeated Richardson extrapolation)
  1960  - Clenshaw-Curtis: Chebyshev-based quadrature via DCT
  1970  - Patterson, Kronrod: Gauss-Kronrod embedded rules for adaptive quadrature
  1965  - Sobol: quasi-Monte Carlo sequences for multidimensional integration
  1949  - Metropolis-Ulam: Monte Carlo method for nuclear physics calculations
  1953  - Metropolis: Markov chain Monte Carlo (Metropolis algorithm)
  1990  - Gelfand-Smith: MCMC for Bayesian inference becomes mainstream
  2015  - Deep learning era: stochastic gradient estimation (REINFORCE, VAE ELBO)
  2020  - Normalizing flows: exact likelihood via change-of-variables theorem
  2022  - Score-based diffusion: sampling as integration of stochastic DEs

========================================================================

2. Newton-Cotes Rules

2.1 Midpoint, Trapezoidal, and Simpson's Rules

Derivation principle: Integrate the nn-point Lagrange interpolant exactly.

Midpoint rule (1 point, n=1n=1):

M(f)=(ba)f ⁣(a+b2),EM=(ba)324f(ξ)M(f) = (b-a) f\!\left(\frac{a+b}{2}\right), \quad E_M = \frac{(b-a)^3}{24} f''(\xi)

Integrates polynomials of degree 1\leq 1 exactly (1-point rule with degree of exactness 1).

Trapezoidal rule (2 points, n=2n=2):

T(f)=ba2[f(a)+f(b)],ET=(ba)312f(ξ)T(f) = \frac{b-a}{2}[f(a) + f(b)], \quad E_T = -\frac{(b-a)^3}{12} f''(\xi)

Same order as midpoint but larger constant (midpoint is twice as accurate as trapezoidal for convex/concave functions).

Simpson's rule (3 points, n=3n=3):

S(f)=ba6[f(a)+4f ⁣(a+b2)+f(b)],ES=(ba)590f(4)(ξ)S(f) = \frac{b-a}{6}\left[f(a) + 4f\!\left(\frac{a+b}{2}\right) + f(b)\right], \quad E_S = -\frac{(b-a)^5}{90} f^{(4)}(\xi)

Integrates polynomials of degree 3\leq 3 exactly (degree of exactness 3, not 2 - odd-degree rules gain one extra order via symmetry).

NEWTON-COTES RULES: GEOMETRIC INTERPRETATION
========================================================================

  Midpoint:        Trapezoidal:        Simpson's:
  ---------        ------------        ---------
    --             /\                  +--+
   /  \           /  \               /    \
  / f  \         / f  \             /  f   \
 /  x  \        / x x \           / x x x \
 --------       ----------         ----------
   (b-a)        (b-a)/2           (b-a)/6, 4/6, 1/6

  x = quadrature node
  Approximation: rectangle, trapezoid, parabola through 3 points

========================================================================

General Newton-Cotes with n+1n+1 equispaced nodes: xi=a+ihx_i = a + ih, h=(ba)/nh = (b-a)/n:

Qn(f)=hi=0nwiNCf(xi)Q_n(f) = h \sum_{i=0}^{n} w_i^{\text{NC}} f(x_i)

The weights wiNCw_i^{\text{NC}} are determined by integrating the Lagrange basis:

wiNC=1habi(x)dx=0njisjijdsw_i^{\text{NC}} = \frac{1}{h} \int_a^b \ell_i(x) dx = \int_0^n \prod_{j \neq i} \frac{s - j}{i - j} ds

Newton-Cotes weights for small nn:

nnRuleWeightsDegree
1Trapezoidalh2[1,1]\frac{h}{2}[1, 1]1
2Simpson'sh3[1,4,1]\frac{h}{3}[1, 4, 1]3
3Simpson's 3/83h8[1,3,3,1]\frac{3h}{8}[1, 3, 3, 1]3
4Boole's2h45[7,32,12,32,7]\frac{2h}{45}[7, 32, 12, 32, 7]5

Key observation: Even-nn rules (trapezoidal with n=2n=2, Boole's with n=4n=4) automatically achieve one higher degree of exactness than expected - a consequence of symmetry.

2.2 Composite Rules and Error Analysis

Composite trapezoidal rule: Divide [a,b][a,b] into mm subintervals of width h=(ba)/mh = (b-a)/m:

Tm(f)=h[f(a)2+f(a+h)+f(a+2h)++f(bh)+f(b)2]T_m(f) = h\left[\frac{f(a)}{2} + f(a+h) + f(a+2h) + \cdots + f(b-h) + \frac{f(b)}{2}\right] ETm=(ba)h212f(ξ)=O(h2)E_{T_m} = -\frac{(b-a)h^2}{12} f''(\xi) = O(h^2)

Composite Simpson's rule: Requires even mm:

Sm(f)=h3[f(a)+4f(a+h)+2f(a+2h)+4f(a+3h)++f(b)]S_m(f) = \frac{h}{3}[f(a) + 4f(a+h) + 2f(a+2h) + 4f(a+3h) + \cdots + f(b)] ESm=(ba)h4180f(4)(ξ)=O(h4)E_{S_m} = -\frac{(b-a)h^4}{180} f^{(4)}(\xi) = O(h^4)

Error comparison:

RuleFunction evaluationsErrorAccuracy for 10810^{-8}: need mm
Composite trapezoidalm+1m+1O(h2)O(h^2)h104h \approx 10^{-4}, m104m \approx 10^4
Composite Simpson'sm+1m+1 (mm even)O(h4)O(h^4)h102h \approx 10^{-2}, m100m \approx 100
Composite Boole'sm+1m+1 ($4m$)O(h6)O(h^6)
Gauss-Legendre nn ptsnnO(h2n)O(h^{2n})n5n \approx 5

2.3 Euler-Maclaurin Formula

The Euler-Maclaurin formula gives the exact relationship between the trapezoidal rule and the true integral:

Tm(f)=abf(x)dx+k=1pB2k(2k)!h2k[f(2k1)(b)f(2k1)(a)]+O(h2p+2)T_m(f) = \int_a^b f(x) dx + \sum_{k=1}^{p} \frac{B_{2k}}{(2k)!} h^{2k} [f^{(2k-1)}(b) - f^{(2k-1)}(a)] + O(h^{2p+2})

where B2kB_{2k} are Bernoulli numbers (B2=1/6B_2 = 1/6, B4=1/30B_4 = -1/30, B6=1/42B_6 = 1/42, ...).

Key implications:

  1. The error has an asymptotic expansion in powers of h2h^2 - this is the foundation for Richardson extrapolation
  2. For periodic functions where all derivatives of ff match at aa and bb, the boundary terms vanish for all kk - the composite trapezoidal rule achieves spectral accuracy for smooth periodic functions!
  3. The formula predicts exactly how much extrapolation reduces error

Periodic functions example: For f(x)=cos(2πx)f(x) = \cos(2\pi x) on [0,1][0,1] with mm equal intervals, the trapezoidal rule is exact to machine precision even for small mm. This is why the DFT (which uses uniform sampling of a periodic function) is exact!

For AI: The Euler-Maclaurin formula explains why the FFT is so accurate for bandlimited signals: the trapezoidal rule applied to a periodic function with finitely many nonzero Fourier coefficients gives the exact DFT.

2.4 Limitations of Newton-Cotes

Negative weights for large nn: Newton-Cotes weights become negative for n8n \geq 8. This means small perturbations in ff values can cancel to give a large error - numerical instability.

Runge's phenomenon for integration: Unlike interpolation, integration is a smoother operation - integrating the Runge interpolant gives reasonable results for moderate nn. But for large nn with equispaced nodes, the interpolant oscillates wildly and the integral of the oscillations can be large.

Conclusion: Use composite low-order rules (trapezoidal, Simpson's) rather than high-degree Newton-Cotes on a single interval.


3. Gaussian Quadrature

3.1 Degree of Exactness and Optimal Nodes

Definition: A quadrature rule QnQ_n has degree of exactness dd if Qn(p)=I(p)Q_n(p) = I(p) for all polynomials pp of degree d\leq d, but there exists a degree-d+1d+1 polynomial where it fails.

Newton-Cotes with n+1n+1 equispaced nodes achieves degree of exactness nn (or n+1n+1 if nn is even).

Question: Can we do better by choosing the nodes freely?

Counting argument: A rule with nn free nodes has 2n2n free parameters (nn nodes +n+ n weights). A polynomial basis requires matching 2n2n coefficients. We might hope to achieve degree of exactness 2n12n-1.

Theorem (Gaussian quadrature): This is achievable. With nn optimally chosen nodes, we can achieve degree of exactness 2n12n-1 - exactly twice what equispaced nodes give.

How to find the nodes: The optimal nodes for 11f(x)dx\int_{-1}^1 f(x) dx are the zeros of the nn-th Legendre polynomial Pn(x)P_n(x).

Proof sketch: For any polynomial ff of degree 2n1\leq 2n-1, write f=qPn+rf = q \cdot P_n + r where qq has degree n1\leq n-1 and rr has degree n1\leq n-1. Then:

  • 11f=qPn+r\int_{-1}^1 f = \int q \cdot P_n + \int r - the first integral is 0 by orthogonality of PnP_n to lower-degree polynomials
  • Qn(f)=Qn(qPn)+Qn(r)=0+Qn(r)Q_n(f) = Q_n(q \cdot P_n) + Q_n(r) = 0 + Q_n(r) - the first term is 0 because qPnq \cdot P_n vanishes at all zeros of PnP_n
  • Qn(r)=rQ_n(r) = \int r because QnQ_n is exact for polynomials of degree n1<2n1\leq n-1 < 2n-1 \square

3.2 Gauss-Legendre Quadrature

For 11f(x)dx\int_{-1}^1 f(x) dx, the nn-point Gauss-Legendre rule uses:

  • Nodes: zeros of Pn(x)P_n(x) (Legendre polynomial of degree nn)
  • Weights: wi=2(1xi2)[Pn(xi)]2w_i = \frac{2}{(1-x_i^2)[P_n'(x_i)]^2}

Gauss-Legendre nodes and weights for small nn:

nnNodes xix_iWeights wiw_i
10022
2±1/3\pm 1/\sqrt{3}1,11, 1
30,±3/50, \pm\sqrt{3/5}8/9,5/9,5/98/9, 5/9, 5/9
5(5 values in (1,1)(-1,1))(5 values summing to 2)

Error: For fC2n[1,1]f \in C^{2n}[-1,1]:

En=(n!)4(2n+1)[(2n)!]3(ba)2n+1f(2n)(ξ)=O(h2n)E_n = \frac{(n!)^4}{(2n+1)[(2n)!]^3} (b-a)^{2n+1} f^{(2n)}(\xi) = O(h^{2n})

Scaling to [a,b][a,b]: Change variables x=a+b2+ba2tx = \frac{a+b}{2} + \frac{b-a}{2} t, t[1,1]t \in [-1,1]:

abf(x)dx=ba211f ⁣(a+b2+ba2t)dtba2i=1nwif ⁣(a+b2+ba2xi)\int_a^b f(x) dx = \frac{b-a}{2} \int_{-1}^1 f\!\left(\frac{a+b}{2} + \frac{b-a}{2} t\right) dt \approx \frac{b-a}{2} \sum_{i=1}^n w_i f\!\left(\frac{a+b}{2} + \frac{b-a}{2} x_i\right)

Computing GL nodes: Use the eigenvalue method - nodes are eigenvalues of the Golub-Welsch tridiagonal matrix:

Jn=(0β1β10β2β20),βk=k2k21/4J_n = \begin{pmatrix} 0 & \beta_1 & & \\ \beta_1 & 0 & \beta_2 & \\ & \beta_2 & 0 & \ddots \\ & & \ddots & \ddots \end{pmatrix}, \quad \beta_k = \frac{k}{2\sqrt{k^2 - 1/4}}

The eigenvalues are the GL nodes, and the first components of the eigenvectors give the weights.

For AI: Gauss-Legendre quadrature is used in:

  • Normalizing flows: Computing normalizing constants for complex distributions via quadrature
  • Physics-informed networks: Computing weak-form integrals uv\int \nabla u \cdot \nabla v exactly
  • ODE solvers: Gauss collocation methods are implicit RK methods with GL nodes - they achieve the highest possible order for a given number of stages

3.3 Gauss-Chebyshev Quadrature

For 11f(x)(1x2)1/2dx\int_{-1}^1 f(x) (1-x^2)^{-1/2} dx (the Chebyshev weight), the nodes and weights have explicit formulas:

xk=cos ⁣((2k1)π2n),wk=πn,k=1,,nx_k = \cos\!\left(\frac{(2k-1)\pi}{2n}\right), \quad w_k = \frac{\pi}{n}, \quad k = 1, \ldots, n

Key properties:

  • Nodes are equispaced in θ=arccos(x)\theta = \arccos(x) - no eigenvalue computation needed
  • All weights equal π/n\pi/n - uniform weights!
  • These are exactly the Chebyshev nodes of the first kind
  • Degree of exactness: 2n12n-1

Type-II Gauss-Chebyshev: For weight (1x2)1/2(1-x^2)^{1/2}:

xk=cos ⁣(kπn+1),wk=πn+1sin2 ⁣(kπn+1)x_k = \cos\!\left(\frac{k\pi}{n+1}\right), \quad w_k = \frac{\pi}{n+1} \sin^2\!\left(\frac{k\pi}{n+1}\right)

3.4 Other Gauss Rules

Gauss-Laguerre: 0f(x)exdxiwif(xi)\int_0^\infty f(x) e^{-x} dx \approx \sum_i w_i f(x_i) - nodes are zeros of Laguerre polynomials Ln(x)L_n(x).

Gauss-Hermite: f(x)ex2dxiwif(xi)\int_{-\infty}^\infty f(x) e^{-x^2} dx \approx \sum_i w_i f(x_i) - nodes are zeros of Hermite polynomials Hn(x)H_n(x).

For AI: Gauss-Hermite quadrature is used for computing expectations over Gaussian distributions:

ExN(μ,σ2)[f(x)]=1πf(2σt+μ)et2dt1πiwiGHf(2σxiGH+μ)\mathbb{E}_{x \sim \mathcal{N}(\mu, \sigma^2)}[f(x)] = \frac{1}{\sqrt{\pi}} \int_{-\infty}^\infty f(\sqrt{2}\sigma t + \mu) e^{-t^2} dt \approx \frac{1}{\sqrt{\pi}} \sum_i w_i^{\text{GH}} f(\sqrt{2}\sigma x_i^{\text{GH}} + \mu)

This is used in computing the ELBO exactly in certain VAE architectures and in moment-matching for Gaussian approximations (expectation propagation).

Gauss-Jacobi: 11f(x)(1x)α(1+x)βdx\int_{-1}^1 f(x) (1-x)^\alpha (1+x)^\beta dx - used for Beta-distributed random variables and Jacobi polynomial bases.

3.5 Gauss-Kronrod Rules for Error Estimation

The problem: Gaussian quadrature doesn't naturally provide an error estimate.

Gauss-Kronrod: Extend an nn-point GL rule to a (2n+1)(2n+1)-point rule by adding n+1n+1 Kronrod nodes interleaved with the GL nodes, choosing them to maximize the degree of exactness of the combined rule.

The G7-K15 rule (7-point GL + 15-point Kronrod) is the standard in scipy.integrate.quad:

  • G7: degree of exactness 13 (7 nodes)
  • K15: degree of exactness 27 (15 nodes, reuses the G7 nodes)
  • Error estimate: K15G7|K15 - G7|

Adaptivity: If K15G7<tol|K15 - G7| < \text{tol}, accept the interval. Otherwise, bisect and recurse.


4. Richardson Extrapolation and Romberg Integration

4.1 Richardson Extrapolation

Setup: If we know that Q(h)=I+c2h2+c4h4+Q(h) = I + c_2 h^2 + c_4 h^4 + \cdots (error expansion in powers of h2h^2), then computing Q(h)Q(h) and Q(h/2)Q(h/2):

Q(h)=I+c2h2+O(h4)Q(h) = I + c_2 h^2 + O(h^4) Q(h/2)=I+c2h2/4+O(h4)Q(h/2) = I + c_2 h^2/4 + O(h^4)

Eliminating the h2h^2 term:

I=4Q(h/2)Q(h)3+O(h4)I = \frac{4 Q(h/2) - Q(h)}{3} + O(h^4)

This is one step of Richardson extrapolation - combining two O(h2)O(h^2) estimates to get O(h4)O(h^4).

General Richardson extrapolation: If Qk(h)=I+cphp+O(hp+2)Q_k(h) = I + c_p h^p + O(h^{p+2}):

Qk+1(h)=2pQk(h/2)Qk(h)2p1+O(hp+2)Q_{k+1}(h) = \frac{2^p Q_k(h/2) - Q_k(h)}{2^p - 1} + O(h^{p+2})

For AI: Richardson extrapolation is the foundation of gradient estimation via finite differences - taking linear combinations of function evaluations to cancel lower-order error terms. The centered difference formula is a 2-step Richardson extrapolation.

4.2 Romberg's Method

Romberg's method applies repeated Richardson extrapolation to the composite trapezoidal rule.

Romberg table: Starting from Th0T_{h_0} and halving hh repeatedly:

R0,0=T(h0)R_{0,0} = T(h_0) Rk,0=T(h0/2k)R_{k,0} = T(h_0/2^k) Rk,j=4jRk,j1Rk1,j14j1,j=1,,kR_{k,j} = \frac{4^j R_{k,j-1} - R_{k-1,j-1}}{4^j - 1}, \quad j = 1, \ldots, k

The diagonal entry Rk,kR_{k,k} approximates II with error O(h02k+2/22k2)O(h_0^{2k+2}/2^{2k^2}).

Algorithm:

R[0][0] = (b-a)/2 * (f(a) + f(b))  # Trapezoidal with h = b-a
for k = 1, 2, ..., max_iter:
    # Composite trapezoidal with 2^k intervals
    R[k][0] = 0.5*R[k-1][0] + h * sum(f(a + (2j-1)*h) for j in 1..2^(k-1))
    # Richardson extrapolation
    for j = 1, ..., k:
        R[k][j] = (4^j * R[k][j-1] - R[k-1][j-1]) / (4^j - 1)
    if |R[k][k] - R[k-1][k-1]| < tol: break

Convergence: For smooth, non-periodic functions, Rk,kR_{k,k} converges like O(h02k+2)O(h_0^{2k+2}) - faster than any fixed-order method!

4.3 Clenshaw-Curtis Quadrature

Clenshaw-Curtis integrates the Chebyshev interpolant of ff on [1,1][-1,1]:

11f(x)dx11pn(x)dx\int_{-1}^1 f(x) dx \approx \int_{-1}^1 p_n(x) dx

where pnp_n is the degree-nn Chebyshev interpolant at the n+1n+1 Chebyshev nodes of the 2nd kind.

Weights via DCT: The Chebyshev expansion pn(x)=kckTk(x)p_n(x) = \sum_k c_k T_k(x) integrates to:

11Tk(x)dx={2k=00k odd2(1)k/2+1k21k even\int_{-1}^1 T_k(x) dx = \begin{cases} 2 & k = 0 \\ 0 & k \text{ odd} \\ \frac{2(-1)^{k/2+1}}{k^2-1} & k \text{ even} \end{cases}

So the CC weights are computed in O(nlogn)O(n \log n) via DCT.

CC vs GL: For smooth functions, CC and GL achieve similar accuracy. GL uses nn nodes for degree 2n12n-1; CC uses n+1n+1 nodes for degree nn but with DCT acceleration. For adaptive applications where the nodes are reused at subintervals, CC is often preferred because its nodes nest (n=4n=4 nodes include the n=2n=2 nodes).


5. Adaptive Quadrature

5.1 Adaptive Strategies

Idea: Concentrate function evaluations where the integrand is rough (large error per unit interval), not where it is smooth.

Error indicator: Estimate the local error on [a,b][a,b] using:

E^=Qfine(f)Qcoarse(f)\hat{E} = |Q_{\text{fine}}(f) - Q_{\text{coarse}}(f)|

where QfineQ_{\text{fine}} uses more points (e.g., Gauss-Kronrod K15) and QcoarseQ_{\text{coarse}} uses fewer (e.g., G7).

Algorithm (recursive):

  1. Evaluate QfineQ_{\text{fine}} and QcoarseQ_{\text{coarse}} on [a,b][a,b]
  2. If E^<tol(ba)/(BA)\hat{E} < \text{tol} \cdot (b-a) / (B-A) (local tolerance proportional to interval width): accept
  3. Otherwise: bisect [a,b][a,b] into [a,(a+b)/2][a,(a+b)/2] and [(a+b)/2,b][(a+b)/2, b], recurse on each

Global vs local tolerance: The final error satisfies IQjEjtol|I - Q| \leq \sum_j |E_j| \leq \text{tol} if each leaf interval satisfies its local tolerance - requires careful accounting.

5.2 Error Control and Termination

Termination conditions:

  1. Local tolerance met: QfineQcoarse<atol+rtolQfine|Q_{\text{fine}} - Q_{\text{coarse}}| < \text{atol} + \text{rtol} \cdot |Q_{\text{fine}}|
  2. Maximum evaluations reached: Return with warning
  3. Interval too small: Machine epsilon prevents further bisection

Priority queue (non-recursive): For efficiency, maintain a heap of subintervals sorted by error estimate. Always split the highest-error interval:

import heapq
queue = [(-estimated_error, a, b)]
while error_sum > tol and queue:
    (-est_err, l, r) = heapq.heappop(queue)
    m = (l + r) / 2
    # Evaluate on [l,m] and [m,r]
    heapq.heappush(queue, (-new_err_left, l, m))
    heapq.heappush(queue, (-new_err_right, m, r))

5.3 scipy.integrate API

from scipy import integrate

# Basic quadrature (adaptive Gauss-Kronrod)
result, error = integrate.quad(f, a, b)

# With tolerances
result, error = integrate.quad(f, a, b, epsabs=1e-12, epsrel=1e-12)

# Improper integrals
result, error = integrate.quad(f, 0, np.inf)

# Singularity handling
result, error = integrate.quad(f, 0, 1, points=[0.5])  # Inform about singularities

# Fixed-order quadrature
pts, wts = integrate.fixed_quad(f, a, b, n=10)  # Gauss-Legendre, n points

# Romberg integration
result = integrate.romberg(f, a, b)

6. Monte Carlo Integration

6.1 Basic Monte Carlo

The fundamental idea: Rewrite the integral as an expectation:

Ωf(x)dx=ΩExUniform(Ω)[f(x)]\int_\Omega f(x) dx = |\Omega| \cdot \mathbb{E}_{x \sim \text{Uniform}(\Omega)}[f(x)]

Estimate by sampling x1,,xnUniform(Ω)x_1, \ldots, x_n \sim \text{Uniform}(\Omega):

II^n=Ωni=1nf(xi)I \approx \hat{I}_n = \frac{|\Omega|}{n} \sum_{i=1}^{n} f(x_i)

Error analysis: By the Central Limit Theorem:

I^nIN ⁣(0,Ω2Var(f)n),RMSE=ΩVar(f)n=O(1/n)\hat{I}_n - I \sim \mathcal{N}\!\left(0, \frac{|\Omega|^2 \text{Var}(f)}{n}\right), \quad \text{RMSE} = \frac{|\Omega| \sqrt{\text{Var}(f)}}{\sqrt{n}} = O(1/\sqrt{n})

Key properties:

  1. Dimension-independent: O(1/n)O(1/\sqrt{n}) regardless of dd
  2. No smoothness requirement: Works for discontinuous ff
  3. Slow convergence: Need n108n \sim 10^8 for 4 decimal digits of accuracy - much slower than deterministic quadrature for smooth low-dimensional integrands
  4. Easily parallelizable: All samples are independent

When to use Monte Carlo:

  • d>5d > 5-1010 (deterministic methods are exponentially worse)
  • ff is discontinuous or has singularities
  • Accuracy requirement is modest (1%\sim 1\%-5%5\%)
  • Sampling from p(x)p(x) is easier than computing the integral directly

6.2 Variance Reduction Techniques

1. Importance sampling: Instead of sampling from Uniform(Ω)\text{Uniform}(\Omega), sample from a proposal distribution q(x)q(x):

I=f(x)dx=f(x)q(x)q(x)dx=Exq[f(x)q(x)]I = \int f(x) dx = \int \frac{f(x)}{q(x)} q(x) dx = \mathbb{E}_{x \sim q}\left[\frac{f(x)}{q(x)}\right]

Optimal q(x)=f(x)/fq^*(x) = |f(x)| / \int |f| - reduces variance to zero (but requires knowing the integral!). In practice, choose qq to be proportional to f|f| where ff is large.

Estimator variance: Varq ⁣(fq)=f2qI2\text{Var}_{q}\!\left(\frac{f}{q}\right) = \int \frac{f^2}{q} - I^2 - minimized when qfq \propto |f|.

2. Control variates: Find gg with known integral IgI_g and high correlation with ff:

I^=I^n(f)I^n(g)+Ig\hat{I} = \hat{I}_n(f) - \hat{I}_n(g) + I_g

The variance is Var(fg)=Var(f)+Var(g)2Cov(f,g)\text{Var}(f - g) = \text{Var}(f) + \text{Var}(g) - 2\text{Cov}(f,g) - reduced when ff and gg are highly correlated.

3. Antithetic variates: For symmetric domains, pair each sample xix_i with its mirror 2μxi2\mu - x_i:

I^=12ni=1n[f(xi)+f(2μxi)]\hat{I} = \frac{1}{2n} \sum_{i=1}^{n} [f(x_i) + f(2\mu - x_i)]

Variance reduction: Var(12(f(X)+f(2μX)))=12Var(f)+12Cov(f(X),f(2μX))\text{Var}(\frac{1}{2}(f(X)+f(2\mu-X))) = \frac{1}{2}\text{Var}(f) + \frac{1}{2}\text{Cov}(f(X), f(2\mu-X)).

4. Stratified sampling: Divide Ω\Omega into kk strata and sample n/kn/k points from each. Variance:

Varstrat=1njpjVarj(f)Varplain\text{Var}_{\text{strat}} = \frac{1}{n} \sum_j p_j \text{Var}_j(f) \leq \text{Var}_{\text{plain}}

where pjp_j is the probability mass of stratum jj.

For AI: All of these techniques appear in ML:

  • Importance sampling underlies REINFORCE/policy gradient: J=Eπ[logπR]\nabla J = \mathbb{E}_\pi[\nabla \log \pi \cdot R], sampled under current policy
  • Control variates are used as baselines in policy gradient: subtract a baseline b(s)b(s) to reduce variance
  • Stratified sampling is used in mini-batch selection to ensure balanced class representation

6.3 Quasi-Monte Carlo and Low-Discrepancy Sequences

Discrepancy: Measures how uniformly a point set covers the domain. A sequence with low discrepancy fills the space more uniformly than random points.

Koksma-Hlawka inequality:

f1nif(xi)V(f)D(x1,,xn)\left|\int f - \frac{1}{n}\sum_i f(x_i)\right| \leq V(f) \cdot D^*(x_1, \ldots, x_n)

where V(f)V(f) is the variation of ff and DD^* is the star discrepancy.

Sobol sequences: Quasi-random sequences with discrepancy O((logn)d/n)O((\log n)^d / n) - much better than random's O(1/n)O(1/\sqrt{n}) for low to moderate dd.

Convergence comparison:

MethodRateGood for
Monte CarloO(n1/2)O(n^{-1/2})Any dd, discontinuous ff
Quasi-MC (Sobol)O(n1(logn)d)O(n^{-1} (\log n)^d)d20d \leq 20, smooth ff
Gauss-LegendreO(n2n/d)O(n^{-2n/d})d3d \leq 3, smooth ff
Clenshaw-CurtisO(exp(cn))O(\text{exp}(-cn))d=1d = 1, analytic ff

For AI: Quasi-MC is used in:

  • Low-discrepancy random features: Quasi-random Fourier features for kernel approximation are more accurate than random ones
  • Batch sampling in reinforcement learning: Sobol-like sequences for diverse trajectory coverage
  • Hyperparameter search: Quasi-random search outperforms grid search and random search for moderate dd

7. Multidimensional Integration

7.1 Tensor Product Quadrature

For [a,b]df(x1,,xd)dx\int_{[a,b]^d} f(x_1, \ldots, x_d) dx, the tensor product rule applies 1D quadrature in each dimension:

Qnd(f)=i1=1nid=1nwi1widf(xi1,,xid)Q_n^d(f) = \sum_{i_1=1}^{n} \cdots \sum_{i_d=1}^{n} w_{i_1} \cdots w_{i_d} f(x_{i_1}, \ldots, x_{i_d})

Curse of dimensionality: Requires ndn^d function evaluations. For n=10n=10 points per dimension and d=10d=10 dimensions: 101010^{10} evaluations. Infeasible!

Error: If each 1D rule has error O(hp)O(h^p):

E=O(hpd) (not O(hpd))|E| = O(h^p d) \text{ (not } O(h^{pd}) \text{)}

The error scales linearly with dd, not exponentially - the convergence rate O(hp)O(h^p) is preserved. But the number of evaluations grows as ndn^d.

7.2 Sparse Grids

Smolyak's construction: Use tensor products only of "low-level" 1D rules. The level-qq sparse grid uses O(n(logn)d1)O(n (\log n)^{d-1}) points instead of ndn^d:

Qqd=i1q+d1(1)q+d1i1(d1q+d1i1)j=1dΔij(1)Q_q^d = \sum_{|i|_1 \leq q+d-1} (-1)^{q+d-1-|i|_1} \binom{d-1}{q+d-1-|i|_1} \otimes_{j=1}^d \Delta_{i_j}^{(1)}

where Δk(1)\Delta_k^{(1)} is the difference between the kk-th and (k1)(k-1)-th 1D rules.

Error for smooth ff: O((logn)d1/np)O((\log n)^{d-1}/n^p) - polynomial in dd logarithm, not exponential.

For AI: Sparse grid integration is used in:

  • High-dimensional numerical PDEs (atmospheric modeling, option pricing)
  • Bayesian quadrature for computing GP marginal likelihoods
  • Efficient moment computation in uncertainty quantification

7.3 Monte Carlo vs Deterministic in High Dimensions

ddBest deterministic methodMC (unbiased, nn samples)Break-even nn
1GL-5 (55 pts, O(h10)O(h^{10}))O(n1/2)O(n^{-1/2})n100n \approx 100
5Sparse grid (n3\sim n^3)O(n1/2)O(n^{-1/2})n104n \approx 10^4
10Sparse grid (n5\sim n^5)O(n1/2)O(n^{-1/2})n106n \approx 10^6
20MC wins clearlyO(n1/2)O(n^{-1/2})-
100MC onlyO(n1/2)O(n^{-1/2})-

Practical advice: For d3d \leq 3, use Gauss-Legendre or Clenshaw-Curtis. For d15d \leq 15, consider quasi-MC or sparse grids. For d>15d > 15, use Monte Carlo with variance reduction.


8. Applications in Machine Learning

8.1 Normalizing Constants and Partition Functions

The partition function problem: In energy-based models, the probability is:

p(x;θ)=exp(Eθ(x))Z(θ),Z(θ)=exp(Eθ(x))dxp(x; \theta) = \frac{\exp(-E_\theta(x))}{Z(\theta)}, \quad Z(\theta) = \int \exp(-E_\theta(x)) dx

Computing Z(θ)Z(\theta) exactly is intractable for complex EθE_\theta. Solutions:

  • Thermodynamic integration: logZ(θ1)logZ(θ0)=01Ep(;θt)[Eθ0(x)Eθ1(x)]dt\log Z(\theta_1) - \log Z(\theta_0) = \int_0^1 \mathbb{E}_{p(\cdot;\theta_t)}[E_{\theta_0}(x) - E_{\theta_1}(x)] dt - a 1D quadrature over the inverse temperature
  • Annealed importance sampling (AIS): Monte Carlo estimate of the ratio Z(θ1)/Z(θ0)Z(\theta_1)/Z(\theta_0)
  • Score matching: Avoid Z(θ)Z(\theta) entirely by training on xlogp=xEθ\nabla_x \log p = -\nabla_x E_\theta

For diffusion models: The ELBO for a Gaussian diffusion process involves:

L=Eq[logq(x1:Tx0)pθ(x0:T)]\mathcal{L} = \mathbb{E}_q\left[\log \frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})}\right]

Each KL divergence term between Gaussians is computable analytically - no numerical integration needed. This is why diffusion models are tractable.

8.2 ELBO and Variational Inference

Variational Autoencoder (VAE): The ELBO lower bound on logp(x)\log p(x):

ELBO(x;ϕ,θ)=Ezqϕ(zx)[logpθ(xz)]KL(qϕ(zx)p(z))\text{ELBO}(x; \phi, \theta) = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - \text{KL}(q_\phi(z|x) \| p(z))

The reparameterization trick: When qϕ(zx)=N(μϕ,σϕ2)q_\phi(z|x) = \mathcal{N}(\mu_\phi, \sigma_\phi^2):

z=μϕ+σϕε,εN(0,I)z = \mu_\phi + \sigma_\phi \odot \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, I) ϕEqϕ[f(z)]=ϕEε[f(μϕ+σϕε)]=Eε[ϕf(μϕ+σϕε)]\nabla_\phi \mathbb{E}_{q_\phi}[f(z)] = \nabla_\phi \mathbb{E}_\varepsilon[f(\mu_\phi + \sigma_\phi \varepsilon)] = \mathbb{E}_\varepsilon[\nabla_\phi f(\mu_\phi + \sigma_\phi \varepsilon)]

This converts a score function estimator (high variance) to a pathwise estimator (low variance) - essentially converting a derivative of an expectation into an expectation of a derivative.

Gauss-Hermite for exact ELBO: When qϕ(zx)=N(μ,σ2)q_\phi(z|x) = \mathcal{N}(\mu, \sigma^2):

Eq[f(z)]=1πf(2σt+μ)et2dt1πk=1nwkf(2σxk+μ)\mathbb{E}_q[f(z)] = \frac{1}{\sqrt{\pi}} \int f(\sqrt{2}\sigma t + \mu) e^{-t^2} dt \approx \frac{1}{\sqrt{\pi}} \sum_{k=1}^n w_k f(\sqrt{2}\sigma x_k + \mu)

With n=10n = 10 Gauss-Hermite points, this is nearly exact for smooth ff - eliminating the Monte Carlo noise in the ELBO gradient.

8.3 Gaussian Process Marginal Likelihood

The GP log marginal likelihood is:

logp(yX)=12y(K+σ2I)1y12logK+σ2In2log2π\log p(\mathbf{y} | X) = -\frac{1}{2} \mathbf{y}^\top (K + \sigma^2 I)^{-1} \mathbf{y} - \frac{1}{2}\log|K + \sigma^2 I| - \frac{n}{2}\log 2\pi

Computing K+σ2I|K + \sigma^2 I| requires the determinant of an n×nn \times n matrix - O(n3)O(n^3) via Cholesky. For large nn:

Stochastic trace estimation: logA=tr(logA)1mizi(logA)zi\log|A| = \text{tr}(\log A) \approx \frac{1}{m} \sum_i z_i^\top (\log A) z_i for random ziN(0,I)z_i \sim \mathcal{N}(0, I).

Lanczos quadrature: Estimate vlog(A)v\mathbf{v}^\top \log(A) \mathbf{v} using Lanczos iteration to build a tridiagonal approximation of AA, then compute log(tridiagonal)\log(\text{tridiagonal}) via Gauss quadrature. This combines matrix-vector products (O(n)O(n) per) with Gaussian quadrature on the spectrum.

8.4 Expected Value Estimation in RL

Policy gradient (REINFORCE):

θJ(θ)=Eτπθ[tθlogπθ(atst)Gt]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\right]

This is a Monte Carlo estimate - simulate trajectories, compute returns GtG_t. The estimator is unbiased but high-variance. Variance reduction:

  • Baseline subtraction: Replace GtG_t with Gtb(st)G_t - b(s_t) (leaves gradient unbiased, reduces variance if bb correlated with GtG_t)
  • Actor-Critic: Use At=GtVπ(st)A_t = G_t - V^\pi(s_t) - advantage function reduces variance dramatically

TD learning approximates E[Gtst]\mathbb{E}[G_t | s_t] via Bellman recursion rather than direct Monte Carlo:

V(st)rt+γV(st+1)V(s_t) \approx r_t + \gamma V(s_{t+1})

This is a recursive quadrature - approximating the integral of the value function over the next-state distribution using a single sample.


9. Common Mistakes

#MistakeWhy It's WrongFix
1Using Simpson's rule with an odd number of intervalsSimpson's requires mm even (pairs of intervals)Use even mm or use the 3/8 rule for the odd remainder
2Applying high-degree Newton-Cotes on a single large intervalNegative weights for n8n \geq 8 -> catastrophic cancellationUse composite low-order rules
3Underestimating Monte Carlo varianceO(1/n)O(1/\sqrt{n}) convergence means 100x more samples for 10x accuracyUse deterministic methods for d5d \leq 5, or variance reduction
4Ignoring singularities in adaptive quadratureStandard Gauss-Kronrod fails near singularitiesReport singularity locations to quad() via points=[] parameter
5Claiming Monte Carlo is "exact in expectation" justifies any sample sizeUnbiased \neq accurate; confidence intervals require n1n \gg 1Always report MC standard errors alongside estimates
6Using composite trapezoidal for periodic functions with wrong periodIf the function period doesn't match the integration interval, the boundary terms don't cancelEnsure ab\int_a^b covers exactly an integer number of periods
7Forgetting the Jacobian when changing variablesFor x=g(u)x = g(u): $\int f(x) dx = \int f(g(u))g'(u)
8Using GL quadrature with cached nodes/weights outside [1,1][-1,1]Precomputed nodes are for [1,1][-1,1]; rescaling is required for [a,b][a,b]Apply the affine transformation x=(a+b)/2+(ba)/2tx = (a+b)/2 + (b-a)/2 \cdot t
9Stopping Romberg integration too earlyThe Romberg table diagonal may not show the true error until many rowsAlways check at least 2 successive diagonal entries for convergence
10Using importance sampling with a poor proposalIf q(x)f(x)p(x)q(x) \ll f(x) p(x) in some region, the ratio f(x)/q(x)f(x)/q(x) can be huge -> infinite varianceChoose qq proportional to $
11Applying MC to 1D integrals where deterministic is availableMC needs 10610^6 samples for 3 digits; Gauss-Legendre needs 5Use MC only when deterministic methods are infeasible
12Forgetting that quasi-MC Sobol sequences need scrambling for correctnessUnscrambled Sobol sequences can have correlated estimatorsUse scrambled Sobol from scipy.stats.qmc

10. Exercises

Exercise 1 * - Newton-Cotes Rules from Scratch (a) Implement the composite trapezoidal rule trap(f, a, b, m) and verify O(h2)O(h^2) convergence on 01exdx=e1\int_0^1 e^x dx = e - 1. (b) Implement composite Simpson's rule simpson(f, a, b, m) and verify O(h4)O(h^4) convergence. (c) Show the midpoint rule is twice as accurate as the trapezoidal rule (same O(h2)O(h^2) but half the constant) by computing and comparing errors on a convex function.

Exercise 2 * - Romberg Integration (a) Implement Romberg's method building the triangular table Rk,jR_{k,j} for k=0,,8k = 0, \ldots, 8. (b) Apply it to 01sin(x2)dx\int_0^1 \sin(x^2) dx and compare convergence to composite trapezoidal. (c) Show that for f(x)=exf(x) = e^x, the diagonal Rk,kR_{k,k} converges super-geometrically - the number of correct digits roughly doubles each step.

Exercise 3 * - Gauss-Legendre Quadrature (a) Compute the 5-point GL nodes and weights using the eigenvalue method (Golub-Welsch matrix). (b) Verify: the rule integrates all polynomials of degree 9\leq 9 exactly. (c) Compare GL-5 accuracy on 01exdx\int_0^1 e^x dx vs composite Simpson's with 50 subintervals.

Exercise 4 ** - Adaptive Quadrature Implementation (a) Implement adaptive_simpson(f, a, b, tol) using recursive bisection with the 3-point Simpson estimate and the 5-point refined estimate for error control. (b) Test on 01x1/2dx=2\int_0^1 x^{-1/2} dx = 2 (singular at 0) and 050sin(x)dx\int_0^{50} \sin(x) dx (oscillatory). (c) Plot the adaptive mesh (where it placed evaluation points) for a function with a sharp peak.

Exercise 5 ** - Monte Carlo Integration (a) Estimate 0101x2+y2dxdy\int_0^1 \int_0^1 \sqrt{x^2 + y^2} \, dx\, dy using plain Monte Carlo with n=103,104,105n = 10^3, 10^4, 10^5 samples. Show the O(1/n)O(1/\sqrt{n}) error convergence. (b) Implement importance sampling for 0xex2/2dx\int_0^\infty x e^{-x^2/2} dx with proposal q(x)=xex2/2/Cq(x) = xe^{-x^2/2}/C (half-normal). Compare variance vs uniform sampling. (c) Implement antithetic variates for the 2D integral in (a) and measure the variance reduction factor.

Exercise 6 ** - Euler-Maclaurin and Periodic Functions (a) Numerically verify the Euler-Maclaurin formula for the trapezoidal rule on 01x4dx\int_0^1 x^4 dx: show the error matches B2kh2k[f(2k1)(1)f(2k1)(0)]/(2k)!\sum B_{2k} h^{2k} [f^{(2k-1)}(1) - f^{(2k-1)}(0)]/(2k)!. (b) Demonstrate spectral accuracy: for f(x)=esin(2πx)f(x) = e^{\sin(2\pi x)} (periodic), show the composite trapezoidal rule achieves machine precision with just 20 points. (c) Explain why this makes the DFT exact: use the Euler-Maclaurin formula to show that integrating the DFT basis functions e2πikxe^{2\pi i k x} over [0,1][0,1] with the trapezoidal rule is exact for integer kk.

Exercise 7 ** - Gauss-Hermite for Gaussian Expectations (a) Implement nn-point Gauss-Hermite quadrature for f(x)ex2dx\int_{-\infty}^\infty f(x) e^{-x^2} dx using the Golub-Welsch algorithm. (b) Compute ExN(0,1)[log(1+ex)]\mathbb{E}_{x \sim \mathcal{N}(0,1)}[\log(1 + e^x)] (the binary cross-entropy expectation) using GH quadrature with n=5,10,20n = 5, 10, 20 points. Compare to Monte Carlo with 10410^4 samples. (c) Implement the reparameterization trick to compute the ELBO gradient μEzN(μ,1)[logp(z)]\nabla_\mu \mathbb{E}_{z \sim \mathcal{N}(\mu, 1)}[\log p(z)] for p(z)=N(0,1)p(z) = \mathcal{N}(0, 1) using both GH quadrature and the pathwise estimator.

Exercise 8 *** - Quasi-Monte Carlo and Low-Discrepancy Sequences (a) Compare plain Monte Carlo vs Sobol quasi-MC for [0,1]di=1dsin(πxi)dx=(2/π)d\int_{[0,1]^d} \prod_{i=1}^d \sin(\pi x_i) dx = (2/\pi)^d for d=5,10,20d = 5, 10, 20. (b) Plot the convergence rate (log-log error vs log nn) for both methods. Show MC gives slope 0.5-0.5 and Sobol gives slope 1\approx -1 for small dd. (c) For d=10d = 10, how many samples do you need for MC to achieve 10310^{-3} relative accuracy? For Sobol? Compute the ratio.


11. Why This Matters for AI (2026 Perspective)

ConceptAI/LLM Impact
Monte Carlo integrationFoundation of all stochastic gradient methods; ELBO, policy gradient, score matching
Reparameterization trickVAE training, diffusion model score matching, normalizing flows - the key to backpropagating through stochastic variables
Gauss-Hermite quadratureExact ELBO computation in structured VAEs; expectation propagation for approximate inference
Importance samplingOff-policy RL (PPO uses importance weights rt=πθ/πoldr_t = \pi_\theta/\pi_{\text{old}}); rejection sampling for LLM alignment
Quasi-Monte CarloFaster hyperparameter search (Optuna, Ray Tune use quasi-random sampling); faster random feature approximations
Adaptive quadratureError-controlled numerical ODEs (DormandPrince, VODE) used in neural ODE/SDE training
Thermodynamic integrationComputing log partition functions for energy-based models; AIS for evaluating generative models
Variance reduction (baselines)Policy gradient training stability (REINFORCE with baseline, A2C, PPO); reduces gradient variance by 10-100x
Gaussian process marginal likelihoodBayesian optimization for hyperparameter tuning; GP-based surrogate models in AutoML
Lanczos-based matrix quadratureEfficient log-determinant computation for GP likelihoods at large scale (GPyTorch)

12. Conceptual Bridge

Looking back: Numerical integration is built directly on the foundations of this chapter and the broader curriculum:

  • Floating-point arithmetic (Section10-01): All quadrature involves finite-precision arithmetic; catastrophic cancellation in the sum wif(xi)\sum w_i f(x_i) must be avoided
  • Interpolation (Section10-04): Quadrature rules are derived by integrating interpolants exactly. Every quadrature rule has a corresponding interpolation scheme
  • Orthogonal polynomials (Section10-04, Section03-Advanced-Linear-Algebra): Gaussian quadrature nodes are zeros of orthogonal polynomials; the Golub-Welsch algorithm is an eigenvalue computation
  • Probability (Section07): Monte Carlo integration is equivalent to expectation estimation; variance reduction techniques exploit probability theory directly

Looking forward: Numerical integration connects to advanced topics throughout the curriculum:

  • Ordinary and partial differential equations: ODE solvers (RK4, Dormand-Prince) are quadrature formulas for t0t1f(t,y(t))dt\int_{t_0}^{t_1} f(t, y(t)) dt; Gauss collocation methods achieve the highest possible order
  • Markov Chain Monte Carlo: When direct sampling is unavailable, MCMC generates correlated samples whose empirical average still converges to the integral - at a slower O(1/n/τ)O(1/\sqrt{n/\tau}) rate where τ\tau is the autocorrelation time
  • Signal processing: The DFT is a quadrature formula; the connection between Fourier analysis and quadrature (Clenshaw-Curtis via DCT) is deep
NUMERICAL INTEGRATION IN THE CURRICULUM
========================================================================

  Section10-01 Floating-Point      Section10-04 Interpolation
  (rounding in sums)   -->   (integrate interpolant)
           |                        |
           +------------+-----------+
                        v
              Section10-05 NUMERICAL INTEGRATION
              +--------------------------+
              | Newton-Cotes rules        |
              | Gaussian quadrature       |
              | Romberg / Richardson      |
              | Adaptive (Gauss-Kronrod)  |
              | Monte Carlo + QMC         |
              | Multidimensional          |
              +------------+-------------+
                           |
          +----------------+----------------+
          v                v                v
     ODE Solvers        VAE/ELBO         Bayesian
     (SectionAdvCalc)       (reparam trick)   Optimization
                      Diffusion models  (GP marginal
                      Policy gradient   likelihood)

========================================================================

The unifying theme: Every integration problem in ML is either a low-dimensional, smooth integral (use Gaussian quadrature - exact and fast) or a high-dimensional, stochastic integral (use Monte Carlo - dimension-independent convergence). The art is recognizing which regime you're in and applying the appropriate technique. The reparameterization trick is the bridge: it converts a "hard" stochastic integral into a "easy" pathwise integral, enabling efficient gradient computation.


Appendix A: Proof of Gaussian Quadrature Optimality

A.1 The 2n-1 Degree of Exactness Theorem

Theorem: The nn-point Gauss-Legendre rule Qn(f)=i=1nwif(xi)Q_n(f) = \sum_{i=1}^n w_i f(x_i) integrates all polynomials of degree 2n1\leq 2n-1 exactly.

Proof: Let fP2n1f \in \mathcal{P}_{2n-1} (polynomial of degree 2n1\leq 2n-1). Divide by the node polynomial ωn(x)=i=1n(xxi)\omega_n(x) = \prod_{i=1}^n (x-x_i) (degree nn):

f(x)=q(x)ωn(x)+r(x)f(x) = q(x) \omega_n(x) + r(x)

where q,rPn1q, r \in \mathcal{P}_{n-1}.

Step 1: 11q(x)ωn(x)dx=0\int_{-1}^1 q(x) \omega_n(x) dx = 0. Since ωn=cPn\omega_n = c P_n for some constant cc (nodes are zeros of PnP_n), and qPn1q \in \mathcal{P}_{n-1}, orthogonality of PnP_n to all lower-degree polynomials gives qPn=0\int q P_n = 0.

Step 2: Qn(qωn)=iwiq(xi)ωn(xi)=0Q_n(q \omega_n) = \sum_i w_i q(x_i) \omega_n(x_i) = 0. Since ωn(xi)=0\omega_n(x_i) = 0 for each node xix_i.

Step 3: Qn(r)=rQ_n(r) = \int r exactly. Since rPn1r \in \mathcal{P}_{n-1} and any rule with nn nodes is exact for polynomials of degree n1\leq n-1 if it integrates the nn basis functions correctly - and the weights wiw_i are chosen to integrate rPn1r \in \mathcal{P}_{n-1} exactly.

Step 4: Combining:

Qn(f)=Qn(qωn)+Qn(r)=0+r=(qωn+r)=fQ_n(f) = Q_n(q\omega_n) + Q_n(r) = 0 + \int r = \int (q\omega_n + r) = \int f

\square

Why 2n12n-1 is optimal: Any nn-point rule fails for p(x)=ωn(x)2P2np(x) = \omega_n(x)^2 \in \mathcal{P}_{2n} since p(xi)=0p(x_i) = 0 for all nodes but p>0\int p > 0.

A.2 The Golub-Welsch Algorithm

The GL nodes and weights are eigenvalues and eigenvectors of the Jacobi tridiagonal matrix JnJ_n:

Jn=(α0β1β1α1β2β2α2)J_n = \begin{pmatrix} \alpha_0 & \beta_1 & & \\ \beta_1 & \alpha_1 & \beta_2 & \\ & \beta_2 & \alpha_2 & \ddots \\ & & \ddots & \ddots \end{pmatrix}

where for Legendre polynomials: αk=0\alpha_k = 0 (by symmetry), βk=k/4k21\beta_k = k / \sqrt{4k^2 - 1}.

Algorithm:

  1. Build JnJ_n as a tridiagonal matrix
  2. Compute eigendecomposition: Jn=QΛQJ_n = Q \Lambda Q^\top (symmetric)
  3. Nodes: xi=Λiix_i = \Lambda_{ii} (eigenvalues)
  4. Weights: wi=2(Q1i)2w_i = 2 (Q_{1i})^2 (squared first component of eigenvectors, times 2 for the integral of 1 over [1,1][-1,1])

This reduces GL computation to a standard symmetric eigenvalue problem - O(n2)O(n^2) using QR iteration, or O(nlogn)O(n \log n) with divide-and-conquer.

import numpy as np

def gauss_legendre(n):
    """Compute n-point Gauss-Legendre nodes and weights on [-1, 1]."""
    k = np.arange(1, n)
    beta = k / np.sqrt(4*k**2 - 1)
    J = np.diag(beta, -1) + np.diag(beta, 1)  # Symmetric tridiagonal
    eigvals, eigvecs = np.linalg.eigh(J)
    # Sort by node
    idx = np.argsort(eigvals)
    nodes = eigvals[idx]
    weights = 2 * eigvecs[0, idx]**2  # First row of eigenvector matrix
    return nodes, weights

Appendix B: Richardson Extrapolation - Complete Analysis

B.1 The Error Expansion Assumption

Richardson extrapolation requires the error to have the asymptotic expansion:

Q(h)=I+cphp+cp+2hp+2+cp+4hp+4+Q(h) = I + c_p h^p + c_{p+2} h^{p+2} + c_{p+4} h^{p+4} + \cdots

When does this hold?

  • Composite trapezoidal: Yes (Euler-Maclaurin guarantees h2,h4,h6,h^2, h^4, h^6, \ldots terms)
  • Composite Simpson's: Yes (h4,h6,h8,h^4, h^6, h^8, \ldots terms)
  • For periodic functions: The expansion is actually geometric - O(ecn/p)O(e^{-cn/p}) - Richardson extrapolation helps but eventually the geometric error dominates

B.2 Romberg Table Construction

The Romberg table is computed as:

Rk,0=T2k(f)(composite trapezoidal with 2k intervals)R_{k,0} = T_{2^k}(f) \quad (\text{composite trapezoidal with } 2^k \text{ intervals}) Rk,j=4jRk,j1Rk1,j14j1R_{k,j} = \frac{4^j R_{k, j-1} - R_{k-1, j-1}}{4^j - 1}

Incremental trapezoidal computation: Rk,0R_{k,0} can be computed from Rk1,0R_{k-1,0} using only the new midpoints:

T2m(f)=12Tm(f)+hj=1mf(a+(2j1)h),h=ba2mT_{2m}(f) = \frac{1}{2} T_m(f) + h \sum_{j=1}^{m} f(a + (2j-1)h), \quad h = \frac{b-a}{2m}

Total function evaluations: 2k+12^k + 1 (not 20+21++2k=2k+112^0 + 2^1 + \cdots + 2^k = 2^{k+1} - 1 - the old evaluations are reused).

B.3 Convergence of Romberg

For fC[a,b]f \in C^\infty[a,b], the diagonal Rk,kR_{k,k} satisfies:

IRk,kCkh02k+2|I - R_{k,k}| \leq C_k h_0^{2k+2}

where h0=(ba)h_0 = (b-a) and Ck=O(1/(2k+2)!)C_k = O(1/(2k+2)!). The convergence is faster than any power of h0h_0 - supergeometric.

Practical stopping criterion: Compare Rk,kRk1,k1|R_{k,k} - R_{k-1,k-1}|. If this is smaller than the tolerance, stop. But be careful: the estimate can be misleading for non-smooth functions.


Appendix C: Monte Carlo - Detailed Variance Analysis

C.1 Central Limit Theorem for MC

Let X1,,XnUniform(Ω)X_1, \ldots, X_n \sim \text{Uniform}(\Omega) i.i.d. The MC estimator I^n=Ω1nif(Xi)\hat{I}_n = |\Omega| \frac{1}{n} \sum_i f(X_i) satisfies:

n(I^nI)dN(0,Ω2Var(f))\sqrt{n}(\hat{I}_n - I) \xrightarrow{d} \mathcal{N}(0, |\Omega|^2 \text{Var}(f))

95% confidence interval: II^n±1.96Ωσ^nI \in \hat{I}_n \pm 1.96 \frac{|\Omega| \hat{\sigma}}{\sqrt{n}} where σ^2=1n1(fifˉ)2\hat{\sigma}^2 = \frac{1}{n-1}\sum (f_i - \bar{f})^2.

How many samples for accuracy ε\varepsilon?

n=(1.96Ωσ^ε)2n = \left(\frac{1.96 |\Omega| \hat{\sigma}}{\varepsilon}\right)^2

For ε=103\varepsilon = 10^{-3} and σ^=1\hat{\sigma} = 1: n4×106n \approx 4 \times 10^6 - about 4 million samples.

C.2 Importance Sampling: Optimal Proposal

The variance of the importance sampling estimator is:

Varq ⁣(f(X)q(X))=f(x)2q(x)dxI2\text{Var}_q\!\left(\frac{f(X)}{q(X)}\right) = \int \frac{f(x)^2}{q(x)} dx - I^2

Minimize over qq subject to q=1\int q = 1: By Cauchy-Schwarz, f2q(f)2\int \frac{f^2}{q} \geq \left(\int |f|\right)^2 with equality when q=f/fq^* = |f| / \int |f|.

With q=qq = q^*: The estimator is f(X)/q(X)=±f\frac{f(X)/q^*(X)}{} = \pm \int |f| - zero variance! But computing qq^* requires knowing f\int f.

Practical IS: Choose qq close to f|f|. A good choice in practice is q(x)f(x)p(x)q(x) \propto f(x) p(x) where pp is the true weight.

C.3 Quasi-MC Convergence Analysis

The Koksma-Hlawka inequality for quasi-MC:

1nif(xi)[0,1]dfdxVHK(f)D(x1,,xn)\left|\frac{1}{n}\sum_i f(x_i) - \int_{[0,1]^d} f dx\right| \leq V_{\text{HK}}(f) \cdot D^*(x_1, \ldots, x_n)

where VHKV_{\text{HK}} is the Hardy-Krause variation and DD^* is the star discrepancy.

Sobol sequences: D=O((logn)d/n)D^* = O((\log n)^d / n).

Convergence rate comparison:

  • MC: O(n1/2)O(n^{-1/2})
  • Sobol: O(n1(logn)d)O(n^{-1} (\log n)^d) - faster by O(n1/2/(logn)d)O(n^{1/2} / (\log n)^d) factor
  • Break-even (Sobol faster): n(logn)2dn \gg (\log n)^{2d} - for d=10d=10: n1012n \gg 10^{12}... not practical!

But: The constant in front matters. For moderate nn (thousands to millions), Sobol consistently outperforms MC in practice for d20d \leq 20.


Appendix D: Integration and Machine Learning - Extended Connections

D.1 The REINFORCE Gradient and Score Function Estimator

The policy gradient is a Monte Carlo estimate of θEπ[R]\nabla_\theta \mathbb{E}_\pi[R]. The score function estimator (also called REINFORCE or likelihood-ratio estimator):

θEpθ(x)[f(x)]=Epθ(x)[f(x)θlogpθ(x)]\nabla_\theta \mathbb{E}_{p_\theta(x)}[f(x)] = \mathbb{E}_{p_\theta(x)}[f(x) \nabla_\theta \log p_\theta(x)]

Proof:

θf(x)pθ(x)dx=f(x)θpθ(x)dx=f(x)θpθ(x)pθ(x)pθ(x)dx=Epθ[fθlogpθ]\nabla_\theta \int f(x) p_\theta(x) dx = \int f(x) \nabla_\theta p_\theta(x) dx = \int f(x) \frac{\nabla_\theta p_\theta(x)}{p_\theta(x)} p_\theta(x) dx = \mathbb{E}_{p_\theta}[f \nabla_\theta \log p_\theta]

This requires only the ability to sample from pθp_\theta and evaluate logpθ\log p_\theta - but has high variance because f(x)θlogpθ(x)f(x) \nabla_\theta \log p_\theta(x) fluctuates wildly.

Variance of REINFORCE: Var[f(x)θlogpθ(x)]=E[f2θlogpθ2]θE[f]2\text{Var}[f(x) \nabla_\theta \log p_\theta(x)] = \mathbb{E}[f^2 \|\nabla_\theta \log p_\theta\|^2] - \|\nabla_\theta \mathbb{E}[f]\|^2

This can be O(f2)O(\|f\|^2) - proportional to the square of the reward, which is large. Hence the need for baselines and actor-critic methods.

D.2 The Reparameterization Trick as Pathwise Derivative

When x=gθ(ε)x = g_\theta(\varepsilon) with εp(ε)\varepsilon \sim p(\varepsilon) (independent of θ\theta):

θEpθ(x)[f(x)]=Ep(ε)[θf(gθ(ε))]\nabla_\theta \mathbb{E}_{p_\theta(x)}[f(x)] = \mathbb{E}_{p(\varepsilon)}[\nabla_\theta f(g_\theta(\varepsilon))]

Variance comparison: The reparameterization estimator has variance Var[θf(gθ(ε))]\text{Var}[\nabla_\theta f(g_\theta(\varepsilon))] - this is the squared norm of the gradient of ff w.r.t. the parameters, which is typically much smaller than the REINFORCE variance. Empirically, 10-100x lower variance.

When does reparameterization apply?

  • Gaussian: z=μ+σεz = \mu + \sigma \varepsilon, εN(0,1)\varepsilon \sim \mathcal{N}(0,1) [ok]
  • Uniform: z=a+(ba)εz = a + (b-a)\varepsilon, εUniform(0,1)\varepsilon \sim \text{Uniform}(0,1) [ok]
  • Categorical (Gumbel-Softmax): z=softmax((logπ+g)/τ)z = \text{softmax}((\log \pi + g)/\tau), gGumbel(0,1)g \sim \text{Gumbel}(0,1) - continuous relaxation [ok]
  • Any discrete distribution without continuous relaxation:

D.3 Numerical Integration in Normalizing Flows

Change of variables theorem: For x=f(z)x = f(z) with zp(z)z \sim p(z):

px(x)=pz(f1(x))detJf1(x)p_x(x) = p_z(f^{-1}(x)) \cdot |\det J_{f^{-1}}(x)|

The log-likelihood is:

logpx(x)=logpz(f1(x))+logdetJf1(x)\log p_x(x) = \log p_z(f^{-1}(x)) + \log |\det J_{f^{-1}}(x)|

Computing logdetJ\log |\det J| requires the determinant of an n×nn \times n Jacobian matrix - O(n3)O(n^3). Normalizing flows use structured architectures (coupling layers, autoregressive flows) where JJ is triangular and logdetJ=ilogJii\log |\det J| = \sum_i \log |J_{ii}| - O(n)O(n).

This is numerical integration in disguise: We're computing logpx(x)=logδ(xf(z))p(z)dz\log p_x(x) = \log \int \delta(x - f(z)) p(z) dz - a singular integral over zz-space that the change-of-variables formula evaluates analytically for invertible flows.


Appendix E: Software Reference

E.1 scipy.integrate

from scipy import integrate

# Adaptive Gauss-Kronrod (default: G7-K15)
result, abserr = integrate.quad(f, a, b)
result, abserr = integrate.quad(f, a, b, epsabs=1.49e-8, epsrel=1.49e-8, limit=50)

# Improper integrals
result, abserr = integrate.quad(f, 0, np.inf)  # Semi-infinite
result, abserr = integrate.quad(f, -np.inf, np.inf)  # Doubly infinite

# With singularity points
result, abserr = integrate.quad(f, 0, 1, points=[0.5])

# Fixed Gauss-Legendre
result, abserr = integrate.fixed_quad(f, a, b, n=5)

# Romberg
result = integrate.romberg(f, a, b, tol=1.48e-8, rtol=1.48e-8)

# 2D integration
result, abserr = integrate.dblquad(f, a, b, lambda x: c, lambda x: d)

E.2 numpy for Quadrature

import numpy as np

# Gauss-Legendre points and weights (numpy built-in)
x, w = np.polynomial.legendre.leggauss(n)
I = np.dot(w, f(x))  # For integral over [-1, 1]

# Gauss-Hermite
x, w = np.polynomial.hermite.hermgauss(n)
I = np.dot(w, f(x))  # For exp(-x^2) weight

# Gauss-Laguerre
x, w = np.polynomial.laguerre.laggauss(n)
I = np.dot(w, f(x))  # For exp(-x) weight

E.3 Quasi-Monte Carlo with scipy.stats.qmc

from scipy.stats import qmc

# Sobol sequence (scrambled)
sampler = qmc.Sobol(d=5, scramble=True, seed=42)
x = sampler.random(n=1024)  # n must be power of 2 for Sobol
# x has shape (1024, 5), values in [0, 1]^5

# Halton sequence
sampler = qmc.Halton(d=5, scramble=True, seed=42)
x = sampler.random(n=1000)

# Latin Hypercube Sampling
sampler = qmc.LatinHypercube(d=5, seed=42)
x = sampler.random(n=100)

# Compute integral of f over [0,1]^5
I_qmc = f(x).mean()
I_mc = f(np.random.rand(1000, 5)).mean()

Appendix F: Common Integrals in Machine Learning

IntegralClosed formContext
N(x;μ,σ2)dx=1\int \mathcal{N}(x; \mu, \sigma^2) dx = 11Normalization
ExN[x]=μ\mathbb{E}_{x \sim \mathcal{N}}[x] = \muμ\muGaussian mean
ExN[x2]=μ2+σ2\mathbb{E}_{x \sim \mathcal{N}}[x^2] = \mu^2 + \sigma^2μ2+σ2\mu^2 + \sigma^2Gaussian second moment
KL(N(μ,σ2)N(0,1))\text{KL}(\mathcal{N}(\mu,\sigma^2) \| \mathcal{N}(0,1))12(μ2+σ2logσ21)\frac{1}{2}(\mu^2+\sigma^2-\log\sigma^2-1)VAE ELBO term
σ(x)N(x;μ,σ2)dx\int \sigma(x) \mathcal{N}(x;\mu,\sigma^2) dxProbit approx.Expected sigmoid
ex2dx=π\int_{-\infty}^\infty e^{-x^2} dx = \sqrt{\pi}π\sqrt{\pi}Gaussian normalization
0xn1exdx=Γ(n)\int_0^\infty x^{n-1} e^{-x} dx = \Gamma(n)Γ(n)\Gamma(n)Gamma function
01xa1(1x)b1dx=B(a,b)\int_0^1 x^{a-1}(1-x)^{b-1} dx = B(a,b)Γ(a)Γ(b)/Γ(a+b)\Gamma(a)\Gamma(b)/\Gamma(a+b)Beta function
eixtdt=2πδ(x)\int e^{ixt} dt = 2\pi \delta(x)2πδ(x)2\pi\delta(x)Fourier delta
u2=u(Δu)\int \|\nabla u\|^2 = \int u (-\Delta u)(by parts)PDE weak form

Appendix G: Error Bounds Quick Reference

MethodNodesError boundSmoothness needed
Midpoint (1-panel)1(ba)324f\frac{(b-a)^3}{24}\|f''\|fC2f \in C^2
Trapezoidal (1-panel)2(ba)312f\frac{(b-a)^3}{12}\|f''\|fC2f \in C^2
Simpson's (1-panel)3(ba)590f(4)\frac{(b-a)^5}{90}\|f^{(4)}\|fC4f \in C^4
Composite trap. (mm panels)m+1m+1(ba)h212f\frac{(b-a)h^2}{12}\|f''\|fC2f \in C^2
Composite Simpson's (mm even)m+1m+1(ba)h4180f(4)\frac{(b-a)h^4}{180}\|f^{(4)}\|fC4f \in C^4
GL-nn (1 panel)nnO(h2n)O(h^{2n})fC2nf \in C^{2n}
Clenshaw-Curtis (n+1n+1 pts)n+1n+1O(np)O(n^{-p}) or O(ρn)O(\rho^{-n})fCpf \in C^p or analytic
Romberg (kk steps)2k+12^k+1O(h02k+2)O(h_0^{2k+2})fCf \in C^\infty
Monte Carlonn samplesO(n1/2)O(n^{-1/2})Any measurable ff
Quasi-MC (Sobol)nn ptsO(n1(logn)d)O(n^{-1}(\log n)^d)Bounded variation

Appendix H: Extended Newton-Cotes Theory

H.1 Derivation of Newton-Cotes Weights

The weights for an (n+1)(n+1)-point Newton-Cotes rule on [0,n][0, n] with unit spacing are:

wiNC=0nj=0,jintjijdtw_i^{\text{NC}} = \int_0^n \prod_{j=0, j\neq i}^{n} \frac{t-j}{i-j} dt

For n=2n=2 (Simpson's rule, 3 nodes at t=0,1,2t=0,1,2):

w0=02(t1)(t2)(01)(02)dt=1202(t23t+2)dt=12[836+4]=13w_0 = \int_0^2 \frac{(t-1)(t-2)}{(0-1)(0-2)} dt = \frac{1}{2}\int_0^2 (t^2-3t+2) dt = \frac{1}{2}\left[\frac{8}{3} - 6 + 4\right] = \frac{1}{3}

Similarly w1=4/3w_1 = 4/3, w2=1/3w_2 = 1/3. With h=(ba)/2h=(b-a)/2: weights are h/3×[1,4,1]h/3 \times [1, 4, 1]. \square

H.2 Negative Weights for High-Order Newton-Cotes

The weights for Newton-Cotes rules with n8n \geq 8 become negative. This is a fundamental instability: if ff has rounding errors of magnitude ε\varepsilon, the error in Qn(f)Q_n(f) is:

E(iwi)ε=ΛNCε|E| \leq \left(\sum_i |w_i|\right) \varepsilon = \Lambda_{\text{NC}} \varepsilon

For n=10n=10: ΛNC=wi3.4\Lambda_{\text{NC}} = \sum |w_i| \approx 3.4 - significant amplification. For n=20n=20: ΛNC108\Lambda_{\text{NC}} \approx 10^8 - catastrophic.

In contrast, Gaussian quadrature weights are always positive - the sum is exactly bab-a and rounding errors are not amplified.

H.3 Peano Kernel Theorem

For a rule with error E(f)=I(f)Q(f)E(f) = I(f) - Q(f), the Peano kernel K(t)K(t) satisfies:

E(f)=abK(t)f(p+1)(t)dtE(f) = \int_a^b K(t) f^{(p+1)}(t) dt

where pp is the degree of exactness. The Peano kernel is zero when tt is outside [a,b][a,b] and has a specific shape depending on the rule.

Implications:

  1. The error depends on f(p+1)f^{(p+1)} - the function's (p+1)(p+1)-th derivative
  2. If K(t)0K(t) \geq 0 everywhere, we can write E(f)=f(p+1)(ξ)K/(p+1)!E(f) = f^{(p+1)}(\xi) \int K / (p+1)! for some ξ\xi (mean value theorem for integrals)
  3. For symmetric rules (midpoint, trapezoidal, GL), the Peano kernel has definite sign for appropriate pp

Appendix I: Adaptive Quadrature - Implementation Details

I.1 Gauss-Kronrod G7-K15 Rule

The 15-point Gauss-Kronrod rule (K15) extends the 7-point Gauss-Legendre (G7):

G7 nodes (Gauss-Legendre, 7 points): ±0.9491079,±0.7415312,±0.4058452,0\pm 0.9491079, \pm 0.7415312, \pm 0.4058452, 0

K15 nodes (Kronrod extension, 15 points): The G7 nodes plus 8 additional Kronrod nodes.

Error estimate: K15G7|K15 - G7|. If this is <103K15< 10^{-3} |K15|, the error is K15G7\approx |K15 - G7|.

Degree of exactness: G7 integrates polynomials exactly up to degree 13; K15 up to degree 27.

I.2 Recursive vs Priority-Queue Adaptive

Recursive adaptive:

def adaptive_quad(f, a, b, tol, depth=0, max_depth=50):
    Q_coarse = coarse_rule(f, a, b)
    Q_fine = fine_rule(f, a, b)
    if abs(Q_fine - Q_coarse) < tol or depth >= max_depth:
        return Q_fine
    m = (a + b) / 2
    return (adaptive_quad(f, a, m, tol/2, depth+1) +
            adaptive_quad(f, m, b, tol/2, depth+1))

Priority-queue adaptive: Better for functions with many singularities - always refines the worst subinterval first.

I.3 Handling Singularities

Integrable singularities: f(x)C(xa)αf(x) \sim C (x-a)^\alpha with α>1\alpha > -1 gives finite abf\int_a^b f.

Strategy 1 - Change of variables: x=a+(ba)t1/(α+1)x = a + (b-a) t^{1/(\alpha+1)} smooths the singularity:

abC(xa)αdx=C(ba)α+11α+101du\int_a^b C(x-a)^\alpha dx = C(b-a)^{\alpha+1} \frac{1}{\alpha+1} \int_0^1 du

Strategy 2 - Gauss-Jacobi: Use the (1x)α(1+x)β(1-x)^\alpha (1+x)^\beta weight function to absorb the singularity. scipy.integrate.quad handles this automatically when singularities are declared.

Strategy 3 - IMT transformation: The IMT transform maps [0,1][0,1] to (0,1)(0,1) with all derivatives vanishing at the endpoints:

texp ⁣(1t11t)(infinitely flat at 0 and 1)t \mapsto \exp\!\left(-\frac{1}{t} - \frac{1}{1-t}\right) \quad (\text{infinitely flat at 0 and 1})

This is the basis for the tanh-sinh (double exponential) quadrature method by Takahashi and Mori.


Appendix J: Monte Carlo in Practice - Recipes

J.1 Estimating π\pi via Monte Carlo

Classic example: π/4=Prob(X2+Y21)\pi/4 = \text{Prob}(X^2 + Y^2 \leq 1) with X,YUniform(0,1)X, Y \sim \text{Uniform}(0,1).

import numpy as np
np.random.seed(42)

for n in [100, 1000, 10000, 1000000]:
    x, y = np.random.rand(n), np.random.rand(n)
    pi_est = 4 * (x**2 + y**2 <= 1).mean()
    std_err = 4 * np.sqrt((x**2+y**2<=1).mean() * (1-(x**2+y**2<=1).mean()) / n)
    print(f"n={n:>8}: pi = {pi_est:.6f} +/- {std_err:.6f}")

Output: Shows O(1/n)O(1/\sqrt{n}) standard error.

J.2 Gaussian Expectation via GH Quadrature

def gauss_hermite_expect(f, mu, sigma, n=20):
    """
    Compute E_{x~N(mu,sigma^2)}[f(x)] via n-point Gauss-Hermite quadrature.
    """
    x, w = np.polynomial.hermite.hermgauss(n)
    x_transformed = np.sqrt(2) * sigma * x + mu
    return np.dot(w, f(x_transformed)) / np.sqrt(np.pi)

# Example: E[log(1 + e^x)] for x ~ N(0, 1)
f = lambda x: np.log(1 + np.exp(x))
exact_approx = gauss_hermite_expect(f, mu=0, sigma=1, n=20)
mc_approx = f(np.random.randn(100000)).mean()
print(f"GH-20: {exact_approx:.8f}")
print(f"MC-1e5: {mc_approx:.6f} +/- {f(np.random.randn(10000)).std()/100:.6f}")

J.3 ELBO Computation

def vae_elbo(x, encoder, decoder, n_samples=10):
    """
    Compute ELBO = E_q[log p(x|z)] - KL(q(z|x) || p(z)).
    Using reparameterization trick.
    """
    mu, log_sigma = encoder(x)
    sigma = np.exp(log_sigma)

    # KL(N(mu, sigma^2) || N(0, 1)) = 0.5 * (mu^2 + sigma^2 - log(sigma^2) - 1)
    kl = 0.5 * np.sum(mu**2 + sigma**2 - 2*log_sigma - 1)

    # E_q[log p(x|z)] via reparameterization
    eps = np.random.randn(n_samples, len(mu))
    z_samples = mu + sigma * eps  # Shape: (n_samples, latent_dim)
    recon = np.mean([decoder(z) for z in z_samples])

    return recon - kl

Appendix K: Connections to Differential Equations and Physics

K.1 ODE Solvers as Quadrature

The ODE initial value problem y=f(t,y)y' = f(t, y), y(t0)=y0y(t_0) = y_0 has solution:

y(t)=y0+t0tf(s,y(s))dsy(t) = y_0 + \int_{t_0}^{t} f(s, y(s)) ds

Runge-Kutta methods approximate this integral using quadrature:

  • Euler: yn+1=yn+hf(tn,yn)y_{n+1} = y_n + h f(t_n, y_n) - left-endpoint rule (O(h)O(h))
  • RK4: yn+1=yn+h(k1+2k2+2k3+k4)/6y_{n+1} = y_n + h(k_1 + 2k_2 + 2k_3 + k_4)/6 - Simpson's-like rule (O(h4)O(h^4))
  • Gauss collocation: Gauss-Legendre-based - stiffly stable, exact for polynomials of degree 2s12s-1 (ss stages)

Dormand-Prince (RK45): Embedded 4th/5th order pair for adaptive step control - the default ODE solver in scipy.integrate.solve_ivp.

K.2 Feynman Path Integrals

In quantum mechanics, the probability amplitude is formally:

xfeiHT/xi=D[x(t)]eiS[x]/\langle x_f | e^{-iHT/\hbar} | x_i \rangle = \int \mathcal{D}[x(t)] e^{iS[x]/\hbar}

where S[x]=L(x,x˙)dtS[x] = \int L(x, \dot{x}) dt is the action. This is an infinite-dimensional integral - a functional integral over all paths.

Numerical discretization: Replace the path integral by a finite product:

(m2πiε)N/2k=1Ndxkexp ⁣(iεkL(xk,(xk+1xk)/ε))\approx \left(\frac{m}{2\pi i\hbar \varepsilon}\right)^{N/2} \prod_{k=1}^N dx_k \exp\!\left(\frac{i\varepsilon}{\hbar} \sum_k L(x_k, (x_{k+1}-x_k)/\varepsilon)\right)

This is a multi-dimensional integral evaluated by Monte Carlo - the lattice QCD approach uses this for nuclear physics calculations.

K.3 Numerical Integration in Finite Element Methods

Finite element method (FEM): Solve 2u=f-\nabla^2 u = f in Ω\Omega by finding uu in a finite-dimensional subspace VhV_h:

Ωuv=ΩfvvVh\int_\Omega \nabla u \cdot \nabla v = \int_\Omega f v \quad \forall v \in V_h

Each element integral Kϕiϕj\int_K \nabla \phi_i \cdot \nabla \phi_j is computed via Gaussian quadrature on the reference element K^\hat{K}:

Kϕiϕj=detJKmwm(ϕ^i)(xm)JKTJKT(ϕ^j)(xm)\int_K \nabla \phi_i \cdot \nabla \phi_j = |\det J_K| \sum_m w_m (\nabla \hat{\phi}_i)(x_m) \cdot J_K^{-T} J_K^{-T} (\nabla \hat{\phi}_j)(x_m)

where JKJ_K is the Jacobian of the mapping from K^\hat{K} to KK.

For AI: Physics-informed neural networks (PINNs) solve PDEs by:

  1. Sampling "collocation points" inside the domain (random or Gauss-like)
  2. Computing the PDE residual at each point
  3. Minimizing the sum of residuals squared

This is meshless quadrature - and the quality of the quadrature points (random vs quasi-random vs Gauss) directly affects training accuracy.


Appendix L: Notation Reference

SymbolMeaning
QnQ_nQuadrature rule with nn nodes
xix_iQuadrature nodes
wiw_iQuadrature weights
En(f)E_n(f)Quadrature error =I(f)Qn(f)= I(f) - Q_n(f)
hhMesh spacing (ba)/m(b-a)/m
ddDegree of exactness of a rule
PnP_nLegendre polynomial of degree nn
TnT_nChebyshev polynomial of degree nn
BkB_kBernoulli numbers
Rk,jR_{k,j}Romberg table entry
I^n\hat{I}_nMonte Carlo estimator
DD^*Star discrepancy
VHKV_{\text{HK}}Hardy-Krause variation
Z(θ)Z(\theta)Partition function
LELBO\mathcal{L}_{\text{ELBO}}Evidence lower bound
τ\tauAutocorrelation time in MCMC

Appendix M: Advanced Topics in Quadrature

M.1 Clenshaw-Curtis Quadrature - Full Derivation

Setup: Integrate the Chebyshev interpolant of ff at the n+1n+1 points xk=cos(kπ/n)x_k = \cos(k\pi/n):

11f(x)dx11pn(x)dx=11k=0nckTk(x)dx\int_{-1}^1 f(x) dx \approx \int_{-1}^1 p_n(x) dx = \int_{-1}^1 \sum_{k=0}^n c_k T_k(x) dx

Since 11Tk(x)dx=0\int_{-1}^1 T_k(x) dx = 0 for odd kk and 2(1)k/2+1/(k21)2(-1)^{k/2+1}/(k^2-1) for even k2k \geq 2, and 22 for k=0k=0:

11pn(x)dx=2c0+j=1n/22c2j(1)j+14j21\int_{-1}^1 p_n(x) dx = 2c_0 + \sum_{j=1}^{\lfloor n/2 \rfloor} \frac{2c_{2j}(-1)^{j+1}}{4j^2-1}

Computing the weights via the moment values:

wj=k=0nck11Tk(xj)dx=k=0nckmkj(xj)1w_j = \sum_{k=0}^n c_k \int_{-1}^1 T_k(x_j) dx = \sum_{k=0}^n c_k \cdot m_k \cdot \ell_j(x_j)^{-1}

but more efficiently via the DCT applied to the moments vector m=(2,0,2/3,0,2/15,0,)m = (2, 0, -2/3, 0, -2/15, 0, \ldots).

Why CC is competitive with GL: For analytic functions, both achieve exponential convergence. CC uses n+1n+1 nodes for degree-nn exactness while GL uses nn nodes for degree-2n12n-1 exactness - GL is more "efficient" per node. However:

  1. CC nodes nest: the n/2+1n/2+1 nodes of CC-n/2n/2 are a subset of the n+1n+1 nodes of CC-nn - ideal for adaptive schemes
  2. CC weights are computed via DCT in O(nlogn)O(n \log n) - GL requires O(n2)O(n^2) via eigenvalue method
  3. CC nodes are the Chebyshev nodes - same as used in optimal polynomial approximation

M.2 Gauss-Kronrod Derivation

Problem: Given nn GL nodes, find n+1n+1 additional Kronrod points that maximize the degree of exactness of the combined (2n+1)(2n+1)-point rule.

Solution: The Kronrod points are zeros of the Stieltjes polynomial En+1(x)E_{n+1}(x) satisfying:

11Pn(x)En+1(x)xkdx=0,k=0,1,,n\int_{-1}^1 P_n(x) E_{n+1}(x) x^k dx = 0, \quad k = 0, 1, \ldots, n

These polynomials exist and have real zeros in (1,1)(-1,1) interleaving with the GL nodes, but only for specific nn values. The G7-K15 pair is the standard choice.

Degree of exactness of Kronrod: The combined G7-K15 rule is exact for polynomials up to degree 2323 (G7 is exact up to degree 1313, the Kronrod extension pushes to 2323).

M.3 Double Exponential (Tanh-Sinh) Quadrature

The tanh-sinh or double exponential transformation:

x=tanh ⁣(π2sinh(t)),t(,)x = \tanh\!\left(\frac{\pi}{2}\sinh(t)\right), \quad t \in (-\infty, \infty)

maps tRt \in \mathbb{R} to x(1,1)x \in (-1, 1) with:

dxdt=π2cosh(t)cosh2(π2sinht)\frac{dx}{dt} = \frac{\pi}{2} \frac{\cosh(t)}{\cosh^2(\frac{\pi}{2}\sinh t)}

The derivative decays double exponentially as t|t| \to \infty: faster than any power.

Key property: All derivatives of the integrand f(x(t))dx/dtf(x(t)) \cdot |dx/dt| vanish double-exponentially at t=±t = \pm\infty. By Euler-Maclaurin, the trapezoidal rule is exponentially convergent.

Convergence: For analytic ff with singularities at x=±1x = \pm 1:

E=O(eπd/h)|E| = O(e^{-\pi d / h})

where dd is the half-width of the analytic strip and hh is the step size in tt-space. This is double-exponential in 1/h1/h - faster than any other method for this problem class.

For AI: Tanh-sinh quadrature is used in high-precision computations (arbitrary precision arithmetic), computing special functions in ML kernels (Matern covariance, Bayesian quadrature), and computing KL divergences between complex distributions.


Appendix N: Worked Examples

N.1 Integrating f(x)=exsin(x)f(x) = e^x \sin(x) on [0,π][0, \pi]

Exact value: 0πexsin(x)dx=12(eπ+1)12.0703...\int_0^\pi e^x \sin(x) dx = \frac{1}{2}(e^\pi + 1) \approx 12.0703...

Method comparison:

  • Composite trap, m=100m=100: E0.0013|E| \approx 0.0013 (O(h2)O(h^2) with h=π/100h = \pi/100)
  • Composite Simpson's, m=10m=10: E2×107|E| \approx 2 \times 10^{-7} (O(h4)O(h^4), fewer evaluations)
  • GL-5 on [0,π][0,\pi]: E108|E| \approx 10^{-8} (exact for degree 9 polynomials)
  • GL-10: E1016|E| \approx 10^{-16} (machine precision for this smooth function)

Lesson: GL-10 uses only 10 function evaluations and achieves machine precision; composite trap needs 10610^6 evaluations for similar accuracy.

N.2 Computing the ELBO Expectation

For q(z)=N(z;1.5,0.52)q(z) = \mathcal{N}(z; 1.5, 0.5^2) and logp(z)=z2/2log2π\log p(z) = -z^2/2 - \log\sqrt{2\pi}:

Eq[logp(z)]=Eq ⁣[z2212log(2π)]=μ2+σ2212log(2π)\mathbb{E}_q[\log p(z)] = \mathbb{E}_q\!\left[-\frac{z^2}{2} - \frac{1}{2}\log(2\pi)\right] = -\frac{\mu^2 + \sigma^2}{2} - \frac{1}{2}\log(2\pi)

Analytical: (1.52+0.25)/20.919=2.044-(1.5^2 + 0.25)/2 - 0.919 = -2.044

GH-5 approximation: Using 5-point GH nodes {tk}\{t_k\} and weights {wk}\{w_k\}, with zk=2×0.5×tk+1.5z_k = \sqrt{2} \times 0.5 \times t_k + 1.5:

1πk=15wklogp(2σtk+μ)=2.044...\approx \frac{1}{\sqrt{\pi}} \sum_{k=1}^5 w_k \log p(\sqrt{2}\sigma t_k + \mu) = -2.044...

With 5 GH points: relative error <1012< 10^{-12}. With 1000 MC samples: relative error 0.1%\sim 0.1\%.


Appendix O: Summary Table - When to Use What

QUADRATURE METHOD SELECTION GUIDE
========================================================================

  Key: d = dimension, n = nodes/samples, f = function

  1D INTEGRATION (d=1):
  ---------------------
  f smooth + analytic  -> Gauss-Legendre (5-10 pts) or CC  [best]
  f smooth, C\\infty         -> Romberg / Richardson extrapolation
  f piecewise smooth   -> Adaptive G7-K15 (scipy.integrate.quad)
  f has singularities  -> Adaptive + singularity declaration
  f periodic           -> Trapezoidal rule (spectral accuracy!)
  f analytic + endpoint singularity -> Tanh-sinh

  MULTIDIMENSIONAL (d > 1):
  -------------------------
  d \\leq 3, f smooth      -> Tensor product GL or CC
  d \\leq 15, f smooth     -> Sparse grids (Smolyak)
  d \\leq 20               -> Quasi-MC (scrambled Sobol)
  d > 20               -> Monte Carlo with variance reduction
  f = E_p[g(X)]        -> Gauss-Hermite (if p Gaussian)
  Singular / complex   -> MCMC (Metropolis-Hastings)

  ML SPECIFIC:
  ------------
  ELBO gradient        -> Reparameterization + 1 MC sample
  GP marginal lik.     -> Cholesky + Lanczos quadrature
  Policy gradient      -> REINFORCE + baseline (MC)
  Normalizing constant -> AIS or thermodynamic integration
  Neural ODE           -> Adaptive RK (Dormand-Prince)
  Posterior E[f(\\theta)]    -> MCMC (HMC/NUTS for continuous \\theta)

========================================================================

Appendix P: Historical Notes and Key Papers

P.1 Gauss's Contribution

Carl Friedrich Gauss derived Gaussian quadrature in 1814 while working on geodesy and orbit computation. His insight was to treat both the nodes and weights as free variables - giving 2n2n degrees of freedom for an nn-point rule, which (as he showed) is sufficient to integrate all polynomials of degree 2n1\leq 2n-1.

Gauss computed the first several GL rules by hand, finding the nodes as zeros of Legendre polynomials. The connection between optimal quadrature nodes and orthogonal polynomial zeros is now understood through the theory of Gaussian quadrature, but Gauss discovered the pattern empirically.

P.2 The Monte Carlo Method's Origins

The Monte Carlo method was invented by Stanislaw Ulam while recovering from illness in 1946. Playing solitaire, he wondered: what is the probability of success? Unable to compute it analytically, he realized he could estimate it by simulating many games. He mentioned the idea to John von Neumann, who immediately recognized its potential for nuclear physics computations (neutron diffusion through fissile material) - and the Los Alamos Monte Carlo method was born.

The name "Monte Carlo" was suggested by Nicholas Metropolis, after the famous casino (and the element of chance in the method). The first large-scale application was computing the properties of materials for the hydrogen bomb in 1947.

P.3 Markov Chain Monte Carlo

The Metropolis-Hastings algorithm (1953, 1970) enabled sampling from distributions known only up to a normalizing constant. The key idea: propose a new state xx' from the current state xx, then accept with probability min(1,p(x)/p(x))\min(1, p(x')/p(x)). The Markov chain converges to pp as its stationary distribution.

MCMC became mainstream in statistics in the 1990s with the work of Gelfand and Smith (1990) showing how Gibbs sampling could solve Bayesian inference problems computationally. Today, HMC (Hamiltonian Monte Carlo) and NUTS (No-U-Turn Sampler) are the state-of-the-art for continuous distributions in packages like Stan, PyMC, and NumPyro.

P.4 Quasi-Monte Carlo's Development

Ilya Sobol introduced his low-discrepancy sequences in 1967. The key insight: random points have discrepancy O(n1/2)O(n^{-1/2}) while Sobol sequences have discrepancy O((logn)d/n)O((\log n)^d / n). For moderate dimensions (d20d \leq 20) and practical sample sizes (n106n \leq 10^6), Sobol sequences consistently outperform random sampling.

Modern quasi-Monte Carlo uses scrambled Sobol sequences (Owen, 1995) that preserve the low-discrepancy property while also yielding unbiased estimators with variance that can be computed from multiple independent scramblings.


Appendix Q: Practical Code Recipes

Q.1 Composite Rules in NumPy

import numpy as np

def composite_trap(f, a, b, m):
    """Composite trapezoidal rule with m intervals."""
    h = (b - a) / m
    x = np.linspace(a, b, m + 1)
    y = f(x)
    return h * (y[0]/2 + y[1:-1].sum() + y[-1]/2)

def composite_simpson(f, a, b, m):
    """Composite Simpson's rule (m must be even)."""
    assert m % 2 == 0, "m must be even"
    h = (b - a) / m
    x = np.linspace(a, b, m + 1)
    y = f(x)
    return h/3 * (y[0] + 4*y[1:-1:2].sum() + 2*y[2:-2:2].sum() + y[-1])

# Test: integral of e^x from 0 to 1 = e - 1
f = np.exp
exact = np.e - 1
for m in [10, 100, 1000]:
    err_t = abs(composite_trap(f, 0, 1, m) - exact)
    err_s = abs(composite_simpson(f, 0, 1, m) - exact)
    print(f"m={m}: trap={err_t:.2e}, simp={err_s:.2e}")

Q.2 Romberg Integration

def romberg(f, a, b, max_level=8, tol=1e-10):
    """Romberg integration using Richardson extrapolation."""
    R = np.zeros((max_level, max_level))
    h = b - a
    R[0, 0] = h/2 * (f(a) + f(b))

    for k in range(1, max_level):
        h /= 2
        n_new = 2**(k-1)
        x_new = a + h * (2*np.arange(1, n_new+1) - 1)
        R[k, 0] = R[k-1, 0]/2 + h * f(x_new).sum()

        for j in range(1, k+1):
            R[k, j] = (4**j * R[k, j-1] - R[k-1, j-1]) / (4**j - 1)

        if k > 0 and abs(R[k, k] - R[k-1, k-1]) < tol:
            return R[k, k], k

    return R[max_level-1, max_level-1], max_level

result, levels = romberg(np.exp, 0, 1)
print(f"Romberg result: {result:.15f}, levels: {levels}")
print(f"Exact e-1:      {np.e-1:.15f}")

Q.3 Monte Carlo with Variance Reduction

def monte_carlo_antithetic(f, n, seed=42):
    """
    Monte Carlo estimate of E[f(U)] where U ~ Uniform(0,1).
    Uses antithetic variates: pairs (u, 1-u).
    """
    np.random.seed(seed)
    u = np.random.rand(n // 2)
    f_vals = (f(u) + f(1 - u)) / 2  # Antithetic pairs
    return f_vals.mean(), f_vals.std() / np.sqrt(n // 2)

def monte_carlo_plain(f, n, seed=42):
    """Standard Monte Carlo."""
    np.random.seed(seed)
    u = np.random.rand(n)
    return u.mean(), np.std(f(u)) / np.sqrt(n)

# Compare for f(u) = sqrt(u)
f = np.sqrt
for n in [100, 1000, 10000]:
    est_plain, se_plain = monte_carlo_plain(f, n)
    est_anti, se_anti = monte_carlo_antithetic(f, n)
    print(f"n={n:>6}: plain se={se_plain:.4f}, antithetic se={se_anti:.4f}, "
          f"ratio={se_plain/se_anti:.2f}x")

Appendix R: Common Integrals and Their Quadrature Suitability

IntegralExact valueBest methodNotes
01exdx\int_0^1 e^x dxe11.718e-1 \approx 1.718GL-5Analytic, smooth
01xdx\int_0^1 \sqrt{x} dx2/32/3GL-10Smooth, mild singularity at 0
01x1/2dx\int_0^1 x^{-1/2} dx22Adaptive (Gauss-Jacobi)Singularity at 0
0ex2dx\int_0^\infty e^{-x^2} dxπ/2\sqrt{\pi}/2Gauss-HermiteInfinite domain
ex2dx\int_{-\infty}^\infty e^{-x^2} dxπ\sqrt{\pi}Gauss-HermiteInfinite domain
0100sin(x)dx\int_0^{100} \sin(x) dx1cos(100)1-\cos(100)Adaptive (many panels)Oscillatory
01sin(1/x)dx\int_0^1 \sin(1/x) dx0.504\approx 0.504Adaptive + change of varsOscillatory near 0
[0,1]10fdx\int_{[0,1]^{10}} f dxVariesQuasi-MC SobolHigh-dimensional
E[log(1+eX)]\mathbb{E}[\log(1+e^X)]1.047\approx 1.047GH quadratureGaussian expectation
KL(N(μ,σ2)N(0,1))\text{KL}(\mathcal{N}(\mu,\sigma^2)\|\mathcal{N}(0,1))μ2+σ2logσ212\frac{\mu^2+\sigma^2-\log\sigma^2-1}{2}AnalyticalNo quadrature needed!

Appendix S: Extended Theory - Bayesian Quadrature

S.1 Bayesian Quadrature Framework

Bayesian quadrature (BQ) models the integrand ff as a Gaussian process:

fGP(μ0,k)f \sim \mathcal{GP}(\mu_0, k)

and computes the posterior over the integral I=f(x)p(x)dxI = \int f(x) p(x) dx:

IfnN(I^n,V^n)I | \mathbf{f}_n \sim \mathcal{N}(\hat{I}_n, \hat{V}_n)

Posterior mean (the quadrature estimate):

I^n=z(K+σ2I)1fn\hat{I}_n = \mathbf{z}^\top (K + \sigma^2 I)^{-1} \mathbf{f}_n

where zi=k(x,xi)p(x)dxz_i = \int k(x, x_i) p(x) dx (the "kernel mean" - computable analytically for many kernels).

Posterior variance (the uncertainty):

V^n= ⁣ ⁣k(x,x)p(x)p(x)dxdxz(K+σ2I)1z\hat{V}_n = \int\!\!\int k(x, x') p(x) p(x') dx\, dx' - \mathbf{z}^\top (K + \sigma^2 I)^{-1} \mathbf{z}

Key properties:

  1. BQ reduces to kernel ridge regression with kernel kk
  2. With the Gaussian kernel, BQ is equivalent to Gauss-Hermite quadrature in the limit
  3. BQ provides principled uncertainty - the posterior variance tells us how uncertain we are about the integral given current samples

S.2 Connection to Gauss-Legendre

When k(x,y)=min(x,y)k(x, y) = \min(x,y) (the Brownian motion kernel, corresponding to H1H^1 Sobolev RKHS) and p=Uniform[1,1]p = \text{Uniform}[-1,1], BQ recovers the trapezoidal rule.

When kk corresponds to the HmH^m Sobolev kernel, BQ recovers the mm-th order spline quadrature rule. In the limit as the RKHS becomes the space of polynomials, BQ recovers Gaussian quadrature.

For AI: BQ is used for:

  • Bayesian optimization loop: compute the expected improvement acquisition function via BQ when the integrand is also an ML model
  • Uncertainty-aware neural ODEs: BQ estimates the uncertainty in the solution trajectory
  • Model evidence computation: p(y)=p(yθ)p(θ)dθp(y) = \int p(y|\theta) p(\theta) d\theta via BQ with GP prior on the likelihood

S.3 Active Learning for BQ

Rather than using fixed quadrature nodes, active BQ selects the next evaluation point xn+1x_{n+1} to maximize the reduction in posterior variance:

xn+1=argmaxxV^nV^n+1(x)x_{n+1} = \arg\max_x \hat{V}_n - \hat{V}_{n+1}(x)

This is equivalent to minimizing the posterior variance, giving an information-theoretic quadrature rule. For the Matern-1/21/2 kernel, this recovers the bisection rule (adaptive quadrature). For smoother kernels, it places more points where ff is uncertain - naturally handling functions with local features.


Appendix T: Stochastic Integration - SDEs and Neural SDEs

T.1 Ito Integral

The Ito integral 0THtdWt\int_0^T H_t dW_t is a stochastic integral with respect to Brownian motion WtW_t:

0THtdWt=limnkHtk(Wtk+1Wtk)\int_0^T H_t dW_t = \lim_{n \to \infty} \sum_k H_{t_k} (W_{t_{k+1}} - W_{t_k})

(left-endpoint rule - the Ito convention). The Euler-Maruyama scheme for SDEs dx=f(x)dt+g(x)dWdx = f(x)dt + g(x)dW is:

xt+hxt+f(xt)h+g(xt)hεt,εtN(0,1)x_{t+h} \approx x_t + f(x_t) h + g(x_t) \sqrt{h} \, \varepsilon_t, \quad \varepsilon_t \sim \mathcal{N}(0,1)

This is a first-order strong approximation - O(h)O(\sqrt{h}) strong convergence vs O(h)O(h) weak convergence.

T.2 Neural SDEs and Diffusion Models

Neural SDE: Replace ff and gg with neural networks fθf_\theta and gθg_\theta:

dx=fθ(t,x)dt+gθ(t,x)dWtdx = f_\theta(t, x) dt + g_\theta(t, x) dW_t

The ELBO for a neural SDE is:

logp(xT)KL(qϕ(xT)p(x0))+Eq ⁣[0TLSDE(xt,t)dt]\log p(x_T) \geq -\text{KL}(q_\phi(x_T) \| p(x_0)) + \mathbb{E}_q\!\left[\int_0^T \mathcal{L}_{\text{SDE}}(x_t, t) dt\right]

The integral over tt is computed via numerical integration (e.g., Euler-Maruyama or higher-order SDE solvers). This is the foundation of score-based diffusion models (Song et al., 2021).

DDPM (Ho et al., 2020) discretizes the diffusion trajectory at T=1000T = 1000 timesteps and computes the ELBO as a sum of Gaussian KL terms - discrete numerical integration over the diffusion trajectory.

Continuous diffusion (Song et al., 2021) uses the continuous-time SDE and numerical ODE solvers (Dormand-Prince, DDIM) for sampling - the sampling procedure is adaptive quadrature of the reverse SDE.


Appendix U: Cross-Reference Table

TopicConnectionReference
Legendre polynomialsGL nodes; Legendre-Gauss ruleSection10-04 (orthogonal polynomials)
Chebyshev polynomialsClenshaw-Curtis; CC weights via DCTSection10-04 (Chebyshev approximation)
Condition numbersVandermonde in NC; GL weights always positiveSection10-01, Section10-02
Finite differencesRichardson extrapolation; gradient checkingSection10-03
Fourier analysisDFT as trapezoidal rule; CC via DCTSection10-04
Monte CarloStochastic gradient, ELBO, policy gradientSection07 (Probability)
Linear algebraGolub-Welsch tridiagonal eigenvalueSection10-02
OptimizationMCMC as sampling from Bayesian posteriorSection08, Section10-03
SDEs/Neural SDEsStochastic integration; Ito calculusAdvanced calculus
Gaussian processesGP marginal likelihood; Bayesian quadratureSection07, Section10-04

Appendix V: Worked Examples - Comparing Methods

V.1 Computing 01x(1x)dx=π/8\int_0^1 \sqrt{x(1-x)} dx = \pi/8

This integral has integrable singularities at both endpoints (f(0)=f(1)=0f(0) = f(1) = 0 but f(0+)=f(1)=±f'(0^+) = f'(1^-) = \pm\infty).

Methods and errors:

Methodnn evaluationsError
Composite trapezoidal10003.6×1033.6 \times 10^{-3} (O(h3/2)O(h^{3/2}) - slow due to singularity)
Composite Simpson's10001.8×1031.8 \times 10^{-3} (O(h3/2)O(h^{3/2}) - same!)
GL-20 on [0,1][0,1]202.4×1032.4 \times 10^{-3} (singularity causes slow convergence)
Gauss-Jacobi α=β=1/2\alpha=\beta=1/210<1014< 10^{-14} (exact for weight (x(1x))1/2(x(1-x))^{1/2}!)
Tanh-sinh, h=0.1h=0.1~50101210^{-12} (handles endpoint singularities)

Lesson: Match the quadrature rule to the weight function. Gauss-Jacobi absorbs the singularity, tanh-sinh handles it automatically.

V.2 High-Dimensional Monte Carlo: [0,1]10ixi2dx=10/3\int_{[0,1]^{10}} \sum_i x_i^2 dx = 10/3

Methodnn evaluationsErrorTime
MC (random)10410^40.02\approx 0.02Fast
QMC (Sobol)10410^40.003\approx 0.003Fast
Tensor product GL-3310=590493^{10} = 59049<1010< 10^{-10}Fast
MC (random)10610^60.002\approx 0.002Moderate

Lesson: For this smooth function in d=10d=10, tensor product GL still wins with fewer evaluations. But for d=50d=50, GL requires 3507×10233^{50} \approx 7 \times 10^{23} evaluations - infeasible.

V.3 Gaussian Expectation: ExN(0,1)[ex2/2]\mathbb{E}_{x \sim \mathcal{N}(0,1)}[e^{x^2/2}]

This integral diverges! ex2/2N(x;0,1)dx=ex2/2ex2/22πdx=12πdx=\int_{-\infty}^\infty e^{x^2/2} \mathcal{N}(x;0,1) dx = \int e^{x^2/2} \frac{e^{-x^2/2}}{\sqrt{2\pi}} dx = \frac{1}{\sqrt{2\pi}} \int dx = \infty.

Warning for AI: Computing expectations over Gaussian distributions requires checking that the function is integrable. In VAE training, large activations can cause Eq[f(z)]\mathbb{E}_q[f(z)] to be numerically infinite - a symptom of a poor variational posterior qq.

V.4 ELBO Term Computation

The KL divergence KL(N(μ,σ2)N(0,1))\text{KL}(\mathcal{N}(\mu, \sigma^2) \| \mathcal{N}(0, 1)):

KL=12(σ2+μ21logσ2)\text{KL} = \frac{1}{2}\left(\sigma^2 + \mu^2 - 1 - \log \sigma^2\right)

This is analytical - no quadrature needed. The reconstruction term Eq[logp(xz)]\mathbb{E}_q[\log p(x|z)] requires Monte Carlo (or GH quadrature):

def elbo_reconstruction(x_data, mu_z, sigma_z, decoder, n_gh=20):
    """Compute E_q[log p(x|z)] via Gauss-Hermite quadrature."""
    t, w = np.polynomial.hermite.hermgauss(n_gh)
    z_samples = np.sqrt(2) * sigma_z * t + mu_z  # GH nodes mapped to q
    log_likelihood = np.array([decoder.log_prob(x_data, z) for z in z_samples])
    return np.dot(w, log_likelihood) / np.sqrt(np.pi)

With nGH=20n_{\text{GH}} = 20, this achieves near-machine precision for smooth likelihoods - far superior to 1-sample MC.


Appendix W: The Connection Between Quadrature and Finite Element Methods

W.1 Galerkin FEM Requires Integration

The finite element method reduces a PDE to a linear system Ku=fKu = f where:

Kij=Ωϕiϕjdx,fi=ΩfϕidxK_{ij} = \int_\Omega \nabla \phi_i \cdot \nabla \phi_j dx, \quad f_i = \int_\Omega f \phi_i dx

Each entry requires numerical integration over the domain. With P1\mathcal{P}_1 (piecewise linear) elements, the stiffness matrix integrals are computed exactly by the midpoint rule. With P3\mathcal{P}_3 (cubic) elements, 4-point GL is needed for exactness.

Rule of thumb: For Pk\mathcal{P}_k elements in dd dimensions, use GL quadrature of degree 2k1\geq 2k-1 on each element.

W.2 Spectral Element Methods

Spectral element methods combine the geometric flexibility of FEM with the spectral accuracy of GL/CC quadrature:

  1. Divide the domain into KK elements
  2. On each element, use a high-degree polynomial basis (pp-th order, typically p=5p = 5-1616)
  3. Integrate with the matching Gauss-Lobatto-Legendre (GLL) quadrature rule

GLL rule: GL rule with the constraint that the endpoints x=±1x=\pm 1 are always included. The n+1n+1 GLL nodes consist of ±1\pm 1 plus the n1n-1 interior zeros of Pn(x)P_n'(x).

For AI: Spectral element methods are used in:

  • Atmospheric and ocean modeling (the standard in climate science)
  • Spectral graph convolution implementations (spectral approximation of graph filters)
  • Neural operators (FNO, DeepONet) for learning PDE solution operators

Appendix X: Self-Test Questions

  1. Why is the nn-point Gauss-Legendre rule exact for polynomials up to degree 2n12n-1 but not 2n2n?

  2. Explain why the composite trapezoidal rule achieves spectral accuracy for smooth periodic functions, even though its error is formally O(h2)O(h^2) for non-periodic functions.

  3. The midpoint rule has a smaller error constant than the trapezoidal rule for the same O(h2)O(h^2) rate. Why? (Hint: think about whether ff is convex or concave between nodes.)

  4. Why does Monte Carlo integration have dimension-independent O(1/n)O(1/\sqrt{n}) convergence while deterministic quadrature rates degrade exponentially with dimension?

  5. The reparameterization trick reduces gradient variance in VAE training. Why is the score function estimator (REINFORCE) higher variance? (Hint: compare what each estimator directly estimates.)

  6. For the Euler-Maclaurin formula Tm(f)=I+c2h2+c4h4+T_m(f) = I + c_2 h^2 + c_4 h^4 + \cdots, if ff is periodic with f(k)(a)=f(k)(b)f^{(k)}(a) = f^{(k)}(b) for all kk, what does this imply for the trapezoidal error?

Answers: See theory.ipynb for numerical verification of each.


Appendix Y: Extended ML Applications

Y.1 Normalizing Flows and the Change of Variables

A normalizing flow defines a sequence of invertible transformations:

z0f1z1f2fKzK=xz_0 \xrightarrow{f_1} z_1 \xrightarrow{f_2} \cdots \xrightarrow{f_K} z_K = x

The log-likelihood is computed via the change-of-variables formula - a sequence of integration by substitution steps:

logpx(x)=logpz0(z0)k=1KlogdetJfk(zk1)\log p_x(x) = \log p_{z_0}(z_0) - \sum_{k=1}^K \log |\det J_{f_k}(z_{k-1})|

Each Jacobian determinant represents the local volume change under the transformation fkf_k. For a dd-dimensional transformation:

\log |\det J_{f_k}| = \sum_{i=1}^d \log |[\partial f_k / \partial z_{k-1}]_{ii}|}

(only valid when JfkJ_{f_k} is triangular - the key design choice in autoregressive flows).

Integration interpretation: px(x)p_x(x) is the pushforward of pz0p_{z_0} through f=fKf1f = f_K \circ \cdots \circ f_1. Computing logpx(x)\log p_x(x) is exact - no numerical quadrature needed! The integration is performed analytically by the change-of-variables formula.

Y.2 Diffusion Models as Numerical ODE Solvers

The DDIM sampler (Song et al., 2020) solves the reverse diffusion ODE:

dxdt=f(x,t)12g(t)2xlogpt(x)\frac{dx}{dt} = f(x, t) - \frac{1}{2} g(t)^2 \nabla_x \log p_t(x)

using the score function xlogpt(x)εθ(xt,t)/σt\nabla_x \log p_t(x) \approx -\varepsilon_\theta(x_t, t)/\sigma_t.

DDIM is 2nd-order Adams-Bashforth: The multi-step sampler uses previous score evaluations to achieve 2nd-order accuracy per step:

xti1xti+Δt2[3sθ(xti,ti)sθ(xti+1,ti+1)]x_{t_{i-1}} \approx x_{t_i} + \frac{\Delta t}{2}[3 s_\theta(x_{t_i}, t_i) - s_\theta(x_{t_{i+1}}, t_{i+1})]

where sθxlogpts_\theta \approx \nabla_x \log p_t is the score. This is exactly the 2-step Adams-Bashforth ODE solver - applied to the reverse SDE.

Implication: Faster diffusion sampling (fewer NFEs) is equivalent to using higher-order quadrature for the reverse SDE. Methods like DPM-Solver++ (Lu et al., 2022) use exponential integrators and 3rd-order Adams-type solvers to reduce from 1000 steps to 10-20 steps.

Y.3 GP Marginal Likelihood via Lanczos Quadrature

The GP log marginal likelihood requires logK+σ2I\log |K + \sigma^2 I| where KRn×nK \in \mathbb{R}^{n \times n} is the kernel matrix.

Stochastic Lanczos quadrature (SLQ): Estimate logA=tr(logA)\log |A| = \text{tr}(\log A):

  1. Sample random probe vectors v1,,vmN(0,I/n)v_1, \ldots, v_m \sim \mathcal{N}(0, I/n)
  2. For each vjv_j, run qq steps of Lanczos iteration to build a tridiagonal matrix TqT_q with AvjQTqQvjA v_j \approx Q T_q Q^\top v_j
  3. Approximate vjlog(A)vjvj2e1log(Tq)e1v_j^\top \log(A) v_j \approx \|v_j\|^2 e_1^\top \log(T_q) e_1 (via GL quadrature on the spectrum of TqT_q)
  4. Average: logAn1mjvjlog(A)vj\log|A| \approx n \cdot \frac{1}{m} \sum_j v_j^\top \log(A) v_j

Complexity: O(nqm)O(n q m) matrix-vector products vs O(n3)O(n^3) for Cholesky. For large nn, SLQ is the only tractable approach.

GPyTorch (Gardner et al., 2018) implements SLQ for GP training with n>104n > 10^4 - enabling GP regression at the scale of deep learning datasets.


Appendix Z: Final Reference - Decision Tree for Integration in ML

ML INTEGRATION DECISION TREE
========================================================================

  What are you computing?
  |
  +-- E_q[f(z)] with q Gaussian?
  |   +-- Low-dimensional z (d\\leq3), need high accuracy?
  |   |   +-- -> Gauss-Hermite quadrature (n=20 nodes)
  |   +-- High-dimensional z, gradient needed?
  |       +-- -> Reparameterization + single MC sample
  |
  +-- log p(x) = log Z - E(x)?
  |   +-- f(x) has known structure (flow, Gaussian)?
  |   |   +-- -> Analytical formula (change of variables)
  |   +-- Intractable Z?
  |       +-- Need unbiased estimate of Z?
  |       |   +-- -> AIS (Annealed Importance Sampling)
  |       +-- Need to train model?
  |           +-- -> Score matching / contrastive divergence
  |
  +-- Policy gradient E_\\pi[G]?
  |   +-- Continuous action, differentiable reward?
  |   |   +-- -> Reparameterization (pathwise gradient)
  |   +-- Discrete action?
  |       +-- -> REINFORCE with baseline (score function)
  |
  +-- GP marginal likelihood log|K + \\sigma^2I|?
  |   +-- n \\leq 5000?
  |   |   +-- -> Cholesky factorization (exact, O(n^3))
  |   +-- n > 5000?
  |       +-- -> Stochastic Lanczos Quadrature (O(nqm))
  |
  +-- ODE/SDE simulation?
      +-- Deterministic ODE?
      |   +-- -> Adaptive Dormand-Prince (RK45)
      +-- Stochastic SDE?
          +-- -> Euler-Maruyama or Milstein scheme

========================================================================

Appendix AA: Error Analysis for the Composite Rules - Detailed

AA.1 Euler-Maclaurin for Composite Trapezoidal

The complete Euler-Maclaurin formula for the composite trapezoidal rule Tm(f)=h[f(a)2+k=1m1f(a+kh)+f(b)2]T_m(f) = h[\frac{f(a)}{2} + \sum_{k=1}^{m-1} f(a+kh) + \frac{f(b)}{2}] with h=(ba)/mh = (b-a)/m:

Tm(f)=abf(x)dx+k=1pB2k(2k)!h2k[f(2k1)(b)f(2k1)(a)]+O(h2p+2)T_m(f) = \int_a^b f(x) dx + \sum_{k=1}^{p} \frac{B_{2k}}{(2k)!} h^{2k} [f^{(2k-1)}(b) - f^{(2k-1)}(a)] + O(h^{2p+2})

Bernoulli numbers: B2=1/6B_2 = 1/6, B4=1/30B_4 = -1/30, B6=1/42B_6 = 1/42, B8=1/30B_8 = -1/30.

Leading error term: h212[f(b)f(a)]=(ba)h212fˉ\frac{h^2}{12}[f'(b) - f'(a)] = -\frac{(b-a)h^2}{12} \bar{f}'' for some mean fˉ\bar{f}''.

For p=1p=1 (standard result): Tm(f)=I+h212[f(b)f(a)]+O(h4)T_m(f) = I + \frac{h^2}{12}[f'(b) - f'(a)] + O(h^4)

For Richardson extrapolation: Taking TmT_m and T2mT_{2m}:

T2m=I+h24112[f(b)f(a)]+O(h4)T_{2m} = I + \frac{h^2}{4}\frac{1}{12}[f'(b)-f'(a)] + O(h^4)

So 4T2mTm3=I+O(h4)\frac{4T_{2m} - T_m}{3} = I + O(h^4) - this is exactly the composite Simpson's rule!

AA.2 Superconvergence of Periodic Functions

For ff periodic on [a,b][a,b] with period bab-a: f(k)(a)=f(k)(b)f^{(k)}(a) = f^{(k)}(b) for all kk. Every Euler-Maclaurin boundary term vanishes:

Tm(f)=abf(x)dx+k=1B2k(2k)!h2k0=IT_m(f) = \int_a^b f(x) dx + \sum_{k=1}^{\infty} \frac{B_{2k}}{(2k)!} h^{2k} \cdot 0 = I

The trapezoidal rule is exact to all orders for smooth periodic functions. In practice, the error is bounded by the truncation of the Fourier series - exponentially small for analytic periodic functions.

Example: f(x)=ecos(2πx)f(x) = e^{\cos(2\pi x)} on [0,1][0,1]:

  • Composite trapezoidal with m=5m=5: error 108\approx 10^{-8}
  • Composite trapezoidal with m=10m=10: error 1016\approx 10^{-16}

Compare to f(x)=xf(x) = x on [0,1][0,1] (not periodic):

  • Composite trapezoidal with m=5m=5: error 0.033\approx 0.033 (O(h2)O(h^2))
  • Composite trapezoidal with m=10m=10: error 0.0083\approx 0.0083 (O(h2)O(h^2))

This dramatic difference is the key insight behind the DFT's exact representation of bandlimited periodic signals.


Appendix BB: Python Code - Complete Implementations

"""
Complete reference implementations for Section10-05 Numerical Integration.
All functions tested against scipy.integrate.quad.
"""
import numpy as np
from scipy import integrate

def composite_trap(f, a, b, m):
    """Composite trapezoidal: O(h^2) for smooth f."""
    h = (b-a)/m
    x = np.linspace(a, b, m+1)
    y = f(x)
    return h * (y[0]/2 + y[1:-1].sum() + y[-1]/2)

def composite_simpson(f, a, b, m):
    """Composite Simpson's: O(h^4). m must be even."""
    assert m % 2 == 0
    h = (b-a)/m
    x = np.linspace(a, b, m+1)
    y = f(x)
    return h/3 * (y[0] + 4*y[1:-1:2].sum() + 2*y[2:-2:2].sum() + y[-1])

def gauss_legendre(n):
    """n-point GL nodes and weights via Golub-Welsch."""
    k = np.arange(1, n)
    J = np.diag(k / np.sqrt(4*k**2 - 1), -1)
    J += J.T
    evals, evecs = np.linalg.eigh(J)
    idx = np.argsort(evals)
    return evals[idx], 2 * evecs[0, idx]**2

def gl_integrate(f, a, b, n=10):
    """GL quadrature on [a,b] with n nodes."""
    x, w = gauss_legendre(n)
    x_mapped = (a+b)/2 + (b-a)/2 * x
    return (b-a)/2 * np.dot(w, f(x_mapped))

def romberg_integrate(f, a, b, max_k=8, tol=1e-12):
    """Romberg's method via repeated Richardson extrapolation."""
    R = np.zeros((max_k, max_k))
    h = b - a
    R[0, 0] = h/2 * (f(a) + f(b))
    for k in range(1, max_k):
        h /= 2
        n_new = 2**(k-1)
        x_new = a + h * (2*np.arange(1, n_new+1) - 1)
        R[k, 0] = R[k-1, 0]/2 + h * f(x_new).sum()
        for j in range(1, k+1):
            R[k, j] = (4**j * R[k, j-1] - R[k-1, j-1]) / (4**j - 1)
        if abs(R[k, k] - R[k-1, k-1]) < tol:
            return R[k, k], k
    return R[max_k-1, max_k-1], max_k

def monte_carlo(f, domain_sampler, volume, n, seed=42):
    """Basic Monte Carlo integration."""
    np.random.seed(seed)
    x = domain_sampler(n)
    return volume * f(x).mean(), volume * f(x).std() / np.sqrt(n)

def gauss_hermite_expect(f, mu, sigma, n=20):
    """E_{x~N(mu,sigma^2)}[f(x)] via GH quadrature."""
    t, w = np.polynomial.hermite.hermgauss(n)
    return np.dot(w, f(np.sqrt(2)*sigma*t + mu)) / np.sqrt(np.pi)

# Verification
if __name__ == '__main__':
    f = np.exp
    exact = np.e - 1
    print(f"Trapezoidal (m=100): error = {abs(composite_trap(f,0,1,100) - exact):.2e}")
    print(f"Simpson's (m=100):   error = {abs(composite_simpson(f,0,1,100) - exact):.2e}")
    print(f"GL-10:               error = {abs(gl_integrate(f,0,1,10) - exact):.2e}")
    result, levels = romberg_integrate(f, 0, 1)
    print(f"Romberg:             error = {abs(result - exact):.2e} (levels={levels})")
    gh_val = gauss_hermite_expect(lambda x: np.exp(-x**2/2),
                                   mu=0, sigma=1, n=20) * np.sqrt(2*np.pi)
    print(f"GH integral of N(0,1): {gh_val:.10f} (expected 1.0)")