"Integration is not just finding areas - it is the mathematics of accumulation, and accumulation is the mathematics of learning."
Overview
Integration is the second pillar of calculus, dual to differentiation. Where the derivative asks "how fast is this changing right now?", the integral asks "how much has accumulated over this interval?" The Fundamental Theorem of Calculus (FTC) reveals these two operations as inverses of each other - one of the most profound connections in mathematics.
Every core operation in machine learning involves integration. The expected loss E[L] is an integral over a data distribution. KL divergence and entropy - the foundations of information-theoretic learning - are improper integrals of probability densities. Stochastic gradient descent is a Monte Carlo estimator of a gradient integral. Normalizing flows use the change-of-variables formula from integration theory. Understanding integration at the level of definitions and proofs, not just formulas, separates practitioners who can innovate from those who can only apply recipes.
This section builds the complete single-variable integration theory: Riemann sums, the FTC, all standard techniques (substitution, parts, partial fractions), improper integrals, numerical methods (trapezoid, Simpson's, Monte Carlo), and integration in probability theory. The treatment is rigorous but always anchored to concrete ML applications.
Prerequisites
Limits and continuity - 01-Limits-and-Continuity: limit definition, squeeze theorem, continuity on [a,b]
The derivative measures instantaneous rate of change. The integral measures total accumulation. These are two sides of the same coin - and the FTC is the theorem that makes this precise.
Physical intuition. If v(t) is velocity at time t, then ∫abv(t)dt is the total distance traveled from time a to b. Each infinitesimal time slice dt contributes a tiny distance v(t)dt; the integral sums them all.
Geometric intuition.∫abf(x)dx is the signed area between the curve y=f(x) and the x-axis over [a,b]. Area above the axis is positive; area below is negative.
Probabilistic intuition. If p(x) is a probability density, ∫abp(x)dx is the probability that a random draw falls in [a,b]. Integration is the language of probability.
1.2 Historical Motivation
Year
Development
~250 BCE
Archimedes uses method of exhaustion to compute areas of parabolas
1670s
Newton and Leibniz independently develop calculus; Leibniz introduces ∫ notation
1823
Cauchy gives rigorous definition of the definite integral
1854
Riemann formalizes the integral via sums - the definition used today
1875
Darboux introduces upper/lower sums, simplifying Riemann's theory
1902
Lebesgue generalizes the integral to a much broader class of functions
The ∫ symbol is an elongated "S" for summa (Latin: sum) - Leibniz's notation for the limit of infinitely many infinitesimal summands.
1.3 Why Integration Is Central to AI
Expected loss. The training objective of every probabilistic model is:
This is an integral. Stochastic gradient descent estimates it via a Monte Carlo sum over a mini-batch.
KL divergence. The loss function used in variational autoencoders (VAEs), diffusion models, and language model alignment (RLHF):
KL(q∥p)=∫q(x)lnp(x)q(x)dx
Entropy. The Shannon entropy of a continuous distribution:
H(p)=−∫p(x)lnp(x)dx
This integral appears in cross-entropy loss, mutual information, and the information-theoretic analysis of generalization.
Normalizing flows. Change-of-variables: if z=f(x) is a bijection, the density transforms as:
pX(x)=pZ(f(x))det∂x∂f
The absolute Jacobian determinant is the multidimensional version of the substitution rule from this section.
Gaussian integral. The normalization constant of every Gaussian distribution relies on ∫−∞∞e−x2dx=π - an improper integral computed in 8.4.
For AI: Every backpropagation pass computes a stochastic estimate of the gradient of an integral (the expected loss) - integration and differentiation are inseparable in machine learning.
2. The Definite Integral - Riemann's Definition
2.1 Partitions and Riemann Sums
A partition of [a,b] is a finite set of points a=x0<x1<⋯<xn=b. The mesh (or norm) is ∥P∥=maxk(xk−xk−1).
For each subinterval [xk−1,xk] of width Δxk=xk−xk−1, choose a sample pointxk∗∈[xk−1,xk]. The Riemann sum is:
S(P,f)=k=1∑nf(xk∗)Δxk
Three standard choices of xk∗:
Left endpoint:xk∗=xk−1 -> left Riemann sum Ln
Right endpoint:xk∗=xk -> right Riemann sum Rn
Midpoint:xk∗=(xk−1+xk)/2 -> midpoint Riemann sum Mn
Example.f(x)=x2 on [0,1], uniform partition with n intervals, right endpoints:
provided this limit exists and is the same for every choice of sample points. When this limit exists, f is Riemann integrable on [a,b].
For uniform partitions (Δx=(b−a)/n, right endpoints):
∫abf(x)dx=n→∞limk=1∑nf(a+k⋅nb−a)⋅nb−a
2.3 Geometric Interpretation: Signed Area
The integral computes signed area: regions where f(x)>0 contribute positively; regions where f(x)<0 contribute negatively.
∫02πsinxdx=0(positive and negative areas cancel)∫02π∣sinx∣dx=4(total unsigned area)
2.4 Properties of the Definite Integral
For integrable f,g on [a,b]:
Property
Formula
Linearity
∫ab[cf(x)+g(x)]dx=c∫abf(x)dx+∫abg(x)dx
Additivity
∫abfdx=∫acfdx+∫cbfdx for any c∈[a,b]
Monotonicity
f≤g⇒∫abfdx≤∫abgdx
Reverse limits
∫bafdx=−∫abfdx
Zero-width
∫aafdx=0
Bound
$\left
Mean Value Theorem for Integrals. If f is continuous on [a,b], there exists c∈(a,b) with:
f(c)=b−a1∫abf(x)dx
The integral equals the function value at some interior point times the interval length.
2.5 Integrability Conditions
Sufficient condition 1: If f is continuous on [a,b], then f is Riemann integrable. (This covers all smooth functions and all activation functions in ML.)
Sufficient condition 2: If f is bounded and monotone on [a,b], then f is Riemann integrable.
Sufficient condition 3: If f is bounded on [a,b] and has only finitely many discontinuities, then f is Riemann integrable. (This covers ReLU, which is discontinuous in derivative at 0 but not in value.)
3. Antiderivatives and the Indefinite Integral
3.1 Definition and the +C Convention
A function F is an antiderivative of f on an interval I if F′(x)=f(x) for all x∈I.
Theorem (Uniqueness up to constant). If F and G are both antiderivatives of f on I, then F(x)−G(x)=C for some constant C.
Proof. Let H=F−G. Then H′(x)=F′(x)−G′(x)=f(x)−f(x)=0 for all x∈I. By the MVT corollary, H′≡0 implies H is constant. □
The indefinite integral encodes the entire family of antiderivatives:
∫f(x)dx=F(x)+C
where C is an arbitrary constant. The +C is not optional - it represents a genuinely different function for each value of C.
3.2 Basic Antiderivative Table
f(x)
∫f(x)dx
Condition
xn
n+1xn+1+C
n=−1
x−1=1/x
$\ln
x
ex
ex+C
ax
lnaax+C
a>0,a=1
sinx
−cosx+C
cosx
sinx+C
sec2x
tanx+C
1/1−x2
arcsinx+C
$
1/(1+x2)
arctanx+C
sinhx
coshx+C
coshx
sinhx+C
Verification: Every row can be checked by differentiating the right side.
3.3 Linearity
∫[cf(x)+g(x)]dx=c∫f(x)dx+∫g(x)dx
This follows immediately from the linearity of differentiation.
Example.∫(3x2−5cosx+ex)dx=x3−5sinx+ex+C.
3.4 Initial Value Problems
An initial value problem (IVP) specifies f′(x) and an initial condition f(x0)=y0, and asks for f(x).
Procedure:
Find the general antiderivative: f(x)=∫f′(x)dx=F(x)+C
Apply the initial condition: y0=F(x0)+C⇒C=y0−F(x0)
Example.f′(x)=3x2−2x, f(1)=5.
f(x)=x3−x2+C. At x=1: 1−1+C=5⇒C=5.
So f(x)=x3−x2+5.
For AI. Solving θ˙=−∇L(θ) as an ODE gives the continuous-time analogue of gradient descent. The solution is an integral: θ(T)=θ(0)−∫0T∇L(θ(t))dt. Neural ODEs (Chen et al., 2018) make this explicit by parameterizing the dynamics as a neural network and solving the integral numerically via an ODE solver.
4. The Fundamental Theorem of Calculus
The FTC is the central theorem of calculus - it reveals that differentiation and integration are inverse operations and provides a practical method for evaluating definite integrals.
4.1 FTC Part 1
Theorem (FTC Part 1). Let f be continuous on [a,b] and define:
G(x)=∫axf(t)dt,x∈[a,b]
Then G is differentiable on (a,b) and G′(x)=f(x).
Proof. For h>0:
hG(x+h)−G(x)=h1∫xx+hf(t)dt
By the MVT for integrals, there exists ch∈(x,x+h) with h1∫xx+hf(t)dt=f(ch).
As h→0+: ch→x, so by continuity of f: f(ch)→f(x). A symmetric argument handles h→0−. Therefore G′(x)=f(x). □
Interpretation. The area-accumulation function G(x)=∫axf(t)dt has derivative f(x) - the rate of growth of accumulated area at x equals the function value f(x). This is obvious geometrically: adding a thin strip of height f(x) and width h gives area ≈f(x)⋅h.
Generalization (Leibniz rule):
dxd∫g(x)h(x)f(t)dt=f(h(x))⋅h′(x)−f(g(x))⋅g′(x)
4.2 FTC Part 2
Theorem (FTC Part 2). If f is continuous on [a,b] and F is any antiderivative of f, then:
∫abf(x)dx=F(b)−F(a)≡[F(x)]ab
Proof. Let G(x)=∫axf(t)dt. By Part 1, G′(x)=f(x)=F′(x). So G(x)−F(x)=C (constant). At x=a: G(a)−F(a)=0−F(a), giving C=−F(a). At x=b: G(b)=∫abf(t)dt=F(b)−F(a). □
4.3 The Bridge
Why FTC Part 2 is powerful. Before the FTC, computing ∫abf(x)dx required constructing Riemann sums and taking limits - laborious for any non-trivial function. The FTC reduces this to: find any antiderivative F, evaluate at b and a, subtract.
Example.∫01x2dx=[3x3]01=31−0=31. (Recall: directly computing via Riemann sums required summing ∑k2.)
4.4 Worked Examples
Example 1.∫1ex1dx=[lnx]1e=lne−ln1=1−0=1.
Example 2.∫0πsinxdx=[−cosx]0π=(−cosπ)−(−cos0)=1+1=2.
Example 3.∫−11exdx=[ex]−11=e−e−1=e−1/e.
Example 4 (net displacement vs distance). A particle has velocity v(t)=t2−4. On [0,3]:
Net displacement=∫03(t2−4)dt=[3t3−4t]03=9−12=−3Distance=∫03∣t2−4∣dt=∫02(4−t2)dt+∫23(t2−4)dt=316+37=323
4.5 FTC and Automatic Differentiation
FTC Part 1 has a direct counterpart in modern ML: the adjoint method for training neural ODEs. The derivative of a loss L with respect to the initial state z(0), where z(T)=z(0)+∫0Tf(z(t),t;θ)dt, is computed via an integral that runs backward in time. This is FTC Part 1 applied to the adjoint state - the computational core of the torchdiffeq library.
5. Integration by Substitution
5.1 The Reverse Chain Rule
Theorem. If u=g(x) is differentiable and f is continuous on the range of g, then:
∫f(g(x))g′(x)dx=∫f(u)du(evaluated at u=g(x))
Proof. Let F be an antiderivative of f, so F′=f. By the chain rule:
dxd[F(g(x))]=F′(g(x))⋅g′(x)=f(g(x))⋅g′(x)
Therefore F(g(x)) is an antiderivative of f(g(x))⋅g′(x). □
Procedure:
Identify a function u=g(x) whose derivative g′(x) appears (or nearly appears) in the integrand.
Compute du=g′(x)dx.
Substitute: replace g(x) with u and g′(x)dx with du.
Integrate in u.
Back-substitute: replace u with g(x).
5.2 Definite Integrals: Changing Limits
For ∫abf(g(x))g′(x)dx with u=g(x): change limits to u(a) and u(b):
∫abf(g(x))g′(x)dx=∫g(a)g(b)f(u)du
This avoids back-substitution at the end.
5.3 Worked Examples
Example 1.∫sin(x2)⋅2xdx. Let u=x2, du=2xdx:
=∫sinudu=−cosu+C=−cos(x2)+C
Example 2.∫01e−x2⋅(−2x)dx. Let u=−x2, limits: u(0)=0, u(1)=−1:
=∫0−1eudu=[eu]0−1=e−1−1
Example 3.∫x2+1xdx. Let u=x2+1, du=2xdx:
=21∫udu=21ln∣u∣+C=21ln(x2+1)+C
Example 4.∫tanxdx=∫cosxsinxdx. Let u=cosx, du=−sinxdx:
=−∫udu=−ln∣cosx∣+C=ln∣secx∣+C
Example 5 (softmax normalization). For z∈RK:
Z=k=1∑Kezk=ezmaxk=1∑Kezk−zmax
The subtraction of zmax is a discrete analog of substitution u=z−zmax, making all terms ≤1 for numerical stability.
5.4 For AI: Change of Variables in Normalizing Flows
A normalizing flow defines a bijection x=f(z) where z∼pZ. The change-of-variables formula:
pX(x)=pZ(f−1(x))detJf−1(x)
is the multivariate version of u-substitution: ∫abf(g(x))g′(x)dx=∫g(a)g(b)f(u)du, where ∣g′(x)∣ becomes the absolute Jacobian determinant. Real NVP, Glow, and FFJORD all implement variants of this transformation to learn complex densities from simple base distributions.
6. Integration by Parts
6.1 The Reverse Product Rule
Theorem. If u(x) and v(x) are differentiable, then:
∫udv=uv−∫vdu
Proof. The product rule gives (uv)′=u′v+uv′. Integrating both sides:
uv=∫u′vdx+∫uv′dx
Rearranging: ∫uv′dx=uv−∫u′vdx. Writing dv=v′dx and du=u′dx gives the formula. □
For definite integrals:
∫abudv=[uv]ab−∫abvdu
6.2 Choosing u and dv - LIATE
The acronym LIATE gives a priority order for choosing u (choose the type that comes first):
Logarithms: lnx, logax
Inverse trig: arctanx, arcsinx
Algebraic: polynomials xn, x
Trigonometric: sinx, cosx
Exponential: ex, ax
The complementary factor becomes dv and must be something we can integrate.
6.3 Reduction Formulas and Tabular Method
For integrals like ∫xnexdx, repeated integration by parts produces a reduction formula:
∫xnexdx=xnex−n∫xn−1exdx
The tabular method (also called "tic-tac-toe" or "successive differentiation") organizes repeated parts efficiently:
Sign
Differentiate (u)
Integrate (dv)
+
x3
ex
−
3x2
ex
+
6x
ex
−
6
ex
+
0
ex
Result: ∫x3exdx=ex(x3−3x2+6x−6)+C.
6.4 Worked Examples
Example 1.∫xexdx. Let u=x, dv=exdx:
=xex−∫exdx=xex−ex+C=ex(x−1)+C
Example 2.∫lnxdx. Let u=lnx, dv=dx:
=xlnx−∫x⋅x1dx=xlnx−x+C=x(lnx−1)+C
Example 3.∫x2sinxdx. Apply tabular method (alternating signs):
=−x2cosx+2xsinx+2cosx+C
Example 4 (cyclic).∫exsinxdx. Let I=∫exsinxdx. Integrate by parts twice:
The REINFORCE algorithm (Williams, 1992) estimates ∇θEτ∼pθ[R(τ)] where τ is a trajectory and R is return. Integration by parts (in the form of the log-derivative trick) gives:
∇θEx∼pθ[f(x)]=Ex∼pθ[f(x)∇θlogpθ(x)]
This identity - differentiating through an expectation - is the continuous-distribution version of integration by parts, and is the core of policy gradient methods in reinforcement learning (PPO, GRPO, RLHF fine-tuning).
7. Partial Fractions
7.1 Decomposing Rational Functions
Partial fraction decomposition writes a rational function P(x)/Q(x) (where degP<degQ) as a sum of simpler fractions. This converts integrals of rational functions into sums of basic integrals.
Setup:
Factor Q(x) completely over R.
Write the partial fraction decomposition.
Solve for unknown constants (cover-up method or comparing coefficients).
Integrate each term.
7.2 Cases and Worked Examples
Case 1: Distinct linear factors.Q(x)=(x−a1)(x−a2)⋯(x−an):
Q(x)P(x)=x−a1A1+x−a2A2+⋯+x−anAn
Example.∫x2−11dx=∫(x−1)(x+1)1dx.
Decompose: (x−1)(x+1)1=x−1A+x+1B.
Multiply both sides by (x−1)(x+1): 1=A(x+1)+B(x−1).
For AI. Partial fractions appear in z-transform analysis (discrete-time signal processing), computing closed-form solutions to linear recurrences (relevant to LSTM and S4 state space models), and in Laplace transform computations for control-theoretic analysis of learning dynamics.
8. Improper Integrals
8.1 Type I: Infinite Limits
Definition. For f continuous on [a,∞):
∫a∞f(x)dx=b→∞lim∫abf(x)dx
If the limit exists and is finite, the integral converges; otherwise it diverges.
Similarly ∫−∞bf(x)dx=lima→−∞∫abf(x)dx and ∫−∞∞fdx=∫−∞cfdx+∫c∞fdx for any c.
Example 1 (exponential decay).
∫0∞e−xdx=b→∞lim[−e−x]0b=b→∞lim(1−e−b)=1
Example 2 (p-integral).
∫1∞xp1dx=⎩⎨⎧p−11∞p>1p≤1
Proof. For p=1: ∫1bx−pdx=1−pb1−p−1. As b→∞: converges iff 1−p<0 iff p>1. □
Consequence. The standard Gaussian N(0,1) normalizes:
∫−∞∞2π1e−x2/2dx=1
(substitution u=x/2 converts to the Gaussian integral).
8.5 For AI: Entropy, KL Divergence, and Expected Loss
Entropy of a Gaussian. For X∼N(μ,σ2):
H(X)=−∫−∞∞p(x)lnp(x)dx=21ln(2πeσ2)
This improper integral converges because p(x)lnp(x)→0 faster than any polynomial as x→±∞.
KL divergence between Gaussians. For p=N(μ1,σ12), q=N(μ2,σ22):
KL(p∥q)=lnσ1σ2+2σ22σ12+(μ1−μ2)2−21
This closed form follows from evaluating ∫−∞∞p(x)[lnp(x)−lnq(x)]dx using the Gaussian integral. It appears as the regularization term in the VAE ELBO objective.
Expected cross-entropy loss. The population risk:
R(θ)=−∫p(x,y)logqθ(y∣x)dxdy
is an improper integral over the joint distribution. SGD computes an unbiased Monte Carlo estimate using a mini-batch.
9. Numerical Integration
When an antiderivative cannot be expressed in closed form (e.g., ∫e−x2dx, ∫sin(x2)dx), numerical methods approximate the definite integral.
9.1 Trapezoid Rule
Approximate the integrand on each subinterval by a straight line (trapezoid).
Pattern: 1, 4, 2, 4, 2, ..., 4, 1 with coefficients summing to 2n/3⋅3=2n.
Error bound. If ∣f(4)(x)∣≤M:
∣ES∣≤180n4M(b−a)5=O(h4)
Simpson's rule is fourth-order - halving h reduces error by a factor of 16.
Example. Estimate ∫01exdx (true value: e−1≈1.71828) with n=4:
h=0.25, xk∈{0,0.25,0.5,0.75,1}.
S4=30.25[e0+4e0.25+2e0.5+4e0.75+e1]≈1.71828 (accurate to 7 decimal places).
9.3 Gaussian Quadrature
Instead of equally spaced nodes, choose n optimal nodes {xk} and weights {wk} to exactly integrate all polynomials of degree ≤2n−1:
∫−11f(x)dx≈k=1∑nwkf(xk)
The nodes are roots of the Legendre polynomial Pn(x). Gauss-Legendre quadrature is the most accurate quadrature rule for smooth functions - n nodes achieve O(h2n) error.
9.4 Monte Carlo Integration
Idea. For ∫abf(x)dx: draw n i.i.d. samples X1,…,Xn∼Uniform(a,b) and estimate:
I^n=nb−ak=1∑nf(Xk)
By the Law of Large Numbers: I^na.s.∫abf(x)dx.
Error. By the CLT:
n(I^n−I)dN(0,(b−a)2Var[f(X)])
Standard error: SE=n(b−a)Std[f(X)]=O(1/n).
Key property. The O(1/n) convergence rate is dimension-independent. Trapezoid and Simpson's rules suffer from the curse of dimensionality (O(n−2/d) for d-dimensional integrals), but Monte Carlo's rate stays O(1/n) regardless of dimension. This is why high-dimensional integration in ML (expectation over data distributions, latent variables, trajectories) always uses Monte Carlo.
Variance reduction. Importance sampling: draw Xk∼q(x) instead of uniform, estimate:
Choosing q∝∣f∣ minimizes variance - the basis of importance-weighted autoencoders (IWAE).
9.5 For AI: SGD as Monte Carlo Expectation
The true gradient update is:
∇θL(θ)=∫∇θℓ(fθ(x),y)p(x,y)dxdy
SGD with mini-batch B of size B estimates this as:
∇θL(θ)=B1i∈B∑∇θℓ(fθ(xi),yi)
This is a Monte Carlo estimate of the gradient integral. The mini-batch is drawn i.i.d. from the data distribution (uniformly at random from the training set), so it is an unbiased estimator with variance O(1/B). Larger batch sizes reduce variance but increase compute per step.
10. Integration in Probability
Forward reference to 06-Probability Theory. The full treatment of random variables, distributions, expectation, and probabilistic reasoning is in 06-Probability-Theory. Here we cover the integration mechanics - how to compute expectations, variances, KL divergence, and entropy as definite or improper integrals.
10.1 Probability Density Functions
A probability density function (PDF) p:R→[0,∞) satisfies:
∫−∞∞p(x)dx=1,p(x)≥0 for all x
The probability that X falls in [a,b] is Pr(a≤X≤b)=∫abp(x)dx.
Common PDFs and their normalization integrals:
Distribution
PDF
Normalization relies on
Uniform on [a,b]
b−a1
Elementary
Gaussian N(μ,σ2)
σ2π1e−(x−μ)2/(2σ2)
Gaussian integral (8.4)
Exponential(λ)
λe−λx, x≥0
∫0∞λe−λxdx=1
Laplace(μ,b)
$\frac{1}{2b}e^{-
x-\mu
10.2 CDF and FTC
The cumulative distribution function (CDF) is:
F(x)=Pr(X≤x)=∫−∞xp(t)dt
By FTC Part 1: F′(x)=p(x) - the PDF is the derivative of the CDF. This connects probability theory directly to the FTC: the CDF is the "area accumulation function" of the PDF, and differentiating it recovers the density.
Properties of the CDF:
F(−∞)=0, F(+∞)=1
F is non-decreasing: x1<x2⇒F(x1)≤F(x2)
F is right-continuous
Pr(a<X≤b)=F(b)−F(a) (FTC Part 2)
10.3 Expectation as a Weighted Integral
E[X]=∫−∞∞xp(x)dxE[g(X)]=∫−∞∞g(x)p(x)dx(Law of the Unconscious Statistician)
Gaussian expectation. For X∼N(μ,σ2):
E[X]=∫−∞∞x⋅σ2π1e−(x−μ)2/(2σ2)dx=μ
Proof. Substitute u=(x−μ)/σ: E[X]=∫−∞∞(σu+μ)2π1e−u2/2du. The σu term integrates to 0 (odd function); the μ term gives μ⋅1. □
10.4 Variance and Second Moments
Var(X)=E[(X−μ)2]=∫−∞∞(x−μ)2p(x)dx=E[X2]−(E[X])2
Gaussian variance. For X∼N(μ,σ2):
Var(X)=∫−∞∞(x−μ)2σ2π1e−(x−μ)2/(2σ2)dx=σ2
Proven via substitution u=(x−μ)/σ and the identity ∫−∞∞u2e−u2/2du=2π (integration by parts with u⋅ue−u2/2).
10.5 KL Divergence
The Kullback-Leibler divergence from q to p:
KL(p∥q)=∫−∞∞p(x)lnq(x)p(x)dx
Properties:
KL(p∥q)≥0 (Gibbs' inequality - proven via Jensen's inequality and the concavity of ln)
KL(p∥q)=0 iff p=q a.e.
Not symmetric: KL(p∥q)=KL(q∥p) in general
Forward vs. reverse KL:
KL(p∥q) (forward/exclusive): forces q to cover all modes of p. Used in maximum likelihood estimation.
KL(q∥p) (reverse/inclusive): forces q to fit one mode of p well. Used in variational inference (ELBO = −KL(q∥p)+const).
Gibbs' inequality proof. By concavity of ln: lnt≤t−1 for all t>0. Apply with t=q(x)/p(x):
The differential entropy of a continuous random variable X with PDF p:
H(X)=−∫−∞∞p(x)lnp(x)dx=−E[lnp(X)]
Gaussian entropy. For X∼N(μ,σ2):
H(X)=21ln(2πeσ2)=21[1+ln(2πσ2)]
Maximum entropy principle. Among all distributions with mean μ and variance σ2, the Gaussian maximizes entropy. This makes the Gaussian the natural distribution for uncertainty - used throughout Bayesian deep learning (Gaussian priors, Gaussian posteriors in VAEs).
Connection to cross-entropy loss. For a model qθ trained on data from p:
Ex∼p[−lnqθ(x)]=H(p)+KL(p∥qθ)
Minimizing cross-entropy loss minimizes KL(p∥qθ) (since H(p) is constant w.r.t. θ). This is why cross-entropy training is equivalent to maximum likelihood estimation.
11. Common Mistakes
#
Mistake
Why It's Wrong
Fix
1
Forgetting +C in indefinite integrals
Every antiderivative family has a free constant; omitting it loses solutions to IVPs
Always write +C and determine it from initial conditions
2
∫f(x)g(x)dx=∫fdx⋅∫gdx
Integration does NOT distribute over products
Use substitution or integration by parts
3
∫abfdx=F(b)−F(a) without checking F′=f
If F is wrong, the evaluation is wrong
Always verify the antiderivative by differentiating
4
Not changing limits in definite u-substitution
If you substitute u=g(x), the limits must become g(a) and g(b)
Either change limits or back-substitute
5
∫x21dx=ln(x2)+C
Incorrect - ∫x−2dx=−x−1+C; ln antiderivative only applies to 1/x
Power rule: ∫xndx=xn+1/(n+1)+C for n=−1
6
∫01x1dx=[lnx]01=0
This is an improper integral - ln(0)=−∞; the integral diverges
Always check for discontinuities before applying FTC Part 2
7
∫−11x1dx=0 by symmetry
The integrand is odd, but the integral diverges - symmetry argument fails for divergent integrals
Confirm convergence before using symmetry
8
Wrong LIATE choice causes circular integration
Choosing u= exponential in ∫xexdx -> no simplification
Monte Carlo O(1/n) confused with deterministic O(h2)
Monte Carlo error is stochastic, in expectation/variance; not a uniform bound
Report ±1.96SE confidence intervals for MC estimates
11
KL(p∥q)=KL(q∥p)
KL divergence is not symmetric
Know the difference: forward KL is mode-covering, reverse KL is mode-seeking
12
∫udv=uv+∫vdu (sign error in parts)
The formula has a minus sign: ∫udv=uv−∫vdu
Derive from product rule to remember the minus sign
12. Exercises
Exercise 1 - Riemann Sums. For f(x)=x2 on [0,2] with n=8 equal subintervals:
(a) Compute the left Riemann sum L8.
(b) Compute the right Riemann sum R8.
(c) Compute the exact value ∫02x2dx via FTC and verify L8≤ exact ≤R8.
Exercise 4 - Improper Integrals. Determine convergence and evaluate if finite:
(a) ∫1∞x3/21dx (b) ∫01xlnxdx (c) ∫−∞∞xe−x2dx
Exercise 5 - Numerical Integration. For f(x)=e−x2 on [0,2]:
(a) Compute Tn and Sn for n∈{4,8,16}.
(b) Compare to scipy.integrate.quad result.
(c) Plot the absolute error vs. n for both methods on a log-log scale. Measure the observed convergence rates.
Exercise 6 - Monte Carlo Integration. Estimate ∫01sin(πx2)dx:
(a) Implement basic Monte Carlo with n∈{100,1000,10000,100000}.
(b) Plot the estimate and ±2SE confidence band vs. n.
(c) Verify the O(1/n) convergence by plotting n⋅Var[I^n] vs. n.
Exercise 7 - KL Divergence. Let p=N(0,1) and q=N(μ,σ2):
(a) Derive the closed-form KL(p∥q) by evaluating the integral.
(b) Implement numerical KL via Monte Carlo with 105 samples. Verify against the closed form.
(c) Plot KL(p∥q) as a function of μ∈[−3,3] (fixed σ=1) and as a function of σ∈[0.1,3] (fixed μ=0). Observe the asymmetry.
Exercise 8 - ELBO and Variational Inference. The ELBO (Evidence Lower BOund) is:
(a) Show that lnp(x)=ELBO(q)+KL(q∥p(⋅∣x)) using the definition of conditional probability and KL divergence.
(b) Explain why maximizing the ELBO is equivalent to maximizing a lower bound on lnp(x).
(c) For pz=N(0,1) and q=N(μ,σ2), compute KL(q∥pz) and implement the reparameterization trick: z=μ+σε, ε∼N(0,1).
13. Why This Matters for AI (2026 Perspective)
Concept
AI/ML Impact
Riemann sums
Conceptual foundation of all numerical expectation estimates; discrete sum over a dataset approximates the continuous integral over the data-generating distribution
FTC Part 2
Evaluating KL divergences, entropies, and moments in closed form; normalizing flow log-likelihoods
FTC Part 1
Adjoint method for neural ODEs; sensitivity analysis of dynamical systems; continuous-time RL
u-Substitution
Change-of-variables formula in normalizing flows (Real NVP, Glow, FFJORD); reparameterization trick in VAEs
Integration by parts
Log-derivative (REINFORCE) trick for policy gradient methods (PPO, GRPO, DPO in RLHF)
Improper integrals
Convergence of expected losses over infinite data; entropy and KL divergence for continuous distributions
Gaussian integral
Normalization of all Gaussian distributions; closed-form KL divergences between Gaussians (VAE regularizer)
Trapezoid/Simpson
Numerical integration in scientific ML; ODE solver stepping (Euler, RK4 in neural ODEs)
Monte Carlo
SGD and mini-batch training; IWAE (importance-weighted); MCMC in Bayesian neural networks
Every loss function; ELBO; reward expectation in RL; attention weights as expectation over keys
KL divergence
VAE regularizer; DPO alignment loss; diffusion model score function; maximum entropy RL
Entropy
Information bottleneck principle; attention entropy regularization; exploration in RL
Cross-entropy <-> KL
Cross-entropy training = MLE = minimizing KL from true distribution to model
14. Conceptual Bridge
Looking back. This section built on two pillars from previous sections:
Limits (01) - the Riemann integral is defined as a limit of sums; the FTC proof uses the MVT for integrals; convergence of improper integrals is a limit.
Derivatives (02) - antiderivatives are "reverse derivatives"; FTC Part 1 says the area-accumulation function has derivative equal to the integrand; u-substitution reverses the chain rule; integration by parts reverses the product rule.
Every integration technique is the reverse of a differentiation technique. The FTC is the theorem that makes this reversal exact.
Looking forward.
04-Series-and-Sequences - Taylor series are derived using higher-order derivatives; the Taylor remainder formula involves a definite integral of the (n+1)-th derivative. Power series can be integrated term-by-term.
05-Multivariate Calculus - double and triple integrals extend the Riemann definition to higher dimensions; Fubini's theorem allows iterated integration; the substitution rule becomes the Jacobian change-of-variables formula.
06-Probability Theory - all continuous probability is integration: distributions, expectations, variances, moment-generating functions, characteristic functions, conditional distributions.
08-Optimization - population risk is an integral; SGD is a Monte Carlo gradient estimator; natural gradient uses the Fisher information matrix, which is defined via an integral.
Position in the curriculum:
CHAPTER 4 - CALCULUS FUNDAMENTALS
01 Limits and Continuity
(limit foundations, epsilon-delta, continuity)
02 Derivatives and Differentiation
(chain rule, product rule, activation derivatives)
03 Integration YOU ARE HERE
(FTC links 02 <-> 03; substitution reverses chain rule)
04 Series and Sequences
(Taylor series uses 02 derivatives + 03 integration)
05 Multivariate Calculus
(double integrals, Fubini, Jacobians - extends 03)
06 Probability Theory
(all continuous probability IS integration from 03)
Appendix A: Extended Substitution Examples and Patterns
A.1 Recognising the Pattern
The hardest part of u-substitution is identifying the right u. The pattern to look for: one function is the derivative of another part of the integrand.
Integrand pattern
Choose u
Because
f(xn)⋅xn−1
u=xn
du=nxn−1dx
f(ex)⋅ex
u=ex
du=exdx
f(lnx)⋅x1
u=lnx
du=x1dx
f(sinx)⋅cosx
u=sinx
du=cosxdx
f(cosx)⋅sinx
u=cosx
du=−sinxdx
f(x)⋅x1
u=x
du=2x1dx
A.2 Rationalizing Substitutions
For integrals involving ax+b: let u=ax+b so u2=ax+b and 2udu=adx.
Example.∫x2x+1dx. Let u=2x+1, u2=2x+1, x=(u2−1)/2, dx=udu:
For integrands involving a2−x2, a2+x2, or x2−a2:
Form
Substitution
Identity used
a2−x2
x=asinθ
1−sin2θ=cos2θ
a2+x2
x=atanθ
1+tan2θ=sec2θ
x2−a2
x=asecθ
sec2θ−1=tan2θ
Example.∫4−x21dx. Let x=2sinθ, dx=2cosθdθ:
∫4−4sin2θ2cosθdθ=∫2cosθ2cosθdθ=θ+C=arcsin2x+C
For AI. Trigonometric substitutions appear when integrating radial functions in high-dimensional probability (e.g., volumes of spherical shells used in d-dimensional Gaussian integrals).
A.4 The Softmax Integral - Partition Function
The softmax denominator (partition function) is a sum Z=∑kezk, the discrete analogue of the integral Z=∫ef(x)dx that appears in energy-based models. In the continuous case, computing Z is intractable in general - this is the core computational challenge of energy-based models. Variational autoencoders and diffusion models avoid computing Z directly by working with ratios or lower bounds.
Appendix B: Integration by Parts - Extended Examples and Theory
B.1 The Cyclic Trick
When integration by parts produces I=(something)−cI for a constant c=−1, solve for I:
Verification: Differentiate the right side to confirm.
B.2 Integration by Parts for Definite Integrals
Example. Show ∫0∞xe−xdx=1.
Let u=x, dv=e−xdx:
∫0∞xe−xdx=[−xe−x]0∞+∫0∞e−xdx
The boundary term: limx→∞xe−x=0 (L'Hpital: x/ex→0) and at x=0: 0.
=0+[−e−x]0∞=0−(−1)=1
B.3 The Gamma Function
The Gamma function generalizes factorials to real arguments:
Γ(s)=∫0∞xs−1e−xdx,s>0
Key properties (proven by integration by parts):
Γ(s+1)=sΓ(s) (reduction formula via parts)
Γ(n)=(n−1)! for positive integers n
Γ(1/2)=π (from the Gaussian integral)
Proof of reduction.Γ(s+1)=∫0∞xse−xdx. Let u=xs, dv=e−xdx:
=[−xse−x]0∞+s∫0∞xs−1e−xdx=0+sΓ(s)□
For AI. The Gamma function appears in the normalizing constants of many probability distributions used in Bayesian ML: the Gamma distribution (prior for precision in Gaussian models), the Beta distribution (prior for probabilities in Dirichlet-multinomial models), the Student-t distribution (robust regression).
B.4 Wallis's Formula
From repeated integration by parts on ∫0π/2sinnxdx:
2π=1⋅3⋅3⋅5⋅5⋅7⋯2⋅2⋅4⋅4⋅6⋅6⋯=n=1∏∞4n2−14n2
This is one of the earliest infinite product formulas for π - an unexpected connection between integration and π.
Appendix C: Improper Integrals - Convergence Tests in Detail
The boundary p=1 always diverges (∫1/xdx=lnx, which diverges at both limits).
C.2 Comparison Test - Worked Examples
Example 1. Does ∫1∞x2+x1dx converge?
For x≥1: x2+x≥x2, so x2+x1≤x21.
Since ∫1∞x−2dx=1 converges, by comparison the original converges. □
Example 2. Does ∫1∞xlnxdx converge?
For x≥e: lnx≥1, so xlnx≥x1.
Since ∫e∞x−1dx diverges, by comparison the original diverges. □
C.3 Absolute vs Conditional Convergence
∫a∞f(x)dx is absolutely convergent if ∫a∞∣f(x)∣dx<∞.
∫a∞f(x)dx is conditionally convergent if it converges but not absolutely.
Example.∫0∞xsinxdx=2π (Dirichlet integral) - converges conditionally but NOT absolutely (∫0∞∣sinx∣/xdx=∞).
For ML. Absolutely convergent integrals behave nicely: they can be split, reordered, and approximated by truncated versions. The expected loss E[L] is absolutely convergent (non-negative integrand) - this is why expectation estimates via Monte Carlo are reliable.
C.4 Laplace Transform Preview
The Laplace transform is an improper integral parametrized by s:
L{f}(s)=∫0∞f(t)e−stdt
It converts differential equations to algebraic equations (used in control theory). For neural networks, the Laplace transform of the loss trajectory L{t↦L(θt)} is related to the training dynamics in Laplace domain - an emerging tool in the theoretical analysis of gradient descent.
Appendix D: Numerical Integration - Error Analysis and Advanced Methods
D.1 Trapezoid Rule - Derivation from Scratch
On each subinterval [xk−1,xk], the trapezoid rule approximates f by the linear interpolant:
Error derivation. By Taylor expansion on each subinterval:
∫xk−1xkfdx=2h[f(xk−1)+f(xk)]−12h3f′′(ξk)
Summing: ET=−12h2(b−a)fˉ′′ where fˉ′′ is some average of f′′ on [a,b] (MVT). So ∣ET∣≤12n2M2(b−a)3 where M2=max∣f′′∣.
D.2 Simpson's Rule - Parabolic Approximation
On each pair of subintervals [x2k−2,x2k], use the unique parabola through three points:
S2k=3h[f(x2k−2)+4f(x2k−1)+f(x2k)]
The coefficient pattern comes from integrating the Lagrange interpolating polynomial. The error:
ES=−180h4(b−a)fˉ(4)⟹∣ES∣≤180n4M4(b−a)5
D.3 Richardson Extrapolation
If Tn has error ET=c2/n2+c4/n4+⋯, then combining Tn and T2n:
34T2n−Tn
eliminates the O(1/n2) term, giving a method with error O(1/n4) - equal to Simpson's! This idea, applied recursively, gives Romberg integration with error O(h2k) for any k.
D.4 Quasi-Monte Carlo
Standard Monte Carlo has error O(1/n) regardless of dimension. Quasi-Monte Carlo (QMC) uses low-discrepancy sequences (Halton, Sobol) instead of random points:
Random points clump and leave gaps
Low-discrepancy sequences fill space more uniformly
QMC achieves O((logn)d/n) error for smooth integrands in d dimensions - much better than O(1/n) when d is small. Used in financial derivatives pricing and high-dimensional integration in Bayesian neural networks.
D.5 Adaptive Integration
scipy.integrate.quad uses Gaussian-Kronrod quadrature with adaptive refinement: if the error estimate on a subinterval exceeds tolerance, subdivide and integrate each piece separately. This automatically focuses computational effort on regions where f varies rapidly - crucial for integrands with sharp peaks (e.g., probability densities in high-dimensional tails).
Appendix E: Probability Integration - Extended Topics
E.1 Moment-Generating Functions
The moment-generating function (MGF) of X is:
MX(t)=E[etX]=∫−∞∞etxp(x)dx
provided this integral converges for t in a neighborhood of 0.
Why it generates moments: Differentiate k times at t=0:
MX(k)(0)=E[Xke0⋅X]=E[Xk]
So E[X]=M′(0), E[X2]=M′′(0), Var(X)=M′′(0)−[M′(0)]2.
Gaussian MGF. For X∼N(μ,σ2):
MX(t)=eμt+σ2t2/2
Proven by completing the square in the exponent under the integral and using the Gaussian integral.
E.2 Characteristic Functions and Fourier Transforms
The characteristic functionϕX(t)=E[eitX]=∫eitxp(x)dx is the Fourier transform of the PDF. Unlike the MGF, the characteristic function always exists (since ∣eitx∣=1).
The inverse Fourier transform recovers the PDF:
p(x)=2π1∫−∞∞e−itxϕX(t)dt
For AI. Fourier transforms appear in:
Random Fourier features (Rahimi & Recht, 2007): approximate shift-invariant kernels via k^(x−y)=E[eiω(x−y)] where ω∼p(ω)
Frequency domain attention: Fourier attention mechanisms (FNet) replace self-attention with 2D FFT
Spectral normalization: weight matrix spectral norm computed via power iteration of ∥W∥2= largest singular value
E.3 Conditional Expectations as Integrals
The conditional expectation E[Y∣X=x] is defined via the conditional density p(y∣x):
E[Y∣X=x]=∫yp(y∣x)dy
Law of total expectation:E[Y]=EX[E[Y∣X]]=∫E[Y∣X=x]p(x)dx
For AI. Diffusion model training minimizes EtEx0Ext∣x0[∥ϵθ(xt,t)−ϵ∥2] - a triple nested expectation. Each E is an integral; Monte Carlo (sampling) handles them all.
E.4 Integration and Maximum Likelihood
Maximum likelihood estimation (MLE) maximizes ∏i=1np(xi;θ) - equivalently, maximizes the log-likelihood ∑ilogp(xi;θ).
This sum is a Monte Carlo estimate of the integral:
Maximizing MLE is equivalent to minimizing KL(ptrue∥pθ) - the forward KL divergence from the true data distribution to the model. This is a fundamental connection between MLE, integration, and information theory.
E.5 Score Matching
The score function of a distribution p is s(x)=∇xlogp(x). Score matching (Hyvrinen, 2005) estimates p without computing its normalization constant Z=∫ef(x)dx:
J(θ)=Ex∼p[tr(∇xsθ(x))+21∥sθ(x)∥2]
Integration by parts shows this equals Ex∼p[∥sθ(x)−s(x)∥2] plus a constant - so minimizing J(θ) fits the model score to the true score without integrating p. This is the training objective of denoising diffusion probabilistic models (DDPMs).
Appendix G: FTC - Applications in Machine Learning
G.1 Neural ODEs and the Adjoint Method
A neural ODE (Chen et al., 2018) defines the hidden state dynamics as:
dtdh(t)=f(h(t),t;θ)
The hidden state at time T is:
h(T)=h(0)+∫0Tf(h(t),t;θ)dt
This is FTC Part 2 in reverse: given the "derivative" f, integrate to get the accumulated change.
Training requires ∂θ∂L. The adjoint method computes this via a backward ODE:
dtda(t)=−a(t)⊤∂h∂f
where a(t)=∂L/∂h(t) is the adjoint state. FTC Part 1 guarantees that integrating this backward ODE recovers the gradient exactly - with O(1) memory (no storing intermediate states).
G.2 Attention as Expectation
Softmax attention computes:
Attn(q,K,V)=k∑αkvk,αk=∑jeq⋅kj/deq⋅kj/d
This is a discrete expectationEα[V] where α is the softmax distribution over keys. In the continuous limit (as the number of keys grows and positions become dense), this becomes an integral:
Attn(q)=∫v(s)∫eq(s)⋅k(s′)/dds′eq(s)⋅k(s)/dds
This connection motivates kernel attention approximations (Performer, Random Feature Attention) that use random features to approximate the exponential kernel via Monte Carlo integration of the Gaussian integral: eq⋅k=∫eq⋅ω⋅ek⋅ωp(ω)dω.
G.3 Diffusion Models - Score Matching via Integration by Parts
The denoising score matching objective (Vincent, 2011; Song & Ermon, 2019):
Et,x0,ϵ[λ(t)∥ϵθ(xt,t)−ϵ∥2]
where xt=αˉtx0+1−αˉtϵ, ϵ∼N(0,I).
The equivalence to score matching follows from integration by parts in function space. Specifically, ∇xtlogp(xt)=−ϵ/1−αˉt, so the network learns the score of the noisy distribution - an integral relationship between the score function and the data density.
Eqϕ[logpθ(x∣z)]=∫qϕ(z∣x)logpθ(x∣z)dz - estimated via Monte Carlo (reparameterization trick)
KL(qϕ∥p)=∫qϕ(z∣x)logp(z)qϕ(z∣x)dz - computed in closed form for Gaussian qϕ and p
The reparameterization trick (z=μϕ(x)+σϕ(x)⊙ε, ε∼N(0,I)) is a change of variables (substitution rule) that makes the Monte Carlo estimator differentiable w.r.t. ϕ.
Appendix H: Notation Reference and Quick-Reference Tables
This is a conditionally convergent improper integral (not absolutely convergent). One proof uses the Laplace transform: define F(s)=∫0∞e−sxxsinxdx and differentiate w.r.t. s:
F′(s)=−∫0∞e−sxsinxdx=−1+s21
Integrating: F(s)=−arctans+C. As s→∞: F(s)→0, so C=π/2. At s=0: F(0)=π/2.
Appendix M: Monte Carlo Methods - Extended Analysis
M.1 Variance of the Monte Carlo Estimator
Let X1,…,Xn∼i.i.d.Uniform(a,b). The estimator I^n=nb−a∑k=1nf(Xk).
To estimate I=∫f(x)dx, sample Xk∼q(x) (a proposal distribution) and use:
I^nIS=n1k=1∑nq(Xk)f(Xk)
Since ∫f(x)dx=∫q(x)f(x)⋅q(x)dx=Eq[q(X)f(X)], this is also unbiased.
Optimal q.Var[I^IS] is minimized when q(x)∝∣f(x)∣. With this choice, Var=0 if f≥0 everywhere (one-sample exact!).
In practice. Choose q to concentrate samples where ∣f(x)∣ is large - focusing effort on the important region. Used in:
IWAE (Importance Weighted Autoencoders) - tighter ELBO via importance sampling
Particle filters - sequential importance sampling for state estimation
MCMC - Metropolis-Hastings acceptance-rejection is importance sampling on steroids
M.3 Central Limit Theorem for Monte Carlo
SE(I^n)I^n−IdN(0,1)
This gives a 95% confidence interval: I^n±1.96⋅SE(I^n).
The SE is estimated from the sample standard deviation:
SE=n(b−a)σ^f,σ^f2=n−11k=1∑n(f(Xk)−μ^f)2
M.4 Quasi-Monte Carlo - Discrepancy
The error of quasi-Monte Carlo is bounded by the Koksma-Hlawka inequality:
∣I^n−I∣≤Dn∗(x)⋅V(f)
where Dn∗ is the star discrepancy of the point set and V(f) is the total variation of f. Low-discrepancy sequences (Halton, Sobol) achieve Dn∗=O((logn)d/n), giving error O((logn)d/n) vs. O(1/n) for random.
For d=1: QMC is always better than Monte Carlo (for smooth integrands). For large d, the (logn)d factor can dominate.
Appendix N: The Fundamental Theorem - Historical and Conceptual Depth
N.1 Why the FTC Is Deep
Before the FTC, two problems seemed completely unrelated:
The tangent problem: find the slope of a curve at a point -> derivative
The area problem: find the area under a curve -> integral
Newton and Leibniz discovered they are inverse operations. This is not obvious. There is no reason, a priori, to expect that summing infinitesimally thin rectangles (integration) should be related to measuring instantaneous slope (differentiation).
The FTC says: accumulation is the reverse of rate. If you know how fast something is accumulating at every instant, you can find the total accumulation - just by finding an antiderivative. This is one of the most non-obvious and deep theorems in all of mathematics.
N.2 What Makes the FTC Work
The key ingredients:
Continuity of f: ensures f can be approximated uniformly by step functions (Riemann integrability)
MVT for integrals: h1∫xx+hf(t)dt=f(ch) for some ch∈(x,x+h)
Continuity of f at x: ensures f(ch)→f(x) as h→0
If f has discontinuities, Part 1 may fail: G′(x)=f(x) only holds at points of continuity of f. The Lebesgue integral extends the FTC to a much broader class of functions.
N.3 The FTC in Multiple Dimensions
The FTC generalizes to higher dimensions in several forms:
1D FTC
Higher-dimensional version
∫abf′(x)dx=f(b)−f(a)
Green's theorem: ∮CF⋅dr=∬DcurlFdA
Stokes' theorem: ∮∂SF⋅dr=∬S(∇×F)⋅dS
Divergence theorem: ∭∂VF⋅dS=∭V∇⋅FdV
All are special cases of Stokes' theorem on manifolds: ∫Mdω=∫∂Mω.
For ML, the divergence theorem underlies the divergence of a vector field used in:
Flow matching (training vector fields for generative models)
Continuous normalizing flows with trace of the Jacobian
Optimal transport with the continuity equation
N.4 Antiderivatives That Cannot Be Expressed in Closed Form
Some elementary functions have no elementary antiderivative. Famous examples:
Integrand
"Antiderivative"
Notes
e−x2
2πerf(x)
Error function - not elementary
xsinx
Si(x) (sine integral)
Not elementary
xex
Ei(x) (exponential integral)
Not elementary
lnx1
Li(x) (logarithmic integral)
Counts primes! Not elementary
1−k2sin2x
Elliptic integral
Not elementary
These "special functions" appear constantly in ML:
scipy.special.erf, scipy.special.erfinv - in GELU, quantile functions, CDFs
scipy.special.gammaln - in Dirichlet, Beta, Gamma distributions
scipy.special.bessel - in von Mises distributions (circular statistics in positional encoding)
Appendix O: Lebesgue vs. Riemann Integration
O.1 Why a Different Theory?
The Riemann integral is sufficient for continuous functions and most smooth ML applications. But the Lebesgue integral is the foundation of modern probability theory and is essential for rigorous statements about expected values, convergence theorems, and measure theory.
The key difference: Riemann integration partitions the domain (x-axis); Lebesgue integration partitions the range (y-axis).
RIEMANN vs. LEBESGUE INTEGRATION
Riemann: slice domain into [x_i, x_{i+1}], sum f(x_i)*Deltax
f(x)
x
"partition the x-axis"
Lebesgue: slice range into [y_i, y_{i+1}], multiply by measure
of preimage {x : f(x) in [y_i, y_{i+1}]}
y
<- how much x gives f(x) ~= 4?
4
3
x
"partition the y-axis"
O.2 Key Theorem (Lebesgue vs. Riemann)
Theorem (Lebesgue, 1901): A bounded function on [a,b] is Riemann integrable if and only if it is continuous almost everywhere (i.e., the set of discontinuities has Lebesgue measure zero).
For ML, this means:
ReLU is Riemann integrable (discontinuous only at 0, a set of measure zero)
Indicator functions 1[x>0] are Riemann integrable for the same reason
Pathological functions like the Dirichlet function (1Q) are NOT Riemann integrable but ARE Lebesgue integrable
O.3 Convergence Theorems (The Real Advantage)
The Lebesgue integral enables three fundamental convergence theorems that Riemann lacks:
Monotone Convergence Theorem (MCT): If 0≤f1≤f2≤⋯ and fn→f pointwise, then:
n→∞lim∫fndμ=∫n→∞limfndμ=∫fdμ
Dominated Convergence Theorem (DCT): If fn→f pointwise and ∣fn∣≤g for an integrable g, then:
n→∞lim∫fndμ=∫fdμ
Fatou's Lemma: ∫liminfn→∞fndμ≤liminfn→∞∫fndμ
Why ML cares: The DCT justifies differentiating under the integral sign - the key step in computing gradients of expected values:
∂θ∂Epθ[f(x)]=∂θ∂∫f(x)pθ(x)dx=∫f(x)∂θ∂pθdx
This step is legal when f⋅∣∂pθ/∂θ∣ is dominated by an integrable function - which is why REINFORCE and the reparameterization trick both have regularity conditions.
O.4 Probability as Measure Theory
Modern probability uses the Lebesgue framework directly:
Measure theory
Probability
Measure space (Ω,F,μ)
Probability space (Ω,F,P)
Measurable function f:Ω→R
Random variable X:Ω→R
∫fdμ
E[X]
μ(A)=∫A1dμ
P(A)
Radon-Nikodym derivative dP/dQ
Likelihood ratio
The Radon-Nikodym theorem - which says that if P≪Q (P is absolutely continuous w.r.t. Q), there exists a measurable function dQdP such that P(A)=∫AdQdPdQ - is the rigorous foundation for:
KL divergence: DKL(P∥Q)=∫logdQdPdP
Change of variables in normalizing flows
Importance sampling weights
For day-to-day ML, Riemann integration suffices. But when reading probability theory papers or understanding why certain gradient estimators require regularity conditions, the Lebesgue framework is the right language.
WHICH TECHNIQUE?
Is there a composite function f(g(x))?
YES -> try u-substitution: u = g(x)
Is the integrand a product of two "different" types?
YES -> try integration by parts (LIATE order)
Is the integrand a rational function P(x)/Q(x)?
YES -> try partial fractions (after polynomial division if deg P >= deg Q)
Does the integrand contain sqrt(a^2-x^2), sqrt(a^2+x^2), or sqrt(x^2-a^2)?
YES -> try trig substitution (x = a sintheta, a tantheta, a sectheta)
Does the integrand contain e^x times polynomial or trig?
YES -> try tabular integration by parts
Nothing works? -> check integral tables / computer algebra system
P.4 Numerical Integration Comparison
Method
Error
Formula
When to use
Midpoint
O(h2)
h∑f(xi+h/2)
Smooth functions, simple implementation
Trapezoid
O(h2)
h[2f(a)+f(b)+∑f(xi)]
Periodic functions (exponentially fast)
Simpson's
O(h4)
3h[f0+4f1+2f2+4f3+⋯+fn]
Smooth functions, better accuracy
Gauss-Legendre
O(h2n)
Optimal nodes & weights
High accuracy, smooth integrands
Monte Carlo
O(n−1/2)
n1∑f(xi), xi∼U[a,b]
High dimensions, discontinuous f
Quasi-MC
O((logn)d/n)
Low-discrepancy sequences
High dimensions with structure
P.5 Key ML Formulas Involving Integration
Ex∼p[f(x)]=∫f(x)p(x)dx≈n1i=1∑nf(xi)(Monte Carlo mini-batch)DKL(p∥q)=∫p(x)lnq(x)p(x)dx≥0(Gibbs inequality)LVAE=Eqϕ(z∣x)[lnpθ(x∣z)]−DKL(qϕ(z∣x)∥p(z))(ELBO)∇θEpθ[f(x)]=Epθ[f(x)∇θlnpθ(x)](REINFORCE / score gradient)dtdz(t)=fθ(z(t),t)⟹z(T)=z(0)+∫0Tfθ(z(t),t)dt(Neural ODE)
Appendix Q: Exercises - Worked Solutions (Continued)
Q.1 Exercise 5 - Numerical Integration Full Walkthrough
Problem: Compare trapezoid vs. Simpson's for ∫01e−x2dx.
True value: Using the error function, ∫01e−x2dx=2πerf(1)≈0.746824.
Interpretation: The KL divergence from N(0,1) to N(μ,1) is exactly μ2/2. This is the term in the VAE ELBO that penalizes the encoder mean from drifting far from zero.
Q.3 Exercise 7 - Gradient of Expected Loss
Problem: Compute ∇θEx∼pθ[ℓ(x)] via the log-derivative trick.
Solution: Using the score function / REINFORCE estimator:
∇θEpθ[ℓ(x)]=∇θ∫ℓ(x)pθ(x)dx
Assuming we can differentiate under the integral (DCT applies):
This is the REINFORCE gradient estimator. It requires only samples from pθ and the ability to evaluate lnpθ - no reparameterization needed. The cost is high variance, motivating control variates (baselines) to reduce Var[ℓ(x)∇θlnpθ(x)].
Appendix R: Historical Timeline Extended
R.1 Chronological Development
Era
Contributor
Achievement
~250 BCE
Archimedes
Method of exhaustion for area of parabolic segment; π bounds
~1640s
Cavalieri
Cavalieri's principle; "method of indivisibles"
1665-1666
Newton
Inverse tangent problem; "method of fluxions"; FTC discovered privately
1675-1684
Leibniz
Independent discovery; modern ∫ notation; publication of FTC (1684)
1696
L'Hpital
First calculus textbook (based on Bernoulli's lectures)
1734
Bishop Berkeley
The Analyst - criticism of infinitesimals as "ghosts of departed quantities"
1748
Euler
Introductio in Analysin Infinitorum - systematic treatment of functions
1821-1823
Cauchy
Rigorous definition of limit; definite integral via Riemann-style sums
1854
Riemann
Rigorous Riemann integral; characterization of integrable functions
1875
Darboux
Upper/lower sums; cleaner formulation of Riemann integral
1894-1902
Lebesgue
Measure theory; Lebesgue integral; MCT and DCT
1900s
Hilbert
L2 spaces; integration as inner product; functional analysis
1920s-1940s
Kolmogorov
Probability as measure theory; rigorous foundation for ML
1940s-1950s
Monte Carlo
Ulam, von Neumann, Metropolis - stochastic integration for physics
1986
Rumelhart et al.
Backpropagation as chain rule for integrals (Jacobians)
2018
Chen et al.
Neural ODEs - continuous-depth networks via ODE integration
2020s
Diffusion models
Score matching, DDPM, flow matching - integration at the core of generation
R.2 The Newton-Leibniz Priority Dispute
The FTC was discovered independently by Newton (1666, unpublished) and Leibniz (1675-1684, published). The Royal Society's official investigation (1712) wrongly accused Leibniz of plagiarism, damaging his reputation and creating a rift between British and Continental mathematicians. British mathematicians, loyal to Newton's notation, fell behind Continental Europe for ~100 years. The lesson: notation matters - Leibniz's ∫ and d/dx notation won out and is the notation used universally today.
R.3 Why Newton Called Integration "Quadrature"
Newton's original term for integration was "quadrature" - from the Latin for "making a square." The original problem was: given a curve, find a square with the same area. This geometric framing persisted for centuries. Leibniz's more algebraic approach (anti-differentiation) is what we use today, but the term "quadrature" survives in "Gaussian quadrature" - numerical integration using optimal node placement.
Appendix S: Advanced Topics Preview
This section previews topics that build directly on integration and appear in subsequent sections:
S.1 -> 04 Sequences and Series
Taylor series represents a function as an infinite sum:
f(x)=n=0∑∞n!f(n)(a)(x−a)n
The remainder term in Taylor's theorem is an integral:
Rn(x)=n!1∫ax(x−t)nf(n+1)(t)dt
Integration and series interact via term-by-term integration - valid when a series converges uniformly:
Continuous random variables live entirely in the integration framework. The CDF is an integral of the PDF; expectation is an integral; the central limit theorem involves convergence of distribution functions; characteristic functions are Fourier transforms - all integration.