NotesMath for LLMs

Integration

Calculus Fundamentals / Integration

Notes

"Integration is not just finding areas - it is the mathematics of accumulation, and accumulation is the mathematics of learning."

Overview

Integration is the second pillar of calculus, dual to differentiation. Where the derivative asks "how fast is this changing right now?", the integral asks "how much has accumulated over this interval?" The Fundamental Theorem of Calculus (FTC) reveals these two operations as inverses of each other - one of the most profound connections in mathematics.

Every core operation in machine learning involves integration. The expected loss E[L]\mathbb{E}[\mathcal{L}] is an integral over a data distribution. KL divergence and entropy - the foundations of information-theoretic learning - are improper integrals of probability densities. Stochastic gradient descent is a Monte Carlo estimator of a gradient integral. Normalizing flows use the change-of-variables formula from integration theory. Understanding integration at the level of definitions and proofs, not just formulas, separates practitioners who can innovate from those who can only apply recipes.

This section builds the complete single-variable integration theory: Riemann sums, the FTC, all standard techniques (substitution, parts, partial fractions), improper integrals, numerical methods (trapezoid, Simpson's, Monte Carlo), and integration in probability theory. The treatment is rigorous but always anchored to concrete ML applications.

Prerequisites

  • Limits and continuity - 01-Limits-and-Continuity: limit definition, squeeze theorem, continuity on [a,b][a,b]
  • Derivatives and differentiation - 02-Derivatives-and-Differentiation: chain rule, product rule, all elementary derivatives
  • Algebra: polynomial long division, factoring, partial fraction setup
  • Trigonometry: sin\sin, cos\cos, tan\tan identities; inverse trig derivatives

Companion Notebooks

NotebookDescription
theory.ipynbRiemann sums, FTC, all techniques, numerical methods, probability integration
exercises.ipynb10 graded exercises from basic antiderivatives to Monte Carlo and KL divergence

Learning Objectives

After completing this section, you will:

  • Define the definite integral via Riemann sums and compute it as a limit
  • State and prove both parts of the Fundamental Theorem of Calculus
  • Apply u-substitution, integration by parts, and partial fractions to evaluate integrals
  • Classify and evaluate improper integrals using limit definitions and convergence tests
  • Implement trapezoid, Simpson's, and Monte Carlo numerical integration
  • Express expectation, variance, KL divergence, and entropy as integrals
  • Explain why stochastic gradient descent is a Monte Carlo estimator of the gradient integral
  • Verify numerical integration accuracy and analyze error bounds

Table of Contents


1. Intuition

1.1 Accumulation as the Core Idea

The derivative measures instantaneous rate of change. The integral measures total accumulation. These are two sides of the same coin - and the FTC is the theorem that makes this precise.

Physical intuition. If v(t)v(t) is velocity at time tt, then abv(t)dt\int_a^b v(t)\,dt is the total distance traveled from time aa to bb. Each infinitesimal time slice dtdt contributes a tiny distance v(t)dtv(t)\,dt; the integral sums them all.

Geometric intuition. abf(x)dx\int_a^b f(x)\,dx is the signed area between the curve y=f(x)y = f(x) and the xx-axis over [a,b][a,b]. Area above the axis is positive; area below is negative.

Probabilistic intuition. If p(x)p(x) is a probability density, abp(x)dx\int_a^b p(x)\,dx is the probability that a random draw falls in [a,b][a,b]. Integration is the language of probability.

1.2 Historical Motivation

YearDevelopment
~250 BCEArchimedes uses method of exhaustion to compute areas of parabolas
1670sNewton and Leibniz independently develop calculus; Leibniz introduces \int notation
1823Cauchy gives rigorous definition of the definite integral
1854Riemann formalizes the integral via sums - the definition used today
1875Darboux introduces upper/lower sums, simplifying Riemann's theory
1902Lebesgue generalizes the integral to a much broader class of functions

The \int symbol is an elongated "S" for summa (Latin: sum) - Leibniz's notation for the limit of infinitely many infinitesimal summands.

1.3 Why Integration Is Central to AI

Expected loss. The training objective of every probabilistic model is:

L(θ)=E(x,y)pdata[(fθ(x),y)]=(fθ(x),y)p(x,y)dxdy\mathcal{L}(\theta) = \mathbb{E}_{(\mathbf{x},y)\sim p_{\text{data}}}[\ell(f_\theta(\mathbf{x}), y)] = \int \ell(f_\theta(\mathbf{x}), y)\,p(\mathbf{x},y)\,d\mathbf{x}\,dy

This is an integral. Stochastic gradient descent estimates it via a Monte Carlo sum over a mini-batch.

KL divergence. The loss function used in variational autoencoders (VAEs), diffusion models, and language model alignment (RLHF):

KL(qp)=q(x)lnq(x)p(x)dx\text{KL}(q\|p) = \int q(x)\ln\frac{q(x)}{p(x)}\,dx

Entropy. The Shannon entropy of a continuous distribution:

H(p)=p(x)lnp(x)dxH(p) = -\int p(x)\ln p(x)\,dx

This integral appears in cross-entropy loss, mutual information, and the information-theoretic analysis of generalization.

Normalizing flows. Change-of-variables: if z=f(x)\mathbf{z} = f(\mathbf{x}) is a bijection, the density transforms as:

pX(x)=pZ(f(x))detfxp_X(\mathbf{x}) = p_Z(f(\mathbf{x}))\left|\det\frac{\partial f}{\partial \mathbf{x}}\right|

The absolute Jacobian determinant is the multidimensional version of the substitution rule from this section.

Gaussian integral. The normalization constant of every Gaussian distribution relies on ex2dx=π\int_{-\infty}^\infty e^{-x^2}\,dx = \sqrt{\pi} - an improper integral computed in 8.4.

For AI: Every backpropagation pass computes a stochastic estimate of the gradient of an integral (the expected loss) - integration and differentiation are inseparable in machine learning.


2. The Definite Integral - Riemann's Definition

2.1 Partitions and Riemann Sums

A partition of [a,b][a,b] is a finite set of points a=x0<x1<<xn=ba = x_0 < x_1 < \cdots < x_n = b. The mesh (or norm) is P=maxk(xkxk1)\|\mathcal{P}\| = \max_k(x_k - x_{k-1}).

For each subinterval [xk1,xk][x_{k-1}, x_k] of width Δxk=xkxk1\Delta x_k = x_k - x_{k-1}, choose a sample point xk[xk1,xk]x_k^* \in [x_{k-1}, x_k]. The Riemann sum is:

S(P,f)=k=1nf(xk)ΔxkS(\mathcal{P}, f) = \sum_{k=1}^n f(x_k^*)\,\Delta x_k

Three standard choices of xkx_k^*:

  • Left endpoint: xk=xk1x_k^* = x_{k-1} -> left Riemann sum LnL_n
  • Right endpoint: xk=xkx_k^* = x_k -> right Riemann sum RnR_n
  • Midpoint: xk=(xk1+xk)/2x_k^* = (x_{k-1}+x_k)/2 -> midpoint Riemann sum MnM_n

Example. f(x)=x2f(x) = x^2 on [0,1][0,1], uniform partition with nn intervals, right endpoints:

Rn=k=1n(kn)21n=1n3k=1nk2=1n3n(n+1)(2n+1)6n13R_n = \sum_{k=1}^n \left(\frac{k}{n}\right)^2 \cdot \frac{1}{n} = \frac{1}{n^3}\sum_{k=1}^n k^2 = \frac{1}{n^3}\cdot\frac{n(n+1)(2n+1)}{6} \xrightarrow{n\to\infty} \frac{1}{3}

2.2 The Limit Definition

The definite integral of ff over [a,b][a,b] is:

abf(x)dx=limP0S(P,f)\int_a^b f(x)\,dx = \lim_{\|\mathcal{P}\|\to 0} S(\mathcal{P}, f)

provided this limit exists and is the same for every choice of sample points. When this limit exists, ff is Riemann integrable on [a,b][a,b].

For uniform partitions (Δx=(ba)/n\Delta x = (b-a)/n, right endpoints):

abf(x)dx=limnk=1nf ⁣(a+kban)ban\int_a^b f(x)\,dx = \lim_{n\to\infty}\sum_{k=1}^n f\!\left(a + k\cdot\frac{b-a}{n}\right)\cdot\frac{b-a}{n}

2.3 Geometric Interpretation: Signed Area

The integral computes signed area: regions where f(x)>0f(x) > 0 contribute positively; regions where f(x)<0f(x) < 0 contribute negatively.

02πsinxdx=0(positive and negative areas cancel)\int_0^{2\pi}\sin x\,dx = 0 \quad \text{(positive and negative areas cancel)} 02πsinxdx=4(total unsigned area)\int_0^{2\pi}|\sin x|\,dx = 4 \quad \text{(total unsigned area)}

2.4 Properties of the Definite Integral

For integrable f,gf, g on [a,b][a,b]:

PropertyFormula
Linearityab[cf(x)+g(x)]dx=cabf(x)dx+abg(x)dx\int_a^b [cf(x)+g(x)]\,dx = c\int_a^b f(x)\,dx + \int_a^b g(x)\,dx
Additivityabfdx=acfdx+cbfdx\int_a^b f\,dx = \int_a^c f\,dx + \int_c^b f\,dx for any c[a,b]c \in [a,b]
Monotonicityfgabfdxabgdxf \leq g \Rightarrow \int_a^b f\,dx \leq \int_a^b g\,dx
Reverse limitsbafdx=abfdx\int_b^a f\,dx = -\int_a^b f\,dx
Zero-widthaafdx=0\int_a^a f\,dx = 0
Bound$\left

Mean Value Theorem for Integrals. If ff is continuous on [a,b][a,b], there exists c(a,b)c \in (a,b) with:

f(c)=1baabf(x)dxf(c) = \frac{1}{b-a}\int_a^b f(x)\,dx

The integral equals the function value at some interior point times the interval length.

2.5 Integrability Conditions

Sufficient condition 1: If ff is continuous on [a,b][a,b], then ff is Riemann integrable. (This covers all smooth functions and all activation functions in ML.)

Sufficient condition 2: If ff is bounded and monotone on [a,b][a,b], then ff is Riemann integrable.

Sufficient condition 3: If ff is bounded on [a,b][a,b] and has only finitely many discontinuities, then ff is Riemann integrable. (This covers ReLU, which is discontinuous in derivative at 0 but not in value.)


3. Antiderivatives and the Indefinite Integral

3.1 Definition and the +C Convention

A function FF is an antiderivative of ff on an interval II if F(x)=f(x)F'(x) = f(x) for all xIx \in I.

Theorem (Uniqueness up to constant). If FF and GG are both antiderivatives of ff on II, then F(x)G(x)=CF(x) - G(x) = C for some constant CC.

Proof. Let H=FGH = F - G. Then H(x)=F(x)G(x)=f(x)f(x)=0H'(x) = F'(x) - G'(x) = f(x) - f(x) = 0 for all xIx \in I. By the MVT corollary, H0H' \equiv 0 implies HH is constant. \square

The indefinite integral encodes the entire family of antiderivatives:

f(x)dx=F(x)+C\int f(x)\,dx = F(x) + C

where CC is an arbitrary constant. The +C+C is not optional - it represents a genuinely different function for each value of CC.

3.2 Basic Antiderivative Table

f(x)f(x)f(x)dx\int f(x)\,dxCondition
xnx^nxn+1n+1+C\dfrac{x^{n+1}}{n+1} + Cn1n \neq -1
x1=1/xx^{-1} = 1/x$\lnx
exe^xex+Ce^x + C
axa^xaxlna+C\dfrac{a^x}{\ln a} + Ca>0,a1a > 0, a\neq 1
sinx\sin xcosx+C-\cos x + C
cosx\cos xsinx+C\sin x + C
sec2x\sec^2 xtanx+C\tan x + C
1/1x21/\sqrt{1-x^2}arcsinx+C\arcsin x + C$
1/(1+x2)1/(1+x^2)arctanx+C\arctan x + C
sinhx\sinh xcoshx+C\cosh x + C
coshx\cosh xsinhx+C\sinh x + C

Verification: Every row can be checked by differentiating the right side.

3.3 Linearity

[cf(x)+g(x)]dx=cf(x)dx+g(x)dx\int [cf(x) + g(x)]\,dx = c\int f(x)\,dx + \int g(x)\,dx

This follows immediately from the linearity of differentiation.

Example. (3x25cosx+ex)dx=x35sinx+ex+C\int(3x^2 - 5\cos x + e^x)\,dx = x^3 - 5\sin x + e^x + C.

3.4 Initial Value Problems

An initial value problem (IVP) specifies f(x)f'(x) and an initial condition f(x0)=y0f(x_0) = y_0, and asks for f(x)f(x).

Procedure:

  1. Find the general antiderivative: f(x)=f(x)dx=F(x)+Cf(x) = \int f'(x)\,dx = F(x) + C
  2. Apply the initial condition: y0=F(x0)+CC=y0F(x0)y_0 = F(x_0) + C \Rightarrow C = y_0 - F(x_0)

Example. f(x)=3x22xf'(x) = 3x^2 - 2x, f(1)=5f(1) = 5.

f(x)=x3x2+Cf(x) = x^3 - x^2 + C. At x=1x=1: 11+C=5C=51 - 1 + C = 5 \Rightarrow C = 5.

So f(x)=x3x2+5f(x) = x^3 - x^2 + 5.

For AI. Solving θ˙=L(θ)\dot{\theta} = -\nabla\mathcal{L}(\theta) as an ODE gives the continuous-time analogue of gradient descent. The solution is an integral: θ(T)=θ(0)0TL(θ(t))dt\theta(T) = \theta(0) - \int_0^T \nabla\mathcal{L}(\theta(t))\,dt. Neural ODEs (Chen et al., 2018) make this explicit by parameterizing the dynamics as a neural network and solving the integral numerically via an ODE solver.


4. The Fundamental Theorem of Calculus

The FTC is the central theorem of calculus - it reveals that differentiation and integration are inverse operations and provides a practical method for evaluating definite integrals.

4.1 FTC Part 1

Theorem (FTC Part 1). Let ff be continuous on [a,b][a,b] and define:

G(x)=axf(t)dt,x[a,b]G(x) = \int_a^x f(t)\,dt, \quad x \in [a,b]

Then GG is differentiable on (a,b)(a,b) and G(x)=f(x)G'(x) = f(x).

Proof. For h>0h > 0:

G(x+h)G(x)h=1hxx+hf(t)dt\frac{G(x+h) - G(x)}{h} = \frac{1}{h}\int_x^{x+h} f(t)\,dt

By the MVT for integrals, there exists ch(x,x+h)c_h \in (x, x+h) with 1hxx+hf(t)dt=f(ch)\frac{1}{h}\int_x^{x+h} f(t)\,dt = f(c_h).

As h0+h \to 0^+: chxc_h \to x, so by continuity of ff: f(ch)f(x)f(c_h) \to f(x). A symmetric argument handles h0h \to 0^-. Therefore G(x)=f(x)G'(x) = f(x). \square

Interpretation. The area-accumulation function G(x)=axf(t)dtG(x) = \int_a^x f(t)\,dt has derivative f(x)f(x) - the rate of growth of accumulated area at xx equals the function value f(x)f(x). This is obvious geometrically: adding a thin strip of height f(x)f(x) and width hh gives area f(x)h\approx f(x) \cdot h.

Generalization (Leibniz rule):

ddxg(x)h(x)f(t)dt=f(h(x))h(x)f(g(x))g(x)\frac{d}{dx}\int_{g(x)}^{h(x)} f(t)\,dt = f(h(x))\cdot h'(x) - f(g(x))\cdot g'(x)

4.2 FTC Part 2

Theorem (FTC Part 2). If ff is continuous on [a,b][a,b] and FF is any antiderivative of ff, then:

abf(x)dx=F(b)F(a)[F(x)]ab\int_a^b f(x)\,dx = F(b) - F(a) \equiv \Big[F(x)\Big]_a^b

Proof. Let G(x)=axf(t)dtG(x) = \int_a^x f(t)\,dt. By Part 1, G(x)=f(x)=F(x)G'(x) = f(x) = F'(x). So G(x)F(x)=CG(x) - F(x) = C (constant). At x=ax = a: G(a)F(a)=0F(a)G(a) - F(a) = 0 - F(a), giving C=F(a)C = -F(a). At x=bx = b: G(b)=abf(t)dt=F(b)F(a)G(b) = \int_a^b f(t)\,dt = F(b) - F(a). \square

4.3 The Bridge

Why FTC Part 2 is powerful. Before the FTC, computing abf(x)dx\int_a^b f(x)\,dx required constructing Riemann sums and taking limits - laborious for any non-trivial function. The FTC reduces this to: find any antiderivative FF, evaluate at bb and aa, subtract.

Example. 01x2dx=[x33]01=130=13\int_0^1 x^2\,dx = \Big[\frac{x^3}{3}\Big]_0^1 = \frac{1}{3} - 0 = \frac{1}{3}. (Recall: directly computing via Riemann sums required summing k2\sum k^2.)

4.4 Worked Examples

Example 1. 1e1xdx=[lnx]1e=lneln1=10=1\int_1^e \frac{1}{x}\,dx = [\ln x]_1^e = \ln e - \ln 1 = 1 - 0 = 1.

Example 2. 0πsinxdx=[cosx]0π=(cosπ)(cos0)=1+1=2\int_0^\pi \sin x\,dx = [-\cos x]_0^\pi = (-\cos\pi) - (-\cos 0) = 1 + 1 = 2.

Example 3. 11exdx=[ex]11=ee1=e1/e\int_{-1}^{1} e^x\,dx = [e^x]_{-1}^1 = e - e^{-1} = e - 1/e.

Example 4 (net displacement vs distance). A particle has velocity v(t)=t24v(t) = t^2 - 4. On [0,3][0,3]:

Net displacement=03(t24)dt=[t334t]03=912=3\text{Net displacement} = \int_0^3 (t^2-4)\,dt = \left[\frac{t^3}{3} - 4t\right]_0^3 = 9-12 = -3 Distance=03t24dt=02(4t2)dt+23(t24)dt=163+73=233\text{Distance} = \int_0^3 |t^2-4|\,dt = \int_0^2(4-t^2)\,dt + \int_2^3(t^2-4)\,dt = \frac{16}{3} + \frac{7}{3} = \frac{23}{3}

4.5 FTC and Automatic Differentiation

FTC Part 1 has a direct counterpart in modern ML: the adjoint method for training neural ODEs. The derivative of a loss L\mathcal{L} with respect to the initial state z(0)\mathbf{z}(0), where z(T)=z(0)+0Tf(z(t),t;θ)dt\mathbf{z}(T) = \mathbf{z}(0) + \int_0^T f(\mathbf{z}(t),t;\theta)\,dt, is computed via an integral that runs backward in time. This is FTC Part 1 applied to the adjoint state - the computational core of the torchdiffeq library.


5. Integration by Substitution

5.1 The Reverse Chain Rule

Theorem. If u=g(x)u = g(x) is differentiable and ff is continuous on the range of gg, then:

f(g(x))g(x)dx=f(u)du(evaluated at u=g(x))\int f(g(x))\,g'(x)\,dx = \int f(u)\,du \quad \text{(evaluated at } u = g(x)\text{)}

Proof. Let FF be an antiderivative of ff, so F=fF' = f. By the chain rule:

ddx[F(g(x))]=F(g(x))g(x)=f(g(x))g(x)\frac{d}{dx}[F(g(x))] = F'(g(x))\cdot g'(x) = f(g(x))\cdot g'(x)

Therefore F(g(x))F(g(x)) is an antiderivative of f(g(x))g(x)f(g(x))\cdot g'(x). \square

Procedure:

  1. Identify a function u=g(x)u = g(x) whose derivative g(x)g'(x) appears (or nearly appears) in the integrand.
  2. Compute du=g(x)dxdu = g'(x)\,dx.
  3. Substitute: replace g(x)g(x) with uu and g(x)dxg'(x)\,dx with dudu.
  4. Integrate in uu.
  5. Back-substitute: replace uu with g(x)g(x).

5.2 Definite Integrals: Changing Limits

For abf(g(x))g(x)dx\int_a^b f(g(x))g'(x)\,dx with u=g(x)u = g(x): change limits to u(a)u(a) and u(b)u(b):

abf(g(x))g(x)dx=g(a)g(b)f(u)du\int_a^b f(g(x))g'(x)\,dx = \int_{g(a)}^{g(b)} f(u)\,du

This avoids back-substitution at the end.

5.3 Worked Examples

Example 1. sin(x2)2xdx\int \sin(x^2)\cdot 2x\,dx. Let u=x2u = x^2, du=2xdxdu = 2x\,dx:

=sinudu=cosu+C=cos(x2)+C= \int \sin u\,du = -\cos u + C = -\cos(x^2) + C

Example 2. 01ex2(2x)dx\int_0^1 e^{-x^2}\cdot(-2x)\,dx. Let u=x2u = -x^2, limits: u(0)=0u(0)=0, u(1)=1u(1)=-1:

=01eudu=[eu]01=e11= \int_0^{-1} e^u\,du = [e^u]_0^{-1} = e^{-1} - 1

Example 3. xx2+1dx\int \frac{x}{x^2+1}\,dx. Let u=x2+1u = x^2+1, du=2xdxdu = 2x\,dx:

=12duu=12lnu+C=12ln(x2+1)+C= \frac{1}{2}\int\frac{du}{u} = \frac{1}{2}\ln|u| + C = \frac{1}{2}\ln(x^2+1) + C

Example 4. tanxdx=sinxcosxdx\int \tan x\,dx = \int \frac{\sin x}{\cos x}\,dx. Let u=cosxu = \cos x, du=sinxdxdu = -\sin x\,dx:

=duu=lncosx+C=lnsecx+C= -\int\frac{du}{u} = -\ln|\cos x| + C = \ln|\sec x| + C

Example 5 (softmax normalization). For zRK\mathbf{z} \in \mathbb{R}^K:

Z=k=1Kezk=ezmaxk=1KezkzmaxZ = \sum_{k=1}^K e^{z_k} = e^{z_{\max}}\sum_{k=1}^K e^{z_k - z_{\max}}

The subtraction of zmaxz_{\max} is a discrete analog of substitution u=zzmaxu = z - z_{\max}, making all terms 1\leq 1 for numerical stability.

5.4 For AI: Change of Variables in Normalizing Flows

A normalizing flow defines a bijection x=f(z)\mathbf{x} = f(\mathbf{z}) where zpZ\mathbf{z} \sim p_Z. The change-of-variables formula:

pX(x)=pZ(f1(x))detJf1(x)p_X(\mathbf{x}) = p_Z(f^{-1}(\mathbf{x}))\left|\det J_{f^{-1}}(\mathbf{x})\right|

is the multivariate version of uu-substitution: abf(g(x))g(x)dx=g(a)g(b)f(u)du\int_a^b f(g(x))g'(x)\,dx = \int_{g(a)}^{g(b)}f(u)\,du, where g(x)|g'(x)| becomes the absolute Jacobian determinant. Real NVP, Glow, and FFJORD all implement variants of this transformation to learn complex densities from simple base distributions.


6. Integration by Parts

6.1 The Reverse Product Rule

Theorem. If u(x)u(x) and v(x)v(x) are differentiable, then:

udv=uvvdu\int u\,dv = uv - \int v\,du

Proof. The product rule gives (uv)=uv+uv(uv)' = u'v + uv'. Integrating both sides:

uv=uvdx+uvdxuv = \int u'v\,dx + \int uv'\,dx

Rearranging: uvdx=uvuvdx\int uv'\,dx = uv - \int u'v\,dx. Writing dv=vdxdv = v'dx and du=udxdu = u'dx gives the formula. \square

For definite integrals:

abudv=[uv]ababvdu\int_a^b u\,dv = \Big[uv\Big]_a^b - \int_a^b v\,du

6.2 Choosing u and dv - LIATE

The acronym LIATE gives a priority order for choosing uu (choose the type that comes first):

  • Logarithms: lnx\ln x, logax\log_a x
  • Inverse trig: arctanx\arctan x, arcsinx\arcsin x
  • Algebraic: polynomials xnx^n, x\sqrt{x}
  • Trigonometric: sinx\sin x, cosx\cos x
  • Exponential: exe^x, axa^x

The complementary factor becomes dvdv and must be something we can integrate.

6.3 Reduction Formulas and Tabular Method

For integrals like xnexdx\int x^n e^x\,dx, repeated integration by parts produces a reduction formula:

xnexdx=xnexnxn1exdx\int x^n e^x\,dx = x^n e^x - n\int x^{n-1}e^x\,dx

The tabular method (also called "tic-tac-toe" or "successive differentiation") organizes repeated parts efficiently:

SignDifferentiate (uu)Integrate (dvdv)
++x3x^3exe^x
-3x23x^2exe^x
++6x6xexe^x
-66exe^x
++00exe^x

Result: x3exdx=ex(x33x2+6x6)+C\int x^3 e^x\,dx = e^x(x^3 - 3x^2 + 6x - 6) + C.

6.4 Worked Examples

Example 1. xexdx\int x e^x\,dx. Let u=xu = x, dv=exdxdv = e^x\,dx:

=xexexdx=xexex+C=ex(x1)+C= xe^x - \int e^x\,dx = xe^x - e^x + C = e^x(x-1) + C

Example 2. lnxdx\int \ln x\,dx. Let u=lnxu = \ln x, dv=dxdv = dx:

=xlnxx1xdx=xlnxx+C=x(lnx1)+C= x\ln x - \int x\cdot\frac{1}{x}\,dx = x\ln x - x + C = x(\ln x - 1) + C

Example 3. x2sinxdx\int x^2\sin x\,dx. Apply tabular method (alternating signs):

=x2cosx+2xsinx+2cosx+C= -x^2\cos x + 2x\sin x + 2\cos x + C

Example 4 (cyclic). exsinxdx\int e^x\sin x\,dx. Let I=exsinxdxI = \int e^x\sin x\,dx. Integrate by parts twice:

I=exsinxexcosxI    2I=ex(sinxcosx)    I=ex(sinxcosx)2+CI = e^x\sin x - e^x\cos x - I \implies 2I = e^x(\sin x - \cos x) \implies I = \frac{e^x(\sin x - \cos x)}{2} + C

6.5 For AI: REINFORCE Gradient Estimator

The REINFORCE algorithm (Williams, 1992) estimates θEτpθ[R(τ)]\nabla_\theta \mathbb{E}_{\tau\sim p_\theta}[R(\tau)] where τ\tau is a trajectory and RR is return. Integration by parts (in the form of the log-derivative trick) gives:

θExpθ[f(x)]=Expθ[f(x)θlogpθ(x)]\nabla_\theta\mathbb{E}_{x\sim p_\theta}[f(x)] = \mathbb{E}_{x\sim p_\theta}[f(x)\nabla_\theta\log p_\theta(x)]

This identity - differentiating through an expectation - is the continuous-distribution version of integration by parts, and is the core of policy gradient methods in reinforcement learning (PPO, GRPO, RLHF fine-tuning).


7. Partial Fractions

7.1 Decomposing Rational Functions

Partial fraction decomposition writes a rational function P(x)/Q(x)P(x)/Q(x) (where degP<degQ\deg P < \deg Q) as a sum of simpler fractions. This converts integrals of rational functions into sums of basic integrals.

Setup:

  1. Factor Q(x)Q(x) completely over R\mathbb{R}.
  2. Write the partial fraction decomposition.
  3. Solve for unknown constants (cover-up method or comparing coefficients).
  4. Integrate each term.

7.2 Cases and Worked Examples

Case 1: Distinct linear factors. Q(x)=(xa1)(xa2)(xan)Q(x) = (x-a_1)(x-a_2)\cdots(x-a_n):

P(x)Q(x)=A1xa1+A2xa2++Anxan\frac{P(x)}{Q(x)} = \frac{A_1}{x-a_1} + \frac{A_2}{x-a_2} + \cdots + \frac{A_n}{x-a_n}

Example. 1x21dx=1(x1)(x+1)dx\int\frac{1}{x^2-1}\,dx = \int\frac{1}{(x-1)(x+1)}\,dx.

Decompose: 1(x1)(x+1)=Ax1+Bx+1\frac{1}{(x-1)(x+1)} = \frac{A}{x-1}+\frac{B}{x+1}.

Multiply both sides by (x1)(x+1)(x-1)(x+1): 1=A(x+1)+B(x1)1 = A(x+1) + B(x-1).

At x=1x=1: 1=2AA=1/21 = 2A \Rightarrow A = 1/2. At x=1x=-1: 1=2BB=1/21 = -2B \Rightarrow B = -1/2.

1x21dx=12lnx112lnx+1+C=12lnx1x+1+C\int\frac{1}{x^2-1}\,dx = \frac{1}{2}\ln|x-1| - \frac{1}{2}\ln|x+1| + C = \frac{1}{2}\ln\left|\frac{x-1}{x+1}\right| + C

Case 2: Repeated linear factors. For (xa)m(x-a)^m:

A1xa+A2(xa)2++Am(xa)m\frac{A_1}{x-a} + \frac{A_2}{(x-a)^2} + \cdots + \frac{A_m}{(x-a)^m}

Example. x(x1)2dx\int\frac{x}{(x-1)^2}\,dx.

x(x1)2=Ax1+B(x1)2\frac{x}{(x-1)^2} = \frac{A}{x-1} + \frac{B}{(x-1)^2}. Multiply: x=A(x1)+Bx = A(x-1) + B.

At x=1x=1: B=1B=1. Compare xx-coefficients: 1=A1 = A, so A=1A=1.

x(x1)2dx=lnx11x1+C\int\frac{x}{(x-1)^2}\,dx = \ln|x-1| - \frac{1}{x-1} + C

Case 3: Irreducible quadratic factors. For x2+bx+cx^2 + bx + c (no real roots):

Ax+Bx2+bx+c\frac{Ax+B}{x^2+bx+c}

Example. 1x(x2+1)dx\int\frac{1}{x(x^2+1)}\,dx.

1x(x2+1)=Ax+Bx+Cx2+1\frac{1}{x(x^2+1)} = \frac{A}{x} + \frac{Bx+C}{x^2+1}.

Multiply: 1=A(x2+1)+(Bx+C)x1 = A(x^2+1) + (Bx+C)x. At x=0x=0: A=1A=1. Comparing x2x^2: 0=A+BB=10=A+B \Rightarrow B=-1. Comparing x1x^1: 0=C0=C.

1x(x2+1)dx=lnx12ln(x2+1)+C=12lnx2x2+1+C\int\frac{1}{x(x^2+1)}\,dx = \ln|x| - \frac{1}{2}\ln(x^2+1) + C = \frac{1}{2}\ln\frac{x^2}{x^2+1} + C

For AI. Partial fractions appear in z-transform analysis (discrete-time signal processing), computing closed-form solutions to linear recurrences (relevant to LSTM and S4 state space models), and in Laplace transform computations for control-theoretic analysis of learning dynamics.


8. Improper Integrals

8.1 Type I: Infinite Limits

Definition. For ff continuous on [a,)[a,\infty):

af(x)dx=limbabf(x)dx\int_a^\infty f(x)\,dx = \lim_{b\to\infty}\int_a^b f(x)\,dx

If the limit exists and is finite, the integral converges; otherwise it diverges.

Similarly bf(x)dx=limaabf(x)dx\int_{-\infty}^b f(x)\,dx = \lim_{a\to-\infty}\int_a^b f(x)\,dx and fdx=cfdx+cfdx\int_{-\infty}^\infty f\,dx = \int_{-\infty}^c f\,dx + \int_c^\infty f\,dx for any cc.

Example 1 (exponential decay).

0exdx=limb[ex]0b=limb(1eb)=1\int_0^\infty e^{-x}\,dx = \lim_{b\to\infty}[-e^{-x}]_0^b = \lim_{b\to\infty}(1-e^{-b}) = 1

Example 2 (pp-integral).

11xpdx={1p1p>1p1\int_1^\infty \frac{1}{x^p}\,dx = \begin{cases} \dfrac{1}{p-1} & p > 1 \\ \infty & p \leq 1 \end{cases}

Proof. For p1p \neq 1: 1bxpdx=b1p11p\int_1^b x^{-p}\,dx = \frac{b^{1-p}-1}{1-p}. As bb\to\infty: converges iff 1p<01-p < 0 iff p>1p > 1. \square

8.2 Type II: Unbounded Integrands

Definition. If ff has a vertical asymptote at x=ax = a:

abf(x)dx=limε0+a+εbf(x)dx\int_a^b f(x)\,dx = \lim_{\varepsilon\to 0^+}\int_{a+\varepsilon}^b f(x)\,dx

Example. 011xdx=limε0+[2x]ε1=20=2\int_0^1 \frac{1}{\sqrt{x}}\,dx = \lim_{\varepsilon\to 0^+}[2\sqrt{x}]_\varepsilon^1 = 2 - 0 = 2. Converges.

011xdx=limε0+[lnx]ε1=0()=\int_0^1 \frac{1}{x}\,dx = \lim_{\varepsilon\to 0^+}[\ln x]_\varepsilon^1 = 0 - (-\infty) = \infty. Diverges.

8.3 Convergence Tests

Comparison test. If 0f(x)g(x)0 \leq f(x) \leq g(x) for xax \geq a:

  • agdx\int_a^\infty g\,dx converges \Rightarrow afdx\int_a^\infty f\,dx converges
  • afdx\int_a^\infty f\,dx diverges \Rightarrow agdx\int_a^\infty g\,dx diverges

Limit comparison test. If f,g>0f, g > 0 and limxf(x)/g(x)=L(0,)\lim_{x\to\infty} f(x)/g(x) = L \in (0,\infty), then af\int_a^\infty f and ag\int_a^\infty g both converge or both diverge.

Absolute convergence. If af(x)dx<\int_a^\infty |f(x)|\,dx < \infty, then af(x)dx\int_a^\infty f(x)\,dx converges.

8.4 The Gaussian Integral

The most important improper integral in probability and machine learning:

ex2dx=π\int_{-\infty}^\infty e^{-x^2}\,dx = \sqrt{\pi}

Proof (polar coordinates trick). Let I=ex2dxI = \int_{-\infty}^\infty e^{-x^2}\,dx. Then:

I2=ex2dxey2dy=e(x2+y2)dxdyI^2 = \int_{-\infty}^\infty e^{-x^2}\,dx\cdot\int_{-\infty}^\infty e^{-y^2}\,dy = \int_{-\infty}^\infty\int_{-\infty}^\infty e^{-(x^2+y^2)}\,dx\,dy

Convert to polar (x=rcosθx = r\cos\theta, y=rsinθy = r\sin\theta, dxdy=rdrdθdx\,dy = r\,dr\,d\theta):

I2=02π0er2rdrdθ=2π0rer2dr=2π12=πI^2 = \int_0^{2\pi}\int_0^\infty e^{-r^2}r\,dr\,d\theta = 2\pi\int_0^\infty re^{-r^2}\,dr = 2\pi\cdot\frac{1}{2} = \pi

Therefore I=πI = \sqrt{\pi}. \square

Consequence. The standard Gaussian N(0,1)\mathcal{N}(0,1) normalizes:

12πex2/2dx=1\int_{-\infty}^\infty \frac{1}{\sqrt{2\pi}}e^{-x^2/2}\,dx = 1

(substitution u=x/2u = x/\sqrt{2} converts to the Gaussian integral).

8.5 For AI: Entropy, KL Divergence, and Expected Loss

Entropy of a Gaussian. For XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2):

H(X)=p(x)lnp(x)dx=12ln(2πeσ2)H(X) = -\int_{-\infty}^\infty p(x)\ln p(x)\,dx = \frac{1}{2}\ln(2\pi e\sigma^2)

This improper integral converges because p(x)lnp(x)0p(x)\ln p(x) \to 0 faster than any polynomial as x±x\to\pm\infty.

KL divergence between Gaussians. For p=N(μ1,σ12)p = \mathcal{N}(\mu_1,\sigma_1^2), q=N(μ2,σ22)q = \mathcal{N}(\mu_2,\sigma_2^2):

KL(pq)=lnσ2σ1+σ12+(μ1μ2)22σ2212\text{KL}(p\|q) = \ln\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1-\mu_2)^2}{2\sigma_2^2} - \frac{1}{2}

This closed form follows from evaluating p(x)[lnp(x)lnq(x)]dx\int_{-\infty}^\infty p(x)[\ln p(x) - \ln q(x)]\,dx using the Gaussian integral. It appears as the regularization term in the VAE ELBO objective.

Expected cross-entropy loss. The population risk:

R(θ)=p(x,y)logqθ(yx)dxdyR(\theta) = -\int p(\mathbf{x},y)\log q_\theta(y|\mathbf{x})\,d\mathbf{x}\,dy

is an improper integral over the joint distribution. SGD computes an unbiased Monte Carlo estimate using a mini-batch.


9. Numerical Integration

When an antiderivative cannot be expressed in closed form (e.g., ex2dx\int e^{-x^2}\,dx, sin(x2)dx\int \sin(x^2)\,dx), numerical methods approximate the definite integral.

9.1 Trapezoid Rule

Approximate the integrand on each subinterval by a straight line (trapezoid).

For uniform step h=(ba)/nh = (b-a)/n and nodes xk=a+khx_k = a + kh:

abf(x)dxTn=h2[f(x0)+2f(x1)+2f(x2)++2f(xn1)+f(xn)]\int_a^b f(x)\,dx \approx T_n = \frac{h}{2}\left[f(x_0) + 2f(x_1) + 2f(x_2) + \cdots + 2f(x_{n-1}) + f(x_n)\right]

Error bound. If f(x)M|f''(x)| \leq M on [a,b][a,b]:

ET=abfdxTnM(ba)312n2=O(h2)|E_T| = \left|\int_a^b f\,dx - T_n\right| \leq \frac{M(b-a)^3}{12n^2} = O(h^2)

The trapezoid rule is second-order accurate - halving hh reduces error by a factor of 4.

9.2 Simpson's Rule

Approximate the integrand on each pair of subintervals by a quadratic (parabola). Requires nn even:

Sn=h3[f(x0)+4f(x1)+2f(x2)+4f(x3)++4f(xn1)+f(xn)]S_n = \frac{h}{3}\left[f(x_0) + 4f(x_1) + 2f(x_2) + 4f(x_3) + \cdots + 4f(x_{n-1}) + f(x_n)\right]

Pattern: 1, 4, 2, 4, 2, ..., 4, 1 with coefficients summing to 2n/33=2n2n/3 \cdot 3 = 2n.

Error bound. If f(4)(x)M|f^{(4)}(x)| \leq M:

ESM(ba)5180n4=O(h4)|E_S| \leq \frac{M(b-a)^5}{180n^4} = O(h^4)

Simpson's rule is fourth-order - halving hh reduces error by a factor of 16.

Example. Estimate 01exdx\int_0^1 e^x\,dx (true value: e11.71828e-1 \approx 1.71828) with n=4n = 4:

h=0.25h = 0.25, xk{0,0.25,0.5,0.75,1}x_k \in \{0, 0.25, 0.5, 0.75, 1\}.

S4=0.253[e0+4e0.25+2e0.5+4e0.75+e1]1.71828S_4 = \frac{0.25}{3}[e^0 + 4e^{0.25} + 2e^{0.5} + 4e^{0.75} + e^1] \approx 1.71828 (accurate to 7 decimal places).

9.3 Gaussian Quadrature

Instead of equally spaced nodes, choose nn optimal nodes {xk}\{x_k\} and weights {wk}\{w_k\} to exactly integrate all polynomials of degree 2n1\leq 2n-1:

11f(x)dxk=1nwkf(xk)\int_{-1}^1 f(x)\,dx \approx \sum_{k=1}^n w_k f(x_k)

The nodes are roots of the Legendre polynomial Pn(x)P_n(x). Gauss-Legendre quadrature is the most accurate quadrature rule for smooth functions - nn nodes achieve O(h2n)O(h^{2n}) error.

9.4 Monte Carlo Integration

Idea. For abf(x)dx\int_a^b f(x)\,dx: draw nn i.i.d. samples X1,,XnUniform(a,b)X_1,\ldots,X_n \sim \text{Uniform}(a,b) and estimate:

I^n=bank=1nf(Xk)\hat{I}_n = \frac{b-a}{n}\sum_{k=1}^n f(X_k)

By the Law of Large Numbers: I^na.s.abf(x)dx\hat{I}_n \xrightarrow{a.s.} \int_a^b f(x)\,dx.

Error. By the CLT:

n(I^nI)dN(0,(ba)2Var[f(X)])\sqrt{n}(\hat{I}_n - I) \xrightarrow{d} \mathcal{N}(0, (b-a)^2\text{Var}[f(X)])

Standard error: SE=(ba)Std[f(X)]n=O(1/n)\text{SE} = \frac{(b-a)\,\text{Std}[f(X)]}{\sqrt{n}} = O(1/\sqrt{n}).

Key property. The O(1/n)O(1/\sqrt{n}) convergence rate is dimension-independent. Trapezoid and Simpson's rules suffer from the curse of dimensionality (O(n2/d)O(n^{-2/d}) for dd-dimensional integrals), but Monte Carlo's rate stays O(1/n)O(1/\sqrt{n}) regardless of dimension. This is why high-dimensional integration in ML (expectation over data distributions, latent variables, trajectories) always uses Monte Carlo.

Variance reduction. Importance sampling: draw Xkq(x)X_k \sim q(x) instead of uniform, estimate:

I^=1nk=1nf(Xk)q(Xk)/((ba)1)=1nk=1nf(Xk)(ba)quniformq(Xk)\hat{I} = \frac{1}{n}\sum_{k=1}^n \frac{f(X_k)}{q(X_k)/((b-a)^{-1})} = \frac{1}{n}\sum_{k=1}^n \frac{f(X_k)(b-a)q_{\text{uniform}}}{q(X_k)}

Choosing qfq \propto |f| minimizes variance - the basis of importance-weighted autoencoders (IWAE).

9.5 For AI: SGD as Monte Carlo Expectation

The true gradient update is:

θL(θ)=θ(fθ(x),y)p(x,y)dxdy\nabla_\theta \mathcal{L}(\theta) = \int \nabla_\theta \ell(f_\theta(\mathbf{x}), y)\,p(\mathbf{x},y)\,d\mathbf{x}\,dy

SGD with mini-batch B\mathcal{B} of size BB estimates this as:

^θL(θ)=1BiBθ(fθ(xi),yi)\widehat{\nabla}_\theta\mathcal{L}(\theta) = \frac{1}{B}\sum_{i\in\mathcal{B}} \nabla_\theta\ell(f_\theta(\mathbf{x}_i), y_i)

This is a Monte Carlo estimate of the gradient integral. The mini-batch is drawn i.i.d. from the data distribution (uniformly at random from the training set), so it is an unbiased estimator with variance O(1/B)O(1/B). Larger batch sizes reduce variance but increase compute per step.


10. Integration in Probability

Forward reference to 06-Probability Theory. The full treatment of random variables, distributions, expectation, and probabilistic reasoning is in 06-Probability-Theory. Here we cover the integration mechanics - how to compute expectations, variances, KL divergence, and entropy as definite or improper integrals.

10.1 Probability Density Functions

A probability density function (PDF) p:R[0,)p: \mathbb{R} \to [0,\infty) satisfies:

p(x)dx=1,p(x)0 for all x\int_{-\infty}^\infty p(x)\,dx = 1, \qquad p(x) \geq 0 \text{ for all } x

The probability that XX falls in [a,b][a,b] is Pr(aXb)=abp(x)dx\Pr(a \leq X \leq b) = \int_a^b p(x)\,dx.

Common PDFs and their normalization integrals:

DistributionPDFNormalization relies on
Uniform on [a,b][a,b]1ba\frac{1}{b-a}Elementary
Gaussian N(μ,σ2)\mathcal{N}(\mu,\sigma^2)1σ2πe(xμ)2/(2σ2)\frac{1}{\sigma\sqrt{2\pi}}e^{-(x-\mu)^2/(2\sigma^2)}Gaussian integral (8.4)
Exponential(λ\lambda)λeλx\lambda e^{-\lambda x}, x0x\geq 00λeλxdx=1\int_0^\infty \lambda e^{-\lambda x}\,dx = 1
Laplace(μ,b\mu, b)$\frac{1}{2b}e^{-x-\mu

10.2 CDF and FTC

The cumulative distribution function (CDF) is:

F(x)=Pr(Xx)=xp(t)dtF(x) = \Pr(X \leq x) = \int_{-\infty}^x p(t)\,dt

By FTC Part 1: F(x)=p(x)F'(x) = p(x) - the PDF is the derivative of the CDF. This connects probability theory directly to the FTC: the CDF is the "area accumulation function" of the PDF, and differentiating it recovers the density.

Properties of the CDF:

  • F()=0F(-\infty) = 0, F(+)=1F(+\infty) = 1
  • FF is non-decreasing: x1<x2F(x1)F(x2)x_1 < x_2 \Rightarrow F(x_1) \leq F(x_2)
  • FF is right-continuous
  • Pr(a<Xb)=F(b)F(a)\Pr(a < X \leq b) = F(b) - F(a) (FTC Part 2)

10.3 Expectation as a Weighted Integral

E[X]=xp(x)dx\mathbb{E}[X] = \int_{-\infty}^\infty x\,p(x)\,dx E[g(X)]=g(x)p(x)dx(Law of the Unconscious Statistician)\mathbb{E}[g(X)] = \int_{-\infty}^\infty g(x)\,p(x)\,dx \quad \text{(Law of the Unconscious Statistician)}

Gaussian expectation. For XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2):

E[X]=x1σ2πe(xμ)2/(2σ2)dx=μ\mathbb{E}[X] = \int_{-\infty}^\infty x\cdot\frac{1}{\sigma\sqrt{2\pi}}e^{-(x-\mu)^2/(2\sigma^2)}\,dx = \mu

Proof. Substitute u=(xμ)/σu = (x-\mu)/\sigma: E[X]=(σu+μ)12πeu2/2du\mathbb{E}[X] = \int_{-\infty}^\infty (\sigma u + \mu)\frac{1}{\sqrt{2\pi}}e^{-u^2/2}\,du. The σu\sigma u term integrates to 0 (odd function); the μ\mu term gives μ1\mu \cdot 1. \square

10.4 Variance and Second Moments

Var(X)=E[(Xμ)2]=(xμ)2p(x)dx=E[X2](E[X])2\text{Var}(X) = \mathbb{E}[(X-\mu)^2] = \int_{-\infty}^\infty (x-\mu)^2\,p(x)\,dx = \mathbb{E}[X^2] - (\mathbb{E}[X])^2

Gaussian variance. For XN(μ,σ2)X \sim \mathcal{N}(\mu,\sigma^2):

Var(X)=(xμ)21σ2πe(xμ)2/(2σ2)dx=σ2\text{Var}(X) = \int_{-\infty}^\infty (x-\mu)^2 \frac{1}{\sigma\sqrt{2\pi}}e^{-(x-\mu)^2/(2\sigma^2)}\,dx = \sigma^2

Proven via substitution u=(xμ)/σu=(x-\mu)/\sigma and the identity u2eu2/2du=2π\int_{-\infty}^\infty u^2 e^{-u^2/2}\,du = \sqrt{2\pi} (integration by parts with uueu2/2u \cdot ue^{-u^2/2}).

10.5 KL Divergence

The Kullback-Leibler divergence from qq to pp:

KL(pq)=p(x)lnp(x)q(x)dx\text{KL}(p\|q) = \int_{-\infty}^\infty p(x)\ln\frac{p(x)}{q(x)}\,dx

Properties:

  • KL(pq)0\text{KL}(p\|q) \geq 0 (Gibbs' inequality - proven via Jensen's inequality and the concavity of ln\ln)
  • KL(pq)=0\text{KL}(p\|q) = 0 iff p=qp = q a.e.
  • Not symmetric: KL(pq)KL(qp)\text{KL}(p\|q) \neq \text{KL}(q\|p) in general

Forward vs. reverse KL:

  • KL(pq)\text{KL}(p\|q) (forward/exclusive): forces qq to cover all modes of pp. Used in maximum likelihood estimation.
  • KL(qp)\text{KL}(q\|p) (reverse/inclusive): forces qq to fit one mode of pp well. Used in variational inference (ELBO = KL(qp)+const-\text{KL}(q\|p) + \text{const}).

Gibbs' inequality proof. By concavity of ln\ln: lntt1\ln t \leq t - 1 for all t>0t > 0. Apply with t=q(x)/p(x)t = q(x)/p(x):

KL(pq)=p(x)lnq(x)p(x)dxp(x)(q(x)p(x)1)dx=q(x)dxp(x)dx=11=0-\text{KL}(p\|q) = \int p(x)\ln\frac{q(x)}{p(x)}\,dx \leq \int p(x)\left(\frac{q(x)}{p(x)}-1\right)\,dx = \int q(x)\,dx - \int p(x)\,dx = 1-1=0

10.6 Entropy

The differential entropy of a continuous random variable XX with PDF pp:

H(X)=p(x)lnp(x)dx=E[lnp(X)]H(X) = -\int_{-\infty}^\infty p(x)\ln p(x)\,dx = -\mathbb{E}[\ln p(X)]

Gaussian entropy. For XN(μ,σ2)X \sim \mathcal{N}(\mu,\sigma^2):

H(X)=12ln(2πeσ2)=12[1+ln(2πσ2)]H(X) = \frac{1}{2}\ln(2\pi e\sigma^2) = \frac{1}{2}[1 + \ln(2\pi\sigma^2)]

Maximum entropy principle. Among all distributions with mean μ\mu and variance σ2\sigma^2, the Gaussian maximizes entropy. This makes the Gaussian the natural distribution for uncertainty - used throughout Bayesian deep learning (Gaussian priors, Gaussian posteriors in VAEs).

Connection to cross-entropy loss. For a model qθq_\theta trained on data from pp:

Exp[lnqθ(x)]=H(p)+KL(pqθ)\mathbb{E}_{x\sim p}[-\ln q_\theta(x)] = H(p) + \text{KL}(p\|q_\theta)

Minimizing cross-entropy loss minimizes KL(pqθ)\text{KL}(p\|q_\theta) (since H(p)H(p) is constant w.r.t. θ\theta). This is why cross-entropy training is equivalent to maximum likelihood estimation.


11. Common Mistakes

#MistakeWhy It's WrongFix
1Forgetting +C+C in indefinite integralsEvery antiderivative family has a free constant; omitting it loses solutions to IVPsAlways write +C+C and determine it from initial conditions
2f(x)g(x)dx=fdxgdx\int f(x)g(x)\,dx = \int f\,dx \cdot \int g\,dxIntegration does NOT distribute over productsUse substitution or integration by parts
3abfdx=F(b)F(a)\int_a^b f\,dx = F(b) - F(a) without checking F=fF' = fIf FF is wrong, the evaluation is wrongAlways verify the antiderivative by differentiating
4Not changing limits in definite uu-substitutionIf you substitute u=g(x)u = g(x), the limits must become g(a)g(a) and g(b)g(b)Either change limits or back-substitute
51x2dx=ln(x2)+C\int \frac{1}{x^2}\,dx = \ln(x^2)+CIncorrect - x2dx=x1+C\int x^{-2}\,dx = -x^{-1}+C; ln\ln antiderivative only applies to 1/x1/xPower rule: xndx=xn+1/(n+1)+C\int x^n\,dx = x^{n+1}/(n+1)+C for n1n\neq -1
6011xdx=[lnx]01=0\int_0^1 \frac{1}{x}\,dx = [\ln x]_0^1 = 0This is an improper integral - ln(0)=\ln(0) = -\infty; the integral divergesAlways check for discontinuities before applying FTC Part 2
7111xdx=0\int_{-1}^{1}\frac{1}{x}\,dx = 0 by symmetryThe integrand is odd, but the integral diverges - symmetry argument fails for divergent integralsConfirm convergence before using symmetry
8Wrong LIATE choice causes circular integrationChoosing u=u = exponential in xexdx\int xe^x\,dx -> no simplificationLet uu be LIATE-first: u=xu=x, dv=exdxdv=e^x\,dx
9afdx\int_a^\infty f\,dx treated as finite without checkingInfinite integration limits require explicit convergence checkWrite as limbab\lim_{b\to\infty}\int_a^b and evaluate the limit
10Monte Carlo O(1/n)O(1/\sqrt{n}) confused with deterministic O(h2)O(h^2)Monte Carlo error is stochastic, in expectation/variance; not a uniform boundReport ±1.96SE\pm 1.96\,\text{SE} confidence intervals for MC estimates
11KL(pq)=KL(qp)\text{KL}(p\|q) = \text{KL}(q\|p)KL divergence is not symmetricKnow the difference: forward KL is mode-covering, reverse KL is mode-seeking
12udv=uv+vdu\int u\,dv = uv + \int v\,du (sign error in parts)The formula has a minus sign: udv=uvvdu\int u\,dv = uv - \int v\,duDerive from product rule to remember the minus sign

12. Exercises

Exercise 1 - Riemann Sums. For f(x)=x2f(x) = x^2 on [0,2][0,2] with n=8n = 8 equal subintervals: (a) Compute the left Riemann sum L8L_8. (b) Compute the right Riemann sum R8R_8. (c) Compute the exact value 02x2dx\int_0^2 x^2\,dx via FTC and verify L8L_8 \leq exact R8\leq R_8.

Exercise 2 - FTC and Antiderivatives. Evaluate: (a) 1e(lnx)2xdx\int_1^e \frac{(\ln x)^2}{x}\,dx (b) 0π/2sin3xcosxdx\int_0^{\pi/2}\sin^3 x\cos x\,dx (c) 0ln2ex1+exdx\int_0^{\ln 2}e^x\sqrt{1+e^x}\,dx

Exercise 3 - Integration by Parts. Compute: (a) x2exdx\int x^2 e^{-x}\,dx (b) ln(x2+1)dx\int \ln(x^2+1)\,dx (c) excosxdx\int e^x\cos x\,dx

Exercise 4 - Improper Integrals. Determine convergence and evaluate if finite: (a) 11x3/2dx\int_1^\infty \frac{1}{x^{3/2}}\,dx (b) 01lnxxdx\int_0^1 \frac{\ln x}{\sqrt{x}}\,dx (c) xex2dx\int_{-\infty}^\infty xe^{-x^2}\,dx

Exercise 5 - Numerical Integration. For f(x)=ex2f(x) = e^{-x^2} on [0,2][0,2]: (a) Compute TnT_n and SnS_n for n{4,8,16}n \in \{4, 8, 16\}. (b) Compare to scipy.integrate.quad result. (c) Plot the absolute error vs. nn for both methods on a log-log scale. Measure the observed convergence rates.

Exercise 6 - Monte Carlo Integration. Estimate 01sin(πx2)dx\int_0^1 \sin(\pi x^2)\,dx: (a) Implement basic Monte Carlo with n{100,1000,10000,100000}n \in \{100, 1000, 10000, 100000\}. (b) Plot the estimate and ±2SE\pm 2\,\text{SE} confidence band vs. nn. (c) Verify the O(1/n)O(1/\sqrt{n}) convergence by plotting nVar[I^n]n \cdot \text{Var}[\hat{I}_n] vs. nn.

Exercise 7 - KL Divergence. Let p=N(0,1)p = \mathcal{N}(0,1) and q=N(μ,σ2)q = \mathcal{N}(\mu, \sigma^2): (a) Derive the closed-form KL(pq)\text{KL}(p\|q) by evaluating the integral. (b) Implement numerical KL via Monte Carlo with 10510^5 samples. Verify against the closed form. (c) Plot KL(pq)\text{KL}(p\|q) as a function of μ[3,3]\mu \in [-3,3] (fixed σ=1\sigma=1) and as a function of σ[0.1,3]\sigma \in [0.1, 3] (fixed μ=0\mu=0). Observe the asymmetry.

Exercise 8 - ELBO and Variational Inference. The ELBO (Evidence Lower BOund) is:

ELBO(q)=Ezq[lnp(x,z)]Ezq[lnq(z)]=Ezq[lnp(xz)]KL(qpz)\text{ELBO}(q) = \mathbb{E}_{z\sim q}[\ln p(x,z)] - \mathbb{E}_{z\sim q}[\ln q(z)] = \mathbb{E}_{z\sim q}[\ln p(x|z)] - \text{KL}(q\|p_z)

(a) Show that lnp(x)=ELBO(q)+KL(qp(x))\ln p(x) = \text{ELBO}(q) + \text{KL}(q\|p(\cdot|x)) using the definition of conditional probability and KL divergence. (b) Explain why maximizing the ELBO is equivalent to maximizing a lower bound on lnp(x)\ln p(x). (c) For pz=N(0,1)p_z = \mathcal{N}(0,1) and q=N(μ,σ2)q = \mathcal{N}(\mu,\sigma^2), compute KL(qpz)\text{KL}(q\|p_z) and implement the reparameterization trick: z=μ+σεz = \mu + \sigma\varepsilon, εN(0,1)\varepsilon\sim\mathcal{N}(0,1).


13. Why This Matters for AI (2026 Perspective)

ConceptAI/ML Impact
Riemann sumsConceptual foundation of all numerical expectation estimates; discrete sum over a dataset approximates the continuous integral over the data-generating distribution
FTC Part 2Evaluating KL divergences, entropies, and moments in closed form; normalizing flow log-likelihoods
FTC Part 1Adjoint method for neural ODEs; sensitivity analysis of dynamical systems; continuous-time RL
u-SubstitutionChange-of-variables formula in normalizing flows (Real NVP, Glow, FFJORD); reparameterization trick in VAEs
Integration by partsLog-derivative (REINFORCE) trick for policy gradient methods (PPO, GRPO, DPO in RLHF)
Improper integralsConvergence of expected losses over infinite data; entropy and KL divergence for continuous distributions
Gaussian integralNormalization of all Gaussian distributions; closed-form KL divergences between Gaussians (VAE regularizer)
Trapezoid/SimpsonNumerical integration in scientific ML; ODE solver stepping (Euler, RK4 in neural ODEs)
Monte CarloSGD and mini-batch training; IWAE (importance-weighted); MCMC in Bayesian neural networks
PDF/CDF via FTCScore matching (diffusion model training); CDF inversion sampling; cumulative reward functions
Expectation as integralEvery loss function; ELBO; reward expectation in RL; attention weights as expectation over keys
KL divergenceVAE regularizer; DPO alignment loss; diffusion model score function; maximum entropy RL
EntropyInformation bottleneck principle; attention entropy regularization; exploration in RL
Cross-entropy <-> KLCross-entropy training = MLE = minimizing KL from true distribution to model

14. Conceptual Bridge

Looking back. This section built on two pillars from previous sections:

  1. Limits (01) - the Riemann integral is defined as a limit of sums; the FTC proof uses the MVT for integrals; convergence of improper integrals is a limit.
  2. Derivatives (02) - antiderivatives are "reverse derivatives"; FTC Part 1 says the area-accumulation function has derivative equal to the integrand; u-substitution reverses the chain rule; integration by parts reverses the product rule.

Every integration technique is the reverse of a differentiation technique. The FTC is the theorem that makes this reversal exact.

Looking forward.

  • 04-Series-and-Sequences - Taylor series are derived using higher-order derivatives; the Taylor remainder formula involves a definite integral of the (n+1)(n+1)-th derivative. Power series can be integrated term-by-term.
  • 05-Multivariate Calculus - double and triple integrals extend the Riemann definition to higher dimensions; Fubini's theorem allows iterated integration; the substitution rule becomes the Jacobian change-of-variables formula.
  • 06-Probability Theory - all continuous probability is integration: distributions, expectations, variances, moment-generating functions, characteristic functions, conditional distributions.
  • 08-Optimization - population risk is an integral; SGD is a Monte Carlo gradient estimator; natural gradient uses the Fisher information matrix, which is defined via an integral.

Position in the curriculum:

CHAPTER 4 - CALCULUS FUNDAMENTALS


  01 Limits and Continuity 
           (limit foundations, epsilon-delta, continuity)
         
  02 Derivatives and Differentiation 
           (chain rule, product rule, activation derivatives)
         
  03 Integration   YOU ARE HERE 
           (FTC links 02 <-> 03; substitution reverses chain rule)
         
  04 Series and Sequences 
           (Taylor series uses 02 derivatives + 03 integration)
         
  05 Multivariate Calculus 
         (double integrals, Fubini, Jacobians - extends 03)
         
  06 Probability Theory 
         (all continuous probability IS integration from 03)



<- Back to Calculus Fundamentals | Next: Series and Sequences ->


Appendix A: Extended Substitution Examples and Patterns

A.1 Recognising the Pattern

The hardest part of uu-substitution is identifying the right uu. The pattern to look for: one function is the derivative of another part of the integrand.

Integrand patternChoose uuBecause
f(xn)xn1f(x^n)\cdot x^{n-1}u=xnu = x^ndu=nxn1dxdu = nx^{n-1}\,dx
f(ex)exf(e^x)\cdot e^xu=exu = e^xdu=exdxdu = e^x\,dx
f(lnx)1xf(\ln x)\cdot \frac{1}{x}u=lnxu = \ln xdu=1xdxdu = \frac{1}{x}\,dx
f(sinx)cosxf(\sin x)\cdot \cos xu=sinxu = \sin xdu=cosxdxdu = \cos x\,dx
f(cosx)sinxf(\cos x)\cdot \sin xu=cosxu = \cos xdu=sinxdxdu = -\sin x\,dx
f(x)1xf(\sqrt{x})\cdot \frac{1}{\sqrt{x}}u=xu = \sqrt{x}du=12xdxdu = \frac{1}{2\sqrt{x}}\,dx

A.2 Rationalizing Substitutions

For integrals involving ax+b\sqrt{ax+b}: let u=ax+bu = \sqrt{ax+b} so u2=ax+bu^2 = ax+b and 2udu=adx2u\,du = a\,dx.

Example. x2x+1dx\int x\sqrt{2x+1}\,dx. Let u=2x+1u = \sqrt{2x+1}, u2=2x+1u^2 = 2x+1, x=(u21)/2x = (u^2-1)/2, dx=ududx = u\,du:

u212uudu=12(u4u2)du=u510u36+C=(2x+1)5/210(2x+1)3/26+C\int \frac{u^2-1}{2}\cdot u\cdot u\,du = \frac{1}{2}\int(u^4-u^2)\,du = \frac{u^5}{10} - \frac{u^3}{6} + C = \frac{(2x+1)^{5/2}}{10} - \frac{(2x+1)^{3/2}}{6} + C

A.3 Trigonometric Substitutions

For integrands involving a2x2\sqrt{a^2-x^2}, a2+x2\sqrt{a^2+x^2}, or x2a2\sqrt{x^2-a^2}:

FormSubstitutionIdentity used
a2x2\sqrt{a^2-x^2}x=asinθx = a\sin\theta1sin2θ=cos2θ1-\sin^2\theta = \cos^2\theta
a2+x2\sqrt{a^2+x^2}x=atanθx = a\tan\theta1+tan2θ=sec2θ1+\tan^2\theta = \sec^2\theta
x2a2\sqrt{x^2-a^2}x=asecθx = a\sec\thetasec2θ1=tan2θ\sec^2\theta-1 = \tan^2\theta

Example. 14x2dx\int\frac{1}{\sqrt{4-x^2}}\,dx. Let x=2sinθx = 2\sin\theta, dx=2cosθdθdx = 2\cos\theta\,d\theta:

2cosθdθ44sin2θ=2cosθ2cosθdθ=θ+C=arcsinx2+C\int\frac{2\cos\theta\,d\theta}{\sqrt{4-4\sin^2\theta}} = \int\frac{2\cos\theta}{2\cos\theta}\,d\theta = \theta + C = \arcsin\frac{x}{2} + C

For AI. Trigonometric substitutions appear when integrating radial functions in high-dimensional probability (e.g., volumes of spherical shells used in dd-dimensional Gaussian integrals).

A.4 The Softmax Integral - Partition Function

The softmax denominator (partition function) is a sum Z=kezkZ = \sum_k e^{z_k}, the discrete analogue of the integral Z=ef(x)dxZ = \int e^{f(x)}\,dx that appears in energy-based models. In the continuous case, computing ZZ is intractable in general - this is the core computational challenge of energy-based models. Variational autoencoders and diffusion models avoid computing ZZ directly by working with ratios or lower bounds.


Appendix B: Integration by Parts - Extended Examples and Theory

B.1 The Cyclic Trick

When integration by parts produces I=(something)cII = (\text{something}) - cI for a constant c1c \neq -1, solve for II:

I+cI=something    I=something1+cI + cI = \text{something} \implies I = \frac{\text{something}}{1+c}

This works for eaxcos(bx)dx\int e^{ax}\cos(bx)\,dx and eaxsin(bx)dx\int e^{ax}\sin(bx)\,dx:

eaxcos(bx)dx=eax(acos(bx)+bsin(bx))a2+b2+C\int e^{ax}\cos(bx)\,dx = \frac{e^{ax}(a\cos(bx) + b\sin(bx))}{a^2+b^2} + C eaxsin(bx)dx=eax(asin(bx)bcos(bx))a2+b2+C\int e^{ax}\sin(bx)\,dx = \frac{e^{ax}(a\sin(bx) - b\cos(bx))}{a^2+b^2} + C

Verification: Differentiate the right side to confirm.

B.2 Integration by Parts for Definite Integrals

Example. Show 0xexdx=1\int_0^\infty xe^{-x}\,dx = 1.

Let u=xu=x, dv=exdxdv=e^{-x}\,dx:

0xexdx=[xex]0+0exdx\int_0^\infty xe^{-x}\,dx = \Big[-xe^{-x}\Big]_0^\infty + \int_0^\infty e^{-x}\,dx

The boundary term: limxxex=0\lim_{x\to\infty} xe^{-x} = 0 (L'Hpital: x/ex0x/e^x \to 0) and at x=0x=0: 00.

=0+[ex]0=0(1)=1= 0 + [-e^{-x}]_0^\infty = 0 - (-1) = 1

B.3 The Gamma Function

The Gamma function generalizes factorials to real arguments:

Γ(s)=0xs1exdx,s>0\Gamma(s) = \int_0^\infty x^{s-1}e^{-x}\,dx, \quad s > 0

Key properties (proven by integration by parts):

  • Γ(s+1)=sΓ(s)\Gamma(s+1) = s\,\Gamma(s) (reduction formula via parts)
  • Γ(n)=(n1)!\Gamma(n) = (n-1)! for positive integers nn
  • Γ(1/2)=π\Gamma(1/2) = \sqrt{\pi} (from the Gaussian integral)

Proof of reduction. Γ(s+1)=0xsexdx\Gamma(s+1) = \int_0^\infty x^s e^{-x}\,dx. Let u=xsu = x^s, dv=exdxdv = e^{-x}\,dx:

=[xsex]0+s0xs1exdx=0+sΓ(s)= [-x^s e^{-x}]_0^\infty + s\int_0^\infty x^{s-1}e^{-x}\,dx = 0 + s\,\Gamma(s) \quad \square

For AI. The Gamma function appears in the normalizing constants of many probability distributions used in Bayesian ML: the Gamma distribution (prior for precision in Gaussian models), the Beta distribution (prior for probabilities in Dirichlet-multinomial models), the Student-t distribution (robust regression).

B.4 Wallis's Formula

From repeated integration by parts on 0π/2sinnxdx\int_0^{\pi/2}\sin^n x\,dx:

π2=224466133557=n=14n24n21\frac{\pi}{2} = \frac{2\cdot 2\cdot 4\cdot 4\cdot 6\cdot 6\cdots}{1\cdot 3\cdot 3\cdot 5\cdot 5\cdot 7\cdots} = \prod_{n=1}^\infty\frac{4n^2}{4n^2-1}

This is one of the earliest infinite product formulas for π\pi - an unexpected connection between integration and π\pi.


Appendix C: Improper Integrals - Convergence Tests in Detail

C.1 The p-Test Summary

11xpdx{=1p1p>1=p1011xpdx{=11pp<1=p1\int_1^\infty \frac{1}{x^p}\,dx \begin{cases} = \dfrac{1}{p-1} & p > 1 \\ = \infty & p \leq 1 \end{cases} \qquad \int_0^1 \frac{1}{x^p}\,dx \begin{cases} = \dfrac{1}{1-p} & p < 1 \\ = \infty & p \geq 1 \end{cases}

The boundary p=1p = 1 always diverges (1/xdx=lnx\int 1/x\,dx = \ln x, which diverges at both limits).

C.2 Comparison Test - Worked Examples

Example 1. Does 11x2+xdx\int_1^\infty \frac{1}{x^2+\sqrt{x}}\,dx converge?

For x1x \geq 1: x2+xx2x^2 + \sqrt{x} \geq x^2, so 1x2+x1x2\frac{1}{x^2+\sqrt{x}} \leq \frac{1}{x^2}.

Since 1x2dx=1\int_1^\infty x^{-2}\,dx = 1 converges, by comparison the original converges. \square

Example 2. Does 1lnxxdx\int_1^\infty \frac{\ln x}{x}\,dx converge?

For xex \geq e: lnx1\ln x \geq 1, so lnxx1x\frac{\ln x}{x} \geq \frac{1}{x}.

Since ex1dx\int_e^\infty x^{-1}\,dx diverges, by comparison the original diverges. \square

C.3 Absolute vs Conditional Convergence

af(x)dx\int_a^\infty f(x)\,dx is absolutely convergent if af(x)dx<\int_a^\infty |f(x)|\,dx < \infty.

af(x)dx\int_a^\infty f(x)\,dx is conditionally convergent if it converges but not absolutely.

Example. 0sinxxdx=π2\int_0^\infty \frac{\sin x}{x}\,dx = \frac{\pi}{2} (Dirichlet integral) - converges conditionally but NOT absolutely (0sinx/xdx=\int_0^\infty |\sin x|/x\,dx = \infty).

For ML. Absolutely convergent integrals behave nicely: they can be split, reordered, and approximated by truncated versions. The expected loss E[L]\mathbb{E}[\mathcal{L}] is absolutely convergent (non-negative integrand) - this is why expectation estimates via Monte Carlo are reliable.

C.4 Laplace Transform Preview

The Laplace transform is an improper integral parametrized by ss:

L{f}(s)=0f(t)estdt\mathcal{L}\{f\}(s) = \int_0^\infty f(t)e^{-st}\,dt

It converts differential equations to algebraic equations (used in control theory). For neural networks, the Laplace transform of the loss trajectory L{tL(θt)}\mathcal{L}\{t \mapsto \mathcal{L}(\theta_t)\} is related to the training dynamics in Laplace domain - an emerging tool in the theoretical analysis of gradient descent.


Appendix D: Numerical Integration - Error Analysis and Advanced Methods

D.1 Trapezoid Rule - Derivation from Scratch

On each subinterval [xk1,xk][x_{k-1}, x_k], the trapezoid rule approximates ff by the linear interpolant:

f(x)f(xk1)+f(xk)f(xk1)h(xxk1)f(x) \approx f(x_{k-1}) + \frac{f(x_k)-f(x_{k-1})}{h}(x-x_{k-1})

Integrating from xk1x_{k-1} to xkx_k:

xk1xkfdxf(xk1)h+f(xk)f(xk1)hh22=h2[f(xk1)+f(xk)]\int_{x_{k-1}}^{x_k} f\,dx \approx f(x_{k-1})\cdot h + \frac{f(x_k)-f(x_{k-1})}{h}\cdot\frac{h^2}{2} = \frac{h}{2}[f(x_{k-1})+f(x_k)]

Summing nn subintervals with telescoping:

Tn=h2[f(x0)+2f(x1)+2f(x2)++2f(xn1)+f(xn)]T_n = \frac{h}{2}[f(x_0) + 2f(x_1) + 2f(x_2) + \cdots + 2f(x_{n-1}) + f(x_n)]

Error derivation. By Taylor expansion on each subinterval:

xk1xkfdx=h2[f(xk1)+f(xk)]h312f(ξk)\int_{x_{k-1}}^{x_k} f\,dx = \frac{h}{2}[f(x_{k-1})+f(x_k)] - \frac{h^3}{12}f''(\xi_k)

Summing: ET=h2(ba)12fˉE_T = -\frac{h^2(b-a)}{12}\bar{f}'' where fˉ\bar{f}'' is some average of ff'' on [a,b][a,b] (MVT). So ETM2(ba)312n2|E_T| \leq \frac{M_2(b-a)^3}{12n^2} where M2=maxfM_2 = \max|f''|.

D.2 Simpson's Rule - Parabolic Approximation

On each pair of subintervals [x2k2,x2k][x_{2k-2}, x_{2k}], use the unique parabola through three points:

S2k=h3[f(x2k2)+4f(x2k1)+f(x2k)]S_{2k} = \frac{h}{3}[f(x_{2k-2}) + 4f(x_{2k-1}) + f(x_{2k})]

The coefficient pattern comes from integrating the Lagrange interpolating polynomial. The error:

ES=h4(ba)180fˉ(4)    ESM4(ba)5180n4E_S = -\frac{h^4(b-a)}{180}\bar{f}^{(4)} \implies |E_S| \leq \frac{M_4(b-a)^5}{180n^4}

D.3 Richardson Extrapolation

If TnT_n has error ET=c2/n2+c4/n4+E_T = c_2/n^2 + c_4/n^4 + \cdots, then combining TnT_n and T2nT_{2n}:

4T2nTn3\frac{4T_{2n} - T_n}{3}

eliminates the O(1/n2)O(1/n^2) term, giving a method with error O(1/n4)O(1/n^4) - equal to Simpson's! This idea, applied recursively, gives Romberg integration with error O(h2k)O(h^{2k}) for any kk.

D.4 Quasi-Monte Carlo

Standard Monte Carlo has error O(1/n)O(1/\sqrt{n}) regardless of dimension. Quasi-Monte Carlo (QMC) uses low-discrepancy sequences (Halton, Sobol) instead of random points:

  • Random points clump and leave gaps
  • Low-discrepancy sequences fill space more uniformly

QMC achieves O((logn)d/n)O((\log n)^d / n) error for smooth integrands in dd dimensions - much better than O(1/n)O(1/\sqrt{n}) when dd is small. Used in financial derivatives pricing and high-dimensional integration in Bayesian neural networks.

D.5 Adaptive Integration

scipy.integrate.quad uses Gaussian-Kronrod quadrature with adaptive refinement: if the error estimate on a subinterval exceeds tolerance, subdivide and integrate each piece separately. This automatically focuses computational effort on regions where ff varies rapidly - crucial for integrands with sharp peaks (e.g., probability densities in high-dimensional tails).


Appendix E: Probability Integration - Extended Topics

E.1 Moment-Generating Functions

The moment-generating function (MGF) of XX is:

MX(t)=E[etX]=etxp(x)dxM_X(t) = \mathbb{E}[e^{tX}] = \int_{-\infty}^\infty e^{tx}p(x)\,dx

provided this integral converges for tt in a neighborhood of 0.

Why it generates moments: Differentiate kk times at t=0t=0:

MX(k)(0)=E[Xke0X]=E[Xk]M_X^{(k)}(0) = \mathbb{E}[X^k e^{0\cdot X}] = \mathbb{E}[X^k]

So E[X]=M(0)\mathbb{E}[X] = M'(0), E[X2]=M(0)\mathbb{E}[X^2] = M''(0), Var(X)=M(0)[M(0)]2\text{Var}(X) = M''(0) - [M'(0)]^2.

Gaussian MGF. For XN(μ,σ2)X \sim \mathcal{N}(\mu,\sigma^2):

MX(t)=eμt+σ2t2/2M_X(t) = e^{\mu t + \sigma^2 t^2/2}

Proven by completing the square in the exponent under the integral and using the Gaussian integral.

E.2 Characteristic Functions and Fourier Transforms

The characteristic function ϕX(t)=E[eitX]=eitxp(x)dx\phi_X(t) = \mathbb{E}[e^{itX}] = \int e^{itx}p(x)\,dx is the Fourier transform of the PDF. Unlike the MGF, the characteristic function always exists (since eitx=1|e^{itx}| = 1).

The inverse Fourier transform recovers the PDF:

p(x)=12πeitxϕX(t)dtp(x) = \frac{1}{2\pi}\int_{-\infty}^\infty e^{-itx}\phi_X(t)\,dt

For AI. Fourier transforms appear in:

  • Random Fourier features (Rahimi & Recht, 2007): approximate shift-invariant kernels via k^(xy)=E[eiω(xy)]\hat{k}(x-y) = \mathbb{E}[e^{i\omega(x-y)}] where ωp(ω)\omega \sim p(\omega)
  • Frequency domain attention: Fourier attention mechanisms (FNet) replace self-attention with 2D FFT
  • Spectral normalization: weight matrix spectral norm computed via power iteration of W2=\|W\|_2 = largest singular value

E.3 Conditional Expectations as Integrals

The conditional expectation E[YX=x]\mathbb{E}[Y|X=x] is defined via the conditional density p(yx)p(y|x):

E[YX=x]=yp(yx)dy\mathbb{E}[Y|X=x] = \int y\,p(y|x)\,dy

Law of total expectation: E[Y]=EX[E[YX]]=E[YX=x]p(x)dx\mathbb{E}[Y] = \mathbb{E}_X[\mathbb{E}[Y|X]] = \int \mathbb{E}[Y|X=x]\,p(x)\,dx

For AI. Diffusion model training minimizes EtEx0Extx0[ϵθ(xt,t)ϵ2]\mathbb{E}_t\mathbb{E}_{x_0}\mathbb{E}_{x_t|x_0}[\|\epsilon_\theta(x_t,t) - \epsilon\|^2] - a triple nested expectation. Each E\mathbb{E} is an integral; Monte Carlo (sampling) handles them all.

E.4 Integration and Maximum Likelihood

Maximum likelihood estimation (MLE) maximizes i=1np(xi;θ)\prod_{i=1}^n p(x_i;\theta) - equivalently, maximizes the log-likelihood ilogp(xi;θ)\sum_i \log p(x_i;\theta).

This sum is a Monte Carlo estimate of the integral:

1ni=1nlogp(xi;θ)nlogp(x;θ)ptrue(x)dx=H(ptrue)KL(ptruepθ)\frac{1}{n}\sum_{i=1}^n \log p(x_i;\theta) \xrightarrow{n\to\infty} \int \log p(x;\theta)\,p_{\text{true}}(x)\,dx = -H(p_{\text{true}}) - \text{KL}(p_{\text{true}}\|p_\theta)

Maximizing MLE is equivalent to minimizing KL(ptruepθ)\text{KL}(p_{\text{true}}\|p_\theta) - the forward KL divergence from the true data distribution to the model. This is a fundamental connection between MLE, integration, and information theory.

E.5 Score Matching

The score function of a distribution pp is s(x)=xlogp(x)s(x) = \nabla_x \log p(x). Score matching (Hyvrinen, 2005) estimates pp without computing its normalization constant Z=ef(x)dxZ = \int e^{f(x)}\,dx:

J(θ)=Exp[tr(xsθ(x))+12sθ(x)2]J(\theta) = \mathbb{E}_{x\sim p}\left[\text{tr}(\nabla_x s_\theta(x)) + \frac{1}{2}\|s_\theta(x)\|^2\right]

Integration by parts shows this equals Exp[sθ(x)s(x)2]\mathbb{E}_{x\sim p}[\|s_\theta(x) - s(x)\|^2] plus a constant - so minimizing J(θ)J(\theta) fits the model score to the true score without integrating pp. This is the training objective of denoising diffusion probabilistic models (DDPMs).


Appendix F: Key Proofs and Derivations

F.1 Proof: Linearity of the Definite Integral

Theorem. ab[f(x)+g(x)]dx=abf(x)dx+abg(x)dx\int_a^b [f(x)+g(x)]\,dx = \int_a^b f(x)\,dx + \int_a^b g(x)\,dx.

Proof. By definition:

ab[f+g]dx=limnk=1n[f(xk)+g(xk)]Δxk=limn[k=1nf(xk)Δxk+k=1ng(xk)Δxk]\int_a^b [f+g]\,dx = \lim_{n\to\infty}\sum_{k=1}^n [f(x_k^*)+g(x_k^*)]\Delta x_k = \lim_{n\to\infty}\left[\sum_{k=1}^n f(x_k^*)\Delta x_k + \sum_{k=1}^n g(x_k^*)\Delta x_k\right]

Since both limits exist separately: =abfdx+abgdx= \int_a^b f\,dx + \int_a^b g\,dx. \square

F.2 Proof: Substitution Rule for Definite Integrals

Theorem. If u=g(x)u = g(x), gg differentiable, ff continuous:

abf(g(x))g(x)dx=g(a)g(b)f(u)du\int_a^b f(g(x))g'(x)\,dx = \int_{g(a)}^{g(b)} f(u)\,du

Proof. Let FF be an antiderivative of ff. By the chain rule, ddx[F(g(x))]=f(g(x))g(x)\frac{d}{dx}[F(g(x))] = f(g(x))g'(x). By FTC Part 2:

abf(g(x))g(x)dx=[F(g(x))]ab=F(g(b))F(g(a))=g(a)g(b)f(u)du\int_a^b f(g(x))g'(x)\,dx = [F(g(x))]_a^b = F(g(b)) - F(g(a)) = \int_{g(a)}^{g(b)} f(u)\,du \quad \square

F.3 Proof: Integration by Parts for Definite Integrals

Theorem. abuvdx=[uv]ababuvdx\int_a^b u\,v'\,dx = [uv]_a^b - \int_a^b u'v\,dx.

Proof. Product rule: (uv)=uv+uv(uv)' = u'v + uv'. Integrate: ab(uv)dx=abuvdx+abuvdx\int_a^b(uv)'\,dx = \int_a^b u'v\,dx + \int_a^b uv'\,dx.

FTC: [uv]ab=abuvdx+abuvdx[uv]_a^b = \int_a^b u'v\,dx + \int_a^b uv'\,dx. Rearrange. \square

F.4 Proof: 1xpdx\int_1^\infty x^{-p}\,dx Converges iff p>1p > 1

For p1p \neq 1:

1bxpdx=[x1p1p]1b=b1p11p\int_1^b x^{-p}\,dx = \left[\frac{x^{1-p}}{1-p}\right]_1^b = \frac{b^{1-p}-1}{1-p}

As bb \to \infty: b1p0b^{1-p} \to 0 if 1p<01-p < 0 (i.e., p>1p > 1), giving =1/(p1)\int = 1/(p-1).

If p<1p < 1: b1pb^{1-p} \to \infty, so integral diverges.

For p=1p = 1: 1bx1dx=lnb\int_1^b x^{-1}\,dx = \ln b \to \infty. \square

F.5 Proof: Gibbs' Inequality - KL(pq)0\text{KL}(p\|q) \geq 0

Theorem. For probability densities pp and qq: p(x)lnp(x)q(x)dx0\int p(x)\ln\frac{p(x)}{q(x)}\,dx \geq 0.

Proof. Since ln\ln is concave, lntt1\ln t \leq t - 1 for all t>0t > 0 (equality at t=1t=1). Apply to t=q(x)/p(x)t = q(x)/p(x) where p(x)>0p(x) > 0:

lnq(x)p(x)q(x)p(x)1\ln\frac{q(x)}{p(x)} \leq \frac{q(x)}{p(x)} - 1

Multiply by p(x)>0p(x) > 0 and integrate:

p(x)lnq(x)p(x)dx[q(x)p(x)]dx=11=0\int p(x)\ln\frac{q(x)}{p(x)}\,dx \leq \int [q(x)-p(x)]\,dx = 1 - 1 = 0

Therefore KL(pq)0KL(pq)0-\text{KL}(p\|q) \leq 0 \Rightarrow \text{KL}(p\|q) \geq 0. Equality holds iff q=pq = p a.e. \square


Appendix G: FTC - Applications in Machine Learning

G.1 Neural ODEs and the Adjoint Method

A neural ODE (Chen et al., 2018) defines the hidden state dynamics as:

dh(t)dt=f(h(t),t;θ)\frac{d\mathbf{h}(t)}{dt} = f(\mathbf{h}(t), t; \theta)

The hidden state at time TT is:

h(T)=h(0)+0Tf(h(t),t;θ)dt\mathbf{h}(T) = \mathbf{h}(0) + \int_0^T f(\mathbf{h}(t),t;\theta)\,dt

This is FTC Part 2 in reverse: given the "derivative" ff, integrate to get the accumulated change.

Training requires Lθ\frac{\partial \mathcal{L}}{\partial \theta}. The adjoint method computes this via a backward ODE:

da(t)dt=a(t)fh\frac{d\mathbf{a}(t)}{dt} = -\mathbf{a}(t)^\top \frac{\partial f}{\partial \mathbf{h}}

where a(t)=L/h(t)\mathbf{a}(t) = \partial\mathcal{L}/\partial\mathbf{h}(t) is the adjoint state. FTC Part 1 guarantees that integrating this backward ODE recovers the gradient exactly - with O(1)O(1) memory (no storing intermediate states).

G.2 Attention as Expectation

Softmax attention computes:

Attn(q,K,V)=kαkvk,αk=eqkj/djeqkj/d\text{Attn}(q, K, V) = \sum_k \alpha_k v_k, \qquad \alpha_k = \frac{e^{q\cdot k_j/\sqrt{d}}}{\sum_j e^{q\cdot k_j/\sqrt{d}}}

This is a discrete expectation Eα[V]\mathbb{E}_{\alpha}[V] where α\alpha is the softmax distribution over keys. In the continuous limit (as the number of keys grows and positions become dense), this becomes an integral:

Attn(q)=v(s)eq(s)k(s)/deq(s)k(s)/ddsds\text{Attn}(q) = \int v(s)\,\frac{e^{q(s)\cdot k(s)/\sqrt{d}}}{\int e^{q(s)\cdot k(s')/\sqrt{d}}\,ds'}\,ds

This connection motivates kernel attention approximations (Performer, Random Feature Attention) that use random features to approximate the exponential kernel via Monte Carlo integration of the Gaussian integral: eqk=eqωekωp(ω)dωe^{q\cdot k} = \int e^{q\cdot\omega}\cdot e^{k\cdot\omega}\,p(\omega)\,d\omega.

G.3 Diffusion Models - Score Matching via Integration by Parts

The denoising score matching objective (Vincent, 2011; Song & Ermon, 2019):

Et,x0,ϵ[λ(t)ϵθ(xt,t)ϵ2]\mathbb{E}_{t,x_0,\epsilon}\left[\lambda(t)\|\epsilon_\theta(x_t,t) - \epsilon\|^2\right]

where xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, ϵN(0,I)\epsilon \sim \mathcal{N}(0,I).

The equivalence to score matching follows from integration by parts in function space. Specifically, xtlogp(xt)=ϵ/1αˉt\nabla_{x_t}\log p(x_t) = -\epsilon/\sqrt{1-\bar{\alpha}_t}, so the network learns the score of the noisy distribution - an integral relationship between the score function and the data density.

G.4 Variational Autoencoder ELBO

The VAE training objective is the ELBO:

LELBO=Ezqϕ(zx)[logpθ(xz)]KL(qϕ(zx)p(z))\mathcal{L}_{\text{ELBO}} = \mathbb{E}_{z\sim q_\phi(z|x)}[\log p_\theta(x|z)] - \text{KL}(q_\phi(z|x)\|p(z))

Each term is an integral:

  • Eqϕ[logpθ(xz)]=qϕ(zx)logpθ(xz)dz\mathbb{E}_{q_\phi}[\log p_\theta(x|z)] = \int q_\phi(z|x)\log p_\theta(x|z)\,dz - estimated via Monte Carlo (reparameterization trick)
  • KL(qϕp)=qϕ(zx)logqϕ(zx)p(z)dz\text{KL}(q_\phi\|p) = \int q_\phi(z|x)\log\frac{q_\phi(z|x)}{p(z)}\,dz - computed in closed form for Gaussian qϕq_\phi and pp

The reparameterization trick (z=μϕ(x)+σϕ(x)εz = \mu_\phi(x) + \sigma_\phi(x)\odot\varepsilon, εN(0,I)\varepsilon \sim \mathcal{N}(0,I)) is a change of variables (substitution rule) that makes the Monte Carlo estimator differentiable w.r.t. ϕ\phi.


Appendix H: Notation Reference and Quick-Reference Tables

H.1 Integration Notation Comparison

NotationMeaningContext
abf(x)dx\int_a^b f(x)\,dxDefinite integral from aa to bbRiemann, most contexts
f(x)dx\int f(x)\,dxIndefinite integral (antiderivative)General antiderivative
[F(x)]ab[F(x)]_a^bF(b)F(a)F(b) - F(a)FTC shorthand
fdμ\int f\,d\muLebesgue integral w.r.t. measure μ\muMeasure theory, probability
Exp[f(x)]\mathbb{E}_{x\sim p}[f(x)]f(x)p(x)dx\int f(x)p(x)\,dxProbabilistic expectation
I^n=1nf(xi)\hat{I}_n = \frac{1}{n}\sum f(x_i)Monte Carlo estimateStochastic approximation

H.2 Standard Antiderivatives - Extended Table

f(x)f(x)f(x)dx\int f(x)\,dxNotes
xnx^n, n1n\neq-1xn+1/(n+1)+Cx^{n+1}/(n+1)+CPower rule
1/x1/x$\lnx
exe^xex+Ce^x+C
eaxe^{ax}eax/a+Ce^{ax}/a+C
axa^xax/lna+Ca^x/\ln a+C
sinx\sin xcosx+C-\cos x+C
cosx\cos xsinx+C\sin x+C
tanx\tan x$\ln\sec x
secx\sec x$\ln\sec x+\tan x
sec2x\sec^2 xtanx+C\tan x+C
1/1x21/\sqrt{1-x^2}arcsinx+C\arcsin x+C
1/(1+x2)1/(1+x^2)arctanx+C\arctan x+C
1/(a2+x2)1/(a^2+x^2)1aarctan(x/a)+C\frac{1}{a}\arctan(x/a)+C
sinhx\sinh xcoshx+C\cosh x+C
coshx\cosh xsinhx+C\sinh x+C
xlnxxx\ln x - xlnxdx\int\ln x\,dxvia parts

H.3 Numerical Methods Comparison

MethodFormulaErrorEvaluations
Left Riemannhk=0n1f(xk)h\sum_{k=0}^{n-1}f(x_k)O(h)O(h)nn
Trapezoidh[f0/2+f1++fn1+fn/2]h[f_0/2 + f_1+\cdots+f_{n-1}+f_n/2]O(h2)O(h^2)n+1n+1
Simpson'sh/3[f0+4f1+2f2++4fn1+fn]h/3[f_0+4f_1+2f_2+\cdots+4f_{n-1}+f_n]O(h4)O(h^4)n+1n+1 (n even)
Gauss-Legendre (nn pts)wkf(xk)\sum w_k f(x_k)O(h2n)O(h^{2n})nn
Monte Carlo(ba)1nf(Xk)(b-a)\frac{1}{n}\sum f(X_k)O(1/n)O(1/\sqrt{n}) stochasticnn

H.4 Key Integration Formulas for AI

ex2dx=πeax2dx=π/a\int_{-\infty}^\infty e^{-x^2}\,dx = \sqrt{\pi} \qquad \int_{-\infty}^\infty e^{-ax^2}\,dx = \sqrt{\pi/a} xeax2dx=0x2eax2dx=π2a3/2\int_{-\infty}^\infty xe^{-ax^2}\,dx = 0 \qquad \int_{-\infty}^\infty x^2 e^{-ax^2}\,dx = \frac{\sqrt{\pi}}{2a^{3/2}} KL(N(μ1,σ12)N(μ2,σ22))=lnσ2σ1+σ12+(μ1μ2)22σ2212\text{KL}(\mathcal{N}(\mu_1,\sigma_1^2)\|\mathcal{N}(\mu_2,\sigma_2^2)) = \ln\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2+(\mu_1-\mu_2)^2}{2\sigma_2^2} - \frac{1}{2} H(N(μ,σ2))=12ln(2πeσ2)H(Uniform(a,b))=ln(ba)H(\mathcal{N}(\mu,\sigma^2)) = \frac{1}{2}\ln(2\pi e\sigma^2) \qquad H(\text{Uniform}(a,b)) = \ln(b-a) E[X]=0Pr(X>t)dt(for X0)\mathbb{E}[X] = \int_0^\infty \Pr(X > t)\,dt \quad \text{(for } X \geq 0\text{)}

Appendix I: Worked Solutions - Section 12 Exercises

I.1 Exercise 1 - Riemann Sums

f(x)=x2f(x) = x^2 on [0,2][0,2], n=8n=8, h=0.25h = 0.25, nodes xk=khx_k = kh:

L8=hk=07f(kh)=0.25[02+0.252+0.52+0.752+12+1.252+1.52+1.752]L_8 = h\sum_{k=0}^7 f(kh) = 0.25[0^2+0.25^2+0.5^2+0.75^2+1^2+1.25^2+1.5^2+1.75^2] =0.25[0+0.0625+0.25+0.5625+1+1.5625+2.25+3.0625]=0.25×8.75=2.1875= 0.25[0+0.0625+0.25+0.5625+1+1.5625+2.25+3.0625] = 0.25\times 8.75 = 2.1875 R8=hk=18f(kh)=0.25[0.252+0.52++22]=0.25×11.75=2.9375R_8 = h\sum_{k=1}^8 f(kh) = 0.25[0.25^2+0.5^2+\cdots+2^2] = 0.25\times 11.75 = 2.9375

Exact: 02x2dx=[x3/3]02=8/32.66\int_0^2 x^2\,dx = [x^3/3]_0^2 = 8/3 \approx 2.6\overline{6}.

Verify: L8=2.18758/32.9375=R8L_8 = 2.1875 \leq 8/3 \leq 2.9375 = R_8.

I.2 Exercise 2a - 1e(lnx)2xdx\int_1^e \frac{(\ln x)^2}{x}\,dx

Let u=lnxu = \ln x, du=dx/xdu = dx/x. Limits: u(1)=0u(1) = 0, u(e)=1u(e) = 1.

=01u2du=[u3/3]01=13= \int_0^1 u^2\,du = [u^3/3]_0^1 = \frac{1}{3}

I.2b - 0π/2sin3xcosxdx\int_0^{\pi/2}\sin^3 x\cos x\,dx

Let u=sinxu = \sin x, du=cosxdxdu = \cos x\,dx. Limits: 0 to 1.

=01u3du=[u4/4]01=14= \int_0^1 u^3\,du = [u^4/4]_0^1 = \frac{1}{4}

I.2c - 0ln2ex1+exdx\int_0^{\ln 2}e^x\sqrt{1+e^x}\,dx

Let u=1+exu = 1+e^x, du=exdxdu = e^x\,dx. Limits: u(0)=2u(0) = 2, u(ln2)=3u(\ln 2) = 3.

=23udu=[2u3/2/3]23=23(3322)= \int_2^3 \sqrt{u}\,du = [2u^{3/2}/3]_2^3 = \frac{2}{3}(3\sqrt{3}-2\sqrt{2})

I.3 Exercise 3a - x2exdx\int x^2 e^{-x}\,dx

Tabular method (differentiate x2x^2, integrate exe^{-x}):

SignDDII
++x2x^2exe^{-x}
-2x2xex-e^{-x}
++22exe^{-x}
-00ex-e^{-x}
x2exdx=x2ex2xex2ex+C=ex(x2+2x+2)+C\int x^2 e^{-x}\,dx = -x^2 e^{-x} - 2xe^{-x} - 2e^{-x} + C = -e^{-x}(x^2+2x+2) + C

I.3b - ln(x2+1)dx\int \ln(x^2+1)\,dx

Let u=ln(x2+1)u = \ln(x^2+1), dv=dxdv = dx:

=xln(x2+1)x2xx2+1dx=xln(x2+1)2x2x2+1dx= x\ln(x^2+1) - \int x\cdot\frac{2x}{x^2+1}\,dx = x\ln(x^2+1) - 2\int\frac{x^2}{x^2+1}\,dx

Since x2x2+1=11x2+1\frac{x^2}{x^2+1} = 1 - \frac{1}{x^2+1}:

=xln(x2+1)2x+2arctanx+C= x\ln(x^2+1) - 2x + 2\arctan x + C

I.4 Exercise 4a - 1x3/2dx\int_1^\infty x^{-3/2}\,dx

=limb[2x1/2]1b=limb(2b+2)=2= \lim_{b\to\infty}[-2x^{-1/2}]_1^b = \lim_{b\to\infty}\left(-\frac{2}{\sqrt{b}}+2\right) = 2

I.4c - xex2dx\int_{-\infty}^\infty xe^{-x^2}\,dx

The integrand f(x)=xex2f(x) = xe^{-x^2} is an odd function (f(x)=f(x)f(-x) = -f(x)). Since 0xex2dx=[ex2/2]0=1/2<\int_0^\infty xe^{-x^2}\,dx = [-e^{-x^2}/2]_0^\infty = 1/2 < \infty, the integral converges absolutely and:

xex2dx=0(by symmetry)\int_{-\infty}^\infty xe^{-x^2}\,dx = 0 \quad \text{(by symmetry)}

Appendix J: Glossary

TermDefinition
AntiderivativeFF such that F(x)=f(x)F'(x) = f(x); the indefinite integral F(x)+CF(x) + C
Riemann sumk=1nf(xk)Δxk\sum_{k=1}^n f(x_k^*)\Delta x_k - finite approximation to the integral
Definite integralabfdx=limP0S(P,f)\int_a^b f\,dx = \lim_{\|\mathcal{P}\|\to 0}S(\mathcal{P},f)
Indefinite integralfdx=F(x)+C\int f\,dx = F(x) + C - the family of all antiderivatives
FTC Part 1ddxaxf(t)dt=f(x)\frac{d}{dx}\int_a^x f(t)\,dt = f(x)
FTC Part 2abf(x)dx=F(b)F(a)\int_a^b f(x)\,dx = F(b) - F(a) for antiderivative FF
Improper integralIntegral with infinite limits or unbounded integrand; defined via limits
Convergent integralImproper integral whose limit exists and is finite
u-Substitutionf(g(x))g(x)dx=f(u)du\int f(g(x))g'(x)\,dx = \int f(u)\,du - reversal of chain rule
Integration by partsudv=uvvdu\int u\,dv = uv - \int v\,du - reversal of product rule
Partial fractionsDecompose P/QP/Q into simpler rational terms before integrating
PDFp(x)0p(x) \geq 0 with pdx=1\int p\,dx = 1 - probability density function
CDFF(x)=xp(t)dtF(x) = \int_{-\infty}^x p(t)\,dt - cumulative distribution function
ExpectationE[X]=xp(x)dx\mathbb{E}[X] = \int x\,p(x)\,dx - weighted average
KL divergenceKL(pq)=pln(p/q)dx0\text{KL}(p\|q) = \int p\ln(p/q)\,dx \geq 0
EntropyH(p)=plnpdxH(p) = -\int p\ln p\,dx - information content
Trapezoid ruleNumerical integration with O(h2)O(h^2) error
Simpson's ruleNumerical integration with O(h4)O(h^4) error (parabolic approximation)
Monte CarloStochastic integration via random sampling; O(1/n)O(1/\sqrt{n}) error
ELBOEvidence lower bound: $\mathcal{L}_{\text{ELBO}} = \mathbb{E}_q[\log p(x
Score functionxlogp(x)\nabla_x\log p(x) - gradient of log-density; used in diffusion models
Reparameterization trickz=μ+σεz = \mu + \sigma\varepsilon, εN(0,I)\varepsilon\sim\mathcal{N}(0,I) - makes MC estimator differentiable

Appendix K: Connections to Adjacent Sections

K.1 What 04-Series-and-Sequences Needs from This Section

  • Integration term-by-term: n=0anxndx=n=0anxn+1n+1\int\sum_{n=0}^\infty a_n x^n\,dx = \sum_{n=0}^\infty \frac{a_n x^{n+1}}{n+1} - requires uniform convergence
  • Taylor remainder as integral: Rn(x)=1n!ax(xt)nf(n+1)(t)dtR_n(x) = \frac{1}{n!}\int_a^x (x-t)^n f^{(n+1)}(t)\,dt
  • Integral test for series: n=1f(n)\sum_{n=1}^\infty f(n) converges iff 1f(x)dx\int_1^\infty f(x)\,dx converges (for decreasing f0f \geq 0)

K.2 What 05-Multivariate Calculus Needs from This Section

  • Fubini's theorem: f(x,y)dxdy=(f(x,y)dx)dy\int\int f(x,y)\,dx\,dy = \int\left(\int f(x,y)\,dx\right)\,dy - iterated integration reduces a 2D integral to two 1D integrals
  • Change of variables: Rf(x)dx=Sf(g(u))detJg(u)du\int_R f(\mathbf{x})\,d\mathbf{x} = \int_S f(\mathbf{g}(\mathbf{u}))|\det J_\mathbf{g}(\mathbf{u})|\,d\mathbf{u} - the Jacobian generalizes g(x)|g'(x)| from substitution
  • Line integrals: Cfds\int_C f\,ds along a curve - generalization of abf(x)dx\int_a^b f(x)\,dx

K.3 What 06-Probability Theory Needs from This Section

All of continuous probability theory is integration. The sections needs:

  • PDF normalization: pdx=1\int p\,dx = 1
  • CDF definition and FTC connection
  • Expectation: E[g(X)]=g(x)p(x)dx\mathbb{E}[g(X)] = \int g(x)p(x)\,dx
  • Moment computations: Gaussian moments via Gaussian integral and substitution
  • KL divergence and entropy as improper integrals over R\mathbb{R}

Appendix L: Additional Worked Examples - FTC and Techniques

L.1 Leibniz Rule Examples

Example 1. ddxxx2sin(t2)dt\frac{d}{dx}\int_x^{x^2} \sin(t^2)\,dt.

Upper limit h(x)=x2h(x) = x^2, lower limit g(x)=xg(x) = x. Leibniz rule:

=sin((x2)2)2xsin(x2)1=2xsin(x4)sin(x2)= \sin((x^2)^2)\cdot 2x - \sin(x^2)\cdot 1 = 2x\sin(x^4) - \sin(x^2)

Example 2. F(x)=0xt21t4+1dtF(x) = \int_0^x \frac{t^2-1}{t^4+1}\,dt. Find F(x)F'(x).

By FTC Part 1: F(x)=x21x4+1F'(x) = \frac{x^2-1}{x^4+1}.

Critical points of FF: F(x)=0x2=1x=±1F'(x) = 0 \Rightarrow x^2 = 1 \Rightarrow x = \pm 1.

FF' changes sign: ++ for x<1|x|<1, - for x>1|x|>1. So FF has a local max at x=1x=1 and local min at x=1x=-1.

Example 3 (FTC + u-sub). ddx1exlntdt\frac{d}{dx}\int_1^{e^x}\ln t\,dt.

Let u=exu = e^x: ddx1exlntdt=ln(ex)ex=xex\frac{d}{dx}\int_1^{e^x}\ln t\,dt = \ln(e^x)\cdot e^x = x\cdot e^x.

L.2 Trigonometric Integrals

Strategy for sinmxcosnxdx\int\sin^m x\cos^n x\,dx:

  • If mm odd: save one sinx\sin x, convert rest to cosx\cos x via sin2=1cos2\sin^2 = 1-\cos^2, substitute u=cosxu = \cos x.
  • If nn odd: save one cosx\cos x, convert, substitute u=sinxu = \sin x.
  • If both even: use double-angle formulas sin2x=(1cos2x)/2\sin^2 x = (1-\cos 2x)/2, cos2x=(1+cos2x)/2\cos^2 x = (1+\cos 2x)/2.

Example. sin3xcos4xdx\int\sin^3 x\cos^4 x\,dx. Odd power of sin\sin:

sin2xcos4xsinxdx=(1cos2x)cos4xsinxdx\int\sin^2 x\cos^4 x\cdot\sin x\,dx = \int(1-\cos^2 x)\cos^4 x\cdot\sin x\,dx

Let u=cosxu = \cos x: =(1u2)u4du=(u4u6)du=u55+u77+C=cos5x5+cos7x7+C= -\int(1-u^2)u^4\,du = -\int(u^4-u^6)\,du = -\frac{u^5}{5}+\frac{u^7}{7}+C = -\frac{\cos^5 x}{5}+\frac{\cos^7 x}{7}+C.

L.3 Integrals of Rational Functions - Full Pipeline

Example. x34x+1x2x2dx\int\frac{x^3-4x+1}{x^2-x-2}\,dx.

Step 1. Long division (numerator degree > denominator degree):

x34x+1=(x2x2)(x+1)+(x+3)x^3-4x+1 = (x^2-x-2)\cdot(x+1) + (-x+3)

Step 2. Partial fractions of remainder:

x+3x2x2=x+3(x2)(x+1)=Ax2+Bx+1\frac{-x+3}{x^2-x-2} = \frac{-x+3}{(x-2)(x+1)} = \frac{A}{x-2}+\frac{B}{x+1}

At x=2x=2: 1/(3)=A/3A=1/31/(3) = A/3 \Rightarrow A = 1/3. Wait - 2+3=1-2+3=1 and at x=2x=2: A=1/3A = 1/3. At x=1x=-1: 1+3=B(3)B=4/31+3 = B(-3) \Rightarrow B = -4/3.

Step 3. Integrate:

x34x+1x2x2dx=(x+1+1/3x24/3x+1)dx\int\frac{x^3-4x+1}{x^2-x-2}\,dx = \int\left(x+1+\frac{1/3}{x-2}-\frac{4/3}{x+1}\right)\,dx =x22+x+13lnx243lnx+1+C= \frac{x^2}{2}+x+\frac{1}{3}\ln|x-2|-\frac{4}{3}\ln|x+1|+C

L.4 The Dirichlet Integral

0sinxxdx=π2\int_0^\infty \frac{\sin x}{x}\,dx = \frac{\pi}{2}

This is a conditionally convergent improper integral (not absolutely convergent). One proof uses the Laplace transform: define F(s)=0esxsinxxdxF(s) = \int_0^\infty e^{-sx}\frac{\sin x}{x}\,dx and differentiate w.r.t. ss:

F(s)=0esxsinxdx=11+s2F'(s) = -\int_0^\infty e^{-sx}\sin x\,dx = -\frac{1}{1+s^2}

Integrating: F(s)=arctans+CF(s) = -\arctan s + C. As ss\to\infty: F(s)0F(s)\to 0, so C=π/2C = \pi/2. At s=0s=0: F(0)=π/2F(0) = \pi/2.


Appendix M: Monte Carlo Methods - Extended Analysis

M.1 Variance of the Monte Carlo Estimator

Let X1,,Xni.i.d.Uniform(a,b)X_1,\ldots,X_n \overset{i.i.d.}{\sim} \text{Uniform}(a,b). The estimator I^n=bank=1nf(Xk)\hat{I}_n = \frac{b-a}{n}\sum_{k=1}^n f(X_k).

E[I^n]=(ba)E[f(X)]=(ba)1baabf(x)dx=abf(x)dx(unbiased)\mathbb{E}[\hat{I}_n] = (b-a)\mathbb{E}[f(X)] = (b-a)\cdot\frac{1}{b-a}\int_a^b f(x)\,dx = \int_a^b f(x)\,dx \quad \text{(unbiased)} Var[I^n]=(ba)2nVar[f(X)]=(ba)2n[1baabf(x)2dx(abf(x)dxba)2]\text{Var}[\hat{I}_n] = \frac{(b-a)^2}{n}\text{Var}[f(X)] = \frac{(b-a)^2}{n}\left[\frac{1}{b-a}\int_a^b f(x)^2\,dx - \left(\frac{\int_a^b f(x)\,dx}{b-a}\right)^2\right]

Standard error: SE(I^n)=Var[I^n]=O(1/n)\text{SE}(\hat{I}_n) = \sqrt{\text{Var}[\hat{I}_n]} = O(1/\sqrt{n}).

M.2 Importance Sampling

To estimate I=f(x)dxI = \int f(x)\,dx, sample Xkq(x)X_k \sim q(x) (a proposal distribution) and use:

I^nIS=1nk=1nf(Xk)q(Xk)\hat{I}^{\text{IS}}_n = \frac{1}{n}\sum_{k=1}^n \frac{f(X_k)}{q(X_k)}

Since f(x)dx=f(x)q(x)q(x)dx=Eq[f(X)q(X)]\int f(x)\,dx = \int \frac{f(x)}{q(x)}\cdot q(x)\,dx = \mathbb{E}_q\left[\frac{f(X)}{q(X)}\right], this is also unbiased.

Optimal qq. Var[I^IS]\text{Var}[\hat{I}^{\text{IS}}] is minimized when q(x)f(x)q(x) \propto |f(x)|. With this choice, Var=0\text{Var} = 0 if f0f \geq 0 everywhere (one-sample exact!).

In practice. Choose qq to concentrate samples where f(x)|f(x)| is large - focusing effort on the important region. Used in:

  • IWAE (Importance Weighted Autoencoders) - tighter ELBO via importance sampling
  • Particle filters - sequential importance sampling for state estimation
  • MCMC - Metropolis-Hastings acceptance-rejection is importance sampling on steroids

M.3 Central Limit Theorem for Monte Carlo

I^nISE(I^n)dN(0,1)\frac{\hat{I}_n - I}{\text{SE}(\hat{I}_n)} \xrightarrow{d} \mathcal{N}(0,1)

This gives a 95% confidence interval: I^n±1.96SE(I^n)\hat{I}_n \pm 1.96\cdot\text{SE}(\hat{I}_n).

The SE\text{SE} is estimated from the sample standard deviation:

SE^=(ba)σ^fn,σ^f2=1n1k=1n(f(Xk)μ^f)2\widehat{\text{SE}} = \frac{(b-a)\hat{\sigma}_f}{\sqrt{n}}, \qquad \hat{\sigma}_f^2 = \frac{1}{n-1}\sum_{k=1}^n\left(f(X_k) - \hat{\mu}_f\right)^2

M.4 Quasi-Monte Carlo - Discrepancy

The error of quasi-Monte Carlo is bounded by the Koksma-Hlawka inequality:

I^nIDn(x)V(f)|\hat{I}_n - I| \leq D_n^*(\mathbf{x})\cdot V(f)

where DnD_n^* is the star discrepancy of the point set and V(f)V(f) is the total variation of ff. Low-discrepancy sequences (Halton, Sobol) achieve Dn=O((logn)d/n)D_n^* = O((\log n)^d/n), giving error O((logn)d/n)O((\log n)^d/n) vs. O(1/n)O(1/\sqrt{n}) for random.

For d=1d=1: QMC is always better than Monte Carlo (for smooth integrands). For large dd, the (logn)d(\log n)^d factor can dominate.


Appendix N: The Fundamental Theorem - Historical and Conceptual Depth

N.1 Why the FTC Is Deep

Before the FTC, two problems seemed completely unrelated:

  1. The tangent problem: find the slope of a curve at a point -> derivative
  2. The area problem: find the area under a curve -> integral

Newton and Leibniz discovered they are inverse operations. This is not obvious. There is no reason, a priori, to expect that summing infinitesimally thin rectangles (integration) should be related to measuring instantaneous slope (differentiation).

The FTC says: accumulation is the reverse of rate. If you know how fast something is accumulating at every instant, you can find the total accumulation - just by finding an antiderivative. This is one of the most non-obvious and deep theorems in all of mathematics.

N.2 What Makes the FTC Work

The key ingredients:

  1. Continuity of ff: ensures ff can be approximated uniformly by step functions (Riemann integrability)
  2. MVT for integrals: 1hxx+hf(t)dt=f(ch)\frac{1}{h}\int_x^{x+h}f(t)\,dt = f(c_h) for some ch(x,x+h)c_h \in (x, x+h)
  3. Continuity of ff at xx: ensures f(ch)f(x)f(c_h) \to f(x) as h0h \to 0

If ff has discontinuities, Part 1 may fail: G(x)=f(x)G'(x) = f(x) only holds at points of continuity of ff. The Lebesgue integral extends the FTC to a much broader class of functions.

N.3 The FTC in Multiple Dimensions

The FTC generalizes to higher dimensions in several forms:

1D FTCHigher-dimensional version
abf(x)dx=f(b)f(a)\int_a^b f'(x)\,dx = f(b)-f(a)Green's theorem: CFdr=DcurlFdA\oint_C \mathbf{F}\cdot d\mathbf{r} = \iint_D \text{curl}\,\mathbf{F}\,dA
Stokes' theorem: SFdr=S(×F)dS\oint_{\partial S}\mathbf{F}\cdot d\mathbf{r} = \iint_S (\nabla\times\mathbf{F})\cdot d\mathbf{S}
Divergence theorem: VFdS=VFdV\oiiint_{\partial V}\mathbf{F}\cdot d\mathbf{S} = \iiint_V \nabla\cdot\mathbf{F}\,dV

All are special cases of Stokes' theorem on manifolds: Mdω=Mω\int_M d\omega = \int_{\partial M}\omega.

For ML, the divergence theorem underlies the divergence of a vector field used in:

  • Flow matching (training vector fields for generative models)
  • Continuous normalizing flows with trace of the Jacobian
  • Optimal transport with the continuity equation

N.4 Antiderivatives That Cannot Be Expressed in Closed Form

Some elementary functions have no elementary antiderivative. Famous examples:

Integrand"Antiderivative"Notes
ex2e^{-x^2}π2erf(x)\frac{\sqrt{\pi}}{2}\text{erf}(x)Error function - not elementary
sinxx\frac{\sin x}{x}Si(x)\text{Si}(x) (sine integral)Not elementary
exx\frac{e^x}{x}Ei(x)\text{Ei}(x) (exponential integral)Not elementary
1lnx\frac{1}{\ln x}Li(x)\text{Li}(x) (logarithmic integral)Counts primes! Not elementary
1k2sin2x\sqrt{1-k^2\sin^2 x}Elliptic integralNot elementary

These "special functions" appear constantly in ML:

  • scipy.special.erf, scipy.special.erfinv - in GELU, quantile functions, CDFs
  • scipy.special.gammaln - in Dirichlet, Beta, Gamma distributions
  • scipy.special.bessel - in von Mises distributions (circular statistics in positional encoding)

Appendix O: Lebesgue vs. Riemann Integration

O.1 Why a Different Theory?

The Riemann integral is sufficient for continuous functions and most smooth ML applications. But the Lebesgue integral is the foundation of modern probability theory and is essential for rigorous statements about expected values, convergence theorems, and measure theory.

The key difference: Riemann integration partitions the domain (x-axis); Lebesgue integration partitions the range (y-axis).

RIEMANN vs. LEBESGUE INTEGRATION


  Riemann: slice domain into [x_i, x_{i+1}], sum f(x_i)*Deltax
  
     f(x)                                      
                                         
                                    
                               
       x                  
        "partition the x-axis"                 
  

  Lebesgue: slice range into [y_i, y_{i+1}], multiply by measure
            of preimage {x : f(x) in [y_i, y_{i+1}]}
  
     y                                         
             <- how much x gives f(x) ~= 4?    
    4          
    3       
       x                  
        "partition the y-axis"                 
  


O.2 Key Theorem (Lebesgue vs. Riemann)

Theorem (Lebesgue, 1901): A bounded function on [a,b][a, b] is Riemann integrable if and only if it is continuous almost everywhere (i.e., the set of discontinuities has Lebesgue measure zero).

For ML, this means:

  • ReLU is Riemann integrable (discontinuous only at 0, a set of measure zero)
  • Indicator functions 1[x>0]\mathbf{1}[x > 0] are Riemann integrable for the same reason
  • Pathological functions like the Dirichlet function (1Q\mathbf{1}_{\mathbb{Q}}) are NOT Riemann integrable but ARE Lebesgue integrable

O.3 Convergence Theorems (The Real Advantage)

The Lebesgue integral enables three fundamental convergence theorems that Riemann lacks:

Monotone Convergence Theorem (MCT): If 0f1f20 \le f_1 \le f_2 \le \cdots and fnff_n \to f pointwise, then:

limnfndμ=limnfndμ=fdμ\lim_{n\to\infty} \int f_n \, d\mu = \int \lim_{n\to\infty} f_n \, d\mu = \int f \, d\mu

Dominated Convergence Theorem (DCT): If fnff_n \to f pointwise and fng|f_n| \le g for an integrable gg, then:

limnfndμ=fdμ\lim_{n\to\infty} \int f_n \, d\mu = \int f \, d\mu

Fatou's Lemma: lim infnfndμlim infnfndμ\int \liminf_{n\to\infty} f_n \, d\mu \le \liminf_{n\to\infty} \int f_n \, d\mu

Why ML cares: The DCT justifies differentiating under the integral sign - the key step in computing gradients of expected values:

θEpθ[f(x)]=θf(x)pθ(x)dx=f(x)pθθdx\frac{\partial}{\partial\theta} \mathbb{E}_{p_\theta}[f(x)] = \frac{\partial}{\partial\theta} \int f(x)\,p_\theta(x)\,dx = \int f(x)\,\frac{\partial p_\theta}{\partial\theta}\,dx

This step is legal when fpθ/θf \cdot |\partial p_\theta/\partial\theta| is dominated by an integrable function - which is why REINFORCE and the reparameterization trick both have regularity conditions.

O.4 Probability as Measure Theory

Modern probability uses the Lebesgue framework directly:

Measure theoryProbability
Measure space (Ω,F,μ)(\Omega, \mathcal{F}, \mu)Probability space (Ω,F,P)(\Omega, \mathcal{F}, P)
Measurable function f:ΩRf: \Omega \to \mathbb{R}Random variable X:ΩRX: \Omega \to \mathbb{R}
fdμ\int f \, d\muE[X]\mathbb{E}[X]
μ(A)=A1dμ\mu(A) = \int_A 1 \, d\muP(A)P(A)
Radon-Nikodym derivative dP/dQdP/dQLikelihood ratio

The Radon-Nikodym theorem - which says that if PQP \ll Q (P is absolutely continuous w.r.t. Q), there exists a measurable function dPdQ\frac{dP}{dQ} such that P(A)=AdPdQdQP(A) = \int_A \frac{dP}{dQ} \, dQ - is the rigorous foundation for:

  • KL divergence: DKL(PQ)=logdPdQdPD_{KL}(P \| Q) = \int \log\frac{dP}{dQ} \, dP
  • Change of variables in normalizing flows
  • Importance sampling weights

For day-to-day ML, Riemann integration suffices. But when reading probability theory papers or understanding why certain gradient estimators require regularity conditions, the Lebesgue framework is the right language.


Appendix P: Quick Reference Card

P.1 Standard Antiderivatives

f(x)f(x)f(x)dx\int f(x)\,dxNotes
xnx^nxn+1n+1+C\frac{x^{n+1}}{n+1} + Cn1n \ne -1
x1x^{-1}$\lnx
exe^xex+Ce^x + C
axa^xaxlna+C\frac{a^x}{\ln a} + Ca>0,a1a > 0, a \ne 1
lnx\ln xxlnxx+Cx\ln x - x + Cby parts
sinx\sin xcosx+C-\cos x + C
cosx\cos xsinx+C\sin x + C
tanx\tan x$-\ln\cos x
sec2x\sec^2 xtanx+C\tan x + C
11+x2\frac{1}{1+x^2}arctanx+C\arctan x + C
11x2\frac{1}{\sqrt{1-x^2}}arcsinx+C\arcsin x + C
sinhx\sinh xcoshx+C\cosh x + C
coshx\cosh xsinhx+C\sinh x + C
12πex2/2\frac{1}{\sqrt{2\pi}}\,e^{-x^2/2}Φ(x)+C\Phi(x) + CCDF of N(0,1)N(0,1)

P.2 Key Definite Integrals

ex2dx=π0xn1exdx=Γ(n)01xa1(1x)b1dx=B(a,b)\int_{-\infty}^{\infty} e^{-x^2} \, dx = \sqrt{\pi} \qquad \int_0^\infty x^{n-1}e^{-x}\,dx = \Gamma(n) \qquad \int_0^1 x^{a-1}(1-x)^{b-1}\,dx = B(a,b) 0sinxxdx=π2ππsin(mx)cos(nx)dx=0(orthogonality)\int_0^\infty \frac{\sin x}{x}\,dx = \frac{\pi}{2} \qquad \int_{-\pi}^{\pi}\sin(mx)\cos(nx)\,dx = 0 \quad \text{(orthogonality)}

P.3 Integration Strategies Flowchart

WHICH TECHNIQUE?


  Is there a composite function f(g(x))?
    YES -> try u-substitution: u = g(x)

  Is the integrand a product of two "different" types?
    YES -> try integration by parts (LIATE order)

  Is the integrand a rational function P(x)/Q(x)?
    YES -> try partial fractions (after polynomial division if deg P >= deg Q)

  Does the integrand contain sqrt(a^2-x^2), sqrt(a^2+x^2), or sqrt(x^2-a^2)?
    YES -> try trig substitution (x = a sintheta, a tantheta, a sectheta)

  Does the integrand contain e^x times polynomial or trig?
    YES -> try tabular integration by parts

  Nothing works? -> check integral tables / computer algebra system


P.4 Numerical Integration Comparison

MethodErrorFormulaWhen to use
MidpointO(h2)O(h^2)hf(xi+h/2)h\sum f(x_i + h/2)Smooth functions, simple implementation
TrapezoidO(h2)O(h^2)h[f(a)+f(b)2+f(xi)]h[\frac{f(a)+f(b)}{2} + \sum f(x_i)]Periodic functions (exponentially fast)
Simpson'sO(h4)O(h^4)h3[f0+4f1+2f2+4f3++fn]\frac{h}{3}[f_0 + 4f_1 + 2f_2 + 4f_3 + \cdots + f_n]Smooth functions, better accuracy
Gauss-LegendreO(h2n)O(h^{2n})Optimal nodes & weightsHigh accuracy, smooth integrands
Monte CarloO(n1/2)O(n^{-1/2})1nf(xi)\frac{1}{n}\sum f(x_i), xiU[a,b]x_i \sim U[a,b]High dimensions, discontinuous ff
Quasi-MCO((logn)d/n)O((\log n)^d / n)Low-discrepancy sequencesHigh dimensions with structure

P.5 Key ML Formulas Involving Integration

Exp[f(x)]=f(x)p(x)dx1ni=1nf(xi)(Monte Carlo mini-batch)\mathbb{E}_{x\sim p}[f(x)] = \int f(x)\,p(x)\,dx \approx \frac{1}{n}\sum_{i=1}^n f(x_i) \quad \text{(Monte Carlo mini-batch)} DKL(pq)=p(x)lnp(x)q(x)dx0(Gibbs inequality)D_{KL}(p\|q) = \int p(x)\ln\frac{p(x)}{q(x)}\,dx \ge 0 \quad \text{(Gibbs inequality)} LVAE=Eqϕ(zx)[lnpθ(xz)]DKL(qϕ(zx)p(z))(ELBO)\mathcal{L}_{VAE} = \mathbb{E}_{q_\phi(z|x)}[\ln p_\theta(x|z)] - D_{KL}(q_\phi(z|x)\|p(z)) \quad \text{(ELBO)} θEpθ[f(x)]=Epθ[f(x)θlnpθ(x)](REINFORCE / score gradient)\nabla_\theta \mathbb{E}_{p_\theta}[f(x)] = \mathbb{E}_{p_\theta}[f(x)\,\nabla_\theta\ln p_\theta(x)] \quad \text{(REINFORCE / score gradient)} dz(t)dt=fθ(z(t),t)    z(T)=z(0)+0Tfθ(z(t),t)dt(Neural ODE)\frac{d\mathbf{z}(t)}{dt} = f_\theta(\mathbf{z}(t), t) \implies \mathbf{z}(T) = \mathbf{z}(0) + \int_0^T f_\theta(\mathbf{z}(t),t)\,dt \quad \text{(Neural ODE)}

Appendix Q: Exercises - Worked Solutions (Continued)

Q.1 Exercise 5 - Numerical Integration Full Walkthrough

Problem: Compare trapezoid vs. Simpson's for 01ex2dx\int_0^1 e^{-x^2}\,dx.

True value: Using the error function, 01ex2dx=π2erf(1)0.746824\int_0^1 e^{-x^2}\,dx = \frac{\sqrt\pi}{2}\,\text{erf}(1) \approx 0.746824.

Trapezoid with n=4n=4 (h=0.25h = 0.25):

xx-values: 0,0.25,0.5,0.75,1.00, 0.25, 0.5, 0.75, 1.0

ff-values: 1, e0.06250.93941, e0.250.77880, e0.56250.56978, e10.367881,\ e^{-0.0625} \approx 0.93941,\ e^{-0.25} \approx 0.77880,\ e^{-0.5625} \approx 0.56978,\ e^{-1} \approx 0.36788

T4=0.25[1+0.367882+0.93941+0.77880+0.56978]=0.25[0.68394+2.28799]=0.74298T_4 = 0.25\left[\frac{1 + 0.36788}{2} + 0.93941 + 0.77880 + 0.56978\right] = 0.25[0.68394 + 2.28799] = 0.74298

Error: 0.742980.74682=0.00384|0.74298 - 0.74682| = 0.00384 - O(h2)O(h^2) as expected.

Simpson's with n=4n=4 (need even nn):

S4=0.253[f0+4f1+2f2+4f3+f4]S_4 = \frac{0.25}{3}[f_0 + 4f_1 + 2f_2 + 4f_3 + f_4] =0.253[1+4(0.93941)+2(0.77880)+4(0.56978)+0.36788]= \frac{0.25}{3}[1 + 4(0.93941) + 2(0.77880) + 4(0.56978) + 0.36788] =0.253[1+3.75764+1.55760+2.27912+0.36788]=0.253(8.96224)=0.74685= \frac{0.25}{3}[1 + 3.75764 + 1.55760 + 2.27912 + 0.36788] = \frac{0.25}{3}(8.96224) = 0.74685

Error: 0.746850.74682=0.00003|0.74685 - 0.74682| = 0.00003 - vastly smaller. Simpson's is O(h4)O(h^4).

Q.2 Exercise 6 - Expected Value and KL Divergence

Problem: Compute DKL(pq)D_{KL}(p \| q) where p=N(μ,1)p = N(\mu, 1) and q=N(0,1)q = N(0, 1).

Solution:

DKL(pq)=p(x)lnp(x)q(x)dxD_{KL}(p\|q) = \int_{-\infty}^\infty p(x)\ln\frac{p(x)}{q(x)}\,dx

With p(x)=12πe(xμ)2/2p(x) = \frac{1}{\sqrt{2\pi}}e^{-(x-\mu)^2/2} and q(x)=12πex2/2q(x) = \frac{1}{\sqrt{2\pi}}e^{-x^2/2}:

lnp(x)q(x)=(xμ)22+x22=x2(xμ)22=2μxμ22=μxμ22\ln\frac{p(x)}{q(x)} = -\frac{(x-\mu)^2}{2} + \frac{x^2}{2} = \frac{x^2 - (x-\mu)^2}{2} = \frac{2\mu x - \mu^2}{2} = \mu x - \frac{\mu^2}{2} DKL(pq)=Ep ⁣[μxμ22]=μEp[x]μ22=μ2μ22=μ22D_{KL}(p\|q) = \mathbb{E}_p\!\left[\mu x - \frac{\mu^2}{2}\right] = \mu\,\mathbb{E}_p[x] - \frac{\mu^2}{2} = \mu^2 - \frac{\mu^2}{2} = \frac{\mu^2}{2}

Interpretation: The KL divergence from N(0,1)N(0,1) to N(μ,1)N(\mu,1) is exactly μ2/2\mu^2/2. This is the term in the VAE ELBO that penalizes the encoder mean from drifting far from zero.

Q.3 Exercise 7 - Gradient of Expected Loss

Problem: Compute θExpθ[(x)]\nabla_\theta \mathbb{E}_{x \sim p_\theta}[\ell(x)] via the log-derivative trick.

Solution: Using the score function / REINFORCE estimator:

θEpθ[(x)]=θ(x)pθ(x)dx\nabla_\theta \mathbb{E}_{p_\theta}[\ell(x)] = \nabla_\theta \int \ell(x)\,p_\theta(x)\,dx

Assuming we can differentiate under the integral (DCT applies):

=(x)θpθ(x)dx=(x)pθ(x)θpθ(x)pθ(x)dx=Epθ[(x)θlnpθ(x)]= \int \ell(x)\,\nabla_\theta p_\theta(x)\,dx = \int \ell(x)\,p_\theta(x)\,\frac{\nabla_\theta p_\theta(x)}{p_\theta(x)}\,dx = \mathbb{E}_{p_\theta}[\ell(x)\,\nabla_\theta\ln p_\theta(x)]

This is the REINFORCE gradient estimator. It requires only samples from pθp_\theta and the ability to evaluate lnpθ\ln p_\theta - no reparameterization needed. The cost is high variance, motivating control variates (baselines) to reduce Var[(x)θlnpθ(x)]\text{Var}[\ell(x)\nabla_\theta\ln p_\theta(x)].


Appendix R: Historical Timeline Extended

R.1 Chronological Development

EraContributorAchievement
~250 BCEArchimedesMethod of exhaustion for area of parabolic segment; π\pi bounds
~1640sCavalieriCavalieri's principle; "method of indivisibles"
1665-1666NewtonInverse tangent problem; "method of fluxions"; FTC discovered privately
1675-1684LeibnizIndependent discovery; modern \int notation; publication of FTC (1684)
1696L'HpitalFirst calculus textbook (based on Bernoulli's lectures)
1734Bishop BerkeleyThe Analyst - criticism of infinitesimals as "ghosts of departed quantities"
1748EulerIntroductio in Analysin Infinitorum - systematic treatment of functions
1821-1823CauchyRigorous definition of limit; definite integral via Riemann-style sums
1854RiemannRigorous Riemann integral; characterization of integrable functions
1875DarbouxUpper/lower sums; cleaner formulation of Riemann integral
1894-1902LebesgueMeasure theory; Lebesgue integral; MCT and DCT
1900sHilbertL2L^2 spaces; integration as inner product; functional analysis
1920s-1940sKolmogorovProbability as measure theory; rigorous foundation for ML
1940s-1950sMonte CarloUlam, von Neumann, Metropolis - stochastic integration for physics
1986Rumelhart et al.Backpropagation as chain rule for integrals (Jacobians)
2018Chen et al.Neural ODEs - continuous-depth networks via ODE integration
2020sDiffusion modelsScore matching, DDPM, flow matching - integration at the core of generation

R.2 The Newton-Leibniz Priority Dispute

The FTC was discovered independently by Newton (1666, unpublished) and Leibniz (1675-1684, published). The Royal Society's official investigation (1712) wrongly accused Leibniz of plagiarism, damaging his reputation and creating a rift between British and Continental mathematicians. British mathematicians, loyal to Newton's notation, fell behind Continental Europe for ~100 years. The lesson: notation matters - Leibniz's \int and d/dxd/dx notation won out and is the notation used universally today.

R.3 Why Newton Called Integration "Quadrature"

Newton's original term for integration was "quadrature" - from the Latin for "making a square." The original problem was: given a curve, find a square with the same area. This geometric framing persisted for centuries. Leibniz's more algebraic approach (anti-differentiation) is what we use today, but the term "quadrature" survives in "Gaussian quadrature" - numerical integration using optimal node placement.


Appendix S: Advanced Topics Preview

This section previews topics that build directly on integration and appear in subsequent sections:

S.1 -> 04 Sequences and Series

Taylor series represents a function as an infinite sum:

f(x)=n=0f(n)(a)n!(xa)nf(x) = \sum_{n=0}^\infty \frac{f^{(n)}(a)}{n!}(x-a)^n

The remainder term in Taylor's theorem is an integral:

Rn(x)=1n!ax(xt)nf(n+1)(t)dtR_n(x) = \frac{1}{n!}\int_a^x (x-t)^n f^{(n+1)}(t)\,dt

Integration and series interact via term-by-term integration - valid when a series converges uniformly:

n=0anxndx=n=0anxn+1n+1\int \sum_{n=0}^\infty a_n x^n \, dx = \sum_{n=0}^\infty \frac{a_n x^{n+1}}{n+1}

-> Full treatment: Sequences and Series

S.2 -> 05 Multivariable Calculus

Double and triple integrals extend the 1D theory:

Df(x,y)dA=abg(x)h(x)f(x,y)dydx(Fubini’s theorem)\iint_D f(x,y)\,dA = \int_a^b \int_{g(x)}^{h(x)} f(x,y)\,dy\,dx \quad \text{(Fubini's theorem)}

The change of variables formula with Jacobian J|J|:

Df(x,y)dA=Df(x(u,v),y(u,v))Jdudv\iint_D f(x,y)\,dA = \iint_{D'} f(x(u,v), y(u,v))\,|J|\,du\,dv

-> Full treatment: Multivariable Calculus

S.3 -> 06 Probability and Statistics

Continuous random variables live entirely in the integration framework. The CDF is an integral of the PDF; expectation is an integral; the central limit theorem involves convergence of distribution functions; characteristic functions are Fourier transforms - all integration.

-> Full treatment: Probability and Statistics


Appendix T: Self-Assessment Checklist

After completing this section, you should be able to:

Conceptual Understanding

  • Explain integration as accumulated change (area, probability mass, expected value)
  • State both parts of the FTC and explain why they are profound
  • Distinguish when a function is/is not Riemann integrable
  • Explain why a1xpdx\int_a^\infty \frac{1}{x^p}\,dx converges iff p>1p > 1

Computational Skills

  • Compute integrals using power rule, u-substitution, and integration by parts
  • Decompose rational functions into partial fractions
  • Evaluate improper integrals or determine their convergence
  • Implement the trapezoid rule, Simpson's rule, and basic Monte Carlo integration

Probability Connections

  • Verify that a given function is a valid PDF (non-negative, integrates to 1)
  • Compute E[X]\mathbb{E}[X], Var[X]\text{Var}[X] as integrals for common distributions
  • Compute KL divergence between two Gaussian distributions
  • Derive the cross-entropy loss as negative log-likelihood

ML Applications

  • Explain Monte Carlo as an approximation to an integral
  • State the REINFORCE gradient estimator and its derivation via log-derivative trick
  • Explain the VAE ELBO as a sum of two integrals (reconstruction + KL)
  • Describe the neural ODE forward pass as numerical ODE integration
  • Sketch how score matching connects to integration in diffusion models