Integration

"Integration is not just finding areas - it is the mathematics of accumulation, and accumulation is the mathematics of learning."

Overview

Integration is the second pillar of calculus, dual to differentiation. Where the derivative asks "how fast is this changing right now?", the integral asks "how much has accumulated over this interval?" The Fundamental Theorem of Calculus (FTC) reveals these two operations as inverses of each other - one of the most profound connections in mathematics.

Every core operation in machine learning involves integration. The expected loss $\mathbb{E}[\mathcal{L}]$ is an integral over a data distribution. KL divergence and entropy - the foundations of information-theoretic learning - are improper integrals of probability densities. Stochastic gradient descent is a Monte Carlo estimator of a gradient integral. Normalizing flows use the change-of-variables formula from integration theory. Understanding integration at the level of definitions and proofs, not just formulas, separates practitioners who can innovate from those who can only apply recipes.

This section builds the complete single-variable integration theory: Riemann sums, the FTC, all standard techniques (substitution, parts, partial fractions), improper integrals, numerical methods (trapezoid, Simpson's, Monte Carlo), and integration in probability theory. The treatment is rigorous but always anchored to concrete ML applications.

Prerequisites

Limits and continuity - 01-Limits-and-Continuity: limit definition, squeeze theorem, continuity on $[a,b]$
Derivatives and differentiation - 02-Derivatives-and-Differentiation: chain rule, product rule, all elementary derivatives
Algebra: polynomial long division, factoring, partial fraction setup
Trigonometry: $\sin$, $\cos$, $\tan$ identities; inverse trig derivatives

Learning Objectives

After completing this section, you will:

Define the definite integral via Riemann sums and compute it as a limit
State and prove both parts of the Fundamental Theorem of Calculus
Apply u-substitution, integration by parts, and partial fractions to evaluate integrals
Classify and evaluate improper integrals using limit definitions and convergence tests
Implement trapezoid, Simpson's, and Monte Carlo numerical integration
Express expectation, variance, KL divergence, and entropy as integrals
Explain why stochastic gradient descent is a Monte Carlo estimator of the gradient integral
Verify numerical integration accuracy and analyze error bounds

1. Intuition
1.1 Accumulation as the Core Idea
1.2 Historical Motivation
1.3 Why Integration Is Central to AI
2. The Definite Integral - Riemann's Definition
2.1 Partitions and Riemann Sums
2.2 The Limit Definition
2.3 Geometric Interpretation: Signed Area
2.4 Properties of the Definite Integral
2.5 Integrability Conditions
3. Antiderivatives and the Indefinite Integral
3.1 Definition and the +C Convention
3.2 Basic Antiderivative Table
3.3 Linearity
3.4 Initial Value Problems
4. The Fundamental Theorem of Calculus
4.1 FTC Part 1
4.2 FTC Part 2
4.3 The Bridge
4.4 Worked Examples
4.5 FTC and Automatic Differentiation
5. Integration by Substitution
5.1 The Reverse Chain Rule
5.2 Definite Integrals: Changing Limits
5.3 Worked Examples
5.4 For AI: Change of Variables in Flows
6. Integration by Parts
6.1 The Reverse Product Rule
6.2 Choosing u and dv - LIATE
6.3 Reduction Formulas and Tabular Method
6.4 Worked Examples
6.5 For AI: REINFORCE Gradient Estimator
7. Partial Fractions
7.1 Decomposing Rational Functions
7.2 Cases and Worked Examples
8. Improper Integrals
8.1 Type I: Infinite Limits
8.2 Type II: Unbounded Integrands
8.3 Convergence Tests
8.4 The Gaussian Integral
8.5 For AI: Entropy, KL Divergence, Expected Loss
9. Numerical Integration
9.1 Trapezoid Rule
9.2 Simpson's Rule
9.3 Gaussian Quadrature
9.4 Monte Carlo Integration
9.5 For AI: SGD as Monte Carlo Expectation
10. Integration in Probability
10.1 Probability Density Functions
10.2 CDF and FTC
10.3 Expectation as a Weighted Integral
10.4 Variance and Second Moments
10.5 KL Divergence
10.6 Entropy
11. Common Mistakes
12. Exercises
13. Why This Matters for AI (2026 Perspective)
14. Conceptual Bridge

1. Intuition

1.1 Accumulation as the Core Idea

The derivative measures instantaneous rate of change. The integral measures total accumulation. These are two sides of the same coin - and the FTC is the theorem that makes this precise.

Physical intuition. If $v(t)$ is velocity at time $t$, then $\int_a^b v(t)\,dt$ is the total distance traveled from time $a$ to $b$. Each infinitesimal time slice $dt$ contributes a tiny distance $v(t)\,dt$; the integral sums them all.

Geometric intuition. $\int_a^b f(x)\,dx$ is the signed area between the curve $y = f(x)$ and the $x$-axis over $[a,b]$. Area above the axis is positive; area below is negative.

Probabilistic intuition. If $p(x)$ is a probability density, $\int_a^b p(x)\,dx$ is the probability that a random draw falls in $[a,b]$. Integration is the language of probability.

1.2 Historical Motivation

Year	Development
~250 BCE	Archimedes uses method of exhaustion to compute areas of parabolas
1670s	Newton and Leibniz independently develop calculus; Leibniz introduces $\int$ notation
1823	Cauchy gives rigorous definition of the definite integral
1854	Riemann formalizes the integral via sums - the definition used today
1875	Darboux introduces upper/lower sums, simplifying Riemann's theory
1902	Lebesgue generalizes the integral to a much broader class of functions

The $\int$ symbol is an elongated "S" for summa (Latin: sum) - Leibniz's notation for the limit of infinitely many infinitesimal summands.

1.3 Why Integration Is Central to AI

Expected loss. The training objective of every probabilistic model is:

$$\mathcal{L}(\theta) = \mathbb{E}_{(\mathbf{x},y)\sim p_{\text{data}}}[\ell(f_\theta(\mathbf{x}), y)] = \int \ell(f_\theta(\mathbf{x}), y)\,p(\mathbf{x},y)\,d\mathbf{x}\,dy$$

This is an integral. Stochastic gradient descent estimates it via a Monte Carlo sum over a mini-batch.

KL divergence. The loss function used in variational autoencoders (VAEs), diffusion models, and language model alignment (RLHF):

$$\text{KL}(q\|p) = \int q(x)\ln\frac{q(x)}{p(x)}\,dx$$

Entropy. The Shannon entropy of a continuous distribution:

$$H(p) = -\int p(x)\ln p(x)\,dx$$

This integral appears in cross-entropy loss, mutual information, and the information-theoretic analysis of generalization.

Normalizing flows. Change-of-variables: if $\mathbf{z} = f(\mathbf{x})$ is a bijection, the density transforms as:

$$p_X(\mathbf{x}) = p_Z(f(\mathbf{x}))\left|\det\frac{\partial f}{\partial \mathbf{x}}\right|$$

The absolute Jacobian determinant is the multidimensional version of the substitution rule from this section.

Gaussian integral. The normalization constant of every Gaussian distribution relies on $\int_{-\infty}^\infty e^{-x^2}\,dx = \sqrt{\pi}$ - an improper integral computed in 8.4.

For AI: Every backpropagation pass computes a stochastic estimate of the gradient of an integral (the expected loss) - integration and differentiation are inseparable in machine learning.

2. The Definite Integral - Riemann's Definition

2.1 Partitions and Riemann Sums

A partition of $[a,b]$ is a finite set of points $a = x_0 < x_1 < \cdots < x_n = b$. The mesh (or norm) is $\|\mathcal{P}\| = \max_k(x_k - x_{k-1})$.

For each subinterval $[x_{k-1}, x_k]$ of width $\Delta x_k = x_k - x_{k-1}$, choose a sample point $x_k^* \in [x_{k-1}, x_k]$. The Riemann sum is:

$$S(\mathcal{P}, f) = \sum_{k=1}^n f(x_k^*)\,\Delta x_k$$

Three standard choices of $x_k^*$: - Left endpoint: $x_k^* = x_{k-1}$ -> left Riemann sum $L_n$ - Right endpoint: $x_k^* = x_k$ -> right Riemann sum $R_n$ - Midpoint: $x_k^* = (x_{k-1}+x_k)/2$ -> midpoint Riemann sum $M_n$

Example. $f(x) = x^2$ on $[0,1]$, uniform partition with $n$ intervals, right endpoints:

$$R_n = \sum_{k=1}^n \left(\frac{k}{n}\right)^2 \cdot \frac{1}{n} = \frac{1}{n^3}\sum_{k=1}^n k^2 = \frac{1}{n^3}\cdot\frac{n(n+1)(2n+1)}{6} \xrightarrow{n\to\infty} \frac{1}{3}$$

2.2 The Limit Definition

The definite integral of $f$ over $[a,b]$ is:

$$\int_a^b f(x)\,dx = \lim_{\|\mathcal{P}\|\to 0} S(\mathcal{P}, f)$$

provided this limit exists and is the same for every choice of sample points. When this limit exists, $f$ is Riemann integrable on $[a,b]$.

For uniform partitions ($\Delta x = (b-a)/n$, right endpoints):

$$\int_a^b f(x)\,dx = \lim_{n\to\infty}\sum_{k=1}^n f\!\left(a + k\cdot\frac{b-a}{n}\right)\cdot\frac{b-a}{n}$$

2.3 Geometric Interpretation: Signed Area

The integral computes signed area: regions where $f(x) > 0$ contribute positively; regions where $f(x) < 0$ contribute negatively.

$$\int_0^{2\pi}\sin x\,dx = 0 \quad \text{(positive and negative areas cancel)}$$ $$\int_0^{2\pi}|\sin x|\,dx = 4 \quad \text{(total unsigned area)}$$

2.4 Properties of the Definite Integral

For integrable $f, g$ on $[a,b]$:

Property	Formula
Linearity	$\int_a^b [cf(x)+g(x)]\,dx = c\int_a^b f(x)\,dx + \int_a^b g(x)\,dx$
Additivity	$\int_a^b f\,dx = \int_a^c f\,dx + \int_c^b f\,dx$ for any $c \in [a,b]$
Monotonicity	$f \leq g \Rightarrow \int_a^b f\,dx \leq \int_a^b g\,dx$
Reverse limits	$\int_b^a f\,dx = -\int_a^b f\,dx$
Zero-width	$\int_a^a f\,dx = 0$
Bound	$\left\|\int_a^b f\,dx\right\| \leq \int_a^b \|f\|\,dx \leq M(b-a)$ if $\|f\| \leq M$

Mean Value Theorem for Integrals. If $f$ is continuous on $[a,b]$, there exists $c \in (a,b)$ with:

$$f(c) = \frac{1}{b-a}\int_a^b f(x)\,dx$$

The integral equals the function value at some interior point times the interval length.

2.5 Integrability Conditions

Sufficient condition 1: If $f$ is continuous on $[a,b]$, then $f$ is Riemann integrable. (This covers all smooth functions and all activation functions in ML.)

Sufficient condition 2: If $f$ is bounded and monotone on $[a,b]$, then $f$ is Riemann integrable.

Sufficient condition 3: If $f$ is bounded on $[a,b]$ and has only finitely many discontinuities, then $f$ is Riemann integrable. (This covers ReLU, which is discontinuous in derivative at 0 but not in value.)

3. Antiderivatives and the Indefinite Integral

3.1 Definition and the +C Convention

A function $F$ is an antiderivative of $f$ on an interval $I$ if $F'(x) = f(x)$ for all $x \in I$.

Theorem (Uniqueness up to constant). If $F$ and $G$ are both antiderivatives of $f$ on $I$, then $F(x) - G(x) = C$ for some constant $C$.

Proof. Let $H = F - G$. Then $H'(x) = F'(x) - G'(x) = f(x) - f(x) = 0$ for all $x \in I$. By the MVT corollary, $H' \equiv 0$ implies $H$ is constant. $\square$

The indefinite integral encodes the entire family of antiderivatives:

$$\int f(x)\,dx = F(x) + C$$

where $C$ is an arbitrary constant. The $+C$ is not optional - it represents a genuinely different function for each value of $C$.

3.2 Basic Antiderivative Table

$f(x)$	$\int f(x)\,dx$	Condition
$x^n$	$\dfrac{x^{n+1}}{n+1} + C$	$n \neq -1$
$x^{-1} = 1/x$	$\ln\|x\| + C$	$x \neq 0$
$e^x$	$e^x + C$
$a^x$	$\dfrac{a^x}{\ln a} + C$	$a > 0, a\neq 1$
$\sin x$	$-\cos x + C$
$\cos x$	$\sin x + C$
$\sec^2 x$	$\tan x + C$
$1/\sqrt{1-x^2}$	$\arcsin x + C$	$\|x\| < 1$
$1/(1+x^2)$	$\arctan x + C$
$\sinh x$	$\cosh x + C$
$\cosh x$	$\sinh x + C$

Verification: Every row can be checked by differentiating the right side.

3.3 Linearity

$$\int [cf(x) + g(x)]\,dx = c\int f(x)\,dx + \int g(x)\,dx$$

This follows immediately from the linearity of differentiation.

Example. $\int(3x^2 - 5\cos x + e^x)\,dx = x^3 - 5\sin x + e^x + C$.

3.4 Initial Value Problems

An initial value problem (IVP) specifies $f'(x)$ and an initial condition $f(x_0) = y_0$, and asks for $f(x)$.

Procedure: 1. Find the general antiderivative: $f(x) = \int f'(x)\,dx = F(x) + C$ 2. Apply the initial condition: $y_0 = F(x_0) + C \Rightarrow C = y_0 - F(x_0)$

Example. $f'(x) = 3x^2 - 2x$, $f(1) = 5$.

$f(x) = x^3 - x^2 + C$

. At : .

So $f(x) = x^3 - x^2 + 5$.

For AI. Solving $\dot{\theta} = -\nabla\mathcal{L}(\theta)$ as an ODE gives the continuous-time analogue of gradient descent. The solution is an integral: $\theta(T) = \theta(0) - \int_0^T \nabla\mathcal{L}(\theta(t))\,dt$. Neural ODEs (Chen et al., 2018) make this explicit by parameterizing the dynamics as a neural network and solving the integral numerically via an ODE solver.

4. The Fundamental Theorem of Calculus

The FTC is the central theorem of calculus - it reveals that differentiation and integration are inverse operations and provides a practical method for evaluating definite integrals.

4.1 FTC Part 1

Theorem (FTC Part 1). Let $f$ be continuous on $[a,b]$ and define:

$$G(x) = \int_a^x f(t)\,dt, \quad x \in [a,b]$$

Then $G$ is differentiable on $(a,b)$ and $G'(x) = f(x)$.

Proof. For $h > 0$:

$$\frac{G(x+h) - G(x)}{h} = \frac{1}{h}\int_x^{x+h} f(t)\,dt$$

By the MVT for integrals, there exists $c_h \in (x, x+h)$ with $\frac{1}{h}\int_x^{x+h} f(t)\,dt = f(c_h)$.

As $h \to 0^+$: $c_h \to x$, so by continuity of $f$: $f(c_h) \to f(x)$. A symmetric argument handles $h \to 0^-$. Therefore $G'(x) = f(x)$. $\square$

Interpretation. The area-accumulation function $G(x) = \int_a^x f(t)\,dt$ has derivative $f(x)$ - the rate of growth of accumulated area at $x$ equals the function value $f(x)$. This is obvious geometrically: adding a thin strip of height $f(x)$ and width $h$ gives area $\approx f(x) \cdot h$.

Generalization (Leibniz rule):

$$\frac{d}{dx}\int_{g(x)}^{h(x)} f(t)\,dt = f(h(x))\cdot h'(x) - f(g(x))\cdot g'(x)$$

4.2 FTC Part 2

Theorem (FTC Part 2). If $f$ is continuous on $[a,b]$ and $F$ is any antiderivative of $f$, then:

$$\int_a^b f(x)\,dx = F(b) - F(a) \equiv \Big[F(x)\Big]_a^b$$

Proof. Let $G(x) = \int_a^x f(t)\,dt$. By Part 1, $G'(x) = f(x) = F'(x)$. So $G(x) - F(x) = C$ (constant). At $x = a$: $G(a) - F(a) = 0 - F(a)$, giving $C = -F(a)$. At $x = b$: $G(b) = \int_a^b f(t)\,dt = F(b) - F(a)$. $\square$

4.3 The Bridge

Why FTC Part 2 is powerful. Before the FTC, computing $\int_a^b f(x)\,dx$ required constructing Riemann sums and taking limits - laborious for any non-trivial function. The FTC reduces this to: find any antiderivative $F$, evaluate at $b$ and $a$, subtract.

Example. $\int_0^1 x^2\,dx = \Big[\frac{x^3}{3}\Big]_0^1 = \frac{1}{3} - 0 = \frac{1}{3}$. (Recall: directly computing via Riemann sums required summing $\sum k^2$.)

4.4 Worked Examples

Example 1. $\int_1^e \frac{1}{x}\,dx = [\ln x]_1^e = \ln e - \ln 1 = 1 - 0 = 1$.

Example 2. $\int_0^\pi \sin x\,dx = [-\cos x]_0^\pi = (-\cos\pi) - (-\cos 0) = 1 + 1 = 2$.

Example 3. $\int_{-1}^{1} e^x\,dx = [e^x]_{-1}^1 = e - e^{-1} = e - 1/e$.

Example 4 (net displacement vs distance). A particle has velocity $v(t) = t^2 - 4$. On $[0,3]$:

$$\text{Net displacement} = \int_0^3 (t^2-4)\,dt = \left[\frac{t^3}{3} - 4t\right]_0^3 = 9-12 = -3$$ $$\text{Distance} = \int_0^3 |t^2-4|\,dt = \int_0^2(4-t^2)\,dt + \int_2^3(t^2-4)\,dt = \frac{16}{3} + \frac{7}{3} = \frac{23}{3}$$

4.5 FTC and Automatic Differentiation

FTC Part 1 has a direct counterpart in modern ML: the adjoint method for training neural ODEs. The derivative of a loss $\mathcal{L}$ with respect to the initial state $\mathbf{z}(0)$, where $\mathbf{z}(T) = \mathbf{z}(0) + \int_0^T f(\mathbf{z}(t),t;\theta)\,dt$, is computed via an integral that runs backward in time. This is FTC Part 1 applied to the adjoint state - the computational core of the torchdiffeq library.

5. Integration by Substitution

5.1 The Reverse Chain Rule

Theorem. If $u = g(x)$ is differentiable and $f$ is continuous on the range of $g$, then:

$$\int f(g(x))\,g'(x)\,dx = \int f(u)\,du \quad \text{(evaluated at } u = g(x)\text{)}$$

Proof. Let $F$ be an antiderivative of $f$, so $F' = f$. By the chain rule:

$$\frac{d}{dx}[F(g(x))] = F'(g(x))\cdot g'(x) = f(g(x))\cdot g'(x)$$

Therefore $F(g(x))$ is an antiderivative of $f(g(x))\cdot g'(x)$. $\square$

Procedure: 1. Identify a function $u = g(x)$ whose derivative $g'(x)$ appears (or nearly appears) in the integrand. 2. Compute $du = g'(x)\,dx$. 3. Substitute: replace $g(x)$ with $u$ and $g'(x)\,dx$ with $du$. 4. Integrate in $u$. 5. Back-substitute: replace $u$ with $g(x)$.

5.2 Definite Integrals: Changing Limits

For $\int_a^b f(g(x))g'(x)\,dx$ with $u = g(x)$: change limits to $u(a)$ and $u(b)$:

$$\int_a^b f(g(x))g'(x)\,dx = \int_{g(a)}^{g(b)} f(u)\,du$$

This avoids back-substitution at the end.

5.3 Worked Examples

Example 1. $\int \sin(x^2)\cdot 2x\,dx$. Let $u = x^2$, $du = 2x\,dx$:

$$= \int \sin u\,du = -\cos u + C = -\cos(x^2) + C$$

Example 2. $\int_0^1 e^{-x^2}\cdot(-2x)\,dx$. Let $u = -x^2$, limits: $u(0)=0$, $u(1)=-1$:

$$= \int_0^{-1} e^u\,du = [e^u]_0^{-1} = e^{-1} - 1$$

Example 3. $\int \frac{x}{x^2+1}\,dx$. Let $u = x^2+1$, $du = 2x\,dx$:

$$= \frac{1}{2}\int\frac{du}{u} = \frac{1}{2}\ln|u| + C = \frac{1}{2}\ln(x^2+1) + C$$

Example 4. $\int \tan x\,dx = \int \frac{\sin x}{\cos x}\,dx$. Let $u = \cos x$, $du = -\sin x\,dx$:

$$= -\int\frac{du}{u} = -\ln|\cos x| + C = \ln|\sec x| + C$$

Example 5 (softmax normalization). For $\mathbf{z} \in \mathbb{R}^K$:

$$Z = \sum_{k=1}^K e^{z_k} = e^{z_{\max}}\sum_{k=1}^K e^{z_k - z_{\max}}$$

The subtraction of $z_{\max}$ is a discrete analog of substitution $u = z - z_{\max}$, making all terms $\leq 1$ for numerical stability.

5.4 For AI: Change of Variables in Normalizing Flows

A normalizing flow defines a bijection $\mathbf{x} = f(\mathbf{z})$ where $\mathbf{z} \sim p_Z$. The change-of-variables formula:

$$p_X(\mathbf{x}) = p_Z(f^{-1}(\mathbf{x}))\left|\det J_{f^{-1}}(\mathbf{x})\right|$$

is the multivariate version of $u$-substitution: $\int_a^b f(g(x))g'(x)\,dx = \int_{g(a)}^{g(b)}f(u)\,du$, where $|g'(x)|$ becomes the absolute Jacobian determinant. Real NVP, Glow, and FFJORD all implement variants of this transformation to learn complex densities from simple base distributions.

6. Integration by Parts

6.1 The Reverse Product Rule

Theorem. If $u(x)$ and $v(x)$ are differentiable, then:

$$\int u\,dv = uv - \int v\,du$$

Proof. The product rule gives $(uv)' = u'v + uv'$. Integrating both sides:

$$uv = \int u'v\,dx + \int uv'\,dx$$

Rearranging: $\int uv'\,dx = uv - \int u'v\,dx$. Writing $dv = v'dx$ and $du = u'dx$ gives the formula. $\square$

For definite integrals:

$$\int_a^b u\,dv = \Big[uv\Big]_a^b - \int_a^b v\,du$$

6.2 Choosing u and dv - LIATE

The acronym LIATE gives a priority order for choosing $u$ (choose the type that comes first): - Logarithms: $\ln x$, $\log_a x$ - Inverse trig: $\arctan x$, $\arcsin x$ - Algebraic: polynomials $x^n$, $\sqrt{x}$ - Trigonometric: $\sin x$, $\cos x$ - Exponential: $e^x$, $a^x$

The complementary factor becomes $dv$ and must be something we can integrate.

6.3 Reduction Formulas and Tabular Method

For integrals like $\int x^n e^x\,dx$, repeated integration by parts produces a reduction formula:

$$\int x^n e^x\,dx = x^n e^x - n\int x^{n-1}e^x\,dx$$

The tabular method (also called "tic-tac-toe" or "successive differentiation") organizes repeated parts efficiently:

Sign	Differentiate ($u$)	Integrate ($dv$)
$+$	$x^3$	$e^x$
$-$	$3x^2$	$e^x$
$+$	$6x$	$e^x$
$-$	$6$	$e^x$
$+$	$0$	$e^x$

Result: $\int x^3 e^x\,dx = e^x(x^3 - 3x^2 + 6x - 6) + C$.

6.4 Worked Examples

Example 1. $\int x e^x\,dx$. Let $u = x$, $dv = e^x\,dx$:

$$= xe^x - \int e^x\,dx = xe^x - e^x + C = e^x(x-1) + C$$

Example 2. $\int \ln x\,dx$. Let $u = \ln x$, $dv = dx$:

$$= x\ln x - \int x\cdot\frac{1}{x}\,dx = x\ln x - x + C = x(\ln x - 1) + C$$

Example 3. $\int x^2\sin x\,dx$. Apply tabular method (alternating signs):

$$= -x^2\cos x + 2x\sin x + 2\cos x + C$$

Example 4 (cyclic). $\int e^x\sin x\,dx$. Let $I = \int e^x\sin x\,dx$. Integrate by parts twice:

$$I = e^x\sin x - e^x\cos x - I \implies 2I = e^x(\sin x - \cos x) \implies I = \frac{e^x(\sin x - \cos x)}{2} + C$$

6.5 For AI: REINFORCE Gradient Estimator

The REINFORCE algorithm (Williams, 1992) estimates $\nabla_\theta \mathbb{E}_{\tau\sim p_\theta}[R(\tau)]$ where $\tau$ is a trajectory and $R$ is return. Integration by parts (in the form of the log-derivative trick) gives:

$$\nabla_\theta\mathbb{E}_{x\sim p_\theta}[f(x)] = \mathbb{E}_{x\sim p_\theta}[f(x)\nabla_\theta\log p_\theta(x)]$$

This identity - differentiating through an expectation - is the continuous-distribution version of integration by parts, and is the core of policy gradient methods in reinforcement learning (PPO, GRPO, RLHF fine-tuning).

7. Partial Fractions

7.1 Decomposing Rational Functions

Partial fraction decomposition writes a rational function $P(x)/Q(x)$ (where $\deg P < \deg Q$) as a sum of simpler fractions. This converts integrals of rational functions into sums of basic integrals.

Setup: 1. Factor $Q(x)$ completely over $\mathbb{R}$. 2. Write the partial fraction decomposition. 3. Solve for unknown constants (cover-up method or comparing coefficients). 4. Integrate each term.

7.2 Cases and Worked Examples

Case 1: Distinct linear factors. $Q(x) = (x-a_1)(x-a_2)\cdots(x-a_n)$:

$$\frac{P(x)}{Q(x)} = \frac{A_1}{x-a_1} + \frac{A_2}{x-a_2} + \cdots + \frac{A_n}{x-a_n}$$

Example. $\int\frac{1}{x^2-1}\,dx = \int\frac{1}{(x-1)(x+1)}\,dx$.

Decompose: $\frac{1}{(x-1)(x+1)} = \frac{A}{x-1}+\frac{B}{x+1}$.

Multiply both sides by $(x-1)(x+1)$: $1 = A(x+1) + B(x-1)$.

At $x=1$: $1 = 2A \Rightarrow A = 1/2$. At $x=-1$: $1 = -2B \Rightarrow B = -1/2$.

$$\int\frac{1}{x^2-1}\,dx = \frac{1}{2}\ln|x-1| - \frac{1}{2}\ln|x+1| + C = \frac{1}{2}\ln\left|\frac{x-1}{x+1}\right| + C$$

Case 2: Repeated linear factors. For $(x-a)^m$:

$$\frac{A_1}{x-a} + \frac{A_2}{(x-a)^2} + \cdots + \frac{A_m}{(x-a)^m}$$

Example. $\int\frac{x}{(x-1)^2}\,dx$.

$\frac{x}{(x-1)^2} = \frac{A}{x-1} + \frac{B}{(x-1)^2}$

. Multiply: .

At $x=1$: $B=1$. Compare $x$-coefficients: $1 = A$, so $A=1$.

$$\int\frac{x}{(x-1)^2}\,dx = \ln|x-1| - \frac{1}{x-1} + C$$

Case 3: Irreducible quadratic factors. For $x^2 + bx + c$ (no real roots):

$$\frac{Ax+B}{x^2+bx+c}$$

Example. $\int\frac{1}{x(x^2+1)}\,dx$.

$\frac{1}{x(x^2+1)} = \frac{A}{x} + \frac{Bx+C}{x^2+1}$

Multiply: $1 = A(x^2+1) + (Bx+C)x$. At $x=0$: $A=1$. Comparing $x^2$: $0=A+B \Rightarrow B=-1$. Comparing $x^1$: $0=C$.

$$\int\frac{1}{x(x^2+1)}\,dx = \ln|x| - \frac{1}{2}\ln(x^2+1) + C = \frac{1}{2}\ln\frac{x^2}{x^2+1} + C$$

For AI. Partial fractions appear in z-transform analysis (discrete-time signal processing), computing closed-form solutions to linear recurrences (relevant to LSTM and S4 state space models), and in Laplace transform computations for control-theoretic analysis of learning dynamics.

8. Improper Integrals

8.1 Type I: Infinite Limits

Definition. For $f$ continuous on $[a,\infty)$:

$$\int_a^\infty f(x)\,dx = \lim_{b\to\infty}\int_a^b f(x)\,dx$$

If the limit exists and is finite, the integral converges; otherwise it diverges.

Similarly $\int_{-\infty}^b f(x)\,dx = \lim_{a\to-\infty}\int_a^b f(x)\,dx$ and $\int_{-\infty}^\infty f\,dx = \int_{-\infty}^c f\,dx + \int_c^\infty f\,dx$ for any $c$.

Example 1 (exponential decay).

$$\int_0^\infty e^{-x}\,dx = \lim_{b\to\infty}[-e^{-x}]_0^b = \lim_{b\to\infty}(1-e^{-b}) = 1$$

Example 2 ($p$-integral).

$$\int_1^\infty \frac{1}{x^p}\,dx = $$

Proof. For $p \neq 1$: $\int_1^b x^{-p}\,dx = \frac{b^{1-p}-1}{1-p}$. As $b\to\infty$: converges iff $1-p < 0$ iff $p > 1$. $\square$

8.2 Type II: Unbounded Integrands

Definition. If $f$ has a vertical asymptote at $x = a$:

$$\int_a^b f(x)\,dx = \lim_{\varepsilon\to 0^+}\int_{a+\varepsilon}^b f(x)\,dx$$

Example. $\int_0^1 \frac{1}{\sqrt{x}}\,dx = \lim_{\varepsilon\to 0^+}[2\sqrt{x}]_\varepsilon^1 = 2 - 0 = 2$. Converges.

$\int_0^1 \frac{1}{x}\,dx = \lim_{\varepsilon\to 0^+}[\ln x]_\varepsilon^1 = 0 - (-\infty) = \infty$

. Diverges.

8.3 Convergence Tests

Comparison test. If $0 \leq f(x) \leq g(x)$ for $x \geq a$: - $\int_a^\infty g\,dx$ converges $\Rightarrow$ $\int_a^\infty f\,dx$ converges - $\int_a^\infty f\,dx$ diverges $\Rightarrow$ $\int_a^\infty g\,dx$ diverges

Limit comparison test. If $f, g > 0$ and $\lim_{x\to\infty} f(x)/g(x) = L \in (0,\infty)$, then $\int_a^\infty f$ and $\int_a^\infty g$ both converge or both diverge.

Absolute convergence. If $\int_a^\infty |f(x)|\,dx < \infty$, then $\int_a^\infty f(x)\,dx$ converges.

8.4 The Gaussian Integral

The most important improper integral in probability and machine learning:

$$\int_{-\infty}^\infty e^{-x^2}\,dx = \sqrt{\pi}$$

Proof (polar coordinates trick). Let $I = \int_{-\infty}^\infty e^{-x^2}\,dx$. Then:

$$I^2 = \int_{-\infty}^\infty e^{-x^2}\,dx\cdot\int_{-\infty}^\infty e^{-y^2}\,dy = \int_{-\infty}^\infty\int_{-\infty}^\infty e^{-(x^2+y^2)}\,dx\,dy$$

Convert to polar ($x = r\cos\theta$, $y = r\sin\theta$, $dx\,dy = r\,dr\,d\theta$):

$$I^2 = \int_0^{2\pi}\int_0^\infty e^{-r^2}r\,dr\,d\theta = 2\pi\int_0^\infty re^{-r^2}\,dr = 2\pi\cdot\frac{1}{2} = \pi$$

Therefore $I = \sqrt{\pi}$. $\square$

Consequence. The standard Gaussian $\mathcal{N}(0,1)$ normalizes:

$$\int_{-\infty}^\infty \frac{1}{\sqrt{2\pi}}e^{-x^2/2}\,dx = 1$$

(substitution $u = x/\sqrt{2}$ converts to the Gaussian integral).

8.5 For AI: Entropy, KL Divergence, and Expected Loss

Entropy of a Gaussian. For $X \sim \mathcal{N}(\mu, \sigma^2)$:

$$H(X) = -\int_{-\infty}^\infty p(x)\ln p(x)\,dx = \frac{1}{2}\ln(2\pi e\sigma^2)$$

This improper integral converges because $p(x)\ln p(x) \to 0$ faster than any polynomial as $x\to\pm\infty$.

KL divergence between Gaussians. For $p = \mathcal{N}(\mu_1,\sigma_1^2)$, $q = \mathcal{N}(\mu_2,\sigma_2^2)$:

$$\text{KL}(p\|q) = \ln\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1-\mu_2)^2}{2\sigma_2^2} - \frac{1}{2}$$

This closed form follows from evaluating $\int_{-\infty}^\infty p(x)[\ln p(x) - \ln q(x)]\,dx$ using the Gaussian integral. It appears as the regularization term in the VAE ELBO objective.

Expected cross-entropy loss. The population risk:

$$R(\theta) = -\int p(\mathbf{x},y)\log q_\theta(y|\mathbf{x})\,d\mathbf{x}\,dy$$

is an improper integral over the joint distribution. SGD computes an unbiased Monte Carlo estimate using a mini-batch.

9. Numerical Integration

When an antiderivative cannot be expressed in closed form (e.g., $\int e^{-x^2}\,dx$, $\int \sin(x^2)\,dx$), numerical methods approximate the definite integral.

9.1 Trapezoid Rule

Approximate the integrand on each subinterval by a straight line (trapezoid).

For uniform step $h = (b-a)/n$ and nodes $x_k = a + kh$:

$$\int_a^b f(x)\,dx \approx T_n = \frac{h}{2}\left[f(x_0) + 2f(x_1) + 2f(x_2) + \cdots + 2f(x_{n-1}) + f(x_n)\right]$$

Error bound. If $|f''(x)| \leq M$ on $[a,b]$:

$$|E_T| = \left|\int_a^b f\,dx - T_n\right| \leq \frac{M(b-a)^3}{12n^2} = O(h^2)$$

The trapezoid rule is second-order accurate - halving $h$ reduces error by a factor of 4.

9.2 Simpson's Rule

Approximate the integrand on each pair of subintervals by a quadratic (parabola). Requires $n$ even:

$$S_n = \frac{h}{3}\left[f(x_0) + 4f(x_1) + 2f(x_2) + 4f(x_3) + \cdots + 4f(x_{n-1}) + f(x_n)\right]$$

Pattern: 1, 4, 2, 4, 2, ..., 4, 1 with coefficients summing to $2n/3 \cdot 3 = 2n$.

Error bound. If $|f^{(4)}(x)| \leq M$:

$$|E_S| \leq \frac{M(b-a)^5}{180n^4} = O(h^4)$$

Simpson's rule is fourth-order - halving $h$ reduces error by a factor of 16.

Example. Estimate $\int_0^1 e^x\,dx$ (true value: $e-1 \approx 1.71828$) with $n = 4$:

$h = 0.25$

, .

$S_4 = \frac{0.25}{3}[e^0 + 4e^{0.25} + 2e^{0.5} + 4e^{0.75} + e^1] \approx 1.71828$

(accurate to 7 decimal places).

9.3 Gaussian Quadrature

Instead of equally spaced nodes, choose $n$ optimal nodes $\{x_k\}$ and weights $\{w_k\}$ to exactly integrate all polynomials of degree $\leq 2n-1$:

$$\int_{-1}^1 f(x)\,dx \approx \sum_{k=1}^n w_k f(x_k)$$

The nodes are roots of the Legendre polynomial $P_n(x)$. Gauss-Legendre quadrature is the most accurate quadrature rule for smooth functions - $n$ nodes achieve $O(h^{2n})$ error.

9.4 Monte Carlo Integration

Idea. For $\int_a^b f(x)\,dx$: draw $n$ i.i.d. samples $X_1,\ldots,X_n \sim \text{Uniform}(a,b)$ and estimate:

$$\hat{I}_n = \frac{b-a}{n}\sum_{k=1}^n f(X_k)$$

By the Law of Large Numbers: $\hat{I}_n \xrightarrow{a.s.} \int_a^b f(x)\,dx$.

Error. By the CLT:

$$\sqrt{n}(\hat{I}_n - I) \xrightarrow{d} \mathcal{N}(0, (b-a)^2\text{Var}[f(X)])$$

Standard error: $\text{SE} = \frac{(b-a)\,\text{Std}[f(X)]}{\sqrt{n}} = O(1/\sqrt{n})$.

Key property. The $O(1/\sqrt{n})$ convergence rate is dimension-independent. Trapezoid and Simpson's rules suffer from the curse of dimensionality ($O(n^{-2/d})$ for $d$-dimensional integrals), but Monte Carlo's rate stays $O(1/\sqrt{n})$ regardless of dimension. This is why high-dimensional integration in ML (expectation over data distributions, latent variables, trajectories) always uses Monte Carlo.

Variance reduction. Importance sampling: draw $X_k \sim q(x)$ instead of uniform, estimate:

$$\hat{I} = \frac{1}{n}\sum_{k=1}^n \frac{f(X_k)}{q(X_k)/((b-a)^{-1})} = \frac{1}{n}\sum_{k=1}^n \frac{f(X_k)(b-a)q_{\text{uniform}}}{q(X_k)}$$

Choosing $q \propto |f|$ minimizes variance - the basis of importance-weighted autoencoders (IWAE).

9.5 For AI: SGD as Monte Carlo Expectation

The true gradient update is:

$$\nabla_\theta \mathcal{L}(\theta) = \int \nabla_\theta \ell(f_\theta(\mathbf{x}), y)\,p(\mathbf{x},y)\,d\mathbf{x}\,dy$$

SGD with mini-batch $\mathcal{B}$ of size $B$ estimates this as:

$$\widehat{\nabla}_\theta\mathcal{L}(\theta) = \frac{1}{B}\sum_{i\in\mathcal{B}} \nabla_\theta\ell(f_\theta(\mathbf{x}_i), y_i)$$

This is a Monte Carlo estimate of the gradient integral. The mini-batch is drawn i.i.d. from the data distribution (uniformly at random from the training set), so it is an unbiased estimator with variance $O(1/B)$. Larger batch sizes reduce variance but increase compute per step.

10. Integration in Probability

Forward reference to 06-Probability Theory. The full treatment of random variables, distributions, expectation, and probabilistic reasoning is in 06-Probability-Theory. Here we cover the integration mechanics - how to compute expectations, variances, KL divergence, and entropy as definite or improper integrals.

10.1 Probability Density Functions

A probability density function (PDF) $p: \mathbb{R} \to [0,\infty)$ satisfies:

$$\int_{-\infty}^\infty p(x)\,dx = 1, \qquad p(x) \geq 0 \text{ for all } x$$

The probability that $X$ falls in $[a,b]$ is $\Pr(a \leq X \leq b) = \int_a^b p(x)\,dx$.

Common PDFs and their normalization integrals:

Distribution	PDF	Normalization relies on
Uniform on $[a,b]$	$\frac{1}{b-a}$	Elementary
Gaussian $\mathcal{N}(\mu,\sigma^2)$	$\frac{1}{\sigma\sqrt{2\pi}}e^{-(x-\mu)^2/(2\sigma^2)}$	Gaussian integral (8.4)
Exponential($\lambda$)	$\lambda e^{-\lambda x}$, $x\geq 0$	$\int_0^\infty \lambda e^{-\lambda x}\,dx = 1$
Laplace($\mu, b$)	$\frac{1}{2b}e^{-\|x-\mu\|/b}$	Symmetric exponential

10.2 CDF and FTC

The cumulative distribution function (CDF) is:

$$F(x) = \Pr(X \leq x) = \int_{-\infty}^x p(t)\,dt$$

By FTC Part 1: $F'(x) = p(x)$ - the PDF is the derivative of the CDF. This connects probability theory directly to the FTC: the CDF is the "area accumulation function" of the PDF, and differentiating it recovers the density.

Properties of the CDF: - $F(-\infty) = 0$, $F(+\infty) = 1$ - $F$ is non-decreasing: $x_1 < x_2 \Rightarrow F(x_1) \leq F(x_2)$ - $F$ is right-continuous - $\Pr(a < X \leq b) = F(b) - F(a)$ (FTC Part 2)

10.3 Expectation as a Weighted Integral

$$\mathbb{E}[X] = \int_{-\infty}^\infty x\,p(x)\,dx$$ $$\mathbb{E}[g(X)] = \int_{-\infty}^\infty g(x)\,p(x)\,dx \quad \text{(Law of the Unconscious Statistician)}$$

Gaussian expectation. For $X \sim \mathcal{N}(\mu, \sigma^2)$:

$$\mathbb{E}[X] = \int_{-\infty}^\infty x\cdot\frac{1}{\sigma\sqrt{2\pi}}e^{-(x-\mu)^2/(2\sigma^2)}\,dx = \mu$$

Proof. Substitute $u = (x-\mu)/\sigma$: $\mathbb{E}[X] = \int_{-\infty}^\infty (\sigma u + \mu)\frac{1}{\sqrt{2\pi}}e^{-u^2/2}\,du$. The $\sigma u$ term integrates to 0 (odd function); the $\mu$ term gives $\mu \cdot 1$. $\square$

10.4 Variance and Second Moments

$$\text{Var}(X) = \mathbb{E}[(X-\mu)^2] = \int_{-\infty}^\infty (x-\mu)^2\,p(x)\,dx = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$$

Gaussian variance. For $X \sim \mathcal{N}(\mu,\sigma^2)$:

$$\text{Var}(X) = \int_{-\infty}^\infty (x-\mu)^2 \frac{1}{\sigma\sqrt{2\pi}}e^{-(x-\mu)^2/(2\sigma^2)}\,dx = \sigma^2$$

Proven via substitution $u=(x-\mu)/\sigma$ and the identity $\int_{-\infty}^\infty u^2 e^{-u^2/2}\,du = \sqrt{2\pi}$ (integration by parts with $u \cdot ue^{-u^2/2}$).

10.5 KL Divergence

The Kullback-Leibler divergence from $q$ to $p$:

$$\text{KL}(p\|q) = \int_{-\infty}^\infty p(x)\ln\frac{p(x)}{q(x)}\,dx$$

Properties: - $\text{KL}(p\|q) \geq 0$ (Gibbs' inequality - proven via Jensen's inequality and the concavity of $\ln$) - $\text{KL}(p\|q) = 0$ iff $p = q$ a.e. - Not symmetric: $\text{KL}(p\|q) \neq \text{KL}(q\|p)$ in general

Forward vs. reverse KL: - $\text{KL}(p\|q)$ (forward/exclusive): forces $q$ to cover all modes of $p$. Used in maximum likelihood estimation. - $\text{KL}(q\|p)$ (reverse/inclusive): forces $q$ to fit one mode of $p$ well. Used in variational inference (ELBO = $-\text{KL}(q\|p) + \text{const}$).

Gibbs' inequality proof. By concavity of $\ln$: $\ln t \leq t - 1$ for all $t > 0$. Apply with $t = q(x)/p(x)$:

$$-\text{KL}(p\|q) = \int p(x)\ln\frac{q(x)}{p(x)}\,dx \leq \int p(x)\left(\frac{q(x)}{p(x)}-1\right)\,dx = \int q(x)\,dx - \int p(x)\,dx = 1-1=0$$

10.6 Entropy

The differential entropy of a continuous random variable $X$ with PDF $p$:

$$H(X) = -\int_{-\infty}^\infty p(x)\ln p(x)\,dx = -\mathbb{E}[\ln p(X)]$$

Gaussian entropy. For $X \sim \mathcal{N}(\mu,\sigma^2)$:

$$H(X) = \frac{1}{2}\ln(2\pi e\sigma^2) = \frac{1}{2}[1 + \ln(2\pi\sigma^2)]$$

Maximum entropy principle. Among all distributions with mean $\mu$ and variance $\sigma^2$, the Gaussian maximizes entropy. This makes the Gaussian the natural distribution for uncertainty - used throughout Bayesian deep learning (Gaussian priors, Gaussian posteriors in VAEs).

Connection to cross-entropy loss. For a model $q_\theta$ trained on data from $p$:

$$\mathbb{E}_{x\sim p}[-\ln q_\theta(x)] = H(p) + \text{KL}(p\|q_\theta)$$

Minimizing cross-entropy loss minimizes $\text{KL}(p\|q_\theta)$ (since $H(p)$ is constant w.r.t. $\theta$). This is why cross-entropy training is equivalent to maximum likelihood estimation.

11. Common Mistakes

#	Mistake	Why It's Wrong	Fix
1	Forgetting $+C$ in indefinite integrals	Every antiderivative family has a free constant; omitting it loses solutions to IVPs	Always write $+C$ and determine it from initial conditions
2	$\int f(x)g(x)\,dx = \int f\,dx \cdot \int g\,dx$	Integration does NOT distribute over products	Use substitution or integration by parts
3	$\int_a^b f\,dx = F(b) - F(a)$ without checking $F' = f$	If $F$ is wrong, the evaluation is wrong	Always verify the antiderivative by differentiating
4	Not changing limits in definite $u$-substitution	If you substitute $u = g(x)$, the limits must become $g(a)$ and $g(b)$	Either change limits or back-substitute
5	$\int \frac{1}{x^2}\,dx = \ln(x^2)+C$	Incorrect - $\int x^{-2}\,dx = -x^{-1}+C$; $\ln$ antiderivative only applies to $1/x$	Power rule: $\int x^n\,dx = x^{n+1}/(n+1)+C$ for $n\neq -1$
6	$\int_0^1 \frac{1}{x}\,dx = [\ln x]_0^1 = 0$	This is an improper integral - $\ln(0) = -\infty$; the integral diverges	Always check for discontinuities before applying FTC Part 2
7	$\int_{-1}^{1}\frac{1}{x}\,dx = 0$ by symmetry	The integrand is odd, but the integral diverges - symmetry argument fails for divergent integrals	Confirm convergence before using symmetry
8	Wrong LIATE choice causes circular integration	Choosing $u =$ exponential in $\int xe^x\,dx$ -> no simplification	Let $u$ be LIATE-first: $u=x$, $dv=e^x\,dx$
9	$\int_a^\infty f\,dx$ treated as finite without checking	Infinite integration limits require explicit convergence check	Write as $\lim_{b\to\infty}\int_a^b$ and evaluate the limit
10	Monte Carlo $O(1/\sqrt{n})$ confused with deterministic $O(h^2)$	Monte Carlo error is stochastic, in expectation/variance; not a uniform bound	Report $\pm 1.96\,\text{SE}$ confidence intervals for MC estimates
11	$\text{KL}(p\\|q) = \text{KL}(q\\|p)$	KL divergence is not symmetric	Know the difference: forward KL is mode-covering, reverse KL is mode-seeking
12	$\int u\,dv = uv + \int v\,du$ (sign error in parts)	The formula has a minus sign: $\int u\,dv = uv - \int v\,du$	Derive from product rule to remember the minus sign

12. Exercises

Exercise 1 - Riemann Sums. For $f(x) = x^2$ on $[0,2]$ with $n = 8$ equal subintervals: (a) Compute the left Riemann sum $L_8$. (b) Compute the right Riemann sum $R_8$. (c) Compute the exact value $\int_0^2 x^2\,dx$ via FTC and verify $L_8 \leq$ exact $\leq R_8$.

Exercise 2 - FTC and Antiderivatives. Evaluate: (a) $\int_1^e \frac{(\ln x)^2}{x}\,dx$ (b) $\int_0^{\pi/2}\sin^3 x\cos x\,dx$ (c) $\int_0^{\ln 2}e^x\sqrt{1+e^x}\,dx$

Exercise 3 - Integration by Parts. Compute: (a) $\int x^2 e^{-x}\,dx$ (b) $\int \ln(x^2+1)\,dx$ (c) $\int e^x\cos x\,dx$

Exercise 4 - Improper Integrals. Determine convergence and evaluate if finite: (a) $\int_1^\infty \frac{1}{x^{3/2}}\,dx$ (b) $\int_0^1 \frac{\ln x}{\sqrt{x}}\,dx$ (c) $\int_{-\infty}^\infty xe^{-x^2}\,dx$

Exercise 5 - Numerical Integration. For $f(x) = e^{-x^2}$ on $[0,2]$: (a) Compute $T_n$ and $S_n$ for $n \in \{4, 8, 16\}$. (b) Compare to scipy.integrate.quad result. (c) Plot the absolute error vs. $n$ for both methods on a log-log scale. Measure the observed convergence rates.

Exercise 6 - Monte Carlo Integration. Estimate $\int_0^1 \sin(\pi x^2)\,dx$: (a) Implement basic Monte Carlo with $n \in \{100, 1000, 10000, 100000\}$. (b) Plot the estimate and $\pm 2\,\text{SE}$ confidence band vs. $n$. (c) Verify the $O(1/\sqrt{n})$ convergence by plotting $n \cdot \text{Var}[\hat{I}_n]$ vs. $n$.

Exercise 7 - KL Divergence. Let $p = \mathcal{N}(0,1)$ and $q = \mathcal{N}(\mu, \sigma^2)$: (a) Derive the closed-form $\text{KL}(p\|q)$ by evaluating the integral. (b) Implement numerical KL via Monte Carlo with $10^5$ samples. Verify against the closed form. (c) Plot $\text{KL}(p\|q)$ as a function of $\mu \in [-3,3]$ (fixed $\sigma=1$) and as a function of $\sigma \in [0.1, 3]$ (fixed $\mu=0$). Observe the asymmetry.

Exercise 8 - ELBO and Variational Inference. The ELBO (Evidence Lower BOund) is:

$$\text{ELBO}(q) = \mathbb{E}_{z\sim q}[\ln p(x,z)] - \mathbb{E}_{z\sim q}[\ln q(z)] = \mathbb{E}_{z\sim q}[\ln p(x|z)] - \text{KL}(q\|p_z)$$

(a) Show that $\ln p(x) = \text{ELBO}(q) + \text{KL}(q\|p(\cdot|x))$ using the definition of conditional probability and KL divergence. (b) Explain why maximizing the ELBO is equivalent to maximizing a lower bound on $\ln p(x)$. (c) For $p_z = \mathcal{N}(0,1)$ and $q = \mathcal{N}(\mu,\sigma^2)$, compute $\text{KL}(q\|p_z)$ and implement the reparameterization trick: $z = \mu + \sigma\varepsilon$, $\varepsilon\sim\mathcal{N}(0,1)$.

13. Why This Matters for AI (2026 Perspective)

Concept	AI/ML Impact
Riemann sums	Conceptual foundation of all numerical expectation estimates; discrete sum over a dataset approximates the continuous integral over the data-generating distribution
FTC Part 2	Evaluating KL divergences, entropies, and moments in closed form; normalizing flow log-likelihoods
FTC Part 1	Adjoint method for neural ODEs; sensitivity analysis of dynamical systems; continuous-time RL
u-Substitution	Change-of-variables formula in normalizing flows (Real NVP, Glow, FFJORD); reparameterization trick in VAEs
Integration by parts	Log-derivative (REINFORCE) trick for policy gradient methods (PPO, GRPO, DPO in RLHF)
Improper integrals	Convergence of expected losses over infinite data; entropy and KL divergence for continuous distributions
Gaussian integral	Normalization of all Gaussian distributions; closed-form KL divergences between Gaussians (VAE regularizer)
Trapezoid/Simpson	Numerical integration in scientific ML; ODE solver stepping (Euler, RK4 in neural ODEs)
Monte Carlo	SGD and mini-batch training; IWAE (importance-weighted); MCMC in Bayesian neural networks
PDF/CDF via FTC	Score matching (diffusion model training); CDF inversion sampling; cumulative reward functions
Expectation as integral	Every loss function; ELBO; reward expectation in RL; attention weights as expectation over keys
KL divergence	VAE regularizer; DPO alignment loss; diffusion model score function; maximum entropy RL
Entropy	Information bottleneck principle; attention entropy regularization; exploration in RL
Cross-entropy <-> KL	Cross-entropy training = MLE = minimizing KL from true distribution to model

14. Conceptual Bridge

Looking back. This section built on two pillars from previous sections:

Limits (01) - the Riemann integral is defined as a limit of sums; the FTC proof uses the MVT for integrals; convergence of improper integrals is a limit.
Derivatives (02) - antiderivatives are "reverse derivatives"; FTC Part 1 says the area-accumulation function has derivative equal to the integrand; u-substitution reverses the chain rule; integration by parts reverses the product rule.

Every integration technique is the reverse of a differentiation technique. The FTC is the theorem that makes this reversal exact.

Looking forward.

04-Series-and-Sequences - Taylor series are derived using higher-order derivatives; the Taylor remainder formula involves a definite integral of the $(n+1)$-th derivative. Power series can be integrated term-by-term.
05-Multivariate Calculus - double and triple integrals extend the Riemann definition to higher dimensions; Fubini's theorem allows iterated integration; the substitution rule becomes the Jacobian change-of-variables formula.
06-Probability Theory - all continuous probability is integration: distributions, expectations, variances, moment-generating functions, characteristic functions, conditional distributions.
08-Optimization - population risk is an integral; SGD is a Monte Carlo gradient estimator; natural gradient uses the Fisher information matrix, which is defined via an integral.

Position in the curriculum:

CHAPTER 4 - CALCULUS FUNDAMENTALS


  01 Limits and Continuity 
           (limit foundations, epsilon-delta, continuity)

  02 Derivatives and Differentiation 
           (chain rule, product rule, activation derivatives)

  03 Integration   YOU ARE HERE 
           (FTC links 02 <-> 03; substitution reverses chain rule)

  04 Series and Sequences 
           (Taylor series uses 02 derivatives + 03 integration)

  05 Multivariate Calculus 
         (double integrals, Fubini, Jacobians - extends 03)

  06 Probability Theory 
         (all continuous probability IS integration from 03)

Appendix A: Extended Substitution Examples and Patterns

A.1 Recognising the Pattern

The hardest part of $u$-substitution is identifying the right $u$. The pattern to look for: one function is the derivative of another part of the integrand.

Integrand pattern	Choose $u$	Because
$f(x^n)\cdot x^{n-1}$	$u = x^n$	$du = nx^{n-1}\,dx$
$f(e^x)\cdot e^x$	$u = e^x$	$du = e^x\,dx$
$f(\ln x)\cdot \frac{1}{x}$	$u = \ln x$	$du = \frac{1}{x}\,dx$
$f(\sin x)\cdot \cos x$	$u = \sin x$	$du = \cos x\,dx$
$f(\cos x)\cdot \sin x$	$u = \cos x$	$du = -\sin x\,dx$
$f(\sqrt{x})\cdot \frac{1}{\sqrt{x}}$	$u = \sqrt{x}$	$du = \frac{1}{2\sqrt{x}}\,dx$

A.2 Rationalizing Substitutions

For integrals involving $\sqrt{ax+b}$: let $u = \sqrt{ax+b}$ so $u^2 = ax+b$ and $2u\,du = a\,dx$.

Example. $\int x\sqrt{2x+1}\,dx$. Let $u = \sqrt{2x+1}$, $u^2 = 2x+1$, $x = (u^2-1)/2$, $dx = u\,du$:

$$\int \frac{u^2-1}{2}\cdot u\cdot u\,du = \frac{1}{2}\int(u^4-u^2)\,du = \frac{u^5}{10} - \frac{u^3}{6} + C = \frac{(2x+1)^{5/2}}{10} - \frac{(2x+1)^{3/2}}{6} + C$$

A.3 Trigonometric Substitutions

For integrands involving $\sqrt{a^2-x^2}$, $\sqrt{a^2+x^2}$, or $\sqrt{x^2-a^2}$:

Form	Substitution	Identity used
$\sqrt{a^2-x^2}$	$x = a\sin\theta$	$1-\sin^2\theta = \cos^2\theta$
$\sqrt{a^2+x^2}$	$x = a\tan\theta$	$1+\tan^2\theta = \sec^2\theta$
$\sqrt{x^2-a^2}$	$x = a\sec\theta$	$\sec^2\theta-1 = \tan^2\theta$

Example. $\int\frac{1}{\sqrt{4-x^2}}\,dx$. Let $x = 2\sin\theta$, $dx = 2\cos\theta\,d\theta$:

$$\int\frac{2\cos\theta\,d\theta}{\sqrt{4-4\sin^2\theta}} = \int\frac{2\cos\theta}{2\cos\theta}\,d\theta = \theta + C = \arcsin\frac{x}{2} + C$$

For AI. Trigonometric substitutions appear when integrating radial functions in high-dimensional probability (e.g., volumes of spherical shells used in $d$-dimensional Gaussian integrals).

A.4 The Softmax Integral - Partition Function

The softmax denominator (partition function) is a sum $Z = \sum_k e^{z_k}$, the discrete analogue of the integral $Z = \int e^{f(x)}\,dx$ that appears in energy-based models. In the continuous case, computing $Z$ is intractable in general - this is the core computational challenge of energy-based models. Variational autoencoders and diffusion models avoid computing $Z$ directly by working with ratios or lower bounds.

Appendix B: Integration by Parts - Extended Examples and Theory

B.1 The Cyclic Trick

When integration by parts produces $I = (\text{something}) - cI$ for a constant $c \neq -1$, solve for $I$:

$$I + cI = \text{something} \implies I = \frac{\text{something}}{1+c}$$

This works for $\int e^{ax}\cos(bx)\,dx$ and $\int e^{ax}\sin(bx)\,dx$:

$$\int e^{ax}\cos(bx)\,dx = \frac{e^{ax}(a\cos(bx) + b\sin(bx))}{a^2+b^2} + C$$ $$\int e^{ax}\sin(bx)\,dx = \frac{e^{ax}(a\sin(bx) - b\cos(bx))}{a^2+b^2} + C$$

Verification: Differentiate the right side to confirm.

B.2 Integration by Parts for Definite Integrals

Example. Show $\int_0^\infty xe^{-x}\,dx = 1$.

Let $u=x$, $dv=e^{-x}\,dx$:

$$\int_0^\infty xe^{-x}\,dx = \Big[-xe^{-x}\Big]_0^\infty + \int_0^\infty e^{-x}\,dx$$

The boundary term: $\lim_{x\to\infty} xe^{-x} = 0$ (L'Hpital: $x/e^x \to 0$) and at $x=0$: $0$.

$$= 0 + [-e^{-x}]_0^\infty = 0 - (-1) = 1$$

B.3 The Gamma Function

The Gamma function generalizes factorials to real arguments:

$$\Gamma(s) = \int_0^\infty x^{s-1}e^{-x}\,dx, \quad s > 0$$

Key properties (proven by integration by parts): - $\Gamma(s+1) = s\,\Gamma(s)$ (reduction formula via parts) - $\Gamma(n) = (n-1)!$ for positive integers $n$ - $\Gamma(1/2) = \sqrt{\pi}$ (from the Gaussian integral)

Proof of reduction. $\Gamma(s+1) = \int_0^\infty x^s e^{-x}\,dx$. Let $u = x^s$, $dv = e^{-x}\,dx$:

$$= [-x^s e^{-x}]_0^\infty + s\int_0^\infty x^{s-1}e^{-x}\,dx = 0 + s\,\Gamma(s) \quad \square$$

For AI. The Gamma function appears in the normalizing constants of many probability distributions used in Bayesian ML: the Gamma distribution (prior for precision in Gaussian models), the Beta distribution (prior for probabilities in Dirichlet-multinomial models), the Student-t distribution (robust regression).

B.4 Wallis's Formula

From repeated integration by parts on $\int_0^{\pi/2}\sin^n x\,dx$:

$$\frac{\pi}{2} = \frac{2\cdot 2\cdot 4\cdot 4\cdot 6\cdot 6\cdots}{1\cdot 3\cdot 3\cdot 5\cdot 5\cdot 7\cdots} = \prod_{n=1}^\infty\frac{4n^2}{4n^2-1}$$

This is one of the earliest infinite product formulas for $\pi$ - an unexpected connection between integration and $\pi$.

Appendix C: Improper Integrals - Convergence Tests in Detail

C.1 The p-Test Summary

$$\int_1^\infty \frac{1}{x^p}\,dx \qquad \int_0^1 \frac{1}{x^p}\,dx $$

The boundary $p = 1$ always diverges ($\int 1/x\,dx = \ln x$, which diverges at both limits).

C.2 Comparison Test - Worked Examples

Example 1. Does $\int_1^\infty \frac{1}{x^2+\sqrt{x}}\,dx$ converge?

For $x \geq 1$: $x^2 + \sqrt{x} \geq x^2$, so $\frac{1}{x^2+\sqrt{x}} \leq \frac{1}{x^2}$.

Since $\int_1^\infty x^{-2}\,dx = 1$ converges, by comparison the original converges. $\square$

Example 2. Does $\int_1^\infty \frac{\ln x}{x}\,dx$ converge?

For $x \geq e$: $\ln x \geq 1$, so $\frac{\ln x}{x} \geq \frac{1}{x}$.

Since $\int_e^\infty x^{-1}\,dx$ diverges, by comparison the original diverges. $\square$

C.3 Absolute vs Conditional Convergence

$\int_a^\infty f(x)\,dx$

is absolutely convergent if .

$\int_a^\infty f(x)\,dx$

is conditionally convergent if it converges but not absolutely.

Example. $\int_0^\infty \frac{\sin x}{x}\,dx = \frac{\pi}{2}$ (Dirichlet integral) - converges conditionally but NOT absolutely ($\int_0^\infty |\sin x|/x\,dx = \infty$).

For ML. Absolutely convergent integrals behave nicely: they can be split, reordered, and approximated by truncated versions. The expected loss $\mathbb{E}[\mathcal{L}]$ is absolutely convergent (non-negative integrand) - this is why expectation estimates via Monte Carlo are reliable.

C.4 Laplace Transform Preview

The Laplace transform is an improper integral parametrized by $s$:

$$\mathcal{L}\{f\}(s) = \int_0^\infty f(t)e^{-st}\,dt$$

It converts differential equations to algebraic equations (used in control theory). For neural networks, the Laplace transform of the loss trajectory $\mathcal{L}\{t \mapsto \mathcal{L}(\theta_t)\}$ is related to the training dynamics in Laplace domain - an emerging tool in the theoretical analysis of gradient descent.

Appendix D: Numerical Integration - Error Analysis and Advanced Methods

D.1 Trapezoid Rule - Derivation from Scratch

On each subinterval $[x_{k-1}, x_k]$, the trapezoid rule approximates $f$ by the linear interpolant:

$$f(x) \approx f(x_{k-1}) + \frac{f(x_k)-f(x_{k-1})}{h}(x-x_{k-1})$$

Integrating from $x_{k-1}$ to $x_k$:

$$\int_{x_{k-1}}^{x_k} f\,dx \approx f(x_{k-1})\cdot h + \frac{f(x_k)-f(x_{k-1})}{h}\cdot\frac{h^2}{2} = \frac{h}{2}[f(x_{k-1})+f(x_k)]$$

Summing $n$ subintervals with telescoping:

$$T_n = \frac{h}{2}[f(x_0) + 2f(x_1) + 2f(x_2) + \cdots + 2f(x_{n-1}) + f(x_n)]$$

Error derivation. By Taylor expansion on each subinterval:

$$\int_{x_{k-1}}^{x_k} f\,dx = \frac{h}{2}[f(x_{k-1})+f(x_k)] - \frac{h^3}{12}f''(\xi_k)$$

Summing: $E_T = -\frac{h^2(b-a)}{12}\bar{f}''$ where $\bar{f}''$ is some average of $f''$ on $[a,b]$ (MVT). So $|E_T| \leq \frac{M_2(b-a)^3}{12n^2}$ where $M_2 = \max|f''|$.

D.2 Simpson's Rule - Parabolic Approximation

On each pair of subintervals $[x_{2k-2}, x_{2k}]$, use the unique parabola through three points:

$$S_{2k} = \frac{h}{3}[f(x_{2k-2}) + 4f(x_{2k-1}) + f(x_{2k})]$$

The coefficient pattern comes from integrating the Lagrange interpolating polynomial. The error:

$$E_S = -\frac{h^4(b-a)}{180}\bar{f}^{(4)} \implies |E_S| \leq \frac{M_4(b-a)^5}{180n^4}$$

D.3 Richardson Extrapolation

If $T_n$ has error $E_T = c_2/n^2 + c_4/n^4 + \cdots$, then combining $T_n$ and $T_{2n}$:

$$\frac{4T_{2n} - T_n}{3}$$

eliminates the $O(1/n^2)$ term, giving a method with error $O(1/n^4)$ - equal to Simpson's! This idea, applied recursively, gives Romberg integration with error $O(h^{2k})$ for any $k$.

D.4 Quasi-Monte Carlo

Standard Monte Carlo has error $O(1/\sqrt{n})$ regardless of dimension. Quasi-Monte Carlo (QMC) uses low-discrepancy sequences (Halton, Sobol) instead of random points:

Random points clump and leave gaps
Low-discrepancy sequences fill space more uniformly

QMC achieves $O((\log n)^d / n)$ error for smooth integrands in $d$ dimensions - much better than $O(1/\sqrt{n})$ when $d$ is small. Used in financial derivatives pricing and high-dimensional integration in Bayesian neural networks.

D.5 Adaptive Integration

scipy.integrate.quad uses Gaussian-Kronrod quadrature with adaptive refinement: if the error estimate on a subinterval exceeds tolerance, subdivide and integrate each piece separately. This automatically focuses computational effort on regions where $f$ varies rapidly - crucial for integrands with sharp peaks (e.g., probability densities in high-dimensional tails).

Appendix E: Probability Integration - Extended Topics

E.1 Moment-Generating Functions

The moment-generating function (MGF) of $X$ is:

$$M_X(t) = \mathbb{E}[e^{tX}] = \int_{-\infty}^\infty e^{tx}p(x)\,dx$$

provided this integral converges for $t$ in a neighborhood of 0.

Why it generates moments: Differentiate $k$ times at $t=0$:

$$M_X^{(k)}(0) = \mathbb{E}[X^k e^{0\cdot X}] = \mathbb{E}[X^k]$$

So $\mathbb{E}[X] = M'(0)$, $\mathbb{E}[X^2] = M''(0)$, $\text{Var}(X) = M''(0) - [M'(0)]^2$.

Gaussian MGF. For $X \sim \mathcal{N}(\mu,\sigma^2)$:

$$M_X(t) = e^{\mu t + \sigma^2 t^2/2}$$

Proven by completing the square in the exponent under the integral and using the Gaussian integral.

E.2 Characteristic Functions and Fourier Transforms

The characteristic function $\phi_X(t) = \mathbb{E}[e^{itX}] = \int e^{itx}p(x)\,dx$ is the Fourier transform of the PDF. Unlike the MGF, the characteristic function always exists (since $|e^{itx}| = 1$).

The inverse Fourier transform recovers the PDF:

$$p(x) = \frac{1}{2\pi}\int_{-\infty}^\infty e^{-itx}\phi_X(t)\,dt$$

For AI. Fourier transforms appear in: - Random Fourier features (Rahimi & Recht, 2007): approximate shift-invariant kernels via $\hat{k}(x-y) = \mathbb{E}[e^{i\omega(x-y)}]$ where $\omega \sim p(\omega)$ - Frequency domain attention: Fourier attention mechanisms (FNet) replace self-attention with 2D FFT - Spectral normalization: weight matrix spectral norm computed via power iteration of $\|W\|_2 = $ largest singular value

E.3 Conditional Expectations as Integrals

The conditional expectation $\mathbb{E}[Y|X=x]$ is defined via the conditional density $p(y|x)$:

$$\mathbb{E}[Y|X=x] = \int y\,p(y|x)\,dy$$

Law of total expectation: $\mathbb{E}[Y] = \mathbb{E}_X[\mathbb{E}[Y|X]] = \int \mathbb{E}[Y|X=x]\,p(x)\,dx$

For AI. Diffusion model training minimizes $\mathbb{E}_t\mathbb{E}_{x_0}\mathbb{E}_{x_t|x_0}[\|\epsilon_\theta(x_t,t) - \epsilon\|^2]$ - a triple nested expectation. Each $\mathbb{E}$ is an integral; Monte Carlo (sampling) handles them all.

E.4 Integration and Maximum Likelihood

Maximum likelihood estimation (MLE) maximizes $\prod_{i=1}^n p(x_i;\theta)$ - equivalently, maximizes the log-likelihood $\sum_i \log p(x_i;\theta)$.

This sum is a Monte Carlo estimate of the integral:

$$\frac{1}{n}\sum_{i=1}^n \log p(x_i;\theta) \xrightarrow{n\to\infty} \int \log p(x;\theta)\,p_{\text{true}}(x)\,dx = -H(p_{\text{true}}) - \text{KL}(p_{\text{true}}\|p_\theta)$$

Maximizing MLE is equivalent to minimizing $\text{KL}(p_{\text{true}}\|p_\theta)$ - the forward KL divergence from the true data distribution to the model. This is a fundamental connection between MLE, integration, and information theory.

E.5 Score Matching

The score function of a distribution $p$ is $s(x) = \nabla_x \log p(x)$. Score matching (Hyvrinen, 2005) estimates $p$ without computing its normalization constant $Z = \int e^{f(x)}\,dx$:

$$J(\theta) = \mathbb{E}_{x\sim p}\left[\text{tr}(\nabla_x s_\theta(x)) + \frac{1}{2}\|s_\theta(x)\|^2\right]$$

Integration by parts shows this equals $\mathbb{E}_{x\sim p}[\|s_\theta(x) - s(x)\|^2]$ plus a constant - so minimizing $J(\theta)$ fits the model score to the true score without integrating $p$. This is the training objective of denoising diffusion probabilistic models (DDPMs).

Appendix F: Key Proofs and Derivations

F.1 Proof: Linearity of the Definite Integral

Theorem. $\int_a^b [f(x)+g(x)]\,dx = \int_a^b f(x)\,dx + \int_a^b g(x)\,dx$.

Proof. By definition:

$$\int_a^b [f+g]\,dx = \lim_{n\to\infty}\sum_{k=1}^n [f(x_k^*)+g(x_k^*)]\Delta x_k = \lim_{n\to\infty}\left[\sum_{k=1}^n f(x_k^*)\Delta x_k + \sum_{k=1}^n g(x_k^*)\Delta x_k\right]$$

Since both limits exist separately: $= \int_a^b f\,dx + \int_a^b g\,dx$. $\square$

F.2 Proof: Substitution Rule for Definite Integrals

Theorem. If $u = g(x)$, $g$ differentiable, $f$ continuous:

$$\int_a^b f(g(x))g'(x)\,dx = \int_{g(a)}^{g(b)} f(u)\,du$$

Proof. Let $F$ be an antiderivative of $f$. By the chain rule, $\frac{d}{dx}[F(g(x))] = f(g(x))g'(x)$. By FTC Part 2:

$$\int_a^b f(g(x))g'(x)\,dx = [F(g(x))]_a^b = F(g(b)) - F(g(a)) = \int_{g(a)}^{g(b)} f(u)\,du \quad \square$$

F.3 Proof: Integration by Parts for Definite Integrals

Theorem. $\int_a^b u\,v'\,dx = [uv]_a^b - \int_a^b u'v\,dx$.

Proof. Product rule: $(uv)' = u'v + uv'$. Integrate: $\int_a^b(uv)'\,dx = \int_a^b u'v\,dx + \int_a^b uv'\,dx$.

FTC: $[uv]_a^b = \int_a^b u'v\,dx + \int_a^b uv'\,dx$. Rearrange. $\square$

F.4 Proof: $\int_1^\infty x^{-p}\,dx$ Converges iff $p > 1$

For $p \neq 1$:

$$\int_1^b x^{-p}\,dx = \left[\frac{x^{1-p}}{1-p}\right]_1^b = \frac{b^{1-p}-1}{1-p}$$

As $b \to \infty$: $b^{1-p} \to 0$ if $1-p < 0$ (i.e., $p > 1$), giving $\int = 1/(p-1)$.

If $p < 1$: $b^{1-p} \to \infty$, so integral diverges.

For $p = 1$: $\int_1^b x^{-1}\,dx = \ln b \to \infty$. $\square$

F.5 Proof: Gibbs' Inequality - $\text{KL}(p\|q) \geq 0$

Theorem. For probability densities $p$ and $q$: $\int p(x)\ln\frac{p(x)}{q(x)}\,dx \geq 0$.

Proof. Since $\ln$ is concave, $\ln t \leq t - 1$ for all $t > 0$ (equality at $t=1$). Apply to $t = q(x)/p(x)$ where $p(x) > 0$:

$$\ln\frac{q(x)}{p(x)} \leq \frac{q(x)}{p(x)} - 1$$

Multiply by $p(x) > 0$ and integrate:

$$\int p(x)\ln\frac{q(x)}{p(x)}\,dx \leq \int [q(x)-p(x)]\,dx = 1 - 1 = 0$$

Therefore $-\text{KL}(p\|q) \leq 0 \Rightarrow \text{KL}(p\|q) \geq 0$. Equality holds iff $q = p$ a.e. $\square$

Appendix G: FTC - Applications in Machine Learning

G.1 Neural ODEs and the Adjoint Method

A neural ODE (Chen et al., 2018) defines the hidden state dynamics as:

$$\frac{d\mathbf{h}(t)}{dt} = f(\mathbf{h}(t), t; \theta)$$

The hidden state at time $T$ is:

$$\mathbf{h}(T) = \mathbf{h}(0) + \int_0^T f(\mathbf{h}(t),t;\theta)\,dt$$

This is FTC Part 2 in reverse: given the "derivative" $f$, integrate to get the accumulated change.

Training requires $\frac{\partial \mathcal{L}}{\partial \theta}$. The adjoint method computes this via a backward ODE:

$$\frac{d\mathbf{a}(t)}{dt} = -\mathbf{a}(t)^\top \frac{\partial f}{\partial \mathbf{h}}$$

where $\mathbf{a}(t) = \partial\mathcal{L}/\partial\mathbf{h}(t)$ is the adjoint state. FTC Part 1 guarantees that integrating this backward ODE recovers the gradient exactly - with $O(1)$ memory (no storing intermediate states).

G.2 Attention as Expectation

Softmax attention computes:

$$\text{Attn}(q, K, V) = \sum_k \alpha_k v_k, \qquad \alpha_k = \frac{e^{q\cdot k_j/\sqrt{d}}}{\sum_j e^{q\cdot k_j/\sqrt{d}}}$$

This is a discrete expectation $\mathbb{E}_{\alpha}[V]$ where $\alpha$ is the softmax distribution over keys. In the continuous limit (as the number of keys grows and positions become dense), this becomes an integral:

$$\text{Attn}(q) = \int v(s)\,\frac{e^{q(s)\cdot k(s)/\sqrt{d}}}{\int e^{q(s)\cdot k(s')/\sqrt{d}}\,ds'}\,ds$$

This connection motivates kernel attention approximations (Performer, Random Feature Attention) that use random features to approximate the exponential kernel via Monte Carlo integration of the Gaussian integral: $e^{q\cdot k} = \int e^{q\cdot\omega}\cdot e^{k\cdot\omega}\,p(\omega)\,d\omega$.

G.3 Diffusion Models - Score Matching via Integration by Parts

The denoising score matching objective (Vincent, 2011; Song & Ermon, 2019):

$$\mathbb{E}_{t,x_0,\epsilon}\left[\lambda(t)\|\epsilon_\theta(x_t,t) - \epsilon\|^2\right]$$

where $x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon$, $\epsilon \sim \mathcal{N}(0,I)$.

The equivalence to score matching follows from integration by parts in function space. Specifically, $\nabla_{x_t}\log p(x_t) = -\epsilon/\sqrt{1-\bar{\alpha}_t}$, so the network learns the score of the noisy distribution - an integral relationship between the score function and the data density.

G.4 Variational Autoencoder ELBO

The VAE training objective is the ELBO:

$$\mathcal{L}_{\text{ELBO}} = \mathbb{E}_{z\sim q_\phi(z|x)}[\log p_\theta(x|z)] - \text{KL}(q_\phi(z|x)\|p(z))$$

Each term is an integral:

$\mathbb{E}_{q_\phi}[\log p_\theta(x|z)] = \int q_\phi(z|x)\log p_\theta(x|z)\,dz$ - estimated via Monte Carlo (reparameterization trick)
$\text{KL}(q_\phi\|p) = \int q_\phi(z|x)\log\frac{q_\phi(z|x)}{p(z)}\,dz$ - computed in closed form for Gaussian $q_\phi$ and $p$

The reparameterization trick ($z = \mu_\phi(x) + \sigma_\phi(x)\odot\varepsilon$, $\varepsilon \sim \mathcal{N}(0,I)$) is a change of variables (substitution rule) that makes the Monte Carlo estimator differentiable w.r.t. $\phi$.

Appendix H: Notation Reference and Quick-Reference Tables

H.1 Integration Notation Comparison

Notation	Meaning	Context
$\int_a^b f(x)\,dx$	Definite integral from $a$ to $b$	Riemann, most contexts
$\int f(x)\,dx$	Indefinite integral (antiderivative)	General antiderivative
$[F(x)]_a^b$	$F(b) - F(a)$	FTC shorthand
$\int f\,d\mu$	Lebesgue integral w.r.t. measure $\mu$	Measure theory, probability
$\mathbb{E}_{x\sim p}[f(x)]$	$\int f(x)p(x)\,dx$	Probabilistic expectation
$\hat{I}_n = \frac{1}{n}\sum f(x_i)$	Monte Carlo estimate	Stochastic approximation

H.2 Standard Antiderivatives - Extended Table

$f(x)$	$\int f(x)\,dx$	Notes
$x^n$, $n\neq-1$	$x^{n+1}/(n+1)+C$	Power rule
$1/x$	$\ln\|x\|+C$	$x \neq 0$
$e^x$	$e^x+C$
$e^{ax}$	$e^{ax}/a+C$
$a^x$	$a^x/\ln a+C$
$\sin x$	$-\cos x+C$
$\cos x$	$\sin x+C$
$\tan x$	$\ln\|\sec x\|+C$	via substitution
$\sec x$	$\ln\|\sec x+\tan x\|+C$
$\sec^2 x$	$\tan x+C$
$1/\sqrt{1-x^2}$	$\arcsin x+C$
$1/(1+x^2)$	$\arctan x+C$
$1/(a^2+x^2)$	$\frac{1}{a}\arctan(x/a)+C$
$\sinh x$	$\cosh x+C$
$\cosh x$	$\sinh x+C$
$x\ln x - x$	$\int\ln x\,dx$	via parts

H.3 Numerical Methods Comparison

Method	Formula	Error	Evaluations
Left Riemann	$h\sum_{k=0}^{n-1}f(x_k)$	$O(h)$	$n$
Trapezoid	$h[f_0/2 + f_1+\cdots+f_{n-1}+f_n/2]$	$O(h^2)$	$n+1$
Simpson's	$h/3[f_0+4f_1+2f_2+\cdots+4f_{n-1}+f_n]$	$O(h^4)$	$n+1$ (n even)
Gauss-Legendre ($n$ pts)	$\sum w_k f(x_k)$	$O(h^{2n})$	$n$
Monte Carlo	$(b-a)\frac{1}{n}\sum f(X_k)$	$O(1/\sqrt{n})$ stochastic	$n$

H.4 Key Integration Formulas for AI

$$\int_{-\infty}^\infty e^{-x^2}\,dx = \sqrt{\pi} \qquad \int_{-\infty}^\infty e^{-ax^2}\,dx = \sqrt{\pi/a}$$ $$\int_{-\infty}^\infty xe^{-ax^2}\,dx = 0 \qquad \int_{-\infty}^\infty x^2 e^{-ax^2}\,dx = \frac{\sqrt{\pi}}{2a^{3/2}}$$ $$\text{KL}(\mathcal{N}(\mu_1,\sigma_1^2)\|\mathcal{N}(\mu_2,\sigma_2^2)) = \ln\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2+(\mu_1-\mu_2)^2}{2\sigma_2^2} - \frac{1}{2}$$ $$H(\mathcal{N}(\mu,\sigma^2)) = \frac{1}{2}\ln(2\pi e\sigma^2) \qquad H(\text{Uniform}(a,b)) = \ln(b-a)$$ $$\mathbb{E}[X] = \int_0^\infty \Pr(X > t)\,dt \quad \text{(for } X \geq 0\text{)}$$

Appendix I: Worked Solutions - Section 12 Exercises

I.1 Exercise 1 - Riemann Sums

$f(x) = x^2$

on , , , nodes :

$$L_8 = h\sum_{k=0}^7 f(kh) = 0.25[0^2+0.25^2+0.5^2+0.75^2+1^2+1.25^2+1.5^2+1.75^2]$$ $$= 0.25[0+0.0625+0.25+0.5625+1+1.5625+2.25+3.0625] = 0.25\times 8.75 = 2.1875$$ $$R_8 = h\sum_{k=1}^8 f(kh) = 0.25[0.25^2+0.5^2+\cdots+2^2] = 0.25\times 11.75 = 2.9375$$

Exact: $\int_0^2 x^2\,dx = [x^3/3]_0^2 = 8/3 \approx 2.6\overline{6}$.

Verify: $L_8 = 2.1875 \leq 8/3 \leq 2.9375 = R_8$.

I.2 Exercise 2a - $\int_1^e \frac{(\ln x)^2}{x}\,dx$

Let $u = \ln x$, $du = dx/x$. Limits: $u(1) = 0$, $u(e) = 1$.

$$= \int_0^1 u^2\,du = [u^3/3]_0^1 = \frac{1}{3}$$

I.2b - $\int_0^{\pi/2}\sin^3 x\cos x\,dx$

Let $u = \sin x$, $du = \cos x\,dx$. Limits: 0 to 1.

$$= \int_0^1 u^3\,du = [u^4/4]_0^1 = \frac{1}{4}$$

I.2c - $\int_0^{\ln 2}e^x\sqrt{1+e^x}\,dx$

Let $u = 1+e^x$, $du = e^x\,dx$. Limits: $u(0) = 2$, $u(\ln 2) = 3$.

$$= \int_2^3 \sqrt{u}\,du = [2u^{3/2}/3]_2^3 = \frac{2}{3}(3\sqrt{3}-2\sqrt{2})$$

I.3 Exercise 3a - $\int x^2 e^{-x}\,dx$

Tabular method (differentiate $x^2$, integrate $e^{-x}$):

Sign	$D$	$I$
$+$	$x^2$	$e^{-x}$
$-$	$2x$	$-e^{-x}$
$+$	$2$	$e^{-x}$
$-$	$0$	$-e^{-x}$

$$\int x^2 e^{-x}\,dx = -x^2 e^{-x} - 2xe^{-x} - 2e^{-x} + C = -e^{-x}(x^2+2x+2) + C$$

I.3b - $\int \ln(x^2+1)\,dx$

Let $u = \ln(x^2+1)$, $dv = dx$:

$$= x\ln(x^2+1) - \int x\cdot\frac{2x}{x^2+1}\,dx = x\ln(x^2+1) - 2\int\frac{x^2}{x^2+1}\,dx$$

Since $\frac{x^2}{x^2+1} = 1 - \frac{1}{x^2+1}$:

$$= x\ln(x^2+1) - 2x + 2\arctan x + C$$

I.4 Exercise 4a - $\int_1^\infty x^{-3/2}\,dx$

$$= \lim_{b\to\infty}[-2x^{-1/2}]_1^b = \lim_{b\to\infty}\left(-\frac{2}{\sqrt{b}}+2\right) = 2$$

I.4c - $\int_{-\infty}^\infty xe^{-x^2}\,dx$

The integrand $f(x) = xe^{-x^2}$ is an odd function ($f(-x) = -f(x)$). Since $\int_0^\infty xe^{-x^2}\,dx = [-e^{-x^2}/2]_0^\infty = 1/2 < \infty$, the integral converges absolutely and:

$$\int_{-\infty}^\infty xe^{-x^2}\,dx = 0 \quad \text{(by symmetry)}$$

Appendix J: Glossary

Term	Definition
Antiderivative	$F$ such that $F'(x) = f(x)$; the indefinite integral $F(x) + C$
Riemann sum	$\sum_{k=1}^n f(x_k^*)\Delta x_k$ - finite approximation to the integral
Definite integral	$\int_a^b f\,dx = \lim_{\\|\mathcal{P}\\|\to 0}S(\mathcal{P},f)$
Indefinite integral	$\int f\,dx = F(x) + C$ - the family of all antiderivatives
FTC Part 1	$\frac{d}{dx}\int_a^x f(t)\,dt = f(x)$
FTC Part 2	$\int_a^b f(x)\,dx = F(b) - F(a)$ for antiderivative $F$
Improper integral	Integral with infinite limits or unbounded integrand; defined via limits
Convergent integral	Improper integral whose limit exists and is finite
u-Substitution	$\int f(g(x))g'(x)\,dx = \int f(u)\,du$ - reversal of chain rule
Integration by parts	$\int u\,dv = uv - \int v\,du$ - reversal of product rule
Partial fractions	Decompose $P/Q$ into simpler rational terms before integrating
PDF	$p(x) \geq 0$ with $\int p\,dx = 1$ - probability density function
CDF	$F(x) = \int_{-\infty}^x p(t)\,dt$ - cumulative distribution function
Expectation	$\mathbb{E}[X] = \int x\,p(x)\,dx$ - weighted average
KL divergence	$\text{KL}(p\\|q) = \int p\ln(p/q)\,dx \geq 0$
Entropy	$H(p) = -\int p\ln p\,dx$ - information content
Trapezoid rule	Numerical integration with $O(h^2)$ error
Simpson's rule	Numerical integration with $O(h^4)$ error (parabolic approximation)
Monte Carlo	Stochastic integration via random sampling; $O(1/\sqrt{n})$ error
ELBO	Evidence lower bound: $\mathcal{L}_{\text{ELBO}} = \mathbb{E}_q[\log p(x\|z)] - \text{KL}(q\\|p_z)$
Score function	$\nabla_x\log p(x)$ - gradient of log-density; used in diffusion models
Reparameterization trick	$z = \mu + \sigma\varepsilon$, $\varepsilon\sim\mathcal{N}(0,I)$ - makes MC estimator differentiable

Appendix K: Connections to Adjacent Sections

K.1 What 04-Series-and-Sequences Needs from This Section

Integration term-by-term: $\int\sum_{n=0}^\infty a_n x^n\,dx = \sum_{n=0}^\infty \frac{a_n x^{n+1}}{n+1}$ - requires uniform convergence
Taylor remainder as integral: $R_n(x) = \frac{1}{n!}\int_a^x (x-t)^n f^{(n+1)}(t)\,dt$
Integral test for series: $\sum_{n=1}^\infty f(n)$ converges iff $\int_1^\infty f(x)\,dx$ converges (for decreasing $f \geq 0$)

K.2 What 05-Multivariate Calculus Needs from This Section

Fubini's theorem: $\int\int f(x,y)\,dx\,dy = \int\left(\int f(x,y)\,dx\right)\,dy$ - iterated integration reduces a 2D integral to two 1D integrals
Change of variables: $\int_R f(\mathbf{x})\,d\mathbf{x} = \int_S f(\mathbf{g}(\mathbf{u}))|\det J_\mathbf{g}(\mathbf{u})|\,d\mathbf{u}$ - the Jacobian generalizes $|g'(x)|$ from substitution
Line integrals: $\int_C f\,ds$ along a curve - generalization of $\int_a^b f(x)\,dx$

K.3 What 06-Probability Theory Needs from This Section

All of continuous probability theory is integration. The sections needs: - PDF normalization: $\int p\,dx = 1$ - CDF definition and FTC connection - Expectation: $\mathbb{E}[g(X)] = \int g(x)p(x)\,dx$ - Moment computations: Gaussian moments via Gaussian integral and substitution - KL divergence and entropy as improper integrals over $\mathbb{R}$

Appendix L: Additional Worked Examples - FTC and Techniques

L.1 Leibniz Rule Examples

Example 1. $\frac{d}{dx}\int_x^{x^2} \sin(t^2)\,dt$.

Upper limit $h(x) = x^2$, lower limit $g(x) = x$. Leibniz rule:

$$= \sin((x^2)^2)\cdot 2x - \sin(x^2)\cdot 1 = 2x\sin(x^4) - \sin(x^2)$$

Example 2. $F(x) = \int_0^x \frac{t^2-1}{t^4+1}\,dt$. Find $F'(x)$.

By FTC Part 1: $F'(x) = \frac{x^2-1}{x^4+1}$.

Critical points of $F$: $F'(x) = 0 \Rightarrow x^2 = 1 \Rightarrow x = \pm 1$.

$F'$

changes sign: for , for . So has a local max at and local min at .

Example 3 (FTC + u-sub). $\frac{d}{dx}\int_1^{e^x}\ln t\,dt$.

Let $u = e^x$: $\frac{d}{dx}\int_1^{e^x}\ln t\,dt = \ln(e^x)\cdot e^x = x\cdot e^x$.

L.2 Trigonometric Integrals

Strategy for $\int\sin^m x\cos^n x\,dx$:

If $m$ odd: save one $\sin x$, convert rest to $\cos x$ via $\sin^2 = 1-\cos^2$, substitute $u = \cos x$.
If $n$ odd: save one $\cos x$, convert, substitute $u = \sin x$.
If both even: use double-angle formulas $\sin^2 x = (1-\cos 2x)/2$, $\cos^2 x = (1+\cos 2x)/2$.

Example. $\int\sin^3 x\cos^4 x\,dx$. Odd power of $\sin$:

$\int\sin^2 x\cos^4 x\cdot\sin x\,dx = \int(1-\cos^2 x)\cos^4 x\cdot\sin x\,dx$

Let $u = \cos x$: $= -\int(1-u^2)u^4\,du = -\int(u^4-u^6)\,du = -\frac{u^5}{5}+\frac{u^7}{7}+C = -\frac{\cos^5 x}{5}+\frac{\cos^7 x}{7}+C$.

L.3 Integrals of Rational Functions - Full Pipeline

Example. $\int\frac{x^3-4x+1}{x^2-x-2}\,dx$.

Step 1. Long division (numerator degree > denominator degree):

$x^3-4x+1 = (x^2-x-2)\cdot(x+1) + (-x+3)$

Step 2. Partial fractions of remainder:

$\frac{-x+3}{x^2-x-2} = \frac{-x+3}{(x-2)(x+1)} = \frac{A}{x-2}+\frac{B}{x+1}$

At $x=2$: $1/(3) = A/3 \Rightarrow A = 1/3$. Wait - $-2+3=1$ and at $x=2$: $A = 1/3$. At $x=-1$: $1+3 = B(-3) \Rightarrow B = -4/3$.

Step 3. Integrate:

$$\int\frac{x^3-4x+1}{x^2-x-2}\,dx = \int\left(x+1+\frac{1/3}{x-2}-\frac{4/3}{x+1}\right)\,dx$$ $$= \frac{x^2}{2}+x+\frac{1}{3}\ln|x-2|-\frac{4}{3}\ln|x+1|+C$$

L.4 The Dirichlet Integral

$$\int_0^\infty \frac{\sin x}{x}\,dx = \frac{\pi}{2}$$

This is a conditionally convergent improper integral (not absolutely convergent). One proof uses the Laplace transform: define $F(s) = \int_0^\infty e^{-sx}\frac{\sin x}{x}\,dx$ and differentiate w.r.t. $s$:

$$F'(s) = -\int_0^\infty e^{-sx}\sin x\,dx = -\frac{1}{1+s^2}$$

Integrating: $F(s) = -\arctan s + C$. As $s\to\infty$: $F(s)\to 0$, so $C = \pi/2$. At $s=0$: $F(0) = \pi/2$.

Appendix M: Monte Carlo Methods - Extended Analysis

M.1 Variance of the Monte Carlo Estimator

Let $X_1,\ldots,X_n \overset{i.i.d.}{\sim} \text{Uniform}(a,b)$. The estimator $\hat{I}_n = \frac{b-a}{n}\sum_{k=1}^n f(X_k)$.

$$\mathbb{E}[\hat{I}_n] = (b-a)\mathbb{E}[f(X)] = (b-a)\cdot\frac{1}{b-a}\int_a^b f(x)\,dx = \int_a^b f(x)\,dx \quad \text{(unbiased)}$$ $$\text{Var}[\hat{I}_n] = \frac{(b-a)^2}{n}\text{Var}[f(X)] = \frac{(b-a)^2}{n}\left[\frac{1}{b-a}\int_a^b f(x)^2\,dx - \left(\frac{\int_a^b f(x)\,dx}{b-a}\right)^2\right]$$

Standard error: $\text{SE}(\hat{I}_n) = \sqrt{\text{Var}[\hat{I}_n]} = O(1/\sqrt{n})$.

M.2 Importance Sampling

To estimate $I = \int f(x)\,dx$, sample $X_k \sim q(x)$ (a proposal distribution) and use:

$$\hat{I}^{\text{IS}}_n = \frac{1}{n}\sum_{k=1}^n \frac{f(X_k)}{q(X_k)}$$

Since $\int f(x)\,dx = \int \frac{f(x)}{q(x)}\cdot q(x)\,dx = \mathbb{E}_q\left[\frac{f(X)}{q(X)}\right]$, this is also unbiased.

Optimal $q$. $\text{Var}[\hat{I}^{\text{IS}}]$ is minimized when $q(x) \propto |f(x)|$. With this choice, $\text{Var} = 0$ if $f \geq 0$ everywhere (one-sample exact!).

In practice. Choose $q$ to concentrate samples where $|f(x)|$ is large - focusing effort on the important region. Used in: - IWAE (Importance Weighted Autoencoders) - tighter ELBO via importance sampling - Particle filters - sequential importance sampling for state estimation - MCMC - Metropolis-Hastings acceptance-rejection is importance sampling on steroids

M.3 Central Limit Theorem for Monte Carlo

$$\frac{\hat{I}_n - I}{\text{SE}(\hat{I}_n)} \xrightarrow{d} \mathcal{N}(0,1)$$

This gives a 95% confidence interval: $\hat{I}_n \pm 1.96\cdot\text{SE}(\hat{I}_n)$.

The $\text{SE}$ is estimated from the sample standard deviation:

$$\widehat{\text{SE}} = \frac{(b-a)\hat{\sigma}_f}{\sqrt{n}}, \qquad \hat{\sigma}_f^2 = \frac{1}{n-1}\sum_{k=1}^n\left(f(X_k) - \hat{\mu}_f\right)^2$$

M.4 Quasi-Monte Carlo - Discrepancy

The error of quasi-Monte Carlo is bounded by the Koksma-Hlawka inequality:

$$|\hat{I}_n - I| \leq D_n^*(\mathbf{x})\cdot V(f)$$

where $D_n^*$ is the star discrepancy of the point set and $V(f)$ is the total variation of $f$. Low-discrepancy sequences (Halton, Sobol) achieve $D_n^* = O((\log n)^d/n)$, giving error $O((\log n)^d/n)$ vs. $O(1/\sqrt{n})$ for random.

For $d=1$: QMC is always better than Monte Carlo (for smooth integrands). For large $d$, the $(\log n)^d$ factor can dominate.

Appendix N: The Fundamental Theorem - Historical and Conceptual Depth

N.1 Why the FTC Is Deep

Before the FTC, two problems seemed completely unrelated: 1. The tangent problem: find the slope of a curve at a point -> derivative 2. The area problem: find the area under a curve -> integral

Newton and Leibniz discovered they are inverse operations. This is not obvious. There is no reason, a priori, to expect that summing infinitesimally thin rectangles (integration) should be related to measuring instantaneous slope (differentiation).

The FTC says: accumulation is the reverse of rate. If you know how fast something is accumulating at every instant, you can find the total accumulation - just by finding an antiderivative. This is one of the most non-obvious and deep theorems in all of mathematics.

N.2 What Makes the FTC Work

The key ingredients: 1. Continuity of $f$: ensures $f$ can be approximated uniformly by step functions (Riemann integrability) 2. MVT for integrals: $\frac{1}{h}\int_x^{x+h}f(t)\,dt = f(c_h)$ for some $c_h \in (x, x+h)$ 3. Continuity of $f$ at $x$: ensures $f(c_h) \to f(x)$ as $h \to 0$

If $f$ has discontinuities, Part 1 may fail: $G'(x) = f(x)$ only holds at points of continuity of $f$. The Lebesgue integral extends the FTC to a much broader class of functions.

N.3 The FTC in Multiple Dimensions

The FTC generalizes to higher dimensions in several forms:

1D FTC	Higher-dimensional version
$\int_a^b f'(x)\,dx = f(b)-f(a)$	Green's theorem: $\oint_C \mathbf{F}\cdot d\mathbf{r} = \iint_D \text{curl}\,\mathbf{F}\,dA$
	Stokes' theorem: $\oint_{\partial S}\mathbf{F}\cdot d\mathbf{r} = \iint_S (\nabla\times\mathbf{F})\cdot d\mathbf{S}$
	Divergence theorem: $\oiiint_{\partial V}\mathbf{F}\cdot d\mathbf{S} = \iiint_V \nabla\cdot\mathbf{F}\,dV$

All are special cases of Stokes' theorem on manifolds: $\int_M d\omega = \int_{\partial M}\omega$.

For ML, the divergence theorem underlies the divergence of a vector field used in: - Flow matching (training vector fields for generative models) - Continuous normalizing flows with trace of the Jacobian - Optimal transport with the continuity equation

N.4 Antiderivatives That Cannot Be Expressed in Closed Form

Some elementary functions have no elementary antiderivative. Famous examples:

Integrand	"Antiderivative"	Notes
$e^{-x^2}$	$\frac{\sqrt{\pi}}{2}\text{erf}(x)$	Error function - not elementary
$\frac{\sin x}{x}$	$\text{Si}(x)$ (sine integral)	Not elementary
$\frac{e^x}{x}$	$\text{Ei}(x)$ (exponential integral)	Not elementary
$\frac{1}{\ln x}$	$\text{Li}(x)$ (logarithmic integral)	Counts primes! Not elementary
$\sqrt{1-k^2\sin^2 x}$	Elliptic integral	Not elementary

These "special functions" appear constantly in ML: - scipy.special.erf, scipy.special.erfinv - in GELU, quantile functions, CDFs - scipy.special.gammaln - in Dirichlet, Beta, Gamma distributions - scipy.special.bessel - in von Mises distributions (circular statistics in positional encoding)

Appendix O: Lebesgue vs. Riemann Integration

O.1 Why a Different Theory?

The Riemann integral is sufficient for continuous functions and most smooth ML applications. But the Lebesgue integral is the foundation of modern probability theory and is essential for rigorous statements about expected values, convergence theorems, and measure theory.

The key difference: Riemann integration partitions the domain (x-axis); Lebesgue integration partitions the range (y-axis).

RIEMANN vs. LEBESGUE INTEGRATION


  Riemann: slice domain into [x_i, x_{i+1}], sum f(x_i)*Deltax

     f(x)                                      



       x                  
        "partition the x-axis"                 


  Lebesgue: slice range into [y_i, y_{i+1}], multiply by measure
            of preimage {x : f(x) in [y_i, y_{i+1}]}

     y                                         
             <- how much x gives f(x) ~= 4?    
    4          
    3       
       x                  
        "partition the y-axis"

O.2 Key Theorem (Lebesgue vs. Riemann)

Theorem (Lebesgue, 1901): A bounded function on $[a, b]$ is Riemann integrable if and only if it is continuous almost everywhere (i.e., the set of discontinuities has Lebesgue measure zero).

For ML, this means: - ReLU is Riemann integrable (discontinuous only at 0, a set of measure zero) - Indicator functions $\mathbf{1}[x > 0]$ are Riemann integrable for the same reason - Pathological functions like the Dirichlet function ($\mathbf{1}_{\mathbb{Q}}$) are NOT Riemann integrable but ARE Lebesgue integrable

O.3 Convergence Theorems (The Real Advantage)

The Lebesgue integral enables three fundamental convergence theorems that Riemann lacks:

Monotone Convergence Theorem (MCT): If $0 \le f_1 \le f_2 \le \cdots$ and $f_n \to f$ pointwise, then:

$$\lim_{n\to\infty} \int f_n \, d\mu = \int \lim_{n\to\infty} f_n \, d\mu = \int f \, d\mu$$

Dominated Convergence Theorem (DCT): If $f_n \to f$ pointwise and $|f_n| \le g$ for an integrable $g$, then:

$$\lim_{n\to\infty} \int f_n \, d\mu = \int f \, d\mu$$

Fatou's Lemma: $\int \liminf_{n\to\infty} f_n \, d\mu \le \liminf_{n\to\infty} \int f_n \, d\mu$

Why ML cares: The DCT justifies differentiating under the integral sign - the key step in computing gradients of expected values:

$$\frac{\partial}{\partial\theta} \mathbb{E}_{p_\theta}[f(x)] = \frac{\partial}{\partial\theta} \int f(x)\,p_\theta(x)\,dx = \int f(x)\,\frac{\partial p_\theta}{\partial\theta}\,dx$$

This step is legal when $f \cdot |\partial p_\theta/\partial\theta|$ is dominated by an integrable function - which is why REINFORCE and the reparameterization trick both have regularity conditions.

O.4 Probability as Measure Theory

Modern probability uses the Lebesgue framework directly:

Measure theory	Probability
Measure space $(\Omega, \mathcal{F}, \mu)$	Probability space $(\Omega, \mathcal{F}, P)$
Measurable function $f: \Omega \to \mathbb{R}$	Random variable $X: \Omega \to \mathbb{R}$
$\int f \, d\mu$	$\mathbb{E}[X]$
$\mu(A) = \int_A 1 \, d\mu$	$P(A)$
Radon-Nikodym derivative $dP/dQ$	Likelihood ratio

The Radon-Nikodym theorem - which says that if $P \ll Q$ (P is absolutely continuous w.r.t. Q), there exists a measurable function $\frac{dP}{dQ}$ such that $P(A) = \int_A \frac{dP}{dQ} \, dQ$ - is the rigorous foundation for: - KL divergence: $D_{KL}(P \| Q) = \int \log\frac{dP}{dQ} \, dP$ - Change of variables in normalizing flows - Importance sampling weights

For day-to-day ML, Riemann integration suffices. But when reading probability theory papers or understanding why certain gradient estimators require regularity conditions, the Lebesgue framework is the right language.

Appendix P: Quick Reference Card

P.1 Standard Antiderivatives

$f(x)$	$\int f(x)\,dx$	Notes
$x^n$	$\frac{x^{n+1}}{n+1} + C$	$n \ne -1$
$x^{-1}$	$\ln\|x\| + C$
$e^x$	$e^x + C$
$a^x$	$\frac{a^x}{\ln a} + C$	$a > 0, a \ne 1$
$\ln x$	$x\ln x - x + C$	by parts
$\sin x$	$-\cos x + C$
$\cos x$	$\sin x + C$
$\tan x$	$-\ln\|\cos x\| + C$
$\sec^2 x$	$\tan x + C$
$\frac{1}{1+x^2}$	$\arctan x + C$
$\frac{1}{\sqrt{1-x^2}}$	$\arcsin x + C$
$\sinh x$	$\cosh x + C$
$\cosh x$	$\sinh x + C$
$\frac{1}{\sqrt{2\pi}}\,e^{-x^2/2}$	$\Phi(x) + C$	CDF of $N(0,1)$

P.2 Key Definite Integrals

$$\int_{-\infty}^{\infty} e^{-x^2} \, dx = \sqrt{\pi} \qquad \int_0^\infty x^{n-1}e^{-x}\,dx = \Gamma(n) \qquad \int_0^1 x^{a-1}(1-x)^{b-1}\,dx = B(a,b)$$ $$\int_0^\infty \frac{\sin x}{x}\,dx = \frac{\pi}{2} \qquad \int_{-\pi}^{\pi}\sin(mx)\cos(nx)\,dx = 0 \quad \text{(orthogonality)}$$

P.3 Integration Strategies Flowchart

WHICH TECHNIQUE?


  Is there a composite function f(g(x))?
    YES -> try u-substitution: u = g(x)

  Is the integrand a product of two "different" types?
    YES -> try integration by parts (LIATE order)

  Is the integrand a rational function P(x)/Q(x)?
    YES -> try partial fractions (after polynomial division if deg P >= deg Q)

  Does the integrand contain sqrt(a^2-x^2), sqrt(a^2+x^2), or sqrt(x^2-a^2)?
    YES -> try trig substitution (x = a sintheta, a tantheta, a sectheta)

  Does the integrand contain e^x times polynomial or trig?
    YES -> try tabular integration by parts

  Nothing works? -> check integral tables / computer algebra system

P.4 Numerical Integration Comparison

Method	Error	Formula	When to use
Midpoint	$O(h^2)$	$h\sum f(x_i + h/2)$	Smooth functions, simple implementation
Trapezoid	$O(h^2)$	$h[\frac{f(a)+f(b)}{2} + \sum f(x_i)]$	Periodic functions (exponentially fast)
Simpson's	$O(h^4)$	$\frac{h}{3}[f_0 + 4f_1 + 2f_2 + 4f_3 + \cdots + f_n]$	Smooth functions, better accuracy
Gauss-Legendre	$O(h^{2n})$	Optimal nodes & weights	High accuracy, smooth integrands
Monte Carlo	$O(n^{-1/2})$	$\frac{1}{n}\sum f(x_i)$, $x_i \sim U[a,b]$	High dimensions, discontinuous $f$
Quasi-MC	$O((\log n)^d / n)$	Low-discrepancy sequences	High dimensions with structure

P.5 Key ML Formulas Involving Integration

$$\mathbb{E}_{x\sim p}[f(x)] = \int f(x)\,p(x)\,dx \approx \frac{1}{n}\sum_{i=1}^n f(x_i) \quad \text{(Monte Carlo mini-batch)}$$ $$D_{KL}(p\|q) = \int p(x)\ln\frac{p(x)}{q(x)}\,dx \ge 0 \quad \text{(Gibbs inequality)}$$ $$\mathcal{L}_{VAE} = \mathbb{E}_{q_\phi(z|x)}[\ln p_\theta(x|z)] - D_{KL}(q_\phi(z|x)\|p(z)) \quad \text{(ELBO)}$$ $$\nabla_\theta \mathbb{E}_{p_\theta}[f(x)] = \mathbb{E}_{p_\theta}[f(x)\,\nabla_\theta\ln p_\theta(x)] \quad \text{(REINFORCE / score gradient)}$$ $$\frac{d\mathbf{z}(t)}{dt} = f_\theta(\mathbf{z}(t), t) \implies \mathbf{z}(T) = \mathbf{z}(0) + \int_0^T f_\theta(\mathbf{z}(t),t)\,dt \quad \text{(Neural ODE)}$$

Appendix Q: Exercises - Worked Solutions (Continued)

Q.1 Exercise 5 - Numerical Integration Full Walkthrough

Problem: Compare trapezoid vs. Simpson's for $\int_0^1 e^{-x^2}\,dx$.

True value: Using the error function, $\int_0^1 e^{-x^2}\,dx = \frac{\sqrt\pi}{2}\,\text{erf}(1) \approx 0.746824$.

Trapezoid with $n=4$ ($h = 0.25$):

$x$

-values:

$f$

-values:

$$T_4 = 0.25\left[\frac{1 + 0.36788}{2} + 0.93941 + 0.77880 + 0.56978\right] = 0.25[0.68394 + 2.28799] = 0.74298$$

Error: $|0.74298 - 0.74682| = 0.00384$ - $O(h^2)$ as expected.

Simpson's with $n=4$ (need even $n$):

$$S_4 = \frac{0.25}{3}[f_0 + 4f_1 + 2f_2 + 4f_3 + f_4]$$ $$= \frac{0.25}{3}[1 + 4(0.93941) + 2(0.77880) + 4(0.56978) + 0.36788]$$ $$= \frac{0.25}{3}[1 + 3.75764 + 1.55760 + 2.27912 + 0.36788] = \frac{0.25}{3}(8.96224) = 0.74685$$

Error: $|0.74685 - 0.74682| = 0.00003$ - vastly smaller. Simpson's is $O(h^4)$.

Q.2 Exercise 6 - Expected Value and KL Divergence

Problem: Compute $D_{KL}(p \| q)$ where $p = N(\mu, 1)$ and $q = N(0, 1)$.

Solution:

$$D_{KL}(p\|q) = \int_{-\infty}^\infty p(x)\ln\frac{p(x)}{q(x)}\,dx$$

With $p(x) = \frac{1}{\sqrt{2\pi}}e^{-(x-\mu)^2/2}$ and $q(x) = \frac{1}{\sqrt{2\pi}}e^{-x^2/2}$:

$$\ln\frac{p(x)}{q(x)} = -\frac{(x-\mu)^2}{2} + \frac{x^2}{2} = \frac{x^2 - (x-\mu)^2}{2} = \frac{2\mu x - \mu^2}{2} = \mu x - \frac{\mu^2}{2}$$ $$D_{KL}(p\|q) = \mathbb{E}_p\!\left[\mu x - \frac{\mu^2}{2}\right] = \mu\,\mathbb{E}_p[x] - \frac{\mu^2}{2} = \mu^2 - \frac{\mu^2}{2} = \frac{\mu^2}{2}$$

Interpretation: The KL divergence from $N(0,1)$ to $N(\mu,1)$ is exactly $\mu^2/2$. This is the term in the VAE ELBO that penalizes the encoder mean from drifting far from zero.

Q.3 Exercise 7 - Gradient of Expected Loss

Problem: Compute $\nabla_\theta \mathbb{E}_{x \sim p_\theta}[\ell(x)]$ via the log-derivative trick.

Solution: Using the score function / REINFORCE estimator:

$$\nabla_\theta \mathbb{E}_{p_\theta}[\ell(x)] = \nabla_\theta \int \ell(x)\,p_\theta(x)\,dx$$

Assuming we can differentiate under the integral (DCT applies):

$$= \int \ell(x)\,\nabla_\theta p_\theta(x)\,dx = \int \ell(x)\,p_\theta(x)\,\frac{\nabla_\theta p_\theta(x)}{p_\theta(x)}\,dx = \mathbb{E}_{p_\theta}[\ell(x)\,\nabla_\theta\ln p_\theta(x)]$$

This is the REINFORCE gradient estimator. It requires only samples from $p_\theta$ and the ability to evaluate $\ln p_\theta$ - no reparameterization needed. The cost is high variance, motivating control variates (baselines) to reduce $\text{Var}[\ell(x)\nabla_\theta\ln p_\theta(x)]$.

Appendix R: Historical Timeline Extended

R.1 Chronological Development

Era	Contributor	Achievement
~250 BCE	Archimedes	Method of exhaustion for area of parabolic segment; $\pi$ bounds
~1640s	Cavalieri	Cavalieri's principle; "method of indivisibles"
1665-1666	Newton	Inverse tangent problem; "method of fluxions"; FTC discovered privately
1675-1684	Leibniz	Independent discovery; modern $\int$ notation; publication of FTC (1684)
1696	L'Hpital	First calculus textbook (based on Bernoulli's lectures)
1734	Bishop Berkeley	The Analyst - criticism of infinitesimals as "ghosts of departed quantities"
1748	Euler	Introductio in Analysin Infinitorum - systematic treatment of functions
1821-1823	Cauchy	Rigorous definition of limit; definite integral via Riemann-style sums
1854	Riemann	Rigorous Riemann integral; characterization of integrable functions
1875	Darboux	Upper/lower sums; cleaner formulation of Riemann integral
1894-1902	Lebesgue	Measure theory; Lebesgue integral; MCT and DCT
1900s	Hilbert	$L^2$ spaces; integration as inner product; functional analysis
1920s-1940s	Kolmogorov	Probability as measure theory; rigorous foundation for ML
1940s-1950s	Monte Carlo	Ulam, von Neumann, Metropolis - stochastic integration for physics
1986	Rumelhart et al.	Backpropagation as chain rule for integrals (Jacobians)
2018	Chen et al.	Neural ODEs - continuous-depth networks via ODE integration
2020s	Diffusion models	Score matching, DDPM, flow matching - integration at the core of generation

R.2 The Newton-Leibniz Priority Dispute

The FTC was discovered independently by Newton (1666, unpublished) and Leibniz (1675-1684, published). The Royal Society's official investigation (1712) wrongly accused Leibniz of plagiarism, damaging his reputation and creating a rift between British and Continental mathematicians. British mathematicians, loyal to Newton's notation, fell behind Continental Europe for ~100 years. The lesson: notation matters - Leibniz's $\int$ and $d/dx$ notation won out and is the notation used universally today.

R.3 Why Newton Called Integration "Quadrature"

Newton's original term for integration was "quadrature" - from the Latin for "making a square." The original problem was: given a curve, find a square with the same area. This geometric framing persisted for centuries. Leibniz's more algebraic approach (anti-differentiation) is what we use today, but the term "quadrature" survives in "Gaussian quadrature" - numerical integration using optimal node placement.

Appendix S: Advanced Topics Preview

This section previews topics that build directly on integration and appear in subsequent sections:

S.1 -> 04 Sequences and Series

Taylor series represents a function as an infinite sum:

$$f(x) = \sum_{n=0}^\infty \frac{f^{(n)}(a)}{n!}(x-a)^n$$

The remainder term in Taylor's theorem is an integral:

$$R_n(x) = \frac{1}{n!}\int_a^x (x-t)^n f^{(n+1)}(t)\,dt$$

Integration and series interact via term-by-term integration - valid when a series converges uniformly:

$$\int \sum_{n=0}^\infty a_n x^n \, dx = \sum_{n=0}^\infty \frac{a_n x^{n+1}}{n+1}$$

-> Full treatment: Sequences and Series

S.2 -> 05 Multivariable Calculus

Double and triple integrals extend the 1D theory:

$$\iint_D f(x,y)\,dA = \int_a^b \int_{g(x)}^{h(x)} f(x,y)\,dy\,dx \quad \text{(Fubini's theorem)}$$

The change of variables formula with Jacobian $|J|$:

$$\iint_D f(x,y)\,dA = \iint_{D'} f(x(u,v), y(u,v))\,|J|\,du\,dv$$

-> Full treatment: Multivariable Calculus

S.3 -> 06 Probability and Statistics

Continuous random variables live entirely in the integration framework. The CDF is an integral of the PDF; expectation is an integral; the central limit theorem involves convergence of distribution functions; characteristic functions are Fourier transforms - all integration.

-> Full treatment: Probability and Statistics

Appendix T: Self-Assessment Checklist

After completing this section, you should be able to:

Conceptual Understanding - [ ] Explain integration as accumulated change (area, probability mass, expected value) - [ ] State both parts of the FTC and explain why they are profound - [ ] Distinguish when a function is/is not Riemann integrable - [ ] Explain why $\int_a^\infty \frac{1}{x^p}\,dx$ converges iff $p > 1$

Computational Skills - [ ] Compute integrals using power rule, u-substitution, and integration by parts - [ ] Decompose rational functions into partial fractions - [ ] Evaluate improper integrals or determine their convergence - [ ] Implement the trapezoid rule, Simpson's rule, and basic Monte Carlo integration

Probability Connections - [ ] Verify that a given function is a valid PDF (non-negative, integrates to 1) - [ ] Compute $\mathbb{E}[X]$, $\text{Var}[X]$ as integrals for common distributions - [ ] Compute KL divergence between two Gaussian distributions - [ ] Derive the cross-entropy loss as negative log-likelihood

ML Applications - [ ] Explain Monte Carlo as an approximation to an integral - [ ] State the REINFORCE gradient estimator and its derivation via log-derivative trick - [ ] Explain the VAE ELBO as a sum of two integrals (reconstruction + KL) - [ ] Describe the neural ODE forward pass as numerical ODE integration - [ ] Sketch how score matching connects to integration in diffusion models

$x=1$$1 - 1 + C = 5 \Rightarrow C = 5$$x = A(x-1) + B$$x_k \in \{0, 0.25, 0.5, 0.75, 1\}$$\int_a^\infty |f(x)|\,dx < \infty$$[0,2]$$n=8$$h = 0.25$$x_k = kh$$+$$|x|<1$$-$$|x|>1$$F$$x=1$$x=-1$$0, 0.25, 0.5, 0.75, 1.0$$1,\ e^{-0.0625} \approx 0.93941,\ e^{-0.25} \approx 0.77880,\ e^{-0.5625} \approx 0.56978,\ e^{-1} \approx 0.36788$