NotesMath for LLMs

Series and Sequences

Calculus Fundamentals / Series and Sequences

Notes

"The series 1+12+14+18+=21 + \frac{1}{2} + \frac{1}{4} + \frac{1}{8} + \cdots = 2 may seem paradoxical - an infinite number of terms summing to a finite value. But this is exactly the miracle that makes calculus possible."

  • Augustin-Louis Cauchy, on the foundations of analysis

Overview

Sequences and series are the mathematics of infinity made tractable. A sequence is an ordered list without end; a series is the sum of such a list. The central question - does an infinite process converge to something finite? - is one of the deepest and most practically consequential questions in mathematics.

For machine learning, this is not abstract. Every gradient descent trajectory is a sequence of parameter vectors. Every optimizer update rule is derived via Taylor series truncation. Every attention mechanism approximation exploits the Taylor expansion of the exponential function. Every positional encoding scheme (RoPE, ALiBi, sinusoidal) draws on Fourier series theory. The Laplace approximation - used in Bayesian deep learning - is a quadratic Taylor expansion of the log-posterior. Understanding why these approximations work, when they fail, and how accurate they are requires the theory in this section.

This section builds from sequences through convergence tests through power series to Taylor series, culminating in concrete ML applications with rigorous error analysis.

Prerequisites

  • Limits and continuity - 04/01: epsilon-delta definition, limit laws, continuity
  • Derivatives - 04/02: differentiation rules, higher-order derivatives
  • Integration - 04/03: definite integrals, improper integrals, p-integral test

Companion Notebooks

NotebookDescription
theory.ipynbInteractive: convergence visualization, partial sums, Taylor approximation error, ML applications
exercises.ipynb10 graded exercises from geometric series to linear attention approximations

Learning Objectives

After completing this section, you will be able to:

  1. Define a sequence formally and determine convergence using the epsilon-N definition
  2. State and prove the Monotone Convergence Theorem for sequences
  3. Determine convergence of infinite series using all standard tests
  4. Distinguish absolute from conditional convergence and explain the rearrangement theorem
  5. Compute the radius and interval of convergence of a power series
  6. Differentiate and integrate power series term-by-term within the radius of convergence
  7. Derive Taylor series for exe^x, sinx\sin x, cosx\cos x, ln(1+x)\ln(1+x), and (1+x)α(1+x)^\alpha
  8. Bound approximation error using the Lagrange remainder
  9. Apply series theory to ML: softmax temperature, linear attention, GELU, Adam, Laplace
  10. Recognise Fourier series as orthogonal function expansion and connect to positional encodings

Table of Contents


1. Intuition

1.1 What Sequences and Series Represent

A sequence is a function from the natural numbers to the real numbers: an infinite, ordered list a1,a2,a3,a_1, a_2, a_3, \ldots The key question is always: where does this list go? Does it home in on a specific value, grow without bound, or oscillate forever?

A series adds up all the terms: n=1an=a1+a2+a3+\sum_{n=1}^\infty a_n = a_1 + a_2 + a_3 + \cdots This is more subtle. An infinite sum might converge to a finite value - or it might grow forever. The sum 1+12+14+=21 + \frac{1}{2} + \frac{1}{4} + \cdots = 2 converges. The sum 1+12+13+1 + \frac{1}{2} + \frac{1}{3} + \cdots - the harmonic series - diverges to infinity, even though each term goes to zero. This counterintuitive behavior is why convergence tests exist.

SEQUENCES vs. SERIES - VISUAL INTUITION


  SEQUENCE {a}: where do the terms go?
  
  a = 1/n:   1, 1/2, 1/3, 1/4, ...  ->  0      (converges)
  a = n:     1,   2,   3,   4, ...  ->  infinity      (diverges)
  a = (-1): -1,  1, -1,  1, ...   ->  ?      (oscillates, diverges)

  SERIES Sigmaa: where does the cumulative sum go?
  
  Sigma (1/2):  1/2 + 1/4 + 1/8 + ...  ->  1      (converges, geometric)
  Sigma 1/n:     1 + 1/2 + 1/3 + ...    ->  infinity      (diverges! harmonic)
  Sigma 1/n^2:    1 + 1/4 + 1/9 + ...    -> pi^2/6    (converges, Basel problem)

  Key insight: a -> 0 is NECESSARY but NOT SUFFICIENT for Sigmaa to converge.


The fundamental distinction between sequences and series is that sequences ask about the terms themselves; series ask about the cumulative totals. A sequence can converge to zero while its series diverges - the harmonic series is the canonical example.

1.2 Historical Context

The history of series is a history of infinity being tamed:

EraMathematicianAchievement
~250 BCEArchimedesGeometric series sum rn\sum r^n for quadrature of parabola
1350Nicole OresmeFirst proof that the harmonic series diverges
1600sNewtonBinomial series (1+x)α(1+x)^\alpha; power series for sin\sin, cos\cos, ln\ln
1735EulerBasel problem: 1/n2=π2/6\sum 1/n^2 = \pi^2/6; eiπ+1=0e^{i\pi} + 1 = 0
1748EulerIntroductio: systematic treatment of infinite series
1821CauchyRigorous convergence theory; ratio test; uniform convergence concept
1826AbelAbel's theorem on power series at boundary
1837DirichletFourier series convergence conditions; conditional vs. absolute convergence
1854RiemannRearrangement theorem: conditionally convergent series can be rearranged to any sum
1900TaylorTaylor series already known since Gregory (1668), but named after Brook Taylor (1715)

The key conceptual revolution was Cauchy's insistence on rigor. Before him, mathematicians manipulated series freely, often getting wrong answers. The definition of convergence as a limit of partial sums made the subject precise and allowed the rich theory we use today.

1.3 Why Series Are Central to AI

Series appear throughout modern machine learning - often hidden in plain sight:

ML ContextSeries Connection
Softmax functionsoftmax(x/T)i=exi/T/jexj/T\text{softmax}(\mathbf{x}/T)_i = e^{x_i/T}/\sum_j e^{x_j/T} - the denominator is a sum over all classes; as T0T \to 0, only the max survives (argmax); as TT \to \infty, uniform distribution (Taylor expansion of ex1+xe^x \approx 1+x dominates)
Linear attentionApproximates exp(qk)1+qk\exp(qk^\top) \approx 1 + qk^\top (first-order Taylor) to reduce complexity from O(n2)O(n^2) to O(n)O(n)
GELU activationDefined via error function, approximated by Taylor-Hermite expansion
Adam optimizerBias correction m^t=mt/(1βt)\hat{m}_t = m_t/(1-\beta^t) uses geometric series βk\sum \beta^k; step size derivation uses Taylor expansion
Positional encodingsSinusoidal (original Transformer), RoPE - both are Fourier series evaluated at discrete positions
Laplace approximationQuadratic Taylor expansion of $\log p(\theta
Neural ODEsEuler method is first-order Taylor; higher-order Runge-Kutta methods use more Taylor terms
Residual networksResNet(x)=x+F(x)eF(x)\text{ResNet}(x) = x + F(x) \approx e^F(x) - the skip connections are the partial sums of a Taylor series for the exponential of the network function

2. Sequences

2.1 Formal Definition

Definition: A sequence is a function a:NRa: \mathbb{N} \to \mathbb{R}. We write the sequence as (an)n=1(a_n)_{n=1}^\infty or {an}\{a_n\}, where an=a(n)a_n = a(n) is the nn-th term.

This formal view - sequence as function - is the right way to think about it. Everything we know about functions (domain, range, composition) applies to sequences.

Examples:

  • an=1na_n = \frac{1}{n}: terms 1,12,13,14,1, \frac{1}{2}, \frac{1}{3}, \frac{1}{4}, \ldots - decreasing, converges to 00
  • an=(1)na_n = (-1)^n: terms 1,1,1,1,-1, 1, -1, 1, \ldots - oscillates, does not converge
  • an=(1+1n)na_n = \left(1 + \frac{1}{n}\right)^n: terms 2,2.25,2.370,2.4412, 2.25, 2.370\ldots, 2.441\ldots - converges to ee
  • an=n!nna_n = \frac{n!}{n^n}: terms decrease rapidly to 00 (Stirling: n!2πn(n/e)nn! \approx \sqrt{2\pi n}(n/e)^n)
  • an=rna_n = r^n for r<1|r| < 1: geometric decay, converges to 00

Non-examples (not standard sequences):

  • an=1n5a_n = \frac{1}{n - 5} for n=5n = 5: undefined at n=5n=5; need to restrict domain to n>5n > 5
  • an=sin(n)a_n = \sin(n): well-defined for all nNn \in \mathbb{N}, but its limit behavior is subtle (it doesn't converge)

2.2 Convergence: epsilon-N Definition

Definition: A sequence (an)(a_n) converges to the limit LRL \in \mathbb{R} if for every ϵ>0\epsilon > 0 there exists NNN \in \mathbb{N} such that for all n>Nn > N:

anL<ϵ|a_n - L| < \epsilon

We write limnan=L\lim_{n\to\infty} a_n = L or anLa_n \to L. If no such LL exists, the sequence diverges.

Reading the definition: For any desired accuracy ϵ\epsilon (no matter how small), there is a threshold NN beyond which every term is within ϵ\epsilon of LL. The threshold NN may depend on ϵ\epsilon - tighter accuracy requires waiting longer.

Worked example: Prove limn1n=0\lim_{n\to\infty} \frac{1}{n} = 0.

Proof: Given ϵ>0\epsilon > 0, choose N=1/ϵN = \lceil 1/\epsilon \rceil (the smallest integer 1/ϵ\ge 1/\epsilon). Then for all n>Nn > N:

1n0=1n<1Nϵ\left|\frac{1}{n} - 0\right| = \frac{1}{n} < \frac{1}{N} \le \epsilon

Therefore 1n0\frac{1}{n} \to 0. \square

Uniqueness: If (an)(a_n) converges, its limit is unique. (Proof: if anLa_n \to L and anMa_n \to M, then for any ϵ>0\epsilon > 0, LMLan+anM<ϵ+ϵ=2ϵ|L - M| \le |L - a_n| + |a_n - M| < \epsilon + \epsilon = 2\epsilon for large nn; since ϵ\epsilon is arbitrary, LM=0|L-M| = 0.)

Limit laws for sequences: If anLa_n \to L and bnMb_n \to M, then:

  • an+bnL+Ma_n + b_n \to L + M
  • anbnLMa_n b_n \to LM
  • an/bnL/Ma_n/b_n \to L/M provided M0M \ne 0 and bn0b_n \ne 0 eventually
  • f(an)f(L)f(a_n) \to f(L) if ff is continuous at LL (sequential characterization of continuity)

Squeeze theorem for sequences: If anbncna_n \le b_n \le c_n and anLa_n \to L and cnLc_n \to L, then bnLb_n \to L.

2.3 Bounded and Monotone Sequences

Definition: A sequence (an)(a_n) is:

  • Bounded above if there exists MM such that anMa_n \le M for all nn
  • Bounded below if there exists mm such that anma_n \ge m for all nn
  • Bounded if both: anB|a_n| \le B for some BB and all nn
  • Monotone increasing if an+1ana_{n+1} \ge a_n for all nn (strictly increasing if >>)
  • Monotone decreasing if an+1ana_{n+1} \le a_n for all nn

Theorem (Monotone Convergence Theorem - MCT for sequences): Every bounded monotone sequence converges.

More precisely: If (an)(a_n) is monotone increasing and bounded above, then ansup{an:nN}a_n \to \sup\{a_n : n \in \mathbb{N}\}.

Proof sketch: Let L=sup{an}L = \sup\{a_n\}, which exists by the completeness of R\mathbb{R} (every non-empty set bounded above has a supremum). Given ϵ>0\epsilon > 0, by definition of supremum, there exists NN with aN>Lϵa_N > L - \epsilon. Since the sequence is increasing, anaN>Lϵa_n \ge a_N > L - \epsilon for all nNn \ge N. Also anL<L+ϵa_n \le L < L + \epsilon. So anL<ϵ|a_n - L| < \epsilon for all nNn \ge N. \square

Key consequence: To prove convergence without knowing the limit, it suffices to show the sequence is monotone and bounded. This is widely used in analysis to establish existence of limits.

Example: an=(1+1/n)na_n = (1 + 1/n)^n is increasing and bounded above by 33 (provable by AM-GM). Therefore it converges. Its limit is defined to be e2.71828e \approx 2.71828.

For ML: Gradient descent sequences (thetat)(theta_t) are not monotone in parameter space, but the loss sequence (L(thetat))(L(theta_t)) is often bounded below (losses are non-negative) and may be monotone decreasing. The MCT guarantees loss convergence for many convex problems.

2.4 Subsequences and Bolzano-Weierstrass

Definition: A subsequence of (an)(a_n) is a sequence of the form (ank)k=1(a_{n_k})_{k=1}^\infty where n1<n2<n3<n_1 < n_2 < n_3 < \cdots is a strictly increasing sequence of indices.

Theorem: If anLa_n \to L, then every subsequence also converges to LL.

*Converse (useful)**: If a sequence has two subsequences converging to different limits, the original sequence diverges. This proves (1)n(-1)^n diverges: the subsequences (1)2k=1(-1)^{2k} = 1 and (1)2k+1=1(-1)^{2k+1} = -1 converge to 11 and 1-1 respectively.

Theorem (Bolzano-Weierstrass): Every bounded sequence has a convergent subsequence.

Proof idea (bisection): Let (an)(a_n) be bounded in [m,M][m, M]. Bisect the interval into [m,m+M2][m, \frac{m+M}{2}] and [m+M2,M][\frac{m+M}{2}, M]. At least one half contains infinitely many terms - pick that half and pick any term in it as an1a_{n_1}. Repeat: bisect the chosen half, pick the half with infinitely many terms, pick an2a_{n_2} from it (with n2>n1n_2 > n_1). The resulting intervals have lengths Mm2k0\frac{M-m}{2^k} \to 0, so the chosen terms form a Cauchy sequence and hence converge. \square

For ML: Bolzano-Weierstrass underlies the proof that compact sets in Rn\mathbb{R}^n always contain convergent subsequences - which is why SGD on compact parameter spaces always has convergent subsequences (though full convergence requires additional conditions).

2.5 Important Sequences in ML

The following sequences appear constantly in machine learning. Understanding their convergence is practically important:

ML SEQUENCES - CONVERGENCE ANALYSIS


  Learning rate schedules:
  
  Step decay:    alpha = alpha_0 * gamma^t/s         geometric, -> 0
  Cosine:        alpha = alpha_0/2 * (1 + cos(pit/T)) bounded, periodic
  Warmup:        alpha = t*alpha_0/T for t <= T      linear ramp, then decay
  Polynomial:    alpha = alpha_0/(1 + betat)^p         -> 0 if p > 0

  Exponential moving average (EMA / Adam momentum):
  
  m_t = beta*m_{t-1} + (1-beta)*g_t
      = (1-beta) Sigma_0^{t-1} beta^k * g_{t-k}     geometric series weights!
  As t -> infinity: weights sum to 1 (geometric series identity)
  Bias-corrected: m_t = m_t / (1 - beta)    corrects for initial zero

  Convergence of gradient norms in Adam:
  
  Required for convergence: Sigma alpha = infinity (enough total movement)
  And:                      Sigma alpha^2 < infinity (steps shrink fast enough)
  This is the Robbins-Monro condition from stochastic approximation.


Exponential moving average as geometric series: The Adam first moment mt=βmt1+(1β)gtm_t = \beta m_{t-1} + (1-\beta)g_t unfolds to:

mt=(1β)k=0t1βkgtk+βtm0m_t = (1-\beta)\sum_{k=0}^{t-1} \beta^k g_{t-k} + \beta^t m_0

For β=0.9\beta = 0.9 and m0=0m_0 = 0: the weights (1β)βk(1-\beta)\beta^k form a geometric series summing to 1βt1 - \beta^t. The bias correction m^t=mt/(1βt)\hat{m}_t = m_t/(1-\beta^t) normalizes this to 1, giving a proper weighted average of past gradients. As tt \to \infty, βt0\beta^t \to 0 and the correction becomes negligible.


3. Infinite Series

3.1 Partial Sums and Convergence

Definition: Given a sequence (an)n=1(a_n)_{n=1}^\infty, the NN-th partial sum is:

SN=n=1Nan=a1+a2++aNS_N = \sum_{n=1}^N a_n = a_1 + a_2 + \cdots + a_N

The infinite series n=1an\sum_{n=1}^\infty a_n converges to SS if the sequence of partial sums (SN)(S_N) converges to SS:

n=1an=S    limNSN=S\sum_{n=1}^\infty a_n = S \iff \lim_{N\to\infty} S_N = S

If (SN)(S_N) diverges, the series diverges.

Properties (inherited from sequence limit laws):

  • Linearity: (can+dbn)=can+dbn\sum (ca_n + db_n) = c\sum a_n + d\sum b_n (if both converge)
  • Tail behavior: n=1an\sum_{n=1}^\infty a_n converges iff n=kan\sum_{n=k}^\infty a_n converges for any fixed kk (convergence is about the tail, not finitely many initial terms)
  • Term rearrangement: For absolutely convergent series, any rearrangement converges to the same sum. For conditionally convergent series, rearrangement can change or destroy convergence (Riemann's theorem).

3.2 Geometric Series

The geometric series is the most important series in mathematics:

n=0rn=1+r+r2+r3+\sum_{n=0}^\infty r^n = 1 + r + r^2 + r^3 + \cdots

Theorem: The geometric series converges if and only if r<1|r| < 1, and:

n=0rn=11r\sum_{n=0}^\infty r^n = \frac{1}{1-r}

Proof: The NN-th partial sum satisfies SN=1+r++rNS_N = 1 + r + \cdots + r^N. Multiply by rr: rSN=r+r2++rN+1rS_N = r + r^2 + \cdots + r^{N+1}. Subtract: SNrSN=1rN+1S_N - rS_N = 1 - r^{N+1}, so SN=1rN+11rS_N = \frac{1-r^{N+1}}{1-r}. As NN \to \infty: if r<1|r| < 1 then rN+10r^{N+1} \to 0, so SN11rS_N \to \frac{1}{1-r}. If r1|r| \ge 1, rN+1r^{N+1} does not go to zero, so SNS_N diverges. \square

Shifted and scaled versions:

n=0arn=a1r,n=krn=rk1r,n=1nrn1=1(1r)2\sum_{n=0}^\infty ar^n = \frac{a}{1-r}, \qquad \sum_{n=k}^\infty r^n = \frac{r^k}{1-r}, \qquad \sum_{n=1}^\infty nr^{n-1} = \frac{1}{(1-r)^2}

The last identity comes from differentiating rn=11r\sum r^n = \frac{1}{1-r} term-by-term.

ML applications of geometric series:

  1. Discounted future rewards (RL): Gt=k=0γkrt+k=rt+γGt+1G_t = \sum_{k=0}^\infty \gamma^k r_{t+k} = r_t + \gamma G_{t+1}. The total return is a geometric series with ratio γ[0,1)\gamma \in [0,1); it converges to rmax/(1γ)r_{\text{max}}/(1-\gamma) when rewards are bounded.

  2. Adam bias correction: mt=(1β)k=0t1βkgtkm_t = (1-\beta)\sum_{k=0}^{t-1}\beta^k g_{t-k}. Sum of weights: (1β)k=0t1βk=1βt(1-\beta)\sum_{k=0}^{t-1}\beta^k = 1 - \beta^t. Correction factor: 1/(1βt)1/(1-\beta^t).

  3. Infinite-horizon value function: Vπ(s)=E[t=0γtrts0=s]V^\pi(s) = \mathbb{E}[\sum_{t=0}^\infty \gamma^t r_t | s_0 = s], which is the Bellman equation's foundation.

  4. Attention denominator as geometric series: For scalar queries and keys with q=k=cq = k = c (constant), softmax(c)i=ec/jec=1/n\text{softmax}(c)_i = e^c/\sum_j e^c = 1/n - the denominator is a finite geometric-like sum.

3.3 The Harmonic Series Diverges

The harmonic series n=11n=1+12+13+14+\sum_{n=1}^\infty \frac{1}{n} = 1 + \frac{1}{2} + \frac{1}{3} + \frac{1}{4} + \cdots diverges despite 1n0\frac{1}{n} \to 0.

Proof (Oresme's grouping argument, ~1350): Group terms in blocks of doubling size:

1+12+13+14214=12+15+16+17+18418=12+19++1168116=12+1 + \frac{1}{2} + \underbrace{\frac{1}{3} + \frac{1}{4}}_{\ge 2 \cdot \frac{1}{4} = \frac{1}{2}} + \underbrace{\frac{1}{5} + \frac{1}{6} + \frac{1}{7} + \frac{1}{8}}_{\ge 4 \cdot \frac{1}{8} = \frac{1}{2}} + \underbrace{\frac{1}{9} + \cdots + \frac{1}{16}}_{\ge 8 \cdot \frac{1}{16} = \frac{1}{2}} + \cdots

Each block contributes at least 12\frac{1}{2}. Adding infinitely many blocks of 12\frac{1}{2} gives infinity. \square

Why this matters: The harmonic series illustrates that an0a_n \to 0 alone does not guarantee convergence. Every convergence test addresses this problem: it must establish convergence faster than harmonic decay.

The harmonic series in ML: The divergence of 1/n\sum 1/n is why the learning rate schedule αt=α0/t\alpha_t = \alpha_0/t satisfies the Robbins-Monro condition αt=\sum \alpha_t = \infty - it decreases slowly enough to ensure the optimizer explores sufficiently. The companion condition αt2<\sum \alpha_t^2 < \infty holds since 1/t2<\sum 1/t^2 < \infty (p-series with p=2>1p = 2 > 1).

3.4 p-Series

The p-series is n=11np\sum_{n=1}^\infty \frac{1}{n^p} for p>0p > 0.

Theorem: The p-series converges if and only if p>1p > 1.

ppSeriesSumConverges?
p=1p = 1Harmonic\inftyNo
p=2p = 21/n2\sum 1/n^2π2/6\pi^2/6 (Basel)Yes
p=3p = 31/n3\sum 1/n^31.202\approx 1.202 (Apry)Yes
p=1/2p = 1/21/n\sum 1/\sqrt{n}\inftyNo
p=0.99p = 0.991/n0.99\sum 1/n^{0.99}\inftyNo (barely!)

Proof: By the integral test (proven in 4.2): n=11np\sum_{n=1}^\infty \frac{1}{n^p} converges iff 11xpdx\int_1^\infty \frac{1}{x^p}\,dx converges. This integral equals x1p1p1\frac{x^{1-p}}{1-p}\Big|_1^\infty for p1p \ne 1. For p>1p > 1: x1p0x^{1-p} \to 0 as xx \to \infty, so the integral =1p1= \frac{1}{p-1} (converges). For p<1p < 1: x1px^{1-p} \to \infty, so the integral diverges. For p=1p = 1: 11xdx=lnx1=\int_1^\infty \frac{1}{x}\,dx = \ln x\big|_1^\infty = \infty. \square

3.5 Telescoping Series

A telescoping series has a form that allows consecutive cancellation:

n=1N(bnbn+1)=b1bN+1b1limnbn\sum_{n=1}^N (b_n - b_{n+1}) = b_1 - b_{N+1} \to b_1 - \lim_{n\to\infty} b_n

Example: n=11n(n+1)=n=1(1n1n+1)=10=1\sum_{n=1}^\infty \frac{1}{n(n+1)} = \sum_{n=1}^\infty \left(\frac{1}{n} - \frac{1}{n+1}\right) = 1 - 0 = 1

Proof via partial fractions: 1n(n+1)=1n1n+1\frac{1}{n(n+1)} = \frac{1}{n} - \frac{1}{n+1}. Then SN=(112)+(1213)++(1N1N+1)=11N+11S_N = (1 - \frac{1}{2}) + (\frac{1}{2} - \frac{1}{3}) + \cdots + (\frac{1}{N} - \frac{1}{N+1}) = 1 - \frac{1}{N+1} \to 1.

Another example: n=11n+1+n=n=1(n+1n)\sum_{n=1}^\infty \frac{1}{\sqrt{n+1} + \sqrt{n}} = \sum_{n=1}^\infty (\sqrt{n+1} - \sqrt{n}) diverges (telescopes to N+11\sqrt{N+1} - 1 \to \infty).


4. Convergence Tests

The fundamental challenge: given a series an\sum a_n, does it converge? The convergence tests below address different structural properties of (an)(a_n).

CONVERGENCE TEST SELECTION GUIDE


  a -> 0? -> NO -> DIVERGES (divergence test)

  Is a = f(n) with f decreasing, positive?
    -> Use INTEGRAL TEST: compare to integralf(x)dx

  Is a comparable to a known series b?
    -> a <= Cb and Sigmab converges -> CONVERGES
    -> a >= Cb and Sigmab diverges  -> DIVERGES (comparison)
    -> lim a/b = L in (0,infinity) -> same behavior (limit comparison)

  Does a involve factorials or exponentials?
    -> Try RATIO TEST: L = lim |a_1/a|
       L < 1 -> converges; L > 1 -> diverges; L = 1 -> inconclusive

  Does a involve n powers?
    -> Try ROOT TEST: L = lim |a|^{1/n}
       L < 1 -> converges; L > 1 -> diverges; L = 1 -> inconclusive

  Does the series alternate signs?
    -> Try ALTERNATING SERIES TEST: a decreasing, -> 0 -> converges


4.1 Divergence Test

Theorem (Divergence Test): If n=1an\sum_{n=1}^\infty a_n converges, then an0a_n \to 0.

Equivalently (contrapositive): If an↛0a_n \not\to 0, then an\sum a_n diverges.

Proof: If an=S\sum a_n = S, then an=SnSn1SS=0a_n = S_n - S_{n-1} \to S - S = 0. \square

Important: The converse is FALSE. an0a_n \to 0 does NOT imply an\sum a_n converges. The harmonic series is the canonical counterexample.

Examples:

  • nn+1\sum \frac{n}{n+1}: since nn+110\frac{n}{n+1} \to 1 \ne 0, diverges immediately.
  • (1)n\sum (-1)^n: since (1)n(-1)^n does not converge to 00, diverges.
  • 1n\sum \frac{1}{n}: 1n0\frac{1}{n} \to 0, so the test is inconclusive. Must use another test.

4.2 Integral Test

Theorem (Integral Test): Suppose f:[1,)Rf: [1, \infty) \to \mathbb{R} is continuous, positive, and decreasing, with an=f(n)a_n = f(n). Then:

n=1an converges    1f(x)dx converges\sum_{n=1}^\infty a_n \text{ converges} \iff \int_1^\infty f(x)\,dx \text{ converges}

Proof sketch: Since ff is decreasing, f(n)f(x)f(n+1)f(n) \ge f(x) \ge f(n+1) for x[n,n+1]x \in [n, n+1], so:

f(n)=nn+1f(n)dxnn+1f(x)dxnn+1f(n+1)dx=f(n+1)f(n) = \int_n^{n+1} f(n)\,dx \ge \int_n^{n+1} f(x)\,dx \ge \int_n^{n+1} f(n+1)\,dx = f(n+1)

Summing: n=1Nf(n)1N+1f(x)dxn=2N+1f(n)\sum_{n=1}^N f(n) \ge \int_1^{N+1} f(x)\,dx \ge \sum_{n=2}^{N+1} f(n). As NN \to \infty, the partial sums and the integral bound each other, so both converge or both diverge. \square

Applications:

  • p-series: 1xpdx\int_1^\infty x^{-p}\,dx converges iff p>1p > 1 -> p-series converges iff p>1p > 1
  • 1nlnn\sum \frac{1}{n\ln n}: 21xlnxdx=ln(lnx)2=\int_2^\infty \frac{1}{x\ln x}\,dx = \ln(\ln x)\big|_2^\infty = \infty -> diverges
  • 1n(lnn)2\sum \frac{1}{n(\ln n)^2}: 21x(lnx)2dx=1lnx2=1ln2\int_2^\infty \frac{1}{x(\ln x)^2}\,dx = \frac{-1}{\ln x}\big|_2^\infty = \frac{1}{\ln 2} -> converges

Error bound from integral test: If the series converges,

N+1f(x)dxn=N+1anNf(x)dx\int_{N+1}^\infty f(x)\,dx \le \sum_{n=N+1}^\infty a_n \le \int_N^\infty f(x)\,dx

This gives a computable bound on the tail error after NN terms.

4.3 Comparison and Limit Comparison Tests

Theorem (Direct Comparison): Suppose 0anbn0 \le a_n \le b_n for all sufficiently large nn.

  • If bn\sum b_n converges, then an\sum a_n converges.
  • If an\sum a_n diverges, then bn\sum b_n diverges.

Proof: For the first part: partial sums of an\sum a_n are bounded above by the (convergent) sum of bn\sum b_n, and they're increasing, so they converge by MCT. \square

Theorem (Limit Comparison): Suppose an,bn>0a_n, b_n > 0 and limnanbn=L\lim_{n\to\infty} \frac{a_n}{b_n} = L with 0<L<0 < L < \infty. Then an\sum a_n and bn\sum b_n either both converge or both diverge.

Why: For large nn, anLbna_n \approx Lb_n, so ana_n and bnb_n have the same order of magnitude.

Examples:

  • 1n2+1\sum \frac{1}{n^2 + 1}: compare to 1n2\sum \frac{1}{n^2} (p-series, converges). Limit comparison: 1/(n2+1)1/n2=n2n2+11(0,)\frac{1/(n^2+1)}{1/n^2} = \frac{n^2}{n^2+1} \to 1 \in (0,\infty). Conclusion: converges.
  • 3n21n4+2n\sum \frac{3n^2 - 1}{n^4 + 2n}: dominant term is 3n2\frac{3}{n^2}. Limit comparison with 1/n21/n^2: ratio 3\to 3. Converges.

4.4 Ratio Test

Theorem (Ratio Test, d'Alembert): Let L=limnan+1anL = \lim_{n\to\infty} \left|\frac{a_{n+1}}{a_n}\right|.

  • If L<1L < 1: an\sum a_n converges absolutely
  • If L>1L > 1 (or L=L = \infty): an\sum a_n diverges
  • If L=1L = 1: inconclusive

Proof (convergence case): If L<1L < 1, pick rr with L<r<1L < r < 1. Then an+1ran|a_{n+1}| \le r|a_n| for all large nn. By induction, anCrn|a_n| \le C r^n for some constant CC. So anCrn<\sum |a_n| \le C \sum r^n < \infty (geometric series). \square

Best for: Factorials (n!n!), exponentials (cnc^n), or products involving nn.

Examples:

  • n!nn\sum \frac{n!}{n^n}: ratio =(n+1)!/(n+1)n+1n!/nn=(n+1)nn(n+1)n+1=nn(n+1)n=(nn+1)ne1<1= \frac{(n+1)!/(n+1)^{n+1}}{n!/n^n} = \frac{(n+1)n^n}{(n+1)^{n+1}} = \frac{n^n}{(n+1)^n} = \left(\frac{n}{n+1}\right)^n \to e^{-1} < 1 -> converges
  • 2nn!\sum \frac{2^n}{n!}: ratio =2n+1/(n+1)!2n/n!=2n+10<1= \frac{2^{n+1}/(n+1)!}{2^n/n!} = \frac{2}{n+1} \to 0 < 1 -> converges

4.5 Root Test

Theorem (Root Test, Cauchy): Let L=limnan1/nL = \lim_{n\to\infty} |a_n|^{1/n}.

  • If L<1L < 1: an\sum a_n converges absolutely
  • If L>1L > 1: diverges
  • If L=1L = 1: inconclusive

Proof: Same structure as ratio test - if L<1L < 1, eventually an1/nr<1|a_n|^{1/n} \le r < 1, so anrn|a_n| \le r^n, a convergent geometric series. \square

Best for: Terms of the form f(n)nf(n)^n.

Example: (n2n+1)n\sum \left(\frac{n}{2n+1}\right)^n: root test gives L=limn2n+1=12<1L = \lim \frac{n}{2n+1} = \frac{1}{2} < 1 -> converges.

Relationship to ratio test: If the ratio test limit exists, the root test limit also exists and equals the same value. The root test is stronger (applicable when ratio test fails), but harder to compute.

4.6 Alternating Series Test

Definition: An alternating series has the form (1)n+1bn=b1b2+b3b4+\sum (-1)^{n+1} b_n = b_1 - b_2 + b_3 - b_4 + \cdots where bn>0b_n > 0.

Theorem (Alternating Series Test / Leibniz criterion): If:

  1. bn>0b_n > 0 for all nn
  2. bn+1bnb_{n+1} \le b_n (decreasing)
  3. bn0b_n \to 0

Then (1)n+1bn\sum (-1)^{n+1} b_n converges.

Proof idea: Even and odd partial sums are monotone and bounded, converging to a common limit. \square

Error bound: The error in approximating the sum by SNS_N satisfies SSNbN+1|S - S_N| \le b_{N+1}.

Key example: The alternating harmonic series n=1(1)n+1n=112+1314+=ln2\sum_{n=1}^\infty \frac{(-1)^{n+1}}{n} = 1 - \frac{1}{2} + \frac{1}{3} - \frac{1}{4} + \cdots = \ln 2.

This is conditionally convergent - it converges but 1/n=\sum |1/n| = \infty (harmonic series diverges).

4.7 Absolute vs. Conditional Convergence

Definition: A series an\sum a_n is:

  • Absolutely convergent if an\sum |a_n| converges
  • Conditionally convergent if an\sum a_n converges but an\sum |a_n| diverges

Theorem: Absolute convergence implies convergence.

Proof: 0an+an2an0 \le a_n + |a_n| \le 2|a_n|, so (an+an)\sum (a_n + |a_n|) converges by comparison. Then an=(an+an)an\sum a_n = \sum(a_n + |a_n|) - \sum|a_n| converges as a difference of convergent series. \square

Riemann Rearrangement Theorem: If an\sum a_n is conditionally convergent, then for any LRL \in \mathbb{R} (or ±\pm\infty), there exists a rearrangement of the terms that converges to LL.

This shocking result means: the value of a conditionally convergent series depends on the order of summation! This is not a curiosity - it has direct implications for:

  • Stochastic gradient descent: the order in which mini-batches are processed affects the path (though not the limit for convergent algorithms)
  • Floating-point summation: the order of floating-point additions affects the accumulated rounding error

For ML: Practically, all series in ML are absolutely convergent (finite sums, or convergent geometric/p-series). But the theory of conditional convergence explains why naive summation of large arrays can lose precision.


5. Power Series

5.1 Definition and Convergence Interval

Definition: A power series centered at aa is a series of the form:

n=0cn(xa)n=c0+c1(xa)+c2(xa)2+\sum_{n=0}^\infty c_n (x-a)^n = c_0 + c_1(x-a) + c_2(x-a)^2 + \cdots

where cnc_n are the coefficients and aa is the center. When a=0a = 0 we get a Maclaurin (or standard) power series cnxn\sum c_n x^n.

A power series is fundamentally different from the series we've studied: it is a function of xx, not just a number. For each value of xx, we get a numerical series that may or may not converge.

Abel's Theorem on convergence structure: For any power series cn(xa)n\sum c_n(x-a)^n, exactly one of three things holds:

  1. The series converges only at x=ax = a
  2. The series converges for all xRx \in \mathbb{R}
  3. There exists R>0R > 0 (the radius of convergence) such that the series converges absolutely for xa<R|x - a| < R and diverges for xa>R|x - a| > R

The interval (aR,a+R)(a-R, a+R) is the interval of convergence (open, possibly including one or both endpoints).

5.2 Radius of Convergence

Theorem (Radius of Convergence via Ratio Test): If limncn+1cn=L\lim_{n\to\infty} \left|\frac{c_{n+1}}{c_n}\right| = L, then R=1LR = \frac{1}{L}.

Derivation: Apply the ratio test to cn(xa)n\sum c_n(x-a)^n:

cn+1(xa)n+1cn(xa)n=xacn+1cnxaL\left|\frac{c_{n+1}(x-a)^{n+1}}{c_n(x-a)^n}\right| = |x-a| \cdot \left|\frac{c_{n+1}}{c_n}\right| \to |x-a| \cdot L

Ratio test converges when xaL<1|x-a| \cdot L < 1, i.e., xa<1/L=R|x-a| < 1/L = R. \square

Hadamard formula (root test version):

R=1lim supncn1/nR = \frac{1}{\limsup_{n\to\infty} |c_n|^{1/n}}

This formula always works, even when the ratio test limit doesn't exist.

Key cases:

  • R=0R = 0: converges only at center (e.g., n!xn\sum n! x^n)
  • R=R = \infty: converges for all xx (e.g., xn/n!\sum x^n/n!, the exponential series)
  • R=1R = 1: converges in (1,1)(-1, 1) or possibly [1,1][-1, 1] (e.g., xn\sum x^n, geometric series)

Examples:

n=0xnn!\sum_{n=0}^\infty \frac{x^n}{n!}: ratio =xn+1/(n+1)!xn/n!=xn+10= \frac{|x|^{n+1}/(n+1)!}{|x|^n/n!} = \frac{|x|}{n+1} \to 0 for all xx. So R=R = \infty.

n=0n!xn\sum_{n=0}^\infty n! \cdot x^n: ratio =(n+1)x= (n+1)|x| \to \infty unless x=0x = 0. So R=0R = 0.

n=1xnn\sum_{n=1}^\infty \frac{x^n}{n}: ratio x\to |x|, so R=1R = 1. Interval: (1,1)(-1, 1) - check endpoints separately.

5.3 Behavior at Endpoints

The ratio test is inconclusive at xa=R|x - a| = R (endpoints). Endpoints must be checked individually using series convergence tests.

Example: n=1xnn\sum_{n=1}^\infty \frac{x^n}{n} has R=1R = 1.

  • At x=1x = 1: 1n\sum \frac{1}{n} - harmonic series, diverges
  • At x=1x = -1: (1)nn\sum \frac{(-1)^n}{n} - alternating harmonic series, converges (to ln2-\ln 2)

So the interval of convergence is [1,1)[-1, 1).

Example: n=1xnn2\sum_{n=1}^\infty \frac{x^n}{n^2} has R=1R = 1.

  • At x=1x = 1: 1n2\sum \frac{1}{n^2} - p-series with p=2p=2, converges
  • At x=1x = -1: (1)nn2\sum \frac{(-1)^n}{n^2} - absolutely convergent (compare to 1/n2\sum 1/n^2), converges

Interval of convergence: [1,1][-1, 1] (closed).

5.4 Term-by-Term Differentiation and Integration

Theorem: If f(x)=n=0cn(xa)nf(x) = \sum_{n=0}^\infty c_n(x-a)^n has radius of convergence R>0R > 0, then:

  1. Term-by-term differentiation: f(x)=n=1ncn(xa)n1f'(x) = \sum_{n=1}^\infty nc_n(x-a)^{n-1}, with radius of convergence RR

  2. Term-by-term integration: axf(t)dt=n=0cnn+1(xa)n+1\int_a^x f(t)\,dt = \sum_{n=0}^\infty \frac{c_n}{n+1}(x-a)^{n+1}, with radius of convergence RR

Both operations preserve the radius of convergence (though endpoint behavior may change).

Why this works: Inside the interval of convergence, power series converge uniformly (see Appendix B), which allows interchange of limit and derivative/integral. This is the key technical condition.

Application - deriving new series from known ones:

From 11x=n=0xn\frac{1}{1-x} = \sum_{n=0}^\infty x^n (geometric series, x<1|x| < 1):

Differentiate: 1(1x)2=n=1nxn1=n=0(n+1)xn\frac{1}{(1-x)^2} = \sum_{n=1}^\infty nx^{n-1} = \sum_{n=0}^\infty (n+1)x^n

Integrate from 0 to x: ln(1x)=n=1xnn-\ln(1-x) = \sum_{n=1}^\infty \frac{x^n}{n}, so ln(1x)=n=1xnn\ln(1-x) = -\sum_{n=1}^\infty \frac{x^n}{n}

Substitute xxx \to -x: ln(1+x)=n=1(1)n+1xnn=xx22+x33\ln(1+x) = \sum_{n=1}^\infty \frac{(-1)^{n+1}x^n}{n} = x - \frac{x^2}{2} + \frac{x^3}{3} - \cdots

This is the Taylor series for ln(1+x)\ln(1+x), derived purely from the geometric series plus integration.

5.5 Uniqueness of Power Series Representations

Theorem: If f(x)=n=0cn(xa)nf(x) = \sum_{n=0}^\infty c_n(x-a)^n for all xx in some interval around aa, then the coefficients are uniquely determined:

cn=f(n)(a)n!c_n = \frac{f^{(n)}(a)}{n!}

Proof: Set x=ax = a: f(a)=c0f(a) = c_0. Differentiate: f(a)=c1f'(a) = c_1. Differentiate again: f(a)=2c2f''(a) = 2c_2, so c2=f(a)/2c_2 = f''(a)/2. Inductively: cn=f(n)(a)/n!c_n = f^{(n)}(a)/n!. \square

Consequence: The Taylor series of ff at aa is the only power series centered at aa that can represent ff. If you find any power series for ff, it must be the Taylor series.


6. Taylor and Maclaurin Series

6.1 Deriving the Taylor Series Formula

The question: given a smooth function ff, can we represent it as a power series near a point aa? If so, the coefficients must be cn=f(n)(a)/n!c_n = f^{(n)}(a)/n! (by 5.5). This leads to:

Definition (Taylor Series): The Taylor series of ff at x=ax = a is:

f(x)=?n=0f(n)(a)n!(xa)n=f(a)+f(a)(xa)+f(a)2!(xa)2+f(x) \stackrel{?}{=} \sum_{n=0}^\infty \frac{f^{(n)}(a)}{n!}(x-a)^n = f(a) + f'(a)(x-a) + \frac{f''(a)}{2!}(x-a)^2 + \cdots

The question mark is crucial: even if ff has derivatives of all orders, the Taylor series may not converge to ff. The conditions for equality are given in 6.4.

Intuition via polynomial fitting: The nn-th Taylor polynomial Tn(x)T_n(x) is the unique polynomial of degree n\le n that matches ff and its first nn derivatives at x=ax = a:

Tn(a)=f(a),Tn(a)=f(a),,Tn(n)(a)=f(n)(a)T_n(a) = f(a), \quad T_n'(a) = f'(a), \quad \ldots, \quad T_n^{(n)}(a) = f^{(n)}(a)

As nn \to \infty, we hope Tn(x)f(x)T_n(x) \to f(x) - this is the convergence question.

TAYLOR POLYNOMIAL APPROXIMATION


  f(x) = sin(x) at a = 0:

  T_1(x) = x                        (tangent line)
  T_3(x) = x - x^3/6                 (cubic approximation)
  T_5(x) = x - x^3/6 + x^5/120       (quintic approximation)
  T_7(x) = x - x^3/6 + x^5/120 - x^7/5040

  Error |sin(x) - T_5(x)| at x = 0.1: ~= 2.6 x 10^1^1  (tiny!)
  Error |sin(x) - T_5(x)| at x = 2.0: ~= 4.6 x 10^3   (small but visible)
  Error |sin(x) - T_5(x)| at x = pi:   ~= 1.9 x 10^1   (large near pi!)

  The polynomial is most accurate near x = 0 (the expansion point).


6.2 Standard Maclaurin Series

The most important Maclaurin series (Taylor series at a=0a = 0):

Exponential: ex=n=0xnn!=1+x+x22!+x33!+e^x = \sum_{n=0}^\infty \frac{x^n}{n!} = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots, R=R = \infty

Derivation: f(n)(x)=exf^{(n)}(x) = e^x for all nn, so f(n)(0)=1f^{(n)}(0) = 1, giving cn=1/n!c_n = 1/n!.

Sine: sinx=n=0(1)nx2n+1(2n+1)!=xx36+x5120\sin x = \sum_{n=0}^\infty \frac{(-1)^n x^{2n+1}}{(2n+1)!} = x - \frac{x^3}{6} + \frac{x^5}{120} - \cdots, R=R = \infty

Cosine: cosx=n=0(1)nx2n(2n)!=1x22+x424\cos x = \sum_{n=0}^\infty \frac{(-1)^n x^{2n}}{(2n)!} = 1 - \frac{x^2}{2} + \frac{x^4}{24} - \cdots, R=R = \infty

Note: sin\sin has only odd powers; cos\cos has only even powers - reflecting their symmetry properties.

Natural logarithm: ln(1+x)=n=1(1)n+1xnn=xx22+x33\ln(1+x) = \sum_{n=1}^\infty \frac{(-1)^{n+1}x^n}{n} = x - \frac{x^2}{2} + \frac{x^3}{3} - \cdots, R=1R = 1

Geometric series: 11x=n=0xn=1+x+x2+\frac{1}{1-x} = \sum_{n=0}^\infty x^n = 1 + x + x^2 + \cdots, R=1R = 1

Binomial series: (1+x)α=n=0(αn)xn(1+x)^\alpha = \sum_{n=0}^\infty \binom{\alpha}{n} x^n, R=1R = 1 (for any αR\alpha \in \mathbb{R})

where (αn)=α(α1)(αn+1)n!\binom{\alpha}{n} = \frac{\alpha(\alpha-1)\cdots(\alpha-n+1)}{n!} are generalized binomial coefficients.

Arctangent: arctanx=n=0(1)nx2n+12n+1=xx33+x55\arctan x = \sum_{n=0}^\infty \frac{(-1)^n x^{2n+1}}{2n+1} = x - \frac{x^3}{3} + \frac{x^5}{5} - \cdots, R=1R = 1

Interesting consequence: At x=1x=1: π/4=11/3+1/5\pi/4 = 1 - 1/3 + 1/5 - \cdots (Leibniz formula for π\pi - converges very slowly).

6.3 Lagrange Remainder and Error Bounds

Theorem (Taylor's Theorem with Lagrange Remainder): If ff has n+1n+1 continuous derivatives on an interval containing aa and xx, then:

f(x)=Tn(x)+Rn(x)f(x) = T_n(x) + R_n(x)

where the Lagrange remainder is:

Rn(x)=f(n+1)(c)(n+1)!(xa)n+1R_n(x) = \frac{f^{(n+1)}(c)}{(n+1)!}(x-a)^{n+1}

for some cc strictly between aa and xx.

Error bound: If f(n+1)(t)M|f^{(n+1)}(t)| \le M for all tt between aa and xx, then:

Rn(x)M(n+1)!xan+1|R_n(x)| \le \frac{M}{(n+1)!}|x-a|^{n+1}

Proof sketch: Apply the Cauchy Mean Value Theorem or integrate the (n+1)(n+1)-th derivative n+1n+1 times. \square

Worked example: Bound the error in approximating e0.1e^{0.1} by T3(x)=1+x+x2/2+x3/6T_3(x) = 1 + x + x^2/2 + x^3/6 at x=0.1x = 0.1, a=0a = 0.

f(4)(x)=exf^{(4)}(x) = e^x. On [0,0.1][0, 0.1]: exe0.1<1.11|e^x| \le e^{0.1} < 1.11, so M=1.11M = 1.11.

R3(0.1)1.114!(0.1)4=1.11×0.0001244.6×106|R_3(0.1)| \le \frac{1.11}{4!}(0.1)^4 = \frac{1.11 \times 0.0001}{24} \approx 4.6 \times 10^{-6}

True value: e0.11.10517e^{0.1} \approx 1.10517; T3(0.1)=1+0.1+0.005+0.000167=1.105167T_3(0.1) = 1 + 0.1 + 0.005 + 0.000167 = 1.105167. Error 3.3×107\approx 3.3 \times 10^{-7} (even smaller than our bound, as expected).

For ML: The Lagrange remainder quantifies how many Taylor terms are needed for a desired approximation accuracy. This determines, e.g., how many terms of the linear attention approximation are needed for a given accuracy trade-off.

6.4 Analytic Functions and Convergence Domain

A function is analytic at aa if its Taylor series at aa converges to f(x)f(x) in some neighborhood of aa. All functions we encounter in ML are analytic wherever they are smooth.

However: A function can be infinitely differentiable (smooth) without being analytic. The classic example:

f(x)={e1/x2x00x=0f(x) = \begin{cases} e^{-1/x^2} & x \ne 0 \\ 0 & x = 0 \end{cases}

This function has f(n)(0)=0f^{(n)}(0) = 0 for all nn, so its Taylor series is the zero series - which does NOT converge to ff (except at x=0x = 0).

For ML functions: All standard activation functions and loss functions are analytic on their domains. The Taylor series of σ(x)=1/(1+ex)\sigma(x) = 1/(1+e^{-x}), ReLU(x)\text{ReLU}(x), and GELU(x)\text{GELU}(x) converge to the function on appropriate intervals.

Domain of convergence: For ln(1+x)\ln(1+x), the Taylor series has R=1R = 1 - it converges in (1,1](-1, 1] but not at x=1x = -1 or for x>1|x| > 1. Computing ln(1+x)\ln(1 + x) for large xx via truncated Taylor series will give wrong results. This is why implementations use the formula ln(1+ex)x+ex\ln(1 + e^x) \approx x + e^{-x} for large xx (log-sum-exp trick).

6.5 Algebraic Manipulation of Taylor Series

Taylor series can be combined algebraically to produce new series without recomputing from scratch.

Substitution: Replace xx with a multiple, power, or function:

  • ex2=(x2)nn!=(1)nx2nn!e^{-x^2} = \sum \frac{(-x^2)^n}{n!} = \sum \frac{(-1)^n x^{2n}}{n!} (substitute xx2x \to -x^2 in exe^x)
  • cos(x2)=(1)nx4n(2n)!\cos(x^2) = \sum \frac{(-1)^n x^{4n}}{(2n)!} (substitute xx2x \to x^2 in cosx\cos x)

Multiplication: Cauchy product of two absolutely convergent series:

(n=0anxn)(n=0bnxn)=n=0(k=0nakbnk)xn\left(\sum_{n=0}^\infty a_n x^n\right)\left(\sum_{n=0}^\infty b_n x^n\right) = \sum_{n=0}^\infty \left(\sum_{k=0}^n a_k b_{n-k}\right) x^n

Composition: sin(ex1)=sin(x+x2/2+)=x+x2/2+(x+)36+\sin(e^x - 1) = \sin\left(x + x^2/2 + \cdots\right) = x + x^2/2 + \cdots - \frac{(x+\cdots)^3}{6} + \cdots

Collect terms by power of xx.

Example - softmax Taylor expansion: Near T=T = \infty (high temperature), exi/T1+xi/T+12(xi/T)2+e^{x_i/T} \approx 1 + x_i/T + \frac{1}{2}(x_i/T)^2 + \cdots. To first order:

softmax(x/T)i1+xi/Tj(1+xj/T)=1+xi/Tn+xˉ/T\text{softmax}(x/T)_i \approx \frac{1 + x_i/T}{\sum_j (1 + x_j/T)} = \frac{1 + x_i/T}{n + \bar{x}/T}

As TT \to \infty, this 1/n\to 1/n (uniform). The correction term xi/Tx_i/T shows that even at high temperature, the softmax is not exactly uniform - it favors larger xix_i by O(1/T)O(1/T).


7. Key Series in Machine Learning

7.1 Softmax and Temperature Scaling

The softmax function with temperature T>0T > 0:

softmax(x/T)i=exi/Tj=1nexj/T\text{softmax}(\mathbf{x}/T)_i = \frac{e^{x_i/T}}{\sum_{j=1}^n e^{x_j/T}}

Low temperature (T0T \to 0): Let xmax=maxjxjx_{\max} = \max_j x_j. Factor out exmax/Te^{x_{\max}/T}:

softmax(x/T)i=e(xixmax)/Tje(xjxmax)/T\text{softmax}(x/T)_i = \frac{e^{(x_i - x_{\max})/T}}{\sum_j e^{(x_j - x_{\max})/T}}

As T0T \to 0: terms with xj<xmaxx_j < x_{\max} have e(xjxmax)/T0e^{(x_j - x_{\max})/T} \to 0; the ii^* term (the max) stays at 1. So softmax(x/T)ei\text{softmax}(x/T) \to \mathbf{e}_{i^*} (one-hot on argmax). This is the temperature 0\to 0 limit used in hard attention and Gumbel-Softmax.

High temperature (TT \to \infty): Taylor expand exi/T=1+xi/T+O(1/T2)e^{x_i/T} = 1 + x_i/T + O(1/T^2):

softmax(x/T)i1+xi/Tn+xˉ/T1n\text{softmax}(x/T)_i \approx \frac{1 + x_i/T}{n + \bar{x}/T} \to \frac{1}{n}

The distribution becomes uniform. This is used in knowledge distillation (Hinton 2015): training the student on soft targets with high temperature captures the "dark knowledge" in the teacher's probability distribution.

Series representation of the denominator: For the standard softmax, Z=jexjZ = \sum_j e^{x_j} is the partition function - a finite sum but in theory for continuous distributions it becomes a genuine integral or series.

7.2 GELU via Hermite Polynomials

The GELU (Gaussian Error Linear Unit) activation is:

GELU(x)=xΦ(x)=x12[1+erf(x2)]\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]

where Φ\Phi is the standard Gaussian CDF. This has no closed-form Taylor series at x=0x=0 in elementary functions, but the approximation:

GELU(x)0.5x(1+tanh[2π(x+0.044715x3)])\text{GELU}(x) \approx 0.5x\left(1 + \tanh\left[\sqrt{\frac{2}{\pi}}\left(x + 0.044715x^3\right)\right]\right)

uses a polynomial approximation inside tanh\tanh. This is the approximation used in practice (GPT-2, BERT).

Taylor series of erf: Since erf(x)=2π0xet2dt\text{erf}(x) = \frac{2}{\sqrt{\pi}}\int_0^x e^{-t^2}\,dt, substitute et2=(1)nt2nn!e^{-t^2} = \sum \frac{(-1)^n t^{2n}}{n!} and integrate term-by-term:

erf(x)=2πn=0(1)nx2n+1n!(2n+1)\text{erf}(x) = \frac{2}{\sqrt{\pi}} \sum_{n=0}^\infty \frac{(-1)^n x^{2n+1}}{n!(2n+1)}

Near x=0x = 0: erf(x)2xπ2x33π+\text{erf}(x) \approx \frac{2x}{\sqrt\pi} - \frac{2x^3}{3\sqrt\pi} + \cdots

So: GELU(x)x12(1+2xπ)=x2+x2π\text{GELU}(x) \approx x \cdot \frac{1}{2}\left(1 + \frac{2x}{\sqrt\pi}\right) = \frac{x}{2} + \frac{x^2}{\sqrt\pi} near x=0x = 0.

Connection to Hermite polynomials: The derivatives of GELU at 0 involve Hermite polynomials HnH_n - the orthogonal polynomials with respect to the Gaussian weight ex2/2e^{-x^2/2}. This connection explains why GELU interacts well with Gaussian-distributed inputs.

7.3 Optimizer Update Rules as Taylor Approximations

Newton's method (second-order optimization) applies a quadratic Taylor approximation to the loss:

L(θ+δ)L(θ)+Lδ+12δHδ\mathcal{L}(\theta + \delta) \approx \mathcal{L}(\theta) + \nabla\mathcal{L}^\top \delta + \frac{1}{2}\delta^\top H \delta

Minimizing over δ\delta: δ=H1L\delta^* = -H^{-1}\nabla\mathcal{L} (Newton step). This is exact for quadratic losses.

SGD discards the Hessian (HI/αH \approx I/\alpha): δ=αL\delta = -\alpha \nabla\mathcal{L} - first-order Taylor only.

Adam's bias correction via geometric series:

The Adam first moment is mt=(1β1)k=0t1β1kgtkm_t = (1-\beta_1)\sum_{k=0}^{t-1} \beta_1^k g_{t-k}. Its expectation:

E[mt]=(1β1)k=0t1β1kE[gtk]=(1β1t)E[g]\mathbb{E}[m_t] = (1-\beta_1)\sum_{k=0}^{t-1}\beta_1^k \mathbb{E}[g_{t-k}] = (1-\beta_1^t)\mathbb{E}[g]

using the geometric series partial sum k=0t1β1k=(1β1t)/(1β1)\sum_{k=0}^{t-1}\beta_1^k = (1-\beta_1^t)/(1-\beta_1). The bias correction m^t=mt/(1β1t)\hat{m}_t = m_t/(1-\beta_1^t) restores E[m^t]=E[g]\mathbb{E}[\hat{m}_t] = \mathbb{E}[g].

Gradient clipping via Taylor: For very large gradients, the loss is approximated by its first-order Taylor, and clipping bounds the step size to stay within a trust region where the linear approximation is valid.

7.4 Linear Attention via Taylor Expansion of exp

Standard self-attention has complexity O(n2d)O(n^2 d) because it computes:

Attention(Q,K,V)=softmax(QKd)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V

The key operation is exp(qikj/d)\exp(q_i^\top k_j / \sqrt{d}) for all i,ji, j pairs.

Linear attention (Katharopoulos et al., 2020) replaces exp()\exp(\cdot) with its first-order Taylor approximation:

exp(qk/d)1+qk/d=ϕ(q)ϕ(k)\exp(q^\top k/\sqrt{d}) \approx 1 + q^\top k/\sqrt{d} = \phi(q)^\top \phi(k)

where ϕ(x)=[1,x/d]\phi(x) = [1, x/\sqrt{d}] (feature map). Then:

LinearAttn(qi)=jϕ(qi)ϕ(kj)vjjϕ(qi)ϕ(kj)=ϕ(qi)(jϕ(kj)vj)ϕ(qi)(jϕ(kj))\text{LinearAttn}(q_i) = \frac{\sum_j \phi(q_i)^\top\phi(k_j) v_j}{\sum_j \phi(q_i)^\top\phi(k_j)} = \frac{\phi(q_i)^\top \left(\sum_j \phi(k_j)v_j^\top\right)}{\phi(q_i)^\top \left(\sum_j \phi(k_j)\right)}

The sums jϕ(kj)vj\sum_j \phi(k_j)v_j^\top and jϕ(kj)\sum_j \phi(k_j) can be computed once in O(nd)O(nd), reducing complexity to O(n)O(n).

Error of first-order approximation: By Taylor's theorem, ex(1+x)x22ex|e^x - (1+x)| \le \frac{x^2}{2}e^{|x|}. For x=qk/dx = q^\top k/\sqrt{d} small (as when q,kq, k have small norm), the error is O(1/d)O(1/d) - small in high dimension.

Higher-order approximations use ex1+x+x2/2e^x \approx 1 + x + x^2/2, giving Performer-style random feature approximations via the polynomial kernel 1+qk+(qk)2/21 + q^\top k + (q^\top k)^2/2.

7.5 Log-Sum-Exp and Numerical Stability

The LogSumExp operation appears in softmax computation:

LSE(x1,,xn)=ln(i=1nexi)\text{LSE}(x_1, \ldots, x_n) = \ln\left(\sum_{i=1}^n e^{x_i}\right)

Numerical problem: For large xix_i (e.g., xi=1000x_i = 1000), exie^{x_i} overflows float32 (max3.4×1038\max \approx 3.4 \times 10^{38}).

Stable computation: Let m=maxixim = \max_i x_i. Then:

LSE(x)=m+ln(iexim)\text{LSE}(x) = m + \ln\left(\sum_i e^{x_i - m}\right)

Each exim1e^{x_i - m} \le 1, so no overflow.

Taylor analysis: Why does this work? Write xi=m+δix_i = m + \delta_i where δi0\delta_i \le 0:

lnexi=m+ln(1+i:xi<meδi)\ln\sum e^{x_i} = m + \ln\left(1 + \sum_{i: x_i < m} e^{\delta_i}\right)

For the dominant term (δi=0\delta_{i^*} = 0, the max): contribution is e0=1e^0 = 1. Other terms are exponentially small: eδi1e^{\delta_i} \ll 1. The Taylor expansion ln(1+ϵ)ϵϵ2/2+\ln(1 + \epsilon) \approx \epsilon - \epsilon^2/2 + \cdots shows the correction is small when all other xix_i are much smaller than mm.

7.6 Laplace Approximation

Setup: Given a posterior p(θD)p(Dθ)p(θ)p(\theta|\mathcal{D}) \propto p(\mathcal{D}|\theta)p(\theta), we want a Gaussian approximation around the MAP estimate θ^\hat\theta.

Method: Taylor-expand logp(θD)\log p(\theta|\mathcal{D}) around θ^\hat\theta (where gradient is zero):

logp(θD)logp(θ^D)12(θθ^)H(θθ^)\log p(\theta|\mathcal{D}) \approx \log p(\hat\theta|\mathcal{D}) - \frac{1}{2}(\theta - \hat\theta)^\top H (\theta - \hat\theta)

where H=2logp(θ^D)H = -\nabla^2 \log p(\hat\theta|\mathcal{D}) (negative Hessian at MAP). The second-order Taylor expansion gives:

p(θD)N(θ^,H1)p(\theta|\mathcal{D}) \approx \mathcal{N}(\hat\theta,\, H^{-1})

This is the Laplace approximation - a Gaussian centered at MAP with covariance equal to the inverse of the negative log-posterior Hessian.

Where Taylor truncation is valid: The approximation is accurate when the log-posterior is nearly quadratic near θ^\hat\theta, i.e., when the third-order remainder O(θθ^3)O(|\theta - \hat\theta|^3) is small. This holds when the posterior is concentrated (many data points, by Bernstein-von Mises theorem).

Applications in ML (2026):

  • Bayesian neural networks: Laplace approximation used in Laplace-Redux (Immer et al., 2021), PostHoc Laplace (Daxberger et al., 2021)
  • Uncertainty quantification: covariance H1H^{-1} estimates predictive uncertainty
  • Last-layer Laplace: apply Laplace approximation only to the last linear layer (computationally tractable for large models)

8. Fourier Series (Preview)

Scope note: The full theory of Fourier series - convergence theorems, Parseval's identity, Fourier transform, and their ML applications - belongs in a dedicated treatment. Here we give the essential concepts needed to understand positional encodings and frequency decomposition in transformers.

8.1 Periodic Functions as Trigonometric Series

A Fourier series represents a periodic function ff with period 2π2\pi as an infinite sum of sines and cosines:

f(x)=a02+n=1(ancos(nx)+bnsin(nx))f(x) = \frac{a_0}{2} + \sum_{n=1}^\infty \left(a_n \cos(nx) + b_n \sin(nx)\right)

Unlike Taylor series (which expand around a single point using polynomials), Fourier series use global basis functions - sine and cosine waves of different frequencies - to represent the function over its entire period.

The key difference from Taylor series:

FeatureTaylor SeriesFourier Series
Basis functions1,x,x2,x3,1, x, x^2, x^3, \ldots (powers)1,cosx,sinx,cos2x,1, \cos x, \sin x, \cos 2x, \ldots (sinusoids)
Expansion pointLocal (around aa)Global (over the full period)
Best forSmooth, analytic functionsPeriodic functions, possibly discontinuous
ConvergenceIn a neighborhood of aaIn L2L^2 norm over the period
DerivativesEasy (power rule)Easy (differentiation rotates sin/cos)

Key property: The derivative of a Fourier series is obtained by differentiating term-by-term:

f(x)=n=1n(ansin(nx)+bncos(nx))f'(x) = \sum_{n=1}^\infty n(-a_n \sin(nx) + b_n \cos(nx))

Differentiation multiplies each frequency-nn component by nn - high-frequency components dominate the derivative.

8.2 Orthogonality and Fourier Coefficients

The key property that makes Fourier series work is orthogonality: different basis functions are orthogonal with respect to the inner product f,g=1πππf(x)g(x)dx\langle f, g\rangle = \frac{1}{\pi}\int_{-\pi}^\pi f(x)g(x)\,dx:

1πππcos(mx)cos(nx)dx={0mn1m=n02m=n=0\frac{1}{\pi}\int_{-\pi}^\pi \cos(mx)\cos(nx)\,dx = \begin{cases} 0 & m \ne n \\ 1 & m = n \ne 0 \\ 2 & m = n = 0 \end{cases} 1πππsin(mx)sin(nx)dx={0mn1m=n\frac{1}{\pi}\int_{-\pi}^\pi \sin(mx)\sin(nx)\,dx = \begin{cases} 0 & m \ne n \\ 1 & m = n \end{cases} 1πππcos(mx)sin(nx)dx=0for all m,n\frac{1}{\pi}\int_{-\pi}^\pi \cos(mx)\sin(nx)\,dx = 0 \quad \text{for all } m, n

These orthogonality relations are provable using the integration formulas for products of sines and cosines (from 03-Integration).

Deriving the Fourier coefficients: Multiply both sides of the Fourier series by cos(nx)\cos(nx) and integrate over [π,π][-\pi, \pi]. All terms vanish by orthogonality except the ana_n term:

an=1πππf(x)cos(nx)dx,bn=1πππf(x)sin(nx)dxa_n = \frac{1}{\pi}\int_{-\pi}^\pi f(x)\cos(nx)\,dx, \qquad b_n = \frac{1}{\pi}\int_{-\pi}^\pi f(x)\sin(nx)\,dx

These are exactly the projection coefficients of ff onto the orthogonal basis {cos(nx),sin(nx)}\{\cos(nx), \sin(nx)\} - the Fourier series is the best L2L^2 approximation of ff in the sinusoidal basis.

Complex form: Using Euler's formula einx=cos(nx)+isin(nx)e^{inx} = \cos(nx) + i\sin(nx):

f(x)=n=f^(n)einx,f^(n)=12πππf(x)einxdxf(x) = \sum_{n=-\infty}^\infty \hat{f}(n)\,e^{inx}, \qquad \hat{f}(n) = \frac{1}{2\pi}\int_{-\pi}^\pi f(x)e^{-inx}\,dx

The complex coefficients f^(n)\hat{f}(n) encode both amplitude and phase of the nn-th frequency component.

8.3 ML Relevance: Positional Encodings and RoPE

Original Transformer positional encodings (Vaswani et al., 2017) use sinusoidal functions evaluated at position pp and dimension dd:

PEp,2k=sin(p100002k/d),PEp,2k+1=cos(p100002k/d)PE_{p,2k} = \sin\left(\frac{p}{10000^{2k/d}}\right), \qquad PE_{p,2k+1} = \cos\left(\frac{p}{10000^{2k/d}}\right)

This is a Fourier-like encoding: position pp is represented by a superposition of sinusoids at d/2d/2 different frequencies ωk=1/100002k/d\omega_k = 1/10000^{2k/d}.

Why sinusoidal? Any relative shift in position corresponds to a linear transformation of the encoding vector - provable because the Fourier basis diagonalizes the shift operator. This allows the attention mechanism to learn relative positions.

Rotary Position Embeddings (RoPE) (Su et al., 2021, used in LLaMA, GPT-NeoX) rotate the query and key vectors in the complex plane by angle mθkm\theta_k at position mm:

qm=qeimθ,kn=keinθq_m = q \cdot e^{im\theta}, \qquad k_n = k \cdot e^{in\theta}

The dot product qmkn=Re(qeimθ(keinθ))=qkcos((mn)θ)q_m^\top k_n = \text{Re}(q e^{im\theta} \cdot (k e^{in\theta})^*) = q^\top k \cos((m-n)\theta) depends only on mnm-n (relative position). This is the Fourier multiplication theorem: convolution in position space <-> pointwise multiplication in frequency space.

Full Fourier theory: Fourier transform, Parseval's theorem, Gibbs phenomenon, and non-periodic functions are treated in the Probability chapter (characteristic functions) and an advanced analysis appendix.


9. Common Mistakes

#MistakeWhy It's WrongFix
1Concluding convergence from an0a_n \to 0Necessary but not sufficient: harmonic series has 1/n01/n \to 0 but divergesUse a proper convergence test (ratio, comparison, integral)
2Using ratio test when ratio =1= 1L = 1 is the inconclusive case: works for p-series, where the ratio always -> 1Switch to integral test or comparison test
3Forgetting to check endpoints separatelyRatio/root tests are strict inequalities: they say nothing at $x-a
4Rearranging conditionally convergent seriesRiemann's theorem: rearrangement can change the sum to any valueOnly rearrange absolutely convergent series freely
5Applying a Taylor series outside its radius of convergenceThe series may diverge or converge to the wrong valueCheck $
6Confusing the Taylor series existing with converging to ffA smooth function may have a Taylor series that doesn't equal ff (e.g., e1/x2e^{-1/x^2})Check analyticity; use Lagrange remainder to bound error
7Differentiating a power series and changing the radius of convergenceRadius is preserved under differentiation/integration - endpoints can changeRe-check endpoints only; radius stays the same
8Treating (1)n/n\sum (-1)^n/n as the negative of 1/n\sum 1/nThe alternating series converges to ln2-\ln 2; the non-alternating diverges entirelyAbsolute and conditional convergence are fundamentally different
9Applying the integral test to non-monotone functionsThe test requires ff to be eventually decreasing and positiveVerify monotonicity; if ff is not monotone, use comparison instead
10Assuming an\sum a_n and bn\sum b_n both diverge implies (an+bn)\sum (a_n + b_n) divergesBoth can diverge while their sum converges: an=1/na_n = 1/n, bn=1/nb_n = -1/nConvergence of sums is NOT additive for divergent series
11Using ex1+xe^x \approx 1 + x for xx not smallFirst-order Taylor has error O(x2)O(x^2): for x=1x = 1, error is e20.72\approx e - 2 \approx 0.72Check $
12Confusing RR (radius of convergence) with the interval of convergenceRR is the radius; the interval is (aR,a+R)(a-R, a+R), and endpoints need separate analysisAlways state both RR and the interval (with endpoints explicitly checked)

10. Exercises

Exercise 1 - Geometric Series Compute the sum of each geometric series or state that it diverges: (a) n=0(23)n\sum_{n=0}^\infty \left(\frac{2}{3}\right)^n (b) n=2(32)n\sum_{n=2}^\infty \left(\frac{3}{2}\right)^n (c) n=0(1)n4n\sum_{n=0}^\infty \frac{(-1)^n}{4^n} (d) In reinforcement learning, the discounted return is G0=t=0γtrG_0 = \sum_{t=0}^\infty \gamma^t r where r=1r = 1 (constant reward) and γ=0.95\gamma = 0.95. Compute G0G_0.

Exercise 2 - Convergence Tests Determine whether each series converges or diverges. State which test you use: (a) n=11n3\sum_{n=1}^\infty \frac{1}{\sqrt{n^3}} (b) n=1nen\sum_{n=1}^\infty \frac{n}{e^n} (c) n=1(1)nn\sum_{n=1}^\infty \frac{(-1)^n}{\sqrt{n}} (d) n=1n!2n\sum_{n=1}^\infty \frac{n!}{2^n} (e) n=21nlnn\sum_{n=2}^\infty \frac{1}{n \ln n}

Exercise 3 - Radius of Convergence Find the radius of convergence RR and the interval of convergence (including endpoint checks) for each power series: (a) n=0xn3n\sum_{n=0}^\infty \frac{x^n}{3^n} (b) n=1(1)nxnn\sum_{n=1}^\infty \frac{(-1)^n x^n}{n} (c) n=0n!xn\sum_{n=0}^\infty n! \cdot x^n (d) n=0(x2)nn2+1\sum_{n=0}^\infty \frac{(x-2)^n}{n^2 + 1}

Exercise 4 - Taylor Series Derivation (a) Derive the Maclaurin series for f(x)=exf(x) = e^{-x} from scratch (compute coefficients via cn=f(n)(0)/n!c_n = f^{(n)}(0)/n!). (b) Use the series for exe^x to derive the series for ex2e^{x^2}. (c) Using integration of the series for et2e^{-t^2} from 0 to xx, express 0xet2dt\int_0^x e^{-t^2}\,dt as a power series. (d) How many terms are needed to approximate 00.5et2dt\int_0^{0.5} e^{-t^2}\,dt to within 10810^{-8}?

Exercise 5 - Lagrange Remainder and Error Bounds (a) Find the 4th-order Taylor polynomial for f(x)=ln(1+x)f(x) = \ln(1+x) at a=0a = 0. (b) Use the Lagrange remainder to bound f(0.2)T4(0.2)|f(0.2) - T_4(0.2)|. (c) Compute the actual error and verify your bound holds. (d) How many terms of the Taylor series for ln(1+x)\ln(1+x) are needed to compute ln(1.1)\ln(1.1) to 10 decimal places?

Exercise 6 - GELU Taylor Analysis (a) Show that erf(x)=2πn=0(1)nx2n+1n!(2n+1)\text{erf}(x) = \frac{2}{\sqrt{\pi}}\sum_{n=0}^\infty \frac{(-1)^n x^{2n+1}}{n!(2n+1)} by integrating et2e^{-t^2} term-by-term. (b) Write out the first 4 terms of the GELU series: GELU(x)=x12(1+erf(x/2))\text{GELU}(x) = x \cdot \frac{1}{2}(1 + \text{erf}(x/\sqrt{2})). (c) Compare the GELU Taylor approximation to T3(x)=xT_3(x) = x (ReLU-like) and T5(x)T_5(x) for x[2,2]x \in [-2, 2]. (d) At what x|x| does the 5th-order approximation exceed 1% relative error?

Exercise 7 - Linear Attention Approximation (a) Show that eqk1+qke^{q^\top k} \approx 1 + q^\top k to first order, and bound the error eqk(1+qk)|e^{q^\top k} - (1 + q^\top k)| in terms of qk|q^\top k|. (b) If queries and keys are unit vectors in Rd\mathbb{R}^d (so qk1|q^\top k| \le 1), bound the maximum first-order approximation error. (c) Write the standard attention formula and show how the linear approximation reduces complexity from O(n2d)O(n^2 d) to O(nd2)O(n d^2). (d) Would using the second-order approximation ex1+x+x2/2e^x \approx 1 + x + x^2/2 improve accuracy? What is the trade-off?

Exercise 8 - Geometric Series in Adam (a) Show that the Adam first moment mt=(1β)k=0t1βkgtkm_t = (1-\beta)\sum_{k=0}^{t-1}\beta^k g_{t-k} (for m0=0m_0 = 0) satisfies the recurrence mt=βmt1+(1β)gtm_t = \beta m_{t-1} + (1-\beta) g_t. (b) Compute E[mt]\mathbb{E}[m_t] assuming all gradients have the same mean μ\mu (i.e., E[gk]=μ\mathbb{E}[g_k] = \mu). Show it equals (1βt)μ(1-\beta^t)\mu. (c) Why is the bias correction m^t=mt/(1βt)\hat{m}_t = m_t/(1-\beta^t) needed in early training (tt small)? (d) For β=0.9\beta = 0.9, at what step tt is the bias correction less than 1% (i.e., βt<0.01\beta^t < 0.01)?


11. Why This Matters for AI (2026 Perspective)

ConceptAI/ML ApplicationWhere in Modern Systems
Geometric seriesDiscounted return in RL; Adam bias correctionPPO, SAC, A3C; all Adam-family optimizers
Convergence testsProving SGD convergence; Robbins-Monro conditionsConvergence proofs in NeurIPS/ICML papers
Taylor series (1st order)Linear attention; Performer approximationLinformer, Performer, FNet
Taylor series (2nd order)Newton-CG optimizer; K-FAC; Gauss-NewtonNatural gradient methods, second-order optimizers
Lagrange remainderQuantifying approximation error in linearized attentionGuarantees in approximation papers
Power series / radius of convergenceConvergence radius of optimizer iterates; spectral radiusConvergence theory for RNNs, ResNets
Maclaurin series for exe^xDeriving the exponential map on Lie groups (rotation matrices in equivariant networks)SE(3)-equivariant networks, FrameDiff
Log-sum-exp stabilityNumerically stable softmax in all transformer implementationsPyTorch, JAX - used in every attention block
Fourier series / sinusoidal PEPosition encoding in original Transformer; period-length encodingVaswani et al. 2017; long-context models
RoPE (Rotary PE)Relative position encoding in LLaMA, GPT-NeoX, FalconMost open-source LLMs >= 2023
Laplace approximationBayesian uncertainty quantification for LLMsLaplace-Redux; PostHoc Laplace for safety
Alternating series (MCMC)Controlling numerical error in annealed importance samplingBayesian inference; diffusion model likelihoods
eiθ=cosθ+isinθe^{i\theta} = \cos\theta + i\sin\thetaComplex rotations in RoPE; frequency analysis of loss landscapesMechanistic interpretability of sinusoidal features
Binomial series (1+x)α(1+x)^\alphaLegendre polynomials; polynomial feature maps in kernelsPolynomial attention, spherical harmonics

Conceptual Bridge

Where you came from: This section is the culmination of single-variable calculus. Limits (04/01) are the foundation - sequence convergence IS a limit. Derivatives (04/02) provide the Taylor coefficients f(n)(a)/n!f^{(n)}(a)/n!. Integration (04/03) allows deriving new power series from old ones via term-by-term integration. Every idea in this section rests on the three preceding sections.

What this unlocks forward: Series are the bridge from single-variable to infinite-dimensional mathematics:

  • -> 05 Multivariate Calculus: The multivariable Taylor theorem uses the same structure: f(x+h)=f(x)+fh+12hHh+f(\mathbf{x}+\mathbf{h}) = f(\mathbf{x}) + \nabla f^\top \mathbf{h} + \frac{1}{2}\mathbf{h}^\top H \mathbf{h} + \cdots where HH is the Hessian matrix. Every gradient descent convergence proof uses the second-order Taylor expansion.

  • -> 06 Probability and Statistics: Generating functions G(x)=pnxnG(x) = \sum p_n x^n encode probability distributions as power series. Characteristic functions ϕ(t)=E[eitX]\phi(t) = \mathbb{E}[e^{itX}] are Fourier transforms of probability densities. The cumulant generating function K(t)=logE[etX]K(t) = \log\mathbb{E}[e^{tX}] has Taylor coefficients equal to the cumulants (mean, variance, skewness, kurtosis).

  • -> Functional Analysis (advanced): Functions in L2[a,b]L^2[a,b] can be expanded in any orthonormal basis (Legendre polynomials, Fourier series, wavelets). This is the infinite-dimensional generalization of expressing a vector in an orthonormal basis. Neural networks implicitly learn such basis decompositions.

POSITION IN THE CURRICULUM


  04/01 Limits -> epsilon-N convergence used in 04/04 sequence theory
  04/02 Derivatives -> Taylor coefficients f(a)/n! in 04/04
  04/03 Integration -> term-by-term integration; integral test
         
  04/04 SERIES AND SEQUENCES  <- YOU ARE HERE
         
         -> 05 Multivariable Taylor theorem
                  (gradient descent convergence)
         
         -> 06 Probability: generating functions,
                  characteristic functions, CLT proof
         
         -> Advanced: Fourier analysis, functional
                   analysis, spectral methods in ML


The single most important takeaway: Taylor series and the geometric series are not just abstract tools - they are the mathematical engine behind every major optimization algorithm and attention approximation in modern AI. Every time a researcher writes ex1+xe^x \approx 1 + x, they are using the first-order Maclaurin approximation with a remainder bounded by the Lagrange error formula.


Appendix A: Extended Convergence Proofs

A.1 Proof of the Integral Test

Theorem: Let f:[1,)[0,)f: [1, \infty) \to [0,\infty) be continuous and decreasing, with an=f(n)a_n = f(n). Then n=1an\sum_{n=1}^\infty a_n converges iff 1f(x)dx<\int_1^\infty f(x)\,dx < \infty.

Proof: Since ff is decreasing, for x[n,n+1]x \in [n, n+1]:

f(n)f(x)f(n+1)f(n) \ge f(x) \ge f(n+1)

Integrating over [n,n+1][n, n+1]:

f(n)nn+1f(x)dxf(n+1)f(n) \ge \int_n^{n+1} f(x)\,dx \ge f(n+1)

Sum from n=1n = 1 to NN:

n=1Nf(n)1N+1f(x)dxn=2N+1f(n)=n=1Nf(n+1)\sum_{n=1}^N f(n) \ge \int_1^{N+1} f(x)\,dx \ge \sum_{n=2}^{N+1} f(n) = \sum_{n=1}^N f(n+1)

If 1fdx<\int_1^\infty f\,dx < \infty: The partial sums SN=n=1NanS_N = \sum_{n=1}^N a_n are bounded above by a1+1fdx<a_1 + \int_1^\infty f\,dx < \infty. Since SNS_N is increasing (terms are positive) and bounded above, SNS_N converges by MCT.

If an<\sum a_n < \infty: Then 1N+1fdxn=1Nf(n)n=1an<\int_1^{N+1} f\,dx \le \sum_{n=1}^N f(n) \to \sum_{n=1}^\infty a_n < \infty. Since 1N+1fdx\int_1^{N+1} f\,dx is increasing and bounded, it converges. \square

A.2 Proof of the Ratio Test

Theorem: If L=liman+1/an<1L = \lim |a_{n+1}/a_n| < 1, then an\sum a_n converges absolutely.

Proof: Choose rr with L<r<1L < r < 1. Since an+1/anL|a_{n+1}/a_n| \to L, there exists NN such that an+1/anr|a_{n+1}/a_n| \le r for all nNn \ge N. Then for nNn \ge N:

an+1ranr2aN1rnN+1aN|a_{n+1}| \le r|a_n| \le r^2|a_{N-1}| \le \cdots \le r^{n-N+1}|a_N|

So anaNrNrn=Crn|a_{n}| \le |a_N| \cdot r^{-N} \cdot r^n = C r^n for C=aNrNC = |a_N| r^{-N}. Therefore:

n=NanCn=Nrn=CrN1r<\sum_{n=N}^\infty |a_n| \le C \sum_{n=N}^\infty r^n = \frac{Cr^N}{1-r} < \infty

Since the tail converges, so does the full series. \square

If L>1L > 1: Then an+1/an>1|a_{n+1}/a_n| > 1 eventually, so an|a_n| \to \infty, violating the necessary condition an0a_n \to 0. Series diverges. \square

A.3 Proof of the Alternating Series Test (Leibniz)

Theorem: If bn0b_n \ge 0, bn+1bnb_{n+1} \le b_n, and bn0b_n \to 0, then (1)n+1bn\sum (-1)^{n+1}b_n converges.

Proof: Consider even and odd partial sums:

S2m=(b1b2)+(b3b4)++(b2m1b2m)S_{2m} = (b_1 - b_2) + (b_3 - b_4) + \cdots + (b_{2m-1} - b_{2m})

Each parenthesis b2k1b2k0b_{2k-1} - b_{2k} \ge 0 (since bnb_n is decreasing), so S2mS_{2m} is increasing.

Also: S2m=b1(b2b3)(b4b5)b2mb1S_{2m} = b_1 - (b_2 - b_3) - (b_4 - b_5) - \cdots - b_{2m} \le b_1. So S2mS_{2m} is bounded above.

By MCT, S2mLS_{2m} \to L for some Lb1L \le b_1.

Now S2m+1=S2m+b2m+1L+0=LS_{2m+1} = S_{2m} + b_{2m+1} \to L + 0 = L (since bn0b_n \to 0).

Both subsequences of partial sums converge to LL, so SNLS_N \to L. \square

Error bound proof: Note S2mLS2m+1S_{2m} \le L \le S_{2m+1} (odd sums overshoot from above). So:

LSNSN+1SN=bN+1|L - S_N| \le |S_{N+1} - S_N| = b_{N+1} \qquad \square

Appendix B: Uniform Convergence

B.1 Pointwise vs. Uniform Convergence

Definition (Pointwise convergence): A sequence of functions fn:DRf_n: D \to \mathbb{R} converges pointwise to ff if for each fixed xDx \in D:

limnfn(x)=f(x)\lim_{n\to\infty} f_n(x) = f(x)

Definition (Uniform convergence): fnf_n converges uniformly to ff on DD if:

limnsupxDfn(x)f(x)=0\lim_{n\to\infty} \sup_{x \in D} |f_n(x) - f(x)| = 0

Uniform convergence is stronger: the same NN works for ALL xx simultaneously.

The critical distinction:

POINTWISE vs. UNIFORM CONVERGENCE


  f(x) = x on [0,1]:
  
  At x = 0.5: (0.5) -> 0  (converges)
  At x = 1.0: 1 = 1 -> 1  (converges)
  Pointwise limit: f(x) = 0 for x in [0,1), f(1) = 1

  But: sup_{xin[0,1]} |x - f(x)| = sup_{xin[0,1)} x = 1 for all n
  -> NOT uniformly convergent on [0,1]

  On [0, 0.9]: sup x = 0.9 -> 0 -> uniformly convergent here.


B.2 Why Uniform Convergence Enables Term-by-Term Operations

Theorem: If fnff_n \to f uniformly on [a,b][a,b] and each fnf_n is continuous:

  1. ff is continuous
  2. limnabfn(x)dx=abf(x)dx\lim_{n\to\infty} \int_a^b f_n(x)\,dx = \int_a^b f(x)\,dx (limit and integral interchange)

Theorem: If fnff_n \to f pointwise and fngf_n' \to g uniformly, then f=gf' = g (limit and derivative interchange).

These theorems are exactly what justify term-by-term differentiation and integration of power series inside the radius of convergence: the partial sums k=0nck(xa)k\sum_{k=0}^n c_k(x-a)^k converge uniformly on any closed interval [ar,a+r][a-r, a+r] with r<Rr < R.

B.3 Failure Without Uniform Convergence

Without uniform convergence, swapping limits and integrals leads to errors:

Example: fn(x)=nxenxf_n(x) = n x e^{-nx} on [0,1][0,1].

  • Pointwise: fn(x)0f_n(x) \to 0 for all x0x \ge 0 (compute via L'Hpital or squeeze)
  • But: 01fn(x)dx=1en1010dx=0\int_0^1 f_n(x)\,dx = 1 - e^{-n} \to 1 \ne \int_0^1 0\,dx = 0

The limit and integral don't commute! Convergence is NOT uniform (supxfn(x)=1/e\sup_x f_n(x) = 1/e for all nn).

For ML: This is why convergence arguments for neural network training must be careful about the order of limits. The Dominated Convergence Theorem (from integration theory, 03/Appendix O) provides the right condition under which limits and integrals commute.


Appendix C: Complex Power Series and Euler's Formula

C.1 Extending Power Series to Complex Numbers

Power series work equally well for complex zCz \in \mathbb{C}. The exponential series ez=zn/n!e^z = \sum z^n/n! converges for all zCz \in \mathbb{C} (same ratio test argument).

Euler's formula: Setting z=iθz = i\theta (purely imaginary):

eiθ=n=0(iθ)nn!=n=0inθnn!e^{i\theta} = \sum_{n=0}^\infty \frac{(i\theta)^n}{n!} = \sum_{n=0}^\infty \frac{i^n \theta^n}{n!}

Using i0=1i^0=1, i1=ii^1=i, i2=1i^2=-1, i3=ii^3=-i, i4=1i^4=1, \ldots (period 4):

eiθ=(1θ22!+θ44!)+i(θθ33!+θ55!)=cosθ+isinθe^{i\theta} = \left(1 - \frac{\theta^2}{2!} + \frac{\theta^4}{4!} - \cdots\right) + i\left(\theta - \frac{\theta^3}{3!} + \frac{\theta^5}{5!} - \cdots\right) = \cos\theta + i\sin\theta

Euler's identity (θ=π\theta = \pi): eiπ=cosπ+isinπ=1+0i=1e^{i\pi} = \cos\pi + i\sin\pi = -1 + 0i = -1, giving eiπ+1=0e^{i\pi} + 1 = 0.

This remarkable identity connects the five most fundamental constants of mathematics: ee, ii, π\pi, 11, 00.

C.2 Rotations as Complex Exponentials

Multiplying by eiθe^{i\theta} rotates a complex number by angle θ\theta:

eiθreiϕ=rei(θ+ϕ)e^{i\theta} \cdot re^{i\phi} = r e^{i(\theta+\phi)}

This is the mathematical foundation of RoPE (Rotary Position Embeddings):

In RoPE, each attention head dimension pair (q2k,q2k+1)(q_{2k}, q_{2k+1}) is treated as a complex number q2k+iq2k+1q_{2k} + i q_{2k+1}. Multiplying by eimθke^{im\theta_k} (rotation by position mm times frequency θk\theta_k) encodes position:

q~m=qeimθ    q~mk~n=Re[qkˉei(mn)θ]\tilde{q}_m = q \cdot e^{im\theta} \implies \tilde{q}_m^\top \tilde{k}_n = \text{Re}[q\bar{k} e^{i(m-n)\theta}]

The inner product depends only on the relative position mnm-n - the rotation encodes position via complex multiplication, making it equivariant to sequence shifts.

C.3 Hyperbolic Functions

The hyperbolic functions emerge from the real and imaginary parts of exe^x:

coshx=ex+ex2=n=0x2n(2n)!,sinhx=exex2=n=0x2n+1(2n+1)!\cosh x = \frac{e^x + e^{-x}}{2} = \sum_{n=0}^\infty \frac{x^{2n}}{(2n)!}, \qquad \sinh x = \frac{e^x - e^{-x}}{2} = \sum_{n=0}^\infty \frac{x^{2n+1}}{(2n+1)!}

The tanh activation function: tanh(x)=sinhx/coshx\tanh(x) = \sinh x/\cosh x. Its Taylor series at 0:

tanhx=xx33+2x515\tanh x = x - \frac{x^3}{3} + \frac{2x^5}{15} - \cdots

This shows tanh is approximately linear (x\approx x) for small xx and saturates to ±1\pm 1 for large xx.


Appendix D: Generating Functions

D.1 Ordinary Generating Functions

A generating function packages a sequence (a0,a1,a2,)(a_0, a_1, a_2, \ldots) as a formal power series:

G(x)=n=0anxnG(x) = \sum_{n=0}^\infty a_n x^n

The sequence (an)(a_n) can be recovered as an=G(n)(0)/n!a_n = G^{(n)}(0)/n! (or via coefficient extraction). The power of generating functions is that algebraic operations on G(x)G(x) correspond to combinatorial operations on the sequence.

Key examples:

SequenceGenerating functionNotes
1,1,1,1, 1, 1, \ldots11x\frac{1}{1-x}Geometric series
(nk)\binom{n}{k} (fixed kk)xk(1x)k+1\frac{x^k}{(1-x)^{k+1}}Binomial coefficients
Fibonacci: 0,1,1,2,3,5,0,1,1,2,3,5,\ldotsx1xx2\frac{x}{1-x-x^2}Closed form via partial fractions
n2n^2x(1+x)(1x)3\frac{x(1+x)}{(1-x)^3}Derived by differentiating geometric series

Operations:

  • Shift: if G(x)=anxnG(x) = \sum a_n x^n, then xG(x)=anxn+1xG(x) = \sum a_n x^{n+1} (sequence shifted right by 1)
  • Differentiation: G(x)=nanxn1G'(x) = \sum na_n x^{n-1} (sequence (nan)(na_n))
  • Multiplication: G(x)H(x)=n(k=0nakbnk)xnG(x)H(x) = \sum_n \left(\sum_{k=0}^n a_k b_{n-k}\right)x^n (convolution)

D.2 Moment Generating Functions (MGF) in ML

The moment generating function of a random variable XX:

MX(t)=E[etX]=n=0E[Xn]n!tnM_X(t) = \mathbb{E}[e^{tX}] = \sum_{n=0}^\infty \frac{\mathbb{E}[X^n]}{n!} t^n

The Taylor coefficients encode the moments: MX(n)(0)=E[Xn]M_X^{(n)}(0) = \mathbb{E}[X^n].

For Gaussian XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2): MX(t)=eμt+σ2t2/2M_X(t) = e^{\mu t + \sigma^2 t^2/2}. This gives:

  • M(0)=μM'(0) = \mu (mean)
  • M(0)=μ2+σ2M''(0) = \mu^2 + \sigma^2 (second moment)
  • Variance: M(0)(M(0))2=σ2M''(0) - (M'(0))^2 = \sigma^2

Cumulant generating function: KX(t)=logMX(t)=μt+σ22t2K_X(t) = \log M_X(t) = \mu t + \frac{\sigma^2}{2}t^2 for Gaussian. Taylor coefficients of KXK_X are the cumulants: κ1=μ\kappa_1 = \mu (mean), κ2=σ2\kappa_2 = \sigma^2 (variance), κ3\kappa_3 (skewness), κ4\kappa_4 (excess kurtosis). For Gaussian, κn=0\kappa_n = 0 for n3n \ge 3 - a Gaussian is completely characterized by its first two cumulants. This is why many normality tests check higher cumulants.


Appendix E: Multivariable Taylor (Preview)

Forward reference: Full treatment in 05 Multivariate Calculus.

The Taylor series extends naturally to functions f:RnRf: \mathbb{R}^n \to \mathbb{R}. Expanding around a\mathbf{a}:

f(x)=f(a)+f(a)(xa)+12(xa)H(a)(xa)+O(xa3)f(\mathbf{x}) = f(\mathbf{a}) + \nabla f(\mathbf{a})^\top(\mathbf{x}-\mathbf{a}) + \frac{1}{2}(\mathbf{x}-\mathbf{a})^\top H(\mathbf{a})(\mathbf{x}-\mathbf{a}) + O(\|\mathbf{x}-\mathbf{a}\|^3)

where f(a)\nabla f(\mathbf{a}) is the gradient vector and H(a)H(\mathbf{a}) is the Hessian matrix.

Reading the terms:

  • 0th order: f(a)f(\mathbf{a}) - the function value at the expansion point
  • 1st order: fδ\nabla f^\top \delta - the linear (gradient) term; this is the basis of SGD
  • 2nd order: 12δHδ\frac{1}{2}\delta^\top H\delta - the curvature term; this is why quadratic forms appear in optimization

Gradient descent as first-order Taylor: Setting x=aαf(a)\mathbf{x} = \mathbf{a} - \alpha\nabla f(\mathbf{a}) in the 1st-order expansion:

f(aαf)f(a)αf(a)2f(\mathbf{a} - \alpha\nabla f) \approx f(\mathbf{a}) - \alpha\|\nabla f(\mathbf{a})\|^2

The loss decreases by αf2\alpha\|\nabla f\|^2 per step - valid when α\alpha is small enough that the linear approximation holds. The "right" α\alpha is determined by the second-order term.

Newton's method as second-order Taylor: Minimize the 2nd-order Taylor model exactly:

δ=H1f,f(a+δ)f(a)12fH1f\delta^* = -H^{-1}\nabla f, \qquad f(\mathbf{a} + \delta^*) \approx f(\mathbf{a}) - \frac{1}{2}\nabla f^\top H^{-1}\nabla f

For a quadratic ff, Newton's method is exact in one step. This is why L-BFGS and other quasi-Newton methods (which approximate H1H^{-1}) converge much faster than SGD on smooth losses.


Appendix F: Fourier Series Extended

F.1 Parseval's Identity

For a function ff with Fourier series f(x)=a0/2+(ancosnx+bnsinnx)f(x) = a_0/2 + \sum(a_n\cos nx + b_n\sin nx):

1πππf(x)2dx=a022+n=1(an2+bn2)\frac{1}{\pi}\int_{-\pi}^\pi |f(x)|^2\,dx = \frac{a_0^2}{2} + \sum_{n=1}^\infty (a_n^2 + b_n^2)

Interpretation: The L2L^2 norm squared of ff equals the sum of squared Fourier coefficients. This is an infinite-dimensional Pythagorean theorem - the orthonormal basis {1/2,cosnx,sinnx}\{1/\sqrt{2}, \cos nx, \sin nx\} satisfies the same relationship as a standard orthonormal basis in Rn\mathbb{R}^n.

For ML: In the theory of neural networks as function approximators, the Fourier perspective shows that learning high-frequency functions requires exponentially more training data (the "spectral bias" of neural networks - they learn low frequencies first).

F.2 Gibbs Phenomenon

For a discontinuous function (like a square wave), the Fourier partial sum SN(x)S_N(x) overshoots near the discontinuity by approximately 8.9%=1π0πsinttdt18.9\% = \frac{1}{\pi}\int_0^\pi \frac{\sin t}{t}\,dt - 1 of the jump, regardless of NN.

This Gibbs phenomenon does not disappear as NN \to \infty - it is an intrinsic property of Fourier series near discontinuities. The partial sum does converge to the function's average at the discontinuity.

For ML: Neural networks trained on discontinuous target functions (e.g., step functions, indicator functions) exhibit analogous ringing artifacts near edges. Regularization and smooth activations mitigate this.

F.3 Convergence in L2L^2 vs. Pointwise

Theorem (Riesz-Fischer): Every square-integrable function fL2[π,π]f \in L^2[-\pi,\pi] equals the limit of its Fourier partial sums in L2L^2 norm:

fSNL2=(ππf(x)SN(x)2dx)1/20\left\|f - S_N\right\|_{L^2} = \left(\int_{-\pi}^\pi |f(x) - S_N(x)|^2\,dx\right)^{1/2} \to 0

However, pointwise convergence is more subtle - Carleson's theorem (1966) states that the Fourier series of any L2L^2 function converges pointwise almost everywhere, but this is a very deep result.

The L2L^2 convergence is sufficient for ML applications - it means the Fourier approximation is the best L2L^2 approximation of the function in the sinusoidal basis, analogous to how the truncated SVD is the best low-rank approximation (Eckart-Young theorem).


Appendix G: ML Derivations in Full

G.1 Full Derivation: Adam Bias Correction

The Adam update rule:

m0=0,mt=β1mt1+(1β1)gtm_0 = 0, \qquad m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t

Unrolling the recurrence:

mt=(1β1)k=0t1β1kgtk+β1tm0=(1β1)k=0t1β1kgtkm_t = (1-\beta_1)\sum_{k=0}^{t-1}\beta_1^k g_{t-k} + \beta_1^t m_0 = (1-\beta_1)\sum_{k=0}^{t-1}\beta_1^k g_{t-k}

(since m0=0m_0 = 0). Taking expectation with E[gk]=g\mathbb{E}[g_k] = g (true gradient, stationary assumption):

E[mt]=(1β1)(k=0t1β1k)g=(1β1)1β1t1β1g=(1β1t)g\mathbb{E}[m_t] = (1-\beta_1)\left(\sum_{k=0}^{t-1}\beta_1^k\right)g = (1-\beta_1) \cdot \frac{1-\beta_1^t}{1-\beta_1} \cdot g = (1-\beta_1^t) g

The geometric series sum k=0t1β1k=1β1t1β1\sum_{k=0}^{t-1}\beta_1^k = \frac{1-\beta_1^t}{1-\beta_1} is the key step.

Bias correction: m^t=mt/(1β1t)\hat{m}_t = m_t/(1-\beta_1^t) ensures E[m^t]=g\mathbb{E}[\hat{m}_t] = g. For β1=0.9\beta_1 = 0.9:

  • t=1t=1: 1β1t=0.11-\beta_1^t = 0.1, correction factor = 10 (large!)
  • t=10t=10: 10.9100.6511-0.9^{10} \approx 0.651, correction 1.54\approx 1.54
  • t=100t=100: 10.91001.01-0.9^{100} \approx 1.0, correction negligible

Similarly for the second moment vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2 with correction v^t=vt/(1β2t)\hat{v}_t = v_t/(1-\beta_2^t).

G.2 Full Derivation: Linear Attention Complexity

Standard attention for nn tokens and dd-dimensional heads:

A=softmax(QK/d)Rn×n,Output=AVA = \text{softmax}(QK^\top/\sqrt{d}) \in \mathbb{R}^{n\times n}, \quad \text{Output} = AV

Cost: QKQK^\top is O(n2d)O(n^2 d); applying AA to VV is O(n2d)O(n^2 d). Total: O(n2d)O(n^2 d).

Linear attention with ϕ(x)=[1,x]\phi(x) = [1, x] (first-order Taylor):

LinearAttn(qi)=jϕ(qi)ϕ(kj)vjjϕ(qi)ϕ(kj)\text{LinearAttn}(q_i) = \frac{\sum_j \phi(q_i)^\top\phi(k_j)v_j}{\sum_j \phi(q_i)^\top\phi(k_j)}

The key insight: define S=jϕ(kj)vjR2d×dS = \sum_j \phi(k_j)v_j^\top \in \mathbb{R}^{2d \times d} and z=jϕ(kj)R2dz = \sum_j \phi(k_j) \in \mathbb{R}^{2d}. These are computed ONCE in O(nd2)O(nd^2). Then for each query:

outputi=ϕ(qi)Sϕ(qi)z\text{output}_i = \frac{\phi(q_i)^\top S}{\phi(q_i)^\top z}

Cost per query: O(d2)O(d^2). Total: O(nd2)O(nd^2). For fixed dd, this is O(n)O(n) vs. O(n2)O(n^2).

Trade-off: The first-order Taylor approximation introduces error O((qk)2/2)O((q^\top k)^2/2). For unit vectors in Rd\mathbb{R}^d: qk1|q^\top k| \le 1, error 0.5\le 0.5. In practice, the error is managed by choosing dd large and using random feature approximations (Performer).

G.3 Full Derivation: Softmax Temperature and Entropy

For logits xRn\mathbf{x} \in \mathbb{R}^n and temperature T>0T > 0, let pi=exi/T/jexj/Tp_i = e^{x_i/T}/\sum_j e^{x_j/T}.

Entropy H=ipilogpiH = -\sum_i p_i \log p_i:

As T0T \to 0: peip \to \mathbf{e}_{i^*} (one-hot), H0H \to 0 (minimum entropy).

As TT \to \infty: p(1/n,,1/n)p \to (1/n, \ldots, 1/n), HlognH \to \log n (maximum entropy). Proof via Taylor:

pi1+xi/Tn+xˉ/T1n(1+xixˉT)p_i \approx \frac{1 + x_i/T}{n + \bar{x}/T} \approx \frac{1}{n}\left(1 + \frac{x_i - \bar{x}}{T}\right) Hi1n(1+xixˉT)log[1n(1+xixˉT)]H \approx -\sum_i \frac{1}{n}\left(1 + \frac{x_i-\bar{x}}{T}\right)\log\left[\frac{1}{n}\left(1 + \frac{x_i-\bar{x}}{T}\right)\right] logn1nixixˉT11/(n)+O(1/T2)logn\approx \log n - \frac{1}{n}\sum_i \frac{x_i-\bar{x}}{T} \cdot \frac{1}{1/(n)} + O(1/T^2) \to \log n

So entropy increases with temperature, controlled by the O(1/T)O(1/T) correction term.


Appendix H: Numerical Summation Methods

H.1 Euler-Maclaurin Formula

The Euler-Maclaurin formula connects sums to integrals with correction terms:

n=abf(n)=abf(x)dx+f(a)+f(b)2+k=1pB2k(2k)![f(2k1)(b)f(2k1)(a)]+Rp\sum_{n=a}^b f(n) = \int_a^b f(x)\,dx + \frac{f(a)+f(b)}{2} + \sum_{k=1}^p \frac{B_{2k}}{(2k)!}\left[f^{(2k-1)}(b) - f^{(2k-1)}(a)\right] + R_p

where B2kB_{2k} are Bernoulli numbers (B2=1/6B_2 = 1/6, B4=1/30B_4 = -1/30, B6=1/42B_6 = 1/42, \ldots).

Application: Summing n=1N1/n2\sum_{n=1}^N 1/n^2 (Basel-like): the formula gives the exact asymptotic expansion. This is how the exact value π2/6\pi^2/6 was computed.

For ML: Euler-Maclaurin underlies the Euler method for ODEs (Neural ODEs), and is the bridge between discrete (sequence) and continuous (integral) formulations of the same problem.

H.2 Kahan Summation Algorithm

Naive summation of floating-point numbers accumulates O(nϵ)O(n\epsilon) rounding error where ϵ\epsilon is machine epsilon (~10710^{-7} for float32). Kahan summation reduces this to O(ϵ)O(\epsilon):

def kahan_sum(arr):
    total = 0.0
    c = 0.0  # compensation
    for x in arr:
        y = x - c            # corrected term
        t = total + y        # new sum (may lose low bits)
        c = (t - total) - y  # lost low bits
        total = t
    return total

This is why torch.sum() and numpy.sum() use pairwise summation (not Kahan but better than naive) for numerical stability.

H.3 Richardson Extrapolation

If S(h)S(h) approximates SS with error S(h)=S+c1hp+c2hp+1+S(h) = S + c_1 h^p + c_2 h^{p+1} + \cdots, then combining two approximations eliminates the leading error:

S2pS(h/2)S(h)2p1S \approx \frac{2^p S(h/2) - S(h)}{2^p - 1}

Example: Trapezoidal rule has error O(h2)O(h^2). Richardson extrapolation on two trapezoidal estimates gives Simpson's rule (O(h4)O(h^4)). Repeated extrapolation gives Romberg integration with O(h2k)O(h^{2k}) accuracy.

For series, Richardson extrapolation can accelerate convergence of slowly converging sequences - a technique used in numerical analysis for computing π\pi, ee, and other constants.


Appendix I: Glossary and Reference Tables

I.1 Key Definitions

TermDefinition
SequenceFunction a:NRa: \mathbb{N} \to \mathbb{R}; an ordered list (a1,a2,)(a_1, a_2, \ldots)
Converges (sequence)limnan=L\lim_{n\to\infty} a_n = L exists and is finite
Bounded sequence$\exists B:
Monotone sequenceEither always increasing (an+1ana_{n+1} \ge a_n) or always decreasing
Infinite seriesn=1an=limNn=1Nan\sum_{n=1}^\infty a_n = \lim_{N\to\infty} \sum_{n=1}^N a_n
Absolutely convergent$\sum
Conditionally convergentan\sum a_n converges but $\sum
Power seriescn(xa)n\sum c_n (x-a)^n - a series whose terms are polynomial monomials
Radius of convergenceLargest RR such that cnxn\sum c_n x^n converges absolutely for $
Taylor seriesf(n)(a)n!(xa)n\sum \frac{f^{(n)}(a)}{n!}(x-a)^n - power series coefficients from derivatives
Maclaurin seriesTaylor series at a=0a = 0
Analytic functionFunction that equals its Taylor series in a neighborhood of each point
Uniform convergence$\sup_x
Fourier seriesExpansion of a periodic function in the sinusoidal basis
Lagrange remainderRn(x)=f(n+1)(c)(n+1)!(xa)n+1R_n(x) = \frac{f^{(n+1)}(c)}{(n+1)!}(x-a)^{n+1} - exact error in Taylor approximation

I.2 Standard Series Table

SeriesSumDomain
n=0xn\sum_{n=0}^\infty x^n11x\frac{1}{1-x}$
n=0(1)nxn\sum_{n=0}^\infty (-1)^n x^n11+x\frac{1}{1+x}$
n=0xnn!\sum_{n=0}^\infty \frac{x^n}{n!}exe^xAll xx
n=0(1)nx2n+1(2n+1)!\sum_{n=0}^\infty \frac{(-1)^n x^{2n+1}}{(2n+1)!}sinx\sin xAll xx
n=0(1)nx2n(2n)!\sum_{n=0}^\infty \frac{(-1)^n x^{2n}}{(2n)!}cosx\cos xAll xx
n=1(1)n+1xnn\sum_{n=1}^\infty \frac{(-1)^{n+1}x^n}{n}ln(1+x)\ln(1+x)1<x1-1 < x \le 1
n=0(αn)xn\sum_{n=0}^\infty \binom{\alpha}{n}x^n(1+x)α(1+x)^\alpha$
n=0(1)nx2n+12n+1\sum_{n=0}^\infty \frac{(-1)^n x^{2n+1}}{2n+1}arctanx\arctan x$
n=0x2n(2n)!\sum_{n=0}^\infty \frac{x^{2n}}{(2n)!}coshx\cosh xAll xx
n=0x2n+1(2n+1)!\sum_{n=0}^\infty \frac{x^{2n+1}}{(2n+1)!}sinhx\sinh xAll xx
n=11n2\sum_{n=1}^\infty \frac{1}{n^2}π26\frac{\pi^2}{6}(Basel problem)
n=0(1)n2n+1\sum_{n=0}^\infty \frac{(-1)^n}{2n+1}π4\frac{\pi}{4}(Leibniz)

I.3 Convergence Test Decision Table

TestBest ForCondition for ConvergenceCondition for DivergenceInconclusive
DivergenceAny-an↛0a_n \not\to 0an0a_n \to 0
Geometricrn\sum r^n$r< 1$
p-series1/np\sum 1/n^pp>1p > 1p1p \le 1-
IntegralDecreasing positive f(n)f(n)fdx<\int f\,dx < \inftyfdx=\int f\,dx = \inftyff not monotone
Comparison0anbn0 \le a_n \le b_nbn\sum b_n convergesan\sum a_n diverges-
Limit Comparisonan,bn>0a_n, b_n > 0L(0,)L \in (0,\infty) -> samesameL=0L = 0 or \infty
RatioFactorials, exponentialsL<1L < 1L>1L > 1L=1L = 1
Rootnn-th powers (f(n))n(f(n))^nL<1L < 1L>1L > 1L=1L = 1
Alternating(1)nbn\sum(-1)^n b_nbn0b_n \ge 0 decreasing, 0\to 0-bnb_n not decreasing

I.4 Self-Assessment Checklist

After completing this section:

Sequences

  • Can give epsilon-N proof that 1/n01/n \to 0
  • Can state and use Monotone Convergence Theorem
  • Can explain Bolzano-Weierstrass and its ML relevance

Series and Convergence Tests

  • Can prove harmonic series diverges (Oresme)
  • Can apply all 7 convergence tests and know when each applies
  • Can explain absolute vs. conditional convergence and Riemann rearrangement

Power Series

  • Can find radius of convergence via ratio test
  • Can differentiate and integrate power series term-by-term
  • Can derive ln(1+x)\ln(1+x) series from geometric series via integration

Taylor Series

  • Can derive Taylor series for exe^x, sinx\sin x, cosx\cos x, ln(1+x)\ln(1+x)
  • Can bound Taylor approximation error via Lagrange remainder
  • Can manipulate series algebraically (substitution, multiplication, composition)

ML Applications

  • Can derive Adam bias correction using geometric series
  • Can explain linear attention as first-order Taylor approximation
  • Can explain softmax temperature behavior at T0T \to 0 and TT \to \infty
  • Can describe Laplace approximation as quadratic Taylor expansion

Appendix J: Worked Solutions to Selected Exercises

J.1 Exercise 1 - Geometric Series

(a) n=0(2/3)n\sum_{n=0}^\infty (2/3)^n: geometric series with a=1a=1, r=2/3r = 2/3. Since r=2/3<1|r| = 2/3 < 1:

n=0(2/3)n=112/3=11/3=3\sum_{n=0}^\infty (2/3)^n = \frac{1}{1-2/3} = \frac{1}{1/3} = 3

(b) n=2(3/2)n\sum_{n=2}^\infty (3/2)^n: ratio r=3/2>1r = 3/2 > 1. Diverges.

(c) n=0(1)n/4n=n=0(1/4)n\sum_{n=0}^\infty (-1)^n/4^n = \sum_{n=0}^\infty (-1/4)^n: geometric with r=1/4r = -1/4, r=1/4<1|r| = 1/4 < 1:

11(1/4)=15/4=45\frac{1}{1-(-1/4)} = \frac{1}{5/4} = \frac{4}{5}

(d) G0=t=00.95t1=110.95=10.05=20G_0 = \sum_{t=0}^\infty 0.95^t \cdot 1 = \frac{1}{1-0.95} = \frac{1}{0.05} = 20.

J.2 Exercise 3 - Radius of Convergence

(a) (x/3)n\sum (x/3)^n: geometric series with ratio x/3x/3. Converges when x/3<1|x/3| < 1, i.e., x<3|x| < 3. So R=3R = 3.

  • At x=3x = 3: 1n=1\sum 1^n = \sum 1 diverges.
  • At x=3x = -3: (1)n\sum (-1)^n diverges (oscillates).
  • Interval of convergence: (3,3)(-3, 3).

(b) (1)nxn/n\sum (-1)^n x^n / n: ratio test: an+1/an=nxn+1x|a_{n+1}/a_n| = \frac{n|x|}{n+1} \to |x|. So R=1R = 1.

  • At x=1x = 1: (1)n/n\sum (-1)^n/n - alternating harmonic, converges to ln2-\ln 2.
  • At x=1x = -1: (1)n(1)n/n=1/n\sum (-1)^n(-1)^n/n = \sum 1/n - harmonic, diverges.
  • Interval of convergence: (1,1](-1, 1].

(c) n!xn\sum n! x^n: ratio =(n+1)x= (n+1)|x| \to \infty for any x0x \ne 0. So R=0R = 0; converges only at x=0x = 0.

(d) (x2)n/(n2+1)\sum (x-2)^n/(n^2+1): ratio x2\to |x-2|. So R=1R = 1, centered at a=2a = 2.

  • At x=3x = 3 (x2=1x-2 = 1): 1/(n2+1)\sum 1/(n^2+1) converges (compare to p-series 1/n2\sum 1/n^2).
  • At x=1x = 1 (x2=1x-2 = -1): (1)n/(n2+1)\sum (-1)^n/(n^2+1) converges absolutely.
  • Interval of convergence: [1,3][1, 3].

J.3 Exercise 5 - Lagrange Remainder

(a) ln(1+x)\ln(1+x): derivatives at a=0a=0:

  • f(x)=ln(1+x)f(x) = \ln(1+x), f(0)=0f(0) = 0
  • f(x)=1/(1+x)f'(x) = 1/(1+x), f(0)=1f'(0) = 1
  • f(x)=1/(1+x)2f''(x) = -1/(1+x)^2, f(0)=1f''(0) = -1
  • f(x)=2/(1+x)3f'''(x) = 2/(1+x)^3, f(0)=2f'''(0) = 2
  • f(4)(x)=6/(1+x)4f^{(4)}(x) = -6/(1+x)^4, f(4)(0)=6f^{(4)}(0) = -6
T4(x)=xx22+x33x44T_4(x) = x - \frac{x^2}{2} + \frac{x^3}{3} - \frac{x^4}{4}

(b) At x=0.2x = 0.2: f(5)(x)=24/(1+x)5f^{(5)}(x) = 24/(1+x)^5. On [0,0.2][0, 0.2]: f(5)24/(1+0)5=24|f^{(5)}| \le 24/(1+0)^5 = 24. So:

R4(0.2)245!(0.2)5=241200.00032=0.2×0.00032=6.4×105|R_4(0.2)| \le \frac{24}{5!}(0.2)^5 = \frac{24}{120} \cdot 0.00032 = 0.2 \times 0.00032 = 6.4 \times 10^{-5}

(c) True: ln(1.2)0.182322\ln(1.2) \approx 0.182322. T4(0.2)=0.20.02+0.0026670.0004=0.182267T_4(0.2) = 0.2 - 0.02 + 0.002667 - 0.0004 = 0.182267. Error 5.5×105\approx 5.5 \times 10^{-5} (less than our bound).

(d) Need Rn(0.1)1n+1(0.1)n+11010|R_n(0.1)| \le \frac{1}{n+1}(0.1)^{n+1} \le 10^{-10}. For n=8n = 8: 19(0.1)91.1×1010\frac{1}{9}(0.1)^9 \approx 1.1 \times 10^{-10}. So 9 terms suffice.


Appendix K: Historical Deep Dives

K.1 The Basel Problem (Euler, 1735)

The Basel problem asks: what is n=11/n2\sum_{n=1}^\infty 1/n^2?

This series was known to converge (p-series with p=2p=2), but its exact value was unknown until Euler's stunning proof in 1735.

Euler's argument (not fully rigorous by modern standards, but correct):

The function sin(πx)/(πx)\sin(\pi x)/(\pi x) has zeros at x=±1,±2,±3,x = \pm 1, \pm 2, \pm 3, \ldots So it factors as:

sin(πx)πx=n=1(1x2n2)\frac{\sin(\pi x)}{\pi x} = \prod_{n=1}^\infty \left(1 - \frac{x^2}{n^2}\right)

Expanding the left side via the Taylor series for sin\sin:

sin(πx)πx=1π2x26+π4x4120\frac{\sin(\pi x)}{\pi x} = 1 - \frac{\pi^2 x^2}{6} + \frac{\pi^4 x^4}{120} - \cdots

Expanding the right side (infinite product), the coefficient of x2x^2 is:

n=11n2-\sum_{n=1}^\infty \frac{1}{n^2}

Equating: 1/n2=π2/6-\sum 1/n^2 = -\pi^2/6, so n=11n2=π26\sum_{n=1}^\infty \frac{1}{n^2} = \frac{\pi^2}{6}.

This approach - identifying a function by its zeros and comparing Taylor series - is now a standard technique in complex analysis. Euler's original argument was made rigorous by Weierstrass's factorization theorem (1876).

K.2 Ramanujan's Enigmatic Formulas

Srinivasa Ramanujan (1887-1920) discovered series identities that seemed impossible:

1π=229801n=0(4n)!(n!)426390n+11033964n\frac{1}{\pi} = \frac{2\sqrt{2}}{9801}\sum_{n=0}^\infty \frac{(4n)!}{(n!)^4} \cdot \frac{26390n + 1103}{396^{4n}}

This series was discovered empirically and later proven using the theory of elliptic integrals and modular forms. It converges incredibly fast - each term adds about 8 decimal digits to π\pi.

Modern use: The Chudnovsky algorithm (1988), based on a Ramanujan-like formula, is used for computing π\pi to trillions of digits and is implemented in the mpmath Python library.

K.3 Taylor vs. Fourier: Two Ways to Decompose a Function

The Taylor and Fourier series represent two fundamentally different decomposition strategies:

TaylorFourier
BasisPowers {1,x,x2,}\{1, x, x^2, \ldots\}Sinusoids {sinnx,cosnx}\{\sin nx, \cos nx\}
LocalityLocal: describes ff near aaGlobal: describes ff over [π,π][-\pi, \pi]
Best forAnalytic functions, approximationPeriodic functions, signal processing
ConvergenceIn disk $x-a
DerivativesChange polynomial degreesMultiply by nn (frequency)
ML useOptimizer derivation, GELU, attentionPositional encoding, spectral analysis

Both are special cases of harmonic analysis on different groups:

  • Taylor series: the group (R,+)(\mathbb{R}, +) near 0 (polynomial approximation)
  • Fourier series: the circle group T=R/2πZ\mathbb{T} = \mathbb{R}/2\pi\mathbb{Z} (periodic structure)
  • Fourier transform (on R\mathbb{R}): the full translation group

Appendix L: Advanced Convergence Topics

L.1 Cauchy Sequences and Completeness

Definition: A sequence (an)(a_n) is a Cauchy sequence if for every ϵ>0\epsilon > 0, there exists NN such that for all m,n>Nm, n > N: aman<ϵ|a_m - a_n| < \epsilon.

Intuitively: the terms get arbitrarily close to each other as nn \to \infty.

Theorem (Completeness of R\mathbb{R}): Every Cauchy sequence in R\mathbb{R} converges. (Equivalently: R\mathbb{R} is a complete metric space.)

This is the formal statement of the fact that R\mathbb{R} has "no gaps". It is what distinguishes R\mathbb{R} from Q\mathbb{Q}: the sequence 3,3.1,3.14,3.141,3, 3.1, 3.14, 3.141, \ldots (rational approximations to π\pi) is Cauchy in Q\mathbb{Q} but does not converge in Q\mathbb{Q} (since πQ\pi \notin \mathbb{Q}). It does converge in R\mathbb{R}.

For ML: Gradient descent sequences are sometimes analyzed as Cauchy sequences - if parameter updates become arbitrarily small, the sequence is Cauchy in the complete space Rd\mathbb{R}^d and therefore converges (to some local minimum or saddle point).

L.2 Absolute Convergence and Rearrangements

Theorem (Absolute convergence -> any rearrangement converges to same sum): If an\sum a_n converges absolutely to SS, then any rearrangement aσ(n)\sum a_{\sigma(n)} also converges to SS.

Proof sketch: For any ϵ>0\epsilon > 0, choose NN so that n>Nan<ϵ/2\sum_{n>N}|a_n| < \epsilon/2. Any rearrangement that keeps the first NN terms (in possibly different order) agrees with the original sum to within ϵ\epsilon. \square

Theorem (Riemann Rearrangement): If an\sum a_n converges conditionally, then for any L[,]L \in [-\infty, \infty], there exists a rearrangement converging to LL.

Proof idea: Separate positive and negative parts: pn=max(an,0)p_n = \max(a_n, 0) and qn=max(an,0)q_n = \max(-a_n, 0). Conditional convergence implies pn=qn=\sum p_n = \sum q_n = \infty (both diverge). To get sum LL: greedily add positive terms until partial sum exceeds LL, then negative terms until below LL, and repeat. Since pn,qn0p_n, q_n \to 0, the oscillations decrease and the series converges to LL. \square

Concrete example: The alternating harmonic series 11/2+1/31/4+=ln21 - 1/2 + 1/3 - 1/4 + \cdots = \ln 2.

A rearrangement (two positives, one negative): 1+1/31/2+1/5+1/71/4+=32ln21 + 1/3 - 1/2 + 1/5 + 1/7 - 1/4 + \cdots = \frac{3}{2}\ln 2.

This is not a curiosity - it shows that the value of a conditionally convergent series is an artifact of the order of summation.

L.3 Power Series at the Boundary: Abel's Theorem

Theorem (Abel): If n=0cnRn\sum_{n=0}^\infty c_n R^n converges (where RR is the radius of convergence), then:

limxRn=0cnxn=n=0cnRn\lim_{x \to R^-} \sum_{n=0}^\infty c_n x^n = \sum_{n=0}^\infty c_n R^n

In other words: if the power series converges at an endpoint, the function is continuous there from the inside.

Example: ln(1+x)=(1)n+1xn/n\ln(1+x) = \sum (-1)^{n+1}x^n/n for x<1|x| < 1. This series converges at x=1x=1 (alternating harmonic). By Abel's theorem:

ln2=limx1n=1(1)n+1xnn=n=1(1)n+1n=112+1314+\ln 2 = \lim_{x\to 1^-} \sum_{n=1}^\infty \frac{(-1)^{n+1}x^n}{n} = \sum_{n=1}^\infty \frac{(-1)^{n+1}}{n} = 1 - \frac{1}{2} + \frac{1}{3} - \frac{1}{4} + \cdots

This justifies the formula ln2=11/2+1/31/4+\ln 2 = 1 - 1/2 + 1/3 - 1/4 + \cdots from the power series perspective.

L.4 Summability Methods

Sometimes a divergent series can be assigned a meaningful sum via a summability method:

Cesro summability: limN1Nn=1NSn=L\lim_{N\to\infty} \frac{1}{N}\sum_{n=1}^N S_n = L. The series 11+11+1 - 1 + 1 - 1 + \cdots has partial sums 1,0,1,0,1, 0, 1, 0, \ldots with average 1/2\to 1/2. So it is Cesro summable to 1/21/2.

Abel summability: limx1anxn=L\lim_{x\to 1^-} \sum a_n x^n = L. For (1)n\sum (-1)^n: (x)n=1/(1+x)1/2\sum (-x)^n = 1/(1+x) \to 1/2 as x1x \to 1^-. Abel sum =1/2= 1/2.

For ML: The value 11+11+=1/21 - 1 + 1 - 1 + \cdots = 1/2 appears in zeta function regularization, used in string theory and occasionally in neural network theory. The regularized sum n=1n=1/12\sum_{n=1}^\infty n = -1/12 (via analytic continuation of the Riemann zeta function) appears in Casimir effect calculations and dimensional regularization in physics.


Appendix M: Worked Examples - Taylor Series Applications

M.1 Computing ee to High Precision

The series e=n=01/n!e = \sum_{n=0}^\infty 1/n! converges extremely fast (ratio test: ratio =1/(n+1)0= 1/(n+1) \to 0).

Error after NN terms: eTN=n=N+11/n!1(N+1)!111/(N+2)1(N+1)!|e - T_N| = \sum_{n=N+1}^\infty 1/n! \le \frac{1}{(N+1)!}\cdot\frac{1}{1 - 1/(N+2)} \approx \frac{1}{(N+1)!}

For N=10N = 10: error 1/11!2.5×108\le 1/11! \approx 2.5 \times 10^{-8}. For N=20N = 20: error 1/21!1019\le 1/21! \approx 10^{-19}.

This explains why float64 (15 significant digits) is enough: just N=17N=17 terms suffice.

M.2 Taylor Series for Numerical Integration

Integrals like 01ex2dx\int_0^1 e^{-x^2}\,dx have no elementary antiderivative, but the series gives:

01ex2dx=01n=0(1)nx2nn!dx=n=0(1)nn!(2n+1)\int_0^1 e^{-x^2}\,dx = \int_0^1 \sum_{n=0}^\infty \frac{(-1)^n x^{2n}}{n!}\,dx = \sum_{n=0}^\infty \frac{(-1)^n}{n!(2n+1)}

This alternating series has error bounded by the first omitted term. For 10 significant digits:

1n!(2n+1)<1010    n9\left|\frac{1}{n!(2n+1)}\right| < 10^{-10} \implies n \ge 9

So 10 terms give: 11/3+1/101/42+1/2161/1320+1/93601/75600+1/6854401/68947200.74682413281 - 1/3 + 1/10 - 1/42 + 1/216 - 1/1320 + 1/9360 - 1/75600 + 1/685440 - 1/6894720 \approx 0.7468241328

True value: π2erf(1)0.7468241329\frac{\sqrt\pi}{2}\text{erf}(1) \approx 0.7468241329.

M.3 L'Hpital's Rule via Taylor Series

Limits of the form 0/00/0 are often cleaner via Taylor series:

limx0sinxxx3=limx0(xx3/6+x5/120)xx3=limx0x3/6+O(x5)x3=16\lim_{x\to 0}\frac{\sin x - x}{x^3} = \lim_{x\to 0}\frac{(x - x^3/6 + x^5/120 - \cdots) - x}{x^3} = \lim_{x\to 0}\frac{-x^3/6 + O(x^5)}{x^3} = -\frac{1}{6} limx0ex1xx2=limx0x2/2+O(x3)x2=12\lim_{x\to 0}\frac{e^x - 1 - x}{x^2} = \lim_{x\to 0}\frac{x^2/2 + O(x^3)}{x^2} = \frac{1}{2} limx01cosxx2=limx0x2/2x4/24+x2=12\lim_{x\to 0}\frac{1-\cos x}{x^2} = \lim_{x\to 0}\frac{x^2/2 - x^4/24 + \cdots}{x^2} = \frac{1}{2}

The Taylor series approach reveals the rate of vanishing explicitly - it shows not just that the limit exists but also the leading behavior, which L'Hpital's rule alone does not.

For ML: This technique is used to analyze the behavior of loss functions near saddle points and to derive learning rate bounds from second-order Taylor expansions.

M.4 Stability of Softmax Computed via log-sum-exp

Standard softmax: softmax(x)[i] = exp(x[i]) / sum(exp(x)).

For x=[1000,1001,1002]x = [1000, 1001, 1002] in float32: exp(1002) overflows. Result: nan.

Log-sum-exp stable version: subtract m=max(x)=1002m = \max(x) = 1002:

LSE(x)=1002+ln(e2+e1+e0)=1002+ln(e2+e1+1)\text{LSE}(x) = 1002 + \ln(e^{-2} + e^{-1} + e^0) = 1002 + \ln(e^{-2} + e^{-1} + 1)

Numerically: e20.135e^{-2} \approx 0.135, e10.368e^{-1} \approx 0.368, e0=1e^0 = 1. Sum 1.503\approx 1.503. ln(1.503)0.408\ln(1.503) \approx 0.408.

LSE(x)1002.408\text{LSE}(x) \approx 1002.408. Softmax values: exiLSE(x)=e0.408,e0.5921.0,e00.408e^{x_i - \text{LSE}(x)} = e^{-0.408}, e^{0.592-1.0}, e^{0-0.408}.

The Taylor connection: LSE(x)=m+ln(1+iiexim)m+iiexim\text{LSE}(x) = m + \ln(1 + \sum_{i \ne i^*} e^{x_i - m}) \approx m + \sum_{i \ne i^*} e^{x_i - m} (first-order Taylor of ln(1+ϵ)\ln(1+\epsilon) for small ϵ\epsilon) - valid when the max logit dominates by a large margin.


Appendix N: Connections to Advanced Topics

N.1 ζ\zeta Function and the Riemann Hypothesis

The Riemann zeta function extends the p-series to complex ss:

ζ(s)=n=11ns,Re(s)>1\zeta(s) = \sum_{n=1}^\infty \frac{1}{n^s}, \quad \text{Re}(s) > 1

This converges for Re(s)>1\text{Re}(s) > 1 (by the real p-series result). Riemann's 1859 paper showed ζ(s)\zeta(s) can be analytically continued to all of C{1}\mathbb{C} \setminus \{1\} via functional equation.

Special values:

  • ζ(2)=π2/6\zeta(2) = \pi^2/6 (Basel problem)
  • ζ(4)=π4/90\zeta(4) = \pi^4/90
  • ζ(2n)=(1)n+1(2π)2nB2n2(2n)!\zeta(2n) = (-1)^{n+1}\frac{(2\pi)^{2n}B_{2n}}{2(2n)!} (Bernoulli numbers)
  • ζ(1)=1/12\zeta(-1) = -1/12 (regularized value; related to n=1/12\sum n = -1/12 by analytic continuation)

The Riemann Hypothesis: all non-trivial zeros of ζ(s)\zeta(s) have Re(s)=1/2\text{Re}(s) = 1/2. Unsolved since 1859, it is one of the Millennium Prize Problems (worth $1M).

Connections to ML: Spectral methods in deep learning analysis (neural tangent kernel theory) involve the spectrum of infinite-width limits, which connect to LL-function theory. The random matrix theory of neural network weight spectra has connections to the distribution of Riemann zeta zeros.

N.2 Spectral Bias of Neural Networks

Theorem (Rahaman et al., 2019): Neural networks with smooth activations learn low-frequency components of the target function before high-frequency components.

Fourier analysis explanation: If the target function has Fourier decomposition f(x)=kckeikxf(x) = \sum_k c_k e^{ikx}, a neural network with gradient flow tfNN=NNL\partial_t f_{NN} = -\nabla_{NN}\mathcal{L} learns the kk-th frequency component at rate proportional to the kk-th Fourier coefficient of the NTK (Neural Tangent Kernel).

For standard activations, the NTK decays with frequency kk, so low frequencies kk are learned first. This spectral bias (also called "frequency principle") has practical implications:

  • High-frequency functions (sharp edges, textures) require more training
  • Random Fourier features can overcome this for specific tasks
  • This is why position encodings must explicitly inject high-frequency sinusoids (the transformer couldn't learn high-frequency positional patterns from data alone)

N.3 Nonlinear Series: Fixed-Point Theorems

Many iterative algorithms in ML produce sequences converging to a fixed point. The key tool is:

Banach Fixed-Point Theorem (Contraction Mapping): If T:XXT: X \to X is a contraction on a complete metric space (i.e., d(Tx,Ty)qd(x,y)d(Tx, Ty) \le q \cdot d(x,y) for q<1q < 1), then TT has a unique fixed point x=Txx^* = Tx^*, and the iteration xn+1=Txnx_{n+1} = Tx_n converges to xx^* with geometric rate qnq^n.

Application - Bellman equations: The Bellman backup operator TT in reinforcement learning satisfies TVTWγVW\|TV - TW\|_\infty \le \gamma\|V - W\|_\infty with contraction rate γ<1\gamma < 1. By Banach's theorem, the value iteration Vn+1=TVnV_{n+1} = TV_n converges to the unique optimal value function VV^*.

The convergence rate γn\gamma^n is itself a geometric series - each policy evaluation step reduces the error by factor γ\gamma.

N.4 Matrix Exponential and Lie Groups

For a square matrix AA, the matrix exponential is defined by the same power series:

eA=n=0Ann!=I+A+A22!+A33!+e^A = \sum_{n=0}^\infty \frac{A^n}{n!} = I + A + \frac{A^2}{2!} + \frac{A^3}{3!} + \cdots

This series always converges (operator norm satisfies AnAn\|A^n\| \le \|A\|^n, so the series is bounded by the scalar exponential eAe^{\|A\|}).

Applications in ML:

  • Equivariant networks: The group of rotations SO(3)SO(3) has a Lie algebra so(3)\mathfrak{so}(3) (skew-symmetric matrices). The exponential map exp:so(3)SO(3)\exp: \mathfrak{so}(3) \to SO(3) parameterizes rotations via power series. Used in SE(3)-equivariant networks (FrameDiff, AlphaFold2 structure module).
  • Continuous normalizing flows: The flow z(t)=etAz(0)z(t) = e^{tA}z(0) for a learned AA; the trace tr(A)=iλi\text{tr}(A) = \sum_i \lambda_i controls the log-determinant.
  • Mamba / state space models: The discrete-time recurrence (Ak,Bk)(A_k, B_k) is derived from continuous matrices via matrix exponential with zero-order hold.

Appendix O: Final Reference and Checklist

O.1 Essential Formulas at a Glance

Geometric series:

n=0rn=11r(r<1),n=0N1rn=1rN1r\sum_{n=0}^\infty r^n = \frac{1}{1-r} \quad (|r|<1), \qquad \sum_{n=0}^{N-1} r^n = \frac{1-r^N}{1-r}

p-series criterion: 1/np\sum 1/n^p converges     p>1\iff p > 1

Radius of convergence: R=limncncn+1R = \lim_{n\to\infty} \left|\frac{c_n}{c_{n+1}}\right| (ratio method); R=1/lim supcn1/nR = 1/\limsup |c_n|^{1/n} (root method)

Taylor series: f(x)=n=0f(n)(a)n!(xa)nf(x) = \sum_{n=0}^\infty \frac{f^{(n)}(a)}{n!}(x-a)^n

Lagrange error bound: Rn(x)M(n+1)!xan+1|R_n(x)| \le \frac{M}{(n+1)!}|x-a|^{n+1} where M=maxf(n+1)M = \max|f^{(n+1)}|

Euler's formula: eiθ=cosθ+isinθe^{i\theta} = \cos\theta + i\sin\theta

Key Maclaurin series:

  • ex=xn/n!e^x = \sum x^n/n! (all xx)
  • sinx=(1)nx2n+1/(2n+1)!\sin x = \sum (-1)^n x^{2n+1}/(2n+1)! (all xx)
  • cosx=(1)nx2n/(2n)!\cos x = \sum (-1)^n x^{2n}/(2n)! (all xx)
  • ln(1+x)=(1)n+1xn/n\ln(1+x) = \sum (-1)^{n+1}x^n/n (1<x1-1 < x \le 1)
  • 1/(1x)=xn1/(1-x) = \sum x^n (x<1|x| < 1)

Adam bias correction (geometric series):

E[mt]=(1βt)g,m^t=mt1βt\mathbb{E}[m_t] = (1-\beta^t)g, \quad \hat{m}_t = \frac{m_t}{1-\beta^t}

Linear attention (Taylor approximation):

eqk1+qk,error(qk)22eqke^{q^\top k} \approx 1 + q^\top k, \quad \text{error} \le \frac{(q^\top k)^2}{2} e^{|q^\top k|}

Laplace approximation:

p(θD)N ⁣(θ^,[2logp(θ^D)]1)p(\theta|\mathcal{D}) \approx \mathcal{N}\!\left(\hat\theta, \left[-\nabla^2\log p(\hat\theta|\mathcal{D})\right]^{-1}\right)

O.2 Final Self-Assessment

After completing this entire section, you should be able to:

Prove from scratch:

  • Harmonic series diverges (Oresme grouping)
  • Ratio test (convergence case)
  • Alternating series test (Leibniz)
  • Taylor's theorem with Lagrange remainder
  • Adam bias correction derivation using geometric series

Compute fluently:

  • Sums of geometric and telescoping series
  • Radius and interval of convergence for power series
  • Taylor/Maclaurin series for standard functions
  • Approximation errors using Lagrange remainder
  • New series via substitution, differentiation, integration

Explain to a colleague:

  • Why conditional convergence allows Riemann rearrangement
  • How linear attention reduces complexity using Taylor series
  • Why softmax becomes uniform at high temperature
  • What the Laplace approximation approximates and when it's valid
  • How RoPE uses complex exponentials for relative position encoding

Appendix P: Connections to Adjacent Sections

P.1 <- 04/01 Limits and Continuity

The entire theory of sequences is the theory of limits restricted to integer inputs. Every result about sequence convergence is a special case of the limit definition from 01:

  • limnan=L\lim_{n\to\infty} a_n = L is the epsilon-delta limit with xx \to \infty and δ\delta replaced by NN
  • The Squeeze Theorem for sequences is the Squeeze Theorem from 01 restricted to integers
  • Continuous functions preserve sequence limits: anL    f(an)f(L)a_n \to L \implies f(a_n) \to f(L) - this is the sequential characterization of continuity from 01

Formal link: A function f:RRf: \mathbb{R} \to \mathbb{R} is continuous at aa if and only if for every sequence xnax_n \to a, we have f(xn)f(a)f(x_n) \to f(a). This makes sequences a fundamental tool for studying continuous functions.

P.2 <- 04/02 Derivatives and Differentiation

Taylor series are the direct children of differentiation:

  • The nn-th Taylor coefficient is f(n)(a)/n!f^{(n)}(a)/n!, requiring nn derivatives at aa
  • The Lagrange remainder involves the (n+1)(n+1)-th derivative
  • L'Hpital's rule (for 0/00/0 limits) is superseded by Taylor series analysis, which gives the limit value AND the rate of approach
  • The derivative of a power series is a power series (via term-by-term differentiation) - justified by uniform convergence

P.3 <- 04/03 Integration

Integration provides two key tools for series theory:

  1. Integral test: an\sum a_n converges iff fdx\int f\,dx converges - directly using the improper integrals from 03
  2. Term-by-term integration: cnxndx=cnxn+1/(n+1)\int \sum c_n x^n\,dx = \sum c_n x^{n+1}/(n+1) - used to derive ln(1+x)\ln(1+x) from 1/(1+x)1/(1+x) and arctanx\arctan x from 1/(1+x2)1/(1+x^2)

The connection is deep: both integration and summation are instances of the same "linear functional on functions" idea from functional analysis.

P.4 -> 05 Multivariate Calculus

The multivariable Taylor theorem (the most important formula in optimization theory) is the direct extension of the univariate Taylor series:

f(x+h)=f(x)+f(x)h+12hH(x)h+O(h3)f(\mathbf{x} + \mathbf{h}) = f(\mathbf{x}) + \nabla f(\mathbf{x})^\top \mathbf{h} + \frac{1}{2}\mathbf{h}^\top H(\mathbf{x})\mathbf{h} + O(\|\mathbf{h}\|^3)

Every gradient descent convergence proof, every Newton method derivation, and every trust region method is an application of this formula. The radius of convergence concept extends to a "trust region" around the expansion point.

-> Full treatment: 05 Multivariate Calculus

P.5 -> 06 Probability and Statistics

Generating functions, characteristic functions, and moment generating functions are all power series / Fourier transforms in disguise:

  • MGF: MX(t)=E[etX]=nE[Xn]tn/n!M_X(t) = \mathbb{E}[e^{tX}] = \sum_n \mathbb{E}[X^n]t^n/n!
  • PGF: GX(z)=E[zX]=npnznG_X(z) = \mathbb{E}[z^X] = \sum_n p_n z^n (discrete distributions)
  • Characteristic function: ϕX(t)=E[eitX]\phi_X(t) = \mathbb{E}[e^{itX}] (Fourier transform of density)

The Central Limit Theorem proof via characteristic functions uses Taylor expansion of logϕX(t)\log\phi_X(t) and the fact that the Gaussian characteristic function is et2/2e^{-t^2/2}.

-> Full treatment: 06 Probability and Statistics


Appendix Q: Additional Exercises with Full Solutions

Q.1 Power Series Manipulation

Problem: Starting from 11x=n=0xn\frac{1}{1-x} = \sum_{n=0}^\infty x^n, derive: (a) 1(1x)2=n=1nxn1\frac{1}{(1-x)^2} = \sum_{n=1}^\infty n x^{n-1} (b) x(1x)2=n=1nxn\frac{x}{(1-x)^2} = \sum_{n=1}^\infty n x^n (c) ln(2)=11/2+1/31/4+\ln(2) = 1 - 1/2 + 1/3 - 1/4 + \cdots (alternating harmonic series)

Solutions:

(a) Differentiate xn=1/(1x)\sum x^n = 1/(1-x) term-by-term (valid for x<1|x| < 1):

ddxn=0xn=n=1nxn1=ddx11x=1(1x)2\frac{d}{dx}\sum_{n=0}^\infty x^n = \sum_{n=1}^\infty nx^{n-1} = \frac{d}{dx}\frac{1}{1-x} = \frac{1}{(1-x)^2}

(b) Multiply (a) by xx: n=1nxn=x(1x)2\sum_{n=1}^\infty nx^n = \frac{x}{(1-x)^2}.

(c) From ln(1+x)=n=1(1)n+1xnn\ln(1+x) = \sum_{n=1}^\infty \frac{(-1)^{n+1}x^n}{n} (derived by integrating 1/(1+x)1/(1+x)). At x=1x=1: the series (1)n+1/n\sum (-1)^{n+1}/n converges by the alternating series test (1/n1/n is decreasing to 0). By Abel's theorem (continuity at boundary), ln(1+1)=ln2=n=1(1)n+1/n=11/2+1/3\ln(1+1) = \ln 2 = \sum_{n=1}^\infty (-1)^{n+1}/n = 1 - 1/2 + 1/3 - \cdots.

Q.2 Convergence Test Practice

Determine convergence of n=1(3)nn22n\sum_{n=1}^\infty \frac{(-3)^n}{n^2 \cdot 2^n}.

Solution: Rewrite as (3/2)n/n2\sum (-3/2)^n / n^2. The ratio 3/2=3/2>1|-3/2| = 3/2 > 1, so we first try the absolute value: (3/2)n/n2\sum (3/2)^n/n^2.

Ratio test: (3/2)n+1/(n+1)2(3/2)n/n2=32n2(n+1)232>1\frac{(3/2)^{n+1}/(n+1)^2}{(3/2)^n/n^2} = \frac{3}{2} \cdot \frac{n^2}{(n+1)^2} \to \frac{3}{2} > 1.

So (3/2)n/n2\sum (3/2)^n/n^2 diverges. Since the series diverges absolutely, check conditional convergence: alternating series test needs bn=(3/2)n/n20b_n = (3/2)^n/n^2 \to 0? No - (3/2)n(3/2)^n \to \infty. So bnb_n \to \infty, violating the necessary condition. The series diverges.

Q.3 Full Taylor Expansion of arctan(x)\arctan(x)

Derive: arctanx=n=0(1)nx2n+12n+1\arctan x = \sum_{n=0}^\infty \frac{(-1)^n x^{2n+1}}{2n+1} for x1|x| \le 1.

Solution: Start from 11+t2=n=0(1)nt2n\frac{1}{1+t^2} = \sum_{n=0}^\infty (-1)^n t^{2n} (geometric series with r=t2r = -t^2, valid for t<1|t| < 1).

Integrate from 0 to xx:

arctanx=0xdt1+t2=0xn=0(1)nt2ndt=n=0(1)nx2n+12n+1\arctan x = \int_0^x \frac{dt}{1+t^2} = \int_0^x \sum_{n=0}^\infty (-1)^n t^{2n}\,dt = \sum_{n=0}^\infty (-1)^n \frac{x^{2n+1}}{2n+1}

Valid for x<1|x| < 1. At x=1x = 1: (1)n/(2n+1)\sum (-1)^n/(2n+1) converges by alternating series test. At x=1x = -1: also converges. By Abel's theorem, the formula holds on [1,1][-1, 1].

Leibniz formula for π\pi: Setting x=1x = 1:

π4=113+1517+\frac{\pi}{4} = 1 - \frac{1}{3} + \frac{1}{5} - \frac{1}{7} + \cdots

This converges, but VERY slowly (error after NN terms 1/(2N+1)\approx 1/(2N+1); need N5×109N \approx 5 \times 10^9 for 10 digits). In practice, π\pi is computed via faster-converging formulas (Machin's: π/4=4arctan(1/5)arctan(1/239)\pi/4 = 4\arctan(1/5) - \arctan(1/239)).


Appendix R: Series in Numerical Computing

R.1 Float Arithmetic and Series Truncation

Every floating-point number is represented with finite precision. When computing a power series numerically, two errors compete:

  1. Truncation error: stopping at NN terms instead of \infty. Bounded by Lagrange remainder.
  2. Rounding error: each term cnxnc_n x^n is rounded to nearest float. Accumulates with NN terms.

The optimal NN balances these - adding more terms reduces truncation error but eventually increases rounding error. For float64 with ϵmach1016\epsilon_{\text{mach}} \approx 10^{-16}:

Optimal NN for exe^x near x=0x=0: truncation error xN+1/(N+1)!\approx |x|^{N+1}/(N+1)!; rounding error Nϵmachex\approx N\epsilon_{\text{mach}} e^{|x|}. Set equal: N!xN/(Nϵmach)N! \approx |x|^N / (N\epsilon_{\text{mach}}). For x=1|x| = 1: optimal N17N \approx 17 (matches the 17 significant digits of float64).

For large xx: The terms xn/n!x^n/n! grow before shrinking (the series oscillates in magnitude). Catastrophic cancellation: many large positive and negative terms cancel. Solution: use exp(x) = exp(x/2)^2 (argument reduction) to bring xx near 0 before applying the series.

R.2 Compensated Summation (Kahan Algorithm)

When summing many floating-point numbers, naive summation accumulates O(nε)O(n\varepsilon) error. The Kahan compensated sum tracks the lost bits:

def kahan_sum(values):
    s = 0.0
    c = 0.0          # running compensation
    for v in values:
        y = v - c    # compensate the next term
        t = s + y    # new sum (low bits of y may be lost)
        c = (t - s) - y  # recover what was lost
        s = t
    return s

This reduces error to O(ε)O(\varepsilon) regardless of nn. Libraries like NumPy use pairwise summation (divide-and-conquer), achieving O(εlogn)O(\varepsilon \log n) error - a good balance.

For ML: In mixed-precision training (float16 weights, float32 accumulation), the summation algorithm in the optimizer matters. Kahan summation in the gradient accumulation step significantly reduces precision loss.

R.3 Euler's Transform for Alternating Series Acceleration

The alternating harmonic series (1)n+1/n=ln2\sum (-1)^{n+1}/n = \ln 2 converges slowly. Euler's transform accelerates convergence:

n=0(1)nan=n=0Δna02n+1\sum_{n=0}^\infty (-1)^n a_n = \sum_{n=0}^\infty \frac{\Delta^n a_0}{2^{n+1}}

where Δna0\Delta^n a_0 are finite differences: Δ0ak=ak\Delta^0 a_k = a_k, Δn+1ak=Δnak+1Δnak\Delta^{n+1} a_k = \Delta^n a_{k+1} - \Delta^n a_k.

For an=1/(n+1)a_n = 1/(n+1) (alternating harmonic):

  • Δ0a0=1\Delta^0 a_0 = 1, Δ1a0=1/21=1/2\Delta^1 a_0 = 1/2 - 1 = -1/2, Δ2a0=1/3+1/21/2=1/31=...\Delta^2 a_0 = -1/3 + 1/2 - 1/2 = 1/3 - 1 = ...

The transformed series converges much faster. This is used in high-precision computation of ln2\ln 2, π\pi, and other constants.


Appendix S: Topic Map and Reading Guide

S.1 Minimum Viable Reading

For an ML practitioner who needs only the most practically relevant material:

  1. 1.3: Why series matter for AI (overview table)
  2. 3.2: Geometric series (Adam bias correction, discounted rewards)
  3. 4.4: Ratio test (to understand radius of convergence)
  4. 5.2: Radius of convergence (when to trust a Taylor approximation)
  5. 6.1-6.3: Taylor series derivation and Lagrange error bounds
  6. 7.1-7.6: All ML applications in full
  7. 8.3: RoPE and positional encodings

S.2 Theoretical Deep Dive

For a reader interested in rigorous analysis:

  1. 2.2: epsilon-N definition of sequence convergence (in full)
  2. 2.3: Monotone Convergence Theorem (proof)
  3. 4: All convergence tests (with proofs in Appendix A)
  4. 5: Power series theory, radius of convergence, uniform convergence (App. B)
  5. 6.4: When does the Taylor series converge to ff? (Analytic functions)
  6. Appendix L: Cauchy sequences, Riemann rearrangement, Abel's theorem
  7. Appendix N: Zeta function, spectral bias, Banach fixed-point theorem

S.3 Prerequisites by Section

SectionRequires
2 Sequences04/01 Limits (epsilon-delta definition)
3 Infinite Series2 (convergence of sequences)
4 Convergence Tests3.1 (partial sums), 04/03 Integration (integral test)
5 Power Series4 (convergence tests), especially ratio test
6 Taylor Series04/02 Derivatives (all orders), 5 (power series)
7 ML Applications6 (Taylor series), 04/03 (integration)
8 Fourier Preview04/03 (integration, orthogonality integrals)
Appendix B5 (uniform convergence motivation)
Appendix CComplex numbers (basic)
Appendix D6 (generating functions), 04/03 probability
Appendix N6 (Taylor), 04/01 (limits), some complex analysis