NotesMath for LLMs

Limits and Continuity

Calculus Fundamentals / Limits and Continuity

Notes

"The concept of the limit is the cornerstone on which the whole of mathematical analysis ultimately rests."

  • Aleksandr Khinchin, Mathematical Foundations of Information Theory (1957)

Overview

Limits formalize the intuition of "approaching" - what value does a function tend toward as its input gets arbitrarily close to some point? This deceptively simple question required two centuries to answer rigorously, from Newton and Leibniz's intuitive infinitesimals to Cauchy's sequential definition to Weierstrass's definitive epsilon-delta framework. The result is the foundation upon which all of calculus - and by extension, all of continuous optimization - is built.

Continuity is the qualitative property that emerges from limits behaving well: a function is continuous at a point if its limit there equals its actual value. Continuous functions are the "nice" functions of analysis - they preserve neighborhoods, satisfy the Intermediate Value Theorem, and attain extrema on compact sets. In machine learning, continuity is not just a mathematical nicety; it is a design criterion for activation functions, loss landscapes, and optimization trajectories.

For AI practitioners, limits appear in at least five critical contexts: (1) the definition of the derivative as a limit of difference quotients, which underlies all backpropagation; (2) softmax temperature annealing, where T0T \to 0 recovers argmax and TT \to \infty recovers uniform; (3) the vanishing gradient problem, where σ(x)0\sigma(x) \to 0 as x±x \to \pm\infty kills gradient signal; (4) numerical stability, where naive limit computations suffer catastrophic cancellation; and (5) learning rate schedules, where the Robbins-Monro conditions αt=\sum \alpha_t = \infty, αt2<\sum \alpha_t^2 < \infty guarantee convergence.

Prerequisites

  • Functions: domain, codomain, composition, inverse - 01-Mathematical-Foundations
  • Algebra: polynomial factoring, rationalization, conjugate multiplication
  • Exponential and logarithmic functions: exe^x, lnx\ln x, basic identities
  • Trigonometry: sin\sin, cos\cos, tan\tan and their basic properties
  • Real number system: R\mathbb{R}, absolute value x\lvert x \rvert, the Archimedean property

Companion Notebooks

NotebookDescription
theory.ipynbInteractive examples: epsilon-delta visualization, limit laws, IVT, discontinuity types, ML applications
exercises.ipynb10 graded exercises from basic limit computation to gradient-as-limit and cross-entropy

Learning Objectives

After completing this section, you will be able to:

  1. State the epsilon-delta definition of a limit and verify it for elementary functions
  2. Compute limits using algebraic manipulation, L'Hpital's Rule, and the Squeeze Theorem
  3. Distinguish one-sided from two-sided limits and determine when each exists
  4. Identify and classify discontinuities as removable, jump, or essential
  5. State and apply the Intermediate Value Theorem and Extreme Value Theorem
  6. Prove the Squeeze Theorem and use it to evaluate limx0sinxx\lim_{x \to 0} \frac{\sin x}{x}
  7. Recognize indeterminate forms and choose the appropriate resolution technique
  8. Implement numerically stable limit computations using expm1, log1p, and log_softmax
  9. Analyze softmax temperature, sigmoid saturation, and ReLU continuity via limits
  10. Connect the limit definition to the derivative as preparation for 02

Table of Contents


1. Intuition

1.1 Approaching Without Arriving

Consider the function f(x)=x24x2f(x) = \frac{x^2 - 4}{x - 2}. At x=2x = 2, this function is undefined: both numerator and denominator are zero, and division by zero is not permitted. Yet if you evaluate ff at values close to 22 - say x=1.9,1.99,1.999x = 1.9, 1.99, 1.999 and x=2.1,2.01,2.001x = 2.1, 2.01, 2.001 - you observe that f(x)f(x) gets arbitrarily close to 44. The function is not defined at x=2x = 2, but its behavior near x=2x = 2 is perfectly predictable.

This is the key insight: a limit describes the behavior of f(x)f(x) as xx approaches aa, regardless of what f(a)f(a) is (or whether f(a)f(a) exists at all). We write:

limxaf(x)=L\lim_{x \to a} f(x) = L

and read it as: "the limit of f(x)f(x) as xx approaches aa equals LL."

For the example above:

limx2x24x2=limx2(x2)(x+2)x2=limx2(x+2)=4\lim_{x \to 2} \frac{x^2 - 4}{x - 2} = \lim_{x \to 2} \frac{(x-2)(x+2)}{x-2} = \lim_{x \to 2} (x + 2) = 4

The cancellation of (x2)(x - 2) is valid because we are considering xx close to (but not equal to) 22.

LIMITS: THE CORE PICTURE


         f(x)
           
         5 
           
         4 - - - - - - - - -  <- limit value (hole at x=2)
                         
         3            
                   
         2      
             
         1 
           
            x
                 1   1.5  1.9 2  2.1  2.5   3

  f(x) = (x^2-4)/(x-2)   Undefined at x=2, but the limit is 4.
  As x -> 2 from either side, f(x) -> 4.


The informal statement is: limxaf(x)=L\lim_{x \to a} f(x) = L means "we can make f(x)f(x) as close to LL as we like by taking xx sufficiently close to aa."

The crucial distinction: a limit is about neighborhood behavior, not about what happens at the point itself. This distinction is what makes calculus work: the derivative of ff at aa is defined as the limit of difference quotients, even though the difference quotient f(a+h)f(a)h\frac{f(a+h) - f(a)}{h} is never evaluated at h=0h = 0.

1.2 Historical Necessity: From Newton to Weierstrass

The limit concept was not discovered all at once - it was forced on mathematicians by the need to make calculus rigorous.

Newton (1666) and Leibniz (1675) independently invented calculus using "infinitesimals" - quantities smaller than any real number but not zero. Their methods worked in practice but were logically incoherent: you cannot simultaneously treat hh as nonzero (to divide) and as zero (to drop h2h^2 terms). Bishop Berkeley famously mocked infinitesimals as "ghosts of departed quantities."

Cauchy (1821) introduced the sequential approach: limxaf(x)=L\lim_{x \to a} f(x) = L if for every sequence xnax_n \to a (with xnax_n \neq a), the sequence f(xn)Lf(x_n) \to L. This was much cleaner, but still relied on intuitive notions of "approaching."

Weierstrass (1861) gave the definitive formulation, now universal in analysis: the epsilon-delta definition (Section 2.1). This eliminated all appeals to intuition and motion - limits became purely a statement about inequalities between real numbers.

Chronological summary:

YearMathematicianContribution
1666/1675Newton / LeibnizCalculus via infinitesimals (intuitive, not rigorous)
1797LagrangeAttempted algebraic foundation via power series
1821CauchySequential definition; "sum formula" for continuity
1861Weierstrassepsilon-delta definition; made analysis fully rigorous
1960sRobinsonNonstandard analysis: infinitesimals made rigorous via model theory

The modern formulation we use today is Weierstrass's. Interestingly, nonstandard analysis (Abraham Robinson, 1966) later vindicated Newton's intuition by making infinitesimals rigorous through model theory - but the epsilon-delta approach remains standard in mathematical practice.

1.3 Why Limits Matter for AI

Limits are not just historical curiosities - they are load-bearing mathematics in modern AI systems.

Gradient computation: Every gradient in a neural network is a limit. The gradient of loss L\mathcal{L} with respect to weight ww is:

Lw=limh0L(w+h)L(w)h\frac{\partial \mathcal{L}}{\partial w} = \lim_{h \to 0} \frac{\mathcal{L}(w + h) - \mathcal{L}(w)}{h}

Automatic differentiation implements this limit algebraically rather than numerically, but the underlying mathematics is limit-theoretic.

Softmax temperature: The softmax function with temperature TT is:

softmaxT(z)i=ezi/Tjezj/T\text{softmax}_T(\mathbf{z})_i = \frac{e^{z_i/T}}{\sum_j e^{z_j/T}}

Taking limT0+\lim_{T \to 0^+} recovers the argmax (one-hot); limT\lim_{T \to \infty} recovers the uniform distribution. Understanding these limits is essential for temperature-scaled inference in LLMs (Radford et al., 2019; Ouyang et al., 2022).

Vanishing gradients: The sigmoid σ(x)=1/(1+ex)\sigma(x) = 1/(1 + e^{-x}) satisfies limx±σ(x){0,1}\lim_{x \to \pm\infty} \sigma(x) \in \{0, 1\}, and limx±σ(x)=0\lim_{x \to \pm\infty} \sigma'(x) = 0. Deep networks with sigmoid activations suffer gradient vanishing precisely because of this limit behavior (Hochreiter & Schmidhuber, 1997).

Numerical stability: Computing (ex1)/x(e^x - 1)/x naively near x=0x = 0 loses up to 16 digits of precision due to catastrophic cancellation. The function numpy.expm1 computes ex1e^x - 1 directly to avoid this - a practical implementation of numerically stable limits.

Convergence guarantees: Stochastic gradient descent converges (in expectation) if the learning rate sequence {αt}\{\alpha_t\} satisfies the Robbins-Monro conditions tαt=\sum_t \alpha_t = \infty and tαt2<\sum_t \alpha_t^2 < \infty (Robbins & Monro, 1951). These are conditions on the limiting behavior of a series - pure limit theory applied to optimization.


2. Formal Definitions

2.1 The epsilon-delta Definition

Definition (Limit). Let ff be a function defined on some open interval containing aa, except possibly at aa itself. We say

limxaf(x)=L\lim_{x \to a} f(x) = L

if for every ε>0\varepsilon > 0, there exists δ>0\delta > 0 such that

0<xa<δ    f(x)L<ε0 < \lvert x - a \rvert < \delta \implies \lvert f(x) - L \rvert < \varepsilon

Unpacking the definition:

  • "ε>0\varepsilon > 0" - your challenger specifies how close to LL you must get
  • "there exists δ>0\delta > 0" - you respond with a proximity to aa
  • "0<xa<δ0 < \lvert x - a \rvert < \delta" - xx is within δ\delta of aa, but xax \neq a
  • "f(x)L<ε\lvert f(x) - L \rvert < \varepsilon" - then f(x)f(x) is within ε\varepsilon of LL

The definition is a game: for every precision challenge ε\varepsilon your opponent poses, you must produce a δ\delta that wins. If you can always win, the limit is LL.

epsilon-delta VISUALIZATION


  f(x)
    
 L+epsilon  <- epsilon-strip (top)
             f(x) must stay inside 
    L  - - - - - - -  - - - - - -- - - -   <- limit value L
                                 
 L-epsilon  <- epsilon-strip (bottom)
              a-delta    a    a+delta
    -> x
               <-  2delta  ->
           delta-window: x must stay inside this

  For x in (a-delta, a+delta) \ {a}, f(x) must lie in (L-epsilon, L+epsilon).


Example (proving a limit with epsilon-delta): Prove limx3(2x1)=5\lim_{x \to 3} (2x - 1) = 5.

We need: given ε>0\varepsilon > 0, find δ>0\delta > 0 such that 0<x3<δ    (2x1)5<ε0 < \lvert x - 3 \rvert < \delta \implies \lvert (2x-1) - 5 \rvert < \varepsilon.

Note: (2x1)5=2x6=2x3\lvert (2x-1) - 5 \rvert = \lvert 2x - 6 \rvert = 2\lvert x - 3 \rvert.

So if x3<δ\lvert x - 3 \rvert < \delta, then (2x1)5=2x3<2δ\lvert (2x-1) - 5 \rvert = 2\lvert x - 3 \rvert < 2\delta.

Choose δ=ε/2\delta = \varepsilon/2. Then 2δ=ε2\delta = \varepsilon, so (2x1)5<ε\lvert (2x-1) - 5 \rvert < \varepsilon. \square

Example (limit does not exist): f(x)=sin(1/x)f(x) = \sin(1/x) as x0x \to 0. For any δ>0\delta > 0, the interval (0,δ)(0, \delta) contains points where sin(1/x)=1\sin(1/x) = 1 and points where sin(1/x)=1\sin(1/x) = -1. No single LL can satisfy the epsilon-delta condition with ε=1/2\varepsilon = 1/2. The limit does not exist.

Non-examples (limit fails to exist):

  • f(x)=x/xf(x) = \lvert x \rvert / x as x0x \to 0: left limit is 1-1, right limit is +1+1 - unequal one-sided limits
  • f(x)=1/x2f(x) = 1/x^2 as x0x \to 0: f(x)+f(x) \to +\infty - no finite LL
  • f(x)=sin(1/x)f(x) = \sin(1/x) as x0x \to 0: oscillates between 1-1 and 11 without settling

2.2 Limit Laws

Limits interact well with algebraic operations. If limxaf(x)=L\lim_{x \to a} f(x) = L and limxag(x)=M\lim_{x \to a} g(x) = M, then:

LawStatement
Sumlimxa[f(x)+g(x)]=L+M\lim_{x \to a} [f(x) + g(x)] = L + M
Differencelimxa[f(x)g(x)]=LM\lim_{x \to a} [f(x) - g(x)] = L - M
Constant Multiplelimxa[cf(x)]=cL\lim_{x \to a} [c \cdot f(x)] = c \cdot L
Productlimxa[f(x)g(x)]=LM\lim_{x \to a} [f(x) \cdot g(x)] = L \cdot M
Quotientlimxaf(x)g(x)=LM\lim_{x \to a} \frac{f(x)}{g(x)} = \frac{L}{M}, provided M0M \neq 0
Powerlimxa[f(x)]n=Ln\lim_{x \to a} [f(x)]^n = L^n for integer nn
Rootlimxaf(x)n=Ln\lim_{x \to a} \sqrt[n]{f(x)} = \sqrt[n]{L} (for even nn, require L0L \geq 0)
CompositionIf gg is continuous at LL: limxag(f(x))=g(L)\lim_{x \to a} g(f(x)) = g(L)

Proof sketch (Sum Law): Given ε>0\varepsilon > 0. Since LfL_f and LgL_g are limits, choose δ1,δ2\delta_1, \delta_2 such that f(x)L<ε/2\lvert f(x) - L \rvert < \varepsilon/2 and g(x)M<ε/2\lvert g(x) - M \rvert < \varepsilon/2 within their respective δ\delta-windows. Set δ=min(δ1,δ2)\delta = \min(\delta_1, \delta_2). Then:

f(x)+g(x)(L+M)f(x)L+g(x)M<ε2+ε2=ε\lvert f(x) + g(x) - (L+M) \rvert \leq \lvert f(x) - L \rvert + \lvert g(x) - M \rvert < \frac{\varepsilon}{2} + \frac{\varepsilon}{2} = \varepsilon \qquad \square

Direct substitution: If ff is a polynomial, rational function (with nonzero denominator), or composed of continuous functions, then limxaf(x)=f(a)\lim_{x \to a} f(x) = f(a). This is valid whenever ff is continuous at aa - essentially the definition of continuity (Section 5).

2.3 One-Sided Limits

For functions that behave differently approaching from the left versus the right, we introduce one-sided limits.

Definition (Right-hand limit). limxa+f(x)=L\lim_{x \to a^+} f(x) = L if for every ε>0\varepsilon > 0 there exists δ>0\delta > 0 such that a<x<a+δ    f(x)L<εa < x < a + \delta \implies \lvert f(x) - L \rvert < \varepsilon.

Definition (Left-hand limit). limxaf(x)=L\lim_{x \to a^-} f(x) = L if for every ε>0\varepsilon > 0 there exists δ>0\delta > 0 such that aδ<x<a    f(x)L<εa - \delta < x < a \implies \lvert f(x) - L \rvert < \varepsilon.

Key theorem: limxaf(x)=L\lim_{x \to a} f(x) = L if and only if both one-sided limits exist and are equal:

limxaf(x)=L=limxa+f(x)\lim_{x \to a^-} f(x) = L = \lim_{x \to a^+} f(x)

Examples:

Sign function sgn(x)=x/x\text{sgn}(x) = x / \lvert x \rvert for x0x \neq 0:

limx0sgn(x)=1,limx0+sgn(x)=+1\lim_{x \to 0^-} \text{sgn}(x) = -1, \quad \lim_{x \to 0^+} \text{sgn}(x) = +1

Since 11-1 \neq 1, the two-sided limit limx0sgn(x)\lim_{x \to 0} \text{sgn}(x) does not exist.

Floor function x\lfloor x \rfloor: at any integer nn,

limxnx=n1,limxn+x=n\lim_{x \to n^-} \lfloor x \rfloor = n - 1, \quad \lim_{x \to n^+} \lfloor x \rfloor = n

Jump discontinuity at every integer.

Absolute value x\lvert x \rvert:

limx0x=0=limx0+x\lim_{x \to 0^-} \lvert x \rvert = 0 = \lim_{x \to 0^+} \lvert x \rvert

Two-sided limit exists and equals 0=00 = \lvert 0 \rvert, so x\lvert x \rvert is continuous at 00.

For AI: ReLU(x)=max(0,x)(x) = \max(0, x) has well-defined one-sided limits everywhere. At x=0x = 0:

limx0ReLU(x)=0,limx0+ReLU(x)=0\lim_{x \to 0^-} \text{ReLU}(x) = 0, \quad \lim_{x \to 0^+} \text{ReLU}(x) = 0

Both equal ReLU(0)=0\text{ReLU}(0) = 0, so ReLU is continuous at the origin. However, its derivative (Section 9.3) has a jump discontinuity there.

2.4 Limits at Infinity

Definition. limx+f(x)=L\lim_{x \to +\infty} f(x) = L means: for every ε>0\varepsilon > 0, there exists M>0M > 0 such that x>M    f(x)L<εx > M \implies \lvert f(x) - L \rvert < \varepsilon.

Similarly for limxf(x)=L\lim_{x \to -\infty} f(x) = L with x<Mx < -M.

Standard examples:

limx1x=0,limx1x2=0,limxex=0\lim_{x \to \infty} \frac{1}{x} = 0, \quad \lim_{x \to \infty} \frac{1}{x^2} = 0, \quad \lim_{x \to \infty} e^{-x} = 0 limxarctan(x)=π2,limxarctan(x)=π2\lim_{x \to \infty} \arctan(x) = \frac{\pi}{2}, \quad \lim_{x \to -\infty} \arctan(x) = -\frac{\pi}{2}

Rational function limits at infinity: For p(x)/q(x)p(x)/q(x) with deg(p)=m\deg(p) = m, deg(q)=n\deg(q) = n:

  • If m<nm < n: limit is 00
  • If m=nm = n: limit is ratio of leading coefficients
  • If m>nm > n: limit is ±\pm\infty

Example:

limx3x2+x12x25=limx3+1/x1/x225/x2=3+0020=32\lim_{x \to \infty} \frac{3x^2 + x - 1}{2x^2 - 5} = \lim_{x \to \infty} \frac{3 + 1/x - 1/x^2}{2 - 5/x^2} = \frac{3 + 0 - 0}{2 - 0} = \frac{3}{2}

For AI: Sigmoid saturation at infinity:

limx+σ(x)=1,limxσ(x)=0\lim_{x \to +\infty} \sigma(x) = 1, \quad \lim_{x \to -\infty} \sigma(x) = 0

This is why sigmoid-based networks suffer from vanishing gradients: for large x\lvert x \rvert, the sigmoid is essentially constant, so its derivative is essentially zero.

2.5 Infinite Limits and Vertical Asymptotes

Definition. limxaf(x)=+\lim_{x \to a} f(x) = +\infty means: for every M>0M > 0, there exists δ>0\delta > 0 such that 0<xa<δ    f(x)>M0 < \lvert x - a \rvert < \delta \implies f(x) > M.

Note: ++\infty is not a real number - saying the limit is ++\infty means the function grows without bound, not that it converges.

Vertical asymptote: x=ax = a is a vertical asymptote of ff if any one-sided limit as xax \to a is ±\pm\infty.

Examples:

limx0+1x=+,limx01x=\lim_{x \to 0^+} \frac{1}{x} = +\infty, \quad \lim_{x \to 0^-} \frac{1}{x} = -\infty limx01x2=+(both sides)\lim_{x \to 0} \frac{1}{x^2} = +\infty \quad \text{(both sides)} limx(π/2)tan(x)=+,limx(π/2)+tan(x)=\lim_{x \to (\pi/2)^-} \tan(x) = +\infty, \quad \lim_{x \to (\pi/2)^+} \tan(x) = -\infty

For AI - log barrier: Interior point methods use log barriers ilog(si)-\sum_i \log(s_i) where si>0s_i > 0. As si0+s_i \to 0^+, the barrier +\to +\infty, keeping the optimization in the feasible interior. This is a vertical asymptote engineered as a constraint.


3. Fundamental Limits

These limits appear repeatedly in analysis, ML, and numerical methods. Every practitioner should know them cold.

3.1 The Sine Limit: sin(x)/x -> 1

limx0sinxx=1\lim_{x \to 0} \frac{\sin x}{x} = 1

Geometric proof: Consider a unit circle. For 0<x<π/20 < x < \pi/2, the areas satisfy:

Area(OAP)Area(sector OAP)Area(OAT)\text{Area}(\triangle OAP) \leq \text{Area(sector }OAP) \leq \text{Area}(\triangle OAT)

In terms of xx (the angle):

sinx2x2tanx2\frac{\sin x}{2} \leq \frac{x}{2} \leq \frac{\tan x}{2}

Dividing by sinx/2>0\sin x / 2 > 0:

1xsinx1cosx1 \leq \frac{x}{\sin x} \leq \frac{1}{\cos x}

Taking reciprocals (reversing inequalities):

cosxsinxx1\cos x \leq \frac{\sin x}{x} \leq 1

Since limx0cosx=1\lim_{x \to 0} \cos x = 1, by the Squeeze Theorem: limx0sinxx=1\lim_{x \to 0} \frac{\sin x}{x} = 1. \square

Consequence:

limx01cosxx=0,limx01cosxx2=12\lim_{x \to 0} \frac{1 - \cos x}{x} = 0, \quad \lim_{x \to 0} \frac{1 - \cos x}{x^2} = \frac{1}{2}

The first follows from 1cosxx=1cosxx1+cosx1+cosx=sin2xx(1+cosx)02=0\frac{1-\cos x}{x} = \frac{1-\cos x}{x} \cdot \frac{1+\cos x}{1+\cos x} = \frac{\sin^2 x}{x(1+\cos x)} \to \frac{0}{2} = 0.

For AI - attention scores: The sinusoidal positional encoding in transformers (Vaswani et al., 2017) uses sin(k/100002i/d)\sin(k/10000^{2i/d}). The smooth interpolation behavior of sine (controlled by this limit) ensures smooth positional representations.

3.2 Exponential and Logarithmic Limits

The fundamental exponential limit:

limx0ex1x=1\lim_{x \to 0} \frac{e^x - 1}{x} = 1

This is the definition of the derivative of exe^x at x=0x = 0, and it says the exponential grows at rate 11 near the origin.

Proof (via series): ex=1+x+x2/2!+e^x = 1 + x + x^2/2! + \cdots, so ex1x=1+x/2!+x2/3!+1\frac{e^x - 1}{x} = 1 + x/2! + x^2/3! + \cdots \to 1.

Consequence:

limx0ax1x=lnafor any a>0\lim_{x \to 0} \frac{a^x - 1}{x} = \ln a \quad \text{for any } a > 0

Logarithmic limits:

limx0ln(1+x)x=1,limxlnxxp=0 for any p>0\lim_{x \to 0} \frac{\ln(1+x)}{x} = 1, \quad \lim_{x \to \infty} \frac{\ln x}{x^p} = 0 \text{ for any } p > 0

The second says logarithm grows slower than any positive power - critical for understanding why O(logn)O(\log n) algorithms are so efficient.

For AI - cross-entropy: The cross-entropy loss L=iyilogpi\mathcal{L} = -\sum_i y_i \log p_i uses limp0+plogp=0\lim_{p \to 0^+} p \log p = 0 (Section 3.4) to handle the convention 0log0=00 \log 0 = 0.

3.3 Euler's Number as a Limit

e=limn(1+1n)n=limx0(1+x)1/xe = \lim_{n \to \infty} \left(1 + \frac{1}{n}\right)^n = \lim_{x \to 0} (1 + x)^{1/x}

Both forms define the same constant e2.71828e \approx 2.71828\ldots.

Motivation: If interest on principal PP is compounded nn times per year at annual rate rr, after one year the amount is P(1+r/n)nP(1 + r/n)^n. As nn \to \infty (continuous compounding), this approaches PerPe^r.

General form:

limn(1+rn)n=er,limx0(1+rx)1/x=er\lim_{n \to \infty} \left(1 + \frac{r}{n}\right)^n = e^r, \quad \lim_{x \to 0} (1 + rx)^{1/x} = e^r

Proof sketch: Let f(n)=(1+1/n)nf(n) = (1 + 1/n)^n. Taking logarithm: lnf(n)=nln(1+1/n)\ln f(n) = n \ln(1 + 1/n). Setting h=1/nh = 1/n: lnf=ln(1+h)h1\ln f = \frac{\ln(1+h)}{h} \to 1 as h0h \to 0. So fe1=ef \to e^1 = e. \square

For AI - learning rate warm-up: The cosine schedule and exponential decay are engineering analogs of (1+1/n)n(1 + 1/n)^n - discrete compounding that approaches a continuous exponential in the limit of fine schedules.

3.4 Logarithmic Limits and xln(x)

limx0+xlnx=0\lim_{x \to 0^+} x \ln x = 0

This is a 0()0 \cdot (-\infty) indeterminate form. Rewrite:

limx0+xlnx=limx0+lnx1/x\lim_{x \to 0^+} x \ln x = \lim_{x \to 0^+} \frac{\ln x}{1/x}

This is /+-\infty/+\infty, so L'Hpital applies:

=limx0+1/x1/x2=limx0+(x)=0= \lim_{x \to 0^+} \frac{1/x}{-1/x^2} = \lim_{x \to 0^+} (-x) = 0

For AI - entropy: The Shannon entropy H(p)=ipilogpiH(p) = -\sum_i p_i \log p_i uses the convention 0log0=00 \log 0 = 0, justified by limp0+plogp=0\lim_{p \to 0^+} p \log p = 0. This makes entropy a continuous function of the probability distribution even at the boundary of the simplex.


4. Computation Techniques

4.1 Algebraic Techniques

When direct substitution gives 0/00/0 or other indeterminate forms, algebraic manipulation often resolves the indeterminacy.

Factoring:

limx2x38x2=limx2(x2)(x2+2x+4)x2=limx2(x2+2x+4)=4+4+4=12\lim_{x \to 2} \frac{x^3 - 8}{x - 2} = \lim_{x \to 2} \frac{(x-2)(x^2+2x+4)}{x-2} = \lim_{x \to 2} (x^2 + 2x + 4) = 4 + 4 + 4 = 12

Rationalization (conjugate multiplication):

limx0x+42x=limx0(x+42)(x+4+2)x(x+4+2)=limx0xx(x+4+2)=14\lim_{x \to 0} \frac{\sqrt{x+4} - 2}{x} = \lim_{x \to 0} \frac{(\sqrt{x+4}-2)(\sqrt{x+4}+2)}{x(\sqrt{x+4}+2)} = \lim_{x \to 0} \frac{x}{x(\sqrt{x+4}+2)} = \frac{1}{4}

Common factor in rational functions:

limx3x2+5x+6x2x12=limx3(x+2)(x+3)(x+3)(x4)=limx3x+2x4=17=17\lim_{x \to -3} \frac{x^2 + 5x + 6}{x^2 - x - 12} = \lim_{x \to -3} \frac{(x+2)(x+3)}{(x+3)(x-4)} = \lim_{x \to -3} \frac{x+2}{x-4} = \frac{-1}{-7} = \frac{1}{7}

Multiplying by 1/xn1/x^n (for rational functions at infinity): already shown in Section 2.4.

4.2 L'Hpital's Rule

Theorem (L'Hpital's Rule). Suppose limxaf(x)=0\lim_{x \to a} f(x) = 0 and limxag(x)=0\lim_{x \to a} g(x) = 0 (the 0/00/0 form), or both limits are ±\pm\infty (the /\infty/\infty form). If ff and gg are differentiable near aa (but not necessarily at aa), g(x)0g'(x) \neq 0 near aa, and limxaf(x)/g(x)\lim_{x \to a} f'(x)/g'(x) exists (or is ±\pm\infty), then:

limxaf(x)g(x)=limxaf(x)g(x)\lim_{x \to a} \frac{f(x)}{g(x)} = \lim_{x \to a} \frac{f'(x)}{g'(x)}

The same holds for one-sided limits and for a=±a = \pm\infty.

Critical: L'Hpital applies only to 0/00/0 or /\infty/\infty. Never apply it to other forms. Always verify the indeterminate form first.

Examples:

limx0sinxx\lim_{x \to 0} \frac{\sin x}{x}: This is 0/00/0. L'Hpital: cosx11\frac{\cos x}{1} \to 1.

limx0ex1xx2\lim_{x \to 0} \frac{e^x - 1 - x}{x^2}: This is 0/00/0. L'Hpital: ex12x\frac{e^x - 1}{2x}, still 0/00/0. Apply again: ex212\frac{e^x}{2} \to \frac{1}{2}.

limxxex\lim_{x \to \infty} x e^{-x}: Rewrite as xex\frac{x}{e^x} (/\infty/\infty). L'Hpital: 1ex0\frac{1}{e^x} \to 0.

Other indeterminate forms - convert before applying L'Hpital:

FormConversion
00 \cdot \inftyWrite as f1/g\frac{f}{1/g} or g1/f\frac{g}{1/f} (0/0 or infinity/infinity)
\infty - \inftyFactor or find common denominator
11^\infty, 000^0, 0\infty^0Take ln\ln: limln(fg)=limglnf\lim \ln(f^g) = \lim g \ln f

Example (11^\infty): limx0(1+x)1/x\lim_{x \to 0} (1+x)^{1/x}. Let y=(1+x)1/xy = (1+x)^{1/x}, so lny=ln(1+x)x\ln y = \frac{\ln(1+x)}{x}. This is 0/00/0; L'Hpital gives 1/(1+x)11\frac{1/(1+x)}{1} \to 1. So ye1=ey \to e^1 = e.

Warning: L'Hpital can fail to terminate or produce incorrect answers if misapplied. Always check that the form is truly indeterminate, and stop as soon as the limit becomes evaluable by direct substitution.

4.3 The Squeeze Theorem

Theorem (Squeeze / Sandwich Theorem). Suppose h(x)f(x)g(x)h(x) \leq f(x) \leq g(x) for all xx near aa (but not necessarily at aa), and:

limxah(x)=limxag(x)=L\lim_{x \to a} h(x) = \lim_{x \to a} g(x) = L

Then limxaf(x)=L\lim_{x \to a} f(x) = L.

Proof: Given ε>0\varepsilon > 0. Choose δ1,δ2>0\delta_1, \delta_2 > 0 so that:

  • 0<xa<δ1    h(x)L<ε0 < \lvert x - a \rvert < \delta_1 \implies \lvert h(x) - L \rvert < \varepsilon, i.e., Lε<h(x)<L+εL - \varepsilon < h(x) < L + \varepsilon
  • 0<xa<δ2    g(x)L<ε0 < \lvert x - a \rvert < \delta_2 \implies \lvert g(x) - L \rvert < \varepsilon, i.e., Lε<g(x)<L+εL - \varepsilon < g(x) < L + \varepsilon

Set δ=min(δ1,δ2)\delta = \min(\delta_1, \delta_2). For 0<xa<δ0 < \lvert x - a \rvert < \delta:

Lε<h(x)f(x)g(x)<L+εL - \varepsilon < h(x) \leq f(x) \leq g(x) < L + \varepsilon

So f(x)L<ε\lvert f(x) - L \rvert < \varepsilon. \square

Example 1: limx0x2sin(1/x)\lim_{x \to 0} x^2 \sin(1/x). Since 1sin(1/x)1-1 \leq \sin(1/x) \leq 1: x2x2sin(1/x)x2-x^2 \leq x^2 \sin(1/x) \leq x^2. Both x20-x^2 \to 0 and x20x^2 \to 0, so limx0x2sin(1/x)=0\lim_{x \to 0} x^2 \sin(1/x) = 0.

Example 2: limx0sinxx=1\lim_{x \to 0} \frac{\sin x}{x} = 1 (proved geometrically in Section 3.1 using squeeze).

Example 3: limnnn\lim_{n \to \infty} \sqrt[n]{n}. We know 1nn1+2/n1 \leq \sqrt[n]{n} \leq 1 + \sqrt{2/n} (can be proved via AM-GM). Since 1+2/n11 + \sqrt{2/n} \to 1, the squeeze gives nn1\sqrt[n]{n} \to 1.

For AI: The squeeze theorem is the theoretical justification for proving that "soft" approximations converge to "hard" decisions. For example, if LsmoothLtargetLupperL_\text{smooth} \leq L_\text{target} \leq L_\text{upper} and both bounds converge, then LtargetL_\text{target} converges - used in proving convergence of temperature-annealed sampling.

4.4 Substitution and Composition

Theorem. If gg is continuous at LL and limxaf(x)=L\lim_{x \to a} f(x) = L, then:

limxag(f(x))=g ⁣(limxaf(x))=g(L)\lim_{x \to a} g(f(x)) = g\!\left(\lim_{x \to a} f(x)\right) = g(L)

This allows moving limits inside continuous functions.

Examples:

limx0esinx=elimx0sinx=e0=1\lim_{x \to 0} e^{\sin x} = e^{\lim_{x \to 0} \sin x} = e^0 = 1 limxπcos(x+π/2)=cos ⁣(limxπ(x+π/2))=cos(3π/2)=0\lim_{x \to \pi} \cos(x + \pi/2) = \cos\!\left(\lim_{x \to \pi}(x + \pi/2)\right) = \cos(3\pi/2) = 0 limx0+ln(sinx/x)=ln ⁣(limx0+sinxx)=ln(1)=0\lim_{x \to 0^+} \ln(\sin x / x) = \ln\!\left(\lim_{x \to 0^+} \frac{\sin x}{x}\right) = \ln(1) = 0

5. Continuity

5.1 The Three-Part Definition

Definition (Continuity at a Point). A function ff is continuous at aa if:

  1. f(a)f(a) is defined (the function exists at aa)
  2. limxaf(x)\lim_{x \to a} f(x) exists (the limit exists at aa)
  3. limxaf(x)=f(a)\lim_{x \to a} f(x) = f(a) (limit equals the function value)

All three conditions must hold. Failure of any one gives a discontinuity.

Equivalent epsilon-delta formulation: ff is continuous at aa if for every ε>0\varepsilon > 0 there exists δ>0\delta > 0 such that xa<δ    f(x)f(a)<ε\lvert x - a \rvert < \delta \implies \lvert f(x) - f(a) \rvert < \varepsilon. (Note: unlike the limit definition, x=ax = a is now permitted.)

Examples:

  • f(x)=x2f(x) = x^2: continuous at every aRa \in \mathbb{R} (polynomial)
  • f(x)=sinxf(x) = \sin x: continuous at every aRa \in \mathbb{R}
  • f(x)=1/xf(x) = 1/x: continuous at every a0a \neq 0; not defined (hence not continuous) at 00
  • f(x)=xf(x) = \lfloor x \rfloor: continuous at every non-integer; discontinuous at every integer

Non-examples (failing each condition):

  • Condition 1 fails: f(x)=(x24)/(x2)f(x) = (x^2-4)/(x-2) at x=2x = 2 (undefined)
  • Condition 2 fails: f(x)=sin(1/x)f(x) = \sin(1/x) at x=0x = 0 (limit doesn't exist)
  • Condition 3 fails: f(x)=x2f(x) = x^2 for x1x \neq 1, f(1)=5f(1) = 5 (limit is 11 but f(1)=5f(1) = 5)

5.2 Types of Discontinuity

DISCONTINUITY TAXONOMY


  REMOVABLE                  JUMP                   ESSENTIAL (INFINITE)
  (Hole in graph)            (Step)                 (Blows up)

     -                                                
                                                
                                                
                                             

  lim exists,            lim doesn't exist         lim = +/-infinity
  != f(a) or              (one-sided limits         (vertical asymptote)
  f(a) undefined         exist but differ)

  Fix: redefine f(a)     Cannot fix: jump          Cannot fix: infinity
  Example: (x^2-4)/(x-2)  Example: sgn(x)           Example: 1/x at 0
  at x=2                 at x=0


Removable discontinuity: limxaf(x)\lim_{x \to a} f(x) exists but either f(a)f(a) is undefined or f(a)limxaf(x)f(a) \neq \lim_{x\to a}f(x). Fixable by redefining f(a)=limxaf(x)f(a) = \lim_{x \to a} f(x).

Jump discontinuity: Both one-sided limits exist but limxaf(x)limxa+f(x)\lim_{x \to a^-} f(x) \neq \lim_{x \to a^+} f(x). The function "jumps" at aa.

Essential (infinite) discontinuity: At least one one-sided limit is ±\pm\infty. Also called an infinite discontinuity.

Oscillatory discontinuity (a subtype of essential): sin(1/x)\sin(1/x) at x=0x = 0 - neither one-sided limit exists (even as ±\pm\infty).

For AI:

  • ReLU has no discontinuities (continuous everywhere), but its derivative has a jump at 00
  • Hard attention (argmax) is a jump discontinuity at ties - made differentiable via softmax temperature annealing
  • Quantized activations (INT8) introduce jump discontinuities - handled with straight-through estimator in training

5.3 Continuity on Intervals

Definition. ff is continuous on the open interval (a,b)(a,b) if ff is continuous at every c(a,b)c \in (a,b).

ff is continuous on the closed interval [a,b][a,b] if:

  • ff is continuous on (a,b)(a,b)
  • limxa+f(x)=f(a)\lim_{x \to a^+} f(x) = f(a) (right-continuous at aa)
  • limxbf(x)=f(b)\lim_{x \to b^-} f(x) = f(b) (left-continuous at bb)

Standard continuous functions: Polynomials, rational functions (on their domains), sinx\sin x, cosx\cos x, exe^x, lnx\ln x (on (0,)(0,\infty)), x\sqrt{x} (on [0,)[0,\infty)), and any composition thereof.

For AI: Neural networks with continuous activation functions (sigmoid, tanh, GELU, SiLU) define continuous functions RnRm\mathbb{R}^n \to \mathbb{R}^m. Networks with ReLU are piecewise linear and continuous. Continuity of the network function is necessary (though not sufficient) for gradient-based optimization to work reliably.

5.4 Operations Preserving Continuity

If ff and gg are continuous at aa, then so are:

  • f+gf + g, fgf - g, fgf \cdot g, cfc \cdot f (for any constant cc)
  • f/gf / g provided g(a)0g(a) \neq 0
  • fgf \circ g provided ff is continuous at g(a)g(a)

Consequence: Every function built by composing, adding, multiplying elementary continuous functions is continuous on its natural domain. This covers virtually all standard functions in analysis.


6. Key Theorems

6.1 Intermediate Value Theorem

Theorem (IVT). Let ff be continuous on the closed interval [a,b][a, b]. If kk is any value strictly between f(a)f(a) and f(b)f(b), then there exists c(a,b)c \in (a, b) such that f(c)=kf(c) = k.

Informally: a continuous function cannot skip values - if it goes from f(a)f(a) to f(b)f(b), it must pass through every intermediate value.

INTERMEDIATE VALUE THEOREM


  f(x)
    
 f(b)             <- f(b)
              
    k - - -  <- f(c) = k (guaranteed to exist)
           
 f(a)      <- f(a)
        c
    -> x
         a               b

  If f is continuous on [a,b], then for any k between f(a) and f(b),
  there exists c in (a,b) with f(c) = k.


Proof sketch (requires completeness of R\mathbb{R}): Let S={x[a,b]:f(x)<k}S = \{x \in [a,b] : f(x) < k\}. If f(a)<k<f(b)f(a) < k < f(b), then SS is nonempty (contains aa) and bounded above (by bb). Let c=supSc = \sup S. By continuity of ff at cc, one can show f(c)=kf(c) = k. \square

Applications:

  1. Root finding (bisection method): If f(a)<0<f(b)f(a) < 0 < f(b) and ff is continuous, then ff has a root in (a,b)(a, b). Bisect: evaluate f((a+b)/2)f((a+b)/2) and recurse into the half-interval where sign changes. Converges in O(log2(1/ε))O(\log_2(1/\varepsilon)) steps.

  2. Fixed-point existence: If f:[0,1][0,1]f: [0,1] \to [0,1] is continuous, then ff has a fixed point f(c)=cf(c) = c. (Apply IVT to g(x)=f(x)xg(x) = f(x) - x: g(0)=f(0)0g(0) = f(0) \geq 0 and g(1)=f(1)10g(1) = f(1) - 1 \leq 0.)

  3. Equilibrium existence: Used in proving existence of Nash equilibria (via Kakutani fixed-point theorem), stopping times for Brownian motion, and zero crossings of residuals in iterative solvers.

For AI: The bisection method is the simplest root-finding algorithm. Modern AI uses line search (Wolfe conditions) and trust-region methods - all of which rely on IVT guarantees to find step sizes satisfying sufficient decrease conditions.

6.2 Extreme Value Theorem

Theorem (EVT). Let ff be continuous on the closed interval [a,b][a, b]. Then ff attains its maximum and minimum on [a,b][a, b]: there exist c,d[a,b]c, d \in [a, b] such that f(d)f(x)f(c)f(d) \leq f(x) \leq f(c) for all x[a,b]x \in [a, b].

Why compactness matters: The EVT requires [a,b][a,b] to be closed and bounded (compact). Counterexamples on non-compact domains:

  • f(x)=xf(x) = x on (0,1)(0, 1): no maximum (approaches 11 but never attains it)
  • f(x)=1/xf(x) = 1/x on (0,1](0, 1]: no maximum (blows up as x0+x \to 0^+)

For AI: Loss functions trained over bounded parameter spaces (after gradient clipping or weight constraints) attain their minimum. This theoretical guarantee underpins the well-posedness of constrained optimization problems in ML.

6.3 Uniform Continuity

Definition. ff is uniformly continuous on SS if for every ε>0\varepsilon > 0 there exists δ>0\delta > 0 such that for all x,ySx, y \in S: xy<δ    f(x)f(y)<ε\lvert x - y \rvert < \delta \implies \lvert f(x) - f(y) \rvert < \varepsilon.

Key distinction: In pointwise continuity, δ\delta can depend on both ε\varepsilon and the point aa. In uniform continuity, δ\delta depends only on ε\varepsilon - the same δ\delta works everywhere.

Theorem (Heine-Cantor). If ff is continuous on a closed bounded interval [a,b][a,b], then ff is uniformly continuous on [a,b][a,b].

Examples:

  • f(x)=x2f(x) = x^2 is uniformly continuous on [0,1][0, 1] but not on R\mathbb{R} (the derivative grows without bound)
  • f(x)=sinxf(x) = \sin x is uniformly continuous on R\mathbb{R} (bounded derivative)
  • f(x)=1/xf(x) = 1/x is not uniformly continuous on (0,1)(0, 1) (derivative blows up near 00)

For AI: Lipschitz continuity (a stronger condition: f(x)f(y)Lxy\lvert f(x) - f(y) \rvert \leq L \lvert x - y \rvert) implies uniform continuity. Lipschitz constraints appear in spectral normalization (Miyato et al., 2018) for GANs and in Wasserstein GAN training - all are quantitative versions of uniform continuity.


7. Asymptotic Behavior

7.1 Horizontal and Vertical Asymptotes

Horizontal asymptote: y=Ly = L is a horizontal asymptote if limx+f(x)=L\lim_{x \to +\infty} f(x) = L or limxf(x)=L\lim_{x \to -\infty} f(x) = L.

Vertical asymptote: x=ax = a is a vertical asymptote if any one-sided limit as xax \to a is ±\pm\infty.

Oblique (slant) asymptote: y=mx+by = mx + b is an oblique asymptote if limx±[f(x)(mx+b)]=0\lim_{x \to \pm\infty} [f(x) - (mx+b)] = 0. Occurs when deg(numerator)=deg(denominator)+1\deg(\text{numerator}) = \deg(\text{denominator}) + 1.

Example: f(x)=x2+1x1f(x) = \frac{x^2 + 1}{x - 1}. Polynomial division: f(x)=x+1+2x1f(x) = x + 1 + \frac{2}{x-1}. As x±x \to \pm\infty: f(x)(x+1)=2x10f(x) - (x+1) = \frac{2}{x-1} \to 0. So y=x+1y = x + 1 is an oblique asymptote and x=1x = 1 is a vertical asymptote.

7.2 Polynomial vs. Exponential Growth

Growth hierarchy (as x+x \to +\infty):

lnxxpexxx\ln x \ll x^p \ll e^x \ll x^x

Formally:

limxlnxxp=0(p>0),limxxnex=0(n>0)\lim_{x \to \infty} \frac{\ln x}{x^p} = 0 \quad (p > 0), \qquad \lim_{x \to \infty} \frac{x^n}{e^x} = 0 \quad (n > 0)

Every polynomial is dominated by every exponential; every logarithm is dominated by every positive power.

Proof of xn/ex0x^n/e^x \to 0: Apply L'Hpital nn times: xnexn!ex0\frac{x^n}{e^x} \to \frac{n!}{e^x} \to 0.

For AI - complexity: Algorithm complexity classes: O(logn)O(\log n) (logarithm) O(nk)\ll O(n^k) (polynomial) O(en)\ll O(e^n) (exponential). This hierarchy, grounded in limit theory, governs which algorithms scale to LLM-scale data (e.g., transformers are O(n2d)O(n^2 d) for sequence length nn and dimension dd).

7.3 Big-O and Little-o as Limit Statements

Definition. f(x)=O(g(x))f(x) = O(g(x)) as xax \to a means there exist C,δ>0C, \delta > 0 such that f(x)Cg(x)\lvert f(x) \rvert \leq C \lvert g(x) \rvert for xa<δ\lvert x - a \rvert < \delta.

Definition. f(x)=o(g(x))f(x) = o(g(x)) as xax \to a means limxaf(x)g(x)=0\lim_{x \to a} \frac{f(x)}{g(x)} = 0.

In words: OO means "bounded by a multiple"; oo means "negligible compared to."

Examples:

  • sinx=O(x)\sin x = O(x) as x0x \to 0 (since sinxx\lvert \sin x \rvert \leq \lvert x \rvert)
  • x2=o(x)x^2 = o(x) as x0x \to 0 (since x2/x=x0x^2/x = x \to 0)
  • ex1=x+O(x2)e^x - 1 = x + O(x^2) as x0x \to 0 (Taylor expansion)

For AI: The Adam optimizer has per-parameter learning rate α/(v^+ε)\alpha / (\sqrt{\hat{v}} + \varepsilon). The ε\varepsilon term prevents division by zero when v^0\hat{v} \approx 0. In the limit v^0\hat{v} \to 0, the effective learning rate is O(α/ε)O(\alpha/\varepsilon); for large v^\hat{v}, it is O(α/v^)O(\alpha/\sqrt{\hat{v}}). This piecewise asymptotic analysis (via big-O) explains Adam's behavior in both sparse and dense gradient regimes.


8. Numerical Stability Near Limits

8.1 Catastrophic Cancellation

When computing (ex1)/x(e^x - 1)/x for xx close to 00, the numerator involves subtracting two nearly equal quantities. In IEEE 754 double precision (64-bit float, u1.1×1016u \approx 1.1 \times 10^{-16}), numbers are stored to 16\sim 16 significant digits. If ex=1+x+x2/2+e^x = 1 + x + x^2/2 + \ldots and x=108x = 10^{-8}, then:

  • ex1.0000000100000000050e^x \approx 1.0000000100000000050\ldots
  • 1.01.0 in floating point: exactly 1.01.0
  • ex1e^x - 1 in floating point: only the digits beyond the 11 are meaningful, losing 8\sim 8 digits

The error in the subtraction propagates, giving up to 8 digits of lost precision. The limit value (which is 11) is computed incorrectly.

CATASTROPHIC CANCELLATION EXAMPLE


  x = 1e-8 (IEEE 754 double)

  True value: (e^x - 1)/x  ~=  1.000000005000000017...

  Naive computation:
    e^x   = 1.00000001000000002   (15-16 sig digits, ok)
    e^x-1 = 0.00000001000000002   (only ~9 sig digits now!)
    /x    ~=  1.000000002          (lost ~8 digits of precision)

  Stable computation (expm1):
    expm1(x) = 1.0000000050000000...   (full precision maintained)
    /x       = 1.0000000050000000      (correct!)


8.2 Numerically Stable Alternatives

numpy.expm1(x) = ex1e^x - 1: Computed using a special algorithm that avoids cancellation for small xx.

numpy.log1p(x) = ln(1+x)\ln(1 + x): Similarly avoids cancellation when x0x \approx 0.

scipy.special.log_softmax(x): Avoids overflow in exie^{x_i} by computing xilogjexjx_i - \log\sum_j e^{x_j} via the log-sum-exp trick:

logjexj=m+logjexjm,m=maxjxj\log\sum_j e^{x_j} = m + \log\sum_j e^{x_j - m}, \quad m = \max_j x_j

All exjm1e^{x_j - m} \leq 1, preventing overflow.

Stable sigmoid: σ(x)=1/(1+ex)\sigma(x) = 1/(1+e^{-x}) can overflow exe^{-x} for large negative xx. Stable version:

σ(x)={1/(1+ex)x0ex/(1+ex)x<0\sigma(x) = \begin{cases} 1/(1+e^{-x}) & x \geq 0 \\ e^x/(1+e^x) & x < 0 \end{cases}

For AI - loss computation: Cross-entropy loss using torch.nn.CrossEntropyLoss internally uses log_softmax for numerical stability. Implementing naive log(softmax(x)) instead causes overflow/underflow for large logits.

8.3 Machine Epsilon and Floating-Point Limits

Machine epsilon u=2522.22×1016u = 2^{-52} \approx 2.22 \times 10^{-16} (double precision): the smallest ε\varepsilon such that fl(1+ε)>1\text{fl}(1 + \varepsilon) > 1 in IEEE 754.

In floating-point arithmetic, the limit limh0\lim_{h \to 0} cannot be taken exactly - there is a practical lower bound on hh below which finite difference approximations deteriorate:

  • For first-order finite differences: optimal hu108h \approx \sqrt{u} \approx 10^{-8}
  • For second-order: optimal hu1/3105h \approx u^{1/3} \approx 10^{-5}

This is why gradient checking in ML uses h=105h = 10^{-5} or 10710^{-7} rather than 101510^{-15}.


9. Machine Learning Applications

9.1 Softmax Temperature Limits

The softmax with temperature T>0T > 0 is:

softmaxT(z)i=ezi/Tj=1nezj/T\text{softmax}_T(\mathbf{z})_i = \frac{e^{z_i / T}}{\sum_{j=1}^n e^{z_j / T}}

Limit as T0+T \to 0^+ (cold / sharp):

limT0+softmaxT(z)i={1if i=argmaxjzj0otherwise\lim_{T \to 0^+} \text{softmax}_T(\mathbf{z})_i = \begin{cases} 1 & \text{if } i = \arg\max_j z_j \\ 0 & \text{otherwise} \end{cases}

(assuming unique argmax). The distribution collapses to a point mass on the highest-scoring class. This is why T0T \to 0 is called the "deterministic" or "greedy" limit.

Limit as TT \to \infty (hot / uniform):

limTsoftmaxT(z)i=1nfor all i\lim_{T \to \infty} \text{softmax}_T(\mathbf{z})_i = \frac{1}{n} \quad \text{for all } i

All logits become irrelevant; the distribution becomes uniform.

Proof (cold limit): WLOG z1>zjz_1 > z_j for j>1j > 1. Factor out ez1/Te^{z_1/T}:

softmaxT(z)1=11+j>1e(zjz1)/T\text{softmax}_T(\mathbf{z})_1 = \frac{1}{1 + \sum_{j>1} e^{(z_j - z_1)/T}}

As T0+T \to 0^+: (zjz1)/T(z_j - z_1)/T \to -\infty (since zjz1<0z_j - z_1 < 0), so e(zjz1)/T0e^{(z_j-z_1)/T} \to 0. Thus the denominator 1\to 1.

For AI: In LLM inference (GPT, LLaMA, etc.), temperature controls creativity vs. determinism. Temperature T=1T = 1 is trained distribution; T<1T < 1 sharpens (more predictable); T>1T > 1 flattens (more diverse/random). The limits above show T0T \to 0 recovers beam search top-1, and TT \to \infty recovers pure random sampling.

9.2 Sigmoid Saturation and Vanishing Gradients

The sigmoid function σ:R(0,1)\sigma: \mathbb{R} \to (0, 1):

σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}

Limits at infinity:

limx+σ(x)=1,limxσ(x)=0\lim_{x \to +\infty} \sigma(x) = 1, \quad \lim_{x \to -\infty} \sigma(x) = 0

Derivative:

σ(x)=σ(x)(1σ(x))\sigma'(x) = \sigma(x)(1 - \sigma(x))

Limits of derivative:

limx±σ(x)=0\lim_{x \to \pm\infty} \sigma'(x) = 0

The derivative at saturation is zero, not just small - the function is completely flat in the limit.

Vanishing gradient mechanism: In a deep network with sigmoid activations, the gradient of loss with respect to layer ll's weights involves products of σ(x[l])\sigma'(x^{[l]}) across all layers l+1,,Ll+1, \ldots, L. If each factor is 0.25\leq 0.25 (the max of σ(1σ)\sigma(1-\sigma)), then for Ll=20L - l = 20 layers: (0.25)201012(0.25)^{20} \approx 10^{-12}. The gradient effectively vanishes - Hochreiter & Schmidhuber (1997) identified this as the key failure mode of early RNNs.

Fix: ReLU activations have limx+ReLU(x)=1\lim_{x \to +\infty} \text{ReLU}'(x) = 1 (no saturation for positive inputs). GELU and SiLU have smoother saturation profiles. Layer normalization and residual connections also mitigate the issue at the architecture level.

9.3 ReLU and GELU: Continuity at the Origin

ReLU: ReLU(x)=max(0,x)={xx>00x0\text{ReLU}(x) = \max(0, x) = \begin{cases} x & x > 0 \\ 0 & x \leq 0 \end{cases}

Continuity check at x=0x = 0:

limx0ReLU(x)=0,limx0+ReLU(x)=0,ReLU(0)=0\lim_{x \to 0^-} \text{ReLU}(x) = 0, \quad \lim_{x \to 0^+} \text{ReLU}(x) = 0, \quad \text{ReLU}(0) = 0

All equal: ReLU is continuous everywhere.

Derivative at x=0x = 0: limh0+ReLU(h)0h=1\lim_{h \to 0^+} \frac{\text{ReLU}(h) - 0}{h} = 1 but limh000h=0\lim_{h \to 0^-} \frac{0 - 0}{h} = 0. One-sided derivatives differ: ReLU is not differentiable at 00. This is a removable issue in practice - the subgradient {0}\{0\} or {1}\{1\} (or 1/21/2) is used.

GELU: GELU(x)=xΦ(x)\text{GELU}(x) = x \cdot \Phi(x) where Φ\Phi is the standard normal CDF.

limx0GELU(x)=0Φ(0)=012=0=GELU(0)\lim_{x \to 0} \text{GELU}(x) = 0 \cdot \Phi(0) = 0 \cdot \frac{1}{2} = 0 = \text{GELU}(0) GELU(x)=Φ(x)+xϕ(x)\text{GELU}'(x) = \Phi(x) + x\phi(x) limx0GELU(x)=12+0=12\lim_{x \to 0} \text{GELU}'(x) = \frac{1}{2} + 0 = \frac{1}{2}

GELU is smooth (CC^\infty) at the origin - no kink, unlike ReLU. This is why GELU-based networks (BERT, GPT-2 onward) train more smoothly in practice.

GELU vs. ReLU continuity comparison:

PropertyReLUGELU
ContinuousYes (everywhere)Yes (everywhere)
Differentiable at 0No (jump in derivative)Yes (CC^\infty)
Saturates for xx \to -\inftyGradient = 0 (dying ReLU)Gradient 0\to 0 (soft)
Saturates for x+x \to +\inftyGradient = 1 (no saturation)Gradient 1\to 1 (no saturation)

9.4 Learning Rate Decay: Robbins-Monro Conditions

The stochastic gradient descent update θt+1=θtαtθLt(θt)\theta_{t+1} = \theta_t - \alpha_t \nabla_\theta \mathcal{L}_t(\theta_t) converges (under convexity and bounded variance assumptions) if and only if the learning rate schedule {αt}\{\alpha_t\} satisfies:

t=1αt=andt=1αt2<\sum_{t=1}^\infty \alpha_t = \infty \qquad \text{and} \qquad \sum_{t=1}^\infty \alpha_t^2 < \infty

(Robbins & Monro, 1951)

Interpretation:

  • αt=\sum \alpha_t = \infty: the total step size is infinite - SGD can travel arbitrarily far and cannot get stuck permanently
  • αt2<\sum \alpha_t^2 < \infty: the noise contribution (proportional to αt2\alpha_t^2) is summable - gradient noise eventually becomes negligible

Standard schedules:

Scheduleαt\alpha_tαt\sum \alpha_tαt2\sum \alpha_t^2Satisfies RM?
Constantα\alpha\infty\inftyNo (second fails)
1/t1/tα/t\alpha / t\infty (harmonic)π2/6<\pi^2/6 < \inftyYes
1/t1/\sqrt{t}α/t\alpha / \sqrt{t}\infty\inftyNo (second fails)
Exponentialαct\alpha \cdot c^tα/(1c)<\alpha/(1-c) < \inftyFiniteNo (first fails)

Only 1/t1/t satisfies both - which is why it's the canonical theoretical schedule, even though practitioners prefer others (cosine, warmup) for practical reasons.

9.5 Gradient as a Limit

The derivative (Section 9, forward reference: 02-Derivatives) is defined as a limit:

f(a)=limh0f(a+h)f(a)hf'(a) = \lim_{h \to 0} \frac{f(a+h) - f(a)}{h}

This limit must exist (be finite, with equal one-sided limits) for ff to be differentiable at aa.

For AI - automatic differentiation: PyTorch's autograd and JAX's grad do not compute this limit numerically. Instead, they apply the chain rule symbolically/computationally. But the mathematical object they compute is precisely this limit - evaluated at each intermediate node in the computation graph.

Numerical gradient checking: To verify a gradient implementation, compute:

f(θ+hei)f(θhei)2hfθi\frac{f(\theta + h \mathbf{e}_i) - f(\theta - h \mathbf{e}_i)}{2h} \approx \frac{\partial f}{\partial \theta_i}

with h=105h = 10^{-5}. The centered finite difference has error O(h2)O(h^2) vs. O(h)O(h) for one-sided - both are limit approximations, with the centered form being more accurate.

Preview: Derivatives The limit f(a)=limh0[f(a+h)f(a)]/hf'(a) = \lim_{h\to 0}[f(a+h)-f(a)]/h defines the derivative - the instantaneous rate of change. All differentiation rules (product, chain, etc.) and backpropagation follow from this limit.

-> Full treatment: Derivatives and Differentiation


10. Common Mistakes

#MistakeWhy It's WrongFix
1Evaluating limxaf(x)\lim_{x \to a} f(x) by computing f(a)f(a) when ff is discontinuous at aaThe limit is about neighborhood behavior, not the point value; f(a)f(a) may not exist or may differAlways check whether ff is continuous at aa first; if not, compute the limit separately
2Concluding the limit doesn't exist because f(a)f(a) is undefinedLimits are independent of f(a)f(a); a removable discontinuity has a well-defined limitCompute the limit algebraically; undefined f(a)f(a) does not imply no limit
3Applying L'Hpital's Rule to non-indeterminate formsL'Hpital is only valid for 0/00/0 or ±/±\pm\infty/\pm\infty; other forms give wrong answersCheck: substitute and verify the form is indeterminate before applying L'Hpital
4Applying L'Hpital repeatedly without checking if the form is still indeterminateAfter one application, direct substitution may be possible; repeated L'Hpital may cycleAfter each application, try direct substitution before applying again
5Confusing one-sided and two-sided limitsThe two-sided limit requires both one-sided limits to exist and be equalCompute lim\lim^- and lim+\lim^+ separately; two-sided limit exists iff they agree
6Asserting limxf(x)=\lim_{x\to\infty}f(x) = \infty means the limit exists±\pm\infty are not real numbers; saying the limit is \infty means divergence, not existenceDistinguish between "limit exists (finite)" and "limit is ±\pm\infty"
7Assuming all three parts of continuity hold because ff looks smoothContinuity requires all three conditions; a piecewise-defined function may fail condition 3Verify all three: f(a)f(a) defined, limit exists, limit equals f(a)f(a)
8Computing (ex1)/x(e^x - 1)/x naively near x=0x = 0 in codeCatastrophic cancellation destroys 8+ digits of precision for small xxUse numpy.expm1(x) / x for numerically stable computation
9Applying the Squeeze Theorem without verifying the inequalities hold near aaThe bounds must hold in a punctured neighborhood of aa, not just at one pointCarefully verify h(x)f(x)g(x)h(x) \leq f(x) \leq g(x) near aa; sketch or prove each bound
10Assuming sigmoid saturation means gradient is small but nonzeroσ(x)0\sigma'(x) \to 0 exactly as x±x \to \pm\infty, not just approximatelyFor deep networks, gradient multiplied across many saturated layers goes to zero exponentially fast

11. Exercises

Eight graded exercises with worked solutions in exercises.ipynb.

#DifficultyTopicParts
1Basic limit computation: factoring, trig, conjugates(a)-(c)
2One-sided limits and existence(a)-(b)
3L'Hpital's Rule: three indeterminate forms(a)-(c)
4Continuity analysis: classify discontinuities(a)-(c)
5Squeeze Theorem: prove x2sin(1/x)0x^2\sin(1/x)\to 0 and verify IVT(a)-(b)
6epsilon-delta proof: verify limx2(3x1)=5\lim_{x\to 2}(3x-1)=5 and find explicit δ(ε)\delta(\varepsilon)(a)-(c)
7Gradient as a limit: numerical vs. analytic finite differences(a)-(c)
8Cross-entropy limit: limp0+plogp\lim_{p\to 0^+} p\log p and entropy continuity(a)-(c)

12. Why This Matters for AI (2026 Perspective)

ConceptAI / LLM ApplicationSpecific Example
epsilon-delta limitsFoundation of automatic differentiationPyTorch autograd, JAX grad implement the limit f(a)=limh0(f(a+h)f(a))/hf'(a) = \lim_{h\to 0}(f(a+h)-f(a))/h algebraically
One-sided limitsDerivative of ReLU at originSubgradient at x=0x=0 chosen from [0,1][0,1]; training stable because ReLU is continuous
Squeeze TheoremConvergence proofs for SGDBounding optimization error between upper and lower envelopes of the loss landscape
ContinuityActivation function designGELU / SiLU chosen over hard ReLU for CC^\infty smoothness; improves gradient flow
IVTLoss surface root findingLine search algorithms (backtracking, Wolfe conditions) rely on IVT to guarantee sufficient decrease
EVTWell-posedness of optimizationConstrained optimization over compact parameter sets attains minimum by EVT
Limits at infinityVanishing/exploding gradientsσ(x)0\sigma'(x)\to 0 as x±x\to\pm\infty: mathematical root of LSTM/ResNet motivation
limT0softmaxT\lim_{T\to 0}\text{softmax}_TTemperature-scaled LLM decodingTemperature in GPT-4, LLaMA-3 inference controls creativity vs. accuracy
limxlnx=0\lim x\ln x = 0Shannon entropy convention0log0=00\log 0 = 0 makes cross-entropy continuous; used in every classification loss
Catastrophic cancellationNumerical stability in trainingtorch.nn.CrossEntropyLoss uses log_softmax to avoid overflow - a direct fix for limit instability
Robbins-Monro conditionsSGD learning rate theoryCosine decay + warmup approximates RM conditions while being practically efficient
Big-O at infinityTransformer complexityAttention is O(n2d)O(n^2 d); linear attention approximations are O(nd)O(nd) - limit theory quantifies the gap

Conceptual Bridge

Looking backward: Limits formalize the intuition of "approaching" that appears throughout earlier sections. The real number system (01-Mathematical-Foundations) provides the completeness axiom - every nonempty bounded set has a supremum - which is what makes limits well-defined. Without R\mathbb{R} being complete, Cauchy sequences might not converge, and the entire limit framework would collapse. Linear algebra (02-03) introduced matrix norms A\lVert A \rVert; limit-based analysis of norm sequences Ak\lVert A^k \rVert underlies the study of matrix powers and iterative methods.

Looking forward: Limits are the gateway to all of calculus. The derivative (02-Derivatives-and-Differentiation) is defined as limh0[f(a+h)f(a)]/h\lim_{h\to 0}[f(a+h)-f(a)]/h - every differentiation rule follows from properties of limits. The integral (03-Integration) is a limit of Riemann sums. Series (04-Series-and-Sequences) are limits of partial sums; their convergence is governed by limit tests (ratio, root, comparison). In multivariate calculus (05), limits extend to functions of several variables, and continuity becomes the prerequisite for partial derivatives and the multivariable chain rule - the mathematical core of backpropagation.

Beyond calculus, functional analysis (12) studies limits in infinite-dimensional spaces (Banach and Hilbert spaces), and the operator norm T\lVert T \rVert is defined as supx1Tx\sup_{\lVert x \rVert \leq 1} \lVert Tx \rVert - a supremum, hence a limit. Measure theory (24) defines integration via limits of simple functions. The entire edifice of continuous mathematics rests on the foundation laid here.

POSITION IN CURRICULUM


  01-Mathematical-Foundations        02-Linear-Algebra-Basics
  (Number systems, functions)      (Vectors, matrices, norms)
                                                 
                                                 
                                                 
  
            04-01: LIMITS AND CONTINUITY (YOU ARE HERE)       
                                                               
    epsilon-delta limits * Limit laws * One-sided limits                 
    L'Hpital * Squeeze Theorem * Fundamental limits           
    Continuity * IVT * EVT * Asymptotic behavior               
    Numerical stability * ML applications                      
  
                                         
                                         
                                         
  04-02: Derivatives            04-04: Series & Sequences
  (Derivative as limit,          (Partial sums as limits,
   chain rule, backprop)          Taylor series, convergence)
           
           
           
  05: Multivariate Calculus
  (Partial derivatives, gradient,
   chain rule in R, Jacobian)
           
           
  08: Optimization
  (Gradient descent, Newton's method,
   convergence guarantees)



<- Back to Calculus Fundamentals | Next: Derivatives and Differentiation ->


Appendix A: Extended Proofs and Derivations

A.1 Proving Limit Laws from epsilon-delta

Theorem (Product Law). If limxaf(x)=L\lim_{x \to a} f(x) = L and limxag(x)=M\lim_{x \to a} g(x) = M, then limxaf(x)g(x)=LM\lim_{x \to a} f(x)g(x) = LM.

Proof. The key trick is to write:

f(x)g(x)LM=f(x)g(x)Lg(x)+Lg(x)LM=g(x)[f(x)L]+L[g(x)M]f(x)g(x) - LM = f(x)g(x) - Lg(x) + Lg(x) - LM = g(x)[f(x) - L] + L[g(x) - M]

Given ε>0\varepsilon > 0. Since g(x)Mg(x) \to M, gg is bounded near aa: choose δ0\delta_0 so that g(x)M<1\lvert g(x) - M \rvert < 1 implies g(x)M+1\lvert g(x) \rvert \leq \lvert M \rvert + 1 for xa<δ0\lvert x - a \rvert < \delta_0.

Set B=M+1B = \lvert M \rvert + 1. Choose:

  • δ1\delta_1: xa<δ1    f(x)L<ε2B\lvert x - a \rvert < \delta_1 \implies \lvert f(x) - L \rvert < \frac{\varepsilon}{2B}
  • δ2\delta_2: xa<δ2    g(x)M<ε2(L+1)\lvert x - a \rvert < \delta_2 \implies \lvert g(x) - M \rvert < \frac{\varepsilon}{2(\lvert L \rvert + 1)}

Set δ=min(δ0,δ1,δ2)\delta = \min(\delta_0, \delta_1, \delta_2). Then for 0<xa<δ0 < \lvert x-a \rvert < \delta:

f(x)g(x)LMg(x)f(x)L+Lg(x)M<Bε2B+Lε2(L+1)<ε2+ε2=ε\lvert f(x)g(x) - LM \rvert \leq \lvert g(x) \rvert \lvert f(x) - L \rvert + \lvert L \rvert \lvert g(x) - M \rvert < B \cdot \frac{\varepsilon}{2B} + \lvert L \rvert \cdot \frac{\varepsilon}{2(\lvert L \rvert+1)} < \frac{\varepsilon}{2} + \frac{\varepsilon}{2} = \varepsilon \quad \square

Theorem (Composition Law). If limxaf(x)=L\lim_{x \to a} f(x) = L and gg is continuous at LL, then limxag(f(x))=g(L)\lim_{x \to a} g(f(x)) = g(L).

Proof. Given ε>0\varepsilon > 0. Since gg is continuous at LL: η>0\exists \eta > 0 such that uL<η    g(u)g(L)<ε\lvert u - L \rvert < \eta \implies \lvert g(u) - g(L) \rvert < \varepsilon.

Since limxaf(x)=L\lim_{x \to a} f(x) = L: δ>0\exists \delta > 0 such that 0<xa<δ    f(x)L<η0 < \lvert x - a \rvert < \delta \implies \lvert f(x) - L \rvert < \eta.

Combining: 0<xa<δ    f(x)L<η    g(f(x))g(L)<ε0 < \lvert x - a \rvert < \delta \implies \lvert f(x) - L \rvert < \eta \implies \lvert g(f(x)) - g(L) \rvert < \varepsilon. \square

A.2 The Cauchy Criterion for Limits

An equivalent characterization that avoids specifying the limit value:

Theorem (Cauchy Criterion). limxaf(x)\lim_{x \to a} f(x) exists (as a finite number) if and only if: for every ε>0\varepsilon > 0 there exists δ>0\delta > 0 such that for all x,yx, y with 0<xa<δ0 < \lvert x-a \rvert < \delta and 0<ya<δ0 < \lvert y-a \rvert < \delta:

f(x)f(y)<ε\lvert f(x) - f(y) \rvert < \varepsilon

This is useful when you suspect a limit exists but don't know its value.

A.3 Sequential Characterization of Limits

Theorem (Heine's Theorem). limxaf(x)=L\lim_{x \to a} f(x) = L if and only if for every sequence (xn)(x_n) with xnax_n \to a and xnax_n \neq a for all nn, we have f(xn)Lf(x_n) \to L.

Uses of sequential characterization:

  1. Proving limits exist: Find sequences that suggest the limit, then verify
  2. Proving limits don't exist: Find two sequences xnax_n \to a and ynay_n \to a such that f(xn)L1L2f(yn)f(x_n) \to L_1 \neq L_2 \leftarrow f(y_n)

Example (limit doesn't exist via sequences): limx0sin(1/x)\lim_{x \to 0} \sin(1/x).

  • Take xn=12πnx_n = \frac{1}{2\pi n}: then sin(1/xn)=sin(2πn)=00\sin(1/x_n) = \sin(2\pi n) = 0 \to 0
  • Take yn=1π/2+2πny_n = \frac{1}{\pi/2 + 2\pi n}: then sin(1/yn)=sin(π/2+2πn)=11\sin(1/y_n) = \sin(\pi/2 + 2\pi n) = 1 \to 1

Since 010 \neq 1, the limit does not exist. \square

A.4 Proof of the Intermediate Value Theorem

Full Proof. Let f:[a,b]Rf: [a,b] \to \mathbb{R} be continuous, f(a)<k<f(b)f(a) < k < f(b). We show c(a,b)\exists c \in (a,b) with f(c)=kf(c) = k.

Define S={x[a,b]:f(x)<k}S = \{x \in [a,b] : f(x) < k\}. Then:

  • SS \neq \emptyset: aSa \in S since f(a)<kf(a) < k
  • SS is bounded above: by bb

Let c=supSc = \sup S. We claim f(c)=kf(c) = k.

Case f(c)<kf(c) < k: By continuity, δ1>0\exists \delta_1 > 0 such that xc<δ1    f(x)<k\lvert x - c \rvert < \delta_1 \implies f(x) < k. So (c,c+δ1)[a,b]S(c, c + \delta_1) \cap [a,b] \subseteq S, meaning cc is not an upper bound for SS - contradiction.

Case f(c)>kf(c) > k: By continuity, δ2>0\exists \delta_2 > 0 such that xc<δ2    f(x)>k\lvert x - c \rvert < \delta_2 \implies f(x) > k. So (cδ2,c)[a,b](c - \delta_2, c) \cap [a,b] contains no elements of SS, meaning cδ2/2c - \delta_2/2 is a smaller upper bound - contradiction.

Therefore f(c)=kf(c) = k. Note cac \neq a (since f(a)<kf(a) < k) and cbc \neq b (since f(b)>kf(b) > k), so c(a,b)c \in (a,b). \square

Key insight: The proof uses the completeness of R\mathbb{R} (every nonempty bounded set has a supremum) - IVT fails for functions f:QQf: \mathbb{Q} \to \mathbb{Q} on rationals because Q\mathbb{Q} is not complete.

A.5 Limit Superior and Limit Inferior

For a deeper understanding of oscillating limits, we introduce:

Definition. The limit superior is:

lim supxaf(x)=limδ0+sup0<xa<δf(x)\limsup_{x \to a} f(x) = \lim_{\delta \to 0^+} \sup_{0 < \lvert x-a \rvert < \delta} f(x)

The limit inferior is:

lim infxaf(x)=limδ0+inf0<xa<δf(x)\liminf_{x \to a} f(x) = \lim_{\delta \to 0^+} \inf_{0 < \lvert x-a \rvert < \delta} f(x)

Theorem. limxaf(x)\lim_{x \to a} f(x) exists and equals LL if and only if:

lim supxaf(x)=lim infxaf(x)=L\limsup_{x \to a} f(x) = \liminf_{x \to a} f(x) = L

Example: For f(x)=sin(1/x)f(x) = \sin(1/x):

  • lim supx0f(x)=1\limsup_{x \to 0} f(x) = 1 (achieved along xn=1/(pi/2+2πn)x_n = 1/(pi/2 + 2\pi n))
  • lim infx0f(x)=1\liminf_{x \to 0} f(x) = -1 (achieved along yn=1/(π/2+2πn)y_n = 1/(-\pi/2 + 2\pi n) for large nn)

Since 111 \neq -1, the limit does not exist - confirming our earlier result.

For AI: Limsup and liminf appear in the analysis of optimization algorithms. The limsup of gradient norms determines whether an algorithm has "bounded oscillation" - a prerequisite for convergence. In AdaGrad and Adam, the effective learning rate limsup is controlled by the accumulated second moment.


Appendix B: Worked Examples with Full Solutions

B.1 Twelve Limit Computations

Work through each, identifying the technique before computing.

1. limx0tanxx\displaystyle\lim_{x \to 0} \frac{\tan x}{x}

Form: 0/00/0. Since tanx=sinx/cosx\tan x = \sin x / \cos x:

tanxx=sinxx1cosx111=1\frac{\tan x}{x} = \frac{\sin x}{x} \cdot \frac{1}{\cos x} \to 1 \cdot \frac{1}{1} = 1

2. limx(x2+xx)\displaystyle\lim_{x \to \infty} \left(\sqrt{x^2 + x} - x\right)

Form: \infty - \infty. Rationalize:

x2+xx=(x2+x)x2x2+x+x=xx2+x+x=11+1/x+112\sqrt{x^2+x} - x = \frac{(x^2+x) - x^2}{\sqrt{x^2+x}+x} = \frac{x}{\sqrt{x^2+x}+x} = \frac{1}{\sqrt{1+1/x}+1} \to \frac{1}{2}

3. limx0+xx\displaystyle\lim_{x \to 0^+} x^x

Form: 000^0. Let y=xx=exlnxy = x^x = e^{x\ln x}. We showed xlnx0x\ln x \to 0, so ye0=1y \to e^0 = 1.

4. limxπ/2cosxxπ/2\displaystyle\lim_{x \to \pi/2} \frac{\cos x}{x - \pi/2}

Substitute u=xπ/2u = x - \pi/2 (so cosx=cos(u+π/2)=sinu\cos x = \cos(u + \pi/2) = -\sin u):

limu0sinuu=1\lim_{u \to 0} \frac{-\sin u}{u} = -1

5. limnn1/n\displaystyle\lim_{n \to \infty} n^{1/n}

Let y=n1/n=e(lnn)/ny = n^{1/n} = e^{(\ln n)/n}. Since (lnn)/n0(\ln n)/n \to 0 (log grows slower than linear):

ye0=1y \to e^0 = 1

6. limx0e3xe2xx\displaystyle\lim_{x \to 0} \frac{e^{3x} - e^{2x}}{x}

Factor: e2x(ex1)x=e2xex1x11=1\frac{e^{2x}(e^x - 1)}{x} = e^{2x} \cdot \frac{e^x - 1}{x} \to 1 \cdot 1 = 1.

Wait: check again. e3xe2x=e2x(ex1)e^{3x} - e^{2x} = e^{2x}(e^x - 1).

limx0e2x(ex1)x=e01=1\lim_{x \to 0} \frac{e^{2x}(e^x - 1)}{x} = e^0 \cdot 1 = 1

7. limx0ln(1+sinx)x\displaystyle\lim_{x \to 0} \frac{\ln(1 + \sin x)}{x}

Since sinxx\sin x \approx x near 00 and ln(1+u)/u1\ln(1+u)/u \to 1 as u0u \to 0:

ln(1+sinx)x=ln(1+sinx)sinxsinxx11=1\frac{\ln(1+\sin x)}{x} = \frac{\ln(1+\sin x)}{\sin x} \cdot \frac{\sin x}{x} \to 1 \cdot 1 = 1

8. limx01cosxx2\displaystyle\lim_{x \to 0} \frac{1 - \cos x}{x^2}

Multiply by conjugate:

1cosxx2=(1cosx)(1+cosx)x2(1+cosx)=sin2xx2(1+cosx)=(sinxx)211+cosx112=12\frac{1-\cos x}{x^2} = \frac{(1-\cos x)(1+\cos x)}{x^2(1+\cos x)} = \frac{\sin^2 x}{x^2(1+\cos x)} = \left(\frac{\sin x}{x}\right)^2 \cdot \frac{1}{1+\cos x} \to 1 \cdot \frac{1}{2} = \frac{1}{2}

9. limx0+sinxx\displaystyle\lim_{x \to 0^+} \frac{\sin x}{\sqrt{x}}

sinxx=xsinxx01=0\frac{\sin x}{\sqrt{x}} = \sqrt{x} \cdot \frac{\sin x}{x} \to 0 \cdot 1 = 0

10. limxlnxx\displaystyle\lim_{x \to \infty} \frac{\ln x}{x}

L'Hpital (/\infty/\infty): 1/x1=1x0\frac{1/x}{1} = \frac{1}{x} \to 0.

11. limx1xn1x1\displaystyle\lim_{x \to 1} \frac{x^n - 1}{x - 1}

Factor: xn1=(x1)(xn1+xn2++1)x^n - 1 = (x-1)(x^{n-1} + x^{n-2} + \cdots + 1). Cancel (x1)(x-1): limit is nn.

Alternatively: this is the derivative of xnx^n at x=1x = 1, which is n1n1=nn \cdot 1^{n-1} = n.

12. limx0arcsinxx\displaystyle\lim_{x \to 0} \frac{\arcsin x}{x}

Let y=arcsinxy = \arcsin x, so x=sinyx = \sin y and y0y \to 0 as x0x \to 0:

arcsinxx=ysiny=(sinyy)111=1\frac{\arcsin x}{x} = \frac{y}{\sin y} = \left(\frac{\sin y}{y}\right)^{-1} \to 1^{-1} = 1

B.2 Continuity Verification Walkthrough

Problem: Determine where f(x)=x21x1f(x) = \frac{x^2 - 1}{\lvert x - 1 \rvert} is continuous and classify any discontinuities.

Solution:

For x1x \neq 1:

f(x)=(x1)(x+1)x1f(x) = \frac{(x-1)(x+1)}{\lvert x-1 \rvert}

Case x>1x > 1: x1=x1\lvert x-1 \rvert = x-1, so f(x)=x+1f(x) = x + 1. Continuous for x>1x > 1.

Case x<1x < 1: x1=(x1)=1x\lvert x-1 \rvert = -(x-1) = 1-x, so f(x)=(x+1)f(x) = -(x+1). Continuous for x<1x < 1.

At x=1x = 1: f(1)f(1) is undefined (division by zero). Check one-sided limits:

limx1+f(x)=1+1=2,limx1f(x)=(1+1)=2\lim_{x \to 1^+} f(x) = 1 + 1 = 2, \quad \lim_{x \to 1^-} f(x) = -(1+1) = -2

One-sided limits differ: 222 \neq -2. This is a jump discontinuity at x=1x = 1 with jump of 44.

The discontinuity is not removable (both limits are finite but unequal).

B.3 epsilon-delta Proof for a Nonlinear Limit

Claim: limx2x2=4\lim_{x \to 2} x^2 = 4.

Proof: Given ε>0\varepsilon > 0. We need x24<ε\lvert x^2 - 4 \rvert < \varepsilon when x2<δ\lvert x - 2 \rvert < \delta.

Factor: x24=x2x+2\lvert x^2 - 4 \rvert = \lvert x-2 \rvert \cdot \lvert x+2 \rvert.

Control x+2\lvert x+2 \rvert: Assume x2<1\lvert x - 2 \rvert < 1, so 1<x<31 < x < 3, giving 3<x+2<53 < x + 2 < 5, so x+2<5\lvert x+2 \rvert < 5.

Then: x24=x2x+2<5x2\lvert x^2 - 4 \rvert = \lvert x-2 \rvert \cdot \lvert x+2 \rvert < 5\lvert x-2 \rvert.

Choose δ=min ⁣(1,ε5)\delta = \min\!\left(1, \frac{\varepsilon}{5}\right). Then:

x2<δ    x24<5ε5=ε\lvert x-2 \rvert < \delta \implies \lvert x^2 - 4 \rvert < 5 \cdot \frac{\varepsilon}{5} = \varepsilon \quad \square

Pattern: For polynomial limits, always: (1) factor out xa\lvert x-a \rvert, (2) bound the remaining factor using xa<1\lvert x-a \rvert < 1, (3) choose δ=min(1,ε/bound)\delta = \min(1, \varepsilon/\text{bound}).


Appendix C: Connections to Advanced Mathematics

C.1 Topological Perspective

In topology, continuity is defined without reference to metrics. A function f:XYf: X \to Y between topological spaces is continuous if for every open set VYV \subseteq Y, the preimage f1(V)={xX:f(x)V}f^{-1}(V) = \{x \in X : f(x) \in V\} is open in XX.

For R\mathbb{R}: Open sets are unions of open intervals. The topological definition is equivalent to the epsilon-delta definition because epsilon-balls are the open sets of R\mathbb{R}.

Consequence: The image of a compact set under a continuous function is compact (the continuous image of a compact set is compact). For [a,b][a,b], this gives: f([a,b])f([a,b]) is closed and bounded, i.e., ff attains its maximum and minimum - the Extreme Value Theorem.

C.2 Uniform Continuity and Lipschitz Maps in ML

A function f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m is LL-Lipschitz if:

f(x)f(y)Lxyx,y\lVert f(\mathbf{x}) - f(\mathbf{y}) \rVert \leq L \lVert \mathbf{x} - \mathbf{y} \rVert \quad \forall \mathbf{x}, \mathbf{y}

Lipschitz continuity implies uniform continuity (take δ=ε/L\delta = \varepsilon/L).

In ML:

  • Spectral normalization (Miyato et al., 2018): constrains each weight matrix to have spectral norm 1\leq 1, making each layer 1-Lipschitz, making the whole network O(1)O(1)-Lipschitz
  • Wasserstein GAN (Arjovsky et al., 2017): the discriminator must be 1-Lipschitz (enforced via gradient penalty or spectral norm), which makes the Wasserstein distance well-defined
  • Gradient clipping: bounding LC\lVert \nabla \mathcal{L} \rVert \leq C ensures the loss is CC-Lipschitz in parameters - controls training instability

C.3 Limits in Metric Spaces

The epsilon-delta definition generalizes directly to metric spaces: replace xa\lvert x - a \rvert with d(x,a)d(x, a) for any metric dd. This covers:

  • p\ell^p spaces: xap<δ\lVert \mathbf{x} - \mathbf{a} \rVert_p < \delta for vector limits
  • Function spaces: convergence of sequences of functions (pointwise, uniform, L2L^2)
  • Matrix sequences: convergence in Frobenius norm AnAF<ε\lVert A_n - A \rVert_F < \varepsilon

Convergence of matrix series: The matrix exponential eA=k=0Ak/k!e^A = \sum_{k=0}^\infty A^k / k! is defined as a limit of partial sums in the matrix operator norm - a direct generalization of the scalar exponential limit.

C.4 Dirichlet and Thomae Functions: Pathological Examples

Dirichlet function:

D(x)={1xQ0xQD(x) = \begin{cases} 1 & x \in \mathbb{Q} \\ 0 & x \notin \mathbb{Q} \end{cases}

This function has no limit at any point (rationals and irrationals are dense in R\mathbb{R}, so any neighborhood of aa contains both). It is continuous nowhere.

Thomae's function (popcorn function):

T(x)={1/qx=p/q in lowest terms,xQ0xQT(x) = \begin{cases} 1/q & x = p/q \text{ in lowest terms}, x \in \mathbb{Q} \\ 0 & x \notin \mathbb{Q} \end{cases}

This function satisfies limxaT(x)=0\lim_{x \to a} T(x) = 0 for every aa (irrationals accumulate near every point, and the rational values 1/q1/q become small for large qq). So TT is continuous at every irrational and discontinuous at every rational - a function continuous exactly on the irrationals.

These pathological examples show that the epsilon-delta framework is necessary: intuition from smooth curves does not predict the behavior of all functions.


Appendix D: Numerical Methods Grounded in Limits

D.1 Bisection Method

The bisection method exploits the IVT to find roots of f(x)=0f(x) = 0.

Algorithm:

  1. Start with [a0,b0][a_0, b_0] where f(a0)f(b0)<0f(a_0) \cdot f(b_0) < 0
  2. At step nn: let mn=(an+bn)/2m_n = (a_n + b_n)/2
    • If f(an)f(mn)<0f(a_n) \cdot f(m_n) < 0: set [an+1,bn+1]=[an,mn][a_{n+1}, b_{n+1}] = [a_n, m_n]
    • Else: set [an+1,bn+1]=[mn,bn][a_{n+1}, b_{n+1}] = [m_n, b_n]
  3. By IVT, ff has a root in every [an,bn][a_n, b_n]
  4. limnan=limnbn=c\lim_{n \to \infty} a_n = \lim_{n \to \infty} b_n = c where f(c)=0f(c) = 0

Convergence: After nn steps, the interval has length (b0a0)/2n(b_0 - a_0)/2^n. To achieve accuracy ε\varepsilon: need nlog2((b0a0)/ε)n \geq \log_2((b_0-a_0)/\varepsilon) steps. This is O(log1/ε)O(\log 1/\varepsilon) - linear in the number of bits of precision.

Connection to limits: Bisection computes limnmn\lim_{n \to \infty} m_n numerically. The sequence (mn)(m_n) is Cauchy (differences go to zero), so it converges in R\mathbb{R} by completeness. The limit is the root - IVT guarantees existence, completeness guarantees the numerical process converges.

D.2 Newton's Method as a Limit Process

Newton's method computes:

xn+1=xnf(xn)f(xn)x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}

This converges (quadratically near a simple root) to cc where f(c)=0f(c) = 0.

The update formula comes from the linear approximation: f(x)f(xn)+f(xn)(xxn)f(x) \approx f(x_n) + f'(x_n)(x - x_n), set to zero: x=xnf(xn)/f(xn)x = x_n - f(x_n)/f'(x_n). This linear approximation is itself a limit statement - the tangent line is limh0\lim_{h \to 0} of secant lines.

For AI: Quasi-Newton methods (BFGS, L-BFGS) approximate f(xn)1f'(x_n)^{-1} (the Hessian inverse) via finite differences of gradients. At the heart of each update is a finite difference approximation of the second derivative - a discrete limit.

D.3 Gradient Checking Implementation

To verify that an implemented gradient g^i\hat{g}_i matches the true gradient f/θi\partial f / \partial \theta_i, use the centered finite difference:

giFD=f(θ+hei)f(θhei)2hg_i^{\text{FD}} = \frac{f(\theta + h \mathbf{e}_i) - f(\theta - h \mathbf{e}_i)}{2h}

The relative error check:

g^gFDg^+gFD<105\frac{\lVert \hat{g} - g^{\text{FD}} \rVert}{\lVert \hat{g} \rVert + \lVert g^{\text{FD}} \rVert} < 10^{-5}

is a heuristic validation that the implementation of the analytic gradient is correct.

Why centered? The centered difference approximates ff' with error O(h2)O(h^2) (from Taylor: f(x+h)f(xh)=2hf(x)+(h3/3)f(x)+f(x+h) - f(x-h) = 2hf'(x) + (h^3/3)f'''(x) + \ldots), while the one-sided approximation has error O(h)O(h). The limit is the same but the rate of approach is faster for centered differences.


Appendix E: Limits in Probability and Statistics

E.1 Law of Large Numbers

The weak law of large numbers states: if X1,X2,X_1, X_2, \ldots are i.i.d. with mean μ\mu, then for every ε>0\varepsilon > 0:

limnP ⁣(1ni=1nXiμ>ε)=0\lim_{n \to \infty} P\!\left(\left\lvert \frac{1}{n}\sum_{i=1}^n X_i - \mu \right\rvert > \varepsilon\right) = 0

This is a limit statement about a sequence of probabilities - a "convergence in probability." The expected loss over the training distribution is μ=E[L]\mu = \mathbb{E}[\mathcal{L}]; the empirical loss L^n=1niL(xi)\hat{\mathcal{L}}_n = \frac{1}{n}\sum_i \mathcal{L}(\mathbf{x}_i) converges to it in the limit of infinite data.

E.2 Central Limit Theorem as a Limit

The CLT states: for i.i.d. XiX_i with mean μ\mu and variance σ2\sigma^2:

n(Xˉnμσ)dN(0,1)\sqrt{n}\left(\frac{\bar{X}_n - \mu}{\sigma}\right) \xrightarrow{d} \mathcal{N}(0, 1)

This "convergence in distribution" is a limit of CDFs: for every continuity point tt of the standard normal CDF Φ\Phi:

limnP ⁣(n(Xˉnμσ)t)=Φ(t)\lim_{n \to \infty} P\!\left(\sqrt{n}\left(\frac{\bar{X}_n - \mu}{\sigma}\right) \leq t\right) = \Phi(t)

For AI: Mini-batch gradient estimates converge to the full-batch gradient in distribution as batch size grows - the CLT justifies treating mini-batch gradients as Gaussian perturbations of the true gradient, underpinning stochastic optimization theory (Mandt, Hoffman, Blei, 2017).

E.3 KL Divergence and Limit Continuity

The KL divergence DKL(PQ)=ipilog(pi/qi)D_{\text{KL}}(P \| Q) = \sum_i p_i \log(p_i / q_i) requires the convention 0log0=00 \log 0 = 0 (using limp0+plogp=0\lim_{p \to 0^+} p \log p = 0) and is undefined when qi=0q_i = 0 but pi>0p_i > 0.

Continuity: With the convention 0log0=00 \log 0 = 0, DKLD_{\text{KL}} is lower-semicontinuous: for any sequence PnPP_n \to P, lim infnDKL(PnQ)DKL(PQ)\liminf_{n} D_{\text{KL}}(P_n \| Q) \geq D_{\text{KL}}(P \| Q). This is a limit property that ensures KL minimization is well-posed.

For AI: RLHF training (Ouyang et al., 2022) includes a KL penalty DKL(πθπref)D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) to prevent the model from drifting too far from the reference policy. The continuity of KL guarantees that small policy changes produce small KL penalties - essential for stable RLHF training.


Appendix F: Glossary of Limit-Related Terms

TermDefinition
Limitlimxaf(x)=L\lim_{x\to a}f(x) = L: f(x)f(x) can be made arbitrarily close to LL by taking xx sufficiently close to (but not equal to) aa
epsilon-delta definitionFormal definition: ε>0,δ>0:0<xa<δ    f(x)L<ε\forall\varepsilon>0,\exists\delta>0: 0<\lvert x-a\rvert<\delta\implies\lvert f(x)-L\rvert<\varepsilon
One-sided limitlimxa+\lim_{x\to a^+} (right) or limxa\lim_{x\to a^-} (left) - approaches aa from one side only
Limit at infinitylimxf(x)=L\lim_{x\to\infty}f(x)=L: f(x)Lf(x)\to L as xx grows without bound
Infinite limitlimxaf(x)=±\lim_{x\to a}f(x)=\pm\infty: f(x)f(x) grows without bound as xax\to a; not a real limit
Continuityff is continuous at aa if f(a)f(a) exists, limxaf(x)\lim_{x\to a}f(x) exists, and they are equal
Removable discontinuitylimxaf(x)\lim_{x\to a}f(x) exists but f(a)f(a) is undefined or \neq limit
Jump discontinuityOne-sided limits exist but differ: limxaf(x)limxa+f(x)\lim_{x\to a^-}f(x)\neq\lim_{x\to a^+}f(x)
Essential discontinuityAt least one one-sided limit is ±\pm\infty or doesn't exist
Indeterminate formForms like 0/00/0, /\infty/\infty, 00\cdot\infty, 11^\infty that require further analysis
L'Hpital's RuleFor 0/00/0 or ±/±\pm\infty/\pm\infty: limf/g=limf/g\lim f/g = \lim f'/g'
Squeeze Theoremhfgh\leq f\leq g and h,gLh,g\to L implies fLf\to L
IVTContinuous ff on [a,b][a,b] attains every value between f(a)f(a) and f(b)f(b)
EVTContinuous ff on [a,b][a,b] attains its maximum and minimum
Uniform continuityδ\delta works uniformly for all points (not point-dependent)
Lipschitz continuityf(x)f(y)Lxy\lvert f(x)-f(y)\rvert\leq L\lvert x-y\rvert - quantitative uniform continuity
Machine epsilonSmallest ε\varepsilon with fl(1+ε)>1\text{fl}(1+\varepsilon)>1 in floating point (1016\approx 10^{-16} for float64)
Catastrophic cancellationLoss of significant digits when subtracting nearly equal floating-point numbers
limsup / liminfLargest/smallest accumulation point of f(x)f(x) as xax\to a
Big-Of=O(g)f=O(g): fCg\lvert f\rvert\leq C\lvert g\rvert near aa for some C>0C>0
Little-of=o(g)f=o(g): f/g0f/g\to 0 as xax\to a; ff is negligible compared to gg

Appendix G: Further Reading and References

Textbooks

  1. Stewart, J. (2015). Calculus: Early Transcendentals (8th ed.). Cengage. - Standard undergraduate reference; clear motivation and worked examples.

  2. Spivak, M. (2006). Calculus (4th ed.). Publish or Perish. - Rigorous treatment; complete proofs of all major theorems including IVT and EVT.

  3. Rudin, W. (1976). Principles of Mathematical Analysis (3rd ed.). McGraw-Hill. - The graduate-level standard; epsilon-delta proofs throughout; metric space generalization.

  4. Apostol, T. M. (1974). Mathematical Analysis (2nd ed.). Addison-Wesley. - Thorough treatment of limits, continuity, and the Riemann integral.

Machine Learning Connections

  1. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Ch. 4 (Numerical computation), Ch. 6 (Activation functions). - Numerical stability, sigmoid/ReLU analysis.

  2. Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735-1780. - Identifies vanishing gradient via limit behavior of sigmoid.

  3. Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22(3), 400-407. - Original convergence conditions for SGD.

  4. Vaswani, A., et al. (2017). Attention is All You Need. NeurIPS. - Softmax temperature in attention.

  5. Miyato, T., et al. (2018). Spectral Normalization for Generative Adversarial Networks. ICLR. - Lipschitz continuity in deep learning.

  6. Mandt, S., Hoffman, M. D., & Blei, D. M. (2017). Stochastic Gradient Descent as Approximate Bayesian Inference. JMLR. - CLT applied to SGD noise.

Online Resources

  1. 3Blue1Brown - "Essence of Calculus" (YouTube). Visual intuition for limits and derivatives.

  2. Paul's Online Math Notes (tutorial.math.lamar.edu). Comprehensive worked examples at undergraduate level.

  3. MIT OpenCourseWare 18.01 (Single Variable Calculus). Full lecture notes and problem sets.


Appendix H: Detailed ML Worked Examples

H.1 Softmax Numerical Stability: Full Derivation

The naive softmax computation:

def softmax_naive(z):
    return np.exp(z) / np.sum(np.exp(z))

fails for large zi\lvert z_i \rvert due to overflow (e1000=e^{1000} = \infty in float64) or underflow (e1000=0e^{-1000} = 0).

The log-sum-exp trick exploits the limit-preserving shift:

softmax(z)i=ezijezj=ezimjezjm,m=maxjzj\text{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_j e^{z_j}} = \frac{e^{z_i - m}}{\sum_j e^{z_j - m}}, \quad m = \max_j z_j

This is valid because ezim/jezjme^{z_i - m} / \sum_j e^{z_j - m} equals ezi/jezje^{z_i}/\sum_j e^{z_j} (both numerator and denominator are multiplied by eme^{-m}). After shifting, all exponents are 0\leq 0, so no overflow. The maximum term contributes e0=1e^0 = 1, preventing underflow of the sum.

Derivation of stability bound: Let ai=zim0a_i = z_i - m \leq 0. Then eai(0,1]e^{a_i} \in (0, 1] for all ii. The sum jeaj[1,n]\sum_j e^{a_j} \in [1, n] (at least eaargmax=1e^{a_{\text{argmax}}} = 1, at most nn terms each 1\leq 1). No overflow or underflow.

Log-softmax (needed for cross-entropy):

logsoftmax(z)i=zimlogjezjm\log\text{softmax}(\mathbf{z})_i = z_i - m - \log\sum_j e^{z_j - m}

This is the numerically stable form used in torch.nn.CrossEntropyLoss.

Limit interpretation: The log-sum-exp function LSE(z)=logjezj\text{LSE}(\mathbf{z}) = \log\sum_j e^{z_j} is a smooth approximation to the max:

limβ1βlogjeβzj=maxjzj\lim_{\beta \to \infty} \frac{1}{\beta} \log\sum_j e^{\beta z_j} = \max_j z_j

This is another limit - the "softmax" smoothly approaches the argmax as temperature T=1/β0T = 1/\beta \to 0.

H.2 Vanishing Gradient: Quantitative Analysis via Limits

Consider an LL-layer network with sigmoid activations. The gradient of loss L\mathcal{L} with respect to weights in layer ll is:

LW[l]=La[L]k=l+1La[k]a[k1]\frac{\partial \mathcal{L}}{\partial W^{[l]}} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}^{[L]}} \cdot \prod_{k=l+1}^{L} \frac{\partial \mathbf{a}^{[k]}}{\partial \mathbf{a}^{[k-1]}}

Each factor a[k]a[k1]=diag(σ(z[k]))W[k]\frac{\partial \mathbf{a}^{[k]}}{\partial \mathbf{a}^{[k-1]}} = \text{diag}(\sigma'(\mathbf{z}^{[k]})) \cdot W^{[k]} involves σ(z)=σ(z)(1σ(z))1/4\sigma'(z) = \sigma(z)(1-\sigma(z)) \leq 1/4.

If we model each factor as a scalar 1/4\leq 1/4:

LW[l]C(14)Ll\left\lVert \frac{\partial \mathcal{L}}{\partial W^{[l]}} \right\rVert \leq C \cdot \left(\frac{1}{4}\right)^{L-l}

As LlL - l \to \infty (very deep network, gradient flows from the last layer to layer ll):

limL(14)Ll=0\lim_{L \to \infty} \left(\frac{1}{4}\right)^{L-l} = 0

The gradient vanishes exponentially in the depth. For Ll=10L - l = 10: (1/4)10106(1/4)^{10} \approx 10^{-6}. For Ll=20L - l = 20: (1/4)201012(1/4)^{20} \approx 10^{-12}.

Contrast with ReLU: ReLU(x)=1[x>0]{0,1}\text{ReLU}'(x) = \mathbf{1}[x > 0] \in \{0, 1\}. For positive pre-activations, each factor is 11 (no attenuation). The limit is 1Ll=11^{L-l} = 1 - gradient passes unchanged. This is the key advantage of ReLU for deep networks (He et al., 2015).

ResNet correction: Residual connections add a skip path: a[k]=F(a[k1])+a[k1]\mathbf{a}^{[k]} = \mathcal{F}(\mathbf{a}^{[k-1]}) + \mathbf{a}^{[k-1]}. The gradient becomes:

a[k]a[k1]=Fa[k1]+I\frac{\partial \mathbf{a}^{[k]}}{\partial \mathbf{a}^{[k-1]}} = \frac{\partial \mathcal{F}}{\partial \mathbf{a}^{[k-1]}} + I

The identity term II prevents the product from decaying to zero - the residual "highway" carries gradients regardless of the activation. Mathematically: even if F/a0\lVert \partial\mathcal{F}/\partial\mathbf{a}\rVert \to 0 (saturated activations), the full Jacobian stays bounded away from zero.

H.3 Learning Rate Schedules: Limit Analysis

Cosine annealing:

αt=αmin+12(αmaxαmin)(1+cos ⁣(πtT))\alpha_t = \alpha_\text{min} + \frac{1}{2}(\alpha_\text{max} - \alpha_\text{min})\left(1 + \cos\!\left(\frac{\pi t}{T}\right)\right)

As tTt \to T: cos(πt/T)cos(π)=1\cos(\pi t/T) \to \cos(\pi) = -1, so αtαmin\alpha_t \to \alpha_\text{min}.

The schedule is continuous (cosine is continuous) and smooth, avoiding the discontinuous jumps of step-decay schedules. The limit αT=αmin>0\alpha_T = \alpha_\text{min} > 0 means it does not satisfy the Robbins-Monro conditions - in practice this is acceptable because modern large-batch training typically runs for a fixed number of steps, not until convergence.

Warmup + linear decay:

αt={αmaxt/TwarmtTwarmαmax(Tt)/(TTwarm)t>Twarm\alpha_t = \begin{cases} \alpha_\text{max} \cdot t/T_\text{warm} & t \leq T_\text{warm} \\ \alpha_\text{max} \cdot (T - t)/(T - T_\text{warm}) & t > T_\text{warm} \end{cases}

At t=Twarmt = T_\text{warm}, both formulas give αmax\alpha_\text{max}: continuity is maintained. The limit limtTαt=0\lim_{t \to T} \alpha_t = 0 satisfies αt=\sum \alpha_t = \infty (roughly T2/(2(TTwarm))T^2/(2(T-T_\text{warm}))) but αt2\sum \alpha_t^2 may be finite or not depending on the schedule length.

GPT-3 / LLaMA learning rate schedule: Uses cosine decay with warmup, which empirically outperforms the theoretically optimal 1/t1/t schedule for large-batch transformer training. The theory is underdeveloped; the practice is empirically justified.

H.4 Adam Optimizer: Limit Behavior

The Adam update (Kingma & Ba, 2014):

mt=β1mt1+(1β1)gt(first moment)m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \quad \text{(first moment)} vt=β2vt1+(1β2)gt2(second moment)v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \quad \text{(second moment)} m^t=mt/(1β1t),v^t=vt/(1β2t)(bias correction)\hat{m}_t = m_t / (1 - \beta_1^t), \quad \hat{v}_t = v_t / (1 - \beta_2^t) \quad \text{(bias correction)} θt=θt1αm^t/(v^t+ε)\theta_t = \theta_{t-1} - \alpha \hat{m}_t / (\sqrt{\hat{v}_t} + \varepsilon)

Limit as tt \to \infty: The bias correction 1/(1β1t)11/(1-\beta_1^t) \to 1 and 1/(1β2t)11/(1-\beta_2^t) \to 1 (since β1t0\beta_1^t \to 0). So for large tt, Adam behaves like the bias-corrected version without correction.

Limit for sparse gradients: If gt=0g_t = 0 for many steps, then vt0v_t \to 0 (exponential decay). The effective learning rate α/(vt+ε)α/ε\alpha/(\sqrt{v_t} + \varepsilon) \to \alpha/\varepsilon - Adam gives a large step for a parameter that hasn't received gradients recently. This is the "adaptive" feature: rarely-updated parameters get large steps when they finally do receive gradients. This is controlled by the limit behavior of the second moment estimator.


Appendix I: Self-Assessment Questions

Conceptual Questions

  1. Explain in your own words why the epsilon-delta definition requires 0<xa0 < \lvert x - a \rvert (strict inequality, excluding x=ax = a) but the continuity definition does not exclude x=ax = a.

  2. Give an example of a function ff such that:

    • ff is defined at aa
    • limxaf(x)\lim_{x \to a} f(x) exists
    • But ff is not continuous at aa Explain which of the three continuity conditions fails.
  3. Why does the Squeeze Theorem require the inequality h(x)f(x)g(x)h(x) \leq f(x) \leq g(x) to hold in a neighborhood of aa, not just at aa?

  4. L'Hpital's Rule is often stated as: "just differentiate numerator and denominator separately." What is wrong with this description, and what are the actual conditions for the rule to apply?

  5. Why does the IVT require the function to be continuous on a closed interval [a,b][a,b] and not just on the open interval (a,b)(a,b)? Give a counterexample showing what can go wrong on an open interval.

  6. Explain why limxf(x)=L\lim_{x \to \infty} f(x) = L (finite) and limxf(x)=\lim_{x \to \infty} f(x) = \infty represent fundamentally different situations, even though both involve xx \to \infty.

  7. In what sense is the derivative f(a)f'(a) "just a limit"? What property of the limit (one-sided vs. two-sided) determines whether ff is differentiable at aa?

Computational Practice

  1. Compute limx0e2x2ex+1x2\lim_{x \to 0} \frac{e^{2x} - 2e^x + 1}{x^2} using three different methods: (a) L'Hpital twice, (b) Taylor expansion, (c) substitution.

  2. Find all discontinuities of f(x)=sin(πx)x21f(x) = \frac{\sin(\pi x)}{x^2 - 1} and classify each.

  3. Prove from the epsilon-delta definition that limx3x=3\lim_{x \to 3} \sqrt{x} = \sqrt{3}. (Hint: rationalize x3\lvert \sqrt{x} - \sqrt{3} \rvert.)

  4. Use the Squeeze Theorem to show limx0+x1/x=1\lim_{x \to 0^+} x \lfloor 1/x \rfloor = 1. (Hint: 1/x1<1/x1/x1/x - 1 < \lfloor 1/x \rfloor \leq 1/x.)

  5. Determine whether f(x)=n=0xn/n!f(x) = \sum_{n=0}^\infty x^n / n! (the Taylor series for exe^x) is continuous on all of R\mathbb{R}. (Hint: uniform convergence on compact sets.)

ML Application Questions

  1. Write Python code to compute limT0+softmaxT([3,1,2])\lim_{T \to 0^+} \text{softmax}_T([3, 1, 2]) numerically for T=1,0.1,0.01,0.001,0.0001T = 1, 0.1, 0.01, 0.001, 0.0001. What do you observe? At what temperature does floating-point underflow cause problems with naive implementation?

  2. The learning rate αt=C/tγ\alpha_t = C/t^\gamma satisfies the Robbins-Monro conditions for which values of γ\gamma? Prove your answer by computing αt\sum \alpha_t and αt2\sum \alpha_t^2.

  3. Implement the stable sigmoid σstable(x)\sigma_{\text{stable}}(x) that avoids overflow for both large positive and large negative xx. Verify that it matches scipy.special.expit(x) to full precision for x{1000,100,1,0,1,100,1000}x \in \{-1000, -100, -1, 0, 1, 100, 1000\}.


Appendix J: Connection Map - Limits Throughout the Curriculum

This section appears at the foundation of calculus, but limits permeate the entire curriculum.

WHERE LIMITS APPEAR ACROSS THE CURRICULUM


04-01  LIMITS (here) 
                                                           All of Analysis
  
   04-02 Derivatives     f'(a) = lim_{h->0} [f(a+h)-f(a)]/h
  
   04-03 Integration     integral f = lim_{n->infinity} Sigma f(x*)Deltax
  
   04-04 Series          Sigmaa = lim_{N->infinity} S  (partial sums)
  
   05 Multivariate       partialf/partialx = lim_{h->0} [f(x+he)-f(x)]/h
  
   06 Probability        P(A) = lim_{n->infinity} #{outcomes in A}/n (freq.)
  
   08 Optimization       Convergence: lim_{t->infinity} ||nablaL(theta)|| = 0
  
   10 Numerical Methods  Finite differences, iterative convergence
  
   12 Functional Analysis Operator limits, Banach/Hilbert spaces
  
   24 Measure Theory     Integration as limit of simple functions


Every subsequent section in this curriculum builds on the limit concept introduced here. The epsilon-delta definition is the seed; the rest of mathematics is the tree.


Appendix K: Proofs of Fundamental Limits

K.1 Proof that lim(n->infinity)(1 + 1/n)^n = e

We show the sequence an=(1+1/n)na_n = (1 + 1/n)^n is increasing and bounded, hence convergent, and define ee as its limit.

Step 1: ana_n is increasing.

By AM-GM: for positive numbers x1,,xn+1x_1, \ldots, x_{n+1}:

x1++xn+1n+1(x1xn+1)1/(n+1)\frac{x_1 + \cdots + x_{n+1}}{n+1} \geq (x_1 \cdots x_{n+1})^{1/(n+1)}

Apply with nn copies of (1+1/n)(1 + 1/n) and one copy of 11:

n(1+1/n)+1n+1[(1+1/n)n1]1/(n+1)\frac{n(1+1/n) + 1}{n+1} \geq \left[(1+1/n)^n \cdot 1\right]^{1/(n+1)} n+2n+1=1+1n+1an1/(n+1)\frac{n+2}{n+1} = 1 + \frac{1}{n+1} \geq a_n^{1/(n+1)}

Raising both sides to the (n+1)(n+1)-th power: an+1ana_{n+1} \geq a_n. \square

Step 2: ana_n is bounded above by 33.

By the binomial theorem:

an=(1+1n)n=k=0n(nk)1nk=k=0nn(n1)(nk+1)k!nka_n = \left(1+\frac{1}{n}\right)^n = \sum_{k=0}^n \binom{n}{k}\frac{1}{n^k} = \sum_{k=0}^n \frac{n(n-1)\cdots(n-k+1)}{k! \cdot n^k}

Each term n(n1)(nk+1)k!nk=1k!nnn1n<1k!\frac{n(n-1)\cdots(n-k+1)}{k! \cdot n^k} = \frac{1}{k!}\cdot\frac{n}{n}\cdot\frac{n-1}{n}\cdots < \frac{1}{k!}.

So an<k=0n1k!k=01k!=ek=012k=2111/2a_n < \sum_{k=0}^n \frac{1}{k!} \leq \sum_{k=0}^\infty \frac{1}{k!} = e \leq \sum_{k=0}^\infty \frac{1}{2^k} = 2 \cdot \frac{1}{1-1/2}...

More carefully: k=01/k!1+1+1/2+1/4+=1+2=3\sum_{k=0}^\infty 1/k! \leq 1 + 1 + 1/2 + 1/4 + \ldots = 1 + 2 = 3.

By the monotone convergence theorem, the increasing bounded sequence ana_n converges. We define e=limn(1+1/n)n2.718e = \lim_{n\to\infty}(1+1/n)^n \approx 2.718\ldots \square

K.2 Proof that lim(x->0) sin(x)/x = 1

We give the geometric proof in detail.

Setup: Consider the unit circle centered at the origin. Let O=(0,0)O = (0,0), A=(1,0)A = (1,0), P=(cosx,sinx)P = (\cos x, \sin x) for 0<x<π/20 < x < \pi/2, and T=(1,tanx)T = (1, \tan x) (the tangent line at AA meets the line OPOP at TT).

Area inequalities:

Area(OAP)Area(sector OAP)Area(OAT)\text{Area}(\triangle OAP) \leq \text{Area(sector }OAP) \leq \text{Area}(\triangle OAT)

Computing each:

  • Area(OAP)=121sinx=sinx2\text{Area}(\triangle OAP) = \frac{1}{2} \cdot 1 \cdot \sin x = \frac{\sin x}{2} (base OA=1OA = 1, height =sinx= \sin x)
  • Area(sector)=x2ππ(12)=x2\text{Area(sector)} = \frac{x}{2\pi} \cdot \pi(1^2) = \frac{x}{2} (fraction x/2πx/2\pi of unit circle area π\pi)
  • Area(OAT)=121tanx=tanx2\text{Area}(\triangle OAT) = \frac{1}{2} \cdot 1 \cdot \tan x = \frac{\tan x}{2}

So: sinx2x2tanx2\frac{\sin x}{2} \leq \frac{x}{2} \leq \frac{\tan x}{2}

Multiply by 2sinx>0\frac{2}{\sin x} > 0:

1xsinx1cosx1 \leq \frac{x}{\sin x} \leq \frac{1}{\cos x}

Take reciprocals (reverse inequalities):

cosxsinxx1\cos x \leq \frac{\sin x}{x} \leq 1

Since limx0+cosx=1\lim_{x\to 0^+} \cos x = 1 and limx0+1=1\lim_{x\to 0^+} 1 = 1, by Squeeze: limx0+sinxx=1\lim_{x\to 0^+} \frac{\sin x}{x} = 1.

For x0x \to 0^-: use sin(x)/(x)=sin(x)/x1\sin(-x)/(-x) = \sin(x)/x \to 1 (same by symmetry). \square

K.3 Proof of L'Hpital's Rule (0/0 case)

Theorem. Let f,gf, g be differentiable on (ar,a+r){a}(a - r, a + r) \setminus \{a\} for some r>0r > 0. Suppose limxaf(x)=limxag(x)=0\lim_{x\to a} f(x) = \lim_{x\to a} g(x) = 0, g(x)0g'(x) \neq 0 near aa, and limxaf(x)/g(x)=L\lim_{x\to a} f'(x)/g'(x) = L. Then limxaf(x)/g(x)=L\lim_{x\to a} f(x)/g(x) = L.

Proof (using Cauchy's MVT): Extend ff and gg by f(a)=g(a)=0f(a) = g(a) = 0 (so they are continuous at aa).

Cauchy's Mean Value Theorem: For any xax \neq a (say x>ax > a), there exists cx(a,x)c_x \in (a, x) such that:

f(x)f(a)g(x)g(a)=f(cx)g(cx)\frac{f(x) - f(a)}{g(x) - g(a)} = \frac{f'(c_x)}{g'(c_x)}

Since f(a)=g(a)=0f(a) = g(a) = 0: f(x)g(x)=f(cx)g(cx)\frac{f(x)}{g(x)} = \frac{f'(c_x)}{g'(c_x)}.

As xa+x \to a^+: cx(a,x)ac_x \in (a, x) \to a, so cxa+c_x \to a^+.

Therefore: limxa+f(x)g(x)=limxa+f(cx)g(cx)=limca+f(c)g(c)=L\lim_{x\to a^+} \frac{f(x)}{g(x)} = \lim_{x\to a^+} \frac{f'(c_x)}{g'(c_x)} = \lim_{c\to a^+} \frac{f'(c)}{g'(c)} = L.

Similarly from the left. \square


Appendix L: Extended ML Applications - Advanced Topics

L.1 Limits in Attention Mechanisms

The scaled dot-product attention:

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

The scaling by dk\sqrt{d_k} prevents the inner products QKQK^\top from growing large in magnitude. Without scaling, as dkd_k \to \infty, the logits QKQK^\top grow like O(dk)O(\sqrt{d_k}) (random initialization variance), pushing softmax into the saturation regime:

limdksoftmax ⁣(QKdk)(depends on structure)\lim_{d_k \to \infty} \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) \to \text{(depends on structure)}

With scaling: QK/dkQK^\top / \sqrt{d_k} remains O(1)O(1) as dkd_k \to \infty (assuming unit-variance queries and keys), keeping softmax in its linear regime where gradients are large.

Formal statement: If q,kRdkq, k \in \mathbb{R}^{d_k} with independent N(0,1)\mathcal{N}(0,1) components:

E[qk]=0,Var(qk)=dk\mathbb{E}[q \cdot k] = 0, \quad \text{Var}(q \cdot k) = d_k

So qk/dkq \cdot k / \sqrt{d_k} has variance 11 regardless of dkd_k - this is the limit limdkVar(qk/dk)=1\lim_{d_k\to\infty} \text{Var}(q\cdot k/\sqrt{d_k}) = 1.

L.2 Token Probability Limits in Language Modeling

An autoregressive language model assigns probability:

P(wtw1,,wt1)=softmax(Wht1)wtP(w_t \mid w_1, \ldots, w_{t-1}) = \text{softmax}(W \mathbf{h}_{t-1})_{w_t}

Limit as context grows: What happens to P(wtw1,,wt1)P(w_t \mid w_1, \ldots, w_{t-1}) as tt \to \infty? Under standard ergodicity assumptions on the language distribution, the model's uncertainty should approach the entropy of the distribution:

limtH(wtw1,,wt1)=H\lim_{t \to \infty} H(w_t \mid w_1, \ldots, w_{t-1}) = H_\infty

where HH_\infty is the entropy rate of the language. This limit (which exists for stationary ergodic processes) is the asymptotic uncertainty per token - the information-theoretic lower bound on perplexity.

L.3 Neural Tangent Kernel and Infinite Width Limits

As the width nn \to \infty of a neural network, the network function converges (in distribution) to a Gaussian process with a specific kernel called the Neural Tangent Kernel (Jacot et al., 2018):

limnfθ(x)GP(0,KNTK)\lim_{n \to \infty} f_\theta(\mathbf{x}) \sim \mathcal{GP}(0, K_{\text{NTK}})

Moreover, in this limit, training with gradient descent is equivalent to kernel regression with KNTKK_{\text{NTK}}:

limnθt=θ0η0tθfθsθLsds\lim_{n \to \infty} \theta_t = \theta_0 - \eta \int_0^t \nabla_\theta f_{\theta_s} \nabla_\theta \mathcal{L}_s \, ds

This infinite-width limit is an active research area connecting neural network training to the well-understood theory of kernel methods and Gaussian processes - all via limit theory applied to neural networks as a function of width.

L.4 Grokking as a Limit Phenomenon

Grokking (Power et al., 2022): the phenomenon where a neural network first overfits (near-zero training loss, high test loss), then after continued training, suddenly generalizes (both losses drop).

From a limit perspective: the training loss is minimized early (the limit of the optimization trajectory is a global minimum), but the test loss requires the model to find a different basin that generalizes. The sudden transition can be understood as crossing a threshold where the solution's "complexity" (measured by norms or effective rank) crosses a critical value.

The limit limtLtest(θt)\lim_{t\to\infty} \mathcal{L}_{\text{test}}(\theta_t) may be much smaller than limtT1Ltest(θt)\lim_{t\to T_1} \mathcal{L}_{\text{test}}(\theta_t) for an intermediate time T1T_1 - the limit of the process depends on running it long enough. This is a reminder that limits of optimization trajectories are not always achieved quickly.


Appendix M: Quick Reference Card

LIMITS AND CONTINUITY - QUICK REFERENCE


DEFINITION            lim_{x->a} f(x) = L
                       for allepsilon>0, existsdelta>0: 0<|x-a|<delta  |f(x)-L|<epsilon

ONE-SIDED             lim_{x->a} and lim_{x->a}
                      Two-sided limit  both one-sided limits agree

LIMIT LAWS            lim(f+/-g) = lim f +/- lim g
                      lim(fg) = (lim f)(lim g)
                      lim(f/g) = (lim f)/(lim g) if lim g != 0

KEY LIMITS            lim_{x->0} sin(x)/x = 1
                      lim_{x->0} (e-1)/x = 1
                      lim_{n->infinity} (1+1/n) = e
                      lim_{x->0} x*ln(x) = 0

TECHNIQUES            - Direct substitution (if continuous)
                      - Factoring / cancellation
                      - Rationalization (conjugate)
                      - L'Hpital (0/0 or infinity/infinity only!)
                      - Squeeze Theorem

CONTINUITY            f continuous at a 
                      (1) f(a) defined
                      (2) lim_{x->a} f(x) exists
                      (3) lim_{x->a} f(x) = f(a)

DISCONTINUITIES       Removable: limit exists, != f(a) or f(a) undefined
                      Jump: one-sided limits exist but differ
                      Essential: at least one one-sided limit = +/-infinity

IVT                   f continuous on [a,b], k between f(a) and f(b)
                       existscin(a,b): f(c) = k

EVT                   f continuous on [a,b]  f attains max and min

SQUEEZE               h<=f<=g near a, lim h = lim g = L  lim f = L

STABILITY             (e-1)/x near x=0: use numpy.expm1(x)/x
                      log(1+x)/x near x=0: use numpy.log1p(x)/x
                      log(softmax(z)): use log_softmax (LSE trick)

ML CONNECTIONS        softmax_T -> argmax as T->0, uniform as T->infinity
                      sigma(x) -> 0/1 as x->+/-infinity (vanishing gradient)
                      gradient = lim_{h->0}[f(x+h)-f(x)]/h
                      Robbins-Monro: Sigmaalpha=infinity AND Sigmaalpha^2<infinity



Appendix N: Problem Set - Additional Exercises

N.1 Computation Problems ()

N1. Compute the following limits without L'Hpital's Rule:

(a) limx4x2x4\displaystyle\lim_{x \to 4} \frac{\sqrt{x} - 2}{x - 4}

(b) limh0(2+h)38h\displaystyle\lim_{h \to 0} \frac{(2+h)^3 - 8}{h}

(c) limx0sin(3x)5x\displaystyle\lim_{x \to 0} \frac{\sin(3x)}{5x}

(d) limx4x32x+17x3+x23\displaystyle\lim_{x \to \infty} \frac{4x^3 - 2x + 1}{7x^3 + x^2 - 3}

Solutions: (a) 1/41/4 (rationalize), (b) 1212 (derivative of x3x^3 at x=2x=2), (c) 3/53/5 (fundamental limit), (d) 4/74/7 (leading coefficients).

N2. Classify the discontinuities of f(x)=x23x+2x21f(x) = \frac{x^2 - 3x + 2}{x^2 - 1} on R\mathbb{R}.

Factor: numerator =(x1)(x2)= (x-1)(x-2), denominator =(x1)(x+1)= (x-1)(x+1).

At x=1x = 1: f(x)=x2x+1f(x) = \frac{x-2}{x+1} after cancellation; limx1f(x)=1/2\lim_{x\to 1} f(x) = -1/2. Removable discontinuity (redefine f(1)=1/2f(1) = -1/2).

At x=1x = -1: denominator 0\to 0, numerator (2)(3)=60\to (-2)(-3) = 6 \neq 0; f(x)±f(x) \to \pm\infty. Vertical asymptote / essential discontinuity.

N3. For which values of cc is g(x)={cx2+2xx<2x3cxx2g(x) = \begin{cases} cx^2 + 2x & x < 2 \\ x^3 - cx & x \geq 2 \end{cases} continuous at x=2x = 2?

Require: c(4)+4=82cc(4) + 4 = 8 - 2c, so 4c+4=82c4c + 4 = 8 - 2c, giving 6c=46c = 4, c=2/3c = 2/3.

N.2 Theory Problems ()

N4. Prove using the epsilon-delta definition that limx0xsin(1/x)=0\lim_{x \to 0} x \sin(1/x) = 0.

Proof: For any ε>0\varepsilon > 0, choose δ=ε\delta = \varepsilon. Then for 0<x<δ0 < \lvert x \rvert < \delta:

xsin(1/x)0=xsin(1/x)x1<δ=ε\lvert x \sin(1/x) - 0 \rvert = \lvert x \rvert \cdot \lvert \sin(1/x) \rvert \leq \lvert x \rvert \cdot 1 < \delta = \varepsilon \quad \square

N5. Show that the converse of the Extreme Value Theorem is false: give an example of ff attaining its maximum and minimum on (0,1)(0,1) without ff being continuous.

Example: f(x)=0f(x) = 0 for x(0,1)x \in (0,1) and f(x)f(x) undefined elsewhere. Then ff attains max and min (both equal 00) but is "vacuously" not defined as a function on [0,1][0,1] for the purposes of continuity. A cleaner example: f:[0,1]Rf: [0,1] \to \mathbb{R}, f(0)=f(1)=1f(0) = f(1) = 1, f(x)=0f(x) = 0 for x(0,1)x \in (0,1) - attains max (11) and min (00) but is discontinuous at 00 and 11.

N6. (Banach Fixed-Point Theorem, simplified.) Let f:[0,1][0,1]f: [0,1] \to [0,1] be continuous. Show ff has at least one fixed point.

Define g(x)=f(x)xg(x) = f(x) - x. Then g(0)=f(0)0=f(0)0g(0) = f(0) - 0 = f(0) \geq 0 and g(1)=f(1)10g(1) = f(1) - 1 \leq 0. By IVT applied to gg (continuous) on [0,1][0,1]: c[0,1]\exists c \in [0,1] with g(c)=0g(c) = 0, i.e., f(c)=cf(c) = c. \square

N.3 ML Application Problems ()

N7. (Numerical stability.) Implement both naive and stable computations of log(1+ex)\log(1 + e^x) (the softplus function). For x=100,50,10,0,10,50x = 100, 50, 10, 0, -10, -50, compare results and relative errors.

Analysis: For large x>0x > 0: log(1+ex)x\log(1+e^x) \approx x. Naive exe^x overflows for x>709x > 709 (float64). Stable version: log(1+ex)=x+log(1+ex)\log(1+e^x) = x + \log(1 + e^{-x}) for x>0x > 0 (the exe^{-x} term is small and doesn't overflow). For x<0x < 0: use naive form since ex<1e^x < 1.

N8. (Learning rate theory.) Consider the schedule αt=α0/(1+βt)\alpha_t = \alpha_0 / (1 + \beta t) for constants α0,β>0\alpha_0, \beta > 0.

(a) Show this satisfies the first Robbins-Monro condition: t=1αt=\sum_{t=1}^\infty \alpha_t = \infty.

(b) Show this satisfies the second condition: t=1αt2<\sum_{t=1}^\infty \alpha_t^2 < \infty.

(c) Compare to αt=α0/t\alpha_t = \alpha_0 / \sqrt{t}: does this satisfy both conditions?

Solutions: (a) αtα0/(βt)\alpha_t \sim \alpha_0/(\beta t); 1/t=\sum 1/t = \infty (harmonic series). (b) αt2α02/(β2t2)\alpha_t^2 \sim \alpha_0^2/(\beta^2 t^2); 1/t2=π2/6<\sum 1/t^2 = \pi^2/6 < \infty. (c) 1/t1/\sqrt{t}: 1/t=\sum 1/\sqrt{t} = \infty (first holds), 1/t=\sum 1/t = \infty (second fails).


Appendix O: Typesetting Reference for LaTeX

When writing mathematical content on limits in LaTeX (following the notation guide):

% Limit notation
\lim_{x \to a} f(x) = L           % standard limit
\lim_{x \to a^+} f(x)             % right-hand limit  
\lim_{x \to a^-} f(x)             % left-hand limit
\lim_{x \to +\infty} f(x)         % limit at +infinity

% epsilon-delta definition
\forall \varepsilon > 0, \; \exists \delta > 0 : \;
  0 < \lvert x - a \rvert < \delta \implies \lvert f(x) - L \rvert < \varepsilon

% Fundamental limits
\lim_{x \to 0} \frac{\sin x}{x} = 1
\lim_{x \to 0} \frac{e^x - 1}{x} = 1
\lim_{n \to \infty} \left(1 + \frac{1}{n}\right)^n = e

% Continuity
f \text{ continuous at } a \iff
  \lim_{x \to a} f(x) = f(a)

% Big-O notation
f(x) = O(g(x)) \text{ as } x \to a
f(x) = o(g(x)) \text{ as } x \to a

Common errors in notation:

  • Use \lvert \cdot \rvert not |\cdot| for absolute value in LaTeX
  • Use \varepsilon not \epsilon (matches standard analysis texts)
  • Use \to not : for limit variable (i.e., xax \to a, not x:ax:a)
  • Use \infty not inf or Inf
  • Subscripts of limit use _ correctly: \lim_{x \to 0} not \lim{x \to 0}

Appendix P: Historical Problems and Their Solutions

P.1 Zeno's Paradoxes: The Ancient Limit Problem

Zeno of Elea (~450 BCE) posed paradoxes about motion that are precisely limit problems in disguise.

Achilles and the Tortoise: Achilles (speed vv) chases a tortoise (speed u<vu < v) with head start dd. Zeno argued: first Achilles must reach the tortoise's initial position (time d/vd/v); by then the tortoise has moved du/vd \cdot u/v further; then Achilles must cover that gap (time du/v2du/v^2); and so on. The paradox claims this infinite sequence of tasks cannot be completed.

Resolution via limits: The total time is:

T=dv+duv2+du2v3+=dvn=0(uv)n=dv11u/v=dvuT = \frac{d}{v} + \frac{du}{v^2} + \frac{du^2}{v^3} + \cdots = \frac{d}{v} \cdot \sum_{n=0}^\infty \left(\frac{u}{v}\right)^n = \frac{d}{v} \cdot \frac{1}{1 - u/v} = \frac{d}{v-u}

This is the correct finite time - the geometric series converges. The limit limNn=0N(d/v)(u/v)n=d/(vu)\lim_{N\to\infty} \sum_{n=0}^N (d/v)(u/v)^n = d/(v-u) is finite and gives the time Achilles catches the tortoise. Zeno's error was assuming an infinite series must have an infinite sum - a mistake corrected by the theory of limits.

Lesson: Infinite processes can have finite limits. The notion limN\lim_{N\to\infty} is precisely the mathematical resolution to Zeno's paradox.

P.2 Berkeley's Objection: "Ghosts of Departed Quantities"

Bishop George Berkeley (1734), in The Analyst, critiqued Newton's infinitesimals:

"And what are these Fluxions? The Velocities of evanescent Increments. And what are these same evanescent Increments? They are neither finite Quantities, nor Quantities infinitely small, nor yet nothing. May we not call them the Ghosts of departed Quantities?"

Berkeley's critique was valid: Newton used h0h \neq 0 to divide, then set h=0h = 0 to drop higher-order terms. This is logically inconsistent.

The resolution: Weierstrass's epsilon-delta definition never actually "sets h=0h = 0" - instead, it asks: for every ε>0\varepsilon > 0, can we find δ>0\delta > 0 (with h0h \neq 0) such that the difference quotient is within ε\varepsilon of f(a)f'(a)? This makes no appeal to h=0h = 0 - only to inequalities between real numbers. Berkeley's ghost is exorcised by phrasing everything as "hh approaches but never reaches 00."

P.3 Dirichlet's Proof of the Convergence of Fourier Series

Dirichlet (1829) gave the first rigorous proof that the Fourier series of a "reasonable" function converges to the function at points of continuity. The key ingredient was precise limit analysis: showing that the Dirichlet kernel DN(x)=sin((N+1/2)x)sin(x/2)D_N(x) = \frac{\sin((N+1/2)x)}{\sin(x/2)} satisfies limNDN(x)f(ax)dx=f(a)\lim_{N\to\infty} \int D_N(x) f(a - x) dx = f(a) at continuity points.

This required careful epsilon-delta arguments for limits of integrals - exactly the kind of rigorous limit theory that Weierstrass would later systematize.

For AI: Fourier analysis (20) is the mathematical foundation of signal processing. The convergence of Fourier series is a limit statement - the same limit theory studied here extends to function spaces and ensures that truncated Fourier representations converge to the original signal at points of continuity.

P.4 Cauchy's Error and Its Correction

Cauchy (1821) stated (incorrectly): "The limit of a sum of continuous functions is continuous." This is false in general - the pointwise limit of continuous functions need not be continuous.

Counterexample: fn(x)=xnf_n(x) = x^n on [0,1][0,1]. Each fnf_n is continuous. The pointwise limit:

f(x)=limnxn={00x<11x=1f(x) = \lim_{n\to\infty} x^n = \begin{cases} 0 & 0 \leq x < 1 \\ 1 & x = 1 \end{cases}

is discontinuous at x=1x = 1. Cauchy's error!

Correction: If the convergence is uniform (i.e., supxfn(x)f(x)0\sup_x \lvert f_n(x) - f(x) \rvert \to 0), then the limit is continuous. This stronger notion - uniform convergence - is the correct condition, discovered by Stokes and Seidel (1847).

For AI: Neural network functions fθf_\theta trained on finite data may not converge uniformly to the true function - only pointwise (or in some norm). The distinction between pointwise and uniform convergence explains generalization gaps: the network may correctly learn the training points but fail elsewhere.


Appendix Q: Summary of All Proof Techniques

TechniqueWhen to UseKey Idea
Direct substitutionff continuous at aalimxaf(x)=f(a)\lim_{x\to a}f(x) = f(a)
Factoring0/00/0 form with polynomialCancel common factor (xa)(x-a)
RationalizationSquare roots in numerator or denominatorMultiply by conjugate
epsilon-delta constructionProving a limit rigorouslyBound f(x)L\|f(x)-L\| in terms of xa\|x-a\|; choose δ=f(ε)\delta = f(\varepsilon)
Squeeze Theoremff is bounded between two functions with known limitFind hfgh \leq f \leq g with h,gLh,g \to L
L'Hpital's Rule0/00/0 or /\infty/\infty formReplace f/gf/g by f/gf'/g'
Cauchy's MVTProving L'Hpital, comparing ratesf(x)f(a)/g(x)g(a)=f(c)/g(c)f(x)-f(a) / g(x)-g(a) = f'(c)/g'(c) for some cc
Series expansionPolynomial-like behavior near 00Expand ex,sinx,ln(1+x)e^x, \sin x, \ln(1+x) in Taylor series
SubstitutionSimplify the variableReplace xx by u=φ(x)u = \varphi(x)
Sequential argumentProving limit DNEFind two sequences approaching aa with different limits
Limsup/liminfOscillating functionsCompute lim sup\limsup and lim inf\liminf; limit exists iff they agree

Appendix R: Connections to Linear Algebra

R.1 Limits of Matrix Sequences

The theory of limits extends to matrices via matrix norms. For a sequence of matrices {Ak}k=1\{A_k\}_{k=1}^\infty:

Definition. limkAk=A\lim_{k\to\infty} A_k = A (in norm) means limkAkA=0\lim_{k\to\infty} \lVert A_k - A \rVert = 0 for any matrix norm \lVert \cdot \rVert.

Matrix exponential as a limit:

eA=limNk=0NAkk!=I+A+A22!+A33!+e^A = \lim_{N\to\infty} \sum_{k=0}^N \frac{A^k}{k!} = I + A + \frac{A^2}{2!} + \frac{A^3}{3!} + \cdots

This series converges absolutely (in any norm) for all matrices ARn×nA \in \mathbb{R}^{n\times n}.

Power iteration: The sequence vk+1=Avk/Avk\mathbf{v}_{k+1} = A\mathbf{v}_k / \lVert A\mathbf{v}_k \rVert converges (under mild conditions) to the eigenvector corresponding to the largest eigenvalue:

limkvk=u1(dominant eigenvector)\lim_{k\to\infty} \mathbf{v}_k = \mathbf{u}_1 \quad \text{(dominant eigenvector)}

This is a limit of a vector sequence - continuity of the norm and eigenvalue structure determine convergence.

R.2 Spectral Radius and Stability

Spectral radius: ρ(A)=maxiλi\rho(A) = \max_i \lvert\lambda_i\rvert (largest magnitude eigenvalue).

Theorem (Gelfand's formula):

ρ(A)=limkAk1/k\rho(A) = \lim_{k\to\infty} \lVert A^k \rVert^{1/k}

This limit always exists and equals the spectral radius - a remarkable result connecting limits of norms to eigenvalue structure.

Stability of linear systems: The iteration xk+1=Axk\mathbf{x}_{k+1} = A\mathbf{x}_k converges to 0\mathbf{0} for any initial x0\mathbf{x}_0 if and only if ρ(A)<1\rho(A) < 1:

ρ(A)<1    limkAk=0\rho(A) < 1 \iff \lim_{k\to\infty} A^k = 0

For AI - gradient descent as a linear system: Near a minimum θ\theta^*, gradient descent θk+1=θkα2L(θ)(θkθ)\theta_{k+1} = \theta_k - \alpha \nabla^2\mathcal{L}(\theta^*)(\theta_k - \theta^*) is a linear iteration with matrix IαHI - \alpha H (where HH is the Hessian). Convergence requires ρ(IαH)<1\rho(I - \alpha H) < 1, i.e., 1αλi<1\lvert 1 - \alpha\lambda_i \rvert < 1 for all eigenvalues λi>0\lambda_i > 0 of HH. This gives the condition 0<α<2/λmax0 < \alpha < 2/\lambda_{\max} - a limit-theory result on convergence of gradient descent.

R.3 Condition Number and Sensitivity Analysis

The condition number κ(A)=AA1\kappa(A) = \lVert A \rVert \lVert A^{-1} \rVert measures the sensitivity of Ax=bAx = b to perturbations in bb:

δxxκ(A)δbb\frac{\lVert \delta x \rVert}{\lVert x \rVert} \leq \kappa(A) \frac{\lVert \delta b \rVert}{\lVert b \rVert}

In the limit κ(A)\kappa(A) \to \infty (ill-conditioned), a tiny relative perturbation δb\delta b causes an arbitrarily large relative change in the solution xx. This is the matrix-level analog of the numerical instability (catastrophic cancellation) discussed in Section 8 - both arise from near-singularity, and both are limit phenomena.


Appendix S: Python and NumPy Reference for Limit Computations

import numpy as np

# === Numerically stable computations near limits ===

# (e^x - 1)/x near x=0: use expm1
def f_stable(x):
    """Compute (e^x - 1)/x stably for small x."""
    return np.expm1(x) / x  # NOT (np.exp(x) - 1) / x

# log(1 + x)/x near x=0: use log1p
def g_stable(x):
    """Compute log(1+x)/x stably for small x."""
    return np.log1p(x) / x  # NOT np.log(1+x) / x

# Stable softmax
def softmax_stable(z):
    z_shifted = z - np.max(z)
    exp_z = np.exp(z_shifted)
    return exp_z / exp_z.sum()

# Stable log-softmax (for cross-entropy)
def log_softmax_stable(z):
    m = np.max(z)
    return z - m - np.log(np.sum(np.exp(z - m)))

# Stable sigmoid
def sigmoid_stable(x):
    return np.where(x >= 0,
                    1 / (1 + np.exp(-x)),
                    np.exp(x) / (1 + np.exp(x)))

# === Numerical limit approximation ===

def numerical_limit(f, a, h_values=None):
    """Numerically approximate lim_{x->a} f(x) via centered differences."""
    if h_values is None:
        h_values = [1e-1, 1e-2, 1e-4, 1e-6, 1e-8, 1e-10]
    print(f"{'h':>12} | {'f(a+h)':>15} | {'f(a-h)':>15} | {'avg':>15}")
    print("-" * 65)
    for h in h_values:
        fph = f(a + h)
        fmh = f(a - h)
        avg = (fph + fmh) / 2
        print(f"{h:>12.2e} | {fph:>15.10f} | {fmh:>15.10f} | {avg:>15.10f}")

# Example: lim_{x->0} sin(x)/x = 1
numerical_limit(lambda x: np.sin(x)/x if x != 0 else 1.0, a=1e-14)

# === Gradient checking ===

def grad_check(f, theta, i, h=1e-5):
    """Centered finite difference for d/d(theta_i) f(theta)."""
    theta_plus = theta.copy(); theta_plus[i] += h
    theta_minus = theta.copy(); theta_minus[i] -= h
    return (f(theta_plus) - f(theta_minus)) / (2 * h)

Key takeaway: The choice of h in grad_check uses h=105h = 10^{-5} - the optimal value from Section 8.3 (centered finite difference has error O(h2)O(h^2), minimized around hu1/3h \approx u^{1/3} where u=252u = 2^{-52}).


End of Appendices. See theory.ipynb for interactive examples and exercises.ipynb for graded problems.


Appendix T: Summary of Key Results

ResultStatementWhere Proved
epsilon-delta limit definition$\forall\varepsilon>0,\exists\delta>0: 0<x-a
Limit lawsSum, product, quotient of limits2.2
Squeeze Theoremhfgh\leq f\leq g, h,gLh,g\to L \Rightarrow fLf\to L4.3, App. K.2
limx0sin(x)/x=1\lim_{x\to 0}\sin(x)/x=1Geometric area argument3.1, App. K.2
limx0(ex1)/x=1\lim_{x\to 0}(e^x-1)/x=1Taylor series / definition of exe^x derivative3.2
lim(1+1/n)n=e\lim(1+1/n)^n = eMonotone convergence + binomial theorem3.3, App. K.1
limx0+xlnx=0\lim_{x\to 0^+}x\ln x=0L'Hpital on lnx/(1/x)\ln x/(1/x)3.4
L'Hpital's Rulelimf/g=limf/g\lim f/g = \lim f'/g' for 0/00/0 or /\infty/\infty4.2, App. K.3
Continuity definitionAll three conditions: existence, limit, equality5.1
Intermediate Value TheoremContinuous ff on [a,b][a,b] hits every intermediate value6.1, App. A.4
Extreme Value TheoremContinuous ff on [a,b][a,b] attains max and min6.2
Heine-Cantor TheoremContinuous on [a,b][a,b] \Rightarrow uniformly continuous6.3
Gelfand's formulaρ(A)=limkAk1/k\rho(A)=\lim_k\|A^k\|^{1/k}App. R.2
Robbins-Monro conditionsSGD converges iff αt=\sum\alpha_t=\infty and αt2<\sum\alpha_t^2<\infty9.4
Softmax temperature limitsT0T\to 0: argmax; TT\to\infty: uniform9.1
Vanishing gradient rate(1/4)Ll(1/4)^{L-l} for LlL-l-deep sigmoid networksApp. H.2

<- Back to Calculus Fundamentals | Next: Derivatives and Differentiation ->