"The concept of the limit is the cornerstone on which the whole of mathematical analysis ultimately rests."
- Aleksandr Khinchin, Mathematical Foundations of Information Theory (1957)
Overview
Limits formalize the intuition of "approaching" - what value does a function tend toward as its input gets arbitrarily close to some point? This deceptively simple question required two centuries to answer rigorously, from Newton and Leibniz's intuitive infinitesimals to Cauchy's sequential definition to Weierstrass's definitive epsilon-delta framework. The result is the foundation upon which all of calculus - and by extension, all of continuous optimization - is built.
Continuity is the qualitative property that emerges from limits behaving well: a function is continuous at a point if its limit there equals its actual value. Continuous functions are the "nice" functions of analysis - they preserve neighborhoods, satisfy the Intermediate Value Theorem, and attain extrema on compact sets. In machine learning, continuity is not just a mathematical nicety; it is a design criterion for activation functions, loss landscapes, and optimization trajectories.
For AI practitioners, limits appear in at least five critical contexts: (1) the definition of the derivative as a limit of difference quotients, which underlies all backpropagation; (2) softmax temperature annealing, where recovers argmax and recovers uniform; (3) the vanishing gradient problem, where as kills gradient signal; (4) numerical stability, where naive limit computations suffer catastrophic cancellation; and (5) learning rate schedules, where the Robbins-Monro conditions , guarantee convergence.
Prerequisites
- Functions: domain, codomain, composition, inverse - 01-Mathematical-Foundations
- Algebra: polynomial factoring, rationalization, conjugate multiplication
- Exponential and logarithmic functions: , , basic identities
- Trigonometry: , , and their basic properties
- Real number system: , absolute value , the Archimedean property
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Interactive examples: epsilon-delta visualization, limit laws, IVT, discontinuity types, ML applications |
| exercises.ipynb | 10 graded exercises from basic limit computation to gradient-as-limit and cross-entropy |
Learning Objectives
After completing this section, you will be able to:
- State the epsilon-delta definition of a limit and verify it for elementary functions
- Compute limits using algebraic manipulation, L'Hpital's Rule, and the Squeeze Theorem
- Distinguish one-sided from two-sided limits and determine when each exists
- Identify and classify discontinuities as removable, jump, or essential
- State and apply the Intermediate Value Theorem and Extreme Value Theorem
- Prove the Squeeze Theorem and use it to evaluate
- Recognize indeterminate forms and choose the appropriate resolution technique
- Implement numerically stable limit computations using
expm1,log1p, andlog_softmax - Analyze softmax temperature, sigmoid saturation, and ReLU continuity via limits
- Connect the limit definition to the derivative as preparation for 02
Table of Contents
- 1. Intuition
- 2. Formal Definitions
- 3. Fundamental Limits
- 4. Computation Techniques
- 5. Continuity
- 6. Key Theorems
- 7. Asymptotic Behavior
- 8. Numerical Stability Near Limits
- 9. Machine Learning Applications
- 10. Common Mistakes
- 11. Exercises
- 12. Why This Matters for AI (2026 Perspective)
- Conceptual Bridge
1. Intuition
1.1 Approaching Without Arriving
Consider the function . At , this function is undefined: both numerator and denominator are zero, and division by zero is not permitted. Yet if you evaluate at values close to - say and - you observe that gets arbitrarily close to . The function is not defined at , but its behavior near is perfectly predictable.
This is the key insight: a limit describes the behavior of as approaches , regardless of what is (or whether exists at all). We write:
and read it as: "the limit of as approaches equals ."
For the example above:
The cancellation of is valid because we are considering close to (but not equal to) .
LIMITS: THE CORE PICTURE
f(x)
5
4 - - - - - - - - - <- limit value (hole at x=2)
3
2
1
x
1 1.5 1.9 2 2.1 2.5 3
f(x) = (x^2-4)/(x-2) Undefined at x=2, but the limit is 4.
As x -> 2 from either side, f(x) -> 4.
The informal statement is: means "we can make as close to as we like by taking sufficiently close to ."
The crucial distinction: a limit is about neighborhood behavior, not about what happens at the point itself. This distinction is what makes calculus work: the derivative of at is defined as the limit of difference quotients, even though the difference quotient is never evaluated at .
1.2 Historical Necessity: From Newton to Weierstrass
The limit concept was not discovered all at once - it was forced on mathematicians by the need to make calculus rigorous.
Newton (1666) and Leibniz (1675) independently invented calculus using "infinitesimals" - quantities smaller than any real number but not zero. Their methods worked in practice but were logically incoherent: you cannot simultaneously treat as nonzero (to divide) and as zero (to drop terms). Bishop Berkeley famously mocked infinitesimals as "ghosts of departed quantities."
Cauchy (1821) introduced the sequential approach: if for every sequence (with ), the sequence . This was much cleaner, but still relied on intuitive notions of "approaching."
Weierstrass (1861) gave the definitive formulation, now universal in analysis: the epsilon-delta definition (Section 2.1). This eliminated all appeals to intuition and motion - limits became purely a statement about inequalities between real numbers.
Chronological summary:
| Year | Mathematician | Contribution |
|---|---|---|
| 1666/1675 | Newton / Leibniz | Calculus via infinitesimals (intuitive, not rigorous) |
| 1797 | Lagrange | Attempted algebraic foundation via power series |
| 1821 | Cauchy | Sequential definition; "sum formula" for continuity |
| 1861 | Weierstrass | epsilon-delta definition; made analysis fully rigorous |
| 1960s | Robinson | Nonstandard analysis: infinitesimals made rigorous via model theory |
The modern formulation we use today is Weierstrass's. Interestingly, nonstandard analysis (Abraham Robinson, 1966) later vindicated Newton's intuition by making infinitesimals rigorous through model theory - but the epsilon-delta approach remains standard in mathematical practice.
1.3 Why Limits Matter for AI
Limits are not just historical curiosities - they are load-bearing mathematics in modern AI systems.
Gradient computation: Every gradient in a neural network is a limit. The gradient of loss with respect to weight is:
Automatic differentiation implements this limit algebraically rather than numerically, but the underlying mathematics is limit-theoretic.
Softmax temperature: The softmax function with temperature is:
Taking recovers the argmax (one-hot); recovers the uniform distribution. Understanding these limits is essential for temperature-scaled inference in LLMs (Radford et al., 2019; Ouyang et al., 2022).
Vanishing gradients: The sigmoid satisfies , and . Deep networks with sigmoid activations suffer gradient vanishing precisely because of this limit behavior (Hochreiter & Schmidhuber, 1997).
Numerical stability: Computing naively near loses up to 16 digits of precision due to catastrophic cancellation. The function numpy.expm1 computes directly to avoid this - a practical implementation of numerically stable limits.
Convergence guarantees: Stochastic gradient descent converges (in expectation) if the learning rate sequence satisfies the Robbins-Monro conditions and (Robbins & Monro, 1951). These are conditions on the limiting behavior of a series - pure limit theory applied to optimization.
2. Formal Definitions
2.1 The epsilon-delta Definition
Definition (Limit). Let be a function defined on some open interval containing , except possibly at itself. We say
if for every , there exists such that
Unpacking the definition:
- "" - your challenger specifies how close to you must get
- "there exists " - you respond with a proximity to
- "" - is within of , but
- "" - then is within of
The definition is a game: for every precision challenge your opponent poses, you must produce a that wins. If you can always win, the limit is .
epsilon-delta VISUALIZATION
f(x)
L+epsilon <- epsilon-strip (top)
f(x) must stay inside
L - - - - - - - - - - - - -- - - - <- limit value L
L-epsilon <- epsilon-strip (bottom)
a-delta a a+delta
-> x
<- 2delta ->
delta-window: x must stay inside this
For x in (a-delta, a+delta) \ {a}, f(x) must lie in (L-epsilon, L+epsilon).
Example (proving a limit with epsilon-delta): Prove .
We need: given , find such that .
Note: .
So if , then .
Choose . Then , so .
Example (limit does not exist): as . For any , the interval contains points where and points where . No single can satisfy the epsilon-delta condition with . The limit does not exist.
Non-examples (limit fails to exist):
- as : left limit is , right limit is - unequal one-sided limits
- as : - no finite
- as : oscillates between and without settling
2.2 Limit Laws
Limits interact well with algebraic operations. If and , then:
| Law | Statement |
|---|---|
| Sum | |
| Difference | |
| Constant Multiple | |
| Product | |
| Quotient | , provided |
| Power | for integer |
| Root | (for even , require ) |
| Composition | If is continuous at : |
Proof sketch (Sum Law): Given . Since and are limits, choose such that and within their respective -windows. Set . Then:
Direct substitution: If is a polynomial, rational function (with nonzero denominator), or composed of continuous functions, then . This is valid whenever is continuous at - essentially the definition of continuity (Section 5).
2.3 One-Sided Limits
For functions that behave differently approaching from the left versus the right, we introduce one-sided limits.
Definition (Right-hand limit). if for every there exists such that .
Definition (Left-hand limit). if for every there exists such that .
Key theorem: if and only if both one-sided limits exist and are equal:
Examples:
Sign function for :
Since , the two-sided limit does not exist.
Floor function : at any integer ,
Jump discontinuity at every integer.
Absolute value :
Two-sided limit exists and equals , so is continuous at .
For AI: ReLU has well-defined one-sided limits everywhere. At :
Both equal , so ReLU is continuous at the origin. However, its derivative (Section 9.3) has a jump discontinuity there.
2.4 Limits at Infinity
Definition. means: for every , there exists such that .
Similarly for with .
Standard examples:
Rational function limits at infinity: For with , :
- If : limit is
- If : limit is ratio of leading coefficients
- If : limit is
Example:
For AI: Sigmoid saturation at infinity:
This is why sigmoid-based networks suffer from vanishing gradients: for large , the sigmoid is essentially constant, so its derivative is essentially zero.
2.5 Infinite Limits and Vertical Asymptotes
Definition. means: for every , there exists such that .
Note: is not a real number - saying the limit is means the function grows without bound, not that it converges.
Vertical asymptote: is a vertical asymptote of if any one-sided limit as is .
Examples:
For AI - log barrier: Interior point methods use log barriers where . As , the barrier , keeping the optimization in the feasible interior. This is a vertical asymptote engineered as a constraint.
3. Fundamental Limits
These limits appear repeatedly in analysis, ML, and numerical methods. Every practitioner should know them cold.
3.1 The Sine Limit: sin(x)/x -> 1
Geometric proof: Consider a unit circle. For , the areas satisfy:
In terms of (the angle):
Dividing by :
Taking reciprocals (reversing inequalities):
Since , by the Squeeze Theorem: .
Consequence:
The first follows from .
For AI - attention scores: The sinusoidal positional encoding in transformers (Vaswani et al., 2017) uses . The smooth interpolation behavior of sine (controlled by this limit) ensures smooth positional representations.
3.2 Exponential and Logarithmic Limits
The fundamental exponential limit:
This is the definition of the derivative of at , and it says the exponential grows at rate near the origin.
Proof (via series): , so .
Consequence:
Logarithmic limits:
The second says logarithm grows slower than any positive power - critical for understanding why algorithms are so efficient.
For AI - cross-entropy: The cross-entropy loss uses (Section 3.4) to handle the convention .
3.3 Euler's Number as a Limit
Both forms define the same constant .
Motivation: If interest on principal is compounded times per year at annual rate , after one year the amount is . As (continuous compounding), this approaches .
General form:
Proof sketch: Let . Taking logarithm: . Setting : as . So .
For AI - learning rate warm-up: The cosine schedule and exponential decay are engineering analogs of - discrete compounding that approaches a continuous exponential in the limit of fine schedules.
3.4 Logarithmic Limits and xln(x)
This is a indeterminate form. Rewrite:
This is , so L'Hpital applies:
For AI - entropy: The Shannon entropy uses the convention , justified by . This makes entropy a continuous function of the probability distribution even at the boundary of the simplex.
4. Computation Techniques
4.1 Algebraic Techniques
When direct substitution gives or other indeterminate forms, algebraic manipulation often resolves the indeterminacy.
Factoring:
Rationalization (conjugate multiplication):
Common factor in rational functions:
Multiplying by (for rational functions at infinity): already shown in Section 2.4.
4.2 L'Hpital's Rule
Theorem (L'Hpital's Rule). Suppose and (the form), or both limits are (the form). If and are differentiable near (but not necessarily at ), near , and exists (or is ), then:
The same holds for one-sided limits and for .
Critical: L'Hpital applies only to or . Never apply it to other forms. Always verify the indeterminate form first.
Examples:
: This is . L'Hpital: .
: This is . L'Hpital: , still . Apply again: .
: Rewrite as (). L'Hpital: .
Other indeterminate forms - convert before applying L'Hpital:
| Form | Conversion |
|---|---|
| Write as or (0/0 or infinity/infinity) | |
| Factor or find common denominator | |
| , , | Take : |
Example (): . Let , so . This is ; L'Hpital gives . So .
Warning: L'Hpital can fail to terminate or produce incorrect answers if misapplied. Always check that the form is truly indeterminate, and stop as soon as the limit becomes evaluable by direct substitution.
4.3 The Squeeze Theorem
Theorem (Squeeze / Sandwich Theorem). Suppose for all near (but not necessarily at ), and:
Then .
Proof: Given . Choose so that:
- , i.e.,
- , i.e.,
Set . For :
So .
Example 1: . Since : . Both and , so .
Example 2: (proved geometrically in Section 3.1 using squeeze).
Example 3: . We know (can be proved via AM-GM). Since , the squeeze gives .
For AI: The squeeze theorem is the theoretical justification for proving that "soft" approximations converge to "hard" decisions. For example, if and both bounds converge, then converges - used in proving convergence of temperature-annealed sampling.
4.4 Substitution and Composition
Theorem. If is continuous at and , then:
This allows moving limits inside continuous functions.
Examples:
5. Continuity
5.1 The Three-Part Definition
Definition (Continuity at a Point). A function is continuous at if:
- is defined (the function exists at )
- exists (the limit exists at )
- (limit equals the function value)
All three conditions must hold. Failure of any one gives a discontinuity.
Equivalent epsilon-delta formulation: is continuous at if for every there exists such that . (Note: unlike the limit definition, is now permitted.)
Examples:
- : continuous at every (polynomial)
- : continuous at every
- : continuous at every ; not defined (hence not continuous) at
- : continuous at every non-integer; discontinuous at every integer
Non-examples (failing each condition):
- Condition 1 fails: at (undefined)
- Condition 2 fails: at (limit doesn't exist)
- Condition 3 fails: for , (limit is but )
5.2 Types of Discontinuity
DISCONTINUITY TAXONOMY
REMOVABLE JUMP ESSENTIAL (INFINITE)
(Hole in graph) (Step) (Blows up)
-
lim exists, lim doesn't exist lim = +/-infinity
!= f(a) or (one-sided limits (vertical asymptote)
f(a) undefined exist but differ)
Fix: redefine f(a) Cannot fix: jump Cannot fix: infinity
Example: (x^2-4)/(x-2) Example: sgn(x) Example: 1/x at 0
at x=2 at x=0
Removable discontinuity: exists but either is undefined or . Fixable by redefining .
Jump discontinuity: Both one-sided limits exist but . The function "jumps" at .
Essential (infinite) discontinuity: At least one one-sided limit is . Also called an infinite discontinuity.
Oscillatory discontinuity (a subtype of essential): at - neither one-sided limit exists (even as ).
For AI:
- ReLU has no discontinuities (continuous everywhere), but its derivative has a jump at
- Hard attention (argmax) is a jump discontinuity at ties - made differentiable via softmax temperature annealing
- Quantized activations (INT8) introduce jump discontinuities - handled with straight-through estimator in training
5.3 Continuity on Intervals
Definition. is continuous on the open interval if is continuous at every .
is continuous on the closed interval if:
- is continuous on
- (right-continuous at )
- (left-continuous at )
Standard continuous functions: Polynomials, rational functions (on their domains), , , , (on ), (on ), and any composition thereof.
For AI: Neural networks with continuous activation functions (sigmoid, tanh, GELU, SiLU) define continuous functions . Networks with ReLU are piecewise linear and continuous. Continuity of the network function is necessary (though not sufficient) for gradient-based optimization to work reliably.
5.4 Operations Preserving Continuity
If and are continuous at , then so are:
- , , , (for any constant )
- provided
- provided is continuous at
Consequence: Every function built by composing, adding, multiplying elementary continuous functions is continuous on its natural domain. This covers virtually all standard functions in analysis.
6. Key Theorems
6.1 Intermediate Value Theorem
Theorem (IVT). Let be continuous on the closed interval . If is any value strictly between and , then there exists such that .
Informally: a continuous function cannot skip values - if it goes from to , it must pass through every intermediate value.
INTERMEDIATE VALUE THEOREM
f(x)
f(b) <- f(b)
k - - - <- f(c) = k (guaranteed to exist)
f(a) <- f(a)
c
-> x
a b
If f is continuous on [a,b], then for any k between f(a) and f(b),
there exists c in (a,b) with f(c) = k.
Proof sketch (requires completeness of ): Let . If , then is nonempty (contains ) and bounded above (by ). Let . By continuity of at , one can show .
Applications:
-
Root finding (bisection method): If and is continuous, then has a root in . Bisect: evaluate and recurse into the half-interval where sign changes. Converges in steps.
-
Fixed-point existence: If is continuous, then has a fixed point . (Apply IVT to : and .)
-
Equilibrium existence: Used in proving existence of Nash equilibria (via Kakutani fixed-point theorem), stopping times for Brownian motion, and zero crossings of residuals in iterative solvers.
For AI: The bisection method is the simplest root-finding algorithm. Modern AI uses line search (Wolfe conditions) and trust-region methods - all of which rely on IVT guarantees to find step sizes satisfying sufficient decrease conditions.
6.2 Extreme Value Theorem
Theorem (EVT). Let be continuous on the closed interval . Then attains its maximum and minimum on : there exist such that for all .
Why compactness matters: The EVT requires to be closed and bounded (compact). Counterexamples on non-compact domains:
- on : no maximum (approaches but never attains it)
- on : no maximum (blows up as )
For AI: Loss functions trained over bounded parameter spaces (after gradient clipping or weight constraints) attain their minimum. This theoretical guarantee underpins the well-posedness of constrained optimization problems in ML.
6.3 Uniform Continuity
Definition. is uniformly continuous on if for every there exists such that for all : .
Key distinction: In pointwise continuity, can depend on both and the point . In uniform continuity, depends only on - the same works everywhere.
Theorem (Heine-Cantor). If is continuous on a closed bounded interval , then is uniformly continuous on .
Examples:
- is uniformly continuous on but not on (the derivative grows without bound)
- is uniformly continuous on (bounded derivative)
- is not uniformly continuous on (derivative blows up near )
For AI: Lipschitz continuity (a stronger condition: ) implies uniform continuity. Lipschitz constraints appear in spectral normalization (Miyato et al., 2018) for GANs and in Wasserstein GAN training - all are quantitative versions of uniform continuity.
7. Asymptotic Behavior
7.1 Horizontal and Vertical Asymptotes
Horizontal asymptote: is a horizontal asymptote if or .
Vertical asymptote: is a vertical asymptote if any one-sided limit as is .
Oblique (slant) asymptote: is an oblique asymptote if . Occurs when .
Example: . Polynomial division: . As : . So is an oblique asymptote and is a vertical asymptote.
7.2 Polynomial vs. Exponential Growth
Growth hierarchy (as ):
Formally:
Every polynomial is dominated by every exponential; every logarithm is dominated by every positive power.
Proof of : Apply L'Hpital times: .
For AI - complexity: Algorithm complexity classes: (logarithm) (polynomial) (exponential). This hierarchy, grounded in limit theory, governs which algorithms scale to LLM-scale data (e.g., transformers are for sequence length and dimension ).
7.3 Big-O and Little-o as Limit Statements
Definition. as means there exist such that for .
Definition. as means .
In words: means "bounded by a multiple"; means "negligible compared to."
Examples:
- as (since )
- as (since )
- as (Taylor expansion)
For AI: The Adam optimizer has per-parameter learning rate . The term prevents division by zero when . In the limit , the effective learning rate is ; for large , it is . This piecewise asymptotic analysis (via big-O) explains Adam's behavior in both sparse and dense gradient regimes.
8. Numerical Stability Near Limits
8.1 Catastrophic Cancellation
When computing for close to , the numerator involves subtracting two nearly equal quantities. In IEEE 754 double precision (64-bit float, ), numbers are stored to significant digits. If and , then:
- in floating point: exactly
- in floating point: only the digits beyond the are meaningful, losing digits
The error in the subtraction propagates, giving up to 8 digits of lost precision. The limit value (which is ) is computed incorrectly.
CATASTROPHIC CANCELLATION EXAMPLE
x = 1e-8 (IEEE 754 double)
True value: (e^x - 1)/x ~= 1.000000005000000017...
Naive computation:
e^x = 1.00000001000000002 (15-16 sig digits, ok)
e^x-1 = 0.00000001000000002 (only ~9 sig digits now!)
/x ~= 1.000000002 (lost ~8 digits of precision)
Stable computation (expm1):
expm1(x) = 1.0000000050000000... (full precision maintained)
/x = 1.0000000050000000 (correct!)
8.2 Numerically Stable Alternatives
numpy.expm1(x) = : Computed using a special algorithm that avoids cancellation for small .
numpy.log1p(x) = : Similarly avoids cancellation when .
scipy.special.log_softmax(x): Avoids overflow in by computing via the log-sum-exp trick:
All , preventing overflow.
Stable sigmoid: can overflow for large negative . Stable version:
For AI - loss computation: Cross-entropy loss using torch.nn.CrossEntropyLoss internally uses log_softmax for numerical stability. Implementing naive log(softmax(x)) instead causes overflow/underflow for large logits.
8.3 Machine Epsilon and Floating-Point Limits
Machine epsilon (double precision): the smallest such that in IEEE 754.
In floating-point arithmetic, the limit cannot be taken exactly - there is a practical lower bound on below which finite difference approximations deteriorate:
- For first-order finite differences: optimal
- For second-order: optimal
This is why gradient checking in ML uses or rather than .
9. Machine Learning Applications
9.1 Softmax Temperature Limits
The softmax with temperature is:
Limit as (cold / sharp):
(assuming unique argmax). The distribution collapses to a point mass on the highest-scoring class. This is why is called the "deterministic" or "greedy" limit.
Limit as (hot / uniform):
All logits become irrelevant; the distribution becomes uniform.
Proof (cold limit): WLOG for . Factor out :
As : (since ), so . Thus the denominator .
For AI: In LLM inference (GPT, LLaMA, etc.), temperature controls creativity vs. determinism. Temperature is trained distribution; sharpens (more predictable); flattens (more diverse/random). The limits above show recovers beam search top-1, and recovers pure random sampling.
9.2 Sigmoid Saturation and Vanishing Gradients
The sigmoid function :
Limits at infinity:
Derivative:
Limits of derivative:
The derivative at saturation is zero, not just small - the function is completely flat in the limit.
Vanishing gradient mechanism: In a deep network with sigmoid activations, the gradient of loss with respect to layer 's weights involves products of across all layers . If each factor is (the max of ), then for layers: . The gradient effectively vanishes - Hochreiter & Schmidhuber (1997) identified this as the key failure mode of early RNNs.
Fix: ReLU activations have (no saturation for positive inputs). GELU and SiLU have smoother saturation profiles. Layer normalization and residual connections also mitigate the issue at the architecture level.
9.3 ReLU and GELU: Continuity at the Origin
ReLU:
Continuity check at :
All equal: ReLU is continuous everywhere.
Derivative at : but . One-sided derivatives differ: ReLU is not differentiable at . This is a removable issue in practice - the subgradient or (or ) is used.
GELU: where is the standard normal CDF.
GELU is smooth () at the origin - no kink, unlike ReLU. This is why GELU-based networks (BERT, GPT-2 onward) train more smoothly in practice.
GELU vs. ReLU continuity comparison:
| Property | ReLU | GELU |
|---|---|---|
| Continuous | Yes (everywhere) | Yes (everywhere) |
| Differentiable at 0 | No (jump in derivative) | Yes () |
| Saturates for | Gradient = 0 (dying ReLU) | Gradient (soft) |
| Saturates for | Gradient = 1 (no saturation) | Gradient (no saturation) |
9.4 Learning Rate Decay: Robbins-Monro Conditions
The stochastic gradient descent update converges (under convexity and bounded variance assumptions) if and only if the learning rate schedule satisfies:
(Robbins & Monro, 1951)
Interpretation:
- : the total step size is infinite - SGD can travel arbitrarily far and cannot get stuck permanently
- : the noise contribution (proportional to ) is summable - gradient noise eventually becomes negligible
Standard schedules:
| Schedule | Satisfies RM? | |||
|---|---|---|---|---|
| Constant | No (second fails) | |||
| (harmonic) | Yes | |||
| No (second fails) | ||||
| Exponential | Finite | No (first fails) |
Only satisfies both - which is why it's the canonical theoretical schedule, even though practitioners prefer others (cosine, warmup) for practical reasons.
9.5 Gradient as a Limit
The derivative (Section 9, forward reference: 02-Derivatives) is defined as a limit:
This limit must exist (be finite, with equal one-sided limits) for to be differentiable at .
For AI - automatic differentiation: PyTorch's autograd and JAX's grad do not compute this limit numerically. Instead, they apply the chain rule symbolically/computationally. But the mathematical object they compute is precisely this limit - evaluated at each intermediate node in the computation graph.
Numerical gradient checking: To verify a gradient implementation, compute:
with . The centered finite difference has error vs. for one-sided - both are limit approximations, with the centered form being more accurate.
Preview: Derivatives The limit defines the derivative - the instantaneous rate of change. All differentiation rules (product, chain, etc.) and backpropagation follow from this limit.
-> Full treatment: Derivatives and Differentiation
10. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Evaluating by computing when is discontinuous at | The limit is about neighborhood behavior, not the point value; may not exist or may differ | Always check whether is continuous at first; if not, compute the limit separately |
| 2 | Concluding the limit doesn't exist because is undefined | Limits are independent of ; a removable discontinuity has a well-defined limit | Compute the limit algebraically; undefined does not imply no limit |
| 3 | Applying L'Hpital's Rule to non-indeterminate forms | L'Hpital is only valid for or ; other forms give wrong answers | Check: substitute and verify the form is indeterminate before applying L'Hpital |
| 4 | Applying L'Hpital repeatedly without checking if the form is still indeterminate | After one application, direct substitution may be possible; repeated L'Hpital may cycle | After each application, try direct substitution before applying again |
| 5 | Confusing one-sided and two-sided limits | The two-sided limit requires both one-sided limits to exist and be equal | Compute and separately; two-sided limit exists iff they agree |
| 6 | Asserting means the limit exists | are not real numbers; saying the limit is means divergence, not existence | Distinguish between "limit exists (finite)" and "limit is " |
| 7 | Assuming all three parts of continuity hold because looks smooth | Continuity requires all three conditions; a piecewise-defined function may fail condition 3 | Verify all three: defined, limit exists, limit equals |
| 8 | Computing naively near in code | Catastrophic cancellation destroys 8+ digits of precision for small | Use numpy.expm1(x) / x for numerically stable computation |
| 9 | Applying the Squeeze Theorem without verifying the inequalities hold near | The bounds must hold in a punctured neighborhood of , not just at one point | Carefully verify near ; sketch or prove each bound |
| 10 | Assuming sigmoid saturation means gradient is small but nonzero | exactly as , not just approximately | For deep networks, gradient multiplied across many saturated layers goes to zero exponentially fast |
11. Exercises
Eight graded exercises with worked solutions in exercises.ipynb.
| # | Difficulty | Topic | Parts |
|---|---|---|---|
| 1 | Basic limit computation: factoring, trig, conjugates | (a)-(c) | |
| 2 | One-sided limits and existence | (a)-(b) | |
| 3 | L'Hpital's Rule: three indeterminate forms | (a)-(c) | |
| 4 | Continuity analysis: classify discontinuities | (a)-(c) | |
| 5 | Squeeze Theorem: prove and verify IVT | (a)-(b) | |
| 6 | epsilon-delta proof: verify and find explicit | (a)-(c) | |
| 7 | Gradient as a limit: numerical vs. analytic finite differences | (a)-(c) | |
| 8 | Cross-entropy limit: and entropy continuity | (a)-(c) |
12. Why This Matters for AI (2026 Perspective)
| Concept | AI / LLM Application | Specific Example |
|---|---|---|
| epsilon-delta limits | Foundation of automatic differentiation | PyTorch autograd, JAX grad implement the limit algebraically |
| One-sided limits | Derivative of ReLU at origin | Subgradient at chosen from ; training stable because ReLU is continuous |
| Squeeze Theorem | Convergence proofs for SGD | Bounding optimization error between upper and lower envelopes of the loss landscape |
| Continuity | Activation function design | GELU / SiLU chosen over hard ReLU for smoothness; improves gradient flow |
| IVT | Loss surface root finding | Line search algorithms (backtracking, Wolfe conditions) rely on IVT to guarantee sufficient decrease |
| EVT | Well-posedness of optimization | Constrained optimization over compact parameter sets attains minimum by EVT |
| Limits at infinity | Vanishing/exploding gradients | as : mathematical root of LSTM/ResNet motivation |
| Temperature-scaled LLM decoding | Temperature in GPT-4, LLaMA-3 inference controls creativity vs. accuracy | |
| Shannon entropy convention | makes cross-entropy continuous; used in every classification loss | |
| Catastrophic cancellation | Numerical stability in training | torch.nn.CrossEntropyLoss uses log_softmax to avoid overflow - a direct fix for limit instability |
| Robbins-Monro conditions | SGD learning rate theory | Cosine decay + warmup approximates RM conditions while being practically efficient |
| Big-O at infinity | Transformer complexity | Attention is ; linear attention approximations are - limit theory quantifies the gap |
Conceptual Bridge
Looking backward: Limits formalize the intuition of "approaching" that appears throughout earlier sections. The real number system (01-Mathematical-Foundations) provides the completeness axiom - every nonempty bounded set has a supremum - which is what makes limits well-defined. Without being complete, Cauchy sequences might not converge, and the entire limit framework would collapse. Linear algebra (02-03) introduced matrix norms ; limit-based analysis of norm sequences underlies the study of matrix powers and iterative methods.
Looking forward: Limits are the gateway to all of calculus. The derivative (02-Derivatives-and-Differentiation) is defined as - every differentiation rule follows from properties of limits. The integral (03-Integration) is a limit of Riemann sums. Series (04-Series-and-Sequences) are limits of partial sums; their convergence is governed by limit tests (ratio, root, comparison). In multivariate calculus (05), limits extend to functions of several variables, and continuity becomes the prerequisite for partial derivatives and the multivariable chain rule - the mathematical core of backpropagation.
Beyond calculus, functional analysis (12) studies limits in infinite-dimensional spaces (Banach and Hilbert spaces), and the operator norm is defined as - a supremum, hence a limit. Measure theory (24) defines integration via limits of simple functions. The entire edifice of continuous mathematics rests on the foundation laid here.
POSITION IN CURRICULUM
01-Mathematical-Foundations 02-Linear-Algebra-Basics
(Number systems, functions) (Vectors, matrices, norms)
04-01: LIMITS AND CONTINUITY (YOU ARE HERE)
epsilon-delta limits * Limit laws * One-sided limits
L'Hpital * Squeeze Theorem * Fundamental limits
Continuity * IVT * EVT * Asymptotic behavior
Numerical stability * ML applications
04-02: Derivatives 04-04: Series & Sequences
(Derivative as limit, (Partial sums as limits,
chain rule, backprop) Taylor series, convergence)
05: Multivariate Calculus
(Partial derivatives, gradient,
chain rule in R, Jacobian)
08: Optimization
(Gradient descent, Newton's method,
convergence guarantees)
<- Back to Calculus Fundamentals | Next: Derivatives and Differentiation ->
Appendix A: Extended Proofs and Derivations
A.1 Proving Limit Laws from epsilon-delta
Theorem (Product Law). If and , then .
Proof. The key trick is to write:
Given . Since , is bounded near : choose so that implies for .
Set . Choose:
- :
- :
Set . Then for :
Theorem (Composition Law). If and is continuous at , then .
Proof. Given . Since is continuous at : such that .
Since : such that .
Combining: .
A.2 The Cauchy Criterion for Limits
An equivalent characterization that avoids specifying the limit value:
Theorem (Cauchy Criterion). exists (as a finite number) if and only if: for every there exists such that for all with and :
This is useful when you suspect a limit exists but don't know its value.
A.3 Sequential Characterization of Limits
Theorem (Heine's Theorem). if and only if for every sequence with and for all , we have .
Uses of sequential characterization:
- Proving limits exist: Find sequences that suggest the limit, then verify
- Proving limits don't exist: Find two sequences and such that
Example (limit doesn't exist via sequences): .
- Take : then
- Take : then
Since , the limit does not exist.
A.4 Proof of the Intermediate Value Theorem
Full Proof. Let be continuous, . We show with .
Define . Then:
- : since
- is bounded above: by
Let . We claim .
Case : By continuity, such that . So , meaning is not an upper bound for - contradiction.
Case : By continuity, such that . So contains no elements of , meaning is a smaller upper bound - contradiction.
Therefore . Note (since ) and (since ), so .
Key insight: The proof uses the completeness of (every nonempty bounded set has a supremum) - IVT fails for functions on rationals because is not complete.
A.5 Limit Superior and Limit Inferior
For a deeper understanding of oscillating limits, we introduce:
Definition. The limit superior is:
The limit inferior is:
Theorem. exists and equals if and only if:
Example: For :
- (achieved along )
- (achieved along for large )
Since , the limit does not exist - confirming our earlier result.
For AI: Limsup and liminf appear in the analysis of optimization algorithms. The limsup of gradient norms determines whether an algorithm has "bounded oscillation" - a prerequisite for convergence. In AdaGrad and Adam, the effective learning rate limsup is controlled by the accumulated second moment.
Appendix B: Worked Examples with Full Solutions
B.1 Twelve Limit Computations
Work through each, identifying the technique before computing.
1.
Form: . Since :
2.
Form: . Rationalize:
3.
Form: . Let . We showed , so .
4.
Substitute (so ):
5.
Let . Since (log grows slower than linear):
6.
Factor: .
Wait: check again. .
7.
Since near and as :
8.
Multiply by conjugate:
9.
10.
L'Hpital (): .
11.
Factor: . Cancel : limit is .
Alternatively: this is the derivative of at , which is .
12.
Let , so and as :
B.2 Continuity Verification Walkthrough
Problem: Determine where is continuous and classify any discontinuities.
Solution:
For :
Case : , so . Continuous for .
Case : , so . Continuous for .
At : is undefined (division by zero). Check one-sided limits:
One-sided limits differ: . This is a jump discontinuity at with jump of .
The discontinuity is not removable (both limits are finite but unequal).
B.3 epsilon-delta Proof for a Nonlinear Limit
Claim: .
Proof: Given . We need when .
Factor: .
Control : Assume , so , giving , so .
Then: .
Choose . Then:
Pattern: For polynomial limits, always: (1) factor out , (2) bound the remaining factor using , (3) choose .
Appendix C: Connections to Advanced Mathematics
C.1 Topological Perspective
In topology, continuity is defined without reference to metrics. A function between topological spaces is continuous if for every open set , the preimage is open in .
For : Open sets are unions of open intervals. The topological definition is equivalent to the epsilon-delta definition because epsilon-balls are the open sets of .
Consequence: The image of a compact set under a continuous function is compact (the continuous image of a compact set is compact). For , this gives: is closed and bounded, i.e., attains its maximum and minimum - the Extreme Value Theorem.
C.2 Uniform Continuity and Lipschitz Maps in ML
A function is -Lipschitz if:
Lipschitz continuity implies uniform continuity (take ).
In ML:
- Spectral normalization (Miyato et al., 2018): constrains each weight matrix to have spectral norm , making each layer 1-Lipschitz, making the whole network -Lipschitz
- Wasserstein GAN (Arjovsky et al., 2017): the discriminator must be 1-Lipschitz (enforced via gradient penalty or spectral norm), which makes the Wasserstein distance well-defined
- Gradient clipping: bounding ensures the loss is -Lipschitz in parameters - controls training instability
C.3 Limits in Metric Spaces
The epsilon-delta definition generalizes directly to metric spaces: replace with for any metric . This covers:
- spaces: for vector limits
- Function spaces: convergence of sequences of functions (pointwise, uniform, )
- Matrix sequences: convergence in Frobenius norm
Convergence of matrix series: The matrix exponential is defined as a limit of partial sums in the matrix operator norm - a direct generalization of the scalar exponential limit.
C.4 Dirichlet and Thomae Functions: Pathological Examples
Dirichlet function:
This function has no limit at any point (rationals and irrationals are dense in , so any neighborhood of contains both). It is continuous nowhere.
Thomae's function (popcorn function):
This function satisfies for every (irrationals accumulate near every point, and the rational values become small for large ). So is continuous at every irrational and discontinuous at every rational - a function continuous exactly on the irrationals.
These pathological examples show that the epsilon-delta framework is necessary: intuition from smooth curves does not predict the behavior of all functions.
Appendix D: Numerical Methods Grounded in Limits
D.1 Bisection Method
The bisection method exploits the IVT to find roots of .
Algorithm:
- Start with where
- At step : let
- If : set
- Else: set
- By IVT, has a root in every
- where
Convergence: After steps, the interval has length . To achieve accuracy : need steps. This is - linear in the number of bits of precision.
Connection to limits: Bisection computes numerically. The sequence is Cauchy (differences go to zero), so it converges in by completeness. The limit is the root - IVT guarantees existence, completeness guarantees the numerical process converges.
D.2 Newton's Method as a Limit Process
Newton's method computes:
This converges (quadratically near a simple root) to where .
The update formula comes from the linear approximation: , set to zero: . This linear approximation is itself a limit statement - the tangent line is of secant lines.
For AI: Quasi-Newton methods (BFGS, L-BFGS) approximate (the Hessian inverse) via finite differences of gradients. At the heart of each update is a finite difference approximation of the second derivative - a discrete limit.
D.3 Gradient Checking Implementation
To verify that an implemented gradient matches the true gradient , use the centered finite difference:
The relative error check:
is a heuristic validation that the implementation of the analytic gradient is correct.
Why centered? The centered difference approximates with error (from Taylor: ), while the one-sided approximation has error . The limit is the same but the rate of approach is faster for centered differences.
Appendix E: Limits in Probability and Statistics
E.1 Law of Large Numbers
The weak law of large numbers states: if are i.i.d. with mean , then for every :
This is a limit statement about a sequence of probabilities - a "convergence in probability." The expected loss over the training distribution is ; the empirical loss converges to it in the limit of infinite data.
E.2 Central Limit Theorem as a Limit
The CLT states: for i.i.d. with mean and variance :
This "convergence in distribution" is a limit of CDFs: for every continuity point of the standard normal CDF :
For AI: Mini-batch gradient estimates converge to the full-batch gradient in distribution as batch size grows - the CLT justifies treating mini-batch gradients as Gaussian perturbations of the true gradient, underpinning stochastic optimization theory (Mandt, Hoffman, Blei, 2017).
E.3 KL Divergence and Limit Continuity
The KL divergence requires the convention (using ) and is undefined when but .
Continuity: With the convention , is lower-semicontinuous: for any sequence , . This is a limit property that ensures KL minimization is well-posed.
For AI: RLHF training (Ouyang et al., 2022) includes a KL penalty to prevent the model from drifting too far from the reference policy. The continuity of KL guarantees that small policy changes produce small KL penalties - essential for stable RLHF training.
Appendix F: Glossary of Limit-Related Terms
| Term | Definition |
|---|---|
| Limit | : can be made arbitrarily close to by taking sufficiently close to (but not equal to) |
| epsilon-delta definition | Formal definition: |
| One-sided limit | (right) or (left) - approaches from one side only |
| Limit at infinity | : as grows without bound |
| Infinite limit | : grows without bound as ; not a real limit |
| Continuity | is continuous at if exists, exists, and they are equal |
| Removable discontinuity | exists but is undefined or limit |
| Jump discontinuity | One-sided limits exist but differ: |
| Essential discontinuity | At least one one-sided limit is or doesn't exist |
| Indeterminate form | Forms like , , , that require further analysis |
| L'Hpital's Rule | For or : |
| Squeeze Theorem | and implies |
| IVT | Continuous on attains every value between and |
| EVT | Continuous on attains its maximum and minimum |
| Uniform continuity | works uniformly for all points (not point-dependent) |
| Lipschitz continuity | - quantitative uniform continuity |
| Machine epsilon | Smallest with in floating point ( for float64) |
| Catastrophic cancellation | Loss of significant digits when subtracting nearly equal floating-point numbers |
| limsup / liminf | Largest/smallest accumulation point of as |
| Big-O | : near for some |
| Little-o | : as ; is negligible compared to |
Appendix G: Further Reading and References
Textbooks
-
Stewart, J. (2015). Calculus: Early Transcendentals (8th ed.). Cengage. - Standard undergraduate reference; clear motivation and worked examples.
-
Spivak, M. (2006). Calculus (4th ed.). Publish or Perish. - Rigorous treatment; complete proofs of all major theorems including IVT and EVT.
-
Rudin, W. (1976). Principles of Mathematical Analysis (3rd ed.). McGraw-Hill. - The graduate-level standard; epsilon-delta proofs throughout; metric space generalization.
-
Apostol, T. M. (1974). Mathematical Analysis (2nd ed.). Addison-Wesley. - Thorough treatment of limits, continuity, and the Riemann integral.
Machine Learning Connections
-
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Ch. 4 (Numerical computation), Ch. 6 (Activation functions). - Numerical stability, sigmoid/ReLU analysis.
-
Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735-1780. - Identifies vanishing gradient via limit behavior of sigmoid.
-
Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22(3), 400-407. - Original convergence conditions for SGD.
-
Vaswani, A., et al. (2017). Attention is All You Need. NeurIPS. - Softmax temperature in attention.
-
Miyato, T., et al. (2018). Spectral Normalization for Generative Adversarial Networks. ICLR. - Lipschitz continuity in deep learning.
-
Mandt, S., Hoffman, M. D., & Blei, D. M. (2017). Stochastic Gradient Descent as Approximate Bayesian Inference. JMLR. - CLT applied to SGD noise.
Online Resources
-
3Blue1Brown - "Essence of Calculus" (YouTube). Visual intuition for limits and derivatives.
-
Paul's Online Math Notes (tutorial.math.lamar.edu). Comprehensive worked examples at undergraduate level.
-
MIT OpenCourseWare 18.01 (Single Variable Calculus). Full lecture notes and problem sets.
Appendix H: Detailed ML Worked Examples
H.1 Softmax Numerical Stability: Full Derivation
The naive softmax computation:
def softmax_naive(z):
return np.exp(z) / np.sum(np.exp(z))
fails for large due to overflow ( in float64) or underflow ().
The log-sum-exp trick exploits the limit-preserving shift:
This is valid because equals (both numerator and denominator are multiplied by ). After shifting, all exponents are , so no overflow. The maximum term contributes , preventing underflow of the sum.
Derivation of stability bound: Let . Then for all . The sum (at least , at most terms each ). No overflow or underflow.
Log-softmax (needed for cross-entropy):
This is the numerically stable form used in torch.nn.CrossEntropyLoss.
Limit interpretation: The log-sum-exp function is a smooth approximation to the max:
This is another limit - the "softmax" smoothly approaches the argmax as temperature .
H.2 Vanishing Gradient: Quantitative Analysis via Limits
Consider an -layer network with sigmoid activations. The gradient of loss with respect to weights in layer is:
Each factor involves .
If we model each factor as a scalar :
As (very deep network, gradient flows from the last layer to layer ):
The gradient vanishes exponentially in the depth. For : . For : .
Contrast with ReLU: . For positive pre-activations, each factor is (no attenuation). The limit is - gradient passes unchanged. This is the key advantage of ReLU for deep networks (He et al., 2015).
ResNet correction: Residual connections add a skip path: . The gradient becomes:
The identity term prevents the product from decaying to zero - the residual "highway" carries gradients regardless of the activation. Mathematically: even if (saturated activations), the full Jacobian stays bounded away from zero.
H.3 Learning Rate Schedules: Limit Analysis
Cosine annealing:
As : , so .
The schedule is continuous (cosine is continuous) and smooth, avoiding the discontinuous jumps of step-decay schedules. The limit means it does not satisfy the Robbins-Monro conditions - in practice this is acceptable because modern large-batch training typically runs for a fixed number of steps, not until convergence.
Warmup + linear decay:
At , both formulas give : continuity is maintained. The limit satisfies (roughly ) but may be finite or not depending on the schedule length.
GPT-3 / LLaMA learning rate schedule: Uses cosine decay with warmup, which empirically outperforms the theoretically optimal schedule for large-batch transformer training. The theory is underdeveloped; the practice is empirically justified.
H.4 Adam Optimizer: Limit Behavior
The Adam update (Kingma & Ba, 2014):
Limit as : The bias correction and (since ). So for large , Adam behaves like the bias-corrected version without correction.
Limit for sparse gradients: If for many steps, then (exponential decay). The effective learning rate - Adam gives a large step for a parameter that hasn't received gradients recently. This is the "adaptive" feature: rarely-updated parameters get large steps when they finally do receive gradients. This is controlled by the limit behavior of the second moment estimator.
Appendix I: Self-Assessment Questions
Conceptual Questions
-
Explain in your own words why the epsilon-delta definition requires (strict inequality, excluding ) but the continuity definition does not exclude .
-
Give an example of a function such that:
- is defined at
- exists
- But is not continuous at Explain which of the three continuity conditions fails.
-
Why does the Squeeze Theorem require the inequality to hold in a neighborhood of , not just at ?
-
L'Hpital's Rule is often stated as: "just differentiate numerator and denominator separately." What is wrong with this description, and what are the actual conditions for the rule to apply?
-
Why does the IVT require the function to be continuous on a closed interval and not just on the open interval ? Give a counterexample showing what can go wrong on an open interval.
-
Explain why (finite) and represent fundamentally different situations, even though both involve .
-
In what sense is the derivative "just a limit"? What property of the limit (one-sided vs. two-sided) determines whether is differentiable at ?
Computational Practice
-
Compute using three different methods: (a) L'Hpital twice, (b) Taylor expansion, (c) substitution.
-
Find all discontinuities of and classify each.
-
Prove from the epsilon-delta definition that . (Hint: rationalize .)
-
Use the Squeeze Theorem to show . (Hint: .)
-
Determine whether (the Taylor series for ) is continuous on all of . (Hint: uniform convergence on compact sets.)
ML Application Questions
-
Write Python code to compute numerically for . What do you observe? At what temperature does floating-point underflow cause problems with naive implementation?
-
The learning rate satisfies the Robbins-Monro conditions for which values of ? Prove your answer by computing and .
-
Implement the stable sigmoid that avoids overflow for both large positive and large negative . Verify that it matches
scipy.special.expit(x)to full precision for .
Appendix J: Connection Map - Limits Throughout the Curriculum
This section appears at the foundation of calculus, but limits permeate the entire curriculum.
WHERE LIMITS APPEAR ACROSS THE CURRICULUM
04-01 LIMITS (here)
All of Analysis
04-02 Derivatives f'(a) = lim_{h->0} [f(a+h)-f(a)]/h
04-03 Integration integral f = lim_{n->infinity} Sigma f(x*)Deltax
04-04 Series Sigmaa = lim_{N->infinity} S (partial sums)
05 Multivariate partialf/partialx = lim_{h->0} [f(x+he)-f(x)]/h
06 Probability P(A) = lim_{n->infinity} #{outcomes in A}/n (freq.)
08 Optimization Convergence: lim_{t->infinity} ||nablaL(theta)|| = 0
10 Numerical Methods Finite differences, iterative convergence
12 Functional Analysis Operator limits, Banach/Hilbert spaces
24 Measure Theory Integration as limit of simple functions
Every subsequent section in this curriculum builds on the limit concept introduced here. The epsilon-delta definition is the seed; the rest of mathematics is the tree.
Appendix K: Proofs of Fundamental Limits
K.1 Proof that lim(n->infinity)(1 + 1/n)^n = e
We show the sequence is increasing and bounded, hence convergent, and define as its limit.
Step 1: is increasing.
By AM-GM: for positive numbers :
Apply with copies of and one copy of :
Raising both sides to the -th power: .
Step 2: is bounded above by .
By the binomial theorem:
Each term .
So ...
More carefully: .
By the monotone convergence theorem, the increasing bounded sequence converges. We define
K.2 Proof that lim(x->0) sin(x)/x = 1
We give the geometric proof in detail.
Setup: Consider the unit circle centered at the origin. Let , , for , and (the tangent line at meets the line at ).
Area inequalities:
Computing each:
- (base , height )
- (fraction of unit circle area )
So:
Multiply by :
Take reciprocals (reverse inequalities):
Since and , by Squeeze: .
For : use (same by symmetry).
K.3 Proof of L'Hpital's Rule (0/0 case)
Theorem. Let be differentiable on for some . Suppose , near , and . Then .
Proof (using Cauchy's MVT): Extend and by (so they are continuous at ).
Cauchy's Mean Value Theorem: For any (say ), there exists such that:
Since : .
As : , so .
Therefore: .
Similarly from the left.
Appendix L: Extended ML Applications - Advanced Topics
L.1 Limits in Attention Mechanisms
The scaled dot-product attention:
The scaling by prevents the inner products from growing large in magnitude. Without scaling, as , the logits grow like (random initialization variance), pushing softmax into the saturation regime:
With scaling: remains as (assuming unit-variance queries and keys), keeping softmax in its linear regime where gradients are large.
Formal statement: If with independent components:
So has variance regardless of - this is the limit .
L.2 Token Probability Limits in Language Modeling
An autoregressive language model assigns probability:
Limit as context grows: What happens to as ? Under standard ergodicity assumptions on the language distribution, the model's uncertainty should approach the entropy of the distribution:
where is the entropy rate of the language. This limit (which exists for stationary ergodic processes) is the asymptotic uncertainty per token - the information-theoretic lower bound on perplexity.
L.3 Neural Tangent Kernel and Infinite Width Limits
As the width of a neural network, the network function converges (in distribution) to a Gaussian process with a specific kernel called the Neural Tangent Kernel (Jacot et al., 2018):
Moreover, in this limit, training with gradient descent is equivalent to kernel regression with :
This infinite-width limit is an active research area connecting neural network training to the well-understood theory of kernel methods and Gaussian processes - all via limit theory applied to neural networks as a function of width.
L.4 Grokking as a Limit Phenomenon
Grokking (Power et al., 2022): the phenomenon where a neural network first overfits (near-zero training loss, high test loss), then after continued training, suddenly generalizes (both losses drop).
From a limit perspective: the training loss is minimized early (the limit of the optimization trajectory is a global minimum), but the test loss requires the model to find a different basin that generalizes. The sudden transition can be understood as crossing a threshold where the solution's "complexity" (measured by norms or effective rank) crosses a critical value.
The limit may be much smaller than for an intermediate time - the limit of the process depends on running it long enough. This is a reminder that limits of optimization trajectories are not always achieved quickly.
Appendix M: Quick Reference Card
LIMITS AND CONTINUITY - QUICK REFERENCE
DEFINITION lim_{x->a} f(x) = L
for allepsilon>0, existsdelta>0: 0<|x-a|<delta |f(x)-L|<epsilon
ONE-SIDED lim_{x->a} and lim_{x->a}
Two-sided limit both one-sided limits agree
LIMIT LAWS lim(f+/-g) = lim f +/- lim g
lim(fg) = (lim f)(lim g)
lim(f/g) = (lim f)/(lim g) if lim g != 0
KEY LIMITS lim_{x->0} sin(x)/x = 1
lim_{x->0} (e-1)/x = 1
lim_{n->infinity} (1+1/n) = e
lim_{x->0} x*ln(x) = 0
TECHNIQUES - Direct substitution (if continuous)
- Factoring / cancellation
- Rationalization (conjugate)
- L'Hpital (0/0 or infinity/infinity only!)
- Squeeze Theorem
CONTINUITY f continuous at a
(1) f(a) defined
(2) lim_{x->a} f(x) exists
(3) lim_{x->a} f(x) = f(a)
DISCONTINUITIES Removable: limit exists, != f(a) or f(a) undefined
Jump: one-sided limits exist but differ
Essential: at least one one-sided limit = +/-infinity
IVT f continuous on [a,b], k between f(a) and f(b)
existscin(a,b): f(c) = k
EVT f continuous on [a,b] f attains max and min
SQUEEZE h<=f<=g near a, lim h = lim g = L lim f = L
STABILITY (e-1)/x near x=0: use numpy.expm1(x)/x
log(1+x)/x near x=0: use numpy.log1p(x)/x
log(softmax(z)): use log_softmax (LSE trick)
ML CONNECTIONS softmax_T -> argmax as T->0, uniform as T->infinity
sigma(x) -> 0/1 as x->+/-infinity (vanishing gradient)
gradient = lim_{h->0}[f(x+h)-f(x)]/h
Robbins-Monro: Sigmaalpha=infinity AND Sigmaalpha^2<infinity
Appendix N: Problem Set - Additional Exercises
N.1 Computation Problems ()
N1. Compute the following limits without L'Hpital's Rule:
(a)
(b)
(c)
(d)
Solutions: (a) (rationalize), (b) (derivative of at ), (c) (fundamental limit), (d) (leading coefficients).
N2. Classify the discontinuities of on .
Factor: numerator , denominator .
At : after cancellation; . Removable discontinuity (redefine ).
At : denominator , numerator ; . Vertical asymptote / essential discontinuity.
N3. For which values of is continuous at ?
Require: , so , giving , .
N.2 Theory Problems ()
N4. Prove using the epsilon-delta definition that .
Proof: For any , choose . Then for :
N5. Show that the converse of the Extreme Value Theorem is false: give an example of attaining its maximum and minimum on without being continuous.
Example: for and undefined elsewhere. Then attains max and min (both equal ) but is "vacuously" not defined as a function on for the purposes of continuity. A cleaner example: , , for - attains max () and min () but is discontinuous at and .
N6. (Banach Fixed-Point Theorem, simplified.) Let be continuous. Show has at least one fixed point.
Define . Then and . By IVT applied to (continuous) on : with , i.e., .
N.3 ML Application Problems ()
N7. (Numerical stability.) Implement both naive and stable computations of (the softplus function). For , compare results and relative errors.
Analysis: For large : . Naive overflows for (float64). Stable version: for (the term is small and doesn't overflow). For : use naive form since .
N8. (Learning rate theory.) Consider the schedule for constants .
(a) Show this satisfies the first Robbins-Monro condition: .
(b) Show this satisfies the second condition: .
(c) Compare to : does this satisfy both conditions?
Solutions: (a) ; (harmonic series). (b) ; . (c) : (first holds), (second fails).
Appendix O: Typesetting Reference for LaTeX
When writing mathematical content on limits in LaTeX (following the notation guide):
% Limit notation
\lim_{x \to a} f(x) = L % standard limit
\lim_{x \to a^+} f(x) % right-hand limit
\lim_{x \to a^-} f(x) % left-hand limit
\lim_{x \to +\infty} f(x) % limit at +infinity
% epsilon-delta definition
\forall \varepsilon > 0, \; \exists \delta > 0 : \;
0 < \lvert x - a \rvert < \delta \implies \lvert f(x) - L \rvert < \varepsilon
% Fundamental limits
\lim_{x \to 0} \frac{\sin x}{x} = 1
\lim_{x \to 0} \frac{e^x - 1}{x} = 1
\lim_{n \to \infty} \left(1 + \frac{1}{n}\right)^n = e
% Continuity
f \text{ continuous at } a \iff
\lim_{x \to a} f(x) = f(a)
% Big-O notation
f(x) = O(g(x)) \text{ as } x \to a
f(x) = o(g(x)) \text{ as } x \to a
Common errors in notation:
- Use
\lvert \cdot \rvertnot|\cdot|for absolute value in LaTeX - Use
\varepsilonnot\epsilon(matches standard analysis texts) - Use
\tonot:for limit variable (i.e., , not ) - Use
\inftynotinforInf - Subscripts of limit use
_correctly:\lim_{x \to 0}not\lim{x \to 0}
Appendix P: Historical Problems and Their Solutions
P.1 Zeno's Paradoxes: The Ancient Limit Problem
Zeno of Elea (~450 BCE) posed paradoxes about motion that are precisely limit problems in disguise.
Achilles and the Tortoise: Achilles (speed ) chases a tortoise (speed ) with head start . Zeno argued: first Achilles must reach the tortoise's initial position (time ); by then the tortoise has moved further; then Achilles must cover that gap (time ); and so on. The paradox claims this infinite sequence of tasks cannot be completed.
Resolution via limits: The total time is:
This is the correct finite time - the geometric series converges. The limit is finite and gives the time Achilles catches the tortoise. Zeno's error was assuming an infinite series must have an infinite sum - a mistake corrected by the theory of limits.
Lesson: Infinite processes can have finite limits. The notion is precisely the mathematical resolution to Zeno's paradox.
P.2 Berkeley's Objection: "Ghosts of Departed Quantities"
Bishop George Berkeley (1734), in The Analyst, critiqued Newton's infinitesimals:
"And what are these Fluxions? The Velocities of evanescent Increments. And what are these same evanescent Increments? They are neither finite Quantities, nor Quantities infinitely small, nor yet nothing. May we not call them the Ghosts of departed Quantities?"
Berkeley's critique was valid: Newton used to divide, then set to drop higher-order terms. This is logically inconsistent.
The resolution: Weierstrass's epsilon-delta definition never actually "sets " - instead, it asks: for every , can we find (with ) such that the difference quotient is within of ? This makes no appeal to - only to inequalities between real numbers. Berkeley's ghost is exorcised by phrasing everything as " approaches but never reaches ."
P.3 Dirichlet's Proof of the Convergence of Fourier Series
Dirichlet (1829) gave the first rigorous proof that the Fourier series of a "reasonable" function converges to the function at points of continuity. The key ingredient was precise limit analysis: showing that the Dirichlet kernel satisfies at continuity points.
This required careful epsilon-delta arguments for limits of integrals - exactly the kind of rigorous limit theory that Weierstrass would later systematize.
For AI: Fourier analysis (20) is the mathematical foundation of signal processing. The convergence of Fourier series is a limit statement - the same limit theory studied here extends to function spaces and ensures that truncated Fourier representations converge to the original signal at points of continuity.
P.4 Cauchy's Error and Its Correction
Cauchy (1821) stated (incorrectly): "The limit of a sum of continuous functions is continuous." This is false in general - the pointwise limit of continuous functions need not be continuous.
Counterexample: on . Each is continuous. The pointwise limit:
is discontinuous at . Cauchy's error!
Correction: If the convergence is uniform (i.e., ), then the limit is continuous. This stronger notion - uniform convergence - is the correct condition, discovered by Stokes and Seidel (1847).
For AI: Neural network functions trained on finite data may not converge uniformly to the true function - only pointwise (or in some norm). The distinction between pointwise and uniform convergence explains generalization gaps: the network may correctly learn the training points but fail elsewhere.
Appendix Q: Summary of All Proof Techniques
| Technique | When to Use | Key Idea |
|---|---|---|
| Direct substitution | continuous at | |
| Factoring | form with polynomial | Cancel common factor |
| Rationalization | Square roots in numerator or denominator | Multiply by conjugate |
| epsilon-delta construction | Proving a limit rigorously | Bound in terms of ; choose |
| Squeeze Theorem | is bounded between two functions with known limit | Find with |
| L'Hpital's Rule | or form | Replace by |
| Cauchy's MVT | Proving L'Hpital, comparing rates | for some |
| Series expansion | Polynomial-like behavior near | Expand in Taylor series |
| Substitution | Simplify the variable | Replace by |
| Sequential argument | Proving limit DNE | Find two sequences approaching with different limits |
| Limsup/liminf | Oscillating functions | Compute and ; limit exists iff they agree |
Appendix R: Connections to Linear Algebra
R.1 Limits of Matrix Sequences
The theory of limits extends to matrices via matrix norms. For a sequence of matrices :
Definition. (in norm) means for any matrix norm .
Matrix exponential as a limit:
This series converges absolutely (in any norm) for all matrices .
Power iteration: The sequence converges (under mild conditions) to the eigenvector corresponding to the largest eigenvalue:
This is a limit of a vector sequence - continuity of the norm and eigenvalue structure determine convergence.
R.2 Spectral Radius and Stability
Spectral radius: (largest magnitude eigenvalue).
Theorem (Gelfand's formula):
This limit always exists and equals the spectral radius - a remarkable result connecting limits of norms to eigenvalue structure.
Stability of linear systems: The iteration converges to for any initial if and only if :
For AI - gradient descent as a linear system: Near a minimum , gradient descent is a linear iteration with matrix (where is the Hessian). Convergence requires , i.e., for all eigenvalues of . This gives the condition - a limit-theory result on convergence of gradient descent.
R.3 Condition Number and Sensitivity Analysis
The condition number measures the sensitivity of to perturbations in :
In the limit (ill-conditioned), a tiny relative perturbation causes an arbitrarily large relative change in the solution . This is the matrix-level analog of the numerical instability (catastrophic cancellation) discussed in Section 8 - both arise from near-singularity, and both are limit phenomena.
Appendix S: Python and NumPy Reference for Limit Computations
import numpy as np
# === Numerically stable computations near limits ===
# (e^x - 1)/x near x=0: use expm1
def f_stable(x):
"""Compute (e^x - 1)/x stably for small x."""
return np.expm1(x) / x # NOT (np.exp(x) - 1) / x
# log(1 + x)/x near x=0: use log1p
def g_stable(x):
"""Compute log(1+x)/x stably for small x."""
return np.log1p(x) / x # NOT np.log(1+x) / x
# Stable softmax
def softmax_stable(z):
z_shifted = z - np.max(z)
exp_z = np.exp(z_shifted)
return exp_z / exp_z.sum()
# Stable log-softmax (for cross-entropy)
def log_softmax_stable(z):
m = np.max(z)
return z - m - np.log(np.sum(np.exp(z - m)))
# Stable sigmoid
def sigmoid_stable(x):
return np.where(x >= 0,
1 / (1 + np.exp(-x)),
np.exp(x) / (1 + np.exp(x)))
# === Numerical limit approximation ===
def numerical_limit(f, a, h_values=None):
"""Numerically approximate lim_{x->a} f(x) via centered differences."""
if h_values is None:
h_values = [1e-1, 1e-2, 1e-4, 1e-6, 1e-8, 1e-10]
print(f"{'h':>12} | {'f(a+h)':>15} | {'f(a-h)':>15} | {'avg':>15}")
print("-" * 65)
for h in h_values:
fph = f(a + h)
fmh = f(a - h)
avg = (fph + fmh) / 2
print(f"{h:>12.2e} | {fph:>15.10f} | {fmh:>15.10f} | {avg:>15.10f}")
# Example: lim_{x->0} sin(x)/x = 1
numerical_limit(lambda x: np.sin(x)/x if x != 0 else 1.0, a=1e-14)
# === Gradient checking ===
def grad_check(f, theta, i, h=1e-5):
"""Centered finite difference for d/d(theta_i) f(theta)."""
theta_plus = theta.copy(); theta_plus[i] += h
theta_minus = theta.copy(); theta_minus[i] -= h
return (f(theta_plus) - f(theta_minus)) / (2 * h)
Key takeaway: The choice of h in grad_check uses - the optimal value from Section 8.3 (centered finite difference has error , minimized around where ).
End of Appendices. See theory.ipynb for interactive examples and exercises.ipynb for graded problems.
Appendix T: Summary of Key Results
| Result | Statement | Where Proved |
|---|---|---|
| epsilon-delta limit definition | $\forall\varepsilon>0,\exists\delta>0: 0< | x-a |
| Limit laws | Sum, product, quotient of limits | 2.2 |
| Squeeze Theorem | , | 4.3, App. K.2 |
| Geometric area argument | 3.1, App. K.2 | |
| Taylor series / definition of derivative | 3.2 | |
| Monotone convergence + binomial theorem | 3.3, App. K.1 | |
| L'Hpital on | 3.4 | |
| L'Hpital's Rule | for or | 4.2, App. K.3 |
| Continuity definition | All three conditions: existence, limit, equality | 5.1 |
| Intermediate Value Theorem | Continuous on hits every intermediate value | 6.1, App. A.4 |
| Extreme Value Theorem | Continuous on attains max and min | 6.2 |
| Heine-Cantor Theorem | Continuous on uniformly continuous | 6.3 |
| Gelfand's formula | App. R.2 | |
| Robbins-Monro conditions | SGD converges iff and | 9.4 |
| Softmax temperature limits | : argmax; : uniform | 9.1 |
| Vanishing gradient rate | for -deep sigmoid networks | App. H.2 |
<- Back to Calculus Fundamentals | Next: Derivatives and Differentiation ->