Part 1

30 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Floating-Point Arithmetic: Part 1: Intuition to Appendix E: Proofs and Derivations

1. Intuition

1.1 Why Computers Cannot Represent Most Real Numbers

The real numbers form an uncountably infinite set. A computer, by contrast, stores information in a finite number of bits. With $k$ bits you can represent at most $2^k$ distinct values. No finite $k$ is large enough to cover even the rationals between 0 and 1 - there are infinitely many of them.

The resolution: floating-point numbers are a carefully chosen, finite subset of the rationals - dense near zero (where scientific quantities tend to cluster) and increasingly sparse at large magnitudes. The IEEE 754 standard specifies exactly which rationals are representable and exactly how arithmetic on them is performed. Every GPU, CPU, and TPU on the planet implements this standard.

The consequence for computation: Every real-number operation - addition, multiplication, transcendentals - must be rounded to the nearest representable floating-point number. This introduces a tiny error at each step. Over millions of operations (a single transformer forward pass involves billions), these errors can accumulate in structured, predictable ways. Understanding and controlling this accumulation is numerical analysis.

A concrete example: The number $0.1$ cannot be represented exactly in binary floating-point. In fp32 it is stored as approximately $0.100000001490116119384765625$ - an error of $\approx 1.5 \times 10^{-9}$ . This seems negligible, but when you sum 10 million such values, the error accumulates to $\approx 15$ - a 15% error in a sum that should be exactly $10^6$ .

1.2 Why This Matters for AI

Floating-point arithmetic is not an abstraction for AI practitioners - it is the daily substrate:

Training stability: fp16 training requires loss scaling to prevent gradient underflow. Without it, gradients smaller than the fp16 minimum ( $\approx 6 \times 10^{-5}$ ) are rounded to zero, silently stopping learning for those parameters.

Model quality: Quantizing from fp32 to int8 for inference can degrade accuracy by 1-5% on challenging tasks if done naively. Understanding the rounding error model tells you which operations are most sensitive and how to compensate.

Numerical bugs: NaN (Not a Number) propagation is one of the most common silent failures in deep learning. A single NaN anywhere in the computation graph - caused by $\log(0)$ , $0/0$ , or overflow - propagates through every subsequent operation, turning the entire gradient to NaN.

Speed and memory: The transition from fp32 to bf16 for training LLMs (GPT-4, Llama, Gemini all use bf16) halves memory and roughly doubles arithmetic throughput on modern hardware. Understanding why bf16 preserves training stability when fp16 does not requires knowing that bf16 preserves the exponent range of fp32 while sacrificing mantissa precision.

For AI: Every tensor in PyTorch, JAX, or TensorFlow is a floating-point array. torch.finfo(torch.float16).eps gives you $\varepsilon_{\text{mach}}$ for fp16. Knowing this number - $9.77 \times 10^{-4}$ - immediately tells you the precision limit of every fp16 computation.

1.3 A Day in the Life of a Floating-Point Number

Follow a real number $x = 0.3$ through a computation:

Birth (encoding): $x = 0.3$ is mapped to the nearest fp32 representable value: $x_{\text{fp32}} = 0.30000001192092896$ . Error: $\approx 1.2 \times 10^{-8}$ .

Arithmetic (rounding): Compute $x_{\text{fp32}} + 0.6_{\text{fp32}}$ . Each input is rounded; the addition result is rounded again. The result $0.9_{\text{fp32}} = 0.89999997615814208$ differs from the true $0.9$ by $\approx 2.4 \times 10^{-8}$ .

Comparison (danger): Testing x_fp32 + 0.6_fp32 == 0.9 returns False in Python because both sides round differently. This is the source of countless bugs in numerical code.

Accumulation (catastrophe): Subtract $0.8_{\text{fp32}}$ from the result: $0.89999997615814208 - 0.80000001192092896 = 0.09999996423721312$ vs the true $0.1$ . Relative error: $\approx 3.6 \times 10^{-7}$ - 30x larger than the original encoding error, because subtraction of nearly-equal numbers amplifies relative error. This is catastrophic cancellation.

Death (overflow/underflow): Multiply by $10^{38}$ in fp32: overflow to $+\infty$ . Multiply by $10^{-46}$ : underflow to $0$ (or a subnormal). Once a value becomes $\pm\infty$ or NaN, all subsequent operations inherit the corruption.

1.4 Historical Timeline: 1985-2024

FLOATING-POINT ARITHMETIC: KEY MILESTONES
========================================================================

  1985  IEEE 754-1985 published - first universal FP standard
        (sign, exponent, mantissa; round-to-nearest-even default)
  1987  First IEEE 754 hardware: Intel 8087 math coprocessor
  1998  IEEE 754-1985 extended to include 80-bit extended precision
  2008  IEEE 754-2008 - adds fp16 (half precision), decimal formats
  2010  NVIDIA Fermi GPU: first IEEE-compliant fp64 on GPU
  2017  Mixed-precision training (Micikevicius et al., NVIDIA) -
        fp16 forward/backward, fp32 master weights
  2018  bf16 (Brain Float 16) introduced by Google Brain for TPUs
  2019  NVIDIA Ampere: bf16 tensor cores, hardware Transformer Engine
  2022  fp8 training (NVIDIA H100 Transformer Engine) - 8-bit FP
        with dynamic scaling; used for LLM training
  2023  Llama-2, GPT-4, Gemini: bf16 training standard
  2024  fp4 experimental; int4/int8 GGUF quantization standard
        for LLM inference (llama.cpp ecosystem)

========================================================================

2. Formal Definitions: IEEE 754

2.1 The IEEE 754 Standard

The IEEE 754 standard represents a floating-point number $x$ in sign-magnitude-exponent form:

x = (-1)^s \cdot (1.m_1 m_2 \ldots m_p) _2 \cdot 2^{e - \text{bias}}

where:

$s \in \{0, 1\}$ is the sign bit
$m_1 m_2 \ldots m_p$ is the mantissa (or significand) in binary, with an implicit leading 1
$e$ is the stored exponent (biased)
$\text{bias} = 2^{q-1} - 1$ for a $q$ -bit exponent field

Format specifications:

Format	Total bits	Sign	Exponent	Mantissa	$\varepsilon_{\text{mach}}$	Max value	Min normal
fp64 (double)	64	1	11	52	$2.2 \times 10^{-16}$	$1.8 \times 10^{308}$	$2.2 \times 10^{-308}$
fp32 (single)	32	1	8	23	$1.2 \times 10^{-7}$	$3.4 \times 10^{38}$	$1.2 \times 10^{-38}$
bf16	16	1	8	7	$7.8 \times 10^{-3}$	$3.4 \times 10^{38}$	$1.2 \times 10^{-38}$
fp16	16	1	5	10	$9.8 \times 10^{-4}$	$6.6 \times 10^{4}$	$6.1 \times 10^{-5}$
fp8 E4M3	8	1	4	3	$0.125$	$448$	$1.5 \times 10^{-2}$
fp8 E5M2	8	1	5	2	$0.25$	$57344$	$1.5 \times 10^{-5}$

Key insight: bf16 has the same exponent range as fp32 (both use 8 exponent bits, bias 127) but only 7 mantissa bits vs 23. This means bf16 can represent the same scale of numbers as fp32, but with far less precision per value. This is why bf16 is training-stable (no overflow, no gradient underflow) while fp16 (5-bit exponent, max value $\approx 65504$ ) overflows easily during training.

Bit-layout diagram:

IEEE 754 BIT LAYOUTS
========================================================================

  fp32 (32 bits):
  [S][EEEEEEEE][MMMMMMMMMMMMMMMMMMMMMMM]
   1     8              23

  fp16 (16 bits):
  [S][EEEEE][MMMMMMMMMM]
   1    5        10

  bf16 (16 bits):
  [S][EEEEEEEE][MMMMMMM]
   1     8         7      <- Same exponent field as fp32!

  fp8 E4M3 (8 bits):
  [S][EEEE][MMM]
   1    4     3

========================================================================

2.2 Machine Epsilon

Definition (Machine Epsilon): The machine epsilon $\varepsilon_{\text{mach}}$ is the smallest floating-point number $\varepsilon$ such that $\text{fl}(1 + \varepsilon) > 1$ . Equivalently:

\varepsilon_{\text{mach}} = 2^{-p}

where $p$ is the number of mantissa bits (the "precision").

For fp32: $\varepsilon_{\text{mach}} = 2^{-23} \approx 1.19 \times 10^{-7}$ .

Unit in the last place (ULP): The ULP of a floating-point number $x$ is the spacing between $x$ and the next representable number. For a number in the range $[2^e, 2^{e+1})$ :

\text{ULP}(x) = 2^{e-p}

Fundamental property: For any real $x$ in the normal range, the nearest floating-point number $\text{fl}(x)$ satisfies:

|\text{fl}(x) - x| \le \frac{1}{2} \text{ULP}(x) \le \frac{\varepsilon_{\text{mach}}}{2} |x|

This is the relative rounding error bound: $|\text{fl}(x) - x| / |x| \le \varepsilon_{\text{mach}}/2$ .

Non-example: Machine epsilon is NOT the smallest positive floating-point number. The smallest normal fp32 number is $\approx 1.18 \times 10^{-38}$ . Machine epsilon is the smallest perturbation to $1$ that is detectable - a very different concept.

2.3 The Floating-Point Number Line

The floating-point numbers are not uniformly spaced. They cluster densely near zero and become increasingly sparse at large magnitudes:

In $[1, 2)$ : spacing = $2^{-23}$ (fp32) - $\approx 8.4 \times 10^6$ numbers
In $[2, 4)$ : spacing = $2^{-22}$ - $\approx 8.4 \times 10^6$ numbers
In $[1024, 2048)$ : spacing = $2^{-13}$ - same count, but 1000x coarser
Near $10^{37}$ : spacing $\approx 10^{30}$ - individual values separated by $10^{30}$ !

Consequence for AI: Neural network weights are typically initialized in $[-1, 1]$ , where fp32 provides $\sim 10^7$ distinct values per unit. As training progresses and some weights grow large, precision degrades - but this rarely matters because the loss landscape is smooth there. The dangerous zone is when gradients become very small (near $10^{-7}$ in fp32), where underflow begins.

2.4 Special Values: Infinity, NaN, Subnormals

Positive/negative infinity $(\pm\infty)$ : Stored with all-ones exponent, zero mantissa. Result of overflow (e.g., $3.4 \times 10^{38} \times 2$ in fp32) or division by zero. All arithmetic on $\infty$ is defined: $1/\infty = 0$ , $\infty + \text{finite} = \infty$ , but $\infty - \infty = \text{NaN}$ .

NaN (Not a Number): All-ones exponent, nonzero mantissa. Results from $0/0$ , $\infty - \infty$ , $\sqrt{-1}$ . NaN is contagious: any arithmetic involving NaN produces NaN. A single NaN in a gradient will propagate backward through the entire computation graph, turning all parameter updates to NaN.

Subnormal numbers: When the stored exponent is all-zeros, the implicit leading 1 becomes a leading 0, allowing numbers smaller than the normal minimum. fp32 subnormals cover $[\approx 1.4 \times 10^{-45}, 1.18 \times 10^{-38})$ with reduced precision. Many GPUs flush subnormals to zero ("FTZ mode") for performance, which can cause unexpected underflow.

Practical rule: In deep learning code, if torch.isnan(loss) is True, trace backward: either the loss computation involves log(0), a division by a near-zero value, or upstream activations have overflowed.

2.5 Rounding Modes

IEEE 754 defines five rounding modes. The default - used in virtually all ML training - is round-to-nearest-even (banker's rounding):

Mode	Description	When used
Round-to-nearest-even	Round to nearest; on tie, round to even mantissa bit	Default (IEEE 754)
Round toward $+\infty$	Always round up	Interval arithmetic upper bounds
Round toward $-\infty$	Always round down	Interval arithmetic lower bounds
Round toward zero	Truncate	Hardware division
Stochastic rounding	Randomly round up/down, proportional to distance	Low-precision training (fp8)

Round-to-nearest-even is the default because it is unbiased: on average, half of tie-breaks round up and half round down, preventing systematic accumulation of rounding errors in long sums. This is critical for gradient accumulation over millions of steps.

3. Floating-Point Arithmetic

3.1 The Fundamental Rounding Error Model

Theorem (Rounding Error Model): For any IEEE 754-compliant basic operation $\circ \in \{+, -, \times, \div\}$ on floating-point numbers $x, y$ :

\text{fl}(x \circ y) = (x \circ y)(1 + \delta), \quad |\delta| \le \varepsilon_{\text{mach}}

The computed result equals the exact result times a factor $(1+\delta)$ where $\delta$ is at most one machine epsilon in magnitude. This is the single most important fact in numerical analysis.

Proof sketch: By definition of $\varepsilon_{\text{mach}}$ , rounding any real $z$ to float gives $\text{fl}(z) = z(1+\delta)$ with $|\delta| \le \varepsilon_{\text{mach}}/2$ (round-to-nearest). IEEE 754 mandates that each basic operation is computed as if exact and then rounded, so $\text{fl}(x \circ y) = \text{fl}(x \circ_{\text{exact}} y) = (x \circ_{\text{exact}} y)(1 + \delta)$ .

Chaining errors: For a sequence of $n$ operations, the cumulative error is bounded by approximately $n\varepsilon_{\text{mach}}$ in relative terms. For a dot product $\mathbf{x}^\top \mathbf{y} = \sum_{i=1}^n x_i y_i$ :

|\text{fl}(\mathbf{x}^\top \mathbf{y}) - \mathbf{x}^\top \mathbf{y}| \le (n-1)\varepsilon_{\text{mach}} |\mathbf{x}|^\top |\mathbf{y}| \cdot (1 + O(\varepsilon_{\text{mach}}))

For AI: The attention softmax computes $\text{softmax}(\mathbf{q}^\top K / \sqrt{d_k})$ - a dot product of length $d_k$ (typically 64-128). In fp16, $\varepsilon_{\text{mach}} \approx 10^{-3}$ , and the dot product accumulates $\sim 128 \times 10^{-3} = 12.8\%$ relative error if done naively in fp16. This is why FlashAttention accumulates the softmax denominator in fp32 even when computing in fp16.

3.2 Catastrophic Cancellation

Catastrophic cancellation occurs when two nearly-equal floating-point numbers are subtracted, causing the leading significant digits to cancel and leaving only the noisy low-order bits.

Example: Compute $f(x) = \sqrt{x+1} - \sqrt{x}$ for large $x$ .

For $x = 10^8$ in fp32:

$\sqrt{10^8 + 1} \approx 10000.0000500000$ (fp32: $10000.0$ )
$\sqrt{10^8} = 10000.0$ (fp32: $10000.0$ )
$f(10^8)_{\text{fp32}} = 0.0$ - complete loss of all digits!

True value: $f(10^8) = 1/(\sqrt{10^8+1} + \sqrt{10^8}) \approx 5 \times 10^{-5}$ .

Algebraic fix: Multiply and divide by the conjugate:

\sqrt{x+1} - \sqrt{x} = \frac{(\sqrt{x+1} - \sqrt{x})(\sqrt{x+1} + \sqrt{x})}{\sqrt{x+1} + \sqrt{x}} = \frac{1}{\sqrt{x+1} + \sqrt{x}}

The denominator is computed by addition (not subtraction), avoiding cancellation entirely.

General rule: When a subtraction $a - b$ results in a value much smaller than $|a|$ and $|b|$ , suspect catastrophic cancellation. The relative error in the result can be as large as $|a|/|a-b| \times \varepsilon_{\text{mach}}$ - the ratio of operand magnitude to result magnitude times machine epsilon.

In AI: The log-sum-exp function $\log(\sum_i e^{x_i})$ suffers cancellation when $e^{x_i}$ values are very different in magnitude. The numerically stable form subtracts the maximum first - an algebraic rearrangement that avoids both overflow and cancellation (see Section6.2).

3.3 Error Accumulation in Long Computations

Forward error analysis: Track how errors in inputs propagate to errors in outputs. For a function $f: \mathbb{R}^n \to \mathbb{R}^m$ evaluated via a sequence of operations, the forward error bound expresses $\|\hat{f}(\mathbf{x}) - f(\mathbf{x})\|$ in terms of $\varepsilon_{\text{mach}}$ and problem data.

Backward error analysis (Wilkinson): Ask a different question: for what perturbed input $\hat{\mathbf{x}}$ does the computed output $\hat{f}(\mathbf{x})$ equal the exact output? The backward error is $\|\hat{\mathbf{x}} - \mathbf{x}\|/\|\mathbf{x}\|$ - how much was the input effectively perturbed?

Why backward analysis is preferred: If an algorithm's backward error is of size $\varepsilon_{\text{mach}}$ (i.e., the computed answer is the exact answer for a slightly perturbed input), the algorithm is backward stable - as good as you can expect from finite-precision arithmetic.

Summation error: For the sum $S = \sum_{i=1}^n x_i$ , naive sequential summation has forward error $O(n \varepsilon_{\text{mach}}) \|{\mathbf{x}}\|_1$ . Pairwise summation (binary tree) reduces this to $O(\log n \varepsilon_{\text{mach}})$ . Kahan summation reduces it to $O(\varepsilon_{\text{mach}})$ , independent of $n$ .

3.4 Kahan Compensated Summation

Kahan (1965) showed that a small modification to sequential summation reduces the accumulated error from $O(n\varepsilon_{\text{mach}})$ to $O(\varepsilon_{\text{mach}})$ , regardless of $n$ :

KAHAN SUMMATION ALGORITHM
========================================================================

  Input: x_1, x_2, ..., x_n
  Output: S \\approx sum(x_i)

  s = 0.0       # running sum
  c = 0.0       # compensation (tracks lost digits)

  for each x_i:
    y = x_i - c         # compensated addend
    t = s + y           # sum so far (y is small, loses digits)
    c = (t - s) - y     # recover the lost digits
    s = t               # update sum

  return s

========================================================================

Why it works: The compensation variable c tracks the rounding error at each step. It accumulates the "lost" low-order bits and feeds them back into the next iteration. The net effect is that the error behaves as $O(\varepsilon_{\text{mach}})$ regardless of $n$ - equivalent to computing in twice the precision.

In PyTorch: torch.sum() uses pairwise summation (tree reduction), which achieves $O(\log n \varepsilon_{\text{mach}})$ error. For high-precision accumulation (e.g., loss over a large batch), consider accumulating in fp64 then converting back.

3.5 Interval Arithmetic and Verified Computation

Interval arithmetic replaces each number $x$ with an interval $[\underline{x}, \overline{x}]$ guaranteed to contain the true value. All operations are extended to operate on intervals, widening the bounds to account for rounding errors.

Example: $[1.0, 1.0] + [0.3, 0.3_{\text{fp32}}^+] = [1.3, 1.3_{\text{fp32}}^+]$ where the subscript fp32 $^+$ denotes rounding up.

Practical use: Interval arithmetic gives verified bounds on computation results - you know with certainty that the true answer lies in the computed interval. Used in: formal verification of safety-critical software, reliable computing, and bound propagation in neural network verification (e.g., $\alpha$ - $\beta$ CROWN for adversarial robustness).

Limitation: Interval widths grow rapidly in long computations (dependency problem), making intervals pessimistic for iterative algorithms. More sophisticated tools (affine arithmetic, Taylor models) improve the dependency problem.

4. Condition Numbers and Stability

4.1 Condition Number of a Scalar Problem

The condition number of a computational problem measures the sensitivity of the output to perturbations in the input - it quantifies how much the problem amplifies input errors, independent of any particular algorithm.

Definition (absolute condition number): For a function $f: \mathbb{R} \to \mathbb{R}$ :

\kappa_{\text{abs}}(x) = \lim_{\delta \to 0} \sup_{|\Delta x| \le \delta} \frac{|f(x + \Delta x) - f(x)|}{|\Delta x|} = |f'(x)|

Definition (relative condition number):

\kappa_{\text{rel}}(x) = \left|\frac{x \cdot f'(x)}{f(x)}\right|

This measures relative output change per unit relative input change - the more natural quantity.

Examples:

Function	$\kappa_{\text{rel}}(x)$	Well-conditioned?
$f(x) = \sqrt{x}$	$1/2$	Yes - always
$f(x) = \ln x$	$1/	\ln x
$f(x) = e^x$	$	x
$f(x) = \sin x$	$	x \cos x / \sin x
$a - b$ (subtraction)	$\max(	a

Key insight: Condition number $\kappa_{\text{rel}} \gg 1$ means the problem is inherently ill-conditioned - even the best algorithm operating in exact arithmetic cannot give accurate results given inaccurate inputs. No numerical technique can fix a fundamentally ill-conditioned problem; it requires reformulation.

4.2 Forward vs Backward Error Analysis

Given a computed result $\hat{y}$ for a true answer $y = f(x)$ :

Forward error: $|y - \hat{y}|$ (or relative: $|y - \hat{y}|/|y|$ ) - how far the computed answer is from the truth.

Backward error: The smallest $\|\Delta x\|$ such that $f(x + \Delta x) = \hat{y}$ - what perturbation to the input makes our computed answer exact. Symbolically: $\hat{y} = f(\hat{x})$ for some $\hat{x}$ ; the backward error is $|\hat{x} - x|/|x|$ .

Connection: Forward error $\le$ condition number $\times$ backward error:

\frac{|\hat{y} - y|}{|y|} \lesssim \kappa_{\text{rel}}(x) \cdot \frac{|\hat{x} - x|}{|x|}

If an algorithm has backward error $\sim \varepsilon_{\text{mach}}$ , then its forward error is at most $\kappa \varepsilon_{\text{mach}}$ - the best possible for that problem.

4.3 Forward and Backward Stability of Algorithms

Definition (backward stable): An algorithm for computing $f(x)$ is backward stable if the computed result $\hat{f}(x)$ satisfies:

\hat{f}(x) = f(x + \Delta x) \text{ for some } \|\Delta x\| = O(\varepsilon_{\text{mach}}) \|x\|

A backward-stable algorithm computes the exact answer for a slightly perturbed problem. Combined with a well-conditioned problem ( $\kappa \sim 1$ ), this guarantees the forward error is $O(\varepsilon_{\text{mach}})$ .

Definition (forward stable): An algorithm is forward stable if the computed result satisfies:

\|\hat{f}(x) - f(x)\| = O(\varepsilon_{\text{mach}}) \|f(x)\|

Hierarchy:

backward stable -> forward stable (when  is moderate)
forward stable  backward stable (in general)

Examples:

Backward stable: Householder QR, Gaussian elimination with partial pivoting
Forward stable but not backward stable: Gram-Schmidt orthogonalization (modified version)
Unstable: Naive Gaussian elimination without pivoting (exponential error growth possible)

4.4 Condition Number of a Matrix

For a linear system $A\mathbf{x} = \mathbf{b}$ , the matrix condition number is:

\kappa(A) = \|A\| \cdot \|A^{-1}\| = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}

where $\sigma_{\max}$ and $\sigma_{\min}$ are the largest and smallest singular values.

Geometric interpretation: $\kappa(A)$ measures how much $A$ can stretch unit vectors relative to how much it shrinks them. A unit sphere gets mapped to an ellipsoid with semi-axes $\sigma_1 \ge \sigma_2 \ge \ldots \ge \sigma_n$ ; the condition number is the ratio $\sigma_1/\sigma_n$ .

Error amplification: If $A\hat{\mathbf{x}} = \hat{\mathbf{b}}$ and $\|\hat{\mathbf{b}} - \mathbf{b}\|/\|\mathbf{b}\| = \varepsilon$ (small right-hand side perturbation):

\frac{\|\hat{\mathbf{x}} - \mathbf{x}\|}{\|\mathbf{x}\|} \le \kappa(A) \cdot \varepsilon

The condition number is the worst-case amplification factor from $\mathbf{b}$ -error to $\mathbf{x}$ -error.

Values to know:

$\kappa(I) = 1$ - perfectly conditioned (identity)
$\kappa(A) \approx 10^k$ - you lose $k$ decimal digits of accuracy
$\kappa(A) > 1/\varepsilon_{\text{mach}}$ - $A$ is numerically singular; solutions are meaningless

For AI: The Hessian matrix of a neural network loss has condition number $\kappa(H) = \lambda_{\max}/\lambda_{\min}$ . When $\kappa(H)$ is large ( $10^5$ - $10^{10}$ in practice), gradient descent converges slowly and is sensitive to learning rate. Adam's parameter-wise scaling approximates diagonal preconditioning, reducing the effective condition number. Full preconditioning (Shampoo optimizer) targets the exact Hessian eigenspectrum.

5. Numerical Formats for AI

5.1 fp32, fp16, bf16 Compared

FLOATING-POINT FORMAT COMPARISON
========================================================================

  Property           fp32           fp16           bf16
  ---------------------------------------------------------------------
  Bits               32             16             16
  Exponent bits       8              5              8
  Mantissa bits      23             10              7
  Max value       3.4e+38       6.5e+04        3.4e+38
  Min normal      1.2e-38       6.1e-05        1.2e-38
  Machine eps     1.2e-07       9.8e-04        7.8e-03
  ---------------------------------------------------------------------
  Training use    Reference     Legacy (APEX)  Standard (2023+)
  Gradient risk   None          Underflow      None
  Memory (1B par) 4 GB          2 GB           2 GB
  Arithmetic TPU  Baseline      2x fp32        2x fp32

========================================================================

bf16 vs fp16: Both are 16-bit. bf16 keeps the 8-bit exponent field of fp32, giving the same dynamic range. fp16 uses only 5 exponent bits - max value $\approx 65504$ - which causes overflow for activations in large models. bf16 gives 3 fewer significant decimal digits than fp16, but training remains stable because gradients don't overflow.

Why bf16 won for LLMs: GPT-2 trained in fp16 with careful loss scaling. GPT-3, Llama, Mistral, Gemini all train in bf16 - no loss scaling needed, simpler codebase, same memory.

5.2 fp8 and Extreme Quantization

fp8 uses only 8 bits. NVIDIA's H100 supports two fp8 formats:

E4M3 (4 exponent, 3 mantissa): higher precision, smaller range (max 448)
E5M2 (5 exponent, 2 mantissa): lower precision, larger range (max 57344)

fp8 training workflow: Forward pass in fp8 E4M3; backward pass in fp8 E5M2; weight updates in fp32 or bf16. The Transformer Engine (NVIDIA) automates format selection and scaling.

Post-training quantization: Most LLM inference today uses int8 or int4 weight quantization (not fp8): weights stored as 8-bit integers, dequantized to fp16/bf16 before matmul. Tools: GPTQ, AWQ, bitsandbytes. The quantization error is $O(2^{-b})$ where $b$ is the bit-width.

5.3 Mixed-Precision Training

Mixed-precision training (Micikevicius et al., 2018) is the standard approach for training large models:

MIXED-PRECISION TRAINING WORKFLOW
========================================================================

  Master weights:  fp32  ---------------------------------------------+
                          | cast down                      ^ update    |
  Forward pass:   fp16/bf16 (activations, weights)                    |
  Loss:           fp32 accumulation                                    |
  Backward pass:  fp16/bf16 gradients                                 |
  Gradient:       fp16/bf16 -> fp32 (before weight update)  ----------+

  With loss scaling (fp16 only):
  loss_scaled = loss x scale_factor   (e.g., 2^15 = 32768)
  Backward through loss_scaled
  gradients_fp32 /= scale_factor before optimizer step

========================================================================

Key elements:

fp32 master weights: Full-precision copy of weights for optimizer states (momentum, variance) and weight updates
fp16/bf16 forward/backward: Halves memory for activations; 2x throughput on tensor cores
Loss scaling (fp16 only): Multiplies loss by a large factor before backward pass to prevent gradient underflow; divides gradients before applying them
fp32 accumulation: Matrix multiplications accumulate partial sums in fp32 on modern hardware, even when inputs are in fp16

5.4 Stochastic Rounding

Standard rounding (round-to-nearest) is deterministically biased: a number exactly halfway between two representable values always rounds to even. Over many operations, this can systematically shift the computed trajectory.

Stochastic rounding: Round $x$ to $\lfloor x \rfloor_{\text{fp}}$ with probability $(\lceil x \rceil_{\text{fp}} - x) / \text{ULP}(x)$ , otherwise to $\lceil x \rceil_{\text{fp}}$ . This introduces random noise but ensures $\mathbb{E}[\text{fl}(x)] = x$ - the rounding is unbiased.

Why this matters for low-precision training: In fp8/fp4 training, rounding errors are large enough to systematically bias weight updates. Stochastic rounding prevents this systematic bias, allowing effective training at very low precision. Used in Graphcore IPUs and experimental fp4 training.

6. Applications in ML

6.1 Gradient Vanishing/Exploding Through a Floating-Point Lens

Vanishing gradients in fp16: If a gradient $g$ satisfies $|g| < 6.1 \times 10^{-5}$ (fp16 minimum normal), it is represented as 0 (or a subnormal with reduced precision). The parameter receives no gradient - vanishing is literal in floating-point.

Loss scaling prevents this: multiplying the loss by $s$ before backprop multiplies all gradients by $s$ . A gradient of $10^{-6}$ with scale $s = 2^{15} \approx 32768$ becomes $\approx 0.03$ - comfortably within fp16 range.

Gradient clipping addresses exploding gradients: if $\|\mathbf{g}\|_2 > G_{\max}$ , scale $\mathbf{g} \leftarrow G_{\max} \mathbf{g}/\|\mathbf{g}\|_2$ . This prevents overflow and stabilizes training. Used by default in transformer training (clip norm $= 1.0$ is standard).

6.2 Numerically Stable Softmax and Log-Sum-Exp

Naive softmax: $\text{softmax}(\mathbf{x})_i = e^{x_i} / \sum_j e^{x_j}$

Problem: For large $x_i$ (e.g., $x_i = 100$ ), $e^{100} \approx 2.7 \times 10^{43}$ - overflow in fp32. For very negative $x_i$ , $e^{x_i} = 0$ - underflow, losing the contribution entirely.

Stable softmax: Subtract the maximum before exponentiation:

\text{softmax}(\mathbf{x})_i = \frac{e^{x_i - \max_j x_j}}{\sum_j e^{x_j - \max_j x_j}}

By linearity of softmax, this gives identical output. Now the largest exponent is $e^0 = 1$ - no overflow. All other terms are in $(0, 1]$ - no underflow for reasonable inputs.

Log-sum-exp: $\text{logsumexp}(\mathbf{x}) = \log(\sum_j e^{x_j})$ . Stable form:

\text{logsumexp}(\mathbf{x}) = m + \log\sum_j e^{x_j - m}, \quad m = \max_j x_j

Used in: numerically stable cross-entropy, log-probabilities, normalizing constants.

6.3 Stable Cross-Entropy Loss

Naive implementation:

# UNSTABLE: loss = -sum(y * log(softmax(logits)))
probs = exp(logits) / sum(exp(logits))  # overflow risk
loss = -sum(y * log(probs))             # log(0) risk

Stable implementation (PyTorch's F.cross_entropy):

# Combines log-softmax and NLL in one numerically stable pass:
# loss = -logits[true_class] + logsumexp(logits)
log_probs = logits - logsumexp(logits)  # log-softmax, stable
loss = -log_probs[true_class]           # NLL

This avoids both overflow in exp and log(0) in the entropy term.

6.4 Loss Scaling for fp16 Training

Dynamic loss scaling (PyTorch GradScaler):

Start with scale $s = 2^{15}$
Compute scaled_loss = loss * s
Call scaled_loss.backward() - gradients are scaled by $s$
If any gradient is inf or nan: reduce $s \leftarrow s/2$ ; skip optimizer step
Otherwise: unscale gradients by $1/s$ ; optimizer step; every $N$ steps try $s \leftarrow s \times 2$

This adaptive scheme maintains the gradient scale in fp16 range, maximizing utilization of fp16 precision while detecting and recovering from overflow.

7. Common Mistakes

#	Mistake	Why It's Wrong	Fix
1	Testing equality: `a + b == c` for floats	Rounding makes exact equality almost never true	Use `abs(a+b-c) < tol * max(abs(a+b), abs(c), 1)`
2	Assuming `0.1 + 0.2 == 0.3`	Neither `0.1`, `0.2`, nor `0.3` is exactly representable	Always use relative tolerance for float comparisons
3	Using fp16 for loss accumulation	fp16 has only 10 bits of mantissa; running sum loses precision fast	Accumulate loss in fp32, even when training in fp16
4	Ignoring NaN propagation	One NaN in a gradient turns ALL gradients to NaN	Add `torch.isnan(loss).any()` check before backward; trace to source
5	Conflating machine epsilon with smallest float	$\varepsilon_{\text{mach}} = 2^{-23}$ for fp32; smallest normal is $2^{-126}$ ; completely different	Know both values; use `torch.finfo` to look them up
6	Computing `log(softmax(x))` in two steps	`softmax(x)` first rounds, then `log` can see zeros	Use `log_softmax(x)` which is a single numerically stable operation
7	Not checking condition number before solving $Ax = b$	High $\kappa(A)$ means the solution is meaningless in floating point	Compute `np.linalg.cond(A)` first; if $> 10^{12}$ in fp64, reformulate
8	Using fp16 without loss scaling	Gradients of magnitude $< 6\times 10^{-5}$ flush to zero	Use `torch.amp.GradScaler()` or switch to bf16
9	Naive large-number subtraction	$10^{10} - (10^{10} - 1)$ loses all digits in fp32	Rearrange algebraically to avoid large-magnitude cancellation
10	Computing $e^x$ for large $x$ before dividing	Overflow before normalization	Use log-sum-exp trick; never compute raw $e^x$ without max-subtraction
11	Assuming GPU arithmetic is associative	`(a+b)+c \\neq a+(b+c)` in floating point; GPU thread ordering varies	Don't rely on associativity; use `torch.use_deterministic_algorithms(True)` for reproducibility
12	Ignoring subnormal flushing on GPU	CUDA default: FTZ (flush-to-zero) mode for subnormals	Be aware of subnormal behavior; check if FTZ affects your gradient magnitudes

8. Exercises

Exercise 1 * - Machine Epsilon and Floating-Point Formats

(a) Without using np.finfo, write code to experimentally determine machine epsilon for fp32 by finding the smallest $\varepsilon$ such that $\text{fl}(1 + \varepsilon) > 1$ (start from 1.0 and halve repeatedly).

(b) Compare to np.finfo(np.float32).eps. Also compute for fp64 and fp16. What is the ratio between fp32 and fp64 epsilon?

(c) Compute the maximum representable value for fp32 without using np.finfo.max. Start from 1.0 and double until overflow.

(d) Explain: why does np.float16(65504) + np.float16(1) == np.float16(65504) return True?

Exercise 2 * - Catastrophic Cancellation

(a) Compute $f(x) = x - \sin(x)$ for $x = 10^{-7}$ in fp32 and fp64. Compare to the Taylor series approximation $x - (x - x^3/6 + x^5/120) = x^3/6 - x^5/120 \approx x^3/6$ .

(b) Compute $\sqrt{x^2 + 1} - 1$ for $x = 10^{-4}$ in fp32 and fp64. Derive an algebraically equivalent form without cancellation and verify numerically.

(c) For the quadratic formula $x = (-b \pm \sqrt{b^2 - 4ac}) / (2a)$ with $a = 1$ , $b = -10^8$ , $c = 1$ : compute both roots in fp32. One root is catastrophically inaccurate. Identify which and implement the numerically stable form using the Vieta relation $x_1 x_2 = c/a$ .

Exercise 3 * - Kahan Compensated Summation

(a) Implement naive summation and Kahan summation in pure NumPy (no special functions).

(b) Sum 10 million copies of $1/3$ using both methods in fp32. Compare to the exact answer $10^7/3 \approx 3{,}333{,}333.33\ldots$ . What is the absolute error for each method?

(d) Verify that Kahan summation gives error $O(\varepsilon_{\text{mach}})$ independent of $n$ , while naive summation gives error $O(n \varepsilon_{\text{mach}})$ .

Exercise 4 ** - Condition Numbers

(a) Compute the relative condition number of $f(x) = e^x$ at $x = -30$ , $x = 0$ , $x = 30$ . At which value is $f$ most sensitive to input perturbation?

(b) Construct a $4 \times 4$ Hilbert matrix $H_{ij} = 1/(i+j-1)$ (notoriously ill-conditioned). Compute $\kappa_2(H)$ using np.linalg.cond. Solve $H\mathbf{x} = \mathbf{b}$ for $\mathbf{b} = H\mathbf{1}$ (true solution $\mathbf{x} = \mathbf{1}$ ). What is the relative error in the computed solution?

(c) For the $4 \times 4$ Hilbert matrix, how many decimal digits of accuracy do you expect in the solution? Verify by computing $\lfloor \log_{10}(\varepsilon_{\text{mach}} \cdot \kappa(H)) \rfloor$ and comparing to the actual error.

Exercise 5 ** - Numerically Stable Softmax and Log-Sum-Exp

(a) Implement naive softmax and stable softmax. Test both on $\mathbf{x} = (1000, 1001, 1002)$ in fp32. Which produces NaN? What does the stable version give?

(b) Implement the log-sum-exp function: naive (log(sum(exp(x)))) and stable. Test on $\mathbf{x} = (-1000, 0, 1000)$ .

(c) Implement numerically stable cross-entropy loss $\ell(\mathbf{z}, y) = -z_y + \text{logsumexp}(\mathbf{z})$ where $y$ is the true class. Test on logits $(0.1, 0.2, 0.3, 100.0)$ with true class $= 2$ . Compare to naively computing log(softmax(z)[y]).

(d) Profile the two implementations for $10^5$ calls. Is there a speed difference?

Exercise 6 ** - Mixed-Precision Comparison

(a) Compute the inner product $\mathbf{x}^\top \mathbf{y}$ for $n = 1024$ -dimensional random unit vectors in fp16, bf16, fp32, and fp64. Use the fp64 result as ground truth. Report absolute and relative errors.

(b) Simulate gradient underflow: generate a random gradient vector in fp32 with entries drawn from $\mathcal{N}(0, 10^{-6})$ . Cast to fp16. What fraction of gradient entries are exactly zero (underflowed)? Repeat with scale factor $s = 2^{15}$ applied before casting.

(c) Implement simple loss scaling: wrap a function f(x) that returns a simulated loss; scale the loss, compute a "gradient" via finite differences in fp16, unscale. Compare gradient quality to fp32 finite differences.

Exercise 7 *** - Floating-Point Conditioning of a Linear System

(a) Generate random $n \times n$ matrices with condition number $\kappa \in \{10, 10^4, 10^8, 10^{12}\}$ by constructing $A = U \Sigma V^\top$ with $U, V$ random orthogonal and $\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_n)$ with $\sigma_1/\sigma_n = \kappa$ .

(b) For each, solve $A\mathbf{x} = \mathbf{b}$ (true solution $\mathbf{x} = \mathbf{1}$ ) in fp32 and fp64. Plot relative error vs $\kappa$ on a log-log scale. Verify the predicted slope of 1.

(c) For $\kappa = 10^8$ in fp32: add a tiny perturbation $\|\Delta \mathbf{b}\| / \|\mathbf{b}\| = 10^{-7}$ to $\mathbf{b}$ . How much does the solution change? Compare to $\kappa \cdot 10^{-7}$ .

Exercise 8 *** - Stable Algorithm Design

(a) The sample variance formula $\sigma^2 = \frac{1}{n} \sum (x_i - \bar{x})^2$ can be expanded as $\sigma^2 = \frac{1}{n}\sum x_i^2 - \bar{x}^2$ (one-pass formula). Show that the one-pass formula suffers catastrophic cancellation for $\mathbf{x} = (10^8, 10^8 + 1, 10^8 + 2)$ in fp32. Verify using Welford's online algorithm as the stable alternative.

(b) Implement Welford's online algorithm for computing running mean and variance.

9. Why This Matters for AI (2026 Perspective)

Concept	AI Impact
Machine epsilon / format choice	LLM training: bf16 is now the universal default (Llama-3, Gemma, Mistral) - understanding why requires knowing bf16's exponent range matches fp32
Catastrophic cancellation	Naive attention score computation can cancel digits; FlashAttention's online softmax reordering avoids this
Loss scaling	Required for stable fp16 training; not needed with bf16; integral to `torch.amp.GradScaler`
Stable log-sum-exp	Every transformer's softmax in the forward pass uses this; also in all language modeling losses and perplexity calculations
Condition numbers	The Hessian condition number $\kappa(H)$ determines gradient descent convergence rate; Adam/Adafactor implicitly precondition to reduce effective $\kappa$
fp8 training	NVIDIA H100 Transformer Engine, used for Llama-3 and GPT-4 training; requires per-tensor scaling factors
Gradient underflow	Without scaling, fp16 gradient magnitudes below $6\times 10^{-5}$ flush to zero - silent learning failure
Stochastic rounding	Enables effective fp4/fp8 training by preventing systematic rounding bias; used in Graphcore IPU and emerging hardware
Subnormal flushing	CUDA FTZ mode flushes subnormals to zero for speed; can cause unexpected behavior in extreme low-precision regimes
Kahan summation	PyTorch's attention accumulation uses fp32 accumulation (equivalent to Kahan) even in bf16 forward passes

10. Conceptual Bridge

Backward: What This Builds On

Floating-point arithmetic is the computational manifestation of the real number system. The real numbers ( $\mathbb{R}$ ) are dense, complete, and infinite - studied in Mathematical Foundations Section01. Floating-point numbers are a finite approximation: the map $\text{fl}: \mathbb{R} \to \mathbb{F}$ introduces the rounding errors analyzed here.

The condition number of a linear system - introduced as $\kappa(A) = \|A\|\|A^{-1}\|$ - is defined using matrix norms from Linear Algebra Basics Section06 and singular values from Advanced Linear Algebra Section02. The interpretation $\kappa(A) = \sigma_{\max}/\sigma_{\min}$ requires the full SVD theory developed there.

Forward: What This Enables

Numerical Linear Algebra (Section02, this chapter): Every linear system solver, least-squares method, and eigenvalue algorithm is analyzed through the lens of backward error and condition numbers introduced here. The pivoting strategies in Gaussian elimination and the choice between QR and normal equations for least squares are direct applications of the stability concepts from Section4.

Numerical Optimization (Section03, this chapter): Gradient checking via finite differences relies on the finite-difference approximation error analysis - the step size $h$ must balance truncation error $O(h^2)$ against cancellation error $O(\varepsilon_{\text{mach}}/h)$ , an optimization that requires understanding both. Mixed-precision training is the union of Section5 (format choices) and optimization algorithms.

All of Applied Mathematics: Every numerical computation in this curriculum - ODE solvers, PDE methods, Fourier transforms - is ultimately floating-point arithmetic analyzed by the tools developed here.

POSITION IN CURRICULUM
========================================================================

  Section01-Mathematical-Foundations  (real numbers, binary representation)
       |
       v
  Section02-Linear-Algebra-Basics  (matrix norms, condition numbers prelim.)
  Section03-Advanced-Linear-Algebra  (SVD -> (A) = \\sigma_max/\\sigma_min)
       |
       v
  Section10-Numerical-Methods
    +-- Section01-Floating-Point-Arithmetic  === YOU ARE HERE
    |        |  \\varepsilon_mach, , stability
    |        v
    +-- Section02-Numerical-Linear-Algebra  (stable solvers, iterative methods)
    +-- Section03-Numerical-Optimization   (AD, line search, finite diffs)
    +-- Section04-Interpolation            (polynomial, spline, RBF)
    +-- Section05-Numerical-Integration    (quadrature, Monte Carlo)
       |
       v
  Section08-Optimization  (gradient descent, Adam - uses mixed-precision)
  Section13-ML-Specific-Math  (numerical stability of transformers)

========================================================================

The foundations laid in this section - machine epsilon, rounding error models, condition numbers, backward stability - are the vocabulary of all numerical computation. Every section that follows assumes fluency with these concepts.

Appendix A: IEEE 754 Bit-Level Encoding

A.1 Decoding a 32-bit Float by Hand

Every fp32 number $x$ is stored as three fields:

Bit 31  | Bits 30-23  | Bits 22-0
Sign S  | Exponent E  | Mantissa M

The value is:

x = (-1)^S \times 2^{E - 127} \times (1.M_1 M_2 \ldots M_{23})_2

Example: Decode the fp32 hex 0x3FC00000:

Binary: 0 01111111 10000000000000000000000
$S = 0$ (positive)
$E = 01111111_2 = 127$ ; exponent $= 127 - 127 = 0$
Mantissa $= 1.10000\ldots_2 = 1 + 2^{-1} = 1.5$
Value: $(-1)^0 \times 2^0 \times 1.5 = 1.5$ [ok]

Encoding 1.0: $1.0 = 1.000\ldots_2 \times 2^0$ ; $E = 127 = 01111111_2$ ; $M = 0$ ; hex 0x3F800000.

Encoding 0.1: $0.1 = 1.10011001100\ldots_2 \times 2^{-4}$ (repeating binary). Stored exponent $= 127 - 4 = 123$ ; mantissa rounded to 23 bits: 11001100110011001100110; value is $\approx 0.100000001490116$ .

A.2 Special Patterns

Pattern	Meaning
$E = 00000000$ , $M = 0$	$\pm 0$ (zero)
$E = 00000000$ , $M \ne 0$	Subnormal: $(-1)^S \times 2^{-126} \times (0.M)_2$
$E = 11111111$ , $M = 0$	$\pm\infty$
$E = 11111111$ , $M \ne 0$	NaN (quiet or signaling)
$E = 11111111$ , $M_{22} = 1$	Quiet NaN (qNaN) - most common
$E = 11111111$ , $M_{22} = 0, M \ne 0$	Signaling NaN (sNaN) - triggers exception

A.3 Python Bit Manipulation

import struct, numpy as np

def fp32_to_bits(x):
    return struct.unpack('I', struct.pack('f', x))[0]

def bits_to_fp32(bits):
    return struct.unpack('f', struct.pack('I', bits))[0]

def decode_fp32(x):
    bits = fp32_to_bits(x)
    S = (bits >> 31) & 1
    E = (bits >> 23) & 0xFF
    M = bits & 0x7FFFFF
    if E == 0:
        val = (-1)**S * 2**(-126) * M / 2**23  # subnormal
    elif E == 255:
        val = float('inf') if M == 0 else float('nan')
    else:
        val = (-1)**S * 2**(E - 127) * (1 + M / 2**23)
    return {'S': S, 'E': E, 'M': M, 'value': val}

Appendix B: Error Analysis Worked Examples

B.1 Quadratic Formula Cancellation

The quadratic $x^2 - 10^8 x + 1 = 0$ has roots:

x_{1,2} = \frac{10^8 \pm \sqrt{10^{16} - 4}}{2}

In fp32, $\sqrt{10^{16} - 4} \approx 10^8$ (exactly). So:

$x_1 = (10^8 + 10^8)/2 = 10^8$ - fine
$x_2 = (10^8 - 10^8)/2 = 0/2 = 0$ - catastrophically wrong (true: $x_2 = 10^{-8}$ )

Stable form: Use $x_1 x_2 = c/a = 1$ , so $x_2 = 1/x_1 = 10^{-8}$ .

B.2 Dot Product Error Bound

For $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ , the computed dot product $\hat{s}$ satisfies:

|\hat{s} - \mathbf{x}^\top \mathbf{y}| \le \gamma_n |\mathbf{x}|^\top |\mathbf{y}|

where $\gamma_n = n\varepsilon_{\text{mach}} / (1 - n\varepsilon_{\text{mach}}) \approx n\varepsilon_{\text{mach}}$ .

For $n = 512$ , fp16: $\gamma_{512} \approx 512 \times 10^{-3} = 0.512$ - potential 50% error! This is why transformer attention uses fp32 accumulation inside bf16/fp16 matmuls.

B.3 Gram-Schmidt vs Householder

Modified Gram-Schmidt (MGS) is forward stable but not backward stable: given columns $A = [\mathbf{a}_1, \ldots, \mathbf{a}_n]$ , the computed $Q$ satisfies $\|\hat{Q}^\top \hat{Q} - I\| = O(\kappa(A) \varepsilon_{\text{mach}})$ . For ill-conditioned $A$ , the computed columns are not orthogonal.

Householder QR is backward stable: $\|A - \hat{Q}\hat{R}\| = O(\varepsilon_{\text{mach}}) \|A\|$ and $\hat{Q}$ is exactly orthogonal (to machine precision). Always prefer Householder for numerical QR.

Appendix C: Floating-Point Arithmetic in PyTorch

C.1 Default Dtypes

import torch
torch.get_default_dtype()     # float32 by default
torch.set_default_dtype(torch.float64)  # change for scientific computing

# Check format properties:
torch.finfo(torch.float16)    # eps, min, max, tiny
torch.finfo(torch.bfloat16)
torch.finfo(torch.float32)
torch.finfo(torch.float64)

C.2 Mixed-Precision with AMP

from torch.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    optimizer.zero_grad()
    with autocast(device_type='cuda', dtype=torch.float16):
        output = model(batch)
        loss = criterion(output, target)
    # Scales loss, calls backward, updates scaler
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

C.3 Numerically Stable Operations

# Stable log-softmax (use this, not log(softmax(x))):
torch.nn.functional.log_softmax(x, dim=-1)

# Stable cross-entropy (combines log-softmax + NLL):
torch.nn.functional.cross_entropy(logits, targets)

# Stable sigmoid cross-entropy:
torch.nn.functional.binary_cross_entropy_with_logits(logits, targets)
# NEVER: binary_cross_entropy(sigmoid(logits), targets)  <- unstable

# Numerically stable softplus:
torch.nn.functional.softplus(x)  # log(1 + exp(x)), stable

C.4 Debugging NaN Issues

# Enable anomaly detection (slow, for debugging only):
torch.autograd.set_detect_anomaly(True)

# Check for NaN:
assert not torch.isnan(loss).any(), f"NaN loss at step {step}"
assert not torch.isinf(loss).any(), f"Inf loss at step {step}"

# Gradient NaN check:
for name, param in model.named_parameters():
    if param.grad is not None:
        if torch.isnan(param.grad).any():
            print(f"NaN gradient in {name}")

Appendix D: Numerical Formats Reference Card

QUICK REFERENCE: FLOATING-POINT FORMATS FOR AI
========================================================================

  Format   Bits  \\varepsilon_mach      Max         Use case
  ----------------------------------------------------------------------
  fp64      64   2.2e-16   1.8e+308    Scientific computing, reference
  fp32      32   1.2e-07   3.4e+038    Training master weights, debug
  bf16      16   7.8e-03   3.4e+038    LLM training (2023+ standard)
  fp16      16   9.8e-04   6.5e+004    Older mixed-precision training
  fp8 E4M3   8   1.2e-01     448       H100 forward pass
  fp8 E5M2   8   2.5e-01   57344       H100 backward pass
  int8       8   1/128      127        Inference quantization (weights)
  int4       4   1/8          7        GGUF LLM inference
  ----------------------------------------------------------------------

  Rule of thumb:
  - Training large models:  bf16 compute + fp32 master weights
  - Inference large models: int8 or int4 weights, fp16 activations
  - Scientific computing:   fp64 throughout
  - Edge inference:         int4/int8 quantized (ONNX, TFLite, GGUF)

========================================================================

<- Back to Numerical Methods | Next: Numerical Linear Algebra ->

Appendix E: Proofs and Derivations

E.1 Proof: Kahan Summation Error Bound

Theorem: Kahan compensated summation computes $\hat{S} = \sum_{i=1}^n x_i$ with error:

|\hat{S} - S| \le (2 + O(n\varepsilon_{\text{mach}})) \varepsilon_{\text{mach}} \sum_{i=1}^n |x_i|

This is $O(\varepsilon_{\text{mach}})$ rather than $O(n\varepsilon_{\text{mach}})$ for naive summation.

Proof sketch: At each step, Kahan maintains a compensation $c$ that captures the rounding error from the previous step to full precision. Formally, after step $k$ :

s_k + c_k = \sum_{i=1}^k x_i + O(\varepsilon_{\text{mach}}^2 \sum |x_i|)

The accumulated error is $O(\varepsilon_{\text{mach}}^2)$ per step rather than $O(\varepsilon_{\text{mach}})$ , so after $n$ steps the total is $O(n\varepsilon_{\text{mach}}^2) = O(\varepsilon_{\text{mach}})$ (treating $\varepsilon_{\text{mach}}$ as small). $\square$

E.2 Derivation: Stable Softmax

Claim: $\text{softmax}(\mathbf{x})_i = \text{softmax}(\mathbf{x} - m\mathbf{1})_i$ for any scalar $m$ .

Proof:

\frac{e^{x_i - m}}{\sum_j e^{x_j - m}} = \frac{e^{x_i} e^{-m}}{\sum_j e^{x_j} e^{-m}} = \frac{e^{x_i}}{\sum_j e^{x_j}} = \text{softmax}(\mathbf{x})_i \quad \square

Setting $m = \max_j x_j$ ensures the largest exponent is $e^0 = 1$ , bounding all terms in $[e^{-\infty}, 1] = [0, 1]$ - within the range of any floating-point format.

E.3 Wilkinson's Backward Error for GE with Partial Pivoting

Gaussian elimination with partial pivoting (GEPP) on $A\mathbf{x} = \mathbf{b}$ computes $\hat{\mathbf{x}}$ satisfying:

(A + \Delta A)\hat{\mathbf{x}} = \mathbf{b}

where $\|\Delta A\|_\infty \le 8n^3 \rho \varepsilon_{\text{mach}} \|A\|_\infty$ and $\rho$ is the growth factor: $\rho = \max_{i,j,k} |a_{ij}^{(k)}| / \|A\|_\infty$ (ratio of largest element appearing during elimination to largest initial element).

Key point: With partial pivoting, $\rho \le 2^{n-1}$ in the worst case (Wilkinson, 1961) but is almost always $O(n^{1/2})$ in practice. This is why GEPP is considered backward stable for practical purposes even though its worst-case bound is exponential.

E.4 Condition Number and Relative Error

Theorem: Let $A$ be invertible, $\mathbf{b} \ne \mathbf{0}$ . If $A\mathbf{x} = \mathbf{b}$ and $(A + \Delta A)(\mathbf{x} + \Delta\mathbf{x}) = \mathbf{b} + \Delta\mathbf{b}$ , then:

\frac{\|\Delta\mathbf{x}\|}{\|\mathbf{x}\|} \le \frac{\kappa(A)}{1 - \kappa(A)\|\Delta A\|/\|A\|} \left(\frac{\|\Delta A\|}{\|A\|} + \frac{\|\Delta\mathbf{b}\|}{\|\mathbf{b}\|}\right)

When $\|\Delta A\| / \|A\|$ is small (as for rounding errors $\sim \varepsilon_{\text{mach}}$ ), this simplifies to:

\frac{\|\Delta\mathbf{x}\|}{\|\mathbf{x}\|} \lesssim \kappa(A) \left(\frac{\|\Delta A\|}{\|A\|} + \frac{\|\Delta\mathbf{b}\|}{\|\mathbf{b}\|}\right)

Proof: Subtract the two equations, use $\Delta\mathbf{x} = -A^{-1}\Delta A(\mathbf{x} + \Delta\mathbf{x}) + A^{-1}\Delta\mathbf{b}$ , take norms, and use triangle inequality. The factor $1 - \kappa(A)\|\Delta A\|/\|A\|$ appears from a Neumann series approximation. $\square$

Floating Point Arithmetic: Part 1 - Intuition To Appendix E Proofs And Derivations