Part 2

26 min read18 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Floating-Point Arithmetic: Appendix F: Numerical Experiments Reference to Appendix W: Format Selection Guide for AI Practitioners

Appendix F: Numerical Experiments Reference

F.1 Key Numbers to Remember

Quantity	fp16	bf16	fp32	fp64
$\varepsilon_{\text{mach}}$	$9.77 \times 10^{-4}$	$7.81 \times 10^{-3}$	$1.19 \times 10^{-7}$	$2.22 \times 10^{-16}$
Max finite value	$65504$	$\approx 3.4 \times 10^{38}$	$\approx 3.4 \times 10^{38}$	$\approx 1.8 \times 10^{308}$
Min normal	$6.10 \times 10^{-5}$	$1.18 \times 10^{-38}$	$1.18 \times 10^{-38}$	$2.23 \times 10^{-308}$
Decimal digits	$\sim 3$	$\sim 2.3$	$\sim 7$	$\sim 15$
Bits	16	16	32	64

F.2 Common Condition Numbers in ML

Matrix type	Typical $\kappa(A)$	Implication
Random Gaussian $A \in \mathbb{R}^{n \times n}$	$\sim \sqrt{n}$	Well-conditioned; easy to solve
Gram matrix $X^\top X$ for correlated features	$10^3$ - $10^8$	May need regularization
Hessian of neural loss at sharp minimum	$10^5$ - $10^{10}$	Slow gradient descent; use Adam
Hessian at flat region (saddle)	$\approx 0$ on some axes	Ill-conditioned; Newton fails
Hilbert matrix $n=10$	$\approx 10^{13}$	Numerically singular in fp32
Discrete Laplacian $n \times n$ PDE	$O(n^2)$	Requires preconditioning

F.3 Loss Scaling Reference Values

Scale	When to use	Notes
$2^{15} = 32768$	Default initial scale for fp16	PyTorch GradScaler default
$2^{10} = 1024$	After repeated overflow reductions	Still keeps most gradients in fp16 range
$1$ (no scaling)	bf16 training	Not needed; bf16 has same range as fp32
Dynamic	Best practice	GradScaler adjusts automatically

Appendix G: Glossary

Term	Definition
ULP	Unit in the Last Place: spacing between consecutive floats near a given value
Machine epsilon $\varepsilon_{\text{mach}}$	Smallest $\varepsilon$ with $\text{fl}(1 + \varepsilon) > 1$ ; equals $2^{-p}$ for $p$ -bit mantissa
Catastrophic cancellation	Loss of significant digits when subtracting nearly-equal numbers
Backward error	Size of input perturbation that makes computed output exact
Forward error	Distance from computed output to true output
Backward stable	Algorithm whose backward error is $O(\varepsilon_{\text{mach}})$
Condition number $\kappa$	Worst-case ratio of relative output change to relative input change
Overflow	Result exceeds maximum representable value; becomes $\pm\infty$
Underflow	Result is too small to represent normally; becomes subnormal or 0
NaN	Not a Number: result of undefined operations ( $0/0$ , $\infty - \infty$ )
Subnormal	Floating-point number smaller than the minimum normal; loses precision
Rounding mode	Rule for mapping exact values to the nearest representable float
Stochastic rounding	Random rounding, unbiased in expectation; used in low-precision training
Loss scaling	Multiplying the loss by a large factor before backprop to prevent gradient underflow
Mixed precision	Using different formats for different parts of training (e.g., fp16 compute + fp32 weights)
bf16	Brain Float 16: 8-bit exponent (same range as fp32) + 7-bit mantissa
fp8	8-bit floating point; two variants: E4M3 (more precision) and E5M2 (more range)
Kahan summation	Compensated summation algorithm achieving $O(\varepsilon_{\text{mach}})$ error independent of $n$
Pairwise summation	Binary-tree summation achieving $O(\log n \, \varepsilon_{\text{mach}})$ error
FTZ	Flush to Zero: hardware mode that maps subnormals to 0 for speed
GradScaler	PyTorch class implementing dynamic loss scaling for fp16 training

Appendix H: Connections to Other Sections

H.1 Floating-Point and Optimization

The choice of floating-point format directly impacts optimizer behavior:

Adam in fp16: Adam maintains first moment $m_t$ and second moment $v_t$ (moving averages of gradients and squared gradients). These accumulators grow slowly over training. In fp16, small updates $\alpha m_t / \sqrt{v_t}$ can underflow. This is why Adam's optimizer states are almost always kept in fp32, even when gradients are computed in fp16.

Gradient clipping and numerical overflow: When gradients overflow to $\pm\infty$ in fp16, the global norm $\|\mathbf{g}\|_2 = \sqrt{\sum g_i^2}$ also becomes infinity. Clipping to $G_{\max}$ when norm is $\infty$ produces division $0/0 = \text{NaN}$ . PyTorch's clip_grad_norm_ guards against this by checking for non-finite norms before clipping.

Numerical second-order methods: Computing the Hessian $\nabla^2 L$ via finite differences requires step size $h \approx \varepsilon_{\text{mach}}^{1/3}$ - balancing truncation error $O(h^2)$ with cancellation error $O(\varepsilon_{\text{mach}}/h)$ . For fp32, optimal $h \approx (1.2 \times 10^{-7})^{1/3} \approx 5 \times 10^{-3}$ .

H.2 Floating-Point and Neural Network Architecture

Layer normalization: LayerNorm computes $\text{LayerNorm}(\mathbf{x}) = (\mathbf{x} - \mu) / (\sigma + \epsilon) \times \gamma + \beta$ where $\epsilon$ (often $10^{-5}$ or $10^{-6}$ ) prevents division by zero when $\sigma \approx 0$ . This $\epsilon$ is a numerical safeguard, not a mathematical constant.

Softmax temperature scaling: Attention logits $\mathbf{q}^\top \mathbf{k} / \sqrt{d_k}$ scale inversely with dimension. Without the $\sqrt{d_k}$ divisor, logits grow as $\Theta(\sqrt{d_k})$ and softmax becomes peaky (near one-hot), causing vanishing gradients. The scaling keeps logits in a numerically well-conditioned range.

Residual connections: $\mathbf{h}_{l+1} = \mathbf{h}_l + F(\mathbf{h}_l)$ prevents the gradient vanishing problem: $\partial L / \partial \mathbf{h}_l = \partial L / \partial \mathbf{h}_{l+1} (I + \partial F / \partial \mathbf{h}_l)$ . The identity term ensures at least one gradient path with multiplication by 1 - no floating-point underflow across layers.

H.3 Floating-Point and Transformer Efficiency

FlashAttention: The standard attention computation $\text{softmax}(QK^\top / \sqrt{d_k}) V$ requires materializing the $n \times n$ attention matrix. FlashAttention (Dao et al., 2022) computes this in chunks, maintaining online softmax statistics in fp32 while storing intermediate results in fp16/bf16. This avoids both overflow and numerical drift across the $n$ tokens.

KV cache quantization: In inference, key and value matrices are cached across decoding steps. Quantizing this cache from fp16 to int8 (or int4) reduces memory by 2-4x. The quantization error adds noise to the attention scores - understanding the rounding error model tells you this noise is $O(\varepsilon_{int}) \sim O(2^{-7})$ per element, tolerable for typical sequence lengths.

Appendix I: Further Reading

Foundational Papers

Goldberg, D. (1991). What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys, 23(1), 5-48. - The definitive reference; free online.
Higham, N.J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.). SIAM. - Chapter 2-3: rounding errors; Chapter 9: Gaussian elimination; definitive modern reference.
Wilkinson, J.H. (1963). Rounding Errors in Algebraic Processes. Prentice-Hall. - Original backward error analysis.
Kahan, W. (1965). Pracniques: Further remarks on reducing truncation errors. Communications of the ACM, 8(1), 40.

ML-Specific Papers

Micikevicius, P. et al. (2018). Mixed Precision Training. ICLR 2018. - The paper that established fp16 mixed-precision training as standard.
Kalamkar, D. et al. (2019). A Study of BFLOAT16 for Deep Learning Training. arXiv:1905.12322. - Why bf16 works better than fp16 for training.
Noune, B. et al. (2022). 8-bit Numerical Formats for Deep Neural Networks. arXiv:2206.02915. - fp8 training foundations.
Dao, T. et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022. - Numerically stable attention at scale.

Textbooks

Trefethen, L.N. & Bau, D. (1997). Numerical Linear Algebra. SIAM. - Excellent treatment of backward error; Chapter 12-15 on floating-point and stability.
Press, W.H. et al. (2007). Numerical Recipes: The Art of Scientific Computing (3rd ed.). Cambridge. - Practical algorithms with implementation details.

<- Back to Numerical Methods | Next: Numerical Linear Algebra ->

Appendix J: Extended Examples and Case Studies

J.1 Case Study: NaN Debugging in a Transformer

A common production scenario: you launch training, the loss drops for 100 steps, then suddenly loss = nan. Here is a systematic debugging procedure grounded in floating-point theory.

Step 1: Identify when NaN first appears.

for step, batch in enumerate(dataloader):
    with torch.autocast('cuda', torch.bfloat16):
        logits = model(batch['input_ids'])
        loss = criterion(logits, batch['labels'])
    if torch.isnan(loss):
        print(f"NaN loss at step {step}")
        break

Step 2: Check for NaN in individual components. Common sources in order of frequency:

log(0): occurs when model outputs exactly 0 probability for the target class
0/0: LayerNorm with $\sigma = 0$ (all-constant hidden states)
Overflow then $\infty - \infty$ : attention logits too large (missing $1/\sqrt{d_k}$ scaling)
Exploding gradients: accumulated for many steps without clipping

Step 3: Numerical fixes.

Replace log(softmax(x)[y]) with F.cross_entropy(x, y) (stable)
Add eps=1e-6 to LayerNorm (PyTorch default)
Add gradient clipping: torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
Switch from fp16 to bf16 (eliminates overflow entirely)

J.2 Case Study: Ill-Conditioned Normal Equations

Linear regression via normal equations: $\hat{\boldsymbol\beta} = (X^\top X)^{-1} X^\top \mathbf{y}$ .

The condition number of $X^\top X$ is $\kappa(X^\top X) = \kappa(X)^2$ . If $X$ has condition number $\kappa(X) = 10^4$ (common for correlated features), then $\kappa(X^\top X) = 10^8$ - near the threshold of fp32 accuracy.

Safer alternative: Use the QR decomposition of $X = QR$ ; then $\hat{\boldsymbol\beta} = R^{-1} Q^\top \mathbf{y}$ . The condition number of $R$ equals $\kappa(X) = 10^4$ - half the digits lost.

Safest: Use scipy.linalg.lstsq which internally uses SVD-based pseudo-inverse with automatic rank determination. Condition number is $\kappa(X)$ , with singular values below rcond * sigma_max set to zero.

J.3 Case Study: fp16 vs bf16 in Attention

In a transformer with $d_{\text{model}} = 4096$ and $d_k = d_{\text{model}} / H = 128$ (32 heads):

Attention logits (before softmax): $\mathbf{q}^\top \mathbf{k} / \sqrt{128}$ . Without the $1/\sqrt{d_k}$ factor, the dot product is $\Theta(\sqrt{d_k}) = \Theta(11)$ in expectation - fine. With random initialization, dot products can reach $\pm 50$ before training stabilizes.

fp16 concern: $e^{50} \approx 5 \times 10^{21}$ - severe overflow. But with the $1/\sqrt{128} = 1/11.3$ factor: $e^{50/11.3} \approx e^{4.4} \approx 82$ - safely within fp16 max of 65504. Gradient of the attention weights passes through softmax's Jacobian, which has values $\le 1/4$ - fine for fp16.

bf16: Identical analysis; bf16's wider exponent range means even moderate overflow during early training is handled gracefully, whereas fp16 needs careful initialization.

J.4 Numerical Stability of Common Activation Functions

Activation	Naive formula	Stable implementation	Issue
Sigmoid	$1/(1 + e^{-x})$	$e^x/(1+e^x)$ for $x < 0$ ; $1/(1+e^{-x})$ for $x \ge 0$	Overflow for $x \ll 0$
Softplus	$\log(1 + e^x)$	$x + \log(1 + e^{-x})$ for $x > 0$	Overflow for large $x$
GELU	$x\Phi(x)$ (error function)	Use precomputed tables or polynomial approx	$\Phi$ computation accuracy
Swish	$x \sigma(x) = x/(1+e^{-x})$	Standard; no overflow for normal $x$	None in typical range
log-softmax	$\log(\text{softmax}(x))$	$x - \max(x) - \log\sum e^{x_i - \max(x_i)}$	Overflow in naive softmax

J.5 Gradient Checkpointing and Numerical Precision

Gradient checkpointing (Chen et al., 2016) saves memory by not storing all activations during the forward pass; it recomputes them during backward. Numerical concern: The recomputed activations may differ slightly from the original computation due to floating-point non-associativity (different thread ordering on GPU).

For deterministic behavior with gradient checkpointing, use:

torch.use_deterministic_algorithms(True)
torch.backends.cuda.matmul.allow_tf32 = False  # Disable TF32 (approximate matmul)

TF32 (TensorFloat-32) is NVIDIA's format that uses fp32 exponent/sign but only 10 bits of mantissa for matmul accumulation - faster but less precise. Enabled by default in PyTorch since 1.7. For reproducible science, disable it; for production training, leave it enabled.

Appendix K: Self-Assessment Questions

What is the machine epsilon of fp32? Of bf16? Of fp16?
Why is bf16 preferred over fp16 for training large language models?
What is catastrophic cancellation? Give an example involving subtraction.
State the fundamental rounding error model. What does $\delta$ represent?
What is the backward error of an algorithm? Why is backward stability more useful than forward stability?
If a matrix has condition number $\kappa = 10^8$ and you solve $A\mathbf{x} = \mathbf{b}$ in fp32 ( $\varepsilon_{\text{mach}} \approx 10^{-7}$ ), what is the expected relative error in the solution?
Describe the Kahan summation algorithm. What is its error guarantee?
Why does log(softmax(x)) fail for large inputs? How is log_softmax implemented stably?
What is loss scaling and when is it needed? Not needed?
Define $\kappa(A)$ in terms of singular values. What does $\kappa(A) = 1$ mean geometrically?
What is the difference between overflow and underflow? Which causes NaN?
Why does FlashAttention use fp32 for softmax accumulation even in bf16 mode?

<- Back to Numerical Methods | Next: Numerical Linear Algebra ->

Appendix L: Numerical Analysis in PyTorch - Implementation Patterns

L.1 Computing Machine Epsilon Empirically

The standard algorithm to find machine epsilon without library calls:

def find_machine_epsilon(dtype):
    """Find machine epsilon for a given dtype by halving."""
    one = np.ones(1, dtype=dtype)
    eps = dtype(1.0)
    while dtype(1.0) + eps / dtype(2.0) > dtype(1.0):
        eps = eps / dtype(2.0)
    return eps

for dt in [np.float16, np.float32, np.float64]:
    emp_eps = find_machine_epsilon(dt)
    lib_eps = np.finfo(dt).eps
    print(f"{dt.__name__:12s}: empirical={emp_eps:.3e}, library={lib_eps:.3e}")

This gives:

float16: $9.77 \times 10^{-4}$
float32: $1.19 \times 10^{-7}$
float64: $2.22 \times 10^{-16}$

L.2 Visualizing the Floating-Point Number Line

import numpy as np
import matplotlib.pyplot as plt

# Generate all positive fp16 normals in [1, 2)
# fp16: bias=15, so exponent bits = 01111 = 15, mantissa = 10 bits
mantissas = np.arange(0, 2**10)
values = 1.0 + mantissas / 2**10  # fp16 values in [1, 2)
spacings = np.diff(values)

# Plot: density of fp16 values in different ranges
ranges = [(0.5, 1.0), (1.0, 2.0), (2.0, 4.0), (4.0, 8.0)]
for lo, hi in ranges:
    # All fp16 values in this range
    count = 2**10  # same mantissa count per exponent
    spacing = (hi - lo) / count
    print(f"[{lo}, {hi}): {count} values, spacing = {spacing:.6f}")

Output shows: every power-of-2 interval contains exactly 1024 fp16 values (for normals), but the spacing doubles with each interval - geometric, not arithmetic, distribution.

L.3 Detecting and Fixing Common Numerical Issues

class NumericalGuard:
    """Context manager for detecting numerical issues during training."""

    def __init__(self, model, check_inputs=True, check_outputs=True):
        self.model = model
        self.hooks = []
        self.check_inputs = check_inputs
        self.check_outputs = check_outputs

    def __enter__(self):
        def hook(module, input, output):
            name = type(module).__name__
            if self.check_inputs:
                for i, x in enumerate(input):
                    if isinstance(x, torch.Tensor):
                        if torch.isnan(x).any():
                            raise ValueError(f"NaN in {name} input {i}")
                        if torch.isinf(x).any():
                            raise ValueError(f"Inf in {name} input {i}")
            if self.check_outputs:
                if isinstance(output, torch.Tensor):
                    if torch.isnan(output).any():
                        raise ValueError(f"NaN in {name} output")

        for module in self.model.modules():
            self.hooks.append(module.register_forward_hook(hook))
        return self

    def __exit__(self, *args):
        for hook in self.hooks:
            hook.remove()

L.4 Comparing Summation Methods

def naive_sum(arr):
    """Standard left-to-right accumulation."""
    s = type(arr[0])(0)
    for x in arr:
        s += x
    return s

def kahan_sum(arr):
    """Kahan compensated summation."""
    s = type(arr[0])(0)
    c = type(arr[0])(0)
    for x in arr:
        y = x - c
        t = s + y
        c = (t - s) - y
        s = t
    return s

def pairwise_sum(arr):
    """Recursive pairwise (binary tree) summation."""
    n = len(arr)
    if n == 1:
        return arr[0]
    mid = n // 2
    return pairwise_sum(arr[:mid]) + pairwise_sum(arr[mid:])

# Test: sum 1 million copies of 1/3 in fp32
n = 1_000_000
x = np.full(n, 1/3, dtype=np.float32)
true_sum = n / 3  # exact in float64

for name, fn in [('Naive', naive_sum), ('Kahan', kahan_sum)]:
    result = fn(list(x))
    error = abs(float(result) - true_sum)
    rel_error = error / abs(true_sum)
    print(f"{name:8s}: result={result:.10f}, rel_error={rel_error:.2e}")

L.5 Stable Implementations of Common Functions

def log_sum_exp(x):
    """Numerically stable log(sum(exp(x))) for array x."""
    m = np.max(x)
    return m + np.log(np.sum(np.exp(x - m)))

def softmax(x):
    """Numerically stable softmax."""
    x_shifted = x - np.max(x)
    e = np.exp(x_shifted)
    return e / np.sum(e)

def log_softmax(x):
    """Numerically stable log-softmax."""
    return x - log_sum_exp(x)

def cross_entropy(logits, target):
    """Numerically stable cross-entropy: -log(softmax(logits)[target])."""
    return -log_softmax(logits)[target]

def sigmoid(x):
    """Numerically stable sigmoid avoiding overflow for large negative x."""
    return np.where(x >= 0,
                    1 / (1 + np.exp(-x)),
                    np.exp(x) / (1 + np.exp(x)))

def softplus(x):
    """Numerically stable softplus: log(1 + exp(x))."""
    return np.where(x > 20,
                    x,  # log(1 + exp(x)) \\approx x for large x
                    np.log1p(np.exp(x)))  # log1p is more accurate than log(1+...)

L.6 Loss Scaling Implementation

class SimpleLossScaler:
    """Minimal loss scaler for fp16 training."""

    def __init__(self, init_scale=2**15, scale_factor=2.0, scale_window=2000):
        self.scale = init_scale
        self.scale_factor = scale_factor
        self.scale_window = scale_window
        self._steps_since_update = 0
        self._successful_steps = 0

    def scale_loss(self, loss):
        return loss * self.scale

    def step(self, optimizer, params):
        """Unscale gradients and step, or skip if overflow detected."""
        # Check for overflow
        overflow = any(
            torch.isnan(p.grad).any() or torch.isinf(p.grad).any()
            for p in params if p.grad is not None
        )

        if overflow:
            # Reduce scale and skip step
            self.scale /= self.scale_factor
            print(f"Overflow detected! Scale reduced to {self.scale}")
        else:
            # Unscale gradients
            for p in params:
                if p.grad is not None:
                    p.grad /= self.scale
            optimizer.step()
            self._successful_steps += 1

            # Increase scale every scale_window successful steps
            if self._successful_steps % self.scale_window == 0:
                self.scale *= self.scale_factor
                print(f"Scale increased to {self.scale}")

<- Back to Numerical Methods | Next: Numerical Linear Algebra ->

Appendix M: Historical Perspective - The Road to IEEE 754

M.1 Before IEEE 754 (Pre-1985)

Before the IEEE 754 standard, every computer manufacturer used its own floating-point format. IBM mainframes used base-16 (hexadecimal) floating point. DEC VAX used its own 32-bit format. CDC 6600 used 60-bit words with 48-bit mantissa. Programs written for one machine could not be reliably ported to another - the same computation produced different results depending on the hardware.

The chaos had real consequences: scientific simulations gave different answers on different machines; numerical algorithms required hand-tuning for each platform; software bugs were indistinguishable from floating-point differences. The 1970s saw a growing recognition that a universal standard was needed.

M.2 The IEEE 754 Committee (1977-1985)

The IEEE 754 working group (chaired by Jerome Coonen, with key contributions from William Kahan and Harold Stone) spent eight years developing the standard. Kahan - who would win the 1989 Turing Award partly for this work - insisted on several features that were controversial but proved crucial:

Round-to-nearest-even as default: Reduces systematic bias in long computations
Gradual underflow (subnormals): Prevents sudden loss of precision near zero
Signed zeros: Enables correct complex analysis identities ( $1/(+0) = +\infty \ne 1/(-0) = -\infty$ )
NaN as a value (not an exception): Allows computation to continue past undefined operations

The Intel 8087 coprocessor (1980) was the first commercial implementation, years before the standard was formally published in 1985. Its success validated the design.

M.3 IEEE 754-2008 and Beyond

The 2008 revision added:

fp16 (half precision): Initially for graphics/ML; 5-bit exponent limits range to $\approx 65504$
Decimal floating-point: For financial computations where $0.1 + 0.2 = 0.3$ exactly
Fused multiply-add (FMA): $\text{fl}(a \times b + c)$ computed with only one rounding error, not two

FMA is particularly important for ML: matrix multiplication decomposes into FMA operations, and modern GPU tensor cores implement FMA in mixed precision (fp16/bf16 multiply + fp32 accumulate) for maximum throughput with minimum precision loss.

M.4 The ML Formats (2017-2024)

The ML revolution created demand for formats the IEEE committee never anticipated:

bf16 (2018): Google Brain developed Brain Float 16 for TPUs. The key insight: training stability requires dynamic range (exponent bits), not precision (mantissa bits). Keep all 8 exponent bits of fp32, sacrifice 16 mantissa bits. The result trains as stably as fp32 at half the cost.

tf32 (2020): NVIDIA's TensorFloat-32 uses fp32's 8-bit exponent and 10-bit mantissa (same as fp16) for matmul accumulation, giving 3x speedup with minimal accuracy loss. Enabled by default in PyTorch on Ampere and newer GPUs.

fp8 E4M3 / E5M2 (2022): For H100 training at extreme throughput. Requires per-tensor scaling factors and careful format selection per layer (E4M3 for weights/activations, E5M2 for gradients).

fp4 / int4 (experimental): Used for inference quantization (GGUF format, llama.cpp). Weights stored as 4-bit integers with a shared scale factor per block. 4-bit reduces model size by 8x vs fp32, enabling 70B-parameter models on a single consumer GPU.

Appendix N: Quick Reference and Formulas

N.1 Key Inequalities

|\text{fl}(x) - x| \le \frac{\varepsilon_{\text{mach}}}{2} |x| \quad \text{(single rounding)}

|\text{fl}(x \circ y) - (x \circ y)| \le \varepsilon_{\text{mach}} |x \circ y| \quad \text{(one operation, round-to-nearest)}

\left|\sum_{i=1}^n x_i - \widehat{\sum_{i=1}^n x_i}\right| \le (n-1)\varepsilon_{\text{mach}} \sum_{i=1}^n |x_i| \quad \text{(naive summation)}

\left|\text{Kahan sum} - \sum x_i\right| \le 2\varepsilon_{\text{mach}} \sum |x_i| \quad \text{(Kahan)}

\frac{\|\Delta\mathbf{x}\|}{\|\mathbf{x}\|} \le \kappa(A) \frac{\|\Delta\mathbf{b}\|}{\|\mathbf{b}\|} \quad \text{(linear system perturbation)}

N.2 Algorithms Reference

Stable softmax:

\text{softmax}(\mathbf{x})_i = \frac{e^{x_i - m}}{\sum_j e^{x_j - m}}, \quad m = \max_j x_j

Stable log-sum-exp:

\text{logsumexp}(\mathbf{x}) = m + \log\sum_j e^{x_j - m}, \quad m = \max_j x_j

Kahan step:

y_k = x_k - c; \quad t = s + y_k; \quad c = (t - s) - y_k; \quad s = t

Optimal finite-difference step:

h^* = \varepsilon_{\text{mach}}^{1/2} \quad \text{(forward differences)}; \quad h^* = \varepsilon_{\text{mach}}^{1/3} \quad \text{(central differences)}

Condition number bounds:

\kappa_2(A) = \sigma_{\max}(A)/\sigma_{\min}(A); \quad \kappa_1(A) = \|A\|_1 \|A^{-1}\|_1

N.3 Checklist: Numerically Stable Implementation

Before deploying any numerical computation, verify:

All log() calls are protected: log(x + eps) or log_softmax() instead of log(softmax())
All divisions are protected: x / (y + eps) for division by potentially-zero values
Softmax is computed with max-subtraction (stable_softmax) not naive
Cross-entropy uses F.cross_entropy(logits, labels) not F.nll_loss(F.log_softmax(logits), labels) (equivalent but one extra call)
Float comparisons use torch.isclose with appropriate atol and rtol, not ==
Accumulation of small differences is done in higher precision (float64 or Kahan)
Gradient clipping is enabled: clip_grad_norm_(params, max_norm=1.0)
NaN detection: torch.isnan(loss) checked at training start
Matrix condition number checked before solving: np.linalg.cond(A) < 1/eps
bf16 used instead of fp16 for new training runs

<- Back to Numerical Methods | Next: Numerical Linear Algebra ->

Appendix O: Worked Problems with Full Solutions

O.1 Problem: Compute $e^x - 1$ for Small $x$

Problem: Compute $f(x) = e^x - 1$ accurately for $x = 10^{-8}$ in fp32.

Naive approach: $e^{10^{-8}} \approx 1.00000001$ in fp32 (only 7 decimal digits). Then $1.00000001 - 1 = 10^{-8}$ ... but in fp32: $\text{fl}(e^{10^{-8}}) = 1.0$ (exact 1.0 in fp32, since $10^{-8}$ is below the fp32 precision at magnitude 1). Result: $f(10^{-8})_{\text{fp32, naive}} = 0.0$ . Relative error: 100%.

Stable approach: Use the identity $e^x - 1 = x + x^2/2 + x^3/6 + \ldots$ for small $x$ , implemented as numpy.expm1(x):

\text{expm1}(10^{-8}) = 10^{-8} \times (1 + 10^{-8}/2 + \ldots) \approx 10^{-8}

The expm1 function is computed without the catastrophic cancellation - it directly computes $e^x - 1$ without subtracting large numbers. Result: $10^{-8}$ . Relative error: machine precision.

Rule: Use np.expm1(x) instead of np.exp(x) - 1 whenever $|x| \ll 1$ . Similarly, use np.log1p(x) instead of np.log(1+x) for small $x$ .

O.2 Problem: Running Mean and Variance

Problem: Compute the mean and variance of a stream of numbers $x_1, x_2, \ldots, x_n$ without storing all values.

Naive: Store all values and compute $\bar{x} = \frac{1}{n}\sum x_i$ , $\sigma^2 = \frac{1}{n}\sum(x_i - \bar{x})^2$ . Requires $O(n)$ memory.

One-pass (unstable): $\sigma^2 = \frac{1}{n}\sum x_i^2 - \bar{x}^2$ . Catastrophically unstable when $|\bar{x}|$ is large relative to $\sigma$ .

Welford's online algorithm (stable):

Initialize: M_1 = x_1, S_1 = 0, n = 1
For k = 2, 3, ...:
    n += 1
    delta = x_k - M_{k-1}
    M_k = M_{k-1} + delta / n
    delta2 = x_k - M_k
    S_k = S_{k-1} + delta * delta2
Variance = S_n / (n - 1)  (sample variance)

This is used in: PyTorch BatchNorm (online mean/variance), Adam optimizer (running mean of gradients), gradient clipping (running norm estimation).

O.3 Problem: Matrix-Vector Product Error Bound

Problem: Bound the error in $A\mathbf{x}$ computed in fp32, where $A \in \mathbb{R}^{m \times n}$ and $\mathbf{x} \in \mathbb{R}^n$ .

Each entry of the product is a dot product: $(A\mathbf{x})_i = \sum_{j=1}^n A_{ij} x_j$ .

Each such dot product of $n$ terms has forward error at most $(n-1)\varepsilon_{\text{mach}} \sum_j |A_{ij}||x_j|$ .

In matrix form:

\|\widehat{A\mathbf{x}} - A\mathbf{x}\|_\infty \le (n-1)\varepsilon_{\text{mach}} |A| |\mathbf{x}|_\infty

where $|A|$ denotes the entry-wise absolute value.

For a GPU matmul using FMA with fp32 accumulation: the effective error is reduced by approximately factor $\sqrt{n}$ due to cancellations, but the worst-case bound above still applies.

Implication for attention: The attention score $\mathbf{q}^\top \mathbf{k} / \sqrt{d_k}$ is a dot product of length $d_k$ . In fp16 ( $\varepsilon \approx 10^{-3}$ ) without fp32 accumulation: error $\le d_k \times 10^{-3} \times$ (product magnitude). For $d_k = 128$ : 12.8% relative error - unacceptably large. With fp32 accumulation (as in FlashAttention): error $\le d_k \times 10^{-7}$ - perfectly fine.

O.4 Problem: Condition Number of the Attention Matrix

Problem: What is the condition number of the softmax output $\mathbf{p} = \text{softmax}(\mathbf{z})$ ?

The Jacobian of softmax is $J = \text{diag}(\mathbf{p}) - \mathbf{p}\mathbf{p}^\top$ , which has eigenvalues $0$ and $p_i(1-p_i)$ for each $i$ . This matrix is singular (rank $n-1$ ), so in the strictest sense, the softmax map has infinite condition number (it maps from $\mathbb{R}^n$ to the probability simplex, a lower-dimensional manifold).

Practical question: How sensitive are the softmax outputs to perturbations in the logits? The answer depends on the sharpness of the softmax. A "peaky" distribution (one $p_i \approx 1$ ) has Jacobian entries near 0 - small gradient flow, effectively zero sensitivity. A "flat" distribution (all $p_i = 1/n$ ) has maximum sensitivity: the Frobenius norm of $J$ is maximized.

For training: Sharp attention (peaky softmax) causes vanishing gradients through the attention weights. This is prevented by: (1) temperature scaling $1/\sqrt{d_k}$ ; (2) attention dropout (randomly zero-ing some weights, softening the distribution); (3) label smoothing (softens the target distribution, preventing overconfident predictions).

Appendix P: Summary Statistics of Chapter

CHAPTER Section01 SUMMARY
========================================================================

  Core concepts: 10 (IEEE 754, \\varepsilon_mach, rounding, cancellation,
                     Kahan, condition number, stability, formats,
                     mixed precision, stable implementations)

  Key formulas: 12 (rounding model, Kahan step, stable softmax,
                    logsumexp, (A), error bounds, ...)

  Numerical formats: 8 (fp64, fp32, bf16, fp16, fp8 E4M3, fp8 E5M2,
                        int8, int4)

  AI connections: 15 (loss scaling, NaN debugging, FlashAttention,
                      TF32, bf16 training, gradient clipping, ...)

  Exercises: 8 (* to ***), covering all major topics

  Appendices: 16 (A through P)

  Notes length: ~1700 lines
  Theory cells: 50+
  Exercises: 10 graded problems (24 cells)

========================================================================

Appendix Q: Deep Dive - IEEE 754 Special Value Arithmetic

Understanding how special values propagate is essential for debugging AI training runs.

Q.1 NaN Propagation Rules

NaN (Not a Number) is sticky: any arithmetic operation with a NaN produces a NaN.

NaN + x  = NaN    for any x (including NaN)
NaN * 0  = NaN    (not 0, despite "anything times zero is zero")
NaN < x  = False  for any x
NaN == NaN = False  (IEEE mandates this - NaN is not equal to itself)

The self-inequality of NaN is the canonical way to detect it:

def is_nan(x):
    return x != x  # True only when x is NaN

In PyTorch: torch.isnan(x) or x != x.

NaN in gradient computation: If any parameter gradient is NaN, the entire parameter update step is corrupted. A single NaN in one layer's weight will propagate backward through the chain rule to corrupt all earlier layer gradients.

Debugging strategy for NaN gradients:

Register forward hooks to check activations for NaN after each layer
Register backward hooks to check gradients for NaN
Binary search over layers to find the first occurrence
Check for division by zero in custom loss functions (log(0), 0/0)
Check for overflow in exponentials (exp(large_number))

Q.2 Infinity Arithmetic

Infinity follows extended real arithmetic:

+\\infty + (+\\infty)  = +\\infty
+\\infty + (-\\infty)  = NaN    (indeterminate form)
+\\infty * 0     = NaN    (indeterminate form)
+\\infty * (+\\infty)  = +\\infty
1 / 0      = +\\infty     (for positive numerator)
-1 / 0     = -\\infty
0 / 0      = NaN
x / +\\infty     = 0      for finite x

Loss spike analysis: When a loss value becomes inf, trace backward:

log(0) - zero probability assigned to correct class
exp(x) for large x before normalization (use log-sum-exp)
Division by a very small denominator (batch norm with near-zero variance)
Accumulated gradients that overflow fp16 range before gradient clipping

Q.3 Signed Zero

IEEE 754 has both +0 and -0. They compare equal (+0 == -0 is True) but differ in division:

1 / (+0) = +\\infty
1 / (-0) = -\\infty

Signed zero matters in:

Complex number arithmetic: branch cuts depend on sign of zero imaginary part
Sorting algorithms: sort([+0, -0]) may or may not preserve order
Gradient of relu at zero: subgradient convention can produce +0 or -0

Q.4 Subnormal Performance Impact

On most hardware, operations involving subnormal numbers (also called denormals) execute 10-100x slower than normal operations, because they require software emulation or special hardware paths.

In training:

Very small weight values gradually entering subnormal range -> sudden throughput drop
Gradient values approaching zero -> subnormal gradients -> training slows
Fix: Set the flush-to-zero (FTZ) flag, which replaces subnormals with zero
- PyTorch: enabled by default in CUDA
- NumPy: np.seterr(under='ignore') + platform FTZ setting
- Downside: FTZ introduces larger underflow errors but avoids performance cliff

Appendix R: Floating-Point Error Case Studies from AI Research

Case Study 1: The fp16 Loss Explosion Problem (2017-2018)

When researchers first attempted to train large language models in fp16, they observed frequent loss explosions. Investigation revealed:

Gradient magnitudes of ~10^-^4 to 10^-^3 during stable training
fp16 minimum positive normal: ~6.1 x 10^-5
Many gradients underflowed to zero -> effectively frozen parameters
Sudden instability from accumulated representation errors in weights

Solution (Micikevicius et al., 2017): Mixed-precision training with loss scaling:

Maintain fp32 master copy of all weights
Cast to fp16 for forward and backward pass (4x speedup)
Scale loss by $S$ (typically 28 to 215) before backward
Check for overflow (any gradient is Inf or NaN)
If no overflow: unscale, apply gradient clipping, update fp32 weights
If overflow: skip step, halve $S$

This framework is now standard in torch.cuda.amp.

Case Study 2: Attention Score Overflow in Early Transformers

In the original Transformer (Vaswani et al., 2017), attention scores are:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

Without the $\sqrt{d_k}$ scaling, with $d_k = 64$ :

$q \cdot k$ can have magnitude $\sim d_k = 64$ (assuming unit-variance features)
After softmax, extreme values dominate: $\text{softmax}([10, 10, ..., 64]) \approx [0, 0, ..., 1]$
Gradient of softmax becomes nearly zero -> vanishing gradients

The $1/\sqrt{d_k}$ factor ensures dot products have variance 1, keeping softmax in its informative regime.

fp16 version: With $d_k = 64$ and input std $\approx 1$ , raw scores $\approx \mathcal{N}(0, 64)$ , so $|qk| > 65504$ (fp16 max) with non-negligible probability when $d_k$ is large. FlashAttention computes attention in blocks, using online softmax with numerical rescaling to avoid materializing the full $N \times N$ attention matrix.

Case Study 3: Layer Normalization Stability

Layer normalization computes:

\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \varepsilon}}

where $\varepsilon$ (typically $10^{-5}$ to $10^{-8}$ ) prevents division by zero.

Numerical pitfall: If all activations in a layer are identical (common during initialization or after a bad update), $\sigma^2 = 0$ exactly in floating point. Without $\varepsilon$ , we'd compute $0/0 = \text{NaN}$ .

Stability consideration: The choice of $\varepsilon$ matters:

Too small ( $10^{-12}$ ): offers no protection when $\sigma^2$ is represented as exact zero
Too large ( $10^{-3}$ ): changes the effective normalization for small-variance activations
Standard choice $10^{-5}$ balances stability with fidelity

Catastrophic cancellation in variance: Computing $\sigma^2 = \mathbb{E}[x^2] - \mu^2$ suffers catastrophic cancellation when $\mu^2 \approx \mathbb{E}[x^2]$ (small variance). Welford's online algorithm avoids this (see Appendix O).

Case Study 4: Gradient Checkpointing and Recomputation Errors

Gradient checkpointing saves memory by recomputing intermediate activations during the backward pass instead of storing them. This recomputation uses the same inputs but may use different precision if the precision mode changes between forward and backward.

Source of error: In mixed-precision training with autocast, the recomputed forward pass during backward may use slightly different floating-point operations than the original forward. The resulting gradient is mathematically correct (same bit pattern inputs) but the error can accumulate differently.

Industry practice: PyTorch's torch.utils.checkpoint.checkpoint handles this correctly by default. The key requirement is that the checkpointed function must be deterministic - stochastic operations (like Dropout) must use the same random seed in forward and recomputation.

Appendix S: Floating-Point Benchmarks and Profiling

Measuring fp32 vs bf16 Throughput

import torch
import time

def benchmark_matmul(dtype, size=4096, n_warmup=5, n_trials=20):
    """Benchmark matrix multiplication throughput."""
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    A = torch.randn(size, size, dtype=dtype, device=device)
    B = torch.randn(size, size, dtype=dtype, device=device)

    # Warmup
    for _ in range(n_warmup):
        C = torch.mm(A, B)
    if device == 'cuda':
        torch.cuda.synchronize()

    # Benchmark
    start = time.perf_counter()
    for _ in range(n_trials):
        C = torch.mm(A, B)
    if device == 'cuda':
        torch.cuda.synchronize()
    elapsed = time.perf_counter() - start

    # FLOPS: 2 * N^3 for N x N matmul
    flops = 2 * size**3 * n_trials
    tflops = flops / elapsed / 1e12
    return tflops

# Typical results on A100 GPU:
# fp32: ~19.5 TFLOPS
# bf16: ~77.0 TFLOPS  (4x speedup with Tensor Cores)
# fp16: ~77.0 TFLOPS  (similar to bf16)

Detecting Numerical Issues in Training

def register_nan_hooks(model):
    """Register hooks to detect NaN/Inf in forward and backward passes."""
    hooks = []

    def forward_hook(name):
        def hook(module, input, output):
            if isinstance(output, torch.Tensor):
                if torch.isnan(output).any():
                    print(f"NaN detected in forward: {name}")
                if torch.isinf(output).any():
                    print(f"Inf detected in forward: {name}")
        return hook

    def backward_hook(name):
        def hook(module, grad_input, grad_output):
            for i, g in enumerate(grad_input):
                if g is not None and torch.isnan(g).any():
                    print(f"NaN in grad_input[{i}] at: {name}")
        return hook

    for name, module in model.named_modules():
        hooks.append(module.register_forward_hook(forward_hook(name)))
        hooks.append(module.register_full_backward_hook(backward_hook(name)))

    return hooks  # Call hook.remove() to deregister

# Usage:
# hooks = register_nan_hooks(model)
# train_step(model, batch)
# for h in hooks: h.remove()

Condition Number Monitoring

def monitor_weight_conditioning(model, threshold=1e4):
    """Monitor condition numbers of weight matrices."""
    ill_conditioned = []
    for name, param in model.named_parameters():
        if param.dim() >= 2:
            # Use fast approximation via max/min singular values
            try:
                sv = torch.linalg.svdvals(param.view(param.shape[0], -1))
                kappa = sv[0] / sv[-1]
                if kappa > threshold:
                    ill_conditioned.append((name, kappa.item()))
            except Exception:
                pass
    return ill_conditioned

Appendix T: Quick Derivations

T.1 Why Machine Epsilon Is $2^{1-p}$

A floating-point number with precision $p$ bits (significand) represents values of the form:

x = \pm 1.b_1 b_2 \cdots b_{p-1} \times 2^e

The spacing between consecutive representable numbers near $1.0$ is:

2^0 \times 2^{-(p-1)} = 2^{1-p}

This is the unit in the last place (ULP) at $1.0$ , which equals machine epsilon $\varepsilon_{\text{mach}}$ .

For fp32: $p = 24$ (1 implicit + 23 explicit mantissa bits), so $\varepsilon_{\text{mach}} = 2^{1-24} = 2^{-23} \approx 1.19 \times 10^{-7}$ .

T.2 Why Kahan Summation Has $O(\varepsilon_{\text{mach}})$ Error

Naive summation of $n$ numbers has error bound $O(n \varepsilon_{\text{mach}} \sum |x_i|)$ - grows with $n$ .

Kahan summation maintains a compensation variable $c$ tracking the lost low-order bits:

c = 0
for each x_i:
    y = x_i - c          # Corrected input
    t = sum + y          # sum is large, y small
    c = (t - sum) - y    # Algebraically zero, but captures rounding error
    sum = t

The compensation $c$ captures the bits lost in sum + y and feeds them into the next iteration. The net error is $O(\varepsilon_{\text{mach}} \sum |x_i|)$ , independent of $n$ .

T.3 Why Softmax Is Numerically Stable With Shifting

For $z_i \in \mathbb{R}$ , softmax satisfies translation invariance:

\text{softmax}(\mathbf{z})_i = \text{softmax}(\mathbf{z} - c\mathbf{1})_i \quad \forall c

Proof:

\frac{e^{z_i - c}}{\sum_j e^{z_j - c}} = \frac{e^{z_i} \cdot e^{-c}}{\sum_j e^{z_j} \cdot e^{-c}} = \frac{e^{z_i}}{\sum_j e^{z_j}}

Setting $c = \max_j z_j$ ensures all exponents $z_i - c \leq 0$ , preventing overflow while preserving the mathematical value.

T.4 Relative Backward Stability of Householder QR

The Householder QR algorithm produces $\hat{Q}, \hat{R}$ such that:

\hat{Q}\hat{R} = A + \delta A, \quad \frac{\|\delta A\|}{\|A\|} = O(\varepsilon_{\text{mach}})

This is backward stable: the computed result is the exact answer for a slightly perturbed problem. The forward error in solving $Ax = b$ via QR is then $O(\kappa(A) \varepsilon_{\text{mach}})$ , which is optimal - no algorithm can do better without additional information about the problem structure.

Appendix U: Cross-Reference with Other Sections

This section connects to several other parts of the curriculum:

Topic Introduced Here	Full Treatment
Condition number $\kappa(A)$	Section02 Numerical Linear Algebra - condition number theory, backward stability proofs, perturbation analysis
Stable matrix algorithms (LU, QR, Cholesky)	Section03-Advanced-Linear-Algebra/08-Matrix-Decompositions - full decomposition theory
Floating-point in optimization (gradient descent)	Section03 Numerical Optimization - learning rate selection, gradient accumulation, precision effects on convergence
Interpolation with finite precision	Section04 Interpolation and Approximation - Runge's phenomenon, Chebyshev nodes, numerical stability of polynomial evaluation
Numerical quadrature errors	Section05 Numerical Integration - error analysis of quadrature rules in finite precision
Probabilistic error analysis	Section05-Probability-and-Statistics - probabilistic numerics, Gaussian process approximations

Notation cross-references:

$\varepsilon_{\text{mach}}$ defined here -> used in all Section10 sections
$\kappa(A)$ introduced here -> defined formally in Section02
$\text{fl}(\cdot)$ rounding operator defined here -> used throughout Section10

Appendix V: Extended Exercises with Hints

These supplementary problems extend the main exercises for students who want deeper practice.

V.1 Sterbenz's Lemma (**)

Statement: If $a, b \in \mathbb{F}$ and $b/2 \leq a \leq 2b$ , then $a - b$ is computed exactly (no rounding error).

Proof sketch: When $a \approx b$ , the result $a - b$ is small. Since both $a$ and $b$ are exactly representable, and their exponents differ by at most 1, the subtraction can be performed in the significand without any rounding.

Application: In Kahan summation, the step c = (t - sum) - y uses Sterbenz's lemma to capture rounding errors exactly when the values have similar magnitudes.

Exercise: Verify Sterbenz's lemma numerically for fp32 arithmetic. Generate pairs $(a, b)$ where $b/2 \leq a \leq 2b$ and confirm that a - b equals (a - b) computed via Fraction (exact rational arithmetic). Compare with pairs outside the range.

V.2 The Table Maker's Dilemma (***)

When evaluating $\sin(x)$ to correctly-rounded results, we need to know $\sin(x)$ to much higher precision - potentially arbitrary precision. This is because the correctly-rounded result depends on which side of a representable number $\sin(x)$ falls on. For a $p$ -bit significand, the worst case requires computing $\sin(x)$ to approximately $p^2$ bits of precision (the Table Maker's Dilemma, Kahan and Ziv, 1990s).

Modern resolution: The CRlibm library (INRIA) and Intel SVML use multi-stage argument reduction + polynomial approximation with double-double arithmetic to guarantee correctly-rounded elementary functions.

For AI: This is why torch.sin() may give slightly different results than math.sin() for the same input - different levels of accuracy guarantees.

V.3 Floating-Point Reproducibility (**)

Training a deep neural network on the same hardware, same code, same random seed should produce the same result. But in practice, results often differ between runs. Sources of non-reproducibility:

Non-associative reductions: GPU CUDA reduction operations may sum in different orders depending on thread scheduling
Atomic operations: Multi-threaded gradient accumulation uses atomics, which arrive in non-deterministic order
cuDNN algorithm selection: cuDNN may select different convolution algorithms with different rounding behaviors

PyTorch reproducibility settings:

import torch
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
torch.backends.cudnn.deterministic = True  # Slower but deterministic
torch.backends.cudnn.benchmark = False     # Disable algorithm search

Tradeoff: Deterministic mode can be 10-30% slower due to avoiding certain non-deterministic fast paths.

V.4 Error-Free Transformations (***)

An error-free transformation (EFT) computes not just $\text{fl}(a \circ b)$ but also the exact rounding error $e$ such that $a \circ b = \text{fl}(a \circ b) + e$ exactly.

TwoSum algorithm (Knuth, 1969):

function TwoSum(a, b):
    s = fl(a + b)
    v = fl(s - a)
    e = fl((a - fl(s - v)) + fl(b - v))
    return (s, e)  # a + b = s + e exactly

This costs 6 floating-point operations to compute both the sum and the exact rounding error.

Application - Double-Double Arithmetic: By representing each number as a pair $(h, l)$ where $h + l$ is the true value (and $|l| \leq \varepsilon_{\text{mach}}|h|/2$ ), we effectively double the working precision from $p$ to $\sim 2p$ bits using only hardware arithmetic. This is how long double is emulated on some platforms and how Kahan summation achieves its precision.

Appendix W: Format Selection Guide for AI Practitioners

Decision Tree for Format Selection

CHOOSING FLOATING-POINT FORMAT
========================================================================

  Is this inference or training?
  |
  +-> INFERENCE
  |     Is accuracy critical?
  |     +-> YES: fp32 (or bf16 if model supports it)
  |     +-> NO: int8 quantization (post-training quant, GPTQ, AWQ)
  |                -> 4-8x memory reduction, ~1-2% accuracy loss
  |
  +-> TRAINING
        Hardware with Tensor Cores? (A100/H100/RTX 30xx/40xx)
        +-> YES: bf16 + AMP (torch.cuda.amp)
        |         - Weight master copy in fp32
        |         - Compute in bf16 (4x throughput)
        |         - Dynamic loss scaling
        +-> NO (V100 or CPU):
              fp32 safe baseline
              fp16 + AMP if V100 (needs loss scaling, unstable for LLMs)

========================================================================

Format Comparison Summary

Format	Bits	Dynamic Range	Precision	Best Use Case
fp64	64	+/-10^308	~15 digits	Scientific computing, debugging
fp32	32	+/-10^38	~7 digits	Training baseline, optimizer states
tf32	19*	+/-10^38	~3 digits	A100 matmul (automatic, transparent)
bf16	16	+/-10^38	~2 digits	LLM training (same range as fp32)
fp16	16	+/-65504	~3 digits	Vision models, inference
fp8 E4M3	8	+/-448	~1 digit	Forward pass (H100+)
fp8 E5M2	8	+/-57344	<1 digit	Gradient accumulation (H100+)
int8	8	+/-127	Integer	Inference quantization

*tf32 uses 10-bit mantissa for compute but stores as fp32.

Choosing Loss Scale Initial Value

# Typical AMP GradScaler configuration
scaler = torch.cuda.amp.GradScaler(
    init_scale=2**16,      # Start with 65536 - large enough for fp16
    growth_factor=2.0,      # Double scale every growth_interval steps
    backoff_factor=0.5,     # Halve scale on overflow
    growth_interval=2000,   # Steps between scale increases
    enabled=True
)

Heuristic: Start with init_scale = 2^16 for most models. If you see frequent overflow warnings in the first 100 steps, reduce to 2^12. If no overflow for 10,000+ steps, the scaler will auto-increase.

End of Appendix W

<- Back to Chapter 10 | Next: Numerical Linear Algebra ->

Floating Point Arithmetic: Part 2 - Appendix F Numerical Experiments Reference To Appendix W Format Select

Floating-Point Arithmetic: Appendix F: Numerical Experiments Reference to Appendix W: Format Selection Guide for AI Practitioners

Appendix F: Numerical Experiments Reference

F.1 Key Numbers to Remember

F.2 Common Condition Numbers in ML

F.3 Loss Scaling Reference Values

Appendix G: Glossary

Appendix H: Connections to Other Sections

H.1 Floating-Point and Optimization

H.2 Floating-Point and Neural Network Architecture

H.3 Floating-Point and Transformer Efficiency

Appendix I: Further Reading

Foundational Papers

ML-Specific Papers

Textbooks

Appendix J: Extended Examples and Case Studies

J.1 Case Study: NaN Debugging in a Transformer

J.2 Case Study: Ill-Conditioned Normal Equations

J.3 Case Study: fp16 vs bf16 in Attention

J.4 Numerical Stability of Common Activation Functions

J.5 Gradient Checkpointing and Numerical Precision

Appendix K: Self-Assessment Questions

Appendix L: Numerical Analysis in PyTorch - Implementation Patterns

L.1 Computing Machine Epsilon Empirically

L.2 Visualizing the Floating-Point Number Line

L.3 Detecting and Fixing Common Numerical Issues

L.4 Comparing Summation Methods

L.5 Stable Implementations of Common Functions

L.6 Loss Scaling Implementation

Appendix M: Historical Perspective - The Road to IEEE 754

M.1 Before IEEE 754 (Pre-1985)

M.2 The IEEE 754 Committee (1977-1985)

M.3 IEEE 754-2008 and Beyond

M.4 The ML Formats (2017-2024)

Appendix N: Quick Reference and Formulas

N.1 Key Inequalities

N.2 Algorithms Reference

N.3 Checklist: Numerically Stable Implementation

Appendix O: Worked Problems with Full Solutions

O.1 Problem: Compute ex−1e^x - 1ex−1 for Small xxx

O.2 Problem: Running Mean and Variance

O.3 Problem: Matrix-Vector Product Error Bound

O.4 Problem: Condition Number of the Attention Matrix

Appendix P: Summary Statistics of Chapter

Appendix Q: Deep Dive - IEEE 754 Special Value Arithmetic

Q.1 NaN Propagation Rules

Q.2 Infinity Arithmetic

Q.3 Signed Zero

Q.4 Subnormal Performance Impact

Appendix R: Floating-Point Error Case Studies from AI Research

Case Study 1: The fp16 Loss Explosion Problem (2017-2018)

Case Study 2: Attention Score Overflow in Early Transformers

Case Study 3: Layer Normalization Stability

Case Study 4: Gradient Checkpointing and Recomputation Errors

Appendix S: Floating-Point Benchmarks and Profiling

Measuring fp32 vs bf16 Throughput

Detecting Numerical Issues in Training

Condition Number Monitoring

Appendix T: Quick Derivations

T.1 Why Machine Epsilon Is 21−p2^{1-p}21−p

T.2 Why Kahan Summation Has O(εmach)O(\varepsilon_{\text{mach}})O(εmach​) Error

T.3 Why Softmax Is Numerically Stable With Shifting

T.4 Relative Backward Stability of Householder QR

Appendix U: Cross-Reference with Other Sections

Appendix V: Extended Exercises with Hints

V.1 Sterbenz's Lemma (**)

V.2 The Table Maker's Dilemma (***)

V.3 Floating-Point Reproducibility (**)

V.4 Error-Free Transformations (***)

Appendix W: Format Selection Guide for AI Practitioners

Decision Tree for Format Selection

Format Comparison Summary

Choosing Loss Scale Initial Value

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?

O.1 Problem: Compute $e^x - 1$ for Small $x$

T.1 Why Machine Epsilon Is $2^{1-p}$

T.2 Why Kahan Summation Has $O(\varepsilon_{\text{mach}})$ Error