Lesson overview | Lesson overview | Next part
Number Systems: Part 1: Intuition to 7. Floating-Point Arithmetic Deep Dive
1. Intuition
1.1 What Are Number Systems?
A number system is a formal framework for representing, storing, and computing with numerical quantities using a finite set of symbols. At its core, a number system answers four questions:
- What values can be represented? - The range of the system
- How precisely? - The resolution or granularity between representable values
- At what memory cost? - The number of bits required per value
- How fast can arithmetic be performed? - Hardware throughput for the format
Every number stored in a computer is an approximation. Real numbers () are uncountably infinite - there are infinitely many real numbers between any two real numbers. But a computer register has a fixed number of bits. A 32-bit register can represent at most distinct values. A 16-bit register: . An 8-bit register: . A 4-bit register: .
The art of number system design is choosing which values to represent - how to distribute those discrete points across the real number line to minimise error for the intended application.
THE NUMBER SYSTEM DESIGN PROBLEM
=======================================================================
Real number line (continuous, infinite):
------------------------------------------------------------------->
4-bit representation (only 16 values available):
--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*------------------>
Where do you place those 16 dots?
Uniform spacing (INT4): *--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*
Dense near zero (NF4): *******--*--*--*----*----*----*-----*------*
Logarithmic spacing (FP4): *****--*--*--*----*----*----*--------*--------*
Each choice optimises for different distributions of real-world data.
For neural networks, the data we need to represent - weights, activations, gradients - follows specific statistical distributions. Transformer weights are approximately normally distributed with small variance. Activations have heavier tails. Gradients span many orders of magnitude. The "right" number system is the one that best matches the statistical distribution of the data it represents.
1.2 Why Number Systems Matter for AI
The impact of number system choice on LLM development is immediate and quantifiable:
Memory impact (LLaMA-3 70B model, weights only):
| Format | Bytes/param | Total Weight Memory | GPU Requirement |
|---|---|---|---|
| FP32 | 4 | 280 GB | 4\times A100 80 GB |
| BF16 | 2 | 140 GB | 2\times A100 80 GB |
| INT8 | 1 | 70 GB | 1\times A100 80 GB |
| INT4 | 0.5 | 35 GB | 1\times RTX 4090 24 GB |
| INT2 | 0.25 | 17.5 GB | 1\times RTX 3090 24 GB |
Training cost impact:
- Training a 70B model in FP32: ~280 GB just for weights; optimizer states (Adam and ) add another 560 GB; total ~840 GB - requires a full node of 8\times A100s or more
- Training the same model in BF16 mixed precision: ~140 GB weights + 560 GB optimizer (FP32) + 140 GB gradients; total ~840 GB still (optimizer dominates), but forward/backward pass is 2\times faster due to BF16 matmul throughput
- Inference in INT4: 35 GB; fits on a single consumer GPU; enables local AI on laptops
What goes wrong with the wrong number system:
- FP16 training without loss scaling -> gradient underflow -> weights stop updating -> training stalls silently
- BF16 accumulation instead of FP32 -> precision loss in gradient sums -> training diverges after millions of steps
- Naive INT8 quantization of transformers -> activation outliers clip -> catastrophic quality degradation
- INT4 quantization of first/last layers -> disproportionate quality loss -> model produces gibberish
The history of AI progress is partly a history of number system engineering:
EVOLUTION OF NUMBER FORMATS IN DEEP LEARNING
=======================================================================
2012 --- FP32 -------- AlexNet; all computation in single precision
|
2017 --- FP16 -------- NVIDIA Volta tensor cores; first hardware-accelerated
| reduced precision; required loss scaling
|
2018 --- BF16 -------- Google Brain introduces Brain Float 16; same range
| as FP32; no loss scaling needed; game changer
|
2020 --- Mixed ------- BF16 forward + FP32 master weights becomes standard
| Precision for all large-scale training
|
2022 --- FP8 --------- NVIDIA H100 adds FP8 tensor cores; 2\times throughput
| vs BF16; per-tensor scaling required
|
2023 --- INT4 -------- GPTQ, AWQ enable high-quality 4-bit inference;
| 70B models on consumer GPUs for the first time
|
2024 --- Ternary ----- BitNet b1.58: {-1,0,+1} weights trained from scratch;
| eliminates multiplication entirely
|
2024 --- FP8 Train --- DeepSeek-V3: entire training in FP8; commercial
| frontier model; massive cost reduction
|
2025-26 - MXFP/Sub-4 - Microscaling formats; sub-4-bit active research;
hardware co-designed with number formats
1.3 The Precision-Efficiency Frontier
The fundamental trade-off: more bits -> higher precision -> better model quality, but fewer bits -> less memory -> faster compute -> wider hardware access.
The engineering challenge is finding the minimum precision that preserves acceptable quality for each operation. The key insight is that different operations have vastly different precision requirements:
PRECISION REQUIREMENTS BY OPERATION
=======================================================================
More Bits Needed
^
|
| Gradient accumulation (FP32)
| # Small errors compound over millions
| of steps; must be high precision
|
| Optimizer states (FP32)
| # Adam m, v track subtle gradient
| statistics; precision critical
|
| Loss computation (FP32)
| # Cross-entropy involves log/exp;
| overflow/underflow risk
|
| Weight storage - training (BF16)
| # Weights change slowly; moderate
| precision sufficient
|
| Activation computation (BF16/FP8)
| # Errors localised per forward pass;
| don't compound across steps
|
| KV cache storage (INT8/INT4)
| # Small reconstruction error per token;
| acceptable quality trade-off
|
| Weight-only inference (INT4/INT2)
| # Weights static; error bounded;
| no accumulation across steps
|
v
Fewer Bits Needed
Quantitative precision requirements:
| Operation | Minimum Format | Machine Epsilon | Why This Precision |
|---|---|---|---|
| Gradient accumulation | FP32 | Must resolve updates over steps | |
| Optimizer states (Adam , ) | FP32 | Tracks exponential moving averages; | |
| Weight master copy | FP32 | Single source of truth; must accumulate tiny updates | |
| Forward/backward matmul | BF16 | Errors don't compound across training steps | |
| Inference weights | INT4 | (depends on range) | Static; bounded error; no accumulation |
| KV cache | INT8/FP8 | Per-token error; minor impact on generation quality |
1.4 Levels of Number System Usage in LLMs
A modern LLM uses multiple number systems simultaneously in different parts of the computation:
MIXED-PRECISION LLM ARCHITECTURE (2026 Standard)
=======================================================================
TRAINING:
+-----------------------------------------------------------------+
| |
| +------------------+ +------------------+ |
| | Master Weights | | Adam m (moment) | |
| | FP32 | | FP32 | |
| | (4 bytes/param) | | (4 bytes/param) | |
| +--------+---------+ +------------------+ |
| | cast |
| v +------------------+ |
| +------------------+ | Adam v (variance)| |
| | Working Weights | | FP32 | |
| | BF16 | | (4 bytes/param) | |
| | (2 bytes/param) | +------------------+ |
| +--------+---------+ |
| | |
| v |
| +------------------+ +------------------+ |
| | Forward Pass |---->| Activations | |
| | BF16 | | BF16 | |
| | matmul in BF16 | | (for backward) | |
| | accum in FP32 | +------------------+ |
| +--------+---------+ |
| | |
| v |
| +------------------+ +------------------+ |
| | Backward Pass |---->| Gradient Accum | |
| | BF16 | | FP32 | |
| | gradient compute| | (critical!) | |
| +------------------+ +--------+---------+ |
| | |
| v |
| +------------------+ |
| | Weight Update | |
| | FP32 | |
| | \theta <- \theta - \eta*m/\sqrtv | |
| +------------------+ |
| |
| Total per parameter: 4 + 4 + 4 + 2 = 14 bytes (FP32 master |
| + FP32 Adam m + FP32 Adam v + BF16 working copy) |
| |
+-----------------------------------------------------------------+
INFERENCE:
+-----------------------------------------------------------------+
| |
| +------------------+ +------------------+ |
| | Weights | | KV Cache | |
| | INT4 (GPTQ/AWQ) | | INT8 or FP8 | |
| | 0.5 bytes/param | | 1 byte/scalar | |
| | dequant -> BF16 | | | |
| +--------+---------+ +------------------+ |
| | |
| v |
| +------------------+ +------------------+ |
| | Matmul | | Softmax | |
| | BF16 compute | | FP32 (stable) | |
| | FP32 accum | | | |
| +------------------+ +------------------+ |
| |
+-----------------------------------------------------------------+
1.5 Historical Timeline
| Year | Event | Significance for AI |
|---|---|---|
| 1940s | ENIAC: 10-digit decimal fixed-point | First electronic general-purpose computer; decimal arithmetic |
| 1954 | IBM 704: first commercial floating-point hardware (36-bit) | Floating-point becomes accessible; enables scientific computing |
| 1985 | IEEE 754 standard published: defines FP32 and FP64 | Universal standard; all modern hardware and software agrees on FP format |
| 2008 | IEEE 754-2008 revision: adds FP16 (binary16) | Half-precision officially standardised; enables GPU compute |
| 2012 | AlexNet wins ImageNet using FP32 on GPUs | Deep learning revolution begins; all training in FP32 |
| 2017 | NVIDIA Volta GPU: FP16 tensor cores | First hardware acceleration for reduced precision ML; 8\times throughput vs FP32 |
| 2018 | Google Brain introduces BF16 (Brain Float 16) | Same exponent range as FP32 with 7-bit mantissa; eliminates loss scaling |
| 2018 | Mixed precision training paper (Micikevicius et al.) | FP16 forward + FP32 master weights formalised as standard practice |
| 2019 | INT8 inference widely deployed | LLM.int8() predecessor methods; production quantization begins |
| 2020 | A100 GPU: native BF16 tensor cores | BF16 becomes the default training format for all large models |
| 2022 | FP8 proposed for training (Micikevicius et al. 2022) | Two complementary 8-bit formats: E4M3 (precision) and E5M2 (range) |
| 2022 | NVIDIA H100: first hardware with FP8 tensor cores | 4\times throughput vs BF16; enables FP8 training at scale |
| 2023 | GPTQ, AWQ: INT4 post-training quantization for LLMs | 70B models on consumer GPUs; democratises large model access |
| 2023 | QLoRA introduces NF4 (Normal Float 4) | Quantile-based 4-bit format optimal for normally-distributed weights |
| 2023 | OCP Microscaling (MX) standard published | Industry-wide block floating-point standard (AMD, ARM, Intel, Meta, Microsoft, NVIDIA, Qualcomm) |
| 2024 | BitNet b1.58: ternary weights {-1,0,+1} (1.58 bits) | Eliminates multiply; addition/subtraction only; trained from scratch |
| 2024 | DeepSeek-V3: FP8 training at scale | Commercial frontier model trained entirely in FP8; massive cost reduction |
| 2025 | NVIDIA B200 (Blackwell): MXFP8 native support | Hardware designed around block floating-point; format and silicon co-evolution |
| 2025-26 | FP8 training standard; sub-4-bit quantization active research | Industry converges on FP8 for training; INT4/INT2/ternary for inference |
1.6 The Role of Hardware
Number systems are not purely mathematical abstractions - they are hardware capabilities. A format that has no hardware support executes in software at full precision, gaining no speed advantage.
GPU tensor core throughput by format (NVIDIA H100 SXM):
| Format | Throughput | Relative to FP32 | Hardware Unit |
|---|---|---|---|
| FP64 | 67 TFLOPS | 1\times | CUDA cores (double precision) |
| FP32 | 67 TFLOPS | 1\times | CUDA cores |
| TF32 | 989 TFLOPS | 15\times | Tensor cores |
| BF16 | 989 TFLOPS | 15\times | Tensor cores |
| FP16 | 989 TFLOPS | 15\times | Tensor cores |
| FP8 | 1,979 TFLOPS | 30\times | Tensor cores |
| INT8 | 1,979 TOPS | 30\times | Tensor cores |
The throughput gap is enormous: FP8 operations are 30\times faster than FP32 on the same hardware. This means a format change from FP32 to FP8 can potentially deliver a 30\times speedup for matmul-bound operations - far more impactful than any algorithmic optimisation.
Hardware design co-evolves with number format needs:
- NVIDIA added BF16 tensor cores to A100 because the ML community needed wider dynamic range than FP16
- NVIDIA added FP8 tensor cores to H100 specifically because 8-bit training research showed viability
- NVIDIA added MXFP support to B200 because block floating-point emerged as the optimal scaling strategy
- Google designed TPU v2+ with native BF16 from the start - BF16 was literally invented for TPUs
The implication: when choosing a number format, you must check whether your target hardware has native support. Running FP8 arithmetic on A100 (which has no FP8 tensor cores) gains nothing - the computation falls back to BF16 or FP32.
HARDWARE-FORMAT SUPPORT MATRIX
=======================================================================
| FP64 | FP32 | TF32 | BF16 | FP16 | FP8 | INT8 | INT4
==============+======+======+======+======+======+======+======+=====
V100 (2017) | OK | OK | NO | NO | OKTC | NO | NO | NO
A100 (2020) | OK | OK | OKTC | OKTC | OKTC | NO | OKTC | NO
H100 (2022) | OK | OK | OKTC | OKTC | OKTC | OKTC | OKTC | NO
B200 (2025) | OK | OK | OKTC | OKTC | OKTC | OKTC | OKTC | OKTC
RTX 4090 | OK | OK | OKTC | OKTC | OKTC | OKTC | OKTC | OKTC
TC = Tensor Core accelerated (high throughput)
OK = Supported via CUDA cores (standard throughput)
NO = Not supported at hardware level
2. Positional Number Systems
2.1 The Positional Representation Principle
A positional number system represents any number as a weighted sum of powers of a fixed base (or radix) :
where:
- : the digit at position
- : the position of the most significant digit
- The radix point separates the integer part () from the fractional part ()
- Any real number representable exactly may require infinitely many digits
The value of a digit depends on its position - this is what "positional" means. The digit 3 in position 2 of a base-10 number represents , not simply 3.
Example in base 10:
POSITIONAL VALUE IN BASE 10
=======================================================================
Position: 3 2 1 0 . -1 -2
Power: 10^3 10^2 10^1 10^0 10^-^1 10^-^2
Weight: 1000 100 10 1 0.1 0.01
Digit: 4 7 2 5 . 3 8
Value: 4000 700 20 5 0.3 0.08
= 4725.38
Why positional systems matter for computing: every digital computer uses a positional system (base 2) because transistors have two states. Understanding how positional encoding works is prerequisite to understanding how floating-point numbers allocate bits.
2.2 Binary (Base 2)
Binary is the foundational number system for all digital computation:
- Base ; digits ; each digit is one bit (binary digit)
- Natural representation for digital hardware: a transistor is either on (1) or off (0)
- Every number stored in a computer - every weight, every activation, every token ID - is ultimately a binary string
Integer binary representation:
Worked example:
BINARY TO DECIMAL CONVERSION (1011_2 -> 11_1_0)
=======================================================================
Bit position: 3 2 1 0
Power of 2: 2^3 2^2 2^1 2^0
Weight: 8 4 2 1
Bit value: 1 0 1 1
Contribution: 8 0 2 1 -> Total: 11_1_0
Binary fraction:
The critical limitation - infinite binary expansions:
Most decimal fractions have infinite binary representations. This is the root cause of floating-point "errors" that every programmer encounters:
This is not a bug - it is a fundamental mathematical fact. The number cannot be represented exactly in any finite number of binary digits, just as cannot be represented exactly in any finite number of decimal digits.
AI implication: when a learning rate is set to lr = 0.001, the actual value stored in memory is the nearest binary floating-point approximation, not exactly . For FP32, this is - close enough that it doesn't matter, but the principle underlies all numerical precision analysis.
Common powers of 2 (essential for parameter counting and memory estimation):
| Power | Value | Common Usage |
|---|---|---|
| 1,024 \approx 1K | Kilobyte | |
| 1,048,576 \approx 1M | Megabyte | |
| 1,073,741,824 \approx 1B | Gigabyte | |
| 4,294,967,296 | Max UINT32; max token count for most frameworks | |
| \approx 1T | Terabyte; large training datasets |
2.3 Hexadecimal (Base 16)
Hexadecimal (hex) provides a compact human-readable representation of binary data:
- Base ; digits
- Each hex digit corresponds to exactly 4 bits - this is why hex is used
- A 32-bit value requires only 8 hex digits instead of 32 binary digits
- Standard for representing memory addresses, byte patterns, and floating-point bit representations
Conversion - binary to hex: group binary digits in groups of 4 from the right; convert each group:
BINARY-HEX CONVERSION TABLE
=======================================================================
Binary | Hex | Decimal Binary | Hex | Decimal
-------+-----+--------- -------+-----+---------
0000 | 0 | 0 1000 | 8 | 8
0001 | 1 | 1 1001 | 9 | 9
0010 | 2 | 2 1010 | A | 10
0011 | 3 | 3 1011 | B | 11
0100 | 4 | 4 1100 | C | 12
0101 | 5 | 5 1101 | D | 13
0110 | 6 | 6 1110 | E | 14
0111 | 7 | 7 1111 | F | 15
AI usage: when inspecting raw model weight files (.safetensors, .bin), memory dumps, or CUDA kernel outputs, values are displayed in hex. For example, the FP32 bit pattern for 1.0 is 0x3F800000:
2.4 Two's Complement for Signed Integers
Two's complement is the universal standard for representing negative integers in hardware:
Principle: represent negative numbers using modular arithmetic. For an -bit system:
- Non-negative: to ; represented in standard binary
- Negative: is represented as
- Or equivalently: flip all bits, then add 1
Range:
| Bit Width | Range | AI Usage |
|---|---|---|
| INT8 (8-bit) | Quantized weights, activations | |
| INT16 (16-bit) | Token IDs (vocabulary < 65K) | |
| INT32 (32-bit) | Accumulator for INT8 matmul | |
| INT64 (64-bit) | Dataset sizes, file offsets |
Worked example - representing -42 in INT8:
STEP 1: Start with +42
42_1_0 = 00101010_2
STEP 2: Flip all bits (one's complement)
00101010 -> 11010101
STEP 3: Add 1
11010101 + 1 = 11010110
RESULT: -42_1_0 = 11010110_2
VERIFICATION: 42 + (-42) should equal 0 (mod 256)
00101010
+ 11010110
---------
100000000 -> Discard carry bit -> 00000000 = 0 OK
Key properties for hardware design:
- Same hardware for signed and unsigned addition - the adder doesn't need to know about signs; two's complement arithmetic "just works" with the same circuit
- Asymmetry: there is one more negative value than positive - has no positive counterpart in INT8. This matters for symmetric quantization: the range wastes one code point
- Overflow detection: carry into sign bit carry out of sign bit indicates overflow
AI relevance - quantization: when quantizing FP32 weights to INT8, we map continuous values to . Symmetric quantization uses (wastes one level) for simplicity; asymmetric uses all 256 levels for better coverage.
2.5 Fixed-Point Representation
Fixed-point represents fractional numbers with a fixed number of bits allocated to the integer and fractional parts:
Format : integer bits + fractional bits; total bits (including sign bit in two's complement)
Value interpretation:
Properties:
- Range:
- Precision: uniform spacing of across the entire range - every representable value is exactly apart from its neighbours
- No exponent: unlike floating-point, there is no exponent field; the radix point position is fixed and implicit
Example - Q3.4 format (8-bit):
Q3.4 FIXED-POINT (8-BIT)
=======================================================================
Bit layout: [S][I_2 I_1 I_0][F_3 F_2 F_1 F_0]
| | |
Sign Integer Fraction
(1) (3 bits) (4 bits)
Range: [-8, 7.9375] (= [-8, 8 - 1/16])
Resolution: 1/16 = 0.0625
Example: 01011100_2
Sign = 0 (positive)
Integer = 101_2 = 5
Fraction = 1100_2 = 12/16 = 0.75
Value = 5.75
REPRESENTABLE VALUES NEAR ZERO:
... -0.1875 -0.125 -0.0625 0 0.0625 0.125 0.1875 ...
| | | | | | |
Uniform spacing of 0.0625 everywhere
Contrast with floating-point: fixed-point has uniform spacing everywhere - the gap between representable values near zero is the same as between large values. Floating-point has non-uniform spacing - very dense near zero, sparse for large values. For neural network weights (which cluster near zero), floating-point is generally better.
Fixed-point in AI:
- Hardware quantization: INT8 with an implicit scale factor is effectively fixed-point:
- DSP and edge inference: some microcontrollers (ARM Cortex-M) lack floating-point units entirely; all neural network inference runs in fixed-point
- Advantages: simpler hardware; exact representation for dyadic fractions (); deterministic timing
- Disadvantages: cannot represent values spanning many orders of magnitude simultaneously; poor for gradients () and losses () in the same format
3. IEEE 754 Floating-Point Standard
3.1 Floating-Point Representation Principle
Floating-point is scientific notation for binary. Instead of allocating a fixed number of bits to the integer and fractional parts, floating-point uses a mantissa (significand) and an exponent to represent numbers across a vast range of magnitudes with consistent relative precision:
where:
- Sign bit : for positive, for negative
- Mantissa (significand) : normalised so the leading digit is always 1 (and therefore can be stored implicitly - the "hidden bit")
- Exponent integer range: stored with a bias so that both negative and positive exponents can be represented as unsigned integers
The key property: floating-point provides the same number of significant bits regardless of value magnitude. A number near and a number near both have the same relative precision. This is exactly what neural networks need - weights near zero and weights near the maximum both get the same relative accuracy.
Contrast with fixed-point:
FIXED-POINT vs FLOATING-POINT DISTRIBUTION
=======================================================================
Fixed-point (Q7.8, 16-bit):
Representable values uniformly spaced by 1/256:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
0 128
Near zero: same spacing 1/256 -> relative precision: high
Near 128: same spacing 1/256 -> relative precision: low
Floating-point (FP16):
Representable values DENSE near zero, SPARSE far from zero:
||||||||| | | | | | | | | | | | | | | |
0 65504
Near zero: spacing ~2^-^2^4 -> very fine resolution
Near 65504: spacing ~32 -> coarse but same relative precision
Same NUMBER of representable values between 1 and 2
as between 1024 and 2048 (mantissa bits = 10 for FP16)
3.2 IEEE 754 FP32 (Single Precision)
FP32 has been the default numerical format for deep learning since the field began using GPUs. Understanding its bit layout is essential.
Bit layout: 32 bits total
FP32 BIT LAYOUT (32 BITS)
=======================================================================
Bit: 31 30--------23 22------------------------------------0
+--+------------+------------------------------------------+
| s| Exponent | Mantissa |
| | (8 bits) | (23 bits) |
+--+------------+------------------------------------------+
1 8 bits 23 bits = 32 bits
Decoding formula:
where:
- is the stored exponent (unsigned 8-bit integer, )
- Exponent bias = 127: the true exponent is , giving for normal numbers
- and are reserved for special values (see 3.3)
- The leading before the mantissa is implicit (not stored) - this gives 24 bits of significand precision using only 23 stored bits
Numerical properties:
| Property | Value |
|---|---|
| Total bits | 32 |
| Sign bits | 1 |
| Exponent bits | 8 |
| Mantissa bits | 23 (24 with implicit leading 1) |
| Exponent bias | 127 |
| Exponent range (true) | to |
| Max positive value | |
| Min positive normal | |
| Machine epsilon | |
| Decimal precision | ~7.2 significant digits |
| Min positive subnormal |
Machine epsilon () is the smallest value such that in the floating-point system. For FP32, . This determines the relative precision - when you add a value smaller than to , the result rounds back to and the addition is lost.
Worked example - decoding a bit pattern:
Decode: 0 10000001 01000000000000000000000
STEP-BY-STEP FP32 DECODING
=======================================================================
Given: 0 10000001 01000000000000000000000
Step 1: Sign bit
s = 0 -> positive
Step 2: Exponent
E = 10000001_2 = 128 + 1 = 129
True exponent: e = 129 - 127 = 2
Step 3: Mantissa (implicit leading 1)
m = 1.01000000000000000000000_2
= 1 + 0\times2^-^1 + 1\times2^-^2 + 0\times2^-^3 + ...
= 1 + 0.25
= 1.25
Step 4: Combine
x = (-1)^0 \times 1.25 \times 2^2
= 1 \times 1.25 \times 4
= 5.0
ANSWER: The bit pattern represents 5.0
AI usage of FP32:
- Training master weights: always stored in FP32; the authoritative copy of all parameters
- Gradient accumulation: partial gradient sums accumulated in FP32 to prevent precision loss
- Optimizer states: Adam (first moment) and (second moment) stored in FP32
- Loss and metrics: cross-entropy loss computed in FP32 to avoid overflow in exp/log
- Cannot be used alone for large model training: LLaMA-3 70B in pure FP32 requires ~280 GB for weights alone
3.3 Special Values in IEEE 754
IEEE 754 reserves certain exponent-mantissa combinations for special mathematical values:
IEEE 754 SPECIAL VALUES
=======================================================================
Exponent E | Mantissa m | Value | Purpose
===========+============+====================+==========================
0 | 0 | \pm0 | Signed zero; +0 = -0
0 | \neq 0 | \pmsubnormal | Gradual underflow
1-254 | any | \pmnormal number | Standard representation
255 | 0 | \pm\infty (infinity) | Overflow result
255 | \neq 0 | NaN (Not a Number) | Invalid operations
Zero ():
- Exponent , mantissa ; sign bit determines or
- in all comparisons (IEEE 754 mandates this)
- Both exist because for some functions (e.g., )
- AI relevance: ReLU outputs ; signed zero rarely matters in practice
Subnormal (denormal) numbers:
- Exponent , mantissa
- Value: (no implicit leading 1; exponent fixed at )
- Purpose: gradual underflow - fills the gap between zero and the smallest normal number
- Without subnormals: there would be a gap from to with nothing in between
- With subnormals: smallest representable positive value =
- Hardware cost: subnormal arithmetic is slower on GPUs (10-100\times penalty); many GPU "fast-math" modes flush subnormals to zero (
--ftz=true) - AI implication: gradients below the subnormal threshold become exactly zero; this is a form of gradient underflow
Infinity ():
- Exponent , mantissa
- Result of overflow () or division by zero ()
- Arithmetic with infinity: ; (for ); NaN
- AI relevance: loss explosion during training produces ; must detect and handle (skip step, reduce learning rate)
NaN (Not a Number):
- Exponent , mantissa
- Two types:
- Quiet NaN (qNaN): propagates silently through all arithmetic - , , . Extremely dangerous in training - loss appears to be a number but all weights are corrupted
- Signalling NaN (sNaN): raises a hardware exception when used in arithmetic; useful for debugging but rarely used in GPU code
- Produced by: , , ,
- AI relevance: NaN in loss -> all subsequent gradients are NaN -> all weights become NaN -> model destroyed. Must check for NaN explicitly; PyTorch
torch.autograd.detect_anomaly()helps
Ordering of all special values:
NaN is unordered - (NaN is not equal to itself!). This is the standard way to check for NaN: x != x is true only if x is NaN.
3.4 FP32 Arithmetic Properties
Floating-point arithmetic violates several "obvious" algebraic properties that hold for real numbers. These violations have direct consequences for neural network training:
1. Associativity failure - in general:
In FP32, because . Wait - , so IS below the relative precision threshold. The result is , and then .
But:
Here , and .
Result: but . The "correct" mathematical answer is 1.
AI impact: multi-GPU training performs parallel reduction (summing gradients across GPUs) in different orders depending on timing -> different results each run. This is why multi-GPU training is non-deterministic even with the same seed.
2. Commutativity - always holds: and in IEEE 754. The order of two operands does not matter. (But the order of three or more does, because associativity fails.)
3. Distributivity failure - in general. Each intermediate operation rounds independently, accumulating different errors.
3.5 Rounding Modes
IEEE 754 defines five rounding modes. The choice of rounding mode affects the bias and variance of numerical errors:
1. Round to Nearest, Ties to Even (RNE) - the default:
- Round to the nearest representable value
- When the value is exactly midway between two representable values, round to the one whose last mantissa bit is 0 (even)
- Why ties-to-even? Eliminates statistical bias - ties-to-up would systematically inflate values over many operations; ties-to-even has zero expected bias
- This is the default for FP32, BF16, and all standard GPU computation
2. Round Toward Zero (Truncation):
- Simply discard all bits beyond the mantissa width
- Always rounds toward zero: positive values round down, negative values round up
- Used in: integer conversion (
int(3.7) = 3); some fixed-point hardware
3. Round Toward (Ceiling):
- Always round up (toward positive infinity)
- Used in interval arithmetic for computing upper bounds
4. Round Toward (Floor):
- Always round down (toward negative infinity)
- Used in interval arithmetic for computing lower bounds
5. Round to Nearest, Ties Away from Zero:
- What most people think "normal" rounding is: 0.5 always rounds up
- NOT the IEEE default; less common in hardware
- Introduces slight positive bias over many operations
AI-specific: Stochastic Rounding:
- Not part of IEEE 754 but increasingly important for low-precision training
- Round up or down randomly with probability proportional to the fractional position
- ,
- Unbiased: - preserves small gradient updates in expectation
- Used in: some FP8 training implementations; Graphcore IPU hardware
- Cost: requires random number generation per operation
3.6 Catastrophic Cancellation
The problem: subtracting two nearly equal floating-point numbers catastrophically reduces the number of significant bits in the result.
Mechanism:
Consider and stored in FP32 (which has ~7 decimal digits of precision):
Both and have 7 significant digits. Their difference has only 1 significant digit - the other 6 significant digits cancelled. The result is represented as with the remaining mantissa bits filled with garbage (the hardware doesn't know what the true value would have been beyond the stored precision).
Formal statement: if and agree to significant bits, then has at most significant bits. For FP32 with 24-bit significand:
- If and agree to 20 bits: has only 4 significant bits
- If and agree to 23 bits: has only 1 significant bit
- If and agree to 24 bits: (complete cancellation)
Where cancellation occurs in neural networks:
- Attention logit differences: softmax() where logits are close
- Layer normalisation: where is close to
- Gradient computation: chain rule products where terms nearly cancel
- Residual connections: followed by subtraction in backward pass
Mitigations:
- Log-sum-exp trick for softmax (8.4): avoids computing exp of very large/small numbers
- Kahan summation (3.7): maintains error correction term to recover lost precision
- RMSNorm instead of LayerNorm (8.6): avoids mean subtraction entirely
- Reordering operations: compute instead of direct subtraction when
3.7 Kahan Summation Algorithm
The problem: summing floating-point numbers with naive sequential addition accumulates rounding error of , where is machine epsilon. For a training step with millions of gradient contributions, this error can become significant.
Kahan's solution (1965): maintain a running compensation term that captures the rounding error from each addition and feeds it back into the next:
KAHAN SUMMATION ALGORITHM
=======================================================================
Input: values x_1, x_2, ..., x_n
Output: sum with O(\epsilon) error (instead of O(n\epsilon))
sum = 0.0 // Running total
c = 0.0 // Compensation for lost low-order bits
for each x^i:
y = x^i - c // Compensate: add back what was lost last time
t = sum + y // Tentative new sum (rounding happens here)
c = (t - sum) - y // Recover rounding error: what was lost
sum = t // Update running total
return sum
How it works - step by step with FP32 (7 decimal digits):
KAHAN SUMMATION TRACE
=======================================================================
Sum: [1.0, 1e-7, 1e-7, 1e-7, 1e-7]
NAIVE SUMMATION:
sum = 1.0
sum = 1.0 + 1e-7 = 1.0000001 <- barely fits in FP32
sum = 1.0000001 + 1e-7 = 1.0000002 <- may round
sum = 1.0000002 + 1e-7 = 1.0000003 <- each step loses precision
sum = 1.0000003 + 1e-7 = 1.0000004
Result: 1.0000004 (limited by FP32 precision near 1.0)
KAHAN SUMMATION:
Step 1: x_1 = 1.0
y = 1.0 - 0 = 1.0
t = 0 + 1.0 = 1.0
c = (1.0 - 0) - 1.0 = 0 // no error
sum = 1.0
Step 2: x_2 = 1e-7
y = 1e-7 - 0 = 1e-7
t = 1.0 + 1e-7 = 1.0000001
c = (1.0000001 - 1.0) - 1e-7 // recovers rounding error
sum = 1.0000001
Step 3: x_3 = 1e-7
y = 1e-7 - c // compensates for error from step 2
t = 1.0000001 + y
c = (t - 1.0000001) - y // captures new error
sum = t
... continues, accumulating with compensation ...
Result: more accurate than naive sum
Error analysis:
- Naive summation: error = - grows linearly with number of terms
- Kahan summation: error = - independent of ; dramatic improvement for large sums
- Cost: approximately 4 FLOPs per element instead of 1; worth the overhead for numerical stability
AI applications:
- Gradient accumulation across micro-batches: sum thousands of small gradient contributions
- Loss computation over large batches: sum cross-entropy losses across many tokens
- Weight norm computation: sum of squares of millions of parameters
- In practice: PyTorch and JAX gradient accumulation uses FP32 (which provides sufficient precision without Kahan); Kahan is used in specialised scenarios where even FP32 is insufficient
4. Floating-Point Formats for AI
This section covers every floating-point format relevant to modern AI systems, from the rarely-used FP64 down to the frontier FP8 formats that power 2024-2026 era training.
4.1 FP64 (Double Precision)
Bit layout: 64 = 1 sign + 11 exponent + 52 mantissa
| Property | Value |
|---|---|
| Exponent bias | 1023 |
| Exponent range (true) | to |
| Max value | |
| Min positive normal | |
| Machine epsilon | |
| Decimal precision | ~15-17 significant digits |
AI relevance - mostly irrelevant for neural networks:
- GPU throughput: A100 FP64 = 19.5 TFLOPS vs FP32 = 19.5 TFLOPS (scalar) vs BF16 = 312 TFLOPS (tensor core). FP64 is 16\times slower than BF16.
- No neural network operation requires 15-digit precision. The signal-to-noise ratio of gradient estimates (due to mini-batch sampling) is far larger than FP32 precision errors.
- Niche uses: eigenvalue decomposition in research; high-precision statistical hypothesis tests; numerical ODE/SDE solvers for diffusion models; Gram matrix condition number analysis
- Rule of thumb: if you think you need FP64 for neural networks, you probably have a numerical stability bug that should be fixed structurally (better algorithm) rather than with more bits
4.2 FP32 (Single Precision) - Detailed
FP32 was the unquestioned standard for deep learning from 2012 to approximately 2018. All early frameworks (Theano, Caffe, early TensorFlow, early PyTorch) used FP32 by default.
FP32 ROLE IN MODERN LLM TRAINING (2026)
=======================================================================
Still used for:
OK Master weights (authoritative copy)
OK Optimizer states (Adam m, v)
OK Gradient accumulation (sum of gradients)
OK Loss and metrics computation
OK Learning rate and hyperparameters
OK Numerical stability fallback (softmax intermediate)
NO LONGER used for (replaced by BF16/FP8):
NO Forward pass matmul computation
NO Backward pass gradient computation
NO Activation storage
NO Weight communication across GPUs
NO Inference of any kind
Why FP32 persists for certain operations: gradient accumulation sums millions of small values. With BF16 (), a gradient contribution smaller than of the current sum is lost. Over millions of steps, these lost contributions compound. FP32 () resolves contributions as small as of the sum - 65,000\times more precise.
Memory cost for a 70B model in pure FP32:
- Weights: GB
- Adam : 280 GB
- Adam : 280 GB
- Total optimizer + weights: 840 GB
- Impractical on any single GPU; requires aggressive model parallelism
4.3 TF32 (TensorFloat-32)
TF32 is an NVIDIA-proprietary format that exists only inside tensor core hardware. It is not an IEEE standard and cannot be stored in memory - it is a computational format.
Bit layout: 19 bits = 1 sign + 8 exponent + 10 mantissa
TF32 - A HYBRID FORMAT
=======================================================================
FP32: [1 sign][8 exponent][23 mantissa] <- 32 bits
TF32: [1 sign][8 exponent][10 mantissa] <- 19 bits
BF16: [1 sign][8 exponent][ 7 mantissa] <- 16 bits
FP16: [1 sign][5 exponent][10 mantissa] <- 16 bits
TF32 takes:
- Exponent from FP32 (8 bits -> same range as FP32)
- Mantissa from FP16 (10 bits -> same precision as FP16)
- Result: FP32 range with FP16 precision
How it works in practice:
- When you call a FP32 matmul on an A100 or H100, cuBLAS automatically uses TF32 internally
- Inputs are FP32; tensor core truncates mantissa to 10 bits for multiply; accumulates in FP32
- Output is FP32
- Throughput: A100 TF32 = 156 TFLOPS vs FP32 scalar = 19.5 TFLOPS - 8\times faster
Implication: if you're running "FP32 training" on an A100/H100, your matmuls are actually TF32 matmuls unless you explicitly disable it (torch.backends.cuda.matmul.allow_tf32 = False). This is fine for virtually all training - the precision loss from 23->10 mantissa bits in the multiply step is negligible because accumulation is still FP32.
4.4 FP16 (Half Precision)
IEEE 754 binary16. The first reduced-precision format to gain widespread ML use, starting with NVIDIA Volta (2017).
Bit layout: 16 = 1 sign + 5 exponent + 10 mantissa
| Property | Value |
|---|---|
| Exponent bias | 15 |
| Exponent range (true) | to |
| Max value | |
| Min positive normal | |
| Machine epsilon | |
| Decimal precision | ~3.3 significant digits |
The critical limitation - narrow dynamic range:
FP16 can represent values from to . This range is far too narrow for LLM training:
FP16 FAILURE MODES IN LLM TRAINING
=======================================================================
OVERFLOW: values > 65,504 become \infty
- Loss values in early training often > 10
- Attention logits (before softmax) can reach 100+
- If any intermediate value exceeds 65,504 -> \infty -> NaN -> training dead
UNDERFLOW: values < 6.1\times10^-^5 become 0
- Gradients are often ~10^-^7
- In FP16, these gradients -> 0
- Zero gradients -> weights don't update -> training stalls
- This is SILENT - loss appears stable but model isn't learning
FP16 representable range
+----------------------+
gradient zone | | loss zone
(10^-^7 to 10^-^4) | 6.1\times10^-^5 -> 65504 | (1 to 100+)
######## |######################| ########
underflow! | | overflow!
Loss scaling - the FP16 workaround:
- Before backward pass: multiply loss by a large constant (e.g., 128, 1024, or dynamic)
- All gradients are scaled by -> shifted into representable range
- After gradient computation: divide all gradients by before optimizer step
- Dynamic loss scaling: start with large ; if overflow (Inf/NaN) detected, halve ; if no overflow for steps, double
2026 status: FP16 is largely replaced by BF16 for training. Still used in some inference engines and legacy code. The loss scaling machinery made FP16 training possible but fragile - BF16 eliminates the need entirely.
4.5 BF16 (Brain Float 16)
BF16 is the most important number format innovation for deep learning in the 2018-2026 era. Invented by Google Brain for TPU hardware, it was specifically designed for neural network training.
Bit layout: 16 = 1 sign + 8 exponent + 7 mantissa
| Property | Value |
|---|---|
| Exponent bits | 8 (same as FP32) |
| Mantissa bits | 7 |
| Exponent bias | 127 (same as FP32) |
| Max value | (same as FP32) |
| Min positive normal | (same as FP32) |
| Machine epsilon | |
| Decimal precision | ~2.4 significant digits |
Why BF16 dominates - the key insight:
BF16 has the same 8-bit exponent as FP32, giving it identical dynamic range ( to ). This single design choice eliminates both overflow and underflow problems that plague FP16:
BF16 vs FP16 - THE RANGE ADVANTAGE
=======================================================================
FP16 range FAILS for AI
+-----------------------+
6\times10^-^5 65,504
|#######################|
BF16 range WORKS for AI
+---------------------------------------------------------------+
1.2\times10^-^3^8 3.4\times10^3^8
|###############################################################|
FP32 range (identical to BF16!)
+---------------------------------------------------------------+
1.2\times10^-^3^8 3.4\times10^3^8
|###############################################################|
Gradients at 10^-^7? In BF16 range OK In FP16? UNDERFLOW NO
Large logits at 200? In BF16 range OK In FP16? OVERFLOW NO
The precision trade-off:
- BF16 has only 7 mantissa bits vs FP32's 23 - it is less precise in relative terms
- Machine epsilon : any value smaller than of the current number is lost when added
- This means BF16 can only represent decimal digits of precision
Why this is acceptable for ML: the noise from mini-batch gradient estimation (which typically has variance on the order of the gradient magnitude) far exceeds BF16 precision error. The gradient itself is a noisy estimate - adding numerical noise to a signal that already has statistical noise is negligible.
FP32 <-> BF16 conversion:
- BF16 to FP32: pad 16 zero bits to the mantissa -> trivial; free in hardware
- FP32 to BF16: drop the lower 16 mantissa bits (with rounding) -> trivial; 1 cycle
- This simplicity is by design: the exponent field is identical, so no exponent conversion needed
Hardware support:
- Google TPU v2+ (2017): BF16 native from inception
- NVIDIA A100 (2020): BF16 tensor cores at 312 TFLOPS
- NVIDIA H100 (2022): BF16 tensor cores at 989 TFLOPS
- All modern training hardware supports BF16 at full tensor core speed
BF16 matmul accumulation: NVIDIA tensor cores compute BF16 \times BF16 but accumulate the intermediate sums in FP32. The result is then optionally converted back to BF16 for storage. This means the matmul output quality is much better than naive BF16 arithmetic - the FP32 accumulation prevents precision loss in long dot products.
4.6 FP16 vs BF16 Head-to-Head
| Property | FP16 | BF16 |
|---|---|---|
| Total bits | 16 | 16 |
| Exponent bits | 5 | 8 |
| Mantissa bits | 10 | 7 |
| Max value | 65,504 | |
| Min positive normal | ||
| Machine epsilon | ||
| Needs loss scaling | Yes (critical) | No |
| Gradient underflow risk | High (gradients lost) | None (range matches FP32) |
| Precision per value | Higher (10-bit mantissa) | Lower (7-bit mantissa) |
| FP32 conversion cost | Non-trivial (exponent remapping) | Trivial (truncate 16 LSBs) |
| GPU tensor core support | V100+, A100+, H100+ | A100+, H100+ (TPU: v2+) |
| 2026 training status | Legacy | Standard |
| 2026 inference status | Some engines (TensorRT) | Standard |
The verdict: BF16 is strictly preferred for LLM training. FP16 wins only in niche inference scenarios where the extra 3 mantissa bits matter and loss scaling overhead is acceptable.
4.7 FP8 Formats (E4M3 and E5M2)
FP8 is the frontier training format as of 2024-2026. NVIDIA H100 was the first GPU with FP8 tensor cores, enabling 2\times throughput over BF16. There are two complementary 8-bit floating-point formats:
FP8 E4M3 - optimised for precision (forward pass):
| Property | Value |
|---|---|
| Bit layout | 1 sign + 4 exponent + 3 mantissa |
| Exponent bias | 7 |
| Max value | 448 |
| Machine epsilon | (12.5% relative error!) |
| Decimal precision | ~1 significant digit |
| Use case | Weights and activations in forward pass |
| Special: no | NaN is the only special value (S=1, E=1111, M=111) |
FP8 E5M2 - optimised for range (gradients):
| Property | Value |
|---|---|
| Bit layout | 1 sign + 5 exponent + 2 mantissa |
| Exponent bias | 15 |
| Max value | 57,344 |
| Machine epsilon | (25% relative error!) |
| Decimal precision | less than 1 significant digit |
| Use case | Gradients in backward pass |
| Has and NaN | Like IEEE 754 FP16 |
FP8 FORMAT COMPARISON
=======================================================================
E4M3 - more mantissa bits -> better precision
+--+--------+------+
| s| EEEE | MMM | 4 exponent + 3 mantissa
+--+--------+------+ Range: \pm448; good precision for weights
Use: forward pass (activations, weights)
E5M2 - more exponent bits -> wider range
+--+----------+----+
| s| EEEEE | MM | 5 exponent + 2 mantissa
+--+----------+----+ Range: \pm57,344; wider range for gradients
Use: backward pass (gradients)
WHY TWO FORMATS?
Forward pass: values clustered in known range; precision matters more
Backward pass: gradients span many orders of magnitude; range matters more
The FP8 challenge - extreme quantization noise:
FP8 E4M3 has : every value has up to 12.5% relative error. This is enormous - for comparison, BF16 has 0.78%. At this precision level, per-tensor or per-block scaling is mandatory to keep values within the representable range.
Hardware throughput: H100 FP8 = 3,958 TOPS - that's 4\times BF16 throughput and 60\times FP32 scalar throughput. This massive speed advantage drives the push toward FP8 training.
DeepSeek-V3 (2024): trained the entire model (forward and backward) in FP8 with tile-wise scaling. This was the first commercial frontier model to demonstrate FP8 training at scale, achieving massive cost reduction vs BF16 baseline.
4.8 FP8 Scaling Strategies
Because FP8 has such limited precision and range, every tensor needs an associated scale factor that maps the tensor's actual value range into the FP8 representable range:
1. Per-tensor scaling - simplest:
- One scalar per entire tensor
- Fast; minimal overhead; poor when tensor has outliers
- If one element is larger than the rest, all other elements lose precision because is set by the outlier
2. Per-block (tile) scaling - optimal for training:
- One scale per block of the matrix (e.g., )
- Each block has its own scale, matching its local value range
- Overhead: extra scale factors per matmul
- DeepSeek-V3 uses tile-wise scaling with : this was the key innovation that made FP8 training work at frontier scale
PER-TENSOR vs PER-BLOCK SCALING
=======================================================================
Matrix with outlier:
+-----------------------------------------+
| 0.01 0.02 0.01 0.03 0.02 | 100.0 | <- outlier
| 0.02 0.01 0.03 0.01 0.02 | 0.01 |
| 0.01 0.03 0.02 0.01 0.01 | 0.02 |
| 0.02 0.01 0.01 0.02 0.03 | 0.01 |
+-----------------------------------------+
Per-tensor scaling: S = 100.0 / 448 \approx 0.223
All 0.01 values -> 0.01/0.223 = 0.045 -> rounds to FP8: 0.0 or 0.0625
MASSIVE precision loss for small values!
Per-block scaling (2\times3 blocks shown):
Block1 (top-left 2\times3): max=0.03; S_1 = 0.03/448 \approx 6.7\times10^-^5
0.01 -> 0.01/6.7\times10^-^5 \approx 149 -> FP8: 152 <- good precision!
Block2 (top-right 2\times3): max=100; S_2 = 100/448 \approx 0.223
100.0 -> 448 -> FP8: 448 <- full range for outlier
3. Delayed scaling:
- Use the scale factor computed from the previous iteration (slightly stale)
- Avoids an extra pass over the tensor to compute the current max
- Works because tensor statistics change slowly between iterations
4. Just-in-time scaling:
- Compute scale fresh each iteration by scanning the tensor
- Accurate but requires an extra read pass over the data
- Used when tensor distributions change rapidly
5. Stochastic rounding for FP8:
- Standard round-to-nearest introduces systematic bias at low precision
- Stochastic rounding: round up with probability = fractional position; round down otherwise
- - unbiased in expectation
- Critical for gradient accumulation in FP8: small gradient updates are preserved probabilistically
- Cost: requires hardware random number generation per operation
5. Integer Formats for AI
Integer formats have become central to AI inference through quantization. Unlike floating-point, integers have no exponent - all values are uniformly spaced. This simplicity makes integer arithmetic faster, cheaper, and more energy-efficient.
5.1 INT32 - Full Precision Integer
- 32-bit two's complement; range
- AI role: accumulator. When multiplying two INT8 values, the product can be up to - this fits in INT16. But a dot product of 4096-dimensional INT8 vectors produces partial sums up to million - requires INT32 to avoid overflow.
- Pipeline: INT8 matmul -> INT32 accumulation -> scale to FP32 -> output as BF16 or INT8
- GPU supports INT32 natively; standard for index operations, loop counters, and token IDs in CUDA kernels
5.2 INT16
- 16-bit two's complement; range
- Rare in AI - too narrow for accumulation (INT8 \times INT8 summed over elements overflows) and too wide for weight storage (twice the memory of INT8)
- Usage: token IDs are often stored as INT16 when vocabulary size (using UINT16). GPT-2's vocabulary of 50,257 tokens fits in UINT16.
- Some DSP-based inference engines use INT16 for activations
5.3 INT8
INT8 is the workhorse format for production LLM inference in 2026:
- 8-bit two's complement; range ; 1 byte per value
- GPU throughput: A100 INT8 = 624 TOPS; 2\times FP16/BF16 tensor core throughput
- Memory: 2\times compression vs BF16; 4\times vs FP32
Quantization - mapping float to INT8:
where (scale) and (zero point) define the mapping:
- Symmetric quantization (): maps to where ;
- Asymmetric quantization (): maps to ; ;
Dequantization:
Quantization granularity:
- Per-tensor: one , for entire weight matrix - simple but outliers dominate
- Per-channel (per-row/column): one , per output channel - standard for weight matrices
- Per-group: one , per group of values (e.g., ) - better for activations
- Per-token: one , per token's activation vector - handles dynamic range well
Activation quantization is harder than weight quantization:
- Weights are static (computed once during quantization); activations change every forward pass
- Transformer activations have outlier channels - a few channels have values 10-100\times larger than others
- SmoothQuant (Xiao et al. 2023): migrate difficulty from activations to weights by per-channel scaling:
where with
LLM.int8() (Dettmers et al. 2022): handles outlier features by computing them in FP16 while the rest uses INT8 - mixed-precision decomposition at the feature level
5.4 UINT8 - Unsigned Integer 8-bit
- Range: ; no negative values
- AI uses:
- Activations after ReLU: always non-negative; UINT8 matches perfectly
- Image pixel values: standard image format (0-255 per channel)
- Asymmetric quantization zero point: stored as UINT8
- Asymmetric activation quantization often uses UINT8: maps with the conceptual zero mapped to UINT8 value
5.5 INT4
INT4 is the primary format for consumer-GPU LLM inference in 2026:
- 4-bit two's complement; range ; 0.5 bytes per value
- Memory: 4\times compression vs BF16; 8\times vs FP32
- LLaMA-3 70B in INT4: ~35 GB - fits on a single RTX 4090 (24 GB VRAM, with offloading) or 2\times RTX 3090
Hardware status:
- Not natively supported by standard tensor cores for compute on H100 - must be dequantized to BF16 before matmul
- NVIDIA Ada Lovelace (RTX 4090): INT4 tensor cores with 2\times INT8 throughput
- Typical deployment: W4A16 (Weight INT4, Activation BF16) - weights stored in INT4, dequantized to BF16 on the fly before each matmul
INT4 DEQUANTIZE-ON-THE-FLY PIPELINE
=======================================================================
Memory (HBM) GPU Compute (SM)
+------------+ +------------------------------+
| INT4 weight|--load-->| Unpack INT4 -> INT8 |
| (0.5 B/val)| | Dequantize: INT8 \times S -> BF16 |
| | | BF16 matmul with activation |
| | | FP32 accumulation |
| | | Output BF16 |
+------------+ +------------------------------+
Benefit: 4\times less memory bandwidth (bandwidth-bound -> 4\times faster)
Cost: dequantization overhead (small; amortised over matmul compute)
INT4 quantization methods:
- GPTQ (Frantar et al. 2023): post-training quantization using approximate second-order information (Hessian inverse); groups of 128; deterministic
- AWQ (Lin et al. 2023, MLSys 2024 Best Paper): activation-aware weight quantization; protects salient channels based on activation magnitudes
- Symmetric INT4: levels ;
- Asymmetric INT4 (UINT4): levels ; allows asymmetric weight distributions
5.6 INT2 and INT1 (Binary)
INT2 - 2-bit quantization:
- Range: (signed) or (unsigned)
- 4\times compression vs INT8; 16\times vs FP32
- QuIP# (Tseng et al. 2024): achieves usable 2-bit quantization using incoherence processing (random orthogonal transforms to distribute weight magnitudes more uniformly)
- Quality: significant degradation; 2-8 points PPL increase over BF16; research-grade only
INT1 - binary neural networks:
- 1 bit: or ; 8\times compression vs INT8; 32\times vs FP32
- XNOR-popcount matmul: binary weights transform multiply-accumulate into bitwise XNOR followed by popcount (counting set bits); extremely fast hardware implementation:
- Standard matmul: (multiply + add)
- Binary matmul: (bitwise + count)
- 64 binary multiplies per single 64-bit XNOR instruction; then one popcount
- BitNet (Wang et al. 2023): INT1 weights trained from scratch; competitive with FP16 at large scale when combined with proper training methodology
- Practical challenge: model quality degrades severely below INT4 without training from scratch with quantization-aware methods
5.7 Packed Integer Formats
Low-bit integers are packed into larger registers for efficient storage and SIMD processing:
INT4 packing:
- 8 INT4 values packed into a single 32-bit word; 16 values per 64-bit word
- Layout: little-endian; first value in least significant bits
INT4 PACKING IN A 32-BIT REGISTER
=======================================================================
32-bit word: [v_7][v_6][v_5][v_4][v_3][v_2][v_1][v_0]
MSB LSB
Each v^i is 4 bits (one INT4 value)
Extraction in Python/CUDA:
value_i = (int32_word >> (4 * i)) & 0xF # Extract i-th INT4
# For signed INT4, sign-extend:
if value_i >= 8:
value_i -= 16 # Convert [0,15] -> [-8, 7]
INT2 packing: 16 values per 32-bit word INT1 packing: 32 values per 32-bit word - entire binary weight vector in one register
SIMD unpacking:
- Modern GPUs and CPUs can unpack and process multiple packed values per instruction
- AVX-512: 512-bit register -> 128 packed INT4 values per register
- Key performance consideration: unpacking overhead must be amortised over computation; fine-grained per-element unpacking kills performance
6. Non-Uniform and Specialised Formats
Standard integer and floating-point formats use uniform or logarithmic spacing. Specialised formats can achieve better accuracy per bit by matching the spacing to the statistical distribution of the data.
6.1 Normal Float 4-bit (NF4)
NF4 is an information-theoretically optimal 4-bit format for normally distributed data. Introduced by Dettmers et al. (2023) as part of QLoRA (Quantization-aware Low-Rank Adaptation).
Key insight: transformer weights are approximately normally distributed . A uniform quantization grid (INT4) places equal numbers of levels across the range, but most weights cluster near zero. NF4 places more levels near zero (where the density is highest) and fewer levels at the tails (where few weights exist).
Construction - quantile-based level placement:
- Compute the quantiles of the standard normal distribution that divide the probability mass into 16 equal regions
- The level for each region is the expected value (centroid) of the distribution within that region
- Normalise all levels to
NF4 levels (16 values, symmetric around 0):
NF4 vs INT4 LEVEL DISTRIBUTION
=======================================================================
Normal distribution of weights:
####
########
############
################
#####################
###############################
------------------------------------------------------
-1.0 0 1.0
INT4 levels (uniform spacing):
* * * * * * * * * *
-1.0 -0.78 -0.56 -0.33 -0.11 0.11 0.33 0.56 0.78 1.0
Wasted precision in tails; insufficient detail near zero
NF4 levels (quantile-based):
* * * * * ******* * * * * *
-1.0 0 1.0
Dense near zero (where most weights are); sparse in tails
Quantization: nearest-level assignment
For each weight , find the NF4 level minimising and store the 4-bit index .
Why NF4 is better than INT4 for weights:
- NF4 minimises the expected quantization mean squared error (MSE) for
- Theory: for a Gaussian source, quantile-based quantization approaches the Lloyd-Max optimum
- Practice: NF4 achieves 0.3-1.5 PPL increase over BF16 vs 0.5-2.0 for INT4 AWQ on equivalent models
Usage: QLoRA fine-tuning - base model weights stored in NF4 (frozen); LoRA adapter weights in BF16 (trainable). This enables fine-tuning a 65B model on a single 48 GB GPU.
6.2 Log Number System (LNS)
The log number system represents numbers by their logarithm:
Multiplication becomes addition:
A single addition in log space replaces a multiplication in linear space - a major hardware simplification.
Addition becomes complex:
This requires a lookup table or approximation for the function , known as the Gaussian logarithm. Mitchell's approximation: for small , enabling fast LNS addition.
Properties:
- Range: theoretically unbounded in both directions (limited only by the precision of the stored logarithm)
- Precision: uniform in log space - constant relative precision across all magnitudes
- Contrast with floating-point: FP also has approximately logarithmic spacing, but LNS makes this exact
AI relevance: LNS is rarely used in mainstream ML. Research interest exists for extremely low-bit hardware where multiplication is the dominant energy cost. Some custom accelerator designs use LNS internally.
6.3 Posit Number System (Unum Type III)
The posit system (Gustafson, 2017) is an alternative to IEEE 754 that claims superior accuracy per bit for certain applications:
Structure: variable-precision encoding with four fields:
- Regime: unary encoding of the exponent range - runs of 0s or 1s terminated by the opposite bit
- Variable allocation: bits not used by the regime are available for the exponent and fraction
- Near : regime is short -> more bits for fraction -> higher precision near 1.0
- Far from : regime is long -> fewer fraction bits -> lower precision but wider range
Properties:
- Only one representation of zero (no )
- Only one NaN (called "Not-a-Real", or NaR)
- No gradual underflow or overflow - tapers smoothly
- Higher accuracy per bit than IEEE 754 for values near 1.0; worse for extremes
AI evaluation: multiple research groups evaluated posits for neural network training:
- Marginal accuracy improvement over BF16 at the same bit width (16-bit posit vs BF16)
- No significant quality advantage that justifies the hardware redesign cost
- Custom posit hardware exists (Positron chip) but no mainstream GPU supports posits
2026 status: academic interest only. Not deployed in any production ML system.
6.4 Microscaling Formats (MX - OCP Standard 2023)
Microscaling (MX) is an industry-standard block floating-point format published by the Open Compute Project (OCP) in 2023, co-authored by AMD, ARM, Intel, Meta, Microsoft, NVIDIA, and Qualcomm.
Key idea: share a single exponent (scale) across a block of values, giving each value more effective precision per bit:
MICROSCALING BLOCK STRUCTURE
=======================================================================
Block of N values (e.g., N = 32):
+-----------------+ +--+--+--+--+--+--+--+--+-------+--+--+
| Shared Exponent | |v_1|v_2|v_3|v_4|v_5|v_6|v_7|v_8| ... |v_3_1|v_3_2|
| (8 bits) | | | | | | | | | | | | |
+-----------------+ +--+--+--+--+--+--+--+--+-------+--+--+
Each v^i: sign + mantissa (element format)
value_i = v^i \times 2^(shared_exponent)
MX format variants:
| Format | Element Bits | Element Layout | Block Size | Effective Bits/Value |
|---|---|---|---|---|
| MXFP8 E4M3 | 8 | 1s + 4e + 3m | 32 | 8.25 |
| MXFP8 E5M2 | 8 | 1s + 5e + 2m | 32 | 8.25 |
| MXFP6 E2M3 | 6 | 1s + 2e + 3m | 32 | 6.25 |
| MXFP6 E3M2 | 6 | 1s + 3e + 2m | 32 | 6.25 |
| MXFP4 E2M1 | 4 | 1s + 2e + 1m | 32 | 4.25 |
| MXINT8 | 8 | 1s + 7 integer | 32 | 8.25 |
Why MX is better than per-element floating-point:
- The exponent storage is amortised over values: instead of each value needing its own 4-8 bit exponent, values share one 8-bit exponent
- Effective bits per value: element bits + (8 shared exponent bits / values) \approx element bits + 0.25
- Result: more bits available for mantissa -> better precision per total storage bit
Hardware adoption:
- NVIDIA B200 (Blackwell, 2025): native MXFP8 support
- AMD, Intel, Qualcomm: committed to MX hardware support
- Training: MXFP8 block scaling for both forward and backward - better than per-tensor FP8
6.5 Ternary Weights {-1, 0, +1}
Ternary quantization represents weights using only three values:
Information content: bits per weight - hence the name "1.58-bit" quantization.
The arithmetic revolution - no multiplication:
- Standard matmul: - requires multiply + accumulate per element
- Ternary matmul: - only additions, subtractions, and skips
- If : skip entirely (natural sparsity, typically ~50% of weights)
- If : add
- If : subtract
TERNARY vs STANDARD MATMUL
=======================================================================
Standard FP16 matmul (one output element):
y = w_1\timesx_1 + w_2\timesx_2 + w_3\timesx_3 + w_4\timesx_4 + w_5\timesx_5 + w_6\timesx_6
Operations: 6 multiplies + 5 adds = 11 FLOPs
Ternary matmul (w = [+1, 0, -1, +1, 0, -1]):
y = x_1 + 0 - x_3 + x_4 + 0 - x_6
y = (x_1 + x_4) - (x_3 + x_6)
Operations: 3 adds + 1 subtract = 4 integer ops (NO multiplies)
Storage:
- Efficient packing: 5 ternary values per 8 bits ( bits)
- Simple packing: 2 bits per value (one wasted code point); each value = mapping to
BitNet b1.58 (Ma et al., 2024):
- Ternary weights trained from scratch (not quantized from a float model)
- Competitive with FP16 baselines at 3B+ parameter scale
- Key innovation: uses absmean quantization - scale weights by before ternarizing
- Energy advantage: ternary addition pJ vs FP16 multiply pJ - 4\times energy savings per operation
Limitation: ternary models must be trained from scratch with ternarization-aware methods. Post-training ternarization of float models produces unusable quality.
7. Floating-Point Arithmetic Deep Dive
Understanding how floating-point arithmetic works at the bit level explains why certain operations lose precision, why GPU matmul results differ between runs, and why specific numerical tricks (FMA, compensated summation) matter for training stability.
7.1 Floating-Point Addition
Adding two floating-point numbers is more complex than integer addition because the operands may have different exponents. The hardware must align them before adding:
Algorithm for (assuming ):
FLOATING-POINT ADDITION - STEP BY STEP
=======================================================================
Input: x = 1.000 \times 2^3 and y = 1.011 \times 2^0
Step 1: ALIGN EXPONENTS
Shift y's significand RIGHT by (e_x - e_y) = 3 - 0 = 3 positions:
y = 1.011 \times 2^0 -> 0.001011 \times 2^3
upupup These bits shifted right
GRS (Guard, Round, Sticky bits)
Step 2: ADD ALIGNED SIGNIFICANDS
1.000000 \times 2^3 (x)
+ 0.001011 \times 2^3 (y, aligned)
--------------
1.001011 \times 2^3
Step 3: NORMALISE
Already normalised (leading 1 present) -> no adjustment needed
If needed: shift significand and adjust exponent
Step 4: ROUND
Result has more bits than mantissa allows
Apply rounding mode (default: RNE) to fit into mantissa width
Round 1.001011 to 24 bits (FP32) -> 1.00101100000000000000000 \times 2^3
RESULT: x + y = 1.001011 \times 2^3 = 9.375_1_0
Guard, Round, and Sticky bits - critical for rounding accuracy:
When shifting 's significand during alignment, bits shift beyond the mantissa width. The hardware keeps three extra bits to improve rounding decisions:
| Bit | Name | Purpose |
|---|---|---|
| G | Guard | First bit beyond mantissa width |
| R | Round | Second bit beyond mantissa width |
| S | Sticky | OR of all remaining shifted-out bits (indicates if any were non-zero) |
These three bits provide enough information for the hardware to implement all IEEE 754 rounding modes correctly. Without them, rounding accuracy would be significantly worse.
Precision loss during alignment: when exponents differ significantly, shifting right discards its low-order bits. If (for FP32), then 's entire significand is shifted away and . This is why adding a small number to a large number does nothing - the small number is below the precision of the large number.
7.2 Floating-Point Multiplication
Multiplication is simpler than addition because no alignment is needed:
Algorithm for :
FLOATING-POINT MULTIPLICATION
=======================================================================
Input: x = (-1)^s_x \times m_x \times 2^e_x
y = (-1)^s_y \times m_y \times 2^e_y
Step 1: XOR SIGN BITS
s_result = s_x XOR s_y
(positive \times positive = positive; positive \times negative = negative)
Step 2: ADD EXPONENTS
e_result = e_x + e_y - bias
(Subtract bias once because both inputs had bias added;
without correction, result would have double bias)
Step 3: MULTIPLY SIGNIFICANDS
m_result = m_x \times m_y
Each significand has p bits -> product has 2p bits
(For FP32: 24 \times 24 = 48-bit product)
Step 4: NORMALISE AND ROUND
If m_result \geq 2.0: shift right by 1; increment exponent
Round to fit mantissa width (24 bits for FP32)
Hardware cost: the significand multiplier is the largest component - an bit multiplier requires hardware area. This is why reducing mantissa width (BF16: 7 bits vs FP32: 23 bits) dramatically reduces multiplication hardware cost and energy.
Key advantage over addition: no alignment step -> simpler circuit -> faster in hardware. This is why GPU throughput for multiply is typically the same or better than for addition.
7.3 Floating-Point Division and Square Root
Division:
- Compute significand quotient:
- Subtract exponents:
- Iterative algorithms: Newton-Raphson (converges quadratically; computes then multiplies) or SRT algorithm (produces one quotient digit per cycle)
- Hardware: slower than multiply; separate functional unit on GPU; typically 4-20\times slower
Square root:
- Even exponent: straightforward division by 2
- Odd exponent: multiply by 2 (making it ), then halve the adjusted exponent
- Significand square root: Newton-Raphson iteration
- 2 iterations sufficient for FP32; 3 for FP64
Fast inverse square root - the famous Quake III trick (historical interest):
float Q_rsqrt(float number) {
long i;
float x2, y;
x2 = number * 0.5F;
y = number;
i = *(long *)&y; // Interpret float bits as integer
i = 0x5f3759df - (i >> 1); // Initial approximation (magic!)
y = *(float *)&i; // Convert back to float
y = y * (1.5F - (x2 * y * y)); // One Newton-Raphson refinement
return y;
}
The "magic number" 0x5f3759df exploits the fact that the integer representation of a float is approximately its logarithm. Shifting right by 1 halves the logarithm (approximating square root), and subtracting from the magic constant computes the inverse. The Newton-Raphson step then refines to full FP32 precision.
Modern relevance: modern GPUs have dedicated hardware for rsqrt and sqrt; the Q_rsqrt trick is no longer needed for performance. But understanding it demonstrates the deep connection between integer and floating-point bit representations.
7.4 Fused Multiply-Add (FMA)
FMA computes as a single operation with a single rounding at the end, rather than two separate operations with two roundings:
Compare to the unfused version:
Why FMA matters:
-
More accurate: one rounding error instead of two. For a single operation, the difference is small. But a dot product of length performs FMAs - the error savings compound.
-
Foundation of dot products: the inner product is computed as a chain of FMAs:
acc = 0 acc = FMA(x_1, y_1, acc) // acc = x_1*y_1 + 0 acc = FMA(x_2, y_2, acc) // acc = x_2*y_2 + x_1*y_1 acc = FMA(x_3, y_3, acc) // acc = x_3*y_3 + (x_2*y_2 + x_1*y_1) ... -
Hardware unit: GPU tensor cores and CPU FPUs implement FMA as a single hardware functional unit - no intermediate register write, no intermediate rounding.
-
Kahan summation: FMA can implement compensated summation more efficiently by exploiting the exact product computation.
CUDA intrinsics: fmaf(a, b, c) for FP32 FMA; __fmaf_rn() for round-to-nearest FMA; __fmaf_rd() for round-down FMA.
7.5 Dot Product Accumulation
The dot product (inner product) is the fundamental operation underlying all matrix multiplications in neural networks. Its numerical accuracy determines the quality of every matmul output.
Error analysis:
For a naive dot product computed in floating-point:
where is machine epsilon. The relative error grows linearly with the dimension .
For BF16 () with (typical hidden dimension):
This is a 3,200% relative error - completely unusable for any computation. This is why BF16 dot products must accumulate in FP32.
GPU practice - mixed-precision accumulation:
| Input Format | Accumulation Format | Output Format | Error Bound |
|---|---|---|---|
| INT8 \times INT8 | INT32 -> FP32 | BF16 or INT8 | Exact integer, then scale |
| BF16 \times BF16 | FP32 | BF16 | |
| FP8 \times FP8 | FP32 | BF16 |
The key insight: even though inputs are low-precision, FP32 accumulation ensures the sum is computed accurately. The accumulated result is then converted back to the output format. This is why GPU matmul quality is far better than naive BF16 arithmetic suggests.
Compensated dot product (Ogita-Rump-Oishi, 2005):
- Uses FMA to capture the exact error from each multiplication
- Accumulates the errors separately and adds them at the end
- Result: relative error regardless of - quadratic improvement
- Not yet standard in GPU matmul but used in scientific computing
7.6 Mixed-Precision Matmul
The actual computation inside NVIDIA tensor cores for a BF16 matmul:
MIXED-PRECISION TENSOR CORE MATMUL (BF16)
=======================================================================
Input: A \in \mathbb{R}^m^x^k (BF16), B \in \mathbb{R}^k^x^n (BF16)
Output: C \in \mathbb{R}^m^x^n (FP32 or BF16)
Hardware tile: 16\times16\times16 (m_tile \times n_tile \times k_tile)
Each warp (32 threads) computes one tile output
For each K-dimension chunk of 16:
1. Load A tile (16\times16 BF16) and B tile (16\times16 BF16) into registers
2. Compute 16\times16 partial products: BF16 \times BF16
3. Accumulate partial sums in FP32 registers (one FP32 per output element)
After all K chunks:
4. FP32 accumulated result in registers
5. Optionally convert to BF16 for storage
Key: The multiplication is BF16, but accumulation is FP32
This is what makes mixed-precision training work!
FP8 tensor core operation (H100):
FP8 TENSOR CORE MATMUL
=======================================================================
Input: A \in \mathbb{R}^m^x^k (FP8 E4M3), B \in \mathbb{R}^k^x^n (FP8 E4M3)
Scales: S_A (per-tensor or per-block), S_B (per-tensor or per-block)
1. Dequantize: A_f = A_fp8 \times S_A, B_f = B_fp8 \times S_B
2. Multiply: BF16-equivalent element products
3. Accumulate: FP32 partial sums
4. Scale output: C = (A_f \times B_f) = S_A \times S_B \times (A_fp8 \times B_fp8)
5. Output: FP32 or BF16
Throughput: H100 FP8 = 3,958 TOPS (4\times BF16)
Throughput comparison (H100 SXM):
| Input | Accumulation | TFLOPS/TOPS | Notes |
|---|---|---|---|
| BF16 \times BF16 | FP32 | 989 | Standard for training |
| FP8 \times FP8 | FP32 | 3,958 | 4\times BF16; note: 2\times from data density + 2\times from simpler hardware |
| INT8 \times INT8 | INT32 | 3,958 | Same throughput as FP8 |
The 4\times throughput gain from BF16 -> FP8 is the driving force behind the industry's push toward FP8 training and inference.