Quantization compresses tensors by changing their numeric representation. Distillation compresses behavior by training a smaller or cheaper model to imitate a stronger teacher. Both are central to making LLMs cheaper to store, serve, and deploy.
Overview
The simplest uniform quantizer is:
Here is the scale and is the zero point. Quantization asks how much error this approximation introduces and whether the hardware can exploit the smaller representation.
Distillation uses a teacher distribution:
The student learns from teacher probabilities, generated sequences, features, or preferences. Quantization and distillation can be combined: distill to a smaller model, then quantize it for deployment.
Prerequisites
- Logits, softmax, KL divergence, and cross-entropy
- Matrix multiplication and tensor shapes
- Inference memory and bandwidth constraints
- Fine-tuning and LoRA basics
Companion Notebooks
| Notebook | Purpose |
|---|---|
| theory.ipynb | Demonstrates uniform quantization, clipping, group-wise scales, error versus bits, SmoothQuant intuition, distillation temperature, KL loss, and memory savings. |
| exercises.ipynb | Ten practice problems for quantizer formulas, memory savings, clipping, KL distillation, and deployment checks. |
Learning Objectives
After this section, you should be able to:
- Implement affine quantization and dequantization.
- Explain scale, zero point, clipping, and quantization error.
- Compare per-tensor, per-channel, and group-wise quantization.
- Explain PTQ, QAT, GPTQ intuition, AWQ intuition, and QLoRA.
- Compute memory savings from lower bit widths.
- Compute distillation KL loss with temperature.
- Explain logit, sequence, feature, and preference distillation.
- Build a quality and latency evaluation checklist for compressed LLMs.
Table of Contents
- Compression Goals
- 1.1 Memory reduction
- 1.2 Bandwidth reduction
- 1.3 Compute support
- 1.4 Quality preservation
- 1.5 Pareto frontier
- Uniform Quantization
- 2.1 Affine quantizer
- 2.2 Dequantization
- 2.3 Scale
- 2.4 Zero point
- 2.5 Quantization error
- Granularity
- 3.1 Per-tensor
- 3.2 Per-channel
- 3.3 Group-wise
- 3.4 Activation quantization
- 3.5 KV-cache quantization
- Post-Training Quantization
- 4.1 Calibration data
- 4.2 Clipping
- 4.3 Weighted error
- 4.4 GPTQ intuition
- 4.5 AWQ intuition
- Quantization-Aware Training and QLoRA
- 5.1 Fake quantization
- 5.2 Straight-through estimator
- 5.3 QLoRA pattern
- 5.4 Optimizer memory
- 5.5 Dequantization path
- Distillation Basics
- 6.1 Teacher and student
- 6.2 Soft targets
- 6.3 Temperature
- 6.4 KL distillation loss
- 6.5 Hard-label mixture
- LLM Distillation Types
- Error and Evaluation
- 8.1 Perplexity shift
- 8.2 Task score shift
- 8.3 Calibration shift
- 8.4 Layer sensitivity
- 8.5 Outlier channels
- Deployment Choices
- Debugging Checklist
- 10.1 Calibration representativeness
- 10.2 Layer-by-layer error
- 10.3 Reference comparisons
- 10.4 Latency measurement
- 10.5 Quality gates
Compression Map
| Method | Changes | Needs data? | Main benefit | Main risk |
|---|---|---|---|---|
| PTQ | Numeric format after training | Calibration data | Fast compression | Calibration mismatch |
| QAT | Training simulates quantization | Training data | Better low-bit robustness | More compute |
| QLoRA | Quantized frozen base plus LoRA | Fine-tune data | Cheap adaptation | Activation memory remains |
| Logit distillation | Student matches teacher probabilities | Teacher outputs | Smaller model behavior transfer | Teacher errors transfer |
| Sequence distillation | Student trains on teacher completions | Teacher generations | Simple data pipeline | Diversity loss |
1. Compression Goals
This part studies compression goals as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.
| Subtopic | Question | Formula |
|---|---|---|
| Memory reduction | store fewer bytes per parameter or cache entry | |
| Bandwidth reduction | read fewer bytes per generated token | |
| Compute support | low precision helps only when kernels and hardware support it | |
| Quality preservation | compressed outputs should stay close to reference outputs | |
| Pareto frontier | compression is a quality-cost tradeoff | versus |
1.1 Memory reduction
Main idea. Store fewer bytes per parameter or cache entry.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
1.2 Bandwidth reduction
Main idea. Read fewer bytes per generated token.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
1.3 Compute support
Main idea. Low precision helps only when kernels and hardware support it.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
1.4 Quality preservation
Main idea. Compressed outputs should stay close to reference outputs.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
1.5 Pareto frontier
Main idea. Compression is a quality-cost tradeoff.
Core relation:
\mathrm{quality}$ versus $\mathrm{memory},\mathrm{latency},\mathrm{cost}Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
2. Uniform Quantization
This part studies uniform quantization as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.
| Subtopic | Question | Formula |
|---|---|---|
| Affine quantizer | map real values to integer grid points | |
| Dequantization | recover an approximate real value | |
| Scale | the step size controls resolution | s=(x_\max-x_\min)/(q_\max-q_\min) |
| Zero point | asymmetric quantization uses an integer offset | z=q_\min-\mathrm{round}(x_\min/s) |
| Quantization error | rounding error is bounded by half a step before clipping | $ |
2.1 Affine quantizer
Main idea. Map real values to integer grid points.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This tiny formula is the bridge between real model weights and integer storage.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
2.2 Dequantization
Main idea. Recover an approximate real value.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
2.3 Scale
Main idea. The step size controls resolution.
Core relation:
s=(x_\max-x_\min)/(q_\max-q_\min)Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
2.4 Zero point
Main idea. Asymmetric quantization uses an integer offset.
Core relation:
z=q_\min-\mathrm{round}(x_\min/s)Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
2.5 Quantization error
Main idea. Rounding error is bounded by half a step before clipping.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
3. Granularity
This part studies granularity as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.
| Subtopic | Question | Formula |
|---|---|---|
| Per-tensor | one scale for the whole tensor | shared by all entries |
| Per-channel | one scale per output channel or column | |
| Group-wise | one scale per block of weights | |
| Activation quantization | activation ranges depend on input data | |
| KV-cache quantization | cache precision affects long-context memory and attention quality |
3.1 Per-tensor
Main idea. One scale for the whole tensor.
Core relation:
s$ shared by all entriesQuantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
3.2 Per-channel
Main idea. One scale per output channel or column.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
3.3 Group-wise
Main idea. One scale per block of weights.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. Group scales are one reason modern low-bit LLM quantization can work better than one global scale.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
3.4 Activation quantization
Main idea. Activation ranges depend on input data.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
3.5 KV-cache quantization
Main idea. Cache precision affects long-context memory and attention quality.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
4. Post-Training Quantization
This part studies post-training quantization as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.
| Subtopic | Question | Formula |
|---|---|---|
| Calibration data | estimate activation or weight ranges from representative examples | |
| Clipping | smaller range improves resolution but clips outliers | |
| Weighted error | some weights matter more because activations amplify them | |
| GPTQ intuition | quantize weights while compensating error using approximate second-order information | |
| AWQ intuition | protect activation-salient channels during weight quantization | $ |
4.1 Calibration data
Main idea. Estimate activation or weight ranges from representative examples.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
4.2 Clipping
Main idea. Smaller range improves resolution but clips outliers.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
4.3 Weighted error
Main idea. Some weights matter more because activations amplify them.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. Quantizing a weight is more harmful when common activations magnify that weight's error.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
4.4 GPTQ intuition
Main idea. Quantize weights while compensating error using approximate second-order information.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
4.5 AWQ intuition
Main idea. Protect activation-salient channels during weight quantization.
Core relation:
|x_j|$ indicates sensitivityQuantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
5. Quantization-Aware Training and QLoRA
This part studies quantization-aware training and qlora as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.
| Subtopic | Question | Formula |
|---|---|---|
| Fake quantization | simulate quantization during training while keeping gradients useful | in forward |
| Straight-through estimator | treat rounding as identity in backward | |
| QLoRA pattern | freeze a quantized base and train low-rank adapters | |
| Optimizer memory | optimizer states are needed for trainable adapters, not frozen base weights | |
| Dequantization path | many kernels dequantize blocks on the fly for matmul |
5.1 Fake quantization
Main idea. Simulate quantization during training while keeping gradients useful.
Core relation:
\hat W=Q(W)$ in forwardQuantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
5.2 Straight-through estimator
Main idea. Treat rounding as identity in backward.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
5.3 QLoRA pattern
Main idea. Freeze a quantized base and train low-rank adapters.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
5.4 Optimizer memory
Main idea. Optimizer states are needed for trainable adapters, not frozen base weights.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
5.5 Dequantization path
Main idea. Many kernels dequantize blocks on the fly for matmul.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
6. Distillation Basics
This part studies distillation basics as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.
| Subtopic | Question | Formula |
|---|---|---|
| Teacher and student | train a smaller model to match a larger model | |
| Soft targets | teacher probabilities contain similarity information beyond one-hot labels | |
| Temperature | soften probability distributions | |
| KL distillation loss | minimize divergence from teacher to student | |
| Hard-label mixture | combine task loss with distillation loss |
6.1 Teacher and student
Main idea. Train a smaller model to match a larger model.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
6.2 Soft targets
Main idea. Teacher probabilities contain similarity information beyond one-hot labels.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
6.3 Temperature
Main idea. Soften probability distributions.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
6.4 KL distillation loss
Main idea. Minimize divergence from teacher to student.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. Distillation trains the student on the teacher's distribution, not only the final answer.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
6.5 Hard-label mixture
Main idea. Combine task loss with distillation loss.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
7. LLM Distillation Types
This part studies llm distillation types as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.
| Subtopic | Question | Formula |
|---|---|---|
| Logit distillation | match next-token distributions | |
| Sequence distillation | train on teacher-generated completions | |
| Feature distillation | match hidden states or attention maps | |
| Preference distillation | transfer teacher comparisons or judge preferences | |
| Reasoning trace distillation | train on teacher-produced intermediate reasoning when appropriate |
7.1 Logit distillation
Main idea. Match next-token distributions.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
7.2 Sequence distillation
Main idea. Train on teacher-generated completions.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
7.3 Feature distillation
Main idea. Match hidden states or attention maps.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
7.4 Preference distillation
Main idea. Transfer teacher comparisons or judge preferences.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
7.5 Reasoning trace distillation
Main idea. Train on teacher-produced intermediate reasoning when appropriate.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
8. Error and Evaluation
This part studies error and evaluation as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.
| Subtopic | Question | Formula |
|---|---|---|
| Perplexity shift | quantization can be measured by held-out NLL change | |
| Task score shift | compression should be checked on downstream tasks | |
| Calibration shift | probabilities may become miscalibrated | |
| Layer sensitivity | some layers or projections tolerate fewer bits poorly | |
| Outlier channels | activation outliers often dominate low-bit error | $\max |
8.1 Perplexity shift
Main idea. Quantization can be measured by held-out nll change.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
8.2 Task score shift
Main idea. Compression should be checked on downstream tasks.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
8.3 Calibration shift
Main idea. Probabilities may become miscalibrated.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
8.4 Layer sensitivity
Main idea. Some layers or projections tolerate fewer bits poorly.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
8.5 Outlier channels
Main idea. Activation outliers often dominate low-bit error.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
9. Deployment Choices
This part studies deployment choices as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.
| Subtopic | Question | Formula |
|---|---|---|
| Weight-only quantization | reduce weight bandwidth while leaving activations higher precision | |
| Weight-activation quantization | quantize both weights and activations for more kernel speed | |
| KV-cache quantization | increase context or batch capacity | |
| Distill then quantize | a smaller student can also be quantized | |
| Hardware format | INT4, INT8, FP8, and NF4 need matching kernels |
9.1 Weight-only quantization
Main idea. Reduce weight bandwidth while leaving activations higher precision.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
9.2 Weight-activation quantization
Main idea. Quantize both weights and activations for more kernel speed.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
9.3 KV-cache quantization
Main idea. Increase context or batch capacity.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
9.4 Distill then quantize
Main idea. A smaller student can also be quantized.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
9.5 Hardware format
Main idea. Int4, int8, fp8, and nf4 need matching kernels.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
10. Debugging Checklist
This part studies debugging checklist as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.
| Subtopic | Question | Formula |
|---|---|---|
| Calibration representativeness | calibration data should match deployment prompts | |
| Layer-by-layer error | inspect where compression hurts | |
| Reference comparisons | compare logits before and after compression | $\max |
| Latency measurement | confirm the chosen format is actually faster | |
| Quality gates | do not ship a compressed model without task and safety checks | bounded |
10.1 Calibration representativeness
Main idea. Calibration data should match deployment prompts.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
10.2 Layer-by-layer error
Main idea. Inspect where compression hurts.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
10.3 Reference comparisons
Main idea. Compare logits before and after compression.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
10.4 Latency measurement
Main idea. Confirm the chosen format is actually faster.
Core relation:
Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. A smaller file is not automatically a faster model if kernels do not support the format well.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
10.5 Quality gates
Main idea. Do not ship a compressed model without task and safety checks.
Core relation:
\Delta S$ boundedQuantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.
Worked micro-example. If weights lie in and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly . A real weight maps to integer and dequantizes back to the nearest grid point. Smaller improves resolution near zero but clips large values if the range is too narrow.
Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.
AI connection. This is a practical compression control variable.
Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.
Practice Exercises
- Quantize and dequantize scalar values with an affine quantizer.
- Compute symmetric INT4 scale and error.
- Compare per-tensor and per-channel quantization error.
- Compute memory reduction from bf16 to 4-bit weights.
- Sweep clipping ranges and choose the lowest MSE.
- Compute distillation probabilities at temperature.
- Compute KL distillation loss for teacher and student distributions.
- Combine hard-label CE and distillation loss.
- Estimate QLoRA optimizer-state memory.
- Write a compression deployment checklist.
Why This Matters for AI
Compression determines who can run a model, how much serving costs, and which devices can host useful AI locally. The math matters because bad compression can keep a model small but silently damage probabilities, calibration, long-context behavior, or safety behavior.
Bridge to RAG Math and Retrieval
Compression makes a model cheaper. Retrieval can make a model more informed without changing all its weights. The next section studies embedding retrieval, similarity search, ranking, context packing, and how retrieval changes the conditional distribution used by an LLM.
References
- Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, "Distilling the Knowledge in a Neural Network", 2015: https://arxiv.org/abs/1503.02531
- Benoit Jacob et al., "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference", 2017: https://arxiv.org/abs/1712.05877
- Tim Dettmers et al., "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale", 2022: https://arxiv.org/abs/2208.07339
- Elias Frantar et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers", 2022: https://arxiv.org/abs/2210.17323
- Guangxuan Xiao et al., "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models", 2022: https://arxiv.org/abs/2211.10438
- Ji Lin et al., "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration", 2023: https://arxiv.org/abs/2306.00978
- Tim Dettmers et al., "QLoRA: Efficient Finetuning of Quantized LLMs", 2023: https://arxiv.org/abs/2305.14314