NotesMath for LLMs

Quantization and Distillation

Math for LLMs / Quantization and Distillation

Notes

Quantization compresses tensors by changing their numeric representation. Distillation compresses behavior by training a smaller or cheaper model to imitate a stronger teacher. Both are central to making LLMs cheaper to store, serve, and deploy.

Overview

The simplest uniform quantizer is:

q=round(x/s)+z,x^=s(qz).q=\mathrm{round}(x/s)+z,\qquad \hat x=s(q-z).

Here ss is the scale and zz is the zero point. Quantization asks how much error this approximation introduces and whether the hardware can exploit the smaller representation.

Distillation uses a teacher distribution:

LKD=τ2DKL(pT(τ)pS(τ)).L_\mathrm{KD}=\tau^2D_\mathrm{KL}(p_T^{(\tau)}\Vert p_S^{(\tau)}).

The student learns from teacher probabilities, generated sequences, features, or preferences. Quantization and distillation can be combined: distill to a smaller model, then quantize it for deployment.

Prerequisites

  • Logits, softmax, KL divergence, and cross-entropy
  • Matrix multiplication and tensor shapes
  • Inference memory and bandwidth constraints
  • Fine-tuning and LoRA basics

Companion Notebooks

NotebookPurpose
theory.ipynbDemonstrates uniform quantization, clipping, group-wise scales, error versus bits, SmoothQuant intuition, distillation temperature, KL loss, and memory savings.
exercises.ipynbTen practice problems for quantizer formulas, memory savings, clipping, KL distillation, and deployment checks.

Learning Objectives

After this section, you should be able to:

  • Implement affine quantization and dequantization.
  • Explain scale, zero point, clipping, and quantization error.
  • Compare per-tensor, per-channel, and group-wise quantization.
  • Explain PTQ, QAT, GPTQ intuition, AWQ intuition, and QLoRA.
  • Compute memory savings from lower bit widths.
  • Compute distillation KL loss with temperature.
  • Explain logit, sequence, feature, and preference distillation.
  • Build a quality and latency evaluation checklist for compressed LLMs.

Table of Contents

  1. Compression Goals
  2. Uniform Quantization
  3. Granularity
  4. Post-Training Quantization
  5. Quantization-Aware Training and QLoRA
  6. Distillation Basics
  7. LLM Distillation Types
  8. Error and Evaluation
  9. Deployment Choices
  10. Debugging Checklist

Compression Map

MethodChangesNeeds data?Main benefitMain risk
PTQNumeric format after trainingCalibration dataFast compressionCalibration mismatch
QATTraining simulates quantizationTraining dataBetter low-bit robustnessMore compute
QLoRAQuantized frozen base plus LoRAFine-tune dataCheap adaptationActivation memory remains
Logit distillationStudent matches teacher probabilitiesTeacher outputsSmaller model behavior transferTeacher errors transfer
Sequence distillationStudent trains on teacher completionsTeacher generationsSimple data pipelineDiversity loss

1. Compression Goals

This part studies compression goals as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.

SubtopicQuestionFormula
Memory reductionstore fewer bytes per parameter or cache entryM=Pb/8M=P\cdot b/8
Bandwidth reductionread fewer bytes per generated tokenTreadM/bandwidthT_\mathrm{read}\approx M/\mathrm{bandwidth}
Compute supportlow precision helps only when kernels and hardware support itTkernelT_\mathrm{kernel}
Quality preservationcompressed outputs should stay close to reference outputsDKL(prefpcomp)D_\mathrm{KL}(p_\mathrm{ref}\Vert p_\mathrm{comp})
Pareto frontiercompression is a quality-cost tradeoffquality\mathrm{quality} versus memory,latency,cost\mathrm{memory},\mathrm{latency},\mathrm{cost}

1.1 Memory reduction

Main idea. Store fewer bytes per parameter or cache entry.

Core relation:

M=Pb/8M=P\cdot b/8

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

1.2 Bandwidth reduction

Main idea. Read fewer bytes per generated token.

Core relation:

TreadM/bandwidthT_\mathrm{read}\approx M/\mathrm{bandwidth}

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

1.3 Compute support

Main idea. Low precision helps only when kernels and hardware support it.

Core relation:

TkernelT_\mathrm{kernel}

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

1.4 Quality preservation

Main idea. Compressed outputs should stay close to reference outputs.

Core relation:

DKL(prefpcomp)D_\mathrm{KL}(p_\mathrm{ref}\Vert p_\mathrm{comp})

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

1.5 Pareto frontier

Main idea. Compression is a quality-cost tradeoff.

Core relation:

\mathrm{quality}$ versus $\mathrm{memory},\mathrm{latency},\mathrm{cost}

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

2. Uniform Quantization

This part studies uniform quantization as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.

SubtopicQuestionFormula
Affine quantizermap real values to integer grid pointsq=round(x/s)+zq=\mathrm{round}(x/s)+z
Dequantizationrecover an approximate real valuex^=s(qz)\hat x=s(q-z)
Scalethe step size controls resolutions=(x_\max-x_\min)/(q_\max-q_\min)
Zero pointasymmetric quantization uses an integer offsetz=q_\min-\mathrm{round}(x_\min/s)
Quantization errorrounding error is bounded by half a step before clipping$

2.1 Affine quantizer

Main idea. Map real values to integer grid points.

Core relation:

q=round(x/s)+zq=\mathrm{round}(x/s)+z

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This tiny formula is the bridge between real model weights and integer storage.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

2.2 Dequantization

Main idea. Recover an approximate real value.

Core relation:

x^=s(qz)\hat x=s(q-z)

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

2.3 Scale

Main idea. The step size controls resolution.

Core relation:

s=(x_\max-x_\min)/(q_\max-q_\min)

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

2.4 Zero point

Main idea. Asymmetric quantization uses an integer offset.

Core relation:

z=q_\min-\mathrm{round}(x_\min/s)

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

2.5 Quantization error

Main idea. Rounding error is bounded by half a step before clipping.

Core relation:

xx^s/2|x-\hat x|\le s/2

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

3. Granularity

This part studies granularity as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.

SubtopicQuestionFormula
Per-tensorone scale for the whole tensorss shared by all entries
Per-channelone scale per output channel or columnscs_c
Group-wiseone scale per block of weightssgs_g
Activation quantizationactivation ranges depend on input datax=x(batch)x=x(\mathrm{batch})
KV-cache quantizationcache precision affects long-context memory and attention qualityMKVbKVM_\mathrm{KV}\propto b_\mathrm{KV}

3.1 Per-tensor

Main idea. One scale for the whole tensor.

Core relation:

s$ shared by all entries

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

3.2 Per-channel

Main idea. One scale per output channel or column.

Core relation:

scs_c

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

3.3 Group-wise

Main idea. One scale per block of weights.

Core relation:

sgs_g

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. Group scales are one reason modern low-bit LLM quantization can work better than one global scale.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

3.4 Activation quantization

Main idea. Activation ranges depend on input data.

Core relation:

x=x(batch)x=x(\mathrm{batch})

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

3.5 KV-cache quantization

Main idea. Cache precision affects long-context memory and attention quality.

Core relation:

MKVbKVM_\mathrm{KV}\propto b_\mathrm{KV}

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

4. Post-Training Quantization

This part studies post-training quantization as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.

SubtopicQuestionFormula
Calibration dataestimate activation or weight ranges from representative examplesDcalD_\mathrm{cal}
Clippingsmaller range improves resolution but clips outliersxclip(x,c,c)x\leftarrow\mathrm{clip}(x,-c,c)
Weighted errorsome weights matter more because activations amplify themX(WW^)2\Vert X(W-\hat W)\Vert^2
GPTQ intuitionquantize weights while compensating error using approximate second-order informationΔL12ΔwHΔw\Delta L\approx \frac12\Delta w^\top H\Delta w
AWQ intuitionprotect activation-salient channels during weight quantization$

4.1 Calibration data

Main idea. Estimate activation or weight ranges from representative examples.

Core relation:

DcalD_\mathrm{cal}

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

4.2 Clipping

Main idea. Smaller range improves resolution but clips outliers.

Core relation:

xclip(x,c,c)x\leftarrow\mathrm{clip}(x,-c,c)

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

4.3 Weighted error

Main idea. Some weights matter more because activations amplify them.

Core relation:

X(WW^)2\Vert X(W-\hat W)\Vert^2

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. Quantizing a weight is more harmful when common activations magnify that weight's error.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

4.4 GPTQ intuition

Main idea. Quantize weights while compensating error using approximate second-order information.

Core relation:

ΔL12ΔwHΔw\Delta L\approx \frac12\Delta w^\top H\Delta w

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

4.5 AWQ intuition

Main idea. Protect activation-salient channels during weight quantization.

Core relation:

|x_j|$ indicates sensitivity

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

5. Quantization-Aware Training and QLoRA

This part studies quantization-aware training and qlora as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.

SubtopicQuestionFormula
Fake quantizationsimulate quantization during training while keeping gradients usefulW^=Q(W)\hat W=Q(W) in forward
Straight-through estimatortreat rounding as identity in backwardround(x)/x1\partial \mathrm{round}(x)/\partial x\approx 1
QLoRA patternfreeze a quantized base and train low-rank adaptersWQ(W0)+(α/r)BAW\approx Q(W_0)+(\alpha/r)BA
Optimizer memoryoptimizer states are needed for trainable adapters, not frozen base weightsMoptPtrainableM_\mathrm{opt}\propto P_\mathrm{trainable}
Dequantization pathmany kernels dequantize blocks on the fly for matmulq,sW^q,s\rightarrow \hat W

5.1 Fake quantization

Main idea. Simulate quantization during training while keeping gradients useful.

Core relation:

\hat W=Q(W)$ in forward

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

5.2 Straight-through estimator

Main idea. Treat rounding as identity in backward.

Core relation:

round(x)/x1\partial \mathrm{round}(x)/\partial x\approx 1

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

5.3 QLoRA pattern

Main idea. Freeze a quantized base and train low-rank adapters.

Core relation:

WQ(W0)+(α/r)BAW\approx Q(W_0)+(\alpha/r)BA

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

5.4 Optimizer memory

Main idea. Optimizer states are needed for trainable adapters, not frozen base weights.

Core relation:

MoptPtrainableM_\mathrm{opt}\propto P_\mathrm{trainable}

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

5.5 Dequantization path

Main idea. Many kernels dequantize blocks on the fly for matmul.

Core relation:

q,sW^q,s\rightarrow \hat W

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

6. Distillation Basics

This part studies distillation basics as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.

SubtopicQuestionFormula
Teacher and studenttrain a smaller model to match a larger modelpT(yx),pS(yx)p_T(y\mid x),p_S(y\mid x)
Soft targetsteacher probabilities contain similarity information beyond one-hot labelspTp_T
Temperaturesoften probability distributionspi(τ)=softmax(zi/τ)p_i^{(\tau)}=\mathrm{softmax}(z_i/\tau)
KL distillation lossminimize divergence from teacher to studentLKD=τ2DKL(pT(τ)pS(τ))L_\mathrm{KD}=\tau^2D_\mathrm{KL}(p_T^{(\tau)}\Vert p_S^{(\tau)})
Hard-label mixturecombine task loss with distillation lossL=αLCE+(1α)LKDL=\alpha L_\mathrm{CE}+(1-\alpha)L_\mathrm{KD}

6.1 Teacher and student

Main idea. Train a smaller model to match a larger model.

Core relation:

pT(yx),pS(yx)p_T(y\mid x),p_S(y\mid x)

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

6.2 Soft targets

Main idea. Teacher probabilities contain similarity information beyond one-hot labels.

Core relation:

pTp_T

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

6.3 Temperature

Main idea. Soften probability distributions.

Core relation:

pi(τ)=softmax(zi/τ)p_i^{(\tau)}=\mathrm{softmax}(z_i/\tau)

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

6.4 KL distillation loss

Main idea. Minimize divergence from teacher to student.

Core relation:

LKD=τ2DKL(pT(τ)pS(τ))L_\mathrm{KD}=\tau^2D_\mathrm{KL}(p_T^{(\tau)}\Vert p_S^{(\tau)})

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. Distillation trains the student on the teacher's distribution, not only the final answer.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

6.5 Hard-label mixture

Main idea. Combine task loss with distillation loss.

Core relation:

L=αLCE+(1α)LKDL=\alpha L_\mathrm{CE}+(1-\alpha)L_\mathrm{KD}

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

7. LLM Distillation Types

This part studies llm distillation types as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.

SubtopicQuestionFormula
Logit distillationmatch next-token distributionsDKL(pT(x)pS(x))D_\mathrm{KL}(p_T(\cdot\mid x)\Vert p_S(\cdot\mid x))
Sequence distillationtrain on teacher-generated completionsypT(x)y\sim p_T(\cdot\mid x)
Feature distillationmatch hidden states or attention mapshThS2\Vert h_T-h_S\Vert^2
Preference distillationtransfer teacher comparisons or judge preferencesy+Tyy^+\succ_T y^-
Reasoning trace distillationtrain on teacher-produced intermediate reasoning when appropriatepS(r,yx)p_S(r,y\mid x)

7.1 Logit distillation

Main idea. Match next-token distributions.

Core relation:

DKL(pT(x)pS(x))D_\mathrm{KL}(p_T(\cdot\mid x)\Vert p_S(\cdot\mid x))

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

7.2 Sequence distillation

Main idea. Train on teacher-generated completions.

Core relation:

ypT(x)y\sim p_T(\cdot\mid x)

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

7.3 Feature distillation

Main idea. Match hidden states or attention maps.

Core relation:

hThS2\Vert h_T-h_S\Vert^2

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

7.4 Preference distillation

Main idea. Transfer teacher comparisons or judge preferences.

Core relation:

y+Tyy^+\succ_T y^-

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

7.5 Reasoning trace distillation

Main idea. Train on teacher-produced intermediate reasoning when appropriate.

Core relation:

pS(r,yx)p_S(r,y\mid x)

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

8. Error and Evaluation

This part studies error and evaluation as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.

SubtopicQuestionFormula
Perplexity shiftquantization can be measured by held-out NLL changeΔL=LquantLbase\Delta L=L_\mathrm{quant}-L_\mathrm{base}
Task score shiftcompression should be checked on downstream tasksΔS=ScompSbase\Delta S=S_\mathrm{comp}-S_\mathrm{base}
Calibration shiftprobabilities may become miscalibratedECE\mathrm{ECE}
Layer sensitivitysome layers or projections tolerate fewer bits poorlyΔL\Delta L_\ell
Outlier channelsactivation outliers often dominate low-bit error$\max

8.1 Perplexity shift

Main idea. Quantization can be measured by held-out nll change.

Core relation:

ΔL=LquantLbase\Delta L=L_\mathrm{quant}-L_\mathrm{base}

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

8.2 Task score shift

Main idea. Compression should be checked on downstream tasks.

Core relation:

ΔS=ScompSbase\Delta S=S_\mathrm{comp}-S_\mathrm{base}

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

8.3 Calibration shift

Main idea. Probabilities may become miscalibrated.

Core relation:

ECE\mathrm{ECE}

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

8.4 Layer sensitivity

Main idea. Some layers or projections tolerate fewer bits poorly.

Core relation:

ΔL\Delta L_\ell

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

8.5 Outlier channels

Main idea. Activation outliers often dominate low-bit error.

Core relation:

maxxj\max |x_j|

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

9. Deployment Choices

This part studies deployment choices as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.

SubtopicQuestionFormula
Weight-only quantizationreduce weight bandwidth while leaving activations higher precisionWq, xfpW_q,\ x_\mathrm{fp}
Weight-activation quantizationquantize both weights and activations for more kernel speedWq, xqW_q,\ x_q
KV-cache quantizationincrease context or batch capacitybKVb_\mathrm{KV}\downarrow
Distill then quantizea smaller student can also be quantizedSQ(S)S\rightarrow Q(S)
Hardware formatINT4, INT8, FP8, and NF4 need matching kernelsformatkernel\mathrm{format}\rightarrow\mathrm{kernel}

9.1 Weight-only quantization

Main idea. Reduce weight bandwidth while leaving activations higher precision.

Core relation:

Wq, xfpW_q,\ x_\mathrm{fp}

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

9.2 Weight-activation quantization

Main idea. Quantize both weights and activations for more kernel speed.

Core relation:

Wq, xqW_q,\ x_q

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

9.3 KV-cache quantization

Main idea. Increase context or batch capacity.

Core relation:

bKVb_\mathrm{KV}\downarrow

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

9.4 Distill then quantize

Main idea. A smaller student can also be quantized.

Core relation:

SQ(S)S\rightarrow Q(S)

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

9.5 Hardware format

Main idea. Int4, int8, fp8, and nf4 need matching kernels.

Core relation:

formatkernel\mathrm{format}\rightarrow\mathrm{kernel}

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

10. Debugging Checklist

This part studies debugging checklist as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.

SubtopicQuestionFormula
Calibration representativenesscalibration data should match deployment promptsDcalDdeployD_\mathrm{cal}\approx D_\mathrm{deploy}
Layer-by-layer errorinspect where compression hurtsX(WW^)\Vert X(W-\hat W)\Vert
Reference comparisonscompare logits before and after compression$\max
Latency measurementconfirm the chosen format is actually fasterTcomp<TbaseT_\mathrm{comp}<T_\mathrm{base}
Quality gatesdo not ship a compressed model without task and safety checksΔS\Delta S bounded

10.1 Calibration representativeness

Main idea. Calibration data should match deployment prompts.

Core relation:

DcalDdeployD_\mathrm{cal}\approx D_\mathrm{deploy}

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

10.2 Layer-by-layer error

Main idea. Inspect where compression hurts.

Core relation:

X(WW^)\Vert X(W-\hat W)\Vert

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

10.3 Reference comparisons

Main idea. Compare logits before and after compression.

Core relation:

maxzz^\max|z-\hat z|

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

10.4 Latency measurement

Main idea. Confirm the chosen format is actually faster.

Core relation:

Tcomp<TbaseT_\mathrm{comp}<T_\mathrm{base}

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. A smaller file is not automatically a faster model if kernels do not support the format well.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

10.5 Quality gates

Main idea. Do not ship a compressed model without task and safety checks.

Core relation:

\Delta S$ bounded

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in [1,1][-1,1] and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly s=1/7s=1/7. A real weight 0.330.33 maps to integer round(0.33/s)\mathrm{round}(0.33/s) and dequantizes back to the nearest grid point. Smaller ss improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.


Practice Exercises

  1. Quantize and dequantize scalar values with an affine quantizer.
  2. Compute symmetric INT4 scale and error.
  3. Compare per-tensor and per-channel quantization error.
  4. Compute memory reduction from bf16 to 4-bit weights.
  5. Sweep clipping ranges and choose the lowest MSE.
  6. Compute distillation probabilities at temperature.
  7. Compute KL distillation loss for teacher and student distributions.
  8. Combine hard-label CE and distillation loss.
  9. Estimate QLoRA optimizer-state memory.
  10. Write a compression deployment checklist.

Why This Matters for AI

Compression determines who can run a model, how much serving costs, and which devices can host useful AI locally. The math matters because bad compression can keep a model small but silently damage probabilities, calibration, long-context behavior, or safety behavior.

Bridge to RAG Math and Retrieval

Compression makes a model cheaper. Retrieval can make a model more informed without changing all its weights. The next section studies embedding retrieval, similarity search, ranking, context packing, and how retrieval changes the conditional distribution used by an LLM.

References