Notes - Math for LLMs Tutorial

Notes

Quantization compresses tensors by changing their numeric representation. Distillation compresses behavior by training a smaller or cheaper model to imitate a stronger teacher. Both are central to making LLMs cheaper to store, serve, and deploy.

Overview

The simplest uniform quantizer is:

q=\mathrm{round}(x/s)+z,\qquad \hat x=s(q-z).

Here $s$ is the scale and $z$ is the zero point. Quantization asks how much error this approximation introduces and whether the hardware can exploit the smaller representation.

Distillation uses a teacher distribution:

L_\mathrm{KD}=\tau^2D_\mathrm{KL}(p_T^{(\tau)}\Vert p_S^{(\tau)}).

The student learns from teacher probabilities, generated sequences, features, or preferences. Quantization and distillation can be combined: distill to a smaller model, then quantize it for deployment.

Prerequisites

Logits, softmax, KL divergence, and cross-entropy
Matrix multiplication and tensor shapes
Inference memory and bandwidth constraints
Fine-tuning and LoRA basics

Companion Notebooks

Notebook	Purpose
theory.ipynb	Demonstrates uniform quantization, clipping, group-wise scales, error versus bits, SmoothQuant intuition, distillation temperature, KL loss, and memory savings.
exercises.ipynb	Ten practice problems for quantizer formulas, memory savings, clipping, KL distillation, and deployment checks.

Learning Objectives

After this section, you should be able to:

Implement affine quantization and dequantization.
Explain scale, zero point, clipping, and quantization error.
Compare per-tensor, per-channel, and group-wise quantization.
Explain PTQ, QAT, GPTQ intuition, AWQ intuition, and QLoRA.
Compute memory savings from lower bit widths.
Compute distillation KL loss with temperature.
Explain logit, sequence, feature, and preference distillation.
Build a quality and latency evaluation checklist for compressed LLMs.

Compression Goals
- 1.1 Memory reduction
- 1.2 Bandwidth reduction
- 1.3 Compute support
- 1.4 Quality preservation
- 1.5 Pareto frontier
Uniform Quantization
- 2.1 Affine quantizer
- 2.2 Dequantization
- 2.3 Scale
- 2.4 Zero point
- 2.5 Quantization error
Granularity
- 3.1 Per-tensor
- 3.2 Per-channel
- 3.3 Group-wise
- 3.4 Activation quantization
- 3.5 KV-cache quantization
Post-Training Quantization
- 4.1 Calibration data
- 4.2 Clipping
- 4.3 Weighted error
- 4.4 GPTQ intuition
- 4.5 AWQ intuition
Quantization-Aware Training and QLoRA
- 5.1 Fake quantization
- 5.2 Straight-through estimator
- 5.3 QLoRA pattern
- 5.4 Optimizer memory
- 5.5 Dequantization path
Distillation Basics
- 6.1 Teacher and student
- 6.2 Soft targets
- 6.3 Temperature
- 6.4 KL distillation loss
- 6.5 Hard-label mixture
LLM Distillation Types
- 7.1 Logit distillation
- 7.2 Sequence distillation
- 7.3 Feature distillation
- 7.4 Preference distillation
- 7.5 Reasoning trace distillation
Error and Evaluation
- 8.1 Perplexity shift
- 8.2 Task score shift
- 8.3 Calibration shift
- 8.4 Layer sensitivity
- 8.5 Outlier channels
Deployment Choices
- 9.1 Weight-only quantization
- 9.2 Weight-activation quantization
- 9.3 KV-cache quantization
- 9.4 Distill then quantize
- 9.5 Hardware format
Debugging Checklist

10.1 Calibration representativeness
10.2 Layer-by-layer error
10.3 Reference comparisons
10.4 Latency measurement
10.5 Quality gates

Compression Map

Method	Changes	Needs data?	Main benefit	Main risk
PTQ	Numeric format after training	Calibration data	Fast compression	Calibration mismatch
QAT	Training simulates quantization	Training data	Better low-bit robustness	More compute
QLoRA	Quantized frozen base plus LoRA	Fine-tune data	Cheap adaptation	Activation memory remains
Logit distillation	Student matches teacher probabilities	Teacher outputs	Smaller model behavior transfer	Teacher errors transfer
Sequence distillation	Student trains on teacher completions	Teacher generations	Simple data pipeline	Diversity loss

1. Compression Goals

This part studies compression goals as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.

Subtopic	Question	Formula
Memory reduction	store fewer bytes per parameter or cache entry	$M=P\cdot b/8$
Bandwidth reduction	read fewer bytes per generated token	$T_\mathrm{read}\approx M/\mathrm{bandwidth}$
Compute support	low precision helps only when kernels and hardware support it	$T_\mathrm{kernel}$
Quality preservation	compressed outputs should stay close to reference outputs	$D_\mathrm{KL}(p_\mathrm{ref}\Vert p_\mathrm{comp})$
Pareto frontier	compression is a quality-cost tradeoff	$\mathrm{quality}$ versus $\mathrm{memory},\mathrm{latency},\mathrm{cost}$

1.1 Memory reduction

Main idea. Store fewer bytes per parameter or cache entry.

Core relation:

M=P\cdot b/8

Quantization changes the numerical representation of tensors. Distillation changes the training signal so a smaller or cheaper model imitates a stronger teacher. Both are compression tools, but they fail in different ways: quantization can introduce numerical error, while distillation can omit teacher capabilities that are not present in the distillation data.

Worked micro-example. If weights lie in $[-1,1]$ and we use signed 4-bit integers with values from -8 to 7, a symmetric step size is roughly $s=1/7$ . A real weight $0.33$ maps to integer $\mathrm{round}(0.33/s)$ and dequantizes back to the nearest grid point. Smaller $s$ improves resolution near zero but clips large values if the range is too narrow.

Implementation check. Always compare base and compressed logits on the same inputs. Then check held-out loss, task quality, calibration, memory, and latency. Compression is successful only if the target tradeoff improves.

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

1.2 Bandwidth reduction

Main idea. Read fewer bytes per generated token.

Core relation:

T_\mathrm{read}\approx M/\mathrm{bandwidth}

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

1.3 Compute support

Main idea. Low precision helps only when kernels and hardware support it.

Core relation:

T_\mathrm{kernel}

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

1.4 Quality preservation

Main idea. Compressed outputs should stay close to reference outputs.

Core relation:

D_\mathrm{KL}(p_\mathrm{ref}\Vert p_\mathrm{comp})

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

1.5 Pareto frontier

Main idea. Compression is a quality-cost tradeoff.

Core relation:

\mathrm{quality}$ versus $\mathrm{memory},\mathrm{latency},\mathrm{cost}

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

2. Uniform Quantization

This part studies uniform quantization as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.

Subtopic	Question	Formula
Affine quantizer	map real values to integer grid points	$q=\mathrm{round}(x/s)+z$
Dequantization	recover an approximate real value	$\hat x=s(q-z)$
Scale	the step size controls resolution	$s=(x_\max-x_\min)/(q_\max-q_\min)$
Zero point	asymmetric quantization uses an integer offset	$z=q_\min-\mathrm{round}(x_\min/s)$
Quantization error	rounding error is bounded by half a step before clipping	$

2.1 Affine quantizer

Main idea. Map real values to integer grid points.

Core relation:

q=\mathrm{round}(x/s)+z

AI connection. This tiny formula is the bridge between real model weights and integer storage.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

2.2 Dequantization

Main idea. Recover an approximate real value.

Core relation:

\hat x=s(q-z)

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

2.3 Scale

Main idea. The step size controls resolution.

Core relation:

s=(x_\max-x_\min)/(q_\max-q_\min)

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

2.4 Zero point

Main idea. Asymmetric quantization uses an integer offset.

Core relation:

z=q_\min-\mathrm{round}(x_\min/s)

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

2.5 Quantization error

Main idea. Rounding error is bounded by half a step before clipping.

Core relation:

|x-\hat x|\le s/2

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

3. Granularity

This part studies granularity as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.

Subtopic	Question	Formula
Per-tensor	one scale for the whole tensor	$s$ shared by all entries
Per-channel	one scale per output channel or column	$s_c$
Group-wise	one scale per block of weights	$s_g$
Activation quantization	activation ranges depend on input data	$x=x(\mathrm{batch})$
KV-cache quantization	cache precision affects long-context memory and attention quality	$M_\mathrm{KV}\propto b_\mathrm{KV}$

3.1 Per-tensor

Main idea. One scale for the whole tensor.

Core relation:

s$ shared by all entries

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

3.2 Per-channel

Main idea. One scale per output channel or column.

Core relation:

s_c

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

3.3 Group-wise

Main idea. One scale per block of weights.

Core relation:

s_g

AI connection. Group scales are one reason modern low-bit LLM quantization can work better than one global scale.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

3.4 Activation quantization

Main idea. Activation ranges depend on input data.

Core relation:

x=x(\mathrm{batch})

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

3.5 KV-cache quantization

Main idea. Cache precision affects long-context memory and attention quality.

Core relation:

M_\mathrm{KV}\propto b_\mathrm{KV}

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

4. Post-Training Quantization

This part studies post-training quantization as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.

Subtopic	Question	Formula
Calibration data	estimate activation or weight ranges from representative examples	$D_\mathrm{cal}$
Clipping	smaller range improves resolution but clips outliers	$x\leftarrow\mathrm{clip}(x,-c,c)$
Weighted error	some weights matter more because activations amplify them	$\Vert X(W-\hat W)\Vert^2$
GPTQ intuition	quantize weights while compensating error using approximate second-order information	$\Delta L\approx \frac12\Delta w^\top H\Delta w$
AWQ intuition	protect activation-salient channels during weight quantization	$

4.1 Calibration data

Main idea. Estimate activation or weight ranges from representative examples.

Core relation:

D_\mathrm{cal}

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

4.2 Clipping

Main idea. Smaller range improves resolution but clips outliers.

Core relation:

x\leftarrow\mathrm{clip}(x,-c,c)

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

4.3 Weighted error

Main idea. Some weights matter more because activations amplify them.

Core relation:

\Vert X(W-\hat W)\Vert^2

AI connection. Quantizing a weight is more harmful when common activations magnify that weight's error.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

4.4 GPTQ intuition

Main idea. Quantize weights while compensating error using approximate second-order information.

Core relation:

\Delta L\approx \frac12\Delta w^\top H\Delta w

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

4.5 AWQ intuition

Main idea. Protect activation-salient channels during weight quantization.

Core relation:

|x_j|$ indicates sensitivity

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

5. Quantization-Aware Training and QLoRA

This part studies quantization-aware training and qlora as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.

Subtopic	Question	Formula
Fake quantization	simulate quantization during training while keeping gradients useful	$\hat W=Q(W)$ in forward
Straight-through estimator	treat rounding as identity in backward	$\partial \mathrm{round}(x)/\partial x\approx 1$
QLoRA pattern	freeze a quantized base and train low-rank adapters	$W\approx Q(W_0)+(\alpha/r)BA$
Optimizer memory	optimizer states are needed for trainable adapters, not frozen base weights	$M_\mathrm{opt}\propto P_\mathrm{trainable}$
Dequantization path	many kernels dequantize blocks on the fly for matmul	$q,s\rightarrow \hat W$

5.1 Fake quantization

Main idea. Simulate quantization during training while keeping gradients useful.

Core relation:

\hat W=Q(W)$ in forward

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

5.2 Straight-through estimator

Main idea. Treat rounding as identity in backward.

Core relation:

\partial \mathrm{round}(x)/\partial x\approx 1

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

5.3 QLoRA pattern

Main idea. Freeze a quantized base and train low-rank adapters.

Core relation:

W\approx Q(W_0)+(\alpha/r)BA

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

5.4 Optimizer memory

Main idea. Optimizer states are needed for trainable adapters, not frozen base weights.

Core relation:

M_\mathrm{opt}\propto P_\mathrm{trainable}

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

5.5 Dequantization path

Main idea. Many kernels dequantize blocks on the fly for matmul.

Core relation:

q,s\rightarrow \hat W

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

6. Distillation Basics

This part studies distillation basics as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.

Subtopic	Question	Formula
Teacher and student	train a smaller model to match a larger model	$p_T(y\mid x),p_S(y\mid x)$
Soft targets	teacher probabilities contain similarity information beyond one-hot labels	$p_T$
Temperature	soften probability distributions	$p_i^{(\tau)}=\mathrm{softmax}(z_i/\tau)$
KL distillation loss	minimize divergence from teacher to student	$L_\mathrm{KD}=\tau^2D_\mathrm{KL}(p_T^{(\tau)}\Vert p_S^{(\tau)})$
Hard-label mixture	combine task loss with distillation loss	$L=\alpha L_\mathrm{CE}+(1-\alpha)L_\mathrm{KD}$

6.1 Teacher and student

Main idea. Train a smaller model to match a larger model.

Core relation:

p_T(y\mid x),p_S(y\mid x)

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

6.2 Soft targets

Main idea. Teacher probabilities contain similarity information beyond one-hot labels.

Core relation:

p_T

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

6.3 Temperature

Main idea. Soften probability distributions.

Core relation:

p_i^{(\tau)}=\mathrm{softmax}(z_i/\tau)

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

6.4 KL distillation loss

Main idea. Minimize divergence from teacher to student.

Core relation:

L_\mathrm{KD}=\tau^2D_\mathrm{KL}(p_T^{(\tau)}\Vert p_S^{(\tau)})

AI connection. Distillation trains the student on the teacher's distribution, not only the final answer.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

6.5 Hard-label mixture

Main idea. Combine task loss with distillation loss.

Core relation:

L=\alpha L_\mathrm{CE}+(1-\alpha)L_\mathrm{KD}

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

7. LLM Distillation Types

This part studies llm distillation types as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.

Subtopic	Question	Formula
Logit distillation	match next-token distributions	$D_\mathrm{KL}(p_T(\cdot\mid x)\Vert p_S(\cdot\mid x))$
Sequence distillation	train on teacher-generated completions	$y\sim p_T(\cdot\mid x)$
Feature distillation	match hidden states or attention maps	$\Vert h_T-h_S\Vert^2$
Preference distillation	transfer teacher comparisons or judge preferences	$y^+\succ_T y^-$
Reasoning trace distillation	train on teacher-produced intermediate reasoning when appropriate	$p_S(r,y\mid x)$

7.1 Logit distillation

Main idea. Match next-token distributions.

Core relation:

D_\mathrm{KL}(p_T(\cdot\mid x)\Vert p_S(\cdot\mid x))

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

7.2 Sequence distillation

Main idea. Train on teacher-generated completions.

Core relation:

y\sim p_T(\cdot\mid x)

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

7.3 Feature distillation

Main idea. Match hidden states or attention maps.

Core relation:

\Vert h_T-h_S\Vert^2

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

7.4 Preference distillation

Main idea. Transfer teacher comparisons or judge preferences.

Core relation:

y^+\succ_T y^-

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

7.5 Reasoning trace distillation

Main idea. Train on teacher-produced intermediate reasoning when appropriate.

Core relation:

p_S(r,y\mid x)

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

8. Error and Evaluation

This part studies error and evaluation as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.

Subtopic	Question	Formula
Perplexity shift	quantization can be measured by held-out NLL change	$\Delta L=L_\mathrm{quant}-L_\mathrm{base}$
Task score shift	compression should be checked on downstream tasks	$\Delta S=S_\mathrm{comp}-S_\mathrm{base}$
Calibration shift	probabilities may become miscalibrated	$\mathrm{ECE}$
Layer sensitivity	some layers or projections tolerate fewer bits poorly	$\Delta L_\ell$
Outlier channels	activation outliers often dominate low-bit error	$\max

8.1 Perplexity shift

Main idea. Quantization can be measured by held-out nll change.

Core relation:

\Delta L=L_\mathrm{quant}-L_\mathrm{base}

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

8.2 Task score shift

Main idea. Compression should be checked on downstream tasks.

Core relation:

\Delta S=S_\mathrm{comp}-S_\mathrm{base}

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

8.3 Calibration shift

Main idea. Probabilities may become miscalibrated.

Core relation:

\mathrm{ECE}

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

8.4 Layer sensitivity

Main idea. Some layers or projections tolerate fewer bits poorly.

Core relation:

\Delta L_\ell

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

8.5 Outlier channels

Main idea. Activation outliers often dominate low-bit error.

Core relation:

\max |x_j|

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

9. Deployment Choices

This part studies deployment choices as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.

Subtopic	Question	Formula
Weight-only quantization	reduce weight bandwidth while leaving activations higher precision	$W_q,\ x_\mathrm{fp}$
Weight-activation quantization	quantize both weights and activations for more kernel speed	$W_q,\ x_q$
KV-cache quantization	increase context or batch capacity	$b_\mathrm{KV}\downarrow$
Distill then quantize	a smaller student can also be quantized	$S\rightarrow Q(S)$
Hardware format	INT4, INT8, FP8, and NF4 need matching kernels	$\mathrm{format}\rightarrow\mathrm{kernel}$

9.1 Weight-only quantization

Main idea. Reduce weight bandwidth while leaving activations higher precision.

Core relation:

W_q,\ x_\mathrm{fp}

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

9.2 Weight-activation quantization

Main idea. Quantize both weights and activations for more kernel speed.

Core relation:

W_q,\ x_q

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

9.3 KV-cache quantization

Main idea. Increase context or batch capacity.

Core relation:

b_\mathrm{KV}\downarrow

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

9.4 Distill then quantize

Main idea. A smaller student can also be quantized.

Core relation:

S\rightarrow Q(S)

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

9.5 Hardware format

Main idea. Int4, int8, fp8, and nf4 need matching kernels.

Core relation:

\mathrm{format}\rightarrow\mathrm{kernel}

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

10. Debugging Checklist

This part studies debugging checklist as compression math. The useful habit is to separate storage format, dequantized computation, approximation error, and evaluation.

Subtopic	Question	Formula
Calibration representativeness	calibration data should match deployment prompts	$D_\mathrm{cal}\approx D_\mathrm{deploy}$
Layer-by-layer error	inspect where compression hurts	$\Vert X(W-\hat W)\Vert$
Reference comparisons	compare logits before and after compression	$\max
Latency measurement	confirm the chosen format is actually faster	$T_\mathrm{comp}<T_\mathrm{base}$
Quality gates	do not ship a compressed model without task and safety checks	$\Delta S$ bounded

10.1 Calibration representativeness

Main idea. Calibration data should match deployment prompts.

Core relation:

D_\mathrm{cal}\approx D_\mathrm{deploy}

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

10.2 Layer-by-layer error

Main idea. Inspect where compression hurts.

Core relation:

\Vert X(W-\hat W)\Vert

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

10.3 Reference comparisons

Main idea. Compare logits before and after compression.

Core relation:

\max|z-\hat z|

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

10.4 Latency measurement

Main idea. Confirm the chosen format is actually faster.

Core relation:

T_\mathrm{comp}<T_\mathrm{base}

AI connection. A smaller file is not automatically a faster model if kernels do not support the format well.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

10.5 Quality gates

Main idea. Do not ship a compressed model without task and safety checks.

Core relation:

\Delta S$ bounded

AI connection. This is a practical compression control variable.

Common mistake. Do not report "4-bit" without saying what is quantized, the granularity, the calibration data, and the serving kernel.

Practice Exercises

Quantize and dequantize scalar values with an affine quantizer.
Compute symmetric INT4 scale and error.
Compare per-tensor and per-channel quantization error.
Compute memory reduction from bf16 to 4-bit weights.
Sweep clipping ranges and choose the lowest MSE.
Compute distillation probabilities at temperature.
Compute KL distillation loss for teacher and student distributions.
Combine hard-label CE and distillation loss.
Estimate QLoRA optimizer-state memory.
Write a compression deployment checklist.

Why This Matters for AI

Compression determines who can run a model, how much serving costs, and which devices can host useful AI locally. The math matters because bad compression can keep a model small but silently damage probabilities, calibration, long-context behavior, or safety behavior.

Bridge to RAG Math and Retrieval

Compression makes a model cheaper. Retrieval can make a model more informed without changing all its weights. The next section studies embedding retrieval, similarity search, ranking, context packing, and how retrieval changes the conditional distribution used by an LLM.

References

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, "Distilling the Knowledge in a Neural Network", 2015: https://arxiv.org/abs/1503.02531
Benoit Jacob et al., "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference", 2017: https://arxiv.org/abs/1712.05877
Tim Dettmers et al., "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale", 2022: https://arxiv.org/abs/2208.07339
Elias Frantar et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers", 2022: https://arxiv.org/abs/2210.17323
Guangxuan Xiao et al., "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models", 2022: https://arxiv.org/abs/2211.10438
Ji Lin et al., "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration", 2023: https://arxiv.org/abs/2306.00978
Tim Dettmers et al., "QLoRA: Efficient Finetuning of Quantized LLMs", 2023: https://arxiv.org/abs/2305.14314

Quantization and Distillation

Overview

Prerequisites

Companion Notebooks

Learning Objectives

Table of Contents

Compression Map

1. Compression Goals

1.1 Memory reduction

1.2 Bandwidth reduction

1.3 Compute support

1.4 Quality preservation

1.5 Pareto frontier

2. Uniform Quantization

2.1 Affine quantizer

2.2 Dequantization

2.3 Scale

2.4 Zero point

2.5 Quantization error

3. Granularity

3.1 Per-tensor

3.2 Per-channel

3.3 Group-wise

3.4 Activation quantization

3.5 KV-cache quantization

4. Post-Training Quantization

4.1 Calibration data

4.2 Clipping

4.3 Weighted error

4.4 GPTQ intuition

4.5 AWQ intuition

5. Quantization-Aware Training and QLoRA

5.1 Fake quantization

5.2 Straight-through estimator

5.3 QLoRA pattern

5.4 Optimizer memory

5.5 Dequantization path

6. Distillation Basics

6.1 Teacher and student

6.2 Soft targets

6.3 Temperature

6.4 KL distillation loss

6.5 Hard-label mixture

7. LLM Distillation Types

7.1 Logit distillation

7.2 Sequence distillation

7.3 Feature distillation

7.4 Preference distillation

7.5 Reasoning trace distillation

8. Error and Evaluation

8.1 Perplexity shift

8.2 Task score shift

8.3 Calibration shift

8.4 Layer sensitivity

8.5 Outlier channels

9. Deployment Choices

9.1 Weight-only quantization

9.2 Weight-activation quantization

9.3 KV-cache quantization

9.4 Distill then quantize

9.5 Hardware format

10. Debugging Checklist

10.1 Calibration representativeness

10.2 Layer-by-layer error

10.3 Reference comparisons

10.4 Latency measurement

10.5 Quality gates

Practice Exercises

Why This Matters for AI

Bridge to RAG Math and Retrieval

References