Fine-Tuning Math
Fine-tuning is controlled movement away from a pretrained model. The mathematics asks four questions: what loss is optimized, which parameters can move, how large is the update space, and how do we measure useful adaptation without damaging the base model.
Overview
Pretraining learns a broad conditional distribution. Fine-tuning shifts that distribution toward a task, domain, instruction style, preference pattern, or deployment constraint. The shift can be full-rank and full-model, or it can be restricted to a small trainable subspace such as adapters, soft prompts, prefixes, or LoRA matrices.
The central decomposition is:
The base weights carry pretrained capability. The update carries adaptation. Fine-tuning math is the study of how to choose, constrain, optimize, and evaluate that update.
Prerequisites
- Cross-entropy and answer-only token loss
- Training memory and optimizer-state accounting
- Matrix ranks, low-rank factorization, and SVD intuition
- KL divergence and conditional language-model scoring
Companion Notebooks
| Notebook | Purpose |
|---|---|
| theory.ipynb | Executable demos for SFT masking, LoRA parameter counts, low-rank updates, adapter counts, prefix counts, DPO loss, forgetting tradeoffs, and merge checks. |
| exercises.ipynb | Ten practice problems for the update geometry and objective bookkeeping used in real fine-tuning runs. |
Learning Objectives
After this section, you should be able to:
- Write the fine-tuning objective as local adaptation from pretrained weights.
- Compute answer-only SFT loss with prompt and padding masks.
- Explain full fine-tuning, linear probing, adapters, soft prompts, prefix tuning, and LoRA.
- Count trainable parameters for LoRA, adapters, and prefix methods.
- Derive the LoRA update and merge it for inference.
- Explain why PEFT reduces optimizer-state memory but not all activation memory.
- Compute a DPO preference loss from model and reference log probabilities.
- Diagnose forgetting, overfitting, mask bugs, and incorrect parameter freezing.
Table of Contents
- Fine-Tuning as Local Adaptation
- Supervised Fine-Tuning Objective
- 2.1 Instruction-response pairs
- 2.2 Answer-only loss
- 2.3 Teacher forcing
- 2.4 Label smoothing
- 2.5 Dataset mixture weights
- Full Fine-Tuning
- 3.1 All parameters trainable
- 3.2 Memory cost
- 3.3 Layer-wise learning rates
- 3.4 Weight decay
- 3.5 When full tuning helps
- Parameter-Efficient Fine-Tuning
- 4.1 Linear probing
- 4.2 Adapters
- 4.3 Soft prompt tuning
- 4.4 Prefix tuning
- 4.5 Low-rank adaptation
- LoRA Algebra
- 5.1 Rank constraint
- 5.2 Parameter count
- 5.3 Scaling
- 5.4 Merge for inference
- 5.5 Target modules
- Quantized and Memory-Aware Tuning
- Preference Fine-Tuning
- Evaluation and Diagnostics
- 8.1 Train and validation loss
- 8.2 Base-task retention
- 8.3 Task quality
- 8.4 Distribution shift
- 8.5 Adapter sanity checks
- Choosing a Method
- 9.1 No-update methods
- 9.2 PEFT methods
- 9.3 Full tuning
- 9.4 Preference tuning
- 9.5 Deployment constraints
- Implementation Checklist
- 10.1 Masking
- 10.2 Parameter freeze audit
- 10.3 Learning-rate groups
- 10.4 Reference model
- 10.5 Ablations
Method Map
| Method | Base weights | Trainable object | Main advantage | Main risk |
|---|---|---|---|---|
| Prompting/RAG | Frozen | None | Zero training | May not enforce consistent behavior |
| Linear probing | Frozen | Output head | Cheap diagnostic | Limited generation adaptation |
| Adapters | Frozen mostly | Bottleneck modules | Multi-task modularity | Extra inference modules |
| Prompt/prefix tuning | Frozen | Continuous prompt or KV prefix | Very small parameter count | Capacity can be limited |
| LoRA | Frozen mostly | Low-rank matrix updates | Strong PEFT default | Rank/target choice matters |
| Full fine-tuning | Trainable | All weights | Maximum capacity | Expensive and can forget |
| Preference tuning | Usually partial or full | Policy update | Aligns comparative behavior | Over-optimization and reward hacking |
1. Fine-Tuning as Local Adaptation
This part treats fine-tuning as local adaptation as a measurable adaptation decision. The goal is to know what is updated, what objective is optimized, how much capacity the update has, and how to detect damage to the base model.
| Subtopic | Question | Formula |
|---|---|---|
| Starting from pretrained weights | fine-tuning begins near a useful parameter point | |
| Task loss | the target distribution changes from broad pretraining to task behavior | |
| Regularized adaptation | penalize moving too far from the base model | |
| Function-space shift | behavioral movement is better measured on outputs than raw parameters | |
| Catastrophic forgetting | task learning can reduce general ability |
1.1 Starting from pretrained weights
Main idea. Fine-tuning begins near a useful parameter point.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
1.2 Task loss
Main idea. The target distribution changes from broad pretraining to task behavior.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
1.3 Regularized adaptation
Main idea. Penalize moving too far from the base model.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
1.4 Function-space shift
Main idea. Behavioral movement is better measured on outputs than raw parameters.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
1.5 Catastrophic forgetting
Main idea. Task learning can reduce general ability.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
2. Supervised Fine-Tuning Objective
This part treats supervised fine-tuning objective as a measurable adaptation decision. The goal is to know what is updated, what objective is optimized, how much capacity the update has, and how to detect damage to the base model.
| Subtopic | Question | Formula |
|---|---|---|
| Instruction-response pairs | train on prompt and desired completion pairs | |
| Answer-only loss | mask prompt tokens when the objective is response imitation | |
| Teacher forcing | condition on the gold prefix during training | |
| Label smoothing | soften one-hot targets when useful | $q=(1-\epsilon)y+\epsilon/ |
| Dataset mixture weights | combine multiple tasks by weighted expectation |
2.1 Instruction-response pairs
Main idea. Train on prompt and desired completion pairs.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
2.2 Answer-only loss
Main idea. Mask prompt tokens when the objective is response imitation.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is the difference between teaching the model to imitate the response and wasting loss on tokens it was handed in the prompt.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
2.3 Teacher forcing
Main idea. Condition on the gold prefix during training.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
2.4 Label smoothing
Main idea. Soften one-hot targets when useful.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
2.5 Dataset mixture weights
Main idea. Combine multiple tasks by weighted expectation.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
3. Full Fine-Tuning
This part treats full fine-tuning as a measurable adaptation decision. The goal is to know what is updated, what objective is optimized, how much capacity the update has, and how to detect damage to the base model.
| Subtopic | Question | Formula |
|---|---|---|
| All parameters trainable | full fine-tuning updates every weight tensor | $\Delta\theta\in\mathbb{R}^{ |
| Memory cost | trainable parameters require gradients and optimizer states | |
| Layer-wise learning rates | lower layers can move more slowly than higher layers | |
| Weight decay | regularize parameter norm during adaptation | |
| When full tuning helps | large domain shift or maximum quality can justify the cost |
3.1 All parameters trainable
Main idea. Full fine-tuning updates every weight tensor.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
3.2 Memory cost
Main idea. Trainable parameters require gradients and optimizer states.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
3.3 Layer-wise learning rates
Main idea. Lower layers can move more slowly than higher layers.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
3.4 Weight decay
Main idea. Regularize parameter norm during adaptation.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
3.5 When full tuning helps
Main idea. Large domain shift or maximum quality can justify the cost.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
4. Parameter-Efficient Fine-Tuning
This part treats parameter-efficient fine-tuning as a measurable adaptation decision. The goal is to know what is updated, what objective is optimized, how much capacity the update has, and how to detect damage to the base model.
| Subtopic | Question | Formula |
|---|---|---|
| Linear probing | freeze the backbone and train only a head | |
| Adapters | insert small bottleneck modules inside layers | |
| Soft prompt tuning | learn continuous input embeddings | |
| Prefix tuning | learn virtual key-value prefixes for attention | |
| Low-rank adaptation | learn a low-rank update while freezing the base matrix |
4.1 Linear probing
Main idea. Freeze the backbone and train only a head.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
4.2 Adapters
Main idea. Insert small bottleneck modules inside layers.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
4.3 Soft prompt tuning
Main idea. Learn continuous input embeddings.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
4.4 Prefix tuning
Main idea. Learn virtual key-value prefixes for attention.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
4.5 Low-rank adaptation
Main idea. Learn a low-rank update while freezing the base matrix.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. LoRA works because many useful task updates can be represented well inside a small trainable subspace.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
5. LoRA Algebra
This part treats lora algebra as a measurable adaptation decision. The goal is to know what is updated, what objective is optimized, how much capacity the update has, and how to detect damage to the base model.
| Subtopic | Question | Formula |
|---|---|---|
| Rank constraint | the update matrix has rank at most r | |
| Parameter count | a d_out by d_in matrix gets r(d_in+d_out) trainable parameters | |
| Scaling | the alpha over r factor controls update magnitude | |
| Merge for inference | after training, add the low-rank update into the base matrix | |
| Target modules | attention and MLP projections can receive separate adapters |
5.1 Rank constraint
Main idea. The update matrix has rank at most r.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
5.2 Parameter count
Main idea. A d_out by d_in matrix gets r(d_in+d_out) trainable parameters.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
5.3 Scaling
Main idea. The alpha over r factor controls update magnitude.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
5.4 Merge for inference
Main idea. After training, add the low-rank update into the base matrix.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is why LoRA can add no extra matrix multiplication at serving time after merging.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
5.5 Target modules
Main idea. Attention and mlp projections can receive separate adapters.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
6. Quantized and Memory-Aware Tuning
This part treats quantized and memory-aware tuning as a measurable adaptation decision. The goal is to know what is updated, what objective is optimized, how much capacity the update has, and how to detect damage to the base model.
| Subtopic | Question | Formula |
|---|---|---|
| Frozen quantized base | store base weights in low precision and train small adapters | |
| Adapter optimizer states | optimizer states are needed only for trainable adapter weights | |
| Activation memory remains | PEFT reduces parameter-state memory but still backpropagates through the model | can dominate |
| Gradient checkpointing | recompute activations to fit longer sequences or larger batches | |
| Rank-memory tradeoff | higher rank improves capacity but increases trainable parameters |
6.1 Frozen quantized base
Main idea. Store base weights in low precision and train small adapters.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
6.2 Adapter optimizer states
Main idea. Optimizer states are needed only for trainable adapter weights.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
6.3 Activation memory remains
Main idea. Peft reduces parameter-state memory but still backpropagates through the model.
Core relation:
M_\mathrm{act}$ can dominateFine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
6.4 Gradient checkpointing
Main idea. Recompute activations to fit longer sequences or larger batches.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
6.5 Rank-memory tradeoff
Main idea. Higher rank improves capacity but increases trainable parameters.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
7. Preference Fine-Tuning
This part treats preference fine-tuning as a measurable adaptation decision. The goal is to know what is updated, what objective is optimized, how much capacity the update has, and how to detect damage to the base model.
| Subtopic | Question | Formula |
|---|---|---|
| Preference pairs | learn from chosen and rejected responses | |
| Reward-model view | RLHF trains or uses a reward signal for completions | |
| KL-regularized policy update | keep the tuned model near a reference model | |
| DPO loss | optimize preference likelihood directly with a reference model | |
| Preference over-optimization | too much preference pressure can reduce diversity or factuality | does not guarantee all qualities improve |
7.1 Preference pairs
Main idea. Learn from chosen and rejected responses.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
7.2 Reward-model view
Main idea. Rlhf trains or uses a reward signal for completions.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
7.3 KL-regularized policy update
Main idea. Keep the tuned model near a reference model.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
7.4 DPO loss
Main idea. Optimize preference likelihood directly with a reference model.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. DPO turns a preference pair into a logistic loss over relative log probabilities.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
7.5 Preference over-optimization
Main idea. Too much preference pressure can reduce diversity or factuality.
Core relation:
r\uparrow$ does not guarantee all qualities improveFine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
8. Evaluation and Diagnostics
This part treats evaluation and diagnostics as a measurable adaptation decision. The goal is to know what is updated, what objective is optimized, how much capacity the update has, and how to detect damage to the base model.
| Subtopic | Question | Formula |
|---|---|---|
| Train and validation loss | memorization shows up as train loss improving while validation stalls | |
| Base-task retention | evaluate old skills after adaptation | |
| Task quality | use task-specific automatic and human checks | |
| Distribution shift | fine-tune data should match deployment use | |
| Adapter sanity checks | disable the adapter to confirm measured change comes from the adapter |
8.1 Train and validation loss
Main idea. Memorization shows up as train loss improving while validation stalls.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
8.2 Base-task retention
Main idea. Evaluate old skills after adaptation.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
8.3 Task quality
Main idea. Use task-specific automatic and human checks.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
8.4 Distribution shift
Main idea. Fine-tune data should match deployment use.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
8.5 Adapter sanity checks
Main idea. Disable the adapter to confirm measured change comes from the adapter.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
9. Choosing a Method
This part treats choosing a method as a measurable adaptation decision. The goal is to know what is updated, what objective is optimized, how much capacity the update has, and how to detect damage to the base model.
| Subtopic | Question | Formula |
|---|---|---|
| No-update methods | prompting and retrieval are cheapest when they work | |
| PEFT methods | adapters and LoRA are the default when cost and deployment flexibility matter | $ |
| Full tuning | use when the task requires broad representation movement | $ |
| Preference tuning | use when the target is comparative behavior rather than exact demonstrations | |
| Deployment constraints | latency, adapter routing, merging, and safety evaluation are part of the method choice | is the real metric |
9.1 No-update methods
Main idea. Prompting and retrieval are cheapest when they work.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
9.2 PEFT methods
Main idea. Adapters and lora are the default when cost and deployment flexibility matter.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
9.3 Full tuning
Main idea. Use when the task requires broad representation movement.
Core relation:
|\Delta\theta|$ can be largeFine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
9.4 Preference tuning
Main idea. Use when the target is comparative behavior rather than exact demonstrations.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
9.5 Deployment constraints
Main idea. Latency, adapter routing, merging, and safety evaluation are part of the method choice.
Core relation:
\mathrm{quality}/\mathrm{cost}$ is the real metricFine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
10. Implementation Checklist
This part treats implementation checklist as a measurable adaptation decision. The goal is to know what is updated, what objective is optimized, how much capacity the update has, and how to detect damage to the base model.
| Subtopic | Question | Formula |
|---|---|---|
| Masking | prompt, padding, and answer masks must match the intended objective | |
| Parameter freeze audit | verify only intended tensors require gradients | |
| Learning-rate groups | base weights and adapters need different optimizer groups if both train | |
| Reference model | preference losses require a stable reference distribution | |
| Ablations | compare base, prompt-only, PEFT, and full-tune when feasible |
10.1 Masking
Main idea. Prompt, padding, and answer masks must match the intended objective.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
10.2 Parameter freeze audit
Main idea. Verify only intended tensors require gradients.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. A PEFT run with the wrong tensors trainable is just an expensive surprise.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
10.3 Learning-rate groups
Main idea. Base weights and adapters need different optimizer groups if both train.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
10.4 Reference model
Main idea. Preference losses require a stable reference distribution.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
10.5 Ablations
Main idea. Compare base, prompt-only, peft, and full-tune when feasible.
Core relation:
Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.
Worked micro-example. A projection matrix with and has about 16.8 million weights. A rank-8 LoRA update for the same matrix has trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.
Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.
AI connection. This is a concrete control knob in fine-tuning.
Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.
Practice Exercises
- Compute answer-only SFT loss using a binary mask.
- Compute KL shift between base and tuned next-token distributions.
- Count LoRA trainable parameters for a projection matrix.
- Verify the shape of a LoRA update.
- Approximate a matrix update with truncated SVD.
- Count adapter bottleneck parameters.
- Count prefix-tuning parameters for layers and heads.
- Compute DPO loss for one preference pair.
- Build a forgetting versus task-quality scorecard.
- Write a parameter-freeze and masking checklist.
Why This Matters for AI
Most applied LLM work is not pretraining from scratch. It is adaptation: making a capable base model behave correctly in a domain, workflow, policy regime, or product setting. Fine-tuning math keeps that adaptation honest. It tells you whether you are training the intended tokens, moving the intended parameters, using enough rank, preserving base capability, and measuring the right behavior.
Bridge to Scaling Laws
The next section studies how loss changes with model size, data size, and compute. Fine-tuning adds another axis: adaptation capacity. A small low-rank update may be enough for format and style, while deeper domain shifts may require more data, rank, layers, or full-model movement.
References
- Jeremy Howard and Sebastian Ruder, "Universal Language Model Fine-tuning for Text Classification", 2018: https://arxiv.org/abs/1801.06146
- Neil Houlsby et al., "Parameter-Efficient Transfer Learning for NLP", 2019: https://proceedings.mlr.press/v97/houlsby19a.html
- Xiang Lisa Li and Percy Liang, "Prefix-Tuning: Optimizing Continuous Prompts for Generation", 2021: https://arxiv.org/abs/2101.00190
- Edward J. Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models", 2021: https://arxiv.org/abs/2106.09685
- Long Ouyang et al., "Training language models to follow instructions with human feedback", 2022: https://arxiv.org/abs/2203.02155
- Tim Dettmers et al., "QLoRA: Efficient Finetuning of Quantized LLMs", 2023: https://arxiv.org/abs/2305.14314
- Rafael Rafailov et al., "Direct Preference Optimization: Your Language Model is Secretly a Reward Model", 2023: https://arxiv.org/abs/2305.18290