NotesMath for LLMs

Fine Tuning Math

Math for LLMs / Fine Tuning Math

Notes

Fine-Tuning Math

Fine-tuning is controlled movement away from a pretrained model. The mathematics asks four questions: what loss is optimized, which parameters can move, how large is the update space, and how do we measure useful adaptation without damaging the base model.

Overview

Pretraining learns a broad conditional distribution. Fine-tuning shifts that distribution toward a task, domain, instruction style, preference pattern, or deployment constraint. The shift can be full-rank and full-model, or it can be restricted to a small trainable subspace such as adapters, soft prompts, prefixes, or LoRA matrices.

The central decomposition is:

θ=θ0+Δθ.\theta = \theta_0 + \Delta\theta.

The base weights θ0\theta_0 carry pretrained capability. The update Δθ\Delta\theta carries adaptation. Fine-tuning math is the study of how to choose, constrain, optimize, and evaluate that update.

Prerequisites

  • Cross-entropy and answer-only token loss
  • Training memory and optimizer-state accounting
  • Matrix ranks, low-rank factorization, and SVD intuition
  • KL divergence and conditional language-model scoring

Companion Notebooks

NotebookPurpose
theory.ipynbExecutable demos for SFT masking, LoRA parameter counts, low-rank updates, adapter counts, prefix counts, DPO loss, forgetting tradeoffs, and merge checks.
exercises.ipynbTen practice problems for the update geometry and objective bookkeeping used in real fine-tuning runs.

Learning Objectives

After this section, you should be able to:

  • Write the fine-tuning objective as local adaptation from pretrained weights.
  • Compute answer-only SFT loss with prompt and padding masks.
  • Explain full fine-tuning, linear probing, adapters, soft prompts, prefix tuning, and LoRA.
  • Count trainable parameters for LoRA, adapters, and prefix methods.
  • Derive the LoRA update W=W+(α/r)BAW'=W+(\alpha/r)BA and merge it for inference.
  • Explain why PEFT reduces optimizer-state memory but not all activation memory.
  • Compute a DPO preference loss from model and reference log probabilities.
  • Diagnose forgetting, overfitting, mask bugs, and incorrect parameter freezing.

Table of Contents

  1. Fine-Tuning as Local Adaptation
  2. Supervised Fine-Tuning Objective
  3. Full Fine-Tuning
  4. Parameter-Efficient Fine-Tuning
  5. LoRA Algebra
  6. Quantized and Memory-Aware Tuning
  7. Preference Fine-Tuning
  8. Evaluation and Diagnostics
  9. Choosing a Method
  10. Implementation Checklist

Method Map

MethodBase weightsTrainable objectMain advantageMain risk
Prompting/RAGFrozenNoneZero trainingMay not enforce consistent behavior
Linear probingFrozenOutput headCheap diagnosticLimited generation adaptation
AdaptersFrozen mostlyBottleneck modulesMulti-task modularityExtra inference modules
Prompt/prefix tuningFrozenContinuous prompt or KV prefixVery small parameter countCapacity can be limited
LoRAFrozen mostlyLow-rank matrix updatesStrong PEFT defaultRank/target choice matters
Full fine-tuningTrainableAll weightsMaximum capacityExpensive and can forget
Preference tuningUsually partial or fullPolicy updateAligns comparative behaviorOver-optimization and reward hacking

1. Fine-Tuning as Local Adaptation

This part treats fine-tuning as local adaptation as a measurable adaptation decision. The goal is to know what is updated, what objective is optimized, how much capacity the update has, and how to detect damage to the base model.

SubtopicQuestionFormula
Starting from pretrained weightsfine-tuning begins near a useful parameter pointθ=θ0+Δθ\theta=\theta_0+\Delta\theta
Task lossthe target distribution changes from broad pretraining to task behaviorminθLtask(θ)\min_\theta L_\mathrm{task}(\theta)
Regularized adaptationpenalize moving too far from the base modelL(θ)=Ltask(θ)+λR(θ,θ0)L(\theta)=L_\mathrm{task}(\theta)+\lambda R(\theta,\theta_0)
Function-space shiftbehavioral movement is better measured on outputs than raw parametersDKL(pθ(x)pθ0(x))D_\mathrm{KL}(p_\theta(\cdot\mid x)\Vert p_{\theta_0}(\cdot\mid x))
Catastrophic forgettingtask learning can reduce general abilityΔforget=Sbase(θ0)Sbase(θ)\Delta_\mathrm{forget}=S_\mathrm{base}(\theta_0)-S_\mathrm{base}(\theta)

1.1 Starting from pretrained weights

Main idea. Fine-tuning begins near a useful parameter point.

Core relation:

θ=θ0+Δθ\theta=\theta_0+\Delta\theta

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

1.2 Task loss

Main idea. The target distribution changes from broad pretraining to task behavior.

Core relation:

minθLtask(θ)\min_\theta L_\mathrm{task}(\theta)

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

1.3 Regularized adaptation

Main idea. Penalize moving too far from the base model.

Core relation:

L(θ)=Ltask(θ)+λR(θ,θ0)L(\theta)=L_\mathrm{task}(\theta)+\lambda R(\theta,\theta_0)

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

1.4 Function-space shift

Main idea. Behavioral movement is better measured on outputs than raw parameters.

Core relation:

DKL(pθ(x)pθ0(x))D_\mathrm{KL}(p_\theta(\cdot\mid x)\Vert p_{\theta_0}(\cdot\mid x))

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

1.5 Catastrophic forgetting

Main idea. Task learning can reduce general ability.

Core relation:

Δforget=Sbase(θ0)Sbase(θ)\Delta_\mathrm{forget}=S_\mathrm{base}(\theta_0)-S_\mathrm{base}(\theta)

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

2. Supervised Fine-Tuning Objective

This part treats supervised fine-tuning objective as a measurable adaptation decision. The goal is to know what is updated, what objective is optimized, how much capacity the update has, and how to detect damage to the base model.

SubtopicQuestionFormula
Instruction-response pairstrain on prompt and desired completion pairs(x,y)Dsft(x,y)\sim D_\mathrm{sft}
Answer-only lossmask prompt tokens when the objective is response imitationLsft=jmjlogpθ(yjx,y<j)/jmjL_\mathrm{sft}=-\sum_j m_j\log p_\theta(y_j\mid x,y_{<j})/\sum_j m_j
Teacher forcingcondition on the gold prefix during trainingpθ(yjx,y<j)p_\theta(y_j\mid x,y_{<j}^\star)
Label smoothingsoften one-hot targets when useful$q=(1-\epsilon)y+\epsilon/
Dataset mixture weightscombine multiple tasks by weighted expectationL=kαkEDk[]L=\sum_k \alpha_k E_{D_k}[\ell]

2.1 Instruction-response pairs

Main idea. Train on prompt and desired completion pairs.

Core relation:

(x,y)Dsft(x,y)\sim D_\mathrm{sft}

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

2.2 Answer-only loss

Main idea. Mask prompt tokens when the objective is response imitation.

Core relation:

Lsft=jmjlogpθ(yjx,y<j)/jmjL_\mathrm{sft}=-\sum_j m_j\log p_\theta(y_j\mid x,y_{<j})/\sum_j m_j

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is the difference between teaching the model to imitate the response and wasting loss on tokens it was handed in the prompt.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

2.3 Teacher forcing

Main idea. Condition on the gold prefix during training.

Core relation:

pθ(yjx,y<j)p_\theta(y_j\mid x,y_{<j}^\star)

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

2.4 Label smoothing

Main idea. Soften one-hot targets when useful.

Core relation:

q=(1ϵ)y+ϵ/Vq=(1-\epsilon)y+\epsilon/|V|

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

2.5 Dataset mixture weights

Main idea. Combine multiple tasks by weighted expectation.

Core relation:

L=kαkEDk[]L=\sum_k \alpha_k E_{D_k}[\ell]

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

3. Full Fine-Tuning

This part treats full fine-tuning as a measurable adaptation decision. The goal is to know what is updated, what objective is optimized, how much capacity the update has, and how to detect damage to the base model.

SubtopicQuestionFormula
All parameters trainablefull fine-tuning updates every weight tensor$\Delta\theta\in\mathbb{R}^{
Memory costtrainable parameters require gradients and optimizer statesMMweights+Mgrads+Mopt+MactM\approx M_\mathrm{weights}+M_\mathrm{grads}+M_\mathrm{opt}+M_\mathrm{act}
Layer-wise learning rateslower layers can move more slowly than higher layersη=η0γL\eta_\ell=\eta_0\gamma^{L-\ell}
Weight decayregularize parameter norm during adaptationθθη(L+λθ)\theta\leftarrow\theta-\eta(\nabla L+\lambda\theta)
When full tuning helpslarge domain shift or maximum quality can justify the costbenefit>compute+forgetting risk\mathrm{benefit}>\mathrm{compute}+\mathrm{forgetting\ risk}

3.1 All parameters trainable

Main idea. Full fine-tuning updates every weight tensor.

Core relation:

ΔθRθ\Delta\theta\in\mathbb{R}^{|\theta|}

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

3.2 Memory cost

Main idea. Trainable parameters require gradients and optimizer states.

Core relation:

MMweights+Mgrads+Mopt+MactM\approx M_\mathrm{weights}+M_\mathrm{grads}+M_\mathrm{opt}+M_\mathrm{act}

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

3.3 Layer-wise learning rates

Main idea. Lower layers can move more slowly than higher layers.

Core relation:

η=η0γL\eta_\ell=\eta_0\gamma^{L-\ell}

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

3.4 Weight decay

Main idea. Regularize parameter norm during adaptation.

Core relation:

θθη(L+λθ)\theta\leftarrow\theta-\eta(\nabla L+\lambda\theta)

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

3.5 When full tuning helps

Main idea. Large domain shift or maximum quality can justify the cost.

Core relation:

benefit>compute+forgetting risk\mathrm{benefit}>\mathrm{compute}+\mathrm{forgetting\ risk}

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

4. Parameter-Efficient Fine-Tuning

This part treats parameter-efficient fine-tuning as a measurable adaptation decision. The goal is to know what is updated, what objective is optimized, how much capacity the update has, and how to detect damage to the base model.

SubtopicQuestionFormula
Linear probingfreeze the backbone and train only a headh=fθ0(x),y^=Wh+bh=f_{\theta_0}(x),\quad \hat y=Wh+b
Adaptersinsert small bottleneck modules inside layersh=h+Wupσ(Wdownh)h'=h+W_\mathrm{up}\sigma(W_\mathrm{down}h)
Soft prompt tuninglearn continuous input embeddings[p1,,pm,x1,,xT][p_1,\ldots,p_m,x_1,\ldots,x_T]
Prefix tuninglearn virtual key-value prefixes for attentionK=[Kprefix;K],V=[Vprefix;V]K'=[K_\mathrm{prefix};K],\quad V'=[V_\mathrm{prefix};V]
Low-rank adaptationlearn a low-rank update while freezing the base matrixW=W+αrBAW'=W+\frac{\alpha}{r}BA

4.1 Linear probing

Main idea. Freeze the backbone and train only a head.

Core relation:

h=fθ0(x),y^=Wh+bh=f_{\theta_0}(x),\quad \hat y=Wh+b

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

4.2 Adapters

Main idea. Insert small bottleneck modules inside layers.

Core relation:

h=h+Wupσ(Wdownh)h'=h+W_\mathrm{up}\sigma(W_\mathrm{down}h)

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

4.3 Soft prompt tuning

Main idea. Learn continuous input embeddings.

Core relation:

[p1,,pm,x1,,xT][p_1,\ldots,p_m,x_1,\ldots,x_T]

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

4.4 Prefix tuning

Main idea. Learn virtual key-value prefixes for attention.

Core relation:

K=[Kprefix;K],V=[Vprefix;V]K'=[K_\mathrm{prefix};K],\quad V'=[V_\mathrm{prefix};V]

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

4.5 Low-rank adaptation

Main idea. Learn a low-rank update while freezing the base matrix.

Core relation:

W=W+αrBAW'=W+\frac{\alpha}{r}BA

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. LoRA works because many useful task updates can be represented well inside a small trainable subspace.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

5. LoRA Algebra

This part treats lora algebra as a measurable adaptation decision. The goal is to know what is updated, what objective is optimized, how much capacity the update has, and how to detect damage to the base model.

SubtopicQuestionFormula
Rank constraintthe update matrix has rank at most rrank(BA)r\mathrm{rank}(BA)\le r
Parameter counta d_out by d_in matrix gets r(d_in+d_out) trainable parametersPLoRA=r(din+dout)P_\mathrm{LoRA}=r(d_\mathrm{in}+d_\mathrm{out})
Scalingthe alpha over r factor controls update magnitudeΔW=(α/r)BA\Delta W=(\alpha/r)BA
Merge for inferenceafter training, add the low-rank update into the base matrixWmerged=W+ΔWW_\mathrm{merged}=W+\Delta W
Target modulesattention and MLP projections can receive separate adaptersWq,Wk,Wv,Wo,Wup,WdownW_q,W_k,W_v,W_o,W_\mathrm{up},W_\mathrm{down}

5.1 Rank constraint

Main idea. The update matrix has rank at most r.

Core relation:

rank(BA)r\mathrm{rank}(BA)\le r

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

5.2 Parameter count

Main idea. A d_out by d_in matrix gets r(d_in+d_out) trainable parameters.

Core relation:

PLoRA=r(din+dout)P_\mathrm{LoRA}=r(d_\mathrm{in}+d_\mathrm{out})

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

5.3 Scaling

Main idea. The alpha over r factor controls update magnitude.

Core relation:

ΔW=(α/r)BA\Delta W=(\alpha/r)BA

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

5.4 Merge for inference

Main idea. After training, add the low-rank update into the base matrix.

Core relation:

Wmerged=W+ΔWW_\mathrm{merged}=W+\Delta W

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is why LoRA can add no extra matrix multiplication at serving time after merging.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

5.5 Target modules

Main idea. Attention and mlp projections can receive separate adapters.

Core relation:

Wq,Wk,Wv,Wo,Wup,WdownW_q,W_k,W_v,W_o,W_\mathrm{up},W_\mathrm{down}

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

6. Quantized and Memory-Aware Tuning

This part treats quantized and memory-aware tuning as a measurable adaptation decision. The goal is to know what is updated, what objective is optimized, how much capacity the update has, and how to detect damage to the base model.

SubtopicQuestionFormula
Frozen quantized basestore base weights in low precision and train small adaptersW0Q(W0)W_0\approx Q(W_0)
Adapter optimizer statesoptimizer states are needed only for trainable adapter weightsMoptPtrainableM_\mathrm{opt}\propto P_\mathrm{trainable}
Activation memory remainsPEFT reduces parameter-state memory but still backpropagates through the modelMactM_\mathrm{act} can dominate
Gradient checkpointingrecompute activations to fit longer sequences or larger batchesmemory,compute\mathrm{memory}\downarrow,\quad\mathrm{compute}\uparrow
Rank-memory tradeoffhigher rank improves capacity but increases trainable parametersPLoRArP_\mathrm{LoRA}\propto r

6.1 Frozen quantized base

Main idea. Store base weights in low precision and train small adapters.

Core relation:

W0Q(W0)W_0\approx Q(W_0)

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

6.2 Adapter optimizer states

Main idea. Optimizer states are needed only for trainable adapter weights.

Core relation:

MoptPtrainableM_\mathrm{opt}\propto P_\mathrm{trainable}

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

6.3 Activation memory remains

Main idea. Peft reduces parameter-state memory but still backpropagates through the model.

Core relation:

M_\mathrm{act}$ can dominate

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

6.4 Gradient checkpointing

Main idea. Recompute activations to fit longer sequences or larger batches.

Core relation:

memory,compute\mathrm{memory}\downarrow,\quad\mathrm{compute}\uparrow

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

6.5 Rank-memory tradeoff

Main idea. Higher rank improves capacity but increases trainable parameters.

Core relation:

PLoRArP_\mathrm{LoRA}\propto r

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

7. Preference Fine-Tuning

This part treats preference fine-tuning as a measurable adaptation decision. The goal is to know what is updated, what objective is optimized, how much capacity the update has, and how to detect damage to the base model.

SubtopicQuestionFormula
Preference pairslearn from chosen and rejected responses(x,y+,y)(x,y^+,y^-)
Reward-model viewRLHF trains or uses a reward signal for completionsrϕ(x,y)r_\phi(x,y)
KL-regularized policy updatekeep the tuned model near a reference modelE[r]βDKL(πθπref)E[r]-\beta D_\mathrm{KL}(\pi_\theta\Vert\pi_\mathrm{ref})
DPO lossoptimize preference likelihood directly with a reference modellogσ(β[(logπθ+logπθ)(logπref+logπref)])-\log\sigma(\beta[(\log\pi_\theta^+-\log\pi_\theta^-)-(\log\pi_\mathrm{ref}^+-\log\pi_\mathrm{ref}^-)])
Preference over-optimizationtoo much preference pressure can reduce diversity or factualityrr\uparrow does not guarantee all qualities improve

7.1 Preference pairs

Main idea. Learn from chosen and rejected responses.

Core relation:

(x,y+,y)(x,y^+,y^-)

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

7.2 Reward-model view

Main idea. Rlhf trains or uses a reward signal for completions.

Core relation:

rϕ(x,y)r_\phi(x,y)

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

7.3 KL-regularized policy update

Main idea. Keep the tuned model near a reference model.

Core relation:

E[r]βDKL(πθπref)E[r]-\beta D_\mathrm{KL}(\pi_\theta\Vert\pi_\mathrm{ref})

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

7.4 DPO loss

Main idea. Optimize preference likelihood directly with a reference model.

Core relation:

logσ(β[(logπθ+logπθ)(logπref+logπref)])-\log\sigma(\beta[(\log\pi_\theta^+-\log\pi_\theta^-)-(\log\pi_\mathrm{ref}^+-\log\pi_\mathrm{ref}^-)])

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. DPO turns a preference pair into a logistic loss over relative log probabilities.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

7.5 Preference over-optimization

Main idea. Too much preference pressure can reduce diversity or factuality.

Core relation:

r\uparrow$ does not guarantee all qualities improve

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

8. Evaluation and Diagnostics

This part treats evaluation and diagnostics as a measurable adaptation decision. The goal is to know what is updated, what objective is optimized, how much capacity the update has, and how to detect damage to the base model.

SubtopicQuestionFormula
Train and validation lossmemorization shows up as train loss improving while validation stallsLtrain,Lval↓̸L_\mathrm{train}\downarrow,\quad L_\mathrm{val}\not\downarrow
Base-task retentionevaluate old skills after adaptationSretain(θ)S_\mathrm{retain}(\theta)
Task qualityuse task-specific automatic and human checksStask(θ)S_\mathrm{task}(\theta)
Distribution shiftfine-tune data should match deployment useptrain(x,y)pdeploy(x,y)p_\mathrm{train}(x,y)\approx p_\mathrm{deploy}(x,y)
Adapter sanity checksdisable the adapter to confirm measured change comes from the adapterfW+ΔW(x)fW(x)f_{W+\Delta W}(x)-f_W(x)

8.1 Train and validation loss

Main idea. Memorization shows up as train loss improving while validation stalls.

Core relation:

Ltrain,Lval↓̸L_\mathrm{train}\downarrow,\quad L_\mathrm{val}\not\downarrow

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

8.2 Base-task retention

Main idea. Evaluate old skills after adaptation.

Core relation:

Sretain(θ)S_\mathrm{retain}(\theta)

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

8.3 Task quality

Main idea. Use task-specific automatic and human checks.

Core relation:

Stask(θ)S_\mathrm{task}(\theta)

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

8.4 Distribution shift

Main idea. Fine-tune data should match deployment use.

Core relation:

ptrain(x,y)pdeploy(x,y)p_\mathrm{train}(x,y)\approx p_\mathrm{deploy}(x,y)

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

8.5 Adapter sanity checks

Main idea. Disable the adapter to confirm measured change comes from the adapter.

Core relation:

fW+ΔW(x)fW(x)f_{W+\Delta W}(x)-f_W(x)

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

9. Choosing a Method

This part treats choosing a method as a measurable adaptation decision. The goal is to know what is updated, what objective is optimized, how much capacity the update has, and how to detect damage to the base model.

SubtopicQuestionFormula
No-update methodsprompting and retrieval are cheapest when they workΔθ=0\Delta\theta=0
PEFT methodsadapters and LoRA are the default when cost and deployment flexibility matter$
Full tuninguse when the task requires broad representation movement$
Preference tuninguse when the target is comparative behavior rather than exact demonstrationsy+yy^+\succ y^-
Deployment constraintslatency, adapter routing, merging, and safety evaluation are part of the method choicequality/cost\mathrm{quality}/\mathrm{cost} is the real metric

9.1 No-update methods

Main idea. Prompting and retrieval are cheapest when they work.

Core relation:

Δθ=0\Delta\theta=0

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

9.2 PEFT methods

Main idea. Adapters and lora are the default when cost and deployment flexibility matter.

Core relation:

Δθθ|\Delta\theta|\ll|\theta|

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

9.3 Full tuning

Main idea. Use when the task requires broad representation movement.

Core relation:

|\Delta\theta|$ can be large

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

9.4 Preference tuning

Main idea. Use when the target is comparative behavior rather than exact demonstrations.

Core relation:

y+yy^+\succ y^-

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

9.5 Deployment constraints

Main idea. Latency, adapter routing, merging, and safety evaluation are part of the method choice.

Core relation:

\mathrm{quality}/\mathrm{cost}$ is the real metric

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

10. Implementation Checklist

This part treats implementation checklist as a measurable adaptation decision. The goal is to know what is updated, what objective is optimized, how much capacity the update has, and how to detect damage to the base model.

SubtopicQuestionFormula
Maskingprompt, padding, and answer masks must match the intended objectivemj{0,1}m_j\in\{0,1\}
Parameter freeze auditverify only intended tensors require gradientsrequires_grad\mathrm{requires\_grad}
Learning-rate groupsbase weights and adapters need different optimizer groups if both trainηbaseηadapter\eta_\mathrm{base}\ne\eta_\mathrm{adapter}
Reference modelpreference losses require a stable reference distributionπref\pi_\mathrm{ref}
Ablationscompare base, prompt-only, PEFT, and full-tune when feasibleΔS=SmethodSbase\Delta S=S_\mathrm{method}-S_\mathrm{base}

10.1 Masking

Main idea. Prompt, padding, and answer masks must match the intended objective.

Core relation:

mj{0,1}m_j\in\{0,1\}

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

10.2 Parameter freeze audit

Main idea. Verify only intended tensors require gradients.

Core relation:

requires_grad\mathrm{requires\_grad}

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. A PEFT run with the wrong tensors trainable is just an expensive surprise.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

10.3 Learning-rate groups

Main idea. Base weights and adapters need different optimizer groups if both train.

Core relation:

ηbaseηadapter\eta_\mathrm{base}\ne\eta_\mathrm{adapter}

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

10.4 Reference model

Main idea. Preference losses require a stable reference distribution.

Core relation:

πref\pi_\mathrm{ref}

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.

10.5 Ablations

Main idea. Compare base, prompt-only, peft, and full-tune when feasible.

Core relation:

ΔS=SmethodSbase\Delta S=S_\mathrm{method}-S_\mathrm{base}

Fine-tuning is local learning around a pretrained solution. The base model already has useful representations, so the adaptation method decides how much freedom the update receives. Full fine-tuning gives maximum freedom. PEFT constrains the update to a small module, prompt vector, prefix, or low-rank subspace. Preference tuning changes the objective from imitation to comparative behavior.

Worked micro-example. A projection matrix with din=4096d_\mathrm{in}=4096 and dout=4096d_\mathrm{out}=4096 has about 16.8 million weights. A rank-8 LoRA update for the same matrix has 8(4096+4096)=65,5368(4096+4096)=65,536 trainable weights, before optimizer states. The base matrix can stay frozen while the low-rank update learns the task movement.

Implementation check. Print the number of trainable parameters, inspect masks, run one batch, and verify that loss changes. Then disable the adapter or reload the base model and confirm the difference is caused by the fine-tuning method.

AI connection. This is a concrete control knob in fine-tuning.

Common mistake. Do not treat fine-tuning quality as one number. Track task quality, base-skill retention, calibration, refusal or safety behavior if relevant, and deployment cost.


Practice Exercises

  1. Compute answer-only SFT loss using a binary mask.
  2. Compute KL shift between base and tuned next-token distributions.
  3. Count LoRA trainable parameters for a projection matrix.
  4. Verify the shape of a LoRA update.
  5. Approximate a matrix update with truncated SVD.
  6. Count adapter bottleneck parameters.
  7. Count prefix-tuning parameters for layers and heads.
  8. Compute DPO loss for one preference pair.
  9. Build a forgetting versus task-quality scorecard.
  10. Write a parameter-freeze and masking checklist.

Why This Matters for AI

Most applied LLM work is not pretraining from scratch. It is adaptation: making a capable base model behave correctly in a domain, workflow, policy regime, or product setting. Fine-tuning math keeps that adaptation honest. It tells you whether you are training the intended tokens, moving the intended parameters, using enough rank, preserving base capability, and measuring the right behavior.

Bridge to Scaling Laws

The next section studies how loss changes with model size, data size, and compute. Fine-tuning adds another axis: adaptation capacity. A small low-rank update may be enough for format and style, while deeper domain shifts may require more data, rank, layers, or full-model movement.

References