Notes - Math for LLMs Tutorial

Notes

Retrieval-augmented generation adds an external memory system to an LLM. The math is a pipeline: embed, search, rank, pack context, generate, and verify attribution.

Overview

The central RAG conditional is:

p_\theta(y\mid q,R_k(q)),

where $q$ is the query and $R_k(q)$ is the set of retrieved chunks. RAG succeeds only when the right information is retrieved, ranked highly, packed into context, and used by the generator.

Prerequisites

Embedding-space geometry and cosine similarity
Conditional language-model probability
Efficient inference and context-window constraints
Evaluation metrics and error analysis

Companion Notebooks

Notebook	Purpose
theory.ipynb	Demonstrates cosine search, BM25 intuition, contrastive retrieval loss, recall@k, MMR, reranking, context packing, and RAG failure decomposition.
exercises.ipynb	Ten practice problems for retrieval scores, recall, MRR, chunk packing, MMR, RRF, and RAG diagnostics.

Learning Objectives

After this section, you should be able to:

Define RAG as conditional generation with retrieved non-parametric memory.
Compute dot-product and cosine retrieval scores.
Explain sparse, dense, hybrid, late-interaction, and cross-encoder retrieval.
Write the contrastive loss for dense retriever training.
Compute recall@k, MRR, and simple nDCG.
Explain chunk length, overlap, MMR, and context packing.
Explain ANN recall-latency tradeoffs.
Diagnose RAG failures with traces and ablations.

RAG as Conditional Generation
- 1.1 Parametric memory
- 1.2 Non-parametric memory
- 1.3 Retriever
- 1.4 Generator
- 1.5 Failure decomposition
Similarity Spaces
- 2.1 Embedding functions
- 2.2 Dot product
- 2.3 Cosine similarity
- 2.4 Maximum inner product search
- 2.5 Normalization
Sparse and Dense Retrieval
- 3.1 Sparse lexical retrieval
- 3.2 Dense bi-encoder retrieval
- 3.3 Hybrid retrieval
- 3.4 Late interaction
- 3.5 Cross-encoder reranking
Retriever Training
- 4.1 Positive pairs
- 4.2 Negative pairs
- 4.3 Contrastive loss
- 4.4 In-batch negatives
- 4.5 Hard negatives
Approximate Nearest Neighbor Search
- 5.1 Exact search
- 5.2 ANN search
- 5.3 Recall at k
- 5.4 Index compression
- 5.5 Latency recall tradeoff
Chunking and Context Packing
- 6.1 Chunk length
- 6.2 Overlap
- 6.3 Packing budget
- 6.4 Diversity
- 6.5 Lost-in-context risk
Reranking and Fusion
- 7.1 First-stage recall
- 7.2 Reranker precision
- 7.3 Reciprocal rank fusion
- 7.4 Score calibration
- 7.5 Citation selection
RAG Evaluation
- 8.1 Retrieval metrics
- 8.2 Answer metrics
- 8.3 Attribution
- 8.4 Ablations
- 8.5 Dataset drift
Failure Modes
- 9.1 Missed retrieval
- 9.2 Bad chunk
- 9.3 Distractor context
- 9.4 Generator ignores evidence
- 9.5 Citation mismatch
Implementation Checklist

10.1 Embedding normalization
10.2 Chunk audit
10.3 Gold retrieval set
10.4 Context budget tests
10.5 End-to-end traces

Pipeline Diagram

query -> query encoder -> vector search -> top-k chunks -> reranker -> context packer -> LLM -> answer + citations

Each arrow can fail. The math gives you probes for each arrow.

1. RAG as Conditional Generation

This part studies rag as conditional generation as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.

Subtopic	Question	Formula
Parametric memory	knowledge stored in model weights	$p_\theta(y\mid q)$
Non-parametric memory	knowledge stored in an external corpus	$D=\{d_i\}_{i=1}^n$
Retriever	select relevant documents for a query	$R_k(q)=\mathrm{TopK}_{d_i\in D}\ s(q,d_i)$
Generator	answer conditioned on query and retrieved context	$p_\theta(y\mid q,R_k(q))$
Failure decomposition	RAG can fail at retrieval, ranking, context packing, or generation	$P(\mathrm{correct})=P(\mathrm{retrieve})P(\mathrm{use})P(\mathrm{generate})$

1.1 Parametric memory

Main idea. Knowledge stored in model weights.

Core relation:

p_\theta(y\mid q)

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

1.2 Non-parametric memory

Main idea. Knowledge stored in an external corpus.

Core relation:

D=\{d_i\}_{i=1}^n

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

1.3 Retriever

Main idea. Select relevant documents for a query.

Core relation:

R_k(q)=\mathrm{TopK}_{d_i\in D}\ s(q,d_i)

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. The generator cannot use evidence that retrieval never returns.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

1.4 Generator

Main idea. Answer conditioned on query and retrieved context.

Core relation:

p_\theta(y\mid q,R_k(q))

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

1.5 Failure decomposition

Main idea. Rag can fail at retrieval, ranking, context packing, or generation.

Core relation:

P(\mathrm{correct})=P(\mathrm{retrieve})P(\mathrm{use})P(\mathrm{generate})

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

2. Similarity Spaces

This part studies similarity spaces as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.

Subtopic	Question	Formula
Embedding functions	map queries and documents into a shared vector space	$u=f_q(q),\ v_i=f_d(d_i)$
Dot product	inner product rewards alignment and vector norm	$s(u,v)=u^\top v$
Cosine similarity	normalize away vector length	$s(u,v)=u^\top v/(\Vert u\Vert\Vert v\Vert)$
Maximum inner product search	retrieve highest-scoring vectors	$\arg\max_i u^\top v_i$
Normalization	for unit vectors, dot product and cosine are the same	$\Vert u\Vert=\Vert v\Vert=1$

2.1 Embedding functions

Main idea. Map queries and documents into a shared vector space.

Core relation:

u=f_q(q),\ v_i=f_d(d_i)

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

2.2 Dot product

Main idea. Inner product rewards alignment and vector norm.

Core relation:

s(u,v)=u^\top v

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

2.3 Cosine similarity

Main idea. Normalize away vector length.

Core relation:

s(u,v)=u^\top v/(\Vert u\Vert\Vert v\Vert)

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. Most RAG bugs start with misunderstanding what the vector store is scoring.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

2.4 Maximum inner product search

Main idea. Retrieve highest-scoring vectors.

Core relation:

\arg\max_i u^\top v_i

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

2.5 Normalization

Main idea. For unit vectors, dot product and cosine are the same.

Core relation:

\Vert u\Vert=\Vert v\Vert=1

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

3. Sparse and Dense Retrieval

This part studies sparse and dense retrieval as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.

Subtopic	Question	Formula
Sparse lexical retrieval	match query terms to document terms	$\mathrm{BM25}(q,d)$
Dense bi-encoder retrieval	encode query and document separately	$s(q,d)=f_q(q)^\top f_d(d)$
Hybrid retrieval	combine sparse and dense scores	$s=\lambda s_\mathrm{dense}+(1-\lambda)s_\mathrm{sparse}$
Late interaction	score token embeddings after independent encoding	$\sum_{i\in q}\max_{j\in d} e_i^\top e_j$
Cross-encoder reranking	jointly encode query and document for a slower stronger score	$s=g(q,d)$

3.1 Sparse lexical retrieval

Main idea. Match query terms to document terms.

Core relation:

\mathrm{BM25}(q,d)

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

3.2 Dense bi-encoder retrieval

Main idea. Encode query and document separately.

Core relation:

s(q,d)=f_q(q)^\top f_d(d)

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

3.3 Hybrid retrieval

Main idea. Combine sparse and dense scores.

Core relation:

s=\lambda s_\mathrm{dense}+(1-\lambda)s_\mathrm{sparse}

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

3.4 Late interaction

Main idea. Score token embeddings after independent encoding.

Core relation:

\sum_{i\in q}\max_{j\in d} e_i^\top e_j

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

3.5 Cross-encoder reranking

Main idea. Jointly encode query and document for a slower stronger score.

Core relation:

s=g(q,d)

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

4. Retriever Training

This part studies retriever training as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.

Subtopic	Question	Formula
Positive pairs	train query and relevant document to be close	$(q,d^+)$
Negative pairs	train irrelevant or hard-negative documents to score lower	$(q,d^-)$
Contrastive loss	softmax over one positive and negatives	$L=-\log\frac{e^{s(q,d^+)}}{e^{s(q,d^+)}+\sum_j e^{s(q,d_j^-)}}$
In-batch negatives	other examples in a batch become negatives	$B-1$ negatives per query
Hard negatives	near misses improve ranking training	$s(q,d^-)$ high but label negative

4.1 Positive pairs

Main idea. Train query and relevant document to be close.

Core relation:

(q,d^+)

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

4.2 Negative pairs

Main idea. Train irrelevant or hard-negative documents to score lower.

Core relation:

(q,d^-)

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

4.3 Contrastive loss

Main idea. Softmax over one positive and negatives.

Core relation:

L=-\log\frac{e^{s(q,d^+)}}{e^{s(q,d^+)}+\sum_j e^{s(q,d_j^-)}}

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is the training objective behind many dense retrievers.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

4.4 In-batch negatives

Main idea. Other examples in a batch become negatives.

Core relation:

B-1$ negatives per query

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

4.5 Hard negatives

Main idea. Near misses improve ranking training.

Core relation:

s(q,d^-)$ high but label negative

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

5. Approximate Nearest Neighbor Search

This part studies approximate nearest neighbor search as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.

Subtopic	Question	Formula
Exact search	compare a query to every vector	$O(nd)$
ANN search	trade exactness for latency and memory	$\hat R_k(q)\approx R_k(q)$
Recall at k	measure whether relevant docs appear in top k	$\mathrm{Recall@}k=
Index compression	quantize or cluster vectors to reduce memory	$V\rightarrow \hat V$
Latency recall tradeoff	faster search can miss relevant documents	$T\downarrow,\ \mathrm{recall}\downarrow$

5.1 Exact search

Main idea. Compare a query to every vector.

Core relation:

O(nd)

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

5.2 ANN search

Main idea. Trade exactness for latency and memory.

Core relation:

\hat R_k(q)\approx R_k(q)

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

5.3 Recall at k

Main idea. Measure whether relevant docs appear in top k.

Core relation:

\mathrm{Recall@}k=|\mathrm{Rel}\cap R_k|/|\mathrm{Rel}|

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

5.4 Index compression

Main idea. Quantize or cluster vectors to reduce memory.

Core relation:

V\rightarrow \hat V

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

5.5 Latency recall tradeoff

Main idea. Faster search can miss relevant documents.

Core relation:

T\downarrow,\ \mathrm{recall}\downarrow

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

6. Chunking and Context Packing

This part studies chunking and context packing as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.

Subtopic	Question	Formula
Chunk length	split documents into retrievable units	$d\rightarrow c_1,\ldots,c_m$
Overlap	repeat boundary tokens to avoid cutting facts	$c_i=[t_a,\ldots,t_b],\ c_{i+1}=[t_{b-o},\ldots]$
Packing budget	retrieved chunks must fit context	$\sum_i
Diversity	avoid filling context with near-duplicates	$\mathrm{MMR}=\lambda s(q,d)-(1-\lambda)\max_{d'\in S}s(d,d')$
Lost-in-context risk	retrieved text must be ordered and summarized so the generator can use it	$p(y\mid q,c_{1:k})$ depends on packing

6.1 Chunk length

Main idea. Split documents into retrievable units.

Core relation:

d\rightarrow c_1,\ldots,c_m

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

6.2 Overlap

Main idea. Repeat boundary tokens to avoid cutting facts.

Core relation:

c_i=[t_a,\ldots,t_b],\ c_{i+1}=[t_{b-o},\ldots]

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

6.3 Packing budget

Main idea. Retrieved chunks must fit context.

Core relation:

\sum_i |c_i|\le T_\mathrm{context}

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. Retrieval success still fails if the evidence is packed badly into context.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

6.4 Diversity

Main idea. Avoid filling context with near-duplicates.

Core relation:

\mathrm{MMR}=\lambda s(q,d)-(1-\lambda)\max_{d'\in S}s(d,d')

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

6.5 Lost-in-context risk

Main idea. Retrieved text must be ordered and summarized so the generator can use it.

Core relation:

p(y\mid q,c_{1:k})$ depends on packing

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

7. Reranking and Fusion

This part studies reranking and fusion as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.

Subtopic	Question	Formula
First-stage recall	retrieve broad candidates cheaply	$K_\mathrm{first}\gg K_\mathrm{final}$
Reranker precision	rerank candidates with a stronger model	$s_\mathrm{rerank}(q,d)$
Reciprocal rank fusion	combine ranked lists robustly	$\mathrm{RRF}(d)=\sum_m 1/(k+r_m(d))$
Score calibration	dense, sparse, and reranker scores may live on different scales	$z=(s-\mu)/\sigma$
Citation selection	answer citations should correspond to evidence actually used	$d_i\rightarrow \mathrm{claim}_j$

7.1 First-stage recall

Main idea. Retrieve broad candidates cheaply.

Core relation:

K_\mathrm{first}\gg K_\mathrm{final}

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

7.2 Reranker precision

Main idea. Rerank candidates with a stronger model.

Core relation:

s_\mathrm{rerank}(q,d)

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

7.3 Reciprocal rank fusion

Main idea. Combine ranked lists robustly.

Core relation:

\mathrm{RRF}(d)=\sum_m 1/(k+r_m(d))

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

7.4 Score calibration

Main idea. Dense, sparse, and reranker scores may live on different scales.

Core relation:

z=(s-\mu)/\sigma

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

7.5 Citation selection

Main idea. Answer citations should correspond to evidence actually used.

Core relation:

d_i\rightarrow \mathrm{claim}_j

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

8. RAG Evaluation

This part studies rag evaluation as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.

Subtopic	Question	Formula
Retrieval metrics	measure search independently of generation	$\mathrm{Recall@}k,\ \mathrm{MRR},\ \mathrm{nDCG}$
Answer metrics	measure final response correctness and faithfulness	$S_\mathrm{answer}$
Attribution	claims should be supported by retrieved evidence	$\mathrm{support}(\mathrm{claim},d_i)$
Ablations	compare no retrieval, sparse, dense, hybrid, and reranked variants	$\Delta S$
Dataset drift	retrieval quality changes when corpus or query distribution changes	$p_\mathrm{query},D$ drift

8.1 Retrieval metrics

Main idea. Measure search independently of generation.

Core relation:

\mathrm{Recall@}k,\ \mathrm{MRR},\ \mathrm{nDCG}

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

8.2 Answer metrics

Main idea. Measure final response correctness and faithfulness.

Core relation:

S_\mathrm{answer}

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

8.3 Attribution

Main idea. Claims should be supported by retrieved evidence.

Core relation:

\mathrm{support}(\mathrm{claim},d_i)

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

8.4 Ablations

Main idea. Compare no retrieval, sparse, dense, hybrid, and reranked variants.

Core relation:

\Delta S

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

8.5 Dataset drift

Main idea. Retrieval quality changes when corpus or query distribution changes.

Core relation:

p_\mathrm{query},D$ drift

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

9. Failure Modes

This part studies failure modes as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.

Subtopic	Question	Formula
Missed retrieval	the answer document is not in top k	$d^+\notin R_k(q)$
Bad chunk	the right document is retrieved but not the right span	$c^+\notin R_k(q)$
Distractor context	irrelevant high-scoring chunks pull generation away	$s(q,d^-)>s(q,d^+)$
Generator ignores evidence	the answer is not grounded even with good retrieval	$p_\theta(y\mid q,R_k)$ uses parametric prior
Citation mismatch	the cited chunk does not support the claim	$\mathrm{claim}\not\subset d_\mathrm{cited}$

9.1 Missed retrieval

Main idea. The answer document is not in top k.

Core relation:

d^+\notin R_k(q)

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

9.2 Bad chunk

Main idea. The right document is retrieved but not the right span.

Core relation:

c^+\notin R_k(q)

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

9.3 Distractor context

Main idea. Irrelevant high-scoring chunks pull generation away.

Core relation:

s(q,d^-)>s(q,d^+)

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

9.4 Generator ignores evidence

Main idea. The answer is not grounded even with good retrieval.

Core relation:

p_\theta(y\mid q,R_k)$ uses parametric prior

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

9.5 Citation mismatch

Main idea. The cited chunk does not support the claim.

Core relation:

\mathrm{claim}\not\subset d_\mathrm{cited}

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

10. Implementation Checklist

This part studies implementation checklist as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.

Subtopic	Question	Formula
Embedding normalization	know whether the index expects normalized vectors	$v\leftarrow v/\Vert v\Vert$
Chunk audit	inspect chunks before blaming the retriever	$c_i$
Gold retrieval set	keep queries with known supporting documents	$D_\mathrm{gold}$
Context budget tests	evaluate different k, chunk length, and overlap	$k,o,
End-to-end traces	log query, retrieved docs, reranker scores, prompt, answer, and citations	$\mathrm{trace}$

10.1 Embedding normalization

Main idea. Know whether the index expects normalized vectors.

Core relation:

v\leftarrow v/\Vert v\Vert

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

10.2 Chunk audit

Main idea. Inspect chunks before blaming the retriever.

Core relation:

c_i

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

10.3 Gold retrieval set

Main idea. Keep queries with known supporting documents.

Core relation:

D_\mathrm{gold}

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

10.4 Context budget tests

Main idea. Evaluate different k, chunk length, and overlap.

Core relation:

k,o,|c|

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

10.5 End-to-end traces

Main idea. Log query, retrieved docs, reranker scores, prompt, answer, and citations.

Core relation:

\mathrm{trace}

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. A RAG trace is the fastest way to locate whether failure came from search, ranking, packing, or generation.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

Practice Exercises

Normalize vectors and compute cosine similarities.
Retrieve top-k documents by dot product.
Compute a toy BM25-style lexical score.
Compute dense contrastive loss with one positive and negatives.
Compute recall@k and MRR.
Use MMR to select diverse chunks.
Pack chunks into a context budget.
Combine rankings with reciprocal rank fusion.
Decompose an end-to-end RAG failure.
Write a RAG trace checklist.

Why This Matters for AI

RAG is often the cheapest way to update knowledge, cite sources, and ground answers. But RAG is not magic. Retrieval can miss the answer, rank distractors above evidence, split the useful span across chunks, or feed the generator context it ignores. Good RAG work is measurement-heavy.

Bridge to Serving and Systems Tradeoffs

The final LLM math section studies the system-level tradeoffs around serving: batching, latency, throughput, memory, routing, caching, and cost. RAG adds another system layer because retrieval latency and context length feed directly into serving latency.

References

Patrick Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks", 2020: https://arxiv.org/abs/2005.11401
Vladimir Karpukhin et al., "Dense Passage Retrieval for Open-Domain Question Answering", 2020: https://arxiv.org/abs/2004.04906
Omar Khattab and Matei Zaharia, "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT", 2020: https://arxiv.org/abs/2004.12832
Jeff Johnson, Matthijs Douze, and Herve Jegou, "Billion-scale similarity search with GPUs", 2017: https://arxiv.org/abs/1702.08734
Stephen Robertson and Hugo Zaragoza, "The Probabilistic Relevance Framework: BM25 and Beyond", 2009: https://www.nowpublishers.com/article/Details/INR-019

RAG Math and Retrieval

Overview

Prerequisites

Companion Notebooks

Learning Objectives

Table of Contents

Pipeline Diagram

1. RAG as Conditional Generation

1.1 Parametric memory

1.2 Non-parametric memory

1.3 Retriever

1.4 Generator

1.5 Failure decomposition

2. Similarity Spaces

2.1 Embedding functions

2.2 Dot product

2.3 Cosine similarity

2.4 Maximum inner product search

2.5 Normalization

3. Sparse and Dense Retrieval

3.1 Sparse lexical retrieval

3.2 Dense bi-encoder retrieval

3.3 Hybrid retrieval

3.4 Late interaction

3.5 Cross-encoder reranking

4. Retriever Training

4.1 Positive pairs

4.2 Negative pairs

4.3 Contrastive loss

4.4 In-batch negatives

4.5 Hard negatives

5. Approximate Nearest Neighbor Search

5.1 Exact search

5.2 ANN search

5.3 Recall at k

5.4 Index compression

5.5 Latency recall tradeoff

6. Chunking and Context Packing

6.1 Chunk length

6.2 Overlap

6.3 Packing budget

6.4 Diversity

6.5 Lost-in-context risk

7. Reranking and Fusion

7.1 First-stage recall

7.2 Reranker precision

7.3 Reciprocal rank fusion

7.4 Score calibration

7.5 Citation selection

8. RAG Evaluation

8.1 Retrieval metrics

8.2 Answer metrics

8.3 Attribution

8.4 Ablations

8.5 Dataset drift

9. Failure Modes

9.1 Missed retrieval

9.2 Bad chunk

9.3 Distractor context

9.4 Generator ignores evidence

9.5 Citation mismatch

10. Implementation Checklist

10.1 Embedding normalization

10.2 Chunk audit

10.3 Gold retrieval set

10.4 Context budget tests

10.5 End-to-end traces

Practice Exercises

Why This Matters for AI

Bridge to Serving and Systems Tradeoffs

References