Retrieval-augmented generation adds an external memory system to an LLM. The math is a pipeline: embed, search, rank, pack context, generate, and verify attribution.
Overview
The central RAG conditional is:
where is the query and is the set of retrieved chunks. RAG succeeds only when the right information is retrieved, ranked highly, packed into context, and used by the generator.
Prerequisites
- Embedding-space geometry and cosine similarity
- Conditional language-model probability
- Efficient inference and context-window constraints
- Evaluation metrics and error analysis
Companion Notebooks
| Notebook | Purpose |
|---|---|
| theory.ipynb | Demonstrates cosine search, BM25 intuition, contrastive retrieval loss, recall@k, MMR, reranking, context packing, and RAG failure decomposition. |
| exercises.ipynb | Ten practice problems for retrieval scores, recall, MRR, chunk packing, MMR, RRF, and RAG diagnostics. |
Learning Objectives
After this section, you should be able to:
- Define RAG as conditional generation with retrieved non-parametric memory.
- Compute dot-product and cosine retrieval scores.
- Explain sparse, dense, hybrid, late-interaction, and cross-encoder retrieval.
- Write the contrastive loss for dense retriever training.
- Compute recall@k, MRR, and simple nDCG.
- Explain chunk length, overlap, MMR, and context packing.
- Explain ANN recall-latency tradeoffs.
- Diagnose RAG failures with traces and ablations.
Table of Contents
- RAG as Conditional Generation
- 1.1 Parametric memory
- 1.2 Non-parametric memory
- 1.3 Retriever
- 1.4 Generator
- 1.5 Failure decomposition
- Similarity Spaces
- 2.1 Embedding functions
- 2.2 Dot product
- 2.3 Cosine similarity
- 2.4 Maximum inner product search
- 2.5 Normalization
- Sparse and Dense Retrieval
- Retriever Training
- 4.1 Positive pairs
- 4.2 Negative pairs
- 4.3 Contrastive loss
- 4.4 In-batch negatives
- 4.5 Hard negatives
- Approximate Nearest Neighbor Search
- 5.1 Exact search
- 5.2 ANN search
- 5.3 Recall at k
- 5.4 Index compression
- 5.5 Latency recall tradeoff
- Chunking and Context Packing
- 6.1 Chunk length
- 6.2 Overlap
- 6.3 Packing budget
- 6.4 Diversity
- 6.5 Lost-in-context risk
- Reranking and Fusion
- 7.1 First-stage recall
- 7.2 Reranker precision
- 7.3 Reciprocal rank fusion
- 7.4 Score calibration
- 7.5 Citation selection
- RAG Evaluation
- 8.1 Retrieval metrics
- 8.2 Answer metrics
- 8.3 Attribution
- 8.4 Ablations
- 8.5 Dataset drift
- Failure Modes
- 9.1 Missed retrieval
- 9.2 Bad chunk
- 9.3 Distractor context
- 9.4 Generator ignores evidence
- 9.5 Citation mismatch
- Implementation Checklist
- 10.1 Embedding normalization
- 10.2 Chunk audit
- 10.3 Gold retrieval set
- 10.4 Context budget tests
- 10.5 End-to-end traces
Pipeline Diagram
query -> query encoder -> vector search -> top-k chunks -> reranker -> context packer -> LLM -> answer + citations
Each arrow can fail. The math gives you probes for each arrow.
1. RAG as Conditional Generation
This part studies rag as conditional generation as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.
| Subtopic | Question | Formula |
|---|---|---|
| Parametric memory | knowledge stored in model weights | |
| Non-parametric memory | knowledge stored in an external corpus | |
| Retriever | select relevant documents for a query | |
| Generator | answer conditioned on query and retrieved context | |
| Failure decomposition | RAG can fail at retrieval, ranking, context packing, or generation |
1.1 Parametric memory
Main idea. Knowledge stored in model weights.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
1.2 Non-parametric memory
Main idea. Knowledge stored in an external corpus.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
1.3 Retriever
Main idea. Select relevant documents for a query.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. The generator cannot use evidence that retrieval never returns.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
1.4 Generator
Main idea. Answer conditioned on query and retrieved context.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
1.5 Failure decomposition
Main idea. Rag can fail at retrieval, ranking, context packing, or generation.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
2. Similarity Spaces
This part studies similarity spaces as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.
| Subtopic | Question | Formula |
|---|---|---|
| Embedding functions | map queries and documents into a shared vector space | |
| Dot product | inner product rewards alignment and vector norm | |
| Cosine similarity | normalize away vector length | |
| Maximum inner product search | retrieve highest-scoring vectors | |
| Normalization | for unit vectors, dot product and cosine are the same |
2.1 Embedding functions
Main idea. Map queries and documents into a shared vector space.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
2.2 Dot product
Main idea. Inner product rewards alignment and vector norm.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
2.3 Cosine similarity
Main idea. Normalize away vector length.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. Most RAG bugs start with misunderstanding what the vector store is scoring.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
2.4 Maximum inner product search
Main idea. Retrieve highest-scoring vectors.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
2.5 Normalization
Main idea. For unit vectors, dot product and cosine are the same.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
3. Sparse and Dense Retrieval
This part studies sparse and dense retrieval as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.
| Subtopic | Question | Formula |
|---|---|---|
| Sparse lexical retrieval | match query terms to document terms | |
| Dense bi-encoder retrieval | encode query and document separately | |
| Hybrid retrieval | combine sparse and dense scores | |
| Late interaction | score token embeddings after independent encoding | |
| Cross-encoder reranking | jointly encode query and document for a slower stronger score |
3.1 Sparse lexical retrieval
Main idea. Match query terms to document terms.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
3.2 Dense bi-encoder retrieval
Main idea. Encode query and document separately.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
3.3 Hybrid retrieval
Main idea. Combine sparse and dense scores.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
3.4 Late interaction
Main idea. Score token embeddings after independent encoding.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
3.5 Cross-encoder reranking
Main idea. Jointly encode query and document for a slower stronger score.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
4. Retriever Training
This part studies retriever training as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.
| Subtopic | Question | Formula |
|---|---|---|
| Positive pairs | train query and relevant document to be close | |
| Negative pairs | train irrelevant or hard-negative documents to score lower | |
| Contrastive loss | softmax over one positive and negatives | |
| In-batch negatives | other examples in a batch become negatives | negatives per query |
| Hard negatives | near misses improve ranking training | high but label negative |
4.1 Positive pairs
Main idea. Train query and relevant document to be close.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
4.2 Negative pairs
Main idea. Train irrelevant or hard-negative documents to score lower.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
4.3 Contrastive loss
Main idea. Softmax over one positive and negatives.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is the training objective behind many dense retrievers.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
4.4 In-batch negatives
Main idea. Other examples in a batch become negatives.
Core relation:
B-1$ negatives per queryRAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
4.5 Hard negatives
Main idea. Near misses improve ranking training.
Core relation:
s(q,d^-)$ high but label negativeRAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
5. Approximate Nearest Neighbor Search
This part studies approximate nearest neighbor search as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.
| Subtopic | Question | Formula |
|---|---|---|
| Exact search | compare a query to every vector | |
| ANN search | trade exactness for latency and memory | |
| Recall at k | measure whether relevant docs appear in top k | $\mathrm{Recall@}k= |
| Index compression | quantize or cluster vectors to reduce memory | |
| Latency recall tradeoff | faster search can miss relevant documents |
5.1 Exact search
Main idea. Compare a query to every vector.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
5.2 ANN search
Main idea. Trade exactness for latency and memory.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
5.3 Recall at k
Main idea. Measure whether relevant docs appear in top k.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
5.4 Index compression
Main idea. Quantize or cluster vectors to reduce memory.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
5.5 Latency recall tradeoff
Main idea. Faster search can miss relevant documents.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
6. Chunking and Context Packing
This part studies chunking and context packing as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.
| Subtopic | Question | Formula |
|---|---|---|
| Chunk length | split documents into retrievable units | |
| Overlap | repeat boundary tokens to avoid cutting facts | |
| Packing budget | retrieved chunks must fit context | $\sum_i |
| Diversity | avoid filling context with near-duplicates | |
| Lost-in-context risk | retrieved text must be ordered and summarized so the generator can use it | depends on packing |
6.1 Chunk length
Main idea. Split documents into retrievable units.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
6.2 Overlap
Main idea. Repeat boundary tokens to avoid cutting facts.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
6.3 Packing budget
Main idea. Retrieved chunks must fit context.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. Retrieval success still fails if the evidence is packed badly into context.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
6.4 Diversity
Main idea. Avoid filling context with near-duplicates.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
6.5 Lost-in-context risk
Main idea. Retrieved text must be ordered and summarized so the generator can use it.
Core relation:
p(y\mid q,c_{1:k})$ depends on packingRAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
7. Reranking and Fusion
This part studies reranking and fusion as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.
| Subtopic | Question | Formula |
|---|---|---|
| First-stage recall | retrieve broad candidates cheaply | |
| Reranker precision | rerank candidates with a stronger model | |
| Reciprocal rank fusion | combine ranked lists robustly | |
| Score calibration | dense, sparse, and reranker scores may live on different scales | |
| Citation selection | answer citations should correspond to evidence actually used |
7.1 First-stage recall
Main idea. Retrieve broad candidates cheaply.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
7.2 Reranker precision
Main idea. Rerank candidates with a stronger model.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
7.3 Reciprocal rank fusion
Main idea. Combine ranked lists robustly.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
7.4 Score calibration
Main idea. Dense, sparse, and reranker scores may live on different scales.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
7.5 Citation selection
Main idea. Answer citations should correspond to evidence actually used.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
8. RAG Evaluation
This part studies rag evaluation as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.
| Subtopic | Question | Formula |
|---|---|---|
| Retrieval metrics | measure search independently of generation | |
| Answer metrics | measure final response correctness and faithfulness | |
| Attribution | claims should be supported by retrieved evidence | |
| Ablations | compare no retrieval, sparse, dense, hybrid, and reranked variants | |
| Dataset drift | retrieval quality changes when corpus or query distribution changes | drift |
8.1 Retrieval metrics
Main idea. Measure search independently of generation.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
8.2 Answer metrics
Main idea. Measure final response correctness and faithfulness.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
8.3 Attribution
Main idea. Claims should be supported by retrieved evidence.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
8.4 Ablations
Main idea. Compare no retrieval, sparse, dense, hybrid, and reranked variants.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
8.5 Dataset drift
Main idea. Retrieval quality changes when corpus or query distribution changes.
Core relation:
p_\mathrm{query},D$ driftRAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
9. Failure Modes
This part studies failure modes as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.
| Subtopic | Question | Formula |
|---|---|---|
| Missed retrieval | the answer document is not in top k | |
| Bad chunk | the right document is retrieved but not the right span | |
| Distractor context | irrelevant high-scoring chunks pull generation away | |
| Generator ignores evidence | the answer is not grounded even with good retrieval | uses parametric prior |
| Citation mismatch | the cited chunk does not support the claim |
9.1 Missed retrieval
Main idea. The answer document is not in top k.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
9.2 Bad chunk
Main idea. The right document is retrieved but not the right span.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
9.3 Distractor context
Main idea. Irrelevant high-scoring chunks pull generation away.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
9.4 Generator ignores evidence
Main idea. The answer is not grounded even with good retrieval.
Core relation:
p_\theta(y\mid q,R_k)$ uses parametric priorRAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
9.5 Citation mismatch
Main idea. The cited chunk does not support the claim.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
10. Implementation Checklist
This part studies implementation checklist as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.
| Subtopic | Question | Formula |
|---|---|---|
| Embedding normalization | know whether the index expects normalized vectors | |
| Chunk audit | inspect chunks before blaming the retriever | |
| Gold retrieval set | keep queries with known supporting documents | |
| Context budget tests | evaluate different k, chunk length, and overlap | $k,o, |
| End-to-end traces | log query, retrieved docs, reranker scores, prompt, answer, and citations |
10.1 Embedding normalization
Main idea. Know whether the index expects normalized vectors.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
10.2 Chunk audit
Main idea. Inspect chunks before blaming the retriever.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
10.3 Gold retrieval set
Main idea. Keep queries with known supporting documents.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
10.4 Context budget tests
Main idea. Evaluate different k, chunk length, and overlap.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. This is a practical RAG control variable.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
10.5 End-to-end traces
Main idea. Log query, retrieved docs, reranker scores, prompt, answer, and citations.
Core relation:
RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.
Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.
Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.
AI connection. A RAG trace is the fastest way to locate whether failure came from search, ranking, packing, or generation.
Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.
Practice Exercises
- Normalize vectors and compute cosine similarities.
- Retrieve top-k documents by dot product.
- Compute a toy BM25-style lexical score.
- Compute dense contrastive loss with one positive and negatives.
- Compute recall@k and MRR.
- Use MMR to select diverse chunks.
- Pack chunks into a context budget.
- Combine rankings with reciprocal rank fusion.
- Decompose an end-to-end RAG failure.
- Write a RAG trace checklist.
Why This Matters for AI
RAG is often the cheapest way to update knowledge, cite sources, and ground answers. But RAG is not magic. Retrieval can miss the answer, rank distractors above evidence, split the useful span across chunks, or feed the generator context it ignores. Good RAG work is measurement-heavy.
Bridge to Serving and Systems Tradeoffs
The final LLM math section studies the system-level tradeoffs around serving: batching, latency, throughput, memory, routing, caching, and cost. RAG adds another system layer because retrieval latency and context length feed directly into serving latency.
References
- Patrick Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks", 2020: https://arxiv.org/abs/2005.11401
- Vladimir Karpukhin et al., "Dense Passage Retrieval for Open-Domain Question Answering", 2020: https://arxiv.org/abs/2004.04906
- Omar Khattab and Matei Zaharia, "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT", 2020: https://arxiv.org/abs/2004.12832
- Jeff Johnson, Matthijs Douze, and Herve Jegou, "Billion-scale similarity search with GPUs", 2017: https://arxiv.org/abs/1702.08734
- Stephen Robertson and Hugo Zaragoza, "The Probabilistic Relevance Framework: BM25 and Beyond", 2009: https://www.nowpublishers.com/article/Details/INR-019