NotesMath for LLMs

RAG Math and Retrieval

Math for LLMs / RAG Math and Retrieval

Notes

Retrieval-augmented generation adds an external memory system to an LLM. The math is a pipeline: embed, search, rank, pack context, generate, and verify attribution.

Overview

The central RAG conditional is:

pθ(yq,Rk(q)),p_\theta(y\mid q,R_k(q)),

where qq is the query and Rk(q)R_k(q) is the set of retrieved chunks. RAG succeeds only when the right information is retrieved, ranked highly, packed into context, and used by the generator.

Prerequisites

  • Embedding-space geometry and cosine similarity
  • Conditional language-model probability
  • Efficient inference and context-window constraints
  • Evaluation metrics and error analysis

Companion Notebooks

NotebookPurpose
theory.ipynbDemonstrates cosine search, BM25 intuition, contrastive retrieval loss, recall@k, MMR, reranking, context packing, and RAG failure decomposition.
exercises.ipynbTen practice problems for retrieval scores, recall, MRR, chunk packing, MMR, RRF, and RAG diagnostics.

Learning Objectives

After this section, you should be able to:

  • Define RAG as conditional generation with retrieved non-parametric memory.
  • Compute dot-product and cosine retrieval scores.
  • Explain sparse, dense, hybrid, late-interaction, and cross-encoder retrieval.
  • Write the contrastive loss for dense retriever training.
  • Compute recall@k, MRR, and simple nDCG.
  • Explain chunk length, overlap, MMR, and context packing.
  • Explain ANN recall-latency tradeoffs.
  • Diagnose RAG failures with traces and ablations.

Table of Contents

  1. RAG as Conditional Generation
  2. Similarity Spaces
  3. Sparse and Dense Retrieval
  4. Retriever Training
  5. Approximate Nearest Neighbor Search
  6. Chunking and Context Packing
  7. Reranking and Fusion
  8. RAG Evaluation
  9. Failure Modes
  10. Implementation Checklist

Pipeline Diagram

query -> query encoder -> vector search -> top-k chunks -> reranker -> context packer -> LLM -> answer + citations

Each arrow can fail. The math gives you probes for each arrow.

1. RAG as Conditional Generation

This part studies rag as conditional generation as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.

SubtopicQuestionFormula
Parametric memoryknowledge stored in model weightspθ(yq)p_\theta(y\mid q)
Non-parametric memoryknowledge stored in an external corpusD={di}i=1nD=\{d_i\}_{i=1}^n
Retrieverselect relevant documents for a queryRk(q)=TopKdiD s(q,di)R_k(q)=\mathrm{TopK}_{d_i\in D}\ s(q,d_i)
Generatoranswer conditioned on query and retrieved contextpθ(yq,Rk(q))p_\theta(y\mid q,R_k(q))
Failure decompositionRAG can fail at retrieval, ranking, context packing, or generationP(correct)=P(retrieve)P(use)P(generate)P(\mathrm{correct})=P(\mathrm{retrieve})P(\mathrm{use})P(\mathrm{generate})

1.1 Parametric memory

Main idea. Knowledge stored in model weights.

Core relation:

pθ(yq)p_\theta(y\mid q)

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

1.2 Non-parametric memory

Main idea. Knowledge stored in an external corpus.

Core relation:

D={di}i=1nD=\{d_i\}_{i=1}^n

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

1.3 Retriever

Main idea. Select relevant documents for a query.

Core relation:

Rk(q)=TopKdiD s(q,di)R_k(q)=\mathrm{TopK}_{d_i\in D}\ s(q,d_i)

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. The generator cannot use evidence that retrieval never returns.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

1.4 Generator

Main idea. Answer conditioned on query and retrieved context.

Core relation:

pθ(yq,Rk(q))p_\theta(y\mid q,R_k(q))

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

1.5 Failure decomposition

Main idea. Rag can fail at retrieval, ranking, context packing, or generation.

Core relation:

P(correct)=P(retrieve)P(use)P(generate)P(\mathrm{correct})=P(\mathrm{retrieve})P(\mathrm{use})P(\mathrm{generate})

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

2. Similarity Spaces

This part studies similarity spaces as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.

SubtopicQuestionFormula
Embedding functionsmap queries and documents into a shared vector spaceu=fq(q), vi=fd(di)u=f_q(q),\ v_i=f_d(d_i)
Dot productinner product rewards alignment and vector norms(u,v)=uvs(u,v)=u^\top v
Cosine similaritynormalize away vector lengths(u,v)=uv/(uv)s(u,v)=u^\top v/(\Vert u\Vert\Vert v\Vert)
Maximum inner product searchretrieve highest-scoring vectorsargmaxiuvi\arg\max_i u^\top v_i
Normalizationfor unit vectors, dot product and cosine are the sameu=v=1\Vert u\Vert=\Vert v\Vert=1

2.1 Embedding functions

Main idea. Map queries and documents into a shared vector space.

Core relation:

u=fq(q), vi=fd(di)u=f_q(q),\ v_i=f_d(d_i)

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

2.2 Dot product

Main idea. Inner product rewards alignment and vector norm.

Core relation:

s(u,v)=uvs(u,v)=u^\top v

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

2.3 Cosine similarity

Main idea. Normalize away vector length.

Core relation:

s(u,v)=uv/(uv)s(u,v)=u^\top v/(\Vert u\Vert\Vert v\Vert)

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. Most RAG bugs start with misunderstanding what the vector store is scoring.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

2.4 Maximum inner product search

Main idea. Retrieve highest-scoring vectors.

Core relation:

argmaxiuvi\arg\max_i u^\top v_i

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

2.5 Normalization

Main idea. For unit vectors, dot product and cosine are the same.

Core relation:

u=v=1\Vert u\Vert=\Vert v\Vert=1

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

3. Sparse and Dense Retrieval

This part studies sparse and dense retrieval as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.

SubtopicQuestionFormula
Sparse lexical retrievalmatch query terms to document termsBM25(q,d)\mathrm{BM25}(q,d)
Dense bi-encoder retrievalencode query and document separatelys(q,d)=fq(q)fd(d)s(q,d)=f_q(q)^\top f_d(d)
Hybrid retrievalcombine sparse and dense scoress=λsdense+(1λ)ssparses=\lambda s_\mathrm{dense}+(1-\lambda)s_\mathrm{sparse}
Late interactionscore token embeddings after independent encodingiqmaxjdeiej\sum_{i\in q}\max_{j\in d} e_i^\top e_j
Cross-encoder rerankingjointly encode query and document for a slower stronger scores=g(q,d)s=g(q,d)

3.1 Sparse lexical retrieval

Main idea. Match query terms to document terms.

Core relation:

BM25(q,d)\mathrm{BM25}(q,d)

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

3.2 Dense bi-encoder retrieval

Main idea. Encode query and document separately.

Core relation:

s(q,d)=fq(q)fd(d)s(q,d)=f_q(q)^\top f_d(d)

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

3.3 Hybrid retrieval

Main idea. Combine sparse and dense scores.

Core relation:

s=λsdense+(1λ)ssparses=\lambda s_\mathrm{dense}+(1-\lambda)s_\mathrm{sparse}

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

3.4 Late interaction

Main idea. Score token embeddings after independent encoding.

Core relation:

iqmaxjdeiej\sum_{i\in q}\max_{j\in d} e_i^\top e_j

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

3.5 Cross-encoder reranking

Main idea. Jointly encode query and document for a slower stronger score.

Core relation:

s=g(q,d)s=g(q,d)

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

4. Retriever Training

This part studies retriever training as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.

SubtopicQuestionFormula
Positive pairstrain query and relevant document to be close(q,d+)(q,d^+)
Negative pairstrain irrelevant or hard-negative documents to score lower(q,d)(q,d^-)
Contrastive losssoftmax over one positive and negativesL=loges(q,d+)es(q,d+)+jes(q,dj)L=-\log\frac{e^{s(q,d^+)}}{e^{s(q,d^+)}+\sum_j e^{s(q,d_j^-)}}
In-batch negativesother examples in a batch become negativesB1B-1 negatives per query
Hard negativesnear misses improve ranking trainings(q,d)s(q,d^-) high but label negative

4.1 Positive pairs

Main idea. Train query and relevant document to be close.

Core relation:

(q,d+)(q,d^+)

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

4.2 Negative pairs

Main idea. Train irrelevant or hard-negative documents to score lower.

Core relation:

(q,d)(q,d^-)

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

4.3 Contrastive loss

Main idea. Softmax over one positive and negatives.

Core relation:

L=loges(q,d+)es(q,d+)+jes(q,dj)L=-\log\frac{e^{s(q,d^+)}}{e^{s(q,d^+)}+\sum_j e^{s(q,d_j^-)}}

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is the training objective behind many dense retrievers.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

4.4 In-batch negatives

Main idea. Other examples in a batch become negatives.

Core relation:

B-1$ negatives per query

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

4.5 Hard negatives

Main idea. Near misses improve ranking training.

Core relation:

s(q,d^-)$ high but label negative

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

5. Approximate Nearest Neighbor Search

This part studies approximate nearest neighbor search as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.

SubtopicQuestionFormula
Exact searchcompare a query to every vectorO(nd)O(nd)
ANN searchtrade exactness for latency and memoryR^k(q)Rk(q)\hat R_k(q)\approx R_k(q)
Recall at kmeasure whether relevant docs appear in top k$\mathrm{Recall@}k=
Index compressionquantize or cluster vectors to reduce memoryVV^V\rightarrow \hat V
Latency recall tradeofffaster search can miss relevant documentsT, recallT\downarrow,\ \mathrm{recall}\downarrow

5.1 Exact search

Main idea. Compare a query to every vector.

Core relation:

O(nd)O(nd)

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

5.2 ANN search

Main idea. Trade exactness for latency and memory.

Core relation:

R^k(q)Rk(q)\hat R_k(q)\approx R_k(q)

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

5.3 Recall at k

Main idea. Measure whether relevant docs appear in top k.

Core relation:

Recall@k=RelRk/Rel\mathrm{Recall@}k=|\mathrm{Rel}\cap R_k|/|\mathrm{Rel}|

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

5.4 Index compression

Main idea. Quantize or cluster vectors to reduce memory.

Core relation:

VV^V\rightarrow \hat V

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

5.5 Latency recall tradeoff

Main idea. Faster search can miss relevant documents.

Core relation:

T, recallT\downarrow,\ \mathrm{recall}\downarrow

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

6. Chunking and Context Packing

This part studies chunking and context packing as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.

SubtopicQuestionFormula
Chunk lengthsplit documents into retrievable unitsdc1,,cmd\rightarrow c_1,\ldots,c_m
Overlaprepeat boundary tokens to avoid cutting factsci=[ta,,tb], ci+1=[tbo,]c_i=[t_a,\ldots,t_b],\ c_{i+1}=[t_{b-o},\ldots]
Packing budgetretrieved chunks must fit context$\sum_i
Diversityavoid filling context with near-duplicatesMMR=λs(q,d)(1λ)maxdSs(d,d)\mathrm{MMR}=\lambda s(q,d)-(1-\lambda)\max_{d'\in S}s(d,d')
Lost-in-context riskretrieved text must be ordered and summarized so the generator can use itp(yq,c1:k)p(y\mid q,c_{1:k}) depends on packing

6.1 Chunk length

Main idea. Split documents into retrievable units.

Core relation:

dc1,,cmd\rightarrow c_1,\ldots,c_m

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

6.2 Overlap

Main idea. Repeat boundary tokens to avoid cutting facts.

Core relation:

ci=[ta,,tb], ci+1=[tbo,]c_i=[t_a,\ldots,t_b],\ c_{i+1}=[t_{b-o},\ldots]

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

6.3 Packing budget

Main idea. Retrieved chunks must fit context.

Core relation:

iciTcontext\sum_i |c_i|\le T_\mathrm{context}

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. Retrieval success still fails if the evidence is packed badly into context.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

6.4 Diversity

Main idea. Avoid filling context with near-duplicates.

Core relation:

MMR=λs(q,d)(1λ)maxdSs(d,d)\mathrm{MMR}=\lambda s(q,d)-(1-\lambda)\max_{d'\in S}s(d,d')

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

6.5 Lost-in-context risk

Main idea. Retrieved text must be ordered and summarized so the generator can use it.

Core relation:

p(y\mid q,c_{1:k})$ depends on packing

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

7. Reranking and Fusion

This part studies reranking and fusion as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.

SubtopicQuestionFormula
First-stage recallretrieve broad candidates cheaplyKfirstKfinalK_\mathrm{first}\gg K_\mathrm{final}
Reranker precisionrerank candidates with a stronger modelsrerank(q,d)s_\mathrm{rerank}(q,d)
Reciprocal rank fusioncombine ranked lists robustlyRRF(d)=m1/(k+rm(d))\mathrm{RRF}(d)=\sum_m 1/(k+r_m(d))
Score calibrationdense, sparse, and reranker scores may live on different scalesz=(sμ)/σz=(s-\mu)/\sigma
Citation selectionanswer citations should correspond to evidence actually useddiclaimjd_i\rightarrow \mathrm{claim}_j

7.1 First-stage recall

Main idea. Retrieve broad candidates cheaply.

Core relation:

KfirstKfinalK_\mathrm{first}\gg K_\mathrm{final}

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

7.2 Reranker precision

Main idea. Rerank candidates with a stronger model.

Core relation:

srerank(q,d)s_\mathrm{rerank}(q,d)

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

7.3 Reciprocal rank fusion

Main idea. Combine ranked lists robustly.

Core relation:

RRF(d)=m1/(k+rm(d))\mathrm{RRF}(d)=\sum_m 1/(k+r_m(d))

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

7.4 Score calibration

Main idea. Dense, sparse, and reranker scores may live on different scales.

Core relation:

z=(sμ)/σz=(s-\mu)/\sigma

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

7.5 Citation selection

Main idea. Answer citations should correspond to evidence actually used.

Core relation:

diclaimjd_i\rightarrow \mathrm{claim}_j

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

8. RAG Evaluation

This part studies rag evaluation as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.

SubtopicQuestionFormula
Retrieval metricsmeasure search independently of generationRecall@k, MRR, nDCG\mathrm{Recall@}k,\ \mathrm{MRR},\ \mathrm{nDCG}
Answer metricsmeasure final response correctness and faithfulnessSanswerS_\mathrm{answer}
Attributionclaims should be supported by retrieved evidencesupport(claim,di)\mathrm{support}(\mathrm{claim},d_i)
Ablationscompare no retrieval, sparse, dense, hybrid, and reranked variantsΔS\Delta S
Dataset driftretrieval quality changes when corpus or query distribution changespquery,Dp_\mathrm{query},D drift

8.1 Retrieval metrics

Main idea. Measure search independently of generation.

Core relation:

Recall@k, MRR, nDCG\mathrm{Recall@}k,\ \mathrm{MRR},\ \mathrm{nDCG}

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

8.2 Answer metrics

Main idea. Measure final response correctness and faithfulness.

Core relation:

SanswerS_\mathrm{answer}

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

8.3 Attribution

Main idea. Claims should be supported by retrieved evidence.

Core relation:

support(claim,di)\mathrm{support}(\mathrm{claim},d_i)

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

8.4 Ablations

Main idea. Compare no retrieval, sparse, dense, hybrid, and reranked variants.

Core relation:

ΔS\Delta S

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

8.5 Dataset drift

Main idea. Retrieval quality changes when corpus or query distribution changes.

Core relation:

p_\mathrm{query},D$ drift

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

9. Failure Modes

This part studies failure modes as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.

SubtopicQuestionFormula
Missed retrievalthe answer document is not in top kd+Rk(q)d^+\notin R_k(q)
Bad chunkthe right document is retrieved but not the right spanc+Rk(q)c^+\notin R_k(q)
Distractor contextirrelevant high-scoring chunks pull generation aways(q,d)>s(q,d+)s(q,d^-)>s(q,d^+)
Generator ignores evidencethe answer is not grounded even with good retrievalpθ(yq,Rk)p_\theta(y\mid q,R_k) uses parametric prior
Citation mismatchthe cited chunk does not support the claimclaim⊄dcited\mathrm{claim}\not\subset d_\mathrm{cited}

9.1 Missed retrieval

Main idea. The answer document is not in top k.

Core relation:

d+Rk(q)d^+\notin R_k(q)

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

9.2 Bad chunk

Main idea. The right document is retrieved but not the right span.

Core relation:

c+Rk(q)c^+\notin R_k(q)

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

9.3 Distractor context

Main idea. Irrelevant high-scoring chunks pull generation away.

Core relation:

s(q,d)>s(q,d+)s(q,d^-)>s(q,d^+)

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

9.4 Generator ignores evidence

Main idea. The answer is not grounded even with good retrieval.

Core relation:

p_\theta(y\mid q,R_k)$ uses parametric prior

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

9.5 Citation mismatch

Main idea. The cited chunk does not support the claim.

Core relation:

claim⊄dcited\mathrm{claim}\not\subset d_\mathrm{cited}

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

10. Implementation Checklist

This part studies implementation checklist as retrieval math for LLM systems. Keep separate the embedding space, search algorithm, ranking metric, context budget, and generator behavior.

SubtopicQuestionFormula
Embedding normalizationknow whether the index expects normalized vectorsvv/vv\leftarrow v/\Vert v\Vert
Chunk auditinspect chunks before blaming the retrievercic_i
Gold retrieval setkeep queries with known supporting documentsDgoldD_\mathrm{gold}
Context budget testsevaluate different k, chunk length, and overlap$k,o,
End-to-end traceslog query, retrieved docs, reranker scores, prompt, answer, and citationstrace\mathrm{trace}

10.1 Embedding normalization

Main idea. Know whether the index expects normalized vectors.

Core relation:

vv/vv\leftarrow v/\Vert v\Vert

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

10.2 Chunk audit

Main idea. Inspect chunks before blaming the retriever.

Core relation:

cic_i

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

10.3 Gold retrieval set

Main idea. Keep queries with known supporting documents.

Core relation:

DgoldD_\mathrm{gold}

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

10.4 Context budget tests

Main idea. Evaluate different k, chunk length, and overlap.

Core relation:

k,o,ck,o,|c|

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. This is a practical RAG control variable.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.

10.5 End-to-end traces

Main idea. Log query, retrieved docs, reranker scores, prompt, answer, and citations.

Core relation:

trace\mathrm{trace}

RAG changes the conditional distribution by adding retrieved evidence to the prompt. The retrieval system and generator should be evaluated separately and together. A high-quality generator cannot compensate for missing evidence, and a high-recall retriever can still fail if it returns noisy chunks or if the prompt buries the useful span.

Worked micro-example. If a query vector and document vectors are unit-normalized, cosine similarity is just a dot product. Retrieval by top-k dot product then selects the documents whose embeddings are most aligned with the query. If vectors are not normalized, high-norm documents can win even when their direction is less relevant.

Implementation check. Log query text, query embedding norm, top-k document ids, scores, chunk text, reranker scores, final prompt, answer, and citations. RAG without traces is guesswork.

AI connection. A RAG trace is the fastest way to locate whether failure came from search, ranking, packing, or generation.

Common mistake. Do not evaluate only final answers. Measure retrieval recall, reranker precision, context-packing quality, and answer faithfulness separately.


Practice Exercises

  1. Normalize vectors and compute cosine similarities.
  2. Retrieve top-k documents by dot product.
  3. Compute a toy BM25-style lexical score.
  4. Compute dense contrastive loss with one positive and negatives.
  5. Compute recall@k and MRR.
  6. Use MMR to select diverse chunks.
  7. Pack chunks into a context budget.
  8. Combine rankings with reciprocal rank fusion.
  9. Decompose an end-to-end RAG failure.
  10. Write a RAG trace checklist.

Why This Matters for AI

RAG is often the cheapest way to update knowledge, cite sources, and ground answers. But RAG is not magic. Retrieval can miss the answer, rank distractors above evidence, split the useful span across chunks, or feed the generator context it ignores. Good RAG work is measurement-heavy.

Bridge to Serving and Systems Tradeoffs

The final LLM math section studies the system-level tradeoffs around serving: batching, latency, throughput, memory, routing, caching, and cost. RAG adds another system layer because retrieval latency and context length feed directly into serving latency.

References