Lesson overview | Previous part | Lesson overview
Embedding Space Math: Part 7: Attention Bridge to References
7. Attention Bridge
Attention Bridge connects token ids to continuous vectors and prepares the exact geometry used by attention, language-model logits, and dense retrieval.
7.1 Query key value projections
Purpose. Query key value projections focuses on embedding space after learned linear maps. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.
Operational definition.
Embeddings become contextual through projections, attention, MLPs, and residual additions. Attention compares projected vectors, not raw token ids.
Worked reading.
The query and key projections turn hidden states into vectors whose dot products define attention weights.
| Object | Shape or formula | Role |
|---|---|---|
| token ids | discrete sequence from tokenizer | |
| embedding table | $ | \mathcal{V} |
| hidden states | contextual vectors after lookup and layers | |
| LM head | $d\times | \mathcal{V} |
| position signal | vector, rotation, or bias | injects order into attention |
Examples:
- QKV projections.
- residual stream states.
- dense retrieval vectors.
Non-examples:
- nearest neighbors over integer ids.
- one fixed meaning for a token in every context.
Derivation habit.
- Write the tensor shape before writing the operation.
- State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
- Choose dot product, cosine similarity, or Euclidean distance deliberately.
- Check whether position information is additive, rotary, learned, or an attention bias.
- Track whether input and output embeddings are tied.
Implementation lens.
In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.
When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.
For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.
7.2 Attention as soft nearest-neighbor lookup
Purpose. Attention as soft nearest-neighbor lookup focuses on dot products over projected embeddings. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.
Operational definition.
Embeddings become contextual through projections, attention, MLPs, and residual additions. Attention compares projected vectors, not raw token ids.
Worked reading.
The query and key projections turn hidden states into vectors whose dot products define attention weights.
| Object | Shape or formula | Role |
|---|---|---|
| token ids | discrete sequence from tokenizer | |
| embedding table | $ | \mathcal{V} |
| hidden states | contextual vectors after lookup and layers | |
| LM head | $d\times | \mathcal{V} |
| position signal | vector, rotation, or bias | injects order into attention |
Examples:
- QKV projections.
- residual stream states.
- dense retrieval vectors.
Non-examples:
- nearest neighbors over integer ids.
- one fixed meaning for a token in every context.
Derivation habit.
- Write the tensor shape before writing the operation.
- State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
- Choose dot product, cosine similarity, or Euclidean distance deliberately.
- Check whether position information is additive, rotary, learned, or an attention bias.
- Track whether input and output embeddings are tied.
Implementation lens.
In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.
When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.
For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.
7.3 Layer-wise contextualization
Purpose. Layer-wise contextualization focuses on representations change through residual blocks. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.
Operational definition.
This concept explains how discrete language symbols become continuous vectors with trainable geometry.
Worked reading.
The operational question is what shape the vector has, how it is compared, and how training changes it.
| Object | Shape or formula | Role |
|---|---|---|
| token ids | discrete sequence from tokenizer | |
| embedding table | $ | \mathcal{V} |
| hidden states | contextual vectors after lookup and layers | |
| LM head | $d\times | \mathcal{V} |
| position signal | vector, rotation, or bias | injects order into attention |
Examples:
- embedding rows.
- hidden states.
- similarity search.
Non-examples:
- raw text in linear algebra.
- ids treated as distances.
Derivation habit.
- Write the tensor shape before writing the operation.
- State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
- Choose dot product, cosine similarity, or Euclidean distance deliberately.
- Check whether position information is additive, rotary, learned, or an attention bias.
- Track whether input and output embeddings are tied.
Implementation lens.
In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.
When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.
For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.
7.4 Dimensionality reduction diagnostics
Purpose. Dimensionality reduction diagnostics focuses on PCA views of local geometry. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.
Operational definition.
This concept explains how discrete language symbols become continuous vectors with trainable geometry.
Worked reading.
The operational question is what shape the vector has, how it is compared, and how training changes it.
| Object | Shape or formula | Role |
|---|---|---|
| token ids | discrete sequence from tokenizer | |
| embedding table | $ | \mathcal{V} |
| hidden states | contextual vectors after lookup and layers | |
| LM head | $d\times | \mathcal{V} |
| position signal | vector, rotation, or bias | injects order into attention |
Examples:
- embedding rows.
- hidden states.
- similarity search.
Non-examples:
- raw text in linear algebra.
- ids treated as distances.
Derivation habit.
- Write the tensor shape before writing the operation.
- State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
- Choose dot product, cosine similarity, or Euclidean distance deliberately.
- Check whether position information is additive, rotary, learned, or an attention bias.
- Track whether input and output embeddings are tied.
Implementation lens.
In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.
When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.
For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.
7.5 Embedding geometry in RAG
Purpose. Embedding geometry in RAG focuses on dense retrieval and semantic neighborhoods. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.
Operational definition.
Embeddings become contextual through projections, attention, MLPs, and residual additions. Attention compares projected vectors, not raw token ids.
Worked reading.
The query and key projections turn hidden states into vectors whose dot products define attention weights.
| Object | Shape or formula | Role |
|---|---|---|
| token ids | discrete sequence from tokenizer | |
| embedding table | $ | \mathcal{V} |
| hidden states | contextual vectors after lookup and layers | |
| LM head | $d\times | \mathcal{V} |
| position signal | vector, rotation, or bias | injects order into attention |
Examples:
- QKV projections.
- residual stream states.
- dense retrieval vectors.
Non-examples:
- nearest neighbors over integer ids.
- one fixed meaning for a token in every context.
Derivation habit.
- Write the tensor shape before writing the operation.
- State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
- Choose dot product, cosine similarity, or Euclidean distance deliberately.
- Check whether position information is additive, rotary, learned, or an attention bias.
- Track whether input and output embeddings are tied.
Implementation lens.
In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.
When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.
For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.
8. Scale and Diagnostics
Scale and Diagnostics connects token ids to continuous vectors and prepares the exact geometry used by attention, language-model logits, and dense retrieval.
8.1 Parameter counting
Purpose. Parameter counting focuses on vocabulary size times model width. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.
Operational definition.
Embedding tables are systems objects too: they consume memory, depend on tokenizer ids, and must handle special rows carefully.
Worked reading.
Changing a tokenizer changes which row each token id selects, so old weights no longer mean the same thing without migration.
| Object | Shape or formula | Role |
|---|---|---|
| token ids | discrete sequence from tokenizer | |
| embedding table | $ | \mathcal{V} |
| hidden states | contextual vectors after lookup and layers | |
| LM head | $d\times | \mathcal{V} |
| position signal | vector, rotation, or bias | injects order into attention |
Examples:
- vocabulary resizing.
- special token initialization.
- embedding quantization.
Non-examples:
- renaming token ids without changing weights.
- ignoring padding row behavior.
Derivation habit.
- Write the tensor shape before writing the operation.
- State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
- Choose dot product, cosine similarity, or Euclidean distance deliberately.
- Check whether position information is additive, rotary, learned, or an attention bias.
- Track whether input and output embeddings are tied.
Implementation lens.
In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.
When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.
For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.
8.2 Memory and quantization
Purpose. Memory and quantization focuses on storage cost of embedding rows. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.
Operational definition.
Embedding tables are systems objects too: they consume memory, depend on tokenizer ids, and must handle special rows carefully.
Worked reading.
Changing a tokenizer changes which row each token id selects, so old weights no longer mean the same thing without migration.
| Object | Shape or formula | Role |
|---|---|---|
| token ids | discrete sequence from tokenizer | |
| embedding table | $ | \mathcal{V} |
| hidden states | contextual vectors after lookup and layers | |
| LM head | $d\times | \mathcal{V} |
| position signal | vector, rotation, or bias | injects order into attention |
Examples:
- vocabulary resizing.
- special token initialization.
- embedding quantization.
Non-examples:
- renaming token ids without changing weights.
- ignoring padding row behavior.
Derivation habit.
- Write the tensor shape before writing the operation.
- State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
- Choose dot product, cosine similarity, or Euclidean distance deliberately.
- Check whether position information is additive, rotary, learned, or an attention bias.
- Track whether input and output embeddings are tied.
Implementation lens.
In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.
When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.
For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.
8.3 Norm and similarity dashboards
Purpose. Norm and similarity dashboards focuses on monitoring geometry during training. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.
Operational definition.
Embedding spaces have geometry: norms, directions, subspaces, clusters, and dominant components. These structures can encode useful features and dataset artifacts.
Worked reading.
Centering an embedding cloud removes the mean direction; whitening rescales dominant axes so cosine neighborhoods are less dominated by global components.
| Object | Shape or formula | Role |
|---|---|---|
| token ids | discrete sequence from tokenizer | |
| embedding table | $ | \mathcal{V} |
| hidden states | contextual vectors after lookup and layers | |
| LM head | $d\times | \mathcal{V} |
| position signal | vector, rotation, or bias | injects order into attention |
Examples:
- feature probes.
- bias directions.
- PCA diagnostics.
Non-examples:
- assuming every axis has semantic meaning.
- judging geometry from one 2D plot only.
Derivation habit.
- Write the tensor shape before writing the operation.
- State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
- Choose dot product, cosine similarity, or Euclidean distance deliberately.
- Check whether position information is additive, rotary, learned, or an attention bias.
- Track whether input and output embeddings are tied.
Implementation lens.
In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.
When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.
For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.
8.4 Outlier tokens and special tokens
Purpose. Outlier tokens and special tokens focuses on why control rows need inspection. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.
Operational definition.
Embedding tables are systems objects too: they consume memory, depend on tokenizer ids, and must handle special rows carefully.
Worked reading.
Changing a tokenizer changes which row each token id selects, so old weights no longer mean the same thing without migration.
| Object | Shape or formula | Role |
|---|---|---|
| token ids | discrete sequence from tokenizer | |
| embedding table | $ | \mathcal{V} |
| hidden states | contextual vectors after lookup and layers | |
| LM head | $d\times | \mathcal{V} |
| position signal | vector, rotation, or bias | injects order into attention |
Examples:
- vocabulary resizing.
- special token initialization.
- embedding quantization.
Non-examples:
- renaming token ids without changing weights.
- ignoring padding row behavior.
Derivation habit.
- Write the tensor shape before writing the operation.
- State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
- Choose dot product, cosine similarity, or Euclidean distance deliberately.
- Check whether position information is additive, rotary, learned, or an attention bias.
- Track whether input and output embeddings are tied.
Implementation lens.
In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.
When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.
For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.
8.5 Migration and compatibility tests
Purpose. Migration and compatibility tests focuses on why tokenizer and embedding tables are coupled. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.
Operational definition.
Embedding tables are systems objects too: they consume memory, depend on tokenizer ids, and must handle special rows carefully.
Worked reading.
Changing a tokenizer changes which row each token id selects, so old weights no longer mean the same thing without migration.
| Object | Shape or formula | Role |
|---|---|---|
| token ids | discrete sequence from tokenizer | |
| embedding table | $ | \mathcal{V} |
| hidden states | contextual vectors after lookup and layers | |
| LM head | $d\times | \mathcal{V} |
| position signal | vector, rotation, or bias | injects order into attention |
Examples:
- vocabulary resizing.
- special token initialization.
- embedding quantization.
Non-examples:
- renaming token ids without changing weights.
- ignoring padding row behavior.
Derivation habit.
- Write the tensor shape before writing the operation.
- State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
- Choose dot product, cosine similarity, or Euclidean distance deliberately.
- Check whether position information is additive, rotary, learned, or an attention bias.
- Track whether input and output embeddings are tied.
Implementation lens.
In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.
When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.
For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.
9. Common Mistakes
| # | Mistake | Why it is wrong | Fix |
|---|---|---|---|
| 1 | Treating token ids as numeric magnitudes | Ids are arbitrary labels, not ordered measurements. | Use embedding lookup or one-hot selection. |
| 2 | Using dot product and cosine interchangeably | Dot product includes norm effects. | State whether magnitude should matter. |
| 3 | Assuming static token rows are final meaning | Transformer layers contextualize representations. | Distinguish input embeddings from hidden states. |
| 4 | Ignoring position information | Self-attention alone is permutation-equivariant. | Add or rotate positional information before attention. |
| 5 | Changing vocabulary without resizing embeddings | New ids need rows and output logits. | Resize, initialize, and train new rows explicitly. |
| 6 | Interpreting PCA plots too literally | Two dimensions can hide high-dimensional structure. | Use PCA as a diagnostic, not proof. |
| 7 | Forgetting anisotropy | Dominant directions can distort cosine neighbors. | Inspect mean vector, norm distribution, and centered similarities. |
| 8 | Assuming analogies always work | Linear offsets are empirical, domain-dependent approximations. | Validate with held-out relations. |
| 9 | Confusing retrieval embeddings with LM token embeddings | Retriever embeddings are usually pooled sequence vectors. | Name the embedding type and training objective. |
| 10 | Ignoring tied embeddings | Input and output tables may share parameters. | Check whether the LM head is tied to token embeddings. |
10. Exercises
-
(*) Perform embedding lookup for a batch of token ids.
- (a) State the shape of every object.
- (b) Compute the numeric result.
- (c) Explain the LLM architecture consequence.
-
(*) Show one-hot lookup equivalence.
- (a) State the shape of every object.
- (b) Compute the numeric result.
- (c) Explain the LLM architecture consequence.
-
(*) Compute cosine similarity and nearest neighbors.
- (a) State the shape of every object.
- (b) Compute the numeric result.
- (c) Explain the LLM architecture consequence.
-
(**) Verify a synthetic analogy direction.
- (a) State the shape of every object.
- (b) Compute the numeric result.
- (c) Explain the LLM architecture consequence.
-
(**) Measure anisotropy before and after centering.
- (a) State the shape of every object.
- (b) Compute the numeric result.
- (c) Explain the LLM architecture consequence.
-
(**) Compute a softmax output-row gradient.
- (a) State the shape of every object.
- (b) Compute the numeric result.
- (c) Explain the LLM architecture consequence.
-
(**) Build sinusoidal position encodings.
- (a) State the shape of every object.
- (b) Compute the numeric result.
- (c) Explain the LLM architecture consequence.
-
(***) Apply a RoPE rotation and check norm preservation.
- (a) State the shape of every object.
- (b) Compute the numeric result.
- (c) Explain the LLM architecture consequence.
-
(***) Build an ALiBi bias matrix.
- (a) State the shape of every object.
- (b) Compute the numeric result.
- (c) Explain the LLM architecture consequence.
-
(***) Count embedding parameters and explain tied embeddings.
- (a) State the shape of every object.
- (b) Compute the numeric result.
- (c) Explain the LLM architecture consequence.
11. Why This Matters for AI
| Concept | AI impact |
|---|---|
| Embedding lookup | Transforms token ids into vectors that the transformer can optimize over. |
| Similarity metrics | Support nearest neighbors, retrieval, clustering, probing, and semantic diagnostics. |
| Analogy directions | Reveal when relational information is approximately linear. |
| Anisotropy | Explains why raw embedding spaces can have poor neighborhood structure. |
| Training gradients | Show how token frequency and prediction errors move embedding rows. |
| Position encodings | Let attention use sequence order and relative distance. |
| QKV projections | Turn embeddings into attention queries, keys, and values. |
| Parameter counts | Tie vocabulary, tokenizer choice, model width, and serving memory together. |
12. Conceptual Bridge
The backward bridge is tokenization. Token ids are arbitrary labels until an embedding table gives them trainable vectors. The tokenizer and embedding table therefore form one coupled interface.
The forward bridge is attention. Queries, keys, and values are learned projections of hidden states that begin as embeddings plus position information. Attention is not separate from embedding geometry; it is built on top of it.
+------------+ +------------------+ +-----------------------+
| token ids | ---> | embedding rows | ---> | contextual hidden |
| B x T | | B x T x d | | states and attention |
+------------+ +------------------+ +-----------------------+
A strong mental model is to treat embeddings as the model's input coordinate system. If that coordinate system is distorted, anisotropic, incompatible with the tokenizer, or poorly initialized for new tokens, every downstream layer inherits the problem.
References
- Mikolov et al.. Efficient Estimation of Word Representations in Vector Space. https://arxiv.org/abs/1301.3781
- Pennington, Socher, Manning. GloVe: Global Vectors for Word Representation. https://aclanthology.org/D14-1162/
- Vaswani et al.. Attention Is All You Need. https://arxiv.org/abs/1706.03762
- Su et al.. RoFormer: Enhanced Transformer with Rotary Position Embedding. https://arxiv.org/abs/2104.09864
- Press et al.. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. https://arxiv.org/abs/2108.12409