Lesson overview | Previous part | Next part
Positional Encodings: Part 5: Relative Position Representations to 9. Common Mistakes
5. Relative Position Representations
Relative Position Representations explains how transformer sequence order is represented in hidden states or attention scores.
5.1 Relative bias matrices
Purpose. Relative bias matrices focuses on score additions based on distance. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
ALiBi-style methods add a distance-dependent bias directly to attention scores before softmax.
Worked reading.
Older keys can receive a linear penalty while causal masking still controls which keys are visible.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- long-context extrapolation.
- head-specific slopes.
- score-space position signals.
Non-examples:
- residual-stream position embeddings.
- bias added after softmax.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
5.2 Shaw-style relative keys
Purpose. Shaw-style relative keys focuses on relative vectors inside attention compatibility. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Relative position methods make attention depend on pairwise distance instead of only absolute index identity.
Worked reading.
The same offset can share parameters across many positions, which is useful when patterns repeat across the sequence.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- relative bias.
- relative keys.
- bucketed distance bins.
Non-examples:
- one independent vector per absolute index.
- no order signal at all.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
5.3 Transformer-XL intuition
Purpose. Transformer-XL intuition focuses on segment recurrence and relative positions. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.
Worked reading.
Without a position signal, self-attention can mix content but cannot know which token came first.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- sinusoidal features.
- position rows.
- relative offsets.
Non-examples:
- bag-of-words attention.
- token ids used as positions.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
5.4 Bucketed distances
Purpose. Bucketed distances focuses on sharing parameters across far offsets. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Relative position methods make attention depend on pairwise distance instead of only absolute index identity.
Worked reading.
The same offset can share parameters across many positions, which is useful when patterns repeat across the sequence.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- relative bias.
- relative keys.
- bucketed distance bins.
Non-examples:
- one independent vector per absolute index.
- no order signal at all.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
5.5 When relative position helps
Purpose. When relative position helps focuses on translation invariance and long sequences. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Relative position methods make attention depend on pairwise distance instead of only absolute index identity.
Worked reading.
The same offset can share parameters across many positions, which is useful when patterns repeat across the sequence.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- relative bias.
- relative keys.
- bucketed distance bins.
Non-examples:
- one independent vector per absolute index.
- no order signal at all.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
6. Rotary Positional Embeddings
Rotary Positional Embeddings explains how transformer sequence order is represented in hidden states or attention scores.
6.1 RoPE rotations
Purpose. RoPE rotations focuses on rotating query and key coordinate pairs. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
RoPE encodes position by rotating query and key coordinate pairs with position-dependent angles.
Worked reading.
Because rotations compose by angle differences, a query at position and key at position can expose the relative offset through their dot product.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- decoder-only LLMs.
- relative dot-product behavior.
- long-context scaling variants.
Non-examples:
- adding a position vector to .
- learned absolute position rows.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
6.2 Relative dot-product property
Purpose. Relative dot-product property focuses on dependence on position difference. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
RoPE encodes position by rotating query and key coordinate pairs with position-dependent angles.
Worked reading.
Because rotations compose by angle differences, a query at position and key at position can expose the relative offset through their dot product.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- decoder-only LLMs.
- relative dot-product behavior.
- long-context scaling variants.
Non-examples:
- adding a position vector to .
- learned absolute position rows.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
6.3 Frequency base and dimensions
Purpose. Frequency base and dimensions focuses on how angular speeds are assigned. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.
Worked reading.
Without a position signal, self-attention can mix content but cannot know which token came first.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- sinusoidal features.
- position rows.
- relative offsets.
Non-examples:
- bag-of-words attention.
- token ids used as positions.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
6.4 Long-context RoPE scaling
Purpose. Long-context RoPE scaling focuses on why interpolation and base changes are used. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
RoPE encodes position by rotating query and key coordinate pairs with position-dependent angles.
Worked reading.
Because rotations compose by angle differences, a query at position and key at position can expose the relative offset through their dot product.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- decoder-only LLMs.
- relative dot-product behavior.
- long-context scaling variants.
Non-examples:
- adding a position vector to .
- learned absolute position rows.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
6.5 Implementation checks
Purpose. Implementation checks focuses on norm preservation and pairwise rotations. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.
Worked reading.
Without a position signal, self-attention can mix content but cannot know which token came first.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- sinusoidal features.
- position rows.
- relative offsets.
Non-examples:
- bag-of-words attention.
- token ids used as positions.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
7. ALiBi and Bias Methods
ALiBi and Bias Methods explains how transformer sequence order is represented in hidden states or attention scores.
7.1 Linear attention biases
Purpose. Linear attention biases focuses on distance penalty added to scores. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
ALiBi-style methods add a distance-dependent bias directly to attention scores before softmax.
Worked reading.
Older keys can receive a linear penalty while causal masking still controls which keys are visible.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- long-context extrapolation.
- head-specific slopes.
- score-space position signals.
Non-examples:
- residual-stream position embeddings.
- bias added after softmax.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
7.2 Head-specific slopes
Purpose. Head-specific slopes focuses on multiple distance scales across heads. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.
Worked reading.
Without a position signal, self-attention can mix content but cannot know which token came first.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- sinusoidal features.
- position rows.
- relative offsets.
Non-examples:
- bag-of-words attention.
- token ids used as positions.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
7.3 Extrapolation behavior
Purpose. Extrapolation behavior focuses on no learned position table required. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.
Worked reading.
Without a position signal, self-attention can mix content but cannot know which token came first.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- sinusoidal features.
- position rows.
- relative offsets.
Non-examples:
- bag-of-words attention.
- token ids used as positions.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
7.4 Bias versus embedding
Purpose. Bias versus embedding focuses on score-space signal not residual-stream signal. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
ALiBi-style methods add a distance-dependent bias directly to attention scores before softmax.
Worked reading.
Older keys can receive a linear penalty while causal masking still controls which keys are visible.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- long-context extrapolation.
- head-specific slopes.
- score-space position signals.
Non-examples:
- residual-stream position embeddings.
- bias added after softmax.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
7.5 Mask interaction
Purpose. Mask interaction focuses on biases still respect causal visibility. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.
Worked reading.
Without a position signal, self-attention can mix content but cannot know which token came first.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- sinusoidal features.
- position rows.
- relative offsets.
Non-examples:
- bag-of-words attention.
- token ids used as positions.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
8. Diagnostics and LLM Practice
Diagnostics and LLM Practice explains how transformer sequence order is represented in hidden states or attention scores.
8.1 Position-sensitivity tests
Purpose. Position-sensitivity tests focuses on permutation and shift tests. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position diagnostics test whether a model uses order correctly and remains reliable at target context lengths.
Worked reading.
A correct serving stack must assign decode position ids consistently with the KV cache and any RoPE or bias scheme.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- shift tests.
- needle retrieval by position.
- KV-cache position id checks.
Non-examples:
- only testing short prompts.
- assuming max context length implies uniform quality.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
8.2 Long-context degradation
Purpose. Long-context degradation focuses on lost-in-the-middle and recency effects. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position diagnostics test whether a model uses order correctly and remains reliable at target context lengths.
Worked reading.
A correct serving stack must assign decode position ids consistently with the KV cache and any RoPE or bias scheme.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- shift tests.
- needle retrieval by position.
- KV-cache position id checks.
Non-examples:
- only testing short prompts.
- assuming max context length implies uniform quality.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
8.3 Attention pattern inspection
Purpose. Attention pattern inspection focuses on distance distributions and entropy. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.
Worked reading.
Without a position signal, self-attention can mix content but cannot know which token came first.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- sinusoidal features.
- position rows.
- relative offsets.
Non-examples:
- bag-of-words attention.
- token ids used as positions.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
8.4 Serving and KV cache
Purpose. Serving and KV cache focuses on position ids during decode. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position diagnostics test whether a model uses order correctly and remains reliable at target context lengths.
Worked reading.
A correct serving stack must assign decode position ids consistently with the KV cache and any RoPE or bias scheme.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- shift tests.
- needle retrieval by position.
- KV-cache position id checks.
Non-examples:
- only testing short prompts.
- assuming max context length implies uniform quality.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
8.5 Choosing a scheme
Purpose. Choosing a scheme focuses on tradeoffs among learned sinusoidal RoPE ALiBi and relative bias. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position diagnostics test whether a model uses order correctly and remains reliable at target context lengths.
Worked reading.
A correct serving stack must assign decode position ids consistently with the KV cache and any RoPE or bias scheme.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- shift tests.
- needle retrieval by position.
- KV-cache position id checks.
Non-examples:
- only testing short prompts.
- assuming max context length implies uniform quality.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
9. Common Mistakes
| # | Mistake | Why it is wrong | Fix |
|---|---|---|---|
| 1 | Assuming attention knows order automatically | Dot-product attention over a set is order-agnostic without position information. | Add or inject a positional signal. |
| 2 | Mixing absolute and relative claims | Absolute ids and relative offsets behave differently. | State where the position enters the formula. |
| 3 | Adding RoPE like a vector | RoPE rotates Q/K pairs rather than adding rows to hidden states. | Implement pairwise rotations. |
| 4 | Forgetting norm preservation | RoPE should preserve pair norms. | Unit-test norms after rotation. |
| 5 | Using learned rows beyond their range | A learned table has finite trained positions. | Resize/interpolate or choose an extrapolating scheme. |
| 6 | Applying ALiBi after softmax | Bias must affect logits before normalization. | Add bias to scores before softmax. |
| 7 | Ignoring decode position ids | KV cache requires consistent positions for new tokens. | Track prefix length and generated step. |
| 8 | Judging long context by max length only | Quality may degrade inside the window. | Evaluate retrieval by position and distance. |
| 9 | Confusing padding masks with position encodings | Masks control visibility; encodings provide order. | Use both when needed. |
| 10 | Treating extrapolation as guaranteed | All schemes can fail outside training distribution. | Test at target lengths. |