Language Model Probability Math
An LLM is a machine for assigning probabilities to token continuations. The transformer stack builds a contextual vector; this section explains how that vector becomes a normalized next-token distribution, how training improves that distribution, and how decoding turns it back into text.
Overview
Language-model probability is the center of the LLM learning loop. During training, the model receives a true prefix and is penalized when the observed next token has low probability. During evaluation, the same probabilities produce negative log-likelihood, cross-entropy, perplexity, calibration curves, and answer scores. During generation, the probabilities are transformed by decoding rules such as temperature, top-k, top-p, beam search, or constraints.
The most important idea is simple but deep:
This chain-rule factorization is exact. Autoregressive language modeling becomes a practical learning problem because a neural network can estimate each conditional distribution.
Prerequisites
- Conditional probability and the product rule
- Logarithms, exponentials, and gradients
- Entropy, KL divergence, and cross-entropy
- Vector and matrix shapes from embeddings and attention
- The previous LLM sections on tokenization, embeddings, attention, and positions
Companion Notebooks
| Notebook | Purpose |
|---|---|
| theory.ipynb | Builds the probability machinery with executable toy LMs, stable softmax, cross-entropy gradients, perplexity, calibration, and decoding demos. |
| exercises.ipynb | Gives ten practice problems with scaffolds and complete solutions for sequence probability, loss, decoding, calibration, and conditional scoring. |
Learning Objectives
After this section, you should be able to:
- Define a language model as a distribution over token sequences with an EOS convention.
- Derive the autoregressive factorization from the probability chain rule.
- Convert hidden states into logits, logits into probabilities, and probabilities into log-likelihoods.
- Implement numerically stable softmax and log-softmax.
- Derive the cross-entropy gradient with respect to logits.
- Compute entropy, KL divergence, cross-entropy, perplexity, bits per token, and length-normalized scores.
- Explain why decoding changes the sampling distribution without changing the trained model.
- Diagnose common probability bugs in LLM training and evaluation code.
Table of Contents
- Language Models as Distributions
- Autoregressive Factorization
- 2.1 Chain rule derivation
- 2.2 Teacher forcing
- 2.3 Causal masking
- 2.4 Sequence log probability
- 2.5 Prefix scoring
- Logits, Softmax, and Log-Sum-Exp
- 3.1 LM head
- 3.2 Softmax normalization
- 3.3 Shift invariance
- 3.4 Stable log-softmax
- 3.5 Temperature
- Maximum Likelihood and Cross-Entropy
- Entropy, KL, Perplexity, and Bits
- 5.1 Entropy
- 5.2 Cross-entropy decomposition
- 5.3 Perplexity
- 5.4 Bits per token
- 5.5 Tokenizer caveat
- Sequence Scoring and Calibration
- 6.1 Length bias
- 6.2 Average log probability
- 6.3 Calibration
- 6.4 Expected calibration error
- 6.5 Temperature scaling
- Decoding as Distribution Transformation
- 7.1 Greedy decoding
- 7.2 Sampling
- 7.3 Top-k filtering
- 7.4 Nucleus top-p filtering
- 7.5 Beam search and length penalty
- From Count Models to Neural LMs
- 8.1 N-gram models
- 8.2 Smoothing
- 8.3 Neural probabilistic LM
- 8.4 Recurrent LM
- 8.5 Transformer LM
- Conditional Generation and Constraints
- Diagnostics and Learning Practice
- 10.1 Normalization checks
- 10.2 Likelihood sanity checks
- 10.3 Distribution shift
- 10.4 Generation versus evaluation
- 10.5 Bridge to training at scale
Notation
| Symbol | Meaning |
|---|---|
| Vocabulary of ordinary tokens | |
| V_\\mathrm{ext} | Vocabulary plus special tokens such as EOS |
| Token at position | |
| Prefix before position | |
| Transformer hidden state at position | |
| Logit vector for the next-token distribution | |
| p_\\theta(\\cdot \\mid t_{<i}) | Model distribution over the next token |
| One-hot target vector | |
| Mask indicating whether a token contributes to loss |
The standard decoder-only training path is:
tokens -> embeddings + positions -> causal transformer -> hidden state -> LM head -> logits -> log-softmax -> NLL
1. Language Models as Distributions
This part studies language models as distributions as an operational object: something an LLM computes for every prefix during training, scoring, and generation. The formulas below are small enough to derive by hand, but they are also exactly the formulas used inside large decoder-only models.
| Subtopic | Core question | Formula |
|---|---|---|
| Finite vocabulary and EOS | make sequence probability normalize by ending strings explicitly | |
| Probability over finite strings | a language model assigns mass to every complete token string | |
| Conditional next-token distributions | generation uses one normalized categorical distribution at each prefix | |
| Support and impossible events | zero probability is a dangerous modeling choice | after softmax |
| Why next-token prediction is enough | the chain rule converts local predictions into a full joint model |
1.1 Finite vocabulary and EOS
Main idea. Make sequence probability normalize by ending strings explicitly.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
1.2 Probability over finite strings
Main idea. A language model assigns mass to every complete token string.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
1.3 Conditional next-token distributions
Main idea. Generation uses one normalized categorical distribution at each prefix.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
1.4 Support and impossible events
Main idea. Zero probability is a dangerous modeling choice.
The useful formula is:
p_\theta(v\mid h)>0$ after softmaxThe probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
1.5 Why next-token prediction is enough
Main idea. The chain rule converts local predictions into a full joint model.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
2. Autoregressive Factorization
This part studies autoregressive factorization as an operational object: something an LLM computes for every prefix during training, scoring, and generation. The formulas below are small enough to derive by hand, but they are also exactly the formulas used inside large decoder-only models.
| Subtopic | Core question | Formula |
|---|---|---|
| Chain rule derivation | no independence assumption is needed | |
| Teacher forcing | training conditions on the true prefix, not sampled model history | |
| Causal masking | attention must not leak future targets into the conditional | for and otherwise |
| Sequence log probability | products become sums for stable scoring | |
| Prefix scoring | prompt likelihood and answer likelihood are different objects |
2.1 Chain rule derivation
Main idea. No independence assumption is needed.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
2.2 Teacher forcing
Main idea. Training conditions on the true prefix, not sampled model history.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. This is why training can be massively parallel while generation is sequential.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
2.3 Causal masking
Main idea. Attention must not leak future targets into the conditional.
The useful formula is:
M_{ij}=0$ for $j\le i$ and $-\infty$ otherwiseThe probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
2.4 Sequence log probability
Main idea. Products become sums for stable scoring.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
2.5 Prefix scoring
Main idea. Prompt likelihood and answer likelihood are different objects.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
3. Logits, Softmax, and Log-Sum-Exp
This part studies logits, softmax, and log-sum-exp as an operational object: something an LLM computes for every prefix during training, scoring, and generation. The formulas below are small enough to derive by hand, but they are also exactly the formulas used inside large decoder-only models.
| Subtopic | Core question | Formula |
|---|---|---|
| LM head | the hidden state becomes one score per vocabulary token | |
| Softmax normalization | logits are unnormalized log probabilities | |
| Shift invariance | adding a constant to every logit changes no probability | |
| Stable log-softmax | subtract the maximum before exponentiating | |
| Temperature | divide logits before softmax to control entropy |
3.1 LM head
Main idea. The hidden state becomes one score per vocabulary token.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
3.2 Softmax normalization
Main idea. Logits are unnormalized log probabilities.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. This is the final probability gate between hidden-state geometry and token choice.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
3.3 Shift invariance
Main idea. Adding a constant to every logit changes no probability.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
3.4 Stable log-softmax
Main idea. Subtract the maximum before exponentiating.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
3.5 Temperature
Main idea. Divide logits before softmax to control entropy.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
4. Maximum Likelihood and Cross-Entropy
This part studies maximum likelihood and cross-entropy as an operational object: something an LLM computes for every prefix during training, scoring, and generation. The formulas below are small enough to derive by hand, but they are also exactly the formulas used inside large decoder-only models.
| Subtopic | Core question | Formula |
|---|---|---|
| Dataset likelihood | training maximizes probability assigned to observed tokens | |
| Negative log-likelihood | loss is surprise under the model | |
| Cross-entropy | one-hot labels reduce cross-entropy to NLL | |
| Gradient with respect to logits | the key derivative is predicted minus target | |
| Padding masks | only real target tokens should contribute to the average loss |
4.1 Dataset likelihood
Main idea. Training maximizes probability assigned to observed tokens.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
4.2 Negative log-likelihood
Main idea. Loss is surprise under the model.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
4.3 Cross-entropy
Main idea. One-hot labels reduce cross-entropy to nll.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
4.4 Gradient with respect to logits
Main idea. The key derivative is predicted minus target.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. This derivative is the reason cross-entropy is so convenient for transformer training.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
4.5 Padding masks
Main idea. Only real target tokens should contribute to the average loss.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
5. Entropy, KL, Perplexity, and Bits
This part studies entropy, kl, perplexity, and bits as an operational object: something an LLM computes for every prefix during training, scoring, and generation. The formulas below are small enough to derive by hand, but they are also exactly the formulas used inside large decoder-only models.
| Subtopic | Core question | Formula |
|---|---|---|
| Entropy | intrinsic uncertainty of a distribution | |
| Cross-entropy decomposition | model loss equals data entropy plus mismatch | |
| Perplexity | exponentiated average NLL | |
| Bits per token | use base-2 logs for compression interpretation | |
| Tokenizer caveat | perplexity depends on the tokenization unit | and are more comparable |
5.1 Entropy
Main idea. Intrinsic uncertainty of a distribution.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
5.2 Cross-entropy decomposition
Main idea. Model loss equals data entropy plus mismatch.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
5.3 Perplexity
Main idea. Exponentiated average nll.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. This is the most common intrinsic language-model score, but it must be interpreted with tokenizer and dataset context.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
5.4 Bits per token
Main idea. Use base-2 logs for compression interpretation.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
5.5 Tokenizer caveat
Main idea. Perplexity depends on the tokenization unit.
The useful formula is:
\mathrm{BPB}$ and $\mathrm{BPC}$ are more comparableThe probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
6. Sequence Scoring and Calibration
This part studies sequence scoring and calibration as an operational object: something an LLM computes for every prefix during training, scoring, and generation. The formulas below are small enough to derive by hand, but they are also exactly the formulas used inside large decoder-only models.
| Subtopic | Core question | Formula |
|---|---|---|
| Length bias | raw log probability favors shorter strings | decreases with length |
| Average log probability | length-normalized scores compare answers of different lengths | $\frac{1}{ |
| Calibration | confidence should match empirical correctness | |
| Expected calibration error | bin confidence gaps to summarize miscalibration | $\mathrm{ECE}=\sum_b\frac{ |
| Temperature scaling | post-hoc calibration rescales logits without changing class order |
6.1 Length bias
Main idea. Raw log probability favors shorter strings.
The useful formula is:
\log p(y\mid x)$ decreases with lengthThe probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
6.2 Average log probability
Main idea. Length-normalized scores compare answers of different lengths.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
6.3 Calibration
Main idea. Confidence should match empirical correctness.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
6.4 Expected calibration error
Main idea. Bin confidence gaps to summarize miscalibration.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
6.5 Temperature scaling
Main idea. Post-hoc calibration rescales logits without changing class order.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
7. Decoding as Distribution Transformation
This part studies decoding as distribution transformation as an operational object: something an LLM computes for every prefix during training, scoring, and generation. The formulas below are small enough to derive by hand, but they are also exactly the formulas used inside large decoder-only models.
| Subtopic | Core question | Formula |
|---|---|---|
| Greedy decoding | choose the highest-probability next token | |
| Sampling | draw from the categorical distribution | |
| Top-k filtering | keep only the k highest probability tokens | |
| Nucleus top-p filtering | keep the smallest set whose mass exceeds p | |
| Beam search and length penalty | approximate sequence MAP with multiple partial hypotheses | $s(y)=\log p(y\mid x)/ |
7.1 Greedy decoding
Main idea. Choose the highest-probability next token.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
7.2 Sampling
Main idea. Draw from the categorical distribution.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
7.3 Top-k filtering
Main idea. Keep only the k highest probability tokens.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
7.4 Nucleus top-p filtering
Main idea. Keep the smallest set whose mass exceeds p.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. This is the default practical compromise between deterministic decoding and unrestricted sampling.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
7.5 Beam search and length penalty
Main idea. Approximate sequence map with multiple partial hypotheses.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
8. From Count Models to Neural LMs
This part studies from count models to neural lms as an operational object: something an LLM computes for every prefix during training, scoring, and generation. The formulas below are small enough to derive by hand, but they are also exactly the formulas used inside large decoder-only models.
| Subtopic | Core question | Formula |
|---|---|---|
| N-gram models | approximate history by the last n minus one tokens | |
| Smoothing | reserve probability for unseen events | $P_\alpha(w\mid h)=\frac{c(h,w)+\alpha}{c(h)+\alpha |
| Neural probabilistic LM | learn distributed word representations and probability together | |
| Recurrent LM | compress history into a state | |
| Transformer LM | causal self-attention computes context-dependent hidden states in parallel |
8.1 N-gram models
Main idea. Approximate history by the last n minus one tokens.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
8.2 Smoothing
Main idea. Reserve probability for unseen events.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
8.3 Neural probabilistic LM
Main idea. Learn distributed word representations and probability together.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
8.4 Recurrent LM
Main idea. Compress history into a state.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
8.5 Transformer LM
Main idea. Causal self-attention computes context-dependent hidden states in parallel.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
9. Conditional Generation and Constraints
This part studies conditional generation and constraints as an operational object: something an LLM computes for every prefix during training, scoring, and generation. The formulas below are small enough to derive by hand, but they are also exactly the formulas used inside large decoder-only models.
| Subtopic | Core question | Formula |
|---|---|---|
| Prompt conditioning | a prompt is the observed prefix in the conditional | |
| RAG conditioning | retrieved text changes the conditioning information | |
| Logit bias and constraints | decoding can restrict or reweight the support | |
| Guidance by logit mixing | combine distributions in logit space when a control model exists | |
| Safety filters | post-processing policies are not the same as model probability | may differ from |
9.1 Prompt conditioning
Main idea. A prompt is the observed prefix in the conditional.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
9.2 RAG conditioning
Main idea. Retrieved text changes the conditioning information.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. Retrieval does not change the probability rules; it changes what information the conditional can see.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
9.3 Logit bias and constraints
Main idea. Decoding can restrict or reweight the support.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
9.4 Guidance by logit mixing
Main idea. Combine distributions in logit space when a control model exists.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
9.5 Safety filters
Main idea. Post-processing policies are not the same as model probability.
The useful formula is:
\Pr(\mathrm{emit})$ may differ from $p_\thetaThe probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
10. Diagnostics and Learning Practice
This part studies diagnostics and learning practice as an operational object: something an LLM computes for every prefix during training, scoring, and generation. The formulas below are small enough to derive by hand, but they are also exactly the formulas used inside large decoder-only models.
| Subtopic | Core question | Formula |
|---|---|---|
| Normalization checks | probabilities must sum to one at each prefix | |
| Likelihood sanity checks | known continuations should score above random continuations | |
| Distribution shift | low likelihood can indicate unfamiliar domain or bad tokenization | |
| Generation versus evaluation | sampling quality and held-out NLL measure different properties | is not always best user experience |
| Bridge to training at scale | the same loss drives distributed training and scaling laws |
10.1 Normalization checks
Main idea. Probabilities must sum to one at each prefix.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
10.2 Likelihood sanity checks
Main idea. Known continuations should score above random continuations.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
10.3 Distribution shift
Main idea. Low likelihood can indicate unfamiliar domain or bad tokenization.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
10.4 Generation versus evaluation
Main idea. Sampling quality and held-out nll measure different properties.
The useful formula is:
\arg\min \mathrm{NLL}$ is not always best user experienceThe probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
10.5 Bridge to training at scale
Main idea. The same loss drives distributed training and scaling laws.
The useful formula is:
The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.
Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are , the probability of the first token is large because exponentiation magnifies logit differences:
The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.
Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.
AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.
Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.
Practice Exercises
- Factorize a three-token sentence probability using the chain rule.
- Compute stable softmax probabilities from a logit vector by subtracting the maximum.
- Derive the gradient of one-hot cross-entropy with respect to logits.
- Compute a masked average loss for a padded batch.
- Convert average NLL from nats to perplexity and bits per token.
- Compare raw and length-normalized scores for two candidate answers.
- Apply temperature, top-k, and top-p filtering to a toy distribution.
- Compute expected calibration error for binned confidence values.
- Score a conditional answer without including the prompt tokens in the average answer loss.
- Write a short checklist for debugging an LM probability implementation.
Why This Matters for AI
The probability layer is the contract between representation learning and text behavior. If logits are unstable, loss is masked incorrectly, probabilities are compared across incompatible tokenizations, or decoding filters are misunderstood, the model may appear to improve while the measured probability object is wrong. Strong LLM work requires comfort with this layer because it appears in pretraining, supervised fine-tuning, preference optimization, retrieval evaluation, calibration, long-context tests, and deployment decoding.
Bridge to Training at Scale
The next section studies what happens when the same cross-entropy objective is optimized with billions of tokens, large batches, distributed hardware, mixed precision, gradient accumulation, learning-rate schedules, and checkpointing. Nothing in the training-at-scale story changes the probability target. Scale changes how expensive and fragile it is to estimate the same conditional distributions.
References
- Claude Shannon, "A Mathematical Theory of Communication", 1948.
- Claude Shannon, "Prediction and Entropy of Printed English", 1951.
- Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin, "A Neural Probabilistic Language Model", JMLR, 2003: https://jmlr.csail.mit.edu/papers/v3/bengio03a.html
- Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur, "Recurrent neural network based language model", Interspeech, 2010: https://www.isca-archive.org/interspeech_2010/mikolov10_interspeech.html
- Dan Jurafsky and James H. Martin, "Speech and Language Processing", Chapter 3 on n-gram language models: https://web.stanford.edu/~jurafsky/slp3/
- Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu, "Exploring the Limits of Language Modeling", 2016: https://arxiv.org/abs/1602.02410
- Ashish Vaswani et al., "Attention Is All You Need", 2017: https://arxiv.org/abs/1706.03762
- Ari Holtzman et al., "The Curious Case of Neural Text Degeneration", 2019: https://arxiv.org/abs/1904.09751
- Ofir Press et al., "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation", 2021: https://arxiv.org/abs/2108.12409