Notes

Language Model Probability Math

An LLM is a machine for assigning probabilities to token continuations. The transformer stack builds a contextual vector; this section explains how that vector becomes a normalized next-token distribution, how training improves that distribution, and how decoding turns it back into text.

Overview

Language-model probability is the center of the LLM learning loop. During training, the model receives a true prefix and is penalized when the observed next token has low probability. During evaluation, the same probabilities produce negative log-likelihood, cross-entropy, perplexity, calibration curves, and answer scores. During generation, the probabilities are transformed by decoding rules such as temperature, top-k, top-p, beam search, or constraints.

The most important idea is simple but deep:

P(t_{1:n}) = \\prod_{i=1}^{n} P(t_i \\mid t_{<i}).

This chain-rule factorization is exact. Autoregressive language modeling becomes a practical learning problem because a neural network can estimate each conditional distribution.

Prerequisites

Conditional probability and the product rule
Logarithms, exponentials, and gradients
Entropy, KL divergence, and cross-entropy
Vector and matrix shapes from embeddings and attention
The previous LLM sections on tokenization, embeddings, attention, and positions

Companion Notebooks

Notebook	Purpose
theory.ipynb	Builds the probability machinery with executable toy LMs, stable softmax, cross-entropy gradients, perplexity, calibration, and decoding demos.
exercises.ipynb	Gives ten practice problems with scaffolds and complete solutions for sequence probability, loss, decoding, calibration, and conditional scoring.

Learning Objectives

After this section, you should be able to:

Define a language model as a distribution over token sequences with an EOS convention.
Derive the autoregressive factorization from the probability chain rule.
Convert hidden states into logits, logits into probabilities, and probabilities into log-likelihoods.
Implement numerically stable softmax and log-softmax.
Derive the cross-entropy gradient $p-y$ with respect to logits.
Compute entropy, KL divergence, cross-entropy, perplexity, bits per token, and length-normalized scores.
Explain why decoding changes the sampling distribution without changing the trained model.
Diagnose common probability bugs in LLM training and evaluation code.

Language Models as Distributions
- 1.1 Finite vocabulary and EOS
- 1.2 Probability over finite strings
- 1.3 Conditional next-token distributions
- 1.4 Support and impossible events
- 1.5 Why next-token prediction is enough
Autoregressive Factorization
- 2.1 Chain rule derivation
- 2.2 Teacher forcing
- 2.3 Causal masking
- 2.4 Sequence log probability
- 2.5 Prefix scoring
Logits, Softmax, and Log-Sum-Exp
- 3.1 LM head
- 3.2 Softmax normalization
- 3.3 Shift invariance
- 3.4 Stable log-softmax
- 3.5 Temperature
Maximum Likelihood and Cross-Entropy
- 4.1 Dataset likelihood
- 4.2 Negative log-likelihood
- 4.3 Cross-entropy
- 4.4 Gradient with respect to logits
- 4.5 Padding masks
Entropy, KL, Perplexity, and Bits
- 5.1 Entropy
- 5.2 Cross-entropy decomposition
- 5.3 Perplexity
- 5.4 Bits per token
- 5.5 Tokenizer caveat
Sequence Scoring and Calibration
- 6.1 Length bias
- 6.2 Average log probability
- 6.3 Calibration
- 6.4 Expected calibration error
- 6.5 Temperature scaling
Decoding as Distribution Transformation
- 7.1 Greedy decoding
- 7.2 Sampling
- 7.3 Top-k filtering
- 7.4 Nucleus top-p filtering
- 7.5 Beam search and length penalty
From Count Models to Neural LMs
- 8.1 N-gram models
- 8.2 Smoothing
- 8.3 Neural probabilistic LM
- 8.4 Recurrent LM
- 8.5 Transformer LM
Conditional Generation and Constraints
- 9.1 Prompt conditioning
- 9.2 RAG conditioning
- 9.3 Logit bias and constraints
- 9.4 Guidance by logit mixing
- 9.5 Safety filters
Diagnostics and Learning Practice

10.1 Normalization checks
10.2 Likelihood sanity checks
10.3 Distribution shift
10.4 Generation versus evaluation
10.5 Bridge to training at scale

Notation

Symbol	Meaning
$V$	Vocabulary of ordinary tokens
$V_\\mathrm{ext}$	Vocabulary plus special tokens such as EOS
$t_i$	Token at position $i$
$t_{<i}$	Prefix before position $i$
$h_i$	Transformer hidden state at position $i$
$z_i$	Logit vector for the next-token distribution
$p_\\theta(\\cdot \\mid t_{<i})$	Model distribution over the next token
$y_i$	One-hot target vector
$m_i$	Mask indicating whether a token contributes to loss

The standard decoder-only training path is:

tokens -> embeddings + positions -> causal transformer -> hidden state -> LM head -> logits -> log-softmax -> NLL

1. Language Models as Distributions

This part studies language models as distributions as an operational object: something an LLM computes for every prefix during training, scoring, and generation. The formulas below are small enough to derive by hand, but they are also exactly the formulas used inside large decoder-only models.

Subtopic	Core question	Formula
Finite vocabulary and EOS	make sequence probability normalize by ending strings explicitly	$V_\mathrm{ext}=V\cup\{\mathrm{EOS}\}$
Probability over finite strings	a language model assigns mass to every complete token string	$P(t_{1:n},\mathrm{EOS})$
Conditional next-token distributions	generation uses one normalized categorical distribution at each prefix	$p_\theta(t_{i}\mid t_{<i})$
Support and impossible events	zero probability is a dangerous modeling choice	$p_\theta(v\mid h)>0$ after softmax
Why next-token prediction is enough	the chain rule converts local predictions into a full joint model	$P(t_{1:n})=\prod_i P(t_i\mid t_{<i})$

1.1 Finite vocabulary and EOS

Main idea. Make sequence probability normalize by ending strings explicitly.

The useful formula is:

V_\mathrm{ext}=V\cup\{\mathrm{EOS}\}

The probability object should always be read with its conditioning context. A token is not likely or unlikely in isolation; it is likely or unlikely after a prefix, under a vocabulary, with a particular tokenizer and model state. This is why the same word can be high probability in one prompt and nearly impossible in another.

Worked micro-example. Suppose the current prefix is The capital of France is. A reasonable model should place more mass on Paris than on banana, but it should still maintain a normalized distribution over every token in the vocabulary. If the logits for three candidate tokens are $[5.0, 1.0, -2.0]$ , the probability of the first token is large because exponentiation magnifies logit differences:

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

Implementation check. In code, check shapes and normalization. If logits have shape (batch, time, vocab), then softmax(logits, axis=-1) should sum to one along the vocabulary axis for every batch and time position. When scoring labels, gather only the probability of the observed next token and mask padding positions before averaging.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

Common mistake. Do not compare raw sequence probabilities across different output lengths without a length convention. Products of probabilities shrink as strings get longer, so a two-token answer often has a higher raw probability than a better twenty-token answer.

1.2 Probability over finite strings

Main idea. A language model assigns mass to every complete token string.

The useful formula is:

P(t_{1:n},\mathrm{EOS})

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

1.3 Conditional next-token distributions

Main idea. Generation uses one normalized categorical distribution at each prefix.

The useful formula is:

p_\theta(t_{i}\mid t_{<i})

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

1.4 Support and impossible events

Main idea. Zero probability is a dangerous modeling choice.

The useful formula is:

p_\theta(v\mid h)>0$ after softmax

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

1.5 Why next-token prediction is enough

Main idea. The chain rule converts local predictions into a full joint model.

The useful formula is:

P(t_{1:n})=\prod_i P(t_i\mid t_{<i})

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

2. Autoregressive Factorization

This part studies autoregressive factorization as an operational object: something an LLM computes for every prefix during training, scoring, and generation. The formulas below are small enough to derive by hand, but they are also exactly the formulas used inside large decoder-only models.

Subtopic	Core question	Formula
Chain rule derivation	no independence assumption is needed	$P(t_{1:n})=P(t_1)\prod_{i=2}^n P(t_i\mid t_{1:i-1})$
Teacher forcing	training conditions on the true prefix, not sampled model history	$-\log p_\theta(t_i^\star\mid t_{<i}^\star)$
Causal masking	attention must not leak future targets into the conditional	$M_{ij}=0$ for $j\le i$ and $-\infty$ otherwise
Sequence log probability	products become sums for stable scoring	$\log P(t_{1:n})=\sum_i \log p_\theta(t_i\mid t_{<i})$
Prefix scoring	prompt likelihood and answer likelihood are different objects	$\log p_\theta(y\mid x)=\sum_j \log p_\theta(y_j\mid x,y_{<j})$

2.1 Chain rule derivation

Main idea. No independence assumption is needed.

The useful formula is:

P(t_{1:n})=P(t_1)\prod_{i=2}^n P(t_i\mid t_{1:i-1})

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

2.2 Teacher forcing

Main idea. Training conditions on the true prefix, not sampled model history.

The useful formula is:

-\log p_\theta(t_i^\star\mid t_{<i}^\star)

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. This is why training can be massively parallel while generation is sequential.

2.3 Causal masking

Main idea. Attention must not leak future targets into the conditional.

The useful formula is:

M_{ij}=0$ for $j\le i$ and $-\infty$ otherwise

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

2.4 Sequence log probability

Main idea. Products become sums for stable scoring.

The useful formula is:

\log P(t_{1:n})=\sum_i \log p_\theta(t_i\mid t_{<i})

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

2.5 Prefix scoring

Main idea. Prompt likelihood and answer likelihood are different objects.

The useful formula is:

\log p_\theta(y\mid x)=\sum_j \log p_\theta(y_j\mid x,y_{<j})

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

3. Logits, Softmax, and Log-Sum-Exp

This part studies logits, softmax, and log-sum-exp as an operational object: something an LLM computes for every prefix during training, scoring, and generation. The formulas below are small enough to derive by hand, but they are also exactly the formulas used inside large decoder-only models.

Subtopic	Core question	Formula
LM head	the hidden state becomes one score per vocabulary token	$z=Wh+b$
Softmax normalization	logits are unnormalized log probabilities	$p_i=\exp(z_i)/\sum_j\exp(z_j)$
Shift invariance	adding a constant to every logit changes no probability	$\mathrm{softmax}(z+c\mathbf{1})=\mathrm{softmax}(z)$
Stable log-softmax	subtract the maximum before exponentiating	$\log p_i=z_i-\operatorname{LSE}(z)$
Temperature	divide logits before softmax to control entropy	$p_i(\tau)=\mathrm{softmax}(z_i/\tau)$

3.1 LM head

Main idea. The hidden state becomes one score per vocabulary token.

The useful formula is:

z=Wh+b

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

3.2 Softmax normalization

Main idea. Logits are unnormalized log probabilities.

The useful formula is:

p_i=\exp(z_i)/\sum_j\exp(z_j)

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. This is the final probability gate between hidden-state geometry and token choice.

3.3 Shift invariance

Main idea. Adding a constant to every logit changes no probability.

The useful formula is:

\mathrm{softmax}(z+c\mathbf{1})=\mathrm{softmax}(z)

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

3.4 Stable log-softmax

Main idea. Subtract the maximum before exponentiating.

The useful formula is:

\log p_i=z_i-\operatorname{LSE}(z)

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

3.5 Temperature

Main idea. Divide logits before softmax to control entropy.

The useful formula is:

p_i(\tau)=\mathrm{softmax}(z_i/\tau)

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

4. Maximum Likelihood and Cross-Entropy

This part studies maximum likelihood and cross-entropy as an operational object: something an LLM computes for every prefix during training, scoring, and generation. The formulas below are small enough to derive by hand, but they are also exactly the formulas used inside large decoder-only models.

Subtopic	Core question	Formula
Dataset likelihood	training maximizes probability assigned to observed tokens	$\max_\theta \sum_{(x,y)}\log p_\theta(y\mid x)$
Negative log-likelihood	loss is surprise under the model	$\ell=-\log p_\theta(y^\star\mid h)$
Cross-entropy	one-hot labels reduce cross-entropy to NLL	$H(q,p_\theta)=-\sum_i q_i\log p_{\theta,i}$
Gradient with respect to logits	the key derivative is predicted minus target	$\nabla_z \ell=p-y$
Padding masks	only real target tokens should contribute to the average loss	$L=\sum_i m_i\ell_i/\sum_i m_i$

4.1 Dataset likelihood

Main idea. Training maximizes probability assigned to observed tokens.

The useful formula is:

\max_\theta \sum_{(x,y)}\log p_\theta(y\mid x)

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

4.2 Negative log-likelihood

Main idea. Loss is surprise under the model.

The useful formula is:

\ell=-\log p_\theta(y^\star\mid h)

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

4.3 Cross-entropy

Main idea. One-hot labels reduce cross-entropy to nll.

The useful formula is:

H(q,p_\theta)=-\sum_i q_i\log p_{\theta,i}

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

4.4 Gradient with respect to logits

Main idea. The key derivative is predicted minus target.

The useful formula is:

\nabla_z \ell=p-y

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. This derivative is the reason cross-entropy is so convenient for transformer training.

4.5 Padding masks

Main idea. Only real target tokens should contribute to the average loss.

The useful formula is:

L=\sum_i m_i\ell_i/\sum_i m_i

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

5. Entropy, KL, Perplexity, and Bits

This part studies entropy, kl, perplexity, and bits as an operational object: something an LLM computes for every prefix during training, scoring, and generation. The formulas below are small enough to derive by hand, but they are also exactly the formulas used inside large decoder-only models.

Subtopic	Core question	Formula
Entropy	intrinsic uncertainty of a distribution	$H(p)=-\sum_x p(x)\log p(x)$
Cross-entropy decomposition	model loss equals data entropy plus mismatch	$H(q,p)=H(q)+D_\mathrm{KL}(q\Vert p)$
Perplexity	exponentiated average NLL	$\mathrm{PPL}=\exp\left(\frac{1}{N}\sum_i-\log p_i\right)$
Bits per token	use base-2 logs for compression interpretation	$\mathrm{BPT}=\mathrm{NLL}_{\mathrm{nat}}/\log 2$
Tokenizer caveat	perplexity depends on the tokenization unit	$\mathrm{BPB}$ and $\mathrm{BPC}$ are more comparable

5.1 Entropy

Main idea. Intrinsic uncertainty of a distribution.

The useful formula is:

H(p)=-\sum_x p(x)\log p(x)

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

5.2 Cross-entropy decomposition

Main idea. Model loss equals data entropy plus mismatch.

The useful formula is:

H(q,p)=H(q)+D_\mathrm{KL}(q\Vert p)

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

5.3 Perplexity

Main idea. Exponentiated average nll.

The useful formula is:

\mathrm{PPL}=\exp\left(\frac{1}{N}\sum_i-\log p_i\right)

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. This is the most common intrinsic language-model score, but it must be interpreted with tokenizer and dataset context.

5.4 Bits per token

Main idea. Use base-2 logs for compression interpretation.

The useful formula is:

\mathrm{BPT}=\mathrm{NLL}_{\mathrm{nat}}/\log 2

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

5.5 Tokenizer caveat

Main idea. Perplexity depends on the tokenization unit.

The useful formula is:

\mathrm{BPB}$ and $\mathrm{BPC}$ are more comparable

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

6. Sequence Scoring and Calibration

This part studies sequence scoring and calibration as an operational object: something an LLM computes for every prefix during training, scoring, and generation. The formulas below are small enough to derive by hand, but they are also exactly the formulas used inside large decoder-only models.

Subtopic	Core question	Formula
Length bias	raw log probability favors shorter strings	$\log p(y\mid x)$ decreases with length
Average log probability	length-normalized scores compare answers of different lengths	$\frac{1}{
Calibration	confidence should match empirical correctness	$P(\mathrm{correct}\mid \hat p=c)\approx c$
Expected calibration error	bin confidence gaps to summarize miscalibration	$\mathrm{ECE}=\sum_b\frac{
Temperature scaling	post-hoc calibration rescales logits without changing class order	$z'=z/\tau$

6.1 Length bias

Main idea. Raw log probability favors shorter strings.

The useful formula is:

\log p(y\mid x)$ decreases with length

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

6.2 Average log probability

Main idea. Length-normalized scores compare answers of different lengths.

The useful formula is:

\frac{1}{|y|}\sum_j \log p(y_j\mid x,y_{<j})

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

6.3 Calibration

Main idea. Confidence should match empirical correctness.

The useful formula is:

P(\mathrm{correct}\mid \hat p=c)\approx c

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

6.4 Expected calibration error

Main idea. Bin confidence gaps to summarize miscalibration.

The useful formula is:

\mathrm{ECE}=\sum_b\frac{|B_b|}{n}|\mathrm{acc}(B_b)-\mathrm{conf}(B_b)|

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

6.5 Temperature scaling

Main idea. Post-hoc calibration rescales logits without changing class order.

The useful formula is:

z'=z/\tau

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

7. Decoding as Distribution Transformation

This part studies decoding as distribution transformation as an operational object: something an LLM computes for every prefix during training, scoring, and generation. The formulas below are small enough to derive by hand, but they are also exactly the formulas used inside large decoder-only models.

Subtopic	Core question	Formula
Greedy decoding	choose the highest-probability next token	$t_i=\arg\max_v p(v\mid t_{<i})$
Sampling	draw from the categorical distribution	$t_i\sim p_\theta(\cdot\mid t_{<i})$
Top-k filtering	keep only the k highest probability tokens	$S_k=\operatorname{TopK}(p,k)$
Nucleus top-p filtering	keep the smallest set whose mass exceeds p	$\sum_{v\in S}p(v)\ge p_\mathrm{nuc}$
Beam search and length penalty	approximate sequence MAP with multiple partial hypotheses	$s(y)=\log p(y\mid x)/

7.1 Greedy decoding

Main idea. Choose the highest-probability next token.

The useful formula is:

t_i=\arg\max_v p(v\mid t_{<i})

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

7.2 Sampling

Main idea. Draw from the categorical distribution.

The useful formula is:

t_i\sim p_\theta(\cdot\mid t_{<i})

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

7.3 Top-k filtering

Main idea. Keep only the k highest probability tokens.

The useful formula is:

S_k=\operatorname{TopK}(p,k)

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

7.4 Nucleus top-p filtering

Main idea. Keep the smallest set whose mass exceeds p.

The useful formula is:

\sum_{v\in S}p(v)\ge p_\mathrm{nuc}

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. This is the default practical compromise between deterministic decoding and unrestricted sampling.

7.5 Beam search and length penalty

Main idea. Approximate sequence map with multiple partial hypotheses.

The useful formula is:

s(y)=\log p(y\mid x)/|y|^\alpha

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

8. From Count Models to Neural LMs

This part studies from count models to neural lms as an operational object: something an LLM computes for every prefix during training, scoring, and generation. The formulas below are small enough to derive by hand, but they are also exactly the formulas used inside large decoder-only models.

Subtopic	Core question	Formula
N-gram models	approximate history by the last n minus one tokens	$P(t_i\mid t_{<i})\approx P(t_i\mid t_{i-n+1:i-1})$
Smoothing	reserve probability for unseen events	$P_\alpha(w\mid h)=\frac{c(h,w)+\alpha}{c(h)+\alpha
Neural probabilistic LM	learn distributed word representations and probability together	$e_w\in\mathbb{R}^d$
Recurrent LM	compress history into a state	$h_i=f(h_{i-1},x_i)$
Transformer LM	causal self-attention computes context-dependent hidden states in parallel	$p_\theta(t_i\mid t_{<i})=\mathrm{softmax}(Wh_i+b)$

8.1 N-gram models

Main idea. Approximate history by the last n minus one tokens.

The useful formula is:

P(t_i\mid t_{<i})\approx P(t_i\mid t_{i-n+1:i-1})

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

8.2 Smoothing

Main idea. Reserve probability for unseen events.

The useful formula is:

P_\alpha(w\mid h)=\frac{c(h,w)+\alpha}{c(h)+\alpha |V|}

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

8.3 Neural probabilistic LM

Main idea. Learn distributed word representations and probability together.

The useful formula is:

e_w\in\mathbb{R}^d

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

8.4 Recurrent LM

Main idea. Compress history into a state.

The useful formula is:

h_i=f(h_{i-1},x_i)

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

8.5 Transformer LM

Main idea. Causal self-attention computes context-dependent hidden states in parallel.

The useful formula is:

p_\theta(t_i\mid t_{<i})=\mathrm{softmax}(Wh_i+b)

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

9. Conditional Generation and Constraints

This part studies conditional generation and constraints as an operational object: something an LLM computes for every prefix during training, scoring, and generation. The formulas below are small enough to derive by hand, but they are also exactly the formulas used inside large decoder-only models.

Subtopic	Core question	Formula
Prompt conditioning	a prompt is the observed prefix in the conditional	$p_\theta(y\mid x)$
RAG conditioning	retrieved text changes the conditioning information	$p_\theta(y\mid x,r)$
Logit bias and constraints	decoding can restrict or reweight the support	$z'_v=z_v+\beta_v$
Guidance by logit mixing	combine distributions in logit space when a control model exists	$z_\mathrm{guided}=z_\mathrm{base}+\lambda(z_\mathrm{cond}-z_\mathrm{base})$
Safety filters	post-processing policies are not the same as model probability	$\Pr(\mathrm{emit})$ may differ from $p_\theta$

9.1 Prompt conditioning

Main idea. A prompt is the observed prefix in the conditional.

The useful formula is:

p_\theta(y\mid x)

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

9.2 RAG conditioning

Main idea. Retrieved text changes the conditioning information.

The useful formula is:

p_\theta(y\mid x,r)

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. Retrieval does not change the probability rules; it changes what information the conditional can see.

9.3 Logit bias and constraints

Main idea. Decoding can restrict or reweight the support.

The useful formula is:

z'_v=z_v+\beta_v

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

9.4 Guidance by logit mixing

Main idea. Combine distributions in logit space when a control model exists.

The useful formula is:

z_\mathrm{guided}=z_\mathrm{base}+\lambda(z_\mathrm{cond}-z_\mathrm{base})

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

9.5 Safety filters

Main idea. Post-processing policies are not the same as model probability.

The useful formula is:

\Pr(\mathrm{emit})$ may differ from $p_\theta

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

10. Diagnostics and Learning Practice

This part studies diagnostics and learning practice as an operational object: something an LLM computes for every prefix during training, scoring, and generation. The formulas below are small enough to derive by hand, but they are also exactly the formulas used inside large decoder-only models.

Subtopic	Core question	Formula
Normalization checks	probabilities must sum to one at each prefix	$\sum_v p(v\mid h)=1$
Likelihood sanity checks	known continuations should score above random continuations	$\log p(y_\mathrm{good}\mid x)>\log p(y_\mathrm{bad}\mid x)$
Distribution shift	low likelihood can indicate unfamiliar domain or bad tokenization	$p_\mathrm{train}\ne p_\mathrm{test}$
Generation versus evaluation	sampling quality and held-out NLL measure different properties	$\arg\min \mathrm{NLL}$ is not always best user experience
Bridge to training at scale	the same loss drives distributed training and scaling laws	$L(C,N,D)$

10.1 Normalization checks

Main idea. Probabilities must sum to one at each prefix.

The useful formula is:

\sum_v p(v\mid h)=1

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

10.2 Likelihood sanity checks

Main idea. Known continuations should score above random continuations.

The useful formula is:

\log p(y_\mathrm{good}\mid x)>\log p(y_\mathrm{bad}\mid x)

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

10.3 Distribution shift

Main idea. Low likelihood can indicate unfamiliar domain or bad tokenization.

The useful formula is:

p_\mathrm{train}\ne p_\mathrm{test}

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

10.4 Generation versus evaluation

Main idea. Sampling quality and held-out nll measure different properties.

The useful formula is:

\arg\min \mathrm{NLL}$ is not always best user experience

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

10.5 Bridge to training at scale

Main idea. The same loss drives distributed training and scaling laws.

The useful formula is:

L(C,N,D)

p_1 = \frac{e^{5.0}}{e^{5.0}+e^{1.0}+e^{-2.0}}.

The model is not storing a sentence list. It is using a conditional distribution whose parameters are generated from the prefix representation.

AI connection. For LLMs, this is not decoration; it controls a concrete training, scoring, or decoding behavior.

Practice Exercises

Factorize a three-token sentence probability using the chain rule.
Compute stable softmax probabilities from a logit vector by subtracting the maximum.
Derive the gradient of one-hot cross-entropy with respect to logits.
Compute a masked average loss for a padded batch.
Convert average NLL from nats to perplexity and bits per token.
Compare raw and length-normalized scores for two candidate answers.
Apply temperature, top-k, and top-p filtering to a toy distribution.
Compute expected calibration error for binned confidence values.
Score a conditional answer $p(y\\mid x)$ without including the prompt tokens in the average answer loss.
Write a short checklist for debugging an LM probability implementation.

Why This Matters for AI

The probability layer is the contract between representation learning and text behavior. If logits are unstable, loss is masked incorrectly, probabilities are compared across incompatible tokenizations, or decoding filters are misunderstood, the model may appear to improve while the measured probability object is wrong. Strong LLM work requires comfort with this layer because it appears in pretraining, supervised fine-tuning, preference optimization, retrieval evaluation, calibration, long-context tests, and deployment decoding.

Bridge to Training at Scale

The next section studies what happens when the same cross-entropy objective is optimized with billions of tokens, large batches, distributed hardware, mixed precision, gradient accumulation, learning-rate schedules, and checkpointing. Nothing in the training-at-scale story changes the probability target. Scale changes how expensive and fragile it is to estimate the same conditional distributions.

References

Claude Shannon, "A Mathematical Theory of Communication", 1948.
Claude Shannon, "Prediction and Entropy of Printed English", 1951.
Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin, "A Neural Probabilistic Language Model", JMLR, 2003: https://jmlr.csail.mit.edu/papers/v3/bengio03a.html
Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur, "Recurrent neural network based language model", Interspeech, 2010: https://www.isca-archive.org/interspeech_2010/mikolov10_interspeech.html
Dan Jurafsky and James H. Martin, "Speech and Language Processing", Chapter 3 on n-gram language models: https://web.stanford.edu/~jurafsky/slp3/
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu, "Exploring the Limits of Language Modeling", 2016: https://arxiv.org/abs/1602.02410
Ashish Vaswani et al., "Attention Is All You Need", 2017: https://arxiv.org/abs/1706.03762
Ari Holtzman et al., "The Curious Case of Neural Text Degeneration", 2019: https://arxiv.org/abs/1904.09751
Ofir Press et al., "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation", 2021: https://arxiv.org/abs/2108.12409

Language Model Probability

Language Model Probability Math

Overview

Prerequisites

Companion Notebooks

Learning Objectives

Table of Contents

Notation

1. Language Models as Distributions

1.1 Finite vocabulary and EOS

1.2 Probability over finite strings

1.3 Conditional next-token distributions

1.4 Support and impossible events

1.5 Why next-token prediction is enough

2. Autoregressive Factorization

2.1 Chain rule derivation

2.2 Teacher forcing

2.3 Causal masking

2.4 Sequence log probability

2.5 Prefix scoring

3. Logits, Softmax, and Log-Sum-Exp

3.1 LM head

3.2 Softmax normalization

3.3 Shift invariance

3.4 Stable log-softmax

3.5 Temperature

4. Maximum Likelihood and Cross-Entropy

4.1 Dataset likelihood

4.2 Negative log-likelihood

4.3 Cross-entropy

4.4 Gradient with respect to logits

4.5 Padding masks

5. Entropy, KL, Perplexity, and Bits

5.1 Entropy

5.2 Cross-entropy decomposition

5.3 Perplexity

5.4 Bits per token

5.5 Tokenizer caveat

6. Sequence Scoring and Calibration

6.1 Length bias

6.2 Average log probability

6.3 Calibration

6.4 Expected calibration error

6.5 Temperature scaling

7. Decoding as Distribution Transformation

7.1 Greedy decoding

7.2 Sampling

7.3 Top-k filtering

7.4 Nucleus top-p filtering

7.5 Beam search and length penalty

8. From Count Models to Neural LMs

8.1 N-gram models

8.2 Smoothing

8.3 Neural probabilistic LM

8.4 Recurrent LM

8.5 Transformer LM

9. Conditional Generation and Constraints

9.1 Prompt conditioning

9.2 RAG conditioning

9.3 Logit bias and constraints

9.4 Guidance by logit mixing

9.5 Safety filters

10. Diagnostics and Learning Practice

10.1 Normalization checks

10.2 Likelihood sanity checks

10.3 Distribution shift

10.4 Generation versus evaluation

10.5 Bridge to training at scale

Practice Exercises

Why This Matters for AI

Bridge to Training at Scale

References