"Information is the resolution of uncertainty." - Claude Shannon
Overview
Entropy is the central quantity of information theory - a precise, mathematical measure of uncertainty, surprise, and information content. Introduced by Claude Shannon in his landmark 1948 paper A Mathematical Theory of Communication, entropy transformed our understanding of communication, computation, and inference. Where nineteenth-century thermodynamics defined entropy as a measure of physical disorder, Shannon showed that the same mathematical structure governs the information carried by any probabilistic source, from telegraph signals to protein sequences to transformer language models.
For machine learning practitioners, entropy is ubiquitous. The cross-entropy loss that trains every classifier is entropy in disguise. The perplexity that evaluates every language model is an entropy estimate. The information gain that splits every decision tree is an entropy difference. The soft policy entropy bonus in soft actor-critic prevents premature convergence. Entropy appears not as an abstract concept but as a concrete, computable quantity that governs the fundamental limits of compression, inference, and learning.
This section builds entropy from first principles: self-information (Section 2), the Shannon entropy formula and its standard properties (Section 3), joint and conditional entropy with the chain rule (Section 4), the maximum entropy principle (Section 5), source coding and data compression (Section 6), entropy rates of stochastic processes including language models (Section 7), differential entropy for continuous variables (Section 8), and Renyi and Tsallis generalizations (Section 9). Every concept is developed mathematically and connected to its role in modern AI systems.
Scope note. This section is the canonical home for entropy. The closely related quantities - KL divergence (Section 02), mutual information (Section 03), cross-entropy loss (Section 04), and Fisher information (Section 05) - each have their own dedicated sections in this chapter. Where they appear here, they appear as brief previews with forward references.
Prerequisites
- Probability distributions - PMFs, PDFs, expectations - Chapter 6: Probability Theory
- Logarithm rules - , - Chapter 1
- Expected values - - Chapter 6 Section 02
- Jensen's inequality - for convex : - Chapter 8 Section 01
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Interactive derivations: entropy computations, MaxEnt proofs, entropy rate of Markov chains, perplexity, differential entropy |
| exercises.ipynb | 10 graded exercises from binary entropy to MaxEnt RL entropy bonus |
Learning Objectives
After completing this section, you will:
- Define self-information and explain why the logarithm is the unique choice satisfying the additivity axiom
- Compute Shannon entropy for discrete distributions and verify the bounds
- Prove that entropy is concave in the probability vector and derive the maximum-entropy distribution for given constraints
- Compute joint entropy , conditional entropy , and verify the chain rule
- Apply the maximum entropy principle to derive the Gaussian, uniform, and exponential distributions from moment constraints
- State Shannon's source coding theorem and construct a Huffman code for a given source
- Compute the entropy rate of a Markov chain and connect it to language model perplexity
- Define differential entropy , compute it for Gaussian and uniform distributions, and identify its key differences from discrete entropy
- Parametrize the Renyi entropy family and identify the limiting cases (Hartley, Shannon, min-entropy)
- Explain how entropy appears in decision trees, MaxEnt RL (SAC), confidence calibration, and LLM perplexity evaluation
Table of Contents
- 1. Intuition
- 2. Self-Information and Shannon Entropy
- 3. Properties of Shannon Entropy
- 4. Joint Entropy, Conditional Entropy, and the Chain Rule
- 5. Maximum Entropy Principle
- 6. Source Coding and Data Compression
- 7. Entropy Rate of Stochastic Processes
- 8. Differential Entropy
- 9. Renyi Entropy and Generalizations
- 10. Applications in Machine Learning
- 11. Common Mistakes
- 12. Exercises
- 13. Why This Matters for AI (2026 Perspective)
- 14. Conceptual Bridge
1. Intuition
1.1 What Is Entropy?
Imagine you are handed a message before reading it. How surprised will you be? If the message is from a source that always sends the same symbol - say, a machine that outputs A with probability 1 - you learn nothing; no surprise is possible. If the source sends A or B with equal probability , you learn exactly one bit of information when the symbol arrives. If the source can send any of 256 characters with equal probability, you learn 8 bits per symbol.
Entropy quantifies this intuition precisely. It is the average amount of information (or equivalently, the average surprise) produced by a probabilistic source. Rare events are more surprising than common ones - learning that a 1-in-a-million event occurred tells you far more than learning that a coin came up heads. The entropy of a distribution is the expected value of this surprise, averaged over all possible outcomes.
Three equivalent ways to think about entropy:
-
Uncertainty before the fact. measures how uncertain you are about before observing it. A uniform distribution over outcomes has maximum entropy ; a deterministic distribution has entropy .
-
Information gained after the fact. measures how much information (in bits, nats, or hartleys) you gain on average when you observe . This is why entropy equals the minimum average code length for transmitting .
-
Compression limit. bits per symbol is the theoretical minimum average code length for losslessly encoding a source with distribution . No code can do better, and Huffman coding comes within 1 bit of this bound.
The formal definition makes all three views precise:
The minus sign appears because for , and we want entropy to be non-negative.
Intuitive examples. A fair coin: bit. A biased coin (): bits - less uncertainty because heads is much more likely. A fair 6-sided die: bits. A fair deck of 52 cards: bits.
The key insight for AI. Language is a stochastic source. Each token in a sequence is drawn from a distribution over the vocabulary. The entropy of this distribution measures how uncertain (or creative) the language model is. A model that always predicts the next token with certainty has zero entropy and is not a useful language model. A model with moderate entropy is generating diverse, informative text.
1.2 Why Entropy Matters for AI
Entropy is not merely a theoretical construct - it is load-bearing mathematics in the most common ML operations:
Cross-entropy loss. The most common training objective in classification and language modeling is:
This is the empirical estimate of the cross-entropy . Minimizing it minimizes the KL divergence from the data distribution to the model distribution. Every GPT, LLaMA, Claude, and Gemini model is trained by minimizing cross-entropy. The connection to entropy is developed fully in Section 04 Cross-Entropy.
Perplexity. The standard metric for evaluating language models is:
This is (in nats, ) - the exponentiated per-token entropy of the model's predictions. A perplexity of 10 means the model is, on average, as uncertain as a uniform distribution over 10 tokens. Lower perplexity = lower entropy = better compression = better language model.
Decision trees. The information gain criterion for splitting a node is:
This chooses the feature that maximally reduces the entropy of the target . ID3, C4.5, and the CART (entropy variant) all use this criterion.
Entropy regularization. Soft Actor-Critic (SAC) and other maximum-entropy RL algorithms add to the reward, encouraging the policy to maintain high entropy (explore diverse actions). This prevents premature collapse to deterministic policies and substantially improves sample efficiency.
Confidence calibration. The entropy of a model's output distribution is a measure of predictive uncertainty. High-entropy outputs signal that the model is unsure; low-entropy outputs signal confidence. Temperature scaling adjusts this entropy calibration.
1.3 Historical Timeline
HISTORY OF ENTROPY
1865 Rudolf Clausius coins "entropy" for thermodynamics (disorder)
1877 Ludwig Boltzmann: S = k log W (statistical mechanics)
1929 Leo Szilard: information and thermodynamic entropy connected
1948 Claude Shannon: H(X) = -sum p log p (information theory born)
"A Mathematical Theory of Communication", Bell System Tech. J.
1951 Kullback & Leibler: KL divergence as entropy-relative measure
1959 Peter Elias, David Huffman: optimal prefix codes
1961 Alfred Renyi: generalized entropies H_alpha
1976 Edwin Jaynes: MaxEnt as Bayesian inference framework
1991 Thomas Cover & Joy Thomas: "Elements of Information Theory"
(still the standard graduate textbook)
2000s Arithmetic coding / range coding for LLM compression
2018+ Perplexity as universal LLM benchmark; entropy loss as
training objective for every major language model
2018 SAC (Haarnoja et al.): maximum-entropy RL goes mainstream
2022+ LLM inference with speculative decoding; token entropy
as early-exit signal; entropy-based uncertainty quantification
1.4 Units and the Base Question
The formula depends on the base of the logarithm:
| Base | Unit | Symbol | Conversion |
|---|---|---|---|
| bit (binary digit) | b | 1 nat = bits | |
| nat (natural unit) | nat | 1 bit = nats | |
| hartley (dit) | H | rarely used in ML |
Which base does ML use?
- Theory and proofs: nats () - calculus is cleaner, no constant factors in derivatives
- Compression and coding theory: bits () - matches binary representation
- Perplexity computation: nats in code, reported as base- exponent, but "bits per token" uses base-2
In this section, denotes the natural logarithm unless otherwise specified. When working with bits, we write . The choice of base scales entropy by a constant factor and does not affect any mathematical property or ordering of distributions.
For AI: PyTorch's nn.CrossEntropyLoss and F.cross_entropy use natural logarithm internally (bits are not relevant for gradient computation). Perplexity is reported as where is the average negative log-likelihood in nats. The distinction matters when comparing reported perplexity values across papers - verify whether they use or as the base.
2. Self-Information and Shannon Entropy
2.1 Self-Information
Before defining entropy, we need to define the information content of a single outcome.
Definition (Self-Information). The self-information (or surprise) of an event with probability is:
Self-information is the unique function satisfying three natural axioms:
- Monotonicity. is a decreasing function of : rarer events carry more information.
- Non-negativity. for all .
- Additivity for independent events. If , then .
The third axiom forces the logarithm. If for all and is monotone decreasing, then for some constant . The constant sets the unit; choosing gives bits.
Examples:
- Fair coin, outcome heads: bit
- Fair die, outcome 6: bits
- English letter 'e' (frequency ): bits
- English letter 'z' (frequency ): bits
- Token
" the"in GPT (very common): low self-information, e.g. - nats - A rare technical token: high self-information
The surprise interpretation. If your language model assigns probability to the next token, and that token occurs, the model is "surprised" by nats. If and the token occurs, surprise is nats. The negative log-likelihood of a prediction is literally how surprised the model is.
2.2 Shannon Entropy: Definition
Definition (Shannon Entropy). Let be a discrete random variable with probability mass function . The Shannon entropy of is:
with the convention (since ).
Equivalently, entropy is the expected self-information:
This is a key identity: entropy is the expected surprise of the distribution. A high-entropy distribution is one where you are, on average, very surprised by outcomes. A low-entropy distribution is one where outcomes are mostly predictable.
The convention . This is justified by continuity: . Outcomes with probability zero do not contribute to entropy, which makes sense - an impossible outcome carries infinite self-information, but it never occurs, so its contribution to the expected surprise is zero.
Entropy depends only on . depends only on the probability vector , not on the actual values . Renaming outcomes does not change entropy.
Functional notation. When we want to emphasize dependence on the probability vector, we write or where .
2.3 Computing Entropy for Standard Distributions
Bernoulli distribution. Let : , .
This is the binary entropy function . Key values:
- (deterministic - no uncertainty)
- bit (maximum - fair coin)
- bits; bits
The function is symmetric about , concave, and achieves its unique maximum at .
Categorical distribution. Let with outcomes.
Special cases:
- Uniform: for all -> (maximum)
- Deterministic: for some -> (minimum)
Geometric distribution. Let : for .
For small (rare successes), -> large entropy (many trials needed).
Poisson distribution. Let . No closed form, but:
For large : .
Uniform distribution over elements. - the unique distribution achieving the upper bound. This is one reason uniform distributions arise in MaxEnt problems.
For AI - Token Distribution Entropy. In language model inference, each forward pass produces a categorical distribution over vocabulary tokens (often -). The entropy of this output distribution is:
- Low when the model is confident (e.g., completing
"The capital of France is _") - High when the model is uncertain (e.g., completing
"The best approach to _")
Monitoring token entropy during inference is used for uncertainty quantification, early exit, and speculative decoding.
2.4 Non-Examples and Edge Cases
Non-example 1: Entropy cannot be negative (for discrete distributions). Since and , each term , so always. (Differential entropy for continuous distributions can be negative - see Section 8.)
Non-example 2: High-cardinality does not guarantee high entropy. A distribution over outcomes with and equal probability for the remaining outcomes has very low entropy despite having many possible values. Entropy measures the shape of the distribution, not just its support size.
Non-example 3: Equal entropy does not mean equal distributions. Two distributions can have the same entropy but completely different shapes. For example, and both have entropy bit, but they are very different distributions. (To compare distributions, use KL divergence - see Section 02.)
Non-example 4: does not mean is constant. It means is almost surely constant - i.e., some outcome has probability 1 and all others have probability 0. is deterministic.
Edge case: Infinite alphabet. For countably infinite , can be infinite. For example, has finite entropy, while does not (harmonic series). In practice, vocabulary sizes are finite so this edge case does not arise in LLMs.
Edge case: . By convention, . This is the correct limit and ensures that adding probability-zero outcomes to a distribution does not change its entropy.
3. Properties of Shannon Entropy
3.1 Non-Negativity and Boundedness
Theorem. For any discrete random variable with :
Lower bound (). Each term since implies . Equality holds iff for all , i.e., is deterministic.
Upper bound (). This follows from the log-sum inequality or from the non-negativity of KL divergence. Let be the uniform distribution. Then:
The last inequality holds because KL divergence is always non-negative (proven in Section 02). Therefore . Equality holds iff , i.e., is uniform.
Implications for ML:
- A language model's output entropy over tokens is bounded above by . A model outputting the uniform distribution has maximum entropy and zero confidence.
- Entropy approaching means the model is becoming deterministic - either very confident (good) or overfit to a degenerate output (bad, e.g., mode collapse in generation).
3.2 Concavity
Theorem. The entropy function is strictly concave on the probability simplex .
Proof. We must show that for any and :
This follows because is concave on (since ), and . A sum of concave functions is concave.
Strict concavity: equality holds iff .
Consequence: Mixing increases entropy. If a mixture source outputs from distribution with probability and from distribution with probability , the entropy of the mixture is at least as large as the weighted average of the individual entropies. Intuitively, mixing introduces additional uncertainty about which sub-source generated the output.
For ML: The concavity of entropy means that the maximum entropy distribution over a convex constraint set is unique. This underpins the MaxEnt principle (Section 5). It also means that gradient-based optimization of entropy (as in SAC) is well-behaved - there is a unique maximum.
3.3 Subadditivity
Theorem (Subadditivity of Entropy). For any two jointly distributed random variables , :
with equality if and only if .
Proof. Using the definition of joint entropy:
Since , we have . Equality holds iff for all , i.e., .
Intuition. The joint entropy is the total uncertainty about the pair . If and share information (are dependent), then knowing reduces uncertainty about , so the total joint uncertainty is less than the sum of individual uncertainties.
The quantity is the mutual information - the amount of information and share. This is the subject of Section 03 Mutual Information. Subadditivity is equivalent to .
For ML: In multi-task learning, the subadditivity of entropy bounds how much information the combined tasks share. In representation learning (e.g., InfoNCE objectives in contrastive learning), maximizing mutual information between views is equivalent to reducing the gap in .
3.4 Invariance Under Bijections
Theorem. If is a bijection (one-to-one and onto), then .
Proof. A bijection preserves the probability values: . Since entropy depends only on the multiset of probability values, it is unchanged.
Non-bijective case. If is not injective (many-to-one), then - collapsing outcomes reduces entropy. This is the data processing inequality for functions of : processing cannot increase entropy.
For ML: Lossless tokenization is a bijection on the original character sequence, so the entropy of the token-level distribution is determined by the character-level entropy plus the log of the average number of characters per token. This connects token-level perplexity to character-level perplexity.
3.5 Continuity
Theorem. is a continuous function of .
The proof uses the fact that is continuous on (with value at by convention), and a finite sum of continuous functions is continuous.
Fano's inequality. A quantitative version of continuity: if and are such that , then:
This bounds the conditional entropy (residual uncertainty) given the probability of error. Fano's inequality is a fundamental tool for proving converse channel capacity theorems, and it is covered in depth in Section 03 Mutual Information.
4. Joint Entropy, Conditional Entropy, and the Chain Rule
4.1 Joint Entropy
Definition (Joint Entropy). For jointly distributed random variables and with joint PMF :
Joint entropy measures the total uncertainty about the pair .
Properties:
- - the joint entropy is at least as large as either marginal
- - subadditivity (proven above)
- iff
- iff for some deterministic function - if is determined by , knowing gives you for free
Generalization. For jointly distributed variables :
By subadditivity: with equality iff all are mutually independent.
For AI: In a language model with context , the joint entropy is the total information in a sequence of tokens. The entropy rate (Section 7) is per token - the per-token information content in the limit.
4.2 Conditional Entropy
Definition (Conditional Entropy). The conditional entropy of given is:
This is the expected entropy of after observing - the residual uncertainty about that remains after knowing .
Key identity:
Proof:
Properties:
- - conditioning never increases entropy
- iff - knowing tells you nothing about
- iff is a deterministic function of - knowing determines completely
- - knowing eliminates all uncertainty about
Important subtlety: in general. The conditional entropy is an expectation over ; the quantity is the entropy of the conditional distribution for a specific value .
For AI - Next-Token Prediction: In a language model, is the conditional entropy of the -th token given the context. This is exactly the per-token uncertainty that the model must compress. The average of this over is the entropy rate. The model's prediction attempts to approximate this conditional distribution.
4.3 Chain Rule for Entropy
Theorem (Chain Rule for Entropy).
with the convention (no conditioning for the first term).
Proof. By induction using , which follows from the identity . The general case follows by repeatedly applying this identity.
Special cases:
- :
- i.i.d. case: all i.i.d., so and
For AI - Language Model Factorization: Every autoregressive language model uses the chain rule for entropy (or rather, for probability via the chain rule of probability):
The average negative log-likelihood is the empirical estimate of the entropy rate , which converges to the true entropy rate of the language distribution as .
4.4 Venn Diagram of Information Quantities
The relationships among entropy quantities can be visualized using a Venn diagram where the area of each region represents a quantity:
VENN DIAGRAM OF INFORMATION QUANTITIES
H(X,Y) (joint entropy)
H(X)
H(X|Y) I(X;Y) H(Y|X)
H(Y)
H(X) = H(X|Y) + I(X;Y) [X's total entropy]
H(Y) = H(Y|X) + I(X;Y) [Y's total entropy]
H(X,Y) = H(X) + H(Y|X) [chain rule]
= H(Y) + H(X|Y) [chain rule, reversed]
= H(X) + H(Y) - I(X;Y) [via mutual information]
I(X;Y) = H(X) + H(Y) - H(X,Y) [mutual information]
= H(X) - H(X|Y) [= reduction in uncertainty]
= H(Y) - H(Y|X) [symmetric]
The quantity in the overlap is the mutual information - the information shared between and . When , the circles do not overlap: . This quantity is the subject of Section 03 Mutual Information.
5. Maximum Entropy Principle
5.1 MaxEnt Statement
The maximum entropy principle (Jaynes, 1957) states:
Among all probability distributions consistent with given constraints (known information), choose the one with maximum entropy.
The MaxEnt distribution is the "most uninformative" distribution given what you know - it makes no assumptions beyond the specified constraints. This is not a heuristic but a theorem: the MaxEnt distribution is the unique distribution consistent with the constraints that arises as the uniform distribution over microstate descriptions given macroscopic constraints (Jaynes' maximum entropy formalism).
Why maximum entropy?
- Any distribution with lower entropy is implicitly asserting additional structure not in the constraints
- The MaxEnt distribution is the unique one that is invariant under all symmetries of the constraint set
- It minimizes the KL divergence from the uniform distribution (proof in Section 02)
Examples of MaxEnt distributions:
- Constraint: -> MaxEnt: uniform distribution
- Constraint: , -> MaxEnt: exponential distribution
- Constraint: , -> MaxEnt: Gaussian
5.2 MaxEnt Derivation via Lagrange Multipliers
Problem. Maximize subject to:
- (normalization)
- for (moment constraints)
Method: Lagrange multipliers. Form the Lagrangian:
Setting :
Solving:
where is the partition function (normalization constant). The are chosen to satisfy the moment constraints .
Result. The MaxEnt distribution given moment constraints is always of the exponential family form:
Worked example - Gaussian as MaxEnt. Maximize subject to and . The constraints are and . The MaxEnt distribution is:
Completing the square: , which is exactly .
Worked example - Uniform as MaxEnt. With no moment constraints (only normalization), the MaxEnt distribution is uniform: . Any other distribution would assign higher probability to some outcomes than others, asserting structure not in the constraints.
5.3 Exponential Family from MaxEnt
The MaxEnt derivation shows that the exponential family:
arises naturally whenever we impose moment constraints on the sufficient statistics . Conversely, every exponential family distribution is the MaxEnt distribution for some set of moment constraints.
| Distribution | Constraints | Natural parameter |
|---|---|---|
| Uniform | , no moments | None |
| Exponential | , | |
| Gaussian | , | |
| Poisson | , | |
| Bernoulli | , |
Boltzmann distribution. In statistical physics, the MaxEnt distribution under fixed mean energy is the Boltzmann distribution where is the inverse temperature. Shannon entropy is formally identical to the Gibbs-Boltzmann thermodynamic entropy.
5.4 MaxEnt in Machine Learning
Temperature in language model sampling. The softmax operation in LLMs:
is the MaxEnt distribution under fixed mean logit , where is the temperature. High (uniform, maximum entropy); low (argmax, minimum entropy). This is why temperature is the natural control knob for generation diversity.
MaxEnt language models. Before neural LLMs, MaxEnt models were the dominant NLP approach. They model where is a feature vector. This is logistic regression with hand-crafted features - still the MaxEnt distribution given feature expectations.
Entropy regularization in optimization. Adding as a regularizer to an optimization problem (mirror descent, natural gradient) corresponds to constraining the solution to stay close to the MaxEnt (uniform) distribution. Softmax normalization in attention can be viewed as MaxEnt regularization of the attention weights.
6. Source Coding and Data Compression
6.1 Shannon's Source Coding Theorem
The most fundamental theorem of information theory connects entropy to the minimum number of bits needed to describe a random variable.
Setup. We have a source producing i.i.d. symbols from distribution over alphabet . A code assigns a binary string to each symbol. The average code length is where is the length of the codeword for .
Theorem (Shannon 1948). For any uniquely decodable code :
where is the entropy in bits. Furthermore, there exists a uniquely decodable code achieving:
The lower bound says: no code can do better than entropy. The upper bound says: entropy is achievable within 1 bit per symbol. The gap of 1 bit can be made arbitrarily small by coding blocks of symbols (achieving per symbol).
Proof of lower bound. By Kraft's inequality (Section 6.2), any uniquely decodable code satisfies where . Define as a probability distribution. Then:
using (since by Kraft's inequality for decodable codes) and .
Shannon code. Assign code length . This achieves , giving .
For AI: Every neural network that outputs log-probabilities implicitly defines a code. The cross-entropy loss is the code length of outcome under the model's code. Minimizing cross-entropy minimizes the expected code length - i.e., the model is learning to compress the data.
6.2 Prefix Codes and Kraft's Inequality
A prefix code (or instantaneous code) is a code where no codeword is a prefix of another. Prefix codes are desirable because they can be decoded without reading ahead (instantaneous decoding).
Theorem (Kraft's Inequality). A prefix code with codeword lengths over a binary alphabet exists if and only if:
The inequality is tight for complete prefix codes (every binary string has a unique prefix that is a codeword).
Intuition. Each codeword of length "uses up" of the binary tree's budget. Kraft's inequality says you cannot exceed the available budget. The integer lengths satisfy Kraft's inequality since .
Connection to entropy. The optimal code length for outcome is (generally not an integer). Rounding up to gives the Shannon code. The entropy is the minimum average length achievable by any prefix code.
6.3 Huffman Coding
Huffman's algorithm (1952) constructs the optimal prefix code for a given distribution - the code that minimizes exactly (over integer code lengths).
Algorithm:
- Create a leaf node for each symbol with weight
- While more than one node remains: combine the two lowest-weight nodes into a parent node with weight equal to their sum; assign bit
0to one child,1to the other - Read off codewords from the resulting binary tree
Example. Distribution: .
HUFFMAN CODE CONSTRUCTION
Symbols: A(0.5) B(0.25) C(0.125) D(0.125)
Step 1: Combine C and D -> CD(0.25)
Step 2: Combine B and CD -> BCD(0.5)
Step 3: Combine A and BCD -> root(1.0)
Tree: root(1.0)
/ \
A(0.5) BCD(0.5)
0 / 1
B(0.25) CD(0.25)
0 / 1
C(0.125) D(0.125)
0 1
Codewords: A=0, B=10, C=110, D=111
Average length: 0.51 + 0.252 + 0.1253 + 0.1253 = 1.75 bits
Entropy: H = 0.5log2+0.25log4+0.125log8+0.125log8 = 1.75 bits (exact!)
This distribution has exactly (a dyadic distribution), so Huffman coding achieves the entropy exactly with no rounding loss.
Huffman optimality. Huffman coding achieves the minimum average code length over all prefix codes for the given distribution. The average length satisfies .
6.4 Range Coding in LLM Inference
Modern LLM inference systems use arithmetic coding (and its approximation range coding) to losslessly compress the model's output token sequence to the model's predicted entropy per token.
Key idea. Arithmetic coding encodes a sequence of symbols by narrowing an interval based on the predicted probabilities of each symbol. The final interval can be encoded in bits - within 2 bits of the entropy of the sequence.
Perplexity and compression. A language model with perplexity can compress text to approximately bits per token. A model with compresses to ~3.32 bits/token; with to ~1.58 bits/token. This is why perplexity is a measure of compression efficiency: lower perplexity = better compressor.
For AI: The equivalence between learning (cross-entropy minimization), prediction (perplexity), and compression (code length) is one of the deepest connections in information theory for ML. Chinchilla scaling laws can be derived from the information-theoretic view of language modeling as optimal compression.
7. Entropy Rate of Stochastic Processes
7.1 Entropy Rate: Definition
For a sequence of random variables , the entropy rate measures the per-symbol entropy in the limit:
Definition. The entropy rate of a stochastic process is:
when the limit exists.
Equivalent formulation. For a stationary process, the entropy rate also equals:
the conditional entropy of the next symbol given an infinitely long past. This limit exists and equals the previous one for stationary processes (by the AEP - Asymptotic Equipartition Property).
For i.i.d. processes: - each symbol contributes its full individual entropy.
For deterministic processes: - once you know the rule, you can predict every future symbol.
For natural language: The entropy rate of English is estimated at ~1-1.5 bits per character (Shannon estimated it at 0.6-1.3 bits/character through human prediction experiments). This is much less than bits/character (uniform over letters), reflecting the strong statistical regularities in language.
7.2 Entropy Rate of Markov Chains
For a stationary ergodic Markov chain with state space , transition matrix (where ), and stationary distribution satisfying :
Derivation. For a stationary Markov chain, (Markov property). In stationarity, this equals:
Example - Bigram language model. A bigram language model is a first-order Markov chain over the vocabulary . Its entropy rate is:
where is the stationary token frequency and is the bigram probability.
Worked example. Two-state Markov chain: states , transition matrix .
Stationary distribution: , .
where is the binary entropy function.
When (memoryless): bit (matches i.i.d.).
When (sticky): bits - very predictable.
7.3 Perplexity as Entropy Rate Estimate
Definition (Perplexity). Given a language model and a test sequence :
(Using natural log; for base-2 bits, replace with .)
Connection to entropy rate. By the ergodic theorem, as :
If (perfect model), this equals the true entropy rate , so - a perfect model's perplexity equals the exponentiated entropy rate of the language.
Interpretation table:
| Perplexity | Bits/token | Interpretation |
|---|---|---|
| 1 bit | Near-perfect prediction; approximately 2 equally likely choices per token | |
| 2.9 bits | Good model; ~7 equally likely choices | |
| 4.3 bits | Moderate model | |
| 15 bits | Uniform distribution over 32k-token vocabulary; random model |
For AI - Benchmark context (2026): State-of-the-art LLMs on standard benchmarks:
- GPT-4, Claude 3.5+: WikiText-103 PPL - (nats base)
- GPT-2 (1.5B): WikiText-103 PPL (bits base: bits/token)
- A random model over 50k vocabulary: PPL = 50,000
Cross-entropy rate vs true entropy rate. The perplexity of a model is always at least (the perplexity of a perfect model), because . Improving a language model reduces the KL gap to the true distribution, pushing perplexity toward the fundamental entropy rate limit.
8. Differential Entropy
8.1 Definition and Subtleties
For continuous random variables, the analogue of Shannon entropy is differential entropy.
Definition. Let be a continuous random variable with probability density function . The differential entropy is:
Key differences from discrete entropy:
-
Can be negative. Unlike , differential entropy satisfies no such lower bound. For example, has . A very concentrated distribution has .
-
Not invariant under bijections. Under a bijection :
The Jacobian term appears. This means differential entropy depends on the scale of the variable.
-
Not a probability. can exceed 1 for densities, so can be negative for individual points.
-
Limit of discrete entropy. For a discrete approximation of with bin width : . As , the discrete entropy diverges (infinite precision requires infinite bits), but the differential entropy is the finite part of this divergence.
Despite these subtleties, differential entropy is meaningful for comparing distributions of the same type and for measuring the "entropy-like" uncertainty of continuous variables.
8.2 Differential Entropy of Standard Distributions
Uniform distribution , :
Note: when ; when .
Gaussian distribution :
Derivation:
Multivariate Gaussian :
The differential entropy of a multivariate Gaussian depends on the covariance matrix only through its determinant (product of eigenvalues). A degenerate covariance () gives .
Laplace distribution , :
Exponential distribution , :
For AI - VAE latent space. The VAE (Variational Autoencoder) encodes inputs into a distribution . The ELBO involves the differential entropy of this distribution:
The KL term equals , so optimizing the ELBO maximizes the differential entropy of the encoder distribution subject to staying close to the prior.
8.3 Maximum Differential Entropy: The Gaussian
Theorem. Among all continuous distributions with fixed variance , the Gaussian maximizes the differential entropy:
with equality iff .
Proof. Let be any distribution with variance and let be the Gaussian. Then:
Now , so:
Therefore , giving .
Implication. The Gaussian is the MaxEnt distribution for continuous variables with fixed variance. This justifies the Gaussian prior in Bayesian models: it encodes only the constraint on variance and makes no additional assumptions. It also explains why many natural phenomena follow Gaussian distributions - they are consistent with a variance constraint and maximal uncertainty otherwise.
9. Renyi Entropy and Generalizations
9.1 Renyi Entropy
Shannon entropy is one member of a parametric family introduced by Alfred Renyi in 1961.
Definition (Renyi Entropy). For , , the Renyi entropy of order is:
Limiting cases:
| Order | Name | Formula |
|---|---|---|
| Hartley entropy (max-entropy) | ||
| Shannon entropy | ||
| Collision entropy | ||
| Min-entropy |
Ordering property. Renyi entropy is a decreasing function of :
All orders coincide for the uniform distribution.
Shannon entropy as the limit. Using L'Hopital's rule:
Additivity. Renyi entropy satisfies when , for all . Shannon entropy satisfies a stronger additivity: the chain rule.
9.2 Min-Entropy and Security
Definition. The min-entropy is:
Min-entropy is the worst-case uncertainty - it measures how concentrated the distribution is at its most likely outcome.
Security interpretation. is the guessing advantage - the probability that an adversary can correctly guess in one try by always guessing the most likely outcome. A cryptographic key with high min-entropy is hard to guess; one with low min-entropy (concentrated on a few values) is insecure.
Connection to ML. In adversarial robustness, min-entropy of the model's output distribution bounds the probability that an adversary can predict the top-1 output without information about the input. Models with collapsed output distributions (mode seeking) have low min-entropy and are more vulnerable to certain attacks.
Relationship to other quantities:
- For uniform distribution: all Renyi entropies equal
9.3 Tsallis Entropy
The Tsallis entropy of order (Tsallis 1988) is:
with (Shannon entropy) as the limit.
Differences from Renyi:
- Tsallis entropy is not additive for independent variables:
- Maximized by the -exponential distribution under energy constraints
-exponential family and ML. Tsallis entropy's maximizer is the -Gaussian, a heavy-tailed generalization. Setting gives distributions with bounded support; gives power-law tails. The -softmax:
is used in some modern attention variants and sparse softmax approaches as an alternative to the standard exponential softmax, giving exactly-zero probabilities to low-scoring tokens rather than near-zero.
10. Applications in Machine Learning
10.1 Decision Tree Splits
Decision trees recursively split the training data to reduce prediction uncertainty. Information gain is the entropy-based splitting criterion:
This is the reduction in uncertainty about the target achieved by knowing feature . The ID3 and C4.5 algorithms choose the split with maximum information gain at each node.
Binary information gain. For a binary split vs producing left subset of size and right subset of size :
where , are the class entropies in the left and right partitions.
Gini vs entropy. The Gini impurity is often used as an approximation to entropy (the first-order Taylor approximation of ). For binary classification, while . Both yield similar splits in practice.
Gradient boosted trees. XGBoost and LightGBM use second-order Taylor expansions of the loss function to compute split gains, but the connection to entropy remains: the gain measures uncertainty reduction in the residual distribution.
Random forests. The diversity of trees in a random forest is controlled by the randomized feature selection, which limits the information gain at each split. Higher randomization -> higher entropy of the ensemble -> better generalization.
10.2 Entropy Regularization in RL
Standard RL seeks a policy that maximizes expected cumulative reward. Maximum entropy RL (Ziebart et al., 2008; Haarnoja et al., 2018) adds an entropy bonus:
where is the temperature parameter controlling the trade-off between reward and entropy.
Why entropy regularization helps:
- Exploration: High-entropy policies try diverse actions, avoiding premature convergence to suboptimal deterministic policies.
- Robustness: Maintaining entropy provides a hedge against environment uncertainty and distributional shift.
- Multiple modes: Standard RL tends to find a single optimal policy; MaxEnt RL discovers the full manifold of near-optimal behaviors.
Soft Actor-Critic (SAC, Haarnoja et al. 2018). The dominant maximum entropy RL algorithm for continuous control. Key equations:
The optimal policy is the Boltzmann/softmax policy over Q-values - exactly the MaxEnt distribution over actions under fixed mean Q-value constraint. SAC is used in robotics, LLM RLHF exploration stages, and game AI.
Entropy in RLHF. In RLHF (Reinforcement Learning from Human Feedback) for LLMs, a KL penalty term enforces that the fine-tuned policy stays close to the reference policy :
The KL penalty is equivalent to an entropy regularizer relative to . Minimizing this keeps the output distribution high-entropy (close to the reference), preventing the model from collapsing to reward-hacking degenerate outputs.
10.3 Confidence Calibration
A model is calibrated if its predicted probability for an event equals the empirical frequency of that event. The entropy of the model's output distribution is a key calibration signal:
Predictive entropy decomposes into two components:
- Aleatoric uncertainty - inherent randomness in the data (irreducible)
- Epistemic uncertainty - uncertainty from lack of data or model misspecification (reducible)
Temperature scaling (Guo et al., 2017): The most effective post-hoc calibration method. Fit a single scalar such that is calibrated. increases entropy (reduces overconfidence); decreases entropy (sharpens predictions).
Entropy-based uncertainty detection:
- Out-of-distribution inputs tend to produce high-entropy output distributions (model is uncertain)
- In-distribution inputs the model knows well produce low-entropy outputs
- Threshold for anomaly detection
For LLM generation: Entropy of the output distribution at each step can signal when the model is about to hallucinate (high entropy) vs. when it is confident (low entropy). Token-level entropy is used in speculative decoding, early exit, and uncertainty-guided sampling.
10.4 Preview: Cross-Entropy, KL, Mutual Information, Fisher Information
This chapter (Section 09) develops four closely related concepts that each deserve a full section:
Cross-Entropy
Cross-entropy is the expected code length under code when the true distribution is . It decomposes as : the sum of the true entropy and the KL divergence. The cross-entropy loss in ML trains models to minimize the KL divergence between the data distribution and the model. Minimizing cross-entropy minimizing KL divergence maximum likelihood estimation.
-> Full treatment: Section 04 Cross-Entropy
KL Divergence
KL divergence measures how much fails to match . It is non-negative (, with equality iff ), asymmetric ( in general), and appears everywhere: VAE regularizer, knowledge distillation, RLHF KL penalty, variational inference.
-> Full treatment: Section 02 KL Divergence
Mutual Information
Mutual information measures the statistical dependence between and - how much knowing reduces uncertainty about . It is symmetric, non-negative, and zero iff . Key applications: InfoNCE contrastive loss (SimCLR, CLIP), information bottleneck in DNNs, feature selection.
-> Full treatment: Section 03 Mutual Information
Fisher Information
Fisher information measures how much a random variable tells you about the parameter generating it. The Cramer-Rao bound says no unbiased estimator can have variance less than . The natural gradient in optimization uses the Fisher information matrix as a Riemannian metric, leading to K-FAC and Shampoo.
-> Full treatment: Section 05 Fisher Information
11. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Using base 2 vs inconsistently in the same derivation | Mixing bases introduces a factor of that invalidates gradient computations | Always use natural log () for analysis and derivatives; convert to bits only for reporting |
| 2 | Claiming for a discrete distribution | Discrete entropy is always | Check: ; negative entropy only arises for differential entropy |
| 3 | Omitting the convention | Outcomes with would contribute , which is undefined | Always apply the convention; outcomes with zero probability do not contribute to entropy |
| 4 | Confusing with | is the expectation of over ; they differ unless is constant in | Write |
| 5 | Confusing cross-entropy with entropy | in general; only when | Remember |
| 6 | Assuming "higher entropy = better model" | High-entropy output distributions mean uncertain predictions; a perfect model should have low entropy at high-frequency tokens | Perplexity = exponentiated per-token entropy; the model's entropy should match the data distribution's entropy |
| 7 | Thinking differential entropy is always positive | for is negative when | Differential entropy has no lower bound; it can be for degenerate distributions |
| 8 | Interpreting differential entropy as a probability | is not a probability; can exceed 1 for densities | Differential entropy is a real number measuring uncertainty, not a probability |
| 9 | Confusing entropy with diversity | A distribution can have high entropy but concentrate near a few values if there are many near-uniform values | Entropy measures uncertainty, not diversity of interesting outputs; check the full distribution shape |
| 10 | Applying Shannon's source coding theorem to non-i.i.d. sources | The theorem applies to i.i.d. sequences; for correlated sources, the entropy rate (not ) is the relevant bound | Use entropy rate for dependent sources |
| 11 | Confusing with | Joint entropy is not the product; entropy is additive (not multiplicative) for independent variables | For independent : (sum, not product) |
| 12 | Using Renyi entropy for all purposes | Different Renyi orders emphasize different parts of the distribution; Shannon entropy () is the unique one satisfying the chain rule | Choose based on the application: for security, for compression/learning, for collision probability |
12. Exercises
Exercise 1 * - Binary Entropy Function and Distributions
(a) Compute in bits for: (i) a fair coin (), (ii) a biased coin (), (iii) a fair 6-sided die, (iv) the distribution .
(b) Plot the binary entropy function for . Verify and bit. Identify the unique maximum.
(c) A language model outputs a distribution over a vocabulary of tokens. What is the maximum possible entropy in nats? What is the entropy if the distribution is uniform? What if the top-1 token has probability 0.95 and all others are uniform?
Exercise 2 * - Maximum Entropy via Lagrange Multipliers
(a) Show that the uniform distribution maximizes subject to , .
(b) Find the maximum entropy distribution over subject to . Show this is the exponential distribution .
(c) Find the maximum entropy distribution over subject to and . Show this is .
Exercise 3 * - Joint and Conditional Entropy
(a) Given joint distribution with table: , , , . Compute , , , , .
(b) Verify the chain rule: .
(c) Compute the mutual information . Verify .
(d) Modify the joint distribution so that . Verify that and .
Exercise 4 ** - Entropy Rate of a Markov Chain
Consider a 3-state Markov chain with states and transition matrix:
(a) Find the stationary distribution satisfying and .
(b) Compute the entropy rate in nats.
(c) Compare to the entropy rate if all states were independent: . Is the Markov chain entropy rate higher or lower than the i.i.d. rate? Explain intuitively.
(d) Compute the perplexity . If this Markov chain were a language model, what would this perplexity tell you?
Exercise 5 ** - Shannon Source Coding and Huffman Code
(a) Given source distribution :
- Compute in bits.
- Construct the optimal Huffman code. Draw the code tree.
- Compute the average code length . Verify .
(b) Write a function huffman_encode(probs) that returns a dict mapping symbols to codewords. Verify on the distribution above.
(c) For a distribution where for all and (geometric): show that the Huffman code achieves exactly (a dyadic distribution).
Exercise 6 ** - Differential Entropy and Gaussian MaxEnt
(a) Compute the differential entropy for: (i) , (ii) , (iii) , (iv) , (v) .
(b) Show that for any scalar . What happens to entropy when you scale a Gaussian by ?
(c) Prove directly (without citing the MaxEnt theorem) that for any with using the non-negativity of where .
Exercise 7 *** - Renyi Entropy Family
(a) Implement a function renyi_entropy(p, alpha) that computes for and falls back to Shannon entropy for (handle the limit numerically).
(b) For the categorical distribution , compute for . Plot as a function of and verify the ordering .
(c) Compute the min-entropy directly. Verify it matches your renyi_entropy(p, alpha) as .
(d) What is the guessing advantage for this distribution? (The probability that an adversary guessing optimally in one shot succeeds.)
Exercise 8 *** - MaxEnt RL: Entropy Bonus in Policy Optimization
(a) Implement a tabular policy gradient for a simple 5-action bandit problem with reward vector . Train for 500 steps with learning rate .
(b) Add an entropy bonus: objective = for . Train each for 500 steps.
(c) Plot the policy entropy over training for each . Show that higher maintains higher entropy and finds the softmax-optimal policy rather than the greedy argmax.
(d) Verify that as , the MaxEnt RL solution approaches the greedy policy; as , it approaches the uniform policy.
13. Why This Matters for AI (2026 Perspective)
| Concept | AI Impact |
|---|---|
| Shannon entropy | Foundation of cross-entropy loss; used in training every LLM, classifier, and generative model |
| Perplexity = | Universal language model benchmark; lower perplexity = better compression = better model |
| Maximum entropy principle | Justifies softmax temperature, Gaussian priors, exponential family likelihoods in all Bayesian ML |
| Entropy rate of Markov chains | Connects LLM training loss to the fundamental information-theoretic limit of language compression |
| Source coding (Huffman, arithmetic) | LLM inference compression; range coding for GPU-efficient inference; Chinchilla scaling via compression |
| Conditional entropy | Minimum irreducible error in any supervised learning problem; Bayes error rate bound |
| Entropy regularization (MaxEnt RL) | SAC for robotics, game AI, RLHF exploration; prevents reward hacking; used in Gemini/Grok fine-tuning |
| Temperature scaling | Post-hoc calibration for classification; diverse vs. greedy decoding in LLMs; top- sampling |
| Renyi / min-entropy | Security of cryptographic tokens; private inference; differential privacy (Renyi DP) |
| Tsallis / sparse softmax | -softmax gives exact sparsity, used in sparse transformer attention variants |
| Differential entropy of Gaussian | VAE ELBO; diffusion model score matching; optimal quantization in quantization-aware training |
| Binary entropy function | VC dimension and generalization bounds; PAC-Bayes theory; Fano's inequality in lower bounds |
| Chain rule for entropy | Autoregressive factorization: every GPT/LLaMA/Claude uses this identity to factorize |
| Decision tree information gain | Gradient-boosted trees (XGBoost, LightGBM): the core split criterion for tabular ML |
14. Conceptual Bridge
Looking backward. Entropy builds directly on probability theory (Chapter 6) and statistics (Chapter 7). The fundamental input is a probability distribution over an alphabet - the core object of Chapter 6. The convention uses the expectation operator. The maximum entropy principle invokes Lagrange multipliers from optimization (Chapter 8). The source coding theorem connects to combinatorics and counting (Chapter 1). Everything in this section is built from probability theory and the logarithm.
Looking forward within Chapter 9. Entropy is the foundation from which the other Section 09 concepts branch:
- KL divergence (Section 02) = - the extra cost of using code when truth is ; also proved non-negative from
- Mutual information (Section 03) = - the reduction in 's entropy from observing
- Cross-entropy (Section 04) = - the training loss for classification and language modeling
- Fisher information (Section 05) = the local curvature of the KL divergence; a second-order entropy quantity
Looking forward to advanced topics. The entropy rate connects to ergodic theory and the Asymptotic Equipartition Property (AEP), which underpins all of lossless data compression. Renyi entropy connects to hypothesis testing and Renyi differential privacy. Differential entropy connects to rate-distortion theory (how much distortion is unavoidable at a given bit rate) and Gaussian channels (capacity ). These are all covered in advanced information theory courses (Cover & Thomas, 2006).
CURRICULUM POSITION
Chapter 6: Probability Theory
Distributions, expectations, independence
Chapter 7: Statistics
MLE, MAP, sufficient statistics, exponential family
Chapter 8: Optimization
Lagrange multipliers for MaxEnt derivations
Chapter 9: Information Theory
Section 01 ENTROPY You are here
Section 02 KL Divergence
(H(p,q) = H(p) + D_KL)
Section 03 Mutual Information
(I(X;Y) = H(X) - H(X|Y))
Section 04 Cross-Entropy
(training loss for all LLMs)
Section 05 Fisher Information
(geometry of the KL ball)
Chapter 10: Numerical Methods
Computing entropy stably (log-sum-exp trick),
perplexity computation, information geometry
The central unifying theme. Every quantity in information theory can be expressed in terms of entropy. KL divergence, mutual information, cross-entropy, Fisher information, channel capacity - all reduce to sums of terms. Mastering entropy means mastering the mathematical language that unifies probabilistic inference, data compression, optimal coding, and modern machine learning.
<- Back to Information Theory | Next: KL Divergence ->
Appendix A: Axiomatic Characterization of Entropy
Shannon (1948) derived entropy from a small set of natural axioms. Understanding these axioms illuminates why is the unique reasonable measure of uncertainty.
The Khinchin Axioms (1953 formulation):
Let be a measure of uncertainty for a distribution . We require:
-
Continuity. is continuous in all arguments.
-
Symmetry. is invariant under any permutation of : the labeling of outcomes does not matter.
-
Maximum. : the uniform distribution maximizes uncertainty.
-
Branching / Chainability. Adding a zero-probability outcome does not change uncertainty:
And a two-stage experiment is consistent with the joint:
Theorem (Khinchin 1953, Shannon 1948). The unique family of functions satisfying these axioms is:
for some constant (which sets the unit: gives bits).
Proof sketch. From Axiom 4 applied inductively, must satisfy:
where is some function satisfying (additivity) by the branching axiom applied to two uniform distributions. By continuity, the unique solution is . Extending to rational probabilities by the branching axiom and then to all probabilities by continuity gives the unique formula .
Why this matters. The axiomatic derivation shows that there is no alternative: if you want a measure of uncertainty that is symmetric, continuous, maximal for uniform distributions, and consistent with hierarchical decomposition of outcomes, you must use Shannon entropy. Any other formula would violate at least one of these natural requirements.
The role of continuity. Without the continuity axiom, many pathological functions satisfy the remaining axioms. Continuity ensures that nearly-certain events have nearly-zero entropy - a tiny probability cannot create a large information contribution.
Variants. Dropping the symmetry axiom gives directed entropies (measuring asymmetric uncertainty in sequential decisions). Replacing the maximum axiom with a different normalization gives Renyi entropies. The Khinchin axioms are therefore the minimal set characterizing Shannon entropy specifically.
Appendix B: Asymptotic Equipartition Property (AEP)
The Asymptotic Equipartition Property is the information-theoretic analogue of the Law of Large Numbers. It is the mathematical foundation of all lossless compression and channel coding.
Theorem (AEP, Shannon 1948). Let be i.i.d. random variables with distribution and entropy . Then:
Proof. By the law of large numbers applied to :
Interpretation. The AEP says that for large , all long sequences have approximately the same probability: roughly (in bits). The set of "typical sequences" - those with probability - has total probability approaching 1.
Typical Set. Define the typical set as:
Properties of the typical set:
- for large enough
- (not too many typical sequences)
- (not too few)
Consequence for compression. To compress symbols losslessly:
- Assign short codewords to sequences in : need bits to index them
- Assign a flag + enumerate for atypical sequences: needed with vanishing probability
This achieves an average code rate of bits per symbol, converging to as , . This proves the achievability direction of Shannon's source coding theorem for i.i.d. sources.
AEP for Markov chains. The AEP generalizes to stationary ergodic processes:
where is the entropy rate. This is the Shannon-McMillan-Breiman theorem. For language models, it says that the per-token loss converges almost surely to the entropy rate of the true language distribution.
For AI - Training Loss Convergence. The training cross-entropy loss of an LLM:
converges (by the ergodic theorem) to - the cross-entropy rate. When the model is perfect, this equals the true entropy rate of the language, which is the theoretical minimum achievable loss. The AEP tells us that this convergence is almost sure, not just in probability.
Appendix C: Entropy and Channel Capacity
Shannon's information theory has two fundamental theorems: the source coding theorem (Appendix B and Section 6) and the channel coding theorem (noisy channel coding). While channel capacity is beyond the scope of this section (it is a mutual information concept, covered in Section 03), we briefly preview the connection to entropy.
Binary Symmetric Channel (BSC). A BSC with crossover probability flips each transmitted bit with probability . The maximum rate of reliable communication (channel capacity) is:
where is the binary entropy function. This is the entropy of the output minus the entropy of the noise: you can reliably communicate bits per channel use even though the channel is noisy.
Entropy of noise limits communication. The intuition: the channel introduces bits of noise entropy per use. Of the bit per channel use "capacity," bits are consumed by noise uncertainty, leaving for reliable information. At (complete randomness): . At or (deterministic): .
Shannon's Channel Coding Theorem (preview). For any rate , there exists a code of rate bits per channel use that achieves arbitrarily small error probability. Conversely, for any rate , the error probability is bounded away from zero. The channel capacity is therefore a sharp threshold - another fundamental limit set by entropy.
Full treatment in: Cover & Thomas (2006), Elements of Information Theory, Chapter 7.
Appendix D: Entropy in Quantum Information
Quantum information theory generalizes Shannon entropy to quantum states.
Von Neumann entropy. For a density matrix (the quantum analogue of a probability distribution):
where are the eigenvalues of . For a pure state (, rank 1): . For the maximally mixed state (): .
Connection to Shannon entropy. Von Neumann entropy reduces to Shannon entropy when is diagonal: gives .
Entanglement entropy. For a bipartite quantum system in state , the entanglement entropy (entropy of the reduced density matrix of subsystem ) measures quantum entanglement between and . When and are maximally entangled, where is the dimension. This is a purely quantum phenomenon with no classical analogue.
For AI - Quantum ML. While quantum computing for ML remains largely theoretical (2026), the von Neumann entropy is used in:
- Analyzing the information geometry of quantum circuits
- Quantum kernel methods (quantum feature maps via density matrices)
- Understanding the entanglement structure of tensor network representations of neural networks
Appendix E: Entropy Estimation from Data
In practice, the distribution is unknown and must be estimated from a finite sample .
Plug-in (naive) estimator. Use empirical frequencies :
Bias. The naive estimator is biased downward:
The bias is (first-order correction from Miller & Madow, 1955). For large alphabets (e.g., vocabulary size ) with small samples, this bias can be severe.
Miller-Madow correction. Add where is the number of distinct symbols observed:
Bayesian estimators. With a Dirichlet prior on the probability vector, the posterior mean entropy can be computed via the posterior predictive distribution. The NSB estimator (Nemenman-Shafee-Bialek, 2002) is a mixture of Dirichlet priors that gives near-optimal entropy estimates for undersampled alphabets.
Neural entropy estimation. For continuous random variables, entropy can be estimated via:
- k-nearest-neighbor estimators (Kozachenko-Leonenko, 1987): where is the distance to the -th nearest neighbor
- MINE (Mutual Information Neural Estimation): variational lower bounds on mutual information via neural networks, which can be repurposed for entropy estimation
For LLM evaluation. Estimating the entropy rate of natural language from finite text corpora is done by fitting a language model and measuring its perplexity. The model's perplexity upper bounds ; as the model improves, the bound tightens. Shannon's original estimate of English entropy used human prediction experiments, not language models.
Appendix F: Log-Sum-Exp and Numerical Stability
Computing entropy in practice requires stable numerical computation of .
The challenge. In log-space (when you have not directly), computing entropy requires:
If any is very negative (small probability), can underflow to 0. If any (large probability), it can overflow in intermediate steps.
Numerical entropy computation in PyTorch:
import torch
import torch.nn.functional as F
def stable_entropy(logits):
"""Compute entropy of softmax distribution from logits."""
log_probs = F.log_softmax(logits, dim=-1)
probs = log_probs.exp()
entropy = -(probs * log_probs).sum(dim=-1) # H = -sum(p log p)
return entropy
# Equivalent: F.cross_entropy(logits, probs) where probs = softmax(logits)
# This uses the log-sum-exp trick internally via log_softmax
Log-sum-exp trick. To compute stably:
The shifted sum has the largest term equal to 1 (), preventing both overflow (largest term is ) and underflow (other terms are in ).
Entropy via log-sum-exp:
For neural network outputs , the entropy of is computed stably using log-softmax:
where and the computation is done entirely in log-space.
For AI - Perplexity Computation. Computing perplexity from a sequence of log-probabilities:
def perplexity(log_probs):
"""log_probs: tensor of shape (T,) with log p(x_t | x_{<t})"""
return torch.exp(-log_probs.mean())
Numerically stable because averaging log-probabilities avoids the product which underflows exponentially in .
Appendix G: Entropy in Coding Theory and Compression Algorithms
G.1 Entropy Coding in Practice
Modern lossless compression combines two components:
- Statistical model: Estimates - a probability model (can be adaptive, arithmetic, or neural)
- Entropy coder: Converts probability estimates to bits - arithmetic coding or range coding
The theoretical compression rate is bits per symbol. The gap from practice comes from:
- Imperfect probability model: adds bits
- Integer rounding in prefix codes: adds bit per symbol (avoided by arithmetic coding)
- Block-wise vs. symbol-wise coding: arithmetic coding handles this
G.2 LLM-Based Compression
A striking recent result: large language models are powerful compressors. By using an LLM as the probability model in arithmetic coding:
- Chinchilla 70B achieves ~0.9 bits/byte on text (vs. ~2 bits/byte for gzip, ~1.8 bits/byte for zstd)
- This is near the estimated entropy rate of English (~0.6-1.0 bits/char = ~1-1.5 bits/byte counting spaces/punctuation)
Deletang et al. (2023), "Language Modeling is Compression." This paper establishes:
- A predictor (LLM) can be transformed into a lossless compressor (arithmetic coding on its outputs)
- A compressor can be transformed into a predictor (context tree weighting, Lempel-Ziv)
- Compression performance equals prediction performance: lower perplexity = higher compression ratio
Implication for scaling laws. Chinchilla scaling laws (Hoffmann et al., 2022) can be derived from information-theoretic principles: the optimal compute allocation follows from minimizing the cross-entropy loss subject to FLOPs and data constraints, which is equivalent to maximizing compression ratio at fixed compute budget.
G.3 Brotli, Zstandard, and Neural Compression
Modern production compressors (Brotli in HTTP, Zstandard/zstd in databases) use:
- Context mixing: Combine multiple probability models via a meta-predictor (similar to ensemble learning)
- ANS (Asymmetric Numeral Systems): Modern replacement for arithmetic coding; same compression ratio, faster decode; used in Apple, Meta, NVIDIA compression stacks
Neural network-based learned compression (BPG, WebP2, neural image codecs):
- Train encoder-decoder pair to minimize where is distortion and is the entropy of the compressed representation
- The entropy term is approximated by entropy coding the quantized latent with an entropy model; this is exactly Shannon entropy in the rate-distortion objective
Appendix H: Entropy in Cryptography and Privacy
Shannon entropy and secrecy. Shannon (1949) introduced the concept of perfect secrecy: a cipher is perfectly secret if , i.e., the ciphertext is statistically independent of the plaintext. The one-time pad achieves perfect secrecy: if the key is uniform and independent, then - knowing the ciphertext tells you nothing.
Entropy and passwords. Password strength is often measured in "bits of entropy" - but this means the log of the number of equally likely passwords in the password space, which equals (Hartley entropy / support size) in log2, not Shannon entropy. A strong password has high Hartley entropy (many possible values) but may have low min-entropy (if certain patterns are more common).
Differential privacy and Renyi entropy. Differential privacy (DP) protects individual records in a dataset. The standard -DP has been largely replaced by:
- Renyi DP (Mironov 2017): is -Renyi DP if where is the Renyi divergence. Composing Renyi-DP mechanisms gives -Renyi DP - much tighter than standard DP composition.
- Privacy amplification by sampling: Random subsampling of the dataset amplifies privacy; the analysis uses Renyi divergence properties.
For LLM training with DP: DP-SGD (Abadi et al., 2016) clips and noises gradients per sample. The privacy accounting uses Renyi DP moments for tight composition. Understanding Renyi entropy is necessary to analyze the privacy guarantees of differentially private language model training.
Appendix I: Connection to Statistical Mechanics
Shannon did not derive entropy from scratch - he was informed by thermodynamic entropy. The connection is deep and precise.
Boltzmann entropy. For a physical system with equally likely microstates corresponding to a given macrostate:
where is Boltzmann's constant. This is Hartley entropy (Shannon entropy with uniform distribution) scaled by .
Gibbs entropy. For a system with probability distribution over microstates:
This is exactly Shannon entropy scaled by . Shannon was aware of Gibbs' work; his advisor John von Neumann reportedly told him to call the new quantity "entropy" because "nobody knows what entropy really is, so in a debate you will always have the advantage."
Partition function = normalizing constant. The partition function in statistical mechanics is the same normalizing constant that appears in MaxEnt probability models (Section 5.2). Temperature in physics corresponds to the temperature parameter in softmax: high temperature = high entropy.
Free energy and ELBO. The Helmholtz free energy is the variational objective in statistical mechanics. The ELBO in variational inference is:
Setting (reconstruction energy) and (encoder entropy):
Maximizing the ELBO is equivalent to minimizing the variational free energy - the same variational principle as statistical mechanics. The VAE is literally learning the statistical mechanics of the data distribution.
Appendix J: Further Reading and References
Canonical Textbooks
-
Cover & Thomas (2006) - Elements of Information Theory, 2nd ed. The definitive graduate textbook. Chapters 1-3 cover everything in this section rigorously. URL: https://www.wiley.com/en-us/Elements+of+Information+Theory%2C+2nd+Edition-p-9780471241959
-
MacKay (2003) - Information Theory, Inference, and Learning Algorithms. Free online; excellent for ML practitioners. Chapter 1-8 cover entropy and coding. URL: http://www.inference.org.uk/itila/
-
Shannon (1948) - A Mathematical Theory of Communication. The original paper. Remarkably readable.
-
Jaynes (2003) - Probability Theory: The Logic of Science. The MaxEnt perspective; Bayesian interpretation of entropy.
Papers for AI/ML Context
-
Haarnoja et al. (2018) - Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. arXiv:1801.01290.
-
Deletang et al. (2023) - Language Modeling is Compression. ICLR 2024.
-
Mironov (2017) - Renyi Differential Privacy. arXiv:1702.07476.
-
Guo et al. (2017) - On Calibration of Modern Neural Networks. ICML 2017.
-
Nemenman, Shafee & Bialek (2002) - Entropy and Inference, Revisited. NeurIPS 2001.
Online Resources
- MIT OpenCourseWare 6.441: Information Theory lectures - covers all material in this chapter
- Stanford EE376A: Information Theory - excellent problem sets on entropy and coding
Appendix K: Worked Examples and Computations
K.1 Entropy of English Text
Letter-level entropy of English. The distribution of English letters has been estimated from large corpora. Using Brown corpus frequencies:
| Letters | |||
|---|---|---|---|
| e | 0.1270 | 2.977 | 0.378 |
| t | 0.0906 | 3.464 | 0.314 |
| a | 0.0817 | 3.615 | 0.295 |
| ... | ... | ... | ... |
| z | 0.0007 | 10.48 | 0.007 |
The letter-level entropy of English (26 letters, case-insensitive) is approximately:
This is much less than bits (uniform), reflecting that some letters (e, t, a) are far more common than others (q, z, x).
Including spaces and punctuation: - bits/character.
Word-level entropy: With a 10,000-word vocabulary, a rough estimate gives - bits/word.
Shannon's estimate of English entropy rate: Shannon (1951) estimated the entropy rate at approximately 1.0-1.3 bits/character using a human prediction experiment. More recent estimates with n-gram models give 1.5-2 bits/character; with modern LLMs the estimate decreases as models capture more long-range structure.
Perplexity of a uniform letter model: - a model knowing only letter frequencies is as uncertain as a uniform distribution over 18 letters.
LLM perplexity comparison. GPT-4 on English text achieves approximately in bits/token (roughly 1.5 bits/character). This means the model is, on average, as uncertain as a uniform distribution over about 6 equally likely next tokens - vastly better than a letter-frequency model.
K.2 Entropy Computation for Joint Distributions
Example. Two correlated binary random variables (weather: sunny/rainy) and (activity: indoor/outdoor):
| (indoor) | (outdoor) | ||
|---|---|---|---|
| (rainy) | 0.40 | 0.05 | 0.45 |
| (sunny) | 0.10 | 0.45 | 0.55 |
| 0.50 | 0.50 | 1.00 |
Computations (in nats):
Verification: PASS (subadditivity)
PASS (conditioning reduces entropy)
PASS (mutual information non-negative)
If : ; the actual confirms dependence.
K.3 Computing Entropy Rate of a Markov Chain Numerically
Two-state chain. States: . Transition matrix:
Step 1: Stationary distribution. Solve :
- Solution: ,
Step 2: Per-state conditional entropies:
- nats
- nats
Step 3: Entropy rate:
Comparison to i.i.d.: If each state were drawn independently from :
The Markov chain has lower entropy rate () because there is temporal structure: state 0 tends to follow state 0, state 1 tends to follow state 1.
Perplexity: - the Markov chain is approximately as uncertain as a uniform distribution over 1.72 symbols, a very predictable process.
K.4 Maximum Entropy Distributions: A Summary Table
| Support | Constraints | MaxEnt distribution | Entropy |
|---|---|---|---|
| (discrete) | None | Uniform | |
| (binary) | Bernoulli | ||
| (integers) | Geometric (modified) | - | |
| (non-negative reals) | Exponential | ||
| (bounded interval) | None | Uniform | |
| (all reals) | , | ||
| (multivariate) | , | ||
| with | Sign symmetric | Laplace |
This table is a fundamental reference: the Gaussian arises from constraining only the first two moments, nothing more. Assuming any other continuous distribution under a variance constraint is asserting additional structure not supported by the constraints.
Appendix L: Entropy Regularization: Mathematical Details
L.1 Mirror Descent and Entropy
Regularized optimization. Consider minimizing subject to (simplex). Adding entropy regularization:
The term penalizes distributions far from uniform. This is equivalent to the negative entropy regularizer in mirror descent.
Mirror descent update. The mirror descent algorithm with entropy mirror map gives the update:
This is exactly the multiplicative weights algorithm - the natural gradient descent on the simplex with entropy as the Bregman potential.
Softmax as MaxEnt regularization. The softmax attention mechanism:
is the unique distribution that maximizes the linear objective (expected score) subject to entropy regularization. Without entropy regularization, the optimal distribution would be the argmax (hardmax); the entropy term smooths this to softmax.
L.2 The Information Bottleneck
The Information Bottleneck (Tishby, Pereira & Bialek, 1999) provides an information-theoretic framework for representation learning:
Find a compression of input that:
- Minimizes mutual information (compress , reduce information stored in )
- Maximizes mutual information (preserve information about label )
The parameter trades off compression vs. prediction. At : compress maximally (throw away all information). At : preserve all information about (no compression).
Connection to entropy. The IB objective is a function of entropy quantities:
The IB Lagrangian can be solved analytically when are jointly Gaussian: is optimal for Gaussian data.
Deep learning as IB. Tishby & Schwartz-Ziv (2017) proposed that deep neural networks implicitly solve the IB: early layers maximize (fitting), while later layers minimize (compression/generalization). This remains an active research area.
L.3 Entropy and Generalization in PAC Learning
The PAC (Probably Approximately Correct) framework uses entropy to bound generalization:
Occam's Razor Bound. For a hypothesis class and any distribution over hypotheses, the expected generalization error satisfies:
where is a prior over hypotheses and is the number of training samples. This is the PAC-Bayes bound (McAllester 1999), and the KL divergence term penalizes complexity of the learned distribution relative to the prior.
Mutual information bound (Xu & Raginsky 2017). The generalization gap is bounded by:
where are the model weights and is the training dataset. Mutual information measures how much the weights "remember" the training data - a form of entropy-based overfitting measure.
These connections show that entropy and information theory provide rigorous foundations for understanding why deep learning generalizes - not just heuristic explanations.
Appendix M: Entropy in Generative Models
M.1 Diffusion Models and Entropy
Denoising diffusion models (Ho et al., 2020; Song et al., 2020) define a forward process that gradually adds Gaussian noise:
The forward process converges to - the maximum differential entropy distribution under unit variance. The total entropy increases monotonically along the forward process:
where is the dimension. The reverse process (denoising) decreases entropy back to .
ELBO for diffusion. The training objective is a variational lower bound on that can be written as a sum of KL divergences:
Each KL divergence is the entropy gap between the true reverse process and the learned reverse process.
M.2 Normalizing Flows and Entropy
Normalizing flows learn a bijection from a simple distribution to the data distribution . The change-of-variables formula gives:
where is the Jacobian of . The Jacobian term adds entropy based on how much the flow expands or contracts volumes. Training maximizes - a combination of the base entropy and the volume scaling.
M.3 GANs and Entropy
Generative Adversarial Networks (Goodfellow et al., 2014) train a generator and discriminator . The optimal generator minimizes the Jensen-Shannon divergence:
The JSD is a symmetrized, bounded version of KL divergence. It equals zero iff .
Mode collapse = entropy collapse. A GAN suffering from mode collapse has a generator with low entropy: . The generator is only producing a small number of distinct outputs. This is an entropy problem: the generator has learned a low-entropy distribution. Solutions include entropy regularization of the generator (minibatch discrimination, Unrolled GAN, etc.).
Appendix N: Han's Inequality and Shearer's Lemma
N.1 Han's Inequality
Theorem (Han's Inequality, 1978). For jointly distributed random variables :
In words: the joint entropy is no more than the average of the -subset joint entropies.
Proof sketch. Define where denotes all variables except . Summing over and using the chain rule gives the result.
For ML. Han's inequality is used to prove submodularity of entropy: the function (where is a subset of variables) is submodular:
Submodularity is exploited in feature selection algorithms (greedy feature selection maximizing information gain is approximately optimal due to submodularity), sensor placement, and document summarization.
N.2 Shearer's Lemma
Lemma (Shearer 1986). Let be a collection of subsets of such that each element appears in at least sets in . Then:
Shearer's lemma is a generalization of subadditivity and is used in combinatorics to prove counting inequalities. In information theory, it provides tight bounds on joint entropy from marginal entropies.
For ML - Federated Learning. In federated learning with clients, each holding data subset , Shearer's lemma bounds the amount of information the combined model can learn from the distributed data relative to what each client knows individually.
Appendix O: Summary of Key Entropy Inequalities
ENTROPY INEQUALITY REFERENCE
DISCRETE ENTROPY:
Non-negativity: 0 <= H(X) <= log |X|
MaxEnt: H(X) = log n X ~ Uniform
Conditioning: H(X|Y) <= H(X) with equality XY
Subadditivity: H(X,Y) <= H(X) + H(Y) with equality XY
Chain rule: H(X,Y) = H(X) + H(Y|X)
Chain rule (n vars): H(X1,...,Xn) = i H(Xi | X1,...,Xi1)
Data processing: H(f(X)) <= H(X) (bijection: equality)
Han's inequality: H(X1,...,Xn) <= (1/(n-1)) i H(Xi)
DIFFERENTIAL ENTROPY:
Scale: h(aX) = h(X) + log|a|
Gaussian MaxEnt: h(X) <= 12 log(2pie 2) given Var(X) = 2
Multivariate MaxEnt: h(X) <= 12 log det(2pie ) given Cov(X) =
Conditioning: h(X|Y) <= h(X) (differential version)
RENYI ENTROPY:
Ordering: H0(X) >= H1(X) >= H2(X) >= ... >= Hinfty(X)
Shannon limit: lim_{alpha->1} Halpha(X) = H(X)
Min-entropy: Hinfty(X) = -log max_x p(x)
Guessing: P(guess X correctly) = 2^{-Hinfty(X)}
KL DIVERGENCE (preview):
Non-negativity: DKL(pq) >= 0 with equality p = q
Cross-entropy: H(p,q) = H(p) + DKL(pq) >= H(p)
MaxEnt via KL: H(X) <= log n DKL(p uniform) >= 0
This inequality table is a key reference for the entire information theory chapter. Every inequality here will be used in Section 02 (KL divergence derivations), Section 03 (mutual information bounds), Section 04 (cross-entropy loss analysis), and Section 05 (Fisher information and Cramer-Rao).
Appendix P: Implementation Recipes
P.1 Entropy Computation - Production Patterns
import numpy as np
from scipy.special import entr, rel_entr
# Pattern 1: From probability vector (numerically stable via scipy)
def entropy_from_probs(p):
"""H(X) in nats. entr(p) = -p*log(p) with 0*log(0)=0 convention."""
return np.sum(entr(p))
# Pattern 2: From logits (e.g., neural network outputs)
def entropy_from_logits(logits, temperature=1.0):
"""H(softmax(logits/T)) in nats. Numerically stable."""
logits = logits / temperature
log_Z = np.log(np.sum(np.exp(logits - logits.max()))) + logits.max()
log_probs = logits - log_Z # log softmax
probs = np.exp(log_probs)
return -np.sum(probs * log_probs)
# Pattern 3: Conditional entropy H(X|Y) from joint table
def conditional_entropy(joint):
"""joint[i,j] = p(X=i, Y=j). Returns H(X|Y) in nats."""
py = joint.sum(axis=0) # p(Y)
hxy = np.sum(entr(joint)) # H(X,Y)
hy = np.sum(entr(py)) # H(Y)
return hxy - hy # chain rule
# Pattern 4: Entropy rate of Markov chain
def markov_entropy_rate(P):
"""P: transition matrix. Returns entropy rate in nats."""
n = P.shape[0]
# Stationary distribution via eigenvector
eigvals, eigvecs = np.linalg.eig(P.T)
idx = np.argmin(np.abs(eigvals - 1.0))
mu = np.real(eigvecs[:, idx])
mu = mu / mu.sum()
# Entropy rate = sum_i mu_i * H(P[i,:])
return np.sum(mu * np.array([np.sum(entr(P[i, :])) for i in range(n)]))
# Pattern 5: Perplexity from log-probabilities
def perplexity(log_probs):
"""log_probs: array of log p(x_t | x_{<t}). Returns PPL."""
return np.exp(-np.mean(log_probs))
# Pattern 6: Renyi entropy
def renyi_entropy(p, alpha):
"""Renyi entropy of order alpha in nats."""
if np.isclose(alpha, 1.0):
return np.sum(entr(p)) # Shannon limit
if np.isinf(alpha):
return -np.log(np.max(p)) # Min-entropy
if np.isclose(alpha, 0.0):
return np.log(np.sum(p > 0)) # Hartley
return (1.0 / (1.0 - alpha)) * np.log(np.sum(p ** alpha))
P.2 PyTorch One-Liners
import torch
import torch.nn.functional as F
# Entropy from logits (vectorized over batch)
def batch_entropy(logits, dim=-1):
"""Returns entropy H(softmax(logits)) for each sample. Shape: [batch]"""
log_p = F.log_softmax(logits, dim=dim)
p = log_p.exp()
return -(p * log_p).sum(dim=dim)
# Cross-entropy loss (= H(p_data, p_model))
loss = F.cross_entropy(logits, targets) # = -log p_model(y_true)
# KL divergence: D_KL(p || q) where p, q are probability vectors
kl = F.kl_div(q.log(), p, reduction='batchmean') # Note: F.kl_div(log_q, p)
# Per-token perplexity
log_probs = F.log_softmax(logits, dim=-1)
token_nll = -log_probs.gather(-1, targets.unsqueeze(-1)).squeeze(-1)
ppl = token_nll.mean().exp()
P.3 Common Pitfalls in Code
# WRONG: Does not handle p=0 (nan from log(0))
H_wrong = -np.sum(p * np.log(p)) # nan when p[i] == 0
# CORRECT: Use scipy.special.entr or add epsilon
from scipy.special import entr
H_correct = np.sum(entr(p)) # handles p=0 -> 0
# WRONG: Mixing log bases
H_bits = -np.sum(p * np.log2(p)) # bits
H_nats = -np.sum(p * np.log(p)) # nats
# Cross-entropy with mixed bases will give wrong numerical values
# WRONG: Computing PPL from probability products (underflows)
ppl_wrong = np.prod(probs) ** (-1.0/T) # underflows for T > ~100
# CORRECT: Compute in log space
ppl_correct = np.exp(-np.mean(np.log(probs))) # stable
Appendix Q: Entropy in Transformer Attention
Q.1 Attention Entropy as a Diagnostic
Each attention head in a transformer computes:
The attention entropy for query in head is:
This ranges from (attending to exactly one position - argmax attention) to (attending uniformly to all positions).
Attention entropy diagnostics:
- Low entropy heads (sharp, local attention): typically handle syntactic dependencies (subject-verb agreement, coreference)
- High entropy heads (diffuse, global attention): often handle semantic averaging or positional patterns
- Entropy collapse: when fine-tuning collapses all heads to very low entropy, the model loses diversity and may overfit
- Dead heads: entropy near (uniform) indicates a head that does not attend meaningfully; often pruned safely
For interpretability. Mechanistic interpretability studies (Olsson et al., 2022 - "induction heads"; Elhage et al., 2021 - "Mathematical Framework for Transformer Circuits") characterize attention heads by their entropy patterns. High-entropy heads tend to be backup circuits; low-entropy heads implement specific algorithms.
Q.2 Temperature and Entropy in Decoding
During LLM generation, three common sampling strategies adjust the output entropy:
| Strategy | Formula | Effect on entropy |
|---|---|---|
| Greedy | (deterministic) | |
| Temperature | ; higher -> higher | |
| Top- | Restrict to top- tokens, renormalize | |
| Top- (nucleus) | Restrict to smallest set with | varies with context |
| Min- | Restrict to tokens with | Adaptive entropy bound |
The key insight: All sampling strategies are entropy control mechanisms. They interpolate between (deterministic, no creativity) and (random sampling from the full distribution). Good generation requires entropy calibration: enough uncertainty to be interesting, not so much as to be incoherent.
Entropy-adaptive sampling. Recent work (Entropix, 2024) uses the token-level entropy and varentropy (variance of entropy across the vocabulary) to dynamically adjust sampling strategy:
- Low entropy + low varentropy -> greedy (model is confident)
- High entropy + high varentropy -> sample with high temperature (genuine uncertainty)
- Low entropy + high varentropy -> sample carefully (multiple plausible branches)
Appendix R: Varentropy and Higher-Order Entropy Statistics
Varentropy (variance of the entropy) is a recently popularized concept in LLM decoding:
While measures average surprise, varentropy measures the spread of the surprise distribution. A high-entropy, low-varentropy distribution is one where all outcomes have nearly equal probability (uniform-like). A high-entropy, high-varentropy distribution has a few very likely outcomes and many very unlikely ones - a different shape that calls for different sampling strategies.
Third and fourth moments. The full distribution of (called the "information spectrum" or "entropy density") characterizes how information is distributed across outcomes. The skewness and kurtosis of this distribution are used in information spectrum analysis and Renyi entropy comparisons.
For LLMs. The entropy and varentropy of the output distribution at each generation step provide a 2D signal:
ENTROPY-VARENTROPY DECISION SPACE
High varentropy Uncertain (branching) Noisy (sample broadly)
-> careful sampling -> high temperature
Low varentropy Confident (clear choice) Uniform (stuck)
-> greedy / low temp -> restart / probe
Low entropy High entropy
Entropix sampling (Xjdr et al., 2024) implements entropy-adaptive sampling using both metrics. It represents an application of higher-order information theory to practical LLM inference quality improvement.
Appendix S: Notation Quick Reference
| Symbol | Meaning | Definition |
|---|---|---|
| Self-information | ||
| Shannon entropy | ||
| Entropy of distribution | Same as , emphasis on | |
| Binary entropy function | ||
| Joint entropy | ||
| Conditional entropy | ||
| Entropy rate | ||
| Differential entropy | ||
| Renyi entropy | ||
| Min-entropy | ||
| Tsallis entropy | ||
| Perplexity | ||
| Mutual information | (preview: Section 03) | |
| KL divergence | (preview: Section 02) | |
| Cross-entropy | (preview: Section 04) |