NotesMath for LLMs

Cross Entropy

Information Theory / Cross Entropy

Notes

Cross-Entropy

"The right question is not whether a model predicts the correct label once, but whether it assigns the right probabilities every time."

Overview

Cross-entropy is one of the rare mathematical quantities that is simultaneously an information measure, a coding cost, a statistical objective, and a practical training loss. In information theory it measures the average number of bits or nats required when data generated by a true distribution pp is encoded using a code optimized for some other distribution qq. In machine learning the same quantity appears as the expected negative log-likelihood, which is why it became the default objective for classification, language modeling, and many forms of probabilistic prediction.

That dual identity is what makes cross-entropy so important for AI. When we train a classifier with categorical cross-entropy, we are not merely punishing wrong answers. We are asking the model to match the full target distribution as closely as possible in log-loss terms. When we train an autoregressive language model, we are minimizing average next-token surprise. When we evaluate a model with perplexity, we are exponentiating a token-level cross-entropy. When we do knowledge distillation, label smoothing, masked sequence training, or soft supervision, we are still working inside the same cross-entropy geometry.

This section gives cross-entropy its own careful treatment rather than burying it inside a generic loss-function survey. The neighboring sections already cover entropy, KL divergence, and mutual information in their canonical homes. Here we focus on what is unique about cross-entropy itself: its decomposition, its coding meaning, its role as negative log-likelihood, its stable numerical implementation, and its modern use in 2026-era AI systems.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbInteractive derivations, coding examples, stable softmax/log-sum-exp demos, and perplexity experiments
exercises.ipynb10 graded exercises covering theory identities, BCE/CE derivations, stable computation, and modern AI use cases

Learning Objectives

After completing this section, you will be able to:

  • Define discrete, conditional, continuous, and sequence cross-entropy cleanly
  • Derive the identity H(p,q)=H(p)+DKL(pq)H(p,q)=H(p)+D_{\mathrm{KL}}(p\|q)
  • Explain cross-entropy as wrong-code expected length and as expected negative log-likelihood
  • Distinguish entropy from cross-entropy and explain why the latter is model-dependent
  • Derive binary cross-entropy from the Bernoulli likelihood
  • Derive categorical cross-entropy from softmax logits and explain the fused logit-space form
  • Compute sequence cross-entropy and convert it into perplexity
  • Explain why cross-entropy is a strictly proper scoring rule for probabilistic prediction
  • Derive the logit-space gradient p^y\hat{\mathbf{p}}-\mathbf{y} and interpret it geometrically
  • Implement numerically stable log_softmax, binary CE from logits, and masked token loss
  • Understand label smoothing, soft targets, and knowledge distillation as target-distribution modifications
  • Recognize when plain cross-entropy is the right tool and when it needs augmentation

Table of Contents


1. Intuition

1.1 Cross-Entropy as Wrong-Code Cost

The cleanest intuition for cross-entropy comes from coding. Suppose a source really emits symbols according to a true distribution pp, but we design our code as if the source followed another distribution qq. The code lengths suggested by qq are approximately logq(x)-\log q(x). If the source actually emits xpx \sim p, then the expected code length becomes

H(p,q)=xp(x)logq(x).H(p,q) = -\sum_x p(x)\log q(x).

That is cross-entropy.

Entropy H(p)H(p) is the coding cost when the code is optimized for the true source. Cross-entropy H(p,q)H(p,q) is the coding cost when we compress data from pp using a code optimized for qq. The gap between them is the cost of model mismatch.

This is already enough to understand why cross-entropy became central in machine learning. A predictive model is a probabilistic code for future data. If the model assigns high probability to what actually happens, the code is short and the cross-entropy is low. If the model assigns low probability to what happens, the code is long and the cross-entropy is high.

CODING VIEW OF CROSS-ENTROPY
===============================================================

True source:
  data are generated from p(x)

Model we use:
  code lengths are based on q(x)

Per-symbol code length under q:
  L_q(x)  -log q(x)

Expected length on true data:
  E_{x ~ p}[-log q(x)] = H(p, q)

Special case:
  if q = p, then H(p, q) = H(p)

Interpretation:
  cross-entropy = average surprise under the model
===============================================================

There is a deep practical moral here: cross-entropy does not care whether the model outputs the correct class index alone. It cares whether the model allocates probability mass in the right way. That makes it stronger than simple accuracy as a training signal, because accuracy ignores confidence and only judges the final argmax.

1.2 From Entropy to Cross-Entropy

Entropy asks:

H(p)=xp(x)logp(x).H(p) = -\sum_x p(x)\log p(x).

Cross-entropy asks:

H(p,q)=xp(x)logq(x).H(p,q) = -\sum_x p(x)\log q(x).

The difference is small in notation and enormous in meaning.

In entropy, the same distribution appears twice. The source both generates the data and defines the log-cost. In cross-entropy, the source pp generates the data but the model qq assigns the costs. So cross-entropy is not purely a property of the source. It is a property of the pair (p,q)(p,q).

That pairwise dependence is what makes cross-entropy useful for learning. If the learner changes qq, the cross-entropy changes. Entropy H(p)H(p) is fixed by the data-generating process; cross-entropy H(p,q)H(p,q) is the quantity we can optimize by improving the model.

The basic story can be phrased three equivalent ways:

  1. Coding view: extra expected code length from using the wrong code.
  2. Prediction view: average negative log-probability assigned to observed outcomes.
  3. Optimization view: the objective minimized by maximum likelihood.

These three viewpoints are mathematically identical, but each highlights a different piece of intuition:

  • coding explains why the quantity is called "cross-entropy"
  • prediction explains why low-probability mistakes are punished so sharply
  • optimization explains why nearly every probabilistic classifier minimizes it
ENTROPY VS CROSS-ENTROPY
===============================================================

Entropy:
  H(p)   = average surprise under the true source itself

Cross-entropy:
  H(p,q) = average surprise of true data under model q

If q is good:
  H(p,q) is close to H(p)

If q is poor:
  H(p,q) is much larger than H(p)

So:
  entropy  = intrinsic uncertainty
  cross-entropy = uncertainty + model mismatch
===============================================================

This is also why the sign and the logarithm matter. The logarithm turns probability multiplication into addition over sequences, and the negative sign makes high-probability events cheap while making low-probability events expensive. A model that assigns 10610^{-6} probability to a true event is not slightly wrong. Under log-loss, it is catastrophically wrong.

1.3 Why Cross-Entropy Matters for AI

Cross-entropy appears almost everywhere modern AI systems need calibrated probabilistic prediction.

AI settingRandom objectsRole of cross-entropy
Binary classificationlabel Y{0,1}Y \in \{0,1\}, model score qθ(YX)q_\theta(Y \mid X)binary CE is Bernoulli negative log-likelihood
Multiclass classificationclass Y[K]Y \in [K]categorical CE trains the model to match class probabilities
Language modelingnext token XtX_t given context X<tX_{<t}sequence CE is average next-token NLL
Distillationteacher distribution, student distributionsoft cross-entropy transfers dark knowledge
Label smoothinghard target, softened targetCE with softened targets discourages overconfidence
Masked trainingpadded or ignored positionsmasked CE trains only on valid positions
Calibrationpredicted confidence vs realityCE is a proper scoring rule, but raw CE minimization alone does not guarantee perfect post-hoc calibration

For LLMs specifically, cross-entropy is the pretraining objective that turned next-token prediction into a general-purpose foundation-model paradigm. A model that minimizes sequence cross-entropy on a broad corpus must learn syntax, world knowledge, code regularities, discourse patterns, and many latent structure constraints, because each of those reduces token surprise.

Cross-entropy also sits at the heart of evaluation:

  • token-level CE underlies perplexity
  • masked or weighted CE appears in seq2seq training
  • binary CE appears in reward-model and preference-model components
  • soft CE appears in distillation and teacher-guided fine-tuning

The quantity is therefore not a classroom loss used only in toy classifiers. It is one of the main mathematical interfaces between information theory and large scale model training.

1.4 Historical Timeline

CROSS-ENTROPY -- KEY MILESTONES
===============================================================

1948  Shannon
      Entropy and coding length become central objects in
      communication theory.

1951  Kullback and Leibler
      Relative entropy clarifies the decomposition
      cross-entropy = entropy + KL divergence.

1960s-1980s
      Log-likelihood and log-loss become standard tools in
      statistics and probabilistic classification.

1980s-1990s
      Neural networks adopt cross-entropy because it aligns
      naturally with sigmoid and softmax outputs.

2014-2015
      Distillation and soft targets use cross-entropy between
      teacher and student distributions.

2017  Transformer
      Label-smoothed sequence cross-entropy becomes a standard
      training recipe in large sequence models.

2020-2026
      Cross-entropy remains the default pretraining objective
      for autoregressive and many supervised AI systems, while
      perplexity remains a core LM evaluation metric.
===============================================================

The historical pattern is important. Cross-entropy was not invented as a neural-network trick. It was inherited from information theory and statistics, then rediscovered as exactly the right loss for probabilistic prediction. That heritage explains why the same quantity appears in code design, likelihood, calibration, classification, and sequence modeling.


2. Formal Definitions

2.1 Discrete Cross-Entropy

Let pp and qq be probability mass functions on the same discrete support X\mathcal{X}. The discrete cross-entropy of pp relative to qq is

H(p,q)=xXp(x)logq(x),H(p,q) = -\sum_{x \in \mathcal{X}} p(x)\log q(x),

provided the sum is well defined.

This formula should be read carefully:

  • the expectation is taken under pp
  • the logarithmic score is computed using qq

So pp is the source of the data and qq is the model being judged.

If XpX \sim p, we can write the same quantity more compactly as

H(p,q)=EXp[logq(X)].H(p,q)=\mathbb{E}_{X \sim p}[-\log q(X)].

This expectation form is often the more useful one in machine learning because training datasets are empirical samples from the underlying data distribution.

Example 1: Bernoulli source. If p=Bern(0.8)p=\operatorname{Bern}(0.8) and q=Bern(0.6)q=\operatorname{Bern}(0.6), then

H(p,q)=0.8log0.60.2log0.4.H(p,q) = -0.8\log 0.6 - 0.2\log 0.4.

The source emits ones much more often than zeros, so the penalty paid by qq depends more heavily on how much mass it gives to the event 11.

Example 2: One-hot target. If the true label is a deterministic class cc, then pp is a point mass δc\delta_c, and the cross-entropy becomes

H(δc,q)=logq(c).H(\delta_c, q) = -\log q(c).

This is exactly the common supervised-learning loss on a single example.

Non-example. It is wrong to write "cross-entropy between two arbitrary vectors" unless the vectors are valid probability distributions or are clearly interpreted as such. Raw logits are not probabilities. Applying the CE formula directly to logits is nonsensical.

2.2 Conditional and Empirical Cross-Entropy

In supervised learning the model is conditional. Instead of a single distribution qq, we have qθ(yx)q_\theta(y \mid x). The population cross-entropy is then

L(θ)=E(X,Y)p[logqθ(YX)].\mathcal{L}(\theta) = \mathbb{E}_{(X,Y)\sim p}[-\log q_\theta(Y \mid X)].

This is just conditional cross-entropy averaged over inputs:

Hp(YX;qθ)=E(X,Y)p[logqθ(YX)].H_p(Y \mid X; q_\theta) = \mathbb{E}_{(X,Y)\sim p}[-\log q_\theta(Y \mid X)].

The notation varies in the literature, but the substance does not: the model is penalized according to the log-probability it assigns to the observed label conditional on the input.

Given a dataset

D={(x(i),y(i))}i=1n,\mathcal{D} = \{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^n,

the empirical cross-entropy is

L^n(θ)=1ni=1nlogqθ(y(i)x(i)).\widehat{\mathcal{L}}_n(\theta) = \frac{1}{n}\sum_{i=1}^n -\log q_\theta(y^{(i)} \mid \mathbf{x}^{(i)}).

This is the quantity we actually minimize in training.

There are two layers of approximation here:

  1. the finite dataset approximates the true population distribution
  2. the model class {qθ}\{q_\theta\} may or may not be rich enough to represent the true conditional law

That distinction matters. The theoretical object is population cross-entropy. The implemented object is empirical cross-entropy. Their gap is part of generalization theory.

2.3 Continuous Cross-Entropy

If pp and qq are densities on a continuous support, the continuous cross-entropy is

H(p,q)=p(x)logq(x)dx,H(p,q) = -\int p(x)\log q(x)\,dx,

whenever the integral exists.

This is the direct analog of the discrete formula, but it deserves caution:

  • densities can exceed 11, so logq(x)-\log q(x) can be negative
  • continuous cross-entropy need not share all the intuitions of discrete code lengths unless we interpret it through discretization limits or relative density comparisons
  • support mismatch can make the integral infinite

Even so, continuous cross-entropy is still the expected negative log-density under the source pp, and it still satisfies the same decomposition with KL divergence when the relevant quantities are finite.

Example. If pp is a Gaussian data law and qθq_\theta is a Gaussian model with parameterized mean and variance, minimizing cross-entropy is equivalent to maximum likelihood estimation of the Gaussian parameters.

This continuous version is especially important in density modeling, diffusion components, and variational objectives, even if the pure discrete version is the one most learners first encounter in classification.

2.4 Sequence Cross-Entropy and Cross-Entropy Rate

Suppose a sequence source emits

X1:T=(X1,,XT).X_{1:T} = (X_1,\dots,X_T).

If the model factorizes autoregressively as

q(x1:T)=t=1Tq(xtx<t),q(x_{1:T}) = \prod_{t=1}^T q(x_t \mid x_{<t}),

then the sequence cross-entropy is

H(p,q)=EX1:Tp[logq(X1:T)]=EX1:Tp[t=1Tlogq(XtX<t)].H(p,q) = \mathbb{E}_{X_{1:T}\sim p} \left[ -\log q(X_{1:T}) \right] = \mathbb{E}_{X_{1:T}\sim p} \left[ -\sum_{t=1}^T \log q(X_t \mid X_{<t}) \right].

This additive decomposition is why language-model training looks like "sum the token losses." It is not a heuristic. It is exactly the log-factorization of the joint model.

The average per-token cross-entropy is

1TEX1:Tp[t=1Tlogq(XtX<t)].\frac{1}{T} \mathbb{E}_{X_{1:T}\sim p} \left[ -\sum_{t=1}^T \log q(X_t \mid X_{<t}) \right].

For stationary sources, one often studies the limiting quantity

H(p,q)=limT1TH(p1:T,q1:T),\overline{H}(p,q) = \lim_{T\to\infty} \frac{1}{T}H(p_{1:T}, q_{1:T}),

when the limit exists. This is the cross-entropy rate.

For LLMs, the finite-sample per-token quantity is the practical object. It is reported in nats or bits per token, and its exponentiation yields perplexity after the usual normalization conventions are fixed.

2.5 Support Mismatch, Zero Probabilities, and Log Bases

Cross-entropy looks harmless until the model assigns zero probability to an event that actually occurs.

If p(x)>0p(x) > 0 for some xx but q(x)=0q(x)=0, then

logq(x)=,-\log q(x) = \infty,

so the cross-entropy is infinite.

This is not a technical annoyance. It is conceptually correct. A model that declares an actually possible event to be impossible deserves infinite logarithmic penalty.

This has several practical consequences:

  • smoothed probabilities are often used to avoid literal zeros
  • stable implementations work with logits to avoid underflow to zero
  • support assumptions matter whenever one compares distributions

We also need to specify the logarithm base:

  • log2\log_2 gives cross-entropy in bits
  • log\log with the natural logarithm gives cross-entropy in nats

In optimization, changing the log base only rescales the loss by a constant factor, so the minimizer is unchanged. In information-theoretic interpretation, the base determines the measurement unit.

Important edge case. Cross-entropy is only meaningful when pp and qq live on the same event space. Comparing a word-level distribution to a character-level distribution without careful alignment is not valid. This becomes especially relevant when people compare perplexities across different tokenizers.


3. Core Theory I: Identities, Bounds, and Coding

3.1 Cross-Entropy = Entropy + KL Divergence

The defining identity of this section is

H(p,q)=H(p)+DKL(pq).H(p,q) = H(p) + D_{\mathrm{KL}}(p\|q).

It is the cleanest mathematical statement of what cross-entropy measures: intrinsic uncertainty plus model mismatch.

Derivation. Start from the definition:

H(p,q)=xp(x)logq(x).H(p,q) = -\sum_x p(x)\log q(x).

Add and subtract logp(x)\log p(x) inside the sum:

H(p,q)=xp(x)logp(x)+xp(x)logp(x)q(x).H(p,q) = -\sum_x p(x)\log p(x) + \sum_x p(x)\log \frac{p(x)}{q(x)}.

The first term is H(p)H(p). The second is DKL(pq)D_{\mathrm{KL}}(p\|q). Hence

H(p,q)=H(p)+DKL(pq).H(p,q)=H(p)+D_{\mathrm{KL}}(p\|q).

This identity matters so much because it immediately turns cross-entropy minimization into KL minimization whenever pp is fixed.

In learning problems, pp is the data-generating distribution and does not depend on the model parameters θ\theta. Therefore

argminθH(p,qθ)=argminθDKL(pqθ).\arg\min_\theta H(p,q_\theta) = \arg\min_\theta D_{\mathrm{KL}}(p\|q_\theta).

This is the theoretical bridge from information theory to maximum likelihood.

3.2 Lower Bounds and Equality Cases

Because KL divergence is nonnegative,

DKL(pq)0,D_{\mathrm{KL}}(p\|q)\ge 0,

the decomposition immediately gives

H(p,q)H(p).H(p,q)\ge H(p).

Equality holds if and only if p=qp=q almost everywhere on the support of pp.

This is the core lower-bound fact:

  • the true entropy H(p)H(p) is the irreducible baseline
  • every imperfect model pays an extra KL tax

In machine-learning language, this says a model cannot beat the entropy of the true label/source distribution when evaluated under the true probabilistic task. Any extra loss above that baseline is mismatch.

There are three common special cases:

  1. Perfect modeling: if q=pq=p, then H(p,q)=H(p)H(p,q)=H(p).
  2. Mild mismatch: if qq is close to pp, the cross-entropy is only a little larger than the entropy.
  3. Support error: if qq assigns zero probability where pp assigns positive probability, cross-entropy becomes infinite.

The bound is often used implicitly in language modeling. People say "the model cannot push perplexity below the entropy rate of the data source," which is the sequence version of the same statement.

3.3 Mismatched Coding Interpretation

Cross-entropy is often introduced through coding, but the coding interpretation deserves more than a slogan.

If a code optimized for distribution qq uses approximate codelengths logq(x)-\log q(x), then the average codelength when the true source is pp is

EXp[logq(X)]=H(p,q).\mathbb{E}_{X\sim p}[-\log q(X)] = H(p,q).

The optimal codelength under the true source is instead H(p)H(p). Therefore the excess expected codelength is

H(p,q)H(p)=DKL(pq).H(p,q)-H(p)=D_{\mathrm{KL}}(p\|q).

That difference is not merely "some statistical gap." It is the exact expected coding inefficiency incurred by modeling the source incorrectly.

MISMATCHED CODING
===============================================================

Best possible code for source p:
  expected length = H(p)

Code built from model q:
  expected length = H(p, q)

Extra cost of using q instead of p:
  H(p, q) - H(p) = D_KL(p || q)

Meaning:
  KL divergence is the coding penalty for model mismatch
===============================================================

This picture makes the name "cross-entropy" less mysterious. We take the entropy-style expectation under one distribution and cross it with logarithmic costs defined by another.

It also explains why cross-entropy remains meaningful when training models with soft targets. A teacher distribution is effectively a richer probabilistic code than a one-hot label, so matching it with soft cross-entropy means preserving a more detailed coding structure.

3.4 Conditional Chain Rules and Factorizations

Cross-entropy inherits the same additive structure that makes log-likelihoods so useful.

If p(x,y)=p(x)p(yx)p(x,y)=p(x)p(y \mid x) and q(x,y)=q(x)q(yx)q(x,y)=q(x)q(y \mid x), then

H(pXY,qXY)=E(X,Y)p[logq(X,Y)]H(p_{XY}, q_{XY}) = \mathbb{E}_{(X,Y)\sim p} \left[ -\log q(X,Y) \right] =E(X,Y)p[logq(X)logq(YX)]= \mathbb{E}_{(X,Y)\sim p} \left[ -\log q(X) - \log q(Y \mid X) \right] =H(pX,qX)+EXp[H(pYX,qYX)].= H(p_X, q_X) + \mathbb{E}_{X\sim p}[H(p_{Y\mid X}, q_{Y\mid X})].

So conditional cross-entropy decomposes naturally:

H(pXY,qXY)=H(pX,qX)+Hp(YX;q).H(p_{XY}, q_{XY}) = H(p_X, q_X) + H_p(Y \mid X; q).

This is the formal reason that sequence losses, masked losses, and structured prediction losses can be summed over steps, positions, or factors.

In autoregressive modeling, the factorization

q(x1:T)=t=1Tq(xtx<t)q(x_{1:T}) = \prod_{t=1}^T q(x_t \mid x_{<t})

turns the joint cross-entropy into a sum of token-level terms. In conditional classification, the same logic turns an expectation over labels into a sample average of negative log-conditional probabilities.

The chain-rule viewpoint is especially useful when debugging training setups. If a loss is an average over positions, one should ask:

  • which positions are included?
  • which are masked?
  • are we averaging over tokens, sequences, or batches?

Those choices change the empirical estimator, even though the underlying population quantity is still a conditional cross-entropy.

3.5 Cross-Entropy Rate for Sources

For long sequences produced by a stationary source, one is often less interested in the total code length than in the long-run average cost per symbol. This is the cross-entropy rate:

H(p,q)=limT1TEX1:Tp[logq(X1:T)].\overline{H}(p,q) = \lim_{T\to\infty} \frac{1}{T} \mathbb{E}_{X_{1:T}\sim p} \left[ -\log q(X_{1:T}) \right].

If both source and model are i.i.d., this reduces to the ordinary one-step cross-entropy. But for dependent sources such as language, the rate absorbs long-range conditional structure.

This is why next-token prediction is not just a convenient engineering trick. The autoregressive cross-entropy rate measures the average uncertainty the model still has about the next symbol once it sees the past.

In practice, finite-context models only approximate this ideal because they cannot condition on an infinite past. Hugging Face's perplexity documentation explicitly warns that fixed-context models need sliding-window approximations if we want a better estimate of the fully factorized sequence probability. That is an implementation detail with theoretical teeth: context truncation changes the conditional distribution and therefore changes the measured cross-entropy.


4. Core Theory II: From Information Measure to Learning Objective

4.1 Negative Log-Likelihood and Maximum Likelihood

Let a probabilistic model assign conditional probabilities qθ(yx)q_\theta(y \mid \mathbf{x}). For a dataset

D={(x(i),y(i))}i=1n,\mathcal{D}=\{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^n,

the likelihood is

L(θ;D)=i=1nqθ(y(i)x(i)).L(\theta; \mathcal{D}) = \prod_{i=1}^n q_\theta(y^{(i)} \mid \mathbf{x}^{(i)}).

Taking logs gives

logL(θ;D)=i=1nlogqθ(y(i)x(i)).\log L(\theta; \mathcal{D}) = \sum_{i=1}^n \log q_\theta(y^{(i)} \mid \mathbf{x}^{(i)}).

Maximum likelihood therefore solves

maxθi=1nlogqθ(y(i)x(i)),\max_\theta \sum_{i=1}^n \log q_\theta(y^{(i)} \mid \mathbf{x}^{(i)}),

or equivalently

minθ1ni=1nlogqθ(y(i)x(i)).\min_\theta \frac{1}{n}\sum_{i=1}^n -\log q_\theta(y^{(i)} \mid \mathbf{x}^{(i)}).

That last expression is empirical cross-entropy.

So the standard supervised classification loss is not an arbitrary choice. It is exactly the negative log-likelihood of the observed labels under the model.

This explains a lot at once:

  • why the objective is additive over examples
  • why labels with low predicted probability incur large penalties
  • why the output layer must define a valid probability distribution

It also clarifies what changes under different tasks:

  • binary classification uses a Bernoulli conditional law
  • multiclass classification uses a categorical conditional law
  • sequence models use an autoregressive product of categorical laws
  • soft targets replace the point-mass target distribution by a non-degenerate target law

The unifying object is still expected negative log-probability.

4.2 Binary Cross-Entropy from a Bernoulli Model

Suppose Y{0,1}Y \in \{0,1\} and the model predicts

qθ(Y=1x)=p^,qθ(Y=0x)=1p^.q_\theta(Y=1 \mid \mathbf{x}) = \hat{p}, \qquad q_\theta(Y=0 \mid \mathbf{x}) = 1-\hat{p}.

The Bernoulli likelihood for one example is

qθ(yx)=p^y(1p^)1y.q_\theta(y \mid \mathbf{x}) = \hat{p}^{\,y}(1-\hat{p})^{1-y}.

Taking negative logs gives

BCE(y,p^)=logqθ(yx)=[ylogp^+(1y)log(1p^)].\ell_{\mathrm{BCE}}(y,\hat{p}) = -\log q_\theta(y \mid \mathbf{x}) = -\left[ y\log \hat{p} + (1-y)\log(1-\hat{p}) \right].

This is binary cross-entropy.

Two edge cases explain its shape:

  • if y=1y=1, then =logp^\ell=-\log \hat{p}
  • if y=0y=0, then =log(1p^)\ell=-\log(1-\hat{p})

So confident correct predictions produce small loss, while confident wrong predictions produce very large loss.

BINARY CROSS-ENTROPY INTUITION
===============================================================

True label y = 1:
  loss = -log(p_hat)
  cheap if p_hat is close to 1
  huge if p_hat is close to 0

True label y = 0:
  loss = -log(1 - p_hat)
  cheap if p_hat is close to 0
  huge if p_hat is close to 1

Meaning:
  BCE punishes confident mistakes much more than hesitant ones
===============================================================

This is one reason BCE often trains better than mean-squared error on probabilistic binary prediction. It delivers much sharper corrective gradients when the model is confidently wrong.

In practice, the prediction p^\hat{p} usually comes from a logit zz through the sigmoid

p^=σ(z)=11+ez.\hat{p} = \sigma(z) = \frac{1}{1+e^{-z}}.

Stable implementations therefore work directly with the logit zz, not with p^\hat{p} after explicit sigmoid evaluation.

It is also useful to compare BCE with plain accuracy on a single example. If the true label is 11, then predicting 0.510.51 and predicting 0.990.99 both count as "correct" under accuracy. BCE distinguishes them sharply:

  • log0.51-\log 0.51 is still relatively large
  • log0.99-\log 0.99 is very small

So BCE rewards not merely correctness, but confident correctness backed by probability mass. That extra resolution is exactly what makes probabilistic learning work.

4.3 Categorical Cross-Entropy from Softmax Logits

Now suppose the label YY takes values in [K]={1,,K}[K]=\{1,\dots,K\}. Let the model output logits zRK\mathbf{z}\in\mathbb{R}^K and probabilities

p^=softmax(z),p^k=ezkj=1Kezj.\hat{\mathbf{p}} = \operatorname{softmax}(\mathbf{z}), \qquad \hat{p}_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}.

If the target is one-hot with true class cc, then the categorical cross-entropy is

CE(z,c)=logp^c.\ell_{\mathrm{CE}}(\mathbf{z}, c) = -\log \hat{p}_c.

Expanding p^c\hat{p}_c gives the fused logit-space form

CE(z,c)=zc+logj=1Kezj.\ell_{\mathrm{CE}}(\mathbf{z}, c) = -z_c + \log\sum_{j=1}^K e^{z_j}.

This is the form used in numerically stable implementations because it avoids explicitly computing softmax and then taking a logarithm.

There are several things to notice:

  1. Only the true-class logit appears with a negative sign directly.
  2. All logits influence the partition term logjezj\log\sum_j e^{z_j}.
  3. Adding the same constant to all logits leaves the loss unchanged, because softmax is shift-invariant.

That last property is the foundation of the log-sum-exp stabilization trick.

One-hot target interpretation. If the target distribution is y=ec\mathbf{y}=\mathbf{e}_c, then

k=1Kyklogp^k=logp^c.-\sum_{k=1}^K y_k \log \hat{p}_k = -\log \hat{p}_c.

So the common multiclass loss is a special case of full distribution-level cross-entropy where the target distribution is degenerate.

There is another important geometric interpretation. The model cannot improve the loss for class cc merely by increasing zcz_c in isolation; the partition term couples every class to every other class. This is why the loss produces dense gradients over the full vocabulary in language modeling: even one observed token creates pressure on all logits through normalization.

4.4 Soft Targets and Target Distributions

One-hot labels are not the only valid targets. If the target itself is a probability distribution tΔK1\mathbf{t}\in\Delta^{K-1}, then the natural loss is

(t,p^)=k=1Ktklogp^k.\ell(\mathbf{t}, \hat{\mathbf{p}}) = -\sum_{k=1}^K t_k \log \hat{p}_k.

This is still just cross-entropy.

Soft targets appear in several important settings:

  • label smoothing: replace a point mass by a slightly spread target
  • knowledge distillation: use teacher probabilities as targets
  • weak supervision: labels encode uncertainty rather than certainty
  • annotator aggregation: multiple human labels produce a soft empirical distribution over classes

The interpretation changes from "probability on the correct class" to "probability mass allocated across the whole target distribution."

This distinction is crucial. With one-hot labels, all non-true classes are treated identically. With soft targets, the loss can encode structure among incorrect classes:

  • class A may be more plausible than class B
  • synonyms may deserve nonzero mass
  • near-equivalent translations may share credit

This richer target geometry is part of why distillation can transfer "dark knowledge" that is invisible in hard labels.

4.5 Sequence Cross-Entropy, Token Averaging, and Perplexity

For autoregressive language models, the target is a token sequence x1:Tx_{1:T}. The model defines

qθ(x1:T)=t=1Tqθ(xtx<t).q_\theta(x_{1:T}) = \prod_{t=1}^T q_\theta(x_t \mid x_{<t}).

The negative log-likelihood is

logqθ(x1:T)=t=1Tlogqθ(xtx<t).-\log q_\theta(x_{1:T}) = \sum_{t=1}^T -\log q_\theta(x_t \mid x_{<t}).

Dividing by TT gives average token cross-entropy:

1Tt=1Tlogqθ(xtx<t).\frac{1}{T}\sum_{t=1}^T -\log q_\theta(x_t \mid x_{<t}).

This token average is what practitioners usually report during LLM training.

Perplexity is then

PPL=exp(1Tt=1Tlogqθ(xtx<t))\operatorname{PPL} = \exp\left( \frac{1}{T}\sum_{t=1}^T -\log q_\theta(x_t \mid x_{<t}) \right)

when natural logarithms are used.

Hugging Face's language-modeling documentation makes two important points that are easy to miss:

  1. perplexity is specifically natural for autoregressive or causal language models, not masked language models such as BERT
  2. tokenizer choice changes the event space, so perplexities across different tokenizations are not directly comparable

This is one of the most common evaluation mistakes in modern LLM work. Lower perplexity is meaningful only relative to the same tokenization and evaluation protocol.

Another subtle issue is aggregation choice. People sometimes average token losses per sequence and then average again across a batch, which weights short and long sequences equally. In other pipelines, all valid tokens are pooled into one denominator, which weights sequences proportionally to length. Both choices can be legitimate, but they are not numerically equivalent. When comparing training curves or benchmark results, one should always ask which averaging scheme was used.


5. Core Theory III: Gradients, Geometry, and Statistical Meaning

5.1 Stable Log-Sum-Exp and Fused Implementations

Naively computing

logsoftmax(z)c=logezcjezj\log\operatorname{softmax}(\mathbf{z})_c = \log\frac{e^{z_c}}{\sum_j e^{z_j}}

is numerically dangerous because large logits can overflow the exponentials.

The stable remedy is to subtract the maximum logit

m=maxjzj.m = \max_j z_j.

Then

logjezj=m+logjezjm.\log\sum_j e^{z_j} = m + \log\sum_j e^{z_j-m}.

Since zjm0z_j-m \le 0, the exponentials are at most 11, preventing overflow.

Therefore the stable log-softmax is

logsoftmax(z)c=zcmlogjezjm.\log\operatorname{softmax}(\mathbf{z})_c = z_c - m - \log\sum_j e^{z_j-m}.

This identity is not optional implementation polish. It is the reason real training code works reliably.

PyTorch's official documentation reflects this design:

  • CrossEntropyLoss expects logits, not probabilities
  • BCEWithLogitsLoss fuses sigmoid with binary cross-entropy
  • the fused versions exploit the log-sum-exp trick for numerical stability

The practical rule is simple:

NEVER DO THIS IN PRODUCTION
===============================================================

probs = softmax(logits)
loss  = -log(probs[target])

Reason:
  overflow, underflow, and avoidable precision loss

DO THIS INSTEAD
  use log-softmax or a fused cross-entropy-from-logits routine
===============================================================

This is one of the most important implementation lessons in the chapter because the mathematical formula alone does not tell you how to compute it safely.

For BCE, the same principle appears in a slightly different algebraic form. Instead of computing

ylogσ(z)(1y)log(1σ(z)),-y\log \sigma(z) - (1-y)\log(1-\sigma(z)),

stable implementations rewrite the expression in logit space so they do not need to evaluate σ(z)\sigma(z) near floating-point extremes. This is exactly why BCEWithLogitsLoss exists as a distinct API rather than as a trivial wrapper around sigmoid and BCELoss.

5.2 The Gradient: Predicted Minus Target

For categorical cross-entropy with one-hot target y\mathbf{y} and softmax prediction p^=softmax(z)\hat{\mathbf{p}}=\operatorname{softmax}(\mathbf{z}), the gradient with respect to logits is

z=p^y.\nabla_{\mathbf{z}}\ell = \hat{\mathbf{p}} - \mathbf{y}.

This is one of the most famous formulas in deep learning.

Its importance is hard to overstate:

  • it is simple
  • it is dense over all classes
  • it has a clean probabilistic interpretation

The true class receives gradient p^c1\hat{p}_c-1, which is negative unless the model already assigns probability 11 to the correct class. Every other class receives gradient p^k\hat{p}_k, which pushes those probabilities down.

So the gradient does exactly what we want:

  • increase the correct probability
  • decrease the incorrect probabilities
  • do so in proportion to the model's own confidence structure

This is one reason softmax plus cross-entropy is such an elegant pair. The Jacobian of softmax and the derivative of log-loss cancel into a beautifully simple residual form.

Forward reference: the full softmax Jacobian derivation belongs in Jacobians and Hessians and in the more output-activation-centered treatment in Activation Functions. Here we focus on what that derivative means for cross-entropy itself.

This gradient also explains why CE continues to produce useful updates even when the model is badly wrong. If the true class probability is tiny, then p^c1\hat{p}_c-1 is close to 1-1, which yields a strong correction. By contrast, losses with flatter tails can provide weaker learning signals for the same mistake.

5.3 Hessian, Curvature, and a Preview of Fisher Information

The Hessian of softmax cross-entropy with respect to logits is positive semidefinite and closely tied to the covariance structure of the predicted distribution:

H=diag(p^)p^p^.H = \operatorname{diag}(\hat{\mathbf{p}}) - \hat{\mathbf{p}}\hat{\mathbf{p}}^\top.

This matrix has several key properties:

  • it is symmetric
  • it is positive semidefinite
  • its rows sum to zero because softmax is shift-invariant

Geometrically, this tells us that cross-entropy is locally curved in the directions that change relative class probabilities, but flat along the all-ones direction that shifts all logits equally.

That local curvature structure is one reason CE behaves so nicely for probabilistic classification. Near a well-fit solution, the Hessian captures how sensitive the loss is to perturbations in the predicted distribution.

It also previews Fisher information. For log-likelihood objectives, the expected Hessian and the Fisher matrix are deeply connected. That connection is one reason Fisher information belongs right after this section in the chapter.

There is also a numerical lesson here. In large-vocabulary models, the Hessian is enormous, but its structure is not arbitrary. It is the covariance matrix of the softmax distribution in logit space. That covariance view helps explain why second-order approximations, natural gradients, and curvature-aware methods keep returning to the same family of matrices when derived from CE objectives.

5.4 Log-Loss as a Strictly Proper Scoring Rule

Cross-entropy is more than a convenient optimization objective. It is also a strictly proper scoring rule.

The idea is simple:

  • a scoring rule evaluates probabilistic forecasts
  • it is proper if the expected score is minimized by reporting the true distribution
  • it is strictly proper if that minimizer is unique

For log-loss, the expected score under true law pp is exactly

EXp[logq(X)]=H(p,q).\mathbb{E}_{X\sim p}[-\log q(X)] = H(p,q).

By the identity

H(p,q)=H(p)+DKL(pq),H(p,q)=H(p)+D_{\mathrm{KL}}(p\|q),

this is minimized uniquely at q=pq=p.

That theorem is why CE is the standard loss for probabilistic prediction. If the model class is expressive enough and optimization succeeds, minimizing expected CE pushes the model toward the true conditional distribution, not just toward correct argmax decisions.

This also highlights a contrast with accuracy:

  • accuracy only cares whether the highest-probability class is correct
  • cross-entropy cares whether the full probability distribution is honest

That is a much stronger learning signal, and it is the right one whenever calibrated probabilities matter.

The "strictly proper" language also protects against a common misunderstanding. Cross-entropy is not preferred merely because it is differentiable. It is preferred because it is truth-inducing: in expectation, the best forecast under this rule is the true distribution itself. That is a much deeper reason.

5.5 Calibration, Overconfidence, and Temperature Scaling

A model can achieve low cross-entropy and still be miscalibrated in practice. This is a subtle but important point.

At the population optimum, cross-entropy is minimized by the true conditional distribution. But finite data, finite model classes, optimization bias, label noise, and overparameterization can still produce overconfident or underconfident predictions.

Modern neural networks are often overconfident:

  • they assign probabilities too close to 00 or 11
  • their predicted confidence exceeds empirical correctness frequency

Temperature scaling is a standard post-hoc fix. If z\mathbf{z} are logits, replace them by z/τ\mathbf{z}/\tau with τ>0\tau > 0, then choose τ\tau on a validation set to minimize negative log-likelihood.

Effects:

  • τ>1\tau > 1 softens the distribution and increases entropy
  • τ<1\tau < 1 sharpens the distribution and decreases entropy

This matters because cross-entropy is both a training objective and an evaluation objective. A model that trains well under CE may still need temperature adjustment to report probabilities that are useful for downstream decision-making.

The original Transformer paper famously noted a related tension: label smoothing improved accuracy and BLEU while hurting perplexity, precisely because it makes the model more uncertain. That is not a contradiction. It is a reminder that cross-entropy, calibration, and task metrics interact in subtle ways.

In deployment, this matters whenever probabilities are acted on directly:

  • abstention thresholds
  • medical or legal risk alerts
  • reranking pipelines
  • active learning
  • safety filters

A model with slightly worse raw CE but much better calibration may be more useful in downstream decision systems than a model with slightly better CE and pathological overconfidence.


6. Advanced Topics

6.1 Label Smoothing as Entropy Injection

Label smoothing replaces a hard one-hot target ec\mathbf{e}_c by a softened distribution. A common version is

t(ε)=(1ε)ec+εK1,\mathbf{t}^{(\varepsilon)} = (1-\varepsilon)\mathbf{e}_c + \frac{\varepsilon}{K}\mathbf{1},

or a closely related variant that spreads the ε\varepsilon mass only across the incorrect classes.

The cross-entropy then becomes

LS=k=1Ktk(ε)logp^k.\ell_{\mathrm{LS}} = -\sum_{k=1}^K t_k^{(\varepsilon)} \log \hat{p}_k.

This can be read in two complementary ways:

  1. target-side view: the supervision no longer demands absolute certainty
  2. regularization view: the model is penalized for driving the predicted distribution toward zero entropy too aggressively

Why does this help?

  • it reduces logit saturation
  • it prevents the model from treating every non-target class as equally and infinitely wrong
  • it often improves calibration and generalization

But there is a tradeoff. If the target is smoothed, the minimum achievable loss relative to one-hot labels changes. The model is being asked to remain slightly uncertain, even on examples with unambiguous labels.

This is exactly why the Transformer paper could report that label smoothing εls=0.1\varepsilon_{\mathrm{ls}}=0.1 improved task metrics while hurting perplexity. A lower perplexity corresponds to more concentrated next-token distributions. Smoothing deliberately resists that concentration.

LABEL SMOOTHING
===============================================================

Hard target:
  [0, 0, 1, 0, 0]

Smoothed target:
  [0.02, 0.02, 0.92, 0.02, 0.02]

Effect:
  - less pressure for infinite confidence
  - smaller penalty for near-miss classes
  - often better calibration and robustness
===============================================================

One important nuance from later empirical work is that label smoothing is not a universal free lunch. The NeurIPS 2019 study "When Does Label Smoothing Help?" showed that its benefits depend on task structure, class relationships, and the evaluation metric of interest. In some distillation settings, naive smoothing can even interfere with knowledge transfer.

So the right conclusion is not "always smooth labels." It is:

  • label smoothing changes the target distribution
  • changing the target distribution changes what cross-entropy asks the model to learn
  • that can be beneficial or harmful depending on the problem

Another useful way to think about smoothing is that it injects a tiny amount of target entropy by hand. Hard labels say "the world is deterministic at this example." Smoothed labels say "the world may still have ambiguity, annotation noise, or semantic neighborhood structure." That philosophical shift is small in notation and large in effect.

6.2 Knowledge Distillation and Soft Cross-Entropy

In knowledge distillation we train a student model to match a teacher distribution rather than only the hard class label. If the teacher outputs t\mathbf{t} and the student predicts p^\hat{\mathbf{p}}, the natural loss is

KD=k=1Ktklogp^k.\ell_{\mathrm{KD}} = -\sum_{k=1}^K t_k \log \hat{p}_k.

This is just cross-entropy again, but now the target is itself a learned distribution.

Hinton, Vinyals, and Dean's distillation work made this idea famous because the teacher's non-argmax probabilities carry extra structure:

  • which wrong classes are close to the right one
  • how uncertain the teacher is
  • how semantic similarity shows up in the probability tail

This extra structure is often called dark knowledge.

Temperature plays a special role in distillation. Teacher and student probabilities are often softened by a temperature τ>1\tau>1:

tk(τ)=ezkteacher/τjezjteacher/τ,p^k(τ)=ezkstudent/τjezjstudent/τ.t_k^{(\tau)} = \frac{e^{z_k^{\text{teacher}}/\tau}}{\sum_j e^{z_j^{\text{teacher}}/\tau}}, \qquad \hat{p}_k^{(\tau)} = \frac{e^{z_k^{\text{student}}/\tau}}{\sum_j e^{z_j^{\text{student}}/\tau}}.

The loss is then applied to the softened distributions. Why soften?

  • it reveals class relationships hidden by an almost one-hot teacher output
  • it gives the student a denser gradient signal
  • it emphasizes relative logit structure rather than only the top class

Distillation is therefore not some separate object outside cross-entropy. It is an advanced target-design strategy inside the same CE framework.

6.3 Weighted, Masked, and Cost-Sensitive Cross-Entropy

Real training pipelines rarely use plain unweighted CE without modification. Three common variations matter a lot in practice.

Class weighting

If some classes are rare, we may weight the loss:

wCE=k=1Kwkyklogp^k.\ell_{\mathrm{wCE}} = -\sum_{k=1}^K w_k y_k \log \hat{p}_k.

This tells the optimizer to care more about mistakes on minority classes.

Binary positive weighting

For binary or multi-label BCE, a common implementation uses a pos_weight factor. PyTorch's BCEWithLogitsLoss documentation explicitly describes this as a recall/precision tradeoff mechanism under imbalance.

Masking and ignored positions

Sequence models often contain padding tokens or positions that should not contribute to the loss. Then we compute

masked=t=1Tmt(logq(xtx<t))t=1Tmt,\ell_{\mathrm{masked}} = \frac{\sum_{t=1}^T m_t \, \big(-\log q(x_t \mid x_{<t})\big)} {\sum_{t=1}^T m_t},

where mt{0,1}m_t \in \{0,1\} is a mask.

This normalized masked average matters. If we divide by the wrong denominator, we change the scale of the loss and therefore the effective optimization dynamics.

Cost-sensitive CE

Sometimes the real decision problem is asymmetric. False negatives may be far more expensive than false positives. In such cases, CE can be adapted with weights, though at some point the right object may become a different loss or a decision-thresholding scheme on top of probabilistic predictions.

The important conceptual point is that these are not new unrelated losses. They are modifications of how the same logarithmic scoring rule is aggregated over classes, examples, or positions.

This is worth emphasizing because engineers often treat weighting, masking, and ignoring as "just implementation details." They are implementation details, but they are also modeling decisions. They define which errors the empirical cross-entropy estimator treats as important.

6.4 When Cross-Entropy Fails

Cross-entropy is powerful, but it is not a universal answer.

Failure mode 1: label noise

If labels are corrupted, CE can aggressively fit the wrong target because it treats the observed label distribution as the truth to be matched.

Failure mode 2: extreme class imbalance

On heavily imbalanced datasets, unweighted CE may optimize average log-loss while producing poor minority-class recall.

Failure mode 3: overconfidence under distribution shift

A model can achieve good in-distribution CE and still assign unjustifiably high confidence to out-of-distribution inputs.

Failure mode 4: task-metric mismatch

A small gain in CE does not always correspond to a meaningful gain in the task metric. Translation quality, factuality, ranking quality, or calibrated risk may require more than plain next-step NLL.

Failure mode 5: structured dependence ignored

If the output has combinatorial or relational structure, tokenwise or classwise cross-entropy may fail to capture global consistency constraints.

WHEN PLAIN CE IS NOT ENOUGH
===============================================================

Problem                  What CE misses
---------------------------------------------------------------
Noisy labels             fits corrupted supervision too hard
Class imbalance          minority classes under-emphasized
OOD shift                can still be overconfident
Structured outputs       local token loss misses global structure
Decision asymmetry       equal log-loss is not equal real-world cost
===============================================================

This is not a criticism of cross-entropy so much as a boundary statement. CE is the right objective when the goal is honest probabilistic prediction under the given target distribution. If the problem itself differs from that objective, the loss may need to be modified or supplemented.

That framing helps avoid the unproductive question "Is CE good or bad?" It is better to ask:

  • what target distribution is being treated as truth?
  • what downstream decision problem actually matters?
  • which failure mode is dominant in this setting?

Those questions turn loss selection from ritual into principled design.

6.5 Preview of Alternative Losses

Cross-entropy sits inside a larger loss landscape.

  • Focal loss reweights CE to emphasize hard and rare examples.
  • Hinge loss prioritizes margin over probabilistic calibration.
  • Dice / Jaccard losses are used in segmentation when overlap is what matters most.
  • Bradley-Terry style binary CE reappears in preference-model training.
  • DPO-style preference objectives build on log-probability comparisons and KL-regularized preference optimization.

Forward reference: the broad taxonomy of these alternatives belongs in Loss Functions for Machine Learning. Here we keep the focus on cross-entropy as the canonical log-loss baseline.

The pedagogical point is worth keeping clear: many modern losses are easier to understand once cross-entropy is fully internalized. They often modify CE, add regularization to CE, or contrast CE against another objective.


7. Applications in Machine Learning

7.1 Probabilistic Classification

Cross-entropy is the default loss for probabilistic classification because it matches the problem formulation exactly. If the model outputs class probabilities, the log-loss directly evaluates how much probability the model assigns to the observed class.

This has several consequences that are better than accuracy-only training:

  • the model is encouraged to learn calibrated probabilities
  • errors are differentiated by confidence
  • uncertainty can be propagated into downstream systems

In binary logistic regression, BCE is the Bernoulli negative log-likelihood. In multiclass softmax regression, categorical CE is the categorical negative log-likelihood. Deep classifiers simply parameterize the same probabilistic form with more expressive feature extractors.

From a statistical standpoint, CE is the natural bridge from parametric conditional modeling to empirical risk minimization. From an engineering standpoint, it is the loss that turns logits into meaningful probabilistic feedback.

This is why so many mature classification stacks report both accuracy and CE. Accuracy says whether the decision was correct. CE says how much probability the model placed on what happened. The pair is far more informative than either metric alone.

7.2 Transformer Training and Label-Smoothed Sequence Loss

The original Transformer recipe made cross-entropy a central part of modern sequence-model engineering. During training, token predictions were supervised with label-smoothed categorical CE rather than hard one-hot CE.

That decision encodes a practical philosophy:

  • next-token prediction is probabilistic
  • excessive certainty is often harmful
  • small target-side entropy can regularize the training dynamics

In sequence-to-sequence and language-model settings, CE is usually computed over all valid output positions and averaged over non-padding tokens. Teacher forcing means the model is conditioned on the ground-truth past tokens while predicting the next token. The resulting loss is still just cross-entropy over conditional categorical distributions, but the conditioning context now includes a long history.

This is one place where the information-theory view is especially clarifying. The loss is not "token accuracy with differentiability attached." It is average surprise under the model's conditional next-token distribution.

That perspective also explains why improvements in CE can feel small numerically but large behaviorally. Shaving a few hundredths off token-level CE across billions of predictions corresponds to a systematic reduction in surprise over an enormous event space. In large-scale language modeling, tiny changes in average log-loss can reflect meaningful improvements in predictive structure.

7.3 LLM Pretraining, Tokenization, and Perplexity

Autoregressive LLM pretraining is next-token cross-entropy minimization at scale. Every token position contributes a term

logqθ(xtx<t).-\log q_\theta(x_t \mid x_{<t}).

The model is rewarded for allocating high probability to the actual next token in context. Over trillions of tokens, this becomes a massive empirical estimate of sequence cross-entropy.

Why does this single objective produce such broad capability? Because reducing next-token surprise requires modeling many kinds of structure at once:

  • grammar
  • discourse
  • facts
  • code syntax
  • style
  • local and long-range dependencies

Cross-entropy itself does not "know" these categories. It only measures whether the right token received enough probability mass. But minimizing that criterion across diverse corpora forces the model to learn the latent regularities that make the corpus predictable.

Perplexity is the most common scalar evaluation summary of this process:

PPL=exp(average token cross-entropy).\operatorname{PPL} = \exp(\text{average token cross-entropy}).

But perplexity has caveats:

  • it depends on tokenization
  • it depends on evaluation context length
  • it need not correlate perfectly with downstream usefulness
  • it is natural for causal language modeling but not for masked-LM objectives

So cross-entropy gives us a principled training objective and perplexity gives us a principled proxy metric, but both must be interpreted in context.

There is also a social reason this matters in 2026. Public discussion of model quality often compresses everything into one number. Perplexity is tempting for that role because it is grounded and easy to compute. But a mature evaluation culture treats it as one signal among many, not as a universal summary of intelligence or usefulness.

7.4 Teacher-Student Compression

When a student model is trained to imitate a teacher distribution, soft cross-entropy becomes an information-transfer channel.

Why can a student learn more from the teacher probabilities than from hard labels alone?

  • the teacher reveals confusion structure
  • near-miss classes get nonzero probability
  • similarity relations among classes become visible

For example, a teacher image classifier may assign

[0.72,0.20,0.05,0.03][0.72, 0.20, 0.05, 0.03]

to four classes rather than

[1,0,0,0].[1,0,0,0].

The second vector says only "class 1 is correct." The first says "class 2 is meaningfully plausible, classes 3 and 4 much less so." That relative structure is often exactly what a smaller student needs in order to generalize well.

Cross-entropy is the right tool here because it compares full distributions, not just argmax labels.

7.5 Preference and Ranking Loss Previews

Cross-entropy reappears outside ordinary classification.

In reward modeling for preference learning, binary or pairwise CE is often used to distinguish preferred responses from dispreferred ones. A Bradley-Terry style form models the probability that response aa is preferred to response bb and then trains with binary log-loss on the observed comparison.

In ranking systems, related CE-style objectives supervise the probability that relevant items outrank irrelevant ones. In contrastive systems, normalized softmax objectives are close relatives of categorical CE.

The broader lesson is that cross-entropy is the native loss whenever the target of learning is a probability over alternatives:

  • classes
  • tokens
  • pairwise preferences
  • teacher-guided candidate distributions

Once you see that pattern, a large part of modern AI objective design starts to look like special-purpose packaging around the same logarithmic core.

That recognition is empowering. It means that new objectives become easier to learn because you can ask:

  1. what are the alternatives?
  2. what probability distribution is being modeled over them?
  3. where is the cross-entropy hidden inside the objective?

Often the answer is "very close to the surface."


8. Common Mistakes

#MistakeWhy It Is WrongFix
1Treating cross-entropy and entropy as the same thingEntropy depends on one distribution; cross-entropy depends on a source-model pair (p,q)(p,q)Keep the model mismatch explicitly visible
2Saying CE is "just classification loss"CE is an information-theoretic quantity that becomes a loss under probabilistic modelingLearn the coding and NLL interpretations too
3Applying softmax and then calling CrossEntropyLossPyTorch CE expects logits and does log-softmax internally for stabilityPass raw logits to fused CE routines
4Using sigmoid and then BCELoss when logits are availableThis is less numerically stable than a fused logits-based versionUse BCEWithLogitsLoss
5Comparing perplexity across different tokenizersDifferent tokenizers define different event spaces and token countsCompare PPL only under matched tokenization/eval protocol
6Assuming low CE implies perfect calibrationCE is strictly proper in theory, but finite models and finite data can still be miscalibratedCheck calibration separately and use temperature scaling if needed
7Thinking label smoothing is always goodIt changes the target distribution and can interfere with some goalsUse smoothing deliberately, not by reflex
8Ignoring support mismatchIf q(x)=0q(x)=0 where p(x)>0p(x)>0, cross-entropy is infiniteSmooth or stabilize models so valid events retain support
9Using hard labels when soft targets are availableHard labels discard structure among alternativesUse soft CE for distillation or uncertain supervision
10Averaging masked sequence loss with the wrong denominatorThis changes effective gradient scale and can bias comparisonsNormalize by the number of valid tokens
11Treating CE gains as task-metric guaranteesLower CE does not always imply better BLEU, calibration, fairness, or retrieval qualityEvaluate with the task metric too
12Calling CE symmetricH(p,q)H(q,p)H(p,q)\neq H(q,p) in generalKeep argument order explicit

9. Exercises

  1. Exercise 1 (*): Discrete Cross-Entropy by Hand
    Let p=(0.7,0.2,0.1)p=(0.7,0.2,0.1) and q=(0.5,0.3,0.2)q=(0.5,0.3,0.2). Compute H(p)H(p), H(p,q)H(p,q), and DKL(pq)D_{\mathrm{KL}}(p\|q), then verify the identity H(p,q)=H(p)+DKL(pq)H(p,q)=H(p)+D_{\mathrm{KL}}(p\|q).

  2. Exercise 2 (*): One-Hot Targets and Log-Loss
    Show that if the target is a point mass on class cc, then categorical cross-entropy reduces to logp^c-\log \hat{p}_c. Explain why this means CE punishes confident mistakes so strongly.

  3. Exercise 3 (*): Bernoulli Likelihood to BCE
    Starting from the Bernoulli model qθ(yx)=p^y(1p^)1yq_\theta(y \mid \mathbf{x})=\hat{p}^{\,y}(1-\hat{p})^{1-y}, derive the binary cross-entropy formula.

  4. Exercise 4 (): Softmax Cross-Entropy Gradient**
    Derive z=p^y\nabla_{\mathbf{z}}\ell = \hat{\mathbf{p}}-\mathbf{y} for the one-hot categorical case.

  5. Exercise 5 (): Sequence Cross-Entropy and Perplexity**
    Given token log-likelihoods for a short sequence, compute the average cross-entropy per token and convert it to perplexity. Then explain why the same model can have different perplexities under different tokenizers.

  6. Exercise 6 (): Stable Log-Sum-Exp**
    Implement naive softmax -> log and stable log_softmax for large logits, then demonstrate why the naive version fails numerically.

  7. Exercise 7 (*): Label Smoothing or Distillation**
    On a toy multiclass example, compare one-hot CE against smoothed-target CE or teacher-target CE. Explain what supervision information the soft target adds.

  8. Exercise 8 (*): Weighted or Masked Cross-Entropy in Practice**
    Construct either:

    • a class-imbalanced classification example and compare weighted vs unweighted CE, or
    • a padded sequence example and compare correct masked averaging to naive averaging.

10. Why This Matters for AI (2026 Perspective)

ConceptAI Impact
Cross-entropy as NLLThe default training objective for probabilistic classification and next-token prediction
H(p,q)=H(p)+DKL(pq)H(p,q)=H(p)+D_{\mathrm{KL}}(p\|q)Turns model fitting into distribution matching with a clear irreducible baseline
Binary CEStandard loss for binary and multi-label prediction, reward-model heads, and many ranking tasks
Categorical CEStandard loss for multiclass prediction and vocab-size softmax training
Sequence CEThe core pretraining loss of autoregressive LLMs
PerplexityA direct exponential transform of token-level CE for causal language models
Stable logits-based CEPrevents overflow/underflow in real training systems
Strictly proper scoring ruleJustifies CE when the goal is honest probabilistic prediction
Label smoothingReduces overconfidence and often improves generalization or calibration
Soft-target CEEnables distillation and richer supervision than hard labels alone
Weighted and masked CEMakes CE usable in imbalanced datasets and padded sequence pipelines
CE failure modesReminds us when other objectives or evaluation metrics are needed

Cross-entropy matters in 2026 because frontier AI systems are increasingly judged not only by whether they pick the right answer, but by how they distribute probability mass, how stable they train, how useful their uncertainty is, and how efficiently they compress predictive structure from vast datasets. CE is the language that connects all of those concerns.

It is also one of the few quantities whose theoretical meaning survives contact with engineering reality. The same mathematics explains textbook coding, PyTorch loss APIs, Transformer training recipes, distillation pipelines, and perplexity reports on LLM benchmarks. That level of continuity is rare and valuable.

For someone building AI systems in 2026, this matters because cross-entropy is not a niche theorem you memorize and forget. It is a daily design object:

  • when reading training logs
  • when choosing a loss API
  • when interpreting perplexity changes
  • when debugging overconfidence
  • when deciding whether soft targets are worth the complexity

11. Conceptual Bridge

Cross-entropy is the natural continuation of the first three information-theory sections. Entropy introduced intrinsic uncertainty. KL Divergence introduced mismatch between distributions. Mutual Information introduced uncertainty reduction between variables. Cross-entropy combines the first two directly:

H(p,q)=H(p)+DKL(pq).H(p,q)=H(p)+D_{\mathrm{KL}}(p\|q).

That equation is not just a neat identity. It tells you exactly what model training is doing: preserving the unavoidable uncertainty term while shrinking the mismatch term as much as the model class allows.

Looking forward, the next conceptual step is Fisher Information. Cross-entropy tells us how costly model mismatch is on average. Fisher information tells us how sharply the log-likelihood changes under infinitesimal parameter perturbations. So CE gives the global objective and Fisher gives the local geometry.

Cross-entropy also connects outward to other parts of the curriculum:

CROSS-ENTROPY IN THE CURRICULUM
===============================================================

Entropy
  intrinsic uncertainty
        |
        v
KL divergence
  mismatch between distributions
        |
        v
Cross-entropy
  uncertainty measured under a model
        |
        +--> maximum likelihood
        +--> classification loss
        +--> sequence modeling
        +--> perplexity
        +--> distillation / label smoothing
        |
        v
Fisher information
  local curvature of log-likelihood geometry
===============================================================

If you keep only one idea from this section, let it be this:

Cross-entropy is the quantitative price we pay when our predicted distribution differs from reality.

That sentence unifies coding, likelihood, classification, language modeling, calibration, and distillation in one line.

It also sets up the next section naturally. Once we know how costly model mismatch is in expectation, the next mathematical question is:

How sensitive is that cost to tiny changes in the model parameters?

That is exactly the doorway into Fisher information.

There is also a backward-looking lesson here. Entropy, KL divergence, mutual information, and cross-entropy are not four disconnected formulas to memorize. They are a tightly linked family:

  • entropy measures intrinsic uncertainty
  • KL divergence measures mismatch
  • mutual information measures uncertainty reduction
  • cross-entropy measures uncertainty viewed through a model

Seeing those relationships clearly is part of becoming fluent in the language of modern probabilistic machine learning.

That fluency is exactly what this chapter is building.

Cross-entropy is one of its central dialects.

References

  1. Shannon, C. E. (1948). "A Mathematical Theory of Communication." Bell System Technical Journal. PDF
  2. Kullback, S., and Leibler, R. A. (1951). "On Information and Sufficiency." Annals of Mathematical Statistics.
  3. Cover, T. M., and Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley.
  4. MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press.
  5. Duchi, J. C. Statistics 311 / EE377 Lecture Notes. Online
  6. Stanford EE276. "Lecture 2: Entropy and Related Quantities." PDF
  7. PyTorch Documentation. torch.nn.CrossEntropyLoss
  8. PyTorch Documentation. torch.nn.BCEWithLogitsLoss
  9. Hugging Face Transformers Documentation. "Perplexity of fixed-length models." Docs
  10. Vaswani, A., et al. (2017). "Attention Is All You Need." arXiv
  11. Hinton, G., Vinyals, O., and Dean, J. (2015). "Distilling the Knowledge in a Neural Network." PDF
  12. Muller, R., Kornblith, S., and Hinton, G. (2019). "When Does Label Smoothing Help?" NeurIPS PDF
  13. Dawid, A. P., and Musio, M. (2014). "Theory and Applications of Proper Scoring Rules." Metron.
  14. The Book of Statistical Proofs. "The log probability scoring rule is a strictly proper scoring rule." Proof