Notes

Cross-Entropy

"The right question is not whether a model predicts the correct label once, but whether it assigns the right probabilities every time."

Overview

Cross-entropy is one of the rare mathematical quantities that is simultaneously an information measure, a coding cost, a statistical objective, and a practical training loss. In information theory it measures the average number of bits or nats required when data generated by a true distribution $p$ is encoded using a code optimized for some other distribution $q$ . In machine learning the same quantity appears as the expected negative log-likelihood, which is why it became the default objective for classification, language modeling, and many forms of probabilistic prediction.

That dual identity is what makes cross-entropy so important for AI. When we train a classifier with categorical cross-entropy, we are not merely punishing wrong answers. We are asking the model to match the full target distribution as closely as possible in log-loss terms. When we train an autoregressive language model, we are minimizing average next-token surprise. When we evaluate a model with perplexity, we are exponentiating a token-level cross-entropy. When we do knowledge distillation, label smoothing, masked sequence training, or soft supervision, we are still working inside the same cross-entropy geometry.

This section gives cross-entropy its own careful treatment rather than burying it inside a generic loss-function survey. The neighboring sections already cover entropy, KL divergence, and mutual information in their canonical homes. Here we focus on what is unique about cross-entropy itself: its decomposition, its coding meaning, its role as negative log-likelihood, its stable numerical implementation, and its modern use in 2026-era AI systems.

Prerequisites

Entropy and conditional entropy - 01-Entropy
KL divergence and relative entropy - 02-KL-Divergence
Mutual information as uncertainty reduction - 03-Mutual-Information
Expectation, empirical averages, and log-likelihoods - 06-Probability-Theory/04-Expectation-and-Moments
Maximum likelihood estimation basics - 07-Statistics/02-Estimation-Theory
Softmax and log-softmax mechanics - preview in 13-ML-Specific-Math/02-Activation-Functions

Companion Notebooks

Notebook	Description
theory.ipynb	Interactive derivations, coding examples, stable softmax/log-sum-exp demos, and perplexity experiments
exercises.ipynb	10 graded exercises covering theory identities, BCE/CE derivations, stable computation, and modern AI use cases

Learning Objectives

After completing this section, you will be able to:

Define discrete, conditional, continuous, and sequence cross-entropy cleanly
Derive the identity $H(p,q)=H(p)+D_{\mathrm{KL}}(p\|q)$
Explain cross-entropy as wrong-code expected length and as expected negative log-likelihood
Distinguish entropy from cross-entropy and explain why the latter is model-dependent
Derive binary cross-entropy from the Bernoulli likelihood
Derive categorical cross-entropy from softmax logits and explain the fused logit-space form
Compute sequence cross-entropy and convert it into perplexity
Explain why cross-entropy is a strictly proper scoring rule for probabilistic prediction
Derive the logit-space gradient $\hat{\mathbf{p}}-\mathbf{y}$ and interpret it geometrically
Implement numerically stable log_softmax, binary CE from logits, and masked token loss
Understand label smoothing, soft targets, and knowledge distillation as target-distribution modifications
Recognize when plain cross-entropy is the right tool and when it needs augmentation

Cross-Entropy

1. Intuition

1.1 Cross-Entropy as Wrong-Code Cost

The cleanest intuition for cross-entropy comes from coding. Suppose a source really emits symbols according to a true distribution $p$ , but we design our code as if the source followed another distribution $q$ . The code lengths suggested by $q$ are approximately $-\log q(x)$ . If the source actually emits $x \sim p$ , then the expected code length becomes

H(p,q) = -\sum_x p(x)\log q(x).

That is cross-entropy.

Entropy $H(p)$ is the coding cost when the code is optimized for the true source. Cross-entropy $H(p,q)$ is the coding cost when we compress data from $p$ using a code optimized for $q$ . The gap between them is the cost of model mismatch.

This is already enough to understand why cross-entropy became central in machine learning. A predictive model is a probabilistic code for future data. If the model assigns high probability to what actually happens, the code is short and the cross-entropy is low. If the model assigns low probability to what happens, the code is long and the cross-entropy is high.

CODING VIEW OF CROSS-ENTROPY
===============================================================

True source:
  data are generated from p(x)

Model we use:
  code lengths are based on q(x)

Per-symbol code length under q:
  L_q(x)  -log q(x)

Expected length on true data:
  E_{x ~ p}[-log q(x)] = H(p, q)

Special case:
  if q = p, then H(p, q) = H(p)

Interpretation:
  cross-entropy = average surprise under the model
===============================================================

There is a deep practical moral here: cross-entropy does not care whether the model outputs the correct class index alone. It cares whether the model allocates probability mass in the right way. That makes it stronger than simple accuracy as a training signal, because accuracy ignores confidence and only judges the final argmax.

1.2 From Entropy to Cross-Entropy

Entropy asks:

H(p) = -\sum_x p(x)\log p(x).

Cross-entropy asks:

H(p,q) = -\sum_x p(x)\log q(x).

The difference is small in notation and enormous in meaning.

In entropy, the same distribution appears twice. The source both generates the data and defines the log-cost. In cross-entropy, the source $p$ generates the data but the model $q$ assigns the costs. So cross-entropy is not purely a property of the source. It is a property of the pair $(p,q)$ .

That pairwise dependence is what makes cross-entropy useful for learning. If the learner changes $q$ , the cross-entropy changes. Entropy $H(p)$ is fixed by the data-generating process; cross-entropy $H(p,q)$ is the quantity we can optimize by improving the model.

The basic story can be phrased three equivalent ways:

Coding view: extra expected code length from using the wrong code.
Prediction view: average negative log-probability assigned to observed outcomes.
Optimization view: the objective minimized by maximum likelihood.

These three viewpoints are mathematically identical, but each highlights a different piece of intuition:

coding explains why the quantity is called "cross-entropy"
prediction explains why low-probability mistakes are punished so sharply
optimization explains why nearly every probabilistic classifier minimizes it

ENTROPY VS CROSS-ENTROPY
===============================================================

Entropy:
  H(p)   = average surprise under the true source itself

Cross-entropy:
  H(p,q) = average surprise of true data under model q

If q is good:
  H(p,q) is close to H(p)

If q is poor:
  H(p,q) is much larger than H(p)

So:
  entropy  = intrinsic uncertainty
  cross-entropy = uncertainty + model mismatch
===============================================================

This is also why the sign and the logarithm matter. The logarithm turns probability multiplication into addition over sequences, and the negative sign makes high-probability events cheap while making low-probability events expensive. A model that assigns $10^{-6}$ probability to a true event is not slightly wrong. Under log-loss, it is catastrophically wrong.

1.3 Why Cross-Entropy Matters for AI

Cross-entropy appears almost everywhere modern AI systems need calibrated probabilistic prediction.

AI setting	Random objects	Role of cross-entropy
Binary classification	label $Y \in \{0,1\}$ , model score $q_\theta(Y \mid X)$	binary CE is Bernoulli negative log-likelihood
Multiclass classification	class $Y \in [K]$	categorical CE trains the model to match class probabilities
Language modeling	next token $X_t$ given context $X_{<t}$	sequence CE is average next-token NLL
Distillation	teacher distribution, student distribution	soft cross-entropy transfers dark knowledge
Label smoothing	hard target, softened target	CE with softened targets discourages overconfidence
Masked training	padded or ignored positions	masked CE trains only on valid positions
Calibration	predicted confidence vs reality	CE is a proper scoring rule, but raw CE minimization alone does not guarantee perfect post-hoc calibration

For LLMs specifically, cross-entropy is the pretraining objective that turned next-token prediction into a general-purpose foundation-model paradigm. A model that minimizes sequence cross-entropy on a broad corpus must learn syntax, world knowledge, code regularities, discourse patterns, and many latent structure constraints, because each of those reduces token surprise.

Cross-entropy also sits at the heart of evaluation:

token-level CE underlies perplexity
masked or weighted CE appears in seq2seq training
binary CE appears in reward-model and preference-model components
soft CE appears in distillation and teacher-guided fine-tuning

The quantity is therefore not a classroom loss used only in toy classifiers. It is one of the main mathematical interfaces between information theory and large scale model training.

1.4 Historical Timeline

CROSS-ENTROPY -- KEY MILESTONES
===============================================================

1948  Shannon
      Entropy and coding length become central objects in
      communication theory.

1951  Kullback and Leibler
      Relative entropy clarifies the decomposition
      cross-entropy = entropy + KL divergence.

1960s-1980s
      Log-likelihood and log-loss become standard tools in
      statistics and probabilistic classification.

1980s-1990s
      Neural networks adopt cross-entropy because it aligns
      naturally with sigmoid and softmax outputs.

2014-2015
      Distillation and soft targets use cross-entropy between
      teacher and student distributions.

2017  Transformer
      Label-smoothed sequence cross-entropy becomes a standard
      training recipe in large sequence models.

2020-2026
      Cross-entropy remains the default pretraining objective
      for autoregressive and many supervised AI systems, while
      perplexity remains a core LM evaluation metric.
===============================================================

The historical pattern is important. Cross-entropy was not invented as a neural-network trick. It was inherited from information theory and statistics, then rediscovered as exactly the right loss for probabilistic prediction. That heritage explains why the same quantity appears in code design, likelihood, calibration, classification, and sequence modeling.

2. Formal Definitions

2.1 Discrete Cross-Entropy

Let $p$ and $q$ be probability mass functions on the same discrete support $\mathcal{X}$ . The discrete cross-entropy of $p$ relative to $q$ is

H(p,q) = -\sum_{x \in \mathcal{X}} p(x)\log q(x),

provided the sum is well defined.

This formula should be read carefully:

the expectation is taken under $p$
the logarithmic score is computed using $q$

So $p$ is the source of the data and $q$ is the model being judged.

If $X \sim p$ , we can write the same quantity more compactly as

H(p,q)=\mathbb{E}_{X \sim p}[-\log q(X)].

This expectation form is often the more useful one in machine learning because training datasets are empirical samples from the underlying data distribution.

Example 1: Bernoulli source. If $p=\operatorname{Bern}(0.8)$ and $q=\operatorname{Bern}(0.6)$ , then

H(p,q) = -0.8\log 0.6 - 0.2\log 0.4.

The source emits ones much more often than zeros, so the penalty paid by $q$ depends more heavily on how much mass it gives to the event $1$ .

Example 2: One-hot target. If the true label is a deterministic class $c$ , then $p$ is a point mass $\delta_c$ , and the cross-entropy becomes

H(\delta_c, q) = -\log q(c).

This is exactly the common supervised-learning loss on a single example.

Non-example. It is wrong to write "cross-entropy between two arbitrary vectors" unless the vectors are valid probability distributions or are clearly interpreted as such. Raw logits are not probabilities. Applying the CE formula directly to logits is nonsensical.

2.2 Conditional and Empirical Cross-Entropy

In supervised learning the model is conditional. Instead of a single distribution $q$ , we have $q_\theta(y \mid x)$ . The population cross-entropy is then

\mathcal{L}(\theta) = \mathbb{E}_{(X,Y)\sim p}[-\log q_\theta(Y \mid X)].

This is just conditional cross-entropy averaged over inputs:

H_p(Y \mid X; q_\theta) = \mathbb{E}_{(X,Y)\sim p}[-\log q_\theta(Y \mid X)].

The notation varies in the literature, but the substance does not: the model is penalized according to the log-probability it assigns to the observed label conditional on the input.

Given a dataset

\mathcal{D} = \{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^n,

the empirical cross-entropy is

\widehat{\mathcal{L}}_n(\theta) = \frac{1}{n}\sum_{i=1}^n -\log q_\theta(y^{(i)} \mid \mathbf{x}^{(i)}).

This is the quantity we actually minimize in training.

There are two layers of approximation here:

the finite dataset approximates the true population distribution
the model class $\{q_\theta\}$ may or may not be rich enough to represent the true conditional law

That distinction matters. The theoretical object is population cross-entropy. The implemented object is empirical cross-entropy. Their gap is part of generalization theory.

2.3 Continuous Cross-Entropy

If $p$ and $q$ are densities on a continuous support, the continuous cross-entropy is

H(p,q) = -\int p(x)\log q(x)\,dx,

whenever the integral exists.

This is the direct analog of the discrete formula, but it deserves caution:

densities can exceed $1$ , so $-\log q(x)$ can be negative
continuous cross-entropy need not share all the intuitions of discrete code lengths unless we interpret it through discretization limits or relative density comparisons
support mismatch can make the integral infinite

Even so, continuous cross-entropy is still the expected negative log-density under the source $p$ , and it still satisfies the same decomposition with KL divergence when the relevant quantities are finite.

Example. If $p$ is a Gaussian data law and $q_\theta$ is a Gaussian model with parameterized mean and variance, minimizing cross-entropy is equivalent to maximum likelihood estimation of the Gaussian parameters.

This continuous version is especially important in density modeling, diffusion components, and variational objectives, even if the pure discrete version is the one most learners first encounter in classification.

2.4 Sequence Cross-Entropy and Cross-Entropy Rate

Suppose a sequence source emits

X_{1:T} = (X_1,\dots,X_T).

If the model factorizes autoregressively as

q(x_{1:T}) = \prod_{t=1}^T q(x_t \mid x_{<t}),

then the sequence cross-entropy is

H(p,q) = \mathbb{E}_{X_{1:T}\sim p} \left[ -\log q(X_{1:T}) \right] = \mathbb{E}_{X_{1:T}\sim p} \left[ -\sum_{t=1}^T \log q(X_t \mid X_{<t}) \right].

This additive decomposition is why language-model training looks like "sum the token losses." It is not a heuristic. It is exactly the log-factorization of the joint model.

The average per-token cross-entropy is

\frac{1}{T} \mathbb{E}_{X_{1:T}\sim p} \left[ -\sum_{t=1}^T \log q(X_t \mid X_{<t}) \right].

For stationary sources, one often studies the limiting quantity

\overline{H}(p,q) = \lim_{T\to\infty} \frac{1}{T}H(p_{1:T}, q_{1:T}),

when the limit exists. This is the cross-entropy rate.

For LLMs, the finite-sample per-token quantity is the practical object. It is reported in nats or bits per token, and its exponentiation yields perplexity after the usual normalization conventions are fixed.

2.5 Support Mismatch, Zero Probabilities, and Log Bases

Cross-entropy looks harmless until the model assigns zero probability to an event that actually occurs.

If $p(x) > 0$ for some $x$ but $q(x)=0$ , then

-\log q(x) = \infty,

so the cross-entropy is infinite.

This is not a technical annoyance. It is conceptually correct. A model that declares an actually possible event to be impossible deserves infinite logarithmic penalty.

This has several practical consequences:

smoothed probabilities are often used to avoid literal zeros
stable implementations work with logits to avoid underflow to zero
support assumptions matter whenever one compares distributions

We also need to specify the logarithm base:

$\log_2$ gives cross-entropy in bits
$\log$ with the natural logarithm gives cross-entropy in nats

In optimization, changing the log base only rescales the loss by a constant factor, so the minimizer is unchanged. In information-theoretic interpretation, the base determines the measurement unit.

Important edge case. Cross-entropy is only meaningful when $p$ and $q$ live on the same event space. Comparing a word-level distribution to a character-level distribution without careful alignment is not valid. This becomes especially relevant when people compare perplexities across different tokenizers.

3. Core Theory I: Identities, Bounds, and Coding

3.1 Cross-Entropy = Entropy + KL Divergence

The defining identity of this section is

H(p,q) = H(p) + D_{\mathrm{KL}}(p\|q).

It is the cleanest mathematical statement of what cross-entropy measures: intrinsic uncertainty plus model mismatch.

Derivation. Start from the definition:

H(p,q) = -\sum_x p(x)\log q(x).

Add and subtract $\log p(x)$ inside the sum:

H(p,q) = -\sum_x p(x)\log p(x) + \sum_x p(x)\log \frac{p(x)}{q(x)}.

The first term is $H(p)$ . The second is $D_{\mathrm{KL}}(p\|q)$ . Hence

H(p,q)=H(p)+D_{\mathrm{KL}}(p\|q).

This identity matters so much because it immediately turns cross-entropy minimization into KL minimization whenever $p$ is fixed.

In learning problems, $p$ is the data-generating distribution and does not depend on the model parameters $\theta$ . Therefore

\arg\min_\theta H(p,q_\theta) = \arg\min_\theta D_{\mathrm{KL}}(p\|q_\theta).

This is the theoretical bridge from information theory to maximum likelihood.

3.2 Lower Bounds and Equality Cases

Because KL divergence is nonnegative,

D_{\mathrm{KL}}(p\|q)\ge 0,

the decomposition immediately gives

H(p,q)\ge H(p).

Equality holds if and only if $p=q$ almost everywhere on the support of $p$ .

This is the core lower-bound fact:

the true entropy $H(p)$ is the irreducible baseline
every imperfect model pays an extra KL tax

In machine-learning language, this says a model cannot beat the entropy of the true label/source distribution when evaluated under the true probabilistic task. Any extra loss above that baseline is mismatch.

There are three common special cases:

Perfect modeling: if $q=p$ , then $H(p,q)=H(p)$ .
Mild mismatch: if $q$ is close to $p$ , the cross-entropy is only a little larger than the entropy.
Support error: if $q$ assigns zero probability where $p$ assigns positive probability, cross-entropy becomes infinite.

The bound is often used implicitly in language modeling. People say "the model cannot push perplexity below the entropy rate of the data source," which is the sequence version of the same statement.

3.3 Mismatched Coding Interpretation

Cross-entropy is often introduced through coding, but the coding interpretation deserves more than a slogan.

If a code optimized for distribution $q$ uses approximate codelengths $-\log q(x)$ , then the average codelength when the true source is $p$ is

\mathbb{E}_{X\sim p}[-\log q(X)] = H(p,q).

The optimal codelength under the true source is instead $H(p)$ . Therefore the excess expected codelength is

H(p,q)-H(p)=D_{\mathrm{KL}}(p\|q).

That difference is not merely "some statistical gap." It is the exact expected coding inefficiency incurred by modeling the source incorrectly.

MISMATCHED CODING
===============================================================

Best possible code for source p:
  expected length = H(p)

Code built from model q:
  expected length = H(p, q)

Extra cost of using q instead of p:
  H(p, q) - H(p) = D_KL(p || q)

Meaning:
  KL divergence is the coding penalty for model mismatch
===============================================================

This picture makes the name "cross-entropy" less mysterious. We take the entropy-style expectation under one distribution and cross it with logarithmic costs defined by another.

It also explains why cross-entropy remains meaningful when training models with soft targets. A teacher distribution is effectively a richer probabilistic code than a one-hot label, so matching it with soft cross-entropy means preserving a more detailed coding structure.

3.4 Conditional Chain Rules and Factorizations

Cross-entropy inherits the same additive structure that makes log-likelihoods so useful.

If $p(x,y)=p(x)p(y \mid x)$ and $q(x,y)=q(x)q(y \mid x)$ , then

H(p_{XY}, q_{XY}) = \mathbb{E}_{(X,Y)\sim p} \left[ -\log q(X,Y) \right]

= \mathbb{E}_{(X,Y)\sim p} \left[ -\log q(X) - \log q(Y \mid X) \right]

= H(p_X, q_X) + \mathbb{E}_{X\sim p}[H(p_{Y\mid X}, q_{Y\mid X})].

So conditional cross-entropy decomposes naturally:

H(p_{XY}, q_{XY}) = H(p_X, q_X) + H_p(Y \mid X; q).

This is the formal reason that sequence losses, masked losses, and structured prediction losses can be summed over steps, positions, or factors.

In autoregressive modeling, the factorization

q(x_{1:T}) = \prod_{t=1}^T q(x_t \mid x_{<t})

turns the joint cross-entropy into a sum of token-level terms. In conditional classification, the same logic turns an expectation over labels into a sample average of negative log-conditional probabilities.

The chain-rule viewpoint is especially useful when debugging training setups. If a loss is an average over positions, one should ask:

which positions are included?
which are masked?
are we averaging over tokens, sequences, or batches?

Those choices change the empirical estimator, even though the underlying population quantity is still a conditional cross-entropy.

3.5 Cross-Entropy Rate for Sources

For long sequences produced by a stationary source, one is often less interested in the total code length than in the long-run average cost per symbol. This is the cross-entropy rate:

\overline{H}(p,q) = \lim_{T\to\infty} \frac{1}{T} \mathbb{E}_{X_{1:T}\sim p} \left[ -\log q(X_{1:T}) \right].

If both source and model are i.i.d., this reduces to the ordinary one-step cross-entropy. But for dependent sources such as language, the rate absorbs long-range conditional structure.

This is why next-token prediction is not just a convenient engineering trick. The autoregressive cross-entropy rate measures the average uncertainty the model still has about the next symbol once it sees the past.

In practice, finite-context models only approximate this ideal because they cannot condition on an infinite past. Hugging Face's perplexity documentation explicitly warns that fixed-context models need sliding-window approximations if we want a better estimate of the fully factorized sequence probability. That is an implementation detail with theoretical teeth: context truncation changes the conditional distribution and therefore changes the measured cross-entropy.

4. Core Theory II: From Information Measure to Learning Objective

4.1 Negative Log-Likelihood and Maximum Likelihood

Let a probabilistic model assign conditional probabilities $q_\theta(y \mid \mathbf{x})$ . For a dataset

\mathcal{D}=\{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^n,

the likelihood is

L(\theta; \mathcal{D}) = \prod_{i=1}^n q_\theta(y^{(i)} \mid \mathbf{x}^{(i)}).

Taking logs gives

\log L(\theta; \mathcal{D}) = \sum_{i=1}^n \log q_\theta(y^{(i)} \mid \mathbf{x}^{(i)}).

Maximum likelihood therefore solves

\max_\theta \sum_{i=1}^n \log q_\theta(y^{(i)} \mid \mathbf{x}^{(i)}),

or equivalently

\min_\theta \frac{1}{n}\sum_{i=1}^n -\log q_\theta(y^{(i)} \mid \mathbf{x}^{(i)}).

That last expression is empirical cross-entropy.

So the standard supervised classification loss is not an arbitrary choice. It is exactly the negative log-likelihood of the observed labels under the model.

This explains a lot at once:

why the objective is additive over examples
why labels with low predicted probability incur large penalties
why the output layer must define a valid probability distribution

It also clarifies what changes under different tasks:

binary classification uses a Bernoulli conditional law
multiclass classification uses a categorical conditional law
sequence models use an autoregressive product of categorical laws
soft targets replace the point-mass target distribution by a non-degenerate target law

The unifying object is still expected negative log-probability.

4.2 Binary Cross-Entropy from a Bernoulli Model

Suppose $Y \in \{0,1\}$ and the model predicts

q_\theta(Y=1 \mid \mathbf{x}) = \hat{p}, \qquad q_\theta(Y=0 \mid \mathbf{x}) = 1-\hat{p}.

The Bernoulli likelihood for one example is

q_\theta(y \mid \mathbf{x}) = \hat{p}^{\,y}(1-\hat{p})^{1-y}.

Taking negative logs gives

\ell_{\mathrm{BCE}}(y,\hat{p}) = -\log q_\theta(y \mid \mathbf{x}) = -\left[ y\log \hat{p} + (1-y)\log(1-\hat{p}) \right].

This is binary cross-entropy.

Two edge cases explain its shape:

if $y=1$ , then $\ell=-\log \hat{p}$
if $y=0$ , then $\ell=-\log(1-\hat{p})$

So confident correct predictions produce small loss, while confident wrong predictions produce very large loss.

BINARY CROSS-ENTROPY INTUITION
===============================================================

True label y = 1:
  loss = -log(p_hat)
  cheap if p_hat is close to 1
  huge if p_hat is close to 0

True label y = 0:
  loss = -log(1 - p_hat)
  cheap if p_hat is close to 0
  huge if p_hat is close to 1

Meaning:
  BCE punishes confident mistakes much more than hesitant ones
===============================================================

This is one reason BCE often trains better than mean-squared error on probabilistic binary prediction. It delivers much sharper corrective gradients when the model is confidently wrong.

In practice, the prediction $\hat{p}$ usually comes from a logit $z$ through the sigmoid

\hat{p} = \sigma(z) = \frac{1}{1+e^{-z}}.

Stable implementations therefore work directly with the logit $z$ , not with $\hat{p}$ after explicit sigmoid evaluation.

It is also useful to compare BCE with plain accuracy on a single example. If the true label is $1$ , then predicting $0.51$ and predicting $0.99$ both count as "correct" under accuracy. BCE distinguishes them sharply:

$-\log 0.51$ is still relatively large
$-\log 0.99$ is very small

So BCE rewards not merely correctness, but confident correctness backed by probability mass. That extra resolution is exactly what makes probabilistic learning work.

4.3 Categorical Cross-Entropy from Softmax Logits

Now suppose the label $Y$ takes values in $[K]=\{1,\dots,K\}$ . Let the model output logits $\mathbf{z}\in\mathbb{R}^K$ and probabilities

\hat{\mathbf{p}} = \operatorname{softmax}(\mathbf{z}), \qquad \hat{p}_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}.

If the target is one-hot with true class $c$ , then the categorical cross-entropy is

\ell_{\mathrm{CE}}(\mathbf{z}, c) = -\log \hat{p}_c.

Expanding $\hat{p}_c$ gives the fused logit-space form

\ell_{\mathrm{CE}}(\mathbf{z}, c) = -z_c + \log\sum_{j=1}^K e^{z_j}.

This is the form used in numerically stable implementations because it avoids explicitly computing softmax and then taking a logarithm.

There are several things to notice:

Only the true-class logit appears with a negative sign directly.
All logits influence the partition term $\log\sum_j e^{z_j}$ .
Adding the same constant to all logits leaves the loss unchanged, because softmax is shift-invariant.

That last property is the foundation of the log-sum-exp stabilization trick.

One-hot target interpretation. If the target distribution is $\mathbf{y}=\mathbf{e}_c$ , then

-\sum_{k=1}^K y_k \log \hat{p}_k = -\log \hat{p}_c.

So the common multiclass loss is a special case of full distribution-level cross-entropy where the target distribution is degenerate.

There is another important geometric interpretation. The model cannot improve the loss for class $c$ merely by increasing $z_c$ in isolation; the partition term couples every class to every other class. This is why the loss produces dense gradients over the full vocabulary in language modeling: even one observed token creates pressure on all logits through normalization.

4.4 Soft Targets and Target Distributions

One-hot labels are not the only valid targets. If the target itself is a probability distribution $\mathbf{t}\in\Delta^{K-1}$ , then the natural loss is

\ell(\mathbf{t}, \hat{\mathbf{p}}) = -\sum_{k=1}^K t_k \log \hat{p}_k.

This is still just cross-entropy.

Soft targets appear in several important settings:

label smoothing: replace a point mass by a slightly spread target
knowledge distillation: use teacher probabilities as targets
weak supervision: labels encode uncertainty rather than certainty
annotator aggregation: multiple human labels produce a soft empirical distribution over classes

The interpretation changes from "probability on the correct class" to "probability mass allocated across the whole target distribution."

This distinction is crucial. With one-hot labels, all non-true classes are treated identically. With soft targets, the loss can encode structure among incorrect classes:

class A may be more plausible than class B
synonyms may deserve nonzero mass
near-equivalent translations may share credit

This richer target geometry is part of why distillation can transfer "dark knowledge" that is invisible in hard labels.

4.5 Sequence Cross-Entropy, Token Averaging, and Perplexity

For autoregressive language models, the target is a token sequence $x_{1:T}$ . The model defines

q_\theta(x_{1:T}) = \prod_{t=1}^T q_\theta(x_t \mid x_{<t}).

The negative log-likelihood is

-\log q_\theta(x_{1:T}) = \sum_{t=1}^T -\log q_\theta(x_t \mid x_{<t}).

Dividing by $T$ gives average token cross-entropy:

\frac{1}{T}\sum_{t=1}^T -\log q_\theta(x_t \mid x_{<t}).

This token average is what practitioners usually report during LLM training.

Perplexity is then

\operatorname{PPL} = \exp\left( \frac{1}{T}\sum_{t=1}^T -\log q_\theta(x_t \mid x_{<t}) \right)

when natural logarithms are used.

Hugging Face's language-modeling documentation makes two important points that are easy to miss:

perplexity is specifically natural for autoregressive or causal language models, not masked language models such as BERT
tokenizer choice changes the event space, so perplexities across different tokenizations are not directly comparable

This is one of the most common evaluation mistakes in modern LLM work. Lower perplexity is meaningful only relative to the same tokenization and evaluation protocol.

Another subtle issue is aggregation choice. People sometimes average token losses per sequence and then average again across a batch, which weights short and long sequences equally. In other pipelines, all valid tokens are pooled into one denominator, which weights sequences proportionally to length. Both choices can be legitimate, but they are not numerically equivalent. When comparing training curves or benchmark results, one should always ask which averaging scheme was used.

5. Core Theory III: Gradients, Geometry, and Statistical Meaning

5.1 Stable Log-Sum-Exp and Fused Implementations

Naively computing

\log\operatorname{softmax}(\mathbf{z})_c = \log\frac{e^{z_c}}{\sum_j e^{z_j}}

is numerically dangerous because large logits can overflow the exponentials.

The stable remedy is to subtract the maximum logit

m = \max_j z_j.

Then

\log\sum_j e^{z_j} = m + \log\sum_j e^{z_j-m}.

Since $z_j-m \le 0$ , the exponentials are at most $1$ , preventing overflow.

Therefore the stable log-softmax is

\log\operatorname{softmax}(\mathbf{z})_c = z_c - m - \log\sum_j e^{z_j-m}.

This identity is not optional implementation polish. It is the reason real training code works reliably.

PyTorch's official documentation reflects this design:

CrossEntropyLoss expects logits, not probabilities
BCEWithLogitsLoss fuses sigmoid with binary cross-entropy
the fused versions exploit the log-sum-exp trick for numerical stability

The practical rule is simple:

NEVER DO THIS IN PRODUCTION
===============================================================

probs = softmax(logits)
loss  = -log(probs[target])

Reason:
  overflow, underflow, and avoidable precision loss

DO THIS INSTEAD
  use log-softmax or a fused cross-entropy-from-logits routine
===============================================================

This is one of the most important implementation lessons in the chapter because the mathematical formula alone does not tell you how to compute it safely.

For BCE, the same principle appears in a slightly different algebraic form. Instead of computing

-y\log \sigma(z) - (1-y)\log(1-\sigma(z)),

stable implementations rewrite the expression in logit space so they do not need to evaluate $\sigma(z)$ near floating-point extremes. This is exactly why BCEWithLogitsLoss exists as a distinct API rather than as a trivial wrapper around sigmoid and BCELoss.

5.2 The Gradient: Predicted Minus Target

For categorical cross-entropy with one-hot target $\mathbf{y}$ and softmax prediction $\hat{\mathbf{p}}=\operatorname{softmax}(\mathbf{z})$ , the gradient with respect to logits is

\nabla_{\mathbf{z}}\ell = \hat{\mathbf{p}} - \mathbf{y}.

This is one of the most famous formulas in deep learning.

Its importance is hard to overstate:

it is simple
it is dense over all classes
it has a clean probabilistic interpretation

The true class receives gradient $\hat{p}_c-1$ , which is negative unless the model already assigns probability $1$ to the correct class. Every other class receives gradient $\hat{p}_k$ , which pushes those probabilities down.

So the gradient does exactly what we want:

increase the correct probability
decrease the incorrect probabilities
do so in proportion to the model's own confidence structure

This is one reason softmax plus cross-entropy is such an elegant pair. The Jacobian of softmax and the derivative of log-loss cancel into a beautifully simple residual form.

Forward reference: the full softmax Jacobian derivation belongs in Jacobians and Hessians and in the more output-activation-centered treatment in Activation Functions. Here we focus on what that derivative means for cross-entropy itself.

This gradient also explains why CE continues to produce useful updates even when the model is badly wrong. If the true class probability is tiny, then $\hat{p}_c-1$ is close to $-1$ , which yields a strong correction. By contrast, losses with flatter tails can provide weaker learning signals for the same mistake.

5.3 Hessian, Curvature, and a Preview of Fisher Information

The Hessian of softmax cross-entropy with respect to logits is positive semidefinite and closely tied to the covariance structure of the predicted distribution:

H = \operatorname{diag}(\hat{\mathbf{p}}) - \hat{\mathbf{p}}\hat{\mathbf{p}}^\top.

This matrix has several key properties:

it is symmetric
it is positive semidefinite
its rows sum to zero because softmax is shift-invariant

Geometrically, this tells us that cross-entropy is locally curved in the directions that change relative class probabilities, but flat along the all-ones direction that shifts all logits equally.

That local curvature structure is one reason CE behaves so nicely for probabilistic classification. Near a well-fit solution, the Hessian captures how sensitive the loss is to perturbations in the predicted distribution.

It also previews Fisher information. For log-likelihood objectives, the expected Hessian and the Fisher matrix are deeply connected. That connection is one reason Fisher information belongs right after this section in the chapter.

There is also a numerical lesson here. In large-vocabulary models, the Hessian is enormous, but its structure is not arbitrary. It is the covariance matrix of the softmax distribution in logit space. That covariance view helps explain why second-order approximations, natural gradients, and curvature-aware methods keep returning to the same family of matrices when derived from CE objectives.

5.4 Log-Loss as a Strictly Proper Scoring Rule

Cross-entropy is more than a convenient optimization objective. It is also a strictly proper scoring rule.

The idea is simple:

a scoring rule evaluates probabilistic forecasts
it is proper if the expected score is minimized by reporting the true distribution
it is strictly proper if that minimizer is unique

For log-loss, the expected score under true law $p$ is exactly

\mathbb{E}_{X\sim p}[-\log q(X)] = H(p,q).

By the identity

H(p,q)=H(p)+D_{\mathrm{KL}}(p\|q),

this is minimized uniquely at $q=p$ .

That theorem is why CE is the standard loss for probabilistic prediction. If the model class is expressive enough and optimization succeeds, minimizing expected CE pushes the model toward the true conditional distribution, not just toward correct argmax decisions.

This also highlights a contrast with accuracy:

accuracy only cares whether the highest-probability class is correct
cross-entropy cares whether the full probability distribution is honest

That is a much stronger learning signal, and it is the right one whenever calibrated probabilities matter.

The "strictly proper" language also protects against a common misunderstanding. Cross-entropy is not preferred merely because it is differentiable. It is preferred because it is truth-inducing: in expectation, the best forecast under this rule is the true distribution itself. That is a much deeper reason.

5.5 Calibration, Overconfidence, and Temperature Scaling

A model can achieve low cross-entropy and still be miscalibrated in practice. This is a subtle but important point.

At the population optimum, cross-entropy is minimized by the true conditional distribution. But finite data, finite model classes, optimization bias, label noise, and overparameterization can still produce overconfident or underconfident predictions.

Modern neural networks are often overconfident:

they assign probabilities too close to $0$ or $1$
their predicted confidence exceeds empirical correctness frequency

Temperature scaling is a standard post-hoc fix. If $\mathbf{z}$ are logits, replace them by $\mathbf{z}/\tau$ with $\tau > 0$ , then choose $\tau$ on a validation set to minimize negative log-likelihood.

Effects:

$\tau > 1$ softens the distribution and increases entropy
$\tau < 1$ sharpens the distribution and decreases entropy

This matters because cross-entropy is both a training objective and an evaluation objective. A model that trains well under CE may still need temperature adjustment to report probabilities that are useful for downstream decision-making.

The original Transformer paper famously noted a related tension: label smoothing improved accuracy and BLEU while hurting perplexity, precisely because it makes the model more uncertain. That is not a contradiction. It is a reminder that cross-entropy, calibration, and task metrics interact in subtle ways.

In deployment, this matters whenever probabilities are acted on directly:

abstention thresholds
medical or legal risk alerts
reranking pipelines
active learning
safety filters

A model with slightly worse raw CE but much better calibration may be more useful in downstream decision systems than a model with slightly better CE and pathological overconfidence.

6. Advanced Topics

6.1 Label Smoothing as Entropy Injection

Label smoothing replaces a hard one-hot target $\mathbf{e}_c$ by a softened distribution. A common version is

\mathbf{t}^{(\varepsilon)} = (1-\varepsilon)\mathbf{e}_c + \frac{\varepsilon}{K}\mathbf{1},

or a closely related variant that spreads the $\varepsilon$ mass only across the incorrect classes.

The cross-entropy then becomes

\ell_{\mathrm{LS}} = -\sum_{k=1}^K t_k^{(\varepsilon)} \log \hat{p}_k.

This can be read in two complementary ways:

target-side view: the supervision no longer demands absolute certainty
regularization view: the model is penalized for driving the predicted distribution toward zero entropy too aggressively

Why does this help?

it reduces logit saturation
it prevents the model from treating every non-target class as equally and infinitely wrong
it often improves calibration and generalization

But there is a tradeoff. If the target is smoothed, the minimum achievable loss relative to one-hot labels changes. The model is being asked to remain slightly uncertain, even on examples with unambiguous labels.

This is exactly why the Transformer paper could report that label smoothing $\varepsilon_{\mathrm{ls}}=0.1$ improved task metrics while hurting perplexity. A lower perplexity corresponds to more concentrated next-token distributions. Smoothing deliberately resists that concentration.

LABEL SMOOTHING
===============================================================

Hard target:
  [0, 0, 1, 0, 0]

Smoothed target:
  [0.02, 0.02, 0.92, 0.02, 0.02]

Effect:
  - less pressure for infinite confidence
  - smaller penalty for near-miss classes
  - often better calibration and robustness
===============================================================

One important nuance from later empirical work is that label smoothing is not a universal free lunch. The NeurIPS 2019 study "When Does Label Smoothing Help?" showed that its benefits depend on task structure, class relationships, and the evaluation metric of interest. In some distillation settings, naive smoothing can even interfere with knowledge transfer.

So the right conclusion is not "always smooth labels." It is:

label smoothing changes the target distribution
changing the target distribution changes what cross-entropy asks the model to learn
that can be beneficial or harmful depending on the problem

Another useful way to think about smoothing is that it injects a tiny amount of target entropy by hand. Hard labels say "the world is deterministic at this example." Smoothed labels say "the world may still have ambiguity, annotation noise, or semantic neighborhood structure." That philosophical shift is small in notation and large in effect.

6.2 Knowledge Distillation and Soft Cross-Entropy

In knowledge distillation we train a student model to match a teacher distribution rather than only the hard class label. If the teacher outputs $\mathbf{t}$ and the student predicts $\hat{\mathbf{p}}$ , the natural loss is

\ell_{\mathrm{KD}} = -\sum_{k=1}^K t_k \log \hat{p}_k.

This is just cross-entropy again, but now the target is itself a learned distribution.

Hinton, Vinyals, and Dean's distillation work made this idea famous because the teacher's non-argmax probabilities carry extra structure:

which wrong classes are close to the right one
how uncertain the teacher is
how semantic similarity shows up in the probability tail

This extra structure is often called dark knowledge.

Temperature plays a special role in distillation. Teacher and student probabilities are often softened by a temperature $\tau>1$ :

t_k^{(\tau)} = \frac{e^{z_k^{\text{teacher}}/\tau}}{\sum_j e^{z_j^{\text{teacher}}/\tau}}, \qquad \hat{p}_k^{(\tau)} = \frac{e^{z_k^{\text{student}}/\tau}}{\sum_j e^{z_j^{\text{student}}/\tau}}.

The loss is then applied to the softened distributions. Why soften?

it reveals class relationships hidden by an almost one-hot teacher output
it gives the student a denser gradient signal
it emphasizes relative logit structure rather than only the top class

Distillation is therefore not some separate object outside cross-entropy. It is an advanced target-design strategy inside the same CE framework.

6.3 Weighted, Masked, and Cost-Sensitive Cross-Entropy

Real training pipelines rarely use plain unweighted CE without modification. Three common variations matter a lot in practice.

Class weighting

If some classes are rare, we may weight the loss:

\ell_{\mathrm{wCE}} = -\sum_{k=1}^K w_k y_k \log \hat{p}_k.

This tells the optimizer to care more about mistakes on minority classes.

Binary positive weighting

For binary or multi-label BCE, a common implementation uses a pos_weight factor. PyTorch's BCEWithLogitsLoss documentation explicitly describes this as a recall/precision tradeoff mechanism under imbalance.

Masking and ignored positions

Sequence models often contain padding tokens or positions that should not contribute to the loss. Then we compute

\ell_{\mathrm{masked}} = \frac{\sum_{t=1}^T m_t \, \big(-\log q(x_t \mid x_{<t})\big)} {\sum_{t=1}^T m_t},

where $m_t \in \{0,1\}$ is a mask.

This normalized masked average matters. If we divide by the wrong denominator, we change the scale of the loss and therefore the effective optimization dynamics.

Cost-sensitive CE

Sometimes the real decision problem is asymmetric. False negatives may be far more expensive than false positives. In such cases, CE can be adapted with weights, though at some point the right object may become a different loss or a decision-thresholding scheme on top of probabilistic predictions.

The important conceptual point is that these are not new unrelated losses. They are modifications of how the same logarithmic scoring rule is aggregated over classes, examples, or positions.

This is worth emphasizing because engineers often treat weighting, masking, and ignoring as "just implementation details." They are implementation details, but they are also modeling decisions. They define which errors the empirical cross-entropy estimator treats as important.

6.4 When Cross-Entropy Fails

Cross-entropy is powerful, but it is not a universal answer.

Failure mode 1: label noise

If labels are corrupted, CE can aggressively fit the wrong target because it treats the observed label distribution as the truth to be matched.

Failure mode 2: extreme class imbalance

On heavily imbalanced datasets, unweighted CE may optimize average log-loss while producing poor minority-class recall.

Failure mode 3: overconfidence under distribution shift

A model can achieve good in-distribution CE and still assign unjustifiably high confidence to out-of-distribution inputs.

Failure mode 4: task-metric mismatch

A small gain in CE does not always correspond to a meaningful gain in the task metric. Translation quality, factuality, ranking quality, or calibrated risk may require more than plain next-step NLL.

Failure mode 5: structured dependence ignored

If the output has combinatorial or relational structure, tokenwise or classwise cross-entropy may fail to capture global consistency constraints.

WHEN PLAIN CE IS NOT ENOUGH
===============================================================

Problem                  What CE misses
---------------------------------------------------------------
Noisy labels             fits corrupted supervision too hard
Class imbalance          minority classes under-emphasized
OOD shift                can still be overconfident
Structured outputs       local token loss misses global structure
Decision asymmetry       equal log-loss is not equal real-world cost
===============================================================

This is not a criticism of cross-entropy so much as a boundary statement. CE is the right objective when the goal is honest probabilistic prediction under the given target distribution. If the problem itself differs from that objective, the loss may need to be modified or supplemented.

That framing helps avoid the unproductive question "Is CE good or bad?" It is better to ask:

what target distribution is being treated as truth?
what downstream decision problem actually matters?
which failure mode is dominant in this setting?

Those questions turn loss selection from ritual into principled design.

6.5 Preview of Alternative Losses

Cross-entropy sits inside a larger loss landscape.

Focal loss reweights CE to emphasize hard and rare examples.
Hinge loss prioritizes margin over probabilistic calibration.
Dice / Jaccard losses are used in segmentation when overlap is what matters most.
Bradley-Terry style binary CE reappears in preference-model training.
DPO-style preference objectives build on log-probability comparisons and KL-regularized preference optimization.

Forward reference: the broad taxonomy of these alternatives belongs in Loss Functions for Machine Learning. Here we keep the focus on cross-entropy as the canonical log-loss baseline.

The pedagogical point is worth keeping clear: many modern losses are easier to understand once cross-entropy is fully internalized. They often modify CE, add regularization to CE, or contrast CE against another objective.

7. Applications in Machine Learning

7.1 Probabilistic Classification

Cross-entropy is the default loss for probabilistic classification because it matches the problem formulation exactly. If the model outputs class probabilities, the log-loss directly evaluates how much probability the model assigns to the observed class.

This has several consequences that are better than accuracy-only training:

the model is encouraged to learn calibrated probabilities
errors are differentiated by confidence
uncertainty can be propagated into downstream systems

In binary logistic regression, BCE is the Bernoulli negative log-likelihood. In multiclass softmax regression, categorical CE is the categorical negative log-likelihood. Deep classifiers simply parameterize the same probabilistic form with more expressive feature extractors.

From a statistical standpoint, CE is the natural bridge from parametric conditional modeling to empirical risk minimization. From an engineering standpoint, it is the loss that turns logits into meaningful probabilistic feedback.

This is why so many mature classification stacks report both accuracy and CE. Accuracy says whether the decision was correct. CE says how much probability the model placed on what happened. The pair is far more informative than either metric alone.

7.2 Transformer Training and Label-Smoothed Sequence Loss

The original Transformer recipe made cross-entropy a central part of modern sequence-model engineering. During training, token predictions were supervised with label-smoothed categorical CE rather than hard one-hot CE.

That decision encodes a practical philosophy:

next-token prediction is probabilistic
excessive certainty is often harmful
small target-side entropy can regularize the training dynamics

In sequence-to-sequence and language-model settings, CE is usually computed over all valid output positions and averaged over non-padding tokens. Teacher forcing means the model is conditioned on the ground-truth past tokens while predicting the next token. The resulting loss is still just cross-entropy over conditional categorical distributions, but the conditioning context now includes a long history.

This is one place where the information-theory view is especially clarifying. The loss is not "token accuracy with differentiability attached." It is average surprise under the model's conditional next-token distribution.

That perspective also explains why improvements in CE can feel small numerically but large behaviorally. Shaving a few hundredths off token-level CE across billions of predictions corresponds to a systematic reduction in surprise over an enormous event space. In large-scale language modeling, tiny changes in average log-loss can reflect meaningful improvements in predictive structure.

7.3 LLM Pretraining, Tokenization, and Perplexity

Autoregressive LLM pretraining is next-token cross-entropy minimization at scale. Every token position contributes a term

-\log q_\theta(x_t \mid x_{<t}).

The model is rewarded for allocating high probability to the actual next token in context. Over trillions of tokens, this becomes a massive empirical estimate of sequence cross-entropy.

Why does this single objective produce such broad capability? Because reducing next-token surprise requires modeling many kinds of structure at once:

grammar
discourse
facts
code syntax
style
local and long-range dependencies

Cross-entropy itself does not "know" these categories. It only measures whether the right token received enough probability mass. But minimizing that criterion across diverse corpora forces the model to learn the latent regularities that make the corpus predictable.

Perplexity is the most common scalar evaluation summary of this process:

\operatorname{PPL} = \exp(\text{average token cross-entropy}).

But perplexity has caveats:

it depends on tokenization
it depends on evaluation context length
it need not correlate perfectly with downstream usefulness
it is natural for causal language modeling but not for masked-LM objectives

So cross-entropy gives us a principled training objective and perplexity gives us a principled proxy metric, but both must be interpreted in context.

There is also a social reason this matters in 2026. Public discussion of model quality often compresses everything into one number. Perplexity is tempting for that role because it is grounded and easy to compute. But a mature evaluation culture treats it as one signal among many, not as a universal summary of intelligence or usefulness.

7.4 Teacher-Student Compression

When a student model is trained to imitate a teacher distribution, soft cross-entropy becomes an information-transfer channel.

Why can a student learn more from the teacher probabilities than from hard labels alone?

the teacher reveals confusion structure
near-miss classes get nonzero probability
similarity relations among classes become visible

For example, a teacher image classifier may assign

[0.72, 0.20, 0.05, 0.03]

to four classes rather than

[1,0,0,0].

The second vector says only "class 1 is correct." The first says "class 2 is meaningfully plausible, classes 3 and 4 much less so." That relative structure is often exactly what a smaller student needs in order to generalize well.

Cross-entropy is the right tool here because it compares full distributions, not just argmax labels.

7.5 Preference and Ranking Loss Previews

Cross-entropy reappears outside ordinary classification.

In reward modeling for preference learning, binary or pairwise CE is often used to distinguish preferred responses from dispreferred ones. A Bradley-Terry style form models the probability that response $a$ is preferred to response $b$ and then trains with binary log-loss on the observed comparison.

In ranking systems, related CE-style objectives supervise the probability that relevant items outrank irrelevant ones. In contrastive systems, normalized softmax objectives are close relatives of categorical CE.

The broader lesson is that cross-entropy is the native loss whenever the target of learning is a probability over alternatives:

classes
tokens
pairwise preferences
teacher-guided candidate distributions

Once you see that pattern, a large part of modern AI objective design starts to look like special-purpose packaging around the same logarithmic core.

That recognition is empowering. It means that new objectives become easier to learn because you can ask:

what are the alternatives?
what probability distribution is being modeled over them?
where is the cross-entropy hidden inside the objective?

Often the answer is "very close to the surface."

8. Common Mistakes

#	Mistake	Why It Is Wrong	Fix
1	Treating cross-entropy and entropy as the same thing	Entropy depends on one distribution; cross-entropy depends on a source-model pair $(p,q)$	Keep the model mismatch explicitly visible
2	Saying CE is "just classification loss"	CE is an information-theoretic quantity that becomes a loss under probabilistic modeling	Learn the coding and NLL interpretations too
3	Applying `softmax` and then calling `CrossEntropyLoss`	PyTorch CE expects logits and does log-softmax internally for stability	Pass raw logits to fused CE routines
4	Using `sigmoid` and then `BCELoss` when logits are available	This is less numerically stable than a fused logits-based version	Use `BCEWithLogitsLoss`
5	Comparing perplexity across different tokenizers	Different tokenizers define different event spaces and token counts	Compare PPL only under matched tokenization/eval protocol
6	Assuming low CE implies perfect calibration	CE is strictly proper in theory, but finite models and finite data can still be miscalibrated	Check calibration separately and use temperature scaling if needed
7	Thinking label smoothing is always good	It changes the target distribution and can interfere with some goals	Use smoothing deliberately, not by reflex
8	Ignoring support mismatch	If $q(x)=0$ where $p(x)>0$ , cross-entropy is infinite	Smooth or stabilize models so valid events retain support
9	Using hard labels when soft targets are available	Hard labels discard structure among alternatives	Use soft CE for distillation or uncertain supervision
10	Averaging masked sequence loss with the wrong denominator	This changes effective gradient scale and can bias comparisons	Normalize by the number of valid tokens
11	Treating CE gains as task-metric guarantees	Lower CE does not always imply better BLEU, calibration, fairness, or retrieval quality	Evaluate with the task metric too
12	Calling CE symmetric	$H(p,q)\neq H(q,p)$ in general	Keep argument order explicit

9. Exercises

Exercise 1 (*): Discrete Cross-Entropy by Hand
Let $p=(0.7,0.2,0.1)$ and $q=(0.5,0.3,0.2)$ . Compute $H(p)$ , $H(p,q)$ , and $D_{\mathrm{KL}}(p\|q)$ , then verify the identity $H(p,q)=H(p)+D_{\mathrm{KL}}(p\|q)$ .
Exercise 2 (*): One-Hot Targets and Log-Loss
Show that if the target is a point mass on class $c$ , then categorical cross-entropy reduces to $-\log \hat{p}_c$ . Explain why this means CE punishes confident mistakes so strongly.
Exercise 3 (*): Bernoulli Likelihood to BCE
Starting from the Bernoulli model $q_\theta(y \mid \mathbf{x})=\hat{p}^{\,y}(1-\hat{p})^{1-y}$ , derive the binary cross-entropy formula.
Exercise 4 (): Softmax Cross-Entropy Gradient**
Derive $\nabla_{\mathbf{z}}\ell = \hat{\mathbf{p}}-\mathbf{y}$ for the one-hot categorical case.
Exercise 5 (): Sequence Cross-Entropy and Perplexity**
Given token log-likelihoods for a short sequence, compute the average cross-entropy per token and convert it to perplexity. Then explain why the same model can have different perplexities under different tokenizers.
Exercise 6 (): Stable Log-Sum-Exp**
Implement naive softmax -> log and stable log_softmax for large logits, then demonstrate why the naive version fails numerically.
Exercise 7 (*): Label Smoothing or Distillation**
On a toy multiclass example, compare one-hot CE against smoothed-target CE or teacher-target CE. Explain what supervision information the soft target adds.
Exercise 8 (*): Weighted or Masked Cross-Entropy in Practice**
Construct either:
- a class-imbalanced classification example and compare weighted vs unweighted CE, or
- a padded sequence example and compare correct masked averaging to naive averaging.

10. Why This Matters for AI (2026 Perspective)

Concept	AI Impact
Cross-entropy as NLL	The default training objective for probabilistic classification and next-token prediction
$H(p,q)=H(p)+D_{\mathrm{KL}}(p\\|q)$	Turns model fitting into distribution matching with a clear irreducible baseline
Binary CE	Standard loss for binary and multi-label prediction, reward-model heads, and many ranking tasks
Categorical CE	Standard loss for multiclass prediction and vocab-size softmax training
Sequence CE	The core pretraining loss of autoregressive LLMs
Perplexity	A direct exponential transform of token-level CE for causal language models
Stable logits-based CE	Prevents overflow/underflow in real training systems
Strictly proper scoring rule	Justifies CE when the goal is honest probabilistic prediction
Label smoothing	Reduces overconfidence and often improves generalization or calibration
Soft-target CE	Enables distillation and richer supervision than hard labels alone
Weighted and masked CE	Makes CE usable in imbalanced datasets and padded sequence pipelines
CE failure modes	Reminds us when other objectives or evaluation metrics are needed

Cross-entropy matters in 2026 because frontier AI systems are increasingly judged not only by whether they pick the right answer, but by how they distribute probability mass, how stable they train, how useful their uncertainty is, and how efficiently they compress predictive structure from vast datasets. CE is the language that connects all of those concerns.

It is also one of the few quantities whose theoretical meaning survives contact with engineering reality. The same mathematics explains textbook coding, PyTorch loss APIs, Transformer training recipes, distillation pipelines, and perplexity reports on LLM benchmarks. That level of continuity is rare and valuable.

For someone building AI systems in 2026, this matters because cross-entropy is not a niche theorem you memorize and forget. It is a daily design object:

when reading training logs
when choosing a loss API
when interpreting perplexity changes
when debugging overconfidence
when deciding whether soft targets are worth the complexity

11. Conceptual Bridge

Cross-entropy is the natural continuation of the first three information-theory sections. Entropy introduced intrinsic uncertainty. KL Divergence introduced mismatch between distributions. Mutual Information introduced uncertainty reduction between variables. Cross-entropy combines the first two directly:

H(p,q)=H(p)+D_{\mathrm{KL}}(p\|q).

That equation is not just a neat identity. It tells you exactly what model training is doing: preserving the unavoidable uncertainty term while shrinking the mismatch term as much as the model class allows.

Looking forward, the next conceptual step is Fisher Information. Cross-entropy tells us how costly model mismatch is on average. Fisher information tells us how sharply the log-likelihood changes under infinitesimal parameter perturbations. So CE gives the global objective and Fisher gives the local geometry.

Cross-entropy also connects outward to other parts of the curriculum:

to Statistics through maximum likelihood
to Optimization through logits, gradients, and curvature
to Activation Functions through softmax and log-softmax
to Loss Functions through broader objective design
to Alignment and Safety through binary and pairwise log-losses in preference modeling

CROSS-ENTROPY IN THE CURRICULUM
===============================================================

Entropy
  intrinsic uncertainty
        |
        v
KL divergence
  mismatch between distributions
        |
        v
Cross-entropy
  uncertainty measured under a model
        |
        +--> maximum likelihood
        +--> classification loss
        +--> sequence modeling
        +--> perplexity
        +--> distillation / label smoothing
        |
        v
Fisher information
  local curvature of log-likelihood geometry
===============================================================

If you keep only one idea from this section, let it be this:

Cross-entropy is the quantitative price we pay when our predicted distribution differs from reality.

That sentence unifies coding, likelihood, classification, language modeling, calibration, and distillation in one line.

It also sets up the next section naturally. Once we know how costly model mismatch is in expectation, the next mathematical question is:

How sensitive is that cost to tiny changes in the model parameters?

That is exactly the doorway into Fisher information.

There is also a backward-looking lesson here. Entropy, KL divergence, mutual information, and cross-entropy are not four disconnected formulas to memorize. They are a tightly linked family:

entropy measures intrinsic uncertainty
KL divergence measures mismatch
mutual information measures uncertainty reduction
cross-entropy measures uncertainty viewed through a model

Seeing those relationships clearly is part of becoming fluent in the language of modern probabilistic machine learning.

That fluency is exactly what this chapter is building.

Cross-entropy is one of its central dialects.

References

Shannon, C. E. (1948). "A Mathematical Theory of Communication." Bell System Technical Journal. PDF
Kullback, S., and Leibler, R. A. (1951). "On Information and Sufficiency." Annals of Mathematical Statistics.
Cover, T. M., and Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley.
MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press.
Duchi, J. C. Statistics 311 / EE377 Lecture Notes. Online
Stanford EE276. "Lecture 2: Entropy and Related Quantities." PDF
PyTorch Documentation. torch.nn.CrossEntropyLoss
PyTorch Documentation. torch.nn.BCEWithLogitsLoss
Hugging Face Transformers Documentation. "Perplexity of fixed-length models." Docs
Vaswani, A., et al. (2017). "Attention Is All You Need." arXiv
Hinton, G., Vinyals, O., and Dean, J. (2015). "Distilling the Knowledge in a Neural Network." PDF
Muller, R., Kornblith, S., and Hinton, G. (2019). "When Does Label Smoothing Help?" NeurIPS PDF
Dawid, A. P., and Musio, M. (2014). "Theory and Applications of Proper Scoring Rules." Metron.
The Book of Statistical Proofs. "The log probability scoring rule is a strictly proper scoring rule." Proof

Cross Entropy