NotesMath for LLMs

Common Distributions

Probability Theory / Common Distributions

Notes

Section6.2 Common Distributions

"God does not play dice - but physicists, statisticians, and machine learning engineers do. The distributions in this chapter are the dice they use."

Overview

Every probabilistic model makes a choice: what distribution describes the data? A Bernoulli for a coin flip, a Gaussian for continuous measurements, a Categorical for language model token prediction, a Dirichlet for topic proportions. These are not arbitrary choices - each distribution encapsulates a specific data-generating story, a set of assumptions, and a collection of mathematical properties that make inference tractable.

This section gives the complete treatment of every named distribution that appears throughout this curriculum. For each distribution you will find: the PDF or PMF, the CDF where useful, the parameters and their interpretations, the moments (mean, variance, skewness), the moment generating function, the shape behaviour as parameters vary, the special cases and limiting forms, and the concrete ML applications where the distribution appears.

The section culminates in two unifying frameworks: the exponential family, which shows that Bernoulli, Gaussian, Poisson, Beta, Gamma, Dirichlet, and Categorical are all instances of a single canonical form; and conjugate priors, which explains why Bayesian inference is analytically tractable for precisely these distributions.

What this section assumes: The definitions of PDF, PMF, CDF, and the basic probability axioms from Section01. Bernoulli and Gaussian were introduced there; this section gives their full treatment.

What this section defers: Expectation derivations are stated as facts here and derived from first principles in Section04. The multivariate Gaussian is previewed here and fully developed in Section03.


Prerequisites


Companion Notebooks

NotebookDescription
theory.ipynbInteractive visualisations of all distributions; exponential family; conjugacy updates
exercises.ipynb10 graded exercises: PMF/MGF computation, exponential family identification, conjugate Bayesian updating

Learning Objectives

After completing this section, you will:

  • State the PMF or PDF, support, mean, variance, and MGF of every major named distribution
  • Identify which distribution models a given data-generating process
  • Explain the relationships between distributions (Binomial->Poisson, Binomial->Normal, Beta->Dirichlet)
  • Write any exponential family member in canonical form and identify its sufficient statistics and log-partition function
  • Compute the posterior in a conjugate Bayesian model (Beta-Binomial, Dirichlet-Categorical, Gamma-Poisson, Normal-Normal)
  • Derive the softmax function as the natural parameterisation of the Categorical exponential family
  • Explain how the Gaussian reparameterisation trick enables backpropagation through sampling in VAEs
  • Describe the role of each distribution in at least two concrete ML architectures

Table of Contents


1. Intuition and Overview

1.1 Why Named Distributions?

A probability distribution is completely specified by its CDF. So why do we name particular families and memorise their formulas?

Three reasons:

Sufficient statistics compress data. If you observe nn coin flips, all the information about pp is contained in the count of heads - not the sequence. The Binomial distribution formalises this: its PMF depends on the data only through ixi\sum_i x_i. Named distributions arise precisely when the data-generating process has this kind of compressibility.

Tractable inference. Computing posteriors, marginals, and predictions requires integration. For most distributions this is intractable. The named distributions are the ones for which the integration can be done in closed form - either because the MGF factors, or because they belong to the exponential family where the normalisation constant has an analytic form.

Interpretable parameters. A Gaussian N(μ,σ2)\mathcal{N}(\mu, \sigma^2) has a mean μ\mu and a standard deviation σ\sigma that are immediately meaningful. A Beta(2,5)(2, 5) prior encodes "I've seen 2 successes and 5 failures before the experiment." Named distributions give parameters human-interpretable meaning.

For AI: Every loss function and output layer in a neural network implicitly assumes a distribution over outputs. Cross-entropy loss assumes a Categorical distribution. Mean squared error assumes a Gaussian. Poisson regression assumes a Poisson distribution. Understanding which distribution the loss corresponds to is essential for knowing when a model is misspecified.

1.2 How to Choose a Distribution

DATA GENERATING PROCESS
========================================================================

  Discrete outcomes?
  +-- Two outcomes (0/1)             -> Bernoulli(p)  [single trial]
  +-- Count of successes in n trials -> Binomial(n, p)
  +-- Trials until first success     -> Geometric(p)
  +-- Trials until r-th success      -> Negative Binomial(r, p)
  +-- Count of events in interval    -> Poisson(\\lambda)
  +-- One of K categories            -> Categorical(p)
  +-- Counts across K categories     -> Multinomial(n, p)

  Continuous outcomes?
  +-- Bounded, equal weight          -> Uniform(a, b)
  +-- Unbounded, symmetric bell      -> Gaussian(\\mu, \\sigma^2)
  +-- Non-negative, time to event    -> Exponential(\\lambda) or Gamma(\\alpha, \\beta)
  +-- Probability value in (0, 1)    -> Beta(\\alpha, \\beta)
  +-- Probability vector on simplex  -> Dirichlet(\\alpha)
  +-- Heavy-tailed, small samples    -> Student-t(\\nu)

========================================================================

1.3 The Distribution Family Tree

DISTRIBUTION FAMILY TREE
========================================================================

  DISCRETE                              CONTINUOUS
  ----------------------------          ------------------------------

  Bernoulli(p)                          Uniform(a,b)
      | n trials                              |
      v                                       v (generalise)
  Binomial(n,p) ------- CLT -----------> Gaussian(\\mu,\\sigma^2)
      |                                       |
      | n->\\infty, p->0, np=\\lambda                        | \\mu=0, \\sigma^2=\\nu (\\nu->\\infty)
      v                                       v
  Poisson(\\lambda) ------ Poisson -----------> Exponential(\\lambda)
      |             process                   | sum of \\alpha
      | count                                 v
  Multinomial(n,p) <---- generalise -- Gamma(\\alpha,\\beta)
      |                                       | \\alpha=1
  Categorical(p)                         Exponential(\\lambda)
      |                                       | \\alpha=k/2, \\beta=1/2
  conjugate prior:                       Chi-squared(k)
      |                                       |
      v                                       v
  Dirichlet(\\alpha) ---- marginals ---------> Beta(\\alpha,\\beta)
                                             (K=2 case)

  Student-t(\\nu) = Gaussian / \\sqrt(Chi^2(\\nu)/\\nu)  [heavy tails; \\nu->\\infty -> Gaussian]

========================================================================

1.4 Historical Timeline

YearDiscoveryContributor
1713Bernoulli distribution, Law of Large NumbersJakob Bernoulli
1733Normal approximation to BinomialAbraham de Moivre
1809Normal distribution as error model (least squares)Carl Friedrich Gauss
1837Poisson distribution as limit of rare eventsSimeon-Denis Poisson
1839Dirichlet distribution (as Bayesian prior)Peter Gustav Lejeune Dirichlet
1860Exponential distribution (Maxwell's speed distribution)James Clerk Maxwell
1893Chi-squared distributionKarl Pearson
1908Student-tt distribution (small samples)William Sealy Gosset ("Student")
1911Beta distribution (conjugate to Binomial)Karl Pearson
1922Sufficient statisticsRonald A. Fisher
1935Exponential family (unified framework)Koopman, Pitman, Darmois
2006Dirichlet-Categorical in LDA (topic models)Blei, Ng, Jordan
2013Gaussian VAE and reparameterisation trickKingma, Welling
2020Dirichlet language model priors (BayesOPT)various

2. Discrete Distributions

A discrete random variable takes values in a countable set X\mathcal{X}. Its distribution is fully described by the PMF pX(x)=P(X=x)p_X(x) = P(X = x), which satisfies xXpX(x)=1\sum_{x \in \mathcal{X}} p_X(x) = 1.

2.1 Bernoulli(pp)

Story: A single binary trial: success (1) with probability pp, failure (0) with probability 1p1-p.

PMF:

pX(x)=px(1p)1x,x{0,1},p[0,1]p_X(x) = p^x (1-p)^{1-x}, \quad x \in \{0, 1\}, \quad p \in [0, 1]

Equivalently: P(X=1)=pP(X=1) = p and P(X=0)=1pP(X=0) = 1-p.

CDF:

FX(x)={0x<01p0x<11x1F_X(x) = \begin{cases} 0 & x < 0 \\ 1-p & 0 \leq x < 1 \\ 1 & x \geq 1 \end{cases}

Moments:

QuantityValue
Mean E[X]\mathbb{E}[X]pp
Variance Var(X)\operatorname{Var}(X)p(1p)p(1-p)
Mode1[p>0.5]\mathbf{1}[p > 0.5]
Entropy H(X)H(X)plogp(1p)log(1p)-p\log p - (1-p)\log(1-p)
Skewness(12p)/p(1p)(1-2p)/\sqrt{p(1-p)}

The variance p(1p)p(1-p) is maximised at p=0.5p=0.5 (maximum uncertainty) and equals zero at p{0,1}p \in \{0, 1\} (certainty).

Moment Generating Function:

MX(t)=E[etX]=(1p)+pet=1p+petM_X(t) = \mathbb{E}[e^{tX}] = (1-p) + pe^t = 1 - p + pe^t

Log-odds / logit: The natural parameterisation for the Bernoulli is the logit:

η=logp1p(,)\eta = \log\frac{p}{1-p} \in (-\infty, \infty)

Inverting: p=σ(η)=eη1+eη=11+eηp = \sigma(\eta) = \frac{e^\eta}{1+e^\eta} = \frac{1}{1+e^{-\eta}} - the sigmoid function. This is why logistic regression parameterises the Bernoulli distribution.

For AI:

  • Binary classification: labels y{0,1}y \in \{0, 1\} are Bernoulli. The cross-entropy loss ylogp^(1y)log(1p^)-y\log\hat{p} - (1-y)\log(1-\hat{p}) is the negative log-likelihood of a Bernoulli model.
  • Dropout: each activation is independently masked with Bernoulli(1drop_rate)(1-\text{drop\_rate}). At inference, expectations replace samples: h(1drop_rate)hh \to (1-\text{drop\_rate}) \cdot h.
  • Stochastic depth: transformer layers are included/skipped via Bernoulli sampling during training.

Non-examples: A Bernoulli is NOT appropriate when outcomes are not binary (use Categorical), when multiple trials are involved (use Binomial), or when the "probability" varies per trial (use a hierarchical model).


2.2 Binomial(nn, pp)

Story: Count the number of successes in nn independent Bernoulli(p)(p) trials. If X1,,XniidBern(p)X_1, \ldots, X_n \overset{\text{iid}}{\sim} \operatorname{Bern}(p), then X=i=1nXiBinomial(n,p)X = \sum_{i=1}^n X_i \sim \operatorname{Binomial}(n, p).

PMF:

pX(k)=(nk)pk(1p)nk,k=0,1,,np_X(k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, \ldots, n

The binomial coefficient (nk)=n!/(k!(nk)!)\binom{n}{k} = n!/(k!(n-k)!) counts the number of ways to choose which kk of the nn trials succeed.

Moments:

QuantityValueDerivation hint
Mean E[X]\mathbb{E}[X]npnpLinearity: i=1nE[Xi]=np\sum_{i=1}^n \mathbb{E}[X_i] = np
Variance Var(X)\operatorname{Var}(X)np(1p)np(1-p)Independence: iVar(Xi)\sum_i \operatorname{Var}(X_i)
Mode(n+1)p\lfloor (n+1)p \rfloor(or (n+1)p1\lceil (n+1)p \rceil - 1)
Skewness(12p)/np(1p)(1-2p)/\sqrt{np(1-p)}

MGF: Since X=X1++XnX = X_1 + \cdots + X_n with independent summands:

MX(t)=i=1nMXi(t)=(1p+pet)nM_X(t) = \prod_{i=1}^n M_{X_i}(t) = (1-p+pe^t)^n

Shape behaviour:

  • p<0.5p < 0.5: right-skewed (most counts are below the mean)
  • p=0.5p = 0.5: symmetric
  • p>0.5p > 0.5: left-skewed
  • As nn grows: bell-shaped by the Central Limit Theorem (CLT preview -> Section06)

Normal approximation (CLT preview): For large nn:

Xnpnp(1p)dN(0,1)as n\frac{X - np}{\sqrt{np(1-p)}} \xrightarrow{d} \mathcal{N}(0, 1) \quad \text{as } n \to \infty

Rule of thumb: approximation is accurate when np5np \geq 5 and n(1p)5n(1-p) \geq 5.

Poisson limit: When nn \to \infty and p0p \to 0 with np=λnp = \lambda fixed:

(nk)pk(1p)nkλkeλk!\binom{n}{k}p^k(1-p)^{n-k} \to \frac{\lambda^k e^{-\lambda}}{k!}

(Full derivation in Section2.4.)

For AI:

  • A/B testing: number of conversions in nn visits follows Binomial(n,p)(n, p).
  • Ensemble prediction: number of ensemble members predicting class 1 follows Binomial(M,p)(M, p) where MM is ensemble size.
  • Batch statistics: number of positive examples in a minibatch of size BB drawn from a dataset with fraction ρ\rho positive is Binomial(B,ρ)(B, \rho).

2.3 Geometric(pp) and Negative Binomial(rr, pp)

Geometric(pp)

Story: Number of independent Bernoulli(p)(p) trials until (and including) the first success.

PMF:

P(X=k)=(1p)k1p,k=1,2,3,P(X = k) = (1-p)^{k-1} p, \quad k = 1, 2, 3, \ldots

The factor (1p)k1(1-p)^{k-1} is the probability of k1k-1 consecutive failures; pp is the probability of the final success.

Moments:

QuantityValue
Mean E[X]\mathbb{E}[X]1/p1/p
Variance Var(X)\operatorname{Var}(X)(1p)/p2(1-p)/p^2
Mode11 (always)
Median1/log2(1p)\lceil -1/\log_2(1-p) \rceil

MGF:

MX(t)=pet1(1p)et,t<log(1p)M_X(t) = \frac{pe^t}{1-(1-p)e^t}, \quad t < -\log(1-p)

Memoryless property: For integers s,t0s, t \geq 0:

P(X>s+tX>s)=P(X>t)P(X > s + t \mid X > s) = P(X > t)

Proof: P(X>k)=(1p)kP(X > k) = (1-p)^k. Then P(X>s+tX>s)=P(X>s+t)/P(X>s)=(1p)s+t/(1p)s=(1p)t=P(X>t)P(X > s+t \mid X > s) = P(X > s+t)/P(X > s) = (1-p)^{s+t}/(1-p)^s = (1-p)^t = P(X > t). \square

The Geometric distribution is the unique discrete memoryless distribution (just as the Exponential is the unique continuous memoryless distribution).

Variant: Some sources define XX as the number of failures before the first success, giving PMF P(X=k)=(1p)kpP(X=k) = (1-p)^k p for k=0,1,2,k = 0, 1, 2, \ldots - this shifts everything by 1. Always check which convention a source uses.

Negative Binomial(rr, pp)

Story: Number of trials until the rr-th success. The sum of rr independent Geometric(p)(p) random variables.

PMF:

P(X=k)=(k1r1)pr(1p)kr,k=r,r+1,r+2,P(X = k) = \binom{k-1}{r-1} p^r (1-p)^{k-r}, \quad k = r, r+1, r+2, \ldots

Moments: Mean =r/p= r/p, Variance =r(1p)/p2= r(1-p)/p^2.

Overdispersion: The variance r(1p)/p2=(r/p)(1p)/p>r/pr(1-p)/p^2 = (r/p) \cdot (1-p)/p > r/p exceeds the mean (for p<1p < 1). This makes the Negative Binomial useful for count data with overdispersion - variance greater than the mean - which the Poisson cannot model.

For AI:

  • Sequence length modelling: the length of a sentence until a full stop follows approximately Geometric or Negative Binomial.
  • Retry modelling: number of API calls until success follows Geometric (with exponential backoff, the distribution changes).
  • Count regression: Negative Binomial regression replaces Poisson regression when data exhibits overdispersion (e.g., user activity counts, bug counts per module).

2.4 Poisson(λ\lambda)

Story: Count of independent rare events occurring in a fixed interval (time, space, area), where λ\lambda is the average rate (events per interval).

PMF:

P(X=k)=λkeλk!,k=0,1,2,,λ>0P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \ldots, \quad \lambda > 0

Moments:

QuantityValue
Mean E[X]\mathbb{E}[X]λ\lambda
Variance Var(X)\operatorname{Var}(X)λ\lambda (mean = variance!)
Modeλ\lfloor \lambda \rfloor (or λ1\lambda - 1 and λ\lambda if λ\lambda is integer)
Skewness1/λ1/\sqrt{\lambda} (right-skewed; approaches 0 as λ\lambda grows)

The equal mean and variance is the defining fingerprint of Poisson data. When empirical variance significantly exceeds the mean, the data is overdispersed and the Negative Binomial is more appropriate.

MGF:

MX(t)=eλ(et1)M_X(t) = e^{\lambda(e^t - 1)}

Derivation: MX(t)=k=0etkλkeλk!=eλk=0(λet)kk!=eλeλet=eλ(et1)M_X(t) = \sum_{k=0}^\infty e^{tk} \frac{\lambda^k e^{-\lambda}}{k!} = e^{-\lambda} \sum_{k=0}^\infty \frac{(\lambda e^t)^k}{k!} = e^{-\lambda} \cdot e^{\lambda e^t} = e^{\lambda(e^t-1)}.

Additive property: If XPoisson(λ1)X \sim \operatorname{Poisson}(\lambda_1) and YPoisson(λ2)Y \sim \operatorname{Poisson}(\lambda_2) independently, then X+YPoisson(λ1+λ2)X + Y \sim \operatorname{Poisson}(\lambda_1 + \lambda_2).

Proof via MGFs: MX+Y(t)=MX(t)MY(t)=eλ1(et1)eλ2(et1)=e(λ1+λ2)(et1)M_{X+Y}(t) = M_X(t) M_Y(t) = e^{\lambda_1(e^t-1)} \cdot e^{\lambda_2(e^t-1)} = e^{(\lambda_1+\lambda_2)(e^t-1)}. \square

Poisson Limit Theorem (Binomial -> Poisson):

As nn \to \infty, p0p \to 0, with np=λnp = \lambda fixed:

(nk)pk(1p)nkλkeλk!\binom{n}{k} p^k (1-p)^{n-k} \to \frac{\lambda^k e^{-\lambda}}{k!}

Proof sketch:

(nk)pk(1p)nk=n(n1)(nk+1)k!λknk(1λn)nk\binom{n}{k}p^k(1-p)^{n-k} = \frac{n(n-1)\cdots(n-k+1)}{k!} \cdot \frac{\lambda^k}{n^k} \cdot \left(1-\frac{\lambda}{n}\right)^{n-k}

As nn \to \infty: n(n1)(nk+1)/nk1n(n-1)\cdots(n-k+1)/n^k \to 1, and (1λ/n)neλ(1-\lambda/n)^n \to e^{-\lambda}, (1λ/n)k1(1-\lambda/n)^{-k} \to 1. This gives λkeλ/k!\lambda^k e^{-\lambda}/k!. \square

Poisson Process: A sequence of events where (1) events in disjoint intervals are independent, (2) the probability of exactly one event in a small interval [t,t+δ)[t, t+\delta) is λδ+o(δ)\lambda\delta + o(\delta), and (3) the probability of two or more events is o(δ)o(\delta). The count of events in interval [0,T][0,T] is Poisson(λT)\operatorname{Poisson}(\lambda T).

For AI:

  • Poisson regression: predicting count outcomes (clicks, API calls, bug reports). Loss is logP(yλ)=λylogλ+log(y!)-\log P(y \mid \lambda) = \lambda - y\log\lambda + \log(y!), minimised by setting λ=y^\lambda = \hat{y}.
  • Attention pattern sparsity: the number of tokens attended to above a threshold in sparse attention can follow approximately Poisson.
  • Dataset curation: rare-event counts in datasets (specific entity types, low-frequency tokens) follow Poisson, motivating over-sampling strategies.

2.5 Categorical(p\mathbf{p})

Story: A single draw from KK mutually exclusive categories, where category kk has probability pkp_k. The multivariate generalisation of Bernoulli.

PMF: Let p=(p1,,pK)\mathbf{p} = (p_1, \ldots, p_K) with pk0p_k \geq 0 and kpk=1\sum_k p_k = 1. For outcome X=kX = k:

P(X=k)=pk,k=1,,KP(X = k) = p_k, \quad k = 1, \ldots, K

In one-hot vector notation: if ek\mathbf{e}_k is the kk-th standard basis vector:

P(X=ek)=pkP(X = \mathbf{e}_k) = p_k

Moments: E[Xk]=pk\mathbb{E}[X_k] = p_k, Var(Xk)=pk(1pk)\operatorname{Var}(X_k) = p_k(1-p_k), Cov(Xj,Xk)=pjpk\operatorname{Cov}(X_j, X_k) = -p_jp_k for jkj \neq k.

Softmax Parameterisation:

The probability vector p\mathbf{p} lives on the (K1)(K-1)-dimensional probability simplex ΔK1={p:pk0,kpk=1}\Delta^{K-1} = \{\mathbf{p} : p_k \geq 0, \sum_k p_k = 1\}. For unconstrained logits zRK\mathbf{z} \in \mathbb{R}^K, the natural parameterisation is:

pk=softmax(z)k=ezkj=1Kezjp_k = \operatorname{softmax}(\mathbf{z})_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}

This is the exponential family natural parameterisation of the Categorical - the logits z\mathbf{z} are the natural parameters.

Temperature scaling: The Categorical can be "sharpened" or "softened" by temperature τ>0\tau > 0:

pkezk/τp_k \propto e^{z_k/\tau}
  • τ0\tau \to 0: deterministic (argmax)
  • τ=1\tau = 1: standard softmax
  • τ\tau \to \infty: uniform distribution

Entropy: H(X)=kpklogpkH(X) = -\sum_k p_k \log p_k. Maximum at uniform (p=1/K\mathbf{p} = \mathbf{1}/K, giving H=logKH = \log K); minimum at deterministic (H=0H = 0).

For AI:

  • Language model output: every token prediction is a Categorical over vocabulary. Training minimises logpθ(xtx<t)-\log p_\theta(x_t \mid x_{<t}), which is the NLL of a Categorical.
  • Gumbel-softmax trick: the Gumbel-max trick allows differentiable sampling from a Categorical: sample GkGumbel(0,1)G_k \sim \operatorname{Gumbel}(0,1), then X=argmaxk(zk+Gk)X = \arg\max_k (z_k + G_k). The Gumbel-softmax approximation uses softmax with low temperature to make this differentiable.
  • Top-pp (nucleus) sampling: truncate the Categorical to the smallest set of tokens whose cumulative probability exceeds pp, then renormalise.

2.6 Multinomial(nn, p\mathbf{p})

Story: Counts across KK categories in nn independent Categorical draws. If each trial draws from Categorical(p)(\mathbf{p}), then the vector of counts X=(X1,,XK)\mathbf{X} = (X_1, \ldots, X_K) follows a Multinomial.

PMF: For non-negative integers k1,,kKk_1, \ldots, k_K with jkj=n\sum_j k_j = n:

P(X1=k1,,XK=kK)=n!k1!kK!j=1KpjkjP(X_1 = k_1, \ldots, X_K = k_K) = \frac{n!}{k_1! \cdots k_K!} \prod_{j=1}^K p_j^{k_j}

The multinomial coefficient n!/(k1!kK!)n!/(k_1!\cdots k_K!) counts the number of orderings.

Moments:

E[Xk]=npk,Var(Xk)=npk(1pk),Cov(Xj,Xk)=npjpk  (jk)\mathbb{E}[X_k] = np_k, \quad \operatorname{Var}(X_k) = np_k(1-p_k), \quad \operatorname{Cov}(X_j, X_k) = -np_jp_k \; (j \neq k)

The negative covariance between categories is inevitable: if more of one category is observed, fewer of the others must be. This negative dependence structure is a fundamental constraint of fixed-total count vectors.

Marginals: Each marginal XkBinomial(n,pk)X_k \sim \operatorname{Binomial}(n, p_k) - this is why Binomial is the two-category special case (K=2K=2) of Multinomial.

For AI:

  • Topic models (LDA): document word counts follow Multinomial(doc,θ)(|\text{doc}|, \boldsymbol{\theta}) where θ\boldsymbol{\theta} is the topic-word distribution.
  • Bag-of-words: the count vector representation of a document is a realisation of the Multinomial.
  • Batch class balance: expected class counts in a minibatch follow Multinomial(B,ρ)(B, \boldsymbol{\rho}) where ρ\boldsymbol{\rho} is the class frequency vector.

3. Continuous Distributions

A continuous random variable has a probability density function (PDF) fX(x)0f_X(x) \geq 0 satisfying fX(x)dx=1\int_{-\infty}^\infty f_X(x)\,dx = 1. Probabilities are computed as P(aXb)=abfX(x)dxP(a \leq X \leq b) = \int_a^b f_X(x)\,dx.

3.1 Uniform(aa, bb)

Story: All values in [a,b][a, b] are equally likely. The maximum entropy distribution subject to being supported on a bounded interval.

PDF and CDF:

fX(x)=1ba,x[a,b]FX(x)=xaba,x[a,b]f_X(x) = \frac{1}{b-a}, \quad x \in [a, b] \qquad F_X(x) = \frac{x-a}{b-a}, \quad x \in [a,b]

Moments:

QuantityValue
Mean(a+b)/2(a+b)/2
Variance(ba)2/12(b-a)^2/12
Entropylog(ba)\log(b-a)

MGF:

MX(t)=etbetat(ba),t0;MX(0)=1M_X(t) = \frac{e^{tb} - e^{ta}}{t(b-a)}, \quad t \neq 0; \quad M_X(0) = 1

Maximum entropy: Among all distributions supported on [a,b][a, b], the Uniform has the maximum Shannon entropy. This makes it the natural "least informative" prior when only the support is known.

Universality: If XX has continuous CDF FXF_X, then U=FX(X)U(0,1)U = F_X(X) \sim \mathcal{U}(0,1). Conversely, FX1(U)FXF_X^{-1}(U) \sim F_X for UU(0,1)U \sim \mathcal{U}(0,1). This is the inverse CDF sampling method introduced in Section01.

For AI:

  • Weight initialisation: Kaiming uniform initialisation draws weights from U(3/fan_in,3/fan_in)\mathcal{U}(-\sqrt{3/\text{fan\_in}}, \sqrt{3/\text{fan\_in}}), chosen so that Var(W)=1/fan_in\operatorname{Var}(W) = 1/\text{fan\_in}.
  • Random search: hyperparameter search over a bounded range uses Uniform priors.
  • Data augmentation: random crop position, rotation angle, and colour jitter are drawn from Uniform distributions.

3.2 Gaussian N(μ,σ2)\mathcal{N}(\mu, \sigma^2)

Story: The limiting distribution of the standardised sum of i.i.d. random variables (Central Limit Theorem). The maximum entropy distribution for fixed mean and variance. The distribution that makes least-squares regression optimal under Gaussian noise.

PDF:

fX(x)=1σ2πexp ⁣((xμ)22σ2),xRf_X(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), \quad x \in \mathbb{R}

Normalisation proof sketch:

ex2/2dx=2π\int_{-\infty}^\infty e^{-x^2/2}\,dx = \sqrt{2\pi}

This follows from the Gaussian integral trick: I2= ⁣e(x2+y2)/2dxdy=2π0rer2/2dr=2πI^2 = \int\!\int e^{-(x^2+y^2)/2}\,dx\,dy = 2\pi\int_0^\infty re^{-r^2/2}\,dr = 2\pi, so I=2πI = \sqrt{2\pi}. (Full proof in Appendix A of Section01.)

Parameters:

  • μR\mu \in \mathbb{R}: mean (location parameter - shifts the peak)
  • σ2>0\sigma^2 > 0: variance (scale parameter - controls spread)
  • σ=σ2\sigma = \sqrt{\sigma^2}: standard deviation (same units as XX)

Moments:

QuantityValue
Mean E[X]\mathbb{E}[X]μ\mu
Variance Var(X)\operatorname{Var}(X)σ2\sigma^2
Modeμ\mu (unique, at the peak)
Medianμ\mu (by symmetry)
Skewness00 (perfectly symmetric)
Kurtosis33 (excess kurtosis =0= 0)
Entropy12log(2πeσ2)\frac{1}{2}\log(2\pi e\sigma^2)

MGF:

MX(t)=exp ⁣(μt+σ2t22)M_X(t) = \exp\!\left(\mu t + \frac{\sigma^2 t^2}{2}\right)

Derivation:

MX(t)=etx1σ2πe(xμ)2/(2σ2)dxM_X(t) = \int_{-\infty}^\infty e^{tx} \frac{1}{\sigma\sqrt{2\pi}} e^{-(x-\mu)^2/(2\sigma^2)}\,dx

Complete the square in the exponent: tx(xμ)2/(2σ2)=(xμσ2t)2/(2σ2)+μt+σ2t2/2tx - (x-\mu)^2/(2\sigma^2) = -(x-\mu-\sigma^2 t)^2/(2\sigma^2) + \mu t + \sigma^2t^2/2. The integral of the Gaussian part equals 1, leaving eμt+σ2t2/2e^{\mu t + \sigma^2t^2/2}. \square

Standard Normal: ZN(0,1)Z \sim \mathcal{N}(0,1) has PDF ϕ(z)=ez2/2/2π\phi(z) = e^{-z^2/2}/\sqrt{2\pi} and CDF Φ(z)=P(Zz)\Phi(z) = P(Z \leq z).

Any Gaussian can be standardised: Z=(Xμ)/σN(0,1)Z = (X-\mu)/\sigma \sim \mathcal{N}(0,1).

Key quantiles: Φ(1.645)0.95\Phi(1.645) \approx 0.95, Φ(1.96)0.975\Phi(1.96) \approx 0.975, Φ(2.576)0.995\Phi(2.576) \approx 0.995.

68-95-99.7 Rule:

P(μσXμ+σ)0.683P(\mu - \sigma \leq X \leq \mu + \sigma) \approx 0.683 P(μ2σXμ+2σ)0.954P(\mu - 2\sigma \leq X \leq \mu + 2\sigma) \approx 0.954 P(μ3σXμ+3σ)0.997P(\mu - 3\sigma \leq X \leq \mu + 3\sigma) \approx 0.997

Stability properties:

  1. Linear stability: aX+bN(aμ+b,a2σ2)aX + b \sim \mathcal{N}(a\mu + b, a^2\sigma^2)
  2. Additive stability: X1+X2N(μ1+μ2,σ12+σ22)X_1 + X_2 \sim \mathcal{N}(\mu_1+\mu_2, \sigma_1^2+\sigma_2^2) when X1 ⁣ ⁣ ⁣X2X_1 \perp\!\!\!\perp X_2
  3. Closure under conditioning: the conditional distribution of a Gaussian given a linear observation is Gaussian (developed in Section03)

Maximum entropy: Among all distributions with mean μ\mu and variance σ2\sigma^2, the Gaussian maximises entropy. This is the information-theoretic justification for its ubiquity.

Forward reference - Multivariate Gaussian:

Preview: Multivariate Gaussian N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma) The dd-dimensional generalisation replaces the scalar mean with μRd\boldsymbol{\mu} \in \mathbb{R}^d and the scalar variance with a positive definite covariance matrix ΣRd×d\Sigma \in \mathbb{R}^{d \times d}:

f(x)=1(2π)d/2Σ1/2exp ⁣(12(xμ)Σ1(xμ))f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\!\left(-\tfrac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top \Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)

The Gaussian process (Section06) extends this to infinite dimensions.

-> Full treatment: Section03 Joint Distributions

For AI:

  • Weight initialisation: Xavier/Glorot initialisation uses N(0,2/(nin+nout))\mathcal{N}(0, 2/(n_\text{in}+n_\text{out})); Kaiming uses N(0,2/nin)\mathcal{N}(0, 2/n_\text{in}).
  • VAE latent prior: p(z)=N(0,I)p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, I) is the prior on the latent code. The encoder outputs (μϕ,logσϕ2)(\boldsymbol{\mu}_\phi, \log\sigma^2_\phi) and samples z=μϕ+σϕϵ\mathbf{z} = \boldsymbol{\mu}_\phi + \sigma_\phi \odot \boldsymbol{\epsilon}, ϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, I).
  • Gaussian process: a prior over functions where any finite collection of function values follows a multivariate Gaussian (Section06).
  • SGD noise: the gradient noise in stochastic gradient descent is approximately Gaussian by the CLT, with covariance proportional to the gradient covariance matrix.
  • Diffusion models: the forward noising process adds Gaussian noise q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I).

3.3 Exponential(λ\lambda)

Story: Time between consecutive events in a Poisson process with rate λ\lambda. The continuous analogue of the Geometric distribution.

PDF and CDF:

fX(x)=λeλx,x0FX(x)=1eλx,x0f_X(x) = \lambda e^{-\lambda x}, \quad x \geq 0 \qquad F_X(x) = 1 - e^{-\lambda x}, \quad x \geq 0

Parameters: λ>0\lambda > 0 is the rate (events per unit time). The scale is θ=1/λ\theta = 1/\lambda (mean inter-event time).

Moments:

QuantityValue
Mean1/λ1/\lambda
Variance1/λ21/\lambda^2
Mode00 (always - the mode is at the boundary)
Medianlog(2)/λ\log(2)/\lambda
Skewness22 (always right-skewed, regardless of λ\lambda)
Entropy1logλ1 - \log\lambda

MGF:

MX(t)=λλt,t<λM_X(t) = \frac{\lambda}{\lambda - t}, \quad t < \lambda

Memoryless property (continuous version): For s,t>0s, t > 0:

P(X>s+tX>s)=P(X>t)P(X > s + t \mid X > s) = P(X > t)

Proof: P(X>s)=eλsP(X > s) = e^{-\lambda s}. Then P(X>s+tX>s)=eλ(s+t)/eλs=eλt=P(X>t)P(X > s+t \mid X > s) = e^{-\lambda(s+t)}/e^{-\lambda s} = e^{-\lambda t} = P(X > t). \square

The Exponential is the unique continuous memoryless distribution. Knowing you have already waited ss units gives no information about the remaining wait.

Relationship to Poisson: If events arrive as a Poisson process with rate λ\lambda, then the inter-arrival times are i.i.d. Exponential(λ)(\lambda). The count of events in [0,T][0,T] is Poisson(λT)(\lambda T).

For AI:

  • Regularisation via sparsity: exponential priors on weights yield L1L^1 (Lasso) regularisation under MAP estimation.
  • Session modelling: time between user sessions is modelled as Exponential.
  • Sampling algorithms: the exponential distribution appears in Gillespie's algorithm (exact stochastic simulation) and in MCMC acceptance steps.

3.4 Gamma(α\alpha, β\beta)

Story: The sum of α\alpha independent Exponential(β)(\beta) random variables. Equivalently, the time until the α\alpha-th event in a Poisson process with rate β\beta.

PDF:

fX(x)=βαΓ(α)xα1eβx,x>0,α>0,β>0f_X(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}, \quad x > 0, \quad \alpha > 0, \quad \beta > 0

Here Γ(α)=0tα1etdt\Gamma(\alpha) = \int_0^\infty t^{\alpha-1} e^{-t}\,dt is the gamma function, satisfying Γ(n)=(n1)!\Gamma(n) = (n-1)! for positive integers.

Parameters: α>0\alpha > 0 is the shape (controls the number of "humps" and skewness), β>0\beta > 0 is the rate (scale θ=1/β\theta = 1/\beta).

Moments:

QuantityValue
Meanα/β\alpha/\beta
Varianceα/β2\alpha/\beta^2
Mode(α1)/β(\alpha-1)/\beta for α1\alpha \geq 1; 0 otherwise
Skewness2/α2/\sqrt{\alpha} (decreases as α\alpha grows)

MGF:

MX(t)=(ββt)α,t<βM_X(t) = \left(\frac{\beta}{\beta - t}\right)^\alpha, \quad t < \beta

Special cases:

ParametersDistribution
α=1\alpha = 1Exponential(β)(\beta)
α=k/2\alpha = k/2, β=1/2\beta = 1/2Chi-squared(k)(k)
α=n\alpha = n (integer)Erlang(n,β)(n, \beta)
α\alpha \to \infty (normalised)\to Gaussian (by CLT)

Additive property: If X1Γ(α1,β)X_1 \sim \Gamma(\alpha_1, \beta) and X2Γ(α2,β)X_2 \sim \Gamma(\alpha_2, \beta) independently, then X1+X2Γ(α1+α2,β)X_1 + X_2 \sim \Gamma(\alpha_1+\alpha_2, \beta) (same rate).

Shape behaviour:

  • α<1\alpha < 1: unbounded at 0, heavy right tail
  • α=1\alpha = 1: Exponential (monotone decreasing from λ\lambda)
  • α>1\alpha > 1: bell-shaped, mode at (α1)/β(\alpha-1)/\beta, right-skewed
  • Large α\alpha: approximately Gaussian (CLT)

For AI:

  • Conjugate prior for Poisson rate: Gamma(α,β)(\alpha, \beta) is conjugate to Poisson. After observing kk events in tt time units, the posterior is Gamma(α+k,β+t)(\alpha+k, \beta+t).
  • Conjugate prior for Gaussian precision: Gamma is conjugate to the precision (1/σ21/\sigma^2) of a Gaussian with known mean.
  • Variational inference: the Gamma distribution is used as the variational posterior for non-negative latent variables.

3.5 Beta(α\alpha, β\beta)

Story: A distribution over probability values in (0,1)(0,1). The natural prior for the unknown success probability of a Bernoulli experiment.

PDF:

fX(x)=xα1(1x)β1B(α,β),x(0,1),α>0,β>0f_X(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}, \quad x \in (0,1), \quad \alpha > 0, \quad \beta > 0

where the Beta function B(α,β)=Γ(α)Γ(β)/Γ(α+β)B(\alpha,\beta) = \Gamma(\alpha)\Gamma(\beta)/\Gamma(\alpha+\beta) is the normalising constant.

Moments:

QuantityValue
Meanα/(α+β)\alpha/(\alpha+\beta)
Varianceαβ(α+β)2(α+β+1)\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}
Mode(α1)/(α+β2)(\alpha-1)/(\alpha+\beta-2) for α,β>1\alpha, \beta > 1

The mean depends only on the ratio α/(α+β)\alpha/(\alpha+\beta). The concentration κ=α+β\kappa = \alpha + \beta controls the spread: larger κ\kappa means more concentration around the mean.

Shape behaviour:

ConditionShape
α<1,β<1\alpha < 1, \beta < 1U-shaped (bimodal at 0 and 1)
α=β=1\alpha = \beta = 1Uniform(0,1)
α=β>1\alpha = \beta > 1Symmetric bell, centred at 0.5
α>β\alpha > \betaRight-skewed (mass towards 1)
α<β\alpha < \betaLeft-skewed (mass towards 0)
α,β\alpha, \beta \to \infty, ratio fixedConcentrates at α/(α+β)\alpha/(\alpha+\beta)

Pseudocounts interpretation: Beta(α,β)(\alpha, \beta) encodes the belief arising from having seen α1\alpha - 1 successes and β1\beta - 1 failures. Beta(1,1)=Uniform(0,1)(1,1) = \text{Uniform}(0,1) encodes no prior knowledge.

Relationship to Dirichlet: Beta(α,β)=Dirichlet(α,β)(\alpha, \beta) = \text{Dirichlet}(\alpha, \beta) - the K=2K=2 special case.

For AI:

  • RLHF preference model: the Bradley-Terry model places a Beta prior on the probability that one response is preferred over another.
  • Proportion modelling: the fraction of positive tokens, the click-through rate, and the fraction of attended tokens are all modelled with Beta distributions.
  • Conjugate prior for Binomial: after kk successes in nn trials, the posterior on pp is Beta(α+k,β+nk)(\alpha + k, \beta + n - k).
  • Beta-VAE: the β\beta-VAE regularises the KL term with a β>1\beta > 1 coefficient, encouraging disentanglement of latent variables.

3.6 Dirichlet(α\boldsymbol{\alpha})

Story: A distribution over probability vectors on the simplex ΔK1={p:pk0,kpk=1}\Delta^{K-1} = \{\mathbf{p} : p_k \geq 0, \sum_k p_k = 1\}. The multivariate generalisation of Beta; the natural prior for the parameter of a Categorical distribution.

PDF: For α=(α1,,αK)\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_K) with αk>0\alpha_k > 0:

f(p)=1B(α)k=1Kpkαk1,pΔK1f(\mathbf{p}) = \frac{1}{B(\boldsymbol{\alpha})} \prod_{k=1}^K p_k^{\alpha_k - 1}, \quad \mathbf{p} \in \Delta^{K-1}

where B(α)=Γ(α1)Γ(αK)/Γ(α0)B(\boldsymbol{\alpha}) = \Gamma(\alpha_1)\cdots\Gamma(\alpha_K)/\Gamma(\alpha_0) and α0=kαk\alpha_0 = \sum_k \alpha_k.

Moments:

QuantityValue
E[pk]\mathbb{E}[p_k]αk/α0\alpha_k / \alpha_0
Var(pk)\operatorname{Var}(p_k)αk(α0αk)α02(α0+1)\frac{\alpha_k(\alpha_0 - \alpha_k)}{\alpha_0^2(\alpha_0+1)}
Cov(pj,pk)\operatorname{Cov}(p_j, p_k)αjαkα02(α0+1)-\frac{\alpha_j\alpha_k}{\alpha_0^2(\alpha_0+1)} for jkj \neq k

The mean probability vector is α/α0\boldsymbol{\alpha}/\alpha_0 - normalised concentration parameters.

Concentration parameter α0=kαk\alpha_0 = \sum_k \alpha_k:

  • Small α0\alpha_0 (e.g., αk=0.1\alpha_k = 0.1): sparse - samples concentrate near corners of the simplex
  • α0=K\alpha_0 = K with all αk=1\alpha_k = 1: uniform over simplex
  • Large α0\alpha_0: concentrated - samples cluster near the mean α/α0\boldsymbol{\alpha}/\alpha_0

Symmetric Dirichlet: When all αk=α\alpha_k = \alpha (scalar), the distribution is exchangeable across categories. The parameter α\alpha controls concentration:

  • α<1\alpha < 1: sparse, near-one-hot samples
  • α=1\alpha = 1: uniform over simplex
  • α>1\alpha > 1: dense, near-uniform samples

Marginals: Each marginal pkBeta(αk,α0αk)p_k \sim \text{Beta}(\alpha_k, \alpha_0 - \alpha_k).

For AI:

  • LDA (Latent Dirichlet Allocation): document topic proportions θdDirichlet(α)\boldsymbol{\theta}_d \sim \text{Dirichlet}(\boldsymbol{\alpha}). A small α\alpha enforces document sparsity - each document covers few topics.
  • Token vocabulary priors: Dirichlet priors on subword token distributions encode assumptions about language patterns.
  • Bayesian categorical models: Dirichlet(α)(\boldsymbol{\alpha}) is the conjugate prior to Categorical(p)(\mathbf{p}). After observing counts c=(c1,,cK)\mathbf{c} = (c_1, \ldots, c_K), the posterior is Dirichlet(α+c)(\boldsymbol{\alpha} + \mathbf{c}).

3.7 Student-tt(ν\nu)

Story: The distribution of the tt-statistic when estimating the mean of a Gaussian with unknown variance from small samples. A Gaussian with heavier tails - more robust to outliers.

PDF:

fX(x)=Γ ⁣(ν+12)νπΓ ⁣(ν2)(1+x2ν)(ν+1)/2,xRf_X(x) = \frac{\Gamma\!\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu\pi}\,\Gamma\!\left(\frac{\nu}{2}\right)} \left(1 + \frac{x^2}{\nu}\right)^{-(\nu+1)/2}, \quad x \in \mathbb{R}

Parameter: ν>0\nu > 0 is the degrees of freedom. Controls tail heaviness.

Moments:

QuantityValueWhen
Mean00ν>1\nu > 1
Varianceν/(ν2)\nu/(\nu-2)ν>2\nu > 2
Skewness00ν>3\nu > 3
Kurtosis6/(ν4)6/(\nu-4)ν>4\nu > 4

The moments do not exist for small ν\nu: variance undefined for ν2\nu \leq 2, mean undefined for ν1\nu \leq 1 (Cauchy distribution).

Tail heaviness: The tails decay as x(ν+1)|x|^{-(\nu+1)} - polynomial decay, much heavier than the Gaussian's ex2/2e^{-x^2/2} decay. For ν=1\nu = 1: Cauchy distribution (variance == \infty). For ν\nu \to \infty: N(0,1)\to \mathcal{N}(0,1).

Gaussian construction: T=Z/χν2/νT = Z / \sqrt{\chi^2_\nu/\nu} where ZN(0,1) ⁣ ⁣ ⁣χν2Z \sim \mathcal{N}(0,1) \perp\!\!\!\perp \chi^2_\nu. This gives TtνT \sim t_\nu.

Location-scale family: The general Student-tt with mean μ\mu and scale σ\sigma has PDF obtained by replacing xx with (xμ)/σ(x-\mu)/\sigma and multiplying by 1/σ1/\sigma.

For AI:

  • Robust regression: Student-tt likelihood replaces Gaussian when outliers are present. Heavy tails assign higher probability to extreme residuals, reducing their influence on parameter estimates.
  • Bayesian neural networks: Student-tt priors on weights provide robustness; as ν\nu increases, the prior approaches Gaussian.
  • Uncertainty estimation: predictive distributions in small-data regimes are better modelled with tνt_\nu than Gaussian to reflect parameter uncertainty.
  • Variational inference: the Student-tt arises as a scale mixture of Gaussians: XVN(0,V)X \mid V \sim \mathcal{N}(0, V) and 1/VΓ(ν/2,ν/2)1/V \sim \Gamma(\nu/2, \nu/2) gives marginally XtνX \sim t_\nu.

4. Distribution Relationships

4.1 Limiting Relationships

Binomial -> Poisson (rare events limit):

When nn \to \infty, p0p \to 0, npλnp \to \lambda: Binomial(n,p)Poisson(λ)\operatorname{Binomial}(n,p) \to \operatorname{Poisson}(\lambda).

Rule of thumb: approximation is good when n20n \geq 20 and p0.05p \leq 0.05.

Binomial -> Normal (Central Limit Theorem preview):

When nn \to \infty: (Xnp)/np(1p)N(0,1)(X - np)/\sqrt{np(1-p)} \to \mathcal{N}(0,1).

-> Full proof: Section06 Stochastic Processes

Gamma -> Normal (large shape):

When α\alpha \to \infty (fixed β\beta): (Gamma(α,β)α/β)/α/βN(0,1)(\text{Gamma}(\alpha,\beta) - \alpha/\beta)/\sqrt{\alpha}/\beta \to \mathcal{N}(0,1).

Poisson -> Normal (large rate):

When λ\lambda \to \infty: (Xλ)/λN(0,1)(X - \lambda)/\sqrt{\lambda} \to \mathcal{N}(0,1).

Beta -> Dirac (large concentration):

When α,β\alpha, \beta \to \infty with α/(α+β)=μ\alpha/(\alpha+\beta) = \mu fixed: Beta(α,β)δ(μ)(\alpha,\beta) \to \delta(\mu).

Student-tt -> Normal:

As ν\nu \to \infty: tνN(0,1)t_\nu \to \mathcal{N}(0,1). For ν30\nu \geq 30, the approximation is excellent.

LIMITING RELATIONSHIPS
========================================================================

  Binomial(n,p) --- n->\\infty, p->0, np=\\lambda ---> Poisson(\\lambda)
       |
       |  n->\\infty (CLT)
       v
  Gaussian N(np, np(1-p))

  Gamma(\\alpha,\\beta) ---- \\alpha->\\infty -------------> Gaussian (CLT)
  Poisson(\\lambda) ---- \\lambda->\\infty -------------> Gaussian (CLT)
  Student-t(\\nu) -- \\nu->\\infty -------------> Gaussian N(0,1)
  Beta(\\alpha,\\beta) ----- \\alpha,\\beta->\\infty -----------> Dirac(\\alpha/(\\alpha+\\beta))

========================================================================

4.2 Conjugate Pairs Table

LikelihoodPriorPosteriorUpdated Parameters
Bernoulli(p)(p)Beta(α,β)(\alpha, \beta)Beta(α+k,β+nk)(\alpha + k, \beta + n - k)kk successes in nn trials
Binomial(n,p)(n, p)Beta(α,β)(\alpha, \beta)Beta(α+k,β+nk)(\alpha + k, \beta + n - k)same as Bernoulli
Categorical(p)(\mathbf{p})Dirichlet(α)(\boldsymbol{\alpha})Dirichlet(α+c)(\boldsymbol{\alpha} + \mathbf{c})c\mathbf{c} = observed counts
Multinomial(n,p)(n, \mathbf{p})Dirichlet(α)(\boldsymbol{\alpha})Dirichlet(α+c)(\boldsymbol{\alpha} + \mathbf{c})same as Categorical
Poisson(λ)(\lambda)Gamma(α,β)(\alpha, \beta)Gamma(α+ki,β+n)(\alpha + \sum k_i, \beta + n)nn observations
Exponential(λ)(\lambda)Gamma(α,β)(\alpha, \beta)Gamma(α+n,β+xi)(\alpha + n, \beta + \sum x_i)nn observations
Gaussian(mu,σ2)(\\mu, \sigma^2) (known σ2\sigma^2)N(μ0,τ2)\mathcal{N}(\mu_0, \tau^2)N(μn,τn2)\mathcal{N}(\mu_n, \tau_n^2)Precision-weighted average

Full derivations in Section7 (Conjugate Priors).


5. Moment Generating Functions

5.1 Definition and Uniqueness

The moment generating function (MGF) of a random variable XX is:

MX(t)=E[etX],tRM_X(t) = \mathbb{E}[e^{tX}], \quad t \in \mathbb{R}

when this expectation is finite in an open interval around t=0t = 0.

Why "moment generating": By Taylor-expanding etXe^{tX}:

MX(t)=E ⁣[k=0(tX)kk!]=k=0E[Xk]k!tkM_X(t) = \mathbb{E}\!\left[\sum_{k=0}^\infty \frac{(tX)^k}{k!}\right] = \sum_{k=0}^\infty \frac{\mathbb{E}[X^k]}{k!} t^k

Differentiating kk times and evaluating at t=0t=0:

MX(k)(0)=E[Xk]M_X^{(k)}(0) = \mathbb{E}[X^k]

The kk-th derivative of the MGF at zero is the kk-th raw moment.

Uniqueness theorem: If MX(t)M_X(t) exists in an open interval containing 0, it uniquely determines the distribution of XX. Two random variables with the same MGF have the same distribution.

Cumulant generating function (CGF): KX(t)=logMX(t)K_X(t) = \log M_X(t). Its derivatives at 0 give the cumulants: KX(0)=E[X]K_X'(0) = \mathbb{E}[X], KX(0)=Var(X)K_X''(0) = \operatorname{Var}(X), KX(0)=E[(Xμ)3]K_X'''(0) = \mathbb{E}[(X-\mu)^3] (third central moment), etc.

5.2 MGF Product Rule

Theorem: If X ⁣ ⁣ ⁣YX \perp\!\!\!\perp Y, then MX+Y(t)=MX(t)MY(t)M_{X+Y}(t) = M_X(t) \cdot M_Y(t).

Proof: MX+Y(t)=E[et(X+Y)]=E[etXetY]=E[etX]E[etY]=MX(t)MY(t)M_{X+Y}(t) = \mathbb{E}[e^{t(X+Y)}] = \mathbb{E}[e^{tX}e^{tY}] = \mathbb{E}[e^{tX}]\mathbb{E}[e^{tY}] = M_X(t)M_Y(t), where the third equality uses independence. \square

Application: Proving that sums of independent distributions stay in the same family:

  • Bernoulli(p)n=Binomial(n,p)\text{Bernoulli}(p)^{\oplus n} = \text{Binomial}(n,p): M(t)n=(1p+pet)nM(t)^n = (1-p+pe^t)^n
  • Poisson(λ1)Poisson(λ2)=Poisson(λ1+λ2)\text{Poisson}(\lambda_1) \oplus \text{Poisson}(\lambda_2) = \text{Poisson}(\lambda_1+\lambda_2): eλ1(et1)eλ2(et1)=e(λ1+λ2)(et1)e^{\lambda_1(e^t-1)} \cdot e^{\lambda_2(e^t-1)} = e^{(\lambda_1+\lambda_2)(e^t-1)}
  • Gamma(α1,β)Gamma(α2,β)=Gamma(α1+α2,β)\text{Gamma}(\alpha_1,\beta) \oplus \text{Gamma}(\alpha_2,\beta) = \text{Gamma}(\alpha_1+\alpha_2,\beta): (β/(βt))α1+α2(\beta/(\beta-t))^{\alpha_1+\alpha_2}
  • N(μ1,σ12)N(μ2,σ22)=N(μ1+μ2,σ12+σ22)\mathcal{N}(\mu_1,\sigma_1^2) \oplus \mathcal{N}(\mu_2,\sigma_2^2) = \mathcal{N}(\mu_1+\mu_2,\sigma_1^2+\sigma_2^2): product of exponential-of-quadratics

5.3 MGFs of Key Distributions

DistributionMGF MX(t)M_X(t)Domain
Bernoulli(p)(p)1p+pet1-p+pe^tR\mathbb{R}
Binomial(n,p)(n,p)(1p+pet)n(1-p+pe^t)^nR\mathbb{R}
Geometric(p)(p)pet/(1(1p)et)pe^t/(1-(1-p)e^t)t<log(1p)t < -\log(1-p)
Poisson(λ)(\lambda)eλ(et1)e^{\lambda(e^t-1)}R\mathbb{R}
Uniform(a,b)(a,b)(etbeta)/(t(ba))(e^{tb}-e^{ta})/(t(b-a))R\mathbb{R}
Exponential(λ)(\lambda)λ/(λt)\lambda/(\lambda-t)t<λt < \lambda
Gamma(α,β)(\alpha,\beta)(β/(βt))α(\beta/(\beta-t))^\alphat<βt < \beta
Gaussian(μ,σ2)(\mu,\sigma^2)eμt+σ2t2/2e^{\mu t + \sigma^2t^2/2}R\mathbb{R}
Beta(α,β)(\alpha,\beta)1+k=1(r=0k1α+rα+β+r)tkk!1 + \sum_{k=1}^\infty (\prod_{r=0}^{k-1}\frac{\alpha+r}{\alpha+\beta+r})\frac{t^k}{k!}R\mathbb{R}

Note: the Beta MGF does not have a simple closed form; the Dirichlet and Categorical are typically characterised by their probability generating functions or characteristic functions instead.

-> Full treatment of MGF applications and cumulants: Section04 Expectation and Moments


6. The Exponential Family

The exponential family is the most important unifying framework in probability and statistics. It explains why Bernoulli, Gaussian, Poisson, Gamma, Beta, Dirichlet, and Categorical all share the same algorithmic properties.

6.1 Canonical Form

A distribution belongs to the exponential family if its PDF (or PMF) can be written as:

p(x;η)=h(x)exp ⁣(ηT(x)A(η))p(x; \boldsymbol{\eta}) = h(x) \exp\!\bigl(\boldsymbol{\eta}^\top T(x) - A(\boldsymbol{\eta})\bigr)

where:

  • ηRk\boldsymbol{\eta} \in \mathbb{R}^k - natural parameters (the parameterisation used in optimisation)
  • T(x)RkT(x) \in \mathbb{R}^k - sufficient statistics (captures all information about η\boldsymbol{\eta} in the data)
  • A(η)=logh(x)exp(ηT(x))dxA(\boldsymbol{\eta}) = \log \int h(x)\exp(\boldsymbol{\eta}^\top T(x))\,dx - log-partition function (normalisation constant in log space)
  • h(x)h(x) - base measure (does not depend on η\boldsymbol{\eta})

The canonical form separates the "shape" of the distribution (hh and TT) from the parameterisation (η\boldsymbol{\eta} and AA).

6.2 Members and Their Parameters

DistributionNatural param η\boldsymbol{\eta}Sufficient stat T(x)T(x)Log-partition A(η)A(\boldsymbol{\eta})Base measure h(x)h(x)
Bernoulli(p)(p)logp1p\log\frac{p}{1-p}xxlog(1+eη)\log(1+e^\eta)11
Binomial(n,p)(n,p)logp1p\log\frac{p}{1-p}xxnlog(1+eη)n\log(1+e^\eta)(nx)\binom{n}{x}
Poisson(λ)(\lambda)logλ\log\lambdaxxeηe^\eta1/x!1/x!
Exponential(λ)(\lambda)λ-\lambdaxxlog(η)-\log(-\eta)11
Gaussian(μ,σ2)(\mu,\sigma^2) (both unknown)(μ/σ2,1/(2σ2))(\mu/\sigma^2, -1/(2\sigma^2))(x,x2)(x, x^2)η12/(4η2)12log(2η2)-\eta_1^2/(4\eta_2) - \frac{1}{2}\log(-2\eta_2)1/2π1/\sqrt{2\pi}
Gamma(α,β)(\alpha,\beta)(α1,β)(\alpha-1, -\beta)(logx,x)(\log x, x)logΓ(η1+1)(η1+1)log(η2)\log\Gamma(\eta_1+1) - (\eta_1+1)\log(-\eta_2)11
Beta(α,β)(\alpha,\beta)(α1,β1)(\alpha-1, \beta-1)(logx,log(1x))(\log x, \log(1-x))logB(η1+1,η2+1)\log B(\eta_1+1,\eta_2+1)11
Categorical(p)(\mathbf{p})(logp1/pK,,logpK1/pK)(\log p_1/p_K, \ldots, \log p_{K-1}/p_K)(x1,,xK1)(x_1,\ldots,x_{K-1})log(1+k=1K1eηk)\log(1+\sum_{k=1}^{K-1}e^{\eta_k})11

6.3 The Log-Partition Function

The log-partition function A(η)A(\boldsymbol{\eta}) is convex (always, by Holder's inequality). Its derivatives generate the cumulants:

ηA(η)=Epη[T(X)]\nabla_{\boldsymbol{\eta}} A(\boldsymbol{\eta}) = \mathbb{E}_{p_{\boldsymbol{\eta}}}[T(X)] η2A(η)=Covpη[T(X)]\nabla_{\boldsymbol{\eta}}^2 A(\boldsymbol{\eta}) = \operatorname{Cov}_{p_{\boldsymbol{\eta}}}[T(X)]

Proof: A(η)=logh(x)eηT(x)dxA(\boldsymbol{\eta}) = \log \int h(x)e^{\boldsymbol{\eta}^\top T(x)}\,dx. Differentiating under the integral (valid by dominated convergence):

Aηj=Tj(x)h(x)eηT(x)dxh(x)eηT(x)dx=E[Tj(X)]\frac{\partial A}{\partial \eta_j} = \frac{\int T_j(x) h(x)e^{\boldsymbol{\eta}^\top T(x)}\,dx}{\int h(x)e^{\boldsymbol{\eta}^\top T(x)}\,dx} = \mathbb{E}[T_j(X)]

The second derivative similarly gives the covariance. \square

Consequence: Computing moments of any exponential family member reduces to differentiating A(η)A(\boldsymbol{\eta}) - a single convex function. This is an extraordinary computational shortcut.

Example - Poisson: A(η)=eηA(\eta) = e^\eta. So E[X]=A(η)=eη=λ\mathbb{E}[X] = A'(\eta) = e^\eta = \lambda and Var(X)=A(η)=eη=λ\operatorname{Var}(X) = A''(\eta) = e^\eta = \lambda - confirming the Poisson mean-equals-variance property directly from the log-partition function.

6.4 Sufficient Statistics

Definition (Fisher 1922): A statistic T(X(1),,X(n))T(X^{(1)}, \ldots, X^{(n)}) is sufficient for η\boldsymbol{\eta} if the conditional distribution p(X(1),,X(n)T=t;η)p(X^{(1)}, \ldots, X^{(n)} \mid T = t; \boldsymbol{\eta}) does not depend on η\boldsymbol{\eta}.

Intuitively: TT captures all information in the data about η\boldsymbol{\eta}.

Fisher-Neyman factorisation theorem: TT is sufficient if and only if the likelihood factors as:

p(x;η)=g(T(x),η)h(x)p(\mathbf{x}; \boldsymbol{\eta}) = g(T(\mathbf{x}), \boldsymbol{\eta}) \cdot h(\mathbf{x})

For exponential families with nn i.i.d. observations:

p(x;η)=h(x)exp ⁣(ηi=1nT(x(i))sufficient statnA(η))p(\mathbf{x}; \boldsymbol{\eta}) = h(\mathbf{x}) \exp\!\bigl(\boldsymbol{\eta}^\top \underbrace{\sum_{i=1}^n T(x^{(i)})}_{\text{sufficient stat}} - nA(\boldsymbol{\eta})\bigr)

Examples:

  • Bernoulli: T=xiT = \sum x_i (total successes) is sufficient for pp
  • Gaussian (both params unknown): (T1,T2)=(xi,xi2)(T_1, T_2) = (\sum x_i, \sum x_i^2) is sufficient for (μ,σ2)(\mu, \sigma^2)
  • Poisson: T=xiT = \sum x_i (total count) is sufficient for λ\lambda

Pitman-Koopman-Darmois theorem: The only distributions with a fixed-dimension sufficient statistic (regardless of sample size nn) are the exponential family members.

6.5 ML Implications

Softmax as categorical exponential family:

The Categorical(K)(K) natural parameter is η=(logp1/pK,,logpK1/pK)\boldsymbol{\eta} = (\log p_1/p_K, \ldots, \log p_{K-1}/p_K). Inverting:

pk=eηkj=1Keηj=softmax(η)kp_k = \frac{e^{\eta_k}}{\sum_{j=1}^K e^{\eta_j}} = \operatorname{softmax}(\boldsymbol{\eta})_k

The softmax is therefore the canonical link function of the Categorical exponential family. Neural networks output logits z\mathbf{z} which are exactly the natural parameters η\boldsymbol{\eta}.

Log-sum-exp = log-partition function:

The logsumexp operation logkezk\log\sum_k e^{z_k} is the log-partition function A(η)A(\boldsymbol{\eta}) of the Categorical distribution. The numerically stable form max(z)+logkezkmax(z)\max(\mathbf{z}) + \log\sum_k e^{z_k - \max(\mathbf{z})} is directly derived from properties of AA.

Natural gradient: The Fisher information matrix of an exponential family is F=2A(η)=Cov[T(X)]F = \nabla^2 A(\boldsymbol{\eta}) = \operatorname{Cov}[T(X)]. The natural gradient F1ηLF^{-1}\nabla_{\boldsymbol{\eta}}\mathcal{L} accounts for the curvature of the distribution manifold, giving parameter-invariant updates. K-FAC approximates F1F^{-1} for neural networks.

MLE for exponential families: The MLE of η\boldsymbol{\eta} satisfies:

A(η^)=1ni=1nT(x(i))\nabla A(\hat{\boldsymbol{\eta}}) = \frac{1}{n}\sum_{i=1}^n T(x^{(i)})

meaning the expected sufficient statistics under the model equal the empirical sufficient statistics - a moment matching condition.


7. Conjugate Priors in Bayesian Inference

Bayesian inference requires computing the posterior p(θx)p(xθ)p(θ)p(\boldsymbol{\theta} \mid \mathbf{x}) \propto p(\mathbf{x} \mid \boldsymbol{\theta}) p(\boldsymbol{\theta}). For most priors, this requires numerical integration. Conjugate priors are the special priors for which the posterior stays in the same family - making inference analytic.

7.1 Conjugacy Definition

Definition: A prior p(θ)p(\boldsymbol{\theta}) is conjugate to a likelihood p(xθ)p(\mathbf{x} \mid \boldsymbol{\theta}) if the posterior p(θx)p(\boldsymbol{\theta} \mid \mathbf{x}) is in the same family as the prior.

For exponential family likelihoods, conjugate priors always exist and take a canonical form. The general conjugate prior to p(x;η)=h(x)exp(ηT(x)A(η))p(x; \boldsymbol{\eta}) = h(x)\exp(\boldsymbol{\eta}^\top T(x) - A(\boldsymbol{\eta})) is:

p(η;χ,ν)exp ⁣(ηχνA(η))p(\boldsymbol{\eta}; \boldsymbol{\chi}, \nu) \propto \exp\!\bigl(\boldsymbol{\eta}^\top \boldsymbol{\chi} - \nu A(\boldsymbol{\eta})\bigr)

After observing nn data points x(1),,x(n)x^{(1)}, \ldots, x^{(n)}:

p(ηx)exp ⁣(η(χ+i=1nT(x(i)))(ν+n)A(η))p(\boldsymbol{\eta} \mid \mathbf{x}) \propto \exp\!\left(\boldsymbol{\eta}^\top \left(\boldsymbol{\chi} + \sum_{i=1}^n T(x^{(i)})\right) - (\nu + n)A(\boldsymbol{\eta})\right)

The hyperparameters update simply: χχ+iT(x(i))\boldsymbol{\chi} \to \boldsymbol{\chi} + \sum_i T(x^{(i)}) and νν+n\nu \to \nu + n.

7.2 Beta-Bernoulli/Binomial

Model: pBeta(α,β)p \sim \text{Beta}(\alpha, \beta), X1,,XnpiidBern(p)X_1, \ldots, X_n \mid p \overset{\text{iid}}{\sim} \operatorname{Bern}(p).

Likelihood: p(xp)=pk(1p)nkp(\mathbf{x} \mid p) = p^k(1-p)^{n-k} where k=ixik = \sum_i x_i.

Posterior:

p(px)pα+k1(1p)β+nk1    pxBeta(α+k,β+nk)p(p \mid \mathbf{x}) \propto p^{\alpha+k-1}(1-p)^{\beta+n-k-1} \implies p \mid \mathbf{x} \sim \text{Beta}(\alpha+k, \beta+n-k)

Pseudocounts interpretation: The prior Beta(α,β)(\alpha, \beta) encodes α1\alpha - 1 prior successes and β1\beta - 1 prior failures. After seeing kk successes in nn trials, the posterior is Beta(α+k,β+nk)(\alpha + k, \beta + n - k) - simply adding counts.

Posterior mean:

E[px]=α+kα+β+n\mathbb{E}[p \mid \mathbf{x}] = \frac{\alpha + k}{\alpha + \beta + n}

This is a shrinkage estimator between the prior mean α/(α+β)\alpha/(\alpha+\beta) and the MLE k/nk/n:

E[px]=α+βα+β+nweight on priorαα+β+nα+β+nweight on datakn\mathbb{E}[p \mid \mathbf{x}] = \underbrace{\frac{\alpha+\beta}{\alpha+\beta+n}}_{\text{weight on prior}} \cdot \frac{\alpha}{\alpha+\beta} + \underbrace{\frac{n}{\alpha+\beta+n}}_{\text{weight on data}} \cdot \frac{k}{n}

As nn \to \infty, the posterior concentrates at the MLE k/nk/n.

7.3 Dirichlet-Categorical/Multinomial

Model: pDir(α)\mathbf{p} \sim \text{Dir}(\boldsymbol{\alpha}), X1,,XnpiidCat(p)X_1, \ldots, X_n \mid \mathbf{p} \overset{\text{iid}}{\sim} \operatorname{Cat}(\mathbf{p}).

Posterior:

pxDir(α+c)\mathbf{p} \mid \mathbf{x} \sim \text{Dir}(\boldsymbol{\alpha} + \mathbf{c})

where c=(c1,,cK)\mathbf{c} = (c_1, \ldots, c_K) are the observed category counts (ck=i1[x(i)=k]c_k = \sum_i \mathbf{1}[x^{(i)} = k]).

Posterior mean: E[pkx]=(αk+ck)/(α0+n)\mathbb{E}[p_k \mid \mathbf{x}] = (\alpha_k + c_k)/(\alpha_0 + n).

Add-one (Laplace) smoothing: Setting α=1\boldsymbol{\alpha} = \mathbf{1} (uniform Dirichlet prior) gives the Laplace smoothed estimate p^k=(ck+1)/(n+K)\hat{p}_k = (c_k + 1)/(n + K) - the standard technique for avoiding zero probabilities in language model unigrams.

For LDA: Each document dd has topic proportions θdDir(α)\boldsymbol{\theta}_d \sim \text{Dir}(\boldsymbol{\alpha}). Each topic kk has word distribution ϕkDir(β)\boldsymbol{\phi}_k \sim \text{Dir}(\boldsymbol{\beta}). Words are drawn as wCat(ϕzd)w \sim \operatorname{Cat}(\boldsymbol{\phi}_{z_d}) where zdCat(θd)z_d \sim \operatorname{Cat}(\boldsymbol{\theta}_d).

7.4 Gamma-Poisson

Model: λΓ(α,β)\lambda \sim \Gamma(\alpha, \beta), X1,,XnλiidPoisson(λ)X_1, \ldots, X_n \mid \lambda \overset{\text{iid}}{\sim} \operatorname{Poisson}(\lambda).

Posterior:

λxΓ ⁣(α+i=1nxi,  β+n)\lambda \mid \mathbf{x} \sim \Gamma\!\left(\alpha + \sum_{i=1}^n x_i, \; \beta + n\right)

Posterior mean: (α+xi)/(β+n)(\alpha + \sum x_i)/(\beta + n) - shrinkage between prior mean α/β\alpha/\beta and MLE xˉ\bar{x}.

Interpretations: The prior Gamma(α,β)(\alpha, \beta) encodes "α\alpha events observed over β\beta prior time units." After observing xi\sum x_i events in nn new time units, the posterior is Gamma(α+xi,β+n)(\alpha + \sum x_i, \beta + n).

7.5 Normal-Normal

Model: μN(μ0,τ2)\mu \sim \mathcal{N}(\mu_0, \tau^2), X1,,XnμiidN(μ,σ2)X_1, \ldots, X_n \mid \mu \overset{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2) with known σ2\sigma^2.

Posterior:

μxN(μn,τn2)\mu \mid \mathbf{x} \sim \mathcal{N}(\mu_n, \tau_n^2)

where:

1τn2=1τ2+nσ2μn=τn2 ⁣(μ0τ2+nxˉσ2)\frac{1}{\tau_n^2} = \frac{1}{\tau^2} + \frac{n}{\sigma^2} \qquad \mu_n = \tau_n^2\!\left(\frac{\mu_0}{\tau^2} + \frac{n\bar{x}}{\sigma^2}\right)

Precision-weighted average: The posterior mean is:

μn=τ2τ2+nσ2μ0+nσ2τ2+nσ2xˉ\mu_n = \frac{\tau^{-2}}{\tau^{-2} + n\sigma^{-2}} \mu_0 + \frac{n\sigma^{-2}}{\tau^{-2} + n\sigma^{-2}} \bar{x}

a precision-weighted combination of prior mean and sample mean. As nn \to \infty, μnxˉ\mu_n \to \bar{x} and τn20\tau_n^2 \to 0 (posterior concentrates at MLE).


8. ML Applications

8.1 Language Models: Categorical and Temperature

Every forward pass of a language model computes:

  1. Logits zRV\mathbf{z} \in \mathbb{R}^V over vocabulary of size VV
  2. Token probabilities p=softmax(z/τ)\mathbf{p} = \operatorname{softmax}(\mathbf{z}/\tau) at temperature τ\tau
  3. Next token xtCat(p)x_t \sim \operatorname{Cat}(\mathbf{p})

The cross-entropy training loss is logpxt=zxt/τ+logkezk/τ-\log p_{x_t} = -z_{x_t}/\tau + \log\sum_k e^{z_k/\tau}, which is exactly ηxt+A(η/τ)-\eta_{x_t} + A(\boldsymbol{\eta}/\tau) - the NLL of a Categorical exponential family member.

Perplexity: PPL=exp ⁣(1Tt=1Tlogp(xtx<t))=exp(H(p,p^))\operatorname{PPL} = \exp\!\left(-\frac{1}{T}\sum_{t=1}^T \log p(x_t \mid x_{<t})\right) = \exp(H(p, \hat{p})) - the exponentiated cross-entropy, measuring the effective vocabulary size.

8.2 VAEs: Gaussian Reparameterisation and KL Term

The VAE ELBO is:

L(ϕ,θ)=Eqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))\mathcal{L}(\phi, \theta) = \mathbb{E}_{q_\phi(\mathbf{z}\mid\mathbf{x})}[\log p_\theta(\mathbf{x}\mid\mathbf{z})] - D_{\mathrm{KL}}(q_\phi(\mathbf{z}\mid\mathbf{x}) \| p(\mathbf{z}))

With qϕ(zx)=N(μϕ,diag(σϕ2))q_\phi(\mathbf{z}\mid\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}_\phi, \operatorname{diag}(\boldsymbol{\sigma}^2_\phi)) and p(z)=N(0,I)p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, I):

DKL=12j=1d(σϕ,j2+μϕ,j21logσϕ,j2)D_{\mathrm{KL}} = \frac{1}{2}\sum_{j=1}^d \left(\sigma_{\phi,j}^2 + \mu_{\phi,j}^2 - 1 - \log\sigma_{\phi,j}^2\right)

This closed-form KL is derived from the Gaussian MGF. The reparameterisation trick enables backpropagation: z=μϕ+σϕϵ\mathbf{z} = \boldsymbol{\mu}_\phi + \boldsymbol{\sigma}_\phi \odot \boldsymbol{\epsilon}, ϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, I).

8.3 Dropout as Bernoulli Masking

Dropout applies a Bernoulli(1pdrop)(1-p_\text{drop}) mask independently to each activation:

h~i=himi,miiidBern(1pdrop)\tilde{h}_i = h_i \cdot m_i, \quad m_i \overset{\text{iid}}{\sim} \operatorname{Bern}(1-p_\text{drop})

At test time, the expectation E[mi]=1pdrop\mathbb{E}[m_i] = 1-p_\text{drop} is used: h~i=(1pdrop)hi\tilde{h}_i = (1-p_\text{drop}) h_i.

This is equivalent to training an ensemble of 2d2^d networks (where dd is the number of units) sharing weights, with each network sampled at each step.

8.4 RLHF: Bradley-Terry and Beta Prior

The Bradley-Terry model assigns probability to human preferences:

P(response A preferred over BrA,rB)=σ(rArB)P(\text{response A preferred over B} \mid r_A, r_B) = \sigma(r_A - r_B)

where rA,rBr_A, r_B are learned reward scalars and σ\sigma is the sigmoid (Bernoulli logit link). The reward difference rArBr_A - r_B follows the Bernoulli exponential family with natural parameter rArBr_A - r_B.

A Beta prior on the preference probability p=σ(rArB)p = \sigma(r_A - r_B) regularises the reward model toward a neutral preference (p0.5p \approx 0.5).

8.5 Diffusion Models: Gaussian Noise Schedule

The forward process adds Gaussian noise over TT steps:

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}\, x_{t-1}, \beta_t I)

Using the Gaussian stability under sums (Section3.2), the marginal at step tt is:

q(xtx0)=N(xt;αˉtx0,(1αˉt)I),αˉt=s=1t(1βs)q(x_t \mid x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}\, x_0, (1-\bar{\alpha}_t) I), \quad \bar{\alpha}_t = \prod_{s=1}^t (1-\beta_s)

The reverse process pθ(xt1xt)p_\theta(x_{t-1} \mid x_t) is also Gaussian (for small βt\beta_t), with mean predicted by the neural network.

8.6 Topic Models (LDA)

Latent Dirichlet Allocation uses three distributions from this section:

  1. θdDir(α)\boldsymbol{\theta}_d \sim \text{Dir}(\boldsymbol{\alpha}) - topic proportions per document (Dirichlet)
  2. zdnCat(θd)z_{dn} \sim \operatorname{Cat}(\boldsymbol{\theta}_d) - topic assignment per word (Categorical)
  3. wdnCat(ϕzdn)w_{dn} \sim \operatorname{Cat}(\boldsymbol{\phi}_{z_{dn}}) - word given topic (Categorical with Dirichlet prior ϕkDir(β)\boldsymbol{\phi}_k \sim \text{Dir}(\boldsymbol{\beta}))

The full joint is:

p(W,Z,Θ,Φ)=kDir(ϕk;β)dDir(θd;α)dnCat(zdn;θd)Cat(wdn;ϕzdn)p(\mathbf{W}, \mathbf{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi}) = \prod_k \text{Dir}(\boldsymbol{\phi}_k;\boldsymbol{\beta}) \prod_d \text{Dir}(\boldsymbol{\theta}_d;\boldsymbol{\alpha}) \prod_{dn} \operatorname{Cat}(z_{dn};\boldsymbol{\theta}_d) \operatorname{Cat}(w_{dn};\boldsymbol{\phi}_{z_{dn}})

Inference uses collapsed Gibbs sampling (Dirichlet-Categorical conjugacy allows marginalising Θ\boldsymbol{\Theta} and Φ\boldsymbol{\Phi}).


9. Common Mistakes

#MistakeWhy It's WrongFix
1Confusing the Gaussian parameter σ2\sigma^2 with σ\sigmaN(μ,σ2)\mathcal{N}(\mu, \sigma^2) takes variance as second argument in most conventions, but NumPy's np.random.normal(mu, sigma) takes std devAlways state whether you use variance or std dev; in code use scale=sigma (std dev)
2Assuming Poisson when variance > meanPoisson requires Var = Mean exactly; real count data often has overdispersionCheck the variance-to-mean ratio; use Negative Binomial for overdispersed data
3Using Binomial for small pp with large nnNumerical issues with (nk)pk(1p)nk\binom{n}{k}p^k(1-p)^{n-k} when nn is largeSwitch to Poisson approximation or compute in log-space
4Interpreting Beta(α,β)(\alpha,\beta) parameters as probabilitiesα\alpha and β\beta are pseudocount concentrations, not the mean itselfMean is α/(α+β)\alpha/(\alpha+\beta); α=β=1\alpha=\beta=1 is uniform, not Beta(0.5,0.5)(0.5, 0.5)
5Forgetting the Dirichlet concentration parameter controls sparsityαk<1\alpha_k < 1 produces sparse samples; αk>1\alpha_k > 1 produces dense onesSet αk<1\alpha_k < 1 for topic models expecting sparse documents
6Confusing Categorical and one-hot encodingA Categorical sample is an integer; its one-hot encoding is a vectorIn PyTorch, Categorical.sample() returns indices, not one-hot vectors
7Applying Student-tt formulas when ν2\nu \leq 2Variance is undefined for ν2\nu \leq 2; mean undefined for ν1\nu \leq 1Always check ν\nu before using moment formulas; for ν=1\nu = 1 (Cauchy), variance is \infty
8Treating Exponential(λ)(\lambda) rate and scale interchangeablySome sources use rate λ\lambda (mean =1/λ= 1/\lambda); others use scale θ=1/λ\theta = 1/\lambda (mean =θ= \theta)Verify convention: SciPy stats.expon(scale=theta) uses scale; PyTorch Exponential(rate=lambda) uses rate
9Using Normal approximation to Binomial when np<5np < 5 or n(1p)<5n(1-p) < 5CLT kicks in slowly in the tails; approximation is poor for extreme ppUse exact Binomial PMF or Poisson approximation
10Claiming exponential family membership for Student-ttThe Student-tt is NOT in the exponential family (its normalising constant depends on ν\nu in a non-exponential way)Student-tt is a scale mixture of Gaussians - handle separately
11Confusing natural parameters with mean parametersThe Gaussian natural parameters are (μ/σ2,1/2σ2)(\mu/\sigma^2, -1/2\sigma^2), not (μ,σ2)(\mu, \sigma^2)Distinguish mean parameterisation (human-readable) from natural parameterisation (for exponential family theory)
12Using the Beta posterior mean instead of mode for MAP estimationPosterior mean =(α+k)/(α+β+n)= (\alpha+k)/(\alpha+\beta+n); MAP (mode) =(α+k1)/(α+β+n2)= (\alpha+k-1)/(\alpha+\beta+n-2)For decisions, use posterior mode (MAP) or the full posterior, not mean by default

10. Exercises

Exercise 1 * - PMF and Moments

A biased die has faces weighted so that the probability of face kk is proportional to kk for k=1,2,3,4,5,6k = 1, 2, 3, 4, 5, 6.

(a) Find the normalising constant and write the PMF explicitly. (b) Compute E[X]\mathbb{E}[X] and Var(X)\operatorname{Var}(X). (c) Find P(X4)P(X \geq 4). (d) Is this distribution in the exponential family? Identify T(x)T(x), η\eta, and A(η)A(\eta) if so.

Exercise 2 * - Poisson Limit

A social media post receives clicks at a rate of λ=0.8\lambda = 0.8 per minute. Model the number of clicks in a 10-minute window.

(a) Write the PMF and compute P(X=5)P(X = 5), P(X=0)P(X = 0), P(X3)P(X \geq 3). (b) Suppose you model this as Binomial(n=600,p)(n=600, p) where each second either produces a click or not. Find pp and verify the Poisson limit numerically for k=5k = 5. (c) What property of the Poisson means that clicks in disjoint time windows are independent? (d) If two different posts receive λ1=1.2\lambda_1 = 1.2 and λ2=0.5\lambda_2 = 0.5 clicks/minute, what is the distribution of the total clicks per minute? Prove it using MGFs.

Exercise 3 * - Gaussian Properties

Let XN(3,4)X \sim \mathcal{N}(3, 4) (mean 3, variance 4).

(a) Standardise XX to obtain ZN(0,1)Z \sim \mathcal{N}(0,1). (b) Compute P(1X5)P(1 \leq X \leq 5) using Φ\Phi. (c) If Y=2X1Y = 2X - 1, find the distribution of YY. (d) If X1,X2iidN(3,4)X_1, X_2 \overset{\text{iid}}{\sim} \mathcal{N}(3,4), find the distribution of S=X1+X2S = X_1 + X_2. (e) Verify numerically that the MGF formula gives the correct mean and variance for XX.

Exercise 4 ** - Beta-Binomial Conjugate Update

You are estimating the click-through rate pp of a button. Your prior belief is Beta(2,8)(2, 8).

(a) Interpret this prior: what pseudocounts does it encode? What is the prior mean? (b) You observe 15 clicks in 100 impressions. Write the posterior distribution. (c) Compute the posterior mean and compare it to the MLE 15/100=0.1515/100 = 0.15. (d) After how many additional clicks (keeping total impressions fixed at 100) would the posterior mean exceed 0.200.20? (e) Plot the prior, likelihood (rescaled), and posterior as a function of pp (implement in Python).

Exercise 5 ** - Exponential Family Identification

(a) Show that the Geometric(p)(p) distribution belongs to the exponential family. Identify η\eta, T(x)T(x), A(η)A(\eta), and h(x)h(x). (b) For the Geometric, compute E[X]\mathbb{E}[X] by differentiating A(η)A(\eta). (c) Show that the Negative Binomial(r,p)(r, p) distribution also belongs to the exponential family. (d) The uniform distribution U(0,b)\mathcal{U}(0, b) with unknown bb - does it belong to the exponential family? Explain.

Exercise 6 ** - Dirichlet-Categorical Posterior

A language model assigns log-probabilities to tokens. You use a symmetric Dirichlet(0.1)(0.1) prior over a vocabulary of size K=5K = 5 (simplified).

(a) Sample 3 probability vectors from Dir(0.115)(0.1 \cdot \mathbf{1}_5) and 3 from Dir(215)(2 \cdot \mathbf{1}_5). Describe the visual difference. (b) You observe the token sequence: [A,B,A,C,A,B,A][A, B, A, C, A, B, A] (where {A,B,C,D,E}\{A,B,C,D,E\} are the 5 tokens). Compute the posterior Dirichlet. (c) Compute the posterior mean probability for each token. (d) Compare with the Laplace-smoothed MLE estimate. Show they are equal when αk=1\alpha_k = 1.

Exercise 7 *** - Softmax as Exponential Family

(a) Derive the softmax function from the Categorical exponential family canonical form. Show that the log-partition function A(η)=logkeηkA(\boldsymbol{\eta}) = \log\sum_k e^{\eta_k} leads to E[Xk]=eηk/jeηj=softmax(η)k\mathbb{E}[X_k] = e^{\eta_k}/\sum_j e^{\eta_j} = \operatorname{softmax}(\boldsymbol{\eta})_k. (b) Implement log_softmax(z) in a numerically stable way (subtract max before exponentiating). Verify it equals log(softmax(z)) but is more numerically stable for large logits. (c) The gradient of the cross-entropy loss L=kyklogpk\mathcal{L} = -\sum_k y_k \log p_k with respect to logits z\mathbf{z} is py\mathbf{p} - \mathbf{y} where p=softmax(z)\mathbf{p} = \operatorname{softmax}(\mathbf{z}). Derive this result. (d) Show that temperature scaling pkezk/τp_k \propto e^{z_k/\tau} is equivalent to scaling the natural parameters, and explain why τ0\tau \to 0 gives argmax and τ\tau \to \infty gives uniform.

Exercise 8 *** - Gaussian VAE KL Term

The VAE training objective requires:

DKL ⁣(N(μ,diag(σ2))    N(0,I))D_{\mathrm{KL}}\!\left(\mathcal{N}(\boldsymbol{\mu}, \operatorname{diag}(\boldsymbol{\sigma}^2)) \;\Big\|\; \mathcal{N}(\mathbf{0}, I)\right)

(a) Derive the closed-form expression: 12j(σj2+μj21logσj2)\frac{1}{2}\sum_j (\sigma_j^2 + \mu_j^2 - 1 - \log\sigma_j^2) using the Gaussian MGF. (b) Verify this equals 0 when μj=0\mu_j = 0 and σj=1\sigma_j = 1 for all jj. (c) Implement kl_gaussian(mu, log_var) where log_var = log(sigma^2) (the standard VAE parameterisation). (d) Plot the per-dimension KL as a function of μ\mu (with σ=1\sigma = 1) and as a function of σ\sigma (with μ=0\mu = 0). What do the minima imply for the VAE encoder?


11. Why This Matters for AI (2026 Perspective)

ConceptImpact on Modern AI
Categorical + softmaxEvery language model output layer is a Categorical exponential family. Logits are natural parameters; cross-entropy is NLL; temperature controls entropy of the output distribution
Gaussian reparameterisationEnables end-to-end training of VAEs (2013), diffusion model denoising (2020), and Gaussian noise models in score-based generation
Dirichlet priorsLDA (2003) topic models still used for document understanding; Dirichlet-process priors in Bayesian nonparametric methods for adaptive-capacity models
Beta-Binomial conjugacyRLHF reward modelling uses preference probability estimates; uncertainty-aware sampling uses Beta posteriors for exploration-exploitation
Exponential family unificationNatural gradient (K-FAC, Shampoo) uses the Fisher information matrix =Cov[T(X)]= \operatorname{Cov}[T(X)]; the link between logits and natural parameters motivates output-layer design choices
Student-tt distributionRobust regression and uncertainty quantification in small-data regimes; tt-distributed stochastic embedding (t-SNE) uses t1t_1 (Cauchy) tails to separate clusters
PoissonLanguage model token position counts, API call modelling, and event-based neural networks (neuromorphic AI)
Gamma-Poisson conjugacyBayesian A/B testing for click-through rates and conversion rates; Thompson sampling for multi-armed bandit
Diffusion Gaussian schedulesStable Diffusion, DALL-E, Sora all use Gaussian forward processes; the closed-form q(xtx0)q(x_t \mid x_0) enables efficient training without simulating the full Markov chain
Log-partition functionThe logsumexp trick (numerically stable A(η)A(\boldsymbol{\eta})) is fundamental to FlashAttention's online softmax computation

12. Conceptual Bridge

This section forms the vocabulary layer of probability theory. You can now think about every probabilistic model in terms of its component distributions: a Gaussian prior, a Categorical likelihood, a Dirichlet hyperprior. Without this vocabulary, reading a VAE paper or an LDA paper is like trying to read chemistry without knowing the periodic table.

Looking backward: The CDF, PDF, and PMF definitions from Section01 gave the framework; this section fills it with concrete instances. The axioms guaranteed consistency; the named distributions give tractability. Every distribution here satisfies all the axioms of Section01 - the Bernoulli is the simplest, the Dirichlet the most complex, but all obey the same rules.

Looking forward:

POSITION IN CURRICULUM
========================================================================

  Section06/01 Introduction and Random Variables
      v  (foundations: axioms, CDF, PDF, Bernoulli/Uniform preview)
  > Section06/02 Common Distributions <  <- YOU ARE HERE
      v  (full vocabulary: all named distributions, MGFs, exp family)
  Section06/03 Joint Distributions
      v  (multivariate: joint PDF, marginals, multivariate Gaussian)
  Section06/04 Expectation and Moments
      v  (derivations: LOTUS, covariance matrix, MGF applications)
  Section06/05 Concentration Inequalities
      v  (bounds: Markov, Chebyshev, Hoeffding, PAC learning)
  Section06/06 Stochastic Processes
      v  (CLT: proves the Gaussian limit relationships of Section4)
  Section06/07 Markov Chains
      (MCMC: uses conjugacy and Gaussian proposals)

========================================================================

The distributions in this section are not a list to memorise - they are a language to think in. When a practitioner says "the model is overconfident," they mean the predicted Categorical is too peaked. When they say "use a stronger prior," they mean increase α0\alpha_0 in the Dirichlet. When they say "the KL term is too large," they mean the approximate Gaussian posterior is far from the standard normal prior. Every one of these statements refers to a specific distribution from this section.

<- Back to Probability Theory | Next: Joint Distributions ->


Appendix A: Detailed Distribution Reference Cards

A.1 Bernoulli Distribution - Full Reference

BERNOULLI(p) REFERENCE CARD
========================================================================

  PMF:     P(X=x) = p^x (1-p)^(1-x),   x \\in {0, 1}

  CDF:     F(x) = 0          x < 0
                  1 - p      0 \\leq x < 1
                  1          x \\geq 1

  Mean:    p
  Var:     p(1-p)              [max at p=0.5]
  Mode:    1{p > 0.5}
  Entropy: -p log p - (1-p) log(1-p)

  MGF:     M(t) = 1 - p + pe^t

  Natural param:  \\eta = log(p/(1-p))  [logit]
  Inverse link:   p = \\sigma(\\eta) = 1/(1+e^{-\\eta})  [sigmoid]

  ML role:  Binary labels, dropout, stochastic depth, RLHF preferences

========================================================================

A.2 Gaussian Distribution - Full Reference

GAUSSIAN N(\\mu, \\sigma^2) REFERENCE CARD
========================================================================

  PDF:    f(x) = (1/\\sigma\\sqrt(2\\pi)) exp(-(x-\\mu)^2/2\\sigma^2)
  CDF:    F(x) = \\Phi((x-\\mu)/\\sigma)   where \\Phi = standard normal CDF

  Mean:       \\mu
  Variance:   \\sigma^2
  Mode:       \\mu (unique)
  Median:     \\mu (by symmetry)
  Entropy:    1/2 log(2\\pie\\sigma^2)   [maximum for fixed mean/var]

  MGF:    M(t) = exp(\\mut + \\sigma^2t^2/2)
  CGF:    K(t) = \\mut + \\sigma^2t^2/2   [cumulants: \\kappa_1=\\mu, \\kappa_2=\\sigma^2, \\kappa_k=0 for k\\geq3]

  Standard:   Z = (X - \\mu)/\\sigma ~ N(0, 1)

  Key quantiles (standard normal):
    \\Phi(1.282) = 0.90,  \\Phi(1.645) = 0.95,  \\Phi(1.960) = 0.975
    \\Phi(2.326) = 0.99,  \\Phi(2.576) = 0.995, \\Phi(3.090) = 0.999

  68-95-99.7: P(|Z| \\leq 1) \\approx 0.683, P(|Z| \\leq 2) \\approx 0.954, P(|Z| \\leq 3) \\approx 0.997

  Natural params: \\eta_1 = \\mu/\\sigma^2, \\eta_2 = -1/(2\\sigma^2)
  Suff stats:    T(x) = (x, x^2)

  ML role:  Weight init, VAE prior/posterior, GP, diffusion noise, batch norm

========================================================================

A.3 Beta Distribution - Full Reference

BETA(\\alpha, \\beta) REFERENCE CARD
========================================================================

  PDF:    f(x) = x^(\\alpha-1)(1-x)^(\\beta-1) / B(\\alpha,\\beta),   x \\in (0, 1)
          B(\\alpha,\\beta) = \\Gamma(\\alpha)\\Gamma(\\beta)/\\Gamma(\\alpha+\\beta)

  Mean:   \\alpha/(\\alpha+\\beta)
  Var:    \\alpha\\beta / [(\\alpha+\\beta)^2(\\alpha+\\beta+1)]
  Mode:   (\\alpha-1)/(\\alpha+\\beta-2)   for \\alpha,\\beta > 1

  Shape patterns:
    \\alpha<1, \\beta<1        -> U-shaped (bimodal at 0 and 1)
    \\alpha=\\beta=1           -> Uniform(0,1)
    \\alpha=\\beta>1           -> Symmetric bell
    \\alpha>\\beta             -> Skewed toward 1
    \\alpha<\\beta             -> Skewed toward 0
    \\alpha,\\beta -> \\infty, ratio  -> Concentrates at mode

  Conjugate prior for: Bernoulli, Binomial
  Posterior update:    Beta(\\alpha, \\beta) + k successes, n-k failures
                       -> Beta(\\alpha+k, \\beta+n-k)

  ML role:  CTR estimation, RLHF preference priors, Beta-VAE

========================================================================

Appendix B: The Gamma Function

The gamma function Γ(z)\Gamma(z) generalises the factorial to non-integer arguments and appears in the normalising constants of Beta, Dirichlet, Gamma, and Student-tt distributions.

B.1 Definition and Key Properties

For z>0z > 0:

Γ(z)=0tz1etdt\Gamma(z) = \int_0^\infty t^{z-1} e^{-t}\,dt

Recurrence relation: Γ(z+1)=zΓ(z)\Gamma(z+1) = z\Gamma(z)

Proof: Integration by parts with u=tzu = t^z and dv=etdtdv = e^{-t}dt:

Γ(z+1)=[tzet]0+z0tz1etdt=0+zΓ(z)\Gamma(z+1) = \left[-t^z e^{-t}\right]_0^\infty + z\int_0^\infty t^{z-1}e^{-t}\,dt = 0 + z\Gamma(z)

Factorial connection: Γ(n)=(n1)!\Gamma(n) = (n-1)! for positive integers nn.

Half-integer values: Γ(1/2)=π\Gamma(1/2) = \sqrt{\pi}, Γ(3/2)=π/2\Gamma(3/2) = \sqrt{\pi}/2, Γ(5/2)=3π/4\Gamma(5/2) = 3\sqrt{\pi}/4.

B.2 The Beta Function

The Beta function B(α,β)=Γ(α)Γ(β)/Γ(α+β)B(\alpha, \beta) = \Gamma(\alpha)\Gamma(\beta)/\Gamma(\alpha+\beta) satisfies:

B(α,β)=01tα1(1t)β1dtB(\alpha, \beta) = \int_0^1 t^{\alpha-1}(1-t)^{\beta-1}\,dt

This integral is precisely the normalising constant of the Beta(α,β)(\alpha, \beta) PDF.

Symmetry: B(α,β)=B(β,α)B(\alpha, \beta) = B(\beta, \alpha).

B.3 Stirling's Approximation

For large nn: n!2πn(n/e)nn! \approx \sqrt{2\pi n} (n/e)^n.

In terms of Gamma: Γ(n+1)2πn(n/e)n\Gamma(n+1) \approx \sqrt{2\pi n} (n/e)^n for large nn.

This is used to analyse the Poisson distribution at large kk and the convergence of the Binomial to the Poisson.


Appendix C: Sampling from Distributions

C.1 Inverse CDF Method

For any distribution with continuous, invertible CDF FF:

  1. Draw UU(0,1)U \sim \mathcal{U}(0,1)
  2. Return X=F1(U)X = F^{-1}(U)

Then XFX \sim F.

Explicit inverse CDFs:

DistributionInverse CDF F1(u)F^{-1}(u)
Exponential(λ)(\lambda)log(1u)/λ-\log(1-u)/\lambda
Geometric(p)(p)log(1u)/log(1p)\lceil \log(1-u)/\log(1-p) \rceil
Cauchy(μ,σ)(\mu, \sigma)μ+σtan(π(u1/2))\mu + \sigma\tan(\pi(u-1/2))

Gaussian has no closed-form inverse CDF; the Box-Muller transform is used instead:

Z1=2logU1cos(2πU2),Z2=2logU1sin(2πU2)Z_1 = \sqrt{-2\log U_1}\cos(2\pi U_2), \quad Z_2 = \sqrt{-2\log U_1}\sin(2\pi U_2)

where U1,U2U(0,1)U_1, U_2 \sim \mathcal{U}(0,1) i.i.d. gives Z1,Z2N(0,1)Z_1, Z_2 \sim \mathcal{N}(0,1) i.i.d.

C.2 Sampling the Dirichlet

To sample pDir(α)\mathbf{p} \sim \text{Dir}(\boldsymbol{\alpha}):

  1. Draw YkΓ(αk,1)Y_k \sim \Gamma(\alpha_k, 1) independently for k=1,,Kk = 1, \ldots, K
  2. Return pk=Yk/j=1KYjp_k = Y_k / \sum_{j=1}^K Y_j

This works because the normalised Gamma variables follow a Dirichlet distribution.

C.3 The Gumbel-Max Trick for Categorical

To sample from Categorical(p)(\mathbf{p}):

  1. Compute logits z=logp\mathbf{z} = \log\mathbf{p} (or use raw logits)
  2. Draw GkGumbel(0,1)=log(logUk)G_k \sim \text{Gumbel}(0,1) = -\log(-\log U_k) for UkU(0,1)U_k \sim \mathcal{U}(0,1)
  3. Return X=argmaxk(zk+Gk)X = \arg\max_k (z_k + G_k)

This gives exact categorical samples and enables the Gumbel-softmax differentiable approximation by replacing argmax with softmax at low temperature.


Appendix D: Heavy-Tailed Distributions

D.1 What Makes a Tail Heavy?

The Gaussian tail decays as ex2/2e^{-x^2/2} - super-exponentially fast. Distributions with tails decaying slower than any exponential are called heavy-tailed.

Pareto distribution (power law): P(X>x)=(xm/x)αP(X > x) = (x_m/x)^\alpha for x>xmx > x_m. Tail decays as xαx^{-\alpha}.

  • α>2\alpha > 2: finite variance
  • 1<α21 < \alpha \leq 2: finite mean, infinite variance
  • 0<α10 < \alpha \leq 1: infinite mean

Student-tνt_\nu: P(X>x)xνP(X > x) \sim x^{-\nu} for large xx - polynomial tail with exponent ν\nu.

Cauchy (t1t_1): Variance infinite, mean undefined. The average of nn i.i.d. Cauchy variables is still Cauchy - the CLT fails completely.

D.2 Heavy Tails in ML

Gradient noise: Empirical studies show that SGD gradient noise often has heavier tails than Gaussian. This may explain generalization: heavy-tailed noise explores the loss landscape more broadly, escaping sharp minima.

Neural network weight distributions: Trained neural network weights often follow power-law distributions (Martin & Mahoney 2019, 2021). "Heavy-tailed self-regularization" is proposed as a mechanism for implicit regularization.

Attention scores: Without temperature scaling, softmax attention can produce near-degenerate distributions (mass concentrated on one token), which is the tail behavior of a peaked Categorical.


Appendix E: Worked Derivations

E.1 Poisson MGF Derivation

MX(t)=k=0etkλkeλk!=eλk=0(λet)kk!=eλeλet=eλ(et1)M_X(t) = \sum_{k=0}^\infty e^{tk} \cdot \frac{\lambda^k e^{-\lambda}}{k!} = e^{-\lambda} \sum_{k=0}^\infty \frac{(\lambda e^t)^k}{k!} = e^{-\lambda} \cdot e^{\lambda e^t} = e^{\lambda(e^t-1)}

Checking moments: M(t)=λetM(t)M'(t) = \lambda e^t M(t). At t=0t=0: M(0)=λ1=λ=E[X]M'(0) = \lambda \cdot 1 = \lambda = \mathbb{E}[X]. [ok]

M(t)=λetM(t)+(λet)2M(t)M''(t) = \lambda e^t M(t) + (\lambda e^t)^2 M(t). At t=0t=0: M(0)=λ+λ2=E[X2]M''(0) = \lambda + \lambda^2 = \mathbb{E}[X^2].

Var(X)=E[X2](E[X])2=(λ+λ2)λ2=λ\operatorname{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = (\lambda + \lambda^2) - \lambda^2 = \lambda. [ok]

E.2 Gaussian MGF Derivation (Complete)

MX(t)=etx1σ2πe(xμ)2/(2σ2)dxM_X(t) = \int_{-\infty}^\infty e^{tx} \cdot \frac{1}{\sigma\sqrt{2\pi}} e^{-(x-\mu)^2/(2\sigma^2)} dx

Combine exponents: tx(xμ)22σ2=12σ2[(xμ)22σ2tx]tx - \frac{(x-\mu)^2}{2\sigma^2} = -\frac{1}{2\sigma^2}\left[(x-\mu)^2 - 2\sigma^2 tx\right]

=12σ2[x22x(μ+σ2t)+μ2]= -\frac{1}{2\sigma^2}\left[x^2 - 2x(\mu + \sigma^2 t) + \mu^2\right]

=(x(μ+σ2t))22σ2+μt+σ2t22= -\frac{(x - (\mu+\sigma^2 t))^2}{2\sigma^2} + \mu t + \frac{\sigma^2 t^2}{2}

Substituting back:

MX(t)=eμt+σ2t2/21σ2πe(x(μ+σ2t))2/(2σ2)dx=eμt+σ2t2/2M_X(t) = e^{\mu t + \sigma^2 t^2/2} \int_{-\infty}^\infty \frac{1}{\sigma\sqrt{2\pi}} e^{-(x-(\mu+\sigma^2 t))^2/(2\sigma^2)} dx = e^{\mu t + \sigma^2 t^2/2}

since the integral is 1 (it's a Gaussian PDF with mean μ+σ2t\mu+\sigma^2 t). \square

E.3 Beta-Bernoulli Posterior Derivation

Prior: p(p)=pα1(1p)β1B(α,β)p(p) = \frac{p^{\alpha-1}(1-p)^{\beta-1}}{B(\alpha,\beta)}. Likelihood: L(pk,n)=(nk)pk(1p)nkL(p \mid k, n) = \binom{n}{k}p^k(1-p)^{n-k}.

Posterior:

p(pk,n)pα1(1p)β1pk(1p)nk=p(α+k)1(1p)(β+nk)1p(p \mid k, n) \propto p^{\alpha-1}(1-p)^{\beta-1} \cdot p^k(1-p)^{n-k} = p^{(\alpha+k)-1}(1-p)^{(\beta+n-k)-1}

This is the unnormalised Beta(α+k,β+nk)(\alpha+k, \beta+n-k) PDF, so the posterior is:

p(k,n)Beta(α+k,β+nk)p \mid (k, n) \sim \text{Beta}(\alpha+k, \beta+n-k) \quad \square

E.4 Gamma-Poisson Posterior Derivation

Prior: λΓ(α,β)\lambda \sim \Gamma(\alpha, \beta), so p(λ)λα1eβλp(\lambda) \propto \lambda^{\alpha-1}e^{-\beta\lambda}. Likelihood for nn observations with sum S=xiS = \sum x_i:

p(xλ)=i=1nλxieλxi!λSenλp(\mathbf{x} \mid \lambda) = \prod_{i=1}^n \frac{\lambda^{x_i}e^{-\lambda}}{x_i!} \propto \lambda^S e^{-n\lambda}

Posterior:

p(λx)λα1eβλλSenλ=λ(α+S)1e(β+n)λp(\lambda \mid \mathbf{x}) \propto \lambda^{\alpha-1}e^{-\beta\lambda} \cdot \lambda^S e^{-n\lambda} = \lambda^{(\alpha+S)-1}e^{-(\beta+n)\lambda}

This is Γ(α+S,β+n)\Gamma(\alpha+S, \beta+n). \square


Appendix F: Exponential Family - Additional Members

F.1 Negative Binomial as Exponential Family

The Negative Binomial(r,p)(r, p) PMF for k=0,1,2,k = 0, 1, 2, \ldots:

P(X=k)=(k+r1k)(1p)rpkP(X=k) = \binom{k+r-1}{k}(1-p)^r p^k

Writing η=logp\eta = \log p:

P(X=k)=(k+r1k)(1p)rexp(ηk)P(X=k) = \binom{k+r-1}{k}(1-p)^r \exp(\eta k)

Natural param: η=logp(,0)\eta = \log p \in (-\infty, 0), sufficient stat: T(x)=xT(x) = x, log-partition: A(η)=rlog(1eη)A(\eta) = -r\log(1-e^\eta).

Note: The parameter rr (number of failures until stop) must be known; the family is parameterised only by pp when rr is fixed.

F.2 von Mises Distribution (Circular Data)

For directional/angular data (e.g., wind direction, protein torsion angles), the von Mises distribution plays the role that the Gaussian plays for linear data:

f(θ;μ,κ)=eκcos(θμ)2πI0(κ),θ[0,2π)f(\theta; \mu, \kappa) = \frac{e^{\kappa\cos(\theta-\mu)}}{2\pi I_0(\kappa)}, \quad \theta \in [0, 2\pi)

where I0I_0 is the modified Bessel function. It belongs to the exponential family with η=κeiμ\eta = \kappa e^{i\mu} (complex natural parameter).

For AI: Rotary Position Embedding (RoPE) in transformers embeds token positions as rotations in 2D planes. The von Mises distribution is the natural distribution for such circular position encodings.


Appendix G: Numerical Stability in Practice

G.1 Log-Sum-Exp

The log-partition function A(η)=logkeηkA(\boldsymbol{\eta}) = \log\sum_k e^{\eta_k} is numerically unstable for large or small ηk\eta_k due to floating-point overflow/underflow.

Stable computation:

logkeηk=c+logkeηkc,c=maxkηk\log\sum_k e^{\eta_k} = c + \log\sum_k e^{\eta_k - c}, \quad c = \max_k \eta_k

Since eηkc(0,1]e^{\eta_k - c} \in (0, 1] for all kk, no overflow occurs.

G.2 Log-Space Probability Computations

For small probabilities, work in log space throughout:

  • Multiply probabilities: add log-probabilities
  • Normalise: subtract log-sum-exp
  • Compute Bernoulli NLL: -y * log_p - (1-y) * log(1-p) should use log_p = log_sigmoid(logit) and log(1-p) = log_sigmoid(-logit) for numerical stability

G.3 The Gamma Function at Large Arguments

logΓ(z)\log\Gamma(z) can be computed stably using the Lanczos approximation. For z1z \gg 1, use Stirling's log-approximation: logΓ(z+1)zlogzz+12log(2πz)\log\Gamma(z+1) \approx z\log z - z + \frac{1}{2}\log(2\pi z).

SciPy provides scipy.special.gammaln for numerically stable logΓ\log\Gamma.


Appendix H: Distribution Identification Checklist

When fitting a distribution to data, use this diagnostic checklist:

CheckToolImplication
Are values integers \geq 0?-Consider discrete (Poisson, NB, Geometric)
Are values in (0,1)?-Beta or transformed Gaussian
Are values on a simplex?-Dirichlet
Is variance \approx mean?Dispersion testPoisson (if yes), NB (if var > mean)
Is distribution symmetric?Skewness testGaussian, tt, symmetric Beta
Are tails heavier than Gaussian?Kurtosis, QQ-plotStudent-tt, Laplace
Does a QQ-plot follow the diagonal?scipy.stats.probplotGood fit to reference distribution
Histogram bell-shaped?plt.histGaussian or Beta with α=β>1\alpha=\beta>1
Histogram exponentially decaying?Log-scale plotExponential or Geometric
Do log-log tails follow a line?Log-log tail plotPower law (Pareto)

Sample size requirements:

DistributionMinimum samples for good MLE
Bernoulli~30 (for pp far from 0/1)
Gaussian~30 (CLT kicks in)
Poisson~50 (for λ<1\lambda < 1); 10 for large λ\lambda
Beta~100 (4 moment equations for α,β\alpha, \beta)
Dirichlet~100K100K (scales with KK)
Student-tt~50 (for reliable ν\nu estimate)

Appendix I: Distribution Parameter Estimation (MLE)

I.1 Maximum Likelihood Estimation Overview

The MLE θ^\hat{\boldsymbol{\theta}} maximises the likelihood L(θ)=i=1np(x(i);θ)L(\boldsymbol{\theta}) = \prod_{i=1}^n p(x^{(i)}; \boldsymbol{\theta}), equivalently maximises the log-likelihood (θ)=ilogp(x(i);θ)\ell(\boldsymbol{\theta}) = \sum_i \log p(x^{(i)}; \boldsymbol{\theta}).

For exponential family members, the MLE satisfies the moment-matching condition:

A(η^)=1ni=1nT(x(i))\nabla A(\hat{\boldsymbol{\eta}}) = \frac{1}{n}\sum_{i=1}^n T(x^{(i)})

The expected sufficient statistics under the model equal the empirical sufficient statistics.

I.2 MLEs of Key Distributions

Bernoulli(p)(p): p^=1nixi=xˉ\hat{p} = \frac{1}{n}\sum_i x_i = \bar{x} (sample proportion)

Gaussian(μ,σ2)(\mu, \sigma^2):

μ^=xˉ=1nixi,σ^2=1ni(xixˉ)2\hat{\mu} = \bar{x} = \frac{1}{n}\sum_i x_i, \qquad \hat{\sigma}^2 = \frac{1}{n}\sum_i (x_i - \bar{x})^2

Note: σ^MLE2=n1nS2\hat{\sigma}^2_\text{MLE} = \frac{n-1}{n} S^2 is biased (divides by nn not n1n-1). The unbiased estimator uses the sample variance S2=1n1(xixˉ)2S^2 = \frac{1}{n-1}\sum(x_i - \bar{x})^2.

Poisson(λ)(\lambda): λ^=xˉ\hat{\lambda} = \bar{x} (sample mean)

Exponential(λ)(\lambda): λ^=1/xˉ\hat{\lambda} = 1/\bar{x} (reciprocal of sample mean)

Categorical(p)(\mathbf{p}): p^k=ck/n\hat{p}_k = c_k/n where ck=i1[x(i)=k]c_k = \sum_i \mathbf{1}[x^{(i)}=k] (empirical frequencies)

Beta(α,β)(\alpha, \beta): No closed form. Method of moments gives starting values:

μ~=xˉ,s~2=1n1(xixˉ)2\tilde{\mu} = \bar{x}, \quad \tilde{s}^2 = \frac{1}{n-1}\sum(x_i - \bar{x})^2 α^=μ~ ⁣(μ~(1μ~)s~21),β^=(1μ~) ⁣(μ~(1μ~)s~21)\hat{\alpha} = \tilde{\mu}\!\left(\frac{\tilde{\mu}(1-\tilde{\mu})}{\tilde{s}^2} - 1\right), \quad \hat{\beta} = (1-\tilde{\mu})\!\left(\frac{\tilde{\mu}(1-\tilde{\mu})}{\tilde{s}^2} - 1\right)

Refine with Newton-Raphson using the digamma function ψ=Γ/Γ\psi = \Gamma'/\Gamma.

I.3 Bayesian Estimation vs. MLE

Under conjugate priors, the posterior mean often provides a better estimator than MLE, especially for small samples:

DistributionMLEPosterior Mean (conjugate prior)
Bernoullik/nk/n(α+k)/(α+β+n)(\alpha+k)/(\alpha+\beta+n)
Categoricalck/nc_k/n(αk+ck)/(α0+n)(\alpha_k+c_k)/(\alpha_0+n)
Poissonxˉ\bar{x}(α+xi)/(β+n)(\alpha+\sum x_i)/(\beta+n)

The posterior mean shrinks the MLE toward the prior mean - a form of regularisation that prevents overfitting to small samples.


Appendix J: Relationships Between Distributions - Extended

J.1 The Exponential Family as a Unifying Framework

ALL MEMBERS OF THE EXPONENTIAL FAMILY
========================================================================

  One-parameter families:
    Bernoulli(p)     - natural param: log(p/(1-p))
    Poisson(\\lambda)       - natural param: log(\\lambda)
    Exponential(\\lambda)   - natural param: -\\lambda
    Geometric(p)     - natural param: log(1-p)

  Two-parameter families:
    Gaussian(\\mu,\\sigma^2)   - natural params: (\\mu/\\sigma^2, -1/2\\sigma^2)
    Gamma(\\alpha,\\beta)       - natural params: (\\alpha-1, -\\beta)
    Beta(\\alpha,\\beta)        - natural params: (\\alpha-1, \\beta-1)
    NegBin(r,p)      - natural param: log(p)  [r fixed]

  K-parameter families:
    Categorical(p)   - natural params: (log p_k/p_K)_{k<K}
    Multinomial(n,p) - same as Categorical  [n fixed]
    Dirichlet(\\alpha)     - natural params: (\\alpha_k - 1)_{k=1}^K

  NOT exponential family:
    Student-t(\\nu)     - tail behavior is not exponential
    Cauchy            - same
    Pareto            - support depends on parameter

========================================================================

J.2 Scale Mixtures of Gaussians

Several heavy-tailed distributions arise as variance mixtures of Gaussians: XVN(0,V)X \mid V \sim \mathcal{N}(0, V) with random variance VV.

Mixing Distribution for VVMarginal Distribution of XX
V=σ2V = \sigma^2 (constant)N(0,σ2)\mathcal{N}(0, \sigma^2)
VInvGamma(ν/2,ν/2)V \sim \text{InvGamma}(\nu/2, \nu/2)Student-tνt_\nu
VExponential(λ2/2)V \sim \text{Exponential}(\lambda^2/2)Laplace(0,1/λ)(0, 1/\lambda)
VLevy(μ,c)V \sim \text{Levy}(\mu, c)α\alpha-stable distribution

For AI: The variance mixture representation of Student-tt enables efficient Gibbs sampling - alternating between sampling XVNX \mid V \sim \mathcal{N} and VXInvGammaV \mid X \sim \text{InvGamma}. This technique underlies many robust Bayesian regression algorithms.

J.3 Normalising Flows: Transforming Distributions

Any differentiable, invertible function gg transforms a distribution: if XpXX \sim p_X and Y=g(X)Y = g(X), then:

pY(y)=pX(g1(y))ddyg1(y)p_Y(y) = p_X(g^{-1}(y)) \cdot \left|\frac{d}{dy}g^{-1}(y)\right|

Examples of distribution transformation chains:

  • Start with UU(0,1)U \sim \mathcal{U}(0,1), apply g(u)=log(u)/λg(u) = -\log(u)/\lambda: get Exponential(λ)(\lambda)
  • Start with ZN(0,1)Z \sim \mathcal{N}(0,1), apply g(z)=eμ+σzg(z) = e^{\mu + \sigma z}: get Log-Normal(μ,σ2)(\mu,\sigma^2)
  • Start with ZN(0,1)Z \sim \mathcal{N}(0,1), apply CDF and then categorical rounding: get Gaussian copula
  • Compose KK invertible transforms: get a normalising flow with complex target distribution

-> Full treatment: Section03 Joint Distributions (change-of-variables formula)


Appendix K: Quick-Reference Probability Tables

K.1 Standard Normal Tail Probabilities

| zz | P(Z>z)P(Z > z) | P(Z>z)P(|Z| > z) | |---|---|---| | 1.00 | 0.1587 | 0.3174 | | 1.28 | 0.1003 | 0.2005 | | 1.64 | 0.0505 | 0.1010 | | 1.96 | 0.0250 | 0.0500 | | 2.00 | 0.0228 | 0.0455 | | 2.33 | 0.0099 | 0.0197 | | 2.58 | 0.0049 | 0.0099 | | 3.00 | 0.0013 | 0.0027 | | 3.29 | 0.0005 | 0.0010 | | 3.89 | 0.0001 | 0.0002 |

K.2 Poisson Probabilities P(X=k)P(X = k) for λ{1,2,5}\lambda \in \{1, 2, 5\}

kkλ=1\lambda=1λ=2\lambda=2λ=5\lambda=5
00.36790.13530.0067
10.36790.27070.0337
20.18390.27070.0842
30.06130.18040.1404
40.01530.09020.1755
50.00310.03610.1755
60.00050.01200.1462
70.00010.00340.1044

K.3 Beta Distribution Moments and Shapes

(α,β)(\alpha, \beta)MeanStd DevShape Description
(1, 1)0.5000.289Uniform
(0.5, 0.5)0.5000.354U-shaped (arcsine)
(2, 2)0.5000.224Symmetric bell
(2, 5)0.2860.159Right-skewed
(5, 2)0.7140.159Left-skewed
(10, 10)0.5000.106Narrow symmetric bell
(1, 3)0.2500.194Decreasing
(3, 1)0.7500.194Increasing

K.4 Binomial Cumulative Probabilities

P(Xk)P(X \leq k) for Binomial(n=20,p)\text{Binomial}(n=20, p):

kkp=0.2p=0.2p=0.5p=0.5p=0.8p=0.8
20.20610.00020.0000
50.80420.02070.0000
80.99000.25170.0001
100.99940.58810.0026
121.00000.86840.0321
151.00000.99410.3704
181.00001.00000.9308

Appendix L: Information-Theoretic Properties of Distributions

L.1 Maximum Entropy Distributions

The principle of maximum entropy states: among all distributions satisfying given constraints, choose the one with maximum entropy. This gives the "least informative" distribution consistent with the known facts.

ConstraintMaximum Entropy Distribution
Support {0,1,,N1}\{0, 1, \ldots, N-1\}, no other infoDiscrete Uniform
Support (0,)(0, \infty), fixed mean μ\muExponential(1/μ)(1/\mu)
Support R\mathbb{R}, fixed mean μ\mu and variance σ2\sigma^2Gaussian(μ,σ2)(\mu, \sigma^2)
Support [a,b][a, b], no other infoUniform(a,b)(a, b)
Support {0,1}\{0,1\}, fixed mean ppBernoulli(p)(p)
Support ΔK1\Delta^{K-1}, fixed mean μ\boldsymbol{\mu}Dirichlet (with matching moments)

Proof sketch for Gaussian: Maximise H[p]=p(x)logp(x)dxH[p] = -\int p(x)\log p(x)\,dx subject to p(x)dx=1\int p(x)\,dx = 1, xp(x)dx=μ\int x\,p(x)\,dx = \mu, and (xμ)2p(x)dx=σ2\int (x-\mu)^2 p(x)\,dx = \sigma^2. Using Lagrange multipliers (Section05/04 of Chapter 5), the optimal density satisfies logp(x)=λ0+λ1x+λ2(xμ)2\log p(x) = \lambda_0 + \lambda_1 x + \lambda_2(x-\mu)^2, which is the Gaussian form.

L.2 KL Divergences Between Common Distributions

Two Gaussians:

DKL(N(μ1,σ12)N(μ2,σ22))=logσ2σ1+σ12+(μ1μ2)22σ2212D_{\mathrm{KL}}(\mathcal{N}(\mu_1,\sigma_1^2) \| \mathcal{N}(\mu_2,\sigma_2^2)) = \log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2+(\mu_1-\mu_2)^2}{2\sigma_2^2} - \frac{1}{2}

Gaussian from standard normal:

DKL(N(μ,σ2)N(0,1))=12(σ2+μ21logσ2)D_{\mathrm{KL}}(\mathcal{N}(\mu,\sigma^2) \| \mathcal{N}(0,1)) = \frac{1}{2}(\sigma^2 + \mu^2 - 1 - \log\sigma^2)

Two Bernoullis:

DKL(Bern(p)Bern(q))=plogpq+(1p)log1p1qD_{\mathrm{KL}}(\operatorname{Bern}(p) \| \operatorname{Bern}(q)) = p\log\frac{p}{q} + (1-p)\log\frac{1-p}{1-q}

Two Categoricals:

DKL(Cat(p)Cat(q))=kpklogpkqkD_{\mathrm{KL}}(\operatorname{Cat}(\mathbf{p}) \| \operatorname{Cat}(\mathbf{q})) = \sum_k p_k \log\frac{p_k}{q_k}

Two Poissons:

DKL(Pois(λ1)Pois(λ2))=λ1logλ1λ2(λ1λ2)D_{\mathrm{KL}}(\operatorname{Pois}(\lambda_1) \| \operatorname{Pois}(\lambda_2)) = \lambda_1\log\frac{\lambda_1}{\lambda_2} - (\lambda_1 - \lambda_2)

Two Betas:

DKL(Beta(α1,β1)Beta(α2,β2))=logB(α2,β2)B(α1,β1)+(α1α2)ψ(α1)+(β1β2)ψ(β1)+(α2α1+β2β1)ψ(α1+β1)D_{\mathrm{KL}}(\text{Beta}(\alpha_1,\beta_1) \| \text{Beta}(\alpha_2,\beta_2)) = \log\frac{B(\alpha_2,\beta_2)}{B(\alpha_1,\beta_1)} + (\alpha_1-\alpha_2)\psi(\alpha_1) + (\beta_1-\beta_2)\psi(\beta_1) + (\alpha_2-\alpha_1+\beta_2-\beta_1)\psi(\alpha_1+\beta_1)

where ψ=Γ/Γ\psi = \Gamma'/\Gamma is the digamma function.

L.3 Entropy Ordering

For distributions with the same support and mean, entropy orders distributions by "spread":

H(Bernoulli(0.5))=log20.693 natsH(\text{Bernoulli}(0.5)) = \log 2 \approx 0.693 \text{ nats} H(N(μ,σ2))=12log(2πeσ2)H(\mathcal{N}(\mu, \sigma^2)) = \frac{1}{2}\log(2\pi e \sigma^2) H(Uniform(a,b))=log(ba)H(\text{Uniform}(a,b)) = \log(b-a)

For continuous distributions on R\mathbb{R} with fixed variance σ2\sigma^2:

H(Gaussian)H(Laplace)H(Logistic)H(\text{Gaussian}) \geq H(\text{Laplace}) \geq H(\text{Logistic}) \geq \cdots

The Gaussian maximises entropy, making it the distribution that "knows least" given the mean and variance constraints - which is exactly why it appears in CLT results and maximum entropy arguments.


Appendix M: PyTorch and SciPy API Reference

M.1 SciPy Distributions

from scipy import stats

# Discrete distributions
stats.bernoulli(p=0.3)         # Bernoulli(p)
stats.binom(n=10, p=0.3)       # Binomial(n, p)
stats.geom(p=0.3)              # Geometric(p), k=1,2,...
stats.nbinom(n=5, p=0.3)       # NegBinomial(n, p)
stats.poisson(mu=2.5)          # Poisson(lambda)

# Continuous distributions
stats.uniform(loc=0, scale=1)  # Uniform(0,1)   [loc=a, scale=b-a]
stats.norm(loc=0, scale=1)     # Gaussian(mu, sigma)  [scale=sigma, not sigma^2!]
stats.expon(scale=1/lambda_)   # Exponential(lambda)  [scale=1/rate]
stats.gamma(a=2, scale=1/beta) # Gamma(alpha, beta)   [a=shape, scale=1/rate]
stats.beta(a=2, b=5)           # Beta(alpha, beta)
stats.t(df=5)                  # Student-t(nu)

# Common methods (all distributions)
rv.pmf(k) / rv.pdf(x)  # PMF or PDF
rv.cdf(x)              # CDF
rv.ppf(q)              # Quantile (inverse CDF)
rv.sf(x)               # Survival function 1-CDF
rv.rvs(size=100)        # Random samples
rv.mean()              # E[X]
rv.var()               # Var(X)
rv.std()               # std dev
rv.entropy()           # H(X) in nats

M.2 PyTorch torch.distributions

import torch
from torch.distributions import (
    Bernoulli, Binomial, Geometric, Poisson,
    Categorical, Multinomial,
    Uniform, Normal, Exponential, Gamma, Beta, Dirichlet, StudentT
)

# Create distributions
d = Normal(loc=torch.tensor(0.0), scale=torch.tensor(1.0))

# Common methods
d.sample((5,))          # draw 5 samples
d.log_prob(x)           # log p(x) - used in loss functions
d.cdf(x)               # F(x)
d.icdf(q)              # F^{-1}(q)
d.entropy()            # H(X)
d.mean                 # E[X] (property, not method)
d.variance             # Var(X)
d.stddev               # std dev

# Reparameterised sampling (enables gradients through sampling)
d.rsample((5,))         # only for reparameterisable distributions
                        # (Normal, Gamma, Beta, Dirichlet)

# Special distributions
Categorical(logits=z)   # Categorical from logits (uses log_softmax internally)
Dirichlet(concentration=alpha)  # Dirichlet(alpha)

M.3 NumPy Random Number Generation

import numpy as np
rng = np.random.default_rng(seed=42)

# Discrete
rng.integers(0, 2, size=100)              # Uniform integers (Bernoulli-like)
rng.binomial(n=10, p=0.3, size=100)       # Binomial
rng.geometric(p=0.3, size=100)            # Geometric (p=success prob)
rng.poisson(lam=2.5, size=100)            # Poisson

# Continuous
rng.uniform(low=0, high=1, size=100)      # Uniform
rng.normal(loc=0, scale=1, size=100)      # Gaussian  [scale=sigma]
rng.exponential(scale=1/lambda_, size=100) # Exponential [scale=1/rate]
rng.gamma(shape=2, scale=1/beta, size=100) # Gamma
rng.beta(a=2, b=5, size=100)              # Beta

# Categorical/Dirichlet
rng.choice(K, p=probs, size=100)          # Categorical
rng.dirichlet(alpha, size=10)             # Dirichlet
rng.multinomial(n=20, pvals=probs)        # Multinomial

Appendix N: Distribution Parameter Estimation

Quick reference for maximum likelihood estimates (MLE) and method of moments (MOM) estimators.

Discrete Distributions

DistributionMLE EstimatorMethod of Moments
Bernoulli(pp)p^=xˉ\hat{p} = \bar{x}Same as MLE
Binomial(nn,pp)p^=xˉ/n\hat{p} = \bar{x}/nSame as MLE
Geometric(pp)p^=1/xˉ\hat{p} = 1/\bar{x}Same as MLE
Poisson(λ\lambda)λ^=xˉ\hat{\lambda} = \bar{x}Same as MLE
Negative Binomial(rr,pp)p^=r/(r+xˉ)\hat{p} = r/(r+\bar{x})Match mean and variance

Continuous Distributions

DistributionMLE Estimator(s)Notes
Uniform(aa,bb)a^=x(1)\hat{a} = x_{(1)}, b^=x(n)\hat{b} = x_{(n)}Biased; adjust for small nn
Gaussian(μ\mu,σ2\sigma^2)μ^=xˉ\hat{\mu} = \bar{x}, σ^2=1n(xixˉ)2\hat{\sigma}^2 = \frac{1}{n}\sum(x_i-\bar{x})^2Biased variance; unbiased uses n1n-1
Exponential(λ\lambda)λ^=1/xˉ\hat{\lambda} = 1/\bar{x}Same as MLE
Gamma(α\alpha,β\beta)Solve ψ(α^)logα^=logxˉlogx\psi(\hat{\alpha}) - \log\hat{\alpha} = \log\bar{x} - \overline{\log x}ψ\psi is digamma; numerical solution needed
Beta(α\alpha,β\beta)Method of moments: α^=xˉ(xˉ(1xˉ)s21)\hat{\alpha} = \bar{x}\left(\frac{\bar{x}(1-\bar{x})}{s^2}-1\right)s2s^2 = sample variance

Bayesian Estimation with Conjugate Priors

For conjugate models, the posterior mean provides a natural estimator:

Bernoulli with Beta prior:

p^Bayes=α+xiα+β+n\hat{p}_{\text{Bayes}} = \frac{\alpha + \sum x_i}{\alpha + \beta + n}

This shrinks the MLE xˉ\bar{x} toward the prior mean α/(α+β)\alpha/(\alpha+\beta).

Poisson with Gamma prior:

λ^Bayes=α+xiβ+n\hat{\lambda}_{\text{Bayes}} = \frac{\alpha + \sum x_i}{\beta + n}

Gaussian with Gaussian prior (known σ2\sigma^2):

μ^Bayes=σ2μ0/σ02+nxˉn+σ2/σ02\hat{\mu}_{\text{Bayes}} = \frac{\sigma^2 \mu_0 / \sigma_0^2 + n\bar{x}}{n + \sigma^2/\sigma_0^2}

As nn \to \infty, all Bayesian estimators converge to MLE - the data overwhelms the prior.


Appendix O: Tail Bounds for Common Distributions

Understanding tail behaviour is essential for concentration inequalities and PAC-learning bounds.

Gaussian Tails

The Mills ratio gives an upper bound:

P(X>t)ϕ(t)t=1t2πet2/2,t>0P(X > t) \leq \frac{\phi(t)}{t} = \frac{1}{t\sqrt{2\pi}} e^{-t^2/2}, \quad t > 0

More precisely: P(X>t)ϕ(t)tP(X > t) \sim \frac{\phi(t)}{t} as tt \to \infty.

Numerically: P(X>2)0.046P(|X| > 2) \approx 0.046, P(X>3)0.003P(|X| > 3) \approx 0.003, P(X>6)2×109P(|X| > 6) \approx 2 \times 10^{-9}.

Sub-Gaussian Distributions

A random variable XX is sub-Gaussian with parameter σ\sigma if:

E[etX]eσ2t2/2t\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2/2} \quad \forall t

This implies Gaussian-like tail decay: P(XE[X]ϵ)2eϵ2/(2σ2)P(|X - \mathbb{E}[X]| \geq \epsilon) \leq 2e^{-\epsilon^2/(2\sigma^2)}.

Examples: Gaussian(μ,σ2\mu, \sigma^2) is sub-Gaussian(σ\sigma); Bounded[a,b][a,b] is sub-Gaussian((ba)/2)((b-a)/2).

LLM relevance: Token embeddings are often assumed sub-Gaussian for theoretical analysis of attention mechanisms and generalisation bounds.

Poisson Tails (Chernoff bounds)

For XPoisson(λ)X \sim \text{Poisson}(\lambda):

P(X(1+δ)λ)(eδ(1+δ)1+δ)λeλδ2/3,0<δ1P(X \geq (1+\delta)\lambda) \leq \left(\frac{e^\delta}{(1+\delta)^{1+\delta}}\right)^\lambda \leq e^{-\lambda\delta^2/3}, \quad 0 < \delta \leq 1

Exponential Tails (Memoryless)

For XExp(λ)X \sim \text{Exp}(\lambda): P(X>t)=eλtP(X > t) = e^{-\lambda t} exactly (no approximation needed).

Forward reference: Concentration inequalities (Markov, Chebyshev, Hoeffding, Bernstein) are developed fully in Section6.5 Concentration Inequalities.


Appendix P: Worked Example - Distribution Selection in Practice

Problem: A recommendation system logs how many times each user clicks on recommended items in a session. You observe counts x1,,xnx_1, \ldots, x_n and want to model the distribution.

Step 1: Check support. Counts take values {0,1,2,}\{0, 1, 2, \ldots\} -> discrete, non-negative integer. Eliminates all continuous distributions.

Step 2: Check bounds. No natural upper bound -> Geometric or Poisson. (If sessions had fixed length NN, Binomial would apply.)

Step 3: Check variance-mean relationship.

  • Poisson: Var=μ\text{Var} = \mu
  • Geometric: Var=(1p)/p2>1/p=μ\text{Var} = (1-p)/p^2 > 1/p = \mu (always over-dispersed)
  • Negative Binomial: Var=μ+μ2/r\text{Var} = \mu + \mu^2/r (over-dispersed, extra parameter)

Compute sample mean μ^\hat{\mu} and sample variance s^2\hat{s}^2. If s^2μ^\hat{s}^2 \approx \hat{\mu}, use Poisson. If s^2μ^\hat{s}^2 \gg \hat{\mu}, use Negative Binomial.

Step 4: Fit and check. Compute MLE, then use a chi-squared goodness-of-fit test or probability plot.

Step 5: Consider mixture models. If many zeros occur (zero-inflated data), consider Zero-Inflated Poisson: P(X=0)=π+(1π)eλP(X=0) = \pi + (1-\pi)e^{-\lambda}, P(X=k)=(1π)Pois(k;λ)P(X=k) = (1-\pi)\text{Pois}(k;\lambda) for k1k \geq 1.

Key insight: The choice of distribution encodes assumptions about the data-generating process. Poisson assumes events occur at a constant rate independently; Negative Binomial allows the rate itself to vary (it is a Poisson-Gamma mixture). In LLM contexts, token frequency distributions are often heavy-tailed (Zipfian), motivating log-normal or power-law models rather than Poisson.


Last updated: 2026 - covers all distributions in Section6.2 scope as defined in the Chapter README.