Section6.2 Common Distributions
"God does not play dice - but physicists, statisticians, and machine learning engineers do. The distributions in this chapter are the dice they use."
Overview
Every probabilistic model makes a choice: what distribution describes the data? A Bernoulli for a coin flip, a Gaussian for continuous measurements, a Categorical for language model token prediction, a Dirichlet for topic proportions. These are not arbitrary choices - each distribution encapsulates a specific data-generating story, a set of assumptions, and a collection of mathematical properties that make inference tractable.
This section gives the complete treatment of every named distribution that appears throughout this curriculum. For each distribution you will find: the PDF or PMF, the CDF where useful, the parameters and their interpretations, the moments (mean, variance, skewness), the moment generating function, the shape behaviour as parameters vary, the special cases and limiting forms, and the concrete ML applications where the distribution appears.
The section culminates in two unifying frameworks: the exponential family, which shows that Bernoulli, Gaussian, Poisson, Beta, Gamma, Dirichlet, and Categorical are all instances of a single canonical form; and conjugate priors, which explains why Bayesian inference is analytically tractable for precisely these distributions.
What this section assumes: The definitions of PDF, PMF, CDF, and the basic probability axioms from Section01. Bernoulli and Gaussian were introduced there; this section gives their full treatment.
What this section defers: Expectation derivations are stated as facts here and derived from first principles in Section04. The multivariate Gaussian is previewed here and fully developed in Section03.
Prerequisites
- Section01 Introduction and Random Variables - probability axioms, CDF, PDF, PMF, Bernoulli (intro)
- Integration: computing areas under curves, substitution, gamma function - Section03 Integration
- Series: power series, convergence - Section04 Series and Sequences
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Interactive visualisations of all distributions; exponential family; conjugacy updates |
| exercises.ipynb | 10 graded exercises: PMF/MGF computation, exponential family identification, conjugate Bayesian updating |
Learning Objectives
After completing this section, you will:
- State the PMF or PDF, support, mean, variance, and MGF of every major named distribution
- Identify which distribution models a given data-generating process
- Explain the relationships between distributions (Binomial->Poisson, Binomial->Normal, Beta->Dirichlet)
- Write any exponential family member in canonical form and identify its sufficient statistics and log-partition function
- Compute the posterior in a conjugate Bayesian model (Beta-Binomial, Dirichlet-Categorical, Gamma-Poisson, Normal-Normal)
- Derive the softmax function as the natural parameterisation of the Categorical exponential family
- Explain how the Gaussian reparameterisation trick enables backpropagation through sampling in VAEs
- Describe the role of each distribution in at least two concrete ML architectures
Table of Contents
- 1. Intuition and Overview
- 2. Discrete Distributions
- 3. Continuous Distributions
- 4. Distribution Relationships
- 5. Moment Generating Functions
- 6. The Exponential Family
- 7. Conjugate Priors in Bayesian Inference
- 8. ML Applications
- 9. Common Mistakes
- 10. Exercises
- 11. Why This Matters for AI (2026 Perspective)
- 12. Conceptual Bridge
1. Intuition and Overview
1.1 Why Named Distributions?
A probability distribution is completely specified by its CDF. So why do we name particular families and memorise their formulas?
Three reasons:
Sufficient statistics compress data. If you observe coin flips, all the information about is contained in the count of heads - not the sequence. The Binomial distribution formalises this: its PMF depends on the data only through . Named distributions arise precisely when the data-generating process has this kind of compressibility.
Tractable inference. Computing posteriors, marginals, and predictions requires integration. For most distributions this is intractable. The named distributions are the ones for which the integration can be done in closed form - either because the MGF factors, or because they belong to the exponential family where the normalisation constant has an analytic form.
Interpretable parameters. A Gaussian has a mean and a standard deviation that are immediately meaningful. A Beta prior encodes "I've seen 2 successes and 5 failures before the experiment." Named distributions give parameters human-interpretable meaning.
For AI: Every loss function and output layer in a neural network implicitly assumes a distribution over outputs. Cross-entropy loss assumes a Categorical distribution. Mean squared error assumes a Gaussian. Poisson regression assumes a Poisson distribution. Understanding which distribution the loss corresponds to is essential for knowing when a model is misspecified.
1.2 How to Choose a Distribution
DATA GENERATING PROCESS
========================================================================
Discrete outcomes?
+-- Two outcomes (0/1) -> Bernoulli(p) [single trial]
+-- Count of successes in n trials -> Binomial(n, p)
+-- Trials until first success -> Geometric(p)
+-- Trials until r-th success -> Negative Binomial(r, p)
+-- Count of events in interval -> Poisson(\\lambda)
+-- One of K categories -> Categorical(p)
+-- Counts across K categories -> Multinomial(n, p)
Continuous outcomes?
+-- Bounded, equal weight -> Uniform(a, b)
+-- Unbounded, symmetric bell -> Gaussian(\\mu, \\sigma^2)
+-- Non-negative, time to event -> Exponential(\\lambda) or Gamma(\\alpha, \\beta)
+-- Probability value in (0, 1) -> Beta(\\alpha, \\beta)
+-- Probability vector on simplex -> Dirichlet(\\alpha)
+-- Heavy-tailed, small samples -> Student-t(\\nu)
========================================================================
1.3 The Distribution Family Tree
DISTRIBUTION FAMILY TREE
========================================================================
DISCRETE CONTINUOUS
---------------------------- ------------------------------
Bernoulli(p) Uniform(a,b)
| n trials |
v v (generalise)
Binomial(n,p) ------- CLT -----------> Gaussian(\\mu,\\sigma^2)
| |
| n->\\infty, p->0, np=\\lambda | \\mu=0, \\sigma^2=\\nu (\\nu->\\infty)
v v
Poisson(\\lambda) ------ Poisson -----------> Exponential(\\lambda)
| process | sum of \\alpha
| count v
Multinomial(n,p) <---- generalise -- Gamma(\\alpha,\\beta)
| | \\alpha=1
Categorical(p) Exponential(\\lambda)
| | \\alpha=k/2, \\beta=1/2
conjugate prior: Chi-squared(k)
| |
v v
Dirichlet(\\alpha) ---- marginals ---------> Beta(\\alpha,\\beta)
(K=2 case)
Student-t(\\nu) = Gaussian / \\sqrt(Chi^2(\\nu)/\\nu) [heavy tails; \\nu->\\infty -> Gaussian]
========================================================================
1.4 Historical Timeline
| Year | Discovery | Contributor |
|---|---|---|
| 1713 | Bernoulli distribution, Law of Large Numbers | Jakob Bernoulli |
| 1733 | Normal approximation to Binomial | Abraham de Moivre |
| 1809 | Normal distribution as error model (least squares) | Carl Friedrich Gauss |
| 1837 | Poisson distribution as limit of rare events | Simeon-Denis Poisson |
| 1839 | Dirichlet distribution (as Bayesian prior) | Peter Gustav Lejeune Dirichlet |
| 1860 | Exponential distribution (Maxwell's speed distribution) | James Clerk Maxwell |
| 1893 | Chi-squared distribution | Karl Pearson |
| 1908 | Student- distribution (small samples) | William Sealy Gosset ("Student") |
| 1911 | Beta distribution (conjugate to Binomial) | Karl Pearson |
| 1922 | Sufficient statistics | Ronald A. Fisher |
| 1935 | Exponential family (unified framework) | Koopman, Pitman, Darmois |
| 2006 | Dirichlet-Categorical in LDA (topic models) | Blei, Ng, Jordan |
| 2013 | Gaussian VAE and reparameterisation trick | Kingma, Welling |
| 2020 | Dirichlet language model priors (BayesOPT) | various |
2. Discrete Distributions
A discrete random variable takes values in a countable set . Its distribution is fully described by the PMF , which satisfies .
2.1 Bernoulli()
Story: A single binary trial: success (1) with probability , failure (0) with probability .
PMF:
Equivalently: and .
CDF:
Moments:
| Quantity | Value |
|---|---|
| Mean | |
| Variance | |
| Mode | |
| Entropy | |
| Skewness |
The variance is maximised at (maximum uncertainty) and equals zero at (certainty).
Moment Generating Function:
Log-odds / logit: The natural parameterisation for the Bernoulli is the logit:
Inverting: - the sigmoid function. This is why logistic regression parameterises the Bernoulli distribution.
For AI:
- Binary classification: labels are Bernoulli. The cross-entropy loss is the negative log-likelihood of a Bernoulli model.
- Dropout: each activation is independently masked with Bernoulli. At inference, expectations replace samples: .
- Stochastic depth: transformer layers are included/skipped via Bernoulli sampling during training.
Non-examples: A Bernoulli is NOT appropriate when outcomes are not binary (use Categorical), when multiple trials are involved (use Binomial), or when the "probability" varies per trial (use a hierarchical model).
2.2 Binomial(, )
Story: Count the number of successes in independent Bernoulli trials. If , then .
PMF:
The binomial coefficient counts the number of ways to choose which of the trials succeed.
Moments:
| Quantity | Value | Derivation hint |
|---|---|---|
| Mean | Linearity: | |
| Variance | Independence: | |
| Mode | (or ) | |
| Skewness |
MGF: Since with independent summands:
Shape behaviour:
- : right-skewed (most counts are below the mean)
- : symmetric
- : left-skewed
- As grows: bell-shaped by the Central Limit Theorem (CLT preview -> Section06)
Normal approximation (CLT preview): For large :
Rule of thumb: approximation is accurate when and .
Poisson limit: When and with fixed:
(Full derivation in Section2.4.)
For AI:
- A/B testing: number of conversions in visits follows Binomial.
- Ensemble prediction: number of ensemble members predicting class 1 follows Binomial where is ensemble size.
- Batch statistics: number of positive examples in a minibatch of size drawn from a dataset with fraction positive is Binomial.
2.3 Geometric() and Negative Binomial(, )
Geometric()
Story: Number of independent Bernoulli trials until (and including) the first success.
PMF:
The factor is the probability of consecutive failures; is the probability of the final success.
Moments:
| Quantity | Value |
|---|---|
| Mean | |
| Variance | |
| Mode | (always) |
| Median |
MGF:
Memoryless property: For integers :
Proof: . Then .
The Geometric distribution is the unique discrete memoryless distribution (just as the Exponential is the unique continuous memoryless distribution).
Variant: Some sources define as the number of failures before the first success, giving PMF for - this shifts everything by 1. Always check which convention a source uses.
Negative Binomial(, )
Story: Number of trials until the -th success. The sum of independent Geometric random variables.
PMF:
Moments: Mean , Variance .
Overdispersion: The variance exceeds the mean (for ). This makes the Negative Binomial useful for count data with overdispersion - variance greater than the mean - which the Poisson cannot model.
For AI:
- Sequence length modelling: the length of a sentence until a full stop follows approximately Geometric or Negative Binomial.
- Retry modelling: number of API calls until success follows Geometric (with exponential backoff, the distribution changes).
- Count regression: Negative Binomial regression replaces Poisson regression when data exhibits overdispersion (e.g., user activity counts, bug counts per module).
2.4 Poisson()
Story: Count of independent rare events occurring in a fixed interval (time, space, area), where is the average rate (events per interval).
PMF:
Moments:
| Quantity | Value |
|---|---|
| Mean | |
| Variance | (mean = variance!) |
| Mode | (or and if is integer) |
| Skewness | (right-skewed; approaches 0 as grows) |
The equal mean and variance is the defining fingerprint of Poisson data. When empirical variance significantly exceeds the mean, the data is overdispersed and the Negative Binomial is more appropriate.
MGF:
Derivation: .
Additive property: If and independently, then .
Proof via MGFs: .
Poisson Limit Theorem (Binomial -> Poisson):
As , , with fixed:
Proof sketch:
As : , and , . This gives .
Poisson Process: A sequence of events where (1) events in disjoint intervals are independent, (2) the probability of exactly one event in a small interval is , and (3) the probability of two or more events is . The count of events in interval is .
For AI:
- Poisson regression: predicting count outcomes (clicks, API calls, bug reports). Loss is , minimised by setting .
- Attention pattern sparsity: the number of tokens attended to above a threshold in sparse attention can follow approximately Poisson.
- Dataset curation: rare-event counts in datasets (specific entity types, low-frequency tokens) follow Poisson, motivating over-sampling strategies.
2.5 Categorical()
Story: A single draw from mutually exclusive categories, where category has probability . The multivariate generalisation of Bernoulli.
PMF: Let with and . For outcome :
In one-hot vector notation: if is the -th standard basis vector:
Moments: , , for .
Softmax Parameterisation:
The probability vector lives on the -dimensional probability simplex . For unconstrained logits , the natural parameterisation is:
This is the exponential family natural parameterisation of the Categorical - the logits are the natural parameters.
Temperature scaling: The Categorical can be "sharpened" or "softened" by temperature :
- : deterministic (argmax)
- : standard softmax
- : uniform distribution
Entropy: . Maximum at uniform (, giving ); minimum at deterministic ().
For AI:
- Language model output: every token prediction is a Categorical over vocabulary. Training minimises , which is the NLL of a Categorical.
- Gumbel-softmax trick: the Gumbel-max trick allows differentiable sampling from a Categorical: sample , then . The Gumbel-softmax approximation uses softmax with low temperature to make this differentiable.
- Top- (nucleus) sampling: truncate the Categorical to the smallest set of tokens whose cumulative probability exceeds , then renormalise.
2.6 Multinomial(, )
Story: Counts across categories in independent Categorical draws. If each trial draws from Categorical, then the vector of counts follows a Multinomial.
PMF: For non-negative integers with :
The multinomial coefficient counts the number of orderings.
Moments:
The negative covariance between categories is inevitable: if more of one category is observed, fewer of the others must be. This negative dependence structure is a fundamental constraint of fixed-total count vectors.
Marginals: Each marginal - this is why Binomial is the two-category special case () of Multinomial.
For AI:
- Topic models (LDA): document word counts follow Multinomial where is the topic-word distribution.
- Bag-of-words: the count vector representation of a document is a realisation of the Multinomial.
- Batch class balance: expected class counts in a minibatch follow Multinomial where is the class frequency vector.
3. Continuous Distributions
A continuous random variable has a probability density function (PDF) satisfying . Probabilities are computed as .
3.1 Uniform(, )
Story: All values in are equally likely. The maximum entropy distribution subject to being supported on a bounded interval.
PDF and CDF:
Moments:
| Quantity | Value |
|---|---|
| Mean | |
| Variance | |
| Entropy |
MGF:
Maximum entropy: Among all distributions supported on , the Uniform has the maximum Shannon entropy. This makes it the natural "least informative" prior when only the support is known.
Universality: If has continuous CDF , then . Conversely, for . This is the inverse CDF sampling method introduced in Section01.
For AI:
- Weight initialisation: Kaiming uniform initialisation draws weights from , chosen so that .
- Random search: hyperparameter search over a bounded range uses Uniform priors.
- Data augmentation: random crop position, rotation angle, and colour jitter are drawn from Uniform distributions.
3.2 Gaussian
Story: The limiting distribution of the standardised sum of i.i.d. random variables (Central Limit Theorem). The maximum entropy distribution for fixed mean and variance. The distribution that makes least-squares regression optimal under Gaussian noise.
PDF:
Normalisation proof sketch:
This follows from the Gaussian integral trick: , so . (Full proof in Appendix A of Section01.)
Parameters:
- : mean (location parameter - shifts the peak)
- : variance (scale parameter - controls spread)
- : standard deviation (same units as )
Moments:
| Quantity | Value |
|---|---|
| Mean | |
| Variance | |
| Mode | (unique, at the peak) |
| Median | (by symmetry) |
| Skewness | (perfectly symmetric) |
| Kurtosis | (excess kurtosis ) |
| Entropy |
MGF:
Derivation:
Complete the square in the exponent: . The integral of the Gaussian part equals 1, leaving .
Standard Normal: has PDF and CDF .
Any Gaussian can be standardised: .
Key quantiles: , , .
68-95-99.7 Rule:
Stability properties:
- Linear stability:
- Additive stability: when
- Closure under conditioning: the conditional distribution of a Gaussian given a linear observation is Gaussian (developed in Section03)
Maximum entropy: Among all distributions with mean and variance , the Gaussian maximises entropy. This is the information-theoretic justification for its ubiquity.
Forward reference - Multivariate Gaussian:
Preview: Multivariate Gaussian The -dimensional generalisation replaces the scalar mean with and the scalar variance with a positive definite covariance matrix :
The Gaussian process (Section06) extends this to infinite dimensions.
-> Full treatment: Section03 Joint Distributions
For AI:
- Weight initialisation: Xavier/Glorot initialisation uses ; Kaiming uses .
- VAE latent prior: is the prior on the latent code. The encoder outputs and samples , .
- Gaussian process: a prior over functions where any finite collection of function values follows a multivariate Gaussian (Section06).
- SGD noise: the gradient noise in stochastic gradient descent is approximately Gaussian by the CLT, with covariance proportional to the gradient covariance matrix.
- Diffusion models: the forward noising process adds Gaussian noise .
3.3 Exponential()
Story: Time between consecutive events in a Poisson process with rate . The continuous analogue of the Geometric distribution.
PDF and CDF:
Parameters: is the rate (events per unit time). The scale is (mean inter-event time).
Moments:
| Quantity | Value |
|---|---|
| Mean | |
| Variance | |
| Mode | (always - the mode is at the boundary) |
| Median | |
| Skewness | (always right-skewed, regardless of ) |
| Entropy |
MGF:
Memoryless property (continuous version): For :
Proof: . Then .
The Exponential is the unique continuous memoryless distribution. Knowing you have already waited units gives no information about the remaining wait.
Relationship to Poisson: If events arrive as a Poisson process with rate , then the inter-arrival times are i.i.d. Exponential. The count of events in is Poisson.
For AI:
- Regularisation via sparsity: exponential priors on weights yield (Lasso) regularisation under MAP estimation.
- Session modelling: time between user sessions is modelled as Exponential.
- Sampling algorithms: the exponential distribution appears in Gillespie's algorithm (exact stochastic simulation) and in MCMC acceptance steps.
3.4 Gamma(, )
Story: The sum of independent Exponential random variables. Equivalently, the time until the -th event in a Poisson process with rate .
PDF:
Here is the gamma function, satisfying for positive integers.
Parameters: is the shape (controls the number of "humps" and skewness), is the rate (scale ).
Moments:
| Quantity | Value |
|---|---|
| Mean | |
| Variance | |
| Mode | for ; 0 otherwise |
| Skewness | (decreases as grows) |
MGF:
Special cases:
| Parameters | Distribution |
|---|---|
| Exponential | |
| , | Chi-squared |
| (integer) | Erlang |
| (normalised) | Gaussian (by CLT) |
Additive property: If and independently, then (same rate).
Shape behaviour:
- : unbounded at 0, heavy right tail
- : Exponential (monotone decreasing from )
- : bell-shaped, mode at , right-skewed
- Large : approximately Gaussian (CLT)
For AI:
- Conjugate prior for Poisson rate: Gamma is conjugate to Poisson. After observing events in time units, the posterior is Gamma.
- Conjugate prior for Gaussian precision: Gamma is conjugate to the precision () of a Gaussian with known mean.
- Variational inference: the Gamma distribution is used as the variational posterior for non-negative latent variables.
3.5 Beta(, )
Story: A distribution over probability values in . The natural prior for the unknown success probability of a Bernoulli experiment.
PDF:
where the Beta function is the normalising constant.
Moments:
| Quantity | Value |
|---|---|
| Mean | |
| Variance | |
| Mode | for |
The mean depends only on the ratio . The concentration controls the spread: larger means more concentration around the mean.
Shape behaviour:
| Condition | Shape |
|---|---|
| U-shaped (bimodal at 0 and 1) | |
| Uniform(0,1) | |
| Symmetric bell, centred at 0.5 | |
| Right-skewed (mass towards 1) | |
| Left-skewed (mass towards 0) | |
| , ratio fixed | Concentrates at |
Pseudocounts interpretation: Beta encodes the belief arising from having seen successes and failures. Beta encodes no prior knowledge.
Relationship to Dirichlet: Beta - the special case.
For AI:
- RLHF preference model: the Bradley-Terry model places a Beta prior on the probability that one response is preferred over another.
- Proportion modelling: the fraction of positive tokens, the click-through rate, and the fraction of attended tokens are all modelled with Beta distributions.
- Conjugate prior for Binomial: after successes in trials, the posterior on is Beta.
- Beta-VAE: the -VAE regularises the KL term with a coefficient, encouraging disentanglement of latent variables.
3.6 Dirichlet()
Story: A distribution over probability vectors on the simplex . The multivariate generalisation of Beta; the natural prior for the parameter of a Categorical distribution.
PDF: For with :
where and .
Moments:
| Quantity | Value |
|---|---|
| for |
The mean probability vector is - normalised concentration parameters.
Concentration parameter :
- Small (e.g., ): sparse - samples concentrate near corners of the simplex
- with all : uniform over simplex
- Large : concentrated - samples cluster near the mean
Symmetric Dirichlet: When all (scalar), the distribution is exchangeable across categories. The parameter controls concentration:
- : sparse, near-one-hot samples
- : uniform over simplex
- : dense, near-uniform samples
Marginals: Each marginal .
For AI:
- LDA (Latent Dirichlet Allocation): document topic proportions . A small enforces document sparsity - each document covers few topics.
- Token vocabulary priors: Dirichlet priors on subword token distributions encode assumptions about language patterns.
- Bayesian categorical models: Dirichlet is the conjugate prior to Categorical. After observing counts , the posterior is Dirichlet.
3.7 Student-()
Story: The distribution of the -statistic when estimating the mean of a Gaussian with unknown variance from small samples. A Gaussian with heavier tails - more robust to outliers.
PDF:
Parameter: is the degrees of freedom. Controls tail heaviness.
Moments:
| Quantity | Value | When |
|---|---|---|
| Mean | ||
| Variance | ||
| Skewness | ||
| Kurtosis |
The moments do not exist for small : variance undefined for , mean undefined for (Cauchy distribution).
Tail heaviness: The tails decay as - polynomial decay, much heavier than the Gaussian's decay. For : Cauchy distribution (variance ). For : .
Gaussian construction: where . This gives .
Location-scale family: The general Student- with mean and scale has PDF obtained by replacing with and multiplying by .
For AI:
- Robust regression: Student- likelihood replaces Gaussian when outliers are present. Heavy tails assign higher probability to extreme residuals, reducing their influence on parameter estimates.
- Bayesian neural networks: Student- priors on weights provide robustness; as increases, the prior approaches Gaussian.
- Uncertainty estimation: predictive distributions in small-data regimes are better modelled with than Gaussian to reflect parameter uncertainty.
- Variational inference: the Student- arises as a scale mixture of Gaussians: and gives marginally .
4. Distribution Relationships
4.1 Limiting Relationships
Binomial -> Poisson (rare events limit):
When , , : .
Rule of thumb: approximation is good when and .
Binomial -> Normal (Central Limit Theorem preview):
When : .
-> Full proof: Section06 Stochastic Processes
Gamma -> Normal (large shape):
When (fixed ): .
Poisson -> Normal (large rate):
When : .
Beta -> Dirac (large concentration):
When with fixed: Beta.
Student- -> Normal:
As : . For , the approximation is excellent.
LIMITING RELATIONSHIPS
========================================================================
Binomial(n,p) --- n->\\infty, p->0, np=\\lambda ---> Poisson(\\lambda)
|
| n->\\infty (CLT)
v
Gaussian N(np, np(1-p))
Gamma(\\alpha,\\beta) ---- \\alpha->\\infty -------------> Gaussian (CLT)
Poisson(\\lambda) ---- \\lambda->\\infty -------------> Gaussian (CLT)
Student-t(\\nu) -- \\nu->\\infty -------------> Gaussian N(0,1)
Beta(\\alpha,\\beta) ----- \\alpha,\\beta->\\infty -----------> Dirac(\\alpha/(\\alpha+\\beta))
========================================================================
4.2 Conjugate Pairs Table
| Likelihood | Prior | Posterior | Updated Parameters |
|---|---|---|---|
| Bernoulli | Beta | Beta | successes in trials |
| Binomial | Beta | Beta | same as Bernoulli |
| Categorical | Dirichlet | Dirichlet | = observed counts |
| Multinomial | Dirichlet | Dirichlet | same as Categorical |
| Poisson | Gamma | Gamma | observations |
| Exponential | Gamma | Gamma | observations |
| Gaussian (known ) | Precision-weighted average |
Full derivations in Section7 (Conjugate Priors).
5. Moment Generating Functions
5.1 Definition and Uniqueness
The moment generating function (MGF) of a random variable is:
when this expectation is finite in an open interval around .
Why "moment generating": By Taylor-expanding :
Differentiating times and evaluating at :
The -th derivative of the MGF at zero is the -th raw moment.
Uniqueness theorem: If exists in an open interval containing 0, it uniquely determines the distribution of . Two random variables with the same MGF have the same distribution.
Cumulant generating function (CGF): . Its derivatives at 0 give the cumulants: , , (third central moment), etc.
5.2 MGF Product Rule
Theorem: If , then .
Proof: , where the third equality uses independence.
Application: Proving that sums of independent distributions stay in the same family:
- :
- :
- :
- : product of exponential-of-quadratics
5.3 MGFs of Key Distributions
| Distribution | MGF | Domain |
|---|---|---|
| Bernoulli | ||
| Binomial | ||
| Geometric | ||
| Poisson | ||
| Uniform | ||
| Exponential | ||
| Gamma | ||
| Gaussian | ||
| Beta |
Note: the Beta MGF does not have a simple closed form; the Dirichlet and Categorical are typically characterised by their probability generating functions or characteristic functions instead.
-> Full treatment of MGF applications and cumulants: Section04 Expectation and Moments
6. The Exponential Family
The exponential family is the most important unifying framework in probability and statistics. It explains why Bernoulli, Gaussian, Poisson, Gamma, Beta, Dirichlet, and Categorical all share the same algorithmic properties.
6.1 Canonical Form
A distribution belongs to the exponential family if its PDF (or PMF) can be written as:
where:
- - natural parameters (the parameterisation used in optimisation)
- - sufficient statistics (captures all information about in the data)
- - log-partition function (normalisation constant in log space)
- - base measure (does not depend on )
The canonical form separates the "shape" of the distribution ( and ) from the parameterisation ( and ).
6.2 Members and Their Parameters
| Distribution | Natural param | Sufficient stat | Log-partition | Base measure |
|---|---|---|---|---|
| Bernoulli | ||||
| Binomial | ||||
| Poisson | ||||
| Exponential | ||||
| Gaussian (both unknown) | ||||
| Gamma | ||||
| Beta | ||||
| Categorical |
6.3 The Log-Partition Function
The log-partition function is convex (always, by Holder's inequality). Its derivatives generate the cumulants:
Proof: . Differentiating under the integral (valid by dominated convergence):
The second derivative similarly gives the covariance.
Consequence: Computing moments of any exponential family member reduces to differentiating - a single convex function. This is an extraordinary computational shortcut.
Example - Poisson: . So and - confirming the Poisson mean-equals-variance property directly from the log-partition function.
6.4 Sufficient Statistics
Definition (Fisher 1922): A statistic is sufficient for if the conditional distribution does not depend on .
Intuitively: captures all information in the data about .
Fisher-Neyman factorisation theorem: is sufficient if and only if the likelihood factors as:
For exponential families with i.i.d. observations:
Examples:
- Bernoulli: (total successes) is sufficient for
- Gaussian (both params unknown): is sufficient for
- Poisson: (total count) is sufficient for
Pitman-Koopman-Darmois theorem: The only distributions with a fixed-dimension sufficient statistic (regardless of sample size ) are the exponential family members.
6.5 ML Implications
Softmax as categorical exponential family:
The Categorical natural parameter is . Inverting:
The softmax is therefore the canonical link function of the Categorical exponential family. Neural networks output logits which are exactly the natural parameters .
Log-sum-exp = log-partition function:
The logsumexp operation is the log-partition function of the Categorical distribution. The numerically stable form is directly derived from properties of .
Natural gradient: The Fisher information matrix of an exponential family is . The natural gradient accounts for the curvature of the distribution manifold, giving parameter-invariant updates. K-FAC approximates for neural networks.
MLE for exponential families: The MLE of satisfies:
meaning the expected sufficient statistics under the model equal the empirical sufficient statistics - a moment matching condition.
7. Conjugate Priors in Bayesian Inference
Bayesian inference requires computing the posterior . For most priors, this requires numerical integration. Conjugate priors are the special priors for which the posterior stays in the same family - making inference analytic.
7.1 Conjugacy Definition
Definition: A prior is conjugate to a likelihood if the posterior is in the same family as the prior.
For exponential family likelihoods, conjugate priors always exist and take a canonical form. The general conjugate prior to is:
After observing data points :
The hyperparameters update simply: and .
7.2 Beta-Bernoulli/Binomial
Model: , .
Likelihood: where .
Posterior:
Pseudocounts interpretation: The prior Beta encodes prior successes and prior failures. After seeing successes in trials, the posterior is Beta - simply adding counts.
Posterior mean:
This is a shrinkage estimator between the prior mean and the MLE :
As , the posterior concentrates at the MLE .
7.3 Dirichlet-Categorical/Multinomial
Model: , .
Posterior:
where are the observed category counts ().
Posterior mean: .
Add-one (Laplace) smoothing: Setting (uniform Dirichlet prior) gives the Laplace smoothed estimate - the standard technique for avoiding zero probabilities in language model unigrams.
For LDA: Each document has topic proportions . Each topic has word distribution . Words are drawn as where .
7.4 Gamma-Poisson
Model: , .
Posterior:
Posterior mean: - shrinkage between prior mean and MLE .
Interpretations: The prior Gamma encodes " events observed over prior time units." After observing events in new time units, the posterior is Gamma.
7.5 Normal-Normal
Model: , with known .
Posterior:
where:
Precision-weighted average: The posterior mean is:
a precision-weighted combination of prior mean and sample mean. As , and (posterior concentrates at MLE).
8. ML Applications
8.1 Language Models: Categorical and Temperature
Every forward pass of a language model computes:
- Logits over vocabulary of size
- Token probabilities at temperature
- Next token
The cross-entropy training loss is , which is exactly - the NLL of a Categorical exponential family member.
Perplexity: - the exponentiated cross-entropy, measuring the effective vocabulary size.
8.2 VAEs: Gaussian Reparameterisation and KL Term
The VAE ELBO is:
With and :
This closed-form KL is derived from the Gaussian MGF. The reparameterisation trick enables backpropagation: , .
8.3 Dropout as Bernoulli Masking
Dropout applies a Bernoulli mask independently to each activation:
At test time, the expectation is used: .
This is equivalent to training an ensemble of networks (where is the number of units) sharing weights, with each network sampled at each step.
8.4 RLHF: Bradley-Terry and Beta Prior
The Bradley-Terry model assigns probability to human preferences:
where are learned reward scalars and is the sigmoid (Bernoulli logit link). The reward difference follows the Bernoulli exponential family with natural parameter .
A Beta prior on the preference probability regularises the reward model toward a neutral preference ().
8.5 Diffusion Models: Gaussian Noise Schedule
The forward process adds Gaussian noise over steps:
Using the Gaussian stability under sums (Section3.2), the marginal at step is:
The reverse process is also Gaussian (for small ), with mean predicted by the neural network.
8.6 Topic Models (LDA)
Latent Dirichlet Allocation uses three distributions from this section:
- - topic proportions per document (Dirichlet)
- - topic assignment per word (Categorical)
- - word given topic (Categorical with Dirichlet prior )
The full joint is:
Inference uses collapsed Gibbs sampling (Dirichlet-Categorical conjugacy allows marginalising and ).
9. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Confusing the Gaussian parameter with | takes variance as second argument in most conventions, but NumPy's np.random.normal(mu, sigma) takes std dev | Always state whether you use variance or std dev; in code use scale=sigma (std dev) |
| 2 | Assuming Poisson when variance > mean | Poisson requires Var = Mean exactly; real count data often has overdispersion | Check the variance-to-mean ratio; use Negative Binomial for overdispersed data |
| 3 | Using Binomial for small with large | Numerical issues with when is large | Switch to Poisson approximation or compute in log-space |
| 4 | Interpreting Beta parameters as probabilities | and are pseudocount concentrations, not the mean itself | Mean is ; is uniform, not Beta |
| 5 | Forgetting the Dirichlet concentration parameter controls sparsity | produces sparse samples; produces dense ones | Set for topic models expecting sparse documents |
| 6 | Confusing Categorical and one-hot encoding | A Categorical sample is an integer; its one-hot encoding is a vector | In PyTorch, Categorical.sample() returns indices, not one-hot vectors |
| 7 | Applying Student- formulas when | Variance is undefined for ; mean undefined for | Always check before using moment formulas; for (Cauchy), variance is |
| 8 | Treating Exponential rate and scale interchangeably | Some sources use rate (mean ); others use scale (mean ) | Verify convention: SciPy stats.expon(scale=theta) uses scale; PyTorch Exponential(rate=lambda) uses rate |
| 9 | Using Normal approximation to Binomial when or | CLT kicks in slowly in the tails; approximation is poor for extreme | Use exact Binomial PMF or Poisson approximation |
| 10 | Claiming exponential family membership for Student- | The Student- is NOT in the exponential family (its normalising constant depends on in a non-exponential way) | Student- is a scale mixture of Gaussians - handle separately |
| 11 | Confusing natural parameters with mean parameters | The Gaussian natural parameters are , not | Distinguish mean parameterisation (human-readable) from natural parameterisation (for exponential family theory) |
| 12 | Using the Beta posterior mean instead of mode for MAP estimation | Posterior mean ; MAP (mode) | For decisions, use posterior mode (MAP) or the full posterior, not mean by default |
10. Exercises
Exercise 1 * - PMF and Moments
A biased die has faces weighted so that the probability of face is proportional to for .
(a) Find the normalising constant and write the PMF explicitly. (b) Compute and . (c) Find . (d) Is this distribution in the exponential family? Identify , , and if so.
Exercise 2 * - Poisson Limit
A social media post receives clicks at a rate of per minute. Model the number of clicks in a 10-minute window.
(a) Write the PMF and compute , , . (b) Suppose you model this as Binomial where each second either produces a click or not. Find and verify the Poisson limit numerically for . (c) What property of the Poisson means that clicks in disjoint time windows are independent? (d) If two different posts receive and clicks/minute, what is the distribution of the total clicks per minute? Prove it using MGFs.
Exercise 3 * - Gaussian Properties
Let (mean 3, variance 4).
(a) Standardise to obtain . (b) Compute using . (c) If , find the distribution of . (d) If , find the distribution of . (e) Verify numerically that the MGF formula gives the correct mean and variance for .
Exercise 4 ** - Beta-Binomial Conjugate Update
You are estimating the click-through rate of a button. Your prior belief is Beta.
(a) Interpret this prior: what pseudocounts does it encode? What is the prior mean? (b) You observe 15 clicks in 100 impressions. Write the posterior distribution. (c) Compute the posterior mean and compare it to the MLE . (d) After how many additional clicks (keeping total impressions fixed at 100) would the posterior mean exceed ? (e) Plot the prior, likelihood (rescaled), and posterior as a function of (implement in Python).
Exercise 5 ** - Exponential Family Identification
(a) Show that the Geometric distribution belongs to the exponential family. Identify , , , and . (b) For the Geometric, compute by differentiating . (c) Show that the Negative Binomial distribution also belongs to the exponential family. (d) The uniform distribution with unknown - does it belong to the exponential family? Explain.
Exercise 6 ** - Dirichlet-Categorical Posterior
A language model assigns log-probabilities to tokens. You use a symmetric Dirichlet prior over a vocabulary of size (simplified).
(a) Sample 3 probability vectors from Dir and 3 from Dir. Describe the visual difference. (b) You observe the token sequence: (where are the 5 tokens). Compute the posterior Dirichlet. (c) Compute the posterior mean probability for each token. (d) Compare with the Laplace-smoothed MLE estimate. Show they are equal when .
Exercise 7 *** - Softmax as Exponential Family
(a) Derive the softmax function from the Categorical exponential family canonical form. Show that the log-partition function leads to .
(b) Implement log_softmax(z) in a numerically stable way (subtract max before exponentiating). Verify it equals log(softmax(z)) but is more numerically stable for large logits.
(c) The gradient of the cross-entropy loss with respect to logits is where . Derive this result.
(d) Show that temperature scaling is equivalent to scaling the natural parameters, and explain why gives argmax and gives uniform.
Exercise 8 *** - Gaussian VAE KL Term
The VAE training objective requires:
(a) Derive the closed-form expression: using the Gaussian MGF.
(b) Verify this equals 0 when and for all .
(c) Implement kl_gaussian(mu, log_var) where log_var = log(sigma^2) (the standard VAE parameterisation).
(d) Plot the per-dimension KL as a function of (with ) and as a function of (with ). What do the minima imply for the VAE encoder?
11. Why This Matters for AI (2026 Perspective)
| Concept | Impact on Modern AI |
|---|---|
| Categorical + softmax | Every language model output layer is a Categorical exponential family. Logits are natural parameters; cross-entropy is NLL; temperature controls entropy of the output distribution |
| Gaussian reparameterisation | Enables end-to-end training of VAEs (2013), diffusion model denoising (2020), and Gaussian noise models in score-based generation |
| Dirichlet priors | LDA (2003) topic models still used for document understanding; Dirichlet-process priors in Bayesian nonparametric methods for adaptive-capacity models |
| Beta-Binomial conjugacy | RLHF reward modelling uses preference probability estimates; uncertainty-aware sampling uses Beta posteriors for exploration-exploitation |
| Exponential family unification | Natural gradient (K-FAC, Shampoo) uses the Fisher information matrix ; the link between logits and natural parameters motivates output-layer design choices |
| Student- distribution | Robust regression and uncertainty quantification in small-data regimes; -distributed stochastic embedding (t-SNE) uses (Cauchy) tails to separate clusters |
| Poisson | Language model token position counts, API call modelling, and event-based neural networks (neuromorphic AI) |
| Gamma-Poisson conjugacy | Bayesian A/B testing for click-through rates and conversion rates; Thompson sampling for multi-armed bandit |
| Diffusion Gaussian schedules | Stable Diffusion, DALL-E, Sora all use Gaussian forward processes; the closed-form enables efficient training without simulating the full Markov chain |
| Log-partition function | The logsumexp trick (numerically stable ) is fundamental to FlashAttention's online softmax computation |
12. Conceptual Bridge
This section forms the vocabulary layer of probability theory. You can now think about every probabilistic model in terms of its component distributions: a Gaussian prior, a Categorical likelihood, a Dirichlet hyperprior. Without this vocabulary, reading a VAE paper or an LDA paper is like trying to read chemistry without knowing the periodic table.
Looking backward: The CDF, PDF, and PMF definitions from Section01 gave the framework; this section fills it with concrete instances. The axioms guaranteed consistency; the named distributions give tractability. Every distribution here satisfies all the axioms of Section01 - the Bernoulli is the simplest, the Dirichlet the most complex, but all obey the same rules.
Looking forward:
- Section03 Joint Distributions extends to multiple random variables - the multivariate Gaussian, joint densities, marginalisation, and the chain rule of probability.
- Section04 Expectation and Moments derives the moments stated here from first principles, introduces the full LOTUS theorem, and develops MGFs as analytical tools.
- Section05 Concentration Inequalities bounds how far the distributions here can deviate from their means.
- Section06 Stochastic Processes proves the CLT - the theorem that explains why the Gaussian is the limiting distribution for so many of the relationships in Section4.
POSITION IN CURRICULUM
========================================================================
Section06/01 Introduction and Random Variables
v (foundations: axioms, CDF, PDF, Bernoulli/Uniform preview)
> Section06/02 Common Distributions < <- YOU ARE HERE
v (full vocabulary: all named distributions, MGFs, exp family)
Section06/03 Joint Distributions
v (multivariate: joint PDF, marginals, multivariate Gaussian)
Section06/04 Expectation and Moments
v (derivations: LOTUS, covariance matrix, MGF applications)
Section06/05 Concentration Inequalities
v (bounds: Markov, Chebyshev, Hoeffding, PAC learning)
Section06/06 Stochastic Processes
v (CLT: proves the Gaussian limit relationships of Section4)
Section06/07 Markov Chains
(MCMC: uses conjugacy and Gaussian proposals)
========================================================================
The distributions in this section are not a list to memorise - they are a language to think in. When a practitioner says "the model is overconfident," they mean the predicted Categorical is too peaked. When they say "use a stronger prior," they mean increase in the Dirichlet. When they say "the KL term is too large," they mean the approximate Gaussian posterior is far from the standard normal prior. Every one of these statements refers to a specific distribution from this section.
<- Back to Probability Theory | Next: Joint Distributions ->
Appendix A: Detailed Distribution Reference Cards
A.1 Bernoulli Distribution - Full Reference
BERNOULLI(p) REFERENCE CARD
========================================================================
PMF: P(X=x) = p^x (1-p)^(1-x), x \\in {0, 1}
CDF: F(x) = 0 x < 0
1 - p 0 \\leq x < 1
1 x \\geq 1
Mean: p
Var: p(1-p) [max at p=0.5]
Mode: 1{p > 0.5}
Entropy: -p log p - (1-p) log(1-p)
MGF: M(t) = 1 - p + pe^t
Natural param: \\eta = log(p/(1-p)) [logit]
Inverse link: p = \\sigma(\\eta) = 1/(1+e^{-\\eta}) [sigmoid]
ML role: Binary labels, dropout, stochastic depth, RLHF preferences
========================================================================
A.2 Gaussian Distribution - Full Reference
GAUSSIAN N(\\mu, \\sigma^2) REFERENCE CARD
========================================================================
PDF: f(x) = (1/\\sigma\\sqrt(2\\pi)) exp(-(x-\\mu)^2/2\\sigma^2)
CDF: F(x) = \\Phi((x-\\mu)/\\sigma) where \\Phi = standard normal CDF
Mean: \\mu
Variance: \\sigma^2
Mode: \\mu (unique)
Median: \\mu (by symmetry)
Entropy: 1/2 log(2\\pie\\sigma^2) [maximum for fixed mean/var]
MGF: M(t) = exp(\\mut + \\sigma^2t^2/2)
CGF: K(t) = \\mut + \\sigma^2t^2/2 [cumulants: \\kappa_1=\\mu, \\kappa_2=\\sigma^2, \\kappa_k=0 for k\\geq3]
Standard: Z = (X - \\mu)/\\sigma ~ N(0, 1)
Key quantiles (standard normal):
\\Phi(1.282) = 0.90, \\Phi(1.645) = 0.95, \\Phi(1.960) = 0.975
\\Phi(2.326) = 0.99, \\Phi(2.576) = 0.995, \\Phi(3.090) = 0.999
68-95-99.7: P(|Z| \\leq 1) \\approx 0.683, P(|Z| \\leq 2) \\approx 0.954, P(|Z| \\leq 3) \\approx 0.997
Natural params: \\eta_1 = \\mu/\\sigma^2, \\eta_2 = -1/(2\\sigma^2)
Suff stats: T(x) = (x, x^2)
ML role: Weight init, VAE prior/posterior, GP, diffusion noise, batch norm
========================================================================
A.3 Beta Distribution - Full Reference
BETA(\\alpha, \\beta) REFERENCE CARD
========================================================================
PDF: f(x) = x^(\\alpha-1)(1-x)^(\\beta-1) / B(\\alpha,\\beta), x \\in (0, 1)
B(\\alpha,\\beta) = \\Gamma(\\alpha)\\Gamma(\\beta)/\\Gamma(\\alpha+\\beta)
Mean: \\alpha/(\\alpha+\\beta)
Var: \\alpha\\beta / [(\\alpha+\\beta)^2(\\alpha+\\beta+1)]
Mode: (\\alpha-1)/(\\alpha+\\beta-2) for \\alpha,\\beta > 1
Shape patterns:
\\alpha<1, \\beta<1 -> U-shaped (bimodal at 0 and 1)
\\alpha=\\beta=1 -> Uniform(0,1)
\\alpha=\\beta>1 -> Symmetric bell
\\alpha>\\beta -> Skewed toward 1
\\alpha<\\beta -> Skewed toward 0
\\alpha,\\beta -> \\infty, ratio -> Concentrates at mode
Conjugate prior for: Bernoulli, Binomial
Posterior update: Beta(\\alpha, \\beta) + k successes, n-k failures
-> Beta(\\alpha+k, \\beta+n-k)
ML role: CTR estimation, RLHF preference priors, Beta-VAE
========================================================================
Appendix B: The Gamma Function
The gamma function generalises the factorial to non-integer arguments and appears in the normalising constants of Beta, Dirichlet, Gamma, and Student- distributions.
B.1 Definition and Key Properties
For :
Recurrence relation:
Proof: Integration by parts with and :
Factorial connection: for positive integers .
Half-integer values: , , .
B.2 The Beta Function
The Beta function satisfies:
This integral is precisely the normalising constant of the Beta PDF.
Symmetry: .
B.3 Stirling's Approximation
For large : .
In terms of Gamma: for large .
This is used to analyse the Poisson distribution at large and the convergence of the Binomial to the Poisson.
Appendix C: Sampling from Distributions
C.1 Inverse CDF Method
For any distribution with continuous, invertible CDF :
- Draw
- Return
Then .
Explicit inverse CDFs:
| Distribution | Inverse CDF |
|---|---|
| Exponential | |
| Geometric | |
| Cauchy |
Gaussian has no closed-form inverse CDF; the Box-Muller transform is used instead:
where i.i.d. gives i.i.d.
C.2 Sampling the Dirichlet
To sample :
- Draw independently for
- Return
This works because the normalised Gamma variables follow a Dirichlet distribution.
C.3 The Gumbel-Max Trick for Categorical
To sample from Categorical:
- Compute logits (or use raw logits)
- Draw for
- Return
This gives exact categorical samples and enables the Gumbel-softmax differentiable approximation by replacing argmax with softmax at low temperature.
Appendix D: Heavy-Tailed Distributions
D.1 What Makes a Tail Heavy?
The Gaussian tail decays as - super-exponentially fast. Distributions with tails decaying slower than any exponential are called heavy-tailed.
Pareto distribution (power law): for . Tail decays as .
- : finite variance
- : finite mean, infinite variance
- : infinite mean
Student-: for large - polynomial tail with exponent .
Cauchy (): Variance infinite, mean undefined. The average of i.i.d. Cauchy variables is still Cauchy - the CLT fails completely.
D.2 Heavy Tails in ML
Gradient noise: Empirical studies show that SGD gradient noise often has heavier tails than Gaussian. This may explain generalization: heavy-tailed noise explores the loss landscape more broadly, escaping sharp minima.
Neural network weight distributions: Trained neural network weights often follow power-law distributions (Martin & Mahoney 2019, 2021). "Heavy-tailed self-regularization" is proposed as a mechanism for implicit regularization.
Attention scores: Without temperature scaling, softmax attention can produce near-degenerate distributions (mass concentrated on one token), which is the tail behavior of a peaked Categorical.
Appendix E: Worked Derivations
E.1 Poisson MGF Derivation
Checking moments: . At : . [ok]
. At : .
. [ok]
E.2 Gaussian MGF Derivation (Complete)
Combine exponents:
Substituting back:
since the integral is 1 (it's a Gaussian PDF with mean ).
E.3 Beta-Bernoulli Posterior Derivation
Prior: . Likelihood: .
Posterior:
This is the unnormalised Beta PDF, so the posterior is:
E.4 Gamma-Poisson Posterior Derivation
Prior: , so . Likelihood for observations with sum :
Posterior:
This is .
Appendix F: Exponential Family - Additional Members
F.1 Negative Binomial as Exponential Family
The Negative Binomial PMF for :
Writing :
Natural param: , sufficient stat: , log-partition: .
Note: The parameter (number of failures until stop) must be known; the family is parameterised only by when is fixed.
F.2 von Mises Distribution (Circular Data)
For directional/angular data (e.g., wind direction, protein torsion angles), the von Mises distribution plays the role that the Gaussian plays for linear data:
where is the modified Bessel function. It belongs to the exponential family with (complex natural parameter).
For AI: Rotary Position Embedding (RoPE) in transformers embeds token positions as rotations in 2D planes. The von Mises distribution is the natural distribution for such circular position encodings.
Appendix G: Numerical Stability in Practice
G.1 Log-Sum-Exp
The log-partition function is numerically unstable for large or small due to floating-point overflow/underflow.
Stable computation:
Since for all , no overflow occurs.
G.2 Log-Space Probability Computations
For small probabilities, work in log space throughout:
- Multiply probabilities: add log-probabilities
- Normalise: subtract log-sum-exp
- Compute Bernoulli NLL:
-y * log_p - (1-y) * log(1-p)should uselog_p = log_sigmoid(logit)andlog(1-p) = log_sigmoid(-logit)for numerical stability
G.3 The Gamma Function at Large Arguments
can be computed stably using the Lanczos approximation. For , use Stirling's log-approximation: .
SciPy provides scipy.special.gammaln for numerically stable .
Appendix H: Distribution Identification Checklist
When fitting a distribution to data, use this diagnostic checklist:
| Check | Tool | Implication |
|---|---|---|
| Are values integers \geq 0? | - | Consider discrete (Poisson, NB, Geometric) |
| Are values in (0,1)? | - | Beta or transformed Gaussian |
| Are values on a simplex? | - | Dirichlet |
| Is variance \approx mean? | Dispersion test | Poisson (if yes), NB (if var > mean) |
| Is distribution symmetric? | Skewness test | Gaussian, , symmetric Beta |
| Are tails heavier than Gaussian? | Kurtosis, QQ-plot | Student-, Laplace |
| Does a QQ-plot follow the diagonal? | scipy.stats.probplot | Good fit to reference distribution |
| Histogram bell-shaped? | plt.hist | Gaussian or Beta with |
| Histogram exponentially decaying? | Log-scale plot | Exponential or Geometric |
| Do log-log tails follow a line? | Log-log tail plot | Power law (Pareto) |
Sample size requirements:
| Distribution | Minimum samples for good MLE |
|---|---|
| Bernoulli | ~30 (for far from 0/1) |
| Gaussian | ~30 (CLT kicks in) |
| Poisson | ~50 (for ); 10 for large |
| Beta | ~100 (4 moment equations for ) |
| Dirichlet | ~ (scales with ) |
| Student- | ~50 (for reliable estimate) |
Appendix I: Distribution Parameter Estimation (MLE)
I.1 Maximum Likelihood Estimation Overview
The MLE maximises the likelihood , equivalently maximises the log-likelihood .
For exponential family members, the MLE satisfies the moment-matching condition:
The expected sufficient statistics under the model equal the empirical sufficient statistics.
I.2 MLEs of Key Distributions
Bernoulli: (sample proportion)
Gaussian:
Note: is biased (divides by not ). The unbiased estimator uses the sample variance .
Poisson: (sample mean)
Exponential: (reciprocal of sample mean)
Categorical: where (empirical frequencies)
Beta: No closed form. Method of moments gives starting values:
Refine with Newton-Raphson using the digamma function .
I.3 Bayesian Estimation vs. MLE
Under conjugate priors, the posterior mean often provides a better estimator than MLE, especially for small samples:
| Distribution | MLE | Posterior Mean (conjugate prior) |
|---|---|---|
| Bernoulli | ||
| Categorical | ||
| Poisson |
The posterior mean shrinks the MLE toward the prior mean - a form of regularisation that prevents overfitting to small samples.
Appendix J: Relationships Between Distributions - Extended
J.1 The Exponential Family as a Unifying Framework
ALL MEMBERS OF THE EXPONENTIAL FAMILY
========================================================================
One-parameter families:
Bernoulli(p) - natural param: log(p/(1-p))
Poisson(\\lambda) - natural param: log(\\lambda)
Exponential(\\lambda) - natural param: -\\lambda
Geometric(p) - natural param: log(1-p)
Two-parameter families:
Gaussian(\\mu,\\sigma^2) - natural params: (\\mu/\\sigma^2, -1/2\\sigma^2)
Gamma(\\alpha,\\beta) - natural params: (\\alpha-1, -\\beta)
Beta(\\alpha,\\beta) - natural params: (\\alpha-1, \\beta-1)
NegBin(r,p) - natural param: log(p) [r fixed]
K-parameter families:
Categorical(p) - natural params: (log p_k/p_K)_{k<K}
Multinomial(n,p) - same as Categorical [n fixed]
Dirichlet(\\alpha) - natural params: (\\alpha_k - 1)_{k=1}^K
NOT exponential family:
Student-t(\\nu) - tail behavior is not exponential
Cauchy - same
Pareto - support depends on parameter
========================================================================
J.2 Scale Mixtures of Gaussians
Several heavy-tailed distributions arise as variance mixtures of Gaussians: with random variance .
| Mixing Distribution for | Marginal Distribution of |
|---|---|
| (constant) | |
| Student- | |
| Laplace | |
| -stable distribution |
For AI: The variance mixture representation of Student- enables efficient Gibbs sampling - alternating between sampling and . This technique underlies many robust Bayesian regression algorithms.
J.3 Normalising Flows: Transforming Distributions
Any differentiable, invertible function transforms a distribution: if and , then:
Examples of distribution transformation chains:
- Start with , apply : get Exponential
- Start with , apply : get Log-Normal
- Start with , apply CDF and then categorical rounding: get Gaussian copula
- Compose invertible transforms: get a normalising flow with complex target distribution
-> Full treatment: Section03 Joint Distributions (change-of-variables formula)
Appendix K: Quick-Reference Probability Tables
K.1 Standard Normal Tail Probabilities
| | | | |---|---|---| | 1.00 | 0.1587 | 0.3174 | | 1.28 | 0.1003 | 0.2005 | | 1.64 | 0.0505 | 0.1010 | | 1.96 | 0.0250 | 0.0500 | | 2.00 | 0.0228 | 0.0455 | | 2.33 | 0.0099 | 0.0197 | | 2.58 | 0.0049 | 0.0099 | | 3.00 | 0.0013 | 0.0027 | | 3.29 | 0.0005 | 0.0010 | | 3.89 | 0.0001 | 0.0002 |
K.2 Poisson Probabilities for
| 0 | 0.3679 | 0.1353 | 0.0067 |
| 1 | 0.3679 | 0.2707 | 0.0337 |
| 2 | 0.1839 | 0.2707 | 0.0842 |
| 3 | 0.0613 | 0.1804 | 0.1404 |
| 4 | 0.0153 | 0.0902 | 0.1755 |
| 5 | 0.0031 | 0.0361 | 0.1755 |
| 6 | 0.0005 | 0.0120 | 0.1462 |
| 7 | 0.0001 | 0.0034 | 0.1044 |
K.3 Beta Distribution Moments and Shapes
| Mean | Std Dev | Shape Description | |
|---|---|---|---|
| (1, 1) | 0.500 | 0.289 | Uniform |
| (0.5, 0.5) | 0.500 | 0.354 | U-shaped (arcsine) |
| (2, 2) | 0.500 | 0.224 | Symmetric bell |
| (2, 5) | 0.286 | 0.159 | Right-skewed |
| (5, 2) | 0.714 | 0.159 | Left-skewed |
| (10, 10) | 0.500 | 0.106 | Narrow symmetric bell |
| (1, 3) | 0.250 | 0.194 | Decreasing |
| (3, 1) | 0.750 | 0.194 | Increasing |
K.4 Binomial Cumulative Probabilities
for :
| 2 | 0.2061 | 0.0002 | 0.0000 |
| 5 | 0.8042 | 0.0207 | 0.0000 |
| 8 | 0.9900 | 0.2517 | 0.0001 |
| 10 | 0.9994 | 0.5881 | 0.0026 |
| 12 | 1.0000 | 0.8684 | 0.0321 |
| 15 | 1.0000 | 0.9941 | 0.3704 |
| 18 | 1.0000 | 1.0000 | 0.9308 |
Appendix L: Information-Theoretic Properties of Distributions
L.1 Maximum Entropy Distributions
The principle of maximum entropy states: among all distributions satisfying given constraints, choose the one with maximum entropy. This gives the "least informative" distribution consistent with the known facts.
| Constraint | Maximum Entropy Distribution |
|---|---|
| Support , no other info | Discrete Uniform |
| Support , fixed mean | Exponential |
| Support , fixed mean and variance | Gaussian |
| Support , no other info | Uniform |
| Support , fixed mean | Bernoulli |
| Support , fixed mean | Dirichlet (with matching moments) |
Proof sketch for Gaussian: Maximise subject to , , and . Using Lagrange multipliers (Section05/04 of Chapter 5), the optimal density satisfies , which is the Gaussian form.
L.2 KL Divergences Between Common Distributions
Two Gaussians:
Gaussian from standard normal:
Two Bernoullis:
Two Categoricals:
Two Poissons:
Two Betas:
where is the digamma function.
L.3 Entropy Ordering
For distributions with the same support and mean, entropy orders distributions by "spread":
For continuous distributions on with fixed variance :
The Gaussian maximises entropy, making it the distribution that "knows least" given the mean and variance constraints - which is exactly why it appears in CLT results and maximum entropy arguments.
Appendix M: PyTorch and SciPy API Reference
M.1 SciPy Distributions
from scipy import stats
# Discrete distributions
stats.bernoulli(p=0.3) # Bernoulli(p)
stats.binom(n=10, p=0.3) # Binomial(n, p)
stats.geom(p=0.3) # Geometric(p), k=1,2,...
stats.nbinom(n=5, p=0.3) # NegBinomial(n, p)
stats.poisson(mu=2.5) # Poisson(lambda)
# Continuous distributions
stats.uniform(loc=0, scale=1) # Uniform(0,1) [loc=a, scale=b-a]
stats.norm(loc=0, scale=1) # Gaussian(mu, sigma) [scale=sigma, not sigma^2!]
stats.expon(scale=1/lambda_) # Exponential(lambda) [scale=1/rate]
stats.gamma(a=2, scale=1/beta) # Gamma(alpha, beta) [a=shape, scale=1/rate]
stats.beta(a=2, b=5) # Beta(alpha, beta)
stats.t(df=5) # Student-t(nu)
# Common methods (all distributions)
rv.pmf(k) / rv.pdf(x) # PMF or PDF
rv.cdf(x) # CDF
rv.ppf(q) # Quantile (inverse CDF)
rv.sf(x) # Survival function 1-CDF
rv.rvs(size=100) # Random samples
rv.mean() # E[X]
rv.var() # Var(X)
rv.std() # std dev
rv.entropy() # H(X) in nats
M.2 PyTorch torch.distributions
import torch
from torch.distributions import (
Bernoulli, Binomial, Geometric, Poisson,
Categorical, Multinomial,
Uniform, Normal, Exponential, Gamma, Beta, Dirichlet, StudentT
)
# Create distributions
d = Normal(loc=torch.tensor(0.0), scale=torch.tensor(1.0))
# Common methods
d.sample((5,)) # draw 5 samples
d.log_prob(x) # log p(x) - used in loss functions
d.cdf(x) # F(x)
d.icdf(q) # F^{-1}(q)
d.entropy() # H(X)
d.mean # E[X] (property, not method)
d.variance # Var(X)
d.stddev # std dev
# Reparameterised sampling (enables gradients through sampling)
d.rsample((5,)) # only for reparameterisable distributions
# (Normal, Gamma, Beta, Dirichlet)
# Special distributions
Categorical(logits=z) # Categorical from logits (uses log_softmax internally)
Dirichlet(concentration=alpha) # Dirichlet(alpha)
M.3 NumPy Random Number Generation
import numpy as np
rng = np.random.default_rng(seed=42)
# Discrete
rng.integers(0, 2, size=100) # Uniform integers (Bernoulli-like)
rng.binomial(n=10, p=0.3, size=100) # Binomial
rng.geometric(p=0.3, size=100) # Geometric (p=success prob)
rng.poisson(lam=2.5, size=100) # Poisson
# Continuous
rng.uniform(low=0, high=1, size=100) # Uniform
rng.normal(loc=0, scale=1, size=100) # Gaussian [scale=sigma]
rng.exponential(scale=1/lambda_, size=100) # Exponential [scale=1/rate]
rng.gamma(shape=2, scale=1/beta, size=100) # Gamma
rng.beta(a=2, b=5, size=100) # Beta
# Categorical/Dirichlet
rng.choice(K, p=probs, size=100) # Categorical
rng.dirichlet(alpha, size=10) # Dirichlet
rng.multinomial(n=20, pvals=probs) # Multinomial
Appendix N: Distribution Parameter Estimation
Quick reference for maximum likelihood estimates (MLE) and method of moments (MOM) estimators.
Discrete Distributions
| Distribution | MLE Estimator | Method of Moments |
|---|---|---|
| Bernoulli() | Same as MLE | |
| Binomial(,) | Same as MLE | |
| Geometric() | Same as MLE | |
| Poisson() | Same as MLE | |
| Negative Binomial(,) | Match mean and variance |
Continuous Distributions
| Distribution | MLE Estimator(s) | Notes |
|---|---|---|
| Uniform(,) | , | Biased; adjust for small |
| Gaussian(,) | , | Biased variance; unbiased uses |
| Exponential() | Same as MLE | |
| Gamma(,) | Solve | is digamma; numerical solution needed |
| Beta(,) | Method of moments: | = sample variance |
Bayesian Estimation with Conjugate Priors
For conjugate models, the posterior mean provides a natural estimator:
Bernoulli with Beta prior:
This shrinks the MLE toward the prior mean .
Poisson with Gamma prior:
Gaussian with Gaussian prior (known ):
As , all Bayesian estimators converge to MLE - the data overwhelms the prior.
Appendix O: Tail Bounds for Common Distributions
Understanding tail behaviour is essential for concentration inequalities and PAC-learning bounds.
Gaussian Tails
The Mills ratio gives an upper bound:
More precisely: as .
Numerically: , , .
Sub-Gaussian Distributions
A random variable is sub-Gaussian with parameter if:
This implies Gaussian-like tail decay: .
Examples: Gaussian() is sub-Gaussian(); Bounded is sub-Gaussian.
LLM relevance: Token embeddings are often assumed sub-Gaussian for theoretical analysis of attention mechanisms and generalisation bounds.
Poisson Tails (Chernoff bounds)
For :
Exponential Tails (Memoryless)
For : exactly (no approximation needed).
Forward reference: Concentration inequalities (Markov, Chebyshev, Hoeffding, Bernstein) are developed fully in Section6.5 Concentration Inequalities.
Appendix P: Worked Example - Distribution Selection in Practice
Problem: A recommendation system logs how many times each user clicks on recommended items in a session. You observe counts and want to model the distribution.
Step 1: Check support. Counts take values -> discrete, non-negative integer. Eliminates all continuous distributions.
Step 2: Check bounds. No natural upper bound -> Geometric or Poisson. (If sessions had fixed length , Binomial would apply.)
Step 3: Check variance-mean relationship.
- Poisson:
- Geometric: (always over-dispersed)
- Negative Binomial: (over-dispersed, extra parameter)
Compute sample mean and sample variance . If , use Poisson. If , use Negative Binomial.
Step 4: Fit and check. Compute MLE, then use a chi-squared goodness-of-fit test or probability plot.
Step 5: Consider mixture models. If many zeros occur (zero-inflated data), consider Zero-Inflated Poisson: , for .
Key insight: The choice of distribution encodes assumptions about the data-generating process. Poisson assumes events occur at a constant rate independently; Negative Binomial allows the rate itself to vary (it is a Poisson-Gamma mixture). In LLM contexts, token frequency distributions are often heavy-tailed (Zipfian), motivating log-normal or power-law models rather than Poisson.
Last updated: 2026 - covers all distributions in Section6.2 scope as defined in the Chapter README.